<<

Machine Learning @ Stanford Scaled Machine Learning Conference August 2nd 2016 Qi Lu, Applicaon & Services Group, Microso Agenda § What We Do § History § Going forward § How We Scale § CNTK § FPGA § Open Mind § Q&A What We Do ML @ Microsoft: History Answering quesons with experience

1991 1997 2008 2009 2010 2014 2015

Microso Hotmail Bing maps Bing search Azure Machine Research launches launches launches launches Translator Learning GA formed launches Office 365 Substrate HoloLens Which email What’s the Which URLs What does What is that What will is junk? best way are most that moon person saying? happen next? home? relevant? “mean”?

Machine learning is pervasive throughout Microso products ML @ Microsoft: Going Forward § Data => Model => Intelligence => Fuels of Innovaon § Applicaons & Services § Office 365, Dynamic 365 (Biz SaaS), Skype, Bing, § Digital Work & Digital Life § Models for: World, Organizaons, Users, Languages, Context, … § Compung Devices § PC, Tablet, Phone, Wearable, , Hololens (AR/VR), …. § Models for: Natural User Interacons, Reality, … § Cloud § Azure Infrastructure and Plaorm § Azure ML Tools & Services § Intelligence Services Machine Learning Building Blocks

Azure ML (Cloud) Microsoft R Computational Cognitive HDInsight/Spark (On-Prem & Cloud) Network Toolkit (Cloud Services)

Ease of use through Enterprise Scale & Designed for peak See, hear, interpret, Open source Hadoop Visual Workflows Performance performance and interact with Spark

Single click Write Once, Deploy Works on CPU and GPU Prebuilt APIs with CNTK Use Spark ML or MLLib operaonalizaon Anywhere (single/mul) and experts using Java, Python, Scala or R Expand reach with R Tools for Visual Studio Supports popular Vision, Speech, Gallery and marketplace IDE network types (FNN, Language, Knowledge, Support for Zeppelin CNN, LSTM, RNN) and Jupyter notebook Integraon with Jupyter Secure/Scalable Build and connect Notebook Operaonalizaon Highly Flexible – intelligent bots Includes MRS over descripon language Hadoop or over Spark Integraon with R/ Works with open source Interact with your users Python R Used to build cognive on SMS, text, email, Train on TBs of data APIs Slack, Skype Run large massively parallel compute and data jobs Azure Machine Learning Services

§ Ease of use tools with drag/drop paradigm, single click operaonalizaon § Built-in support for stascal funcons, data ingest, transform, feature generate/select, train, score, evaluate for tabular data and text across classificaon, clustering, recommendaon, anomaly § Seamless R/Python integraon along with support for SQL lite to filter, transform § Jupyter Notebooks for data exploraon and Gallery extensions for quick starts § Modules for text preprocessing, key phrase extracon, language detecon, n-gram generaon, LDA, compressed feature hash, stats based anomaly § Spark/HDInsight/MRS Integraon § GPU support § New geographies § Compute reservaon Intelligence Suite

Information Big Data Stores Machine Learning Intelligence Management and Analytics

Data Factory Data Lake Store Machine Learning Cognive Services

Data Catalog SQL Data Data Lake Bot Web Warehouse Analycs Framework

Mobile Event Hubs HDInsight Cortana (Hadoop and Spark) Bots

Stream Analycs Dashboards & Visualizations

Power BI

Data Intelligence Action Cognitive Services How We Scale Key Dimensions of Scaling § Data volume / dimension § Model / algorithm complexity § Training / evaluaon me § Deployment / update velocity § Developer producvity / innovaon agility § Infrastructure / plaorm § Soware framework / tool § Data set / algorithm How We Scale Example: CNTK CNTK: Computational Network Toolkit § CNTK is Microso’s open-source, cross-plaorm toolkit for learning and evaluang models especially deep neural networks § CNTK expresses (nearly) arbitrary neural networks by composing simple building blocks into complex computaonal networks, supporng common network types and applicaons § CNTK is producon-deployed: accuracy, efficiency, and scales to mul- GPU/mul-server CNTK Development § Open-source development model inside and outside the company § Created by Microso Speech researchers 4 years ago; open-sourced in early 2015 § On GitHub since Jan 2016 under permissive license § Nearly all development is out in the open § Driving applicaons: Speech, Bing, Hololens, MSR research § Each team have full-me employees acvely contribute to CNTK § CNTK trained models are tested and deployed in producon environment § External contribuons § e.g., from MIT and Stanford § Plaorms and runmes § Linux, Windows, .Net, docker, cudnn5 § Python, C++, and C# APIs coming soon CNTL Design Goals & Approach § A deep learning framework that balances § Efficiency: can train producon systems as fast as possible § Performance: can achieve best-in-class performance on benchmark tasks for producon systems § Flexibility: can support a growing and wide variety of tasks such as speech, vision, and text; can try out new ideas very quickly § Lego-like composability § Support a wide range of networks § E.g. Feed-forward DNN, RNN, CNN, LSTM, DSSM, sequence-to-sequence § Evolve and adapt § Design for emerging prevailing paerns Key Functionalities & Capabilities § Supports § CPU and GPU with a focus on GPU Cluster § Automac numerical differenaon § Efficient stac and recurrent network training through batching § Data parallelizaon within and across machines, e.g., 1-bit quanzed SGD § Memory sharing during execuon planning § Modularizaon with separaon of § Computaonal networks § Execuon engine § Learning algorithms § Model descripon § Data readers § Model descripons via § Network definion language (NDL) and model eding language (MEL) § Brain Script (beta) with Easy-to-Understand Syntax Architecture Roadmap § CNTK as a library § More language support: Python/C++/C#/.Net § More expressiveness § Nested loops, sparse support § Finer control of learner § SGD with non-standard loops, e.g., RL § Larger model § Model parallelism, memory swapping, 16-bit floats § More powerful CNTK service on Azure § GPUs soon; longer term with cluster, container, new HW (e.g., FPGA) How We Scale Example: FPGA Catapult v2 Architecture Catapult WCS Mezz card (Pike’s Peak) WCS 2.0 Server Blade (Mt. Hood) Catapult V2 (Pikes Peak)

DRAM DRAM DRAM

40Gb/s CPU QPI CPU Gen3 2x8 FPGA QSFP Switch

Gen3 x8 QSFP WCS Gen4.1 Blade with Mellanox NIC and Catapult FPGA NIC QSFP 40Gb/s

§ Gives substanal acceleraon flexibility Option Card § Mezzanine Can act as a local compute accelerator Connectors § Can act as a network/storage accelerator Pikes Peak

§ Can act as a remote compute accelerator WCS Tray Backplane Configurable Clouds

CS CS § Cloud becomes network + FPGAs aached to servers § Can connuously upgrade/change ToR ToR datacenter HW protocols (network, storage, security) § Can also use as an applicaon acceleraon plane (Hardware ToR ToR Network Acceleraon as a Service (HaaS) Text to Speech acceleraon § Services communicate with no SW intervenon (LTL) Large-scale Bing Ranking HW deep learning § Single workloads (including deep learning) can grab 10s, 100s, or 1000s of FPGAs § Can create service pools as well Bing Ranking SW for high throughput Scalable Deep Learning on FPGAs

L1 Instr Decoder & Control L0 L0 Neural FU F F F F F F

NN Model FPGAs over HaaS Scale ML Engine

§ Scale ML Engine: a flexible DNN accelerator on FPGA § Fully programmable via soware and customizable ISA § Over 10X improvement in energy efficiency, cost, and latency versus CPU § Deployable as large-scale DNN service pools via HaaS § Low latency communicaon in few microseconds / hop § Large scale models at ultra low latencies How We Scale Example: Open Mind Open Mind Studio: the “Visual Studio” for Machine Learning Data, Model, Algorithm, Pipeline, Experiment, and Life Cycle Management

Programming Abstracons for Machine Learning / Deep Learning

Other Deep Open Specialized, The Next Learning Source Opmized New CNTK Frameworks Computaon Computaon Framework (e.g., Caffe, MxNet, Frameworks Frameworks … TensorFlow, (e.g., Hadoop, Spark) (e.g., SCOPE, ChaNa) Theano, Torch)

Federated Infrastructure Data Storage, Compliance, Resource Management, Scheduling, and Deployment

Heterogeneous Compung Plaorm (CPU, GPU, FPGA, RDMA; Cloud, Client/Device) ChaNa:RDMA-Optimized Computation Framework § Focus on faster network § Compact memory representaon § Balanced parallelism § Highly opmized RDMA-aware communicaon primives § Overlapping communicaon and computaon § An order of magnitude improvement in early results § Over exisng computaon frameworks (with TCP) § Against several large-scale workloads in producon Programming Abstraction for Machine Learning § Graph Engines for Distributed Machine Learning § Automac system-level opmizaons § Parallelizaon and distribuon § Layout for efficient data access § Paroning for balanced parallelism § Promising early results § Simplificaon of distributed ML programs via high level abstracons § About 70-80% reducon in code § Relave to ML systems such as Petuum, Parameter Server § Matrix Factorizaon for recommendaon system § Latent Dirichlet Allocaon for topic modeling Q&A Thank You!