DL ON SATURNV CLUSTER

The World’s Greenest Supercomputer HPC Advisory Council Lugano 2017 Gunter Roeth SA A DECADE OF SCIENTIFIC COMPUTING WITH GPUS GPU-Trained AI Machine Beats World Champion in Go World’s First Atomic Model of HIV Capsid

Oak Ridge Deploys World’s Fastest Supercomputer w/ GPUs

Google Outperform Stanford Builds AI Humans in ImageNet Machine using GPUs Fermi: World’s AlexNet beats expert code First HPC GPU by huge margin using GPUs

World’s First GPU Top500 System Discovered How H1N1 World’s First 3-D Mapping Mutates to Resist Drugs of Human Genome CUDA Launched

2006 2008 2010 2012 2014 20162 DEEP LEARNING is the Next Frontier for HPC

“…around 2008 my group at Stanford started advocating shifting deep learning to GPUs (this was really controversial at that time; but now everyone does it); and I'm now advocating shifting to HPC (High Performance Computing/Supercomputing) tactics for scaling up deep learning. Machine learning should embrace HPC. These methods will make researchers more efficient and help accelerate the progress of our whole field.” – Andrew Ng, Feb 2016

Neural Network HPC Design Image & Video GPU Programming Design Expertise Expertise Recognition Expertise + + Expertise +

3 BIG DATA – FROM HPC TO HYPERSCALE TO ...

Small Two different workload Big Data in - HPC is parallel simulation that Data in produces huge datasets - Hyperscale collects data and Huge Huge produces analytics Big Big Compute Storage Compute Storage Can they converge? HPC Hyperscale - One possible future Parallel simulations produce Collects and analyzes data huge data - HPC feeds Hyperscale - Hyperscale analysis affects HPC model/data Small Huge - Move workloads onto same system (stacks side Data out data out by side) - Eventually merge stack

Discovery Prediction Similar engine Insight Autonomy

Create data Analyze data 4 MCBDA2016 – Louis Capps, [email protected] AND NVIDIA HGX-1 ANNOUNCEMENT http://nvidianews.nvidia.com/news/nvidia-and-microsoft-boost-ai-cloud-computing-with-launch-of-industry-standard-hyperscale-gpu-accelerator

5 PROJECT OLYMPUS HGX-1 HYPERSCALE GPU ACCELERATOR PARTNERSHIP + INTEROPERABILITY

CLOUD CHALLENGES 1 SKU, Multiple Instances Integration into Existing Datacenter INSTANCES Granular, Latency Sensitive High Throughput Batch HPC: different CPU:GPU ratios DevOps / Development Production Deployment

6 Project Olympus HGX-1 Hyperscale GPU Accelerator

Configurable PCIe Cable to host + Expansion slots NVIDIA P100 GPU NVLink Hybrid Cube Mesh Fabric 20 Gbyte/sec per link Duplex Adapters for other GPUs

7 DEEP LEARNING

2 CPU : 8 GPU 2 CPU : 16 GPU 8x P100 SXM2 | 4x x16 PCIe 16x P100 SXM2 | 4x x16 PCIe

CPU CPU CPU CPU

8 HPC

4 CPU : 8 GPU 8 CPU : 8 GPU 8x P100 SXM2 | 4x x16 PCIe 8x P100 SXM2 | 8x x16 PCIe

CPU CPU CPU CPU

CPU CPU CPU CPU

CPU CPU CPU CPU

9 INFERENCE

2 CPU : 8 GPU 8 CPU : 32 GPU 8x P100 SXM2 | 4x x16 PCIe 32x P4 PCIe | 8x x16 PCIe

CPU CPU

CPU CPU CPU CPU

CPU CPU CPU CPU

10 WHY THE EXCITEMENT?

GPUs as Enablers of Breakthrough Results AlexNet Training Performance

70x P100 + cuDNN5 60x

50x

40x

30x

20x M40 + cuDNN4 K80 + 10x cuDNN K40 1

0x 2013 2014 2015 2016

65x in 3 Years

We can generate photorealistic images Achieve super-human And we are getting from textual descriptions now! accuracy in classification faster fast

11 Paper: H.Zhang et al. StackGAN: Text to Photo-realistic Image Synthesis with Stacked Generative Adversarial Networks, arXiv:1612.03242 “SUPERHUMAN” RESULTS SPARK HYPERSCALE ADOPTION Alibaba/Aliyun Amazon Baidu eBay Facebook

ImageNet — Accuracy %

96% Flickr Google iFLYTEK iQIYI JD.com Human 93%

Deep Learning 88% Orange Periscope Pinterest Qihoo 360 Shazam 84%

76% 74% 74% 72% Skype Sogou Twitter Yahoo Supermarket Yandex Yelp Hand-coded CV

2010 2011 2012 2013 2014 2015 Cloud Services with AI Powered by NVIDIA

12 DEEP LEARNING EVERYWHERE

INTERNET & CLOUD MEDICINE & BIOLOGY MEDIA & ENTERTAINMENT SECURITY & DEFENSE AUTONOMOUS MACHINES

Image Classification Cancer Cell Detection Video Captioning Face Detection Pedestrian Detection Speech Recognition Diabetic Grading Video Search Video Surveillance Lane Tracking Language Translation Drug Discovery Real Time Translation Satellite Imagery Recognize Traffic Sign Language Processing Sentiment Analysis Recommendation

13 GPUS IN ARTIFICIAL INTELLIGENCE Machine Learning

Neural Networks

Deep Learning

Replace hand-tuned parameters of the feature extraction steps (e.g. in voice and image recognition)

Deep learning is a subset of machine learning that refers to artificial neural networks that are composed of many layers.

Artificial Neural Networks inspired by human brain and need lots of training data (ideal for Big Data).

NVIDIA GPUs and cuDNN software broadly adopted for machine learning. 14 NVIDIA DEEP LEARNING SDK High Performance GPU-Acceleration for Deep Learning

APPLICATIONS Recommendation Sentiment Analysis Image Classification Object Detection Voice Recognition Translation Engines COMPUTER VISION SPEECH AND AUDIO BEHAVIOR

FRAMEWORKS Mocha.jl

cuBLAS cuSPARSE cuFFT

DEEP LEARNING cuDNN NCCL SDK DEEP LEARNING MATH LIBRARIES MULTI-GPU

15 Platform Tensorflow CNTK MXNet Caffe Theano Torch

Release Date 2016 2016 2015 2014 2010 2011

Core Language C++ C++ C++ C++ C++ C

API C++ NDL C++ Python, Matlab Python Lua Python Python, R, Scala, Matlab, Javascript, Go, Julia Synchronisation Sync or async Sync Sync or async Sync Async Sync Model Communication Parameter server MPI Parameter server N/A N/A N/A Mode (Spark/Custom) Multi-GPU ✓ ✓ ✓ ✓ ✓ ✓

Multi-node ✓ ✓ ✓ ✗ ✗ ✗

Data Parallelism ✓ ✓ ✓ ✓ ✓ ✓

Model Parallelism ✓ N/A ✓ ✗ ✓ ✓

Fault Tolerance Checkpoint and Checkpoint and Checkpoint and N/A Checkpoint and Checkpoint and recovery resume resume resume resume Visualisation Graph Graph (static) None Summary Statistics Graph (static) Plots (interactive), (custom) training monitor, debugging tools

16 Fox, James, Yiming Zou, and Judy Qiu. "Software Frameworks for Deep Learning at Scale." NVIDIA

DEEP LEARNING PERFORMANCE GUIDE

17 TensorFlow Deep Learning Framework Training on 8x P100 GPU Server vs 8 x K80 GPU Server

AlexNet GoogleNet ResNet-50 ResNet-152 VGG16 TensorFlow Deep Learning Training 5.0 3x Avg. Speedup An open-source software library for numerical 4.0 computation using data flow graphs.

2.5x Avg. Speedup VERSION 3.0 1.0

ACCELERATED FEATURES

2.0 Full framework accelerated Speedup vs. Server with 8 x K808x with Server vs. Speedup 1.0 SCALABILITY Multi-GPU and multi-node - Server with 8x P100 Server with 8x P100 More Information PCIe 16GB 16GB NVLink https://www.tensorflow.org/

GPU Servers: Single Xeon E5-2690 [email protected] with GPUs configs as shown Ubuntu 14.04.5, CUDA 8.0.42, cuDNN 6.0.5; NCCL 1.6.1, data set: ImageNet; batch sizes: AlexNet (128), GoogleNet (256), ResNet-50 (64), ResNet-152 (32), VGG-16 (32) 18 CNTK Deep Learning Framework Training on 8x P100 GPU Server vs 8 x K80 GPU Server

AlexNet ResNet-50 CNTK 4.0 Deep Learning Training

2.7x A free, easy-to-use, open-source, commercial- Avg. Speedup grade toolkit that trains deep learning 3.0 algorithms to learn like the human brain.

VERSION 1.0 2.0 1.6x Avg. Speedup ACCELERATED FEATURES Full framework accelerated

1.0 Speedup vs. Server with 8 x K808x with Server vs. Speedup SCALABILITY Multi-GPU and multi-node - Server with 8x P100 Server with 8x P100 PCIe 16GB 16GB NVLink More Information www.microsoft.com/en- GPU Servers: Single Xeon E5-2690 [email protected] with GPUs configs as shown Ubuntu 14.04.5, CUDA 8.0.42, cuDNN 6.0.5; NCCL 1.6.1, data set: ImageNet; us/research/product/cognitive-toolkit/ batch sizes: AlexNet (128), ResNet-50 (64) 19 Deep Learning Performance

INFERENCE

20 Deep Learning Inference 1 x GPU Server Throughput Performance vs. Single-Socket CPU Server Deep Learning AlexNet GoogleNet ResNet-152 VGG-19 Inference using TensorRT on CAFFE

40.0 Popular GPU-accelerated framework using 25x NVIDIA’s TensorRT 1.0 Inference Engine Avg. Speedup 35.0 20x Avg. Speedup VERSION 30.0 CAFFE 1.0 TensorRT 1.0

25.0 Socket CPU Server CPU Socket

- ACCELERATED FEATURES 20.0 13x CPU & GPU versions available Avg. Speedup

vs. Single vs. 15.0

10.0 6x SCALABILITY Avg. Speedup Multi-GPU

Speedup vs. Speedup 5.0 More Information - http://caffe.berkeleyvision.org/ 1xP4 1xP4 1xP100 1xP100 (FP32) (INT8) (FP32) (FP16) https://developer.nvidia.com/tensorrt CPU Server: Single Xeon E5-2690 [email protected] GPU Servers: Single Xeon E5-2690 [email protected] with x1 P100 16GB or x1 P4 GPU Ubuntu 14.04.5, TensorRT 2.0, CUDA 8.0.42, cuDNN 6.0.5; NCCL 1.6.1, batch size 128, precision as indicated. 21 GPU DEEP LEARNING IS A NEW COMPUTING MODEL

10s of billions of image, voice, video Billions of Trillions of Operations queries per day GPU train larger models, accelerate Training Datacenter GPU inference for fast response, time to market maximize datacenter throughput

TRAINING DATACENTER INFERENCING

Device

22 WHAT IS SATURNV?

23 NVIDIA DGX SATURNV Giant Leap Towards Exascale AI

Fastest AI Supercomputer in TOP500 4.9 Petaflops Peak FP64 19.6 Petaflops Peak FP16 13 DGX-1 to get into Top500

Most Energy Efficient Supercomputer #1 Green500 9.5 GFLOPS per Watt

Rocket for Cancer Moonshot CANDLE Development Platform Common platform with DOE labs – ANL, LLNL, ORNL, LANL

24 HOW DID WE BUILD SATURNV?

25 WHY NVIDIA DGX SATURNV 124 Node Supercomputing Cluster

Innovation needs a deep learning supercomputer!

Deep Learning scalability; move outside the box

Focus on research

Used internally for Deep Learning applied research Multiple users, algorithms, networks, new approaches Embedded, robotic, auto, hyperscale, HPC

Partner with university research, government and industry collaborations

Study convergence of data science and HPC

26 NVIDIA DGX-1 DEEP LEARNING SYSTEM

27 ONE ARCHITECTURE BUILT FOR BOTH DATA SCIENCE & COMPUTATIONAL SCIENCE

40x 9x

8x NVIDIA DGX-1 7x 30x

6x

5x 20x

4x

up vs Servervs 1x KNL up Servervs 1x KNL up

- -

3x Speed Speed 10x 2x

1x

0x 0x 1 4 8 16 32 64 128 1 4 8 16 32 64 Knights Landing Servers 1x DGX1 Knights Landing Servers 1x DGX1

GPU-Accelerated Server AlexNet Training GTC-P: Plasma Turbulence DGX-1 Faster than 128 Knights Landing Servers DGX-1 Faster than 64 Knights Landing Servers

Based on AlexNet Batch size 256, weak scaling up to 32 KNL servers, GTC-P, Grid Size A, Systems: NVIDIA DGX-1, 8xP100,28 64 & 128 estimated based on ideal scaling, Xeon Phi 7250 Nodes Intel KNL 7250 68 core Flat-Quadrant mode, Omnipath NVIDIA DGX SATURNV 124 node Cluster

nvidia.com/dgx1 124 NVIDIA DGX-1 Nodes – 992 P100 GPUs 8x NVIDIA Tesla P100 SXM GPUs – NVLINK CubeMesh 2x Intel Xeon 20 core GPUs 512TB DDR4 System Memory SSD – 7 TB scratch + 0.5 TB OS

Mellanox 36 port EDR L1 and L2 switches 4 ports per system Partial Fat tree topology Ubuntu 14.04, CUDA 8, OpenMPI 1.10.3

NVIDIA GPU BLAS + Intel MKL (NVIDIA GPU HPL)

Deep Learning applied research Many users, frameworks, algorithms, networks, new approaches Embedded, robotic, auto, hyperscale, HPC

29 DEEP LEARNING CLUSTER REFERENCE ARCHITECTURE

30 NOV2016 TOP GREEN500 SYSTEM

Green500.org Top500.org

SATURNV produced groundbreaking 9.4 GF/W at full scale --> Sets the stage for future Exascale class computing

31 DEEP LEARNING IS VITAL TO HPC 92% believe AI will impact their work 93% using deep learning seeing positive results

Monitoring Effects of Carbon Minute-by-minute insideHPC.com Survey and Greenhouse Gas Emissions AI Weather Forecasting November 2016

32 RESOURCES For Executives, Developers and Data Scientists INTRO MATERIALS CASE STUDIES SELF-PACED LABS

ON-SITE WORKSHOPS PARTNER COURSES TECHNICAL BLOGS

33 NVIDIA DEEP LEARNING INSTITUTE Hands-on Training for Data Scientists and Software Engineers

Training organizations and individuals to solve challenging problems using Deep Learning

On-site workshops and online courses presented by certified experts

Covering complete workflows for proven application use cases Self-driving cars, recommendation engines, medical image classification, intelligent video analytics and more

www.nvidia.com/dli

https://www.nvidia.com/en-us/deep-learning -ai/education/ 34 QUESTIONS?