DL ON SATURNV CLUSTER
The World’s Greenest Supercomputer HPC Advisory Council Lugano 2017 Gunter Roeth SA A DECADE OF SCIENTIFIC COMPUTING WITH GPUS GPU-Trained AI Machine Beats World Champion in Go World’s First Atomic Model of HIV Capsid
Oak Ridge Deploys World’s Fastest Supercomputer w/ GPUs
Google Outperform Stanford Builds AI Humans in ImageNet Machine using GPUs Fermi: World’s AlexNet beats expert code First HPC GPU by huge margin using GPUs
World’s First GPU Top500 System Discovered How H1N1 World’s First 3-D Mapping Mutates to Resist Drugs of Human Genome CUDA Launched
2006 2008 2010 2012 2014 20162 DEEP LEARNING is the Next Frontier for HPC
“…around 2008 my group at Stanford started advocating shifting deep learning to GPUs (this was really controversial at that time; but now everyone does it); and I'm now advocating shifting to HPC (High Performance Computing/Supercomputing) tactics for scaling up deep learning. Machine learning should embrace HPC. These methods will make researchers more efficient and help accelerate the progress of our whole field.” – Andrew Ng, Feb 2016
Neural Network HPC Design Image & Video GPU Programming Design Expertise Expertise Recognition Expertise + + Expertise +
3 BIG DATA – FROM HPC TO HYPERSCALE TO ...
Small Two different workload Big Data in - HPC is parallel simulation that Data in produces huge datasets - Hyperscale collects data and Huge Huge produces analytics Big Big Compute Storage Compute Storage Can they converge? HPC Hyperscale - One possible future Parallel simulations produce Collects and analyzes data huge data - HPC feeds Hyperscale - Hyperscale analysis affects HPC model/data Small Huge - Move workloads onto same system (stacks side Data out data out by side) - Eventually merge stack
Discovery Prediction Similar engine Insight Autonomy
Create data Analyze data 4 MCBDA2016 – Louis Capps, [email protected] MICROSOFT AND NVIDIA HGX-1 ANNOUNCEMENT http://nvidianews.nvidia.com/news/nvidia-and-microsoft-boost-ai-cloud-computing-with-launch-of-industry-standard-hyperscale-gpu-accelerator
5 PROJECT OLYMPUS HGX-1 HYPERSCALE GPU ACCELERATOR PARTNERSHIP + INTEROPERABILITY
CLOUD CHALLENGES 1 SKU, Multiple Instances Integration into Existing Datacenter INSTANCES Granular, Latency Sensitive High Throughput Batch HPC: different CPU:GPU ratios DevOps / Development Production Deployment
6 Project Olympus HGX-1 Hyperscale GPU Accelerator
Configurable PCIe Cable to host + Expansion slots NVIDIA P100 GPU NVLink Hybrid Cube Mesh Fabric 20 Gbyte/sec per link Duplex Adapters for other GPUs
7 DEEP LEARNING
2 CPU : 8 GPU 2 CPU : 16 GPU 8x P100 SXM2 | 4x x16 PCIe 16x P100 SXM2 | 4x x16 PCIe
CPU CPU CPU CPU
8 HPC
4 CPU : 8 GPU 8 CPU : 8 GPU 8x P100 SXM2 | 4x x16 PCIe 8x P100 SXM2 | 8x x16 PCIe
CPU CPU CPU CPU
CPU CPU CPU CPU
CPU CPU CPU CPU
9 INFERENCE
2 CPU : 8 GPU 8 CPU : 32 GPU 8x P100 SXM2 | 4x x16 PCIe 32x P4 PCIe | 8x x16 PCIe
CPU CPU
CPU CPU CPU CPU
CPU CPU CPU CPU
10 WHY THE EXCITEMENT?
GPUs as Enablers of Breakthrough Results AlexNet Training Performance
70x P100 + cuDNN5 60x
50x
40x
30x
20x M40 + cuDNN4 K80 + 10x cuDNN K40 1
0x 2013 2014 2015 2016
65x in 3 Years
We can generate photorealistic images Achieve super-human And we are getting from textual descriptions now! accuracy in classification faster fast
11 Paper: H.Zhang et al. StackGAN: Text to Photo-realistic Image Synthesis with Stacked Generative Adversarial Networks, arXiv:1612.03242 “SUPERHUMAN” RESULTS SPARK HYPERSCALE ADOPTION Alibaba/Aliyun Amazon Baidu eBay Facebook
ImageNet — Accuracy %
96% Flickr Google iFLYTEK iQIYI JD.com Human 93%
Deep Learning 88% Orange Periscope Pinterest Qihoo 360 Shazam 84%
76% 74% 74% 72% Skype Sogou Twitter Yahoo Supermarket Yandex Yelp Hand-coded CV
2010 2011 2012 2013 2014 2015 Cloud Services with AI Powered by NVIDIA
12 DEEP LEARNING EVERYWHERE
INTERNET & CLOUD MEDICINE & BIOLOGY MEDIA & ENTERTAINMENT SECURITY & DEFENSE AUTONOMOUS MACHINES
Image Classification Cancer Cell Detection Video Captioning Face Detection Pedestrian Detection Speech Recognition Diabetic Grading Video Search Video Surveillance Lane Tracking Language Translation Drug Discovery Real Time Translation Satellite Imagery Recognize Traffic Sign Language Processing Sentiment Analysis Recommendation
13 GPUS IN ARTIFICIAL INTELLIGENCE Machine Learning
Neural Networks
Deep Learning
Replace hand-tuned parameters of the feature extraction steps (e.g. in voice and image recognition)
Deep learning is a subset of machine learning that refers to artificial neural networks that are composed of many layers.
Artificial Neural Networks inspired by human brain and need lots of training data (ideal for Big Data).
NVIDIA GPUs and cuDNN software broadly adopted for machine learning. 14 NVIDIA DEEP LEARNING SDK High Performance GPU-Acceleration for Deep Learning
APPLICATIONS Recommendation Sentiment Analysis Image Classification Object Detection Voice Recognition Translation Engines COMPUTER VISION SPEECH AND AUDIO BEHAVIOR
FRAMEWORKS Mocha.jl
cuBLAS cuSPARSE cuFFT
DEEP LEARNING cuDNN NCCL SDK DEEP LEARNING MATH LIBRARIES MULTI-GPU
15 Platform Tensorflow CNTK MXNet Caffe Theano Torch
Release Date 2016 2016 2015 2014 2010 2011
Core Language C++ C++ C++ C++ C++ C
API C++ NDL C++ Python, Matlab Python Lua Python Python, R, Scala, Matlab, Javascript, Go, Julia Synchronisation Sync or async Sync Sync or async Sync Async Sync Model Communication Parameter server MPI Parameter server N/A N/A N/A Mode (Spark/Custom) Multi-GPU ✓ ✓ ✓ ✓ ✓ ✓
Multi-node ✓ ✓ ✓ ✗ ✗ ✗
Data Parallelism ✓ ✓ ✓ ✓ ✓ ✓
Model Parallelism ✓ N/A ✓ ✗ ✓ ✓
Fault Tolerance Checkpoint and Checkpoint and Checkpoint and N/A Checkpoint and Checkpoint and recovery resume resume resume resume Visualisation Graph Graph (static) None Summary Statistics Graph (static) Plots (interactive), (custom) training monitor, debugging tools
16 Fox, James, Yiming Zou, and Judy Qiu. "Software Frameworks for Deep Learning at Scale." NVIDIA
DEEP LEARNING PERFORMANCE GUIDE
17 TensorFlow Deep Learning Framework Training on 8x P100 GPU Server vs 8 x K80 GPU Server
AlexNet GoogleNet ResNet-50 ResNet-152 VGG16 TensorFlow Deep Learning Training 5.0 3x Avg. Speedup An open-source software library for numerical 4.0 computation using data flow graphs.
2.5x Avg. Speedup VERSION 3.0 1.0
ACCELERATED FEATURES
2.0 Full framework accelerated Speedup vs. Server with 8 x K808x with Server vs. Speedup 1.0 SCALABILITY Multi-GPU and multi-node - Server with 8x P100 Server with 8x P100 More Information PCIe 16GB 16GB NVLink https://www.tensorflow.org/
GPU Servers: Single Xeon E5-2690 [email protected] with GPUs configs as shown Ubuntu 14.04.5, CUDA 8.0.42, cuDNN 6.0.5; NCCL 1.6.1, data set: ImageNet; batch sizes: AlexNet (128), GoogleNet (256), ResNet-50 (64), ResNet-152 (32), VGG-16 (32) 18 CNTK Deep Learning Framework Training on 8x P100 GPU Server vs 8 x K80 GPU Server
AlexNet ResNet-50 CNTK 4.0 Deep Learning Training
2.7x A free, easy-to-use, open-source, commercial- Avg. Speedup grade toolkit that trains deep learning 3.0 algorithms to learn like the human brain.
VERSION 1.0 2.0 1.6x Avg. Speedup ACCELERATED FEATURES Full framework accelerated
1.0 Speedup vs. Server with 8 x K808x with Server vs. Speedup SCALABILITY Multi-GPU and multi-node - Server with 8x P100 Server with 8x P100 PCIe 16GB 16GB NVLink More Information www.microsoft.com/en- GPU Servers: Single Xeon E5-2690 [email protected] with GPUs configs as shown Ubuntu 14.04.5, CUDA 8.0.42, cuDNN 6.0.5; NCCL 1.6.1, data set: ImageNet; us/research/product/cognitive-toolkit/ batch sizes: AlexNet (128), ResNet-50 (64) 19 Deep Learning Performance
INFERENCE
20 Deep Learning Inference 1 x GPU Server Throughput Performance vs. Single-Socket CPU Server Deep Learning AlexNet GoogleNet ResNet-152 VGG-19 Inference using TensorRT on CAFFE
40.0 Popular GPU-accelerated framework using 25x NVIDIA’s TensorRT 1.0 Inference Engine Avg. Speedup 35.0 20x Avg. Speedup VERSION 30.0 CAFFE 1.0 TensorRT 1.0
25.0 Socket CPU Server CPU Socket
- ACCELERATED FEATURES 20.0 13x CPU & GPU versions available Avg. Speedup
vs. Single vs. 15.0
10.0 6x SCALABILITY Avg. Speedup Multi-GPU
Speedup vs. Speedup 5.0 More Information - http://caffe.berkeleyvision.org/ 1xP4 1xP4 1xP100 1xP100 (FP32) (INT8) (FP32) (FP16) https://developer.nvidia.com/tensorrt CPU Server: Single Xeon E5-2690 [email protected] GPU Servers: Single Xeon E5-2690 [email protected] with x1 P100 16GB or x1 P4 GPU Ubuntu 14.04.5, TensorRT 2.0, CUDA 8.0.42, cuDNN 6.0.5; NCCL 1.6.1, batch size 128, precision as indicated. 21 GPU DEEP LEARNING IS A NEW COMPUTING MODEL
10s of billions of image, voice, video Billions of Trillions of Operations queries per day GPU train larger models, accelerate Training Datacenter GPU inference for fast response, time to market maximize datacenter throughput
TRAINING DATACENTER INFERENCING
Device
22 WHAT IS SATURNV?
23 NVIDIA DGX SATURNV Giant Leap Towards Exascale AI
Fastest AI Supercomputer in TOP500 4.9 Petaflops Peak FP64 19.6 Petaflops Peak FP16 13 DGX-1 to get into Top500
Most Energy Efficient Supercomputer #1 Green500 9.5 GFLOPS per Watt
Rocket for Cancer Moonshot CANDLE Development Platform Common platform with DOE labs – ANL, LLNL, ORNL, LANL
24 HOW DID WE BUILD SATURNV?
25 WHY NVIDIA DGX SATURNV 124 Node Supercomputing Cluster
Innovation needs a deep learning supercomputer!
Deep Learning scalability; move outside the box
Focus on research
Used internally for Deep Learning applied research Multiple users, algorithms, networks, new approaches Embedded, robotic, auto, hyperscale, HPC
Partner with university research, government and industry collaborations
Study convergence of data science and HPC
26 NVIDIA DGX-1 DEEP LEARNING SYSTEM
27 ONE ARCHITECTURE BUILT FOR BOTH DATA SCIENCE & COMPUTATIONAL SCIENCE
40x 9x
8x NVIDIA DGX-1 7x 30x
6x
5x 20x
4x
up vs Servervs 1x KNL up Servervs 1x KNL up
- -
3x Speed Speed 10x 2x
1x
0x 0x 1 4 8 16 32 64 128 1 4 8 16 32 64 Knights Landing Servers 1x DGX1 Knights Landing Servers 1x DGX1
GPU-Accelerated Server AlexNet Training GTC-P: Plasma Turbulence DGX-1 Faster than 128 Knights Landing Servers DGX-1 Faster than 64 Knights Landing Servers
Based on AlexNet Batch size 256, weak scaling up to 32 KNL servers, GTC-P, Grid Size A, Systems: NVIDIA DGX-1, 8xP100,28 64 & 128 estimated based on ideal scaling, Xeon Phi 7250 Nodes Intel KNL 7250 68 core Flat-Quadrant mode, Omnipath NVIDIA DGX SATURNV 124 node Cluster
nvidia.com/dgx1 124 NVIDIA DGX-1 Nodes – 992 P100 GPUs 8x NVIDIA Tesla P100 SXM GPUs – NVLINK CubeMesh 2x Intel Xeon 20 core GPUs 512TB DDR4 System Memory SSD – 7 TB scratch + 0.5 TB OS
Mellanox 36 port EDR L1 and L2 switches 4 ports per system Partial Fat tree topology Ubuntu 14.04, CUDA 8, OpenMPI 1.10.3
NVIDIA GPU BLAS + Intel MKL (NVIDIA GPU HPL)
Deep Learning applied research Many users, frameworks, algorithms, networks, new approaches Embedded, robotic, auto, hyperscale, HPC
29 DEEP LEARNING CLUSTER REFERENCE ARCHITECTURE
30 NOV2016 TOP GREEN500 SYSTEM
Green500.org Top500.org
SATURNV produced groundbreaking 9.4 GF/W at full scale --> Sets the stage for future Exascale class computing
31 DEEP LEARNING IS VITAL TO HPC 92% believe AI will impact their work 93% using deep learning seeing positive results
Monitoring Effects of Carbon Minute-by-minute insideHPC.com Survey and Greenhouse Gas Emissions AI Weather Forecasting November 2016
32 RESOURCES For Executives, Developers and Data Scientists INTRO MATERIALS CASE STUDIES SELF-PACED LABS
ON-SITE WORKSHOPS PARTNER COURSES TECHNICAL BLOGS
33 NVIDIA DEEP LEARNING INSTITUTE Hands-on Training for Data Scientists and Software Engineers
Training organizations and individuals to solve challenging problems using Deep Learning
On-site workshops and online courses presented by certified experts
Covering complete workflows for proven application use cases Self-driving cars, recommendation engines, medical image classification, intelligent video analytics and more
www.nvidia.com/dli
https://www.nvidia.com/en-us/deep-learning -ai/education/ 34 QUESTIONS?