GPU EN CALCUL SCIENTIFIQUE Formation du Club des Affiliés du LAAS-CNRS, Toulouse, 22 mars 2016 Frédéric Parienté, Tesla Accelerated Computing, GAMING PROENTERPRISE VISUALIZATION DATA CENTER AUTO

THE WORLD LEADER IN VISUAL COMPUTING

2 Time of accelerators has come

NVIDIA is focused on co-design from top-to-bottom FIVE THINGS TO REMEMBER Accelerators are surging in supercomputing Machine learning is the next killer application for HPC

Tesla platform leads in every way

3 “It’s time to start planning for the end of Moore’s Law, and it’s worth pondering how it will end, not just when.”

Robert Colwell Director, Microsystems Technology Office, DARPA

4 TESLA ACCELERATED COMPUTING PLATFORM Focused on Co-Design from Top to Bottom

Fast GPU Productive Expert Accessibility Engineered for High Throughput Programming Co-Design Model & Tools TFLOPS NVIDIA GPU x86 CPU 3,0 K80 APPLICATION 2,5

2,0 MIDDLEWARE K40 1,5 SYS SW K20 Fast GPU 1,0 + M2090 Strong CPU LARGE SYSTEMS 0,5 M1060 PROCESSOR 0,0 2008 2009 2010 2011 2012 2013 2014

5 ACCELERATORS SURGE IN WORLD’S TOP SUPERCOMPUTERS 125

100 Top500: # of Accelerated Supercomputers 100+ accelerated systems now on Top500 list 75 1/3 of total FLOPS powered by accelerators

50 GPUs sweep 23 of 24 new accelerated supercomputers

Tesla supercomputers growing at 50% CAGR 25 over past five years

0 2013 2014 2015 6 70% OF TOP HPC APPS ACCELERATED

INTERSECT360 SURVEY OF TOP APPS TOP 25 APPS IN SURVEY

GROMACS LAMMPS SIMULIA Abaqus NWChem NAMD LS-DYNA AMBER Schrodinger ANSYS Mechanical MSC NASTRAN Gaussian SPECFEM3D GAMESS

ANSYS Fluent ANSYS CFX WRF Star-CD VASP CCSM OpenFOAM COMSOL Top 10 HPC Apps Top 50 HPC Apps CHARMM Star-CCM+ Quantum Espresso BLAST 90% 70% Accelerated Accelerated = All popular functions accelerated = Some popular functions accelerated = In development Intersect360, Nov 2015 “HPC Application Support for GPU Computing” = Not supported 7 370 GPU-Accelerated Applications

www.nvidia.com/appscatalog 8 TESLA BOOSTS DATACENTER THROUGHPUT $500M Datacenter, 4x increase in ROI

30% CPU Nodes

70% 100% GPU-Accelerated CPU Nodes Nodes 70% of Applications 5x Faster with GPU

1000 Jobs Per Day 3800 Jobs Per Day

9 NEXT-GEN SUPERCOMPUTERS ARE GPU-ACCELERATED

SUMMIT

SIERRA

U.S. Dept. of Energy NOAA IBM Watson Pre-Exascale Supercomputers New Supercomputer for Next-Gen Breakthrough Natural Language for Science Weather Forecasting Processing for Cognitive Computing

10 MACHINE LEARNING HPC 1ST CONSUMER KILLER-APP

GOOGLE OPEN-SOURCE TENSORFLOW FACEBOOK MESSENGER MICROSOFT CORTANA FACIAL RECOGNITION

MICROSOFT OPEN-SOURCE DMTK

YOUTUBE CLICK-TO-BUY ADS GOOGLE PHOTO

11 TESLA PLATFORM LEADS IN EVERY WAY

PROCESSOR INTERCONNECT

SOFTWARE ECOSYSTEM

12 TESLA PLATFORM FOR HPC

13 “Approximately a third of HPC systems operating today are equipped with accelerators and nearly half of all newly deployed systems have them.”

Source: ACCELERATED COMPUTING: A TIPPING POINT FOR HPC Intersect360 Nov 2015

14 TESLA FOR SIMLUATION

LIBRARIES DIRECTIVES LANGUAGES

ACCELERATED COMPUTING TOOLKIT

TESLA ACCELERATED COMPUTING

15 Tesla Accelerates Discoveries

Using a supercomputer powered by the Tesla Platform with over 3,000 Tesla accelerators, University of Illinois scientists performed the first all-atom simulation of the HIV virus and discovered the chemical structure of its capsid — “the perfect target for fighting the infection.”

Without GPU, the supercomputer would need to be 5x larger for similar performance.

16 5x Faster AMBER Performance

Dual CPU Server

TESLA K80 Simulation Time from 1 Month to 1 Week World’s Fastest Accelerator Tesla K80 Server for HPC & Data Analytics 0 5 10 15 20 25 30 # of Days

CUDA Cores 4992 Peak DP 1.9 TFLOPS Peak DP w/ Boost 2.9 TFLOPS GDDR5 Memory 24 GB Bandwidth 480 GB/s Power 300 W GPU Boost Dynamic AMBER Benchmark: PME-JAC-NVE Simulation for 1 microsecond CPU: E5-2698v3 @ 2.3GHz. 64GB System Memory, CentOS 6.2 17 TESLA K80: 10X FASTER ON REAL-WORLD APPS 15x K80 CPU

10x

5x

0x

Benchmarks Molecular Dynamics Quantum Chemistry Physics

CPU: 12 cores, E5-2697v2 @ 2.70GHz. 64GB System Memory, CentOS 6.2 18 GPU: Single Tesla K80, Boost enabled TESLA K80 BOOSTS DATA CENTER THROUGHPUT

ACCELERATING KEY APPS 1/3 OF NODES ACCELERATED, 2X SYSTEM THROUGHPUT

Speed-up vs Dual CPU CPU-only System Accelerated System 15x K80 CPU

10x

5x

0x QMCPACK LAMMPS CHROMA NAMD AMBER 100 Jobs Per Day 220 Jobs Per Day

CPU: Dual E5-2698 [email protected] 3.6GHz, 64GB System Memory, CentOS 6.2 GPU: Single Tesla K80, Boost enabled 19 TESLA FOR VISUALIZATION

IRAY OPTIX INDEX

VISUALIZATION TOOLS FOR HPC

TESLA ACCELERATED COMPUTING

20 VISUALIZE DATA INSTANTLY FOR FASTER SCIENCE CPU Supercomputer Viz Cluster

Data Transfer

Traditional Days Slower Time to Discovery Simulation- 1 Week Viz- 1 Day Time to Discovery = Months

Multiple Iterations

GPU-Accelerated Supercomputer Interactive

Tesla Platform Faster Time to Discovery Visualize while you Scalable simulate/without Time to Discovery = Weeks data transfers Flexible Restart Simulation Instantly Multiple Iterations 21 VISUALIZATION-ENABLED SUPERCOMPUTERS Simulation + Visualization

CSCS Piz Daint NCSA Blue Waters ORNL

Galaxy Formation Molecular Dynamics Cosmology

22 GROWING ADOPTION IN CLIMATE & WEATHER

MeteoSwiss Deploys World’s NOAA Chooses Tesla To First Accelerated Weather Improve Weather Forecast Supercomputer Research

2x higher resolution for daily forecasts Develop global model with 3km resolution, five-fold increase from 14x more simulation with ensemble today’s resolution approach for medium-range forecasts Improved resolution requires 100x computational complexity

23 U.S. TO BUILD TWO FLAGSHIP SUPERCOMPUTERS Powered by the Tesla Platform

100-300 PFLOPS Peak

10x in Scientific App Performance

IBM POWER9 CPU + NVIDIA Volta GPU

NVLink High Speed Interconnect

40 TFLOPS per Node, >3,400 Nodes

2017

Major Step Forward on the Path to Exascale 24 ACCELERATED COMPUTING DELIVERS 5X HIGHER ENERGY EFFICIENCY

80-200 GB/s

IBM POWER CPU NVIDIA NVLink NVIDIA Volta GPU Most Powerful Serial Processor Fastest CPU-GPU Interconnect Most Powerful Parallel Processor

25 CORAL: BUILT FOR GRAND SCIENTIFIC CHALLENGES

Fusion Energy Climate Change Biofuels Role of material disorder, Study climate change adaptation and Search for renewable and statistics, and fluctuations in mitigation scenarios; realistically more efficient energy sources nanoscale materials and systems represent detailed features

Astrophysics Combustion Nuclear Energy Radiation transport – critical to Combustion simulations to Unprecedented high-fidelity astrophysics, laser fusion, atmospheric enable the next gen diesel/bio- radiation transport calculations for dynamics, and medical imaging fuels to burn more efficiently nuclear energy applications

26 TESLA PLATFORM FOR MACHINE LEARNING

27 THE BIG BANG IN MACHINE LEARNING

DNN BIG DATA GPU

“ Google’s AI engine also reflects how the world of computer hardware is changing. (It) depends on machines equipped with GPUs… And it depends on these chips more than the larger tech universe realizes.”

28 Tesla Revolutionizes Machine Learning

GOOGLE BRAIN APPLICATION – DEEP LEARNING BEFORE TESLA AFTER TESLA

Cost $5,000K $200K

Servers 1,000 Servers 16 Tesla Servers

Energy 600 KW 4 KW

Performance 1x 6x

29 THE AI RACE IS ON

30 NVIDIA GPU THE ENGINE OF DEEP LEARNING

WATSON CHAINER THEANO MATCONVNET

TENSORFLOW CNTK TORCH CAFFE

NVIDIA CUDA ACCELERATED COMPUTING PLATFORM

31 Caffe Performance

6 M40+cuDNN4

5 M40+cuDNN3

CUDA BOOSTS 4 DEEP LEARNING 3 Performance 2 5X IN 2 YEARS K40+cuDNN1 K40 1

0 11/2013 9/2014 7/2015 12/2015

AlexNet training throughput based on 20 iterations, CPU: 1x E5-2680v3 12 Core 2.5GHz. 128GB System Memory, Ubuntu 14.04

32 AMAZING RATE OF IMPROVEMENT

Image Recognition Pedestrian Detection Object Detection ImageNetIMAGENET Accuracy CALTECH KITTI 100% 100% 100% 96% CV-based DNN-based 95% 95% 90% Top Score 87,5% 93% 86% 90% NVIDIA GPU 90% 80% 83% 79% 88% 75% 72% 85% 85% 70% 84% 66% 62% 80% Accuracy 80% 60%

55% NVIDIA DRIVENet 75% 75% 50% 74% 45% 72% 70% 70% 40% 39%

65% 65% 30% 2010 2011 2012 2013 2014 2015 11/2013 6/2014 12/2014 7/2015 1/2016

33 CUDA FOR DEEP LEARNING DEVELOPMENT

DEEP LEARNING SDK

DIGITS cuDNN cuSPARSE cuBLAS NCCL

TITAN X DEVBOX GPU CLOUD

34 FACEBOOK’S DEEP LEARNING MACHINE Purpose-Built for Deep Learning Training

2x Faster Training for Faster Deployment

2x Larger Networks for Higher Accuracy

Powered by Eight Tesla M40 GPUs

Open Rack Compliant

“Most of the major advances in machine learning and AI in the past few years have been contingent on tapping into powerful Serkan Piantino Engineering Director of Facebook AI Research GPUs and huge data sets to build and train advanced models” 35 DESIGNED FOR AI COMPUTING AT LARGE SCALE

Built on the NVIDIA Tesla Platform Operational Efficiency and Serviceability

• 8 Tesla M40s deliver aggregate 96 GB GDDR5 • Free-air Cooled Design Optimizes Thermal and memory and 56 teraflops of SP performance Power Efficiency • Leverages world’s leading deep learning • Components swappable without tools platform to tap into frameworks such as Torch and libraries such as cuDNN • Configurable PCI-e for versatility

36 13x Faster Training Caffe

Dual CPU Server TESLA M40 GPU Server with Reduce Training Time from 5 Days to less than 10 Hours World’s Fastest Accelerator 4x TESLA M40 for Deep Learning Training 0 1 2 3 4 5 Number of Days

CUDA Cores 3072 Peak SP 7 TFLOPS GDDR5 Memory 12 GB

Bandwidth 288 GB/s Power 250W

Note: Caffe benchmark with AlexNet, training 1.3M images with 90 epochs CPU server uses 2x Xeon E5-2699v3 CPU, 128GB System Memory, Ubuntu 14.04

37 Stabilization and Resize, Filter, Search, Video Enhancements Image Auto-Enhance Processing Processing

4x 5x

TESLA M4 H.264 & H.265, SD & HD Machine Video Learning Highest Throughput Transcode Hyperscale Workload Inference Acceleration 2x 2x

CUDA Cores 1024 Peak SP 2.2 TFLOPS GDDR5 Memory 4 GB

Bandwidth 88 GB/s Form Factor PCIe Low Profile Power 50 – 75 W

Preliminary specifications. Subject to change.

38 TESLA PLATFORM FOR DEVELOPERS

39 10X GROWTH IN ACCELERATED COMPUTING 2008 2015

150,000 3 Million CUDA Downloads CUDA Downloads

27 370 CUDA Apps CUDA Apps

60 800 Universities Universities Teaching Teaching

4,000 60,000 Academic Academic Papers Papers

6,000 450,000 Tesla GPUs Tesla GPUs

77 54,000 Supercomputing Supercomputing Teraflops Teraflops

40 HOW GPU ACCELERATION WORKS Application Code

Compute-Intensive Functions Rest of Sequential 5% of Code CPU Code GPU CPU

+ 41 COMMON PROGRAMMING MODELS ACROSS MULTIPLE CPUS

Libraries AmgX cuBLAS

Compiler Directives

Programming / x86 Languages

42 GPU ACCELERATED LIBRARIES “Drop-in” Acceleration for Your Applications Domain-specific Deep Learning, GIS, EDA, Bioinformatics, Fluids NVBIO Triton Ocean SDK

Visual Processing Image & Video NVIDIA NVIDIA CODEC SDK NPP

Linear Algebra NVIDIA Dense, Sparse, Matrix cuBLAS, cuSPARSE

Math Algorithms NVIDIA cuRAND AMG, Templates, Solvers AmgX cuSOLVER 43 developer.nvidia.com/gpu-accelerated-libraries University of Illinois main() PowerGrid- MRI Reconstruction { #pragma acc kernels //automatically runs on GPU { } OpenACC } 70x Speed-Up 2 Days of Effort Simple | Powerful | Portable

RIKEN Japan NICAM- Climate Modeling Fueling the Next Wave of 8000+ Scientific Discoveries in HPC Developers

using OpenACC 7-8x Speed-Up 5% of Code Modified

http://www.cray.com/sites/default/files/resources/OpenACC_213462.12_OpenACC_Cosmo_CS_FNL.pdf http://www.hpcwire.com/off-the-wire/first-round-of-2015-hackathons-gets-underway http://on-demand.gputechconf.com/gtc/2015/presentation/S529744-Hisashi-Yashiro.pdf http://www.openacc.org/content/experiences-porting-molecular-dynamics-code-gpus-cray-xk7 Minimal Effort LS-DALTON Lines of Code # of Weeks # of Codes to Large-scale Application for Modified Required Maintain Calculating High-accuracy <100 Lines 1 Week 1 Source Molecular Energies

Big Performance

LS-DALTON CCSD(T) Module Benchmarked on Titan Supercomputer (AMD CPU vs Tesla K20X) 12,0x

8,0x OpenACC makes GPU computing approachable for “ “ domain scientists. Initial OpenACC implementation required only minor effort, and more importantly, 4,0x no modifications of our existing CPU implementation. Speedup vs vs CPU Speedup 0,0x Alanine-1 Alanine-2 Alanine-3 Janus Juul Eriksen, PhD Fellow 13 Atoms 23 Atoms 33 Atoms qLEAP Center for Theoretical Chemistry, Aarhus University 45 OPENACC DELIVERS TRUE PERFORMANCE PORTABILITY Paving the Path Forward: Single Code for All HPC Processors Application Performance Benchmark CPU: MPI + OpenMP CPU: MPI + OpenACC CPU + GPU: MPI + OpenACC 35x

30x 30,3x 25x

20x

15x

Single CPU CoreCPU Single 10x 11,9x

5x 7,6x 7,1x 7,1x 5,3x 4,1x 4,3x 5,2x Speedup vs vs Speedup 0x 359.MINIGHOST (MANTEVO) NEMO (CLIMATE & OCEAN) CLOVERLEAF (PHYSICS)

359.miniGhost: CPU: Intel Xeon E5-2698 v3, 2 sockets, 32-cores total, GPU: Tesla K80- single GPU NEMO: Each socket CPU: Intel Xeon E5-‐2698 v3, 16 cores; GPU: NVIDIA K80 both GPUs CLOVERLEAF: CPU: Dual socket Intel Xeon CPU E5-2690 v2, 20 cores total, GPU: Tesla K80 both GPUs 46 CUDA Super Simplified Memory Management Code

CPU Code CUDA 6 Code with Unified Memory void sortfile(FILE *fp, int N) { void sortfile(FILE *fp, int N) { char *data; char *data; data = (char *)malloc(N); cudaMallocManaged(&data, N);

fread(data, 1, N, fp); fread(data, 1, N, fp);

qsort(data, N, 1, compare); qsort<<<...>>>(data,N,1,compare); cudaDeviceSynchronize();

use_data(data); use_data(data);

free(data); cudaFree(data); } }

47 GPU DEVELOPER ECO-SYSTEM Debuggers Languages & Directives Numerical Cluster Tools Packages & Profilers C Libraries C++ CUDA-GDB Fortran FFT NV Visual Profiler BLAS Java GPUDirect RDMA MATLAB NVIDIA Nsight SPARSE Python Datacenter Mathematica Visual Studio LAPACK OpenACC GPU Manager LabView Allinea OpenMP NPP TotalView Video Imaging

Consultants & Training OEM Solution Providers

ANEO GPU Tech

48 DEVELOP ON GEFORCE, DEPLOY ON TESLA

Designed for Developers & Gamers Designed for the Data Center ECC Available Everywhere 24x7 Runtime GPU Monitoring developer.nvidia.com/cuda-gpus Cluster Management developer.nvidia.com/devbox GPUDirect-RDMA Hyper-Q for MPI 3 Year Warranty Integrated OEM Systems, Professional Support 49

EUROPE’S BRIGHTEST MINDS & BEST IDEAS Sep 28-29, 2016 | Amsterdam www.gputechconf.eu #GTC16

DEEP LEARNING & SELF-DRIVING CARS VIRTUAL REALITY & SUPERCOMPUTING & HPC ARTIFICIAL INTELLIGENCE AUGMENTED REALITY

GTC Europe is a two-day conference designed to expose the innovative ways developers, businesses and academics are using parallel computing to transform our world.

2 Days | 800 Attendees | 50+ Exhibitors | 50+ Speakers | 15+ Tracks | 15+ Workshops | 1-to-1 Meetings

51