GPU EN CALCUL SCIENTIFIQUE Formation du Club des Affiliés du LAAS-CNRS, Toulouse, 22 mars 2016 Frédéric Parienté, Tesla Accelerated Computing, NVIDIA GAMING PROENTERPRISE VISUALIZATION DATA CENTER AUTO
THE WORLD LEADER IN VISUAL COMPUTING
2 Time of accelerators has come
NVIDIA is focused on co-design from top-to-bottom FIVE THINGS TO REMEMBER Accelerators are surging in supercomputing Machine learning is the next killer application for HPC
Tesla platform leads in every way
3 “It’s time to start planning for the end of Moore’s Law, and it’s worth pondering how it will end, not just when.”
Robert Colwell Director, Microsystems Technology Office, DARPA
4 TESLA ACCELERATED COMPUTING PLATFORM Focused on Co-Design from Top to Bottom
Fast GPU Productive Expert Accessibility Engineered for High Throughput Programming Co-Design Model & Tools TFLOPS NVIDIA GPU x86 CPU 3,0 K80 APPLICATION 2,5
2,0 MIDDLEWARE K40 1,5 SYS SW K20 Fast GPU 1,0 + M2090 Strong CPU LARGE SYSTEMS 0,5 M1060 PROCESSOR 0,0 2008 2009 2010 2011 2012 2013 2014
5 ACCELERATORS SURGE IN WORLD’S TOP SUPERCOMPUTERS 125
100 Top500: # of Accelerated Supercomputers 100+ accelerated systems now on Top500 list 75 1/3 of total FLOPS powered by accelerators
50 NVIDIA Tesla GPUs sweep 23 of 24 new accelerated supercomputers
Tesla supercomputers growing at 50% CAGR 25 over past five years
0 2013 2014 2015 6 70% OF TOP HPC APPS ACCELERATED
INTERSECT360 SURVEY OF TOP APPS TOP 25 APPS IN SURVEY
GROMACS LAMMPS SIMULIA Abaqus NWChem NAMD LS-DYNA AMBER Schrodinger ANSYS Mechanical MSC NASTRAN Gaussian SPECFEM3D GAMESS
ANSYS Fluent ANSYS CFX WRF Star-CD VASP CCSM OpenFOAM COMSOL Top 10 HPC Apps Top 50 HPC Apps CHARMM Star-CCM+ Quantum Espresso BLAST 90% 70% Accelerated Accelerated = All popular functions accelerated = Some popular functions accelerated = In development Intersect360, Nov 2015 “HPC Application Support for GPU Computing” = Not supported 7 370 GPU-Accelerated Applications
www.nvidia.com/appscatalog 8 TESLA BOOSTS DATACENTER THROUGHPUT $500M Datacenter, 4x increase in ROI
30% CPU Nodes
70% 100% GPU-Accelerated CPU Nodes Nodes 70% of Applications 5x Faster with GPU
1000 Jobs Per Day 3800 Jobs Per Day
9 NEXT-GEN SUPERCOMPUTERS ARE GPU-ACCELERATED
SUMMIT
SIERRA
U.S. Dept. of Energy NOAA IBM Watson Pre-Exascale Supercomputers New Supercomputer for Next-Gen Breakthrough Natural Language for Science Weather Forecasting Processing for Cognitive Computing
10 MACHINE LEARNING HPC 1ST CONSUMER KILLER-APP
GOOGLE OPEN-SOURCE TENSORFLOW FACEBOOK MESSENGER MICROSOFT CORTANA FACIAL RECOGNITION
MICROSOFT OPEN-SOURCE DMTK
YOUTUBE CLICK-TO-BUY ADS GOOGLE PHOTO
11 TESLA PLATFORM LEADS IN EVERY WAY
PROCESSOR INTERCONNECT
SOFTWARE ECOSYSTEM
12 TESLA PLATFORM FOR HPC
13 “Approximately a third of HPC systems operating today are equipped with accelerators and nearly half of all newly deployed systems have them.”
Source: ACCELERATED COMPUTING: A TIPPING POINT FOR HPC Intersect360 Nov 2015
14 TESLA FOR SIMLUATION
LIBRARIES DIRECTIVES LANGUAGES
ACCELERATED COMPUTING TOOLKIT
TESLA ACCELERATED COMPUTING
15 Tesla Accelerates Discoveries
Using a supercomputer powered by the Tesla Platform with over 3,000 Tesla accelerators, University of Illinois scientists performed the first all-atom simulation of the HIV virus and discovered the chemical structure of its capsid — “the perfect target for fighting the infection.”
Without GPU, the supercomputer would need to be 5x larger for similar performance.
16 5x Faster AMBER Performance
Dual CPU Server
TESLA K80 Simulation Time from 1 Month to 1 Week World’s Fastest Accelerator Tesla K80 Server for HPC & Data Analytics 0 5 10 15 20 25 30 # of Days
CUDA Cores 4992 Peak DP 1.9 TFLOPS Peak DP w/ Boost 2.9 TFLOPS GDDR5 Memory 24 GB Bandwidth 480 GB/s Power 300 W GPU Boost Dynamic AMBER Benchmark: PME-JAC-NVE Simulation for 1 microsecond CPU: E5-2698v3 @ 2.3GHz. 64GB System Memory, CentOS 6.2 17 TESLA K80: 10X FASTER ON REAL-WORLD APPS 15x K80 CPU
10x
5x
0x
Benchmarks Molecular Dynamics Quantum Chemistry Physics
CPU: 12 cores, E5-2697v2 @ 2.70GHz. 64GB System Memory, CentOS 6.2 18 GPU: Single Tesla K80, Boost enabled TESLA K80 BOOSTS DATA CENTER THROUGHPUT
ACCELERATING KEY APPS 1/3 OF NODES ACCELERATED, 2X SYSTEM THROUGHPUT
Speed-up vs Dual CPU CPU-only System Accelerated System 15x K80 CPU
10x
5x
0x QMCPACK LAMMPS CHROMA NAMD AMBER 100 Jobs Per Day 220 Jobs Per Day
CPU: Dual E5-2698 [email protected] 3.6GHz, 64GB System Memory, CentOS 6.2 GPU: Single Tesla K80, Boost enabled 19 TESLA FOR VISUALIZATION
IRAY OPTIX INDEX
VISUALIZATION TOOLS FOR HPC
TESLA ACCELERATED COMPUTING
20 VISUALIZE DATA INSTANTLY FOR FASTER SCIENCE CPU Supercomputer Viz Cluster
Data Transfer
Traditional Days Slower Time to Discovery Simulation- 1 Week Viz- 1 Day Time to Discovery = Months
Multiple Iterations
GPU-Accelerated Supercomputer Interactive
Tesla Platform Faster Time to Discovery Visualize while you Scalable simulate/without Time to Discovery = Weeks data transfers Flexible Restart Simulation Instantly Multiple Iterations 21 VISUALIZATION-ENABLED SUPERCOMPUTERS Simulation + Visualization
CSCS Piz Daint NCSA Blue Waters ORNL Titan
Galaxy Formation Molecular Dynamics Cosmology
22 GROWING ADOPTION IN CLIMATE & WEATHER
MeteoSwiss Deploys World’s NOAA Chooses Tesla To First Accelerated Weather Improve Weather Forecast Supercomputer Research
2x higher resolution for daily forecasts Develop global model with 3km resolution, five-fold increase from 14x more simulation with ensemble today’s resolution approach for medium-range forecasts Improved resolution requires 100x computational complexity
23 U.S. TO BUILD TWO FLAGSHIP SUPERCOMPUTERS Powered by the Tesla Platform
100-300 PFLOPS Peak
10x in Scientific App Performance
IBM POWER9 CPU + NVIDIA Volta GPU
NVLink High Speed Interconnect
40 TFLOPS per Node, >3,400 Nodes
2017
Major Step Forward on the Path to Exascale 24 ACCELERATED COMPUTING DELIVERS 5X HIGHER ENERGY EFFICIENCY
80-200 GB/s
IBM POWER CPU NVIDIA NVLink NVIDIA Volta GPU Most Powerful Serial Processor Fastest CPU-GPU Interconnect Most Powerful Parallel Processor
25 CORAL: BUILT FOR GRAND SCIENTIFIC CHALLENGES
Fusion Energy Climate Change Biofuels Role of material disorder, Study climate change adaptation and Search for renewable and statistics, and fluctuations in mitigation scenarios; realistically more efficient energy sources nanoscale materials and systems represent detailed features
Astrophysics Combustion Nuclear Energy Radiation transport – critical to Combustion simulations to Unprecedented high-fidelity astrophysics, laser fusion, atmospheric enable the next gen diesel/bio- radiation transport calculations for dynamics, and medical imaging fuels to burn more efficiently nuclear energy applications
26 TESLA PLATFORM FOR MACHINE LEARNING
27 THE BIG BANG IN MACHINE LEARNING
DNN BIG DATA GPU
“ Google’s AI engine also reflects how the world of computer hardware is changing. (It) depends on machines equipped with GPUs… And it depends on these chips more than the larger tech universe realizes.”
28 Tesla Revolutionizes Machine Learning
GOOGLE BRAIN APPLICATION – DEEP LEARNING BEFORE TESLA AFTER TESLA
Cost $5,000K $200K
Servers 1,000 Servers 16 Tesla Servers
Energy 600 KW 4 KW
Performance 1x 6x
29 THE AI RACE IS ON
30 NVIDIA GPU THE ENGINE OF DEEP LEARNING
WATSON CHAINER THEANO MATCONVNET
TENSORFLOW CNTK TORCH CAFFE
NVIDIA CUDA ACCELERATED COMPUTING PLATFORM
31 Caffe Performance
6 M40+cuDNN4
5 M40+cuDNN3
CUDA BOOSTS 4 DEEP LEARNING 3 Performance 2 5X IN 2 YEARS K40+cuDNN1 K40 1
0 11/2013 9/2014 7/2015 12/2015
AlexNet training throughput based on 20 iterations, CPU: 1x E5-2680v3 12 Core 2.5GHz. 128GB System Memory, Ubuntu 14.04
32 AMAZING RATE OF IMPROVEMENT
Image Recognition Pedestrian Detection Object Detection ImageNetIMAGENET Accuracy CALTECH KITTI 100% 100% 100% 96% CV-based DNN-based 95% 95% 90% Top Score 87,5% 93% 86% 90% NVIDIA GPU 90% 80% 83% 79% 88% 75% 72% 85% 85% 70% 84% 66% 62% 80% Accuracy 80% 60%
55% NVIDIA DRIVENet 75% 75% 50% 74% 45% 72% 70% 70% 40% 39%
65% 65% 30% 2010 2011 2012 2013 2014 2015 11/2013 6/2014 12/2014 7/2015 1/2016
33 CUDA FOR DEEP LEARNING DEVELOPMENT
DEEP LEARNING SDK
DIGITS cuDNN cuSPARSE cuBLAS NCCL
TITAN X DEVBOX GPU CLOUD
34 FACEBOOK’S DEEP LEARNING MACHINE Purpose-Built for Deep Learning Training
2x Faster Training for Faster Deployment
2x Larger Networks for Higher Accuracy
Powered by Eight Tesla M40 GPUs
Open Rack Compliant
“Most of the major advances in machine learning and AI in the past few years have been contingent on tapping into powerful Serkan Piantino Engineering Director of Facebook AI Research GPUs and huge data sets to build and train advanced models” 35 DESIGNED FOR AI COMPUTING AT LARGE SCALE
Built on the NVIDIA Tesla Platform Operational Efficiency and Serviceability
• 8 Tesla M40s deliver aggregate 96 GB GDDR5 • Free-air Cooled Design Optimizes Thermal and memory and 56 teraflops of SP performance Power Efficiency • Leverages world’s leading deep learning • Components swappable without tools platform to tap into frameworks such as Torch and libraries such as cuDNN • Configurable PCI-e for versatility
36 13x Faster Training Caffe
Dual CPU Server TESLA M40 GPU Server with Reduce Training Time from 5 Days to less than 10 Hours World’s Fastest Accelerator 4x TESLA M40 for Deep Learning Training 0 1 2 3 4 5 Number of Days
CUDA Cores 3072 Peak SP 7 TFLOPS GDDR5 Memory 12 GB
Bandwidth 288 GB/s Power 250W
Note: Caffe benchmark with AlexNet, training 1.3M images with 90 epochs CPU server uses 2x Xeon E5-2699v3 CPU, 128GB System Memory, Ubuntu 14.04
37 Stabilization and Resize, Filter, Search, Video Enhancements Image Auto-Enhance Processing Processing
4x 5x
TESLA M4 H.264 & H.265, SD & HD Machine Video Learning Highest Throughput Transcode Hyperscale Workload Inference Acceleration 2x 2x
CUDA Cores 1024 Peak SP 2.2 TFLOPS GDDR5 Memory 4 GB
Bandwidth 88 GB/s Form Factor PCIe Low Profile Power 50 – 75 W
Preliminary specifications. Subject to change.
38 TESLA PLATFORM FOR DEVELOPERS
39 10X GROWTH IN ACCELERATED COMPUTING 2008 2015
150,000 3 Million CUDA Downloads CUDA Downloads
27 370 CUDA Apps CUDA Apps
60 800 Universities Universities Teaching Teaching
4,000 60,000 Academic Academic Papers Papers
6,000 450,000 Tesla GPUs Tesla GPUs
77 54,000 Supercomputing Supercomputing Teraflops Teraflops
40 HOW GPU ACCELERATION WORKS Application Code
Compute-Intensive Functions Rest of Sequential 5% of Code CPU Code GPU CPU
+ 41 COMMON PROGRAMMING MODELS ACROSS MULTIPLE CPUS
Libraries AmgX cuBLAS
Compiler Directives
Programming / x86 Languages
42 GPU ACCELERATED LIBRARIES “Drop-in” Acceleration for Your Applications Domain-specific Deep Learning, GIS, EDA, Bioinformatics, Fluids NVBIO Triton Ocean SDK
Visual Processing Image & Video NVIDIA NVIDIA CODEC SDK NPP
Linear Algebra NVIDIA Dense, Sparse, Matrix cuBLAS, cuSPARSE
Math Algorithms NVIDIA cuRAND AMG, Templates, Solvers AmgX cuSOLVER 43 developer.nvidia.com/gpu-accelerated-libraries University of Illinois main() PowerGrid- MRI Reconstruction {
RIKEN Japan NICAM- Climate Modeling Fueling the Next Wave of 8000+ Scientific Discoveries in HPC Developers
using OpenACC 7-8x Speed-Up 5% of Code Modified
http://www.cray.com/sites/default/files/resources/OpenACC_213462.12_OpenACC_Cosmo_CS_FNL.pdf http://www.hpcwire.com/off-the-wire/first-round-of-2015-hackathons-gets-underway http://on-demand.gputechconf.com/gtc/2015/presentation/S529744-Hisashi-Yashiro.pdf http://www.openacc.org/content/experiences-porting-molecular-dynamics-code-gpus-cray-xk7 Minimal Effort LS-DALTON Lines of Code # of Weeks # of Codes to Large-scale Application for Modified Required Maintain Calculating High-accuracy <100 Lines 1 Week 1 Source Molecular Energies
Big Performance
LS-DALTON CCSD(T) Module Benchmarked on Titan Supercomputer (AMD CPU vs Tesla K20X) 12,0x
8,0x OpenACC makes GPU computing approachable for “ “ domain scientists. Initial OpenACC implementation required only minor effort, and more importantly, 4,0x no modifications of our existing CPU implementation. Speedup vs vs CPU Speedup 0,0x Alanine-1 Alanine-2 Alanine-3 Janus Juul Eriksen, PhD Fellow 13 Atoms 23 Atoms 33 Atoms qLEAP Center for Theoretical Chemistry, Aarhus University 45 OPENACC DELIVERS TRUE PERFORMANCE PORTABILITY Paving the Path Forward: Single Code for All HPC Processors Application Performance Benchmark CPU: MPI + OpenMP CPU: MPI + OpenACC CPU + GPU: MPI + OpenACC 35x
30x 30,3x 25x
20x
15x
Single CPU CoreCPU Single 10x 11,9x
5x 7,6x 7,1x 7,1x 5,3x 4,1x 4,3x 5,2x Speedup vs vs Speedup 0x 359.MINIGHOST (MANTEVO) NEMO (CLIMATE & OCEAN) CLOVERLEAF (PHYSICS)
359.miniGhost: CPU: Intel Xeon E5-2698 v3, 2 sockets, 32-cores total, GPU: Tesla K80- single GPU NEMO: Each socket CPU: Intel Xeon E5-‐2698 v3, 16 cores; GPU: NVIDIA K80 both GPUs CLOVERLEAF: CPU: Dual socket Intel Xeon CPU E5-2690 v2, 20 cores total, GPU: Tesla K80 both GPUs 46 CUDA Super Simplified Memory Management Code
CPU Code CUDA 6 Code with Unified Memory void sortfile(FILE *fp, int N) { void sortfile(FILE *fp, int N) { char *data; char *data; data = (char *)malloc(N); cudaMallocManaged(&data, N);
fread(data, 1, N, fp); fread(data, 1, N, fp);
qsort(data, N, 1, compare); qsort<<<...>>>(data,N,1,compare); cudaDeviceSynchronize();
use_data(data); use_data(data);
free(data); cudaFree(data); } }
47 GPU DEVELOPER ECO-SYSTEM Debuggers Languages & Directives Numerical Cluster Tools Packages & Profilers C Libraries C++ CUDA-GDB Fortran FFT NV Visual Profiler BLAS Java GPUDirect RDMA MATLAB NVIDIA Nsight SPARSE Python Datacenter Mathematica Visual Studio LAPACK OpenACC GPU Manager LabView Allinea OpenMP NPP TotalView Video Imaging
Consultants & Training OEM Solution Providers
ANEO GPU Tech
48 DEVELOP ON GEFORCE, DEPLOY ON TESLA
Designed for Developers & Gamers Designed for the Data Center ECC Available Everywhere 24x7 Runtime GPU Monitoring developer.nvidia.com/cuda-gpus Cluster Management developer.nvidia.com/devbox GPUDirect-RDMA Hyper-Q for MPI 3 Year Warranty Integrated OEM Systems, Professional Support 49
EUROPE’S BRIGHTEST MINDS & BEST IDEAS Sep 28-29, 2016 | Amsterdam www.gputechconf.eu #GTC16
DEEP LEARNING & SELF-DRIVING CARS VIRTUAL REALITY & SUPERCOMPUTING & HPC ARTIFICIAL INTELLIGENCE AUGMENTED REALITY
GTC Europe is a two-day conference designed to expose the innovative ways developers, businesses and academics are using parallel computing to transform our world.
2 Days | 800 Attendees | 50+ Exhibitors | 50+ Speakers | 15+ Tracks | 15+ Workshops | 1-to-1 Meetings
51