Gpu En Calcul Scientifique

GPU EN CALCUL SCIENTIFIQUE Formation du Club des Affiliés du LAAS-CNRS, Toulouse, 22 mars 2016 Frédéric Parienté, Tesla Accelerated Computing, NVIDIA GAMING PROENTERPRISE VISUALIZATION DATA CENTER AUTO THE WORLD LEADER IN VISUAL COMPUTING 2 Time of accelerators has come NVIDIA is focused on co-design from top-to-bottom FIVE THINGS TO REMEMBER Accelerators are surging in supercomputing Machine learning is the next killer application for HPC Tesla platform leads in every way 3 “It’s time to start planning for the end of Moore’s Law, and it’s worth pondering how it will end, not just when.” Robert Colwell Director, Microsystems Technology Office, DARPA 4 TESLA ACCELERATED COMPUTING PLATFORM Focused on Co-Design from Top to Bottom Fast GPU Productive Expert Accessibility Engineered for High Throughput Programming Co-Design Model & Tools TFLOPS NVIDIA GPU x86 CPU 3,0 K80 APPLICATION 2,5 2,0 MIDDLEWARE K40 1,5 SYS SW K20 Fast GPU 1,0 + M2090 Strong CPU LARGE SYSTEMS 0,5 M1060 PROCESSOR 0,0 2008 2009 2010 2011 2012 2013 2014 5 ACCELERATORS SURGE IN WORLD’S TOP SUPERCOMPUTERS 125 100 Top500: # of Accelerated Supercomputers 100+ accelerated systems now on Top500 list 75 1/3 of total FLOPS powered by accelerators 50 NVIDIA Tesla GPUs sweep 23 of 24 new accelerated supercomputers Tesla supercomputers growing at 50% CAGR 25 over past five years 0 2013 2014 2015 6 70% OF TOP HPC APPS ACCELERATED INTERSECT360 SURVEY OF TOP APPS TOP 25 APPS IN SURVEY GROMACS LAMMPS SIMULIA Abaqus NWChem NAMD LS-DYNA AMBER Schrodinger ANSYS Mechanical MSC NASTRAN Gaussian SPECFEM3D GAMESS ANSYS Fluent ANSYS CFX WRF Star-CD VASP CCSM OpenFOAM COMSOL Top 10 HPC Apps Top 50 HPC Apps CHARMM Star-CCM+ Quantum Espresso BLAST 90% 70% Accelerated Accelerated = All popular functions accelerated = Some popular functions accelerated = In development Intersect360, Nov 2015 “HPC Application Support for GPU Computing” = Not supported 7 370 GPU-Accelerated Applications www.nvidia.com/appscatalog 8 TESLA BOOSTS DATACENTER THROUGHPUT $500M Datacenter, 4x increase in ROI 30% CPU Nodes 70% 100% GPU-Accelerated CPU Nodes Nodes 70% of Applications 5x Faster with GPU 1000 Jobs Per Day 3800 Jobs Per Day 9 NEXT-GEN SUPERCOMPUTERS ARE GPU-ACCELERATED SUMMIT SIERRA U.S. Dept. of Energy NOAA IBM Watson Pre-Exascale Supercomputers New Supercomputer for Next-Gen Breakthrough Natural Language for Science Weather Forecasting Processing for Cognitive Computing 10 MACHINE LEARNING HPC 1ST CONSUMER KILLER-APP GOOGLE OPEN-SOURCE TENSORFLOW FACEBOOK MESSENGER MICROSOFT CORTANA FACIAL RECOGNITION MICROSOFT OPEN-SOURCE DMTK YOUTUBE CLICK-TO-BUY ADS GOOGLE PHOTO 11 TESLA PLATFORM LEADS IN EVERY WAY PROCESSOR INTERCONNECT SOFTWARE ECOSYSTEM 12 TESLA PLATFORM FOR HPC 13 “Approximately a third of HPC systems operating today are equipped with accelerators and nearly half of all newly deployed systems have them.” Source: ACCELERATED COMPUTING: A TIPPING POINT FOR HPC Intersect360 Nov 2015 14 TESLA FOR SIMLUATION LIBRARIES DIRECTIVES LANGUAGES ACCELERATED COMPUTING TOOLKIT TESLA ACCELERATED COMPUTING 15 Tesla Accelerates Discoveries Using a supercomputer powered by the Tesla Platform with over 3,000 Tesla accelerators, University of Illinois scientists performed the first all-atom simulation of the HIV virus and discovered the chemical structure of its capsid — “the perfect target for fighting the infection.” Without GPU, the supercomputer would need to be 5x larger for similar performance. 16 5x Faster AMBER Performance Dual CPU Server TESLA K80 Simulation Time from 1 Month to 1 Week World’s Fastest Accelerator Tesla K80 Server for HPC & Data Analytics 0 5 10 15 20 25 30 # of Days CUDA Cores 4992 Peak DP 1.9 TFLOPS Peak DP w/ Boost 2.9 TFLOPS GDDR5 Memory 24 GB Bandwidth 480 GB/s Power 300 W GPU Boost Dynamic AMBER Benchmark: PME-JAC-NVE Simulation for 1 microsecond CPU: E5-2698v3 @ 2.3GHz. 64GB System Memory, CentOS 6.2 17 TESLA K80: 10X FASTER ON REAL-WORLD APPS 15x K80 CPU 10x 5x 0x Benchmarks Molecular Dynamics Quantum Chemistry Physics CPU: 12 cores, E5-2697v2 @ 2.70GHz. 64GB System Memory, CentOS 6.2 18 GPU: Single Tesla K80, Boost enabled TESLA K80 BOOSTS DATA CENTER THROUGHPUT ACCELERATING KEY APPS 1/3 OF NODES ACCELERATED, 2X SYSTEM THROUGHPUT Speed-up vs Dual CPU CPU-only System Accelerated System 15x K80 CPU 10x 5x 0x QMCPACK LAMMPS CHROMA NAMD AMBER 100 Jobs Per Day 220 Jobs Per Day CPU: Dual E5-2698 [email protected] 3.6GHz, 64GB System Memory, CentOS 6.2 GPU: Single Tesla K80, Boost enabled 19 TESLA FOR VISUALIZATION IRAY OPTIX INDEX VISUALIZATION TOOLS FOR HPC TESLA ACCELERATED COMPUTING 20 VISUALIZE DATA INSTANTLY FOR FASTER SCIENCE CPU Supercomputer Viz Cluster Data Transfer Traditional Days Slower Time to Discovery Simulation- 1 Week Viz- 1 Day Time to Discovery = Months Multiple Iterations GPU-Accelerated Supercomputer Interactive Tesla Platform Faster Time to Discovery Visualize while you Scalable simulate/without Time to Discovery = Weeks data transfers Flexible Restart Simulation Instantly Multiple Iterations 21 VISUALIZATION-ENABLED SUPERCOMPUTERS Simulation + Visualization CSCS Piz Daint NCSA Blue Waters ORNL Titan Galaxy Formation Molecular Dynamics Cosmology 22 GROWING ADOPTION IN CLIMATE & WEATHER MeteoSwiss Deploys World’s NOAA Chooses Tesla To First Accelerated Weather Improve Weather Forecast Supercomputer Research 2x higher resolution for daily forecasts Develop global model with 3km resolution, five-fold increase from 14x more simulation with ensemble today’s resolution approach for medium-range forecasts Improved resolution requires 100x computational complexity 23 U.S. TO BUILD TWO FLAGSHIP SUPERCOMPUTERS Powered by the Tesla Platform 100-300 PFLOPS Peak 10x in Scientific App Performance IBM POWER9 CPU + NVIDIA Volta GPU NVLink High Speed Interconnect 40 TFLOPS per Node, >3,400 Nodes 2017 Major Step Forward on the Path to Exascale 24 ACCELERATED COMPUTING DELIVERS 5X HIGHER ENERGY EFFICIENCY 80-200 GB/s IBM POWER CPU NVIDIA NVLink NVIDIA Volta GPU Most Powerful Serial Processor Fastest CPU-GPU Interconnect Most Powerful Parallel Processor 25 CORAL: BUILT FOR GRAND SCIENTIFIC CHALLENGES Fusion Energy Climate Change Biofuels Role of material disorder, Study climate change adaptation and Search for renewable and statistics, and fluctuations in mitigation scenarios; realistically more efficient energy sources nanoscale materials and systems represent detailed features Astrophysics Combustion Nuclear Energy Radiation transport – critical to Combustion simulations to Unprecedented high-fidelity astrophysics, laser fusion, atmospheric enable the next gen diesel/bio- radiation transport calculations for dynamics, and medical imaging fuels to burn more efficiently nuclear energy applications 26 TESLA PLATFORM FOR MACHINE LEARNING 27 THE BIG BANG IN MACHINE LEARNING DNN BIG DATA GPU “ Google’s AI engine also reflects how the world of computer hardware is changing. (It) depends on machines equipped with GPUs… And it depends on these chips more than the larger tech universe realizes.” 28 Tesla Revolutionizes Machine Learning GOOGLE BRAIN APPLICATION – DEEP LEARNING BEFORE TESLA AFTER TESLA Cost $5,000K $200K Servers 1,000 Servers 16 Tesla Servers Energy 600 KW 4 KW Performance 1x 6x 29 THE AI RACE IS ON 30 NVIDIA GPU THE ENGINE OF DEEP LEARNING WATSON CHAINER THEANO MATCONVNET TENSORFLOW CNTK TORCH CAFFE NVIDIA CUDA ACCELERATED COMPUTING PLATFORM 31 Caffe Performance 6 M40+cuDNN4 5 M40+cuDNN3 CUDA BOOSTS 4 DEEP LEARNING 3 Performance 2 5X IN 2 YEARS K40+cuDNN1 K40 1 0 11/2013 9/2014 7/2015 12/2015 AlexNet training throughput based on 20 iterations, CPU: 1x E5-2680v3 12 Core 2.5GHz. 128GB System Memory, Ubuntu 14.04 32 AMAZING RATE OF IMPROVEMENT Image Recognition Pedestrian Detection Object Detection ImageNetIMAGENET Accuracy CALTECH KITTI 100% 100% 100% 96% CV-based DNN-based 95% 95% 90% Top Score 87,5% 93% 86% 90% NVIDIA GPU 90% 80% 83% 79% 88% 75% 72% 85% 85% 70% 84% 66% 62% 80% Accuracy 80% 60% 55% NVIDIA DRIVENet 75% 75% 50% 74% 45% 72% 70% 70% 40% 39% 65% 65% 30% 2010 2011 2012 2013 2014 2015 11/2013 6/2014 12/2014 7/2015 1/2016 33 CUDA FOR DEEP LEARNING DEVELOPMENT DEEP LEARNING SDK DIGITS cuDNN cuSPARSE cuBLAS NCCL TITAN X DEVBOX GPU CLOUD 34 FACEBOOK’S DEEP LEARNING MACHINE Purpose-Built for Deep Learning Training 2x Faster Training for Faster Deployment 2x Larger Networks for Higher Accuracy Powered by Eight Tesla M40 GPUs Open Rack Compliant “Most of the major advances in machine learning and AI in the past few years have been contingent on tapping into powerful Serkan Piantino Engineering Director of Facebook AI Research GPUs and huge data sets to build and train advanced models” 35 DESIGNED FOR AI COMPUTING AT LARGE SCALE Built on the NVIDIA Tesla Platform Operational Efficiency and Serviceability • 8 Tesla M40s deliver aggregate 96 GB GDDR5 • Free-air Cooled Design Optimizes Thermal and memory and 56 teraflops of SP performance Power Efficiency • Leverages world’s leading deep learning • Components swappable without tools platform to tap into frameworks such as Torch and libraries such as cuDNN • Configurable PCI-e for versatility 36 13x Faster Training Caffe Dual CPU Server TESLA M40 GPU Server with Reduce Training Time from 5 Days to less than 10 Hours World’s Fastest Accelerator 4x TESLA M40 for Deep Learning Training 0 1 2 3 4 5 Number of Days CUDA Cores 3072 Peak SP 7 TFLOPS GDDR5 Memory 12 GB Bandwidth 288 GB/s Power 250W Note: Caffe benchmark with AlexNet, training 1.3M images with 90 epochs CPU server uses 2x Xeon E5-2699v3 CPU, 128GB System Memory, Ubuntu 14.04 37 Stabilization and Resize, Filter, Search, Video Enhancements Image Auto-Enhance Processing Processing 4x 5x TESLA M4 H.264 & H.265, SD & HD Machine Video Learning Highest Throughput Transcode Hyperscale Workload Inference Acceleration 2x 2x CUDA Cores 1024 Peak SP 2.2 TFLOPS GDDR5 Memory 4 GB Bandwidth 88 GB/s Form Factor PCIe Low Profile Power 50 – 75 W Preliminary specifications.

Gpu En Calcul Scientifique

Investigations on Hardware Compression of IBM Power9 Processors

POWER® Processor-Based Systems

IBM Power Systems Performance Report Apr 13, 2021

Towards a Portable Hierarchical View of Distributed Shared Memory Systems: Challenges and Solutions

Upgrade to POWER9 Planning Checklist

AC922 Data Movement for CORAL

POWER10 Processor Chip

A Bibliography of Publications in IEEE Micro

IBM's Next Generation POWER Processor

Ilore: Discovering a Lineage of Microprocessors

Introduction to the CINECA Marconi100 HPC System

Craig B. Agricola