CUDA & Tesla GPU Computing Accelerating Scientific Discovery
March 2009
1 What is GPU Computing?
4 cores
Computing with CPU + GPU Heterogeneous Computing
2 “Homemade” Supercomputer Revolution
16 GPUs 8 GPUs 4 GPUs 3 GPUs MIT, Harvard University of Antwerp TU Braunschweig, University of Illinois Belgium Germany
3 GPUs 3 GPUs 3 GPUs 2 GPUs Yonsei University, Rice University University of Georgia Tech Korea Cambridge, UK
3 GPUs: Turning Point in Supercomputing
Desktop beats Cluster
Tesla Personal CalcUA Supercomputer $5 Million $10,000
Source: University of Antwerp, Belgium 4 Introducing the Tesla Personal Supercomputer
Supercomputing Performance Massively parallel CUDA Architecture 960 cores. 4 TeraFlops 250x the performance of a desktop
Personal One researcher, one supercomputer Plugs into standard power strip
Accessible Program in C for Windows, Linux Available now worldwide under $10,000
5 New GPU-based High-Performance Computing Landscape
Tesla Cluster 5000x
Tesla Personal 250x Supercomputer
Supercomputing Cluster Performance Standard Workstation 1x
$100K - $1M < $10 K Accessibility 6 Not 2x or 3x : Speedups are 20x to 150x
146X 36X 18X 50X 100X
Medical Imaging Molecular Dynamics Video Transcoding Matlab Computing Astrophysics U of Utah U of Illinois, Urbana Elemental Tech AccelerEyes RIKEN
149X 47X 20X 130X 30X
Financial simulation Linear Algebra 3D Ultrasound Quantum Chemistry Gene Sequencing Oxford Universidad Jaime Techniscan U of Illinois, Urbana U of Maryland
7 Parallel vs Sequential Architecture Evolution
Many-Core ILLIAC IV Maspar Blue Gene GPUs Cray-1 Thinking Machines High Performance Computing Architectures
Multi-Core DEC PDP-1 IBM System 360 IBM POWER4 x86 Intel 4004 VAX
Database, Operating System Sequential Architectures 8 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1
L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1
9 Tesla T10 GPU: 240 Processor Cores
Thread Processor Array (TPA)
Thread Processor (TP) Special Function Unit (SFU)
Multi-banked Double Precision Register File
FP / Integer Special Ops
TP Array Shared Memory
Processor core has
Special Function Unit (SFU) Special Function Unit (SFU) Special Function Unit (SFU) Special Function Unit (SFU) Special Function Unit (SFU) Special Function Unit (SFU) Floating point / Integer unit Double Precision Double Precision Double Precision Double Precision Double Precision Double Precision Move, compare, logic, branch unit TP Array Shared Memory TP Array Shared Memory TP Array Shared Memory TP Array Shared Memory TP Array Shared Memory TP Array Shared Memory Special Function Unit (SFU) Special Function Unit (SFU) Special Function Unit (SFU) Special Function Unit (SFU) Special Function Unit (SFU) Special Function Unit (SFU) Double Precision Double Precision Double Precision Double Precision Double Precision Double Precision
IEEE 754 floating point TP Array Shared Memory TP Array Shared Memory TP Array Shared Memory TP Array Shared Memory TP Array Shared Memory TP Array Shared Memory GDDR3 Special Function Unit (SFU) Special Function Unit (SFU) Special Function Unit (SFU) Special Function Unit (SFU) Special Function Unit (SFU) Special Function Unit (SFU) Double Precision Double Precision Double Precision Double Precision Double Precision Double Precision
Single and Double TP Array Shared Memory TP Array Shared Memory TP Array Shared Memory TP Array Shared Memory TP Array Shared Memory TP Array Shared Memory
Special Function Unit (SFU) Special Function Unit (SFU) Special Function Unit (SFU) Special Function Unit (SFU) Special Function Unit (SFU) Special Function Unit (SFU) Double Precision Double Precision Double Precision Double Precision Double Precision Double Precision Main
102 GB/s high-speed interface TP Array Shared Memory TP Array Shared Memory TP Array Shared Memory TP Array Shared Memory TP Array Shared Memory TP Array Shared Memory 512 bit Memory
Special Function Unit (SFU) Special Function Unit (SFU) Special Function Unit (SFU) Special Function Unit (SFU) Special Function Unit (SFU) Special Function Unit (SFU) Double Precision Double Precision Double Precision Double Precision Double Precision Double Precision
to memory TP Array Shared Memory TP Array Shared Memory TP Array Shared Memory TP Array Shared Memory TP Array Shared Memory TP Array Shared Memory Thread Manager 102 GB/sec
30 TPAs = 240 Processors 10 CUDA Parallel Computing Architecture
Parallel computing architecture and programming model
Includes a C compiler plus support for OpenCL and DX11 Compute
Architected to natively support ATI’s Compute “Solution” all computational interfaces (standard languages and APIs)
11 CUDA Facts
750+ Research Papers http://scholar.google.com/scholar?hl=en&q=cuda+gpu
100+ universities teaching CUDA http://www.nvidia.com/object/cuda_university_courses.html
100 Million CUDA‐Enabled GPUs
25,000 Active Developers
www.NVIDIA.com/CUDA 12 100+ Universities Teaching CUDA
Available on Google Maps at http://maps.google.com/maps/ms?ie=UTF8&hl=en&t=h&msa=0&msid=118373856608078972372.00046298fc59576f7741a&ll=31.052934,10.546875&spn=161.641612,284.765625&z=2 13 Parallel Computing on All GPUs 100+ Million CUDA GPUs Deployed
GeForce® TeslaTM Quadro® Entertainment High-Performance Computing Design & Creation
14 Tesla GPU Computing Products
Tesla Personal Tesla S1070 1U System Tesla C1060 Supercomputer Computing Board (4 Tesla C1060s)
GPUs 4 Tesla GPUs 1 Tesla GPU 4 Tesla GPUs Single Precision Perf 4.14 Teraflops 933 Gigaflops 3.7 Teraflops Double Precision Perf 346 Gigaflops 78 Gigaflops 312 Gigaflops Memory 4 GB / GPU 4 GB 4 GB / GPU
15 Tesla S1070: Green Supercomputing
20X Better Performance / Watt
Hess University of Heidelberg Chevron University of Illinois Petrobras University of North Carolina NCSA Max Planck Institute CEA Rice University Tokyo Tech University of Maryland JFCOM Eotvas University SAIC University of Wuppertal Federal Chinese Academy of Sciences Motorola National Taiwan University Kodak
16 A $5 Million Datacenter
CPU 1U Server 2 Quad-core Xeon 8 CPU Cores + CPU 1U Server CPUs: 8 cores 4 GPUs = 968 cores Tesla 1U System 0.17 Teraflop (single) 4.14 Teraflops (single) 0.08 Teraflop (double) 0.346 Teraflop (double) $ 3,000 $ 11,000 700 W 1500 W
1819 CPU servers 455 CPU servers 455 Tesla systems 310 Teraflops (single) 1961 Teraflops (single) 6x more perf
155 Teraflops (double) 196 Teraflops (double)
Total area 16K sq feet Total area 9K sq feet 40% smaller
Total 1273 KW Total 682 KW ½the power 17 5000+ Customers / ISVs
Life Sciences & Productivity Oil and CAE / Communi Medical Equipment / Misc Gas EDA Finance Mathematical cation
Max Planck GE Healthcare CEA Hess Synopsys Symcor AccelerEyes Nokia FDA Siemens NCSA TOTAL Nascentric Level 3 MathWorks RIM Robarts Research Techniscan WRF Weather CGG/Veritas Gauda SciComp Wolfram Philips Medtronic Boston Scientific Modeling Chevron CST Hanweck National Samsung AGC Eli Lilly OptiTex Headwave Agilent Quant Instruments LG Evolved machines Silicon Informatics Tech-X Acceleware Catalyst Ansys Sony Smith-Waterman Stockholm Elemental Seismic City RogueWave Access Analytics Ericsson Technologies DNA sequencing Research BNP Paribas Tech-x NTT DoCoMo Dimensional P-Wave AutoDock Harvard Imaging Seismic RIKEN Mitsubishi NAMD/VMD Delaware Manifold Imaging SOFA Hitachi Folding@Home Pittsburg Digisens Mercury Renault Radio Computer Howard Hughes ETH Zurich General Mills Boeing Research Medical ffA Laboratory Institute Atomic Rapidmind CRIBI Genomics US Air Force Physics Rhythm & Hues xNormal Elcomsoft LINZIK 18 More Information
Tesla main page Hear from Developers http://www.nvidia.com/tesla http://www.youtube.com/nvidiatesla Product Information Industry Solutions
CUDA Zone http://www.nvidia.com/cuda Applications, Papers, Videos
19 CUDA Parallel Programming Architecture and Model Programming the GPU in High-Level Languages
20 NVIDIA C for CUDA and OpenCL
Entry point for developers C for CUDA who prefer high-level C
Entry point for developers who want low-level API OpenCL
Shared back-end compiler and optimization technology PTX
GPU
21 Different Programming Styles
C for CUDA C with parallel keywords C runtime that abstracts driver API Memory managed by C runtime Generates PTX
OpenCL Hardware API - similar to OpenGL and CUDA driver API Programmer has complete access to hardware device Memory managed by programmer Generates PTX
22 Simple “C” Description For Parallelism
void saxpy_serial(int n, float a, float *x, float *y) { for (int i = 0; i < n; ++i) y[i] = a*x[i] + y[i]; Standard C Code } // Invoke serial SAXPY kernel saxpy_serial(n, 2.0, x, y);
__global__ void saxpy_parallel(int n, float a, float *x, float *y) { int i = blockIdx.x*blockDim.x + threadIdx.x; if (i < n) y[i] = a*x[i] + y[i]; Parallel C Code } // Invoke parallel SAXPY kernel with 256 threads/block int nblocks = (n + 255) / 256; saxpy_parallel<<
23 Compiling C for CUDA Applications
C CUDA Rest of C Key Kernels Application
NVCC CPU Compiler
CUDA object CPU object files files Linker
CPU-GPU Executable
24 Application Domains
25 Accelerating Time to Discovery
4.6 Days 2.7 Days 3 Hours 8 Hours
30 Minutes 27 Minutes 16 Minutes 13 Minutes
CPU Only With GPU 26 Molecular Dynamics
Available MD software NAMD / VMD (alpha release) HOOMD ACE-MD MD-GPU Source: Stone, Phillips, Hardy, Schulten
Ongoing work LAMMPS CHARMM GROMACS AMBER
Source: Anderson, Lorenz, Travesset 27 Quantum Chemistry
Available MD software NAMD / VMD (alpha release) HOOMD ACE-MD Source: Ufimtsev, Martinez MD-GPU
Ongoing work LAMMPS CHARMM Q-Chem Gaussian
Source: Yasuda 28 Computational Fluid Dynamics (CFD)
Ongoing work Navier-Stokes Lattice Boltzman 3D Euler Solver Weather and ocean modeling Source: Thibault, Senocak
Source: Tolke, Krafczyk 29 Electromagnetics / Electrodynamics
FDTD Solvers Acceleware EM Photonics CUDA Tutorial
Ongoing work Maxwell equation solver Ring Oscillator (FDTD) Particle beam dynamics simulator
FDTD Acceleration using GPUs Source: Acceleware
30 Weather, Atmospheric, & Ocean Modeling
CUDA-accelerated WRF available Other kernels in WRF being ported
Ongoing work Tsunami modeling Ocean modeling Source: Michalakes, Vachharajani Several CFD codes
Source: Takayuki Aoki, Tokyo Tech 31 Bio-Informatics
62x Speedup
102x Speedup
32 Encryption
11x Speedup
Source: S. Manavski Source: http://code.google.com/p/pyrit/
33 Signal Processing
Source: CUDA FFT library (cuFFT) Source: http://gpu-vsipl.gtri.gatech.edu/
34 Libraries
35 FFT Performance: CPU vs GPU (8-Series)
• Intel FFT numbers calculated by repeating same FFT plan • Real FFT performance is ~10 GFlops
Source for Intel data : http://www.intel.com/cd/software/products/asmo-na/eng/266852.htm 36 2D FFT : GPU Performance
37 BLAS: CPU vs GPU (10-series)
CUBLAS: CUDA 2.0, Tesla C1060 (10-series GPU) ATLAS 3.81 on Dual 2.8GHz Opteron Dual-Core 38 GPU + CPU DGEMM Performance
120 GPU + CPU
100
80 GPU only
GFLOPs 60
40 CPU only
Xeon Quad‐core 2.8 GHz, MKL 10.3 20 Tesla C1060 GPU (1.296 GHz) GPU + CPU 0 128 320 512 704 896 1088 1280 1472 1664 1856 2048 2240 2432 2624 2816 3008 3200 3392 3584 3776 3968 4160 4352 4544 4736 4928 5120 5312 5504 5696 5888 6080 Size 39 Results: Sparse Matrix-Vector Multiplication (SpMV) on CUDA
CPU Results from “Optimization of Sparse Matrix-Vector Multiplication on Emerging Multicore Platforms", Williams et al, Supercomputing 2007 40 GPUs: Better Architecture for Computing
41 CPUs: Memory Bandwidth Bottleneck 240 cores
Thread Manager Core 1 Core 2
L2 Cache
64 bit FSB 12.8 GB/sec GDDR3 102 GB/sec 512 bit
Main Main Memory Memory 8x faster interface 42 NVIDIA’s GPUs : Ever Increasing Performance
T10 = Tesla 10-series Double G9x = GeForce 9800 GTX G80 = GeForce 8800 GTX Precision G71 = GeForce 7900 GTX G70 = GeForce 7800 GTX debut NV40 = GeForce 6800 Ultra NV35 = GeForce FX 5950 Ultra NV30 = GeForce FX 5800
G70 NV30 NV35 NV40
43 CPU vs GPU 1U Comparison
CPU 1U Server Tesla 1U System
Product 2x Quad Xeon: 3 GHz Quad-GPU Tesla 1U # of Cores 8 CPU cores 4 GPUs: 960 cores
Single Precision Flops 0.192 Teraflop 4.14 Teraflops 21x higher Gflop
Double Precision Flops 96 GFlops 346 GFlops 3.6x higher Gflop
Typical 1U System Power 670 W 700 W
Note: Current top Intel GFlops: Xeon Harpertown X5482 @ 3.2 GHz is 102.4 Gflops (> $1000 CPU) 44 Final Thoughts
GPU and heterogeneous parallel architecture will revolutionize computing
Parallel computing key to solve some of the most interesting and important human challenges ahead
Learning parallel programming is an imperative for students in computing and sciences
45