CUDA & Tesla GPU Computing Accelerating Scientific Discovery

March 2009

1 What is GPU Computing?

4 cores

Computing with CPU + GPU Heterogeneous Computing

2 “Homemade” Revolution

16 GPUs 8 GPUs 4 GPUs 3 GPUs MIT, Harvard University of Antwerp TU Braunschweig, University of Illinois Belgium Germany

3 GPUs 3 GPUs 3 GPUs 2 GPUs Yonsei University, Rice University University of Georgia Tech Korea Cambridge, UK

3 GPUs: Turning Point in Supercomputing

Desktop beats Cluster

Tesla Personal CalcUA Supercomputer $5 Million $10,000

Source: University of Antwerp, Belgium 4 Introducing the Tesla Personal Supercomputer

Supercomputing Performance Massively parallel CUDA Architecture 960 cores. 4 TeraFlops 250x the performance of a desktop

Personal One researcher, one supercomputer Plugs into standard power strip

Accessible Program in C for Windows, Linux Available now worldwide under $10,000

5 New GPU-based High-Performance Computing Landscape

Tesla Cluster 5000x

Tesla Personal 250x Supercomputer

Supercomputing Cluster Performance Standard 1x

$100K - $1M < $10 K Accessibility 6 Not 2x or 3x : Speedups are 20x to 150x

146X 36X 18X 50X 100X

Medical Imaging Molecular Dynamics Video Transcoding Matlab Computing Astrophysics U of Utah U of Illinois, Urbana Elemental Tech AccelerEyes RIKEN

149X 47X 20X 130X 30X

Financial simulation Linear Algebra 3D Ultrasound Quantum Chemistry Gene Sequencing Oxford Universidad Jaime Techniscan U of Illinois, Urbana U of Maryland

7 Parallel vs Sequential Architecture Evolution

Many-Core ILLIAC IV Maspar Blue Gene GPUs Cray-1 Thinking Machines High Performance Computing Architectures

Multi-Core DEC PDP-1 IBM System 360 IBM POWER4 x86 Intel 4004 VAX

Database, Operating System Sequential Architectures 8 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1

L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1

9 Tesla T10 GPU: 240 Processor Cores

Thread Processor Array (TPA)

Thread Processor (TP) Special Function Unit (SFU)

Multi-banked Double Precision Register File

FP / Integer Special Ops

TP Array Shared Memory

Processor core has

Special Function Unit (SFU) Special Function Unit (SFU) Special Function Unit (SFU) Special Function Unit (SFU) Special Function Unit (SFU) Special Function Unit (SFU) Floating point / Integer unit Double Precision Double Precision Double Precision Double Precision Double Precision Double Precision Move, compare, logic, branch unit TP Array Shared Memory TP Array Shared Memory TP Array Shared Memory TP Array Shared Memory TP Array Shared Memory TP Array Shared Memory Special Function Unit (SFU) Special Function Unit (SFU) Special Function Unit (SFU) Special Function Unit (SFU) Special Function Unit (SFU) Special Function Unit (SFU) Double Precision Double Precision Double Precision Double Precision Double Precision Double Precision

IEEE 754 floating point TP Array Shared Memory TP Array Shared Memory TP Array Shared Memory TP Array Shared Memory TP Array Shared Memory TP Array Shared Memory GDDR3 Special Function Unit (SFU) Special Function Unit (SFU) Special Function Unit (SFU) Special Function Unit (SFU) Special Function Unit (SFU) Special Function Unit (SFU) Double Precision Double Precision Double Precision Double Precision Double Precision Double Precision

Single and Double TP Array Shared Memory TP Array Shared Memory TP Array Shared Memory TP Array Shared Memory TP Array Shared Memory TP Array Shared Memory

Special Function Unit (SFU) Special Function Unit (SFU) Special Function Unit (SFU) Special Function Unit (SFU) Special Function Unit (SFU) Special Function Unit (SFU) Double Precision Double Precision Double Precision Double Precision Double Precision Double Precision Main

102 GB/s high-speed interface TP Array Shared Memory TP Array Shared Memory TP Array Shared Memory TP Array Shared Memory TP Array Shared Memory TP Array Shared Memory 512 bit Memory

Special Function Unit (SFU) Special Function Unit (SFU) Special Function Unit (SFU) Special Function Unit (SFU) Special Function Unit (SFU) Special Function Unit (SFU) Double Precision Double Precision Double Precision Double Precision Double Precision Double Precision

to memory TP Array Shared Memory TP Array Shared Memory TP Array Shared Memory TP Array Shared Memory TP Array Shared Memory TP Array Shared Memory Thread Manager 102 GB/sec

30 TPAs = 240 Processors 10 CUDA Parallel Computing Architecture

Parallel computing architecture and programming model

Includes a C compiler plus support for OpenCL and DX11 Compute

Architected to natively support ATI’s Compute “Solution” all computational interfaces (standard languages and APIs)

11 CUDA Facts

750+ Research Papers http://scholar.google.com/scholar?hl=en&q=cuda+gpu

100+ universities teaching CUDA http://www.nvidia.com/object/cuda_university_courses.html

100 Million CUDA‐Enabled GPUs

25,000 Active Developers

www.NVIDIA.com/CUDA 12 100+ Universities Teaching CUDA

Available on Google Maps at http://maps.google.com/maps/ms?ie=UTF8&hl=en&t=h&msa=0&msid=118373856608078972372.00046298fc59576f7741a&ll=31.052934,10.546875&spn=161.641612,284.765625&z=2 13 Parallel Computing on All GPUs 100+ Million CUDA GPUs Deployed

GeForce® TeslaTM Quadro® Entertainment High-Performance Computing Design & Creation

14 Tesla GPU Computing Products

Tesla Personal Tesla S1070 1U System Tesla C1060 Supercomputer Computing Board (4 Tesla C1060s)

GPUs 4 Tesla GPUs 1 Tesla GPU 4 Tesla GPUs Single Precision Perf 4.14 Teraflops 933 Gigaflops 3.7 Teraflops Double Precision Perf 346 Gigaflops 78 Gigaflops 312 Gigaflops Memory 4 GB / GPU 4 GB 4 GB / GPU

15 Tesla S1070: Green Supercomputing

20X Better Performance / Watt

Hess University of Heidelberg Chevron University of Illinois Petrobras University of North Carolina NCSA Max Planck Institute CEA Rice University Tokyo Tech University of Maryland JFCOM Eotvas University SAIC University of Wuppertal Federal Chinese Academy of Sciences Motorola National Taiwan University Kodak

16 A $5 Million Datacenter

CPU 1U 2 Quad-core 8 CPU Cores + CPU 1U Server CPUs: 8 cores 4 GPUs = 968 cores Tesla 1U System 0.17 Teraflop (single) 4.14 Teraflops (single) 0.08 Teraflop (double) 0.346 Teraflop (double) $ 3,000 $ 11,000 700 W 1500 W

1819 CPU servers 455 CPU servers 455 Tesla systems 310 Teraflops (single) 1961 Teraflops (single) 6x more perf

155 Teraflops (double) 196 Teraflops (double)

Total area 16K sq feet Total area 9K sq feet 40% smaller

Total 1273 KW Total 682 KW ½the power 17 5000+ Customers / ISVs

Life Sciences & Productivity Oil and CAE / Communi Medical Equipment / Misc Gas EDA Finance Mathematical cation

Max Planck GE Healthcare CEA Hess Synopsys Symcor AccelerEyes Nokia FDA Siemens NCSA TOTAL Nascentric Level 3 MathWorks RIM Robarts Research Techniscan WRF Weather CGG/Veritas Gauda SciComp Wolfram Philips Medtronic Boston Scientific Modeling Chevron CST Hanweck National Samsung AGC Eli Lilly OptiTex Headwave Agilent Quant Instruments LG Evolved machines Silicon Informatics Tech-X Acceleware Catalyst Ansys Sony Smith-Waterman Stockholm Elemental Seismic City RogueWave Access Analytics Ericsson Technologies DNA sequencing Research BNP Paribas Tech-x NTT DoCoMo Dimensional P-Wave AutoDock Harvard Imaging Seismic RIKEN Mitsubishi NAMD/VMD Delaware Manifold Imaging SOFA Hitachi Folding@Home Pittsburg Digisens Mercury Renault Radio Howard Hughes ETH Zurich General Mills Boeing Research Medical ffA Laboratory Institute Atomic Rapidmind CRIBI Genomics US Air Force Physics Rhythm & Hues xNormal Elcomsoft LINZIK 18 More Information

Tesla main page Hear from Developers http://www.nvidia.com/tesla http://www.youtube.com/nvidiatesla Product Information Industry Solutions

CUDA Zone http://www.nvidia.com/cuda Applications, Papers, Videos

19 CUDA Parallel Programming Architecture and Model Programming the GPU in High-Level Languages

20 NVIDIA C for CUDA and OpenCL

Entry point for developers C for CUDA who prefer high-level C

Entry point for developers who want low-level API OpenCL

Shared back-end compiler and optimization technology PTX

GPU

21 Different Programming Styles

C for CUDA C with parallel keywords C runtime that abstracts driver API Memory managed by C runtime Generates PTX

OpenCL Hardware API - similar to OpenGL and CUDA driver API Programmer has complete access to hardware device Memory managed by programmer Generates PTX

22 Simple “C” Description For Parallelism

void saxpy_serial(int n, float a, float *x, float *y) { for (int i = 0; i < n; ++i) y[i] = a*x[i] + y[i]; Standard C Code } // Invoke serial SAXPY kernel saxpy_serial(n, 2.0, x, y);

__global__ void saxpy_parallel(int n, float a, float *x, float *y) { int i = blockIdx.x*blockDim.x + threadIdx.x; if (i < n) y[i] = a*x[i] + y[i]; Parallel C Code } // Invoke parallel SAXPY kernel with 256 threads/block int nblocks = (n + 255) / 256; saxpy_parallel<<>>(n, 2.0, x, y);

23 Compiling C for CUDA Applications

C CUDA Rest of C Key Kernels Application

NVCC CPU Compiler

CUDA object CPU object files files Linker

CPU-GPU Executable

24 Application Domains

25 Accelerating Time to Discovery

4.6 Days 2.7 Days 3 Hours 8 Hours

30 Minutes 27 Minutes 16 Minutes 13 Minutes

CPU Only With GPU 26 Molecular Dynamics

Available MD software NAMD / VMD (alpha release) HOOMD ACE-MD MD-GPU Source: Stone, Phillips, Hardy, Schulten

Ongoing work LAMMPS CHARMM GROMACS AMBER

Source: Anderson, Lorenz, Travesset 27 Quantum Chemistry

Available MD software NAMD / VMD (alpha release) HOOMD ACE-MD Source: Ufimtsev, Martinez MD-GPU

Ongoing work LAMMPS CHARMM Q-Chem Gaussian

Source: Yasuda 28 Computational Fluid Dynamics (CFD)

Ongoing work Navier-Stokes Lattice Boltzman 3D Euler Solver Weather and ocean modeling Source: Thibault, Senocak

Source: Tolke, Krafczyk 29 Electromagnetics / Electrodynamics

FDTD Solvers Acceleware EM Photonics CUDA Tutorial

Ongoing work Maxwell equation solver Ring Oscillator (FDTD) Particle beam dynamics simulator

FDTD Acceleration using GPUs Source: Acceleware

30 Weather, Atmospheric, & Ocean Modeling

CUDA-accelerated WRF available Other kernels in WRF being ported

Ongoing work Tsunami modeling Ocean modeling Source: Michalakes, Vachharajani Several CFD codes

Source: Takayuki Aoki, Tokyo Tech 31 Bio-Informatics

62x Speedup

102x Speedup

32 Encryption

11x Speedup

Source: S. Manavski Source: http://code.google.com/p/pyrit/

33 Signal Processing

Source: CUDA FFT library (cuFFT) Source: http://gpu-vsipl.gtri.gatech.edu/

34 Libraries

35 FFT Performance: CPU vs GPU (8-Series)

• Intel FFT numbers calculated by repeating same FFT plan • Real FFT performance is ~10 GFlops

Source for Intel data : http://www.intel.com/cd/software/products/asmo-na/eng/266852.htm 36 2D FFT : GPU Performance

37 BLAS: CPU vs GPU (10-series)

CUBLAS: CUDA 2.0, Tesla C1060 (10-series GPU) ATLAS 3.81 on Dual 2.8GHz Opteron Dual-Core 38 GPU + CPU DGEMM Performance

120 GPU + CPU

100

80 GPU only

GFLOPs 60

40 CPU only

Xeon Quad‐core 2.8 GHz, MKL 10.3 20 Tesla C1060 GPU (1.296 GHz) GPU + CPU 0 128 320 512 704 896 1088 1280 1472 1664 1856 2048 2240 2432 2624 2816 3008 3200 3392 3584 3776 3968 4160 4352 4544 4736 4928 5120 5312 5504 5696 5888 6080 Size 39 Results: Sparse Matrix-Vector Multiplication (SpMV) on CUDA

CPU Results from “Optimization of Sparse Matrix-Vector Multiplication on Emerging Multicore Platforms", Williams et al, Supercomputing 2007 40 GPUs: Better Architecture for Computing

41 CPUs: Memory Bandwidth Bottleneck 240 cores

Thread Manager Core 1 Core 2

L2 Cache

64 bit FSB 12.8 GB/sec GDDR3 102 GB/sec 512 bit

Main Main Memory Memory 8x faster interface 42 NVIDIA’s GPUs : Ever Increasing Performance

T10 = Tesla 10-series Double G9x = GeForce 9800 GTX G80 = GeForce 8800 GTX Precision G71 = GeForce 7900 GTX G70 = GeForce 7800 GTX debut NV40 = GeForce 6800 Ultra NV35 = GeForce FX 5950 Ultra NV30 = GeForce FX 5800

G70 NV30 NV35 NV40

43 CPU vs GPU 1U Comparison

CPU 1U Server Tesla 1U System

Product 2x Quad Xeon: 3 GHz Quad-GPU Tesla 1U # of Cores 8 CPU cores 4 GPUs: 960 cores

Single Precision Flops 0.192 Teraflop 4.14 Teraflops 21x higher Gflop

Double Precision Flops 96 GFlops 346 GFlops 3.6x higher Gflop

Typical 1U System Power 670 W 700 W

Note: Current top Intel GFlops: Xeon Harpertown X5482 @ 3.2 GHz is 102.4 Gflops (> $1000 CPU) 44 Final Thoughts

GPU and heterogeneous parallel architecture will revolutionize computing

Parallel computing key to solve some of the most interesting and important human challenges ahead

Learning parallel programming is an imperative for students in computing and sciences

45