The Evolution of GPU Computing

Steve Scott CTO Tesla, SC12, Salt Lake City Early 3D

Application Host Scene Management

Geometry

Rasterization GPU Frame Pixel Processing Buffer Memory ROP/FBI/Display

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign Moving Toward Programmability

DirectX 7 DirectX 9.0c DirectX 5 T&L TextureStageStage SM 3.0 Riva 128 GeForce 256 Cg GeForce 6

1998 1999 2000 2001 2002 2003 2004

DirectX 6 DirectX 8 DirectX 9 Multitexturing SM 1.x SM 2.0 Riva TNT GeForce 3 GeForceFX

Half-Life Quake 3 Giants Halo Far Cry UE3

All images © their respective owners No Lighting Per-Vertex Lighting Per-Pixel Lighting

Copyright © NVIDIA Corporation 2006 Unreal © Epic Programmable : GeForceFX (2002)

. Vertex and fragment operations specified in small (macro) assembly language (separate processors) . User-specified mapping of input data to operations . Limited ability to use intermediate computed values to index input data (textures and vertex uniforms)

ADDR R0.xyz, eyePosition.xyzx, -f[TEX0].xyzx; DP3R R0.w, R0.xyzx, R0.xyzx; RSQR R0.w, R0.w; InInppuut t0 1 MULR R0.xyz, R0.w, R0.xyzx; Input 2 ADDR R1.xyz, lightPosition.xyzx, -f[TEX0].xyzx; OP DP3R R0.w, R1.xyzx, R1.xyzx; RSQR R0.w, R0.w; TTeemmpp 0 1 Temp 2 MADR R0.xyz, R0.w, R1.xyzx, R0.xyzx; MULR R1.xyz, R0.w, R1.xyzx; DP3R R0.w, R1.xyzx, f[TEX1].xyzx; MAXR R0.w, R0.w, {0}.x; The Pioneers: Early GPGPU (2002) www.gpgpu.org

. on Programmable Graphics Hardware, Purcell et al. . PDEs in Graphics Hardware, Strzodka, Rumpf . Fast Matrix Multiplies using Graphics Hardware, Larsen, McAllister . Using Modern Graphics Architectures for General- Purpose Computing: A Frameworks and Analysis, Thompson et al.

Early Raytracing This was not easy… Challenges with Early GPGPU Programming

. HW challenges • Limited addressing modes Input Registers • Limited communication: inter-pixel, scatter Texture Fragment • Lack of integer & bit ops Constants • No branching Program Temp Registers . SW challenges • Graphics API (DirectX, OpenGL) • Very limited GPU computing ecosystem Output Registers • Distinct vertex and fragment procs

Software (DirectX, Open GL) and hardware slowly became more general purpose... GeForce 8800 (G80) Unified Compute Architecture • Unified processor types • Unified access to mem structures • DirectX 10 & SM 4.0

Host

Input Assembler Setup / Rstr / ZCull

Vtx Thread Issue Geom Thread Issue Pixel Thread Issue

SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP

TF TF TF TF TF TF TF TF

L1 L1 L1 L1 L1 L1 L1 L1 Processor Thread

L2 L2 L2 L2 L2 L2

FB FB FB FB FB FB CUDA C Programming void saxpy_serial(int n, float a, float *x, float *y) { for (int i = 0; i < n; ++i) y[i] = a*x[i] + y[i]; } Serial C Code

// Invoke serial SAXPY kernel saxpy_serial(n, 2.0, x, y);

__global__ void saxpy_parallel(int n, float a, float *x, float *y) { int i = blockIdx.x*blockDim.x + threadIdx.x; if (i < n) y[i] = a*x[i] + y[i]; } Parallel C Code

// Invoke parallel SAXPY kernel with 256 threads/block int nblocks = (n + 255) / 256; saxpy_parallel<<>>(n, 2.0, x, y); GPU Computing Comes of Age Kepler • Large perf/W  • Programmability 60 # GPU-Accelerated Systems in Top500 enhancements

50 2006 – G80 • Unified Graphics/Compute Architecture 40 • IEEE math • CUDA introduced at SC’06

30 Fermi • ECC Tesla • High DP FP perf 20 • C++ support • Double precision FP 10

0 June 2007 Nov 2007 June 2008 Nov 2008 June 2009 Nov 2009 June 2010 Nov 2010 June 2011 Nov 2011 June 2012

KEPLER

SMX (3x power efficiency)

Hyper-Q (programmability and application coverage) Dynamic Parallelism Hyper-Q Easier Speedup of Legacy MPI Apps

FERMI 1 Work Queue CP2K- Quantum Chemistry

20x Strong Scaling: 864 water molecules

15x

10x 2.5x

KEPLER 5x

32 Concurrent Work Queues CPU Dualvs. Speedup 0x 0 5 10 15 20 Number of GPUs K20 with Hyper-Q K20 without Hyper-Q 16 MPI ranks 1 MPI rank per node per node Dynamic Parallelism Simpler Code, More General, Higher Performance

CPU Fermi GPU CPU Kepler GPU

Code Size Cut by 2x 2x Performance Titan: World’s #1 Supercomputer 18,688 Tesla K20X GPUs World leading efficiency 27 Petaflops Kepler GPU Performance Results

Dual-socket comparison: CPU/GPU node vs. Dual-CPU node CPU = 8 core SandyBridge E5-2687w 3.10 GHz

WS-LSMS Single-CPU+K20X 16.70 Single-CPU+M2090 7.00 Dual-CPU 1.00 Chroma Single-CPU+K20X 10.20 Single-CPU+M2090 8.00 Dual-CPU 1.00 SPECFEM3D Single-CPU+K20X 8.85 Single-CPU+M2090 5.41 Dual-CPU 1.00 AMBER Single-CPU+K20X 7.17 Single-CPU+M2090 3.46 Dual-CPU 1.00 NAMD Single-CPU+K20X 2.73 Single-CPU+M2090 1.71 Dual-CPU 1.00 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 The Future The Future of HPC Programming

Computers are not getting faster… just wider

 Need to structure all HPC apps as throughput problems

Locality within nodes much more important  Need to expose locality (programming model) & explicitly manage memory hierarchy (compiler, runtime, autotuner)

How can we enable programmers to code for future processors in a portable way? OpenACC Directives Portable Parallelism

OpenMP OpenACC Directives

CPU CPU GPU

main() { main() { double pi = 0.0; long i; double pi = 0.0; long i;

#pragma acc parallel loop reduction(+:pi) #pragma omp parallel for reduction(+:pi) for (i=0; i

. Integration . Further concentration on locality (both HW and SW) . Reducing overheads . Continued convergence with consumer technology Echelon NVIDIA’s Extreme-Scale Computing Project DARPA UHPC Program Targeting 2018

Fast Forward Program Echelon2018 Compute Vision: Node &Echelon System Compute Node & System

System Interconnect Key architectural features:

DRAM NV • Malleable memory hierarchy DIMMs RAM • Hierarchical register files

DRAM • Hierarchical thread scheduling Stacks • Place coherency/consistency

L20 L21023 MC NIC • Temporal SIMT & scalarization 256KB 256KB • PGAS memory NoC

• HW accelerated queues

C C

0 7 • Active messages LOC0 SM LOC7 0 SM255 • AMOs everywhere Node 0: 16 TF, 2 TB/s, 512+ GB Node 255 • Collective engines Cabinet N-1 Cabinet 0: 4 PF, 128 TB • Streamlined LOC/TOC interaction Echelon System (up to 1 EF)

Summary

. GPU accelerated computing has come a long way in a very short time . Has become much easier to program and more general purpose . Aligned with technology trends, supported by consumer markets . Future evolution is about: • Integration • Increased generality – efficient on any code with high parallelism • Reducing energy, reducing overheads

. This is simply how computers will be built Thank You