The Evolution of GPU Computing
Steve Scott CTO Tesla, NVIDIA SC12, Salt Lake City Early 3D Graphics Pipeline
Application Host Scene Management
Geometry
Rasterization GPU Frame Pixel Processing Buffer Memory ROP/FBI/Display
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign Moving Toward Programmability
DirectX 7 DirectX 9.0c DirectX 5 T&L TextureStageStage SM 3.0 Riva 128 GeForce 256 Cg GeForce 6
1998 1999 2000 2001 2002 2003 2004
DirectX 6 DirectX 8 DirectX 9 Multitexturing SM 1.x SM 2.0 Riva TNT GeForce 3 GeForceFX
Half-Life Quake 3 Giants Halo Far Cry UE3
All images © their respective owners No Lighting Per-Vertex Lighting Per-Pixel Lighting
Copyright © NVIDIA Corporation 2006 Unreal © Epic Programmable Shaders: GeForceFX (2002)
. Vertex and fragment operations specified in small (macro) assembly language (separate processors) . User-specified mapping of input data to operations . Limited ability to use intermediate computed values to index input data (textures and vertex uniforms)
ADDR R0.xyz, eyePosition.xyzx, -f[TEX0].xyzx; DP3R R0.w, R0.xyzx, R0.xyzx; RSQR R0.w, R0.w; InInppuut t0 1 MULR R0.xyz, R0.w, R0.xyzx; Input 2 ADDR R1.xyz, lightPosition.xyzx, -f[TEX0].xyzx; OP DP3R R0.w, R1.xyzx, R1.xyzx; RSQR R0.w, R0.w; TTeemmpp 0 1 Temp 2 MADR R0.xyz, R0.w, R1.xyzx, R0.xyzx; MULR R1.xyz, R0.w, R1.xyzx; DP3R R0.w, R1.xyzx, f[TEX1].xyzx; MAXR R0.w, R0.w, {0}.x; The Pioneers: Early GPGPU (2002) www.gpgpu.org
. Ray Tracing on Programmable Graphics Hardware, Purcell et al. . PDEs in Graphics Hardware, Strzodka, Rumpf . Fast Matrix Multiplies using Graphics Hardware, Larsen, McAllister . Using Modern Graphics Architectures for General- Purpose Computing: A Frameworks and Analysis, Thompson et al.
Early Raytracing This was not easy… Challenges with Early GPGPU Programming
. HW challenges • Limited addressing modes Input Registers • Limited communication: inter-pixel, scatter Texture Fragment • Lack of integer & bit ops Constants • No branching Program Temp Registers . SW challenges • Graphics API (DirectX, OpenGL) • Very limited GPU computing ecosystem Output Registers • Distinct vertex and fragment procs
Software (DirectX, Open GL) and hardware slowly became more general purpose... GeForce 8800 (G80) Unified Compute Architecture • Unified processor types • Unified access to mem structures • DirectX 10 & SM 4.0
Host
Input Assembler Setup / Rstr / ZCull
Vtx Thread Issue Geom Thread Issue Pixel Thread Issue
SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP
TF TF TF TF TF TF TF TF
L1 L1 L1 L1 L1 L1 L1 L1 Processor Thread
L2 L2 L2 L2 L2 L2
FB FB FB FB FB FB CUDA C Programming void saxpy_serial(int n, float a, float *x, float *y) { for (int i = 0; i < n; ++i) y[i] = a*x[i] + y[i]; } Serial C Code
// Invoke serial SAXPY kernel saxpy_serial(n, 2.0, x, y);
__global__ void saxpy_parallel(int n, float a, float *x, float *y) { int i = blockIdx.x*blockDim.x + threadIdx.x; if (i < n) y[i] = a*x[i] + y[i]; } Parallel C Code
// Invoke parallel SAXPY kernel with 256 threads/block int nblocks = (n + 255) / 256; saxpy_parallel<<
50 2006 – G80 • Unified Graphics/Compute Architecture 40 • IEEE math • CUDA introduced at SC’06
30 Fermi • ECC Tesla • High DP FP perf 20 • C++ support • Double precision FP 10
0 June 2007 Nov 2007 June 2008 Nov 2008 June 2009 Nov 2009 June 2010 Nov 2010 June 2011 Nov 2011 June 2012
KEPLER
SMX (3x power efficiency)
Hyper-Q (programmability and application coverage) Dynamic Parallelism Hyper-Q Easier Speedup of Legacy MPI Apps
FERMI 1 Work Queue CP2K- Quantum Chemistry
20x Strong Scaling: 864 water molecules
15x
10x 2.5x
KEPLER 5x
32 Concurrent Work Queues CPU Dualvs. Speedup 0x 0 5 10 15 20 Number of GPUs K20 with Hyper-Q K20 without Hyper-Q 16 MPI ranks 1 MPI rank per node per node Dynamic Parallelism Simpler Code, More General, Higher Performance
CPU Fermi GPU CPU Kepler GPU
Code Size Cut by 2x 2x Performance Titan: World’s #1 Supercomputer 18,688 Tesla K20X GPUs World leading efficiency 27 Petaflops Kepler GPU Performance Results
Dual-socket comparison: CPU/GPU node vs. Dual-CPU node CPU = 8 core SandyBridge E5-2687w 3.10 GHz
WS-LSMS Single-CPU+K20X 16.70 Single-CPU+M2090 7.00 Dual-CPU 1.00 Chroma Single-CPU+K20X 10.20 Single-CPU+M2090 8.00 Dual-CPU 1.00 SPECFEM3D Single-CPU+K20X 8.85 Single-CPU+M2090 5.41 Dual-CPU 1.00 AMBER Single-CPU+K20X 7.17 Single-CPU+M2090 3.46 Dual-CPU 1.00 NAMD Single-CPU+K20X 2.73 Single-CPU+M2090 1.71 Dual-CPU 1.00 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 The Future The Future of HPC Programming
Computers are not getting faster… just wider
Need to structure all HPC apps as throughput problems
Locality within nodes much more important Need to expose locality (programming model) & explicitly manage memory hierarchy (compiler, runtime, autotuner)
How can we enable programmers to code for future processors in a portable way? OpenACC Directives Portable Parallelism
OpenMP OpenACC Directives
CPU CPU GPU
main() { main() { double pi = 0.0; long i; double pi = 0.0; long i;
#pragma acc parallel loop reduction(+:pi) #pragma omp parallel for reduction(+:pi) for (i=0; i . Integration . Further concentration on locality (both HW and SW) . Reducing overheads . Continued convergence with consumer technology Echelon NVIDIA’s Extreme-Scale Computing Project DARPA UHPC Program Targeting 2018 Fast Forward Program Echelon2018 Compute Vision: Node &Echelon System Compute Node & System System Interconnect Key architectural features: DRAM NV • Malleable memory hierarchy DIMMs RAM • Hierarchical register files DRAM • Hierarchical thread scheduling Stacks • Place coherency/consistency L20 L21023 MC NIC • Temporal SIMT & scalarization 256KB 256KB • PGAS memory NoC • HW accelerated queues C C 0 7 • Active messages LOC0 SM LOC7 0 SM255 • AMOs everywhere Node 0: 16 TF, 2 TB/s, 512+ GB Node 255 • Collective engines Cabinet N-1 Cabinet 0: 4 PF, 128 TB • Streamlined LOC/TOC interaction Echelon System (up to 1 EF) Summary . GPU accelerated computing has come a long way in a very short time . Has become much easier to program and more general purpose . Aligned with technology trends, supported by consumer markets . Future evolution is about: • Integration • Increased generality – efficient on any code with high parallelism • Reducing energy, reducing overheads . This is simply how computers will be built Thank You