Steve Scott, Evolution of GPU Computing

The Evolution of GPU Computing Steve Scott CTO Tesla, NVIDIA SC12, Salt Lake City Early 3D Graphics Pipeline Application Host Scene Management Geometry Rasterization GPU Frame Pixel Processing Buffer Memory ROP/FBI/Display © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign Moving Toward Programmability DirectX 7 DirectX 9.0c DirectX 5 T&L TextureStageStage SM 3.0 Riva 128 GeForce 256 Cg GeForce 6 1998 1999 2000 2001 2002 2003 2004 DirectX 6 DirectX 8 DirectX 9 Multitexturing SM 1.x SM 2.0 Riva TNT GeForce 3 GeForceFX Half-Life Quake 3 Giants Halo Far Cry UE3 All images © their respective owners No Lighting Per-Vertex Lighting Per-Pixel Lighting Copyright © NVIDIA Corporation 2006 Unreal © Epic Programmable Shaders: GeForceFX (2002) . Vertex and fragment operations specified in small (macro) assembly language (separate processors) . User-specified mapping of input data to operations . Limited ability to use intermediate computed values to index input data (textures and vertex uniforms) ADDR R0.xyz, eyePosition.xyzx, -f[TEX0].xyzx; DP3R R0.w, R0.xyzx, R0.xyzx; RSQR R0.w, R0.w; InInppuut t0 1 MULR R0.xyz, R0.w, R0.xyzx; Input 2 ADDR R1.xyz, lightPosition.xyzx, -f[TEX0].xyzx; OP DP3R R0.w, R1.xyzx, R1.xyzx; RSQR R0.w, R0.w; TTeemmpp 0 1 Temp 2 MADR R0.xyz, R0.w, R1.xyzx, R0.xyzx; MULR R1.xyz, R0.w, R1.xyzx; DP3R R0.w, R1.xyzx, f[TEX1].xyzx; MAXR R0.w, R0.w, {0}.x; The Pioneers: Early GPGPU (2002) www.gpgpu.org . Ray Tracing on Programmable Graphics Hardware, Purcell et al. PDEs in Graphics Hardware, Strzodka, Rumpf . Fast Matrix Multiplies using Graphics Hardware, Larsen, McAllister . Using Modern Graphics Architectures for General- Purpose Computing: A Frameworks and Analysis, Thompson et al. Early Raytracing This was not easy… Challenges with Early GPGPU Programming . HW challenges • Limited addressing modes Input Registers • Limited communication: inter-pixel, scatter Texture Fragment • Lack of integer & bit ops Constants • No branching Program Temp Registers . SW challenges • Graphics API (DirectX, OpenGL) • Very limited GPU computing ecosystem Output Registers • Distinct vertex and fragment procs Software (DirectX, Open GL) and hardware slowly became more general purpose... GeForce 8800 (G80) Unified Compute Architecture • Unified processor types • Unified access to mem structures • DirectX 10 & SM 4.0 Host Input Assembler Setup / Rstr / ZCull Vtx Thread Issue Geom Thread Issue Pixel Thread Issue SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP TF TF TF TF TF TF TF TF L1 L1 L1 L1 L1 L1 L1 L1 Processor Thread L2 L2 L2 L2 L2 L2 FB FB FB FB FB FB CUDA C Programming void saxpy_serial(int n, float a, float *x, float *y) { for (int i = 0; i < n; ++i) y[i] = a*x[i] + y[i]; } Serial C Code // Invoke serial SAXPY kernel saxpy_serial(n, 2.0, x, y); __global__ void saxpy_parallel(int n, float a, float *x, float *y) { int i = blockIdx.x*blockDim.x + threadIdx.x; if (i < n) y[i] = a*x[i] + y[i]; } Parallel C Code // Invoke parallel SAXPY kernel with 256 threads/block int nblocks = (n + 255) / 256; saxpy_parallel<<<nblocks, 256>>>(n, 2.0, x, y); GPU Computing Comes of Age Kepler • Large perf/W • Programmability 60 # GPU-Accelerated Systems in Top500 enhancements 50 2006 – G80 • Unified Graphics/Compute Architecture 40 • IEEE math • CUDA introduced at SC’06 30 Fermi • ECC Tesla • High DP FP perf 20 • C++ support • Double precision FP 10 0 June 2007 Nov 2007 June 2008 Nov 2008 June 2009 Nov 2009 June 2010 Nov 2010 June 2011 Nov 2011 June 2012 KEPLER SMX (3x power efficiency) Hyper-Q (programmability and application coverage) Dynamic Parallelism Hyper-Q Easier Speedup of Legacy MPI Apps FERMI 1 Work Queue CP2K- Quantum Chemistry 20x Strong Scaling: 864 water molecules 15x 10x 2.5x KEPLER 5x 32 Concurrent Work Queues CPU vs. Dual Speedup 0x 0 5 10 15 20 Number of GPUs K20 with Hyper-Q K20 without Hyper-Q 16 MPI ranks 1 MPI rank per node per node Dynamic Parallelism Simpler Code, More General, Higher Performance CPU Fermi GPU CPU Kepler GPU Code Size Cut by 2x 2x Performance Titan: World’s #1 Supercomputer 18,688 Tesla K20X GPUs World leading efficiency 27 Petaflops Kepler GPU Performance Results Dual-socket comparison: CPU/GPU node vs. Dual-CPU node CPU = 8 core SandyBridge E5-2687w 3.10 GHz WS-LSMS Single-CPU+K20X 16.70 Single-CPU+M2090 7.00 Dual-CPU 1.00 Chroma Single-CPU+K20X 10.20 Single-CPU+M2090 8.00 Dual-CPU 1.00 SPECFEM3D Single-CPU+K20X 8.85 Single-CPU+M2090 5.41 Dual-CPU 1.00 AMBER Single-CPU+K20X 7.17 Single-CPU+M2090 3.46 Dual-CPU 1.00 NAMD Single-CPU+K20X 2.73 Single-CPU+M2090 1.71 Dual-CPU 1.00 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 The Future The Future of HPC Programming Computers are not getting faster… just wider Need to structure all HPC apps as throughput problems Locality within nodes much more important Need to expose locality (programming model) & explicitly manage memory hierarchy (compiler, runtime, autotuner) How can we enable programmers to code for future processors in a portable way? OpenACC Directives Portable Parallelism OpenMP OpenACC Directives CPU CPU GPU main() { main() { double pi = 0.0; long i; double pi = 0.0; long i; #pragma acc parallel loop reduction(+:pi) #pragma omp parallel for reduction(+:pi) for (i=0; i<N; i++) for (i=0; i<N; i++) { { double t = (double)((i+0.05)/N); double t = (double)((i+0.05)/N); pi += 4.0/(1.0+t*t); pi += 4.0/(1.0+t*t); } } printf(“pi = %f\n”, pi/N); } printf(“pi = %f\n”, pi/N); } How Are GPUs Likely to Evolve Over This Decade? . Integration . Further concentration on locality (both HW and SW) . Reducing overheads . Continued convergence with consumer technology Echelon NVIDIA’s Extreme-Scale Computing Project DARPA UHPC Program Targeting 2018 Fast Forward Program Echelon2018 Compute Vision: Node &Echelon System Compute Node & System System Interconnect Key architectural features: DRAM NV • Malleable memory hierarchy DIMMs RAM • Hierarchical register files DRAM • Hierarchical thread scheduling Stacks • Place coherency/consistency L20 L21023 • Temporal SIMT & scalarization MC NIC 256KB 256KB • PGAS memory NoC • HW accelerated queues C C 0 7 • Active messages LOC 0 SM LOC 7 0 SM255 • AMOs everywhere Node 0: 16 TF, 2 TB/s, 512+ GB Node 255 • Collective engines Cabinet N-1 Cabinet 0: 4 PF, 128 TB • Streamlined LOC/TOC interaction Echelon System (up to 1 EF) Summary . GPU accelerated computing has come a long way in a very short time . Has become much easier to program and more general purpose . Aligned with technology trends, supported by consumer markets . Future evolution is about: • Integration • Increased generality – efficient on any code with high parallelism • Reducing energy, reducing overheads . This is simply how computers will be built Thank You .

Load more