How Powerful are GPUs?
Pat Hanrahan Computer Science Department Stanford University
Computer Forum 2007
Modern Graphics Pipeline
Application
CdCommand
Geometry
Rasterization
Texture
Fragment
Display
Page 1 A Pitch from 5 Years Ago …
Cinematic games and media drive GPU market Current GPUs faster than CPUs (at graphics) Gap between the GPU and the CPU increasing Why? Efficiently use VLSI resources Programmable GPUs ≈ Stream processors Many applications map to stream processing Therefore, a $50 high-performance, massively parallel computer will soon ship with every PC
Pat Hanrahan, circa 2002-2005
What Happened?
Now AMD and Intel gave up on sequential CPUs with high clock rates and went multi -core (2-4) Gap between GPU and CPU stablelized GPUs are data parallel (64-128 cores) DX10 mandates unified graphics pipeline GPGPU – many algorithms implemented Future Two main types of processors CPU – fast sequential processor GPU – fast data parallel processor Hybrid CPU/GPU
Page 2 Overview
Current programmable GPUs Performance Programming model: Stream abstraction Applications How General?
Programmable GPUs
Page 3 ATI R600 (X2X00)
80 nm process
~700 million transistors
64 4-wide unified shaders
~700 Mhz clock
512-bit GDDR memory
GDDR3 @ 900Mhz = 115 GB/s
GDDR4 @ 1100Mhz = 140 GB/s
R300 not R600 230 Watt
NVIDIA G80 (8800)
90 nm TSMC process
681million transistors
480 mm^2
128 scalar processors
1.3 Ghz clock rate
384-bit GDDR memory
GDDR3 @ 900Mhz = 86.4 GB/s
130 Watts
Page 4 GeForce 8800 Series GPU
Host
Input Assembler RtitiRasterization
Vertex Thread Geometry Thread Pixel Thread
SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP
TF TF TF TF TF TF TF TF
L1 L1 L1 L1 L1 L1 L1 L1 Thread Processor
L2 L2 L2 L2 L2 L2
FB FB FB FB FB FB
Shader Model 4.0 Architecture
32 4-32-bit Input Parameters 64K 32-bit
Registers 32 4-32-bit
64K insts Program Textures
8 4-32-bit Output
Page 5 Simple Graphics Pipeline
# c[0-3] = modelview projection (composite) matrix # c[4-7] = modelview inverse transpose # c[32] = eye-space light direction # c[33] = constant eye-space half-angle vector # c[35].x = pre-multiplied diffuse light color & diffuse mat. # c[35].y = pre-multiplied ambient light color & diffuse mat. # c[36] = specular color; c[38].x = specular power
DP4 o[HPOS].x, c[0], v[OPOS]; # Transform position. DP4 o[HPOS].y, c[1], v[OPOS]; DP4 o[HPOS].z, c[2], v[OPOS]; DP4 o[HPOS].w, c[3], v[OPOS]; DP3 R0.x, c[4], v[NRML]; # Transform normal. DP3 R0.y, c[5], v[NRML]; DP3 R0. z, c[6], v[NRML]; DP3 R1.x, c[32], R0; # R1.x = L DOT N' DP3 R1.y, c[33], R0; # R1.y = H DOT N' MOV R1.w, c[38].x; # R1.w = specular power LIT R2, R1; # Compute lighting MAD R3, c[35].x, R2.y, c[35].y; # diffuse + ambient MAD o[COL0].xyz, c[36], R2.z, R3; # + specular END
G80 = Data Parallel Computer
Host
Input Assembler
Thread Execution Manager
SIMD Core SIMD core SIMD core SIMD Core SIMD Core SIMD Core SIMD Core SIMD Core
Parallel Parallel Parallel Parallel Parallel Parallel Parallel Parallel Data Cache Data Cache Data Cache Data Cache Data Cache Data Cache Data Cache Data Cache
Load/store
Global Memory
Page 6 G80 “core”
Each core 8 functional units SIMD 16/32 “ warp” 8-10 stage pipeline Thread scheduler 128-512 threads/core 16 KB shared memory PlllDtChParallel Data Cache
Total #threads/chip 16 * 512 = 8K
GPU Multi-threading (version 1)
Change threads each cycle (round robin)
frag1 frag2 frag3 frag4
instr1
instr2
instr3
Page 7 GPU Multi-threading (version 2)
Change thread after texture fetch/stall
frag1 frag2 frag3 frag4 Run until stall at texture fetch (multiple instructions)
8800GTX Peak Performance
575 Mhz * 128 processors * 2 flop/inst * 2 inst/clock
MAD instruction
= 332.8 GFLOPS
Page 8 Instructions Issue Rate
http://graphics.stanford.edu/projects/gpubench/
ATI X1900XTX NVIDIA 7900GTX
Instructions Issue Rate
http://graphics.stanford.edu/projects/gpubench/
NVIDIA 7900GTX NVIDIA 8800GTX
Page 9 Measured BLAS Performance
SAXPY X1900 (DX9): 6 GFlops X1900 (CTM): 6GFlops6 GFlops 8800GTX (DX9): 12 GFlops SGEMV X1900 (DX9): 4 GFlops X1900 (CTM): 6 GFlops 8800GTX (DX9): 14 GFlops SGEMM X1900 (DX9): 30 GFlops X1900 (CTM): 120 GFlops 8800GTX (DX9): 105 Gflops 3 Ghz Core 2 40 Gflops
Programming Abstractions
Page 10 Approach I
Run application using graphics library
Graphics library-based programming models NVIDIA’s Cg Microsoft’s HLSL OpenGL Shading Language RapidMind Sh [McCool et al. 2004]
Approach II
Map application to parallel computer
Communi cati ng sequen tia l processes (CSP) Threads: pthreads, Occam, UPC, … Message passing: MPI Data parallel programming APL, SETL, S, Fortran90, … C* (lisp*), NESL, … Stream languages StreaMIT, StreamC/KernelC MS Accelerator, CUDA, DPVM, PeakStream
Page 11 Stream Programming Environment
Collections stored in memory Multidimensional arrays (stencils) Graphs and meshes (topology) Data parallel operators Application: map Reductions: scan, reduce (fold) Communication: send, sort, gather, scatter Filter (|O|<|I|) and generate (|O|>|I|)
Brook
Ian Buck PhD Thesis Stanford University
Brook for GPUs: Stream computing on graphics hardware, I. Buck, T. Foley, D. Horn, J. Sugarman, K. Fatahalian, M. Houston, P. Hanrahan, SIGGRAPH 2004
Page 12 Brook Example kernel void foo ( float a<>, float b<>, outflt float result <> ) { result = a + b; } float a<100>; float b<100>; float c<100>; for (i=0; i<100; i++) foo(a,b,c); c[i] = a[i]+b[i];
Classical N-Body Simulation
Stellar dynamics Gravitational acceleration Gravitational accel. + jerk
Molecular dynamics Implicit solvent models Lennard-Jones Coulomb
Page 13 Folding@Home Performance
Vijay Pande Group
GROMACs on Brook
GPU:CPUcore 40:1
CPU: 3.0 Ghz P4 GPU: ATI X1900X
Current Statistics: March 19, 2007
Client type Current Current TFLOPS* Processors Windows 150 157457 Mac OS X/PPC 7 8710 Mac OS X/Intel 7 2520 Linux 34 24639 GPU 40 682 PS/3 26 877 Total 223 1824132
*TFLOPs is actual flops from software cores, not peak values
Page 14 Folding@Home GPU Cluster
25 nodes Nforce4 SLI Dual core Opteron 2x ATI X1900XTX Linux 5 TFlops of folding “power”
Not actual machine
Future
Page 15 Summary
Cinematic games and media drive GPU market GPU evolving into a high throughput processor “Data parallel multi-threaded machine” Many applications map to GPUs Processor of the future likely to be a CPU/GPU Small number of traditional CPU cores Large number of GPU cores
Opportunities
Current hardware not optimal Incredible opportunity for architectural innovation Current software environment immature Incredible opportunity for reinventing parallel computing software, programming environments and languages
Page 16 Acknowledgements
Bill Dally Ian Buck Eric Darve Mattan Erez Vijay Pande Kayvon Fatahalian Bill Mark Tim Foley John Owens Daniel Horn Kurt Akeley Michael Houston Mark Horowitz Jeremy Sugarman
Funding: DARPA, DOE, ATI, IBM, NVIDIA, SONY
Questions?
Page 17