How Powerful are GPUs?

Pat Hanrahan Computer Science Department Stanford University

Computer Forum 2007

Modern

Application

CdCommand

Geometry

Rasterization

Texture

Fragment

Display

Page 1 A Pitch from 5 Years Ago …

Cinematic games and media drive GPU market Current GPUs faster than CPUs (at graphics) Gap between the GPU and the CPU increasing Why? Efficiently use VLSI resources Programmable GPUs ≈ Stream processors Many applications map to Therefore, a $50 high-performance, massively parallel computer will soon ship with every PC

Pat Hanrahan, circa 2002-2005

What Happened?

Now AMD and Intel gave up on sequential CPUs with high clock rates and went multi -core (2-4) Gap between GPU and CPU stablelized GPUs are data parallel (64-128 cores) DX10 mandates unified graphics pipeline GPGPU – many algorithms implemented Future Two main types of processors CPU – fast sequential processor GPU – fast data parallel processor Hybrid CPU/GPU

Page 2 Overview

Current programmable GPUs Performance Programming model: Stream abstraction Applications How General?

Programmable GPUs

Page 3 ATI R600 (X2X00)

„ 80 nm process

„ ~700 million transistors

„ 64 4-wide unified

~700 Mhz clock

„ 512-bit GDDR memory

GDDR3 @ 900Mhz = 115 GB/s

GDDR4 @ 1100Mhz = 140 GB/s

R300 not R600 „ 230 Watt

NVIDIA G80 (8800)

„ 90 nm TSMC process

„ 681million transistors

„ 480 mm^2

„ 128 scalar processors

1.3 Ghz

„ 384-bit GDDR memory

GDDR3 @ 900Mhz = 86.4 GB/s

„ 130 Watts

Page 4 GeForce 8800 Series GPU

Host

Input Assembler RtitiRasterization

Vertex Thread Geometry Thread Pixel Thread

SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP

TF TF TF TF TF TF TF TF

L1 L1 L1 L1 L1 L1 L1 L1 Thread Processor

L2 L2 L2 L2 L2 L2

FB FB FB FB FB FB

Shader Model 4.0 Architecture

32 4-32-bit Input Parameters 64K 32-bit

Registers 32 4-32-bit

64K insts Program Textures

8 4-32-bit Output

Page 5 Simple Graphics Pipeline

# c[0-3] = modelview projection (composite) matrix # c[4-7] = modelview inverse transpose # c[32] = eye-space light direction # c[33] = constant eye-space half-angle vector # c[35].x = pre-multiplied diffuse light color & diffuse mat. # c[35].y = pre-multiplied ambient light color & diffuse mat. # c[36] = specular color; c[38].x = specular power

DP4 o[HPOS].x, c[0], v[OPOS]; # Transform position. DP4 o[HPOS].y, c[1], v[OPOS]; DP4 o[HPOS].z, c[2], v[OPOS]; DP4 o[HPOS].w, c[3], v[OPOS]; DP3 R0.x, c[4], v[NRML]; # Transform normal. DP3 R0.y, c[5], v[NRML]; DP3 R0. z, c[6], v[NRML]; DP3 R1.x, c[32], R0; # R1.x = L DOT N' DP3 R1.y, c[33], R0; # R1.y = H DOT N' MOV R1.w, c[38].x; # R1.w = specular power LIT R2, R1; # Compute lighting MAD R3, c[35].x, R2.y, c[35].y; # diffuse + ambient MAD o[COL0].xyz, c[36], R2.z, R3; # + specular END

G80 = Data Parallel Computer

Host

Input Assembler

Thread Execution Manager

SIMD Core SIMD core SIMD core SIMD Core SIMD Core SIMD Core SIMD Core SIMD Core

Parallel Parallel Parallel Parallel Parallel Parallel Parallel Parallel Data Cache Data Cache Data Cache Data Cache Data Cache Data Cache Data Cache Data Cache

Load/store

Global Memory

Page 6 G80 “core”

Each core „ 8 functional units „ SIMD 16/32 “ warp” „ 8-10 stage pipeline „ Thread scheduler „ 128-512 threads/core „ 16 KB shared memory PlllDtChParallel Data Cache

Total #threads/chip 16 * 512 = 8K

GPU Multi-threading (version 1)

Change threads each cycle (round robin)

frag1 frag2 frag3 frag4

instr1

instr2

instr3

Page 7 GPU Multi-threading (version 2)

Change thread after texture fetch/stall

frag1 frag2 frag3 frag4 Run until stall at texture fetch (multiple instructions)

8800GTX Peak Performance

575 Mhz * 128 processors * 2 flop/inst * 2 inst/clock

MAD instruction

= 332.8 GFLOPS

Page 8 Instructions Issue Rate

http://graphics.stanford.edu/projects/gpubench/

ATI X1900XTX NVIDIA 7900GTX

Instructions Issue Rate

http://graphics.stanford.edu/projects/gpubench/

NVIDIA 7900GTX NVIDIA 8800GTX

Page 9 Measured BLAS Performance

SAXPY „ X1900 (DX9): 6 GFlops „ X1900 (CTM): 6GFlops6 GFlops „ 8800GTX (DX9): 12 GFlops SGEMV „ X1900 (DX9): 4 GFlops „ X1900 (CTM): 6 GFlops „ 8800GTX (DX9): 14 GFlops SGEMM „ X1900 (DX9): 30 GFlops „ X1900 (CTM): 120 GFlops „ 8800GTX (DX9): 105 Gflops „ 3 Ghz Core 2 40 Gflops

Programming Abstractions

Page 10 Approach I

Run application using graphics library

Graphics library-based programming models „ NVIDIA’s Cg „ Microsoft’s HLSL „ OpenGL Language „ RapidMind Sh [McCool et al. 2004]

Approach II

Map application to parallel computer

Communi cati ng sequen tia l processes (CSP) „ Threads: pthreads, Occam, UPC, … „ Message passing: MPI Data parallel programming „ APL, SETL, S, Fortran90, … „ C* (lisp*), NESL, … Stream languages „ StreaMIT, StreamC/KernelC „ MS Accelerator, CUDA, DPVM, PeakStream

Page 11 Stream Programming Environment

Collections stored in memory „ Multidimensional arrays (stencils) „ Graphs and meshes (topology) Data parallel operators „ Application: map „ Reductions: scan, reduce (fold) „ Communication: send, sort, gather, scatter „ Filter (|O|<|I|) and generate (|O|>|I|)

Brook

Ian Buck PhD Thesis Stanford University

Brook for GPUs: Stream computing on graphics hardware, I. Buck, T. Foley, D. Horn, J. Sugarman, K. Fatahalian, M. Houston, P. Hanrahan, SIGGRAPH 2004

Page 12 Brook Example kernel void foo ( float a<>, float b<>, outflt float result <> ) { result = a + b; } float a<100>; float b<100>; float c<100>; for (i=0; i<100; i++) foo(a,b,c); c[i] = a[i]+b[i];

Classical N-Body Simulation

Stellar dynamics „ Gravitational acceleration „ Gravitational accel. + jerk

Molecular dynamics „ Implicit solvent models „ Lennard-Jones Coulomb

Page 13 Folding@Home Performance

Vijay Pande Group

GROMACs on Brook

GPU:CPUcore 40:1

CPU: 3.0 Ghz P4 GPU: ATI X1900X

Current Statistics: March 19, 2007

Client type Current Current TFLOPS* Processors Windows 150 157457 Mac OS X/PPC 7 8710 Mac OS X/Intel 7 2520 Linux 34 24639 GPU 40 682 PS/3 26 877 Total 223 1824132

*TFLOPs is actual from cores, not peak values

Page 14 Folding@Home GPU Cluster

25 nodes „ Nforce4 SLI „ Dual core Opteron „ 2x ATI X1900XTX „ Linux 5 TFlops of folding “power”

Not actual machine

Future

Page 15 Summary

Cinematic games and media drive GPU market GPU evolving into a high throughput processor „ “Data parallel multi-threaded machine” Many applications map to GPUs „ Processor of the future likely to be a CPU/GPU „ Small number of traditional CPU cores „ Large number of GPU cores

Opportunities

Current hardware not optimal „ Incredible opportunity for architectural innovation Current software environment immature „ Incredible opportunity for reinventing software, programming environments and languages

Page 16 Acknowledgements

Bill Dally Ian Buck Eric Darve Mattan Erez Vijay Pande Kayvon Fatahalian Bill Mark Tim Foley John Owens Daniel Horn Kurt Akeley Michael Houston Mark Horowitz Jeremy Sugarman

Funding: DARPA, DOE, ATI, IBM, NVIDIA, SONY

Questions?

Page 17