Work factorization for efficient throughput architectures

Sylvain Collange Departamento de Ciência da Computação, ICEx Universidade Federal de Minas Gerais [email protected]

February 01, 2012 GPGPU in HPC, today

Graphics Processing Unit (GPU) Made for video games: mass market Low unit price, amortized R&D Inexpensive, high-performance parallel processor 2002: General-Purpose computation on GPU (GPGPU) 2012: 3 out of top 5 supercomputers use GPUs

#4 Dawning Nebulae (China)

#2 Tianhe-1A (China)

#5 Tsubame 2.0 (Japan) 2 GPGPU in the future?

Yesterday (2000-2010) Homogeneous multi-core

Discrete components Central Processing Today (2011-...) (CPU) Unit (GPU) Chip-level integration AMD Fusion

NVIDIA Denver/Maxwell project… Throughput- optimized Many embedded SoCs cores Latency- Tomorrow optimized Hardware Heterogeneous multi-core cores accelerators GPUs to blend into Heterogeneous multi-core chip throughput-optimized cores? 3 Outline

Background: GPU architecture requirements From a sequential processor to a GPU Parallel control regularity Parallel memory locality Parallel value locality

4 What do we need GPUs for?

1.3D graphics rendering for games Complex , lighting computations…

2.Computer Aided Design workstations Complex geometry

3.GPGPU Complex synchronization, data movements One chip to rule them all Find the common denominator 5 The (simplest) graphics rendering pipeline

Primitives (triangles…)

Vertices Fragment Textures

Vertex shader Z-Compare Blending

Clipping, Rasterization Attribute interpolation

Pixels Fragments Z-Buffer

Programmable Parametrizable stage stage 6 How much performance do we need

… to run 3DMark 11 at 50 frames/second?

Element Per frame Per second

Vertices 12.0M 600M Primitives 12.6M 630M Fragments 180M 9.0G Instructions 14.4G 720G

Intel Core i7 2700K: 56 Ginsn/s peak We need to go 13x faster

7 Source: Damien Triolet, Hardware.fr Constraints

Memory wall Memory speed does not increase as fast as computing speed More and more difficult to hide memory latency Power wall Power consumption of transistors does not decrease as fast as density increases Performance is now limited by power consumption

8 Latency vs. throughput

Latency: time to solution CPUs Minimize time, at the expense of power

Throughput: quantity of tasks processed per unit of time GPUs Assumes unlimited parallelism Minimize energy per operation

9 Amdahl's law

Bounds speedup attainable on a parallel machine

1 S= S Speedup P P Ratio of parallel Time to run 1−P Time to run sequential portions N parallel portions portions N Number of processors

S (speedup)

N (available processors)

G. Amdahl. Validity of the Single Processor Approach to Achieving Large-Scale 10 Computing Capabilities. AFIPS 1967. Why heterogeneous architectures? 1 S= Time to run P Time to run sequential portions 1−P parallel portions N

Latency-optimized multi-core (CPU) Low efficiency on parallel portions: spends too much resources

Throughput-optimized multi-core (GPU) Low performance on sequential portions

Heterogeneous multi-core (CPU+GPU) Use the right tool for the right job Allows aggressive optimization for latency or for throughput

M. Hill, M. Marty. Amdahl's law in the multicore era. IEEE Computer, 2008. 11 Programming model: multi-threading

1 vertex = 1 thread = Computes spacial coordinates, texture coordinates…

1 fragment = 1 thread = Computes color, lighting…

GPGPU Bulk-Synchronous Parallel (BSP) model NVIDA CUDA, OpenCL… Barrier

Program describes operation to apply SPMD: Single Program, Multiple Data 12 L. Valiant. A bridging model for parallel computation. Comm. ACM 1990. Threading granularity

Coarse-grained threading X Decouple tasks to reduce conflicts and inter-thread communication e.g. MPI, OpenMP T0 T1 T2 T3 X[0..3] X[4..7] X[8..11] X[12-15]

Fine-grained threading X Interleave tasks Exhibit locality: neighbor threads share memory T0 T1 T2 T3 X[0] X[1] X[2] X[3] Exhibit regularity: neighbor X[4] X[5] X[6] X[7] threads have a similar behavior X[8] X[9] X[10] X[11] X[12] X[13] X[14] X[15] e.g. OpenCL, CUDA 13 Parallel regularity Similarity in behavior between threads Regular Irregular Control Thread regularity 1 2 3 4 1 2 3 4 i=17 i=17 i=17 i=17 i=21 i=4 i=17 i=2 switch(i) { T i

m case 2:... e case 17:... case 21:... }

Memory load load load load load load load load regularity A[8] A[9] A[10] A[11] A[8] A[0] A[11] A[3] r=A[i] A Memory

Data a=32 a=32 a=32 a=32 a=17 a=-5 a=11 a=42 regularity b=52 b=52 b=52 b=52 r=a*b b=15 b=0 b=-2 b=52 14 Outline

Background: GPU architecture requirements From a sequential processor to a GPU Parallel control regularity Parallel memory locality Parallel value locality

15 First step: sequential, pipelined processor

Let's build a GPU Our application: scalar-vector multiplication: X ← a∙X First idea: run each thread sequentially

for i = 0 to n-1 X[i] ← a * X[i] Source code add i ← 18 Fetch move i ← 0 loop: store X[17] Decode load t ← X[i] Memory mul Execute mul t ← a×t store X[i] ← t Memory add i ← i+1 branch i

16 Homogeneous multi-core

Replication of the complete execution engine

move i ← slice_begin loop: load t ← X[i] add i ← 18 F add i ← 50 F mul t ← a×t IF IF M

store X[i] ← t store X[17] IDD store X[49] DID e add i ← i+1 m o r

branch i

Threads: T0 T1

Improves throughput thanks to explicit parallelism

17 Interleaved multi-threading

Time-multiplexing of processing units Same software view mul Fetch move i ← slice_begin mul loop: Decode load t ← X[i] mul t ← a×t add i ←73 Execute store X[i] ← t add i ← 50 Memory add i ← i+1 branch i

Threads: T0 T1 T2 T3

Hides latency thanks to explicit parallelism

18 Single Instruction, Multiple Threads (SIMT) Factorization of fetch/decode, load-store units Fetch 1 instruction on behalf of several threads Read 1 memory location and broadcast to several registers

T0 (0-3) load F M

T1 (0-3) store e

D m o r

T2 y (0) mul (1) mul (2) mul (3) mul X T3

(0) (1) (2) (3) Mem In NVIDIA-speak SIMT: Single Instruction, Multiple Threads Convoy of synchronized threads: warp Improves Area/Power-efficiency thanks to regularity Consolidates memory transactions: less memory pressure 19 What about SIMD?

Single Instruction Multiple Data

for i = 0 to n-1 step 4 X[i..i+3] ← a * X[i..i+3] Source code

loop: add i ← 20 F vload T ← X[i] vstore X[16..19 M vmul T ← a×T D e vstore X[i] ← T m o r

add i ← i+4 vmul X y branch i

Vectors, not threads: no “true” thread divergence allowed 20 Flynn's taxonomy

Classification of parallel architectures

Single Instruction Multiple Instruction

F F F F F Single Data X X SISD MISD

F Multiple Data F F F F X X X X X X X X SIMD MIMD

M. Flynn. Some Computer Organizations and Their Effectiveness. IEEE TC 1972. 21 Flynn's taxonomy revisited …to account for multi-threading

Resource: Instruction RF, Execute Memory F pipeline stage Fetch X (Data) M (Address)

T T T T T T T T T T T T 0 1 2 3 0 1 2 3 0 1 2 3 Single resource F X M SIMT SDMT SAMT

T T T T T T T T T T T T Multiple 0 1 2 3 0 1 2 3 0 1 2 3 resources F F F F X X X X M M M M MIMT MDMT MAMT

Mostly orthogonal Mix and match to build your own _I_D_A_T pipeline! 22 Examples: conventional design points

MI MD MA MT Multi-core T 0 F X M Most CPUs of today T 1 F X M T MIMD(MAMT) 2 F X M

Short-vector SIMD MD X Multimedia instruction set SI SA ST T F X M extensions in CPUs 0 SIMD(SAST) X MD GPU T X 0 SI SA MT NVIDIA, AMD, Intel... GPUs T 1 F X M T SI(MDSA)MT 2 X

23 GPU design space: not just SIMT

Examples How can we run SPMD threads? NVIDIA GeForce MIMD GTX 280 (2008) (multi-core) Spacial / horizontal F F F F SIMT X X X X F NVIDIA GTX 480 X X X X (2010)

AMD Fine-grained Temporal / vertical 5870 (2011) SIMT multi-threading F X F X X F X X F X Switch-on-event X AMD Radeon F X multi-threading 7870 (2012) F X F X F X NVIDIA Echelon F X project (2017?)

Programmer's point of view: only threads 24 Example GPU: NVIDIA GeForce GTX 580

SIMT: warps of 32 threads 16 SMs / chip 2×16 cores / SM, 48 warps / SM

Warp 1 Warp 2

Warp 3 Warp 4

C

C C C C C

o

o o o o o

r

r r r r r

e

e e e e e

1 1 2 … 1 1 … 3 …

6

7 8 2

Warp 47 Warp 48

Time SM1 SM16 1580 Gflop/s Up to 24576 threads in flight 25 Outline

Background: GPU architecture requirements From a sequential processor to a GPU Parallel control regularity Parallel memory locality Parallel value locality

26 Divergence statistics

50% — 85% of branches are uniform Inside a warp, all threads take the branch or none do Easy case Need to avoid costly predication on uniform branches Fully dynamic, hardware implementation 27 How to keep threads synchronized?

Issue: control divergence x = 0; // Uniform condition Rules of the game if(tid > 17) { One thread per Processing Element x = 1; (PE) } All PE execute the same instruction // Divergent conditions if(tid < 2) { PEs can be individually disabled if(tid == 0) { x = 2; } else { Thread 0 Thread 1 Thread 2 Thread 3 x = 3; } 1 instruction PE 0 PE 1 PE 2 PE 3 }

28 The standard way: mask stack Code Hardware mask stack x = 0; 1 activity bit / thread // Uniform condition if(tid > 17) { 1111 s k

i tid=0 tid=2 p x = 1; } 1111 // Divergent conditions tid=1 tid=3 if(tid < 2) { push if(tid == 0) { 1111 1100 push x = 2; 1111 1100 1000 } pop 1111 1100 else { push x = 3; 1111 1100 0100 } pop 1111 1100 pop } 1111 A. Levinthal and T. Porter. Chap - a SIMD graphics processor. SIGGRAPH’84, 1984. 29 Goto considered harmful?

MIPS NVIDIA NVIDIA Intel GMA Intel GMA AMD AMD AMD Cayman Tesla Fermi Gen4 SB R500 R600 (2011) (2007) (2010) (2006) (2011) (2005) (2007) push j bar bar jmpi jmpi jump push push_else jal bra bpt if if loop push_else pop jr brk bra iff else endloop pop push_wqm syscall brkpt brk else endif rep loop_start pop_wqm cal brx endif case endrep loop_start_no_al else_wqm cont cal do while breakloop loop_start_dx10 jump_any kil cont while break breakrep loop_end reactivate pbk exit break cont continue loop_continue reactivate_wqm pret jcal cont halt loop_break loop_start ret jmp halt call jump loop_start_no_al ssy jmx msave return else loop_start_dx10 trap longjmp mrest fork call loop_end .s pbk push call_fs loop_continue pcnt pop return loop_break plongjmp return_fs jump pret alu else ret alu_push_before call ssy alu_pop_after call_fs .s alu_pop2_after return alu_continue return_fs Control instructions in some CPU alu_break alu alu_else_after alu_push_before and GPU instruction sets alu_pop_after alu_pop2_after alu_continue Control flow structure is explicit alu_break alu_else_after GPU-specific instruction sets No support for arbitrary control flow 30 Alternative: 1 PC / thread

Code Program Counters (PCs) x = 0; tid= 0 1 2 3 if(tid > 17) { x = 1; } Match if(tid < 2) { → active if(tid == 0) {

x = 2; 1 0 0 0 PC Master PC } 0 else { PC No match x = 3; 1 → inactive } PC PC } 2 3

31 Scheduling policy: min(SP:PC)

Which PC to choose as master Source Assembleur Ordre PC ? if(…) … { p? br else Conditionals, loops } … 1 else br endif Order of code addresses else: { 2 } … min(PC) endif: 3 Functions while(…) start: { … Favor max nesting depth 1 2 3 } p? br start min(SP) … 4 With compiler support Unstructured control flow too … … 1 f(); call f No code duplication … 3 void f() f: { … 2 G. Diamos et al. SIMD re-convergence at thread frontiers. … ret MICRO 44, 2011. } 32 Multiple PC arbitration: functional model

Insn, Insn PC Match Exec Update PC PC 0 MPC 0 PC Insn, Insn 1 Match Exec Update PC PC t MPC 1 s a e c t MPC Instruction Insn, d o a

V Fetch MPC

o No match: discard instruction r B

Insn, Insn PC Match Exec Update PC PC n MPC n

S. Collange. Stack-less SIMT reconvergence at low cost. Tech report, ENS Lyon, 2011. 33 Benefits of multiple-PC arbitration

Before: stack, counters After: multiple PCs O(d), O(log d) memory O(1) memory d = nesting depth No shared state 1 R/W port to memory Allows thread Exceptions: stack suspension, restart, overflow, underflow migration Partial SIMD semantics Full SPMD semantics (Bougé-Levaire) (multi-thread) C-style structured control- Arbitrary control flow flow only Traditional instruction sets, Specific instruction sets languages, compilers Enables many new architecture ideas 34 Outline

Background: GPU architecture requirements From a sequential processor to a GPU Parallel control regularity Parallel memory locality Parallel value locality

35 Memory access patterns In traditional vector processing

Easy T T T Easy T T T 1 2 n 1 2 n Registers Registers

Memory Memory Scalar load & broadcast Unit-strided load Reduction & scalar store Unit-strided store

Hard T T T Hardest T T T 1 2 n 1 2 n Registers Registers

Memory Memory (Non-unit) strided load Gather (Non-unit) strided store Scatter

In SIMT Every load is a gather, every store is a scatter 36 Breakdown of memory access patterns

Vast majority: uniform or unit-strided And even aligned vectors

Loads Stores

“In making a design trade-off, favor the frequent case over the infrequent case.” J. Hennessy, D. Patterson. Computer architecture: a quantitative approach. MKP, 2007 37 Memory coalescing

In hardware: compare the address of each vector element Coalesce memory accesses that fall within the same segment

Unit-strided requests Irregular requests

→ One transaction → Multiple transactions

Dynamically detects parallel memory regularity

38 Array of structures (AoS)

struct Pixel { Programmer-friendly memory float r, g, b; layout }; Pixel image_AoS[480][640]; Group data logically Access pattern in SIMT: (non-unit) strided load

Not as efficient as unit-strided kernel void luminance(Pixel img[][], float luma[][]) { load int x=tid.x; int y=tid.y; luma[y][x]=.59*img[y][x].r + .11*img[y][x].g + .30*img[y][x].b; }

Need to rethink data structures for fine-grained threading 39 Structure of Arrays (SoA)

Access pattern in SIMT: unit- struct Image { float R[480][640]; strided load float G[480][640]; float B[480][640]; Efficient thanks to coalescing }; Image image_SoA; With fine-grained threading: value regularity Homogeneous data in memory Variables close in values are close in space kernel void luminance(Image img, float luma[][]) { int x=tid.x; int y=tid.y; luma[y][x]=.59*img.R[y][x] + .11*img.G[y][x] + .30*img.B[y][x]; }

40 Example: call stack

Coarse-grained (…) Fine-grained (Fermi) Split stacks Interleaved stacks

T0 SP T0 SP T1 T2 T1 T3 Cache line Cache T2 line

T3 Memory Memory No false sharing False sharing Good for split, coherent Good for spacial locality! caches (multi-cores) 41 Outline

Background: GPU architecture requirements From a sequential processor to a GPU Parallel control regularity Parallel memory locality Parallel value locality

42 On-chip memory

Conventional wisdom Cache area in CPU vs. GPU according to the NVIDIA CUDA Programming Guide:

But... if we include registers: GPU Register files + caches NVIDIA 3.9 MB GF110 GPU AMD Tahiti 11.9 MB GPU i7 9.3 MB CPU

GPUs now have more internal memory than desktop CPUs 43 Little's law: data=throughput×latency

Throughput (GB/s) Intel Core i7 920 180 50

1,25 3 10 50 Latency (ns)

1500 NVIDIA GeForce GTX 580

L1

320 190 L2 DRAM

30 210 350 ns44 J. Little. A proof for the queuing formula L= λ W. JSTOR 1961. The cost of SIMT: register redundancy SIMD SIMT mov i ← 0 mov i ← tid loop: loop: vload T ← X[i] load t ← X[i] vmul T ← a×T mul t ← a×t vstore X[i] ← T store X[i] ← t add i ← i+16 add i ← i+tcnt branch i

Instructions Thread 0 1 2 3 … vload load vmul mul vstore store add add branch scalar SIMD branch Registers T t a 17 a 1717171717 17 i 0 scalar i 0 1 2 3 4 15 n 51 vector n 5151515151 51 45 What are we computing on?

t t

h h

r r

e e

a a

d d

Uniform data

0 1 In a warp, v[tid] = c 5 5 5 5 5 5 5 5

Affine data c=5 In a warp, v[tid] = b + tid × s 8 9 101112131415 Base b, stride s s=1 b=8

Average frequency in GPGPU applications

Other Operations Affine Uniform

RF reads

0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% 46 Scalarization

Factor out across threads Common calculations Common registers vector scalar compact 8 9 101112131415 8+1×tid + + + + + + + + + 5 5 5 5 5 5 5 5 5+0×tid

expand = 1314151617181920 13+1×tid

Dynamic scalarization: tagged register file Static scalarization: compiler analysis

47 Dynamic scalarization: tagged registers

Associate a tag to each Instructions Tags vector register mov i ← tid A←A loop: Uniform, Affine, unKnown load t ← X[i] K←U[A]

T mul t ← a×t K←U×K r

a store X[i] ← t U[A]←K

Propagate tags across c e add i ← i+tcnt A←A+U arithmetic instructions branch i

Thread Tag 0 1 2 3 … K t U a 17 X X X X X A i 0 1 X X X X U n 51 X X X X X

48 A scalarizing compiler?

Scalar-only Vector-only

Programming Sequential SPMD models programming (CUDA, OpenCL)

T V c c

r e g S o o a c c in m m d o t r P o iz i m e M

t r l p r i p i i a o z i p l p i D l i l e e n i n a le c m r r a g r S o l c

Model CPU Actual CPU GPU Architectures Scalar Scalar+SIMD SIMT

49 Compile SPMD programs to SIMD

SIMD SIMT Instruction Vectorization at Vectorization at regularity compile-time runtime Control Software-managed Hardware-managed regularity Bit-masking, predication Stack, counters, multiple PCs Memory Compiler selects: Hardware-managed regularity vector load-store or Gather-scatter with gather-scatter hardware coalescing

Same set of optimizations Perform them at compile time rather than runtime

50 Identify scalar features?

Scalar registers c=a+b a 2 2 2 2 2 2 2 2 + + + + + + + + Uniform vectors b 3 3 3 3 3 3 3 3 Affine vectors with known c 5 5 5 5 5 5 5 5 stride

Scalar operations if(c) c 0 0 0 0 0 0 0 0 { Uniform inputs, } uniform outputs else { Uniform branches } Uniform conditions x=t[i]; i 8 9 101112131415 Vector load/store (vs. x gather/scatter) Memory Affine aligned addresses t[8] t[15] In all cases: divergence analysis 51 Teaser: be sure to attend second part of Fernando's course Conclusion

SIMT bridges the gap between superscalar and SIMD Smooth, dynamic tradeoff between regularity and efficiency

SIMD Efficiency (/W)

Dynamic scalarization SIMT Decoupled scalar-SIMT, Affine caches…

Superscalar

Dynamic warp formation, Dynamic warp subdivision, NIMT… Regularity of application Transaction processing Computer graphics Dense linear algebra 52 Work factorization for efficient throughput architectures

Sylvain Collange Departamento de Ciência da Computação, ICEx Universidade Federal de Minas Gerais [email protected]

February 01, 2012