Multi-threading or SIMD? How GPU architectures exploit regularity

Sylvain Collange Arénaire, LIP, ENS de Lyon [email protected]

ARCHI'11 June 14, 2011 From GPU to integrated many-core

Yesterday (2000-2010) Homogeneous multi-core

Discrete components Central Processing Today (2011-...) (CPU) Unit (GPU) Heterogeneous multi-core AMD Fusion

NVIDIA Denver/Maxwell project… Throughput- optimized Focus on the throughput- cores Latency- optimized part optimized Hardware cores accelerators Similarities? Differences? Heterogeneous multi-core chip Possible improvements? 2 Outline

Performance or efficiency? Latency architecture Throughput architecture Execution units: efficiency through regularity Traditional divergence control Towards more flexibility Memory access: locality and regularity Some memory organizations Dealing with variable latency

3 The 1980': pipelined processor

Example: scalar-vector multiplication: X ← a∙X

for i = 0 to n-1 X[i] ← a * X[i] Source code add i ← 18 Fetch move i ← 0 loop: store X[17] Decode load t ← X[i] Memory mul Execute mul t ← a×t store X[i] ← t L/S Unit add i ← i+1 branch i

4 The 1990': superscalar processor

Goal: improve performance of sequential applications Latency: time to get the result Exploits Instruction-Level Parallelism (ILP) Lots of tricks Branch prediction, out-of-order execution, register renaming, data prefetching, memory disambiguation… Basis: speculation Take a bet on future events If right: time gain If wrong, roll back: energy loss

5 What makes speculation work: regularity

Application behavior likely to follow regular patterns

Regular case Irregular case Time for(i…) i=0 i=1 i=2 i=3 i=0 i=1 i=2 i=3 Control { if(f(i)) { taken taken taken taken not tk taken taken not tk regularity }

Memory j = g(i); j=17 j=18 j=19 j=20 j=21 j=4 j=17 j=2 regularity x = a[j]; }

Applications Caches Branch prediction Instruction prefetch, data prefetch, write combining… 6 The 2000': going multi-threaded

Obstacles to continuous CPU performance increase Power wall Memory wall ILP wall 2000-2010: gradual transition from latency-oriented to throughput-oriented Homogeneous multi-core Interleaved multi-threading Clustered multi-threading

7 Homogeneous multi-core

Replication of the complete execution engine Multi-threaded software

move i ← slice_begin loop: load t ← X[i] add i ← 18 IF add i ← 50 IF mul t ← a×t IF IF M

store X[i] ← t store X[17] ID store X[49] ID e add i ← i+1 m o r

branch i

Threads: T0 T1

Improves throughput thanks to explicit parallelism

8 Interleaved multi-threading

Time-multiplexing of processing units Same software view mul Fetch move i ← slice_begin mul loop: Decode load t ← X[i] mul t ← a×t add i ←73 Execute store X[i] ← t add i ← 50 Memory add i ← i+1 branch i

Threads: T0 T1 T2 T3

Hides latency thanks to explicit parallelism

9 Clustered multi-core

For each individual unit, T0 T1 T2 T3 select between → Cluster 1 → Cluster 2 Replication Time-multiplexing br Fetch Examples mul store Decode

Sun UltraSparc T2 mul EX AMD Bulldozer add i ←73 add i ← 50 Memory

load X[89] L/S Unit store X[72] load X[17] store X[49]

Area-efficient tradeoff 10 Outline

Performance or efficiency? Latency architecture Throughput architecture Execution units: efficiency through regularity Traditional divergence control Towards more flexibility Memory access: locality and regularity Some memory organizations Dealing with variable latency

11 Heterogeneity: causes and consequences 1 Amdahl's law S= P 1−P Time to run N Time to run sequential portions parallel portions Latency-optimized multi-core Low efficiency on parallel portions: spends too much resources

Throughput-optimized multi-core Low performance on sequential portions

Heterogeneous multi-core Power-constrained: can afford idle transistors Suggests more radical specialization 12 Threading granularity

Coarse-grained threading X Decouple tasks to reduce conflicts and inter-thread communication T0 T1 T2 T3 X[0..3] X[4..7] X[8..11] X[12-15]

Fine-grained threading X Interleave tasks Exhibit locality: neighbor threads share memory T0 T1 T2 T3 X[0] X[1] X[2] X[3] Exhibit regularity: neighbor X[4] X[5] X[6] X[7] threads have a similar behavior X[8] X[9] X[10] X[11] X[12] X[13] X[14] X[15] 13 Parallel regularity Similarity in behavior between threads Regular Irregular Control Thread regularity 1 2 3 4 1 2 3 4 i=17 i=17 i=17 i=17 i=21 i=4 i=17 i=2 switch(i) { T i

m case 2:... e case 17:... case 21:... }

Memory load load load load load load load load regularity A[8] A[9] A[10] A[11] A[8] A[0] A[11] A[3] r=A[i] A Memory

Data a=32 a=32 a=32 a=32 a=17 a=-5 a=11 a=42 regularity b=52 b=52 b=52 b=52 r=a*b b=15 b=0 b=-2 b=52 14 Single Instruction, Multiple Threads (SIMT) Cooperative sharing of fetch/decode, load-store units Fetch 1 instruction on behalf of several threads Read 1 memory location and broadcast to several registers

T0 (0-3) load IF M

T1 (0-3) store e

ID m o r

T2 y (0) mul (1) mul (2) mul (3) mul EX T3

(0) (1) (2) (3) LSU In NVIDIA-speak SIMT: Single Instruction, Multiple Threads Convoy of synchronized threads: warp Improves Area/Power-efficiency thanks to regularity Consolidates memory transactions: less memory pressure 15 Example GPU: NVIDIA GeForce GTX 580

SIMT: warps of 32 threads 16 SMs / chip 2×16 cores / SM, 48 warps / SM

Warp 1 Warp 2

Warp 3 Warp 4

C

C C C C C

o

o o o o o

r

r r r r r

e

e e e e e

1 1 2 … 1 1 … 3 …

6

7 8 2

Warp 47 Warp 48

Time SM1 SM16 1580 Gflop/s Up to 24576 threads in flight 16 Outline

Performance or efficiency? Latency architecture Throughput architecture Execution units: efficiency through regularity Traditional divergence control Towards more flexibility Memory access: locality and regularity Some memory organizations Dealing with variable latency

17 Capturing instruction regularity

How to handle control divergence? x = 0; // Uniform condition Techniques from Single Instruction, if(tid > 17) { Multiple Data (SIMD) architectures x = 1; Rules of the game } One thread per Processing Element // Divergent conditions (PE) if(tid < 2) { All PE execute the same instruction if(tid == 0) { x = 2; PEs can be individually disabled } else { Thread 0 Thread 1 Thread 2 Thread 3 x = 3; } 1 instruction PE 0 PE 1 PE 2 PE 3 }

18 Most common: mask stack Code Mask Stack x = 0; 1 activity bit / thread // Uniform condition if(tid > 17) { 1111 s k

i tid=0 tid=2 p x = 1; } 1111 // Divergent conditions tid=1 tid=3 if(tid < 2) { push if(tid == 0) { 1111 1100 push x = 2; 1111 1100 1000 } pop 1111 1100 else { push x = 3; 1111 1100 0100 } pop 1111 1100 pop } 1111 A. Levinthal and T. Porter. Chap - a SIMD graphics processor. SIGGRAPH’84, 1984. 19 Curiosity: activity counters Code Counters x = 0; 1 (in)activity counter / thread // Uniform condition if(tid > 17) { s k

i tid=0 tid=2 p x = 1; } 0 0 0 0 // Divergent conditions tid=1 tid=3 if(tid < 2) { 0 0 1 1 inc if(tid == 0) { inc x = 2; 0 1 2 2 } dec 0 0 1 1 else { inc x = 3; 1 0 2 2 } dec 0 0 1 1 } dec 0 0 0 0 R. Keryell and N. Paris. Activity counter : New optimization for the dynamic scheduling of 20 SIMD control flow. ICPP ’93, 1993. Brute-force: 1 PC / thread

Code Program Counters (PCs) x = 0; tid= 0 1 2 3 if(tid > 17) { x = 1; } Match if(tid < 2) { → active if(tid == 0) {

x = 2; 1 0 0 0 PC Master PC } 0 else { PC No match x = 3; 1 → inactive } PC PC } 2 3

P. Hatcher et al. A production-quality C* compiler for Hypercube multicomputers, PPOPP '91, 1991. 21 Traditional SIMT pipeline

Instruction Exec Activity bit Instruction, Exec t Activity bit s a

Instruction MPC, Instruction Insn, c d

Sequencer Activity Fetch Activity a

o Activity bit=0: discard instruction mask mask r B

Instruction, Mask Exec stack Activity bit

Used in virtually every modern GPU

22 Outline

Performance or efficiency? Latency architecture Throughput architecture Execution units: efficiency through regularity Traditional divergence control Towards more flexibility Memory access: locality and regularity Some memory organizations Maximizing throughput

23 Goto considered harmful?

MIPS NVIDIA NVIDIA Intel GMA Intel GMA AMD AMD AMD Cayman Tesla Fermi Gen4 SB R500 R600 (2011) (2007) (2010) (2006) (2011) (2005) (2007) push j bar bar jmpi jmpi jump push push_else jal bra bpt if if loop push_else pop jr brk bra iff else endloop pop push_wqm syscall brkpt brk else endif rep loop_start pop_wqm cal brx endif case endrep loop_start_no_al else_wqm cont cal do while breakloop loop_start_dx10 jump_any kil cont while break breakrep loop_end reactivate pbk exit break cont continue loop_continue reactivate_wqm pret jcal cont halt loop_break loop_start ret jmx halt call jump loop_start_no_al ssy kil msave return else loop_start_dx10 trap pbk mrest fork call loop_end .s pret push call_fs loop_continue ret pop return loop_break ssy return_fs jump .s alu else alu_push_before call alu_pop_after call_fs alu_pop2_after return Control instructions in some CPU alu_continue return_fs alu_break alu and GPU instruction sets alu_else_after alu_push_before alu_pop_after alu_pop2_after alu_continue alu_break Why so many? alu_else_after

Expose control flow structure to the instruction sequencer 24 SIMD is so last century

Maspar MP-1 (1990) NVIDIA Fermi (2010) /1000 1 instruction for 1 instruction for Fewer PEs 16 384 PEs 16 PEs ×50 PE : ~1 mm², PE : ~0,03 mm², 1.6 µm process Bigger PEs 40 nm process SIMD programming Threaded programming More model divergence model

From centralized control to flexible distributed control 25 A democratic instruction sequencer

Maintain one PC per thread Vote: select one of the individual PCs as the Master PC Which one? Various policies: Majority: most common PC Minimum: threads which are late Deepest control flow nesting level Deepest function call nesting level Various combinations of the former

W. Fung, I. Sham, G. Yuan, and T. Aamodt. Dynamic warp formation and scheduling for efficient GPU control flow. MICRO’07, 2007. J. Meng, D. Tarjan and K. Skadron. Dynamic warp subdivision for integrated branch and memory divergence tolerance. ISCA'2010, 2010. S. Collange. Une architecture unifiée pour traiter la divergence de contrôle et la divergence 26 mémoire en SIMT. SympA'14, 2011. Our new SIMT pipeline

Insn, Insn PC Match Exec Update PC PC 0 MPC 0 PC Insn, Insn 1 Match Exec Update PC PC t MPC 1 s a e c t MPC Instruction Insn, d o a

V Fetch MPC

o No match: discard instruction r B

Insn, Insn PC Match Exec Update PC PC n MPC n

27 Benefits of multiple-PC arbitration

Before: stack, counters After: multiple PCs O(n), O(log n) memory O(1) memory n = nesting depth No shared state 1 R/W port to memory Allows thread Exceptions: stack suspension, restart, overflow, underflow migration Still SIMD semantics True SPMD semantics (Bougé-Levaire) (multi-thread) Structured control flow only Traditional languages, Specific instruction sets compilers Traditional instruction sets Enables many new architecture ideas 28 With multiple warps Two-stage scheduling Select one warp Select one instruction (MPC) for this warp

Warp Ready? Next I PE 0 1 2 3 0  sub r4, r0 mul add sub mul 1  add r1, r3 add mul add add 2  load r3, [r1] mul add store load 3  mul r5, r2 load mul mul add

Read registers

add add add Execute

29 Dual Instruction, Multiple Threads (DIMT)

Two-stage scheduling Select one warp Select two instructions (MPC , MPC ) for this warp 1 2 Warp Ready? Next I PE 0 1 2 3 0  sub r4, r0 add r1, r3 mul add sub mul 1  add r1, r3 mul r5, r2 add mul add add 2  load r3, [r1] add r1, r3 mul add store load 3  mul r5, r2 add r1, r3 load mul mul add

Read registers

add mul add add Execute

More than 2 instructions: NIMT

A. Glew. Coherent vector lane threading. Berkeley ParLab Seminar, 2009. 30 Why DIMT?

Code Program Counters (PCs) x = 0; tid= 0 1 2 3 if(tid > 17) { x = 1; } if(tid < 2) { No overlap if(tid == 0) { x = 2; PC Master PC 0 } 0 else { x = 3; PC Master PC 1 } 1 PC PC } 2 3

“Fills holes” using parallelism between execution paths31 Dynamic Warp Formation (DWF)

Why need warps at all? Select master PC from global thread pool On each PE, select one thread from local thread pool

Ready? PE 0 1 2 3  mul add sub mul  add mul add add  mul add store load  load mul mul add

Read registers

add add add add Execute

W. Fung, I. Sham, G. Yuan, and T. Aamodt. Dynamic warp formation and scheduling for efficient GPU control flow. MICRO’07, 2007. 32 New DIMT+DWF pipeline

Insn , 0 Insn PC MPC 00 0 0 Insn , Match Exec Update PC PC PC 1 0 01 MPC tid PC MPC Insn , t 1

10 0 0 s a c PC e MPC t Instruction 0 11 d

o Insn a

V Match Exec Update PC PC MPC Fetch Insn , o 1 1 1 r 1 MPC B 1 Insn Match Exec Update PC PC n PC nm

Radical departure from classical SIMD 33 Avoiding redundancy

Goal: keep execution units busy?

Keep execution units busy doing real work! 34 What are we computing on?

t

h

r

e

a

Uniform data d In a warp, v[i] = c 5 5 5 5 5 5 5 5

Affine data c=5 In a warp, v[i] = b + i s 8 9 101112131415 Base b, stride s s=1 b=8

Average frequency on GPGPU applications

Other Operations Affine Uniform

Inputs

0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% 35 Tagging registers

Instructions Tags Associate a tag to each mov i ← tid A←A vector register loop: load t ← X[i] K←U[A]

Uniform, Affine, unKnown T mul t ← a×t K←U×K r

a store X[i] ← t U[A]←K c

Propagate tags across e add i ← i+tcnt A←A+U arithmetic instructions branch i

36 Dynamic Work Factorization (DWF)

Inactive for Inactive for 38% of operands 24% of instructions

Scalar Tags Vector RF RF

Reg ID De- Reg ID + tag duplication Read Fetch Decode Execute ... operands Branch / Mask S. Collange, D. Defour, Y. Zhang. Dynamic detection of uniform and affine vectors in GPGPU computations. Europar HPPC09, 2009 37 Catch-22

Insn , 0 Insn PC MPC 00 0 0 Insn , Match Exec Update PC PC PC 1 0 01 MPC tid PC MPC Insn , 1 10 0 0 MPC PC I. 0 11 Tags Insn MPC Fetch Insn , Match Exec Update PC PC 1 1 1 1 MPC 1 Insn Match Exec Update PC PC n PC nm Vote Broadcast

Control logic needs to stay much smaller / simpler / less power-hungry than Execution logic

Is execution unit utilization such an issue anyway? 38 Outline

Performance or efficiency? Latency architecture Throughput architecture Execution units: efficiency through regularity Traditional divergence control Towards more flexibility Memory access: locality and regularity Some memory organizations Dealing with variable latency

39 It's the memory, stupid!

Our primary constraint: power Power measurements on NVIDIA GT200 Energy/op Total power (nJ) (W) Instruction control 1.8 18 Multiply-add on a 32-wide 3.6 36 warp Load 128B from DRAM 80 90

With the same amount of energy Load 1 word from DRAM Compute 44 Memory traffic is what matters (most)

S. Collange, D. Defour, A. Tisserand. Power consumption of GPUs from a software 40 perspective. ICCS 2009. Memory access patterns In traditional vector processing

T T T T T T 1 2 n 1 2 n Registers Registers

Memory Memory Scalar load & broadcast Unit-strided load Reduction & scalar store Unit-strided store

T T T T T T 1 2 n 1 2 n Registers Registers

Memory Memory (Non-unit) strided load Gather (Non-unit) strided store Scatter

In SIMT Every load is a gather, every store is a scatter 41 The memory we want

D

e

c

o

d

e

r

s

Data: 4 bytes

PE PE PE PE 0 1 2 3

Addresses

Many independent R/W ports Supports lots of small transactions: 4B or 8B-wide

42 The memory we have

DRAMs Wide bus, burst mode Use wide transactions (≥32B) Switching DRAM pages is expensive Group accesses by pages (1 page ≈ 2KB) One shared bus, read/write turnaround penalty Group accesses by direction Caches Have wide cache lines (128B-256B) Have few R/W ports

43 Breakdown of memory access patterns

Vast majority: uniform or unit-strided And even aligned vectors

Loads Stores

“In making a design trade-off, favor the frequent case over the infrequent case.” [HP06]

44 Coalescing concurrent requests

Unit-strided detection (NVIDIA CC 1.0-1.1 coalescing) Unit-strided and Non-unit-strided or aligned requests unaligned requests

→ One transaction → Multiple transactions

Minimal coverage (NVIDIA CC 1.2 coalescing) 1. Select one request, consider maximal aligned transaction 2. Identify requests that fall in the same memory segment 3. Reduce transaction size when possible and issue transaction 4. Repeat with remaining requests 45 Banked shared memory

Bank arbiter

D

e

c

o

d

e

r

Data 1 word/bank

PE PE PE PE Addresses 0 1 2 3

Software-managed memory Interleaved on a word-by-word basis 46 Used in (2007) Hardware-managed cache

D D

e e

c c

o L1D o L1D

d d

e Cache e Cache

r Tags r Array

1 line

Arbiter

Data

PE PE PE PE Addresses: 0 1 2 3 cache lines

Share one wide port to the L1 cache Multiple lanes can read from the same cache line

Bottleneck: single-ported cache tags 47 Used in NVIDIA Fermi (2010) Outline

Performance or efficiency? Latency architecture Throughput architecture Execution units: efficiency through regularity Traditional divergence control Towards more flexibility Memory access: locality and regularity Some memory organizations Dealing with variable latency

48 Dealing with pipeline hazards

Bank conflicts Lost arbitration Cache misses

Conventional solution: stall execution pipeline until resolved

49 Preferred solution: in-order replay Instruction replay Keep pipeline running Put back offending instruction in instruction queue With updated pred mask: only replay threads that failed PC 0 Match Exec PC Addresses 1 ,

t Match Exec s n a o i c k t e

t ICache Insn Data d n a o r a a t

Match Exec i

V Fetch Queue o Cache B r b r B a PC n Match Exec

Replay (insn, mask) 50 Used in NVIDIA Tesla (2007) Dynamic Warp Subdivision

Consider Replay as a control-flow operation (or no-op) Threads that miss are turned inactive until data arrives Threads that hit ask for next instruction Memory divergence = branch divergence Both handled the same way When one thread misses, no need to block the whole warp Tradeoff: more latency hiding, lower ALU utilization Could counteract utilization loss with DIMT/NIMT?

J. Meng, D. Tarjan and K. Skadron. Dynamic warp subdivision for integrated branch and memory divergence tolerance. ISCA'2010, 2010. 51 Linked list traversal: without DWS

1: while(i != -1) { 2: i = l[i]; 3: }

Thread 0 Thread 1 Thread 2 Thread 3 MPC=2 PC=2 PC=2 PC=2 PC=2 2: i = l[i]; hit miss hit miss

MPC=1 PC=1 PC=1 PC=1 PC=1 1: i != -1? true true true false MPC=2 PC=2 PC=2 PC=2 PC=3 2: i = l[i]; miss hit hit

MPC=1 PC=1 PC=1 PC=1 PC=3 52 1: i != -1? true true false Linked list traversal: with DWS

1: while(i != -1) { 2: i = l[i]; 3: }

Thread 0 Thread 1 Thread 2 Thread 3 MPC=2 PC=2 PC=2 PC=2 PC=2 2: i = l[i]; hit miss hit miss ready=0 ready=0 MPC=1 PC=1 PC=1 1: i != -1? true true MPC=2 PC=2 PC=2 2: i = l[i]; miss ready=1 ready=1 hit ready=0 MPC=1 PC=1 PC=1 PC=1 1: i != -1? true false false MPC=2 PC=2 PC=3 PC=3 2: i = l[i]; ready=1 hit MPC=1 PC=1 PC=1 PC=3 PC=3 1: i != -1? true true

53 SIMT pipeline – memory instruction

PC ,v 0 0 Write-back PC Match Exec 0 Update PC Data, v PC , v Addresses 0 1 , 1 hit/miss PC Match Exec 1 t n K s

o v

i 1 C a k t e c A n a t ICache Data r

d PC a o t N

i 2 / a

Match Exec B

V Fetch Cache b o K r

r v

a 2 C B A PC , v n n PC Match Exec n v n

Hazards: Divergence Bank conflict Cache miss all cause PC and valid bit to be updated accordingly 54 Conclusion: the missing link

CMP stack-based PC-based SIMD CMT SIMT SIMT SMT Allow more Optimize for flexibility regularity

SIMD programming model Multi-thread programming model

New range of architecture options between Simultaneous Multi-Threading, Chip Multi-Threading and SIMD Exploits parallel regularity for higher perf/W

55 Perspectives: next challenges

Instruction fetch policy, thread scheduling policy: objectives to balance Instruction throughput Memory-level parallelism Fairness Regularity — coherence Detect control-flow reconvergence points Cross-fertilization with ideas from “classical” multi- threaded microarchitecture ?

56 Multi-threading or SIMD? How GPU architectures exploit regularity

Sylvain Collange Arénaire, LIP, ENS de Lyon [email protected]

ARCHI'11 June 14, 2011 Conclusion

SIMT bridges the gap between superscalar and SIMD Smooth, dynamic tradeoff between regularity and efficiency

SIMD Efficiency (flops/W)

Dynamic work factorization, SIMT Decoupled scalar-SIMT, Affine caches…

Superscalar

Dynamic warp formation, Dynamic warp subdivision, NIMT… Regularity of application Transaction processing Computer graphics Dense linear algebra 58 Software view

Programming model Threads SPMD: Single program, multiple data One kernel code, many threads Unspecified execution order between explicit synchronization barriers Languages Graphics : HLSL, Cg, GLSL Barrier GPGPU : C for CUDA, OpenCL

For n threads: X[tid] ← a * X[tid] Kernel

59 Out-of-order microarchitecture scale: test esi, esi je .L4 void scale(float a, float * X, int n) sub esi, 1 xor eax, eax Software { lea rdx, [4+rsi*4] for(int i = 0; i != n; ++i) .L3: movss xmm1, DWORD PTR [rdi+rax] X[i] = a * X[i]; mulss xmm1, xmm0 } movss DWORD PTR [rdi+rax], xmm1 add rax, 4 cmp rax, rdx jne .L3 .L4: rep Architecture: sequential programming model ret

Out-of-order superscalar microarchitecture

Hardware datapaths: dataflow execution model

Hardware 60 GPU microarchitecture

__global__ void scale(float a, float * X) { Software unsigned int tid; tid = blockIdx.x * blockDim.x + threadIdx.x; X[tid] = a * X[tid]; }

Architecture: multi-thread programming model

SIMT microarchitecture

Hardware datapaths: SIMD execution units

Hardware 61 The old way: centralized control

Scalar unit takes all control decisions Gives instructions SIMD / Vector units act as a coprocessor Reports hazards

Instructions, control

SIMD or Vector units Slave Scalar unit Feedback: activity, hazards Master Shared resource

62 The new way: piggyback model

Multiple threads, one shared resource Instruction fetch, memory bank port, cache tag port… SIMT units Select one thread, give it access to

the resource Vote/arbitration Let other threads opportunistically Shared share the resource resource Same instruction, same cache line…

Variations: multiple resources, grouping strategies, arbitration policies… 63 Sequential case: cache and prefetch

Code Trace Memory

for i = 0 to N-1 load X[i] Cache miss load X[i] . Fetch . cache . line . Prefetch i++ next line load X[i] Cache hit i++ load X[i] Cache hit i++ load X[i] Cache hit i++ load X[i] Cache hit

“Vertical” temporal or spacial locality

Sequential reuse, in time 64 Parallel case: coalescing, multi-threading

Code Trace warp 0 warp 1 warp 2 t t t t t t t t t t t t kernel: 0 1 2 3 4 5 6 7 8 9 10 11 load X[tid] ... w0: load X[{0,1,2,3}] Fetch cache w1: load X[{4,5,6,7}] line

w2: load X[{8,9,10,11}]

“Horizontal” temporal or spacial locality

Parallel reuse, in space 65 How to compute the Master PC?

Early SIMD machines (1980 – 1985) Software or hardware Master runs through all conditional blocks, whether taken or not Runs through loop bodies n+1 times Golden SIMD era (1990 – 1995) Mostly software (with hardware support) Only runs through branches actually taken Current GPUs (2005 – 2010) In hardware Structured control-flow in the instruction set

66 On-chip memory

Conventional wisdom Cache area in CPU vs. GPU according to the NVIDIA CUDA Programming Guide:

Actual data

GPU Register files + caches NVIDIA 3.9 MB GF110 AMD 7.7 MB Cayman

At this rate, will catch up with CPUs by 2012… 67 The cost of SIMT: register wastage SIMD SIMT mov i ← 0 mov i ← tid loop: loop: vload T ← X[i] load t ← X[i] vmul T ← a×T mul t ← a×t vstore X[i] ← T store X[i] ← t add i ← i+16 add i ← i+tcnt branch i

Instructions Thread 0 1 2 3 … vload load vmul mul vstore store add add branch scalar SIMD branch Registers T t a 17 a 1717171717 17 i 0 scalar i 0 1 2 3 4 15 n 51 vector n 5151515151 51 68 Focus on GPGPU

Graphics Processing Unit (GPU) Video game industry: mass market Low unit price, amortized R&D Inexpensive, high-performance parallel processor 2002: General-Purpose computation on GPU (GPGPU) 2010: #1 supercomputer Tianhe-1A supercomputer 7168 GPUs (NVIDIA Tesla M2050) 2.57 Pflops 4.04 MW “only” #1 in Top500, #11 in Green500 Credits: NVIDIA 69 SIMD

Single Instruction Multiple Data

for i = 0 to n-1 step 4 X[i..i+3] ← a * X[i..i+3] Source code

loop: add i ← 20 IF vload T ← X[i] vstore X[16..19 M vmul T ← a×T ID e vstore X[i] ← T m o r

add i ← i+4 vmul EX y branch i

Challenging to program (semi-regular apps?)

70 So how does it differ from SIMD?

SIMD SIMT Instruction Vectorization at Vectorization at regularity compile-time runtime Control Software-managed Hardware-managed regularity Bit-masking, Stack, counters, predication multiple PCs Memory Compiler selects: Hardware-managed regularity vector load-store or Gather-scatter with gather-scatter hardware coalescing

SIMD is static, SIMT is dynamic Similar opposition as VLIW vs. superscalar

71 Dynamic data deduplication: accuracy Detects 79% of affine input operands 75% of affine computations

Unknown Outputs detected Affine Uniform

Outputs total

Inputs detected

Inputs total

0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% 72