Work factorization for efficient throughput architectures
Sylvain Collange Departamento de Ciência da Computação, ICEx Universidade Federal de Minas Gerais [email protected]
February 01, 2012 GPGPU in HPC, today
Graphics Processing Unit (GPU) Made for video games: mass market Low unit price, amortized R&D Inexpensive, high-performance parallel processor 2002: General-Purpose computation on GPU (GPGPU) 2012: 3 out of top 5 supercomputers use GPUs
#4 Dawning Nebulae (China)
#2 Tianhe-1A (China)
#5 Tsubame 2.0 (Japan) 2 GPGPU in the future?
Yesterday (2000-2010) Homogeneous multi-core
Discrete components Central Graphics Processing Unit Processing Today (2011-...) (CPU) Unit (GPU) Chip-level integration Intel Sandy Bridge AMD Fusion
NVIDIA Denver/Maxwell project… Throughput- optimized Many embedded SoCs cores Latency- Tomorrow optimized Hardware Heterogeneous multi-core cores accelerators GPUs to blend into Heterogeneous multi-core chip throughput-optimized cores? 3 Outline
Background: GPU architecture requirements From a sequential processor to a GPU Parallel control regularity Parallel memory locality Parallel value locality
4 What do we need GPUs for?
1.3D graphics rendering for games Complex texture mapping, lighting computations…
2.Computer Aided Design workstations Complex geometry
3.GPGPU Complex synchronization, data movements One chip to rule them all Find the common denominator 5 The (simplest) graphics rendering pipeline
Primitives (triangles…)
Vertices Fragment shader Textures
Vertex shader Z-Compare Blending
Clipping, Rasterization Attribute interpolation
Pixels Framebuffer Fragments Z-Buffer
Programmable Parametrizable stage stage 6 How much performance do we need
… to run 3DMark 11 at 50 frames/second?
Element Per frame Per second
Vertices 12.0M 600M Primitives 12.6M 630M Fragments 180M 9.0G Instructions 14.4G 720G
Intel Core i7 2700K: 56 Ginsn/s peak We need to go 13x faster
7 Source: Damien Triolet, Hardware.fr Constraints
Memory wall Memory speed does not increase as fast as computing speed More and more difficult to hide memory latency Power wall Power consumption of transistors does not decrease as fast as density increases Performance is now limited by power consumption
8 Latency vs. throughput
Latency: time to solution CPUs Minimize time, at the expense of power
Throughput: quantity of tasks processed per unit of time GPUs Assumes unlimited parallelism Minimize energy per operation
9 Amdahl's law
Bounds speedup attainable on a parallel machine
1 S= S Speedup P P Ratio of parallel Time to run 1−P Time to run sequential portions N parallel portions portions N Number of processors
S (speedup)
N (available processors)
G. Amdahl. Validity of the Single Processor Approach to Achieving Large-Scale 10 Computing Capabilities. AFIPS 1967. Why heterogeneous architectures? 1 S= Time to run P Time to run sequential portions 1−P parallel portions N
Latency-optimized multi-core (CPU) Low efficiency on parallel portions: spends too much resources
Throughput-optimized multi-core (GPU) Low performance on sequential portions
Heterogeneous multi-core (CPU+GPU) Use the right tool for the right job Allows aggressive optimization for latency or for throughput
M. Hill, M. Marty. Amdahl's law in the multicore era. IEEE Computer, 2008. 11 Programming model: multi-threading
1 vertex = 1 thread = Computes spacial coordinates, texture coordinates…
1 fragment = 1 thread = Computes color, lighting…
GPGPU Bulk-Synchronous Parallel (BSP) model NVIDA CUDA, OpenCL… Barrier
Program describes operation to apply SPMD: Single Program, Multiple Data 12 L. Valiant. A bridging model for parallel computation. Comm. ACM 1990. Threading granularity
Coarse-grained threading X Decouple tasks to reduce conflicts and inter-thread communication e.g. MPI, OpenMP T0 T1 T2 T3 X[0..3] X[4..7] X[8..11] X[12-15]
Fine-grained threading X Interleave tasks Exhibit locality: neighbor threads share memory T0 T1 T2 T3 X[0] X[1] X[2] X[3] Exhibit regularity: neighbor X[4] X[5] X[6] X[7] threads have a similar behavior X[8] X[9] X[10] X[11] X[12] X[13] X[14] X[15] e.g. OpenCL, CUDA 13 Parallel regularity Similarity in behavior between threads Regular Irregular Control Thread regularity 1 2 3 4 1 2 3 4 i=17 i=17 i=17 i=17 i=21 i=4 i=17 i=2 switch(i) { T i
m case 2:... e case 17:... case 21:... }
Memory load load load load load load load load regularity A[8] A[9] A[10] A[11] A[8] A[0] A[11] A[3] r=A[i] A Memory
Data a=32 a=32 a=32 a=32 a=17 a=-5 a=11 a=42 regularity b=52 b=52 b=52 b=52 r=a*b b=15 b=0 b=-2 b=52 14 Outline
Background: GPU architecture requirements From a sequential processor to a GPU Parallel control regularity Parallel memory locality Parallel value locality
15 First step: sequential, pipelined processor
Let's build a GPU Our application: scalar-vector multiplication: X ← a∙X First idea: run each thread sequentially
for i = 0 to n-1 X[i] ← a * X[i] Source code add i ← 18 Fetch move i ← 0 loop: store X[17] Decode load t ← X[i] Memory mul Execute mul t ← a×t store X[i] ← t Memory add i ← i+1 branch i 16 Homogeneous multi-core Replication of the complete execution engine move i ← slice_begin loop: load t ← X[i] add i ← 18 F add i ← 50 F mul t ← a×t IF IF M store X[i] ← t store X[17] IDD store X[49] DID e add i ← i+1 m o r branch i Threads: T0 T1 Improves throughput thanks to explicit parallelism 17 Interleaved multi-threading Time-multiplexing of processing units Same software view mul Fetch move i ← slice_begin mul loop: Decode load t ← X[i] mul t ← a×t add i ←73 Execute store X[i] ← t add i ← 50 Memory add i ← i+1 branch i Threads: T0 T1 T2 T3 Hides latency thanks to explicit parallelism 18 Single Instruction, Multiple Threads (SIMT) Factorization of fetch/decode, load-store units Fetch 1 instruction on behalf of several threads Read 1 memory location and broadcast to several registers T0 (0-3) load F M T1 (0-3) store e D m o r T2 y (0) mul (1) mul (2) mul (3) mul X T3 (0) (1) (2) (3) Mem In NVIDIA-speak SIMT: Single Instruction, Multiple Threads Convoy of synchronized threads: warp Improves Area/Power-efficiency thanks to regularity Consolidates memory transactions: less memory pressure 19 What about SIMD? Single Instruction Multiple Data for i = 0 to n-1 step 4 X[i..i+3] ← a * X[i..i+3] Source code loop: add i ← 20 F vload T ← X[i] vstore X[16..19 M vmul T ← a×T D e vstore X[i] ← T m o r add i ← i+4 vmul X y branch i Vectors, not threads: no “true” thread divergence allowed 20 Flynn's taxonomy Classification of parallel architectures Single Instruction Multiple Instruction F F F F F Single Data X X SISD MISD F Multiple Data F F F F X X X X X X X X SIMD MIMD M. Flynn. Some Computer Organizations and Their Effectiveness. IEEE TC 1972. 21 Flynn's taxonomy revisited …to account for multi-threading Resource: Instruction RF, Execute Memory F pipeline stage Fetch X (Data) M (Address) T T T T T T T T T T T T 0 1 2 3 0 1 2 3 0 1 2 3 Single resource F X M SIMT SDMT SAMT T T T T T T T T T T T T Multiple 0 1 2 3 0 1 2 3 0 1 2 3 resources F F F F X X X X M M M M MIMT MDMT MAMT Mostly orthogonal Mix and match to build your own _I_D_A_T pipeline! 22 Examples: conventional design points MI MD MA MT Multi-core T 0 F X M Most CPUs of today T 1 F X M T MIMD(MAMT) 2 F X M Short-vector SIMD MD X Multimedia instruction set SI SA ST T F X M extensions in CPUs 0 SIMD(SAST) X MD GPU T X 0 SI SA MT NVIDIA, AMD, Intel... GPUs T 1 F X M T SI(MDSA)MT 2 X 23 GPU design space: not just SIMT Examples How can we run SPMD threads? NVIDIA GeForce MIMD GTX 280 (2008) (multi-core) Spacial / horizontal F F F F SIMT X X X X F NVIDIA GTX 480 X X X X (2010) AMD Radeon Fine-grained Temporal / vertical 5870 (2011) SIMT multi-threading F X F X X F X X F X Switch-on-event X AMD Radeon F X multi-threading 7870 (2012) F X F X F X NVIDIA Echelon F X project (2017?) Programmer's point of view: only threads 24 Example GPU: NVIDIA GeForce GTX 580 SIMT: warps of 32 threads 16 SMs / chip 2×16 cores / SM, 48 warps / SM Warp 1 Warp 2 Warp 3 Warp 4 C C C C C C o o o o o o r r r r r r e e e e e e 1 1 2 … 1 1 … 3 … 6 7 8 2 Warp 47 Warp 48 Time SM1 SM16 1580 Gflop/s Up to 24576 threads in flight 25 Outline Background: GPU architecture requirements From a sequential processor to a GPU Parallel control regularity Parallel memory locality Parallel value locality 26 Divergence statistics 50% — 85% of branches are uniform Inside a warp, all threads take the branch or none do Easy case Need to avoid costly predication on uniform branches Fully dynamic, hardware implementation 27 How to keep threads synchronized? Issue: control divergence x = 0; // Uniform condition Rules of the game if(tid > 17) { One thread per Processing Element x = 1; (PE) } All PE execute the same instruction // Divergent conditions if(tid < 2) { PEs can be individually disabled if(tid == 0) { x = 2; } else { Thread 0 Thread 1 Thread 2 Thread 3 x = 3; } 1 instruction PE 0 PE 1 PE 2 PE 3 } 28 The standard way: mask stack Code Hardware mask stack x = 0; 1 activity bit / thread // Uniform condition if(tid > 17) { 1111 s k i tid=0 tid=2 p x = 1; } 1111 // Divergent conditions tid=1 tid=3 if(tid < 2) { push if(tid == 0) { 1111 1100 push x = 2; 1111 1100 1000 } pop 1111 1100 else { push x = 3; 1111 1100 0100 } pop 1111 1100 pop } 1111 A. Levinthal and T. Porter. Chap - a SIMD graphics processor. SIGGRAPH’84, 1984. 29 Goto considered harmful? MIPS NVIDIA NVIDIA Intel GMA Intel GMA AMD AMD AMD Cayman Tesla Fermi Gen4 SB R500 R600 (2011) (2007) (2010) (2006) (2011) (2005) (2007) push j bar bar jmpi jmpi jump push push_else jal bra bpt if if loop push_else pop jr brk bra iff else endloop pop push_wqm syscall brkpt brk else endif rep loop_start pop_wqm cal brx endif case endrep loop_start_no_al else_wqm cont cal do while breakloop loop_start_dx10 jump_any kil cont while break breakrep loop_end reactivate pbk exit break cont continue loop_continue reactivate_wqm pret jcal cont halt loop_break loop_start ret jmp halt call jump loop_start_no_al ssy jmx msave return else loop_start_dx10 trap longjmp mrest fork call loop_end .s pbk push call_fs loop_continue pcnt pop return loop_break plongjmp return_fs jump pret alu else ret alu_push_before call ssy alu_pop_after call_fs .s alu_pop2_after return alu_continue return_fs Control instructions in some CPU alu_break alu alu_else_after alu_push_before and GPU instruction sets alu_pop_after alu_pop2_after alu_continue Control flow structure is explicit alu_break alu_else_after GPU-specific instruction sets No support for arbitrary control flow 30 Alternative: 1 PC / thread Code Program Counters (PCs) x = 0; tid= 0 1 2 3 if(tid > 17) { x = 1; } Match if(tid < 2) { → active if(tid == 0) { x = 2; 1 0 0 0 PC Master PC } 0 else { PC No match x = 3; 1 → inactive } PC PC } 2 3 31 Scheduling policy: min(SP:PC) Which PC to choose as master Source Assembleur Ordre PC ? if(…) … { p? br else Conditionals, loops } … 1 else br endif Order of code addresses else: { 2 } … min(PC) endif: 3 Functions while(…) start: { … Favor max nesting depth 1 2 3 } p? br start min(SP) … 4 With compiler support Unstructured control flow too … … 1 f(); call f No code duplication … 3 void f() f: { … 2 G. Diamos et al. SIMD re-convergence at thread frontiers. … ret MICRO 44, 2011. } 32 Multiple PC arbitration: functional model Insn, Insn PC Match Exec Update PC PC 0 MPC 0 PC Insn, Insn 1 Match Exec Update PC PC t MPC 1 s a e c t MPC Instruction Insn, d o a V Fetch MPC o No match: discard instruction r B Insn, Insn PC Match Exec Update PC PC n MPC n S. Collange. Stack-less SIMT reconvergence at low cost. Tech report, ENS Lyon, 2011. 33 Benefits of multiple-PC arbitration Before: stack, counters After: multiple PCs O(d), O(log d) memory O(1) memory d = nesting depth No shared state 1 R/W port to memory Allows thread Exceptions: stack suspension, restart, overflow, underflow migration Partial SIMD semantics Full SPMD semantics (Bougé-Levaire) (multi-thread) C-style structured control- Arbitrary control flow flow only Traditional instruction sets, Specific instruction sets languages, compilers Enables many new architecture ideas 34 Outline Background: GPU architecture requirements From a sequential processor to a GPU Parallel control regularity Parallel memory locality Parallel value locality 35 Memory access patterns In traditional vector processing Easy T T T Easy T T T 1 2 n 1 2 n Registers Registers Memory Memory Scalar load & broadcast Unit-strided load Reduction & scalar store Unit-strided store Hard T T T Hardest T T T 1 2 n 1 2 n Registers Registers Memory Memory (Non-unit) strided load Gather (Non-unit) strided store Scatter In SIMT Every load is a gather, every store is a scatter 36 Breakdown of memory access patterns Vast majority: uniform or unit-strided And even aligned vectors Loads Stores “In making a design trade-off, favor the frequent case over the infrequent case.” J. Hennessy, D. Patterson. Computer architecture: a quantitative approach. MKP, 2007 37 Memory coalescing In hardware: compare the address of each vector element Coalesce memory accesses that fall within the same segment Unit-strided requests Irregular requests → One transaction → Multiple transactions Dynamically detects parallel memory regularity 38 Array of structures (AoS) struct Pixel { Programmer-friendly memory float r, g, b; layout }; Pixel image_AoS[480][640]; Group data logically Access pattern in SIMT: (non-unit) strided load Not as efficient as unit-strided kernel void luminance(Pixel img[][], float luma[][]) { load int x=tid.x; int y=tid.y; luma[y][x]=.59*img[y][x].r + .11*img[y][x].g + .30*img[y][x].b; } Need to rethink data structures for fine-grained threading 39 Structure of Arrays (SoA) Access pattern in SIMT: unit- struct Image { float R[480][640]; strided load float G[480][640]; float B[480][640]; Efficient thanks to coalescing }; Image image_SoA; With fine-grained threading: value regularity Homogeneous data in memory Variables close in values are close in space kernel void luminance(Image img, float luma[][]) { int x=tid.x; int y=tid.y; luma[y][x]=.59*img.R[y][x] + .11*img.G[y][x] + .30*img.B[y][x]; } 40 Example: call stack Coarse-grained (x86…) Fine-grained (Fermi) Split stacks Interleaved stacks T0 SP T0 SP T1 T2 T1 T3 Cache line Cache T2 line T3 Memory Memory No false sharing False sharing Good for split, coherent Good for spacial locality! caches (multi-cores) 41 Outline Background: GPU architecture requirements From a sequential processor to a GPU Parallel control regularity Parallel memory locality Parallel value locality 42 On-chip memory Conventional wisdom Cache area in CPU vs. GPU according to the NVIDIA CUDA Programming Guide: But... if we include registers: GPU Register files + caches NVIDIA 3.9 MB GF110 GPU AMD Tahiti 11.9 MB GPU Intel Core i7 9.3 MB CPU GPUs now have more internal memory than desktop CPUs 43 Little's law: data=throughput×latency Throughput (GB/s) Intel Core i7 920 180 50 1,25 3 10 50 Latency (ns) 1500 NVIDIA GeForce GTX 580 L1 320 190 L2 DRAM 30 210 350 ns44 J. Little. A proof for the queuing formula L= λ W. JSTOR 1961. The cost of SIMT: register redundancy SIMD SIMT mov i ← 0 mov i ← tid loop: loop: vload T ← X[i] load t ← X[i] vmul T ← a×T mul t ← a×t vstore X[i] ← T store X[i] ← t add i ← i+16 add i ← i+tcnt branch i Instructions Thread 0 1 2 3 … vload load vmul mul vstore store add add branch scalar SIMD branch Registers T t a 17 a 1717171717 17 i 0 scalar i 0 1 2 3 4 15 n 51 vector n 5151515151 51 45 What are we computing on? t t h h r r e e a a d d Uniform data 0 1 In a warp, v[tid] = c 5 5 5 5 5 5 5 5 Affine data c=5 In a warp, v[tid] = b + tid × s 8 9 101112131415 Base b, stride s s=1 b=8 Average frequency in GPGPU applications Other Operations Affine Uniform RF reads 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% 46 Scalarization Factor out across threads Common calculations Common registers vector scalar compact 8 9 101112131415 8+1×tid + + + + + + + + + 5 5 5 5 5 5 5 5 5+0×tid expand = 1314151617181920 13+1×tid Dynamic scalarization: tagged register file Static scalarization: compiler analysis 47 Dynamic scalarization: tagged registers Associate a tag to each Instructions Tags vector register mov i ← tid A←A loop: Uniform, Affine, unKnown load t ← X[i] K←U[A] T mul t ← a×t K←U×K r a store X[i] ← t U[A]←K Propagate tags across c e add i ← i+tcnt A←A+U arithmetic instructions branch i Thread Tag 0 1 2 3 … K t U a 17 X X X X X A i 0 1 X X X X U n 51 X X X X X 48 A scalarizing compiler? Scalar-only Vector-only Programming Sequential SPMD models programming (CUDA, OpenCL) T V c c r e g S o o a c c in m m d o t r P o iz i m e M t r l p r i p i i a o z i p l p i D l i l e e n i n a le c m r r a g r S o l c Model CPU Actual CPU GPU Architectures Scalar Scalar+SIMD SIMT 49 Compile SPMD programs to SIMD SIMD SIMT Instruction Vectorization at Vectorization at regularity compile-time runtime Control Software-managed Hardware-managed regularity Bit-masking, predication Stack, counters, multiple PCs Memory Compiler selects: Hardware-managed regularity vector load-store or Gather-scatter with gather-scatter hardware coalescing Same set of optimizations Perform them at compile time rather than runtime 50 Identify scalar features? Scalar registers c=a+b a 2 2 2 2 2 2 2 2 + + + + + + + + Uniform vectors b 3 3 3 3 3 3 3 3 Affine vectors with known c 5 5 5 5 5 5 5 5 stride Scalar operations if(c) c 0 0 0 0 0 0 0 0 { Uniform inputs, } uniform outputs else { Uniform branches } Uniform conditions x=t[i]; i 8 9 101112131415 Vector load/store (vs. x gather/scatter) Memory Affine aligned addresses t[8] t[15] In all cases: divergence analysis 51 Teaser: be sure to attend second part of Fernando's course Conclusion SIMT bridges the gap between superscalar and SIMD Smooth, dynamic tradeoff between regularity and efficiency SIMD Efficiency (flops/W) Dynamic scalarization SIMT Decoupled scalar-SIMT, Affine caches… Superscalar Dynamic warp formation, Dynamic warp subdivision, NIMT… Regularity of application Transaction processing Computer graphics Dense linear algebra 52 Work factorization for efficient throughput architectures Sylvain Collange Departamento de Ciência da Computação, ICEx Universidade Federal de Minas Gerais [email protected] February 01, 2012