GPU Architecture Part 2: SIMT Control Flow Management
Total Page:16
File Type:pdf, Size:1020Kb
GPU architecture part 2: SIMT control flow management Caroline Collange Inria Rennes – Bretagne Atlantique [email protected] https://team.inria.fr/pacap/members/collange/ Master 2 SIF ADA - 2019 Outline Running SPMD software on SIMD hardware Context: software and hardware The control flow divergence problem Stack-based control flow tracking Stacks Counters Path-based control flow tracking The idea: use PCs Implementation: path list Applications Software approaches Use cases and principle Scalarization Research directions 2 Analogy: Out-of-order microarchitecture scale: test esi, esi je .L4 void scale(float a, float * X, int n) sub esi, 1 xor eax, eax Software { lea rdx, [4+rsi*4] for(int i = 0; i != n; ++i) .L3: movss xmm1, DWORD PTR [rdi+rax] X[i] = a * X[i]; mulss xmm1, xmm0 } movss DWORD PTR [rdi+rax], xmm1 add rax, 4 cmp rax, rdx jne .L3 .L4: rep Architecture: sequential programming model ret Hardware 3 Analogy: Out-of-order microarchitecture scale: test esi, esi je .L4 void scale(float a, float * X, int n) sub esi, 1 xor eax, eax Software { lea rdx, [4+rsi*4] for(int i = 0; i != n; ++i) .L3: movss xmm1, DWORD PTR [rdi+rax] X[i] = a * X[i]; mulss xmm1, xmm0 } movss DWORD PTR [rdi+rax], xmm1 add rax, 4 cmp rax, rdx jne .L3 .L4: rep Architecture: sequential programming model ret Hardware datapaths: dataflow execution model Hardware 4 Analogy: Out-of-order microarchitecture scale: test esi, esi je .L4 void scale(float a, float * X, int n) sub esi, 1 xor eax, eax Software { lea rdx, [4+rsi*4] for(int i = 0; i != n; ++i) .L3: movss xmm1, DWORD PTR [rdi+rax] X[i] = a * X[i]; mulss xmm1, xmm0 } movss DWORD PTR [rdi+rax], xmm1 add rax, 4 cmp rax, rdx jne .L3 .L4: rep Architecture: sequential programming model ret Dark magic! Hardware datapaths: dataflow execution model Hardware 5 Analogy: Out-of-order microarchitecture scale: test esi, esi je .L4 void scale(float a, float * X, int n) sub esi, 1 xor eax, eax Software { lea rdx, [4+rsi*4] for(int i = 0; i != n; ++i) .L3: movss xmm1, DWORD PTR [rdi+rax] X[i] = a * X[i]; mulss xmm1, xmm0 } movss DWORD PTR [rdi+rax], xmm1 add rax, 4 cmp rax, rdx jne .L3 .L4: rep Architecture: sequential programming model ret Dark magic! Out-of-order superscalar microarchitecture Hardware datapaths: dataflow execution model Hardware 6 GPU microarchitecture: where it fits __global__ void scale(float a, float * X) { Software unsigned int tid; tid = blockIdx.x * blockDim.x + threadIdx.x; X[tid] = a * X[tid]; } Architecture: multi-thread programming model Dark magic! SIMT microarchitecture Hardware datapaths: SIMD execution units Hardware 7 Programmer's view Programming model Threads SPMD: Single program, multiple data One kernel code, many threads Unspecified execution order between explicit synchronization barriers Languages Barrier Graphics shaders : HLSL, Cg, GLSL GPGPU : C for CUDA, OpenCL For n threads: X[tid] ← a * X[tid] Kernel 8 Types of control flow Structured control flow: single-entry, single exit properly nested Conditionals: if-then-else Single-entry, single-exit loops: while, do-while, for… Function call-return… Unstructured control flow break, continue && || short-circuit evaluation Exceptions Coroutines goto, comefrom Code that is hard to indent! 9 Warps T0 warps of fixed warps size into are grouped Threads Warp 0 Warp T1 T2 T3 T4 Warp 1 Warp T5 T6 T7 10 An early SIMT architecture. Musée Gallo- Romain de St-Romain-en-Gal, Vienne Control flow: uniform or divergent Outcome per thread: Control is uniform Warp when all threads in the warp x = 0; follow the same path // Uniform condition T0 T1 T2 T3 if(a[x] > 17) { 1 1 1 1 x = 1; } Control is divergent // Divergent condition when different threads follow if(tid < 2) { 1 1 0 0 different paths x = 2; } 11 Computer architect view SIMD execution inside a warp One SIMD lane per thread All SIMD lanes see the same instruction Control-flow differentiation using execution mask All instructions controlled with 1 bit per lane 1→perform instruction 0→do nothing 12 Running independent threads in SIMD How to keep threads synchronized? x = 0; Challenge: divergent control flow // Uniform condition if(tid > 17) { Rules of the game x = 1; One thread per SIMD lane } Same instruction on all lanes // Divergent conditions Lanes can be individually disabled if(tid < 2) { with execution mask if(tid == 0) { Which instruction? x = 2; How to compute execution mask? } else { Thread 0 Thread 1 Thread 2 Thread 3 x = 3; } 1 instruction Lane 0 Lane 1 Lane 2 Lane 3 } 16 Example: if Execution mask inside if statement = if condition Warp x = 0; // Uniform condition T0 T1 T2 T3 if condition: if(a[x] > 17) { a[x] > 17? 1 1 1 1 x = 1; Execution mask: 1 1 1 1 } // Divergent condition if condition: if(tid < 2) { tid < 2? 1 1 0 0 x = 2; Execution mask: 1 1 0 0 } 17 State of the art in 1804 The Jacquard loom is a GPU Framebuffer Shaders Multiple warp threads with per-thread conditional execution Execution mask given by punched cards Supports 600 to 800 parallel threads 18 Conditionals that no thread executes Do not waste energy fetching instructions not executed Skip instructions when execution mask is all-zeroes Warp x = 0; // Uniform condition T0 T1 T2 T3 if condition: if(a[x] > 17) { a[x] > 17? 1 1 1 1 x = 1; Execution mask: 1 1 1 1 } else { Execution mask: 0 0 0 0 x = 0; Jump to end-if } // Divergent condition if condition: if(tid < 2) { tid < 2? 1 1 0 0 x = 2; Execution mask: 1 1 0 0 } Uniform branches are just usual scalar branches 19 What about loops? Keep looping until all threads exit Mask out threads that have exited the loop Warp Execution trace: i = 0; T0 T1 T2 T3 while(i < tid) { i = 0; i =? 0 0 0 0 i++; i < tid? 0 1 1 1 } print(i); i++; i =? 0 1 1 1 i < tid? 0 0 1 1 i++; i =? 0 1 2 2 T i m i < tid? 0 0 0 1 e i++; i =? 0 1 2 3 i < tid? 0 0 0 0 No active thread left → restore mask and exit loop print(i); i =? 0 1 2 3 20 What about nested control flow? x = 0; // Uniform condition if(tid > 17) { x = 1; } // Divergent conditions if(tid < 2) { if(tid == 0) { x = 2; } else { x = 3; } } We need a generic solution! 21 Outline Running SPMD software on SIMD hardware Context: software and hardware The control flow divergence problem Stack-based control flow tracking Stacks Counters Path-based control flow tracking The idea: use PCs Implementation: path list Applications Software approaches Use cases and principle Scalarization Research directions 22 Quizz What do Star Wars and GPUs have in common? 23 Answer: Pixar! In the early 1980's, the Computer Division of Lucasfilm was designing custom hardware for computer graphics Acquired by Steve Jobs in 1986 and became Pixar Their core product: the Pixar Image Computer This early GPU handles nested divergent control flow! 24 Pixar Image Computer: architecture overview Instruction Control unit push/ pop ALU 0 ALU 1 ALU 2 ALU 3 Mask stack Execution mask 25 The mask stack of the Pixar Image Computer Code Mask Stack x = 0; 1 activity bit / thread // Uniform condition if(tid > 17) { 1111 s k i tid=0 tid=2 p x = 1; } 1111 // Divergent conditions tid=1 tid=3 if(tid < 2) { push if(tid == 0) { 1111 1100 push x = 2; 1111 1100 1000 } pop 1111 1100 else { push x = 3; 1111 1100 0100 } pop 1111 1100 pop } 1111 A. Levinthal and T. Porter. Chap - a SIMD graphics processor. SIGGRAPH’84, 1984. 26 Observation: stack content is a histogram 1000 0100 1100 1100 1100 1100 1100 1111 1111 1111 1111 1111 1111 On structured control flow: columns of 1s A thread active at level n is active at all levels i<n Conversely: no “zombie” thread gets revived at level i>n if inactive at n 27 Observation: stack content is a histogram 1000 0100 1100 1100 1100 1100 1100 1111 1111 1111 1111 1111 1111 On structured control flow: columns of 1s A thread active at level n is active at all levels i<n Conversely: no “zombie” thread gets revived at level i>n if inactive at n The height of each column of 1s is enough Alternative implementation: maintain an activity counter for each thread R. Keryell and N. Paris. Activity counter : New optimization for the dynamic scheduling of 28 SIMD control flow. ICPP ’93, 1993. With activity counters Code Counters x = 0; 1 (in)activity counter / thread // Uniform condition if(tid > 17) { s k i tid=0 tid=2 p x = 1; } 0 0 0 0 // Divergent conditions tid=1 tid=3 if(tid < 2) { 0 0 1 1 inc if(tid == 0) { inc x = 2; 0 1 2 2 } dec 0 0 1 1 else { inc x = 3; 1 0 2 2 } dec 0 0 1 1 } dec 0 0 0 0 29 30 Credits: Ronan Keryell In an SIMD prototype from École des Mines de Paris 1993 Paris in In an SIMD de prototype fromdes Mines École In Intel integrated graphics since 2004 since In graphics Intel integrated Activity counters in use Activity Outline Running SPMD software on SIMD hardware Context: software and hardware The control flow divergence problem Stack-based control flow tracking Stacks, counters A compiler perspective Path-based control flow tracking The idea: use PCs Implementation: path list Applications Software approaches Use cases and principle Scalarization Research directions 31 Back to ancient history Microsoft DirectX 7.x 8.0 8.1 9.0 a 9.0b 9.0c 10.0 10.1 11 Unified shaders NVIDIA NV10 NV20 NV30 NV40 G70 G80-G90 GT200 GF100 FP 16 Programmable FP 32 Dynamic SIMT CUDA shaders control flow ATI/AMD FP 24 CTM FP 64 CAL R100 R200 R300 R400 R500 R600 R700 Evergreen GPGPU traction 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 32 Early days of programmable shaders It is 21st century! Graphics cards now look and sound like hair dryers The infamous GeForce FX 5800 Ultra Graphics shaders are programmed in assembly-like language Direct3D shader assembly, OpenGL ARB Vertex/Fragment Program..