Work Factorization for Efficient Throughput Architectures

Work factorization for efficient throughput architectures Sylvain Collange Departamento de Ciência da Computação, ICEx Universidade Federal de Minas Gerais [email protected] February 01, 2012 GPGPU in HPC, today Graphics Processing Unit (GPU) Made for video games: mass market Low unit price, amortized R&D Inexpensive, high-performance parallel processor 2002: General-Purpose computation on GPU (GPGPU) 2012: 3 out of top 5 supercomputers use GPUs #4 Dawning Nebulae (China) #2 Tianhe-1A (China) #5 Tsubame 2.0 (Japan) 2 GPGPU in the future? Yesterday (2000-2010) Homogeneous multi-core Discrete components Central Graphics Processing Unit Processing Today (2011-...) (CPU) Unit (GPU) Chip-level integration Intel Sandy Bridge AMD Fusion NVIDIA Denver/Maxwell project… Throughput- optimized Many embedded SoCs cores Latency- Tomorrow optimized Hardware Heterogeneous multi-core cores accelerators GPUs to blend into Heterogeneous multi-core chip throughput-optimized cores? 3 Outline Background: GPU architecture requirements From a sequential processor to a GPU Parallel control regularity Parallel memory locality Parallel value locality 4 What do we need GPUs for? 1.3D graphics rendering for games Complex texture mapping, lighting computations… 2.Computer Aided Design workstations Complex geometry 3.GPGPU Complex synchronization, data movements One chip to rule them all Find the common denominator 5 The (simplest) graphics rendering pipeline Primitives (triangles…) Vertices Fragment shader Textures Vertex shader Z-Compare Blending Clipping, Rasterization Attribute interpolation Pixels Framebuffer Fragments Z-Buffer Programmable Parametrizable stage stage 6 How much performance do we need … to run 3DMark 11 at 50 frames/second? Element Per frame Per second Vertices 12.0M 600M Primitives 12.6M 630M Fragments 180M 9.0G Instructions 14.4G 720G Intel Core i7 2700K: 56 Ginsn/s peak We need to go 13x faster 7 Source: Damien Triolet, Hardware.fr Constraints Memory wall Memory speed does not increase as fast as computing speed More and more difficult to hide memory latency Power wall Power consumption of transistors does not decrease as fast as density increases Performance is now limited by power consumption 8 Latency vs. throughput Latency: time to solution CPUs Minimize time, at the expense of power Throughput: quantity of tasks processed per unit of time GPUs Assumes unlimited parallelism Minimize energy per operation 9 Amdahl's law Bounds speedup attainable on a parallel machine 1 S= S Speedup P P Ratio of parallel Time to run 1−P Time to run sequential portions N parallel portions portions N Number of processors S (speedup) N (available processors) G. Amdahl. Validity of the Single Processor Approach to Achieving Large-Scale 10 Computing Capabilities. AFIPS 1967. Why heterogeneous architectures? 1 S= Time to run P Time to run sequential portions 1−P parallel portions N Latency-optimized multi-core (CPU) Low efficiency on parallel portions: spends too much resources Throughput-optimized multi-core (GPU) Low performance on sequential portions Heterogeneous multi-core (CPU+GPU) Use the right tool for the right job Allows aggressive optimization for latency or for throughput M. Hill, M. Marty. Amdahl's law in the multicore era. IEEE Computer, 2008. 11 Programming model: multi-threading 1 vertex = 1 thread = Computes spacial coordinates, texture coordinates… 1 fragment = 1 thread = Computes color, lighting… GPGPU Bulk-Synchronous Parallel (BSP) model NVIDA CUDA, OpenCL… Barrier Program describes operation to apply SPMD: Single Program, Multiple Data 12 L. Valiant. A bridging model for parallel computation. Comm. ACM 1990. Threading granularity Coarse-grained threading X Decouple tasks to reduce conflicts and inter-thread communication e.g. MPI, OpenMP T0 T1 T2 T3 X[0..3] X[4..7] X[8..11] X[12-15] Fine-grained threading X Interleave tasks Exhibit locality: neighbor threads share memory T0 T1 T2 T3 X[0] X[1] X[2] X[3] Exhibit regularity: neighbor X[4] X[5] X[6] X[7] threads have a similar behavior X[8] X[9] X[10] X[11] X[12] X[13] X[14] X[15] e.g. OpenCL, CUDA 13 Parallel regularity Similarity in behavior between threads Regular Irregular Control Thread regularity 1 2 3 4 1 2 3 4 i=17 i=17 i=17 i=17 i=21 i=4 i=17 i=2 switch(i) { T i m case 2:... e case 17:... case 21:... } Memory load load load load load load load load regularity A[8] A[9] A[10] A[11] A[8] A[0] A[11] A[3] r=A[i] A Memory Data a=32 a=32 a=32 a=32 a=17 a=-5 a=11 a=42 regularity b=52 b=52 b=52 b=52 r=a*b b=15 b=0 b=-2 b=52 14 Outline Background: GPU architecture requirements From a sequential processor to a GPU Parallel control regularity Parallel memory locality Parallel value locality 15 First step: sequential, pipelined processor Let's build a GPU Our application: scalar-vector multiplication: X ← a∙X First idea: run each thread sequentially for i = 0 to n-1 X[i] ← a * X[i] Source code add i ← 18 Fetch move i ← 0 loop: store X[17] Decode load t ← X[i] Memory mul Execute mul t ← a×t store X[i] ← t Memory add i ← i+1 branch i<n? loop Sequential CPU Machine code 16 Homogeneous multi-core Replication of the complete execution engine move i ← slice_begin loop: load t ← X[i] add i ← 18 F add i ← 50 F mul t ← a×t IF IF M store X[i] ← t store X[17] IDD store X[49] DID e add i ← i+1 m o r branch i<slice_end? loop mul EXX mul XEX y Machine code LSUMem LSUMem Threads: T0 T1 Improves throughput thanks to explicit parallelism 17 Interleaved multi-threading Time-multiplexing of processing units Same software view mul Fetch move i ← slice_begin mul loop: Decode load t ← X[i] mul t ← a×t add i ←73 Execute store X[i] ← t add i ← 50 Memory add i ← i+1 branch i<slice_end? loop load X[89] Memory store X[72] load-store Machine code load X[17] unit store X[49] Threads: T0 T1 T2 T3 Hides latency thanks to explicit parallelism 18 Single Instruction, Multiple Threads (SIMT) Factorization of fetch/decode, load-store units Fetch 1 instruction on behalf of several threads Read 1 memory location and broadcast to several registers T0 (0-3) load F M T1 (0-3) store e D m o r T2 y (0) mul (1) mul (2) mul (3) mul X T3 (0) (1) (2) (3) Mem In NVIDIA-speak SIMT: Single Instruction, Multiple Threads Convoy of synchronized threads: warp Improves Area/Power-efficiency thanks to regularity Consolidates memory transactions: less memory pressure 19 What about SIMD? Single Instruction Multiple Data for i = 0 to n-1 step 4 X[i..i+3] ← a * X[i..i+3] Source code loop: add i ← 20 F vload T ← X[i] vstore X[16..19 M vmul T ← a×T D e vstore X[i] ← T m o r add i ← i+4 vmul X y branch i<n? loop Machine code Mem SIMD CPU Synchronous programs PRAM-like model Vectors, not threads: no “true” thread divergence allowed 20 Flynn's taxonomy Classification of parallel architectures Single Instruction Multiple Instruction F F F F F Single Data X X SISD MISD F Multiple Data F F F F X X X X X X X X SIMD MIMD M. Flynn. Some Computer Organizations and Their Effectiveness. IEEE TC 1972. 21 Flynn's taxonomy revisited …to account for multi-threading Resource: Instruction RF, Execute Memory F pipeline stage Fetch X (Data) M (Address) T T T T T T T T T T T T 0 1 2 3 0 1 2 3 0 1 2 3 Single resource F X M SIMT SDMT SAMT T T T T T T T T T T T T Multiple 0 1 2 3 0 1 2 3 0 1 2 3 resources F F F F X X X X M M M M MIMT MDMT MAMT Mostly orthogonal Mix and match to build your own _I_D_A_T pipeline! 22 Examples: conventional design points MI MD MA MT Multi-core T 0 F X M Most CPUs of today T 1 F X M T MIMD(MAMT) 2 F X M Short-vector SIMD MD X Multimedia instruction set SI SA ST T F X M extensions in CPUs 0 SIMD(SAST) X MD GPU T X 0 SI SA MT NVIDIA, AMD, Intel... GPUs T 1 F X M T SI(MDSA)MT 2 X 23 GPU design space: not just SIMT Examples How can we run SPMD threads? NVIDIA GeForce MIMD GTX 280 (2008) (multi-core) Spacial / horizontal F F F F SIMT X X X X F NVIDIA GTX 480 X X X X (2010) AMD Radeon Fine-grained Temporal / vertical 5870 (2011) SIMT multi-threading F X F X X F X X F X Switch-on-event X AMD Radeon F X multi-threading 7870 (2012) F X F X F X NVIDIA Echelon F X project (2017?) Programmer's point of view: only threads 24 25 SM16 … SM1 Core 32 … Warp 2 Warp 4 Warp 48 Core 18 Core 17 Core 16 … Warp 1 Warp 3 Warp 47 Core 2 Core 1 Up to 24576 threads in flight 1580 Gflop/s 2×16 2×16 cores/ SM / 48 SM, warps 16 SMs / 16 SMs chip SIMT: warps of 32 threads warps SIMT: Time Example GPU: NVIDIA GeForce GTX 580 GeForce GTX Example GPU: NVIDIA Outline Background: GPU architecture requirements From a sequential processor to a GPU Parallel control regularity Parallel memory locality Parallel value locality 26 Divergence statistics 50% — 85% of branches are uniform Inside a warp, all threads take the branch or none do Easy case Need to avoid costly predication on uniform branches Fully dynamic, hardware implementation 27 How to keep threads synchronized? Issue: control divergence x = 0; // Uniform condition Rules of the game if(tid > 17) { One thread per Processing Element x = 1; (PE) } All PE execute the same instruction // Divergent conditions if(tid < 2) { PEs can be individually disabled if(tid == 0) { x = 2; } else { Thread 0 Thread 1 Thread 2 Thread 3 x = 3; } 1 instruction PE 0 PE 1 PE 2 PE 3 } 28 The standard way: mask stack Code Hardware mask stack x = 0; 1 activity bit / thread // Uniform condition if(tid > 17) { 1111 s k i tid=0 tid=2 p x = 1; } 1111 // Divergent conditions tid=1 tid=3 if(tid < 2) { push if(tid == 0) { 1111 1100 push x = 2; 1111 1100 1000 } pop 1111 1100 else { push x = 3; 1111 1100 0100 } pop 1111 1100 pop } 1111 A.

Work Factorization for Efficient Throughput Architectures

Reviving the Development of Openchrome

Intel® G31 Express Chipset Product Brief

Multiprocessing Contents

A Programming Model and Processor Architecture for Heterogeneous Multicore Computers

Virtual Texturing

HP Rp3000 Point of Sale System

PC Hardware Contents

HP Compaq 6910P Notebook PC

Intel HD Graphics Directx Developer's Guide (Sandy Bridge)

Intel® Q965 Express Chipset Development Kit

Intel's Core 2 Family

Intel® Atom™ Processor N270 and Mobile Intel 945GSE Express