Threading Hardware in G80

1 Sources

• Slides by ECE 498 AL : Programming Massively Parallel Processors : Wen-Mei Hwu • John Nickolls,

2 Fixed-function pipeline 3D API Commands 3D3D API: API: 3D3D OpenGLOpenGL or or ApplicationApplication Direct3DDirect3D OrOr Game Game CPU-GPU Boundary (AGP/PCIe) Data Stream Command & GPU

Vertex Pixel Index Assembled Pixel Primitives Location Updates Stream Stream Rasterization GPU Primitive Raster Frame GPU Primitive and Frame Front End Assembly Operations Buffer Front End Assembly Interpolation Buffer Pre-transformed Pre-transformed Fragments Vertices

ProgrammableProgrammable Vertices ProgrammableProgrammable Fragments Transformed VertexVertex FragmentFragment Transformed ProcessorProcessor ProcessorProcessor

3 Programmable pipeline 3D API Commands 3D3D API: API: 3D3D OpenGLOpenGL or or ApplicationApplication Direct3DDirect3D OrOr Game Game CPU-GPU Boundary (AGP/PCIe) Data Stream Command & GPU

Vertex Pixel Index Assembled Pixel Primitives Location Updates Stream Stream Rasterization GPU Primitive Raster Frame GPU Primitive and Frame Front End Assembly Operations Buffer Front End Assembly Interpolation Buffer Pre-transformed Pre-transformed Fragments Vertices

ProgrammableProgrammable Vertices ProgrammableProgrammable Fragments Transformed VertexVertex FragmentFragment Transformed ProcessorProcessor ProcessorProcessor

4 Unified Programmable 3D API pipeline Commands 3D3D API: API: 3D3D OpenGLOpenGL or or ApplicationApplication Direct3DDirect3D OrOr Game Game CPU-GPU Boundary (AGP/PCIe) Data Stream Command & GPU

Vertex Pixel Index Assembled Pixel Primitives Location Updates Stream Stream Rasterization GPU Primitive Raster Frame GPU Primitive and Frame Front End Assembly Operations Buffer Front End Assembly Interpolation Buffer Pre-transformed Pre-transformed Fragments Vertices Vertices Fragments Transformed Transformed

UnifiedUnified Vertex, Vertex, Fragment,Fragment, Geometry Geometry 5 ProcessorProcessor General Diagram (6800/NV40)

6 TurboCache

• Uses PCI-Express bandwidth to render directly to system memory • Card needs less memory • Performance boost while lowering cost • TurboCache Manager dynamically allocates from main memory • Local memory used to cache data and to deliver peak performance when needed

7 NV40 Vertex Processor

An NV40 vertex processor is able to execute one vector operation (up to four FP32 components), one scalar FP32 operation, and make one access to the texture per clock cycle 8 NV40 Fragment Processors

Early termination from mini z buffer and z buffer checks; resulting sets of 4 pixels (quads) passed on to fragment units

9 Why NV40 series was better

• Massive parallelism • Scalability – Lower end products have fewer pixel pipes and fewer vertex units • Computation Power – 222 million transistors – First to comply with Microsoft’s DirectX 9 spec • Dynamic Branching in pixel

10 Dynamic Branching

• Helps detect if pixel needs shading • Instruction flow handled in groups of pixels • Specify branch granularity (the number of consecutive pixels that take the same branch) • Better distribution of blocks of pixels between the different quad engines

11 General Diagram (7800/G70)

12 General Diagram General Diagram (7800/G70) (6800/NV40)

13 14 GeForce Go 7800 – Power Issues

• Power consumption and package are the same as the 6800 Ultra chip, meaning notebook designers do not have to change very much about their thermal designs • Dynamic clock scaling can run as slow as 16 MHz – This is true for the engine, memory, and pixel clocks • Heavier use of clock gating than the desktop version • Runs at voltages lower than any other mobile performance part • Regardless, you won’t get much battery-based runtime for a 3D game

15 GeForceGeForce 78007800 GTXGTX ParallelismParallelism

8 Vertex Engines

Z-Cull Triangle Setup/Raster

Shader Instruction Dispatch 24 Pixel Shaders

Fragment Crossbar 16 Raster Operation Pipelines

Memory Memory Memory Memory 16 Partition Partition Partition Partition G80 – Graphics Mode • The future of GPUs is programmable processing • So – build the architecture around the processor

Host

Input Assembler Setup / Rstr / ZCull

Vtx Thread Issue Geom Thread Issue Pixel Thread Issue

SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP

TF TF TF TF TF TF TF TF

L1 L1 L1 L1 L1 L1 L1 L1 Thread Processor

L2 L2 L2 L2 L2 L2

FB FB FB FB FB FB 17 G80 CUDA mode – A Device Example • Processors execute computing threads • New operating mode/HW interface for computing Host

Input Assembler

Thread Execution Manager

Parallel Data Parallel Data Parallel Data Parallel Data Parallel Data Parallel Data Parallel Data Parallel Data Cache Cache Cache Cache Cache Cache Cache Cache

TextureTexture Texture Texture Texture Texture Texture Texture Texture

Load/store Load/store Load/store Load/store Load/store Load/store

Global Memory 18 Why Use the GPU for Computing ? • The GPU has evolved into a very flexible and powerful processor: –It’s programmable using high-level languages – It supports 32-bit floating point precision –It offers lots of GFLOPS:

G80 = GeForce 8800 GTX

GFLOPS G71 = GeForce 7900 GTX G70 = GeForce 7800 GTX •GPU in every PC and workstation NV40 = GeForce 6800 Ultra NV35 = GeForce FX 5950 Ultra NV30 = GeForce FX 5800

19 What is Behind such an Evolution? • The GPU is specialized for compute-intensive, highly data parallel computation (exactly what graphics rendering is about) – So, more transistors can be devoted to data processing rather than data caching and flow control

ALU ALU Control ALU ALU CPU GPU Cache

DRAM DRAM

• The fast-growing video game industry exerts strong economic pressure that forces constant innovation 20 What is (Historical) GPGPU ?

• General Purpose computation using GPU and graphics API in applications other than 3D graphics – GPU accelerates critical path of application

• Data parallel algorithms leverage GPU attributes – Large data arrays, streaming throughput – Fine-grain SIMD parallelism – Low-latency floating point (FP) computation • Applications – see //GPGPU.org – Game effects (FX) physics, image processing – Physical modeling, computational engineering, matrix algebra, convolution, correlation, sorting

21 Previous GPGPU Constraints • Dealing with graphics API per thread per Shader Input Registers – Working with the corner cases of the per Context graphics API Fragment Program • Addressing modes Texture Constants – Limited texture size/dimension Temp Registers • Shader capabilities – Limited outputs Output Registers • Instruction sets FB Memory – Lack of Integer & bit ops • Communication limited –Between pixels – Scatter a[i] = p

22 An Example of Physical Reality Behind CUDA CPU (host) GPU w/ local DRAM (device)

23 Arrays of Parallel Threads

• A CUDA kernel is executed by an array of threads – All threads run the same code (SPMD) – Each thread has an ID that it uses to compute memory addresses and make control decisions

threadID 0 1 2 3 4 5 6 7

… float x = input[threadID]; float y = func(x); output[threadID] = y; …

24 Thread Blocks: Scalable Cooperation

• Divide monolithic thread array into multiple blocks – Threads within a block cooperate via shared memory, atomic operations and barrier synchronization – Threads in different blocks cannot cooperate

Thread Block 0 Thread Block 0 Thread Block N - 1 threadID 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7

… … … float x = float x = float x = input[threadID]; input[threadID]; input[threadID]; float y = func(x); float y = func(x); … float y = func(x); output[threadID] = y; output[threadID] = y; output[threadID] = y; … … …

25 Thread Batching: Grids and Blocks

• A kernel is executed as a grid of Host Device

thread blocks Grid 1 – All threads share data memory space Kernel Block Block Block 1 (0, 0) (1, 0) (2, 0) •A thread block is a batch of threads that can cooperate with each other Block Block Block (0, 1) (1, 1) (2, 1) by:

– Synchronizing their execution Grid 2 • For hazard-free shared memory accesses Kernel 2 – Efficiently sharing data through a low latency shared memory • Two threads from two different Block (1, 1) blocks cannot cooperate Thread Thread Thread Thread Thread (0, 0) (1, 0) (2, 0) (3, 0) (4, 0)

Thread Thread Thread Thread Thread (0, 1) (1, 1) (2, 1) (3, 1) (4, 1)

Thread Thread Thread Thread Thread (0, 2) (1, 2) (2, 2) (3, 2) (4, 2) 26 Courtesy: NDVIA Block and Thread IDs

• Threads and blocks have IDs – So each thread can decide what data Device to work on Grid 1 – Block ID: 1D or 2D – Thread ID: 1D, 2D, or 3D Block Block Block (0, 0) (1, 0) (2, 0) • Simplifies memory addressing when processing Block Block Block multidimensional data (0, 1) (1, 1) (2, 1) – Image processing – Solving PDEs on volumes Block (1, 1) –… Thread Thread Thread Thread Thread (0, 0) (1, 0) (2, 0) (3, 0) (4, 0)

Thread Thread Thread Thread Thread (0, 1) (1, 1) (2, 1) (3, 1) (4, 1)

Thread Thread Thread Thread Thread (0, 2) (1, 2) (2, 2) (3, 2) (4, 2)

Courtesy: NDVIA 27 CUDA Device Memory Space Overview

• Each thread can: (Device) Grid – R/W per-thread registers Block (0, 0) Block (1, 0) – R/W per-thread local memory

– R/W per-block shared memory Shared Memory Shared Memory

– R/W per-grid global memory Registers Registers Registers Registers – Read only per-grid constant memory

– Read only per-grid texture memory Thread (0, 0) Thread (1, 0) Thread (0, 0) Thread (1, 0)

Local Local Local Local Memory Memory Memory Memory

Host Global • The host can R/W global, Memory

constant, and texture Constant memories Memory Texture Memory 28 Global, Constant, and Texture Memories (Long Latency Accesses)

• Global memory (Device) Grid

– Main means of communicating Block (0, 0) Block (1, 0) R/W Data between host and

device Shared Memory Shared Memory – Contents visible to all threads Registers Registers Registers Registers • Texture and Constant

Memories Thread (0, 0) Thread (1, 0) Thread (0, 0) Thread (1, 0) – Constants initialized by host

– Contents visible to all threads Local Local Local Local Memory Memory Memory Memory

Host Global Memory

Constant Memory

Texture Memory 29 Courtesy: NDVIA Block IDs and Thread IDs

• Each thread uses IDs to decide Host Device what data to work on Grid 1 Kernel Block Block – Block ID: 1D or 2D 1 (0, 0) (1, 0) – Thread ID: 1D, 2D, or 3D Block Block (0, 1) (1, 1)

• Simplifies memory Grid 2

addressing when processing Kernel multidimensional data 2 Block (1, 1) – Image processing (0,0,1) (1,0,1) (2,0,1) (3,0,1) – Solving PDEs on volumes

Thread Thread Thread Thread –… (0,0,0) (1,0,0) (2,0,0) (3,0,0)

Thread Thread Thread Thread (0,1,0) (1,1,0) (2,1,0) (3,1,0) Courtesy: NDVIA 30 CUDA Memory Model Overview • Global memory – Main means of

communicating R/W Grid Data between host and device Block (0, 0) Block (1, 0) – Contents visible to all Shared Memory Shared Memory threads Registers Registers Registers Registers – Long latency access Thread (0, 0) Thread (1, 0) Thread (0, 0) Thread (1, 0) • We will focus on global memory for now Host Global Memory – Constant and texture memory will come later 31 Parallel Computing on a GPU

• 8-series GPUs deliver 25 to 200+ GFLOPS on compiled parallel C applications – Available in laptops, desktops, and clusters GeForce 8800

• GPU parallelism is doubling every year • Programming model scales transparently Tesla D870 • Programmable in C with CUDA tools • Multithreaded SPMD model uses application data parallelism and thread parallelism

Tesla S87032 Single-Program Multiple-Data (SPMD) • CUDA integrated CPU + GPU application C program – Serial C code executes on CPU – Parallel Kernel C code executes on GPU thread blocks

CPU Serial Code Grid 0

GPU Parallel Kernel KernelA<<< nBlk, nTid >>>(args); . . .

CPU Serial Code Grid 1

GPU Parallel Kernel KernelB<<< nBlk, nTid >>>(args); . . . 33 Grids and Blocks

• A kernel is executed as a grid Host Device

of thread blocks Grid 1

– All threads share global memory Kernel Block Block space 1 (0, 0) (1, 0)

• A thread block is a batch of Block Block (0, 1) (1, 1) threads that can cooperate with

each other by: Grid 2

– Synchronizing their execution Kernel using barrier 2 Block (1, 1) – Efficiently sharing data through (0,0,1) (1,0,1) (2,0,1) (3,0,1) a low latency shared memory

– Two threads from two different Thread Thread Thread Thread blocks cannot cooperate (0,0,0) (1,0,0) (2,0,0) (3,0,0) Thread Thread Thread Thread (0,1,0) (1,1,0) (2,1,0) (3,1,0) Courtesy: NDVIA

34 CUDA Thread Block

• Programmer declares (Thread) Block: – Block size 1 to 512 concurrent threads – Block shape 1D, 2D, or 3D CUDA Thread Block – Block dimensions in threads Thread Id #: • All threads in a Block execute the same 0 1 2 3 … m thread program • Threads share data and synchronize while doing their share of the work Thread program • Threads have thread id numbers within Block • Thread program uses thread id to select work and address shared data Courtesy: John Nickolls, NVIDIA

35 GeForce-8 Series HW Overview

Streaming Processor Array

TPC TPC TPC … TPC TPC TPC

Texture Processor Cluster Streaming Multiprocessor Instruction L1 Data L1

SM Instruction Fetch/Dispatch

Shared Memory TEX SP SP

SM SP SP SFU SFU SP SP

SP SP

36 CUDA Processor Terminology •SPA – Streaming Processor Array (variable across GeForce 8-series, 8 in GeForce8800) •TPC – Texture Processor Cluster (2 SM + TEX) •SM – Streaming Multiprocessor (8 SP) – Multi-threaded processor core – Fundamental processing unit for CUDA thread block •SP – Streaming Processor – Scalar ALU for a single CUDA thread

37 Streaming Multiprocessor (SM)

• Streaming Multiprocessor (SM) – 8 Streaming Processors (SP) Streaming Multiprocessor – 2 Super Function Units (SFU) Instruction L1 Data L1

• Multi-threaded instruction dispatch Instruction Fetch/Dispatch

– 1 to 512 threads active Shared Memory

– Shared instruction fetch per 32 threads SP SP

– Cover latency of texture/memory loads SP SP SFU SFU • 20+ GFLOPS SP SP

• 16 KB shared memory SP SP • texture and global memory access

38 G80 Thread Computing Pipeline • TheProcessors future ofexecute GPUs computing is programmable threads processing • SoAlternative – build the operating architecture mode around specifically the processor for computing HostHost Generates Thread grids based on Input Assembler Input Assembler kernel calls Setup / Rstr / ZCull

Thread Execution Manager Vtx Thread Issue Geom Thread Issue Pixel Thread Issue

SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP

ParallelTF Data ParallelTF Data ParallelTF Data ParallelTF Data ParallelTF Data ParallelTF Data ParallelTF Data ParallelTF Data Cache Cache Cache Cache Cache Cache Cache Cache

TextureTextureL1 TextureL1 TextureL1 TextureL1 TextureL1 TextureL1 TextureL1 TextureL1 Thread Processor

Load/storeL2 Load/storeL2 Load/storeL2 Load/storeL2 Load/storeL2 Load/storeL2

FB FB FB Global Memory FB FB FB 39 Thread Life Cycle in HW

Host Device • Grid is launched on the SPA • Thread Blocks are serially Grid 1 distributed to all the SM’s Kernel Block Block Block 1 (0, 0) (1, 0) (2, 0) – Potentially >1 Thread Block per SM Block Block Block • Each SM launches Warps of (0, 1) (1, 1) (2, 1)

Threads Grid 2 – 2 levels of parallelism Kernel • SM schedules and executes 2 Warps that are ready to run • As Warps and Thread Blocks Block (1, 1)

complete, resources are freed Thread Thread Thread Thread Thread – SPA can distribute more Thread (0, 0) (1, 0) (2, 0) (3, 0) (4, 0) Blocks Thread Thread Thread Thread Thread (0, 1) (1, 1) (2, 1) (3, 1) (4, 1)

Thread Thread Thread Thread Thread (0, 2) (1, 2) (2, 2) (3, 2) (4, 2) 40 SM Executes Blocks

t0 t1 t2 … tm SM 0 SM 1 t0 t1 t2 … tm MT IU MT IU Blocks SP SP

Blocks • Threads are assigned to SMs in Block granularity – Up to 8 Blocks to each SM as Shared Shared resource allows Memory Memory – SM in G80 can take up to 768 threads TF • Could be 256 (threads/block) * 3 Texture L1 blocks • Or 128 (threads/block) * 6 blocks, etc. L2 • Threads run concurrently Memory – SM assigns/maintains thread id #s – SM manages/schedules thread 41 execution Thread Scheduling/Execution

• Each Thread Blocks is divided in 32- …Block 1 Warps …Block 2 Warps thread Warps t0 t1 t2 … t31 t0 t1 t2 … t31 – This is an implementation decision, not … … part of the CUDA programming model • Warps are scheduling units in SM • If 3 blocks are assigned to an SM and each Streaming Multiprocessor Instruction L1 Data L1 Block has 256 threads, how many Warps are there in an SM? Instruction Fetch/Dispatch – Each Block is divided into 256/32 = 8 Shared Memory Warps SP SP – There are 8 * 3 = 24 Warps SP SP SFU SFU – At any point in time, only one of the 24 SP SP

Warps will be selected for instruction SP SP fetch and execution.

42 SM Warp Scheduling

• SM hardware implements zero- overhead Warp scheduling – Warps whose next instruction has its operands ready for consumption are SM multithreaded eligible for execution Warp scheduler – Eligible Warps are selected for execution time on a prioritized scheduling policy warp 8 instruction 11 – All threads in a Warp execute the same instruction when selected warp 1 instruction 42 • 4 clock cycles needed to dispatch the

warp 3 instruction 95 same instruction for all threads in a . Warp in G80 . – If one global memory access is needed warp 8 instruction 12 for every 4 instructions warp 3 instruction 96 – A minimal of 13 Warps are needed to fully tolerate 200-cycle memory latency43 SM Instruction Buffer – Warp Scheduling

• Fetch one warp instruction/cycle I$ – from instruction L1 cache L1 – into any instruction buffer slot Multithreaded • Issue one “ready-to-go” warp instruction/cycle Instruction Buffer

– from any warp - instruction buffer slot R C$ Shared F L1 Mem – operand scoreboarding used to prevent hazards

• Issue selection based on round-robin/age of Operand Select warp

• SM broadcasts the same instruction to 32 MAD SFU Threads of a Warp

44 Scoreboarding

• All register operands of all instructions in the Instruction Buffer are scoreboarded – Instruction becomes ready after the needed values are deposited – prevents hazards – cleared instructions are eligible for issue • Decoupled Memory/Processor pipelines – any thread can continue to issue instructions until scoreboarding prevents issue – allows Memory/Processor ops to proceed in shadow of other waiting Memory/Processor ops

45 Granularity Considerations

• For Matrix Multiplication, should I use 4X4, 8X8, 16X16 or 32X32 tiles? – For 4X4, we have 16 threads per block, Since each SM can take up to 768 threads, the thread capacity allows 48 blocks. However, each SM can only take up to 8 blocks, thus there will be only 128 threads in each SM! • There are 8 warps but each warp is only half full.

– For 8X8, we have 64 threads per Block. Since each SM can take up to 768 threads, it could take up to 12 Blocks. However, each SM can only take up to 8 Blocks, only 512 threads will go into each SM! • There are 16 warps available for scheduling in each SM • Each warp spans four slices in the y dimension

– For 16X16, we have 256 threads per Block. Since each SM can take up to 768 threads, it can take up to 3 Blocks and achieve full capacity unless other resource considerations overrule. • There are 24 warps available for scheduling in each SM • Each warp spans two slices in the y dimension

– For 32X32, we have 1024 threads per Block. Not even one can fit into an SM!

46 Memory Hardware in G80

47 CUDA Device Memory Space: Review

• Each thread can: (Device) Grid

– R/W per-thread registers Block (0, 0) Block (1, 0) – R/W per-thread local memory – R/W per-block shared memory Shared Memory Shared Memory – R/W per-grid global memory Registers Registers Registers Registers – Read only per-grid constant memory Thread (0, 0) Thread (1, 0) Thread (0, 0) Thread (1, 0) – Read only per-grid texture memory Local Local Local Local Memory Memory Memory Memory • The host can R/W Host Global global, constant, and Memory Constant texture memories Memory

Texture Memory 48 Parallel Memory Sharing

Thread • Local Memory: per-thread – Private per thread Local Memory – Auto variables, register spill • Shared Memory: per-Block Block – Shared by threads of the same block Shared Memory – Inter-thread communication • Global Memory: per-application – Shared by all threads Grid 0 – Inter-Grid communication

. . . Global Sequential Grid 1 Memory Grids in Time . . . 49 SM Memory Architecture

t0 t1 t2 … tm SM 0 SM 1 t0 t1 t2 … tm MT IU MT IU Blocks SP SP

Blocks • Threads in a block share data & results

Shared Shared – In Memory and Shared Memory Memory Memory – Synchronize at barrier instruction TF • Per-Block Shared Memory Texture L1 Allocation Courtesy: John Nicols, NVIDIA – Keeps data close to processor L2 – Minimize trips to global Memory – Shared Memory is dynamically Memory allocated to blocks, one of the limiting resources 50 SM Register File

• Register File (RF) I$ – 32 KB (8K entries) for each SM in G80 L1

• TEX pipe can also read/write RF Multithreaded Instruction Buffer – 2 SMs share 1 TEX

• Load/Store pipe can also read/write RF R C$ Shared F L1 Mem

Operand Select

MAD SFU

51 Programmer View of Register File 4 blocks 3 blocks • There are 8192 registers in each SM in G80 – This is an implementation decision, not part of CUDA – Registers are dynamically partitioned across all blocks assigned to the SM – Once assigned to a block, the register is NOT accessible by threads in other blocks – Each thread in the same block only access registers assigned to itself

52 Matrix Multiplication Example • If each Block has 16X16 threads and each thread uses 10 registers, how many thread can run on each SM? – Each block requires 10*256 = 2560 registers – 8192 = 3 * 2560 + change – So, three blocks can run on an SM as far as registers are concerned • How about if each thread increases the use of registers by 1? – Each Block now requires 11*256 = 2816 registers – 8192 < 2816 *3 – Only two Blocks can run on an SM, 1/3 reduction of parallelism!!!

53 More on Dynamic Partitioning

• Dynamic partitioning gives more flexibility to compilers/programmers – One can run a smaller number of threads that require many registers each or a large number of threads that require few registers each • This allows for finer grain threading than traditional CPU threading models. – The compiler can tradeoff between instruction-level parallelism and thread level parallelism

54