1/11/12

Administrative • First assignment out, due next Friday at 5PM – Use handin on CADE machines to submit • “handin cs6235 lab1 ” • The file should be a gzipped tar file of the CUDA program and output L2: Hardware Execution Model and – Can implement on CADE machines, see provided Overview makefile • Grad lab is MEB 3161, must be sitting at machine January 11, 2012 • Partners for people who have project ideas • TA: Preethi Kotari, [email protected] • Mailing lists now visible: – [email protected] • Please use for all questions suitable for the whole class • Feel free to answer your classmates questions!

L2: Hardware Overview CS6235 L2: Hardware Overview CS6963

Outline What is an Execution Model? • Parallel programming model • Execution Model – Software technology for expressing parallel algorithms that • Host Synchronization target parallel hardware – Consists of programming languages, libraries, annotations, … • Single Instruction Multiple Data (SIMD) – Defines the semantics of software constructs running on • Multithreading parallel hardware • Parallel execution model • Scheduling instructions for SIMD, multithreaded – Exposes an abstract view of hardware execution, multiprocessor generalized to a class of architectures. • How it all comes together – Answers the broad question of how to structure and name data and instructions and how to interrelate the two. • Reading: – Allows humans to reason about harnessing, distributing, and controlling concurrency. Ch 3 in Kirk and Hwu, • Today’s lecture will help you reason about the target http://courses.ece.illinois.edu/ece498/al/textbook/Chapter3-CudaThreadingModel.pdf architecture while you are developing your code Ch 4 in Nvidia CUDA Programming Guide – How will code constructs be mapped to the hardware? CS6235 L2: Hardware Overview CS6235 L2: Hardware Overview

1 1/11/12

NVIDIA GPU Execution Model SIMT = Single-Instruction Multiple Threads I. SIMD Execution of Device warpsize=M threads (from Mul N Mulprocessor 2 • Coined by Nvidia single block) – Result is a set of instruction Mulprocessor 1 • streams roughly equal to # Combines SIMD execution within a blocks in thread divided by Data Cache, Fermi only Block (on an SM) with SPMD execution warpsize Shared Memory across Blocks (distributed across SMs) II. Multithreaded Execution Registers Registers Registers across different instruction Instrucon … Unit • streams within block Processor 1 Processor 2 Processor M Terms to be defined… – Also possibly across different blocks if there are more blocks Constant than SMs Cache Texture III. Each block mapped to Cache single SM – No direct interaction across Device memory SMs

CS6235 L2: Hardware Overview CS6235 L2: Hardware Overview

CUDA Thread Block Overview Calling a Kernel Function – • All threads in a block execute the same kernel program (SPMD) Thread Creation in Detail • Programmer declares block: • A kernel function must be called with an execution – Block size 1 to 512 concurrent threads CUDA Thread Block – Block shape 1D, 2D, or 3D configuration: – Block dimensions in threads Thread Id #: __global__ void KernelFunc(...); • Threads have thread id numbers within 0 1 2 3 … m dim3 DimGrid(100, 50); // 5000 thread blocks block – Thread program uses thread id to select dim3 DimBlock(4, 8, 8); // 256 threads per block work and address shared data size_t SharedMemBytes = 64; // 64 bytes of shared memory Thread program KernelFunc<<< DimGrid, DimBlock, SharedMemBytes >>>(...); • Threads in the same block share data and synchronize while doing their share of the Only for data that is not statically allocated work • Any call to a kernel function is asynchronous from CUDA • Threads in different blocks cannot 1.0 on cooperate Courtesy: John Nickolls, – Each block can execute in any order NVIDIA • Explicit synchronization needed for blocking continued relative to other blocks! host execution (next slide)

CS6235 L2: Hardware Overview CS6235 L2: Hardware Overview

2 1/11/12

Host Blocking: Common Examples Predominant Control Mechanisms: Some definitions • How do you guarantee the GPU is done and results are ready? Name Meaning Examples • Timing example (excerpt from simpleStreams in CUDA SDK): Single Instruction, A single thread of Array notation as in cudaEvent_t start_event, stop_event; Multiple Data control, same Fortran 95: (SIMD) computation applied A[1:n] = A[1:n] + B[1:n] cudaEventCreate(&start_event); across “vector” elts cudaEventCreate(&stop_event); Kernel fns w/in block: cudaEventRecord(start_event, 0); compute<<>> init_array<<>>(d_a, d_c, niterations); Multiple Instruction, Multiple threads of OpenMP parallel loop: cudaEventRecord(stop_event, 0); Multiple Data control, processors forall (i=0; i>> • A bunch of runs in a row example (excerpt from transpose in Single Program, Multiple threads of Processor-specific code: CUDA SDK) Multiple Data control, but each for (int i = 0; i < numIterations; ++i) { (SPMD) processor executes if ($threadIdx.x == 0) { same code transpose<<< grid, threads >>>(d_odata, d_idata, size_x, size_y); } } cudaThreadSynchronize();

CS6235 L2: Hardware Overview CS6235 L2: Hardware Overview

Streaming Multiprocessor (SM) I. SIMD

• Streaming Multiprocessor (SM) • Motivation: – 8 Streaming Processors (SP) Streaming Multiprocessor – Data-parallel computations map well to – 2 Super Function Units (SFU) Instruction L1 Data L1

• Multi-threaded instruction dispatch Instruction Fetch/Dispatch architectures that apply the same

– 1 to 512 threads active Shared Memory computation repeatedly to different data – Shared instruction fetch per 32 threads SP SP – Cover latency of texture/memory loads – Conserve control units and simplify SP SP SFU SFU • 20+ GFLOPS SP SP coordination • 16 KB shared memory SP SP • Analogy to light switch • DRAM texture and memory access

CS6235 L2: Hardware Overview CS6235 L2: Hardware Overview

3 1/11/12

Example SIMD Execution Example SIMD Execution

“Count 6” kernel function “Count 6” kernel function d_out[threadIdx.x] = 0; d_out[threadIdx.x] = 0; for (int i=0; i

Memory Memory

CS6235 L2: Hardware Overview CS6235 L2: Hardware Overview

Example SIMD Execution Example SIMD Execution

“Count 6” kernel function “Count 6” kernel function d_out[threadIdx.x] = 0; d_out[threadIdx.x] = 0; for (int i=0; i

Memory Memory Etc.

CS6235 L2: Hardware Overview CS6235 L2: Hardware Overview

4 1/11/12

Overview of SIMD Programming Aside: Multimedia Extensions like SSE

• Vector architectures • COMPLETELY DIFFERENT ARCHITECTURE! • Early examples of SIMD supercomputers • At the core of multimedia extensions • TODAY Mostly – SIMD parallelism – Multimedia extensions such as SSE-3 – Variable-sized data fields: – Graphics and games processors (example, IBM Cell) Vector length = register width / type size

– Accelerators (e.g., ClearSpeed) V0 • V1 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Sixteen 8-bit Operands Is there a dominant SIMD programming model? V2 – Unfortunately, NO!!! V3 1 2 3 4 5 6 7 8 Eight 16-bit Operands V4 • 1 2 3 4 Why not? V5 Four 32-bit Operands – Vector architectures were programmed by scientists . – Multimedia extension architectures are programmed Example: PowerPC AltiVec by systems programmers (almost assembly language!) V31 or code is automatically generated by a compiler 0 127 – GPUs are programmed by games developers (domain- specific) – Accelerators typically use their own proprietary tools WIDE UNIT

CS6235 L2: Hardware Overview CS6235 L2: Hardware Overview

Aside: Multimedia Extensions Scalar vs. SIMD Operation II. Multithreading: Motivation 1 r3 • Each arithmetic instruction includes the + Scalar: add r1,r2,r3 2 r2 following sequence

3 r1 Acvity Cost Note Load operands As much as O(100) cycles Depends on locaon Compute O(1) cycles Accesses registers SIMD: vadd v1,v2,v3 1 2 3 4 v3 Store result As much as O(100) cycles Depends on locaon + + + + • Memory latency, the time in cycles to 1 2 3 4 v2 access memory, limits utilization of compute engines 2 4 6 8 v1

lanes CS6235 L2: Hardware Overview CS6235 L2: Hardware Overview

5 1/11/12

Thread-Level Parallelism What Resources are Shared? • Motivation: – a single thread leaves a processor under-utilized • Multiple threads• are for most of the time simultaneously active (in other Warp – by doubling processor area, single thread (Instrucon words, a new thread can start Stream) performance barely improves without a context switch) • Strategies for thread-level parallelism: • For correctness, each thread – multiple threads share the same large processor needs its own reduces under-utilization, efficient resource (PC), and its own logical regs (on allocation Instrucons Issued Multi-Threading this hardware, each thread w/in – each thread executes on its own mini processor block gets its own physical regs) simple design, low interference between threads • Functional units, instruction unit, Multi-Processing i-cache shared by all threads

Slide source: Al Davis

CS6235 L2: Hardware Overview CS6235 L2: Hardware Overview

Aside: Multithreading How is context switching so efficient? Block 0 Thread 0 Block 0 Block 0 • Historically, supercomputers targeting non- Thread 1 Thread 256 numeric computation • HEP, Tera MTA, Cray XMT

Block 8 Block 8 • Now common in commodity Thread 1 Thread 256 – Block 8 Simultaneous multithreading: Thread 0 • Multiple threads may come from different • Large register file (16K registers/block) streams, can issue from multiple streams in – Each thread assigned a “window” of physical registers single instruction issue – Works if entire thread block’s registers do not exceed • Alpha 21464 and Pentium 4 are examples capacity (otherwise, compiler fails) – • May be able to schedule from multiple blocks simultaneously CUDA somewhat simplified: • Similarly, shared memory requirements must not exceed – A full warp scheduled at a time capacity for all blocks simultaneously scheduled

CS6235 L2: Hardware Overview CS6235 L2: Hardware Overview

6 1/11/12

Example: Thread Scheduling on G80 SM Warp Scheduling

• SM hardware implements zero- • Each Block is executed as 32- …Block 1 Warps …Block 2 Warps …Block 1 Warps overhead Warp scheduling thread Warps t0 t1 t2 … t31 t0 t1 t2 … t31 t0 t1 t2 … t31 – Warps whose next instruction has – An implementation decision, … … … SM multithreaded its operands ready for consumption not part of the CUDA Warp scheduler are eligible for execution programming model – Eligible Warps are selected for

– Warps are scheduling units Streaming Multiprocessor time execution on a prioritized scheduling in SM Instruction L1 warp 8 instruction 11 policy – All threads in a Warp execute the • If 3 blocks are assigned to an Instruction Fetch/Dispatch warp 1 instruction 42 same instruction when selected SM and each block has 256 Shared Memory • 4 clock cycles needed to dispatch threads, how many Warps are SP SP warp 3 instruction 95 the same instruction for all threads there in an SM? SP SP . in a Warp in G80 – Each Block is divided into SFU SFU . SP SP – If one global memory access is 256/32 = 8 Warps warp 8 instruction 12 needed for every 4 instructions SP SP – There are 8 * 3 = 24 Warps – A minimum of 13 Warps are needed warp 3 instruction 96 to fully tolerate 200-cycle memory latency 25 CS6235 CS6963 L2: Hardware Overview L2: Hardware Overview

SM Instruction Buffer – Warp Scheduling Scoreboarding

• Fetch one warp instruction/cycle I $ – from instruction cache • How to determine if a thread is ready to – into any instruction buffer slot execute? • Multithreaded Issue one “ready-to-go” warp Instruction Buffer • A scoreboard is a table in hardware that instruction/cycle tracks – from any warp - instruction buffer slot R C $ Shared F L1 Mem – operand scoreboarding used to prevent – instructions being fetched, issued, executed hazards Operand Select – resources (functional units and operands) they • Issue selection based on round-robin/ need age of warp MAD SFU – which instructions modify which registers • SM broadcasts the same instruction to 32 Threads of a Warp • Old concept from CDC 6600 (1960s) to separate memory and computation

CS6235 L2: Hardware Overview CS6235 L2: Hardware Overview

7 1/11/12

Scoreboarding Scoreboarding from Example • All register operands of all instructions in the Instruction Buffer are scoreboarded • Consider three separate – Status becomes ready after the needed values are instruction streams: warp1, deposited warp3 and warp8 – prevents hazards Warp Current Instrucon – cleared instructions are eligible for issue Instrucon State • Decoupled Memory/Processor pipelines Warp 1 42 Compung – any thread can continue to issue instructions until warp 8 instruction 11 t=k scoreboarding prevents issue Warp 3 95 Compung – allows Memory/Processor ops to proceed in shadow of warp 1 instruction 42 t=k+1 Memory/Processor ops warp 3 instruction 95 t=k+2 Schedule . Warp 8 11 Operands . ready to go at time k warp 8 instruction 12 t=l>k … warp 3 instruction 96 t=l+1

CS6235 L2: Hardware Overview CS6235 L2: Hardware Overview

III. How it Comes Together Scoreboarding from Example G80 Example: Executing Thread Blocks

SM 0 SM 1 • Threads are assigned to t0 t1 t2 … tm • Consider three separate t0 t1 t2 … tm Streaming instruction streams: warp1, MT IU MT IU Blocks Multiprocessors in block warp3 and warp8 SP SP granularity Warp Current Instrucon – Up to 8 blocks to Instrucon State Blocks each SM as Schedule resource allows Warp 1 42 Ready to at time k+1 warp 8 instruction 11 t=k write result – SM in G80 can take Shared Shared Memory Memory up to 768 threads warp 1 instruction 42 t=k+1 Warp 3 95 Compung • Could be 256 • Threads run concurrently (threads/block) warp 3 instruction 95 t=k+2 . Warp 8 11 Compung – SM maintains thread/block id #s * 3 blocks . – SM manages/schedules thread execution • Or 128 warp 8 instruction 12 t=l>k … (threads/block) * 6 blocks, etc. warp 3 instruction 96 t=l+1

32 CS6235 L2: Hardware Overview L2: Hardware Overview

8 1/11/12

Details of Mapping Transparent Scalability • If #blocks in a grid exceeds number of SMs, • Hardware is free to assigns blocks to – multiple blocks mapped to an SM any processor at any time – treated independently – – provides more warps to scheduler so good as long as A kernel scales across any number of resources not exceeded parallel processors – Device Kernel grid Possibly context switching overhead when Device scheduling between blocks (registers and shared Block 0 Block 1 memory) Block 2 Block 3 Block 0 Block 1 Block 4 Block 5 Block 0 Block 1 Block 2 Block 3 • Thread Synchronization (more next time) Block 6 Block 7 time Block 2 Block 3 – Within a block, threads observe SIMD model, and Block 4 Block 5 Block 6 Block 7 synchronize using __syncthreads() Block 4 Block 5 Each block can execute in any order relative to other blocks.

– Across blocks, interaction through global memory Block 6 Block 7

33 34 CS6235 CS6235 L2: Hardware Overview L2: Hardware Overview

Summary of Lecture What’s Coming • SIMT = SIMD+SPMD • • SIMD execution model within a warp, and Next few lectures: conceptually within a block – Correctness of parallelization • MIMD execution model across blocks – Managing the memory hierarchy • Multithreading of SMs used to hide memory – Next assignment latency • Motivation for lots of threads to be concurrently active • Scoreboarding used to track warps ready to execute

CS6235 35 CS6235 36 L2: Hardware Overview L2: Hardware Overview

9