1/11/12
Administrative • First assignment out, due next Friday at 5PM – Use handin on CADE machines to submit • “handin cs6235 lab1
L2: Hardware Overview CS6235 L2: Hardware Overview CS6963
Outline What is an Execution Model? • Parallel programming model • Execution Model – Software technology for expressing parallel algorithms that • Host Synchronization target parallel hardware – Consists of programming languages, libraries, annotations, … • Single Instruction Multiple Data (SIMD) – Defines the semantics of software constructs running on • Multithreading parallel hardware • Parallel execution model • Scheduling instructions for SIMD, multithreaded – Exposes an abstract view of hardware execution, multiprocessor generalized to a class of architectures. • How it all comes together – Answers the broad question of how to structure and name data and instructions and how to interrelate the two. • Reading: – Allows humans to reason about harnessing, distributing, and controlling concurrency. Ch 3 in Kirk and Hwu, • Today’s lecture will help you reason about the target http://courses.ece.illinois.edu/ece498/al/textbook/Chapter3-CudaThreadingModel.pdf architecture while you are developing your code Ch 4 in Nvidia CUDA Programming Guide – How will code constructs be mapped to the hardware? CS6235 L2: Hardware Overview CS6235 L2: Hardware Overview
1 1/11/12
NVIDIA GPU Execution Model SIMT = Single-Instruction Multiple Threads I. SIMD Execution of Device warpsize=M threads (from Mul processor N Mul processor 2 • Coined by Nvidia single block) – Result is a set of instruction Mul processor 1 • streams roughly equal to # Combines SIMD execution within a blocks in thread divided by Data Cache, Fermi only Block (on an SM) with SPMD execution warpsize Shared Memory across Blocks (distributed across SMs) II. Multithreaded Execution Registers Registers Registers across different instruction Instruc on … Unit • streams within block Processor 1 Processor 2 Processor M Terms to be defined… – Also possibly across different blocks if there are more blocks Constant than SMs Cache Texture III. Each block mapped to Cache single SM – No direct interaction across Device memory SMs
CS6235 L2: Hardware Overview CS6235 L2: Hardware Overview
CUDA Thread Block Overview Calling a Kernel Function – • All threads in a block execute the same kernel program (SPMD) Thread Creation in Detail • Programmer declares block: • A kernel function must be called with an execution – Block size 1 to 512 concurrent threads CUDA Thread Block – Block shape 1D, 2D, or 3D configuration: – Block dimensions in threads Thread Id #: __global__ void KernelFunc(...); • Threads have thread id numbers within 0 1 2 3 … m dim3 DimGrid(100, 50); // 5000 thread blocks block – Thread program uses thread id to select dim3 DimBlock(4, 8, 8); // 256 threads per block work and address shared data size_t SharedMemBytes = 64; // 64 bytes of shared memory Thread program KernelFunc<<< DimGrid, DimBlock, SharedMemBytes >>>(...); • Threads in the same block share data and synchronize while doing their share of the Only for data that is not statically allocated work • Any call to a kernel function is asynchronous from CUDA • Threads in different blocks cannot 1.0 on cooperate Courtesy: John Nickolls, – Each block can execute in any order NVIDIA • Explicit synchronization needed for blocking continued relative to other blocks! host execution (next slide)
CS6235 L2: Hardware Overview CS6235 L2: Hardware Overview
2 1/11/12
Host Blocking: Common Examples Predominant Control Mechanisms: Some definitions • How do you guarantee the GPU is done and results are ready? Name Meaning Examples • Timing example (excerpt from simpleStreams in CUDA SDK): Single Instruction, A single thread of Array notation as in cudaEvent_t start_event, stop_event; Multiple Data control, same Fortran 95: (SIMD) computation applied A[1:n] = A[1:n] + B[1:n] cudaEventCreate(&start_event); across “vector” elts cudaEventCreate(&stop_event); Kernel fns w/in block: cudaEventRecord(start_event, 0); compute<<
CS6235 L2: Hardware Overview CS6235 L2: Hardware Overview
Streaming Multiprocessor (SM) I. SIMD
• Streaming Multiprocessor (SM) • Motivation: – 8 Streaming Processors (SP) Streaming Multiprocessor – Data-parallel computations map well to – 2 Super Function Units (SFU) Instruction L1 Data L1
• Multi-threaded instruction dispatch Instruction Fetch/Dispatch architectures that apply the same
– 1 to 512 threads active Shared Memory computation repeatedly to different data – Shared instruction fetch per 32 threads SP SP – Cover latency of texture/memory loads – Conserve control units and simplify SP SP SFU SFU • 20+ GFLOPS SP SP coordination • 16 KB shared memory SP SP • Analogy to light switch • DRAM texture and memory access
CS6235 L2: Hardware Overview CS6235 L2: Hardware Overview
3 1/11/12
Example SIMD Execution Example SIMD Execution
“Count 6” kernel function “Count 6” kernel function d_out[threadIdx.x] = 0; d_out[threadIdx.x] = 0; for (int i=0; i