L2: Hardware Execution Model and Overview Administrative Outline

1/11/12 Administrative • First assignment out, due next Friday at 5PM – Use handin on CADE machines to submit • “handin cs6235 lab1 <probfile>” • The file <probfile> should be a gzipped tar file of the CUDA program and output L2: Hardware Execution Model and – Can implement on CADE machines, see provided Overview makefile • Grad lab is MEB 3161, must be sitting at machine January 11, 2012 • Partners for people who have project ideas • TA: Preethi Kotari, [email protected] • Mailing lists now visible: – [email protected] • Please use for all questions suitable for the whole class • Feel free to answer your classmates questions! L2: Hardware Overview CS6235 L2: Hardware Overview CS6963 Outline What is an Execution Model? • Parallel programming model • Execution Model – Software technology for expressing parallel algorithms that • Host Synchronization target parallel hardware – Consists of programming languages, libraries, annotations, … • Single Instruction Multiple Data (SIMD) – Defines the semantics of software constructs running on • Multithreading parallel hardware • Parallel execution model • Scheduling instructions for SIMD, multithreaded – Exposes an abstract view of hardware execution, multiprocessor generalized to a class of architectures. • How it all comes together – Answers the broad question of how to structure and name data and instructions and how to interrelate the two. • Reading: – Allows humans to reason about harnessing, distributing, and controlling concurrency. Ch 3 in Kirk and Hwu, • Today’s lecture will help you reason about the target http://courses.ece.illinois.edu/ece498/al/textbook/Chapter3-CudaThreadingModel.pdf architecture while you are developing your code Ch 4 in Nvidia CUDA Programming Guide – How will code constructs be mapped to the hardware? CS6235 L2: Hardware Overview CS6235 L2: Hardware Overview 1 1/11/12 NVIDIA GPU Execution Model SIMT = Single-Instruction Multiple Threads I. SIMD Execution of Device warpsize=M threads (from Mul*processor N Mul*processor 2 • Coined by Nvidia single block) – Result is a set of instruction Mul*processor 1 • streams roughly equal to # Combines SIMD execution within a blocks in thread divided by Data Cache, Fermi only Block (on an SM) with SPMD execution warpsize Shared Memory across Blocks (distributed across SMs) II. Multithreaded Execution Registers Registers Registers across different instruction Instruc*on … Unit • streams within block Processor 1 Processor 2 Processor M Terms to be defined… – Also possibly across different blocks if there are more blocks Constant than SMs Cache Texture III. Each block mapped to Cache single SM – No direct interaction across Device memory SMs CS6235 L2: Hardware Overview CS6235 L2: Hardware Overview CUDA Thread Block Overview Calling a Kernel Function – • All threads in a block execute the same kernel program (SPMD) Thread Creation in Detail • Programmer declares block: • A kernel function must be called with an execution – Block size 1 to 512 concurrent threads CUDA Thread Block – Block shape 1D, 2D, or 3D configuration: – Block dimensions in threads Thread Id #: __global__ void KernelFunc(...); • Threads have thread id numbers within 0 1 2 3 … m dim3 DimGrid(100, 50); // 5000 thread blocks block – Thread program uses thread id to select dim3 DimBlock(4, 8, 8); // 256 threads per block work and address shared data size_t SharedMemBytes = 64; // 64 bytes of shared memory Thread program KernelFunc<<< DimGrid, DimBlock, SharedMemBytes >>>(...); • Threads in the same block share data and synchronize while doing their share of the Only for data that is not statically allocated work • Any call to a kernel function is asynchronous from CUDA • Threads in different blocks cannot 1.0 on cooperate Courtesy: John Nickolls, – Each block can execute in any order NVIDIA • Explicit synchronization needed for blocking continued relative to other blocks! host execution (next slide) CS6235 L2: Hardware Overview CS6235 L2: Hardware Overview 2 1/11/12 Host Blocking: Common Examples Predominant Control Mechanisms: Some definitions • How do you guarantee the GPU is done and results are ready? Name Meaning Examples • Timing example (excerpt from simpleStreams in CUDA SDK): Single Instruction, A single thread of Array notation as in cudaEvent_t start_event, stop_event; Multiple Data control, same Fortran 95: (SIMD) computation applied A[1:n] = A[1:n] + B[1:n] cudaEventCreate(&start_event); across “vector” elts cudaEventCreate(&stop_event); Kernel fns w/in block: cudaEventRecord(start_event, 0); compute<<<gs,bs,msize>>> init_array<<<blocks, threads>>>(d_a, d_c, niterations); Multiple Instruction, Multiple threads of OpenMP parallel loop: cudaEventRecord(stop_event, 0); Multiple Data control, processors forall (i=0; i<n; i++) cudaEventSynchronize(stop_event); (MIMD) periodically synch cudaEventElapsedTime(&elapsed_time, start_event, stop_event); Kernel fns across blocks compute<<<gs,bs,msize>>> • A bunch of runs in a row example (excerpt from transpose in Single Program, Multiple threads of Processor-specific code: CUDA SDK) Multiple Data control, but each for (int i = 0; i < numIterations; ++i) { (SPMD) processor executes if ($threadIdx.x == 0) { same code transpose<<< grid, threads >>>(d_odata, d_idata, size_x, size_y); } } cudaThreadSynchronize(); CS6235 L2: Hardware Overview CS6235 L2: Hardware Overview Streaming Multiprocessor (SM) I. SIMD • Streaming Multiprocessor (SM) • Motivation: – 8 Streaming Processors (SP) Streaming Multiprocessor – Data-parallel computations map well to – 2 Super Function Units (SFU) Instruction L1 Data L1 • Multi-threaded instruction dispatch Instruction Fetch/Dispatch architectures that apply the same – 1 to 512 threads active Shared Memory computation repeatedly to different data – Shared instruction fetch per 32 threads SP SP – Cover latency of texture/memory loads – Conserve control units and simplify SP SP SFU SFU • 20+ GFLOPS SP SP coordination • 16 KB shared memory SP SP • Analogy to light switch • DRAM texture and memory access CS6235 L2: Hardware Overview CS6235 L2: Hardware Overview 3 1/11/12 Example SIMD Execution Example SIMD Execution “Count 6” kernel function “Count 6” kernel function d_out[threadIdx.x] = 0; d_out[threadIdx.x] = 0; for (int i=0; i<SIZE/BLOCKSIZE; i++) { for (int i=0; i<SIZE/BLOCKSIZE; i++) { int val = d_in[i*BLOCKSIZE + threadIdx.x]; int val = d_in[i*BLOCKSIZE + threadIdx.x]; d_out[threadIdx.x] += compare(val, 6); d_out[threadIdx.x] += compare(val, 6); } } Reg Reg Reg Each “core” threadIdx.x Reg Reg threadIdx.x Reg InstrucDon P P ... P iniDalizes data InstrucDon 0 ! M-1 LDC 0, &(dout+ Unit from addr + P0 + P! ... + PM-1 Unit based on its threadIdx.x) own threadIdx &dout &dout &dout Memory Memory CS6235 L2: Hardware Overview CS6235 L2: Hardware Overview Example SIMD Execution Example SIMD Execution “Count 6” kernel function “Count 6” kernel function d_out[threadIdx.x] = 0; d_out[threadIdx.x] = 0; for (int i=0; i<SIZE/BLOCKSIZE; i++) { for (int i=0; i<SIZE/BLOCKSIZE; i++) { int val = d_in[i*BLOCKSIZE + threadIdx.x]; int val = d_in[i*BLOCKSIZE + threadIdx.x]; d_out[threadIdx.x] += compare(val, 6); d_out[threadIdx.x] += compare(val, 6); } } 0 0 0 /* i*BLOCKSIZE Reg Reg Reg Reg Reg Reg Each “core” + threadIdx */ Each “core” InstrucDon performs same InstrucDon P P ... P /* int i=0; */ P0 P! ... PM-1 LDC BLOCKSIZE,R2 iniDalizes its 0 ! M-1 Unit Unit LDC 0, R3 operaons from MUL R1, R3, R2 own R3 its own registers ADD R4, R1, RO Memory Memory Etc. CS6235 L2: Hardware Overview CS6235 L2: Hardware Overview 4 1/11/12 Overview of SIMD Programming Aside: Multimedia Extensions like SSE • Vector architectures • COMPLETELY DIFFERENT ARCHITECTURE! • Early examples of SIMD supercomputers • At the core of multimedia extensions • TODAY Mostly – SIMD parallelism – Multimedia extensions such as SSE-3 – Variable-sized data fields: – Graphics and games processors (example, IBM Cell) Vector length = register width / type size – Accelerators (e.g., ClearSpeed) V0 • V1 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Sixteen 8-bit Operands Is there a dominant SIMD programming model? V2 – Unfortunately, NO!!! V3 1 2 3 4 5 6 7 8 Eight 16-bit Operands V4 • 1 2 3 4 Why not? V5 Four 32-bit Operands – Vector architectures were programmed by scientists . – Multimedia extension architectures are programmed Example: PowerPC AltiVec by systems programmers (almost assembly language!) V31 or code is automatically generated by a compiler 0 127 – GPUs are programmed by games developers (domain- specific) – Accelerators typically use their own proprietary tools WIDE UNIT CS6235 L2: Hardware Overview CS6235 L2: Hardware Overview Aside: Multimedia Extensions Scalar vs. SIMD Operation II. Multithreading: Motivation 1 r3 • Each arithmetic instruction includes the + Scalar: add r1,r2,r3 2 r2 following sequence 3 r1 Ac*vity Cost Note Load operands As much as O(100) cycles Depends on locaon Compute O(1) cycles Accesses registers SIMD: vadd<sws> v1,v2,v3 1 2 3 4 v3 Store result As much as O(100) cycles Depends on locaon + + + + • Memory latency, the time in cycles to 1 2 3 4 v2 access memory, limits utilization of compute engines 2 4 6 8 v1 lanes CS6235 L2: Hardware Overview CS6235 L2: Hardware Overview 5 1/11/12 Thread-Level Parallelism What Resources are Shared? • Motivation: – a single thread leaves a processor under-utilized • Multiple threads• are for most of the time simultaneously active (in other Warp – by doubling processor area, single thread (InstrucDon words, a new thread can start Stream) performance barely improves

Load more