CUDA What Is GPGPU

What is GPGPU ? • General Purpose computation using GPU in applications other than 3D graphics CUDA – GPU accelerates critical path of application • Data parallel algorithms leverage GPU attributes – Large data arrays, streaming throughput Slides by David Kirk – Fine-grain SIMD parallelism – Low-latency floating point (FP) computation • Applications – see //GPGPU.org – Game effects (FX) physics, image processing – Physical modeling, computational engineering, matrix algebra, convolution, correlation, sorting Previous GPGPU Constraints CUDA • Dealing with graphics API per thread per Shader Input Registers • “Compute Unified Device Architecture” – Working with the corner cases per Context • General purpose programming model of the graphics API Fragment Program Texture – User kicks off batches of threads on the GPU • Addressing modes Constants – GPU = dedicated super-threaded, massively data parallel co-processor – Limited texture size/dimension Temp Registers • Targeted software stack – Compute oriented drivers, language, and tools Output Registers • Shader capabilities • Driver for loading computation programs into GPU FB Memory – Limited outputs – Standalone Driver - Optimized for computation • Instruction sets – Interface designed for compute - graphics free API – Data sharing with OpenGL buffer objects – Lack of Integer & bit ops – Guaranteed maximum download & readback speeds • Communication limited – Explicit GPU memory management – Between pixels – Scatter a[i] = p 1 Parallel Computing on a GPU Extended C • Declspecs • NVIDIA GPU Computing Architecture – global, device, shared, __device__ float filter[N]; – Via a separate HW interface local, constant __global__ void convolve (float *image) { – In laptops, desktops, workstations, servers GeForce 8800 __shared__ float region[M]; ... • Keywords • 8-series GPUs deliver 50 to 200 GFLOPS region[threadIdx] = image[i]; on compiled parallel C applications – threadIdx, blockIdx • Intrinsics __syncthreads() – __syncthreads ... Tesla D870 • GPU parallelism is doubling every year image[j] = result; • Programming model scales transparently • Runtime API } – Memory, symbol, // Allocate GPU memory execution void *myimage = cudaMalloc(bytes) • Programmable in C with CUDA tools management • Multithreaded SPMD model uses // 100 blocks, 10 threads per block application • Function launch convolve<<<100, 10>>> (myimage); data parallelism and thread parallelism Tesla S870 CUDA Programming Model: A Highly Multithreaded Coprocessor • The GPU is viewed as a compute device that: – Is a coprocessor to the CPU or host – Has its own DRAM (device memory) – Runs many threads in parallel • Data-parallel portions of an application are executed on the device as kernels which run in parallel on many threads • Differences between GPU and CPU threads – GPU threads are extremely lightweight • Very little creation overhead – GPU needs 1000s of threads for full efficiency • Multi-core CPU needs only a few 2 Thread Batching: Grids and Blocks Block and Thread IDs • A kernel is executed as a Host Device grid of thread blocks Grid 1 • Threads and blocks have IDs – All threads share data Kernel Block Block Block – So each thread can decide Device 1 memory space (0, 0) (1, 0) (2, 0) what data to work on Grid 1 •A thread block is a batch of Block Block Block – Block ID: 1D or 2D Block Block Block (0, 1) (1, 1) (2, 1) threads that can cooperate – Thread ID: 1D, 2D, or 3D (0, 0) (1, 0) (2, 0) Block Block Block with each other by: Grid 2 • Simplifies memory (0, 1) (1, 1) (2, 1) – Synchronizing their execution Kernel addressing when processing • For hazard-free shared 2 multidimensional data memory accesses Block (1, 1) – Image processing – Efficiently sharing data Block (1, 1) – Solving PDEs on volumes Thread Thread Thread Thread Thread through a low latency shared (0, 0) (1, 0) (2, 0) (3, 0) (4, 0) Thread Thread Thread Thread Thread (0, 0) (1, 0) (2, 0) (3, 0) (4, 0) –… memory Thread Thread Thread Thread Thread (0, 1) (1, 1) (2, 1) (3, 1) (4, 1) • Two threads from two Thread Thread Thread Thread Thread (0, 1) (1, 1) (2, 1) (3, 1) (4, 1) Thread Thread Thread Thread Thread different blocks cannot (0, 2) (1, 2) (2, 2) (3, 2) (4, 2) Thread Thread Thread Thread Thread cooperate (0, 2) (1, 2) (2, 2) (3, 2) (4, 2) Courtesy: NDVIA Courtesy: NDVIA CUDA Device Memory Space Global, Constant, and Texture Memories Overview (Long Latency Accesses) • Each thread can: (Device) Grid • Global memory (Device) Grid – R/W per-thread registers Block (0, 0) Block (1, 0) – Main means of Block (0, 0) Block (1, 0) – R/W per-thread local memory communicating R/W – R/W per-block shared memory Shared Memory Shared Memory Data between host and Shared Memory Shared Memory – R/W per-grid global memory Registers Registers Registers Registers device Registers Registers Registers Registers – Read only per-grid constant – Contents visible to all memory Thread (0, 0) Thread (1, 0) Thread (0, 0) Thread (1, 0) threads Thread (0, 0) Thread (1, 0) Thread (0, 0) Thread (1, 0) – Read only per-grid texture Local Local Local Local • Texture and Constant Local Local Local Local memory Memory Memory Memory Memory Memories Memory Memory Memory Memory Host Global Host Global Memory – Constants initialized by • The host can R/W Memory host global, constant, and Constant Constant Memory – Contents visible to all Memory texture memories Texture Texture Memory threads Memory Courtesy: NDVIA 3 CUDA Highlights: Easy and Lightweight • The API is an extension to the ANSI C programming language CUDA – API Low learning curve • The hardware is designed to enable lightweight runtime and driver High performance CUDA Device Memory Allocation CUDA Device Memory Allocation (cont.) • cudaMalloc() (Device) Grid – Allocates object in the Block (0, 0) Block (1, 0) • Code example: device Global Memory – Allocate a 64 * 64 single precision float array – Requires two parameters Shared Memory Shared Memory Register Register Register Register – Attach the allocated storage to Md.elements • Address of a pointer to the s s s s – “d” is often used to indicate a device data structure allocated object Thread (0, Thread (1, Thread (0, Thread (1, • Size of of allocated object 0) 0) 0) 0) Local Local Local Local BLOCK_SIZE = 64; • cudaFree() Memor Memor Memor Memor y y y y float * d_matrix; – Frees object from device Host Global int size = BLOCK_SIZE * BLOCK_SIZE * sizeof(float); Global Memory Memory Constant • Pointer to freed object Memory Texture cudaMalloc((void**)&d_matrix, size); Memory cudaFree(d_matrix); 4 CUDA Host-Device Data Transfer CUDA Host-Device Data Transfer (cont.) • cudaMemcpy() (Device) Grid – memory data transfer Block (0, 0) Block (1, 0) • Code example: – Transfer a 64 * 64 single precision float array – Requires four parameters Shared Memory Shared Memory • Pointer to source Register Register Register Register – M is in host memory and Md is in device memory • Pointer to destination s s s s – cudaMemcpyHostToDevice and • Number of bytes copied Thread (0, Thread (1, Thread (0, Thread (1, 0) 0) 0) 0) cudaMemcpyDeviceToHost are symbolic constants • Type of transfer Local Local Local Local – Host to Host Memor Memor Memor Memor cudaMemcpy(h_matrix, d_matrix, size, – Host to Device y y y y Host Global cudaMemcpyHostToDevice); – Device to Host Memory – Device to Device Constant Memory cudaMemcpy(h_matrx, d_matrix, size, • Asynchronous in CUDA Texture cudaMemcpyDeviceToHost); Memory 1.1 Calling a Kernel Function – Thread Creation Memory Model • A kernel function must be called with an execution configuration: __global__ void KernelFunc(...); dim3 DimGrid(100, 50); // 5000 thread blocks dim3 DimBlock(4, 8, 8); // 256 threads per block size_t SharedMemBytes = 64; // 64 bytes of shared memory KernelFunc<<< DimGrid, DimBlock, SharedMemBytes >>>(...); • Any call to a kernel function is asynchronous from CUDA 1.1 on, explicit synch needed for blocking 5 Why Use the GPU for Computing ? What is Behind such an Evolution? • The GPU has evolved into a very flexible and • The GPU is specialized for compute-intensive, powerful processor: highly data parallel computation (exactly what – It’s programmable using high-level languages graphics rendering is about) – It supports 32-bit floating point precision – So, more transistors can be devoted to data – It offers lots of GFLOPS: processing rather than data caching and flow control ALU ALU Control ALU ALU CPU GPU Cache G80 = GeForce 8800 GTX GFLOPS G71 = GeForce 7900 GTX DRAM DRAM G70 = GeForce 7800 GTX NV40 = GeForce 6800 Ultra • The fast-growing video game industry exerts NV35 = GeForce FX 5950 Ultra NV30 = GeForce FX 5800 strong economic pressure that forces constant •GPU in every PC and workstation innovation G80 Thread Computing Pipeline • ProcessorsThe future of execute GPUs iscomputing programmable threads processing • AlternativeSo – build theoperating architecture mode around specifically the processor for computing split personality HostHost Input Assembler n. Input Assembler Setup / Rstr / ZCull Thread Execution Manager Vtx Thread Issue Geom Thread Issue Pixel Thread Issue Two distinct personalities in the same entity, SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP each of which prevails at a particular time. ParallelTF Data ParallelTF Data ParallelTF Data ParallelTF Data ParallelTF Data ParallelTF Data ParallelTF Data ParallelTF Data Cache Cache Cache Cache Cache Cache Cache Cache TextureTextureL1 TextureL1 TextureL1 TextureL1 TextureL1 TextureL1 TextureL1 TextureL1 Thread Processor Load/storeL2 Load/storeL2 Load/storeL2 Load/storeL2

CUDA What Is GPGPU

X86 Platform Coprocessor/Prpmc (PC on a PMC)

Convey Overview

Exploiting Free Silicon for Energy-Efficient Computing Directly

Comparing the Power and Performance of Intel's SCC to State

World's First High- Performance X86 With

Introduction to Cpu

AI Chips: What They Are and Why They Matter

Instruction Set Innovations for Convey's HC-1 Computer

CUDA C++ Programming Guide

Intel Xeon Phi Product Family Brief

Comparing Performance and Energy Efficiency of Fpgas and Gpus For

Intel Delivers New Architecture for Discovery with Intel® Xeon Phi™ Coprocessors