With Thanks To…

With Thanks To… Michael Garland, Mark Harris, John Danskin, Ignacio Castaño, David Kirk and many others at NVIDIA Christian Eisenacher & Charles Loop, Eric Haines, Eric Sintorn & Ulf Assarsson, Anjul Patney & John Owens, DICE, CAPCOM, CryTek, Bungie and many others not at NVIDIA © NVIDIA Corporation 2009 Early 3D Graphics Perspective study of a chalice Paolo Uccello, circa 1450 © NVIDIA Corporation 2009 Early Graphics Hardware Artist using a perspective machine Albrecht Dürer, 1525 Perspective study of a chalice Paolo Uccello, circa 1450 © NVIDIA Corporation 2009 Early Electronic Graphics Hardware SKETCHPAD: A Man-Machine Graphical Communication System Ivan Sutherland, 1963 © NVIDIA Corporation 2009 The Graphics Pipeline The Geometry Engine: A VLSI Geometry System for Graphics Jim Clark, 1982 © NVIDIA Corporation 2009 The Graphics Pipeline Vertex Transform & Lighting Triangle Setup & Rasterization Texturing & Pixel Shading Depth Test & Blending Framebuffer © NVIDIA Corporation 2009 The Graphics Pipeline Vertex Key abstraction of real-time graphics Rasterize Hardware used to look like this Pixel One chip/board per stage Test & Blend Framebuffer Fixed data flow through pipeline © NVIDIA Corporation 2009 SGI RealityEngine (1993) Vertex Rasterize Pixel Test & Blend Framebuffer Kurt Akeley. RealityEngine Graphics, SIGGRAPH 93. © NVIDIA Corporation 2009 SGI InfiniteReality (1997) Vertex Rasterize Pixel Test & Blend Framebuffer Montrym et al. InfiniteReality: A real-time graphics system, SIGGRAPH 97 © NVIDIA Corporation 2009 The Graphics Pipeline Vertex Remains a useful abstraction Rasterize Hardware used to look like this Pixel Test & Blend Framebuffer © NVIDIA Corporation 2009 The Graphics Pipeline // Each thread performs one pair-wise addition __global__ void vecAdd(float* A, float* B, float* C) Vertex { int i = threadIdx.x + blockDim.x * blockIdx.x; C[i] = A[i] + B[i]; } Rasterize Hardware used to look like this: Vertex, pixel processing became Pixel programmable Test & Blend Framebuffer © NVIDIA Corporation 2009 The Graphics Pipeline // Each thread performs one pair-wise addition Vertex __global__ void vecAdd(float* A, float* B, float* C) { int i = threadIdx.x + blockDim.x * blockIdx.x; C[i] = A[i] + B[i]; } Geometry Hardware used to look like this Rasterize Vertex, pixel processing became programmable Pixel New stages added Test & Blend Framebuffer © NVIDIA Corporation 2009 The Graphics Pipeline // Each thread performs one pair-wise addition Vertex __global__ void vecAdd(float* A, float* B, float* C) { int i = threadIdx.x + blockDim.x * blockIdx.x; C[i] = A[i] + B[i]; } Tessellation Hardware used to look like this Geometry Vertex, pixel processing became Rasterize programmable Pixel New stages added Test & Blend Framebuffer GPU architecture increasingly centers around shader execution © NVIDIA Corporation 2009 Modern GPUs: Unified Design Vertex shaders, pixel shaders, etc. become threads running different programs on a flexible core GeForce 8: Modern GPU Architecture Host Input Assembler Setup & Rasterize Vertex Thread Issue Geom Thread Issue Pixel Thread Issue SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP TF TF TF TF TF TF TF TF Thread Processor L1 L1 L1 L1 L1 L1 L1 L1 L2 L2 L2 L2 L2 L2 Framebuffer Framebuffer Framebuffer Framebuffer Framebuffer Framebuffer © NVIDIA Corporation 2009 GeForce 8: Modern GPU Architecture Host Input Assembler Setup & Rasterize Vertex Thread Issue Geom Thread Issue Pixel Thread Issue SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP TF TF TF TF TF TF TF TF Thread Processor L1 L1 L1 L1 L1 L1 L1 L1 L2 L2 L2 L2 L2 L2 Framebuffer Framebuffer Framebuffer Framebuffer Framebuffer Framebuffer © NVIDIA Corporation 2009 GPUs Today 1995 1999 2002 2003 2004 2005 2006-2007 NV1 GeForce 256 GeForce4 GeForce FX GeForce 6 GeForce 7 GeForce 8 1 Million 22 Million 63 Million 130 Million 222 Million 302 Million 754 Million Transistors Transistors Transistors Transistors Transistors Transistors Transistors 20082008 GeForceGeForce GTXGTX 200200 Lessons from Graphics Pipeline 1.41.4 BillionBillion TransistorsTransistors Throughput is paramount Create, run, & retire lots of threads very rapidly Use multithreading to hide latency Why GPU Computing? Fact: Throughputnobody architecture cares about Æ massivetheoretical parallelism peak Massive parallelism ÆChallenge:computational horsepower harness GPU power for real application performance 1000 NVIDIA GPU Intel CPU 750 500 250 Peak GFLOP/s 0 Sep-02 Jan-04 May-05 Oct-06 Feb-08 GPU Computing 1.0 (Ignoring prehistory: Ikonas, Pixel Machine, Pixel-Planes…) GPU Computing 1.0: compute pretending to be graphics Disguise data as textures or geometry Disguise algorithm as render passes ÆTrick graphics pipeline into doing your computation! © NVIDIA Corporation 2009 GPU Computing 1.0 – the GPGPU era Geometric computation E.g. form factors, Voronoi diagrams, proximity calculations Primarily vertex & rasterization hardware Domain computations E.g. boiling, reaction-diffusion, fluids, image-processing Primarily texture & pixel processing hardware Term GPGPU coined by Mark Harris © NVIDIA Corporation 2009 GPGPU Hardware & Algorithms GPUs get progressively more capable Fixed-function – fancy depth + texture compositing Register combiners Æ full programmable shading Floating-point pixel hardware greatly extends reach Algorithms get more sophisticated Cellular automata & relatives (CML, LBM) PDEs & numeric solvers – multigrid, conjugate gradient, … Clever graphics tricks – occlusion culling, depth/stencil, … High-level shading languages emerge Cg, GLSL, HLSL © NVIDIA Corporation 2009 Typical GPGPU Constructs © NVIDIA Corporation 2009 Typical GPGPU Constructs © NVIDIA Corporation 2009 GPGPU Evolves Emergence of libraries & frameworks, e.g. Glift Comparison to optimized CPU codes Ian Buck’s BrookGPU closes out GPGPU era First attempt at complete high-level GPGPU language Streams, shapes, kernels + restricted C syntax Many proofs-of-concept; limited uptake outside Stanford © NVIDIA Corporation 2009 GPU Computing 2.0 GPU Computing 2.0: direct compute Program GPU directly, no graphics-based restrictions Provide scatter, inter-thread communication, pointers Enable standard languages like C GPU Computing supplants graphics-based GPGPU November 2006: NVIDIA introduces CUDA © NVIDIA Corporation 2009 CUDA In One Slide Block Thread per-block per-thread shared local memory Local barrier memory Kernel foo() . Global barrier per-device global Kernel bar() memory . © NVIDIA Corporation 2009 C-for-CUDA Example void saxpy_serial(int n, float a, float *x, float *y) { for (int i = 0; i < n; ++i) y[i] = a*x[i] + y[i]; Standard C Code } // Invoke serial SAXPY kernel saxpy_serial(n, 2.0, x, y); __global__ void saxpy_parallel(int n, float a, float *x, float *y) { int i = blockIdx.x*blockDim.x + threadIdx.x; if(i<n) y[i] = a*x[i] + y[i]; } Parallel C Code // Invoke parallel SAXPY kernel with 256 threads/block int nblocks = (n + 255) / 256; saxpy_parallel<<<nblocks, 256>>>(n, 2.0, x, y); CUDA Successes 146X 36X 18X 50X 100X Medical Imaging Molecular Dynamics Video Transcoding Matlab Computing Astrophysics U of Utah U of Illinois Elemental Tech AccelerEyes RIKEN 149X 47X 20X 130X 30X Financial simulation Linear Algebra 3D Ultrasound Quantum Chemistry Gene Sequencing Oxford Universidad Jaime Techniscan U of Illinois U of Maryland Now Entering: GPU Computing 3.0 GPU Computing 3.0: an emerging ecosystem Hardware & product lines Education & research Algorithmic sophistication Consumer applications Cross-platform standards High-level languages GPU begins to “vanish” behind codes, platforms, apps Hardware & Products GT200: double precision, coalescing, shmem atomics Product line from laptops to supercomputers © NVIDIA Corporation 2009 Algorithmic Sophistication Sort Sparse matrix Hash tables Fast multipole method Ray tracing (parallel tree traversal) © NVIDIA Corporation 2009 CPU Results from “Optimization of Sparse Matrix-Vector Multiplication on Emerging Multicore Platforms", Williams et al, Supercomputing 2007 Cross-Platform Standards ATI’s Compute “Solution” © NVIDIA Corporation 2009 OpenCL vs. C for CUDA Entry point for C for CUDA developers who prefer high-level C Entry point for developers who OpenCL want low-level API Shared back-end compiler & optimization technology PTX GPU © NVIDIA Corporation 2009 Different Host Code Styles Calling a C function in nvcc OpenCL API-style programming extern "C" void FFT1024( float2 *work, int batch ) // create a compute context with GPU device context = clCreateContextFromType(CL_DEVICE_TYPE_GPU); { // create a work-queue FFT1024_device<<< grid2D(batch), 64 >>> queue = clCreateWorkQueue(context, NULL, NULL, 0); // allocate the buffer memory objects ( work, work ); memobjs[0] = clCreateBuffer(context, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, } sizeof(float)*2*num_entries, srcA); memobjs[1] = clCreateBuffer(context, CL_MEM_READ_WRITE, sizeof(float)*2*num_entries, NULL); // create the compute program program = clCreateProgramFromSource(context, 1, &fft1D_1024_kernel_src, NULL); nvcc compiler, CUDA runtime, PTX layer // build the compute program executable clBuildProgramExecutable(program, false, NULL, NULL); manage kernel resources and execution // create the compute kernel kernel = clCreateKernel(program, “fft1D_1024”); // create N-D range object with work-item dimensions global_work_size[0] = n; local_work_size[0] = 64; range = clCreateNDRangeContainer(context, 0, 1, global_work_size, local_work_size); Source: // set the args values SIGGraph sneak preview clSetKernelArg(kernel,

Load more