Cuda C Programming Guide
Total Page:16
File Type:pdf, Size:1020Kb
CUDA C PROGRAMMING GUIDE PG-02829-001_v5.0 | October 2012 Design Guide CHANGES FROM VERSION 4.2 ‣ Updated Texture Memory and Texture Functions with the new texture object API. ‣ Updated Surface Memory and Surface Functions with the new surface object API. ‣ Updated Concurrent Kernel Execution, Implicit Synchronization, and Overlapping Behavior for devices of compute capability 3.5. ‣ Removed from Table 2 Throughput of Native Arithmetic Instructions the possible code optimization of omitting __syncthreads() when synchronizing threads within a warp. ‣ Removed Synchronization Instruction for devices of compute capability 3.5. ‣ Updated __global__ to mention that __global__ functions are callable from the device for devices of compute capability 3.x (see the CUDA Dynamic Parallelism programming guide for more details). ‣ Mentioned in Device Memory Qualifiers that __device__, __shared__, and __constant__ variables can be declared as external variables when compiling in the separate compilation mode (see the nvcc user manual for a description of this mode). ‣ Mentioned memcpy and memset in Dynamic Global Memory Allocation and Operations. ‣ Added new functions sincospi(), sincospif(), normcdf(), normcdfinv(), normcdff(), and normcdfinvf() in Standard Functions. ‣ Updated the maximum ULP error for erfcinvf(), sin(), sinpi(), cos(), cospi(), and sincos() in Standard Functions. ‣ Added new intrinsic __frsqrt_rn() in Intrinsic Functions. ‣ Added new Section Callbacks on stream callbacks. www.nvidia.com CUDA C Programming Guide PG-02829-001_v5.0 | ii TABLE OF CONTENTS Chapter 1. Introduction.........................................................................................1 1.1 From Graphics Processing to General Purpose Parallel Computing............................... 1 1.2 CUDA™: A General-Purpose Parallel Computing Platform and Programming Model............. 3 1.3 A Scalable Programming Model.........................................................................4 1.4 Document Structure......................................................................................5 Chapter 2. Programming Model............................................................................... 7 2.1 Kernels......................................................................................................7 2.2 Thread Hierarchy......................................................................................... 8 2.3 Memory Hierarchy.......................................................................................10 2.4 Heterogeneous Programming.......................................................................... 11 2.5 Compute Capability.....................................................................................12 Chapter 3. Programming Interface..........................................................................14 3.1 Compilation with NVCC................................................................................ 14 3.1.1 Compilation Workflow.............................................................................15 3.1.1.1 Offline Compilation.......................................................................... 15 3.1.1.2 Just-in-Time Compilation....................................................................15 3.1.2 Binary Compatibility...............................................................................16 3.1.3 PTX Compatibility..................................................................................16 3.1.4 Application Compatibility.........................................................................16 3.1.5 C/C++ Compatibility...............................................................................17 3.1.6 64-Bit Compatibility............................................................................... 17 3.2 CUDA C Runtime........................................................................................ 18 3.2.1 Initialization........................................................................................ 18 3.2.2 Device Memory..................................................................................... 19 3.2.3 Shared Memory.....................................................................................21 3.2.4 Page-Locked Host Memory........................................................................27 3.2.4.1 Portable Memory..............................................................................28 3.2.4.2 Write-Combining Memory....................................................................28 3.2.4.3 Mapped Memory.............................................................................. 28 3.2.5 Asynchronous Concurrent Execution............................................................ 29 3.2.5.1 Concurrent Execution between Host and Device........................................ 29 3.2.5.2 Overlap of Data Transfer and Kernel Execution......................................... 30 3.2.5.3 Concurrent Kernel Execution............................................................... 30 3.2.5.4 Concurrent Data Transfers.................................................................. 30 3.2.5.5 Streams.........................................................................................30 3.2.5.6 Events.......................................................................................... 34 3.2.5.7 Synchronous Calls.............................................................................34 3.2.6 Multi-Device System............................................................................... 34 3.2.6.1 Device Enumeration..........................................................................35 3.2.6.2 Device Selection.............................................................................. 35 www.nvidia.com CUDA C Programming Guide PG-02829-001_v5.0 | iii 3.2.6.3 Stream and Event Behavior................................................................. 35 3.2.6.4 Peer-to-Peer Memory Access................................................................36 3.2.6.5 Peer-to-Peer Memory Copy..................................................................36 3.2.7 Unified Virtual Address Space................................................................... 37 3.2.8 Error Checking......................................................................................37 3.2.9 Call Stack........................................................................................... 38 3.2.10 Texture and Surface Memory................................................................... 38 3.2.10.1 Texture Memory............................................................................. 38 3.2.10.2 Surface Memory............................................................................. 47 3.2.10.3 CUDA Arrays..................................................................................50 3.2.10.4 Read/Write Coherency..................................................................... 50 3.2.11 Graphics Interoperability........................................................................51 3.2.11.1 OpenGL Interoperability................................................................... 51 3.2.11.2 Direct3D Interoperability.................................................................. 53 3.2.11.3 SLI Interoperability......................................................................... 59 3.3 Versioning and Compatibility..........................................................................60 3.4 Compute Modes..........................................................................................61 3.5 Mode Switches........................................................................................... 62 3.6 Tesla Compute Cluster Mode for Windows.......................................................... 62 Chapter 4. Hardware Implementation......................................................................63 4.1 SIMT Architecture....................................................................................... 63 4.2 Hardware Multithreading...............................................................................64 Chapter 5. Performance Guidelines........................................................................ 66 5.1 Overall Performance Optimization Strategies...................................................... 66 5.2 Maximize Utilization.................................................................................... 66 5.2.1 Application Level.................................................................................. 66 5.2.2 Device Level........................................................................................ 67 5.2.3 Multiprocessor Level...............................................................................67 5.3 Maximize Memory Throughput........................................................................ 70 5.3.1 Data Transfer between Host and Device....................................................... 70 5.3.2 Device Memory Accesses......................................................................... 71 5.4 Maximize Instruction Throughput.....................................................................75 5.4.1 Arithmetic Instructions............................................................................76 5.4.2 Control Flow Instructions.........................................................................79 5.4.3 Synchronization Instruction.....................................................................