Cuda C Programming Guide

CUDA C PROGRAMMING GUIDE PG-02829-001_v5.0 | October 2012 Design Guide CHANGES FROM VERSION 4.2 ‣ Updated Texture Memory and Texture Functions with the new texture object API. ‣ Updated Surface Memory and Surface Functions with the new surface object API. ‣ Updated Concurrent Kernel Execution, Implicit Synchronization, and Overlapping Behavior for devices of compute capability 3.5. ‣ Removed from Table 2 Throughput of Native Arithmetic Instructions the possible code optimization of omitting __syncthreads() when synchronizing threads within a warp. ‣ Removed Synchronization Instruction for devices of compute capability 3.5. ‣ Updated __global__ to mention that __global__ functions are callable from the device for devices of compute capability 3.x (see the CUDA Dynamic Parallelism programming guide for more details). ‣ Mentioned in Device Memory Qualifiers that __device__, __shared__, and __constant__ variables can be declared as external variables when compiling in the separate compilation mode (see the nvcc user manual for a description of this mode). ‣ Mentioned memcpy and memset in Dynamic Global Memory Allocation and Operations. ‣ Added new functions sincospi(), sincospif(), normcdf(), normcdfinv(), normcdff(), and normcdfinvf() in Standard Functions. ‣ Updated the maximum ULP error for erfcinvf(), sin(), sinpi(), cos(), cospi(), and sincos() in Standard Functions. ‣ Added new intrinsic __frsqrt_rn() in Intrinsic Functions. ‣ Added new Section Callbacks on stream callbacks. www.nvidia.com CUDA C Programming Guide PG-02829-001_v5.0 | ii TABLE OF CONTENTS Chapter 1. Introduction.........................................................................................1 1.1 From Graphics Processing to General Purpose Parallel Computing............................... 1 1.2 CUDA™: A General-Purpose Parallel Computing Platform and Programming Model............. 3 1.3 A Scalable Programming Model.........................................................................4 1.4 Document Structure......................................................................................5 Chapter 2. Programming Model............................................................................... 7 2.1 Kernels......................................................................................................7 2.2 Thread Hierarchy......................................................................................... 8 2.3 Memory Hierarchy.......................................................................................10 2.4 Heterogeneous Programming.......................................................................... 11 2.5 Compute Capability.....................................................................................12 Chapter 3. Programming Interface..........................................................................14 3.1 Compilation with NVCC................................................................................ 14 3.1.1 Compilation Workflow.............................................................................15 3.1.1.1 Offline Compilation.......................................................................... 15 3.1.1.2 Just-in-Time Compilation....................................................................15 3.1.2 Binary Compatibility...............................................................................16 3.1.3 PTX Compatibility..................................................................................16 3.1.4 Application Compatibility.........................................................................16 3.1.5 C/C++ Compatibility...............................................................................17 3.1.6 64-Bit Compatibility............................................................................... 17 3.2 CUDA C Runtime........................................................................................ 18 3.2.1 Initialization........................................................................................ 18 3.2.2 Device Memory..................................................................................... 19 3.2.3 Shared Memory.....................................................................................21 3.2.4 Page-Locked Host Memory........................................................................27 3.2.4.1 Portable Memory..............................................................................28 3.2.4.2 Write-Combining Memory....................................................................28 3.2.4.3 Mapped Memory.............................................................................. 28 3.2.5 Asynchronous Concurrent Execution............................................................ 29 3.2.5.1 Concurrent Execution between Host and Device........................................ 29 3.2.5.2 Overlap of Data Transfer and Kernel Execution......................................... 30 3.2.5.3 Concurrent Kernel Execution............................................................... 30 3.2.5.4 Concurrent Data Transfers.................................................................. 30 3.2.5.5 Streams.........................................................................................30 3.2.5.6 Events.......................................................................................... 34 3.2.5.7 Synchronous Calls.............................................................................34 3.2.6 Multi-Device System............................................................................... 34 3.2.6.1 Device Enumeration..........................................................................35 3.2.6.2 Device Selection.............................................................................. 35 www.nvidia.com CUDA C Programming Guide PG-02829-001_v5.0 | iii 3.2.6.3 Stream and Event Behavior................................................................. 35 3.2.6.4 Peer-to-Peer Memory Access................................................................36 3.2.6.5 Peer-to-Peer Memory Copy..................................................................36 3.2.7 Unified Virtual Address Space................................................................... 37 3.2.8 Error Checking......................................................................................37 3.2.9 Call Stack........................................................................................... 38 3.2.10 Texture and Surface Memory................................................................... 38 3.2.10.1 Texture Memory............................................................................. 38 3.2.10.2 Surface Memory............................................................................. 47 3.2.10.3 CUDA Arrays..................................................................................50 3.2.10.4 Read/Write Coherency..................................................................... 50 3.2.11 Graphics Interoperability........................................................................51 3.2.11.1 OpenGL Interoperability................................................................... 51 3.2.11.2 Direct3D Interoperability.................................................................. 53 3.2.11.3 SLI Interoperability......................................................................... 59 3.3 Versioning and Compatibility..........................................................................60 3.4 Compute Modes..........................................................................................61 3.5 Mode Switches........................................................................................... 62 3.6 Tesla Compute Cluster Mode for Windows.......................................................... 62 Chapter 4. Hardware Implementation......................................................................63 4.1 SIMT Architecture....................................................................................... 63 4.2 Hardware Multithreading...............................................................................64 Chapter 5. Performance Guidelines........................................................................ 66 5.1 Overall Performance Optimization Strategies...................................................... 66 5.2 Maximize Utilization.................................................................................... 66 5.2.1 Application Level.................................................................................. 66 5.2.2 Device Level........................................................................................ 67 5.2.3 Multiprocessor Level...............................................................................67 5.3 Maximize Memory Throughput........................................................................ 70 5.3.1 Data Transfer between Host and Device....................................................... 70 5.3.2 Device Memory Accesses......................................................................... 71 5.4 Maximize Instruction Throughput.....................................................................75 5.4.1 Arithmetic Instructions............................................................................76 5.4.2 Control Flow Instructions.........................................................................79 5.4.3 Synchronization Instruction.....................................................................

Cuda C Programming Guide

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support