Cuda C Programming Guide

CUDA C PROGRAMMING GUIDE PG-02829-001_v7.5 | September 2015 Design Guide CHANGES FROM VERSION 7.0 ‣ Updated C/C++ Language Support to: ‣ Added new section C++11 Language Features, ‣ Clarified that values of const-qualified variables with builtin floating-point types cannot be used directly in device code when the Microsoft compiler is used as the host compiler, ‣ Documented the extended lambda feature, ‣ Documented that typeid, std::type_info, and dynamic_cast are only supported in host code, ‣ Documented the restrictions on trigraphs and digraphs, ‣ Clarified the conditions under which layout mismatch can occur on Windows. ‣ Updated Table 12 to mention support of half-precision floating-point operations on devices of compute capabilities 5.3. ‣ Updated Table 2 with throughput for half-precision floating-point instructions. ‣ Added compute capability 5.3 to Table 13. ‣ Added the maximum number of resident grids per device to Table 13. ‣ Clarified the definition of __threadfence() in Memory Fence Functions. ‣ Mentioned in Atomic Functions that atomic functions do not act as memory fences. www.nvidia.com CUDA C Programming Guide PG-02829-001_v7.5 | ii TABLE OF CONTENTS Chapter 1. Introduction.........................................................................................1 1.1. From Graphics Processing to General Purpose Parallel Computing............................... 1 1.2. CUDA®: A General-Purpose Parallel Computing Platform and Programming Model.............4 1.3. A Scalable Programming Model.........................................................................5 1.4. Document Structure...................................................................................... 7 Chapter 2. Programming Model............................................................................... 9 2.1. Kernels......................................................................................................9 2.2. Thread Hierarchy........................................................................................10 2.3. Memory Hierarchy....................................................................................... 12 2.4. Heterogeneous Programming.......................................................................... 14 2.5. Compute Capability..................................................................................... 16 Chapter 3. Programming Interface..........................................................................17 3.1. Compilation with NVCC................................................................................ 17 3.1.1. Compilation Workflow.............................................................................18 3.1.1.1. Offline Compilation.......................................................................... 18 3.1.1.2. Just-in-Time Compilation....................................................................18 3.1.2. Binary Compatibility...............................................................................18 3.1.3. PTX Compatibility..................................................................................19 3.1.4. Application Compatibility.........................................................................19 3.1.5. C/C++ Compatibility............................................................................... 20 3.1.6. 64-Bit Compatibility............................................................................... 20 3.2. CUDA C Runtime.........................................................................................20 3.2.1. Initialization.........................................................................................21 3.2.2. Device Memory..................................................................................... 21 3.2.3. Shared Memory..................................................................................... 24 3.2.4. Page-Locked Host Memory........................................................................29 3.2.4.1. Portable Memory..............................................................................30 3.2.4.2. Write-Combining Memory....................................................................30 3.2.4.3. Mapped Memory...............................................................................30 3.2.5. Asynchronous Concurrent Execution............................................................ 31 3.2.5.1. Concurrent Execution between Host and Device........................................32 3.2.5.2. Concurrent Kernel Execution............................................................... 32 3.2.5.3. Overlap of Data Transfer and Kernel Execution......................................... 32 3.2.5.4. Concurrent Data Transfers.................................................................. 33 3.2.5.5. Streams.........................................................................................33 3.2.5.6. Events...........................................................................................37 3.2.5.7. Synchronous Calls.............................................................................37 3.2.6. Multi-Device System............................................................................... 38 3.2.6.1. Device Enumeration.......................................................................... 38 3.2.6.2. Device Selection.............................................................................. 38 www.nvidia.com CUDA C Programming Guide PG-02829-001_v7.5 | iii 3.2.6.3. Stream and Event Behavior................................................................. 38 3.2.6.4. Peer-to-Peer Memory Access................................................................39 3.2.6.5. Peer-to-Peer Memory Copy..................................................................39 3.2.7. Unified Virtual Address Space................................................................... 40 3.2.8. Interprocess Communication..................................................................... 41 3.2.9. Error Checking......................................................................................41 3.2.10. Call Stack.......................................................................................... 42 3.2.11. Texture and Surface Memory................................................................... 42 3.2.11.1. Texture Memory............................................................................. 42 3.2.11.2. Surface Memory............................................................................. 52 3.2.11.3. CUDA Arrays..................................................................................56 3.2.11.4. Read/Write Coherency..................................................................... 56 3.2.12. Graphics Interoperability........................................................................56 3.2.12.1. OpenGL Interoperability................................................................... 57 3.2.12.2. Direct3D Interoperability...................................................................59 3.2.12.3. SLI Interoperability..........................................................................65 3.3. Versioning and Compatibility.......................................................................... 66 3.4. Compute Modes..........................................................................................67 3.5. Mode Switches........................................................................................... 68 3.6. Tesla Compute Cluster Mode for Windows.......................................................... 68 Chapter 4. Hardware Implementation......................................................................69 4.1. SIMT Architecture....................................................................................... 69 4.2. Hardware Multithreading...............................................................................71 Chapter 5. Performance Guidelines........................................................................ 72 5.1. Overall Performance Optimization Strategies...................................................... 72 5.2. Maximize Utilization.................................................................................... 72 5.2.1. Application Level...................................................................................72 5.2.2. Device Level........................................................................................ 73 5.2.3. Multiprocessor Level...............................................................................73 5.2.3.1. Occupancy Calculator........................................................................ 75 5.3. Maximize Memory Throughput........................................................................ 77 5.3.1. Data Transfer between Host and Device....................................................... 78 5.3.2. Device Memory Accesses..........................................................................79 5.4. Maximize Instruction Throughput.....................................................................83 5.4.1. Arithmetic Instructions............................................................................83 5.4.2. Control Flow Instructions......................................................................... 87 5.4.3. Synchronization Instruction.......................................................................88 Appendix A. CUDA-Enabled GPUs..........................................................................

Load more