Cuda C Programming Guide

CUDA C PROGRAMMING GUIDE PG-02829-001_v10.1 | August 2019 Design Guide CHANGES FROM VERSION 9.0 ‣ Documented restriction that operator-overloads cannot be __global__ functions in Operator Function. ‣ Removed guidance to break 8-byte shuffles into two 4-byte instructions. 8-byte shuffle variants are provided since CUDA 9.0. See Warp Shuffle Functions. ‣ Passing __restrict__ references to __global__ functions is now supported. Updated comment in __global__ functions and function templates. ‣ Documented CUDA_ENABLE_CRC_CHECK in CUDA Environment Variables. ‣ Warp matrix functions now support matrix products with m=32, n=8, k=16 and m=8, n=32, k=16 in addition to m=n=k=16. ‣ Added new Unified Memory sections: System Allocator, Hardware Coherency, Access Counters www.nvidia.com CUDA C Programming Guide PG-02829-001_v10.1 | ii TABLE OF CONTENTS Chapter 1. Introduction.........................................................................................1 1.1. From Graphics Processing to General Purpose Parallel Computing............................... 1 1.2. CUDA®: A General-Purpose Parallel Computing Platform and Programming Model.............3 1.3. A Scalable Programming Model.........................................................................4 1.4. Document Structure...................................................................................... 5 Chapter 2. Programming Model............................................................................... 7 2.1. Kernels......................................................................................................7 2.2. Thread Hierarchy......................................................................................... 8 2.3. Memory Hierarchy....................................................................................... 10 2.4. Heterogeneous Programming.......................................................................... 12 2.5. Compute Capability..................................................................................... 14 Chapter 3. Programming Interface..........................................................................15 3.1. Compilation with NVCC................................................................................ 15 3.1.1. Compilation Workflow.............................................................................16 3.1.1.1. Offline Compilation.......................................................................... 16 3.1.1.2. Just-in-Time Compilation....................................................................16 3.1.2. Binary Compatibility...............................................................................16 3.1.3. PTX Compatibility..................................................................................17 3.1.4. Application Compatibility.........................................................................17 3.1.5. C/C++ Compatibility............................................................................... 18 3.1.6. 64-Bit Compatibility............................................................................... 18 3.2. CUDA C Runtime.........................................................................................18 3.2.1. Initialization.........................................................................................19 3.2.2. Device Memory..................................................................................... 20 3.2.3. Shared Memory..................................................................................... 23 3.2.4. Page-Locked Host Memory........................................................................28 3.2.4.1. Portable Memory..............................................................................29 3.2.4.2. Write-Combining Memory....................................................................29 3.2.4.3. Mapped Memory...............................................................................30 3.2.5. Asynchronous Concurrent Execution............................................................ 31 3.2.5.1. Concurrent Execution between Host and Device........................................31 3.2.5.2. Concurrent Kernel Execution............................................................... 31 3.2.5.3. Overlap of Data Transfer and Kernel Execution......................................... 32 3.2.5.4. Concurrent Data Transfers.................................................................. 32 3.2.5.5. Streams.........................................................................................32 3.2.5.6. Graphs.......................................................................................... 36 3.2.5.7. Events...........................................................................................42 3.2.5.8. Synchronous Calls.............................................................................43 3.2.6. Multi-Device System............................................................................... 43 3.2.6.1. Device Enumeration.......................................................................... 43 www.nvidia.com CUDA C Programming Guide PG-02829-001_v10.1 | iii 3.2.6.2. Device Selection.............................................................................. 43 3.2.6.3. Stream and Event Behavior................................................................. 44 3.2.6.4. Peer-to-Peer Memory Access................................................................44 3.2.6.5. Peer-to-Peer Memory Copy..................................................................45 3.2.7. Unified Virtual Address Space................................................................... 46 3.2.8. Interprocess Communication..................................................................... 46 3.2.9. Error Checking......................................................................................47 3.2.10. Call Stack.......................................................................................... 48 3.2.11. Texture and Surface Memory................................................................... 48 3.2.11.1. Texture Memory............................................................................. 48 3.2.11.2. Surface Memory............................................................................. 58 3.2.11.3. CUDA Arrays..................................................................................62 3.2.11.4. Read/Write Coherency..................................................................... 62 3.2.12. Graphics Interoperability........................................................................62 3.2.12.1. OpenGL Interoperability................................................................... 63 3.2.12.2. Direct3D Interoperability...................................................................65 3.2.12.3. SLI Interoperability..........................................................................71 3.3. Versioning and Compatibility.......................................................................... 72 3.4. Compute Modes..........................................................................................73 3.5. Mode Switches........................................................................................... 74 3.6. Tesla Compute Cluster Mode for Windows.......................................................... 74 Chapter 4. Hardware Implementation......................................................................76 4.1. SIMT Architecture....................................................................................... 76 4.2. Hardware Multithreading...............................................................................78 Chapter 5. Performance Guidelines........................................................................ 80 5.1. Overall Performance Optimization Strategies...................................................... 80 5.2. Maximize Utilization.................................................................................... 80 5.2.1. Application Level...................................................................................80 5.2.2. Device Level........................................................................................ 81 5.2.3. Multiprocessor Level...............................................................................81 5.2.3.1. Occupancy Calculator........................................................................ 83 5.3. Maximize Memory Throughput........................................................................ 85 5.3.1. Data Transfer between Host and Device....................................................... 86 5.3.2. Device Memory Accesses..........................................................................87 5.4. Maximize Instruction Throughput.....................................................................91 5.4.1. Arithmetic Instructions............................................................................91 5.4.2. Control Flow Instructions......................................................................... 95 5.4.3. Synchronization Instruction.......................................................................96 Appendix A. CUDA-Enabled GPUs........................................................................... 97 Appendix B. C Language Extensions........................................................................98 B.1. Function Execution Space Specifiers.................................................................98

Cuda C Programming Guide

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support