CUDA C++ Programming Guide

CUDA C++ Programming Guide Design Guide PG-02829-001_v11.4 | September 2021 Changes from Version 11.3 ‣ Added Graph Memory Nodes. ‣ Formalized Asynchronous SIMT Programming Model. CUDA C++ Programming Guide PG-02829-001_v11.4 | ii Table of Contents Chapter 1. Introduction........................................................................................................ 1 1.1. The Benefits of Using GPUs.....................................................................................................1 1.2. CUDA®: A General-Purpose Parallel Computing Platform and Programming Model....... 2 1.3. A Scalable Programming Model.............................................................................................. 3 1.4. Document Structure................................................................................................................. 5 Chapter 2. Programming Model.......................................................................................... 7 2.1. Kernels.......................................................................................................................................7 2.2. Thread Hierarchy...................................................................................................................... 8 2.3. Memory Hierarchy...................................................................................................................10 2.4. Heterogeneous Programming................................................................................................11 2.5. Asynchronous SIMT Programming Model.............................................................................14 2.5.1. Asynchronous Operations................................................................................................ 14 2.6. Compute Capability................................................................................................................. 15 Chapter 3. Programming Interface....................................................................................16 3.1. Compilation with NVCC.......................................................................................................... 16 3.1.1. Compilation Workflow...................................................................................................... 17 3.1.1.1. Offline Compilation.................................................................................................... 17 3.1.1.2. Just-in-Time Compilation..........................................................................................17 3.1.2. Binary Compatibility......................................................................................................... 18 3.1.3. PTX Compatibility..............................................................................................................18 3.1.4. Application Compatibility..................................................................................................18 3.1.5. C++ Compatibility..............................................................................................................19 3.1.6. 64-Bit Compatibility..........................................................................................................19 3.2. CUDA Runtime........................................................................................................................ 20 3.2.1. Initialization.......................................................................................................................20 3.2.2. Device Memory................................................................................................................. 21 3.2.3. Device Memory L2 Access Management........................................................................ 24 3.2.3.1. L2 cache Set-Aside for Persisting Accesses............................................................24 3.2.3.2. L2 Policy for Persisting Accesses............................................................................ 24 3.2.3.3. L2 Access Properties.................................................................................................26 3.2.3.4. L2 Persistence Example............................................................................................26 3.2.3.5. Reset L2 Access to Normal...................................................................................... 27 3.2.3.6. Manage Utilization of L2 set-aside cache................................................................ 28 3.2.3.7. Query L2 cache Properties........................................................................................28 3.2.3.8. Control L2 Cache Set-Aside Size for Persisting Memory Access............................28 CUDA C++ Programming Guide PG-02829-001_v11.4 | iii 3.2.4. Shared Memory................................................................................................................ 28 3.2.5. Page-Locked Host Memory............................................................................................. 34 3.2.5.1. Portable Memory....................................................................................................... 34 3.2.5.2. Write-Combining Memory......................................................................................... 34 3.2.5.3. Mapped Memory........................................................................................................ 35 3.2.6. Asynchronous Concurrent Execution.............................................................................. 36 3.2.6.1. Concurrent Execution between Host and Device.....................................................36 3.2.6.2. Concurrent Kernel Execution....................................................................................36 3.2.6.3. Overlap of Data Transfer and Kernel Execution...................................................... 37 3.2.6.4. Concurrent Data Transfers....................................................................................... 37 3.2.6.5. Streams...................................................................................................................... 37 3.2.6.6. CUDA Graphs............................................................................................................. 41 3.2.6.7. Events......................................................................................................................... 50 3.2.6.8. Synchronous Calls..................................................................................................... 50 3.2.7. Multi-Device System.........................................................................................................50 3.2.7.1. Device Enumeration...................................................................................................50 3.2.7.2. Device Selection.........................................................................................................51 3.2.7.3. Stream and Event Behavior...................................................................................... 51 3.2.7.4. Peer-to-Peer Memory Access.................................................................................. 52 3.2.7.5. Peer-to-Peer Memory Copy...................................................................................... 52 3.2.8. Unified Virtual Address Space......................................................................................... 53 3.2.9. Interprocess Communication.......................................................................................... 53 3.2.10. Error Checking............................................................................................................... 54 3.2.11. Call Stack........................................................................................................................55 3.2.12. Texture and Surface Memory........................................................................................ 55 3.2.12.1. Texture Memory....................................................................................................... 55 3.2.12.2. Surface Memory.......................................................................................................64 3.2.12.3. CUDA Arrays............................................................................................................ 68 3.2.12.4. Read/Write Coherency.............................................................................................68 3.2.13. Graphics Interoperability................................................................................................68 3.2.13.1. OpenGL Interoperability...........................................................................................69 3.2.13.2. Direct3D Interoperability......................................................................................... 71 3.2.13.3. SLI Interoperability...................................................................................................76 3.2.14. External Resource Interoperability................................................................................76 3.2.14.1. Vulkan Interoperability.............................................................................................77 3.2.14.2. OpenGL Interoperability...........................................................................................84 3.2.14.3. Direct3D 12 Interoperability.................................................................................... 85 3.2.14.4. Direct3D 11 Interoperability...................................................................................

Load more