Cuda C Programming Guide
Total Page:16
File Type:pdf, Size:1020Kb
CUDA C PROGRAMMING GUIDE PG-02829-001_v10.1 | August 2019 Design Guide CHANGES FROM VERSION 9.0 ‣ Documented restriction that operator-overloads cannot be __global__ functions in Operator Function. ‣ Removed guidance to break 8-byte shuffles into two 4-byte instructions. 8-byte shuffle variants are provided since CUDA 9.0. See Warp Shuffle Functions. ‣ Passing __restrict__ references to __global__ functions is now supported. Updated comment in __global__ functions and function templates. ‣ Documented CUDA_ENABLE_CRC_CHECK in CUDA Environment Variables. ‣ Warp matrix functions now support matrix products with m=32, n=8, k=16 and m=8, n=32, k=16 in addition to m=n=k=16. ‣ Added new Unified Memory sections: System Allocator, Hardware Coherency, Access Counters www.nvidia.com CUDA C Programming Guide PG-02829-001_v10.1 | ii TABLE OF CONTENTS Chapter 1. Introduction.........................................................................................1 1.1. From Graphics Processing to General Purpose Parallel Computing............................... 1 1.2. CUDA®: A General-Purpose Parallel Computing Platform and Programming Model.............3 1.3. A Scalable Programming Model.........................................................................4 1.4. Document Structure...................................................................................... 5 Chapter 2. Programming Model............................................................................... 7 2.1. Kernels......................................................................................................7 2.2. Thread Hierarchy......................................................................................... 8 2.3. Memory Hierarchy....................................................................................... 10 2.4. Heterogeneous Programming.......................................................................... 12 2.5. Compute Capability..................................................................................... 14 Chapter 3. Programming Interface..........................................................................15 3.1. Compilation with NVCC................................................................................ 15 3.1.1. Compilation Workflow.............................................................................16 3.1.1.1. Offline Compilation.......................................................................... 16 3.1.1.2. Just-in-Time Compilation....................................................................16 3.1.2. Binary Compatibility...............................................................................16 3.1.3. PTX Compatibility..................................................................................17 3.1.4. Application Compatibility.........................................................................17 3.1.5. C/C++ Compatibility............................................................................... 18 3.1.6. 64-Bit Compatibility............................................................................... 18 3.2. CUDA C Runtime.........................................................................................18 3.2.1. Initialization.........................................................................................19 3.2.2. Device Memory..................................................................................... 20 3.2.3. Shared Memory..................................................................................... 23 3.2.4. Page-Locked Host Memory........................................................................28 3.2.4.1. Portable Memory..............................................................................29 3.2.4.2. Write-Combining Memory....................................................................29 3.2.4.3. Mapped Memory...............................................................................30 3.2.5. Asynchronous Concurrent Execution............................................................ 31 3.2.5.1. Concurrent Execution between Host and Device........................................31 3.2.5.2. Concurrent Kernel Execution............................................................... 31 3.2.5.3. Overlap of Data Transfer and Kernel Execution......................................... 32 3.2.5.4. Concurrent Data Transfers.................................................................. 32 3.2.5.5. Streams.........................................................................................32 3.2.5.6. Graphs.......................................................................................... 36 3.2.5.7. Events...........................................................................................42 3.2.5.8. Synchronous Calls.............................................................................43 3.2.6. Multi-Device System............................................................................... 43 3.2.6.1. Device Enumeration.......................................................................... 43 www.nvidia.com CUDA C Programming Guide PG-02829-001_v10.1 | iii 3.2.6.2. Device Selection.............................................................................. 43 3.2.6.3. Stream and Event Behavior................................................................. 44 3.2.6.4. Peer-to-Peer Memory Access................................................................44 3.2.6.5. Peer-to-Peer Memory Copy..................................................................45 3.2.7. Unified Virtual Address Space................................................................... 46 3.2.8. Interprocess Communication..................................................................... 46 3.2.9. Error Checking......................................................................................47 3.2.10. Call Stack.......................................................................................... 48 3.2.11. Texture and Surface Memory................................................................... 48 3.2.11.1. Texture Memory............................................................................. 48 3.2.11.2. Surface Memory............................................................................. 58 3.2.11.3. CUDA Arrays..................................................................................62 3.2.11.4. Read/Write Coherency..................................................................... 62 3.2.12. Graphics Interoperability........................................................................62 3.2.12.1. OpenGL Interoperability................................................................... 63 3.2.12.2. Direct3D Interoperability...................................................................65 3.2.12.3. SLI Interoperability..........................................................................71 3.3. Versioning and Compatibility.......................................................................... 72 3.4. Compute Modes..........................................................................................73 3.5. Mode Switches........................................................................................... 74 3.6. Tesla Compute Cluster Mode for Windows.......................................................... 74 Chapter 4. Hardware Implementation......................................................................76 4.1. SIMT Architecture....................................................................................... 76 4.2. Hardware Multithreading...............................................................................78 Chapter 5. Performance Guidelines........................................................................ 80 5.1. Overall Performance Optimization Strategies...................................................... 80 5.2. Maximize Utilization.................................................................................... 80 5.2.1. Application Level...................................................................................80 5.2.2. Device Level........................................................................................ 81 5.2.3. Multiprocessor Level...............................................................................81 5.2.3.1. Occupancy Calculator........................................................................ 83 5.3. Maximize Memory Throughput........................................................................ 85 5.3.1. Data Transfer between Host and Device....................................................... 86 5.3.2. Device Memory Accesses..........................................................................87 5.4. Maximize Instruction Throughput.....................................................................91 5.4.1. Arithmetic Instructions............................................................................91 5.4.2. Control Flow Instructions......................................................................... 95 5.4.3. Synchronization Instruction.......................................................................96 Appendix A. CUDA-Enabled GPUs........................................................................... 97 Appendix B. C Language Extensions........................................................................98 B.1. Function Execution Space Specifiers.................................................................98