Cuda C Programming Guide

CUDA C PROGRAMMING GUIDE PG-02829-001_v10.0 | October 2018 Design Guide CHANGES FROM VERSION 9.0 ‣ Documented restriction that operator-overloads cannot be __global__ functions in Operator Function. ‣ Removed guidance to break 8-byte shuffles into two 4-byte instructions. 8-byte shuffle variants are provided since CUDA 9.0. See Warp Shuffle Functions. ‣ Passing __restrict__ references to __global__ functions is now supported. Updated comment in __global__ functions and function templates. ‣ Documented CUDA_ENABLE_CRC_CHECK in CUDA Environment Variables. ‣ Warp matrix functions now support matrix products with m=32, n=8, k=16 and m=8, n=32, k=16 in addition to m=n=k=16. www.nvidia.com CUDA C Programming Guide PG-02829-001_v10.0 | ii TABLE OF CONTENTS Chapter 1. Introduction.........................................................................................1 1.1. From Graphics Processing to General Purpose Parallel Computing............................... 1 1.2. CUDA®: A General-Purpose Parallel Computing Platform and Programming Model.............3 1.3. A Scalable Programming Model.........................................................................4 1.4. Document Structure...................................................................................... 5 Chapter 2. Programming Model............................................................................... 7 2.1. Kernels......................................................................................................7 2.2. Thread Hierarchy......................................................................................... 8 2.3. Memory Hierarchy....................................................................................... 10 2.4. Heterogeneous Programming.......................................................................... 12 2.5. Compute Capability..................................................................................... 14 Chapter 3. Programming Interface..........................................................................15 3.1. Compilation with NVCC................................................................................ 15 3.1.1. Compilation Workflow.............................................................................16 3.1.1.1. Offline Compilation.......................................................................... 16 3.1.1.2. Just-in-Time Compilation....................................................................16 3.1.2. Binary Compatibility...............................................................................16 3.1.3. PTX Compatibility..................................................................................17 3.1.4. Application Compatibility.........................................................................17 3.1.5. C/C++ Compatibility............................................................................... 18 3.1.6. 64-Bit Compatibility............................................................................... 18 3.2. CUDA C Runtime.........................................................................................18 3.2.1. Initialization.........................................................................................19 3.2.2. Device Memory..................................................................................... 20 3.2.3. Shared Memory..................................................................................... 23 3.2.4. Page-Locked Host Memory........................................................................28 3.2.4.1. Portable Memory..............................................................................29 3.2.4.2. Write-Combining Memory....................................................................29 3.2.4.3. Mapped Memory...............................................................................30 3.2.5. Asynchronous Concurrent Execution............................................................ 31 3.2.5.1. Concurrent Execution between Host and Device........................................31 3.2.5.2. Concurrent Kernel Execution............................................................... 31 3.2.5.3. Overlap of Data Transfer and Kernel Execution......................................... 32 3.2.5.4. Concurrent Data Transfers.................................................................. 32 3.2.5.5. Streams.........................................................................................32 3.2.5.6. Graphs.......................................................................................... 36 3.2.5.7. Events...........................................................................................42 3.2.5.8. Synchronous Calls.............................................................................43 3.2.6. Multi-Device System............................................................................... 43 3.2.6.1. Device Enumeration.......................................................................... 43 www.nvidia.com CUDA C Programming Guide PG-02829-001_v10.0 | iii 3.2.6.2. Device Selection.............................................................................. 43 3.2.6.3. Stream and Event Behavior................................................................. 44 3.2.6.4. Peer-to-Peer Memory Access................................................................44 3.2.6.5. Peer-to-Peer Memory Copy..................................................................45 3.2.7. Unified Virtual Address Space................................................................... 46 3.2.8. Interprocess Communication..................................................................... 46 3.2.9. Error Checking......................................................................................47 3.2.10. Call Stack.......................................................................................... 47 3.2.11. Texture and Surface Memory................................................................... 48 3.2.11.1. Texture Memory............................................................................. 48 3.2.11.2. Surface Memory............................................................................. 57 3.2.11.3. CUDA Arrays..................................................................................61 3.2.11.4. Read/Write Coherency..................................................................... 61 3.2.12. Graphics Interoperability........................................................................61 3.2.12.1. OpenGL Interoperability................................................................... 62 3.2.12.2. Direct3D Interoperability...................................................................64 3.2.12.3. SLI Interoperability..........................................................................70 3.3. Versioning and Compatibility.......................................................................... 71 3.4. Compute Modes..........................................................................................72 3.5. Mode Switches........................................................................................... 73 3.6. Tesla Compute Cluster Mode for Windows.......................................................... 73 Chapter 4. Hardware Implementation......................................................................75 4.1. SIMT Architecture....................................................................................... 75 4.2. Hardware Multithreading...............................................................................77 Chapter 5. Performance Guidelines........................................................................ 79 5.1. Overall Performance Optimization Strategies...................................................... 79 5.2. Maximize Utilization.................................................................................... 79 5.2.1. Application Level...................................................................................79 5.2.2. Device Level........................................................................................ 80 5.2.3. Multiprocessor Level...............................................................................80 5.2.3.1. Occupancy Calculator........................................................................ 82 5.3. Maximize Memory Throughput........................................................................ 84 5.3.1. Data Transfer between Host and Device....................................................... 85 5.3.2. Device Memory Accesses..........................................................................86 5.4. Maximize Instruction Throughput.....................................................................90 5.4.1. Arithmetic Instructions............................................................................90 5.4.2. Control Flow Instructions......................................................................... 94 5.4.3. Synchronization Instruction.......................................................................95 Appendix A. CUDA-Enabled GPUs........................................................................... 96 Appendix B. C Language Extensions........................................................................97 B.1. Function Execution Space Specifiers.................................................................97 B.1.1. __device__.........................................................................................

Cuda C Programming Guide

CUDA by Example

Bitfusion Guide to CUDA Installation Bitfusion Guides Bitfusion: Bitfusion Guide to CUDA Installation

2.5 Classification of Parallel Computers

Parallel Architectures and Algorithms for Large-Scale Nonlinear Programming

SIMD Extensions

(GPU) Computing

Graphics Card Support List

Massively Parallel Computing with CUDA

NVIDIA Quadro RTX for V-Ray Next

NVIDIA Launches Tegra X1 Mobile Super Chip

GPU-Based Deep Learning Inference

CUDA Flux: a Lightweight Instruction Profiler for CUDA Applications