Cuda C Best Practices Guide

CUDA C BEST PRACTICES GUIDE DG-05603-001_v9.0 | June 2018 Design Guide TABLE OF CONTENTS Preface............................................................................................................ vii What Is This Document?..................................................................................... vii Who Should Read This Guide?...............................................................................vii Assess, Parallelize, Optimize, Deploy.....................................................................viii Assess........................................................................................................ viii Parallelize.................................................................................................... ix Optimize...................................................................................................... ix Deploy.........................................................................................................ix Recommendations and Best Practices.......................................................................x Chapter 1. Assessing Your Application.......................................................................1 Chapter 2. Heterogeneous Computing.......................................................................2 2.1. Differences between Host and Device................................................................ 2 2.2. What Runs on a CUDA-Enabled Device?...............................................................3 Chapter 3. Application Profiling.............................................................................. 5 3.1. Profile.......................................................................................................5 3.1.1. Creating the Profile.................................................................................5 3.1.2. Identifying Hotspots.................................................................................6 3.1.3. Understanding Scaling.............................................................................. 6 3.1.3.1. Strong Scaling and Amdahl's Law............................................................7 3.1.3.2. Weak Scaling and Gustafson's Law.......................................................... 7 3.1.3.3. Applying Strong and Weak Scaling.......................................................... 8 Chapter 4. Parallelizing Your Application................................................................... 9 Chapter 5. Getting Started................................................................................... 10 5.1. Parallel Libraries........................................................................................ 10 5.2. Parallelizing Compilers................................................................................. 10 5.3. Coding to Expose Parallelism..........................................................................11 Chapter 6. Getting the Right Answer...................................................................... 12 6.1. Verification............................................................................................... 12 6.1.1. Reference Comparison............................................................................ 12 6.1.2. Unit Testing......................................................................................... 12 6.2. Debugging.................................................................................................13 6.3. Numerical Accuracy and Precision....................................................................13 6.3.1. Single vs. Double Precision.......................................................................13 6.3.2. Floating Point Math Is not Associative..........................................................14 6.3.3. Promotions to Doubles and Truncations to Floats............................................ 14 6.3.4. IEEE 754 Compliance.............................................................................. 14 6.3.5. x86 80-bit Computations..........................................................................14 Chapter 7. Optimizing CUDA Applications.................................................................16 Chapter 8. Performance Metrics............................................................................ 17 8.1. Timing..................................................................................................... 17 www.nvidia.com CUDA C Best Practices Guide DG-05603-001_v9.0 | ii 8.1.1. Using CPU Timers.................................................................................. 17 8.1.2. Using CUDA GPU Timers.......................................................................... 18 8.2. Bandwidth................................................................................................ 18 8.2.1. Theoretical Bandwidth Calculation............................................................. 19 8.2.2. Effective Bandwidth Calculation................................................................ 19 8.2.3. Throughput Reported by Visual Profiler........................................................20 Chapter 9. Memory Optimizations.......................................................................... 21 9.1. Data Transfer Between Host and Device............................................................ 21 9.1.1. Pinned Memory..................................................................................... 22 9.1.2. Asynchronous and Overlapping Transfers with Computation................................ 22 9.1.3. Zero Copy........................................................................................... 25 9.1.4. Unified Virtual Addressing........................................................................26 9.2. Device Memory Spaces................................................................................. 26 9.2.1. Coalesced Access to Global Memory............................................................28 9.2.1.1. A Simple Access Pattern.....................................................................28 9.2.1.2. A Sequential but Misaligned Access Pattern..............................................29 9.2.1.3. Effects of Misaligned Accesses............................................................. 30 9.2.1.4. Strided Accesses.............................................................................. 31 9.2.2. Shared Memory..................................................................................... 33 9.2.2.1. Shared Memory and Memory Banks........................................................33 9.2.2.2. Shared Memory in Matrix Multiplication (C=AB)......................................... 34 9.2.2.3. Shared Memory in Matrix Multiplication (C=AAT)........................................37 9.2.3. Local Memory.......................................................................................39 9.2.4. Texture Memory.................................................................................... 40 9.2.4.1. Additional Texture Capabilities.............................................................40 9.2.5. Constant Memory...................................................................................41 9.2.6. Registers............................................................................................. 41 9.2.6.1. Register Pressure............................................................................. 41 9.3. Allocation................................................................................................. 41 Chapter 10. Execution Configuration Optimizations.................................................... 42 10.1. Occupancy...............................................................................................42 10.1.1. Calculating Occupancy...........................................................................43 10.2. Concurrent Kernel Execution........................................................................ 44 10.3. Multiple contexts...................................................................................... 45 10.4. Hiding Register Dependencies....................................................................... 46 10.5. Thread and Block Heuristics......................................................................... 46 10.6. Effects of Shared Memory............................................................................ 47 Chapter 11. Instruction Optimization...................................................................... 49 11.1. Arithmetic Instructions................................................................................49 11.1.1. Division Modulo Operations..................................................................... 49 11.1.2. Reciprocal Square Root.......................................................................... 49 11.1.3. Other Arithmetic Instructions.................................................................. 50 11.1.4. Exponentiation With Small Fractional Arguments........................................... 50 www.nvidia.com CUDA C Best Practices Guide DG-05603-001_v9.0 | iii 11.1.5. Math Libraries..................................................................................... 51 11.1.6. Precision-related Compiler Flags...............................................................53 11.2. Memory Instructions................................................................................... 53 Chapter 12. Control Flow.....................................................................................54

Cuda C Best Practices Guide

45-Year CPU Evolution: One Law and Two Equations

Multi-Cycle Datapathoperation

CS2504: Computer Organization

Analysis of Body Bias Control Using Overhead Conditions for Real Time Systems: a Practical Approach∗

A Performance Analysis Tool for Intel SGX Enclaves

ESC-470: ARM 9 Instruction Set Architecture with Performance

A Characterization of Processor Performance in the VAX-1 L/780

Autotuning GPU Kernels Via Static and Predictive Analysis

The Anatomy of the ARM Cortex-M0+ Processor

Microarchitecture-Level Power-Performance Simulators: Modeling, Validation, and Impact on Design

Assembly Language Programming (Part 1) 2 7

V850e/Ms1 , V850e/Ms2