CUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES Stephen Jones, GTC 2017 The art of doing more with less
2 RULE #1: DON’T TRY TOO HARD
Peak Performance Performance
Time
3 RULE #1: DON’T TRY TOO HARD
Peak Performance
Performance Unrealistic Effort/Reward Unrealistic
Time
4 RULE #1: DON’T TRY TOO HARD
Peak Performance Performance
Time
5 RULE #1: DON’T TRY TOO HARD
Peak Performance Performance
Don’t waste this time
Get on this curve
Reduce this time
Time
6 RULE #1: DON’T TRY TOO HARD
Peak Performance
Performance Here be ninjas
Hire an intern
Premature Point of excitement diminishing returns
Most people Trough of give up here despair Wait, it’s 4 weeks and going slower?? this is it?
Time
7 PERFORMANCE CONSTRAINTS
Compute Intensity Divergence 10% 3% Instruction 2%
Occupancy 10%
Memory 75%
8 PERFORMANCE CONSTRAINTS Chart Title
Compute Intensity CPU <> GPU Transfer Divergence Instruction
Occupancy
Coalescence
Divergent Access
Cache Inefficiency
Register Spilling 9 MEMORY ORDERS OF MAGNITUDE
SM L1$ L2 Cache GDRAM DRAM CPU regs shmem PCIe regs shmem bus regs shmem
20,000 2,000 300 16 150 GB/sec GB/sec GB/sec GB/sec GB/sec
10 TALK BREAKDOWN In no particular order
1. Why Didn’t I Think Of That? 2. CPU Memory to GPU Memory (the PCIe Bus) 3. GPU Memory to the SM 4. Registers & Shared Memory 5. Occupancy, Divergence & Latency 6. Weird Things You Never Thought Of (and probably shouldn’t try)
11 WHERE TO BEGIN?
12 THE OBVIOUS
NVIDIA Visual Profiler
Start with the Visual Profiler 13 CPU <> GPU DATA MOVEMENT
14 PCI ISSUES
regs shmem PCIe regs shmem bus regs shmem
16 GB/sec
Moving data over the PCIe bus
15 PIN YOUR CPU MEMORY
GPU Memory CPU Memory
Copy Data
16 PIN YOUR CPU MEMORY
GPU Memory CPU Memory
DMA Data Controller
17 PIN YOUR CPU MEMORY
GPU Memory CPU Memory
DMA Controller Data
Swap
18 PIN YOUR CPU MEMORY
GPU Memory CPU Memory
Data
CPU allocates & pins page then DMA copies locally before DMA Controller
Pinned Copy of Data
19 PIN YOUR CPU MEMORY
GPU Memory CPU Memory
DMA User Pinned Controller Data
cudaHostAlloc( &data, size, cudaHostAllocMapped ); cudaHostRegister( &data, size, cudaHostRegisterDefault ); 20 PIN YOUR CPU MEMORY
21 REMEMBER: PCIe GOES BOTH WAYS
22 STREAMS & CONCURRENCY Hiding the cost of data transfer
Operations in a single stream are ordered But hardware can copy and compute at the same time
Single Copy data Copy data Compute Stream to GPU to Host
Time
23 STREAMS & CONCURRENCY
Copy Copy Stream 2 Work up back Saved Time
Copy Copy Stream 1 Work up back
Single Copy data Copy data Compute Stream to GPU to Host
Time
24 STREAMS & CONCURRENCY
Can keep on breaking work into smaller chunks and saving time
8 2 1 streams streams stream
25 SMALL PCIe TRANSFERS
PCIe is designed for large data transfers But fine-grained copy/compute overlap prefers small transfers So how small can we go?
Too 8 2 1 many
26 APPARENTLY NOT THAT SMALL
27 FROM GPU MEMORY TO GPU THREADS
28 FEEDING THE MACHINE
regs shmem PCIe regs shmem bus regs shmem
From GPU Memory to the SMs
29 USE THE PARALLEL ARCHITECTURE Hardware is optimized to use all SIMT threads at once
L2 Cache Line
Threads run Cache is sized to service High-speed GPU memory in groups of 32 sets of 32 requests at a time works best with linear access
30 VECTORIZE MEMORY LOADS Multi-Word as well as Multi-Thread
T0-T32
int
31 VECTORIZE MEMORY LOADS
T0-T15
T16-T31 int2
Fill multiple cache lines in a single fetch
32 VECTORIZE MEMORY LOADS
T0-T7
T8-T15 int4 T16-T23
T24-T31
Fill multiple cache lines in a single fetch
33 VECTORIZE MEMORY LOADS
34 DO MULTIPLE LOADS PER THREAD Multi-Thread, Multi-Word AND Multi-Iteration
__global__ void copy(int2 *input, __global__ void copy(int2 *input, int2 *output, int2 *output, int max) { int max, int loadsPerThread) { int id = threadIdx.x + blockDim.x * blockIdx.x; int id = threadIdx.x + blockDim.x * blockIdx.x; if( id < max ) { output[id] = input[id]; for(int n=0; n
One copy per thread Multiple copies per thread Maximum overhead Amortize overhead
35 “MAXIMAL” LAUNCHES ARE BEST
36 COALESCED MEMORY ACCESS It’s not just good enough to use all SIMT threads
Coalesced: Sequential memory accesses are adjacent 1 2 3 4
3 1 Uncoalesced: Sequential memory accesses are unassociated 4 2
37 SIMT PENALTIES WHEN NOT COALESCED
x = data[threadIdx.x] x = data[rand()]
Single 32-wide operation 32 one-wide operations
38 SCATTER & GATHER
Gathering Scattering
1 2 3 4 1 2 3 4
3 3 1 1
4 4 2 2
Reading randomly Reading sequentially Writing sequentially Writing randomly
39 AVOID SCATTER/GATHER IF YOU CAN
40 AVOID SCATTER/GATHER IF YOU CAN
41 SORTING MIGHT BE AN OPTION
If reading non-sequential data is expensive, is it worth sorting it to make it sequential?
Gathering Coalesced Read
1 2 3 4 1 2 3 4
Slow Fast
2 4 1 3 Sort 1 2 3 4
42 SORTING MIGHT BE AN OPTION
Even if you’re only going to read it twice, then yes!
43 PRE-SORTING TURNS OUT TO BE GOOD
44 DATA LAYOUT: “AOS vs. SOA” Sometimes you can’t just sort your data
Array-of-Structures Structure-of-Arrays #define NPTS 1024 * 1024 #define NPTS 1024 *1024 struct Coefficients_AOS { struct Coefficients_SOA { double u[3]; double u[3][NPTS]; double x[3][3]; double x[3][3][NPTS]; double p; double p[NPTS]; double rho; double rho[NPTS]; double eta; double eta[NPTS]; }; };
Coefficients_AOS gridData[NPTS]; Coefficients_SOA gridData;
Single-thread code prefers arrays of SIMT code prefers structures of arrays, structures, for cache efficiency for execution & memory efficiency
45 DATA LAYOUT: “AOS vs. SOA”
#define NPTS 1024 * 1024 u0 u1 u2 struct Coefficients_AOS { x00 x01 x02 double u[3]; double x[3][3]; x10 x11 x12 double p; x20 x21 x22 double rho; double eta; p }; rho
Coefficients_AOS gridData[NPTS]; eta Structure Definition Conceptual Layout
46 SOA: STRIDED ARRAY ACCESS
GPU reads data one element at a time, but in parallel by 32 threads in a warp
Array-of-Structures Memory Layout u0 u1 u2 x00 x01 x02 x10 x11 x12 x20 x21 x22 p rho eta double u0 = gridData[threadIdx.x].u[0]; Conceptual Layout
47 AOS: COALESCED BUT COMPLEX
GPU reads data one element at a time, but in parallel by 32 threads in a warp
Array-of-Structures Memory Layout u0 u1 u2 x00 x01 x02 x10 x11 x12 Structure-of-Arrays Memory Layout x20 x21 x22 p rho eta
Conceptual Layout double u0 = gridData.u[0][threadIdx.x];
48 BLOCK-WIDE LOAD VIA SHARED MEMORY
Read data linearly as bytes. Use shared memory to convert to struct
Device Memory
Block copies data to shared memory
Shared Memory
49 BLOCK-WIDE LOAD VIA SHARED MEMORY
Read data linearly as bytes. Use shared memory to convert to struct
Device Memory
Shared Memory
Threads which own the data grab it from shared memory
50 CLEVER AOS/SOA TRICKS
51 CLEVER AOS/SOA TRICKS Helps for any data size
52 HANDY LIBRARY TO HELP YOU
Trove – A utility library for fast AOS/SOA access and transposition https://github.com/bryancatanzaro/trove
53 (AB)USING THE CACHE
54 MAKING THE MOST OF L2-CACHE
L2 cache is fast but small: L2 Cache GDRAM
Architecture L2 Cache Total Cache Bytes Size Threads per Thread Kepler 1536 KB 30,720 51
Maxwell 3072 KB 49,152 64
Pascal 4096 KB 114,688 36
2,000 300 GB/sec GB/sec
55 TRAINING DEEP NEURAL NETWORKS
56 LOTS OF PASSES OVER DATA
3x3 W1 convolution
5x5 FFT W2 Cat! convolution +
7x7 W3 convolution
57 MULTI-RESOLUTION CONVOLUTIONS
Pass 1 : 3x3
Pass 2: 5x5
Pass 3: 7x7
58 TILED, MULTI-RESOLUTION CONVOLUTION
Pass 1 : 3x3
Pass 2: 5x5
Pass 3: 7x7
Do 3 passes per-tile Each tile sized to fit in L2 cache 59 LAUNCHING FEWER THAN MAXIMUM THREADS
60 SHARED MEMORY: DEFINITELY WORTH IT
61 USING SHARED MEMORY WISELY
Shared memory arranged into “banks” for concurrent SIMT access ▪ 32 threads can read simultaneously so long as into separate banks
Shared memory has 4-byte and 8-byte “bank” sizes
62 STENCIL ALGORITHM
Many algorithms have high data re-use: potentially good for shared memory
“Stencil” algorithms accumulate data from neighbours onto a central point ▪ Stencil has width “W” (in the above case, W=5)
Adjacent threads will share (W-1) items of data – good potential for data re-use
63 STENCILS IN SHARED MEMORY
64 SIZE MATTERS
65 PERSISTENT KERNELS Revisiting the tiled convolutions
Avoid multiple kernel launches by caching in shared memory instead of L2
void tiledConvolution() { convolution<3><<< numblocks, blockdim, 0, s >>>(ptr, chunkSize); Separate kernel convolution<5><<< numblocks, blockdim, 0, s >>>(ptr, chunkSize); launches with convolution<7><<< numblocks, blockdim, 0, s >>>(ptr, chunkSize); L2 re-use }
__global__ void convolutionShared(int *data, int count, int sharedelems) { extern __shared__ int shdata[]; shdata[threadIdx.x] = data[threadIdx.x + blockDim.x*blockIdx.x]; __syncthreads(); Single kernel convolve<3>(threadIdx.x, shdata, sharedelems); launch with __syncthreads(); persistent kernel convolve<5>(threadIdx.x, shdata, sharedelems); __syncthreads(); convolve<7>(threadIdx.x, shdata, sharedelems); }
66 PERSISTENT KERNELS
67 OPERATING DIRECTLY FROM CPU MEMORY
Can save memory copies. It’s obvious when you think about it ...
Compute only begins when 1st copy Copy data Copy data Compute has finished. Task only ends when to GPU to Host 2nd copy has finished.
Compute begins after first fetch.
Compute Uses lots of threads to cover CPU
CPU host-memory access latency. Takes Write to Write
Read from Read advantage of bi-directional PCI.
68 OPERATING DIRECTLY FROM CPU MEMORY
69 OCCUPANCY AND REGISTER LIMITATIONS Function float double Register file is bigger than shared log 7 18 memory and L1 cache! cos 16 28 acos 6 18 Occupancy can kill you if you use cosh 7 10 too many registers tan 15 28 erfc 14 22 Often worth forcing fewer registers to allow more blocks per SM exp 7 10 log10 6 18 But watch out for math functions! normcdf 16 26 cbrt 8 20 sqrt 6 12
__launch_bounds__(maxThreadsPerBlock, rsqrt 5 12 minBlocksPerMultiprocessor) y0 20 30
__global__ void compute() { y1 22 30 y = acos(pow(log(fdivide(tan(cosh(erfc(x))), 2)), 3); fdivide 11 20 } pow 11 24 grad. desc. 14 22 70 THANK YOU!