<<

CUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES Stephen Jones, GTC 2017 The art of doing more with less

2 RULE #1: DON’T TRY TOO HARD

Peak Performance Performance

Time

3 RULE #1: DON’T TRY TOO HARD

Peak Performance

Performance Unrealistic Effort/Reward Unrealistic

Time

4 RULE #1: DON’T TRY TOO HARD

Peak Performance Performance

Time

5 RULE #1: DON’T TRY TOO HARD

Peak Performance Performance

Don’t waste this time

Get on this curve

Reduce this time

Time

6 RULE #1: DON’T TRY TOO HARD

Peak Performance

Performance Here be ninjas

Hire an intern

Premature Point of excitement diminishing returns

Most people Trough of give up here despair Wait, it’s 4 weeks and going slower?? this is it?

Time

7 PERFORMANCE CONSTRAINTS

Compute Intensity Divergence 10% 3% Instruction 2%

Occupancy 10%

Memory 75%

8 PERFORMANCE CONSTRAINTS Chart Title

Compute Intensity CPU <> GPU Transfer Divergence Instruction

Occupancy

Coalescence

Divergent Access

Cache Inefficiency

Register Spilling 9 MEMORY ORDERS OF MAGNITUDE

SM L1$ L2 GDRAM DRAM CPU regs PCIe regs shmem bus regs shmem

20,000 2,000 300 16 150 GB/sec GB/sec GB/sec GB/sec GB/sec

10 TALK BREAKDOWN In no particular order

1. Why Didn’t I Think Of That? 2. CPU Memory to GPU Memory (the PCIe Bus) 3. GPU Memory to the SM 4. Registers & 5. Occupancy, Divergence & 6. Weird Things You Never Thought Of (and probably shouldn’t try)

11 WHERE TO BEGIN?

12 THE OBVIOUS

NVIDIA Visual Profiler

Start with the Visual Profiler 13 CPU <> GPU DATA MOVEMENT

14 PCI ISSUES

regs shmem PCIe regs shmem bus regs shmem

16 GB/sec

Moving data over the PCIe bus

15 PIN YOUR CPU MEMORY

GPU Memory CPU Memory

Copy Data

16 PIN YOUR CPU MEMORY

GPU Memory CPU Memory

DMA Data Controller

17 PIN YOUR CPU MEMORY

GPU Memory CPU Memory

DMA Controller Data

Swap

18 PIN YOUR CPU MEMORY

GPU Memory CPU Memory

Data

CPU allocates & pins page then DMA copies locally before DMA Controller

Pinned Copy of Data

19 PIN YOUR CPU MEMORY

GPU Memory CPU Memory

DMA User Pinned Controller Data

cudaHostAlloc( &data, size, cudaHostAllocMapped ); cudaHostRegister( &data, size, cudaHostRegisterDefault ); 20 PIN YOUR CPU MEMORY

21 REMEMBER: PCIe GOES BOTH WAYS

22 STREAMS & CONCURRENCY Hiding the cost of data transfer

Operations in a single stream are ordered But hardware can copy and compute at the same time

Single Copy data Copy data Compute Stream to GPU to Host

Time

23 STREAMS & CONCURRENCY

Copy Copy Stream 2 Work up back Saved Time

Copy Copy Stream 1 Work up back

Single Copy data Copy data Compute Stream to GPU to Host

Time

24 STREAMS & CONCURRENCY

Can keep on breaking work into smaller chunks and saving time

8 2 1 streams streams stream

25 SMALL PCIe TRANSFERS

PCIe is designed for large data transfers But fine-grained copy/compute overlap prefers small transfers So how small can we go?

Too 8 2 1 many

26 APPARENTLY NOT THAT SMALL

27 FROM GPU MEMORY TO GPU THREADS

28 FEEDING THE MACHINE

regs shmem PCIe regs shmem bus regs shmem

From GPU Memory to the SMs

29 USE THE PARALLEL ARCHITECTURE Hardware is optimized to use all SIMT threads at once

L2 Cache Line

Threads run Cache is sized to service High-speed GPU memory in groups of 32 sets of 32 requests at a time works best with linear access

30 VECTORIZE MEMORY LOADS Multi-Word as well as Multi-

T0-T32

int

31 VECTORIZE MEMORY LOADS

T0-T15

T16-T31 int2

Fill multiple cache lines in a single fetch

32 VECTORIZE MEMORY LOADS

T0-T7

T8-T15 int4 T16-T23

T24-T31

Fill multiple cache lines in a single fetch

33 VECTORIZE MEMORY LOADS

34 DO MULTIPLE LOADS PER THREAD Multi-Thread, Multi-Word AND Multi-Iteration

__global__ void copy(int2 *input, __global__ void copy(int2 *input, int2 *output, int2 *output, int max) { int max, int loadsPerThread) { int id = threadIdx.x + blockDim.x * blockIdx.x; int id = threadIdx.x + blockDim.x * blockIdx.x; if( id < max ) { output[id] = input[id]; for(int n=0; n= max ) { } break; } output[id] = input[id]; id += blockDim.x * gridDim.x; } }

One copy per thread Multiple copies per thread Maximum overhead Amortize overhead

35 “MAXIMAL” LAUNCHES ARE BEST

36 COALESCED MEMORY ACCESS It’s not just good enough to use all SIMT threads

Coalesced: Sequential memory accesses are adjacent 1 2 3 4

3 1 Uncoalesced: Sequential memory accesses are unassociated 4 2

37 SIMT PENALTIES WHEN NOT COALESCED

x = data[threadIdx.x] x = data[rand()]

Single 32-wide operation 32 one-wide operations

38 SCATTER & GATHER

Gathering Scattering

1 2 3 4 1 2 3 4

3 3 1 1

4 4 2 2

Reading randomly Reading sequentially Writing sequentially Writing randomly

39 AVOID SCATTER/GATHER IF YOU CAN

40 AVOID SCATTER/GATHER IF YOU CAN

41 SORTING MIGHT BE AN OPTION

If reading non-sequential data is expensive, is it worth sorting it to make it sequential?

Gathering Coalesced Read

1 2 3 4 1 2 3 4

Slow Fast

2 4 1 3 Sort 1 2 3 4

42 SORTING MIGHT BE AN OPTION

Even if you’re only going to read it twice, then yes!

43 PRE-SORTING TURNS OUT TO BE GOOD

44 DATA LAYOUT: “AOS vs. SOA” Sometimes you can’t just sort your data

Array-of-Structures Structure-of-Arrays #define NPTS 1024 * 1024 #define NPTS 1024 *1024 struct Coefficients_AOS { struct Coefficients_SOA { double u[3]; double u[3][NPTS]; double x[3][3]; double x[3][3][NPTS]; double p; double p[NPTS]; double rho; double rho[NPTS]; double eta; double eta[NPTS]; }; };

Coefficients_AOS gridData[NPTS]; Coefficients_SOA gridData;

Single-thread code prefers arrays of SIMT code prefers structures of arrays, structures, for cache efficiency for execution & memory efficiency

45 DATA LAYOUT: “AOS vs. SOA”

#define NPTS 1024 * 1024 u0 u1 u2 struct Coefficients_AOS { x00 x01 x02 double u[3]; double x[3][3]; x10 x11 x12 double p; x20 x21 x22 double rho; double eta; p }; rho

Coefficients_AOS gridData[NPTS]; eta Structure Definition Conceptual Layout

46 SOA: STRIDED ARRAY ACCESS

GPU reads data one element at a time, but in parallel by 32 threads in a warp

Array-of-Structures Memory Layout u0 u1 u2 x00 x01 x02 x10 x11 x12 x20 x21 x22 p rho eta double u0 = gridData[threadIdx.x].u[0]; Conceptual Layout

47 AOS: COALESCED BUT COMPLEX

GPU reads data one element at a time, but in parallel by 32 threads in a warp

Array-of-Structures Memory Layout u0 u1 u2 x00 x01 x02 x10 x11 x12 Structure-of-Arrays Memory Layout x20 x21 x22 p rho eta

Conceptual Layout double u0 = gridData.u[0][threadIdx.x];

48 BLOCK-WIDE LOAD VIA SHARED MEMORY

Read data linearly as bytes. Use shared memory to convert to struct

Device Memory

Block copies data to shared memory

Shared Memory

49 BLOCK-WIDE LOAD VIA SHARED MEMORY

Read data linearly as bytes. Use shared memory to convert to struct

Device Memory

Shared Memory

Threads which own the data grab it from shared memory

50 CLEVER AOS/SOA TRICKS

51 CLEVER AOS/SOA TRICKS Helps for any data size

52 HANDY TO HELP YOU

Trove – A utility library for fast AOS/SOA access and transposition https://github.com/bryancatanzaro/trove

53 (AB)USING THE CACHE

54 MAKING THE MOST OF L2-CACHE

L2 cache is fast but small: L2 Cache GDRAM

Architecture L2 Cache Total Cache Bytes Size Threads per Thread Kepler 1536 KB 30,720 51

Maxwell 3072 KB 49,152 64

Pascal 4096 KB 114,688 36

2,000 300 GB/sec GB/sec

55 TRAINING DEEP NEURAL NETWORKS

56 LOTS OF PASSES OVER DATA

3x3 W1 convolution

5x5 FFT W2 Cat! convolution +

7x7 W3 convolution

57 MULTI-RESOLUTION CONVOLUTIONS

Pass 1 : 3x3

Pass 2: 5x5

Pass 3: 7x7

58 TILED, MULTI-RESOLUTION CONVOLUTION

Pass 1 : 3x3

Pass 2: 5x5

Pass 3: 7x7

Do 3 passes per-tile Each tile sized to fit in L2 cache 59 LAUNCHING FEWER THAN MAXIMUM THREADS

60 SHARED MEMORY: DEFINITELY WORTH IT

61 USING SHARED MEMORY WISELY

Shared memory arranged into “banks” for concurrent SIMT access ▪ 32 threads can read simultaneously so long as into separate banks

Shared memory has 4-byte and 8-byte “bank” sizes

62 STENCIL ALGORITHM

Many algorithms have high data re-use: potentially good for shared memory

“Stencil” algorithms accumulate data from neighbours onto a central point ▪ Stencil has width “W” (in the above case, W=5)

Adjacent threads will share (W-1) items of data – good potential for data re-use

63 STENCILS IN SHARED MEMORY

64 SIZE MATTERS

65 PERSISTENT KERNELS Revisiting the tiled convolutions

Avoid multiple kernel launches by caching in shared memory instead of L2

void tiledConvolution() { convolution<3><<< numblocks, blockdim, 0, s >>>(ptr, chunkSize); Separate kernel convolution<5><<< numblocks, blockdim, 0, s >>>(ptr, chunkSize); launches with convolution<7><<< numblocks, blockdim, 0, s >>>(ptr, chunkSize); L2 re-use }

__global__ void convolutionShared(int *data, int count, int sharedelems) { extern __shared__ int shdata[]; shdata[threadIdx.x] = data[threadIdx.x + blockDim.x*blockIdx.x]; __syncthreads(); Single kernel convolve<3>(threadIdx.x, shdata, sharedelems); launch with __syncthreads(); persistent kernel convolve<5>(threadIdx.x, shdata, sharedelems); __syncthreads(); convolve<7>(threadIdx.x, shdata, sharedelems); }

66 PERSISTENT KERNELS

67 OPERATING DIRECTLY FROM CPU MEMORY

Can save memory copies. It’s obvious when you think about it ...

Compute only begins when 1st copy Copy data Copy data Compute has finished. Task only ends when to GPU to Host 2nd copy has finished.

Compute begins after first fetch.

Compute Uses lots of threads to cover CPU

CPU host-memory access latency. Takes Write to Write

Read from Read advantage of bi-directional PCI.

68 OPERATING DIRECTLY FROM CPU MEMORY

69 OCCUPANCY AND REGISTER LIMITATIONS Function float double is bigger than shared log 7 18 memory and L1 cache! cos 16 28 acos 6 18 Occupancy can kill you if you use cosh 7 10 too many registers tan 15 28 erfc 14 22 Often worth forcing fewer registers to allow more blocks per SM exp 7 10 log10 6 18 But watch out for math functions! normcdf 16 26 cbrt 8 20 sqrt 6 12

__launch_bounds__(maxThreadsPerBlock, rsqrt 5 12 minBlocksPerMultiprocessor) y0 20 30

__global__ void compute() { y1 22 30 y = acos(pow(log(fdivide(tan(cosh(erfc(x))), 2)), 3); fdivide 11 20 } pow 11 24 grad. desc. 14 22 70 THANK YOU!