<<

Optimization

OpenCL Best Practices Guide

May 27, 2010 OpenCL Best Practices Guide

REVISIONS

ii May 27, 2010

Table of Contents

Preface...... What Is This Document? v Who Should Read This Guide? v Recommendations and Best Practices v Contents Summary vi Chapter 1. with OpenCL ...... 1 1.1 Differences Between Host and Device 1 1.2 What Runs on an OpenCL-Enabled Device? 2 1.3 Maximum Performance Benefit 3 Chapter 2. Performance Metrics ...... 5 2.1 Timing 5 2.1.1 Using CPU Timers 5 2.1.2 Using OpenCL GPU Timers 6 2.2 Bandwidth 6 2.2.1 Theoretical Bandwidth Calculation 6 2.2.2 Effective Bandwidth Calculation 7 2.2.3 Throughput Reported by the OpenCL Visual Profiler 7 Chapter 3. Memory Optimizations ...... 9 3.1 Data Transfer Between Host and Device 9 3.1.1 Pinned Memory 9 3.1.2 Asynchronous Transfers 12 3.1.3 Overlapping Transfers and Device Computation 12 3.2 Device Memory 16 3.2.1 Coalesced Access to Global Memory 17 3.2.1.1 A Simple Access Pattern 18 3.2.1.2 A Sequential but Misaligned Access Pattern 18 3.2.1.3 Effects of Misaligned Accesses 19 3.2.1.4 Strided Accesses 21 3.2.2 22 3.2.2.1 Shared Memory and Memory Banks 22 3.2.2.2 Shared Memory in Matrix Multiplication ( = AB) 23 3.2.2.3 Shared Memory in Matrix Multiplication (C = AA T) 27

April 30, 2010 iii OpenCL Best Practices Guide

3.2.2.4 Shared Memory Use by Kernel Arguments 29 3.2.3 Local Memory 29 3.2.4 Texture Memory 30 3.2.4.1 Textured Fetch vs. Global Memory Read 30 3.2.4.2 Additional Texture Capabilities 30 3.2.5 Constant Memory 31 3.2.6 Registers 31 3.2.6.1 Register Pressure 31 Chapter 4. NDRange Optimizations...... 33 4.1 Occupancy 33 4.2 Calculating Occupancy 33 4.3 Hiding Register Dependencies 35 4.4 and Block Heuristics 35 4.5 Effects of Shared Memory 36 Chapter 5. Instruction Optimizations ...... 39 5.1 Arithmetic Instructions 39 5.1.1 Division and Modulo Operations 40 5.1.2 Reciprocal Square Root 40 5.1.3 Other Arithmetic Instructions 40 5.1.4 Math Libraries 41 5.2 Memory Instructions 41 Chapter 6. Control Flow ...... 43 6.1 Branching and Divergence 43 6.2 Branch Predication 43 Appendix A. Recommendations and Best Practices ...... 45 A.1 Overall Performance Optimization Strategies 45 A.2 High-Priority Recommendations 46 A.3 Medium-Priority Recommendations 46 A.4 Low-Priority Recommendations 47

iv May 27, 2010

Preface

What Is This Document?

Best Practices Guide

Who Should Read This Guide?

 OpenCL Getting Started Guide  NVIDIA OpenCL Programming Guide  OpenCL Specification

Recommendations and Best Practices

April 30, 2010 v OpenCL Best Practices Guide

Contents Summary

 Introduction to with OpenCL:  Performance Metrics:  Memory Optimizations:  NDRanges Optimizations:  Instruction Optimizations:  Control Flow:  Getting the Right Answer:

vi May 27, 2010

Chapter 1. Heterogeneous Computing with OpenCL

host device

1.1 Differences Between Host and Device

 Threading resources. warp  Threads.  RAM.

April 30, 2010 1 OpenCL Best Practices Guide

NVIDIA OpenCL Programming Guide

1.2 What Runs on an OpenCL-Enabled Device?

   

2 May 27, 2010 Introduction to Parallel Computing with OpenCL



1.3 Maximum Performance Benefit

High Priority: To get the maximum benefit from OpenCL, focus first on finding ways to parallelize sequential code.

S 1 = (1 − ) + P N N P/N N S = 1 / 1−P ¾ 1 / (1 – ¾) 4 P P N P

April 30, 2010 3

Chapter 2. Performance Metrics

2.1 Timing

2.1.1 Using CPU Timers

clFinish() clFinish() clFinish() clWaitForEvents()

April 30, 2010 5 OpenCL Best Practices Guide

2.1.2 Using OpenCL GPU Timers

cl_ulong start, end;

clGetEventProfilingInfo(event, CL_PROFILING_COMMAND_END, sizeof(cl_ulong), &end, NULL); clGetEventProfilingInfo(event, CL_PROFILING_COMMAND_START, sizeof(cl_ulong), &start, NULL);

float executionTimeInMilliseconds = (end - start) * 1.0e-6f; Listing 2.1 How to time code using OpenCL events

2.2 Bandwidth

High Priority: Use the effective bandwidth of your computation as a metric when measuring performance and optimization benefits.

2.2.1 Theoretical Bandwidth Calculation

( 1107 x 10 6 x ( 512/8 ) x 2 ) / 10 9 = 141.6 GB/sec

6 May 27, 2010 Performance Metrics

2.2.2 Effective Bandwidth Calculation

Effective bandwidth (( B r + B w ) / 10 9 ) / time

Br Bw time Effective bandwidth = (( 2048 2 x 4 x 2 ) / 10 9 ) / time and

2.2.3 Throughput Reported by the OpenCL Visual Profiler

April 30, 2010 7

Chapter 3. Memory Optimizations

3.1 Data Transfer Between Host and Device

High Priority: Minimize data transfer between the host and the device, even if it means running some kernels on the device that do not show performance gains when compared with running them on the host CPU.

page- locked pinned NVIDIA OpenCL Programming Guide

3.1.1 Pinned Memory

April 30, 2010 9 OpenCL Best Practices Guide

CL_MEM_ALLOC_HOST_PTR oclBandwidthTest oclSobelFilter oclMedianFilter oclCopyComputeOverlap cl_context cxGPUContext; cl_mem cmPinnedBufIn = NULL; cl_mem cmPinnedBufOut = NULL; cl_mem cmDevBufIn = NULL; cl_mem cmDevBufOut = NULL; unsigned char * cDataIn = NULL; unsigned char * cDataOut = NULL; cmPinnedBufIn = clCreateBuffer (cxGPUContext, CL_MEM_READ_ONLY | CL_MEM_ALLOC_HOST_PTR, memSize, NULL, NULL); cmPinnedBufOut = clCreateBuffer (cxGPUContext, CL_MEM_WRITE_ONLY | CL_MEM_ALLOC_HOST_PTR, memSize, NULL, NULL);

cmDevBufIn = clCreateBuffer (cxGPUContext, CL_MEM_READ_ONLY, memSize, NULL, NULL); cmDevBufOut = clCreateBuffer (cxGPUContext, CL_MEM_WRITE_ONLY, memSize, NULL, NULL);

10 May 27, 2010 Memory Optimizations

cDataIn = ( unsigned char *) clEnqueueMapBuffer (cqCommandQue, cmPinnedBufIn, CL_TRUE, CL_MAP_WRITE, 0, memSize, 0, NULL, NULL, NULL); cDataOut = ( unsigned char *) clEnqueueMapBuffer (cqCommandQue, cmPinnedBufOut, CL_TRUE, CL_MAP_READ, 0, memSize, 0, NULL, NULL, NULL);

for (unsigned int i = 0; i < memSize; i++) { cDataIn[i] = ( unsigned char )(i & 0xff); } clEnqueueWriteBuffer (cqCommandQue, cmDevBufIn, CL_FALSE, 0, szBuffBytes, cDataIn , 0, NULL, NULL); clEnqueueNDRangeKernel (cqCommandQue, …); clEnqueueReadBuffer (cqCommandQue, cmDevBufOut, CL_TRUE, 0, szBuffBytes, cDataOut , 0, NULL, NULL);

April 30, 2010 11 OpenCL Best Practices Guide

3.1.2 Asynchronous Transfers

clEnqueueReadBuffer() clEnqueueWriteBuffer() clEnqueueReadBuffer() clEnqueueWriteBuffer () call non- lEnqueueWriteBuffer printf clFinish clEnqueueWriteBuffer (cqCommandQue, cmDevBufIn, CL_FALSE , 0, szBuffBytes, cDataIn , 0, NULL, NULL); printf(“ Asynchronous printf after non-blocking write ”); clFinish (cqCommandQue); blocking lEnqueueReadBuffer printf clEnqueueReadBuffer (cqCommandQue, cmDevBufOut, CL_TRUE , 0, szBuffBytes, cDataOut , 0, NULL, NULL); printf(“ Synchronous printf after blocking read ”);

3.1.3 Overlapping Transfers and Device Computation

CL_DEVICE_GPU_OVERLAP_NV

12 May 27, 2010 Memory Optimizations

// Nonblocking Write of 1st half of input data, queue 0 clEnqueueWriteBuffer (cqCommandQueue[0], cmDevSrcA, CL_FALSE, 0, szHalfBuffer, (void*)&fSourceA[0], 0, NULL, NULL); clEnqueueWriteBuffer (cqCommandQueue[0], cmDevSrcB, CL_FALSE, 0, szHalfBuffer, (void*)&fSourceB[0], 0, NULL, NULL);

// Push the write for queue 0 clFlush (cqCommandQueue[0]);

// Start new phase ***********************************

// Launch kernel computation, queue 0 clEnqueueNDRangeKernel (cqCommandQueue[0], ckKernel[0], 1, NULL, &szGlobalWorkSize, &szLocalWorkSize, 0, NULL, NULL);

// Nonblocking Write of 2nd half of input data, queue1 clEnqueueWriteBuffer (cqCommandQueue[1], cmDevSrcA, CL_FALSE, szHalfBuffer, szHalfBuffer,

April 30, 2010 13 OpenCL Best Practices Guide

(void*)&fSourceA[szHalfOffset], 0, NULL, NULL); clEnqueueWriteBuffer (cqCommandQueue[1], cmDevSrcB, CL_FALSE, szHalfBuffer, szHalfBuffer, (void*)&fSourceB[szHalfOffset], 0, NULL, NULL);

// Push the compute for queue 0 & write for queue 1 clFlush (cqCommandQueue[0]); clFlush (cqCommandQueue[1]);

// Start new phase **********************************

// Launch kernel computation, queue 1 clEnqueueNDRangeKernel (cqCommandQueue[1], ckKernel[1], 1, NULL, &szGlobalWorkSize, &szLocalWorkSize, 0, NULL, NULL);

// Non Blocking Read of 1st half of output data, queue 0 clEnqueueReadBuffer (cqCommandQueue[0], cmDevResult, CL_FALSE, 0, szHalfBuffer, (void*)&fResult[0], 0, NULL, NULL);

// Push the compute for queue 1 & the read for queue 0 clFlush (cqCommandQueue[0]); clFlush (cqCommandQueue[1]);

// Start new phase ******************************

// Non Blocking Read of 2nd half of output data, queue 1 clEnqueueReadBuffer (cqCommandQueue[1], cmDevResult, CL_FALSE, szHalfBuffer, szHalfBuffer, (void*)&fResult[szHalfOffset], 0, NULL, NULL);

14 May 27, 2010 Memory Optimizations

April 30, 2010 15 OpenCL Best Practices Guide

3.2 Device Memory Spaces

Figure 3.2 The various memory spaces on a CUDA device

Table 3.1 Salient features of device memory

Memory Location Cached Access Scope Lifetime on/off chip

16 May 27, 2010 Memory Optimizations

3.2.1 Coalesced Access to Global Memory

High Priority: Ensure global memory accesses are coalesced whenever possible.

Figure 3.3 Linear memory segments and threads in a half warp

 k k  

April 30, 2010 17 OpenCL Best Practices Guide

    

3.2.1.1 A Simple Access Pattern k k

Figure 3.4 Coalesced access in which all threads but one access the corresponding word in a segment

3.2.1.2 A Sequential but Misaligned Access Pattern

18 May 27, 2010 Memory Optimizations

Figure 3.5 Unaligned sequential addresses that fit within a single 128- byte segment

Figure 3.6 Misaligned sequential addresses that fall within two 128-byte segments

__attribute__ ((aligned(8))) and __attribute__ ((aligned(16)))

3.2.1.3 Effects of Misaligned Accesses __kernel void offsetCopy(__global float *odata, __global float* idata, int offset) { int xid = get_global_id(0) + offset; odata[xid] = idata[xid]; } Listing 3.5 A copy kernel that illustrates misaligned accesses

idata

April 30, 2010 19 OpenCL Best Practices Guide

offset

Figure 3.7 Performance of offsetCopy kernel

20 May 27, 2010 Memory Optimizations

3.2.1.4 Strided Accesses strideCopy() stride idata odata __kernel void strideCopy(__global float* odata, __global float* idata, int stride) { int xid = get_global_id(0) * stride; odata[xid] = idata[xid]; } Listing 3.6 A kernel to illustrate non-unit stride data copy

Figure 3.8 A half warp accessing memory with a stride of 2

April 30, 2010 21 OpenCL Best Practices Guide

Figure 3.9 Performance of strideCopy kernel

3.2.2 Shared Memory

3.2.2.1 Shared Memory and Memory Banks nn n

22 May 27, 2010 Memory Optimizations

Medium Priority: Accesses to shared memory should be designed to avoid serializing requests due to bank conflicts.

NV IDIA OpenCL Programming Guide

3.2.2.2 Shared Memory in Matrix Multiplication (C = AB)

April 30, 2010 23 OpenCL Best Practices Guide

Figure 3.10 A block-column matrix (A) multiplied by a block-row matrix (B) and the resulting product matrix (C)

simpleMultiply __kernel void simpleMultiply( __global float* a, __global float* b, __global float* c, int N) { int row = get_glob al_id(1); int col = get_global_id(0) ; float sum = 0.0f; for (int i = 0; i < TILE_DIM; i++) { sum += a[row*TILE_DIM+i] * b[i*N+col]; } c[row*N+col] = sum; } Listing 3.7 Unoptimized matrix multiplication

a b c blockDim.x blockDim.y TILE_DIM row col for i for

24 May 27, 2010 Memory Optimizations

Figure 3.11 Computing a row (half warp) of a tile in C using one row of A and an entire tile of B

i for row*TILE_DIM+i __kernel void coalescedMultiply( __global float* a, __global float* b, __global float* c, int N, __local float aTile[TILE_DIM][TILE_DIM] ) { int row = get_global_id(1) ; int col = get_global_id(0) ; float sum = 0.0f; int x = get_l ocal_id(0); int y = get_local_id(1); aTile[y][x] = a[row*TILE_DIM+x]; for (int i = 0; i < TILE_DIM; i++) { sum += aTile[y][i]* b[i*N+col]; } c[row*N+col] = sum; }

April 30, 2010 25 OpenCL Best Practices Guide

Listing 3.8 Using shared memory to improve the global memory efficiency in matrix multiplication

for __kernel void sharedABMultiply(__global float* a, __global float* b, __global float* c, int N, __local float aTile[TILE_DIM][TILE_DIM], __local float bTile[TILE_DIM][TILE_DIM]) { int row = get_global_id(1); int col = get_global_id(0); float sum = 0.0f; int x = get_local_id(0); int y = get_local_id(1); aTile[y][threadIdx.x] = a[row*TILE_DIM+x]; bTile[y][threadIdx.x] = b[y*N+col]; barrier(CLK_LOCAL_MEM_FENCE); for (int i = 0; i < TILE_DIM; i++) { sum += aTile[y][i]* bTile[i][x]; } c[row*N+col] = sum; } Listing 3.9 Improvement by reading additional data into shared memory

barrier() Table 3.2 Performance improvements optimizing C = AB matrix multiply

26 May 27, 2010 Memory Optimizations

Medium Priority: Use shared memory to avoid redundant transfers from global memory.

3.2.2.3 Shared Memory in Matrix Multiplication (C = AA T) __kernel void simpleMultiply(__global float *a, __global float *c, int M) { int row = get_global_id(1); int col = get_global_id(0); float sum = 0.0f; for (int i = 0; i < TILE_DIM; i++) { sum += a[row*TILE_DIM+i] * a[col*TILE_DIM+i]; } c[row*M+col] = sum; } Listing 3.10 Unoptimized handling of strided accesses to global memory

row col a[col*TILE_DIM+i] i col col*TILE_DIM __kernel void coalescedMultiply(__global float *a, __global float *c, int M, __local float aTile[TILE_DIM][TILE_DIM], __local float transposedTile[TILE_DIM][TILE_DIM]) {

April 30, 2010 27 OpenCL Best Practices Guide

int row = get_global_id(1); int col = get_global_id(0); float sum = 0.0f; int x = get_local_id(0); int y = get_local_id(1); aTile[y][x] = a[row*TILE_DIM+x]; transposedTile[x][y] = a[(get_group_id(0) * get_num_groups(0) + y) * TILE_DIM + x]; barrier(CLK_LOCAL_MEM_FENCE); for (int i = 0; i < TILE_DIM; i++) { sum += aTile[y][i] * transposedTile[i][x]; } c[row*M+col] = sum; } Listing 3.11 An optimized version of Listing 3.10 using coalesced reads from global memory

transposedTile aTile transposedTile for __local float transposedTile[TILE_DIM][TILE_DIM+1]; Table 3.3 Performance improvements optimizing C = AA T matrix multiplication

28 May 27, 2010 Memory Optimizations

  

3.2.2.4 Shared Memory Use by Kernel Arguments

Low Priority: For kernels with long argument lists, place some arguments into constant memory to save shared memory.

3.2.3 Local Memory

.local ld.local st.local

April 30, 2010 29 OpenCL Best Practices Guide

3.2.4 Texture Memory

3.2.4.1 Textured Fetch vs. Global Memory Read    

3.2.4.2 Additional Texture Capabilities Table 3.4 Useful imaging-processing features for image objects

Filtering

Normalized texture coordinates

Addressing modes ¹

30 May 27, 2010 Memory Optimizations

¹ clamp repeat x N x x Nx. x x x x x x x

3.2.5 Constant Memory

3.2.6 Registers

float4 int4

3.2.6.1 Register Pressure

April 30, 2010 31

Chapter 4. NDRange Optimizations

4.1 Occupancy

occupancy NVIDIA OpenCL Programming Guide

4.2 Calculating Occupancy

April 30, 2010 33 OpenCL Best Practices Guide

CUDA_Occupancy_calculator.xls tools

Figure 4.1 Use the CUDA GPU Occupancy to project occupancy

34 May 27, 2010 NDRange Optimizations

4.3 Hiding Register Dependencies

Medium Priority: To hide latency arising from register dependencies, maintain at least 25 percent occupancy on devices with compute capability 1.1 and lower, and 18.75 percent occupancy on later devices.

4.4 Thread and Block Heuristics

Medium Priority: The number of threads per block should be a multiple of 32 threads, because this provides optimal computing efficiency and facilitates coalescing.

April 30, 2010 35 OpenCL Best Practices Guide

barrier()   

4.5 Effects of Shared Memory

36 May 27, 2010 NDRange Optimizations

April 30, 2010 37 OpenCL Best Practices Guide

38 May 27, 2010

Chapter 5. Instruction Optimizations

operations T T

5.1 Arithmetic Instructions

native_logf(x) NVIDIA OpenCL Programming Guide mul24 NVIDIA OpenCL Programming Guide mul24 mul24

April 30, 2010 39 OpenCL Best Practices Guide

5.1.1 Division and Modulo Operations

Low Priority: Use shift operations to avoid expensive division and modulo calculations.

n i/n i ≫ n i n i n n NVIDIA OpenCL Programming Guide

5.1.2 Reciprocal Square Root

rsqrt() sqrt(x) rsqrt()

5.1.3 Other Arithmetic Instructions

Low Priority: Avoid automatic conversion of doubles to floats.

 char short int  f 3.141592653589793f 1.0f 0.5f erfc()

40 May 27, 2010 Instruction Optimizations

5.1.4 Math Libraries

Medium Priority: Use the native math whenever speed trumps precision.

native_functionName() functionName() native_functionName() native_sin(x) native_exp(x) functionName() sin(x) exp(x) native_sin(x) native_cos(x) native_exp(x) sin(x) cos(x) tan(x) x NVIDIA OpenCL Programming Guide

5.2 Memory Instructions

High Priority: Minimize the use of global memory. Prefer shared memory access where possible.

__local float shared[32]; __global float* device; shared[threadIdx.x] = device[threadIdx.x];

April 30, 2010 41

Chapter 6. Control Flow

6.1 Branching and Divergence

High Priority: Avoid different paths within the same warp.

if switch do for while NVIDIA OpenCL Programming Guide threadIdx / WSIZE WSIZE

6.2 Branch Predication

Low Priority: Make it easy for the to use branch predication in lieu of loops or control statements.

if switch #pragma unroll NVIDIA OpenCL Programming Guide

April 30, 2010 43 OpenCL Best Practices Guide

44 May 27, 2010

Appendix A. Recommendations and Best Practices

A.1 Overall Performance Optimization Strategies

  

April 30, 2010 45 OpenCL Best Practices Guide

A.2 High-Priority Recommendations

      

A.3 Medium-Priority Recommendations

       

46 May 27, 2010 Recommendations and Best Practices

A.4 Low-Priority Recommendations

   

April 30, 2010 47

Notice ALL NVIDIA DESIGN SPECIFICATIONS, REFERENCE BOARDS, FILES, DRAWINGS, DIAGNOSTICS, LISTS, AND OTHER DOCUMENTS (TOGETHER AND SEPARATELY, “MATERIALS”) ARE BEING PROVIDED “AS IS.” NVIDIA MAKES NO WARRANTIES, EXPRESSED, IMPLIED, STATUTORY, OR OTHERWISE WITH RESPECT TO THE MATERIALS, AND EXPRESSLY DISCLAIMS ALL IMPLIED WARRANTIES OF NONINFRINGEMENT, MERCHANTABILITY, AND FITNESS FOR A PARTICULAR PURPOSE. Information furnished is believed to be accurate and reliable. However, NVIDIA Corporation assumes no responsibility for the consequences of use of such information or for any infringement of patents or other rights of third parties that may result from its use. No license is granted by implication or otherwise under any patent or patent rights of NVIDIA Corporation. Specifications mentioned in this publication are subject to change without notice. This publication supersedes and replaces all information previously supplied. NVIDIA Corporation products are not authorized for use as critical components in life support devices or systems without express written approval of NVIDIA Corporation.

Trademarks NVIDIA, the NVIDIA logo, CUDA, GeForce, NVIDIA , and Tesla are trademarks or registered trademarks of NVIDIA Corporation. Other company and product names may be trademarks of the respective companies with which they are associated.

Copyright © 2010 NVIDIA Corporation. All rights reserved.