Peter Messmer () Sami Saarinen (CSC) Sami Ilvonen (CSC)

Advanced CUDA and OpenACC Jan 21-23, 2014 PRACE Advanced Training Centre CSC – IT Center for Science Ltd, Finland free(hA); free(hB); free(hC); } else { // Worker domainDecomposition(rank, nprocs, N, &localN, &offset); // Allocate host and device arrays hA = (double*)malloc(sizeof(double) * localN); hB = (double*)malloc(sizeof(double) * localN); hC = (double*)malloc(sizeof(double) * localN); CUDA_CHECK( cudaMalloc((void**)&dA, sizeof(double) * localN) ); CUDA_CHECK( cudaMalloc((void**)&dB, sizeof(double) * localN) ); CUDA_CHECK( cudaMalloc((void**)&dC, sizeof(double) * localN) );

MPI_Recv(hA, localN, MPI_DOUBLE, 0, 11, MPI_COMM_WORLD, MPI_STATUS_IGNORE); MPI_Recv(hB, localN, MPI_DOUBLE, 0, 12, MPI_COMM_WORLD, MPI_STATUS_IGNORE); CUDA_CHECK( cudaMemcpy(dA, hA, sizeof(double) * localN, cudaMemcpyHostToDevice) ); CUDA_CHECK( cudaMemcpy(dB, hB, sizeof(double) * localN, cudaMemcpyHostToDevice) ); vector_add(dC, dA, dB, localN, CUDA);

CUDA_CHECK( cudaMemcpy(hC, dC, sizeof(double) * localN, cudaMemcpyDeviceToHost) ); // Copy the results back to root process MPI_Send(hC, localN, MPI_DOUBLE, 0, 13, MPI_COMM_WORLD); // Release the allocated memory CUDA_CHECK( cudaFree((void*)dA) ); CUDA recap

Sami Ilvonen Material on “CUDA recap”, () 2014 by CSC – IT Center for Science Ltd. CSC – IT Center for Science Ltd Material from CSC is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Espoo, Finland Unported License, http://creativecommons.org/licenses/by-nc-sa/3.0/ 1 2

Outline CUDA

Programming model Compute Unified Device Architecture and memory hierarchy CUDA C is a C/C++ language extension for GPU Kernels programming Synchronization, streams, events CUDA API is the most up-to-date programming interface Summary for NVIDIA GPUs – Other options: OpenCL (Khronos Group), DirectCompute

(Microsoft) – Two : runtime and device

3 4 Unified Device Architecture CUDA C Code Consists of

CUDA provides abstraction layer between different GPUs Qualifiers __device__ float array[128]; – Parallel Threads Execution (ptx) ISA and virtual machine global, device, shared, __global__ void kern(float *data) { – Ptx can be compiled to device binary code either at local, constant, ... __shared__ float buffer[32]; compile time or by the driver using JIT at runtime Built-in variables ... threadIdx, blockIdx, ... buffer[threadIdx.x] = data[i]; Important concept of compute capability Intrinsics ... __syncthreads, __syncthreads; – Original G80 architecture supported CC 1.0 __fmul_rn, ...... } – CC 3.5 only supported by K20/K20x and K40 Runtime API calls float *d_data; memory, device, . Dynamic parallelism cudaMalloc((void**)&d_data, bytes); execution management – Many features only supported with most recent GPUs Kernel launch kern<<<1024, 128>>>(d_data);

5 6

CUDA APIs CUDA APIs

CUDA API includes functions for CUDA Runtime API – Memory control (allocation, copying) – User friendlier interface for application developers – Synchronization and execution control – Requires nvcc compiler – Hardware control and query CUDA Driver API – etc. – Low-level interface for more detailed control – Much more complicated and verbose – Can be used with other C compilers

7 8 CUDA Programming Model

GPU accelerator is called device, CPU is host Parallel code (kernel) is launched by the host and

executed on the device by several threads PROGRAMMING MODEL, THREADS Threads are grouped into thread blocks Program code is written from a single thread’s point of view – Each thread has unique id – Each thread can diverge and execute a unique code path (can cause performance issues)

9 10

Thread Hierarchy Hardware Implementation, SIMT Architecture

Block (1,0) Threads: Thread (0,0) Thread (1,0) Thread (2,0) Thread (3,0) Maximum number of threads in a block depends on the – 3D IDs, unique in block compute capability (1024 on Fermi/Kepler) Blocks: Thread (0,1) Thread (1,1) Thread (2,1) Thread (3,1) GPU multiprocessor creates, manages, schedules and – 3D IDs, unique in grid executes threads in warps of 32 Dimensions are set at kernel Warp executes one common instruction at a time launch Grid Block (0,0) Block (1,0) – Threads are allowed to branch, but each branch is Built-in variables for device executed serially code: Context switch is extremely fast, warp scheduler selects – threadIdx, blockIdx Block (0,1) Block (1,1) – blockDim, gridDim warps that are ready to execute → can hide latencies

11 12 Hardware Implementation (cont.)

The actual number of simultaneous threads in execution depends on the device – For example, Fermi can execute 512 threads simultaneously PROGRAMMING MODEL, MEMORY For maximum throughput many more threads are needed – Many threads will be waiting for memory queries or register values – Thread scheduler can hide latencies efficiently only if there are enough threads waiting for execution

13 14

CPU and GPU Memories Device Memory Hierarchy

Host and device have separate memories Thread Per-thread local Registers are fast, off-chip Host manages the GPU memory memory local memory has high latency Usually one has to Thread Block Per-block shared Tens of kb per block, on-chip, 1. Copy (explicitly) data from host to the device memory very fast 2. Execute the GPU kernel Grid 0 Size up to 6 GB, high latency 3. Copy (explicitly) the results back to the host Block (0,0) Block (1,0) Random access very expensive! Coalesced access much more Data copies between host and device use the PCI bus Global memory efficient with very limited bandwidth → minimize the transfers! Block (0,1) Block (1,1) Constant memory (64k)

Texture memory

15 16 Memory Hierarchy Allocating Device Memory

Another view of the memory (Device) Grid cudaMalloc() (Device) Grid hierarchy of a CUDA device Block (0, 0) Block (0, 1) – Allocate device global Block (0, 0) Block (0, 1) Arrows show the read and Shared Memory memory Shared Memory Shared Memory write permissions Registers Registers Registers Registers cudaFree() Registers Registers Registers Registers Host can only access global, Thread Thread Thread Thread Thread Thread Thread Thread (0, 0) (0, 1) (0, 0) (0, 1) – Frees the allocated (0, 0) (0, 1) (0, 0) (0, 1)

texture and constant memory Local Local Local Local Local Local Local Local mem mem mem mem memory mem mem mem mem

Host Global Host Global Note that the global, memory memory constant and texture Constant Note that the host code Constant memory spaces are memory memory Texture can not dereference Texture persistent between kernel memory device memory pointers! memory calls!

17 18

Device-Host Data Transfer Coalesced Memory Access

cudaMemcpy() (Device) Grid Global memory access has very high latency transfers data from Block (0, 0) Block (0, 1) Threads are executed in warps, memory operations are – Host to Host Shared Memory Shared Memory grouped in a similar fashion – Host to Device Registers Registers Registers Registers Thread Thread Thread Thread – Memory access is optimized for coalesced access where – Device to Host (0, 0) (0, 1) (0, 0) (0, 1) threads read from / write to successive memory locations Local Local Local Local – Device to Device mem mem mem mem – Exact alignment rules and performance issues depend on Host Global This call blocks the memory the computing capability. execution of the host code. Constant memory Shared memory is better suited for more complicated These is also an Texture asynchronous copy function. memory data access patterns

19 20 Access Patterns Coalesced Access (cont.)

__global__ void offsetDemo(float *out, float *in, int oset) Performance penalties for non-coalesced access were { int tid = blocdIdx.x * blockDim.x + threadIdx.x + oset; very high for CC 1.0/1.1 devices odata[tid] = idata[tid]; } CC 2.0 has relaxed the requirements considerably – For best performance portability it is safest (and hardest) oset = 0 oset = 1 to comply with older requirements Coalesced access Unaligned (but sequential) access Refer to the NVIDIA documentation for details 128B aligned 128B aligned segment segment

(half) warp of threads (half) warp of threads Note that access would be coalesced also when all threads do not copy

21 22

Shared Memory Shared Memory

Non-coalesced global memory access expensive, how to __shared__ qualifier declares a variable that avoid it? – Resides in the shared memspace of the thread block – Load data using coalesced operations from global to – Has the lifetime of the block shared memory – Is only accessible from all threads within the block – Access shared memory (avoid bank conflicts) and do the Beware of synchronization issues, such as write-after- needed manipulations write, write-after-read – Save output back to global memory using coalesced writes – Synchronize the execution of threads with __syncthreads() when needed

23 24 Allocating Shared Memory Example of Memory Operations

Inside device code (static size): int main(void) { float *A = (float *) malloc(N*sizeof(float)); __global__ kern(float *in) float *d_A; { __shared__ sdata[NX][NY]; cudaMalloc((void**)&d_A, N*sizeof(float)); cudaMemcpy(d_A, A, N*sizeof(float), cudaMemcpyHostToDevice); ... Can not dereference device Extenal variable, array size determined by kernel call float A0 = d_A[0]; pointers in host code! parameter: ... extern __shared__ float sdata[] cudaMemcpy(A, d_A, N*sizeof(float), cudaMemcpyDeviceToHost); __global__ kern(float *in) cudaFree(d_A); { free(A); float *arr = (float *)sdata; return 0; ... } kern<<<128,10,1024,0>>>

25 26

Device Code

C++ function with restrictions: – Can only dereference pointers to device memory – No static variables, no recursion DEVICE CODE, KERNELS – No variable number of arguments Functions must be declared with a qualifier – __global__: Kernel, called from CPU . Cannot be called from GPU (except CUDA5+ with CC 3.5) . Must return void – __device__: Called from __device__ and __global__ funcs . Can not be called from CPU – __host__: Can only be called by CPU . Can be combined with __device__ qualifier

27 28 Calling GPU Kernel Kernel Call Example

3 blocks with 4 threads in each Special syntax: __global__ void kern(int *A) { – kname<<>>(args) int idx = blockIdx.x * blockDim.x + threadIdx.x; – kname is the name of the kernel function A[idx] = idx; Result: A = {0,1,2,3,4,5,6,7,8,9,10,11} } – grid determines the block hierarchy void main() { – block determines the thread hierarchy in a block // Allocate memories, copy values dim3 grid, block; – args is the list of arguments of the kernel block.x = 4; grid and block can be either integers or struct (class) of grid.x = 12/block.x; type dim3 kern<<>>(d_A); // Copy results back There are two additional parameters, more about them } later

29 30

Thread Branching

Thread execution can branch according to e.g. thread index – All threads in the warp execute the same command THREAD BRANCHING AND SYNCHRONIZATION – All threads do not have to participate Execution of different code paths is serialized Performance issues – Depends on the code paths and type of divergence, general suggestion is not to branch if possible

31 32 Thread Branching Thread Synchronization

Kernel-level synchronization Thread 0 Threads 1-31 Program Blocks must be independent

… int tid = threadIdx.x – Can run in any order, concurrently or sequentially

… if (tid == 0) {++var1;} Some level of coordination can be achieved using atomic

… else {var1 = var1 + 2;} intrinsincs → can cause performance issues Execution … var2 = 3 * var1; Threads in a block can synchronize using __syncthreads() intrinsic Important issue when combined with memory access!

33 34

Virtual Memory System

Virtual memory Physical Modern operating (per process) memory systems utilize virtual memory PAGE-LOCKED MEMORY, STREAMS, EVENTS – Memory is organized to memory pages – Memory pages can reside on swap area on the disk Disk

35 36 Streams Example CUDA Events cudaStream_t streams[2]; for (int i = 0; i < 2; ++i) cudaStreamCreate(&streams[i]); Runtime API provides events to monitor device’s float* h_Data; progress and perform accurate timing cudaMallocHost(&h_Data, 2*dsize); Events are recorded asynchronously at any point of the for (int i = 0; i < 2; ++i) { cudaMemcpyAsync(in_d_Data + i*dsize, h_Data + i*dsize, dsize, program execution cudaMemcpyHostToDevice, streams[i]); MyKernel<<<100, 1024, 0, streams[i]>>>(out_d_Data + i*dsize, – An event has completed when all tasks or commands in a in_d_Data + i*dsize, dsize); given stream have completed cudaMemcpyAsync(h_Data + i*size, out_d_Data + i*dsize, dsize, cudaMemcpyDeviceToHost, streams[i]); – Events in stream 0 are completed after all preceding tasks } and commands in all streams are completed for (int i = 0; i < 2; ++i) cudaStreamDestroy(streams[i]);

37 38

CPU/GPU Synchronization Timing Using CUDA Events

One can synchronize host thread and GPU by – cudaDeviceSynchronize() float time_diff_ms = 0.0; . Blocks until all previous CUDA calls in all streams of all host threads cudaEvent_t start, stop; have completed cudaEventCreate(&start); cudaEventCreate(&stop); – cudaStreamSynchronize(stream) cudaEventRecord(start, 0); kernel_call<<<...>>>(...); . Blocks until all CUDA calls associated to the stream are completed cudaEventRecord(stop, 0); – cudaEventSynchronize(event) cudaEventSynchronize(stop); . Blocks until the event is recorderd cudaEventElapsedTime(&time_diff_ms, start, stop); All CUDA calls to stream 0 block until previous call is completed cudaEventDestroy(start); cudaEventDestroy(stop);

39 40 Links to NVIDIA Material

NVIDIA CUDA main page http://developer.nvidia.com/category/zone/cuda-zone CUDA documentation main page http://developer.nvidia.com/nvidia-gpu-computing- documentation

41 CSC cluster introduction

Sami Ilvonen Material on “CSC cluster introduction”, (C) 2014 by CSC – IT Center for Science Ltd. CSC – IT Center for Science Ltd Material from CSC is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Espoo, Finland Unported License, http://creativecommons.org/licenses/by-nc-sa/3.0/ 42 43

Scalable hybrid prototype Current configuration

Part of the PRACE Technology Evaluation master Head node (frontend) Objectives – Users login here – Enabling key applications on new architectures – No GPUs available – Familiarizing users and providing a research platform node[02-10] Compute nodes – Whole system benchmarking energy efficiency, – Accessible via the batch job queue system productivity and performance – node[02-05] Xeon Phi Located at CSC – IT Center for Science Ltd – node[06-10] Nvidia Keplers – Espoo, Finland . node[06-09] K20 Documentation of the system . node10 K20x https://confluence.csc.fi/display/HPCproto/HPC+Prototypes 44 45 Diagram of a Kepler Node Upcoming system

Host NVIDIA Kepler K20(x)m Host RAM (bank 0) GPU memory Larger accelerated system from Bull 16GB 5 (6) GB GDDR5 DDR3 208 (250) GB/s – 44 Xeon Phi 7120X nodes 51,2GB/s CPU0 PCIe2 Kepler Other – Similar amount of Nvidia K40 nodes in 1Q14 c0 c1 c2 8GB/s SMX1 … SMX13(14) nodes L2 cache (15MB) L2 cache 1280kB (1536kB) Extreme energy-efficiency c3 c4 c5

QPI 96GB/s – Latest and greatest versions of Phi and Atlas CPU1 InfiniBand card – Direct Liquid Cooling c0 c1 c2 PCIe2 8GB/s InfiniBand L2 cache (15MB) HCA – Located in CSC’s Kajaani datacenter c3 c4 c5 FDR DDR3 InfiniBand 51,2GB/s 7GB/s Host RAM (bank 1) Packages 16GB Cores Memory

46 47

First login and modules Modules continued ssh to hybrid.csc.fi with your training account Unloading and switching loaded modules $ ssh –Y hybrid.csc.fi –l trngNN – To unload a module Familiarize yourself with environment modules $ module unload – To load the CUDA 5.5, use: – To remove all modules $ module load cuda/5.5 $ module purge – To see all available modules: – To switch a version of a module $ module avail $ module swap cuda/5.5 cuda/5.0 – To see that modules are currently loaded $ module list

48 49 Custom configuration on Hybrid SLURM batch job queue system

NFS mounts Reserves and allocates nodes to jobs – /home, /share, /usr/local At CSC we are moving to use SLURM on all systems Additional native support libraries and programs – Designed for HPC from the ground up – Python, HDF5, gcc etc. – Open source, extendable, lightweight – Small libraries and utilities (strace etc.) – Becoming increasingly popular in the HPC community SLURM batch job queuing system

50 51

SLURM commands SLURM options

Checking the status of queues Here is a short list of common SLURM options: $ squeue -p (--partition) Partition name (queue) Checking node status -n (--ntasks) Number of MPI tasks $ sinfo [-r] -N (--ntasks-per-node) Number of MPI tasks per node Running a job interactively -c (--cpus-per-task) Number of CPUs per MPI task $ srun [command] Sending a batch job $ sbatch [job script] For simplicity all of the following examples use interactive execution (srun). However for ”real” work you should run batch jobs.

52 53 Partitions Running a job (one liner)

Default partition does not have GPUs Running a program using srun: GPU type specific partitions k20 and k20x and one [master:~ ] srun -n 1 -p k20 hostname overlapping partition for all GPUs node06 More complicated example:

[master:~ ] sinfo [master:~ ] srun -n 2 -p k20 --ntasks-per-node=1 hostname PARTITION AVAIL TIMELIMIT NODES STATE NODELIST node07 michost* up 1-00:00:00 4 idle node[02-05] gpu up 1-00:00:00 5 idle node[06-10] node06 k20 up 1-00:00:00 4 idle node[06-09] k20x up 1-00:00:00 1 idle node10 all up 1-00:00:00 9 idle node[02-10]

54 55

Running a GPU Job Batch job

Requires the GRES parameter to be used Generate a batch job file (gputest.sh): $ srun --gres=gpu:1 ./hello_cuda – If you don't use it, you won't get access to the GPU #!/bin/bash #SBATCH -p gpu MPI GPU job #SBATCH -n 2 $ srun -n 2 --ntasks-per-node=1 --gres=gpu:1 \ #SBATCH --ntasks-per-node=1 ./mpihello_cuda #SBATCH --gres=gpu:1 srun ./gputest 100000

Submit the job using sbatch gputest.sh

56 57 GPGPU Revolutionizes Computing Latency Processor + Throughput processor

Overview CUDA 5.5 and Kepler Peter Messmer

CPU GPU

© NVIDIA Corporation 2014 © NVIDIA Corporation 2014

58 59

Low Latency or High Throughput? Low Latency or High Throughput?

CPU architecture must minimize latency within each thread GPU architecture hides latency with computation from other thread warps

CPU core – Low Latency Processor Computation Thread/Warp

T1 T2 T3 T4 Tn Processing

CPU GPU GPU Stream Multiprocessor – High Throughput Processor . Optimized for low-latency . Optimized for data-parallel, Waiting for data W4 access to cached data sets throughput computation W3 Ready to be processed . . Control logic for out-of-order Architecture tolerant of W2 and speculative execution memory latency W1 Context switch . More transistors dedicated to computation

© NVIDIA Corporation 2014 © NVIDIA Corporation 2014

60 61 GPU Architecture: Two Main Components

Global memory Analogous to RAM in a CPU server Accessible by both GPU and CPU Currently up to 12 GB per GPU

Bandwidth currently up to ~288 GB/s (Tesla products) I/F DRAM

ECC on/off (Quadro and Tesla products) DRAM I/F

I/F DRAM

Streaming Multiprocessors (SMs) HOST I/F L2 DRAM I/F I/F DRAM Perform the actual computations GigaThread

Each SM has its own: I/F DRAM Control units, registers, execution pipelines, caches DRAM DRAM I/F

© NVIDIA Corporation 2014 © NVIDIA Corporation 2014

62 63

The Kepler GK110 GPU New High-Performance SMX Instructions

Performance SHFL (shuffle) -- Intra-warp data exchange Compiler-generated, high performance instructions:

Efficiency  bit shift  bit rotate  fp32 division ATOM -- Broader functionality, Faster  read-only cache Programmability

© NVIDIA Corporation 2014 © NVIDIA Corporation 2014

64 65 New Instruction: SHFL SHFL Example: Warp Prefix-Sum

__global__ void shfl_prefix_sum(int *data) Data exchange between threads within a warp { 3 8 2 6 3 9 1 4 int id = threadIdx.x; Avoids use of shared memory int value = data[id]; n = __shfl_up(value, 1) One 32-bit value per exchange int lane_id = threadIdx.x & warpSize; value += n 3 11 10 8 9 12 10 5 4 variants: // Now accumulate in log2(32) steps for(int i=1; i<=width; i*=2) { n = __shfl_up(value, 2) a b c d e f g h int n = __shfl_up(value, i); value += n 3 11 13 19 19 20 19 17 __shfl() __shfl_up() __shfl_down() __shfl_xor() if(lane_id >= i) value += n; n = __shfl_up(value, 4) } value += n 3 11 13 19 21 31 32 36 h d f e a c c b g h a b c d e f c d e f g h a b c d a b g h e f // Write out our result data[id] = value; th th Indexed Shift right to n Shift left to n Butterfly (XOR) } any-to-any neighbour neighbour exchange

© NVIDIA Corporation 2014 © NVIDIA Corporation 2014

66 67

ATOM instruction enhancements High Speed Atomics Enable New Uses

Added int64 functions to 2 – 10x performance gains Atomics are now fast enough to use within inner loops match existing int32 Shorter processing Example: Data reduction (sum of all values) More atomic processors Atom Op int32 int64 Slowest 10x faster Without Atomics add x x Fastest 2x faster 1. Divide input data array into N sections cas x x 2. Launch N blocks, each reduces one exch x x section min/max x X 3. Output is N values and/or/xor x X 4. Second launch of N threads, reduces outputs to single value

© NVIDIA Corporation 2014 © NVIDIA Corporation 2014

68 69 High Speed Atomics Enable New Uses Improved Texture Performance

SMX Atomics are now fast enough to use within inner loops Texture : Example: Data reduction (sum of all values) Provides hardware accelerated filtered sampling of data (1D, 2D, 3D) TexTex TexTex With Atomics Read-only data cache holds fetched samples Backed up by the L2 cache 1. Divide input data array into N sections Read-only 2. Launch N blocks, each reduces one Data Cache section SMX vs Fermi SM : 4x filter ops per clock 3. Write output directly via atomic. 4x cache capacity No need for second kernel launch. L2

© NVIDIA Corporation 2014 © NVIDIA Corporation 2014

70 71

Texture Cache Unlocked const __restrict__ Example

SMX

Added a new path for compute __global__ void saxpy(float x, float y, Avoids the texture unit Annotate eligible kernel const float * __restrict__ input, float * output) TexTex parameters with Allows a global address to be fetched and cached TexTex { const __restrict__ size_t offset = threadIdx.x + Eliminates texture setup (blockIdx.x * blockDim.x); Why use it? Compiler will automatically // Compiler will automatically read-only Separate pipeline from shared/L1 Read-only // data cache for "input" Data Cache map loads to use read-only output[offset] = (input[offset] * x) + y; Highest miss bandwidth data cache path } Flexible, e.g. unaligned accesses

Managed automatically by compiler L2 “const __restrict__” indicates eligibility

© NVIDIA Corporation 2014 © NVIDIA Corporation 2014

72 73 An even easier way Without Bindless Textures

#define N 1024 __ldg() texture tex; Issues a load through the texture unit For all builtin types: int, double, float4, double2, etc. __global__ void kernel() { int i = blockIdx.x * blockDim.x + threadIdx.x; Example: float x = tex1D(tex, i); // do some work using x... float a = array[i]; } void call_kernel(float *buffer) becomes { float a = __ldg(&array[i]); // Note: Pass an address. // bind texture to buffer cudaBindTexture(0, tex, buffer, N*sizeof(float)); kernel<<>>(); }

© NVIDIA Corporation 2014 © NVIDIA Corporation 2014

74 75

With Bindless Textures Bindless Texture

#define N 1024 __global__ void kernel(cudaTextureObject_t tex) Requires Compute Capability 3.x and CUDA 5.x or later { Texture reference objects are passable int i = blockIdx.x * blockDim.x + threadIdx.x; float x = tex1Dfetch(tex, i); // do some work using x ... Direct access to memory via texture unit using __ldg() } void call_kernel(cudaTextureObject_t tex) For read only-data { kernel <<>>(tex); } void main() { ... cudaTextureObject_t tex; cudaCreateTextureObject(&tex, &resDesc, &texDesc, NULL); }

© NVIDIA Corporation 2014 © NVIDIA Corporation 2014

76 77 Optimizing for Kepler

Fermi code runs on Kepler as is

Better results – recompile code for Kepler

Best performance - tune code for Kepler

© NVIDIA Corporation 2014 © NVIDIA Corporation 2014

78 79

Fermi – Concurrent Kernels Fermi – Concurrent Kernels

A -- B -- C depth-first A -- B -- C breadth-first Stream 1 Stream 1 P -- Q -- R P -- Q -- R Z--Y--X R--Q--P C--B--A Z – R – C – Y – Q – B – X – P – A Stream 2 Stream 2 Hardware Work Queue Hardware Work Queue X -- Y -- Z X -- Y -- Z

Stream 3 Stream 3

Fermi allows 16-way concurrency Fermi allows 16-way concurrency But CUDA kernels multiplex into a single queue But CUDA kernels multiplex into a single queue Issue order matters for concurrency Issue order matters for concurrency https://developer.nvidia.com/gpu-computing-webinars https://developer.nvidia.com/gpu-computing-webinars http://www.stanford.edu/group/ttsdocs/cgi-bin/techbriefingvideos/2013/01/18/cuda- http://www.stanford.edu/group/ttsdocs/cgi-bin/techbriefingvideos/2013/01/18/cuda- programming-your-gpu programming-your-gpu

© NVIDIA Corporation 2014 © NVIDIA Corporation 2014

80 81 K20 Improved Concurrency Grid Management Unit Stream Queue Mgmt C R Z B Q Y A P X A -- B -- C Stream Queue Mgmt C — B -- A CUDA C R Z Stream 1 Generated Grid Management Unit Work B Q Y Pending & Suspended Grids P -- Q -- R R — Q -- P A P X 1000s of pending grids Stream 2 Z – Y -- X X -- Y -- Z Multiple Hardware Work Queues Stream 3 Work Distributor Work Distributor depth-first or 16 active grids 32 active grids Kepler allows 32-way concurrency breadth-first One kernel queue per stream No inter-stream dependencies SM SM SM SM SMX SMX SMX SMX

Fermi Kepler GK110 © NVIDIA Corporation 2014 © NVIDIA Corporation 2014

82 83

Hyper-Q Enables Efficient Scheduling Strong Scaling of MPI Application

GPU parallelizable part Grid management unit can select most appropriate grid from 32 streams CPU parallel part

Serial part

Improves scheduling of concurrently executed grids

Particularly interesting for MPI applications

N =1

Multicore CPU only © NVIDIA Corporation 2014 © NVIDIA Corporation 2014

84 85

Strong Scaling of MPI Application Strong Scaling of MPI Application

GPU parallelizable part GPU parallelizable part CPU parallel part CPU parallel part Serial part Serial part

N =1 N=2 N =1 N=2 N=4 Multicore CPU only Multicore CPU only

© NVIDIA Corporation 2014 © NVIDIA Corporation 2014

86 87

Strong Scaling of MPI Application GPU Accelerated MPI Application

GPU parallelizable part GPU parallelizable part CPU parallel part CPU parallel part Serial part Serial part

N =1 N=2 N=4 N=8 N =1 N=2 N=4 N=8 N=1 Multicore CPU only Multicore CPU only GPU accelerated CPU

© NVIDIA Corporation 2014 © NVIDIA Corporation 2014

88 89 GPU Accelerated Strong Scaling GPU Accelerated Strong Scaling

GPU parallelizable part GPU parallelizable part CPU parallel part CPU parallel part Serial part Serial part

With Hyper-Q/MPS With Hyper-Q/MPS Available in K20, K40 Available in K20, K40

N =1 N=2 N=4 N=8 N=1 N=2 N=4 N=8 N =1 N=2 N=4 N=8 N=1 N=2 N=4 N=8 Multicore CPU only GPU accelerated CPU Multicore CPU only GPU accelerated CPU

© NVIDIA Corporation 2014 © NVIDIA Corporation 2014

90 91

Example: Hyper-Q/MPS for CP2K How to use MPS

- No application modifications necessary

- Proxy process between user processes and GPU nvidia_cuda_mps_control –d

- Set environment variable to use proxy export CUDA_MPS_CLIENT=1

© NVIDIA Corporation 2014 © NVIDIA Corporation 2014

92 93

Don’t Forget Large-Scale Behavior Grid Management Unit Stream Queue Mgmt C R Z B Q Y A P X Profile in realistic environment Stream Queue Mgmt CUDA Get profile at scale C R Z Generated Tau, Scalasca, VampirTrace+Vampir, Craypat, .. Compute Grid Management Unit B Q Y Work Pending & Suspended Grids A P X 1000s of pending grids Waste

Time Time Work Distributor Work Distributor

16 active grids 32 active grids Nrank = 384 Fix messaging problems first! GPUs will accelerate your compute, amplify messaging problems SM SM SM SM SMX SMX SMX SMX Will also help CPU-only code Fermi Kepler GK110 © NVIDIA Corporation 2014 © NVIDIA Corporation 2014

94 95

Improving Programmability

Library Calls from Kernels

Simplify CPU/GPU Divide

Batching to Help Fill GPU Dynamic Occupancy Dynamic Load Balancing Parallelism

Data-Dependent Execution

Recursive Parallel Algorithms

© NVIDIA Corporation 2014 © NVIDIA Corporation 2014

96 97 What is Dynamic Parallelism? What Does It Mean?

CPU GPU CPU GPU The ability to launch new grids from the GPU Dynamically Simultaneously Independently

CPU GPU CPU GPU

Fermi: Only CPU can generate GPU work Kepler: GPU can generate work for itself

GPU as Co-Processor Autonomous, Dynamic Parallelism

© NVIDIA Corporation 2014 © NVIDIA Corporation 2014

98 99

Dynamic Work Generation Fixed Grid CUDA Dynamic Parallelism

__global__ void childKernel() Kernel launches grids { printf("Hello %d", threadIdx.x); } Statically assign conservative worst-case grid Identical syntax as host __global__ void parentKernel() { childKernel<<<1,10>>>(); cudaDeviceSynchronize(); CUDA runtime function in printf("World!\n"); } cudadevrt library

Dynamically assign performance int main(int argc, char *argv[]) where accuracy is required { Enabled via nvcc flag parentKernel<<<1,1>>>(); cudaDeviceSynchronize(); Initial Grid –rdc=true return 0; }

© NVIDIA Corporation 2014 Dynamic Grid © NVIDIA Corporation 2014

100 101 Offline Static Linker

main1.cpp main2.cpp + + foo.cu bar.cu a.cu b.cu + + ab.culib  ab.culib

a.o + b.o

program1.exe program2.exe

Link and externally call device code

© NVIDIA Corporation 2014 © NVIDIA Corporation 2014

102 103

Compile Trajectory CUDA 5 Introduces Device Code Linker

a.cu b.cu Separation of host and device code a.c b.c a.ptx Device b.ptx Device code translates into Linker device-specific binary (.cubin) or device independent a .cubin a.o b.o b.cubin assembly (.ptx) Host

Linker Device code embedded in host object data a.out © NVIDIA Corporation 2014 © NVIDIA Corporation 2014

104 105

Device Linker Invocation

Introduction of an optional link step for device code nvcc –arch=sm_20 –dc a.cu b.cu nvcc –arch=sm_20 –dlink a.o b.o –o link.o g++ a.o b.o link.o –L -lcudart

Link device-runtime library for dynamic parallelism nvcc –arch=sm_35 –dc a.cu b.cu nvcc –arch=sm_35 –dlink a.o b.o -lcudadevrt –o link.o g++ a.o b.o link.o –L -lcudadevrt -lcudart

Currently, link occurs at cubin level (PTX not supported)

© NVIDIA Corporation 2014 © NVIDIA Corporation 2014

106 107

Stream Priorities for Kernels No Priorities

Stream 1 Kernel A Kernel B Kernel C

Kernel X Stream 2 Kernel X Launched

With Priorities

Stream 1 Kernel A Kernel B Kernel C

Kernel X High-Priority Kernel X Stream 2 Launched

© NVIDIA Corporation 2014 © NVIDIA Corporation 2014

108 109 CUDA Tools

. Development Tools . NVCC, PTXAS, cuobjdump, Nsight . Libraries . CUBLAS, CUFFT, CURAND, CUSPARSE, NPP, THRUST . Debugging Tools The CUDA Toolkit: . CUDA-GDB, CUDA-MEMCHECK . Libraries, Profilers, Performance Tools / Profilers . NVVP, NVPROF Debuggers . System Management . NVIDIA-SMI, NVML

© NVIDIA Corporation 2014

110 111

NVCC

. NVCC is the main CUDA compiler driver . Uses an LLVM-based compiler to build Device Code . Host code is compiled with g++ . Supports most common features . Useful Options: . Pass args to GCC: -Xcompiler, to PTXAS –Xptxas, . --lineinfo (for debugger and profiler), -G (kernel debugging code) . -gencode arch=compute_35,code=sm_35 . Fun Trick: nvcc --cuda . Generates code which can be built with another compiler *at your own risk*

112 113 PTXAS cuobjdump

. Assembles PTX to native binary code . Similar to objdump (binutils) for host code. . Invoked automatically by NVCC . PTX is an abstract virtual ISA (like LLVM, the language) . Allows you to extract compiled information, including GPU hardware instructions (SASS) . Useful Options (usually passed via nvcc -Xptxas=[options] . -v : print out information on each kernel compiled for each architecture . Useful Options: . -maxrregcount : limit registers to increase occupancy . -ptx - extract PTX from a compiled file . -dlcm=cg : shut off L1 caching of globals (Fermi) . -sass – disassemble compiled GPU hardware instructions . -symbols – dump ELF symbol names

114 115

NSIGHT IDE

. Eclipse-based IDE

116 117 CUBLAS CURAND

. Full BLAS Library levels 1,2, & 3 . random number generator . Host API for using GPU to generate huge arrays of random . Extremely well tuned for every NVIDIA architecture numbers . Device API for generating random numbers within kernels . Full host-side API for use without any other kernels . Huge number of random number generator algorithms implemented . . Device-callable API (newer) for calling BLAS from other kernels Pseudorandom: XORWOW, MRG32K3A, MTGP32 . Quasirandom: SOBOL32, SOBOL64 & scrambled variants

118 119

CUFFT THRUST

. Multi-dimensional FFT on GPU . STL-like template library (C++) . 1D, 2D, 3D . Masks host<-->device data copies . Host-side API, interoperable with other CUDA code. . Includes many useful algorithms: . FFTW compatibility mode . Sort . Uses data layouts compatible with FFTW . Reordering . Reductions . Prefix-Sums . Transformations (foreach)

120 121 CUSPARSE NVIDIA Performance Primitives (NPP)

. Tuned Sparse linear algebra library . Large signal and image/video processing library

. Supports a variety of sparse formats: . Primitive operations from which signal processing algorithms are . COO, CSR, CSC, ELL, HYB, BSR, BSRX constructed.

. Supports 3 types of operations: . Sparse matrix, dense vector . Sparse vector, dense vector . Sparse matrix, set of dense vectors

122 123

Printf

. First form of “debugging” most programmers try

. Compute 2.0 and above allow printf from kernels

. Caveats: . Prints to a buffer which is copied out at kernel end - cudaDeviceSetLimit(cudaLimitPrintfFifoSize, size);

. Can be explosive

124 125 cuda-memcheck cuda-gdb

. Checks for Memory management / access problems: . GDB-based debugger for kernel code . Out of bounds read/write . Requires SM 2.0 or later . Shared memory race detection . Breakpoints: kernel launch, source line, function entry . Stack overflow . Change thread/block focus . Misaligned access . device memory leaks . Trap into kernels on illegal instruction . Step by instruction or source line . print and change variables . Notes: . GUI interface through Nsight Eclipse Edition . build with nvcc -G -lineinfo . Notes: . Build with nvcc -G to debug kernel code

126 127

Nvidia Visual Profiler (nvvp)

. Runs full applications and gathers a whole host of metrics and counters . Visual timeline display of kernels

. New in 5.5: Guided Optimization Analysis tools

. Caveats: . Limited counter storage, app may be rerun many times for profile generation . Some profile counters are not entirely intuitive without deep hardware knowledge.

128 129 Nvidia Visual Profiler (nvvp) Nvidia Visual Profiler (nvvp)

What is the host doing?

130 131

Nvidia Visual Profiler (nvvp) Nvidia Visual Profiler (nvvp)

What copies are happening?

How much overlap?

132 133 Nvidia Visual Profiler (nvvp) Nvidia Visual Profiler (nvvp)

Streams

Multi GPU

134 135

NVProf

. “backend” to NVVP which collects counters . Run from command line, no dependence on X . Useful for profiling on cluster/ nodes

. New in 5.5: NVVP counters . nvprof --analysis-metrics -o profile ./appname . Generate profile on cluster, review on your laptop.

. On : export PMI_NO_FORK=1

136 137 “Command Line Profiler”

. Built into the driver . Always present . Activated by environment variables / config files . outputs to file

. Notes: . Predates NVVP and NVPROF . Today, largely used when there is a reason NVVP or NVPROF cannot be used.

138 139

NVIDIA-SMI NVML

. Utility include in driver for controlling and configuring Tesla and . NVIDIA Management Library Quadro products . Set GPU access mode . C library exposes driver operations for system management . Read, reset ECC state tasks . Reset

. Change Clocks/ power limits, accounting mode . Largely makes NVIDIA-SMI functionality accessible to an . Silly feature: dasBlinkenlights application in lib form. . nvidia-smi –i 0 –t 0; sleep 1; nvidia-smi –i 0 –t 1;

140 141 Other ways to unleash GPU power

. OpenACC . PGI, Cray, and CAPS compiler support directive-based . CUDA . PGI Compiler specific, allows you to write kernels in Fortran . 3rd Party Libs: . Many 3rd party libs take advantage of GPU acceleration: PetSC, Trillinos, Magma, . 3rd Party Apps: . Many apps from Photoshop to Matlab already employ GPU acceleration in one form or another. . Play Games: . It’s where GPU’s started and they’re still very good at it!

142 143

144 1 Unified Memory

CUDA 2 XT and Drop-in Libraries 6 New Features in CUDA6 3 GPUDirect RDMA in MPI

4 Developer Tools

145 146

Unified Memory Dramatically Lower Developer Effort

1 Unified Memory Developer View Today Developer View With Unified Memory CUDA 2 XT and Drop-in Libraries

6 3 GPUDirect RDMA in MPI

System GPU Memory Unified Memory 4 Developer Tools Memory

147 148 Super Simplified Memory Management Code Unified Memory Delivers

CPU Code CUDA 6 Code with Unified Memory . Single pointer to data, accessible anywhere 1. Simpler . Tight language integration void sortfile(FILE *fp, int N) { void sortfile(FILE *fp, int N) { Programming & char *data; char *data; Memory Model data = (char *)malloc(N); cudaMallocManaged(&data, N); . Greatly simplifies code porting

fread(data, 1, N, fp); fread(data, 1, N, fp);

qsort(data, N, 1, compare); qsort<<<...>>>(data,N,1,compare); cudaDeviceSynchronize(); . Migrate data to accessing processor use_data(data); use_data(data); 2. Performance

free(data); cudaFree(data); Through . Guarantee global coherency } } Data Locality . Still allows cudaMemcpyAsync() hand tuning

149 150

Simpler Memory Model: Simpler Memory Model: Eliminate Deep Copies Eliminate Deep Copies

CPU Memory CPU Memory dataElem dataElem prop1 prop1 struct dataElem struct dataElem prop2 prop2 { “Hello World” { “Hello World” int prop1; *text int prop1; *text int prop2; int prop2; char *text; char *text; }; }; Two Copies Required dataElem prop1 prop2 “Hello World” *text GPU Memory GPU Memory

151 152 Simpler Memory Model: Simpler Memory Model: Eliminate Deep Copies Eliminate Deep Copies

void launch(dataElem *elem) { dataElem *g_elem; CPU Memory CPU Memory char *g_text; dataElem void launch(dataElem *elem) { prop1 int textlen = strlen(elem->text); kernel<<< ... >>>(elem);

prop2 } // Allocate storage for struct and text “Hello World” cudaMalloc(&g_elem, sizeof(dataElem)); *text Unified Memory cudaMalloc(&g_text, textlen); dataElem // Copy up each piece separately, including prop1 // new “text” pointer value cudaMemcpy(g_elem, elem, sizeof(dataElem)); Two Copies prop2 Required “Hello World” cudaMemcpy(g_text, elem->text, textlen); *text cudaMemcpy(&(g_elem->text), &g_text, dataElem sizeof(g_text)); prop1

prop2 // Finally we can launch our kernel, but “Hello World” // CPU & GPU use different copies of “elem” *text kernel<<< ... >>>(g_elem); } GPU Memory GPU Memory

153 154

Simpler Memory Model Simpler Memory Model

Example: GPU & CPU Shared Example: GPU & CPU Shared Linked Lists Linked Lists CPU Memory CPU Memory

key key key key key key key key data data data data Only practical option is to use data data data data next next next next zero-copy (pinned system) memory next next next next GPU accesses at PCIe bandwidth All GPU accesses at very high latency data access over PCIe

GPU Memory GPU Memory

155 156 Simpler Memory Model Unified Memory with C++ A Powerful Combination

Example: GPU & CPU Shared C++ objects migrate easily when allocated on managed heap Linked Lists Overload new operator to use C++ in unified memory region CPU Memory Deep copies, pass-by-value, pass-by-reference: JUST WORKS Local class Managed { Can pass list elements between data void *operator new(size_t len) { CPU Program Host & Device access void *ptr; dataElem *data = new dataElem; Unified Memory cudaMallocManaged(&ptr, len); Can insert and delete elements return ptr; from Host or Device* key key key key } Unified Memory

Single list - no complex data data data data void operator delete(void *ptr) { dataElem cudaFree(ptr); next next next next prop1 synchronization } }; // Inherit from “Managed”, prop2 Local // C++ now handles our deep copies text “Hello World” data class dataElem : public Managed { access int prop1; *Program must still ensure no race conditions. int prop2; *Data is coherent between CPU & GPU String text; at kernel launch & sync only GPU Memory }; GPU Program

157 158

Unified Memory Roadmap

CUDA 6: Ease of Use 1 Unified Memory Next: Optimizations Single Pointer to Data Maxwell 2 XT and Drop-in Libraries No Memcopy Required CUDA Coherence @ launch & sync Prefetching 6 Shared C/C++ Data Migration Hints System Allocator Unified 3 GPUDirect RDMA in MPI Structures Additional OS Support Stack Memory Unified HW-Accelerated Coherence 4 Developer Tools

159 160 Extended (XT) Library Interfaces Multi-GPU cuFFT

Automatic Scaling to >1 GPU per node Single & Batch Transforms across cuFFT 3D Performance on 2 GPUs* multiple GPUs (max 2 in CUDA 6) K10: Single GPU K10: Dual GPU 140 cuFFT and cuBLAS level 3 120 Tuned for multi-GPU cards (K10) 100 Better scaling for larger transforms Out-of-core operations: e.g. very large GEMM 80

60

BLAS 3 Host Interfaces: automatically overlaps memory transfers 40 Execution Time (ms) ExecutionTime

20

0 256x256x256 512x512x512

*Does not include memcpy time

161 162

Multi-GPU cuBLAS New Drop-in NVBLAS Library

Single function call automatically spreads work across two GPUs Source and result data in system memory Drop-in replacement for CPU-only BLAS Supports matrices > size of memory (out-of-core) Automatically routes standard BLAS3 calls to cuBLAS All BLAS Level-3 routines Optionally configure which routines and matrix sizes are accelerated User provides CPU-only BLAS dynamic library location cuBLAS ZGEMM Performance on 2 GPUS 2500 Simply re-link or change library load order

2000 1500 gcc myapp.c –lnvblas -lmkl_rt -o myapp 1000 1 K20c - or -

GFLOPS 2 K20c 500 env LD_PRELOAD=libnvblas.so myapp 0 in-core limit 0 4096 8192 12288 16384 20480 24576 28672 Matrix Size (NxN)

163 164 New Drop-in NVBLAS Library

Drop-in replacement for CPU-only BLAS Matrix-Matrix Multiplication in R 1 Unified Memory Automatically route BLAS3 calls to cuBLAS 3000

2500 XT and Drop-in Libraries Example: Drop-in for R CUDA 2 2000

> LD_PRELOAD=/usr/local/cuda/lib64/libnvblas.so /s 6 > A <- matrix(rnorm(4096*4096), nrow=4096, ncol=4096) 1500 GFlops > B <- matrix(rnorm(4096*4096), nrow=4096, ncol=4096) nvBLAS, 4x K20X GPUs 3 GPUDirect RDMA in MPI

fp64 MKL, 6-core Xeon E5-2667 CPU > system.time(C <- A %*% B) 1000 user system elapsed

0.348 0.142 0.289 500 4 Developer Tools

0 Use in any app that uses standard BLAS3 0 5000 10000 15000 20000 25000 30000 35000 Octave, Scilab, etc. matrix dimension

165 166

GPUDirect RDMA in MVAPICH2 & OpenMPI

Reduced inter-node latency Better MPI Application Scaling Preliminary Performance of MVAPICH2 with GPU-Direct-RDMA 1 Unified Memory Execution Time of HSG Application with 2 GPU Nodes 4 MPI Processes / GPU Node 90 MVAPICH2-1.9 80 MVAPICH2-1.9-GDR 36% 2 XT and Drop-in Libraries 70 CUDA 60 Better Better 50 6 40 3 GPUDirect RDMA in MPI TotalTime (S.) 30 63% 62% 20 67%

10 Developer Tools 0 4 64 128 256 Side Number Based on MVAPICH2-1.9 Sandy Bridge (E5-2670) node with 16 cores NVIDIA Telsa K20c GPU, Mellanox ConnectX-3 FDR HCA CUDA 5.0, OFED 1.5.4.1 with GPU-Direct-RDMA Patch DK-OSU-MVAPICH2-GPU-Direct-RDMA 2

167 168 Remote Development with Nsight Eclipse Edition CUDA tools for MPS (Multi-Process Server)

Local IDE, remote application Profile MPI apps on Edit locally, build & run remotely MPS using nvprof Automatic sync via ssh Cross-compilation to ARM Import multi-process MPI ranks into Visual Full debugging & profiling via Profiler remote connection Run CUDA- Edit Build Run MEMCHECK on apps sync Debug running on MPS Profile

169 170

Detailed Kernel Profiling Detailed Instruction Mix Visualization Visual Profiler and NSight EE Visual Profiler and NSight EE

Instruction counts automatically locate hot spots in your code

Corresponding Assembly

Detected Hot Spot

171 172 CUDA 6

Dramatically Simplifies Parallel Programming with Unified Memory

Sign up for CUDA Registered Developer Program https://developer.nvidia.com/cuda-toolkit

173 174 Drinking from the Firehose

Lots of information here

Interrupt me with questions No, seriously. Please do this.

GPU Optimization Tell me if you’re lost before we move on Part 1 Be brave, you’re likely not alone

Peter Messmer Discussion is good for learning

175 176

Main Requirements for GPU Performance

Expose sufficient parallelism

Utilize parallel execution resources efficiently Use memory system efficiently Coalesce global memory accesses Use shared memory where possible Have coherent execution within warps of threads GPU OPTIMIZATION FUNDAMENTALS

177 178 APOD: A Systematic Path to Performance Assess

Assess

HOTSPOTS Deploy Parallelize

Identify hotspots (total time, number of calls) Optimize Understand scaling (strong and weak)

179 180

Parallelize Optimize

Profile-driven optimization Applications Tools: nsight Visual Studio Edition or Eclipse Edition nvvp NVIDIA Visual Profiler Compiler Programming Libraries nvprof Command-line profiling Directives Languages

181 182 Deploy

Productize

Check API return values Library distribution Run cuda-memcheck tools Cluster management ASSESS

Early gains Subsequent changes are evolutionary

183 184

Assess Assess

We’ve found a hotspot to work on! What percent of our total time does this represent? How much can we improve it? What is the “speed of light”? How much will this improve our overall performance?

Profile the code, find the hotspot(s) Focus your attention where it will give the most benefit

185 186 Assess Assess: Understanding Scaling

Let’s investigate… Strong Scaling Strong scaling and Amdahl’s Law A measure of how, for fixed overall problem size, the time to Weak scaling and Gustafson’s Law solution decreases as more processors are added to a system Expected perf limiters: Bandwidth? Computation? Latency? Linear strong scaling: speedup achieved is equal to number of processors used

Amdahl’s Law: ퟏ ퟏ 푺 = ≈ 푷 ퟏ − 푷 + (ퟏ − 푷) 푵

187 188

Assess: Understanding Scaling Assess: Applying Strong and Weak Scaling

Weak Scaling Understanding which type of scaling is most applicable is an A measure of how time to solution changes as more processors important part of estimating speedup: are added with fixed problem size per processor Sometimes problem size will remain constant Linear weak scaling: overall problem size increases as num. of Other times problem size will grow to fill the available processors processors increases, but execution time remains constant Apply either Amdahl's or Gustafson's Law to determine an upper

bound for the speedup Gustafson’s Law:

푺=푵+(ퟏ−푷)(ퟏ−푵)

189 190 Assess: Applying Strong Scaling Assess: Applying Strong Scaling

Recall that in this case we are wanting to optimize an Say, for example, our kernel is ~93% of total time: existing kernel with a pre-determined workload ퟏ Speedup 푺 = 푷 (S = speedup in parallel part) ퟏ−푷 + P 푺푷 That’s strong scaling, so Amdahl’s Law will determine ퟏ In the limit when 푺푷 is huge, 푺 will approach ≈ ퟏퟒ. ퟑ the maximum speedup ퟏ−ퟎ.ퟗퟑ In practice, it will be less than that depending on the 푺푷 achieved

Getting 푺푷 to be high is the goal of optimizing, of course

~93% ~93%

191 192

Assess: Speed of Light Assess: Limiting Factor

What’s the limiting factor? Comparing bytes per instr. will give you a guess as to whether Memory bandwidth? you’re likely to be bandwidth-bound or instruction-bound

Compute throughput? Comparing actual achieved GB/s vs. theory and achieved Latency? Ginstr/s vs. theory will give you an idea of how well you’re doing If both are low, then you’re probably latency-bound and need to expose Not sure? more (concurrent) parallelism Get a rough estimate by counting bytes per instruction, 푮푩풚풕풆풔/풔풆풄 compare it to “balanced” peak ratio 푮풊풏풔풏풔/풔풆풄 Profiler will help you determine this

193 194 Assess: Limiting Factor Assess: Speed of Light

What’s the limiting factor? Memory bandwidth? Compute throughput? Latency?

Consider SpMV: intuitively expect it to be bandwidth-limited Say we discover we’re getting only ~38% of peak bandwidth If we aim to get this up to ~65% of peak, that’s 1.7 for this kernel 1.7 for this kernel translates into 1.6 overall due to Amdahl: ퟏ 퐒 = ퟎ.ퟗퟑ ≈ ퟏ. ퟔ ퟏ−ퟎ.ퟗퟑ + ퟏ.ퟕ ~93%

195 196

Assess: Limiting Factor

For our example SpMV kernel, our first discovery was that we’re latency-limited, not bandwidth, since utilization was so low

This tells us our first “optimization” step actually needs to be related how we expose (memory-level) parallelism

PARALLELIZE

~93%

197 198 Parallelize

Applications

Compiler Programming Libraries Directives Languages PARALLELIZE Computation Pick the best tool for the job

199 200

Parallelize: e.g., with GPU Accelerated Libraries Parallelize: e.g., with Thrust

Similar to C++ STL

High-level interface // generate 32M random numbers on host NVIDIA cuBLAS NVIDIA cuSPARSE NVIDIA NPP Enhances developer productivity thrust::host_vector h_vec(32 << 20); NVIDIA cuFFT thrust::generate(h_vec.begin(), Enables performance portability h_vec.end(), rand); between GPUs and multicore CPUs // transfer data to device (GPU) Flexible thrust::device_vector d_vec = h_vec; Matrix Algebra on GPU Accelerated Vector Signal GPU and Multicore Linear Algebra Image Processing NVIDIA cuRAND Backends for CUDA, OpenMP, TBB // sort data on device Extensible and customizable thrust::sort(d_vec.begin(), d_vec.end());

Integrates with existing software // transfer data back to host thrust::copy(d_vec.begin(), Open source d_vec.end(), Building-block C++ Templated h_vec.begin()); IMSL Library CenterSpace NMath Algorithms Parallel Algorithms

thrust.github.com or developer.nvidia.com/thrust

201 202 Parallelize: e.g., with OpenACC Parallelize: e.g., with CUDA C

CPU GPU Standard C Code CUDA C Code

__global__ Directives-based approach void saxpy_serial(int n, void saxpy_parallel(int n, float a, float a, float *x, float *x, Program myscience Compiler parallelizes code float *y) float *y) ... serial code ... { { !$acc kernels do k = 1,n1 OpenACC int i = blockIdx.x * blockDim.x + do i = 1,n2 Compiler Works on many-core GPUs & for (int i = 0; i < n; ++i) threadIdx.x; ... parallel code ... Directive y[i] = a*x[i] + y[i]; if (i < n) y[i] = a*x[i] + y[i]; enddo enddo multicore CPUs } } !$acc end kernels ... // Perform SAXPY on 1M elements // Perform SAXPY on 1M elements End Program myscience saxpy_serial(4096*256, 2.0, x, y); saxpy_parallel<<<4096,256>>>(n,2.0,x,y); Your original Fortran or C code www.nvidia.com/gpudirectives developer.nvidia.com/cuda-toolkit

203 204

Parallelism Needed Case Study: Matrix Transpose

void transpose(float in[][], float out[][], int N) { GPU is a parallel machine for(int j=0; j < N; j++) i Lots of arithmetic pipelines for(int i=0; i < N; i++) Multiple memory banks out[j][i] = in[i][j]; }

To get good performance, your code must expose sufficient parallelism for 2 reasons: j To actually give work to all the pipelines To hide latency of the pipelines

Rough rule of thumb for Tesla K20X: You want to have 14K or more threads running concurrently

205 206 An Initial CUDA Version Parallelize across matrix elements tid tid tid __global__ void transpose(float in[], float out[], int N) { Process elements independently bid for(int j=0; j < N; j++) for(int i=0; i < N; i++) __global__ transpose(float in[], float out[]) bid out[i*N+j] = in[j*N+i]; { in int tid = threadIdx.x; } int bid = blockIdx.x;

out[tid*N+bid] = in[bid*N+tid]; bid bid float in[N*N], out[N*N]; } … transpose<<<1,1>>>(in, out, N); float in[], out[]; tid … + Quickly implemented - Performance weak transpose<<>>(in, out); tid out Need to expose parallelism! tid

207 208

Asynchronicity = Overlap = Parallelism

DMA

DMA

PARALLELIZE Data Transfer Heterogeneous system: overlap work and data movement

209 210 Asynchronicity Parallelize: Achieve Asynchronicity

This is the kind of case we would be concerned about Found the top kernel, but the GPU is mostly idle – that is our bottleneck Need to overlap CPU/GPU computation and PCIe transfers What we want to see is maximum overlap of all engines

211 212

Main Requirements for GPU Performance

Expose sufficient parallelism

Utilize parallel execution resources efficiently Use memory system efficiently Coalesce global memory accesses Use shared memory where possible Have coherent execution within warps of threads OPTIMIZE

213 214 GPU Optimization Fundamentals GPU Optimization Fundamentals

Find ways to parallelize sequential code Find ways to parallelize sequential code Adjust kernel launch configuration to maximize device utilization Ensure global memory accesses are coalesced Kernel optimizations Minimize redundant accesses to global memory Launch configuration Avoid different execution paths within the same warp Global memory throughput Shared memory access Minimize data transfers between the host and the device Instruction throughput / control flow

http://docs.nvidia.com/cuda/cuda-c-best-practices-guide/ Optimization of CPU-GPU interaction Maximizing PCIe throughput Overlapping kernel execution with memory copies

215 216

Kernel Launch Configuration

A kernel is a function that runs on the GPU A kernel is launched as a grid of blocks of threads Launch configuration is the number of blocks and number of threads per block, expressed in CUDA with the <<< >>> notation:

mykernel<<>>(…); OPTIMIZE What values should we pick for these? Kernel Optimizations: Kernel Launch Configuration Need enough total threads to process entire input Need enough threads to keep the GPU busy Selection of block size is an optimization step involving warp occupancy

217 218 High-level view of GPU Architecture Kepler Streaming Multiprocessor (SMX)

Several Streaming Multiprocessors Per SMX: E.g., Kepler GK110 has up to 15 SMs 192 SP CUDA Cores L2 Cache shared among SMs 64 DP CUDA Cores Multiple channels to DRAM 4 warp schedulers Up to 2048 concurrent threads One or two instructions issued per scheduler per clock from a single warp Register file (256KB) Shared memory (48KB) Kepler GK110

219 220

CUDA Execution Model Execution Model Software Hardware Thread: Sequential execution unit All threads execute same sequential program Threads are executed by scalar CUDA Cores CUDA Threads execute in parallel Core Thread Threads Block: a group of threads Thread blocks are executed on multiprocessors

Executes on a single Streaming Multiprocessor (SM) Thread blocks do not migrate Threads within a block can cooperate Light-weight synchronization Several concurrent thread blocks can reside on Thread Block Multiprocessor Data exchange one multiprocessor - limited by multiprocessor resources (shared memory and register file) Grid: a collection of thread blocks Thread blocks of a grid execute across multiple SMs Thread blocks do not synchronize with each other A kernel is launched as a grid of thread blocks Communication between blocks is expensive Grid Device

221 222 Launch Configuration: General Guidelines Launch Configuration: General Guidelines

How many blocks should we use? How many threads per block should we choose? 1,000 or more thread blocks is best The really short answer: 128, 256, or 512 are often good choices Rule of thumb: enough blocks to fill the GPU at least 10s of times over Makes your code ready for several generations of future GPUs The slightly longer answer:

Pick a size that suits the problem well Multiples of 32 threads are best Pick a number of threads per block (and a number of blocks) that is sufficient to keep the SM busy

223 224

Warps Hardware Levels of Parallelism

Simultaneous Multithreading A thread block consists Multiple “computers” Cross-core, Cross-socket Single Instruction, Multiple Data Tightly-coupled Single Computer of warps of 32 threads In-core parallelism Supercomputing apps OpenMP, pthreads

A warp is executed SIMD SMT MPI 32 Threads physically in parallel on = 32 Threads some multiprocessor. 32 Threads

32 Threads Thread Block Multiprocessor Warps Threads of a warp issue SIMT Single Instruction, Multiple Threads instructions in lock- In-processor parallelism step (as with SIMD) Many threads on many cores

These form a continuum. Best performance is achieved with a mix.

225 226 Occupancy Low Latency or High Throughput?

Need enough concurrent warps CPU architecture must minimize latency within each thread per SM to hide latencies: GPU architecture hides latency with computation from other (warps of) threads Instruction latencies Memory access latencies GPU Streaming Multiprocessor – High-throughput Processor Computation Thread/Warp W 4 Tn Processing W Hardware resources determine 3 W Waiting for data number of warps that fit per SM 2 W1 Ready to be processed

CPU core – Low-latency Processor

Occupancy = Nactual / Nmax T1 T2 T3 T4 Context switch

227 228

Latency Hiding FFMA R0, R43, R0, R4; Occupancy FFMA R1, R43, R4, R5; FMUL R7, R9, R0; FMUL R8, R9, R1; ST.E [R2], R7; Instruction latencies: ILP=2 Occupancy: number of concurrent warps per SM, expressed as: Roughly 10-20 cycles for arithmetic operations Absolute number of warps of threads that fit concurrently (e.g., 1..64), or DRAM accesses have higher latencies (400-800 cycles) Ratio of warps that fit concurrently to architectural maximum (0..100%) Instruction Level Parallelism (ILP) Independent instructions between two dependent ones Number of warps that fit determined by resource availability: ILP depends on the code, done by the compiler Threads per thread block Switching to a different warp Registers per thread If a warp must stall for N cycles due to dependencies, having N other Shared memory per thread block Kepler SM resources:

warps with eligible instructions keeps the SM going – 64K 32-bit registers – Up to 48 KB of shared memory Switching among concurrently resident warps has no overhead – Up to 2048 concurrent threads – Up to 16 concurrent thread blocks State (registers, shared memory) is partitioned, not stored/restored

229 230 Occupancy and Performance Thread Block Size and Occupancy

Note that 100% occupancy isn’t needed to reach maximum Thread block size is a multiple of warp size (32) performance Even if you request fewer threads, hardware rounds up Once the “needed” occupancy (enough warps to switch among to cover Thread blocks can be too small latencies) is reached, further increases won’t improve performance Kepler SM can run up to 16 thread blocks concurrently

SM can reach the block count limit before reaching good occupancy Level of occupancy needed depends on the code E.g.: 1-warp blocks = 16 warps/SM on Kepler (25% occ – probably not enough) More independent work per thread -> less occupancy is needed Thread blocks can be too big Memory-bound codes tend to need more occupancy Enough SM resources for more threads, but not enough for a whole block Higher latency than for arithmetic, need more work to hide it A thread block isn’t started until resources are available for all of its threads

231 232

Thread Block Sizing CUDA Occupancy Calculator

Number of warps allowed by SM resources Too few SM resources: threads per block Registers Shared memory Analyze effect of resource consumption on occupancy

Too many threads per block

233 234 Occupancy Analysis in NVIDIA Visual Profiler Kepler: Level of Parallelism Needed

To saturate instruction bandwidth: Fp32 math: ~1.7K independent instructions per SM Fewer for lower-throughput instructions Keep in mind that Kepler can track up to 2048 threads per SM Occupancy here is limited

by grid size and number of threads per block To saturate memory bandwidth: 100+ concurrent independent 128-byte lines per SM

235 236 GPU Optimization Part 2 OPTIMIZE

Peter Messmer Kernel Optimizations: Global Memory Throughput

237 238

CUDA Memory Architecture Optimizing Memory Throughput

Device Goal: utilize all available memory GPU Multiprocessor bandwidth MultiprocessorRegisters Host DRAM MultiprocessorRegistersShared Memory Shared Memory Little’s Law: CPU Local Registers Shared Memory # bytes in flight = latency * bandwidth

Chipset Global L1 / L2 Cache

Constant DRAM Constant and Texture

L latency Access Caches  Increase parallelism (bytes in flight) Texture (or)

 Reduce latency (time between requests)

239 240 Illustration: Little’s Law for Escalators Memory-Level Parallelism = Bandwidth

Say the parameters of our escalator are: In order to saturate memory bandwidth, SM must have 1 person fits on each step enough independent memory requests in flight concurrently Step arrives every 2 secs (bandwidth=0.5 persons/s) 20 steps tall (latency=40 seconds) 1 person in flight: 0.025 persons/s achieved To saturate bandwidth: Need 1 person arriving every 2 s Means we’ll need 20 persons in flight The idea: Bandwidth × Latency It takes latency time units for the first person to arrive We need bandwidth persons to get on the escalator every time unit

241 242

Memory-Level Parallelism: Requests in flight Requests per Thread and Performance

Achieved Kepler memory throughput Experiment: vary size of accesses by Accesses by a warp: Shown as a function of number of concurrent requests threads of a warp, check performance 4B words: 1 line per SM with 128-byte lines Memcopy kernel: each warp has 2 concurrent 8B words: 2 lines requests (one write and the read following it) 16B words: 4 lines

To achieve same throughput at lower occupancy or with smaller words, need more independent requests per warp

243 244 Optimizing Access Concurrency

Ways to increase concurrent accesses: Increase occupancy (run more warps concurrently) Adjust block dimensions to maximize occupancy If occupancy is limited by registers per thread, try to reduce register count (-maxrregcount option or __launch_bounds__)

Modify code to process several elements per thread Doubling elements per thread doubles independent accesses per thread OPTIMIZE Kernel Optimizations: Global Memory Access Coalescing

245 246

Mechanics of a Memory Access Memory Access Efficiency Analysis

Memory operations are issued per warp Two perspectives on the throughput: Just like all other instructions Application’s point of view: count only bytes requested by application HW point of view: count all bytes moved by hardware Operation: Threads in a warp provide memory addresses The two views can be different: Hardware determines which lines/segments are needed, fetches them Memory is accessed at 32 byte granularity With a scattered or offset pattern, the application doesn’t use all the bytes the hardware actually transferred Broadcast: the same small transaction serves many threads in a warp

247 248 Access Patterns vs. Memory Throughput Access Patterns vs. Memory Throughput

Scenario: Scenario: Warp requests 32 aligned, consecutive 4-byte words Warp requests 32 aligned, permuted 4-byte words Addresses fall within 4 segments Addresses fall within 4 segments Warp needs 128 bytes Warp needs 128 bytes 128 bytes move across the bus 128 bytes move across the bus Bus utilization: 100% Bus utilization: 100%

addresses from a warp addresses from a warp ......

0 32 64 96 128 160 192 224 256 288 320 352 384 416 448 0 32 64 96 128 160 192 224 256 288 320 352 384 416 448 Memory addresses Memory addresses

249 250

Access Patterns vs. Memory Throughput Access Patterns vs. Memory Throughput

Scenario: Scenario: Warp requests 32 misaligned, consecutive 4-byte words All threads in a warp request the same 4-byte word Addresses fall within at most 5 segments Addresses fall within a single segment Warp needs 128 bytes Warp needs 4 bytes At most 160 bytes move across the bus 32 bytes move across the bus Bus utilization: at least 80% Bus utilization: 12.5% Some misaligned patterns will fall within 4 segments, so 100% utilization

addresses from a warp addresses from a warp ......

0 32 64 96 128 160 192 224 256 288 320 352 384 416 448 0 32 64 96 128 160 192 224 256 288 320 352 384 416 448 Memory addresses Memory addresses

251 252 Access Patterns vs. Memory Throughput Parallelizing SAXPY

void saxpy(int n, float a, float * x, float Scenario: * y) Divide the work equally Warp requests 32 scattered 4-byte words { among T threads for(int i=0; i

0 32 64 96 128 160 192 224 256 288 320 352 384 416 448 Memory addresses

253 254

Parallelizing SAXPY Parallelizing SAXPY

__global__ void saxpy1(int n, float a, float __global__ void saxpy1(int n, float a, float * x, float * y) Divide the work equally * x, float * y) In SIMT, 32 threads of a warp { among T threads { issue the x[base+i] instruction int workPerThread = 1 + n/blockDim.x; int workPerThread = 1 + n/blockDim.x; int base = threadIdx.x * workPerThread; Each thread is responsible for int base = threadIdx.x * workPerThread; simultaneously. computing one contiguous Each thread has different value for(int i=0; i 1, this { { becomes a strided load y[base +i] += a * x[base+ i]; y[base +i] += a * x[base+i]; } } } } } } x x thread 0 thread 1 thread 2 thread 3 … thread 31 thread 0 thread 1 thread 2 thread 3 … thread 31

255 256 Parallelizing SAXPY A Better Way to Parallelize SAXPY

__global__ void saxpy1(int n, float a, float __global__ void saxpy2(int n, float a, float * x, float * y) In SIMT, 32 threads of a warp * x, float * y) Divide work up so that each { issue the x[base+i] instruction { pass through the loop, the int workPerThread = 1 + n/blockDim.x; int id; int base = threadIdx.x * workPerThread; simultaneously. int loopCount = 0; thread block computes one Each thread has different value while(id < n) ‘contiguous region’ of the for(int i=0; i 1, this y[id] += a * x[id]; Achieves memory coalescing { becomes a strided load loopCount++; y[base +i] += a * x[base+i]; } } } } } x x … thread 0 thread 1 thread 2 thread 3 … thread 31 loopcount = 0 loopcount = 1 … loopcount=k

257 258

A Better Way to Parallelize SAXPY Structures of Non-Native Size

__global__ void saxpy2(int n, float a, float * x, float * y) The area of X addressed by Say we are reading a 12-byte structure per thread { each warp is contiguous in int id; struct Position int loopCount = 0; global memory. { while(id < n) The number of global memory { float x, y, z; id = loopCount*blockDim.x + threadIdx.x; transactions is minimized. }; y[id] += a * x[id]; ... loopCount++; This effect translates to loads } and stores of y also. __global__ void kernel( Position *data, ... ) } { int idx = blockIdx.x * blockDim.x + threadIdx.x; Position temp = data[idx]; x … ... loopcount = 0 loopcount = 1 … loopcount=k }

259 260 Structure of Non-Native Size First Load Instruction

Compiler converts temp = data[idx] into 3 loads: Each loads 4 bytes Can’t do an 8 and a 4 byte load: 12 bytes per element means that every other element wouldn’t align the 8-byte load on 8-byte boundary addresses from a warp Addresses per warp for each of the loads: Successive threads read 4 bytes at 12-byte stride ...

0 4 8 12 16 20 24 28 32 36 40 44 48 52 56 60 64

261 262

Second Load Instruction Third Load Instruction

addresses from a warp addresses from a warp

......

0 4 8 12 16 20 24 28 32 36 40 44 48 52 56 60 64 0 4 8 12 16 20 24 28 32 36 40 44 48 52 56 60 64

263 264 Performance and Solutions Global Memory Access Patterns

0 1 31 Because of the address pattern, we end up moving 3x more bytes SoA vs AoS: than application requests Good: point.x[i] We waste a lot of bandwidth, leaving performance on the table Not so good: point[i].x

Potential solutions: Strided array access: Change data layout from array of structures to structure of arrays In this case: 3 separate arrays of floats ~OK: x[i] = a[i+1] – a[i] The most reliable approach (also ideal for both CPUs and GPUs) Slower: x[i] = a[64*i] – a[i] 0 1 31 Use loads via read-only cache As long as lines survive in the cache, performance will be nearly optimal Random array access: Stage loads via shared memory Slower: a[rand(i)]

265 266

Summary: GMEM Optimization A note about caches

Strive for perfect address coalescing per warp L1 and L2 caches Align starting address (may require padding) Ignore in software design A warp will ideally access within a contiguous region Thousands of concurrent Avoid scattered address patterns or patterns with large strides between threads – cache blocking threads difficult at best Analyze and optimize address patterns: Use profiling tools (included with CUDA toolkit download) Read-only Data Cache Compare the transactions per request to the ideal ratio Shared with texture pipeline Choose appropriate data layout (prefer SoA) Useful for uncoalesced reads If needed, try read-only loads, staging accesses via SMEM Handled by compiler when const __restrict__ is used, or use __ldg() primitive

267 268 Blocking for GPU Memory Caches Read-only Data Cache

Short answer: DON’T Go through the read-only cache GPU caches are not intended for the same use as CPU caches Not coherent with writes Smaller size (especially per thread), so not aimed at temporal reuse Thus, addresses must not be written by the same kernel Intended to smooth out some access patterns, help with spilled registers, Two ways to enable: etc. Decorating pointer arguments as hints to compiler: Usually not worth trying to cache-block like you would on CPU Pointer of interest: const __restrict__ 100s to 1,000s of run-time scheduled threads competing for the cache All other pointer arguments: __restrict__ If it is possible to block for L1 then it’s possible block for SMEM – Conveys to compiler that no aliasing will occur Same size Using __ldg() intrinsic Same or higher bandwidth Requires no pointer decoration Guaranteed locality: hw will not evict behind your back

269 270

Read-only Data Cache Read-only Data Cache

Go through the read-only cache Go through the read-only cache Not coherent with writes Not coherent with writes Thus, addresses must not be written by the same kernel Thus, addresses must not be written by the same kernel Two ways to enable: Two ways to enable: Decorating pointer arguments as hints to compiler: Decorating pointer arguments as hints to compiler: __global__ void kernel( Pointer of interest: const __restrict__ Pointer of interest: const __restrict____global__ void kernel( int *output, int* __restrict__ output, All other pointer arguments: __restrict__ All other pointer arguments: __restrict__ int *input ) const int* __restrict__ input ) { – Conveys to compiler that{ no aliasing will occur – Conveys to compiler that no aliasing will occur ... Using __ldg() intrinsic ... Using __ldg() intrinsic output[idx] = __ldg( &input[idx] ); output[idx] = input[idx]; Requires no pointer decoration Requires no pointer decoration} }

271 272 Texture and Constant Memory Texture

Read-only Separate cache Data resides in global memory Dedicated texture cache hardware provides: Read via special-purpose caches Out-of-bounds index handling clamp or wrap-around Optional interpolation Think: using fp indices for arrays Linear, bilinear, trilinear – Interpolation weights are 9-bit Optional format conversion {char, short, int} -> float All of these are “free”

273 274

Examples of Texture Object Indexing

0 1 2 3 4 0 (2.5, 0.5) Integer indices fall between 1 (1.0, 1.0) elements 2 Optional interpolation: 3 Weights are determined by coordinate distance

Index Wrap: Index Clamp: 0 1 2 3 4 0 1 2 3 4 OPTIMIZE 0 0 (5.5, 1.5) (5.5, 1.5) 1 1 Kernel Optimizations: Shared Memory Accesses 2 2 3 3 49 © 2013, 275 276 NVIDI A Shared Memory Shared Memory Organization

Bank Bank Bank Bank Fast, on-chip memory Organized in 32 independent banks SM Accessible by all threads within a thread block Any 1:1 or multicast pattern Common allocation for entire thread block Registers Optimal access: no two words from C C C C Variety of uses: same bank Software managed cache (e.g., tiled DGEMM) L1 SMEM Separate banks per thread Global memory coalescing (e.g., transpose) Banks can multicast

Communication within a thread block (e.g., FFT, reductions) Bank Bank Bank Bank Limited Resource Multiple words from same bank serialize

Use of shared memory affects occupancy C C C C

277 278

Bank Addressing Examples Bank Addressing Examples

. No Bank Conflicts . No Bank Conflicts . 2-way Bank Conflicts . 8-way Bank Conflicts

x8 Thread 0 Bank 0 Thread 0 Bank 0 Thread 0 Bank 0 Thread 0 Bank 0 Thread 1 Bank 1 Thread 1 Bank 1 Thread 1 Bank 1 Thread 1 Bank 1 Thread 2 Bank 2 Thread 2 Bank 2 Thread 2 Bank 2 Thread 2 Bank 2 Thread 3 Bank 3 Thread 3 Bank 3 Thread 3 Bank 3 Thread 3 Thread 4 Bank 4 Thread 4 Bank 4 Thread 4 Bank 4 Thread 4 Thread 5 Bank 5 Thread 5 Bank 5 Bank 5 Thread 5 Bank 7 Bank 6 Thread 6 Bank 6 Thread 6 Bank 6 Thread 6 Bank 8 Thread 7 Bank 7 Thread 7 Bank 7 Bank 7 Thread 7 Thread 28 x8 Bank 9 Thread 29 Thread 30 Thread 31 Bank 31 Thread 31 Bank 31 Thread 31 Bank 31 Thread 31 Bank 31

279 280 Motivating Example: Matrix Transpose Transposing with Shared Memory

_global__ void gpuTranspose_kernel(int rows, int cols, float *in, float *out) i i 1. Read block_ij into { shared memory int i, j; i = blockIdx.x * blockDim.x + threadIdx.x; • Reads are coalesced j = blockIdx.y * blockDim.y + threadIdx.y; 2. Transpose shared out[i * rows + j] = in[j * cols + i]; } memory indices j j 3. Write transposed Either write or read is strided in gmem block to global and uncoalesced memory • Writes are coalesced Solution: tile in shared memory Shared Global Memory Memory

281 282

Shared Memory Organization Shared Memory: Avoiding Bank Conflicts

Bank Bank Bank Bank Organized in 32 independent banks Example: 32x32 SMEM array Note: same as warp size. Not a coincidence. Any 1:1 or multicast pattern Warp accesses a column: Every 32byte word is in the next bank, 32-way bank conflicts (threads in a warp access the same bank) C C C C modulo 32. Optimal access: no two words from same bank Bank 0 0 1 2 31 Bank Bank Bank Bank Separate banks per thread Bank 1 Banks can multicast 0 1 2 31 … 0 1 2 31

Multiple words from same bank serialize C C C C Bank 31 Called bank conflict, causes instruction replay 0 1 2 31

283 284 Shared Memory: Avoiding Bank Conflicts Shared Memory: Avoiding Bank Conflicts

Example: 32x32 SMEM array Add a column for padding: Warp accesses a column: 32x33 SMEM array 32-way bank conflicts (threads in a warp access the same bank) Warp accesses a column: 32 different banks, no bank conflicts

padding Accesses along row Bank 0 0 1 2 31 produces 0 bank Bank 0 0 1 2 31 Accesses along row conflicts produces no bank Bank 1 0 1 2 31 0 1 2 31 Bank 1 conflicts Accesses along … 0 1 2 31 … 0 1 2 31 column produces 32 Accesses along Bank 31 bank conflicts Bank 31 column produces no (replays) bank conflicts 0 1 2 31 0 1 2 31

285 286

Shared Memory/L1 Sizing Final Notes on Shared Memory

Shared memory and L1 use the same 64KB physical memory Fast: high bandwidth, low latency Program-configurable split: Useful as user managed cache for coalescing, caching, and Fermi: 48:16, 16:48 communication within a thread block Kepler: 48:16, 16:48, 32:32 Shared memory size / L1 cache size is API-configurable CUDA API: cudaDeviceSetCacheConfig(), cudaFuncSetCacheConfig() 16k L1 / 48k Shared (default on both Fermi and Kepler) Large L1 can improve performance when: 48k L1 / 16k Shared Spilling registers (more lines in the cache -> fewer evictions) 32k L1 / 32k Shared (Kepler only). Large SMEM can improve performance when: Be careful of: Occupancy is limited by SMEM Overuse: Excessive allocation can hurt occupancy Access pattern: Lots of bank conflicts can hurt performance

287 288 Exposing Sufficient Parallelism

What SMX ultimately needs: Sufficient number of independent instructions Kepler GK110 is “wider” than Fermi or GK104; needs more parallelism

Two ways to increase parallelism: More independent instructions (ILP) within a thread (warp) More concurrent threads (warps) OPTIMIZE Kernel Optimizations: Instruction Throughput / Control Flow

289 290

Independent Instructions: ILP vs. TLP Control Flow

SMX can leverage available Instruction-Level Parallelism more or Instructions are issued per 32 threads (warp) less interchangeably with Thread-Level Parallelism

Divergent branches: Sometimes easier to increase ILP than to increase TLP Threads within a single warp take different paths E.g., # of threads may be limited by algorithm or by HW resource limits if-else, ... But if each thread has some degree of independent operations to do, Different execution paths within a warp are serialized Kepler SMX can leverage that. (E.g., a small loop that is unrolled.)

Different warps can execute different code with no impact on In fact, some degree of ILP is actually required to approach performance theoretical max Instructions Per Clock (IPC)

291 292 Control Flow Control Flow

Avoid diverging within a warp Note: some divergence is not necessarily a problem, but large amounts impacts execution efficiency if ( ... ) { // then-clause Example with divergence: if (threadIdx.x > 2) {...} else {...} } else Branch granularity < warp size instructions { // else-clause

Example without divergence: } if (threadIdx.x / warpSize > 2) {...} else {...}

Branch granularity is a whole multiple of warp size

293 294

Execution within warps is coherent Execution diverges within a warp

0 1 2 3 30 31 32 33 34 35 62 63 0 1 2 3 30 31 32 33 34 35 62 63 instructions / time / instructions time / instructions

Warp Warp (“vector” of threads) (“vector” of threads)

295 296 Execution diverges within a warp Runtime Math Library and Intrinsics

0 1 2 3 30 31 32 33 34 35 62 63 Two types of runtime math library functions __func(): many map directly to hardware ISA Fast but lower accuracy (see CUDA Programming Guide for full details) Examples: __sinf(x), __expf(x), __powf(x, y) func(): compile to multiple instructions Slower but higher accuracy (5 ulp or less) Examples: sin(x), exp(x), pow(x, y)

instructions / time / instructions A number of additional intrinsics: __sincosf(), __frcp_rz(), ... Explicit IEEE rounding modes (rz,rn,ru,rd) Solution: Group threads with similar control flow

297 298

Maximizing PCIe Throughput

Use transfers that are of reasonable size (a few MB, at least) Use pinned system memory Overlap memcopies with useful computation

OPTIMIZE Optimizing CPU-GPU Interaction: Maximizing PCIe Throughput

299 300 Deploy GPU Optimization Fundamentals

We’ve removed (or reduced) some bottleneck Recap: * Our app is now faster while remaining fully functional Develop systematically with APOD Let’s take advantage of that! Expose sufficient parallelism

*Don’t forget to check correctness at every step Utilize parallel processing resources efficiently

Assess

Deploy Parallelize

Optimize

301 302

Online Resources

www.udacity.com

devtalk.nvidia.com

developer.nvidia.com docs.nvidia.com www.stackoverflow.com

303 MPI+CUDA

System System System GDDR5 Memory GDDR5 Memory GDDR5 Memory Memory Memory Memory

GPU CPU GPU CPU … GPU CPU

CUDA Aware MPI PCI-e PCI-e PCI-e Network Network Network Card Card Card

Node 0 Node 1 Node n-1

2

304 305

MPI+CUDA MPI+CUDA

System System System GDDR5 Memory GDDR5 Memory GDDR5 Memory Memory Memory Memory

GPU CPU GPU CPU … GPU CPU

//MPI rank 0 PCI-e PCI-e PCI-e MPI_Send(s_buf_d,size,MPI_CHAR,n-1,tag,MPI_COMM_WORLD); Network Network Network Card Card Card //MPI rank n-1 MPI_Recv(r_buf_d,size,MPI_CHAR,0,tag,MPI_COMM_WORLD,&stat); Node 0 Node 1 Node n-1

3 4

306 307 Outline Message Passing Interface - MPI

. Short Introduction to MPI . Standard to exchange data between processes via messages . Unified Virtual Addressing and GPUDirect — Defines API to exchanges messages . How CUDA-aware MPI works . Pt. 2 Pt.: e.g. MPI_Send, MPI_Recv . Collectives, e.g. MPI_Reduce . Performance Results . Multiple implementations (open source and commercial) . Wrap-up and conclusions — Binding for C/C++, Fortran, Python, …

5 6

308 309

MPI – How to launch a MPI program MPI – A minimal program

mpirun –np 4 ./myapp #include int main(int argc, char *argv[]) { int myrank; /* Initialize the MPI library */ MPI_Init(&argc,&argv); /* Determine the calling process rank */ myapp myapp myapp myapp MPI_Comm_rank(MPI_COMM_WORLD,&myid); /* Call MPI routines like MPI_Send, MPI_Recv, ... */ /* Shutdown MPI library */ MPI_Finalize(); return 0; }

7 8

310 311 Unified Virtual Addressing Unified Virtual Addressing cudaMemcpy GPU Buffer No UVA: Multiple Memory Spaces UVA : Single Address Space

Pinned fabric Buffer System GPU System GPU Memory Memory Memory Memory Host Buffer memcpy 0x0000 0x0000 0x0000

0xFFFF 0xFFFF 0xFFFF . One address space for all CPU and GPU memory — Determine physical memory location from a pointer value — Enable libraries to simplify their interfaces (e.g. MPI and cudaMemcpy) CPU GPU CPU GPU . Supported on devices with compute capability 2.0 for PCI-e PCI-e — 64-bit applications on Linux and on Windows also TCC mode

9 10

312 313

TM MPI+CUDA NVIDIA GPUDirect Accelerated communication with network & storage devices

GPU1 GPU2 Memory Memory

System Memory With UVA and CUDA-aware MPI No UVA and regular MPI

//MPI rank 0 //MPI rank 0 MPI_Send(s_buf_d,size,…); cudaMemcpy(s_buf_h,s_buf_d,size,…); MPI_Send(s_buf_h,size,…); CPU GPU1 GPU2 //MPI rank n-1 //MPI rank n-1 MPI_Recv(r_buf_d,size,…); MPI_Recv(r_buf_h,size,…); cudaMemcpy(r_buf_d,r_buf_h,size,…); PCI-e Chip IB set CUDA-aware MPI makes MPI+CUDA easier. 11 12

314 315 NVIDIA GPUDirectTM NVIDIA GPUDirectTM Accelerated communication with network & storage devices Peer to Peer Transfers

GPU1 GPU2 GPU1 GPU2 Memory Memory Memory Memory

System System Memory Memory

CPU CPU GPU1 GPU2 GPU1 GPU2

PCI-e Chip IB PCI-e Chip IB set set

13 14

316 317

NVIDIA GPUDirectTM NVIDIA GPUDirectTM Peer to Peer Transfers Support for RDMA

GPU1 GPU2 GPU1 GPU2 Memory Memory Memory Memory

System System Memory Memory

CPU CPU GPU1 GPU2 GPU1 GPU2

PCI-e Chip IB PCI-e Chip IB set set

15 16

318 319 TM NVIDIA GPUDirect CUDA-Aware MPI Support for RDMA

GPU1 GPU2 Memory Memory Example:

System MPI Rank 0 MPI_Send from GPU Buffer Memory MPI Rank 1 MPI_Recv to GPU Buffer

. Show how CUDA+MPI works in principle CPU — Depending on the MPI implementation, message size, system GPU1 GPU2 setup, … situation might be different . Two GPUs in two nodes PCI-e Chip IB set

17 18

320 321

MPI GPU to Remote GPU CUDA-Aware MPI GPUDirect Support for RDMA MPI Rank 0 MPI Rank 1 GPU Buffer

PCI-E DMA GPU

Host Host Buffer Pinned CUDA Buffer Pinned fabric Buffer

memcpy RDMA

MPI_Send(s_buf_d,size,MPI_CHAR,1,tag,MPI_COMM_WORLD);

MPI_RecvMPI_Recv(r_buf_d,size,MPI_CHAR,0,tag,MPI_COMM_WORLD,&stat);(r_buf_d,size,MPI_CHAR,0,tag,MPI_COMM_WORLD,&stat);

19 20

322 323 MPI GPU to Remote GPU Regular MPI GPU to Remote GPU GPUDirect Support for RDMA MPI Rank 0 MPI Rank 1

GPU MPI_Sendrecv

Time Host

cudaMemcpy(s_buf_h,s_buf_d,size,cudaMemcpyDeviceToHost); MPI_Send(s_buf_h,size,MPI_CHAR,1,tag,MPI_COMM_WORLD);

MPI_Recv(r_buf_h,size,MPI_CHAR,0,tag,MPI_COMM_WORLD,&stat);MPI_Recv(r_buf_h,size,MPI_CHAR,0,tag,MPI_COMM_WORLD,&stat); cudaMemcpy(r_buf_d,r_buf_h,size,cudaMemcpyHostToDevicecudaMemcpy(r_buf_d,r_buf_h,size,cudaMemcpyHostToDevice); 21 22

324 325

Regular MPI GPU to Remote GPU MPI GPU to Remote GPU Without GPUDirect memcpy D->H MPI_Sendrecv memcpy H->D MPI Rank 0 MPI Rank 1 Time

GPU

Host

MPI_Send(s_buf_h,size,MPI_CHAR,1,tag,MPI_COMM_WORLD);

MPI_Recv(r_buf_h,size,MPI_CHAR,0,tag,MPI_COMM_WORLD,&stat);MPI_Recv(r_buf_h,size,MPI_CHAR,0,tag,MPI_COMM_WORLD,&stat);

23 24

326 327 MPI GPU to Remote GPU Performance Results two Nodes Without GPUDirect 7

MPI_Sendrecv 6

5 Time 4 MVAPICH2-1.9b D to D (host memory staging) 3 MVAPICH2-1.9b D to D with BW in GB/s BW in 2 GPUDirect 1 MVAPICH2-1.9b MPI H to H 0

message Size

Latency (1 byte) 19.00 µs 18.34 µs 1.11 µs

25 26

328 329

Example: Jacobi Example: Jacobi . Solves the 2D-Poisson equation on a rectangle While not converged ∆풖 풙, 풚 = ퟎ ∀ 풙, 풚 ∈ Ω\휹Ω — Dirichlet boundary conditions . Do Jacobi step: 풖 풙, 풚 = 풇 풙, 풚 ∈ 휹Ω for (int i=1; i < n-1; i++) for (int j=1; j < m-1; j++) u_new[i][j] = 0.0f - 0.25f*(u[i-1][j] + u[i+1][j] . 2D domain decomposition with n x k domains + u[i][j-1] + u[i][j+1])

Rank Rank Rank . Exchange halo with 2 – 4 neigbours (0,0) (0,1) … (0,n-1) . Swap u_new and u … . Next iteration

Rank Rank Rank (k-1,0) (k-1,1) (k-1,n-1)

27 28

330 331 Jacobi Results (1000 steps) Jacobi Results (1000 steps) weak scaling 4k x 4k per process weak scaling 4k x 4k per process

9 8.00 Communication with top/bottom neighbor 8 7.80 needed 2x1 topology 7 Communication with 7.60 top/bottom and left right 6 7.40 neighbor needed 2x2 topology 5 7.20 4 regular MPI regular MPI 7.00 3 CUDA-aware MPI CUDA-aware MPI 2 6.80 1 6.60

Performance per GPU GLU/s GPU per in Performance 0 GLU/s GPU per in Performance 6.40 1 2 4 8 1 2 4 8 #Processes #Processes

29 30

332 333

LBM D2Q37 LBM D2Q37 Results strong scaling on 8192x1024 cells

Lattice Boltzmann Method (LBM) 300

. D2Q37 Model 250 1.25 x speed up . Application developed at 200 U Rome Tore Vergata/INFN, U Ferrara/INFN, TU Eindhoven 150 MVAPICH2-1.9b (host . Reproduce dynamics of fluid by simulating virtual particles which memory staging) MLups collide and propagate, e.g. solve the Rayleigh-Taylor instability 100 MVAPICH2-1.9b . Simulation of large problems requires double precision and many 50 GPUs Implementation and Benchmarks: 0 F. Schifano (U Ferrara) 2 4 8 16 #Processes 31 32

334 335 CUDA-Aware MPI Implementations CUDA-Aware Caveats Integrated Support for GPU Computing . MVAPICH2 1.8/1.9b . cudaSetDevice needs to be called before MPI_Init — http://mvapich.cse.ohio-state.edu/overview/mvapich2/ . MPI Environment vars. can be used to set GPU affinity . OpenMPI 1.7 (beta) — MVAPICH2: MV2_COMM_WORLD_LOCAL_RANK — http://www.open-mpi.org/ — OpenMPI: OMPI_COMM_WORLD_LOCAL_RANK . CRAY MPI (MPT 5.6.2) . MV2_USE_CUDA needs to be set for MVAPICH . IBM Platform MPI (8.3) . MPICH_RDMA_ENABLED_CUDA for MPT on Cray . PMPI_GPU_AWARE for Platform MPI . Lib needs to be build with CUDA-awarenes enabled

33 34

336 337

CUDA-Aware MPI + OpenACC Profiling MPI+CUDA applications

. To use CUDA-aware MPI with OpenACC . Use nvprof: #pragma acc host_data use_device(s_buf) mpirun –np n nvprof --output-profile out.$MV2_COMM_WORLD_RANK ./app MPI_Send(s_buf,size,MPI_CHAR,1,tag,MPI_COMM_WORLD); see docs.nvidia.com for details . To use MPI with OpenACC . Use CUDA-aware tracing libraries like score-p or #pragma acc update host(s_buf[0:size] ) VampirTrace and tools like Vampir MPI_Send(s_buf,size,MPI_CHAR,1,tag,MPI_COMM_WORLD);

35 36

338 339 Conclusions

. Use CUDA-aware MPI when possible . Depending on CUDA version, hardware setup, … a CUDA-aware MPI gives you — Ease of programming Thank you — Pipelined data transfer which automatically provides optimizations when http://developer.nvidia.com/content/introduction-cuda-aware-mpi available . Overlap CUDA copy and RDMA transfer — Utilization of the best GPUDirect technology available

. Examples are available for download at github: https://github.com/parallel-forall/code-samples/tree/master/posts/cuda-aware-mpi-example 37 38

340 341 1 : Introduction to OpenACC

Sami Saarinen Material on “CSC cluster introduction”, (C) 2014 by CSC – IT Center for Science Ltd. CSC – IT Center for Science Ltd Material from CSC is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Espoo, Finland Unported License, http://creativecommons.org/licenses/by-nc-sa/3.0/ 342 343

Three lectures on OpenACC What is OpenACC ?

Introduction to OpenACC OpenACC API defines a set of compiler directives that allow – General introduction with simple examples parallel loops and code regions to be offloaded from host CPU to the attached GPU(s) Tuning OpenACC programs – Directives are very similar to the OpenMP directives – Useful tips for better performance The OpenACC directives allow creation of high-level host CPU + – Code profiling GPU programs without explicit need to manage data transfers Advanced OpenACC topics between host CPU and GPU – or to write CUDA-code – Interfacing with CUDA codes and libraries The API supports both C/C++ and Fortran bindings – Hybrid host CPU + GPU programming with OpenACC – Compilers from PGI, Cray and CAPS – and soon from GNU (4.9?) – OpenACC with MPI and OpenMP More about OpenACC standard: http://www.openacc.org

344 345 OpenACC directive syntax in a nutshell Execution model in OpenACC

Application C/C++ program #pragma acc directive [clause [,] clause] ... ] { /* A structured C-code block to be executed on a GPU */ }

Fortran !$acc parallel !$acc directive [clause [,] clause] ... ] ... Fortran code section to be executed on a GPU ... !$acc end parallel !$acc end directive Important directives parallel, kernels, data, loop, update, host_data, wait, ... Often used clauses Runs on host CPU(s) Offloaded to GPU if (condition), async(handle), ...

346 347

Just before getting serious … some info Systems used during the lectures Host Ivy Brigde (alias IvB) nodes Reference systems and compilers used – Intel® Xeon® CPU E5-2640 0 @ 2.50GHz, 15360KB L2-cache, memory 32GB/node – In these lectures we compare GPU performance against – 2 sockets/node x 6-cores/socket, 2-way hyperthreading (HT) enabled multi-threaded host CPUs (Xeon) and Intel XeonPhi (MIC) GPU NVIDIA Tesla Kepler K20m – currently one per non-MIC IvB node – – We use PGI compilers except for some comparisons Intel Architecture sm_35 @ 705MHz, memory 5GB, with UVA, ECC on, 1280KB L2-cache Intel KNC (alias MIC) cards – currently one card per non-GPU IvB node pgaccelinfo – Intel® XeonPhi® 5110 @ 1.053GHz, 512KB L2-cache, memory 8GB/card – Provides a nice hardware information on GPU used – 60-cores, 4-way HT Compilers nvidia-smi – PGI 13.10 with OpenACC – Displays f.ex. whether our GPU is in a sane state or not . CUDA-aware MPI (MVAPICH2-2.0b) – Intel compilers v14.0.1 built on 20131008 Showing timer functions that were used BLAS libraries – We are interested in wall-clock timings only – Intel MKL 11.1.1.106 (also used with PGI except when doing MPI-runs) – NVIDIA cuBLAS 5.0 348 349 pgaccelinfo

CUDA Driver Version: 6000 NVRM version: NVIDIA UNIX nvidia-smi x86_64 Kernel Module 331.20 Device Number: 0 Execution Timeout Device Name: Tesla K20m Device Revision Number: 3.5 +------+ Global Memory Size: 5032706048 | NVIDIA-SMI 331.20 Driver Version: 331.20 | Number of Multiprocessors: 13 Execution Timeout: No |------+------+------+ Number of SP Cores: 2496 Integrated Device: No Number of DP Cores: 832 Can Map Host Memory: Yes | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | Concurrent Copy and Execution: Yes Compute Mode: exclusive-process | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | Total Constant Memory: 65536 Concurrent Kernels: Yes |======+======+======| Total Shared Memory per Block: 49152 ECC Enabled: Yes Registers per Block: 65536 Memory Clock Rate: 2600 MHz | 0 Tesla K20m On | 0000:01:00.0 Off | 0 | Warp Size: 32 Memory Bus Width: 320 bits | N/A 21C P0 22W / 225W | 13MiB / 4799MiB | 3% E. Process | Maximum Threads per Block: 1024 L2 Cache Size: 1310720 bytes +------+------+------+ Maximum Block Dimensions: 1024, 1024, 64 Max Threads Per SMP: 2048 Maximum Grid Dimensions: 2147483647 x Async Engines: 2 65535 x 65535 Unified Addressing: Yes +------+ Maximum Memory Pitch: 2147483647B Initialization time: 35756 microseconds | Compute processes: GPU Memory | Texture Alignment: 512B Current free memory: 4951715840 Clock Rate: 705 MHz Upload time (4MB): 2467 microseconds ( 935 ms pinned) | GPU PID Process name Usage | Download time: 2639 microseconds (1043 ms pinned) |======| Upload bandwidth: 1700 MB/sec (4485 MB/sec pinned) | No running compute processes found | Download bandwidth: 1589 MB/sec (4021 MB/sec pinned) PGI Compiler Option: -ta=nvidia,cc35 +------+

350 351

Recommended portable wall-clock timers DAXPY with OpenMP : y[j] = y[j] + a * x[j] – alternatives to omp_get_wtime() on host CPUs

C/C++ & Fortran callable Fortran-only In CUDA A simple vector-kernel with OpenMP for the host CPU

#include FUNCTION ftimer() #include void daxpy(int n, double a, SUBROUTINE daxpy(n, a, x, y)

#include const double *restrict x, INTEGER :: n, j implicit none cudaEvent_t start, stop; double ctimer_() { real(kind=8) :: ftimer cudaEventCreate(&start); double *restrict y) REAL(kind=8) :: a, x(n), y(n) cudaEventCreate(&stop); struct timeval tbuf; integer :: t, rate { !$omp parallel do cudaEventRecord(start, 0); gettimeofday(&tbuf,NULL); CALL SYSTEM_CLOCK(t,count_rate=rate) #pragma omp parallel for DO j = 1,n

kernel<<>>(args); for (int j=0; j

352 353 DAXPY performance (GFlops/s) : Vector length N=128M DAXPY with OpenACC : y[j] = y[j] + a * x[j] Tesla K20m (CUDA) A simple vector-kernel with OpenACC for the GPU Tesla K20m (OpenACC) void daxpy(int n, double a, SUBROUTINE daxpy(n, a, x, y) const double *restrict x, INTEGER :: n, j double *restrict y) REAL(kind=8) :: a, x(n), y(n) MIC omp 120 { !$acc parallel loop Bigger the #pragma acc parallel loop DO j = 1,n IvB omp 6 better ! for (int j=0; j

354 355

DAXPY with CUDA : y[j] = y[j] + a * x[j] Defining OpenACC parallel regions

// daxpy_cuda.cu __global__ void daxpy (int n, double a, Two approaches to define parallel regions for GPU-code: const double *x, double *y) { – PARALLEL and KERNELS int tid = blockIdx.x * blockDim.x + threadIdx.x; int stride = blockDim.x * gridDim.x; Both equally valid and can perform equally well while (tid < n) { y[tid] += a * x[tid]; PARALLEL requires careful analysis by the programmer to tid += stride; } make sure the expressed parallelism is safe } int main() { ... – Often preferred when translated from OpenMP to OpenACC int n = 1<<27; // 128M int vlen = 256; KERNELS relies on compiler’s parallel analysis and dim3 blockdim = dim3(vlen,1,1); // dim3 griddim = dim3(65536,1,1); produces safely parallelized region dim3 griddim = dim3((n+blockdim.x-1)/blockdim.x,1,1); ... – Even large code blocks can be covered under this directive daxpy <<>> (n, a, d_x, d_y);

356 357 PARALLEL vs. KERNELS (1) PARALLEL vs. KERNELS (2) w/o ”acc loop” in PARALLEL the loop would run redundantly on every thread ACC PARALLEL merged with LOOP ; ACC KERNELS does not require LOOP

!$acc parallel !$acc kernels !$acc parallel loop !$acc kernels loop ! “loop” optional !$acc loop ! Required !$acc loop ! Optional do j=1,n do j=1,n do j=1,n do j=1,n a(j) = b(j) * c(j) + d(j) a(j) = b(j) * c(j) + d(j) a(j) = b(j) * c(j) + d(j) a(j) = b(j) * c(j) + d(j) enddo enddo enddo enddo !$acc end parallel ! Optional here !$acc end kernels ! Always prefer !$acc end parallel !$acc end kernels

#pragma acc parallel #pragma acc kernels #pragma acc parallel loop #pragma acc kernels // loop optional

{ { for (int j=0; j

358 359

Traps & pitfalls with PARALLEL (1) : OpenMP vs. OpenACC Traps & pitfalls with PARALLEL (2) : A correction LOOPs within PARALLEL executed independently aka “NOWAIT” Add LOOP VECTOR, or use separate PARALLEL or use KERNELS

// Normalization : OpenMP // OpenACC PARALLEL (wrong) // Behaves like OpenMP NOWAIT // OpenACC PARALLEL (ok) // Separate PARALLEL loops // or use OpenACC KERNELS sum = 0; sum = 0; sum = 0; sum = 0; sum = 0; sum = 0; #pragma omp parallel #pragma acc parallel #pragma omp parallel #pragma acc parallel #pragma acc parallel loop #pragma acc kernels { { { { for (j=0; j

} }

360 361 OpenACC PARALLEL –directive : summary OpenACC KERNELS –directive : summary

First of the two methods to define parallel region An alternative definition of parallel OpenACC region – May contain one or more parallel loops – As with PARALLEL may also contain one or more loops The programmer is responsible for making sure there are KERNELS will be safely parallelized since compiler, not no dependencies in the consecutive loops programmer is held responsible – Enables highly parallel execution w/o checks – Compilers do make a parallel dependency analysis for a Thus there is no implicit barrier between loops KERNELS-region to decide what is safe – unlike PARALLEL – Consecutive loops are allowed to start while previous ones Generated kernels (e.g. separate loops) could still be run are still in progress : strong likelihood for incorrect results independently if dependencies are not preventing this, e.g. . Behaves a bit like OpenMP NOWAIT would do #pragma acc kernels loop independent – Using vector-loops (where applicable) seems(!) to imply sync

362 363

What happens to sequential code in OpenACC ? Three levels of parallelism in OpenACC

Usually safer to create via KERNELS region if applicable gang – equivalent to CUDA thread block Single statements are handled as loops with a vector length of one – both in KERNELS and PARALLEL – The highest level of parallelism, where “gangs” work Could also have "acc loop seq" to enforce sequential execution ! independently of each other without synchronization worker – alias CUDA warp #pragma acc kernels for (int k = 0; k

364 365 Three levels of parallelism in OpenACC (cont’d) Three levels of parallelism in OpenACC (cont’d) We will cover in the exercises how to suggest different Especially suitable for 3D finite difference –stencils, or parallelisms and chunk sizes for PARALLEL and KERNELS heavy triple loops with truly independent operations They look briefly as follows : OpenACC compiler usually decides (also for PARALLEL // With PARALLEL construct // With KERNELS construct regions), how to split up the work across these levels double arr3d[nx][ny][nz]; double arr3d[nx][ny][nz]; #pragma acc parallel \ #pragma acc kernels You can provide suggestions for the compiler what kind of num_gangs(100) num_workers(16) vector_length(64) #pragma acc loop gang(100) #pragma acc loop gang for (int i=0; i

2D-Laplace solver with Jacobi 5-point stencil update The driver loop for both OpenMP and OpenACC

2 Laplace equation ∇ u = 0 with fixed boundary conditions allocate (u(0:nx+1,0:ny+1), unew(0:nx+1,0:ny+1)) Discretized in a square rectangle using 5-point stencil call init(u) ; call init(unew) – Test case uses 2400 x 2400 grid for interior points norm = eps + 1 Solver uses slow Jacobi iteration because it’s iter = 0 – Easy to parallelize – and explain do while (iter <= maxiter .and. norm >= eps) Analogical cases found in many finite difference schemes call update (unew, u) therefore this technique is useful to understand call update (u, unew, norm) A good starting point is an OpenMP implementation iter = iter + 2 enddo

368 369 Stencil update loop with OpenMP Stencil update loop with OpenACC

norm = 0 norm = 0 !$omp parallel do reduction(max:norm) private(i,j) !$acc parallel loop reduction(max:norm) private(i,j) do j=1,ny do j=1,ny do i=1,nx do i=1,nx U_new(i,j) = 0.25*( & U_new(i,j) = 0.25*( & & U_old(i-1,j) + U_old(i+1,j) + & & U_old(i-1,j) + U_old(i+1,j) + & & U_old(i,j-1) + U_old(i,j+1)) & U_old(i,j-1) + U_old(i,j+1)) norm = max(norm,abs(U_new(i,j) – U_old(i,j))) norm = max(norm,abs(U_new(i,j) – U_old(i,j))) enddo enddo enddo enddo

370 371

Version#1 of Jacobi 5-point stencil update : N=2400 Performance of Jacobi update : poor GPU results

Bigger the 4500 better ! Tesla K20m 4000 3500

IvB serial

3000 ]

/s 2500 IvB omp 12

2000 MLups Smaller the [ Performance 1500 Poor GPU better ! MIC omp 240 performance 1000 here ?! 500 0 5 10 15 20 25 30 0 MIC omp 240 IvB omp 12 IvB serial Tesla K20m Tesla K20m MIC omp 240 IvB omp 12 IvB serial Time(s) 0,961 2,1 10,6 28,1 MLups/s 148 4326 1976 393 372 373 pgireport pgfortran -O4 -fast -Mvect -Minline -Minfo=all -acc file.F90

So what’s going on then ? 91: !$acc parallel loop reduction(max:norm) private(i,j)

VVVV-- [update] : Accelerator kernel generated Compile with pgireport in front of PGI-compiler command VVVV-- [update] : Generating present_or_copyin(old(:nx+1,:ny+1)) VVVV-- [update] : Generating present_or_copyout(new(1:nx,1:ny)) – Developed at CSC – produces annotated listing file .lst VVVV-- [update] : Generating NVIDIA code ... To quickly reveal performance bottlenecks, use nvprof –profiler provided by NVIDIA for GPU regions 92: do j=1,ny

– Easy to use : nvprof /my/path/stencil.x 2400 2400 VVVV-- [update] : !$acc loop gang ! blockidx%x VVVV-- [update] : Loop not fused: no successor loop – By default ( -s ) produces very useful summary output 93: do i=1,nx – Option --print-gpu-trace provides timeline breakdown VVVV-- [update] : !$acc loop vector(256) ! threadidx%x – A file for NVIDIA visual profiler (nvvp) with option -o file VVVV-- [update] : Generated 4 alternate versions of the loop VVVV-- [update] : Generated vector sse code for the loop Additional information with PGI’s environment variables: VVVV-- [update] : Generated 3 prefetch instructions for the loop export PGI_ACC_NOTIFY=2 # verbosity on each data transfer 94: new(i,j) = factor*(old(i-1,j) + old(i+1,j) + old(i,j-1) + old(i,j+1)) export PGI_ACC_TIME=1 # timing summary for each kernel 95: norm = max(norm,abs(new(i,j) - old(i,j)))

VVVV-- [update] : Max reduction generated for norm

374 375

nvprof performance profile output (version#1) PGI_ACC_TIME=1 (version#1) Data xfers > 10s in total Time > 11 sec Time(%) Time Calls Avg Min Max Name /share/sbs/research/perfstat/openacc/testing/stencil/openacc_basic/PGIACC_stencil.F90 update NVIDIA devicenum=0 time(us): 11,350,717 48.52 5.38s 2533 2.13ms 928ns 2.73ms [CUDA memcpy HtoD] 91: compute region reached 361 times Source file 45.92 5.09s 2539 2.01ms 2.53us 2.60ms [CUDA memcpy DtoH] 91: data copyin reached 1083 times device time(us): total=2,754,043 max=2,827 min=2,013 avg=2,542 91: kernel launched 361 times 2.95 326.90ms 361 905.54us 901.64us 910.15us update_91_gpu grid: [2400] block: [256] Data in : 2,7s 2.58 286.36ms 361 793.25us 784.87us 800.26us update_99_gpu device time(us): total=341,368 max=1,043 min=941 avg=945 0.03 3.26ms 361 9.02us 8.83us 9.25us update_95_gpu_red elapsed time(us): total=346,142 max=1,055 min=953 avg=958 91: reduction kernel launched 361 times Line grid: [1] block: [256] Data out : 2,6s 0.01 686.69us 2 343.35us 343.11us 343.58us init_70_gpu device time(us): total=8,685 max=79 min=22 avg=24 numbers 0.00 8.96us 2 4.48us 4.45us 4.51us init_78_gpu elapsed time(us): total=13,088 max=91 min=33 avg=36 0.00 5.31us 2 2.66us 2.59us 2.72us init_77_gpu 98: data copyout reached 1083 times device time(us): total=2,595,456 max=2,822 min=1,931 avg=2,396

Actual computation ~ 0,6s

376 377 Important directives & clauses touched in this lecture

OpenACC – kernels and parallel But we can do better – much, much better !! – loop sometimes with gang, worker, vector, seq – kernels loop with independent … to be continued … – reduction([+|max]:var) –clause OpenMP – parallel – for/DO sometimes with reduction([+|max]:var) and nowait

378 379

Summary

OpenACC provides a portable way to harness computing power of GPU accelerators without need to learn CUDA Simple and directive based syntax resembles OpenMP and enables very rapid deployment of GPUs Try to identify the main bottlenecks of your code on the host CPUs and translate the program to use OpenACC gradually i.e. not everything has to be changed instantly As with any parallel programming model, there are good practices in place for OpenACC in order to obtain better performance – some of them covered in the next lectures

380 2 : Tuning OpenACC programs

Sami Saarinen Material on “CSC cluster introduction”, (C) 2014 by CSC – IT Center for Science Ltd. CSC – IT Center for Science Ltd Material from CSC is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Espoo, Finland Unported License, http://creativecommons.org/licenses/by-nc-sa/3.0/ 381 382

Three lectures on OpenACC

Introduction to OpenACC – General introduction with simple examples Quote from the previous lecture : Tuning OpenACC programs – Useful tips for better performance “But we can do better – much, much better !!” – Code profiling Advanced OpenACC topics – Interfacing with CUDA codes and libraries – Hybrid host CPU + GPU programming with OpenACC – OpenACC with MPI and OpenMP

383 384 PGI_ACC_TIME=1 (version#2) nvprof performance profile output (version#2) Time ~ 0,6 sec

/share/sbs/research/perfstat/openacc/testing/stencil/openacc_data/PGIACC_stencil.F90 Time(%) Time Calls Avg Min Max Name update NVIDIA devicenum=0 time(us): 632,050 52.83 326.86ms 361 905.42us 901.76us 908.74us update_95_gpu No data transfers !! 95: compute region reached 361 times 46.35 286.72ms 361 794.25us 786.60us 801.06us update_103_gpu 95: kernel launched 361 times 0.53 3.26ms 361 9.02us 8.80us 9.28us update_99_gpu_red grid: [2400] block: [256] device time(us): total=332,326 max=990 min=916 avg=920 0.13 782.98us 361 2.17us 2.11us 2.62us [CUDA memcpy DtoH] elapsed time(us): total=336,283 max=1,001 min=926 avg=931 0.11 688.10us 2 344.05us 342.72us 345.38us init_74_gpu 95: reduction kernel launched 361 times 0.06 346.34us 361 959ns 928ns 1.70us [CUDA memcpy HtoD] grid: [1] block: [256] device time(us): total=8,083 max=70 min=21 avg=22 0.00 5.60us 2 2.80us 2.72us 2.88us init_82_gpu elapsed time(us): total=12,078 max=81 min=31 avg=33 0.00 4.90us 2 2.45us 2.27us 2.62us init_81_gpu

Computation still ~ 0,6s Data transfer cost now ~ 1ms

385 386

Better results achieved through DATA –directive Driving loop for GPU-only data with OpenACC

Probably the single most important directive in OpenACC allocate (u(0:nx+1,0:ny+1), unew(0:nx+1,0:ny+1)) to guarantee GPU program performance Controls data creation on GPUs and can lead to radically call init(u) ; call init(unew) reduced data transfers between GPU and host norm = eps + 1 – The aim is to keep data on GPUs as long as possible iter = 0 do while (iter <= maxiter .and. norm >= eps) float A[100], B[10], C[20], D[3][3]; REAL A(100), B(10), C(20), D(3,3) call update (unew, u) #pragma acc data create(A) copy(B[3:7]) \ !$acc data create(A) copy(B(4:10)) & copyin(C[0:20]) copyout(D) call update (u, unew, norm) !$acc& copyout(C(1:20)) copyout(D) { iter = iter + 2 #pragma acc kernels ... !$acc kernels ... !$acc end kernels #pragma acc parallel ... !$acc parallel ... !$acc end parallel enddo #pragma acc data ... // nesting ok !$acc data ... !$acc end data } !$acc end data 387 388 Stencil update loop with OpenACC : pgireport pgfortran -O4 -fast -Mvect -Minline -Minfo=all -acc file.F90

95: !$acc parallel loop reduction(max:norm) private(i,j) present(new,old) Data already present on GPU VVVV-- [update] : Generating present(old(:,:)) VVVV-- [update] : Generating present(new(:,:)) norm = 0 VVVV-- [update] : Accelerator kernel generated VVVV-- [update] : Generating NVIDIA code !$acc parallel loop reduction(max:norm) private(i,j) & ... 96: do j=1,ny !$acc& present(U_new, U_old) do j=1,ny VVVV-- [update] : !$acc loop gang ! blockidx%x VVVV-- [update] : Loop not fused: no successor loop do i=1,nx 97: do i=1,nx U_new(i,j) = 0.25*( & & U_old(i-1,j) + U_old(i+1,j) + & VVVV-- [update] : !$acc loop vector(256) ! threadidx%x VVVV-- [update] : Generated 4 alternate versions of the loop & U_old(i,j-1) + U_old(i,j+1)) VVVV-- [update] : Generated vector sse code for the loop VVVV-- [update] : Generated 3 prefetch instructions for the loop norm = max(norm,abs(U_new(i,j) – U_old(i,j))) enddo 98: new(i,j) = factor*(old(i-1,j) + old(i+1,j) + old(i,j-1) + old(i,j+1)) 99: norm = max(norm,abs(new(i,j) - old(i,j))) enddo ...

389 390

Version#2 of Jacobi 5-point stencil update : N=2400 Performance of Jacobi update – with !$acc data

Smaller the Bigger the 7000 better ! better ! Tesla K20m 6000

5000

IvB serial ]

/s 4000

IvB omp 12 3000

MLups [ Performance 2000 MIC omp 240 1000

0 5 10 15 20 25 30 0 Tesla MIC omp IvB omp 12 IvB serial MIC omp 240 IvB omp 12 IvB serial Tesla K20m K20m 240 Time(s) V1 0,961 2,1 10,6 28,1 MLups/s 297 4327 1976 393 Time(s) V2 0,622 With $acc data 6280 391 392 Cornerstones for accelerator performance But what happens without DATA –directive ? the DATA –directive plays a major role here Every time upon entering either PARALLEL or KERNELS Always give GPUs lots of computational work to do regions input data gets copied into GPU from host CPU, – Independent work in triple loops or 3D-stencils etc. and output data returned back when leaving the regions Pay close attention to data transfers between host and This creates a lot of unnecessary traffic between GPU GPU and usage of OpenACC data on devices (host/GPU) and host CPU – Prefer to create data directly on the GPU and keep it there DATA –directive creates a scope within which data can be Focus on data re-use whilst it resides on the GPU to avoid declared to stay on GPU, potentially never touching CPU memory bandwidth bottlenecks due to transfers back It is possible to control data movement between host and forth between host and GPUs and GPU, or even have nested DATA –directives (scopes)

393 394

Explicit control of data movement Explicit control of data movement (cont’d)

You can add data clauses to PARALLEL and KERNELS create – data array gets allocated on GPU device without directives for providing finer and more explicit control of copying data between host and device what needs to be transferred between host and GPU present – assumes data array is already present on GPU For arrays (vectors, matrices, multidimensional data), you copy – allocates data array on GPU and upon entering can provide arrays with a data range, which currently has the PARALLEL/KERNELS region copies it from host to GPU to be contiguous and back to host when leaving the region copyin – allocates data array on GPU and only copies it In Fortran the range is given using Fortran array notation from host to GPU upon start of the parallel region i.e. ARRAY(start_index:end_index) copyout – allocates data array on GPU and copies it back In C/C++ the range is given by ARRAY[start_index:length] to host when leaving the parallel region

395 396 Explicit control of data movement (cont’d) Examples of controlling data movement

int a[100], *b, d[3][3]; In C/C++ INTEGER A(0:99), D(3,3) pcopy (or present_and_copy) – same as present and start : b = (int *)malloc(sizeof(*b)*10) INTEGER, ALLOCATABLE :: B(:) In Fortran length start : end copy together, but checks presence of an array in GPU #pragma acc data create(a) ALLOCATE(B(0:9)) { !$acc data create(A) before attempting to copy data to GPU when entering #pragma acc kernels copyin(b[0:1]) !$acc kernels copyin(B(0:0)) the parallel region. Data gets automatically copied back { a[0] = b[0]; } A(0) = B(0) #pragma acc parallel copyout(d) !$acc end kernels to host upon leaving the region { !$acc parallel copyout(D) pcopyin (or present_and_copyin) – similar as in the #pragma acc loop collapse(2) !$acc loop collapse(2) for (int i=0; i<3; ++i) DO J=1,3; DO I=1,3; D(I,J) = (I-1)*3 + J previous merges features of present and copyin for (int j=0; j<3; ++j) !$acc end parallel d[i][j] = i*3 + j + 1; !$acc data copy(B) pcopyout (or present_and_copyout) – merges features } !$acc kernels of present and copyout #pragma acc data copy(b) B(:) = B(:) + 1 #pragma acc kernels !$acc end kernels { for (int i; i<10; ++i) b[i] += 1; } !$acc end data } !$acc end data 397 398

An example of creating snapshots Using UPDATE –directive !$acc data create(u, unew) In addition to explicit data clauses in PARALLEL and do while (iter <= maxiter .and. norm >= eps) KERNELS -directives, you can also UPDATE HOST and/or if (mod(iter,100) == 0) then ! Snapshot GPU device arrays whilst in parallel or DATA region !$acc update host(u) This comes handy if you have to produce periodical call visualize(u) ! Runs on HOST snapshots of data whilst in highly parallel GPU endif call update(unew, u) processing: e.g. pass data to host for visualization call update(u, unew, norm) – Often done conditionally (under IF -clause) and iter = iter + 2 asynchronously (ASYNC -clause, later followed by WAIT) enddo Reserve true as well a device array can also receive an !$acc end data UPDATE during simulation (via UPDATE DEVICE) call visualize(u) ! The final, converged field

399 400 In a more asynchronous manner h = 1 ! A handle for async operations, >= 0 !$acc data create(u, unew) do while (iter <= maxiter .and. norm >= eps) !$acc update host(u) async(h) if (mod(iter,100)==0) call update (unew, u) if (mod(iter,100)==0) then Start asynchronous A more serious tuning example !$acc wait(h) transfer to host call visualize(u) endif Wait for async call update (u, unew, norm) operation to iter = iter + 2 complete and enddo visualize again !$acc end data call visualize(u)

401 402

QR-decomposition by Modified Gram-Schmidt (MGS) The basic QR-decomp algorithm with MGS

Inspiration from Ron Farber’s DrDobbs article in 2012 A pseudo code for A  Q * R with MGS, approx. 2mn2 flp. opers for k = 1 to n Starting off from an OpenMP version on the host R(k,k) = tmp = NORM2 ( A(1:m,k) ) Q(1:m,k) = A(1:m,k) / tmp Mapping the OpenMP directives to OpenACC for j = k+1 to n Simple GPU profiling to identify unexpected bottlenecks R(k,j) = DOT_PRODUCT( Q(1:m,k), A(1:m,j) ) A(1:m,j) = A(1:m,j) – Q(1:m,k) * R(k,j) Transposed arrays also beneficial on host CPU side end # for j = k+1 to n end # for k = 1 to n Performance with OpenACC – n/a in OpenMP Computational kernels involve Also making MatMul (MM) much faster with OpenACC – Creating a near singular (ill-conditioned) A-matrix . Not time critical, but probes MGS-algorithm’s accuracy brilliantly The final profile and performance results – MGS sweep for creating Q and R from A, and directly on GPU device – Run MatMul to check that Q * R is close enough to the original A-matrix

403 404 A  Q * R with MGS : OpenMP & OpenACC Loop Difference between row and column –major orders indexing (version#1) “wrong way 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 #pragma omp parallel reduction(+:tmp) #pragma acc kernels present(wrk,Q,R) around” – for (int k = 0; k

405 406

Version#1 of Modified Gram-Schmidt : N=2400 MM : QR  Q * R : must be close enough to A (version#1) // -- OpenMP -- // -- OpenACC -- IvB serial double Q[nrows][nmin],R[nmin][ncols]; // The same declarations as in OpenMP double QR[nrows][ncols]; Smaller the IvB omp 12 double sum; #pragma acc data present(QR,Q,R) better ! { #pragma omp parallel for \ #pragma acc kernels MIC omp 240 reduction(+:sum) for (int i = 0; i

407 408 nvprof profile output – version#1 Lessons learned from version#1

Time(%) Time Calls Avg Min Max Name GPU version’s performance is far below than expected and was beaten even by a single host CPU core 58.42 171.03s 1 171.03s 171.03s 171.03s MGS_247_gpu 41.58 121.72s 1 121.72s 121.72s 121.72s MatMul_303_gpu – Both MGS and MM do not deliver performance 0.00 8.59ms 1 8.59ms 8.59ms 8.59ms Check_282_gpu 0.00 930.79us 1 930.79us 930.79us 930.79us InitMat_201_gpu And yet we do have DATA –directives in place … 0.00 660.58us 1 660.58us 660.58us 660.58us MGS_232_gpu 0.00 8.96us 1 8.96us 8.96us 8.96us Check_282_gpu_red Looking solely into OpenACC version, and perhaps nvvp 0.00 2.53us 1 2.53us 2.53us 2.53us [CUDA memcpy DtoH] (NVIDIA Visual Profiler) charts, it can be figured out that in 0.00 1.66us 1 1.66us 1.66us 1.66us [CUDA memcpy HtoD] – MGS work (“wrk”) and Q arrays need to be transposed – More parallelism is needed in MGS – no matter what – MM needs revisiting (but no cuBLAS/DGEMM in this lecture!)

409 410

A  Q * R with MGS : Optimized OpenACC version#2 nvprof profile output – version#2 double wrkT[ncols][nrows], Qt[nmin][nrows]; // Transposes // The heaviest kernel #pragma acc data create(wrkT,Qt) present(A,Q,R) #pragma acc parallel loop // Kernel ’MGS_266_gpu’ { for (int j = k+1; j

R[k][k] = tmp = sqrt(tmp); for (int i = 0; i

Version#2 of Modified Gram-Schmidt : N=2400 MM : QR  Q * R : must be close enough to A (version#2 – OpenMP blocked) // -- OpenMP (version#1) -- // -- OpenMP (version#2) – IvB serial double Q[nrows][nmin],R[nmin][ncols]; int bs = 32; Dzjeez ! int iimax, jjmax, kkmax; double QR[nrows][ncols]; #pragma omp parallel Ain’t there { any BLAS ?? IvB omp 12 double sum; #pragma omp for for (int i=0; i

415 416 Performance of the MGS-routine (A  Q * R) Performance of the MatMul-routine (Q * R  A)

35 Bigger the 40 Bigger the better ! better !

30 35

30

25

]

] 25 /s /s 20 20

15 GFlops

GFlops 15

[ [ Performance Performance 10 10 5 5 0 0 Tesla K20m MIC omp IvB omp 12 IvB serial Tesla K20m MIC omp IvB omp 12 IvB serial 240 240 Version#1 0,162 0,755 4,056 0,352 Version#1 0,227 1,952 5,922 0,533 Version#2 19,063 30,373 18,186 2,809 Version#2 36,427 6,724 23,303 2,189 417 418

Important directives & clauses touched in this lecture Summary

OpenACC DATA directive provides essential framework for efficient – data with GPU programming with OpenACC . create, present, copy[in|out], pcopy[in|out] – update host [and update device] with Straight forward conversion from OpenMP directives to . async(handle) and if(cond) OpenACC does not always the best performance – . wait(handle) as we have seen in the Modified Gram-Schmidt QR- – loop collapse(2), loop vector decomposition program – reduction([+:max]:var) –clauses Sometimes speedups with OpenACC compared to the OpenMP host OpenMP versions are just phenomenal – parallel Some more excitement to come in the next lecture ! – for / DO sometimes with reduction(+:var) and nowait

419 420 3 : Advanced OpenACC topics

Sami Saarinen Material on “CSC cluster introduction”, (C) 2014 by CSC – IT Center for Science Ltd. CSC – IT Center for Science Ltd Material from CSC is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Espoo, Finland Unported License, http://creativecommons.org/licenses/by-nc-sa/3.0/ 421 422

Three lectures on OpenACC CUDA and OpenACC interoperability

Introduction to OpenACC In the next couple of slides we learn how to access CUDA- – General introduction with simple examples kernels from OpenACC-code – and vice versa Tuning OpenACC programs New OpenACC directive host_data and data directive – Useful tips for better performance clause deviceptr play key roles in making interoperability – Code profiling with CUDA possible Advanced OpenACC topics As a test example we will go back to our very first example found in the lecture 1 – DAXPY – Interfacing with CUDA codes and libraries – In addition we show data initialization and vector – Hybrid host CPU + GPU programming with OpenACC summation – written in OpenACC, but called from CUDA !! – OpenACC with MPI and OpenMP

423 424 The host_data –directive Calling CUDA-kernel from OpenACC-program

Makes addresses of device resident data array accessible In this scenario we have a (main) program written in C/C++ (or by the host CPU under presence of a valid DATA scope, e.g. Fortran) and this driver uses OpenACC directives extern void some_CUDA_kernel_wrapper( SUBROUTINE CALC(m, Y, X, IA) – CUDA-kernels must be called with help of OpenACC host_data int m, float *y, float *x, int *ia); INTEGER :: m, ia(2*m) REAL :: Y(m), X(m) Interface function in CUDA-file must have extern "C" void func(…) void calc(const int m, float y[m],

float x[m], int ia[2*m]) The CUDA-codes are compiled with NVIDIA nvcc compiler, e.g. { !$acc data present(ia) copy(y) & #pragma acc data present(ia[0:2*m]) \ !$acc& copyout(x) nvcc -c -O4 --restrict -arch=sm_35 daxpy_cuda.cu copy(y[0:m]) copyout(x[0:m]) !$acc host_data use_device(ia,x,y) { The OpenACC-codes are compiled with PGI-compiler e.g. #pragma acc host_data use_device(ia,x,y) CALL some_CUDA_kernel_wrapper(m,ia,x,y) { pgcc -c -acc -O4 call_cuda_from_openacc.c some_CUDA_kernel_wrapper(m,ia,x,y) !$acc end host_data } // #pragma acc host_data -acc -Mcuda } // #pragma acc data !$acc end data Linking with PGI-compiler must also have e.g. } END SUBROUTINE CALC pgcc -acc -Mcuda call_cuda_from_openacc.o daxpy_cuda.o

425 426

Calling CUDA-kernel from OpenACC-program

// call_cuda_from_openacc.c // daxpy_cuda.cu The deviceptr data –clause extern void daxpy(int n, double a, __global__ const double *x, double *y); void daxpy_kernel(int n, double a, In DATA, PARALLEL or KERNELS constructs tells that the array const double *x, double *y) pointers belong to the device side – thus the following routines int main(int argc, char *argv[]) { // The actual CUDA-kernel become CUDA-callable (requires extern "C" void -prototypes) { int tid = blockIdx.x * blockDim.x + threadIdx.x; int n = (argc > 1) ? atoi(argv[1]) : (1 << 27); int stride = blockDim.x * gridDim.x; void daxpy(int n, double a, const double a = 2.0; while (tid < n) { SUBROUTINE daxpy(n, a, X, Y) const double *x, double *y) double *x = malloc(n * sizeof(*x)); y[tid] += a * x[tid]; INTEGER :: n double *y = malloc(n * sizeof(*y)); tid += stride; { } #pragma acc parallel deviceptr(x,y) REAL(8) :: a, X(n), Y(n) #pragma acc data create(x[0:n], y[0:n]) } { { extern "C" void daxpy(int n, double a, #pragma acc loop !$acc kernels deviceptr(X,Y) // Initialize x & y const double *x, double *y) for (int j=0; j>>(n, a, x, y); } END SUBROUTINE daxpy ... }

} // #pragma acc data

427 428 Calling OpenACC-routines from CUDA-programs

Calling OpenACC-routines from CUDA-programs // call_openacc_from_cuda.cu // daxpy_openacc.c #include void daxpy(int n, double a, #include const double *restrict x, double *restrict y) In this scenario we have a (main) program written in CUDA and it extern "C" void daxpy(int n, double a, { const double *x, double *y); #pragma acc parallel loop deviceptr(x,y) calls functions written (C/C++/Fortran) with OpenACC extension extern "C" void init(int n, double scaling, double *v); for (int j=0; j 1) ? atoi(argv[1]) : (1 << 27); #pragma acc parallel loop deviceptr(v) double *x, *y, *s, tmp; for (int j=0; j

Linking must still be done with PGI using -acc -Mcuda e.g. // *res = s; // not supported for deviceptr cudaMemcpy(&tmp,s,(size_t)1*sizeof(*s), #pragma acc loop seq cudaMemcpyDeviceToHost); // chksum pgcc -acc -Mcuda call_openacc_from_cuda.o daxpy_openacc.o for (int j=0; j<1; ++j) res[j] = s; cudaFree(x); cudaFree(y); cudaFree(s); } } 429 430

MatMul (MM) using cuBLAS library The target problem

We already got familiar with MM in the previous lecture, Matrix-matrix multiply updates input/output matrix CMxN when verifying correctness of our MGS QR-decomposition using input matrices AMxK and BKxN in the following way : Pushing little further and introducing calls to the xGEMM [C] = α[A][B] + β[C] library functions from BLAS library – The "x" stands for D and S, as for double and single math Traditionally the xGEMM-routines are catered for Fortran-access i.e. matrices stored in column-major order For GPUs this function resides in NVIDIA’s cuBLAS-library – OpenACC routine needs host_data use_device(...) wrap The same routines can also be used in C-language We compare results – double and single – math against – Using Fortran-access by storing matrices in a vector, or – host CPU using Intel MKL library with PGI and Intel compilers – Using row-major matrices and calculating “fake” transpose – Intel MIC-card where we use MKL with Intel compiler (native) [C]T = α[B]T[A]T + β[C]T

431 432 Calling double precision DGEMM on host CPU Calling single precision SGEMM on host CPU

! Fortran (column-major) // C, imitating Fortran-order // With C row-major matrices ! Fortran (column-major) // C, imitating Fortran-order // With C row-major matrices INTEGER :: M,N,K,LDA,LDB,LDC #include // with MKL #include // with MKL INTEGER :: M,N,K,LDA,LDB,LDC #include // with MKL #include // with MKL ! LDA ≥ M, LDB ≥ K, LDC ≥ M int m,n,k,lda,ldb,ldc; int m,n,k,lda,ldb,ldc; ! LDA ≥ M, LDB ≥ K, LDC ≥ M int m,n,k,lda,ldb,ldc; int m,n,k,lda,ldb,ldc; REAL(8) :: A(LDA,K) ! M-by-K double *A; // minlen lda * k double A[m][lda]; // lda ≥ k REAL(4) :: A(LDA,K) ! M-by-K float *A; // minlen lda * k float A[m][lda]; // lda ≥ k REAL(8) :: B(LDB,N) ! K-by-N double *B; // minlen ldb * n double B[k][ldb]; // ldb ≥ n REAL(4) :: B(LDB,N) ! K-by-N float *B; // minlen ldb * n float B[k][ldb]; // ldb ≥ n REAL(8) :: C(LDC,N) ! M-by-N double *C; // minlen ldc * n double C[m][ldc]; // ldc ≥ n REAL(4) :: C(LDC,N) ! M-by-N float *C; // minlen ldc * n float C[m][ldc]; // ldc ≥ n REAL(8) :: ALPHA, BETA double alpha, beta; double alpha, beta; REAL(4) :: ALPHA, BETA float alpha, beta; float alpha, beta; ! ... /* ... */ /* ... */ ! ... /* ... */ /* ... */ CALL DGEMM ("N","N", & DGEMM ("N","N", DGEMM ("N","N", CALL SGEMM ("N","N", & SGEMM ("N","N", SGEMM ("N","N", & M, N, K, & &m, &n, &k, &n, &m, &k, & M, N, K, & &m, &n, &k, &n, &m, &k, & ALPHA, & &alpha, &alpha, & ALPHA, & &alpha, &alpha, & A, LDA, & A, &lda, (const double *)B, &ldb, & A, LDA, & A, &lda, (const float *)B, &ldb, & B, LDB, & B, &ldb, (const double *)A, &lda, & B, LDB, & B, &ldb, (const float *)A, &lda, & BETA, & &beta, &beta, & BETA, & &beta, &beta, & C, LDC ) C, &ldc ); (double *)C, &ldc ); & C, LDC ) C, &ldc ); (float *)C, &ldc );

433 434

With cuBLAS the prototype is different (DGEMM) With cuBLAS the prototype is different (SGEMM)

// #include // C, imitating Fortran-order // With C row-major matrices // #include // C, imitating Fortran-order // With C row-major matrices // Cannot include int m,n,k,lda,ldb,ldc; int m,n,k,lda,ldb,ldc; // Cannot include int m,n,k,lda,ldb,ldc; int m,n,k,lda,ldb,ldc; // in PGI  manual interface double *A; // minlen lda * k double A[m][lda]; // lda ≥ k // in PGI  manual interface float *A; // minlen lda * k float A[m][lda]; // lda ≥ k double *B; // minlen ldb * n double B[k][ldb]; // ldb ≥ n float *B; // minlen ldb * n float B[k][ldb]; // ldb ≥ n #define dgemm cublasDgemm double *C; // minlen ldc * n double C[m][ldc]; // ldc ≥ n #define SGEMM cublasSgemm float *C; // minlen ldc * n float C[m][ldc]; // ldc ≥ n

double alpha, beta; double alpha, beta; float alpha, beta; float alpha, beta; extern void extern void /* ... */ /* ... */ /* ... */ /* ... */ DGEMM (char tr_a, char tr_b, SGEMM (char tr_a, char tr_b, #pragma acc data present(A,B,C) #pragma acc data present(A,B,C) #pragma acc data present(A,B,C) #pragma acc data present(A,B,C) int m, int n, int k, int m, int n, int k, #pragma acc host_data \ #pragma acc host_data \ #pragma acc host_data \ #pragma acc host_data \ double alpha, use_device(A,B,C) use_device(A,B,C) float alpha, use_device(A,B,C) use_device(A,B,C) const double *A, int lda, DGEMM ('N', 'N', DGEMM ('N', 'N', const float *A, int lda, SGEMM ('N', 'N', SGEMM ('N', 'N', const double *B, int ldb, const float *B, int ldb, m, n, k, n, m, k, m, n, k, n, m, k, double beta, float beta, alpha, alpha, alpha, alpha, double *C, int ldc); float *C, int ldc); A, lda, (const double *)B, ldb, A, lda, (const float *)B, ldb,

B, ldb, (const double *)A, lda, B, ldb, (const float *)A, lda, beta, beta, beta, beta, C, ldc ); (double *)C, ldc ); C, ldc ); (float *)C, ldc );

435 436 Accessing DGEMM/SGEMM

Using PGI/OpenACC compiler, the cuBLAS can be accessed by providing the following library to the linker : NVIDIA Tesla K20m -L$CUDA_INSTALL_PATH/lib64 –lcublas Using PGI compiler on the host – to get linked with multi- Intel XeonPhi 5110 (MIC) threaded Intel MKL – the magic string gets pretty complex : -mp=numa,bind,allcores -L$MKLROOT/lib/intel64 -Wl,-rpath=$MKLROOT/lib/intel64 Intel Xeon host (IvB) -lmkl_intel_lp64 -lmkl_pgi_thread -lmkl_core -lpgftnrtl –lrt -pgf90libs # Needed for C main programs Using Intel compiler for host, just add: - -mkl=parallel and for native Intel MIC-card also add -mmic

437 438

Can OpenMP and OpenACC co-exist ? OK, as long as used consistently and compiler support exists – One (… or more) GPU(s) per host CPU can be considered – Another scope: hybrid GPU + host CPU matrix multiplication

! Single GPU-device per host CPU node ! Up to NGPUs devices per host CPU node USE OMP_LIB USE OMP_LIB INTEGER :: TID USE OPENACC REAL(8), INTENT(INOUT) :: A(:), B(:) INTEGER :: TID, IDX, N, NGPUs REAL(8) :: IN(N * NGPUs), OUT(N * NGPUs) !$OMP PARALLEL SHARED(A,B) PRIVATE(TID) !$OMP PARALLEL SHARED(IN,OUT) & TID = OMP_GET_THREAD_NUM() ! ≥ 0 !$OMP& PRIVATE(TID, IDX) NUM_THREADS(NGPUs) IF (TID == 0) THEN TID = OMP_GET_THREAD_NUM() ! ≥ 0 !$ACC DATA PRESENT(A) CALL ACC_SET_DEVICE_NUM(TID,ACC_DEVICE_NVIDIA) !$ACC HOST_DATA USE_DEVICE(A) IDX = TID * N CALL some_CUDA_or_GPU_code(SIZE(A),A) !$ACC DATA COPYIN(IN(IDX+1:IDX+N)) & !$ACC END HOST_DATA !$ACC& COPYOUT(OUT(IDX+1:IDX+N)) !$ACC END DATA !$ACC HOST_DATA USE_DEVICE(IN,OUT) ELSE CALL some_CUDA_or_GPU_code(IDX,N,IN,OUT) CALL DO_SOMETHING_on_CPUs_with(B, ...) !$ACC END HOST_DATA !$ACC END DATA ENDIF !$OMP END PARALLEL !$OMP END PARALLEL 439 440 Hybrid MatMul using host CPU(s) and one GPU Hybrid MatMul (cont’d)

Assuming GPU resident matrices CMxN , AMxK and BKxN Showing only the row-major version (C) and just The matrix-matrix multiplication DGEMM, we notice that we have glue together two code sections [C] = α[A][B] + β[C] – The 1st one is calling GPU’s cuBLAS (cublasDgemm) Can be split up into GPU (=1) and host CPU (=2) parts : nd M1xN M1xK KxN – The 2 one is calling CPU’s multithreaded DGEMM [C1] = α[A1][B] + β[C1] , where C1 , A1 and B M2xN M2xK KxN The split up (M=M1+M2) is a question that needs testing [C2] = α[A2][B] + β[C2] , where C2 , A2 and B The transposed form of the same split can be written as : To gain any benefit our hybrid approach must run faster T T T T than say cuBLAS or host CPU version alone [C1] = α[B] [A1] + β[C1] T T T T – We must keep host CPU’s DGEMM to use all of its cores [C2] = α[B] [A2] + β[C2]

441 442

Hybrid MatMul (cont’d) Hybrid MatMul (cont’d) // With C row-major matrices (GPU) // With C row-major matrices (CPU) int m,n,k,lda,ldb,ldc; #include // with MKL int m1, m2; // m1 + m2 = m #pragma acc data present(A,B,C) A nice way to do the split up is to launch OpenMP double A[m][lda]; // lda ≥ k { const int ah = 1, bh = 2, ch = 3; PARALLEL SECTION loop with two (2) threads on top of double B[k][ldb]; // ldb ≥ n #pragma acc update host(B[0:k][0:ldb]) async(bh) double C[m][ldc]; // ldc ≥ n #pragma acc update host(A[m1:m2][0:lda]) async(ah) GPU and host CPU sections double alpha, beta; #pragma acc update host(C[m1:m2][0:ldc]) async(ch) \ #pragma acc data present(A,B,C) if (beta != 0) Each OpenMP section does its own MatMul call in a { #pragma acc wait OpenACC data region #pragma acc host_data \ DGEMM ("N","N", use_device(A,B,C) &n, &m2, &k, cublasDgemm ('N', 'N', – GPU calls cuBLAS, CPU calls DGEMM (from say MKL) &alpha, n, m1, k, (const double *)B, &ldb, Before the host CPU can do its portion of MatMul, it alpha, (const double *)&A[m1][0], &lda, (const double *)B, ldb, &beta, needs copies of sub-matrices A2, whole B and possibly C2 (const double *)A, lda, (double *)&C[m1][0], &ldc ); beta, #pragma acc update device(C[m1:m2][0:ldc]) But can we keep CPU DGEMM multithreaded, since we (double *)C, ldc ); } call MKL already from OpenMP parallel region ? (Yes) No }

443 444 Plugging in the outer OpenMP parallel region int maxth = omp_get_max_threads() #pragma omp parallel sections num_threads(2) default(shared) if (m1 > 0) { #pragma omp section 100% GPU { #pragma acc data present(A,B,C) // This must be *inside* OMP-section 75% GPU + 25% host CPU if (m1 > 0) { // GPU-call to cublasDgemm( ... ) } Non-hybrid MKL in single threaded mode ! } // end of OMP-section 1 host CPU 12 cores #pragma omp section { Hybrid, 100% #pragma acc data present(A,B,C) // This must be *inside* OMP-section host CPU, 12 cores if (m2 > 0) { (data from GPU)

// Xfer B, A2 & possibly C2 from GPU to CPU – asynchronously + wait // omp_set_num_threads(maxth); // Host CPU-call to DGEMM( ... ) – tried to keep this multithreaded

// Update C-matrix portion C2 back to GPU } MKL_DYNAMIC=false } // end of OMP-section 2 OMP_NESTED=true } // #pragma omp parallel sections 445 446

Explaining poor performance in hybrid xGEMMs

Why was 75% GPU and 25% host CPU version much slower than anticipated ? [Data transfer plays only a small role] . The CPU was not using MKL-multithreading, since its xGEMM call was launched from within OpenMP region (nesting). The limitation only applies to non-Intel (e.g. PGI) compiled programs that call Intel MKL through non-Intel threading layer whilst already in threaded region !! Why was the pure 100% host CPU hybrid version slightly slower than a non-hybrid host CPU version with 12 threads ?

. CPU needs to transfer A2 and B matrices from GPU and return C2 back . It indeed used MKL-multithreading, since launched from a non-parallel OpenMP-region (since #pragma omp ... if (m1 > 0) was specified)

447 448 Halo-data (nearest neighbor) exchange, Y-dir MPI and OpenACC interoperability

16 17 18 Consider a situation where there is one GPU per host node 13 14 15 The aim is to write an MPI-parallel program where each 16 17 18 10 11 12 MPI-task uses OpenACC to drive its GPU device 13 14 15 (7) (8) (9) To be efficient and much easier to program, the MPI- 10 11 12 7 8 9 communication between tasks should be GPU-to-GPU 4 5 6 (10) (11) (12) – We talk about CUDA-aware MPI implementation, e.g. 1 2 3 7 8 9 . MVAPICH2 v2.0b just been built at CSC for use by PGI compilers Y 4 5 6 1 2 3 With std MPI communication goes via host-to-host X An example: halo data exchange for GPU resident data

449 450

Using standard, non-CUDA aware MPI Using standard host-to-host MPI (not CUDA-aware) South-to-North communication only shown here

real(kind=8) :: u(0:nx+1,0:ny+1) ! Local region with halos, GPU-resident real(kind=8) :: sendbuf(nx), recvbuf(nx) ! msg-buffers Our computational regions are GPU resident with one GPU integer :: rst(mpi_status_size) ! Receive status attached to each MPI-task prevproc = me – 1 ; nextproc = me + 1 ; itag = 1000 ! prev & next procs, tag !$acc data present(u) create(sendbuf) – Before stencil updates the halo-data needs to be exchanged if (nextproc < npes) then ! Send halo to north (buffered send) !$acc kernels On each MPI-task we need to copy halo-data from GPU sendbuf(1:nx) = u(1:nx,ny) !$acc end kernels (using OpenACC) put to a (send-)buffer residing on this host !$acc update host(sendbuf(1:nx)) !! Don’t like this ... a “baddie” – Then send the buffer from this host to the neighboring host call mpi_bsend(sendbuf,nx,MPI_REAL8,nextproc,itag+me,icomm,ierr) endif . Once received, the neighboring host transfers the received buffer if (prevproc >= 0) then ! Receive halo from south to its own GPU using OpenACC UPDATE-directive call mpi_recv(recvbuf,nx,MPI_REAL8,prevproc,itag+prevproc,icomm,rst,ierr) !$acc kernels copyin(recvbuf(1:nx)) !! Don’t like this ... a “baddie” o aaaaargh !! u(1:nx,0) = recvbuf(1:nx) !$acc end kernels This obviously sounds very, very complicated … endif !$acc end data 451 452 Using CUDA-aware MPI Using CUDA-aware MPI South-to-North communication only shown here South-to-North communication made even simpler !! real(kind=8) :: u(0:nx+1,0:ny+1) ! Local region with halos, GPU-resident real(kind=8) :: u(0:nx+1,0:ny+1) ! Local region with halos, GPU-resident real(kind=8) :: sendbuf(nx), recvbuf(nx) ! msg-buffers integer :: rst(mpi_status_size) ! Receive status integer :: rst(mpi_status_size) ! Receive status prevproc = me – 1 ; nextproc = me + 1 ; itag = 1000 ! prev & next procs, tag prevproc = me – 1 ; nextproc = me + 1 ; itag = 1000 !$acc data present(u) create(sendbuf, recvbuf) !$acc host_data use_device(sendbuf, recvbuf) ! Affects both MPI_bsend & MPI_recv !$acc data present(u) if (nextproc < npes) then ! Send halo to north (buffered send) !$acc host_data use_device(u) ! Affects both MPI_bsend & MPI_recv !$acc kernels sendbuf(1:nx) = u(1:nx,ny) if (nextproc < npes) then ! Send halo to north (buffered send) !$acc end kernels call mpi_bsend(u(1,ny),nx,MPI_REAL8,nextproc,itag+me,icomm,ierr) call mpi_bsend(sendbuf,nx,MPI_REAL8,nextproc,itag+me,icomm,ierr) endif endif if (prevproc >= 0) then ! Receive halo from south if (prevproc >= 0) then ! Receive halo from south call mpi_recv(recvbuf,nx,MPI_REAL8,prevproc,itag+prevproc,icomm,rst,ierr) call mpi_recv(u(1,0),nx,MPI_REAL8,prevproc,itag+prevproc,icomm,rst,ierr) !$acc kernels u(1:nx,0) = recvbuf(1:nx) endif !$acc end kernels !$acc end host_data endif !$acc end host_data !$acc end data !$acc end data

453 454

Performance on 2 hosts, 1 MPI-task/host, 1 GPU/task Important directives & clauses touched in this lecture Bigger the 10000 better ! 9000 OpenACC 8000 –

7000 host_data use_device(…) ]

/s 6000 – deviceptr(…) in data/parallel/kernels constructs 5000 –

4000 async(handle) and if(cond) –clauses

MLups [

Performance 3000 – wait –clause 2000 OpenMP 1000 0 – parallel sections … if (cond) Tesla K20m Tesla K20m IvB omp 12 with standard with CUDA- with MPI – section MPI aware MPI w/o MPI, 1 host 6280 6280 1976 with MPI, 2 hosts 6097 9077 4048 455 456 Summary

Calling CUDA-kernels from within OpenACC – and vice versa is not very complicated task once certain conventions are adhered and compiler/linker support is there Calling NVIDIA cuBLAS from OpenACC is also feasible when interface and matrix row/column ordering are known Using OpenMP & OpenACC in concert is not too complex either and may help some hybrid codes to see a daylight Efficient GPU-to-GPU MPI-communication requires a CUDA- aware MPI implementation – also transparent for host-to- host MPI-communication

457