TAMING THE HYDRA MULTI-GPU PROGRAMMING WITH OPENACC Jeff Larkin, GTC 2019, March 2019 Why use Multiple Devices?

Because 2 > 1 : Effectively using multiple devices results in faster time to solution and/or solve more complex problems

Because they’re there : If running on a machine with multiple devices you’re wasting resources by not using them all

Because the code already runs on multiple nodes using MPI, so little change is necessary

2 SUMMIT NODE (2) IBM POWER9 + (6) VOLTA V100

256 GB 256 GB (DDR4) (DDR4) 135 GB/s 135 GB/s CPU 0 CPU 1

0 (0-3) 7 (28-31) 14 (56-59) 22 (88-91) 29 (116-119) 36 (144-147)

1 (4-7) 8 (32-35) 15 (60-63) 23 (92-95) 30 (120-123) 37 (148-151)

2 (8-11) 9 (36-39) 16 (64-67) 24 (96-99) 31 (124-127) 38 (152-155) 64 GB/s 3 (12-15) 10 (40-43) 17 (68-71) 25 (100-103) 32 (128-131) 39 (156-159)

4 (16-19) 11 (44-47) 18 (72-75) 26 (104-107) 33 (132-135) 40 (160-163)

5 (20-23) 12 (48-51) 19 (76-79) 27 (108-111) 34 (136-139) 41 (164-167)

6 (24-27) 13 (52-55) 20 (80-83) 28 (112-115) 35 (140-143) 42 (168-171)

GPU 0 GPU 1 GPU 2 GPU 3 GPU 4 GPU 5

16 GB 16 GB 16 GB 16 GB 16 GB 16 GB (HBM2) (HBM2) (HBM2) (HBM2) (HBM2) (HBM2)

NVLink2 (50 GB/s) (900 GB/s) 3 DGX-1 NODE

4 DGX-2 NODE

5 MULTI-GPU PROGRAMMING STRATEGIES

6 MULTI-GPU PROGRAMMING MODELS

1 CPU : Many GPUs

• A single thread will change devices as-needed to send data and kernels to different GPUs

Many Threads : 1 GPU (Each)

• Using OpenMP, Pthreads, or similar, each thread can manage its own GPU

1+ CPU Process : 1 GPU

• Each rank acts as-if there’s just 1 GPU, but multiple ranks per node use all GPUs

Many Processes : Many GPUs

• Each rank manages multiple GPUs, multiple ranks/node. Gets complicated quickly!

7 MULTI-GPU PROGRAMMING MODELS Trade-offs Between Approaches

• Conceptually • Conceptually Very • Little to no code Simple Simple changes required • Easily share data • Requires • Set and forget the • Re-uses existing between peer additional loops device numbers domain devices • CPU can become a • Relies on external decomposition • Coordinating bottleneck Threading API • Probably already between GPUs • Remaining CPU • Can see improved using MPI extremely tricky cores often utilization • Watch affinity underutilized • Watch affinity 1 CPU : Many GPUs Many Threads : 1 GPU 1+ CPU : 1 GPU Many Processes : Many (Each) GPUs

8 MULTI-DEVICE OPENACC

OpenACC presents devices numbered 0 – (N-1) for each device type available.

The order of the devices comes from the runtime, almost certainly the same as CUDA

By default all data and work go to the current device. Each device has its own async queues.

Developers must change the current device and maybe the current device type using an API

9 MULTI-DEVICE CUDA

CUDA by default exposes all devices, numbered 0 – (N-1), if devices are not all the same, it will reorder the “best” to device 0.

Each device has its own pool of streams.

If you do nothing, all work will go to Device #0.

Developer must change the current device explicitly using an API

10 SAMPLE CODE

11 SAMPLE CODE Embarrassingly Parallel Image Filter

12 SAMPLE CODE Embarrassingly Parallel Image Filter

13 SAMPLE CODE Embarrassingly Parallel Image Filter

14 SAMPLE CODE Embarrassingly Parallel Image Filter

15 SAMPLE FILTER CODE for(y = 0; y < h; y++) { for(x = 0; x < w; x++) { For Each Pixel double blue = 0.0, green = 0.0, red = 0.0; for(int fy = 0; fy < filtersize; fy++) { long iy = y - (filtersize/2) + fy; for (int fx = 0; fx < filtersize; fx++) { long ix = x - (filtersize/2) + fx; blue += filter[fy][fx] * (double)imgData[iy * step + ix * ch]; green += filter[fy][fx] * (double)imgData[iy * step + ixApply * ch Filter+ 1]; red += filter[fy][fx] * (double)imgData[iy * step + ix * ch + 2]; } } out[y * step + x * ch] = 255 - (scale * blue); out[y * step + x * ch + 1 ] = 255 - (scale * green); Store the Values out[y * step + x * ch + 2 ] = 255 - (scale * red); } } }

16 1:MANY

17 1:MANY – THE PLAN

1. Introduce loop to initialize each device with needed arrays

2. Send each device its block of the image

3. Introduce loop to copy data back from and tear down each device.

18 SAMPLE FILTER CODE (1:MANY)

for(int device = 0; device < ndevices; device++) { acc_set_device_num(device, acc_device_default); long lower = device*rows_per_device; for(int device = 0; device < ndevices; device++) { long upper = MIN(lower + rows_per_device, h); long copyLower = MAX(lower-(filtersize/2), 0); long copyUpper = MIN(upper+(filtersize/2), h); acc_set_device_num(device, acc_device_default); #pragma acc enter data \ create(imgData[copyLower*step:(copyUpper-copyLower)*step], \ out[lower*step:(upper-lower)*step]) \ long lower = device*rows_per_device; copyin(filter[:5][:5]) } long upper = MIN(lower + rows_per_device, h); for(int device = 0; device < ndevices; device++) { acc_set_device_num(device, acc_device_default); printf("Launching device %d\n", device); long copyLower = MAX(lower-(filtersize/2), 0); long lower = device*rows_per_device; long upper = MIN(lower + rows_per_device, h); long copyUpper = MIN(upper+(filtersize/2), h); long copyLower = MAX(lower-(filtersize/2), 0); long copyUpper = MIN(upper+(filtersize/2), h); #pragma acc update device(imgData[copyLower*step:(copyUpper-copyLower)*step]) async #pragma acc enter data \ #pragma acc parallel loop present(filter, \ imgData[copyLower*step:(copyUpper-copyLower)*step], \ out[lower*step:(upper-lower)*step]) async create(imgData[copyLower*step:(copyUpper-copyLower)*step], \ for(y = lower; y < upper; y++) { #pragma acc loop for(x = 0; x < w; x++) { out[lower*step:(upper-lower)*step]) \ double blue = 0.0, green = 0.0, red = 0.0; #pragma acc loop seq for(int fy = 0; fy < filtersize; fy++) { copyin(filter[:5][:5]) long iy = y - (filtersize/2) + fy; #pragma acc loop seq for (int fx = 0; fx < filtersize; fx++) { } long ix = x - (filtersize/2) + fx; if( (iy<0) || (ix<0) || (iy>=h) || (ix>=w) ) continue; blue += filter[fy][fx] * (double)imgData[iy * step + ix * ch]; green += filter[fy][fx] * (double)imgData[iy * step + ix * ch + 1]; red += filter[fy][fx] * (double)imgData[iy * step + ix * ch + 2]; } } out[y * step + x * ch] = 255 - (scale * blue); out[y * step + x * ch + 1 ] = 255 - (scale * green); out[y * step + x * ch + 2 ] = 255 - (scale * red); } } #pragma acc update self(out[lower*step:(upper-lower)*step]) async }

for(int device = 0; device < ndevices; device++) { acc_set_device_num(device, acc_device_default); #pragma acc wait long lower = device*rows_per_device; long upper = MIN(lower + rows_per_device, h); long copyLower = MAX(lower-(filtersize/2), 0); long copyUpper = MIN(upper+(filtersize/2), h); #pragma acc exit data delete(out[lower*step:(upper-lower)*step], \ imgData[copyLower*step:(copyUpper-copyLower)*step], filter) }

19 SAMPLE FILTER CODE (1:MANY)

for(int device = 0; device < ndevices; device++) { acc_set_device_num(device, acc_device_default); long lower = device*rows_per_device; long upper = MIN(lower + rows_per_device, h); long copyLower = MAX(lower-(filtersize/2), 0); long copyUpper = MIN(upper+(filtersize/2), h); #pragma acc enter data \ for(int device = 0; device < ndevices; device++) { create(imgData[copyLower*step:(copyUpper-copyLower)*step], \ out[lower*step:(upper-lower)*step]) \ copyin(filter[:5][:5]) acc_set_device_num(device, acc_device_default); }

for(int device = 0; device < ndevices; device++) { long lower = device*rows_per_device; acc_set_device_num(device, acc_device_default); printf("Launching device %d\n", device); long lower = device*rows_per_device; long upper = MIN(lower + rows_per_device, h); long upper = MIN(lower + rows_per_device, h); long copyLower = MAX(lower-(filtersize/2), 0); long copyUpper = MIN(upper+(filtersize/2), h); long copyLower = MAX(lower-(filtersize/2), 0); #pragma acc update device(imgData[copyLower*step:(copyUpper-copyLower)*step]) async #pragma acc parallel loop present(filter, \ long copyUpper = MIN(upper+(filtersize/2), h); imgData[copyLower*step:(copyUpper-copyLower)*step], \ out[lower*step:(upper-lower)*step]) async for(y = lower; y < upper; y++) { #pragma acc update device(imgData[copyLower*step:(copyUpper- #pragma acc loop for(x = 0; x < w; x++) { double blue = 0.0, green = 0.0, red = 0.0; copyLower)*step]) async #pragma acc loop seq for(int fy = 0; fy < filtersize; fy++) { long iy = y - (filtersize/2) + fy; #pragma acc parallel loop present(filter, \ #pragma acc loop seq for (int fx = 0; fx < filtersize; fx++) { long ix = x - (filtersize/2) + fx; imgData[copyLower*step:(copyUpper-copyLower)*step], \ if( (iy<0) || (ix<0) || (iy>=h) || (ix>=w) ) continue; blue += filter[fy][fx] * (double)imgData[iy * step + ix * ch]; out[lower*step:(upper-lower)*step]) async green += filter[fy][fx] * (double)imgData[iy * step + ix * ch + 1]; red += filter[fy][fx] * (double)imgData[iy * step + ix * ch + 2]; for(y = lower; y < upper; y++) { } } out[y * step + x * ch] = 255 - (scale * blue); for(x = 0; x < w; x++) { out[y * step + x * ch + 1 ] = 255 - (scale * green); out[y * step + x * ch + 2 ] = 255 - (scale * red); } // Apply Filter } #pragma acc update self(out[lower*step:(upper-lower)*step]) async } } for(int device = 0; device < ndevices; device++) { } acc_set_device_num(device, acc_device_default); #pragma acc wait long lower = device*rows_per_device; #pragma acc update self(out[lower*step:(upper-lower)*step]) long upper = MIN(lower + rows_per_device, h); long copyLower = MAX(lower-(filtersize/2), 0); long copyUpper = MIN(upper+(filtersize/2), h); async #pragma acc exit data delete(out[lower*step:(upper-lower)*step], \ imgData[copyLower*step:(copyUpper-copyLower)*step], filter) } }

20 SAMPLE FILTER CODE (1:MANY)

for(int device = 0; device < ndevices; device++) { acc_set_device_num(device, acc_device_default); long lower = device*rows_per_device; long upper = MIN(lower + rows_per_device, h); long copyLower = MAX(lower-(filtersize/2), 0); long copyUpper = MIN(upper+(filtersize/2), h); #pragma acc enter data \ create(imgData[copyLower*step:(copyUpper-copyLower)*step], \ out[lower*step:(upper-lower)*step]) \ copyin(filter[:5][:5]) }

for(int device = 0; device < ndevices; device++) { acc_set_device_num(device, acc_device_default); printf("Launching device %d\n", device); long lower = device*rows_per_device; long upper = MIN(lower + rows_per_device, h); long copyLower = MAX(lower-(filtersize/2), 0); long copyUpper = MIN(upper+(filtersize/2), h); #pragma acc update device(imgData[copyLower*step:(copyUpper-copyLower)*step]) async #pragma acc parallel loop present(filter, \ imgData[copyLower*step:(copyUpper-copyLower)*step], \ out[lower*step:(upper-lower)*step]) async for(y = lower; y < upper; y++) { #pragma acc loop for(x = 0; x < w; x++) { double blue = 0.0, green = 0.0, red = 0.0; #pragma acc loop seq for(int fy = 0; fy < filtersize; fy++) { for(int device = 0; device < ndevices; device++) { long iy = y - (filtersize/2) + fy; #pragma acc loop seq for (int fx = 0; fx < filtersize; fx++) { acc_set_device_num(device, acc_device_default); long ix = x - (filtersize/2) + fx; if( (iy<0) || (ix<0) || (iy>=h) || (ix>=w) ) continue; #pragma acc wait blue += filter[fy][fx] * (double)imgData[iy * step + ix * ch]; green += filter[fy][fx] * (double)imgData[iy * step + ix * ch + 1]; long lower = device*rows_per_device; red += filter[fy][fx] * (double)imgData[iy * step + ix * ch + 2]; } } long upper = MIN(lower + rows_per_device, h); out[y * step + x * ch] = 255 - (scale * blue); out[y * step + x * ch + 1 ] = 255 - (scale * green); out[y * step + x * ch + 2 ] = 255 - (scale * red); long copyLower = MAX(lower-(filtersize/2), 0); } } #pragma acc update self(out[lower*step:(upper-lower)*step]) async long copyUpper = MIN(upper+(filtersize/2), h); }

for(int device = 0; device < ndevices; device++) { #pragma acc exit data delete(out[lower*step:(upper- acc_set_device_num(device, acc_device_default); #pragma acc wait long lower = device*rows_per_device; lower)*step], \ long upper = MIN(lower + rows_per_device, h); long copyLower = MAX(lower-(filtersize/2), 0); imgData[copyLower*step:(copyUpper-copyLower)*step], filter) long copyUpper = MIN(upper+(filtersize/2), h); #pragma acc exit data delete(out[lower*step:(upper-lower)*step], \ imgData[copyLower*step:(copyUpper-copyLower)*step], filter) } }

21 1:MANY – THE GOOD & BAD

Speed-up 1:Many Serial 0.06 1.80X

The Good 1.60X 0.05 1.40X Relatively simple to understand 0.04 1.20X The Bad

1.00X

Up - 0.03 Adds code that will be Time(s) 0.80X superfluous with single or no Speed device 0.02 0.60X

One thread will eventually 0.40X 0.01 become a bottleneck 0.20X

0 0.00X 1 2 3 4 6 8

22 MANY (THREADS) : 1

23 MANY (THREADS) : 1 – THE PLAN

1. Spawn 1 thread per GPU using OpenMP (any threading model will do)

2. Set the device within each thread.

3. Assign an equal portion of the work on each thread.

4. The threads will automatically synchronize when all threads are complete.

24 SAMPLE FILTER CODE (OPENMP) long rows_per_device = (h+(ndevices-1))/ndevices; #pragma omp parallel for num_threads(acc_get_num_devices(acc_device_type_default) for(int device = 0; device < ndevices; device++) { long lower = device*rows_per_device; long upper = MIN(lower + rows_per_device, h); long copyLower = MAX(lower-(filtersize/2), 0); long copyUpper = MIN(upper+(filtersize/2), h);

acc_set_device_num(device, acc_device_default);

#pragma acc declare copyin(filter) #pragma acc parallel loop \ copyin(imgData[copyLower*step:(copyUpper-copyLower)*step]) \ copyout(out[lower*step:(upper-lower)*step]) present(filter) for(y = lower; y < upper; y++) { for(x = 0; x < w; x++) { // Apply Filter } } } // end omp parallel for

25 1:MANY – THE GOOD & BAD

Speed-up Many:1 (affinity) Serial 0.06 4.00X

The Good 3.50X 0.05 Relatively simple to understand 3.00X 0.04

Fewer code changes 2.50X

Up -

The Bad 0.03 2.00X

Time(s) Speed 1.50X Introduces an additional 0.02 threading API 1.00X 0.01 Need to be conscious of affinity 0.50X (OMP_PLACES) 0 0.00X 1 2 3 4 6 8

26 ABOUT AFFINITY

$ nvidia-smi topo -m GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 mlx5_0 mlx5_2 mlx5_1 mlx5_3 CPU Affinity GPU0 X NV1 NV1 NV2 NV2 SYS SYS SYS PIX SYS PHB SYS 0-19,40-59 GPU1 NV1 X NV2 NV1 SYS NV2 SYS SYS PIX SYS PHB SYS 0-19,40-59 GPU2 NV1 NV2 X NV2 SYS SYS NV1 SYS PHB SYS PIX SYS 0-19,40-59 GPU3 NV2 NV1 NV2 X SYS SYS SYS NV1 PHB SYS PIX SYS 0-19,40-59 GPU4 NV2 SYS SYS SYS X NV1 NV1 NV2 SYS PIX SYS PHB 20-39,60-79 GPU5 SYS NV2 SYS SYS NV1 X NV2 NV1 SYS PIX SYS PHB 20-39,60-79 GPU6 SYS SYS NV1 SYS NV1 NV2 X NV2 SYS PHB SYS PIX 20-39,60-79 GPU7 SYS SYS SYS NV1 NV2 NV1 NV2 X SYS PHB SYS PIX 20-39,60-79 mlx5_0 PIX PIX PHB PHB SYS SYS SYS SYS X SYS PHB SYS mlx5_2 SYS SYS SYS SYS PIX PIX PHB PHB SYS X SYS PHB mlx5_1 PHB PHB PIX PIX SYS SYS SYS SYS PHB SYS X SYS mlx5_3 SYS SYS SYS SYS PHB PHB PIX PIX SYS PHB SYS X

• GPUs have affinity to certain CPUs, meaning they connect directly rather than through an inter-CPU bus

• OMP_PLACES, numctl, or your job launcher can help place your threads close to each GPU

27 MANY (THREADS) : 1 (IMPROVED)

28 MANY (THREADS) : 1 IMPROVED – THE PLAN

1. Spawn 1 thread per GPU using OpenMP (any threading model will do)

2. Set the device within each thread.

3. Assign an equal portion of the work on each thread.

4. Break the work further into blocks and use asynchronous queues to overlap copies and computation

5. Synchronize when all threads are complete.

29 SAMPLE FILTER CODE

int ndevices = acc_get_num_devices(acc_device_default); long rows_per_device = (h+(ndevices-1))/ndevices; long rows_per_block = (h+(ndevices*3-1))/(ndevices*3); #pragma omp parallel num_threads(ndevices) { int tid = omp_get_thread_num(); acc_set_device_num(tid, acc_device_default); #pragma acc declare copyin(filter) long lower = tid*rows_per_device; long upper = MIN(lower + rows_per_device, h); long copyLower = MAX(lower-(filtersize/2), 0); long copyUpper = MIN(upper+(filtersize/2), h); #pragma acc data create(imgData[copyLower*step:(copyUpper-copyLower)*step]) create(out[lower*step:(upper-lower)*step]) #pragma omp for for(int block = 0; block < ndevices*3; block++) { lower = block*rows_per_block; upper = MIN(lower + rows_per_block, h); copyLower = MAX(lower-(filtersize/2), 0); copyUpper = MIN(upper+(filtersize/2), h);

#pragma acc update device(imgData[copyLower*step:(copyUpper-copyLower)*step]) async(block) #pragma acc parallel loop default(present) async(block) for(y = lower; y < upper; y++) { #pragma acc loop for(x = 0; x < w; x++) { // Apply Filter } } #pragma acc update self(out[lower*step:(upper-lower)*step]) async(block) } // end omp for #pragma acc wait } // end omp parallel

30 1:MANY – THE GOOD & BAD

Speed-up Many:1 (async) Serial 0.06 4.50X

The Good 4.00X 0.05 3.50X Relatively simple to understand 0.04 3.00X Fewer code changes

2.50X

Up - Overlaps PCIe Copies 0.03

Time(s) 2.00X Speed

The Bad 0.02 1.50X

Introduces an additional 1.00X 0.01 threading API 0.50X

Need to be conscious of affinity 0 0.00X (OMP_PLACES) 1 2 3 4 6 8

31 1+ (RANK): 1 (GPU)

32 MANY (THREADS) : 1 – THE PLAN

1. Use MPI to decompose the work among ranks

2. Assign each rank a GPU

3. Reassemble the result

33 SAMPLE FILTER CODE (MPI) int ndevices = acc_get_num_devices(acc_device_default); acc_set_device_num(rank%ndevices, acc_device_default);

MPI_Scatterv((void*) imgData, sendcounts, senddispls, MPI_UNSIGNED_CHAR, (void*) (imgData + copyLower*step), (int) copySize, MPI_UNSIGNED_CHAR, 0, MPI_COMM_WORLD);

#pragma acc parallel loop copyin(imgData[copyLower*step:copySize]) \ copy(out[lower*step:size]) \ copyin(filter[:5][:5])

for(y = lower; y < upper; y++) { for(x = 0; x < w; x++) { // Apply Filter } }

MPI_Gatherv((void*) (out+lower*step), (int) size, MPI_UNSIGNED_CHAR, (void*) out, recvcounts, recvdispls, MPI_UNSIGNED_CHAR, 0, MPI_COMM_WORLD);

34 1:1 – THE GOOD & BAD 0.14 7.00X Speed-up MPI (No Comm) Serial MPI (w/ Comm)

0.12 6.00X The Good 0.1 5.00X Little/No Code change if you

0.08 4.00X Up

already use MPI - Time(s) Affinity handled by MPI launcher 0.06 3.00XSpeed

The Bad 0.04 2.00X

Dependence on MPI (if not 0.02 1.00X already using) 0 0.00X Need “enough” work to 1 2 3 4 6 8 overcome communication cost

35 1+ (RANK): 1 (GPU) (IMPROVED)

36 MANY (THREADS) : 1 IMPROVED – THE PLAN

1. Use MPI to decompose the work among ranks

2. Start CUDA MPS & launch multiple ranks / GPU

3. Assign each rank a GPU

4. Reassemble the result

37 SAMPLE FILTER CODE (MPI) int ndevices = acc_get_num_devices(acc_device_default); int sharing = nranks / ndevices; acc_set_device_num(rank/sharing, acc_device_default);

MPI_Scatterv((void*) imgData, sendcounts, senddispls, MPI_UNSIGNED_CHAR, (void*) (imgData + copyLower*step), (int) copySize, MPI_UNSIGNED_CHAR, 0, MPI_COMM_WORLD);

#pragma acc parallel loop copyin(imgData[copyLower*step:copySize]) \ copy(out[lower*step:size]) \ copyin(filter[:5][:5])

for(y = lower; y < upper; y++) { for(x = 0; x < w; x++) { // Apply Filter } }

MPI_Gatherv((void*) (out+lower*step), (int) size, MPI_UNSIGNED_CHAR, (void*) out, recvcounts, recvdispls, MPI_UNSIGNED_CHAR, 0, MPI_COMM_WORLD);

38 1:1 – THE GOOD & BAD 0.14 16.00X The Good Speed-up MPI (No Comm) Serial MPI (w/ Comm) 0.12 14.00X Little/No Code change if you 12.00X already use MPI 0.1

10.00X

Affinity handled by MPI launcher 0.08

Up - 8.00X

Memcopies overlapped Time(s)

0.06 Speed automatically 6.00X

0.04 The Bad 4.00X

Dependence on MPI (if not 0.02 2.00X already using) 0 0.00X Need “enough” work to 1 2 3 4 6 8 overcome communication cost

39 PERFORMANCE COMPARISON

Serial 1:Many OpenMP OpenMP (Async) MPI MPI+MPS 0.06

0.05

0.04

0.03

Avg. Time (S) Time Avg. 0.02

0.01

0 0 1 2 3 4 5 6 7 8 9

Number of GPUs

PGI 19.1, NVIDIA DGX-1V 40 WHAT ABOUT COMMUNICATION?

41 MULTI GPU PROGRAMMING With MPI and OpenACC

Node 0 Node 1 Node N-1

MEMMEM MEMMEM MEMMEM MEMMEM MEMMEM MEMMEM

GPU CPU GPU CPU GPU CPU …

PCIe PCIe PCIe Switch Switch Switch IB IB IB

42 MPI+OPENACC

//MPI rank 0 #pragma acc host_data use_device( sbuf ) MPI_Send(sbuf, size, MPI_DOUBLE, n-1, tag, MPI_COMM_WORLD);

//MPI rank n-1 #pragma acc host_data use_device( rbuf ) MPI_Recv(rbuf, size, MPI_DOUBLE, 0, tag, MPI_COMM_WORLD, MPI_STATUS_IGNORE);

43 MPI+OPENACC (IMPROVED)

//MPI rank 0 #pragma acc update self( sbuf[0:size] ) MPI_Send(sbuf, size, MPI_DOUBLE, n-1, tag, MPI_COMM_WORLD);

//MPI rank n-1 MPI_Recv(rbuf, size, MPI_DOUBLE, 0, tag, MPI_COMM_WORLD, MPI_STATUS_IGNORE); #pragma acc update device( rbuf[0:size] )

44 CONCLUSIONS

45 CONCLUSIONS The Hydra has been Tamed.

OpenACC provides a simple API for using multiple devices.

A single CPU thread is not sufficient to feed multiple GPUs

Multiple OpenMP threads is a simple way to control multiple devices.

MPI is an even simpler way to control multiple threads, if you can overcome communication cost.

Overlapping data copies is still a benefit in multi-device codes

46