TAMING THE HYDRA MULTI-GPU PROGRAMMING WITH OPENACC Jeff Larkin, GTC 2019, March 2019 Why use Multiple Devices?
Because 2 > 1 : Effectively using multiple devices results in faster time to solution and/or solve more complex problems
Because they’re there : If running on a machine with multiple devices you’re wasting resources by not using them all
Because the code already runs on multiple nodes using MPI, so little change is necessary
2 SUMMIT NODE (2) IBM POWER9 + (6) NVIDIA VOLTA V100
256 GB 256 GB (DDR4) (DDR4) 135 GB/s 135 GB/s CPU 0 CPU 1
0 (0-3) 7 (28-31) 14 (56-59) 22 (88-91) 29 (116-119) 36 (144-147)
1 (4-7) 8 (32-35) 15 (60-63) 23 (92-95) 30 (120-123) 37 (148-151)
2 (8-11) 9 (36-39) 16 (64-67) 24 (96-99) 31 (124-127) 38 (152-155) 64 GB/s 3 (12-15) 10 (40-43) 17 (68-71) 25 (100-103) 32 (128-131) 39 (156-159)
4 (16-19) 11 (44-47) 18 (72-75) 26 (104-107) 33 (132-135) 40 (160-163)
5 (20-23) 12 (48-51) 19 (76-79) 27 (108-111) 34 (136-139) 41 (164-167)
6 (24-27) 13 (52-55) 20 (80-83) 28 (112-115) 35 (140-143) 42 (168-171)
GPU 0 GPU 1 GPU 2 GPU 3 GPU 4 GPU 5
16 GB 16 GB 16 GB 16 GB 16 GB 16 GB (HBM2) (HBM2) (HBM2) (HBM2) (HBM2) (HBM2)
NVLink2 (50 GB/s) (900 GB/s) 3 DGX-1 NODE
4 DGX-2 NODE
5 MULTI-GPU PROGRAMMING STRATEGIES
6 MULTI-GPU PROGRAMMING MODELS
1 CPU : Many GPUs
• A single thread will change devices as-needed to send data and kernels to different GPUs
Many Threads : 1 GPU (Each)
• Using OpenMP, Pthreads, or similar, each thread can manage its own GPU
1+ CPU Process : 1 GPU
• Each rank acts as-if there’s just 1 GPU, but multiple ranks per node use all GPUs
Many Processes : Many GPUs
• Each rank manages multiple GPUs, multiple ranks/node. Gets complicated quickly!
7 MULTI-GPU PROGRAMMING MODELS Trade-offs Between Approaches
• Conceptually • Conceptually Very • Little to no code Simple Simple changes required • Easily share data • Requires • Set and forget the • Re-uses existing between peer additional loops device numbers domain devices • CPU can become a • Relies on external decomposition • Coordinating bottleneck Threading API • Probably already between GPUs • Remaining CPU • Can see improved using MPI extremely tricky cores often utilization • Watch affinity underutilized • Watch affinity 1 CPU : Many GPUs Many Threads : 1 GPU 1+ CPU : 1 GPU Many Processes : Many (Each) GPUs
8 MULTI-DEVICE OPENACC
OpenACC presents devices numbered 0 – (N-1) for each device type available.
The order of the devices comes from the runtime, almost certainly the same as CUDA
By default all data and work go to the current device. Each device has its own async queues.
Developers must change the current device and maybe the current device type using an API
9 MULTI-DEVICE CUDA
CUDA by default exposes all devices, numbered 0 – (N-1), if devices are not all the same, it will reorder the “best” to device 0.
Each device has its own pool of streams.
If you do nothing, all work will go to Device #0.
Developer must change the current device explicitly using an API
10 SAMPLE CODE
11 SAMPLE CODE Embarrassingly Parallel Image Filter
12 SAMPLE CODE Embarrassingly Parallel Image Filter
13 SAMPLE CODE Embarrassingly Parallel Image Filter
14 SAMPLE CODE Embarrassingly Parallel Image Filter
15 SAMPLE FILTER CODE for(y = 0; y < h; y++) { for(x = 0; x < w; x++) { For Each Pixel double blue = 0.0, green = 0.0, red = 0.0; for(int fy = 0; fy < filtersize; fy++) { long iy = y - (filtersize/2) + fy; for (int fx = 0; fx < filtersize; fx++) { long ix = x - (filtersize/2) + fx; blue += filter[fy][fx] * (double)imgData[iy * step + ix * ch]; green += filter[fy][fx] * (double)imgData[iy * step + ixApply * ch Filter+ 1]; red += filter[fy][fx] * (double)imgData[iy * step + ix * ch + 2]; } } out[y * step + x * ch] = 255 - (scale * blue); out[y * step + x * ch + 1 ] = 255 - (scale * green); Store the Values out[y * step + x * ch + 2 ] = 255 - (scale * red); } } }
16 1:MANY
17 1:MANY – THE PLAN
1. Introduce loop to initialize each device with needed arrays
2. Send each device its block of the image
3. Introduce loop to copy data back from and tear down each device.
18 SAMPLE FILTER CODE (1:MANY)
for(int device = 0; device < ndevices; device++) { acc_set_device_num(device, acc_device_default); long lower = device*rows_per_device; for(int device = 0; device < ndevices; device++) { long upper = MIN(lower + rows_per_device, h); long copyLower = MAX(lower-(filtersize/2), 0); long copyUpper = MIN(upper+(filtersize/2), h); acc_set_device_num(device, acc_device_default); #pragma acc enter data \ create(imgData[copyLower*step:(copyUpper-copyLower)*step], \ out[lower*step:(upper-lower)*step]) \ long lower = device*rows_per_device; copyin(filter[:5][:5]) } long upper = MIN(lower + rows_per_device, h); for(int device = 0; device < ndevices; device++) { acc_set_device_num(device, acc_device_default); printf("Launching device %d\n", device); long copyLower = MAX(lower-(filtersize/2), 0); long lower = device*rows_per_device; long upper = MIN(lower + rows_per_device, h); long copyUpper = MIN(upper+(filtersize/2), h); long copyLower = MAX(lower-(filtersize/2), 0); long copyUpper = MIN(upper+(filtersize/2), h); #pragma acc update device(imgData[copyLower*step:(copyUpper-copyLower)*step]) async #pragma acc enter data \ #pragma acc parallel loop present(filter, \ imgData[copyLower*step:(copyUpper-copyLower)*step], \ out[lower*step:(upper-lower)*step]) async create(imgData[copyLower*step:(copyUpper-copyLower)*step], \ for(y = lower; y < upper; y++) { #pragma acc loop for(x = 0; x < w; x++) { out[lower*step:(upper-lower)*step]) \ double blue = 0.0, green = 0.0, red = 0.0; #pragma acc loop seq for(int fy = 0; fy < filtersize; fy++) { copyin(filter[:5][:5]) long iy = y - (filtersize/2) + fy; #pragma acc loop seq for (int fx = 0; fx < filtersize; fx++) { } long ix = x - (filtersize/2) + fx; if( (iy<0) || (ix<0) || (iy>=h) || (ix>=w) ) continue; blue += filter[fy][fx] * (double)imgData[iy * step + ix * ch]; green += filter[fy][fx] * (double)imgData[iy * step + ix * ch + 1]; red += filter[fy][fx] * (double)imgData[iy * step + ix * ch + 2]; } } out[y * step + x * ch] = 255 - (scale * blue); out[y * step + x * ch + 1 ] = 255 - (scale * green); out[y * step + x * ch + 2 ] = 255 - (scale * red); } } #pragma acc update self(out[lower*step:(upper-lower)*step]) async }
for(int device = 0; device < ndevices; device++) { acc_set_device_num(device, acc_device_default); #pragma acc wait long lower = device*rows_per_device; long upper = MIN(lower + rows_per_device, h); long copyLower = MAX(lower-(filtersize/2), 0); long copyUpper = MIN(upper+(filtersize/2), h); #pragma acc exit data delete(out[lower*step:(upper-lower)*step], \ imgData[copyLower*step:(copyUpper-copyLower)*step], filter) }
19 SAMPLE FILTER CODE (1:MANY)
for(int device = 0; device < ndevices; device++) { acc_set_device_num(device, acc_device_default); long lower = device*rows_per_device; long upper = MIN(lower + rows_per_device, h); long copyLower = MAX(lower-(filtersize/2), 0); long copyUpper = MIN(upper+(filtersize/2), h); #pragma acc enter data \ for(int device = 0; device < ndevices; device++) { create(imgData[copyLower*step:(copyUpper-copyLower)*step], \ out[lower*step:(upper-lower)*step]) \ copyin(filter[:5][:5]) acc_set_device_num(device, acc_device_default); }
for(int device = 0; device < ndevices; device++) { long lower = device*rows_per_device; acc_set_device_num(device, acc_device_default); printf("Launching device %d\n", device); long lower = device*rows_per_device; long upper = MIN(lower + rows_per_device, h); long upper = MIN(lower + rows_per_device, h); long copyLower = MAX(lower-(filtersize/2), 0); long copyUpper = MIN(upper+(filtersize/2), h); long copyLower = MAX(lower-(filtersize/2), 0); #pragma acc update device(imgData[copyLower*step:(copyUpper-copyLower)*step]) async #pragma acc parallel loop present(filter, \ long copyUpper = MIN(upper+(filtersize/2), h); imgData[copyLower*step:(copyUpper-copyLower)*step], \ out[lower*step:(upper-lower)*step]) async for(y = lower; y < upper; y++) { #pragma acc update device(imgData[copyLower*step:(copyUpper- #pragma acc loop for(x = 0; x < w; x++) { double blue = 0.0, green = 0.0, red = 0.0; copyLower)*step]) async #pragma acc loop seq for(int fy = 0; fy < filtersize; fy++) { long iy = y - (filtersize/2) + fy; #pragma acc parallel loop present(filter, \ #pragma acc loop seq for (int fx = 0; fx < filtersize; fx++) { long ix = x - (filtersize/2) + fx; imgData[copyLower*step:(copyUpper-copyLower)*step], \ if( (iy<0) || (ix<0) || (iy>=h) || (ix>=w) ) continue; blue += filter[fy][fx] * (double)imgData[iy * step + ix * ch]; out[lower*step:(upper-lower)*step]) async green += filter[fy][fx] * (double)imgData[iy * step + ix * ch + 1]; red += filter[fy][fx] * (double)imgData[iy * step + ix * ch + 2]; for(y = lower; y < upper; y++) { } } out[y * step + x * ch] = 255 - (scale * blue); for(x = 0; x < w; x++) { out[y * step + x * ch + 1 ] = 255 - (scale * green); out[y * step + x * ch + 2 ] = 255 - (scale * red); } // Apply Filter } #pragma acc update self(out[lower*step:(upper-lower)*step]) async } } for(int device = 0; device < ndevices; device++) { } acc_set_device_num(device, acc_device_default); #pragma acc wait long lower = device*rows_per_device; #pragma acc update self(out[lower*step:(upper-lower)*step]) long upper = MIN(lower + rows_per_device, h); long copyLower = MAX(lower-(filtersize/2), 0); long copyUpper = MIN(upper+(filtersize/2), h); async #pragma acc exit data delete(out[lower*step:(upper-lower)*step], \ imgData[copyLower*step:(copyUpper-copyLower)*step], filter) } }
20 SAMPLE FILTER CODE (1:MANY)
for(int device = 0; device < ndevices; device++) { acc_set_device_num(device, acc_device_default); long lower = device*rows_per_device; long upper = MIN(lower + rows_per_device, h); long copyLower = MAX(lower-(filtersize/2), 0); long copyUpper = MIN(upper+(filtersize/2), h); #pragma acc enter data \ create(imgData[copyLower*step:(copyUpper-copyLower)*step], \ out[lower*step:(upper-lower)*step]) \ copyin(filter[:5][:5]) }
for(int device = 0; device < ndevices; device++) { acc_set_device_num(device, acc_device_default); printf("Launching device %d\n", device); long lower = device*rows_per_device; long upper = MIN(lower + rows_per_device, h); long copyLower = MAX(lower-(filtersize/2), 0); long copyUpper = MIN(upper+(filtersize/2), h); #pragma acc update device(imgData[copyLower*step:(copyUpper-copyLower)*step]) async #pragma acc parallel loop present(filter, \ imgData[copyLower*step:(copyUpper-copyLower)*step], \ out[lower*step:(upper-lower)*step]) async for(y = lower; y < upper; y++) { #pragma acc loop for(x = 0; x < w; x++) { double blue = 0.0, green = 0.0, red = 0.0; #pragma acc loop seq for(int fy = 0; fy < filtersize; fy++) { for(int device = 0; device < ndevices; device++) { long iy = y - (filtersize/2) + fy; #pragma acc loop seq for (int fx = 0; fx < filtersize; fx++) { acc_set_device_num(device, acc_device_default); long ix = x - (filtersize/2) + fx; if( (iy<0) || (ix<0) || (iy>=h) || (ix>=w) ) continue; #pragma acc wait blue += filter[fy][fx] * (double)imgData[iy * step + ix * ch]; green += filter[fy][fx] * (double)imgData[iy * step + ix * ch + 1]; long lower = device*rows_per_device; red += filter[fy][fx] * (double)imgData[iy * step + ix * ch + 2]; } } long upper = MIN(lower + rows_per_device, h); out[y * step + x * ch] = 255 - (scale * blue); out[y * step + x * ch + 1 ] = 255 - (scale * green); out[y * step + x * ch + 2 ] = 255 - (scale * red); long copyLower = MAX(lower-(filtersize/2), 0); } } #pragma acc update self(out[lower*step:(upper-lower)*step]) async long copyUpper = MIN(upper+(filtersize/2), h); }
for(int device = 0; device < ndevices; device++) { #pragma acc exit data delete(out[lower*step:(upper- acc_set_device_num(device, acc_device_default); #pragma acc wait long lower = device*rows_per_device; lower)*step], \ long upper = MIN(lower + rows_per_device, h); long copyLower = MAX(lower-(filtersize/2), 0); imgData[copyLower*step:(copyUpper-copyLower)*step], filter) long copyUpper = MIN(upper+(filtersize/2), h); #pragma acc exit data delete(out[lower*step:(upper-lower)*step], \ imgData[copyLower*step:(copyUpper-copyLower)*step], filter) } }
21 1:MANY – THE GOOD & BAD
Speed-up 1:Many Serial 0.06 1.80X
The Good 1.60X 0.05 1.40X Relatively simple to understand 0.04 1.20X The Bad
1.00X
Up - 0.03 Adds code that will be Time(s) 0.80X superfluous with single or no Speed device 0.02 0.60X
One thread will eventually 0.40X 0.01 become a bottleneck 0.20X
0 0.00X 1 2 3 4 6 8
22 MANY (THREADS) : 1
23 MANY (THREADS) : 1 – THE PLAN
1. Spawn 1 thread per GPU using OpenMP (any threading model will do)
2. Set the device within each thread.
3. Assign an equal portion of the work on each thread.
4. The threads will automatically synchronize when all threads are complete.
24 SAMPLE FILTER CODE (OPENMP) long rows_per_device = (h+(ndevices-1))/ndevices; #pragma omp parallel for num_threads(acc_get_num_devices(acc_device_type_default) for(int device = 0; device < ndevices; device++) { long lower = device*rows_per_device; long upper = MIN(lower + rows_per_device, h); long copyLower = MAX(lower-(filtersize/2), 0); long copyUpper = MIN(upper+(filtersize/2), h);
acc_set_device_num(device, acc_device_default);
#pragma acc declare copyin(filter) #pragma acc parallel loop \ copyin(imgData[copyLower*step:(copyUpper-copyLower)*step]) \ copyout(out[lower*step:(upper-lower)*step]) present(filter) for(y = lower; y < upper; y++) { for(x = 0; x < w; x++) { // Apply Filter } } } // end omp parallel for
25 1:MANY – THE GOOD & BAD
Speed-up Many:1 (affinity) Serial 0.06 4.00X
The Good 3.50X 0.05 Relatively simple to understand 3.00X 0.04
Fewer code changes 2.50X
Up -
The Bad 0.03 2.00X
Time(s) Speed 1.50X Introduces an additional 0.02 threading API 1.00X 0.01 Need to be conscious of affinity 0.50X (OMP_PLACES) 0 0.00X 1 2 3 4 6 8
26 ABOUT AFFINITY
$ nvidia-smi topo -m GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 mlx5_0 mlx5_2 mlx5_1 mlx5_3 CPU Affinity GPU0 X NV1 NV1 NV2 NV2 SYS SYS SYS PIX SYS PHB SYS 0-19,40-59 GPU1 NV1 X NV2 NV1 SYS NV2 SYS SYS PIX SYS PHB SYS 0-19,40-59 GPU2 NV1 NV2 X NV2 SYS SYS NV1 SYS PHB SYS PIX SYS 0-19,40-59 GPU3 NV2 NV1 NV2 X SYS SYS SYS NV1 PHB SYS PIX SYS 0-19,40-59 GPU4 NV2 SYS SYS SYS X NV1 NV1 NV2 SYS PIX SYS PHB 20-39,60-79 GPU5 SYS NV2 SYS SYS NV1 X NV2 NV1 SYS PIX SYS PHB 20-39,60-79 GPU6 SYS SYS NV1 SYS NV1 NV2 X NV2 SYS PHB SYS PIX 20-39,60-79 GPU7 SYS SYS SYS NV1 NV2 NV1 NV2 X SYS PHB SYS PIX 20-39,60-79 mlx5_0 PIX PIX PHB PHB SYS SYS SYS SYS X SYS PHB SYS mlx5_2 SYS SYS SYS SYS PIX PIX PHB PHB SYS X SYS PHB mlx5_1 PHB PHB PIX PIX SYS SYS SYS SYS PHB SYS X SYS mlx5_3 SYS SYS SYS SYS PHB PHB PIX PIX SYS PHB SYS X
• GPUs have affinity to certain CPUs, meaning they connect directly rather than through an inter-CPU bus
• OMP_PLACES, numctl, or your job launcher can help place your threads close to each GPU
27 MANY (THREADS) : 1 (IMPROVED)
28 MANY (THREADS) : 1 IMPROVED – THE PLAN
1. Spawn 1 thread per GPU using OpenMP (any threading model will do)
2. Set the device within each thread.
3. Assign an equal portion of the work on each thread.
4. Break the work further into blocks and use asynchronous queues to overlap copies and computation
5. Synchronize when all threads are complete.
29 SAMPLE FILTER CODE
int ndevices = acc_get_num_devices(acc_device_default); long rows_per_device = (h+(ndevices-1))/ndevices; long rows_per_block = (h+(ndevices*3-1))/(ndevices*3); #pragma omp parallel num_threads(ndevices) { int tid = omp_get_thread_num(); acc_set_device_num(tid, acc_device_default); #pragma acc declare copyin(filter) long lower = tid*rows_per_device; long upper = MIN(lower + rows_per_device, h); long copyLower = MAX(lower-(filtersize/2), 0); long copyUpper = MIN(upper+(filtersize/2), h); #pragma acc data create(imgData[copyLower*step:(copyUpper-copyLower)*step]) create(out[lower*step:(upper-lower)*step]) #pragma omp for for(int block = 0; block < ndevices*3; block++) { lower = block*rows_per_block; upper = MIN(lower + rows_per_block, h); copyLower = MAX(lower-(filtersize/2), 0); copyUpper = MIN(upper+(filtersize/2), h);
#pragma acc update device(imgData[copyLower*step:(copyUpper-copyLower)*step]) async(block) #pragma acc parallel loop default(present) async(block) for(y = lower; y < upper; y++) { #pragma acc loop for(x = 0; x < w; x++) { // Apply Filter } } #pragma acc update self(out[lower*step:(upper-lower)*step]) async(block) } // end omp for #pragma acc wait } // end omp parallel
30 1:MANY – THE GOOD & BAD
Speed-up Many:1 (async) Serial 0.06 4.50X
The Good 4.00X 0.05 3.50X Relatively simple to understand 0.04 3.00X Fewer code changes
2.50X
Up - Overlaps PCIe Copies 0.03
Time(s) 2.00X Speed
The Bad 0.02 1.50X
Introduces an additional 1.00X 0.01 threading API 0.50X
Need to be conscious of affinity 0 0.00X (OMP_PLACES) 1 2 3 4 6 8
31 1+ (RANK): 1 (GPU)
32 MANY (THREADS) : 1 – THE PLAN
1. Use MPI to decompose the work among ranks
2. Assign each rank a GPU
3. Reassemble the result
33 SAMPLE FILTER CODE (MPI) int ndevices = acc_get_num_devices(acc_device_default); acc_set_device_num(rank%ndevices, acc_device_default);
MPI_Scatterv((void*) imgData, sendcounts, senddispls, MPI_UNSIGNED_CHAR, (void*) (imgData + copyLower*step), (int) copySize, MPI_UNSIGNED_CHAR, 0, MPI_COMM_WORLD);
#pragma acc parallel loop copyin(imgData[copyLower*step:copySize]) \ copy(out[lower*step:size]) \ copyin(filter[:5][:5])
for(y = lower; y < upper; y++) { for(x = 0; x < w; x++) { // Apply Filter } }
MPI_Gatherv((void*) (out+lower*step), (int) size, MPI_UNSIGNED_CHAR, (void*) out, recvcounts, recvdispls, MPI_UNSIGNED_CHAR, 0, MPI_COMM_WORLD);
34 1:1 – THE GOOD & BAD 0.14 7.00X Speed-up MPI (No Comm) Serial MPI (w/ Comm)
0.12 6.00X The Good 0.1 5.00X Little/No Code change if you
0.08 4.00X Up
already use MPI - Time(s) Affinity handled by MPI launcher 0.06 3.00XSpeed
The Bad 0.04 2.00X
Dependence on MPI (if not 0.02 1.00X already using) 0 0.00X Need “enough” work to 1 2 3 4 6 8 overcome communication cost
35 1+ (RANK): 1 (GPU) (IMPROVED)
36 MANY (THREADS) : 1 IMPROVED – THE PLAN
1. Use MPI to decompose the work among ranks
2. Start CUDA MPS & launch multiple ranks / GPU
3. Assign each rank a GPU
4. Reassemble the result
37 SAMPLE FILTER CODE (MPI) int ndevices = acc_get_num_devices(acc_device_default); int sharing = nranks / ndevices; acc_set_device_num(rank/sharing, acc_device_default);
MPI_Scatterv((void*) imgData, sendcounts, senddispls, MPI_UNSIGNED_CHAR, (void*) (imgData + copyLower*step), (int) copySize, MPI_UNSIGNED_CHAR, 0, MPI_COMM_WORLD);
#pragma acc parallel loop copyin(imgData[copyLower*step:copySize]) \ copy(out[lower*step:size]) \ copyin(filter[:5][:5])
for(y = lower; y < upper; y++) { for(x = 0; x < w; x++) { // Apply Filter } }
MPI_Gatherv((void*) (out+lower*step), (int) size, MPI_UNSIGNED_CHAR, (void*) out, recvcounts, recvdispls, MPI_UNSIGNED_CHAR, 0, MPI_COMM_WORLD);
38 1:1 – THE GOOD & BAD 0.14 16.00X The Good Speed-up MPI (No Comm) Serial MPI (w/ Comm) 0.12 14.00X Little/No Code change if you 12.00X already use MPI 0.1
10.00X
Affinity handled by MPI launcher 0.08
Up - 8.00X
Memcopies overlapped Time(s)
0.06 Speed automatically 6.00X
0.04 The Bad 4.00X
Dependence on MPI (if not 0.02 2.00X already using) 0 0.00X Need “enough” work to 1 2 3 4 6 8 overcome communication cost
39 PERFORMANCE COMPARISON
Serial 1:Many OpenMP OpenMP (Async) MPI MPI+MPS 0.06
0.05
0.04
0.03
Avg. Time (S) Time Avg. 0.02
0.01
0 0 1 2 3 4 5 6 7 8 9
Number of GPUs
PGI 19.1, NVIDIA DGX-1V 40 WHAT ABOUT COMMUNICATION?
41 MULTI GPU PROGRAMMING With MPI and OpenACC
Node 0 Node 1 Node N-1
MEMMEM MEMMEM MEMMEM MEMMEM MEMMEM MEMMEM
GPU CPU GPU CPU GPU CPU …
PCIe PCIe PCIe Switch Switch Switch IB IB IB
42 MPI+OPENACC
//MPI rank 0 #pragma acc host_data use_device( sbuf ) MPI_Send(sbuf, size, MPI_DOUBLE, n-1, tag, MPI_COMM_WORLD);
//MPI rank n-1 #pragma acc host_data use_device( rbuf ) MPI_Recv(rbuf, size, MPI_DOUBLE, 0, tag, MPI_COMM_WORLD, MPI_STATUS_IGNORE);
43 MPI+OPENACC (IMPROVED)
//MPI rank 0 #pragma acc update self( sbuf[0:size] ) MPI_Send(sbuf, size, MPI_DOUBLE, n-1, tag, MPI_COMM_WORLD);
//MPI rank n-1 MPI_Recv(rbuf, size, MPI_DOUBLE, 0, tag, MPI_COMM_WORLD, MPI_STATUS_IGNORE); #pragma acc update device( rbuf[0:size] )
44 CONCLUSIONS
45 CONCLUSIONS The Hydra has been Tamed.
OpenACC provides a simple API for using multiple devices.
A single CPU thread is not sufficient to feed multiple GPUs
Multiple OpenMP threads is a simple way to control multiple devices.
MPI is an even simpler way to control multiple threads, if you can overcome communication cost.
Overlapping data copies is still a benefit in multi-device codes
46