Hybrid Programming in CUDA, OpenMP and MPI

James E. McClure Hybrid Programming in CUDA, OpenMP and Introduction MPI Heterogeneous Computing

CUDA Overview James E. McClure

CPU + GPU Advanced Research Computing CUDA and OpenMP

CUDA and MPI 22 October 2012

1 / 42 Course Contents

Hybrid Programming in CUDA, OpenMP and MPI J.E. McClure This is a talk about concurrency: Introduction • Concurrency within individual GPU Heterogeneous Computing • Concurrency within multiple GPU CUDA Overview • Concurrency between GPU and CPU CPU + GPU • Concurrency using CPU CUDA and OpenMP • Concurrency across many nodes in CUDA and MPI

2 / 42 Course Contents

Hybrid Programming in CUDA, OpenMP and MPI J.E. McClure Three programming models for achieving concurrency: Introduction • CUDA: Single Instruction Multiple Data (SIMD) Heterogeneous Computing programming for GPU CUDA • OpenMP: Fork-and-join parallelism for shared memory Overview programming CPU + GPU

CUDA and • MPI: message passing interface for distributed memory OpenMP programs CUDA and MPI

3 / 42 Useful links

Hybrid Programming in CUDA, OpenMP and ARC website: MPI

J.E. McClure http://www.arc.vt.edu/

Introduction CUDA Programming Guide:

Heterogeneous http://docs.nvidia.com/cuda/index.html Computing

CUDA CUDA SDK code examples: Overview http://docs.nvidia.com/cuda/cuda-samples/index.html CPU + GPU

CUDA and OpenMP website: OpenMP http://www.openmp.org CUDA and MPI LLNL MPI tutorial: https://computing.llnl.gov/tutorials/mpi/

4 / 42 Hardware Overview

Hybrid Programming Modern supercomputing nodes are heterogeneous: in CUDA, OpenMP and • Multiple CPU cores that share memory MPI

J.E. McClure • Multiple GPU or other accelerators

Introduction

Heterogeneous Computing CUDA . . . Overview CPU + GPU L3 CUDA and OpenMP CUDA and L2 MPI

Device Memory PCIe Host Memory

5 / 42 Hardware Overview

Hybrid Programming • GPU and shared memory CPU cores will be programmed in CUDA, OpenMP and with CUDA and OpenMP MPI • MPI will be used to pass messages between nodes J.E. McClure

Introduction

Heterogeneous Computing

CUDA Overview

CPU + GPU

CUDA and OpenMP

CUDA and MPI

6 / 42 Examples to Download

Hybrid Programming in CUDA, OpenMP and MPI ARC HokieSpeed Examples: J.E. McClure www.arc.vt.edu/resources/hpc/hokiespeed.php Introduction • wget Heterogeneous Computing • Simple matrix-matrix multiplication example code CUDA Overview • Example code for OpenMP and MPI CPU + GPU • Example code for CUDA and MPI CUDA and OpenMP • Makefiles for example cases CUDA and MPI • Example submission script for HokieSpeed

7 / 42 Compiling with CUDA

Hybrid Programming in CUDA, OpenMP and MPI

J.E. McClure To view with the modules you have loaded: • module list Introduction

Heterogeneous Computing To see a list of available modules: CUDA • module avail Overview CPU + GPU Load the by typing: CUDA and OpenMP • module swap gcc CUDA and MPI • module load cuda

8 / 42 Compiling with CUDA

Hybrid Programming in CUDA, OpenMP and MPI

J.E. McClure The CUDA compiler nvcc compiles by: • Introduction identifying device (ie. gpu) functions and compiling them Heterogeneous • passing host (ie. cpu) functions to gcc/g++ Computing CUDA To compile, type: Overview CPU + GPU • nvcc -o runme program.cu CUDA and OpenMP To compile with double precision support, type: CUDA and • nvcc -o runme -arch sm 13 program.cu MPI

9 / 42 Compiling CUDA with OpenMP and/or MPI

Hybrid Programming in CUDA, OpenMP and MPI Compiling with OpenMP: J.E. McClure • nvcc -Xcompiler -fopenmp -lcuda -lcudart Introduction -lgomp -o runme program.cu Heterogeneous Computing Compiling with MPI: CUDA • Identify the path of the MPI and include directories Overview CPU + GPU • module show openmpi CUDA and OpenMP • -I$(CUDA INC) -I$(OMPI INC) CUDA and • -L$(OMP LIB) -lmpi -L$(CUDA LIB64) -lcuda MPI -lcudart

10 / 42 Running on HokieSpeed

Hybrid Programming Use the example runscript to submit a job to the queue. in CUDA, OpenMP and Request 4 nodes and 1 mpi for each node: MPI • #PBS -l nodes=4:ppn=12 J.E. McClure • mpiexec -npernode 1 ./run-cuda-mpi Introduction

Heterogeneous Computing Submit the job to the queue: CUDA • qsub hokiespeed qsub example.sh Overview CPU + GPU Monitor the job: CUDA and OpenMP • qstat 1234 (monitor job with id 1234) CUDA and MPI • qdel 1234 (kill job with id 1234) View the output: • more hokiespeed qsub example.sh.o1234

11 / 42 The CUDA Programming Model

Hybrid Programming in CUDA, OpenMP and MPI

J.E. McClure

Introduction

Heterogeneous Computing

CUDA Overview

CPU + GPU

CUDA and OpenMP

CUDA and MPI

12 / 42 Memory Management in CUDA

Hybrid Programming in CUDA, Example OpenMP and MPI

J.E. McClure // Multiplication for an NxN matrix int N = K*BLOCK_SIZE; Introduction // amount of memory in bytes Heterogeneous Computing int size = N*N*sizeof(float); CUDA float *hA,*hB,*hC;//host(cpu) Overview

CPU + GPU float *dA,*dB,*dC;//device(gpu)

CUDA and hA = new float[N*N]; OpenMP cudaMalloc(&dA,size); CUDA and MPI //.... Initialize arrays on the host cudaMemcpy(dA,hA,size, cudaMemcpyHostToDevice);

13 / 42 Hierarchy

Hybrid Programming in CUDA, OpenMP and MPI Example

J.E. McClure // Set up threadblocks Introduction dim3 threadBlock(BLOCK_SIZE,BLOCK_SIZE); Heterogeneous Computing dim3 numBlocks(K,K);

CUDA //Launch the matrix multiplication kernel Overview gpuMM<<>>(dA,dB,dC,N); CPU + GPU //... CUDA and OpenMP // Copy the result back to theCPU CUDA and cudaMemcpy(,dC,size, MPI cudaMemcpyDeviceToHost);

14 / 42 CUDA kernels

Hybrid Programming Parts of your program that run on GPU must be provided as in CUDA, OpenMP and CUDA kernels: MPI Example J.E. McClure

Introduction __global__ void gpuMM(float*A,float*B,...) Heterogeneous { Computing

CUDA int row,col; Overview row = blockIdx.x*blockDim.x+threadIdx.x; CPU + GPU col = blockIdx.y*blockDim.y+threadIdx.y; CUDA and OpenMP float sum = 0.f;

CUDA and for(int n=0; n

15 / 42 Measuring GPU performance

Hybrid Programming in CUDA, OpenMP and MPI Example

J.E. McClure float time; Introduction cudaEvent_t start, stop; Heterogeneous Computing cudaEventCreate(&start);

CUDA cudaEventCreate(&stop); Overview cudaEventRecord(start,stream); CPU + GPU /* Do someGPU work and time it*/ CUDA and OpenMP cudaEventRecord(stop,stream); CUDA and cudaEventSynchronize(stop); MPI cudaEventElapsedTime(&time,start,stop);

16 / 42 Occupancy Considerations for GPU

Hybrid Programming in CUDA, OpenMP and MPI • Fermi GPU can have up to 48

J.E. McClure active warps per SM

Introduction • Instructions are issued per warp Heterogeneous • If a warp is not ready, the . . . Computing hardware switches warps CUDA Overview (context switching) CPU + GPU • Shared memory can limit CUDA and L2 OpenMP occupancy! CUDA and • Device Memory MPI Goal: always have enough active warps to saturate the memory bandwidth of the device

17 / 42 Increasing Occupancy with Multiple Kernels

Hybrid Programming in CUDA, OpenMP and Suppose you are going to perform multiple matrix-matrix MPI multiplications J.E. McClure

Introduction Example

Heterogeneous Computing gpuMM<<>>(dA1,dB1,dC1,N); CUDA Overview gpuMM<<>>(dA2,dB2,dC2,N);

CPU + GPU CUDA and However, kernels launched from the same stream (in this case OpenMP the default stream) will execute serially. CUDA and MPI Kernels launched from different streams can execute concurrently on a single device, if there is room.

18 / 42 CUDA Streams and Events

Hybrid Programming in CUDA, OpenMP and The CUDA driver API provides streams and events as a way to MPI manage GPU synchronization: J.E. McClure

Introduction • Synchronization is implied for events within a stream

Heterogeneous (including default stream) Computing • Streams belong to a particular GPU CUDA Overview • More than one stream can be associated with a GPU CPU + GPU • CUDA and Streams are required if you want to perform asynchronous OpenMP communication CUDA and MPI • Streams are critical if you want concurrency with multiple GPU or multiple kernels on any single GPU.

19 / 42 CUDA Streams

Hybrid Programming in CUDA, OpenMP and Example MPI J.E. McClure // Createa pair of streams

Introduction cudaStream_t stream[2]; Heterogeneous for(int i=0; i<2; ++i) Computing cudaStreamCreate(&stream[i]); CUDA Overview // Launcha Kernel from each stream CPU + GPU KernelOne<<<100,512,0,stream[0]>>>(..) CUDA and KernelTwo<<<100,512,0,stream[1]>>>(..) OpenMP

CUDA and // Destroy the streams MPI for(int i=0; i<2; ++i) cudaStreamDestroy(stream[i]);

20 / 42 Synchronization of Streams and Events

Hybrid Programming in CUDA, OpenMP and MPI Streams can be synchronized explicitly: J.E. McClure • cudaDeviceSynchronize(): wait for all preceding Introduction commands in all streams for a device to complete. Heterogeneous Computing • cudaStreamSynchronize(): wait for all preceding events CUDA in a specified stream to complete Overview

CPU + GPU • cudaStreamWaitEvent(): synchronize a stream with an

CUDA and event (both must be specified) OpenMP • CUDA and cudaStreamQuery(): Ask the system if preceding MPI commands in a stream have completed

21 / 42 CUDA Streams

Hybrid Programming in CUDA, OpenMP and MPI

J.E. McClure Two streams will be synchronized implicitly if any of the Introduction following operations are issued in between them Heterogeneous Computing • a page-locked memory allocation (using cudaMallocHost) CUDA Overview • a device memory allocation (using cudaMalloc) CPU + GPU • a memory copy between two devices CUDA and OpenMP • any CUDA command to the default stream

CUDA and MPI

22 / 42 Using Multiple GPU

Hybrid Programming in CUDA, OpenMP and MPI Example J.E. McClure

Introduction intdeviceCount;

Heterogeneous // How many devices? Computing cudaGetDeviceCount(&deviceCount); CUDA Overview //Get the properties of all devices

CPU + GPU f o r (int dvc= 0; dvc

23 / 42 Using Multiple GPU

Hybrid Programming in CUDA, GPU can be controlled by: OpenMP and MPI • a single CPU thread J.E. McClure • multiple CPU threads belonging to the same process Introduction • multiple CPU threads belonging to different processes Heterogeneous Computing All CUDA calls are issued to the current GPU: CUDA Overview Example CPU + GPU CUDA and cudaSetDevice(0); OpenMP gpuMM<<>>(dA1,dB1,dC1,N); CUDA and MPI cudaSetDevice(1); gpuMM<<>>(dA2,dB2,dC2,N);

24 / 42 Streams and Multiple GPU

Hybrid CUDA streams belong to a device: Programming in CUDA, • Each device has its own default stream OpenMP and MPI • Streams belong to the GPU that was current when it was J.E. McClure created

Introduction • You cannot issue calls to a stream if the associated GPU is Heterogeneous not active Computing

CUDA Example Overview CPU + GPU cudaSetDevice(0); CUDA and cudaStreamCreate(&streamA); OpenMP

CUDA and cudaSetDevice(1); MPI cudaStreamCreate(&streamB); // Launch kernels gpuMM<<...,streamA>>(dA1,dB1,dC1,N); gpuMM<<...,streamB>>(dA2,dB2,dC2,N);

25 / 42 Peer to Peer Memory Copies

Hybrid Programming Suppose you want to use multiple GPU to work together and in CUDA, OpenMP and solve the same problem MPI

J.E. McClure You’ll probably need to transfer some data between GPU to do this: Introduction

Heterogeneous Example Computing

CUDA Overview cudaMemcpyPeerAsync(void *dst, int dst_dev,

CPU + GPU void *src,int src_dev,size_t size,

CUDA and cudaStream_t stream) OpenMP

CUDA and MPI • Copies data between two devices • stream must belong to source gpu • blocking version exists too!

26 / 42 Exercises: GPU Concurrency

Hybrid Programming in CUDA, OpenMP and MPI • Implement a GPU timer using streams J.E. McClure • Study the performance of gpuMM as a function of the Introduction matrix size N Heterogeneous Computing • Increase the occupancy for small matrices by launching CUDA Overview multiple kernels simultaneously CPU + GPU • Perform simultaneous matrix-matrix multiplication using CUDA and two GPU OpenMP

CUDA and • How big do the matrices have to be for using multiple MPI GPU to provide a significant advantage?

27 / 42 Concurrency using CPU and GPU

Hybrid Kernel launches are asynchronous with respect to the CPU, Programming in CUDA, even within the default stream OpenMP and MPI Example J.E. McClure

Introduction gpuMM<<>>(dA,dB,dC,N);

Heterogeneous for (row=0; row

28 / 42 Concurrency using CPU and GPU

Hybrid Programming in CUDA, OpenMP and We don’t just pay for the computation, we also pay for data MPI CPU ↔ GPU transfers J.E. McClure Example Introduction

Heterogeneous Computing cudaMemcpy(dA,hA,size,

CUDA cudaMemcpyHostToDevice); Overview cudaMemcpy(dB,hB,size, CPU + GPU cudaMemcpyHostToDevice); CUDA and OpenMP gpuMM<<>>(dA,dB,dC,N); CUDA and cudaMemcpy(C,dC,size, MPI cudaMemcpyDeviceToHost);

29 / 42 Asynchronous Memory Transfers in CUDA

Hybrid Programming in CUDA, OpenMP and MPI • We know that CPU ↔GPU memory transfers are J.E. McClure expensive Introduction • We also know that PCIe can perform simultaneous, Heterogeneous bi-directional transfers: Computing • One cudaMemcpy for Host → Device CUDA Overview • One cudaMemcpy for Device → Host CPU + GPU • If the memory transfers belong to the same stream they CUDA and OpenMP will be synchronized CUDA and • We need a way to asynchronously perform transfers to get MPI the full advantage of PCIe

30 / 42 Asynchronous Memory Transfers in CUDA

Hybrid Programming in CUDA, OpenMP and MPI Example

J.E. McClure // Host dataMUST be pinned!!! Introduction cudaMallocHost(&cpuData,size); Heterogeneous Computing cudaMalloc(&gpuData,size);

CUDA // Bi-directional memory transfer Overview cudaMemcpyAsync(gpuData,cpuData,size, CPU + GPU cudaMemcpyHostToDevice,stream[0]); CUDA and OpenMP cudaMemcpyAsync(cpuData,gpuData,size, CUDA and cudaMemcpyDeviceToHost,stream[1]); MPI // Clean up

31 / 42 Exercises: CPU+GPU Concurrency

Hybrid Programming in CUDA, OpenMP and MPI J.E. McClure • How does simultaneously executing CPU and GPU

Introduction matrix-multiplication effect performance? Heterogeneous • How does the performance of each depend on the problem Computing size? CUDA Overview • How does the performance change if you include memory CPU + GPU transfers in the timings? CUDA and OpenMP • Is it possible to overlap memory copies for multiple GPU CUDA and MPI kernels?

32 / 42 Adding Multiple CPU Cores Using OpenMP

Hybrid Programming OpenMP uses a fork-and-join model of parallelism to target in CUDA, OpenMP and multi-core CPU MPI

J.E. McClure • Master thread initiated at run-time and persists throughout Introduction

Heterogeneous • Worker threads are created within parallel regions Computing

CUDA Overview

CPU + GPU

CUDA and OpenMP

CUDA and MPI

33 / 42 Adding Multiple CPU Cores Using OpenMP

Hybrid Programming Example in CUDA, OpenMP and MPI #include J.E. McClure int main (void){

Introduction int omp_threads = 12;

Heterogeneous omp_set_num_threads(omp_threads); Computing double starttime = omp_get_wtime(); CUDA Overview //... CPU + GPU #pragma omp parallel CUDA and { OpenMP //CPU&GPU work withina parallel region CUDA and MPI } cudaDeviceSynchronize(); double stoptime = omp_get_wtime(); double CPU_time = stoptime-starttime; } 34 / 42 Adding Multiple CPU Cores Using OpenMP

Hybrid Programming in CUDA, OpenMP and MPI Example

J.E. McClure #pragma omp parallel Introduction {// Work insidea parallel region Heterogeneous Computing #pragma omp master

CUDA {// ManageGPU from master thread Overview cudaMemcpy(...); CPU + GPU gpuMM<<>>(dA,dB,dC,N); CUDA and OpenMP } CUDA and // Use all threads forCPU work MPI }

35 / 42 Adding Multiple CPU Cores Using OpenMP

Hybrid Programming in CUDA, OpenMP and MPI • Need to break up the work between threads J.E. McClure • Partition by rows: Introduction

Heterogeneous Computing

CUDA Overview

CPU + GPU

CUDA and OpenMP

CUDA and MPI

36 / 42 Adding Multiple CPU Cores Using OpenMP

Hybrid Programming in CUDA, OpenMP and Example MPI J.E. McClure int partRow = N/omp_threads;

Introduction #pragma omp parallel Heterogeneous {// work withina parallel region Computing //...GPU calls from master thread CUDA Overview int row,col,n;// iterators for each thread CPU + GPU #pragma omp for schedule(guided,partRow) CUDA and for (row=0; row

CUDA and //... matrix multiply onCPU MPI } }

37 / 42 Exercises: OpenMP and CUDA

Hybrid Programming in CUDA, OpenMP and MPI

J.E. McClure • Implement the matrix-matrix multiplication using OpenMP Introduction

Heterogeneous • Show that the OpenMP implementation scales (ie. set Computing omp threads=1,2,...12 and measure wall time) CUDA Overview • Does parallel initialization of hA effect the performance? CPU + GPU • What happens if you make CUDA calls from worker CUDA and OpenMP threads? CUDA and MPI

38 / 42 Using CUDA with MPI

Hybrid Programming Example in CUDA, OpenMP and MPI #include J.E. McClure #include"cuda.h"

Introduction int main(int argc, char**argv){

Heterogeneous int np,rank; Computing MPI_Init(&argc,&argv); CUDA Overview MPI_Comm_size(MPI_COMM_WORLD ,&np); CPU + GPU MPI_Comm_rank(MPI_COMM_WORLD ,&rank); CUDA and //... OpenMP // EachMPI process assings work toCPU CUDA and MPI // andGPU usingCUDA and/or OpenMP //... MPI_Finalize(); }

39 / 42 Using CUDA with MPI

Hybrid Whether or not MPI calls can reference GPU memory depends Programming in CUDA, on CUDA version and hardware compute capability: OpenMP and MPI Example J.E. McClure

Introduction //pointers to host memory work always Heterogeneous float *buf; Computing buf = new float[N]; CUDA Overview MPI_Send(&A,N,MPI_FLOAT,recvID,tag, CPU + GPU MPI_COMM_WORLD); CUDA and OpenMP //pointers to device memory only work

CUDA and //with newest hardware/CUDA MPI size = N*sizeof(float); cudaMalloc(buf,size); MPI_Send(&buf,N,MPI_FLOAT,recvID,tag, MPI_COMM_WORLD);

40 / 42 Exercises: MPI and CUDA

Hybrid Programming in CUDA, Compare the performance of the following for bi-directional OpenMP and MPI communication between two nodes: J.E. McClure 1 Use MPI Sendrecv to send data from a source process to Introduction a destination process Heterogeneous Computing 2 The following sequence:

CUDA • Copy data to be sent from the device to the host at the Overview source process CPU + GPU • Use MPI Sendrecv to send data from the source process CUDA and to a destination process OpenMP • Copy received data from host to device at destination CUDA and MPI process Modify the example code available from: http://mpi.deino.net/mpi functions/MPI Sendrecv.html

41 / 42 Questions?

Hybrid Programming in CUDA, OpenMP and MPI

J.E. McClure

Introduction

Heterogeneous Computing Be sure to fill out the FDI evaluation forms:

CUDA http://www.fdi.vt.edu/training/evals/index.html Overview

CPU + GPU

CUDA and OpenMP

CUDA and MPI

42 / 42