Hybrid Programming in CUDA, OpenMP and MPI
James E. McClure Hybrid Programming in CUDA, OpenMP and Introduction MPI Heterogeneous Computing
CUDA Overview James E. McClure
CPU + GPU Advanced Research Computing CUDA and OpenMP
CUDA and MPI 22 October 2012
1 / 42 Course Contents
Hybrid Programming in CUDA, OpenMP and MPI J.E. McClure This is a talk about concurrency: Introduction • Concurrency within individual GPU Heterogeneous Computing • Concurrency within multiple GPU CUDA Overview • Concurrency between GPU and CPU CPU + GPU • Concurrency using shared memory CPU CUDA and OpenMP • Concurrency across many nodes in distributed memory CUDA and MPI
2 / 42 Course Contents
Hybrid Programming in CUDA, OpenMP and MPI J.E. McClure Three programming models for achieving concurrency: Introduction • CUDA: Single Instruction Multiple Data (SIMD) Heterogeneous Computing programming for GPU CUDA • OpenMP: Fork-and-join parallelism for shared memory Overview programming CPU + GPU
CUDA and • MPI: message passing interface for distributed memory OpenMP programs CUDA and MPI
3 / 42 Useful links
Hybrid Programming in CUDA, OpenMP and ARC website: MPI
J.E. McClure http://www.arc.vt.edu/
Introduction CUDA Programming Guide:
Heterogeneous http://docs.nvidia.com/cuda/index.html Computing
CUDA CUDA SDK code examples: Overview http://docs.nvidia.com/cuda/cuda-samples/index.html CPU + GPU
CUDA and OpenMP website: OpenMP http://www.openmp.org CUDA and MPI LLNL MPI tutorial: https://computing.llnl.gov/tutorials/mpi/
4 / 42 Hardware Overview
Hybrid Programming Modern supercomputing nodes are heterogeneous: in CUDA, OpenMP and • Multiple CPU cores that share memory MPI
J.E. McClure • Multiple GPU or other accelerators
Introduction
Heterogeneous Computing CUDA . . . Overview CPU + GPU L3 CUDA and OpenMP CUDA and L2 MPI
Device Memory PCIe Host Memory
5 / 42 Hardware Overview
Hybrid Programming • GPU and shared memory CPU cores will be programmed in CUDA, OpenMP and with CUDA and OpenMP MPI • MPI will be used to pass messages between nodes J.E. McClure
Introduction
Heterogeneous Computing
CUDA Overview
CPU + GPU
CUDA and OpenMP
CUDA and MPI
6 / 42 Examples to Download
Hybrid Programming in CUDA, OpenMP and MPI ARC HokieSpeed Examples: J.E. McClure www.arc.vt.edu/resources/hpc/hokiespeed.php Introduction • wget
7 / 42 Compiling with CUDA
Hybrid Programming in CUDA, OpenMP and MPI
J.E. McClure To view with the modules you have loaded: • module list Introduction
Heterogeneous Computing To see a list of available modules: CUDA • module avail Overview CPU + GPU Load the compiler by typing: CUDA and OpenMP • module swap intel gcc CUDA and MPI • module load cuda
8 / 42 Compiling with CUDA
Hybrid Programming in CUDA, OpenMP and MPI
J.E. McClure The CUDA compiler nvcc compiles by: • Introduction identifying device (ie. gpu) functions and compiling them Heterogeneous • passing host (ie. cpu) functions to gcc/g++ Computing CUDA To compile, type: Overview CPU + GPU • nvcc -o runme program.cu CUDA and OpenMP To compile with double precision support, type: CUDA and • nvcc -o runme -arch sm 13 program.cu MPI
9 / 42 Compiling CUDA with OpenMP and/or MPI
Hybrid Programming in CUDA, OpenMP and MPI Compiling with OpenMP: J.E. McClure • nvcc -Xcompiler -fopenmp -lcuda -lcudart Introduction -lgomp -o runme program.cu Heterogeneous Computing Compiling with MPI: CUDA • Identify the path of the MPI library and include directories Overview CPU + GPU • module show openmpi CUDA and OpenMP • -I$(CUDA INC) -I$(OMPI INC) CUDA and • -L$(OMP LIB) -lmpi -L$(CUDA LIB64) -lcuda MPI -lcudart
10 / 42 Running on HokieSpeed
Hybrid Programming Use the example runscript to submit a job to the queue. in CUDA, OpenMP and Request 4 nodes and 1 mpi process for each node: MPI • #PBS -l nodes=4:ppn=12 J.E. McClure • mpiexec -npernode 1 ./run-cuda-mpi Introduction
Heterogeneous Computing Submit the job to the queue: CUDA • qsub hokiespeed qsub example.sh Overview CPU + GPU Monitor the job: CUDA and OpenMP • qstat 1234 (monitor job with id 1234) CUDA and MPI • qdel 1234 (kill job with id 1234) View the output: • more hokiespeed qsub example.sh.o1234
11 / 42 The CUDA Programming Model
Hybrid Programming in CUDA, OpenMP and MPI
J.E. McClure
Introduction
Heterogeneous Computing
CUDA Overview
CPU + GPU
CUDA and OpenMP
CUDA and MPI
12 / 42 Memory Management in CUDA
Hybrid Programming in CUDA, Example OpenMP and MPI
J.E. McClure // Multiplication for an NxN matrix int N = K*BLOCK_SIZE; Introduction // amount of memory in bytes Heterogeneous Computing int size = N*N*sizeof(float); CUDA float *hA,*hB,*hC;//host(cpu) Overview
CPU + GPU float *dA,*dB,*dC;//device(gpu)
CUDA and hA = new float[N*N]; OpenMP cudaMalloc(&dA,size); CUDA and MPI //.... Initialize arrays on the host cudaMemcpy(dA,hA,size, cudaMemcpyHostToDevice);
13 / 42 Thread Hierarchy
Hybrid Programming in CUDA, OpenMP and MPI Example
J.E. McClure // Set up threadblocks Introduction dim3 threadBlock(BLOCK_SIZE,BLOCK_SIZE); Heterogeneous Computing dim3 numBlocks(K,K);
CUDA //Launch the matrix multiplication kernel Overview gpuMM<<
14 / 42 CUDA kernels
Hybrid Programming Parts of your program that run on GPU must be provided as in CUDA, OpenMP and CUDA kernels: MPI Example J.E. McClure
Introduction __global__ void gpuMM(float*A,float*B,...) Heterogeneous { Computing
CUDA int row,col; Overview row = blockIdx.x*blockDim.x+threadIdx.x; CPU + GPU col = blockIdx.y*blockDim.y+threadIdx.y; CUDA and OpenMP float sum = 0.f;
CUDA and for(int n=0; n 15 / 42 Measuring GPU performance Hybrid Programming in CUDA, OpenMP and MPI Example J.E. McClure float time; Introduction cudaEvent_t start, stop; Heterogeneous Computing cudaEventCreate(&start); CUDA cudaEventCreate(&stop); Overview cudaEventRecord(start,stream); CPU + GPU /* Do someGPU work and time it*/ CUDA and OpenMP cudaEventRecord(stop,stream); CUDA and cudaEventSynchronize(stop); MPI cudaEventElapsedTime(&time,start,stop); 16 / 42 Occupancy Considerations for GPU Hybrid Programming in CUDA, OpenMP and MPI • Fermi GPU can have up to 48 J.E. McClure active warps per SM Introduction • Instructions are issued per warp Heterogeneous • If a warp is not ready, the . . . Computing hardware switches warps CUDA Overview (context switching) CPU + GPU • Shared memory can limit CUDA and L2 OpenMP occupancy! CUDA and • Device Memory MPI Goal: always have enough active warps to saturate the memory bandwidth of the device 17 / 42 Increasing Occupancy with Multiple Kernels Hybrid Programming in CUDA, OpenMP and Suppose you are going to perform multiple matrix-matrix MPI multiplications J.E. McClure Introduction Example Heterogeneous Computing gpuMM<< CPU + GPU CUDA and However, kernels launched from the same stream (in this case OpenMP the default stream) will execute serially. CUDA and MPI Kernels launched from different streams can execute concurrently on a single device, if there is room. 18 / 42 CUDA Streams and Events Hybrid Programming in CUDA, OpenMP and The CUDA driver API provides streams and events as a way to MPI manage GPU synchronization: J.E. McClure Introduction • Synchronization is implied for events within a stream Heterogeneous (including default stream) Computing • Streams belong to a particular GPU CUDA Overview • More than one stream can be associated with a GPU CPU + GPU • CUDA and Streams are required if you want to perform asynchronous OpenMP communication CUDA and MPI • Streams are critical if you want concurrency with multiple GPU or multiple kernels on any single GPU. 19 / 42 CUDA Streams Hybrid Programming in CUDA, OpenMP and Example MPI J.E. McClure // Createa pair of streams Introduction cudaStream_t stream[2]; Heterogeneous for(int i=0; i<2; ++i) Computing cudaStreamCreate(&stream[i]); CUDA Overview // Launcha Kernel from each stream CPU + GPU KernelOne<<<100,512,0,stream[0]>>>(..) CUDA and KernelTwo<<<100,512,0,stream[1]>>>(..) OpenMP CUDA and // Destroy the streams MPI for(int i=0; i<2; ++i) cudaStreamDestroy(stream[i]); 20 / 42 Synchronization of Streams and Events Hybrid Programming in CUDA, OpenMP and MPI Streams can be synchronized explicitly: J.E. McClure • cudaDeviceSynchronize(): wait for all preceding Introduction commands in all streams for a device to complete. Heterogeneous Computing • cudaStreamSynchronize(): wait for all preceding events CUDA in a specified stream to complete Overview CPU + GPU • cudaStreamWaitEvent(): synchronize a stream with an CUDA and event (both must be specified) OpenMP • CUDA and cudaStreamQuery(): Ask the system if preceding MPI commands in a stream have completed 21 / 42 CUDA Streams Hybrid Programming in CUDA, OpenMP and MPI J.E. McClure Two streams will be synchronized implicitly if any of the Introduction following operations are issued in between them Heterogeneous Computing • a page-locked memory allocation (using cudaMallocHost) CUDA Overview • a device memory allocation (using cudaMalloc) CPU + GPU • a memory copy between two devices CUDA and OpenMP • any CUDA command to the default stream CUDA and MPI 22 / 42 Using Multiple GPU Hybrid Programming in CUDA, OpenMP and MPI Example J.E. McClure Introduction intdeviceCount; Heterogeneous // How many devices? Computing cudaGetDeviceCount(&deviceCount); CUDA Overview //Get the properties of all devices CPU + GPU f o r (int dvc= 0; dvc 23 / 42 Using Multiple GPU Hybrid Programming in CUDA, GPU can be controlled by: OpenMP and MPI • a single CPU thread J.E. McClure • multiple CPU threads belonging to the same process Introduction • multiple CPU threads belonging to different processes Heterogeneous Computing All CUDA calls are issued to the current GPU: CUDA Overview Example CPU + GPU CUDA and cudaSetDevice(0); OpenMP gpuMM<< 24 / 42 Streams and Multiple GPU Hybrid CUDA streams belong to a device: Programming in CUDA, • Each device has its own default stream OpenMP and MPI • Streams belong to the GPU that was current when it was J.E. McClure created Introduction • You cannot issue calls to a stream if the associated GPU is Heterogeneous not active Computing CUDA Example Overview CPU + GPU cudaSetDevice(0); CUDA and cudaStreamCreate(&streamA); OpenMP CUDA and cudaSetDevice(1); MPI cudaStreamCreate(&streamB); // Launch kernels gpuMM<<...,streamA>>(dA1,dB1,dC1,N); gpuMM<<...,streamB>>(dA2,dB2,dC2,N); 25 / 42 Peer to Peer Memory Copies Hybrid Programming Suppose you want to use multiple GPU to work together and in CUDA, OpenMP and solve the same problem MPI J.E. McClure You’ll probably need to transfer some data between GPU to do this: Introduction Heterogeneous Example Computing CUDA Overview cudaMemcpyPeerAsync(void *dst, int dst_dev, CPU + GPU void *src,int src_dev,size_t size, CUDA and cudaStream_t stream) OpenMP CUDA and MPI • Copies data between two devices • stream must belong to source gpu • blocking version exists too! 26 / 42 Exercises: GPU Concurrency Hybrid Programming in CUDA, OpenMP and MPI • Implement a GPU timer using streams J.E. McClure • Study the performance of gpuMM as a function of the Introduction matrix size N Heterogeneous Computing • Increase the occupancy for small matrices by launching CUDA Overview multiple kernels simultaneously CPU + GPU • Perform simultaneous matrix-matrix multiplication using CUDA and two GPU OpenMP CUDA and • How big do the matrices have to be for using multiple MPI GPU to provide a significant advantage? 27 / 42 Concurrency using CPU and GPU Hybrid Kernel launches are asynchronous with respect to the CPU, Programming in CUDA, even within the default stream OpenMP and MPI Example J.E. McClure Introduction gpuMM<< Heterogeneous for (row=0; row 28 / 42 Concurrency using CPU and GPU Hybrid Programming in CUDA, OpenMP and We don’t just pay for the computation, we also pay for data MPI CPU ↔ GPU transfers J.E. McClure Example Introduction Heterogeneous Computing cudaMemcpy(dA,hA,size, CUDA cudaMemcpyHostToDevice); Overview cudaMemcpy(dB,hB,size, CPU + GPU cudaMemcpyHostToDevice); CUDA and OpenMP gpuMM<< 29 / 42 Asynchronous Memory Transfers in CUDA Hybrid Programming in CUDA, OpenMP and MPI • We know that CPU ↔GPU memory transfers are J.E. McClure expensive Introduction • We also know that PCIe can perform simultaneous, Heterogeneous bi-directional transfers: Computing • One cudaMemcpy for Host → Device CUDA Overview • One cudaMemcpy for Device → Host CPU + GPU • If the memory transfers belong to the same stream they CUDA and OpenMP will be synchronized CUDA and • We need a way to asynchronously perform transfers to get MPI the full advantage of PCIe 30 / 42 Asynchronous Memory Transfers in CUDA Hybrid Programming in CUDA, OpenMP and MPI Example J.E. McClure // Host dataMUST be pinned!!! Introduction cudaMallocHost(&cpuData,size); Heterogeneous Computing cudaMalloc(&gpuData,size); CUDA // Bi-directional memory transfer Overview cudaMemcpyAsync(gpuData,cpuData,size, CPU + GPU cudaMemcpyHostToDevice,stream[0]); CUDA and OpenMP cudaMemcpyAsync(cpuData,gpuData,size, CUDA and cudaMemcpyDeviceToHost,stream[1]); MPI // Clean up 31 / 42 Exercises: CPU+GPU Concurrency Hybrid Programming in CUDA, OpenMP and MPI J.E. McClure • How does simultaneously executing CPU and GPU Introduction matrix-multiplication effect performance? Heterogeneous • How does the performance of each depend on the problem Computing size? CUDA Overview • How does the performance change if you include memory CPU + GPU transfers in the timings? CUDA and OpenMP • Is it possible to overlap memory copies for multiple GPU CUDA and MPI kernels? 32 / 42 Adding Multiple CPU Cores Using OpenMP Hybrid Programming OpenMP uses a fork-and-join model of parallelism to target in CUDA, OpenMP and multi-core CPU MPI J.E. McClure • Master thread initiated at run-time and persists throughout Introduction Heterogeneous • Worker threads are created within parallel regions Computing CUDA Overview CPU + GPU CUDA and OpenMP CUDA and MPI 33 / 42 Adding Multiple CPU Cores Using OpenMP Hybrid Programming Example in CUDA, OpenMP and MPI #include Introduction int omp_threads = 12; Heterogeneous omp_set_num_threads(omp_threads); Computing double starttime = omp_get_wtime(); CUDA Overview //... CPU + GPU #pragma omp parallel CUDA and { OpenMP //CPU&GPU work withina parallel region CUDA and MPI } cudaDeviceSynchronize(); double stoptime = omp_get_wtime(); double CPU_time = stoptime-starttime; } 34 / 42 Adding Multiple CPU Cores Using OpenMP Hybrid Programming in CUDA, OpenMP and MPI Example J.E. McClure #pragma omp parallel Introduction {// Work insidea parallel region Heterogeneous Computing #pragma omp master CUDA {// ManageGPU from master thread Overview cudaMemcpy(...); CPU + GPU gpuMM<< 35 / 42 Adding Multiple CPU Cores Using OpenMP Hybrid Programming in CUDA, OpenMP and MPI • Need to break up the work between threads J.E. McClure • Partition by rows: Introduction Heterogeneous Computing CUDA Overview CPU + GPU CUDA and OpenMP CUDA and MPI 36 / 42 Adding Multiple CPU Cores Using OpenMP Hybrid Programming in CUDA, OpenMP and Example MPI J.E. McClure int partRow = N/omp_threads; Introduction #pragma omp parallel Heterogeneous {// work withina parallel region Computing //...GPU calls from master thread CUDA Overview int row,col,n;// iterators for each thread CPU + GPU #pragma omp for schedule(guided,partRow) CUDA and for (row=0; row CUDA and //... matrix multiply onCPU MPI } } 37 / 42 Exercises: OpenMP and CUDA Hybrid Programming in CUDA, OpenMP and MPI J.E. McClure • Implement the matrix-matrix multiplication using OpenMP Introduction Heterogeneous • Show that the OpenMP implementation scales (ie. set Computing omp threads=1,2,...12 and measure wall time) CUDA Overview • Does parallel initialization of hA effect the performance? CPU + GPU • What happens if you make CUDA calls from worker CUDA and OpenMP threads? CUDA and MPI 38 / 42 Using CUDA with MPI Hybrid Programming Example in CUDA, OpenMP and MPI #include Introduction int main(int argc, char**argv){ Heterogeneous int np,rank; Computing MPI_Init(&argc,&argv); CUDA Overview MPI_Comm_size(MPI_COMM_WORLD ,&np); CPU + GPU MPI_Comm_rank(MPI_COMM_WORLD ,&rank); CUDA and //... OpenMP // EachMPI process assings work toCPU CUDA and MPI // andGPU usingCUDA and/or OpenMP //... MPI_Finalize(); } 39 / 42 Using CUDA with MPI Hybrid Whether or not MPI calls can reference GPU memory depends Programming in CUDA, on CUDA version and hardware compute capability: OpenMP and MPI Example J.E. McClure Introduction //pointers to host memory work always Heterogeneous float *buf; Computing buf = new float[N]; CUDA Overview MPI_Send(&A,N,MPI_FLOAT,recvID,tag, CPU + GPU MPI_COMM_WORLD); CUDA and OpenMP //pointers to device memory only work CUDA and //with newest hardware/CUDA MPI size = N*sizeof(float); cudaMalloc(buf,size); MPI_Send(&buf,N,MPI_FLOAT,recvID,tag, MPI_COMM_WORLD); 40 / 42 Exercises: MPI and CUDA Hybrid Programming in CUDA, Compare the performance of the following for bi-directional OpenMP and MPI communication between two nodes: J.E. McClure 1 Use MPI Sendrecv to send data from a source process to Introduction a destination process Heterogeneous Computing 2 The following sequence: CUDA • Copy data to be sent from the device to the host at the Overview source process CPU + GPU • Use MPI Sendrecv to send data from the source process CUDA and to a destination process OpenMP • Copy received data from host to device at destination CUDA and MPI process Modify the example code available from: http://mpi.deino.net/mpi functions/MPI Sendrecv.html 41 / 42 Questions? Hybrid Programming in CUDA, OpenMP and MPI J.E. McClure Introduction Heterogeneous Computing Be sure to fill out the FDI evaluation forms: CUDA http://www.fdi.vt.edu/training/evals/index.html Overview CPU + GPU CUDA and OpenMP CUDA and MPI 42 / 42