Hybrid Programming in CUDA, Openmp and MPI

Hybrid Programming in CUDA, OpenMP and MPI James E. McClure Hybrid Programming in CUDA, OpenMP and Introduction MPI Heterogeneous Computing CUDA Overview James E. McClure CPU + GPU Advanced Research Computing CUDA and OpenMP CUDA and MPI 22 October 2012 1 / 42 Course Contents Hybrid Programming in CUDA, OpenMP and MPI J.E. McClure This is a talk about concurrency: Introduction • Concurrency within individual GPU Heterogeneous Computing • Concurrency within multiple GPU CUDA Overview • Concurrency between GPU and CPU CPU + GPU • Concurrency using shared memory CPU CUDA and OpenMP • Concurrency across many nodes in distributed memory CUDA and MPI 2 / 42 Course Contents Hybrid Programming in CUDA, OpenMP and MPI J.E. McClure Three programming models for achieving concurrency: Introduction • CUDA: Single Instruction Multiple Data (SIMD) Heterogeneous Computing programming for GPU CUDA • OpenMP: Fork-and-join parallelism for shared memory Overview programming CPU + GPU CUDA and • MPI: message passing interface for distributed memory OpenMP programs CUDA and MPI 3 / 42 Useful links Hybrid Programming in CUDA, OpenMP and ARC website: MPI J.E. McClure http://www.arc.vt.edu/ Introduction CUDA Programming Guide: Heterogeneous http://docs.nvidia.com/cuda/index.html Computing CUDA CUDA SDK code examples: Overview http://docs.nvidia.com/cuda/cuda-samples/index.html CPU + GPU CUDA and OpenMP website: OpenMP http://www.openmp.org CUDA and MPI LLNL MPI tutorial: https://computing.llnl.gov/tutorials/mpi/ 4 / 42 Hardware Overview Hybrid Programming Modern supercomputing nodes are heterogeneous: in CUDA, OpenMP and • Multiple CPU cores that share memory MPI J.E. McClure • Multiple GPU or other accelerators Introduction Heterogeneous Computing CUDA . Overview CPU + GPU L3 CUDA and OpenMP CUDA and L2 MPI Device Memory PCIe Host Memory 5 / 42 Hardware Overview Hybrid Programming • GPU and shared memory CPU cores will be programmed in CUDA, OpenMP and with CUDA and OpenMP MPI • MPI will be used to pass messages between nodes J.E. McClure Introduction Heterogeneous Computing CUDA Overview CPU + GPU CUDA and OpenMP CUDA and MPI 6 / 42 Examples to Download Hybrid Programming in CUDA, OpenMP and MPI ARC HokieSpeed Examples: J.E. McClure www.arc.vt.edu/resources/hpc/hokiespeed.php Introduction • wget <copy link> Heterogeneous Computing • Simple matrix-matrix multiplication example code CUDA Overview • Example code for OpenMP and MPI CPU + GPU • Example code for CUDA and MPI CUDA and OpenMP • Makefiles for example cases CUDA and MPI • Example submission script for HokieSpeed 7 / 42 Compiling with CUDA Hybrid Programming in CUDA, OpenMP and MPI J.E. McClure To view with the modules you have loaded: • module list Introduction Heterogeneous Computing To see a list of available modules: CUDA • module avail Overview CPU + GPU Load the compiler by typing: CUDA and OpenMP • module swap intel gcc CUDA and MPI • module load cuda 8 / 42 Compiling with CUDA Hybrid Programming in CUDA, OpenMP and MPI J.E. McClure The CUDA compiler nvcc compiles by: • Introduction identifying device (ie. gpu) functions and compiling them Heterogeneous • passing host (ie. cpu) functions to gcc/g++ Computing CUDA To compile, type: Overview CPU + GPU • nvcc -o runme program.cu CUDA and OpenMP To compile with double precision support, type: CUDA and • nvcc -o runme -arch sm 13 program.cu MPI 9 / 42 Compiling CUDA with OpenMP and/or MPI Hybrid Programming in CUDA, OpenMP and MPI Compiling with OpenMP: J.E. McClure • nvcc -Xcompiler -fopenmp -lcuda -lcudart Introduction -lgomp -o runme program.cu Heterogeneous Computing Compiling with MPI: CUDA • Identify the path of the MPI library and include directories Overview CPU + GPU • module show openmpi CUDA and OpenMP • -I$(CUDA INC) -I$(OMPI INC) CUDA and • -L$(OMP LIB) -lmpi -L$(CUDA LIB64) -lcuda MPI -lcudart 10 / 42 Running on HokieSpeed Hybrid Programming Use the example runscript to submit a job to the queue. in CUDA, OpenMP and Request 4 nodes and 1 mpi process for each node: MPI • #PBS -l nodes=4:ppn=12 J.E. McClure • mpiexec -npernode 1 ./run-cuda-mpi Introduction Heterogeneous Computing Submit the job to the queue: CUDA • qsub hokiespeed qsub example.sh Overview CPU + GPU Monitor the job: CUDA and OpenMP • qstat 1234 (monitor job with id 1234) CUDA and MPI • qdel 1234 (kill job with id 1234) View the output: • more hokiespeed qsub example.sh.o1234 11 / 42 The CUDA Programming Model Hybrid Programming in CUDA, OpenMP and MPI J.E. McClure Introduction Heterogeneous Computing CUDA Overview CPU + GPU CUDA and OpenMP CUDA and MPI 12 / 42 Memory Management in CUDA Hybrid Programming in CUDA, Example OpenMP and MPI J.E. McClure // Multiplication for an NxN matrix int N = K*BLOCK_SIZE; Introduction // amount of memory in bytes Heterogeneous Computing int size = N*N*sizeof(float); CUDA float *hA,*hB,*hC;//host(cpu) Overview CPU + GPU float *dA,*dB,*dC;//device(gpu) CUDA and hA = new float[N*N]; OpenMP cudaMalloc(&dA,size); CUDA and MPI //.... Initialize arrays on the host cudaMemcpy(dA,hA,size, cudaMemcpyHostToDevice); 13 / 42 Thread Hierarchy Hybrid Programming in CUDA, OpenMP and MPI Example J.E. McClure // Set up threadblocks Introduction dim3 threadBlock(BLOCK_SIZE,BLOCK_SIZE); Heterogeneous Computing dim3 numBlocks(K,K); CUDA //Launch the matrix multiplication kernel Overview gpuMM<<<grid,threadBlock>>>(dA,dB,dC,N); CPU + GPU //... CUDA and OpenMP // Copy the result back to theCPU CUDA and cudaMemcpy(C,dC,size, MPI cudaMemcpyDeviceToHost); 14 / 42 CUDA kernels Hybrid Programming Parts of your program that run on GPU must be provided as in CUDA, OpenMP and CUDA kernels: MPI Example J.E. McClure Introduction __global__ void gpuMM(float*A,float*B,...) Heterogeneous { Computing CUDA int row,col; Overview row = blockIdx.x*blockDim.x+threadIdx.x; CPU + GPU col = blockIdx.y*blockDim.y+threadIdx.y; CUDA and OpenMP float sum = 0.f; CUDA and for(int n=0; n<N; ++n) MPI sum += A[row*N+n]*B[n*N+col]; C[row*N+col] = sum; } 15 / 42 Measuring GPU performance Hybrid Programming in CUDA, OpenMP and MPI Example J.E. McClure float time; Introduction cudaEvent_t start, stop; Heterogeneous Computing cudaEventCreate(&start); CUDA cudaEventCreate(&stop); Overview cudaEventRecord(start,stream); CPU + GPU /* Do someGPU work and time it*/ CUDA and OpenMP cudaEventRecord(stop,stream); CUDA and cudaEventSynchronize(stop); MPI cudaEventElapsedTime(&time,start,stop); 16 / 42 Occupancy Considerations for GPU Hybrid Programming in CUDA, OpenMP and MPI • Fermi GPU can have up to 48 J.E. McClure active warps per SM Introduction • Instructions are issued per warp Heterogeneous • If a warp is not ready, the . Computing hardware switches warps CUDA Overview (context switching) CPU + GPU • Shared memory can limit CUDA and L2 OpenMP occupancy! CUDA and • Device Memory MPI Goal: always have enough active warps to saturate the memory bandwidth of the device 17 / 42 Increasing Occupancy with Multiple Kernels Hybrid Programming in CUDA, OpenMP and Suppose you are going to perform multiple matrix-matrix MPI multiplications J.E. McClure Introduction Example Heterogeneous Computing gpuMM<<<grid,nthreads>>>(dA1,dB1,dC1,N); CUDA Overview gpuMM<<<grid,nthreads>>>(dA2,dB2,dC2,N); CPU + GPU CUDA and However, kernels launched from the same stream (in this case OpenMP the default stream) will execute serially. CUDA and MPI Kernels launched from different streams can execute concurrently on a single device, if there is room. 18 / 42 CUDA Streams and Events Hybrid Programming in CUDA, OpenMP and The CUDA driver API provides streams and events as a way to MPI manage GPU synchronization: J.E. McClure Introduction • Synchronization is implied for events within a stream Heterogeneous (including default stream) Computing • Streams belong to a particular GPU CUDA Overview • More than one stream can be associated with a GPU CPU + GPU • CUDA and Streams are required if you want to perform asynchronous OpenMP communication CUDA and MPI • Streams are critical if you want concurrency with multiple GPU or multiple kernels on any single GPU. 19 / 42 CUDA Streams Hybrid Programming in CUDA, OpenMP and Example MPI J.E. McClure // Createa pair of streams Introduction cudaStream_t stream[2]; Heterogeneous for(int i=0; i<2; ++i) Computing cudaStreamCreate(&stream[i]); CUDA Overview // Launcha Kernel from each stream CPU + GPU KernelOne<<<100,512,0,stream[0]>>>(..) CUDA and KernelTwo<<<100,512,0,stream[1]>>>(..) OpenMP CUDA and // Destroy the streams MPI for(int i=0; i<2; ++i) cudaStreamDestroy(stream[i]); 20 / 42 Synchronization of Streams and Events Hybrid Programming in CUDA, OpenMP and MPI Streams can be synchronized explicitly: J.E. McClure • cudaDeviceSynchronize(): wait for all preceding Introduction commands in all streams for a device to complete. Heterogeneous Computing • cudaStreamSynchronize(): wait for all preceding events CUDA in a specified stream to complete Overview CPU + GPU • cudaStreamWaitEvent(): synchronize a stream with an CUDA and event (both must be specified) OpenMP • CUDA and cudaStreamQuery(): Ask the system if preceding MPI commands in a stream have completed 21 / 42 CUDA Streams Hybrid Programming in CUDA, OpenMP and MPI J.E. McClure Two streams will be synchronized implicitly if any of the Introduction following operations are issued in between them Heterogeneous Computing • a page-locked memory allocation (using cudaMallocHost) CUDA Overview • a device memory allocation (using cudaMalloc) CPU + GPU • a memory copy between two devices CUDA and OpenMP • any CUDA command to the default

Load more