Extending Unified Parallel C for GPU Computing

Extending Unified Parallel C for GPU Computing Yili Zheng, Costin Iancu, Paul Hargrove, Seung-Jai Min, Katherine Yelick Lawrence Berkeley National Lab GPU Cluster with Hybrid Memory Network Computer Node Computer Node CPU CPU CPU CPU CPU Memory CPU Memory GPU GPU GPU GPU Memory Memory Memory Memory GPU GPU GPU GPU 2/24/2010 SIAM PP10 -- Extending Unified Parallel C for GPU Computing 2 Current Programming Model for GPU Clusters MPI + CUDA/OpenCL CPU CPU CPU GPU GPU GPU CPU CPU CPU MPI CUDA/OpenCL 2/24/2010 SIAM PP10 -- Extending Unified Parallel C for GPU Computing 3 PGAS Programming Model for Hybrid Multi-Core Systems Network Computer Node Computer Node CPU CPU CPU CPU CPU Memory CPU Memory PGAS GPU GPU GPU GPU Memory Memory Memory Memory GPU GPU GPU GPU 2/24/2010 SIAM PP10 -- Extending Unified Parallel C for GPU Computing 4 PGAS Example: Global Matrix Distribution Global Matrix View Distributed Matrix Storage 1 5 2 6 1 2 5 6 9 13 10 14 3 4 7 8 9 10 13 14 3 7 4 8 11 12 15 16 11 15 12 16 2/24/2010 SIAM PP10 -- Extending Unified Parallel C for GPU Computing 5 PGAS Example: Fast Fourier Transform Global Matrix View Distributed Matrix Storage 1 5 9 13 1 52 93 134 GPU I 2 6 10 14 25 6 107 148 GPU II 3 7 11 15 4 8 12 16 39 107 11 1512 GPU III Shared [4] A[4][4]; 1D_FFT(A); // row fft A’=AT; 134 148 1215 16 GPU IV 1D_FFT(A’); // col fft A= A’T 2/24/2010 SIAM PP10 -- Extending Unified Parallel C for GPU Computing 6 Hybrid Partitioned Global Address Space Shared Segment on Host Shared Segment on Host Shared Segment on GPU Shared Segment on GPU Memory Memory Memory Memory Private Private Private Private Private Private Private Private Segment Segment Segment Segment Segment Segment Segment Segment on Host on GPU on Host on GPU on Host on GPU on Host on GPU Memory Memory Memory Memory Memory Memory Memory Memory Thread 1 Thread 2 Thread 3 Thread 4 Each thread has only one shared segment, which can be either in host memory or in GPU memory, but not both. Decouple the memory model from execution models; therefore it supports various execution models. Backward compatible with current UPC and CUDA/OpenCL programs. 2/24/2010 SIAM PP10 -- Extending Unified Parallel C for GPU Computing 7 Execution Models • Synchronous model CPU CPU CPU CPU GPU GPU CPU CPU CPU CPU • Virtual GPU model CPU CPU CPU CPU GPU GPU GPU GPU CPU CPU CPU CPU • Hybrid model CPU CPU CPU CPU GPU GPU 2/24/2010 SIAM PP10 -- Extending Unified Parallel C for GPU Computing 8 UPC Overview • PGAS dialect of ISO C99 • Distributed shared arrays • Dynamic shared-memory allocation • One-sided shared-memory communication • Synchronization: barriers, locks, memory fences • Collective communication library • Parallel I/O library 2/24/2010 SIAM PP10 -- Extending Unified Parallel C for GPU Computing 9 Hybrid PGAS Example shared int *sp Shared Segment on Host Shared Segment on Host Shared Segment on GPU Shared Segment on GPU Memory Memory Memory Memory Private Private Private Private Private Private Private Private int *p Segment Segment Segment Segment Segment Segment Segment Segment on Host on GPU on Host on GPU on Host on GPU on Host on GPU Memory Memory Memory Memory Memory Memory Memory Memory Thread 1 Thread 2 Thread 3 Thread 4 Standard C p = malloc(4) *p *p *p UPC sp = upc_alloc(4) *sp *sp *sp UPC with GPU *sp *sp *sp sp = upc_alloc(4) shared heap 2/24/2010 SIAM PP10 -- Extending Unified Parallel C for GPU Computing 10 Data Transfer Example • MPI+CUDA copy nbytes data from src on GPU 1 to dst on GPU 2 Process 1: Process 2: send_buffer = malloc(nbytes); recv_buffer = malloc(nbytes); cudaMemcpy(send_buffer, src); MPI_Send(send_buffer); MPI_Recv(recv_buffer); free(send_buffer); cudaMemcpy(dst, recv_buffer); free(recv_buffer); • UPC with GPU extensions Thread 1: Thread 2: upc_memcpy(dst, src, nbytes); // no operation required Advantages of PGAS on GPU clusters Don’t need explicit buffer management by the user. Facilitate end-to-end optimizations such as data transfer pipelining. One-sided communications map well to DMA transfers for GPU devices. Concise code 2/24/2010 SIAM PP10 -- Extending Unified Parallel C for GPU Computing 11 PGAS GPU Code Example • Thread 1 • Thread 2 // use GPU memory for shared segment bupc_attach_gpu(gpu_id); shared [] int * shared sp; shared [] int * shared sp; // shared memory is allocatd on GPU sp = upc_alloc(sizeof(int)); *sp = 4; // write to GPU memory upc_barrier; upc_barrier; // read from remote GPU memory printf(“%d”, *sp); 2/24/2010 SIAM PP10 -- Extending Unified Parallel C for GPU Computing 12 Berkeley UPC Software Stack UPC Applications UPC-to-C Translator Translated C code with Runtime Calls UPC Runtime GASNet Communication Library Network Drivers and OS Libraries 2/24/2010 SIAM PP10 -- Extending Unified Parallel C for GPU Computing 13 Translation and Call Graph Example shared [] int * shared sp; *sp = a; UPC-to-C Translator UPCR_PUT_PSHARED_VAL(sp, a); UPC Runtime No Is *sp on Yes GPU? gasnet_put(sp, a); gasnet_put_to_gpu(sp, a); GASNet GASNet+CUDA 2/24/2010 SIAM PP10 -- Extending Unified Parallel C for GPU Computing 14 Active Messages A B • Active messages = Data + Action Request • Key enabling technology for both one-sided and two-sided communications Request handler – Software implementation of Put/Get – Eager and Rendezvous protocols Reply • Remote Procedural Calls – Facilitate “owner-computes” – Execute asynchronous tasks Reply handler 2/24/2010 SIAM PP10 -- Extending Unified Parallel C for GPU Computing 15 GASNet Extensions for GPU One-sided Communication • How to transfer to/from Active remote GPU device memory? Message – Active Messages (AM) Request – Need to execute CUDA Host1 Host2 operations outside of AM handler context because they Active may block Message CUDA CUDA – Solution: asynchronous GPU Reply Memcpy Memcpy task queue • How to know when the data transfer is done? GPU1 GPU2 – Send an ACK message after the GPU op is done on the GPU device – Solution: GPU task queue polling and callback support 2/24/2010 SIAM PP10 -- Extending Unified Parallel C for GPU Computing 16 Implementation • UPC-to-C translator – No change because the UPC runtime API is intact. – Compile UPC code and CUDA code separately and then link the object files with libs together. • UPC runtime extensions – Shared-heap management for GPU device memory – Accesses to shared data on GPU (via pointer-to-shared) – Interoperability of UPC and GPU (CUDA) • GASNet extensions – Put and Get operations for GPU – Asynchronous GPU task queue for running GPU operations outside of AM handler context 2/24/2010 SIAM PP10 -- Extending Unified Parallel C for GPU Computing 17 Summary • Runtime extensions for enabling PGAS on GPU clusters – Unified API for data management and communication – High-level expressions of data movement enabling end-to- end optimizations – Compatible with different execution models and existing GPU applications • Reusable modular components in the implementation – Task queue for asynchronous task execution – Communication protocols for heterogeneous processors – Portable to other GPU SDK, e.g., OpenCL. Platform (CUDA) specific codes are limited and encapsulated. • Work in progress 2/24/2010 SIAM PP10 -- Extending Unified Parallel C for GPU Computing 18 Thank You! • MS 31 UPC at Scale 1:20 PM - 3:20 PM Room: Leonesa II • MS 52 Getting Multicore Performance with UPC 1:20 PM - 3:20 PM Room: Eliza Anderson Amphitheater 2/24/2010 SIAM PP10 -- Extending Unified Parallel C for GPU Computing 19.

Extending Unified Parallel C for GPU Computing

Other Apis What’S Wrong with Openmp?

Opencl on Shared Memory Multicore Cpus

Heterogeneous Task Scheduling for Accelerated Openmp

AMD Accelerated Parallel Processing Opencl Programming Guide

L22: Parallel Programming Language Features (Chapel and Mapreduce)

Implementing FPGA Design with the Opencl Standard

Parallel Programming

Opencl SYCL 2.2 Specification

An Introduction to Gpus, CUDA and Opencl

Outro to Parallel Computing

Opencl in Scientific High Performance Computing

The Continuing Renaissance in Parallel Programming Languages