Tuned Openmp to CUDA Translation
Total Page:16
File Type:pdf, Size:1020Kb
Tuned OpenMP to CUDA Translation Rudi Eigenmann Seyong Lee Purdue University Motivation Many-Core Processors: New Opportunities for General-Purpose High-Performance Computing Multicore CPUs Cell BE ˇ GPGPUs Multicore FPGAs Etc. Floating-Point Operations per Second and Memory Bandwidth for the CPU and GPU GFLOP/s GB/s Time Time (a) Peak GFLOP/s (b) Peak Memory Bandwidth Courtesy: NVIDIA Tradeoff between Performance and Programmability in Many-Core Processors (a) CPU (b) GPU Courtesy: NVIDIA GPU provides more computing power but worse programmability than CPU. GPU architectures are optimized for stream computations. General-Purpose GPUs (GPGPUs) provide better programmability for general applications. CUDA programming model is more user-friendly than previous approaches, but still complex and error-prone. Goal and Approach Contribution Propose a new API, called OpenMPC, for improved CUDA programming, which provides both programmability and tunability. Developed a fully automatic and parameterized reference compilation system to support OpenMPC. Developed several tools to assist users in performance tuning. Evaluation on the 14 OpenMP programs: User-assisted tuning with optional manual modification of input OpenMP programs improves performance 1.24 times on average (up to 7.71 times) over un-tuned versions, which is 75% (92% if excluding one benchmark) of the performance of hand-written CUDA versions. Related Work Baskaran [08] G-ADAPT [09] OpenMP-to-GPGPU [09] Baskaran [10] Volkov [08] Nukada [09] OpenMPC [10] Automatic - CUDA-lite [08] hiCUDA [09] Semi JCUDA [09] PGI Accelerator [09] Automatic Automatic Optimization CUDA [07] OpenCL [08] Leung [10] Ryoo [08] Manual BrookGPU [04] Manual Semi-Automatic Automatic GPU Code Generation Overview of the CUDA Programming Model The CUDA Programming Model Host Device A general-purpose multi-threaded Grid 1 Kernel SIMD model for GPGPU Block Block Block 1 programming (0, 0) (1, 0) (2, 0) A CUDA program consists of a Block Block Block series of sequential and parallel (0, 1) (1, 1) (2, 1) execution phases. Grid 2 Parallel regions are executed by a Kernel set of threads (thread batching) 2 Architectural Limit Block (1, 1) No global synchronization Thread Thread Thread Thread Thread mechanism is supported. (0, 0) (1, 0) (2, 0) (3, 0) (4, 0) Thread Thread Thread Thread Thread No H/W caching mechanism for (0, 1) (1, 1) (2, 1) (3, 1) (4, 1) global memory Thread Thread Thread Thread Thread (0, 2) (1, 2) (2, 2) (3, 2) (4, 2) CUDA Programming Model Courtesy: NVIDIA CUDA Memory Model Grid Device Thread Block M Multiprocessor N Thread Block 0 Multiprocessor 1 Shared Memory Shared Memory Registers Registers Partitioned Register File Instruction Processor 1 ••• Processor 8 Unit Thread 0 ••• Thread K Constant Cache Local Local Memory Memory Texture Cache Global Memory Off-chip Device Memory CPU Global Memory Texture Memory with a Dedicated Cache Memory Constant Memory Texture Memory Constant Memory with a Dedicated Cache Partitioned Local Memory (a) CUDA Memory Model (b) CUDA Hardware Model A host CPU and a GPU device have separate address space. The shared memory and register bank in a multiprocessor are dynamically partitioned among the active thread blocks running on the multiprocessor. Programming Complexity Comparison (Jacobi) OpenMP code CUDA code float a[SIZE_2][SIZE_2]; float b[SIZE_2][SIZE_2]; float a[SIZE_2][SIZE_2]; __global__ void main_kernel0(float a[SIZE_2][SIZE_2], float b[SIZE_2][SIZE_2]) { float b[SIZE_2][SIZE_2]; int main (int argc, char *argv[]) { int i,j; int _bid = (blockIdx.x+(blockIdx.y*gridDim.x)); int _gtid = (threadIdx.x+(_bid*blockDim.x)); i=(_gtid+1); int i, j, k; for (k = 0; k < ITER; k++) { if (i<=SIZE { for (j=1; j<=SIZE j ++ ) #pragma omp parallel for private(i, j) a[i][j]=(b[(i-1)][j]+b[(i+1)][j]+b[i][(j-1)]+b[i][(j+1)])/4.0F; for (i = 1; i <= SIZE; i++) for (j = 1; j <= SIZE; j++) } } a[i][j] = (b[i - 1][j] + b[i + 1][j] + b[i][j - 1] + b[i][j + 1]) / 4.0f; __global__ void main_kernel1(float a[SIZE_2][SIZE_2], float b[SIZE_2][SIZE_2]) { #pragma omp parallel for private(i, j) for (i = 1; i <= SIZE; i++) int i,j; int _bid = (blockIdx.x+(blockIdx.y*gridDim.x)); int _gtid = (threadIdx.x+(_bid*blockDim.x)); i=(_gtid+1); for (j = 1; j <= SIZE; j++) if (i<=SIZE { b[i][j] = a[i][j]; return 0; for (j=1; j<=SIZE j ++ ) b[i][j]=a[i][j]; } } } int main(int argc, char * argv[]) { int i,j,k; float * gpu__a; float * gpu__b; gpuBytes=(SIZE_2*SIZE_2)*sizeof (float); CUDA_SAFE_CALL(cudaMalloc(((void * * )( & gpu__a)), gpuBytes)); dim3 dimBlock0(BLOCK_SIZE, 1, 1); dim3 dimGrid0(NUM_BLOCKS, 1, 1); CUDA_SAFE_CALL(cudaMemcpy(gpu__a, a, gpuBytes, cudaMemcpyHostToDevice)); gpuBytes=((SIZE_2*SIZE_2)*sizeof (float)); CUDA_SAFE_CALL(cudaMalloc(((void * * )( & gpu__b)), gpuBytes)); CUDA_SAFE_CALL(cudaMemcpy(gpu__b, b, gpuBytes, cudaMemcpyHostToDevice)); dim3 dimBlock1(BLOCK_SIZE, 1, 1); dim3 dimGrid1(NUM_BLOCKS, 1, 1); for (k=0; k<ITER; k ++ ) { main_kernel0<<<dimGrid0, dimBlock0, 0, 0>>>((float (*)[SIZE_2])gpu__a, (float (*)[SIZE_2])gpu__b); main_kernel1<<<dimGrid1, dimBlock1, 0, 0>>>((float (*)[SIZE_2)])gpu__a, (float (*)[SIZE_2])gpu__b); } gpuBytes=(SIZE_2*SIZE_2)*sizeof (float); CUDA_SAFE_CALL(cudaMemcpy(b, gpu__b, gpuBytes, cudaMemcpyDeviceToHost)); CUDA_SAFE_CALL(cudaFree(gpu__b)); gpuBytes=(SIZE_2*SIZE_2)*sizeof (float); CUDA_SAFE_CALL(cudaMemcpy(a, gpu__a, gpuBytes, cudaMemcpyDeviceToHost)); CUDA_SAFE_CALL(cudaFree(gpu__a)); fflush(stdout); fflush(stderr); return 0; } OpenMP Version of Jacobi float a[SIZE_2][SIZE_2]; float b[SIZE_2][SIZE_2]; int main (int argc, char *argv[]) { int i, j, k; for (k = 0; k < ITER; k++) { #pragma omp parallel for private(i, j) for (i = 1; i <= SIZE; i++) for (j = 1; j <= SIZE; j++) a[i][j] = (b[i - 1][j] + b[i + 1][j] + b[i][j - 1] + b[i][j + 1]) / 4.0f; #pragma omp parallel for private(i, j) for (i = 1; i <= SIZE; i++) for (j = 1; j <= SIZE; j++) b[i][j] = a[i][j]; return 0; } CUDA Version of Jacobi int main(int argc, char * argv[]) { GPU Memory Allocation and Data Transfer to GPU int i,j,k; float * gpu__a; float * gpu__b; gpuBytes=(SIZE_2*SIZE_2)*sizeof (float); CUDA_SAFE_CALL(cudaMalloc(((void * * )( & gpu__a)), gpuBytes)); CUDA_SAFE_CALL(cudaMemcpy(gpu__a, a, gpuBytes, cudaMemcpyHostToDevice)); gpuBytes=((SIZE_2*SIZE_2)*sizeof (float)); CUDA_SAFE_CALL(cudaMalloc(((void * * )( & gpu__b)), gpuBytes)); CUDA_SAFE_CALL(cudaMemcpy(gpu__b, b, gpuBytes, cudaMemcpyHostToDevice)); dim3 dimBlock0(BLOCK_SIZE, 1, 1); dim3 dimGrid0(NUM_BLOCKS, 1, 1); dim3 dimBlock1(BLOCK_SIZE, 1, 1); dim3 dimGrid1(NUM_BLOCKS, 1, 1); for (k=0; k<ITER; k ++ ) { main_kernel0<<<dimGrid0, dimBlock0, 0, 0>>>((float (*)[SIZE_2])gpu__a, (float (*)[SIZE_2])gpu__b); main_kernel1<<<dimGrid1, dimBlock1, 0, 0>>>((float (*)[SIZE_2)])gpu__a, (float (*)[SIZE_2])gpu__b); GPU Kernel Execution } gpuBytes=(SIZE_2*SIZE_2)*sizeof (float); CUDA_SAFE_CALL(cudaMemcpy(b, gpu__b, gpuBytes, cudaMemcpyDeviceToHost)); gpuBytes=(SIZE_2*SIZE_2)*sizeof (float); CUDA_SAFE_CALL(cudaMemcpy(a, gpu__a, gpuBytes, cudaMemcpyDeviceToHost)); CUDA_SAFE_CALL(cudaFree(gpu__b)); CUDA_SAFE_CALL(cudaFree(gpu__a)); fflush(stdout); fflush(stderr); return 0; } Data Transfer Back To CPU and GPU Memory Deallocation CUDA Version of Jacobi (2) __global__ void main_kernel0(float a[SIZE_2][SIZE_2], float b[SIZE_2][SIZE_2]) { int i,j; int _bid = (blockIdx.x+(blockIdx.y*gridDim.x)); int _gtid = (threadIdx.x+(_bid*blockDim.x)); i=(_gtid+1); if (i<=SIZE { for (j=1; j<=SIZE j ++ ) a[i][j]=(b[(i-1)][j]+b[(i+1)][j]+b[i][(j-1)]+b[i][(j+1)])/4.0F; } } __global__ void main_kernel1(float a[SIZE_2][SIZE_2], float b[SIZE_2][SIZE_2]) { int i,j; int _bid = (blockIdx.x+(blockIdx.y*gridDim.x)); int _gtid = (threadIdx.x+(_bid*blockDim.x)); i=(_gtid+1); if (i<=SIZE { for (j=1; j<=SIZE j ++ ) b[i][j]=a[i][j]; } } Why OpenMP? Advantages of OpenMP as a programming paradigm for GPGPUs. Loop-level parallelism of OpenMP is an ideal target for utilizing GPU’s highly parallel computing units. OpenMP’s fork-join model well represents the relationship between a master thread in CPU and a set of worker threads in a GPU. Incremental parallelization of OpenMP can add the same benefit to GPGPU programming. Baseline Translation of OpenMP into CUDA Identify Kernel Regions (parallel work to be executed on the GPU) Serial Int a; region ….. 1) Split an OpenMP parallel region at #pragma omp parallel every synchronization point. { ….. Parallel 2) Select sub-parallel regions containing #pragma omp for region at least one omp-for loop and transform Implicit for( … ) { … } them into kernel functions. barrier ….. } Input OpenMP Program Unoptimized Performance of OpenMP Programs on CUDA Kernel Benchmarks NAS Parallel Benchmarks Rodinia Benchmarks • Speedups are over serial on the CPU, when the largest available input data were used. • Experimental Platform: CPU: two Dual-Core AMD Opteron at 3 GHz GPU: NVIDIA Quadro FX 5600 with 16 multiprocessors at 1.35GHz Differences between OpenMP and CUDA Programming Models OpenMP Model CUDA Model Coarse-grain parallelism Fine-grain parallelism (runs (runs tens of threads.) thousands of threads.) SIMD/MIMD model SIMD model Threads work well on both Threads strongly prefer regular and irregular regular memory access memory access patterns. patterns Optimized for traditional Optimized for stream shared-memory architectures that operate multiprocessors (SMPs) on a large data space (or stream) in parallel and tuned for fast access to regular, consecutive data. Intra-Thread vs. Inter-Thread Locality Thread