Hybrid Openmp + CUDA Programming Model Podgainy D.V., Streltsova O.I., Zuev M.I

Hybrid OpenMP + CUDA programming model Podgainy D.V., Streltsova O.I., Zuev M.I. Heterogeneous Computations team HybriLIT Laboratory of Information Technologies, Joint Institute for Nuclear Research Dubna, Russia, from 15 February to 7 March 2017 Types of parallel machines Distributed memory Shared memory • each processor has its own • single address space for all processors memory address space • examples: IBM p-series, multi-core PC • examples: clusters, Blue Gene/L Shared-Memory Parallel Computers Uniform Memory Access CPU0 CPU1 CPU2 CPU0 CPU1 CPU2 Memory Mem0 Mem1 Mem2 Non-Uniform Memory Access (ccNUMA) CPU CPU Memory Memory CPU CPU Bus Interconnect CPU CPU Memory Memory CPU CPU HETEROGENEOUS COMPUTATIONS TEAM, HybriLIT OpenMP (Open speciFications For Multi-Processing) OpenMP (Open speciFications For Multi-Processing) is an API that supports multi-platform shared memory multiprocessing programming in Fortran, C, C++. • Compiler • Library • Environment directives routines variables OpenMP is managed by consortium OpenMP Architecture Review Board (or OpenMP ARB) from 1997 OpenMP website: http://openmp.org/ OpenMP for OpenMP for Version 3.0 Version Fortran 1.0, in 1997 Fortran 2.0, in 2000 was released in 4.0 С/C++ in 1998 С/C++ in 2002 May 2008. July 2013 HETEROGENEOUS COMPUTATIONS TEAM, HybriLIT What is OpenMP • OpenMP (Open speciFications For Multi-Processing) – one of the most popular parallel computing technologies for multi-processor/core computers with the shared memory architecture. • OpenMP is based on traditional programming languages. OpenMP standard is developed for Fortran, C and C++ languages. All basic constructions for these languages are similar. Also, there are known cases of OpenMP implementation for MATLAB and MATHEMATICA. • The OpenMP-based computer program contains a number of threads interacting via shared memory. OpenMP provides a number of special directives for a compiler, library functions and environment variables. • Compiler directives are used for indicating segments of a code with the possibility for parallel processing. • Utilizing OpenMP constructs (compiler directives, procedures, environment variables), a user can organize parallelism in their serial code. • “Partial parallelization” is available by “step-by-step” adding OpenMP-directives. (OpenMP offers an incremental approach to parallelism) • OpenMP-directives are ignored by standard compiler. So, the code stays workable on both single- and multi-processor platform. HETEROGENEOUS COMPUTATIONS TEAM, HybriLIT OpenMP compilers (Full list at http://openmp.org/wp/openmp-compilers/) Compiler Flag InFormation GNU gcc -fopenmp gcc 4.2 – OpenMP 2.5 gcc 4.4 – OpenMP 3.0 gcc 4.7 – OpenMP 3.1 gcc -fopenmp start_openmp.c -o test1 gcc 4.9 – OpenMP 4.0 • Intel C/C++ -openmp on Linux or Mac OSX OpenMP 3.1 API and Fortran Specification -Qopenmp on Windows • Support for most of the new features in icc -openmp start_openmp.c -o test1 the OpenMP* 4.0 API Specification Portland -mp Full support for Group OpenMP 3.1 Compilers and Tools pgcc -mp start_openmp.c -o test1 HETEROGENEOUS COMPUTATIONS TEAM, HybriLIT OpenMP Programming Model • OpenMP is an explicit (not automatic) programming model, offering the programmer full control over parallelization. Fork - Join Model: Thread 0 Master thread (Thread 0) Master thread (Thread 0) Thread 1 Thread 2 •All OpenMP start with just one thread: JOIN: When the team FORK: the the master thread. threads complete the master thread The master thread statements in the then creates a executes sequentially Parallel region parallel region team of until the first parallel construct, they parallel threads. region construct is synchronize and encountered. terminate HETEROGENEOUS COMPUTATIONS TEAM, HybriLIT OpenMP: Construct parallel C / C++, General Code Structure : Fortran, General Code Structure: #include <omp.h> PROGRAM Start main () { !Serial code // Serial code !......... ... ! // Fork a team of threads: // Fork a team of threads: !$OMP PARALLEL #pragma omp parallel { ... ... structured block structured block ... ... } !$OMP END PARALLEL // Resume serial code ! Resume serial code ... END } The parallelism has to be expressed explicitly. OpenMP: Construct parallel Meaning: #pragma omp parallel • The entire code block Following the { parallel-directive is executed by all structured block threads concurrently } • This includes: - creation of team of ”worker” threads - thread executes a copy of the code #pragma omp parallel [clause ...] newline within the structured block if (scalar_expression) - barrier synchronization (implicit barrier) private (list) - termination of worker threads. shared (list) deFault (shared | none) firstprivate (list) reduction (operator: list) copyin (list) num_threads (integer-expression) { structured block } HETEROGENEOUS COMPUTATIONS TEAM, HybriLIT OpenMP: Classes of variables There are two main classes oF variables: shared and private ones. • The shared variable always exists only in a single instance for the whole program and is available for all threads under the same name. • The declaration of the private variable causes generation of its own instance of the given variable for each thread. Change of a value of a thread’s private variable does not influence the change of this local variable value in other threads. • There are also “intermediate” types that provide interconnection between parallel and consistent sections. • Thus, if a variable that is a part of a consistent code preceding a parallel section, is declared firstprivate, then in the parallel section this value is assigned to private variables under the same name for each thread. • Likewise, a variable of lastprivate type after termination of a parallel block saves the value obtained in the latest completed parallel thread. HETEROGENEOUS COMPUTATIONS TEAM, HybriLIT Private variables and Shared variables Shared: the data within a parallel region is shared, which means visible and accessible by all threads simultaneously. Private: the data within a parallel region is private to each thread, which means each thread will have a local copy and use it as a temporary variable. ……… int a; // shared automatic int j; What if we need to initialize a int k=3; private variable? #pragma omp parallel private (j,k) firstprivate: private variables with { int b; //private automatic initial values copied from the master thread’s copy b=j ; //b is not deFined foo (j,b,k); } HETEROGENEOUS COMPUTATIONS TEAM, HybriLIT SpeciFication oF number oF threads Setting OpenMP environment variables is done the same way you set any other sh/tcsh setenv OMP_NUM_THREADS 4 environment variables, and depends upon which shell you use. sh/bash export OMP_NUM_THREADS=4 Via runtime functions: omp_set_num_threads(4); Other useful function to get information about threads: Runtime function omp_get_num_threads() • Returns number oF threads in parallel region • Returns 1 if called outside parallel region Runtime function omp_get_thread_num() •Returns id of thread in team (Value between [0,Nthreads-1] ) •Master thread always has id 0 HETEROGENEOUS COMPUTATIONS TEAM, HybriLIT Explicit (low level) parallelism and High level parallelism: directive section Low level parallelism: the work is distributed between threads by means functions omp_get_thread_num (Returns the thread number of the thread executing within its thread team.) omp_get_num_threads (returns the number of threads in the parallel region.). EXAMPLE of high level parallelism #pragma omp parallel (parallel independent sections): { if(omp_get_thread_num()) ==3 ) #pragma omp sections […[parameters…]] { { #pragma omp section < code for the thread number 3 Each of block 1 { >; and block 2 < block1> } in this example } else will be carried #pragma omp section { out by one of { < code for all another threads >; parallel treads . } < block2> } } } HETEROGENEOUS COMPUTATIONS TEAM, HybriLIT OpenMP compilers Compiler Flag InFormation GNU gcc -fopenmp gcc 4.2 – OpenMP 2.5 gcc 4.4 – OpenMP 3.0 gcc 4.7 – OpenMP 3.1 gcc -fopenmp start_openmp.c -o test1 gcc 4.9 – OpenMP 4.0 • Intel C/C++ -openmp on Linux or Mac OSX OpenMP 3.1 API and Fortran Specification -Qopenmp on Windows • Support for most of the new features in icc -openmp start_openmp.c -o test1 the OpenMP* 4.0 API Specification Portland -mp Full support for Group OpenMP 3.1 Compilers and Tools pgcc -mp start_openmp.c -o test1 Full list at http://openmp.org/wp/openmp-compilers/ Hardware: CPU and GPUs on one node GPU 1 GPU 2 GPU 3 (Tesla k40) (Tesla k40) (Tesla k40) PCIe PCIe PCIe Host memory CPU 1 CPU 2 Node #1 Using Multiple GPU deviceInfo_CUDA.cu #include <stdio.h> #include <cuda.h> #include <omp.h> int main (){ int ngpus = 0; // number of CUDA GPUs int device; // How many devices ? cudaGetDeviceCount(&ngpus); if(ngpus < 1) { printf("no CUDA capable devices Were detected\n"); return 1; } printf("number of CUDA devices:\t %d \n", ngpus); for( device = 0; device < ngpus; device++){ cudaDeviceProp dprop; cudaGetDeviceProperties(&dprop, device); printf(" %d: %s\n", device, dprop.name); } return 0; } Compilation and running CUDA+OpenMP program Compilation $ nvcc -Xcompiler -fopenmp –lgomp -arch=compute_35 --gpu-code=sm_35,sm_37 deviceInfo_CUDA.cu -o cuda_app script_multiCUDA #!/bin/sh #SBATCH -p tut #SBATCH –t 60 #SBATCH -n 1 #SBATCH -c 2 #SBATCH --gres=gpu:2 export OMP_NUM_THREADS=2 srun ./cuda_app Using Multiple GPU GPU can be controlled by: • a single CPU thread • multiple CPU threads belonging to the same process • multiple CPU threads belonging to different processes All CUDA

Hybrid Openmp + CUDA Programming Model Podgainy D.V., Streltsova O.I., Zuev M.I

Heterogeneous Task Scheduling for Accelerated Openmp

Parallel Programming

Openmp API 5.1 Specification

Openmp Made Easy with INTEL® ADVISOR

Introduction to Openmp Paul Edmon ITC Research Computing Associate

Parallel Programming with Openmp

Intel Threading Building Blocks

Unified Parallel C for GPU Clusters: Language Extensions and Compiler Implementation

Lecture 12: Introduction to Openmp (Part 1)

Programming Your GPU with Openmp*

A “Hands-On” Introduction to Openmp*

Openmp Tutorial