Hybrid OpenMP + CUDA programming model Podgainy D.V., Streltsova O.I., Zuev M.I. Heterogeneous Computations team HybriLIT Laboratory of Information Technologies, Joint Institute for Nuclear Research

Dubna, Russia, from 15 February to 7 March 2017 Types of parallel machines

Distributed memory • each processor has its own • single address space for all processors memory address space • examples: IBM p-series, multi-core PC • examples: clusters, Blue Gene/L Shared-Memory Parallel Computers

CPU0 CPU1 CPU2 CPU0 CPU1 CPU2

Memory

Mem0 Mem1 Mem2 Non-Uniform Memory Access (ccNUMA) CPU CPU Memory Memory CPU CPU Bus Interconnect CPU CPU Memory Memory CPU CPU HETEROGENEOUS COMPUTATIONS TEAM, HybriLIT OpenMP (Open specifications for Multi-Processing)

OpenMP (Open specifications for Multi-Processing) is an API that supports multi-platform shared memory programming in , , C++.

• Environment directives routines variables

OpenMP is managed by consortium OpenMP Architecture Review Board (or OpenMP ARB) from 1997

OpenMP website: http://openmp.org/

OpenMP for OpenMP for Version 3.0 Version Fortran 1.0, in 1997 Fortran 2.0, in 2000 was released in 4.0 С/C++ in 1998 С/C++ in 2002 May 2008. July 2013

HETEROGENEOUS COMPUTATIONS TEAM, HybriLIT What is OpenMP • OpenMP (Open specifications for Multi-Processing) – one of the most popular technologies for multi-processor/core computers with the shared memory architecture. • OpenMP is based on traditional programming languages. OpenMP standard is developed for Fortran, C and C++ languages. All basic constructions for these languages are similar. Also, there are known cases of OpenMP implementation for MATLAB and MATHEMATICA. • The OpenMP-based computer program contains a number of threads interacting via shared memory. OpenMP provides a number of special directives for a compiler, library functions and environment variables. • Compiler directives are used for indicating segments of a code with the possibility for parallel processing. • Utilizing OpenMP constructs (compiler directives, procedures, environment variables), a user can organize parallelism in their serial code. • “Partial parallelization” is available by “step-by-step” adding OpenMP-directives. (OpenMP offers an incremental approach to parallelism) • OpenMP-directives are ignored by standard compiler. So, the code stays workable on both single- and multi-processor platform.

HETEROGENEOUS COMPUTATIONS TEAM, HybriLIT OpenMP (Full list at http://openmp.org/wp/openmp-compilers/)

Compiler Flag Information GNU gcc -fopenmp gcc 4.2 – OpenMP 2.5 gcc 4.4 – OpenMP 3.0 gcc 4.7 – OpenMP 3.1 gcc -fopenmp start_openmp.c -o test1 gcc 4.9 – OpenMP 4.0 • C/C++ -openmp on or Mac OSX OpenMP 3.1 API and Fortran Specification -Qopenmp on Windows • Support for most of the new features in icc -openmp start_openmp.c -o test1 the OpenMP* 4.0 API Specification Portland -mp Full support for Group OpenMP 3.1 Compilers and Tools pgcc -mp start_openmp.c -o test1

HETEROGENEOUS COMPUTATIONS TEAM, HybriLIT OpenMP Programming Model

• OpenMP is an explicit (not automatic) programming model, offering the full control over parallelization. Fork - Join Model: 0 Master thread (Thread 0) Master thread (Thread 0) Thread 1

Thread 2 •All OpenMP start with just one thread: JOIN: When the team FORK: the the master thread. threads complete the master thread The master thread statements in the then creates a executes sequentially Parallel region parallel region team of until the first parallel construct, they parallel threads. region construct is synchronize and encountered. terminate

HETEROGENEOUS COMPUTATIONS TEAM, HybriLIT OpenMP: Construct parallel

C / C++, General Code Structure : Fortran, General Code Structure: #include PROGRAM Start main () { !Serial code // Serial code !...... ! // Fork a team of threads: // Fork a team of threads: !$OMP PARALLEL #pragma omp parallel { ...... structured block structured block ...... } !$OMP END PARALLEL // Resume serial code ! Resume serial code ... END }

The parallelism has to be expressed explicitly. OpenMP: Construct parallel Meaning: #pragma omp parallel • The entire code block following the { parallel- is executed by all structured block threads concurrently } • This includes: - creation of team of ”worker” threads - thread executes a copy of the code #pragma omp parallel [clause ...] newline within the structured block if (scalar_expression) - barrier synchronization (implicit barrier) private (list) - termination of worker threads. shared (list) default (shared | none) firstprivate (list) reduction (operator: list) copyin (list) num_threads (integer-expression) { structured block }

HETEROGENEOUS COMPUTATIONS TEAM, HybriLIT OpenMP: Classes of variables

There are two main classes of variables: shared and private ones.

• The shared variable always exists only in a single instance for the whole program and is available for all threads under the same name. • The declaration of the private variable causes generation of its own instance of the given variable for each thread. Change of a value of a thread’s private variable does not influence the change of this local variable value in other threads. • There are also “intermediate” types that provide interconnection between parallel and consistent sections. • Thus, if a variable that is a part of a consistent code preceding a parallel section, is declared firstprivate, then in the parallel section this value is assigned to private variables under the same name for each thread. • Likewise, a variable of lastprivate type after termination of a parallel block saves the value obtained in the latest completed parallel thread.

HETEROGENEOUS COMPUTATIONS TEAM, HybriLIT Private variables and Shared variables Shared: the data within a parallel region is shared, which means visible and accessible by all threads simultaneously.

Private: the data within a parallel region is private to each thread, which means each thread will have a local copy and use it as a temporary variable.

……… int a; // shared automatic int j; What if we need to initialize a int k=3; private variable? #pragma omp parallel private (j,k) firstprivate: private variables with { int b; //private automatic initial values copied from the master thread’s copy b=j ; //b is not defined

foo (j,b,k); }

HETEROGENEOUS COMPUTATIONS TEAM, HybriLIT Specification of number of threads

Setting OpenMP environment variables is done the same way you set any other sh/tcsh setenv OMP_NUM_THREADS 4 environment variables, and depends upon which shell you use. sh/bash export OMP_NUM_THREADS=4

Via runtime functions: omp_set_num_threads(4);

Other useful function to get information about threads:

Runtime function omp_get_num_threads() • Returns number of threads in parallel region • Returns 1 if called outside parallel region Runtime function omp_get_thread_num() •Returns id of thread in team (Value between [0,Nthreads-1] ) •Master thread always has id 0

HETEROGENEOUS COMPUTATIONS TEAM, HybriLIT Explicit (low level) parallelism and High level parallelism: directive section

Low level parallelism: the work is distributed between threads by means functions omp_get_thread_num (Returns the thread number of the thread executing within its thread team.) omp_get_num_threads (returns the number of threads in the parallel region.).

EXAMPLE of high level parallelism #pragma omp parallel (parallel independent sections): { if(omp_get_thread_num()) ==3 ) #pragma omp sections […[parameters…]] { { #pragma omp section < code for the thread number 3 Each of block 1 { >; and block 2 < block1> } in this example } else will be carried #pragma omp section { out by one of { < code for all another threads >; parallel treads . } < block2> } } }

HETEROGENEOUS COMPUTATIONS TEAM, HybriLIT OpenMP compilers

Compiler Flag Information GNU gcc -fopenmp gcc 4.2 – OpenMP 2.5 gcc 4.4 – OpenMP 3.0 gcc 4.7 – OpenMP 3.1 gcc -fopenmp start_openmp.c -o test1 gcc 4.9 – OpenMP 4.0 • Intel C/C++ -openmp on Linux or Mac OSX OpenMP 3.1 API and Fortran Specification -Qopenmp on Windows • Support for most of the new features in icc -openmp start_openmp.c -o test1 the OpenMP* 4.0 API Specification Portland -mp Full support for Group OpenMP 3.1 Compilers and Tools pgcc -mp start_openmp.c -o test1

Full list at http://openmp.org/wp/openmp-compilers/ Hardware: CPU and GPUs on one node

GPU 1 GPU 2 GPU 3

(Tesla k40) (Tesla k40) (Tesla k40)

PCIe PCIe PCIe Host memory CPU 1 CPU 2

Node #1 Using Multiple GPU deviceInfo_CUDA.cu #include #include #include int main (){

int ngpus = 0; // number of CUDA GPUs int device; // How many devices ? cudaGetDeviceCount(&ngpus); if(ngpus < 1) { printf("no CUDA capable devices were detected\n"); return 1; } printf("number of CUDA devices:\t %d \n", ngpus); for( device = 0; device < ngpus; device++){ cudaDeviceProp dprop; cudaGetDeviceProperties(&dprop, device); printf(" %d: %s\n", device, dprop.name); } return 0; } Compilation and running CUDA+OpenMP program

Compilation $ nvcc -Xcompiler -fopenmp –lgomp -arch=compute_35 --gpu-code=sm_35,sm_37 deviceInfo_CUDA.cu -o cuda_app

script_multiCUDA #!/bin/sh #SBATCH -p tut #SBATCH –t 60 #SBATCH -n 1 #SBATCH -c 2 #SBATCH --gres=gpu:2 export OMP_NUM_THREADS=2 srun ./cuda_app Using Multiple GPU

GPU can be controlled by: • a single CPU thread • multiple CPU threads belonging to the same • multiple CPU threads belonging to different processes

All CUDA calls are issued to the current GPU: cudaSetDevice(0); kernel_0 <<>> ( ndata_per_kernel0, dev_A );

cudaSetDevice(1); kernel_1<<>> ( ndata_per_kernel1, dev_B ); Using Multiple GPU cudaSetDevice (id_device) cudaError_t cudaSetDevice (int id_device) Sets device as the current device for the calling host thread. cudaGetDevice (&device) cudaError_t cudaGetDevice (int * device)

Returns in *device the current vice for the calling host thread. Using Multiple GPU

#pragma omp parallel { unsigned int cpu_thread_id = omp_get_thread_num(); unsigned int num_cpu_threads = omp_get_num_threads(); int gpu_id = 0;

cudaSetDevice(cpu_thread_id % ngpus); cudaGetDevice(&gpu_id); // operation with GPU, CPU-bound stream }

"cpu_thread_id % ngpus" allows more CPU threads than GPU devices // 0-thread % 2 = 0, 1-therad % 2 = 1