NUMERICAL

NUMERICAL PARALLEL COMPUTING Lecture 3: Programming multicore processors with OpenMP http://people.inf.ethz.ch/iyves/pnc12/ Peter Arbenz∗, Andreas Adelmann∗∗ ∗ Dept, ETH Z¨urich, E-mail: [email protected] ∗∗Paul Scherrer Institut, Villigen E-mail: [email protected]

Parallel Numerical Computing. Lecture 3, Mar 9, 2012 1/47 NUMERICAL PARALLEL COMPUTING Review of last week

So far

I Moore’s law.

I Flynn’s taxonomy of parallel computers: SISD, SIMD, MIMD.

I Some terminology: Work, , efficiency, .

I Amdahl’s and Gustafson’s law

I SIMD programming.

Today

I MIMD programming (Part 1)

Parallel Numerical Computing. Lecture 3, Mar 9, 2012 2/47 NUMERICAL PARALLEL COMPUTING MIMD: Multiple Instruction stream - Multiple Data stream

MIMD: Multiple Instruction stream – Multiple Data stream

Each processor (core) can execute its own instruction stream on its own data independently from the other processors. Each processor is a full-fledged CPU with both control unit and ALU. MIMD systems are asynchronous.

Parallel Numerical Computing. Lecture 3, Mar 9, 2012 3/47 NUMERICAL PARALLEL COMPUTING MIMD: Multiple Instruction stream - Multiple Data stream Shared memory machines Shared memory machines (multiprocessors)

I autonomous processors connected to memory system via interconnection network I single address space accessible by all processors I (implicit) communication by means of shared data I data dependencies / race conditions possible

Fig.2.3 in Pacheco (2011) Parallel Numerical Computing. Lecture 3, Mar 9, 2012 4/47 NUMERICAL PARALLEL COMPUTING MIMD: Multiple Instruction stream - Multiple Data stream machines

I distributed memory machines (multicomputers) I Each processor has its own local/private memory I processor/memory pairs communicate via interconnection network I all data are local to some processor, I (explicit) communication by message passing or some other means to access memory of remote processor

Fig.2.4 in Pacheco (2011)

Parallel Numerical Computing. Lecture 3, Mar 9, 2012 5/47 NUMERICAL PARALLEL COMPUTING Shared-memory machines Typical architecture of a multicore processor Typical architecture of a multicore processor

Multiple cores share multiple caches, that are arrange in a tree-like structure. 3-levels example: I L1-cache in-core,

I 2 cores share L2-cache,

I all cores have access to all of the L3 cache and memory.

UMA: Each processor has direct connection to (block of) memory.

Parallel Numerical Computing. Lecture 3, Mar 9, 2012 6/47 NUMERICAL PARALLEL COMPUTING Shared-memory machines Typical architecture of a multicore processor Typical architecture of a multicore processor (cont’d) NUMA: non-uniform memory access

I Processors can access each others’ memory through special hardware built into processors. I Own memory is faster to access than remote memory.

Parallel Numerical Computing. Lecture 3, Mar 9, 2012 7/47 NUMERICAL PARALLEL COMPUTING Shared-memory machines Typical architecture of a multicore processor Interconnection networks Most widely used interconnects on shared memory machines

I bus (slow / cheap / not scalable)

I crossbar switch

Fig.2.7(a) in Pacheco (2011)

Parallel Numerical Computing. Lecture 3, Mar 9, 2012 8/47 NUMERICAL PARALLEL COMPUTING Execution of parallel programs

Execution of parallel programs

I Multitasking (time sharing) In operating systems that support multitasking several threads or processes are executed on the same processor in time slices (time sharing). In this way, latency due to, e.g., I/O operations can be hidden. This form of executing multiple tasks at the same time is called concurrency. Multiple tasks are executed at the same time, but only one of them has access to the compute resources at any given time. No simultaneous parallel excution is taking place.

Parallel Numerical Computing. Lecture 3, Mar 9, 2012 9/47 NUMERICAL PARALLEL COMPUTING Execution of parallel programs

Execution of parallel programs (cont.)

I Using multiple physical processors admits the parallel execution of multiple tasks. The parallel hardware may cause overhead though.

Parallel Numerical Computing. Lecture 3, Mar 9, 2012 10/47 NUMERICAL PARALLEL COMPUTING Execution of parallel programs

Execution of parallel programs (cont.)

I Simultaneous Multithreading (SMT) In simultaneous multithreading (SMT) or hyperthreading multiple flows of control are running concurrently on a processor (or a core). The processor switches among these so-called threads of control by means of dedicated hardware. If multiple logical processors are executed on one physical processor then the hardware resources can be better employed and task execution may be sped up. (With two logical processors, performance improvements of up to 30% have been observed.)

Parallel Numerical Computing. Lecture 3, Mar 9, 2012 11/47 NUMERICAL PARALLEL COMPUTING Execution of parallel programs Multicore programming Multicore programming

I Multicore processors are programmed with multithreaded programs.

I Although many programs use multithreading there are some notable differences between multicore programming and SMT.

I SMT is mainly used by the OS to hide I/O overhead. On multicore processors the work is actually distributed on the different cores.

I Cores have individual caches. False sharing may occur: Two cores may work on different data that is stored in the same cacheline. Although there is no data dependence the cache line of the other processor is marked invalid if the first processor writes its data item. (Massive) performance degradation is possible.

Parallel Numerical Computing. Lecture 3, Mar 9, 2012 12/47 NUMERICAL PARALLEL COMPUTING Execution of parallel programs Multicore programming Multicore programming (cont.)

I priorities. If multithreading programs are executed on a single processor machines, always the thread with the highest priority is executed. On multicore processors, threads with different priorities can be executed simultaneously. This may lead to different results! Programming multicore machines therefore requires techniques, methods, and designs from parallel programming.

Parallel Numerical Computing. Lecture 3, Mar 9, 2012 13/47 NUMERICAL PARALLEL COMPUTING Execution of parallel programs Parallel programming models Parallel programming models The design of a parallel program is always based on an abstract view of the parallel system on which the software shall be executed. This abstract view is called parallel programming model. It does not only describe the underlying hardware. It describes the the whole system as it is presented to a software developer:

I System software (operating system)

I parallel programming language

I parallel library

I compiler

I run time system

Parallel Numerical Computing. Lecture 3, Mar 9, 2012 14/47 NUMERICAL PARALLEL COMPUTING Execution of parallel programs Parallel programming models Parallel programming models (cont.) Level of parallelism: On which level of the program do we have parallelism?

I Instruction level parallelism Compiler can detect independent instructions and distribute then on different functional units of a . I Data or loop level parallelism Data structures, e.g., arrays, are partitioned in portions. The same operation is applied to all elements of the portions. SIMD. I Function level parallelism Functions in a program, e.g., in recursive calls, can be invoked in parallel, provided there are no dependences.

Parallel Numerical Computing. Lecture 3, Mar 9, 2012 15/47 NUMERICAL PARALLEL COMPUTING Execution of parallel programs Parallel programming models Parallel programming models (cont.) Explicit vs. : How is parallelism declared in the program?

I Implicit parallelism. Parallelizing compilers can detect regions/statements in the code that can be executed concurrently/in parallel. Parallelizing compilers are of limited success. I with implicit partitioning. The programmer indicates to the compiler where there is potential to parallelize. The partitioning is done implicitly. OpenMP. I Explicit partitioning. The programmer also indicates how to partition, but does not indicate where to execute the parts. I Explicit communication and synchronization. MPI.

Parallel Numerical Computing. Lecture 3, Mar 9, 2012 16/47 NUMERICAL PARALLEL COMPUTING Execution of parallel programs Parallel programming models Parallel programming models (cont.) There are two flavors of explicit parallel programming,

I Thread programming A thread is a sequence of statements that can be executed in parallel with other sequences of statements (threads). Each thread has its own resources (program counter, status information, etc.) but they use a common address space. Suited for multicore processors. I Message passing programming In message passing programming, processes are used for the various pieces of the parallel program, that run on physical or logical processors. Each of the processes has its own (private) address space.

Parallel Numerical Computing. Lecture 3, Mar 9, 2012 17/47 NUMERICAL PARALLEL COMPUTING Thread programming

Thread programming Programming multicore processors is tightly connected to parallel programming with a shared address space and to thread programming. There are a number of of environments for thread programming

I Pthreads (Posix threads)

I Java threads

I OpenMP

I Intel TBB (thread building blocks) In this lecture we deal with OpenMP which is the most commonly used in HPC.

Parallel Numerical Computing. Lecture 3, Mar 9, 2012 18/47 NUMERICAL PARALLEL COMPUTING Thread programming Processes vs. threads Processes

I A is an instance of a program that is executing more or less autonomously on a physical processor.

I A process contains all the information needed to execute the program.

I process ID I program code I actual value of the program counter I actual content in registers I data on run time stack I global data I data on heap Each process has its own address space.

I Informations change dynamically during process execution.

Parallel Numerical Computing. Lecture 3, Mar 9, 2012 19/47 NUMERICAL PARALLEL COMPUTING Thread programming Processes vs. threads Processes (cont.)

I If compute resources are assigned to another process, the status of the present (to be suspended) process has to be saved, in order that the execution of the suspended process can be resumed some later time.

I This (time consuming) process is called context switch.

I It is the basic of multitasking where processes are given time slices in a round robin fashion.

I In contrast to scalar processors, on multiprocessor systems the processes can actually run in parallel.

Parallel Numerical Computing. Lecture 3, Mar 9, 2012 20/47 NUMERICAL PARALLEL COMPUTING Thread programming Processes vs. threads Threads

I Thread model is an extension of the process model.

I Each process consists of multiple independent instruction streams (called threads) that are assigned compute resources by some scheduling procedure.

I The threads of a process share the address space of this process. Global variables and all dynamically allocated data objects are accessible by all threads of a process.

I Each thread has its own run time stack, registers, program counter.

I Threads can communicate by reading / writing variables in the common address space. (This may require synchronization.)

I We consider system threads here, but no user threads.

Parallel Numerical Computing. Lecture 3, Mar 9, 2012 21/47 NUMERICAL PARALLEL COMPUTING Thread programming Processes vs. threads Threads (cont.)

Parallel Numerical Computing. Lecture 3, Mar 9, 2012 22/47 NUMERICAL PARALLEL COMPUTING Thread programming Synchronization Synchronization

I Threads communicate through shared variables. Uncoordinated access of these variables can lead to undesired effects.

I If, e.g., two threads T1 and T2 increment a variable by 1 or 2, the result of the parallel program depends on the way the incriminated variable is accessed. This is called a .

I To prevent unexpected results the access to shared variables must be synchronized.

I A barrier is set to synchronize all threads. All threads wait at the barrier until all of them have arrived there.

Parallel Numerical Computing. Lecture 3, Mar 9, 2012 23/47 NUMERICAL PARALLEL COMPUTING Thread programming Synchronization Synchronization (cont.)

I Mutual exclusion ensures that only one of multiple threads can access a critical section of the code. (E.g., to increment a variable.) Mutual exclusion serializes the access of the critical section.

I Synchronization can be very time consuming. If it is not done right, much waiting time is spent. (E.g., if the load is not balanced well among the processors.)

Parallel Numerical Computing. Lecture 3, Mar 9, 2012 24/47 NUMERICAL PARALLEL COMPUTING Parallel Programming Concepts Introduction Parallel Programming Concepts

I The new generation of multicore processors get their performance from the multitude of cores on a chip.

I Unlike the situation until recently, the programmer has to take action to be able to exploit the improved performance.

I Techniques of parallel programming are known for years in scientific computing and elsewhere.

I What is new: parallel programming has reached main stream.

I The essential step in programming multicore processors is in providing multiple streams of execution that can be executed simultaneously (concurrently) on the multiple cores.

I Let us introduce a few concepts and notions.

Parallel Numerical Computing. Lecture 3, Mar 9, 2012 25/47 NUMERICAL PARALLEL COMPUTING Parallel Programming Concepts Design of parallel programs Design of parallel programs: 1. Partitioning

I Basic idea of parallel programming: generate multiple instruction streams that can be executed in parallel. If we have a sequential code available, we my parallelize it.

I In order to generate independant instruction streams we partition the problem that we want to solve into tasks. Tasks are the smallest units of parallelism.

I The size of a task is called granularity (fine/coarse grain).

Parallel Numerical Computing. Lecture 3, Mar 9, 2012 26/47 NUMERICAL PARALLEL COMPUTING Parallel Programming Concepts Design of parallel programs Design of parallel programs: 2. Communication

I Tasks may depend on each other in one way or another. They may access the same data concurrently (data dependence) or they may need to wait for another task to finish (flow dependance) as its results are needed.

I Both dependences may be translated into communication in a message passing environment.

Parallel Numerical Computing. Lecture 3, Mar 9, 2012 27/47 NUMERICAL PARALLEL COMPUTING Parallel Programming Concepts Design of parallel programs Design of parallel programs: 3. Scheduling

I The tasks are mapped (aggregated) on threads or processes that are executed on the physical compute resources which can be processors of a parallel machine or cores of a multicore processor.

I The assignment of tasks to processes or threads is called scheduling. In static scheduling the assignment takes place before the actual computation. In dynamic scheduling the assignment takes place during program execution.

Parallel Numerical Computing. Lecture 3, Mar 9, 2012 28/47 NUMERICAL PARALLEL COMPUTING Parallel Programming Concepts Design of parallel programs Design of parallel programs: 4. Mapping

I The processes or threads are mapped on the processors/cores. This mapping may be explicitly done in the program or (mostly) by the operating system.

I The mapping should be done in a way to minimize communication and balance the work load.

Parallel Numerical Computing. Lecture 3, Mar 9, 2012 29/47 NUMERICAL PARALLEL COMPUTING OpenMP

OpenMP OpenMP is an application programming interface that provides a parallel programming model for shared memory and distributed shared memory multiprocessors. It extends programming languages (C/C++ and Fortran) by

I a set of compiler directives to express shared memory parallelism. (Compiler directives are called pragmas in C.)

I runtime library routines and environment variables that are used to examine and modify execution parameters. There is a standard include file omp.h for C/C++ OpenMP programs. OpenMP is becoming the de facto standard for parallelizing applications for shared memory multiprocessors. OpenMP is independent of the underlying hardware or operating system. Parallel Numerical Computing. Lecture 3, Mar 9, 2012 30/47 NUMERICAL PARALLEL COMPUTING OpenMP

OpenMP References There are a number of good books on OpenMP:

I B. Chapman, G. Jost, R. van der Pas: Using OpenMP. MIT Press, 2008. Easy to read. Examples both C and Fortran (but not C++). I R. Chandra, R. Menon, L. Dagum, D. Kohr, D. Maydan, J. McDonald: Parallel programming in OpenMP. Morgan Kaufmann, San Francisco CA 2001. Easy to read. Much of the material is with Fortran. I S. Hoffmann and R. Lienhart: OpenMP. Springer, Berlin 2008. Easy to read. C and C++. In German. Available online via NEBIS. There is an OpenMP organization with most of the major computer manufacturers and the U.S. DOE ACSI program on its board, see http://www.openmp.org. Parallel Numerical Computing. Lecture 3, Mar 9, 2012 31/47 NUMERICAL PARALLEL COMPUTING OpenMP Execution model: fork-join parallelism Execution model: fork-join parallelism

I OpenMP is based on the fork-join execution model.

I At the start of an OpenMP program, a single thread (master thread) is executing.

I If the master thread arrives at a compiler directive #pragma omp parallel that indicates the beginning of a parallel section of the code, then it forks the execution into a certain number of threads.

I At the end of the parallel section the execution is joined again in the single master thread.

I At the end of a parallel section there is an implicite barrier. The program cannot proceed before all threads have reached the barrier.

Parallel Numerical Computing. Lecture 3, Mar 9, 2012 32/47 NUMERICAL PARALLEL COMPUTING OpenMP Execution model: fork-join parallelism Execution model: fork-join parallelism II

I Master thread spawns a team of threads as needed

I From a programmers perspective, parallelism is added incrementally: i.e. the sequential program evolves into a parallel program (vs. MPI distributed memory programming)

I However, parallelism is limited wrt. processor numbers.

Parallel Numerical Computing. Lecture 3, Mar 9, 2012 33/47 NUMERICAL PARALLEL COMPUTING OpenMP Execution model: fork-join parallelism Execution model: fork-join parallelism III

I This is (at least in theory) a dynamic thread generation. The number of threads may vary from one parallel region to the other. The number of threads that are generated can be determined by functions from the runtime library or by environment variables: setenv OMP NUM THREADS 4 (In static thread generation, the number of threads is fixed a priori from the start.)

I In practice, a number of threads may be initiated at the start of the program. But (slave) threads become active only at the beginning of a parallel region.

Parallel Numerical Computing. Lecture 3, Mar 9, 2012 34/47 NUMERICAL PARALLEL COMPUTING OpenMP Some OpenMP demo programs OpenMP hello world

#include #include

main() {

#pragma omp parallel { printf("Hello world\n"); } }

Parallel Numerical Computing. Lecture 3, Mar 9, 2012 35/47 NUMERICAL PARALLEL COMPUTING OpenMP Some OpenMP demo programs gcc compiler OpenMP programs can be compiled by the Gnu compiler

gcc -o hello hello.c -fopenmp -lgomp

When -fopenmp is used, the compiler will generate parallel code based on the OpenMP directives encountered. A variable _OPENMP is defined that can be checked with the precompiler #ifdef. -lgomp loads libraries of the Gnu OpenMP Project.

Parallel Numerical Computing. Lecture 3, Mar 9, 2012 36/47 NUMERICAL PARALLEL COMPUTING OpenMP Some OpenMP demo programs A more complicated Hello world demo The following little program calls functions from the OpenMP run time library to get the number of threads and the id of the actual thread.

/* Modified example 2.1 from Hoffmann-Lienhart */ #include #ifdef _OPENMP #include #endif

Parallel Numerical Computing. Lecture 3, Mar 9, 2012 37/47 NUMERICAL PARALLEL COMPUTING OpenMP Some OpenMP demo programs A more complicated Hello world demo (cont.) int main(int argc , char* argv []) { #ifdef _OPENMP printf("Number of processors: %d\n", omp_get_num_procs ());

#pragma omp parallel { printf("Thread %d of %d says \" Hallo World!\"\n", omp_get_thread_num (), omp_get_num_threads ()); }

Parallel Numerical Computing. Lecture 3, Mar 9, 2012 38/47 NUMERICAL PARALLEL COMPUTING OpenMP Some OpenMP demo programs A more complicated Hello world demo (cont.) #else printf("OpenMP not supported.\n"); #endif

printf("Job completed.\n");

return 0; }

It is important to note that the sequence how the threads are actually executed is not always the same. It depends on the run! This is not important here, of course. But if it was, there was a race condition.

Parallel Numerical Computing. Lecture 3, Mar 9, 2012 39/47 NUMERICAL PARALLEL COMPUTING OpenMP OpenMP parallel control structures OpenMP parallel control structures In the OpenMP fork/join model, the parallel control structures are those that fork (i.e. start) new threads. There are just two of these: 1. The parallel directive is used to create multiple threads of execution that execute concurrently. The parallel directive applies to a structured block, i.e. a block of code with one entry point at the top and one exit point at the bottom of the block. (Exception: exit()). 2. Further constructs are needed to divide work among an existing set of parallel threads. The for directive, e.g., is used to express loop-level parallelism.

Parallel Numerical Computing. Lecture 3, Mar 9, 2012 40/47 NUMERICAL PARALLEL COMPUTING OpenMP OpenMP parallel control structures

Parallel Numerical Computing. Lecture 3, Mar 9, 2012 41/47 NUMERICAL PARALLEL COMPUTING OpenMP OpenMP parallel control structures An example for loop-level parallelism (back to axpy) 1. The sequential program for (i=0; i< N; i++){ y[i] = alpha*x[i] + y[i]; }

2. OpenMP parallel region #pragma omp parallel { for (i=0; i< N; i++){ y[i] = alpha*x[i] + y[i]; } }

Parallel Numerical Computing. Lecture 3, Mar 9, 2012 42/47 NUMERICAL PARALLEL COMPUTING OpenMP OpenMP parallel control structures

3. OpenMP parallel region (assumes Nthrds divdes N) #pragma omp parallel { int id, i, Nthrds, istart, iend; id = omp_get_thread_num(); Nthrds = omp_get_num_threads(); istart = id*N/Nthrds; iend = (id+1)*N/Nthrds; for (i=istart; i< iend; i++){ y[i] = alpha*x[i] + y[i];} }

Parallel Numerical Computing. Lecture 3, Mar 9, 2012 43/47 NUMERICAL PARALLEL COMPUTING OpenMP OpenMP parallel control structures

4. OpenMP parallel region combined with a for-directive #pragma omp parallel #pragma omp for for (i=0; i< N; i++){ y[i] = alpha*x[i] + y[i]; }

5. OpenMP parallel region combined with a for-directive and a schedule clause #pragma omp parallel for schedule (static) for (i=0; i< N; i++){ y[i] = alpha*x[i] + y[i]; }

Parallel Numerical Computing. Lecture 3, Mar 9, 2012 44/47 NUMERICAL PARALLEL COMPUTING OpenMP For-construct with the schedule clause For-construct with the schedule clause The schedule clause effects how loop iterations are mapped onto threads I schedule(static [,chunk]) Deal out blocks of iterations of size “chunk” to each thread. I schedule(dynamic [,chunk]) Each thread grabs “chunk” iterations off a queue until all iterations have been handled I schedule(guided [,chunk]) Threads dynamically grab blocks of iterations. The size of the block starts large and shrinks down to size “chunk” as the calculation proceeds (guided self-scheduling). I schedule(runtime) Schedule and chunk size defined by OMP SCHEDULE environment variable. Parallel Numerical Computing. Lecture 3, Mar 9, 2012 45/47 NUMERICAL PARALLEL COMPUTING OpenMP Timings with varying chunk sizes in the for Timings with varying chunk sizes in the for #pragma omp parallel for schedule(static,chunk_size) for (i=0; i< N; i++){ y[i] = alpha*x[i] + y[i]; }

chunk size p = 1 2 4 6 8 12 16 N/p 1674 854 449 317 239 176 59 100 1694 1089 601 405 317 239 166 4 1934 2139 1606 1294 850 742 483 1 2593 2993 3159 2553 2334 2329 2129 Table 1: Some execution times in µsec for the saxpy operation with varying chunk size and processor number p Here, chunks of size chunk size are cyclically assigned to the processors. In Table1 timings on the HP Superdome are

Parallel Numericalsummarized Computing. Lecture for N 3, Mar=1000000. 9, 2012 46/47 NUMERICAL PARALLEL COMPUTING OpenMP Timings with varying chunk sizes in the for

How to measure time (walltime.c)

double t0, tw; ... t0 = walltime(&tw); for (ict=0; ict

Parallel Numerical Computing. Lecture 3, Mar 9, 2012 47/47