3) Programming Memory-Coupled Memory Consistency Variable Analysis Systems OpenMP Examples of Memory- . . .

Page 1 of 46

Parallel and High-Performance Computing 3. Programming Memory-Coupled Systems Hans-Joachim Bungartz Appearances

• symmetric multiprocessors (SMP): – processors of the same type – connected to a shared global memory via a bus or a crossbar • distributed (virtual) shared-memory systems (DSM/VSM): – a shared global address space – physically Cache Coherence Memory Consistency • properties: Variable Analysis – simpler to program OpenMP – granularity: from program to block level Examples of Memory- . . . • several models, depending on physical arrangement of memory

Page 2 of 46

Parallel and High-Performance Computing 3. Programming Memory-Coupled Systems Hans-Joachim Bungartz Appearances (cont’d)

(UMA): – same memory access behaviour of all processors to one – same access times of all processors to all stored data – no distinction of local or remote memory – local caches frequent

– examples: SGI Power Challenge, Sun SPARCstation 10 and 20 Cache Coherence • Non-Uniform Memory Access (NUMA): Memory Consistency Variable Analysis – one shared global address spaces OpenMP – but physically distributed memory units Examples of Memory- . . . – different access times depending on location of the data (local or remote) – often even hierarchy of access times due to network topology – example: CM-5 (fat-tree topology)

Page 3 of 46

Parallel and High-Performance Computing 3. Programming Memory-Coupled Systems Hans-Joachim Bungartz Appearances (cont’d)

• Cache-Coherent Non-Uniform Memory Access (CC-NUMA): – all access traffic done via local cache – cache-coherence ensured system-wide – examples: SGI Origin, HP/Convex SPP series • Cache-Only Memory Architecture (COMA): – special case of CC-NUMA Cache Coherence – all memory treated as cache, data are migrating Memory Consistency Variable Analysis – examples: KSR-1, KSR-2 OpenMP • Non-Cache-Coherent Non-Uniform Memory Access (NCC-NUMA): Examples of Memory- . . . – remote access not done via cache – cache-coherence has to be ensured explicitly – examples: Cray T3D and T3E

Page 4 of 46

Parallel and High-Performance Computing 3. Programming Memory-Coupled Systems Hans-Joachim Bungartz 3.1. Cache Coherence Definitions • problem: independent access of different processors with local cache to shared data may cause validity problems (several simultaneous copies or instances of the data) • cache coherence: Cache coherence is obtained if the results of a parallel program behave as if there were a total ordering of all memory accesses satisfying: 1. This total order is consistent with the program order for accesses to that mem- ory unit from any processor. Cache Coherence 2. The value returned by a ‘READ’ is the last value written in the total ordering – Memory Consistency the system must not provide out-dated values. Variable Analysis • consistency: OpenMP A system is consistent if all existing copies of a memory word (in main memory and Examples of Memory- . . . caches) are identical. • How do inconsistencies occur? – A change in a cache is not immediately realized in main memory (copy-back or write-back policy, in contrast to write-through policy). • system-wide permanent consistency is expensive • Temporary inconsistencies can be tolerated, if cache-coherence is ensured. • for that: cache-coherence protocols:

– write-update protocol: modification of one copy leads to modifications of all Page 5 of 46 other copies (before next access at the latest) Parallel and High-Performance – write-invalidate protocol: modification of one copy causes all other copies to be Computing declared ’invalid’ 3. Programming Memory-Coupled Systems Hans-Joachim Bungartz A Cache-Coherence Strategy: Bus Snooping

• bus snooping: a famous and widespread strategy • field of application: SMP with local caches, connected to the shared main memory via a bus • principle: all processors tap the bus, they snoop the bus for addresses put on the bus by the other processors • If a processor notices an address available in its local cache, too, the following steps are executed: Cache Coherence – In case of a detected ‘WRITE’ and a non-modified local copy, the local copy is Memory Consistency invalidated. Variable Analysis OpenMP – In case of a detected ‘READ’ or ‘WRITE’ and a modified local copy, the bus transfer is interrupted. First, the local copy is written to main memory (a direct Examples of Memory- . . . cache-to-cache transfer is not that frequent); then, the interrupted transfer is continued. • Hence, cache-coherence (the temporal order) is ensured! • suitable cache-coherence protocol for bus snooping: MESI protocol

Page 6 of 46

Parallel and High-Performance Computing 3. Programming Memory-Coupled Systems Hans-Joachim Bungartz MESI Protocol

• MESI protocol: standard protocol for bus snooping • cache-coherence protocol of the write-invalidate type • each block in each cache is assigned one of four possible states: – exclusive modified: there has been a ‘WRITE’ modification, but the block is the only copy in any of the caches – exclusive unmodified: there have been only ‘READ’ accesses to this block, and Cache Coherence the block is the only copy in any of the caches Memory Consistency – shared unmodified: there is more than one copy in the different caches, but only Variable Analysis ‘READ’ accesses so far OpenMP – invalid: the values in the local cache block have been declared invalid Examples of Memory- . . . • any kind of action may lead to a state transition: – data is needed locally (for a ‘READ’ or a ‘WRITE’) and may be available locally or not – a data transfer with local copies involved is snooped on the bus

Page 7 of 46

Parallel and High-Performance Computing 3. Programming Memory-Coupled Systems Hans-Joachim Bungartz MESI Protocol (cont’d)

Cache Coherence Memory Consistency Variable Analysis OpenMP Examples of Memory- . . .

Page 8 of 46

Parallel and High-Performance Computing 3. Programming Memory-Coupled Systems Hans-Joachim Bungartz MESI Protocol (cont’d)

• legend: – RH (Read Hit): the data needed locally is available locally – RMS (Read Miss Shared): the data needed locally is not available locally, and there exist other copies – RME (Read Miss Exclusive): the data needed locally is not available locally, but there exist no other copies – WH (Write Hit): the data to be modified locally is available locally Cache Coherence – WM (Write Miss): the data to be modified locally is not available locally Memory Consistency – SHR (Snoop Hit on a Read): an address of the block is snooped on the bus in a Variable Analysis ‘READ’ request OpenMP – SHW (Snoop Hit on a Write or Read-with-intent-to-modify): an address of the Examples of Memory- . . . block is snooped on the bus in a ‘WRITE’ request – dirty line copy back: interrupt bus transfer; the interrupting processor stores his copy to main memory, before restarting the interrupted transfer – invalidate transaction: the other processors are informed of a modification and caused to invalidate their copies – Read-with-intent-to-modify: causes invalidation of potential other copies – cache line fill: fill cache with missing data

Page 9 of 46

Parallel and High-Performance Computing 3. Programming Memory-Coupled Systems Hans-Joachim Bungartz MESI Protocol: Example 1 (READ)

• scenario: – some processor wants to read some invalidated data in its local cache – hence, we have a RM (Read Miss), and the data is transferred from main mem- ory • three possible cases: – the cache block was not in any other processor’s cache: hence, an RME is done, the cache block is loaded, and its new state is exclusive Cache Coherence unmodified (as long as no other processor loads the block into its cache, and as Memory Consistency long as we have only RH’s of the local processor) Variable Analysis – the cache block is in another processor’s cache, with an exclusive unmodified or OpenMP shared unmodified state: Examples of Memory- . . . then, an RMS is done, and all involved cache memories switch this block’s state or attribute to shared unmodified (via the snooping action SHR) – if another cache memory owns this block as exclusive modified, then the ad- dress is detected via bus snooping (SHR), the bus transaction is interrupted, the cache block is written to main memory (dirty line copy back), the state there is set to shared unmodified, and the read operation is repeated

Page 10 of 46

Parallel and High-Performance Computing 3. Programming Memory-Coupled Systems Hans-Joachim Bungartz MESI Protocol: Example 2 (WRITE)

• scenario: – a processor wants to write some data into its cache – if in state exclusive modified: WH, no further action necessary (due to write-back policy no immediate copying to main memory) – else: give address on bus to allow snooping; three possible cases:

* block not yet in cache, state invalid: send a Read-with-intent-to-modify on the bus, all other caches snoop (SHW) Cache Coherence and switch their state to invalid, if it was shared unmodified or exclusive un- Memory Consistency modified before (no more direct READ possible); the block is loaded from Variable Analysis main memory and gets the attribute exclusive modified; if there was an ex- OpenMP clusive modified elsewhere, bus transfer is interrupted and main memory is Examples of Memory- . . . updated * block in cache, state exclusive unmodified: change state to exclusive modified * block in cache, state shared unmodified: send an invalidate transaction on the bus to cause the respective caches to swith their state to invalid, then change state to exclusive modified – note that a cache block is never written back without need

Page 11 of 46

Parallel and High-Performance Computing 3. Programming Memory-Coupled Systems Hans-Joachim Bungartz Alternative: Cache Coherence by Tables

• bus snooping and the MESI protocol are based on the broadcast abilities of the bus • typical for SMP, impossible for DSM • for cache coherence in DSM (CC-NUMA): directory tables: – either in hardware, in software, or as a combination of both – directory tables are either stored centrally or distributed over the processors (standard) – one table for each block of local memory Cache Coherence Memory Consistency – table records whether block has been loaded to the local or to some remote cache Variable Analysis OpenMP • states, state transitions, and transactions are typically similar to their MESI counter- Examples of Memory- . . . parts (for example, there are now explicit invalidate messages to the involved pro- cessors) • examples: ALLCACHE engine (KSR), LimitLESS (MIT Alewife), DBCCP (Dash)

Page 12 of 46

Parallel and High-Performance Computing 3. Programming Memory-Coupled Systems Hans-Joachim Bungartz 3.2. Memory Consistency What Happens on a Monoprocessor? • all data transfer between registers and cache is executed by the load&store unit of the processor • only these instructions have an impact on cache coherence • typical for modern microprocessors: reordering of load&store operations by load&store unit:

– load operations (READ) are done immediately where they occur in the program Cache Coherence – as soon as possible Memory Consistency – store operations (WRITE) are postponed and buffered in an internal write buffer Variable Analysis (FIFO) – as late as possible OpenMP – objective: improve pipelining properties (load operation provides needed data, Examples of Memory- . . . store operations needs calculated data) – control mechanism (synchronization) in order to prevent a READ on outdated (since not yet updated) data • concept of a non-blocking cache: – in case of a cache miss, execution of commands can continue (i.e., the next instruction(s) can be executed without waiting for the successful end of the load command) – of course: this is only possible if the next commands do not need the loaded value Page 13 of 46

• consequence of both strategies: modified order of execution (without impact on the Parallel and High-Performance Computing computed result if synchronized) 3. Programming Memory-Coupled Systems Hans-Joachim Bungartz Problems on a SMP Multiprocessor • more complicated: – SMP: simultaneous accesses to the same data are possible – in spite of cache coherence, the reordering on different processors may have an impact on the overall order of load&store operations • example: slight modification of Dekker’s algorithm

x:=0; y:=0; ... Cache Coherence Memory Consistency P1: Variable Analysis x:=1; OpenMP if (y=0) { do action A1; } Examples of Memory- . . .

process P2: y:=1; if (x=0) { do action A2; }

• four possibilities: – A1 is executed, A2 is not – A1 is not executed, A2 is – neither A1 nor A2 are executed Page 14 of 46 – both A1 and A2 are executed (possible only with a changed order!) Parallel and High-Performance • cache coherence is not sufficient here! Computing 3. Programming Memory-Coupled Systems Hans-Joachim Bungartz Problems on a DSM Multiprocessor

• another problem: store instructions are not atomic – the update is effected earlier on the one processor than on the other • typical for DSM (NUMA) systems: different access times (local – remote) • hence: “race” of different memory accesses, a local READ might pass a remote WRITE – although started in the correct order • directory tables for cache coherence can help here

• generally: consistency model, specifies the order in which memory accesses of one Cache Coherence processor are seen by the others Memory Consistency – sequential consistency: strongest restrictions Variable Analysis OpenMP – processor consistency Examples of Memory- . . . – weak consistency – release consistency – entry consistency

Page 15 of 46

Parallel and High-Performance Computing 3. Programming Memory-Coupled Systems Hans-Joachim Bungartz Some Definitions

• let Pi and Pk denote two processors of the system

• a load access of Pi is called completed with respect to Pk at time t, if no store access of Pk can influence the result of Pi’s load access any more

• a store access of Pi is called completed with respect to Pk at time t, if a load access of Pk would result in the value written by Pi • any memory access is called completed at time t, if it is completed with respect to all processors of the system Cache Coherence • a load access is called globally completed, if both itself and the store access that Memory Consistency produced the value read are completed Variable Analysis • Note that a difference between a completed and a globally completed load access is OpenMP possible only in systems with non-atomic store operations. Examples of Memory- . . .

Page 16 of 46

Parallel and High-Performance Computing 3. Programming Memory-Coupled Systems Hans-Joachim Bungartz Sequential Consistency

• definition: A multiprocessor system is called sequentially consistent, if the result of an arbitrary computation is always the same as the result obtained on a monoprocessor, with the (or perhaps better ’a’) sequential order of execution given by the program. • consequences: – parallel execution with sequential consistency corresponds to an overlapping sequential execution

– all memory accesses have to be atomic Cache Coherence – reordering of load&store commands are forbidden Memory Consistency Variable Analysis – non-blocking cache is forbidden OpenMP – secure strategy, but often poor efficiency Examples of Memory- . . . • sufficient condition for sequential consistency: Before any memory access is allowed to be completed with respect to another pro- cessor, all preceding load accesses must have been globally completed and all pre- ceding store accesses must have been completed. • order of instructions is unmodified on all processors, and this order is globally visible • sequential consistency does not replace synchronization (for correct access to shared data spaces)!! • not all memory-coupled multiprocessors fulfil memory consistency Page 17 of 46

Parallel and High-Performance Computing 3. Programming Memory-Coupled Systems Hans-Joachim Bungartz Processor Consistency

• definition: A multiprocessor system is called processor consistent, if the result of an arbitrary computation is always the same as the result obtained if the instructions on each single processor are executed in the order given by its program. • weaker than sequential consistency: a global unique order of all processors’ memory operations is no longer necessary, here

• It is possible that Pi sees store operations of Pj and Pk in another order than Pj and Pk do. Cache Coherence • However, Pi’s store operations are seen by all processors in the same order. Memory Consistency • sufficient conditions for processor consistency: Variable Analysis – Before any load access is allowed to be completed with respect to any other OpenMP processor, all preceding store accesses must have been completed. Examples of Memory- . . . – Before any store access is allowed to be completed with respect to any other processor, all preceding accesses (load or store) must have been completed. • hence: no global completion required • price: sequentialization of memory accesses may be non-correct (READ is possible, before WRITE is completed everywhere) • realized in some multiprocessor systems

Page 18 of 46

Parallel and High-Performance Computing 3. Programming Memory-Coupled Systems Hans-Joachim Bungartz Weak Consistency

• so far: synchronization of parallel threads has been neglected • in practice: programmer defines and protects critical regions with some synchroniza- tion mechanisms (as studied in Chapter 2) • example (once more): Dekker’s algorithm

mutex m; process P1: process P2: ... lock(m); lock(m); lock(m); x:=1; y:=1; Cache Coherence x:=0; if (y=0) if (x=0) Memory Consistency y:=0; { do action A1; } { do action A2; } Variable Analysis unlock(m); unlock(m} unlock(m); OpenMP ... Examples of Memory- . . .

• idea: consistency is ensured at synchronization points only – these are the entry and exit points of critical regions; within critical regions, consistency may be violated (since protected) • weaker sufficient conditions than before • however: for synchronization points, sequential consistency is required • higher potential concerning performance, but also higher responsibility of program- mer Page 19 of 46

Parallel and High-Performance Computing 3. Programming Memory-Coupled Systems Hans-Joachim Bungartz Release Consistency, Entry Consistency

• two more consistency models: – release consistency: further weakening – entry consistency: still further weakening • both – are highly specialized models – have been designed for specific architectures Cache Coherence Memory Consistency • there exist a lot of other models, typically defining further weakenings Variable Analysis OpenMP Examples of Memory- . . .

Page 20 of 46

Parallel and High-Performance Computing 3. Programming Memory-Coupled Systems Hans-Joachim Bungartz Summary of Consistency Models

• experience shows: biggest gain in performance obtainable by transition from sequen- tial to processor consistency (up to 40 per cent) • state of the art: – some variant of a weak consistency (typically not clearly stated by manufactur- ers) – sequential consistency offered by all systems as an option, at least

• hence: careful programming (manual synchronization, critical regions, protection Cache Coherence mechanisms) for explicit parallel programming basically inevitable Memory Consistency • then: no need to know the consistency model (only for hardware optimization pur- Variable Analysis poses) OpenMP • if no explicit synchronization: there may be non-deterministic characteristics Examples of Memory- . . . – hard to track and understand – result may be false – there exist exceptions, where order of memory accesses is not that critical (ex- ample: relaxation schemes for the iterative solution of large systems of linear equations)

Page 21 of 46

Parallel and High-Performance Computing 3. Programming Memory-Coupled Systems Hans-Joachim Bungartz 3.3. Variable Analysis Analysis of Occurring Variables

• be aware of different characteristics of used variables! • standard scenario: , parallel execution of loop iterations • concerning the use of variables in loops, we distinguish: – local (private) variables: new initialization in each loop iteration – shared variables: all others, read or written by more than one loop iteration Cache Coherence • example: matrix-matrix product Memory Consistency Variable Analysis for (i=0; i

if the outer loop is parallelized, i, j, and k are private and all others are shared variables

Page 22 of 46

Parallel and High-Performance Computing 3. Programming Memory-Coupled Systems Hans-Joachim Bungartz Variable Analysis (cont’d) • the shared variables are further subdivided: – independent variables:

* either variables only read by the loop iterations * or variables of an array structure, where each component is accessed by one loop iteration only – dependent variables: can be read or written by all loop iterations, protection mechanism required

* reduction variable: scalar or array variable used at a single position of a Cache Coherence single associative and commutative operation only (ADD, MULT, AND, OR, Memory Consistency XOR) Variable Analysis example: OpenMP for (k=0; k

Parallel and High-Performance Computing 3. Programming Memory-Coupled Systems Hans-Joachim Bungartz 3.4. OpenMP What is OpenMP?

• Techniques for programming distributed-memory systems such as MPI, which will be discussed in Ch. 4, can also be used for programming shared-memory systems. However, using programming language extensions tailored for a shared-memory en- vironment can often improve performance. • Recently, OpenMP has emerged as a shared-memory standard. • OpenMP is an API (application programming interface), consisting of a set of compiler Cache Coherence directives and a library of support functions that help the compiler to generate multi- Memory Consistency threaded code that can take advantage of a shared-memory multiprocessor system. Variable Analysis • OpenMP can be used together with Fortran, C, or C++. OpenMP • OpenMP can help Examples of Memory- . . . – for working on a dual or quad system; – for working on classical SMP ; – for working on hybrid systems, consisting of a (distributed-memory) cluster of (shared-memory) nodes; OpenMP is then most frequently used in combination with MPI. • We will discuss the most important compiler directives, such as parallel, for, parallel for, sections, parallel sections, critical, and single, as well as the four important OpenMP functions omp_get_num_procs, omp_get_num_threads, omp_get_thread_num, and omp_set_num_threads. Page 24 of 46

Parallel and High-Performance Computing 3. Programming Memory-Coupled Systems Hans-Joachim Bungartz Brief History of OpenMP

• October 1997: development of a standard API for shared-memory programming • September 1998: C/C++ version 1.0 • April 1999: Fortran version 1.0 • October 1999: OpenMP Architecture Review Board installed • November 2000: Fortran Version 2.0 • April 2002: C/C++ version 2.0

Cache Coherence Memory Consistency Variable Analysis OpenMP Examples of Memory- . . .

Page 25 of 46

Parallel and High-Performance Computing 3. Programming Memory-Coupled Systems Hans-Joachim Bungartz OpenMP Compiler Directives and Functions – Overview

• important compiler directives: – parallel: precedes a block of code to be executed in parallel by multiple threads (without a work-sharing (see below) construct the block is executed by each ) – for: precedes a for-loop with independent iterations that may be divided among threads executing in parallel – parallel for: a combination of the above – sections: precedes a sequence of blocks that may be executed in parallel Cache Coherence Memory Consistency – parallel sections: a combination of the two single directives Variable Analysis – single: precedes a code block to be executed sequentially (i.e. by a single OpenMP thread) Examples of Memory- . . . – critical: precedes a critical section for, sections and single are so-called work-sharing constructs • important OpenMP functions: – omp_get_num_procs: returns the number of CPUs of the system on which this thread is executing (i.e. the number of available CPUs) – omp_get_num_threads: returns the number of threads that are active in the current parallel region – omp_get_thread_num: returns the thread’s ID Page 26 of 46 – omp_set_num_threads: allows to prescribe the number of threads executing Parallel and High-Performance the parallel sections of code Computing 3. Programming Memory-Coupled Systems Hans-Joachim Bungartz The Underlying Shared-Memory Model

• model: a collection of processors, each with access to the same shared memory • interaction and synchronization (only in run-time library) via shared variables • typical parallel program structure: fork-join model (cf. Sect. 2.4) – At the begin, only a single process (here mostly called thread), the master thread, is active. – The master thread executes the algorithm’s sequential parts.

– Where parallel execution is required, the master thread forks, i.e. creates or Cache Coherence awakens additional threads; from then on, all created threads (called team of Memory Consistency threads) work in parallel. Variable Analysis – At the end of the parallel section, there is a join – the created threads are killed OpenMP or suspended. Examples of Memory- . . . • features: – there is dynamic parallelism: the number of active threads may change dynami- cally throughout the program – incremental parallelisation is supported: transforming a sequential program into a parallel one step by step (considered as a big advantage)

Page 27 of 46

Parallel and High-Performance Computing 3. Programming Memory-Coupled Systems Hans-Joachim Bungartz Parallel for Loops

• for loops as a typical appearence of inherently parallel operations • OpenMP’s support to indicate when a loop’s iterations may be executed in parallel (no data dependence): – principle: tell the compiler the loop’s character, and the compiler will do the rest (code generation for forking/joining the threads and assigning iterations to them) – compiler directive: pragmatic information or pragma (may be ignored by the compiler, but can be used by it for parallelisation) Cache Coherence – general syntax of a pragma: Memory Consistency Variable Analysis #pragma omp OpenMP Examples of Memory- . . . – simplest form of the parallel for pragma (to be inserted immediately before the loop):

#pragma omp parallel for

– example:

#pragma omp parallel for for (i=0;i<100;i++) a[i]=0.0;

Page 28 of 46

Parallel and High-Performance Computing 3. Programming Memory-Coupled Systems Hans-Joachim Bungartz • requirements to the for loop: – canonical shape of the control clause

8 index + + 9 > > > > > + + index > > > > index − − > 8 9 > > < > > > > > − − index > <> <==> <> => for (index = start; index end; index + = inc ) > >=> > > > > > index − = inc > : > ; > > > > Cache Coherence > index = index + inc> > > Memory Consistency > index = inc + index> > > Variable Analysis :> ;> index = index − inc OpenMP Examples of Memory- . . . – no early exits, i.e. no break/return/exit statements nor jumps outside the loop

Page 29 of 46

Parallel and High-Performance Computing 3. Programming Memory-Coupled Systems Hans-Joachim Bungartz Shared and Private Variables

• Each thread has its own execution context, but the threads share their execution envi- ronment (cf. Sect. 2.1). • Basically, variables may either be declared shared or private. – shared variable: has the same address in the context of each thread and can be accessed by each thread – private variable: has a different address in each thread’s context and can not be accessed by any other than the owning thread Cache Coherence • default in the parallel for pragma: shared (apart from the loop index, which is private) Memory Consistency Variable Analysis • example on the previous slide: the array a is shared, the integer i is private OpenMP • The environment variable OMP_NUM_THREADS contains a default number of threads Examples of Memory- . . . to be created for parallel execution of code. Under Unix, the printenv and setenv commands can be used to check or modify the current value. • What’s the difference between setting OMP_NUM_THREADS and using omp_set_num_threads?

Page 30 of 46

Parallel and High-Performance Computing 3. Programming Memory-Coupled Systems Hans-Joachim Bungartz Declaring Private Variables

• Consider the following example:

for (i=0;i

• Parallel execution of different iterations of the outer loop leads to parallel initialisations and increments of the same shared variable j; hence, some of the iterations of the inner loop might be not executed. Cache Coherence • remedy: declare j to be a private variable, too – with the help of the private clause Memory Consistency • A clause in general is an optional addition to a pragma. Variable Analysis • private clause: OpenMP Examples of Memory- . . . – directs the compiler to make the listed variables private (i.e., private copies will be allocated for each thread) – private copies are accessible only inside the loop – their values are undefined on entry and exit of the parallel section – i.e.: a value assigned before entering the loop is not accessible inside the loop, nor are changes to j during the loop execution afterwards outside • in the above example:

#pragma omp parallel for private(j) for (i=0;i

• The firstprivate clause allows to inherit initial values by directing the compiler to create private copies with initial values identical to the master thread’s variable value as the loop is entered. • example:

x[0]=0.0; #pragma omp parallel for private(j) firstprivate(x) for (i=0;i

#pragma omp parallel for private(j) lastprivate(x) for (i=0;i

• Consider the following code which approximates π by performing a suitable numerical integration:

double area, pi, x; int i, n; ... area = 0.0; for (i = 0; i

#pragma omp parallel for private(x)

in front of the loop may lead to a non-deterministic behaviour. Page 33 of 46

Parallel and High-Performance Computing 3. Programming Memory-Coupled Systems Hans-Joachim Bungartz The critical Pragma

• Such critical sections can be denoted with the help of the critical pragma. • This pragma directs the compiler to enforce mutual exclusion among the threads trying to execute the block of code following the pragma. • in our above example:

double area, pi, x; int i, n; ... area = 0.0; Cache Coherence #pragma omp parallel for private(x) Memory Consistency for (i = 0; i < n; i++) { Variable Analysis x = (i+0.5)/n; OpenMP #pragma omp critical Examples of Memory- . . . area += 4.0/(1.0 + x*x); } pi = area/n;

• However, this will dramatically reduce the speed-up (remember Amdahl’s law: critical sections are pieces of sequential code).

Page 34 of 46

Parallel and High-Performance Computing 3. Programming Memory-Coupled Systems Hans-Joachim Bungartz Reductions

• A reduction operation reduces n operands to one single value by always applying the same operation (sum, ...). • Reductions are that frequent that OpenMP allows to add a reduction clause to the parallel for pragma. • The user must specify both reduction operation and reduction variable, but everything else is done by OpenMP. • available reduction operations: sum, product, biwise AND, bitwise OR, bitwise ex- clsuive OR, logical AND, logical OR Cache Coherence • in our above example: Memory Consistency Variable Analysis double area, pi, x; OpenMP int i, n; Examples of Memory- ...... area = 0.0; #pragma omp parallel for private(x) reduction(+:area) for (i = 0; i < n; i++) { x = (i+0.5)/n; area += 4.0/(1.0 + x*x); } pi = area/n;

• Note that implementations using the reduction clause are by far faster than using the critical pragma. Page 35 of 46

Parallel and High-Performance Computing 3. Programming Memory-Coupled Systems Hans-Joachim Bungartz Performance Improvements

• Sometimes, the loops in the sequential code are not well-suited or not optimal for an efficient parallelisation. Then, several measures (manual or OpenMP-supported) can be taken. • inverting loops (cf. Sect. 6.2): change the order (inner-outer relation) of the loops, if helpful

#pragma parallel for private(i) for (i = 1; i < m; i++) for (j = 0; j < n; j++) for (j = 0; j < n; j++) for (i = 1; i < m; i++) Cache Coherence a[i][j] = 2 * a[i-1][j]; a[i][j] = 2 * a[i-1][j]; Memory Consistency Variable Analysis • conditionally executing loops: OpenMP – for too short loops (too small numbers of iterations), the parallel overhead may Examples of Memory- . . . even lead to increasing runtimes with increasing numbers of threads involved – remedy: direct the compiler to insert code that decides at run-time whether or not the loop should be executed in parallel – solution in OpenMP: the if clause in the parallel for pragma – in our above example:

#pragma omp parallel for private(x) reduction(+:area) if (n > 5000) for (i = 0; i < n; i++) { ... Page 36 of 46

Parallel and High-Performance Computing 3. Programming Memory-Coupled Systems Hans-Joachim Bungartz • scheduling loops: – scheduling: allocation of iterations of the loop to threads (important, if length of iterations differs much) – statically (once for all) or dynamically (successive allocation) – OpenMP: schedule clause with various variants

Cache Coherence Memory Consistency Variable Analysis OpenMP Examples of Memory- . . .

Page 37 of 46

Parallel and High-Performance Computing 3. Programming Memory-Coupled Systems Hans-Joachim Bungartz Data Parallelism Beyond the for Loop • for loops are probably the most common opportunity for paralllelism in programs. • Nevertheless, there is data parallelism outside simple for loops. • the parallel pragma: – precedes a block that should be executed in parallel by all of the threads (use curly braces if the block is not a simple statement) – directs the compiler to replicate the respective code block among the threads (unlike the parallel for pragma) • the for pragma: Cache Coherence Memory Consistency – helps to deal with situations where the parallel for pragma can not be Variable Analysis used (due to additional statements within the parallel section, e.g.) OpenMP – often used in combination with the parallel pragma Examples of Memory- . . . • the single pragma: – tells the compiler that only a single thread should execute the following block

Page 38 of 46

Parallel and High-Performance Computing 3. Programming Memory-Coupled Systems Hans-Joachim Bungartz • example:

#pragma omp parallel private(i,j,low,high) for (i = 0; i < m; i++) { low = a[i]; high = b[i]; if (low > high) { #pragma omp single printf ("Exiting during iteration %d\n", i) break; } Cache Coherence #pragma omp for nowait Memory Consistency for (j = low; j < high; j++) Variable Analysis c[j] = (c[j] - a[i])/b[i]; OpenMP } Examples of Memory- . . .

Page 39 of 46

Parallel and High-Performance Computing 3. Programming Memory-Coupled Systems Hans-Joachim Bungartz Functional Parallelism

• So far, we have discussed the exploitation of data parallelism, only. • However, OpenMP helps to exploit functional parallelism, too. • The parallel sections pragma precedes a sequence of k blocks that may be executed in parallel by k threads. • The section pragma precedes each of those k blocks within the surrounding parallel sections pragma. • example: Cache Coherence #pragma omp parallel sections Memory Consistency { Variable Analysis #pragma omp section OpenMP x = f1(); Examples of Memory- . . . #pragma omp section y = f2(); #pragma omp section z = f3(); }

• Sometimes, it is better to use the parallel pragma and the sections pragma instead of the parallel sections pragma (similar to the use of the parallel for, parallel, and for pragmas).

Page 40 of 46

Parallel and High-Performance Computing 3. Programming Memory-Coupled Systems Hans-Joachim Bungartz Combining MPI and OpenMP

• increasing attractiveness of combining MPI (to be discussed in the next chapter) and OpenMP • reason: more and more architectures are hybrid ones, i.e. mixtures of shared mem- ory and distributed memory, and the hybrid approach often leads to faster program execution – clusters: typically use dual- or even quad-processor nodes today – supercomputers: constellations, i.e. collections of multiprocessors, are widespread Cache Coherence • principle: create an MPI process (P) for every multiprocessor and create threads (t) to occupy each multiprocessor’s CPUs Memory Consistency Variable Analysis OpenMP Examples of Memory- . . .

Page 41 of 46

Parallel and High-Performance Computing 3. Programming Memory-Coupled Systems Hans-Joachim Bungartz Discussion of the Hybrid Approach

• Why may the combination of MPI and OpenMP be advantageous? – lower communication overhead (only n active communicating processes in- stead of nk) – sometimes a parallelisation via light-weight processes may be worth while, in contrast to heavy-weight processes • example – let’s parallelise a program of 100 sec sequential run-time on a cluster of eight dual-processor nodes: Cache Coherence – 5% inherently sequential, 90% perfectly parallelisable, 5% principally parallelis- Memory Consistency able, but with high communication overhead (hence: with MPI processes not Variable Analysis feasible, with OpenMP threads feasible) OpenMP – speed-up for pure MPI (16 processes): Examples of Memory- . . . 1 S(p) ≤ = 6.4 0.1 + 0.9/16

– speed-up for hybrid approach: 1 S(p) ≤ = 7.6 0.05 + 0.05/2 + 0.9/16

• higher flexibility concerning load balancing (address situations with idle processes)

Page 42 of 46

Parallel and High-Performance Computing 3. Programming Memory-Coupled Systems Hans-Joachim Bungartz 3.5. Examples of Memory-Coupled Systems Sun Multiprocessor Workstations

• a classical UMA-SMP family: – SPARCstation 10/514 (4 SuperSPARC processors, since 1993) – SPARCstation 20 (2 SuperSPARC processors, since 1994) – Ultra1 (since 1995) and Ultra2 (up tp 2 UltraSPARC processors, since 1997) – SPARCserver 1000E (up to 8 SuperSPARC processors) Cache Coherence – SPARCcenter 2000E (up to 20 SuperSPARC processors) Memory Consistency – Enterprise server series (since 1997, up to 64 UltraSPARC II processors) Variable Analysis – Blade 2000 workstations, Fire server series (UltraSPARC III Cu processors, up OpenMP to 106 CPUs, up to 1000 CPUs per system), V server series (up to Examples of Memory- . . . 8 UltraSPARC III Cu processors) • system bus (older systems) or crossbar interconnect (newer systems) • write invalidate cache policy, bus snooping • cache coherence: MOESI protocol (fifth state ‘owned’)

Page 43 of 46

Parallel and High-Performance Computing 3. Programming Memory-Coupled Systems Hans-Joachim Bungartz Threads in Solaris

• Solaris 2.0 came with the development of multiprocessor systems • new: thread concept – first objective: support SMP access to shared memory – second objective: support development of parallel programs – user threads and kernel threads (operating system) – lightweight processes: virtual CPU, layer between CPUs and threads Cache Coherence • synchronization mechanisms for Solaris threads: Memory Consistency – mutex variables: with functions init, destroy, lock, trylock, unlock Variable Analysis – condition variables: with functions init, destroy, wait, timedwait, OpenMP signal, broadcast Examples of Memory- . . . – semaphors: with functions init, destroy, wait, trywait, post – R/W locks: protection mechanism for objects with frequent READ and rare WRITE accesses; allows for several simultaneous READs or (XOR) one and only one WRITE

Page 44 of 46

Parallel and High-Performance Computing 3. Programming Memory-Coupled Systems Hans-Joachim Bungartz Cray DSM Multiprocessors

• DSM: one logical address space that is physically distributed over the processors • DSM systems are NUMA architectures • Cray started its multiprocessor activities in 1989 • primary target application: numerical simulations • generations: – first generation: Cray T3D (up to 2048 processors) – second generation: Cray T3E (up to 2048 processors, since 1995) Cache Coherence Memory Consistency – successors: influenced by takeover of Cray Research by SGI Variable Analysis • core features: OpenMP – NCC-NUMA Examples of Memory- . . . – DEC alpha processors – 3D torus topology – combination of virtual-cut-through and wormhole routing – various synchronization mechanisms

Page 45 of 46

Parallel and High-Performance Computing 3. Programming Memory-Coupled Systems Hans-Joachim Bungartz Shared-Memory Computer Amadeus

• installed in 2004 at the Universität Stuttgart • manufactured by HP • Intel Itanium2 processor with 1.3GHz clock speed with a 3MB cache • 4 processors with a theoretical peak performance of 20,8 GFlop/s and 32GB shared memory • contains an InfiniBand card Cache Coherence

Memory Consistency

Variable Analysis

OpenMP

Examples of Memory- . . .

Page 46 of 46

Parallel and High-Performance Computing 3. Programming Memory-Coupled Systems Hans-Joachim Bungartz