Introduction to OpenMP

HW & SW for High Performance Computing

Master's Degree in Advanced Computing for Science and Engineering

FIM - 2012/13 Vicente Martín v0.0 Contents

● Overview: Architectures and Tools.

● OpenMP. – The place of OpenMP among other HPC programming paradigms. – Memory and execution model. – Environment. – Directives. OpenMP UPC Taxonomy HPF

MPI UPC HPF Ejemplo: cc-NUMA

Silicon Graphics Origin 2000: two processors sharing a common hierarchical memory. Caches are held coherent (cc). The global shared memory space is unique. The access to memory banks physically located in remote boards is done through an interconnection network. The worst case memory access time to local, on board, memory is 318 ns. If the machine has 64 nodes, worst case access is 1067 ns. Ejemplo: Multicomputador

Tarzan: Nodos 5-6: SMPs tetraprocesadores PPC 604e Nodos 1-4: Monoprocesadores Power2 SC

Each node of the IBM SP2 is a workstation with its own controlling its separate address space. Access to a memory position belonging to other node requires the communication of two independent OS through an explicit message passing, in this case helped by specialised HW: The High Performance Switch. Machines available for practises.

● Magerit. – Multicomputer. Cluster architecture made up from 245 nodes with 16 Power7 cores each. Linux SLE. 8TB RAM / 200 TB disk under GPFS

● Triqui 1-2-3-4. Intel SMP 8 CPUs (2 quadcores). Linux (main machine for OpenMP) CeSViMa: Magerit

 Prácticas MPI (OpenMP limitado)  Potencia sostenida >70 Tflops  Multicomputador: Linux SLE – Infiniband QDR: comunicación interproceso MPI – Gigabit Ethernet: Sistema de ficheros (GPFS) + gestión.  245 nodos de cómputo – 16 Power 7 cores. 32 GB RAM per node. – 3920 cores, ~8 TB de RAM y 147 TB de disco local.  Disks – 192 TB under GPFS – 256 HD SATA x 750GB – Distributed (16 servidores Power5) – Fault tolerant  RAID5 + HOT SPARE  Interactive Nodes: – ssh [email protected]  Computational Nodes: – Under SLURM-Moab  Compilers: – IBM XL C/C++ (xlc, xlC) Fortran 77/90/95 (xlf, xlf90, xlf95) and thread safe versions (_r commands) – GNU compilers (gcc, g++, g77) – MPI wrappers (mpicc, mpiCC, mpif70, mpif90) Backend are IBM XL compilers by default. – MPI: Lamm, Glenn Messages.  A Job definition file is needed in order to submit a job. – (see Magerit docs, http://static.cesvima.upm.es/doc/manual/Magerit-GuiaUsuarios.pdf) – [email protected]  SLURM: Simple Linux Utility for Resource Management – scalable cluster management and job scheduling system for Linux clusters – A Slurm daemon (slurmd) runs at every node. It is under the control of a central (slurmctl) daemon. – Useful commands: – sacct, salloc, sattach, sbatch, sbcast, scancel, scontrol, sinfo, smap, squeue, srun, strigger and sview – QoS based.  SLURM-Moab Basic commands

– jobcancel: Deletes a job from the execution queue. – jobcheck: Shows detailed information about a queued job. – jobq: Shows the state of the user's jobs in the system. – jobstart: Gives an estimate of the starting time of the job. – Jobsubmit: Sends a job to the system for its execution.  A Job definition file is needed in order to submit a job. (see Magerit docs)  #!/bin/bash  #------Start job description ------ #@ group = [project_id]  #@ class = [class_name]  #@ initialdir = //projects/[project_id]/[data_dir]  #@ output = res/[programoutfile].out  #@ error = res/[programerrfile].err  #@ total_tasks = [number of tasks]  #@ wall_clock_limit = [hh:mm:ss]  #------End job description ------  #------Start execution ------  # Run our program  srun ./[myprogram]   #------End execution ------ ● Triqui 1-2-3-4: – 8 cores per node. – Intel compilers (icc, ifort): use -openmp switch for openMP. – Located in /opt/intel/Compiler/11.1/069

● Look for the exact place: it changes qith compiler versions.

● There is a Documentation directory.

● You have to make sourcing of the files (source or . commands):

● iccvars.sh ( C language and Bourne shell. Also .csh version) ifortvars.sh (idem Fortran version) en /opt/intel/Compiler/11.1/069/bin with argument intel64 – Place the commands in .bash_profile if you don't want to have to repeat it each time. – GNU compilers: use the -fopenmp switch. ● Triqui 1-2-3-4:

● Examples: Intel compiler.

– Ifort -free -openmp sourcefile.f -o compiled – Icc -openmp sourcefile.c -o compiled –

● Examples: GNU compiler.

– gcc -fopenmp sourcefile.c -o compiled – gfortran -fopenmp -ffree-form sourcefile.f -o compiled Compilers available in: ● Magerit: – XL compilers (xlf, xlc): use switch -qsmp=omp and thread-safe libraries calling the compilers with the scripts xlf_r y xlc_r – GNU compilers: -fopenmp (check the OpenMP support level for each architecture) – NOTA: Magerit nodes are 16-core SMPs, hence the ideal speedup a factor of 16.

● Examples: IBM Xl compilers.

– xlc_r -qsmp=openmp sourcefile.c -o compiled – xlf_r -qsmp=openmp sourcefile.f -o compiled – xlf77_r, xlf90_r ... Multiprocessors Examples. Shared memory, NUMA access.

SGI Origin 2000 [Laudon 98] Cray T3D Parallel programs always have overheads:

● Communications and Synchronizations: Typically this is the most important overhead.

● Non optimal algoritmh: The parallel algorithm could be not as efficient as the sequential one.

● Parallel SW overhead: Extra cost of the parallel implementation (e.g.: calculations associated to domain decomposition)

● Load balancing: Task migration, context switches, etc. needed to balance the load among different threads/ processors/nodes. These contribuctions, correctly expressed, are additive. ● Speedup factor S: Ratio between the time taken by a sequential machine and the same calculation in a parallel one. T sec S= T conc ● Define T conc  N  as the time taken by a parallel machine with N nodes. Usually we will be interested in:

T conc 1 S N = T conc  N 

● Ideally, the value of S is N. – This is linear speedup. – There could be superlinear speedup... (usually an effect of the increased aggregated cache in a parallel machine.) ● Efficiency: Speedup per processor.

S = N ● The ideal value of  would be 1. Two limiting situations for the speedup factor.

Speedup saturation Due to parallel Overhead (usually communications)

Fixed subdomain (grain) size Fixed problem size Domain Decomposition and Communications Overhead. ● How to decompose the problem domain affects parallel overhead through communications.

● Example: Balanced first neighbours algorithm in 2D. – Bidimensional square domain with nN=nN nN nodal points. – Sets of n nodal (grain size) points are distributed to each of the N processors. – The algorithm only needs to know the values at each neighboring point. How do we divide the problem domain? ● The amount of communications will be proportional to the lenght of the border between subdomains.

● Consider two possibilities: n – N rectangles nN×  N – N squares n×n  ● The lenght of the border, for big enough N will be:

– Rectangular case: lrec =N nN – Squared case: l sqr=2N n ● The communications ratio between them: l N rec =  l sqr 2

– Using a rectangular domain decomposition, we will perform, approximately,  N times more communications than in the square case: It will be less efficient and less scalable. Amdahl, Gustafson and Scalability

● In1967 Amdahl judged that if a program has an intrinsically sequential part, s, and another parallel one, p (s+p=1) using N processors, then the maximum speedup would be:

s p 1 S= = p p s  s N N – According to this, if we have 1024 processors and the sequential part is just 0.5% the maximum speedup would be 168!!... A sequential part of 5% would limit it to less than 20... wikipedia ● In 1988 Gustafson argued that s and p, were not constant for just the program itself and that the correct values to use would be s' and p', the values measured in the parallel system. According to this, a sequential system that would substitute the parallel one, would take s'+p'N. Then, the speedup would be: s' p' N  S= =N1−N ×s' s' p' 

– Resembling much more the fixed subdomain size curves that we had...

● Scalability has to be studied for a given implementation in a given machine. MPI

● De facto standard for the Message Passing model. Agreed among companies, universities and other research centers.

● This is a multicomputer-oriented model (there are implementations in shared memory machines)

● Versión 1.0 en 1994, 2.0 en 1997. Parallel I/O paralela, dynamic creation and destruction of tasks, one-sided communication, etc.

● The programmer has to worry about all parallel detailsg: MPI solves only the parts closer to HW communications, but not from the messages passing logic.

● Bindings para F77, C, C++, f90 (con caveats...) incluso Java, Python... OpenMP ● First specification in 1998. Version 2.0 (f90) in 2000. C, C++ in 2002. 2.5 version (f90/C/C++) in 2005. Agreed mainly by HW vendors but also users. Version 3.0 May 2008 (major change: tasks).

● API for shared memory computers. Focused on getting reasonable efficiency with minimal investment (no parallel-specific source code).

● The original, serial, source code is “annotated” with compiler directives and conditional compilation statements. Also environment variables.

● F77, f90, C and C++ bindings.

● Allows for an “incremental” parallelization. Intermediate level among the HPF “high level” and MPI “low level”.

http://www.openmp.org, www.compunity.org UPC ● ANSI C extension with PGAS ideas (Partitioned Global Addressing Space): Shared memory with informtion about what is local and what is not.

● Again, a Consortium: Universities, laboratories and manufacturers. 1.0 Spec in 2001, 1.1 in 2003, 3.0 in 2008.

● Execution threads + thread private memory + global memory divided into partitions + affinity relationships among threadas and partitions (the partition resides in the logical memory space of the thread).

● Shared Pointers with access to the whole of the global memory space. Possibility to specify relaxed or strict memory consistency during the access to positions shared among several threads. HPF ● Complete Data Parallel programming model language (SPMD model).

● Aims: Good performance, portability, compatibility with the base language (f95). Coping with both, shared and distributed memory machines.

● Enters using directives in (except for some new intrinsics) the source program.

● Specification 1.0 in 1993. Spec. 2.0 en 1997. Most implementations are subset 1.1 (nowadays is a mostly abandoned languaje, i.e: here we see it very briefly in order to better understand parallel languages)

● Very complex compilation process. Performance difficult to predict... This killed the language. Examples. ● To give an idea of these paradigms, underlying philosophy, complexity and performance, the following programs calculate the next integral using a Riemann sum:

1 4 dx =∫0 2 1x – In the HPF case, several function implementations -all of them correct- are used in order to study the different performance. (the example uses the inlined one) program main MPI include program "mpif.h" main double include precision "mpif.h" PI25DT double precision PI25DT parameter (PI25DT = 3.141592653589793238462643d0) do 20 i = myid+1, n, numprocs double parameter precision (PI25DT mypi =, pi, 3.141592653589793238462643d0) h, sum, x, f, a do 20 i = myid+1, n, numprocs double precision mypi, pi, h, sum, x, f, a x = h * (dble(i) - 0.5d0) double precision starttime, endtime sum x = = h sum* (dble(i) + f(x) - 0.5d0) integer double n, precision myid, numprocs starttime,, i, endtime ierr sum = sum + f(x) integer n, myid, numprocs, i, ierr 20 continue c function to integrate 20 mypi continue = h * sum c f(a) = 4.d0 / (1.d0 + function a*a) to integrate mypi = h * sum f(a) = 4.d0 / (1.d0 + a*a) c collect all the partial sums c call MPI_REDUCE(mypi,pi,1,MPI_DOUBLE_PRECISION, collect all the partial sums call MPI_INIT(ierr) call MPI_REDUCE(mypi,pi,1,MPI_DOUBLE_PRECISION, call MPI_INIT(ierr) & MPI_SUM,0, MPI_COMM_WORLD,ierr) call MPI_COMM_RANK(MPI_COMM_WORLD, myid, ierr) c & node MPI_SUM,0, 0 prints the MPI_COMM_WORLD,ierr) answer. call call MPI_COMM_SIZE(MPI_COMM_WORLD, MPI_COMM_RANK(MPI_COMM_WORLD, numprocs, myid, ierr) ierr) c node 0 prints the answer. call MPI_COMM_SIZE(MPI_COMM_WORLD, numprocs, ierr) endtime = MPI_WTIME() if endtime (myid .eq. = MPI_WTIME() 0) then 10 if ( myid .eq. 0 ) then ifprint (myid *, 'pi.eq. is 0)', pi,then ' Error is', abs(pi - PI25DT) 10 print if ( myid *, 'Enter .eq. the0 ) thennumber of intervals: (0 quits) ' print *, 'pi is ', pi, ' Error is', abs(pi - PI25DT) print *, 'Enter the number of intervals: (0 quits) ' print *, 'time is ', endtime-starttime, ' seconds' read(*,*) n endif print *, 'time is ', endtime-starttime, ' seconds' endif read(*,*) n endif endif goto 10 c broadcast n 30 callgoto MPI_FINALIZE(ierr) 10 c starttime = MPI_WTIME() broadcast n 30 call MPI_FINALIZE(ierr) starttime = MPI_WTIME() stop call MPI_BCAST(n,1,MPI_INTEGER,0, end stop & call MPI_BCAST(n,1,MPI_INTEGER,0, MPI_COMM_WORLD,ierr) end c & check MPI_COMM_WORLD,ierr) for quit signal c if ( n .le. 0 ) goto 30 check for quit signal c if ( n .le. 0 ) goto calculate 30 the interval size c h = 1.0d0/n calculate the interval size sum h = =1.0d0/n 0.0d0 sum = 0.0d0 Program Integral Program Integral

! Riemann sum, OpenMP version. ! Riemann sum, OpenMP version. OMP Integer(Kind(1)) :: n,i Real( Integer(Kind(1)) Kind(1.D0)) :::: n,iw, x, suma, pi, a Integer Real( Kind(1.D0)):: InitialClock, :: FinalClock, w, x, suma, TicksPerSecond pi, a Integer :: InitialClock, FinalClock, TicksPerSecond Print *,' Number of intervals=' Read Print *,n *,' Number of intervals=' Read *,n w = 1.0d0/n suma w = =1.0d0/n 0.0d0 suma = 0.0d0 Call System_Clock(InitialClock) Call System_Clock(InitialClock) !$OMP PARALLEL DO PRIVATE(x), SHARED(w), REDUCTION(+: suma) !$OMP PARALLEL DO PRIVATE(x), SHARED(w), REDUCTION(+: suma) Do i=1, n Do i=1, n x= w * (i-0.5D0) suma x= w = *suma (i-0.5D0) + f(x) End suma Do = suma + f(x) End Do Call System_Clock(FinalClock,TicksPerSecond) Call System_Clock(FinalClock,TicksPerSecond) Print *,' Segundos :', Float(FinalClock-InitialClock)/(Float(TicksPerSecond)) Pi Print = w *,'* suma Segundos :', Float(FinalClock-InitialClock)/(Float(TicksPerSecond)) Print Pi = *,' w Pi=* suma ' , Pi Print *,' Pi= ' , Pi End End /*Copyright (C) 2000 Chen Jianxun, Sebastien Chauvin, Tarek El-Ghazawi */ /*Copyright (C) 2000 Chen Jianxun, Sebastien Chauvin, Tarek El-Ghazawi */ #include #include #include /*modo por omision, no es necesario ponerlo*/ #include /*modo por omision, no es necesario ponerlo*/ UPC #include #include #define N 32767 #define N 32767 #define f(x) 1/(1+x*x) #define f(x) 1/(1+x*x) upc_lock_t l; upc_lock_t l; shared float pi; shared float pi; void main(void) void main(void) { { float local_pi=0.0; float local_pi=0.0; int i; int i; upc_forall(i=0; i

!HPF$ INDEPENDENT PURE Function fv(a) ForAll!HPF$ ( INDEPENDENTi=1:n ) funcion(i) = 4.0D0/(1.0D0+ (w * (i-0.5D0))**2) PURE Function fv(a) suma ForAll = SUM(funcion) ( i=1:n ) funcion(i) = 4.0D0/(1.0D0+ (w * (i-0.5D0))**2) suma = SUM(funcion) Real(Kind(1.D0)), Dimension(:), Intent(IN) :: a Real(Kind(1.D0)), Dimension(:), Intent(IN) :: a Call SYSTEM_CLOCK(FinalClock) Real(Kind(1.d0)),Dimension(Size(a)) :: fv Call SYSTEM_CLOCK(FinalClock) Real(Kind(1.d0)),Dimension(Size(a)) :: fv fv = 4.0D0/(1.D0 + a*a) Print *,' Segundos :', & fv = 4.0D0/(1.D0 + a*a) Float(FinalClock-InitialClock)/Float(TicksPerSecond) Print *,' Segundos :', & Float(FinalClock-InitialClock)/Float(TicksPerSecond) End Function fv Pi = w * suma End Function fv Print Pi = *,' w Pi=* suma ' , Pi Print *,' Pi= ' , Pi End PROGRAM Integral End PROGRAM Integral OpenMP Execution Model ● Execution threads. Each process can be executed by several threads. Each one with its own control flow (including private data) but sharing the same address space.

● Fork/join. A master thread starts the execution. It can start other threads that, in turn, can start another threads, thus becoming the masters of a new team of threads, inside a parallel region. The forked threads, join at the end of the parallel region. Memory Model

● Shared memory, relaxed consistency. – Every thread has access to the shared memory, but the view is temporary: In order to improve the speed, consistency among the views of all the threads at every time is not inforced. Besides this, each thread has a private memory space (threadprivate), that is not accessible by other threads.

● If a thread becomes master of a team, its private variables will be visible from all of the team members unless declared again as private. – Consistency among central memory and the thread's view is not guaranteed. It can be inforced with FLUSH. OpenMP: Structure. ● OpenMP enters a program through directives, as specific run time functions (that can be conditionally compiled) and environment variables, that can modify the program's behaviour at execution time. – !$OMP… Fortran sentinel (free form) – !$OMP, C$OMP, *$OMP … Fortran sentinel (fixed form) – !$ … Fortran sentinel for conditional compilation. – #pragma omp … (C language, case sensitive) – _OPENMP macro for conditional compilation using #ifdef. OpenMP: Structure.

● Examples: – !$OMP PARALLEL DO PRIVATE – !$OMP END DO NOWAIT – !$NT = OMP_GET_NUM_THREADS() – setenv OMP_NUM_THREADS 8 – #pragma omp parallel for Environment Variables.

● Set in the shell that starts the execution, ignored once the run has started. Always uppercase, argument's case ignored. Calls to equivalent library functions supersede the environment variables values. – OMP_SCHEDULE: Controls the work distribution among loops. It can be static, dynamic or guided. A numerical parameter can be given to specify, for example, the block size to share in a static distribution.

● The default value is implementation specific.

● i.e.: setenv OMP_SCHEDULE static,100 – OMP_NUM_THREADS: Controls the number of execution threads that are started in a parallel region. – OMP_DYNAMIC : Controls dynamically the number of execution threads that are started in a parallel region in order to optimize resources usage. It can be set to true or false. Default value is implementation dependent. – OMP_NESTED : true or false Controls the expansion of a thread into a team of threads within nested parallel regions. Default value is false. – OMP_MAX_ACTIVE_LEVELS : Beyond the maximum number of active levels nested parallel regions do not expand a thread team. – OMP_THREAD_LIMIT : Maximum number of threads allowed (e.g.: 1024) – OMP_STACKSIZE: Stack size in Kbytes (typically 4- 8 MB) – OMP_WAIT_POLICY: active or passive whether waitig threads shoul go to sleep or spin, consuming cycles but doing nothing.

● During execution time, this behaviour can be modified by using library calls: – i.e.:

● omp_set_dynamic ● omp_set_num_threads – Calls have preference over environment variables. OpenMP: Directives

● Parallel Region: PARALLEL

● Work Sharing: DO (Fortran), for (C/C++), SECTIONS, SINGLE, WORKSHARE (f90)

● Synchronization: MASTER, CRITICAL, BARRIER, ATOMIC, FLUSH, ORDERED.

● Clauses accepted for the directives: – Data environment: default, THREAD PRIVATE – Data scoping: PRIVATE, SHARED, DEFAULT, FIRSTPRIVATE, LASTPRIVATE, REDUCTION, COPYIN, copyprivate

● Other: Orphaning, tasks. ● Parallel Region: !$OMP PARALLEL [ espec.[[,], espec.] ...] [code block] !$OMP END PARALLEL #pragma omp parallel [ espec.[[,], espec.] ...] { [code block] } – [code block] will be executed by a team of threads. Master thread ID is 0. There can't be jumps in the program from/to [code block] although parallel regions can be nested. – There are qualifiers that specify the behaviour of the variables inside a block:

● (FIRST)PRIVATE (list): Variables in (list) are private to each thread. They are initialized to the value of the original variable if FIRSTPRIVATE is used. – Do/for index variables are considered as private by default. ● SHARED(list): Variables in (list) are common to all of the threads.

● REDUCTION ((operator/intrinsic):list): Performs a reduction operations on (list) variables using the specified binary operator/intrinsic. – Operator: It can be +,*,-,.and. ,.or.,.eqv., .neqv. – Intrinsic can be MAX, MIN, IAND, IOR, IEOR.

● When finishing a parallel region there is an implicit sync. Only the master thread continues the execution.

● An if clause can be used to actually parallelize or not the region: #pragma omp parallel if (n>threshold)... ● DEFAULT (private|firstprivate|shared|none): Variable default type.

● COPYIN (list) Copy the value of the master's threadprivate (list) variables to the threadprivate variable of each other member of the team in a parallel region. – Executed after the set of threads have been prepared and prior to the execution of the parallel region (global variables are made private to each thread)

● COPYPRIVATE (list) Only for the SINGLE directive. Broadcast, using a private variable, (list) values from the data environment of one implicit task to the data environments of other tasks in the same parallel region. – i.e.: Use one thread to read data and then use this data to initialize private data of the rest of the threads. – Cannot be used with a NOWAIT. ● Work sharing: Do/for loops, a set of loops (indexes) are assigned to each available thread. !$OMP DO [especifications] [Do Loop] !$OMP END DO [NOWAIT]

#pragma omp for[especifications] [for Loop]

– The way in which the distribution is done can be modified with SCHEDULE. – (FIRST)(LAST)PRIVATE and REDUCTION can be also applied. – At the end of the distribution directive there is an implicit sync among threads that can be avoided with a NOWAIT clause. ● Nested loops can be executed in parallel without need of using nested parallel regions using the collapse clause. – The compiler creates a single loop out of those (innermost) loops indicated by the specification in the clause:

!$OMP parallel do collapse(2) Do i=1, ilim,ijump Do j=1, jlim, jjump Do k=1, klim, kjump

.... End do End Do End Do Example #include void main(void) { int i; int n=1000; int a[1000];

/* 1 #pragma omp parallel shared (n,a) private (i) */ #pragma omp parallel shared (n,a) { /* 1 #pragma omp for */ #pragma omp for lastprivate (i) for (i=0; i

} Result: a(1000)=1000. Is this correct? Why? Test and discuss the first set of pragmas. Storage Association

● The state of private variables is undefined on entry/exit of a parallel region.

● Private variables are created per thread at the start of the parallel region, there is no storage association with the variable of the same name outside of the region.

● Firstprivate and lastprivate clauses are used to overcome this. #include Sample results: void main(void) ●Run exactly as written in the box: { Inside parallel region: int i,B,C; Inside parallel region: int A=5; A=1 int n=1000; B=500 #pragma omp parallel private (i,A,B) A=13686955 /* 2 #pragma omp parallel */ B=13687954 { Outside parallel region: #pragma omp for A=5 /* 2 #pragma omp for private (i) firstprivate (A) lastprivate (C) */ B=134514507 for (i=0; i

● SECTIONS and DO are usually joined to a PARALLEL !$OMP PARALLEL SECTIONS [clauses] !$OMP PARALLEL DO [clauses] #pragma omp parallel for [clauses] ● There is certain degree of control on what is executed by each thread:

● The [code block] is executed by only one thread in the team (not necessarily the master. The assignment of work to thread number is non-deterministic. The rest of threads wait at a barrier.) : !$OMP SINGLE[(FIRST)PRIVATE(list)] [code block] !$OMP END SINGLE [COPYPRIVATE(LIST), NOWAIT] #pragma omp single [(first)private(list), copyprivate(list), nowait] [code block]

● [code block] is executed only by the master thread, the rest jump the block and continue. There is no implicit sync at the end. !$OMP MASTER [code block] !$OMP END MASTER #pragma omp master [code block] ● There is certain degree of control on what is executed by each thread:

● Critical Region: [code block] is accessed by only one thread at a time. Each thread waits at the beginning of the block till no other thread is executing the block. !$OMP CRITICAL [code block] !$OMP END CRITICAL #pragma omp critical [code block]

● Atomic actualization insures that the memory position marked as atomic will be accessed by only one thread at a time. It affects to all of the threads executing the program and acts only on the sentence inmediatelly following the ATOMIC declaration. !$OMP ATOMIC Expression-statement #pragma omp atomic Expression-statement ● !$OMP BARRIER Establish a barrier point.

– Only threads in the team where the barrier is located are affected. e.g.: A barrier within the innermost region of a nested parallel region will affect only to the team of threads executing the innermost parallel region.

● !$OMP FLUSH(list) Insures that each thread that finds this sentence, syncs the values in (list) (If it is not specified, it refers to all of the non private variables of the thread) in its temporary view with those in the shared memory. More about data scoping. ● Private variables are undefined on entry/exit the parallel region. They have no storage association with the variable of same name outside the parallel region.

● THREADPRIVATE[list] makes [list] private to each thread. It can be used on its own (without a Do/for, etc.) – Global variables (common blocks, statics) are replicated so that each thread has a private copy. By default threadprivate copies are not allocated or defined.

● Data copying clauses: – COPYIN[list] at the start of a parallel region the values of the variables in [list] of the master thread are copied to the corresponding private variables of the rest of the threads. – COPYPRIVATE More about data scoping.

● LASTPRIVATE[list] makes [list] to behave as if it has been declared PRIVATE, but the last thread executing the work distribution sentence updates with its private value of [list] the value existing before the constructor. Some functions in OMP_LIB ● Use OMP_LIB o include “omp_lib.h”, (Fortran) o include (C, C++)

● Many of the functions set/get the corresponding variable in the Internal Control Variable (ICV) set.

– omp_set_num_threads : set the number of threads. – omp_get_num_threads : get the nuber of threads in the team. – omp_get_thread_num : Thread IDentificator – omp_get_max_threads : Max number of threads allowed in a parallel region. – omp_get_num_procs : Number of processors of the machine as reported by te OS. Some functions in OMP_LIB – omp_set(get)_dynamic : Sets (reports) the capability of dynamic thread adjustment. – omp_in_parallel : Tests if it is inside a parallel region. – omp_set(get)_nested : Allows (tests) if nested parallel regions are allowed (the actual expansion of the innermost parallel regions are implementation dependent. The maximum depth can be also controlled)

– omp_get_wtime : wall clock time. – omp_get_wtick : clock resolution in seconds per tick. Some functions in OMP_LIB – omp_get_thread_limit : Max number of threads allowed for the whole program. – omp_set(get)_schedule : Sets loop scheduling when runtime schedule option is used. Gets the scheduling in use. – omp_set(get)_max_active_levels : Set (get) the maximum number of active parallel regions allowed. – omp_get_level : Returns the number of nested parallel regions enclosing the task that contains the call. – omp_get_active_level : wall clock time. – omp_get_ancestor_thread_num : For a given nested level, returns the thread number of the ancestor of the current thread. – omp_get_team_size(level): Number of threads in the team at level. Some functions in OMP_LIB

● Lock Routines

– omp_init_lock – omp_destroy_lock – omp_set_lock / omp_unset_lock – omp_test_lock

● And the corresponding nested versions. Program vida Program vida Implicit none ¿Paralelizar ! Integer Implicit :: nonei, n, numgen, vivos Integer,! Integer parameter :: i, n, numgen, :: n=100, vivos numgen = 1000 HPF Integer Integer, :: parameteri,vivos :: n=100, numgen = 1000 Integer Integer :: ::tinicial, i,vivos tfinal, numticks en OMP? !Integer, Integer Allocatable :: tinicial, tfinal,:: tablero(:,:), numticks vecinos(:,:) Integer,!Integer, Dimension(n,n) Allocatable :: tablero(:,:),:: Tablero, Vecinosvecinos(:,:) !!HPF$ Integer, PROCESSORS Dimension(n,n) red(2,2) :: Tablero, Vecinos ! Se hacen numgen generaciones !!HPF$ PROCESSORS red(2,2) Do! Se i=1, hacen numgen numgen generaciones !!HPF$ PROCESSORS linea(4) Do i=1, numgen !!HPF$!!HPF$ Distribute PROCESSORS tablero(BLOCK,BLOCK) linea(4) ONTO RED !!HPF$ Distribute tablero(BLOCK,BLOCK) ONTO RED ! Se calcula el numero de vecinos de cada celda. !!HPF$ Distribute vecinos(BLOCK,BLOCK) ONTO RED ! Se calcula el numero de vecinos de cada celda. !HPF$!!HPF$ Distribute Distribute (*,block) vecinos(BLOCK,BLOCK) :: tablero, vecinos ONTO RED vecinos = tablero + cshift(tablero,shift=-1,dim=1) & !HPF$ Distribute (*,block) :: tablero, vecinos vecinos = + tablero cshift(tablero,shift=+1,dim=1) + cshift(tablero,shift=-1,dim=1) & vecinos = vecinos + cshift(tablero,shift=+1,dim=1) + cshift(vecinos,shift=-1,dim=2) & ! Print *,' Tamaño tablero:' vecinos = vecinos + cshift(vecinos,shift=-1,dim=2) & ! Read! Print *, *,' n Tamaño tablero:' + cshift(vecinos,shift=+1,dim=2) ! Read *, n vecinos = vecinos-tablero + cshift(vecinos,shift=+1,dim=2) ! Print *,' Numero de generaciones:' vecinos = vecinos-tablero ! Read! Print *, *,' numgen Numero de generaciones:' ! Read *, numgen ! Se actualiza el tablero y se calcula el total de vivos. ! Se actualiza el tablero y se calcula el total de vivos. ! Allocate (tablero(n,n)) ! Allocate! Allocate (vecinos(n,n)) (tablero(n,n)) Where(vecinos==3) tablero=1 ! Allocate (vecinos(n,n)) Where(vecinos<2.or.vecinos>3) Where(vecinos==3) tablero=1 tablero=0 Where(vecinos<2.or.vecinos>3) tablero=0 ! Tablero inicial: ! Tablero inicial: End Do End Do tablero = 0 tablero(:,n/2) tablero = 0 = 1 vivos = sum(tablero) tablero(:,n/2) = 1 Call vivos System_Clock(tfinal) = sum(tablero) tablero(n/2,:) = 1 Call System_Clock(tfinal) vivos tablero(n/2,:) = sum(tablero) = 1 vivos = sum(tablero) print *,'vivos =',vivos print *,'vivos =',vivos Call System_Clock(tinicial, numticks) print *,' tiempo CPU (segs.)=', Float(tfinal-tinicial)/Float(numticks) Call System_Clock(tinicial, numticks) print *,' tiempo CPU (segs.)=', Float(tfinal-tinicial)/Float(numticks) End End !$OMP PARALLEL DO REDUCTION(+:vivos) Do!$OMP j=1, nPARALLEL DO REDUCTION(+:vivos) Do Do i=1, j=1, n n Do i=1, n if((vecinos(i,j).lt.2).or.(vecinos(i,j).gt.3)) then tablero(i,j)if((vecinos(i,j).lt.2).or.(vecinos(i,j).gt.3)) = 0 then else tablero(i,j) if (vecinos(i,j).eq.3) = 0 then tablero(i,j)else if (vecinos(i,j).eq.3) = 1 then end tablero(i,j) If = 1 if end (tablero(i,j).eq.1) If vivos = vivos+1 End if (tablero(i,j).eq.1)Do vivos = vivos+1 End End Do Do End Do End Do End Do Call system_clock(tfinal) Call Call CPU_Time(rtfinal) system_clock(tfinal) Call CPU_Time(rtfinal) Print *,' vivos al final =', vivos Print Print *,' *,' segs vivos =', alfloat(tfinal-tinicial)/Float(nticks) final =', vivos Print Print *,' *,' segs segs CPU=', =', float(tfinal-tinicial)/Float(nticks) rtfinal-rtinicial Print *,' segs CPU=', rtfinal-rtinicial End End

Modelo de Memoria en en OMP Program NoWait Program NoWait n=0 ! Ilustracion de variables sin estado definido a la salida de una construccion n=0 ! paralela! Ilustracion en OpenMP de variables sin estado definido a la salida de una construccion nc=0 ! paralela en OpenMP !$OMP nc=0 PARALLEL ! NOTA: Para que el programa tuviese logica la variable suma: n deberia ser !$OMP PARALLEL ! declarada! NOTA: Paraprivada que a el cada programa hilo y tuviesehacer la logica suma dela variablesu contenido suma: en n cadadeberia hilo ser al !$OMP DO REDUCTION(+:nc) ! declarada privada a cada hilo y hacer la suma de su contenido en cada hilo al !!$OMP!$OMP DO DO PRIVATE(nc) REDUCTION(+:nc) ! final. Estando sin declarar por omision se toma como SHARED !!$OMP DO PRIVATE(nc) ! En! final. cambio Estando nc esta sin correctamente declarar por omision declarada se tomacomo como variable SHARED de reduccion. ! En cambio nc esta correctamente declarada como variable de reduccion. DO i=1, 1000 DO i=1, 1000 Integer :: n,nc, i Integer :: n,nc, i n=n+i n=n+i n=0 nc = nc+i n=0 End nc Do= nc+i nc=0 End Do !$OMP nc=0 PARALLEL !$OMP END DO !$OMP PARALLEL !$OMP!$OMP END END PARALLEL DO !$OMP DO REDUCTION(+:nc) !$OMP END PARALLEL !!$OMP!$OMP DO DO PRIVATE(nc) REDUCTION(+:nc) !Alternativa tambien mal. !$OMP FLUSH !!$OMP DO PRIVATE(nc) !Alternativa tambien mal. !$OMP FLUSH

DO i=1, 1000 Print *, 'n, nc en bucle 2, despues END PARALLEL DO i=1, 1000 & +FLUSH=',n,Print *, 'n, nc enn bucle 2, despues END PARALLEL & +FLUSH=',n, n !!$OMP ATOMIC !!$OMPn = n+i ATOMIC n = n+i nc = nc+i End nc Do= nc+i !$OMP End DoEND DO !$OMP!$OMP END END PARALLEL DO !$OMP END PARALLEL Print *, 'n, nc en bucle 1, despues END PARALLEL=',n, nc Print *, 'n, nc en bucle 1, despues END PARALLEL=',n, nc c c

n=0 n=0 n=0 n=0 nc=0 nc=0 !$OMP nc=0 PARALLEL !$OMP nc=0 PARALLEL Modelo de Memoria !$OMP PARALLEL !$OMP PARALLEL !$OMP DO REDUCTION(+:nc) !$OMP DO REDUCTION(+:nc) en en OMP !!$OMP!$OMP DO DO PRIVATE(nc) REDUCTION(+:nc) !!$OMP!$OMP DO DO PRIVATE(nc) REDUCTION(+:nc) !!$OMP DO PRIVATE(nc) !!$OMP DO PRIVATE(nc) DO i=1, 1000 n=n+i DO i=1, 1000 DO i=1, 1000 n=n+i DO i=1, 1000 nc = nc+i n=n+i End nc Do= nc+i nc n=n+i = nc+i End Do nc = nc+i !$OMP END DO NOWAIT End Do !$OMP!$OMP FLUSH END DO NOWAIT !$OMP End DoEND DO !$OMP FLUSH !$OMP END DO Print *, 'n,nc en bucle 5, despues END DO NOWAIT+FLUSH & antesPrint END*, 'n,nc PARALLEL=',n,nc en bucle 5, despues END DO NOWAIT+FLUSH Print *, 'n,nc en bucle 3, despues END DO antes END PARALLEL=',n,nc & antes END PARALLEL=',n,nc Print *, 'n,nc en bucle 3, despues END DO antes END PARALLEL=',n,nc !$OMP END PARALLEL !$OMP END PARALLEL Print!$OMP *, 'n,ncEND en PARALLEL bucle 5, despues END DO NOWAIT+FLUSH !$OMP END PARALLEL Print *, 'n,nc en bucle 5, despues END DO NOWAIT+FLUSH & despues END PARALLEL=',n,nc n=0 n=0& despues END PARALLEL=',n,nc nc=0 n=0 n=0 nc=0 nc=0 !$OMP PARALLEL !$OMP nc=0 PARALLEL !$OMP!$OMP DO PARALLEL REDUCTION(+:nc) !$OMP PARALLEL !$OMP DO REDUCTION(+:nc) !$OMP DO REDUCTION(+:nc) !!$OMP DO PRIVATE(nc) !!$OMP!$OMP DO DO PRIVATE(nc) REDUCTION(+:nc) !!$OMP DO PRIVATE(nc) !!$OMP DO PRIVATE(nc) DO i=1, 1000 DO i=1, 1000 n=n+i DO i=1, 1000 DO i=1, 1000 n=n+i nc = nc+i n=n+i End nc Do= nc+i nc n=n+i = nc+i End Do nc = nc+i !$OMP END DO NOWAIT End Do !$OMP!$OMP BARRIER END DO NOWAIT !$OMP End DoEND DO NOWAIT !$OMP BARRIER !$OMP END DO NOWAIT Print *, 'n,nc en bucle 6, despues END DO NOWAIT+BARRIER & antesPrint END*, 'n,nc PARALLEL=',n,nc en bucle 6, despues END DO NOWAIT+BARRIER Print *, 'n,nc en bucle 4, despues END DO NOWAIT !$OMP& antes END END PARALLEL PARALLEL=',n,nc & antesPrint END*, 'n,nc PARALLEL=',n,nc en bucle 4, despues END DO NOWAIT !$OMP END PARALLEL & antes END PARALLEL=',n,nc Print *, 'n,nc en bucle 6, despues END DO NOWAIT+BARRIER & despuesPrint *, 'n,ncEND en PARALLEL=',n,nc bucle 6, despues END DO NOWAIT+BARRIER !$OMP END PARALLEL & despues END PARALLEL=',n,nc !$OMP END PARALLEL Stop Print *, 'n,nc en bucle 4, despues de END PARALLEL=',n,nc End Stop Print *, 'n,nc en bucle 4, despues de END PARALLEL=',n,nc End Some further comments:

● OpenMP is relatively easy to learn and to use, producing applications with the desired level of performance in shared memory computers...

– … but in some cases is very difficult to get good performance. Especially true in algorithms that require complex interactions among threads... – … which are also difficult to debug.

● OpenMP also “hides” the machine from the user: The OpenMP support in a given platform, both from the HW architecture perspective and from the OS/Compiler/run-time environment, noticeably affects its performance.

● OpenMP is evolving: It has been very succesful up to now and it is being adapted to use new architectures, especifically, HW accelerators (i.e.: GPUs) Bibliografía ● Arquitecturas, nociones generales:

– B. Wilkinson. Computer Architecture, Design and Performance. Prentice Hall 1996. – Communications of the ACM vol. 35, No. 8 1992 (monográfico) – R. W. Hockney y R.C. Jesshope. Parallel Computers II, Adam Hilger 1988. – G. Fox et al. Solving Problems on Concurrent Computers Vol. I. Prentice- Hall, 1988. – I. Foster, Designing and Building Parallel Programs. Concepts and Tools for Parallel Software Engineering. Addison-Wesley, 1995 (versión web en http://www.mcs.anl.gov/dbpp ) – J. Dongarra ed. Sourcebook of Parallel Computing. Morgan Kaufmann, 2003. – I. Foster, C. Kesselman (Eds.) The GRID: Blueprint for a New Computing Infrastructure. Morgan Kaufmann Publishers. (El capítulo 2 está en http://www.globus.org/research/papers/chapter2.pdf – The Globus Project. http://www.globus.org – UK Grid Support. http://www.grid-support.ac.uk – R. Thompson. Grid Networking. http://www.owenwalcher.com/grid_networking.htm – D. Jones y S. Smith. Operating Systems. Central Queensland University, 1997. http://infocom.cqu.edu.au – G. Wellein et al. Itanium1/Itanium2 First Experiences. HPC Services Regionales Rechenzentrum Erlangen, 2003. – J. Morris. Computer Architecture. 1998. http://ciips.ee.uwa.edu.au/~morris/Courses/CA406 – Cray X1 System Overview (S-2346-22), en http://www.cray.com – A.J. van der Steen. Overview of Recent (13th edition, 2003) . Netherlands National Computing Facilities Foundation, 2004. – M.S. Schmalz. Organization of Computer Systems. University of FL 2001. http://www.cise.ufl.edu/~mszz/CompOrg/ – J. Laudon y D. Lenoski. The SGI Origin: A cc-NUMA Highly Scalable Server. Silicon Graphics Inc. 1998. – J. Ammon. Hypercube Connectivity within cc-NUMA Architecture. Silicon Graphics Inc. 1998.

● Otros:

– Top500. http://www.top500.org ● OpenMP: – The OpenMP Architecture Review Board. OpenMP Fortran Application Program Interface Version 3.0 May 2008. http://www.openmp.org – Chapman, Jost, van der Pas. Using OpenMP. MIT Press, 2008. – AA. VV. OpenMP: A Proposed Industry Standard API for Shared Memory Programming, 1997. http://www.openmp.org – AA.VV. OpenMP, a Parallel Programming Model for Shared Memory Architectures. Edinburgh Parallel Computing Centre, 1998. http://www.epcc.ed.ac.uk/epcc-tec/documents ● SHARED[list] marks the memory positions occupied for [list] as a shared area. The fact that all threads share a given set of memory positions does not guarantee that a change in one would be seen instantly by all of the threads since every thread could have a different temporary view. This is true only after a sync. This can be inforced by using a FLUSH.

● REDUCTION Variables entering in this clause must be SHARED, although each thread treats it as PRIVATE, using a variable local to each thread that is nitialized according to the operator: 0 if it is +, 1 if it is *, .TRUE. if it is .and. , etc.