University of Tartu Institute of Computer Science

Parallel Computing

MTAT.08.020 PARALLEELARVUTUSED

[email protected] Liivi 2 - 324

2012/13 Fall term 2 Practical information about the course

Lectures: Thursdays Liivi 2 - 612 16:15 – Eero Vainikko

Computer classes: Mondays Liivi 2 - 205 at 16:15 – Oleg Batrašev

• 6 EAP

• Exam 3 Practical information about the course

In theory, there is no difference between theory and practice. But, in practice, there is.

Jan L. A. van de Snepscheut 4 Practical information about the course

NB! To be allowed to the exam, all computer class exercises must be submitted, and the deadlines should be met!

The final grade consists of:

1. Computer class exercises 50%

2. Oral exam with the blackboard 40%

3. Active participation at lectures 10%

Course homepage( http://www.ut.ee/~eero/PC/) 5 Course syllabus Course content (preliminary, about to change on the way)

1. Motivation for

2. Classifications of parallel systems

3. Clos-networks, Beowulf clusters, Grid, Peer-to-peer networks

4. Benchmarks

5. , efficiency

6. Amdahl’s law

7. Scaled efficiency, ways for increasing the efficiency

8. Fortran95, HPF

9. OpenMP

10. MPI

11. Assessment of parallel , definition of optimality 6 Course syllabus

12. Main techniques in parallel algorithms:

13. Balanced trees

14. Pointer jumping

15. Divide and conquer

16. Partitioning

17. Pipelining

18. Accelerated Cascading

19. Symmetry breaking 7 Course syllabus

Computer class activities:

1. Fortran 90/95

2. OpenMP

3. Message Passing Interface

4. Solving various parallelisation problems

Literature:

1. A.Grama, A.Gupta, G.Karypis and V.Kumar. “Introduction to Parallel Comput- ing”, Second Edition, Addison-Wesley, 2003.

2. L. Ridgway Scott, Terry Clark and Babak Bagheri. “Scientific Parallel Comput- ing”, Princeton University Press, 2005. 8 Course syllabus

3. Jack Dongarra, Ian Foster, Geoffrey Fox, William Gropp, Ken Kennedy, Linda Torczon and Andy White. “Sourcebook of Parallel Computing”, Morgan Kauf- mann Publishers, 2003.

4. I.Foster. "Designing and Building Parallel Programs", Reading, MA: Addison- Wesley, 1995; http://www.mcs.anl.gov/dbpp.

5. Scott, L. Ridgway Clark, Terry Bagheri, Babak, “Scientific Parallel Computing”, Princeton University Press, 2005.

6. E.Vainikko. "Fortran95 ja MPI", TÜ Kirjastus, 2004.

7. Ed Akin, “Object-Oriented Programming via Fortran90/95”, Cambridge Univer- sity Press, 2003.

8. A. C. Marshall, Fortran 90 1 Day Course( http://www.liv.ac.uk/HPC/ F90Course1DayHomePage.html) ,[1997]

9. C.-K- Shene, “Fortran 90 Tutorial( http://www.cs.mtu.edu/~shene/ COURSES/cs201/NOTES/fortran.html)” Department of Computer Science, Michigan Technological University, [2009.] 9 Course syllabus

10. P.S.Pacheco. "Parallel Programming with MPI", Morgan Kaufmann Publ, 1997.

11. Joseph Ja’Ja’. "An Introduction to Parallel Algorithms", Addison-Wesley Publ., 1992.

12. Ralf Steinmetz, Klaus Wehrle (Eds.), Peer-to-Peer Systems and Applications, LNCS 3485, Springer, 2005. 10 Introduction

Why do we need parallel computing?

First, form groups of 2!Write down as many reasons as you can during 5 minutes! 11 Introduction Introduction

Parallel Computing is a part of Computer Science and Computational Sciences (hardware, software, applications, programming technologies, algorithms, theory and practice) with special emphasis on parallel computing or supercomputing

1 Parallel Computing – motivation

The main questions in parallel computing: • How is organised interprocessor communication?

• How to organise memory?

• Who is taking care of parallelisation?

◦ CPU? ◦ compiler? ◦ ... or programmer? 12 Introduction 1.1 History of computing 1.1 History of computing

Pre-history

Even before electronic computers parallel computing existed ? . . . – (pyramides.) 1929 – parallelisation of weather forecasts ≈1940 – Parallel computations in war industry using Felix-like devices 1950 - Emergence of first electronic computers. Based on lamps. ENIAC and others 13 Introduction 1.1 History of computing

1960 - Mainframe era. IBM 14 Introduction 1.1 History of computing

1970 - Era of Minis 1980 - Era of PCs 1990 - Era of parallel computers 2000 - Clusters / Grids 2010 - Clouds 15 Introduction 1.2 Expert’s predictions 1.2 Expert’s predictions

Much-cited legend: In 1947 computer engineer Howard Aiken said that USA will need in the future at most 6 computers 1950: Thomas J. Watson as well: 6 computers will be needed in the world 1977: Seymour Cray: The computer Cray-1 will attract only 100 clients in the world 1980: Study by IBM – about 50 Cray-class computers could be sold at most in a year worldwide Reality: Many Cray-* processing power in many of nowadays homes

Gordon Moore’s (founder of Intel) law: (1965: the number of switches doubles every second year ) 1975: - refinement of the above: The number of switches on a CPU doubles every 18 months

Until 2020 or 2030 we would reach in such a way to the atomic level or quantum computer! – But why?Thereason is: light speed limit (see Example 1.2 below!) 16 Introduction 1.2 Expert’s predictions

Flops:

first computers 102 100 Flops (sada op/s) desktop computers today 109 Gigaflops (GFlops) (miljard op/s) nowadays 1012 Teraflops (TFlops) (triljon op/s) the aim of today 1015 Petaflops (PFlops) (kvadriljon op/s) next step 1018 Exaflops (EFlops) (kvintiljon op/s)

http://en.wikipedia.org/wiki/Orders_of_magnitude_ (numbers) 17 Introduction 1.3 Usage areas of a petaflop computer 1.3 Usage areas of a petaflop computer

Engineering applications

• Wind tunnels

• Combustion engine optimisation

• High frequency cirquits

• Buildings and mechanical structures (optimisation, cost, shape, safety etc)

• Effective and safe nuclear power station design

• Simulation of nuclear explosions 18 Introduction 1.3 Usage areas of a petaflop computer

Scientific applications

• Bioinformatics

• Computational physics and chemistry (quantum mechanics, macromolecules, composite materials, chemical reactions etc).

• Astrophysics (development of galaxies, thermonuclear reactions, postprocess- ing of large datasets produced by telescopes)

• Weather modelling

• Climate and environment modelling

• Looking for minerals

• Flood predictions 19 Introduction 1.3 Usage areas of a petaflop computer

Commercial applications

• Oil industry

• Market and monetary predictions

• Global economy modelling

• Car industry (like crash-tests)

• Aviation (design of modern aircrafts and space shuttles )

Applications in computer systems

• Computer security (discovery of network attacks)

• Cryptography 20 Introduction 1.3 Usage areas of a petaflop computer

• Embedded systems (for example, car)

• etc 21 Introduction 1.4 Example 1.1 1.4 Example 1.1

Why speeds of order 1012 are not enough?

Weather forecast in Europe for next 48 hours from sea lever upto 20km – need to solve an ODE (xyz and t) The volume of the region: 5000 ∗ 4000 ∗ 20 km3. Stepsize in xy-plane 1km Cartesian mesh z-direction: 0.1km. Timesteps: 1min. Around 1000 flop per each timestep. As a result, ca

5000 ∗ 4000 ∗ 20km3 × 10 rectangles per km3 = 4 ∗ 109meshpoints

and

4 ∗ 109 meshpoints ∗ 48 ∗ 60timesteps × 103 flop ≈ 1.15 ∗ 1015 flop.

If a computer can do 109 flops (nowadays PC-s), it will take around

1.15 ∗ 1015 flop / 109 flop = 1.15 ∗ 106 seconds ≈ 13 days !! 22 Introduction 1.4 Example 1.1

But, with 1012 flops,

1.15 ∗ 103 seconds < 20min.

Not hard to imagine “small” changes in given scheme such that 1012 flops not enough:

• Reduce the mesh stepsize to 0.1km in each directions and timestep to 10 seconds and the total time would grow from

20min to 8days.

• We could replace the Europe with the whole World model (the area: 2 ∗ 107km2 −→5 ∗ 108km2) and to combine the model with an Ocean model. 23 Introduction 1.4 Example 1.1

Therefore, we must say: The needs of science and technology grow faster than available possibilities, need only to change ε and h to get unsatisfied again!

But again, why do we need a parallel computer to achieve this goal? 24 Introduction 1.5 Example 1.2 1.5 Example 1.2

Have a look at the following piece of code:

§doi =1,1000000000000 ¤ z ( i ) =x ( i ) +y ( i ) ! ie. 3 ∗ 1012 memory accesses end do

¦ Assuming, that data is traveling with the speed of light 3∗108m/s, for finishing the operation in 1 second, the average distance of a memory cell must be:

8 r = 3∗10 m/s∗1s = −4m. 3∗1012 10 Typical computer memory is situated on a rectangular mesh. The length of each edge would be

−4 2√∗10 m ≈ 10−10m, 3∗1012 – the size of a small atom! 25 Introduction 1.5 Example 1.2

But why is parallel computing still not predominant?

Three main reasons: hardware, algorithms and software Hardware: speed of

• networking

• peripheral devices

• memory access do not cope with the speed growth in processor capacity. Algorithms: An enormous number of different algorithms exist but

• problems start with their implementation on some real life application 26 Introduction 1.5 Example 1.2

Software: development in progress;

• no good enough automatic parallelisers

• everything done as handwork

• no easily portable libraries for different architectures

• does the paralleising effort pay off at all?

BUT (as explained above) parallel computing will take over 27 Introduction 1.6 Example 1.3 1.6 Example 1.3

Solving a sparse system of linear equations using MUMPS4.3 (Multifrontal Solver): SUN computer class procesors for solution of a medium size problem (262144 unknowns, 1308672 nonzeros)

#procs. time (s) speedup 1 84.3 2 63.0 1.34 4 53.6 1.57 8 27.0 3.12 12 59.3 1.47 16 29.5 2.86 20 57.5 1.47 23 74.0 1.14 28 Introduction 1.7 Example 1.4 1.7 Example 1.4

Approximation of π using Monte-Carlo method

#procs. time (s) speedup 1 107.8 2 53.5 2.01 4 26.9 4.00 8 13.5 7.98 12 9.3 11.59 16 6.8 15.85 20 5.5 19.59 21 9.0 11.97 23 29.2 3.69 29 Computer architectures and // 1.7 Example 1.4 2 Computer architectures and parallelism

Various ways to classify computers:

• Architecture

◦ Single processor computer ◦ Multicore processor ◦ distributed system ◦ system

• operating system

◦ UNIX, ◦ LINUX, ◦ (OpenMosix) ◦ Plan 9 30 Computer architectures and // 1.7 Example 1.4

◦ WIN* ◦ etc

• usage area

◦ supercomputing ◦ ◦ real time sytems ◦ mobile systems ◦ neurological networks ◦ etc

But impossible to ignore (implicit or explicit) parallelism in a computer or a set of computers 31 // Architectures 2.1 Processor development Osa I Parallel Architectures

2.1 Processor development

Instruction pipelining (similar to car production lines) - performing different sub- operations with moving objects

Instruction throughput (the number of instructions that can be executed in a unit of time) increase 32 // Architectures 2.1 Processor development

operation overlap RISC processor pipelines:

• Instruction fetch

• Instruction decode and register fetch

Basic five-stage pipeline in a RISC • Execute machine (IF = Instruction Fetch, ID = In- struction Decode, EX = Execute, MEM = Memory access, WB = Register write • Memory access back). In the fourth clock cycle (the green column), the earliest instruction is in MEM stage, and the latest instruction has • Register write back not yet entered the pipeline. 33 // Architectures 2.1 Processor development Pipeline length?

Limitations:

• pipeline speed defined by the slowest component

• Usually, each 5-6 operation - branch

• Cost of false prediction grows with the length of pipeline (larger number of subcommands get waisted)

One possibility: Multiple pipelines (super-pipelined processors, superscalar execu- tion) - in effect, execution of multiple commands in a single cycle 34 // Architectures 2.1 Processor development

Example 2.1: Two pipelines for adding 4 numbers: 35 // Architectures 2.1 Processor development

True Data Dependency - result of one operation being an input to another

Resource Dependency - two operations competing for the same resource (e.g. pro- cessor floating point unit)

Branch Dependency (or Procedural Dependency) - scheduling impossible in case of if-directive

Instruction Level Parallelism (ILP)

• possibility for out-of-order instruction execution

◦ factor: size of instruction window

• ILP is limited by:

◦ parallelism present in the instruction window ◦ hardware parameters – existence of different functional units 36 // Architectures 2.1 Processor development

· number of pipelines · pipeline lengths · pipeline properties · etc

◦ not possible to always utilise all Functional Units (FU) – if at certain timestep none of FUs is utilised – vertical waste – if only some FUs utilised – horisontal waste

• Typically, processors support 4-level superscalar execution 37 // Architectures 2.1 Processor development

Very Long Instruction Word (VLIW) Processors

• Main design concern with superscalar processors – complexity and high price

• VLIW – based on compile-time analysis - which commands to bind together for ILP

• These commands get packed together into one (long) instruction (giving the name)

• First used in Multiflow Trace machine (ca 1984)

• Intel IA64 - part of the concept being implemented 38 // Architectures 2.1 Processor development

VLIW properties

• Simpler hardware side

• Compiler has more context to find ILP

• But compiler lacks all the run-time information (e.g. data-misses in cache), therefore, only quite conservative approach possible (just-in-time compilers might have benefit here!)

• More difficult prediction of branching and memory usage

• VLIW performance depends highly on compiler abilities, like

◦ loop unrolling ◦ speculative execution ◦ branch prediction etc.

• 4-way to 8-way parallelism in VLIW processors 39 // Architectures 2.2 Memory performance effect 2.2 Memory performance effect

• Memory system, not processor speed appears often as a bottleneck

• 2 main parameters: latency - time that it takes from request to delivery from the memory bandwidth - amount of data flow in timestep (between the processor and memory) 40 // Architectures 2.2 Memory performance effect Example 2.2: Memory latency

• Simplified architecture: 1 GHz processor

• memory fetching time 100 ns (no cache), blocksize 1 word

• 2 multiply-add units, 4 multiplication or addition operations per clock cycle => peak performance – 4 Gflops.

• Suppose, we need to perform dot product. Each memory access takes 100 clock cycles =⇒ 1 flop per 100 ns =⇒ actual performance only: ???10Mflops! 41 // Architectures 2.2 Memory performance effect Memory latency hiding with the help of cache

• Cache - small but fast memory between main memory and DRAM

• works as low latency, high throughput storage

• helps to decrease real memory latency only if enough reuse of data is present

• part of data served by cache without main memory access – cache hit ratio

• often: hit ratio a major factor to performance 42 // Architectures 2.2 Memory performance effect Example 2.3: Cache effect

• Cache size: 32 KB with latency 1 ns

• operation C = AB , where A and B are 32 × 32 matrices (i.e. all matrices A, B and C fit simultaneously into cache

• We observe that:

◦ reading 2 matrices into cache (meaning 2K words) takes ca 200µs ◦ multiplying two n × n matrices takes 2n3 operations, =⇒ 64K operations, can be performed in 16K cycles (4 op/cycle) ◦ =⇒ total time (data access + calculations= 200µs + 16µs ◦ close to peak performance 64K/216 or 303 Mflops.

• Access to same data corresponds to temporal locality 43 // Architectures 2.2 Memory performance effect

• In given example O(n2) data accesses and O(n3) computations. Such asymp- totic difference very good in case of cache

• Data reusage is of critical importance for performance! 44 // Architectures 2.3 Memory Bandwidth (MB) and performance 2.3 Memory Bandwidth (MB) and performance

• MB is governed by memory bus and memory properties

• MB can be increased with enlarging memory block accessed

Let system spend l time units to move b data units, where

• l – system latency

• b – memory block size (also called cache line)

Example 2.4: effect of memory block size: dot product of two vec- tors

• In example 2.2 we had b = 1, words giving 10 Mflops 45 // Architectures 2.3 Memory Bandwidth (MB) and performance

• Increasing b = 4 and assuming that data stored in memory linearly =⇒8 flops (4 * and + operations) for each 200 clock cycles corresponding to 1 flop per 25 ns resulting in peak performance 40 Mflops.

Observations:

• Block size does not affect system latency

• Wide memory bus (like in the example: 4) expensive to implement in hardware

• Usually, memory access is performed similarly to pipeline

• Therefore, with MB increase, system peak performance increases

• Locality of memory addresses IS important!

• Good to reorder computations according to data location! 46 // Architectures 2.3 Memory Bandwidth (MB) and performance Example 2.5: Effect of memory ordering

§f l o a t :: A(1000,1000) ,rowsum(1000) ¤

¦ Version A Version B

§ ¤§doi =1,1000 ¤ doi =1,1000 rowsum( i ) =0.0 rowsum( i ) =0.0 enddo doj =1,1000 doj =1,1000 rowsum( i ) =rowsum( i ) +A( i , j ) doi =1,1000 enddo rowsum( i ) =rowsum( i ) +A( i , j ) enddo enddo ! enddo

¦ ¦

• Which one is faster: A or B? 47 // Architectures 2.3 Memory Bandwidth (MB) and performance 48 // Architectures 2.4 Other approaches for hiding memory latency

• FORTRAN rule: two-dimensional arrays (matrices) are stored columnwise!

• For example, matrix B(1:3,1:2) elements in memory ordered as: B(1,1), B(2,1), B(3,1), B(1,2), B(2,2), B(3,2). (fastest changing – the leftmost index)

To avoid spending time how to optimise due to memory ordering – use Fortran95 array syntax! (leaving the optimisation to the compiler)

2.4 Other approaches for hiding memory latency

• Multithreading

• Prefetching 49 // Architectures 2.4 Other approaches for hiding memory latency Example 2.6: Multithreading

Preliminary code (C):

§for ( i = 0; i < n ; i ++) ¤ { c [ i ] = dot_product ( get_row ( a , i ), b); }

¦

• each dot product appears to be independent, therefore it is possible to recode:

§for ( i = 0; i < n ; i ++) ¤ { c [ i ] = create_thread ( dot_product , get_row ( a , i ), b); }

¦ 50 // Architectures 2.4 Other approaches for hiding memory latency

• Possibility for performing computation in each clock cycle

◦ prerequisites: – memory system supporting multiple requests simultaneously – processor capable switching from one to another in each clock cycle (e.g. machines like HEP and Tera) ◦ program should explicitly specify independence in form of threads 51 // Architectures 2.4 Other approaches for hiding memory latency Prefetching

• Cache misses make programs waiting

• Why not to take care of data being in cache in advance?

• Draw-backs:

◦ need for larger caches ◦ in case of pre-fetched data happens to get over-written, the situation is worse than before!

Drawbacks of latency hiding techniques In both cases, better memory bandwidth between main memory and cache needed!!! Each thread has proportinally smaller part of cache to use. =>

• possibly recovering from latency problem bandwidth problem might occur!

• higher need for hardware properties (cache size) 52 // Architectures 2.5 Classification of Parallel Computers 2.5 Classification of Parallel Computers

2.5.1 Granularity

In parallel computing, granularity means the amount of computation in relation to communication or synchronisation Periods of computation are typically separated from periods of communication by synchronization events.

• fine level (same operations with different data)

◦ vector processors ◦ instruction level parallelism ◦ fine-grain parallelism: – Relatively small amounts of computational work are done between communication events – Low computation to communication ratio – Facilitates load balancing 53 // Architectures 2.5 Classification of Parallel Computers

– Implies high communication overhead and less opportunity for per- formance enhancement – If granularity is too fine it is possible that the overhead required for communications and synchronization between tasks takes longer than the computation.

• operation level (different operations simultaneously)

• problem level (independent subtasks)

◦ coarse-grain parallelism: – Relatively large amounts of computational work are done between communication/synchronization events – High computation to communication ratio – Implies more opportunity for performance increase – Harder to load balance efficiently 54 // Architectures 2.5 Classification of Parallel Computers

2.5.2 Hardware:

Pipelining

(was used in supercomputers, e.g. Cray-1) In N elements in pipeline and for ∀ element L clock cycles =⇒ for calculation it would take L + N cycles; without pipeline L ∗ N cycles

Example of good code for pipelineing:

§doi =1 ,k ¤ z ( i ) =x ( i ) +y ( i ) end do

¦ 55 // Architectures 2.5 Classification of Parallel Computers

Vector processors,

fast vector operations (operations on arrays). Previous example good also for (vector addition) , but, e.g. recursion – hard to optimise for vector processors Example: IntelMMX – simple vector processor.

Processor arrays Most often 2-dimensional arrays (For example: MasPar MP2 - massively parallel computer)

neighbours and on edges to the cor- responding opposite edge nodes. Pro- cessors had mutual clock. Programming such a computer quite specific, spe- cial language, MPL, was used; need for thinkinking about communication be- MasPar-MP2: 128x128=16384 tween neighbours, or to all processor at processors, each had 64Kbytes memory. once (which was slower). each processor connected to its 56 // Architectures 2.5 Classification of Parallel Computers

Shared memory computer

Distributed systems

(e.g.: clusters) Most spread today.

One of the main questions on parallel hardware: do the processors share a mutual clock or not? 57 // Architectures 2.5 Classification of Parallel Computers 58 // Architectures 2.6 Flynn’s classification 2.6 Flynn’s classification

Instruction SISD SIMD S - Single (MISD) MIMD Data M - Multiple

I - Instruction

Abbreviations: D - Data

For example: Single Instruction Multiple Data stream =>: SISD - single instruction single data stream, (e.g. simple PC) SIMD - Same instructions applied to multiple data. (Example: MasPar) MISD - same data used to perform multiple operations... Sometimes have been considered vector processors belonging here but most often said to be empty class MIMD - Separate data and separate instructions. (Example: ) 59 // Architectures 2.6 Flynn’s classification

2.6.1 SISD

• A serial (non-parallel) computer

• Single Instruction: Only one instruction stream is being acted on by the CPU during any one clock cycle

• Single Data: Only one data stream is being used as input during any one clock cycle

• Deterministic execution

• This is the oldest and even today, the most common type of computer

• Examples: older generation mainframes, minicomputers and workstations; most modern day PCs. 60 // Architectures 2.6 Flynn’s classification

2.6.2 SIMD

• A type of parallel computer

• Single Instruction: All processing units execute the same instruction at any given clock cycle

• Multiple Data: Each processing unit can operate on a different data element

• Best suited for specialized problems characterized by a high degree of regular- ity, such as graphics/image processing.

• Synchronous (lockstep) and deterministic execution

• Two varieties: Processor Arrays and Vector Pipelines

• Examples:some early computers of this type:

◦ Processor Arrays: Connection Machine CM-2, MasPar MP-1 & MP-2, IL- LIAC IV 61 // Architectures 2.6 Flynn’s classification

◦ Vector Pipelines: IBM 9000, Cray X-MP, Y-MP & C90, Fujitsu VP, NEC SX-2, Hitachi S820, ETA10

• graphics cards

• Most modern computers, particularly those with graphics processor units (GPUs) employ SIMD instructions and execution units.

• possibility to switch off some processors (with mask arrays) 62 // Architectures 2.6 Flynn’s classification

2.6.3 (MISD):

• A type of parallel computer

• Multiple Instruction: Each processing unit operates on the data independently via separate instruction streams.

• Single Data: A single data stream is fed into multiple processing units.

• Few actual examples of this class of parallel computer have ever existed. One is the experimental Carnegie-Mellon C.mmp computer (1971).

• Some conceivable uses might be:

◦ multiple frequency filters operating on a single signal stream ◦ multiple cryptography algorithms attempting to crack a single coded mes- sage. 63 // Architectures 2.6 Flynn’s classification

2.6.4 MIMD

• A type of parallel computer

• Multiple Instruction: Every processor may be executing a different instruction stream

• Multiple Data: Every processor may be working with a different data stream

• Execution can be synchronous or asynchronous, deterministic or non- deterministic

• Currently, the most common type of parallel computer - most modern super- computers fall into this category.

• Examples: most current supercomputers, networked parallel computer clusters and "grids", multi-processor SMP computers, multi-core PCs.

• Note: many MIMD architectures also include SIMD execution sub-components 64 // Architectures 2.6 Flynn’s classification

2.6.5 Comparing SIMD with MIMD

• SIMD have less hardware units than MIMD (single instruction unit)

• Nevertheless, as SIMD computers are specially designed, they tend to be ex- pensive and timeconsuming to develop

• not all applications suitable for SIMD

• Platforms supporting SPMD can be built from cheaper components 65 // Architectures 2.6 Flynn’s classification

2.6.6 Flynn-Johnson classification

(picture by: Behrooz Parhami) 66 // Architectures 2.7 Type of memory access 2.7 Type of memory access

2.7.1 Shared Memory

• common shared memory

• Problem occurs when more than one want to write to (or read from) the same memory address

• Shared memory programming models do deal with these situations 67 // Architectures 2.7 Type of memory access

2.7.2

• Networked processors with their private memory

2.7.3 hybrid memory models

• E.g. distributed shared memory, SGI Origin 2000 68 // Architectures 2.8 Communication model of parallel computers 2.8 Communication model of parallel computers

2.8.1 Communication through shared memory address space

• UMA ()

• NUMA (non-uniform memory access)

◦ SGI Origin 2000 ◦ Sun Ultra HPC 69 // Architectures 2.8 Communication model of parallel computers

Comparing UMA and NUMA:

C - Cache, P- Processor, M-Memory / Which are UMA, which NUMA?(a)& (b) - UMA, (c) - NUMA 70 // Architectures 2.8 Communication model of parallel computers

2.8.2 Communication through messages

• Using some messaging libraries like MPI, PVM.

• All processors are

◦ independent ◦ own private memory ◦ have unique ID

• Communication is performed through exchanging messages 71 // Architectures 2.9 Other classifications 2.9 Other classifications

2.9.1 realisation

• using only hardware modules

• mixed modules (hardware and software)

2.9.2 Control type

1. synchronous

2. dataflow-driven

3. asynchronous 72 // Architectures 2.9 Other classifications

2.9.3 Network connection speed

• network bandwidth

◦ can be increased e.g. through channel bonding

• network latency

◦ the time from the source sending a message to the destination receiving it

Which one is more easy to increase: bandwidth or latency?Moreeasy to increase bandwidth! 73 // Architectures 2.9 Other classifications

2.9.4 Network topology

• Bus-based networks

• Ring - one of the simplest topologies

• Array topology

◦ Example: cube (in case of 3D array)

• Hypercube

◦ In ring the longest route between 2 processors is: P/2 – in hypercube – logP

Ring: 74 // Architectures 2.9 Other classifications

Hypercube: 75 // Architectures 2.9 Other classifications

Figure. How to design a hypercube: Add similar structure and connect corre- sponding nodes (adding one 1 bit). Problem: large number of connections per node Easy to emulate Hypercube on e.g. MasPar array in logn time: 76 // Architectures 2.9 Other classifications 77 // Architectures 2.9 Other classifications

• Star topology

◦ Speed depends very much on switch properties (e.g. latency, bandwidth, backplane frequency) and ability to cope with arbitray communication pat- terns

• Clos-network

For example Myrinet. (quite popular around 2005)

• Cheaper but with higher latency: Gbit Ethernet., (10 Gbit Ethernet)

• Nowadays, most popular low-latency network type – Infiniband 78 // Architectures 2.10 Clos-networks 2.10 Clos-networks Theory of Clos networks came into existence with telephone networks – paper : : Charles Clos, – “A Study of Non-Blocking Switching Networks”, 1952.

2.10.1 Myrinet Myrinet( http://www.myri.com) Standardised network architecture Aims:

• Max throughput with to large number of processors

• Low latency

• eliminate the need to allocate processors according to network topology

◦ (historical remark: in the same reasons, RAM replaced sequential access memory (disk, trumble)

In general, good network makes 1/2 cluster cost! 79 // Architectures 2.10 Clos-networks

Myrinet card and switch

128-port Myrinet Clos network switch 80 // Architectures 2.10 Clos-networks

2.10.2 Topological concepts

• Minimum bisection (minimaalne poolitus) Definition. Minimal number of connections between arbitrary two equally sized sets of nodes in the network

• Describes minimal throughput, independently from communication pattern

• Upper bound in case of N-node network is N/2 connections

• Network topology achieving this upper bound is called full bisection (täispooli- tus) 81 // Architectures 2.10 Clos-networks

Example. 8 nodes, network switch has 5 ports:

What is Minimal Bisection in this case?Minimalbisection is 1 – i.e. very bad 82 // Architectures 2.10 Clos-networks

But, if to set the network up as follows:

Question: Does there exist full bisection? If yes, how? 83 // Architectures 2.10 Clos-networks

Answer: minimal bisection: 4. But not in case of each possible communication pattern! (find!) => not rearrageable network

2.10.3 Rearrangeable networks (ümberjärjestatavad võrgud)

Definition. Network is called rearrangeable, if it can find a route in case of arbi- trary permutation of communication patterns (or, whichever node wants to talk to with whichever other node, there is always a way to do it simultaneously) Theorem. Whichever rearrangeable network has full bisection. Is the opposite also true or not?Theopposite is not true. Network can have full bisection without being rearrange- able. (Proof is based on Hall’s Theorem) 84 // Architectures 2.10 Clos-networks

Example: 8-node clos network with 4-port switches. "Leaf"-switches "Spine"-switches

Main rule: As many connections to the leaves as many there are to the spine- switches (enables rearangeability) Therefore: Clos networks are:

• scalable upto a large number of nodes

• rearrangeable (proved, that in each permutation there exists a routing table)

• There exist multiple routes 85 // Architectures 2.10 Clos-networks

In case of too few ports on a switch, it is possible to compose a larger one: 86 // Architectures 2.10 Clos-networks 87 // Architectures 2.11 Beowulf clusters 2.11 Beowulf clusters

• History • 1993-1994 Donald Becker (NASA) – 16 486 DX4 66MHz processors, ETHERNET-channel bonding, COTS (Commodity Off The Shelf ) – creat- ed in CESDIS (The Center of Excellence in Space Data and Information Sciences), 74 MFlops (4,6 MFlops/node).

• Beowulf – formerly project name, not the system Dr Thomas Sterling: "What the hell, call it ’Beowulf.’ No one will ever hear of it anyway."

• Beowulf - first poem in En- glish based on Scandina- vian legend about hero who defeated the monster Gren- del with his courage and strength 88 // Architectures 2.11 Beowulf clusters

• Network choice Ethernet 100Mbps, Gigabit Ethernet, Infiniband, Myrinet (Scali, pathscale, nu- malink, etc)

• OS – most often linux

• Typical Beowulf cluster Internet

Frontend Switch

Nodes 89 // Architectures 2.11 Beowulf clusters

• Building a cluster (rack/shelf)? 90 // Architectures 2.11 Beowulf clusters

• console switch? • which software.

• Security

The part of the system • Cluster administration needing security measures

Internet

◦ Frontend ClusterMonkey Switch ◦ Rocks

◦ Clustermagic

◦ etc.

Nodes 91 // Performance 3.1 Speedup 3 Parallel Performance How to measure parallel performance?

3.1 Speedup

t (N) S(N,P) := seq tpar(N,P)

tseq(N) – time for solving given problem with best known sequential algorithm • In general, different algorithm than the parallel one

◦ If same (and there exist a faster sequential one):

Relative Speedup • 0 < S(N,P) ≤ P

• If S(N,P) = P,– linear speedup or optimal speedup (Example, embarras- ingly parallel algorithms) 92 // Performance 3.2 Efficiency

• May happen that S(N,P) > P, (swapping; cache effect) – superlinear speedup

• Often S(N,P) < 1,– is it speedup?slowdown

3.2 Efficiency

t (N) E(N,P) := seq . P ·tpar(N,P)

Presumably, 0 < E(N,P) ≤ 1. 93 // Performance 3.3 Amdahl’s law 3.3 Amdahl’s law

In each algorithm ∃ parts that cannot be parallelised

• Let σ (0 < σ ≤ 1) – sequential part

• Assume that the rest 1 − σ parallelised optimally

Then, in best case: t 1 1 S(N,P) = seq = ≤ . 1−σ  1−σ σ + P tseq σ + P σ 94 // Performance 3.3 Amdahl’s law

Example 1. Assume 5% of the algorithm is not parallelisable (ie. σ = 0.05) => :

P max S(N,P) 2 1.9 Therefore, not much to gain at all with 4 3.5 huge number of processors! 10 6.9 20 10.3 100 16.8 ∞ 20

Example 2. If σ = 0.67 (33% parallelisable), with P = 10: 1 S(N,10) = = 1.42 0.67 + 0.33/10 95 // Performance 3.3 Amdahl’s law 96 // Performance 3.4 Gustafson-Barsis’ law 3.4 Gustafson-Barsis’ law

John Gustafson & Ed Barsis (Scania Laboratory) 1988:

• 1024-processor nCube/10 claimed: they bet Amdahl’s law!

• Their σ ≈ 0.004...0.008 • but got S ≈ 1000

• (Acording to Amdahl’s S might be 125...250)

How was it possible? Does Amdahl’s law hold? Mathematically – yes. But in practice – not very good idea to solve a problem with fixed size N on whatever number of processors! In general, σ = σ(N) 6= const Usually, σ decreases with N growing! Algorithm is said to be effectively parallel if σ → 0 with N → ∞

Scaled efficiency to avoid misunderstandings: 97 // Performance 3.5 Scaled efficiency 3.5 Scaled efficiency

tseq(N) ES(N,P) := tpar(P · N,P)

• Problem size increasing accordingly with adding new processors – does time remain the same?

• 0 < ES(N,P) ≤ 1

• If ES(N,P) = 1 – linear speedup 98 // Performance 3.6 Methods to increase efficiency 3.6 Methods to increase efficiency

Factors influencing efficiency: • communication time

• waiting time

• additional computations

• changing/improving algorithm Profiling parallel programs • MPE - jumpshot, LMPI, MpiP

• Totalview, Vampir, Allinea OPT

• Linux - gprof (compiler switch -pg)

• SUN - prof, gprof, prism

• Many other commercial applications 99 // Performance 3.6 Methods to increase efficiency

3.6.1 Overlapping communication and computations

Example: Parallel Ax-operation for sparse matrices par_sparse_mat.f90( http://www.ut.ee/~eero/F95jaMPI/Kood/ mpi_CG/par_sparse_mat.f90.html) (“Fortran95 ja MPI” pp. 119-120) Matrix partitioned, divided between processors. Starting communication (non- blocking); calculations at inside parts of the region => economy in waiting times.

3.6.2 Extra computations instead of communication

• Computations in place instead of importing the results over the network

• Sometimes it pays off!

Example: Random number generation. Broadcasting only seed and generate in par- allel (deterministic algorithm) 100 // Performance 3.7 Benchmarks 3.7 Benchmarks

5 main HPC (High Performance Computing) benchmarks:

• NPB • Perf

• IOzone • Linpack • Graph 500 • HINT

3.7.1 Numerical Aerodynamic Simulation (NAS) Parallel Benchmarks (NPB)

MPI, 8 programs (IS,FT,MG,CG,LU,SP,BT,EP) from Computational Fluid Dynam- ics (CFD) code • Integer Sort (IS) - integer operations and communication speed. Latency of critical importance

• Fast Fourier Transform (FT). Remote communication performance; 3D ODV solution 101 // Performance 3.7 Benchmarks

• Multigrid (MG). Well-structured communication pattern test. 3D Poisson prob- lem with constant coefficients

• Conjugate Gradient (CG). Irregular communication on unstructured discreti- sation grid. Finding smallest eigenvalue of a large positive definite matrix

• Lower-Upper diagonal (LU). Blocking communication operations with small granularity, using SSOR (Symmetric Successive Over-Relaxation) to solve sparse 5x5 lower and upper diagonal systems. Length of the message varying a lot

• Scalar pentadiagonal (SP) and block tridiagonal (BT). Balance test between computations and communication. Needs even number of processes. Solving a set of diagonally not dominant SP and BT systems. (Although similar, dif- ference in the ratio of communication and computations – SP having more communication than BT

(EP). Gauss residual method with some specifics; showing the performance of the slowest processor.

• + some new tests in version 3, good with networked multicore processors 102 // Performance 3.7 Benchmarks

3.7.2 Linpack

Jack Dongarra. HPL - High Performance Linpack, using MPI and BLAS. Solving systems of linear equations with dense matrices. The aim is to fit a problem with maximal size (advisably, utilising 80% of memory).

• Used for http://www.top500.org

◦ Also, http://www.bbc.co.uk/news/10187248

• Rpeak - peak performance in Gflops • Rmax - maximal achieved perfor- mance in Gflops

• N- size of matrix giving peak per- • NB - blocksize. In general, the formance in Gflops (usually <80% smaller the better, but usually in memory size) range 32...256. 103 // Performance 3.7 Benchmarks

3.7.3 HINT benchmark

The HINT (Hierarchical INTegration). Quite popular benchmark. Graphical view of:

• floating point performance

• integer operation performance

• performances with different memory hierarchies

3.7.4 Perf

Measures network latency and banwidth between two nodes

3.7.5 IOzone

Tests I/O performance 104 // Performance 3.7 Benchmarks

3.7.6 Graph 500

• Graph 500 benchmark initiative( http://www.graph500.org/ specifications) 105 //-programming models Part II Parallel Programming 4 Parallel Programming Models – an Overview

Von Neumann’s model, (von Neumann’s bottleneck) 106 //-programming models 4.1 Model of Tasks and Channels Process Interaction

4.1 Model of Tasks and Channels task (tegum) logically as a whole, independent (part of) program, a unit of work consuming what operation system has on offer channel (kanal) set of tools to pass signals and information (unidirectionally) be- tween two points

In the model of tasks and channels:

• Parallel program consists of one or a number of tasks that work concurrently task: sequential program (virtual von Neumann’s machine), having input and output port and in addition is capable of:

• sending messages • creating subtasks

• receiving messages • termination 107 //-programming models 4.1 Model of Tasks and Channels

channel: queue of messages

• sender can put messages

• receiver can take messages

◦ blocking if no such message exist

Channels can be:

• dynamically created between tasks’ input and output ports

• deleted

Sending messages is asynchronous and non-blocking (Receiving is blocking) tasks can be placed differently on different processes without change in program semantics 108 //-programming models 4.1 Model of Tasks and Channels

Tasks and channels’ model properties:

• Performance usually depends also on channel properties

• Independence of task subdivision onto processors

• Modularity Similarity with object-oriented model adding channels

• Determinism achieved through:

◦ only one sender and receiver ◦ blocking on the receiver side 109 //-programming models 4.2 Message passing model 4.2 Message passing model

Similar to task-channel model Differences:

• Send message on channel k replaced by “Send message to process t”

• Preliminary, in the model there was no assumption that number of processors may change

• Does not assume multiple processes on a one processor

◦ still, if it should happen, communication only through message passing mechanisms

• Most often, SPMD (Single Program Multiple Data) programming model 110 //-programming models 4.3 Shared Memory model 4.3 Shared Memory model

• all tasks have common address space;

• asynchronous reading and writing

• (=>) various locking mechanisms needed... • A bit similar to the previous one (small granularity) • benefit: no data ownership exist 111 //-programming models 4.4 4.4 Implicit Parallelism

• no process interaction is visible to the programmer

• instead the compiler and/or runtime is responsible for performing it

• most common with domain-specific languages where the concurrency within a problem can be more prescribed 112 //-programming models 4.4 Implicit Parallelism Problem decomposition

• Any parallel program is composed of simultaneously executing processes

• problem decomposition relates to the way in which these processes are formu- lated.

• This classification may also be referred to as:

◦ algorithmic skeletons or ◦ parallel programming paradigms 113 //-programming models 4.2 Data parallel model 4.1 Task parallel model

• focuses on processes, or threads of execution. These processes will often be behaviourally distinct, which emphasises the need for communication. is a natural way to express message-passing communication. It is usually classified as MIMD/MPMD (or MISD)

4.2 Data parallel model

• performing operations on a data set

◦ usually regularly structured in an array ◦ set of tasks will operate on this data – independently on separate partitions

• Data accessibility 114 //-programming models 4.2 Data parallel model

◦ shared memory system, the data accessible to all ◦ in a distributed-memory system the data divided between memories and worked on locally.

usually classified as SIMD/SPMD.

• Generally, granularity small

• same operation on multiple computing devices

• usually needs some clarification from the programmer, how data distributed between the processors

• communication/synchronisation is created automatically by compiler

Example HPF (High Performance Fortran) 115 Shared Memory 4.2 Data parallel model 5 Shared memory programing model

Dining Philosopher’s problem

5 plates, 5 forksl. One can et only with 2 forks Eat and Think! Avoid starva- tion (exp. to death)! not too fat? No dead- locks no fights; fairness? ⇒ need for some locking mechnism! 116 Shared Memory 5.1 PRAM (Parallel Random Access Machine) model 5.1 PRAM (Parallel Random Access Machine) model

• consisting of P deterministic RAM processors (von Neumann machines)

• processors work in parallel

• communication and synchronisation through shared memory

Variations

• EREW PRAM (Exclusive Read Exclusive Write)

• CREW PRAM (Concurrent Read Exclusive Write)

• CRCW PRAM (Concurrent Read Concurrent Write):

◦ common CRCW PRAM - write succeeds only if all have the same value ◦ arbitrary CRCW PRAM - arbitrary process succeeds ◦ priority CRCW PRAM - with predefined priorities 117 Shared Memory 5.2 OpenMP 5.2 OpenMP

OpenMP tutvustus( http://www.llnl.gov/computing/tutorials/ openMP/) Programmeerimismudel:

• Lõimedel (threads) baseeruv paralleelsus

• Programmeerija vahend (mitte automaatne paralleliseerija) (explicit paral- lelism)

• Hargnevuste-ühendamiste (Fork-Join) mudel Alustatakse 1 protsessist – pealõimest (master thread) Fork,Join 118 Shared Memory 5.2 OpenMP

• Baseerub kompilaatoridirektiividel C/C++, Fortran*

• Mitmetasemeline paralleelsus (Nested Parallelism) toetatud (kuigi kõik realisatsioonid ei pruugi seda toetada)

• Dünaamilised lõimed lõimede arvu saab muuta programmi töö käigus 119 Shared Memory 5.2 OpenMP

• Olemasoleva järjestikprogrammi paralleeliseerimine tihti vaid väheste muuda- tustega lähtetekstis

• OpenMP kujunenud standardiks 120 Shared Memory 5.2 OpenMP

OpenMP ajalugu:

• Standardiseerimist alustati 1994, taasalustati 1997

• Okt. 1997: Fortran vers. 1.0

• 1998 lõpp: C/C++ vers. 1.0

• Juuni 2000: Fortran vers. 2.0

• Aprill 2002: C/C++ vers. 2.0

• Mai 2008: vers. 3.0 standard

• Juuli 2011: vers. 3.1 121 Shared Memory 5.2 OpenMP

Näide Fortran:

§ PROGRAM HELLO ¤ INTEGER VAR1, VAR2, VAR3 ! Programmi järjestikune osa ... ! Programmi paralleelne osa, uute lõimede tegemine; ! muutujate skoopide määramine: !$OMP PARALLEL PRIVATE(VAR1, VAR2) SHARED(VAR3) ! Paralleelne osa mis jookseb kõigil lõimedel : ... ! Kõik lõimed ühinevad pealõimega, paralleelse osa lõpp: !$OMP END PARALLEL ! Programm järjestikosa lõpetamine: ... END

¦ 122 Shared Memory 5.2 OpenMP

Näide: C/C++:

§#include ¤ main (){ i n t var1 , var2 , var3 ; Järjestikosa ... Programmi paralleelne osa , uute lõimede tegemine ; muutujate skoopide määramine : #pragma omp parallel private(var1, var2) shared( var3){ Paralleelne osa mis jookseb kõigil lõimedel : ... Kõik lõimed ühinevad pealõimega , paralleelse osa lõpp : } Järjestikprogrammi lõpetamine ... }

¦ 123 Shared Memory 5.2 OpenMP

Üldreeglid:

• Kommentaare samale reale OpenMP direktiividega panna ei saa

• Üldjuhul on Fortrani kompileerimisel võimalik lisada võti OpenMP kommen- taaride interpreteerimiseks

• Mitmed OpenMP käsud on paarisdirektiivid:

§!$OMP directive ¤ [ programmiblokk ] !$OMP end directive

¦ 124 Shared Memory 5.3 OpenMP direktiivid: PARALLEL 5.3 OpenMP direktiivid: PARALLEL

§!$OMP PARALLEL[clause...] ¤ IF(scalar_logical_expression) PRIVATE(list) SHARED(list) DEFAULT(PRIVATE| FIRSTPRIVATE| SHARED| NONE) FIRSTPRIVATE(list) REDUCTION(operator: list) COPYIN(list) NUM_THREADS(scalar −integer − expression)

block

!$OMP END PARALLEL

¦ 125 Shared Memory 5.3 OpenMP direktiivid: PARALLEL

Lõimede arvu saab määrata:

• klausliga NUM_THREADS

• teegifunktsiooniga omp_set_num_threads()

• keskkonnamuutujaga OMP_NUM_THREADS

Kitsendused:

• Paralleelpiirkond programmis on blokk mis ei ületa alamrogrammide ega failide piire

• Sünkroniseerimata I/O on etteaimamatu tulemusega

• keelatud on paralleelpiirkonda sisse- või väljasuunamine

• vaid üksainus IF-lause on lubatud 126 Shared Memory 5.3 OpenMP direktiivid: PARALLEL

Näide: Hello World (F77)

§ PROGRAM HELLO ¤ INTEGER NTHREADS, TID , OMP_GET_NUM_THREADS, + OMP_GET_THREAD_NUM C Forka team of threads giving them their own copies of variables !$OMP PARALLEL PRIVATE(TID) C Obtain and print thread id TID = OMP_GET_THREAD_NUM() PRINT ∗ , ’ Hello World from thread = ’, TID C Only master thread does this IF ( TID .EQ. 0) THEN NTHREADS = OMP_GET_NUM_THREADS() PRINT ∗ , ’ Number of threads = ’, NTHREADS ENDIF C All threads join master thread and disband !$OMP END PARALLEL END

¦ 127 Shared Memory 5.4 OpenMP tööjaotusdirektiivid: DO

OpenMP tööjaotusdirektiivid on: DO, SECTIONS ja SINGLE

5.4 OpenMP tööjaotusdirektiivid: DO

§!$OMP DO[clause...] ¤ SCHEDULE(type[,chunk]) ORDERED PRIVATE(list) FIRSTPRIVATE(list) LASTPRIVATE(list) SHARED(list) REDUCTION(operator| intrinsic: list) COLLAPSE(n)

do_loop

!$OMP END DO[ NOWAIT]

¦ 128 Shared Memory 5.4 OpenMP tööjaotusdirektiivid: DO

DO-direktiivi klauslid:

• SCHEDULE kirjeldab kuidas iteratsioonid on eri lõimede vahel ärajagatud:

◦ STATIC - tükid pikkusega chunk või ühtlaselt kui chunk ei ole määratud ◦ DYNAMIC - tükid pikkusega chunk(=vaikimisi 1) jaotatakse dünaamiliselt, peale ühe tüki lõpetamist antakse dünaamiliselt uus ◦ GUIDED - tüki suurust vähendatakse eksponentsiaalselt ◦ RUNTIME - jaotus otsustatakse jooksvalt keskkonnamuutuja OMP_SCHEDULE järgi. ◦ AUTO - jaotus otsustatakse kompilaatori poolt ja/või jooksvalt

• ORDERED - tsükli täitmine nagu järjestiktsüklis

• NO WAIT - peale tsükli lõppu lõimed ei sünkroniseeru vaid jätkavad kohe pro- grammi järgnevate ridadega

• COLLAPSE - mitu sisemist tsüklit lugeda üheks iteratsiooniks mitme teineteise sees paikneva tsükli puhul 129 Shared Memory 5.4 OpenMP tööjaotusdirektiivid: DO

Kitsendusi:

• DO-tsükkel ei saa olla DO WHILE-tsükkel, iteratsiooniparameeter – täisarv!

• Programmi õigsus ei tohi sõltuda sellest milline lõim millist tsükli osa täidab

• sisenemide-väljumine antud tsüklist ei ole lubatud

• ORDERED ja SCEDULE-käsk vaid 1 kord 130 Shared Memory 5.4 OpenMP tööjaotusdirektiivid: DO

DO-direktiivi näide:

PROGRAM VEC_ADD_DO § INTEGERN , CHUNKSIZE, CHUNK, I ¤ PARAMETER (N=1000) PARAMETER (CHUNKSIZE=100) REALA (N), B(N), C(N) DOI = 1, N A(I) = I * 1.0 B(I) = A(I) ENDDO CHUNK = CHUNKSIZE !$OMP PARALLEL SHARED(A,B,C) PRIVATE(I) ! ArraysA,B,C will be shared by all threads !I-- private for each thread(each hasa unique copy) !$OMPDO SCHEDULE(DYNAMIC,CHUNK) ! The iterations of the loop will be distributed ! dynamically in CHUNK sized pieces DOI = 1, N C(I) = A(I) + B(I) ENDDO !$OMP ENDDO NOWAIT !$OMP END PARALLEL END

¦ 131 Shared Memory 5.5 OpenMP tööjaotusdirektiivid: SECTIONS 5.5 OpenMP tööjaotusdirektiivid: SECTIONS

SECTIONS võimaldab defineerida sektsioonid, iga sektsioon eraldi lõimele. Saab kasutada PRIVATE, FIRSTPRIVATE, LASTPRIVATE, REDUCTION klausleid muutujate funktsionaalsuse määratlemiseks paralleelses programmi osas. Kui kasutada lõpus NOWAIT märgendit, siis pärast sektsiooni lõpetamist lõim ei oota, kuni teine lõim lõpetab antud operatsiooni, vaid jätkab tööd.

§!$OMP SECTIONS[clause...] ¤ PRIVATE(list) FIRSTPRIVATE(list) LASTPRIVATE(list) REDUCTION(operator| intrinsic: l i s t) !$OMP SECTION block !$OMP SECTION block !$OMP END SECTIONS[ NOWAIT]

¦ 132 Shared Memory 5.5 OpenMP tööjaotusdirektiivid: SECTIONS

Näide: SECTIONS:

§ PROGRAM VEC_ADD_SECTIONS ¤ INTEGERN , I PARAMETER (N=1000) REALA (N), B(N), C(N) ! Some initializations DOI = 1 , N A( I ) = I ∗ 1.0 B( I ) = A( I ) ENDDO !$OMP PARALLEL SHARED(A,B,C), PRIVATE(I) !$OMP SECTIONS !$OMP SECTION DOI = 1 , N/2 C( I ) = A( I ) + B( I ) ENDDO !$OMP SECTION DOI = 1+N/ 2 , N C( I ) = A( I ) + B( I ) 133 Shared Memory 5.5 OpenMP tööjaotusdirektiivid: SECTIONS

ENDDO !$OMP END SECTIONS NOWAIT !$OMP END PARALLEL END

¦ 134 Shared Memory 5.6 OpenMP tööjaotusdirektiivid: SINGLE 5.6 OpenMP tööjaotusdirektiivid: SINGLE

Vaid 1 lõim täidab antud blokki, vajalik näiteks I/O puhul jms.

§!$OMP SINGLE[clause...] ¤ PRIVATE(list) FIRSTPRIVATE(list) block !$OMP END SINGLE[ NOWAIT]

¦

Kitsendusi: sisenemine-väljumine lubamatu SINGLE-blokist 135 Shared Memory 5.7 OpenMP tööjaotusdirektiivide kombineerimine: 5.7 OpenMP tööjaotusdirektiivide kombineerimine:

PARALLEL SECTIONS, PARALLEL DO NB! Kui openMP käsk läheb liiga pikaks ühel real, siis saab seda poolitada ja jätkata:

§!$OMP PARALLEL DO ¤ !$OMP& SHARED(a)

¦

Näide: PARALLEL DO 136 Shared Memory 5.7 OpenMP tööjaotusdirektiivide kombineerimine:

§ PROGRAM VECTOR_ADD ¤ INTEGERN , I , CHUNKSIZE, CHUNK PARAMETER (N=1000) PARAMETER (CHUNKSIZE=100) REALA (N), B(N), C(N) ! Some initializations DOI = 1 , N A( I ) = I ∗ 1.0 B( I ) = A( I ) ENDDO CHUNK = CHUNKSIZE !$OMP PARALLEL DO !$OMP& SHARED(A,B,C,CHUNK) PRIVATE(I) !$OMP& SCHEDULE(STATIC,CHUNK) DOI = 1 , N C( I ) = A( I ) + B( I ) ENDDO !$OMP END PARALLEL DO END

¦ 137 Shared Memory 5.8 OpenMP: sünkronisatsioonikäsud 5.8 OpenMP: sünkronisatsioonikäsud

Vaid pealõim (master):

§!$OMP MASTER ¤ ( block ) !$OMP END MASTER

¦

Samas järjekorras (vaid 1 lõim korraga) nagu paralleliseerimata programm:

§!$OMP ORDERED ¤ ( block ) !$OMP END ORDERED

¦

(kasutatav vaid DO ja PARALLEL DO blokis) 138 Shared Memory 5.8 OpenMP: sünkronisatsioonikäsud

Vaid 1 lõim korraga (üksteise järel):

§!$OMPCRITICAL[ name] ¤ block !$OMP ENDCRITICAL

Näide¦CRITICAL: programm, mis loeb kokku, kui mitu lõime töötab. Selleks aga ei tohi X väärtus rikutud saada, ehk X võib suurendada järjekorras (nii on tulemus "kaitstud"):

§ PROGRAMCRITICAL ¤ INTEGERX X = 0 !$OMP PARALLEL SHARED(X) !$OMPCRITICAL X = X + 1 !$OMP ENDCRITICAL !$OMP END PARALLEL END

¦ 139 Shared Memory 5.8 OpenMP: sünkronisatsioonikäsud

Barjäär:

§!$OMP BARRIER ¤

¦ ATOMIC-direktiiv (mini-kriitiline lõik vaid ühe järgneva käsu jaoks, millel on ka omad piirangud, näiteks ei saa see olla call käsk jms.):

§!$OMP ATOMIC ¤ < d i r e k t i i v >

¦ FLUSH-käsk – mälusünkronisatsioon:

§!$OMP FLUSH(list) ¤

¦ (kõik lõimed kirjutavad oma lokaalsed muutujad tagasi jagatud mällu) 140 Shared Memory 5.9 OpenMP skoobidirektiivid 5.9 OpenMP skoobidirektiivid

PRIVATE - muutujad privaatsed igal lõimel

SHARED - muutujad ühised kõikidele lõimedele

DEFAULT (PRIVATE | SHARED | NONE) vaikeväärtuse määramine

FIRSTPRIVATE - privaatsed muutujad mis saavad algväärtustatud

LASTPRIVATE - peale tsüklit saab jagatud muutuja väärtuseks viimase sektsiooni või tsükli väärtuse

COPYIN - paralleelsesse ossa sisenemisel saavad kõigi lõimede vastavad muutu- jad väärtuseks pealõime vastava väärtuse

REDUCTION - etteantud operatsiooni teostamine kõigi privaatsete koopiatega an- tud muutujast 141 Shared Memory 5.9 OpenMP skoobidirektiivid

Näide REDUCTION - kahe suure vektori skalaarkorrutis. Selleks, nagu me teame on vaja korrutada vektorite vastavad komponendid ja siis liita tulemused. Kasutame DO tsüklit, mis korrutab ja liidab korrutise tulemusele. Sellist programmi on lihtne paralleliseerida. Erinevate paralleelsete jupide tulemused tuleb lõpus kokku liita - selleks siis kasutatakse REDUCTION(milline tehe, milline muutuja):

§ PROGRAM DOT_PRODUCT ¤ INTEGERN , CHUNKSIZE, CHUNK, I PARAMETER (N=100) PARAMETER (CHUNKSIZE=10) REALA (N), B(N), RESULT ! Some initializations DOI = 1 , N A( I ) = I ∗ 1.0 B( I ) = I ∗ 2.0 ENDDO RESULT= 0.0 CHUNK = CHUNKSIZE !$OMP PARALLEL DO !$OMP& DEFAULT(SHARED) PRIVATE(I) 142 Shared Memory 5.9 OpenMP skoobidirektiivid

!$OMP& SCHEDULE(STATIC,CHUNK) !$OMP& REDUCTION(+:RESULT) DOI = 1 , N RESULT = RESULT + (A( I ) ∗ B( I )) ENDDO !$OMP END PARALLEL DO NOWAIT PRINT ∗ , ’ F i n a l Result= ’, RESULT END

¦ 143 Shared Memory 5.9 OpenMP skoobidirektiivid

Klausel OpenMP direktiiv PAR. DO SECTS SINGLE PAR. DO PAR. SEC IF x x x PRIVATE x x x x x x SHARED x x x x DEFAULT x x x FIRSTPRIVATE x x x x x x LASTPRIVATE x x x x REDUCTION x x x x x COPYIN x x x SHEDULE x x ORDERED x x NOWAIT x x x Järgnevatele direktiividel klausleid ei ole:

MASTER FLUSH CRITICAL ORDERED BARRIER ATOMIC THREADPRIVATE 144 Shared Memory 5.10 OpenMP teegikäske 5.10 OpenMP teegikäske

SUBROUTINE OMP_SET_NUM_THREADS(scalar_integer_expr)

INTEGER FUNCTION OMP_GET_NUM_THREADS()

INTEGER FUNCTION OMP_GET_MAX_THREADS()

INTEGER FUNCTION OMP_GET_THREAD_NUM()

INTEGER FUNCTION OMP_GET_NUM_PROCS()

LOGICAL FUNCTION OMP_IN_PARALLEL()

SUBROUTINE OMP_SET_DYNAMIC(scalar_logical_expression)

LOGICAL FUNCTION OMP_GET_DYNAMIC() 145 Shared Memory 5.10 OpenMP teegikäske

SUBROUTINE OMP_SET_NESTED(scalar_logical_expression)

LOGICAL FUNCTION OMP_GET_NESTED

SUBROUTINE OMP_INIT_LOCK(var)

SUBROUTINE OMP_INIT_NEST_LOCK(var)

SUBROUTINE OMP_DESTROY_LOCK(var)

SUBROUTINE OMP_DESTROY_NEST_LOCK(var)

SUBROUTINE OMP_SET_LOCK(var)

SUBROUTINE OMP_SET_NEST_LOCK(var)

SUBROUTINE OMP_UNSET_LOCK(var)

SUBROUTINE OMP_UNSET_NEST_LOCK(var) 146 Shared Memory 5.10 OpenMP teegikäske

SUBROUTINE OMP_TEST_LOCK(var)

SUBROUTINE OMP_TEST_NEST_LOCK(var) 147 Shared Memory 5.11 OpenMP näiteprogramme 5.11 OpenMP näiteprogramme

• mxmult.c( http://www.ut.ee/~eero/PC/naiteid/mxmult.c. html)

• omp_mm.f( http://www.ut.ee/~eero/PC/naiteid/omp_mm.f. html)

• omp_orphan.f( http://www.ut.ee/~eero/PC/naiteid/omp_ orphan.f.html)

• omp_reduction.f( http://www.ut.ee/~eero/PC/naiteid/omp_ reduction.f.html)

• omp_workshare1.f( http://www.ut.ee/~eero/PC/naiteid/omp_ workshare1.f.html)

• omp_workshare2.f( http://www.ut.ee/~eero/PC/naiteid/omp_ workshare2.f.html) 148 Fortran and HPF 6.1 Fortran Evolution 6 Fortran and HPF

6.1 Fortran Evolution

• FORmula TRANslation

• first compiler 1957

• 1st official standard 1972 – Fortran66

• 2nd standard 1980 – Fortran77

• 1991 – Fortran90

• Fortran95

• Fortran2003

• Fortran2008 149 Fortran and HPF 6.2 Concept

High Performance Fortran 6.2 Concept

Fortran90 extension

• SPMD (Single Program Multiple Data) model

• each process operates with its own part of data

• HPF commands specify which processor gets which part of the data

• Concurrency is defined by HPF commands based on Fortran90

• HPF directives as comments: !HPF$

Most of the commands of declatrative type concerning data distribution between pro- cesses Command INDEPENDENT is an exception (and its attributel NEW) which execute commands 150 Fortran and HPF 6.3 PROCESSORS declaration 6.3 PROCESSORS declaration

determines conceptual processor array (which need not reflect the actual hard- ware structure) Example:

§! HPF$ PROCESSORS, DIMENSION(4):: P1 ¤ ! HPF$ PROCESSORS, DIMENSION(2,2):: P2 ! HPF$ PROCESSORS, DIMENSION(2,1,2):: P3

¦

• Number of processors needs to remain the same in a program

• If 2 different processor arrays similar, one can assume thet corresponding pro- cessors have the same physical process

• !HPF$ PROCESSORS::P – defines a scalar processor 151 Fortran and HPF 6.4 DISTRIBUTE-directive 6.4 DISTRIBUTE-directive

says how to deliver data between processors. Example:

§REAL, DIMENSION(50) : : A ¤ REAL, DIMENSION(10 ,10) : : B, C, D ! HPF$DISTRIBUTE(BLOCK) ONTO P1::A !1 −D ! HPF$DISTRIBUTE(CYCLIC, CYCLIC) ONTO P2::B,C !2 −D ! HPF$DISTRIBUTED(BLOCK, ∗ ) ONTO P1! alternative syntax

¦

• array and processor array rank need to confirm

• A(1:50) : P1(1:4) – each processor gets d50/4e = 13 elements except P1(4) which gets the rest (11 elements)

• * – dimension to be ignored => in this example D distributed row-wise

• CYCLIC – elements delivered one-by-one (as cards from pile between players) 152 Fortran and HPF 6.4 DISTRIBUTE-directive

• BLOCK - array elements are delivered block-wise (each block having elements from close range)

Example (9 -element array delivered to 3 processors) : CYCLIC: 123123123 BLOCK: 111222333 153 Fortran and HPF 6.4 DISTRIBUTE-directive

Example: DISTRIBUTE block-wise:

§PROGRAM Chunks ¤ REAL, DIMENSION(20) : : A ! HPF$ PROCESSORS, DIMENSION(4)::P ! HPF$DISTRIBUTE(BLOCK) ONTOP::A

¦ Example: DISTRIBUTE cyclically:

§PROGRAM Round_Robin ¤ REAL, DIMENSION(20) : : A ! HPF$ PROCESSORS, DIMENSION(4)::P ! HPF$DISTRIBUTE(CYCLIC) ONTOP::A

¦ 154 Fortran and HPF 6.4 DISTRIBUTE-directive

Example: DISTRIBUTE 2-dimensional layout:

§PROGRAM Skwiffy ¤ IMPLICIT NONE REAL, DIMENSION( 4 , 4 ) : : A, B, C ! HPF$ PROCESSORS, DIMENSION(2,2)::P ! HPF$DISTRIBUTE(BLOCK, CYCLIC) ONTOP::A,B,C B = 1; C = 1; A = B + C END PROGRAM Skwiffy

¦ cyclic in one dimension; block-wise in other:

(11)(12)(11)(12) (11)(12)(11)(12) (21)(22)(21)(22) (21)(22)(21)(22) 155 Fortran and HPF 6.4 DISTRIBUTE-directive

Example: DISTRIBUTE *:

§PROGRAM Skwiffy ¤ IMPLICIT NONE REAL, DIMENSION( 4 , 4 ) : : A, B, C ! HPF$ PROCESSORS, DIMENSION(4)::Q ! HPF$DISTRIBUTE( ∗ ,BLOCK) ONTOQ::A,B,C B = 1; C = 1; A = B + C; PRINT∗ , A END PROGRAM Skwiffy

¦ Remarks about DISTRIBUTE • Without ONTO default structure (given with program arguments, e.g.)

• BLOCK – better if algorithm uses a lot of neighbouring elements in the array => less communication, faster

• CYCLIC – good for even distribution

• ignoring a dimension (with *) good if calculations due with a whole row or column 156 Fortran and HPF 6.5 Distribution of allocatable arrays

• all scalar variables are by default replicated; updating – compiler task

6.5 Distribution of allocatable arrays

is similar, except that delivery happens right after the memory allocation

§REAL, ALLOCATABLE, DIMENSION (:,:):: A ¤ INTEGER :: i e r r ! HPF$ PROCESSORS, DIMENSION(10,10)::P ! HPF$DISTRIBUTE(BLOCK, CYCLIC)::A ... ALLOCATE(A(100 ,20) , stat = i e r r ) !A automatically distributed here ! block size in dim=1 is 10 elements ... DEALLOCATE(A) END

¦ blocksize is determined right after ALLOCATE command 157 Fortran and HPF 6.6 HPF rule: Owner Calcuates 6.6 HPF rule: Owner Calcuates Processor on the left of assignment performs the computations

Example:

§DOi = 1 ,n ¤ a ( i −1) = b( i ∗6) / c ( i + j )−a ( i ∗∗ i ) END DO

¦ • calculation performed by process owning a(i-1)

NOTE that – the rule is not obligatory but advisable to the compiler!

• compiler may (for the purpose of less communication in the whole program) to leave only the assignment to the a(i-1) owner-processor 158 Fortran and HPF 6.7 Scalar variables 6.7 Scalar variables

§REAL, DIMENSION(100,100) :: X ¤ REAL :: Scal ! HPF$DISTRIBUTE(BLOCK,BLOCK)::X .... Scal = X( i , j ) ....

¦ owner of X(i,j) assigns Scal value and sends to other processes (replica- tion) 159 Fortran and HPF 6.8 Examples of good DISTRIBUTE subdivision 6.8 Examples of good DISTRIBUTE subdivision

Example:

§A( 2 : 9 9 ) = (A( : 9 8 ) +A( 3 : ) ) /2 ! neighbour calculations ¤ B(22:56)= 4.0∗ATAN( 1 . 0 ) ! section ofB calculated C ( : ) = SUM(D, DIM=1) ! Sum downa column

¦ From the owner calculates rule we get:

§!HPF$ DISTRIBUTE (BLOCK) ONTOP :: A ¤ !HPF$ DISTRIBUTE (CYCLIC) ONTOP :: B !HPF$ DISTRIBUTE (BLOCK) ONTOP :: C ! or (CYCLIC) !HPF$ DISTRIBUTE (∗ ,BLOCK) ONTOP :: D ! or (∗ ,CYCLIC)

¦ 160 Fortran and HPF 6.8 Examples of good DISTRIBUTE subdivision

Example (SSOR):

§DOj = 2 ,n−1 ¤ DOi = 2 ,n−1 a ( i , j ) =(omega/ 4 ) ∗(a ( i , j −1)+a ( i , j +1)+ & a ( i −1, j ) +a ( i +1 , j ) ) +(1−omega) ∗a ( i , j ) END DO END DO

¦ Best is BLOCK-distribution in both dimensions 161 Fortran and HPF 6.9 HPF programming methodology 6.9 HPF programming methodology

• Need to find balance between concurrency and communication

• the more processes the more communication

• aiming to

◦ find balanced load based from the owner calculates rule ◦ data locality

• use array syntax and intrinsic functions on arrays

• avoid deprecated Fortran features (like assumed size and memory associations)

Easy to write a program in HPF but difficult to gain good efficiency Programming in HPF technique is more or less like this: 1. Write a correctly working serial program, test and debug it

2. add distribution directives introducing as less as possible communication 162 Fortran and HPF 6.9 HPF programming methodology

Advisable to add the INDEPENDENT directives (where semantically relevant), Per- form data alignment (with ALIGN-directive). First thing is to: Choose a good !

Issues reducing efficiency:

• Difficult indexing (may confuse the compiler where particular elements are situated)

• array syntax is very powerful

◦ but with complex constructions may make compiler to fail to optimise

• sequential cycles to be left sequential (or replicate)

• object redelivery is time consuming 163 Fortran and HPF 6.9 HPF programming methodology Additional advice:

• Use array syntax possibilities instead of cycles

• Use intrinsic functions where possible (for possibly better optimisation)

• Before parallelisation think whether the argorithm is parallelisable at all?

• Use INDIPENDENT and ALIGN directives

• It is possible to assign sizes to the lengths in BLOCK and CYCLE if sure in need for the algorithm efficiency, otherwise better leave the decision to the compiler 164 Fortran and HPF 6.10 BLOCK(m) and CYCLIC(m) 6.10 BLOCK(m) and CYCLIC(m)

predefining block size; in general make the code less efficient due to more com- plex accounting on ownership

§REAL, DIMENSION(20) : : A, B ¤ ! HPF$ PROCESSORS, DIMENSION(4)::P ! HPF$DISTRIBUTEA(BLOCK(9)) ONTOP ! HPF$DISTRIBUTEB(CYCLIC(2)) ONTOP

¦ 2D example:

§REAL, DIMENSION( 4 , 9 ) : : A ¤ ! HPF$ PROCESSORS, DIMENSION(2)::P ! HPF$DISTRIBUTE(BLOCK(3),CYCLIC(2)) ONTOP::A

¦ 165 Fortran and HPF 6.11 Array alignment 6.11 Array alignment

• improves data locality

• minimises communication

• distributes workload

Simplest example: A = B + C – With correct ALIGN’ment it is without communication 2 ways:

§! HPF$ ALIGN(:,:) WITHT(:,:)::A,B,C ¤

¦ equivalent to:

§! HPF$ ALIGNA(:,:) WITHT(:,:) ¤ ! HPF$ ALIGNB(:,:) WITHT(:,:) ! HPF$ ALIGNC(:,:) WITHT(:,:)

¦ 166 Fortran and HPF 6.11 Array alignment

Example:

§REAL, DIMENSION(10) : : A, B, C ¤ ! HPF$ ALIGN(:) WITHC(:)::A,B

¦ – need only C be in DISTRIBUTE-command. Example, symbol instead of “:”:

§! HPF$ ALIGN(j) WITHC(j)::A,B ¤

¦ means: for ∀j align arrays A and B with array C. (“:” instead of j – stronger requirement) 167 Fortran and HPF 6.11 Array alignment

§! HPF$ ALIGNA(i,j) WITHB(i,j) ¤

¦

Example 2 (2-dimensional case):

§REAL, DIMENSION(10 ,10) : : A, B ¤ ! HPF$ ALIGNA(:,:) WITHB(:,:)

¦ is stronger requirement than

not assuming same size arrays Good to perform A=B+C+B*C operations (everything local!) transposed alignment:

§REAL, DIMENSION(10 ,10) : : A, B ¤ ! HPF$ ALIGNA(i,:) WITHB(:,i)

¦ (second dimension of A and first dimension of B with same length!) 168 Fortran and HPF 6.11 Array alignment

Similarly:

§REAL, DIMENSION(10 ,10) : : A, B ¤ ! HPF$ ALIGNA(:,j) WITHB(j,:)

¦ or:

§REAL, DIMENSION(10 ,10) : : A, B ¤ ! HPF$ ALIGNA(i,j) WITHB(j,i)

¦ good to perform operation:

§A=A+TRNSPOSE(B) ∗A !everything local! ¤

¦ 169 Fortran and HPF 6.12 Strided Alignment 6.12 Strided Alignment

Example: Align elements of matrix D with every second element in E:

§REAL, DIMENSION( 5 ) : : D ¤ REAL, DIMENSION(10) : : E ! HPF$ ALIGND(:) WITHE(1::2)

¦ could be written also:

§! HPF$ ALIGND(i) WITHE(i ∗2−1) ¤

¦ Operation:

§D = D + E ( : : 2 ) ! local ¤

¦ 170 Fortran and HPF 6.12 Strided Alignment

Example: reverse strided alignment:

§REAL, DIMENSION( 5 ) : : D ¤ REAL, DIMENSION(10) : : E ! HPF$ ALIGND(:) WITHE(UBOUND(E):: −2)

¦ could be written also:

§! HPF$ ALIGND(i) WITHE(2+UBOUND(E) −i ∗2) ¤

¦ 171 Fortran and HPF 6.13 Example on Alignment 6.13 Example on Alignment

§PROGRAM Warty ¤ IMPLICIT NONE REAL, DIMENSION( 4 ) : : C REAL, DIMENSION( 8 ) : : D REAL, DIMENSION( 2 ) : : E C = 1; D = 2 E = D ( : : 4 ) + C ( : : 2 ) END PROGRAM Warty

¦ minimal (0) communication is achieved with:

§! HPF$ ALIGNC(:) WITHD(::2) ¤ ! HPF$ ALIGNE(:) WITHD(::4) ! HPF$DISTRIBUTE(BLOCK)::D

¦ 172 Fortran and HPF 6.14 Alignment with Allocatable Arrays 6.14 Alignment with Allocatable Arrays

• alignment is performed togehter with memory allocation

• existing object cannot be aligned to unallocated object Example:

§REAL, DIMENSION (:), ALLOCATABLE :: A,B ¤ ! HPF$ ALIGNA(:) WITHB(:)

¦ then

§ALLOCATE (B(100) , stat = i e r r ) ¤ ALLOCATE (A(100) , stat = i e r r )

¦ is OK

§ALLOCATE (B(100) ,A(100) , stat = i e r r ) ¤

¦ also OK (allocation starts from left), 173 Fortran and HPF 6.14 Alignment with Allocatable Arrays

but,

§ALLOCATE (A(100) , stat = i e r r ) ¤ ALLOCATE (B(100) , stat = i e r r )

¦ or

§ALLOCATE (A(100) ,B(100) , stat = i e r r ) ¤

¦ give error! Simple array cannot be aligned with allocatable:

§REAL, DIMENSION (:):: X ¤ REAL, DIMENSION (:), ALLOCATABLE :: A ! HPF$ ALIGNX(:) WITHA(:)! ERROR

¦ 174 Fortran and HPF 6.14 Alignment with Allocatable Arrays

One more problem:

§REAL, DIMENSION (:), ALLOCATABLE :: A, B ¤ ! HPF$ ALIGNA(:) WITHB(:) ALLOCATE(B(100) , stat = i e r r ) ALLOCATE(A(50) , stat = i e r r )

¦ ”:” says that A and B should be with same length (but they are not!) But this is OK:

§REAL, DIMENSION (:), ALLOCATABLE :: A, B ¤ ! HPF$ ALIGNA(i) WITHB(i) ALLOCATE(B(100) , stat = i e r r ) ALLOCATE(A(50) , stat = i e r r )

¦ – still: A cannot be larger than B. 175 Fortran and HPF 6.16 Dimension replication 6.15 Dimension collapsing

With one element it is possible to align one or severeal dimensions:

§! HPF$ ALIGN( ∗ ,:) WITHY(:)::X ¤

¦ – each element from Y aligned to a column from X (First dimension of matrix X being collapsed)

6.16 Dimension replication

§! HPF$ ALIGNY(:) WITHX( ∗ ,:) ¤

¦ – each processor getting arbitrary row X(:,i) gets also a copy of Y(i) 176 Fortran and HPF 6.16 Dimension replication

Example: 2D Gauss elimination kernel of the program:

§... ¤ DOj = i +1 , n A( j , i ) = A( j , i )/Swap( i ) A( j , i +1:n) = A( j , i +1:n) − A( j , i ) ∗Swap( i +1:n) Y( j ) = Y( j ) − A( j , i ) ∗Temp END DO

¦ Y(k) together with A(k,i) =>

§! HPF$ ALIGNY(:) WITHA(:, ∗ ) ¤

¦ Swap(k) together with A(i,k) =>

§! HPF$ ALIGN Swap(:) WITHA( ∗ ,:) ¤

¦ No matrix A neighbouring elements in same expression => CYLCIC: 177 Fortran and HPF 6.16 Dimension replication

§! HPF$DISTRIBUTEA(CYCLIC, CYCLIC) ¤

¦ 178 Fortran and HPF 6.16 Dimension replication

Example: matrix multiplication

PROGRAM ABmult § IMPLICIT NONE ¤ INTEGER, PARAMETER :: N = 100 INTEGER, DIMENSION (N,N):: A, B, C INTEGER :: i , j ! HPF$ PROCESSORS square(2,2) ! HPF$DISTRIBUTE(BLOCK,BLOCK) ONTO square::C ! HPF$ ALIGNA(i, ∗ ) WITHC(i, ∗ ) !replicate copies of rowA(i, ∗ ) onto processors which computeC(i,j) ! HPF$ ALIGNB( ∗ ,j) WITHC( ∗ ,j) !replicate copies of columnB( ∗ ,j)) onto processors which computeC(i,j) A = 1 B = 2 C = 0 DOi = 1 , N DOj = 1 , N ! All the work is local due to ALIGNs C( i , j ) = DOT_PRODUCT(A( i ,:), B(:, j )) END DO END DO WRITE( ∗ , ∗ ) C END

¦ 179 Fortran and HPF 6.17 HPF Intrinsic Functions 6.17 HPF Intrinsic Functions

NUMBER_OF_PROCESSORS and PROCESSORS_SHAPE – information about phys- ical hardware needed for portability:

§! HPF$ PROCESSORS P1(NUMBER.OF.PROCESSORS()) ¤ ! HPF$ PROCESSORS P2(4,4,NUMBER.OF.PROCESSORS()/16) ! HPF$ PROCESSORS P3(0:NUMBER.OF.PROCESSORS(1) −1,& ! HPF$ 0:NUMBER.OF.PROCESSORS(2) −1)

¦ 2048-processor hyprcube:

§PRINT∗ , PROCESSORS.SHAPE() ¤

¦ would return:

22222222222 180 Fortran and HPF 6.18 HPF Template Syntax 6.18 HPF Template Syntax

TEMPLATE - conceptual object, does not use any RAM, defined statically (like 0- sized arrays being not assigned to)

• are declared

• distributed

• can be used to align arrays

Example:

§REAL, DIMENSION(10) : : A, B ¤ ! HPF$ TEMPLATE, DIMENSION(10)::T ! HPF$DISTRIBUTE(BLOCK)::T ! HPF$ ALIGN(:) WITHT(:)::A,B

¦ (here only T may be argument to DISTRIBUTE) Combined TEMPLATE directive: 181 Fortran and HPF 6.18 HPF Template Syntax

§! HPF$ TEMPLATE, DIMENSION(100,100),& ¤ ! HPF$DISTRIBUTE(BLOCK, CYCLIC) ONTOP::T ! HPF$ ALIGNA(:,:) WITHT(:,:)

¦ which is equivalent to:

§! HPF$ TEMPLATE, DIMENSION(100,100)::T ¤ ! HPF$ ALIGNA(:,:) WITHT(:,:) ! HPF$DISTRIBUTET(BLOCK, CYCLIC) ONTOP

¦ 182 Fortran and HPF 6.18 HPF Template Syntax

Example:

§ PROGRAM Warty ¤ IMPLICIT NONE REAL, DIMENSION( 4 ) : : C REAL, DIMENSION( 8 ) : : D REAL, DIMENSION( 2 ) : : E ! HPF$ TEMPLATE, DIMENSION(8)::T ! HPF$ ALIGND(:) WITHT(:) ! HPF$ ALIGNC(:) WITHT(::2) ! HPF$ ALIGNE(:) WITHT(::4) ! HPF$DISTRIBUTE(BLOCK)::T C = 1; D = 2 E = D ( : : 4 ) + C ( : : 2 ) END PROGRAM Warty

¦ (similar to the example of alignment with stride) 183 Fortran and HPF 6.18 HPF Template Syntax

More examples on using Templates:

• ALIGNA (:)WITH T1(:,∗)– ∀ i, element A(i) replicated according to row T1(i,:).

• ALIGNC (i,j) WITH T2(j,i) – transposed C aligned with T2

• ALIGNB (:,∗)WITH T3(:)– ∀ i, matrix B(i,:) aligned with template element T2(i),

• DISTRIBUTE (BLOCK,CYCLIC):: T1, T2

• DISTRIBUTE T1(CYCLIC,∗)ONTOP – T1 rows are distributed cyclically 184 Fortran and HPF 6.19 FORALL 6.19 FORALL

Syntax:

FORALL([,])&

example:

§FORALL ( i =1:n , j =1:m,A( i , j ).NE. 0 ) A( i , j ) = 1/A( i , j ) ¤

¦ 185 Fortran and HPF 6.19 FORALL

Circumstances, where Fortran90 syntax is not enough, but FORALL makes it simple:

• index expressions:

§FORALL ( i =1:n , j =1:n , i /= j ) A( i , j ) = REAL( i + j ) ¤

¦

• intrinsic or PURE-functions (which have no side-effects):

§FORALL ( i =1:n : 3 , j =1:n : 5 ) A( i , j ) = SIN (A( j , i )) ¤

¦

• subindexing:

§FORALL ( i =1:n , j =1:n) A(VS( i ), j ) = i +VS( j ) ¤

¦ 186 Fortran and HPF 6.19 FORALL

• unusual parts can be accessed:

§FORALL ( i =1:n) A( i , i ) = B( i ) ! diagonal ¤ !... DOj = 1 , n FORALL ( i =1: j ) A( i , j ) = B( i ) ! triangular END DO

¦ To parallelise add before DO also:

§! HPF$ INDEPENDENT, NEW(i) ¤

¦ or write nested FORALL commands:

§FORALL ( j = 1:n) ¤ FORALL ( i =1: j ) A( i , j ) = B( i ) ! triangular END FORALL

¦ 187 Fortran and HPF 6.19 FORALL

FORALL-command execution:

1. triple-list evaluation

2. scalar matrix evaluation

3. for each .TRUE. mask elements find right hand side value

4. assignment of right hand side value to the left hand side

In case of HPF synchronisation between the steps! 188 Fortran and HPF 6.20 PURE-procedures 6.20 PURE-procedures

§PURE REAL FUNCTIONF ( x , y ) ¤ PURE SUBROUTINEG ( x , y , z )

¦ without side effects, i.e.:

• no outer I/O nor ALLOCATE (communication still allowed)

• does not change global state of the program

• intrinsic functions are of type PURE!

• can be used in FORALL and in PURE-procedures

• no PAUSE or STOP

• FUNCTION formal parameters with attribute INTENT(IN) 189 Fortran and HPF 6.20 PURE-procedures

Example (function):

§PURE REAL FUNCTIONF ( x , y ) ¤ IMPLICIT NONE REAL, INTENT ( IN ):: x , y F = x∗x + y∗y + 2∗x∗y + ASIN (MIN( x / y , y / x )) END FUNCTIONF

¦ example of usage:

§FORALL ( i =1:n , j =1:n)& ¤ A( i , j ) = b( i ) + F(1.0∗ i , 1 . 0 ∗ j )

¦ 190 Fortran and HPF 6.20 PURE-procedures

Example (subroutine):

§PURE SUBROUTINEG ( x , y , z ) ¤ IMPLICIT NONE REAL, INTENT (OUT), DIMENSION (:):: z REAL, INTENT ( IN ), DIMENSION (:):: x , y INTEGERi INTERFACE REAL FUNCTIONF ( x , y ) REAL, INTENT ( IN ):: x , y END FUNCTIONF END INTERFACE !... FORALL( i =1: SIZE ( z )) z ( i ) = F( x ( i ),y ( i )) END SUBROUTINEG

¦ 191 Fortran and HPF 6.20 PURE-procedures

“MIMD” example:

§REAL FUNCTIONF ( x , i ) ! PURE ¤ IMPLICIT NONE REAL, INTENT ( IN ):: x ! element INTEGER, INTENT ( IN ):: i ! index IF ( x > 0 . 0 ) THEN F = x∗x ELSEIF ( i ==1 .OR. i ==n) THEN F = 0.0 ELSE F = x ENDIF END FUNCTIONF

¦ 192 Fortran and HPF 6.21 INDEPENDENT 6.21 INDEPENDENT

Directly in front of DO or FORALL

§! HPF$ INDEPENDENT ¤ DOi = 1 ,n x ( i ) = i ∗∗2 END DO

¦ In front of FORALL: no synchronisation needed between right hand side expres- sion evaluation and assignment If INDEPENDENT loop...

• ...assigns more than once to a same element, parallelisation is lost!

• ...includes EXIT, STOP or PAUSE command, iteration needs to execute se- quentially to be sure to end in right iteration

• ...has jumps out of the loop or I/O => sequential execution 193 Fortran and HPF 6.21 INDEPENDENT

Is independent:

§! HPF$ INDEPENDENT ¤ DOi = 1 , n b( i ) = b( i ) + b( i ) END DO

¦ Not independent:

§DOi = 1 , n ¤ b( i ) = b( i +1) + b( i ) END DO

¦ Not independent:

§DOi = 1 , n ¤ b( i ) = b( i −1) + b( i ) END DO

¦ 194 Fortran and HPF 6.21 INDEPENDENT

This is independent loop:

§! HPF$ INDEPENDENT ¤ DOi = 1 , n a ( i ) = b( i −1) + b( i ) END DO

¦ Question to ask: does a later iteration depend on a previous one? 195 Fortran and HPF 6.22 INDEPENDENT NEW command 6.22 INDEPENDENT NEW command

– create an independent variable on each process!

§! HPF$ INDEPENDENT, NEW(s1,s2) ¤ DOi =1 ,n s1 = Sin ( a ( 1 ) ) s2 = Cos( a ( 1 ) ) a ( 1 ) = s1∗s1−s2∗s2 END DO

¦ 196 Fortran and HPF 6.22 INDEPENDENT NEW command

Rules for NEW-variable:

• cannot be used outside cycle without redefinition

• cannot be used in FORALL

• not pointers or formal parameters

• cannot have SAVE atribute

Disallowed:

§! HPF$ INDEPENDENT, NEW(s1,s2) ¤ DOi = 1 ,n s1 = SIN ( a ( i )) s2 = COS( a ( i )) a ( i ) = s1∗s1−s2∗s2 END DO k = s1+s2 !not allowed!

¦ 197 Fortran and HPF 6.23 EXTRINSIC

Example (only outer cycles can be executed independently):

§! HPF$ INDEPENDENT, NEW(i2) ¤ DO i1 = 1 , n1 ! HPF$ INDEPENDENT, NEW(i3) DO i2 = 1 , n2 ! HPF$ INDEPENDENT, NEW(i4) DO i3 = 1 , n3 DO i4 = 1 , n4 a ( i1 , i2 , i3 ) = a ( i1 , i2 , i3 )& + b( i1 , i2 , i4 ) ∗c ( i2 , i3 , i4 ) END DO END DO END DO END DO

¦ 6.23 EXTRINSIC

Used in INTERFACE-commands to declare that a routine does not belong to HPF 198 Fortran and HPF 6.23 EXTRINSIC 7 MPI

See separate slides on course homepage! 199 // Alg.Design Part III Parallel Algorithms

8 Parallel Algorithm Design Principles

• Identifying portions of the work that can be performed concurrently

• Mapping the concurrent pieces of work onto multiple processes running in parallel

• Distributing the input, output and also intermediate data associated with the program

• Access management of data shared by multiple processes

• Process synchronisation in various stages in parallel program execution 200 // Alg.Design 8.1 Decomposition, Tasks and Dependency Graphs 8.1 Decomposition, Tasks and Dependency Graphs

• Subdividing calculations into smaller components is called decomposition

Example: Dense matrix-vector multiplication y = Ab n y[i] = ∑ j=1 A[i, j]b[ j]

([1])

• Computational Problem × Decomposition →Tasks 201 // Alg.Design 8.1 Decomposition, Tasks and Dependency Graphs

◦ in case of y = Ab tasks – calculation of each row – independent

([1]) Tasks and their relative order abstraction:

• Task Dependency Graph

◦ directed acyclic graph – nodes – tasks 202 // Alg.Design 8.1 Decomposition, Tasks and Dependency Graphs

– directed edges – dependences ◦ (can be disconnected) ◦ (edge-set can be empty)

• Mapping tasks onto processors

• Processors vs. processes 203 // Alg.Design 8.2 Decomposition Techniques 8.2 Decomposition Techniques

• Recursive decomposition

Example: Quicksort 204 // Alg.Design 8.2 Decomposition Techniques

• Data decomposition

◦ Input data ◦ Output data ◦ intermediate results ◦ Owner calculates rule

• Exploratory decomposition

Example: 15-square game

• Speculative decomposition

• Hybrid decomposition 205 // Alg.Design 8.3 Tasks and Interactions 8.3 Tasks and Interactions

• Task generation

◦ static ◦ dynamic

• Task sizes

◦ uniform ◦ non-uniform ◦ knowledge of task sizes ◦ size of data associated with tasks 206 // Alg.Design 8.3 Tasks and Interactions

• Characteristics of Inter-Task Interactions

◦ static versus dynamic ◦ regular versus irregular ◦ read-only versus read-write ◦ one-way versus two-way 207 // Alg.Design 8.4 Mapping Techniques for Load balancing 8.4 Mapping Techniques for Load balancing

• Static mapping

◦ Mappings based on data partitioning – array distribution schemes · block distribution · cyclic and block-cyclic distribution · randomised block distribution – Graph partitioning 208 // Alg.Design 8.4 Mapping Techniques for Load balancing

• Dynamic mapping schemes

◦ centralised schemes – master-slave – self scheduling · single task scheduling · chunk scheduling ◦ distributed schemes – each process can send/receive work from whichever process · how are sending receiving processes paired together? · initiator – sender or receiver? · how much work exchanged? · when the work transfer is performed? 209 // Alg.Design 8.4 Mapping Techniques for Load balancing

Methods for reducing interaction overheads

• maximising data locality

◦ minimise volume of data exchange ◦ minimise frequency of interactions

• overlapping computations with interactions

• replicating data or computation

• overlapping interactions with other interactions 210 // Alg.Design 8.5 Parallel Algorithm Models 8.5 Parallel Algorithm Models

• data-parallel model

• task graph model

◦ typically amount of data relatively large to the amount of computation

• work pool model

◦ or task pool model ◦ pool can be – centralised – distributed – statically/dynamically created,

• master-slave model 211 // Alg.Design 8.5 Parallel Algorithm Models

• pipeline (or producer-consumer model)

◦ stream parallelism ◦ stream of data triggers computations ◦ pipelines is a chain of consumer-produces processes ◦ shape of pipeline can be: – linear or multidimesional array – trees – general graphs with or without cycles

• Bulk-sunchronous parallel (BSP) model

◦ Synchronisation steps needed in a regular or irregular pattern – p2p synchronisation – and/or global synchronisation 212 // Alg.Design 9.1 EXAMPLE: 2D FEM for Poisson equation An example of BSP model based parallelisation: 9 Solving systems of linear equations with sparse matrices

9.1 EXAMPLE: 2D FEM for Poisson equation

Consider as an example Poisson equation

−∆u(x)= f (x), ∀x ∈ Ω, u(x)= g(x), ∀x ∈ Γ, where Laplacian ∆ is defined by

∂ 2u ∂ 2u  x  (∆u)(x) = ( + )(x), x = ∂x2 ∂y2 y 213 // Alg.Design 9.1 EXAMPLE: 2D FEM for Poisson equation

u can be e.g.:

• temperature • displacement of an elastic mem- brane fixed at the boundary under • electro-magnetic potential a transversal load of intensity

Divergence theorem Z Z divF dx = F · nds, Ω Γ where F = (F1,F2) is a vector-valued function defined on Ω, ∂F ∂F divF = 1 + 2 , ∂x1 ∂x2 and n = (n1,n2) – outward unit normal to Γ. Here dx – element of area in R2; ds – arc length along Γ. Applying divergence theorem to: F = (vw,0) and F = (0,vw) (details not shown here) – Green’s first identity Z Z ∂u Z ∇u · ∇vdx = v ds − v4udx (1) Ω Γ ∂n Ω 214 // Alg.Design 9.1 EXAMPLE: 2D FEM for Poisson equation we come to a variational formulation of the Poisson problem on V :

∂v ∂v V = {v : v is continuous on Ω, and are piecewise continuous on Ω and ∂x1 ∂x2 v = 0 on Γ}

Poisson problem in Variational Formulation: Find u ∈ V such that

a(u,v) = ( f ,v), ∀v ∈ V (2) where in case of Poisson equation Z a(u,v) = ∇u · ∇vdx Ω R and (u,v) = Ω u(x)v(x)dx. The gradient ∇ of a scalar function f (x,y) is defined by: ∂ f ∂ f  ∇ f = , ∂x ∂y 215 // Alg.Design 9.1 EXAMPLE: 2D FEM for Poisson equation

Again, the Variational formulation of Poisson equation is:

Z ∂u ∂v ∂u ∂v Z + dx = f vdx Ω ∂x ∂x ∂y ∂y Ω

With discretisation, we replace the space V with Vh – space of piecewise linear functions. Each function in Vh can be written as

vh(x,y) = ∑ηiϕi(x,y), where ϕi(x,y) – basis function (or ’hat’ functions)

vh ϕi

Ni T on Ω 216 // Alg.Design 9.1 EXAMPLE: 2D FEM for Poisson equation

With 2D FEM we demand that the equation in the Variational formulation is satis- fied for M basis functions ϕi ∈ Vh i.e. Z Z ∇uh · ∇ϕidx = f ϕidx i = 1,...,M Ω Ω But we have: M uh = ∑ ξ jϕ j =⇒ j=1 M ∇uh = ∑ ξ j∇ϕ j j=1

M linear equations with respect to unknowns ξ j:

M Z  Z ∑ ξ j∇ϕ j · ∇ϕidx = f ϕidx i = 1,...,M Ω j=1 Ω 217 // Alg.Design 9.1 EXAMPLE: 2D FEM for Poisson equation

The stiffness matrix A = (ai j) elements and right-hand side b = (bi) calculation: Z ai j = a(ϕi,ϕ j) = ∇ϕi · ∇ϕ jdx Ω Z bi =( f ,ϕi) = f ϕidx Ω

Integrals computed only where the pairs ∇ϕi · ∇ϕ j get in touch (have mutual support)

Example

0 two basis functions ϕi and ϕ j for nodes Ni and Nj Their common support is τ ∪τ so that Z Z Z ai j = ∇ϕi · ∇ϕ jdx = ∇ϕi · ∇ϕ jdx + ∇ϕi · ∇ϕ jdx Ω τ τ0 218 // Alg.Design 9.1 EXAMPLE: 2D FEM for Poisson equation

Nk ϕi ϕ j aτ τ k, j Nk Ni Nj τ0 τ Ni Nj

Element matrices

Consider single element τ Pick two basis functionsϕi and ϕ j (out of three). ϕk – piecewise linear =⇒ denoting by pk(x,y) = ϕk|τ :

pi(x,y)= αix + βiy + γi, p j(x,y)= α jx + β jy + γ j.

∂ pi ∂ pi =⇒∇ϕi τ = ( ∂x , ∂y ) = (αi,βi)=⇒ their dot product Z Z ∇ϕi · ∇ϕ jdx = αiα j + βiβ jdx. τ τ 219 // Alg.Design 9.1 EXAMPLE: 2D FEM for Poisson equation

Finding coefficients α and β – put three points (xi,yi,1), (x j,y j,0), (xk,yk,0) to the plane equation and solve the system        αi xi yi 1 αi 1 D βi  =  x j y j 1  βi  =  0  γi xk yk 1 γi 0

1 The integral is computed by the multiplication with the triangle area 2 |detD|   Z xi yi 1 τ  1 ai j = ∇ϕi · ∇ϕ jdx = αiα j + βiβ j det x j y j 1  τ 2 xk yk 1

The element matrix for τ is  τ τ τ  aii ai j aik τ τ τ τ A =  a ji a j j a jk  τ τ τ aki ak j akk 220 // Alg.Design 9.1 EXAMPLE: 2D FEM for Poisson equation

Assembled stiffness matrix

• created by adding appropriately all the element matrices together

• Different types of boundary values used in the problem setup result in slightly different stiffness matrices

• Most typical boundary conditions:

◦ Dirichlet’ ◦ Neumann

• but also:

◦ free boundary condition ◦ Robin boundary condition 221 // Alg.Design 9.1 EXAMPLE: 2D FEM for Poisson equation

Right-hand side R Right-hand side b : bi = Ω f ϕidx. Approximate calculation of an integral through the f value in the middle of the triangle τ: x + x + x y + y + y f τ = f ( i j k i j k ) 3 3 • support – the area of the triangle

• ϕi is a pyramid (integrating ϕi over τ gives pyramid volume) Z τ τ 1 τ τ bi = f ϕidx = f |detD | τ 6

τ • Assembling procedure for the RHS b values from bi (somewhat similarly to matrix assembly, but more simple)

• Introduction of Dirichlet boundary condition just eliminates the rows. 222 // Alg.Design 9.1 EXAMPLE: 2D FEM for Poisson equation

EXAMPLE y

Consider unit square.

Vh – space of piecewise linar func- tions on Ω with zero on Γ and values de- fined on inner nodes x

Discretised problem in Variational Formulation: Find uh ∈ Vh such that

a(uh,v) = ( f ,v), ∀v ∈ Vh (3) where in case of Poisson equation Z a(u,v) = ∇u · ∇vdx Ω 223 // Alg.Design 9.1 EXAMPLE: 2D FEM for Poisson equation

• In FEM the equation (3) needs to be satisfied on a set of testfunctions ϕi = ϕi(x),

◦ which are defined such that  1, x = xi ϕi = 0 x = x j j 6= i

• and it is demanded that (3) is satisfied with each ϕi (i = 1,...,N)

• As a result, a matrix of the linear equations is obtained

In general case n2 × n2 Laplace’s matrix has form:

 B −I 0 ··· 0   .. .   −IB −I . .   .  A =  0 −IB .. 0 , (4)    . . . .   ...... −I  0 ··· 0 −IB 224 // Alg.Design 9.1 EXAMPLE: 2D FEM for Poisson equation where n × n matrix B is of form:  4 −1 0 ··· 0   .. .   −1 4 −1 . .   .  B =  0 −1 4 .. 0 .    . . . .   ...... −1  0 ··· 0 −1 4

It means that most of the elements of matrix A are zero Poisson equation with varying coefficient (different materials, varying conduc- tivity, elasticity etc.)

−∇(λ(x,y)∇u) = f

• assume that λ – constant accross each element τ: λ(x,y) = λ τ (x,y) ∈ τ

• =⇒ Elemental matrices Aτ get multiplied by λ τ . 225 // Alg.Design 9.1 EXAMPLE: 2D FEM for Poisson equation

9.1.1 Sparse matrix storage schemes

9.1.2 Triple storage format

• n × m matrix A each nonzero with 3 values: integers i and j and (in most applications) real matrix element ai j. =⇒ three arrays:

§indi ( 1 : nz ), indj ( nz ), vals ( nz ) ¤

¦ of length nz, – number of matrix A nonzeroes Advantages of the scheme: • Easy to refer to a particular element

• Freedom to choose the order of the elements Disadvantages : • Nontrivial to find, for example, all nonzeroes of a particular row or column and their positions 226 // Alg.Design 9.1 EXAMPLE: 2D FEM for Poisson equation

9.1.3 Column-major storage format For each matrix A column j a vector col(j) – giving row numbers i for which ai j 6= 0.

• To store the whole matrix, each col(j)

◦ added into a 1-dimensonal array cols(1:nz) ◦ introduce cptr(1:M) referring to column starts of each column in cols.

§cols ( 1 : nz ), cptr ( 1 :M), vals ( 1 : nz ) ¤

¦ Advantages:

• Easy to find matrix column nonzeroes together with their positions 227 // Alg.Design 9.1 EXAMPLE: 2D FEM for Poisson equation

Disadvantages:

• Algorithms become more difficult to read

• Difficult to find nonzeroes in a particular row

9.1.4 Row-major storage format For each matrix A row i a vector row(i) giving column numbers j for which ai j 6= 0.

• To store the whole matrix, each row(j)

◦ added into a 1-dimensonal array rows(1:nz) ◦ introduce rptr(1:N) referring to row starts of each row in rows.

§rows ( 1 : nz ), r p t r ( 1 :N), vals ( 1 : nz ) ¤

¦ Advantages: 228 // Alg.Design 9.1 EXAMPLE: 2D FEM for Poisson equation

• Easy to find matrix row nonzeroes together with their positions

Disadvantages:

• Algorithms become more difficult to read.

• Difficult to find nonzeroes in a particular column.

9.1.5 Combined schemes

Triple format enhanced with cols(1:nz), cptr(1:M), rows(1:nz), rptr(1:N). Here cols and rows refer to corresponding matrix A refer to values in triple format. Advantages:

• All operations easy to perform

Disadvantages:

• More memory needed.

• Reference through indexing in all cases 229 // Alg.Design 9.2 Solution methods for sparse matrix systems 9.2 Solution methods for sparse matrix systems

9.2.1 Direct methods

LU-factorisation with minimal fill-in

• analysis • 100

• factorisation • 10

• back-substitution • 1

Discretisations of PDE-s in case of 2D regions – still OK 3D problems – not so efficient Implementations: MUMPS, UMFPACK, ... 230 // Alg.Design 9.2 Solution methods for sparse matrix systems

9.2.2 Krylov subspace method: Preconditioned Conjugate Gradient Method (PCG)

(0) (0) (0) §Calculate r = b − Ax with given starting vector x ¤ fori = 1 , 2 , . . . solve Mz(i−1) = r(i−1) (i−1)T (i−1) ρi−1 = r z i fi ==1 p(1) = z(0) else βi−1 = ρi−1/ρi−2 (i) (i−1) (i−1) p = z + βi−1p endif (i) (i) (i)T (i) q = Ap ; αi = ρi−1/p q (i) (i−1) (i) (i) (i−1) (i) x = x + αip ; r = r − αiq check convergence ; continue if needed end

¦ 231 // Alg.Design 9.3 Preconditioning 9.3 Preconditioning Idea: Replace Ax = b with system M−1Ax = M−1b. Apply CG to

Bx = c, (5) where B = M−1A and c = M−1b. But how to choose M? Preconditioner M = MT to be chosen such that

(i) Problem Mz = r being easy to solve

(ii) Matrix B being better conditioned than A, meaning that κ2(B) < κ2(A) 232 // Alg.Design 9.3 Preconditioning

Then p p #IT(5) = O( κ2(B)) < O( κ2(A)) = #IT(soving Ax = b)

• (In extreme cases M = I or M = A)

• Preconditioned Conjugate Gradients (PCG) Method

◦ obtained if to take in previous algorithm M 6= I 233 // Alg.Design 9.4 Domain Decompossition Method 9.4 Domain Decompossition Method

Preconditioning and parallelisation simultaneously Solution domain Ω divided into P overlapping subdomains • Substructuring methods

• Schwarz method

◦ Additive Schwarz P −1 T −1 M = ∑ Ri Ai Ri i=1

– Ri - Restriction operator −1 – Ai - i-th subdomain solve · Subdomain solvers – exact or inexact · In case of CG - symmetry of M−1 – prerequisite! ◦ Multiplicative Schwarz

• Two-level methods 234 // Alg.Design 9.5 Domain Decomposition on Unstructured Grids 9.5 Domain Decomposition on Unstructured Grids

Domain DecompositiononUnstructuredGrids DOUG (University of Bath, University of Tartu) 1997 - 2012 I.G.Graham, M.Haggers, R. Scheichl, L.Stals, E.Vainikko, M.Tehver, K. Sk- aburskas, O. Batrašev, C. Pöcher Parallel implementation based on:

• MPI • 2D & 3D problems

• UMFPACK • 2-level Additive Schwarz method • METIS • 2-level partitioning • BLAS • Automatic Coarse Grid generation • Automatic parallelisation and load- balancing • Adaptive refinement of the coarse • Systems of PDEs grid 235 // Alg.Design 9.5 Domain Decomposition on Unstructured Grids DOUG Strategies

• Iterative solver based on Krylov subspace methods PCG, MINRES, BICGSTAB, 2-layered FPGMRES with left or right preconditioning

• Non-blocking communication where at all possible

Ax-operation: y := Ax – :-) Dot-product: (x,y) – :-(

• Preconditioner based on Domain Decomposition with 2-level solvers Applying the preconditioner M−1: solve for z : Mz = r .

• Subproblems are solved with a direct, sparse multifrontal solver (UMFPACK) 236 // Alg.Design 9.5 Domain Decomposition on Unstructured Grids Aggregation-based Domain Decomposition methods

• Had been analysed only upto some • using rough aggregation before extent graph partitioner

• Major progress - development of • We are: making use of strong the theory: sharper bounds connections • Implementation based on For- • Second aggregation for creating tran95 http://dougdevel. subdomains, or org

1. R. Scheichl and E. Vainikko, Robust Aggregation-Based Coarsening for Additive Schwarz in the Case of Highly Vari- able Coefficients, Proceddings of the European Conference on Computational Fluid Dynamics, ECCOMAS CFD 2006 (P. Wesseling, E. ONate, J. Periaux, Eds.), TU Delft, 2006.

2. R. Scheichl and E. Vainikko, Additive Schwarz and Aggregation-Based Coarsening for Elliptic Problems with Highly Variable Coefficients, Computing, 2007, 80(4), pp 319-343. 237 // Alg.Design 9.5 Domain Decomposition on Unstructured Grids Aggregation – forming groups of neighbouring graph nodes

Key issues:

• how to find good aggregates?

• Smoothing step(s) for restriction and interpolation operators

Four (often conflicting) aims:

1. follow adequatly underlying physial properties of the domain

2. try to retain optimal aggregate size

3. keep the shape of aggregates regular

4. reduce communication => develop aggregates with smooth boundaries 238 // Alg.Design 9.5 Domain Decomposition on Unstructured Grids

Coefficient jump resolving property 239 // Alg.Design 9.5 Domain Decomposition on Unstructured Grids

Coefficient jump resolving property 240 // Alg.Design 9.5 Domain Decomposition on Unstructured Grids

Algorithm: Shape-preserving aggregation Input: MatrixA, aggregation radius r, strong connection threashold α. Output: Aggregate number for each node in the domain. 1. ScaleA to unit diagonal matrix (all ones on diagonal) n 2. Find the set S of matrixA strong connectons: S = ∪i=1Si, where Si ≡ { j 6= i : |ai j| ≥ α maxk6=i |aik|, unscale A; aggr_num:=0; 3. aggr_num:=aggr_num+1; Choose a seednode x from G or if G = /0, choose the first nonaggregated node x; level:=0 4. If (level

Parallel Implementation

• Nearest neighbour communication after Ax-operations and after fine solves

• Coarse problem setup

◦ Interpolation (I = RT ), restriction (R) operators + smoothing operator – sparse matrix structure T ◦ Coarse matrix A0 = RAR , through sparse matrix multiplication (SMM) operations are key routines affecting the performance of initialisation stage

• SMM easy to parallelise (all data is available locally due to overlap)

• Coarse problem solution

◦ Coarse problem placement - who is responsible for solving it? ◦ Communication strategy: coarse vector distribution in background during fine solves (on a different thread) 242 // Alg.Design 9.5 Domain Decomposition on Unstructured Grids

DOUG Numerical results Clipped random fields Time in seconds (# iterations) for different methods: jumpsize Aggregation AMG UMFPACK 1.5 ∗ 101 2.12 (24) 1.35 (14) 1.85 2.2 ∗ 102 2.14 (27) 2.27 (27) 1.70 3.3 ∗ 103 2.34 (29) 3.31 (40) 1.33 4.9 ∗ 104 2.41 (26) 6.23 (77) 4.88

n = 256 × 256 243 // Alg.Optimality 9.5 Domain Decomposition on Unstructured Grids 10 Parallel Algorithm Optimality Definitions

Joseph Ja’Ja’: An Introduction to Parallel Algorithms.

• Algorithms from the perspective of theory of parallel algorithms

• Parallel program complexity analysis

• Shared memory model ([C/E]R[C/E]W PRAM, see pt. 5.1)

• Unlimited number of processors

• Communication time not considered 244 // Alg.Optimality 10.1 Assessing Parallel Algorithms 10.1 Assessing Parallel Algorithms

W – Work T – Time

• T(n) = O( f (n)) – ∃ constants c > 0 and n0 > 0 such that T(n) ≤ c f (n) for each n ≥ n0

• T(n) = Ω( f (n)) – ∃ constants c > 0 and n0 > 0 such that T(n) ≥ c f (n) for each n ≥ n0

• T(n) = Θ( f (n)) – T(n) = O( f (n)) and T(n) = Ω( f (n)). 245 // Alg.Optimality 10.2 Definitions of Optimality 10.2 Definitions of Optimality

Consider computational problem Q. Denote T ∗(n) – Q sequential ie.: ∃ an algorithm for solution of the problem Q with time O(T ∗(n)) and it can be shown that the time cannot be improved Sequential algorithm with work time O(T ∗(n)) is called time optimal 2 types of parallel algorithm optimality definitions (1st weaker than 2nd):

1. Parallel algorithm for solution Q is called optimal or optimal in weak sense or weakly optimal or cost optimal, if W(n) = Θ(T ∗(n)).

2. Parallel algorithm is called work-time (WT ) optimal or optimal in the strong sense if it can be shown that the program parallel working time T(n) cannot be improved by any other optimal parallel algorithm 246 // Alg.Optimality 10.2 Definitions of Optimality

EXAMPLE: Algorithm Sum § ¤ Input : n = 2k numbers in array A n Output : Sum S = ∑i=1 A(i) begin 1. for i = 1 : n pardo B(i) = A(i) 2. for h = 1 : logn do for i = 1 : n/2h pardo B(i) := B(2i − 1) + B(2i) 3. S := B(1) end Number of time units ( j): T(n) = ¦ logn + 2 = O(logn) 1. step n operations j.th step (h = j − 1 in algorithm part 2.) n/2 j−1 operations ( j = 2 : log(n) + 1) logn j => W(n) = n + ∑ j=1 (n/2 ) + 1 = O(n) Due to W(n) = O(n) and T ∗(n) = n, => algorithm optimal. n Can be shown that optimal speedup with number of processors p = O( logn ), (see the book) (In Ja’Ja’ Section.10 shown that it takes Ω(logn) time independently of number of processors => algorithm optimal also in strong sense) 247 // Algorithm Basic Techniques 11.1 Balanced Trees 11 Basic Techniques in Parallel Algorithms

11.1 Balanced Trees

Prefix Sums Find si = x1  x2  ...  xi, 1 ≤ i ≤ n, where  – some operation, like *,+,... ∗ Easiest way: si = si−1  xi – sequential. T (n) = O(n)

Figure: sequential and parallel 248 // Algorithm Basic Techniques 11.1 Balanced Trees

ALGORITHM Prefix Sums § ¤ Input : array of n = 2k elements (x1,x2,...,xn) , where k ≥ 0 integer Output : Prefix sums si , 1 ≤ i ≤ n begin 1. i f n = 1 then {sets1 := x1 ; e x i t } 2. for 1 ≤ i ≤ n/2 pardo set yi := x2i−1  x2i 3. Recursively , find Prefix sums for {y1,y2,...,yn/2} and store in z1,z2,...,zn/2. 4. for 1 ≤ i ≤ n pardo {i even: set si := zi/2 i = 1 : set s1 := x1 i odd > 0 : set si := z(i−1)/2  xi} end

¦ 249 // Algorithm Basic Techniques 11.1 Balanced Trees

Program work time T(n) and work W(n) estimation:

T(n) = T(n/2) + a W(n) = W(n/2) + bn,

where a and b are constants. Solution of the recursion:

T(n) = O(logn) W(n) = O(n).

Algorithm is (WT -optimal) or optimal in strong sense (Proof Ja’Ja’ Sec 10), EREW and CREW PRAM 250 // Algorithm Basic Techniques 11.2 Pointer Jumping Method 11.2 Pointer Jumping Method

Rooted Directed tree T is a directed graph with a special node r such that

1. for each v ∈ V − {r} outdegree is 1, outdegree of r is 0;

2. for each v ∈ V − {r} there exists a directed path from node v to the node r.

The special node r is called root of tree T

With pointer jumping it is possible to quickly manage data stored in rooted directed trees 251 // Algorithm Basic Techniques 11.2 Pointer Jumping Method

Task: Find all roots of forest

Forest F:

• consists of rooted directed trees

• given by array P(1 : n) : P(i) = j => edge from node i to j

• If P(i) = i => i – root

Task is to find root S( j) for each j, ( j = 1,...,n).

Simple algorithm (for example, reversing all arrows and thereafter root finding) – linear time

At first take S(i) – successor of i-th node it’s P(i) In pointer jumping, each node successor is overwritten with successor’s succes- sor; repeatedly => After k iterations distance between i and S(i) is 2k if S(i) not root 252 // Algorithm Basic Techniques 11.2 Pointer Jumping Method

ALGORITHM Pointer Jumping Input: forest of rooted directed trees with roots pointing to itself, edges given as pairs (i,P(i)), where i = 1,...,n. Output: root S(i) of the corresponding tree the node i belongs to begin for i = 1,n pardo S(i) := P(i) while (S(i) 6= S(S(i))) do S(i) := S(S(i)) end

CREW PRAM 253 // Algorithm Basic Techniques 11.2 Pointer Jumping Method

Example 254 // Algorithm Basic Techniques 11.2 Pointer Jumping Method

Time and work estimate for the given algorithm: h- maximal tree-length in forest

• Time: stepsize between i and S(i) doubles in each iteration until S(S(i)) is root to the tree conatining i

◦ => number of iterations= O(logh) ◦ Each iteration can be performed in O(1) parallel time and O(n) operations

◦ => T(n) = O(logh)

• W(n) = O(nlogh)

(W(n) non-optimal due to existence of O(n) sequential algorithm) 255 // Algorithm Basic Techniques 11.3 Divide and Conquer 11.3 Divide and Conquer

Divide and conquer strategy

• consists of 3 main steps:

1. Dividing input into multiple relatively equal parts

2. Solving recursively subproblems on given parts

3. Merging the found solutions of different subproblems into a solution of the over- all problem 256 // Algorithm Basic Techniques 11.3 Divide and Conquer

Finding convex hull of a given set

• Let a set of points S = {p1, p2,..., pn}, be given with coordinates pi = (xi,yi).

• Planar convex hull of set S is the smallest convex polygon, containing all n points of S

• Polygon Q is called convex if for arbitrary 2 polygon Q points p and q the line segment (p,q) lies entirely in polygon Q

Convex hull problem: Determine the ordered (say, clockwise) list CH(S) of the points of S, determining the boundary of the convex hull of S 257 // Algorithm Basic Techniques 11.3 Divide and Conquer 258 // Algorithm Basic Techniques 11.3 Divide and Conquer

• Can be shown that in case of sequential algorithm T ∗(n) = Θ(nlogn) (sorting can be reduced to the problem of solving convex-hull problem)

Let p and q be points from S with the smallest and largest x-coordinate

• => p and q ∈ CH(S).

• => CH(S) points p ... q form upper hull UH(S)

• q... p form lower LH(S)

Remark. Sorting n numbers can be done in O(logn) time on EREW PRAM using total of O(nlogn) operations. (Ja’Ja’ Chapter.4.) Assume for simplicity that no points have the same x or y coordinates and that n is power of 2 Sort the points according to x-coordinate (O(logn) time end O(nlogn) operations): x(p1) < x(p2) < ··· < x(pn) Let S = (p , p ,··· , p n ) and S = (p n ,··· , pn) and UH(S ) and UH(S ) have 1 1 2 2 2 2 +1 1 2 been found already. Upper common tangent for sets Si (i = 1,2) needs to be found. 259 // Algorithm Basic Techniques 11.3 Divide and Conquer 260 // Algorithm Basic Techniques 11.3 Divide and Conquer

The upper common tangent for UH(S1) and UH(S2) can be found in O(logn) sequential time using binary search (Ja’Ja’ Ch. 6)

0 0 Let UH(S1) = (q1,...,qs) and UH(S2) = (q1,...,qt)

0 • (q1 = p1 and qt = pn)

0 • Upper common tangent defined with points: (qi,q j). UH(S) consists of first iUH (S1) points and t − j + 1 UH(S2) points: 0 0 UH(S) = (q1,...,qi,q j,...,qt) – can be determined in O(1) time and O(n) operations 261 // Algorithm Basic Techniques 11.3 Divide and Conquer

ALGORITHM (Simple Upper Hull) Input: Set S, consisting of points in the plane, no two of which have the same x or y coordinates s.t. k x(p1) < x(p2) < ··· < x(pn), where n = 2 . Output: The upper hull of S begin 1. If n ≤ 4 use brute-force method to find UH(S) and exit 2. Let S = (p , p ,··· , p n ) and S = (p n ,··· , pn) 1 1 2 2 2 2 +1 Recursively compute UH(S1) and UH(S2) in parallel 3. Find the upper common tangent between UH(S1) and UH(S2) and deduce the upper hull of S end 262 // Algorithm Basic Techniques 11.3 Divide and Conquer

Theorem. Algorithm correctly computes the upper hull of n points in the plane taking O(log2 n) time and using total of O(nlogn) operations. Proof. Correctness follows from induction over n T(n) andW(n)? Step 1. – O(1) time Step 2. – T(n/2) time using 2W(n/2) operations Step 3. – Common tangent: O(logn) time, combining O(1) time and O(n) operations => recursions:

n • T(n) ≤ T( 2 ) + alogn, n • W(n) ≤ 2W( 2 ) + bn. with a and b positive constants. Solving the recursions gives: T(n) = O(log2 n) ja W(n) = O(nlogn). It follows that: Convex hull of a set of n points can be found in O(log2 n) time using total of O(nlogn) operations. => the algorithm is optimal. 263 // Algorithm Basic Techniques 11.4 Partitioning 11.4 Partitioning

Partitioning strategy:

1. breaking up the given problem into p independent subproblems of almost equal sizes

2. solving p problems concurrently

• On contrary to divide-and-concuer strategy – the main task is partitioning, not combining the solution 264 // Algorithm Basic Techniques 11.4 Partitioning Merging linearly ordered sets

Given set S and partial order relation ≤ • Set S is linarly (or totally) ordered if for every pair a,b ∈ S either a ≤ b or b ≤ a

Consider 2 nondecreasing sequences A = (a1,a2,...,an) and B = (b1,b2,...,bn) with elements from linearly ordered set S

• Task: merge the sequnces A and B into sorted sequence C = (c1,c2,...,c2n)

Let X = (x1,x2,...,xt) set with elements from set S. • Rank of element x ∈ S in set X, denoted with rank(x : X), is the number of set X elements less or equal to x.

Let Y = (y1,y2,...,ys) be arbitrary array of elements in set S.

• Ranking of set Y in set X is an array rank(Y : X) = (r1,r2,...,rs), where ri = rank(yi : X). Ecample: X = (26,−12,27,30,53,7); Y = (12,28,−28). Then rank(Y : X) = (2,4,0). 265 // Algorithm Basic Techniques 11.4 Partitioning

For simplicity: all elements of both sorted sequences A and B to merge are dis- tinct => merging problem solvable finding for each element x in A and B its rank in set A ∪ B.

• If rank(x : A ∪ B) = i, then ci = x.

• As rank(x : A∪B) = rank(x : A)+rank(x : B), then the merging problem can be solved finding two integer arrays rank(A : B) and rank(B : A).

So, consider the problem rank(B : A) Let bi – an arbitrary element of B As A– sorted, finding rank(bi,A) can be performed with binary search. (Compare bi with middle element and go on with lower or upper part etc. until we have found a j(i) < bi < a j(i)+1 and rank(bi,A) = j(i)). Sequential time is O(logn) => perform in parallel with each element of B. Work - O(nlogn) , which is not optimal since there exists a linear sequential algorithm 266 // Algorithm Basic Techniques 11.4 Partitioning Optimal merging algorithm

Split sequences A and B into logn blocks => ca n/logn elements into each approxi- mately equal partition 267 // Algorithm Basic Techniques 11.4 Partitioning

Algorithm 2.7 (Partitioning) Input: Two arrays A = (a1,...,an) and B = (b1,...,bm) with elements in increasing order with logm and k(m) = m/logm being integers Output: k(m) pairs (Ai,Bi) of sequences A and B subsequences such that: (1) |Bi| = logm; (2) ∑i |Ai| = n and (3) each subsequence Ai and Bi element larger than all elements in subsequences Ai−1 and Bi−1 ∀ 1 ≤ i ≤ k(m) − 1. begin 1. j(0) := 0; j(k(m)) := n 2. for 1 ≤ i ≤ k(m) − 1 pardo 2.1 rank bilogm in sequence A using binary search and j(i) := rank(bilogm : A) 3. for 0 ≤ i ≤ k(m) − 1 pardo 3.1 Bi:=(bilogm+1,...,b(i+1)logm) 3.2 Ai:=(a j(i)+1,...,a j(i+1)) (Ai is empty if j(i) = j(i + 1)) 268 // Algorithm Basic Techniques 11.4 Partitioning

end

Example: A = (4,6,7,10,12,15,18,20) B = (3,9,16,21)

Here m = 4, k(m) = 2. As rank(9 : A) = 3 => A0 = (4,6,7), B0 = (3,9) A1 = (10,12,15,18,20), B1 = (16,21) 269 // Algorithm Basic Techniques 11.4 Partitioning

Lemma 2.1: Let C be a sorted sequence obtained with merging two sorted se- quences A and B (with lengths n and m). Then Algorithm 2.7 partitions sequences A and B into pairs (Ai,Bi) such that |Bi| = O(logm), ∑i |Ai| = n and C = (C0,C1,...), where Ci is sorted sequence obtained through merging subsequences Ai and Bi. In addition, the algorithm works with O(logn) time using O(n + m) operations. Proof. Ensure that each element from subsequences Ai and Bi larger than each element in Ai−1 and Bi−1. Step 1: O(1) time Step 2: O(logn) time (binary search for each one separately) Operation count: O((logn) × (m/logm)) = O(m + n), as (mlogn/logm) < (mlog(n + m)/logm) ≤ n + m, if n,m ≥ 4. Step 3: Time: O(1), linear number of opeartions 270 // Algorithm Basic Techniques 11.4 Partitioning

Theorem 2.4 Let A and B be two sorted sequences, each of length n. Merging sequences A and B can be performed in O(logn) time with O(n) operations. Proof. 1. Using algorithm 2.7 divide A and B into pairs of subsequences (Ai,Bi) (Time:O(logn), Work O(n)) 2. Could use a sequential algorithm for merging (Ai,Bi) but Ti = O(|Bi| + |Ai|) = O(logn+?) Therefore, if |Ai| > O(logn) – 2a) split Ai into pieces of length logn and apply algorithm 2.7, with B replaced with Ai and A replaced with Bi – O(loglogn) time using O(|Ai|) operations. |Ai| ≤ O(logn) 2b) use sequential algorithm for merging subsequences 271 // Algorithm Basic Techniques 11.5 Pipelining 11.5 Pipelining

In pipelining, process is divided into subprocesses in such a way that as soon the previous subprocess has finished an arbitrary algorithm phase, it can be followed by the next subprocess

2-3 Trees

2-3 tree is:

• rooted tree with each node having 2 or 3 children

• path length from root to all leaves is the same

Properties:

• number of leaves in 2-3 tree is between on 2h and 3h (h - tree height) => height of n-leave trees is Θ(logn)

• being used in sorted lists (element search, adding, subtracting an element) 272 // Algorithm Basic Techniques 11.5 Pipelining

Sorted list A = (a1,a2,...,an), where a1 < a2 < ··· < an as 2-3 tree leaves – elements ai with direction from left to right Inner nodes: L(ν) : M(ν) : R(ν) L(ν) - largest element of leften subtree M(ν) - middle subtree largest element R(ν) - largest element of right subtree (if it exists) Example 273 // Algorithm Basic Techniques 11.5 Pipelining

• 2-3-tree corresponding to a sorted list A is not defined uniformly

• After elemental operations (adding, deleting a leave) the structure must remain the same

Searching:

With given b find the leaf of T for which ai ≤ b < ai+1. Work is proportional to the tree height (O(logn)) 274 // Algorithm Basic Techniques 11.5 Pipelining

Adding a leaf:

First find the place for addition with previous algorithm

• (If b = ai – nothing needs to be done)

• add the leaf and repair the tree 275 // Algorithm Basic Techniques 11.5 Pipelining

• Add root on new level, if needed 276 // Algorithm Basic Techniques 11.5 Pipelining

Adding k sorted elements b1 < b2 < ··· < bk to 2-3 tree Sequential algorithm: Time O(klogn), Work: O(klogn)

• With sequential algorithm add b1 and bk to the tree (Time: O(logn))

• In parallel, search places for each element bi

• A. Simple case – between each pair ai and ai+1 at most 1 element b j:

◦ adding nodes in parallel (for repairing the tree) moving from bottom to top (never more than 6 subnodes!) ◦ Time: O(logn), ◦ Work: O(klogn).

• B. More difficult case – if for example between ai and ai+1 a sequence bl,bl+1,...,b f , repair the tree with middle element, then subsequence middle elements etc. in parallel

• Total time: O(logklogn) 277 // Algorithm Basic Techniques 11.5 Pipelining

Pipelining can be used – operations in parallel in waves (each level has exactly one repairment of the tree)!

• Time: O(logk + logn) = O(logn), Work: O(klogn) 278 // Algorithm Basic Techniques 11.6 Accelerated Cascading 11.6 Accelerated Cascading

Consider:

• fast but nonoptimal algorithm and

• slower but optimal algorithm

Question: Could these be combned into a fast and optimal algorithm? 279 // Algorithm Basic Techniques 11.6 Accelerated Cascading

11.6.1 Constant time maximum algorithm

Algorithm 2.8 (maximum in O(1) time) Input: Array A, with all p elements being different Output: Array M of logical values such that M(i)=1 exactly iff A(i) is maximal element of array A begin 1. for 1 ≤ i, j ≤ p pardo if (A(i) ≥ A( j)) then B(i, j) := 1 else B(i, j) := 0 2. for 1 ≤ i ≤ p pardo M(i) := B(i,1) ∧ B(i,2) ∧ ··· ∧ B(i, p) end

CRCW PRAM model, O(1) time, O(p2) operations 280 // Algorithm Basic Techniques 11.6 Accelerated Cascading

11.6.2 Doubly logarithmic-time algorithm

k Assume n = 22 for some integer k k−1 √ Doubly logarithmic-depth n-leave tree (DLT): Root has 22 (= n) children, k−2 k−i−1 each of which have 22 children and in general, each node on i-th level has 22 children, 0 ≤ i ≤ k − 1. Each level k node has 2 children Example: 16-node doubly logarithmic-depth tree: (k = 2) 281 // Algorithm Basic Techniques 11.6 Accelerated Cascading

• through induction, it is easy to ensure that number of nodes on i-th level is k k−i 22 −2 ,

k 0 k • number of nodes on level k: 22 −2 = 22 −1 = n/2.

• depth of the tree: k + 1 = loglogn + 1

• DLT can be used for finding maximum of n elements: each node gets maximum of subtree. Starting from beneath with O(1) time using Algorithm 2.8.

• => T(n) = O(loglogn)

• W(n) =? k−i−1 on i-th level number of operations: O((22 )2) per node, giving k−i−1 k k−i k O((22 )2 · 22 −2 ) = O(22 ) = O(n) operations on level i.

• => W(n) = O(nloglogn)

• => algorithm is not optimal! 282 // Algorithm Basic Techniques 11.6 Accelerated Cascading

11.6.3 Making fast algorithm optimal

logarithmic depth binary tree – optimal, time O(logn) Accelerated cascading

1. Starting with optimal algorithm until size of the problem smaller than a certain level

2. Switching over to fast but nonoptimal algorithm

Problem of finding maximum:

1. start with binary tree algorithm from leaves dlogloglogne levels up . as number of nodes reduces by factor of two each level, we have maximum within n0 = O(n/loglogn) nodes . Number of operations so far: O(n), Time: O(logloglogn).

2. DLT on remaining n0 elements Time: O(loglogn0) = O(loglogn) W(n) = O(n0 loglogn0) = O(n) operations.

=> hybrid algorithm optimal and works with O(loglogn) time. 283 // Algorithm Basic Techniques 11.7 Symmetry breaking 11.7 Symmetry breaking

Let G = (V,E) directed cycle (inbound and outbound rank=1, ∀u,v ∈ V ∃ directed path from node u to v.

Graph k-coloring is a mapping: c :V → 0,1,...,k − 1 such that c(i) 6= c( j) if hi, ji ∈ E. 284 // Algorithm Basic Techniques 11.7 Symmetry breaking

Consider 3-coloring problem. Sequential algorithm is simple – alternating coloring with 0,1, in the end pick a color from 1,2. Let edges be given with an array S(1 : n) (there is an edge hi,S(i)i ∈ E) Previous edge: P(S(i)) = i Suppose there is a coloring c(1 : n) (can always be found, taking for example, c(i) = i) Consider binary representation i = it−1 ···ik ···iii0; k-th least significant bit: ik

Algorithm 2.9 (Basic coloring) Input: Directed cycle with edges given by array S(1 : n) and coloring c(1,n) on nodes Output: New coloring c0(1 : n) begin for 1 ≤ i ≤ n pardo 1. Find k-th least significant bit, where c(i) and c(S(i)) differ 0 2. c (i) := 2k + c(i)k end 285 // Algorithm Basic Techniques 11.7 Symmetry breaking

v c k c’ 1 0001 1 2 3 0011 2 4 7 0111 0 1 14 1110 2 5 2 0010 0 0 15 1111 0 1 4 0100 0 0 5 0101 0 1 6 0110 1 3 8 1000 1 2 10 1010 0 0 11 1011 0 1 12 1100 0 0 9 1001 2 4 13 1101 2 5 286 // Algorithm Basic Techniques 11.7 Symmetry breaking

Lemma 2.3: Output c0(1 : n) is valid coloring whenever input c(1 : n) is valid coloring. T(n) = O(1), W(n) = O(n). Proof:

• k can be found always, if c is valid coloring

• Consider c0(i) = c0( j), if there is an edge i → j 0 0 => c (i) = 2k + c(i)kand c ( j) = 2l + c( j)l 0 0 as c (i) = c ( j) => k = l => c(i)k = c( j)k, which is contradicting the definition of k . => c0(i) 6= c0( j), whenever there is an edge i → j. 287 // Algorithm Basic Techniques 11.7 Symmetry breaking

11.7.1 Superfast 3-coloring algorithm

Let t > 3 (number of bits in c ) => c0(1 : n) can be described with dlogte + 1 bits. => if the number of colors in c(1 : n) is q, where 2t−1 < q ≤ 2t, then coloring c0 uses at most 2dlogte+1 = O(t) = O(logq) colors. => number of colors reduces exponentially. reduction takes place whenever t > dlogte + 1 – i.e. t > 3. Applying base coloring algorithm with t = 3 gives at most 6 colors, as c0(i) = 0 2k + c(i)k and => 0 ≤ c (i) ≤ 5 (as 0 ≤ k ≤ 2). 288 // Algorithm Basic Techniques 11.7 Symmetry breaking

Number of iterations?

Denote log(i) x: log(1) x = logx; log(i) x = log(log(i−1) x) Let log∗ x = min{i|log(i) x ≤ 1} Function log∗ x is growing extremely slowly being < 5 for each x ≤ 265536 ! Theorem 2.6: 3-coloring directed cycles can be performed with O(log∗ n) time and O(nlog∗ n) operations. Proof: Applying base coloring algorithm, we reach 6 colors at most in O(log∗ n) iterations. Changing colors 3, 4 and 5 with 3 iterations – change each of the colors to the smallest possibility from set {0,1,2} (depending on neighbours). It works correctly because no neighbouring node colored at the same time. 289 // Algorithm Basic Techniques 11.7 Symmetry breaking

11.7.2 Optimal 3-coloring algorithm

Remark. (Sorting integers) Let be given n integers in the interval [0,O(logn)]. These can be sorted in O(logn) time using linear number of operations. (Can be achieved through combination of radix-sort and prefix sum algorithms) Algorithm 2.10 (3-coloring of a cycle) Input: Directed graph of legth n with edges defined by array S(1 : n) Output: 3-coloring of cycle nodes. begin 1. for 1 ≤ i ≤ n pardo C(i) := i 2. Apply algorithm 2.9 once. 3. Sort depending on node colors 4. for i = 3 : 2dlogne do each i-color nodes pardo color with smallest possible from the set 0,1,2, which is different from the neighbours. end 290 Scalability 12.1 Isoefficiency Metric of Scalability 12 Parallel Algorithm Scalability1

12.1 Isoefficiency Metric of Scalability

• For a given problem size, as we increase the number of processing elements, the overall efficiency of the parallel system goes down with given algorthm.

• For some systems, the efficiency of a parallel system increases if the problem size is increased while keeping the number of processing elements constant.

1Based on [1] 291 Scalability 12.1 Isoefficiency Metric of Scalability

Variation of efficiency: (a) as the number of processing elements is increased for a given problem size; and (b) as the problem size is increased for a given number of processing elements. The phenomenon illustrated in graph (b) is not common to all parallel systems

• What is the rate at which the problem size must increase with respect to the number of processing elements to keep the efficiency fixed? 292 Scalability 12.1 Isoefficiency Metric of Scalability

• This rate determines the scalability of the system. The slower this rate, the better

• Before we formalize this rate, we define the problem size W as the asymptotic number of operations associated with the best serial algorithm to solve the problem

• We can write parallel runtime as:

W + T (W, p) T = o P p

• The resulting expression for speedup is:

W S = Tp W p = W + To(W, p) 293 Scalability 12.1 Isoefficiency Metric of Scalability

• Finally, we write the expression for efficiency as

S E = p W = W + To(W, p) 1 = 1 + To(W, p)/W • For scalable parallel systems, efficiency can be maintained at a fixed value (between 0 and 1) if the ratio To/W is maintained at a constant value. • For a desired value E of efficiency,

1 E = 1 + To(W, p)/W T (W, p) 1 − E o = W E E W = T (W, p). 1 − E o 294 Scalability 12.1 Isoefficiency Metric of Scalability

• If K = E/(1–E) is a constant depending on the efficiency to be maintained, since To is a function of W and p, we have

W = KTo(W, p)

• The problem size W can usually be obtained as a function of p by algebraic manipulations to keep efficiency constant.

• This function is called the isoefficiency function.

• This function determines the ease with which a parallel system can maintain a constant efficiency and hence achieve speedups increasing in proportion to the number of processing elements 295 Scalability 12.1 Isoefficiency Metric of Scalability Example: Adding n numbers on p processors

• The overhead function for the problem of adding n numbers on p processing elements is approximately 2plog p:

◦ The first stage of algorithm (summing n/p numbers locally) takes n/p time ◦ the second phase involves log p steps with communication and addition on each step. Assume single communication takes unit time, the time is 2log p

n T = + 2log p P p n S = n p + 2log p 1 E = 2plog p 1 + n 296 Scalability 12.1 Isoefficiency Metric of Scalability

• Doing as before, but substituting To by 2plog p , we get

W = K2plog p

• Thus, the asymptotic isoefficiency function for this parallel system is Θ(plog p)

• If the number of processing elements is increased from p to p’, the problem size (in this case, n ) must be increased by a factor of (p’log p’)/(plog p) to get the same efficiency as on p processing elements

Efficiencies for different n and p:

n p = 1 p = 4 p = 8 p = 16 p = 32

64 1.0 0.80 0.57 0.33 0.17

192 1.0 0.92 0.80 0.60 0.38

320 1.0 0.95 0.87 0.71 0.50

512 1.0 0.97 0.91 0.80 0.62 297 Scalability 12.1 Isoefficiency Metric of Scalability 3/2 3/4 3/4 Example 2: Let To = p + p W

• Using only the first term of To in W = KTo(W, p), we get

W = K p3/2

• Using only the second term, yields the following relation between W and p:

W = K p3/4W 3/4 W 1/4 = K p3/4 W = K4 p3

• The larger of these two asymptotic rates determines the isoefficiency. This is given by Θ(p3) 298 Scalability 12.2 Cost-Optimality and the Isoefficiency Function 12.2 Cost-Optimality and the Isoefficiency Function

• A parallel system is cost-optimal if and only if

pTP = Θ(W)

• From this, we have:

W + T (W, p)= Θ(W) o => cost-optimal if and only if its over- To(W, p)= O(W) head function does not asymptotically ex- W = Ω(To(W, p)) ceed the problem size

• If we have an isoefficiency function f (p), then it follows that the relation W = Ω( f (p)) must be satisfied to ensure the cost-optimality of a parallel system as it is scaled up. 299 Scalability 12.2 Cost-Optimality and the Isoefficiency Function

Lower Bound on the Isoefficiency Function

• For a problem consisting of W units of work, no more than W processing ele- ments can be used cost-optimally.

• The problem size must increase at least as fast as Θ(p) to maintain fixed efficiency; hence, Ω(p) is the asymptotic lower bound on the isoefficiency function.

Degree of Concurrency and the Isoefficiency Function

• The maximum number of tasks that can be executed simultaneously at any time in a parallel algorithm is called its degree of concurrency.

• If C(W) is the degree of concurrency of a parallel algorithm, then for a prob- lem of size W , no more than C(W) processing elements can be employed effectively. 300 Scalability 12.2 Cost-Optimality and the Isoefficiency Function

Degree of Concurrency and the Isoefficiency Function: Example

• Consider solving a system of equations in variables by using Gaussian elimi- nation (W = Θ(n3))

• The n variables must be eliminated one after the other, and eliminating each variable requires Θ(n2) computations.

• At most Θ(n2) processing elements can be kept busy at any time.

• Since W = Θ(n3) for this problem, the degree of concurrency C(W) is Θ(W 2/3)

• Given p processing elements, the problem size should be at least Ω(p3/2) to use them all. 301 Graph Algorithms 12.2 Cost-Optimality and the Isoefficiency Function Scalability of Parallel Graph Algorithms

12.1 Definitions and Representation

12.2 : Prim’s Algorithm

12.3 Single-Source Shortest Paths: Dijkstra’s Algorithm

12.4 All-Pairs Shortest Paths

12.5 Connected Components

?? Algorithms for Sparse Graphs 302 Graph Algorithms 12.1 Definitions and Representation 12.1 Definitions and Representation

• An undirected graph G is a pair (V,E), where V is a finite set of points called vertices and E is a finite set of edges

• An edge e ∈ E is an unordered pair (u,v), where u,v ∈ V

• In a directed graph, the edge e is an ordered pair (u,v). An edge (u,v) is incident from vertex u and is incident to vertex v

• A path from a vertex v to a vertex u is a sequence hv0,v1,v2,... ,vki of vertices where v0 = v, vk = u, and (vi,vi+1) ∈ E for i = 0,1,...,k − 1

• The length of a path is defined as the number of edges in the path

• An undirected graph is connected if every pair of vertices is connected by a path

• A forest is an acyclic graph, and a tree is a connected acyclic graph

• A graph that has weights associated with each edge is called a weighted graph 303 Graph Algorithms 12.1 Definitions and Representation

• Graphs can be represented by their adjacency matrix or an edge (or vertex) list

• Adjacency matrices have a value ai, j = 1 if nodes i and j share an edge; 0 otherwise. In case of a weighted graph, ai, j = wi, j, the weight of the edge

• The adjacency list representation of a graph G = (V,E) consists of an array Adj[1..|V|] of lists. Each list Adj[v] is a list of all vertices adjacent to v

• For a grapn with n nodes, adjacency matrices take Θ(n2) space and adjacency list takes Θ(|E|) space. 304 Graph Algorithms 12.1 Definitions and Representation

• A spanning tree of an undirected graph G is a subgraph of G that is a tree containing all the vertices of G

• In a weighted graph, the weight of a subgraph is the sum of the weights of the edges in the subgraph

• A minimum spanning tree (MST) for a weighted undirected graph is a span- ning tree with minimum weight 305 Graph Algorithms 12.2 Minimum Spanning Tree: Prim’s Algorithm 12.2 Minimum Spanning Tree: Prim’s Algorithm

• Prim’s algorithm for finding an MST is a greedy algorithm

• Start by selecting an arbitrary vertex, include it into the current MST

• Grow the current MST by inserting into it the vertex closest to one of the ver- tices already in current MST 306 Graph Algorithms 12.2 Minimum Spanning Tree: Prim’s Algorithm

procedure PRIM_MST(V , E , w , r ) § ¤ begin VT := {r} ; d[r] := 0 ; for all v ∈ (V −VT ) do i f edge (r,v) exists set d[v] := w(r,v) ; else set d[v] := ∞ ; endif while VT 6= V do begin finda vertex u such that d[u] := min{d[v] | v ∈ (V −VT )} ; VT := VT ∪ {u} ; for all v ∈ (V −VT ) do d[v] := min{d[v],w(u,v)} ; endwhile end PRIM_MST

¦ 307 Graph Algorithms 12.2 Minimum Spanning Tree: Prim’s Algorithm Parallel formulation

• The algorithm works in n outer iter- • This node is inserted into MST, and ations - it is hard to execute these the choice broadcast to all proces- iterations concurrently. sors.

• The inner loop is relatively easy to • Each processor updates its part of parallelize. Let p be the number of the d vector locally. processes, and let n be the number of vertices.

• The adjacency matrix is partitioned in a 1-D block fashion, with dis- tance vector d partitioned accord- ingly.

• In each step, a processor selects the locally closest node, followed by a global reduction to select glob- ally closest node. 308 Graph Algorithms 12.2 Minimum Spanning Tree: Prim’s Algorithm

• The cost to select the minimum entry is O(n/p + log p).

• The cost of a broadcast is O(log p).

• The cost of local updation of the d vector is O(n/p).

• The parallel time per iteration is O(n/p + log p).

• The total parallel time O(n2/p + nlog p):

computation z }| { communication n2  z }| { T = Θ + Θ(nlog p) p p

• Since sequential run time is W = Θ(n2):

Θ(n2) S = Θ(n2/p) + Θ(nlogn) 1 E = 1 + Θ((plog p)/n) 309 Graph Algorithms 12.2 Minimum Spanning Tree: Prim’s Algorithm

• => for cost-optimal parallel formulation (plog p)/n = O(1)

• => Prim’s algorithm can use only p = O(n/logn) processes in this formulation

• The corresponding isoefficiency is O(p2 log2 p). 310 Graph Algorithms 12.3 Single-Source Shortest Paths: Dijkstra’s Algorithm 12.3 Single-Source Shortest Paths: Dijkstra’s Algorithm

• For a weighted graph G = (V,E,w), the single-source shortest paths problem is to find the shortest paths from a vertex v ∈ V to all other vertices in V

• Dijkstra’s algorithm is similar to Prim’s algorithm. It maintains a set of nodes for which the shortest paths are known

• It grows this set based on the node closest to source using one of the nodes in the current shortest path set 311 Graph Algorithms 12.3 Single-Source Shortest Paths: Dijkstra’s Algorithm

procedure DIJKSTRA_SINGLE_SOURCE_SP(V , E , w , s ) § ¤ begin VT := {s} ; for all v ∈ (V −VT ) do i f (s,v) exists set l[v] := w(s,v) ; else set l[v] := ∞ ; endif while VT 6= V do begin finda vertex u such that l[u] := min{l[v] | v ∈ (V −VT )} ; VT := VT ∪ {u} ; for all v ∈ (V −VT ) do l[v] := min{l[v],l[u] + w(u,v)} ; endwhile end DIJKSTRA_SINGLE_SOURCE_SP

¦ Run time of Dijkstra’s algorithm is Θ(n2) 312 Graph Algorithms 12.4 All-Pairs Shortest Paths Parallel formulation

• Very similar to the parallel formulation of Prim’s algorithm for minimum span- ning trees

• The weighted adjacency matrix is partitioned using the 1-D block mapping

• Each process selects, locally, the node closest to the source, followed by a global reduction to select next node

• The node is broadcast to all processors and the l-vector updated.

• The parallel performance of Dijkstra’s algorithm is identical to that of Prim’s algorithm

12.4 All-Pairs Shortest Paths

• Given a weighted graph G(V,E,w), the all-pairs shortest paths problem is to find the shortest paths between all pairs of vertices vi, v j ∈ V . • A number of algorithms are known for solving this problem. 313 Graph Algorithms 12.4 All-Pairs Shortest Paths Matrix-Multiplication Based Algorithm

• Consider the multiplication of the weighted adjacency matrix with itself - except, in this case, we replace the multiplication operation in matrix multiplication by addition, and the addition operation by minimization

• Notice that the product of weighted adjacency matrix with itself returns a matrix that contains shortest paths of length 2 between any pair of nodes

• It follows from this argument that An contains all shortest paths

• An is computed by doubling powers - i.e., as A, A2, A4, A8, ... 314 Graph Algorithms 12.4 All-Pairs Shortest Paths 315 Graph Algorithms 12.4 All-Pairs Shortest Paths

• We need logn matrix multiplications, each taking time O(n3).

• The serial complexity of this procedure is O(n3 logn).

• This algorithm is not optimal, since the best known algorithms have complexity O(n3).

Parallel formulation

• Each of the logn matrix multiplications can be performed in parallel.

• We can use n3/logn processors to compute each matrix-matrix product in time logn.

• The entire process takes O(log2 n) time.

Dijkstra’s Algorithm 316 Graph Algorithms 12.4 All-Pairs Shortest Paths

• Execute n instances of the single-source shortest path problem, one for each of the n source vertices.

• Complexity is O(n3).

Parallel formulation

Two parallelization strategies - execute each of the n shortest path problems on a different processor (source partitioned), or use a parallel formulation of the shortest path problem to increase concurrency (source parallel).

Dijkstra’s Algorithm: Source Partitioned Formulation

• Use n processors, each processor Pi finds the shortest paths from vertex vi to all other vertices by executing Dijkstra’s sequential single-source shortest paths algorithm.

• It requires no interprocess communication (provided that the adjacency matrix is replicated at all processes). 317 Graph Algorithms 12.4 All-Pairs Shortest Paths

• The parallel run time of this formulation is: Θ(n2).

• While the algorithm is cost optimal, it can only use n processors. Therefore, the isoefficiency due to concurrency is Θ(p3).

Dijkstra’s Algorithm: Source Parallel Formulation

• In this case, each of the shortest path problems is further executed in parallel. We can therefore use up to n2 processors.

• Given p processors (p > n), each single source shortest path problem is exe- cuted by p/n processors.

• Using previous results, this takes time:

computation z }| { communication n3  z }| { T = Θ + Θ(nlog p) p p 318 Graph Algorithms 12.4 All-Pairs Shortest Paths

• For cost optimality, we have p = O(n2/logn) and the isoefficiency is Θ((plog p)1.5).

Floyd’s Algorithm

• Let G = (V,E,w) be the weighted graph with vertices V = {v1,v2,...,vn}.

• For any pair of vertices vi,v j ∈ V , consider all paths from vi to v j whose in- (k) termediate vertices belong to the subset {v1,v2,... ,vk} (k ≤ n). Let pi, j (of (k) weight di, j ) be the minimum-weight path among them.

(k) • If vertex vk is not in the shortest path from vi to v j, then pi, j is the same as (k−1) pi, j .

(k) (k) • If vk is in pi, j , then we can break pi, j into two paths - one from vi to vk and one from vk to v j. Each of these paths uses vertices from {v1,v2,... ,vk−1}. 319 Graph Algorithms 12.4 All-Pairs Shortest Paths

From our observations, the following recurrence relation follows: ( (k) w(vi,v j) if k = 0 di, j = n (k−1) (k−1) (k−1)o min di, j ,di,k + dk, j if k ≥ 1

This equation must be computed for each pair of nodes and for k = 1,n. The serial complexity is O(n3). procedure FLOYD_ALL_PAIRS_SP( A ) § ¤ begin D0 = A ; for k := 1 to n do for i := 1 to n do for j := 1 to n do (k) (k−1) (k−1) (k−1) di, j := min(di, j ,di,k + dk, j ) ; end FLOYD_ALL_PAIRS_SP

¦ 320 Graph Algorithms 12.4 All-Pairs Shortest Paths

Parallel formulation: 2D Block Mapping

√ √ • Matrix D(k) is divided into p blocks of size (n/ p) × (n/ p).

• Each processor updates its part of the matrix during each iteration.

(k−1) (k−1) (k−1) • To compute dl,r processor Pi, j must get dl,k and dk,r . √ • In general, during the kth iteration, each of the p) processes containing part √ of the kth row send it to the p − 1 processes in the same column. √ • Similarly, each of the p processes containing part of the kth column sends it √ to the p − 1 processes in the same row. 321 Graph Algorithms 12.4 All-Pairs Shortest Paths 322 Graph Algorithms 12.4 All-Pairs Shortest Paths 323 Graph Algorithms 12.4 All-Pairs Shortest Paths

(0) §procedure FLOYD_2DBLOCK( D ) ¤ begin for k := 1 to n do begin th (k−1) each process Pi, j that hasa segment of the k row of D broadcasts it to the P∗, j processes ; th (k−1) each process Pi, j that hasa segment of the k column of D broadcasts it to the Pi,∗ processes ; each process waits to receive the needed segments ; (k) each process Pi, j computes its part of the D matrix ; end end FLOYD_2DBLOCK

¦

• During each iteration of the algorithm, the kth row and kth column of processors perform a one-to-all broadcast along their rows/columns. √ √ • The size of this broadcast is n/ p elements, taking time Θ((nlog p)/ p).

• The synchronization step takes time Θ(log p). 324 Graph Algorithms 12.4 All-Pairs Shortest Paths

• The computation time is Θ(n2/p).

• The parallel run time of the 2-D block mapping formulation of Floyd’s algorithm is

computation communication z }| { z }| { n3   n2  T = Θ + Θ √ log p p p p

• The above formulation can use O(n2/log2 n) processors cost-optimally.

• The isoefficiency of this formulation is Θ(p1.5 log3 p).

• This algorithm can be further improved by relaxing the strict synchronization after each iteration.

Speeding things up by pipelining 325 Graph Algorithms 12.4 All-Pairs Shortest Paths

• The synchronization step in parallel Floyd’s algorithm can be removed without affecting the correctness of the algorithm.

• A process starts working on the kth iteration as soon as it has computed the k − 1th iteration and has the relevant parts of the D(k−1) matrix.

process 4 at time t has just computed a segment of the kth column of the D(k−1) matrix. It sends the segment to pro- cesses 3 and 5. These processes receive the segment at time t +1 (where the time unit is the time it takes for a matrix seg- ment to travel over the communication link between adjacent processes). Sim- ilarly, processes farther away from pro- Communication protocol followed in cess 4 receive the segment later. Pro- the pipelined 2-D block mapping formu- cess 1 (at the boundary) does not forward lation of Floyd’s algorithm. Assume that the segment after receiving it. √ • In each step, n/ p elements of the first row are sent from process Pi, j to Pi+1, j. 326 Graph Algorithms 12.4 All-Pairs Shortest Paths

• Similarly, elements of the first column are sent from process Pi, j to process Pi, j+1. √ • Each such step takes time Θ(n/ p).

√ √ √ • After Θ( p) steps, process P p, p gets the relevant elements of the first row and first column in time Θ(n). • The values of successive rows and columns follow after time Θ(n2/p) in a pipelined mode.

√ √ • Process P p, p finishes its share of the shortest path computation in time Θ(n3/p) + Θ(n).

√ √ th • When process P p, p has finished the (n − 1) iteration, it sends the relevant values of the nth row and column to the other processes.

• The overall parallel run time of this formulation is computation z }| { communication n3  z }| { T = Θ + Θ(n) p p 327 Graph Algorithms 12.4 All-Pairs Shortest Paths

• The pipelined formulation of Floyd’s algorithm uses up to O(n2) processes efficiently.

• The corresponding isoefficiency is Θ(p1.5).

All-pairs Shortest Path: Comparison 328 Graph Algorithms 12.5 Connected Components 12.5 Connected Components

• The connected components of an undirected graph are the equivalence classes of vertices under the “is reachable from” relation

• A graph with three connected components: {1,2,3,4}, {5,6,7}, and {8,9}:

Depth-First Search (DFS) Based Algorithm

• Perform DFS on the graph to get a forest - each tree in the forest corresponds to a separate connected component

• Part (b) is a depth-first forest obtained from depth-first traversal of the graph in part (a). Each of these trees is a connected component of the graph in part (a): 329 Graph Algorithms 12.5 Connected Components

Parallel Formulation

• Partition the graph across processors and run independent connected compo- nent algorithms on each processor. At this point, we have p spanning forests.

• In the second step, spanning forests are merged pairwise until only one span- ning forest remains. 330 Graph Algorithms 12.5 Connected Components Computing connected components in parallel:

The adjacency matrix of the graph G in (a) is par- titioned into two parts (b).

Each process gets a sub- graph of G ((c) and (e)).

Each process then com- putes the spanning forest of the subgraph ((d) and (f)).

Finally, the two spanning trees are merged to form the solution. 331 Graph Algorithms 12.5 Connected Components

• To merge pairs of spanning forests efficiently, the algorithm uses disjoint sets of edges.

• We define the following operations on the disjoint sets:

• find(x)

◦ returns a pointer to the representative element of the set containing x . Each set has its own unique representative.

• union(x, y)

◦ unites the sets containing the elements x and y. The two sets are as- sumed to be disjoint prior to the operation.

• For merging forest A into forest B, for each edge (u,v) of A, a find operation is performed to determine if the vertices are in the same tree of B.

• If not, then the two trees (sets) of B containing u and v are united by a union operation. 332 Graph Algorithms 12.5 Connected Components

• Otherwise, no union operation is necessary.

• Hence, merging A and B requires at most 2(n − 1) find operations and (n − 1) union operations.

Parallel 1-D Block Mapping

• The n × n adjacency matrix is partitioned into p blocks.

• Each processor can compute its local spanning forest in time Θ(n2/p).

• Merging is done by embedding a logical tree into the topology. There are log p merging stages, and each takes time Θ(n). Thus, the cost due to merging is Θ(nlog p).

• During each merging stage, spanning forests are sent between nearest neigh- bors. Recall that Θ(n) edges of the spanning forest are transmitted. 333 Graph Algorithms 12.5 Connected Components

• The parallel run time of the connected-component algorithm is

localcomputation z }| { forestmerging n2  z }| { T = Θ + Θ(nlog p) p p

• For a cost-optimal formulation p = O(n/logn). The corresponding isoefficiency is Θ(p2 log2 p). 334 Summary CONTENTS Contents

1 Parallel Computing – motivation 11 1.1 History of computing...... 12 1.2 Expert’s predictions...... 15 1.3 Usage areas of a petaflop computer...... 17 1.4 Example 1.1...... 21 1.5 Example 1.2...... 24 1.6 Example 1.3...... 27 1.7 Example 1.4...... 28

2 Computer architectures and parallelism 29

I Parallel Architectures 31 2.1 Processor development...... 31 2.2 Memory performance effect...... 39 2.3 Memory Bandwidth (MB) and performance...... 44 2.4 Other approaches for hiding memory latency...... 48 2.5 Classification of Parallel Computers...... 52 335 Summary CONTENTS

2.6 Flynn’s classification...... 58 2.7 Type of memory access...... 66 2.8 Communication model of parallel computers...... 68 2.9 Other classifications...... 71 2.10 Clos-networks...... 78 2.11 Beowulf clusters...... 87

3 Parallel Performance 91 3.1 Speedup...... 91 3.2 Efficiency...... 92 3.3 Amdahl’s law...... 93 3.4 Gustafson-Barsis’ law...... 96 3.5 Scaled efficiency...... 97 3.6 Methods to increase efficiency...... 98 3.7 Benchmarks...... 100

II Parallel Programming 105

4 Parallel Programming Models – an Overview 105 336 Summary CONTENTS

4.1 Model of Tasks and Channels...... 106 4.2 Message passing model...... 109 4.3 Shared Memory model...... 110 4.4 Implicit Parallelism...... 111 4.1 Task parallel model...... 113 4.2 Data parallel model...... 113

5 Shared memory programing model 115 5.1 PRAM (Parallel Random Access Machine) model...... 116 5.2 OpenMP...... 117 5.3 OpenMP direktiivid: PARALLEL...... 124 5.4 OpenMP tööjaotusdirektiivid: DO...... 127 5.5 OpenMP tööjaotusdirektiivid: SECTIONS...... 131 5.6 OpenMP tööjaotusdirektiivid: SINGLE ...... 134 5.7 OpenMP tööjaotusdirektiivide kombineerimine:...... 135 5.8 OpenMP: sünkronisatsioonikäsud...... 137 5.9 OpenMP skoobidirektiivid...... 140 5.10 OpenMP teegikäske...... 144 5.11 OpenMP näiteprogramme...... 147 337 Summary CONTENTS

6 Fortran and HPF 148 6.1 Fortran Evolution...... 148 6.2 Concept...... 149 6.3 PROCESSORS declaration...... 150 6.4 DISTRIBUTE-directive...... 151 6.5 Distribution of allocatable arrays...... 156 6.6 HPF rule: Owner Calcuates...... 157 6.7 Scalar variables...... 158 6.8 Examples of good DISTRIBUTE subdivision...... 159 6.9 HPF programming methodology...... 161 6.10 BLOCK(m) and CYCLIC(m) ...... 164 6.11 Array alignment...... 165 6.12 Strided Alignment...... 169 6.13 Example on Alignment...... 171 6.14 Alignment with Allocatable Arrays...... 172 6.15 Dimension collapsing...... 175 6.16 Dimension replication...... 175 6.17 HPF Intrinsic Functions...... 179 6.18 HPF Template Syntax...... 180 338 Summary CONTENTS

6.19 FORALL ...... 184 6.20 PURE-procedures...... 188 6.21 INDEPENDENT ...... 192 6.22 INDEPENDENT NEW command...... 195 6.23 EXTRINSIC ...... 197

7 MPI 198

III Parallel Algorithms 199

8 Parallel Algorithm Design Principles 199 8.1 Decomposition, Tasks and Dependency Graphs...... 200 8.2 Decomposition Techniques...... 203 8.3 Tasks and Interactions...... 205 8.4 Mapping Techniques for Load balancing...... 207 8.5 Parallel Algorithm Models...... 210

9 Solving systems of linear equations with sparse matrices 212 9.1 EXAMPLE: 2D FEM for Poisson equation...... 212 339 Summary CONTENTS

9.2 Solution methods for sparse matrix systems...... 229 9.3 Preconditioning...... 231 9.4 Domain Decompossition Method...... 233 9.5 Domain Decomposition on Unstructured Grids...... 234

10 Parallel Algorithm Optimality Definitions 243 10.1 Assessing Parallel Algorithms...... 244 10.2 Definitions of Optimality...... 245

11 Basic Techniques in Parallel Algorithms 247 11.1 Balanced Trees...... 247 11.2 Pointer Jumping Method...... 250 11.3 Divide and Conquer...... 255 11.4 Partitioning...... 263 11.5 Pipelining...... 271 11.6 Accelerated Cascading...... 278 11.7 Symmetry breaking...... 283

12 Parallel Algorithm Scalability 290 12.1 Isoefficiency Metric of Scalability...... 290 340 Summary CONTENTS

12.2 Cost-Optimality and the Isoefficiency Function...... 298 12.1 Definitions and Representation...... 302 12.2 Minimum Spanning Tree: Prim’s Algorithm...... 305 12.3 Single-Source Shortest Paths: Dijkstra’s Algorithm...... 310 12.4 All-Pairs Shortest Paths...... 312 12.5 Connected Components...... 328