<<

Technische Universität München

Parallel Programming and High-Performance Computing

Part 1: Introduction

Dr. Ralf-Peter Mundani CeSIM / IGSSE Technische Universität München 1 Introduction General Remarks

• materials: http://www5.in.tum.de/lehre/vorlesungen/parhpp/SS08/

• Ralf-Peter Mundani – email [email protected], phone 289–25057, room 3181 (city centre) – consultation-hour: Tuesday, 4:00—6:00 pm (room 02.05.058) • Ioan Lucian Muntean – email [email protected], phone 289–18692, room 02.05.059

• lecture (2 SWS) – weekly – Tuesday, start at 12:15 pm, room 02.07.023

• exercises (1 SWS) – fortnightly – Wednesday, start at 4:45 pm, room 02.07.023

Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2008 1−2 Technische Universität München 1 Introduction General Remarks

•content – part 1: introduction – part 2: high-performance networks – part 3: foundations – part 4: programming memory-coupled systems – part 5: programming message-coupled systems – part 6: dynamic load balancing – part 7: examples of parallel

Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2008 1−3 Technische Universität München 1 Introduction Overview

• motivation • classification of parallel computers • levels of parallelism • quantitative performance evaluation

I think there is a world market for maybe five computers. —Thomas Watson, chairman IBM, 1943

Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2008 1−4 Technische Universität München 1 Introduction Motivation

• numerical simulation: from phenomena to predictions physical phenomenon technical 1. modelling determination of parameters, expression of relations

2. numerical treatment model discretisation, development

3. implementation software development, parallelisation discipline 4. visualisation mathematics illustration of abstract simulation results

5. validation comparison of results with reality

application 6. embedding insertion into working process

Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2008 1−5 Technische Universität München 1 Introduction Motivation

• why parallel programming and HPC? – complex problems (especially the so called “grand challenges”) demand for more computing power • climate or geophysics simulation (tsunami, e. g.) • structure or flow simulation (crash test, e. g.) • development systems (CAD, e. g.) • large data analysis (Large Hadron Collider at CERN, e. g.) • military applications (crypto analysis, e. g.) • … – performance increase due to • faster hardware, more memory (“work harder”) • more efficient algorithms, optimisation (“work smarter”) • (“get some help”)

Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2008 1−6 Technische Universität München 1 Introduction Motivation

• objectives (in case all resources would be available N-times) – throughput: compute N problems simultaneously • running N instances of a sequential program with different data sets (“embarrassing parallelism”); SETI@home, e. g. • drawback: limited resources of single nodes – response time: compute one problem at a fraction (1/N) of time • running one instance (i. e. N processes) of a parallel program for jointly solving a problem; finding prime numbers, e. g. • drawback: writing a parallel program; communication – problem size: compute one problem with N-times larger data • running one instance (i. e. N processes) of a parallel program, using the sum of all local memories for computing larger problem sizes; iterative solution of SLE, e. g. • drawback: writing a parallel program; communication

Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2008 1−7 Technische Universität München 1 Introduction Overview

• motivation • classification of parallel computers • levels of parallelism • quantitative performance evaluation

Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2008 1−8 Technische Universität München 1 Introduction Classification of Parallel Computers

• definition: “A collection of processing elements that communicate and cooperate to solve large problems” (ALMASE and GOTTLIEB, 1989) • possible appearances of such processing elements – specialised units (steps of a vector pipeline, e. g.) – parallel features in modern monoprocessors (superscalar architectures, , VLIW, multithreading, multicore, …) – several uniform arithmetical units (processing elements of array computers, e. g.) – processors of a multiprocessor computer (i. e. the actual parallel computers) – complete stand-alone computers connected via LAN (work station or PC clusters, so called virtual parallel computers) – parallel computers or clusters connected via WAN (so called metacomputers)

Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2008 1−9 Technische Universität München 1 Introduction Classification of Parallel Computers

• reminder: dual core, quad core, manycore, and multicore – observation: increasing frequency (and thus core voltage) over past years – problem: thermal power dissipation increases linearly in frequency and with the square of the core voltage

Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2008 1−10 Technische Universität München 1 Introduction Classification of Parallel Computers

• reminder: dual core, quad core, manycore, and multicore (cont’d) – 25% reduction in frequency (and thus core voltage) leads to 50% reduction in dissipation

dissipation

performance Î

normal CPU reduced CPU

Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2008 1−11 Technische Universität München 1 Introduction Classification of Parallel Computers

• reminder: dual core, quad core, manycore, and multicore (cont’d) – idea: installation of two cores per die with same dissipation as single core system

dissipation

performance Î

single core dual core

Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2008 1−12 Technische Universität München 1 Introduction Classification of Parallel Computers

• commercial parallel computers – manufacturers: starting from 1983, big players and small start-ups (see tabular; out of business: no longer in the parallel business) – names have been coming and going rapidly – in addition: several manufacturers of vector computers and non- standard architectures

company country year status in 2003 Sequent U.S. 1984 acquired by IBM Intel U.S. 1984 out of business Meiko U.K. 1985 bankrupt nCUBE U.S. 1985 out of business Parsytec Germany 1985 out of business Alliant U.S. 1985 bankrupt

Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2008 1−13 Technische Universität München 1 Introduction Classification of Parallel Computers

• commercial parallel computers (cont’d)

company country year status in 2003 Encore U.S. 1986 out of business Floating Point Systems U.S. 1986 acquired by SUN Myrias Canada 1987 out of business Ametek U.S. 1987 out of business U.S. 1988 active C-DAC India 1991 active Kendall Square Research U.S. 1992 bankrupt IBM U.S. 1993 active NEC Japan 1993 active SUN Microsystems U.S. 1993 active Research U.S. 1993 active

Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2008 1−14 Technische Universität München 1 Introduction Classification of Parallel Computers

• arrival of clusters – in the late eighties, PCs became a commodity market with rapidly increasing performance, mass production, and decreasing prices – growing attractiveness for parallel computers – 1994: Beowulf, the first parallel computer built completely out of commodity hardware • NASA Goddard Space Flight Centre • 16 Intel DX4 processors • multiple 10 Mbit Ethernet links • Linux with GNU compilers •MPI library – 1996: Beowulf cluster performing more than 1 GFlops – 1997: a 140-node cluster performing more than 10 GFlops

Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2008 1−15 Technische Universität München 1 Introduction Classification of Parallel Computers

• arrival of clusters (cont’d) – 2005: InfiniBand cluster at TUM • 36 Opteron nodes (quad boards) • 4 nodes (quad boards) • 4 Xeon nodes (dual boards) for interactive tasks • InfiniBand 4× Switch, 96 ports • Linux (SuSE and Redhat)

Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2008 1−16 Technische Universität München 1 Introduction Classification of Parallel Computers

– supercomputing or high-performance scientific computing as the most important application of the big number crunchers – national initiatives due to huge budget requirements • Accelerated Strategic Computing Initiative (ASCI) in the U.S. – in the sequel of the nuclear testing moratorium in 1992/93 – decision: develop, build, and install a series of five supercomputers of up to $100 million each in the U.S. – start: ASCI Red (1997, Intel-based, Sandia National Laboratory, the world’s first TFlops computer) – then: ASCI Blue Pacific (1998, LLNL), ASCI Blue Mountain, ASCI White, … • meanwhile new high-end computing memorandum (2004)

Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2008 1−17 Technische Universität München 1 Introduction Classification of Parallel Computers

• supercomputers (cont’d) – federal “Bundeshöchstleistungsrechner” initiative in Germany • decision in the mid-nineties • three federal supercomputing centres in Germany (Munich, Stuttgart, and Jülich) • one new installation every second year (i. e. a six year upgrade cycle for each centre) • the newest one to be among the top 10 of the world – overview and state of the art: Top500 list (updated every six month), see http://www.top500.org

Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2008 1−18 Technische Universität München 1 Introduction Classification of Parallel Computers

•MOORE’s law – observation of Intel co-founder Gordon E. MOORE, describes important trend in history of (1965)

number of transistors that can be placed on an integrated circuit is increasing exponentially, doubling approximately every two years

Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2008 1−19 Technische Universität München 1 Introduction Classification of Parallel Computers

• some numbers: Top500

Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2008 1−20 Technische Universität München 1 Introduction Classification of Parallel Computers

• some numbers: Top500 (cont’d)

Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2008 1−21 Technische Universität München 1 Introduction Classification of Parallel Computers

• some numbers: Top500 (cont’d)

cluster: #nodes > #processors/node constellation: #nodes < #processors/node

Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2008 1−22 Technische Universität München 1 Introduction Classification of Parallel Computers

• some numbers: Top500 (cont’d)

Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2008 1−23 Technische Universität München 1 Introduction Classification of Parallel Computers

• some numbers: Top500 (cont’d)

Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2008 1−24 Technische Universität München 1 Introduction Classification of Parallel Computers

• some numbers: Top500 (cont’d)

Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2008 1−25 Technische Universität München 1 Introduction Classification of Parallel Computers

• some numbers: Top500 (cont’d)

Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2008 1−26 Technische Universität München 1 Introduction Classification of Parallel Computers

• some numbers: Top500 (cont’d)

Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2008 1−27 Technische Universität München 1 Introduction Classification of Parallel Computers

• some numbers: Top500 (cont’d)

Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2008 1−28 Technische Universität München 1 Introduction Classification of Parallel Computers

• some numbers: Top500 (cont’d)

Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2008 1−29 Technische Universität München 1 Introduction Classification of Parallel Computers

• The Earth Simulator – world’s #1 from 2002—04 – installed in 2002 in Yokohama, Japan – ES-building (approx. 50m × 65m × 17m) – based on NEC SX-6 architecture – developed by three governmental agencies – highly parallel vector – consists of 640 nodes (plus 2 control & 128 data switching) • 8 vector processors (8 GFlops each) • 16 GB Î 5120 processors (40.96 TFlops peak performance) and 10 TB memory; 35.86 TFlops sustained performance (Linpack) – nodes connected by 640×640 single stage crossbar (83,200 cables with a total extension of 2400km; 8 TBps total bandwidth) – further 700 TB disc space and 1.6 PB mass storage

Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2008 1−30 Technische Universität München 1 Introduction Classification of Parallel Computers

• BlueGene/L – world’s #1 since 2004 – installed in 2005 at LLNL, CA, USA (beta-system in 2004 at IBM) – cooperation of DoE, LLNL, and IBM – massive parallel supercomputer – consists of 65,536 nodes (plus 12 front-end and 1204 I/O nodes) • 2 PowerPC 440d processors (2.8 GFlops each) • 512 MB memory Î 131,072 processors (367 TFlops peak performance) and 33.5 TB memory; 280.6 TFlops sustained performance (Linpack) – nodes configured as 3D torus (32 × 32 × 64); global reduction tree for fast operations (global max / sum) in a few microseconds – 1024 Gbps link to global parallel file system – further 806 TB disc space; operating system SuSE SLES 9

Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2008 1−31 Technische Universität München 1 Introduction Classification of Parallel Computers

• HLRB II (world’s #6 for 04/2006) – installed in 2006 at LRZ, Garching – installation costs 38 M€ – monthly costs approx. 400,000 € – upgrade in 2007 (finished) – one of Germany’s 3 supercomputers – SGI 4700 – consists of 19 nodes (SGI NUMA link 2D torus) • 256 blades (ccNUMA link with partition fat tree) – Intel Itanium2 Montecito Dual Core (12.8 GFlops) – 4 GB memory per core Î 9728 processor cores (62.3 TFlops peak performance) and 39 TB memory; 56.5 TFlops sustained performance (Linpack) – footprint 24m × 12m; total weight 103 metric tons

Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2008 1−32 Technische Universität München 1 Introduction Classification of Parallel Computers

• standard classification according to FLYNN – global data and instruction streams as criterion • instruction stream: sequence of commands to be executed • data stream: sequence of data subject to instruction streams – two-dimensional subdivision according to • amount of instructions per time a computer can execute • amount of data elements per time a computer can process – hence, FLYNN distinguishes four classes of architectures • SISD: single instruction, single data • SIMD: single instruction, multiple data • MISD: multiple instruction, single data • MIMD: multiple instruction, multiple data – drawback: very different computers may belong to the same class

Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2008 1−33 Technische Universität München 1 Introduction Classification of Parallel Computers

• standard classification according to FLYNN (cont’d) –SISD • one processing unit that has access to one data memory and to one program memory • classical monoprocessor following VON NEUMANN’s principle

data memory processor program memory

Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2008 1−34 Technische Universität München 1 Introduction Classification of Parallel Computers

• standard classification according to FLYNN (cont’d) –SIMD • several processing units, each with separate access to a (shared or distributed) data memory; one program memory • synchronous execution of instructions • example: array computer, vector computer • advantages: easy programming model due to control flow with a strict synchronous-parallel execution of all instructions • drawbacks: specialised hardware necessary, easily becomes out- dated due to recent developments at commodity market

data memory processor

program memory

data memory processor

Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2008 1−35 Technische Universität München 1 Introduction Classification of Parallel Computers

• standard classification according to FLYNN (cont’d) –MISD • several processing units that have access to one data memory; several program memories • not very popular class (mainly for special applications such as Digital Signal Processing) • operating on a single stream of data, forwarding results from one processing unit to the next • example: (network of primitive processing elements that “pump” data)

processor program memory data memory

processor program memory

Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2008 1−36 Technische Universität München 1 Introduction Classification of Parallel Computers

• standard classification according to FLYNN (cont’d) –MIMD • several processing units, each with separate access to a (shared or distributed) data memory; several program memories • classification according to (physical) memory organisation – shared memory Î shared (global) address space – Î distributed (local) address space • example: multiprocessor systems, networks of computers

data memory processor program memory

data memory processor program memory

Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2008 1−37 Technische Universität München 1 Introduction Classification of Parallel Computers

• processor coupling – cooperation of processors / computers as well as their shared use of various resources require communication and synchronisation – the following types of processor coupling can be distinguished • memory-coupled multiprocessor systems (MemMS) • message-coupled multiprocessor systems (MesMS)

global memory distributed memory

shared MemMS, SMP Mem-MesMS (hybrid) address space

distributed ∅ MesMS address space

Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2008 1−38 Technische Universität München 1 Introduction Classification of Parallel Computers

• processor coupling (cont’d) – central issues • : costs for adding new nodes / processors • programming model: costs for writing parallel programs • portability: costs for portation (migration), i. e. transfer from one system to another while preserving executability and flexibility • load distribution: costs for obtaining a uniform load distribution among all nodes / processors – MemMS are advantageous concerning scalability, MesMS are typically better concerning the rest – hence, combination of MemMS and MesMS for exploiting all advantages Î distributed / virtual shared memory (DSM / VSM) – physical distributed memory with global shared address space

Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2008 1−39 Technische Universität München 1 Introduction Classification of Parallel Computers

• processor coupling (cont’d) – (UMA) • each processor P has direct access via the network to each memory module M with same access times to all data • standard programming model can be used (i. e. no explicit send / receive of messages necessary) • communication and synchronisation via shared variables (inconsistencies (write conflicts, e. g.) have to prevented in general by the programmer) M M … M

network

P P … P

Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2008 1−40 Technische Universität München 1 Introduction Classification of Parallel Computers

• processor coupling (cont’d) – symmetric multiprocessor (SMP) • only a small amount of processors, in most cases a central , one address space (UMA), but bad scalability • cache-coherence implemented in hardware (i. e. a read always provides a variable’s value from its last write) • example: double or quad boards, SGI Challenge

M

C: cache C C C … P P P

Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2008 1−41 Technische Universität München 1 Introduction Classification of Parallel Computers

• processor coupling (cont’d) – non-uniform memory access (NUMA) • memory modules physically distributed among processors • shared address space, but access times depend on location of data (i. e. local addresses faster than remote addresses) • differences in access times are visible in the program • example: DSM / VSM, Cray T3E

network

M M

P … P

Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2008 1−42 Technische Universität München 1 Introduction Classification of Parallel Computers

• processor coupling (cont’d) – cache-coherent non-uniform memory access (ccNUMA) • caches for local and remote addresses; cache-coherence implemented in hardware for entire address space • problem with scalability due to frequent cache actualisations • example: SGI Origin 2000

network

M M

C … C P P

Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2008 1−43 Technische Universität München 1 Introduction Classification of Parallel Computers

• processor coupling (cont’d) – cache-only memory access (COMA) • each processor has only cache-memory • entirety of all cache-memories = global shared memory • cache-coherence implemented in hardware • example: Kendall Square Research KSR-1

network

C C C … P P P

Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2008 1−44 Technische Universität München 1 Introduction Classification of Parallel Computers

• processor coupling (cont’d) – no remote memory access (NORMA) • each processor has direct access to its local memory only • access to remote memory only via explicit message exchange (due to distributed address space) possible • synchronisation implicitly via the exchange of messages • performance improvement between memory and I/O due to parallel data transfer (Direct Memory Access, e. g.) possible • example: IBM SP2, ASCI Red / Blue / White

network

P P P … M M M

Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2008 1−45 Technische Universität München 1 Introduction Overview

• motivation • classification of parallel computers • levels of parallelism • quantitative performance evaluation

Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2008 1−46 Technische Universität München 1 Introduction Levels of Parallelism

• the suitability of a parallel architecture for a given parallel program strongly depends on the granularity of parallelism • some remarks on granularity – quantitative meaning: ratio of computational effort and communication / synchronisation effort (≈ amount of instructions between two necessary communication / synchronisation steps) – qualitative meaning: level on which work is done in parallel

program level

process level fine-grain block level parallelism

instruction level parallelism

coarse-grain sub-instruction level

Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2008 1−47 Technische Universität München 1 Introduction Levels of Parallelism

• program level – parallel processing of different programs – independent units without any shared data – no or only small amount of communication / synchronisation – organised by the OS • process level – a program is subdivided into processes to be executed in parallel – each process consists of a larger amount of sequential instructions and has a private address space – synchronisation necessary (in case all processes in one program) – communication in most cases necessary (data exchange, e. g.) – support by OS via routines for process management, process communication, and process synchronisation – term of process often referred to as heavy-weight process

Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2008 1−48 Technische Universität München 1 Introduction Levels of Parallelism

• block level – blocks of instructions are executed in parallel – each block consists of a smaller amount of instructions and shares the address space with other blocks – communication via shared variables; synchronisation mechanisms – term of block often referred to as light-weight-process () • instruction level – parallel execution of machine instructions – optimising compilers can increase this potential by modifying the order of commands (better exploitation of superscalar architecture and pipelining mechanisms) • sub-instruction level – instructions are further subdivided in units to be executed in parallel or via overlapping (vector operations, e. g.)

Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2008 1−49 Technische Universität München 1 Introduction Overview

• motivation • classification of parallel computers • levels of parallelism • quantitative performance evaluation

Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2008 1−50 Technische Universität München 1 Introduction Quantitative Performance Evaluation

• execution time – time T of a parallel program between start of the execution on one processor and end of all computations on the last processor – during execution all processors are in one of the following states • compute

– computation time TCOMP – time spent for computations • communicate

– communication time TCOMM – time spent for send and receive operations •idle

– idle time TIDLE – time spent for waiting (sending / receiving messages)

– hence T = TCOMP + TCOMM + TIDLE

Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2008 1−51 Technische Universität München 1 Introduction Quantitative Performance Evaluation

• parallel profile – measures the amount of parallelism of a parallel program – graphical representation • x-axis shows time, y-axis shows amount of parallel activities • identification of computation, communication, and idle periods – example proc. A proc. B proc. C compute 3 communicate 2 idle 1 amount of processes 0 time

Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2008 1−52 Technische Universität München 1 Introduction Quantitative Performance Evaluation

• parallel profile (cont’d) – degree of parallelism • P(t) indicates the amount of processes (of one application) that can be executed in parallel at any point in time (i. e. y-values of the previous example for any time t) – average parallelism (often referred to as parallel index) • A(p) indicates the average amount of processes that can be executed in parallel, hence

p t2 1 ∑ ⋅ ti i = ⋅ A(p) = =1i , A(p) ∫P(t)dt or p − tt 12 t t1 ∑ =1i i

where p is the amount of processes and ti is the time when exactly i processes are busy

Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2008 1−53 Technische Universität München 1 Introduction Quantitative Performance Evaluation

• parallel profile (cont’d) – previous example: A(p) = (1⋅18 + 2⋅4 + 3⋅13) / 35 = 65/35 = 1.86 P(t) 3 2 1 time 51015202530354045 – for A(p) exist several theoretical (typically quite pessimistic) estimates, often used as arguments against parallel systems – example: estimate of MINSKY (1971) • problem: amount of used processors is halved in every step • parallel summation of 2p numbers on p processors, e. g. •result?

Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2008 1−54 Technische Universität München 1 Introduction Quantitative Performance Evaluation

• comparison multiprocessor / monoprocessor – correlation of multi- and monoprocessor systems’ performance – important: program that can be executed on both systems – definitions • P(1): amount of unit operations of a program on the monoprocessor system • P(p): amount of unit operations of a program on the multiprocessor systems with p processors • T(1): execution time of a program on the monoprocessor system (measured in steps or clock cycles) • T(p): execution time of a program on the multiprocessor system (measured in steps or clock cycles) with p processors

Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2008 1−55 Technische Universität München 1 Introduction Quantitative Performance Evaluation

• comparison multiprocessor / monoprocessor (cont’d) – simplifying preconditions • T(1) = P(1) – one operation to be executed in one step on the monoprocessor system • T(p) ≤ P(p) – more than one operation to be executed in one step (for p ≥ 2) on the multiprocessor system with p processors

Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2008 1−56 Technische Universität München 1 Introduction Quantitative Performance Evaluation

• comparison multiprocessor / monoprocessor (cont’d) – speed-up • S(p) indicates the improvement in processing speed T(1) S(p) = T(p) • in general, 1 ≤ S(p) ≤ p – efficiency • E(p) indicates the relative improvement in processing speed S(p) E(p) = p • improvement is normalised by the amount of processors p • in general, 1/p ≤ E(p) ≤ 1

Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2008 1−57 Technische Universität München 1 Introduction Quantitative Performance Evaluation

• comparison multiprocessor / monoprocessor (cont’d) – speed-up and efficiency can be seen in two different ways • algorithm-independent – best known sequential algorithm for the monoprocessor system is compared to the respective parallel algorithm for the multiprocessor system Î absolute speed-up Î absolute efficiency • algorithm-dependent – parallel algorithm is treated as sequential one to measure the execution time on the monoprocessor system; “unfair” due to communication and synchronisation overhead Î relative speed-up Î relative efficiency

Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2008 1−58 Technische Universität München 1 Introduction Quantitative Performance Evaluation

• comparison multiprocessor / monoprocessor (cont’d) –overhead • O(p) indicates the necessary overhead of a multiprocessor system for organisation, communication, and synchronisation P(p) O(p) = P(1) • in general, 1 ≤ O(p) – parallel index • I(p) indicates the amount of operations executed on average per time unit P(p) I(p) = T(p) • I(p) ≈ relative speed-up (taking into account the overhead)

Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2008 1−59 Technische Universität München 1 Introduction Quantitative Performance Evaluation

• comparison multiprocessor / monoprocessor (cont’d) –utilisation • U(p) indicates the amount of operations each processor executes on average per time unit I(p) U(p) = p • conforms to the normalised parallel index –conclusions • all defined expressions have a value of 1 for p = 1 • the parallel index is an upper bound for the speed-up 1 ≤ S(p) ≤ I(p) ≤ p • the workload is an upper bound for the efficiency 1/p ≤ E(p) ≤ U(p) ≤ 1

Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2008 1−60 Technische Universität München 1 Introduction Quantitative Performance Evaluation

• comparison multiprocessor / monoprocessor (cont’d) – example (1) • a monoprocessor systems needs 6000 steps for the execution of 6000 operations to compute some result • a multiprocessor system with five processors needs 6750 operations for the computation of the same result, but it needs only 1500 steps for the execution • thus P(1) = T(1) = 6000, P(5) = 6750, and T(5) = 1500 • speed-up and efficiency can be computed as

S(5) = 6000/1500 = 4 and E(5) = 4/5 = 0.8

Î there is an acceleration of factor 4 compared to the monoprocessor system, i. e. on average an improvement of 80% for each processor of the multiprocessor system

Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2008 1−61 Technische Universität München 1 Introduction Quantitative Performance Evaluation

• comparison multiprocessor / monoprocessor (cont’d) – example (2) • parallel index and utilisation can be computed as

I(5) = 6750/1500 = 4.5 and U(5) = 4.5/5 = 0.9

Î on average 4.5 processors are simultaneously busy, i. e. each processor is working only for 90% of the execution time • overhead can be computed as

O(5) = 6750/6000 = 1.125

Î there is an overhead of 12.5% on the multiprocessor system compared to the monoprocessor system

Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2008 1−62 Technische Universität München 1 Introduction Quantitative Performance Evaluation

• scalability – objective: adding further processing elements to the system shall reduce the execution time without any program modifications – i. e. a linear performance increase with an efficiency close to 1 – important for the scalability is a sufficient problem size • one porter may carry one suitcase in a minute • 60 porters won’t do it in a second • but 60 porters may carry 60 suitcases in a minute – in case of a fixed problem size and an increasing amount of processors saturation will occur for a certain value of p, hence scalability is limited – when scaling the amount of processors together with the problem size (so called scaled problem analysis) this effect will not appear for good scalable hard- and software systems

Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2008 1−63 Technische Universität München 1 Introduction Quantitative Performance Evaluation

•AMDAHL’s law – the probably most important and most famous estimate for the speed- up (even if quite pessimistic) – underlying model • each program consists of a sequential part s, 0 ≤ s ≤ 1, that can only be executed in a sequential way; synchronisation, data I/O, e. g • furthermore, each program consists of a parallelisable part 1−s that can be executed in parallel by several processes; finding the maximum value within a set of numbers, e. g. – hence, the execution time for the parallel program executed on p processors can be written as − s1 T(1)sT(p) +⋅= ⋅ T(1) p

Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2008 1−64 Technische Universität München 1 Introduction Quantitative Performance Evaluation

•AMDAHL’s law (cont’d) – the speed-up can thus be computed as T(1) T(1) 1 S(p) == = − s1 − s1 T(p) T(1)s +⋅ ⋅ T(1) s + p p – when increasing p →∞we finally get AMDAHL’s law 1 1 lim S(p)lim = lim = p ∞→ p ∞→ − s1 s + s p Î speed-up is bounded: S(p) ≤ 1/s – the sequential part can have a dramatic impact on the speed-up – therefore central effort of all (parallel) algorithms: keep s small – many parallel programs have a small sequential part (s < 0.1)

Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2008 1−65 Technische Universität München 1 Introduction Quantitative Performance Evaluation

•AMDAHL’s law (cont’d) –example •s = 0.1 and, thus, S(p) ≤ 10 • independent from p the speed-up is bounded by this limit • where’s the error? S(p) 10

s = 0.1

5

5110 52025p

Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2008 1−66 Technische Universität München 1 Introduction Quantitative Performance Evaluation

•GUSTAFSON’s law – addresses the shortcomings of AMDAHL’s law as it states that any sufficient large problem can be efficiently parallelised – instead of a fixed problem size it supposes a fixed time concept – underlying model • execution time on the parallel machine is normalised to 1 • this contains a non-parallelisable part σ, 0 ≤σ≤1 – hence, the execution time for the sequential program on the monoprocessor can be written as

T(1) =σ+p⋅(1−σ)

– the speed-up can thus be computed as

S(p) =σ+p⋅(1−σ) = p +σ⋅(1−p)

Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2008 1−67 Technische Universität München 1 Introduction Quantitative Performance Evaluation

•GUSTAFSON’s law (cont’d) – difference to AMDAHL • sequential part s(p) is not constant, but gets smaller with increasing p

σ s(p) = , s(p) ∈]0, 1[ σ−⋅+σ )(1p

• often more realistic, because more processors are used for a larger problem size, and here parallelisable parts typically increase (more computations, less declarations, …) • speed-up is not bounded for increasing p

Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2008 1−68 Technische Universität München 1 Introduction Quantitative Performance Evaluation

•GUSTAFSON’s law (cont’d) – some more thoughts about speed-up • theory tells: a superlinear speed-up does not exist – each parallel algorithm can be simulated on a monoprocessor system by emulating in a loop always the next step of a processor from the multiprocessor system • but superlinear speed-up can be observed – when improving an inferior sequential algorithm – when a parallel program (that does not fit into the main memory of the monoprocessor system) completely runs in cache and main memory of the nodes from the multiprocessor system

Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2008 1−69 Technische Universität München 1 Introduction Quantitative Performance Evaluation

• communication—computation-ratio (CCR) – important quantity measuring the success of a parallelisation • gives the relation of pure communication time and pure computing time • a small CCR is favourable • typically: CCR decreases with increasing problem size –example •N×N matrix distributed among p processors (N/p rows each) • iterative method: in each step, each matrix element is replaced by the average of its eight neighbour values • hence, the two neighbouring rows are always necessary • computation time: 8N⋅N/p • communication time: 2N • CCR: p/4N – what does this mean?

Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2008 1−70