Technische Universität München Technische Universität München General Remarks . Ralf-Peter Mundani . email: [email protected], phone: 289–25057, room: 3181 . consultation-hour: by appointment High Performance Computing – . lecture: Tuesday, 12:00—13:30, room 02.07.023 Programming Paradigms and Scalability Part 1: Introduction . Christoph Riesinger . email: [email protected] PD Dr. rer. nat. habil. Ralf-Peter Mundani . exercise: Wednesday, 10:15—11:45, room 02.07.023 (fortnightly) Computation in Engineering (CiE) Scientific Computing (SCCS) . examination . written, 90 minutes Summer Term 2015 . all printed/written materials allowed (no electronic devices)

. materials: http:www5.in.tum.de

PD Dr. Ralf-Peter Mundani  High Performance Computing  Summer Term 2015 12

Technische Universität München Technische Universität München General Remarks Overview . content . motivation . part 1: introduction . hardware excursion . part 2: high-performance networks . . part 3: foundations . classification of parallel computers . part 4: shared-memory programming . quantitative performance evaluation . part 5: distributed-memory programming . part 6: examples of parallel algorithms

If one ox could not do the job they did not try to grow a bigger ox, but used two oxen. —Grace Murray Hopper

PD Dr. Ralf-Peter Mundani  High Performance Computing  Summer Term 2015 13 PD Dr. Ralf-Peter Mundani  High Performance Computing  Summer Term 2015 14 Technische Universität München Technische Universität München Motivation Motivation . numerical simulation: from phenomena to predictions . why numerical simulation? physical phenomenon . because experiments are sometimes impossible technical process 1. modelling  life cycle of galaxies, weather forecast, terror attacks, e.g. determination of parameters, expression of relations

2. numerical treatment model discretisation, algorithm development

3. implementation software development, parallelisation bomb attack on WTC (1993) discipline 4. visualisation . because experiments are sometimes not welcome mathematics illustration of abstract simulation results  avalanches, nuclear tests, medicine, e.g. computer science 5. validation comparison of results with reality

application 6. embedding insertion into working process

PD Dr. Ralf-Peter Mundani  High Performance Computing  Summer Term 2015 15 PD Dr. Ralf-Peter Mundani  High Performance Computing  Summer Term 2015 16

Technische Universität München Technische Universität München Motivation Motivation . why numerical simulation? (cont’d) . why parallel programming and HPC? . because experiments are sometimes very costly & time consuming . complex problems (especially the so called “grand challenges”)  protein folding, material sciences, e.g. demand for more computing power . climate or geophysics simulation (tsunami, e.g.) . structure or flow simulation (crash test, e.g.) . development systems (CAD, e.g.) . large data analysis (Large Hadron Collider at CERN, e.g.) . military applications (crypto analysis, e.g.) Mississippi basin model (Jackson, MS) . because experiments are sometimes more expensive .   aerodynamics, crash test, e.g. . performance increase due to . faster hardware, more memory (“work harder”) . more efficient algorithms, optimisation (“work smarter”) . parallel computing (“get some help”)

PD Dr. Ralf-Peter Mundani  High Performance Computing  Summer Term 2015 17 PD Dr. Ralf-Peter Mundani  High Performance Computing  Summer Term 2015 18 Technische Universität München Technische Universität München Motivation Motivation . objectives (in case all resources would be available N-times) . levels of parallelism . throughput: compute N problems simultaneously . qualitative meaning: level(s) on which work is done in parallel . running N instances of a sequential program with different data sets (“embarrassing parallelism”); SETI@home, e.g. . drawback: limited resources of single nodes sub-instruction level . response time: compute one problem at a fraction (1N) of time . running one instance (i. e. N processes) of a parallel program for instruction level jointly solving a problem; finding prime numbers, e.g. . drawback: writing a parallel program; communication block level

. problem size: compute one problem with N-times larger data process level . running one instance (i. e. N processes) of a parallel program, using

the sum of all local memories for computing larger problem sizes; granularity program level iterative solution of SLE, e.g. . drawback: writing a parallel program; communication

PD Dr. Ralf-Peter Mundani  High Performance Computing  Summer Term 2015 19 PD Dr. Ralf-Peter Mundani  High Performance Computing  Summer Term 2015 110

Technische Universität München Technische Universität München Motivation Motivation . levels of parallelism (cont’d) . levels of parallelism (cont’d) . program level . block level . parallel processing of different programs . blocks of instructions are executed in parallel . independent units without any shared data . each block consists of few instructions and shares data with others . communication via shared variables; synchronisation mechanisms . organised by the OS . term of block often referred to as light-weight-process (thread)

. process level . instruction level . a program is subdivided into processes to be executed in . parallel execution of machine instructions parallel . optimising compilers can increase this potential by modifying the order . each process consists of a larger amount of sequential of commands instructions and some private data . sub-instruction level . communication in most cases necessary (data exchange, e.g.) . instructions are further subdivided in units to be executed in parallel or . term of process often referred to as heavy-weight process via overlapping (vector operations, e.g.)

PD Dr. Ralf-Peter Mundani  High Performance Computing  Summer Term 2015 111 PD Dr. Ralf-Peter Mundani  High Performance Computing  Summer Term 2015 112 Technische Universität München Technische Universität München Overview Hardware Excursion . motivation . definition of parallel computers . hardware excursion “A collection of processing elements that communicate and cooperate . supercomputers to solve large problems” (ALMASE and GOTTLIEB, 1989) . classification of parallel computers . possible appearances of such processing elements . quantitative performance evaluation . specialised units (steps of a vector pipeline, e.g.) . parallel features in modern monoprocessors (instruction pipelining, superscalar architectures, VLIW, multithreading, multicore, …) . several uniform arithmetical units (processing elements of array computers, GPGPUs, e.g.) . complete stand-alone computers connected via LAN (work station or PC clusters, so called virtual parallel computers) . parallel computers or clusters connected via WAN (so called metacomputers)

PD Dr. Ralf-Peter Mundani  High Performance Computing  Summer Term 2015 113 PD Dr. Ralf-Peter Mundani  High Performance Computing  Summer Term 2015 114

Technische Universität München Technische Universität München Hardware Excursion Hardware Excursion . instruction pipelining . instruction pipelining (cont‘d) . instruction execution involves several operations . observation: while processing particular stage of instruction, other 1.instruction fetch (IF) stages are idle 2.decode (DE) . hence, multiple instructions to be overlapped in execution  3.fetch operands (OP) instruction pipelining (similar to assembly lines) 4.execute (EX) . advantage: no additional hardware necessary

5.write back (WB) … time which are executed successively instruction N IF DE OP EX WB instruction N1 IF DE OP EX WB . hence, only one part of CPU works at a given moment instruction N2 IF DE OP EX WB … … IF DE OP EX WB IF DE OP EX WB instruction N3 IF DE OP EX WB instruction N4 IF DE OP EX WB

instruction N instruction N1 …

PD Dr. Ralf-Peter Mundani  High Performance Computing  Summer Term 2015 115 PD Dr. Ralf-Peter Mundani  High Performance Computing  Summer Term 2015 116 Technische Universität München Technische Universität München Hardware Excursion Hardware Excursion . superscalar . superscalar (cont’d) . faster CPU throughput due to simultaneously execution of instructions . pipelining for superscalar architectures also possible within one clock cycle via redundant functional units (ALU, multiplier, …) . dispatcher decides (during runtime) which instructions read from memory can be executed in parallel and dispatches them to different functional instruction N IF DE OP EX WB units instruction N1 IF DE OP EX WB time . for instance, PowerPC 970 (4  ALU, 2  FPU) IF DE OP EX WB IF DE OP EX WB IF DE OP EX WB

… IF DE OP EX WB instr. 3 instr. 4 instr. 1 instr. 2 instr. A instr. B IF DE OP EX WB ALU ALU ALU ALU FPU FPU IF DE OP EX WB IF DE OP EX WB instruction N9 IF DE OP EX WB . but, performance improvement is limited (intrinsic parallelism)

PD Dr. Ralf-Peter Mundani  High Performance Computing  Summer Term 2015 117 PD Dr. Ralf-Peter Mundani  High Performance Computing  Summer Term 2015 118

Technische Universität München Technische Universität München Hardware Excursion Hardware Excursion . very long instruction word (VLIW) . vector units . in contrast to superscalar architectures, the compiler groups parallel . simultaneously execution of one instruction on a one-dimensional executable instructions during compilation (pipelining still possible) array of data ( vector) . advantage: no additional hardware logic necessary . VU first appeared in 1970s and were the basis of most . drawback: not always fully useable ( dummy filling (NOP)) supercomputers in the 1980s and 1990s

VLIW instruction T ( A1 B1 A2 B2 A3 B3  AN1 BN1 AN BN ) instr. 2 instr. 3 instr. 4 instr. 1 instruction  123 N1 N

T ( C1 C2 C3  CN1 CN ) . specialised hardware  very expensive registers . limited application areas (mostly CFD, CSD, …)

PD Dr. Ralf-Peter Mundani  High Performance Computing  Summer Term 2015 119 PD Dr. Ralf-Peter Mundani  High Performance Computing  Summer Term 2015 120 Technische Universität München Technische Universität München Hardware Excursion Hardware Excursion . dual core, quad core, many core, and multicore . dual core, quad core, many core, and multicore (cont’d) . observation: increasing frequency f (and thus core voltage v) over . 25% reduction in performance (i.e. core voltage) leads to approx. past years  problem: thermal power dissipation P  fv2 50% reduction in dissipation

dissipation

performance 

normal CPU reduced CPU

PD Dr. Ralf-Peter Mundani  High Performance Computing  Summer Term 2015 121 PD Dr. Ralf-Peter Mundani  High Performance Computing  Summer Term 2015 122

Technische Universität München Technische Universität München Hardware Excursion Hardware Excursion . dual core, quad core, many core, and multicore (cont’d) . dual core, quad core, many core, and multicore (cont’d) . idea: installation of two cores per die with same dissipation as . single vs. dual  quad single core system

core 0 core 0 core 1 core 0 core 1 core 2 core 3

L1 L1 L1 L1 L1 L1 L1 dissipation L2 shared L2 shared L2 shared L2

performance  FSB FSB FSB

FSB: front side bus (i.e. connection to memory (via north bridge)) single core dual core

PD Dr. Ralf-Peter Mundani  High Performance Computing  Summer Term 2015 123 PD Dr. Ralf-Peter Mundani  High Performance Computing  Summer Term 2015 124 Technische Universität München Technische Universität München Hardware Excursion Hardware Excursion . INTEL Nehalem Core i7 . Intel E5-2600 Sandy-Bridge Series . 2 CPUs connected by 2 QPIs (Intel Quick Path Interconnect) . Quick Path Interconnect (1 sending and 1 receiving port) core 0 core 1 core 2 core 3 . 8 GT/s ∙ 16 Bit/T payload ∙ 2 directions / 8 Bit/Byte = 32 GB/s L1L2 L1L2 L1L2 L1L2 max bandwidth / QPI ∙ shared L3 . 2 QPI links: 2 32 GB/s  64 GB/s max bandwidth

QPI source: www.samrathacks.com

QPI: QuickPath Interconnect replaces FSB (QPI is a point-to-point interconnection – with a memory controller now on-die – in order to allow both reduced latency and higher bandwidth  up to (theoretically) 25.6 GBytes data transfer, i.e. 2 FSB)

source: G. Wellein, RRZE

PD Dr. Ralf-Peter Mundani  High Performance Computing  Summer Term 2015 125 PD Dr. Ralf-Peter Mundani  High Performance Computing  Summer Term 2015 126

Technische Universität München Technische Universität München Overview Supercomputers . motivation . arrival of clusters . hardware excursion . in the late eighties, PCs became a commodity market with rapidly . supercomputers increasing performance, mass production, and decreasing prices . classification of parallel computers . growing attractiveness for parallel computers . quantitative performance evaluation . 1994: Beowulf, the first parallel computer built completely out of commodity hardware . NASA Goddard Space Flight Centre . 16 Intel DX4 processors . multiple 10 Mbit Ethernet links . Linux with GNU compilers . MPI library

. 1996: Beowulf cluster performing more than 1 GFlops . 1997: a 140-node cluster performing more than 10 GFlops

PD Dr. Ralf-Peter Mundani  High Performance Computing  Summer Term 2015 127 PD Dr. Ralf-Peter Mundani  High Performance Computing  Summer Term 2015 128 Technische Universität München Technische Universität München Supercomputers Supercomputers . supercomputers . supercomputers (cont’d) . supercomputing or high-performance scientific computing as the . federal “Bundeshöchstleistungsrechner” initiative in Germany most important application of the big number crunchers . decision in the mid-nineties . national initiatives due to huge budget requirements . three federal supercomputing centres in Germany (Munich, . Accelerated Strategic Computing Initiative (ASCI) in the U.S. Stuttgart, and Jülich) . in the sequel of the nuclear testing moratorium in 199293 . one new installation every second year (i.e. a six year upgrade . decision: develop, build, and install a series of five cycle for each centre) supercomputers of up to $100 million each in the U.S. . the newest one to be among the top 10 of the world . start: ASCI Red (1997, Intel-based, Sandia National . overview and state of the art: Top500 list (updated every six Laboratory, the world’s first TFlops computer) month), see http:www.top500.org . then: ASCI Blue Pacific (1998, LLNL), ASCI Blue Mountain, . finally (a somewhat different definition) ASCI White,  : Turns CPU-bound problems into I/O-bound problems. . meanwhile new high-end computing memorandum (2004) —Ken Batcher

PD Dr. Ralf-Peter Mundani  High Performance Computing  Summer Term 2015 129 PD Dr. Ralf-Peter Mundani  High Performance Computing  Summer Term 2015 130

Technische Universität München Technische Universität München Supercomputers Supercomputers

. MOORE’s law . some numbers: Top500 . observation of Intel co-founder Gordon E. MOORE, describes important trend in history of computer hardware (1965)

number of transistors that can be placed on an integrated circuit is increasing exponentially, doubling approximately every eighteen months

PD Dr. Ralf-Peter Mundani  High Performance Computing  Summer Term 2015 131 PD Dr. Ralf-Peter Mundani  High Performance Computing  Summer Term 2015 132 Technische Universität München Technische Universität München Supercomputers Supercomputers . some numbers: Top500 (cont’d) . some numbers: Top500 (cont’d)

s! rtiu s, fo altiu us, Citi PD Dr. Ralf-Peter Mundani  High Performance Computing  Summer Term 2015 133 PD Dr. Ralf-Peter Mundani  High Performance Computing  Summer Term 2015 134

Technische Universität München Technische Universität München Supercomputers Supercomputers . the 10 fastest supercomputers in the world (by November 2014) . The Earth Simulator – world’s #1 from 2002—04 . installed in 2002 in Yokohama, Japan . ES-building (approx. 50m  65m  17m) . based on NEC SX-6 architecture . developed by three governmental agencies . highly parallel vector supercomputer . consists of 640 nodes (plus 2 control & 128 data switching) . 8 vector processors (8 GFlops each) . 16 GB shared memory  5120 processors (40.96 TFlops peak performance) and 10 TB memory; 35.86 TFlops sustained performance (Linpack) . nodes connected by 640640 single stage crossbar (83,200 cables with a total extension of 2400km; 8 TBps total bandwidth) . further 700 TB disc space and 1.60 PB mass storage

PD Dr. Ralf-Peter Mundani  High Performance Computing  Summer Term 2015 135 PD Dr. Ralf-Peter Mundani  High Performance Computing  Summer Term 2015 136 Technische Universität München Technische Universität München Supercomputers Supercomputers

. BlueGeneL – world’s #1 from 2004—08 . Roadrunner – world’s #1 from 2008—09 . installed in 2005 at LLNL, CA, USA . installed in 2008 at LANL, NM, USA (beta-system in 2004 at IBM) . installation costs about $120 million . cooperation of DoE, LLNL, and IBM . first “hybrid” supercomputer . massive parallel supercomputer . dual-core Opteron . consists of 65,536 nodes (plus 12 front-end and 1204 IO nodes) . Cell Broadband Engine . 2 PowerPC 440d processors (2.8 GFlops each)  129,600 cores (1456.70 TFlops peak performance) and . 512MB memory 98 TB memory; 1144.00 TFlops sustained performance (Linpack)  131,072 processors (367.00 TFlops peak performance) and . standard processing (file system IO, e. g.) handled by Opteron, 33.50 TB memory; 280.60 TFlops sustained performance (Linpack) while mathematically and CPU-intensive tasks are handled by Cell . nodes configured as 3D torus (32  32  64); global reduction tree for fast . 2.35 MW power consumption ( 437 MFlops per Watt ) operations (global max  sum) in a few microseconds . primarily usage: ensure safety and reliability of nation’s nuclear . 1024 Gbps link to global parallel file system weapons stockpile, real-time applications (cause & effect in capital . further 806 TB disc space; operating system SuSE SLES 9 markets, bone structures and tissues renderings as patients are being examined, e.g.)

PD Dr. Ralf-Peter Mundani  High Performance Computing  Summer Term 2015 137 PD Dr. Ralf-Peter Mundani  High Performance Computing  Summer Term 2015 138

Technische Universität München Technische Universität München Supercomputers Supercomputers

. HLRB II (world’s #6 for 042006) . SuperMUC (world’s #4 for 062012) . installed in 2006 at LRZ, Garching . installed in 2012 at LRZ, Garching . installation costs 38M€ . IBM System x iDataPlex . monthly costs approx. 400,000€ . (still) one of Germany’s 3 supercomputers . upgrade in 2007 (finished) . consists of 19 islands (Infiniband FDR10 pruned tree with . one of Germany’s 3 supercomputers 4:1 intra-island / inter-island ratio) . SGI 4700 . 18 thin islands with 512 nodes each (total 288 TB memory) . consists of 19 nodes (SGI NUMA link 2D torus) . Sandy Bridge-EP Xeon E5 (2 CPUs (8 cores each) / node) . 256 blades (ccNUMA link with partition fat tree) . 1 fat island with 205 nodes (total 52 TB memory) . Intel Itanium2 Montecito Dual Core (12.80 GFlops) . 4GB memory per core . Westmere-EX Xeon E7 (4 CPUs (10 cores each) / node)  9728 cores (62.30 TFlops peak performance) and 39 TB  147,456 cores (3.185 PFlops peak performance – thin islands memory; 56.50 TFlops sustained performance (Linpack) only); 2.897 PFlops sustained performance (Linpack) . footprint 24m  12m; total weight 103 metric tons . footprint 21m  26m; warm water cooling

PD Dr. Ralf-Peter Mundani  High Performance Computing  Summer Term 2015 139 PD Dr. Ralf-Peter Mundani  High Performance Computing  Summer Term 2015 140 Technische Universität München Technische Universität München Overview Classification of Parallel Computers

. motivation . standard classification according to FLYNN . hardware excursion . global data and instruction streams as criterion . supercomputers . instruction stream: sequence of commands to be executed . classification of parallel computers . data stream: sequence of data subject to instruction streams . quantitative performance evaluation . two-dimensional subdivision according to . amount of instructions per time a computer can execute . amount of data elements per time a computer can process . hence, FLYNN distinguishes four classes of architectures . SISD: single instruction, single data . SIMD: single instruction, multiple data . MISD: multiple instruction, single data . MIMD: multiple instruction, multiple data . drawback: very different computers may belong to the same class

PD Dr. Ralf-Peter Mundani  High Performance Computing  Summer Term 2015 141 PD Dr. Ralf-Peter Mundani  High Performance Computing  Summer Term 2015 142

Technische Universität München Technische Universität München Classification of Parallel Computers Classification of Parallel Computers

. standard classification according to FLYNN (cont’d) . standard classification according to FLYNN (cont’d) . SISD . SIMD . one processing unit that has access to one data memory and to . several processing units, each with separate access to a (shared or one program memory distributed) data memory; one program memory . synchronous execution of instructions . classical monoprocessor following VON NEUMANN’s principle . example: array computer, vector computer . advantages: easy programming model due to control flow with a strict data memory processor program memory synchronous-parallel execution of all instructions . drawbacks: specialised hardware necessary, easily becomes out- dated due to recent developments at commodity market

data memory processor

program memory

data memory processor

PD Dr. Ralf-Peter Mundani  High Performance Computing  Summer Term 2015 143 PD Dr. Ralf-Peter Mundani  High Performance Computing  Summer Term 2015 144 Technische Universität München Technische Universität München Classification of Parallel Computers Classification of Parallel Computers

. standard classification according to FLYNN (cont’d) . standard classification according to FLYNN (cont’d) . MISD . MIMD . several processing units that have access to one data memory; . several processing units, each with separate access to a (shared or several program memories distributed) data memory; several program memories . not very popular class (mainly for special applications such as Digital . classification according to (physical) memory organisation Signal Processing) . shared memory  shared (global) address space . operating on a single stream of data, forwarding results from one . distributed memory  distributed (local) address space processing unit to the next . example: multiprocessor systems, networks of computers . example: systolic array (network of primitive processing elements that “pump” data) data memory processor program memory processor program memory data memory data memory processor program memory processor program memory

PD Dr. Ralf-Peter Mundani  High Performance Computing  Summer Term 2015 145 PD Dr. Ralf-Peter Mundani  High Performance Computing  Summer Term 2015 146

Technische Universität München Technische Universität München Classification of Parallel Computers Classification of Parallel Computers . processor coupling . processor coupling (cont’d) . cooperation of processors  computers as well as their shared use . uniform memory access (UMA) of various resources require communication and synchronisation . each processor P has direct access via the network to each . the following types of processor coupling can be distinguished memory module M with same access times to all data . memory-coupled multiprocessor systems (MemMS) . standard programming model can be used (i.e. no explicit send . message-coupled multiprocessor systems (MesMS)  receive of messages necessary) . communication and synchronisation via shared variables (inconsistencies (write conflicts, e.g.) have to prevented in global memory distributed memory general by the programmer) shared M M  M MemMS, SMP Mem-MesMS (hybrid) address space distributed network  MesMS address space P P  P

PD Dr. Ralf-Peter Mundani  High Performance Computing  Summer Term 2015 147 PD Dr. Ralf-Peter Mundani  High Performance Computing  Summer Term 2015 148 Technische Universität München Technische Universität München Classification of Parallel Computers Classification of Parallel Computers . processor coupling (cont’d) . processor coupling (cont’d) . symmetric multiprocessor (SMP) . non-uniform memory access (NUMA) . only a small amount of processors, in most cases a central bus, . memory modules physically distributed among processors one address space (UMA), but bad scalability . shared address space, but access times depend on location of . cache-coherence implemented in hardware (i.e. a read always data (i.e. local addresses faster than remote addresses) provides a variable’s value from its last write) . differences in access times are visible in the program . example: double or quad boards, SGI Challenge . example: DSM  VSM, T3E

M network

M M C: cache C C C  P P P P  P

PD Dr. Ralf-Peter Mundani  High Performance Computing  Summer Term 2015 149 PD Dr. Ralf-Peter Mundani  High Performance Computing  Summer Term 2015 150

Technische Universität München Technische Universität München Classification of Parallel Computers Classification of Parallel Computers . processor coupling (cont’d) . processor coupling (cont’d) . cache-coherent non-uniform memory access (ccNUMA) . cache-only memory access (COMA) . caches for local and remote addresses; cache-coherence . each processor has only cache-memory implemented in hardware for entire address space . entirety of all cache-memories  global shared memory . problem with scalability due to frequent cache actualisations . cache-coherence implemented in hardware . example: SGI Origin 2000 . example: Kendall Square Research KSR-1

network network

M M C C C  C  C P P P P P

PD Dr. Ralf-Peter Mundani  High Performance Computing  Summer Term 2015 151 PD Dr. Ralf-Peter Mundani  High Performance Computing  Summer Term 2015 152 Technische Universität München Technische Universität München Classification of Parallel Computers Classification of Parallel Computers . processor coupling (cont’d) . difference between processes and threads . no remote memory access (NORMA) program (*.exe, *.out, e.g.) program (*.exe, *.out, e.g.) . each processor has direct access to its local memory only . access to remote memory only via explicit message exchange (due to distributed address space) possible . synchronisation implicitly via the exchange of messages . performance improvement between memory and IO due to

parallel data transfer (Direct Memory Access, e.g.) possible (NORMA) . example: IBM SP2, ASCI Red  Blue  White (UMA, NUMA) messages network messages process model P P P thread model  M M M

PD Dr. Ralf-Peter Mundani  High Performance Computing  Summer Term 2015 153 PD Dr. Ralf-Peter Mundani  High Performance Computing  Summer Term 2015 154

Technische Universität München Technische Universität München Overview Quantitative Performance Evaluation . motivation . execution time . hardware excursion . time T of a parallel program between start of the execution on one . supercomputers processor and end of all computations on the last processor . classification of parallel computers . during execution all processors are in one of the following states . quantitative performance evaluation . compute

. TCOMP: time spent for computations

. communicate

. TCOMM: time spent for send and receive operations

. idle

. TIDLE: time spent for waiting (sending  receiving messages)

. hence T  TCOMP  TCOMM  TIDLE

PD Dr. Ralf-Peter Mundani  High Performance Computing  Summer Term 2015 155 PD Dr. Ralf-Peter Mundani  High Performance Computing  Summer Term 2015 156 Technische Universität München Technische Universität München Quantitative Performance Evaluation Quantitative Performance Evaluation

. comparison multiprocessor  monoprocessor . comparison multiprocessor  monoprocessor (cont’d) . correlation of multi- and monoprocessor systems’ performance . simplifying preconditions . important: program that can be executed on both systems . T(1)  P(1) . definitions . one operation to be executed in one step on the . P(1): amount of unit operations of a program on the monoprocessor system monoprocessor system . T(p)  P(p) . P(p): amount of unit operations of a program on the . more than one operation to be executed in one step multiprocessor systems with p processors (for p  2) on the multiprocessor system with p processors . T(1): execution time of a program on the monoprocessor system (measured in steps or clock cycles) . T(p): execution time of a program on the multiprocessor system (measured in steps or clock cycles) with p processors

PD Dr. Ralf-Peter Mundani  High Performance Computing  Summer Term 2015 157 PD Dr. Ralf-Peter Mundani  High Performance Computing  Summer Term 2015 158

Technische Universität München Technische Universität München Quantitative Performance Evaluation Quantitative Performance Evaluation

. comparison multiprocessor  monoprocessor (cont’d) . comparison multiprocessor  monoprocessor (cont’d) . speed-up . speed-up and efficiency can be seen in two different ways . S(p) indicates the improvement in processing speed . algorithm-independent . best known sequential algorithm for the monoprocessor system is T(1) compared to the respective parallel algorithm for the S(p)  with 1  S(p)  p multiprocessor system T(p)  absolute speed-up . efficiency  absolute efficiency . E(p) indicates the relative improvement in processing speed . algorithm-dependent . improvement is normalised by the amount of processors p . parallel algorithm is treated as sequential one to measure the execution time on the monoprocessor system; “unfair” due to S(p) communication and synchronisation overhead E(p)  with 1p  E(p)  1 p  relative speed-up  relative efficiency

PD Dr. Ralf-Peter Mundani  High Performance Computing  Summer Term 2015 159 PD Dr. Ralf-Peter Mundani  High Performance Computing  Summer Term 2015 160 Technische Universität München Technische Universität München Quantitative Performance Evaluation Quantitative Performance Evaluation

. scalability . AMDAHL’s law . objective: adding further processing elements to the system shall reduce . the probably most important and most famous estimate for the the execution time without any program modifications speed-up (even if quite pessimistic) . i. e. a linear performance increase with an efficiency close to 1 . underlying model . important for the scalability is a sufficient problem size . each program has a sequential part s, 0  s  1, that can only . one porter may carry one suitcase in a minute be executed in a sequential way: synchronisation, data IO, … . 60 porters won’t do it in a second . furthermore, each program consists of a parallelisable part 1s . but 60 porters may carry 60 suitcases in a minute that can be executed in parallel by several processes; finding . in case of a fixed problem size and an increasing amount of processors the maximum value within a set of numbers, e.g. saturation will occur for a certain value of p, hence scalability is limited . hence, the execution time for the parallel program executed on p . when scaling the amount of processors together with the problem size (so called scaled problem analysis) this effect will not appear for good processors can be written as scalable hard- and software systems 1 s T(p)  sT(1)  T(1) p

PD Dr. Ralf-Peter Mundani  High Performance Computing  Summer Term 2015 161 PD Dr. Ralf-Peter Mundani  High Performance Computing  Summer Term 2015 162

Technische Universität München Technische Universität München Quantitative Performance Evaluation Quantitative Performance Evaluation

. AMDAHL’s law (cont’d) . AMDAHL’s law (cont’d) . the speed-up can thus be computed as . example: s  0.1 T(1) T(1) 1 . independent from p the speed-up is bounded by this limit S (p)    1 s 1 s T(p) s T(1)  T(1) s  . where’s the error? p p 10

9 MDAHL . when increasing p we finally get A ’s law 8

7 1 1 lim S(p)  lim  6 p p 1 s s s  5 p speed-up 4  speed-up is bounded: S(p)  1s 3 2

. the sequential part can have a dramatic impact on the speed-up 1 S(p) . therefore central effort of all (parallel) algorithms: keep s small 0 0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100 . many parallel programs have a small sequential part (s  0.1) # processes

PD Dr. Ralf-Peter Mundani  High Performance Computing  Summer Term 2015 163 PD Dr. Ralf-Peter Mundani  High Performance Computing  Summer Term 2015 164 Technische Universität München Technische Universität München Quantitative Performance Evaluation Quantitative Performance Evaluation

. GUSTAFSON’s law . GUSTAFSON’s law (cont’d) . addresses the shortcomings of AMDAHL’s law as it states that any . difference to AMDAHL sufficient large problem can be efficiently parallelised . sequential part s(p) is not constant, but gets smaller with . instead of a fixed problem size it supposes a fixed time concept increasing p . underlying model  . execution time on the parallel machine is normalised to 1 s(p)  , s(p) 0, 1   p(1 ) . this contains a non-parallelisable part , 0 1 . hence, the execution time for the sequential program on the . often more realistic, because more processors are used for a monoprocessor can be written as larger problem size, and here parallelisable parts typically increase (more computations, less declarations, ) T(1) p(1) . speed-up is not bounded for increasing p . the speed-up can thus be computed as S(p) p(1)  p (1p)

PD Dr. Ralf-Peter Mundani  High Performance Computing  Summer Term 2015 165 PD Dr. Ralf-Peter Mundani  High Performance Computing  Summer Term 2015 166

Technische Universität München Technische Universität München Quantitative Performance Evaluation Quantitative Performance Evaluation

. GUSTAFSON’s law (cont’d) . communication—computation-ratio (CCR) . some more thoughts about speed-up . important quantity measuring the success of a parallelisation . theory tells: a superlinear speed-up does not exist . relation of pure communication time and pure computing time . each parallel algorithm can be simulated on a . a small CCR is favourable monoprocessor system by emulating in a loop always the . typically: CCR decreases with increasing problem size next step of a processor from the multiprocessor system . example . but superlinear speed-up can be observed . NN matrix distributed among p processors (Np rows each) . when improving an inferior sequential algorithm . iterative method: in each step, each matrix element is replaced . when a parallel program (that does not fit into the main by the average of its eight neighbour values memory of the monoprocessor system) completely runs in . hence, the two neighbouring rows are always necessary cache and main memory of the nodes from the . computation time: 8NNp multiprocessor system . communication time: 2N . CCR: p4N – what does this mean?

PD Dr. Ralf-Peter Mundani  High Performance Computing  Summer Term 2015 167 PD Dr. Ralf-Peter Mundani  High Performance Computing  Summer Term 2015 168 Technische Universität München Technische Universität München Twelve ways… Twelve ways…

…to fool the masses when giving performance results on parallel computers. 7. When direct run time comparisons are required, compare with an old code on —David H. Bailey, an obsolete system. NASA Ames Research Centre, 1991 8. If MFLOPS rates must be quoted, base the operation count on the parallel implementation, not on the best sequential implementation. 1. Quote only 32-bit performance results, not 64-bit results. 9. Quote performance in terms of processor utilisation, parallel speed-ups or 2. Present performance figures for an inner kernel, and then represent these MFLOPS per dollar. figures as the performance of the entire application. 10. Mutilate the algorithm used in the parallel implementation to match the 3. Quietly employ assembly code and other low-level language constructs. architecture. 4. Scale up the problem size with the number of processors, but omit any 11. Measure parallel run times on a dedicated system, but measure conventional mention of this fact. run times in a busy environment. 5. Quote performance results projected to a full system. 12. If all else fails, show pretty pictures and animated videos, and don’t talk about performance. 6. Compare your results against scalar, unoptimised codes on Crays.

PD Dr. Ralf-Peter Mundani  High Performance Computing  Summer Term 2015 169 PD Dr. Ralf-Peter Mundani  High Performance Computing  Summer Term 2015 170