Technische Universität München Technische Universität München General Remarks . Ralf-Peter Mundani . email: [email protected], phone: 289–25057, room: 3181 . consultation-hour: by appointment High Performance Computing – . lecture: Tuesday, 12:00—13:30, room 02.07.023 Programming Paradigms and Scalability Part 1: Introduction . Christoph Riesinger . email: [email protected] PD Dr. rer. nat. habil. Ralf-Peter Mundani . exercise: Wednesday, 10:15—11:45, room 02.07.023 (fortnightly) Computation in Engineering (CiE) Scientific Computing (SCCS) . examination . written, 90 minutes Summer Term 2015 . all printed/written materials allowed (no electronic devices)
. materials: http:www5.in.tum.de
PD Dr. Ralf-Peter Mundani High Performance Computing Summer Term 2015 12
Technische Universität München Technische Universität München General Remarks Overview . content . motivation . part 1: introduction . hardware excursion . part 2: high-performance networks . supercomputers . part 3: foundations . classification of parallel computers . part 4: shared-memory programming . quantitative performance evaluation . part 5: distributed-memory programming . part 6: examples of parallel algorithms
If one ox could not do the job they did not try to grow a bigger ox, but used two oxen. —Grace Murray Hopper
PD Dr. Ralf-Peter Mundani High Performance Computing Summer Term 2015 13 PD Dr. Ralf-Peter Mundani High Performance Computing Summer Term 2015 14 Technische Universität München Technische Universität München Motivation Motivation . numerical simulation: from phenomena to predictions . why numerical simulation? physical phenomenon . because experiments are sometimes impossible technical process 1. modelling life cycle of galaxies, weather forecast, terror attacks, e.g. determination of parameters, expression of relations
2. numerical treatment model discretisation, algorithm development
3. implementation software development, parallelisation bomb attack on WTC (1993) discipline 4. visualisation . because experiments are sometimes not welcome mathematics illustration of abstract simulation results avalanches, nuclear tests, medicine, e.g. computer science 5. validation comparison of results with reality
application 6. embedding insertion into working process
PD Dr. Ralf-Peter Mundani High Performance Computing Summer Term 2015 15 PD Dr. Ralf-Peter Mundani High Performance Computing Summer Term 2015 16
Technische Universität München Technische Universität München Motivation Motivation . why numerical simulation? (cont’d) . why parallel programming and HPC? . because experiments are sometimes very costly & time consuming . complex problems (especially the so called “grand challenges”) protein folding, material sciences, e.g. demand for more computing power . climate or geophysics simulation (tsunami, e.g.) . structure or flow simulation (crash test, e.g.) . development systems (CAD, e.g.) . large data analysis (Large Hadron Collider at CERN, e.g.) . military applications (crypto analysis, e.g.) Mississippi basin model (Jackson, MS) . because experiments are sometimes more expensive . aerodynamics, crash test, e.g. . performance increase due to . faster hardware, more memory (“work harder”) . more efficient algorithms, optimisation (“work smarter”) . parallel computing (“get some help”)
PD Dr. Ralf-Peter Mundani High Performance Computing Summer Term 2015 17 PD Dr. Ralf-Peter Mundani High Performance Computing Summer Term 2015 18 Technische Universität München Technische Universität München Motivation Motivation . objectives (in case all resources would be available N-times) . levels of parallelism . throughput: compute N problems simultaneously . qualitative meaning: level(s) on which work is done in parallel . running N instances of a sequential program with different data sets (“embarrassing parallelism”); SETI@home, e.g. . drawback: limited resources of single nodes sub-instruction level . response time: compute one problem at a fraction (1N) of time . running one instance (i. e. N processes) of a parallel program for instruction level jointly solving a problem; finding prime numbers, e.g. . drawback: writing a parallel program; communication block level
. problem size: compute one problem with N-times larger data process level . running one instance (i. e. N processes) of a parallel program, using
the sum of all local memories for computing larger problem sizes; granularity program level iterative solution of SLE, e.g. . drawback: writing a parallel program; communication
PD Dr. Ralf-Peter Mundani High Performance Computing Summer Term 2015 19 PD Dr. Ralf-Peter Mundani High Performance Computing Summer Term 2015 110
Technische Universität München Technische Universität München Motivation Motivation . levels of parallelism (cont’d) . levels of parallelism (cont’d) . program level . block level . parallel processing of different programs . blocks of instructions are executed in parallel . independent units without any shared data . each block consists of few instructions and shares data with others . communication via shared variables; synchronisation mechanisms . organised by the OS . term of block often referred to as light-weight-process (thread)
. process level . instruction level . a program is subdivided into processes to be executed in . parallel execution of machine instructions parallel . optimising compilers can increase this potential by modifying the order . each process consists of a larger amount of sequential of commands instructions and some private data . sub-instruction level . communication in most cases necessary (data exchange, e.g.) . instructions are further subdivided in units to be executed in parallel or . term of process often referred to as heavy-weight process via overlapping (vector operations, e.g.)
PD Dr. Ralf-Peter Mundani High Performance Computing Summer Term 2015 111 PD Dr. Ralf-Peter Mundani High Performance Computing Summer Term 2015 112 Technische Universität München Technische Universität München Overview Hardware Excursion . motivation . definition of parallel computers . hardware excursion “A collection of processing elements that communicate and cooperate . supercomputers to solve large problems” (ALMASE and GOTTLIEB, 1989) . classification of parallel computers . possible appearances of such processing elements . quantitative performance evaluation . specialised units (steps of a vector pipeline, e.g.) . parallel features in modern monoprocessors (instruction pipelining, superscalar architectures, VLIW, multithreading, multicore, …) . several uniform arithmetical units (processing elements of array computers, GPGPUs, e.g.) . complete stand-alone computers connected via LAN (work station or PC clusters, so called virtual parallel computers) . parallel computers or clusters connected via WAN (so called metacomputers)
PD Dr. Ralf-Peter Mundani High Performance Computing Summer Term 2015 113 PD Dr. Ralf-Peter Mundani High Performance Computing Summer Term 2015 114
Technische Universität München Technische Universität München Hardware Excursion Hardware Excursion . instruction pipelining . instruction pipelining (cont‘d) . instruction execution involves several operations . observation: while processing particular stage of instruction, other 1.instruction fetch (IF) stages are idle 2.decode (DE) . hence, multiple instructions to be overlapped in execution 3.fetch operands (OP) instruction pipelining (similar to assembly lines) 4.execute (EX) . advantage: no additional hardware necessary
5.write back (WB) … time which are executed successively instruction N IF DE OP EX WB instruction N1 IF DE OP EX WB . hence, only one part of CPU works at a given moment instruction N2 IF DE OP EX WB … … IF DE OP EX WB IF DE OP EX WB instruction N3 IF DE OP EX WB instruction N4 IF DE OP EX WB
instruction N instruction N1 …
PD Dr. Ralf-Peter Mundani High Performance Computing Summer Term 2015 115 PD Dr. Ralf-Peter Mundani High Performance Computing Summer Term 2015 116 Technische Universität München Technische Universität München Hardware Excursion Hardware Excursion . superscalar . superscalar (cont’d) . faster CPU throughput due to simultaneously execution of instructions . pipelining for superscalar architectures also possible within one clock cycle via redundant functional units (ALU, multiplier, …) . dispatcher decides (during runtime) which instructions read from memory can be executed in parallel and dispatches them to different functional instruction N IF DE OP EX WB units instruction N1 IF DE OP EX WB time . for instance, PowerPC 970 (4 ALU, 2 FPU) IF DE OP EX WB IF DE OP EX WB IF DE OP EX WB
… IF DE OP EX WB instr. 3 instr. 4 instr. 1 instr. 2 instr. A instr. B IF DE OP EX WB ALU ALU ALU ALU FPU FPU IF DE OP EX WB IF DE OP EX WB instruction N9 IF DE OP EX WB . but, performance improvement is limited (intrinsic parallelism)
PD Dr. Ralf-Peter Mundani High Performance Computing Summer Term 2015 117 PD Dr. Ralf-Peter Mundani High Performance Computing Summer Term 2015 118
Technische Universität München Technische Universität München Hardware Excursion Hardware Excursion . very long instruction word (VLIW) . vector units . in contrast to superscalar architectures, the compiler groups parallel . simultaneously execution of one instruction on a one-dimensional executable instructions during compilation (pipelining still possible) array of data ( vector) . advantage: no additional hardware logic necessary . VU first appeared in 1970s and were the basis of most . drawback: not always fully useable ( dummy filling (NOP)) supercomputers in the 1980s and 1990s
VLIW instruction T ( A1 B1 A2 B2 A3 B3 AN1 BN1 AN BN ) instr. 2 instr. 3 instr. 4 instr. 1 instruction 123 N1 N
T ( C1 C2 C3 CN1 CN ) . specialised hardware very expensive registers . limited application areas (mostly CFD, CSD, …)
PD Dr. Ralf-Peter Mundani High Performance Computing Summer Term 2015 119 PD Dr. Ralf-Peter Mundani High Performance Computing Summer Term 2015 120 Technische Universität München Technische Universität München Hardware Excursion Hardware Excursion . dual core, quad core, many core, and multicore . dual core, quad core, many core, and multicore (cont’d) . observation: increasing frequency f (and thus core voltage v) over . 25% reduction in performance (i.e. core voltage) leads to approx. past years problem: thermal power dissipation P fv2 50% reduction in dissipation
dissipation
performance
normal CPU reduced CPU
PD Dr. Ralf-Peter Mundani High Performance Computing Summer Term 2015 121 PD Dr. Ralf-Peter Mundani High Performance Computing Summer Term 2015 122
Technische Universität München Technische Universität München Hardware Excursion Hardware Excursion . dual core, quad core, many core, and multicore (cont’d) . dual core, quad core, many core, and multicore (cont’d) . idea: installation of two cores per die with same dissipation as . single vs. dual quad single core system
core 0 core 0 core 1 core 0 core 1 core 2 core 3
L1 L1 L1 L1 L1 L1 L1 dissipation L2 shared L2 shared L2 shared L2
performance FSB FSB FSB
FSB: front side bus (i.e. connection to memory (via north bridge)) single core dual core
PD Dr. Ralf-Peter Mundani High Performance Computing Summer Term 2015 123 PD Dr. Ralf-Peter Mundani High Performance Computing Summer Term 2015 124 Technische Universität München Technische Universität München Hardware Excursion Hardware Excursion . INTEL Nehalem Core i7 . Intel E5-2600 Sandy-Bridge Series . 2 CPUs connected by 2 QPIs (Intel Quick Path Interconnect) . Quick Path Interconnect (1 sending and 1 receiving port) core 0 core 1 core 2 core 3 . 8 GT/s ∙ 16 Bit/T payload ∙ 2 directions / 8 Bit/Byte = 32 GB/s L1L2 L1L2 L1L2 L1L2 max bandwidth / QPI ∙ shared L3 . 2 QPI links: 2 32 GB/s 64 GB/s max bandwidth
QPI source: www.samrathacks.com
QPI: QuickPath Interconnect replaces FSB (QPI is a point-to-point interconnection – with a memory controller now on-die – in order to allow both reduced latency and higher bandwidth up to (theoretically) 25.6 GBytes data transfer, i.e. 2 FSB)
source: G. Wellein, RRZE
PD Dr. Ralf-Peter Mundani High Performance Computing Summer Term 2015 125 PD Dr. Ralf-Peter Mundani High Performance Computing Summer Term 2015 126
Technische Universität München Technische Universität München Overview Supercomputers . motivation . arrival of clusters . hardware excursion . in the late eighties, PCs became a commodity market with rapidly . supercomputers increasing performance, mass production, and decreasing prices . classification of parallel computers . growing attractiveness for parallel computers . quantitative performance evaluation . 1994: Beowulf, the first parallel computer built completely out of commodity hardware . NASA Goddard Space Flight Centre . 16 Intel DX4 processors . multiple 10 Mbit Ethernet links . Linux with GNU compilers . MPI library
. 1996: Beowulf cluster performing more than 1 GFlops . 1997: a 140-node cluster performing more than 10 GFlops
PD Dr. Ralf-Peter Mundani High Performance Computing Summer Term 2015 127 PD Dr. Ralf-Peter Mundani High Performance Computing Summer Term 2015 128 Technische Universität München Technische Universität München Supercomputers Supercomputers . supercomputers . supercomputers (cont’d) . supercomputing or high-performance scientific computing as the . federal “Bundeshöchstleistungsrechner” initiative in Germany most important application of the big number crunchers . decision in the mid-nineties . national initiatives due to huge budget requirements . three federal supercomputing centres in Germany (Munich, . Accelerated Strategic Computing Initiative (ASCI) in the U.S. Stuttgart, and Jülich) . in the sequel of the nuclear testing moratorium in 199293 . one new installation every second year (i.e. a six year upgrade . decision: develop, build, and install a series of five cycle for each centre) supercomputers of up to $100 million each in the U.S. . the newest one to be among the top 10 of the world . start: ASCI Red (1997, Intel-based, Sandia National . overview and state of the art: Top500 list (updated every six Laboratory, the world’s first TFlops computer) month), see http:www.top500.org . then: ASCI Blue Pacific (1998, LLNL), ASCI Blue Mountain, . finally (a somewhat different definition) ASCI White, Supercomputer: Turns CPU-bound problems into I/O-bound problems. . meanwhile new high-end computing memorandum (2004) —Ken Batcher
PD Dr. Ralf-Peter Mundani High Performance Computing Summer Term 2015 129 PD Dr. Ralf-Peter Mundani High Performance Computing Summer Term 2015 130
Technische Universität München Technische Universität München Supercomputers Supercomputers
. MOORE’s law . some numbers: Top500 . observation of Intel co-founder Gordon E. MOORE, describes important trend in history of computer hardware (1965)
number of transistors that can be placed on an integrated circuit is increasing exponentially, doubling approximately every eighteen months
PD Dr. Ralf-Peter Mundani High Performance Computing Summer Term 2015 131 PD Dr. Ralf-Peter Mundani High Performance Computing Summer Term 2015 132 Technische Universität München Technische Universität München Supercomputers Supercomputers . some numbers: Top500 (cont’d) . some numbers: Top500 (cont’d)
s! rtiu s, fo altiu us, Citi PD Dr. Ralf-Peter Mundani High Performance Computing Summer Term 2015 133 PD Dr. Ralf-Peter Mundani High Performance Computing Summer Term 2015 134
Technische Universität München Technische Universität München Supercomputers Supercomputers . the 10 fastest supercomputers in the world (by November 2014) . The Earth Simulator – world’s #1 from 2002—04 . installed in 2002 in Yokohama, Japan . ES-building (approx. 50m 65m 17m) . based on NEC SX-6 architecture . developed by three governmental agencies . highly parallel vector supercomputer . consists of 640 nodes (plus 2 control & 128 data switching) . 8 vector processors (8 GFlops each) . 16 GB shared memory 5120 processors (40.96 TFlops peak performance) and 10 TB memory; 35.86 TFlops sustained performance (Linpack) . nodes connected by 640640 single stage crossbar (83,200 cables with a total extension of 2400km; 8 TBps total bandwidth) . further 700 TB disc space and 1.60 PB mass storage
PD Dr. Ralf-Peter Mundani High Performance Computing Summer Term 2015 135 PD Dr. Ralf-Peter Mundani High Performance Computing Summer Term 2015 136 Technische Universität München Technische Universität München Supercomputers Supercomputers
. BlueGeneL – world’s #1 from 2004—08 . Roadrunner – world’s #1 from 2008—09 . installed in 2005 at LLNL, CA, USA . installed in 2008 at LANL, NM, USA (beta-system in 2004 at IBM) . installation costs about $120 million . cooperation of DoE, LLNL, and IBM . first “hybrid” supercomputer . massive parallel supercomputer . dual-core Opteron . consists of 65,536 nodes (plus 12 front-end and 1204 IO nodes) . Cell Broadband Engine . 2 PowerPC 440d processors (2.8 GFlops each) 129,600 cores (1456.70 TFlops peak performance) and . 512MB memory 98 TB memory; 1144.00 TFlops sustained performance (Linpack) 131,072 processors (367.00 TFlops peak performance) and . standard processing (file system IO, e. g.) handled by Opteron, 33.50 TB memory; 280.60 TFlops sustained performance (Linpack) while mathematically and CPU-intensive tasks are handled by Cell . nodes configured as 3D torus (32 32 64); global reduction tree for fast . 2.35 MW power consumption ( 437 MFlops per Watt ) operations (global max sum) in a few microseconds . primarily usage: ensure safety and reliability of nation’s nuclear . 1024 Gbps link to global parallel file system weapons stockpile, real-time applications (cause & effect in capital . further 806 TB disc space; operating system SuSE SLES 9 markets, bone structures and tissues renderings as patients are being examined, e.g.)
PD Dr. Ralf-Peter Mundani High Performance Computing Summer Term 2015 137 PD Dr. Ralf-Peter Mundani High Performance Computing Summer Term 2015 138
Technische Universität München Technische Universität München Supercomputers Supercomputers
. HLRB II (world’s #6 for 042006) . SuperMUC (world’s #4 for 062012) . installed in 2006 at LRZ, Garching . installed in 2012 at LRZ, Garching . installation costs 38M€ . IBM System x iDataPlex . monthly costs approx. 400,000€ . (still) one of Germany’s 3 supercomputers . upgrade in 2007 (finished) . consists of 19 islands (Infiniband FDR10 pruned tree with . one of Germany’s 3 supercomputers 4:1 intra-island / inter-island ratio) . SGI Altix 4700 . 18 thin islands with 512 nodes each (total 288 TB memory) . consists of 19 nodes (SGI NUMA link 2D torus) . Sandy Bridge-EP Xeon E5 (2 CPUs (8 cores each) / node) . 256 blades (ccNUMA link with partition fat tree) . 1 fat island with 205 nodes (total 52 TB memory) . Intel Itanium2 Montecito Dual Core (12.80 GFlops) . 4GB memory per core . Westmere-EX Xeon E7 (4 CPUs (10 cores each) / node) 9728 cores (62.30 TFlops peak performance) and 39 TB 147,456 cores (3.185 PFlops peak performance – thin islands memory; 56.50 TFlops sustained performance (Linpack) only); 2.897 PFlops sustained performance (Linpack) . footprint 24m 12m; total weight 103 metric tons . footprint 21m 26m; warm water cooling
PD Dr. Ralf-Peter Mundani High Performance Computing Summer Term 2015 139 PD Dr. Ralf-Peter Mundani High Performance Computing Summer Term 2015 140 Technische Universität München Technische Universität München Overview Classification of Parallel Computers
. motivation . standard classification according to FLYNN . hardware excursion . global data and instruction streams as criterion . supercomputers . instruction stream: sequence of commands to be executed . classification of parallel computers . data stream: sequence of data subject to instruction streams . quantitative performance evaluation . two-dimensional subdivision according to . amount of instructions per time a computer can execute . amount of data elements per time a computer can process . hence, FLYNN distinguishes four classes of architectures . SISD: single instruction, single data . SIMD: single instruction, multiple data . MISD: multiple instruction, single data . MIMD: multiple instruction, multiple data . drawback: very different computers may belong to the same class
PD Dr. Ralf-Peter Mundani High Performance Computing Summer Term 2015 141 PD Dr. Ralf-Peter Mundani High Performance Computing Summer Term 2015 142
Technische Universität München Technische Universität München Classification of Parallel Computers Classification of Parallel Computers
. standard classification according to FLYNN (cont’d) . standard classification according to FLYNN (cont’d) . SISD . SIMD . one processing unit that has access to one data memory and to . several processing units, each with separate access to a (shared or one program memory distributed) data memory; one program memory . synchronous execution of instructions . classical monoprocessor following VON NEUMANN’s principle . example: array computer, vector computer . advantages: easy programming model due to control flow with a strict data memory processor program memory synchronous-parallel execution of all instructions . drawbacks: specialised hardware necessary, easily becomes out- dated due to recent developments at commodity market
data memory processor
program memory
data memory processor
PD Dr. Ralf-Peter Mundani High Performance Computing Summer Term 2015 143 PD Dr. Ralf-Peter Mundani High Performance Computing Summer Term 2015 144 Technische Universität München Technische Universität München Classification of Parallel Computers Classification of Parallel Computers
. standard classification according to FLYNN (cont’d) . standard classification according to FLYNN (cont’d) . MISD . MIMD . several processing units that have access to one data memory; . several processing units, each with separate access to a (shared or several program memories distributed) data memory; several program memories . not very popular class (mainly for special applications such as Digital . classification according to (physical) memory organisation Signal Processing) . shared memory shared (global) address space . operating on a single stream of data, forwarding results from one . distributed memory distributed (local) address space processing unit to the next . example: multiprocessor systems, networks of computers . example: systolic array (network of primitive processing elements that “pump” data) data memory processor program memory processor program memory data memory data memory processor program memory processor program memory
PD Dr. Ralf-Peter Mundani High Performance Computing Summer Term 2015 145 PD Dr. Ralf-Peter Mundani High Performance Computing Summer Term 2015 146
Technische Universität München Technische Universität München Classification of Parallel Computers Classification of Parallel Computers . processor coupling . processor coupling (cont’d) . cooperation of processors computers as well as their shared use . uniform memory access (UMA) of various resources require communication and synchronisation . each processor P has direct access via the network to each . the following types of processor coupling can be distinguished memory module M with same access times to all data . memory-coupled multiprocessor systems (MemMS) . standard programming model can be used (i.e. no explicit send . message-coupled multiprocessor systems (MesMS) receive of messages necessary) . communication and synchronisation via shared variables (inconsistencies (write conflicts, e.g.) have to prevented in global memory distributed memory general by the programmer) shared M M M MemMS, SMP Mem-MesMS (hybrid) address space distributed network MesMS address space P P P
PD Dr. Ralf-Peter Mundani High Performance Computing Summer Term 2015 147 PD Dr. Ralf-Peter Mundani High Performance Computing Summer Term 2015 148 Technische Universität München Technische Universität München Classification of Parallel Computers Classification of Parallel Computers . processor coupling (cont’d) . processor coupling (cont’d) . symmetric multiprocessor (SMP) . non-uniform memory access (NUMA) . only a small amount of processors, in most cases a central bus, . memory modules physically distributed among processors one address space (UMA), but bad scalability . shared address space, but access times depend on location of . cache-coherence implemented in hardware (i.e. a read always data (i.e. local addresses faster than remote addresses) provides a variable’s value from its last write) . differences in access times are visible in the program . example: double or quad boards, SGI Challenge . example: DSM VSM, Cray T3E
M network
M M C: cache C C C P P P P P
PD Dr. Ralf-Peter Mundani High Performance Computing Summer Term 2015 149 PD Dr. Ralf-Peter Mundani High Performance Computing Summer Term 2015 150
Technische Universität München Technische Universität München Classification of Parallel Computers Classification of Parallel Computers . processor coupling (cont’d) . processor coupling (cont’d) . cache-coherent non-uniform memory access (ccNUMA) . cache-only memory access (COMA) . caches for local and remote addresses; cache-coherence . each processor has only cache-memory implemented in hardware for entire address space . entirety of all cache-memories global shared memory . problem with scalability due to frequent cache actualisations . cache-coherence implemented in hardware . example: SGI Origin 2000 . example: Kendall Square Research KSR-1
network network
M M C C C C C P P P P P
PD Dr. Ralf-Peter Mundani High Performance Computing Summer Term 2015 151 PD Dr. Ralf-Peter Mundani High Performance Computing Summer Term 2015 152 Technische Universität München Technische Universität München Classification of Parallel Computers Classification of Parallel Computers . processor coupling (cont’d) . difference between processes and threads . no remote memory access (NORMA) program (*.exe, *.out, e.g.) program (*.exe, *.out, e.g.) . each processor has direct access to its local memory only . access to remote memory only via explicit message exchange (due to distributed address space) possible . synchronisation implicitly via the exchange of messages . performance improvement between memory and IO due to
parallel data transfer (Direct Memory Access, e.g.) possible (NORMA) . example: IBM SP2, ASCI Red Blue White (UMA, NUMA) messages network messages process model P P P thread model M M M
PD Dr. Ralf-Peter Mundani High Performance Computing Summer Term 2015 153 PD Dr. Ralf-Peter Mundani High Performance Computing Summer Term 2015 154
Technische Universität München Technische Universität München Overview Quantitative Performance Evaluation . motivation . execution time . hardware excursion . time T of a parallel program between start of the execution on one . supercomputers processor and end of all computations on the last processor . classification of parallel computers . during execution all processors are in one of the following states . quantitative performance evaluation . compute
. TCOMP: time spent for computations
. communicate
. TCOMM: time spent for send and receive operations
. idle
. TIDLE: time spent for waiting (sending receiving messages)
. hence T TCOMP TCOMM TIDLE
PD Dr. Ralf-Peter Mundani High Performance Computing Summer Term 2015 155 PD Dr. Ralf-Peter Mundani High Performance Computing Summer Term 2015 156 Technische Universität München Technische Universität München Quantitative Performance Evaluation Quantitative Performance Evaluation
. comparison multiprocessor monoprocessor . comparison multiprocessor monoprocessor (cont’d) . correlation of multi- and monoprocessor systems’ performance . simplifying preconditions . important: program that can be executed on both systems . T(1) P(1) . definitions . one operation to be executed in one step on the . P(1): amount of unit operations of a program on the monoprocessor system monoprocessor system . T(p) P(p) . P(p): amount of unit operations of a program on the . more than one operation to be executed in one step multiprocessor systems with p processors (for p 2) on the multiprocessor system with p processors . T(1): execution time of a program on the monoprocessor system (measured in steps or clock cycles) . T(p): execution time of a program on the multiprocessor system (measured in steps or clock cycles) with p processors
PD Dr. Ralf-Peter Mundani High Performance Computing Summer Term 2015 157 PD Dr. Ralf-Peter Mundani High Performance Computing Summer Term 2015 158
Technische Universität München Technische Universität München Quantitative Performance Evaluation Quantitative Performance Evaluation
. comparison multiprocessor monoprocessor (cont’d) . comparison multiprocessor monoprocessor (cont’d) . speed-up . speed-up and efficiency can be seen in two different ways . S(p) indicates the improvement in processing speed . algorithm-independent . best known sequential algorithm for the monoprocessor system is T(1) compared to the respective parallel algorithm for the S(p) with 1 S(p) p multiprocessor system T(p) absolute speed-up . efficiency absolute efficiency . E(p) indicates the relative improvement in processing speed . algorithm-dependent . improvement is normalised by the amount of processors p . parallel algorithm is treated as sequential one to measure the execution time on the monoprocessor system; “unfair” due to S(p) communication and synchronisation overhead E(p) with 1p E(p) 1 p relative speed-up relative efficiency
PD Dr. Ralf-Peter Mundani High Performance Computing Summer Term 2015 159 PD Dr. Ralf-Peter Mundani High Performance Computing Summer Term 2015 160 Technische Universität München Technische Universität München Quantitative Performance Evaluation Quantitative Performance Evaluation
. scalability . AMDAHL’s law . objective: adding further processing elements to the system shall reduce . the probably most important and most famous estimate for the the execution time without any program modifications speed-up (even if quite pessimistic) . i. e. a linear performance increase with an efficiency close to 1 . underlying model . important for the scalability is a sufficient problem size . each program has a sequential part s, 0 s 1, that can only . one porter may carry one suitcase in a minute be executed in a sequential way: synchronisation, data IO, … . 60 porters won’t do it in a second . furthermore, each program consists of a parallelisable part 1s . but 60 porters may carry 60 suitcases in a minute that can be executed in parallel by several processes; finding . in case of a fixed problem size and an increasing amount of processors the maximum value within a set of numbers, e.g. saturation will occur for a certain value of p, hence scalability is limited . hence, the execution time for the parallel program executed on p . when scaling the amount of processors together with the problem size (so called scaled problem analysis) this effect will not appear for good processors can be written as scalable hard- and software systems 1 s T(p) sT(1) T(1) p
PD Dr. Ralf-Peter Mundani High Performance Computing Summer Term 2015 161 PD Dr. Ralf-Peter Mundani High Performance Computing Summer Term 2015 162
Technische Universität München Technische Universität München Quantitative Performance Evaluation Quantitative Performance Evaluation
. AMDAHL’s law (cont’d) . AMDAHL’s law (cont’d) . the speed-up can thus be computed as . example: s 0.1 T(1) T(1) 1 . independent from p the speed-up is bounded by this limit S (p) 1 s 1 s T(p) s T(1) T(1) s . where’s the error? p p 10
9 MDAHL . when increasing p we finally get A ’s law 8
7 1 1 lim S(p) lim 6 p p 1 s s s 5 p speed-up 4 speed-up is bounded: S(p) 1s 3 2
. the sequential part can have a dramatic impact on the speed-up 1 S(p) . therefore central effort of all (parallel) algorithms: keep s small 0 0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100 . many parallel programs have a small sequential part (s 0.1) # processes
PD Dr. Ralf-Peter Mundani High Performance Computing Summer Term 2015 163 PD Dr. Ralf-Peter Mundani High Performance Computing Summer Term 2015 164 Technische Universität München Technische Universität München Quantitative Performance Evaluation Quantitative Performance Evaluation
. GUSTAFSON’s law . GUSTAFSON’s law (cont’d) . addresses the shortcomings of AMDAHL’s law as it states that any . difference to AMDAHL sufficient large problem can be efficiently parallelised . sequential part s(p) is not constant, but gets smaller with . instead of a fixed problem size it supposes a fixed time concept increasing p . underlying model . execution time on the parallel machine is normalised to 1 s(p) , s(p) 0, 1 p(1 ) . this contains a non-parallelisable part , 0 1 . hence, the execution time for the sequential program on the . often more realistic, because more processors are used for a monoprocessor can be written as larger problem size, and here parallelisable parts typically increase (more computations, less declarations, ) T(1) p(1) . speed-up is not bounded for increasing p . the speed-up can thus be computed as S(p) p(1) p (1p)
PD Dr. Ralf-Peter Mundani High Performance Computing Summer Term 2015 165 PD Dr. Ralf-Peter Mundani High Performance Computing Summer Term 2015 166
Technische Universität München Technische Universität München Quantitative Performance Evaluation Quantitative Performance Evaluation
. GUSTAFSON’s law (cont’d) . communication—computation-ratio (CCR) . some more thoughts about speed-up . important quantity measuring the success of a parallelisation . theory tells: a superlinear speed-up does not exist . relation of pure communication time and pure computing time . each parallel algorithm can be simulated on a . a small CCR is favourable monoprocessor system by emulating in a loop always the . typically: CCR decreases with increasing problem size next step of a processor from the multiprocessor system . example . but superlinear speed-up can be observed . NN matrix distributed among p processors (Np rows each) . when improving an inferior sequential algorithm . iterative method: in each step, each matrix element is replaced . when a parallel program (that does not fit into the main by the average of its eight neighbour values memory of the monoprocessor system) completely runs in . hence, the two neighbouring rows are always necessary cache and main memory of the nodes from the . computation time: 8NNp multiprocessor system . communication time: 2N . CCR: p4N – what does this mean?
PD Dr. Ralf-Peter Mundani High Performance Computing Summer Term 2015 167 PD Dr. Ralf-Peter Mundani High Performance Computing Summer Term 2015 168 Technische Universität München Technische Universität München Twelve ways… Twelve ways…
…to fool the masses when giving performance results on parallel computers. 7. When direct run time comparisons are required, compare with an old code on —David H. Bailey, an obsolete system. NASA Ames Research Centre, 1991 8. If MFLOPS rates must be quoted, base the operation count on the parallel implementation, not on the best sequential implementation. 1. Quote only 32-bit performance results, not 64-bit results. 9. Quote performance in terms of processor utilisation, parallel speed-ups or 2. Present performance figures for an inner kernel, and then represent these MFLOPS per dollar. figures as the performance of the entire application. 10. Mutilate the algorithm used in the parallel implementation to match the 3. Quietly employ assembly code and other low-level language constructs. architecture. 4. Scale up the problem size with the number of processors, but omit any 11. Measure parallel run times on a dedicated system, but measure conventional mention of this fact. run times in a busy environment. 5. Quote performance results projected to a full system. 12. If all else fails, show pretty pictures and animated videos, and don’t talk about performance. 6. Compare your results against scalar, unoptimised codes on Crays.
PD Dr. Ralf-Peter Mundani High Performance Computing Summer Term 2015 169 PD Dr. Ralf-Peter Mundani High Performance Computing Summer Term 2015 170