Quantitative Performance Evaluation

Technische Universität München Technische Universität München General Remarks . Ralf-Peter Mundani . email: [email protected], phone: 289–25057, room: 3181 . consultation-hour: by appointment High Performance Computing – . lecture: Tuesday, 12:00—13:30, room 02.07.023 Programming Paradigms and Scalability Part 1: Introduction . Christoph Riesinger . email: [email protected] PD Dr. rer. nat. habil. Ralf-Peter Mundani . exercise: Wednesday, 10:15—11:45, room 02.07.023 (fortnightly) Computation in Engineering (CiE) Scientific Computing (SCCS) . examination . written, 90 minutes Summer Term 2015 . all printed/written materials allowed (no electronic devices) . materials: http:www5.in.tum.de PD Dr. Ralf-Peter Mundani High Performance Computing Summer Term 2015 12 Technische Universität München Technische Universität München General Remarks Overview . content . motivation . part 1: introduction . hardware excursion . part 2: high-performance networks . supercomputers . part 3: foundations . classification of parallel computers . part 4: shared-memory programming . quantitative performance evaluation . part 5: distributed-memory programming . part 6: examples of parallel algorithms If one ox could not do the job they did not try to grow a bigger ox, but used two oxen. —Grace Murray Hopper PD Dr. Ralf-Peter Mundani High Performance Computing Summer Term 2015 13 PD Dr. Ralf-Peter Mundani High Performance Computing Summer Term 2015 14 Technische Universität München Technische Universität München Motivation Motivation . numerical simulation: from phenomena to predictions . why numerical simulation? physical phenomenon . because experiments are sometimes impossible technical process 1. modelling life cycle of galaxies, weather forecast, terror attacks, e.g. determination of parameters, expression of relations 2. numerical treatment model discretisation, algorithm development 3. implementation software development, parallelisation bomb attack on WTC (1993) discipline 4. visualisation . because experiments are sometimes not welcome mathematics illustration of abstract simulation results avalanches, nuclear tests, medicine, e.g. computer science 5. validation comparison of results with reality application 6. embedding insertion into working process PD Dr. Ralf-Peter Mundani High Performance Computing Summer Term 2015 15 PD Dr. Ralf-Peter Mundani High Performance Computing Summer Term 2015 16 Technische Universität München Technische Universität München Motivation Motivation . why numerical simulation? (cont’d) . why parallel programming and HPC? . because experiments are sometimes very costly & time consuming . complex problems (especially the so called “grand challenges”) protein folding, material sciences, e.g. demand for more computing power . climate or geophysics simulation (tsunami, e.g.) . structure or flow simulation (crash test, e.g.) . development systems (CAD, e.g.) . large data analysis (Large Hadron Collider at CERN, e.g.) . military applications (crypto analysis, e.g.) Mississippi basin model (Jackson, MS) . because experiments are sometimes more expensive . aerodynamics, crash test, e.g. performance increase due to . faster hardware, more memory (“work harder”) . more efficient algorithms, optimisation (“work smarter”) . parallel computing (“get some help”) PD Dr. Ralf-Peter Mundani High Performance Computing Summer Term 2015 17 PD Dr. Ralf-Peter Mundani High Performance Computing Summer Term 2015 18 Technische Universität München Technische Universität München Motivation Motivation . objectives (in case all resources would be available N-times) . levels of parallelism . throughput: compute N problems simultaneously . qualitative meaning: level(s) on which work is done in parallel . running N instances of a sequential program with different data sets (“embarrassing parallelism”); SETI@home, e.g. drawback: limited resources of single nodes sub-instruction level . response time: compute one problem at a fraction (1N) of time . running one instance (i. e. N processes) of a parallel program for instruction level jointly solving a problem; finding prime numbers, e.g. drawback: writing a parallel program; communication block level . problem size: compute one problem with N-times larger data process level . running one instance (i. e. N processes) of a parallel program, using the sum of all local memories for computing larger problem sizes; granularity program level iterative solution of SLE, e.g. drawback: writing a parallel program; communication PD Dr. Ralf-Peter Mundani High Performance Computing Summer Term 2015 19 PD Dr. Ralf-Peter Mundani High Performance Computing Summer Term 2015 110 Technische Universität München Technische Universität München Motivation Motivation . levels of parallelism (cont’d) . levels of parallelism (cont’d) . program level . block level . parallel processing of different programs . blocks of instructions are executed in parallel . independent units without any shared data . each block consists of few instructions and shares data with others . communication via shared variables; synchronisation mechanisms . organised by the OS . term of block often referred to as light-weight-process (thread) . process level . instruction level . a program is subdivided into processes to be executed in . parallel execution of machine instructions parallel . optimising compilers can increase this potential by modifying the order . each process consists of a larger amount of sequential of commands instructions and some private data . sub-instruction level . communication in most cases necessary (data exchange, e.g.) . instructions are further subdivided in units to be executed in parallel or . term of process often referred to as heavy-weight process via overlapping (vector operations, e.g.) PD Dr. Ralf-Peter Mundani High Performance Computing Summer Term 2015 111 PD Dr. Ralf-Peter Mundani High Performance Computing Summer Term 2015 112 Technische Universität München Technische Universität München Overview Hardware Excursion . motivation . definition of parallel computers . hardware excursion “A collection of processing elements that communicate and cooperate . supercomputers to solve large problems” (ALMASE and GOTTLIEB, 1989) . classification of parallel computers . possible appearances of such processing elements . quantitative performance evaluation . specialised units (steps of a vector pipeline, e.g.) . parallel features in modern monoprocessors (instruction pipelining, superscalar architectures, VLIW, multithreading, multicore, …) . several uniform arithmetical units (processing elements of array computers, GPGPUs, e.g.) . complete stand-alone computers connected via LAN (work station or PC clusters, so called virtual parallel computers) . parallel computers or clusters connected via WAN (so called metacomputers) PD Dr. Ralf-Peter Mundani High Performance Computing Summer Term 2015 113 PD Dr. Ralf-Peter Mundani High Performance Computing Summer Term 2015 114 Technische Universität München Technische Universität München Hardware Excursion Hardware Excursion . instruction pipelining . instruction pipelining (cont‘d) . instruction execution involves several operations . observation: while processing particular stage of instruction, other 1.instruction fetch (IF) stages are idle 2.decode (DE) . hence, multiple instructions to be overlapped in execution 3.fetch operands (OP) instruction pipelining (similar to assembly lines) 4.execute (EX) . advantage: no additional hardware necessary 5.write back (WB) … time which are executed successively instruction N IF DE OP EX WB instruction N1 IF DE OP EX WB . hence, only one part of CPU works at a given moment instruction N2 IF DE OP EX WB … … IF DE OP EX WB IF DE OP EX WB instruction N3 IF DE OP EX WB instruction N4 IF DE OP EX WB instruction N instruction N1 … PD Dr. Ralf-Peter Mundani High Performance Computing Summer Term 2015 115 PD Dr. Ralf-Peter Mundani High Performance Computing Summer Term 2015 116 Technische Universität München Technische Universität München Hardware Excursion Hardware Excursion . superscalar . superscalar (cont’d) . faster CPU throughput due to simultaneously execution of instructions . pipelining for superscalar architectures also possible within one clock cycle via redundant functional units (ALU, multiplier, …) . dispatcher decides (during runtime) which instructions read from memory can be executed in parallel and dispatches them to different functional instruction N IF DE OP EX WB units instruction N1 IF DE OP EX WB time . for instance, PowerPC 970 (4 ALU, 2 FPU) IF DE OP EX WB IF DE OP EX WB IF DE OP EX WB … IF DE OP EX WB instr. 3 instr. 4 instr. 1 instr. 2 instr. A instr. B IF DE OP EX WB ALU ALU ALU ALU FPU FPU IF DE OP EX WB IF DE OP EX WB instruction N9 IF DE OP EX WB . but, performance improvement is limited (intrinsic parallelism) PD Dr. Ralf-Peter Mundani High Performance Computing Summer Term 2015 117 PD Dr. Ralf-Peter Mundani High Performance Computing Summer Term 2015 118 Technische Universität München Technische Universität München Hardware Excursion Hardware Excursion . very long instruction word (VLIW) . vector units . in contrast to superscalar architectures, the compiler groups parallel . simultaneously execution of one instruction on a one-dimensional executable instructions during compilation (pipelining still possible) array of data ( vector) . advantage: no additional hardware logic necessary . VU first appeared in 1970s and were the basis of most . drawback: not always fully

Quantitative Performance Evaluation

3Dfx Oral History Panel Gordon Campbell, Scott Sellers, Ross Q. Smith, and Gary M. Tarolli

PACKET 7 BOOKSTORE 433 Lecture 5 Dr W IBM OVERVIEW

Online Sec 6.15.Indd

Hive: Fault Containment for Shared-Memory Multiprocessors

Kingston Memory in Your System Contents

Performance of Various Computers Using Standard Linear Equations Software

Lmbench: Portable Tools for Performance Analysis

Performance of Various Computers Using Standard Linear Equations Software

Performance of Various Computers Using Standard Linear Equations

Interconnect-Aware Coherence Protocols for Chip Multiprocessors

An Overview of Scientific Computing

No Slide Title