Goals of This Lecture How to Increase the Compute Power?

Goals of this lecture Motivate you! Design of Parallel and High-Performance Trends Computing High performance computing Fall 2017 Programming models Lecture: Introduction Course overview Instructor: Torsten Hoefler & Markus Püschel TA: Salvatore Di Girolamo 2 © Markus Püschel Computer Science Trends What doubles …? Source: Wikipedia 3 4 How to increase the compute power? Clock Speed! 10000 Sun’s Surface 1000 ) 2 Rocket Nozzle Nuclear Reactor 100 10 8086 Hot Plate 8008 8085 Pentium® Power Density (W/cm Power 4004 286 386 486 Processors 8080 1 1970 1980 1990 2000 2010 Source: Intel Source: Wikipedia 5 6 How to increase the compute power? Evolutions of Processors (Intel) Not an option anymore! Clock Speed! 10000 Sun’s Surface 1000 ) 2 Rocket Nozzle Nuclear Reactor 100 ~3 GHz Pentium 4 Core Nehalem Haswell Pentium III Sandy Bridge Hot Plate 10 8086 Pentium II 8008 free speedup 8085 Pentium® Pentium Pro Power Density (W/cm Power 4004 286 386 486 Processors 8080 Pentium 1 1970 1980 1990 2000 2010 Source: Intel 7 8 Source: Wikipedia/Intel/PCGuide Evolutions of Processors (Intel) Evolutions of Processors (Intel) Cores: 8x ~360 Gflop/s Vector units: 8x parallelism: work required 2 2 4 4 4 8 cores ~3 GHz Pentium 4 Core Nehalem Haswell Pentium III Sandy Bridge Pentium II Pentium Pro free speedup Pentium memory bandwidth (normalized) 9 10 Source: Wikipedia/Intel/PCGuide Source: Wikipedia/Intel/PCGuide A more complete view Source: www.singularity.com Can we do this today? 11 12 © Markus Püschel Computer Science High-Performance Computing (HPC) a.k.a. “Supercomputing” Question: define “Supercomputer”! High-Performance Computing 13 14 High-Performance Computing (HPC) a.k.a. “Supercomputing” Question: define “Supercomputer”! . “A supercomputer is a computer at the frontline of contemporary processing capacity--particularly speed of calculation.” (Wikipedia) . Usually quite expensive ($s and kWh) and big (space) 1 Exaflop! ~2023? HPC is a quickly growing niche market . Not all “supercomputers”, wide base TaihuLight, ~125 PF (2016) . Important enough for vendors to specialize . Very important in research settings (up to 40% of university spending) “Goodyear Puts the Rubber to the Road with High Performance Computing” “High Performance Computing Helps Create New Treatment For Stroke Victims” “Procter & Gamble: Supercomputers and the Secret Life of Coffee” “Motorola: Driving the Cellular Revolution With the Help of High Performance Computing” Source: www.singularity.com Blue Waters, ~13 PF (2012) “Microsoft: Delivering High Performance Computing to the Masses” 15 16 Blue Waters in 2012 17 Source: extremetech.com 18 The Top500 List (June 2017) The Top500 List A benchmark, solve Ax=b . As fast as possible! as big as possible . Reflects some applications, not all, not even many . Very good historic data! Speed comparison for computing centers, states, countries, nations, continents . Politicized (sometimes good, sometimes bad) . Yet, fun to watch 19 20 Green Top500 List (June 2017) Top500: Trends Single GPU/MIC Card Source: Jack Dongarra 21 22 Piz Daint @ CSCS HPC Applications: Scientific Computing Most natural sciences are simulation driven or are moving towards simulation . Theoretical physics (solving the Schrödinger equation, QCD) . Biology (Gene sequencing) . Chemistry (Material science) . Astronomy (Colliding black holes) . Medicine (Protein folding for drug discovery) . Meteorology (Storm/Tornado prediction) . Geology (Oil reservoir management, oil exploration) . and many more … (even Pringles uses HPC) 23 24 More pictures at: http://spcl.inf.ethz.ch/Teaching/2015-dphpc/ HPC Applications: Commercial Computing HPC Applications: Industrial Computing Databases, data mining, search Aeronautics (airflow, engine, structural mechanics, . Amazon, Facebook, Google electromagnetism) Transaction processing Automotive (crash, combustion, airflow) . Visa, Mastercard Decision support Computer-aided design (CAD) . Stock markets, Wall Street, Military applications Pharmaceuticals (molecular modeling, protein folding, drug design) Parallelism in high-end systems and back-ends . Often throughput-oriented Petroleum (Reservoir analysis) . Used equipment varies from COTS (Google) to high-end redundant Visualization (all of the above, movies, 3d) mainframes (banks) 25 26 What can faster computers do for us? Towards the age of massive parallelism Solving bigger problems than we could solve before! Everything goes parallel . E.g., Gene sequencing and search, simulation of whole cells, mathematics . Desktop computers get more cores of the brain, … 2,4,8, soon dozens, hundreds? . The size of the problem grows with the machine power . Supercomputers get more PEs (cores, nodes) Weak Scaling > 10 million today > 50 million on the horizon Solve today’s problems faster! 1 billion in a couple of years (after 2020) . E.g., large (combinatorial) searches, mechanical simulations (aircrafts, cars, Parallel Computing is inevitable! weapons, …) . The machine power grows with constant problem size Parallel vs. Concurrent computing Strong Scaling Concurrent activities may be executed in parallel Example: A1 starts at T1, ends at T2; A2 starts at T3, ends at T4 Intervals (T1,T2) and (T3,T4) may overlap! Parallel activities: A1 is executed while A2 is running 27 Usually requires separate resources! 28 © Markus Püschel Computer Science Flynn’s Taxonomy SISD SIMD Standard Serial Computer Vector Machines or Extensions (nearly extinct) (very common) Programming Models MISD MIMD Redundant Execution Multicore (fault tolerance) (ubiquituous) 29 30 Parallel Resources and Programming Historic Architecture Examples Parallel Resource Programming Systolic Array . Instruction-level parallelism . Compiler . Data-stream driven (data counters) . Pipelining . (inline assembly) . Multiple streams for parallelism . VLIW . Hardware scheduling . Specialized for applications (reconfigurable) . Superscalar . SIMD operations . Compiler (inline assembly) Source: ni.com . Vector operations . Intrinsics Dataflow Architectures . Instruction sequences . Libraries . No program counter, execute instructions when all input arguments are available . Multiprocessors . Compilers (very limited) . Fine-grained, high overheads . Multicores . Expert programmers Example: compute f = (a+b) * (c+d) . Multithreading . Parallel languages . Parallel libraries . Hints Both come-back in FPGA computing . Interesting research opportunities! Source: isi.edu 31 32 Parallel Architectures 101 Programming Models Uniform memory access Non-uniform memory access Shared Memory Programming (SM) . Shared address space . Implicit communication . Hardware for cache-coherent remote memory access . Cache-coherent Non Uniform Memory Access (cc NUMA) Today’s laptops Today’s servers . Pthreads, OpenMP Time-division multiplexing Remote direct-memory access (Partitioned) Global Address Space (PGAS) . Remote Memory Access . Remote vs. local memory (cf. ncc-NUMA) Yesterday’s clusters Today’s clusters Distributed Memory Programming (DM) . Explicit communication (typically messages) … and mixtures of those . Message Passing 33 34 MPI: de-facto large-scale prog. standard DPHPC Overview Basic MPI Advanced MPI, including MPI-3 35 36 Schedule of Last Year Related classes in the SE/PL focus k Monday Thursday 0 09/19: no lecture 09/22: MPI Tutorial (white bg) 263-2910-00L Program Analysis 1 09/26: Organization - Introduction (1pp) (6pp) 09/29: Projects - Advanced MPI Tutorial http://www.srl.inf.ethz.ch/pa.php 2 10/03: Cache Coherence & Memory Models (1pp) (6pp) 10/06: Cache Organization - Introduction to OpenMP Spring 2017 3 10/10: Memory Models (1pp) (6pp) 10/13: Sequential Consistency + OpenMP Lecturer: Martin Vechev Synchronization 4 10/17: Linearizability (1pp) (6pp) 10/20: Linearizability 5 10/24: Languages and Locks (1pp) (6pp) 10/27: Locks 263-2300-00L How to Write Fast Numerical Code 6 10/31: Amdahl's Law (1pp) (6pp) - Notes 11/03: Amdahl's Law http://www.inf.ethz.ch/personal/markusp/teaching/263-2300-ETH- 7 11/07: Project presentations 11/10: No recitation session 8 11/14: Roofline Model (1pp) (6pp) - Notes 11/17: Roofline Model spring16/course.html 9 11/21: Balance Principles (1pp) (6pp) - Notes on Balance Principles / 11/24: Balance Priciples & Scheduling Spring 2017 Scheduling (1pp) (6pp) - Nodes on Scheduling Lecturer: Markus Püschel 10 11/28: Locks and Lock-Free (1pp) (6pp) 12/01: SPIN Tutorial 11 12/05: Lock-Free and distributed memory (1pp) (6pp) 12/08: Benchmarking - (paper) This list is not exhaustive! 12 12/12: Guest lecture - Dr. Tobias Grosser 12/15: Network Models 13 12/19: Final Presentations 37 38.

Goals of This Lecture How to Increase the Compute Power?

Concurrent Objects and Linearizability Concurrent Computaton

Core Processors

Concurrent and Parallel Programming

Arxiv:2012.03692V1 [Cs.DC] 7 Dec 2020

A Task Parallel Approach Bachelor Thesis Paul Blockhaus

Round-Up: Runtime Checking Quasi Linearizability Of

Local Linearizability for Concurrent Container-Type Data Structures∗

Efficient Lock-Free Durable Sets 128:3

LL/SC and Atomic Copy: Constant Time, Space Efficient

Wait-Freedom Is Harder Than Lock-Freedom Under Strong Linearizability Oksana Denysyuk, Philipp Woelfel

Root Causing Linearizability Violations⋆

Proving Linearizability Using Partial Orders