Goals of this lecture

 Motivate you!

Design of Parallel and High-Performance  Trends

Computing  High performance computing Fall 2017  Programming models Lecture: Introduction  Course overview

Instructor: Torsten Hoefler & Markus Püschel TA: Salvatore Di Girolamo

2

© Markus Püschel Computer Science

Trends

What doubles …?

Source: Wikipedia 3 4

How to increase the compute power?

Clock Speed!

10000 Sun’s Surface

1000 ) 2 Rocket Nozzle

Nuclear Reactor 100

10 8086 Hot Plate 8008 8085 Pentium®

Power Density Density (W/cm Power 4004 286 386 486 Processors 8080 1 1970 1980 1990 2000 2010 Source: Intel

Source: Wikipedia 5 6 How to increase the compute power? Evolutions of Processors (Intel) Not an option anymore! Clock Speed!

10000 Sun’s Surface

1000 ) 2 Rocket Nozzle

Nuclear Reactor 100 ~3 GHz Pentium 4 Core Nehalem Haswell

Pentium III Sandy Bridge Hot Plate 10 8086 Pentium II 8008 free speedup 8085 Pentium® Pentium Pro

Power Density Density (W/cm Power 4004 286 386 486 Processors 8080 Pentium 1 1970 1980 1990 2000 2010 Source: Intel

7 8 Source: Wikipedia/Intel/PCGuide

Evolutions of Processors (Intel) Evolutions of Processors (Intel)

Cores: 8x ~360 Gflop/s Vector units: 8x

parallelism: work required

2 2 4 4 4 8 cores ~3 GHz Pentium 4 Core Nehalem Haswell

Pentium III Sandy Bridge Pentium II Pentium Pro free speedup

Pentium memory bandwidth (normalized)

9 10 Source: Wikipedia/Intel/PCGuide Source: Wikipedia/Intel/PCGuide

A more complete view

Source: www.singularity.com Can we do this today?

11 12 © Markus Püschel Computer Science High-Performance Computing (HPC)

 a.k.a. “Supercomputing”

 Question: define “Supercomputer”!

High-Performance Computing

13 14

High-Performance Computing (HPC)

 a.k.a. “Supercomputing”

 Question: define “Supercomputer”! . “A supercomputer is a computer at the frontline of contemporary processing capacity--particularly speed of calculation.” (Wikipedia) . Usually quite expensive ($s and kWh) and big (space) 1 Exaflop! ~2023?  HPC is a quickly growing niche market . Not all “supercomputers”, wide base TaihuLight, ~125 PF (2016) . Important enough for vendors to specialize . Very important in research settings (up to 40% of university spending) “Goodyear Puts the Rubber to the Road with High Performance Computing” “High Performance Computing Helps Create New Treatment For Stroke Victims” “Procter & Gamble: Supercomputers and the Secret Life of Coffee” “Motorola: Driving the Cellular Revolution With the Help of High Performance Computing” Source: www.singularity.com Blue Waters, ~13 PF (2012) “Microsoft: Delivering High Performance Computing to the Masses” 15 16

Blue Waters in 2012

17 Source: extremetech.com 18 The Top500 List (June 2017) The Top500 List

 A benchmark, solve Ax=b . As fast as possible!  as big as possible  . Reflects some applications, not all, not even many . Very good historic data!

 Speed comparison for computing centers, states, countries, nations, continents  . Politicized (sometimes good, sometimes bad) . Yet, fun to watch

19 20

Green Top500 List (June 2017) Top500: Trends

Single GPU/MIC Card

Source: Jack Dongarra 21 22

Piz Daint @ CSCS HPC Applications: Scientific Computing

 Most natural sciences are simulation driven or are moving towards simulation . Theoretical physics (solving the Schrödinger equation, QCD) . Biology (Gene sequencing) . Chemistry (Material science) . Astronomy (Colliding black holes) . Medicine (Protein folding for drug discovery) . Meteorology (Storm/Tornado prediction) . Geology (Oil reservoir management, oil exploration) . and many more … (even Pringles uses HPC)

23 24 More pictures at: http://spcl.inf.ethz.ch/Teaching/2015-dphpc/ HPC Applications: Commercial Computing HPC Applications: Industrial Computing

 Databases, data mining, search  Aeronautics (airflow, engine, structural mechanics, . Amazon, Facebook, Google electromagnetism)  Transaction processing  Automotive (crash, combustion, airflow) . Visa, Mastercard

 Decision support  Computer-aided design (CAD) . Stock markets, Wall Street, Military applications  Pharmaceuticals (molecular modeling, protein folding, drug design)  Parallelism in high-end systems and back-ends . Often throughput-oriented  Petroleum (Reservoir analysis) . Used equipment varies from COTS (Google) to high-end redundant  Visualization (all of the above, movies, 3d) mainframes (banks)

25 26

What can faster computers do for us? Towards the age of massive parallelism

 Solving bigger problems than we could solve before!  Everything goes parallel . E.g., Gene sequencing and search, simulation of whole cells, mathematics . Desktop computers get more cores of the brain, … 2,4,8, soon dozens, hundreds? . The size of the problem grows with the machine power . Supercomputers get more PEs (cores, nodes)  Weak Scaling > 10 million today > 50 million on the horizon

 Solve today’s problems faster! 1 billion in a couple of years (after 2020)

. E.g., large (combinatorial) searches, mechanical simulations (aircrafts, cars,  is inevitable! weapons, …) . The machine power grows with constant problem size Parallel vs.  Strong Scaling Concurrent activities may be executed in parallel Example: A1 starts at T1, ends at T2; A2 starts at T3, ends at T4 Intervals (T1,T2) and (T3,T4) may overlap! Parallel activities: A1 is executed while A2 is running

27 Usually requires separate resources! 28

© Markus Püschel Computer Science Flynn’s Taxonomy

SISD SIMD Standard Serial Computer Vector Machines or Extensions (nearly extinct) (very common) Programming Models MISD MIMD Redundant Execution Multicore (fault tolerance) (ubiquituous)

29 30 Parallel Resources and Programming Historic Architecture Examples

Parallel Resource Programming  Systolic Array . Instruction-level parallelism . Compiler . Data-stream driven (data counters) . Pipelining . (inline assembly) . Multiple streams for parallelism . VLIW . Hardware scheduling . Specialized for applications (reconfigurable) . Superscalar . SIMD operations . Compiler (inline assembly) Source: ni.com . Vector operations . Intrinsics  Dataflow Architectures . Instruction sequences . Libraries . No program counter, execute instructions when all input arguments are available . Multiprocessors . Compilers (very limited) . Fine-grained, high overheads . Multicores . Expert programmers Example: compute f = (a+b) * (+d) . Multithreading . Parallel languages . Parallel libraries . Hints  Both come-back in FPGA computing . Interesting research opportunities! Source: isi.edu

31 32

Parallel Architectures 101 Programming Models

Uniform memory access Non-uniform memory access  Shared Memory Programming (SM) . Shared address space . Implicit communication . Hardware for cache-coherent remote memory access . Cache-coherent Non Uniform Memory Access (cc NUMA) Today’s laptops Today’s servers . Pthreads, OpenMP

Time-division multiplexing Remote direct-memory access

 (Partitioned) Global Address Space (PGAS) . Remote Memory Access . Remote vs. local memory (cf. ncc-NUMA)

Yesterday’s clusters Today’s clusters  Distributed Memory Programming (DM) . Explicit communication (typically messages)  … and mixtures of those . Message Passing 33 34

MPI: de-facto large-scale prog. standard DPHPC Overview

Basic MPI Advanced MPI, including MPI-3

35 36 Schedule of Last Year Related classes in the SE/PL focus

k Monday Thursday 0 09/19: no lecture 09/22: MPI Tutorial (white bg)  263-2910-00L Program Analysis 1 09/26: Organization - Introduction (1pp) (6pp) 09/29: Projects - Advanced MPI Tutorial http://www.srl.inf.ethz.ch/pa.php 2 10/03: Cache Coherence & Memory Models (1pp) (6pp) 10/06: Cache Organization - Introduction to OpenMP Spring 2017 3 10/10: Memory Models (1pp) (6pp) 10/13: Sequential Consistency + OpenMP Lecturer: Martin Vechev Synchronization 4 10/17: Linearizability (1pp) (6pp) 10/20: Linearizability

5 10/24: Languages and Locks (1pp) (6pp) 10/27: Locks  263-2300-00L How to Write Fast Numerical Code 6 10/31: Amdahl's Law (1pp) (6pp) - Notes 11/03: Amdahl's Law http://www.inf.ethz.ch/personal/markusp/teaching/263-2300-ETH- 7 11/07: Project presentations 11/10: No recitation session 8 11/14: Roofline Model (1pp) (6pp) - Notes 11/17: Roofline Model spring16/course.html

9 11/21: Balance Principles (1pp) (6pp) - Notes on Balance Principles / 11/24: Balance Priciples & Scheduling Spring 2017 Scheduling (1pp) (6pp) - Nodes on Scheduling Lecturer: Markus Püschel 10 11/28: Locks and -Free (1pp) (6pp) 12/01: SPIN Tutorial

11 12/05: Lock-Free and distributed memory (1pp) (6pp) 12/08: Benchmarking - (paper)  This list is not exhaustive! 12 12/12: Guest lecture - Dr. Tobias Grosser 12/15: Network Models

13 12/19: Final Presentations

37 38