Introduction to High Performance Computing

n Why HPC n Basic concepts n How to program n Technological trends

1 Why HPC? n Many problems require more resources than available on a single computer n “Grand Challenge" (en.wikipedia.org/wiki/Grand_Challenge) problems requiring PetaFLOPS and PetaBytes of computing resources. n Web search engines/databases processing millions of transactions per second

2 Uses of HPC n Historically "the high end of computing” n Chemistry, Molecular Sciences n Atmosphere, Earth, Environment n Geology, Seismology n Physics - applied, nuclear, particle, n Mechanical Engineering - from condensed matter, high pressure, prosthetics to spacecraft fusion, photonics n Electrical Engineering, Circuit Design, n Bioscience, Biotechnology, Genetics Microelectronics n Computer Science, Mathematics

3 Uses of HPC

n Today, commercial applications provide an n Medical imaging and diagnosis equal or greater driving force; require n Pharmaceutical design processing of large amounts of data in n Management of national and multi-national sophisticated ways corporations n Databases, data mining n Financial and economic modeling n Oil exploration n Advanced graphics and virtual reality, particularly in the entertainment industry n Web search engines, web based business services n Networked video and multi-media technologies n Collaborative work environments

4 Example: Weather Prediction

10 km Target for addressing key science challenges in weather & climate prediction: Global 1-km Earth system simulations @ ~1 year / day rate

1 km

ETP4HPC SRA-3 Kick-off meeting IBM IOT, Munich, March 20th 2017 Peter Bauer & Erwin Laure 4CoE5 Example: NOMAD Science and Data Handling Challenges Data is the raw materials of the 21st century 10.000 1 Mio The NOMAD Archive NOMAD supports all important codes in computational materials science. The code- independent Archive contains data from many million calculations (billions of CPU 100 1.000 hours). The NOMAD challenge: Build a map # Compositions and fill the existing white spots

1 1 Photovoltaics # Geometries per Composition B Thermal- Discovering interpretable patterns and barrier materials correlations in this data will

• create knowledge Descriptor Transparent Superconductors • advance materials science, metals • identify new scientific phenomena, and • support industrial applications. Descriptor A

ETP4HPC SRA-3 Kick-off meeting IBM IOT, Munich, March 20th 2017 Peter Bauer & Erwin Laure 4CoE6 The Airbus Challenge, David Hills, 2008 ExaFLOW

An Airbus 310 cruising at 250 m/s at 10000m Teraflops machine (1012 Flops): 8·105 years Result in one week: 4·1019 flops machine (40 EFlops) (based on John Kim’s estimate, TSFP-9, 2015)

ExaFLOW HPC Summit, Prague 7 Predicting interactomes by docking… a dream?

Ø ~20’000 human proteins Ø Interactome prediction will require 20’0002 docking runs Ø Which will require > 10 billions CPU hours and generate about 100 exabytes of data Ø Interest in simulating/understanding the impact of disease-related mutations that affect/alter the interaction network

8 Molecular Dynamics on the exascale

• Understanding proteins and A drugs VSD • A 1 μs simulation: 10 exaflop • Many structural transition: many simulations needed • Study effect of several bound drugs B • Study effect of mutations R0 • All this multiplies to >> zettaflop E183 R1 E226 R2 • Question: how far can we D259 R3 4 parallelize? E236 K R5

R1 R2 E183 R3 Example: ion channel in a nerve cell. * F233 R4 * Α/Β K5 E226 Δ/Ε* Γ/Δ* Β/Γ R6 Opens and closes during signalling. D259 E236 Affected by e.g. alcohol and drugs. Β Α Γ 200 000 atoms Ε Δ Partners Q Funding

9 bioexcel.eu FET: Human Brain Project

F Schürmann, H Markram (Blue Brain Project, EPFL) What is n Traditionally, software has been written for serial computation: n To be run on a single computer having a single (CPU); n A problem is broken into a discrete series of instructions. n Instructions are executed one after another. n Only one instruction may execute at any moment in .

11 Parallel Computing n In the simplest sense, parallel computing is the simultaneous use of multiple compute resources to solve a computational problem: n A problem is broken into discrete parts that can be solved concurrently n Each part is further broken down to a series of instructions

12 Parallelism on different levels n CPU n Instruction level parallelism, pipelining n Vector unit n Multiple cores • Multiple threads or processes n Computer n Multiple CPUs n Co-processors (GPUs, FPGAs, …) n Network n Tightly integrated network of computers (supercomputer) n Loosely integrated network of computers (distributed computing)

13 Flynn’s taxonomy (1966) n {Single, Multiple} {Instructions, Data}

SISD SIMD Single Instruction, Single Data Single Instruction, Multiple Data

MISD MIMD Multiple Instruction, Single Data Multiple Instruction, Multiple Data

14 Single Instruction Single Data n A serial (non-parallel) computer n Single instruction: only one instruction stream is being acted on by the CPU during any one clock cycle n Single data: only one data stream is being used as input during any one clock cycle n Deterministic execution n This is the oldest and used to be the most common type of computer (up to arrival of multicore CPUs) n Examples: older generation mainframes, minicomputers and workstations; older generation PCs. n Attention: single core CPUs exploit instruction level parallelism (pipelining, multiple issue, speculative execution) but are still classified SISD

15 Single Instruction Multiple Data n “Vector” Computer n Single instruction: All processing units execute the same instruction at any given clock cycle n Multiple data: Each processing unit can operate on a different data element n Best suited for specialized problems characterized by a high degree of regularity, such as graphics/image processing. n Synchronous (lockstep) and deterministic execution n Two varieties: Processor Arrays and Vector Pipelines n Most modern computers, particularly those with graphics processor units (GPUs) employ SIMD instructions and execution units.

16 Multiple Instruction, Multiple Data n Currently, the most common type of parallel computer. Most modern computers fall into this category. n Multiple Instruction: every processor may be executing a different instruction stream n Multiple Data: every processor may be working with a different data stream n Execution can be synchronous or asynchronous, deterministic or non- deterministic n Examples: most current supercomputers, networked parallel computer clusters and "grids", multi-processor SMP computers, multi- core PCs. n Note: many MIMD architectures also include SIMD execution sub- components

17 Multiple Instruction, Single Data n No examples exist today n Potential uses might be: n Multiple cryptography algorithms attempting to crack a single coded message n Multiple frequency filters operating on a single signal

18 Single Program Multiple Data (SPMD) n MIMDs are typically programmed following the SPMD model n A single program is executed by all tasks simultaneously. n At any moment in time, tasks can be executing the same or different instructions within the same program. All tasks may use different data. (MIMD) n SPMD programs usually have the necessary logic programmed into them to allow different tasks to branch or conditionally execute only those parts of the program they are designed to execute. That is, tasks do not necessarily have to execute the entire program - perhaps only a portion of it.

19 Multiple Program Multiple Data (MPMD) n MPMD applications typically have multiple executable object files (programs). While the application is being run in parallel, each task can be executing the same or different program as other tasks. n All tasks may use different data n Workflow applications, multidisciplinary optimization, combination of different models

20 FLOPS n FLoating Point Operations per Second n Most commonly used performance indicator for parallel computers n Typically measured using the Linpack benchmark n Most useful for scientific applications Name Flops n Other benchmarks include Yotta 1024 SPEC, NAS, stream (memory) Zetta 1021 Exa 1018 Peta 1015 Tera 1012 Giga 109 Mega 106

21 Moore’s Law

n Gordon E. Moore, "Cramming more components onto integrated circuits", Electronics Magazine 19 April 1965:

“The complexity for minimum component costs has increased at a rate of roughly a factor of two per year ... Certainly over the short term this rate can be expected to continue, if not to increase. Over the longer term, the rate of increase is a bit more uncertain, although there is no reason to believe it will not remain nearly constant for at least 10 years. That means by 1975, the number of components per integrated circuit for minimum cost will be 65,000. I believe that such a large circuit can be built on a single wafer.” n With later alterations: Transistor density doubles every 18 months n So far this law holds n It has also been interpreted as doubling performance every 18 months n A little inaccurate - see later

22 4 Years

4 Years

23 Top500 Nr 1: ”TaihuLight - Sunway” n National Supercomputing Center in Wuxi, China n Sunway SW26010 260 cores, 1.45 GHz n 10,649,600 cores n 93 PF Linpack (125.5 PF theoretical peak) n 15 MW

24 Communication Architecture

25 A parallel computer is

“a collection of processing elements that communicate and cooperate to solve large problems fast”

(Almasi and Gottlieb 1989)

26 Communication Architecture n Defines basic communication and synchronization operations n Addresses the organizational structures that realize these operations n Communication: exchange of data between processing units n Synchronization: coordinate parallel activities

27 Synchronization: Dining Philosophers n Algorithm: n Think n Take left fork n Take right fork n Eat n Release right fork n Release left fork n Synchronization Problems: n Dead lock: • All have left fork n Starvation: • One philosopher can never get hold of two forks • Only in modified algorithm • Release fork if cannot get hold of second fork 28 Common Synchronization Patterns n Barrier Hold activities until all processes have reached the same point n Semaphore Finite resources Two operations: P - wait for free resource and lock it; V - release resource n Mutex Only one can access a shared resource n Events Process waits until notified by another process

29 Typical Communication Architectures n Shared Memory n Distributed Memory

30 Shared Memory

31 Shared Memory Multiprocessor n Hardware provides single physical address space for all processors n Global physical address space and symmetric access to all of main memory (symmetric multiprocessor - SMP) n All processors and memory modules are attached to the same interconnect (bus or switched network)

32 Differences in Memory Access n Uniform Memory Access (UMA) Memory access takes about the same time independent of data location and requesting processor

n Nonuniform memory access (NUMA) Memory access can differ depending on where the data is located and which processor requests the data

33 Cache coherence n While main memory is shared, caches are local to individual processors n Client B’s cache might have old data since updates in client A’s cache are not yet propagated n Different cache coherency protocols to avoid this problem n Subject of subsequent lectures

34 Synchronization n Access to shared data needs to be protected n Mutual exclusion (mutex) n Point-to-point events n Global event synchronization (barrier)

35 SMP Pros and Cons n Advantages: n Global address space provides a user-friendly programming perspective to memory n Data sharing between tasks is both fast and uniform due to the proximity of memory to CPUs n Disadvantages: n Primary disadvantage is the lack of scalability between memory and CPUs. Adding more CPUs can geometrically increases traffic on the shared memory-CPU path, and for cache coherent systems, geometrically increase traffic associated with cache/memory management. n Programmer responsibility for synchronization constructs that insure "correct" access of global memory. n Expense: it becomes increasingly difficult and expensive to design and produce shared memory machines with ever increasing numbers of processors.

36 Distributed Memory Multiprocessors

37 DMMPs n Each processor has private physical address space n No cache coherence problem n Hardware sends/receives messages between processors n Message passing

38 Synchronization n Synchronization via exchange of messages n Synchronous communication n Sender/receiver wait until data has been sent/received n Asynchronous communication n Sender/receiver can proceed after sending/receiving has been initiated P1 P2 n Higher level concepts send(x) (barriers, semaphores, …) recv(y) can be constructed using send/recv primitives e=isend(x) e=irecv(y) n Message passing libraries wait(e) wait(e) typically provide them 39 DMMPs Pros and Cons n Advantages: n Memory is scalable with number of processors. Increase the number of processors and the size of memory increases proportionately. n Each processor can rapidly access its own memory without interference and without the overhead incurred with trying to maintain cache coherency. n Cost effectiveness: can use commodity, off-the-shelf processors and networking. n Disadvantages: n The programmer is responsible for many of the details associated with data communication between processors. n It may be difficult to map existing data structures, based on global memory, to this memory organization. n Non-uniform memory access (NUMA) times n Administration and software overhead (essentially N systems vs. 1 SMP) 40 Hybrid Approaches

41 Combining SMPs and DMMPs n Today, DMMPs are typically built with SMPs as building blocks n E.g. Cray XC40 has two CPUs with 16 cores each per DMMP node n Soon systems with more CPUs and many more cores will appear n Combine advantages and disadvantages from both categories n Programming is more complicated due to the combination of several different memory organizations that require different treatment

42 Moore’s law revisited n Doubling of transistor density every 18 months n Often paraphrased as doubling of performance every 18 months

43 Reinterpreting Moore’s law

n Moore’s law is holding, in the number of transistors n Transistors on an ASIC still doubling every 18 months at constant cost n 15 years of exponential clock rate growth has ended n Moore’s Law reinterpreted n Performance improvements are now coming from the increase in the number of cores on a processor (ASIC) n #cores per chip doubles every 18 months instead of clock Transistors

Thread Performance Clock Frequency

Power (watts) Performance # Cores

2020 2025 2030 44 Figure courtesy of Kunle Olukotun,Year Lance Hammond, Herb Sutter, and Burton Smith Computing Power Consumption

Power = Capacitive × Voltage2 ×Frequency

n Capacitive load per transistor is a function of both the number of transistors connected to an output and the technology, which determines the capacitance of both wires and transistors n Frequency switched is a function of the clock rate

45 Hitting the Power Wall

Power = Capacitive load× Voltage2 ×Frequency

×30 5V → 1V ×1000 46 Multiple cores deliver more performance per watt

Power Cache Power = ¼ 4 Performance = 1/2 3 Performance Big core 2 2 Small core 1 1 1 1

Many core is more power efficient C1 C2 4 4 Power ~ area Cache 3 3 Single performance ~ area**.5 2 2 C3 C4 1 1 47 Multicore CPUs n Intel Xenon 6 core processor

48 What does this mean? n The easy times have gone n Updating to the next processor generation will not automatically increase performance anymore n Parallel computing techniques are needed to fully exploit new processor generation n Parallel computing is going mainstream

49 GPUs

50 GPUs n GPU = Graphical Processing Unit = specialized microcircuit to accelerate the creation and manipulation of images in video frame for display devices. n Excellent for processing of large blocks of data done in parallel. n GPUs are used in game consoles, embedded systems (like systems on cars for automatic driving), computers and supercomputers. n Since 2012, GPUs are the main workforce for training deep-learning networks The Rise of GPU in HPC n GPUs are a core technology in many world’s fastest and most energy-efficient supercomputers n GPUs compete well in terms of FLOPS/Watt n Between 2012 and December 2013, the list of ten most energy- efficient supercomputers (Green500) changed to 100% based on NVIDIA GPU systems

n In the current Green500, the top 2 most energy-efficient supercomputers use NVIDIA P100 GPU GPU Design Motivation: Process Pixels in Parallel

n Data parallel n In 1080i and 1080p videos, 1920 x 1080 pixels = 2M pixels per video frame à compute intensive n Lots of parallelism at low clock speed à power efficient n Computation on each pixel is independent from computation on other pixels. n No need for synchronization n Large data-locality = access to data is regular n No need for large caches CPU and GPU

n CPU has tens of massive cores, CPU excels at irregular control-intensive work n Lots of hardware for control, fewer ALUs n GPU has thousands of small cores, GPU excels at regular math-intensive work n Lots of ALUs, little hardware for control Weakness of GPU

GPU is very fast (huge parallelism) but getting data from/to GPU is slow

Base clock: DDR4 745 MHz 288 GB/s GDR 32 GB/s 80 GB/s AM GPU CPU DRAM 64GB 12GB PCIe Gen3-12x

NVIDIA TESLA K40 = the most common GPU on supercomputers in Nov. 2016 top500 list CPU vs GPU n CPU are latency-optimized n Reduce memory latency with big caches n Hide memory latencies with other instructions (instruction window, out-of-order) n Each thread runs as fast as possible but fewer threads n GPU are throughput-optimized n Each thread might take a long time but thousands of threads are used Is GPU good for my non-graphics application?

n It depends ... n Compute-intensive applications with little synchronization benefit the most from GPU: • Deep-learning network training 8×-10×, GROMACS 2×-3×, LAMMPS 2×-8×, QMCPack 3×. n Irregular applications, such as sorting and constraint solvers, are faster on CPU*. n GPU applications are more difficult to program ... n CUDA is the de-facto standard for programming NVIDIA GPUs n OpenCL supports all the accelerators, including non-NVIDIA ones n OpenACC and OpenMP 4 provide higher level programming interface

*Debunking the 100X GPU vs. CPU Myth: An Evaluation of Throughput Computing on CPU and GPU by Victor W Lee et al. Network Topologies

58 The Role of the Network n The overall performance of DMMPs depends critically on the performance of the network used to connect the individual nodes n How fast can messages be transmitted and how much data can be exchanged n Also applies to networked SMPs n Latency: time between start of packet transmission to the start of packet reception (but typically measured as round- trip of zero sized messages n Bandwidth: how much data can be transmitted over the network (bit/s)

59 Different Technologies n Ethernet n Myrinet n Infiniband n Proprietary networks n Cray Aries n IBM BlueGene n Differ in bandwidth and latency but most notably in sustained performance through e.g. MPI

60 Network Topologies n Networks can be arranged in a number of ways n Typical design goals is to balance performance and cost n Factors in addition to latency and bandwidth: n Fault tolerance n Power consumption n Number of switches n Number of links n Cable length n Additional considerations: n Total Network Bandwidth • Bandwidth of each link multiplied by the number of links n Bisection Bandwidth • Worst case bandwidth if nodes are divided into two disjoint sets

61 Common Topologies n Bus n Total bandwidth is bandwidth of the link n Bisection bandwidth is bandwidth of the link n Ring n TB: P times the bw of one link n BS: 2 times the bw of one link n Fully connected network n TB: P x (P-1)/2 n BS: (P/2)2

62 Common Topologies Cont’d n Mesh n Typically 2D or 3D

n N-cube n Hypercube n 2n nodes

n Fat tree n Common in Infiniband based systems

63 Summary n An HPC system is a collection of “nodes” connected by some network n Nodes consist of (multiple) many-core CPUs, accelerators (GPUs, FPGAs, etc.) and memory n Memory is typically shared between all CPUs of a node • But not (yet) accelerators

64 Performance

65 Why worry about Performance? n Compare different systems n Select the most appropriate system for a given problem n Make efficient use of available resources n Scaling n Increase in resources should result in faster results n How does increase in resources effect overall runtime? n How does increase of problem size effect overall runtime?

66 Optimization Goals n Execution time n Minimize the time between start and completion of a task n Typical goal in HPC n Throughput n Maximize the number of tasks completed in a given time n Typical goal of large data centers (HTC)

67 Performance Definitions

1 Performancex = Execution timex

For two computers X and Y, if the performance of X is greater than the performance of Y, we have €

Performancex > Performancey 1 1 > Execution timex Execution timey

Execution timex > Execution timey

68

€ Measuring Performance n Performance is measured in time units n Different ways to measure time n Wall clock time or elapsed time • Time taken from start to end • Measures everything, including other tasks performed on multitasking systems n CPU time • Actual time the CPU spends computing for a specific task • Does not include time spent for other processes or I/O • CPU time < wall clock time n User CPU time • CPU time spent for user program • User CPU time < CPU time < wall clock time n System CPU time • CPU time spent on tasks for user program • User/System CPU time difficult to measure

69 Factors of CPU performance

CPU clock cycles CPU execution time = Clock rate

CPU clock cycles = Instructions × Average clock cycles per instruction (CPI)

Instruction count × CPI CPU time = Clock rate

Components of performance Units of measure € CPU execution time Seconds for the program Instruction count Instructions executed for the program Clock cycles per instruction Average number of clock (CPI) cycles per instruction Clock cycle time Seconds per clock cycle 70 Other Performance Factors n Memory subsystem n Cache misses n Amount and frequency of data to be moved n I/O subsystem n Amount and frequency of data to be moved n For parallel systems n Synchronization n Communication n Load balancing

71 Amdahl’s Law

72 Amdahl’s law

n Pitfall: Expecting the improvement of one aspect of a computer to increase overall performance by an amount proportional to the size of the improvement

n Gene Amdahl (1967): time effected by improvement Improved time = + time unaffected Amount of improvement

n Example: Suppose a program runs for 100 seconds, with 80 seconds spent in multiply operations. Doubling the € efficiency of multiply operations will result in new runtime of 60 seconds and thus a performance improvement of 1.67. How much do we need to improve multiply to achieve 5 times improvement?

73 Speedup

n Speedup (S) is defined as the improvement in execution time when increasing the amount of parallelism or

sequential execution time (TS) over parallel execution time (TP)

Perfect Speedup

18 16 TS 14 S = 12 TP 10 8 6 Speedup 4 2 0 1 2 4 8 16 Number of CPUs € 74 Efficiency n Speedup as percentage of number of processors 1 T 1 E = × S = S P TP P n A speedup of 90 with 100 processors yields 90% efficiency. €

75 Superlinear Speedup n Sometimes speedup is larger than number of processors n Very rare n Main reasons: n Different parallel and sequential algorithms

n Changes in memory behavior • Smaller problem size in parallel version fits main memory while sequential one doesn’t • Changes in cache behavior

76 Typical Speedup Curves

77 Amdahl’s Law and Parallel Processing n According to Amdahl’s law speedup is limited by the non- parallelizable fraction of a program n Assume rp is the parallelizable fraction of a program and rs the sequential one. rp+rs=1 We can compute the maximum theoretical speedup achievable on n processors with 1 S = max r r + p s n n If 20% of a program is sequential, the maximal achievable speedup is 5 for nÞ∞ 78 € How to live with Amdahl’s law n Many real-world problems have significant parallel portions n Yet, to use 100,000 cores with 90% efficiency, the sequential part needs to be limited to 0,00001%!

n Conclusion: minimize rs and maximize rp n Increase amount of work done in the parallel (typically compute intensive) parts

79 Scaling Example n Workload: sum of 10 scalars, and 10 × 10 matrix sum n Speed up from 10 to 100 processors

n Single processor: Time = (10 + 100) × tadd n 10 processors

n Time = 10 × tadd + 100/10 × tadd = 20 × tadd n Speedup = 110/20 = 5.5 (55% of potential) n 100 processors

n Time = 10 × tadd + 100/100 × tadd = 11 × tadd n Speedup = 110/11 = 10 (10% of potential)

n Assumes load can be balanced across processors

80 Scaling Example (cont) n What if matrix size is 100 × 100?

n Single processor: Time = (10 + 10000) × tadd n 10 processors

n Time = 10 × tadd + 10000/10 × tadd = 1010 × tadd n Speedup = 10010/1010 = 9.9 (99% of potential) n 100 processors

n Time = 10 × tadd + 10000/100 × tadd = 110 × tadd n Speedup = 10010/110 = 91 (91% of potential)

n Assuming load balanced

81 Strong and Weak Scaling n Strong scaling is the speedup achieved without increasing the size of the problem n Weak scaling is the speedup achieved while increasing the size of the problem proportionally to the increase in number of processors

82 Weak Scaling Example n 10 processors, 10 × 10 matrix

n Time = 10 × tadd + 100/10 × tadd = 20 × tadd n 100 processors, 32 × 32 matrix

n Time = 10 × tadd + 1000/100 × tadd = 20 × tadd n Constant performance in this example

83 Load Balancing n Good speedup can only be achieved if the parallel workload is relatively equally spread over the available processors n If workload is unevenly spread, overall performance is bound to the slowest processor (i.e. Processor with most workload)

84 Example Continued n 100 processors

n Time = 10 × tadd + 10000/100 × tadd = 110 × tadd n Speedup = 10010/110 = 91 (91% of potential) n Assumes each processor gets 1% of workload n Assume one processor get 2% (i.e. 200 matrix elements) and the rest of 9800 elements is equally distributed over the remaining 99 processors.

" 9800t 200t% Time = Max$ , ' +10t = 210t # 99 1 & n Speedup = 10010/210 = 47.6 € 85 Load Balancing Examples

P0 P2 P0 P1 P2 P3 P1 P3

P0 P2

P0 P1 P2 P3 P0P1P2P3P0P1P2P3P0P1P2P3 P1 P3

86 Synchronization and Communication

n Parallel programs need synchronization and communication to ensure correct program behavior

n Synchronization and communication adds significant overhead and thus reduces parallel efficiency

T S = S can be refined as TP T S = S TPC + synch wait time + communication time

€ with TPC symbolizing the net parallel computation time 87 € Synchronization and Communication Cont’d n Goal is to avoid synchronization and communication n Not always possible n Overlap communication with computation and optimize communication n Communication overhead impacted by latency and bandwidth • Block communication n Use more efficient communication patterns n Profiling tools can help identifying synchronization and communication overhead

88 Example: Vampir traces

89 The Impact of Data n Apart from communication, data is effecting performance at many levels n Memory hierarchy n I/O

90 Memory/Storage Hierarchies

91 Tomorrow n How to program HPC systems n Technological trends

92