Design of Parallel and High-Performance Computing Fall 2014 Lecture: Introduction

Instructor: Torsten Hoefler & Markus Püschel TA: Timo Schneider Goals of this lecture

 Motivate you!

 What is ? . And why do we need it?

 What is high-performance computing? . What’s a and why do we care?

 Basic overview of . Programming models Some examples . Architectures Some case-studies

 Provide context for coming lectures

2

Let us assume …

 … you were to build a machine like this …

 … we know how each part works . There are just many of them! Source: wikipedia . Question: How many calculations per second are needed to emulate a brain? 3 Source: www.singularity.com Can we do this today?

4 1 Exaflop! ~2022?

Tianhe-2, ~55 PF (2013)

Source: www.singularity.com Blue Waters, ~13 PF (2012)

5 Human Brain – No Problem!

 … not so fast, we need to understand how to program those machines …

6 Human Brain – No Problem!

Scooped!

Source: extremetech.com 7 Under the Kei Computer

8 Other problem areas: Scientific Computing

 Most natural sciences are simulation driven or are moving towards simulation . Theoretical physics (solving the Schrödinger equation, QCD) . Biology (Gene sequencing) . Chemistry (Material science) . Astronomy (Colliding black holes) . Medicine (Protein folding for drug discovery) . Meteorology (Storm/Tornado prediction) . Geology (Oil reservoir management, oil exploration) . and many more … (even Pringles uses HPC)

9 Other problem areas: Commercial Computing

 Databases, data mining, search . Amazon, Facebook, Google

 Transaction processing . Visa, Mastercard

 Decision support . Stock markets, Wall Street, Military applications

 Parallelism in high-end systems and back-ends . Often throughput-oriented . Used equipment varies from COTS (Google) to high-end redundant mainframes (banks)

10 Other problem areas: Industrial Computing

 Aeronautics (airflow, engine, structural mechanics, electromagnetism)

 Automotive (crash, combustion, airflow)

 Computer-aided design (CAD)

 Pharmaceuticals (molecular modeling, protein folding, drug design)

 Petroleum (Reservoir analysis)

 Visualization (all of the above, movies, 3d)

11 What can faster computers do for us?

 Solving bigger problems than we could solve before! . E.g., Gene sequencing and search, simulation of whole cells, mathematics of the brain, … . The size of the problem grows with the machine power  Weak Scaling

 Solve today’s problems faster! . E.g., large (combinatorial) searches, mechanical simulations (aircrafts, cars, weapons, …) . The machine power grows with constant problem size  Strong Scaling

12 High-Performance Computing (HPC)

 a.k.a. “Supercomputing”

 Question: define “Supercomputer”!

13 High-Performance Computing (HPC)

 a.k.a. “Supercomputing”

 Question: define “Supercomputer”! . “A supercomputer is a computer at the frontline of contemporary processing capacity--particularly speed of calculation.” (Wikipedia) . Usually quite expensive ($s and kWh) and big (space)

 HPC is a quickly growing niche market . Not all “”, wide base . Important enough for vendors to specialize . Very important in research settings (up to 40% of university spending) “Goodyear Puts the Rubber to the Road with High Performance Computing” “High Performance Computing Helps Create New Treatment For Stroke Victims” “Procter & Gamble: Supercomputers and the Secret Life of Coffee” “Motorola: Driving the Cellular Revolution With the Help of High Performance Computing” “Microsoft: Delivering High Performance Computing to the Masses” 14 The Top500 List

 A benchmark, solve Ax=b . As fast as possible!  as big as possible  . Reflects some applications, not all, not even many . Very good historic data!

 Speed comparison for computing centers, states, countries, nations, continents  . Politicized (sometimes good, sometimes bad) . Yet, fun to watch

15 The Top500 List (June 2014)

16 Piz Daint @ CSCS

18 Blue Waters in 2009 Imagine you’re designing a $500 M supercomputer, and all you have is:

This is why you need to understand performance expectations well! Blue Waters in 2012 History and Trends

Single GPU/MIC Card

Source: Jack Dongarra 21 High-Performance Computing grows quickly

 Computers are used to automate many tasks

 Still growing exponentially . New uses discovered continuously

IDC, 2007: “The overall HPC server market grew by 15.5 percent in 2007 to reach $11.6 billion […] while the same kinds of boxes that go into HPC machinery but are used for general purpose computing, rose by only 3.6 percent to $54.4”

IDC, 2009: “expects the HPC technical server market to grow at a healthy 7% to 8% yearly rate to reach revenues of $13.4 billion by 2015.”

“The non-HPC portion of the server market was actually down 20.5 per cent, to $34.6bn” Source: The Economist 22 How to increase the compute power?

Clock Speed:

10000 Sun’s Surface

1000 ) 2 Rocket Nozzle

Nuclear Reactor 100

10 8086 Hot Plate 8008 Pentium® 8085

Power Density Density (W/cm Power 4004 286 386 486 Processors 8080 1 1970 1980 1990 2000 2010 Source: Intel

23 How to increase the compute power? Not an option anymore! Clock Speed:

10000 Sun’s Surface

1000 ) 2 Rocket Nozzle

Nuclear Reactor 100

10 8086 Hot Plate 8008 Pentium® 8085

Power Density Density (W/cm Power 4004 286 386 486 Processors 8080 1 1970 1980 1990 2000 2010 Source: Intel

24

Source: Wikipedia 25 A more complete view

26 So how to invest the transistors?

 Architectural innovations . Branch prediction, Tomasulo logic/rename register, speculative execution, … . Help only so much 

 What else? . Simplification is beneficial, less transistors per CPU, more CPUs, e.g., Cell B.E., GPUs, MIC . We call this “cores” these days . Also, more intelligent devices or higher bandwidths (e.g., DMA controller, intelligent NICs)

Source: IBM Source: NVIDIA Source: Intel

27 Towards the age of massive parallelism

 Everything goes parallel . Desktop computers get more cores 2,4,8, soon dozens, hundreds? . Supercomputers get more PEs (cores, nodes) > 3 million today > 50 million on the horizon 1 billion in a couple of years (after 2020)

 Parallel Computing is inevitable!

Parallel vs. Concurrent computing Concurrent activities may be executed in parallel Example: A1 starts at T1, ends at T2; A2 starts at T3, ends at T4 Intervals (T1,T2) and (T3,T4) may overlap! Parallel activities: A1 is executed while A2 is running

Usually requires separate resources! 28 Goals of this lecture

 Motivate you!

 What is parallel computing? . And why do we need it?

 What is high-performance computing? . What’s a Supercomputer and why do we care?

 Basic overview of . Programming models Some examples . Architectures Some case-studies

 Provide context for coming lectures

29

Granularity and Resources

Activities Parallel Resource . Micro-code instruction . Instruction-level parallelism . Machine-code instruction . Pipelining (complex or simple) . VLIW . Sequence of machine-code . Superscalar instructions: . SIMD operations Blocks . Vector operations Loops . Instruction sequences Loop nests . Multiprocessors Functions . Multicores Function sequences . Multithreading

30 Resources and Programming

Parallel Resource Programming . Instruction-level parallelism . Compiler . Pipelining . (inline assembly) . VLIW . Hardware scheduling . Superscalar . SIMD operations . Compiler (inline assembly) . Vector operations . Libraries . Instruction sequences . Compilers (very limited) . Multiprocessors . Expert programmers . Multicores . Parallel languages . Multithreading . Parallel libraries . Hints

31 Historic Architecture Examples

. Data-stream driven (data counters) . Multiple streams for parallelism . Specialized for applications (reconfigurable)

Source: ni.com  Dataflow Architectures . No program counter, execute instructions when all input arguments are available . Fine-grained, high overheads Example: compute f = (a+b) * (+d)

Source: isi.edu

32 Von Neumann Architecture

 Program counter  Inherently serial! Retrospectively define parallelism in instructions and data

SISD SIMD Standard Serial Computer Vector Machines or Extensions (nearly extinct) (very common)

MISD MIMD Redundant Execution Multicore (fault tolerance) (ubiquituous)

33 Parallel Architectures 101

Today’s laptops Today’s servers

Yesterday’s clusters Today’s clusters

 … and mixtures of those

34 Programming Models

Programming (SM/UMA) . Shared address space . Implicit communication . Hardware for cache-coherent remote memory access . Cache-coherent Non (cc NUMA)

 (Partitioned) Global Address Space (PGAS) . Remote Memory Access . Remote vs. local memory (cf. ncc-NUMA)

Programming (DM) . Explicit communication (typically messages) . Message Passing

35 Shared Memory Machines

 Two historical architectures: . “Mainframe” – all-to-all connection between memory, I/O and PEs Often used if PE is the most expensive part Source: IBM Bandwidth scales with P PE Cost scales with P, Question: what about network cost?

36 Shared Memory Machines

 Two historical architectures: . “Mainframe” – all-to-all connection between memory, I/O and PEs Often used if PE is the most expensive part Source: IBM Bandwidth scales with P PE Cost scales with P, Question: what about network cost? Answer: Cost can be cut with multistage connections (butterfly) . “Minicomputer” – bus-based connection All traditional SMP systems High latency, low bandwidth (cache is important) Tricky to achieve highest performance (contention) Low cost, extensible

37 Shared Memory Machine Abstractions

 Any PE can access all memory . Any I/O can access all memory (maybe limited)

 OS (resource management) can run on any PE . Can run multiple threads in shared memory . Used since 40+ years

 Communication through shared memory . Load/store commands to memory controller . Communication is implicit . Requires coordination

 Coordination through shared memory . Complex topic . Memory models

38 Shared Memory Machine Programming

 Threads or processes . Communication through memory

 Synchronization through memory or OS objects . Lock/mutex (protect critical region) . Semaphore (generalization of mutex (binary sem.)) . Barrier (synchronize a group of activities) . Atomic Operations (CAS, Fetch-and-add) . Transactional Memory (execute regions atomically)

 Practical Models: . Posix threads . MPI-3 . OpenMP . Others: Java Threads, Qthreads, …

39 An SMM Example: Compute Pi

 Using Gregory-Leibnitz Series:

. Iterations of sum can be computed in parallel . Needs to sum all contributions at the end

Source: mathworld.wolfram.com

40 Pthreads Compute Pi Example int main( int argc, char *argv[] ) int n=10000; { double *resultarr; // definitions … int nthreads; thread_arr = (pthread_t*)malloc(nthreads * sizeof(pthread_t)); resultarr= (double*)malloc(nthreads * sizeof(double)); void *compute_pi(void *data) { int i, j; for (i=0; i

41 Additional comments on SMM

 OpenMP would allow to implement this example much simpler (but has other issues)

 Transparent shared memory has some issues in practice: . False sharing (e.g., resultarr[]) . Race conditions (complex mutual exclusion protocols) . Little tool support (debuggers need some work)

 Achieving performance is harder than it seems!

42 Distributed Memory Machine Programming

 Explicit communication between PEs . Message passing or channels

 Only local memory access, no direct access to remote memory . No shared resources (well, the network)

 Programming model: Message Passing (MPI, PVM) . Communication through messages or group operations (broadcast, reduce, etc.) . Synchronization through messages (sometimes unwanted side effect) or group operations (barrier) . Typically supports message matching and communication contexts

43 DMM Example: Message Passing

Match Receive Y, t ,P Address Y Send X, Q, t

Address X Local Local process address space address space

Process P Process Q Source: John Mellor-Crummey

 Send specifies buffer to be transmitted

 Recv specifies buffer to receive into

 Implies copy operation between named PEs

 Optional tag matching

 Pair-wise synchronization (cf. happens before)

44 DMM MPI Compute Pi Example int main( int argc, char *argv[] ) { // definitions MPI_Init(&argc,&argv); MPI_Comm_size(MPI_COMM_WORLD, &numprocs); MPI_Comm_rank(MPI_COMM_WORLD, &myid);

double t = -MPI_Wtime(); for (j=0; j

if (!myid) { printf("pi is approximately %.16f, Error is %.16f\n", pi, fabs(pi - PI25DT)); printf("time: %f\n", t); }

MPI_Finalize(); } 45 DMM Example: PGAS

 Partitioned Global Address Space . Shared memory emulation for DMM Usually non-coherent . “Distributed Shared Memory” Usually coherent

 Simplifies shared access to distributed data . Has similar problems as SMM programming . Sometimes lacks performance transparency Local vs. remote accesses

 Examples: . UPC, CAF, Titanium, X10, …

46 How to Tame the Beast?

 How to program large machines?

 No single approach, PMs are not converging yet . MPI, PGAS, OpenMP, Hybrid (MPI+OpenMP, MPI+MPI, MPI+PGAS?), …

 Architectures converge . General purpose nodes connected by general purpose or specialized networks . Small scale often uses commodity networks . Specialized networks become necessary at scale

 Even worse: accelerators (not covered in this class, yet)

47 Practical SMM Programming: Pthreads

Covered in example, small set of functions for creation and management

User-level Threads Kernel-level Threads

User User Kernel Kernel

CPU 0 CPU 1 CPU 0 CPU 1

48 Practical SMM Programming: Source: OpenMP.org

 Fork-join model

 Types of constructs:

+ Tasks

Source: Blaise Barney, LLNL OpenMP General Code Structure

#include

main () { int var1, var2, var3; // Serial code

// Beginning of parallel section. Fork a team of threads. Specify variable scoping #pragma omp parallel private(var1, var2) shared(var3) { // Parallel section executed by all threads // Other OpenMP directives // Run-time Library calls // All threads join master thread and disband } // Resume serial code }

Source: Blaise Barney, LLNL 50 Practical PGAS Programming: UPC

 PGAS extension to the C99 language

 Many helper library functions . Collective and remote allocation . Collective operations

 Complex

51 Practical DMM Programming: MPI-1

Helper Functions

many more (>600 total)

Collection of 1D address spaces Source: Blaise Barney, LLNL 52 Complete Six Function MPI-1 Example

#include

int main(int argc, char **argv) { int myrank, sbuf=23, rbuf=32; MPI_Init(&argc, &argv);

/* Find out my identity in the default communicator */ MPI_Comm_rank(MPI_COMM_WORLD, &myrank); if (myrank == 0) { MPI_Send(&sbuf, /* message buffer */ 1, /* one data item */ MPI_INT, /* data item is an integer */ rank, /* destination process rank */ 99, /* user chosen message tag */ MPI_COMM_WORLD); /* default communicator */ } else { MPI_Recv(&rbuf, MPI_DOUBLE, 0, 99, MPI_COMM_WORLD, &status); printf(“received: %i\n”, rbuf); }

MPI_Finalize(); } 53 MPI-2/3: Greatly enhanced functionality

 Support for shared memory in SMM domains

 Support for Remote Memory Access Programming . Direct use of RDMA . Essentially PGAS

 Enhanced support for message passing communication . Scalable topologies . More nonblocking features . … many more

54 MPI: de-facto large-scale prog. standard

Basic MPI Advanced MPI, including MPI-3

To appear at SC14 (11/17/2014) 55 Accelerator example: CUDA Hierarchy of Threads

Source: NVIDIA Complex Memory Model

Simple Architecture

56 Accelerator example: CUDA

Host Code #define N 10 The Kernel int main( void ) { int a[N], b[N], c[N]; __global__ void add( int *a, int *b, int *c ) { int *dev_a, *dev_b, *dev_c; int tid = blockIdx.x; // allocate the memory on the GPU // handle the data at this index cudaMalloc( (void**)&dev_a, N * sizeof(int) ); if (tid < N) cudaMalloc( (void**)&dev_b, N * sizeof(int) ); c[tid] = a[tid] + b[tid]; cudaMalloc( (void**)&dev_c, N * sizeof(int) ); } // fill the arrays 'a' and 'b' on the CPU for (int i=0; i>>( dev_a, dev_b, dev_c ); // copy the array 'c' back from the GPU to the CPU cudaMemcpy( c, dev_c, N * sizeof(int), cudaMemcpyDeviceToHost ); // free the memory allocated on the GPU cudaFree( dev_a ); cudaFree( dev_b ); cudaFree( dev_c ); }

57 OpenACC / OpenMP 4.0

 Aims to simplify GPU programming

 Compiler support . Annotations!

#define N 10 int main( void ) { int a[N], b[N], c[N]; #pragma acc kernels for (int i = 0; i < N; ++i) c[i] = a[i] + b[i]; }

58 More programming models/frameworks

 Not covered: . SMM: Intel / Cilk Plus, Intel TBB, … . Directives: OpenHMPP, PVM, … . PGAS: Coarray (Fortran 2008), … . HPCS: IBM X10, Fortress, Chapel, … . Accelerator: OpenCL, C++AMP, …

 This class will not describe any model in more detail! . There are too many and they will change quickly (only MPI made it >15 yrs)

 No consensus, but fundamental questions remain: . Data movement . Synchronization . Memory Models . Algorithmics . Foundations

59 Goals of this lecture

 Motivate you!

 What is parallel computing? . And why do we need it?

 What is high-performance computing? . What’s a Supercomputer and why do we care?

 Basic overview of . Programming models Some examples . Architectures Some case-studies

 Provide context for coming lectures

60

Architecture Developments

<1999 ’00-’05 ’06-’12 ’13-’20 >2020

distributed large cache- large cache- coherent and non- memory largely non- coherent multicore coherent multicore coherent coherent machines machines machines manycore communicating accelerators and communicating communicating accelerators and multicores through through coherent through coherent multicores messages communicating memory access memory access communicating through remote and messages and remote direct through memory direct memory memory access access and remote access direct memory access

Sources: various vendors Case Study 1: IBM POWER7 IH (BW)

NPCF Blue Waters System

Building Block

SuperNode Super Node (1024 cores) (32 Nodes / 4 CEC) s e l b a C

k

n Near-line Storage i L -

Drawer L (256 cores) On-line Storage SMP node (32 cores) P7 Chip (8 cores)

Source: IBM/NCSA Source: IBM 62 POWER7 Core

Source: IBM/NCSA Source: IBM 63 POWER7 Chip (8 cores)

Quad-chip MCM  Base Technology . 45 nm, 576 mm2 . 1.2 B transistors

 Chip . 8 cores . 4 FMAs/cycle/core . 32 MB L3 (private/shared) . Dual DDR3 memory 128 GiB/s peak bandwidth (1/2 byte/flop) . Clock range of 3.5 – 4 GHz

Source: IBM/NCSA Source: IBM 64     Source: IBM/NCSA 4 threads 4 threads per core (max) 32 cores 800 W (0.8 W/F) 4x32 Quad ChipModule(4chips) . . . 512 GB/s 512GB/s RAM BW(0.5 B/F) 128threadspackage per 32GHz cores*8F/core*41 TF = MiB

L3 cache

D C D C B A B A

D C D C C C C C B A B A C C C C

l l l l l l l l k k k k C C C C k k k k C C C C

l l l l l l l l G G G G G G G G k k k k k k k k

r r r r r r r r G G G G G G G G p p p p p p p p r r r r r r r r p p p p p p p p Source: IBM Source: W Z MMCC1 1 MMCC0 0 MMCC 1 MMCC 00 P P B A 8 8 7 7 c c

W X

- - u u 3 0 P P

A C C B X Y Y Z

B C C A Y Y X Z P P 8 8 c c 7 7 A B

- - u u 2 1 P P W X W MMCC1 1 MMCC0 0 Z MMCC 1 MMCC 0 D C D C B A B A

D C D C C C C C B A B A C C C C

l l l l l l l l 65 k k k k C C C C k k k k C C C C

l l l l l l l l G G G G G G G G k k k k k k k k

r r r r r r r r G G G G G G G G p p p p p p p p

r r r r r r r r p p p p p p p p Adding a Network Interface (Hub)

P7-IH Quad Block Diagram 32w SMP w/ 2 Tier SMP Fabric, 4 chip Processor MCM + Hub SCM w/ On-Board Optics

A Clk Grp

A Clk Grp D D 1 1 I

I

M Mem 8c uP 8c uP MMeemm M M M M  Connects QCM to PCI-e M B Clk Grp M M B Clk Grp

I

I 0 5 0 5 0 0

D D Mem MMeemm C C C C Clk Grp C C Clk Grp D M D M M M 0 0 I

I

Mem Mem M Mem Mem M M M M

P7-0 P7-1 M D Clk Grp D Clk Grp M M

I

I 4 Mem Mem 4 D D Mem Mem D . Two 16x and one 8x PCI-e slot A Clk Grp D A Clk Grp I 7 7 I

M

Mem MMeemm M M M M B A M M

M

B Clk Grp 1 I

I B Clk Grp 1 1 1 1 1 2

2 D D Mem MMeemm C C

C Y X C D C Clk Grp D

C Clk Grp M M M M I I 6 6 M

M Mem MMeemm M M M M D Clk Grp C D Clk Grp

M

M

C 1 1 I  I 3 Connects 8 QCM's via low 3 D D Mem MMeemm W Z Y latency, high bandwidth, A X B W Z copper fabric. Z B W A X W

A Clk Grp D A Clk Grp D 0 0 C I C I M M 1 1

Mem MMeemm M M M M B Clk Grp B Clk Grp

0 0 1 1 M M Y I I Mem Mem 4 Mem Z Mem 4 0 0 D D C C

C Clk Grp C C C Clk Grp D D . Provides a message passing M M 1 1 I

A B I M M M M 1 1

Mem MMeemm M M M M D Clk Grp D Clk Grp

1 1 M M I

I Mem X Y Mem 5 Mem Mem 5 D mechanism with very D A Clk Grp A Clk Grp D D 3 3

I

I

Mem Mem M Mem Mem M M M M M M M B Clk Grp B Clk Grp I

I

8 1 P7-3 P7-2 1 8

D D Mem MMeemm

high bandwidth 1 1 C C D D C Clk Grp C C C Clk Grp M M 2 2 I

I

M Mem Mem M Mem M M Mem M M M M M M D Clk Grp D Clk Grp

I

I 9 9 D D Mem 8c uP 8c uP MMeemm . Provides the lowest possible

28x XMIT/RCV pairs 624 Hub Chip Module s s s s s s s @ 10 Gb/s 12x 12x / / / / / / / B B B B B B B

latency between 8 QCM's s s s / / /

832 12x 12x G G G G G G G B B B 12x 12x 12x 12x 12x 12x 2 2 2 2 2 2 2 2 2 2 2 2 2 2 G G G + + + + + + + Z 12x 12x 12x 12x 12x 12x 7 7 7 Y 2 2 2 2 2 2 2 + + + 2 2 2 2 2 2 2 12x 12x 12x 12x 12x 12x 7 7 7 12x 12x 12x 12x 12x 12x 164 164 164 164 164 164 164 72 72 72 W D0-D15 Lr0-Lr23 Ll0 Ll1 Ll2 Ll3 Ll4 Ll5 Ll6 EG2 EG1 EG2 X ) 2 ) + 1 0 + 1 5 = x = 2 x 6 1 ( (

s s / / x x B B x 1 6 G G 8

0 5 6 1

1 + e e e I 5 + 7 Inter-Hub Board Level L-Buses I I 0 C 1 3.0Gb/s @ 8B+8B, 90% sus. peak C C P 320 GB/s 240 GB/s P P Source: IBM/NCSA Source: IBM 66 1.1 TB/s POWER7 IH HUB B - D P l l l 1 2 t t t

M s s s C C C M M T

, u u u g g g O O A u u u R R B B B A B - l l l A B

- - - - P P D P P P 1 1

0 1 2 C C E E t t t P P M X X X o o o E E D D S S P S S H H H P P P F F T M M S S 1 1 F F I I I 1 1 8 2 2 2 V V 8 6 6 S S 6 6 x C C C x I I x x x x C C I I Y Y Y E E E H H H - - - I I I

P P P I2C_0 + Int

C C C

LL0 Bus 8B l P P P O O O a I I I Copper 8B s c e i l t

 g C u 8 C

192 GB/s Host Connection p n 2 2 2 d I i I O

r 8B o LL1 Bus i o

Copper 8B M W T I2C_27 + Int d r

a LL2 Bus 8B o 8B  B Copper

336 GB/s to 7 other local nodes s s r l Y Y e a p c H 8B H 12x s p LL3 Bus D0 Bus o P P l e o

Copper 8B 12x Optical

Torrent f d L 3 C f

- o i I B n D r E

 U 8B 240 GB/s to local-remote nodes LL4 Bus e H s p Copper 8B e u o l s 6 S T u s a

1

f B c u

i B

8B o LL5 Bus t B D

U t Copper 8B p c D H

Hub Chip O e

 320 GB/s to remote nodes n LL6 Bus 8B n 12x D15 Bus o Copper 8B c Diff PHYs 12x Optical r e t n I s c c c c s s s s s s  s u n n n n u u u u u u

40 GB/s to general purpose I/O u y y y y B B B B B B B B - x x - - - - S S S S x x - - -

6 6 Z X Y 6 6 Z W X Y

W

D D D D

B B B B B B B O O O O B 8 8 8 8 8 8 8 8 T T T T 24

L remote s s l l HUB to QCM Connections u a a  u Buses B cf. “The PERCS interconnect” @HotI’10 c c Address/Data

B i i

t t 3 0 p p 2 R R O O L L L remote 4 Drawer Interconnect to Create a Supernode Optical

Source: IBM/NCSA Source: IBM 67 P P P P P P P P P P P P P P P P P C C C C C C C C C C C C C C C C C P7 IH Drawer '

I I I I I I I I I m I I I I I I I I k o n r i e e e e e e e e e s e e e e e e e e 64/40 Optical f 64/40 Optical

L e t l - u u 'D-Link' L 'D-Link' '

o 17 16 15 14 13 12 11 10 9 d 8 7 6 5 4 3 2 1 r - P P P P P P P P P P P P P P P P P o n e 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 a b M ------i

C C C C C C C C C C C C C C C C C F

F l B

1 1 1 1 1 1 1 1 9 8 7 6 5 4 3 2 1 a ------7 6 5 4 3 2 1 0 4 U C C C C C C C C C c ------0 i C C C C C C C C H t 1 1 1 1 1 1 1 1 1 3 , 1 1 1 1 1 1 1 1 p

• 8 nodes 2 O

• 32 chips HUB HUB HUB HUB HUB HUB HUB HUB 0 1 2 3 4 5 6 7 • 256 cores 5 4 3 2 1 0 9 8 5 4 3 2 1 0 9 8 5 4 3 2 1 0 9 8 5 4 3 2 1 0 9 8 5 4 3 2 1 0 9 8 5 4 3 2 1 0 9 8 5 4 3 2 1 0 9 8 5 4 3 2 1 0 9 8 1 1 1 1 1 1 0 0 1 1 1 1 1 1 0 0 1 1 1 1 1 1 0 0 1 1 1 1 1 1 0 0 1 1 1 1 1 1 0 0 1 1 1 1 1 1 0 0 1 1 1 1 1 1 0 0 1 1 1 1 1 1 0 0 M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D ------0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 4 4 4 4 4 4 4 4 5 5 5 5 5 5 5 5 6 6 6 6 6 6 6 6 7 7 7 7 7 7 7 7 N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N First Level Interconnect

QCM 0 QCM 1 QCM 2 QCM 3 QCM 4 QCM 5 QCM 6 QCM 7 L-Local P7-2 P7-2 P7-2 P7-2 P7-2 P7-2 P7-2 P7-2 HUB to HUB Copper Wiring P7-1 P7-3 P7-1 P7-3 P7-1 P7-3 P7-1 P7-3 P7-1 P7-3 P7-1 P7-3 P7-1 P7-3 P7-1 P7-3 P7-0 P7-0 P7-0 P7-0 P7-0 P7-0 P7-0 P7-0 U-P1-M1 U-P1-M2 U-P1-M3 U-P1-M4 U-P1-M5 U-P1-M6 U-P1-M7 U-P1-M8 7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0

256 Cores 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D ------0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 4 4 4 4 4 4 4 4 5 5 5 5 5 5 5 5 6 6 6 6 6 6 6 6 7 7 7 7 7 7 7 7 N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N

DCA-0 Connector (Top DCA) DCA-1 Connector (Bottom DCA) Source: IBM/NCSA 1st LevSource:el L oIBMc al Interconnect (256 cores) 68 POWER7 IH Drawer @ SC09 P7-IH Board Layout 2nd Level of Interconnect (1,024 cores) P7 IH Supernode 2nd Level Interconnect (1,024 cores) 2nd Level Interconnect (1,024 cores) DCA-1 Connector (Bottom DCA) DCA-1 Connector (Bottom DCA) DCA-0 Connector (Top DCA) DCA-0 Connector (Top DCA)

P7-0 P7-0 P7-0 P7-0 P7-0 P7-0 P7-0 P7-0 P7-0 P7-0 P7-0 P7-0 P7-0 P7-0 P7-0 P7-0

Second Level Interconnect P7-1 P7-3 P7-1 P7-3 P7-1 P7-3 P7-1 P7-3 P7-1 P7-3 P7-1 P7-3 P7-1 P7-3 P7-1 P7-3 P7-1 P7-3 P7-1 P7-3 P7-1 P7-3 P7-1 P7-3 P7-1 P7-3 P7-1 P7-3 P7-1 P7-3 P7-1 P7-3

P7-2 P7-2 P7-2 P7-2 P7-2 P7-2 P7-2 P7-2 P7-2 P7-2 P7-2 P7-2 P7-2 P7-2 P7-2 P7-2 QCM 0 QCM 1 QCM 2 QCM 3 QCM 4 QCM 5 QCM 6 QCM 7 QCM 0 QCM 1 QCM 2 QCM 3 QCM 4 QCM 5 QCM 6 QCM 7 .Optical ‘L-Remote’ Links from HUB

61x96mm 61x96mm 61x96mm 61x96mm 61x96mm 61x96mm 61x96mm 61x96mm 61x96mm 61x96mm 61x96mm 61x96mm 61x96mm 61x96mm 61x96mm 61x96mm .4 drawers HUB HUB HUB HUB HUB HUB HUB HUB HUB HUB HUB HUB HUB HUB HUB HUB 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7

FSP/CLK-B FSP/CLK-A FSP/CLK-B FSP/CLK-A O O 2 2 p p

P P P P P P P P P , P P P P P P P P P P P P P P P P P , P P P P P P P P

.1,024 Cores 3 3 t t H H i i 0 0

C C C C C C C C C c C C C C C C C C C C C C C C C C C c C C C C C C C C U U 4 4 a a

I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I B B l l F F

F F

e e e e e e e e e i e e e e e e e e e e e e e e e e e i e e e e e e e e M M b b a a e e n n o o - - r r d d o o

' '

17 16 15 14 13 12 11 10 9 L 8 7 6 5 4 3 2 1 17 16 15 14 13 12 11 10 9 L 8 7 6 5 4 3 2 1 u u u u - - l l t t e e 'D-Link' L 'D-Link' 'D-Link' L 'D-Link'

f f s s i i r r n n 64/40 Optical o 64/40 Optical 64/40 Optical o 64/40 Optical k k m m ' '

4.6 TB/s BW of 1150 Bisection BW Super Node 10G-E ports

(32 Nodes / 4 CEC) P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C

' '

I I I I I I I I I m I I I I I I I I I I I I I I I I I m I I I I I I I I k k o o n n r r i i e e e e e e e e e s e e e e e e e e e e e e e e e e e s e e e e e e e e 64/40 Optical f 64/40 Optical 64/40 Optical f 64/40 Optical

L L e e t t l l - - u u u u 'D-Link' L 'D-Link' 'D-Link' L 'D-Link' ' '

o o 17 16 15 14 13 12 11 10 9 d 8 7 6 5 4 3 2 1 17 16 15 14 13 12 11 10 9 d 8 7 6 5 4 3 2 1 r r - - o o n n e e a a b b M M i i

F F

F F l l B B

a a 4 4 U U c c 0 0 i i H H t t 3 3 , , p p s 2 2 O O

FSP/CLK-B FSP/CLK-A FSP/CLK-B FSP/CLK-A e l b

a HUB HUB HUB HUB HUB HUB HUB HUB HUB HUB HUB HUB HUB HUB HUB HUB 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7

C 61x96mm 61x96mm 61x96mm 61x96mm 61x96mm 61x96mm 61x96mm 61x96mm 61x96mm 61x96mm 61x96mm 61x96mm 61x96mm 61x96mm 61x96mm 61x96mm

k n i L - L QCM 0 QCM 1 QCM 2 QCM 3 QCM 4 QCM 5 QCM 6 QCM 7 QCM 0 QCM 1 QCM 2 QCM 3 QCM 4 QCM 5 QCM 6 QCM 7 P7-2 P7-2 P7-2 P7-2 P7-2 P7-2 P7-2 P7-2 P7-2 P7-2 P7-2 P7-2 P7-2 P7-2 P7-2 P7-2

P7-1 P7-3 P7-1 P7-3 P7-1 P7-3 P7-1 P7-3 P7-1 P7-3 P7-1 P7-3 P7-1 P7-3 P7-1 P7-3 P7-1 P7-3 P7-1 P7-3 P7-1 P7-3 P7-1 P7-3 P7-1 P7-3 P7-1 P7-3 P7-1 P7-3 P7-1 P7-3

P7-0 P7-0 P7-0 P7-0 P7-0 P7-0 P7-0 P7-0 P7-0 P7-0 P7-0 P7-0 P7-0 P7-0 P7-0 P7-0

DCA-0 Connector (Top DCA) DCA-0 Connector (Top DCA) DCA-1 Connector (Bottom DCA) DCA-1 Connector (Bottom DCA) 2nd Level Interconnect (1,024 cores) 2nd Level Interconnect (1,024 cores) Source: IBM/NCSA Source: IBM 70 Case Study 2: IBM Blue Gene/Q packaging

2. Module 3. Compute Card Single Chip One single chip module, 16 GB DDR3 Memory16 4. Node Card 1. Chip 16 32 Compute Cards, 16 cores Optical Modules, Link Chips, Torus

512

16 5b. I/O Drawer 8 I/O Cards 6. Rack 8 PCIe Gen2 slots 2 Midplanes 1, 2 or 4 I/O Drawers

7. System 20PF/s

5a. Midplane 16 Node Cards

~2 Mio 16384 8192 Source: IBM, SC10 © Markus Püschel Blue Gene/Q Compute chip Computer Science System-on-a-Chip design : integrates processors,  360 mm² Cu-45 technology (SOI) memory and networking logic into a single chip . ~ 1.47 B transistors

 16 user + 1 service processors . plus 1 redundant processor . all processors are symmetric . each 4-way multi-threaded . 64 bits PowerISA™ . 1.6 GHz . L1 I/D cache = 16kB/16kB . L1 prefetch engines . each processor has Quad FPU (4-wide double precision, SIMD) . peak performance 204.8 GFLOPS@55W

 Central shared L2 cache: 32 MB . eDRAM . multiversioned cache/transactional memory/speculative execution. . supports atomic ops

 Dual memory controller . 16 GB external DDR3 memory . 1.33 Gb/s . 2 * 16 byte-wide interface (+ECC)

 Chip-to-chip networking . Router logic integrated into BQC chip.

Source: IBM, PACT’11 Blue Gene/Q Network

 On-chip external network  Message Unit  Torus Switch  Serdes  Everything!  Only 55-60 W per node  Top of Green500 and GreenGraph500

Source: IBM, PACT’11 73 Case Study 3: Cray Cascade (XC30)

 Biggest current installation at CSCS!  . >2k nodes

 Standard Intel x86 Sandy Bridge Server-class CPUs

Source: Bob Alverson, Cray

74 Cray Cascade Network Topology

 All-to-all connection among groups (“blue network”)

Source: Bob Alverson, Cray

 What does that remind you of?

75 Goals of this lecture

 Motivate you!

 What is parallel computing? . And why do we need it?

 What is high-performance computing? . What’s a Supercomputer and why do we care?

 Basic overview of . Programming models Some examples . Architectures Some case-studies

 Provide context for coming lectures

76

DPHPC Lecture

 You will most likely not have access to the largest machines . But our desktop/laptop will be a “large machine” soon . HPC is often seen as “Formula 1” of computing (architecture experiments)

 DPHPC will teach you concepts! . Enable to understand and use all parallel architectures . From a quad-core mobile phone to the largest machine on the planet! MCAPI vs. MPI – same concepts, different syntax . No particular language (but you should pick/learn one for your project!) Parallelism is the future:

77 Related classes in the SE focus

 263-2910-00L Program Analysis http://www.srl.inf.ethz.ch/pa.php Spring 2015 Lecturer: Prof. M. Vechev

 263-2300-00L How to Write Fast Numerical Code http://www.inf.ethz.ch/personal/markusp/teaching/263-2300-ETH- spring14/course.html Spring 2015 Lecturer: Prof. M. Pueschel

 This list is not exhaustive!

78 DPHPC Overview

79