Introduction Principles Core Memory Processor Network Exascale

Computer Architectures

D. Pleiter

Jülich Supercomputing Centre and University of Regensburg

November 2018

1 / 179 Introduction Principles Core Memory Processor Network Exascale

Overview

Introduction

Principles of Computer Architectures

Processor Core Architecture

Memory Architecture

Processor Architectures

Network Architecture

Exascale Challenges

2 / 179 Introduction Principles Core Memory Processor Network Exascale

Content

Introduction

Principles of Computer Architectures

Processor Core Architecture

Memory Architecture

Processor Architectures

Network Architecture

Exascale Challenges

3 / 179 Introduction Principles Core Memory Processor Network Exascale

Computational Science

• Computational Science = Multi-disciplinary field that uses advanced computing capabilities to understand and solve complex problems • Key elements of computational science • Application science areas • Algorithms and mathematical methods • Computer science, including areas like • Computer architectures • Performance modelling

4 / 179 Introduction Principles Core Memory Processor Network Exascale

Motivation

• Know-how on computer architectures useful for application developer • How much could I optimize my application? • Which performance can I expect on a given architecture? • Which performance can I expect for a given algorithm? • Deep knowledge on computer architectures is required to develop application optimized computing infrastructures • Which hardware features are required by a given application? • Which are the optimal design parameters?

5 / 179 Introduction Principles Core Memory Processor Network Exascale

Top500 List

• Performance metric • Floating-point operations per time unit while solving a dense linear set of equations High-Performance LINPACK (HPL) benchmark • Criticism • Workload not representative • Problem size can be freely tuned • But: Allows for long-term comparison • First list published in June 1993 • New, additional metric being established: HPCG • Workload: Sparse linear system solver based on CG

6 / 179 Introduction Principles Core Memory Processor Network Exascale

Top500 Trend: Rmax

1e+09 Top 1 1e+08 Top 1-10 Top 1-100 1e+07 1e+06 1e+05 GFlops/s 10000 1000 100

June 1993 June 1998 June 2003 June 2008 June 2013 June 2018

Peak performance of top 100 systems doubled for many years every 13.5 months, now slowdown to about 26.8 months

7 / 179 Introduction Principles Core Memory Processor Network Exascale

Content

Introduction

Principles of Computer Architectures

Processor Core Architecture

Memory Architecture

Processor Architectures

Network Architecture

Exascale Challenges

8 / 179 Introduction Principles Core Memory Processor Network Exascale

Problem Solving Abstraction

• Numerical problem solving abstraction Formal or computational computer mathematical sequence of code solution algorithm tasks implementation • Execution of computer code = computation • What is computation? • Process of manipulating data according to simple rules • Let data be represented by w ∈ {0, 1}n

w (0) → w (1) → · · · → w (i) → · · ·

9 / 179 Introduction Principles Core Memory Processor Network Exascale

Computer Model: Deterministic Finite Automata (DFA)

• Informal definition: Perform automatic operations on some given sequence of inputs front pad • Example: Automatic door • Inputs: door • Sensor front pad rear pad • Sensor rear pad • States: • Door open • Door closed

10 / 179 Introduction Principles Core Memory Processor Network Exascale

Computer Model: DFA (cont.)

A Deterministic Finite Automaton (DFA) is defined by • A finite set of states Q • Example: Q = {door open, door closed} • An alphabet Γ • An alphabet is a finite set of letters • Example: Γ = {front, rear, both, neither} • A transition function δ : Q × Γ → Q

• A start state q0 ∈ Q • A set of final states F ⊆ Q • Allows to define when a computation is successful

11 / 179 Introduction Principles Core Memory Processor Network Exascale

Computer Model: DFA (cont.)

Deterministic Finite Automaton (DFA) for automatic door:

• Q = {door open, door closed} closed neither • Γ = {front, rear, both, neither} • Transistions: neither front,rear,both • {closed,front} →open • {open,neither} →closed • ... open front,rear,both • q0 =closed • F = Q

Graphical representation: Finite State Machine (FSM) diagram

12 / 179 Introduction Principles Core Memory Processor Network Exascale

Computer Model: Turing Machine

Turing Machine components: [Turing, 1936] • Tape divided into cells • Each contains a symbol from some finite alphabet, which includes the special symbol “blank” • The tape is assumed to infinite in any direction • Head that can • Read and write symbols on the tape, and • Move tape left and right one cell at a time • State register that stores the state of the Turing Machine • The number of states is finite • There is a special start state • Table including instructions • The number of instructions is finite • Chosen instruction depends on current state and input

13 / 179 Introduction Principles Core Memory Processor Network Exascale

Computer Model: Turing Machine (cont.)

• Notation 1011 q3 0101 • Left-hand side tape is 1011 • Current state is q3 • Head is right to state symbol, i.e. reading 0 • Right-hand side of tape is 0101 • Transition δ :(Q \ F) × Γ → Q × Γ × {L, R} • Γ is the tape alphabet • Example state table for Q = {A,B,C,HALT}:

Γ q = A q = B q = C Write Move State Write Move State Write Move State 0 1 R B 1 L A 1 L B 1 1 L C 1 R B 1 R HALT • Differences compared to DFA • A Turing Machine performs write operations • The read-write head can move both to the left and right

14 / 179 Introduction Principles Core Memory Processor Network Exascale

Computer Model: Random-Access Machine (RAM)

• Components • CPU C • PC: Program counter PC • IR: Instruction register • ACC: Accumulator IR • Randomly addressable memory M • Processing sequence CA • Load MEM[PC] to IR • Increment PC ACC • Execute instruction in IR • Repeat

15 / 179 Introduction Principles Core Memory Processor Network Exascale

Computer Model: RAM (cont.)

Example set of instructions:

Instruction Meaning HALT Finish computations LD Load value from memory ACC ← MEM[a] LDI Load immediate value ACC ← i ST Store value MEM[a] ← ACC ADD Add value ACC ← ACC+MEM[a] ADDI Add immediate value ACC ← ACC+i JUMP Update program counter PC ← i JZERO Update PC if ACC==0 PC ← i

16 / 179 Introduction Principles Core Memory Processor Network Exascale

Computer Model: RAM (cont.)

Example: Add numbers stored in address range [0x100, 0x102]

PC Instruction Meaning 0x00 LD 0x100 Initiate accumulator 0x01 ADD 0x101 Add second number 0x02 ADD 0x102 Add third number 0x03 ST 0x103 Store result

Benefits and limitations of RAM architecture: • Memory is randomly addressible • No direct support of memory address indirection • Work-around: Modify set of instructions during execution

17 / 179 Introduction Principles Core Memory Processor Network Exascale

Von Neumann Architecture

[Neumann, 1945] • Components defined by v. Neumann: • Central arithmetic CA CC • Central control CC • Memory M C • Input I • Output O M CA • Simplified modern view: • Central Processing Unit I • Memory • I/O chipset O • Bus Bus • Memory used for instructions and data Self-modifying code

18 / 179 Introduction Principles Core Memory Processor Network Exascale

Harvard Architecture

• “Von Neumann bottleneck” M data • Only single path for loading instructions and data → Bad processor utilization • General purpose memory replaced by C • Data memory • Instruction memory • Independend data pathes M instr Concurrent access • Different address spaces

19 / 179 Introduction Principles Core Memory Processor Network Exascale

Modern (Simple) Processor Architecture

L1−I • Components • Instruction decoder and Decoder scheduler • Scheduler Execution pipelines • Register file • Register File Caches (L1-I, L1-D, L2)

L2 • External memory Memory • Memory architecture • From outside: von Neumann • General purpose memory Pipeline #0 Pipeline #1 Pipeline #2 • Common L2 cache • From inside: Harvard • Data L1 cache (L1D) L1−D • Instruction L1 cache (L1P)

20 / 179 Introduction Principles Core Memory Processor Network Exascale

Digression on Terminology

• Event = Point of time ti where something happens • An event has no duration • Examples • Instruction execution completed • State changed • Counter reaches/exceeds a given value

• Latency = Distance ∆t = tj − ti between two events where ti < tj • Also: response time • Examples • Time between instruction issue and completion • Duration of a task execution

21 / 179 Introduction Principles Core Memory Processor Network Exascale

Digression on Terminology (cont.) • Bandwidth = Amount of items passing a particular interface per unit time

MCdata

• If transfer of l items has a latency ∆t then bandwidth is given by l b = ∆t • Equivalent to throughput = Amount of work W com- pleted per unit time • Nominal bandwidth = The maximal rate B at which data can be trans- ferred over a data bus

22 / 179 Introduction Principles Core Memory Processor Network Exascale

Pipelining

Optimize work using an assembly line: Weld car body

Paint car

Assemble motor

Works only well if time needed at all Assemble wheels stations is similar

23 / 179 Introduction Principles Core Memory Processor Network Exascale

Pipeline: Time Diagram

P0 P1 P2 P3 P4

t0 C0 t0 + 1 C1 C0 t0 + 2 C1 C0 t0 + 3 C1 C0 t0 + 4 C1 C0 t0 + 5 C1 t0 + 6

Without pipeline: Need 5 time units to finish 1 car With pipeline: Need 6 time units to finish 2 cars

24 / 179 Introduction Principles Core Memory Processor Network Exascale

Pipeline: Time Diagram (cont.)

P0 P1 P2 P3 P4 • pipeline filling: t t ... t 3 t0 C0 = 0 ( 0 + ) t0 + 1 C1 C0 • pipeline draining: t0 + 2 C2 C1 C0 t = (t0 + 6)...(t0 + 9) t0 + 3 C3 C2 C1 C0 • Time to perform n t0 + 4 C4 C3 C2 C1 C0 operations in k pipeline t0 + 5 C5 C4 C3 C2 C1 stages assuming a t0 + 6 C5 C4 C3 C2 nominal throughput B = 1: t0 + 7 C5 C4 C3 , − t0 + 8 C5 C4 ∆tp(n k) = n + (k 1) t0 + 9 C5 Here: ∆tp(6, 5) = 10

25 / 179 Introduction Principles Core Memory Processor Network Exascale

Pipeline Costs • In our example: throughput B = 1 • Throughput: number of operations n b(n, k) = = ∆tp(n, k) n + (k − 1)

Note: limn→∞ b(n, k) = B • Gain (or speed-up) scalar execution time n k s(n, k) = = pipelined execution time n + (k − 1)

Note: limn→∞ s(n, k) = k • Efficiency 1 n (n, k) = s(n, k) = k n + (k − 1)

26 / 179 Introduction Principles Core Memory Processor Network Exascale

Pipelined Design Digital pipeline with k stages: clock

input f1 f2 fk output

pipeline stage

• Each pipeline stage consists of a • Functional/combinatorial step • Register • Execution of a pipeline stage takes a fixed time unit • Pipeline synchronized by a clock • Synchronous vs. asynchronous pipelines

27 / 179 Introduction Principles Core Memory Processor Network Exascale

Pipeline Design (2)

Waveform representation of pipeline:

Clock input A0 B0 r1 A1 B1 r2 A2 B2 output A3 B3

• In each combinatorial step values change: A0 → A1 → · · · → A3 • Content of registers sample at rising clock edge

28 / 179 Introduction Principles Core Memory Processor Network Exascale

Pipeline: Loop Unroll

Consider copy of a vector: yi ← xi (i = 0,..., n − 1)

Simple implementation: CPU for(i = 0; i < n; i++) { y[i] = x[i]; } MEM

Implementation with unrolling: (assume n a multiple of 4)

for (i = 0; i < n; i += 4) { CPU y[i] = x[i]; y[i+1] = x[i+1]; y[i+2] = x[i+2]; MEM y[i+3] = x[i+3]; } t

Note: Leave unrolling to compiler!

29 / 179 Introduction Principles Core Memory Processor Network Exascale

Pipelined Architecture Example

• Consider the following simple architecture:

Register File ALU

• Components: • Register file = collection of registers with 2 output ports and 1 input port • Pipelined arithmetic and logic unit ALU

• Instructions are executed in order • Performance parameters of arithmetic unit: • Throughput B = 1 • Start-up latency λ = 4 clock cycles

30 / 179 Introduction Principles Core Memory Processor Network Exascale

Digression: Pseudo-Assembler

• Pseudo → Not an existing Instruction Set Architecture • General syntax (to be extended later) imm, , [, ...] where ... instruction ... destination register ... immediate value ... source register • Examples:

1 mov 0xd, R0 !! Move constant value to R0 2 mov R0, R1 !! Move content register R0 to R1 3 add R0, R1, R2 !! Integer addition: R2=R0+R1 4 mul R0, R1, R2 !! Integer multiplication: R2=R0∗R1

31 / 179 Introduction Principles Core Memory Processor Network Exascale

Pipelined Architecture Example (2)

• Consider the following operation: z = a × b + c − d (a, b, c, d integer values) • Pseudo-assembler • Assume a, b, c, d to be already loaded in register R0, R1, R2, R3 imul R0, R1, R5 ! store a × b in register 5 iadd R5, R2, R6 ! add c isub R6, R3, R7 ! subtract d • Instructions are not independent: hazard due to data dependence • Scheduling of instructions must be aware of hazards

32 / 179 Introduction Principles Core Memory Processor Network Exascale

Data hazards

• An instruction j is data dependent on instruction i if either • Instruction i produces a result that may be used by instruction j • Instruction j is data dependent on instruction k, and instruction k is data dependent on instruction i • Read-after-write (RAW) hazard Instruction j tries to read a source before it is updated by instruction i Example for wrong scheduling of instructions: imul R0, R1, R5! isub R6, R3, R7 ! Register R6 not written, yet iadd R5, R2, R6!

33 / 179 Introduction Principles Core Memory Processor Network Exascale

Data Hazards (2)

Other data hazards can occur in more complicated architectures: • Write after write (WAW) hazard: Instruction j tries to write to a destination before it is written by instruction i Result of instruction i remains in destination • Cannot occur in our simple architecture with single pipeline of fixed depth • Write after read (WAR) hazard: Instruction j tries to write a destination before it is read by i Instruction i will read already updated value • Cannot occur in our simple architecture with single pipeline of fixed depth

34 / 179 Introduction Principles Core Memory Processor Network Exascale

Scheduling

• To resolve data hazards the processor may have to stall • Assume the following hardware parameters: βint = 1, λint = 4 clock cycles • Consider example: z = a × b + c − d 1 imul R0, R1, R5 store a × b in register 5 2 stall 3 stall 4 stall 5 iadd R5, R2, R6 add c 6-8 ··· stall 9 isub R6, R3, R7 subtract d • Latency of operation: 3 · 4 clock cycles

35 / 179 Introduction Principles Core Memory Processor Network Exascale

Data Hazards and Optimization

• Consider the following example:

z = a × b + c − d

• Data dependence graph:

imul

RAW

iadd

RAW

isub

• Latency of operation: 3 · ∆tp(1, 4) = 12 clock cycles

36 / 179 Introduction Principles Core Memory Processor Network Exascale

Data Hazards and Optimization (cont.)

• Rewrite the example as follows:

x = a × b y = c − d z = x + y

• Data dependence graph:

imul isub

RAW RAW

iadd • Latency: ∆tp(2, 4) + ∆tp(1, 4) = (5 + 4) clock cycles = 9 clock cycles

37 / 179 Introduction Principles Core Memory Processor Network Exascale

Parallel Execution • Taking advantage of parallelism is one of the most important methods for improving performance • Today: several levels of parallelism • Instruction Level Parallelism (ILP) Multiple instruction pipelines • Vector/SIMD instructions Single Instruction performing same operation on Multiple Data • Simultaneous Multi-Threading (SMT) Concurrent execution of multiple instruction streams within same processor core • Multiple cores per processor Processors with multiple cores • Multiple processors Multiple processors per nodes, multiple nodes per system • Problem: Efficient exploitation of parallelism

38 / 179 Introduction Principles Core Memory Processor Network Exascale

Performance Metric

• Parallel speedup using P processes

∆tsequential S(P) = 1000 Ideal scaling ∆tparallel(P)

• Parallel efficiency 100

S(P) Speedup  = 10 P

Example: S = 26 using P = 32 1 1 10 100 1000 processing elements Number of processes P ⇒  = 81% • Super-linear speedup: S > P

39 / 179 Introduction Principles Core Memory Processor Network Exascale

Amdahl’s Law 10 • Consider the following case: 8 S = 10 • Fraction f of application can be 6 tot

parallelized resulting in speed-up by S 4 factor S • The remaining fraction 1 − f remains 2 0 serial 0 0.5 1 • Execution time after acceleration: f 5 f ∆t ∆t0 = (1 − f )∆t + 4 S

tot 3 • Total speed-up S 2 f = 80% t 1 ∆ 1 Stot = 0 = 0 25 50 75 100 ∆t (1 − f ) + f /S S

40 / 179 Introduction Principles Core Memory Processor Network Exascale

Gustafson’s Law • Amdahl’s law: Only considers scaling with fixed workload strong scaling • Different ansatz by Gustafson: • Increase workload proportional to number of processing elements P weak scaling • Assume non-parallelized code part to be independent of workload • Execution times • Serial execution • Execution time for original problem: ∆t • Execution time for scaled problem size: (1 − f )∆t + f P ∆t • Parallel execution • Execution time for scaled problem size: (1 − f )∆t + f ∆t • Speed-up: (1 − f )∆t + f P ∆t S = = P − (P − 1)(1 − f ) (1 − f )∆t + f ∆t

41 / 179 Introduction Principles Core Memory Processor Network Exascale

Parallel Efficiency: Amdahl vs. Gustafson

• Parallel efficiency according to Amdahl:

1 1 1  (P) = · = Amdahl P (1 − f ) + f /P P (1 − f ) + f

• Parallel efficiency according to Gustafson:

P − 1 1 − f  (P) = 1 − (1 − f ) = f + Gustafson P P • Limit of large number of processing elements P:

lim Amdahl(P) = 0 P→∞ lim Gustafson(P) = f P→∞

42 / 179 Introduction Principles Core Memory Processor Network Exascale

Content

Introduction

Principles of Computer Architectures

Processor Core Architecture

Memory Architecture

Processor Architectures

Network Architecture

Exascale Challenges

43 / 179 Introduction Principles Core Memory Processor Network Exascale

Processor Architecture: Core

L1−D L3 • Multi-core architectures have L2 slice MC L1−I become the default Core #N−1 • Typical components • Front-end

L1−D IO • Instruction fetching L3 L2 slice • Instruction decoding L1−I Core #1 • Branch prediction • Execution engine L1−D L3 • Resource allocation L2 slice IPB L1−I • Instruction execution Core #0 • Instruction retirement

44 / 179 Introduction Principles Core Memory Processor Network Exascale

Processor Architecture: Other

• Memory hierarchy L1−D L3 • Caches L2 slice MC L1−I • Memory controller Core #N−1 • I/O controller • Interface to I/O devices

L1−D IO • Example: network controller L3 L2 slice L1−I • Inter-Processor Bus Core #1 • Required for multi-socket designs, only L1−D L3 L2 slice IPB • Intra-processor network L1−I Core #0 • Ring or mesh network • Cross-bar switch

45 / 179 Introduction Principles Core Memory Processor Network Exascale

Instruction Set Architecture (ISA)

• Relevant ISA categories • CISC: Complex Instruction Set Computer • RISC: Reduced Instruction Set Computer • VLIW: Very Long Instruction Word • Classification based on internal memory architecture • Today’s architectures are typically general-purpose register architectures • Load-store architectures • Explicit load-store operations only, operands use register addresses only • Examples: POWER ISA, Arm ISA • Register-memory architecture • Operand may be in memory • Examples: x86 ISA

46 / 179 Introduction Principles Core Memory Processor Network Exascale

ISA: Operators

Selected set of categories for instruction operators: Operator type Examples Data transfer Load and store operations Arithmetic and logic Integer arithmetics, bitwise opera- tions, compare operations Control Branch, jump, procedure call and return Floating point Floating point arithmetics Mathemetical functions Inverse, square root

47 / 179 Introduction Principles Core Memory Processor Network Exascale

ISA: Memory addressing Classification according to memory addressing features • Memory addressing • Granularity: Today typically byte addressing • Alignment requirements • Addressing modes (R9 ... destination register):

Mode Pseudo-assembler Description Register mov R0, R9 Value from registers Immediate mov 0x3, R9 Constant values Register in- ld (R1), R9 Value from location direct referenced by pointer Displacement ld 0x100(R1), R9 Pointer plus a fixed off- set Indexed ld (R0+R1), R9 Pointer plus run-time offset Scaled ld 0x100(R0,2), R9 Pointer plus scaled run-time offset

48 / 179 Introduction Principles Core Memory Processor Network Exascale

Introduction to the x86 ISA

• Dominant ISA for desktop computers • Introduced 1978 very long-term compatibility • Meanwhile various variants and extensions: • Addition of vector instructions for multimedia applications • Extension from 16- to 32- to 64-bit architecture • General features • CISC ISA • Register-memory architecture • Variable instruction size (8, 16, 24, 32, 40, 48 bits) • No data alignment required (few exceptions)

49 / 179 Introduction Principles Core Memory Processor Network Exascale

x86 ISA: Data Transfer Instructions

Instructions for explicit transfer of data between memory and registers: MOV and variants

Instruction Description mov m32, r32 Load 32-bit value to a register mov r32, m32 Store 32-bit value to memory mov r32, r32 Move 32-bit value within register file mov imm, r32 Move 32-bit immediate value to register mov imm, m32 Move 32-bit immediate value to memory mov m64, r64 Load 64-bit value to a register mov r64, m64 Store 64-bit value to memory movq m64, xmm Load 128-bit value to a XMM register movq xmm, m64 Store 128-bit from a XMM register

50 / 179 Introduction Principles Core Memory Processor Network Exascale

x86 ISA: Addressing Modes The x86 ISA supports 7 different addressing modes. General syntax: disp(base,index,scale)

Mode Example Absolute mov (1024), %eax Register indirect mov (%ebx), %eax Base-indexed mov (%ebx,%ecx), %eax Based indexed with mov 1024(%ebx), %eax displacement Based with scaled mov (%ebx,%ecx,2), %eax index Based with scaled mov 1024(%ebx,%ecx,2), %eax index and displace- ment

51 / 179 Introduction Principles Core Memory Processor Network Exascale

x86 ISA: Arithmetics and Logical Instructions

• Integer arithmetics • Integer addition (destination is overwritten) Operation Example Add immediate to register content add $3, %eax Add two register values add %ebx, %eax

• Similar instructions: • Subtraction: sub • Signed/unsigned multiplication: imul / mul • Integer divide: idiv • Instructions for bitwise operations • Locigal shift left/right: shl, shr • Arithmetic shift left/right: sal, sar • Bitwise AND/OR/XOR/NOT: and, or, xor, not

52 / 179 Introduction Principles Core Memory Processor Network Exascale

x86 ISA: Arithmetics and Logical Instructions (2)

• Comparison instructions • Compare 32-/64-bit values using cmp instruction: Operation Example Compare content of register with cmp %eax, $0 zero Compare content of 2 registers cmp %eax, %ebx

• Operations performed for comparison instruction • Subtraction of operands • Update status flags in EFLAGS register depending on result • Operations performed when using logical compare instruction test: • Perform bitwise AND of operands • Update status flags in EFLAGS register depending on result • EFLAGS register used by conditioned instructions

53 / 179 Introduction Principles Core Memory Processor Network Exascale

Processor Core Performance

• Program latency (= time-to-solution) may be written as

Instructions Clock cycles Seconds ∆t = × × Program Instruction Clock cycle

• Options to reduce time-to-solution • Reduce number of instructions • Decrease cycles per instruction (CPI) • Increase clock rate

54 / 179 Introduction Principles Core Memory Processor Network Exascale

Optimising Pipelined Architectures • CPI = Cycles per Instruction • Assume a pipelined architecture:

Pipeline CPI = Ideal pipeline CPI + Structural stalls + Data hazard stalls + Control stalls • Structural stalls = Resource conflicts in functional units of a pipeline • Possible causes for structural stalls • Functional units with different throughput B • Memory bus which is used for data and instructions conflicting load operations • CPI can be reduced by • Reducing ideal pipeline CPI • Reducing stalls • Exploit Instruction Level Parallelism (ILP)

55 / 179 Introduction Principles Core Memory Processor Network Exascale

Instruction Level Parallelism (ILP)

• ILP = Parallelism among instructions from small code areas which are independent of one another • Exploitation of ILP • Overlapping of instructions in a pipeline • Parallel execution of instructions in multiple functional units • Vector instructions

56 / 179 Introduction Principles Core Memory Processor Network Exascale

ILP: Pipeline

• Consider previous example: z = a × b + c − d 1 mul R0, R1, R5 store a × b in register 5 2 stall 3 stall 4 stall 5 add R5, R2, R6 add c 6-8 ··· stall 9 sub R6, R3, R7 subtract d • Re-organisation to improve pipeline filling 1 mul R0, R1, R5 store a × b in register 5 2 sub R2, R3, R6 store c − d in register 6 3 stall 4 stall 5 stall 6 add R5, R6, R7 add a × b and c − d

57 / 179 Introduction Principles Core Memory Processor Network Exascale

ILP: Pipeline (2) Data dependence diagram for re-ordered schedule Previously omitted load-store operations included

ld c ld d ld a ld b

- *

+

st x

58 / 179 Introduction Principles Core Memory Processor Network Exascale

ILP: Multiple-issue of Instructions

• Allow multiple instructions to be issued within same clock cycle • Flavours of multiple-issue processors: • Very Long Instruction Word (VLIW) processors • Fixed number of instructions are scheduled per clock cycle • Static scheduling at compile time • Statically scheduled superscalar processors • Superscalar processor = Processor which can schedule ≥ 1 instructions per clock cycle • Variable number of instructions are scheduled per clock cycle and executed in-order • Dynamically scheduled superscalar processors • Variable number of instructions are scheduled per clock cycle and executed out-of-order

59 / 179 Introduction Principles Core Memory Processor Network Exascale

ILP: Vector instructions

• Vector instructions exploit datalevel parallelism by operating on data items in parallel • E.g., vector add       z0 x0 y0 z1 x1 y1   ←   +   z2 x2 y2 z3 x3 y3

• Example ISA: • Intel Streaming SIMD Extensions (SSE) • Intel Advanced Vector Extensions (AVX) • POWER ISA (VMX/AltiVec, VSX) • ARMv7 NEON, ARMv8, ARM SVE

60 / 179 Introduction Principles Core Memory Processor Network Exascale

AVX512 ISA • 512-bit wide vector units 448 192 128 64 0 processing packed or X7 X2X1 X0 scalar operands Y7 Y2Y1 Y0 • 16× 32-bit m7 m2 m1 m0 • 8× 64-bit op op op op • Support of masks: merging X7 op Y7 X2 op Y2X1 op Y1 X0 op Y0 (m) and zeroing (z) Z7 | 0 Z2 | 0 Z1 | 0 Z0 | 0 masking • Vector registers zmm0, X7 X2X1 X0 xmm1, ..., zmm31 Y7 Y2Y1 Y0

• 3-4 operands m7 m2 m1 m0 op • One source can be

memory reference Z7 | 0 Z2 | 0Z1 | 0 X0 op Y0 • Destination can be Z0 | 0 memory reference

61 / 179 Introduction Principles Core Memory Processor Network Exascale

AVX512 ISA (2)

• Types of instructions • Load/store instructions • Arithmetic and bitwise instructions • Shuffle and unpack instructions • Memory prefetch instructions • Multiple sets of AVX512 instructions: F, CD, ER, PF, ... • On Linux see /proc/cpuinfo to identify supported sets

62 / 179 Introduction Principles Core Memory Processor Network Exascale

AVX512 ISA: Floating Point Instructions

• In the following me assume all mask bits set to ’1’ • Floating point instructions operate on • Single (S) or double (D) precision operands • Scalar (S) or packed (P) operands • Scalar operations update only lowest part of destination register

• VADDSD: X7... X0 + Y7... Y0 Z7 | 0 Z1 | 0 X0+Y0 • Packed operations update all parts + X7+Y7... X0+Y0 • VADDPD: X7... X0 Y7... Y0 • VADDPS: X15... X1 X0 + Y15... Y1 Y0 X15+Y15 ... X1+Y1 X0+Y0

63 / 179 Introduction Principles Core Memory Processor Network Exascale

AVX512 ISA: Shuffle Instructions • VSHUFPD zmm1{k1}{z}, zmm2, zmm3/m512, imm8 Re-order double precision floats depending on immediate value

448 256 192 128 64 0 X7 X3 X2 X1 X0

Y7 Y3 Y2 Y1 Y0

Y6 or Y7 Y2 or Y3 X2 or X3 Y0 or Y1 X0 or X1

• Immediate value selects higher (1) or lower (0) part • Example: imm8 = 0x55 = b’01 01 01 01 will select (Y6,X7,Y4,X5,Y2,X3,Y0,X1) • Use case: Swap two elements representing real and imaginary part of double precision complex number

64 / 179 Introduction Principles Core Memory Processor Network Exascale

AVX512 ISA: Example

1 #include < s t d i o . h> 2 #include 3 4 typedef union { __m512d v ; double d[8]; } Pdouble; 5 6 i n t main ( void ){ 7 i n t i ; 8 Pdouble a, b, c, d; 9 10 for (i = 0; i < 8; i++) { 11 a.d[i] = b.d[i] = ( double ) i + 0 . 5 ; 12 } 13 14 c.v = _mm512_add_pd(a.v, b.v); 15 16 p r i n t f ( " c = (%.2f,%.2f)\n", c.d[1], c.d[0]); 17 18 return 0; 19 } Compile with: gcc -mavx512f -mavx512cd ...

65 / 179 Introduction Principles Core Memory Processor Network Exascale

Thread Level Parallelism

• Thread = Smallest unit of processing that can be scheduled by operating system • Single core use case: Hide pipeline stalls • Time-division multiplexing: If stall occurs then schedule other thread • Typical execution model:

thread #1

thread #2

forkthread #3 join

thread #N • Popular Application Programming Interfaces (API): • POSIX Threads • OpenMP

66 / 179 Introduction Principles Core Memory Processor Network Exascale

Simultaneous Multithreading (SMT)

• SMT architecture = Architecture which allows > 1 thread per core to be executed simultaneously TLP and ILP are exploited simultaneously • Architectural requirements • Dynamic management of resources, e.g. • Dynamic scheduling • Register renaming • Duplication of resources for each thread • Enlarged register file • Seperate program counters (PC) • Capability for instructions from multiple threads to commit • Operating systems view: Multiple logical processors • Also known as: Hyperthreading, hardware threading

67 / 179 Introduction Principles Core Memory Processor Network Exascale

Control Flow

• Control flow refers to the ability to change the order in which instructions are executed • Control flow instructions = jump or branch instructions • Example loop for (i = 0; i < 10; i++) { ... /* loop body */ }

• At end of loop jump to beginning of loop • If condition i < 10 is false, jump out of loop • Types of control flow change • Conditional branches • Jumps • Procedure calls • Procedure returns

68 / 179 Introduction Principles Core Memory Processor Network Exascale

Control Flow (2)

• Control flow instructions must have a destination address • Run-time address (e.g. return address) • Compile time address label • Jump instruction in x86 ISA: jmp Example for an (infinite) loop: instrA .L2: instrB ... jmp .L2 • Branching typically causes performance degradation • Pipelines must be drained and possibly re-filled • New instructions have to loaded

69 / 179 Introduction Principles Core Memory Processor Network Exascale

Branch Prediction

• Techniques to minimize costs of branching • Avoid branching • Code re-organisation • Use conditioned instructions (e.g., x86 ISA: cmov) • Loop unrolling • Branch prediction • Branch prediction = Anticipate one branch being taken with higher probability then the other branch • Practical observation: individual branch is often highly biased towards taken or untaken • Static branch prediction • Compiler generates code to assume particular branch to be taken (e.g. branch hints instructions) • Decision based on “educated guess” or profiling information • Dynamic branch prediction

70 / 179 Introduction Principles Core Memory Processor Network Exascale

Branch Prediction Buffer

• 1-bit prediction scheme • Instantiate memory buffer with n bits • Depending on branch address select a bit • If bit is ’1’ take branch otherwise not • Flip bit in case of misprediction • 2-bit prediction scheme: change from “predict taken” to “predict not taken” only after 2 mispredictions

Predict taken Predict taken taken 10 not taken 11 taken not taken taken

Predict not taken not taken Predict not taken not taken 00 taken 01

71 / 179 Introduction Principles Core Memory Processor Network Exascale

Basic Block

• Basic block = Sequence of instructions with no branches in except to the entry and no branches out except at the exit • Examples: • Loop body without control flow instructions, e.g. for (i = 0; i < n; i++) { y[i] = x[i]; } • Subroutine without control flow instructions, e.g. double pow2(double x) { double y; y = x * x; return y; }

72 / 179 Introduction Principles Core Memory Processor Network Exascale

x86 ISA: Conditioned Instructions

• Conditioned instruction = Execution of conditioned instructions depends on (run-time) flags • Conditioned instructions in x86 ISA: • Conditioned branch operations, e.g. Jump if equal je Jump if greater jg Jump if less jl Jump if less or equal jle • Conditional move cmov • Data transfer is only performed if condition is true • Use case: k = (i == 0) ? 0 : k

73 / 179 Introduction Principles Core Memory Processor Network Exascale

Assembler Generation

• Example GNU C-compiler: gcc -S -O0 myprog.c • Command will generate output file myprog.s • Usually better to start with zero-level optimization (-O0) • Use cases: • Verification of sequence of instructions generated by compiler • Starting point for own assembler implementations

74 / 179 Introduction Principles Core Memory Processor Network Exascale

Assembler Generation Example C-programm Assembler: .file "func.c" .text .globl func .type func, @function func: pushl %ebp void func() movl %esp, %ebp { subl $16, %esp int i; movl $0, -4(%ebp) for (i=0; i<100; i++); jmp .L2 } .L3: addl $1, -4(%ebp) .L2: cmpl $99, -4(%ebp) jle .L3 leave ret

75 / 179 Introduction Principles Core Memory Processor Network Exascale

Content

Introduction

Principles of Computer Architectures

Processor Core Architecture

Memory Architecture

Processor Architectures

Network Architecture

Exascale Challenges

76 / 179 Introduction Principles Core Memory Processor Network Exascale

Memory in Modern Node Architectures

• Today’s compute nodes: different memory types and technologies Core #N−1 • Volatile memories L1−I L1−D • Main memory (DDR3 or DDR4) L2 • Caches (SRAM) • Accelerator memory L3 slice (GDDR5 or HBM) • Non-volatile memories • SSD (NAND flash) MC IO • Different capabilities • Bandwidth • Latency for single transfer • Access time DDR GPU • Cycle time SSD • GDDR Capacity /HBM

77 / 179 Introduction Principles Core Memory Processor Network Exascale

Memory: DRAM

• Data bit storage cell = 1 capacitor + 1 transistor • Capacitors discharge periodic refresh needed • Refresh required in intervals of O(10) ms • Possibly significant performance impact • Memory access latencies become variable • Memory organisation • Memory arrays with separate row and column addresses less address pins • Read operation: /ras /cas addr row col /we data data

78 / 179 Introduction Principles Core Memory Processor Network Exascale

Memory: DRAM (2)

• Technology improvements • Synchronization with system bus: Synchronous dynamic random-access memory (SDRAM) • Double Data Rate (DDR) • DRAM typically packaged on small boards: DIMM = Dual Inline Memory Modules • Typical DRAM channel width: 8 Byte • Performance range for selected set of data rates:

DDR3 DDR4 Data rate [Mbps] 400 − 2333 2133 − 3200 GByte/sec/DIMM [GByte/s] 3.2 − 18.6 17.1 − 25.6

79 / 179 Introduction Principles Core Memory Processor Network Exascale

DRAM Organisation

Organisation of DRAM memory connected to processor • Multiple memory channels per processor CPU • Multiple memory ranks attached to single channel 64 rank ID • Chip-select signals to multiple channels enable rank 64 multiple ranks • 8 Multiple memory banks bank ID multiple banks • Memory cells arranged in 8 rows and columns row+col Physical address:

ch ID rank ID bank ID row col

80 / 179 Introduction Principles Core Memory Processor Network Exascale

DRAM Organisation (cont.)

Typical design as of 2014: • 4 Gbit memory die with 8-bit memory cells • 8 banks • 64k rows • 1k columns • 8 memory dies per channel • Channel data bus with of 64 bit • 4 ranks per channel • 4 channels Total memory capacity: 64 GiByte

81 / 179 Introduction Principles Core Memory Processor Network Exascale

Memory: SRAM

• Data bit storage cell = (4 + 2) transistors • Comparison with DRAM DRAM SRAM Refresh required? yes no Speed slower faster Density higher lower • Typically used for on-chip memory

82 / 179 Introduction Principles Core Memory Processor Network Exascale

Memory Access Locality

• Empirical observation: Programs tend to reuse data and instructions they have used recently • Observation can be exploited to • Improve performance • Optimize use of more/less expensive memory • Types of localities • Temporal locality: Recently accessed items are likely to be accessed in the near future • Spatial locality: Items whose addresses are near one another tend to be referenced close together in time

83 / 179 Introduction Principles Core Memory Processor Network Exascale

Memory Access Locality: Example

1 double a[N][N], b[N][N]; 2 3 for (i=0; i

• Assume right-most index being fastest running index • Temporal locality: b[j][0] • Spatial locality: a[i][j] and a[i][j+1] (j + 1 < N)

84 / 179 Introduction Principles Core Memory Processor Network Exascale

Memory Cache: Motivation • Observations • DRAM based memories are large but slow • In particular, fetching small amounts of data CPU is inefficient • SRAM based memories are small but fast Cache • Typical application exhibit significant data locality Memory • Exploit observations by introducing caches • Cache = Fast memory unit that allows to store data for quick access in near future • A cache is transparent for the application • Application programs do not require explicit load/store to/from cache • Alternative concept: software managed scratch-pad memory • Caches are typically hardware managed

85 / 179 Introduction Principles Core Memory Processor Network Exascale

Memory/storage Hierarchy

Computers provide multiple levels of volatile memory and non-volatile storage • Important because of Processor-memory performance gap • Close to processor: very fast memory, but less capacity

Level Access time Capacity Registers O(0.1) ns O(1) kByte L1 cache O(1) ns O(10-100) kByte L2 cache O(10) ns O(100) kByte L3 cache O(10) ns O(1-10) MByte Memory O(100) ns O(1-100) GByte Disk O(10) ms O(100-...) GByte

86 / 179 Introduction Principles Core Memory Processor Network Exascale

Memory: Cache

Cache miss = Word not found in cache • Word has to be fetched from next memory level • Typically a full cache line (=multiple words) is fetched Cache organisation: Set associative cache • Match line onto a set and then place line within the set • Choice of set address: (address) mod (number of sets) • Types: • Direct-mapped cache: 1 cache line per set • Each line has only one place it can appear in the cache • Fully associative cache: 1 set only • A line can be placed anywhere in the cache • n-way set associative cache: n cache lines per set • A line can be placed in a restricted set of places (1 out of n different places) in the cache

87 / 179 Introduction Principles Core Memory Processor Network Exascale

Memory: Cache (2) Direct-mapped vs. 4-way set associative cache:

set #1

line #0

line #0 set #0

line #0

88 / 179 Introduction Principles Core Memory Processor Network Exascale

Memory: Cache (3)

Replacement policies • Random: • Randomly select any possible line to evict • Least-recently used (LRU): • Assume that a line which has not been used for some time is not going to be used in the near future • Requires keeping track of all read accesses • First in, first out (FIFO): • Book keeping simpler to implement than in case of LRU

89 / 179 Introduction Principles Core Memory Processor Network Exascale

Cache Write Policies

Issue: How to keep cache and memory consistent? • Cache write policy design options • Write-through cache • Update cache and memory • Memory write causes significant delay if pipeline must stall • Write-back cache • Update cache only and post-pone update of memory memory and cache will become inconsistent • If cache line is dirty then it must be committed before being replaced • Add dirty bit to manage this task Issue: How to manage partial cache line updates? • Typical solution: write-allocate (fetch-on-write) • Cache line is read before being modified

90 / 179 Introduction Principles Core Memory Processor Network Exascale

Cache misses

• Categories of cache misses Compulsory: Occurs when cache line is ac- cessed the first time Capacity: Re-fetch of cache lines due to lack of cache capacity Conflict: Cache line was discharged due to lack of associativity • Average memory access time =

hit time + miss rate · miss penalty

Hit time ... Time to hit in the cache Miss penalty ... Time to load cache line from memory

91 / 179 Introduction Principles Core Memory Processor Network Exascale

Cache misses (2)

Strategies to reduce miss rate: • Larger cache line size • If spatial locality is large then other parts of cache line are likely to be re-used • Bigger caches to reduce miss rate • Reduce capacity misses • Higher associativity • Reduce conflict misses • Hit time may increase

92 / 179 Introduction Principles Core Memory Processor Network Exascale

Cache misses (3)

Strategies to reduce miss penalty • Multilevel caches • Technical problem: Larger cache means higher miss penalty • Problem mitigation: Larger but slower L2 cache: Average memory access time =

Hit timeL1 + Miss rateL1·

(Hit timeL2 + Miss rateL2 · Miss penaltyL2) • Giving priority to read misses over writes • Goal: Serve reads before writes have completed • Implementation using write buffers

93 / 179 Introduction Principles Core Memory Processor Network Exascale

Cache: Inclusive vs. Exclusive • Policies: • Inclusive cache = Data present in one cache level also present in higher cache level • Example: All data in L1 cache is also present in L2 cache • Exclusive cache = Data present in at most one cache level • Intermediate policies: • Data available at higher cache level is allowed to be available at lower cache levels • Advantages of inclusive cache • Management of cache coherence may be simpler • Advantages of exclusive cache • Duplicate copies of data in several cache levels avoided • Example for intermediate policy: • L3 cache which is used as victim cache: cache holds data evicted from L2, but not necessarily data loaded to L2

94 / 179 Introduction Principles Core Memory Processor Network Exascale

Re-Reference Timing Patterns

• Re-reference time is time Example: On-line transaction between two occurences of processing application references to the same item • Re-reference timing patterns can be analysed using memory traces plus simulation • Pattern only depends on work-load and cache-line size • Phenomenological observation

−β R(t) = R0 t [Hartstein et al., 2006] with 1.3 . β . 1.7

95 / 179 Introduction Principles Core Memory Processor Network Exascale

Cache Miss Patterns

• Using simulators it is possible to investigate cache miss patterns for different cache configurations • Example: • On-line transaction processing application • Analysis of different associativity settings [Hartstein et al., 2006]

96 / 179 Introduction Principles Core Memory Processor Network Exascale

Memory Prefetching

• Prefetching = Technique to hide cache miss latencies • Idea: Load data before program issues load instruction • Hardware data prefetching • Generic implementation • Monitor memory access pattern • Predict memory addresses to be accessed • Speculatively issue a prefetch request • Prefetching schemes • Stream prefetching • Global History Buffer prefetching • List prefetching • Prefetching at different cache levels

97 / 179 Introduction Principles Core Memory Processor Network Exascale

Memory Prefetching (2)

• Software data prefetching: Compiler or programmer inserts prefetching instructions • Risk of negative performance impact • Memory bus contention • Cache pollution • Metrics to characterize prefetching algorithms • Coverage: Number of misses that can be prefetched • Timeliness: Distance between prefetch and original miss • Accuracy: Probability that a prefetched line is used before being replaced

98 / 179 Introduction Principles Core Memory Processor Network Exascale

Cache Coherence

• Cache coherence problem • Processors/cores view memory through individual caches • Different processors could have different values for the same memory location

• Example with 2 processors P1 and P2:

Time Event Cache P1 Cache P2 Memory location A 0 123 1 P1 reads from A 123 123 2 P2 reads from A 123 123 123 3 P1 writes to A 567 123 567

• Value in processor P2 will only be updated after cache line was evicted Need for cache-coherence mechanism

99 / 179 Introduction Principles Core Memory Processor Network Exascale

Cache Coherence (2)

Requirements to achieve coherency

• Program order preservation: If a read by processor P1 from memory location A follows a write to the same location then a consecutive read should return the new value unless another processor performed a write operation

• Coherent memory view: If a read by processor P1 from A follows a write by processor P2 to A then the new value is returned if both events are sufficiently separated in time • Serialized writes: 2 writes to location A by one processor are seen in the same order by all processors

100 / 179 Introduction Principles Core Memory Processor Network Exascale

Cache Coherence (3)

• Classes of cache coherence protocols • Directory based: The sharing status of a block of memory is kept in one central location (directory) • Snooping: • Protocols rely on existence of a shared broadcast medium • Each cache keeps track of cache line status • All caches listen to broadcasted snoop messages and act when needed • Usually on write other copies are invalidated (write invalidate protocol) • Pros/cons • Snoop protocols are relatively easy to implement • Directory based protocols have the potential for scaling to larger processor counts

101 / 179 Introduction Principles Core Memory Processor Network Exascale

Cache Coherence: MSI Protocol

Each cache line can be in one of the following states • Modified • Cache line contains current value • All copies in memory and any other cache are dirty • Shared • Cache line contains current value • All copies in memory and any other cache are clean • Invalid • Cache line is dirty Permitted combination of states of a particular cache line in 2 caches: M S I M    S    I   

102 / 179 Introduction Principles Core Memory Processor Network Exascale

Cache Coherence: MSI Protocol (2)

• Processor operations • Read (PrRd) • Write (PrWr) Processor • Operations on broadcast medium • Bus Read (BusRd) PrRd PrWr • Triggered by PrRd to not valid cache line, i.e. not in S or M Cache Controller • Bus Read Exclusive (BusRdEx) BusRd • Triggered by PrWr to cache line BusRdEx which is not in M state BusWr • All other copies are marked invalid • Write-Back (BusWr) • Cache controller writes modified cache line back to memory

103 / 179 Introduction Principles Core Memory Processor Network Exascale

Cache Coherence: MSI Protocol (3)

I

PrWr+BusRdEx BusRdEx

M PrRd or PrWr PrRd+BusRd BusRdEx

BusRd PrWr+BusRdEx

S PrRd

When leaving M state the cache controller puts modified cache line on the bus

104 / 179 Introduction Principles Core Memory Processor Network Exascale

Node Architecture

DDR DDR

• Dual-socket node MC MC • Nodes interconnected via IPB IPB inter-processor bus (IPB) • Different I/O devices IO IO • Storage (SSD) • Compute accelerator (GPU) NET • Network (NET) GPU SSD GDDR /HBM

105 / 179 Introduction Principles Core Memory Processor Network Exascale

NUMA Architectures

• NUMA = Non-Uniform Memory Access • Memory design in multi-processor system where memory is shared • A processor can access another processor’s memory

MEM MEM

CPU CPU

• Memory access latency depends on which memory is accessed

106 / 179 Introduction Principles Core Memory Processor Network Exascale

NUMA: Hardware Locality Control

• Scheduling of processes to cores and placement of memory is controlled by operating system • NUMA memory allocation policies in Linux Name Description default Allocate on the local node bind Allocate on a specific set of nodes interleave Interleave allocations on a set of nodes preferred Try to allocate on a node first • Linux tools to control NUMA policy • taskset: Control of process’ CPU affinity • numactl: Control of process scheduling and memory placement • Portable Hardware Locality (hwloc) library/tools

107 / 179 Introduction Principles Core Memory Processor Network Exascale

Content

Introduction

Principles of Computer Architectures

Processor Core Architecture

Memory Architecture

Processor Architectures

Network Architecture

Exascale Challenges

108 / 179 Introduction Principles Core Memory Processor Network Exascale

x86 Architecture

• ISA based on Intel 8086 CPU (launched 1978) • Architecture dominates PC mass market and is used in most [Top500, 06/2012] • Top500: ∼ 87% of the top500 systems use x86 CPUs • Manufacturers: Intel, AMD, VIA • Main features • CISC-type ISA • One operand can be a memory location • Extensions (selection): • x86-64 • SSE2, SSE3, SSE4.x • AVX, AVX2, AVX-512

109 / 179 Introduction Principles Core Memory Processor Network Exascale

x86 Architecture: Xeon Platinum

Hardware parameters of Intel Platinum 8168 (Skylake)

Number of cores 24 Core clock 2.7 GHz (base) L1 cache (data+instr) 32+32 kBytes/core L2 cache 1024 kBytes/core L3 cache 33 MBytes Max. memory bandwidth 128 GBytes/s Peak DP performance 32 Flop/cycle/core 2074 GFlop/s Max. TDP 205 Watt Lithography 14 nm [wikichip.org] Release date Q3/2017

Nominal performance numbers as specified by manufacturer

110 / 179 Introduction Principles Core Memory Processor Network Exascale

Power Architecture

• Load-store RISC architecture • Implementations currently used for HPC • POWER8, POWER9 • Multi-core and multi-chip processors • Very high memory bandwidth • Targets server market and HPC • BlueGene L/P/Q processors • System-on-a-Chip (SoC) design with network links on chip • Only used for BlueGene parallel computers • Cell Broadband Engine Architecture (CBEA) • Heterogenious multi-core processor • Supercomputers based on CBEA: RoadRunner, QPACE • Also used for embedded technologies and network processors

111 / 179 Introduction Principles Core Memory Processor Network Exascale

IBM POWER9 • Introduced in 2017 using 14nm technology • POWER v3.0 ISA • Up to 24 SMT4 cores, frequency up to 4 GHz • Cache parameters • L1-D/L1-I: 32/32 kiByte/core, private, 8-way associative • L2: 512 kiByte/2 SMT4 cores, partially shared, 8-way associative • L3: 10 MiByte/2 SMT4 cores, shared • Cache line size 128 Byte • 8× memory controllers • Performance figures • 8 DP FMA/cycle/SMT4 core

112 / 179 Introduction Principles Core Memory Processor Network Exascale

IBM POWER9 (cont.)

[IBM, 2017]

113 / 179 Introduction Principles Core Memory Processor Network Exascale

IBM POWER9 (cont.)

[IBM, 2017]

Memory attachment options • 8 direct DDR4 channels with up to 120 GByte/s sustained bandwidth • 8 buffered channels with up to 230 GByte/s sustained bandwidth

114 / 179 Introduction Principles Core Memory Processor Network Exascale

ARM Architecture

• ARM = Advanced RISC Machine • Current main market: mobile and embedded electronics • Emerging server processor market • ISA features • Load/store architecture • Example: Marvell ThunderX2 • ARMv8 ISA • Up to 32 cores, up to 2.5 GHz • Cache parameters - L1-D/L1-I: 32/32 kiByte/core, - L2: 256 kiByte/core, shared, - L3: 32 MiByte, shared

115 / 179 Introduction Principles Core Memory Processor Network Exascale

Content

Introduction

Principles of Computer Architectures

Processor Core Architecture

Memory Architecture

Processor Architectures

Network Architecture

Exascale Challenges

116 / 179 Introduction Principles Core Memory Processor Network Exascale

Computer Network

• Computer network = interconnection between computing nodes • Simplistic view:

Source Transmitter

Source system

Transmission System

Destination system

Receiver Destination

117 / 179 Introduction Principles Core Memory Processor Network Exascale

Protocol Architecture

• Protocol = Agreement about communication between two or more entities • Key elements • Syntax • Data formats • Signal levels • Semantics • Control information • Error handling • Timing • Speed matching • Sequencing

118 / 179 Introduction Principles Core Memory Processor Network Exascale

Protocol Architecture

Human protocol vs. computer network protocol

Hi! connect req

Hi! connnect reply

What time? req file

2pm file

time

119 / 179 Introduction Principles Core Memory Processor Network Exascale

Message Communication Protocols

Eager:

Data • Eager protocol: • Sender pushes message regardless of receiver state • Low overhead • Useful for small messages • Rendezvous protocol: Rendezvous: • Handshake happens between the sender and the receiver before Start sending data Reply • High overhead due handshake involving control messages Data • Useful for large messages Finish

120 / 179 Introduction Principles Core Memory Processor Network Exascale

Layering of Networks

• Most networks are organised as a series of layers • Physical communication takes place at lowest layer • At higher layers: virtual communication • Lower levels are implemented in hardware, while top levels are implemented in software • API between layers are tightly defined • Advantages of layering: • Design becomes less complex as problem is broken down • Changes typically affect only one layer • Example: new physical layer, new protocols • For different layers solutions from different providers can be used

121 / 179 Introduction Principles Core Memory Processor Network Exascale

OSI Network Layer Model • OSI = Open Systems Interconnection • Standard defined by the International Organization for Standardization (ISO) • 2nd edition published in 1994

Application Application

Presentation Presentation

Session Session

Transport Transport

Network Network Network

Data Link Data Link Data Link

Physical Physical Physical

Physical media Physical media

122 / 179 Introduction Principles Core Memory Processor Network Exascale

OSI model: Physical Layer • OSI Standard: “The Physical Layer provides the (...) means to activate, maintain, and de-activate physical-connections for bit transmission between data-link-entities.” • Transmission of bits, not, e.g., packets • Within a given network architecture a large range of physical media may be supported • Example: Gigabit Ethernet physical layers (as defined in IEEE 802.3 standard)

Name Description 1000BASE-T CAT5 copper cables 1000BASE-SX Short-wave laser 1000BASE-LX Long-wave laser

123 / 179 Introduction Principles Core Memory Processor Network Exascale

OSI Model: Data Link Layer

• Purpose as defined in the OSI standard: • Establish link connections between transport-entities • Detects and possibly corrects errors which may occur in the physical layer • Connects the network layer to the physical layer • Layer may implement flow control • Flow control = Mechanism to control data transmission rate • Required to avoid transceiver from outrunning receiver • Example: Credit mechanism • Receiver provides credits to transmitter depending on available buffer space give credit • Transmitter only sends data when credits available consume credit

124 / 179 Introduction Principles Core Memory Processor Network Exascale

OSI Model: Network Layer

• OSI Standard: “It provides to the transport-entities independence from routing and relay considerations associated with the establishment and operation of a given network-connection.” • The standard defines that transport-entities are known to this layer by mean of network-addresses • Routing can be implemented at this layer • Layer can receive packets via the interface to the data link layer of one link and forward the packets to another link depending on the network-address

125 / 179 Introduction Principles Core Memory Processor Network Exascale

OSI Model: Virtual View

Application Application Network service

Presentation Presentation Session

Session Session End−to−end messages

Transport Transport End−to−end packet link

Network Network Packet link

Data Link Data Link Bit pipe

Physical Physical

126 / 179 Introduction Principles Core Memory Processor Network Exascale

Digital Signal Encoding

• Digital signal = discrete, discontinuous voltage pulses • In practice, signal may be only approximatively digital • Binary data encoded into signal elements • Signal encoding schemes (selection) • Return to Zero (RZ), Non-Return to Zero (NRZ) • Non-Return to Zero Inverted (NRZI) • Manchester • Pre-requisites for receiving data • Matching of clock • Voltage level matching

127 / 179 Introduction Principles Core Memory Processor Network Exascale

Digital Signal Encoding: Rates

• Symbol rate = Number of times a signal in a com- munication channel changes state or varies per time unit • Unit: baud • Bit rate = Number of bits transmitted in a given time unit • Symbol rate =6 bit rate • One change of state in the physical link may correspond to one bit or more or less than one bit • Bit rate [bits/s] = baud × bits per baud

128 / 179 Introduction Principles Core Memory Processor Network Exascale

Physical Layer: High-speed Transceivers

• Standard technology: Serialisation/Deserialisation (SerDes) • Transmitter: Slow clock • Wide parallel input Fast clock • High-speed serial output • Receiver: vice versa • Example: • Parallel interface: 16 bit at 250 MHz • Serial interface: 1 bit at 5 GHz Fast clock • Higher output data rate due to encoding (here: 8b/10b) Slow clock • Clock is typically not transmitted need clock recovery

129 / 179 Introduction Principles Core Memory Processor Network Exascale

Link Errors

• Errors in network will occur! • Error detection • Error recovery • Error types • Temporary errors (transient or intermittent) • Data corruption, e.g. bit flips in data characters due to random noise • Data loss, e.g. packet loss due to control character corruption • Shift in sampling point • Permanent errors • Link failure • It may not be possible to guarantee detection of all errors • Strategy: Make probablity of undetected errors sufficiently small

130 / 179 Introduction Principles Core Memory Processor Network Exascale

Link Errors (2) • BER = Bit Error Rate

Number of flipped bits Number of received bits • PER = Packet Error Rate • Assume packets of equal length of N bits and assume bit errors to be independent (assuming BER is small):

PER = 1 − (1 − BER)N ' N · BER

• Warning: - In practice, bit errors are not independent - Encoding results in non-deterministic relation between BER and PER • BER targets −12 • 10 Gbit Ethernet: 10 (IEEE 802.3-2008, clause 44) −12 • Infiniband: 10 (IBA, rev. 1.2.1, C6-11, C8-4)

131 / 179 Introduction Principles Core Memory Processor Network Exascale

Network Error Detection

• Detection of data corruption • Transmitter adds error-detecting code • Code is recalculated and checked by receiver • Simple example: parity bit • Set parity bit if number of ’1’ is odd (odd parity) or even (even parity) • Even number of bit flips remains undetected • More robust error codes: Cyclic Redundancy Check • Detection of data loss • Robust protocols where each packet is acknowledged

132 / 179 Introduction Principles Core Memory Processor Network Exascale

Error Recovery: Packet Retransmission

TX RX

• Error case: Packet payload is Header corrupted Data • Transmitter adds error-code to packet and keeps copy of packet Error code • Receiver checks error-code and NACK returns feedback to transmitter • ACK if code matched • NACK if mismatch detected RESTART • Transmitter action depending on Header

feedback: Data • ACK: Delete copy • NACK: Retransmit packet Error code

133 / 179 Introduction Principles Core Memory Processor Network Exascale

Message Routing Introduction

• Routing = Mechanism to allow messages to find a path from source to destination • May be implemented in hardware or software • Features relevant for HPC network: • Performance • High bandwidth • Low latency • Scalability • Algorithm must be implemented in a distributed way

134 / 179 Introduction Principles Core Memory Processor Network Exascale

Network Routing Ingredients

• Network topology • Restricts number of routes from source to destination • Switching techniques • Switching = Mechanism that a router uses to move a packet from its input to output ports • Routing algorithm = The algorithm used to deter- mine the path that a message will take to go from the source to destination

135 / 179 Introduction Principles Core Memory Processor Network Exascale

Classification of Routing Algorithms

• Static vs. dynamic routing algorithms • Deterministic/oblivious routing • Routes based only on information that is available before the algorithm begins • Adaptive routing • Decisions are made at run-time • Quasi-static routing • Similar to deterministic routing, but allow for routing tables to be updated periodically • Path length • Minimal routing • Route packets along minimal path, i.e. do not allow packets to move away from destination • Non-minimal routing • Allow packets to move away from destination

136 / 179 Introduction Principles Core Memory Processor Network Exascale

Classification of Routing Algorithms (2)

• Other performance relevant features • Congestion management • Congestion = Situation where a link or node is car- rying so much data that its quality of service deteriorates • Worst-case characteristics • Deadlock = State where packets cannot progress in network due to cyclic wait depen- dency • Livelock = State where packets injected into the network are constantly routed and never reach destination

137 / 179 Introduction Principles Core Memory Processor Network Exascale

Network Topology

• Network topology refers to the way in which the network of computing nodes is connected • Physical topology = topology of the physical network • Logical topology = logical view on the way network is connected • Not explored in this lecture • Simplest topology: point-to-point topology • Permanent link between two endpoints

138 / 179 Introduction Principles Core Memory Processor Network Exascale

Network Evaluation Metric • Diameter • Maximum minimal distance between 2 nodes • Smaller is better • Connectivity • The minimum number of arcs that must be removed to break it into two disconnected networks • Larger is better • Bi-section width • The minimum number of arcs that must be removed to partition the network into two equal halves • Larger is better • Bi-section bandwidth • Aggregate bandwidth of all links connecting two equal halves • Larger is better • Costs • Smaller is better

139 / 179 Introduction Principles Core Memory Processor Network Exascale

Network Evaluation Metric: Diameter and Connectivity

• Diameter is the maximum minimal distance between 2 nodes • Connectivity gives the minimum number of edges to be removed for breaking network in disconnected pieces • Example:

• Diameter = 3 • Connectivity = 2

140 / 179 Introduction Principles Core Memory Processor Network Exascale

Network Evaluation Metric: Bi-section (Band-)Width

• Procedure to determine bi-section width and bandwidth: 1. Determine all possible ways to create 2 equi-partitions 2. Select the partitioning with minimal number of edges removed 3. Compute aggregate bandwidth of removed links • Example:

• Bi-section width = 4

141 / 179 Introduction Principles Core Memory Processor Network Exascale

All-connected and Star Topology

All-connected network Star topology

Evaluation: Evaluation: • Diameter: 1 • Diameter: 2 • Connectivity: P − 1 • Connectivity: 1 • Bi-section width: (P/2)2 • Bi-section width: — • Costs: O(P2) • Costs: O(P)

142 / 179 Introduction Principles Core Memory Processor Network Exascale

Cartesian Network Topologies

• 1-dimensional linear array and ring:

• 2-dimensional mesh and torus:

143 / 179 Introduction Principles Core Memory Processor Network Exascale

Cartesian Network Topologies (2)

• Evaluation: Mesh Torus Diameter d(P1/d − 1) d(P1/d /2) Connectivity d 2d Bi-section width P(d−1)/d 2P(d−1)/d

144 / 179 Introduction Principles Core Memory Processor Network Exascale

Network Topology: Tree

• Switch-based network • Evaluation binary tree: • Diameter: O(log(P)) • Connectivity: 1 • Bi-section width: 1 • Costs: O(P) • Number of links between upper levels smaller Blocking network topology

145 / 179 Introduction Principles Core Memory Processor Network Exascale

Network Topology: Clos Network

• N-stage switch-based network 12 n • Here: N = 3 • Network is characterised by 1 2 r r: number of ingress stage 1 m switches n: number of ingress ports at ingress stage 1 2 m m: number of egress ports at ingress stage • Evaluation: 1 2 r • Diameter: N + 1 • Connectivity: m (at switch-level) • m Bi-section width: r 2 • A Clos network can be made non-blocking • Non-blocking for m ≥ n if one allows for rearrangements • Strictly non-blocking if m ≥ 2n − 1

146 / 179 Introduction Principles Core Memory Processor Network Exascale

Network Topology: Folded Clos Network

1 2 m • Topology obtained by folding ingress and egress stage • Example: folded 3-stage Clos 1 2 r • Evaluation of Clos and Folded Clos network: • Diameter = N + 1 • Connectivity = m • m Bi-section width = r 2 • Closely related to fat-tree topology • Here: similar to folded 5-stage Clos

147 / 179 Introduction Principles Core Memory Processor Network Exascale

Network Topology: Dragonfly

• Multi-layer network • Switch layer • Switch-group layer • System layer • All-to-all connectivity within layer switch−group • Network parameters p: number of nodes per switch

a: number of switches per group to other switch−groups h: number of links per switch to other groups • Evaluation: • Diameter: 5 • Connectivity: ah − 1 (at system layer) • a h+1 2 Bi-section width: 2

148 / 179 Introduction Principles Core Memory Processor Network Exascale

Network Solutions for Top500 Systems

[www.top500.org, 11/2018] System share Performance share

149 / 179 Introduction Principles Core Memory Processor Network Exascale

Network Solutions: Ethernet

• Most popular data centre network technology • Large class of technologies including several technology families like • 1 Gigabit Ethernet • • 40 and 100 Gigabit Ethernet • Limited uptake for compute node interconnect • Relatively low bandwidth, high latencies • High costs of large high-bandwidth switch fabrics • Growing interest in high-speed Ethernet → new protocols • Internet Wide-Area RDMA Protocol (iWARP) • RDMA over Converged (Enhanced) Ethernet (RoCE)

150 / 179 Introduction Principles Core Memory Processor Network Exascale

Network Solutions: Infiniband

• Currently most popular network technology • Fast evolution of (nominal) bandwidth Technology QDR FDR EDR First Top500 06/2009 11/2011 06/2015 Data rate [Gbits/s] 8 13.6 24.2 • Low latency of O(1 µs) • Key features • OS kernel by-pass • Function offloading • Data transport managed by HCA • Offload of collective operations

151 / 179 Introduction Principles Core Memory Processor Network Exascale

Top10 Network Architectures

[www.top500.org, 11/2018]

System Technology Topology 1 Summit Infiniband EDR fat tree 2 Sierra Infiniband EDR fat tree 3 Sunway TaihuLight Custom partially tree 4 Tianhe-2 Custom fat tree 5 Piz Daint Cray Aries dragonfly 6 Trinity Cray Aries dragonfly 7 ABCI Infiniband EDR fat tree? 8 SuperMUC-NG OPA fat tree 9 Titan Cray Gemini 3-d torus 10 Sequoia Custom 5-d torus

152 / 179 Introduction Principles Core Memory Processor Network Exascale

Content

Introduction

Principles of Computer Architectures

Processor Core Architecture

Memory Architecture

Processor Architectures

Network Architecture

Exascale Challenges

153 / 179 Introduction Principles Core Memory Processor Network Exascale

What is Exascale Computing?

• Answer 1: Running HPL at 1 EFlop/s • HPL = High-Performance LINPACK • HPL is the load used for Top500 • Answer 2: Future generation supercomputers enabling new science • Exascale application challenges: • Weather, climatology and solid earth sciences • Astrophysics, high-energy physics and plasma physics • Materials science, chemistry and nano-science • Engineering sciences and industrial applications • Life sciences and medicine • Answer 3: Exascale computer = 100 − 1000× of today’s petascale supercomputers • Flop/s metric will become less relevant

154 / 179 Introduction Principles Core Memory Processor Network Exascale

Canon of Exascale Challenges

• Reducing power requirements • Need energy-to-solution reduction by about 100× • Exploiting massive parallelism • More computational performance → more parallelism • Scalability challenge • Maintaining a balanced system • Floating-point operations are cheap, but data transport is expensive • Coping with run-time errors • More components → higher risk of system failures • Number of components in future systems is likely to increase

155 / 179 Introduction Principles Core Memory Processor Network Exascale

Technology Trends: CPUs

[Karl Rupp, 2015]

156 / 179 Introduction Principles Core Memory Processor Network Exascale

Technology Trends: Moore’s Law

• Observation: Time evolution of optimal manufacturing costs for integrated circuits results [G. Moore, 1965] in exponential increase of number of components per circuit • Term tends to be abused for any exponential growth

[M. Bohr, 2007]

157 / 179 Introduction Principles Core Memory Processor Network Exascale

Technology Trends: Dennard Scaling

• Dennard scaling allowed to change the following parameters at constant power: • Increase of transistor density (Moore’s law) • Increase clock frequency • Reduce supply voltage • Increase performance = increase number of transistors • Allows to implement, e.g., more cores

[L. Chang et al., 2010] Trend towards increasing parallelism

158 / 179 Introduction Principles Core Memory Processor Network Exascale

Parallel Execution • Taking advantage of parallelism is one of the most important methods for improving performance • Today: several levels of parallelism • Instruction Level Parallelism (ILP) Multiple instruction pipelines • Vector/SIMD instructions Single Instruction performing same operation on Multiple Data • Simultaneous Multi-Threading (SMT) Concurrent execution of multiple instruction streams within same processor core • Multiple cores per processor Processors with multiple cores • Multiple processors Multiple processors per nodes, multiple nodes per system • Problem: Efficient exploitation of parallelism

159 / 179 Introduction Principles Core Memory Processor Network Exascale

Technology Trends: Flop/cycle

Evaluation of single-precision floating-point arithmetic:

Year Intel CPU Intel Xeon Phi NVIDIA Tesla Product Flop/cycle Product Flop/cycle Product Flop/cycle 1999 Pentium III 3 2001 Pentium IV 4 2006 Core2 16 2008 Nehalem 80 2009 Fermi/M2050 896 2011 Sandybridge 192 2012 KNC 2048 Kepler/K20 4992 2013 Haswell 576 Kepler/K40 5760 2014 Kepler/K80 2 × 4992 2016 KNL 4608 Pascal 3584 2017 Skylake 1792 Volta 5120

Intel figures partially taken from [Boyle et al., 2015]

160 / 179 Introduction Principles Core Memory Processor Network Exascale

Balance Conditions • Assume that computation and data transport can be perfectly overlapped

∆tcompute(W ) ' ∆ttransport(W ) • Assume latency-bandwidth-model

Icompute(W ) Itransport(W ) λcompute + ' λtransport + βcompute βtransport • Neglecting start-up latencies leads to following balance condition I β compute = AI ' compute Itransport βtransport

• Arithmetic intensity is application property • βcompute is cheap to increase • βtransport is expensive to increase

161 / 179 Introduction Principles Core Memory Processor Network Exascale

Technology trends: Rent’s rule

• Rent’s Rule T = k Gp G...Number of logic elements (gates) T ...Number of edge connections (terminals) k ...Rent’s coefficient p ...Rent’s exponent, where typically p  1 • Difficult to balance communication and compute • Strategy for problem mitigating: Memory hierarchy • Multiple, fast but small on-chip memory (cache) • Slower but larger off-chip memory • Trend towards deeper memory hierarchies • Example: DRAM main memory + high-bandwidth memory

162 / 179 Introduction Principles Core Memory Processor Network Exascale

Balanced Architectures: Deeper Memory Hierarchy

• Next generation of Intel Xeon Phi: Knights Landing • 2 layers of DRAM • Fast but small MCDRAM • Large but slow DDR4 • Use models • Use MCDRAM as cache • Software managed memory [Intel, 2015] • Hybrid mode

• User challenge: optimize for MCDRAM data locality DDR4 • Implicitly: cache-friendly data layout • Explicitly: transfer data address space between layers flat cache hybrid

163 / 179 Introduction Principles Core Memory Processor Network Exascale

Energy Efficiency • Energy efficient = The amount of work W which can be performed for a given amount of energy E is maxi- mized • Energy efficiency  : E W  = E E • Important to reduce operational costs • Consumed energy Z t+∆t E = P(τ) dτ ≤ max(P)∆t t

∆t ... time-to-solution E ... energy-to-solution P(τ) ... power consumed at time τ 164 / 179 Introduction Principles Core Memory Processor Network Exascale

Power Efficiency

• Power efficient = Best performance dW /dt for given power consumption P

• Power efficiency P : dW /dt  = P P • Example: Maximize floating-point performance over power consumption (Flop/Watt) • Important if available power is limited • Examples: limitations in power distribution, costs for cooling

165 / 179 Introduction Principles Core Memory Processor Network Exascale

Energy vs. Power Efficiency • Assume constant power consumption: E = P∆t • Assume constant throughput b = dW /dt = W /∆t: • Relation between energy and power efficiency under these assumptions:

W b ∆t dW /dt  = = = =  E E P ∆t P P Special case of similar energy and power efficiency Maximizing P =6 minimizing E • E.g., W may change for different algorithms • Popular power efficiency metric: floating-point throughput vs. power consumption b  = fp P P

166 / 179 Introduction Principles Core Memory Processor Network Exascale

The Green500 List

• Green500: Rank supercomputers according to power efficiency • Metric = floating-point performance vs. power consumption • = system listed in TOP500 • Performance = HPL performance (like for TOP500) • Current number #1 (Nov’18): 17.6 GFlop/s/W • This is equivalent to 57 pJ/Flop • Exascale goal: keep below 20 MW • This is equivalent to 20 pJ/Flop • Critisism • HPL load is not representative • Green500 does not cover full system

167 / 179 Introduction Principles Core Memory Processor Network Exascale

The Green500 List (cont.)

15000 Zetta/Exascaler 10000

GPU MFlop/s/W GPU 5000 BG/P PowerXCell 8i BG/Q Xeon Phi PEZY-SC

0

Nov’07Nov’08Nov’09Nov’10Nov’11Nov’12Nov’13Nov’14Nov’15Nov’16Nov’17Nov’18

168 / 179 Introduction Principles Core Memory Processor Network Exascale

Green500: November 2018 Results

MFlop/s Name Year Compute Top500 /W Country Devices Position 1 17,604 RIKEN 2017 Intel Xeon 375 Japan ZettaScaler 2 15,113 NVIDIA 2017 Intel Xeon 374 US NVIDIA V100 3 14,668 Summit 2018 IBM POWER9 1 US NVIDIA V100 4 14,423 ABCI 2018 Intel Xeon 7 Japan NVIDIA V100 ... 30 4,563 Lemhi 2018 Intel Xeon 427 US

169 / 179 Introduction Principles Core Memory Processor Network Exascale

Costs of Data Transport

[W. Keckler et al., 2011] Comparison of • NVIDIA Fermi GPU as of 2010 • Projection NVIDIA GPU as of 2017

2010 2017 Process technology 40 nm 10 nm Voltage (VDD) 0.9 V 0.65 V Frequency 1.6 GHz 2 GHz DP-FMA energy 50 pJ 6.5 pJ Wire energy 310 pJ 150 pJ (256 bit, 10 mm)

Data movement costs will even more dominate

170 / 179 Introduction Principles Core Memory Processor Network Exascale

Strategies for Reducing Energy-to-Solution

Selected strategies for improving energy-efficiency: • Use of power-efficient components • Focus on compute accelerators • Improve data centre efficiency • Avoid consuming energy except for executing instructions and data transport • Improve efficiency of cooling and power conversion • Use of energy-efficient algorithms

171 / 179 Introduction Principles Core Memory Processor Network Exascale

Energy-Efficient Compute Devices

• Power to switch CMOS transistors dynamic power

2 Pdynamic ∝ CV f Nsw C ... Capacitive load V ... Voltage f ... Frequency Nsw ... Number of switching bits • Voltage can be reduced for lower frequencies • Strategies for improving power efficiency: • Simplify architecture, i.e. reduce transistors per operation • Example: SIMD units, simple core architectures • Reduce clock frequency, but increase parallelism • Example: GPU, Xeon Phi • Dynamic Voltage and Frequency Scaling (DVFS) • Increase (decrease) frequency and voltage dynamically

172 / 179 Introduction Principles Core Memory Processor Network Exascale

Data Centre Efficiency: PUE

• PUE = Power Usage Effectiveness

Total data center energy consumption or power IT energy consumption or power

• PUE≥ 1 by definition • Option to re-use heat not considered • IT equipment input • Power consumed at connection of IT device to electrical power system • Supporting infrastructure includes • Power systems • Examples: UPS, power converters, lighting • HVAC systems (Heating, Ventilation and Air Conditioning) • Examples: chillers, water pumps, fans

173 / 179 Introduction Principles Core Memory Processor Network Exascale

PUE (2)

• Consider air-cooled data centre without free-cooling • Computer Room Air Conditioning Power (CRAC) including fans • Chiller • UPS Chiller Typical PUE & 1.4 Plant • Examples for small PUE • Leibniz Supercomputing PDU CRAC Centre/SuperMUC: PUE=1.1 Unit • Hot water cooling with directly liquid cooled nodes IT Equipment • Yahoo data centre in Lockport: PUE=1.08 • Passively air-cooled • No chillers

174 / 179 Introduction Principles Core Memory Processor Network Exascale

Energy-aware Algorithm Design

• Goal: Classify algorithms in terms of energy efficiency • Allow for selecting most energy efficient algorithm • Example numerical problem:

AX = B

(A ... symmetric, positive definite matrix) • Choice between different algorithms • Cholesky decomposition of A = R|R ⇒ X = R−1 (R|)−1 B • Conjugate gradient and variants

175 / 179 Introduction Principles Core Memory Processor Network Exascale

Energy-aware Algorithm Design (cont.)

[Pavel Klavik et al., 2014] • Comparison of time-to-solution and energy-to-solution

• Comparison of bfp (Top500) and P (Green500)

176 / 179 Introduction Principles Core Memory Processor Network Exascale

Resilience • Resilience = Set of techniques for keeping applications running to a correct solution in a timely and efficient manner despite underlying system faults • Related to Reliability, Availability and Serviceability (RAS) • MTTF = Mean time to failure • Time after system start until first failure is produced • MTTR = Mean time to repair • Time to identify and repair a failure • MTBF = Mean time between failures • MTBF = MTTF + MTTR • MTTF on today’s high-end systems • MTTF of GPU double bit errors on Titan with about 18,000 GPUs is about 160 h [D. Tiwari et al., 2015] • MTTF on Sequoia with 98,304 Blue Gene/Q nodes is reported to be 3.5-7 days [J. Hsu, 2015]

177 / 179 Introduction Principles Core Memory Processor Network Exascale

Strategies for Improving Resilience

• Checkpoint / Restart • Write regular check-points to allow for restart in case of faults • System versus application level checkpointing • Challenges: • High costs of writing checkpoints • Performance impact of restarts • Algorithm-level fault tolerance (ALFT) • Aim for algorithms that can recover from faults • Early ALFT approach for matrix operations [Huang/Abraham, 1984] • Strategy: Encode data such that errors can be detected and corrected • Example: Expand matrix using sum of each column and row - N × N matrix ⇒ (N + 1) × (N + 1) matrix - Check each row and column for mismatch ⇒ detect error - Exploit data redundancy to correct error

178 / 179 Introduction Principles Core Memory Processor Network Exascale

Finish with Simple Architecture: Leibniz’ Reckoner

[Museum Schloss Herrenhausen]

179 / 179