Parallel architectures

Denis Barthou

[email protected]

1 Parallel architectures 2014-2015 D. Barthou 1- Objectives of this lecture

● Analyze and understand how parallel machines work

● Study modern parallel architectures

● Use this knowledge to write better code

2 Parallel architectures 2014-2015 D. Barthou Outline

1. Introduction 2. architecture Pipeline, OoO, superscalar, VLIW, branch prediction, ILP limit 3. Vectors Definition, vectorization 4. Memory and caches Principle, caches, multicores and optimization 5. New architectures and accelerators

3 Parallel architectures 2014-2015 D. Barthou 1- Parallelism

Many services and machines are already parallel

● Internet and server infrastructures

● Data bases

● Games

● Sensor networks (cars, embedded equipment, …)

● ... What's new ?

4 Parallel architectures 2014-2015 D. Barthou 1- Parallelism

Many services and machines are already parallel

● Internet and server infrastructures

● Data bases

● Games

● Sensor networks (cars, embedded equipment, …)

● ... What's new ?

● Parallelism everywhere

● Dramatic increase of parallelism inside a compute node

5 Parallel architectures 2014-2015 D. Barthou 1- Multicore/manycore

Many core already there Nvidia Kepler: 192 cores Intel Tera chip, 2007 7,1 billion of transistors (80 cores)

Intel SCC, 2010 (48 cores)

Many Integrated Chips ou Xeon Phi (60 cores) 6 Parallel architectures 2014-2015 D. Barthou 1- Why so many cores ? Moore's law

Every 18 mois, the number of transistors double, with the same cost (1965) Exponential law applies on:

● Processor performance,

● Memory & disk capacity

● Size of the wire

● Heat dissipated

7 Parallel architectures 2014-2015 D. Barthou 1- Moore's law, limiting factor: W

W = CV2f

8 Parallel architectures 2014-2015 D. Barthou 1- Impacts

● No more increase in frequency

● Increase in core number

9 Parallel architectures 2014-2015 D. Barthou 1- Impacts

● Dissipated heat goes down

● Performance/core stalls

10 Parallel architectures 2014-2015 D. Barthou 1- Multicore strategy

● Technological choice by default

● Need for software improvements

– Hide HW complexity – Find parallelism, a lot and efficiently !

● All applications will run on parallel machines

– Parallel Machines for HPC:worth hand tuning application codes for performance – End-user Machines: Tuning does not pay off, parallelism worth it ?

11 Parallel architectures 2014-2015 D. Barthou 1- Don't forget: Amdahl's law

Measures

● Speed-up: T1 / Tp

● Efficiency: T1 / (p.Tp) Amdahl's law: f code fraction that runs parallel, speed up max on p proc.: 1 / ((1 – f) + f/p) In fact scalability not so good due to higher synchronization & communication costs when p increases. 12 Parallel architectures 2014-2015 D. Barthou 1- Don't forget: soft+hard interactions

● Performance are obtained through interations between

– Compiler, – OS – Libraries and runtime – HW

13 Parallel architectures 2014-2015 D. Barthou Looking the future of performance Performance for scientific codes on TOP500 machines

14 Parallel architectures 2014-2015 D. Barthou Looking the future of performance

15 Parallel architectures 2014-2015 D. Barthou 2- Unicore

● Current processors: ~1 billion transistors

– Work in parallel

● How hardware organizes/expresses this parallelism ?

– Find parallelism between instructions (ILP)

● Goal of the architects until 2002

– Hide parallelism to everyone (user, compiler, runtime)

16 Parallel architectures 2014-2015 D. Barthou 2- Unicore

● Mechanisms for finding ILP

– Pipeline: slice execution in several steps – Superscalar: execute multiple instructions the same cycle – VLIW: read bundles of instructions and execute them in the same cycle – Vectors: one instruction, multiple data (SIMD) – Out of order execution: independent instructions executed in parallel

17 Parallel architectures 2014-2015 D. Barthou 2-a Pipeline

Washing machines (D. Patterson) 30min washing, 40min dryier, 20min folding (3 stages) Not pipelined: 90min/pers., Bandwidth: 1pers/90min Pipelined: 120min/pers., Bandwidth: 1pers/40min. Each step takes the time of the longest step (in sync) Speed-up:increase with number of stages 18 Parallel architectures 2014-2015 D. Barthou 2-a Pipeline

● Pipeline with 5 stages, MIPS (IF/ID/Ex/Mem/WB)

19 Parallel architectures 2014-2015 D. Barthou 2-a Pipeline

● 1 cycle/instruction

● Superscalar: less than 1 cycle/instruction

20 Parallel architectures 2014-2015 D. Barthou 2-a Pipeline

Issues for pipelines (hazards)

● Data dependences: value computed by an instruction is used by another instruction...

● Branches

Solutions:

● Forwarding: pass the value to the unit as soon as it is available in the pipeline

● Stall

● Speculation

21 Parallel architectures 2014-2015 D. Barthou 2-a Pipeline

● Dependences, forwarding & stall

● Superscalar: increases probability for “hazard”

22 Parallel architectures 2014-2015 D. Barthou 2-b Superscalar

Scalar pipeline

Superscalar

23 Parallel architectures 2014-2015 D. Barthou 2-b Superscalar architecture

Key Features

● Many instructions at the same cycle

● Multiples functional units Adaptations

● High Risk of dependences

● Everything more complex

– Penality important if stall

● HW Mecanisms

– Registers renaming, OoO – Branch prediction 24 Parallel architectures 2014-2015 D. Barthou 2-b Out of order

● Main idea: enable instructions to execute even when one instruction stalls

Issues for OoO:

● Interruptions ? completion order of instructions and side effects ?

25 Parallel architectures 2014-2015 D. Barthou 2-b Out of order: example of MIPS 10k

26 Parallel architectures 2014-2015 D. Barthou 2-b Out of Order: Tomasulo algorithm

5 steps:

● Dispatch: take an instruction from the queue and put it in the ROB. Update registers to write

● Issue: wait for operands to be ready

● Execute: instruction in the pipeline

● Write result (WB): write result on Common Data Bus to update its value and execute other instructions that depend on it.

● Commit: update register with ROB value. When the first instruction of the ROB (a queue) is terminated, remove it.

27 Parallel architectures 2014-2015 D. Barthou 2-b Out of order: mise en place

HW needs

● Buffer for pending instructions : reorder buffer (rob)

● Written registers are renamed (remove WAR,WAW dependences)

28 Parallel architectures 2014-2015 D. Barthou 29 Parallel architectures 2014-2015 D. Barthou 30 Parallel architectures 2014-2015 D. Barthou 31 Parallel architectures 2014-2015 D. Barthou 32 Parallel architectures 2014-2015 D. Barthou 33 Parallel architectures 2014-2015 D. Barthou 34 Parallel architectures 2014-2015 D. Barthou 35 Parallel architectures 2014-2015 D. Barthou 36 Parallel architectures 2014-2015 D. Barthou 37 Parallel architectures 2014-2015 D. Barthou 38 Parallel architectures 2014-2015 D. Barthou 39 Parallel architectures 2014-2015 D. Barthou 40 Parallel architectures 2014-2015 D. Barthou 2-b Out of order

● Currently

– Pipelines with more than 10 stages – 6-8 instructions / cycle => many instructions in flight

● OoO & ILP: dependences computation, dynamic scheduling, register renaming, ROB and dispatch buffer: as if instructions were executed sequentially.

● To avoid stalls:

– Speculation on memory dependences – Speculative branches, delay slot

● Complexity of OoO mechanism: quadratic in the number of instructions... 41 Parallel architectures 2014-2015 D. Barthou 2-b Out of order: performance impact

● Does it pay off ?

42 Parallel architectures 2014-2015 D. Barthou 2-b Out of order: performance impact

43 Parallel architectures 2014-2015 D. Barthou 2-b Out of order: performance impact

44 Parallel architectures 2014-2015 D. Barthou 2-b Out of order: performance impact

45 Parallel architectures 2014-2015 D. Barthou 2-b Out of order: performance impact

46 Parallel architectures 2014-2015 D. Barthou 2-b Out of order: performance impact

47 Parallel architectures 2014-2015 D. Barthou 2-b Out of order: performance impact

48 Parallel architectures 2014-2015 D. Barthou 2-b Out of order: performance impact

49 Parallel architectures 2014-2015 D. Barthou 2-b Out of order: performance impact

50 Parallel architectures 2014-2015 D. Barthou 2-b Out of order: performance impact

51 Parallel architectures 2014-2015 D. Barthou 2-b Out of order: performance impact

52 Parallel architectures 2014-2015 D. Barthou 2-b Out of order: performance impact

53 Parallel architectures 2014-2015 D. Barthou 2-b Out of order: performance impact

54 Parallel architectures 2014-2015 D. Barthou 2-b Out of order: performance impact

55 Parallel architectures 2014-2015 D. Barthou 2-b Out of order: performance impact

56 Parallel architectures 2014-2015 D. Barthou 2-b Out of order: performance impact

57 Parallel architectures 2014-2015 D. Barthou 2-b Out of order: performance impact

58 Parallel architectures 2014-2015 D. Barthou 2-b Out of order: performance impact

59 Parallel architectures 2014-2015 D. Barthou 2-b Out of order: performance impact

60 Parallel architectures 2014-2015 D. Barthou 2-b Out of order: performance impact

61 Parallel architectures 2014-2015 D. Barthou 2-b Out of order: performance impact

62 Parallel architectures 2014-2015 D. Barthou 2-b Out of order: performance impact

63 Parallel architectures 2014-2015 D. Barthou 2-c Very Large Instruction Word

Key Features

● Instructions are packed statically into bundles, in the asm code. Bundles are executed at the same cycle

● The number of instruction per bundle is defined

● The compiler creates the bundles

64 Parallel architectures 2014-2015 D. Barthou 2-c VLIW

● Compiler has to ensure that

– Instructions are only scheduled when their operands are ready – Time separating two instructions in dependence is long enough

65 Parallel architectures 2014-2015 D. Barthou 2-c VLIW

● VLIW require special support from compiler

– Bundles on IA64, molecule on Transmeta – Compiler finds ILP – Needs to recompile whenever architcture changes.

66 Parallel architectures 2014-2015 D. Barthou 2-c Itanium Example

67 Parallel architectures 2014-2015 D. Barthou 2-c Itanium Registers

68 Parallel architectures 2014-2015 D. Barthou 2-c Itanium Execution

Execution

● Bundles are read until a functional unit saturates

● Up to 2 bundles issued in parallel (max size of execution window)

69 Parallel architectures 2014-2015 D. Barthou 2-c Itanium Execution

Execution

● Every cycle: a new window

● Limits on parallelism:

– Number of functional units – Size of the window

70 Parallel architectures 2014-2015 D. Barthou 2-c Itanium Execution

Mécanismes pour assurer un bon niveau de parallélisme Pas les mêmes que le superscalaire, ni OoO ni renommage de registres Les mécanismes proposés permettent au compilateur d'exprimer du parallélisme

● Problèmes de branchement:

– Prédiction statique des branchements: instructions différentes de branchement suivant probabilité de faire le branchement – Prédication

● Dépendances (entre cases mémoire, entre un load et un store par ex.)

– Spéculation D'autres mécanismes complètent la panoplie: registres tournants, grand nombre de registres, ... 71 Parallel architectures 2014-2015 D. Barthou 2-c Compiler optimizations: unroll & jam

● Unroll & jam to create parallelism.

● Generalization: modulo scheduling

72 Parallel architectures 2014-2015 D. Barthou 2-c Compiler optimizations: Predication

How to deal with branches ?

● Remplace branches by conditional instructions (predicated)

● Both branches of the if..then..else are executed in parallel !

73 Parallel architectures 2014-2015 D. Barthou 2-c Compiler optimizations: Predication

● + No stall due to branches ! No branches !

● + Increases parallelism, necessary for VLIW

● - Many instructions are nops...

– Not efficient for if..then..else with many instructions

74 Parallel architectures 2014-2015 D. Barthou 2-c Compiler optimization: Speculation

How to deal with dependences ? store (r1), r2 ← stores r2 at address r1 load r3, (r4) ← reads r3 at address r4 Dependence if r1 == r4

● Instructions cannot be issued in parallel, stall... Speculation: let's assume r1 and r4 are different ! load.a r3,(r4) ← speculative load: no dependence store (r1), r2 the load can be executed way before the store ... check.a r4 ← check that indeed speculation was correct 75 Parallel architectures 2014-2015 D. Barthou 2-c Compiler optimization: Speculation

● + Removes costly dependences, enables more parallelism

● - If speculation erroneous, execute error handling code...Usually expensive.

● - Difficult for the compiler to tell when speculation pays off. Many times, on Itanium, there was no dependence but the compiler assumed there was one and speculated.

76 Parallel architectures 2014-2015 D. Barthou 2-d HW/SW optimization: branch prediction

Predict as soon as possible:

● If instruction is a branch →

● The @ of the branch → branch target predictor (cache BTB) Principle:

● Predict future based on past

● Possible to undo if prediction is not accurate

77 Parallel architectures 2014-2015 D. Barthou 2-d HW/SW optimization: branch prediction

Branch prediction algorithms

● Simplest or compiler driven:

– Always predict the same outcome (not taken if moving for forward, taken if moving backward, …) – Compiler can select strategy through ASM (profile guided, ...-

● 1 bit predictor: only relies on latest execution of the same branch

– If branch taken in previous execution, predict taken – If branch previously not taken, predict not taken → Early prediction: fetch stage !

78 Parallel architectures 2014-2015 D. Barthou 2-d HW/SW optimization: branch prediction

Branch prediction algorithms

● 2bit predictor: history of 2 last branches

● Prediction based on 2 bits

– 0 et 1: predict taken – 2 et 3: predict not taken

● Update (saturated arithmetics)

– If taken, decrements counter – If not taken, increment counter

79 Parallel architectures 2014-2015 D. Barthou 2-d HW/SW optimization: branch prediction

Correlated branches (Yeh, Patt, 1992) If (x[i]>5) y=y+4; If (x[i]>3) c=2;

● If first branch taken, second is taken too → 2 branches are correlated Branch history: keep the last h branch results

● Indexed by PC of the branches ()

● Since pentium Pro: keep 2 bits for 2 last branches

80 Parallel architectures 2014-2015 D. Barthou 2-d HW/SW optimization: branch prediction

K last values 0010001

... 2k entrées

Global predictor for all branches Or local predictor for 1 branch history

Automaton T/NT

81 Parallel architectures 2014-2015 D. Barthou 2-d HW/SW optimization: branch prediction

Theoretical limits for Branch predictors

● Not possible to predict noise Practical limit:

● Size of the history

● Arbitration between different prediction methods

82 Parallel architectures 2014-2015 D. Barthou 2-e Limits for Instruction Level Parallelism

For an ideal machine

● Infinite number of registers

● Infinite size for execution window

● Infinite number of instructions/cycle

● Perfect branch prediction

83 Parallel architectures 2014-2015 D. Barthou 2-e Limits to Instruction Level Parallelism

A more realistic machine:

● Limited window (2048 instructions)

● Branch predictor

● 64 instructions/cycle max.

84 Parallel architectures 2014-2015 D. Barthou 2-e Limits to Instruction Level Parallelism

● To move from 6 to 12 instructions/cycle, estimate:

– 3-4 memory access/cycle – 2-3 branch/cycle – Rename more than 20 registers/cycle – Fetch 12 to 24 instructions/cycle

● Increase complexity → limit frequency

● Examples

– Itanium2, 6 instructions/cycle, large issue window. High consumption, low frequency – AMD VLIW5/4 shader architecture: 4/5 instructions/cycle. Shader: 3.4 average... 85 Parallel architectures 2014-2015 D. Barthou 2- Conclusion to Superscalar

Limits for instruction level parallelism (ILP)

● HW: for OoO, high complexity in terms of transistors when ILP increases

● SW: difficult to find more than 6 instructions in parallel in codes (even after optimization) Performance gains from ILP

● “Out of order can buy you 5 cycles, not 200 cycles” (D.Levinthal, Intel)

● Important gains, main hindrance: dependences and branches

● Does not help to hide memory long latences (cf after)

86 Parallel architectures 2014-2015 D. Barthou 3- Vector/SIMD

a) What are vector instructions ?

● One instruction applies to multiple data (a vector)

● SIMD: single instruction, multiple data b) How to use it ?

87 Parallel architectures 2014-2015 D. Barthou 3-a SIMD

Features

● Is not fit for all codes, all instructions

● Instructions for SIMD: multimedia, 3D (GPU), scientific computation, physic simulations (games) Multimedia = mass market (pays off HW development)

● SIMD extensions for all architectures

– Intel: MMX, SSE, AVX, AVX2 – IBM: Altivec – ARM: Neon – Sparc: Visual Instruction Set

● Some processors are vector processors: old Crays, GPUs, ... 88 Parallel architectures 2014-2015 D. Barthou 3-a SIMD

SIMDizable code

For (i=0; i

All iterations are parallel Add elements of B and C into A, elementwise.

89 Parallel architectures 2014-2015 D. Barthou 3-a SIMD

SIMDizable code

For (i=0; i

● 4 successive iterations are parallel

● Add elements of B and C into A, elementwise.

90 Parallel architectures 2014-2015 D. Barthou 3-a SIMD

Usual vector kernels and size of vectors

91 Parallel architectures 2014-2015 D. Barthou 3-a SIMD: SSE example

Vector register (SSE) %xmm

● Can be seen as one value on 128bits, 2 values on 64 bits, 4 values on 32 bits, ...e

● Length of the elements specified by the SIMD instruction

● Length of vector always 128 bits

● Load/store vectors in one instruction

92 Parallel architectures 2014-2015 D. Barthou 3-a SIMD: SSE example

SIMD Instructions (SSE)

● Operation are the same for all elements here

93 Parallel architectures 2014-2015 D. Barthou 3-a SIMD: AVX example

Extension of SSE on 256 bits

● Instructions with 3 registers possibles x = y op z

● Alignment issue for 256 bits (non aligned → slowdown) To come next

● AVX-2 (512 bits)

94 Parallel architectures 2014-2015 D. Barthou 3-a SIMD: Itanium example

Complex SIMD instruction: psad for (i=0; i<7; i++) s+=abs(A[i]-B[i]);

● Computes ∑ ⅼAi-Biⅼ

● Difficult for a compiler to select this instruction !

95 Parallel architectures 2014-2015 D. Barthou 3-a SIMDization

SIMD API

● Large API for SIMD (1400 pages of documentation for SSE)

● Who generates SIMD instructions ?

– Compiler: requires vectorizer. Difficult to take all instructions in the API – User: assembly like coding (asm or intrinsics)

No automatic vectorizer in HW !

96 Parallel architectures 2014-2015 D. Barthou 3-b SIMDization

SIMDization methods

● Compiler optimization

● User vectorization

– Intrinsics – Type Attributes (gcc/icc) – Code transformation to help compiler vectorization (pragmas, flags, …)

97 Parallel architectures 2014-2015 D. Barthou 3-b SIMDization: difficult !

● Extract from IBM developper manual for Blue Gene/L

98 Parallel architectures 2014-2015 D. Barthou 3-b Automatic SIMDization

Read the manual !

● Most compilers can vectorize

– Gcc: http://gcc.gnu.org/projects/tree-ssa/vectorization.html – Provide a vectorization report explaining what is vectorized, and if not, why.

● Challenges for automatic vectorization

– Requires good dependence analysis

● Some code transformations can make this hard to the compiler – Alignment issues – Non contiguous data in memory (strides >1) – Mix scalar/vector code. Better if all computation is using SIMD. 99 Parallel architectures 2014-2015 D. Barthou 3-b Hand SIMDization

Hands-on vectorization

● In general, non portable code

● Avoid ASM as long as possible Solution 1: intrinsics

● Non portable solution for using SIMD instructions in C/C++...

● One intrinsic function = one ASM instruction in C

● Compiler handles register allocation and other optimizations (few of them)

100 Parallel architectures 2014-2015 D. Barthou 3-b Hand SIMDization

Solution 1: intrinsics (mandelbrot, Intel) #include …. __m256 ymm13 = _mm256_mul_ps(ymm11,ymm11); // xi*xi __m256 ymm14 = _mm256_mul_ps(ymm12,ymm12); // yi*yi __m256 ymm15 = _mm256_add_ps(ymm13,ymm14); // xi*xi+yi*yi

// xi*xi+yi*yi < 4 in each slot ymm15 = _mm256_cmp_ps(ymm15,ymm5, _CMP_LT_OQ); // now ymm15 has all 1s in the non overflowed locations test = _mm256_movemask_ps(ymm15)&255; // lower 8 bits are comparisons ymm15 = _mm256_and_ps(ymm15,ymm4); // get 1.0f or 0.0f in each field as counters // counters for each pixel iteration ymm10 = _mm256_add_ps(ymm10,ymm15);

ymm15 = _mm256_mul_ps(ymm11,ymm12); // xi*yi ymm11 = _mm256_sub_ps(ymm13,ymm14); // xi*xi-yi*yi ymm11 = _mm256_add_ps(ymm11,ymm8); // xi <- xi*xi-yi*yi+x0 done! 101 ymm12 = _mm256_add_ps(ymm15,ymm15); // 2*xi*yi Parallel architectures 2014-2015 D. Barthou ymm12 = _mm256_add_ps(ymm12,ymm9); // yi <- 2*xi*yi+y0

3-b Hand SIMDization

Solution 2: type attributes

● GCC extensions, supported by icc. Ex: typedef int vec __attribute__ ((vector_size (N)));

● Write computation with scalar operators +, -, *, / and compiler generates vector code.

● N has to be a constant (power of 2) + The compiler generates code, vectorization, register allocation... - Very limited set of instructions on vectors, or use intrinsics.

102 Parallel architectures 2014-2015 D. Barthou 3-b SIMDization

Speed-up example:

● On fractal computation (embarassingly //)

● Theoretical max speedup: 8 Characteristics of the computation

● Computation-bound, no memory access

103 Parallel architectures 2014-2015 D. Barthou 3-b Hand SIMDization

Solution 2: Guide compiler

● Use specific pragmas for compiler/language #pragma omp simd #pragma ivdep , #pragma vector always

● Use dedicated compiler such as ISPC (Intel SPMD Program compiler), http://ispc.github.io

– SIMD and multithreaded programming – Similar to GPU programming

104 Parallel architectures 2014-2015 D. Barthou 3-b Hand SIMDization

Solution 3: Modify your code to let the compiler vectorize

● Some transformations:

– Change data layout for having contiguous data A[B[i]] → copy into A'[i] array of structures (AoS) ↔ structures of arrays (SoA) – Avoid complex control (if...) – Align structures (valloc, mem_align) – Make loops with few instructions (fissionner)

105 Parallel architectures 2014-2015 D. Barthou 3-b SIMDization

Is it worth to SIMDize ?

● SIMDization improves computation (less instructions), memory instructions (less instructions) → compute-bound are generally improved by SIMDization

● Vectorization does not change memory latency → memory-bound code : SIMDization probably has no effect

106 Parallel architectures 2014-2015 D. Barthou 3- Vectorization conclusion

● Important performance factor now

● Use to its best the compiler – Check that loops are vectorized (asm or optimization reports) – Transform code to help the compiler – Last resort: hand vectorize using intrinsics

107 Parallel architectures 2014-2015 D. Barthou 4- Memory

CPU Memory

Performance limited by accesses to memory

● Bandwidth (byte/cycle)

– If m is the mean % of access to memory in the code, requires 1+m access per instruction

● Latency (cycle for one access)

– Time for one access >> 1 cycle 108 Parallel architectures 2014-2015 D. Barthou 4- Memory

a)General principle b) Cache c) Cache in multicores d) Code Optimization for caches

109 Parallel architectures 2014-2015 D. Barthou 4-a DRAM technology

Dynamic RAM

Main memory 1-T DRAM Cell + low consomption word + small size access transistor - requires amplification VREF Every read weakens the signal - Non square signal bit Storage capacitor (FET gate, trench, stack)

110 Parallel architectures 2014-2015 D. Barthou 4-a SRAM technology

Static RAM

● - 6 transistors, takes room

● + Fast read on BL by applying voltage on WL (square signal)

● + Stable signal, no need for refresh

● - State (0 or 1) kept by constant voltage (Vdd)

111 Parallel architectures 2014-2015 D. Barthou 4-a NVRAM technology

Magnetoresistive RAM + High density + No need to keep power - Not as fast as SRAM (not far) - Writing not scaling to well to lower scales

112 Parallel architectures 2014-2015 D. Barthou 4-a NVRAM technology

Resistive RAM + High density + Works with very low voltage + Stackable + Fast (x20 SRAM for prelim tests) - Still early production/R&D

113 Parallel architectures 2014-2015 D. Barthou 4-a Memory technology

Organization of bit arrays bit lines 2D array Col. Col. word lines 1 2M

Addresse on N+M bits

s Row 1 s e

● N bits for lines N r d r d e d ● A

M bits for rows o

c N w Row 2 e o R D Memory cell M (one bit) N+M Column Decoder & Sense Amplifiers

Data D

114 Parallel architectures 2014-2015 D. Barthou 4-a Memory Issue

Ecart entre processeur/mémoire: +50%/an

● For a superscalar processor 2Ghz 4 instructions/cycle, one DRAM with 100ns/access => 800 instructions/access !

115 Parallel architectures 2014-2015 D. Barthou 4-b Shape of accesses

Address n loop iterations

Instruction fetches

subroutine subroutine call Stack return accesses argument access

Data accesses Time 116 Parallel architectures 2014-2015 D. Barthou 4-b Locality

Memory access have a regularity that can be predicted Temporal locality: The same cell is accessed multiple times within a short amount of time Spacial locality: Two neighbor cells are accessed within a short amount of time

117 Parallel architectures 2014-2015 D. Barthou 4-b Cache

Caches take advantage of locality

● Temporal locality: keep elements recently accessed

● Spacial locality: fetch all elements next to accessed elements Use Static RAM

A Small, B Fast Big, Slow CPU Memory Memory (RF, SRAM) (DRAM) holds frequently used data 120 Parallel architectures 2014-2015 D. Barthou 4-b Cache

For each memory access:

● Check if the address is within the cache (cache hit),

● If not, this is a cache miss, fetch it in memory and keep it in cache. Expected gain: T1 latency for a hit, T2 latency for a miss (and memory access) H: % of memory access with a hit (hit ratio) Mean latency = h.T1 + (1 - h). T2 If h=0,5, at best, latency = ½ T2... Only a factor of 2, whatever the speed of the cache !! => Need to have a hit ratio near 1

121 Parallel architectures 2014-2015 D. Barthou 4-b Cache behavior

Key steps

● Identification: how to find data in cache ?

● Placement: where are data in cache ?

● Replacement policy: How to make room for new data ?

● Write Policy : How to propagate data changes ?

● Which strategies to improve hit ratio ?

122 Parallel architectures 2014-2015 D. Barthou 4-b Cache structure and identification

Cache line: block of consecutive data in memory Cache lines are organized in sets Associativity: number of cache lines in one set

From memory address to cache location

● One part is just kept for identification, the tag.

● One part identifies the set

● One part corresponds to position in cache line

123 Parallel architectures 2014-2015 D. Barthou 4-a Cache structure and identification

Data stored at address 01001 01 011

Set Offset Offset Offset Offset Offset Offset Offset Offset Tag 000 001 010 011 100 101 110 110

00

data 01001 01

10

11 124 Parallel architectures 2014-2015 D. Barthou 4-b Placement

Associativity

● Data can be

– In any cache line of a set – In one single set

● Extreme cases: full associative (one set) and direct map (one set=one cache line) Data in cache ?

● Extract its set from address

● For each cache line of the set (in //)

– Compare tag of cache line with address – Same ? Found the data 125 Parallel architectures 2014-2015 D. Barthou 4-b Remplacement: associative caches

When set is full, which line evict ?

● Random

● Least Recently Used, LRU

– Keep a freshness indication, updated with each access to the set – Used for low assoc. (2,4,8)

● Round robin

– Used for high assoc.

● Least Frequently Used, LFU

– Keep a counter for updating frequency

126 Parallel architectures 2014-2015 D. Barthou 4-b Writing Policy

If data is written, which cache update ? Cache hit:

● Write through: write to cache and memory. Simple but consumes bandwidth.

● Write back: write to cache. Data is tagged dirty. Write to memory when cache line is evicted. Cache miss (for a write):

● No write allocate: no data in cache

● Write allocate: bring back cache line in cache (read cache line!) Possible combinations: write through and no write allocate

127 Parallel architectures 2014-2015 D. Barthou 4-b Types of cache misses

Compulsory cache miss:

● First access to data Capacity cache miss:

● Data already accessed and was in cache, but evicted since then by other data due to lack of space in cache Conflict miss:

● Data already accessed and was in cache, but evicted since then by other data due to lack of space in set (placement policy issue)

128 Parallel architectures 2014-2015 D. Barthou 4-b Types of cache misses

Compulsory cache miss:

● Prefetch Capacity cache miss:

● Reorganize schedule of accesses, reduce # of accesses between 2 accesses to same data Conflict miss:

● Reorganize data alignment (complex), padding, restart OS (fragmentation may introduce misses by VM).

129 Parallel architectures 2014-2015 D. Barthou 4-b Memory hierarchy

130 Parallel architectures 2014-2015 D. Barthou 4-b Memory hierarchy

Influence of L2 on L1 Use a smaller L1

● Improve access time if hit

● Improve energy consumption (mean) Use a L1 write-through (simpler) and L2 write-back

● The L2 write back absorbs out-going memory traffic

131 Parallel architectures 2014-2015 D. Barthou 4-b Memory hierarchy

Inclusion

● Inclusive Cache

– L2 has copies of cache lines in L1 – External access: need to check L2 only – Frequent cache configuration

● Exclusive cache

– L1 has cache lines not in L2 – Swap cache lines if L1 misses and L2 hits.

132 Parallel architectures 2014-2015 D. Barthou 4-b Reducing associativity cost

Associativity:

● Reduces number of conflict misses

● Expensive (surface, energy, delay) Optimization for associativity:

● Victim cache

● Way prediction

133 Parallel architectures 2014-2015 D. Barthou 4-b Victim cache

CPU Unified L2 L1 Data Cache RF Cache Direct Map. Evicted data from L1 (HP 7200) Victim Cache where ? Fully Assoc. 4 blocks Victim cache: small assoc. cache working as back-up to a direct map cache (Jouppi 1990)

● First lookup in DM cache, if miss, lookup in VC

● Si hit, échange donnée DM/VC. Si Miss, DM → VC, VC → ? 134 Parallel architectures 2014-2015 D. Barthou 4-b Way Predicting Instruction Cache

Jump target

0x4 Dec Alpha21264 Jump Add control

PC

addr inst Primary Instruction way Cache

Sequential Way Branch Target Way

135 Parallel architectures 2014-2015 D. Barthou 4-b Way Prediction Cache

Use prediction table based on address to guess the way to access (MIPS R10000 L2)

HIT MISS

Return copy Look in other way of data from cache

MISS SLOW HIT (change entry in prediction table) Read block of data from next level of 136 Parallel architectures 2014-2015 cache D. Barthou 4-c Caches for multicores

L1: not shared partagé, performance critical Sharing L2 or L3

● + Better communication between cores, through cache

● - Memory contention (shared BW, sharing cache size) No sharing

● - Communication only through memory

● + No contention In all cases, memory is shared

● - BW shared (interface of chip)

137 Parallel architectures 2014-2015 D. Barthou 4-c Write back/Write Through

Write Through

● Every write →

– Update local cache – Write to the bus between cores: updates memory and invalids/updates other caches

● Pro: simple to implement

● Cons: roughly ~15% of accesses are write, high BW demand

– Does not scale to high number of cores... – Requires dual-way tagging

138 Parallel architectures 2014-2015 D. Barthou 4-c Write back/ Write Through

Write back

● When cache owning data writes, no writing on bus

– Preserves BW – Used by most current multicores

139 Parallel architectures 2014-2015 D. Barthou 4-c Caches et shared memory

To limit traffic, multiple copies possible in different caches Data read: no problem Write data in multiple caches → Need to update all copies! Cache coherency issue Similar to DMA & cache issue.

Atomic memory block in cache: cache line → Can have same cache lines, even if data accessed by cores are different! Can lead to False sharing

140 Parallel architectures 2014-2015 D. Barthou 4-c Caches and DMA

Memory Physical Bus Address (A) Memory

Proc. Data (D) Cache

R/W Page transfers occur while the Processor is running A Either Cache or DMA can D DMA be the Bus Master and R/W DISK effect transfers

141 Parallel architectures 2014-2015 D. Barthou 4-c Caches and DMA

Cached portions of page Physical Memory Memory Bus Proc. Cache DMA transfers

DMA DISK

142 Parallel architectures 2014-2015 D. Barthou 4-c Cache coherency

Coherency:

● All cores must see same data at same address

● If one core writes to private cache,

– Inform all other caches (snoop) – Inform only caches have that data (directory)

● After modification

– Invalidation: all other copies are invalidated – Update: all copies are updated

143 Parallel architectures 2014-2015 D. Barthou 4-c Snoopy cache

Requires cache write through.

● Memory (or LLC) has uptodate values

● All updates are posted on bus Principle: All caches spy the bus (snoopy cache)

● If a write request on bus for an address in cache → invalidate or update

● If a read request on bus due to miss on an address → check is has the data, and answers to request

144 Parallel architectures 2014-2015 D. Barthou 4-c Snoopy cache

Memory Bus

Snoopy M1 Cache Physical Memory Snoopy M2 Cache

Snoopy DMA DISKS M3 Cache

145 Parallel architectures 2014-2015 D. Barthou 4-c Central directory

Principle: for a shared cache, The directory keeps information on caches that has data Update directory after each access Optimizes cache accesses and coherency traffic

146 Parallel architectures 2014-2015 D. Barthou 4-c Central directory

How: With each line in directory, a bit for tagging caches possessing this line.

+ Centralized information - Updates first need to consult directory - Contention on directory - Cost in terms of memory space

Possible to distribute directory among caches...

147 Parallel architectures 2014-2015 D. Barthou 4-c Distributed directory

On Xeon Phi:

● Each L3 contains information on directory for some cache lines

148 Parallel architectures 2014-2015 D. Barthou 4-c Coherency protocol: MSI

Each cache line has a tag M: Modified S: Shared Address tag I: Invalid state bits

P1 reads Other processor reads M or writes

P1 writes back Write miss te ri w Other processor o t nt intent to write te in Read P 1 miss S I Read by any Other processor processor intent to write Cache state in processor P 1 149 Parallel architectures 2014-2015 D. Barthou 4-c Coherency protocol: MESI

Each cache line has a tag M: Modified Exclusive E: Exclusive, unmodified Address tag S: Shared state I: Invalid bits

P write P1 read P write 1 1 M E Read miss, or read not shared Write miss te Other processor reads ri w Other processor o P1 writes back t intent to write nt te Read miss, in shared P 1 S I Read by any Other processor intent to write processor Cache state in 150 processor P Parallel architectures 2014-2015 1 D. Barthou 4-c Benchmark results on Nehalem

151 Parallel architectures 2014-2015 D. Barthou 4-c Benchmark results on Nehalem

152 Parallel architectures 2014-2015 D. Barthou 4-c Benchmark results on Nehalem

For data in state M:

● Necessary to do a write-back ?

● If data in L1/L2 of another core, write back L3 then read from L3 → performance in L1 lower than when data in memory and much lower than when data in L3!!

153 Parallel architectures 2014-2015 D. Barthou 4-c Nehalem cache in details

L2 Cache:

● 256 KB, 8-way assoc.

● 1 L2 private for each of the 4 cores of the chip

154 Parallel architectures 2014-2015 D. Barthou 4-c Nehalem cache in details

L2 Cache (snoopy)

● Handles requests from other L2: invalidation, write back to L3

● State F (forward): additional state for S. One copy in F.

● Cache with the F line: only one to answer to requests

155 Parallel architectures 2014-2015 D. Barthou 4-c Nehalem cache in details

L3 Cache:

● Shared between 4 cores

● 8MB, 16 way, inclusive

● Arbitration: serializes L3 requests coming from L2

156 Parallel architectures 2014-2015 D. Barthou 4-c Nehalem cache in details

L3 Cache:

● Maintains directory

● Minimizes traffic

157 Parallel architectures 2014-2015 D. Barthou 4-c Nehalem cache in details

Coherency at L3 level

● Arbitration between L2

● Uptodate list of L2 data

158 Parallel architectures 2014-2015 D. Barthou 4-c Benchmark results on Nehalem

Writing data in state E:

● Invalidation, transfers between L2 caches

● Use of snoop

159 Parallel architectures 2014-2015 D. Barthou 4-c Benchmark results on Nehalem

Writing data in state M:

● Same issue as for the read

● Performance very low when data in L1...

160 Parallel architectures 2014-2015 D. Barthou 4-c Benchmark results on Dunnington

Dunnington: machine IBM EX4, 96 cores, gathered in 4 hexacore nodes

● L1 private (32KB), L2 shared between 2 cores (3MB), L3 shared (16MB) between 6 cores

● L4 between chips of the same node

161 Parallel architectures 2014-2015 D. Barthou 4-c Benchmark results on Dunnington

Read access:

● Similar performance to Nehalem

● Loss of performance when reading modified data in a L1

162 Parallel architectures 2014-2015 D. Barthou 4-c Benchmark results on Dunnington

Write access:

● Very poor performance

● Worse when data is in local cache than when in distant cache...

● Same for states M or S → Better not to write on this machine... → High impact on performance for many codes

163 Parallel architectures 2014-2015 D. Barthou 4-c False sharing

Line of cache: minimal block of information stored in cache All the line is invalidated, loaded... Possible to increase line size without increasing tags → decreases relative cost of tags

Cons: coherency at the line level If lines shared between cores and data is not, false sharing.

164 Parallel architectures 2014-2015 D. Barthou 4-c False sharing

A[0] and A[1] in the same cache line Core 1 writes A[0], Core 2 writes A[1] → each write invalidates the cache line in the other core

Core 1 Core 2 High impact on performance L1 Core Store 1,A[0] cache A[0]=1 L1 Core Store 2,A[1] cache A[0] invalide A[1]=2 L1 Core Store 3, A[0] cache A[0]=1 A[1] invalide L1 Core Store 0,A[1] cache A[0] invalide A[1]=0 165 Parallel architectures 2014-2015 D. Barthou 4-c Prefetching

The hardware can prefetch data in cache, based on regular accesses

● Speculates on future data/instruction addresses and prefetches them in cache

– Instruction prefetch easier

● Types of prefetch

– Hardware – Software (compiler or user)

Which miss is avoided by prefetch ??

166 Parallel architectures 2014-2015 D. Barthou 4-c Prefetching hardware

● Prefetch on miss

– Prefetch block b+1 when miss block (line) b

● One Block Lookahead (OBL) mechanism

– Start prefetching block b+k when accessing block b (prefetch parameter k)

● Prefetch with stride

– Detects regular accesses: when accessing blocks b, b+N, b+2N, prefetches block b+3N – Example: IBM Power5: 8 prefetch independent streams based on strides, prefetches 12 lines in advance

167 Parallel architectures 2014-2015 D. Barthou 4-c Prefetching hardware

Instruction prefetch on architecture Alpha AXP 21064

– Prefetches 2 blocks on a miss (missed block and following) – Next block put in stream buffer – If L1 miss, request stream buffer and prefetches next block Prefetched Req instruction block block Stream Buffer CPU L1 Unified L2 Instruction Req Cache RF block 168 Parallel architectures 2014-2015 D. Barthou 4-c Prefetching hardware

Limits

● How to make the difference between regular large strided accesses and accesses to unrelated data ?

● Strided prefetch has a size limit

– Importance to choose appropriate data structures – Importance of the way data is iterated, as for instance for dimensions of array

169 Parallel architectures 2014-2015 D. Barthou 4-d Software Cache Optimizations

Transformations on source or assembly code

● Change code to enhance spacial/temporal locality

– Reorder instructions: tiling, fusion, fission, loop interchange, ...Check dependences ! – Change data structures: copy data or make redundant computation (instead of loading data twice) – Only load what is necessary in cache: streaming or non- temporal accesses for data only read/written once.

170 Parallel architectures 2014-2015 D. Barthou 4-d Loop interchange

for(j=0; j < N; j++) { for(i=0; i < M; i++) { x[i][j] = 2 * x[i][j]; } }

for(i=0; i < M; i++) { for(j=0; j < N; j++) { x[i][j] = 2 * x[i][j]; } }

Loop interchange: which improvement ? 171 Parallel architectures 2014-2015 D. Barthou 4-d Loop Fusion

for(i=0; i < N; i++) a[i] = b[i] * c[i];

for(i=0; i < N; i++) d[i] = a[i] * c[i];

for(i=0; i < N; i++) { a[i] = b[i] * c[i]; d[i] = a[i] * c[i]; }

Loop fusion: which improvement ? 172 Parallel architectures 2014-2015 D. Barthou 4-d Loop Fission

for(i=0; i < N; i++) a[i] = b[i] * c[i]; d[i] = e[i] * f[i];

for(i=0; i < N; i++) a[i] = b[i] * c[i]; for(i=0; i < N; i++) d[i] = e[i] * f[i];

Loop fission: which improvement ? 173 Parallel architectures 2014-2015 D. Barthou 4-d Tiling

for(i=0; i < N; i++) for(j=0; j < N; j++) for (k=0; k

for(i=0; i < N; i=i+B) for(j=0; j < N; j=j+B) for(k=0; k < N; k++) for(ii=i; i < min(i+B,N); i++) for(jj=j; jj < min(j+B,N); jj++) c[ii][jj] = c[ii][jj]+a[ii][k]*b[k][jj]

Tiling: which improvement ? What is best size for B ?? 174 Parallel architectures 2014-2015 D. Barthou 4-d Tiling LU computation

Effects of coherency on Nehalem

● Impact of small block (fits in private L1)

● Tiling for L3

175 Parallel architectures 2014-2015 D. Barthou 4-d Tiling for LU computation

176 Parallel architectures 2014-2015 D. Barthou 4-d Software Prefetch

for(i=0; i < N; i++) { prefetch( &a[i + k] ); prefetch( &b[i + k] ); SUM = SUM + a[i] * b[i]; }

Prefetch soft: which prefetch distance ?

● Too short distance: no advantage of prefetch

● Trop far distance:

– Page miss, cache pollution – Should not exceed cache size 177 Parallel architectures 2014-2015 D. Barthou 4-d Array of Structure (AoS) or Structure of Array (SoA) ?

int value[SIZE], key[SIZE];

struct merged { int value,key; }; struct merged array_merged[SIZE]

● Which advantage for cache ?

178 Parallel architectures 2014-2015 D. Barthou 4-d An exemple: Matrix product

for (i=0; i

M K C

Matrix product

● At the heart of many solvers

2 3 ● Good characteristics for cache: 3*N data, N operations

● Matrices too large to fit in cache

● Many opportunities for optimization 179 Parallel architectures 2014-2015 D. Barthou 4-d Matrix product

for (i=0; i

M K C

Optimizations (depend on matrix size)

● Tiling for different cache levels, for registers (blocking)

● Prefetching,

● Ordering instructions, unrolling

● Vectorization 180 Parallel architectures 2014-2015 D. Barthou 4-d Matrix product

for (i=0; i

M K C

ATLAS: Auto-Tuning Linear Algebra System

● Generate linear algebra library

– Generates multiple versions of code – Benchmarks, takes the best (auto-tuning, iterative compilation) 181 Parallel architectures 2014-2015 D. Barthou 4-d ATLAS: Matrix product

Large cache scenario:

● Matrices are small enough to fit into cache

● Only cold misses, no capacity misses

● Miss ratio:

– Data size = 3 N2

– Each miss brings in b floating-point numbers

– Miss ratio = 3 N2 /b*4N3 = 0.75/bN = 0.019 (b = 4,N=10)

182 Parallel architectures 2014-2015 D. Barthou 4-d ATLAS: Matrix product

Small cache scenario:

● Matrices are large compared to cache

● Cold and capacity misses

● Miss ratio:

– C : N2/ /b misses (good temporal locality), A : N3/b misses (good spatial locality), B : N3 misses (poor locality)

– Miss ratio = 0.25 (b+1)/b = 0.3125 (for b=4)

183 Parallel architectures 2014-2015 D. Barthou 4-d ATLAS: Matrix product

Tiling in ATLAS:

● Only square tiles

● Objective: tile for L1 or L2

● Parameter: NB

Mini-MMM code

for (int j = 0; j < NB; j++)

for (int i = 0; i < NB; i++)

for (int k = 0; k < NB; k++)

C[i][j] += A[i][k] * B[k][j]

184 Parallel architectures 2014-2015 D. Barthou 4-d ATLAS: Matrix product

Tiling for registers

● Parameters: NU, MU, KU

● These blocks fit in registers

● Additional code if size not multiple of NU/MU/KU

for (int j = 0; j < NB; j += NU)

for (int i = 0; i < NB; i += MU)

load C[i..i+MU-1, j..j+NU-1] into registers

for (int k = 0; k < NB; k++)

load A[i..i+MU-1,k] into registers

load B[k,j..j+NU-1] into registers

multiply A’s and B’s and add to C’s

store C[i..i+MU-1, j..j+NU-1]

185 Parallel architectures 2014-2015 D. Barthou 4-d ATLAS: Matrix product

Search for parameters (depending on architecture)

– NU,MU,KU, NB – Prefetch distance – Scheduling of instructions in loops Several methods:

● Exhaustive search, with a priori bounding of parameters

● Model cache and architecture

– Variant: perform local search around expected values

● Improve further the code by hand (unleashed version of ATLAS)

186 Parallel architectures 2014-2015 D. Barthou 4-d ATLAS: Matrix product

Model approach very close to exhaustive search

● Obtain good code in short amount of time

● Still a gap with hand-tuned code → requires expertise, possible through learning method Advantage of ATLAS:

● Fully automatic

● Adapts to new architectures

187 Parallel architectures 2014-2015 D. Barthou 4-d MAGMA: for heterogeneous architectures

Hybrid approach for library

● Task graph

● To schedule on CPU/GPU

● To adapt according to CPU/GPU Adapt algorithms too

188 Parallel architectures 2014-2015 D. Barthou 4-d MAGMA: for heterogeneous architectures

189 Parallel architectures 2014-2015 D. Barthou 4-d MAGMA: scheduling with StarPU sur CPUs+GPUs

190 Parallel architectures 2014-2015 D. Barthou 4-d Spiral: signal processing

191 Parallel architectures 2014-2015 D. Barthou 4-d Spiral: expressing code as formula

192 Parallel architectures 2014-2015 D. Barthou 4-d Iterative compilation and machine learning methods

Auto-tuning generalization

● Search for best optimization parameters, automatically

– Limit the search through machine learning or modeling

● Search for best algorithmic combination for an architecture

– SPIRAL et FFTW: library generators for signal processing – MAGMA, PhiPAC: for linear algebra

● In compilers: Milepost for gcc: learning optimizations efficient for some types of codes

– Pro: it works. Cons: blind shooting !

193 Parallel architectures 2014-2015 D. Barthou 5- Next Frontier: Exascale

194 Parallel architectures 2014-2015 D. Barthou 5- Next Frontier: Exascale

What needs for Exascale Which roadmap ? Climate Science, IESP: International Exascale Software High Energy Physics, ● Nuclear Physics Industries, universities, agencies Fusion Energy 2 possible scenarii: Nuclear Energy ● Large number of small chips: Biology 1000 cores/chip, 1 million chips Material Science and Chemistry ● National Security Hybrid processors: 1Ghz processor, 10K FPU/socket, 100K ● Workshop sur les Grand Challenges sockets/system

9 http://www.exascale.org → 10 threads

195 Parallel architectures 2014-2015 D. Barthou 5- Next Frontier: Exascale

Changing scale: from Luxembourg to EU Challenges for Exascale (HW) - power consumption - efficiency scalability Challenges for Exascale (SW) - expressivity - make code abstract from Architecture - handle heterogeneity - maintenance

196 Parallel architectures 2014-2015 D. Barthou 5- Last Frontier ?

(c) AMD 197 Parallel architectures 2014-2015 D. Barthou 5- Towards Exaflop

● Moore Law

– 22nm: 37% consumption gain

198 Parallel architectures 2014-2015 D. Barthou 5- Towards Exaflop

<5nm: quantic issue

(c) Intel 199 Parallel architectures 2014-2015 D. Barthou 5- Towards Exaflop

Previsions from DoE 5 major changes required

● a- Memory hierarchies

● b- Fault Tolerance

● c- Energetic efficiency

● d- Heterogeneity

● e- Scaling

http://www.csm.ornl.gov/~engelman/publications/engelmann10facilitating.ppt.pdf

200 Parallel architectures 2014-2015 D. Barthou 5-a Memory hierarchy

Moving data around: consumes energy

● Free computation, bandwidth expensive Memory hierarchies

● 3D chips

● Photonic interconnect Dataflow processors

201 Parallel architectures 2014-2015 D. Barthou 5-a Stacking memory in 3D

Project Eurocloud FP7, www.eurocloudserver.com 202 Parallel architectures 2014-2015 D. Barthou 5-a Stacking memory in 3D

Sandwitch of many-core and DRAM → better bandwidth, better efficiency

Samsung 3D RAM 203 Parallel architectures 2014-2015 D. Barthou 5-a Photonic network on chip

204 Parallel architectures 2014-2015 D. Barthou 5-a Photonic network on chip

Carte pour un supercalculateur photonique

Bande passante: 10TB/s, off et on-chip 205 Parallel architectures 2014-2015 D. Barthou 5-a Dataflow architecture

Principle: reduce chip memory by connecting directly functioning units → FPGA or many-core grid- connected → reduces consumption (5W)

(c) Maxeler

(c) MPPA Kalray 206 Parallel architectures 2014-2015 D. Barthou 5-b Fault Tolerance

Cause: Width of wire Many units Very low Voltage

(c) Intel 207 Parallel architectures 2014-2015 D. Barthou 5-b Fault Tolerance

Checkpoint/restart: Save data regularly Requires global I/O operations The MTBF has to be > save time... To define: new algorithms with save integrated, easier restart

(c) Intel Sauvegarde et synchro 208 Parallel architectures 2014-2015 D. Barthou 5-c Energetic efficiency

Hardware improvement: not enough - New low power efficiency - Switch off some components - Reduce frequency when code is memory bound - Improve cache algorithms, better use cache - Tradeoff computation / data movement: change precision, computation redundancy, ...

209 Parallel architectures 2014-2015 D. Barthou 5-c Energetic efficiency: BlueGene Q

210 Parallel architectures 2014-2015 D. Barthou 5-c Energetic efficiency: Blue Gene Q

211 Parallel architectures 2014-2015 D. Barthou 5-c Energetic efficiency: Blue Gene Q

212 Parallel architectures 2014-2015 D. Barthou 5-c Energetic efficiency: ARM

ARM Kal el

● 4 cores fast

● 1 core slower

213 Parallel architectures 2014-2015 D. Barthou 5-c Energetic efficiency: Exynos Samsung

214 Parallel architectures 2014-2015 D. Barthou 5-c Energetic efficiency: Exynos Samsung

215 Parallel architectures 2014-2015 D. Barthou 5-d Heterogeneity

Causes:

● Multiple of cores: accelerators, GPUs:

● Interconnect network

● Type of memory, cache interconnect

● Software: programmation, os, tools

– MPI, OpenMP, OpenCL, ...

216 Parallel architectures 2014-2015 D. Barthou 5-d Heterogeneity hardware

217 Parallel architectures 2014-2015 D. Barthou 5-d Heterogeneity software

An example of OpenMP/MPI/Cuda

218 Parallel architectures 2014-2015 D. Barthou 5- Implications

New languages required

● Need to gain higher abstraction for paralllelism

● Need techniques for optimization / auto-tuning

– “portable” code

● Need runtime techniques

– Dynamic adaptation of code to data and hardware

● Need tools for performance analysis

● To develop for applications Lot of work still to be done !

219 Parallel architectures 2014-2015 D. Barthou