Introduction Principles Core Memory Processor Network Exascale

Computer Architectures

D. Pleiter

Jülich Supercomputing Centre and University of Regensburg

January 2016

1 / 179 Introduction Principles Core Memory Processor Network Exascale Overview

Introduction

Principles of Computer Architectures

Processor Core Architecture

Memory Architecture

Processor Architectures

Network Architecture

Exascale Challenges

2 / 179 Introduction Principles Core Memory Processor Network Exascale Content

Introduction

Principles of Computer Architectures

Processor Core Architecture

Memory Architecture

Processor Architectures

Network Architecture

Exascale Challenges

77 / 179 Introduction Principles Core Memory Processor Network Exascale Memory in Modern Node Architectures

• Today’s compute nodes: different memory types and technologies Core #N−1 • Volatile memories L1−I L1−D • Main memory (DDR3 or DDR4) L2 • Caches (SRAM)

• Accelerator memory L3 slice (GDDR5 or HBM) • Non-volatile memories • SSD (NAND flash) MC IO • Different capabilities • Bandwidth • Latency for single transfer • Access time DDR GPU • Cycle time SSD GDDR • Capacity /HBM

78 / 179 Introduction Principles Core Memory Processor Network Exascale Memory: DRAM

• Data bit storage cell = 1 capacitor + 1 transistor • Capacitors discharge periodic refresh needed • Refresh required in intervals of O(10) ms • Possibly significant performance impact • Memory access latencies become variable • Memory organisation • Memory arrays with separate row and column addresses less address pins • Read operation: /ras /cas addr row col /we data data

79 / 179 Introduction Principles Core Memory Processor Network Exascale Memory: DRAM (2)

• Technology improvements • Synchronization with system bus: Synchronous dynamic random-access memory (SDRAM) • Double Data Rate (DDR) • DRAM typically packaged on small boards: DIMM = Dual Inline Memory Modules • Typical DRAM channel width: 8 Byte • Performance range for selected set of data rates:

DDR3 DDR4 Data rate [Mbps] 400 − 2333 1600 − 3200 GByte/sec/DIMM [GByte/s] 3.2 − 18.6 12.8 − 25.6

80 / 179 Introduction Principles Core Memory Processor Network Exascale DRAM Organisation

Organisation of DRAM memory connected to processor • Multiple memory channels per processor CPU • Multiple memory ranks attached to single channel 64 rank ID • Chip-select signals to multiple channels enable rank 64 multiple ranks • 8 Multiple memory banks bank ID multiple banks • Memory cells arranged in 8 rows and columns row+col Physical address:

ch ID rank ID bank ID row col

81 / 179 Introduction Principles Core Memory Processor Network Exascale DRAM Organisation (cont.)

Typical design as of 2014: • 4 Gbit memory die with 8-bit memory cells • 8 banks • 64k rows • 1k columns • 8 memory dies per channel • Channel data bus with of 64 bit • 4 ranks per channel • 4 channels Total memory capacity: 64 GiByte

82 / 179 Introduction Principles Core Memory Processor Network Exascale Memory: SRAM

• Data bit storage cell = (4 + 2) transistors • Comparison with DRAM DRAM SRAM Refresh required? yes no Speed slower faster Density higher lower • Typically used for on-chip memory

83 / 179 Introduction Principles Core Memory Processor Network Exascale Memory Access Locality

• Empirical observation: Programs tend to reuse data and instructions they have used recently • Observation can be exploited to • Improve performance • Optimize use of more/less expensive memory • Types of localities • Temporal locality: Recently accessed items are likely to be accessed in the near future • Spatial locality: Items whose addresses are near one another tend to be referenced close together in time

84 / 179 Introduction Principles Core Memory Processor Network Exascale Memory Access Locality: Example

1 double a[N][N], b[N][N]; 2 3 for (i=0; i

• Assume right-most index being fastest running index • Temporal locality: b[j][0] • Spatial locality: a[i][j] and a[i][j+1] (j + 1 < N)

85 / 179 Introduction Principles Core Memory Processor Network Exascale Memory : Motivation

• Observations • DRAM based memories are large but slow • In particular, fetching small amounts of data CPU is inefficient • SRAM based memories are small but fast Cache • Typical application exhibit significant data locality Memory • Exploit observations by introducing caches • Cache = Fast memory unit that allows to store data for quick access in near future • A cache is transparent for the application • Application programs do not require explicit load/store to/from cache • Alternative concept: software managed scratch-pad memory • Caches are typically hardware managed

86 / 179 Introduction Principles Core Memory Processor Network Exascale Memory/storage Hierarchy

Computers provide multiple levels of and non-volatile storage • Important because of Processor-memory performance gap • Close to processor: very fast memory, but less capacity

Level Access time Capacity Registers O(0.1) ns O(1) kByte L1 cache O(1) ns O(10-100) kByte L2 cache O(10) ns O(100) kByte L3 cache O(10) ns O(1-10) MByte Memory O(100) ns O(1-100) GByte Disk O(10) ms O(100-...) GByte

87 / 179 Introduction Principles Core Memory Processor Network Exascale Memory: Cache

Cache miss = Word not found in cache • Word has to be fetched from next memory level • Typically a full cache line (=multiple words) is fetched Cache organisation: Set associative cache • Match line onto a set and then place line within the set • Choice of set address: (address) mod (number of sets) • Types: • Direct-mapped cache: 1 cache line per set • Each line has only one place it can appear in the cache • Fully associative cache: 1 set only • A line can be placed anywhere in the cache • n-way set associative cache: n cache lines per set • A line can be placed in a restricted set of places (1 out of n different places) in the cache

88 / 179 Introduction Principles Core Memory Processor Network Exascale Memory: Cache (2)

Direct-mapped vs. 2-way set associative cache:

way #1

line #0

line #0 way #0

line #0

89 / 179 Introduction Principles Core Memory Processor Network Exascale Memory: Cache (3)

Replacement policies • Random: • Randomly select any possible line to evict • Least-recently used (LRU): • Assume that a line which has not been used for some time is not going to be used in the near future • Requires keeping track of all read accesses • First in, first out (FIFO): • Book keeping simpler to implement than in case of LRU

90 / 179 Introduction Principles Core Memory Processor Network Exascale Cache Write Policies

Issue: How to keep cache and memory consistent? • Cache write policy design options • Write-through cache • Update cache and memory • Memory write causes significant delay if pipeline must stall • Write-back cache • Update cache only and post-pone update of memory memory and cache will become inconsistent • If cache line is dirty then it must be committed before being replaced • Add dirty bit to manage this task Issue: How to manage partial cache line updates? • Typical solution: write-allocate (fetch-on-write) • Cache line is read before being modified

91 / 179 Introduction Principles Core Memory Processor Network Exascale Cache misses

• Categories of cache misses Compulsory: Occurs when cache line is ac- cessed the first time Capacity: Re-fetch of cache lines due to lack of cache capacity Conflict: Cache line was discharged due to lack of associativity • Average memory access time =

hit time + miss rate · miss penalty

Hit time ... Time to hit in the cache Miss penalty ... Time to load cache line from memory

92 / 179 Introduction Principles Core Memory Processor Network Exascale Cache misses (2)

Strategies to reduce miss rate: • Larger cache line size • If spatial locality is large then other parts of cache line are likely to be re-used • Bigger caches to reduce miss rate • Reduce capacity misses • Higher associativity • Reduce conflict misses • Hit time may increase

93 / 179 Introduction Principles Core Memory Processor Network Exascale Cache misses (3) Strategies to reduce miss penalty • Multilevel caches • Technical problem: Larger cache means higher miss penalty • Problem mitigation: Larger but slower L2 cache: Average memory access time =

Hit timeL1 + Miss rateL1·

(Hit timeL2 + Miss rateL2 · Miss penaltyL2) • Giving priority to read misses over writes • Goal: Serve reads before writes have completed • Implementation using write buffers risk of data hazards: Pseudo-assembler example assuming direct mapped cache and memory offsets 512 and 1024 mapped to same set: store R1, 512(R0) ! Write value to memory load 1024(R0), R2 ! Load value from memory load 512(R0), R3 ! Load previously stored value?

94 / 179 Introduction Principles Core Memory Processor Network Exascale Cache: Inclusive vs. Exclusive • Policies: • Inclusive cache = Data present in one cache level also present in higher cache level • Example: All data in L1 cache is also present in L2 cache • Exclusive cache = Data present in at most one cache level • Intermediate policies: • Data available at higher cache level is allowed to be available at lower cache levels • Advantages of inclusive cache • Management of may be simpler • Advantages of exclusive cache • Duplicate copies of data in several cache levels avoided • Example for intermediate policy: • L3 cache which is used as victim cache: cache holds data evicted from L2, but not necessarily data loaded to L2

95 / 179 Introduction Principles Core Memory Processor Network Exascale Re-Reference Timing Patterns

• Re-reference time is time Example: On-line transaction between two occurences of processing application references to the same item • Re-reference timing patterns can be analysed using memory traces plus simulation • Pattern only depends on work-load and cache-line size • Phenomenological observation

−β R(t) = R0 t [Hartstein et al., 2006] with 1.3 . β . 1.7

96 / 179 Introduction Principles Core Memory Processor Network Exascale Cache Miss Patterns

• Using simulators it is possible to investigate cache miss patterns for different cache configurations • Example: • On-line transaction processing application • Analysis of different associativity settings [Hartstein et al., 2006]

97 / 179 Introduction Principles Core Memory Processor Network Exascale Memory Prefetching

• Prefetching = Technique to hide cache miss latencies • Idea: Load data before program issues load instruction • Hardware data prefetching • Generic implementation • Monitor memory access pattern • Predict memory addresses to be accessed • Speculatively issue a prefetch request • Prefetching schemes • Stream prefetching • Global History Buffer prefetching • List prefetching • Prefetching at different cache levels

98 / 179 Introduction Principles Core Memory Processor Network Exascale Memory Prefetching (2)

• Software data prefetching: or programmer inserts prefetching instructions • Risk of negative performance impact • Memory bus contention • Cache pollution • Metrics to characterize prefetching algorithms • Coverage: Number of misses that can be prefetched • Timeliness: Distance between prefetch and original miss • Accuracy: Probability that a prefetched line is used before being replaced

99 / 179 Introduction Principles Core Memory Processor Network Exascale Cache Coherence

• Cache coherence problem • Processors/cores view memory through individual caches • Different processors could have different values for the same memory location

• Example with 2 processors P1 and P2:

Time Event Cache P1 Cache P2 Memory location A 0 123 1 P1 reads from A 123 123 2 P2 reads from A 123 123 123 3 P1 writes to A 567 123 567

• Value in processor P2 will only be updated after cache line was evicted Need for cache-coherence mechanism

100 / 179 Introduction Principles Core Memory Processor Network Exascale Cache Coherence (2)

Requirements to achieve coherency

• Program order preservation: If a read by processor P1 from memory location A follows a write to the same location then a consecutive read should return the new value unless another processor performed a write operation

• Coherent memory view: If a read by processor P1 from A follows a write by processor P2 to A then the new value is returned if both events are sufficiently separated in time • Serialized writes: 2 writes to location A by one processor are seen in the same order by all processors

101 / 179 Introduction Principles Core Memory Processor Network Exascale Cache Coherence (3)

• Classes of cache coherence protocols • Directory based: The sharing status of a block of memory is kept in one central location (directory) • Snooping: • Protocols rely on existence of a shared broadcast medium • Each cache keeps track of cache line status • All caches listen to broadcasted snoop messages and act when needed • Usually on write other copies are invalidated (write invalidate protocol) • Pros/cons • Snoop protocols are relatively easy to implement • Directory based protocols have to the potential for scaling to larger processor counts

102 / 179 Introduction Principles Core Memory Processor Network Exascale Cache Coherence: MSI Protocol

Each cache line can be in one of the following states • Modified • Cache line contains current value • All copies in memory and any other cache are dirty • Shared • Cache line contains current value • All copies in memory and any other cache are clean • Invalid • Cache line is dirty Permitted combination of states of a particular cache line in 2 caches: M S I M    S    I   

103 / 179 Introduction Principles Core Memory Processor Network Exascale Cache Coherence: MSI Protocol (2)

• Processor operations • Read (PrRd) • Write (PrWr) Processor • Operations on broadcast medium • Bus Read (BusRd) PrRd PrWr • Triggered by PrRd to not valid cache line, i.e. not in S or M Cache Controller • Bus Read Exclusive (BusRdEx) BusRd • Triggered by PrWr to cache line BusRdEx which is not in M state BusWr • All other copies are marked invalid • Write-Back (BusWr) • Cache controller writes modified cache line back to memory

104 / 179 Introduction Principles Core Memory Processor Network Exascale Cache Coherence: MSI Protocol (3)

I

PrWr+BusRdEx BusRdEx

M PrRd or PrWr PrRd+BusRd BusRdEx

BusRd PrWr+BusRdEx

S PrRd

When leaving M state the cache controller puts modified cache line on the bus

105 / 179 Introduction Principles Core Memory Processor Network Exascale Content

Introduction

Principles of Computer Architectures

Processor Core Architecture

Memory Architecture

Processor Architectures

Network Architecture

Exascale Challenges

106 / 179 Introduction Principles Core Memory Processor Network Exascale Intel Xeon E5-v3 (Haswell)

• Introduced in 2014 using 22nm technology • x86 ISA + SSE + AVX2 • Up to 18 cores, frequency 1.6-3.7 GHz • Cache parameters • L1-D/L1-I: 32 kiByte/core, private, 8-way associative • L2: 256 kiByte/core, private, 8-way associative • L3: 2.5 MiByte/core, shared • Cache line size 64 Byte • 4× DDR4 channels • Performance figures • 8 DP FMA/cycle • L1: 2×32 Byte load / 1×32 Byte store • L2: 64 Byte load/store

107 / 179 Introduction Principles Core Memory Processor Network Exascale Intel Xeon E5-v3: Core Pipeline Functionality

[Intel, 2015]

108 / 179 Introduction Principles Core Memory Processor Network Exascale Intel Xeon E5-v3: Architecture with 12 Cores

[Intel, 2015]

109 / 179 Introduction Principles Core Memory Processor Network Exascale IBM POWER8 • Introduced in 2014 using 22nm technology • POWER v2.07 ISA • Up to 12 cores, frequency 2.5-5.0 GHz • Cache parameters • L1-D/L1-I: 64/32 kiByte/core, private, 8-way associative • L2: 512 kiByte/core, private, 8-way associative • L3: 8 MiByte/core, shared • Cache line size 128 Byte • 8× memory controllers • Performance figures • 4 DP FMA/cycle • L1: 4×16 Byte/cycle load / 16 Byte store • L2: 64 Byte/cycle load / 16 Byte store

110 / 179 Introduction Principles Core Memory Processor Network Exascale IBM POWER8 (cont.)

[IBM, 2015]

111 / 179 Introduction Principles Core Memory Processor Network Exascale IBM POWER8 (cont.)

[IBM, 2014]

Memory buffer features: • 4 DRAM interfaces, 16 MiByte L4 buffer cache • 2+1 Byte link to+from processor, 9.6 Gbit/s data rate

112 / 179 Introduction Principles Core Memory Processor Network Exascale Applied Micro X-Gene 1

• Introduced in 2014 using 40nm technology • ARMv8 ISA • 8 cores, frequency 1.6-2.4 GHz • Cache parameters • L1-D/L1-I: 32/32 kiByte/core, private, 8-way associative • L2: 256 kiByte/PMD, shared, 8-way associative • L3: 8 MiByte, shared • Cache line size 128 Byte • Up to 4 DDR memory channels • Performance figures • 2 DP FMA/cycle

113 / 179 Introduction Principles Core Memory Processor Network Exascale Applied Micro X-Gene 1: Architecture

114 / 179