Introduction Principles Core Memory Processor Network Exascale
Computer Architectures
D. Pleiter
Jülich Supercomputing Centre and University of Regensburg
January 2016
1 / 179 Introduction Principles Core Memory Processor Network Exascale Overview
Introduction
Principles of Computer Architectures
Processor Core Architecture
Memory Architecture
Processor Architectures
Network Architecture
Exascale Challenges
2 / 179 Introduction Principles Core Memory Processor Network Exascale Content
Introduction
Principles of Computer Architectures
Processor Core Architecture
Memory Architecture
Processor Architectures
Network Architecture
Exascale Challenges
77 / 179 Introduction Principles Core Memory Processor Network Exascale Memory in Modern Node Architectures
• Today’s compute nodes: different memory types and technologies Core #N−1 • Volatile memories L1−I L1−D • Main memory (DDR3 or DDR4) L2 • Caches (SRAM)
• Accelerator memory L3 slice (GDDR5 or HBM) • Non-volatile memories • SSD (NAND flash) MC IO • Different capabilities • Bandwidth • Latency for single transfer • Access time DDR GPU • Cycle time SSD GDDR • Capacity /HBM
78 / 179 Introduction Principles Core Memory Processor Network Exascale Memory: DRAM
• Data bit storage cell = 1 capacitor + 1 transistor • Capacitors discharge periodic refresh needed • Refresh required in intervals of O(10) ms • Possibly significant performance impact • Memory access latencies become variable • Memory organisation • Memory arrays with separate row and column addresses less address pins • Read operation: /ras /cas addr row col /we data data
79 / 179 Introduction Principles Core Memory Processor Network Exascale Memory: DRAM (2)
• Technology improvements • Synchronization with system bus: Synchronous dynamic random-access memory (SDRAM) • Double Data Rate (DDR) • DRAM typically packaged on small boards: DIMM = Dual Inline Memory Modules • Typical DRAM channel width: 8 Byte • Performance range for selected set of data rates:
DDR3 DDR4 Data rate [Mbps] 400 − 2333 1600 − 3200 GByte/sec/DIMM [GByte/s] 3.2 − 18.6 12.8 − 25.6
80 / 179 Introduction Principles Core Memory Processor Network Exascale DRAM Organisation
Organisation of DRAM memory connected to processor • Multiple memory channels per processor CPU • Multiple memory ranks attached to single channel 64 rank ID • Chip-select signals to multiple channels enable rank 64 multiple ranks • 8 Multiple memory banks bank ID multiple banks • Memory cells arranged in 8 rows and columns row+col Physical address:
ch ID rank ID bank ID row col
81 / 179 Introduction Principles Core Memory Processor Network Exascale DRAM Organisation (cont.)
Typical design as of 2014: • 4 Gbit memory die with 8-bit memory cells • 8 banks • 64k rows • 1k columns • 8 memory dies per channel • Channel data bus with of 64 bit • 4 ranks per channel • 4 channels Total memory capacity: 64 GiByte
82 / 179 Introduction Principles Core Memory Processor Network Exascale Memory: SRAM
• Data bit storage cell = (4 + 2) transistors • Comparison with DRAM DRAM SRAM Refresh required? yes no Speed slower faster Density higher lower • Typically used for on-chip memory
83 / 179 Introduction Principles Core Memory Processor Network Exascale Memory Access Locality
• Empirical observation: Programs tend to reuse data and instructions they have used recently • Observation can be exploited to • Improve performance • Optimize use of more/less expensive memory • Types of localities • Temporal locality: Recently accessed items are likely to be accessed in the near future • Spatial locality: Items whose addresses are near one another tend to be referenced close together in time
84 / 179 Introduction Principles Core Memory Processor Network Exascale Memory Access Locality: Example
1 double a[N][N], b[N][N]; 2 3 for (i=0; i • Assume right-most index being fastest running index • Temporal locality: b[j][0] • Spatial locality: a[i][j] and a[i][j+1] (j + 1 < N) 85 / 179 Introduction Principles Core Memory Processor Network Exascale Memory Cache: Motivation • Observations • DRAM based memories are large but slow • In particular, fetching small amounts of data CPU is inefficient • SRAM based memories are small but fast Cache • Typical application exhibit significant data locality Memory • Exploit observations by introducing caches • Cache = Fast memory unit that allows to store data for quick access in near future • A cache is transparent for the application • Application programs do not require explicit load/store to/from cache • Alternative concept: software managed scratch-pad memory • Caches are typically hardware managed 86 / 179 Introduction Principles Core Memory Processor Network Exascale Memory/storage Hierarchy Computers provide multiple levels of volatile memory and non-volatile storage • Important because of Processor-memory performance gap • Close to processor: very fast memory, but less capacity Level Access time Capacity Registers O(0.1) ns O(1) kByte L1 cache O(1) ns O(10-100) kByte L2 cache O(10) ns O(100) kByte L3 cache O(10) ns O(1-10) MByte Memory O(100) ns O(1-100) GByte Disk O(10) ms O(100-...) GByte 87 / 179 Introduction Principles Core Memory Processor Network Exascale Memory: Cache Cache miss = Word not found in cache • Word has to be fetched from next memory level • Typically a full cache line (=multiple words) is fetched Cache organisation: Set associative cache • Match line onto a set and then place line within the set • Choice of set address: (address) mod (number of sets) • Types: • Direct-mapped cache: 1 cache line per set • Each line has only one place it can appear in the cache • Fully associative cache: 1 set only • A line can be placed anywhere in the cache • n-way set associative cache: n cache lines per set • A line can be placed in a restricted set of places (1 out of n different places) in the cache 88 / 179 Introduction Principles Core Memory Processor Network Exascale Memory: Cache (2) Direct-mapped vs. 2-way set associative cache: way #1 line #0 line #0 way #0 line #0 89 / 179 Introduction Principles Core Memory Processor Network Exascale Memory: Cache (3) Replacement policies • Random: • Randomly select any possible line to evict • Least-recently used (LRU): • Assume that a line which has not been used for some time is not going to be used in the near future • Requires keeping track of all read accesses • First in, first out (FIFO): • Book keeping simpler to implement than in case of LRU 90 / 179 Introduction Principles Core Memory Processor Network Exascale Cache Write Policies Issue: How to keep cache and memory consistent? • Cache write policy design options • Write-through cache • Update cache and memory • Memory write causes significant delay if pipeline must stall • Write-back cache • Update cache only and post-pone update of memory memory and cache will become inconsistent • If cache line is dirty then it must be committed before being replaced • Add dirty bit to manage this task Issue: How to manage partial cache line updates? • Typical solution: write-allocate (fetch-on-write) • Cache line is read before being modified 91 / 179 Introduction Principles Core Memory Processor Network Exascale Cache misses • Categories of cache misses Compulsory: Occurs when cache line is ac- cessed the first time Capacity: Re-fetch of cache lines due to lack of cache capacity Conflict: Cache line was discharged due to lack of associativity • Average memory access time = hit time + miss rate · miss penalty Hit time ... Time to hit in the cache Miss penalty ... Time to load cache line from memory 92 / 179 Introduction Principles Core Memory Processor Network Exascale Cache misses (2) Strategies to reduce miss rate: • Larger cache line size • If spatial locality is large then other parts of cache line are likely to be re-used • Bigger caches to reduce miss rate • Reduce capacity misses • Higher associativity • Reduce conflict misses • Hit time may increase 93 / 179 Introduction Principles Core Memory Processor Network Exascale Cache misses (3) Strategies to reduce miss penalty • Multilevel caches • Technical problem: Larger cache means higher miss penalty • Problem mitigation: Larger but slower L2 cache: Average memory access time = Hit timeL1 + Miss rateL1· (Hit timeL2 + Miss rateL2 · Miss penaltyL2) • Giving priority to read misses over writes • Goal: Serve reads before writes have completed • Implementation using write buffers risk of data hazards: Pseudo-assembler example assuming direct mapped cache and memory offsets 512 and 1024 mapped to same set: store R1, 512(R0) ! Write value to memory load 1024(R0), R2 ! Load value from memory load 512(R0), R3 ! Load previously stored value? 94 / 179 Introduction Principles Core Memory Processor Network Exascale Cache: Inclusive vs. Exclusive • Policies: • Inclusive cache = Data present in one cache level also present in higher cache level • Example: All data in L1 cache is also present in L2 cache • Exclusive cache = Data present in at most one cache level • Intermediate policies: • Data available at higher cache level is allowed to be available at lower cache levels • Advantages of inclusive cache • Management of cache coherence may be simpler • Advantages of exclusive cache • Duplicate copies of data in several cache levels avoided • Example for intermediate policy: • L3 cache which is used as victim cache: cache holds data evicted from L2, but not necessarily data loaded to L2 95 / 179 Introduction Principles Core Memory Processor Network Exascale Re-Reference Timing Patterns • Re-reference time is time Example: On-line transaction between two occurences of processing application references to the same item • Re-reference timing patterns can be analysed using memory traces plus simulation • Pattern only depends on work-load and cache-line size • Phenomenological observation −β R(t) = R0 t [Hartstein et al., 2006] with 1.3 . β . 1.7 96 / 179 Introduction Principles Core Memory Processor Network Exascale Cache Miss Patterns • Using simulators it is possible to investigate cache miss patterns for different cache configurations • Example: • On-line transaction processing application • Analysis of different associativity settings [Hartstein et al., 2006] 97 / 179 Introduction Principles Core Memory Processor Network Exascale Memory Prefetching • Prefetching = Technique to hide cache miss latencies • Idea: Load data before program issues load instruction • Hardware data prefetching • Generic implementation • Monitor memory access pattern • Predict memory addresses to be accessed • Speculatively issue a prefetch request • Prefetching schemes • Stream prefetching • Global History Buffer prefetching • List prefetching • Prefetching at different cache levels 98 / 179 Introduction Principles Core Memory Processor Network Exascale Memory Prefetching (2) • Software data prefetching: Compiler or programmer inserts prefetching instructions • Risk of negative performance impact • Memory bus contention • Cache pollution • Metrics to characterize prefetching algorithms • Coverage: Number of misses that can be prefetched • Timeliness: Distance between prefetch and original miss • Accuracy: Probability that a prefetched line is used before being replaced 99 / 179 Introduction Principles Core Memory Processor Network Exascale Cache Coherence • Cache coherence problem • Processors/cores view memory through individual caches • Different processors could have different values for the same memory location • Example with 2 processors P1 and P2: Time Event Cache P1 Cache P2 Memory location A 0 123 1 P1 reads from A 123 123 2 P2 reads from A 123 123 123 3 P1 writes to A 567 123 567 • Value in processor P2 will only be updated after cache line was evicted Need for cache-coherence mechanism 100 / 179 Introduction Principles Core Memory Processor Network Exascale Cache Coherence (2) Requirements to achieve coherency • Program order preservation: If a read by processor P1 from memory location A follows a write to the same location then a consecutive read should return the new value unless another processor performed a write operation • Coherent memory view: If a read by processor P1 from A follows a write by processor P2 to A then the new value is returned if both events are sufficiently separated in time • Serialized writes: 2 writes to location A by one processor are seen in the same order by all processors 101 / 179 Introduction Principles Core Memory Processor Network Exascale Cache Coherence (3) • Classes of cache coherence protocols • Directory based: The sharing status of a block of memory is kept in one central location (directory) • Snooping: • Protocols rely on existence of a shared broadcast medium • Each cache keeps track of cache line status • All caches listen to broadcasted snoop messages and act when needed • Usually on write other copies are invalidated (write invalidate protocol) • Pros/cons • Snoop protocols are relatively easy to implement • Directory based protocols have to the potential for scaling to larger processor counts 102 / 179 Introduction Principles Core Memory Processor Network Exascale Cache Coherence: MSI Protocol Each cache line can be in one of the following states • Modified • Cache line contains current value • All copies in memory and any other cache are dirty • Shared • Cache line contains current value • All copies in memory and any other cache are clean • Invalid • Cache line is dirty Permitted combination of states of a particular cache line in 2 caches: M S I M S I 103 / 179 Introduction Principles Core Memory Processor Network Exascale Cache Coherence: MSI Protocol (2) • Processor operations • Read (PrRd) • Write (PrWr) Processor • Operations on broadcast medium • Bus Read (BusRd) PrRd PrWr • Triggered by PrRd to not valid cache line, i.e. not in S or M Cache Controller • Bus Read Exclusive (BusRdEx) BusRd • Triggered by PrWr to cache line BusRdEx which is not in M state BusWr • All other copies are marked invalid • Write-Back (BusWr) • Cache controller writes modified cache line back to memory 104 / 179 Introduction Principles Core Memory Processor Network Exascale Cache Coherence: MSI Protocol (3) I PrWr+BusRdEx BusRdEx M PrRd or PrWr PrRd+BusRd BusRdEx BusRd PrWr+BusRdEx S PrRd When leaving M state the cache controller puts modified cache line on the bus 105 / 179 Introduction Principles Core Memory Processor Network Exascale Content Introduction Principles of Computer Architectures Processor Core Architecture Memory Architecture Processor Architectures Network Architecture Exascale Challenges 106 / 179 Introduction Principles Core Memory Processor Network Exascale Intel Xeon E5-v3 (Haswell) • Introduced in 2014 using 22nm technology • x86 ISA + SSE + AVX2 • Up to 18 cores, frequency 1.6-3.7 GHz • Cache parameters • L1-D/L1-I: 32 kiByte/core, private, 8-way associative • L2: 256 kiByte/core, private, 8-way associative • L3: 2.5 MiByte/core, shared • Cache line size 64 Byte • 4× DDR4 channels • Performance figures • 8 DP FMA/cycle • L1: 2×32 Byte load / 1×32 Byte store • L2: 64 Byte load/store 107 / 179 Introduction Principles Core Memory Processor Network Exascale Intel Xeon E5-v3: Core Pipeline Functionality [Intel, 2015] 108 / 179 Introduction Principles Core Memory Processor Network Exascale Intel Xeon E5-v3: Architecture with 12 Cores [Intel, 2015] 109 / 179 Introduction Principles Core Memory Processor Network Exascale IBM POWER8 • Introduced in 2014 using 22nm technology • POWER v2.07 ISA • Up to 12 cores, frequency 2.5-5.0 GHz • Cache parameters • L1-D/L1-I: 64/32 kiByte/core, private, 8-way associative • L2: 512 kiByte/core, private, 8-way associative • L3: 8 MiByte/core, shared • Cache line size 128 Byte • 8× memory controllers • Performance figures • 4 DP FMA/cycle • L1: 4×16 Byte/cycle load / 16 Byte store • L2: 64 Byte/cycle load / 16 Byte store 110 / 179 Introduction Principles Core Memory Processor Network Exascale IBM POWER8 (cont.) [IBM, 2015] 111 / 179 Introduction Principles Core Memory Processor Network Exascale IBM POWER8 (cont.) [IBM, 2014] Memory buffer features: • 4 DRAM interfaces, 16 MiByte L4 buffer cache • 2+1 Byte link to+from processor, 9.6 Gbit/s data rate 112 / 179 Introduction Principles Core Memory Processor Network Exascale Applied Micro X-Gene 1 • Introduced in 2014 using 40nm technology • ARMv8 ISA • 8 cores, frequency 1.6-2.4 GHz • Cache parameters • L1-D/L1-I: 32/32 kiByte/core, private, 8-way associative • L2: 256 kiByte/PMD, shared, 8-way associative • L3: 8 MiByte, shared • Cache line size 128 Byte • Up to 4 DDR memory channels • Performance figures • 2 DP FMA/cycle 113 / 179 Introduction Principles Core Memory Processor Network Exascale Applied Micro X-Gene 1: Architecture 114 / 179