Chapter 2 CPU vs. Memory: Performance vs Latency

Memory Hierarchy Design

4

Introduction Design Considerations

● Goal: unlimited amount of memory with low latency

● Fast memory technology is more expensive per bit than ● Memory hierarchy design becomes more crucial with recent slower memory multi-core processors: – Use principle of locality (spatial and temporal) – Aggregate peak bandwidth grows with # cores:

● Solution: organize memory system into a hierarchy ● Intel Core i7 can generate two references per core per clock ● Four cores and 3.2 GHz clock – Entire addressable memory space available in largest, slowest memory – 25.6 billion 64-bit references/second + – 12.8 billion 128-bit instruction references – Incrementally smaller and faster memories, each containing a – = 409.6 GB/s! subset of the memory below it, proceed in steps up toward the ● DRAM bandwidth is only 6% of this (25 GB/s) ● Requires: ● Temporal and spatial locality insures that nearly all – Multi-port, pipelined caches references can be found in smaller memories – Two levels of per core – Shared third-level cache on chip – Gives the illusion of a large, fast memory being presented to the processor

2 5

Memory Hierarchies Performance and Power for Caches

CPU CPU Registers 1000 B 300 ps Registers 500 B 500 ps L1 64 kB 1 ns L1 64 kB 2 ns On-die

L2 256 kB 3-10 ns L2 256 kB 10-20 ns ● High-end have >10 MB on-chip cache

2-4 MB 10-20 ns – Consumes large amount of area and power budget L3 In package – Both static (idle) and dynamic power is an issue

● Personal/mobile devices have Memory 4-16 GB 50-100 ns Memory 256-512 MB 50-100 ns – 20-50x smaller power budget – 25%-50% is consumed by memory

Disk 4-16 TB 5-10 ms Flash 4-6 GB 25-50 µs Off-chip Storage Storage

Server Personal Mobile 3 6

Handling of Cache Misses Calculating Miss Rate

Misses Miss rate×Memory accesses Miss rate×Memory accesses = = Instruction Instruction count Instruction

● When a word is not found in the cache, a miss occurs: Average memory access time = Hit time+Miss rate×Miss penalty – Fetch word from lower level in hierarchy, requiring a higher latency reference ● Note that speculative and multithreaded processors may – Lower level may be another cache or the main memory execute other instructions during a miss – Also fetch the other words contained within the – This reduces performance impact of misses ● Takes advantage of spatial locality – But complicates analysis and introduces runtime dependence – Place block into cache in any location within its set, determined by address

● block address MODULO number of sets

7 10

Cache Associativity and Writing Strategies Cache and TLB Mapping Illustrated

tag set cache location block

64 7 6 … 1 0 ● n sets → n-way set associative – Direct-mapped cache → one block per set – Fully associative → one set

sets tag ● Writing to cache: two strategies cache – Write-through Physical address space

● Immediately update lower levels of hierarchy – Write-back

● Only update lower levels of hierarchy when an updated block is TLB OS kernel replaced – Both strategies use to make writes asynchronous Hardware regs. heap stack

8 Virtual address space 11

Miss Rate and Types of Cache Misses Cache Design Considerations

● ● Miss rate is... ● Larger block size More levels of cache

– Fraction of cache access that result in a miss – Reduced compulsory misses – Reduced miss penalty Access time = Hit time +Miss rate – Slightly reduced static power L1 L1 (smaller tag) ×(Hit timeL2+Miss rateL2×Miss penaltyL2) ● Reasons for cache misses and their names – Sometimes increased capacity – Compulsory and conflict misses ● Prioritize read misses over ● Cache block is referenced for the first time ● Bigger cache write misses – Solution: hardware and software prefetchers – Reduced miss rate – Reduced miss penalty – Capacity – Increased hit time – Need for write buffer and ● Cache block was discarded and later retrieved – Increased static & dynamic hazard resolution – Solution: build bigger cache or reorganize the code power ● – Conflict Avoid address translation ● Higher associativity during indexing of cache ● Program makes repeated references to multiple addresses from – Reduced conflict miss rate different blocks that map to the same location in the cache – Reduced hit time – Increased hit time – Solution: add padding or stride to code; change size of data – Limited size and structure – Coherency – Increased power of cache 9 12

Categories of Metrics for Cache Optimization Optimization 3: Pipelined Cache Access

● Reduce hit time ● Improves bandwidth

– Small and simple first-level ● History caches ● Reduce miss rate – Pentium (1993) 1 cycle – “way-prediction” – Compiler optimization – Pentium Pro (1995) 2 cycles – Side effect: reduction in power consumption – Side effect: reduced power – Pentium III (1999) 2 cycles ● Reduce miss penalty and/or – ● Increase cache bandwidth Pentium 4 (2000) 4 cycles rate via parallelism – Intel Core i7 (2010) 4 cycles – Pipelined caches – Hardware prefetching ● – Multibanked caches Interaction with branch prediction – Software and compiler – Nonblocking caches prefetching – Increased penalty for mispredicted branches ● ● Reduce miss penalty – Side effect: increased Load instructions are longer power due to unused data – “Critical word first” – Waiting for cache pipeline to finish – Merging write buffers ● Pipeline cache cycles make high degrees of associativity easier to implement 13 16

Optimization 1: Small/Simple 1st-level Caches Optimization 4: Nonblocking Caches

● If one instruction stalls on a cache miss, should the following st ● 1 level cache should match the clock cycle of CPU instruction stall if its data is in cache? ● Cache addressing is a three step process – NO – Address tag memory with index portion of address ● But you have design a nonblocking cache (lockup-free cache) – Compare the found tag with address tag – Call it “hit under miss” – Choose cache set ● Why stop at two instructions?

● High associativity helps with... – Make it “hit under multiple miss” – Address aliasing ● What about two outstanding misses? – Dealing with TLB and multiprocessing conflicts – “miss under miss”

● Low associativity... – Next level cache has to be able to handle multiple misses – Is faster ● Rarely the case in contemporary CPUs ● ● Overlap tag check with data transmission How long before our caches become ● 10% or more for each doubling of set count – Out-of-order, superscalar, … – Consumes less power – Moving the CPU innovation into the memory hierarchy? 14 17

Optimization 2: “way prediction” Optimization 5: Multibanked Caches

● Idea: predict “the way” ● Main memory has been organized in banks for increased – which block within set will be accessed next bandwidth

● Index multiplexor (mux) starts working early ● Caches can do this too

● Implemented as extra bits kept in cache for each block ● Each cache block is evenly spread across banks ● Prediction accuracy (simulated) – Sequential interleaving

– more effective of instruction caches ● Bank 0: blk[0] Bank 1: blk[1] Bank 2: blk[2] Bank 3: blk[3] – > 90% for two-way associative ● Bank 0: blk[4] Bank 1: blk[5] Bank 2: blk[6] Bank 3: blk[7] – > 80% for four-way associative ● Modern use

● On misprediction – ARM Cortex-A8 ● 1-4 banks in L2 cache – Try the other block – Intel Core i7 – Change the prediction bits ● 4 banks in L1 (2 memory accesses/cycle) – Incur penalty (commonly 1 cycle for slow CPUs) ● 8 banks in L2 st ● Examples: 1 use MIPS R10000 in 1990s, ARM Cortex-A8 ● Reduced power usage 15 18

Optimization 6: Critical Word 1st ,Early Restart Optimization 8: Compiler Optimizations

● Forget cache blocks (lines) deal with words – Start executing instruction when its word arrives, not the entire block where the word resides ● No hardware changes required ● Critical word first ● Two main techniques – What if the instruction needs the last word in the cache block? – ● Go ahead and request the last word from memory before Loop interchange requesting others ● Requires 2 or more loop nests – Out-of-order loading of cache block words ● The order of loops is changed to walk through memory in a ● As soon as the word arrives pass it on to the CPU more cache-friendly manner ● Continue fetching the remaining words – Loop blocking

● Early restart ● Additional loop nests are introduced to deal with small portion of an array (called a block but also a tile) – Don't change the order of words, but supply the missing word as soon as it arrives

● it won't help if the last word in block is needed ● Useful for large cache blocks, depends on data-stream 19 22

Optimization 7: Merging Write Buffer (Intro) Optimization 8: Loop Interchange

/* Before: memory stride = 100 */

for (j = 0; j < 100; ++j) ● Write buffer basics for (i = 0; i < 5000; ++i) – Write buffer sits between cache and memory x[i][j] = 2 * x[i][j]; – Write buffer stores both: data and its address – Write buffer allows for the store instruction to finish immediately

● Unless the write buffer is full /* After: memory stride = 1 – Especially useful for write-through caches * Uses all words in a single cache block */

● Write-back caches will benefit for when block is replaced for (i = 0; i < 5000; ++i) for (j = 0; j < 100; ++j) x[i][j] = 2 * x[i][j];

20 23

Optimization 7: Merging Write Buffer Optimization 8: Blocking

/* Before: memory stride: A → 1, B → N */ ● Merging write buffer: a buffer that merges write requests for (i = 0; i < N; ++i) ● When storing to a block that is already pending in the write for (j = 0; j < N; ++j) { x = 0; buffer, only update the write buffer for (k = 0; k < N; ++k) // dot product x += A[i][k] * B[k][j]; ● Another way to look at it: the write buffer with merging is equivalent to a larger buffer but without merging [i][j] = x; } ● Merging buffer reduces stalls due to the buffer being full

● Should not be used for I/O addresses (special memory /* After: blocking factor = B */ locations) Addr valid Data valid valid valid for (jj = 0; jj < N; jj += B) 100 1 M[100] 0 0 0 for (kk = 0; kk < N; kk += B) 108 1 M[108] 0 0 0 for (i = 0; i < N; ++i) 116 1 M[116] 0 0 0 for (j = jj; j < min(jj+B,N); ++j) { x=0;

124 1 M[124] 0 0 0 mergeafter for (k = kk; k < min(kk+B,N); ++k) x += A[i][k] * B[k][j]; Addr valid Data valid valid valid C[i][j] += x; 100 1 M[100] 1 M[108] 1 M[116] 1 M[124] } 21 24

Optimization 9: Hardware Prefetching Optimization 10: Compile Prefetch Example /* Before */ ● Prefetch asks for data before it's needed for (i = 0; i < 5000; ++i) – Reduction of latency (miss penalty) x[i] = 2 * x[i]; ● Overlap the fetch from memory with /* After */ – Execution of instructions for (i = 0; i < 5000; ++i) { ● Principal of locality // get ready for next iteration prefetch(x[i+1]); // better use cache block – On a miss: request not just one (cache) block but also the next block x[i] = 2 * x[i]; } ● The prefetched data might not end up in cache – It might go to a special prefetch buffer that is cheaper to access than memory /* Even better */ ● Intel Core i7 can prefetch both L1 and L2 for (i = 0; i < 5000; ++i) { // better prefetch the next cache block – Pairwise prefetching (get the current and the next sequential prefetch(x[i+8]); block) x[i] = 2 * x[i]; – May be turned off in BIOS } 25 28

Optimization 9: Hardware Prefetching contd. Memory Technology Optimization

● Memory sits between caches and I/O devices – Bridges the gap of few tens of cycles cache latency and millions of cycles latency to disk ● Hardware prefetching may waste bandwidth ● Memory performance is measured with – When the prediction accuracy is poor and most of the prefetched data goes unused – Latency ● Cache designers are concerned with that because this is the ● Aggressive and sophisticated prefetchers main factor in miss penalty – Use extra silicon area – Bandwidth

– Consume power ● Ever faster I/O devices (think solid ) and increasing core counts stress the memory bandwidth – Increase chip complexity ● Faster and larger cache cannot eliminate the need for main ● Compilers may reorganize the code to increase hardware prefetch accuracy memory – Even with principle of locality in mind there are codes that wait for memory – Use Amdahl's law to see how important is faster memory 26 29

Optimization 10: Compiler Prefetching Memory Technology Overview

● Compiler may insert (special) instructions that fetch data from memory but do not influence the current code ● Memory responsiveness is measured with

● Two flavors of prefetching instructions – Access time

● – Register prefetches The time between read request and data arrival ● This could also be called memory latency ● Data loaded into a register – Cycle time – Cache prefetches ● Minimum time between unrelated memory requests ● Data loaded into cache ● Comes from the way data is stored in a single ● Interaction with the :

– Faulting prefetches cause virtual address faults ● Two types of memory – Nonfaulting prefetches do not cause such faults – SRAM

● Because the hardware ignores the prefetch instruction right ● Used caches (since 1975) before the fault – DRAM ● The most common prefetch is into cache and nonfaulting ● Used for main memory (since 1975) – It is called nonbinding 27 30

Memory Technology: SRAM Improving Memory Performance

● The main pressure comes from the Moore's law – Faster single core CPUs need more data faster – More cores per chip need more data faster ● SRAM = Static RAM ● Multiple accesses to the same row ● No need for refresh (unlike DRAM) – Utilizes the row buffer that holds the data from the row when – Hence, access time = cycle time the column portion of the address arrives

● Typically, 6 transistors to store a single bit – Multiple column addresses can nowadays utilize the row buffer – Prevents loss of information after read ● Synchronous DRAM ● Minimum power required to retain the charge in standby mode – and the SDRAM use a clock for synchronization (faster than asynchronous operation) ● Most SRAM memories (caches) is not integrated into the chip – Also, SDRAM has burst mode ● In this mode, SDRAM delivers multiple data items without new address requests

● Possible due to an extra length register in SDRAM 31 34 – Supports “Critical Word First” cache optimization

Memory Technology: DRAM Improving Memory Performance (contd.)

● DRAM = Dynamic RAM ● DRAM is getting wider

● Only one transistor used to store a bit – Required due to every increasing density (Get more data out at once) – After a read, the information is gone ● Old DDR: 4-bit bus – So a read must be followed by a write to restore the state! ● DDR 2, DDR 3: 16-bit bus (2010) – Hence cycle time is longer than access time ● DDR = Double Data Rate – Using multiple banks helps hiding the rewrite of data – Single data rate: data transferred on one edge of the clock – Also, over time (8ms) data is lost: periodic refresh required signal ● Row refreshed at once (usually part of the mem. controller) – Double data rate: data transferred on both edges of the clock ● Number of pins required to address a single item is an issue signal – Address lines are multiplexed – Effectively doubles the bandwidth with the same clock – Row Access Strobe (RAS) is the first half of the address ● Multiple banks – Column Access Strobe (CAS) - the second half of the address – Helps due to interleaving (one word spread across banks) – Internally, DRAM is a 2D matrix and each half of the address – Smaller power consumption works in one of the dimensions – Adds delay because bank has to get opened before access 32 35

Memory Technology: DRAM (contd.) DDR Standards and Technologies

● Amdahl suggested linear growth of capacity with CPU speed – 1000 MB for every 1000 MIPS ● DDR is now a standard – But Moore's law made that impossible: CPU speed grew faster – This helps interoperability, competition and results in lower prices – Capacity grew only about half as fast in recent years ● DDR (2000) ● The expected rate (from the CPU designer perspective) is 55% a year – 2.5 V; 133, 150, 200 MHz; >100 nm – Or fourfold improvement in capacity every three years ● DDR2 (2004) ● Memory latency (measured as row access time) improves only 5% per year – 1.8 V; 266, 333, 400 MHz; 80-90 nm ● DDR3 (2010) ● Data transfer time (related to CAS) improves at over 10% a year – 1.5 V; 533, 666, 800 MHz; 32-40 nm

● DRAM packaging unit: DIMM ● DDR4 (expected late 2013?) – Dual Inline Memory Module – 1-1.2 V; 1066-1600 MHz; 2x nm – 8 wide + ECC (Error-Correcting Code) 33 36

GPU Memory Properties

● Flash memory must be erased before overwritten ● GPU memory – NAND Flash (higher density Flash) erasing is done by blocks

– GDRAM Graphics = DRAM ● This becomes a software (OS kernel) problem to assemble data – GSDRAM = Graphics Synchronous DRAM in blocks and clean up old blocks ● Flash memory is static ● GDDR – No continuous refreshes, almost no power draw when inactive – GDDR5 based on DDR3, earlier GDDRs based DDR2 ● Flash has limited number of times (100k) each block can be ● GPUs demand more bandwidth because of higher performance due to greater parallelism written ● Cheaper than SDRAM, more expensive than disk ● Greater bandwidth achieved with: – Flash: $2 / GB, SDRAM: $30 / GB, Magnetic disk: $0.09 / GB – Wider interface: 32-bits (vs. 4,8,16 for CPUs) ● – Higher allowed by soldering into GPU board rather Much slower than SDRAM, much faster than disk snap in sockets for CPU DIMMs – For reads: 4 times slower than SDRAM, 1000x faster than ● In practice, GDDR is 2-5x faster than DDR disk – Writes to flash are 2x-20x slower than reads 37 40

SDRAM Power Optimizations Memory Dependability

● Errors in memory systems ● Higher clock rate means greater (static and dynamic) power draw of SDRAM – Permanent errors during fabrication – Dynamic errors during operation = soft errors ● Optimizing power becomes important as SDRAMs grow ● Cosmic rays (high energy particles) ● Common techniques include ● Manufacturing defects are accommodated with spare rows – Reduce operating voltage (from 1.5 to 1.35) – After fabrication memory modules are configured – Introduce multiple banks ● Defective rows are disabled and spare rows replace them ● one bank opens at a time to deliver subsequent word ● Dynamic errors are detected with parity bits and corrected – Recent SDRAMs enter power-down mode with ECC (Error-Correcting Codes)

● In this mode the memory modules ignore the clock – Instruction caches are read-only so the parity bits suffice ● However, need to keep refreshing the clock – Parity bit does not detect multi-bit errors: 1 parity + 8 data bits – Usually implemented as an internal refresh circuitry – ECC detects 2 and corrects 1 error for 8-bit ECC + 64-bit data

38 41

Flash Memory Introduction Memory Dependability: Chipkill

● High end servers and warehouse clusters need Chipkill

● Flash memory is a type of EEPROM – Chipkill is like RAID for Disks – Electronically Erasable Programmable Read-Only Memory – Parity bits and ECC are not kept together with data but are distributed – Read-only under normal operation – Complete failure of a single memory module can be handled – Erasable when special voltage applied ● Developed by IBM, Intel calls it SDDC ● The memory module is flashed = erased ● Suitable for long term storage ● According to IBM analysis, 10,000 processor server with 4GB/CPU has unrecoverable failures in 3 years: – Used in mobile form factors: phones, tables, laptops – 90000 when only parity is used ● May be used as another level of memory hierarchy due the small size of RAM in these devices – 3500 with ECC only – 6 with Chipkill

39 42

Virtual Memory Basics Virtual Memory Caveats

● Virtual memory is ● Virtual Memory system works if both hardware and software – Protection mechanism to keep process memory data private are flawless – Context switch changes the process which the CPU is executing – In practice it's never the case because of bugs ● Complexity of hardware increases through Moore's law ● After switch, old process' instructions and data are not visible to the new process ● Complexity of OS software (millions of lines of code) increases with hardware complexity and new features – Cooperation between hardware (TLB=translation look-aside buffer, segmented memory, segmentation fault mechanism) ● Finding a security whole is a matter of time and effort and software (OS kernel) ● If complexity is the problem than a simpler system has a – The same address space for each running process better chance of being secure ● Virtual memory system is a security measure – Virtual Machines have smaller “code base” in terms of – It's harder to break the encrypted message than to snoop it hardware and software from cache right after the context switch – OS kernel no longer has to be trusted

– Context switch reloads TLBs and registers but not cache ● In fact, it could be a malicious OS and the VM system will keep it isolated ● Flushing all caches takes too many cycles for each switch

● This becomes the starting point for side-channel crypto-attacks43 46

Hardware Requirements for Virtual Memory Rationale Behind Virtual Machines

● Provide user and kernel modes of execution – Only the kernel is allowed certain operations ● Security becomes important issue in modern systems ● Make some CPU state read-only connected to the Internet – Read-only: Am I in kernel mode? vs. Change to kernel mode. ● Security failures of standard OS kernels ● Provide programmable way of changing modes ● Users of modern systems are unrelated and little in common – Entering the kernel occurs usually with a system call and/or a special assembly instruction – In , users might request different OS's – There is an instruction to return to the previous mode by ● Performance gains in CPU speed made the VM overhead restoring the register file saved before the system call acceptable

● Check memory protection for every memory access ● The first implementations appeared in the 1960s Wide acceptance in 2000s – Limit access to memory of other processes – Execute kernel code if segmentation violation occurs

● TLB is the main component of any Virtual Memory system 44 47

Translation Look-aside Buffer (TLB) Virtual Machine Types

● Every interpreted language could be considered a VM ● Without TLB, each memory access would be – Java VM, Dalvik (Android) – A load to check the translation table – Python (Cpython, Jython, IronPython), Ruby (Rubinus), Per l – The actual memory access (if it doesn't violate the protection) (Parrot) ● TLB operates on the principle of locality – .Net is a virtualization with the assumption that the code will be translated into assembly upon execution and never – If memory access have locality then the address translations must have locality interpreted – code inside Chrome browser can run inside Native Client – If the address accessed is not in TLB then the translation must involve main memory (NaCl) – Emscripten translates C/C++ into obscure Javascript that runs ● TLB is like cache very fast (asm.js project) – TLB tags store a portion of the virtual address ● Here, we are interested in VMs at binary ISA and hardware – TLB data is the physical address, protection bit, valid bit, use levels bit, dirty bit – We will only consider VMs that export the same ISA as the underlying hardware (no on-the-fly instruction translation) 45 48

Virtual Machine Terminology Impact of Virtual Machines on Virtual Memory

● Virtual Machine Monitor (or Hypervisor) ● Every guest OS maintains its own set of pages – Software that supports VMs – Virtualization introduces real memory – Much smaller than the tradition OS – Guest OS maps virtual address to real memory addresses

● Tens of thousands of lines of code for VMM vs. many millions – VMM maps real memory addresses to physical address for the OS – The mapping is done via shadow page table ● Host ● No actual table exists, no additional level of indirection

– Underlying hardware platform ● Instead, changes to guest OS table occur through special instructions that trap and transfer control to VMM ● Guest (VM) – TLBs in some RISC processors include Process ID – Software (OS with applications) that run on the VM ● This eliminates the need for TLB flush upon guest VM switch

49 52

Virtual Machine Overheads and Benefits Impact of Virtual Machines on Virtual Memory

● VM comes with overhead – CPU-bound codes experience very little slow-down ● I/O subsystem has to be virtualized to allow sharing between – I/O-intensive workloads end up calling the OS kernel frequently VM guests

● This results in system calls and privileged instruction execution ● Diversity of I/O devices poses a challenge ● Virtualization later has to emulate or protect these code sections – Supporting many device drivers introduces complexity into ● Overhead depends on the speed of emulation/protection and is VMM often noticeable (cost I/O instruction surges compared with the rest of instructions) – Generic drivers are often used to provide a common interface – I/O-bound applications spend most of the time in I/O ● Physical disks are partitioned by the VMM and each partition becomes a disk visible to a guest VM ● Overhead is large by small compared to speed of disks ● ● Benefits of VMs Network interface are share in time slices – Software management: multiple OS's and versions – VMM keeps track of virtual network addresses and delivers simultaneously packets to the corresponding guest VM – Hardware management: multiplexing of server resources, migration on failure, load balancing overloaded servers 50 53

Requirements for Virtual Machines

● Guest software should behave on a VM as it behaves on the hardware (bare metal) – Except for overhead and limitation on resources

● Guest software should not be able to change resource allocation directly – Security and isolation

● Hardware has to provide at least to modes: system (for the host) and user (for the guest Vms) – This is similar to virtual memory system privileges

● Some instructions should only be available in system mode – This is similar to virtual memory system calls and TLB instructions

51