The Memory Hierarchy Today Random-‐Access Memory (RAM

Carnegie Mellon Carnegie Mellon

Today

¢ Storage technologies and trends ¢ Locality of reference The Memory Hierarchy ¢ Caching in the memory hierarchy

15-213/18-243: Introducon to Computer Systems 10th Lecture, Feb. 13, 2014

Instructors: Anthony Rowe, Seth Goldstein and Gregory Kesden

1 2

Carnegie Mellon Carnegie Mellon

Random-Access Memory (RAM) SRAM vs DRAM Summary

¢ Key features § RAM is tradionally packaged as a chip. § Basic storage unit is normally a cell (one bit per cell). Trans. Access Needs Needs § Mulple RAM chips form a memory. per bit time refresh? EDC? Cost Applications ¢ Stac RAM (SRAM) § Each cell stores a bit with a four or six-transistor circuit. SRAM 4 or 6 1X No Maybe 100x Cache memories § Retains value indeﬁnitely, as long as it is kept powered. DRAM 1 10X Yes Yes 1X Main memories, § Relavely insensive to electrical noise (EMI), radiaon, etc. frame buffers § Faster and more expensive than DRAM. ¢ Dynamic RAM (DRAM) § Each cell stores bit with a capacitor. One transistor is used for access § Value must be refreshed every 10-100 ms. § More sensive to disturbances (EMI, radiaon,…) than SRAM. § Slower and cheaper than SRAM. 3 4

1 Carnegie Mellon Carnegie Mellon

Convenonal DRAM Organizaon Reading DRAM Supercell (2,1) Step 1(a): Row access strobe (RAS) selects row 2. ¢ d x w DRAM: Step 1(b): Row 2 copied from DRAM array to row buﬀer. § dw total bits organized as d supercells of size w bits

16 x 8 DRAM chip 16 x 8 DRAM chip cols Cols 0 1 2 3 0 1 2 3 RAS = 2 2 2 bits 0 0 / / addr addr 1 1 rows Rows Memory Memory 2 supercell 2 controller controller (to/from CPU) (2,1) 3 3 8 bits 8 / / data data

Internal row buffer Internal row buffer 5 6

Carnegie Mellon Carnegie Mellon

Reading DRAM Supercell (2,1) Memory Modules

Step 2(a): Column access strobe (CAS) selects column 1. addr (row = i, col = j) Step 2(b): Supercell (2,1) copied from buﬀer to data lines, and eventually : supercell (i,j)

back to the CPU. DRAM 0 16 x 8 DRAM chip 64 MB memory module

Cols DRAM 7 consisting of 0 1 2 3 eight 8Mx8 DRAMs CAS = 1 2 / 0 addr To CPU 1 Rows bits bits bits bits bits bits bits bits Memory 56-63 48-55 40-47 32-39 24-31 16-23 8-15 0-7 controller 2

supercell 3 63 56 55 48 47 40 39 32 31 24 23 16 15 8 7 0 Memory (2,1) 8 / controller data 64-bit doubleword at main memory address A

supercell 64-bit doubleword Internal row buffer (2,1) 7 8

2 Carnegie Mellon Carnegie Mellon

Enhanced DRAMs Nonvolale Memories

¢ Basic DRAM cell has not changed since its invenon in 1966. ¢ DRAM and SRAM are volale memories § Commercialized by Intel in 1970. § Lose informaon if powered off. ¢ ¢ DRAM cores with beer interface logic and faster I/O : Nonvolale memories retain value even if powered off § Read-only memory (ROM): programmed during producon § Synchronous DRAM (SDRAM) § Programmable ROM (PROM): can be programmed once § Uses a convenonal clock signal instead of asynchronous control § Eraseable PROM (EPROM): can be bulk erased (UV, X-Ray) § Allows reuse of the row addresses (e.g., RAS, CAS, CAS, CAS) § Electrically eraseable PROM (EEPROM): electronic erase capability § Flash memory: EEPROMs with paral (sector) erase capability § Double data-rate synchronous DRAM (DDR SDRAM) § Wears out aer about 100,000 erasings. § Double edge clocking sends two bits per cycle per pin ¢ Uses for Nonvolale Memories § Different types disnguished by size of small prefetch buffer: § Firmware programs stored in a ROM (BIOS, controllers for disks, – DDR (2 bits), DDR2 (4 bits), DDR4 (8 bits) network cards, graphics accelerators, security subsystems,…) § By 2010, standard for most server and desktop systems § Solid state disks (replace rotang disks in thumb drives, smart phones, mp3 players, tablets, laptops,…) § Intel Core i7 supports only DDR3 SDRAM § Disk caches

9 10

Carnegie Mellon Carnegie Mellon Tradional Bus Structure Connecng Memory Read Transacon (1) CPU and Memory

¢ CPU places address A on the memory bus. ¢ A bus is a collecon of parallel wires that carry address, data, and control signals. Register file ¢ Buses are typically shared by mulple devices. Load operation: movl A, %eax

ALU %eax CPU chip Main memory Register file I/O bridge 0 A

Bus interface x ALU A

System bus Memory bus

I/O Main Bus interface bridge memory

11 12

3 Carnegie Mellon Carnegie Mellon

Memory Read Transacon (2) Memory Read Transacon (3)

¢ Main memory reads A from the memory bus, retrieves ¢ CPU read word x from the bus and copies it into register word x, and places it on the bus. %eax.

ALU ALU %eax %eax x

Main memory Main memory I/O bridge x 0 I/O bridge 0 Bus interface x A Bus interface x A

13 14

Carnegie Mellon Carnegie Mellon

Memory Write Transacon (1) Memory Write Transacon (2)

¢ CPU places address A on bus. Main memory reads it and ¢ CPU places data word y on the bus. waits for the corresponding data word to arrive.

ALU ALU %eax y %eax y

Main memory Main memory I/O bridge 0 0 A I/O bridge y Bus interface A Bus interface A

15 16

4 Carnegie Mellon Carnegie Mellon

Memory Write Transacon (3) What’s Inside A Disk Drive? Spindle ¢ Arm Main memory reads data word y from the bus and stores Platters it at address A.

ALU %eax y main memory I/O bridge 0

bus interface y A Electronics (including a processor SCSI and memory!) connector

Image courtesy of Seagate Technology

17 18

Carnegie Mellon Carnegie Mellon

Disk Geometry Disk Geometry (Muliple-Plaer View)

¢ Disks consist of plaers, each with two surfaces. ¢ Aligned tracks form a cylinder. ¢ Each surface consists of concentric rings called tracks. Cylinder k ¢ Each track consists of sectors separated by gaps. Surface 0 Platter 0 Surface 1 Tracks Surface 2 Surface Platter 1 Surface 3 Track k Gaps Surface 4 Platter 2 Surface 5

Spindle Spindle

Sectors

19 20

5 Carnegie Mellon Carnegie Mellon

Disk Capacity Compung Disk Capacity

¢ Capacity: maximum number of bits that can be stored. Capacity = (# bytes/sector) x (avg. # sectors/track) x § Vendors express capacity in units of gigabytes (GB), where (# tracks/surface) x (# surfaces/plaer) x 1 GB = 109 Bytes (Lawsuit pending! Claims decepve adversing). (# plaers/disk) ¢ Capacity is determined by these technology factors: Example: § Recording density (bits/in): number of bits that can be squeezed § 512 bytes/sector into a 1 inch segment of a track. § 300 sectors/track (on average) § Track density (tracks/in): number of tracks that can be squeezed § 20,000 tracks/surface into a 1 inch radial segment. § 2 surfaces/plaer § § Areal density (bits/in2): product of recording and track density. 5 plaers/disk

¢ Modern disks paron tracks into disjoint subsets called Capacity = 512 x 300 x 20000 x 2 x 5 recording zones = 30,720,000,000 § Each track in a zone has the same number of sectors, determined = 30.72 GB by the circumference of innermost track. § Each zone has a diﬀerent number of sectors/track

21 22

Carnegie Mellon Carnegie Mellon

Disk Operaon (Single-Plaer View) Disk Operaon (Mul-Plaer View)

Read/write heads The disk surface move in unison The read/write head spins at a fixed from cylinder to cylinder rotational rate is attached to the end of the arm and flies over the disk surface on a thin cushion of air. Arm spindle spindle

spindle spindle spindle Spindle

By moving radially, the arm can position the read/write head over any track.

23 24

6 Carnegie Mellon Carnegie Mellon

Disk Structure - top view of single plaer Disk Access

Surface organized into tracks

Tracks divided into sectors

Head in position above a track

25 26

Carnegie Mellon Carnegie Mellon

Disk Access Disk Access – Read

Rotation is counter-clockwise About to read blue sector

27 28

7 Carnegie Mellon Carnegie Mellon

Disk Access – Read Disk Access – Read

After BLUE read After BLUE read

After reading blue sector Red request scheduled next

29 30

Carnegie Mellon Carnegie Mellon

Disk Access – Seek Disk Access – Rotaonal Latency

After BLUE read Seek for RED After BLUE read Seek for RED Rotational latency

Seek to red’s track Wait for red sector to rotate around

31 32

8 Carnegie Mellon Carnegie Mellon

Disk Access – Read Disk Access – Service Time Components

After BLUE read Seek for RED Rotational latency After RED read After BLUE read Seek for RED Rotational latency After RED read

Complete read of red Data transfer Seek Rotaonal Data transfer latency

33 34

Carnegie Mellon Carnegie Mellon

Disk Access Time Disk Access Time Example

¢ Average me to access some target sector approximated by : ¢ Given: § Taccess = Tavg seek + Tavg rotaon + Tavg transfer § Rotaonal rate = 7,200 RPM ¢ Seek me (Tavg seek) § Average seek me = 9 ms. § Time to posion heads over cylinder containing target sector. § Avg # sectors/track = 400. § Typical Tavg seek is 3—9 ms ¢ Derived: ¢ Rotaonal latency (Tavg rotaon) § Tavg rotaon = 1/2 x (60 secs/7200 RPM) x 1000 ms/sec = 4 ms. § Time waing for ﬁrst bit of target sector to pass under r/w head. § Tavg transfer = 60/7200 RPM x 1/400 secs/track x 1000 ms/sec = 0.02 ms § Tavg rotaon = 1/2 x 1/RPMs x 60 sec/1 min § Taccess = 9 ms + 4 ms + 0.02 ms § Typical Tavg rotaon = 7200 RPMs ¢ Important points: ¢ Transfer me (Tavg transfer) § Access me dominated by seek me and rotaonal latency. § Time to read the bits in the target sector. § First bit in a sector is the most expensive, the rest are free. § Tavg transfer = 1/RPM x 1/(avg # sectors/track) x 60 secs/1 min. § SRAM access me is about 4 ns/doubleword, DRAM about 60 ns § Disk is about 40,000 mes slower than SRAM, § 2,500 mes slower then DRAM. 35 36

9 Carnegie Mellon Carnegie Mellon

Logical Disk Blocks I/O Bus CPU chip

¢ Modern disks present a simpler abstract view of the Register file

complex sector geometry: ALU § The set of available sectors is modeled as a sequence of b-sized System bus Memory bus logical blocks (0, 1, 2, ...)

¢ Mapping between logical blocks and actual (physical) I/O Main Bus interface sectors bridge memory § Maintained by hardware/ﬁrmware device called disk controller. § Converts requests for logical blocks into (surface,track,sector) triples. I/O bus Expansion slots for ¢ Allows controller to set aside spare cylinders for each other devices such zone. USB Graphics Disk as network adapters. controller § Accounts for the diﬀerence in “formaed capacity” and “maximum adapter controller capacity”. Mouse Keyboard Monitor Disk

37 38

Carnegie Mellon Carnegie Mellon

Reading a Disk Sector (1) Reading a Disk Sector (2) CPU chip CPU chip Register file CPU initiates a disk read by writing a Register file Disk controller reads the sector and command, logical block number, and performs a direct memory access ALU destination memory address to a port ALU (DMA) transfer into main memory. (address) associated with disk controller.

Main Main Bus interface Bus interface memory memory

I/O bus I/O bus

USB Graphics Disk USB Graphics Disk controller adapter controller controller adapter controller

mouse keyboard Monitor Mouse Keyboard Monitor Disk Disk 39 40

10 Carnegie Mellon Carnegie Mellon

Reading a Disk Sector (3) Solid State Disks (SSDs) I/O bus CPU chip When the DMA transfer completes, Register file the disk controller notifies the CPU Requests to read and write logical disk blocks ALU with an interrupt (i.e., asserts a Solid State Disk (SSD) special “interrupt” pin on the CPU) Flash translation layer

Main Flash memory Bus interface memory Block 0 Block B-1 Page 0 Page 1 … Page P-1 … Page 0 Page 1 … Page P-1 I/O bus

¢ Pages: 512KB to 4KB, Blocks: 32 to 128 pages USB Graphics Disk ¢ Data read/wrien in units of pages. controller adapter controller ¢ Page can be wrien only aer its block has been erased Mouse Keyboard Monitor ¢ A block wears out aer 100,000 repeated writes. Disk 41 42

Carnegie Mellon Carnegie Mellon

SSD Performance Characteriscs SSD Tradeoﬀs vs Rotang Disks

¢ Advantages Sequenal read tput 250 MB/s Sequenal write tput 170 MB/s § No moving parts à faster, less power, more rugged Random read tput 140 MB/s Random write tput 14 MB/s Rand read access 30 us Random write access 300 us ¢ Disadvantages § Have the potenal to wear out ¢ Why are random writes so slow? § Migated by “wear leveling logic” in ﬂash translaon layer § Erasing a block is slow (around 1 ms) § E.g. Intel X25 guarantees 1 petabyte (1015 bytes) of random § Write to a page triggers a copy of all useful pages in the block writes before they wear out § Find an used block (new block) and erase it § In 2010, about 100 mes more expensive per byte § Write the page into the new block § Copy other pages from old block to the new block ¢ Applicaons § MP3 players, smart phones, laptops § Beginning to appear in desktops and servers

43 44

11 Carnegie Mellon Carnegie Mellon

Inﬂecon point in computer history Storage Trends CPU Clock Rates when designers hit the “Power Wall” SRAM Metric 1980 1985 1990 1995 2000 2005 2010 2010:1980 1980 1990 1995 2000 2003 2005 2010 2010:1980 $/MB 19,200 2,900 320 256 100 75 60 320 access (ns) 300 150 35 15 3 2 1.5 200 CPU 8080 386 Pentium P-III P-4 Core 2 Core i7 ---

DRAM Clock rate (MHz) 1 20 150 600 3300 2000 2500 2500 Metric 1980 1985 1990 1995 2000 2005 2010 2010:1980

Cycle $/MB 8,000 880 100 30 1 0.1 0.06 130,000 time (ns) 1000 50 6 1.6 0.3 0.50 0.4 2500 access (ns) 375 200 100 70 60 50 40 9

typical size (MB) 0.064 0.256 4 16 64 2,000 8,000 125,000 Cores 1 1 1 1 1 2 4 4 Disk Effective Metric 1980 1985 1990 1995 2000 2005 2010 2010:1980 cycle 1000 50 6 1.6 0.3 0.25 0.1 10,000

time (ns) $/MB 500 100 8 0.30 0.01 0.005 0.0003 1,600,000 access (ms) 87 75 28 10 8 4 3 29 typical size (MB) 1 10 160 1,000 20,000 160,000 1,500,000 1,500,000 45 46

Carnegie Mellon Carnegie Mellon

The CPU-Memory Gap Locality to the Rescue! The gap between DRAM, disk, and CPU speeds.

100,000,000.0 Disk The key to bridging this CPU-Memory gap is a fundamental 10,000,000.0 property of computer programs known as locality 1,000,000.0 SSD 100,000.0

10,000.0 Disk seek time Flash SSD access time DRAM access time 1,000.0

ns SRAM access time DRAM CPU cycle time 100.0 Effective CPU cycle time

10.0

1.0

0.1 CPU

0.0 1980 1985 1990 1995 2000 2003 2005 2010 Year 47 48

12 Carnegie Mellon Carnegie Mellon

Today Locality

¢ Storage technologies and trends ¢ Principle of Locality: Programs tend to use data and ¢ Locality of reference instrucons with addresses near or equal to those they ¢ Caching in the memory hierarchy have used recently

¢ Temporal locality: § Recently referenced items are likely to be referenced again in the near future

¢ Spaal locality: § Items with nearby addresses tend to be referenced close together in me

49 50

Carnegie Mellon Carnegie Mellon

Locality Example Qualitave Esmates of Locality

¢ Claim: Being able to look at code and get a qualitave sum = 0; for (i = 0; i < n; i++) sense of its locality is a key skill for a professional sum += a[i]; programmer. return sum;

¢ Data references ¢ Queson: Does this funcon have good locality with § Reference array elements in succession respect to array a? (stride-1 reference paern). Spaal locality § Reference variable sum each iteraon. Temporal locality int sum_array_rows(int a[M][N]) { ¢ Instrucon references int i, j, sum = 0; § Reference instrucons in sequence. Spaal locality § Cycle through loop repeatedly. for (i = 0; i < M; i++) Temporal locality for (j = 0; j < N; j++) sum += a[i][j]; return sum; } 51 52

13 Carnegie Mellon Carnegie Mellon

Locality Example Locality Example

¢ Queson: Does this funcon have good locality with ¢ Queson: Can you permute the loops so that the funcon respect to array a? scans the 3-d array a with a stride-1 reference paern (and thus has good spaal locality)? int sum_array_cols(int a[M][N]) { int i, j, sum = 0; int sum_array_3d(int a[M][N][N]) { for (j = 0; j < N; j++) int i, j, k, sum = 0; for (i = 0; i < M; i++) sum += a[i][j]; for (i = 0; i < M; i++) return sum; for (j = 0; j < N; j++) } for (k = 0; k < N; k++) sum += a[k][i][j]; return sum; }

53 54

Carnegie Mellon Carnegie Mellon

Memory Hierarchies Today

¢ Some fundamental and enduring properes of hardware ¢ Storage technologies and trends and soware: ¢ Locality of reference § Fast storage technologies cost more per byte, have less capacity, ¢ Caching in the memory hierarchy and require more power (heat!). § The gap between CPU and main memory speed is widening. § Well-wrien programs tend to exhibit good locality.

¢ These fundamental properes complement each other beaufully.

¢ They suggest an approach for organizing memory and storage systems known as a memory hierarchy.

55 56

14 Carnegie Mellon Carnegie Mellon

An Example Memory Hierarchy Caches

L0: CPU registers hold words retrieved ¢ Cache: A smaller, faster storage device that acts as a staging Registers from L1 cache area for a subset of the data in a larger, slower device. L1: L1 cache Smaller, (SRAM) L1 cache holds cache lines retrieved ¢ Fundamental idea of a memory hierarchy: from L2 cache faster, § For each k, the faster, smaller device at level k serves as a cache for the costlier L2: L2 cache larger, slower device at level k+1. per byte (SRAM) L2 cache holds cache lines retrieved from main memory ¢ Why do memory hierarchies work? L3: Larger, Main memory § Because of locality, programs tend to access the data at level k more (DRAM) Main memory holds disk blocks slower, retrieved from local disks oen than they access the data at level k+1. cheaper § Thus, the storage at level k+1 can be slower, and thus larger and per byte L4: Local secondary storage Local disks hold ﬁles (local disks) retrieved from disks on cheaper per bit. remote network servers ¢ Big Idea: The memory hierarchy creates a large pool of Remote secondary storage storage that costs as much as the cheap storage near the L5: (tapes, distributed ﬁle systems, Web servers) boom, but that serves data to programs at the rate of the fast storage near the top. 57 58

Carnegie Mellon Carnegie Mellon

General Cache Concepts General Cache Concepts: Hit

Request: 14 Data in block b is needed

Smaller, faster, more expensive Block b is in cache: Cache 8 4 9 10 14 3 memory caches a subset of Cache 8 9 14 3 the blocks Hit!

Data is copied in block-sized 10 4 transfer units

Larger, slower, cheaper memory Memory 0 1 2 3 viewed as paroned into “blocks” Memory 0 1 2 3 4 5 6 7 4 5 6 7 8 9 10 11 8 9 10 11 12 13 14 15 12 13 14 15

59 60

15 Carnegie Mellon Carnegie Mellon General Caching Concepts: General Cache Concepts: Miss Types of Cache Misses

Request: 12 Data in block b is needed ¢ Cold (compulsory) miss Block b is not in cache: § Cold misses occur because the cache is empty. Cache 8 12 9 14 3 Miss! ¢ Conﬂict miss § Most caches limit blocks at level k+1 to a small subset (somemes a Block b is fetched from Request: 12 singleton) of the block posions at level k. 12 memory § E.g. Block i at level k+1 must be placed in block (i mod 4) at level k. § Conﬂict misses occur when the level k cache is large enough, but mulple Block b is stored in cache data objects all map to the same level k block. Memory 0 1 2 3 • Placement policy: 4 5 6 7 determines where b goes § E.g. Referencing blocks 0, 8, 0, 8, 0, 8, ... would miss every me. • Replacement policy: 8 9 10 11 ¢ Capacity miss determines which block 12 13 14 15 gets evicted (vicm) § Occurs when the set of acve cache blocks (working set) is larger than the cache.

61 62

Carnegie Mellon Carnegie Mellon

Examples of Caching in the Hierarchy Summary

¢ Cache Type What is Cached? Where is it Cached? Latency (cycles) Managed By The speed gap between CPU, memory and mass storage connues to widen. Registers 4-8 bytes words CPU core 0 Compiler TLB Address translaons On-Chip TLB 0 Hardware ¢ Well-wrien programs exhibit a property called locality. L1 cache 64-bytes block On-Chip L1 1 Hardware L2 cache 64-bytes block On/Off-Chip L2 10 Hardware ¢ Virtual Memory 4-KB page Main memory 100 Hardware + OS Memory hierarchies based on caching close the gap by Buffer cache Parts of files Main memory 100 OS exploing locality. Disk cache Disk sectors Disk controller 100,000 Disk firmware Network buffer Parts of files Local disk 10,000,000 AFS/NFS client cache Browser cache Web pages Local disk 10,000,000 Web browser

Web cache Web pages Remote server disks 1,000,000,000 Web proxy server

63 64