Memory Hierarchy Summary

Carnegie Mellon Carnegie Mellon Today DRAM as building block for main memory The Memory Hierarchy Locality of reference Caching in the memory hierarchy Storage technologies and trends 15‐213 / 18‐213: Introduction to Computer Systems 10th Lecture, Feb 14, 2013 Instructors: Seth Copen Goldstein, Anthony Rowe, Greg Kesden 1 2 Carnegie Mellon Carnegie Mellon Byte‐Oriented Memory Organization Simple Memory Addressing Modes Normal (R) Mem[Reg[R]] • • • . Register R specifies memory address . Aha! Pointer dereferencing in C Programs refer to data by address movl (%ecx),%eax . Conceptually, envision it as a very large array of bytes . In reality, it’s not, but can think of it that way . An address is like an index into that array Displacement D(R) Mem[Reg[R]+D] . and, a pointer variable stores an address . Register R specifies start of memory region . Constant displacement D specifies offset Note: system provides private address spaces to each “process” . Think of a process as a program being executed movl 8(%ebp),%edx . So, a program can clobber its own data, but not that of others nd th From 2 lecture 3 From 5 lecture 4 Carnegie Mellon Carnegie Mellon Traditional Bus Structure Connecting Memory Read Transaction (1) CPU and Memory CPU places address A on the memory bus. A bus is a collection of parallel wires that carry address, data, and control signals. Buses are typically shared by multiple devices. Register file Load operation: movl A, %eax ALU %eax CPU chip Main memory Register file I/O bridge A 0 Bus interface ALU x A System bus Memory bus I/O Main Bus interface bridge memory 5 6 Carnegie Mellon Carnegie Mellon Memory Read Transaction (2) Memory Read Transaction (3) Main memory reads A from the memory bus, retrieves CPU read word xfrom the bus and copies it into register word x, and places it on the bus. %eax. Register file Register file Load operation: movl A, %eax Load operation: movl A, %eax ALU ALU %eax %eax x Main memory Main memory I/O bridge x 0 I/O bridge 0 Bus interface x A Bus interface x A 7 8 Carnegie Mellon Carnegie Mellon Memory Write Transaction (1) Memory Write Transaction (2) CPU places address A on bus. Main memory reads it and CPU places data word yon the bus. waits for the corresponding data word to arrive. Register file Store operation: movl %eax, A Register file Store operation: movl %eax, A ALU ALU %eax y %eax y Main memory Main memory I/O bridge A 0 I/O bridge y 0 Bus interface A Bus interface A 9 10 Carnegie Mellon Carnegie Mellon Memory Write Transaction (3) Dynamic Random‐Access Memory (DRAM) Main memory reads data word yfrom the bus and stores Key features it at address A. DRAM is traditionally packaged as a chip . Basic storage unit is normally a cell (one bit per cell) register file Store operation: movl %eax, A . Multiple DRAM chips form main memory in most computers ALU %eax y Technical characteristics main memory . Organized in two dimensions (rows and columns) I/O bridge 0 . To access (within a DRAM chip): select row then select column nd bus interface y A . Consequence: 2 access to a row faster than different column/row . Each cell stores bit with a capacitor; one transistor is used for access . Value must be refreshed every 10‐100 ms . Done within the hardware 11 12 Carnegie Mellon Carnegie Mellon Conventional DRAM Organization Reading DRAM Supercell (2,1) Step 1(a): Row access strobe (RAS) selects row 2. dxwDRAM: Step 1(b): Row 2 copied from DRAM array to row buffer. dw total bits organized as d supercells of size wbits 16 x 8 DRAM chip 16 x 8 DRAM chip cols Cols 0 123 0 123 RAS = 2 2 2 bits 0 / 0 / addr addr 1 1 rows Rows Memory Memory 2 supercell 2 controller controller (to/from CPU) (2,1) 3 8 3 8 bits / / data data Internal row buffer Internal row buffer 13 14 Carnegie Mellon Carnegie Mellon Reading DRAM Supercell (2,1) Memory Modules Step 2(a): Column access strobe (CAS) selects column 1. addr (row = i, col = j) Step 2(b): Supercell (2,1) copied from buffer to data lines, and eventually : supercell (i,j) back to the CPU. DRAM 0 16 x 8 DRAM chip 64 MB memory module Cols DRAM 7 consisting of 0 123 CAS = 1 eight 8Mx8 DRAMs 2 / 0 addr To CPU 1 Rows bits bits bits bits bits bits bits bits Memory 56-63 48-55 40-47 32-39 24-31 16-23 8-15 0-7 controller 2 supercell 3 6356 55 48 47 40 39 32 3124 23 16 15 8 7 0 Memory (2,1) 8 / controller data 64-bit doubleword at main memory address A supercell 64-bit doubleword Internal row buffer (2,1) 15 16 Carnegie Mellon Carnegie Mellon Aside: Nonvolatile Memories Issue: memory access is slow DRAM and SRAM (caches, on Tuesday) are volatile memories . Lose information if powered off DRAM access is much slower than CPU cycle time Most common nonvolatile storage is the hard disk . A DRAM chip has access times of 30‐50ns . Rotating platters (like DVDs)… plentiful capacity, but very slow . and, transferring from main memory into register Nonvolatile memories retain value even if powered off . Read‐only memory (ROM): programmed during production can take 3X or more longer than that . Programmable ROM (PROM): can be programmed once . With sub‐nanosecond cycles times, . Eraseable PROM (EPROM): can be bulk erased (UV, X‐Ray) 100s of cycles per memory access . Electrically eraseable PROM (EEPROM): electronic erase capability . and, the gap grows over time . Flash memory: EEPROMs with partial (sector) erase capability . Wears out after about 100,000 erasings Consequence: memory access efficiency crucial to performance Uses for Nonvolatile Memories . approximately 1/3 of instructions are loads or stores . Firmware programs stored in a ROM (BIOS, controllers for disks, network . both hardware and programmer have to work at it cards, graphics accelerators, security subsystems,…) . Solid state disks (replace rotating disks in thumb drives, smart phones, mp3 players, tablets, laptops,…) . Disk caches 17 18 Carnegie Mellon Carnegie Mellon The CPU‐Memory Gap Locality to the Rescue! The gap between DRAM, disk, and CPU speeds. 100,000,000.0 Disk The key to bridging this CPU‐Memory gap is a fundamental 10,000,000.0 property of computer programs known as locality 1,000,000.0 SSD 100,000.0 10,000.0 Disk seek time Flash SSD access time DRAM access time 1,000.0 ns SRAM access time DRAM CPU cycle time 100.0 Effective CPU cycle time 10.0 1.0 0.1 CPU 0.0 1980 1985 1990 1995 2000 2003 2005 2010 Year 19 20 Carnegie Mellon Carnegie Mellon Today Locality DRAM as building block for main memory Principle of Locality: Programs tend to use data and Locality of reference instructions with addresses near or equal to those they Caching in the memory hierarchy have used recently Storage technologies and trends Temporal locality: . Recently referenced items are likely to be referenced again in the near future Spatial locality: . Items with nearby addresses tend to be referenced close together in time 21 22 Carnegie Mellon Carnegie Mellon Locality Example Qualitative Estimates of Locality Claim: Being able to look at code and get a qualitative sum = 0; for (i = 0; i < n; i++) sense of its locality is a key skill for a professional sum += a[i]; programmer. return sum; Data references Question: Does this function have good locality with . Reference array elements in succession respect to array a? (stride‐1 reference pattern). Spatial locality . Reference variable sum each iteration. Temporal locality int sum_array_rows(int a[M][N]) { Instruction references int i, j, sum = 0; . Reference instructions in sequence. Spatial locality . Cycle through loop repeatedly. Temporal locality for (i = 0; i < M; i++) for (j = 0; j < N; j++) sum += a[i][j]; return sum; } 23 24 Carnegie Mellon Carnegie Mellon Locality Example Today Question: Does this function have good locality with DRAM as building block for main memory respect to array a? Locality of reference Caching in the memory hierarchy int sum_array_cols(int a[M][N]) Storage technologies and trends { int i, j, sum = 0; for (j = 0; j < N; j++) for (i = 0; i < M; i++) sum += a[i][j]; return sum; } 25 26 Carnegie Mellon Carnegie Mellon Memory Hierarchies An Example Memory Hierarchy L0: Some fundamental and enduring properties of hardware and CPU registers hold words retrieved Registers from L1 cache software: L1: L1 cache . Fast storage technologies cost more per byte, have less Smaller, (SRAM) L1 cache holds cache lines retrieved from L2 cache capacity, and require more power (heat!) faster, costlier L2: L2 cache per byte . The gap between CPU and main memory speed is widening (SRAM) L2 cache holds cache lines retrieved from main memory . Well‐written programs tend to exhibit good locality L3: Larger, Main memory (DRAM) Main memory holds disk blocks slower, retrieved from local disks These fundamental properties complement each other cheaper per byte L4: Local secondary storage Local disks hold files beautifully (local disks) retrieved from disks on remote network servers They suggest an approach for organizing memory and Remote secondary storage L5: (tapes, distributed file systems, Web servers) storage systems known as a memory hierarchy 27 28 Carnegie Mellon Carnegie Mellon Caches General Cache Concepts Cache: A smaller, faster storage device that acts as a staging area for a subset of the data in a larger, slower device. Smaller, faster, more expensive Fundamental idea of a memory hierarchy: Cache 84 9 1410 3 memory caches a subset of .

Memory Hierarchy Summary

Data Storage the CPU-Memory

Managing the Memory Hierarchy

United States Patent (10) Patent N0.: US 7,290,102 B2 Lubbers Et A]

Embedded DRAM

Cortex-A9 Single Core Microarchitecture

Chapter 6 : Memory System

The Memory Hierarchy

Lecture 10: Memory Hierarchy -- Memory Technology and Principal of Locality

Memory Systems

Chapter 2: Memory Hierarchy Design

A Novel Cache Architecture for Multicore Processor Using Vlsi Sujitha .S, Anitha .N

Problem 1 : Locality of Reference. [10 Points] Problem 2 : Virtual Memory