Carnegie Mellon Carnegie Mellon
Today
¢ Storage technologies and trends ¢ Locality of reference The Memory Hierarchy ¢ Caching in the memory hierarchy
15-213/18-243: Introduc on to Computer Systems 10th Lecture, Feb. 13, 2014
Instructors: Anthony Rowe, Seth Goldstein and Gregory Kesden
1 2
Carnegie Mellon Carnegie Mellon
Random-Access Memory (RAM) SRAM vs DRAM Summary
¢ Key features § RAM is tradi onally packaged as a chip. § Basic storage unit is normally a cell (one bit per cell). Trans. Access Needs Needs § Mul ple RAM chips form a memory. per bit time refresh? EDC? Cost Applications ¢ Sta c RAM (SRAM) § Each cell stores a bit with a four or six-transistor circuit. SRAM 4 or 6 1X No Maybe 100x Cache memories § Retains value indefinitely, as long as it is kept powered. DRAM 1 10X Yes Yes 1X Main memories, § Rela vely insensi ve to electrical noise (EMI), radia on, etc. frame buffers § Faster and more expensive than DRAM. ¢ Dynamic RAM (DRAM) § Each cell stores bit with a capacitor. One transistor is used for access § Value must be refreshed every 10-100 ms. § More sensi ve to disturbances (EMI, radia on,…) than SRAM. § Slower and cheaper than SRAM. 3 4
1 Carnegie Mellon Carnegie Mellon
Conven onal DRAM Organiza on Reading DRAM Supercell (2,1) Step 1(a): Row access strobe (RAS) selects row 2. ¢ d x w DRAM: Step 1(b): Row 2 copied from DRAM array to row buffer. § dw total bits organized as d supercells of size w bits
16 x 8 DRAM chip 16 x 8 DRAM chip cols Cols 0 1 2 3 0 1 2 3 RAS = 2 2 2 bits 0 0 / / addr addr 1 1 rows Rows Memory Memory 2 supercell 2 controller controller (to/from CPU) (2,1) 3 3 8 bits 8 / / data data
Internal row buffer Internal row buffer 5 6
Carnegie Mellon Carnegie Mellon
Reading DRAM Supercell (2,1) Memory Modules
Step 2(a): Column access strobe (CAS) selects column 1. addr (row = i, col = j) Step 2(b): Supercell (2,1) copied from buffer to data lines, and eventually : supercell (i,j)
back to the CPU. DRAM 0 16 x 8 DRAM chip 64 MB memory module
Cols DRAM 7 consisting of 0 1 2 3 eight 8Mx8 DRAMs CAS = 1 2 / 0 addr To CPU 1 Rows bits bits bits bits bits bits bits bits Memory 56-63 48-55 40-47 32-39 24-31 16-23 8-15 0-7 controller 2
supercell 3 63 56 55 48 47 40 39 32 31 24 23 16 15 8 7 0 Memory (2,1) 8 / controller data 64-bit doubleword at main memory address A
supercell 64-bit doubleword Internal row buffer (2,1) 7 8
2 Carnegie Mellon Carnegie Mellon
Enhanced DRAMs Nonvola le Memories
¢ Basic DRAM cell has not changed since its inven on in 1966. ¢ DRAM and SRAM are vola le memories § Commercialized by Intel in 1970. § Lose informa on if powered off. ¢ ¢ DRAM cores with be er interface logic and faster I/O : Nonvola le memories retain value even if powered off § Read-only memory (ROM): programmed during produc on § Synchronous DRAM (SDRAM) § Programmable ROM (PROM): can be programmed once § Uses a conven onal clock signal instead of asynchronous control § Eraseable PROM (EPROM): can be bulk erased (UV, X-Ray) § Allows reuse of the row addresses (e.g., RAS, CAS, CAS, CAS) § Electrically eraseable PROM (EEPROM): electronic erase capability § Flash memory: EEPROMs with par al (sector) erase capability § Double data-rate synchronous DRAM (DDR SDRAM) § Wears out a er about 100,000 erasings. § Double edge clocking sends two bits per cycle per pin ¢ Uses for Nonvola le Memories § Different types dis nguished by size of small prefetch buffer: § Firmware programs stored in a ROM (BIOS, controllers for disks, – DDR (2 bits), DDR2 (4 bits), DDR4 (8 bits) network cards, graphics accelerators, security subsystems,…) § By 2010, standard for most server and desktop systems § Solid state disks (replace rota ng disks in thumb drives, smart phones, mp3 players, tablets, laptops,…) § Intel Core i7 supports only DDR3 SDRAM § Disk caches
9 10
Carnegie Mellon Carnegie Mellon Tradi onal Bus Structure Connec ng Memory Read Transac on (1) CPU and Memory
¢ CPU places address A on the memory bus. ¢ A bus is a collec on of parallel wires that carry address, data, and control signals. Register file ¢ Buses are typically shared by mul ple devices. Load operation: movl A, %eax
ALU %eax CPU chip Main memory Register file I/O bridge 0 A
Bus interface x ALU A
System bus Memory bus
I/O Main Bus interface bridge memory
11 12
3 Carnegie Mellon Carnegie Mellon
Memory Read Transac on (2) Memory Read Transac on (3)
¢ Main memory reads A from the memory bus, retrieves ¢ CPU read word x from the bus and copies it into register word x, and places it on the bus. %eax.
Register file Register file Load operation: movl A, %eax Load operation: movl A, %eax
ALU ALU %eax %eax x
Main memory Main memory I/O bridge x 0 I/O bridge 0 Bus interface x A Bus interface x A
13 14
Carnegie Mellon Carnegie Mellon
Memory Write Transac on (1) Memory Write Transac on (2)
¢ CPU places address A on bus. Main memory reads it and ¢ CPU places data word y on the bus. waits for the corresponding data word to arrive.
Register file Register file Store operation: movl %eax, A Store operation: movl %eax, A
ALU ALU %eax y %eax y
Main memory Main memory I/O bridge 0 0 A I/O bridge y Bus interface A Bus interface A
15 16
4 Carnegie Mellon Carnegie Mellon
Memory Write Transac on (3) What’s Inside A Disk Drive? Spindle ¢ Arm Main memory reads data word y from the bus and stores Platters it at address A.
register file Store operation: movl %eax, A Actuator
ALU %eax y main memory I/O bridge 0
bus interface y A Electronics (including a processor SCSI and memory!) connector
Image courtesy of Seagate Technology
17 18
Carnegie Mellon Carnegie Mellon
Disk Geometry Disk Geometry (Muliple-Pla er View)
¢ Disks consist of pla ers, each with two surfaces. ¢ Aligned tracks form a cylinder. ¢ Each surface consists of concentric rings called tracks. Cylinder k ¢ Each track consists of sectors separated by gaps. Surface 0 Platter 0 Surface 1 Tracks Surface 2 Surface Platter 1 Surface 3 Track k Gaps Surface 4 Platter 2 Surface 5
Spindle Spindle
Sectors
19 20
5 Carnegie Mellon Carnegie Mellon
Disk Capacity Compu ng Disk Capacity
¢ Capacity: maximum number of bits that can be stored. Capacity = (# bytes/sector) x (avg. # sectors/track) x § Vendors express capacity in units of gigabytes (GB), where (# tracks/surface) x (# surfaces/pla er) x 1 GB = 109 Bytes (Lawsuit pending! Claims decep ve adver sing). (# pla ers/disk) ¢ Capacity is determined by these technology factors: Example: § Recording density (bits/in): number of bits that can be squeezed § 512 bytes/sector into a 1 inch segment of a track. § 300 sectors/track (on average) § Track density (tracks/in): number of tracks that can be squeezed § 20,000 tracks/surface into a 1 inch radial segment. § 2 surfaces/pla er § § Areal density (bits/in2): product of recording and track density. 5 pla ers/disk
¢ Modern disks par on tracks into disjoint subsets called Capacity = 512 x 300 x 20000 x 2 x 5 recording zones = 30,720,000,000 § Each track in a zone has the same number of sectors, determined = 30.72 GB by the circumference of innermost track. § Each zone has a different number of sectors/track
21 22
Carnegie Mellon Carnegie Mellon
Disk Opera on (Single-Pla er View) Disk Opera on (Mul -Pla er View)
Read/write heads The disk surface move in unison The read/write head spins at a fixed from cylinder to cylinder rotational rate is attached to the end of the arm and flies over the disk surface on a thin cushion of air. Arm spindle spindle
spindle spindle spindle Spindle
By moving radially, the arm can position the read/write head over any track.
23 24
6 Carnegie Mellon Carnegie Mellon
Disk Structure - top view of single pla er Disk Access
Surface organized into tracks
Tracks divided into sectors
Head in position above a track
25 26
Carnegie Mellon Carnegie Mellon
Disk Access Disk Access – Read
Rotation is counter-clockwise About to read blue sector
27 28
7 Carnegie Mellon Carnegie Mellon
Disk Access – Read Disk Access – Read
After BLUE read After BLUE read
After reading blue sector Red request scheduled next
29 30
Carnegie Mellon Carnegie Mellon
Disk Access – Seek Disk Access – Rota onal Latency
After BLUE read Seek for RED After BLUE read Seek for RED Rotational latency
Seek to red’s track Wait for red sector to rotate around
31 32
8 Carnegie Mellon Carnegie Mellon
Disk Access – Read Disk Access – Service Time Components
After BLUE read Seek for RED Rotational latency After RED read After BLUE read Seek for RED Rotational latency After RED read
Complete read of red Data transfer Seek Rota onal Data transfer latency
33 34
Carnegie Mellon Carnegie Mellon
Disk Access Time Disk Access Time Example
¢ Average me to access some target sector approximated by : ¢ Given: § Taccess = Tavg seek + Tavg rota on + Tavg transfer § Rota onal rate = 7,200 RPM ¢ Seek me (Tavg seek) § Average seek me = 9 ms. § Time to posi on heads over cylinder containing target sector. § Avg # sectors/track = 400. § Typical Tavg seek is 3—9 ms ¢ Derived: ¢ Rota onal latency (Tavg rota on) § Tavg rota on = 1/2 x (60 secs/7200 RPM) x 1000 ms/sec = 4 ms. § Time wai ng for first bit of target sector to pass under r/w head. § Tavg transfer = 60/7200 RPM x 1/400 secs/track x 1000 ms/sec = 0.02 ms § Tavg rota on = 1/2 x 1/RPMs x 60 sec/1 min § Taccess = 9 ms + 4 ms + 0.02 ms § Typical Tavg rota on = 7200 RPMs ¢ Important points: ¢ Transfer me (Tavg transfer) § Access me dominated by seek me and rota onal latency. § Time to read the bits in the target sector. § First bit in a sector is the most expensive, the rest are free. § Tavg transfer = 1/RPM x 1/(avg # sectors/track) x 60 secs/1 min. § SRAM access me is about 4 ns/doubleword, DRAM about 60 ns § Disk is about 40,000 mes slower than SRAM, § 2,500 mes slower then DRAM. 35 36
9 Carnegie Mellon Carnegie Mellon
Logical Disk Blocks I/O Bus CPU chip
¢ Modern disks present a simpler abstract view of the Register file
complex sector geometry: ALU § The set of available sectors is modeled as a sequence of b-sized System bus Memory bus logical blocks (0, 1, 2, ...)
¢ Mapping between logical blocks and actual (physical) I/O Main Bus interface sectors bridge memory § Maintained by hardware/firmware device called disk controller. § Converts requests for logical blocks into (surface,track,sector) triples. I/O bus Expansion slots for ¢ Allows controller to set aside spare cylinders for each other devices such zone. USB Graphics Disk as network adapters. controller § Accounts for the difference in “forma ed capacity” and “maximum adapter controller capacity”. Mouse Keyboard Monitor Disk
37 38
Carnegie Mellon Carnegie Mellon
Reading a Disk Sector (1) Reading a Disk Sector (2) CPU chip CPU chip Register file CPU initiates a disk read by writing a Register file Disk controller reads the sector and command, logical block number, and performs a direct memory access ALU destination memory address to a port ALU (DMA) transfer into main memory. (address) associated with disk controller.
Main Main Bus interface Bus interface memory memory
I/O bus I/O bus
USB Graphics Disk USB Graphics Disk controller adapter controller controller adapter controller
mouse keyboard Monitor Mouse Keyboard Monitor Disk Disk 39 40
10 Carnegie Mellon Carnegie Mellon
Reading a Disk Sector (3) Solid State Disks (SSDs) I/O bus CPU chip When the DMA transfer completes, Register file the disk controller notifies the CPU Requests to read and write logical disk blocks ALU with an interrupt (i.e., asserts a Solid State Disk (SSD) special “interrupt” pin on the CPU) Flash translation layer
Main Flash memory Bus interface memory Block 0 Block B-1 Page 0 Page 1 … Page P-1 … Page 0 Page 1 … Page P-1 I/O bus
¢ Pages: 512KB to 4KB, Blocks: 32 to 128 pages USB Graphics Disk ¢ Data read/wri en in units of pages. controller adapter controller ¢ Page can be wri en only a er its block has been erased Mouse Keyboard Monitor ¢ A block wears out a er 100,000 repeated writes. Disk 41 42
Carnegie Mellon Carnegie Mellon
SSD Performance Characteris cs SSD Tradeoffs vs Rota ng Disks
¢ Advantages Sequen al read tput 250 MB/s Sequen al write tput 170 MB/s § No moving parts à faster, less power, more rugged Random read tput 140 MB/s Random write tput 14 MB/s Rand read access 30 us Random write access 300 us ¢ Disadvantages § Have the poten al to wear out ¢ Why are random writes so slow? § Mi gated by “wear leveling logic” in flash transla on layer § Erasing a block is slow (around 1 ms) § E.g. Intel X25 guarantees 1 petabyte (1015 bytes) of random § Write to a page triggers a copy of all useful pages in the block writes before they wear out § Find an used block (new block) and erase it § In 2010, about 100 mes more expensive per byte § Write the page into the new block § Copy other pages from old block to the new block ¢ Applica ons § MP3 players, smart phones, laptops § Beginning to appear in desktops and servers
43 44
11 Carnegie Mellon Carnegie Mellon
Inflec on point in computer history Storage Trends CPU Clock Rates when designers hit the “Power Wall” SRAM Metric 1980 1985 1990 1995 2000 2005 2010 2010:1980 1980 1990 1995 2000 2003 2005 2010 2010:1980 $/MB 19,200 2,900 320 256 100 75 60 320 access (ns) 300 150 35 15 3 2 1.5 200 CPU 8080 386 Pentium P-III P-4 Core 2 Core i7 ---
DRAM Clock rate (MHz) 1 20 150 600 3300 2000 2500 2500 Metric 1980 1985 1990 1995 2000 2005 2010 2010:1980
Cycle $/MB 8,000 880 100 30 1 0.1 0.06 130,000 time (ns) 1000 50 6 1.6 0.3 0.50 0.4 2500 access (ns) 375 200 100 70 60 50 40 9
typical size (MB) 0.064 0.256 4 16 64 2,000 8,000 125,000 Cores 1 1 1 1 1 2 4 4 Disk Effective Metric 1980 1985 1990 1995 2000 2005 2010 2010:1980 cycle 1000 50 6 1.6 0.3 0.25 0.1 10,000
time (ns) $/MB 500 100 8 0.30 0.01 0.005 0.0003 1,600,000 access (ms) 87 75 28 10 8 4 3 29 typical size (MB) 1 10 160 1,000 20,000 160,000 1,500,000 1,500,000 45 46
Carnegie Mellon Carnegie Mellon
The CPU-Memory Gap Locality to the Rescue! The gap between DRAM, disk, and CPU speeds.
100,000,000.0 Disk The key to bridging this CPU-Memory gap is a fundamental 10,000,000.0 property of computer programs known as locality 1,000,000.0 SSD 100,000.0
10,000.0 Disk seek time Flash SSD access time DRAM access time 1,000.0
ns SRAM access time DRAM CPU cycle time 100.0 Effective CPU cycle time
10.0
1.0
0.1 CPU
0.0 1980 1985 1990 1995 2000 2003 2005 2010 Year 47 48
12 Carnegie Mellon Carnegie Mellon
Today Locality
¢ Storage technologies and trends ¢ Principle of Locality: Programs tend to use data and ¢ Locality of reference instruc ons with addresses near or equal to those they ¢ Caching in the memory hierarchy have used recently
¢ Temporal locality: § Recently referenced items are likely to be referenced again in the near future
¢ Spa al locality: § Items with nearby addresses tend to be referenced close together in me
49 50
Carnegie Mellon Carnegie Mellon
Locality Example Qualita ve Es mates of Locality
¢ Claim: Being able to look at code and get a qualita ve sum = 0; for (i = 0; i < n; i++) sense of its locality is a key skill for a professional sum += a[i]; programmer. return sum;
¢ Data references ¢ Ques on: Does this func on have good locality with § Reference array elements in succession respect to array a? (stride-1 reference pa ern). Spa al locality § Reference variable sum each itera on. Temporal locality int sum_array_rows(int a[M][N]) { ¢ Instruc on references int i, j, sum = 0; § Reference instruc ons in sequence. Spa al locality § Cycle through loop repeatedly. for (i = 0; i < M; i++) Temporal locality for (j = 0; j < N; j++) sum += a[i][j]; return sum; } 51 52
13 Carnegie Mellon Carnegie Mellon
Locality Example Locality Example
¢ Ques on: Does this func on have good locality with ¢ Ques on: Can you permute the loops so that the func on respect to array a? scans the 3-d array a with a stride-1 reference pa ern (and thus has good spa al locality)? int sum_array_cols(int a[M][N]) { int i, j, sum = 0; int sum_array_3d(int a[M][N][N]) { for (j = 0; j < N; j++) int i, j, k, sum = 0; for (i = 0; i < M; i++) sum += a[i][j]; for (i = 0; i < M; i++) return sum; for (j = 0; j < N; j++) } for (k = 0; k < N; k++) sum += a[k][i][j]; return sum; }
53 54
Carnegie Mellon Carnegie Mellon
Memory Hierarchies Today
¢ Some fundamental and enduring proper es of hardware ¢ Storage technologies and trends and so ware: ¢ Locality of reference § Fast storage technologies cost more per byte, have less capacity, ¢ Caching in the memory hierarchy and require more power (heat!). § The gap between CPU and main memory speed is widening. § Well-wri en programs tend to exhibit good locality.
¢ These fundamental proper es complement each other beau fully.
¢ They suggest an approach for organizing memory and storage systems known as a memory hierarchy.
55 56
14 Carnegie Mellon Carnegie Mellon
An Example Memory Hierarchy Caches
L0: CPU registers hold words retrieved ¢ Cache: A smaller, faster storage device that acts as a staging Registers from L1 cache area for a subset of the data in a larger, slower device. L1: L1 cache Smaller, (SRAM) L1 cache holds cache lines retrieved ¢ Fundamental idea of a memory hierarchy: from L2 cache faster, § For each k, the faster, smaller device at level k serves as a cache for the costlier L2: L2 cache larger, slower device at level k+1. per byte (SRAM) L2 cache holds cache lines retrieved from main memory ¢ Why do memory hierarchies work? L3: Larger, Main memory § Because of locality, programs tend to access the data at level k more (DRAM) Main memory holds disk blocks slower, retrieved from local disks o en than they access the data at level k+1. cheaper § Thus, the storage at level k+1 can be slower, and thus larger and per byte L4: Local secondary storage Local disks hold files (local disks) retrieved from disks on cheaper per bit. remote network servers ¢ Big Idea: The memory hierarchy creates a large pool of Remote secondary storage storage that costs as much as the cheap storage near the L5: (tapes, distributed file systems, Web servers) bo om, but that serves data to programs at the rate of the fast storage near the top. 57 58
Carnegie Mellon Carnegie Mellon
General Cache Concepts General Cache Concepts: Hit
Request: 14 Data in block b is needed
Smaller, faster, more expensive Block b is in cache: Cache 8 4 9 10 14 3 memory caches a subset of Cache 8 9 14 3 the blocks Hit!
Data is copied in block-sized 10 4 transfer units
Larger, slower, cheaper memory Memory 0 1 2 3 viewed as par oned into “blocks” Memory 0 1 2 3 4 5 6 7 4 5 6 7 8 9 10 11 8 9 10 11 12 13 14 15 12 13 14 15
59 60
15 Carnegie Mellon Carnegie Mellon General Caching Concepts: General Cache Concepts: Miss Types of Cache Misses
Request: 12 Data in block b is needed ¢ Cold (compulsory) miss Block b is not in cache: § Cold misses occur because the cache is empty. Cache 8 12 9 14 3 Miss! ¢ Conflict miss § Most caches limit blocks at level k+1 to a small subset (some mes a Block b is fetched from Request: 12 singleton) of the block posi ons at level k. 12 memory § E.g. Block i at level k+1 must be placed in block (i mod 4) at level k. § Conflict misses occur when the level k cache is large enough, but mul ple Block b is stored in cache data objects all map to the same level k block. Memory 0 1 2 3 • Placement policy: 4 5 6 7 determines where b goes § E.g. Referencing blocks 0, 8, 0, 8, 0, 8, ... would miss every me. • Replacement policy: 8 9 10 11 ¢ Capacity miss determines which block 12 13 14 15 gets evicted (vic m) § Occurs when the set of ac ve cache blocks (working set) is larger than the cache.
61 62
Carnegie Mellon Carnegie Mellon
Examples of Caching in the Hierarchy Summary
¢ Cache Type What is Cached? Where is it Cached? Latency (cycles) Managed By The speed gap between CPU, memory and mass storage con nues to widen. Registers 4-8 bytes words CPU core 0 Compiler TLB Address transla ons On-Chip TLB 0 Hardware ¢ Well-wri en programs exhibit a property called locality. L1 cache 64-bytes block On-Chip L1 1 Hardware L2 cache 64-bytes block On/Off-Chip L2 10 Hardware ¢ Virtual Memory 4-KB page Main memory 100 Hardware + OS Memory hierarchies based on caching close the gap by Buffer cache Parts of files Main memory 100 OS exploi ng locality. Disk cache Disk sectors Disk controller 100,000 Disk firmware Network buffer Parts of files Local disk 10,000,000 AFS/NFS client cache Browser cache Web pages Local disk 10,000,000 Web browser
Web cache Web pages Remote server disks 1,000,000,000 Web proxy server
63 64
16