Graduate Institute of Electronics Engineering, NTU

Memory Hierarchy

Lecturer: ChihhaoChao Advisor: Prof. An-YeuWu Date: 2009.4.29 Wednesday

Adapted from Prof. Wu’s 計算機結構 Lecture Note ACCESS IC LAB ACCESS IC LAB Graduate Institute of Electronics Engineering, NTU Outline

v Review of memory basics v v overview v Measuring and improving cache performance

P2 Graduate Institute of Electronics Engineering, NTU

Review of Memory Basics

ACCESS IC LAB ACCESS IC LAB Graduate Institute of Electronics Engineering, NTU Memory Classification & Metrics

Non-Volatile Read-Only Read-Write Memory Read-Write Memory Memory Random Non-Random EPROM Access Access Mask- EEPROM SRAM FIFO Programmed FLASH DRAM LIFO

v Key Design Metrics 1. Memory Density (number of bits/um2) and Size 2. Access Time (time to read or write) and Throughput 3. Power Dissipation

P4 ACCESS IC LAB Graduate Institute of Electronics Engineering, NTU Memory Array Architecture

P5 ACCESS IC LAB Graduate Institute of Electronics Engineering, NTU Latch and Register Based Memory

Positive Latch Negative Latch Register-based Memory

v Works fine for small memory blocks v Simple memory model, simple timing v Inefficient in area for large memories v Density is the key metric in large memory circuits

P6 ACCESS IC LAB Graduate Institute of Electronics Engineering, NTU Static RAM (SRAM) Cell (6(6--T Cell)

v Logic state held by cross-coupled inverters (M1,M2;M3,M4) v Retain state as long as power supply turns on v Feedback must be overdriven to write into the memory

P7 ACCESS IC LAB Graduate Institute of Electronics Engineering, NTU Dynamic RAM (DRAM) Cell

Write: set Bit Line (BL) to 0 or VDD & enable Word Line (WL)

Read: set Bit Line (BL) to 0 or VDD/2 & enable Word Line (WL)

v DRAM relies on charge stored in a capacitor to hold logic state v Use in all high density memories (one bit / transistor) v Must be “refreshed” or state will be lost – high overhead

P8 ACCESS IC LAB Graduate Institute of Electronics Engineering, NTU Interacting with a Memory Device

v Output Enable gates the chip’s v Address pins drive row and tristate driver column decoders v Write Enable sets the memory’s v Data pins are bidirectional and read/write mode shared by reads and writes v Chip Enable/Chip Select acts as a master switch

P9 ACCESS IC LAB Graduate Institute of Electronics Engineering, NTU Asynchronous SRAM On the outside v Basic Memory, e.g.. v MCM6264C 8k x 8 SRAM v Bidirectional data bus for read/write v Chip Enables (E1 and E2) v E1=1’b0, E2=1’b1 to enable the chip On the inside v Write Enable (W) v active-low when chip is enabled v Output Enable (G) v active-low when chip is enabled

P10 ACCESS IC LAB Graduate Institute of Electronics Engineering, NTU

Asynchronous SRAM Read Operation

v Read cycle begins when all enable signals are active (E1,E2,G) v Data is valid after read access time v Data bus is tristated shortly after G or E1 goes high (inactive)

P11 ACCESS IC LAB Graduate Institute of Electronics Engineering, NTU Address Controlled Reads

v Perform multiple reads without disable chip (G=1’b0) v Data bus after Address bus, after some delay v Note the Bus enable time, Access time, Contamination time, Bus tristate time

P12 ACCESS IC LAB Graduate Institute of Electronics Engineering, NTU

Asynchronous SRAM Write Operation

v Data latched when W or E1 goes high v Data must be stable at this time v Address must be stable before W goes low (inactive) v Write waveforms are very important v Glitches to address can cause write to unexpected address

P13 ACCESS IC LAB Graduate Institute of Electronics Engineering, NTU Synchronous SRAM v Use synchronization registers to provide synchronous inputs and encourage more reliable operation at high speed

P14 ACCESS IC LAB Graduate Institute of Electronics Engineering, NTU Asynchronous DRAM Operation

v Usually address are separated to Row address and Column address v Manipulation of RAS and CAS can provide early-write, read-write, hidden- refresh, ... etc efficient operating modes.

P15 ACCESS IC LAB Graduate Institute of Electronics Engineering, NTU Key Messages on Memory Devices v DRAM vs. SRAM v SRAM holds states as long as power supply is turned on; DRAM must be refreshed à result in complicated control v DRAM has much higher density, but requires special capacitor technology

v Handle memory operations v Primary inputs of memory should be registered for synchronization and reducing glitches v It’s bad idea to enable two tri-states driving the bus at the same time v An SRAM doesn’t need to be refreshed while a DRAM does v A synchronous memory can result in higher throughput

P16 Graduate Institute of Electronics Engineering, NTU

Memory Hierarchy

ACCESS IC LAB ACCESS IC LAB Graduate Institute of Electronics Engineering, NTU Where are we now?

P18 ACCESS IC LAB Graduate Institute of Electronics Engineering, NTU Technology Trends Capacity Speed (latency) Logic: 2x in 3 years 2x in 3 years DRAM: 4x in 3 years 2x in 10 years Disk: 4x in 3 years 2x in 10 years

DRAM Year Size Cycle Time 1980 1000:1! 64 Kb 2:1! 250 ns 1983 256 Kb 220 ns 1986 1 Mb 190 ns 1989 4 Mb 165 ns 1992 16 Mb 145 ns 1995 64 Mb 120 ns 1998 256 Mb 100 ns 2001 1 Gb 80 ns P19 ACCESS IC LAB Graduate Institute of Electronics Engineering, NTU Processor Memory Latency Gap

µProc 60%/yr. 1000 CPU (2X/1.5yr) “Moore’s Law” e

c Processor-Memory n

a 100 Performance Gap:

m (grows 50% / year) r o f 10 r DRAM e 9%/yr.

P DRAM (2X/10 yrs) 1 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 8 8 8 8 8 8 8 8 8 8 9 9 9 9 9 9 9 9 9 9 0 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 Time P20 ACCESS IC LAB Graduate Institute of Electronics Engineering, NTU What the Gap means?

v We use the pipelined MIPS as an example:

Clock period will be bounded by Memory, not Logic!! P21 ACCESS IC LAB Graduate Institute of Electronics Engineering, NTU Deep Pipeline in Modern Desktop uP

P22 ACCESS IC LAB Graduate Institute of Electronics Engineering, NTU Memory Access Pattern

v Model the memory access address and access time v Not fully random, e.g. uniform distributed v Usually have some pattern à Here comes the chance!

P23 ACCESS IC LAB Graduate Institute of Electronics Engineering, NTU

Memory Hierarchy

v A memory hierarchy consists of multiple levels of memory with different speeds and sizes v Guideline: Build memory as a hierarchy of levels, with the fastest memory close to the processor, and the slower, less expensive memory below that v Goal: To present the user with as much as is available in the cheapest technology, while providing access at the speed offered by the fastest memory. v Three major technologies used to construct memory hierarchy:

Memory hierarchy Typical access time $ per GB in 2004

SRAM 0.5 – 5 ns $4000 -$10000

DRAM 50 – 70 ns $100 -$200

Magnetic disk 5,000,000-20,000,000 ns $0.5 -$2

P24 ACCESS IC LAB Graduate Institute of Electronics Engineering, NTU General Principles of Memory

v Definitions v Upper: memory closer to processor v Block: minimum unit that is present or not present v Block address: location of block in memory

v Locality + smaller HW is faster = memory hierarchy v Levels: each smaller, faster, more expensive/byte than level below v Inclusive: data found in upper leve also found in the lower level

P25 ACCESS IC LAB Graduate Institute of Electronics Engineering, NTU

Memory Hierarchy: How Does it Work?

v Temporal Locality (Locality in Time): => Keep most recently accessed data items closer to the processor v Spatial Locality (Locality in Space): => Move blocks consists of contiguous words to the upper levels

Lower Level To Processor Upper Level Memory Memory Blk X From Processor Blk Y

P26 ACCESS IC LAB Graduate Institute of Electronics Engineering, NTU Memory Hierarchy: Terminology v Hit: data appears in some block in the upper level (example: Block X) v Hit Rate: the fraction of memory access found in the upper level v Hit Time: Time to access the upper level which consists of RAM access time + Time to determine hit/miss v Miss: data needs to be retrieve from a block in the lower level (Block Y) v Miss Rate = 1 -(Hit Rate) v Miss Penalty: Time to replace a block in the upper level + Time to deliver the block the processor v Hit Time << Miss Penalty

Lower Level To Processor Upper Level Memory Memory Blk X From Processor Blk Y

P27 ACCESS IC LAB Graduate Institute of Electronics Engineering, NTU Memory Hierarchy of a Modern Computer System

Processor

Control Tertiary Secondary Storage Storage (Tape)

O Second Main R

C (Disk) e n a

g Level Memory - C c i

Datapath s h

h Cache (DRAM) t e e i r p

s (SRAM)

Speed (ns): 1s 10s 100s 10,000,000s 10,000,000,000s (10s ms) (10s sec) Size (bytes): 100s Ks Ms Gs Ts

P28 ACCESS IC LAB Graduate Institute of Electronics Engineering, NTU A Typical Memory Hierarchy of Modern Computer System

P29 Graduate Institute of Electronics Engineering, NTU

Cache Overview

ACCESS IC LAB ACCESS IC LAB Graduate Institute of Electronics Engineering, NTU Inside a Cache

P31 ACCESS IC LAB Graduate Institute of Electronics Engineering, NTU The Basics of Cache

P32 ACCESS IC LAB Graduate Institute of Electronics Engineering, NTU The Basics of Cache (2) v Cache: a safe place for hiding or storing things. v Example: Before the request, the cache contains a collection of recent

references X1, X2, ………., Xn-1, and the processor requests a word Xn that is not in the cache. This request results in a miss, and the word Xn is brought form memory into cache. à Replacement Policy

X4 X4

X1 X1

Xn-2 Xn-2

Xn-1 Xn-1

X2 X2

Xn

X3 X3

Before the reference to Xn After the reference to Xn

P33 ACCESS IC LAB Graduate Institute of Electronics Engineering, NTU Four Questions for Memory Hierarchy Designers

v Q1: Where can a block be placed in the upper level? (Block placement)

v Q2: How is a block found if it is in the upper level? (Block identification)

v Q3: Which block should be replaced on a miss? (Block replacement)

v Q4: What happens on a write? (Write strategy)

P34 ACCESS IC LAB Graduate Institute of Electronics Engineering, NTU Q1: Where can a block be placed?

v Direct Mapped: Each block has only one place that it can appear in the cache. v Fully associative: Each block can be placed anywhere in the cache. v Set associative: Each block can be placed in a restricted set of places in the cache. v If there are n blocks in a set, the cache placement is called n-way set associative

v What is the associativity of a direct mapped cache?

P35 ACCESS IC LAB Graduate Institute of Electronics Engineering, NTU Placement Policy

P36 ACCESS IC LAB Graduate Institute of Electronics Engineering, NTU

Associative structures

P37 ACCESS IC LAB Graduate Institute of Electronics Engineering, NTU Q2: How Is a Block Found?

v The address can be divided into two main parts v Block offset: selects the data from the block offset size = log2(block size) v Block address: tag + index Ø index: selects set in cache index size = log2(#blocks/associativity) Ø tag: compared to tag in cache to determine hit tag size = addreess size -index size -offset size v Each block has a valid bit that tells if the block is valid -the block is in the cache if the tags match and the valid bit is set. P38 ACCESS IC LAB Graduate Institute of Electronics Engineering, NTU

Valid Bit of Cache

v Add a “valid bit” to indicate whether an entry contains a valid address.

v Replacement policy: recently accessed words replace less-recently referenced words. (use temporal locality)

P39 ACCESS IC LAB Graduate Institute of Electronics Engineering, NTU Direct Mapped Cache Architecture

P40 ACCESS IC LAB Graduate Institute of Electronics Engineering, NTU

22--Way Seett--Associative Cache Architecture

P41 ACCESS IC LAB Graduate Institute of Electronics Engineering, NTU Fully Associative Cache Architecture

P42 ACCESS IC LAB Graduate Institute of Electronics Engineering, NTU Q3: Which Block Should be Replaced on a Miss?

vEasy for Direct Mapped -only one candidate vSet Associative or Fully Associative: vRandom -easier to implement vLeast Recently Used (LRU) -harder to implement Ø true implementation only feasible for small sets (2-way) Ø cache state must be updated on every access v First-In, First-Out (FIFO) or Round-Robin Ø usually used in highly associative sets v Not Least Recently Used Ø FIFO with exception for most recently used blocks P43 ACCESS IC LAB Graduate Institute of Electronics Engineering, NTU Example

v Miss rates for caches with different size, associativity and replacement algorithm.

Associativity: 2-way 4-way 8-way Size LRURandom LRU Random LRU Random 16 KB 5.18% 5.69% 4.67% 5.29% 4.39% 4.96% 64 KB 1.88% 2.01% 1.54% 1.66% 1.39% 1.53% 256 KB 1.15% 1.17% 1.13% 1.13% 1.12% 1.12%

For caches with low miss rates, random is almost as good as LRU.

P44 ACCESS IC LAB Graduate Institute of Electronics Engineering, NTU Q4: What Happens on a Write ($Hit)? v Write through: The information is written to both the block in the cache and to the block in the lower-level memory. v Write back: The information is written only to the block in the cache. The modified cache block is written to main memory only when it is replaced. v is block clean or dirty? (add a dirty bit to each block) v Pros and Cons of each: v Write through Ø Easier to implement Ø Always combine with write buffers to avoid memory latency Ø Read miss will not result in writes to memory ( for replacing block ) v Write back Ø Less memory access Ø Perform writes at the speed of the cache P45 ACCESS IC LAB Graduate Institute of Electronics Engineering, NTU Q4: What Happens on a Write ($Miss)?

v Since data does not have to be brought into the cache on a write miss, there are two options:

v Write allocate Ø The block is brought into the cache on a write miss Ø Hope subsequent writes to the block hit in cache Ø Low miss rate, complex control, on write-back caches

v No-write allocate Ø The block is modified in memory, but not brought into the cache Ø Writes have to go to memory anyway, so no need to bring the block into the cache Ø High miss rate, simple control, on write-through caches

P46 ACCESS IC LAB Graduate Institute of Electronics Engineering, NTU Calculating Bits in Cache v How many total bits are needed for a direct-mapped cache with 64 KBytesof data and 8 word blocks, assuming a 32-bit address?

v 64 Kbytes = 214 words = (214)/8 = 211 blocks v block size = 32 bytes à offset size = 5 bits, v #sets = #blocks = 211 à index size = 11 bits v tag size = address size -index size -offset size = 32 -11 -5 = 16 bits v bits/block = data bits + tag bits + valid bit = 8x32 + 16 + 1 = 273 bits v bits in cache = #blocks x bits/block = 211 x 273 = 68.25 Kbytes v Increase block size => decrease bits in cache

P47 ACCESS IC LAB Graduate Institute of Electronics Engineering, NTU Calculating Bits in Cache v How many total bits are needed for a direct-mapped cache with 64 KBytesof data and one word blocks, assuming a 32-bit address? v 64 Kbytes = 16 K words = 214 words = 214 blocks v block size = 4 bytes à offset size = 2 bits, v #sets = #blocks = 214 à index size = 14 bits v tag size = address size -index size -offset size = 32 -14 -2 = 16 bits v bits/block = data bits + tag bits + valid bit = 32 + 16 + 1 = 49 v bits in cache = #blocks x bits/block = 214 x 49 = 98 Kbytes v How many total bits would be needed for a 4-way set associative cache to store the same amount of data v block size and #blocks does not change v #sets = #blocks/4 = (214)/4 = 212 à index size = 12 bits v tag size = address size -index size -offset = 32 -12 -2 = 18 bits v bits/block = data bits + tag bits + valid bit = 32 + 18 + 1 = 51 v bits in cache = #blocks x bits/block = 214 x 51 = 102 Kbytes v Increase associativity à Increase bits in cache

P48 Graduate Institute of Electronics Engineering, NTU

Measuring and Improving Cache Performance

ACCESS IC LAB ACCESS IC LAB Graduate Institute of Electronics Engineering, NTU Measuring Cache Performance

CPU time = (CPU execution clock cycles + Memory-stall clock cycles) * Clock cycle time

Memory-stall clock cycles come primarily from cache misses: Memory-stall clock cycles = Read-stall cycles + Write-stall cycles Read-stall cycles = (reads/program) * read miss rate * read miss penalty Write-stall cycles = ((writes/ program) * write miss rate * write miss penalty) + write buffer stalls v For a write-back scheme: v It has potential additional stalls arising from the need to write a cache block to memory when the block is replaced. v For a write-through scheme: v Write miss requires that we fetch the block before continuing the write. v Write buffer stalls: The write buffer is full when a write occurs.

P50 ACCESS IC LAB Graduate Institute of Electronics Engineering, NTU

Measuring Cache Performance v In most write-through scheme, we assume: v Read and write miss penalty are the same. v Write buffer stall is negligible.

(1) Memory-stall clock cycles = (Memory accesses/Program) * Miss rate * Miss penalty

(2) Memory-stall clock cycles = (Instructions/Program) * (Misses/Instruction) * Miss penalty

P51 ACCESS IC LAB Graduate Institute of Electronics Engineering, NTU

Calculating cache performance v (question) How much faster a processor would run with perfect cache that never missed? v (answer) Assumptions: v An instruction cache miss = 2% v A data cache miss rate = 4% v CPI = 2 without any memory stalls v Miss penalty = 100 cycles for all misses v Use the instruction frequencies for SPECint2000 from Chapter3, Fig3.26 on page 228. v Instruction count = I

P52 ACCESS IC LAB Graduate Institute of Electronics Engineering, NTU

Calculating cache performance

Instruction miss cycles = I * 2% * 100 = 2.00 I Data miss cycles = I * 36% * 4% * 100 = 1.44 I Total number of memory-stall cycles = 2.00 I + 1.44 I = 3.44 I The CPI with memory stalls is 2 + 3.44 = 5.44 Since there is no change in instruction count or clock rate, the ratio of the CPU execution times =

CPU time with stalls I * CPIstall clock cycle

CPU time with perfect cache I * CPIperfect clock cycle CPI 5.44 stall 2.72 CPIperfect 2

P53 ACCESS IC LAB Graduate Institute of Electronics Engineering, NTU

Calculating cache performance v Example: Suppose we speed up the computer in the previous example by reducing CPI from 2 to 1.

v The system with cache misses has a CPI of 1 + 3.44 = 4.44. v The system with the perfect cache = 4.44 / 1 = 4.44 times faster. v The amount of execution time spent on memory stalls rises from 3.44 / 5.44 = 63% to 3.44 / 4.44 = 77%

P54 ACCESS IC LAB Graduate Institute of Electronics Engineering, NTU Calculating cache performance v Example: cache performance with increased clock rate (question) How much faster will the computer be with the faster clock, assuming the same miss rate ad the previous example? (answer) v The new miss penalty = 200 clock cycles v Total miss cycles per instruction = (2%*200)+36%*(4%*200) = 6.88 v CPI = 2 + 6.88 = 8.88 performance with fast clock The relative performance = performance with slow clock

IC * CPIslow * clock cycle 5.44 1.23 IC * CPIfast * (clock cycle/2) 8.88/2

The computer with faster clock is about 1.2 times faster rather than 2 times faster, which it have been if we ignored cache misses. P55 ACCESS IC LAB Graduate Institute of Electronics Engineering, NTU Calculating cache performance v Reducing cache misses by more flexible placement of blocks v (1) Direct mapped cache: A block can go in exactly one place in the cache. v (2) Fully-associative cache: A cache structure in which a block can be placed in any location in the cache. v (3) Set-associative cache: A cache that has a fixed number of locations (at least two) where each block can be placed.

v In direct-mapped cache, the position of a memory block is given by (block number) modulo (number of cache blocks) v In a set-associative cache, the set containing a memory block is given by (block number) modulo (number of set in the cache).

P56 ACCESS IC LAB Graduate Institute of Electronics Engineering, NTU Blockck--SSize Tradede--Off

miss rate

block size (bytes)

Increase block size tends to decrease miss rate P57 ACCESS IC LAB Graduate Institute of Electronics Engineering, NTU Blockck--SSize Tradede--Off

v Larger block size à spatial locality (good) à reduce miss ratio v Larger miss penalty v Block is too largeàless blocks in cacheàmiss rate goes up v Avg access time = hit time * (1-miss rate) + miss penalty * miss rate

P58 ACCESS IC LAB Graduate Institute of Electronics Engineering, NTU Increase Memory Bandwidth to Reduce Miss Penalty

P59 ACCESS IC LAB Graduate Institute of Electronics Engineering, NTU Misses and Associativity in Caches vExample: vAssume there are three small caches, each consisting of four one-word blocks. They are fully associative, two-way set associative and direct-mapped. Find the number of misses for each organization given the following sequence of block addresses : 0, 8, 0, 6, 8.

P60 ACCESS IC LAB Graduate Institute of Electronics Engineering, NTU Misses in Directect--mmapped Cache

nSequence of block Block address Cache block addresses : 0, 8, 0, 6, 8. 0 (0 modulo 4) = 0 6 (6 modulo 4) = 2 8 (8 modulo 4) = 0

Address of Hit Contents of cache blocks memory or after reference block accessed miss 0 1 2 3 0 miss Memory[0] 8 miss Memory[8] 0 miss Memory[0] 6 miss Memory[0] Memory[6] 8 miss Memory[8] Memory[6]

The direct-mapped cache generates 5 misses for the five accesses.

P61 ACCESS IC LAB Graduate Institute of Electronics Engineering, NTU

Misses in 22--Way Seett--Associative Cache

Block address Cache block 0 (0 modulo 2) = 0 6 (6 modulo 2) = 0 8 (8 modulo 2) = 0

Address of Hit Contents of cache blocks memory or after reference block miss 0 0 1 1 accessed 0 miss Memory[0] 8 miss Memory[0] Memory[8] 0 hit Memory[0]Memory[8] 6 miss Memory[0] Memory[6] 8 miss Memory[8] Memory[6] The two-way set associative cache has 4 misses. P62 ACCESS IC LAB Graduate Institute of Electronics Engineering, NTU

Misses in Fully Associative Cache

Address of Hit Contents of cache blocks memory or after reference block miss accessed 0 0 0 0 0 miss Memory[0] 8 miss Memory[0] Memory[8] 0 hit Memory[0] Memory[8] 6 miss Memory[0] Memory[8] Memory[6] 8 hit Memory[0] Memory[8] Memory[6]

The fully associative cache only has 3 misses: the best one

P63 ACCESS IC LAB Graduate Institute of Electronics Engineering, NTU FourFour--way sesett--associative cache

P64 ACCESS IC LAB Graduate Institute of Electronics Engineering, NTU

Size of tags v.s. Set associativity

v Question: v Assume a cache of 4K caches, a four-word block size, and a 32-bit address. Find the total number of sets and the total number tag bits for caches that are direct-mapped, two-way and four-way set associative and fully associative. v Answer: v direct-mapped: Ø The bits for index and tag = 32 – 4 = 28 (4=block offset) Ø The number of sets = the number of blocks = 4K

Ø The bits for index = log2(4K) = 12 Ø The total number of tag bits = (28 -12) * 4K = 64K bits

P65 ACCESS IC LAB Graduate Institute of Electronics Engineering, NTU

Size of tags v.s. Set associativity v Two-way set associative: Ø The number of sets = (the number of blocks) / 2 = 2K Ø The total number of tag bits = (28 - 11) * 2 * 2K = 68K bits

v Four-way set associative: Ø The number of sets = (the number of blocks) / 4 = 1K Ø The total number of tag bits = (28 - 10) * 4 * 1K = 72K bits

v Fully set associative: Ø The number of sets = 1 Ø The total number of tag bits = 28 * 4K * 1= 112K bits

v Least recently used (LRU): A replacement scheme in which the block replaced is the one that has been unused for the longest time.

P66 ACCESS IC LAB Graduate Institute of Electronics Engineering, NTU Multilevel cache v Multilevel cache: A memory hierarchy with multiple levels of caches, rather than just a cache and main memory. v Example: v Suppose we have a processor with a base CPI of 1.0, assuming all reference hit in the primary cache, and a clock rate of 5GHz. v Assume a main memory access time of 100 ns, including all the miss handling. Suppose the miss rate per instruction at the primary cache is 2%. v How much faster will the processor be if we add a secondary cache that has a 5 ns access time for either a hit or a miss, and is large enough to reduce the miss rate to the main memory to 0.5%?

P67 ACCESS IC LAB Graduate Institute of Electronics Engineering, NTU Multilevel cache (cont’d)

v For the processor with one level of cache: The miss penalty to main memory = 100ns/ 0.2ns (1/5GHz) = 500 clock cycles. Total CPI = base CPI + memory-stall cycles per instruction. Total CPI = 1.0 + 2% * 500 = 11.0

v For the processor with two levels of cache: The miss penalty for an access to second-level cache = 5 / 0.2 = 25 clock cycles. Total CPI = base CPI + primary stalls per instruction + secondary stalls per instruction . Total CPI = 1.0 + 2% * 25 + 0.5% * 500 = 4.0 v The processor with the secondary cache is faster by 11.0 / 4.0 = 2.8

P68 ACCESS IC LAB Graduate Institute of Electronics Engineering, NTU Embed Cache Into Pipelined MIPS

P69