The Memory Hierarchy. . .
Brief Overview of Memory Design CS 135: Computer Architecture 1 What is the memory hierarchy ? components Focus on Cache design How does Cache memory work ? Instructor: Prof. Bhagi Narahari How are addresses mapped to Cache How to rewrite code to get better cache performance ? –code Dept. of Computer Science optimization Course URL: www.seas.gwu.edu/~narahari/cs135/ How do disks work ? Virtual memory – what is it ?
CS 135: Computer Architecture, Bhagi Narahari
A Computer Computer organization - Recap
CPU has two components: Arithmetic and Logic Unit (ALU) Input Output Performs arithmetic operations Device Device Performs logical operations Central Input Processing Control Unit Auxiliary Device Unit Storage Controls the action of the other comppputer components so that Device instructions are executed in the correct sequence Memory Memory Contains instructions and data
CPU requests data, data fetched from The computer is composed of input devices, a central processing unit, a memory unit and output devices. memory How long does it take to fetch from memory ?
CS 135: Computer Architecture, Bhagi Narahari CS 135: Computer Architecture, Bhagi Narahari
1 CPU + memory Memory Unit
An ordered sequence of storage cells, each capable of holding a piece of data. Address space Size of memory: N bit address space = 2N memory locations address Addressability 200PC Size of each memory location – kbitsk bits memory data Total memory size = k.2N bits CPU Assumption thus far: Processor/CPU gets data 200 ADD r5,r1,r3 ADD IRr5,r1,r3 or instruction from some memory address (Inst fetch or Load/Store instruction) But how is memory actually organized ? Can everything we need fit into a memory that is close to the CPU ?
CS 135: Computer Architecture, Bhagi Narahari CS 135: Computer Architecture, Bhagi Narahari
Where does run-time stack fit into this. . Memory Technology
We looked at how variables in C are Random access memory allocated memory addresses Can read from any location by supplying address of data Each function has activation record Memory Comes in Many Flavors Compiler takes care of allocation of memory Main RAM memory Key features addresses to variables RAM is packaged as a chip. Basic storage unit is a cell (one bit per cell). Question now is: where are these Multiple RAM chips form a memory. SRAM (Static Random Access Memory) or DRAM (Dynamic Random memory addresses and how long does it Access Memory) take to fetch the contents into the ROM, EPROM, EEPROM, Flash, etc. processor register Read only memories – store OS “Secondary memory” Disks, Tapes, etc. LD R0, A We know the address of A, but how long will it take to go into Difference in speed, price and “size” memory and fetch into register R0 ? Fast is small and/or expensive Large is slow and/or cheap
CS 135: Computer Architecture, Bhagi Narahari CS 135: Computer Architecture, Bhagi Narahari
2 How is memory really organized ? Memory Hierarchies
Many types of memory with different speeds Some fundamental and enduring properties of Processor speed and memory speed hardware and software: mismatched Fast storage technologies cost more per byte and have less capacity. Data transferred between memory and processor Instructions or data The gap between CPU and main memory speed is widening. What does ppgrocessor do while waiting for data Well-written programs tend to exhibit good locality . to be transferred ? Idle – processor is stalled leading to slowdown in speed and lower performance These fundamental properties complement each Why can’t we have memory as fast as processor other beautifully. Technology, cost, size What is the solution then ? They suggest an approach for organizing memory and storage systems known as a memory hierarchy.
CS 135: Computer Architecture, Bhagi Narahari CS 135: Computer Architecture, Bhagi Narahari
An Example Memory Hierarchy
Smaller, L0: faster, registers CPU registers hold words retrieved and from L1 cache. costlier L1: on-chip L1 (per byte) cache (SRAM) L1 cache holds cache lines retrieved storage from the L2 cache memory. devices L2: off-chip L2 cache (SRAM) L2 cache holds cache lines retrieved from main memory.
L3: main memory Larger, (DRAM) Main memory holds disk slower, blocks retrieved from local and disks. cheaper L4: local secondary storage (per byte) (local disks) storage Local disks hold files devices retrieved from disks on Items on a desktop (register) or in a drawer (cache) are more readily accessible remote network servers. than those in a file cabinet (main memory) or in a closet in another room L5: remote secondary storage keep more frequently accessed items on desktop, next frequent in drawer, etc. (distributed file systems, Web servers) and things you need a lot less often in the closet in another room! CS 135: Computer Architecture, Bhagi Narahari CS 135: Computer Architecture, Bhagi Narahari
3 Memory Hierarchies Locality
Principle of Locality: Key Principles Programs tend to reuse data and instructions near those Locality – most programs do not access code or data they have used recently, or that were recently referenced uniformly themselves. Smaller hardware is faster Temporal locality: Recently referenced items are likely to be Goal referenced in the near future. Spatial locality: Items with nearby addresses tend to be Desiggyyn a memory hierarchy “with cost almost as low referenced close together in time. as the cheapest level of the hierarchy and speed almost as fast as the fastest level” This implies that we be clever about keeping more likely used data as “close” to the CPU as possible Levels provide subsets Anything (data) found in a particular level is also found in the next level below. Each level maps from a slower, larger memory to a smaller but faster memory
CS 135: Computer Architecture, Bhagi Narahari CS 135: Computer Architecture, Bhagi Narahari
Locality Memory Hierarchy: The Tradeoff
cache virtual memory sum = 0;
for (i = 0; i < n; i++) C sum += a[i]; a C CPU c 16 Ba 8 B 4 KB return sum; h c Memory disk regs e h e
register L1-cache L2-cache memory disk memory reference reference reference reference reference Locality Example: size: 608 B 128k B 512kB -- 4MB 128 MB 27GB • Data speed: 1.4 ns 4.2 ns 16.8 ns 112 ns 9 ms – Reference array elements in succession $/Mbyte: $90/MB $2-6/MB $0.01/MB (stride-1 reference pattern): Spatial locality block size: 4 B 4 B 16 B 4-8 KB – Reference sum each iteration: Temporal locality • Instructions larger, slower, cheaper – Reference instructions in sequence: Spatial locality – Cycle through loop repeatedly: Temporal locality (Numbers are for a Alpha 21264 at 700MHz)
CS 135: Computer Architecture, Bhagi Narahari CS 135: Computer Architecture, Bhagi Narahari
4 Computer Organization – with Memory Hierarchy: Key concepts Memory system
We shall return to a detailed discussion of the various components of the Processor interrupts memory hierarchy What is a Cache memory Cache How does Memory access and Disk access work?k ? Memory-I/O bus How should we organize the memory hierarchy I/O I/O I/O controller controller Memory controller How can we rewrite our code to improve performance
diskDisk diskDisk Display Network Based on memory access, type of instructions, etc.
CS 135: Computer Architecture, Bhagi Narahari CS 135: Computer Architecture, Bhagi Narahari
A Simple Model of Memory . . . Memory Access times
Sequence of addresses memory access time On-chip Cache takes 1 processor cycle How many ? Main memory takes a number (10-50) processor cycles CPU generates request for memory Disk takes a huge amount location – i.e., an address Simple model we will use for now: How longgg does it take to get this data ? Memory = Cache + Main memory Depends where it is in the Memory hierarchy Small size Cache = not everything fits in it Simplified Model for memory hierarchy: Simplified Cache organization: small amount of On-chip Cache memory Cache consists of a set of blocks each of some number of bytes Larger amount of off-chip Main memory Only a block can be fetched into and out of cache Huge Disk Eg; if block is 16 bytes, then load 16 bytes into cache Cannot load a single byte
CS 135: Computer Architecture, Bhagi Narahari CS 135: Computer Architecture, Bhagi Narahari
5 Memory Access times using Simplified Model Summary: Memory Access time optimization
If data is found in Cache then time =1 If each access to memory leads to a cache Called a cache hit hit then time to fetch from memory is one cycle Else time is Main memory access time Program performance is good! Cache miss, means read from next level If each access to memory leads to a cache Note: need a ‘control unit’ to determine if miss then time to fetch from memory is location is in cache or not much larger than 1 cycle Cache controller Program performance is bad! Why does concept of caching work ? Design Goal: Principle of Locality
Programs access data nearby, or data/instructions that were How to arrange data/instructions so that we used recently have as few cache misses as possible.
CS 135: Computer Architecture, Bhagi Narahari CS 135: Computer Architecture, Bhagi Narahari
Some details of Memory Organization Random-Access Memory (RAM)
Static RAM (SRAM) This is what we saw before in the example of a memory built from latches (transistors). Like a register – once written it keeps its value till next write Retains value indefinitely, as long as it is kept powered. Relatively insensitive to disturbances such as electrical noise. Faster and more expensive than DRAM. Dynamic RAM (DRAM) Each cell stores a bit with a capacitor and transistor. Value must be refreshed every 10-100 ms else we lose data Sensitive to disturbances. Slower and cheaper than SRAM. Many types of DRAMS SDRAM, DDR-DRAM, Video DRAM, etc.
CS 135: Computer Architecture, Bhagi Narahari CS 135: Computer Architecture, Bhagi Narahari
6 Nonvolatile Memories SRAM vs DRAM Summary
DRAM and SRAM are volatile memories Lose information if powered off. Nonvolatile memories retain value even if powered off. Generic name is read-only memory (ROM). Tran. Access Misleading because some ROMs can be read and modified. per bit time Persist? Sensitive? Cost Applications Types of ROMs SRAM 6 1X Yes No 100x cache memories PblROM(Programmable ROM (PROM) Eraseable programmable ROM (EPROM) DRAM 1 10X No Yes 1X Main memories, Electrically eraseable PROM (EEPROM) frame buffers Flash memory Firmware Program stored in a ROM Boot time code, BIOS (basic input/ouput system) graphics cards, disk controllers.
CS 135: Computer Architecture, Bhagi Narahari CS 135: Computer Architecture, Bhagi Narahari
Building the Memory Memory Modules for 64 bit Processor
addr (row = i, col = j) Build large memory using several : supercell (i,j)
smaller memory chips DRAM 0 How to organize a large memory using 64 MB memory module several “modules” consisting of DRAM 7 CPU generates an address request eight 8Mx8 DRAMs The address can be in any module
Need to figure out which one bits bits bits bits bits bits bits bits 56-63 48-55 40-47 32-39 24-31 16-23 8-15 0-7
6356 55 48 47 40 39 32 3124 23 16 15 8 7 0 Memory controller 64-bit doubleword at main memory address A
64-bit doubleword
CS 135: Computer Architecture, Bhagi Narahari CS 135: Computer Architecture, Bhagi Narahari
7 Typical Bus Structure Connecting Memory Read Transaction (1) CPU and Memory
A bus is a collection of parallel wires that CPU places address A on the memory carry address, data, and control signals. bus. Buses are typically shared by multiple register file Load operation: Load R0, A devices. or in Intel: movl A, R0 ALU R0 CPU chip
register file main memory I/O bridge 0 A ALU bus interface x A
system bus memory bus
I/O main bus interface bridge memory
CS 135: Computer Architecture, Bhagi Narahari CS 135: Computer Architecture, Bhagi Narahari
Memory Read Transaction (2) Memory Read Transaction (3)
Main memory reads A from the memory CPU read word x from the bus and bus, retreives word x, and places it on copies it into register %eax. the bus. register file register file Load operation: Load R0, A Load operation: Load R0, A Or in Intel: movl A, R0 or in Intel: movl A, R0 ALU ALU R0 R0 x
main memory main memory I/O bridge x 0 I/O bridge 0
bus interface x A bus interface x A
CS 135: Computer Architecture, Bhagi Narahari CS 135: Computer Architecture, Bhagi Narahari
8 Memory Write Transaction (1) Memory Write Transaction (2)
CPU places address A on bus. Main CPU places data word y on the bus. memory reads it and waits for the corresponding data word to arrive. register file Store operation: STORE R1,A register file Store operation: STORE R1, A or in Intel movl R1, A In Intel: movl R1, A ALU ALU R1 y R1 y main memory main memory I/O bridge 0 0 A I/O bridge y bus interface A bus interface A
CS 135: Computer Architecture, Bhagi Narahari CS 135: Computer Architecture, Bhagi Narahari
Memory Write Transaction (3) Memory Write Transaction (3)
Main memory read data word y from the Main memory read data word y from the bus and stores it at address A. bus and stores it at address A.
register file register file Store operation: STORE R1, A Store operation: STORE R1,A In Intel movl R1, A In Intel movl R1, A ALU ALU R1 y R1 y main memory main memory I/O bridge 0 I/O bridge 0
bus interface y A bus interface y A
CS 135: Computer Architecture, Bhagi Narahari CS 135: Computer Architecture, Bhagi Narahari
9 Memory Access time and Performance ? The CPU-Memory Gap
CPU time = Seconds = Instructions x Cycles x Seconds Program Program Instruction Cycle The increasing gap between DRAM, disk, and CPU speeds.
CPU = IC * CPI * Clk 100,000,000 10,000,000 1,000,000 How does memoryyq access time fit into CPU time equation ? 100, 000 Disk seek time DRAM access time
ns 10,000 SRAM access time 1,000 CPU cycle time 100 10 1 1980 1985 1990 1995 2000 year
CS 135: Computer Architecture, Bhagi Narahari CS 135: Computer Architecture, Bhagi Narahari
Performance Memory Hierarchies
Simplified model: Organize memory as a memory hierarchy.
Processor is (1) in execution or (2) waits for Key Principles memory Locality – most programs do not access code or data uniformly “effective”(Real) CPI increases Smaller hardware is faster execution time = (execution cycles + memory Goal stall cycles) cycle time Design a memory hierarchy “with cost almost as low as the cheapest level of the hierarchy and speed almost as fast as the fastest level” This implies that we be clever about keeping more likely used data as “close” to the CPU as possible Improve performance = decrease stall Levels provide subsets cycles Anything (data) found in a particular level is also found in the next level below. Decrease time to access memory Each level maps from a slower, larger memory to a smaller but faster memory How ?
CS 135: Computer Architecture, Bhagi Narahari CS 135: Computer Architecture, Bhagi Narahari
10 Memory Hierarchy: The Tradeoff An Example Memory Hierarchy
Smaller, L0: cache virtual memory faster, registers CPU registers hold words retrieved and from L1 cache. costlier L1: on-chip L1 C cache (SRAM) a C (per byte) L1 cache holds cache lines retrieved from the L2 cache memory. CPU c 16 Ba 8 B 4 KB storage h c Memory disk devices L2: off-chip L2 regs e h cache (SRAM) e L2 cache holds cache lines retrieved from main memory. register L1-cache L2-cache memory disk memory main memory reference reference reference reference reference L3: Larger, (DRAM) Main memory holds disk size: 608 B 128k B 512kB -- 4MB 128 MB 27GB slower, blocks retrieved from local speed: 1.4 ns 4.2 ns 16.8 ns 112 ns 9 ms and disks. cheaper $/Mbyte: $90/MB $2-6/MB $0.01/MB L4: local secondary storage block size: 4 B 4 B 16 B 4-8 KB (per byte) (local disks) storage Local disks hold files devices retrieved from disks on remote network servers. larger, slower, cheaper L5: remote secondary storage (distributed file systems, Web servers) (Numbers are for a 21264 at 700MHz)
CS 135: Computer Architecture, Bhagi Narahari CS 135: Computer Architecture, Bhagi Narahari
Recap -- Locality Cache memory
Principle of Locality: Programs tend to reuse data and instructions near those Why ? they have used recently, or that were recently referenced themselves. How does it work ? Temporal locality: Recently referenced items are likely to be How is the memory organized ? referenced in the near future. Spatial locality: Items with nearby addresses tend to be referenced close together in time.
Locality Example: sum = 0; • Data for (i = 0; i < n; i++) – Reference array elements in succession spatial sum += a[i]; locality return sum; – Reference sum each iteration: temporal locality • Instructions – Reference instructions in sequence: spatial locality – Cycle through loop repeatedly: temporal locality CS 135: Computer Architecture, Bhagi Narahari CS 135: Computer Architecture, Bhagi Narahari
11 Why Cache ? Simple Model of Memory Hierarchy. . .
Gap between main memory speed and Sequence of addresses processor speed How many ? Reading data/inst from memory will take CPU generates request for memory more than 1 processor cycle location – i.e., an address Increases time to execute program Place a small but fast memory close to How longgg does it take to get this data ? Depends where it is in the Memory hierarchy the processor Simplified Model for memory hierarchy: Why does this work small amount of Fast On-chip Cache memory Principle of Locality Larger amount of off-chip Main memory Huge Disk
CS 135: Computer Architecture, Bhagi Narahari CS 135: Computer Architecture, Bhagi Narahari
How does Cache memory work ? Memory Access times
Address space = 2N words each of some memory access time size K bits On-chip Cache takes 1 processor cycle Main memory takes a number (10-50) processor cycles N bit address Disk takes a huge amount N Memory addresses go from 0 to 2 -1 Simple model we will use: These are the addresses that the processor Memory = Cache + Main memory requests in the Load or Store instructions, or Small size Cache = not everything fits in it when fetching a new instruction (value in PC) Cache organization: Some of these memory locations are Cache consists of a set of blocks each of some placed in the cache number of bytes If you see it in the cache then don’t need to Only a block can be fetched into and out of cache go all the way to memory to read them Eg; if block is 16 bytes, then load 16 bytes into cache Cannot load a single byte Faster time to read/write inst/data!
CS 135: Computer Architecture, Bhagi Narahari CS 135: Computer Architecture, Bhagi Narahari
12 Memory Access times using Simplified Model Terminology Summary
If data is found in Cache then time =1 Hit: data appears in block in upper level (i.e. block X in cache) Hit Rate: fraction of memory access found in upper level Called a cache hit Hit Time: time to access upper level which consists of Else time is Main memory access time RAM access time + Time to determine hit/miss Miss: data needs to be retrieved from a block in the lower level (i.e. Cache miss, means read from next level block Y in memory) Note: need a ‘control unit’ to determine if Miss Rate =1= 1 - (Hit Rate) location is in cache or not Miss Penalty: Extra time to replace a block in the upper level + Time to deliver the block the processor Cache controller Hit Time << Miss Penalty (500 instructions on Alpha 21264) Why does concept of caching work ? Lower Level To Processor Upper Level Memory Memory Principle of Locality Blk X Programs access data nearby, or data/instructions that were From Processor used recently Blk Y
CS 135: Computer Architecture, Bhagi Narahari CS 135: Computer Architecture, Bhagi Narahari
Memory Hierarchy--Performance Cache Performance – Simplified Models
Placing the fastest memory near the CPU Memory stall cycles = can result in increases in performance number of misses * miss penalty = Consider the number of cycles the CPU is = IC*(memory accesses/instruction)*miss rate* stalled waiting for a memory access: miss penalty memory stall cycles Hit rate/ratio r = number of requests that are hits/total number requests CPU execution time = Miss rate = 1 – (hit rate) (CPU clk cycles + Memory stall cycles) * clk Cost of memory access= rC + (1-r) C cycle time. h m Ch is cost/time from cache, Cm is cost/time when miss – Memory stall cycles = number of misses * miss fetch from memory penalty Extend to multiple levels of cache Hit ratios for level 1, level 2, etc. Fewer misses in cache = better performance! Access times for level 1, level 2, etc.
r1 Ch1 + r2Ch2 + (1- r1 -r2)Cm
CS 135: Computer Architecture, Bhagi Narahari CS 135: Computer Architecture, Bhagi Narahari
13 So how well does your memory perform ? Average Memory Access Time
What is the average time to access memory ? AMAT = HitTime + (1 - h) x MissPenalty An application program has a miss rate Hit time: basic time of every access. The memory design gives us hit time and miss ppyenalty Hit rate (h): fraction of access that hit usually sub sti tuted bby mi(1iss rate m = (1-h) Miss penalty: extra time to fetch a block from lower So what is the average memory access level, including time to replace in CPU time (AMAT) ? Hit time : if found in cache Hit time + miss penalty : if not in cache
CS 135: Computer Architecture, Bhagi Narahari CS 135: Computer Architecture, Bhagi Narahari
Example 1. . Example. .
System = Processor, Cache, Main Average memory access time with and Memory without cache ? Cache time = 1 processor cycle AMAT-cache = 1 + miss ratio * miss Memory access time = 50 processor cycles penalty Suppose out of 1000 memory accesses 1+ (0.04)*50 = 3 (due to Load/Store and Inst fetch) AMAT-without-cache = 50 40 misses in the cache What happens if miss ratio increases ? 960 hit in the Cache Miss ratio = 40/1000 = 4% Average memory access time with and without cache ?
CS 135: Computer Architecture, Bhagi Narahari CS 135: Computer Architecture, Bhagi Narahari
14 Example 2 Example 2
In reality we have an application with a Suppose your program has 20% instruction mix Load/Store operations CPU time = IC*CPI*Clock Memory is accessed once for each Effective CPI = CPI + Average Stalls per instruction PLUS again for each Load/Store instruction Out of 100 instructions, we will have 120 How many memory access per accesses to memory instruction ? Average memory accesses per instruction = 120/100 = 1.2 When is memory accessed ? How many stall cycles (waiting for Example: suppose your program has memory) ? 20% Load/Store operations 1.2 * 0.04*50 = 1.2*2= 2.4 Processor is stalled 2.4 cycles each inst
CS 135: Computer Architecture, Bhagi Narahari CS 135: Computer Architecture, Bhagi Narahari
Example 2 – complete example Cache Memory Hardware Design
Assume “ideal” CPI (without memory stalls) is Main memory has 2N locations 1.5 Cache has 2k locations What is the CPU time now ? Smaller than main memory How to “organize” these cache locations ? CPU = IC * CPI * Clock IC*CPI is th e numb er of CPU cycl es Processor generates N bit address CPI is number of cycles for CPU execution per How do we look at this N bit address and instruction decide (a) if it is in cache and (b) where Each instruction stalls for some number of to place it in cache ? cycles IC * (CPI + Avg Stall cycles)*Clock IC*(1.5+2.4)*clock = IC*3.9 clock cycles
CS 135: Computer Architecture, Bhagi Narahari CS 135: Computer Architecture, Bhagi Narahari
15 Cache Design--Questions Unified or Separate I-Cache and D-Cache
Q1: Where can a block be placed in the Two types of accesses: upper level? Instruction fetch block placement Data fetch (load/store instructions) Q2: How is a block found if it is in the Unified Cache upper level? One large cache for both instructions and date Pros: simpler to managgye, less hardware complexity block identification Cons: how to divide cache between data and instructions? Confuses the Use the address bits to identify the block standard harvard architecture model; optimizations difficult Q3: Which block should be replaced on a Separate Instruction and Data cache miss? Instruction fetch goes to I-Cache block replacement Data access from Load/Stores goes to D-cache Pros: easier to optimize each Q4: What happens on a write? Cons: more expensive; how to decide on sizes of each Write strategy
CS 135: Computer Architecture, Bhagi Narahari CS 135: Computer Architecture, Bhagi Narahari
Definitions Memory viewed as blocks
Cache has a total size – number of bytes in cache If cache block size = K bytes, then memory can be viewed Transfers take place in blocks as contiguous set of blocks each of size K A whole block is transferred between memory and cache Locating a block requires two attributes: Size of block 16 byte memory, Organization of blocks within the cache With 4 byte sized Block size (also referred to as line size) 16 byte Cache blocks; Granularity at which cache operates memory 4 blocks of memory Each block is contiguous series of bytes in memory and begins on a naturally aligned boundary Eg: cache with 16 byte blocks each contains 16 bytes First byte aligned to 16 byte boundaries in address space Low order 4 bits of address of first byte would be 0000 Smallest usable block size is the natural word size of the processor Else would require splitting an access across blocks and slows down translation
CS 135: Computer Architecture, Bhagi Narahari CS 135: Computer Architecture, Bhagi Narahari
16 Where can a block be placed in a Where can a block be placed in a cache? cache?-Cache Organization
If each block has only one place it can appear in Fully Associative Direct Mapped Set Associative the cache, it is said to be “direct mapped” 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 and the mapping is usually (Block address) MOD (Number of blocks in the cache) Cache: If a block can be placed anywhere in the cache, Set 0Set 1Set 2Set 3 it is said to be fully associative If a block can be placed in a restrictive set of Block 12 can go Block 12 can go Block 12 can go places in the cache, the cache is set associative. anywhere only into Block 4 anywhere in set 0 (12 mod 8) (12 mod 4) A set is a group of blocks in the cache. A block is first mapped onto a set, and then the block can be placed 1 2 3 4 5 6 7 8 9….. anywhere within that set. (Block address) MOD (Number of sets in the cache) if there are n blocks in a set, the cache is called n-way Memory: 12 set associative
CS 135: Computer Architecture, Bhagi Narahari CS 135: Computer Architecture, Bhagi Narahari
Cache Design: How is data found in Cache Organizations Cache?
Direct Mapped vs Fully Associate Processor generates address request – some N bit address Direct mapped is not flexible enough: if X(mod K)=Y(mod K) then X and Y cannot both be located in cache Two issues: Fully associative allows any mapping, implies all How do we know if a data item is in the cache? locations must be searched to find the right one – If it is, how do we find it? .. In which cache block do we look expensive hardware "direct mapped" Set Associative Compromise between direct mapped and fully associative For each item of data at the lower level, there is exactly one location in the cache where it might be. Allow many-to-few mappings On lookup, subset of address bits used to generate an e.g., lots of items at the lower level share locations in the upper level index BUT index now corresponds to a set of entries which can be searched in parallel – more efficient hardware implementation than fully associative, but due to flexible mapping behaves more like fully associative
CS 135: Computer Architecture, Bhagi Narahari CS 135: Computer Architecture, Bhagi Narahari
17 Addressing. Example
Memory address of N bits Memory This is what is requested by processor Memory block 1 We need mapping from Addresses to Cache Memory block 2 Cache How to break up these N bits into fields so that Memory block 3 the cache controller can easily determine if the Memory block 4 cache block 0 Block address requested is already stored in cache ? Memory block 5 Cache block 1 Focus on Direct mapped: Cache block 2 How to implement Mod.P where P=2K Cache block3 How to determine which cache block to map the address -- index How to determine which of the words is the one in memory – tag Since we are working with blocks, how to find the specific word within the block – offset N bit address is broken into these 3 fields!
CS 135: Computer Architecture, Bhagi Narahari CS 135: Computer Architecture, Bhagi Narahari
Addressing -- Example Example
8 bit address Memory (0,1,2,3) Byte addressability (4,5,6,7) 4 byte cache blocks (8,9,10,11) Cache Each cache block has 4 bytes (12,13,14,15) 0 Block Total size of cache = 16 bytes (16,17,18,19) 1
Addresses 2 Which cache block do we place address In each block 3 01101101 4 bytes in How do we know which block from each block memory is in the cache block ? Addresses: (0,1,2,3) Is it 01101101 or 11001101
CS 135: Computer Architecture, Bhagi Narahari CS 135: Computer Architecture, Bhagi Narahari
18 Addressing -- Example So how complex is the hardware required ?
8 bit address Given N bit address, the cache controller Byte addressability needs to determine if the requested word 4 byte cache blocks is in cache or not Each cache block has 4 bytes = need 2 bits to specify If not, then send the address to the memory which byte within the block bus Total size of cache = 16 bytes This process h as t o b e b uilt i nt o 4 blocks in cache = apply Modulo 4 mapping hardware Which cache block do we place 01101101 Complexity of the circuit determines time Last two LSB are offset within a block taken to determine if word is in cache Next 2 LSB are modulo 4 and specify cache block Hit time is function of complexity of circuit How do we know which block from memory is in For direct mapped, can you think of a the cache block ? hardware design ? Is it 01101101 or 11001101
CS 135: Computer Architecture, Bhagi Narahari CS 135: Computer Architecture, Bhagi Narahari
Direct Mapped Caches Cache Block Replacement ?
When we run out of space on the cache block index and need to bring in new block into cache tag index tag idx b.o. Replace a block that is currently in cache to create space Which block to replace ? r Easy in Direct mapped cache – no choice!
decode More complex in Associative Caches since decoder we have a choice
= = Tag Multiplexor Tag Match match
CS 135: Computer Architecture, Bhagi Narahari CS 135: Computer Architecture, Bhagi Narahari
19 What happens on write? Performance
Write through – Simplified model: Propagate each write through cache and on execution time = (execution cycles + stall to next level written to both the block in the cycles) cycle time cache and to the block in lower level memory stall cycles = # of instructions miss ratio WitbWrite back – miss penalty The information is written only to the block in the cache, memory is written only when Two ways of improving performance: block is replaced decreasing the miss ratio decreasing the miss penalty A dirty bit is kept in the block description area Tradeoffs?? What happens if we increase block size?
CS 135: Computer Architecture, Bhagi Narahari CS 135: Computer Architecture, Bhagi Narahari
Decreasing miss penalty with multilevel Performance caches
Increasing the block size tends to decrease miss rate: Add a second level cache: 40% often primary cache is on the same chip as the processor 35% 30% use SRAMs to add another cache above primary memory
25% (DRAM) 20% miss penalty goes down if data is in 2nd level cache Miss rate 15%
10%
5%
0% 4 16 64 256 Using multilevel caches: Block size (bytes) 1 KB 8 KB try and optimize the hit time on the 1st level cache 16 KB 64 KB try and optimize the miss rate on the 2nd level cache 256 KB Use split caches because more spatial locality in code: Block size in Instruction Data miss Effective combined Program words miss rate rate miss rate gcc 1 6.1% 2.1% 5.4% 4 2.0% 1.7% 1.9% spice 1 1.2% 1.3% 1.2% 4 0.3% 0.6% 0.4%
CS 135: Computer Architecture, Bhagi Narahari CS 135: Computer Architecture, Bhagi Narahari
20 How about rewriting code to improve cache Summary: Memory Access time optimization rates ?
If each access to memory leads to a cache hit then time to fetch from memory is one cycle Program performance is good! If each access to memory leads to a cache miss then time to fetch from memory is much larger than 1 cycle Program performance is bad! Design Goal: How to arrange data/instructions so that we have as few cache misses as possible.
CS 135: Computer Architecture, Bhagi Narahari CS 135: Computer Architecture, Bhagi Narahari
Secondary Memory What is multiprogramming and Why ?
CPU fetches from memory, memory is a Processor overlaps execution of two sequence of 2N locations processes/programs What happens if main memory (chip capacity) is When one is waiting for I/O, the other is N less than 2 “swapped” in to the processor Data and programs may be stored in a non- Save the “state” of the process being swapped out volatile component Processes need to share memory MS-Word executable is on disk Each has its own address space Your application data What happens if more than one process is Leads to better throughput and running on the system utilization Multiprogramming What is the address space available to each user ? Need to use Disks!
CS 135: Computer Architecture, Bhagi Narahari CS 135: Computer Architecture, Bhagi Narahari
21 The Complete memory hierarchy Why does memory hierarchy work ?
Processor has a set of registers Principle of locality! Processor instructions operate on contents in the Exploit at each level. registers Small, fast cache memory placed near the processor Main memoryyg sitting outside the chip If data is not in the cache then fetch from main memory Takes longer to access main memory than cache Disk sitting outside the motherboard If data/program not in main memory then fetch/load from disk Takes much longer to access disk than main memory
CS 135: Computer Architecture, Bhagi Narahari CS 135: Computer Architecture, Bhagi Narahari
How do disks work ? Disk Geometry
Disks consist of platters, each with two surfaces. Each surface consists of concentric rings called tracks. Each track consists of sectors separated by gaps.
tracks surface tktrack k gaps
spindle
sectors
CS 135: Computer Architecture, Bhagi Narahari CS 135: Computer Architecture, Bhagi Narahari
22 Disk Geometry (Muliple-Platter View) Disk Capacity
Aligned tracks form a cylinder. Capacity: maximum number of bits that can be stored. Vendors express capacity in units of gigabytes (GB), where cylinder k 1 GB = 10^9. Capacity is determined by these technology factors: surface 0 Recording density (bits/in): number of bits that can be platter 0 surface 1 squeezed into a 1 inch segment of a track. surface 2 Track density (tracks/in): number of tracks that can be platter 1 surface 3 squeezed into a 1 inch radial segment. surface 4 Areal density (bits/in2): product of recording and track platter 2 surface 5 density. Modern disks partition tracks into disjoint subsets called spindle recording zones Each track in a zone has the same number of sectors, determined by the circumference of innermost track. Each zone has a different number of sectors/track
CS 135: Computer Architecture, Bhagi Narahari CS 135: Computer Architecture, Bhagi Narahari
Disk Operation (Single-Platter View) Disk Operation (Single-Platter View)
The disk The disk The read/write head The read/write head surface surface is attached to the end is attached to the end spins at a fixed spins at a fixed of the arm and flies over of the arm and flies over rotational rate rotational rate the disk surface on the disk surface on a thin cushion of air. a thin cushion of air. s s
p e p e
l l spindle indle spindle indle spindle spindle spind spind
By moving radially, the arm By moving radially, the arm can position the read/write can position the read/write head over any track. head over any track.
CS 135: Computer Architecture, Bhagi Narahari CS 135: Computer Architecture, Bhagi Narahari
23 Disk Access Time Logical Disk Blocks
Average time to access some target sector approximated Modern disks present a simpler abstract view of by : the complex sector geometry: Taccess = Tavg seek + Tavg rotation + Tavg transfer The set of available sectors is modeled as a sequence Seek time (Tavg seek) of b-sized logical blocks (0, 1, 2, ...) Time to position heads over cylinder containing target Mapping between logical blocks and actual sector. (physical) sectors Typical Tavg seek = 9 ms Maintained by hardware/firmware device called disk Rotational latency (Tavg rotation) controller. Time waiting for first bit of target sector to pass under r/w head. Converts requests for logical blocks into (surface,track,sector) triples. Tavg rotation = 1/2 x 1/RPMs x 60 sec/1 min Transfer time (Tavg transfer) Allows controller to set aside spare cylinders for each zone. Time to read the bits in the target sector. Tavg transfer = 1/RPM x 1/(avg # sectors/track) x 60 secs/1 Accounts for the difference in “formatted capacity” min. and “maximum capacity”.
CS 135: Computer Architecture, Bhagi Narahari CS 135: Computer Architecture, Bhagi Narahari
I/O Bus Reading a Disk Sector (1)
CPU chip CPU chip CPU initiates a disk read by writing a register file register file command, logical block number, and destination memory address to a port ALU ALU (address) associated with disk controller. system bus memory bus
main I/O main bus interface bus interface memory bidbridge memory
I/O bus
I/O bus Expansion slots for other devices such USB graphics disk USB graphics disk as network adapters. controller adapter controller controller adapter controller mouse keyboard monitor mouse keyboard monitor disk disk CS 135: Computer Architecture, Bhagi Narahari CS 135: Computer Architecture, Bhagi Narahari
24 Reading a Disk Sector (2) Reading a Disk Sector (3)
CPU chip CPU chip Disk controller reads the sector and When the DMA transfer completes, the register file performs a direct memory access (DMA) register file disk controller notifies the CPU with an transfer into main memory. interrupt (i.e., asserts a special “interrupt” ALU ALU pin on the CPU)
main main bus interface bus interface memory memory
I/O bus I/O bus
USB graphics disk USB graphics disk controller adapter controller controller adapter controller
mouse keyboard monitor mouse keyboard monitor disk disk
CS 135: Computer Architecture, Bhagi Narahari CS 135: Computer Architecture, Bhagi Narahari
Storage Trends CPU Clock Rates
metric 1980 1985 1990 1995 2000 2000:1980
SRAM $/MB 19,200 2,900 320 256 100 190 1980 1985 1990 1995 2000 2000:1980 access (ns) 300 150 35 15 2 100 processor 8080 286 386 Pent P-III clock rate(MHz) 1 6 20 150 750 750 cycle time(ns) 1,000 166 50 6 1.6 750 metric 1980 1985 1990 1995 2000 2000:1980 DRAM $/MB 8, 000 880 100 30 1 8, 000 access (ns) 375 200 100 70 60 6 typical size(MB) 0.064 0.256 4 16 64 1,000
metric 1980 1985 1990 1995 2000 2000:1980
$/MB 500 100 8 0.30 0.05 10,000 Disk access (ms) 87 75 28 10 8 11 typical size(MB) 1 10 160 1,000 9,000 9,000
(Culled from back issues of Byte and PC Magazine) CS 135: Computer Architecture, Bhagi Narahari CS 135: Computer Architecture, Bhagi Narahari
25 Last step: Virtual Memory Virtual Memory
N bit address space… Main memory can act as a cache for the secondary If word is not in main memory then it is on disk storage (disk) – need a controller to manage this ..analogous Virtual addresses Physical addresses to cache controller Address translation Virtual memory management system Note: transfers take place from disk in “blocks” of M bytes (sector) – page of data Memory consists of a number of pages Disk addresses Determine if the page is in main memory or not N bit address broken into fields which determine page number, and whether page is in the memory or disk Advantages: If page size is 1024 bits…how to organize the N illusion of having more physical memory bit address ?? program relocation protection
CS 135: Computer Architecture, Bhagi Narahari CS 135: Computer Architecture, Bhagi Narahari
Pages: virtual memory blocks Page Tables
Page faults: the data is not in memory, retrieve it from disk huge miss penalty, thus pages Virtual page number should be fairly large (e.g., 4KB) Page table Physical page or Physical memory reducing page faults is important Valid disk address (LRU is worth the price) 1 can handle the faults in software 1 1 instead of hardware 1 0 using write-through is too 1 1 expensive so we use writeback 0 1 Disk storage 1 Virtual address 0 1 31 30 29 28 2715 14 13 12 11 10 9 8 3 2 1 0
Virtual page number Page offset
Tra ns l ati o n
29 28 2715 14 13 12 11 10 9 8 3 2 1 0
Physical page number Page offset
Physical address
CS 135: Computer Architecture, Bhagi Narahari CS 135: Computer Architecture, Bhagi Narahari
26 Page Tables More on Virtual memory
Page table register This is part of the Operating system Virtual address 31 30 29 28 2715 14 13 12 11 10 9 8 3 2 1 0 CS 156 focus is the OS utilities Virtual page number Page offset
20 12 Valid Physical page number
Page table
18 If 0 then page is not present in memory
29 28 27 15 14 13 12 11 10 9 8 3 2 1 0
Physical page number Page offset
Physical address
CS 135: Computer Architecture, Bhagi Narahari CS 135: Computer Architecture, Bhagi Narahari
27