Computer Architecture Cache Memory Design

Course Objectives: Where are we? CS 211: Computer Architecture Cache Memory Design CS 135 CS 211: Part 2! CS 211: Part 3 • Discussions thus far • Multiprocessing concepts ¾ Processor architectures to increase the processing speed ¾ Multi-core processors ¾ Focused entirely on how instructions can be executed faster ¾ Have not addressed the other components that go into putting it ¾ Parallel processing all together ¾ Cluster computing ¾ Other components: Memory, I/O, Compiler • Next: ¾ Memory design: how is memory organized to facilitate fast access • Embedded Systems – another dimension ¾ Focus on cache design ¾ Challenges and what’s different ¾ Compiler optimization: how does compiler generate ‘optimized’ machine code • Reconfigurable architectures ¾ With a look at techniques for ILP ¾ What are they? And why ? ¾ We have seen some techniques already, and will cover some more in memory design before getting to formal architecture of compilers ¾ Quick look at I/O and Virtual Memory CS 135 CS 135 1 Memory Memory Technology • Memory Comes in Many Flavors • In our discussions (on MIPS pipeline, ¾ SRAM (Static Random Access Memory) ¾ Like a register file; once data written to SRAM its contents superscalar, EPIC) we’ve constantly stay valid – no need to refresh it been assuming that we can access our ¾ DRAM (Dynamic Random Access Memory) ¾ Like leaky capacitors – data stored into DRAM chip charging operand from memory in 1 clock cycle… memory cells to max values; charge slowly leaks and will eventually be too low to be valid – therefore refresh circuitry ¾ This is possible, but its complicated rewrites data and charges back to max ¾ Static RAM is faster but more expensive ¾ We’ll now discuss how this happens ¾ Cache uses static RAM • We’ll talk about… ¾ ROM, EPROM, EEPROM, Flash, etc. ¾ Read only memories – store OS ¾ Memory Technology ¾ Disks, Tapes, etc. ¾ Memory Hierarchy • Difference in speed, price and “size” ¾ Caches ¾ Fast is small and/or expensive ¾ Memory ¾ Large is slow and/or expensive ¾ Virtual Memory CS 135 CS 135 Is there a problem with DRAM? Why Not Only DRAM? Processor-DRAM Memory Gap (latency) • Can lose data when no power µProc ¾ Volatile storage 1000 CPU 60%/yr. • Not large enough for some things “Moore’s Law” (2X/1.5yr) ¾ Backed up by storage (disk) 100 Processor-Memory ¾ Virtual memory, paging, etc. Performance Gap: • Not fast enough for processor accesses 10 grows 50% / year ¾ Takes hundreds of cycles to return data DRAM ¾ OK in very regular applications Performance DRAM 9%/yr. ¾ Can use SW pipelining, vectors 1 (2X/10yrs) ¾ Not OK in most other applications 1989 1992 1980 1981 1987 1988 1990 1991 1993 1994 1995 1996 1997 1998 1982 1983 1984 1985 1986 1999 2000 Time CS 135 CS 135 2 The principle of locality… Levels in a typical memory hierarchy • …says that most programs don’t access all code or data uniformly ¾ e.g. in a loop, small subset of instructions might be executed over and over again… ¾ …& a block of memory addresses might be accessed sequentially… • This has led to “memory hierarchies” • Some important things to note: ¾ Fast memory is expensive ¾ Levels of memory usually smaller/faster than previous ¾ Levels of memory usually “subset” one another ¾ All the stuff in a higher level is in some level below it CS 135 CS 135 Memory Hierarchies The Full Memory Hierarchy “always reuse a good idea” Capacity Upper Level • Key Principles Access Time Staging Cost Xfer Unit faster ¾ Locality – most programs do not access code or data CPU Registers uniformly 100s Bytes Registers <10s ns Instr. Operands prog./compiler ¾ Smaller hardware is faster 1-8 bytes Cache K Bytes Our current Cache • Goal 10-100 ns focus 1-0.1 cents/bit cache cntl ¾ Design a memory hierarchy “with cost almost as low Blocks 8-128 bytes as the cheapest level of the hierarchy and speed Main Memory M Bytes Memory almost as fast as the fastest level” 200ns- 500ns $.0001-.00001 cents /bit ¾ This implies that we be clever about keeping more likely OS Disk Pages 4K-16K bytes used data as “close” to the CPU as possible G Bytes, 10 ms (10,000,000 ns) Disk • Levels provide subsets 10 -5 - 10 -6 cents/bit user/operator Files ¾ Anything (data) found in a particular level is also found Mbytes Tape Larger in the next level below. infinite sec-min Tape Lower Level ¾ Each level maps from a slower, larger memory to a -8 smaller but faster memory 10 CS 135 CS 135 3 Propagation delay bounds where Cache: Terminology memory can be placed • Cache is name given to the first level of the memory hierarchy encountered once an address leaves the CPU ¾ Double Data Rate (DDR) SDRAM Takes advantage of the principle of locality • The term cache is also now applied whenever buffering is employed to reuse XDR planned 3 to 6 GHz items • Cache controller ¾ The HW that controls access to cache or generates request to memory ¾ • No cache in 1980 PCs to 2-level cache by 1995..! CS 135 CS 135 What is a cache? • Small, fast storage used to improve average access time to Caches: multilevel slow memory. • Exploits spatial and temporal locality • In computer architecture, almost everything is a cache! ¾ Registers “a cache” on variables – software managed Main ¾ First-level cache a cache on second-level cache CPU cache ¾ Second-level cache a cache on memory Memory ¾ Memory a cache on disk (virtual memory) ¾ TLB a cache on page table ¾ Branch-prediction a cache on prediction information? Proc/Regs L1-Cache CPU L2 L3 Main Bigger Faster L2-Cache L1 cache cache Memory Memory ~256KB ~4MB Disk, Tape, etc. 16~32KB ~10 pclk latency ~50 pclk latency CS 135 1~2 CSpclk 135 latency 4 A brief description of a cache Terminology Summary • Cache = next level of memory hierarchy up from register file • Hit: data appears in block in upper level (i.e. block X in cache) ¾ Hit Rate: fraction of memory access found in upper level ¾ All values in register file should be in cache ¾ Hit Time: time to access upper level which consists of ¾ RAM access time + Time to determine hit/miss • Miss: data needs to be retrieved from a block in the lower level • Cache entries usually referred to as (i.e. block Y in memory) “blocks” ¾ Miss Rate = 1 - (Hit Rate) ¾ Miss Penalty: Extra time to replace a block in the upper level + ¾ Block is minimum amount of information that ¾ Time to deliver the block the processor can be in cache • Hit Time << Miss Penalty (500 instructions on Alpha 21264) ¾ fixed size collection of data, retrieved from memory Lower Level and placed into the cache To Processor Upper Level Memory Memory • Processor generates request for Blk X From Processor data/inst, first look up the cache Blk Y • If we’re looking for item in a cache and CS 135find it, have a cache hit; if not then a CS 135 Definitions Cache Basics • Locating a block requires two attributes: • Cache consists of block-sized lines ¾ Size of block ¾ Line size typically power of two ¾ Organization of blocks within the cache ¾ Typically 16 to 128 bytes in size • Block size (also referred to as line size) • Example ¾ Granularity at which cache operates ¾ Suppose block size is 128 bytes ¾ Each block is contiguous series of bytes in memory and begins on a naturally aligned boundary ¾ Lowest seven bits determine offset within block ¾ Eg: cache with 16 byte blocks ¾ Read data at address A=0x7fffa3f4 ¾ each contains 16 bytes ¾ Address begins to block with base address 0x7fffa380 ¾ First byte aligned to 16 byte boundaries in address space ¾ Low order 4 bits of address of first byte would be 0000 ¾ Smallest usable block size is the natural word size of the processor ¾ Else would require splitting an access across blocks and slows down translation CS 135 CS 135 5 Memory Hierarchy Unified or Separate I-Cache and D-Cache • Placing the fastest memory near the CPU • Two types of accesses: can result in increases in performance ¾ Instruction fetch • Consider the number of cycles the CPU is ¾ Data fetch (load/store instructions) stalled waiting for a memory access – • Unified Cache memory stall cycles ¾ One large cache for both instructions and date ¾ Pros: simpler to manage, less hardware complexity ¾ Cons: how to divide cache between data and ¾ CPU execution time = instructions? Confuses the standard harvard architecture (CPU clk cycles + Memory stall cycles) * clk cycle time. model; optimizations difficult • Separate Instruction and Data cache ¾ Memory stall cycles = ¾ Instruction fetch goes to I-Cache number of misses * miss penalty = ¾ Data access from Load/Stores goes to D-cache IC*(memory accesses/instruction)*miss rate* miss penalty ¾ Pros: easier to optimize each ¾ Cons: more expensive; how to decide on sizes of each CS 135 CS 135 Cache Design--Questions Where can a block be placed in a cache? • Q1: Where can a block be placed in • 3 schemes for block placement in a the upper level? cache: ¾ block placement ¾ Direct mapped cache: ¾ Block (or data to be stored) can go to only 1 place in • Q2: How is a block found if it is in the cache upper level? ¾ Usually: (Block address) MOD (# of blocks in the cache) ¾ block identification • Q3: Which block should be replaced on ¾ Fully associative cache: ¾ Block can be placed anywhere in cache a miss? ¾ block replacement ¾ Set associative cache: • Q4: What happens on a write? ¾ “Set” = a group of blocks in the cache ¾ Block mapped onto a set & then block can be placed ¾ Write strategy anywhere within that set ¾ Usually: (Block address) MOD (# of sets in the cache) CS 135 CS 135 ¾ If n blocks in a set, we call it n-way set associative 6 Where can a block be placed in a cache? Associativity Fully Associative Direct Mapped Set Associative • If you have associativity > 1 you have 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 to have a replacement policy ¾ FIFO Cache: Set 0Set 1Set 2Set 3 ¾ LRU ¾ Random Block 12 can go Block 12 can go Block 12 can go anywhere only into Block 4 anywhere in set 0 • “Full” or “Full-map” associativity means (12 mod 8) (12 mod 4) you check every tag in parallel and a 1 2 3 4 5 6 7 8 9….

Computer Architecture Cache Memory Design

Data Storage the CPU-Memory

Branch-Directed and Pointer-Based Data Cache Prefetching

Cache-Aware Roofline Model: Upgrading the Loft

ARM-Architecture Simulator Archulator

Multi-Tier Caching Technology™

Non-Sequential Instruction Cache Prefetching for Multiple--Issue Processors

CUDA 11 and A100 - WHAT’S NEW? Markus Hrywniak, 23Rd June 2020 TOPICS for TODAY

Denial-Of-Service Attacks on Shared Cache in Multicore: Analysis and Prevention

Fetch Directed Instruction Prefetching

A Characterization of Processor Performance in the VAX-1 L/780

Instruction Set Innovations for Convey's HC-1 Computer

Dynamic Eviction Set Algorithms and Their Applicability to Cache Characterisation