T04: Fundamental Memory Microarchitecture

ECE 4750 Computer Architecture, Fall 2020 T04 Fundamental Memory Microarchitecture School of Electrical and Computer Engineering Cornell University revision: 2020-10-28-14-41 1 Memory Microarchitectural Design Patterns 3 1.1. Transactions and Steps . .3 1.2. Microarchitecture Overview . .4 2 FSM Cache 5 2.1. High-Level Idea for FSM Cache . .6 2.2. FSM Cache Datapath . .7 2.3. FSM Cache Control Unit . .9 2.4. Analyzing Performance . 10 3 Pipelined Cache 12 3.1. High-Level Idea for Pipelined Cache . 13 3.2. Pipelined Cache Datapath and Control Unit . 14 3.3. Analyzing Performance . 20 3.4. Pipelined Cache with TLB . 22 4 Cache Microarchitecture Optimizations 25 4.1. Reduce Hit Time . 25 4.2. Reduce Miss Rate . 26 1 4.3. Reduce Miss Penalty . 30 4.4. Cache Optimization Summary . 31 5 Case Study: ARM Cortex A8 and Intel Core i7 32 5.1. ARM Cortex A8 . 32 5.2. Intel Core i7 . 33 Copyright © 2018 Christopher Batten, Christina Delimitrou. All rights reserved. This handout was originally prepared by Prof. Christopher Batten at Cornell University for ECE 4750 / CS 4420. It has since been updated by Prof. Christina Delimitrou in 2017-2020. Download and use of this handout is permitted for individual educational non-commercial purposes only. Redistribution either in part or in whole via both commercial or non-commercial means requires written permission. 2 1. Memory Microarchitectural Design Patterns 1.1. Transactions and Steps 1. Memory Microarchitectural Design Patterns Time Mem Accesses Avg Cycles Time = × × Mem Access Sequence Sequence Mem Access Cycle Avg Cycles Avg Cycles Num Misses Avg Extra Cycles = + × Mem Access Hit Num Accesses Miss Extra Accesses Microarchitecture Hit Latency for Translation FSM Cache >1 1+ Pipelined Cache ≈1 1+ Pipelined Cache + TLB ≈1 ≈0 1.1. Transactions and Steps • We can think of each memory access as a transaction • Executing a memory access involves a sequence of steps – Check Tag : Check one or more tags in cache – Select Victim : Select victim line from cache using replacement policy – Evict Victim : Evict victim line from cache and write victim to memory – Refill : Refill requested line by reading line from memory – Write Mem : Write requested word to memory – Access Data : Read or write requested word in cache 3 1. Memory Microarchitectural Design Patterns 1.1. Transactions and Steps Steps for Write-Through 1.2. Microarchitecture with No Write Allocate Overview mmureq 32b mmuresp Memory Management Unit cachereq 32b cacheresp Control Unit Control Status Datapath Tag Data Steps for Write-Back Array Array with Write Allocate mreq 128b mresp Main Memory >1 cycle combinational 4 2. FSM Cache 2. FSM Cache Time Mem Accesses Avg Cycles Time = × × Mem Access Sequence Sequence Mem Access Cycle mmureq 32b mmuresp Avg Cycles Avg Cycles Num Misses Avg Extra Cycles = + × Mem Access Hit Num AccessesMemory ManagementMiss Unit cachereq 32b cacheresp Extra Accesses Microarchitecture Hit Latency for Translation Control Unit FSM Cache >1 1+ Pipelined Cache ≈Control1 1+ Status Pipelined Cache + TLB ≈Datapath1 ≈0 Tag Data Assumptions Array Array • Page-based translation, no TLB, physically addr cache mreq 128b mresp • Single-ported combinational Main Memory >1 cycle SRAMs for tag, data storage combinational • Unrealistic combinational main memory • Cache requests are 4 B Configuration • Four 16 B cache lines • Two-way set-associative • Replacement policy: LRU • Write policy: write-through, no write allocate 5 2. FSM Cache 2.1. High-Level Idea for FSM Cache 2.1. High-Level Idea for FSM Cache read hit Check Access Tag read miss Data Select Refill write Victim Write Mem write hit Check Access Check Access Check Access Tag Select Access Check Access Refill Tag Data Tag Data Tag Data Check Victim Data Tag Data read write Write read read read FSM Mem hit hit hit miss hit 6 2. FSM Cache 2.2. FSM Cache Datapath 2.2. FSM Cache Datapath As with processors, we design our cache datapath by incrementally adding support for each transaction and resolving conflicts with muxes. Implementing READ transactions that hit !cachereq_val tag idx off 00 27b 1b 2b MT cache idx tag idx tag idx off req. addr tarray0 tarray1 read _match _match hit = = Way 0 Way 1 Tag 2b Array [127:96] tarray0_en tarray0_en 32b [95:64] cache [63:32] 128b resp. Data [31:0] data darray_en Array MRD MT: Check tag MRD: Read data array, return cacheresp Implementing READ transactions that miss !cachereq_val tag idx off 00 27b 1b 2b MT cache tag idx tag tag idx tag idx off req. addr tarray0 tarray1 _sel read read _match _match victim miss hit victim = = R0 Way 0 Way 1 Tag 2b Array [127:96] tarray0_en tarray0_en tarray0_wen tarray0_wen 32b [95:64] word enable cache R1 [63:32] 1111 128b resp. Data [31:0] data darray_en Array z4b 128b darray_en darray_wen MRD MT: Check tag memreq. memresp. MRD: Read data array, return cacheresp addr data R0: Send refill memreq, get memresp R1: Write data array with refill cache line 7 2. FSM Cache 2.2. FSM Cache Datapath Implementing WRITE transactions that miss !cachereq_val tag idx off 00 27b 1b 2b write MT cache tag idx tag tag idx tag idx off req. addr tarray0 tarray1 _sel read read _match _match victim miss hit victim = = R0 Way 0 Way 1 Tag 2b Array [127:96] tarray0_en tarray0_en tarray0_wen tarray0_wen 32b [95:64] word 32b enable cache R1 [63:32] 1111 128b resp. cache Data [31:0] data req. darray_en Array data z4b z4b zext 128b darray_en darray_wen MRD MWD z4b_sel 128b MT: Check tag memreq. memreq. memresp. MRD: Read data array, return cacheresp addr data data R0: Send refill memreq, get memresp R1: Write data array with refill cache line MWD: Send write memreq, write data array Implementing WRITE transactions that hit !cachereq_val tag idx off 00 27b 1b 2b write MT cache tag idx tag tag idx tag off idx off req. addr tarray0 tarray1 _sel read read _match _match victim miss hit victim = = R0 Way 0 Way 1 Tag 1111 2b Array word_ [127:96] tarray0_en tarray0_en darray_ tarray0_wen tarray0_wen write_ en_sel 32b sel word [95:64] enableword 32b 128b enable cache R1 repl [63:32] 128b 1111 128b resp. cache unit Data [31:0] data req. darray_en Array data z4b z4b zext 128b darray_en darray_wen MRD MWD z4b_sel 128b MT: Check tag memreq. memreq. memresp. MRD: Read data array, return cacheresp addr data data R0: Send refill memreq, get memresp R1: Write data array with refill cache line MWD: Send write memreq, write data array 8 2. FSM Cache 2.3. FSM Cache Control Unit 2.3. FSM Cache Control Unit MT read R0 R1 MRD miss read write hit MWD We will need to keep valid bits in the control unit, with one valid bit for every cache line. We will also need to keep use bits which are updated on every access to indicate which was the last “used” line. Assume we create the following two control signals generated by the FSM control unit. hit = ( tarray0_match && valid0[idx] ) || ( tarray1_match && valid1[idx] ) victim = !use[idx] tarray0 tarray1 darray darray worden z4b memreq cachresp en wen en wen en wen sel sel sel val op val MT MRD R0 R1 MWD 9 2. FSM Cache 2.4. Analyzing Performance 2.4. Analyzing Performance Time Mem Accesses Avg Cycles Time = × × Mem Access Sequence Sequence Mem Access Cycle Avg Cycles Avg Cycles Num Misses Avg Extra Cycles = + × Mem Access Hit Num Accesses Miss Estimating cycle time !cachereq_val tag idx off 00 27b 1b 2b write MT cache tag idx tag tag idx tag off idx off req. addr tarray0 tarray1 _sel read read _match _match victim miss hit victim = = R0 Way 0 Way 1 Tag 1111 2b Array • register read/write = 1t word_ [127:96] tarray0_en tarray0_en darray_ tarray0_wen tarray0_wen write_ en_sel 32b sel word [95:64] • tag array read/write = 10t enableword 32b 128b enable cache R1 repl [63:32] • data array read/write = 10t 128b 1111 128b resp. cache unit Data [31:0] data • mem read/writereq. = 20t darray_en Array data z4b z4b zext 128b darray_en • decoder = 3t darray_wen MRD MWD z4b_sel 128b • comparator = 10t MT: Check tag memreq. memreq. memresp. MRD: Read data array, return cacheresp • mux addr= 3t data data R0: Send refill memreq, get memresp R1: Write data array with refill cache line • repl unit = 0t MWD: Send write memreq, write data array • z4b = 0t 10 2. FSM Cache 2.4. Analyzing Performance Estimating AMAL Consider the following sequence of memory acceses which might correspond to copying 4 B elements from a source array to a destination array. Each array contains 64 elements. What is the AMAL? rd 0x1000 wr 0x2000 rd 0x1004 wr 0x2004 rd 0x1008 wr 0x2008 ... rd 0x1040 wr 0x2040 Consider the following sequence of memory acceses which might correspond to incrementing 4 B elements in an array. The array contains 64 elements. What is the AMAL? rd 0x1000 wr 0x1000 rd 0x1004 wr 0x1004 rd 0x1008 wr 0x1008 ... rd 0x1040 wr 0x1040 11 3. Pipelined Cache 3. Pipelined Cache Time Mem Accesses Avg Cycles Time = × × Mem Access Sequence Sequence Mem Access Cycle mmureq 32b mmuresp Avg Cycles Avg Cycles Num Misses Avg Extra Cycles = + × Mem Access Hit Num AccessesMemory ManagementMiss Unit cachereq 32b cacheresp Extra Accesses Microarchitecture Hit Latency for Translation Control Unit FSM Cache >1 1+ Pipelined Cache ≈Control1 1+ Status Pipelined Cache + TLB ≈Datapath1 ≈0 Tag Data Assumptions Array Array • Page-based translation, no TLB, physically addr cache mreq 128b mresp • Single-ported combinational Main Memory >1 cycle SRAMs for tag, data storage combinational • Unrealistic combinational main memory • Cache requests are 4 B Configuration • Four 16 B cache lines • Direct-mapped • Replacement policy: LRU • Write policy: write-through, no write allocate 12 3.

Load more