Virtual Memory

EEC 581 Computer Architecture Memory Hierarchy Design (II) Department of Electrical Engineering and Computer Science Cleveland State University Topics to be covered Cache Penalty Reduction Techniques Victim cache Assist cache Non-blocking cache Data Prefetch mechanism Virtual Memory 2 1 3Cs Absolute Miss Rate (SPEC92) •Compulsory misses are a tiny fraction of the overall misses •Capacity misses reduce with increasing sizes •Conflict misses reduce with increasing associativity 0.14 1-way Conflict 0.12 2-way 0.1 4-way 0.08 8-way 0.06 Capacity 0.04 Miss Rate per Type 0.02 0 1 2 4 8 16 32 64 128 Cache Size (KB) Compulsory 3 2:1 Cache Rule Miss rate DM cache size X ~= Miss rate 2-way SA cache size X/2 Conflict 4 2 3Cs Relative Miss Rate 100% 1-way 80% 2-way Conflict 4-way 60% 8-way 40% Capacity 20% Miss Rate per Type 0% 1 2 4 8 16 32 64 128 Caveat: fixed block size Compulsory Cache Size (KB) 5 Victim Caching [Jouppi’90] Victim cache (VC) Processor A small, fully associative structure Effective in direct-mapped caches L1 VC Whenever a line is displaced from L1 cache, it is loaded into VC Memory Processor checks both L1 and VC Victim Cache Organization simultaneously Swap data between VC and L1 if L1 misses and VC hits When data has to be evicted from VC, it is written back to memory 6 3 % of Conflict Misses Removed Dcache Icache 7 Assist Cache [Chan et al. ‘96] Assist Cache (on-chip) avoids thrashing in main (off-chip) L1 Processor cache (both run at full speed) 64 x 32-byte fully associative CAM L1 AC Data enters Assist Cache when miss (FIFO replacement policy Memory in Assist Cache) Assist Cache Data conditionally moved to L1 Organization or back to memory during eviction Flush back to memory when brought in by “Spatial locality hint” instructions Reduce pollution 8 4 Multi-lateral Cache Architecture Processor Core A B Memory A Fully Connected Multi-Lateral Cache Architecture Most of the cache architectures be generalized into this form 9 Cache Architecture Taxonomy Processor Processor Processor A B A A B Memory Memory Memory General Description Single-level cache Two-level cache Processor Processor Processor A B A B A B Memory Memory Memory Victim cache Assist cache NTS, and PCS caches 10 5 Non-blocking (Lockup-Free) Cache [Kroft ‘81] Prevent pipeline from stalling due to cache misses (continue to provide hits to other lines while servicing a miss on one/more lines) Uses Miss Status Handler Register (MSHR) Tracks cache misses, allocate one entry per cache miss (called fill buffer in Intel P6 proliferation) New cache miss checks against MSHR Pipeline stalls at a cache miss only when MSHR is full Carefully choose number of MSHR entries to match the sustainable bus bandwidth 11 Bus Utilization (MSHR = 2) Time Lead-off latency 4 data chunk m1 m2 m3 Initiation m4 interval m5 Stall due to insufficient MSHR Data Bus Idle Transfer BUS Memory bus utilization 12 6 Bus Utilization (MSHR = 4) Time Stall Data Bus Idle Transfer BUS Memory bus utilization 13 Sample question What is the major difference between CDC6600’s Scoreboarding algorithm and IBM 360/91’s Tomasulo algorithm? (One sentence) Why IBM 360/91 only implemented Tomasulo’s algorithm in the floating-point unit but not in the integer unit? (One sentence) What are the “two main functions” of a ReOrder Buffer (ROB)? 14 7 Sample question What is the major difference between CDC6600’s Scoreboarding algorithm and IBM 360/91’s Tomasulo algorithm? (One sentence) Tomasulo algorithm does register renaming. Why IBM 360/91 only implemented Tomasulo’s algorithm in the floating-point unit but not in the integer unit? (One sentence) Due to the long latency of the FPU. There are only 4 registers in the FPU. What are the “two main functions” of a ReOrder Buffer (ROB)? To support i) precise exception and ii) branch mis- prediction recovery 15 Sample question What is the main responsibility of the Load Store Queue? Given 4 architectural registers (R0 to R3) and 16 physical registers (T0 to T15). The current RAT content is indicated in the leftmost table below. Note that the physical registers are allocated in circular numbering order (i.e., T0, T1, to T15 then back to T0). Assume the renaming logic can rename 4 instructions per clock cycle. For the following instruction sequence, fill the RAT content one cycle later. (The destination register of arithmetic instructions is on the left-hand side.) What are the “two main functions” of a ReOrder Buffer (ROB)? RAT (after 1 cycle) 16 8 Sample question What is the main responsibility of the Load Store Queue? To perform memory address disambiguation and maintain memory ordering. Given 4 architectural registers (R0 to R3) and 16 physical registers (T0 to T15). The current RAT content is indicated in the leftmost table below. Note that the physical registers are allocated in circular numbering order (i.e., T0, T1, to T15 then back to T0). Assume the renaming logic can rename 4 instructions per clock cycle. For the following instruction sequence, fill the RAT content one cycle later. (The destination register of arithmetic instructions is on the left-hand side.) What are the “two main functions” of a ReOrder Buffer (ROB)? 17 Sample question Caches and main memory are sub-divided into multiple banks in order to allow parallel access. What is an alternative way of allowing parallel access? What a cache that allows multiple cache misses to be outstanding to main memory at the same time, the pipeline is not stalled. What is called? While cache misses are outstanding to main memory, what is the structure that keeps bookkeeping information about the outstanding cache misses? This structure often augments the cache. 18 9 Sample question Caches and main memory are sub-divided into multiple banks in order to allow parallel access. What is an alternative way of allowing parallel access? Multiporting, duplicating What a cache that allows multiple cache misses to be outstanding to main memory at the same time, the pipeline is not stalled. What is called? Non-blocking (or lockup-free) ( While cache misses are outstanding to main memory, what is the structure that keeps bookkeeping information about the outstanding cache misses? This structure often augments the cache. Miss status handling registers (MSHRs) 19 Sample question Consider a processor with separate instruction and data caches (and no L2 cache). We are focusing on improving the data cache performance since our instruction cache achieves 100% hit rate with various optimizations. The data cache is 4kB, direct-mapped, and has single cycle access latency. The processor supports a 64-bit virtual address space, 8kB pages and no more than 16GB physical memory. The cache block size is 32 bytes. The data cache is virtually indexed and physically tagged. Assume that the data TLB hit rate is 100%. The miss rate of the data cache is measured to be 10%. The miss penalty is 20 cycles. Compute the average memory access latency (in terms of number of cycles) for data accesses. To improve the overall memory access latency, we decided to introduce a victim cache. It is fully associative and has eight entries. Its access latency is one cycle. To save power and energy consumption, we decided to access the victim cache only after we detect a miss from the data cache. The victim cache hit rate is measured to be 50% (i.e., the probability of finding data in the victim cache given that the data cache doesn’t have it). Further, only after we detect a miss from the victim cache we start miss handling. Compute the average memory access latency for data accesses. 20 10 Prefetch (Data/Instruction) Predict what data will be needed in future Pollution vs. Latency reduction If you correctly predict the data that will be required in the future, you reduce latency. If you mispredict, you bring in unwanted data and pollute the cache To determine the effectiveness When to initiate prefetch? (Timeliness) Which lines to prefetch? How big a line to prefetch? (note that cache mechanism already performs prefetching.) What to replace? Software (data) prefetching vs. hardware prefetching 21 Software-controlled Prefetching Use instructions Existing instruction Alpha’s load to r31 (hardwired to 0) Specialized instructions and hints Intel’s SSE: prefetchnta, prefetcht0/t1/t2 MIPS32: PREF PowerPC: dcbt (data cache block touch), dcbtst (data cache block touch for store) Compiler or hand inserted prefetch instructions 22 11 Alpha The Alpha architecture supports data prefetch via load instructions with a destination of register R31 or F31. LDBU, LDF, LDG, LDL, LDT, LDWU Normal cache line prefetches. Prefetch with modify intent; sets the dirty LDS and modified bits. LDQ Prefetch, evict next; no temporal locality. The Alpha architecture also defines the following instructions. FETCH Prefetch Data FETCH_M Prefetch Data, Modify Intent PowerPC dcbt Data Cache Block Touch Dcbtst Data Cache Block Touch for Store Intel SSE The SSE prefetch instruction has the following variants: prefetcht0 Temporal data; prefetch data into all cache levels. Temporal with respect to first level cache; prefetch prefetcht1 data in all cache levels except 0th cache level. Temporal with respect to second level cache; prefetcht2 prefetch data in all cache levels, except 0th and 1st cache levels. Non-temporal with respect to all cache levels; prefetchnta prefetch data into non-temporal cache structure, with minimal cache pollution. 12 Software-controlled Prefetching for (i=0; i < N; i++) { /* unroll loop 4 times */ prefetch

Virtual Memory

Digital Semiconductor Alpha 21064 and Alpha 21064A Microprocessors Hardware Reference Manual

Alpha 21064A Microprocessors Data Sheet

Computer Architectures an Overview

Virtual Memory CS740 October 13, 1998

Address Translation & Caches Outline

Database Integration

Thesis May Never Have Been Completed

Iilihfflf WWETWIU HI Ull ISTITUTO NAZIONALE DI FISICA NUCLEARE - ISTITUTO NAZIONALE DI FISICA NUCLEARE - \ST, ^3 Laboratori Nazionali Di Frascati

Advancements in Microprocessor Architecture for Ubiquitous AI—An Overview on History, Evolution, and Upcoming Challenges in AI Implementation

A 160Mhz 32B 0.5W CMOS ARM Processor, Sribalan

Phd Musoll.Pdf

History of Processor Performance