Compute Express Link™ (CXL™): Memory and Cache Protocols

Total Page:16

File Type:pdf, Size:1020Kb

Compute Express Link™ (CXL™): Memory and Cache Protocols Compute Express Link™ (CXL™): Memory and Cache Protocols Robert Blankenship Intel Corporation 2020 Storage Developer Conference. © CXL™ Consortium. All Rights Reserved. 1 Topics What is CXL? Caching 101 CXL Caching Protocol CXL Memory Protocol Merging Memory and Caching for Accelerators 2020 Storage Developer Conference. © CXL™ Consortium. All Rights Reserved. 2 What is CXL? • CXL runs across the standard PCIe physical layer with new protocols x16 PCIe x16 CXL Card Card optimized for cache and memory • CXL uses a flexible processor Port that X16 Connector can auto-negotiate to either the PCIe channel SERDES standard PCIe transaction protocol or Connector the alternate CXL transaction protocols etc. • First generation CXL aligns to 32 GT/s PCIe 5.0 specification Processor • CXL usages expected to be key driver for an aggressive timeline to PCIe 6.0 architecture 2020 Storage Developer Conference. © CXL™ Consortium. All Rights Reserved. 3 CXL Uses 3 Protocols • IO (CXL.io) → Same as PCI Express® • Memory (CXL.mem) → Memory Access from CPU Host to Device • Cache (CXL.cache) → Coherent Access from Device to CPU Host 2020 Storage Developer Conference. © CXL™ Consortium. All Rights Reserved. 4 Representative CXL Usages 2020 Storage Developer Conference. © CXL™ Consortium. All Rights Reserved. 5 Caching 101 2020 Storage Developer Conference. © CXL™ Consortium. All Rights Reserved. 6 Caching Overview Caching temporarily brings data closer to the consumer Host Memory Access Latency: ~200ns Shared Bandwidth: 100+ GB/s Improves latency and bandwidth using prefetching and/or locality Data • Prefetching: Loading Data into cache before it is required Read • Spatial Locality (locality is space): Access address X then Local Data Cache X+n Access Latency: ~10ns • Temporal Locality (locality in Time): Multiple access to the Dedicated Bandwidth: 100+ GB/s same Data Data Read Accelerator 2020 Storage Developer Conference. © CXL™ Consortium. All Rights Reserved. 7 CPU Cache/Memory Hierarchy with CXL ~10 GB – Directly Connected Memory CXL.mem CXL.mem • Modern CPUs have 2 or (aka DDR) ~10 GB ~10 GB more levels of coherent Home Agent cache Coherent CPU-to-CPU Symmetric Links • Lower levels (L1), smaller CPU Socket 1 in capacity with lowest ~10 MB – L3 (aka LLC) latency and highest Wr Wr CXL.Cache ~500 KB – L2 ~500 KB – L2 Cache Cache ... CXL.$ ~50KB ~50KB bandwidth per source. CXL.io CXL.io PCIe L1 L1 L1 L1 PCIe • Higher levels (L3), less ~50 KB ~50 KB ... ~50 KB ~50 KB CPU CPU CPU CPU bandwidth per source but CPU Socket 0 much higher capacity and support more sources ~1 MB ~1 MB • Device caches are expected Device Device to be up to 1MB. Note: Cache/Memory capacities are examples and not aligned to a specific product. 2020 Storage Developer Conference. © CXL™ Consortium. All Rights Reserved. 8 Cache Consistency How do we make sure updates in cache are visible to other agents? • Invalidate all peer caches prior to update • Can managed with software or hardware CXL uses hardware coherence Define a point of “Global Observation” (aka GO) when new data is visible from writes Tracking granularity is a “cacheline” of data 64-bytes for CXL All addresses are assumed to be Host Physical Address (HPA) in CXL cache and memory protocols Translations using existing Address Translation Services (ATS). 2020 Storage Developer Conference. © CXL™ Consortium. All Rights Reserved. 9 Cache Coherence Protocol • Modern CPU caches and CXL are built on M,E,S,I protocol/states • Modified – Only in one cache, Can be read or written, Data NOT up-to-date in memory • Exclusive – Only in one cache, Can be read or written, Data IS up-to-date in memory • Shared – Can be in many caches, Can only be read, Data IS up-to-date in memory • Invalid – Not in cache • M,E,S,I is tracked for each cacheline address in each cache • Cacheline address in CXL is Addr[51:6] • Notes: • Each level of the CPU cache hierarchy follows MESI and layers above must be consistent • Other extended states and flows are possible but not covered in context of CXL 2020 Storage Developer Conference. © CXL™ Consortium. All Rights Reserved. 10 How are Peer Caches Managed? • All peer caches managed by the “Home Agent” within the cache level Hidden from CXL device • A “Snoop” is the term for the Home to check cache state and may cause cache state changes • CXL Snoops: • Snoop Invalidate (SnpInv): Cache to degrade to I-state, and must return any Modified data • Snoop Data (SnpData): Cache to degrade to S-state, and must return any Modified data. • Snoop Current (SnpCurr): Cache state does not change, but must return any Modified data 2020 Storage Developer Conference. © CXL™ Consortium. All Rights Reserved. 11 CXL Cache Protocol 2020 Storage Developer Conference. © CXL™ Consortium. All Rights Reserved. 12 Cache Protocol Summary • Simple set of 15 cache-able reads and writes from the device to host memory • Keep complexity of global coherence management in the host 2020 Storage Developer Conference. © CXL™ Consortium. All Rights Reserved. 13 Cache Protocol Channels 3 channels in each direction: D2H vs H2D Data and RSP channels are pre-allocated D2H Requests from the device H2D Requests are snoops from the host Ordering: H2D Req (Snoop) push H2D RSP 2020 Storage Developer Conference. © CXL™ Consortium. All Rights Reserved. 14 Read Flow • Diagram to show message flows in time • X-axis: Agents • Y-axis: Time 2020 Storage Developer Conference. © CXL™ Consortium. All Rights Reserved. 15 Read Flow • Diagram to show message flows in time • X-axis: Agents • Y-axis: Time 2020 Storage Developer Conference. © CXL™ Consortium. All Rights Reserved. 16 Mapping Flow Back to CPU Hierarchy ~10 GB – Directly Connected Memory CXL.mem CXL.mem (aka DDR) ~10 GB ~10 GB Home Agent Coherent CPU-to-CPU Symmetric Links CPU Socket 1 ~10 MB – L3 (aka LLC) Wr Wr CXL.Cache ~500 KB – L2 ~500 KB – L2 Cache Cache ... CXL.$ ~50KB ~50KB CXL.io CXL.io PCIe L1 L1 L1 L1 PCIe ~50 KB ~50 KB ... ~50 KB ~50 KB CPU CPU CPU CPU CPU Socket 0 CXL~1 MB ~1 MB DeviceDevice Device 2020 Storage Developer Conference. © CXL™ Consortium. All Rights Reserved. 17 Mapping Flow Back to CPU Hierarchy • Peer Cache can be: ~10 GB – Directly Connected Memory CXL.mem CXL.mem • Peer CXL Device with (aka DDR) ~10 GB ~10 GB Cache Home Agent • CPU Cache in Local Socket Coherent CPU-to-CPU Symmetric Links Peer Cache • CPU Cache in Remote CPU Socket 1 Socket ~10 MB – L3 (aka LLC) Wr Wr CXL.Cache ~500 KB – L2 Peer~500 KBCache – L2 Cache Cache Peer Cache ... CXL.$ Peer Cache ~50KB ~50KB CXL.io CXL.io PCIe L1 L1 L1 L1 PCIe ~50 KB ~50 KB ... ~50 KB ~50 KB CPU CPU CPU CPU CPU Socket 0 CXL~1 MB ~1 MBPeer DeviceDevice DeviceCache 2020 Storage Developer Conference. © CXL™ Consortium. All Rights Reserved. 18 Mapping Flow Back to CPU Hierarchy Memory Memory • Peer Cache can be: Controller Controller ~10 GB – DirectlyMemory Connected Memory CXL.mem CXL.mem • Peer CXL Device with Memory Controller(aka DDR) ~10 GB ~10 GB Cache Controller • CPU Cache in Local Socket HomeHome Agent Home • CPU Cache in Remote Socket Coherent CPU-to-CPU Symmetric Links Peer Cache • Memory Controller CPU Socket 1 can be: ~10 MB – L3 (aka LLC) Wr Wr CXL.Cache • Native DDR on Local ~500 KB – L2 Peer~500 KBCache – L2 Cache Cache Peer Cache ... CXL.$ Peer Cache Socket ~50KB ~50KB CXL.io CXL.io PCIe • Native DDR on Remote L1 L1 L1 L1 PCIe ~50 KB ~50 KB ~50 KB ~50 KB Socket ... • CXL.mem on peer Device CPU CPU CPU CPU CPU Socket 0 CXL~1 MB ~1 PeerMB DeviceDevice DeviceCache 2020 Storage Developer Conference. © CXL™ Consortium. All Rights Reserved. 19 Example #2: Write CXL Peer • For Cache Writes there Device Cache Home I S are three phases: Memory • Ownership Controller • Silent Write • Cache Eviction Legend Cache State: Modified Exclusive Shared Invalid Allocate Tracker Deallocate Tracker 2020 Storage Developer Conference. © CXL™ Consortium. All Rights Reserved. 20 Example #2: Write CXL Peer • For Cache Writes there Device Cache Home are three phases: I S Ownership Memory • Controller • Silent Write Ownership • Cache Eviction S I I E Legend Cache State: Modified Exclusive Shared Invalid Allocate Tracker Deallocate Tracker 2020 Storage Developer Conference. © CXL™ Consortium. All Rights Reserved. 21 Example #2: Write CXL Peer • For Cache Writes there Device Cache Home are three phases: I S Memory • Ownership Controller • Silent Write Ownership • Cache Eviction S I I E Legend Write Cache State: <Silent Wr> E M Modified Exclusive Shared Invalid Allocate Tracker Deallocate Tracker 2020 Storage Developer Conference. © CXL™ Consortium. All Rights Reserved. 22 Example #2: Write CXL Peer • For Cache Writes there Device Cache Home are three phases: I S Memory • Ownership Controller • Silent Write Ownership • Cache Eviction S I I E Legend Write Cache State: <Silent Wr> Modified E M Exclusive Eviction Shared M I Invalid Data Allocate Tracker Deallocate Tracker 2020 Storage Developer Conference. © CXL™ Consortium. All Rights Reserved. 23 Example #3: Steaming Write CXL Peer • Direct Write to Host Device Cache Home • Ownership + Write in a I S Memory single flow. Controller • Rely on completion to S I indicate ordering • May see reduced Data bandwidth for ordered Legend traffic Cache State: Modified • Host may install data into Exclusive Shared LLC instead of writing to Invalid memory Allocate Tracker Deallocate Tracker 2020 Storage Developer Conference. © CXL™ Consortium. All Rights Reserved. 24 15 Request in CXL • Reads: RdShared, RdCurr, RdOwn, RdAny • Read-0: RdownNoData, CLFlush, CacheFlushed • Writes: DirtyEvict, CleanEvict, CleanEvictNoData • Streaming Writes: ItoMWr, MemWr, WOWrInv, WrInv(F) 2020 Storage Developer Conference. © CXL™ Consortium. All Rights Reserved. 25 CXL Memory Protocol 2020 Storage Developer Conference. © CXL™ Consortium.
Recommended publications
  • Memory Hierarchy Memory Hierarchy
    Memory Key challenge in modern computer architecture Lecture 2: different memory no point in blindingly fast computation if data can’t be and variable types moved in and out fast enough need lots of memory for big applications Prof. Mike Giles very fast memory is also very expensive [email protected] end up being pushed towards a hierarchical design Oxford University Mathematical Institute Oxford e-Research Centre Lecture 2 – p. 1 Lecture 2 – p. 2 CPU Memory Hierarchy Memory Hierarchy Execution speed relies on exploiting data locality 2 – 8 GB Main memory 1GHz DDR3 temporal locality: a data item just accessed is likely to be used again in the near future, so keep it in the cache ? 200+ cycle access, 20-30GB/s spatial locality: neighbouring data is also likely to be 6 used soon, so load them into the cache at the same 2–6MB time using a ‘wide’ bus (like a multi-lane motorway) L3 Cache 2GHz SRAM ??25-35 cycle access 66 This wide bus is only way to get high bandwidth to slow 32KB + 256KB main memory ? L1/L2 Cache faster 3GHz SRAM more expensive ??? 6665-12 cycle access smaller registers Lecture 2 – p. 3 Lecture 2 – p. 4 Caches Importance of Locality The cache line is the basic unit of data transfer; Typical workstation: typical size is 64 bytes 8 8-byte items. ≡ × 10 Gflops CPU 20 GB/s memory L2 cache bandwidth With a single cache, when the CPU loads data into a ←→ 64 bytes/line register: it looks for line in cache 20GB/s 300M line/s 2.4G double/s ≡ ≡ if there (hit), it gets data At worst, each flop requires 2 inputs and has 1 output, if not (miss), it gets entire line from main memory, forcing loading of 3 lines = 100 Mflops displacing an existing line in cache (usually least ⇒ recently used) If all 8 variables/line are used, then this increases to 800 Mflops.
    [Show full text]
  • Caching Basics
    Introduction Why memory subsystem design is important • CPU speeds increase 25%-30% per year • DRAM speeds increase 2%-11% per year Autumn 2006 CSE P548 - Memory Hierarchy 1 Memory Hierarchy Levels of memory with different sizes & speeds • close to the CPU: small, fast access • close to memory: large, slow access Memory hierarchies improve performance • caches: demand-driven storage • principal of locality of reference temporal: a referenced word will be referenced again soon spatial: words near a reference word will be referenced soon • speed/size trade-off in technology ⇒ fast access for most references First Cache: IBM 360/85 in the late ‘60s Autumn 2006 CSE P548 - Memory Hierarchy 2 1 Cache Organization Block: • # bytes associated with 1 tag • usually the # bytes transferred on a memory request Set: the blocks that can be accessed with the same index bits Associativity: the number of blocks in a set • direct mapped • set associative • fully associative Size: # bytes of data How do you calculate this? Autumn 2006 CSE P548 - Memory Hierarchy 3 Logical Diagram of a Cache Autumn 2006 CSE P548 - Memory Hierarchy 4 2 Logical Diagram of a Set-associative Cache Autumn 2006 CSE P548 - Memory Hierarchy 5 Accessing a Cache General formulas • number of index bits = log2(cache size / block size) (for a direct mapped cache) • number of index bits = log2(cache size /( block size * associativity)) (for a set-associative cache) Autumn 2006 CSE P548 - Memory Hierarchy 6 3 Design Tradeoffs Cache size the bigger the cache, + the higher the hit ratio
    [Show full text]
  • Make the Most out of Last Level Cache in Intel Processors In: Proceedings of the Fourteenth Eurosys Conference (Eurosys'19), Dresden, Germany, 25-28 March 2019
    http://www.diva-portal.org Postprint This is the accepted version of a paper presented at EuroSys'19. Citation for the original published paper: Farshin, A., Roozbeh, A., Maguire Jr., G Q., Kostic, D. (2019) Make the Most out of Last Level Cache in Intel Processors In: Proceedings of the Fourteenth EuroSys Conference (EuroSys'19), Dresden, Germany, 25-28 March 2019. ACM Digital Library N.B. When citing this work, cite the original published paper. Permanent link to this version: http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-244750 Make the Most out of Last Level Cache in Intel Processors Alireza Farshin∗† Amir Roozbeh∗ KTH Royal Institute of Technology KTH Royal Institute of Technology [email protected] Ericsson Research [email protected] Gerald Q. Maguire Jr. Dejan Kostić KTH Royal Institute of Technology KTH Royal Institute of Technology [email protected] [email protected] Abstract between Central Processing Unit (CPU) and Direct Random In modern (Intel) processors, Last Level Cache (LLC) is Access Memory (DRAM) speeds has been increasing. One divided into multiple slices and an undocumented hashing means to mitigate this problem is better utilization of cache algorithm (aka Complex Addressing) maps different parts memory (a faster, but smaller memory closer to the CPU) in of memory address space among these slices to increase order to reduce the number of DRAM accesses. the effective memory bandwidth. After a careful study This cache memory becomes even more valuable due to of Intel’s Complex Addressing, we introduce a slice- the explosion of data and the advent of hundred gigabit per aware memory management scheme, wherein frequently second networks (100/200/400 Gbps) [9].
    [Show full text]
  • Migration from IBM 750FX to MPC7447A by Douglas Hamilton European Applications Engineering Networking and Computing Systems Group Freescale Semiconductor, Inc
    Freescale Semiconductor AN2808 Application Note Rev. 1, 06/2005 Migration from IBM 750FX to MPC7447A by Douglas Hamilton European Applications Engineering Networking and Computing Systems Group Freescale Semiconductor, Inc. Contents 1 Scope and Definitions 1. Scope and Definitions . 1 2. Feature Overview . 2 The purpose of this application note is to facilitate migration 3. 7447A Specific Features . 12 from IBM’s 750FX-based systems to Freescale’s 4. Programming Model . 16 MPC7447A. It addresses the differences between the 5. Hardware Considerations . 27 systems, explaining which features have changed and why, 6. Revision History . 30 before discussing the impact on migration in terms of hardware and software. Throughout this document the following references are used: • 750FX—which applies to Freescale’s MPC750, MPC740, MPC755, and MPC745 devices, as well as to IBM’s 750FX devices. Any features specific to IBM’s 750FX will be explicitly stated as such. • MPC7447A—which applies to Freescale’s MPC7450 family of products (MPC7450, MPC7451, MPC7441, MPC7455, MPC7445, MPC7457, MPC7447, and MPC7447A) except where otherwise stated. Because this document is to aid the migration from 750FX, which does not support L3 cache, the L3 cache features of the MPC745x devices are not mentioned. © Freescale Semiconductor, Inc., 2005. All rights reserved. Feature Overview 2 Feature Overview There are many differences between the 750FX and the MPC7447A devices, beyond the clear differences of the core complex. This section covers the differences between the cores and then other areas of interest including the cache configuration and system interfaces. 2.1 Cores The key processing elements of the G3 core complex used in the 750FX are shown below in Figure 1, and the G4 complex used in the 7447A is shown in Figure 2.
    [Show full text]
  • 45-Year CPU Evolution: One Law and Two Equations
    45-year CPU evolution: one law and two equations Daniel Etiemble LRI-CNRS University Paris Sud Orsay, France [email protected] Abstract— Moore’s law and two equations allow to explain the a) IC is the instruction count. main trends of CPU evolution since MOS technologies have been b) CPI is the clock cycles per instruction and IPC = 1/CPI is the used to implement microprocessors. Instruction count per clock cycle. c) Tc is the clock cycle time and F=1/Tc is the clock frequency. Keywords—Moore’s law, execution time, CM0S power dissipation. The Power dissipation of CMOS circuits is the second I. INTRODUCTION equation (2). CMOS power dissipation is decomposed into static and dynamic powers. For dynamic power, Vdd is the power A new era started when MOS technologies were used to supply, F is the clock frequency, ΣCi is the sum of gate and build microprocessors. After pMOS (Intel 4004 in 1971) and interconnection capacitances and α is the average percentage of nMOS (Intel 8080 in 1974), CMOS became quickly the leading switching capacitances: α is the activity factor of the overall technology, used by Intel since 1985 with 80386 CPU. circuit MOS technologies obey an empirical law, stated in 1965 and 2 Pd = Pdstatic + α x ΣCi x Vdd x F (2) known as Moore’s law: the number of transistors integrated on a chip doubles every N months. Fig. 1 presents the evolution for II. CONSEQUENCES OF MOORE LAW DRAM memories, processors (MPU) and three types of read- only memories [1]. The growth rate decreases with years, from A.
    [Show full text]
  • Caches & Memory
    Caches & Memory Hakim Weatherspoon CS 3410 Computer Science Cornell University [Weatherspoon, Bala, Bracy, McKee, and Sirer] Programs 101 C Code RISC-V Assembly int main (int argc, char* argv[ ]) { main: addi sp,sp,-48 int i; sw x1,44(sp) int m = n; sw fp,40(sp) int sum = 0; move fp,sp sw x10,-36(fp) for (i = 1; i <= m; i++) { sw x11,-40(fp) sum += i; la x15,n } lw x15,0(x15) printf (“...”, n, sum); sw x15,-28(fp) } sw x0,-24(fp) li x15,1 sw x15,-20(fp) Load/Store Architectures: L2: lw x14,-20(fp) lw x15,-28(fp) • Read data from memory blt x15,x14,L3 (put in registers) . • Manipulate it .Instructions that read from • Store it back to memory or write to memory… 2 Programs 101 C Code RISC-V Assembly int main (int argc, char* argv[ ]) { main: addi sp,sp,-48 int i; sw ra,44(sp) int m = n; sw fp,40(sp) int sum = 0; move fp,sp sw a0,-36(fp) for (i = 1; i <= m; i++) { sw a1,-40(fp) sum += i; la a5,n } lw a5,0(x15) printf (“...”, n, sum); sw a5,-28(fp) } sw x0,-24(fp) li a5,1 sw a5,-20(fp) Load/Store Architectures: L2: lw a4,-20(fp) lw a5,-28(fp) • Read data from memory blt a5,a4,L3 (put in registers) . • Manipulate it .Instructions that read from • Store it back to memory or write to memory… 3 1 Cycle Per Stage: the Biggest Lie (So Far) Code Stored in Memory (also, data and stack) compute jump/branch targets A memory register ALU D D file B +4 addr PC B control din dout M inst memory extend new imm forward pc Stack, Data, Code detect unit hazard Stored in Memory Instruction Instruction Write- ctrl ctrl ctrl Fetch Decode Execute Memory Back IF/ID
    [Show full text]
  • Hierarchical Roofline Analysis for Gpus: Accelerating Performance
    Hierarchical Roofline Analysis for GPUs: Accelerating Performance Optimization for the NERSC-9 Perlmutter System Charlene Yang, Thorsten Kurth Samuel Williams National Energy Research Scientific Computing Center Computational Research Division Lawrence Berkeley National Laboratory Lawrence Berkeley National Laboratory Berkeley, CA 94720, USA Berkeley, CA 94720, USA fcjyang, [email protected] [email protected] Abstract—The Roofline performance model provides an Performance (GFLOP/s) is bound by: intuitive and insightful approach to identifying performance bottlenecks and guiding performance optimization. In prepa- Peak GFLOP/s GFLOP/s ≤ min (1) ration for the next-generation supercomputer Perlmutter at Peak GB/s × Arithmetic Intensity NERSC, this paper presents a methodology to construct a hi- erarchical Roofline on NVIDIA GPUs and extend it to support which produces the traditional Roofline formulation when reduced precision and Tensor Cores. The hierarchical Roofline incorporates L1, L2, device memory and system memory plotted on a log-log plot. bandwidths into one single figure, and it offers more profound Previously, the Roofline model was expanded to support insights into performance analysis than the traditional DRAM- the full memory hierarchy [2], [3] by adding additional band- only Roofline. We use our Roofline methodology to analyze width “ceilings”. Similarly, additional ceilings beneath the three proxy applications: GPP from BerkeleyGW, HPGMG Roofline can be added to represent performance bottlenecks from AMReX, and conv2d from TensorFlow. In so doing, we demonstrate the ability of our methodology to readily arising from lack of vectorization or the failure to exploit understand various aspects of performance and performance fused multiply-add (FMA) instructions. bottlenecks on NVIDIA GPUs and motivate code optimizations.
    [Show full text]
  • An Algorithmic Theory of Caches by Sridhar Ramachandran
    An algorithmic theory of caches by Sridhar Ramachandran Submitted to the Department of Electrical Engineering and Computer Science in partial fulfillment of the requirements for the degree of Master of Science at the MASSACHUSETTS INSTITUTE OF TECHNOLOGY. December 1999 Massachusetts Institute of Technology 1999. All rights reserved. Author Department of Electrical Engineering and Computer Science Jan 31, 1999 Certified by / -f Charles E. Leiserson Professor of Computer Science and Engineering Thesis Supervisor Accepted by Arthur C. Smith Chairman, Departmental Committee on Graduate Students MSSACHUSVTS INSTITUT OF TECHNOLOGY MAR 0 4 2000 LIBRARIES 2 An algorithmic theory of caches by Sridhar Ramachandran Submitted to the Department of Electrical Engineeringand Computer Science on Jan 31, 1999 in partialfulfillment of the requirementsfor the degree of Master of Science. Abstract The ideal-cache model, an extension of the RAM model, evaluates the referential locality exhibited by algorithms. The ideal-cache model is characterized by two parameters-the cache size Z, and line length L. As suggested by its name, the ideal-cache model practices automatic, optimal, omniscient replacement algorithm. The performance of an algorithm on the ideal-cache model consists of two measures-the RAM running time, called work complexity, and the number of misses on the ideal cache, called cache complexity. This thesis proposes the ideal-cache model as a "bridging" model for caches in the sense proposed by Valiant [49]. A bridging model for caches serves two purposes. It can be viewed as a hardware "ideal" that influences cache design. On the other hand, it can be used as a powerful tool to design cache-efficient algorithms.
    [Show full text]
  • Generalized Methods for Application Specific Hardware Specialization
    Generalized methods for application specific hardware specialization by Snehasish Kumar M. Sc., Simon Fraser University, 2013 B. Tech., Biju Patnaik University of Technology, 2010 Dissertation Submitted in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy in the School of Computing Science Faculty of Applied Sciences c Snehasish Kumar 2017 SIMON FRASER UNIVERSITY Spring 2017 All rights reserved. However, in accordance with the Copyright Act of Canada, this work may be reproduced without authorization under the conditions for “Fair Dealing.” Therefore, limited reproduction of this work for the purposes of private study, research, education, satire, parody, criticism, review and news reporting is likely to be in accordance with the law, particularly if cited appropriately. Approval Name: Snehasish Kumar Degree: Doctor of Philosophy (Computing Science) Title: Generalized methods for application specific hardware specialization Examining Committee: Chair: Binay Bhattacharyya Professor Arrvindh Shriraman Senior Supervisor Associate Professor Simon Fraser University William Sumner Supervisor Assistant Professor Simon Fraser University Vijayalakshmi Srinivasan Supervisor Research Staff Member, IBM Research Alexandra Fedorova Supervisor Associate Professor University of British Columbia Richard Vaughan Internal Examiner Associate Professor Simon Fraser University Andreas Moshovos External Examiner Professor University of Toronto Date Defended: November 21, 2016 ii Abstract Since the invention of the microprocessor in 1971, the computational capacity of the microprocessor has scaled over 1000× with Moore and Dennard scaling. Dennard scaling ended with a rapid increase in leakage power 30 years after it was proposed. This ushered in the era of multiprocessing where additional transistors afforded by Moore’s scaling were put to use. With the scaling of computational capacity no longer guaranteed every generation, application specific hardware specialization is an attractive alternative to sustain scaling trends.
    [Show full text]
  • Stealing the Shared Cache for Fun and Profit
    IT 13 048 Examensarbete 30 hp Juli 2013 Stealing the shared cache for fun and profit Moncef Mechri Institutionen för informationsteknologi Department of Information Technology Abstract Stealing the shared cache for fun and profit Moncef Mechri Teknisk- naturvetenskaplig fakultet UTH-enheten Cache pirating is a low-overhead method created by the Uppsala Architecture Besöksadress: Research Team (UART) to analyze the Ångströmlaboratoriet Lägerhyddsvägen 1 effect of sharing a CPU cache Hus 4, Plan 0 among several cores. The cache pirate is a program that will actively and Postadress: carefully steal a part of the shared Box 536 751 21 Uppsala cache by keeping its working set in it. The target application can then be Telefon: benchmarked to see its dependency on 018 – 471 30 03 the available shared cache capacity. The Telefax: topic of this Master Thesis project 018 – 471 30 00 is to implement a cache pirate and use it on Ericsson’s systems. Hemsida: http://www.teknat.uu.se/student Handledare: Erik Berg Ämnesgranskare: David Black-Schaffer Examinator: Ivan Christoff IT 13 048 Sponsor: Ericsson Tryckt av: Reprocentralen ITC Contents Acronyms 2 1 Introduction 3 2 Background information 5 2.1 A dive into modern processors . 5 2.1.1 Memory hierarchy . 5 2.1.2 Virtual memory . 6 2.1.3 CPU caches . 8 2.1.4 Benchmarking the memory hierarchy . 13 3 The Cache Pirate 17 3.1 Monitoring the Pirate . 18 3.1.1 The original approach . 19 3.1.2 Defeating prefetching . 19 3.1.3 Timing . 20 3.2 Stealing evenly from every set .
    [Show full text]
  • 14. Caching and Cache-Efficient Algorithms
    MITOCW | 14. Caching and Cache-Efficient Algorithms The following content is provided under a Creative Commons license. Your support will help MIT OpenCourseWare continue to offer high quality educational resources for free. To make a donation or to view additional materials from hundreds of MIT courses, visit MIT OpenCourseWare at ocw.mit.edu. JULIAN SHUN: All right. So we've talked a little bit about caching before, but today we're going to talk in much more detail about caching and how to design cache-efficient algorithms. So first, let's look at the caching hardware on modern machines today. So here's what the cache hierarchy looks like for a multicore chip. We have a whole bunch of processors. They all have their own private L1 caches for both the data, as well as the instruction. They also have a private L2 cache. And then they share a last level cache, or L3 cache, which is also called LLC. They're all connected to a memory controller that can access DRAM. And then, oftentimes, you'll have multiple chips on the same server, and these chips would be connected through a network. So here we have a bunch of multicore chips that are connected together. So we can see that there are different levels of memory here. And the sizes of each one of these levels of memory is different. So the sizes tend to go up as you move up the memory hierarchy. The L1 caches tend to be about 32 kilobytes. In fact, these are the specifications for the machines that you're using in this class.
    [Show full text]
  • IBM Power Systems Performance Report Apr 13, 2021
    IBM Power Performance Report Power7 to Power10 September 8, 2021 Table of Contents 3 Introduction to Performance of IBM UNIX, IBM i, and Linux Operating System Servers 4 Section 1 – SPEC® CPU Benchmark Performance 4 Section 1a – Linux Multi-user SPEC® CPU2017 Performance (Power10) 4 Section 1b – Linux Multi-user SPEC® CPU2017 Performance (Power9) 4 Section 1c – AIX Multi-user SPEC® CPU2006 Performance (Power7, Power7+, Power8) 5 Section 1d – Linux Multi-user SPEC® CPU2006 Performance (Power7, Power7+, Power8) 6 Section 2 – AIX Multi-user Performance (rPerf) 6 Section 2a – AIX Multi-user Performance (Power8, Power9 and Power10) 9 Section 2b – AIX Multi-user Performance (Power9) in Non-default Processor Power Mode Setting 9 Section 2c – AIX Multi-user Performance (Power7 and Power7+) 13 Section 2d – AIX Capacity Upgrade on Demand Relative Performance Guidelines (Power8) 15 Section 2e – AIX Capacity Upgrade on Demand Relative Performance Guidelines (Power7 and Power7+) 20 Section 3 – CPW Benchmark Performance 19 Section 3a – CPW Benchmark Performance (Power8, Power9 and Power10) 22 Section 3b – CPW Benchmark Performance (Power7 and Power7+) 25 Section 4 – SPECjbb®2015 Benchmark Performance 25 Section 4a – SPECjbb®2015 Benchmark Performance (Power9) 25 Section 4b – SPECjbb®2015 Benchmark Performance (Power8) 25 Section 5 – AIX SAP® Standard Application Benchmark Performance 25 Section 5a – SAP® Sales and Distribution (SD) 2-Tier – AIX (Power7 to Power8) 26 Section 5b – SAP® Sales and Distribution (SD) 2-Tier – Linux on Power (Power7 to Power7+)
    [Show full text]