Oracle's Next Generation SPARC Cache Hierarchy

Total Page:16

File Type:pdf, Size:1020Kb

Oracle's Next Generation SPARC Cache Hierarchy Next Generaon SPARC Processor Cache Hierarchy Ram Sivaramakrishnan Hardware Director Sum. Jairath Sr. Hardware Architect Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | 2 Safe Harbor Statement The following is intended to outline our general product direcAon. It is intended for informaon purposes only, and may not be incorporated into any contract. It is not a commitment to deliver any material, code, or funcAonality, and should not be relied upon in making purchasing decisions. The development, release, and Aming of any features or funcAonality described for Oracle’s products remains at the sole discreAon of Oracle. Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | 3 Presentaon Outline • Overview • ParAAoned L3 Cache- Design ObjecAves • Core Cluster and L3 ParAAon • On-Chip Network and SMP Coherence • On-Chip Protocol Features • Workload Performance OpAmizaon • Conclusion Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | 4 Processor Overview The next generaon SPARC processor contains CORE CORE CORE CORE CLUSTER CLUSTER CLUSTER CLUSTER • 32 cores organized as 8 core clusters • 64MB L3 cache SUBSYSTEM • 8 DDR4 memory schedulers (MCU) providing 160GBps of COHERENCE, SMP & I/O sustained memory bandwidth L3$ & ON-CHIP NETWORK • A coherency subsystem for 1 to 32 socket scaling ACCELERATORS ACCELERATORS MEMORY CONTROL MEMORY CONTROL • A coherency and IO interconnect that provides >336GBps of peak bandwidth • CORE CORE SUBSYSTEM CORE CORE 8 Data AnalyAcs Accelerators (DAX) for query acceleraon and CLUSTER CLUSTER CLUSTER CLUSTER messaging COHERENCE, SMP & I/O Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | 5 Why a ParAAoned Last Level Cache? • A Logically Unified Last Level Cache – Latency scales up with core count and cache size – Benefits large shared workloads – Workloads with liele or no sharing see the impact of longer latency • ParAAoned Shared Last Level Cache – A core can allocate only in its local parAAon – A thread sees a lower access latency to its local parAAon – Low latency sharing between parAAons makes them appear as a larger shared cache Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | 6 Last Level Cache Design ObjecAves • Minimize the effecAve L2 cache miss latency for response Ame criAcal applicaons • Provide an on-chip interconnect for low latency inter-parAAon communicaon • Provide protocol features that enable the collecAon of parAAons to appear as a larger shared cache Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | 7 Core Cluster and ParAAon Overview • Four S4 SPARC cores L2I 256KB 8-way – 2-issue OOO pipeline and 8 strands per core – 16KB L1 I$ and a 16KB write-through L1 D$ per core • 256KB 8-way L2 I$ shared by all four cores core0 core1 core2 core3 • 256KB 8-way dual-banked write-back L2 D$ shared by a core pair • Dual-banked 8MB 8-way L3$ L2D 256KB 8-way L2D 256KB 8-way • 41 cycle ( <10ns) load to use latency to the L3 • >192GBps interface between a core cluster and its local L3 parAAon L3 8MB 8-way Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | 8 Processor Block Diagram • An On-Chip Network (OCN) 4 4 SMP & IO 4 4 that connects cores cores Gateway cores cores – 8 parAAons SMP & IO – 4 SMP& IO Gateways 8 MB L3 8 MB L3 Gateway 8 MB L3 8 MB L3 – 4 scheduler and 4 DAX instances on each side MCUs ON-CHIP NETWORK(OCN) MCUs & DAXs & DAXs • 64GBps per OCN port per direcon SMP & IO Gateway 8 MB L3 8 MB L3 8 MB L3 8 MB L3 • Low latency cross-parAAon SMP & IO access 4 4 Gateway 4 4 cores cores cores cores Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | 9 Salient Features of the L3 • Supplier state per cache line to prevent mulAple sharing parAAons from sourcing a line • Provides tracking of applicaon request history for opAmal use of the on- chip protocol features • Idle parAAons can be used as a vicAm cache by other acAve parAAons • Allows the DAX to directly deposit the results of a data analyAcs query or a message to be used by a consuming thread • Pointer version per cache line for applicaon data protecAon Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | 10 On-Chip Network(OCN) The OCN consists of three networks • Requests are broadcast on four address planed rings, one per SMP gateway on chip – No queuing on the ring, maximum hop count is 11 • The Data network is constructed as a mesh with 10-ported switches – Provides ~512GBps per cross-secAon – Unloaded latency <6ns • The Response network is a point to point network that aggregates snoop responses from all parAAons before delivery to the requester Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | 11 SMP Coherence • Distributed Directory based SMP coherence for 8-way glue-less and 32-way glued scaling • Directory is implemented as an inclusive tag rather than an L3 mirror – Reduces the needed associavity in the tag • Inclusive tag sized at 80MB, 20-way set associave to minimize distributed cache evicAons for throughput workloads • Dynamic organizaon supports all socket counts from 1-8 in glue-less systems Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | 12 OCN Protocol The OCN protocol supports • Mediated and Peer-to-Peer Requests, Memory Prefetch – Protocol choice based on sharing characterisAcs of a workload • Dynamically ParAAoned Cache – Fair re-distribuAon of last level cache among varying number of acAve threads • DAX Allocaon – Coherent communicaon protocol between core and DAX Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | 13 Mediated and Peer-to-Peer Requests Address space sharing characterisAcs of a workload • Peer-to-Peer Requests – Broadcast to all parAAons on the request network – Lowest latency path to get data from another parAAon (also referred to as C2C transfer) • Mediated Requests – Unicast to coherence gateway. Other agents ignore the request – Is the fastest and most power efficient path to data from memory Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | 14 Peer to Peer Request Flow 4 cores 4 cores SMP 4 cores 4 cores gateway • L3 miss broadcast to all agents 8 MB 8 MB 8 MB 8 MB SMP on the ring L3 L3 gateway L3 L3 • All L3 parAAons lookup tag and respond with an Ack/Nack MCU MCU • Supplier ParAAon sources data • Lowest latency cross-parAAon 8 MB 8 MB SMP gateway 8 MB 8 MB transfer L3 L3 L3 L3 SMP • Turns into a mediated request 4 cores 4 cores gateway 4 cores 4 cores on failure Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | 15 Mediated Request Flow • Unicast to SMP gateway 4 cores 4 cores SMP gateway 4 cores 4 cores • SMP directory lookup and 8 MB 8 MB 8 MB 8 MB SMP L3 L3 gateway L3 L3 Memory Prefetch • Memory read sent on the MCU request network to the MCU appropriate MCU instance (not shown) 8 MB 8 MB SMP gateway 8 MB 8 MB • Memory provides data to the L3 L3 L3 L3 requester SMP 4 cores 4 cores gateway 4 cores 4 cores • Lowest latency path to memory Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | 16 Dynamic Request • T1 T2 Tn Tracks how its L3 misses are served and determines if a mediated or peer-to-peer Data-Supplier Tracking Code-Supplier Tracking request is appropriate. 4 cores 4 cores • High cache-to-cache rate drives peer-to- peer requests 8 MB L3 8 MB L3 • High memory return drives mediated Data From Memory Data Via Cache to Cache requests Memory Controller Code From Memory Code Via Cache to Cache • Temporal sharing behavior is effecAvely Cache-Line Supplier Tracking tracked using a 64-bit history register • Code and Data are tracked independently. Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | 17 Memory Prefetch 4 cores • L3 miss parAAon can request for a speculave SMP memory read 8 MB L3 gateway • Supported for peer-to-peer and mediated requests • Data returning from memory is preserved in a 64 entry (256 entries per chip) buffer unAl a memory read is received • Allows for faster memory latency for workloads with M Prefetch C low sharing Buffer U • Is dynamically enabled using code and data source tracking Memory Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | 18 Workload Caching CharacterisAcs Workload Examples Instruc.ons Data Data Source Performance Needs • InstrucAon and Data - Individual Programs, InstrucAons and Data Not Shared Not Shared Memory Prefetch. Virtual Machines from Memory • Predictable QoS MulAple Programs Working on Common InstrucAons from • InstrucAon Memory Data – e.g. Database Not Shared Shared Memory, some Data as Prefetch Shared Global Area C2C • Fast Data C2C (SGA) • Fast InstrucAon C2C • Data Memory ParAAoned Data InstrucAons as C2C, Prefetch Processing – e.g. Shared Not Shared Data from Memory • High Bandwidth Low- AnalyAcs Latency LLC Hits • QoS - Predictability Linked Lists, Control Both InstrucAons and • Fast InstrucAon and Shared Shared Data Structure Data as C2C Data C2C Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | 19 Dynamically ParAAoned Cache • For opAmal use of the cache under variable load, a parAAon can use another parAAon as a vicAm cache • A vicAm parAAon can either accept or reject the request based on a heurisc Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | 20 Dynamically ParAAoned Cache T0 T1 T2 Tn Time Throughput Duraon – Large • ParAAoned L3 Cache thread count, parallelized • OpAmal for “Throughput Duraon” workload of the workload T1 Serial Duraon – Small thread • Unified L3 Cache count – result collecAon, • OpAmal for “Serial Duraon” of the garbage collecAon, workload scheduling T0 T1 T2 Tn • Dynamically ParAAoned Cache Throughput Duraon - Large • Adapts with workload behavior thread count, parallelized • L3 ParAAons join or disjoin based on workload acAve threads in parAAon Workload AcAve-Threads Timeline Copyright © 2014 Oracle and/or its affiliates.
Recommended publications
  • Caching Basics
    Introduction Why memory subsystem design is important • CPU speeds increase 25%-30% per year • DRAM speeds increase 2%-11% per year Autumn 2006 CSE P548 - Memory Hierarchy 1 Memory Hierarchy Levels of memory with different sizes & speeds • close to the CPU: small, fast access • close to memory: large, slow access Memory hierarchies improve performance • caches: demand-driven storage • principal of locality of reference temporal: a referenced word will be referenced again soon spatial: words near a reference word will be referenced soon • speed/size trade-off in technology ⇒ fast access for most references First Cache: IBM 360/85 in the late ‘60s Autumn 2006 CSE P548 - Memory Hierarchy 2 1 Cache Organization Block: • # bytes associated with 1 tag • usually the # bytes transferred on a memory request Set: the blocks that can be accessed with the same index bits Associativity: the number of blocks in a set • direct mapped • set associative • fully associative Size: # bytes of data How do you calculate this? Autumn 2006 CSE P548 - Memory Hierarchy 3 Logical Diagram of a Cache Autumn 2006 CSE P548 - Memory Hierarchy 4 2 Logical Diagram of a Set-associative Cache Autumn 2006 CSE P548 - Memory Hierarchy 5 Accessing a Cache General formulas • number of index bits = log2(cache size / block size) (for a direct mapped cache) • number of index bits = log2(cache size /( block size * associativity)) (for a set-associative cache) Autumn 2006 CSE P548 - Memory Hierarchy 6 3 Design Tradeoffs Cache size the bigger the cache, + the higher the hit ratio
    [Show full text]
  • 45-Year CPU Evolution: One Law and Two Equations
    45-year CPU evolution: one law and two equations Daniel Etiemble LRI-CNRS University Paris Sud Orsay, France [email protected] Abstract— Moore’s law and two equations allow to explain the a) IC is the instruction count. main trends of CPU evolution since MOS technologies have been b) CPI is the clock cycles per instruction and IPC = 1/CPI is the used to implement microprocessors. Instruction count per clock cycle. c) Tc is the clock cycle time and F=1/Tc is the clock frequency. Keywords—Moore’s law, execution time, CM0S power dissipation. The Power dissipation of CMOS circuits is the second I. INTRODUCTION equation (2). CMOS power dissipation is decomposed into static and dynamic powers. For dynamic power, Vdd is the power A new era started when MOS technologies were used to supply, F is the clock frequency, ΣCi is the sum of gate and build microprocessors. After pMOS (Intel 4004 in 1971) and interconnection capacitances and α is the average percentage of nMOS (Intel 8080 in 1974), CMOS became quickly the leading switching capacitances: α is the activity factor of the overall technology, used by Intel since 1985 with 80386 CPU. circuit MOS technologies obey an empirical law, stated in 1965 and 2 Pd = Pdstatic + α x ΣCi x Vdd x F (2) known as Moore’s law: the number of transistors integrated on a chip doubles every N months. Fig. 1 presents the evolution for II. CONSEQUENCES OF MOORE LAW DRAM memories, processors (MPU) and three types of read- only memories [1]. The growth rate decreases with years, from A.
    [Show full text]
  • Hierarchical Roofline Analysis for Gpus: Accelerating Performance
    Hierarchical Roofline Analysis for GPUs: Accelerating Performance Optimization for the NERSC-9 Perlmutter System Charlene Yang, Thorsten Kurth Samuel Williams National Energy Research Scientific Computing Center Computational Research Division Lawrence Berkeley National Laboratory Lawrence Berkeley National Laboratory Berkeley, CA 94720, USA Berkeley, CA 94720, USA fcjyang, [email protected] [email protected] Abstract—The Roofline performance model provides an Performance (GFLOP/s) is bound by: intuitive and insightful approach to identifying performance bottlenecks and guiding performance optimization. In prepa- Peak GFLOP/s GFLOP/s ≤ min (1) ration for the next-generation supercomputer Perlmutter at Peak GB/s × Arithmetic Intensity NERSC, this paper presents a methodology to construct a hi- erarchical Roofline on NVIDIA GPUs and extend it to support which produces the traditional Roofline formulation when reduced precision and Tensor Cores. The hierarchical Roofline incorporates L1, L2, device memory and system memory plotted on a log-log plot. bandwidths into one single figure, and it offers more profound Previously, the Roofline model was expanded to support insights into performance analysis than the traditional DRAM- the full memory hierarchy [2], [3] by adding additional band- only Roofline. We use our Roofline methodology to analyze width “ceilings”. Similarly, additional ceilings beneath the three proxy applications: GPP from BerkeleyGW, HPGMG Roofline can be added to represent performance bottlenecks from AMReX, and conv2d from TensorFlow. In so doing, we demonstrate the ability of our methodology to readily arising from lack of vectorization or the failure to exploit understand various aspects of performance and performance fused multiply-add (FMA) instructions. bottlenecks on NVIDIA GPUs and motivate code optimizations.
    [Show full text]
  • An Algorithmic Theory of Caches by Sridhar Ramachandran
    An algorithmic theory of caches by Sridhar Ramachandran Submitted to the Department of Electrical Engineering and Computer Science in partial fulfillment of the requirements for the degree of Master of Science at the MASSACHUSETTS INSTITUTE OF TECHNOLOGY. December 1999 Massachusetts Institute of Technology 1999. All rights reserved. Author Department of Electrical Engineering and Computer Science Jan 31, 1999 Certified by / -f Charles E. Leiserson Professor of Computer Science and Engineering Thesis Supervisor Accepted by Arthur C. Smith Chairman, Departmental Committee on Graduate Students MSSACHUSVTS INSTITUT OF TECHNOLOGY MAR 0 4 2000 LIBRARIES 2 An algorithmic theory of caches by Sridhar Ramachandran Submitted to the Department of Electrical Engineeringand Computer Science on Jan 31, 1999 in partialfulfillment of the requirementsfor the degree of Master of Science. Abstract The ideal-cache model, an extension of the RAM model, evaluates the referential locality exhibited by algorithms. The ideal-cache model is characterized by two parameters-the cache size Z, and line length L. As suggested by its name, the ideal-cache model practices automatic, optimal, omniscient replacement algorithm. The performance of an algorithm on the ideal-cache model consists of two measures-the RAM running time, called work complexity, and the number of misses on the ideal cache, called cache complexity. This thesis proposes the ideal-cache model as a "bridging" model for caches in the sense proposed by Valiant [49]. A bridging model for caches serves two purposes. It can be viewed as a hardware "ideal" that influences cache design. On the other hand, it can be used as a powerful tool to design cache-efficient algorithms.
    [Show full text]
  • Generalized Methods for Application Specific Hardware Specialization
    Generalized methods for application specific hardware specialization by Snehasish Kumar M. Sc., Simon Fraser University, 2013 B. Tech., Biju Patnaik University of Technology, 2010 Dissertation Submitted in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy in the School of Computing Science Faculty of Applied Sciences c Snehasish Kumar 2017 SIMON FRASER UNIVERSITY Spring 2017 All rights reserved. However, in accordance with the Copyright Act of Canada, this work may be reproduced without authorization under the conditions for “Fair Dealing.” Therefore, limited reproduction of this work for the purposes of private study, research, education, satire, parody, criticism, review and news reporting is likely to be in accordance with the law, particularly if cited appropriately. Approval Name: Snehasish Kumar Degree: Doctor of Philosophy (Computing Science) Title: Generalized methods for application specific hardware specialization Examining Committee: Chair: Binay Bhattacharyya Professor Arrvindh Shriraman Senior Supervisor Associate Professor Simon Fraser University William Sumner Supervisor Assistant Professor Simon Fraser University Vijayalakshmi Srinivasan Supervisor Research Staff Member, IBM Research Alexandra Fedorova Supervisor Associate Professor University of British Columbia Richard Vaughan Internal Examiner Associate Professor Simon Fraser University Andreas Moshovos External Examiner Professor University of Toronto Date Defended: November 21, 2016 ii Abstract Since the invention of the microprocessor in 1971, the computational capacity of the microprocessor has scaled over 1000× with Moore and Dennard scaling. Dennard scaling ended with a rapid increase in leakage power 30 years after it was proposed. This ushered in the era of multiprocessing where additional transistors afforded by Moore’s scaling were put to use. With the scaling of computational capacity no longer guaranteed every generation, application specific hardware specialization is an attractive alternative to sustain scaling trends.
    [Show full text]
  • 14. Caching and Cache-Efficient Algorithms
    MITOCW | 14. Caching and Cache-Efficient Algorithms The following content is provided under a Creative Commons license. Your support will help MIT OpenCourseWare continue to offer high quality educational resources for free. To make a donation or to view additional materials from hundreds of MIT courses, visit MIT OpenCourseWare at ocw.mit.edu. JULIAN SHUN: All right. So we've talked a little bit about caching before, but today we're going to talk in much more detail about caching and how to design cache-efficient algorithms. So first, let's look at the caching hardware on modern machines today. So here's what the cache hierarchy looks like for a multicore chip. We have a whole bunch of processors. They all have their own private L1 caches for both the data, as well as the instruction. They also have a private L2 cache. And then they share a last level cache, or L3 cache, which is also called LLC. They're all connected to a memory controller that can access DRAM. And then, oftentimes, you'll have multiple chips on the same server, and these chips would be connected through a network. So here we have a bunch of multicore chips that are connected together. So we can see that there are different levels of memory here. And the sizes of each one of these levels of memory is different. So the sizes tend to go up as you move up the memory hierarchy. The L1 caches tend to be about 32 kilobytes. In fact, these are the specifications for the machines that you're using in this class.
    [Show full text]
  • Open Jishen-Zhao-Dissertation.Pdf
    The Pennsylvania State University The Graduate School RETHINKING THE MEMORY HIERARCHY DESIGN WITH NONVOLATILE MEMORY TECHNOLOGIES A Dissertation in Computer Science and Engineering by Jishen Zhao c 2014 Jishen Zhao Submitted in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy May 2014 The dissertation of Jishen Zhao was reviewed and approved∗ by the following: Yuan Xie Professor of Computer Science and Engineering Dissertation Advisor, Chair of Committee Mary Jane Irwin Professor of Computer Science and Engineering Vijaykrishnan Narayanan Professor of Computer Science and Engineering Zhiwen Liu Associate Professor of Electrical Engineering Onur Mutlu Associate Professor of Electrical and Computer Engineering Carnegie Mellon University Special Member Lee Coraor Associate Professor of Computer Science and Engineering Director of Academic Affairs ∗Signatures are on file in the Graduate School. Abstract The memory hierarchy, including processor caches and the main memory, is an important component of various computer systems. The memory hierarchy is becoming a fundamental performance and energy bottleneck, due to the widening gap between the increasing bandwidth and energy demands of modern applications and the limited performance and energy efficiency provided by traditional memory technologies. As a result, computer architects are facing significant challenges in developing high-performance, energy-efficient, and reliable memory hierarchies. New byte-addressable nonvolatile memories (NVRAMs) are emerging with unique properties that are likely to open doors to novel memory hierarchy designs to tackle the challenges. However, substantial advancements in redesigning the existing memory hierarchy organizations are needed to realize their full potential. This dissertation focuses on re-architecting the current memory hierarchy design with NVRAMs, producing high-performance, energy-efficient memory designs for both CPU and graphics processor (GPU) systems.
    [Show full text]
  • Beating In-Order Stalls with “Flea-Flicker” Two-Pass Pipelining
    Beating in-order stalls with “flea-flicker”∗two-pass pipelining Ronald D. Barnes Erik M. Nystrom John W. Sias Sanjay J. Patel Nacho Navarro Wen-mei W. Hwu Center for Reliable and High-Performance Computing Department of Electrical and Computer Engineering University of Illinois at Urbana-Champaign {rdbarnes, nystrom, sias, sjp, nacho, hwu}@crhc.uiuc.edu Abstract control speculation features allow the compiler to miti- gate control dependences, further increasing static schedul- Accommodating the uncertain latency of load instructions ing freedom. Predication enables the compiler to optimize is one of the most vexing problems in in-order microarchi- program decision and to overlap independent control con- tecture design and compiler development. Compilers can structs while minimizing code growth. In the absence of generate schedules with a high degree of instruction-level unanticipated run-time delays such as cache miss-induced parallelism but cannot effectively accommodate unantici- stalls, the compiler can effectively utilize execution re- pated latencies; incorporating traditional out-of-order exe- sources, overlap execution latencies, and work around exe- cution into the microarchitecture hides some of this latency cution constraints [1]. For example, we have measured that, but redundantly performs work done by the compiler and when run-time stall cycles are discounted, the Intel refer- adds additional pipeline stages. Although effective tech- ence compiler can achieve an average throughput of 2.5 in- niques, such as prefetching and threading, have been pro- structions per cycle (IPC) across SPECint2000 benchmarks posed to deal with anticipable, long-latency misses, the for a 1.0GHz Itanium 2 processor. shorter, more diffuse stalls due to difficult-to-anticipate, Run-time stall cycles of various types prolong the execu- first- or second-level misses are less easily hidden on in- tion of the compiler-generated schedule, in the noted exam- order architectures.
    [Show full text]
  • Performance Analysis of Complex Shared Memory Systems Abridged Version of Dissertation
    Performance Analysis of Complex Shared Memory Systems Abridged Version of Dissertation Daniel Molka September 12, 2016 The goal of this thesis is to improve the understanding of the achieved application performance on existing hardware. It can be observed that the scaling of parallel applications on multi-core processors differs significantly from the scaling on multiple processors. Therefore, the properties of shared resources in contemporary multi-core processors as well as remote accesses in multi-processor systems are investigated and their respective impact on the application performance is analyzed. As a first step, a comprehensive suite of highly optimized micro-benchmarks is developed. These benchmarks are able to determine the performance of memory accesses depending on the location and coherence state of the data. They are used to perform an in-depth analysis of the characteristics of memory accesses in contemporary multi-processor systems, which identifies potential bottlenecks. In order to localize performance problems, it also has to be determined to which extend the application performance is limited by certain resources. Therefore, a methodology to derive metrics for the utilization of individual components in the memory hierarchy as well as waiting times caused by memory accesses is developed in the second step. The approach is based on hardware performance counters. The developed micro-benchmarks are used to selectively stress individual components, which can be used to identify the events that provide a reasonable assessment for the utilization of the respective component and the amount of time that is spent waiting for memory accesses to complete. Finally, the knowledge gained from this process is used to implement a visualization of memory related performance issues in existing performance analysis tools.
    [Show full text]
  • Chapter 2: Memory Hierarchy Design
    Computer Architecture A Quantitative Approach, Fifth Edition Chapter 2 Memory Hierarchy Design Copyright © 2012, Elsevier Inc. All rights reserved. 1 Contents 1. Memory hierarchy 1. Basic concepts 2. Design techniques 2. Caches 1. Types of caches: Fully associative, Direct mapped, Set associative 2. Ten optimization techniques 3. Main memory 1. Memory technology 2. Memory optimization 3. Power consumption 4. Memory hierarchy case studies: Opteron, Pentium, i7. 5. Virtual memory 6. Problem solving dcm 2 Introduction Introduction Programmers want very large memory with low latency Fast memory technology is more expensive per bit than slower memory Solution: organize memory system into a hierarchy Entire addressable memory space available in largest, slowest memory Incrementally smaller and faster memories, each containing a subset of the memory below it, proceed in steps up toward the processor Temporal and spatial locality insures that nearly all references can be found in smaller memories Gives the allusion of a large, fast memory being presented to the processor Copyright © 2012, Elsevier Inc. All rights reserved. 3 Memory hierarchy Processor L1 Cache L2 Cache L3 Cache Main Memory Latency Hard Drive or Flash Capacity (KB, MB, GB, TB) Copyright © 2012, Elsevier Inc. All rights reserved. 4 PROCESSOR L1: I-Cache D-Cache I-Cache instruction cache D-Cache data cache U-Cache unified cache L2: U-Cache Different functional units fetch information from I-cache and D-cache: decoder and L3: U-Cache scheduler operate with I- cache, but integer execution unit and floating-point unit Main: Main Memory communicate with D-cache. Copyright © 2012, Elsevier Inc. All rights reserved.
    [Show full text]
  • Architecting and Programming an Incoherent Multiprocessor Cache
    © 2015 Wooil Kim ARCHITECTING, PROGRAMMING, AND EVALUATING AN ON-CHIP INCOHERENT MULTI-PROCESSOR MEMORY HIERARCHY BY WOOIL KIM DISSERTATION Submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy in Computer Science in the Graduate College of the University of Illinois at Urbana-Champaign, 2015 Urbana, Illinois Doctoral Committee: Professor Josep Torrellas, Chair and Director of Research Professor Marc Snir Professor David Padua Professor William Gropp Professor P. Sadayappan, Ohio State University Abstract New architectures for extreme-scale computing need to be designed for higher energy efficiency than current systems. The DOE-funded Traleika Glacier architecture is a recently-proposed extreme-scale manycore that radically simplifies the architecture, and proposes a cluster-based on-chip memory hierarchy without hardware cache coherence. Programming for such an environment, which can use scratchpads or incoherent caches, is challenging. Hence, this thesis focuses on architecting, programming, and evaluating an on-chip incoherent multiprocessor memory hierarchy. This thesis starts by examining incoherent multiprocessor caches. It proposes ISA support for data movement in such an environment, and two relatively user-friendly programming approaches that use the ISA. The ISA support is largely based on writeback and self-invalidation instructions, while the program- ming approaches involve shared-memory programming either inside a cluster only, or across clusters. The thesis also includes compiler transformations for such an incoherent cache hierarchy. Our simulation results show that, with our approach, the execution of applications on incoherent cache hierarchies can deliver reasonable performance. For execution within a cluster, the average execution time of our applications is only 2% higher than with hardware cache coherence.
    [Show full text]
  • 55 Measuring Microarchitectural Details of Multi- and Many-Core
    55 Measuring Microarchitectural Details of Multi- and Many-Core Memory Systems through Microbenchmarking ZHENMAN FANG, University of California Los Angeles SANYAM MEHTA, PEN-CHUNG YEW and ANTONIA ZHAI, University of Minnesota JAMES GREENSKY and GAUTHAM BEERAKA,Intel BINYU ZANG, Shanghai Jiao Tong University As multicore and many-core architectures evolve, their memory systems are becoming increasingly more complex. To bridge the latency and bandwidth gap between the processor and memory, they often use a mix of multilevel private/shared caches that are either blocking or nonblocking and are connected by high-speed network-on-chip. Moreover, they also incorporate hardware and software prefetching and simultaneous mul- tithreading (SMT) to hide memory latency. On such multi- and many-core systems, to incorporate various memory optimization schemes using compiler optimizations and performance tuning techniques, it is cru- cial to have microarchitectural details of the target memory system. Unfortunately, such details are often unavailable from vendors, especially for newly released processors. In this article, we propose a novel microbenchmarking methodology based on short elapsed-time events (SETEs) to obtain comprehensive memory microarchitectural details in multi- and many-core processors. This approach requires detailed analysis of potential interfering factors that could affect the intended behavior of such memory systems. We lay out effective guidelines to control and mitigate those interfering factors. Taking the impact of SMT into consideration, our proposed methodology not only can measure traditional cache/memory latency and off-chip bandwidth but also can uncover the details of software and hardware prefetching units not attempted in previous studies. Using the newly released Intel Xeon Phi many-core processor (with in-order cores) as an example, we show how we can use a set of microbenchmarks to determine various microarchitectural features of its memory system (many are undocumented from vendors).
    [Show full text]