Cooperative Cache Scrubbing

Total Page:16

File Type:pdf, Size:1020Kb

Load more

Cooperative Cache Scrubbing Jennifer B. Sartor Wim Heirman Ghent University Intel ExaScience Lab∗ Belgium Belgium Stephen M. Blackburn Lieven Eeckhout Kathryn S. McKinley Australian National University Ghent University Microsoft Research Australia Belgium Washington, USA ABSTRACT the memory wall translates into a power wall. More cores need big- Managing the limited resources of power and memory bandwidth ger memories to store all of their data, which requires more power. while improving performance on multicore hardware is challeng- The limits on total power due to packaging and temperature con- ing. In particular, more cores demand more memory bandwidth, strain system design and consequently, the fraction of total system and multi-threaded applications increasingly stress memory sys- power consumed by main memory is expected to grow. For some tems, leading to more energy consumption. However, we demon- large commercial servers, memory power already dominates total strate that not all memory traffic is necessary. For modern Java pro- system power [31]. grams, 10 to 60% of DRAM writes are useless, because the data on Software trends exacerbate memory bandwidth problems. these lines are dead - the program is guaranteed to never read them Evolving applications have seemingly insatiable memory needs again. Furthermore, reading memory only to immediately zero ini- with increasingly larger working set sizes and irregular memory tialize it wastes bandwidth. We propose a software/hardware coop- access patterns. Recent studies of managed and native programs on erative solution: the memory manager communicates dead and zero multicore processors show that memory bandwidth limits perfor- lines with cache scrubbing instructions. We show how scrubbing mance and scaling [21, 43, 55]. Two factors in managed languages instructions satisfy MESI cache coherence protocol invariants and pose both opportunities and challenges for memory bandwidth op- demonstrate them in a Java Virtual Machine and multicore simula- timization: (1) rapid object allocation and (2) zero initialization. tor. Scrubbing reduces average DRAM traffic by 59%, total DRAM (1) Many applications rapidly allocate short-lived objects, re- energy by 14%, and dynamic DRAM energy by 57% on a range lying on garbage collection or region memory managers to clean of configurations. Cooperative software/hardware cache scrubbing up. This programming style creates an object stream with excellent reduces memory bandwidth and improves energy efficiency, two temporal locality, which the allocator exploits to create spatial lo- critical problems in modern systems. cality [5, 6, 7]. Unfortunately, short-lived streams displace useful long-lived objects, leave dead objects on dirty cache lines, and in- crease bandwidth pressure. The memory system must write dirty Categories and Subject Descriptors lines back to memory as they are evicted, but many writes are use- C.0 [Computer Systems Organization]: General—Hard- less because they contain dead objects that the program is guaran- ware/software interfaces, instruction set design; D.3.4 [Pro- teed to never read again. Our measurements of Java benchmarks gramming Languages]: Processors—Memory management show that 10 to 60% of write backs are useless! (garbage collection), optimization, runtimes (2) Managed languages and safe C variants require zero initial- ization of all fresh allocation. With the widely used fetch-on-write cache policy, writes of zeroes produce useless reads from memory 1. INTRODUCTION that fetch cache lines, only to immediately over-write them with Historically, improvements in processor speeds have outpaced im- zeroes. Prior work shows memory traffic due to zeroing ranges provements in memory speeds and current trends exacerbate this between 10 and 45% [54]. memory wall. For example, as multicore processors add cores, they While prodigious object allocation and zero initialization present rarely include commensurate memory or bandwidth [43]. Today, challenges to the memory system, they also offer an opportunity. Because the memory manager identifies dead objects and zero ini- ∗Wim Heirman was a post-doctoral researcher at Ghent University tializes memory, software can communicate this semantic informa- while this work was being done. tion to hardware to better manage caches, avoid useless traffic to Permission to make digital or hard copies of all or part of this work for memory, and save energy. personal or classroom use is granted without fee provided that copies are not These observations motivate software/hardware cooperative made or distributed for profit or commercial advantage and that copies bear cache scrubbing. We leverage standard high-performance gener- this notice and the full citation on the first page. Copyrights for components ational garbage collectors, which allocate new objects in a region of this work owned by others than ACM must be honored. Abstracting with and periodically copy out all live objects, at which point the col- credit is permitted. To copy otherwise, or republish, to post on servers or to dead redistribute to lists, requires prior specific permission and/or a fee. Request lector guarantees that any remaining objects are – the pro- permissions from [email protected]. gram will never read from the region before writing to it. However, PACT’14, August 24–27, 2014, Edmonton, AB, Canada. any memory manager that identifies dead cache-line sized blocks, Copyright 2014 ACM 978-1-4503-2809-8/14/08 ...$15.00. including widely used region allocators in C programs [4], can http://dx.doi.org/10.1145/2628071.2628083. 15 use scrubbing to eliminate writes to DRAM. The memory man- 2. BACKGROUND AND MOTIVATION ager communicates dead regions to hardware. We explore three We first present background on the bandwidth bottleneck and mem- cache scrubbing instructions: clinvalidate invalidates the cache- ory management. We then characterize the opportunity due to use- line, clundirty unsets the dirty bit, and clclean unsets the dirty bit less write backs. In our Java benchmarks, from 10 to 60% of write and moves the cache line to the set’s least-recently-used (LRU) backs are useless, i.e., the garbage collector has determined that position. While clinvalidate is similar to PowerPC’s discontinued the data is dead before the cache writes it back. We then describe dcbi instruction [20], and ARM’s privileged cache invalidation in- why existing cache hint instructions are insufficient to solve this struction [3], clundirty and clclean are new contributions. We also problem. use clzero, which is based on existing cache zeroing instructions [20, 32, 47] to eliminate read traffic. We extend the widely used 2.1 Bandwidth and Allocation Wall MESI cache coherence protocol to handle these instructions and Much prior research shows that bandwidth is increasingly a per- show how they maintain MESI invariants [14, 20, 22, 24, 37, 41]. formance bottleneck as chip-multiprocessor machines scale up the After each collection, the memory manager issues one of the three number of cores, both in hardware and software [10, 21, 35, 40, scrubbing instructions for each dead cache line. During application 43, 55]. Rogers et al. show that the bandwidth wall between off- execution, the memory manager zero initializes memory regions chip memory and on-chip cores is a performance bottleneck and with clzero. can severely limit core scaling [43]. Molka et al. perform a de- To evaluate cache scrubbing, we modify Jikes RVM [2, 7], a tailed study on the Nehalem microarchitecture [40], showing that Java Virtual Machine (JVM), and the Sniper multicore simula- on the quad-core Intel X5570 processor, memory read bandwidth tor [11] and execute 11 single and multi-threaded Java applica- does not scale beyond two cores, while write bandwidth does not tions. Because memory systems hide write latencies relatively well, scale beyond one core. simulated performance improvements are modest, but consistent: On the software side, Zhao et al. show that allocation-intensive 4.6% on average across a range of configurations. However, scrub- Java programs pose an allocation wall which quickly satu- bing dramatically improves the efficiency of the memory system. rates memory bandwidth, limiting application scaling and perfor- Scrubbing and zeroing together reduce DRAM traffic by 59%, to- mance [55]. Languages with region allocators and garbage collec- tal DRAM energy by 14%, and dynamic DRAM energy by 57%, tion encourage rapid allocation of short-lived objects, which some- on average. Across the three scrubbing instructions, we find that times interact poorly with cache memory designs. Scrubbing ad- clclean, which suppresses the write back and puts the cache line in dresses this issue by communicating language semantics down to the LRU position, performs best. hardware. While we focus on managed languages, prior work by Isen et al. identifies and reduces useless write backs in explicitly managed 2.2 Memory Management sequential C programs [23]. They require a map of the live/free The best performing memory managers for explicit and managed status of blocks of allocated memory, which hardware stores af- languages have converged on a similar regime. Explicit memory ter software communicates the information. Their approach is not managers use region allocators when possible [4, 21] and high- practical since (1) it requires a 100 KB hardware lookup table and a performance garbage collectors use generational collectors with a mark bit on every cache line, (2) it exacerbates limitations
Recommended publications
  • 2.2 Adaptive Routing Algorithms and Router Design 20

    2.2 Adaptive Routing Algorithms and Router Design 20

    https://theses.gla.ac.uk/ Theses Digitisation: https://www.gla.ac.uk/myglasgow/research/enlighten/theses/digitisation/ This is a digitised version of the original print thesis. Copyright and moral rights for this work are retained by the author A copy can be downloaded for personal non-commercial research or study, without prior permission or charge This work cannot be reproduced or quoted extensively from without first obtaining permission in writing from the author The content must not be changed in any way or sold commercially in any format or medium without the formal permission of the author When referring to this work, full bibliographic details including the author, title, awarding institution and date of the thesis must be given Enlighten: Theses https://theses.gla.ac.uk/ [email protected] Performance Evaluation of Distributed Crossbar Switch Hypermesh Sarnia Loucif Dissertation Submitted for the Degree of Doctor of Philosophy to the Faculty of Science, Glasgow University. ©Sarnia Loucif, May 1999. ProQuest Number: 10391444 All rights reserved INFORMATION TO ALL USERS The quality of this reproduction is dependent upon the quality of the copy submitted. In the unlikely event that the author did not send a com plete manuscript and there are missing pages, these will be noted. Also, if material had to be removed, a note will indicate the deletion. uest ProQuest 10391444 Published by ProQuest LLO (2017). Copyright of the Dissertation is held by the Author. All rights reserved. This work is protected against unauthorized copying under Title 17, United States C ode Microform Edition © ProQuest LLO. ProQuest LLO.
  • Adaptive Parallelism for Coupled, Multithreaded Message-Passing Programs Samuel K

    Adaptive Parallelism for Coupled, Multithreaded Message-Passing Programs Samuel K

    University of New Mexico UNM Digital Repository Computer Science ETDs Engineering ETDs Fall 12-1-2018 Adaptive Parallelism for Coupled, Multithreaded Message-Passing Programs Samuel K. Gutiérrez Follow this and additional works at: https://digitalrepository.unm.edu/cs_etds Part of the Numerical Analysis and Scientific omputC ing Commons, OS and Networks Commons, Software Engineering Commons, and the Systems Architecture Commons Recommended Citation Gutiérrez, Samuel K.. "Adaptive Parallelism for Coupled, Multithreaded Message-Passing Programs." (2018). https://digitalrepository.unm.edu/cs_etds/95 This Dissertation is brought to you for free and open access by the Engineering ETDs at UNM Digital Repository. It has been accepted for inclusion in Computer Science ETDs by an authorized administrator of UNM Digital Repository. For more information, please contact [email protected]. Samuel Keith Guti´errez Candidate Computer Science Department This dissertation is approved, and it is acceptable in quality and form for publication: Approved by the Dissertation Committee: Professor Dorian C. Arnold, Chair Professor Patrick G. Bridges Professor Darko Stefanovic Professor Alexander S. Aiken Patrick S. McCormick Adaptive Parallelism for Coupled, Multithreaded Message-Passing Programs by Samuel Keith Guti´errez B.S., Computer Science, New Mexico Highlands University, 2006 M.S., Computer Science, University of New Mexico, 2009 DISSERTATION Submitted in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy Computer Science The University of New Mexico Albuquerque, New Mexico December 2018 ii ©2018, Samuel Keith Guti´errez iii Dedication To my beloved family iv \A Dios rogando y con el martillo dando." Unknown v Acknowledgments Words cannot adequately express my feelings of gratitude for the people that made this possible.
  • Detecting False Sharing Efficiently and Effectively

    Detecting False Sharing Efficiently and Effectively

    W&M ScholarWorks Arts & Sciences Articles Arts and Sciences 2016 Cheetah: Detecting False Sharing Efficiently andff E ectively Tongping Liu Univ Texas San Antonio, Dept Comp Sci, San Antonio, TX 78249 USA; Xu Liu Coll William & Mary, Dept Comp Sci, Williamsburg, VA 23185 USA Follow this and additional works at: https://scholarworks.wm.edu/aspubs Recommended Citation Liu, T., & Liu, X. (2016, March). Cheetah: Detecting false sharing efficiently and effectively. In 2016 IEEE/ ACM International Symposium on Code Generation and Optimization (CGO) (pp. 1-11). IEEE. This Article is brought to you for free and open access by the Arts and Sciences at W&M ScholarWorks. It has been accepted for inclusion in Arts & Sciences Articles by an authorized administrator of W&M ScholarWorks. For more information, please contact [email protected]. Cheetah: Detecting False Sharing Efficiently and Effectively Tongping Liu ∗ Xu Liu ∗ Department of Computer Science Department of Computer Science University of Texas at San Antonio College of William and Mary San Antonio, TX 78249 USA Williamsburg, VA 23185 USA [email protected] [email protected] Abstract 1. Introduction False sharing is a notorious performance problem that may Multicore processors are ubiquitous in the computing spec- occur in multithreaded programs when they are running on trum: from smart phones, personal desktops, to high-end ubiquitous multicore hardware. It can dramatically degrade servers. Multithreading is the de-facto programming model the performance by up to an order of magnitude, significantly to exploit the massive parallelism of modern multicore archi- hurting the scalability. Identifying false sharing in complex tectures.
  • Leaper: a Learned Prefetcher for Cache Invalidation in LSM-Tree Based Storage Engines

    Leaper: a Learned Prefetcher for Cache Invalidation in LSM-Tree Based Storage Engines

    Leaper: A Learned Prefetcher for Cache Invalidation in LSM-tree based Storage Engines ∗ Lei Yang1 , Hong Wu2, Tieying Zhang2, Xuntao Cheng2, Feifei Li2, Lei Zou1, Yujie Wang2, Rongyao Chen2, Jianying Wang2, and Gui Huang2 fyang lei, [email protected] fhong.wu, tieying.zhang, xuntao.cxt, lifeifei, zhencheng.wyj, rongyao.cry, beilou.wjy, [email protected] Peking University1 Alibaba Group2 ABSTRACT 101 Hit ratio Frequency-based cache replacement policies that work well QPS on page-based database storage engines are no longer suffi- Latency cient for the emerging LSM-tree (Log-Structure Merge-tree) based storage engines. Due to the append-only and copy- 100 on-write techniques applied to accelerate writes, the state- of-the-art LSM-tree adopts mutable record blocks and issues Normalized value frequent background operations (i.e., compaction, flush) to reorganize records in possibly every block. As a side-effect, 0 50 100 150 200 such operations invalidate the corresponding entries in the Time (s) Figure 1: Cache hit ratio and system performance churn (QPS cache for each involved record, causing sudden drops on the and latency of 95th percentile) caused by cache invalidations. cache hit rates and spikes on access latency. Given the ob- with notable examples including LevelDB [10], HBase [2], servation that existing methods cannot address this cache RocksDB [7] and X-Engine [14] for its superior write perfor- invalidation problem, we propose Leaper, a machine learn- mance. These storage engines usually come with row-level ing method to predict hot records in an LSM-tree storage and block-level caches to buffer hot records in main mem- engine and prefetch them into the cache without being dis- ory.
  • CSC 256/456: Operating Systems

    CSC 256/456: Operating Systems

    CSC 256/456: Operating Systems Multiprocessor John Criswell Support University of Rochester 1 Outline ❖ Multiprocessor hardware ❖ Types of multi-processor workloads ❖ Operating system issues ❖ Where to run the kernel ❖ Synchronization ❖ Where to run processes 2 Multiprocessor Hardware 3 Multiprocessor Hardware ❖ System in which two or more CPUs share full access to the main memory ❖ Each CPU might have its own cache and the coherence among multiple caches is maintained CPU CPU … … … CPU Cache Cache Cache Memory bus Memory 4 Multi-core Processor ❖ Multiple processors on “chip” ❖ Some caches shared CPU CPU ❖ Some not shared Cache Cache Shared Cache Memory Bus 5 Hyper-Threading ❖ Replicate parts of processor; share other parts ❖ Create illusion that one core is two cores CPU Core Fetch Unit Decode ALU MemUnit Fetch Unit Decode 6 Cache Coherency ❖ Ensure processors not operating with stale memory data ❖ Writes send out cache invalidation messages CPU CPU … … … CPU Cache Cache Cache Memory bus Memory 7 Non Uniform Memory Access (NUMA) ❖ Memory clustered around CPUs ❖ For a given CPU ❖ Some memory is nearby (and fast) ❖ Other memory is far away (and slow) 8 Multiprocessor Workloads 9 Multiprogramming ❖ Non-cooperating processes with no communication ❖ Examples ❖ Time-sharing systems ❖ Multi-tasking single-user operating systems ❖ make -j<very large number here> 10 Concurrent Servers ❖ Minimal communication between processes and threads ❖ Throughput usually the goal ❖ Examples ❖ Web servers ❖ Database servers 11 Parallel Programs ❖ Use parallelism
  • Improving Performance of the Distributed File System Using Speculative Read Algorithm and Support-Based Replacement Technique

    Improving Performance of the Distributed File System Using Speculative Read Algorithm and Support-Based Replacement Technique

    International Journal of Advanced Research in Engineering and Technology (IJARET) Volume 11, Issue 9, September 2020, pp. 602-611, Article ID: IJARET_11_09_061 Available online at http://iaeme.com/Home/issue/IJARET?Volume=11&Issue=9 ISSN Print: 0976-6480 and ISSN Online: 0976-6499 DOI: 10.34218/IJARET.11.9.2020.061 © IAEME Publication Scopus Indexed IMPROVING PERFORMANCE OF THE DISTRIBUTED FILE SYSTEM USING SPECULATIVE READ ALGORITHM AND SUPPORT-BASED REPLACEMENT TECHNIQUE Rathnamma Gopisetty Research Scholar at Jawaharlal Nehru Technological University, Anantapur, India. Thirumalaisamy Ragunathan Department of Computer Science and Engineering, SRM University, Andhra Pradesh, India. C Shoba Bindu Department of Computer Science and Engineering, Jawaharlal Nehru Technological University, Anantapur, India. ABSTRACT Cloud computing systems use distributed file systems (DFSs) to store and process large data generated in the organizations. The users of the web-based information systems very frequently perform read operations and infrequently carry out write operations on the data stored in the DFS. Various caching and prefetching techniques are proposed in the literature to improve performance of the read operations carried out on the DFS. In this paper, we have proposed a novel speculative read algorithm for improving performance of the read operations carried out on the DFS by considering the presence of client-side and global caches in the DFS environment. The proposed algorithm performs speculative executions by reading the data from the local client-side cache, local collaborative cache, from global cache and global collaborative cache. One of the speculative executions will be selected based on the time stamp verification process. The proposed algorithm reduces the write/update overhead and hence performs better than that of the non-speculative algorithm proposed for the similar DFS environment.
  • Parallel Computer Architecture and Programming CMU 15-418/15-618, Spring 2016 Tunes Bang Bang (My Baby Shot Me Down) Nancy Sinatra (Kill Bill Volume 1 Soundtrack)

    Parallel Computer Architecture and Programming CMU 15-418/15-618, Spring 2016 Tunes Bang Bang (My Baby Shot Me Down) Nancy Sinatra (Kill Bill Volume 1 Soundtrack)

    Lecture 11: Directory-Based Cache Coherence Parallel Computer Architecture and Programming CMU 15-418/15-618, Spring 2016 Tunes Bang Bang (My Baby Shot Me Down) Nancy Sinatra (Kill Bill Volume 1 Soundtrack) “It’s such a heartbreaking thing... to have your performance silently crippled by cache lines dropping all over the place due to false sharing.” - Nancy Sinatra CMU 15-418/618, Spring 2016 Today: what you should know ▪ What limits the scalability of snooping-based approaches to cache coherence? ▪ How does a directory-based scheme avoid these problems? ▪ How can the storage overhead of the directory structure be reduced? (and at what cost?) ▪ How does the communication mechanism (bus, point-to- point, ring) affect scalability and design choices? CMU 15-418/618, Spring 2016 Implementing cache coherence The snooping cache coherence Processor Processor Processor Processor protocols from the last lecture relied Local Cache Local Cache Local Cache Local Cache on broadcasting coherence information to all processors over the Interconnect chip interconnect. Memory I/O Every time a cache miss occurred, the triggering cache communicated with all other caches! We discussed what information was communicated and what actions were taken to implement the coherence protocol. We did not discuss how to implement broadcasts on an interconnect. (one example is to use a shared bus for the interconnect) CMU 15-418/618, Spring 2016 Problem: scaling cache coherence to large machines Processor Processor Processor Processor Local Cache Local Cache Local Cache Local Cache Memory Memory Memory Memory Interconnect Recall non-uniform memory access (NUMA) shared memory systems Idea: locating regions of memory near the processors increases scalability: it yields higher aggregate bandwidth and reduced latency (especially when there is locality in the application) But..
  • Efficient Synchronization Mechanisms for Scalable GPU Architectures

    Efficient Synchronization Mechanisms for Scalable GPU Architectures

    Efficient Synchronization Mechanisms for Scalable GPU Architectures by Xiaowei Ren M.Sc., Xi’an Jiaotong University, 2015 B.Sc., Xi’an Jiaotong University, 2012 a thesis submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy in the faculty of graduate and postdoctoral studies (Electrical and Computer Engineering) The University of British Columbia (Vancouver) October 2020 © Xiaowei Ren, 2020 The following individuals certify that they have read, and recommend to the Faculty of Graduate and Postdoctoral Studies for acceptance, the dissertation entitled: Efficient Synchronization Mechanisms for Scalable GPU Architectures submitted by Xiaowei Ren in partial fulfillment of the requirements for the degree of Doctor of Philosophy in Electrical and Computer Engineering. Examining Committee: Mieszko Lis, Electrical and Computer Engineering Supervisor Steve Wilton, Electrical and Computer Engineering Supervisory Committee Member Konrad Walus, Electrical and Computer Engineering University Examiner Ivan Beschastnikh, Computer Science University Examiner Vijay Nagarajan, School of Informatics, University of Edinburgh External Examiner Additional Supervisory Committee Members: Tor Aamodt, Electrical and Computer Engineering Supervisory Committee Member ii Abstract The Graphics Processing Unit (GPU) has become a mainstream computing platform for a wide range of applications. Unlike latency-critical Central Processing Units (CPUs), throughput-oriented GPUs provide high performance by exploiting massive application parallelism.
  • CIS 501 Computer Architecture This Unit: Shared Memory

    CIS 501 Computer Architecture This Unit: Shared Memory

    This Unit: Shared Memory Multiprocessors App App App • Thread-level parallelism (TLP) System software • Shared memory model • Multiplexed uniprocessor CIS 501 Mem CPUCPU I/O CPUCPU • Hardware multihreading CPUCPU Computer Architecture • Multiprocessing • Synchronization • Lock implementation Unit 9: Multicore • Locking gotchas (Shared Memory Multiprocessors) • Cache coherence Slides originally developed by Amir Roth with contributions by Milo Martin • Bus-based protocols at University of Pennsylvania with sources that included University of • Directory protocols Wisconsin slides by Mark Hill, Guri Sohi, Jim Smith, and David Wood. • Memory consistency models CIS 501 (Martin): Multicore 1 CIS 501 (Martin): Multicore 2 Readings Beyond Implicit Parallelism • Textbook (MA:FSPTCM) • Consider “daxpy”: • Sections 7.0, 7.1.3, 7.2-7.4 daxpy(double *x, double *y, double *z, double a): for (i = 0; i < SIZE; i++) • Section 8.2 Z[i] = a*x[i] + y[i]; • Lots of instruction-level parallelism (ILP) • Great! • But how much can we really exploit? 4 wide? 8 wide? • Limits to (efficient) super-scalar execution • But, if SIZE is 10,000, the loop has 10,000-way parallelism! • How do we exploit it? CIS 501 (Martin): Multicore 3 CIS 501 (Martin): Multicore 4 Explicit Parallelism Multiplying Performance • Consider “daxpy”: • A single processor can only be so fast daxpy(double *x, double *y, double *z, double a): • Limited clock frequency for (i = 0; i < SIZE; i++) • Limited instruction-level parallelism Z[i] = a*x[i] + y[i]; • Limited cache hierarchy • Break it
  • Download Slides As

    Download Slides As

    Lecture 12: Directory-Based Cache Coherence Parallel Computer Architecture and Programming CMU 15-418/15-618, Spring 2015 Tunes Home Edward Sharpe & the Magnetic Zeros (Up From Below) “We actually wrote this song at Cray for orientation of hires on the memory system team. The’ll never forget where to get information about about a line in a directory-based cache coherence scheme.” - Alex Ebert CMU 15-418, Spring 2015 NVIDIA GPUs do not implement cache coherence ▪ Incoherent per-SMX core L1 caches CUDA global memory atomic operations “bypass” L1 cache, so an atomic operation will always observe up-to-date data ▪ Single, unifed L2 cache // this is a read-modify-write performed atomically on the // contents of a line in the L2 cache atomicAdd(&x, 1); SMX Core SMX Core SMX Core SMX Core L1 caches are write-through to L2 by default . CUDA volatile qualifer will cause compiler to generate a LD L1 Cache L1 Cache L1 Cache L1 Cache instruction that will bypass the L1 cache. (see ld.cg or ld.cg instruction) Shared L2 Cache NVIDIA graphics driver will clear L1 caches between any two kernel launches (ensures stores from previous kernel are visible to next kernel. Imagine a case where driver did not clear the L1 between kernel launches… Kernel launch 1: Memory (DDR5 DRAM) SMX core 0 reads x (so it resides in L1) SMX core 1 writes x (updated data available in L2) Kernel launch 2: SMX core 0 reads x (cache hit! processor observes stale data) If interested in more details, see “Cache Operators” section of NVIDIA PTX Manual (Section 8.7.6.1 of Parallel
  • A Survey of Cache Coherence Schemes for Multiprocessors

    A Survey of Cache Coherence Schemes for Multiprocessors

    A Survey of Cache Coherence-Schemes for Multiprocessors Per Stenstrom Lund University hared-memory multiprocessors as opposed to shared, because each cache have emerged as an especially cost- is private to one or a few of the total effective way to provide increased number of processors. computing power and speed, mainly be- Cache coherence The private cache organization appears cause they use low-cost microprocessors in a number of multiprocessors, including economically interconnected with shared schemes tackle the Encore Computer’s Multimax, Sequent memory modules. problem Computer Systems’ Symmetry, and Digi- Figure 1 shows a shared-memory multi- of tal Equipment’s Firefly multiprocessor processor consisting of processors con- maintaining data workstation. These systems use a common nected with the shared memory modules bus as the interconnection network. Com- by an interconnection network. This sys- consistency in munication contention therefore becomes tem organization has three problems1: shared-memory a primary concem, and the cache serves mainly to reduce bus contention. (1) Memory contention. Since a mem- multiprocessors. Other systems worth mentioning are ory module can handle only one memory RP3 from IBM, Cedar from the University request at a time, several requests from They rely on of Illinois at Urbana-Champaign, and different processors will be serialized. software, hardware, Butterfly from BBN Laboratories. These (2) Communication contention. Con- systems contain about 100 processors tention for individual links in the intercon- or a combination connected to the memory modules by a nection network can result even if requests multistage interconnection network with a are directed to different memory modules. of both.
  • In-Network Cache Coherence

    In-Network Cache Coherence

    In-Network Cache Coherence Noel Eisley Li-Shiuan Peh Li Shang Dept. of EE Dept. of EE Dept. of ECE Princeton University Princeton University Queen’s University NJ 08544, USA NJ 08544, USA ON K7L 3N6, Canada [email protected] [email protected] [email protected] Abstract There have been a plethora of protocol optimizations proposed to alleviate the overheads of directory-based pro- With the trend towards increasing number of processor tocols (see Section 4). Specifically, there has been prior cores in future chip architectures, scalable directory-based work exploring network optimizations for cache coherence protocols for maintaining cache coherence will be needed. protocols. However, to date, most of these protocols main- However, directory-based protocols face well-known prob- tain a firm abstraction of the interconnection network fabric lems in delay and scalability. Most current protocol op- as a communication medium – the protocol consists of a timizations targeting these problems maintain a firm ab- series of end-to-end messages between requestor nodes, di- straction of the interconnection network fabric as a com- rectory nodes and sharer nodes. In this paper, we investigate munication medium: protocol optimizations consist of end- removing this conventional abstraction of the network as to-end messages between requestor, directory and sharer solely a communication medium. Specifically, we propose nodes, while network optimizations separately target low- an implementation of the coherence protocol and directories ering communication latency for coherence messages. In within the network at each router node. This opens up the this paper, we propose an implementation of the cache co- possibility of optimizing a protocol with in-transit actions.