An Efficient, Self-Contained, On-Chip Directory: DIR1-SISD

2015 International Conference on Parallel Architecture and Compilation An Efficient, Self-Contained, On-Chip Directory: DIR1-SISD Mahdad Davari∗, Alberto Ros†, Erik Hagersten∗ and Stefanos Kaxiras∗ ∗Dept. of Information Technology Uppsala University, Sweden Emails: {mahdad.davari, erik.hagersten, stefanos.kaxiras}@it.uu.se †Dept. of Computer Engineering University of Murcia, Spain Email: [email protected] Abstract—Directory-based cache coherence is the de-facto memory chip multi-processor (CMP) designs, has been standard for scalable shared-memory multi/many-cores and extensively used and studied [11], [12]. significant effort is invested in reducing its overhead. However, At a higher level of abstraction, different directory directory area and complexity optimizations are often anti- thetical to each other. Novel directory-less coherence schemes schemes are distinguished based on (i) how they keep track have been introduced to remove the complexity and cost of sharers, and (ii) which policy they employ to maintain associated with directories in their entirety. However, such coherence across those sharers. The notation DIRi-X has schemes introduce new challenges by transferring some of been used to describe directory schemes [11], [12], where the directory complexity and functionality to the OS and index i refers to how the sharers are tracked, and X denotes using the page table and the TLBs to store data classification information. the policy to maintain coherence across the sharers, such as In this work we bridge the gap between directory-based and broadcast (B) or no-broadcast (NB). Directory policy, on the directory-less coherence schemes and propose a hybrid scheme other hand, defines how the directory scheme behaves with called DIR1-SISD which employs self-invalidation and self- respect to (i) maintaining single-writer-{single or multiple}- shared downgrade as directory policies for the entries. DIR1- reader invariant [11], [13], (ii) request forwarding [12], and SISD allows simultaneous optimizations in area and complexity (iii) directory entry eviction —in case of using a directory without relying on the OS. DIR1-SISD keeps track of a single —private— owner, or allows multiple-readers-multiple-writers cache, which is often the case [8], [9]. Directory complexity to exist simultaneously by transferring the responsibility for is mainly attributed to the complexities associated with the their coherence to the corresponding cores. A DIR1-SISD self- directory policies. As an example, the directory scheme contained directory cache has a unique ability to minimize DIRn-NB is the extreme case which seeks to eliminate eviction-induced complexities by allowing directory entries to be evicted without maintaining inclusion with the cached data the complexities associated with write-induced invalidation (thus avoiding the complexities of broadcasts) and without broadcasts and collecting acknowledgments in their entirety, the need to have a backing store. Using simulation we show however it incurs area overhead by having to save a full- that a small, self-contained, DIR1-SISD cache outperforms a map vector per directory entry. At the other extreme, a n traditional DIR -NB MESI protocol with a directory cache hypothetical DIR0-B would incur no area overhead, but embedded in the LLC (8% in execution time and 15% in traffic) and, further, outperforms a SISD protocol that relies every write request received at the directory would trigger an on the OS to provide a persistent page-based directory (4% in invalidation broadcast, which translates into complexity [11], execution time and 20% in traffic). [12]. Other directory techniques to optimize the area, falling Keywords-multicore; memory hierarchy; cache coherence; in between the two extremes, result in complex directory mechanisms which significantly add to the verification cost and potentially impact scalability [9], [10]. I. INTRODUCTION Besides write requests, directory evictions might as well Hardware-based cache coherence has long served as trigger invalidation broadcasts depending on directory poli- an enabling factor in harnessing the compute power cies. Today’s CMPs implement directories as on-chip sparse of multi/many-cores by providing an easy programming directory-caches [8], [9], which makes them vulnerable paradigm through sparing programmers from dealing with to loss-of-information problem. Although eviction-induced explicit cache and consistency management [1]. While invalidations can be eliminated by allocating a backing snoop-based coherence schemes allowed implementation store in main memory [8], adding backing store is contrary of shared-memory systems using conventional bus-based to the goal of area-efficiency and low-complexity. As a networks [2]–[4], the need for scalable architectures, in- result, inclusion is maintained between cached data and corporating ever-increasing number of cores, necessitated their corresponding directory entries, which consequently directory-based coherence schemes [5]–[10]. Directory- adds to complexity by requiring invalidations —in form of based cache coherence, which has served in many shared- unicast, multicast, or broadcast— and acknowledgement col- 1089-795X/15 $31.00 © 2015 IEEE 317 DOI 10.1109/PACT.2015.23 lection [14]. Furthermore, such invalidations can potentially that enforce invalidations to maintain single-writer-multiple- increase the miss rate and degrade the performance. It is readers invariant [1], our approach allows multiple-writers- still possible to eliminate the eviction-induced invalidations multiple-readers to exist without invalidating any copies of without requiring a backing store, however this introduces a the cache line in L1 caches. This is achieved by giving new type of broadcast upon each directory miss in order to the sharers the responsibility to self-invalidate the shared discover and re-build the sharing status [10]. data when needed. While eliminating the write-induced At the other end of the spectrum, there are coherence invalidations/broadcasts, our approach only requires tracking schemes that aim to eliminate the complexities associated of a single sharer per block, which reduces the area overhead with the directories by removing the directories either in of the directory. In other words, Dir1-SISD either tracks part or in their entirety [15], [16]. By relying on data- a single private owner or allows multiple sharers without race-free (DRF) semantics, the need to obtain ownership tracking them, as long as they self-invalidate and self- upon write accesses is eliminated. In order to maintain downgrade. Thus, Dir1-SISD does not require broadcasts data consistency under such protocols, cores perform self- and implements a simple coherence scheme. invalidation (SI) and self-downgrade (SD) of their level-one Our approach also reduces the complexities associated (L1) cache shared data upon synchronization [17] —locks with eviction-induced invalidations/broadcasts. Under Dir1- and barriers. However, such schemes are heavily software- SISD scheme, a directory entry may be evicted without the dependent and partly delegate directory functions to other need to be backed-up or the need to maintain inclusion, system components such as the operating system (OS), which eliminates eviction-induced invalidations/broadcasts which in turn introduce new complexities and verification present in other protocols. Furthermore, as we later show challenges. As an example, VIPS-M coherence protocol [16] in Section V, our approach enables low-complexity dual- delegates private/shared data classification at page granular- granular directories, which further reduces area overhead by ity to the OS that uses the page-table in main memory as a requiring a single entry per private page in the directory. backing store for TLB entries that hold the classification. Main contributions: We propose a simple Dir1-SISD In essence the TLBs become directory caches and the directory organization to support self-invalidation/self- page table is the backing store. No directory information downgrade coherence that i) eliminates the reliance on the is ever lost in VIPS-M and this is one of the properties that OS, page tables, and TLBs for classification, and ii) intro- contribute to its simplicity. In other words, it is impossible duces no new protocol complexity such as broadcasts. This to use a (classification) directory-cache in VIPS without is because our proposed directory scheme has a unique char- having a backing store of the whole directory or severely acteristic not found in other directories: the on-chip directory compromising its simplicity by introducing broadcasts to cache is a self-contained directory, meaning that it neither manage information loss from the directory [18]. DeNovo needs to be backed-up externally nor enforces inclusion [15] is another coherence protocol which reduces the direc- upon directory evictions. We achieve this by exploiting the tory to track only the writers of data. However, this scheme self-invalidation and self-downgrade policies, as described is heavily application-dependent; furthermore, the last-level in Section III. Further, our directory is naturally extended cache (LLC) is delegated to keep track of the writer for each to multi-granular implementations as the information that is cache line and perform request-forwarding when needed. mainly tasked to track (owners of private blocks) is easily As the aforementioned examples show, VIPS-M and De- compressible to coarser granularities (e.g., regions, pages).

Load more