Louvre: Lightweight Ordering Using Versioning for Release Consistency

Louvre: Lightweight Ordering Using Versioning for Release Consistency Pranith Kumar, Prasun Gera, Hyojong Kim, Hyesoon Kim Georgia Institute of Technology School of Computer Science, College of Computing Atlanta, GA, USA fpranith, prasun.gera, hyojong.kim, [email protected] Abstract—Fence instructions are fundamental primitives that an ordering point in the instruction stream, and when it is ensure consistency in a weakly consistent shared memory multi- encountered by the processor, certain classes of optimizations core processor. The execution cost of these instructions is sig- on the following memory accesses are disabled to ensure con- nificant and adds a non-trivial overhead to parallel programs. In a na´ıve architecture implementation, we track the ordering sistency [4]–[6]. Optimizations such as speculative execution, constraints imposed by a fence by its entry in the reorder prefetch, and buffering of store instructions are implemented buffer and its execution overhead entails stalling the processor’s by modern processors to reduce the cost of memory accesses pipeline until the store buffer is drained and also conservatively [7]–[9]. The fence instructions restrict optimizations which invalidating speculative loads. These actions create a cascading exploit instruction-level parallelism, disable speculative exe- effect of increased overhead on the execution of the following instructions in the program. We find these actions to be overly cution of post-fence loads and stores, and introduce stalls to restrictive and that they can be further relaxed thereby allowing drain the store buffer. aggressive optimizations. Researchers have proposed techniques that allow a proces- The current work proposes a lightweight mechanism in which sor to employ hardware optimizations when ordering instruc- we assign ordering tags, called versions, to load and store in- tions are scheduled, thereby reducing their execution over- structions when they reside in the load/store queues and the write buffer. The version assigned to a memory access allows us to fully head [10], [11]. One such technique utilizes check-pointing exploit the relaxation allowed by the weak consistency model and along with aggressive speculative execution of post-fence restricts its execution in such a way that the ordering constraints loads and stores. The check-pointed state is tracked and by the model are satisfied. We utilize the information captured updated so that it enables roll back in case of ordering vi- through the assigned versions to reduce stalls caused by waiting olations [12]–[14]. Another promising approach is to increase for the store buffer to drain and to avoid unnecessary squashing of speculative loads, thereby minimizing the re-execution penalty. the granularity of enforcing consistency, thereby amortizing This method is particularly effective for the release consistency the overhead of ordering instructions. This technique has been model that employs uni-directional fence instructions. We show applied both in hardware [15] and at an algorithmic level [16], that this mechanism reduces the ordering instruction latency by [17]. Further, to increase the scope for speculative execution 39.6% and improves program performance by 11% on average based on the observation that ordering violations rarely occur over the baseline implementation. at runtime [12], studies have proposed techniques that identify scenarios in which violations are likely to occur and handle I. INTRODUCTION them in special ways. These techniques use annotations that As core counts have continued to increase over the last tag ordering instructions [18], hardware extensions [13], or decade, the shared memory programming model has become run-time information [19]. However, all these techniques were an attractive choice for high performance, scientific, and proposed in the context of stronger memory consistency mod- arXiv:1710.10746v3 [cs.AR] 5 Dec 2017 general purpose parallel applications. The use of this model els such as that of x86 and little work has explored reducing is facilitated by high-level libraries and frameworks such as the overhead of ordering instructions, specifically in weak Cilk [1] and OpenMP [2]. The shared memory programming memory model architectures. model necessitates a consistent view of the memory for pro- Release consistency (RCsc) [20] is one of the most widely grammers to reason about the accuracy of parallel programs. implemented weak memory model in recent processors. Vari- This consistent view is dictated by the memory consistency ants of this model have been adopted in architectures such as model of the processor. Itanium, ARM64, and PowerPC and it is also being considered Higher level languages standardize on a consistency model for adoption in RISC-V processors [21]. In this memory to maintain portability across architectures. Languages such as model memory accesses are classified as ordinary, acquire, C and C++ have standardized on data-race-free memory mod- or release operations. This classification is done using special els [3] that guarantee that accesses to synchronization objects instructions in the ISA. A load-acquire operation and a store- (or atomics) are sequentially consistent. Sequential consistency release operation place ordering constraints either on the for such accesses is ensured by the compiler, which generates memory accesses following it or preceding it respectively. This appropriate ordering instructions. These instructions indicate is in contrast to a full fence which places ordering constraints on both the memory accesses. These acquire and release two-way one-way 100% operations allow more leeway for possible re-ordering of the 80% memory accesses, thereby allowing a processor implementa- 60% tion to reduce the overhead of ensuring memory consistency. 40% We propose LOUVRE, a low-overhead hardware extension 20% that ensures memory ordering using versioning and helps in 0% reducing the overhead of fence instructions in a processor implementing the release consistency model. Our proposed mechanism adds ordering tag fields to the load/store queues and the write buffer. In the proposed technique, we assign Fig. 1. Distribution of fences in ARM64 versions to memory instructions when they are issued. We update the version being assigned upon issuing an ordering instruction. Instead of tracking the ordering constraints by the entry of the fence instruction in the reorder buffer (ROB), II. BACKGROUND AND MOTIVATION we track them in a separate FIFO queue. This allows us to retire the ordering instruction from the ROB while preserving A. Fence instructions the constraint that needs to be enforced. Using the version assigned to memory accesses, we identify and rectify any consis- Ordering instructions can be classified as either uni- tency violations at run-time without the need for maintaining directional (also known as one-way fences) or bi-directional any global state or inter-core communication. Compared to (also known as full fences). Instructions mfence and dmb are versioning, traditional implementation of ordering instructions bi-directional in that they do not allow any re-ordering across stall the pipeline to drain the store buffer to ensure order them. The restrictions placed by such fence instructions on among stores in the buffer and memory accesses following the memory accesses conflict with the optimization techniques ordering instruction. In the proposed versioning mechanism, used to hide the latency of long memory accesses in modern we issue and execute instructions that follow the ordering processors. Efforts to reduce this overhead have led to the instruction without draining the store buffer. Since we version design of the release consistency model [20]. This model pre- the stores in the store buffer, we enforce the correct ordering scribes acquire and release semantics, which are achieved by among them utilizing these versions. Furthermore, the ver- uni-directional fence instructions. These instructions allow re- sioning mechanism is efficient for one-way fence instructions, ordering of memory accesses across them in only one direction since it allows the processor to speculatively issue and execute and have either acquire or release semantics. Slight variations instructions following a store-release instruction before the of this conventional consistency model are implemented in store-release instruction itself has finished execution. architectures such as Itanium and ARM64. On ARM64, a Although we did not find any hardware optimizations in load-acquire instruction (also called as a synchronizing load, the literature specifically targeting one-way fence instructions, ldar) enforces an order between its load and all the following the work which is closest in spirit to our current work is memory accesses. A store-release instruction (also called as zFence [22]. As in the current work, the authors of the zFence a synchronizing store, stlr) enforces an order between its paper focus on reducing the stalls caused by waiting for the store and all the preceding memory accesses. A store-release store buffer to drain. They achieve this by introducing a new also ensures that the store is multi-copy atomic; a store coherence state which indicates exclusive permission to a is made visible to all other processors simultaneously [23]. cache line without the cache line being available. When all Additionally, a fence instruction is sequentially consistent with the stores waiting in the store buffer acquire this exclusive respect to

Louvre: Lightweight Ordering Using Versioning for Release Consistency

Memory Consistency

Memory Consistency: in a Distributed Outline for Lecture 20 Memory System, References to Memory in Remote Processors Do Not Take Place I

Towards Shared Memory Consistency Models for Gpus

Memory Ordering: a Value-Based Approach

On the Coexistence of Shared-Memory and Message-Passing in The

Consistency Models • Data-Centric Consistency Models • Client-Centric Consistency Models

Challenges for the Message Passing Interface in the Petaflops Era

An Overview on Cyclops-64 Architecture - a Status Report on the Programming Model and Software Infrastructure

Constraint Graph Analysis of Multithreaded Programs

Connect User Guide Armv8-A Memory Systems

Memory Consistency Models

The Openmp Memory Model