<<

18 IEEE TRANSACTIONS ON DEPENDABLE AND SECURE COMPUTING, VOL. 6, NO. 1, JANUARY-MARCH 2009 Dynamic Verification of Memory Consistency in -Coherent Multithreaded Computer Architectures

Albert Meixner, Student Member, IEEE, and Daniel J. Sorin, Senior Member, IEEE

Abstract—Multithreaded servers with cache-coherent shared memory are the dominant type of machines used to run critical network services and database management systems. To achieve the high availability required for these tasks, it is necessary to incorporate mechanisms for error detection and recovery. A correct operation of the memory system is defined by the memory consistency model. Errors can therefore be detected by checking if the observed memory system behavior deviates from the specified consistency model. Based on recent work, we design a framework for the dynamic verification of memory consistency (DVMC). The framework consists of mechanisms to dynamically verify three invariants that are proven to guarantee that a specified memory consistency model is obeyed. We describe an implementation of the framework for the SPARCv9 architecture, and we experimentally evaluate its performance by using full-system simulation of commercial workloads.

Index Terms—Reliability, fault tolerance, multiprocessors, multithreaded processors. Ç

1INTRODUCTION

OMPUTER system availability is crucial for the multi- core logic. In this paper, we focus on the runtime Cthreaded (including multiprocessor) systems that run detection of transient and permanent hardware errors in the critical infrastructure. Unless architectural steps are taken, memory system. We do not consider the orthogonal problem availability will decrease over time, as implementations use of detecting errors unrelated to the memory system in larger numbers of increasingly unreliable components in processor cores, because good solutions to this problem search of higher performance. Both the industry and the already exist (for example, redundant multithreading [28], academics predict that transient and permanent faults will [27], [20] and DIVA [3]). lead to increasing hardware error rates [13], [30], [33]. Our previous work [19] achieved end-to-end error Backward error recovery is a cost-effective mechanism [32], detection for a very restricted class of multithreaded memory [25] to tolerate runtime hardware errors, but it can only systems. In that work, we designed an all-hardware scheme recover from errors that are detected in a timely fashion. for the dynamic verification1 (also referred to as online error Traditionally, most systems employ localized error detection detection or online testing) of sequential consistency (DVSC), mechanisms such as parity bits on cache lines and memory which is the most restrictive consistency model. Since the busestodetecterrors. Althoughsuchspecializedmechanisms detect the errors that they target, they do not comprehensively end-to-end correctness of a multithreaded memory system detect whether the end-to-end [29] behavior of the system is defined by its memory consistency model, DVSC is correct. comprehensively detects errors in systems that implement Our goal is end-to-end error detection for multithreaded sequential consistency (SC). However, DVSC’s applications memory systems, which would subsume localized are limited, because SC is not frequently implemented. mechanisms and provide comprehensive error detection. In this paper, we contribute a general framework for Memory systems are complicated concurrent systems that designing dynamic verification hardware for a wide range of include caches, memories, coherence controllers, intercon- memory consistency models, including all those commer- nection network, and all of the other glue that enables cially implemented. Relaxed consistency models, discussed multiple processor cores or hardware contexts to in Section 2, enable hardware and software optimizations to communicate. As more of the memory system becomes reorder memory operations to improve performance. Our integrated on chip, including cache and memory controllers framework for the dynamic verification of memory consis- and logic for glueless multichip , this logic tency (DVMC), described in Section 3, combines the dynamic becomes just as susceptible to hardware errors as the verification of three invariants (also known as assertion checking) to check for memory consistency. In Section 4, we . A. Meixner is with the Department of Computer Science, Duke University, describe a checker design for each invariant and give a Durham, NC 27708. E-mail: [email protected]. SPARCv9-based implementation of DVMC. Section 5 intro- . D.J. Sorin is with the Department of Electrical and Computer Engineering, duces the experimental methodology used to evaluate Duke University, Durham, NC 27708. E-mail: [email protected]. Manuscript received 11 June 2006; revised 18 Apr. 2007; accepted 6 Nov. 1. Dynamic verification checks invariants at runtime and can thus detect 2007; published online 14 Nov. 2007. runtime physical errors. It differs from static verification, which checks that For information on obtaining reprints of this article, please send e-mail to: an implementation satisfies its design specification. Static verification can [email protected], and reference IEEECS Log Number TDSC-0074-0606. detect design bugs but, by definition, cannot detect runtime physical errors. Digital Object Identifier no. 10.1109/TDSC.2007.70243. The two approaches are complementary.

1545-5971/09/$25.00 ß 2009 IEEE Published by the IEEE Computer Society Authorized licensed use limited to: DUKE UNIVERSITY. Downloaded on May 18, 2009 at 10:48 from IEEE Xplore. Restrictions apply. MEIXNER AND SORIN: DYNAMIC VERIFICATION OF MEMORY CONSISTENCY IN CACHE-COHERENT MULTITHREADED... 19

TABLE 1 PC

DVMC. We present and analyze our results in Section 6. Section 7 compares our work with prior work on dynamic verification. In Appendix A, we formally prove that the mechanisms in Section 4 verify the three invariants intro- duced in Section 3 and that these invariants guarantee memory consistency. Fig. 1. Operation orderings in the system.

2BACKGROUND We specify a consistency model as an ordering table, similar This work addresses the dynamic verification of shared to [12]. Columns and rows are labeled with the memory memory multithreaded machines, including simultaneously operation types supported by the system, such as load, store, multithreaded [34], chip multiprocessors, and synchronization operations (for example, memory and traditional multiprocessor systems. For brevity, we will barriers). When a table entry contains the value true, the use the term processor to refer to a physical processor or a operation type OPx in the entry’s row label has a perfor- thread context on a multithreaded processor. We now mance ordering constraint with respect to the operation type describe the program execution model and consistency in the entry’s column label OPy. If an ordering constraint models. exists between two operation types OPx and OPy, then all operations of type OP that appear before any operation Y of 2.1 Program Execution Model x type OPy in program order must also perform before Y. A simple model of program execution is that a single thread Table 1 shows an ordering table for processor consistency of instructions is sequentially executed in program order. (PC). In PC, an ordering requirement exists between a load Modern microprocessors maintain the illusion of sequential and all stores that follow it in program order. That is, any execution, although they actually instructions in load X that appears before any store Y in program order also parallel and out of program order. To capture this behavior has to perform before Y. However, no ordering requirement and the added complexity of multithreaded execution, we exists between a store and subsequent loads. Thus, even if must be precise when referring to the different steps store Y appears before load X in program order, X can still necessary to process a memory operation (an instruction perform before Y. that reads or writes the memory). A memory operation A truth table is not sufficient to express all conceivable executes when its results (for example, the load value in a memory consistency models, but a truth table can be destination register) become visible to instructions executed constructed for all commercially implemented consistency on the same processor. A memory operation commits when models. the state changes are finalized and can no longer be undone. In the instant at which the state changes become visible to other processors, a memory operation performs. A more 3DYNAMIC VERIFICATION FRAMEWORK formal definition of performing a memory operation can be Based on the definitions in Section 2, we devise a framework found in [10]. that breaks the verification process into three invariants 2.2 Memory Consistency Models that correspond to the three steps necessary for processing a memory operation (shown in Fig. 1). First, memory An architecture’s memory consistency model [1] specifies the interface between the shared memory system and operations are read from the instruction stream in program < the software. It specifies the allowable software-visible order p and executed by the processor. At this point, interleavings of the memory operations (loads, stores, operations impact the microarchitectural state but not the and synchronization operations) that are performed by committed architectural state. Second, operations access the the multiple threads. For example, SC specifies that there (highest level) cache in a possibly different order, which we exists a total order of memory operations that maintains call cache order

Authorized licensed use limited to: DUKE UNIVERSITY. Downloaded on May 18, 2009 at 10:48 from IEEE Xplore. Restrictions apply. 20 IEEE TRANSACTIONS ON DEPENDABLE AND SECURE COMPUTING, VOL. 6, NO. 1, JANUARY-MARCH 2009

TABLE 2 TABLE 3 TSO PSO

Stbar provides SS ordering and is equivalent to Membar #SS. Each of the three steps described above introduces different error hazards, which can be dealt with efficiently at the time that an operation takes the respective step. The A system that dynamically verifies all three invariants basic idea of the presented framework is to dynamically in the DVMC framework obeys the consistency model verify an invariant for every step to guarantee that it is specified in the ordering table, regardless of the mechanisms done correctly and, thus, verify that the processing of the used to verify each invariant. Our approach is conservative operation as a whole is error free. The three invariants in that these conditions are sufficient but not necessary (Uniprocessor Ordering, Allowable Reordering,andCache for memory consistency. General consistency verification Coherence) described in the following are sufficient to without the possibility of false positives is NP-hard [11] guarantee memory consistency defined as follows, which and is therefore not feasible at runtime. DVMC’s goal is to we derive from [10] (we formally prove that the three detect transient errors, from which we can recover with invariants ensure memory consistency in Appendix A.2): backward error recovery. DVMC can also detect errors due Definiton 1. An execution is consistent with respect to a to design bugs and permanent faults, but for these errors, forward progress cannot be guaranteed. Errors in the consistency model with a given ordering table if there exists a checker hardware added by DVMC can lead to performance global order < such that the following hold: m penalties due to unnecessary recoveries after false positives

1. For X and Y of type OPx and OPy, it is true that if but do not compromise correctness. X

Authorized licensed use limited to: DUKE UNIVERSITY. Downloaded on May 18, 2009 at 10:48 from IEEE Xplore. Restrictions apply. MEIXNER AND SORIN: DYNAMIC VERIFICATION OF MEMORY CONSISTENCY IN CACHE-COHERENT MULTITHREADED... 21

TABLE 4 TABLE 5 RMO Implemented Optimizations

#LL: LL ordering. #LS: LS ordering. #SL: SL ordering. #SS: SS ordering.

We started with a baseline system that supports only SC but obtains a high performance through load-order prior to verification. Multiple operations can be replayed in speculation and prefetching for both loads and stores. We parallel, independent of register dependencies, as long as then implemented the optimizations described in Table 5 to they do not access the same address. take advantage of the relaxed consistency models. The In consistency models that require loads to be ordered remainder of this section describes the three verification (that is, loads appear to have executed only after all mechanisms that were added to the system, as shown older loads performed), the system speculatively reorders in Fig. 2. loads and detects load-order misspeculations by tracking writes to speculatively loaded addresses. This mechanism 4.1 Uniprocessor Ordering Checker allows stores from other processors to change any load Uniprocessor Ordering is trivially satisfied when all operations value until the load passes the verification stage, and execute sequentially in program order. Thus, Uniprocessor thus loads are considered to perform only after passing Ordering can be dynamically verified by comparing all verification. To prevent stalls in the verification stage, the load results obtained during the original out-of-order VC must be big enough to hold all stores that have been execution to the load results obtained during a subsequent verified but not yet performed. sequential execution of the same program [9], [6], [3]. Because In a model that allows loads to be reordered such as RMO, instructions commit in program order, results of sequential no speculation occurs, and the value of a load cannot be execution can be obtained by replaying all memory opera- affected by any store after it passes the execution stage. tions when they commit. Replay of memory accesses occurs Therefore, a load is considered to perform after the execution during the verification stage, which we add to the pipeline stage in these models, and replay strictly serves the purpose before the retirement stage. During replay, stores are still of verifying Uniprocessor Ordering. Since load ordering does speculative and thus must not modify the architectural state. not have to be enforced, load values can reside in the VC after Instead, they write to a dedicated verification cache (VC). execution and be used during replay, as long as they are Replayed loads first access the VC and, on a miss, access the correctly updated by local stores. This optimization, which highest level of the (bypassing the write has been used in the dynamic verification of single-threaded buffer). The load value from the original execution resides in a execution [8], prevents cache misses during verification and separate structure but could also reside in the . In reduces the pressure on the L1 cache. case of a mismatch between the replayed load value and the original load value, a Uniprocessor Ordering violation is 4.2 Allowable Reordering Checker signaled. Such a violation can be resolved by a simple DVMC verifies Allowable Reordering by checking all pipeline flush, because all operations are still speculative reorderings between program order and the cache access

Fig. 2. Simplified pipeline for DVMC. A single node is shown. Several structures (memory, caches, etc.) were omitted for clarity.

Authorized licensed use limited to: DUKE UNIVERSITY. Downloaded on May 18, 2009 at 10:48 from IEEE Xplore. Restrictions apply. 22 IEEE TRANSACTIONS ON DEPENDABLE AND SECURE COMPUTING, VOL. 6, NO. 1, JANUARY-MARCH 2009

Fig. 3. Legal and illegal overlap of epochs. All epochs are for the same block. order (described in Section 3) against the restrictions proposed for the dynamic verification of coherence [7], [31]. defined by the ordering table. The position in program Any coherence verification mechanism that ensures order is obtained by labeling every instruction X with a the SWMR principle such as the schemes proposed by sequence number seqX that is stored in the ROB during Cantin et al. [7] and Sorin et al. [31] is sufficient for DVMC. decode. Since operations are decoded in program order, We decided to use the coherence verification mechanism seqX is equal to X’s rank in program order. The rank in the introduced as part of DVSC [19], because it supports both perform order is implicitly known, because we verify snooping and directory protocols and scales well to Allowable Reordering when an operation performs. The larger systems.2 Allowable Reordering checker uses the sequence numbers We construct the Cache Coherence checker around to find reorderings and check them against the ordering the notion of an epoch. In Single-Reader/Multiple-Writer table. For this purpose, the checker maintains a protocols, for example, all variations of M(OE)SI protocols, register for every operation type OPx (for example, load or a processor has to obtain appropriate permissions before store) in the ordering table. This counter maxfOPxg reading from or writing to a cache block. It can later give up contains the greatest sequence number of an operation of permissions voluntarily in case of a cache eviction, or type OP that has already performed. When operation X of x permissions can be revoked to allow access by another type OP performs, the checker verifies that seqX > x processor. An epoch for block b is a time interval during maxfOPyg for all operation types OPy that have an which a processor has permission to read (Read-Only ordering relation OPx

Authorized licensed use limited to: DUKE UNIVERSITY. Downloaded on May 18, 2009 at 10:48 from IEEE Xplore. Restrictions apply. MEIXNER AND SORIN: DYNAMIC VERIFICATION OF MEMORY CONSISTENCY IN CACHE-COHERENT MULTITHREADED... 23 message to the home memory controller for that block. the latest end times of Read-Only and Read-Write Epochs can end either as a result of a coherence request, that epochs in the MET. Epochs overlap illegally if either a is, another node’s GetShared request (if epoch is Read-Write) Read-Only Inform-Epoch’s start time is earlier than the or another node’s GetExclusive request, or as a result of latest Read-Write epoch’s end time or if a Read-Write evicting a block from the cache before requesting a new block. Inform-Epoch’s start time is earlier than either the latest Inform-Epoch traffic is proportional to coherence traffic, Read-Only or Read-Write epoch’s end time. To check for because in both cases, a coherence message is sent or received data propagation errors, the memory controller compares at the end of the epoch. Because the coherence protocol the data block at the beginning of the Inform-Epoch to the operates independently of DVMC, sending the Inform-Epoch data block at the end of the latest Read-Write epoch. If is not on the critical path, and no new states are introduced they are not equal, data was propagated incorrectly. into the coherence protocol. The MET only contains entries for blocks that are present For each Inform-Epoch that a memory controller receives, in at least one of the processor caches. When a block it checks that 1) this epoch does not overlap illegally with without a MET entry is requested by a processor, a new other epochs and 2) the correct block data is transferred entry is constructed by using the current logical time as the from epoch to epoch. The memory controller performs last end time of an Read-Write epoch and by computing the these checks by using per-block epoch information that it initial checksum from the data in the memory. To ensure that the data does not change while no MET entry is keeps in its directory-like Memory Epoch Table (MET). present, DVMC requires ECC-protected memory. This Details of cache controller and CET operation. Each requirement is satisfied by virtually all modern server cache has its own CET, which is physically separate from systems, which are equipped with ECC on DRAMs. the cache to avoid slowing cache accesses. A CET entry Logical time. The Cache Coherence Checker requires the corresponds to a cache line and stores 34 bits of use of a logical time base. Logical time is a time basis that information: the type of epoch (1 bit to denote Read-Only respects causality (that is, if event A causes event B, then or Read-Write), the logical time (16 bits) and the data block event A has a smaller logical time), and there are many (data blocks are hashed down to 16 bits, as we will discuss potential logical time bases in a system [14]. We choose later in this section) at the beginning of the epoch, and a two logical time bases—one for snooping and one for DataReadyBit to denote that data has arrived for this epoch directories—based on their ease of implementation. For a (recall that an epoch can begin before data arrives). At the snooping protocol, the logical time for each cache and end of an epoch, the information in the CET is used to memory controller is the number of cache coherence requests construct the Inform-Epoch message sent to the history that it has processed thus far. For a directory protocol, the verifier. Error correcting codes (ECCs) on each line of the logical time is based on a relatively slow loosely synchronized cache (not the CET) are used to ensure that the data block physical clock that is distributed to each cache and memory does not change, unless it is written by a store. Without controller. As long as the skew between any two controllers ECC, silent corruptions of the cache state would be is less than the minimum communication latency between undetectable. The need for ECC will increase the cost of them, causality will be enforced, and this will be a valid basis DVMC on unprotected processors, but virtually all recent of logical time [32]. For both protocols, ties in logical time server processors, including the AMD , Sun can be broken arbitrarily by the priority queue, since there is Ultrasparc T1, Intel , and IBM POWER5, already no causal ordering between events at the same logical time. feature ECC-protected caches. The two implementation challenges in this scheme are to Details of the memory controller and MET operation. prevent incorrect orderings in the logical time due to The memory controllers receive a stream of Inform-Epochs wraparound and to ensure that Inform-Epoch messages are and have to determine if any of the Read-Write epochs sent before the backward error recovery mechanism overlap other epochs and if data is propagated correctly becomes unable to recover to a pre-error state in case the between epochs. In a naive approach, this would require the epoch violates one of the coherence rules. Both of these MET to store all epochs observed to detect collisions with problems can be addressed by “scrubbing” the system of epochs described by later Epoch-Informs, which is clearly old logical times that are in danger of wraparound, which infeasible. To simplify the verification process, we require also bounds the amount of time between the start of an that memory controllers process Inform-Epochs in the epoch and the transmission of an Inform-Epoch message. logical time order of epoch start times. Since the order in Old logical times can lurk in the CETs and METs due to which Epoch-Informs arrive is already strongly correlated very long epochs. Our method of scrubbing remembers newly started epochs in a small FIFO (128 entries in our with the Epoch begin time, incoming Inform-Epochs can be experiments) at each CET: every time an epoch begins, the sorted by timestamps in a small fixed-sized priority queue. cache inserts into the FIFO a pointer to that cache entry The MET at each memory controller maintains the and the logical time at which the epoch would wraparound. following state per block for which it is the home node: By periodically checking the FIFO, we can guarantee that the latest end time of any Read-Only epoch (16 bits), the a FIFO entry will reach the head of the FIFO before latest end time of any Read-Write epoch (16 bits), and the wraparound can occur. When it reaches the head and the data block at the end of the latest Read-Write epoch (hashed epoch is still in progress, the cache controller sends to 16 bits). For every Inform-Epoch that the verifier an Inform-Open-Epoch to the memory controller. This processes, it checks for rule violations and then updates message, which contains the block address, type of epoch, this state. To check for illegal epoch overlapping, the block data at the start of epoch, and logical time at the start verifier compares the start time of the Inform-Epoch with of epoch, notifies the memory controller that the epoch is

Authorized licensed use limited to: DUKE UNIVERSITY. Downloaded on May 18, 2009 at 10:48 from IEEE Xplore. Restrictions apply. 24 IEEE TRANSACTIONS ON DEPENDABLE AND SECURE COMPUTING, VOL. 6, NO. 1, JANUARY-MARCH 2009

TABLE 6 TABLE 7 Memory System Parameters Processor Parameters

Because DVMC primarily targets high-availability commercial servers, we chose the Wisconsin Commercial Workload Suite [2] for our benchmarks. These workloads are described briefly in Table 8 and in more detail in [2]. still in progress and that it should expect a single Inform- Although SPARC v9 is a 64-bit architecture, portions of Closed-Epoch message sometime later. The Inform-Closed- the code in the benchmark suite were written for the 32-bit Epoch only contains the block address and the logical time SPARC v8 instruction set. Since these code segments were at which the epoch ended. To maintain the state about open written for TSO, a system configured for PSO or RMO must epochs, each MET entry holds a bitmask (with 1 bit for each to TSO while executing the 32-bit code. Table 9 processor) for tracking open Read-Only epochs and a single shows the average fraction of 32-bit memory operations processor ID for tracking an open Read-Write epoch. executed for each benchmark during our experiments. Whenever there is an open epoch, the MET entry does not To account for the runtime variability inherent in need the last Read-Only/Read-Write logical time, so these multithreaded workloads, we run each simulation 10 times, logical times and the open epoch information can share with small pseudorandom perturbations. Our experimental storage space if we add an OpenEpoch bit. results show the mean result values and error bars that Data block hashing. An important implementation issue correspond to one standard deviation. is the hashing of data block values in the CETs, METs, and Inform-Epochs. Hashing is an explicit trade-off between error coverage, storage, and interconnection network 6EVALUATION bandwidth. We use CRC-16 as a simple function to hash We used simulation to empirically confirm DVMC’s error data blocks down to 16 bits. Aliasing (that is, two distinct detection capability and gain insight into its impact on error- blocks mapping to the same hash value) represents the free performance. In this section, we describe the results of probability of a false negative (that is, not detecting an error these experiments, and we discuss DVMC’s hardware costs that occurred). By choosing the hash function and the value and interconnect bandwidth overhead. of n, we can make the probability of a false negative arbitrarily small. CRC-16 will not produce false negatives 6.1 Error Detection for blocks with fewer than 16 erroneous bits, and it has a Error injection experiments are commonly used to determine probability of 1/65,535 of false negatives for blocks with the error coverage of mechanisms that are known to be unable 16 or more incorrect bits. to detect certain error scenarios (for example, errors affecting uncovered store or branch operations in EDDI [21]). In DVMC, we already have proof (see Appendix A) that any 5EXPERIMENTAL METHODOLOGY single error affecting memory consistency will be detected. We performed our experiments by using Simics [16] full- The error injection experiments serve merely as a sanity check system simulation of eight-node multiprocessors. The that DVMC was implemented correctly in our simulation systems were configured with either a MOSI directory environment and that errors are detected in time to allow for coherence protocol or a MOSI snooping coherence protocol, system recovery. Experiments were performed by injecting and the simulated processors provide support for the SPARC single errors into all components related to the memory v9 models TSO, PSO, and RMO, as well as SC. All systems system: the load-store queue (LSQ), write buffer, caches, use SafetyNet (SN) [32] for backward error recovery, interconnect and links, and memory and although any other backward error recovery scheme (for cache controllers. The injected single errors included data example, ReVive [25]) would work. Configurations of the and address bit flips; dropped, reordered, misrouted, and directory and snooping systems are shown in Table 6. duplicated messages; incorrect cache state transitions; and Timing information was computed using a customized reorderings and incorrect forwarding in the LSQ and write version of the Multifacet GEMS simulator [17]. We adapted buffer. We did not inject errors into the low-level gates and the cycle-accurate TFSim processor simulator [18] to support wires of the controllers and the checker, because doing so the timing simulation of relaxed consistency models, and we would have required a detailed synthesis. It is possible that configured it as shown in Table 7. some errors may cause an irrecoverable crash by, for example,

Authorized licensed use limited to: DUKE UNIVERSITY. Downloaded on May 18, 2009 at 10:48 from IEEE Xplore. Restrictions apply. MEIXNER AND SORIN: DYNAMIC VERIFICATION OF MEMORY CONSISTENCY IN CACHE-COHERENT MULTITHREADED... 25

TABLE 8 Workload Description (from [32])

not mapping into a well-defined cache operation. Barring message is received and is in order. This latency is bounded such effects, the error injection experiments cover all other by the time-out until an Inform-Open-Epoch is sent. behavioral errors. In multiple-error scenarios DVMC does not have the For each test, an error time, an error type, and an error same detection guarantees as in the case of a single error, but location were chosen at random for injection into a running it is still very likely that multiple errors will be detected. benchmark. After injecting the error, the simulation Intuitively, DVMC detects a multiple-error scenario, unless continued until the error was detected. Since errors become the multiple errors just so happen to cancel each other out. nonrecoverable once the last checkpoint taken before the For example, if a fault causes a Read-Write epoch to overlap error expires, we also checked that a valid checkpoint was a Read-Only epoch and another fault causes an error in still available at the time of detection. We conducted these the Read-Write epoch’s Inform-Epoch such that its experiments for all of the four supported consistency epoch does not appear to overlap the Read-Only epoch, models with both the directory and snooping systems. then DVMC cannot detect this multiple-error scenario. DVMC detected all injected errors well within the Obtaining statistically significant estimates for the likelihood SN recovery time frame of about 100,000 processor cycles. of undetected multiple errors would require an infeasibly Because errors are not frequent enough to make error large number of simulations, because the probability of recovery a performance limiter, the exact error detection errors cancelling out is very low. latency is irrelevant, as long as it is within the recovery DVMC could cause false positives in three situations. The window. Our DVMC implementation enforces a bounded first situation is when Inform-Epoch messages are processed detection latency in each of the three checkers. Errors in out of a logical time order. We saw exactly zero instances of Uniprocessor Ordering are detected before an instruction this occurring in all of our experiments (covering billions of commits and corrected by flushing the pipeline. Thus, they cycles of execution time), even with a small priority queue at never trigger backward error recovery. Illegal reorderings the memory controllers. The second situation is due to are detected once both instructions in a reordered pair have errors in the DVMC hardware itself. Such errors are very been performed. To bound detection latency, barriers are unlikely, given the small area occupied by the DVMC injected periodically (see Section 4.2), enforcing correct hardware. The third situation is due to DVMC enforcing ordering of all outstanding instructions. Violations in the invariants that are more stringent than what is strictly required by the consistency model (for example, a cache coherence protocol are detected as soon as the Inform-Epoch coherence protocol error will not necessarily violate memory consistency, but it will be detected). However, TABLE 9 the hardware is designed to operate within the invariants Percentage of the 32-Bit Code enforced by DVMC, and false positives will only occur in case of hardware errors that would not have affected a correct program execution. In all three situations, the false positives can only cause performance overhead but will not impact a correct program execution. 6.2 Performance Aside from error detection capability, error-free performance is the most important metric for an error detection mechan- ism. To determine the DVMC performance, we ran each

Authorized licensed use limited to: DUKE UNIVERSITY. Downloaded on May 18, 2009 at 10:48 from IEEE Xplore. Restrictions apply. 26 IEEE TRANSACTIONS ON DEPENDABLE AND SECURE COMPUTING, VOL. 6, NO. 1, JANUARY-MARCH 2009

Fig. 4. Workload runtimes for directory coherence. benchmark for a fixed number of transactions and compared write access timing, as indicated by the high variance in the runtime on an unprotected system and a system runtime, and it benefits from the reduced contention caused implementing DVMC with different consistency models. by additional stalls present in SC [26]. The benchmark barnes was run to completion. 6.2.2 Dynamic Verification of Memory Consistency 6.2.1 Baseline System Performance Overhead Before looking at the DVMC overhead, we compare the DVMC can potentially degrade the performance in several performance of unprotected systems (no DVMC or backward ways. The Uniprocessor Ordering checker requires an error recovery) with different memory consistency models. additional pipeline stage, thus extending the time during The “Base” numbers in Figs. 4 and 5 show the relative which instructions are in flight and increasing the runtimes, normalized to SC. The addition of a write buffer in occupancy of the ROB and the physical registers. Load the TSO system improves the performance for almost all replay increases the demand on the cache and can cause additional cache misses. Coherence verification can degrade benchmarks. PSO and RMO do not show significant the system performance due to interconnect bandwidth used performance benefits and can even lead to performance for Inform-Epoch messages. SN, the backward error recov- degradation, although they allow additional optimizations ery mechanism used during our tests, also causes additional that are not legal for TSO. In our experiments, the oldest store interconnect traffic. The Allowable Reordering checker does first strategy implicitly used by TSO performs well not have any influence on the program execution, since it compared to more complicated policies. A nonspeculative operates entirely off the critical path and has no side effects. reordering of loads also turns out to be of little value, because We first examine the total impact of all these factors by load-order misspeculation is exceedingly rare, affecting less comparing benchmark results for an unprotected baseline than 0.1 percent of loads. Although the benefits from system and a system implementing full DVMC, as well as optimizations are limited, relaxed consistency models need SN backward error recovery. The benchmarks are run for to obey memory barriers, which, even when implemented all four supported consistency models and both the efficiently, can make the performance worse than TSO. directory and snooping coherence protocols. Figs. 4 and 5 Although most benchmarks show the expected benefits show the runtimes of all benchmarks normalized to an of a write buffer and the expected overhead incurred by unprotected system implementing SC. Despite the many verification, some of the slash results are counterintuitive. performance hazards described, we observed no slowdown Highly contended locks make slash sensitive to changes in exceeding 11 percent. The worst slowdowns occur with SC,

Fig. 5. Workload runtimes for snooping coherence.

Authorized licensed use limited to: DUKE UNIVERSITY. Downloaded on May 18, 2009 at 10:48 from IEEE Xplore. Restrictions apply. MEIXNER AND SORIN: DYNAMIC VERIFICATION OF MEMORY CONSISTENCY IN CACHE-COHERENT MULTITHREADED... 27

Fig. 8. Interconnect traffic for TSO. Fig. 6. Verification mechanism runtimes for TSO.

bandwidth for Inform-Epoch messages, and load replays which is rarely implemented in modern systems. In all but can initiate additional coherence transactions. For the four out of 40 DVMC configurations, the overhead is directory system with TSO, Fig. 8 shows the mean limited to 6 percent. The rest of this section focuses on a bandwidth on the highest loaded link for different work- directory-based system with TSO, which exhibited the loads and mechanisms. DVCC imposes a consistent traffic largest overhead of the commonly implemented models. overhead of about 20 percent to 30 percent, and load replay To study the impact of the different DVMC compo- does not have any measurable impact. nents, we run the same experiments with a system that only implements backward error recovery using SN, a 6.2.3 Scaling Sensitivity system with backward error recovery that only verifies To gain insight on how well DVMC would scale to different cache coherence (SNþDVCC), a system with backward multiprocessor systems, we ran experiments with different error recovery and uniprocessor ordering verification numbers of processors and different interconnect link (SNþDVUO), and full DVMC with backward error bandwidths and compared the benchmark runtimes on an recovery (DVTSO). The results of these experiments for a unprotected system to a system featuring SN and DVMC. system implementing TSO and directory coherence are All experiments were performed for TSO with both shown in Fig. 6. These experiments show that Uniproces- directory and snooping protocols. sor Ordering Verification is the dominant cause of slow- Fig. 9 shows the average over all benchmark runtimes of down, and although each mechanism (SN, Uniprocessor an eight-node system, with DVTSO normalized to an Ordering Verification, and Coherence Verification) adds a unprotected system, for different link bandwidths ranging small amount of overhead in most cases, full DVTSO is no from 1 Gbyte/s to 3 Gbytes/s. Although performance þ slower than SN DVUO. Fig. 6 also shows some un- overheads vary between the different link bandwidths, expected speedups on slash when SN is added. With slash, these variations are not statistically significant and do not SN slightly delays some writes, which can reduce lock show any clear correlation between link bandwidth and contention and lead to a performance increase. performance overhead. In all configurations, the full link Fig. 7 shows the number of L1 cache misses during bandwidth is only used during brief burst periods, whereas replay normalized to the number of L1 cache misses during most DVMC related messages are transmitted during idle regular execution. Replay misses are rare, because the time times between bursts. Thus, the DVMC traffic has little between a load’s execution and verification is typically impact on the total system performance, as long as the small. Most replay cache misses occur when a processor transmission can be delayed until traffic bursts are over. unsuccessfully tries acquiring a lock and returns to the spin loop. DVMC can increase interconnection network utilization in two ways: the Cache Coherence checker consumes

Fig. 7. Cache misses during replay. Fig. 9. DVTSO overhead versus link bandwidth.

Authorized licensed use limited to: DUKE UNIVERSITY. Downloaded on May 18, 2009 at 10:48 from IEEE Xplore. Restrictions apply. 28 IEEE TRANSACTIONS ON DEPENDABLE AND SECURE COMPUTING, VOL. 6, NO. 1, JANUARY-MARCH 2009

instructions committed by the [3]. By leveraging the microprocessor as an oracular prefetcher and , the simple checker can keep up with the performance of the microprocessor. Second, there have been several schemes for multithreaded uniprocessors, starting with AR-SMT [28], that use redundant threads to detect errors. These schemes leverage the multiple existing thread contexts to achieve error detection. Unlike DVMC, these schemes do not extend to the memory system. For multithreaded systems with shared memory, there are four related pieces of work. Sorin et al. [31] developed a scheme for the dynamic verification of a subset of cache coherence in snooping multiprocessors. Although dynamically verifying these invariants is helpful, it is not an end-to-end mechanism, since coherence is not sufficient Fig. 10. DVTSO overhead versus the number of processors. for implementing consistency. Cantin et al. [7] propose to verify cache coherence by replaying transactions with a Fig. 10 shows the average over benchmark runtimes of simplified coherence protocol. Cain et al. [5] describe systems with one to eight processors and 2.5 Gbytes/s links. an algorithm to verify SC but do not provide an The graph does not show any strong correlation between implementation. Finally, we previously [19] designed an the system size and the DVMC performance overhead. This ad hoc scheme for DVSC, which does not extend to any result is expected, as the DVMC traffic is all unicast and other consistency models. increases linearly with the overall traffic such that the bandwidth consumption stays constant. 8CONCLUSIONS 6.3 Hardware Cost This paper presents a framework that can dynamically The hardware costs of DVMC are determined by the storage verify a wide range of consistency models and comprehen- structures and logic components required to implement the sively detect memory system errors. Our verification three checkers but not the backward error recovery framework is modular, because it checks three independent mechanism,whichisorthogonal.Informationonthe invariants that together are sufficient to guarantee memory implementation cost of SN [32], ReVive [25], and other consistency. The modular design makes it possible to backward error recovery schemes can be found in the replace any of our checking mechanisms with a different literature. The Uniprocessor Ordering checker is the most scheme to adapt to a specific system’s design. For example, complex checker, since it requires the addition of the VC the coherence checker adapted from DVSC [19] can be and a new stage in the processor pipeline. These changes replaced by the design proposed by Cantin et al. [7]. DVMC are nontrivial, but all required logic components are simple, is also not limited to the conventional multiprocessor and storage structures are small (32-256 bytes). The systems described in this paper but could be used with Allowable Reordering checker is the simplest and smallest other shared-memory-based architectures. The simplicity of checker. It requires a LSQ-sized FIFO, a set of sequence the proposed mechanisms suggests that they can be number registers, sequence numbers in the write buffer, implemented with small modifications to existing multi- the ordering tables, and comparators for the checker logic. threaded systems. Simulation of a DVMC implementation The Cache Coherence checker also does not require complex shows some decrease in performance, but we expect the logic, but it incurs greater storage costs. Our CET negative impact to be outweighed by the benefit of entries are 34 bits, leading to a total CET size of about improved safety and availability. 70 Kbytes per node. The MET requires 102 Kbytes per memory controller, with an entry size of 48 bits, but it is not APPENDIX A latency sensitive and can be built out of cheap long-latency PROOF OF FRAMEWORK CORRECTNESS DRAMs. The MET contains entries for blocks that are currently present in at least one processor cache. Entries for We now prove that dynamic verification schemes that use blocks only present at the memory are constructed from the our framework verify memory consistency. The proof current logical time and memory value upon a cache shows that the three verification mechanisms presented in request. To detect data errors on these blocks, DVMC Section 4 verify their respective invariants (Appendix A.1) requires ECC on all main memory DRAMs. and together ensure memory consistency (Appendix A.2). To simplify the discussions, we assume that memory is only accessed at the word granularity. 7RELATED WORK The following proofs rely on a property called In this section, we discuss prior research in dynamic Cache Correctness, which states that after a store to word w verification. For verifying the execution of a single thread, in a cache, word w in the cache contains the stored value there have been two different approaches. First, DIVA adds until it is overwritten. This assumption is necessary to make a small simple checker core that verifies the execution of the any statements about the contents of a cache after a

Authorized licensed use limited to: DUKE UNIVERSITY. Downloaded on May 18, 2009 at 10:48 from IEEE Xplore. Restrictions apply. MEIXNER AND SORIN: DYNAMIC VERIFICATION OF MEMORY CONSISTENCY IN CACHE-COHERENT MULTITHREADED... 29 sequence of stores. It can be dynamically verified using or a store from a different processor performed after standard techniques such as ECC. X performs. tu Definiton 2. A cache is correct if after a store X to word w is A.1.2 Allowable Reordering Checker Correctness executed by a cache, every subsequent load of word w executed Allowable Reordering correctness is directly verified as by the cache returns the value specified by X until another described in Section 4.2. Again, we first give a formal store to w is executed. definition of Allowable Reordering. Cache Correctness is assumed for all storage structures, Definition 4. A reordering of performed operations is allowable including all levels of caches, the VC, CET, MET, and if for any two operations X and Y, where X is of type OPx, main memory. and Y is of type OPy, it is true that if the consistency OP OP A.1 Checkers Satisfy Invariants model requires an ordering constraint between x and y and X

maxfOPyg for all OPx with an ordering con- receives the value of the most recent ST X to the same word w in straint OPx < OPy. Therefore, if the above statement is not true, an error will be signaled. If the program program order (X

, In the second case, all stores to w that precede Y in where the logical time is the begin time of the epoch in program order also performed before Y performs. This which the operation performs, processor ID is the ID of the implies that the VC contains no entry for w and the processor performing the operation, and the perform replayed load value has to be obtained from the cache. If sequence number is the rank of the operation in the the value in the cache is from a store executed by the same < order. These labels are unique, since no operations p c processor that executes Y, then by Cache Correctness, it executed on the same processor share the same sequence must be the value of the last store X that wrote to the cache. number. Labels can be constructed for all operations, since As the cache is written when a store performs, X is the last every operation has to perform within an epoch (rule 1), store to w performed before Y. When X performed, the and processor ID and sequence numbers trivially exist for corresponding entry in the VC must have been freed, since there are no uncommitted stores to w. During deallocation, all operations. Uniprocessor Ordering Verification ensures that the value The serialization of stores to w required in the written to the cache by X is equal to the value contained in definition is obtained by sorting operations by their the VC entry for w, which is the value of the most recent labels. We refer to this order as coherence order for w.To store in program order. Since they have to be identical, X is show that all processors observe the stores to perform in the most recent store in program order. If Y receives a that order, consider any two stores X and Y to the same value from a store Z performed on a different processor, word. then Z must have overwritten the value written to the If X and Y share the same logical time, then they must cache by X. By the definition of performed, Z must have also share the same processor ID, since by rule 1, stores performed after X. can only perform in Read-Write epochs, and Read-Write In both cases, the load receives a value from either the epochs do not overlap (rule 2). The CPU executing the most recent store X to w in processor p’s program order stores observes them performing in program order,

Authorized licensed use limited to: DUKE UNIVERSITY. Downloaded on May 18, 2009 at 10:48 from IEEE Xplore. Restrictions apply. 30 IEEE TRANSACTIONS ON DEPENDABLE AND SECURE COMPUTING, VOL. 6, NO. 1, JANUARY-MARCH 2009

which is verified by the Uniprocessor Ordering checker. To obtain the value returned from load Y, we consider Any other processor can only observe the stores in a later two cases. First, if there exists a store that precedes Y in epoch, since a load must be contained in an epoch the program order but not in the perform order, then both (rule 1), which cannot overlap the epoch during which Y and the store must be from the same processor; the store is performed (rule 2). The data observed by the otherwise, there is no

. The perform sequence number REFERENCES is the rank of the operation in the sequence of all operations [1] S.V. Adve and K. Gharachorloo, “Shared Memory Consistency performed by a given processor. The labels are unique, and a Models: A Tutorial,” Computer, vol. 29, no. 12, pp. 66-76, Dec. 1996. [2] A.R. Alameldeen, M.M. Martin, C.J. Mauer, K.E. Moore, M. Xu, label can be constructed for every operation. M.D. Hill, D.A. Wood, and D.J. Sorin, “Simulating a $2M We use Allowable Reordering Correctness (Appendix A.1.2) Commercial Server on a $2K PC,” Computer, vol. 36, no. 2, to show the first property. All Xs and Ys, with X

Authorized licensed use limited to: DUKE UNIVERSITY. Downloaded on May 18, 2009 at 10:48 from IEEE Xplore. Restrictions apply. MEIXNER AND SORIN: DYNAMIC VERIFICATION OF MEMORY CONSISTENCY IN CACHE-COHERENT MULTITHREADED... 31

[7] J.F. Cantin, M.H. Lipasti, and J.E. Smith, “Dynamic Verification of [27] S.K. Reinhardt and S.S. Mukherjee, “Transient Fault Detection Cache Coherence Protocols,” Proc. Second Workshop Memory via Simultaneous Multithreading,” Proc. 27th Ann. Int’l Symp. Performance Issues (WPI ’01), June 2001. Computer Architecture (ISCA ’00), pp. 25-36, June 2000. [8] S. Chatterjee, C. Weaver, and T. Austin, “Efficient Checker [28] E. Rotenberg, “AR-SMT: A Microarchitectural Approach to ,” Proc. 33rd Ann. IEEE/ACM Int’l Symp. Fault Tolerance in Microprocessors,” Proc. 29th IEEE Int’l Symp. Microarchitecture (MICRO ’00), pp. 87-97, Dec. 2000. Fault-Tolerant Computing (FTCS ’99), pp. 84-91, June 1999. [9] K. Gharachorloo, A. Gupta, and J. Hennessy, “Two Techniques [29] J.H. Saltzer, D.P. Reed, and D.D. Clark, “End-to-End Arguments to Enhance the Performance of Memory Consistency Models,” in Systems Design,” ACM Trans. Computer Systems, vol. 2, no. 4, Proc. Int’l Conf. Parallel Processing (ICPP ’91), vol. I, pp. 355-364, pp. 277-288, Nov. 1984. Aug. 1991. [30] P. Shivakumar, M. Kistler, S.W. Keckler, D. Burger, and L. Alvisi, [10] K. Gharachorloo, D. Lenoski, J. Laudon, P. Gibbons, A. Gupta, and “Modeling the Effect of Technology Trends on the Soft Error Rate J. Hennessy, “Memory Consistency and Event Ordering in of ,” Proc. Int’l Conf. Dependable Systems and Scalable Shared-Memory,” Proc. 17th Ann. Int’l Symp. Computer Networks (DSN ’02), June 2002. Architecture (ISCA ’90), pp. 15-26, May 1990. [31] D.J. Sorin, M.D. Hill, and D.A. Wood, “Dynamic Verification of [11] P.B. Gibbons and E. Korach, “Testing Shared Memories,” SIAM J. End-to-End Multiprocessor Invariants,” Proc. Int’l Conf. Dependable Computing, vol. 26, no. 4, pp. 1208-1244, Aug. 1997. Systems and Networks (DSN ’03), pp. 281-290, June 2003. [12] M.D. Hill, A.E. Condon, M. Plakal, and D.J. Sorin, “A System- [32] D.J. Sorin, M.M. Martin, M.D. Hill, and D.A. Wood, “SafetyNet: Level Specification Framework for I/O Architectures,” Proc. Improving the Availability of Shared Memory Multiprocessors 11th ACM Symp. Parallel Algorithms and Architectures (SPAA ’99), with Global Checkpoint/Recovery,” Proc. 29th Ann. Int’l Symp. pp. 138-147, June 1999. Computer Architecture (ISCA ’02), pp. 123-134, May 2002. [13] International Technology Roadmap for Semiconductors, 2003. [33] J. Srinivasan, S.V. Adve, P. Bose, and J.A. Rivers, “The Impact of [14] L. Lamport, “Time, Clocks and the Ordering of Events in a Technology Scaling on Lifetime Reliability,” Proc. Int’l Conf. Distributed System,” Comm. ACM, vol. 21, no. 7, pp. 558-565, Dependable Systems and Networks (DSN ’04), June 2004. July 1978. [34] D.M. Tullsen, S.J. Eggers, J.S. Emer, H.M. Levy, J.L. Lo, and [15] L. Lamport, “How to Make a Multiprocessor Computer That R.L. Stamm, “Exploiting Choice: Instruction Fetch and Issue Correctly Executes Multiprocess Programs,” IEEE Trans. on an Implementable Simultaneous Multithreading Processor,” Computers, vol. 28, no. 9, pp. 690-691, Sept. 1979. Proc. 23rd Ann. Int’l Symp. Computer Architecture (ISCA ’96), [16] P.S. Magnusson, M. Christensson, J. Eskilson, D. Forsgren, pp. 191-202, May 1996. G. Ha˚llberg, J. Ho¨gberg, F. Larsson, A. Moestedt, and B. Werner, [35] SPARC Architecture Manual (Version 9), D.L. Weaver and “Simics: A Full System Simulation Platform,” Computer, vol. 35, T. Germond, eds. PTR Prentice Hall, 1994. no. 2, pp. 50-58, Feb. 2002. [36] S.C. Woo, M. Ohara, E. Torrie, J.P. Singh, and A. Gupta, [17] M.M. Martin, D.J. Sorin, B.M. Beckmann, M.R. Marty, M. Xu, “The SPLASH-2 Programs: Characterization and Methodological A.R. Alameldeen, K.E. Moore, M.D. Hill, and D.A. Wood, Considerations,” Proc. 22nd Ann. Int’l Symp. Computer Architecture “Multifacet’s General Execution-Driven Multiprocessor Simulator (ISCA ’95), pp. 24-37, June 1995. (GEMS) Toolset,” Computer Architecture News, vol. 33, no. 4, pp. 92-99, Sept. 2005. Albert Meixner received the MS degree in [18] C.J. Mauer, M.D. Hill, and D.A. Wood, “Full System Timing-First applied computer science from Paris Lodron Simulation,” Proc. ACM Sigmetrics ’02, pp. 108-116, June 2002. Universita¨t, Salzburg, Austria. He is currently [19] A. Meixner and D.J. Sorin, “Dynamic Verification of Sequential working toward the PhD degree in the Depart- Consistency,” Proc. 32nd Ann. Int’l Symp. Computer Architecture ment of Computer Science, Duke University. His (ISCA ’05), pp. 482-493, June 2005. research interests include architectures and [20] S.S. Mukherjee, M. Kontz, and S.K. Reinhardt, “Detailed for fault-tolerant computing. He is a Design and Implementation of Redundant Multithreading student member of the IEEE and the IEEE Alternatives,” Proc. 29th Ann. Int’l Symp. Computer Architecture Computer Society. (ISCA ’02), pp. 99-110, May 2002. [21] N. Oh, P.P. Shirvani, and E.J. McCluskey, “Error Detection by Duplicated Instructions in Super-Scalar Processors,” IEEE Trans. Reliability, vol. 51, no. 1, pp. 63-74, Mar. 2002. Daniel J. Sorin received the PhD degree in [22] M. Plakal, D.J. Sorin, A.E. Condon, and M.D. Hill, “Lamport electrical and computer engineering from the Clocks: Verifying a Directory Cache-Coherence Protocol,” Proc. University of Wisconsin, Madison. He is cur- 10th ACM Symp. Parallel Algorithms and Architectures (SPAA ’98), rently an assistant professor of electrical and pp. 67-76, June 1998. computer engineering and computer science in [23] F. Pong and M. Dubois, “Verification Techniques for Cache the Department of Electrical and Computer ACM Computing Surveys, vol. 29, no. 1, Coherence Protocols,” Engineering, Duke University. His research pp. 82-126, Mar. 1997. interests include dependable computer architec- [24] F. Pong and M. Dubois, “Formal Automatic Verification of ture. He is a senior member of the IEEE and the Cache Coherence in Multiprocessors with Relaxed Memory IEEE Computer Society. Models,” IEEE Trans. Parallel and Distributed Systems, vol. 11, no. 9, pp. 989-1006, Sept. 2000. [25] M. Prvulovic, Z. Zhang, and J. Torrellas, “ReVive: Cost-Effective Architectural Support for Rollback Recovery in Shared-Memory . Multiprocessors,” Proc. 29th Ann. Int’l Symp. Computer Architecture For more information on this or any other computing topic, (ISCA ’02), pp. 111-122, May 2002. please visit our Digital Library at www.computer.org/publications/dlib. [26] R. Rajwar, A. Ka¨gi, and J.R. Goodman, “Improving the Throughput of Synchronization by Insertion of Delays,” Proc. Sixth IEEE Symp. High-Performance Computer Architecture (HPCA ’00), pp. 168-179, Jan. 2000.

Authorized licensed use limited to: DUKE UNIVERSITY. Downloaded on May 18, 2009 at 10:48 from IEEE Xplore. Restrictions apply.