Swich: a Prototype for Efficient Cache-Level Checkpointing and Rollback

SWICH: A PROTOTYPE FOR EFFICIENT CACHE-LEVEL CHECKPOINTING AND ROLLBACK EXISTING CACHE-LEVEL CHECKPOINTING SCHEMES DO NOT CONTINUOUSLY SUPPORT A LARGE ROLLBACK WINDOW. IMMEDIATELY AFTER A CHECKPOINT, THE NUMBER OF INSTRUCTIONS THAT THE PROCESSOR CAN UNDO FALLS TO ZERO. TO ADDRESS THIS PROBLEM, WE INTRODUCE SWICH, AN FPGA-BASED PROTOTYPE OF A NEW CACHE-LEVEL SCHEME THAT KEEPS TWO LIVE CHECKPOINTS AT ALL TIMES, FORMING A SLIDING ROLLBACK WINDOW THAT MAINTAINS A LARGE MINIMUM AND AVERAGE LENGTH. Backward recovery through check- tinuously support a large rollback window— pointing and rollback is a popular approach the set of instructions that the processor can in modern processors to providing recovery undo by returning to a previous checkpoint if from transient and intermittent faults.1 This it detects a fault. Specifically, in these schemes, approach is especially attractive when design- immediately after the processor takes a check- ers can implement it with very low overhead, point, the window of code that it can roll back typically using dedicated hardware support. typically is reduced to zero. Radu Teodorescu The many proposals for low-overhead In this article, we describe the microarchi- checkpointing and rollback schemes differ in tecture of a new cache-level checkpointing Jun Nakano a variety of ways, including the level of the scheme that designers can efficiently and inex- memory hierarchy checkpointed and the pensively implement in modern processors Josep Torrellas organization of the checkpoint relative to the and that supports a large rollback window active data. Schemes that checkpoint at the continuously. To do this, the cache keeps data University of Illinois at cache level occupy an especially attractive from two consecutive checkpoints (or epochs) design point.2,3 The hardware for these at all times. We call this approach a sliding roll- Urbana-Champaign schemes, which works by regularly saving the back window, and we’ve named our scheme data in the cache in a checkpoint that can later Swich—for Sliding-Window Cache-Level serve for fault recovery, can be integrated rel- Checkpointing. atively inexpensively and used efficiently in We’ve implemented a prototype of Swich modern processor microarchitectures. using field-programmable gate arrays Unfortunately, cache-level checkpointing (FPGAs), and our evaluation of the prototype schemes such as Carer2 and Sequoia3 don’t con- shows that supporting cache-level check- 28 Published by the IEEE Computer Society 0272-1732/06/$20.00 © 2006 IEEE Scope of the sphere of BER Memory Cache Register Dual Leveled Relative checkpoint location Partial Logging Renaming Buffering Full Separation of checkpoint and active data Figure 1. Taxonomy of hardware-based checkpointing and rollback schemes. pointing with a sliding window is promising. For applications without frequent I/O or sys- IBM G5 Carer/Sequoia ReVive/SafetyNet/COMA tem calls, the technique sustains a large minimum rollback window and adds little Register Register Register additional logic and memory hardware to a simple processor. Cache Cache Cache Low-overhead checkpointing schemes To help in situating Swich among the alter- natives, it’s useful to classify hardware-based Memory Memory Memory checkpointing and rollback schemes into a taxonomy with three axes: Figure 2. Design points based on the scope of the sphere of backward error • the scope of the sphere of backward error recovery (BER). recovery (BER), • the relative location of the checkpoint and active data (the most current ver- particular level. Consequently, we define three sions), and design points on this axis, depending on • the way the scheme separates the check- whether the sphere extends to the registers, the point from the active data. caches, or the main memory (Figure 2). The second axis in the taxonomy compares Figure 1 shows the taxonomy. the level of the memory hierarchy checkpoint- The sphere of BER is the part of the system ed (M, where M is register, cache, or memory) that is checkpointed and, upon fault detec- and the level where the checkpoint is stored. tion, rolls back to a previous state. Any scheme This axis has two design points:4 Dual means assumes that a datum that propagates outside that the two levels are the same. This includes the sphere is correct, because it has no mech- schemes that store the checkpoint in a special anism to roll it back. Typically, the sphere of hardware structure closely attached to M. Lev- BER encloses the memory hierarchy up to a eled means the scheme saves the checkpoint at SEPTEMBER–OCTOBER 2006 29 CACHE-LEVEL CHECKPOINTING Table 1. Characteristics of some existing hardware-based checkpointing and rollback schemes. Design point Tolerable fault Fault-free in the Hardware detection Recovery execution Scheme taxonomy support latency latency overhead IBM G5 Register, dual, full Redundant copy of Pipeline depth 1,000 cycles Negligible register unit (R-unit), lock-stepping units Sparc64, out-of-order Register, dual, Register alias table (RAT) processor renaming copy and restore on Pipeline depth 10 to100 cycles Negligible branch speculation Carer Cache, dual, logging Extra cache line state Cache fill time Cache invalidation Not available Sequoia Cache, leveled, full Cache flush Cache fill time Cache invalidation Not available Swich Cache, dual, logging Extra cache line states Cache fill time Cache invalidation 1% ReVive Memory, dual, logging Memory logging, flush 100 ms 0.1 to 1 s 5% cache at checkpoint SafetyNet Memory, dual, logging Cache and memory 0.1 to 1 ms Not available 1% logging Cache-Only Memory Memory, dual, Mirroring by cache 2 to 200 ms Not available 5 to 35% Architecture renaming coherence protocol a lower level than M—for example, a scheme Trade-offs might save a checkpoint of a cache in memory. The design points in our taxonomy in Fig- The third axis, which considers how the ure 1 correspond to different trade-offs among scheme separates the checkpoint from the active fault detection latency, execution overhead, data, has two design points.5 In full separation, recovery latency, space overhead, type of faults the checkpoint is completely separated from the supported, and hardware required. Figure 3 active data. In partial separation, the checkpoint illustrates the trade-offs. and the active data are one and the same, except Consider the maximum tolerable fault the data that has been modified since the last detection latency. Moving from register-, to checkpoint. For this data, the checkpoint keeps cache-, to memory-level checkpointing, the both the old value at the time of the last check- tolerable fault detection latency increases. This point and the most current value. is because the corresponding checkpoints Partial separation includes the subcategories increase in size and can thus hold more exe- buffering, renaming, and logging. In buffering, cution history. Consequently, the time any modification to location L accumulates between when the error occurs and when it is in a special buffer; at the next checkpoint, the detected can be larger while still allowing sys- modification is applied to L. In renaming, a tem recovery. modification to L does not overwrite the old The fault-free execution overhead has an value but is written to a different place, and interesting shape. In register-level check- the mapping of L is updated to point to this pointing, the overhead is negligible because new place; at the next checkpoint, the new these schemes save at most a few hundred mapping of L is committed. In logging, a bytes (for example, a few registers or the reg- modification to L occurs in place, but the old ister alias table), and efficiently copy them value is copied elsewhere; at the next check- nearby in hardware. At the other levels, a point, the old value is discarded. checkpoint has significant overhead, because Table 1 lists characteristics of several the processors copy larger chunks of data over existing hardware-based checkpointing longer distances. An individual checkpoint’s schemes and where they fit in our taxono- overhead tends to increase moving from my. The “Checkpointing scheme compari- cache- to memory-level checkpointing. How- son” sidebar provides more details on the ever, designers tune the different designs’ schemes listed. checkpointing frequencies to keep the overall 30 IEEE MICRO Checkpointing scheme comparison Register-level checkpointing uses a special directory controller to log memory updates in the memory IBM’s S/390 G5 processor (in our taxonomy from Figure 1, dual; full (dual; partial separation by logging). SafetyNet logs updates to memory separation) includes a redundant copy of the register file called the R- (and caches) in special checkpoint log buffers.6 Therefore, we categorize unit, which lets the processor roll back by one instruction.1 In addition, it SafetyNet’s memory checkpointing as dual; partial separation by logging. has duplicate lock-stepping instruction and execution units, and it can Morin et al. modified the cache coherence protocol of Cache-Only Mem- detect errors by comparing the signals from these units. ory Architecture (COMA) machines to ensure that at any time, every mem- Fujitsu’s Sparc64 processor leverages a mechanism present in most ory line has exactly two copies of data from the last checkpoint in two out-of-order processors: To handle branch misspeculation, processors can distinct nodes.7 Thus, the machines can recover the memory from single- discard the uncommitted state in the pipeline and resume execution from node failures (dual; partial separation by renaming). an older instruction.2 The register writes update renamed register loca- tions (dual; partial separation by renaming). For this mechanism to provide recovery, it must detect faults before the affected instruction commits. References In Sparc64, parity protects about 80 percent of the latches, and the exe- 1. T.J. Slegel et al., “IBM’s S/390 G5 Microprocessor Design,” cution units have a residue check mechanism—for instance, the Sparc64 IEEE Micro, vol.

Swich: a Prototype for Efficient Cache-Level Checkpointing and Rollback

2.2 Adaptive Routing Algorithms and Router Design 20

Adaptive Parallelism for Coupled, Multithreaded Message-Passing Programs Samuel K

Detecting False Sharing Efficiently and Effectively

Leaper: a Learned Prefetcher for Cache Invalidation in LSM-Tree Based Storage Engines

CSC 256/456: Operating Systems

Improving Performance of the Distributed File System Using Speculative Read Algorithm and Support-Based Replacement Technique

Parallel Computer Architecture and Programming CMU 15-418/15-618, Spring 2016 Tunes Bang Bang (My Baby Shot Me Down) Nancy Sinatra (Kill Bill Volume 1 Soundtrack)

Efficient Synchronization Mechanisms for Scalable GPU Architectures

CIS 501 Computer Architecture This Unit: Shared Memory

Download Slides As

A Survey of Cache Coherence Schemes for Multiprocessors

In-Network Cache Coherence