SWICH: A PROTOTYPE FOR EFFICIENT CACHE-LEVEL CHECKPOINTING AND ROLLBACK

EXISTING CACHE-LEVEL CHECKPOINTING SCHEMES DO NOT CONTINUOUSLY

SUPPORT A LARGE ROLLBACK WINDOW. IMMEDIATELY AFTER A CHECKPOINT,

THE NUMBER OF INSTRUCTIONS THAT THE PROCESSOR CAN UNDO FALLS TO

ZERO. TO ADDRESS THIS PROBLEM, WE INTRODUCE SWICH, AN FPGA-BASED

PROTOTYPE OF A NEW CACHE-LEVEL SCHEME THAT KEEPS TWO LIVE

CHECKPOINTS AT ALL TIMES, FORMING A SLIDING ROLLBACK WINDOW THAT

MAINTAINS A LARGE MINIMUM AND AVERAGE LENGTH.

Backward recovery through check- tinuously support a large rollback window— pointing and rollback is a popular approach the set of instructions that the processor can in modern processors to providing recovery undo by returning to a previous checkpoint if from transient and intermittent faults.1 This it detects a fault. Specifically, in these schemes, approach is especially attractive when design- immediately after the processor takes a check- ers can implement it with very low overhead, point, the window of code that it can roll back typically using dedicated hardware support. typically is reduced to zero. Radu Teodorescu The many proposals for low-overhead In this article, we describe the microarchi- checkpointing and rollback schemes differ in tecture of a new cache-level checkpointing Jun Nakano a variety of ways, including the level of the scheme that designers can efficiently and inex- memory hierarchy checkpointed and the pensively implement in modern processors Josep Torrellas organization of the checkpoint relative to the and that supports a large rollback window active data. Schemes that checkpoint at the continuously. To do this, the cache keeps data University of Illinois at cache level occupy an especially attractive from two consecutive checkpoints (or epochs) design point.2,3 The hardware for these at all times. We call this approach a sliding roll- Urbana-Champaign schemes, which works by regularly saving the back window, and we’ve named our scheme data in the cache in a checkpoint that can later Swich—for Sliding-Window Cache-Level serve for fault recovery, can be integrated rel- Checkpointing. atively inexpensively and used efficiently in We’ve implemented a prototype of Swich modern processor microarchitectures. using field-programmable gate arrays Unfortunately, cache-level checkpointing (FPGAs), and our evaluation of the prototype schemes such as Carer2 and Sequoia3 don’t con- shows that supporting cache-level check-

28 Published by the IEEE Society 0272-1732/06/$20.00 © 2006 IEEE Scope of the sphere of BER

Memory

Cache

Register

Dual Leveled Relative checkpoint location Partial Logging Renaming

Buffering Full

Separation of checkpoint and active data

Figure 1. Taxonomy of hardware-based checkpointing and rollback schemes. pointing with a sliding window is promising. For applications without frequent I/O or sys- IBM G5 Carer/Sequoia ReVive/SafetyNet/COMA tem calls, the technique sustains a large min- imum rollback window and adds little Register Register Register additional logic and memory hardware to a simple processor. Cache Cache Cache Low-overhead checkpointing schemes To help in situating Swich among the alter- natives, it’s useful to classify hardware-based Memory Memory Memory checkpointing and rollback schemes into a taxonomy with three axes: Figure 2. Design points based on the scope of the sphere of backward error • the scope of the sphere of backward error recovery (BER). recovery (BER), • the relative location of the checkpoint and active data (the most current ver- particular level. Consequently, we define three sions), and design points on this axis, depending on • the way the scheme separates the check- whether the sphere extends to the registers, the point from the active data. caches, or the main memory (Figure 2). The second axis in the taxonomy compares Figure 1 shows the taxonomy. the level of the memory hierarchy checkpoint- The sphere of BER is the part of the system ed (M, where M is register, cache, or memory) that is checkpointed and, upon fault detec- and the level where the checkpoint is stored. tion, rolls back to a previous state. Any scheme This axis has two design points:4 Dual means assumes that a datum that propagates outside that the two levels are the same. This includes the sphere is correct, because it has no mech- schemes that store the checkpoint in a special anism to roll it back. Typically, the sphere of hardware structure closely attached to M. Lev- BER encloses the memory hierarchy up to a eled means the scheme saves the checkpoint at

SEPTEMBER–OCTOBER 2006 29 CACHE-LEVEL CHECKPOINTING

Table 1. Characteristics of some existing hardware-based checkpointing and rollback schemes.

Design point Tolerable fault Fault-free in the Hardware detection Recovery execution Scheme taxonomy support latency latency overhead IBM G5 Register, dual, full Redundant copy of Pipeline depth 1,000 cycles Negligible register unit (R-unit), lock-stepping units Sparc64, out-of-order Register, dual, Register alias table (RAT) processor renaming copy and restore on Pipeline depth 10 to100 cycles Negligible branch speculation Carer Cache, dual, logging Extra cache line state Cache fill time Cache invalidation Not available Sequoia Cache, leveled, full Cache flush Cache fill time Cache invalidation Not available Swich Cache, dual, logging Extra cache line states Cache fill time Cache invalidation 1% ReVive Memory, dual, logging Memory logging, flush 100 ms 0.1 to 1 s 5% cache at checkpoint SafetyNet Memory, dual, logging Cache and memory 0.1 to 1 ms Not available 1% logging Cache-Only Memory Memory, dual, Mirroring by cache 2 to 200 ms Not available 5 to 35% Architecture renaming coherence protocol

a lower level than M—for example, a scheme Trade-offs might save a checkpoint of a cache in memory. The design points in our taxonomy in Fig- The third axis, which considers how the ure 1 correspond to different trade-offs among scheme separates the checkpoint from the active fault detection latency, execution overhead, data, has two design points.5 In full separation, recovery latency, space overhead, type of faults the checkpoint is completely separated from the supported, and hardware required. Figure 3 active data. In partial separation, the checkpoint illustrates the trade-offs. and the active data are one and the same, except Consider the maximum tolerable fault the data that has been modified since the last detection latency. Moving from register-, to checkpoint. For this data, the checkpoint keeps cache-, to memory-level checkpointing, the both the old value at the time of the last check- tolerable fault detection latency increases. This point and the most current value. is because the corresponding checkpoints Partial separation includes the subcategories increase in size and can thus hold more exe- buffering, renaming, and logging. In buffering, cution history. Consequently, the time any modification to location L accumulates between when the error occurs and when it is in a special buffer; at the next checkpoint, the detected can be larger while still allowing sys- modification is applied to L. In renaming, a tem recovery. modification to L does not overwrite the old The fault-free execution overhead has an value but is written to a different place, and interesting shape. In register-level check- the mapping of L is updated to point to this pointing, the overhead is negligible because new place; at the next checkpoint, the new these schemes save at most a few hundred mapping of L is committed. In logging, a bytes (for example, a few registers or the reg- modification to L occurs in place, but the old ister alias table), and efficiently copy them value is copied elsewhere; at the next check- nearby in hardware. At the other levels, a point, the old value is discarded. checkpoint has significant overhead, because Table 1 lists characteristics of several the processors copy larger chunks of data over existing hardware-based checkpointing longer distances. An individual checkpoint’s schemes and where they fit in our taxono- overhead tends to increase moving from my. The “Checkpointing scheme compari- cache- to memory-level checkpointing. How- son” sidebar provides more details on the ever, designers tune the different designs’ schemes listed. checkpointing frequencies to keep the overall

30 IEEE MICRO Checkpointing scheme comparison Register-level checkpointing uses a special directory controller to log memory updates in the memory IBM’s S/390 G5 processor (in our taxonomy from Figure 1, dual; full (dual; partial separation by logging). SafetyNet logs updates to memory separation) includes a redundant copy of the register file called the R- (and caches) in special checkpoint log buffers.6 Therefore, we categorize unit, which lets the processor roll back by one instruction.1 In addition, it SafetyNet’s memory checkpointing as dual; partial separation by logging. has duplicate lock-stepping instruction and execution units, and it can Morin et al. modified the protocol of Cache-Only Mem- detect errors by comparing the signals from these units. ory Architecture (COMA) machines to ensure that at any time, every mem- Fujitsu’s Sparc64 processor leverages a mechanism present in most ory line has exactly two copies of data from the last checkpoint in two out-of-order processors: To handle branch misspeculation, processors can distinct nodes.7 Thus, the machines can recover the memory from single- discard the uncommitted state in the pipeline and resume execution from node failures (dual; partial separation by renaming). an older instruction.2 The register writes update renamed register loca- tions (dual; partial separation by renaming). For this mechanism to provide recovery, it must detect faults before the affected instruction commits. References In Sparc64, parity protects about 80 percent of the latches, and the exe- 1. T.J. Slegel et al., “IBM’s S/390 G5 Microprocessor Design,” cution units have a residue check mechanism—for instance, the Sparc64 IEEE Micro, vol. 19, no. 2, Mar.-Apr. 1999, pp. 12-23. × × checks integer multiplication using mod3(A B) = mod3[mod3(A) mod3(B)]. 2. H. Ando et al., “A 1.3GHz Fifth Generation SPARC64 Micro- processor,” IEEE J. Solid-State Circuits, vol. 38, no. 11, Nov. Cache-level checkpointing 2003, pp. 1896-1905. Carer checkpoints the cache by writing back dirty lines to memory (lev- 3. D.B. Hunt and P.N. Marinos, “General-Purpose Cache-Aided eled; full separation).3 An optimized protocol introduces a new cache line Rollback Error Recovery (CARER) Technique,” Proc. 17th Int’l state called unchangeable. Under this protocol, at a checkpoint, the Symp. Fault-Tolerant Computing Systems (FTCS 87), IEEE CS processor does not write back dirty lines; it simply write-protects them and Press, 1987, pp. 170-175. marks them as unchangeable. Execution continues, and the processor 4. P.A. Bernstein, “Sequoia: A Fault-Tolerant Tightly Coupled lazily writes these lines back to memory either when they need to be Multiprocessor for Transaction Processing,” Computer, vol. updated or when space is needed (dual; partial separation by logging). 21, no. 2, Feb. 1988, pp. 37-45. For this optimization, the cache needs some forward error recovery (FER) 5. M. Prvulovic, Z. Zhang, and J. Torrellas, “ReVive: Cost-Effective protection such as error-correcting code (ECC), because the unchange- Architectural Support for Rollback Recovery in Shared-Memory able lines in the cache are part of a checkpoint. Multiprocessors,” Proc. 29th Ann. Int’l Symp. Computer Archi- Sequoia flushes all dirty cache lines to the memory when the cache tecture (ISCA 02), IEEE CS Press, 2002, pp. 111-122. overflows or when an I/O operation is initiated.4 Thus, the memory con- 6. D.J. Sorin et al., “SafetyNet: Improving the Availability of tains the checkpointed state (leveled; full separation). Multiprocessors with Global Checkpoint/ Our proposed mechanism, Swich, is similar to Carer in that it can hold Recovery,” Proc. 29th Ann. Int’l Symp. Computer Architec- checkpointed data in the cache (dual; partial separation by logging). Unlike ture (ISCA 02), IEEE CS Press, 2002, pp. 123-134. Carer, Swich holds two checkpoints simultaneously in the cache. Thus, Swich 7. C. Morin et al., “COMA: An Opportunity for Building Fault-Tol- makes a large rollback window available continuously during execution. erant Scalable Shared Memory Multiprocessors,” Proc. 23rd Ann. Int’l Symp. Computer Architecture (ISCA 96), IEEE CS Memory-level checkpointing Press, 1996, pp. 56-65. ReVive flushes dirty lines in the cache to memory at checkpoints.5 It also

Tolerable fault Execution Recovery Space Number of detection latency overhead latency overhead recoverable faults ~10 cycles ~0% 10–1,000 cycles 100 bytes Several Register

Cache

Scope of the Memory sphere of BER ~100 ms ~5% 0.1–1 s Mbytes–Gbytes Many

Dual or partial Leveled or full

Figure 3. Trade-offs for design points in the checkpointing taxonomy.

SEPTEMBER–OCTOBER 2006 31 CACHE-LEVEL CHECKPOINTING

Min Rollback Rollback

window length Min window length Ckpt Ckpt CkptCkpt Ckpt Ckpt Time Time

(a) Cache fills (b) Half cache fills

Figure 4. Evolution of the rollback window length with time for Sequoia and Carer (a) and for Swich (b).

execution time overhead tolerable—about 5 latencies. In addition, the hardware for cache- percent, as shown in Figure 3. level checkpointing is relatively inexpensive The designs with relatively more expensive and functions efficiently in modern architec- individual checkpoints take less frequent tures. Specifically, the primary hardware these checkpoints. As a result, both their recovery schemes require is new cache line states to latency and their space overhead increase mark lines related to the checkpoint. Modifi- along the axis from cache- to memory-level cation of the cache replacement algorithm checkpoints. The trend is the same moving might be necessary to encourage these lines to from dual systems (solid lines in Figure 3) to remain cached. Cache-level checkpointing leveled systems (dashed lines) because indi- schemes also need to change the state of vidual checkpoints in leveled systems involve groups of cache lines at commit and rollback moving data between levels and, therefore, are points. Schemes can support this by accessing more expensive. A similar effect also occurs as the cache tags one at a time to change each we move from partial- (solid lines in Figure concerned tag individually, or by adding hard- 3) to full-separation designs (dashed lines), ware signals that modify all the concerned tags which require more data copying. in one shot. Finally, processors using cache- Finally, moving from register-, to cache-, to level checkpointing must save registers effi- memory-level checkpointing, the number of ciently at a checkpoint and restore them at a recoverable faults increases. The reason is rollback—similarly to the way processors han- twofold: First, these schemes support more dle branch speculation. faults of a given type, thanks to the longer On the negative side, it is difficult to con- detection latencies tolerated. Second, they trol the checkpointing frequency in cache- support more types of faults (for example, level checkpointing schemes because the faults that occur at the new hierarchy level). displacement of a dirty cache line triggers a However, recovering from these faults requires checkpoint.6 Such displacements are highly additional hardware support, which is more dependent on the application and the cache expensive because it is spread more widely organization. Thus, a system’s performance throughout the machine. with a cache-level checkpointing scheme is relatively unpredictable. Motivation for the Swich architecture Another problem of existing implementa- Clearly, the different checkpointing tions of cache-level checkpointing2,3 is that the schemes address a large set of trade-offs; our length of the code that they can roll back (the main focus in this article is cache-level check- rollback window) is at times very small. pointing schemes with their advantages and Specifically, suppose a fault occurs immedi- drawbacks. ately before a checkpoint, and is only detect- On the positive side, trends in technology ed after the checkpoint has completed (when are making the cache level more attractive. the data from the previous checkpoint has More on-chip integration means the possibil- been discarded). At this point, the amount of ity of larger caches, which let cache-level code that can be rolled back is very small, or checkpointing offer increased fault detection even zero. Figure 4a shows how the rollback

32 IEEE MICRO window length evolves in existing cache-level one-shot hardware signal is sufficient for the checkpointing schemes. Right after each operations on the E1E0 bits. However, because checkpoint, the window empties completely. rollback is probably infrequent, these opera- tions can also use slower hardware that walks Sliding-window cache-level checkpointing the cache tags and individually sets the bits of Swich addresses the problem of the empty the relevant tags. For a write-through L1 cache, rollback window by providing a sliding roll- it is simpler to invalidate the entire cache. back window, increasing the chances that the rollback window always has a significant min- Speculative line management imum length. The basic idea is to always keep Swich manages the speculative versions of in the cache the state generated by two uncom- lines, namely those belonging to the specula- mitted checkpoint intervals, which we call spec- tive epochs, following two invariants. First, ulative epochs. Each epoch lasts as long as it only writes can create speculative versions of takes to generate a dirty data footprint about lines. Second, the cache can contain only a half the cache size. When the cache is getting single version of a given line. Such a version full of speculative data, the older epoch is com- can be nonspeculative or speculative with a mitted and a newer one starts. With this given E1E0 identifier. In the latter case, the line approach, the evolution of the rollback win- is necessarily dirty and cannot be displaced dow length is closer to that illustrated in Fig- from the cache. Following the first invariant, ure 4b. When rollback is necessary, the scheme a processor read simply returns the data cur- rolls back both speculative epochs. In the worst rently in the cache and does not change its case (when rollback is necessary immediately line’s E1E0 identifier. If the read misses in the after a checkpoint), the system can roll back at cache, the line is brought from memory, and least one whole speculative epoch. Previously E1E0 are set to 00. proposed schemes have a worst-case scenario of The second invariant determines the an empty rollback window. actions on a write. There are four cases: First, if the write misses, the line is brought from Checkpoint, commit, and rollback memory, it is updated in the cache, and its

We extend each line in the cache with two E1E0 bits are set to the epoch identifier. Sec- epoch bits, E1E0. If the line contains non- ond, if the write hits on a nonspeculative line speculative (that is, committed) data, E1E0 are in the cache, the line is first written back to 00; if it contains data generated by one of the memory (if dirty), then it is updated in the two active speculative epochs, E1E0 are 01 or cache, and finally its E1E0 bits are set to the 10. Only dirty lines (those with the Dirty bit epoch identifier. In these two cases, a count set) can have E1E0 other than 00. of the number of lines belonging to this epoch A checkpoint consists of three steps: com- (SpecCnt) is incremented. Third, if the write mitting the older speculative epoch, saving hits on a speculative line of the same epoch, the current register file and processor state, the update proceeds normally. Finally, if the and starting a new speculative epoch with the write from current epoch i hits on a specula- same E1E0 identifier as the one just commit- tive line from the other epoch (i – 1), then ted. Committing a speculative epoch involves epoch i − 1 is committed, and the write initi- discarding its saved register file and processor ates a new epoch, i + 1. The line is written state and clearing the epoch bits of all the back to memory, then updated in the cache, cache lines whose E1E0 bits are equal to the and finally its E1E0 bits are set to the new epoch’s identifier. It is preferable to clear such epoch identifier. When the SpecCnt count of bits all in one shot, with a hardware signal a given epoch (say, epoch i) reaches the equiv- connected to all the tags that takes no more alent of half the cache size, epoch i – 1 is com- than a handful of cycles to actuate. mitted, and epoch i + 1 begins. On a fault, the two speculative epochs roll Recall that speculative lines cannot be dis- back. This involves restoring the older of the placed from the cache. Consequently, if space two saved register files and processor states, is needed in a cache set, the algorithm first invalidating all the cache lines whose E1E0 bits chooses a nonspeculative line as its victim. If all are not 00, and clearing all the E1E0 bits. A lines in the set are speculative, Swich commits

SEPTEMBER–OCTOBER 2006 33 CACHE-LEVEL CHECKPOINTING

the older speculative epoch (say, epoch i – 1) receives an invalidation for a speculative dirty and starts a new epoch (epoch i + 1). Then, line. The epoch commits, and the line is pro- the algorithm chooses one of the lines just vided and then invalidated. Invalidations to all committed as a victim for displacement. other types of lines proceed as usual. A similar A fourth condition that also leads to an discussion can be presented for other existing epoch commitment occurs when the current techniques for shared-memory multiprocessor epoch (say epoch i) is about to use all the lines checkpointing. of a given cache set. If the scheme allows it to do so, the sliding window becomes vulnera- Swich prototype ble: If the epoch later needs an additional line We implemented a hardware prototype of in the same set, the hardware would have to Swich using FPGAs. As the base for our commit i (and i – 1), decreasing the sliding implementation, we used Leon2, a 32-bit window size to zero. To avoid this, when an processor core compliant with the Sparc V8 epoch is about to use all the lines in a set, the architecture (http://www.gaisler.com). Leon2 hardware first commits the previous epoch is in synthesizable VHDL format. It has an (i – 1) and then starts a new epoch with the in-order, single-issue, five-stage pipeline. Most access that requires a new line. instructions take five cycles to complete if no Finally, in the current design, when the pro- stalls occur. The processor has a windowed gram performs I/O or issues a system call that register file, and the instruction cache is 4 has side effects beyond generating cached Kbytes. The data cache size, associativity, and state, the two speculative epochs are commit- line size are configurable. In our experiments, ted. At that point, the size of the sliding win- we set the line size to 32 bytes and the asso- dow falls to zero. ciativity to 8, and we varied the cache size. Because the processor initially had a write- Two-level cache hierarchies through data cache, we implemented a write- In a two-level cache hierarchy, the algo- back data cache controller. rithm we have described is implemented in We extended the processor in two major the L2 cache. The L1 cache can use a simpler ways: with cache support for buffering spec-

algorithm and does not need the E1E0 bits per ulative data and rollback, and with register line. For example, consider a write-through, support for checkpointing and rollback. We no-allocate L1. On a read, L1 provides the also leveraged, in part, the rollback mecha- data that it has or that it obtains from L2. On nism presented elsewhere.9 a write hit, L1 is updated, and both the update and epoch bits are propagated to L2. On a Data cache extensions commit, no action is taken, whereas in the Although the prototype implements most rare case of a rollback, all of L1 is invalidated. of the design we described earlier, there are a few differences. First, the prototype has a sin- Multiprocessor issues gle-level data cache. Second, for simplicity, we Swich is compatible with existing techniques allowed a speculative epoch to own at most that support cache-level checkpointing in a half the lines in any cache set. Third, the pro- shared-memory multiprocessor.7,8 For exam- totype doesn’t perform epoch commit and ple, assume we use the approach of Wu et al.8 rollback using a one-shot hardware signal. In this case, when a cache must supply a spec- Instead, we designed a hardware state machine ulative dirty line to another cache, the source that walks the cache tags performing the oper- cache must commit the epoch that owns the ations on the relevant cache lines. The reason line to ensure that the line’s data never rolls for this design is that we implemented the back to a previous value. The additional advan- cache with the synchronous RAM blocks pre- tage of Swich is that the source cache needs sent in the Xilinx Virtex II series of FPGAs. only to commit a single speculative epoch, if Such memory structures have only the ordi- the line provided belonged to its older specu- nary address and data inputs; they are diffi- lative epoch. In this case, the rollback window cult to modify to incorporate additional length in the source cache does not fall to zero. control signals, such as those needed to sup- A similar operation occurs when a cache port one-shot commit and rollback signals.

34 IEEE MICRO We extended the cache controller with a cache walk state machine (CWSM) that tra- Idle verses the cache tags and clears the corre- sponding epoch bits (in a commit) and valid bits (in a rollback). A commit or rollback sig- nal activates the CWSM. At that point, the Walk Restore pipeline stalls and the CWSM walks the cache tags. The CWSM has three states, shown in Figure 5a: idle, walk, and restore. In (a) idle, the CWSM is inactive. In walk, its main Idle working state, the CWSM can operate on one or both of the speculative epochs. The tra- versal takes one cycle for each line in the cache. The restore state resumes the normal Checkpoint Rollback cache controller operation and releases the pipeline. When operating system code is executed, Restore we conservatively assume that it cannot be rolled back and, therefore, we commit both (b) speculative epochs and execute the system code nonspeculatively. A more careful imple- Figure 5. Swich state machines: cache walk state machine mentation would identify the sections of sys- (a) and register checkpointing state machine (b). tem code that have no side effects beyond memory updates, and allow their execution to remain speculative. Evaluating the Swich prototype The processor is part of a SoC infrastruc- Register extensions ture that includes a synthesizable SDRAM Swich performs register checkpointing and controller, and PCI and Ethernet interfaces. restoration using two shadow register files, We synthesized the system using Xilinx ISE SRF0 and SRF1. Each of these memory struc- v6.1.03 (http://www.xilinx.com). The target tures is identical to the main register file. FPGA chip was a Xilinx Virtex II XC2V3000 When a checkpoint is to be taken, the pipeline running on a GR-PCI-XC2V development stalls and passes control to the register check- board (http://www.pender.ch). The processor pointing state machine (RCSM), which has runs at 40 MHz, and the board has 64 Mbytes the four states shown in Figure 5b. of SDRAM for main memory. Communica- The RCSM is in the idle state while the tion with the board and loading of programs pipeline executes normally. When a register in memory take place through the PCI inter- checkpoint must be taken, it transitions to the face from a host computer. checkpoint state. In this state, the RCSM On this hardware, we run a special version copies the valid registers in the main register of the SnapGear Embedded Linux distribu- file to one of the SRFs. The register files are tion (http://www.snapgear.org). SnapGear implemented in SRAM and have two read Linux is a full-source package, containing ker- ports and one write port. This means that the nel, libraries, and application code for rapid RCSM can copy only one register per cycle. development of embedded Linux systems. We Thus, the checkpoint stage takes about as many used a cross-compilation tool chain for the cycles as there are valid registers in the regis- Sparc architecture to compile the kernel and ter file. The rollback state is activated when applications. the pipeline receives a rollback signal. In this We experimented with the prototype using state, the RCSM restores the contents of the a set of applications that included open-source register file from the checkpoint. Similarly, Linux applications and microbenchmarks that this takes about as many cycles as there are we ran atop SnapGear Linux. We chose codes valid registers. The restore state reinitializes the that exhibited a variety of memory and sys- pipeline. tem code access patterns to give us a sense of

SEPTEMBER–OCTOBER 2006 35 CACHE-LEVEL CHECKPOINTING

Table 2. Applications evaluated using Swich.

Application Description hennessy Collection of small, computation-intensive applications; little or no system code polymorph Converts Windows-style file names to Unix; moderate system code memtest Microbenchmark that simulates large array traversals; no system code sort Linux utility that sorts lines in text files; system-code intensive ncompress Compression and decompression utility; moderate system code ps Linux status command; very system-code intensive

Sequoia Swich 9,000

8,000 base reg ckpt 7,000 spec cache

6,000

5,000

4,000

3,000

2,000 Hardware overhead (CLB slices) 1,000

0 4 8 16 32 64 4 8 16 32 64 Data cache size (Kbytes)

Figure 6. Amount of logic consumed by different designs and components.

how workload characteristics affect Swich. Hardware overhead Table 2 describes the applications we used. For each of Swich’s two components—the register checkpointing mechanism and the Results data cache support for speculative data—we We have not yet accurately measured measured the hardware overhead in both logic Swich’s execution overhead or recovery and memory structures. Our measurements latency, because of the limitations of our considered only the processor core and caches, implementation—specifically that we imple- not the other on-chip modules such as the mented epoch commit and rollback with PCI and memory controllers. As a reference, inefficient cache tag walking rather than with we also measured the overheads of a cache an efficient one-shot hardware signal. This checkpointing scheme with one speculative problem slows down the Swich FPGA pro- epoch at a time (a system similar to Sequoia3). totype, but it would not be present in a Figure 6 shows the logic consumed, mea- CMOS implementation. sured in number of configurable logic block Consequently, we focused our evaluation (CLB) slices. CLBs are the FPGA blocks used on measuring Swich’s hardware overhead and to build combinational and synchronous logic rollback window size, and on roughly esti- components. A CLB comprises four similar mating Swich’s performance overhead and slices; each slice includes two four-input func- recovery latency using a simple model. tion generators, carry logic, arithmetic logic

36 IEEE MICRO Sequoia Swich 50

45 base reg ckpt 40 spec cache 35 30 25 20 15 10 Hardware overhead (RAM blocks) 5 0 4 8 16 32 64 4 8 16 32 64 Data cache size (Kbytes)

Figure 7. Amount of RAM consumed by the different designs and components. gates, wide-function multiplexers, and two again very similar to that of the Sequoia-like storage elements. Figure 6 graphs hardware system. The Swich additions are primarily overhead for the Sequoia-like system and the additional register checkpoint storage and Swich system, each with several different data additional state bits in the cache tags. cache sizes (ranging from 4 Kbytes to 64 Kbytes). Each bar shows the CLBs consumed Size of the rollback window by the logic to support speculative data in the We particularly wanted to determine the cache (spec cache), register checkpointing (reg size of the Swich rollback window in number ckpt), and the original processor (base). of dynamic instructions. This number deter- With Swich, the logic overhead of the com- mines how far the system can roll back at a bined extensions was only 6.5 percent on aver- given time if it detects a fault. We were espe- age and relatively constant across the range of cially interested in the minimum points in caches we evaluated; the support we propose Figure 4b, which represent the cases when does not use much logic. In addition, the only the shortest rollback windows are avail- number of CLBs used for Swich is only about able. Figure 4b shows an ideal case. In prac- 2 percent higher than for the Sequoia-like sys- tice, the minima aren’t all at the same value. tem. In other words, the difference in logic The actual values of the minimum points between a simple window and a sliding win- depend on the code being executed: the actu- dow is small. al memory footprint of the code and the pres- Figure 7 shows the memory consumed, ence of system code. In our current measured in number of SelectRAM memory implementation, execution of system code blocks. These blocks, used to implement the causes the commit of both speculative epochs caches, register files, and other structures, are and, therefore, induces a minimum in the roll- 18-Kbit, dual-port RAM blocks with two back window. independently clocked and independently For our experiments, we measured the win- controlled synchronous ports that access a dow size at each of the minima and took per- common storage area. The overhead in mem- application averages. Figure 8 shows the ory blocks of the combined Swich extensions average window size at minimum points for is again small for all the caches that we evalu- the different applications and cache sizes rang- ated, indicating that the support we propose ing from 8 Kbytes to 64 Kbytes. The same consumes few hardware resources. In addi- information for the Sequoia and Carer systems tion, Swich’s overhead in memory blocks is would show rollback windows of size zero.

SEPTEMBER–OCTOBER 2006 37 CACHE-LEVEL CHECKPOINTING

200 hennesy 180 polymorph memtest 160 sort 140 ncompress ps 120 100 80 60 Rollback window size 40

(thousands of dynamic instructions) 20 0 8163264

Data cache size (Kbytes)

Figure 8. Minimum rollback window size in Swich, per-application averages.

As Figure 8 shows, the hennessy shows, is that for Swich the rollback window microbenchmark has the largest minimum has a nonzero average minimum value. For ps rollback window. The application is compu- and the other two system-intensive applica- tationally intensive, has little system code, and tions, there is little difference between the two has a relatively small memory footprint. As schemes. Overall, at least for applications the cache increases, the window size increas- without frequent system code, Swich supports es. For 8- to 64-Kbyte caches, the minimum a large minimum rollback window. rollback window ranges from 120,000 to 180,000 instructions. The polymorph and Estimating performance overhead and recovery memtest applications have more system code latency and a larger memory footprint, respectively. A rough way of estimating Swich’s perfor- Their rollback windows are smaller, although mance overhead is to compute the ratio they increase substantially with the cache size, between the time needed to generate a check- to reach 120,000 and 80,000 instructions, point and the duration of the execution respectively, for the 64-Kbyte cache. The sort, between checkpoints. This approach, of ncompress, and ps applications are system- course, neglects any overheads that the scheme code-intensive applications with short roll- might induce between checkpoints. To model back windows, regardless of the cache size. worst-case conditions, we used the average size One way to increase their minimum rollback of the rollback window at minimum points windows is to identify the sections of system (Figure 8) as the number of instructions code that have no side effects beyond memo- between checkpoints. Across all the applica- ry updates and allow their execution to remain tions, this is about 60,000 instructions for 64- speculative. Kbyte caches. If we assume an average of one For completeness, we also measured Swich’s cycle per instruction, this corresponds to average rollback window size and compared 60,000 cycles. On the other hand, a check- it to that of the implementation similar to point can very conservatively take about 256 Sequoia and Carer. Again, we looked at four cycles, because it involves saving registers and different data cache sizes and six applications; asserting a one-shot hardware signal that mod- Figure 9 shows the results. Except in system- ifies cache tag bits. Overall, the ratio results code-intensive applications, Swich’s average in a 0.4 percent overhead. rollback window is significantly larger than We estimated the recovery latency by assum- that of Sequoia-Carer. The reason, as Figure 8 ing 256 cycles to both restore the registers and

38 IEEE MICRO 350 hennesy Sequoia/Carer 300 Swich

250

200 polymorph

150 memtest

Rollback window size 100

50 (thousands of dynamic instructions) sort ncompress ps 0 8 16 32 64 8 16 32 64 8 16 32 64 8 16 32 64 8 16 32 64 8 16 32 64 Data cache size (Kbytes)

Figure 9. Comparison of average rollback window size for an implementation similar to Sequoia and Carer and for Swich. assert a one-shot hardware signal that modi- EIA-0072102, EIA-0103610, CHE-0121357, fies cache tag bits. In addition, as the proces- and CCR-0325603; from DARPA under grant sor reexecutes, it must reload into the cache all NBCH30390004; from the US Department the dirty speculative lines that were invalidat- of Energy under grant B347886; and from ed during rollback. Overall, the combined IBM and Intel. overhead of both effects is likely to be in the milliseconds at most. Given the low frequen- References cy of rollbacks, we believe this is tolerable. 1. D.P. Siewiorek and R.S. Swarz, Reliable Computer Systems: Design and Evaluation, ache-level checkpointing provides an effi- 3rd ed., AK Peters, 1998. Ccient recovery mechanism that can be 2. D.B. Hunt and P.N. Marinos, “General-Pur- implemented inexpensively in modern proces- pose Cache-Aided Rollback Error Recovery sors. This work shows that maintaining two (CARER) Technique,” Proc. 17th Int’l Symp. speculative epochs at all times to form a slid- Fault-Tolerant Computing Systems (FTCS ing rollback window lets Swich support a large 87), IEEE CS Press, 1987, pp. 170-175. minimum (and average) rollback window. 3. P.A. Bernstein, “Sequoia: A Fault-Tolerant Moreover, our FPGA prototype showed that Tightly Coupled Multiprocessor for Transac- the added hardware is minimal and that it can tion Processing,” Computer, vol. 21, no. 2, be well integrated with existing processor Feb. 1988, pp. 37-45. microarchitectures. We expect that support- 4. N.S. Bowen and D.K. Pradhan, “Processor- ing Swich will significantly help processors and Memory-Based Checkpoint and Roll- recover from transient and intermittent faults, back Recovery,” Computer, vol. 26, no. 2, as well as from software errors that have low Feb. 1993, pp. 22-31. detection latency. We are currently examin- 5. M. Prvulovic, Z. Zhang, and J. Torrellas, ing the behavior of Swich under a wider range “ReVive: Cost-Effective Architectural Sup- of workloads. We are also examining how port for Rollback Recovery in Shared-Mem- Swich can be supported while the processor ory Multiprocessors,” Proc. 29th Ann. Int’l executes system code. MICRO Symp. Computer Architecture (ISCA 02), IEEE CS Press, 2002, pp. 111-122. Acknowledgments 6. B. Janssens and W.K. Fuchs, “The Perfor- Partial support for this work came from the mance of Cache-Based Error Recovery in National Science Foundation under grants Multiprocessors,” IEEE Trans. Parallel and

SEPTEMBER–OCTOBER 2006 39 CACHE-LEVEL CHECKPOINTING

Distributed Systems, vol. 5, no. 10, Oct. Jun Nakano is an Advisory IT Specialist at IBM 1994, pp. 1033-1043. Japan. His research interests include reliability 7. R.E. Ahmed, R.C. Frazier, and P.N. Marinos, in computer architecture and variability in “Cache-Aided Rollback Error Recovery semiconductor manufacturing. Nakano has an (CARER) Algorithms for Shared-Memory MS in physics from the University of Tokyo Multiprocessor Systems,” Proc. 20th Int’l and a PhD in computer science from the Uni- Symp. Fault-Tolerant Computing (FTCS 90), versity of Illinois at Urbana-Champaign. He is IEEE CS Press, 1990, pp. 82-88. a member of the IEEE and the ACM. 8. K.L. Wu, W.K. Fuchs, and J.H. Patel, “Error Recovery in Shared-Memory Multiproces- Josep Torrellas is a professor of computer sci- sors Using Private Cache,” IEEE Trans. Par- ence and Willett Faculty Scholar at the Uni- allel and Distributed Systems, vol. 1, no. 10, versity of Illinois at Urbana-Champaign. He Apr. 1990, pp. 231-240. also chairs the IEEE Technical Committee on 9. R. Teodorescu and J. Torrellas, “Prototyping Computer Architecture (TCCA). His research Architectural Support for Program Rollback interests include multiprocessor computer Using FPGAs,” IEEE Symp. Field-Program- architecture, -level speculation, low- mable Custom Computing Machines power design, reliability, and debuggability. (FCCM), IEEE CS Press, 2005, pp. 23-32. Torrellas has a PhD in electrical engineering from Stanford University. He is an IEEE Fel- low and a member of the ACM. Radu Teodorescu is a PhD candidate in the Computer Science Department of the Univer- Direct questions and comments about this sity of Illinois at Urbana-Champaign. His article to Radu Teodorescu, Siebel Center for research interests include processor design for Computer Science, Room 4238, 201 North reliability, debuggability, and variability toler- Goodwin Ave., Urbana, IL 61801; ance. Teodorescu has a BS in computer science [email protected]. from the Technical University of Cluj-Napoca, Romania, and an MS in computer science from For further information on this or any other University of Illinois at Urbana-Champaign. computing topic, visit our Digital Library at He is a student member of the IEEE. http://www.computer.org/publications/dlib.

Get access to individual IEEE Computer Society documents online. More than 100,000 articles and conference papers available!

$9US per article for members

$19US for nonmembers www.computer.org/publications/dlib

40 IEEE MICRO