A One-Cycle Correction Error-Resilient Flip-Flop for Variation-Tolerant Designs on an FPGA
Total Page:16
File Type:pdf, Size:1020Kb
electronics Article A One-Cycle Correction Error-Resilient Flip-Flop for Variation-Tolerant Designs on an FPGA Dam Minh Tung † , Nguyen Van Toan † and Jeong-Gun Lee *,† E-SoC Lab/Smart Computing Lab, Department of Computer Engineering, Hallym University, Chuncheon 24252, Korea; [email protected] (D.M.T.); [email protected] (N.V.T.) * Correspondence: [email protected]; Tel.: +82-33-248-2312 † Current address: 1 Hallymdaehakgil, Chuncheon, Gangwon, Korea. Received: 13 March 2020; Accepted: 6 April 2020; Published: 10 April 2020 Abstract: Timing error resilience (TER) is one of the most promising approaches for eliminating design margins that are required due to process, voltage, and temperature (PVT) variations. However, traditional TER circuits have been designed typically on an application-specific integrated circuits (ASIC) where customized circuits and metastability detector designs at a transistor level are possible. On the other hand, it is difficult to implement those designs on a field-programmable gate array (FPGA) due to its predefined LUT structure and irregular wiring. In this paper, we propose an error detection and correction flip-flop (EDACFF) on an FPGA chip, where the metastability issue can be resolved by imposing proper timing constraints on the circuit structures. The proposed EDACFF exploits a transition detector for detecting a timing error along with a data correction latch for correcting the error with one-cycle performance penalty. Our proposed EDACFF is implemented in a 3-bit counter circuit employing a 5-stage pipeline on a Spartan-6 FPGA device (the XFC6SLX45) to verify the functional and timing behavior. The measurement results show that the proposed design obtains 32% less power consumption and 42% higher performance compared to a traditional worst-case design. Keywords: field-programmable gate array; error detection and correction; PVT variations; timing error resilience; metastability 1. Introduction To develop reliable circuits, a traditional synchronous circuit must have a large timing margin to ensure the correct operation under worst-case timing conditions. It means that an appropriate timing margin is added to a clock period to cover the worst-case circuit propagation delays. However, among most of the circuit operation time, the worst-case timing margin is not fully used since the worst-case rarely happens in practice. Therefore, the worst-case timing margin causes higher throughput loss and lower energy efficiency of a design in typical- or best-case conditions. To minimize the timing margins, many techniques have been proposed for tolerating a timing error that happens in the circuit with the minimal margin. With the help of the techniques, timing margins can be reduced significantly. Generally, the techniques can be categorized into two groups: timing error prediction (TEP), and error detection and correction (EDAC). TEP circuits [1–6] predict a potential error by monitoring data signals. It flags a warning signal whenever the delayed data signals enter an erroneous timing zone that is defined with a clock signal. Then, designer can adjust the supply voltage or clock frequency to ensure correct operation at the edge of predicting a failure. As a result, the output of main flip-flop (FF) always captures the correct data and it does not need any correction. However, in this technique, the timing margin can only be minimized to reserve the enough margin for the correct operation of the main FF. Otherwise, Electronics 2020, 9, 633; doi:10.3390/electronics9040633 www.mdpi.com/journal/electronics Electronics 2020, 9, 633 2 of 12 EDAC techniques [7–15] detect an actual timing error by monitoring critical paths for late arriving data transitions. Then, it uses extra correction circuits to correct the actually happened error. In summary, with a timing error resilience technique, we can have the benefits of higher performance and energy efficiency with some minor area overhead thanks to the reduced timing margin when there is no errors with this margin. However, in the case of having more timing errors, the benefits are gradually reduced because of a correction overhead. The term “performance” in our paper is defined as average-case timing performance. The average timing performance is higher than worst-case timing performance since circuits work with the clock frequency optimized for typical operating condition. The worst cases of Process-Voltage-Temperature operation conditions that can induce errors happen rarely. Generally, most of EDAC circuits are designed in a custom design style and they are implemented in ASIC. They need to optimize their physical designs to satisfy some timing constraint. Moreover, when porting them to a new technology, they cost some efforts for the redesigns on the new technology, leading to the increase of the design times and costs. On the other hand, in an FPGA-based design, it is difficult to implement their designs due to the predefined circuit structures and the un-customized place-and-route (P&R) in FPGAs. For instance, the design of a metastability detector in FPGAs is a critical problem. In addition, the replacement of a traditional FF by a latch in a datapath causes the problem of timing closure in FPGAs. This is because a latch-based design is difficult to meet timing closure with commercial timing tools. Therefore, it is not recommended to replace an FF by a latch in FPGA deigns. In this paper, for variation-tolerant designs on an FPGA, we propose a metastability-immune error detection and correction flip-flop (EDACFF) working with a one-clock-cycle penalty. The metastability problem can be resolved by imposing a proper timing constraint on a design. Our proposed EDACFF is fully supported by standard cells and it is based on the traditional FF. Therefore, it is suitable with a commercial synthesis tool for an FPGA circuit design. Consequently, it can be ported easily to other process technologies with much less design efforts when compared with other timing error resilience techniques. The remainder of this paper is organized as follows. In Section2, we discuss the related works about an EDAC circuit. Next, in Section3, we propose an EDACFF. In this section, the metastability issue is also considered. Section4 shows the testing structure for verifying the functional correction of the proposed EDACFF. Section5 provides the simulation and measurement results from the implementation of the proposed EDACFF on a Spartan-6 FPGA device (XFC6SLX45). Section6 discusses the presented experimental results and possible future work. Finally, Section7 concludes the paper. 2. Related Works In general, traditional EDAC approaches have used ASIC-style implementation. They can be grouped into two categories: (a) FF-based designs and (b) latch-based design. (a) FF-based Designs [7–11]: Razor in [7] detects an error and recomputes computation to recover the correct results at a reduced clock rate with some minor performance degradation. It includes a main FF, a shadow latch, a multiplexer, and a XOR gate. The XOR gate plays the role of comparing the outputs of the main flip-flop that samples data at a rising clock and the shadow latch that is clocked by the delay clock. Since the output of the main FF can be in a metastable state, the output of XOR gate can be in a metastable state too. Therefore, Razor needs a metastability detector at the output of the FF to guarantee a stable output after the detection for a reliable design. In [8], a light-weight error detection register using virtual supply rails occupies small area overhead since it requires only eight extra transistors along with a traditional FF. However, Razor-Lite adopts an instruction replay to correct occurred errors and it leads to the high-performance penalty up to 11 clock cycles per correction. In [9], a low-overhead transition detector (TD) with a 9-transistor current sensing circuit is proposed. TDs are inserted at the half-path points of critical paths and TDs Electronics 2020, 9, 633 3 of 12 predict possible timing errors based on the timing behavior observed at the mid-points of critical paths. So that the timing error in the current clock cycle can be prevented before the real timing violation that can be happened at the endpoint of the critical paths. Thus, it does not need an error correction circuit. However, this design incurs a large area overhead and it needs a significant design effort due to the large number of half-path points of critical paths. In [10], a timing error tolerant (TET) flip-flop was proposed. It consists of a transition detection unit for an error detection and an FF with preset/clear options for error correction. Whenever an error is detected, the output of the FF is preset to “1” or clear to “0” depending on the input value of the FF. However, this design costs area overhead due to the circuits for generating the preset and clear signals. In particular, this design cannot be implemented in an FPGA since a D-FF structure has only one signal line for presetting or clearing in an FPGA. In [11], an EDAC technique is proposed with a new bit flipping FF. Whenever a timing error is detected, it is corrected by complementing the output of the corresponding FF. However, their design requires a metastability detector to detect the metastability that can be occurred at the output of the FF. Their design is prototyped in a MIPS microprocessor core on an FPGA, but the metastability detector is not implemented in their demonstration. (b) Latch-based Designs [12–15]: Razor II [12] is another version of the Razor where a transition detector is used to detect errors. Similar to Razor, it detects timing errors after they actually occur, and it corrects the timing violations using an architectural replay mechanism. A current-based timing error detector was proposed in [13].