Razor: a Low-Power Pipeline Based on Circuit-Level Timing Speculation
Total Page:16
File Type:pdf, Size:1020Kb
Appears in the 36th Annual International Symposium on Microarchitecture (MICRO-36), December 2003. Razor: A Low-Power Pipeline Based on Circuit-Level Timing Speculation Dan Ernst, Nam Sung Kim, Shidhartha Das, Sanjay Pant, Rajeev Rao, Toan Pham, Conrad Ziesler, David Blaauw, Todd Austin, Krisztian Flautner1, and Trevor Mudge Advanced Computer Architecture Lab 1 The University of Michigan ARM Ltd 110 Fulbourn Road 1301 Beal Ave Cambridge, UK CB1 9NJ Ann Arbor, MI 48109 [email protected] [email protected] Abstract video is energy-inefficient for audio. The gap between high perfor- mance and low power can be bridged through the use of dynamic With increasing clock frequencies and silicon integration, voltage scaling (DVS) [16], where periods of low processor utiliza- power aware computing has become a critical concern in the design tion are exploited by lowering the clock frequency to the minimum of embedded processors and systems-on-chip. One of the more effec- required level, allowing corresponding reduction in the supply volt- tive and widely used methods for power-aware computing is age. Since dynamic energy scales quadratically with supply voltage, dynamic voltage scaling (DVS). In order to obtain the maximum significant reduction in energy use can be obtained [14]. power savings from DVS, it is essential to scale the supply voltage as Enabling systems to run at multiple frequency and voltage lev- low as possible while ensuring correct operation of the processor. els is a challenging process and requires characterization of the pro- The critical voltage is chosen such that under a worst-case scenario cessor to ensure that its operation remains correct at the required of process and environmental variations, the processor always oper- operating points. The minimum possible supply voltage that results ates correctly. However, this approach leads to a very conservative in correct operation is referred to as the critical supply voltage. The supply voltage since such a worst-case combination of different vari- critical supply voltage must be sufficient to ensure correct operation abilities will be very rare. In this paper, we propose a new approach in the face of a number of environmental and process related vari- to DVS, called Razor, based on dynamic detection and correction of abilities that can impact circuit performance. These include unex- circuit timing errors. The key idea of Razor is to tune the supply volt- pected voltage drops in the power supply network, temperature age by monitoring the error rate during circuit operation, thereby fluctuations, gate-length and doping concentration variations, cross- eliminating the need for voltage margins and exploiting the data coupling noise, etc. These variabilities may be data dependent, dependence of circuit delay. A Razor flip-flop is introduced that dou- meaning that they exhibit their worst-case impact on circuit perfor- ble-samples pipeline stage values, once with a fast clock and again mance only under certain instruction and data sequences, and are with a time-borrowing delayed clock. A metastability-tolerant com- composed of both local and global components. For instance, local parator then validates latch values sampled with the fast clock. In process variations will impact specific regions of the die in different the event of a timing error, a modified pipeline mispeculation recov- and independent ways, while global process variation impacts the ery mechanism restores correct program state. A prototype Razor circuit performance of the entire die and creates variation from one pipeline was designed in 0.18 µm technology and was analyzed. die to the next. Similarly, temperature and supply drop have local Razor energy overheads during normal operation are limited to and global components, while cross-coupling noise is a predomi- 3.1%. Analyses of a full-custom multiplier and a SPICE-level nantly local effect. Kogge-Stone adder model reveal that substantial energy savings are To ensure correct operation under all possible variations, a con- possible for these devices (up to 64.2%) with little impact on perfor- servative supply voltage is typically selected at design-time using mance due to error recovery (less than 3%). corner analysis. Hence, margins are added to the critical voltage to 1 Introduction account for uncertainty in the circuit models and to account for the A critical concern for embedded systems is the need to deliver worst-case combination of variabilities. However, such a worst-case high levels of performance given ever-diminishing power budgets. combination of variabilities may be very rare or even impossible in a This is evident in the evolution of the mobile phone: in the last 7 particular instance of a chip making this approach overly conserva- years mobile phones have shown a 50X improvement in talk-time tive. And, with process scaling, the environmental and process vari- per gram of battery1, while at the same time taking on new computa- abilities are expected to increase, worsening the required voltage tional tasks that only recently appeared on desktop computers, such margins. as 3D graphics, audio/video, internet access, and gaming. As the To allow for more aggressive power reduction, the supply volt- breadth of applications for these devices widens, a single operating age can be tuned to an individual processor chip using embedded point is no longer sufficient to efficiently meet their processing and inverter delay chains [5]. The delay of the inverter chain is used as a power consumption requirements. For example, MPEG video play- prediction of the critical path delay of the circuit and the supply volt- back requires an order-of-magnitude higher performance than play- age is tuned during processor operation to meet a predetermined ing MP3s. However, running at the performance level necessary for delay through the inverter-chain. This approach to DVS has the advantage that it dynamically adjusts the operating voltage to account for global variations in supply voltage drop, temperature fluctuation, and process variations. However, it cannot account for 1. Comparison of standard configurations of Nokia 232 and Ericsson local variations, such as local supply voltage drops, intra-die process T68 phones. variations, and cross-coupled noise, and therefore requires the addi- clk cycle 1 cycle 2 cycle 3 cycle 4 clock Logic Stage D1 Q1 Logic Stage 0 Main L1 1 Flip-Flop L2 clock_d Error_L D instr 1 instr 2 Shadow Latch comparator Error Error RAZOR FF clk_del Q instr 1 instr 2 (a) (b) Figure 1. Pipeline augmented with Razor latches and control lines. tion of safety margins to the critical voltage. Also, the delay of an latched by the flip-flop and the shadow latch, a delay error in the inverter chain does not scale with voltage and temperature in the main flip-flop is detected. The value in the shadow latch, which is same way as the delays of the critical paths of the actual design, guaranteed to be correct, is then utilized to correct the delay failure. which can contain complex gates and pass-transistor logic, which We present several architectural solutions for error correction, rang- again necessitate extra voltage safety margins. In future technolo- ing from simple clock gating to more sophisticated mechanisms that gies, the local component of environmental and process variation is augment the existing mispeculation recovery infrastructure. expected to become more prominent and, as noted in [6], the sensi- The proposed Razor technique was implemented in a prototype tivity of circuit performance to these variations is higher at lower 64-bit Alpha processor design. This prototype implementation was operating voltages, thereby increasing the necessary margins and used to obtain a realistic prediction of the power overhead for in-situ reducing the scope for energy savings. error correction and detection. We also studied the error-rate trends In this paper, we propose a new approach to DVS, referred to as for datapath components using both circuit-level simulation as well Razor, which is based on dynamic detection and correction of speed as silicon measurements of a full-custom multiplier block. Architec- path failures in digital designs. The key idea of Razor is to tune the tural simulations were then performed to analyze the overall supply voltage by monitoring the error rate during operation. Since throughput and power characteristics of Razor based DVS for differ- this error detection provides in-situ monitoring of the actual circuit ent benchmark test programs. We demonstrate that on average, delay, it accounts for both global and local delay variations and does Razor reduced simulated power consumption by more than 40%, not suffer from voltage scaling disparities. It therefore eliminates the compared to traditional design-time DVS and delay-chain based need for voltage margins that are necessary for “always-correct” cir- approaches. cuit operation in traditional designs. In addition, a key feature of The remainder of this paper is organized as follows. In Section Razor is that operation at sub-critical supply voltages does not con- 2, we present the implementation of Razor, providing a detailed stitute a catastrophic failure, but instead represents a trade-off description of both the proposed circuit and architectural techniques. between the power penalty incurred from error correction against In Section 3, we discuss the simulation framework for Razor-based additional power savings obtained from operating at a lower supply DVS and present error rate studies and our simulation results. In voltage. Section 4 we present a detailed survey of prior work in DVS. Finally, It was previously observed that circuit delay is strongly data in Section 5, we draw our conclusions. dependent, and only exhibits its worst-case delay for very specific instruction and data sequences [24]. From this it can be conjectured 2 Razor Error Detection/Correction that for moderately sub-critical supply voltages only a few critical Razor relies on a combination of architectural and circuit level instructions will fail, while a majority of instructions will continue to techniques for efficient error detection and correction of delay path operate correctly.