Razor: a Low-Power Pipeline Based on Circuit-Level Timing Speculation

Razor: A Low-Power Pipeline Based on Circuit-Level Timing Speculation Dan Ernst, Nam Sung Kim, Shidhartha Das, Sanjay Pant, Rajeev Rao, Toan Pham, Conrad Ziesler, David Blaauw, Todd Austin, Krisztian Flautner1, and Trevor Mudge Advanced Computer Architecture Lab ARM Ltd 1 The University of Michigan 110 Fulbourn Road 1301 Beal Ave Cambridge, UK CB1 9NJ Ann Arbor, MI 48109 [email protected] [email protected] Abstract MPEG video playback requires an order-of-magnitude higher performance than playing MP3s. However, running at the With increasing clock frequencies and silicon integra- performance level necessary for video is energy-inefficient tion, power aware computing has become a critical concern for audio. The gap between high performance and low power in the design of embedded processors and systems-on-chip. can be bridged through the use of dynamic voltage scaling One of the more effective and widely used methods for power- (DVS) [16], where periods of low processor utilization are aware computing is dynamic voltage scaling (DVS). In order exploited by lowering the clock frequency to the minimum to obtain the maximum power savings from DVS, it is essen- required level, allowing corresponding reduction in the sup- tial to scale the supply voltage as low as possible while ensur- ply voltage. Since dynamic energy scales quadratically with ing correct operation of the processor. The critical voltage is supply voltage, significant reduction in energy use can be chosen such that under a worst-case scenario of process and obtained [14]. environmental variations, the processor always operates cor- Enabling systems to run at multiple frequency and volt- rectly. However, this approach leads to a very conservative age levels is a challenging process and requires characteriza- supply voltage since such a worst-case combination of differ- tion of the processor to ensure that its operation remains ent variabilities will be very rare. In this paper, we propose a correct at the required operating points. The minimum possi- new approach to DVS, called Razor, based on dynamic detec- ble supply voltage that results in correct operation is referred tion and correction of circuit timing errors. The key idea of to as the critical supply voltage. The critical supply voltage Razor is to tune the supply voltage by monitoring the error must be sufficient to ensure correct operation in the face of a rate during circuit operation, thereby eliminating the need for number of environmental and process related variabilities that voltage margins and exploiting the data dependence of circuit can impact circuit performance. These include unexpected delay. A Razor flip-flop is introduced that double-samples voltage drops in the power supply network, temperature fluc- pipeline stage values, once with a fast clock and again with a tuations, gate-length and doping concentration variations, time-borrowing delayed clock. A metastability-tolerant com- cross-coupling noise, etc. These variabilities may be data parator then validates latch values sampled with the fast dependent, meaning that they exhibit their worst-case impact clock. In the event of a timing error, a modified pipeline mis- on circuit performance only under certain instruction and data peculation recovery mechanism restores correct program µ sequences, and are composed of both local and global compo- state. A prototype Razor pipeline was designed in 0.18 m nents. For instance, local process variations will impact spe- technology and was analyzed. Razor energy overheads dur- cific regions of the die in different and independent ways, ing normal operation are limited to 3.1%. Analyses of a full- while global process variation impacts the circuit perfor- custom multiplier and a SPICE-level Kogge-Stone adder mance of the entire die and creates variation from one die to model reveal that substantial energy savings are possible for the next. Similarly, temperature and supply drop have local these devices (up to 64.2%) with little impact on performance and global components, while cross-coupling noise is a pre- due to error recovery (less than 3%). dominantly local effect. 1 Introduction To ensure correct operation under all possible variations, A critical concern for embedded systems is the need to a conservative supply voltage is typically selected at design- deliver high levels of performance given ever-diminishing time using corner analysis. Hence, margins are added to the power budgets. This is evident in the evolution of the mobile critical voltage to account for uncertainty in the circuit mod- phone: in the last 7 years mobile phones have shown a 50X els and to account for the worst-case combination of variabil- improvement in talk-time per gram of battery1, while at the ities. However, such a worst-case combination of variabilities same time taking on new computational tasks that only may be very rare or even impossible in a particular instance recently appeared on desktop computers, such as 3D graph- of a chip making this approach overly conservative. And, ics, audio/video, internet access, and gaming. As the breadth with process scaling, the environmental and process variabili- of applications for these devices widens, a single operating ties are expected to increase, worsening the required voltage point is no longer sufficient to efficiently meet their process- margins. To allow for more aggressive power reduction, the sup- ing and power consumption requirements. For example, ply voltage can be tuned to an individual processor chip using embedded inverter delay chains [5]. The delay of the inverter chain is used as a prediction of the critical path delay of the 1. Comparison of standard configurations of Nokia 232 and Ericsson circuit and the supply voltage is tuned during processor oper- T68 phones. ation to meet a predetermined delay through the inverter- Proceedings of the 36th International Symposium on Microarchitecture (MICRO-36 2003) 0-7695-2043-X/03 $17.00 © 2003 IEEE clk cycle 1 cycle 2 cycle 3 cycle 4 clock Logic Stage D1 Q1 Logic Stage 0 Main L1 1 Flip-Flop L2 clock_d Error_L D instr 1 instr 2 Shadow Latch comparator Error Error RAZOR FF clk_del Q instr 1 instr 2 (a) (b) Figure 1. Pipeline augmented with Razor latches and control lines. chain. This approach to DVS has the advantage that it dynam- We propose a combination of circuit and architectural ically adjusts the operating voltage to account for global vari- techniques for low cost in-situ error detection and correction ations in supply voltage drop, temperature fluctuation, and of delay failures. At the circuit level, each delay-critical flip- process variations. However, it cannot account for local varia- flop is augmented with a so-called shadow latch which is tions, such as local supply voltage drops, intra-die process controlled using a delayed clock. The operating voltage is variations, and cross-coupled noise, and therefore requires the constrained such that the worst-case delay is guaranteed to addition of safety margins to the critical voltage. Also, the meet the shadow latch setup time, even though the main flip- delay of an inverter chain does not scale with voltage and flop could fail. By comparing the values latched by the flip- temperature in the same way as the delays of the critical paths flop and the shadow latch, a delay error in the main flip-flop of the actual design, which can contain complex gates and is detected. The value in the shadow latch, which is guaran- pass-transistor logic, which again necessitate extra voltage teed to be correct, is then utilized to correct the delay failure. safety margins. In future technologies, the local component of We present several architectural solutions for error correction, environmental and process variation is expected to become ranging from simple clock gating to more sophisticated more prominent and, as noted in [6], the sensitivity of circuit mechanisms that augment the existing mispeculation recov- performance to these variations is higher at lower operating ery infrastructure. voltages, thereby increasing the necessary margins and reduc- The proposed Razor technique was implemented in a ing the scope for energy savings. prototype 64-bit Alpha processor design. This prototype In this paper, we propose a new approach to DVS, implementation was used to obtain a realistic prediction of referred to as Razor, which is based on dynamic detection and the power overhead for in-situ error correction and detection. correction of speed path failures in digital designs. The key We also studied the error-rate trends for datapath components idea of Razor is to tune the supply voltage by monitoring the using both circuit-level simulation as well as silicon measure- error rate during operation. Since this error detection provides ments of a full-custom multiplier block. Architectural simula- in-situ monitoring of the actual circuit delay, it accounts for tions were then performed to analyze the overall throughput both global and local delay variations and does not suffer and power characteristics of Razor based DVS for different from voltage scaling disparities. It therefore eliminates the benchmark test programs. We demonstrate that on average, need for voltage margins that are necessary for “always-cor- Razor reduced simulated power consumption by more than rect” circuit operation in traditional designs. In addition, a 40%, compared to traditional design-time DVS and delay- key feature of Razor is that operation at sub-critical supply chain based approaches. voltages does not constitute a catastrophic failure, but instead The remainder of this paper is organized as follows. In represents a trade-off between the power penalty incurred Section 2, we present the implementation of Razor, providing from error correction against additional power savings a detailed description of both the proposed circuit and archi- obtained from operating at a lower supply voltage. tectural techniques. In Section 3, we discuss the simulation It was previously observed that circuit delay is strongly framework for Razor-based DVS and present error rate stud- data dependent, and only exhibits its worst-case delay for ies and our simulation results.

Razor: a Low-Power Pipeline Based on Circuit-Level Timing Speculation

Approaches in Green Computing

Power-Aware Design Methodologies for FPGA-Based Implementation of Video Processing Systems

A Dynamic Voltage Scaling Algorithm for Sporadic Tasks∗

Comptia Fc0-Gr1 Exam Questions & Answers

Vertigo: Automatic Performance-Setting for Linux

Power Reduction Techniques for Microprocessor Systems

Vertigo: Automatic Performance-Setting for Linux

ECE 571 – Advanced Microprocessor-Based Design Lecture 28

Dynamic Thermal Management for High-Performance Microprocessors

An Analysis of Efficient Multi-Core Global Power Management Policies

ECE 571 – Advanced Microprocessor-Based Design Lecture 19

Power Savings in Mpsoc