Energy-Effective Issue Logic
Total Page:16
File Type:pdf, Size:1020Kb
Energy-Effective Issue Logic Daniele Folegnani and Antonio GonzaJez Departament d'Arquitectura de Computadors Universitat Politecnica de Catalunya Jordi Girona, 1-3 Mbdul D6 08034 Barcelona, Spain [email protected] Abstract require an increasing computing power, which sometimes is achieved through higher clock frequencies and more sophisticated Tile issue logic of a dynamically-scheduled superscalar processor architectural techniques, with an obvious impact on power is a complex mechanism devoted to start the execution of multiple consumption. instructions eveo' cycle. Due to its complexity, it is responsible for a significant percentage of the energy consumed by a With the fast increase in design complexity and reduction in microprocessor. The energy consumption of the issue logic design time, new generation CAD tools for VLSI design like depends on several architectural parameters, the instruction issue PowerMill [13] and QuickPower [12] are crucial to evaluate the queue size being one of the most important. In this paper we energy consumption at different points of the design, helping to present a technique to reduce the energy consumption of the issue make important decisions early in the design process. The logic of a high-performance superscalar processor. The proposed importance of a continuous feedback between the functional technique is based on the observation that the conventional issue system description and the power impact requires better and faster logic wastes a significant amount of energy for useless activio,. In evaluation tools and models to introduce changes rapidly through particular, the wake-up of empty entries and operands that are various design scenarios. Recent research has shown how read)' represents an important source of energy waste. Besides, architectural power estimation tools like SimplePower [36], we propose a mechanism to dynamically reduce the effective size Wattch [6] and Architectural Power Evaluation [7] can accurately of the instruction queue. We show that on average the effective estimate the power consumption of a whole microprocessor with instruction queue size can be reduced by a factor of 26% with a reasonable overhead and accuracy. minimal impact on performance. This reduction together with the In this paper we first evaluate the energy consumed by the energy saved for empo' and read)' entries result in about 90.7% different parts of a superscalar processor through the Architectural reduction in the energy consumed by the wake-up logic, which Power Evaluation tool [7], which is a detailed cycle-level represents 14.9% of the total energy of the assumed processor. simulator that includes both performance and power consumption Keywords: Issue logic; energy consumption; low power; estimation techniques. This analysis demonstrates that one of the adaptive hardware. critical components for power consumption in modern superscalar processors is the part devoted to extract parallelism from applications at run time. In particular, the wake-up function, 1. Introduction which analyzes data dependencies of the program through an associative search in the instruction queue, is the main power Power consumption has become an important concern in consuming part of the issue logic. This is reflected in a high processor design. More than 95% of current microprocessors complexity circuit and high logic activity. For a typical produced today are used in embedded systems, so low power microarchitecture the instruction queue and its associated issue requirements are nowadays critical[30]. Mobile systems need an logic are responsible for about 25% of the total power extremely efficient use of the energy due to their limited battery consumption on average. This observation motivates an analysis life, which is not expected to experience drastic increases in the of the effectiveness of a large instruction queue. near future. On the other hand, high performance microprocessors have to deal with heat dissipation and high current peak problems. We first note that both empty entries and ready entries This impose very expensive cooling systems with an increasing consume energy for doing a useless activity: the former do not relative cost with respect to the total chip cost. For actual cooling contain any valid data whereas the latter are known to be ready, so systems and technology, beyond 50 Watt of total dissipation, the they do not need to be checked by the wake-up logic. Besides, we cost of dissipating a further Watt becomes superlinear, and may observe that the instruction queue must be large if its size is fixed soon reach an unreasonable cost [271]. Besides, several failure and high performance is desired for a wide range of applications. mechanisms such as thermal runaway, gate dielectric, junction However, different applications may require different queue sizes, fatigue, electro-migration diffusion, electrical-parameter shift and as already pointed out by other authors [2]. In addition, the queue silicon interconnections fatigue become significantly worse as size requirements may significantly vary during the execution of temperature increases [28]. As an example of the current trend in a program. Based on these observations, we propose a technique power consumption, current microarchitectures such as the Alpha to dynamically resize the instruction queue according to the 21364 and PowerPC 704 dissipate about 85 and 100 Watt characteristics of the dynamic instruction stream. This technique respectively [31]. Another important question is that the growing reduces the effective size of the instruction queue by about 26% demand of multimedia functionalities for computer systems [ 16] on average. We show that these mechanisms reduce the energy 230 1063-6897/01 $10.00 © 2001 IEEE consumption of the wake-up logic by about 90% with respect to a requirements [3][2]. Marculescu proposes a mechanism to fixed size instruction queue, while the performance is hardly dynamically adapt the fetch and execution bandwidth based on a affected. profiling at the basic-block level [22]. The rest of this paper is organized as follows. Section 2 Code and data compression [24][15][8] can reduce activity in reviews previous work. Section 3 describes the experimental memories and buses although they introduce the overhead of framework. Section 4 analyzes the energy consumption of a decompressing tasks. typical superscalar processor. Section 5 presents the proposed Some compiler techniques can also be found in the literature. techniques to reduce energy consumption and section 6 evaluates For instance, instruction scheduling approaches [9] can be used to their performance. Finally, section 7 summarizes the main avoid large current peaks in the execution core of high conclusions of this work. performance processors. 2. Related work 3. Experimental Framework A large number of techniques for low power have been proposed in this section we present the framework used to estimate the in the literature. The problem has been attacked from different performance and power consumption of a typical superscalar standpoints, ranging from a high-level perspective, such as the processor. For this purpose, we have used the Architectural Power operating system, to the circuit and technology level. Two main Evaluation tool [7] based on the SimpleScalar simulator [4], with objectives of the low power research area are the peak power extensions to compute the energy spent every cycle. reduction, (i.e. reducing the maximum power dissipated by the die), and the total energy reduction for a given workload. 3.1. Power Consumption Estimation Model Power consumption in CMOS technology is quadratic with voltage supply and linear with switching capacitance and activity. The Architectural Power Evaluation tool [7] considers the Orchestrated methods to scale voltage and frequency processor microarchitecture partitioned into 32 blocks, each simultaneously can be really effective to drastically reduce the corresponding to a basic functional block and a circuit block, The power requirements of microarchitectures. As examples, lntel internal connections of blocks are modeled as components of the StrongARM SA-I10 [11] and Transmeta Crusoe [14] use blocks. The power consumption of the interconnections among combinations of these techniques to considerably reduce power blocks is not considered since it is usually a negligible part of the consumption. total power consumption. Every block is divided in subparts, Another example used in commercial products is the TAU depending on the hardware implementation, and the simulator (Thermal Assist Unit) of the PowerPC G3 and G4 family, where counts the number of times that each subpart is used along the an interrupt is generated based on a programmable thermal program execution. The power consumption of each subpart is threshold [25]. This causes an instruction flow reduction between characterized by the following three parameters: power density, instruction cache and the instruction buffer in order to decrease the area occupied and circuit type. Five different types of circuits are temperature of the chip. Typically, voltage and frequency scaling considered: memory, static logic, dynamic logic, PLA and clock or interruption-based power-aware techniques are managed by the distribution 1. operating system or the applications. Recent studies [5] have In the model, the power characterization of a digital investigated adaptive thermal management and power-conscious synchronous functional block of the microarchitecture is