Characterizing Processors for Energy and Performance Management

2015 16th International Workshop on Microprocessor and SOC Test and Verification Characterizing Processors for Energy and Performance Management Harshit Goyal Vishwani D. Agrawal Department of Electrical and Computer Engineering Department of Electrical and Computer Engineering, Auburn University, Auburn University, Auburn, Alabama 36849 Auburn, Alabama 36849 Email: [email protected] Email: [email protected] Abstract—A processor executes a computing job in a certain II. TERMINOLOGY number of clock cycles. The clock frequency determines the time that the job will take. Another parameter, cycle efficiency A. Technology Benchmark or cycles per joule, determines how much energy the job will A technology benchmark is used to characterize the technol- consume. The execution time measures performance and, in ogy of the target processor whose operation is to be managed. combination with energy dissipation, influences power, thermal behavior, power supply noise and battery life. We describe a The circuit and technology level details of the benchmark method for power management of a processor. An Intel processor circuit should be known such that it can be simulated to in 32nm bulk CMOS technology is used as an illustrative exam- accurately determine its delay and energy versus the operating ple. First, we characterize the technology by H-spice simulation voltage. The circuit size may be reasonable for economy of a ripple carry adder for critical path delay, dynamic energy of the simulation effort, yet it may have an understandably and static power at a wide range of supply voltages. The adder data is then scaled based on the clock frequency, supply voltage, meaningful function. The benchmark used in the present work thermal design power (TDP) and other specifications of the is a 16-bit ripple carry adder. processor. To optimize the time and energy performance, voltage and clock frequency are determined showing 28% reduction both B. Energy per Cycle in execution time and energy dissipation. A power-delay product, known as energy per cycle Keywords-performance, cycle efficiency, energy per cycle, low (EPC) [16], measures the energy dissipated in a circuit per power, thermal design power. switching operation. Since the dynamic energy per switching event is fixed, the EPC describes a fundamental tradeoff between speed and power. Energy per cycle of a circuit is I. INTRODUCTION a key parameter for energy efficiency in ultra-low power Power consumption is proportional to the frequency of applications. Because computing workload is characterized execution and the square of the operating voltage, while energy in terms of clock cycles, this measure directly relates to consumption also depends on the total execution time [17]. the energy consumption of workload [5]–[7]. The total energy per cycle (Etotal) of a single gate is composed of Energy consumption has become one of the primary concerns E E in processor design due to the recent popularity of portable dynamic energy ( dyn) and leakage or static energy ( static): devices and cost concerns for desktops and servers. Notably, E α · C · V 2 battery capacities have improved rather moderately (a factor dyn = 0→1 L dd (1) of 2 to 4 over the last 30 years), while the computational demands have drastically increased over the same time frame. With a number of performance oriented devices emerging Estatic = Pstatic · td and a huge demand of power from a fixed capacity battery, = Ioff · Vdd · td using the battery wisely has assumed high importance. In 2 − Vdd K · C · V · S energy-constrained systems, low power design is essential = L dd 10 (2) for extending battery and system lifetime. Lowering voltage supply (Vdd) quadratically decreases energy dissipation, but also causes an increase in delay [3]. Optimizing performance Etotal = Edyn + Estatic Vdd and power simultaneously requires a thorough study of the − S 2 =(α0→1 + K · 10 ) · CL · V (3) available resources and possible trade-offs [8]. Hence, it be- dd comes important to define and use new or existing metrics. In where α0→1 is low to high transition activity for the gate this work, we have used a recently defined parameter, cycle output node and Pstatic is static leakage power. Ioff is leakage efficiency of processor (η) [11], [12], to investigate the cycles current, K is a constant, CL is load capacitance of the gate that could be run using a given amount of energy. and S is sub-threshold swing. 2332-5674/16 $31.00 © 2016 IEEE 67 DOI 10.1109/MTV.2015.22 C. Cycle Efficiency Cycle efficiency is defined as performance per unit of energy. To increase this efficiency it is required that the fundamental energy of operations be reduced. Further, power is defined as the rate of energy consumption (watts ≡ Joules/second) and is directly affected by the performance. This distinction between power and energy is important because what may seem like a trade-off may just be a Fig. 1. Benchmark circuit structure: an n-bit ripple carry adder. modulation in performance resulting in changes in power consumption. The performance (inverse of time) can be called voltage, maximum temperature and maximum signal loading time efficiency just as cycle efficiency (inverse of energy conditions. per cycle) is energy efficiency. If we regard the clock cycle Processor base frequency describes the rate at which the as a unit of work that a processor performs, then it means processor’s transistors open and close. The processor base work done in a time period 1/f, where f is the frequency in frequency is the operating point where TDP is defined. cycles per second or hertz (Hz). A clock cycle also means Turbo boost and overclocking are both essentially the same certain amount of energy or energy per cycle (EPC).We thing although they may work a little differently. Turbo define cycle efficiency, η =1/EP C, its unit being cycles per boost is a feature of Intel processors created to dynamically joule [11], [12]. Thus, a clock cycle means 1/f second in overclock a CPU, meaning the more you use your CPU, the time and 1/η joule in energy. Consider a program being run faster the CPU moves up to a certain point which is determined on a processor and suppose it takes c clock cycles to execute. by the manufacturer. Overclocking is a similar concept except Then we have, that it is not dynamic and is implemented manually, either c through software or through BIOS on newer motherboards. Execution time = f (4) III. TECHNOLOGY ASSESSMENT c A. Ripple Carry Adder Benchmark Circuit Energy consumed = (5) η The ripple carry adder circuit of Figure 1 is used to learn where, η is cycle efficiency of the processor in cycles per the energy and delay characteristics of the technology of the joule. Equation 4 gives the time performance of the processor processor [9]. In this paper, we use a 16-bit ripple carry as, adder that may a basic element in many digital computational systems. The design methodology emphasizes the operation 1 f Performance in time = = (6) of the adder in 32nm bulk PTM CMOS technology node. We Execution time c assume that the processor being characterized is large and a Similarly, Equation 5 gives the energy performance as, full scale gate level or transistor level model may not be readily available. Even if such a model was available, a detailed 1 c simulation at various voltages would be impractical for high Performance in energy = = η (7) Energy consumed reason of high complexity. However, operational data about Clearly, cycle efficiency (η) characterizes the energy the processor, such as voltage, maximum clock frequency and performance in a similar way as frequency (f) characterizes power consumption, is available. Also, the technology of the the time performance. These two performance parameters are device is specified. We, therefore, characterize the technology related to each other by the power being consumed, as follows: using known and easily analyzable adder benchmark. Then, f we scale this characterization to the processor. (8) Power = η B. Predictive Technology Model D. Base Frequency, Turbo Boost, Overclocking and Thermal Two versions of Predictive Technology Models (PTM) avail- Design Power able from an Arizona State University site that are widely Thermal design power (TDP) is the average maximum used to carry out research experiments are chosen to carry power in watts the processor dissipates when operating at out the characterization: Bulk MOSFET with conventional base frequency with all cores active under a manufacturer SiON/Polysilicon gate and Secondly high-k dielectrics with defined, high complexity workload. TDP is not the maximum metal gate technology, a combination known as HKMG (High- power the CPU may consume - there may be periods of time K, Metal Gate). Our experiments were carried out using 32nm when the CPU dissipates more power than that allowed by its Bulk CMOS PTM model. thermal design. TDP is usually 20%-30% lower than the CPU C. Simulation Tools Used maximum power dissipation. A 16 bit adder is designed using a VHDL code and its Peak power is the maximum power dissipated by the pro- compilation and simulation is carried out using Questa Sim cessor under the worst case conditions - at the maximum core to ensure correct operation before it is synthesized. Leonardo 68 TABLE I H-SPICE SIMULATION OF 16 BIT RIPPLE CARRY ADDER FOR 32NM TECHNOLOGY NODE IN BULK CMOS PTM AT DIFFERENT VOLTAGES (Vdd). Vdd Power from simulation Timing from simulation Energy per cycle Pavg Pdyn Pstatic Ppeak Critical path fmax Edyn Estatic Etotal volts μW μW μW μW delay, ps GHz fJ fJ fJ 1.20

Load more