CORE Metadata, citation and similar papers at core.ac.uk

Provided by Directory of Open Access Journals

INTERNATIONAL JOURNAL OF DESIGN, ANALYSIS AND TOOLS FOR CIRCUITS AND SYSTEMS, VOL. 1, NO. 1, JUNE 2011 9 Dynamic Power Reduction of Stalls in Pipelined Architecture Processors Pejman Lotfi-Kamran, Ali-Asghar Salehpour, Amir-Mohammad Rahmani, Ali Afzali-Kusha, and Zainalabedin Navabi

Abstract—This paper proposes a technique for dynamic power use a forwarding unit [13]. The forwarding unit detects the reduction of pipelined processors. It is based on eliminating dependencies and forwards the required data from the running unnecessary transitions that are generated during the execution instruction to the dependent instructions. In some cases, it is of NOP instructions. The approach includes the elimination of unnecessary changes in pipe register contents and the limita- impossible to forward the result because it may not be ready. tion of boundary movement of transitions caused by inevitable In these situations, using a NOP instruction is inevitable [13], changes in pipe register contents due to insertion of a NOP [14]. The last type of hazard is control hazard that occurs into a pipelined processor. To assess its efficiency, the proposed when a branch prediction is mistaken or in general, when the technique is applied to MIPS, DLX, and PAYEH processors system has no mechanism for the branch prediction. There considering a number of benchmarks. The experimental results show that the techniques can lead to up to 10% reduction in the are two mechanisms for handling the control hazard. The first dynamic power consumption at a cost of negligible (almost zero) mechanism runs instructions after a branch and flushes the pipe speed and (about 0.2%) area overheads. after the misprediction. Generally, flush mechanisms are not Index Terms—Dataflow architectures, low-power design, cost effective. A better solution to handle the control hazard is pipelined processors, stall. to fill the pipe after the jump instruction with specific numbers of NOPs. This mechanism is called delayed jump mechanism and used widely in DSP processors [13], [14]. I. INTRODUCTION The NOP instruction does not contribute to any useful work. OWER dissipation limits have emerged as a major con- Therefore, the power consumed for its execution is wasted. P straint in the design of microprocessors where the speed Our study indicates that the percentage of dynamic power has been traditionally the primary goal [1]. At the low end consumed by NOP instructions in a pipelined processor is of the performance spectrum, namely in the category of considerable. There are many works that have targeted the handheld and portable devices or systems, power has always power optimization of pipelined processors (see, e.g., [17], been the more critical design constraint compared to speed [26]). Among them, several solutions have been presented to constraint [2]-[9]. reduce the number of NOP instructions [13]. Even with em- In battery-powered applications, where the speed is less of ploying these techniques, still a large number of stalls would a concern, relatively simple RISC (Reduced Instruction Set remain. Therefore, the power consumption of the processors Computers) like pipelines are often used [10], [11]. Pipelined may be reduced further by reducing the execution of the NOP processors frequently insert NOP (No Operation Performed) instruction itself. instruction to the pipe to eliminate hazards and generate some delays for the proper execution of the instructions [13]. There The aim of this paper is to reduce the dynamic power con- are three types of hazards which are structural, data, and sumption of a pipelined processor by eliminating the useless transitions that are generated in the when a NOP control [13]. The structural hazard may occur when there 1 are not enough hardware resources for the execution of a instruction passes through pipe stages . This is performed by combination of instructions. While in processors with simple modifying the architecture of RISC processors. The rest of the architectures, this hazard is usually eliminated in the design paper is organized as follows. Section 2 outlines the design phase, it occurs in architectures that use more than one of the baseline pipelined processor used in this work while functional unit for instruction level parallelism [13], [14]. Section 3 motivates the need for a technique for reducing A data hazard occurs when an instruction needs the result he dynamic power consumption of a pipelined processor of its prior instruction that is still in the pipeline and its when a stall happens. In Section 4, our proposed technique result is not ready. This occurs when there is not enough for reducing the dynamic power consumed during a NOP latency between these two instructions which are considered execution is presented. The microarchitectural changes to the data dependent. A technique for preventing data hazard is to baseline pipelined processor for implementing the proposed technique is presented in Section 5. The results are discussed All authors are with the School of Electrical and Computer in Section 6 while the summary and conclusions are gven in Engineering, University of Tehran, Iran. E-mail: plotfi@computer.org, the last section. [email protected], [email protected], [email protected], [email protected]. A.-M. Rahmani is also with Computer Systems Lab., Department of Information Technology, University of Turku, Turku, Finland. E-mail: 1A Preliminary version of this work appeared in the Proc. of VLSI amir.rahmani@utu.fi. Symposium 2008 [15]. INTERNATIONAL JOURNAL OF DESIGN, ANALYSIS AND TOOLS FOR CIRCUITS AND SYSTEMS, VOL. 1, NO. 1, JUNE 2011 10

II. BASELINE PIPELINED PROCESSOR into the pipeline. The controlling signals can be divided into two parts which are critical and non-critical. The examples Figure 1 shows the diagram of the microarchitecture of a of the critical control signals which should be deactivated for conventional pipelined processor based on a 5-stage 32-bit the correct operation of NOP include writing to the memory Von Neumann MIPS I architecture [13]. While we restrict the or register files. The non-critical control signals are those discussion to a MIPS like processor architecture, the proposed signals that do not effect the correct execution of the NOP approach may be applied to other types of architecture. The instruction, and hence, behave as ”do not care” signals. For five stages include: Instruction extraction (FETCH), Instruc- the NOP insertion, only the critical control signals ought to tion decoding (DECODE), Instruction Execution (EXECUTE), be deactivated. Memory access (MEMORY), and Update registers (WRITE BACK). Only two instructions can access the memory. The III. MOTIVATION FOR OUR APPROACH processor contains 32 registers. In the case of hazard, data hazards are resolved with a bypass unit while branch hazards After the DECODE stage, the generated control signals are are resolved by predicting the address results. Interruptions used to control the flow of data. In this stage, if the control unit and exceptions are taken care of by a system coprocessor. determines that the current instruction depends on the former Furthermore, this processor has 4-way instruction and 4-way instructions and the forwarding cannot resolve the dependency, data caches. If a hit happens, the data is immediately sent to the control unit inserts a NOP instruction by deactivating the the next stage while if a miss occurs, the processor should wait critical control signals of the current instruction including for the data to become ready. In the first stage, i.e., FETCH, the control signals for writing to memory and register file. The next instruction is read from the memory and is loaded into the NOP instructions are inserted into the pipeline to eliminate FE/DE register at the end of this stage. In the second stage, hazards. These inserted NOP instructions contribute to the i.e., DECODE, the instruction is decoded where the values overall dynamic power of a pipelined processor by generating of the registers which are needed for running this instruction a number of unnecessary transitions. This is explained by an are read from the register file. In addition, if an immediate example in our baseline pipelined processor. value is used in an instruction, the immediate value is properly sign extended or zero filled. It is in this stage that the controlling signals for running the instruction are generated. These controlling signals include the signals for writing to memory and register file as well as the multiplexer selects and Fig. 2. A simple MIPS program. the ALU operation type. In the third stage, i.e., EXECUTE, the desired operation is performed on the extracted data in A simple program is shown in Figure 2. The first instruction the previous stage. For the branch instruction, the result is is a LOAD instruction that reads a data from memory and computed and based on the computed results, the next value the second instruction is an ADD instruction that uses the of PC (program counter) is determined. In the forth stage, i.e., loaded data. Because of the dependency between these two MEMORY, depending on the instruction, the desired value is instructions, after the LOAD instruction, a NOP instruction written into a memory location or the content of a memory should be inserted into the pipeline. During the execution of location is read. In the last stage, i.e., WRITE BACK, based on the simple program of Figure 2, when the LOAD instruction the instruction, the computed value is written into the register is in the DECODE stage, the control signals and the required file. data corresponding to this instruction are generated/extracted. In some situations, due to the dependency between the On the rising edge of the clock, the generated/extracted two successive instructions, the data needed for the second control/data are latched into the DE/EX pipeline register. In instruction should be produced by the first instruction. In these the next clock cycle, the ADD instruction is in the DECODE cases, when the second instruction is in the decode stage, the stage and the control unit determines that a NOP instruction loaded data from the register file are not valid. However, it is should be inserted into the pipeline. Therefore, the critical possible that when the second instruction actually needs the control signals of the ADD instruction are deactivated and data in the later stages of the pipeline, the first instruction has these deactivated critical control signals along with the other produced the data. Therefore, a forwarding unit is added to the control signals and the required data of the ADD instruction pipeline. If a data field is not valid, the forwarding unit tries (current instruction in the DECODE stage) are latched on to forward the valid data from the subsequent stages. In some the rising edge of the clock. Generally, the data parts of the situations, the first instruction cannot produce the needed data current (i.e., ADD) and previous instruction (i.e., LOAD) are of the second instruction even when this instruction needs the different. It means that data part of NOP is different from the data. In these cases, the second instruction should run with at former instruction (i.e., LOAD). Therefore, passing the NOP least a clock cycle delay. Therefore, in the DECODE stage, instruction in the pipe generates a number of transitions. In these cases are determined and a stall is inserted between the third clock cycle, the ADD instruction should be passed to the two instructions. When a stall is inserted into the pipe, the pipeline. In this time, the control signals corresponding to the FETCH stage stops running (PC is not loaded with a ADD are generated and latched along with its required data. new value) and the content of the controlling signals in the Since the data and non-critical control signals of NOP and DECODE stage are deactivated for a NOP to be inserted ADD instructions are not the same, the number of transitions INTERNATIONAL JOURNAL OF DESIGN, ANALYSIS AND TOOLS FOR CIRCUITS AND SYSTEMS, VOL. 1, NO. 1, JUNE 2011 11

Fig. 1. Dataflow diagram of MIPS pipelined architecture [13]. induced during the passage of the ADD instruction in the instruction are not valid in the DECODE stage, for the correct pipeline stages is not negligible. This imposes some dynamic execution of this instruction, valid data will be prepared by power consumption. The objective of this paper is to minimize the forwarding unit. To minimize the number of transitions these transitions. generated during the execution of NOP in this case, the same data should be prepared for the NOP instruction. If the valid IV. THE PROPOSED TECHNIQUES data of the instruction preceding NOP are still in some pipe As discussed, the data part of an inserted NOP instruction is registers when the NOP instruction needs them, the forwarding not the same as that of its preceding or subsequent instruction unit prepares the data for the NOP as well. In these cases, generating a number of transitions. In addition, passing the a few transitions are generated during the execution of the pending instruction after NOP produces more transitions. NOP instruction. On the other hand, if the valid data are not These transitions lead to wasting the power consumption available in any pipe registers when the NOP instruction needs which should be minimized. For the NOP instruction to them (because the processing of the instruction that generates generate as few transitions as possible, its data part should be those data has been finished and has gone out of the pipe), the same as that of its preceding or subsequent instruction. different data are loaded into some operators generating a Because of the unavailability of some of the data of the number of useless transitions which may propagate to the last instruction passing the pipe after NOP, the data part of the stage of the pipeline. Here, we propose a technique to prevent instruction preceding NOP should be used as the data part the propagation of these transitions to all the pipeline stages. of the NOP instruction. This way, as a NOP instruction For this purpose, the outputs of the NOP instruction should passes through a pipe, relative to the previous cycle, the same be the same as its preceding instruction in all the pipe stages operations are performed on the same data in all stages of the to minimize useless transitions. Implementing this technique, pipeline minimizing the number of transitions. In addition, the value which is loaded in each pipe register for NOP is the non-critical part of the control signals may be the same as the same as that of the previous instruction except for the those of the preceding instruction. The proposed idea may critical control signals. Therefore, only the critical control be implemented in the DECODE stage by modifying the signals of pipe registers should be loaded during the execution architecture of the baseline processor as will be explained in of NOP instructions. Using this approach, if the data of a Section 5. NOP instruction are not valid (i.e., the NOP data differ from The technique decreases the number of unnecessary tran- those of the instruction preceding it) and the valid data are not sitions generated when a NOP is inserted into the pipe. available in the pipe registers (forwarding unit cannot provide When the data part of the instruction before NOP is valid the valid data for the NOP), the change of data which leads in the DECODE stage, the proposed technique guarantees no to some transitions is inevitable. However, these transitions useless transitions is generated as a NOP instruction passes are propagated until they reach a pipe register where their through the pipe. However, if parts of the data of the previous propagation is stopped because writing into pipe registers has INTERNATIONAL JOURNAL OF DESIGN, ANALYSIS AND TOOLS FOR CIRCUITS AND SYSTEMS, VOL. 1, NO. 1, JUNE 2011 12

Fig. 3. Modified MIPS architecture based on the proposed tecniques. been stopped. Therefore, transitions cannot propagate through the pipe registers. In the DECODE stage, if the controller the entire pipeline limits the propagation boundary. detects that the current inststruction is dependent on the former instructions and hence a NOP should be inserted into the pipe, V. MODIFICATION OF BASELINE ARCHITECTURE the load enable control signals of all upcoming stage registers (i.e., ID/EX, EX/MEM, MEM/WB) are activated. The load The techniques proposed in the previous section included enable of ID/EX pipe register is directly fed into it while the the case where the data of the preceding instruction of NOP load enables of EX/MEM, MEM/WB pipe registers (i.e., LN1 is known and the case where the data of the preceding and LN2 respectively) are propagated through pipe registers instruction is not known. To implement the technique based to reach and fed into the desired pipe register. On the other on the first case, it is sufficient to add a load enable control hand, if the controller does not it necessary to insert a NOP signal to the data and non-critical control parts of DE/EX into the pipe, all load enable control signals are deactivated in pipe register. This way, only the critical control signals (such the DECODE stage. as write to memory and register file signals) that should be loaded in each clock cycle are not controlled by the added load enable signal. When a NOP is decided to be inserted VI. RESULTS AND DISCUSSION into the pipe, the controller should deactivate the load enable In this section, we discuss the power reduction, area over- signal. For the second case, a load enable control signal is head, and timing penalty of our proposed power reduction added to each pipe register after the DECODE stage. This technique. The techniques have been implemented in three control signal is only applied to data and non-critical control general processors: MIPS [13], DLX [13], and PAYEH [14]. parts of the pipe registers. By deactivating the load enable MIPS is a 5 stage pipelined processor whose architecture is of a pipe register when NOP results are written into it, only RISC with fixed-width of 32-bit instructions. The details of the critical control signals of that pipe register are changed this processor can be found in reference [13]. DLX is a text and its other parts remain unchanged. The same as other book example of a RISC processor with a 5 stage pipeline control signals, these load enable signals are generated by the using forwarding to avoid data hazards. The DLX processor controller in the DECODE stage and are propagated through uses a load-store architecture. All DLX instructions are 32-bit the pipe registers like other control signals to the desired long. It has 32 32-bit registers [13]. PAYEH is a pipelined destination (i.e., specific pipe register). Figure 3 illustrates the version of SAYEH [27] with a similar instruction set and mechanism of propagating the load enable control signals in has five pipe stages. SAYEH is a multi-cycle RISC processor INTERNATIONAL JOURNAL OF DESIGN, ANALYSIS AND TOOLS FOR CIRCUITS AND SYSTEMS, VOL. 1, NO. 1, JUNE 2011 13

TABLE I CHARACTERISTICS OF ORIGINAL AND MODIFIED MIPS, DLX, AND PAYEH PIPELINED PROCESSORS

Area Characteristic Frequency Characteristic

Processor Original Modified Overhead Original Modified Overhead Area (µm2) Area (µm2) (%) Frequency Frequency (%) (MHz) (MHz)

MIPS 199737.92 200215.89 0.24 50.76 50.76 ≈0

DLX 106396.99 106577.86 0.17 80 80 ≈0

PAYEH 919530.45 921185.60 0.18 129.33 129.33 ≈0

TABLE IV POWER CONSUMPTIONS OF ORIGINAL AND MODIFIED PAYEH TABLE II PROCESSORS FOR DIFFERENT BENCHMARKS AND INPUTS POWER CONSUMPTIONS OF ORIGINAL AND MODIFIED MIPS PROCESSORS FOR DIFFERENT BENCHMARKS AND INPUTS Benchmark input ORG(mW) MOD(mW) IMP(%) n=10 1065.236 972.4543 0.0871 Benchmark input ORG(mW) MOD(mW) IMP(%) n=20 1070.558 975.278 0.089 n=10 502.47 453.43 0.0976 Factorial n=30 1073.907 978.2221 0.0891 n=20 504.98 455.95 0.0971 n=40 1074.204 988.1603 0.0801 Factorial n=30 506.56 457.47 0.0969 n=50 1077.108 981.5689 0.0887 n=40 506.7 458.77 0.0946 n=10 963.8127 885.4547 0.0813 n=50 508.07 460.72 0.0932 n=20 976.6053 897.4026 0.0811 n=10 465.61 427.85 0.0811 Fibonacci n=30 998.4231 908.9644 0.0896 n=20 471.79 431.97 0.0844 n=40 1015.894 936.5526 0.0781 Fibonacci n=30 482.33 443.79 0.0799 n=50 1041.417 953.7297 0.0842 n=40 490.77 446.85 0.0895 n=10 1027.732 949.5212 0.0761 n=50 503.1 459.18 0.0873 n=20 1030.812 956.3874 0.0722 n=10 503.79 471.75 0.0636 Power n=30 1031.812 951.6398 0.0777 n=20 505.3 473.87 0.0622 n=40 1034.504 958.158 0.0738 Power n=30 505.79 473.98 0.0629 n=50 1032.77 956.0356 0.0743 n=40 507.11 472.73 0.0678 n=10 1082.403 1004.47 0.072 n=50 506.26 475.78 0.0602 n=20 1092.483 1007.816 0.0775 n=10 515.43 483.58 0.0618 VectorAddition n=30 1090.404 1011.677 0.0722 n=20 520.23 488.65 0.0607 n=40 1095.717 1018.688 0.0703 Vector Addition n=30 519.24 483.46 0.0689 n=50 1099.623 1017.261 0.0749 n=40 521.77 490.72 0.0595 n=50 523.63 491.11 0.0621 with 16-bit data and 16-bit address buses. PAYEH architecture uses a forwarding unit. This forwarding unit can resolve all dependencies by forwarding the required data from the next pipe stages to the previous ones. TABLE III The original and modified processors were synthesized by POWER CONSUMPTIONS OF ORIGINAL AND MODIFIED DLX PROCESSORS FOR DIFFERENT BENCHMARKS AND INPUTS Synopsys D.. using 130nm TSMC library. Table I shows the reported area and frequency of these processors. As expected, Benchmark input ORG(mW) MOD(mW) IMP(%) the proposed method does not have any adverse effect on n=10 600.45 564.00 0.0607 the frequency of the processors. This is due to the fact that n=20 603.45 567.24 0.06 Factorial n=30 605.34 569.75 0.0588 the method does not affect the critical paths. Table I also n=40 605.51 564.82 0.0672 reveals that the hardware overhead of the proposed technique n=50 607.14 568.71 0.0633 is negligible (< 0.3%). n=10 561.06 523.75 0.0665 n=20 568.507 529.9622 0.0678 Four benchmark programs were used to measure the effec- Fibonacci n=30 581.2077 546.8583 0.0591 tiveness of the proposed dynamic power reduction technique. n=40 591.3779 552.9383 0.065 The Factorial benchmark reads a number and calculates its n=50 606.2355 569.8007 0.0601 n=10 604.548 571.2979 0.055 factorial while Fibonacci benchmark reads a number and n=20 606.36 574.1623 0.0531 computes Fibonacci series up to the requested element. Power Power n=30 606.948 574.2335 0.0539 benchmark reads two numbers, a and b, and calculates a to n=40 608.532 572.6895 0.0589 n=50 607.512 575.0709 0.0534 power b (i.e., ab) and Vector Addition benchmark reads two n=10 628.8246 598.8297 0.0477 vectors and calculates their addition element by element. These n=20 634.6806 603.2639 0.0495 benchmark programs were applied to the original and modified VectorAddition n=30 633.4728 604.9032 0.0451 n=40 636.5594 604.0312 0.0511 synthesized processors. Every benchmark program was run n=50 638.8286 606.1206 0.0512 five times, each time with a different input size. The input size determines the run-time complexity of the program. Input n = 10 (50) corresponds to the least (most) complex runtime. INTERNATIONAL JOURNAL OF DESIGN, ANALYSIS AND TOOLS FOR CIRCUITS AND SYSTEMS, VOL. 1, NO. 1, JUNE 2011 14

As Table II indicates, for the MIPS processor, a maximum [15] P. Lotfi-Kamran et al., “Stall power reduction for pipelined architecture dynamic power reduction of 9.76% was achieved. The average processors,” in Proc. of VLSI Design, January 2008, pp. 541–546. [16] A. Hartstein and T. R. Puzak, “The optimum pipeline depth considering power reduction of the proposed approach was about 7.66% both power and performance,” ACM Transactions on Architecture and for this processor. The table shows that as the complexity of Code Optimization, vol. 1, no. 4, pp. 369–388, December 2004. a program increases, the dynamic power that is consumed by [17] S.-J. Ruan et al., “Bipartitioning and encoding in low-power pipelined circuits,” ACM Transactions on Design Automation of Electronic Sys- the processors also increases. Almost the same results were tems, vol. 10, no. 1, p. 24-32, January 2005. achieved for DLX and PAYEH processors. For the DLX pro- [18] M. Monchiero et al., “Power-aware branch prediction techniques: a cessor, as Table III indicates, a maximum and average power -hints based approach for VLIW processors,” in Proc. of ACM Great Lakes symposium on VLSI, April 2004, pp. 440–443. reduction of 6.22% and 5.69%, respectively, were achieved. [19] D. Parikh et al., “Power issues related to branch prediction,” in Proc. of For the PAYEH processor, 8.86%, and 8.04%, respectively, International Symposium on High-Performance , were the percentages of the maximum and average power February 2002, pp. 233–244. [20] R. I. Bahar and S. Manne, “Power and energy reduction via pipeline savings. balancing,” in Proc. of the 28th Annual International Symposium on Computer Architecture, June-July 2001, pp. 218–229. [21] A. Correale, “Overview of the power minimization techniques employed VII. CONCLUSION in the IBM PowerPC 4xx embedded controllers,” in Proc. of the In this work, we proposed a method for minimizing unnec- ACM/IEEE International Symposium on Low Power Design, April 1995, pp. 75–80. essary transitions that are generated when a NOP instruction is [22] V. Tiwari et al., “Guarded evaluation: pushing power management to inserted into the pipe of a pipelined processor. The proposed logic synthesis/design,” IEEE Transactions on Computer Aided Design approach consisted of two techniques. The first one focused on of Integrated Circuits and Systems, vol. 17, no. 10, pp. 1051–1060, October 1998. eliminating unnecessary changes in the pipe register contents [23] H. Kapadia et al., “Reducing switching activity on datapath buses with while the second one restricted the propagation boundary of control-signal gating,” IEEE Journal of Solid-State Circuits, vol. 34, transitions caused by inevitable changes in the pipe register no. 3, pp. 405–414, March 1999. [24] M. Mnch et al., “Automating RT-level isolation to minimize contents due to insertion of a NOP instruction. To determine power consumption in datapaths,” in Proc. of the Design, Automation the efficacy of the proposed technique, we applied some and Test in Europe Conference and Exhibition), March 2000, pp. 624– benchmarks to MIPS, DLX and PAYEH pipelined processors. 633. [25] G. Kucuk et al., “Low-complexity reorder buffer architecture,” in Proc. While the hardware overhead and timing penalty of the pro- of International Conference on Supercomputing, June 2002, pp. 57–66. posed approach was negligible, the dynamic power reductions [26] S. Manne et al., “Pipeline gating: speculation control for energy re- of up to 10% were achieved. duction,” IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, vol. 17, no. 11, pp. 1061–1079, November 1998. [27] Z. Navabi, Digital design and implementation with field programmable REFERENCES devices. Kluwer Academic Publisher, 2004. [1] “International Technology Roadmap for Semiconductor,” 2007. [2] V. Venkatachalam and M. Franz, “Power reduction techniques for microprocessor systems,” ACM Computing Surveys, vol. 37, no. 3, pp. 195–237, September 2005. [3] D. M. Brooks et al., “Power-aware microarchitecture: design and modeling challenges for next-generation microprocessors,” IEEE Micro, Ali-Asghar Salehpour received B.S. degree from vol. 20, no. 6, pp. 26–44, November/December 2000. Shahed University, Iran, in 2006, and M.S. degree [4] R. Gonzalez and M. Horowitz, “Energy dissipation in general purpose from the University of Tehran, Tehran, Iran, in 2009, microprocessors,” IEEE Journal of Solid-State Circuits, vol. 31, no. 9, both in Computer Engineering. His research interests pp. 1277–1284, September 1996. include low-power design, network, wireless sensor [5] M. Kandemir et al., “Register relabeling: A post-compilation technique networks, and network security. for energy reduction,” in Proc. of Workshop on and Operating Systems for Low Power, October 2000. [6] M. T.-C.Lee et al., “Power analysis and minimization techniques for embedded DSP software,” IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol. 5, no. 1, pp. 123–135, March 1997. [7] T. Li and C. Ding, “Instruction balance and its relation to program energy consumption,” in Proc. of Intl. Workshop on Languages and Compilers for Parallel Computing, August 2001, pp. 71–85. [8] W. Zhang et al., “Exploiting VLIW schedule slacks for dynamic and leakage energy reduction,” in Proc. of the 34th Annual Intl. Symp. on Microarchitecture (Micro), December 2001, pp. 102–113. Amir-Mohammad Rahmani received his Master [9] M. Sarrafzadeh et al., “Low power light-weight embedded systems,” degree in computer architecture from Department in Proc. of the international symposium on Low power electronics and of Electrical and Computer Engineering, University design, October 2006, pp. 207–212. of Tehran in 2009. He is currently pursuing his re- [10] J. Montanaro et al., “A 160-MHz, 32-b, 0.5-W CMOS RISC micropro- search in Computer Systems Laboratory, University cessor,” IEEE Journal of Solid-State Circuits, vol. 31, no. 11, pp. 1703– of Turku, Finland and has a Ph.D. position in Turku 1714, November 1996. Center for Computer Science (TUCS). His research [11] IBM/Motorola, PowerPC 405CR User Manual. interests include Low-Power Design, Networks-on- [12] M. A. Amiri et al., “Design and implementation of a 50MHZ DXT chip, Multi-Processor System-on-chip, and 3D ICs. CoProcessor,” in Proc. of EuroMicro Conference on Digital System His PhD thesis is focused on Power Analysis and Design Architectures, Methods, and Tools, August 2007, pp. 43–50. Optimization in 3D-Networks-on-Chip. Amir is a [13] D. A. Patterson and J. L. Hennessy, Computer Architecture: A Quanti- member of IEEE, IEEE Circuits and Systems Society, and EUROMICRO tative Approach. Morgan-Kaufmann, 4th ed. and has published dozens of refereed papers in prestigious books, journals [14] S. Shamshiri et al., “Instruction-level test methodology for CPU core and conferences. self-testing,” ACM Transactions on Design Automation of Electronic Systems, vol. 10, no. 4, pp. 673-689, October 2005. INTERNATIONAL JOURNAL OF DESIGN, ANALYSIS AND TOOLS FOR CIRCUITS AND SYSTEMS, VOL. 1, NO. 1, JUNE 2011 15

Ali Afzali-Kusha received his B.Sc., M.Sc., and Zainalabedin Navabi Dr. Zainalabedin Navabi is a Ph.D. degrees all in Electrical Engineering from professor of electrical and computer engineering at Sharif University of Technology, University of Pitts- the University of Tehran, and an adjunct professor burgh, and University of Michigan in 1988, 1991, at Worcester Polytechnic Institute, Worcester, MA, and 1994, respectively. From 1994 to 1995, he was a USA. He is the author of eight books on VHDL, Ver- Post-Doctoral Fellow at The University of Michigan. ilog and related tools and environments. Dr. Navabis Since 1995, he has joined The University of Tehran, began his work in the EDA area in 1976, when he where he is currently a Professor of the School started the development of a register-transfer level of Electrical and Computer Engineering and the simulator for one of the very first HDLs. In 1981 Director of Low-Power High-Performance Nanosys- he completed the development of an RTL synthesis tems Laboratory. Also, on a research leave from the tool. Since 1981, Dr. Navabi has been involved in University of Tehran, he has been a Research Fellow at University of Toronto the design, definition and implementation of HDLs. He has written numerous and University of Waterloo in 1998 and 1999, respectively. He has published papers on HDLs, design automation, and digital system test. He started one more than 200 technical papers. Dr. Afzali-Kusha, who is a senior member of the first HDL courses in the US in 1990, and has since conducted short of IEEE, currently serves as an associate editor for the ACM Transactions courses and tutorials in the United States and abroad. In addition to being a on Design Automation of Electronic Systems. His current research interests professor, he is also a consultant to CAE companies. Dr. Navabi received his include network-on-chip, low-power high-performance design methodologies M.S. and Ph.D. from the University of Arizona in 1978 and 1891, and his from the physical design level to the system level for nanoelectronics era. B.S. from the University of Texas at Austin in 1975.