4 1 Introduction

Power consumption (and dissipation) has become critical design considerations in modern microprocessors. For battery powered devices, such as laptop PCs and PDAs, total power consumption is the major issue. For high performance applications such as servers, the need to dissipate high power requires expensive packaging and cooling technologies. Furthermore, in large-scale systems, power consumption can be a major operating expense.

Microprocessors can be made more power efficient at a number of levels, ranging from the circuit level, to the gate level, all the way up to software. Our particular interest is in improving power efficiency at the microarchitecture level. For studying and developing power efficient microarchitectures, power estimation tools are almost essential. And an important part of our research effort has been the development of a flexible and accurate power estimation tool –WArPE.

WArPE uses detailed microarchitecture simulation to measure energy-consuming activities and execution time. These simulation-derived measurements can then be turned into power estimates, given energy estimates for each of the activities. WArPE is based on the simplescalar simulator [1], a performance simulator widely used among academic researchers. An important element of power estimation is the energy consumed by each of the modeled microarchitecture-level activities. In WArPE, these energy estimates can be supplied directly by the user as empirical data, or for many important subsystems they can be generated via analytical models that are part of WArPE.

5 Other power estimation tools based on the simplescalar simulator have been developed [3,4]. WArPE is distinguished from these other estimators in a number of ways. 1) It can take chip technology data as an input and scale energy numbers appropriately, 2) The instruction fetch, decode, rename, issue pipeline is modeled in detail, including latches.

This document describes the internal structure and usage of the WArPE tool. Section 2.0 describes the detailed structure of the simulator, including estimation methodology. Section 3.0 describes the analytical models used, and the following section contains the options, configuration files and output file details. Section 5.0 discusses the file structure of the simulator.

6 2 WArPE Processor Model

WArPE models a modern dynamically scheduled superscalar processor. The processor is divided into a number of function unit blocks (FUBs). The processor is simulated in much the same way as a performance simulator. At the end of each cycle, the estimator determines the activity for each FUB, and uses this activity to estimate energy consumed by that block. The total energy consumed by all the FUBs during each cycle yields an instantaneous power estimate, and the average over all the cycles gives an average power estimate. The instantaneous power is useful when di/dt is of concern; it can be estimated by computing the difference in power consumption between consecutive cycles.

The per-activity energy estimates are determined before the simulator starts. These estimates are determined in one of the following ways.

3) RAM FUBs use a general analytical model 4) Power density model for non-RAM FUBs 5) Latch models (primarily in the instruction pipeline) 6) Special models for critical FUBs such as the issue window.

The following sections describe the overall superscalar microarchitecure, including the specific FUBs that are modeled. This is followed by descriptions of the RAM and power density analytical models. The latch models are described along with the instruction pipeline, and special models are described with the specific FUB is discussed.

7 2.1 Microarchitecture

In this section we touch upon some of the details of how the individual instruction pipeline units are modeled. The generic micro architecture of a pipelined superscalar processor is as shown in the figure.

Variable length/Width Register Result Bus Branch latches File Predictor

Register Functional Instruction Renaming Issue PC Units Cache Logic Window

Collapsible/ Non-Collapsible Data ITLB DTLB Cache

Reorder Buffer

Figure 1: Micro architecture of a simple superscalar processor

8 The associated units include the branch prediction tables, Instruction translation look aside buffer, data caches, data translation look aside buffers, Reorder buffer, register file, result bus etc. For most of these we have an approximate analytical model. There is no analytical model for the latches. We now describe some details of the power models of each pipeline stage.

2.1.1 Instruction Fetch

The instruction fetch stage involves access to the instruction cache, itlb as well as the branch prediction logic. The FUBs representing this stage include those for new PC generation logic (npc), logic associated with the branch target buffer access (btblog), the actual branch target buffer RAM structure (btbcac), the return stack buffer (rsbcac), three FUBs for the L1 instruction cache: one associated with the logic circuits to access the cache (il1log), another one associated with the L1 tag structure (il1tag) and the third one for the actual physical L1 instruction cache (il1cac) and the latches at the end of the pipeline (fdlatch). WArPE has analytical models for almost all of these FUBs. Most of these structures being Cache/CAM like have invalidate, replacement, write back, read and write counters associated with them. Fig2 shows a list of all the counters associated with this stage of execution.

Counter Name of the counter Description No. 0 Brupdate branch update activity 1 Brlookup branch lookup activity 2 Rsbpop return stack pop activity 3 Rsbpush return stack push activity 4 Il1acc il1 access activity 5 Il1wbk il1 writebacks activity 6 Il1rep il1 replacements activity 9 7 Il1inv il1 invalidations activity 12 Il2acc il2 access activity 13 Il2wbk il2 writebacks activity 14 Il2rep il2 replacements activity 15 Il2inv il2 invalidations activity 24 Itlbmis itlb miss activity 27 Itlbacc itlb access activity 28 Itlbwbk itlb writebacks activity 29 Itlbrep itlb replacements activity 30 Itlbinv itlb invalidations activity 35 Npc next pc logic activity 69 Fdlatch_active Latch after fetch stage active 70 Fdlatch_stall Latch after fetch stage stalled 71 Fdlatch_empty Latch after fetch stage empty

Figure 2. Table of all the activity counts associated with the fetch stage

In an attempt to build power numbers for these structures we try to map these tables to an approximate Cache structure. The CACTI tools, which are used by almost all the existing simulators, do this mapping for us. CACTI tools find an optimal cache structure for each of these tables by taking in parameters like the cache size, associativity and the no. of sets. The tool maps these structures to an optimal size cache assuming that some cache optimizations would have been done at the circuit level and return an optimal mapping. The numbers of row and column decoders are thus calculated. The power models for the caches and the decoders are the same as suggested by Wilton and Jouppi [2]. Currently, there are no analytical models for either the write back or the replacement or the invalidation logic circuits. But the simulator maintains a count of these activities. To calculate the power we multiply the activity counts with some approximate power numbers as obtained from the industry. However, the user can input any numbers and hence customize the simulator.

10 At the end of the fetch stage is a set of pipeline latches, which may be of variable width. These latches may be in Active, Stalled or Empty state with each stage consuming a different amount of energy. The simulator keeps an account of the number of latches in each stage per cycle. This gives the power consumed each cycle by the latches. More detail on the latch power model follows in sec 3.3.

2.1.2 Instruction Decode / Dispatch Stage

The decode stage entails the decoders as well as the register aliasing table associated with the Register Renaming Logic. These units are represented in the simulator with FUBs for dispatch queue (dispatchq), instruction decoder (decodepla), logic associated with decoder for handling mispredictions (decodemisp), logic associated with stalling decoder (decodestall), register aliasing table (ratarr), FUBs for input/output dependence check (ratidep, ratodep), register aliasing table stall (ratstall) and the latches at the end of the pipe stage (dilatch). There are counters associated with decoder stall and mispredict activity as well as with the decoder access itself. The register aliasing table has counters associated with the table itself as well as with input and output dependence checking activity. A list of all the counters is given in fig3. Presently, we have analytical model only for the register aliasing table cache. Rest of the activity counters are multiplied with the power numbers obtained from the user input file (pfa mode).

Counter Name of the Counter Description of the counter No. 36 Dispatchqrd dispatchq read activity 37 Dispatchqwr dispatchq write activity 38 Dispatchqrel dispatchq release activity 39 Dispatchqrec dispatchq recover activity 40 Decoder decoder activity 41 Decodemispchk decoder mispredict detect activity 42 Decodemisp decoder mispredict correction 11 activity 43 Decodestallchk decoder stall detect activity 44 Decodestall decoder stall block activity 45 Ratidep rat idep allocation activity 46 Ratodep rat odep allocation activity 47 Ratstallchk rat stall detection activity 48 Ratstall rat stall block activity 72 Dilatch_active Latch after decode stage active 73 Dilatch_stall Latch after decode stage stall 74 Dilatch_empty Latch after decode stage empty Figure 3: Activity Counters associated with the decode/dispatch stage

The instruction thus decoded are moved into another set of latches which again may be of variable size and variable number of latches could be there. These latches may model the delay associated with the renaming logic or the actual decoding of the instruction. As before the latches could be in one of the three states: Active, Stalled or Empty with different power numbers that may be the same as for the previous latches. We maintain a

per cycle record of the state in which the latches are (Dilatch_active,

Dilatch_stall, Dilatch_empty) and calculate the per cycle contribution to total power.

12 Tag Bus Read Insn Grant req0 Inst. 0 Inst. 0 cmprs. req0 0 1

y r r a n r e e r o t t i a t i i

c b b e u r r g r t a s r A A n o I t s

Inst. N-1 Inst. N-1 cmprs.

Wakeup Select

Inst Bus

Figure 4: Instruction Issue Window

Another innovative idea with this power simulator is in the issue window. The simulator models both Collapsible and Non Collapsible instruction issue window with the same FUB: isw. There would be some power associated with collapsing the instruction window. The simulator has counter to record these movements per cycle (Iswcolmoved) and the user can supply the power associated with these movements. The issue window can also be viewed as a set of fixed length latches with the same three states as before. The Active state (Iswact) now corresponds to the number of instruction ready to be issued that cycle while the stalled state (Iswstall) would correspond to instruction that are still waiting for their operands to become ready. The empty state (Iswempty) would represent the in-occupancy of the issue window each cycle. A detailed power model for the same is explained in sec3.4 13 2.1.3 Instruction Execution and Writeback

The instructions selected are then issued to the corresponding Functional Units or are stored in the Load/Store queues. The FUBs for this stage include those for the integer functional units (fuint), floating point functional units (fufp), the L1 data cache logic circuit (dl1log), L1 data cache tag structure(dl1tag), L1 data cache (dl1cac) and similarly for the united L2 cache (ul2log, ul2tag, ul2cac), the load/store queue (lsqrdyq), the data tlb (dtlbcac). The simulator does not have an analytical model for any of the functional units but the load/store queues can be modeled as a pair of cache like structure along with a CAM like structure with analytical models for both of them. Another structure associated with the execution stage is the data cache. The simulator models the data cache on the same lines as the instruction cache using the CACTI tools. There are counters for data cache access(dl2acc), write back(dl2wbk), replacement(dl2rep) and invalidation(dl2inv). The data tlb is also modeled on the lines of the instruction tlb and hence has the CAM like analytical model. The results as generated from the functional units are broadcasted through the result bus. But the current version of the simulator doesn’t calculate the power consumed by this result bus.

All the activities associated with the initialization and the utilization of the register update unit are represented with the FUBs for ruu array (ruuarr), the ruu writeback (ruuwb). A complete list of all the FUBS and all the counters are included in the appendix to this manual. The list of counters associated with this stage is as follows:

Counter No. Name of the counter Description of the counter 8 Dl1acc dl1 access activity 9 Dl1wbk dl1 writebacks activity 10 Dl1rep dl1 replacements activity 11 Dl1inv dl1 invalidations activity 14 16 Dl2acc dl2 access activity 17 Dl2wbk dl2 writebacks activity 18 Dl2rep dl2 replacements activity 19 Dl2inv dl2 invalidations activity 20 Ul2acc ul2 access activity 21 Ul2wbk ul2 writebacks activity 22 Ul2rep ul2 replacements activity 23 Ul2inv ul2 invalidations activity 25 Dtlbmis dtlb miss activity 26 Ul2mis ul2 miss activity 31 Dtlbacc dtlb access activity 32 Dtlbwbk dtlb writebacks activity 33 Dtlbrep dtlb replacements activity 34 Dtlbinv dtlb invalidations activity 45 Ratidep rat idep allocation activity 46 Ratodep rat odep allocation activity 47 Ratstallchk rat stall detection activity 48 Ratstall rat stall block activity 49 Ruuarr ruu array activity 50 Ruurdyqsch ruu readyq allocation activity 51 Ruurec ruu recover activity 52 Ruuret ruu retire activity 53 Ruurdyqcam ruu readyq dependence check activity 54 Ruurdyqrel ruu readyq resource release activity 55 Lsqarr lsq array activity 56 Lsqrdyqsch lsq readyq allocation activity 57 Lsqrec lsq recover activity 58 Lsqret lsq retire activity 59 Lsqrdyqcam lsq readyq dependence check activity 60 Lsqrdyqrel lsq readyq resource release activity 61 Ruuarb ruu arbitration activity 62 Ruuwb ruu writeback scheduler activity 63 Ruuwbq ruu writebackq activity 64 Lsqarb lsq arbitration activity 65 Lsqwb lsq writeback scheduler activity 66 Lsqwbq lsq writebackq activity 67 Fuint functional unit integer 68 Fufp functional unit floating point Figure 5: Activity counters associated with the execution and the writeback stage

15 3 Analytical Models

The architectural power estimation methodologies can be broadly classified into empirical methods and analytical methods. These can further be classified into fixed activity and activity sensitive methods. One of the earliest methods of power estimation was a fixed activity method called the Power Factor Approximation method (PFA) described by Liu and Svensson [5]. Power estimation techniques have come a long way since then, with activity-based models, transition sensitive models and so on. The basic estimation methodology is, however, the same. We basically either calculate the power density constants associated with each structure as in the analytical model or take the power constants as input from the user, pfa model.

3.1 Power Density Model

Several architectural power estimation schemes have been discussed in literature [6][7]. In WArPE we use a scheme similar to Power Factor Approximation (PFA) [5]. We express the power dissipation in terms of the active/inactive power density of each FUB,

16 the area of the FUB and the activity factor, which is determined via performance simulation. power = {(active power density)*(activity) + (inactive power density)*(1–activity)}*area

The power density and area numbers are either determined empirically from the real design and scaled to the required technology or are estimated by considering circuit complexity, logic styles, etc. The power density numbers are further divided based on the following circuit styles:

Dynamic logic Static logic PLA circuits Memory type regular circuits Clock circuits

Thus for every FUB, one has to define 5*3 = 15 different numbers, corresponding to active power density, inactive power density and area for each of the five circuit styles. The user can supply this through the configuration file. However, it is not always possible to get/estimate these numbers. In order to overcome this problem we have included routines, which can analytically model FUBs. Presently, we can construct models for most regular memory type structures like caches, register files, register renaming tables, branch target buffers and reorder buffers. The simulator is designed in such a way that models can be updated and new models can be added relatively easily.

17 In order to take physical structure into consideration, a few more options have been added. The analytical models can, and in fact will, have to be refined continuously to get improve result accuracy. Models for other regular structures like PLAs can also be added.

3.2 Analytical RAM Model

In the analytical mode, power constants are generated using analytical models provided. Presently, we have the capability to model most of the regular and simple logic based structures. The models are based on the circuit time-delay-energy simulation model that is similar to those used by Wilton and Jouppi [2]. The idea is to break FUBs into smaller components, for which analytical models are present. The analytical models used in the simulator are similar to those used by Wilton and Jouppi [2]. Some of the differences include a choice of static vs. dynamic logic for decoder and single ended read option for register files. These models can be used to construct power constants for FUBs that contain regular, memory type building blocks. The FUBs that have already been modeled are the instruction and data caches, TLBs, branch target cache, register allocation table and return address stack. Other units that can be modeled are the register update unit and load/store queue arrays.

For example, a cache can be divided into a decoder buffer, row decoder, word-lines, bit- lines, sense amplifiers, column decoder and output MUXs. The models generate power numbers by calculating the effective switching capacitance. The effective capacitance is estimated by adding the gate, drain and routing capacitances together. These are 18 calculated by functions that take the width and length of Poly used, as inputs. The length of all transistors is assumed to be constant and equal to the Leff defined in the technology file. The list of these functions (included in anal.c) follows. gatecap(): return the gate capacitance of the transistor. gatecappass(): returns the gate capacitance for a pass transistor. draincapp(): returns drain capacitance for the p-type transistor. It has an added feature of optimizing for stacked transistors, example the n-type transistors in a 4- input NAND. draincapn(): similar function for n-type transistor.

The following sections describe each of the basic models provided. An example of the usage of these models to create more complex models will be given in the last chapter.

3.2.1 Decoder Buffer

The decoder buffer, as the name suggests, buffers the address lines that go into the decoders. The buffer is an important element if the address lines feed into a large number of gates. Presently, the sizes of the buffer transistors are fixed. These could be changed depending on the number of gates connected to the lines and the speed required. The following figure shows the buffer architecture.

19 VDD

. . ADDR BITS * 2(BIT and NBIT).


Decoder Buffer Single Buffer

Figure 6:Decoder Buffer.

20 3.2.2 Decoder

Two types of decoder models have been included, depending on the type of circuits they use. The first one is a static decoder that is based on a two level decoding scheme. The first stage is constructed from 3x8 and 2x4 NAND based decoders. The second stage consists of an n-input OR for every output bit, where n is the number of min terms in stage 1. The following schematic brings out the basic architecture of this decoder.

N input NOR gate

N decoders 3x8,2x4 using NAND gates


Stage 1 Stage 2 Figure 7:Static decoder schematic

BIT and NBIT . . Outputs from four 2-input

NAND gates out

Second stage NOR gate Eg. 4-input NOR Structure of decoder Eg. 2x4 decoder 22 Figure 8:Circuits used in the two stages

The second type of decoder is the dynamic decoder, which is based on a domino NOR. However, the maximum inputs that should be allowed for this decoder is around six. The following figure shows a schematic of the dynamic decoder.




Figure 9: Dynamic decoder.

3.2.3 Wordline

The wordline power model includes both the wordline as well as the wordline driver. The driver size is computed using a function called WLdriver_size(). The inputs to this function are the capacitance driven and the rise-time expected. The rise-time has been assumed to be period/8 due to lack of data. This can be changed by changing the entry in 23 tech.h. The model also takes into account single ended read type cells, used in register files. A schematic of the wordline is shown below.

Wordline driver columns

Figure 10: Word line.

3.2.4 Bitline

The bitline model takes into account the precharge transistors, line capacitance and isolation transistors. Several minute features have been added and detailed comments in the code explain these. The basic schematic of the bitline is shown below.

24 Precharge Precharge equalizer


Isolation Pass gate


25 Figure 11: Bitline

3.2.5 Sense Amplifier

The sense amplifier is shared by many bitlines using a column MUX. However, one should not multiplex more than eight bitlines together due to leakage issues. The MUX is a standard pass-gate based MUX with a column decoder. The basic architecture and the sense amplifier circuit used are shown below.




Sense Amplifier


Figure 12: Sense Amplifier architecture


Vdd Vdd Vdd Vdd

Vdd Vdd Gnd Gnd


Figure 13: Sense Amplifier circuit. 27 3.2.6 Output driver

The output driver uses an array of tri-state drivers like the one shown in the schematic below.

VDD Sense amp out




Figure 14: Output driver.

28 3.2.7 Generic mux

This is a standard pass-gate based MUX. The only specifications required are the number of inputs to be multiplexed into one bit and the number of output bits. The generic MUX, as the name suggests, can be used to model a general MUX.

3.2.8 Comparator

The comparator design is shown in Fig. 15.

Vdd precharge


na0 a0

b0 nb0

# of bits to compare

Figure 15:n-bit comparator 29 3.3 Latch Model

At the end of the fetch stage is the pipeline latches associated with the fetch stage. These pipeline latches are basically modeling the delay incurred between moving instruction from the fetch stage to the decode stage. These delays could be due to the delay in BTB lookup or in getting the branch prediction. The latches could be of variable size and the number of latches would also vary depending upon the delay to be modeled. The variable length of the latches is due to the fact that some information may be added on a later latch in the pipeline. At any time these latches could be in one of the three states: Active implying that a new instruction was moved into this latch that cycle, Stalled meaning that the latch is holding on to the instruction that it had in the previous cycle this cycle also, Empty meaning that the latch is not storing anything that cycle. The power associated with each of these states would be different and is read from the input file.

30 Issue w id th In s tru c tio n D e c o d e P ip e lin e le n g th

I-$ IF ID 1 ID 2 ID 3 ID 4

Issu e Q ue ue P ip e lin e -L a tc h w id th

L atch es get w ider de epe r in the pipe line

Figure 16: Simple Architecture along with the Pipeline latches

This breakdown of energy-consuming activity allows for a form of clock gating where active instructions may consume more energy than stalled instructions, and where valid instructions may consume more energy than invalid ones (i.e. empty pipeline slots). For example, consider the logic shown in Figure 17. Here, a typical pipeline latch is shown, as might appear in the decode pipeline. An input multiplexor (typically built into the latch) is used to "recirculate" latched pipeline values when the hold signal is active. In addition, the valid bit from the preceding stage is used to gate the latch itself; if there is no valid data being fed into the latch, then the latch is not clocked.

31 hold from next stage

MUX Latch data out data in

Valid from previous stage


Figure 17: A Pipeline Latch

A Valid Bit from the previous stage is used to gate the clock signal. A hold signal from the succeeding stage is used to switch the multiplexor and recirculate data being stalled.

In this system, a certain amount of energy is consumed if an instruction moves up the pipeline (the hold signal is inactive) and is latched into the next stage. A different (lower) amount is consumed if the hold signal is active, the multiplexor feeds the same data back into the latch and the latched is clocked, but the logic following the latch does not see any of its inputs change. Finally, a different (still lower) amount of energy is consumed if the valid signal is off, and the latch is not clocked at all. Similarly, in the issue queue, a particular issue queue slot may consume different amounts of energy depending on whether or not it holds an active instruction and whether or not the instruction actually issues. The pipeline latches were taken from a high-end design environment. A 2-to-1 static mux was used to re-circulate the data when stalled. Each cycle the simulator maintains an account of latches in various states and the total power the latches would consume each cycle. This is one of the innovative ideas in this simulator

32 3.4 Special Model for Issue Window

As stated before, the simulator models both Collapsible and Non Collapsible instruction issue window with the same FUB: isw. There would be some power associated with collapsing the instruction window. The simulator has counter to record these movements per cycle (Iswcolmoved) and the user can supply the power associated with these movements. The issue window can also be viewed as a set of fixed length latches with the same three states as before. The Active state (Iswact) now corresponds to the number of instruction ready to be issued that cycle while the stalled state (Iswstall) would correspond to instruction that are still waiting for their operands to become ready. The empty state (Iswempty) would represent the in-occupancy of the issue window each cycle.

33 Tag Bus Read Insn Grant req0 Inst. 0 Inst. 0 cmprs. req0 0 1

y r r a n r e e r o t t i a t i i

c b b e u r r g r t a s r A A n o I t s

Inst. N-1 Inst. N-1 cmprs.

Wakeup Select

Inst Bus

Figure 18: Instruction Issue Window

For the issue queue, wakeup logic is modeled by counting the energy in the comparators. For the selection logic, energy of one arbiter cell was supplied. Then the number of arbiter cells per arbiter was calculated based on the number of entries in the issue queue. We assume one arbiter per issue port – in our case four issue ports. Every entry in the issue queue has some comparators (for tag match). The wakeup logic associated with this issue window involves tag comparison and has a level of XOR gates followed by NAND gates. Assuming that the NAND gates are smaller than the XOR, the simulator records the power consumed in these XOR gates each cycle. There are counters associated with each of the states of the issue window latches as well as with data movement between these latches for a collapsible window.

34 4 Options, Configuration, Output

This section describes the options, configuration files and output files used in the WArPE power estimation tool.

4.1 Options

The estimator options (in addition to the underlying simplescalar options) are defined below. These options have been registered in the original simplescalar option database. Implementing these options required modification of some of the original sim- outorder.c code.

–power_config : This option specifies the power simulator configuration file. The file must read permissions. The default file name is power.txt.

–power_outfile : This option specifies the file into which output statistics are dumped. The default file name is power_output.txt.

– tech_file : This option specifies the technology definition

35 file name. The file must have read permissions. The default file name is technology.def.

– technology : This option specifies the power simulation

technology. The technology is defined by an identifier listed in the technology file. Eg. –technology 0.25um. The default technology is 0.8um. – sim_limit : This option specifies the number of instructions (in millions) at which the simulation stops and data is dumped into the output file.

4.2 Configuration files

Following is a description of the various configuration files used in the WArPE estimator. Configuration files provide an easy and effective way of defining the large number of parameters used in the simulator.

36 4.2.1 Basic configuration file

This is the file defined by the –power_config option. It defines the power densities, areas, mode of operation i.e. pfa (empirical) or anal (analytical model), power thresholds, and physical partitioning parameters. This file can be generated by saving a Microsoft Excel worksheet in tab delimited text format.

The file has three main option:

1) –global These define the power and di/dt thresholds for the full chip. The unit is watts.


unit: name of the FUB (Functional Unit Block) as defined in power_init().

mode: pfa: directs the simulator to use empirical data i.e. dyn_pda,…,pla_a. anal: directs the simulator to use analytical model for the FUB. maxpowerth: maximum power threshold for the FUB. maxdidtth: maximum di/dt threshold for the FUB. dyn_pda: dynamic circuit power density - active

dyn_pdi: dynamic circuit power density - inactive

dyn_a: dynamic circuit area sta_pda: static power density – active

37 sta_pdi: static circuit power density – inactive sta_a: static circuit area

clk_pda: clock circuit power density – active clk_pdi: clock power density – inactive clk_a: clock circuit area mem_pda: memory type circuit power density – active mem_pdi: memory type circuit power density – inactive mem_a: memory type circuit area pla_pda: PLA power density – active pla_pdi: PLA power density – inactive pla_a: PLA circuit area

The units of the power densities are W/m2, and the units of area are m2.

3) -

Eg. –itlbcac 1 2 1 static dual This option specifies the physical partition. In the example given above, it defines the partition for itlb. The names specified with a “-“ followed by the FUB name.

: The number of partitions of the wordline. Each partition has a different decoder and wordline driver. The partitions however share sense amplifiers.

38 : The number of partitions of the bitline. Each partition has separate sense amplifiers and decoders. : Similar to bitline partition but shares decoder.

: The type of logic used for decoders, static or dynamic.

: Defines the read mode i.e. dual for dual rail and single for single ended (used in small register files).

4.2.2 Process Technology Data File

This file contains the processing technology data for several generations. It must at least contain the data for the technology defined by the –technology option. Some of the data provided in the technology file is not used presently. It will used in later revisions, e.g. for dual Vt technologies. The format for the technology data is as follows

Eg. 0.8um 0.80 5.00 100 0.75 0.75 1 1

: Technology identifier. It should match the identifier supplied using the

–technology option. : The effective channel length in microns. : The drain voltage used in the technology.

39 : The clock frequency in MHz. : For use in dual voltage circuits. This is the lower threshold voltage. : Higher threshold voltage. : Leakage current for the lower threshold voltage in nA/m. : Leakage current for the higher threshold voltage in nA/m.

4.3 Output file

This file contains the output power statistics generated after the simulated instructions reach sim_limit or the simulation ends. The file is well formatted and the data is self- explanatory. Sample configuration files and output file are shown below.

-global 10 10 Npclog pfa 1 1 7.72 0.772 3.20E+046.05 0.6052.56E+05 8.43 8.43 3.20E+04 10.75 1.075 0.00E+00 91.75 9.175 0.00E+00 Btblog pfa 1 1 7.72 0.772 0.00E+006.05 0.6052.49E+05 8.43 8.43 1.31E+04 10.75 1.075 0.00E+00 91.75 9.175 0.00E+00 Btbcac anal 1 1 7.72 0.772 1.50E+056.05 0.6059.00E+05 8.43 8.43 1.50E+05 10.75 1.075 1.80E+06 91.75 9.175 0.00E+00 Rsbcac anal 1 1 7.72 0.772 3.85E+046.05 0.6057.70E+04 8.43 8.43 1.93E+04 10.75 1.075 5.78E+04 91.75 9.175 0.00E+00 Itlbcac anal 1 1 7.72 0.772 1.50E+056.05 0.6053.00E+05 8.43 8.43 3.75E+04 10.75 1.075 2.63E+05 91.75 9.175 0.00E+00 dtlbcac anal 1 1 7.72 0.772 1.20E+046.05 0.6054.00E+05 8.43 8.43 4.00E+04 10.75 1.075 2.40E+05 91.75 9.175 0.00E+00 pmhlog pfa 1 1 7.72 0.772 6.00E+046.05 0.6052.00E+05 8.43 8.43 2.00E+04 10.75 1.075 1.20E+05 91.75 9.175 0.00E+00 il1log pfa 1 1 7.72 0.772 2.40E+056.05 0.6051.68E+06 8.43 8.43 2.40E+05 10.75 1.075 2.40E+05 91.75 9.175 0.00E+00 il1tag anal 1 1 7.72 0.772 5.28E+056.05 0.6057.92E+05 8.43 8.43 2.64E+05 10.75 1.075 3.70E+06 91.75 9.175 0.00E+00 il1cac anal 1 1 7.72 0.772 0.00E+006.05 0.6051.32E+06 8.43 8.43 3.30E+05 10.75 1.075 4.95E+06 91.75 9.175 0.00E+00 dl1log pfa 1 1 7.72 0.772 3.60E+056.05 0.6051.68E+06 8.43 8.43 1.20E+05 10.75 1.075 2.40E+05 91.75 9.175 0.00E+00 dl1tag anal 1 1 7.72 0.772 2.64E+056.05 0.6057.92E+05 8.43 8.43 2.64E+05 10.75 1.075 3.96E+06 91.75 9.175 0.00E+00 dl1cac anal 1 1 7.72 0.772 0.00E+006.05 0.6051.32E+06 8.43 8.43 3.30E+05 10.75 1.075 4.95E+06 91.75 9.175 0.00E+00

40 dispatchq pfa 1 1 7.72 0.772 6.50E+056.05 0.6054.88E+05 8.43 8.43 1.63E+05 10.75 1.075 3.25E+05 91.75 9.175 0.00E+00 decodepla pfa 1 1 7.72 0.772 3.20E+046.05 0.6054.80E+04 8.43 8.43 1.60E+04 10.75 1.075 0.00E+00 91.75 9.175 6.40E+04 decodemisp pfa 1 1 7.72 0.772 0.00E+00 6.05 0.605 7.43E+04 8.43 8.43 8.25E+03 10.75 1.075 0.00E+00 91.75 9.175 0.00E+00 decodestall pfa 1 1 7.72 0.772 0.00E+00 6.05 0.605 5.23E+04 8.43 8.43 2.75E+03 10.75 1.075 0.00E+00 91.75 9.175 0.00E+00 ratarr anal 1 1 7.72 0.772 2.08E+05 6.05 0.605 5.20E+05 8.43 8.43 5.20E+04 10.75 1.075 2.60E+05 91.75 9.175 0.00E+00 ruuarr pfa 1 1 7.72 0.772 9.10E+04 6.05 0.605 1.82E+05 8.43 8.43 4.55E+04 10.75 1.075 1.37E+05 91.75 9.175 0.00E+00 lsqarr pfa 1 1 7.72 0.772 4.55E+04 6.05 0.605 9.10E+04 8.43 8.43 2.28E+04 10.75 1.075 6.83E+04 91.75 9.175 0.00E+00 ruurdyq pfa 1 1 7.72 0.772 1.50E+04 6.05 0.605 2.00E+04 8.43 8.43 2.50E+03 10.75 1.075 1.25E+04 91.75 9.175 0.00E+00 lsqrdyq pfa 1 1 7.72 0.772 7.50E+03 6.05 0.605 1.00E+04 8.43 1250 4.00E+04 10.75 1.075 6.25E+03 91.75 9.175 0.00E+00 ruuarb pfa 1 1 7.72 0.772 1.05E+05 6.05 0.605 6.30E+05 8.43 8.43 1.05E+05 10.75 1.075 2.10E+05 91.75 9.175 0.00E+00 ruuwb pfa 1 1 7.72 0.772 2.00E+05 6.05 0.605 1.20E+06 8.43 8.43 2.00E+05 10.75 1.075 4.00E+05 91.75 9.175 0.00E+00 lsqarb pfa 1 1 7.72 0.772 1.05E+05 6.05 0.605 6.30E+05 8.43 8.43 1.05E+05 10.75 1.075 2.10E+05 91.75 9.175 0.00E+00 lsqwb pfa 1 1 7.72 0.772 2.00E+05 6.05 0.605 1.20E+06 8.43 8.43 2.00E+05 10.75 1.075 4.00E+05 91.75 9.175 0.00E+00 fuint pfa 1 1 7.72 0.772 8.50E+04 6.05 0.605 2.38E+05 8.43 8.43 1.70E+04 10.75 1.075 0.00E+00 91.75 9.175 0.00E+00 fufp pfa 1 1 7.72 0.772 1.13E+05 6.05 0.605 3.15E+05 8.43 8.43 2.25E+04 10.75 1.075 0.00E+00 91.75 9.175 0.00E+00 ul2log pfa 1 1 7.72 0.772 1.44E+05 6.05 0.605 6.72E+05 8.43 8.43 4.80E+04 10.75 1.075 9.60E+04 91.75 9.175 0.00E+00 ul2tag anal 1 1 7.72 0.772 3.60E+05 6.05 0.605 2.88E+06 8.43 8.43 3.60E+05 10.75 1.075 3.60E+06 91.75 9.175 0.00E+00 ul2cac anal 1 1 7.72 0.772 1.50E+06 6.05 0.605 6.00E+06 8.43 8.43 0.00E+00 10.75 1.075 2.25E+07 91.75 9.175 0.00E+00

Biu pfa 1 1 7.72 0.772 5.00E+05 6.05 0.605 4.00E+06 8.43 8.43 5.00E+05 10.75 1.075 0.00E+00 91.75 9.175 0.00E+00 fdlatch_0pfa1 1 86 34 10 0 0 0 0 0 0 0 0 0 0 0 0 fdlatch_1pfa1 1 86 34 10 0 0 0 0 0 0 0 0 0 0 0 0 fdlatch_3pfa1 1 86 34 10 0 0 0 0 0 0 0 0 0 0 0 0 fdlatch_4pfa1 1 86 34 10 0 0 0 0 0 0 0 0 0 0 0 0 dilatch_0pfa1 1 86 34 10 0 0 0 0 0 0 0 0 0 0 0 0 dilatch_1pfa1 1 86 34 10 0 0 0 0 0 0 0 0 0 0 0 0 dilatch_2pfa1 1 86 34 10 0 0 0 0 0 0 0 0 0 0 0 0 dilatch_3pfa1 1 86 34 10 0 0 0 0 0 0 0 0 0 0 0 0 isw pfa1 1 86 34 10 0 0 0 0 0 0 0 0 0 0 0 0

-dl1cac 1 1 1 static dual -dl1tag 1 1 1 static dual

-dl2cac 1 1 1 static dual -dl2tag 1 1 1 static dual -il1cac 1 1 1 static dual -il1tag 1 1 1 static dual -il2cac 1 1 1 static dual -il2tag 1 1 1 static dual -dtlbcac 1 1 1 static dual -itlbcac 1 1 1 static dual -btbcac 1 1 1 static dual 41 -regfile 1 1 1 static single

Figure 19: Basic Configuration File

tech L(um) Vdd(V) f(MHz) Vtl(V) Vth(V) Iol(nA/um) Ioh(nA/um)

0.8um 0.80 5.00 100 0.75 0.75 0.01 0.01 0.6um 0.60 3.30 200 0.65 0.65 0.01 0.01 0.35um 0.35 2.50 300 0.55 0.55 0.1 0.1 0.25um 0.25 1.50 450 0.45 0.45 0.1 0.1 0.18um 0.18 1.05 700 0.35 0.35 1 0.1 0.15um 0.15 1.00 1000 0.30 0.35 1 0.1 0.13um 0.13 1.00 1500 0.28 0.35 1 0.1 0.1um 0.10 0.75 2250 0.25 0.35 1 0.1 0.07um 0.70 0.60 3300 0.25 0.35 10 0.1 Figure 20 Technology File.

Sun May 19 17:07:59 2002 Power simulation checkpoint at 200000051 instructions functional cumulative maximum maximum maximum power maximum didt block name power power didt power violations violations

npclog 4.354e+06 8.262e+06 7.813e+06 0 0 btblog 6.775e+05 8.097e+06 7.835e+06 0 0 btbcac 1.59e+06 2.135e+07 2.092e+07 0 0 itlbcac 2.293e+05 4.446e+05 4.335e+05 0 0 rsbcac 3.414e+05 1.546e+06 1.245e+06 0 0 dtlbcac 4.024e+06 3.801e+07 3.716e+07 0 0 pmhlog 4.667e+05 3.132e+06 3.132e+06 0 0 il1log 3.548e+07 6.648e+07 6.3e+07 0 0 il1tag 1.071e+08 2.033e+08 1.962e+08 0 0 42 il1cac 1.062e+07 2.029e+07 1.979e+07 0 0 dl1log 1.338e+07 1.819e+08 1.628e+08 0 0 dl1tag 4.12e+07 5.679e+08 5.091e+08 0 0 dl1cac 1.485e+07 2.117e+08 1.905e+08 876705 0 dispatchq 0 0 0 0 0 decodepla 0 0 0 0 0 decodemisp 0 0 0 0 0 decodestall 0 0 0 0 0 ratarr 8.569e+07 2.715e+08 2.384e+08 0 0 ruuarr 2.734e+07 1.864e+08 1.133e+08 0 0 lsqarr 4.258e+06 2.924e+07 2.741e+07 0 0 ruurdyq 1.041e+06 7.845e+06 6.668e+06 0 0 lsqrdyq 7.525e+06 2.3e+07 1.464e+07 0 0 ruuarb 3.15e+07 2.795e+08 1.242e+08 0 0 ruuwb 7.137e+07 1.775e+08 1.745e+08 0 0 lsqarb 3.267e+07 2.795e+08 1.242e+08 0 0 lsqwb 2.487e+07 1.627e+08 1.597e+08 0 0 fuint 3.489e+06 8.958e+06 8.605e+06 0 0 fufp 4.671e+05 5.928e+06 5.461e+06 0 0 ul2log 1.833e+06 5.953e+07 5.85e+07 0 0 ul2tag 1.653e+07 5.574e+08 5.485e+08 0 0 ul2cac 1.352e+07 8.154e+08 8.102e+08 0 0 biu 8.242e+06 2.582e+08 2.512e+08 0 0 isw 1.625e+06 0 1.311e+06 0 0 fdlatch_0 6.458e+04 9.83e+04 7.782e+04 0 0 fdlatch_1 6.442e+04 9.83e+04 7.782e+04 0 0 fdlatch_2 6.387e+04 9.83e+04 7.782e+04 0 0 fdlatch_3 6.329e+04 9.83e+04 7.782e+04 0 0 dilatch_0 6.24e+04 9.83e+04 7.782e+04 0 0 dilatch_1 6.167e+04 9.83e+04 7.782e+04 0 0

43 dilatch_2 6.133e+04 9.83e+04 7.782e+04 0 0 dilatch_3 5.725e+04 9.83e+04 7.782e+04 0 0

Global statistics: Total power = 566797441.827776 Maximum power = 3490027519.397630 Maximum didt power = 3198001037.129858 Power violations = 19489894 Didt power violations = 1204832

Figure 21: Output File.

5 File Structure

The simulator is essentially based on Simplescalar [1]. Care has been taken to keep the power simulation functions in separate files thus minimizing the modification of the original code. However, at some places it was inevitable or rather much more convenient to modify the original Simplescalar files. The file structure is as follows. power.c: The main power number generation file. It contains routines for power calculation. Any new power calculation routines, eg. Clock gated power calculation should be included in this file.

44 power.h: This file contains all the declarations for variables, structures and functions and definitions used in power.c. anal.c: Contains all the analytical models. Any new models developed should be placed in this file. anal.h: Contains declarations and definitions for variables and functions used in anal.c. tech.c: Technology processing file. Reads from the technology file and calculates scaling factors for the required technology .The base technology used is 0.8 um and all simulations are performed by scaling the 0.8um technology. tech.h: Contains all the device size definitions for 0.8 um base technology. sim-outorder.c and main.c have also been modified as described later.

5.1.1 power.h

As mentioned earlier, power.c contains routines for power computation and power.h is the supporting header file. The simulator is designed using a FUB-centric approach. All the power numbers specific to an FUB is stored together in one structure. The structure is shown below. Not all the elements are used. Some of them are present for future expansion. typedef struct { char name[32];

double active_power; 45 double active_power_rd; double active_power_wr; double static_power; double inactive_power;

double active_power_lt; double stall_power_lt; double empty_power_lt;

double active_power_cg; double active_power_wr_cg; double active_power_rd_cg; double inactive_power_cg;

double maxpowerth double maxdidtth;

double cum_power; double prev_power; double max_power; double max_didt; double max_powerx; double max_didtx; } fub_t;

The element name stores the name of the FUB, which can be at most 32 characters in length. The next four elements store power numbers, which are obvious from their names. It should be noted that active power comes in three flavors. When using the 46 empirical method, only active_power is used. It is the sum of the (power density)*(area) products for the five different circuit styles. When analytical models are used, the read and write operations can be separated and these give different power consumptions thus the rd and wr suffixes. The element inactive_power is presently redundant but can be used in the empirical mode for standby mode. The next three numbers are power values for latches only. The next four elements are the clock gated power numbers which are presently not being used. Notice that clock gating does not affect static power and hence static_power_cg is not present. The elements maxpowerth and maxdidtth are the maximum power and maximum di/dt power thresholds for the FUB. These values are defined in the configuration file. cum_power keeps accumulating the power after every cycle and is finally divided by the number of cycles to get the average power dissipated. prev_power, max_power and max_didt are the previous cycle power, maximum power and maximum di/dt power respectively. Finally, max_powerx and max_didtx keep track of the number of threshold violations.

A similar structure of type glb_power_t is used to track the full chip power numbers. Its elements are essentially the sum of the corresponding elements of the FUB structures. Another important structure defined is the power_t, which is used to exchange power numbers. Its got three elements, active_power_rd, active_power_wr and static_power which are self-explanatory.

The activity counts are tracked using two arrays of counters, one for present cycle counts and the other for cumulative counts. Specific counters can be accessed by using the

47 counter name as the index, Eg. pres_count[Ruuarr]. Ninety three counters have presently been declared. New counters can be added simply by adding their names to the #define list and updating NUM_POWER_COUNTERS. As a convention, only the first character of the counter name is in caps.

As more and more features are added to the simulator, new elements can be added to these structures and new counters can be defined for more detail/functionality. This makes the simulator amenable to future development.

Finally, there is a structure, which is used to maintain the power parameter database. The structure type is called power_db. It stores the following data name: Name of a FUB/variable/file. S: The number of sets in a cache like structure. OR The value of a variable, for example: decode width. A: Associativity. B: The block size in number of bits. b: The output size in bits.

nwl, nbl, nsp, logic, rd_mode as defined in section 4.2.1.

The power_db structure is also used to store the various filenames. The convention used is that the first element of the database has name “root”. The next element’s name is the configuration filename. The third element’s name is the output filename. The fourth is the technology filename and the fifth is the technology identifier. This was found to be a 48 way to avoid the addition of an extra field to the database. All other elements are then added in any order. This concludes the discussion of the important structures used. All other structures are self-explanatory.

5.1.2 power.c power.c contains power estimation routines and option handling routines. These routines are described below add_param(), get_param() These functions are used to add and retrieve parameters from the power simulation database. The former adds a structure of type power_db to the database while the latter retrieves the same from the database. search_opt(), print_opt() search_opt() is used to retrieve the physical structure parameters (nwl, nbl, nsp, logic style, read mode) on giving the option name. print_opt() prints all the elements of the power parameter database in a tabular form. It is helpful in debugging. dump_fub_stats() This function dumps all the power statistics on the screen or into the specified file. The file dump mode can be specified by mode = 0 and the screen dump by mode  0. power_init()

49 This function allocates memory for all the FUB structures and calls init()on each FUB. It also reads the thresholds specified the –global option and initializes the global power structure. init() This function reads the power densities and areas of the FUBs from the basic configuration file in case of the pfa mode. If the mode is anal, then it just calls calc_anal(). The functions initializes all the power variables inside the structure. Finally, it adds the FUB to the FUB database. calc_anal(), array_power() These functions calculate the power numbers when in anal mode. calc_anal() calls array_power(), which in turn calls routines from anal.c to generate the power constants. power_update() All the functions mentioned before are called only at the beginning of the simulation. This routine, however, is called every cycle to update the power variables. power_update() multiplies the access counts to active power constants if the count is non-zero or else uses the inactive power constants. Presently, no clock-gating feature is incorporated, but the infrastructure has already been laid. The function also checks for power threshold and di/dt threshold violations. At the end of the function the present cycle power counters are reset whereas the cumulative counts keep on going.

50 5.1.3 anal.h

This is the header file for anal.c. It contains all the function declarations for the functions present in anal.c.

5.1.4 anal.c

This file contains all the analytical models. The analytical models are described in more detail in section 4. In this section we describe the interfaces of all the functions in anal.c.

decoder_buffer_power() This function takes the number of address bits and number of rows as inputs and generates power constants for the decoder buffer. The decoder buffer is meant to feed into all decoders needed for an array. Presently, the size of the buffer is constant, however, in the future this can be made dependent on number of decoders that it feeds into. decoder_power() This function generates the power numbers for the decoder. It takes the number of rows and logic style as inputs. routing_power() 51 This function estimates the power dissipated due the routing in the decoder. It takes rows, columns and cell type as inputs. It needs number of columns as an input because the decoder buffer is assumed to be at the center of all the partition as was made clear in section 4.

wordline_power() This function calculates the power for the wordline, including the wordline driver. The wordline driver size depends upon the number of columns, which is an input and also the particular kind of memory cell used(i.e. read mode and cell size), which is input. The size is then calculated using the WLdriver_size() function [].

bitline_power() This function calculates the power for the bitlines, including the precharge and isolation transistors. It takes the number of rows, columns, cell type and read mode as inputs. In the single ended read mode, no pre-charging is used. Instead, the bitlines are driven by the cell transistors. Hence, this scheme can be used for relatively small structures like register files. senseamp_power() This is used for calculating the sense amplifier power constants. It is assumed that the nodes of the sense amp are charged by a separate pre-charge circuit. The inputs to this function are the number of sense amps and the number of bitlines sharing one senseamp.

52 outmux_power() This function calculates the power for the output MUX. The inputs to the function are the numbers of inputs to the MUX and the number of outputs. compare_power() This function calculates the power for the comparator. This model is useful for tag arrays and register update unit type FUBs. genmux_power() This calculates constants for a generic MUX. The inputs to the function are number of output bits and number of bits being multiplexed into one bit.

driver_size() This function calculates the driver size for driving a capacitance with a desired rise time. The capacitance and rise time are inputs. The voltage swing is assumed to be from 0- Vdd. bldriver_size()

This is similar to driver_size() except for the fact that the voltage swing is Vsense-Vprecharge. This function is mainly used to calculate pre-charge transistor sizes for bit lines in low – power cache implementations. gatecap(), gatecappass()

53 These functions are used to calculate the gate capacitance for a given transistor width and poly length. The latter is used specifically for pass transistors. draincapp(), draincapn() These are used to calculate the drain capacitance for p and n-type transistors respectively. The also take the number of transistors stacked as input to optimize the configuration [].

leakage() This function calculates the leakage power or static power for a given transistor size with a given threshold. Presently, it’s a very rough calculation and much more work can be done in the future. log2() This function returns logarithm to the base two, rounded off to the next lowest integer. It is mainly used for address bit calculations for a given number of rows.

5.1.5 sim-outorder.c, main.c

These files have been slightly modified for the power simulator. Following is a list of changes made.

54 1 In main.c, a power option database called pow_odb has been added. This is used in sim_print_stats() to dump the power statistics. Another change made is the power_init() function call added after sim_init() to initialize the power simulation.

2 In sim-outorder.c, several global variables have been added. These have been well commented. In sim_reg_options(), the five new options have been registered. The power_update() function call has been added in sim_main(). And finally, power_database() has been added. This function essentially processes options and adds them to the power database for use in the analytical models.

5.2 Control Flow

The following flowchart depicts the control flow for the power simulation.


55 sim-outorder:sim_reg_options() registers the power options into the options database.

sim-outorder:power_database() creates the power database using options read from the configuration file and the options database.


56 power.c:init()



anal.c:decoder_buffer_power() :decoder_power() :routing_power() :wordline_power() :bitline_power() :senseamp_power() main.c :outmux_power()


power.c:power_update() every cycle

57 power.c:dump_fub_stats()

This completes the control flow description of the main functions in the power simulator.

6 References

[1] D. Burger and T. Austin. The simplescalar tool set, version 2.0, Technical report, Computer Sciences Department, University of Wisconsin, June 1997.

[2] S.J.E. Wilton and N.P. Jouppi An Enhanced Access and Cycle Time Model for On-Chip Caches, Western research Laboratory Report, May 1993.

[3] D. Brooks, V. Tiwari, M. Martonosi. Wattch: A Framework for Architectural- Level Power Analysis and Optimizations, in Proc. International Symposium on Computer Architecture, Jun. 2000.

58 [4] N. Vijaykrishnan, M. Kandemir, M. J. Irwin, H. S. Kim, and W. Ye Energy- driven integrated hardware-software optimizations using SimplePower, in Proc. International Symposium on Computer Architecture, Jun. 2000.

[5] D. Liu and C. Svensson. Power Consumption Estimation in CMOS VLSI Chips. IEEE Journal of Solid-State Circuits, 29(6), pp. 663-670. Jun. 1994

[6] P. Landman and J. Rabaey. Activity-Sensitive Architectural Power Analysis. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 15(6), page 571, Jun. 1996.

[7] R. Chen, M. Irwin, and R. Bajwa. An architectural level power estimator. In Power-Driven Microarchitecture Workshop at ISCA25, 1998

Appendix

Sl. Name of the FUB Description Models No. supported

1 npclog Next pc generation logic PFA 2 btblog BTB logic PFA 3 btbcac BTB cache PFA/Anal 4 itlbcac Instruction TLB PFA/Anal 5 rsbcac Return Stack Buffer PFA/Anal 6 dtlbcac Data TLB PFA/Anal 7 pmhlog Page miss handler PFA 8 il1log L1 instruction cache logic PFA 9 il1tag L1 instruction cache tag PFA/Anal 10 il1cac L1 instruction cache array PFA/Anal 11 dl1log L1 data cache logic PFA 12 dl1tag L1 data cache tag PFA/Anal 13 dl1cac L1 data cache array PFA/Anal 14 dispatchq Dispatch Queue PFA 15 decodepla Instruction decoder PFA 16 decodemisp Misprediction handling PFA logic 17 decodestall Decoder Stall logic PFA 18 ratarr Register Aliasing table PFA/Anal 19 ruuarr Register update unit / PFA reorder buffer 20 lsqarr Load/Store queue PFA 21 ruurdyq Re order ready queue PFA 22 lsqrdyq Load/Store ready queue PFA 23 ruuarb Re order arbitration logic PFA 24 ruuwb Re order write back PFA scheduler 25 lsqarb Load/store arbitration PFA logic 26 lsqwb Load/store write back PFA scheduler 60 27 fuint Integer functional unit PFA 28 fufp Floating point functional PFA unit 29 ul2log Unified L2 cache logic PFA 30 ul2tag Unified L2 cache tag PFA/Anal 31 ul2cac Unified L2 cache array PFA/Anal 32 biu Bus/IO unit PFA 33 fdlatch Fetch Decode latch PFA 34 dilatch Decode Issue Latch PFA 35 isw Instruction Issue Window PFA

Table of FUBs: Shows the various functional unit blocks with the models existing in the simulator. PFA: Power Factor Approximation Anal: Analytical models exist

Sl Name of the counter Associated FUB Description No. 0 Brupdate BTB cache branch update activity 1 Brlookup BTB cache branch lookup activity 2 Rsbpop Return Stack Buffer return stack pop activity 3 Rsbpush Return Stack Buffer return stack push activity 4 Il1acc L1 Instruction cac il1 access activity 5 Il1wbk L1 Instruction cac il1 writebacks activity 6 Il1rep L1 Instruction cac il1 replacements activity 7 Il1inv L1 Instruction cac il1 invalidations activity 8 Dl1acc L1 Data cac dl1 access activity 9 Dl1wbk L1 Data cac dl1 writebacks activity 61 10 Dl1rep L1 Data cac dl1 replacements activity 11 Dl1inv L1 Data cac dl1 invalidations activity 12 Il2acc L2 Instruction cac il2 access activity 13 Il2wbk L2 Instruction cac il2 writebacks activity 14 Il2rep L2 Instruction cac il2 replacements activity 15 Il2inv L2 Instruction cac il2 invalidations activity 16 Dl2acc L2 Data cac dl2 access activity 17 Dl2wbk L2 Data cac dl2 writebacks activity 18 Dl2rep L2 Data cac dl2 replacements activity 19 Dl2inv L2 Data cac dl2 invalidations activity 20 Ul2acc L2 United cache ul2 access activity 21 Ul2wbk L2 United cache ul2 writebacks activity 22 Ul2rep L2 United cache ul2 replacements activity 23 Ul2inv L2 United cache ul2 invalidations activity 24 Itlbmis Instruction TLB itlb miss activity 25 Dtlbmis Data TLB dtlb miss activity 26 Ul2mis L2 United cache ul2 miss activity 27 Itlbacc Instruction TLB itlb access activity 28 Itlbwbk Instruction TLB itlb writebacks activity 29 Itlbrep Instruction TLB itlb replacements activity 30 Itlbinv Instruction TLB itlb invalidations activity 31 Dtlbacc Data TLB dtlb access activity 32 Dtlbwbk Data TLB dtlb writebacks activity 33 Dtlbrep Data TLB dtlb replacements activity 34 Dtlbinv Data TLB dtlb invalidations activity 35 Npc Next pc generation next pc logic activity logic 36 Dispatchqrd Dispatch Queue dispatchq read activity 62 37 Dispatchqwr Dispatch Queue dispatchq write activity 38 Dispatchqrel Dispatch Queue dispatchq release activity 39 Dispatchqrec Dispatch Queue dispatchq recover activity 40 Decoder Instruction decoder decoder activity 41 Decodemispchk Instruction decoder decoder mispredict detect activity 42 Decodemisp Instruction decoder decoder mispredict correction activity 43 Decodestallchk Instruction decoder decoder stall detect activity 44 Decodestall Instruction decoder decoder stall block activity 45 Ratidep Register Aliasing rat idep allocation table activity 46 Ratodep Register Aliasing rat odep allocation table activity 47 Ratstallchk Register Aliasing rat stall detection table activity 48 Ratstall Register Aliasing rat stall block activity table 49 Ruuarr Reorder buffer ruu array activity 50 Ruurdyqsch Reorder buffer ruu readyq allocation activity 51 Ruurec Reorder buffer ruu recover activity 52 Ruuret Reorder buffer ruu retire activity 53 Ruurdyqcam Reorder buffer ruu readyq dependence check activity 54 Ruurdyqrel Reorder buffer ruu readyq resource release activity 55 Lsqarr Load/Store queue lsq array activity 56 Lsqrdyqsch Load/Store queue lsq readyq allocation activity 57 Lsqrec Load/Store queue lsq recover activity 58 Lsqret Load/Store queue lsq retire activity 63 59 Lsqrdyqcam Load/Store queue lsq readyq dependence check activity 60 Lsqrdyqrel Load/Store queue lsq readyq resource release activity 61 Ruuarb Reorder buffer ruu arbitration activity 62 Ruuwb Reorder buffer ruu writeback scheduler activity 63 Ruuwbq Reorder buffer ruu writebackq activity 64 Lsqarb Load/Store queue lsq arbitration activity 65 Lsqwb Load/Store queue lsq writeback scheduler activity 66 Lsqwbq Load/Store queue lsq writebackq activity 67 Fuint Integer point functional unit integer functional unit 68 Fufp Floating point functional unit floating functional unit point 69 Fdlatch_active Fetch Decode latch Latch after fetch stage active 70 Fdlatch_stall Fetch Decode latch Latch after fetch stage stalled 71 Fdlatch_empty Fetch Decode latch Latch after fetch stage empty 72 Dilatch_active Decode Issue Latch Latch after decode stage active 73 Dilatch_stall Decode Issue Latch Latch after decode stage stall 74 Dilatch_empty Decode Issue Latch Latch after decode stage empty 75 Iswact Instruction Issue Issue window latch active Window 76 Iswstall Instruction Issue Issue window latch Window stalled 77 Iswempty Instruction Issue Issue window latch empty Window 78 Iswcolmoved Instruction Issue Collapsible Issue window Window latch moved 64 Table of Counters: Note that the number of counters would vary with the number of latches. If there are three latches after the fetch stage, there would be 9 Fdlatch (69-77) counters and same for the latches after the decode stage.


