Exploring Early and Late ALUs for Single-Issue In-Order Pipelines

Alen Bardizbanyan and Per Larsson-Edefors Chalmers University of Technology, Gothenburg, Sweden [email protected], [email protected]

Abstract—In-order processors are key components in energy- end of data memory access stage. This way all dependencies efficient embedded systems. One important design aspect of in- between load operations and ALU operations are eliminated. order pipelines is the sequence of pipeline stages: First, the But since ALU operations happen late, stall cycles will be position of the execute stage, in which (ALU) operations and branch prediction are handled, impacts the num- caused if the address generation unit (AG) needs the result of ber of stall cycles that are caused by data dependencies between the ALU immediately. In addition, branch resolution is delayed data memory instructions and their consuming instructions and which increases the misprediction latency. by address generation instructions that depend on an ALU result. Designing a CPU pipeline is a complex that in- Second, the position of the ALU inside the pipeline impacts the volves many different tradeoffs, from cycle performance at branch penalty. This paper considers the question on how to best make use of ALU resources inside a single-issue in-order pipeline. the architectural level to timing performance at the circuit We begin by analyzing which is the most efficient way of placing level. With the ultimate goal to improve performance and a single ALU in an in-order pipeline. We then go on to evaluate energy metrics, this work is concerned with design tradeoffs what is the most efficient way to make use of two ALUs, one for the ALU resources in a single-issue in-order pipeline. The early and one late ALU, which is a technique that has revitalized choice of single-issue functionality is motivated by the fact commercial in-order processors in recent years. Our architectural simulations, which are based on 20 MiBench and 7 SPEC2000 that we want to explore how to best use ALU resources in integer benchmarks and a 65-nm postlayout netlist of a complete the simplest possible pipeline configuration, for which each pipeline, show that utilizing two ALUs in different stages of the design tradeoff will have a very noticeable impact. Promising pipeline gives better performance and energy efficiency than any tradeoffs are likely to be applicable to wider issue pipelines. other pipeline configuration with a single ALU. Our contributions include: • Beside the LUI and AGI pipeline configurations [4], we I. INTRODUCTION also evaluate an intermediate configuration in which the The need for energy-efficient computing has further am- ALU is accessed after the execute stage and before the plified the differentiation between pipeline organi- end of the data memory access stage. zations: While complex out-of-order pipelines are certainly • We evaluate an approach to in-order pipeline design that needed to sustain extreme single- performance, energy- has very recently been utilized in industry: The approach efficient in-order pipelines are essential components for a involves two ALUs in the pipeline, namely an early and wide array of situations when performance requirements are a late ALU [5]. This approach tries to simultaneously more relaxed [1]. Since the simplicity of the in-order pipeline leverage the benefits of LUI and AGI pipelines, while makes it attractive for many-core processors, in-order pipelines reducing the disadvantages. become enablers when parallelism is to be translated to perfor- • Methodology wise, we use a placed and routed 7-stage in- mance [2]. Interestingly, there are efforts in industry to match order pipeline, which is inspired by a recent commercial the performance of out-of-order processors using in-order implementation. The access to a physical implementation pipelines with techniques like dynamic-code optimization [3]. allows us to significantly advance the previous evalua- While their simplicity in terms of instruction handling tion [4] by not only considering tradeoffs that concern reduces energy per cycle over their out-of-order counterparts, execution time but also energy dissipation. in-order pipelines suffer from the fundamental disadvantage We will first review the methodology we used for evaluating of stall cycles due to data dependencies. Here, the access the different pipeline configurations. Next, we will present an stage of the arithmetic logic unit (ALU) in the pipeline has overview of energy per pipeline event for the default LUI a very big impact on the number of stall cycles. Previous pipeline configuration. Then we will present and comment on work has defined two different configurations of in-order the other pipeline configurations, which all will be normalized pipelines and explored them in terms of cycle count [4]: to the LUI pipeline, after which results are shown. The two different configurations are load-use interlock (LUI) and address-generation interlock (AGI). In LUI, the ALU is II. EVALUATION METHODOLOGY accessed in the conventional execute stage of the pipeline in In order to explore different pipeline configurations with which address generation is also handled. This pipeline stalls respect to energy, we developed an RTL model of a 7-stage if there is an immediate dependency between a load instruction in-order MIPS-like LUI-type pipeline without caches. The pro- and an ALU instruction. In AGI, the ALU is accessed in the cessor configuration of the RTL model and the /memory

978-1-4673-7166-7/15/$31.00 c 2015 IEEE 580 matches exactly the pipeline configuration in Table I which MIPS R2000 type of pipeline [13]. For this type of pipeline, was used during architectural simulations in SimpleScalar [6]. the scheduler tries to move ALU operations away in time TABLE I from the load operations to which they have dependencies. PROCESSOR CONFIGURATION FOR ARCHITECTURAL SIMULATIONS. This is, however, not the optimal case for some of the pipeline configurations that are evaluated in this paper. Hence, BPR, BTB bimod, 128 entries for the pipeline configurations that can have address gen- Branch penalty 6 cycles ALUs & MUL 1 eration dependencies with ALU operations, the instructions Fetch & issue width 1 are dynamically analyzed to assess the possibility of moving L1DC&L1IC 16kB, 4-way, 32B line, 2 cycle hit load/store operations away from the dependent ALU operation. L2 unified cache 128kB, 8-way 32B line, 12 cycle hit This way, scheduling is optimized for every different pipeline Memory latency 120 cycles configuration (Sec. IV).

Fig. 1 presents a simplified diagram of the 7-stage pipeline III. ENERGY ANALYSISOFTHE7-STAGE PIPELINE used for the evaluations. The stages are as follow: Instruction To establish a reference case to which we will relate our fetch 1 (IF-1), instruction fetch 2 (IF-2), instruction decode subsequent results, we here present an energy analysis of the (ID), execute (EXE), data cache access 1 (DC-1), data cache default 7-stage LUI-type pipeline. Table II shows the energy access 2 (DC-2), and write-back (WB). In addition, the multi- dissipation of different pipeline stages, i.e., energy dissipated plier is divided in three stages, namely multiplier 1 (mult-1), on each active cycle of a pipeline stage. Apart from the multiplier 2 (mult-2), and multiplier 3 (mult-3). The multiple multiplier (MULT), the other pipeline stages are always active stages used for multiplication, instruction fetch and data cache as soon as there is no stall cycle. The multiplier is only active accesses are important for performance reasons; if fewer stages during multiplication operations. The last value in Table II are used these operations limit the pipeline’s maximal clock shows the energy dissipated in top-level clock buffers. The rate significantly. The address generation unit (AG) and ALU local clock buffers inside the pipeline stages are included in reside in the EXE stage. the total energy of the corresponding stage. The overall clock energy is 13.5 pJ per cycle. As expected, the multiplier and the dissipates the highest energy in the pipeline. TABLE II PIPELINE ENERGY BREAKDOWN.

Stage or unit Energy (pJ) Fig. 1. Stage diagram of the 7-stage pipeline. IF-1 1.9 IF-2 1.75 Based on a commercial 65-nm 1.2-V low-power process ID (Bpr / Ctrl / RF) 40.9 (24.5 / 11.0 / 5.4) technology, the pipeline was synthesized using Synopsys De- ALU / MULT 10.3 / 56.5 sign Compiler [7], while the place and route steps were carried DC-1 1.25 out using Cadence Encounter [8]. The postlayout pipeline DC-2 1.31 Clock buffer 6.5 was operational up to a of 675 MHz under worst- case conditions, i.e., worst-case process corner, 1.1-V supply The energy dissipation for level-1 instruction (L1 IC) and voltage, and 125 ◦C operating temperature. level-1 data (L1 DC) caches are estimated using the SRAM In order to verify the complete pipeline and to extract energy library files provided by the manufacturer. These libraries are values for different pipeline events, we used five different prepared with the postlayout results and, hence, these models benchmarks from the EEMBC benchmark suite [9]; since these are very accurate. Each access to a 16kB 4-way associative lack system calls, they are practical for RTL logic simulation. L1 IC or L1 DC costs 167 pJ of energy. We are omitting the All benchmarks were successfully run until completion on controller energy in this work, but it is usually a very small the postlayout netlist, in the process generating node-accurate part of the cache energy. The clock network energy dissipated switching activity information. Synopsys PrimeTime [10] was in caches is 5 pJ in total when they are not activated. This used to obtain the final energy values per pipeline event; value, added together with the clock network energy of the for this power analysis, typical conditions were assumed, pipeline per cycle, is assumed to be dissipated on each stall i.e., typical process corner, 1.2-V supply voltage, and 25 ◦C cycle in the architectural simulations. Thus, it is assumed that operating temperature. Since the EEMBC benchmarks are too the processor dissipates 18.5 pJ of energy for each stall cycle. small to generate sufficient data for our pipeline design trade- We assume sequential access optimization for the L1 IC in off analysis, the energy values were backannotated to Sim- which the program (PC) value is tracked to see if the pleScalar in order to evaluate larger and more representative next access is going to be to the same cache line. Thus, the way benchmarks from the MiBench [11] and SPEC CPU2000 [12] information for most of the IC accesses can be determined and (SPEC2000int) integer benchmark suites. only a single data way of the L1 IC is activated [14]. During The MiBench and SPEC2000int benchmarks were compiled a single way activate only 26.5 pJ energy is dissipated instead using the GCC-MIPS compiler with -O2 optimization flag. of 167 pJ. By accessing the L1 IC conventionally whenever The compiler used a scheduling algorithm optimized for a the PC value comes from the branch predictor, the IC access

2015 33rd IEEE International Conference on Design (ICCD) 581 scheme becomes very straightforward; furthermore, it does not affect the critical path. This optimization is essential, otherwise the L1 IC energy will be unrepresentative and dominate the overall energy. Since our exploration does not have any impact on data accesses and since data accesses are less frequent compared to instruction accesses, this work assumes no L1 DC optimizations. (a) Reference configuration with ALU in EXE stage (LUI).

IV. PIPELINE CONFIGURATIONS AND DEPENDENCIES Fig. 2 shows the four different pipeline configurations that are evaluated in this work. It should be noted that this work is not concerned with the multiplier: Our evaluations show that stall cycles caused by dependencies between load operations and multiplication operations in a conventional 7- stage pipeline (see Fig. 1) are only 7% of the stall cycles (b) Configuration with ALU in DC-1 stage (intermediate LUI/AGI). caused by the data dependencies between load operations and ALU operations. This indicates that the data dependency stalls of the multiplier operations are almost negligible. As a result, this work focuses on the integer ALU which also handles branch resolution. Fig. 2a represents the reference LUI-type pipeline which is similar to the MIPS R2000 processor [13] insofar as the ALU is accessed directly in the execute stage in which address generation is also handled. This pipeline configuration has a (c) Configuration with ALU in DC-2 stage (AGI). 2-cycle load-use delay, which causes stall cycle(s) if an ALU operation is dependent on a load operation and the distance between them is less than three cycles. Fig. 2b represents a pipeline in which the ALU is located in the DC-1 stage (like in MIPS1004K [15]). We will call this pipeline configuration intermediate LUI/AGI. In this con- figuration, the load-use delay is reduced to one cycle, but a 1-cycle address generation dependency is introduced. Hence a (d) Configuration with ALUs in both EXE and DC-2 stages (dual-ALU). stall cycle will be caused if a dependent ALU operation comes Fig. 2. Evaluated pipeline configurations. directly after a load operation, or an AG operation depends on the result of the previous ALU operation. In addition, the operation depends on one of the previous two load operations, branch penalty is increased by one cycle compared to the it is diverted to the late ALU. If there are other ALU reference LUI pipeline configuration. operations immediately depending on the result of diverted Fig. 2c represents an AGI-type pipeline in which the ALU ALU operations, these are also diverted to the late ALU. is located in the DC-2 stage. The ARM Cortex-A5 processor If, during this process, an AG operation depends on one of uses this type of configuration [16]. While the load-use delay is the previously diverted two ALU operations, stall cycle(s) are completely eliminated, this configuration leads to an increase caused. Similarly, the 2-cycle extra branch prediction penalty of the address generation dependency to two cycles. Hence, is only caused if the branch resolution depends on one of the if an AG operation is dependent on the result of one of the last two diverted ALU operations. But these stall cycles or previous two ALU operations, stall cycle(s) will be caused. extra branch penalties happen very infrequently in this con- The branch penalty is increased by two cycles compared to figuration (see Sec. V). It should be noted that the two ALUs the reference LUI pipeline configuration. never perform any redundant operations: Dependencies can Finally, Fig. 2d represents a pipeline which utilizes two be detected dynamically so that an ALU operation executes identical ALUs in different stages, namely early and late on either the early or the late ALU. The ALUs can execute ALUs. The early ALU is located in the EXE stage and the different ALU operations in the same cycle. late ALU is located in the DC-2 stage. This configuration was Fig. 3 explains the working mechanism of the dual-ALU introduced in the ARC HS processors [5] and we will refer pipeline configuration. This code snippet also includes exam- to this as dual-ALU. In this configuration, the load-use delay ples to show the source of stall cycles for different types of is completely eliminated, just as in the previously described pipelines. For example, the address generation operation for configuration. In addition, most of the address generation op2 depends on op1 and, hence, op2 will cause a 1-cycle stall dependencies are also eliminated by using the early ALU as for the intermediate LUI/AGI pipeline, and it will cause a 2- soon as the dependencies allow so. As soon as there is no load- cycle stall for the AGI pipeline. Another example can be given use dependency, the early ALU is used. Whenever an ALU for load-use dependency: op3 depends on op2 and, hence, op3

582 2015 33rd IEEE International Conference on Computer Design (ICCD) LUI intermediate LUI/AGI AGI dual−ALU will cause a 2-cycle stall for the LUI pipeline, and it will 0.45 cause a 1-cycle stall for the intermediate LUI/AGI pipeline. 0.4 The comments in the code snippet of Fig. 3 identify which 0.35 ALU operation will be handled in which stage for the dual- 0.3 ALU pipeline configuration. Since op3 depends on op2,itis 0.25 diverted to the late ALU and the upcoming ALU operations are 0.2 also diverted to the late ALU due to the chain dependencies. Stall cycles 0.15 Hence, the branch operation of op5 will have an extra 2-cycle penalty if the branch is mispredicted. Even though op7 is 0.1 diverted to the late ALU, op9 can be handled in the early ALU 0.05 since no dependency exists between op9 and the instructions (relative to overall instruction count) 0 fft tiff crc sha pgp gsm jpeg lame qsort immediately preceding. It should be noted that since there is ispell susan rsynth adpcm dijkstra patricia rijndael bitcount blowfish exactly one instruction between op9 and op7, both the early average basicmath and late ALUs will be used in the same cycle. The branch of stringsearch op10 can be resolved without any extra misprediction penalty (a) Stall cycles using (MiBench). since it is being resolved in the early ALU. LUI intermediate LUI/AGI AGI dual−ALU 0.4 0.35 0.3 0.25 0.2 0.15 Stall cycles 0.1 0.05

(relative to overall instruction count) 0 vpr gcc mcf gzip bzip2

Fig. 3. MIPS assembly code snippet for dual-ALU approach. vortex perlbmk average It should be noted that moving the ALU to later stages in (b) Stall cycles (SPEC2000int). the pipeline does not increase the required forwarding for the Fig. 4. Stall cycles for different pipeline configurations. ALU, since it requires less and less forwarding from the stages ahead. But it requires an extra 68 flip-flops for each stage it data dependencies. The intermediate LUI/AGI pipeline reduces is moved. These flip-flops dissipate 1 pJ of energy for each the stall cycles to 7.0% and 14.4%, while they are reduced active cycle, which is 1.5% of the overall pipeline energy per to 6.2% and 11.5% for the AGI pipeline. Stall cycles are cycle, not considering multiplication and cache access. The only 1.7% and 3.7% for the dual-ALU pipeline. Since the overhead is 2 pJ when the ALU is moved two stages. In the SPEC2000int benchmarks have more load-use dependencies dual-ALU approach, the 2 pJ energy overhead only manifests than MiBench, the proportion of stall cycles is much higher when an ALU operation is diverted to the late ALU. here for the LUI pipeline. Also, it is clear that the location of the ALU has a more pronounced impact on the SPEC200int V. E XPLORATION RESULTS results. The dual-ALU configuration reduces the stall cycles In this section, we present our exploration results in terms by approximately 83% for both of the benchmark suites. of stall cycles, execution time and energy dissipation for the In some cases, moving the ALU to later stages can cause four pipeline configurations previously introduced. more stall cycles due to an increasing branch penalty. This is Fig. 4 shows stall cycles statistics. Since the stall cycles especially visible for the adpcm benchmark in which load/store are normalized to the overall instruction count, the figure operations are very few compared to the overall instruction shows the execution time overhead if all the instructions count. For this benchmark, it is clear that an increasing branch were to execute in one cycle. Since this is not the case in penalty nullifies the benefits of moving the ALU to later stages, processors due to memory operations etc., the execution time but this is not the common case. Moreover, the dual-ALU statistics are presented separately. The stall cycle trend is the configuration, which never increases the stall cycles, performs same for the MiBench benchmarks presented in Fig. 4a and well for this benchmark. the SPEC2000int benchmarks presented in Fig. 4b. As the Fig. 5 presents execution times normalized to the LUI ALU is moved to later stages in the pipeline, the stall cycles pipeline (Fig. 2a). On average, the execution time is improved are decreasing. The dual-ALU configuration reduces the stall by 2.1% (MiBench) and 1.7% (SPEC2000int) for the interme- cycles significantly compared to the single ALU approaches. diate LUI/AGI pipeline. The improvement is 2.8% and 2.4% The LUI-type pipeline with its ALU in the execute stage has for the AGI pipeline, while for the dual-ALU pipeline, the 10% (MiBench) and 21% (SPEC2000int) stall cycles due to improvement is 4.4% for MiBench and 5.0% for SPEC2000int.

2015 33rd IEEE International Conference on Computer Design (ICCD) 583 584 rcsiga xr ntuto smc oecsl ntrsof terms and in Fetching costly more misprediction. much stages is branch instruction first more a extra the are an during processing there in pipeline pipeline processed the the and in of fetched stage are the later that When a instructions latency. to misprediction moved is branch ALU increasing and an increased MiBench to slightly is due both dissipation for energy improved for the is configurations, time pipeline intermedi- both execution AGI the the though the for Even SPEC2000int. negligible and clock is LUI/AGI average, the increase on ate shown, energy energy and As pipeline This caches. components level-1 the pipeline. the pipeline LUI from the apart the network all to includes normalized metric is energy In the results. common. cycle that stall the not from is expected execution as the this improves time always but pipeline dual-ALU cycles, the The contrast, stall the used. increasing for due is time to 25% execution configuration increase as occasionally dual-ALU pipelines execution high single-ALU the The as when time. be benchmark execution pipeline can to on improvement due fact impact occur time do the less that to have cycles miss due stall hazards cache the is higher hence, This much and, rates have MiBench. benchmarks from SPEC2000int different improvement that that time the not execution in is eliminated the are benchmarks, cycles stall SPEC2000int more relatively though Even i.5 xcto iefrdfeetcngrtos(omlzdt LUI). to (normalized configurations different for time Execution 5. Fig.

i.6sosteppln nryfrdfeetconfigurations; different for energy pipeline the shows 6 Fig. Execution Time Execution Time (Normalized to ALU in EXE) (Normalized to ALU in EXE) 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.1 intermediate LUI/AGI intermediate LUI/AGI 0 1 0 1

adpcm basicmath bzip2

b xcto ie(SPEC2000int). time Execution (b) bitcount

a xcto ie(MiBench). time Execution (a) blowfish gcc crc dijkstra AGI gzip AGI fft gsm 053r EEItrainlCneec nCmue ein(ICCD) Design Computer on Conference International IEEE 33rd 2015 ispell dual−ALU dual−ALU mcf jpeg lame perlbmk patricia pgp qsort vortex rijndael rsynth vpr sha stringsearch

dijkstra susan average tiff average ilb ihri h 1D sotmzdfreeg,snethe since energy, for decrease. optimized is will DC dissipation L1 improvement energy the absolute energy if relative dual-ALU higher be The the will (SPEC2000int). to 1.3% attention is and energy our total the Turning the mispredic- that configuration, included. branch is during are dissipation accesses tions energy L1-IC of increasing pipeline. overhead the AGI energy the for for The 0.7% reason and pipeline. 1.1% The is LUI/AGI energy total total 0.3% intermediate in the and increase the (MiBench) average, for 0.6% for On by given (SPEC2000int) we energy. increased is pipeline, pipeline is and energy dissipation LUI IC, energy the L1 the Since DC, present L1 2a. also i.e., Fig. components, CPU in three pipeline LUI significantly the a 2) and time overhead. execution energy misprediction shorter branch network a the reduced 1) clock for of pipeline as reduces result 12% the a which as are in high savings is energy reduction as The be dissipation The can energy benchmarks. dissipation energy the all respectively. by for configuration, SPEC2000int, reduced and reduced dual-ALU is MiBench the configuration for For dual-ALU 3% for and dissipation 3.5% energy pipeline the the due average, dissipation On of stall energy mispredictions. on increased saved branch the energy by to the out hence, canceled and, is cycle cycles stall a than energy i.6 ieieeeg o ifrn ofiuain nraie oLUI). to (normalized configurations different for energy Pipeline 6. Fig.

i.7sostettleeg ispto omlzdto normalized dissipation energy total the shows 7 Fig. Energy Dissipation of the Pipeline Energy Dissipation of the Pipeline (Normalized to ALU in EXE) (Normalized to ALU in EXE) 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.1 0.1 0.2 intermediate LUI/AGI intermediate LUI/AGI 1 0 1 0

b ieieeeg ispto (SPEC2000int). dissipation energy Pipeline (b) adpcm

a ieieeeg ispto (MiBench). dissipation energy Pipeline (a) basicmath bzip2 bitcount blowfish gcc crc dijkstra

gzip AGI fft AGI gsm ispell mcf dual−ALU dual−ALU reduced jpeg lame perlbmk patricia pgp y08 (MiBench) 0.8% by qsort vortex rijndael rsynth vpr sha stringsearch dijkstra susan average tiff average . oteda-L praheautdi hspprdet the to due paper can this configuration in evaluated compared This approach savings [17]. dual-ALU energy the stages higher to DC-1 slightly viable provide located and are potentially very EXE ALUs a the which a in in is proposed configuration level- Nugent approach pipeline pipelines. and dual-ALU dual-ALU in-order pipeline 3.7% is single-issue the includes and for ALU which option Thus, caches extra area level-1 caches. the core without 1 of overall area overhead the pipeline area of the The of time. 12% same the dissipation energy the con- the at improves reduces pipeline and [5] considerably dual-ALU recently time execution commercialized the was that that shown figuration have always We are improvements welcome. time execution dissipation, energy or considered. is higher- cycles in stall savings during energy energy small caches the this level when hence, disappear and, might included dissipated not overhead energy is extra increase caches the negligible level-2 work, a in this causes, In only ALU dissipation. penalties, the energy branch moving in that increased execu- shown the to improve also due can have pipeline We the time. in tion stages later to ALU the i.7 oa nryfrdfeetcngrtos(omlzdt LUI). to (normalized configurations different for energy Total 7. Fig. oln ste ontcueadaai nraei power in increase dramatic a cause not do they as long So moving that [4], work previous with line in shown, have We Total Energy Dissipation Total Energy Dissipation (Normalized to LUI) Pipeline (LUI) L1 IC(LUI) L1 DC(LUI) (Normalized to LUI) Pipeline (LUI) L1 IC(LUI) L1 DC(LUI) 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.1 0 1 0 1

adpcm b oa nrydsiain(SPEC2000int). dissipation energy Total (b)

Pipeline (intermediateLUI/AGI) L1 IC(intermediateLUI/AGI) L1 DC(intermediateLUI/AGI) basicmath Pipeline (intermediateLUI/AGI) L1 IC(intermediateLUI/AGI) L1 DC(intermediateLUI/AGI)

bzip2 (MiBench). dissipation energy Total (a) bitcount blowfish

I C VI. gcc crc dijkstra 053r EEItrainlCneec nCmue ein(ICCD) Design Computer on Conference International IEEE 33rd 2015 fft

ONCLUSIONS gzip gsm ispell mcf jpeg lame Pipeline (AGI) L1 IC(AGI) L1 DC(AGI) Pipeline (AGI) L1 IC(AGI) L1 DC(AGI) perlbmk patricia pgp qsort vortex rijndael rsynth Pipeline (dual−ALU) L1 IC(dual−ALU) L1 DC(dual−ALU) Pipeline (dual−ALU) L1 IC(dual−ALU) L1 DC(dual−ALU) vpr sha stringsearch susan average tiff average 1]P .Ngn,“eetdAUi ieie rcso ein”U Patent US design,” processor pipelined in ALU “Repeated Nugent, R. P. [17] 1]T .Hlhl,“R’ isz multiprocessor,” midsize “ARM’s Halfhill, R. T. [16] 1]D .PtesnadJ .Hennessy, L. J. and Patterson A. D. Corporation. Evaluation [13] Performance Standard V1.3, CPU2000 SPEC [12] 1]M .Pwl,A gra,T .Vjyua,B asfi n .Roy, [15] K. and Falsafi, B. Vijaykumar, N. T. Agarwal, A. Powell, D. M. [14] [10] ae ntepprta stelatafce ytescheduling ALUs. the two by using affected for motivation least good the another is is This eval- that problem. configuration paper pipeline the the in eliminates is uated it approach dependencies, dual-ALU the the of of This most Since efficiency dependency. the approach. but ALU immediate correctness, the the an affect from not scheduling have away will moved scheduling the that be then operations to have from ALU stage operations advanced away AG operations more later operations, ALU a moving load a in of Instead to ALU change: ALU this should moved the that For is but has used. stage, generation initially optimally execute generation be the processor in to one scheduling if different example, ALU. a late requires the for it delay However, load-use 1-cycle earlier. a cycle has improvements one resolved it performance less since resolved be provide are will not configuration ALU, can the early which the mispredictions, in branch that fact 1]M .Gtas .S igneg .Ent .M utn .Mde and Mudge, T. Austin, M. T. Ernst, D. Ringenberg, S. J. Guthaus, R. M. [11] 2 .Bl,B dad,J mn,R oln .Jye .Leung, V. Joyce, K. Conlin, R. Amann, J. Edwards, B. Bell, S. [2] [7] for infrastructure An “SimpleScalar: Ernst, D. and Larson, E. Austin, T. [6] organizations,” pipeline two of comparison “A Mudge, T. and Golden M. Nvidia’s[4] “Denver: Venkatraman, K. and Tuck, N. Brown, G. Boggs, D. [3] 1 .Greenhalgh, P. [1] 9 meddMcorcso ecmr osrim Oln] Available: [Online]. Consortium. Benchmark Embedded [9] [8] 5 .Thompson, M. [5] n te seto oiga L oltrsae sthat is stages later to ALU an moving of aspect other One .Mca,M ef .Bo .Bon .Mtia .C Miao, C.-C. Mattina, M. Brown, “Tile64 Zook, J. Khan, J. D. and Fairbanks, Bao, Stickney, N. J. Berger, Montenegro, L. E. F. Anderson, W. Wentzlaff, Reif, D. Ramey, M. C. MacKay, J. A7 S 3 8 ,1994. A, 284 333 US5 2009. einCompiler Design 2002. modeling,” system computer arc-hs-2014Q1.aspx http://www.synopsys.com/Company/Publications/DWTB/Pages/dwtb- family processor HS in 2015. processor,” Mar. ARM 64-bit first 88–89. pp. in 2008, interconnect,” Feb. mesh with SoC oois u.2009. Jul. nologies, Oln] vial:https://www.spec.org/cpu2000/ Available: [Online]. 3–14. pp. 2001, MIPS 54–65. pp. 2001, Dec. selective and in way-prediction direct-mapping,” via energy cache set-associative “Reducing 1998. Inc., Interface Hardware/Software The PrimeTime http://www.eembc.org 2011. Jul. Inc., Systems, Encounter .B rw,“iec:Afe,cmecal ersnaieembedded representative commercially in free, suite,” A benchmark “MiBench: Brown, B. R. ht ae,AM e.2011. Sep. ARM, Paper, White , 7hAna n.Sm.o on Symp. Int. Annual 27th  R 1004K   R R iia mlmnain(D) .10.1.2 v. (EDI), Implementation Digital X .2011.06 v. PX, i.ITEPoesn ihAMCre-1 Cortex- & Cortex-A15 ARM with Processing big.LITTLE TM piiighg-n medddsgswt h ARC the with designs embedded high-end Optimizing  R oeetPoesn ytmDatasheet System Processing Coherent .2010.03 v. , ehia ultn yoss Oln] Available: [Online]. Synopsys. Bulletin, Technical , 4hAMIE n.Sm.o Microarchitecture on Symp. Int. ACM/IEEE 34th n.Wrso nWrla Characterization Workload on Workshop Int. R EFERENCES yoss n. u.2011. Jun. Inc., Synopsys, , Computer EEMicro IEEE yoss n. a.2010. Mar. Inc., Synopsys, , n d ognKumnPublishers Kaufman Morgan ed. 2nd , EEIt oi-tt icisConf. Circuits Solid-State Int. IEEE optrOgnzto Design, & Organization Computer o.3,n.2 p 96,Feb. 59–67, pp. 2, no. 35, vol. , o.3,n.2 p 46–55, pp. 2, no. 35, vol. , TM 94 p 153–161. pp. 1994, , Microprocessor rcso:A64-core A processor: aec Design Cadence , ISTech- MIPS , Dec. , Oct. , 585 , ,