Digital Comprehensive Summaries of Uppsala Dissertations from the Faculty of Science and Technology 1902

Rethinking Dynamic Instruction Scheduling and Retirement for Efficient

MEHDI ALIPOUR

ACTA UNIVERSITATIS UPSALIENSIS ISSN 1651-6214 ISBN 978-91-513-0868-5 UPPSALA urn:nbn:se:uu:diva-403675 2020 Dissertation presented at Uppsala University to be publicly examined in VIII, Universitethuset, Biskopsgatan 3, 753 10 Uppsala, Friday, 20 March 2020 at 09:00 for the degree of Doctor of Philosophy. The examination will be conducted in English. Faculty examiner: Professor Mikko H. Lipasti (University of Wisconsin-Madison).

Abstract Alipour, M. 2020. Rethinking Dynamic Instruction Scheduling and Retirement for Efficient Microarchitectures. Digital Comprehensive Summaries of Uppsala Dissertations from the Faculty of Science and Technology 1902. 76 pp. Uppsala: Acta Universitatis Upsaliensis. ISBN 978-91-513-0868-5.

Out-of-order execution is one of the main micro-architectural techniques used to improve the performance of both single- and multi-threaded processors. The application of such a varies from mobile devices to server computers. This technique achieves higher performance by finding independent instructions and hiding execution latency and uses the cycles which otherwise would be wasted or caused a CPU stall. To accomplish this, it uses scheduling resources including the ROB, IQ, LSQ and physical registers, to store and prioritize instructions. The pipeline of an out-of-order processor has three macro-stages: the front-end, the scheduler, and the back-end. The front-end fetches instructions, places them in the out-of-order resources, and analyzes them to prepare for their execution. The scheduler identifies which instructions are ready for execution and prioritizes them for scheduling. The back-end updates the processor state with the results of the oldest completed instructions, deallocates the resources and commits the instructions in the program order to maintain correct execution. Since out-of-order execution needs to be able to choose any available instructions for execution, its scheduling resources must have complex circuits for identifying and prioritizing instructions, which makes them very expansive, therefore, limited. Due to their cost, the scheduling resources are constrained in size. This limited size leads to two stall points respectively at the front-end and the back-end of the pipeline. The front-end can stall due to fully allocated resources and therefore no more new instructions can be placed in the scheduler. The back-end can stall due to the unfinished execution of an instruction at the head of the ROB which prevents other resources from being deallocated, preventing new instructions from being inserted into the pipeline. To address these two stalls, this thesis focuses on reducing the time instructions occupy the scheduling resources. Our front-end technique tackles IQ pressure while our back-end approach considers the rest of the resources. To reduce front-end stalls we reduce the pressure on the IQ for both storing (depth) and issuing (width) instructions by bypassing them to cheaper storage structures. To reduce back-end stalls, we explore how we can retire instructions earlier, and out- of-order, to reduce the pressure on the out-of-order resource.

Keywords: Out-of-Order Processors, Energy-Efficient, High-Performance, Instruction Scheduling

Mehdi Alipour, Department of Information Technology, and Computer Communication, Box 337, Uppsala University, SE-75105 Uppsala, Sweden.

© Mehdi Alipour 2020

ISSN 1651-6214 ISBN 978-91-513-0868-5 urn:nbn:se:uu:diva-403675 (http://urn.kb.se/resolve?urn=urn:nbn:se:uu:diva-403675) To my parents, who always valued their children, prioritized my sport and education, and supported me throughout

List of papers

This thesis is based on the following papers, which are referred to in the text by their Roman numerals.

I Mehdi Alipour, Trevor E. Carlson, Stefanos Kaxiras, "A Taxonomy of Out-of-Order Instruction Commit". In Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS) Santa Rosa, California, USA, April 2017.

II Mehdi Alipour, Trevor E. Carlson, Stefanos Kaxiras, "Exploring the Performance Limits of Out-of-order Commit". In Proceedings of the 2017 ACM International Conference on Computing Frontiers (CF) Siena, Italy, May 2017.

III Mehdi Alipour, Trevor E. Carlson, David Black-Schaffer, Stefanos Kaxiras, "Maximizing Limited Resources: A Limit-based Study and Taxonomy of Out-of-order Commit". Journal of Signal Processing Systems 91(3-4): 379-397, 2019 (an extension of Paper II).

IV Mehdi Alipour, Rakesh Kumar, Stefanos Kaxiras, David Black-Schaffer, "FIFOrder : Ready-Aware Instruction Scheduling for OoO Processors". In Proceedings of the Design, Automation and Test in Europe (DATE) Florence, Italy, March 2019.

V Mehdi Alipour, Rakesh Kumar, Stefanos Kaxiras, David Black-Schaffer, "Delay and Bypass: Ready and Criticality Aware Instruction Scheduling in Out-of-Order Processors". In proceeding of IEEE International Symposium on High-Performance Computer Architecture (HPCA) San Diego, CA, USA, February 2020.

Reprints were made with permission from the publishers. Other publications not included in this thesis:

• Alberto Ros, Trevor E. Carlson, Mehdi Alipour, Stefanos Kaxiras, "Non- Speculative Load-Load Reordering in TSO". In proceeding of IEEE International Symposium on Computer Architecture (ISCA) Toronto, Canada, June 2017.

• Sizhuo Zhang, Muralidaran Vijayaraghavan, Andrew Wright, Mehdi Alipour, Arvind, "Constructing a Weak Memory Model". In proceeding of IEEE International Symposium on Computer Architecture (ISCA) LA, USA, June 2018.

• Stefanos Kaxiras, Trevor E. Carlson, Mehdi Alipour, Alberto Ros, "Non- Speculative Load Reordering in Total Store Ordering". IEEE Micro Top Picks, June 2018.

• Rakesh Kumar, Mehdi Alipour, David Black-Schaffer, "Freeway: Max- imizing MLP for Slice-Out-of-Order Execution". In proceeding of IEEE International Symposium on High-Performance Computer Architecture (HPCA) Washington D.C., USA, February 2019.

• Christos Sakalis, Mehdi Alipour, Alberto Ros, Alexandra Jimborean, Stefanos Kaxiras, Magnus Själander, "Ghost loads: what is the cost of invisible speculation?". In Proceedings of the 2017 ACM International Conference on Computing Frontiers (CF) Alghero, Italy, May 2019. Contents

1 Preface ...... 9 1.1 Contributions: Front-End ...... 11 1.2 Contribution: Back-End ...... 12 1.3 Thesis Organization ...... 13

2 Overview of Out-of-Order Processors...... 14 2.1 The In-order Front-end ...... 15 2.2 Out-of-order Scheduling ...... 18 2.2.1 Scheduling Resources ...... 19 2.2.2 Scheduling Steps ...... 20 2.2.3 Cost Evaluation: the Width and the Depth of the IQ .... 22 2.2.4 Research Problem I: Inefficient Scheduling ...... 22 2.3 The In-order Back-end ...... 23 2.3.1 Architectural vs. Speculative State of a Processor ...... 23 2.3.2 Overview of the Back-end ...... 24 2.3.3 Research Problem II: In-order Commit is Overly Conservative ...... 26 3 Efficient Resource Allocation: Scheduling Considering Readiness and Critically of Instructions ...... 27 3.1 Research Problem: Inefficient Scheduling ...... 27 3.2 Insight I: Some Instructions do not Need OoO Scheduling ...... 29 3.3 Solution I: FIFOrder, Ready-Aware Instruction Scheduling...... 32 3.3.1 Instruction Criticality and Limits of Ready-Aware Instruction Scheduling ...... 37 3.4 Insight II: Overlap Between Readiness and Criticality ...... 37 3.5 Potential of Combining Readiness and Criticality ...... 41 3.6 Solution II: DNB, Ready- and Criticality-Aware Instructions Scheduling ...... 45 3.7 Conclusion ...... 49 4 High performance resource deallocation: early release and out-of-order commit ...... 50 4.1 Out-of-order commit conditions...... 52 4.2 Contribution I: Relaxing Out-of-Order Commit Conditions ...... 52 4.3 Contribution II: Category/taxonomy of out-of-order commit ..... 54 4.3.1 Safe_OOC ...... 54 4.3.2 Unsafe_OOC ...... 54 4.3.3 Reluctant ...... 54 4.3.4 Aggressive ...... 55 4.3.5 Commit Width and Depth ...... 56 4.4 Performance evaluation ...... 56 4.4.1 Performance evaluation based on the taxonomy ...... 57 4.4.2 Performance evaluation based on the OOC conditions . 58 4.5 Out-of-order Commit and Memory Level Parallelism (MLP) .... 62 4.6 Early release vs. out-of-order commit ...... 62 4.7 Conclusion ...... 64

5 Summary ...... 68

6 Svensk Sammanfattning ...... 70

7 Acknowledgements ...... 71

References ...... 74 1. Preface

Modern processors apply many different micro-architectural techniques to im- prove performance. Among them, out-of-order execution has been used widely, from server to mobile processors. Out-of-order execution finds independent instructions out of the program order and hides the latency and uses the cycles which otherwise would be wasted or caused a CPU stall. The latency source could be rather short, just couple of cycles, such as a division instruction, or long, up to hundred cycles, such as a load instruction which misses in the last level of . In both cases the out-of-order processor contributions are two- fold, 1) finding independent instructions by eliminating the data hazards and 2) increasing the size of the dynamic window thanks to the out-of-order book- keeping resources such as the Re-Order Buffer (ROB), the Instruction Queue (IQ) and the Load Store Queue (LSQ). Register renaming allows the processor to eliminate artificial data hazards1 between instructions caused by the limited number of architectural registers available to the . Out-of-order book- keeping resources allow the processors to many instructions, all at the same time. Figure 1.1 shows a typical pipeline of an out-of-order processor. The pro- cessor has three macro-stages: the front-end, the scheduler, and the back-end. The front-end fetches instructions and analyzes them to prepare for their ex- ecution. It includes instruction fetch (IF), instruction decode (ID), register renaming (RR) and instructions dispatch (DIS). IF stage fetches instructions from the instruction cache. ID stage decodes the instructions’ op-codes. RR stage renames the architectural registers to the physical and DIS stage places instructions in the out-of-order resources. The out-of-order scheduler, schedules instructions for execution. This in- cludes identifying which instructions are ready for execution. Once an in- struction commits, all of its resources, except the IQ, are deallocated2. The out-of-order scheduling resources require the ability to identify and prioritize instructions across a large execution window. This leads to complex content- addressable memories (CAM) that consume significant area and energy. Out- of-order resources are very expensive therefore limited. Due to this limitation, the processor has two stall points, at the front-end and the back-end.

1False dependency 2IQ is deallocated right after issue unless the instructions might need a replay(re-issue due to a cache miss for example)

9 10



Out-of-order scheduler

 

 

           In-order front-end   In-order back-end

Front-End Contributions Back-End Contributions

Problem I: Allocation inefficiency Problem II: In-order commit is overly conservative Observation: Variety of benefits and class of instructions Observation: delaying the commit affects the performance PAPER IV: Instructions readiness reduces the IQ width PAPER I: When to commit OOO. PAPER V: Instructions readiness and criticality  reduces both the IQ PAPER II: The performance effect of commit conditions depth and the width PAPER III: Potential of OOO commit over early release of registers.

Figure 1.1. Overview of the out-of-of processors and the thesis contributions in the front-end and the back-end research problems. Front-end includes Fetch stage to Dispatch and back-end spans from write-back stage to instructions commit. The out-of-order scheduler includes out-of- order resources. The front-End research problem is the inefficiency of resource allocation which Paper VI and V will are focused on. Being overly conservative is the research problem in the back-end which paper I, II, and III are focused on. The front-end can stall due to fully allocated resources and therefore no more new instructions can be placed in the scheduler. The back-end can stall due to the unfinished execution of an instruction at the head of the ROB which causes the resources to be exhausted. Front-end and back-end stalls respec- tively labeled problem I and II in Figure 1.1. To address these two stalls, this thesis focuses on reducing the time that instructions occupy the scheduling re- sources. To reduce front-end stalls, we reduce the pressure on the IQ for both storing (depth) and issuing (width) instructions by bypassing them to cheaper storage structures. To reduce back-end stalls, we explore how we can retire instructions earlier, and out-of-order, to reduce pressure on the rest of the re- sources.

1.1 Contributions: Front-End Research Problem: Inefficient Scheduling. To support complex wake-up and select, designers build the IQ with a complex logic that comprises comparators and reduction trees for all entries. These circuits enable parallel comparisons across all instructions to detect which are ready and identify the prioritized instructions. This complexity makes it energy-expensive, yet all instructions must be placed in it (or the LQ/SQ) for execution in OoO processors.

Our Insight Allocating the IQ entries to all instructions is inefficient. Some instructions do not benefit from the expensive IQ scheduling, and can, therefore, bypass the IQ. Bypassing the IQ has the further benefit of reducing both capacity and issue pressure, which allows the use of a smaller (shallower and narrower) IQ, further reducing energy.

Observations and Contributions We observed that some instructions are ready before being placed in the IQ. Because these instructions are ready to execute when they are placed in the IQ, they do not benefit from the IQ features such as wake-up and select and can bypass it. To avoid placing instructions in the IQ, we need another struc- ture that can buffer them until they execute. This bypass structures can be built more cheaply (as it does not require complex wakeup and select logic for already-ready instructions) and therefore enables the IQ to be reduced in depth and width hens, reduced the energy consumption.

• Paper IV. We observe that ready instructions can bypass the IQ and we propose FIFOrder as a ready-aware instruction scheduling microar- chitecture. We observe R@D instructions, instructions that have their operands ready before being placed in the IQ. We bypass the IQ for

11 R@D instructions and place them in a FIFO instead which enables a smaller IQ and lower scheduler energy consumption. We also observe that some instructions are almost ready, that is, they are likely to become ready soon as one operand is ready. We use them to increase the number of instructions that bypass the IQ. Altogether we bypass the IQ for the majority of instructions and reduce the IQ width to one instruction per cycle which saves a significant amount of energy. However, reducing the IQ depth for FIFOrder hurts performance, as it is unable to identify instructions that should be prioritized.

• Paper V. We observe that the ready-aware scheduler of FIFOrder cannot reduce IQ depth without hurting performance due to its inability to pri- oritize critical instructions. The criticality of an instruction determines its effect on performance when its scheduling is delayed or expedited. Instructions are more critical when delaying and expediting their exe- cutions respectively degrades and improves performance. To address both instruction readiness and criticality, we propose the first ready- and criticality-aware instruction classification. We propose the Delay and Bypass (DNB) microarchitecture, which benefits from the classification and bypasses the IQ for ready instructions (as with FIFOrder) but delays non-critical, non-ready instructions to reduce pressure on the IQ. Delay- ing instructions gives them time to become ready (e.g., the instructions producing their input operands already finished the execution). As a re- sult, bypassing ready and delaying non-critical instructions, only places critical and memory instructions in the IQ, which allows us to reduce both its depth and width and save energy without hurting performance.

1.2 Contribution: Back-End Research Problem: In-order Commit is Overly Conservative When instructions are completed, at the final stage of the back-end, the com- mit stage, all resources, the ROB, LSQ and, PRF are deallocated except the IQ 3. Traditional in-order deallocation of instructions in the back-end can be inefficient as instructions that are not ready-to-commit may block the deal- location of those that are, leading to resource starvation and stalling. While in-order commit has its advantages, such as providing precise interrupts and avoiding complications with the memory consistency model, it prevents the processor from freeing scheduling resources until the instructions complete in program order. A more relaxed approach, out-of-order commit, can trade-off the complexity of ensuring correctness for improved performance by freeing resources earlier.

3IQ deallocation occurs right after execution.

12 Observations and Contributions. We observe that in-order instruction commit limits performance. In addition to that, the potential performance improvements from relaxing in-order com- mit vary depending on the processor and relaxation applied.

• Paper I. In this paper we introduce two approaches to out-of-order com- mit: aggressive, in which instructions are committed as soon as possi- ble, and reluctant, in which instructions are only committed out-of-order if the processor would otherwise stall. In addition to that, introduced approaches, include safe and unsafe out-of-order commit which means how relaxing or preserving the conditions, affects the performance. Safe means although the instructors are committed out-of-order, yet the ex- ecution is correct and the program output is the same as an in-order commit.

• Paper II. Multiple conditions are required to commit instructions. In this work, the performance impact of each of the commit condition is explored, both in isolation and in collaboration with other conditions.

• Paper III. The performance improvement of the out-of-order commit in different scheduler sizes 4 is explored and commit conditions ranked based on their impact on performance improvement. In addition to that, the potential of out-of-order commit over the early release of resources is explored. This comparison is done since the early release of resources has been shown as an effective approach to improve the performance of smaller processors.

1.3 Thesis Organization Chapter 2 provides an overview of out-of-order processors which explains the allocation of the out-of-order scheduling resources in the front-end and their deallocation in the back-end. This chapter also covers the research problems associated with both resource allocation deallocation in out-of-order proces- sors. In chapter 3 we introduce our two efficient resource allocation tech- niques, (FIFOrder (paper IV) to address readiness and Delay and Bypass (DNB) to address bot readiness and criticality (paper V)), and we show how bypass- ing and delaying the IQ for some instructions can reduce IQ pressure and size, and thereby improve efficiency without hurting performance. We focus on out-of-order resource deallocation in chapter 4 and introduce a new approach to explore the effectiveness of out-of-order commit on performance (paper I, II and III).

4ROB, IQ, LSQ, and PRF size

13 2. Overview of Out-of-Order Processors.

Out-of-order execution is an approach that is used in high-performance pro- cessors. This technique is faster then in-order processors since it can execute more instructions in parallel because it is not limited by the order of the pro- gram. Figure 2.1 shows an overview of an in-order processor. This processor can achieve a throughput of one instructing per cycle if-and-only-if the execu- tion delay of all instruction is one cycle. On a long latency instruction, such as a memory access (load/store) that can take 100 cycle delays, this pipeline stalls, and it executes no more instructions. Yet this stall happens even if there are instructions after the long-latency operation that are independent and also, ready to execute. On the other hand, an out-of-order processor considers many instructions at the same time called in-flight instructions. The scheduler task of an out-of- order processor is to find schedulable (ready) instructions and issue them to the functional units for execution. By doing so an out-of-order processor covers the delay of the long latency events by finding independent instructions. An overview of an out-of-order processor is shown in figure 2.2. In this figure, any of the (ready) instructions can be issued to the regardless of their order in the program. Pipeline of an out-of-order processor has three parts, in-order front-end, instruction Scheduling (out-of-order core) and, in-order back-end and retire- ment shown in Figure 2.3. At the end of front-end, the dispatch stage, resource allocation and at the end of back-end, the commit stage, resource deallocation occurs. Resources are, the instruction queue (IQ), the reorder buffer (ROB) and the load-store queue (LSQ) and physical registers. The rest of this chapter provides details of pipeline stages in addition to research problems associated with resource allocation and deallocation.

  

            



Figure 2.1. Overview of an in-order pipeline. In this pipeline, instructions are issued to the execution units in program order, first i1, then i2 and so on. If the instruction at the head of the FIFO (i1) is not ready for execution, the pipeline stalls and the CPU is not fully utilized.

14 

   



     

     

Figure 2.2. Overview of an out-of-order pipeline. In this pipeline (ready) instructions can be issued to the execution units out of the program order. At each cycle, a few of ready instructions are issued to the execution units.

2.1 The In-order Front-end We assume an in-order front-end that consists of four stages, Fetch, Decode, Rename and Dispatch. Instructions steer to the front-end in the program order. The front-end handles the bookkeeping necessary to ensure that instructions chosen for out-of-order issue and execution have their operands ready.

Instruction Fetch is the first stage of the pipeline where the processor accesses the instruc- tion cache by a program (PC) and brings in new instruction(s) to the pipeline and places them into the fetch queue. It also calculates the address of the next instruction typically with the assistance of a . In an N-way superscalar, the processor, in theory, can fetch N instructions at the same time.

ROB

LSQ

D-Cache PRF AGU ALU

Branch Table Prediction Rename Map

I-Cache Op-code ARF Allocation to BRU Register FPU Instruction Queue File ROB, IQ and LSQ

Fetch Decode Rename Dispatch Issue Execution Write back Commit

Front-end Scheduler Back-end Figure 2.3. An overview of an out-of-order processor pipeline. It includes three macro-stages, the front-end, scheduler, and the back-end. The front-end and the back- end are typically in-order while the scheduler is out-of-order.

15 Instruction Decode takes the instructions from the fetch queue and places them in decoders to analyze their opcode to detect the instruction type. To simplify execution, most processors split instructions (macro instructions) at the decode stage into simpler micro-operations (μ-op). For example, it may split a load instruction into two μ-operations, one to calculate the address and the other one to access the .

Register Renaming Data hazards, including WAW (write after write), RAR (read after read), (write after read) and RAW (read after write) could slow down the processor perfor- mance. It is because they force the instructions to execute serially, thereby reducing the potential for parallel execution of instructions and performance. The first three hazards (WAW, RAR, WAR) are false dependencies caused by the compiler reusing the limited architectural registers provided by the ISA. RAW however is true dependency since it is a producer-consumer depen- dency. False dependencies are name dependencies and changing the names eliminates those hazards. Register renaming enables as much ILP as possible even though instructions reuse the architectural registers and to do so it detects and also respects to true dependencies among instructions while eliminating the false dependencies. In a processor that renames instructions to avoid false dependencies, there are two sets of registers files, architectural register file (ARF) that defined by the ISA and visible to the programmer and physical register file (PRF) that de- fined by the implementation and renamed to ISA registers as needed. Indeed, different implementations of an ISA will have the same ARF but different numbers of PRFs. It is a requirement that the PRF be at least as large as the ARF so that each ARF can be renamed once at least. Register renaming (RR) comprises three tasks, reading the source operands, allocating the destination register, and register update. Reading the source operands includes identifying the operands and fetching them. Identifying oc- curs at the rename stage (or decode). Fetching can be done in the rename stage or be postponed to after dispatch. In this thesis, we consider fetching the operand after dispatch which is more common in modern processors. Fetch- ing the operands after dispatch has the benefit of being less costly since the because the operands do not have to be stored with the instruction in the in- struction queue. Figure 2.4 depicts an overview of ARF and PRF . After iden- tifying the operands, the register renaming logic accesses the ARF to fetch them. Each entry in ARF has a busy bit. If the busy bit is zero, there is no pending write on this operand and the ARF contains the latest version and register renaming will fetch the value. If the busy bit is one, there is a pend- ing write to that register and the content of the ARF is obsolete. The PRF has a valid bit for each entry, identifying whether the operand is ready. Two possibilities can occur based on the valid bit in the PRF.

16 If the bit has set to one means that this entry is waiting for the identifying whether the writer on this register is pending. Each entry in PRF has a tag or identification number. Here, the tag to the corresponding entry in PRF is accessed and two possibilities can occur. If the tag it is valid the producer of this entry has finished the execution, therefore, the operand can be fetched from the PRF. If the valid bit is zero, the execution of the producer is ongo- ing. In this case, instead of fetching the non-ready register, register renaming forwards the tag from the map table to the consumer. This tag, later on, is used by the scheduler to get the operand when it is available. After finishing execution, each instruction broadcasts the tag of its destination register to all instructions in the IQ to indicate that the register is now ready. Instructions that were waiting on that register to be ready are then marked as ready and can be selected for execution. Figure 2.4 depicts all the alternatives of source reading in register renaming. Destination comprises three sub-tasks including, setting the busy bit, assigning a tag, and updating the map table. Free PRFs are kept a in list. The destination register of each instruction is used to access the ARF. If the selected register has a pending write, accordingly its busy bit will be set. The selected ARF also must be mapped (renamed) to a free PRF. The busy bit of the PRF is set and its tag (an index) and is used to update the map table to record that the ARF has been mapped to this particular PRF so that later instructions can refer to it if they need that register. Following dependant instructions will know of this mapping when they go through the source reading phase of register renaming. The third task, register updating, affects the ARF, and it occurs at the back- end of the pipeline. Register updating has two phases. First phase is when an instruction finishes the execution and it writes its results to the entry of the PRF based on the tag. In the second phase, when instruction is at the commit stage,it copies the results from the PRF to ARF. These two steps can occur back-to-back if the updating instructions I is the oldest among all the instructions or they second phase can be done many cycles later if there are unfinished and older instruction ahead of the instruction I. After copying a physical register to an architectural register, its busy bit is reset, it is added to the list of free PRF and can rename another architectural register. They call this register retirement and occurs at the commit stage and chapter 4 describes it in more details.

Instruction Dispatch and In-order Allocation Allocation is the task of placing instructions into the out-of-order scheduler resources such as the Re-Order Buffer (ROB), the Instructions Queue (IQ) and, the Load-Store Queue (LSQ). At dispatch, the pipeline places all instructions in the ROB in the program order. The ROB will keep the order among the instructions and makes sure they commit in-order. At the same time as the

17           

         

  Figure 2.4. register renaming

ROB, the instructions are also placed in the IQ to wait for their operands to become ready and then be issued to the functional units for execution. In typical out-of-order processors, all instructions are considered non-ready and are therefore inserted into the IQ, which will later determine when they become ready. Besides that, the out-of-order scheduler does not necessarily consider the effects of the instructions on performance. In chapter 3 we discuss how many instructions are actually ready before they are inserted into the IQ, which provides the opportunity for them to bypass the IQ and avoid both the energy of writing them into the IQ and the capacity required to hold them. If a dispatched instruction is a memory reference, e.g., a load or store, the dispatcher also places it in the LSQ. This is necessary "to maintain" the order among the memory instructions. The LSQ also, allows older stores to forward their results to younger (newer) loads, thereby avoiding the need to wait for the older stores to be written to the cache first.

2.2 Out-of-order Scheduling After placing the instructions in the ROB the instructions can have their ex- ecution re-ordered while maintaining correct program semantics, as the in- structions will commit in the program order stored in the ROB. The ability to execute instructions out-of-order is one of the most important techniques to achieve higher performance in modern processors. Out-of-order schedul- ing has two elements, Scheduling steps (involved pipeline stages) and the scheduling resources. Scheduling steps consist of four phases wakeup, se- lect, issue, and execution. Resources are ROB, IQ, LSQ and physical registers (register renaming has already mentioned in the previous section).

18 2.2.1 Scheduling Resources Reorder Buffer (ROB) Even if instructions are executed out-of-order, the processor needs to keep track of their original order to ensure the correct semantics. The ROB does this. The ROB is a first-in-first-out (FIFO) queue that maintains the program order of the instructions. The instruction at the head of the ROB is the oldest and the one at the tail, is the youngest. Based on this, the visible architectural state of the processor matches the last instruction that has removed from the ROB. The ROB contains instructions in different stages, from the rename stage until the commit stage. Essentially, the ROB captures the next set of instruc- tions that will be, have been, or are being executed but have not yet been committed. The status of each instruction in the ROB includes: not ready to issue, ready and waiting to be issued, issued yet not finished the execution, executed not committed, and ready to commit. The ROB contains instructions whose effects will be soon visible to the pro- gram unless there is a flaw in the execution such as interrupts and exceptions. It also implements the support for recovering from mispredicted branches. When a branch misprediction occurs, the ROB contributed to allowing the speculative (wrong) executed instructions — younger than the mispredicted branch — to be squashed (removed) from the pipeline.

Instruction Queue (IQ) The IQ is the pillar of instruction scheduling and holds the instructions until they become ready for execution and then issues them to the functional units. IQ selects among the ready instructions. The selection process is done by arbiters that implement a priority policy, such as the oldest first (age-based). The other functionality of the IQ is instruction wakeup. Wakeup is the process "marking" the instructions as ready, which occurs when all of the instruction’s operands are ready. On each cycle, the IQ performs wakeup and select to choose the instructions to issue in that cycle. Instructions spend a shorter lifetime in the IQ compared to the ROB since they leave the IQ after being issued while the ROB keeps them until they commit1. The IQ, is typically an associative hardware structure which is im- plemented as Content Addressable Memory (CAM. The essence of a CAM is that it does a parallel comparison "internally" across all entries. Figure 2.5 shows a high-level overview of a single entry of an IQ. Based on this figure the IQ has three sections, the wakeup, tags and the ready logic which holds the instructions as well, and the select logic.

1Lifetime is the number of cycles that an instruction spends in a queue. Status of the instructions related to the execution stage and after that only applies to the ROB and not the IQ.

19 Wakeup Tags, Select logic Ready logic logic

   

Left Operand Ready-left Right Operand Ready-Right

Figure 2.5. IQ and details of an entry. The IQ has three circuitry, wakeup, select and tag and ready logic which hols the instructions as well. At each cycle, instructions are woken up and selected in a loop that requires expensive wiring between all entries (fully associative) in the IQ which makes it the most complex energy-hungry part of instruction scheduling.

The Load Store Queue (LSQ) LSQ holds loads and stores after they have been issued until they commit. This module monitors the ordering between the memory accesses to enforce the ISA’s memory model. For example, uses a "total store order" memory model and the LSQ is responsible for ensuring that loads and stores do not bypass each other in ways that would violate that ordering proposed by the memory model. LSQ manages so that the programmer experiences the correct order, the ordering between older load and younger stores, making sure none of the loads is executed before any store that might alias. Aliasing occurs when a load and a store have the same address. Reordering the execution of such instructions causes the load to read the wrong value. Processors commonly use memory dependency predictors to allow them to reorder the execution of loads and stores that are unlikely to depend on each other, thereby executing them earlier and improving performance. Memory dependency prediction predicts if a store might alias with load(s). If a load is predicted to not alias with earlier stores, it can be issued even before younger stores which improve the performance of the out-of-order processor. With misprediction of memory dependency, LSQ replaying the load(s). With mem- ory dependency prediction, processors often reorder loads and stores, but the LSQ and ROB are responsible for ensuring that the instructions and memory system experiences the operations in an order that respects the ISA’s memory model.

2.2.2 Scheduling Steps Instruction Wakeup. Instructions wait in the IQ until their operands become ready and they are selected for execution. The IQ then wakes-up the waiting instructions (marks them as ready for execution) as their producers finish execution and generate

20 the required operands. As a waiting instruction can be anywhere in the IQ, the results2 of executed instructions must be broadcast to all entries in the IQ. As the scheduler can execute multiple instructions every cycle, multiple results need to be broadcast simultaneously. For every completed operand broadcast, every instruction in the IQ needs to compare its input operands to see if there is a match, showing that the operand is ready. This requires multiple comparators per IQ entry and broadcast busses as many operands as can be generated in a cycle. With a match, IQ marks the operand as available. The instruction itself becomes ready when all operands are ready.

Instruction Select. Instruction issue logic prioritizes and selects instructions for execution from the ready instructions in the IQ by priority. As ready instructions can be any- where in the IQ, all IQ entries need to be examined in parallel to select among them. Ready instructions are prioritized based on their type (often memory accesses first to increase MLP) or age (oldest first to avoid chains of stalled in- structions). Computing these priorities requires complex comparison trees of instruction opcodes and tags. In addition to the priority logic, the IQ requires as many output ports as the maximum issue width to enable the selected in- structions to be read. Because of this complexity, the size and energy of the IQ increases super-linearly with the number of entries and issue width [22].

Instruction Issue. This phase consists of sending the selected instructions to the functions units. In modern processors, the functional units are mostly piplined, therefore each functional unit can typically accept a new instruction each cycle. Out-of-order issue allows processors to issue instructions based on their readiness and pri- ority, which is often different from their age. Out-of-order processors issue the instructions from anywhere in the IQ. We consider a unified issue queue (IQ) where all instructions are placed in the same IQ and issued from the IQ to the functional units.

Instruction Execution. In this stage, the operands of each instruction are sent to the functional units, and the operation is performed. 3 To improve performance, dependant back- to-back instruction can read the generated result from the forwarding path even before writing them back to the PRF. Without a forwarding path, a dependant instruction has to wait until the producer writes its results into the PRF.

2Depending on the implementation, the broadcast information can be the result, destination register number, instruction id, or a combination of above. 3Although the results are generated in this stage, they are not written to the PRF the write-back stage.

21 2.2.3 Cost Evaluation: the Width and the Depth of the IQ A significant portion of the total core energy is consumed during instruction scheduling. The heart of instruction scheduling is the IQ as it is responsible for storing the instructions while they wait to become ready, detecting when they are ready, and choosing which ones to issue. The size (depth and issue width) affects the energy consuming of the scheduler, significantly.

• IQ width/issue width: is the number of instructions that can be issued in the same cycle. It determines the number of arbiters and selectors that are required to select among the ready instructions. Also, it defines the maximum number of broadcasting wakeup wires.

• IQ capacity/ IQ depth: is the number instructions that IW can store. It determines how many entries the IQ has to observe at each cycle to find ready instructions. It also determines how many entries each wake-up signal has to be wired to.

There are two competing issues: the circuit cost (energy, area, delay) and the ability to extract ILP (larger window). Changing the dimension of IQ af- fects the delay and cost of instruction scheduling. More entries increases the number of inflight instructions, however as the size of the IQ increases, its efficiency decreases since the circuitry delay is increased. Increasing depth of the IQ affects the wakeup delay quadratically [22]. Selecting from the ready instructions requires reduction trees whose complexity grows linearly with the issue width of the IQ and logarithmically with its depth [23, 29, 28]. Because of this complexity, the power consumption of the IQ grows dramat- ically with depth and width, as shown in Figure 2.6. Conversely, the energy of simpler scheduling structures, such as in-order FIFO queues, scales more gracefully [22]. However, this comes at the cost of less flexible scheduling and performance: FIFOs can only consider instructions at the head of the queue. The focus of Chapter 3 is how we can use these cheaper scheduling structures to achieve the performance of the more expensive IQs However, when the IW depth is quadrupled, on average the out-of-order window that ready instruction can be found in is doubled [22]. As a result of the combination of a large increase in cost and much smaller increase in per- formance as the depth is increased, designers are always looking for energy- efficient techniques to increase the capacity of the IQ.

2.2.4 Research Problem I: Inefficient Scheduling All instructions pass through the IQ, although it is one of the most energy- and area-expensive components of the processor. We have identified that not all instructions need, nor benefit from, the expensive IQ for scheduling. Indeed,

22      







                       Figure 2.6. Energy consumption per access of IQ (CAM-based) and queue (FIFO- based) instruction storage as a function of width (color) and depth (x-axis). IQ energy grows very significantly with both width and depth, while the simpler FIFO imple- mentation is enormously more efficient. some instructions can either be scheduled directly (because they are ready at dispatch) or can have their scheduling delayed (because they are non-critical. We tackle this problem by looking into how we can classify instructions based on how much they will benefit from the IQ, and how we can use this infor- mation to use simpler scheduling structures. In chapter 3 we provide two ef- ficient schedulers in which instruction classification (readiness and criticality) are used to determine how much they will benefit form out-of-order schedul- ing.

2.3 The In-order Back-end Back-end includes write back and the commit stages. Instruction may proceed to the back-end out-of-order but they leave it in the program order.

2.3.1 Architectural vs. Speculative State of a Processor There are two states in a processor that are needed to allow instructions to ex- ecute out-of-order but still make it look in-order to the programmer.

Architectural state is updated at the commit stage in program order. Archi- tectural state updates are essentially removing the mapping between the phys- ical and architectural register and write the date from the physical registers to the architectural ones.

Speculative state contains the last architectural state in addition to the changes created by the in-flight instructions. These changes are called speculative since since they may not become part of the architectural state if the instructions are squashed. Squashing means instructions leave the pipeline while their finished

23 results are or their execution is canceled. For example, all instructions after a predicted, but unresolved, branch are speculative as they may turn out to be the wrong instructions if the prediction was incorrect. It is only once the prediction is verified to be correct that the effects of these instructions can be committed to architectural state. Until then they remain speculative. To exe- cute instructions out-of-order, but still provide an execution that is functionally equivalent to the in-order original program, the processor must correctly con- vert speculative (out-of-order) instructions into architectural (in-order) ones. These speculations are required to improve the performance of an out-of-order processor.

2.3.2 Overview of the Back-end From a programmer’s point view, processors starts the execution of each in- struction only after the previous one has completed. However, pipelining leads to overlapping instructions, which means that the processor does not wait for the previous instructions to complete (write-back) before starting new ones. In pipelined processors there are some instructions in-flight in different phases of their execution. As a result, instructions may update the processor state in any order, in particular, orders that do not respect that specified by the programmer. To see the difficulties that arise due to this re-ordering, consider two in- structions, inst1 and inst2 where inst1 is older. Inst1 may cause an exception in the last phase of its execution while inst2 might have updated some register in its write-back stage (before Inst1). While the correct execution from the program’s point of view would be that inst2 never wrote back to the register because inst1 caused an exception, in the pipelined processor this has already happened by the time the exception is raised. In this case, the processor cannot provide the processor stage after inst2 and before inst1. The main solution is to hold speculative state separate from architectural state. Essentially emu- lating the sequential execution in an additional stage called commit at the end of the pipeline. Speculative states are not part of the architectural state until the instruction commits. Instructions flow in this stage is based on the original program order. Indeed, any effects that the instructions have on the processor state before reaching the commit stage are considered speculative. In the example above, exception triggered by inst1 is handled in the commit stage. Since instructions are committed in program order, inst2 does not com- mit unless inst1 does, therefore, its changes would not be part of the architec- tural state since inst1 fails to commit. This way, the commit stage will handle the exception as if no younger instruction than inst1 has ever been executed. Such a processor provides precise interrupt/exception since its architectural state is always in the state before the instruction that raises an exception. [25]

24 Instruction Completion and Commit Instructions finish executing when their results have been computed in a func- tional unit, but these results are not written back to the architecturally visible processor state until the instruction commits. While regular flows of instructions pass through the front-end, scheduler, and back-end, there are important events, interrupts and exceptions, which disrupt this flow and have to be handled. When an interrupt occurs, the pro- gram execution must be suspended to allow the operating system to service the interrupt. One way to do this is to stop fetching new instructions and allow the instructions that are already in the pipeline to finish execution, at which time the state of the machine can be saved. Once the interrupt has been serviced by the operating system, the saved machine state can be restored and the original program can resume execution. When such exceptions occur, the results of the computation may no longer be valid and the operating system may need to intervene to log such exceptions. The architectural state of the machine present at the time the excepting in- struction must be saved so that the program can resume execution after the exception is serviced. Machines that are capable of supporting these instruc- tions are said to have a precise exception. The precise exception involves being able to checkpoint the state of the machine just before the execution of the ex- cepting instruction and then resume execution by restoring the checkpointed state and restarting execution at the excepting instruction. To support precise exceptions, the processor must maintain its architectural state and evolve this machine state in such a way as if the instructions in the program are executed one at a time according to the original program order. The reason is that when an exception occurs, the state the machine is in at that time must reflect the condition that all instructions preceding the excepting instruction have com- pleted while no instructions following the excepting instruction has completed. For a out-of-order processors, to have a precise exception, this sequential evo- lution of the architectural state must be maintained even though instructions are actually executed out of program order. To support precise exceptions, in- struction completion must occur in program order to update the architectural state of the machine in program order. To accommodate instructions finishing execution out-of-order but committing in-order, a reorder buffer is needed in the instruction completion stage of the parallel pipeline. As instructions finish execution, they enter the reorder buffer out of order, but they exit the reorder buffer in program order. As they exit the reorder buffer, they are considered architecturally completed. Precise exceptions are handled by the instruction completion stage by using the reorder buffer to ensure that only instructions before the exception in program order affect the architectural state. When an exception occurs, the excepting instruction is tagged in the reorder buffer. The completion stage checks if each instruction before that instruction has com- pleted. When the tagged instruction is found, it is not allowed to complete. All the instructions before the tagged instructions are allowed to complete.

25 The machine state is then checkpointed or saved. The machine state includes all the architectural registers and the . The remaining instruc- tions in the pipeline, some of which may have already finished execution, are discarded. After the exception has been serviced, the checkpointed machine state is restored and execution resumes with the fetching of the instruction that triggered the original exception. In chapter 4 we provide more detail about the associated research problem with in-order commit.

2.3.3 Research Problem II: In-order Commit is Overly Conservative Because instructions can only commit in-order, speculative instructions must store their state in the processor until all instructions before them have commit- ted. This means that speculatively executed instructions may tie up resources (reorder buffer (ROB) entries, load/store queue (LSQ) entries, and physical registers) for a longer than it takes to execute the instruction. The processor may stall even though there are completed instructions that could be commit- ted (and thereby free up resources) without breaking program order. The idea of out-of-order commit (OOC) is to overcome this problem by allowing com- pleted instructions to commit out-of-order (and thereby free up resources) as long as they would not result in an incorrect execution. But the potential of OOC has not been explored in detail. In this thesis we revisit out-of-order commit by examining the potential performance benefits of lifting these conditions [5] one by one and in combi- nation, for both non-speculative and speculative out-of-order commit. While correctly handling recovery for all out-of-order commit conditions currently requires complex tracking and expensive checkpointing, this work aims to ex- plore the potential for selective, speculative out-of-order commit using an or- acle implementation without speculative rollback costs. Chapter 4 introduces the work in this thesis in evaluating OOC based on being an avoidance or prevention techniques, including its safety and the risk of not being able to rollback when needed. Commit conditions, as described by [5], are analyzed in this chapter too to observe their effect on their performance when they are evaded while considering the cost to roll back to a precise state of the ma- chine.

26 3. Efficient Resource Allocation: Scheduling Considering Readiness and Critically of Instructions

In this chapter, we propose ideas to both reduce the width and the depth of the IQ to reduce its energy however, doing this blindly hurts performance. Our insight is that by carefully skipping the IQ placement of some instructions and also by delaying some other instructions, respectively we can reduce the issue ports (width) and the pressure on the IQ’s capacity (depth) without hurting performance. However, we need to be able to identify the instructions that can skip or delay the IQ. To do so, we categorize them as "need IQ scheduling", "don’t need IQ scheduling" and "hopefully won’t need it if we wait a bit".In the end, we propose hardware that takes advantage of the instruction classifi- cation to avoid expensive IQ-based instruction scheduling when possible.

3.1 Research Problem: Inefficient Scheduling Traditional processors place all instructions in the IQ as the most energy- hungry part of the out-of-order scheduling. However, we have identified classes of instructions that do not benefit from this energy expenditure. Placing all renamed instructions, in the IQ for scheduling is inefficient since some in- structions do not benefit from the out-of-order functionalities of the IQ namely select and wakeup. This is because some have their operands ready before dis- patch, some are missing one operand and the other one is going to be ready in just a few cycles, and some others have one operand ready while the other one might take a few hundred cycles to become ready. Indeed, designers agreed with the cost of complex implementation of decode and rename stages, but the information provided by this complexity is ignored by the scheduler. The information which could help to delay or bypass the IQ placement for some of the instructions. Our data shows that a significant number of instructions fall into this category, leading to a reasonable reduction in IQ accesses and capacity pressure.

The Importance of Tackling Inefficient Scheduling One of the main determinants of processor performance is how many instruc- tions it can inspect to find ready instructions to execute. This is a function of

27 the number of in-flight instructions, which are all stored in the IQ in a tradi- tional design. Single-threaded processors, require more in-flight instructions as an approach to improve their performance. The challenge is how to increase the number of in-flight instructions without paying for the cost of a larger IQ. Out of the in-flight instructions, only some of them are ready to be issued. Unfortunately, increasing the number of in-flight instructions by a factor of four only delivers a 2X increase in ready instructions [29, 28]. An additional challenge is an escalation in circuit complexity (and decrease in clock speed) from larger IQs. The IQ consumes a significant part of the core energy [15, 18]. However, it is vital for the performance improvement of a single processor to have more in-flight instructions through increasing the IQ depth and width. This raises the challenge of how we can increase the number of in-flight instructions without increasing the size of the IQ. For standard OoO processor the number of in-flight instructions is equiv- alent to the depth (number of entries) of the IQ. In our proposed design the instructions are distributed between the IQ and a FIFO queue. By distributing the instructions among the IQ and the FIFO the size of the dynamic window is increased but at far lower cost than making the IQ larger. Alternatively this insight can allow us to reduce the energy by reducing the size of the IQ and augmenting it with a FIFO to retain the same in-flight window size at lower en- ergy. Dividing instructions into cheap and expensive can reduce the IQ width as well. Since cheap instructions bypass the IQ can be issued from the FIFO, fewer instructions will be issued from the IQ and the width of the IQ can be reduced. As shown in Figure 3.1 reducing the IQ width also, is very effective in reducing the energy consumption of CAM based IQs.

     







                       Figure 3.1. Energy consumption per access of IQ (CAM-based) and queue (FIFO- based) instruction storage as a function of width (color) and depth (x-axis). IQ energy grows very significantly with both width and depth, while the simpler FIFO imple- mentation is enormously more efficient.

28 Table 3.1. Readiness status of instructions based on the availability of operands at dispatch time. Left operand Right operand Instruction readiness pending pending non-ready pending ready half-ready ready pending half-ready ready ready ready

3.2 Insight I: Some Instructions do not Need OoO Scheduling In X86 processors, macro instructions are break down into (μ-operations) with a maximum of two operands (inputs). Two-input (μ-operations) can be clas- sified into four categories based on the readiness of the operands at rename stage shown in table 3.1.

Observation 1: Ready at Dispatch (R@D) Based on table 3.1, whenever both operands of an instruction are ready, before being dispatched, we call it R@D and it can bypass the IQ because it does not benefit from out-of-order wakeup, select, and also value forwarding. It does not require wake-up since all of its operands are already available. They do not necessarily need selection also since they can be directly issued to the func- tional units. Applying out-of-order wakeup and select across all instructions in an IQ is precisely what makes the IQ expensive and hard to scale however, R@D can bypass it. These instructions can also be issued to the function, directly. Instruction I1 in Figure 3.3 is a R@D instructions since both of its operands are provided by registers. The out-of-order scheduler places all the instructions in the IQ however, based on Figure 3.2 we find that on average across Spec2006 benchmarks, over 20% of instructions R@D.

Observation 2: Almost Ready at Dispatch (AR@D) Some of the instructions have one ready operand and one almost ready i.e. becomes ready soon. This includes the second and third row of table 3.1. We call these instructions AR@D. The non-ready operand in AR@D instructions is almost ready since it comes from either a R@D or AR@D instruction but not from a load. As a result, and are also good candidates for scheduling via a FIFO queue. The reason is their short lifetime in the dynamic instruction window. This short lifetime effect is twofold, firstly, out-of-order instruction wakeup does not help AR@D instructions to have earlier execution and sec- ondly, they are likely to be ready soon, before reaching the head of the FIFO. In Figure 3.3, instruction I3 is AR@D. We do not consider instruction I4 to be AR@D as it is likely to take longer to be ready since neither of its operands is ready. Based on Figure 3.2 22% of total instructions are AR@D.

29 ) is 2 to all (I 6 I

average green

xalancbmk cheap R@D has two de-

0 ready: NR@D 0 ready: wrf 0 omnetpp I instruction. lbm

h264ref the cheap instruction’s instruction in FIFO too. R@D, 20%), one operand libquantum LDTail to schedule and sjeng 1 ready : LD_Tail 1 ready

hmmer has a relatively longer lifetime head of LDTail calculix

povray expensive LDTail soplex is considered a . ) 2

dealII I because it has a source coming from another gobmk AR@D, 22%), one operand ready and one from 1 ready: AR@D 1 ready:

namd LDTail

cactusADM LDTail

gromacs , 25%), no operands ready (not-ready-at-dispatch, NR@D,

milc instructions that are dependant on load instructions as load is not a

mcf LDTail 7 I 2 ready: R@D 2 ready: gcc ) instructions. Only ). In Figure 3.3 we see that the load instruction, 6 AR@D (I bzip2 Classification of instructions at dispatch stage based on the availability of

GemsFDTD expensive instruction and also from the 0

10 70 50 30 20 60 80 90 40

100 red. Cheap to schedule instructions, bypass the IQ and the rest will be

operands at dispatch at operands

AR@D

of ready ready of Distribution tails (LDTail 30 schedule, respectively "OOO-benefit" and "no-OOO-benefit".pensive to Cheap schedule and instructions ex- areand shown in Figure 3.4 respectivelyplaced in in the IQ.instructions: This results an in IQ a forin-order OoO design execution issuing with of of cheap two expensive instructions. units instructionsof At for and issue the holding the a in-flight arbiters FIFO select for among queue. As a result,cheap we instructions replace into the the IQ energy with cost the of cost inserting of and a much removing simpler the FIFO, reducing register that has notcompared yet been to written. the restvided by of a the load instruction. instructionslong Placing since them time. in their Based the on pending IQ this willLoad operand insight, occupy and is its we store entries place pro- for instructions the a shouldthem always as be early dispatched as to25% possible the of to IQ total expose instructions to both are execute MLP and ILP. BasedExploiting on Instruction Figure Readiness 3.2 To take advantage of readinessdivide of instructions, instructions at into dispatch two stage, we categories, logically not included as it isready likely as to well. take longer since its non-load operand is not yet or We classify Observation 3: Load Tail (LDTail their operands: bothready operands (almost-ready-at-dispatch, ready (ready-at-dispatch, a load (load-tail, Figure 3.2. 33%). pendent instructions whose sources are either ready or come from    

             !    

   

Figure 3.3. Instruction dependency graph showing R@D (green), AR@D (blue), and LDTail (red) instructions. The register rename table indicates that physical registers 1-3 have been written (inputs to the R@D and AR@D instructions) while physical register 5 (PR5) has not. (I7) has PR5 and (I6) as input operands. Apart from PR5, I6 is pending as well which is why (I7) is not included in the LDTail. I6 is a load dependant instruction but it is not a LDTail since it has two pending operands.

total energy. While RAD can be sent directly to the functional units for exe- cution, in practice a buffer is needed when all functional units are occupied. They can be placed in a FIFO queue since they are cheaper (than a CAMs) to implement. Placing instructions in FIFO queues is not for free but signif- icantly more energy efficient as shown in Figure 3.1. The other instructions which require out-of-order scheduling are placed in the IQ. This essentially shapes the idea of our first solution to the problem of inefficient scheduling called FIFOrder [4] as a ready-aware instruction scheduling microarchitecture (see Figure 3.4 and 3.6) .

        

        

Instruction Classification  Instruction Scheduling

Figure 3.4. Insight: categorizing instructions into cheap and expensive, respectively "OOO-benefit" and "no-OOO-benefit". Cheap instructions, R@D. AR@D and LDTail, bypass the IQ while the rest, expensive ones, are placed in the IQ. Cheap instruction also will not be issue from the IQ.

31 3.3 Solution I: FIFOrder, Ready-Aware Instruction Scheduling. The goal of FIFORder is to deliver the performance of an out-of-order pro- cessor with the energy budget of an in-order processor. Using the above in- struction classification we hope to be able to bypass the IQ for the majority of the instructions and offload them in cheaper FIFOs, which should allow us to reduce the IQ dimensions and its activities, reads and writes, and thereby save energy. Such a processor will have minimum out-of-orderness since the majority of instructions bypass the IQ and IQ width and the depth, are reduced significantly. A minimum out-of-order processor in this context means issue width of one therefore the IQ issues single instructors per cycle. The base- line has issue width of four, therefore, in the minimum out-of-order processor, three instructions will be issued from the FIFO and yet in total, the processor issued four .

First Design: Single FIFO In the first design for FIFOrder, R@D, AR@D and LDTail instructions are placed in a single FIFO while all other instructions (including loads and stores) are placed in the IQ. This results in the IQ being largely dedicated to mem- ory instructions. In this implementation 66% of the instructions are issued from the FIFO. As a result, the issue width of the IQ can be reduced by three quarters,from4to1. In the baseline processor, AR@D and LDTail instructions potentially can read their missing operand from the value forwarding path(s) but if upon issue time this value is not yet available, these instructions need to wait for some time. In FIFOrder, this waiting time is spent in a FIFO instead of a CAM based IQ. In FIFOrder when a R@D instruction reaches the head of the FIFO, it is issued to the functional units upon availability. They read their operands after issue from the register files. Upon arriving of AR@D and LDTail instructions at the head of the FIFO, if their pending operand is ready they read it and then issue to the functional unit(s) otherwise, they wait until the operand becomes available. This will cause stalls due to the in-order nature of the FIFO which will hurt performance.

Results and Limitations of First Design The fundamental bottleneck of this design is that due to the single FIFO, AR@D and LDTail instructions are mixed with R@D instructions. The FIFO to stall frequently when non-ready AR@D or LDTail instructions reach the head of the FIFO and block younger, ready instructions from issuing. This hurts performance by 15% compared to the baseline (see paper IV [4] for more details). The breakdown of stalls for the single-FIFO design is shown in Figure 3.5. We can see that the FIFO queue was stalled and could not issue instructions (issued zero or less than the bandwidth of three) for 72%

32 100 90 AR@D 80 LDTail 70 60 50 40 30 20 over run-time Stalled FIFO(s) 10 0 Single Double Triple FIFO FIFO FIFO

Figure 3.5. The distribution of R@D FIFO stalls caused by AR@D and LDTail in- structions. Placing the instruction classes in separate queues provides for more out- of-orderness between them, and allows different classes of instructions to bypass each other when they are ready, thereby reducing stalls. of the execution cycles, 40% due to AR@D and the remaining 32% were due to LDTail. This suggests that keeping the AR@D and LDTail instructions out of the FIFO that holds R@D instructions would reduce R@D stalls.

Second Design: Double FIFOs. To tackle the problem of R@D stalls due to non-ready AR@D and LDTail instructions at the FIFO head, we add a second FIFO for AR@D and LD- Tail instructions. This leaves the first FIFO exclusively for R@D instructions, which will never stall. As with the previous design, the maximum issue width is 1 for the OoO IQ and 3 across both FIFOs. For selecting between the FI- FOs, a higher priority is given to instructions from the R@D FIFO. The reason is, they enter the dynamic window by having the ready operands, not giving them the highest priority might add cycles to their lifetime which could affect performance negatively.

Results and Limitations of Second Design By eliminating the stalls caused by AR@D and LDTail instructions in R@D in- struction execution, the dual FIFO design outperforms the single FIFO design and is even better than baseline 4-wide OOO IQ design in a few benchmarks (see paper IV [4] for more details). On average, the dual FIFO design matches the baseline performance but does so with more energy-efficient scheduling due to its 1-wide OoO IQ and two 3-wide FIFOs. From the energy point of view, dual FIFO design outperforms the baseline in terms of EDP energy by over 30% on average (see Figure 3.8). Despite the second FIFO queue, there are still many FIFO stalls, as seen in the middle bar of Figure 3.5. For this design, the majority of the stalls, 35% of all cycles, are coming from LDTail instructions that are blocking the second FIFO. To tackle this problem we sep- arate the LDTail instructions from the AR@D ones by placing them in a third FIFO queue.

33 Table 3.2. Microarchitectural parameters (based on Intel Nehalem [10]) Freq, ISA 3.4 GHz, x86-64 L1i/d 32KiB, 8-way, 4clk L2 256KiB, 8-way, 12clk L3 1MiB, 8-way, 36clk DRAM 200clk Branch Predictor Two level, front end penalty 10clk ROB/IQ/RF(Int,FP)/LQ/SQ 128/56/(68,68),48/36 FIFO queues 32 entries, issue up to 3 from head Technology/VDD/temp 22nm itrs-hp/0.8/360K

Table 3.3. FIFO and IQ Configurations Design # FIFOs IQ Issue Width RF ports Baseline 0 4 8 Design #1 1 1 8 Design #2 2 1 8 Design #3 3 1 8 FXA [24] 1 2 10

Final Design: Triple FIFO. The final design, Figure 3.6 provides three separate FIFOs: one for each of the instruction classes R@D, AR@D, and LDTail. This design is intended to prevent the different classes from stalling each other. Other instructions will be places in the 1-wide OoO IQ. For instruction dispatch, the highest priority is given to R@D instructions, then AR@D, and lowest to LDTail. This order attempts to give priority to the instructions that are most likely to be ready soon. ROB

LSQ

1 D-Cache

,

IQ RF Table

LSQ

FIFOs



Out-Of-Order

to

  Map

ALU AGU

IQand on R@D FIFO#1 3

ROB, AR@D

Rename FIFO#2

Alloca LDTail FIFO#3

BRU

FPU

Rename Dispatch Issue Execute WriteBack Commit

Figure 3.6. FIFOrder microarchitecture. Instructions are classified in the rename stage based on the operand ready bits in Rename Map Table. In the dispatch stage, they are steered to the IQ or FIFOs, depending on the instruction classification. The issue stage stores the instructions in either the FIFOs or IQ, and selects ready instructions across the queues for execution.The shaded and gray parts are the pipeline parts that are affected by the FIFOrder. Functional units are shared between the IQ and the FIFOs to bring more flexibility at issue time.

34 avg

xalancbmk

wrf 3rd design: Triple FIFO design: Triple 3rd FXA soplex

sjeng

povray

omnetpp

namd 1st design: Single FIFO 2nd design: Double FIFO milc

mcf

libquantum

lbm

h264ref

hmmer

gromacs

gobmk

GemsFDTD

gcc

dealII

calculix

cactusADM IPC comparison between three designs of FIFOrder and a related work, FXA [24] normalized to baseline. Baseline: 4-wide OoO.

bzip2

1.2 1.0 0.2 0.8 0.0 0.6 0.4 Normalized IPC Normalized Figure 3.7. Our designs: 1-wide OoO plus 1, 2, or 3 FIFOs. FXA has a 2-wide OoO (see Table 3.2 and 3.3) for more details.

35 avg

xalancbmk

wrf 3rd design: Triple FIFO design: Triple 3rd FXA soplex

sjeng

povray

omnetpp

namd 1st design: Single FIFO 2nd design: Double FIFO milc

mcf

libquantum

lbm

h264ref

hmmer

gromacs

gobmk

GemsFDTD

gcc

dealII

calculix

cactusADM Normalized performance per energy of the dynamic instruction window. Both baseline and FXA [24] spend more energy on issuing bzip2

1.2 1.8 1.0 1.6 1.4 0.2 0.8 0.0 0.6 0.4

dynamic instruction window instruction dynamic Normalized IPC per energy of of energy per IPC Normalized Figure 3.8. instructions due to their issue-widths of 4 and 2, respectively, compared to 1 in our designs.

36 Results The performance and performance per energy result of the three FIFO imple- mentation are shown in the third set of bars of Figure 3.7 and 3.8. As a result of reducing FIFO stalls (Figure 3.5, right bar), this design outperforms the base- line out-of-order by 8% on average and is 55% higher in energy efficiency.

3.3.1 Instruction Criticality and Limits of Ready-Aware Instruction Scheduling To the best of our knowledge, there two ready-aware instruction scheduling techniques, FIFOrder and FXA (Front-End Execution Architecture) [24] (see Figure 1.1.C). Both of these two ideas are not aware of the instructions’ effect on performance, essentially instruction criticality. Before describing the limits of ready-aware instruction scheduling, we define instruction criticality.

Instruction criticality Critical instructions are those for which delaying execution will hurt per- formance. These instructions often lead up to or generate memory-level- parallelism (MLP) [23]. Conversely, non-critical instructions can be delayed to a significant degree without penalty. Non-critical instructions are typically dependent on long-latency operations and often spend a long time in the IQ before becoming ready, thereby increasing IQ pressure. The long lifetime of non-critical instructions presents the opportunity to delay their placement in the IQ and thereby reduce IQ pressure. Reducing the number of non-critical instructions in the IQ also naturally prioritize the critical ones. This thesis bor- rows the definition of criticality from Long-Term-Parking [23] but provides a more efficient way of scheduling these instructions since for the first their overlap with readiness is detected and considered in the scheduling.

3.4 Insight II: Overlap Between Readiness and Criticality We define readiness [4] as having both operands before dispatch and criticality as contribution to MLP [23, 8]. As these properties are typically independent, this leads to four categories of instructions that should be handled differently. Critical & Ready: As these instructions contribute directly to MLP they are critical for performance and should not be delayed. However, as they are also ready for execution, they do not benefit from the expensive wake-up and select mechanism of the IQ. Therefore, these instructions can bypass the IQ and be issued directly to functional units for execution. Critical & Non-Ready: Due to their criticality, these instructions should not be delayed. However, as they do not have all their operands available, they

37                              # #& " & !   &        &  !& .20* ! $ ' (.* (.* "' .20*   .20*!!!   * "   */ "#!   (.*#" #  (.*"&" ! " *! . ! #& *" $  *-/&-/ *" $  *!  !    !&*!1,  %    !&*   #"# )   &

'$' ) "$'  '3 ) " Figure 3.9. Distribution of the instruction classification in an out-of-order processor. cannot be issued for execution immediately. Therefore, they should be allo- cated IQ entries so that they can take advantage of the IQ’s expensive out-of- order wake-up and select mechanism to be selected for execution as soon as they become ready. Non-Critical & Ready: As these instructions are ready for execution they do not require IQ allocation and can be directly issued for execution. How- ever, since they are non-critical, executing them early does not improve perfor- mance. On the contrary, eager execution might hurt performance if they take slots from critical instructions. Therefore, these instructions can both bypass the IQ placement and their execution can be delayed. Non-Critical & Non-Ready: These instructions are neither performance- critical nor they are ready for execution. Placing them in the IQ would oc- cupy entries until their operands become available, potentially at the cost of more critical instructions. On the other hand, delaying their scheduling will not harm performance as they are not performance-critical. These instructions should, therefore, be delayed.

Limits of FIFOrder [4] FIFOrder can reduce IQ issue width however is very sensitive to the IQ depth for two reasons. First, fewer loads and stores are placed in IQ which means lower MLP. Second, the address generators are blocked in the FIFOs due to the in-order feature of these queues. Essentially instructions that are critical to the performance cannot issue unless they reach the head of the FIFOs. Figure 3.10 shows the effect of reducing the IQ size on the performance of FIFOrder. Based on this Figure reducing the IQ depth by half in FIFOrder reduces the performance by 20 percentage points, however, if FIFORder was aware of the criticality of instructions and could issue the instructions from anywhere in the FIFOs, the performance degradation would have been only 8%.

38  )0/*).*   )0/*)1*  )0/*).*((     %( #  . -'6 -'5 -'4 -'3 -'2 -'1 -'0   -'/ -'.       

- &/ &/! ! &/ % &/ $    !    !$   %    .0$.0    /  "  "    /31   /31  0   % !  !       " %  $ 2- $ # $   ( 

Figure 3.10. Effect of reducing the IQ size on the performance of FIFORder. In the oracle version, critical instructions can be issued upon being ready regardless of their position in the FIFO. Although such a design is impossible to implement however, shows the potential of combining readiness and criticality.

Essentially performance-critical instructions are distributed over the FIFOs in FIFOrder as shown in Figure 3.11. Based on this Figure, critical instruc- tions are delayed in AR@D and LDTail queues which affect the performance. In conclusion, the ready-aware scheduler is not effective in reducing both IQ depth and the width and a combination with criticality-aware is required.

Limits of FXA [24] FXA (see Figure 1.1.C) consist of two pipelines, IXU (a 3-stage in-order pipeline and execution units) and OXU (out-of-order execution unit). IXU is embeded between rename and dispatch of out-of-order pipepline. In FXA all instructions pass through the IXU in-order and it directly executes some of them, the ones that their operands become available in three cycles. This includes the R@D instructions, in addition to another 4% of total instructions.

     

     

      

      

     

     



Figure 3.11. Overlap between criticality and readiness in FIFORder. Many of the critical instructions are placed in FIFO queues therefore delayed. Delaying them due to the in-order feature of a FIFO, degrades the performance and makes the FIFOrder more sensitive to have smaller IQ size.

39 FXA does not consider instruction criticality. Ignoring the criticality has per- formance and energy implications. Many (15% of total instructions) non- critical instructions still pass through the filter and occupy space in the IQ. The insight is that we can delay placing non-critical instructions in to the IQ to reducing IQ capacity pressure without hurting performance. By delaying insertion into the IQ (OXU) non-critical instructions are given a further op- portunity to become ready which means, from being a candidate for delaying IQ placement becoming IQ bypass candidates which are more efficient. Fig- ure 3.12 shows that such a delay would significantly increase the number of Non-Critical & Non-Ready instructions that become ready if FXA was aware of criticality of instructions (See more discussion on this opportunity in Paper V [3]). Criticality-Aware Instruction Scheduling Ready-only aware instructions scheduling ignores the criticality of instruc- tion and misses some efficiency opportunities. However, critical-only aware scheduler also, ignores the readiness. (LTP) [23] , see Figure 1.1.B, is one of the state-of-the-art criticality-aware microarchitectures. LTP classifies in- structions in the decode stage based on their contribution to MLP. The critical instructions (the MLP-generating ones) are placed in the IQ directly but non- critical instructions are placed in a separate "long term parking" FIFO, which delays their insertion into the IQ. As Figure 3.9 shows, non-critical (Non- Critical & Ready + Non-Critical & Non-Ready) instructions constitute about 22% of the dynamic instruction stream. LTP delays the IQ placement of non- critical instructions by ”parking" them an energy-efficient FIFO queue(See Figure 3.13B.). LTP can leverage the reduced IQ pressure to reduce the IQ depth by half with minimal performance penalty [23]. non-critical instructions are finally inserted to the IQ when they reach the head of the FIFO which oth- erwise it will cause .

Limits of LTP [23] LTP places Critical & Ready instructions in the IQ (which could bypass it). Critical & Non-Ready Instrcutions that become ready while waiting in the parking queue are also re-inserted into the IQ, when they could bypass it as well. The ready instructions come from three sources. (Critical & Ready + Non-Critical & Ready) account for almost 21% of total instruction based on Figure 3.9. Third source, delayed instructions that become ready while they are parked (see Figure 3.12), count for another 11% of total instructions. LTP altogether, could bypass the IQ placement for about 32% total instructions if it was aware of ready instructions 1.

1We emphasise that LTP defined ""ready " to mean instructions that do not depend on other long- latency instructions. Our definition is more strict in that ready instructions can be immediately executed and are easier to detect in hardware

40                $                # " !    ,05  --5  ,05   '  ,/5  %     !    $ ' #& )   ! " !&*  (.*"&" & ! "#!   #"# *! (.* "'  * "   */ !&*!1, *-/&-/ (.* .20*!!!  *! . *" $  *" $  (.*#" #  .20*   .20* ! '$' ) "$'  '6 ) " Figure 3.12. Readiness of non-Critical instructions, both before issue (blue), after the 3 cycles of the FXA IXU, and after being delayed. On average, 3% of instructions are Non-Critical & Ready and can bypass the IQ. 4% are Non-Critical & Non-Ready that become ready within 3 cycles, and 11% are Non-Critical & Non-Ready that become ready after being delayed. Combined, 18% of instructions become ready and could bypass the IQ as a result.

3.5 Potential of Combining Readiness and Criticality Leveraging both criticality and readiness has the potential to increase the num- ber of instructions that can be offloaded from the IQ without hurting perfor- mance. This is because only Critical & Non-Ready instructions require the sophisticated scheduling of the IQ, whereas the remaining can either be de- layed or bypassed. By taking advantage of both characteristics, only 61% of instructions (Critical & Non-Ready ones) require immediate IQ allocation (Figure 3.9) compared to 78% (Critical & Ready + Critical & Non-Ready)by only considering criticality and 80% (Critical & Non-Ready + Non-Critical & Non-Ready) by only considering readiness. This drastic increase in off- loadable instructions can be leveraged to reduce the IQ depth to reduce the per access and total energy consumption of the IQ therefore the scheduler. In addition, as ready instructions can be issued directly (bypassing the IQ) and in parallel with instructions from IQ, the IQ issue width itself can also be re- duced. Reducing both dimensions of IQ provides significant power reduction as shown in Figure 3.1.

Combining FXA and LTP There are two naive approaches for adding readiness or criticality to existing criticality- or readiness-aware schedulers: add IQ bypassing of ready instruc- tions to LTP’s parking of non-critical ones (LTP+Bypass) or add delaying of non-critical instructions to FXA’s filtering of ready ones (FXA+Delay or what we call it in this thesis FXA+LTP). These approaches significantly improve the energy efficiency of the baseline designs by reducing the IQ activity, from

41   "+ &  "+ ( & ( "+ & ( "+ ( &  "+ &  ( "+ & 

       )"!$ *

          %#" %#" %#"  

    %#"

!" #"!  !" #"!  !" #"! %#" !" #"!  ! """ ' & '  " ' &  & !!'     #  "  # "  # " "

Figure 3.13. Instruction placement based on readiness/criticality for the four designs. Ready instructions are shown with solid lines and non-ready in dashed. Non-ready become ready after a delay and are handled differently across the designs. Note that instructions can only be issued from the head of the FIFO queues (dark gray areas in LTP, CR, and DLQ) while any instruction in the IQs/OXU can be issued.

74% to 44% of the baseline for (LTP) and from 53% to 46% of the baseline for (FXA), as shown in Figure 3.14 (performance is discussed in Section ??). LTP+bypass needs support for detecting R@D also detecting the ready after delay instructions. On the other hand, FXA+LTP (FXA+delay), naturally will execute R@D and ready after delay instructions in the IXU. However, there are significant shortcomings in this design that are covered in the next section.

Limitations and Potentials of FXA + LTP • Non-ready instructions in the IXU: In the FXA+LTP design, the 61% of instructions that are Critical & Non-Ready pass through the IXU while fewer than 1% become ready during its 3 stages. (Overall only 4% of non-ready instructions become ready during those 3 cycles, see Figure 3.12.) To address this, Critical & Non-Ready instructions also, should bypass the IXU and be placed directly in the IQ. • Ready instructions placed in the IQ: To minimize area overhead, the IXU has no floating-point units. As a result, all of the ready floating- point instructions must be placed on the IQ before they can be executed, increasing both capacity and port pressure in the IQ. Also, the IXU is in- efficient at executing memory instructions, which means that many that are ready spend the energy of going through the IXU without being able to execute. The IXU only executes load/store instructions according to the results of the arbitration of shared caches and memory ports between the IXU and the OXU (out-of-order backend or the IQ). Indeed memory instructions are placed in both IXU and FXA but will try to execute both in IXU and FXA. Instructions that are not executed in IXU are placed

42 in the FXA and if the memory ports are not busy by IXU, they can be issued. • Area: FXA has separate functional units for the in-order IXU and out- of-order OXU which leads to underutilization as many instructions are unable to execute in the IXU. • Pipeline depth: The IXU increases pipeline depth by three cycles, which potentially increases branch misprediction penalty and delays the execu- tion of instructions that require a functional unit only present in OXU. • Register file port pressure: FXA requires that operands be read for in- struction execution in the IXU and then again for execution in the OXU if the IXU fails to execute them.

43 44                       !  )  (  '  &  %  $  #  "  !                                                                                                      6   6   6   6   6   6   6   6   6   6   6   6   6   6   6   6   6   6   6   6   6   6   6   6   6   6   6   6   6   6   6   6   6    ,*!$$  ,*!$$  ,*!$$  ,*!$$  ,*!$$  ,*!$$  ,*!$$  ,*!$$  ,*!$$  ,*!$$  ,*!$$  ,*!$$  ,*!$$  ,*!$$  ,*!$$  ,*!$$  ,*!$$  ,*!$$  ,*!$$  ,*!$$  ,*!$$  ,*!$$  ,*!$$  ,*!$$  ,*!$$  ,*!$$  ,*!$$  ,*!$$  ,*!$$  ,*!$$  ,*!$$  ,*!$$  ,*!$$      , 153- ! '#* +!1- +!1- 153- 153- #- #- %&$   +!1- +!1- $   %!!  - $ &)  -  -  -  - # $ )     $ !)-  (#   $ !) "& %& ,    #    #%*  #  $$$   #%#  !2  !&% %)% $ #1 %#' # 02)02 %#' # $ !$4/ -# $     &# *'* ,#%'*  *6 ,#% Figure 3.14. Instruction scheduling energy reductions for naïve extensions to LTP and FXA. 3.6 Solution II: DNB, Ready- and Criticality-Aware Instructions Scheduling To address the aforementioned inefficiencies of the FXA+LTP design, we propose the Delay and Bypass design. DNB bypasses the IQ for Critical & Ready instructions (reducing both IQ capacity and port pressure, as well as IQ reads/writes), delays Non-Critical & Non-Ready and Non-Critical & Ready instructions by placing them in a FIFO (reducing IQ capacity pressure), and bypasses the IQ for delayed instructions that are ready when they reach the head of the delay FIFO (reducing IQ reads/writes). In DNB only Critical & Non-Ready instructions are placed in the expensive IQ, as these are the ones that are critical to performance as soon as they become ready. (See Figure 3.13-D and a comparison across all designs in Figure 3.15.) Figure 3.16 provides an overview of the DNB architecture: DNB adds a FIFO queue for delaying instructions (the Delay-Queue, DLQ), a FIFO queue for decoupling the back-end execution of Critical & Ready instructions from front-end fetch (the Critical-Ready-Queue, CRQ), and a Critical Instruction Table (CIT) for Iterative Backwards Dependency Analysis (IBDA) [30] to de- termine instruction criticality. The components common to both LTP and DNB are shown in light gray and additions in white.

Fetch, Decode and Rename: The DNB front-end determines instruction readiness and criticality. As in LTP, after instruction fetch the CIT is accessed with the PC to check if the instruction is critical. In rename the operand avail- ability (detecting ready instructions) is used to detect if the instruction is ready. This incurs no overhead as the register file check is part of renaming.

Detecting Critical Instructions: As with LTP, DNB uses Iterative Back- ward Dependency Analysis (IBDA) [30] to iteratively identify chains of in- structions that lead to MLP generating instructions. In addition, we mark all loads as critical to avoid delaying them and hurting performance.

Instruction Dispatch: DNB places Critical & Non-Ready instructions di- rectly into the IQ for execution as soon as they are ready and places and all non-critical instructions into the DLQ to be delayed. Critical & Ready in- structions bypass the IQ for immediate execution via the CRQ. The CRQ is needed as ready instructions may not be able to be executed immediately due to structural hazards (lack of available functional units) and/or instruction pri- orities (newer ready instructions should not be prioritized over older ready in- structions). The CRQ decouples the bypassing of ready instructions from the front-end to the back-end. This allows continued fetching even if the ready in- struction cannot be immediately executed and is much cheaper than inserting them into the CAM IQ.

45 Awareness Instruction Issue Results Design Technique R after NR-C NR-C NR-NC NR-NC Scheduling C R R-C R-NC Performance delay NR after delay R after delay NR after delay R after delay Energy

OoO IQ IQ IQ IQ IQ IQ 100% 100%

Park Park Park Park LTP ✓ IQ IQ IQ 91% 74% (delay) IQ IQ IQ

Filter Filter Filter Filter Filter FXA ✓ 3 cycles Filter Filter 89% 53% (bypass ) IQ (2% in 3c) IQ (4% in 3c)

Delay & Delay Delay Delay DNB ✓ ✓ ✓ Bypass IQ IQ 95% 34% Bypass Bypass IQ Bypass

Figure 3.15. Comparison of the four designs. Limitations that increase energy: FXA is only able to handle non-ready instructions that become ready within 3 cycles (6% total) and filters even non-ready instructions, costing energy and adding 3 cycles of delay; LTP inserts all parked instructions into both the LTP parking FIFO and the IQ; DNB inserts inserts Non-Critical & Non-Ready instructions that do not become ready after delay into both the delay queue and the IQ. (R: Ready, C: Critical, NR: Non-Ready, NC: Non-Critical).

Instruction Issue: The DNB back-end can select instructions to issue from the IQ, CRQ and/or DLQ. The total issue width remains the same as the Out- of-Order baseline but is distributed across the three sources: up to two instruc- tions from the IQ and a combination of up to two instructions from the CRQ and DLQ, for a total issue width of four. All queues, including the IQ, apply an age-based instruction issue policy. Note that CRQ and DLQ are inherently age-based due to their FIFO nature. Among the three queues, the DLQ has the highest priority as its instructions are generally the oldest and might cause CPU stalls if its instructions reach the head of the ROB and have not issued. IQ is the second priority as it contains memory instructions. CRQ has the lowest priority as all are ready to execute and the and instructions have a very short lifetime in this queue unless there are insufficient functional units available.

  " 

#  !    $          !  !           #   ! 

 "    ! # CRQ , DLQ  !  Placement in  

 "!   ROB, IQ and LSQ,    ! "!  # #!      !                  

Figure 3.16. DNB microarchitecture. The Critical Instructions Table (CIT) uses IBDA to identify critical instructions while the Rename Map Table is used to identify ready ones. Instructions are placed in the appropriate scheduling structure based on their readiness and criticality.

46 Evaluation • Performance Figure 3.17 shows that reducing the IQ size of our baseline out-of-order processor from 64 to 32 reduces performance to 84% (BaselineHalf), while the ready/criticality-aware designs fare much better at 91% (LTP) 89% (FXA), and 95% (DNB) of the baseline performance. DNB de- livers 4 and 6 percentage points better performance than LTP and FXA by offloading 32% of the instructions, vs. 22% and 24% for LTP and FXA. This is quite close to the maximum potential of offloading 39% of instructions, given that 61% are Critical & Non-Ready and should go directly into the IQ. • energy The energy-saving potential of these designs comes from two sources: reducing the energy cost of each IQ access by making the IQ itself smaller and narrower and avoiding inserting instructions into the IQ via bypass. The three designs have different abilities to achieve these, as outlined in Figure 3.15.

  "!#  "!#  "!! "!!  ( ' & % $ # "

 %## !        -1/ ! $ ' (- (- "' -1/   -1/!!!   "   . "#!   (-#" #  (-"&" ! " ! - ! #& " $  ,.&,. " $  !  ! &    !&!0+  %    !&   #"# ) 

'$' ) "$'  '2 ) " Figure 3.17. IPC comparison between all designs normalized to the baseline.

47 48

                      !  )  (  '  &  %  $  #  "  !                                                                                                                                                                             , 153- ! '#* +!1- +!1- 153- 153- #-#- %&$  +!1- +!1- $     - $ &)  - - - - #  )     $ !)-  (#   $ !) "& ,    #    #%*  # $$$   #%#  !2    !&% %)% %!! $ #1  %#' # 02)02 %#' # $ $  !$4/ -#  %& $     &# *'* ,#%'*  *6 ,#% Figure 3.18. Instruction scheduling energy reductions. (Not shown: naïve designs achieve 44%/46%, see Figure 3.14.) 3.7 Conclusion We explored how to reduce the energy cost of out-of-order instruction schedul- ing while maintaining performance. We observed that delaying the IQ place- ment of instructions helps in reducing the IQ depth while bypassing it is effec- tive in reducing the width. Our final approach is to both reduce the width and depth of the IQ and avoid inserting instructions into the IQ that do not benefit from its expensive scheduling. To accomplish this we apply two complemen- tary approaches to scheduling instructions: delaying, to reduce IQ pressure and allow only important instructions to access the expensive IQ, and bypass- ing, to reduce IQ pressure and avoid the cost of accessing the IQ altogether. However, to be able to maintain performance while reducing the IQ width and depth, the right instructions must be delayed and/or bypassed. To achieve this, we classify instructions based on their criticality and readiness: ready in- structions do not need nor benefit from the IQ, and so they should be bypassed, while non-critical instructions are likely to block the IQ for a long time with little benefit, and so should be delayed to make space for more important ones. We further observe that the instructions that are delayed may become ready during their delay, providing yet another opportunity for bypassing. While these classes have been used separately to reduce scheduling costs previously, we demonstrate that combining them leads to significantly better performance, lower energy, and reduced sensitivity to IQ depth and width.

49 4. High performance resource deallocation: early release and out-of-order commit

Dynamically-scheduled superscalar processors execute instructions out of pro- gram order but commit in-order to present to the programmer the illusion that instructions execute atomically and sequentially as intended by the program. In this context, precise interrupts also are easily provided as the processor verifies correct execution before each instruction is committed [26]. The dis- advantage of in-order, which means when an instruction is the oldest, essen- tially reached the head of the ROB, is that it ties up the out-of-order resources, ROB, LSQ, and physical registers for a much longer time than is necessary for correct execution. This means that execution is halted when any of the resources are fully occupied. To overcome this hurdle, out-of-order commit (OOC) is a solution. Figure 4.1 shows this with an example. The reason that the commit stage of an in-order commit is blocked is because of an unresolved instruction at the head of ROB. A four-wide superscalar, shown in Figure 4.1- a, tries to commit four instructions per cycle. While the first instruction can commit, the second oldest instruction causes the commit stage to block and in- stead of four, only one instruction commits (in-order). Figure 4.1-b however, shows that OOC can bypass the blocking instruction(s) and commit two more instructions out-of-order. In-order commit experiences many stalls which can be avoided by OOC thanks to better utilization of the commit stage. Figure 4.2 shows the dis- tribution of the number of committed instructions per cycle for three differ- ent microarchitectures, comparing in-order and out-of-order commit across all SPEC CPU2006 benchmarks. Three different micro-architectures have been selected to see where OOC is more effective. Intel Haswell (HSW) is a 4- way superscalar and supposed to commit 4 instructions per cycle however, based on this figure this happens less than 20% of on average over spec 2006 benchmarks. We see a large number of commit stalls (zero instructions com- mitted per cycle). In out-of-order commit, the distribution shifts toward four instructions per cycle reflecting the improved commit performance (and the resulting improvement in overall performance). Considering the size of the out-of-order resources, this figure shows that for smaller core (fewer entries in the resources/less aggressive cores (such as Intel Silvemont (SLM), see table 4.1), OOC improved the performance significantly. The incentive for pursuing OOC lies in the promise of higher performance with fewer resources. A turning point in our understanding of out-of-order

50 Figure 4.1. Conceptual comparison of in-order and out-of-order commit (commit depth=4). C: Ready to commit, X: Not ready to commit, E: Executing, I: Issued, O: Empty entry.

commit came with the work of Bell and Lipasti [5] who articulated the limiting factors for committing an instruction. The necessary conditions to allow an instruction to be committed range from the completion status of the instruction itself, to the branch prediction and exception state of intervening instructions. Several proposals for out-of-order commit, implicitly or explicitly, abide by these conditions, providing the means and resulting in more cost efficiency to enforce them.

Table 4.1. Microarchitecture configuration with reorder buffer (ROB), instruction queue (IQ), load and store queues (LQ/SQ) and integer and floating-point register file (RF) details. Dispatch width (D), commit width (CW) and Out-of-Order commit depth (CD) is the same to enable a fair comparison (SLM hardware has a D/CW of 2). Register values are additional physical registers above architectural state. Microarchitecture D/CW/CD ROB IQ LQ/SQ RF(INT,FP) Silvermont-like (SLM) 4/4/8 32 32 10/16 32,32 Nehalem-like (NHM) 4/4/8 128 56 48/36 68,68 Haswell-like (HSW) 4/4/8 192 60 72/42 130,130

51 4 committed 2 committed 0 committed 3 committed 1 committed 100

80

60

40 Cycles (%) 20

SLM SLMOOC NHM NHMOOC HSW HSWOOC

Figure 4.2. Commit bandwidth distribution for the SPEC CPU2006 benchmarks of a 4-wide core with in-order commit and out-of-order commit respecting all commit conditions (Safe_OOC). Out-of-order commit increases commit pressure even without aggressive speculation.

4.1 Out-of-order commit conditions. The necessary conditions to allow an instruction to be committed out-of-order are: 1. The instruction is completed. Instructions can commit only after their completion. 2. The instruction is not involved in memory replay traps. This condition simply says that we cannot commit speculative loads or their dependent instructions. This condition relates to unresolved memory dependen- cies or memory consistency enforcement. For example, total store order (TSO) requires a replay of speculative loads that violate load→load or- dering when this reordering is detected by other cores. 3. Register WAR hazards are resolved (i.e., a write to a particular register cannot be permitted to commit before all prior reads of that architectural register have been completed). 4. Previous branches are successfully predicted. This condition simply says that we can commit only while on the correct path of execution. 5. No prior instruction in the program order is going to raise an exception. This condition provides precise interrupts and is essential in easing the handling of, e.g., page faults.

4.2 Contribution I: Relaxing Out-of-Order Commit Conditions Out first contribution is the study of how to relax the commit conditions. Re- laxing/Evading the out-of-order commit conditions potentially improves the performance. Relaxing does not mean ignoring or avoiding. It mean how to be less conservative than in-order commit and yet provide correct execution.

52 Due to this importance, in this section, this thesis contributes in possibilities of relaxing relaxing each of the conditions in addition to related circumstances. • Instruction complete The core waits for an instruction to finish exe- cuting before commit can occur. We do not examine early commit of loads [12, 13] that miss in the cache and instead we consider them avail- able for commit only after the data returns and is bound to the destination register. • Memory replay traps (safe_ST and safe_LD) We describe two sub-cases for this condition: – Store-Load (safe_ST): By relaxing this condition, we can commit loads and their dependent instructions even if prior non-aliasing stores have unresolved addresses. This condition applies to same- thread memory dependencies involving a store and a load. In par- ticular, we cannot commit a load out-of-order in the presence of a prior store with an unresolved address. If the store and the load prove to be dependent (the load should have taken the value of the store) the commit would have been incorrect. The LD condition disallows the commit of a load and its dependent instructions until all prior stores resolve their addresses and all the memory depen- dencies are correctly enforced. – Load-Load (safe_LD): This concerns memory consistency models that enforce load→load ordering (e.g., Sequential Consistency or TSO). Under this ordering constraint, it is possible to allow loads out-of-order as long as this is not observed in the memory system. The safe_LD condition disallows the out-of-order commit of loads unless it is guaranteed that the correct order will be observed by the memory system. To relax this condition we allow load→load re-orderings that are not observed by other cores. A very specific case would be a memory-mapped IO (MMIO) request that might change the order of memory operations. The MMIO case acts as a ’’, meaning that we have a multi-processor system here. We ignore memory requests from other cores (IO/Coprocessor). • WAR hazards WAR hazards are already handled by the out-of-order core within the ROB, and we assume a solution such as the Value Buffer [19] for committing out-of-order. Thus, we do not consider this condition further. • Unresolved Branches (safe_BR) This condition guarantees that we com- mit only from the correct path of execution. The out-of-order commit should not proceed past unresolved branches until they are correctly resolved. We can relax this condition and commit past an unresolved branch if we can undo the commit. To evaluate maximum performance potential we assume a zero rollback cost for out-of-order commit mis- predictions. However, the normal branch misprediction cost (10 cycles) is faithfully accounted for.

53 • Exceptions (safe_EXC) This condition caters to precise interrupts. En- forcing this condition requires that we do not commit past an instruction (floating-point, memory access, or any instruction that may cause an exception) unless we make sure that the instruction will not cause an ex- ception. To relax this safe_EXC condition, we assume the code regions are exception free.

4.3 Contribution II: Category/taxonomy of out-of-order commit Our second contribution is categorizing the out-of-order commit based aggres- siveness and safety which is covered in paper I [1]. Aggressiveness is about how frequent the processor does out-of-order commit, always or just before commit stall respectively Aggressive and Reluctant. Safety has to do with evading the conditions, zero evaded or at least one evaded respectively Safe and Unsafe.

4.3.1 Safe_OOC In safe_OOC all out-of-order-commit conditions are preserved. This case provides the minimum potential performance improvement of an out-of-order commit, but also the minimum hardware to implement as it does not rely on speculation and rollback beyond what is already available in the out-of-order core.

4.3.2 Unsafe_OOC In Unsafe_OOC one or more (or all) of the out-of-order commit conditions are evaded (apart from true dependencies). Doing so, the maximum potential per- formance improvement of the out-of-order commit is achieved, but this may require extra support for speculation and rollback to be able to revert changes in the architectural state that were found to be incorrect after the commit. Aside from the limiting conditions described above, a separate dimension is the aggressiveness of committing out-of-order, essentially how frequent OOC is enabled, always or sometimes. Thus, concerning the mechanics of out- of-order commit, we distinguish two versions: Reluctant OOC (ROOC) and Aggressive OOC (AOOC).

4.3.3 Reluctant Reluctant out-of-order-commit (ROOC), where the out-of-order commit mech- anisms are engaged only when needed to avoid impending stalls. In other

54 Figure 4.3. Functionality of AOOC and ROOC. In this example AOOC commits one more instruction than ROOC (commit depth=4). C: Ready to commit, X: Not ready to commit, E: Executing, I: Issued, O: Empty entry. words, we engage reluctant out-of-order commit only when one of the criti- cal resources (ROB entries, registers, load-store queue entries) is all but ex- hausted and cannot support the fetching of new instructions in the front end of the pipeline. As such, reluctant out-of-order commit acts as a safety valve to release the pressure on resources (just before this pressure reaches a critical point), rather than aggressively trying to keep resource pressure low. ROOC could potentially lead to more efficient hardware implementations. In more detail, it aims to find the minimum number of instructions needed to commit (out-of-order) so that no resource is exhausted and the front end can continue to issue instructions at its peak bandwidth. The reason for seeking the minimum number of instructions to commit out-of-order is that this minimizes the perturbations in instruction order. Figure 4.3 compares ROOC with AOOC functionality. While Figure 4.3 only considers the ROB, in the general case all out-of-order resources are considered in the same way. The goal here is to have four empty entries for the next cycles(s). In Figure 4.3-a, the first instruction is committed in-order which makes one free entry for the next cycle. The ROB already contains two free entries. To provide another free entry, only a single instruction has to be committed out-of-order by ROOC.

4.3.4 Aggressive Aggressive out-of-order-commit (AOOC), where the out-of-order commit mech- anisms are continuously active, looking for opportunities to commit instruc- tions as early as possible. AOOC attempts to find up to commit-width in- structions so it can satisfy the need to commit at the highest possible commit bandwidth. In our example in Figure 4.3-b it aims to make four free entries

55 instructions. AOOC commits three instructions in which two of them are out- of-order. In contrast to ROOC, aggressive out-of-order commit releases the resources more eagerly, but disregards the following issues: • It might prove wasteful as traditional in-order commit may still be able to provide sufficient resources for forwarding progress; • It may be futile as the chances of encountering an instruction that re- stricts further commit (e.g., an unresolved branch) tends to increase with aggressiveness. (see section the depth of out-of-order commit. • It creates a significant management problem as out-of-order commit can create more (compared to ROOC) gaps in several structures, including the ROB and also the load queue and store queue (which is not com- pletely addressed in prior works [19, 20, 5]).

4.3.5 Commit Width and Depth In addition to the number of instructions that can be committed each cycle (commit width), OOC has to address how many instructions it scans (depth) to find committable instructions. While commit width is the number of in- structions that can be committed simultaneously per cycle, the commit depth is the measure of how far (from the oldest to youngest instruction) the core can scan looking for instructions to commit out-of-order in a given cycle.

4.4 Performance evaluation We introduced aggressiveness (aggressive and reluctant) and safeness (safe and unsafe) for out-of-order commit therefore four combined categories will emerge. In this section, we first evaluate the overall effect of each category and then we explore the impact of each of the commit conditions separately.

Simulation setup To see how relaxing OOC affects performance, we simulate small, medium, and large cores e.g., respectively similar to Intel’s Silvermont (SLM), Nehalem (NHM) and Haswell (HSW) (See table 4.1 for details). As an overview, Fig- ure 4.4 shows the performance improvement for each microarchitecture as- suming Safe_OOC (all conditions respected) for all benchmarks on average. We can see that with narrow commit depth (four in this figure), a small out- of-order processor (SLM), has more potential for relative improvement com- pared to the medium and aggressive microarchitectures. In medium and large aggressive cores, thanks to a larger ROB and other hardware resources, more intrinsic ILP is extracted by traditional in-order commit, leaving less poten- tial for out-of-order commit with a narrow commit depth. The reason is that

56 commit­depth=4 commit­depth=ROB size 2.25 2.00 1.75 1.50 1.25 1.00

IPC, normalized to SLM 0.75 SLM SLM_ROOCSLM_AOOCNHM NHM_ROOCNHM_AOOCHSW HSW_ROOCHSW_AOOC

Figure 4.4. Performance comparison (harmonic mean of IPCs across SPEC CPU2006) of in-order and two types of safe out-of-order commit, reluctant (ROOC) and aggres- sive (AOOC), with a commit depth of 4 and ROB size for increasingly aggressive microarchitectures. These experiments respect all traditional commit conditions, and show that aggressive out-of-order commit can reach the performance of the next class of processor.SLM: Intel Silvermont, HHM: Intel Nehalem, HSW: Intel Haswell the smaller processor (with smaller dynamic window size and, given a bal- anced design, smaller hardware structures) will more likely stall as it exposes a smaller amount of the potential ILP in an application. In case of a larger commit depth, the more aggressive cores (NHM and HSW) have higher poten- tial performance improvement Out-of-order commit frees the processor from the traditional limits, reducing the number of times the processor experiences exhausted resources. In medium and large aggressive cores, thanks to a larger ROB and other hardware resources, more intrinsic ILP is extracted by tradi- tional in-order commit, leaving less potential for out-of-order commit with a narrow commit depth.

4.4.1 Performance evaluation based on the taxonomy The minimum and maximum performance improvement is provided by Safe_OOC and Unsafe_OOC, respectively. In Figure 4.5, we show the amount of improvement provided by both safe and unsafe out-of-order commit, for both aggressive and reluctant modes, for all three microarchitectures.

• Safe_AOOC. When Safe_AOOC (safe and aggressive) is evaluated, the size of the processor is the key point for archiveing performance. Based on Figure 4.5-a) the processors with more entries in their out-of-order resources, in this case NHM and HSW benefit more from AOOC.

• Safe_ROOC. For Safe_ROOC (safe and reluctant), the performance im- provement is lower for all three microarchitectures compared to AOOC yet the trend is the same and the the bigger the out-of-order resources

57 the more effective is ROOC.

Unsafe_OOC By relaxing all conditions, Unsafe_OOC provides the maximum potential for performance improvement. To understand the effectiveness of all conditions together we consider zero cost for recovery for any mis-speculated out-of- order commit condition. In real implementations, Unsafe_OOC will require recovery mechanisms for these techniques, which can reduce the performance potential because of recovery costs.

• Unsafe_AOOC This technique provides the highest performance im- provement for all three different architectures. As before, the bigger the out-of-order scheduler, the more benefit is provided by Unsafe_AOOC.

• Unsafe_ROOC Reluctant out-of-order commit provides less performance improvement it is not continuously looking to commit additional instruc- tions. Between the three microarchitectures, the limited SLM benefits the most from ROOC because of the large number of stalls seen by this core. Therefore ROOC, especially Unsafe_ROOC, is an interesting methodology to improve the performance of relatively small but energy- efficient CPUs as we see a relatively high-performance improvement for a less aggressive commit implementation.

4.4.2 Performance evaluation based on the OOC conditions In the previous section, by analyzing safe and unsafe out-of-order commit, we observe that there is a large gap between the performance improvement of these two implementations. Understanding the cause of this performance improvement (by looking at individual commit conditions in isolation), allows us to better understand where to focus future hardware efforts.

58 SLM_Unsafe NHM_Unsafe HSW_Unsafe SLM_Safe NHM_Safe HSW_Safe a) AOOC across SpecCPU 140 120 100 80 60 40 20

Performance 0 improvement (%) b) ROOC across SpecCPU 140 120 100 80 60 40 20 0 astar bwavesbzip cactusadmcalculixdealii gamessgcc gemsfdtdgobmkgromacsh264refhmmerlbm leslie3dlibquantummcf milc namd omnetppperl povraysjeng soplexsphinx3tonto wrf xalan zeusmpSpecCPU­avg

Figure 4.5. IPC improvement of safe and unsafe out-of-order commit relative to in-order commit as a baseline for both reluctant and aggressive versions applied on SPEC CPU2006 benchmarks on three microarchitectures. 59 Positive Contribution of Out-of-Order Commit Conditions To study the gap between safe and unsafe out-of-order commit ( Figure 4.5), we analyze the effect of relaxing each condition in the presence of the other preserved conditions in Figure 4.6. 1 We analyze the SLM microarchitecture in detail and provide averages across all microarchitectures for both AOOC and ROOC. Each out-of-order commit condition is analyzed in isolation and we consider Unsafe_OOC (all relaxed conditions) as the 100% potential im- provement budget. In the case of the mcf benchmark in Figure 4.5a, the safe and unsafe OOC performance improvement is 33% and 71% respectively (46% of the potential improvement budget is provided by Safe_OOC). We also observe that by relaxing the LD condition (unsafe_LD), 52% of potential improvement budget is achievable (see Figure 4.6a). In Figure 4.6, we can see in some applications (like namd in AOOC mode and leslie3d in ROOC mode) that relaxing just a single condition is not sufficient to fill the gap between safe and unsafe OOC. This does not mean that a single condition is not important, but rather that other preserved conditions are preventing out-of-order commit from achieving its full potential. • AOOC. We observe that for most of the applications Unsafe_BR and Unsafe_LD are the ones that affect performance the most (Figure 4.6a). Additionally, the more aggressive the core, the more important the Un- safe_LD condition becomes. In SLM, NHM and HSW CPUs, Unsafe_LD respectively fills 4%, 10%, and 12%, and Unsafe_BR fills 9%, 8% and 7% of the gap between safe and unsafe OOC. Unsafe_ST is not very effective because of the rar- ity of this condition and the conservative memory dependence predictor used which does not provide flexibility on issuing memory instructions out-of-order let alone committing them out-of-order. Unsafe_EXC or relaxing exceptions are not that effective because they are very rare, es- pecially in integer benchmarks. • ROOC. With ROOC there is less effective in reducing the gap between safe and unsafe OOC and this is because of the nature of this on-demand OOC mode which is enabled and needed more in SLM and much less in NHM and HSW (see Figure 4.5b). Indeed if there are enough empty entries in the resources, ROOC does not engage and this is regardless of the safe or unsafe mode of instruction commit.

1See paper II for the negative effect of out-of-order commit conditions.

60 Safe Unsafe_LD Unsafe_ST Unsafe_BR Unsafe_EXC a) SLM­AOOC across SpecCPU b) AOOC, average across SpecCPU 100 100 80 80 60 60 40 40 20 20 0 0 Contribution in improvement (%) c) SLM­ROOC across SpecCPU d) ROOC, average across SpecCPU 100 100 80 80 60 60 40 40 20 20 0 0 astarbwavesbzip cactusadmcalculixdealiigamessgcc gemsfdtdgobmkgromacsh264refhmmerlbm leslie3dlibquantummcf milc namdomnetppperl povraysjengsoplexsphinx3tontowrf xalanzeusmp SLM­avg NHM­avg HSW­avg

Figure 4.6. Contribution of safe and selectively unsafe out-of-order commit on three different microarchitectures. Unsafe_XX is equivalent to activating (enforcing) all out-of-order commit conditions except XX (the XX condition is relaxed). By relaxing the specific XX condition, the dependence between other conditions is also observed. 61 4.5 Out-of-order Commit and Memory Level Parallelism (MLP) Overlapping cache misses to service them in parallel, in particular, long-latency accesses to DRAM, can deliver significant performance benefits [9]. This memory level parallelism (MLP) is typically achieved through the use of mul- tiple Miss Status Holding Registers (MSHRs) [17], which track outstanding memory requests and allow them to execute in parallel. In this section, we compare in-order commit and out-of-order commit in terms of memory paral- lelism (both to DRAM (MLP) and within the (MHP) [8]) by changing the number of L1 MSHRs and observing the effect on performance. To explore these effects, we select three applications that are highly memory- bound [16] (mcf), medium memory-bound (lbm), and largely not memory- bound (gcc) shown in Figure 4.7. One key observation is that out-of-order commit, in both reluctant and aggressive modes, is significantly better than in-order commit in exploiting intrinsic application memory parallelism. Fig- ure 4.7 shows that the gap between in-order and out-of-order commit is much larger in the case of HSW which means that the more aggressive is the mi- croarchitectures, the more MLP is covered if we apply out-of-order instruction commit. Overall, out-of-order commit outperforms in-order commit by exposing ad- ditional memory parallelism (both to DRAM and in the hierarchy). In sum- mary, with MLP, the out-of-order commit provides more benefit for more ag- gressive micro-architecture and more memory-bound applications.

4.6 Early release vs. out-of-order commit Register renaming enables the avoidance of false dependencies through the register file, thereby increasing the ILP. It does this by translating architectural registers to a larger set of physical registers. This renaming is done early in the pipeline, and the physical registers are not released until the commit stage. This release may happen long after the last consumer reads the data. Due to this, the physical register entries may be kept alive (occupied) for longer than what the dependencies alone require. As a result, new instructions might not be able to rename which causes front-end stall. To address this resource constraint, techniques such as delayed allocation and Early Release of the Physical Register (ERPR) [21] have been developed. In this section, we compare ERPR with the OOC taxonomy [2] since both of them are solutions to release the resources earlier. OOC and ERPR are similar as they release physical register as early as possible. Aggressive OOC outper- forms ERPR in general because it releases ROB and LSQ entries earl(ier), as well as physical registers. However, neither ERPR nor OOC can release IQ

62 Table 4.2. The Contribution of exhausted (full) register file (RF) and reorder buffer (ROB) on CPU stalls. Other reasons to the CPU stalls could be the instruction queue (IQ), the load-store queue (LSQ), instructions cache misses or not having a free func- tional unit to allocate to the ready-to-dispatch instructions.

Microarchitecture RF% ROB% Other% min avg max min avg max min avg max

SLM 6598613 36 92 0527 NHM 26891 83098 0207 HSW 26991 82998 0207

entries earlier than usual because both of them only address the processor state after issue stage of the pipeline when the instructions have left the IQ.

Comparing AOOC, ROOC and ERPR based on of Out-of-Order Conditions Analysis The overall trend of Figure 4.8 shows that AOOC outperforms ROOC and ERPR in all configurations. This is not surprising, as AOOC is always en- abled, while ROOC is enabled only when a resource is exhausted. ERPR is essentially a subset of AOOC that only focuses on the register file (RF). Therefore, the higher is the program sensitivity to the size of RF, the higher is the effect of ERPR. To better understand this behavior, we analyzed the rea- sons for CPU stalls across different microarchitectures (Table 4.2). This table shows that increasing the effective size of RF (e.g., what ERPR does concep- tually) is less effective in SLM compare to NHM and HSW because. Based on this table, in SLM, the CPU stalls are more often due to other resources, such as the ROB size. As a result since in SLM architecture the size of RF con- tributes less in CPU stalls, ROOC outperforms ERPR in terms of performance improvement. ROOC can virtually extend the effective size of ROB, LSQ as well as RF although ERPR only does this on RF. RF in NHM and HSW ar- chitectures contribute more to CPU stalls than in SLM architecture therefore, ERPR outperforms ROOC in these two architectures.

More detail analysis of figure 4.8 specifically based on commit-depth can be seen in Paper III however, Table 4.9 uses this analysis to rank the relative importance of each out-of-order commit condition for the three architectures. Overall relaxing load-load condition is the most promising for performance improvement. This information provides a guideline for which areas to work on to achieve the best performance improvement depending on the baseline out-of-order execution.

63 4.7 Conclusion To obtain higher performance, extending the reach of the processor core has been a focus in much of microarchitecture research. One promising direction is the use of out-of-order commit, which releases precious processor resources early to allow the processor to extend its reach past typical hardware limits. In this thesis, we presented a limit study for out-of-order commit through the introduction of reluctant and aggressive out-of-order commit modes. We show how smaller processors, even with a limited commit scan depth, can benefit from out-of-order commit strategies, but that larger, aggressive cores require deeper commit scan depths to achieve improved performance. In addition, we provide a detailed breakdown of the contributions for each out-of-order com- mit condition for the SPEC CPU2006 benchmark suite, and compare against similar works such as early release of physical registers. Our results show a very high potential for performance improvement, above 2.25x for some benchmarks, and believe that out-of-order commit strategies can play an im- portant role for future energy-efficient and high-performance processor de- signs.

64 IOC Safe-AOOC Unsafe-AOOC

(a) SPEC2006, SLM (b) SPEC2006, NHM (c) SPEC2006, HSW 9 9 9 8 8 8 7 7 7 6 6 6 5 5 5 4 4 4 3 3 3

improvement (% ) 2 2 2

Normalized Performance 1 1 1 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10

(d) mcf, SLM (e) mcf, NHM (f) mcf, HSW 9 9 9 8 8 8 7 7 7 6 6 6 5 5 5 4 4 4 3 3 3 2 2 2 1 1 1 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10

(g) lbm, SLM (h) lbm, NHM (i) lbm, HSW 9 9 9 8 8 8 7 7 7 6 6 6 5 5 5 4 4 4 3 3 3 2 2 2 1 1 1 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10

(j) gcc, SLM (k) gcc, NHM (l) gcc, HSW 9 9 9 8 8 8 7 7 7 6 6 6 5 5 5 4 4 4 3 3 3 2 2 2 1 1 1 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10

MSHR size

Figure 4.7. Memory hierarchy parallelism comparison between in-order commit and safe and unsafe out-of-order commit in MSHRs size of 1 to 10. The results have been normalized to in-order commit with MSHR size of one. Sub-graph (a), (b) and (c) are based on harmonic mean across SPEC CPU2006. The mcf, lbm and gcc bench- marks are respectively representative of categories with high, medium and low mem- ory boundedness.

65 AOOC ROOC ERPR (a) SLM-SAFE (b) NHM-SAFE (c) HSW-SAFE

100 100 100 80 80 80 60 60 60 40 40 40 20 20 20 Performance 0 0 0 improvement (%) 4 8 16 32 4 8 16 32 128 4 8 16 32 128 192 Commit Depth

(d) SLM-UnsafeLD (e) NHM-UnsafeLD (f) HSW-UnsafeLD

100 100 100 80 80 80 60 60 60 40 40 40 20 20 20 0 0 0 4 8 16 32 4 8 16 32 128 4 8 16 32 128 192

(g) SLM-UnsafeBR (h) NHM-UnsafeBR (i) HSW-UnsafeBR

100 100 100 80 80 80 60 60 60 40 40 40 20 20 20 0 0 0 4 8 16 32 4 8 16 32 128 4 8 16 32 128 192

(j) SLM-UnsafeST (k) NHM-UnsafeST (l) HSW-UnsafeST

100 100 100 80 80 80 60 60 60 40 40 40 20 20 20 0 0 0 4 8 16 32 4 8 16 32 128 4 8 16 32 128 192

(m) SLM-UnsafeEXC (n) NHM-UnsafeEXC (o) HSW-UnsafeEXC

100 100 100 80 80 80 60 60 60 40 40 40 20 20 20 0 0 0 4 8 16 32 4 8 16 32 128 4 8 16 32 128 192

(p) SLM-Unsafe (q) NHM-Unsafe (r) HSW-Unsafe

100 100 100 80 80 80 60 60 60 40 40 40 20 20 20 0 0 0 4 8 16 32 4 8 16 32 128 4 8 16 32 128 192

Figure 4.8. Comparison between AOOC, ROOC and ERPR in different commit-depth and microarchitecture setup in six different modes. UnsafeXX is equivalent to (enforc- ing) all out-of-order commit conditions except condition XX . By relaxing the specific XX condition, the dependence between other conditions is also observed.

66 Figure 4.9. Ranking of the benefits of out-of-order commit conditions in different microarchitecture based on the depth of commit.

67 5. Summary

In this thesis, we tackled the pipeline stalls of an out-of-order processor. For the front-end we proposed techniques to delay and skip inserting instructions into the IQ. We did this by characterizing based on their readiness and crit- icality. As a result, we reduce the capacity (depth) and issue (width) pres- sure, without affecting performance. This enabled us to significantly improve overall energy efficiency. For the back-end we explored the factors and con- ditions limiting the out-of-order release of pipeline resources at commit stage. This provided insights into knowing when and how these conditions should be evaded and reserved. This resulted in ranking the conditions based on their ef- fect on the performance in addition to providing insight into how out-of-order commit reduces the occupancy pressure on the out-of-order resources.

Paper I: A Category for Out-of-Order Commit Problem When to commit out-of-order

Solution Introducing a category for out-of-order commit

The bigger the out-of-order scheduler size, the Results more aggressive the out-of-order commit needs to be to achieve higher performance.

Paper II: Exploiting the Potential of Out-of-order Commit Conditions Problem What is the potential of out-of-order commit

Analyzing the performance effect of evading out-of-order commit Solution conditions

Results Memory and branch conditions are the most effective conditions.

68 Paper III: Out-of-order Commit, MLP and Early-Release of Resources Problem Limits of out-of-order commit.

Categorizing the application based MLP sensitivity Solution and comparing different techniques to release the resources earlier

Out-of-order commit is in synergy with MLP. Results Out-of-order commit outperforms early release of resources.

Paper IV: FIFOrder, Ready-Aware instruction Scheduling Problem Inefficient Scheduling.

Some instruction do not need out-of-order Solution scheduling therefore they can bypass the IQ

IQ is bypassed by the majority of instructions. IQ Results width is reduced without hurting the performance

Paper V: DNB, Ready- and Criticality-Aware instruction Scheduling Inefficient Scheduling. Ready-Aware scheduling is not effective Problem in reducing the IQ depth.

Categorizing instruction based on readiness and critically Solution scheduling therefore they can bypass the IQ

Not only IQ width but also IQ depth is reduced without Results hurting the performace.

69 6. Svensk Sammanfattning

Oordnad exekvering är en av de främsta mikroarkitekturella teknikerna som används för att förbättra prestandan hos både enkel- och flertrådade proces- sorer. Applikationen av den här typen av processorer sträcker sig från mobil- till serverprocessorer. Oordnad exekvering når högre prestanda genom att hitta fristående instruktioner och gömmer exekveringslatens genom att använda cykler som annars skulle vara bortkastade eller orsaka en CPU-fördröjning. För att uppnå detta använder oordnad exekvering schemaläggningsresurser för att lagra och prioritera instruktioner, inklusive ROB, IQ, LSQ och fysiska reg- ister. En typisk pipeline hos en oordnad processor har tre makrosteg: framsi- dan, schemaläggaren och baksidan. Framsidan hämtar instruktioner, placerar dem i oordningsresurserna och analyserar dem för att förbereda deras exekver- ing. Schemaläggaren identifierar vilka instruktioner som är redo för exekver- ing och prioriterar dem för schemaläggning. Baksidan uppdaterar processor- tillståndet med resultatet från de äldsta avklarade instruktionerna, avallokerar resurserna och slutför instruktionerna i programordning för att bibehålla kor- rekt exekvering. Oordnad exekvering måste ha möjligheten att fritt välja bland tillgängliga instruktioner att exekvera. Schemaläggningsresurserna måste därför ha kom- plexa kretsar för att identifiera och prioritera instruktioner, vilket gör dem väldigt dyra och därför begränsade. På grund av deras kostnad är schemaläg- gningsresurserna begränsade i storlek. Denna begränsade storlek leder till två fördröjningspunkter, en i framsidan och en i baksidan av pipelinen. Framsidan kan fördröjas på grund av att resurser är fullt allokerade och att inga fler in- struktioner som följd kan läggas i schemaläggaren. Baksidan kan fördröjas på grund av att exekveringen av en instruktion längst fram i ROB:n är oavslutad, vilket förhindrar andra resurser från att bli avallokerade som i sin tur gör att nya instruktioner inte kan läggas in. För att åtgärda dessa källor till fördröjning fokuserar den här avhandlingen på att reducera tiden instruktioner tar upp plats i schemaläggningsresurser. Vår framsidoteknik angriper IQ-trycket medan vår baksidolösning fokuserar på de övriga resurserna. För att reducera framsidofördröjningar reducerar vi trycket på IQ:n för både lagring (djup) och utfärdande (bredd) av instruktioner genom att direkt föra dem till billigare lagringsstrukturer. För att reducera baksidofördröjningar utforskar vi hur vi kan avklara instruktioner tidigare, och i oordning, för att lätta trycket på oordningsresurserna exklusive IQ:n.

70 7. Acknowledgements

During my PhD, I worked on out-of-order processors, trying to make them in-order without hurting performance, however, it seems it has not been influ- ential enough because you are reading this thesis out-of-order (probably). You skipped many chapters and got here to read this acknowledgment. Anyway, it has been a long journey and now it is time to thank the people who made this thesis possible to appear. First and foremost, I would like to express my sincere gratitude to my ad- visors, Prof. Stefanos Kaxiras and Prof. David Black-Schaffer for the contin- uous support of my Ph.D study and related research, for their patience, moti- vation, and immense knowledge. Your guidance helped me in all the time of research and writing of this thesis. I have been very lucky working with you two. Stefanos, I’ll miss both your excitement and reluctance about a research topic. Your excitement in your first replied emails, while I was looking for a Ph.D. position in Sweden, convinced me to come to Sweden while I had a few other admissions in the US and Canada. You are the one who believed in me in doing all crazy ideas and challenging projects specifically in the beginning of my PhD. What I always will remember about you is, "we can", and "always give credit to people who have looked at it before". David, I learned a lot from you, not only in research though. The most im- portant one is learning how to accept failures. I will miss the weekly meetings, the discussions, the plannings. I appreciate you for always being available and being supportive. I could not have imagined having a better advisor and specifically a mentor for my Ph.D study. What I always will remember about you is, "being committed", "being dedicated to the job", "being on time", and "apologizing whenever is needed, ASAP, even if you are the boss". I would like to thank Prof. Erik Hagersten, for creating such a nice group and environment. Also, for making it easy to get hired in UART by just asking the right questions during my interview although they were the hardest ones among the interviews I had back in time. I would like to thank my co-advisors Trevor Carlson, Rakesh Kumar, and Magnus Själander. It has been a pleasure working with you. Trevor, it is amazing how quickly you can come up with an answer to a research question. Rakesh, you work very professionally. The way you look into the "whys" of a research problem is impressive. Magnus, your critical view on research topics, even on published papers in top rank conferences, is impressive. It has been a pleasure working with such a modest person and having you as a friend! I have great memories of the trips with you to Austria and Canada.

71 Next, I would like to thank the entire UART group, Alberto, Alexandra, An- dreas Sembrant, Nikos, Vasileios, Mahdad, Moncef, Konstantinos, German, Ricardo, Johan, Kim, Greg, Mihail, Andra, Hassan, Marina, Per, Chang-Hyun and Yuan. It has been an awesome experience being among you guys. Thank you Ricardo, for fun technical discussions, German, for the despac- ito chats, Per, for helping me in the Swedish translation of my thesis abstract, Hassan, for all the confusions you have, and Marina for being a very nice office-mate. Chris, many thanks for being a trustworthy friend. It has been a pleasure working with you as a co-author. You know so much and you are so generous in sharing them. Johan and Kim, my peers from the beginning of PhD, thank you for being always supportive and making good memories for me. Johan, you have been a very trustable friend, a true one, a hardware geek. Many thanks for always understanding me and being there when I needed to talk. Kim, I’m not sure what to write about you, a friend? A sister? Buddy? You have been the closest friend during my PhD in Uppsala not only when you used to be "Kim-Anh" but also when you became "Kim-Mom". Thank you for being very supportive, helping me not only in the PhD journey but also in getting integrated into a new society, specifically in the beginning. Escape rooms, friendly chats, running over the hills made me unforgettable memories. Thank you, genius. Many thanks to my non-UART friends, Saleh, Malihe, Alireza, Amin, Aala, Marcus, Reza, and Niloo. Emilia, you appeared in my life during a particular period of time, who knows, perhaps for a reason. I was achieving many things while I was losing a very important person, my mother. Since then you have been very supportive and always available. Also, you have been a great gym partner. Special thanks to my best friend, since childhood, Mohammad Sattari. I’m so proud of my youth because of the sports career and its achievements and I owe all of them to you since you took me to the track for the first time. What you have done to me and my family, is invaluable. Thank you for making fun of me when I deserve it, and for loving me when I don’t. Thank you for staying constant in a world full of change, and for keeping some normalcy and modesty in a world full of chaos. Furthermore, I would like to thank my family, my two older brothers and my parents. Thank you dad, for making it possible for me to continue my professional sports career and also going to university while you had to work hard to support us. Thank you mom for being the best mother and a true friend. You sacrificed yourself for us, you loved your family. You are the one who taught me the importance of love in life, how to not give up, how to accept the challenges in life and always working hard for the ones whom we love. You were waiting

72 for this moment, for my PhD defense, and unfortunately, you are not among us anymore(RIP) but I will always love you Madar jan. All in all, many thanks to the people above and to all those who have helped but have not been mentioned by name.

73 References

[1] M. Alipour, T. E. Carlson, and S. Kaxiras. A taxonomy of out-of-order instruction commit. In 2017 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), pages 135–136, 2017. [2] Mehdi Alipour, Trevor E. Carlson, and Stefanos Kaxiras. Exploring the performance limits of out-of-order commit. In Proceedings of the Computing Frontiers Conference, CF’17, pages 211–220, 2017. [3] Mehdi Alipour, Rakesh Kumar, Stefanos Kaxiras, and David Black-Schaffer. Delay and bypass: Ready and criticality aware instruction scheduling in out-of-order processors. In 26th IEEE International Symposium on High Performance Computer Architecture, HPCA 2020, San Diego, CA, USA, February 22-26, 2020, pages 558–569, 2019. [4] Mehdi Alipour, Rakesh Kumar, Stefanos Kaxiras, and David Black-Shaffer. Fiforder microarchitecture: Ready-aware instruction scheduling for ooo processors. In Proceedings of the 25th International Symposium on Design, Automation and Test in Europe, DATE19, pages 710–715, 2019. [5] G. B. Bell and M. H. Lipasti. Deconstructing commit. In Proceedings of the 2004 IEEE International Symposium on Performance Analysis of Systems and Software, ISPASS ’04, pages 68–77, 2004. [6] N. Binkert, B. Beckmann, G. Black, S. K. Reinhardt, A. Saidi, A. Basu, J. Hestness, D. R. Hower, T. Krishna, S. Sardashti, R. Sen, K. Sewell, M. Shoaib, N. Vaish, M. D. Hill, and D. A. Wood. The gem5 simulator. SIGARCH Comput. Archit. News, 39(2):1–7, August 2011. [7] Ramon Canal and Antonio González. A low-complexity issue logic. In Proceedings of the 14th International Conference on Supercomputing, ICS ’00, pages 327–335, 2000. [8] Trevor E. Carlson, Wim Heirman, Osman Allam, Stefanos Kaxiras, and Lieven Eeckhout. The load slice core microarchitecture. In Proceedings of the 42Nd Annual International Symposium on Computer Architecture, ISCA ’15, pages 272–284, 2015. [9] Yuan Chou, Brian Fahs, and Santosh Abraham. Microarchitecture optimizations for exploiting memory-level parallelism. In Proceedings of the 31st Annual International Symposium on Computer Architecture, ISCA ’04, pages 76–, Washington, DC, USA, 2004. IEEE Computer Society. [10] Intel Corporation. IntelR 64 and ia-32 architectures optimization reference manual. http://www.intel.com/content/www/us/en/ architecture-and-technology/ 64-ia-32-architectures-optimization-manual.html, June 2016. [11] Michael K. Gowan, Larry L. Biro, and Daniel B. Jackson. Power considerations in the design of the . In Proceedings of the 35th Annual Design Automation Conference, DAC ’98, pages 726–731, 1998.

74 [12] L. Gwennap. Digital leads the pack with 21164. In Microprocessor Report, 8(12), pages 249–260, September 1994. [13] T. Ham, J. L. Aragón, and M. Martonosi. Desc: Decoupled supply-compute communication management for heterogeneous architectures. In MICRO, pages 191–203, 2015. [14] John L. Henning. Spec cpu2006 benchmark descriptions. SIGARCH Comput. Archit. News, 34(4):1–17, September 2006. [15] Kuo-Su Hsiao and Chung-Ho Chen. An efficient wakeup design for energy reduction in high-performance superscalar processors. In Proceedings of the 2Nd Conference on Computing Frontiers, CF ’05, pages 353–360, 2005. [16] A. Jaleel. Memory characterization of workloads using instrumentation driven simulation. http://www.glue.umd.edu/ajaleel/workload, 2010. [17] David Kroft. Lockup-free instruction fetch/prefetch cache organization. In Proceedings of the 8th Annual Symposium on Computer Architecture, ISCA ’81, pages 81–87, 1981. [18] Srilatha Manne, Artur Klauser, and Dirk Grunwald. Pipeline gating: Speculation control for energy reduction. In Proceedings of the 25th Annual International Symposium on Computer Architecture, ISCA ’98, pages 132–141, 1998. [19] S. Marti, J. Borras, P. Rodriguez, R. Tena, and J. Marin. A complexity-effective out-of-order retirement microarchitecture. IEEE Transactions on Computers, 58(12):1626–1639, 2009. [20] J. F. Martinez, J. Renau, M. C. Huang, and M. Prvulovic. Cherry: Checkpointed early resource recycling in out-of-order . In MICRO, pages 3–14, 2002. [21] Teresa Monreal, Victor Vinals, Jose Gonzalez, Antonio Gonzalez, and Mateo Valero. Late allocation and early release of physical registers. IEEE Trans. Comput., 53(10):1244–1259, October 2004. [22] Subbarao Palacharla, Norman P. Jouppi, and J. E. Smith. Complexity-effective superscalar processors. SIGARCH Comput. Archit. News, 25(2):206–218, May 1997. [23] Andreas Sembrant, Trevor Carlson, Erik Hagersten, David Black-Shaffer, Arthur Perais, André Seznec, and Pierre Michaud. Long term parking (ltp): Criticality-aware resource allocation in ooo processors. In Proceedings of the 48th International Symposium on Microarchitecture, MICRO-48, pages 334–346, 2015. [24] Ryota Shioya, Masahiro Goshima, and Hideki Ando. A front-end execution architecture for high energy efficiency. In Proceedings of the 47th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO-47, pages 419–431, 2014. [25] J. E. Smith and A. R. Pleszkun. Implementing precise interrupts in pipelined processors. IEEE Transactions on Computers, 37(5):562–573, 1988. [26] J. E. Smith and A. R. Pleszkun. Implementing precise interrupts in pipelined processors. IEEE Transactions on Computers, 37(5):562–573, 1988. [27] G. S. Sohi and S. Vajapeyam. Instruction issue logic for high-performance, interruptable pipelined processors. In ISCA, pages 27–34, 1987. [28] Henry wong, Vaughn Betz, and Jonathan Rose. Microarchitecture and circuits

75 for a 200 mhz out-of-order soft processor memory system. ACM Trans. Reconfigurable Technol. Syst., 10(1):7:1–7:22, December 2016. [29] Henry Wong, Vaughn Betz, and Jonathan Rose. High-performance instruction scheduling circuits for superscalar out-of-order soft processors. ACM Trans. Reconfigurable Technol. Syst., 11(1):1:1–1:22, January 2018. [30] Craig B. Zilles and Gurindar S. Sohi. Understanding the backward slices of performance degrading instructions. In Proceedings of the 27th Annual International Symposium on Computer Architecture, ISCA ’00, pages 172–181, 2000.

76

Acta Universitatis Upsaliensis Digital Comprehensive Summaries of Uppsala Dissertations from the Faculty of Science and Technology 1902 Editor: The Dean of the Faculty of Science and Technology

A doctoral dissertation from the Faculty of Science and Technology, Uppsala University, is usually a summary of a number of papers. A few copies of the complete dissertation are kept at major Swedish research libraries, while the summary alone is distributed internationally through the series Digital Comprehensive Summaries of Uppsala Dissertations from the Faculty of Science and Technology. (Prior to January, 2005, the series was published under the title “Comprehensive Summaries of Uppsala Dissertations from the Faculty of Science and Technology”.)

ACTA UNIVERSITATIS UPSALIENSIS Distribution: publications.uu.se UPPSALA urn:nbn:se:uu:diva-403675 2020