Eindhoven University of Technology

MASTER

US4A : universal, super scalable superscalar architecture : a preliminary study

van Hekezen, A.H.

Award date: 1996

Link to publication

Disclaimer This document contains a student thesis (bachelor's or master's), as authored by a student at Eindhoven University of Technology. Student theses are made available in the TU/e repository upon obtaining the required degree. The grade received is not published on the document as presented in the repository. The required complexity or quality of research of student theses may vary by program, and the required minimum study period may vary in duration.

General rights Copyright and moral rights for the publications made accessible in the public portal are retained by the authors and/or other copyright owners and it is a condition of accessing publications that users recognise and abide by the legal requirements associated with these rights.

• Users may download and print one copy of any publication from the public portal for the purpose of private study or research. • You may not further distribute the material or use it for any profit-making activity or commercial gain eb 519

Technische Universiteit tu) Eindhoven

Faculty of Electrical Engineering Digital Information Systems Group

Master's Thesis:

US4A: universal, super scalable superscalar architecture A preliminary study A.H. van Hekezen

Coach : Dr. ir. A.C. Verschueren Supervisor : Prof. ir. M.PJ. Stevens Period : September 1993 - June 1994

The Fa<;ulty of Electrical Engineering of Eindhoven University of Technology does not accept any responsibility regarding the contents of Master's theses. Sequential are approaching a fundamental physical limit on their potential computational power. Such a limit is the speed of light.

A.L. DeCegama

(The Technology of Parallel Processing, Volume 1, 1989) Summary

A superscalar is a microprocessor that can execute multiple instructions concurrently. Such a changes the instruction order to achieve a better performance. is a technique that makes this so called out-of-order execution possible by cancelling register storage conflicts and by keeping true data dependencies. A buffer of instructions between the decode and the update stage, called an , is necessary to make the schedule of instructions. To reduce branch penalties, branches can be predicted. Instructions can be fetched and executed speculatively in the pre­ dicted directions. However, special precautions are necessary to cancel results of incorrect predictions. Additional logic is necessary for making interrupts precise, that is the same as in a .

A preliminary study to a Universal Super Scalable SuperScalar Architecture (US~) has been done. Universal refers to instruction set independency (both RISe and eISC), super scalable refers to the great amount of without a significant increase in clock cycle time. It also refers to the high performance potential. US~ is only a processor core, not a complete processor. A target parameter setting for high performance (more than three instructions per clock cycle using a RISC instruction set) has been defined.

US~ consist of out-of-order execution with full register renaming using an instruction window based on the reorder buffer with future file principle. For every instruction, the instruction window contains the opcode, source- and destination operands, dependency information, and some status bits. Every instruction in the window has a unique label, called a tag. Register renaming has been implemented by replacing a register number with a tag. This tag refers to the instruction that generates the required result. Instructions are executed using the storage in the window for source- and destination operands instead of the normal registers. Whenever no trap or interrupt has occurred, results are copied from the instruction window to the in program order. Therefore, incorrect speculative executed instructions can easily be cancelled by marking those instructions invalid, which prevents them from updating the register file. Furthermore, multi-path can be implemented using a simple extension. A speculative register file or future file has been used to implement the dependency checks. Multiple speculative register files have been used for implementing multi-path speculative execution.

Shifting an instruction early in the cycle and updating operands late in the cycle is the key to have the instruction window do multiple actions in a cycle. This technique has been used for both the instruction window and the speculative register file. An interesting way of combining priority encoders has been used to find a sub-optimal schedule of instructions within one cycle.

US~ is feasible but hardware counts show that it uses a lot of hardware.

3 Contents

Preface 7 1. Introduction 1.1 Speeding up 9 1.2 Project description " 12 1.3 How to read this thesis ...... 14

2. Superscalar concepts 2.1 Introduction 17 2.2 Pipelining ...... 18 2.2.1 Introduction to pipelining 2.2.2 stalls 2.2.3 Out-oi-order execution 2.2.4 Preventing and reducing pipeline stalls due to data dependencies 2.3 Instruction Windows...... 27 2.4 State recovery because of traps and speculative execution " 32 2.5 Theoretical performance of superscalar processors 35 2.5.1 Instruction level parallelism 2.5.2 The real performance limit: basic block size and fetch limits 2.6 Conclusion 39

3. Description of US4A 3.1 Introduction 41 3.2 Starting-points...... 42 3.2.1 Architectural starting-points 3.2.2 Current target parameter settings 3.2.3 Logic design rules 3.2.4 Layout considerations 3.2.5 Memory considerations 3.3 Overview of US4A...... 45 3.4 Description of a non speculative window version...... 48 3.4.1 Description of a window entry

5 3.4.2 Example of a simplified window 3.4.3 Intermezzo: Reducing storage in the instruction window by sharing tag and register value storage 3.4.4 Implementing data dependency checks and register ren­ aming: the speculative register file 3.4.5 Intermezzo: Reducing storage in the instruction window by sharing source and destination operand storage 3.4.6 Dealing with the flag register 3.5 State recovery " 61 3.5.1 State recovery due to traps 3.5.2 State recovery due to speculative execution 3.5.3 Dealing with interrupts 3.6 Multi-path speculative version of US4A...... 65 3.7 Future directions ...... 72 3.7.1 Problems and remaining design issues 3.7.2 Further research 3.7.3 Possible reductions 3.8 Conclusion 75

4. Implementation of the hardware 4.1 Introduction 77 4.2 Moving the instructions through the instruction window 77 4.3 Result monitoring ...... 80 4.4 Speculative register file (SRF) 80 4.5 Flushing paths 81 4.6 Dispatch unit 82 4.7 Turn around time ...... 85 4.8 Hardware counts...... 86 4.9 Conclusion 88

5. Conclusions and recommendations 5.1 Conclusions ...... 89 5.2 Recommendations...... 90

References ...... 91

Index ...... 95

Appendix A. Related publications ...... 97

Appendix B. Hardware counts 105

6 Preface

Faster, faster, faster. Computers are always too slow, no matter how expensive. Therefore, users are always looking for new and faster computers. Once the new computer has arrived, old programs do run much faster and new programs come within reach. However, they only run too slow. So again there is a need for faster computers... a never ending story?

Part of this never ending story is the design of superscalar micropro­ cessors. This is one way among others to speed up program execution time. Superscalar microprocessors are processors that execute several instructions at the same time. With the same clock frequency, the will therefore be faster than his non-superscalar equivalent.

This Master's thesis describes a preliminary study to US~: Universal Super Scalable SuperScalar Architecture. Goal of this study was to define a superscalar processor core which is highly adoptable: the architecture must not depend on a specific instruction set and all architectural parameters must be changeable. Whenever technology improves and more transistors can be put on a chip, more performance can be achieved by increasing the architectural parameters.

Because of this complexity, a lot of time has been spent at learning the concepts of superscalar microprocessors. Examining the literature has consumed a lot of time too; one does not want to design things that are already there. However, the literature was disappointing in several ways. A lot has been written about the subject superscalar microprocessors but only little is really essential. A lot has been written about the architectural concepts but only little about implementations. Many researchers prefer to use their own terms and unfortunately little of them try to comply

7 with established terms; no doubt that this will increase the time needed for literature examinations. I found several papers that had abstract nor title containing the word 'superscalar' though their main theme was about with superscalar pro­ cessors only! As some recent textbooks on (for instance [Sta1l93]) already contain information about superscalar pro­ cessors, standardization will probably be better in the future.

This report has been written to be understandable for electrical engineer­ ing students that have a basic knowledge of computer architecture. However, superscalar concepts are complex; they cannot be explained in a hierarchical way, but only in a circular way. Hopefully, the introduction and conclusion of every chapter, as well as the numerous references will overcome this problem.

I was not able to finish this report without the help of others. I want to thank Gert-Jan de Vos, for discussing many superscalar topics with him. These discussions were very helpful for the red in chapter 2. He also corrected several versions of this thesis. Furthermore, I want to thank Ad Verschueren, my coach. He made a good start possible by handing out many ideas and hints. During the project, he came with much more. Many ideas of this thesis arose during discussions with him, so lowe him a lot. He also corrected several versions of this thesis. Of course, I am responsible for all errors. And there will always be errors, because:

fen perfect rapport is verouderd 1

At last, I hope the reader will enjoy reading.

Ton van Hekezen

1 Dutch for "A perfect report is out of date"

8 1. Introduction

1. 1 Speeding up microprocessors

As long as microprocessors exist, people are looking for ways to increase their performance. The history of microprocessors may be short, but the history of computers started back in 1642. Then, Blaise Pascal made a mechanical that can be seen as the first step towards the modem computer. Later, similar attempts were made [Haye88]. Exactly 200 years later, it was Menabrea [Menal842] who wrote about probably the very first ideas of while describing Charles Babbage's analytical engine. In contrast with what most people belief, the history of computer started surprisingly long ago.

Around 1945, the first electronic computers were built. In 1958, the vac­ uum tubes were replaced by transistors, but it took up to 1972 before the first processor on an Ie called the microprocessor could be built.

As long as computers exist, there are basically two ways of making them faster. 'Making them faster' means reducing program execution time. First of all, the clock frequency can be increased. For instance, a new process technology can allow for smaller delays. This will speed up existing computer architectures. Secondly, the computer architecture can be changed. For instance, memory caches could be introduced to reduce the average memory access time. Or, by using a more advanced floating point unit, the time needed to do a floating point add could be reduced from eight to four cycles. These are several things to decrease the execu­ tion time of a program without changing the clock frequency. Figure 1 gives an overview of most ways. All terms are explained below.

Architectural methods as well as process technology methods are being improved at the same time. Often, they are highly connected. For instance, pipelining is an architectural issue which allows for a higher clock frequency (or a lower number of ). 10 1. Introduction

incfEJasing processor performance

architectural issues Ie-technology issues ~ I*lllelisms Irlcks • laster transistor ~ • pipelining technology parallel sequential • • beller layout p0C8SS«S pocessors • laster ALU. FPU =::::} • special instructions higher clock frequency • mulliprocessor .VLIW systems (MIMD) • superscalar • vector processors • superpipelined (SIMD) • systolic arryas (MISD)

/atg6-grain fif18-grain parallelism parallelism Figure 1. Methods of increasing processor performance.

Exploiting parallelism is a common method of reducing program execu­ tion time among the architectural ones. This can be done in several ways. Before giving some examples, it's important to explain something about types of computers because this will also give an overview of types of parallelism. Michael Flynn [Flyn72J divided computers into four types, depending on the instruction stream and the data stream. This is known as Flynn's classification: • SISD: single instruction stream over a single data stream This is a conventional, sequential CPU like for instance a 68000 or a Z80. • MIMD: multiple instructi011 streams over multiple data streams Multiple independent processors operate as part of a larger system. Memory can be either shared or private. These kind of computers is sometimes called 'parallel computers', but this is confusing as it applies to MIMD as well as SIMD and MISD. Multiprocessor systems is a better term. • SIMD; single instruction stream over multiple data streams This is called a 'vector computer'. A 'vector computer' is a computer which has one instruction to process a whole array. This is an advan­ tage for complicated vector or matrix calculations. • MISD: multiple illstruction streams over a single data stream This is also called systolic arrays. Several processing elements compute a part of the computation.

Note that reality is more subtle than this classification. For instance, a network of can be classified as a MIMD or a MISD systems, depending or the way they are programmed.

Although parallel computers (MIMD, SIMD, MISD) can be much faster than sequential machines (SISD), it is hard to achieve the potential performance increase. Even if the normal, sequential algorithm is known, it is still very difficult to extract the parallelism out of it. Lots of program- 1.1 Speeding up microprocessors 11 mers know how to program on sequential computers, only little of them are able to program parallel computers. In addition, changing the hard­ ware configuration almost always requires programs to be rewritten. Because of these difficulties with parallel computers, it is still interesting to look for faster conventional processors. This thesis is a proposal for a faster conventional processor.

Back to exploiting parallelism as an architectural issue to speed up program execution time on conventional processors, an example of parallelism can be to execute an integer instruction and a floating point instruction concurrently. Usually, an integer ALU and a floating point processor are different units, so why not use them in parallel? (see section 2.2.3 for a more thorough description on this subject). In fact, Intel's uses this approach. Because this is about executing instructions concurrently using multiple execution units (ALUs, Load/Store Units) it is called instruction level parallelism. (Note that this fine-grained parallelism is on a much lower level than Flynn's classification, which is called coarse­ grained.) Theoretically, 10 to 200 000 (!) instructions can be executed in parallel, depending on the type of program (see section 2.5). Only hard­ ware limitations prevent using all of its parallelism.

Although making use of instruction level parallelism may seem simple, finding an optimal schedule for the instructions so that all units are used as good as possible is very difficult. In fact, it is a NP-complete problem [Ke1175]. Heuristics, however, have shown to be very effective. Using instruction level parallelism doesn't have the main disadvantage of the parallel computers that they are hard to program. They hide the parallel­ ism for the programmer. Conventional algorithms or programming techniques can be used, but parallelism is exploited to gain performance. Therefore, this is a very nice solution: enjoying performance increase of parallelism without suffering the pain of parallel programming problems.

The problem of finding the instruction schedule can be solved by soft­ ware or hardware.

In the first case, a so called Very Large Instruction Word (VLIW) processor uses large instructions that consist of several smaller instructions. One large instruction is fetched and executed as a group, so the smaller instructions are executed concurrently. The determines which small instructions are combined to a large instruction. Therefore, the compiler determines the schedule.

In the second case, called a superscalar processor, the hardware deter­ mines at runtime which instructions are executed concurrently. Because program input can be very different every time and because of certain uncertainties (loading a byte from memory can take a few clock cycles if the byte is in the cache and much more if noO, the dynamic scheduling of the hardware in superscalar processors has a better performance potential. Another advantage of superscalar processors above a VLIW 12 1. Introduction

processors is that the instruction set of superscalar processors can be compatible with existing instruction sets. So, superscalar versions of existing processors can (and have been) made. Not even recompiling is 2 necessary to make use of the parallelism • In the case of upgrading a computer system with a processor that has more execution units, a superscalar processor can simply replace the old processor but with a VLIW processor the instruction set changes so all programs must be recompiled.

However, the price for a superscalar processor is very complex hardware. Longer hardware design time and complicated production makes super­ scalar processors more expensive than VLIW processors. Because of the higher performance potential and the larger flexibility, superscalar processors are more interesting than general purpose processors. VLIW processors are more preferable in the case of special dedicated situations.

A related topic is the 5uperpipelilled processor [Joup89bJ. A superpipelined processor fetches instructions at a shorter cycle time than a nonnal processor. For an introduction to pipelines, see section 2.2. Theoretically, this could give a , however, the cycle time of a superpipelined processor needs to be longer. Figure 2 gives an overview of the differ­ ences between VLIW, superscalar and superpipelined. Of course, all kinds of combinations are possible.

1.2 Project description

Goal of this project is to find a Universal, Super Scalable SuperScalar Architecture (US~).

With universal is meant instruction set independency. Either a Reduced Instruction Set Computer (RISC) or a Complex Instruction Set Computer (eISC) should be able to use the principles described in this thesis. In addition, no specific application is kept in mind. The architecture is targeted to general purpose processors, although this doesn't rule out special purpose optimizations.

Scalable refers to the independence of the architecture to the number of fetch units, number and types of ALUs, instruction window size (see chapter 2), memory organization and so on. When more transistors will corne available on an Ie, more execution units (ALUs, Load/Store Units) can be added without a significant increase in clock cycle time. The architecture has been called super scalable, because it must be possible to increase the architectural parameter values a lot without significantly increasing clock cycle time.

2 However, recompiling will squeeze the last performance improvements out of a super­ scalar processor. 1.2 Project description 13

(a) IF I 10 'Elk we I , normal, pipe/ined processor

IF IOSC WB I (b)

vliw II I processor

iI I IF 10 I I (e) ! i superscalar ! proe9ssor

i! I j j ~:: (d) :~. i superpipelin9d I I ~ : ! ; i Iii i i ~ o 2 3 4 clock cycle Figure 2. Overview of several SISD architectures Superscalar refers to a type of processors that executes several instructions concurrently and not necessarily in the original program order. This is one way among other to speed up program execution time. Because the number of transistors that fit on a chip increases much more every year than clock cycle time decreases, superscalar processors have good future prospects.

Because only the core of the processor can be investigated when no specific instruction set is used, the term Architecture is uses instead of for instance 'microprocessor'. This means that no fetch unit, memory man­ agement unit, cache, ALU or Floating point unit will be part of the design.

Performance is the main objective. Hardware cost, Le. size of the chip surface, number and types of transistors, number of busses, and so on is of secondary concern. The fact that transistor counts increases much more than clock frequency every year legitimates this choice. Design time of a basic superscalar architecture may be long. However, a scalable architec­ ture can strongly reduce design time for future processors.

Current superscalar architectures are able to execute at most two instruc­ tions every clock cycle, with a peak performance of three. Looking at the theoretical instruction level parallelism, it is possible to execute at least ten using general purpose RISC, and much more seems to be possible. This project has as target architecture that can fetch 1. Introduction

four instructions each cycle and uses four ALUs and two load/store units, executing up to 4 paths speculatively. With this target architecture, it must be possible to execute more than 3 instruction per cycle.

The architecture has been targeted at near future silicon budgets with approximately more than 10 million transistors. Because of the scalability, whenever higher transistor budgets become available, more performance can be achieved without a complete redesign. The performance is reached by the superscalar principles. However, the added control logic necessary for superscalar operation must not increase clock cycle time too much. This will be one of the main design problems. Out-of-order execution, register renaming, branch prediction, multi-path speculative fetching and execution, will all be necessary (see chapter 2 for an explanation of these terms). In case of a good, well balanced design, the memory bandwidth will probably be the bottleneck.

1.3 How to read this thesis

Arrangements of the sections Theory of superscalar microprocessor is rather complicated. Not, because every part is complex, but because of the strong relations between the parts. Even the different design levels (instruction set design, architecture design, digital logic design) are strongly correlated. Yet, this thesis is divided into four parts: a general introduction, an introduction to super­ scalar concepts, the architectural level of US~, and the logic level. Because the relations cannot be avoided, a lot of cross references have been used to aid the reader.

Chapter 2 gives an introduction to superscalar concepts. Because the superscalar theory is explained either in depth (scientific literature) or only superficially (recent textbooks, for instance [Sta193]), this chapter is somewhere in between.

Chapter 3 gives an overview of US~. It covers the architecture level, although logic design issues cannot be avoided.

Chapter 4 concentrates on the logic design. Several expected implementa­ tion problems are covered. Hardware costs are estimated for several configurations.

The last chapter consist of conclusions and recommendations. It is fol­ lowed by the list of references and an index. A list of related publications can be found in one of the appendices. Although not used in this report, those references may become interesting for future research. A detailed description of the hardware counts has also been given in the appendices. 1.3 How to read this thesis 15

A few notes on used terminology A few notes about the use of several words. Reduced Instruction Set Computing (RISC) versus Complex Instruction Set Computing (CISC) battle will not be fought in this thesis. First of all, the principles of this thesis can be used for both of them. Secondly, the battle is getting fuzzier all the time. RISC was originally a computer that could run a limited instruction set very fast. Most of the instructions could be executed in one cycle. It was based on the fact that most of the complex instructions were hardly used by and human programmers. However, RISe processors like the Intel 960, have an instruction set that consist of more than 100 instructions. One can hardly call that 'reduced' any more. Furthermore, elsc processors like Motorola's 68060 and Intel's Pentium are approaching the 'one cycle per instruction' limit. Both modern RISC and elsc processors uses several millions of transistors nowadays. Some specific differences remains, however. RISe still uses a load/store architecture. This means that there are instructions for trans­ ferring data between memory and registers and instructions for perfonn­ ing operations on data already in registers. else can do operations on memory operands. Another difference is the instruction width. RISC uses a fixed instruction width and elsc still uses a variable instruction wide. These two differences are one of the most important differences between RISe and else nowadays. In this thesis, these two differences have been meant when either CISC or RISe has been mentioned.

Terminology of interrupts is very mixed up too. In this thesis, an interrupt is a hardware exception or the program to deal with this signal. A hardware exception can be a request from an I/O device to be serviced. A trap is an exception caused by an instruction. This can be a page fault or a divide by zero. Exception can be either interrupt or trap. 2. Superscalar concepts

2.1 Introduction

There are many architectural concepts that have to do with superscalar microprocessors. Some of them are special to superscalar processors, some are not. The problem is, that all concepts interact with each other. They cannot be explained in a nice, easy to grasp, hierarchical way. As a consequence, the reader may need to reread this chapter in order to understand all mentioned principles. For an overview, however, reading it once will be enough. This introduction and the conclusion and the end of this chapter will certainly be an aid for that.

When reading this chapter, one must not forget that a superscalar pro­ cessor is in fact a processor with multiple pipelines. Therefore, this chapter starts with an introduction to pipelining (section 2.2). Pipelining is a technique whereby multiple instructions are overlapped in execution. A (section 2.2.2) is one or more cycles that a pipeline - and so the processor - cannot do his work. A data dependency (an instruction writes to a register and his successor reads that value) or a procedural dependency (a jump or a conditional jump) can cause such a pipeline stall.

Changing the original instruction order could prevent some of these pipeline stalls. This is called out-oJ-order execution (section 2.2.3). In order to make this out-of-order execution possible, the data dependencies must be checked to prevent a schedule of instructions that causes incorrect execution results (section 2.2.4). A technique called can be used for that. Another technique, called register renaming, removes the dependencies caused by a lack of register storage, and keeps only the so called true data dependencies. In order to make a schedule of instructions, a buffer of instructions is necessary. This is called an instruction window. Several sorts of instruction windows will be explained in section 2.3.

In order to remove pipeline stalls due to branches (procedural depend­ encies), branch prediction can be used (section 2.2.2 again). This technique 18 2. Superscalar concepts

tries to predict whether the branch (or conditional jump) will be taken or not. In the predicted direction, instructions will be fetched, and may even be executed. The latter situation is called speculative execution. However, when predictions tum out to be incorrect, several things have to be repaired. This repair mechanism turns out to cooperate well with excep­ tion recovery. They both need to save and recover some kind of processor state (section 2.4). Exception recovery in superscalar processors is difficult because when instructions are executed in parallel and out-of-order, it is not clear which instruction caused the trap and in which state the pro­ cessor should be restarted. Several techniques (checkpoint repair, history buffer, reorder buffer with or without future file) to deal with this problem will be discussed.

Having explained superscalar concepts, a discussion about instruction level parallelism (average number of instructions that can be executed in parallel) will be given in section 2.5. When one knows what theoretical limits are, one knows when to stop searching for more. The basic block size, the number of instructions between two jumps, will be discussed also in this section. This is or will be one of the most important bottlenecks.

Like all chapters, this one ends with a conclusion.

2.2 Pipelining

2.2.1 Introduction to pipelining

Pipelining is a technique whereby multiple instructions are overlapped in execution. Because pipelines take more hardware and RISC instructions are simpler to execute, RISC processors were the first processors equipped with a pipeline. Later, elSe processors used pipelines too.

In general, the execution of an instruction consists of four parts: 1. Instruction Fetch (IF), also called: instruction issue Read the instruction from memory or cache 2. Instruction decode and assemble operands (ID) unravel the instruction into opcode and operands, fetch the required operands from the registers. Sometimes this is divided into two separ­ ate stages. 3. Execute (EX) Execute the instruction. 4. Update, also called: write back (WB) Write the results back the destination register. 2.2 Pipellnlng 19

Dependent on the type of architecture or transistor technology, other stages may be possible. This scheme has been chosen as a common example.

--..time ---... time

IF 10

. :.L-•••-••-f.~.--+~+=,..+---.

(al without pipeline (b) with pipeline Figure 3. Clock cycle with and without pipelining Without pipelining, executing an instruction takes as long as all four stages (figure 3a). Because every stage uses a different part of the pro­ cessor, the stages can operate concurrently on different instructions (figure 3b). For instance, when the first instruction has been fetched and is being decoded, his successor can already be fetched. This can continue until all different parts of the processors are operating on a different instruction (see the dotted box in 3). Now, the time to execute a complete instruction is four times smaller. This can be expressed in a reduction of the clock cycle time or in a reduction of the number of clock cycles per instruction or a combination of both. It is one of the architectural issues mentioned in section 1.1 that allow for a lower program execution time.

2.2.2 Pipeline stalls

One of the drawbacks of pipelines is the complex control logic, required to detect when the pipeline should be stalled. There are situations in which the parallelism used by the pipeline is not allowed. In such a case, the pipeline should be halted. There are basically two types of pipeline stalls. Superscalar processors introduces yet another pipeline stall, as will be explained below.

Data dependency Suppose we have a processor with a simple pipeline. When there is no dependency, the pipeline can keep on running (figure 4a). Suppose an instruction writes an result to a register and its successor uses this results (figure 4b). The second instruction has to wait until the result have been written into the register. So, only after the write-back stage, the second instruction can use its operands. This is called a data dependency. Note that a technique called data forwarding exists which send results to the regis­ ters as well as the next instruction, thereby reducing the idle cycles due to data dependencies. As will be shown, however, this is not enough in the case of superscalar processors. Because the way data dependencies 20 2. Superscalar concepts

~time

13:=12+1 IF IDEi: we (a) r7 := r7 +3 No r8:= r6 and 3 Dependency r1 := r1 - 1

.!1.:= 12 + 1 IF 10 (b) r7 8 :=.!l.- Data Dependency

i : jump +100 (c) O i,oo : 13:= 12 +1 Procedural Dependency

~ =pipeline stall Figure 4. Pipeline stalls due the dependencies in the case of a normal processor

are taken care of has a lot of consequences, section 2.2.4 will tell more about it.

Procedural dependency The second form of pipeline stalls is called procedural dependency (figure 4c). From a performance point of view, this is a more serious problem, because it happens often and it is responsible for a lot of idle cycles.

Only after the jump instruction has been decoded and its address has been calculated, the next instruction can be fetched. The idle time between the last executed instruction (including or not including the jump itself) and the first instruction after the jump is called the branch delay or branch penalty. There exist several techniques to reduce the performance degradation due to branches. Branch prediction and specu­ lative execution are some hardware techniques to reduce the performance degradation due to branches. Delayed branches and software pipelining are software techniques to do the same Uohn91]. Branch prediction tries to predict whether a branch will be taken or not, based on the history and the properties of the branch. Fetching continues in the direction of the prediction. Speculative execution goes even further: the speculative fetched instructions after the predicted branch are executed, without knowing whether they need to be executed. If the prediction was good, execution goes on without any delay, if the prediction was wrong, the situation from before the branch has to be recovered.

One way to reduce a few of the stalls described above is forwarding. When an instruction needs the results of its predecessor, the two corn- 2.2 Plpellnlng 21

municate via a register, thereby spilling one clock cycle. When, however, the register-to-register communication is bypassed, the result can be used for the following instruction without delay. This prevents the pipeline stall of figure 4b and reduces the pipeline stalls in case of a branch. If the pipeline is getting complicated, the detection of the forwarding is getting complicated too. Therefore, multiple pipeline computers that use for­ warding, like for instance superscalar processors, must have very com­ plex forwarding detection logic.

2.2.3 Out-of-order execution

Normally, instructions are executed in the same order as the original program order. This is called in-order execution. Higher performance can be achieved by allowing execution of instructions out-oJ-order, that is not in the original program order.

The term'execution' is not precise enough. The execution process can be divided into two processes. The first is issuing instructions to the execu­ tion units. The second is post-processing the results (e.g. write the results back to the register file), known as completion. Both processes can be in­ order or out-oJ-order. Except for the unreal combination of out-of-order issuing, in-order completion, all combinations are explored below. 22 2. SUperscalar concepts

---+time 1. r8 := 6 IF IDEX= WB (a) in-orrJer IsSlJl, in-order completion 2. r3 := r2 + 1 3. -r4 :.. !!+ 2 4. load r6, [r1) 5. r9 := r9-+ r10 6. r2 := -r6. 2 1.18:=6 (b) in-order iSSlM, 2.1.:= r2 +1 out-of-order completion 3. r4 :"!1+ 2 4. load r6, [r1) 5. r9:= r9-+r10 6. r2 := -r6. 2 4.load ~ [r1) ~IF}I~D]illQ1ill~JillmillI~~ 2.~=r2+ 1 1. r8:= 6 (C) out-of-orrJer issu., out-of-order completion 3. r4 := 11 + 2 5. r9:= r9 +r10 6. r2 := -r6. 2 Figure 5. In-order and out-of-order issuing and execution A normal processor executes everything in-order (see figure Sa). There are several clock cycles in which the whole pipeline is stalled. The data dependency between instruction 2 and instruction 3 causes an one cycle stall. Because instruction 4 is a memory load that takes 5 cycles, instruc­ tion Sand 6 have to wait before they can complete. This prevents them from reading registers that have not been updated yet.

A performance increase can be obtained by relaxing the in-order comple­ tion constraint. This is called in-order issuing, out-of-order completion (figure Sb). Now instruction S can complete earlier than instruction 4, thereby reducing possible future data dependencies. However, because of the date dependencies between instruction 4 and instruction 6, all instruc­ tions are executed in the same time as before.

Even more performance improvement can be achieved by allowing out­ of-order issuing of the instructions too (figure Sc). Now, instruction 4, that needed the longest time to complete, is the best candidate to be executed first. Some other changes are made compared to the original order. However, there is still an one cycle pipeline stall due to the data depend­ encies. 2.2 Pipelinlng 23

Two remarks should be made at this point. First, this is an artificial example. It has been made to show principles only. However, this does not mean that the situation is unrealistic.

Second, although the three concepts look similar, there are significant implementation differences. Going from figure Sa to Sb does not cost too much. However, going from figure Sb to Sc does involve a lot of extra hardware. The reason for this is the detection of the dependencies, calculation of the instruction order, and last but not least interrupts and traps3.

In case an instruction causes a trap - for instance, a divide causes a divide-by-zero error or a memory load causes a page fault - special action is necessary. This can consist of an attempt to correct the situation or to notify the user, and is performed by special software. After this, the original program can continue. But in order to continue correctly, the 'state' of the processor had to be saved. Note that 'state' does not mean the and the condition code register only. Consider the following example (figure 5c). Suppose instruction 4 causes a page fault and therefore a trap. The trap repairs the situation by swapping a page from disk to memory. Now execution can be restarted. But at which instruction? If fetching and execution were restarted at instruction 4, then the instructions 2, 1, and 3 (which have not been run to completion yet) would not be executed after the restart. So, the 'state' of the processor which execute out-of-order can not be described very easily. Special ways to deal with this problem will be described in section 2.4.

Executing out-of-order complicates the forwarding technique described in the previous paragraph. This gives a wink to an important concept of superscalar processors. When the execute stage of the pipeline is separ­ ated from the decode stage, the pipeline stalls will still exist, but will not do much harm. The separation can be done by a buffer of instructions, called an instruction window. This buffering can reduce the need for forwarding. The separation is sometimes referred to as a decoupled architecture or decoupled instruction fetching.

Superscalar processor are processors that have multiple pipelines. So far, all things mentioned are not especially for superscalar processors. All types of pipeline stalls do exist for superscalar processors too, as can be seen in figure 6. Moreover, a new dependency arises: the resource conflict (figure 6d). Two instructions could ask for the same functional unit. If only one is available, one instruction has to wait.

Every pipeline stall is much worse in case of a superscalar processor. Consider a superscalar processor of degree four (a superscalar processor capable of fetching and executing four instructions each cycle). Every stall

Jorhroughout this thesis, traps are considered as exceptions generated by instructions, interrupts are hardware exceptions, and exceptions can be either a trap or an interrupt. 24 2. SUperscalar concepts

---.time

Instr.O (a) no dependency Instr.1

(b) data dependency InstLO IF loltit i we Instr.1

(c) procedural Instr.O depend.ncy Instr.1 +-Branch Instr.2 Instr.3 Instr.4 Inslr.5 ~.~;I ,,,,.,.0 I'F I" I (d) "00.",..,,,,0;., Instr.1 :;;:,::, Figure 6. Data dependencies in the case of a superscalar processor prevents four pipelines from running. Stalling all pipelines for three cycles (in case of a branch), will give 12 useless pipeline cycles. Therefore, a superscalar design must be more carefully and smartly designed than a normal processor. All data dependencies should be prevented or reduced as much as possible. Otherwise, the huge amount of hardware necessary for a superscalar processor does not pay back in performance increase. The next section reveals some concepts that try to prevent or reduce the data dependencies from superscalar processors.

2.2.4 Preventing and reducing pipeline stalls due to data dependencies

Basically, there exist three types of data dependencies. The first and most obvious one is a write to a register after which another instruction wants to read the value (figure 7a, changing colors in the figure denotes a change of contents). This is called a true data dependency. It is also referred to as write-read dependency or flow dependency.

The second type is a read from a register after which a new value is written to that register (figure 7b). Because this is the opposite of a data dependency, this is called a anti dependency. Another name is read-write dependency.

The third situation is a write to a register followed by a write to the same register (figure 7c). This is called an output dependency or write-write dependency. Because the last two situations are only a matter of lack of storage, they are called storage conflicts. If unlimited storage were available, only true data dependencies would have been left. 2.2 Pipellnlng 25

time reg. 1 reg. 2 reg. 3

write (a) instr. 1 0 read true data instr.2 ~ {) dependency

instr. 1 read (b) 0< write instr.2 < anti I 0 dependency ~ ;;:: 8 write (c) Q) instr.1 Ol 0 write output ~ instr.2 .9 0 =1 dependency II)

Figure 7. Different types of data dependencies

Considering the dependencies and trying to execute out-of-order, there are two possibilities: first, one could detect the dependencies. Once they are known, the allowed instruction order changes are known. Scoreboar­ ding is an algorithm that accomplishes this goal. This will be clarified below. Second, one could add storage to prevent or reduce the storage conflicts. An algorithm called register renaming realizes this, as will be explained below.

Scoreboarding and the dispatch stack Scoreboardillg is an algorithm which uses the detecting method. When a register has been fetched, it is placed in a buffer of instructions called the instruction window. At the same time, a bit is set at the destination register signalling "waiting for a result". This bit is called the scoreboard bit. When a next instruction reads the result of its predecessor, it can use the scoreboard bit to determine whether the result is already available or not. When the result comes available, the second instruction detects that. Then, the result is read and the scoreboard bit is reset, indicating that there is no read pending on that register any more.

This scheme fails when a just fetched instruction will write to register with its scoreboard bit set. This instruction can not administrate the dependencies any more and fetching must be stalled.

This algorithm has been used since the end of the sixties (1) in the Control Data 6600 [Thor70). Tomasulo described an other architecture in [Toma67) although some aspects of his architecture resemble the score­ boarding prindple very much. An improved algorithm, the dispatch stack, was introduced later ([Torn84] and [Acos86D. 26 2. Superscalar concepts

The dispatch stack scheduling algorithm is an extension from the scoreboard bits towards some -kind of scoreboard counters. Then, the single dependency check is extended to multiple dependency checks. Multiple pending reads can be administrated. Note that the original algorithm as introduced by Torng and Acosta, included an instruction window design as well. At the department of Digital Information Sys­ tems, an implementation of the dispatch stack [dev093] has been made under Idass [Vers90]. Idass is an acronym for Interactive Design and Simulation System, a tool to design and test ULSI digital systems. Main problem with the dispatch stack is the large amount of counters; it uses a lot of hardware and has a large clock cycle time. Trying to solve this problem, a fast dispatch stack was introduced later [Dwye92]. Note that the instruction window is usually part of the design.

Register renaming Instead of detecting, reducing data dependencies is also possible. Register renaming allocates a new instance of a register every time an instruction writes to a register. This new instance has a new name, which clarifies the name of the algorithm. Consider figure 8. Without renaming, there exist a true data dependency between instruction 1 and 2 (r3) and an anti dependency between instruction 1 and 3 (r2). With renaming, the first instance of a register always gets index a. So, the result of instruction 1

and 2 is written as respectively r3a and r4 a• Note that the data depend­ ency between instruction 1 and 2 remains after renaming. At the third instruction, an instance of r2 already exists. So, this instruction has to write its result to a renamed version of r2 called r2t,. There does not exist an anti dependency between instruction 1 and 3 any more. Register renaming has similar consequences for output dependencies. So, register renaming removes storage conflicts. The true data dependencies can not be removed of course, as they are a part of the program.

(a) Wilf10lJt renaming: (b) WtltJ renaming:

-rS := ,.,.,.r2 + 1 rS a := r2 a .. 1 r4a := r3 a .. 7 ,...,.,r2:=r6·5 r2b := r6a • 5 Figure 8. Register renaming

No doubt that this will cost a lot of hardware. However, register renam­ ing is essential for improving performance of superscalar processors above a certain point (see section 2.5.1). In addition, the algorithm is very simple and one can hardly imagine that it increases clock cycle time, in contrary to the dispatch stack. The hardware costs are only a small disadvantage, because the dispatch stack also uses a lot of hardware and has less performance potential Uohn91, section 6.4]. 2.3 Instruction Windows 27

Tomasulo was the first to implement (limited) register renaming back in 1967 [Toma67]. In the floating point unit of the IBM 360, model 91, local instruction windows with register renaming are called reservation sta­ tions. These local windows contain at most three instructions. Register renaming as well as reservation stations have been implemented in very few commercial processors since 1967. Patterson implemented register renaming by his node tables [Patt8S]. In 1989, Popescu showed an imple­ mentation of a superscalar architecture with register renaming ([Pope89], [Pope91], [Ligh91], [Lair92]). His Metaflow processor was a superscalar version of the Spare architecture. IBM's RS 6000 architecture adopted very much of Tomasulo's algorithms and has similar yet larger reservation stations with the same, limited, register renaming. It is uncertain whether its successor, the Power2 chip, implements a more advanced form of register renaming. However, this seems to be the most superscalar-like microprocessor. It can execute up to 6 instructions ideally each cycle and its average performance is approximately 1.75 instructions per cycle [StaI94]. In 1993, Cyrix announced its Ml project, a else processor based on Intel's 80x86 line with partial register renaming and speculative execution [Ryan94].

Register renaming could also be done by a compiler, especially if there is a large register file. However, static instruction stream analysis is necess­ ary to find the best way to do this. At compile time, an average program input has to be determined to gather these statistics. This makes the register renaming an annoying job. Moreover, the renaming is only optimal for one type of program input. An advantage can be the low production costs, because developing of the compiler determines most of the costs. Register renaming is more dynamically and works for every kind of input. It can even work across multiple threads, operating system calls, and interrupt routines. This makes the performance much better at the cost of more hardware.

From a theoretical point of view, there is an interesting trade-off between a large number of registers with compiler support (software renaming), and a small number of registers with renaming. 'The number of registers' denotes the number of register in the instruction set, visible by the programmer. The actual number in the case of renaming is of course higher. This trade-off is beyond the scope of this thesis because of the requirement to comply with existing instruction sets.

2.3 Instruction Windows

Introduction Before explaining instruction windows, it may be better to recall the reasons for this buffer of instructions between the fetch/decode and the update stage, as mentioned before: 1. a buffer between the decode and the update stage will reduce pipeline 28 2. Superscalar concepts

stalls because of temporary shortage of instructions or a temporary shortage of free execution units. A temporary shortage of instructions can be an effect of fetch limitations or procedural dependencies (branches). A shortage of free execution units means that every execu· tion unit is currently busy with an instruction. This buffer is essential for superscalar processor performance. For normal processors which execute at most one instruction every cycle, forwarding is probably a better technique because it achieves approximately equal performance with less hardware cos ts. 2. Buffers can be necessary for precise interrupts. In case of some traps and interrupts, the processor state has to be saved, which is difficult in case of out-oE-order execution. As will be explained in section 2.4, an instruction window can be an important aid for an efficient implemen­ tation of exception recovery. 3. Buffers can be an important aid to recover from mispredicted branches. Not only exceptions need some kind of state recovery, also branch prediction asks for processor state recovery. If only speculative fetching is used, then recovery can be done easily by dearing the incorrectly fetched instructions. If instructions are speculatively executed, state recovery is essential when a prediction appeared to be wrong and therefore false instructions have been executed.

Note that the second and third reason mayor may not be solved by similar implementations. Considering the first reason, a superscalar processor will always have some kind of instruction window. In fact, it is one of the most important units of a superscalar processor. In any case, it is a buffer that contains some kind of predecoded instructions with or without the accompanying data. The instructions are 'shelved'. A window can be global for all units of the processor, or it can exist for every .

Now the final and most important part of a superscalar processor has been introduced, a simplified general superscalar processor can be shown (figure 9). Single lines can represent multiple busses. Certainly, the interface between memory and the fetch & decode unit consists of multiple busses, otherwise it would not be a superscalar processor. Both local and global instruction windows are incorporated, because a design may consist of both types. The integer units can be all the same or different. The same applies to the other types of execution units. The instruction window, that contains the 'shelved' instructions, issues instructions to the execution units. Results may be written to the instruc­ tion window or directly to the register and the state recovery logic.

Although one could say that at least two pipeline stages have been added (receiving the shelved instructions in the window and receiving their results), this is arguable. Actually, instructions are just buffered two times to increase parallelism. This buffering is a completely different kind of pipeline stage than the others. 2.3 Instruction Windows 29

instruction & data memory

..••....._..._...... _..._. ...._•...... •.__._--..•._.-....._..._-_..... __ ._-...... 1superscaJar -_OJ $$Or fetch and !~ decode unit I

state recovery ~ logic global instruction window ! ~ register I file

:

local instr. local instr. local instr. local instr. I window I window window I window ~ l L ~ ~ branch integer lIoating point I II I Ilload/store unit II

i:: Note: a single line can comain several buses : ...... _- _ _ - - _------: Figure 9. Global view of a general superscalar processor Different instruction windows An instruction window can differ in several ways. Considering out-of­ order issuing and execution as well as speculative execution, the most important characteristics are (approximately in order of importance) : 1. local or global window or a combination of both Global instruction windows perform better because a stall of a local window stalls fetching and therefore has effect for all parts of the processor. However, global windows are hardware intensive because every window position has to fit the most complex situation. 2. window contains instructions, part of instructions, or references to them Usually, a decoded version of the instruction is stored in the window with some additional status information. 30 2. Superscalar concepts

3. window contains source operands, references to them, or neither of them A reference to a source operand can be a tag (also called label or ID) or the register name itself. Shelving the complete operands costs hard­ ware but makes the dispatching of instructions to the functional units very fast. A tag as reference is necessary in case of register renaming. Neither of them needs to be stored if a window only contains refer­ ences to the original instruction. 4. windaw contains destination operands, references to them, or neither of both A reference to a destination operand can be a tag or the register name itself. Shelving all operands costs hardware but makes register renam­ ing and speculative execution a lot easier, as will be shown if chapter 3. A tag is essential for register renaming. In fact, this tag is the 'new' register name. Neither of them needs to be registered if a window only contains references to the original instruction. 5. with or without support for speculative fetching and/or execution A window mayor may not contain support for speculative fetching and/or execution. It is most likely that the window contains at least part of the speculative execution control hardware. 6. with or without support for exception recovery A window mayor may not contain support for exception recovery. If it does not, other units of the processor have to take care of it. Even if it does contain support, a separate unit may be necessary for the storage of (parts of) the processor state. 7. instructions in the windaw are ordered in program order (FIFO) or have no specific order (RAM) If the instruction order of the window is the same as the program order, scheduling and state recovery are easier. Instructions enter the window at the top, and are removed from the bottom of the window (FIFO). In the case of no specific order, instructions can be placed anywhere and can leave from every position (RAM). Note that also in the FIFO case, random access reads and writes are necessary too. Therefore, the word FIFO may be a little bit confusing. 8. contents (shelved instructions) can be moving or non-moving (FIFO only) A window that implements a FIFO buffer by moving the shelved instructions is hard to implement but does allow for even more easier scheduling and state recovery. There are implementation problems however because every clock cycle, the window contents must be moved and updated at the same time. 9. compressing or non-compressing (moving FIFO only) This does not apply to non-moving windows. A compressing window fills up the holes by moving parts of the window faster. Holes orig­ inate when speculative shelved instructions are removed before reach­ ing the end of the buffer. Compressing windows are harder to imple­ ment, but perform better because they have more place for useful instructions. 2.3 Instruction Windows 31

These are the most important characteristics of instruction windows. There are more, but they are less important. Hardly any of them found wide acceptance. Most of the recent commercial processors like Moto­ rola's MM88110, Intel's i960MM and i960CA architectures do not use out­ of-order execution or do use a very limited version of it. They have small windows that contain only two or three instructions. Scoreboarding is used for checking of data dependencies. IBM's RS/6000 line (from which the 'PowerPC' chip has been derived), is an exception: it has out-of-order execution, register renaming (though limited), and local windows of at most six entries per execution unit. Several other architectures have been reported but are still in the research phase or details have not been published yet. Several examples will be shortly discussed below.

Examples of instruction windows Tomasulo's reservation stations from the IBM 360/91 [Toma67] are local windows without instructions but with tag and contents of the source operands (more about tags in the next chapter). Because in the instruction set of the IBM 360 the first source operand is also the destination operand, one could say that the reservation stations contain the destina­ tion operand. The shelved instruction can enter and leave the reservation stations everywhere. Because they are so small (two or three instructions), this meant only a small increase in hardware complexity. Exception recovery has not been implemented in the window. In fact, it has not been implemented at all, causing the machine to crash in case of an exception.

Although the dispatch stack algorithm is mainly a dependency check algorithm, a window description is usually connected with it. In the case of Thorn and Acosta's implementation ([Thor84] and [Acos86]), every entry contains a tag (as a reference to the original instruction), the op­ code, both sources, and dependency counters. It is a compressing win­ dow. Instructions can leave everywhere and enter at the bottom. No exception recovery logic has been implemented in the window.

Sohi's Register Update Unit [Sohi90] uses a global, non-compressing FIFO buffer. Shelved instructions are placed into the Register Update Unit at the top and leave at the bottom. They are not moved, head and tail pointers are updated for shifting the window. Once an uncompleted instruction has reached the bottom, the window is full and fetching is stalled. A window entry contains source contents and tags, as well as destination contents and tag. The destination tag consists of the register name and a counter. When instructions have been executed, their results are placed back into the window. The result of the oldest instruction is copied to the register file, when there is certainty that no exception occurred. With this window, register renaming and an exception recovery are implemented at the same time (in fact, it is a reorder buffer, see the next section). Because fetching is stalled if an uncompleted instructions reaches the bottom, a 50 entries queue has a relative speedup of only 1.74 32 2. Superscalor concepts

(Cray 1 architecture). Speculative execution and even multi-path speculat­ ive execution is one of Sohi's recomm~ndations.

The Metaflow architecture ([Pope89], [Pope91], [Ligh91], and [Lair92]) implements these recommendations. It has some extensions and some adaptions to the register update unit: it does not contain the source contents, but the tag only. A tag refers to an instruction in the window. Therefore, a register can be renamed as much as there fit instructions in the window; this limit will hardly ever be reached. The window contains the results and this is not only used for register renaming, but also for exception recovery and speculative execution. Multiple unresolved branches can exist at the same time. The patent application4 reveals more details [Pope89]: window contents are moved and the window is prob­ ably non-compressing. Although the window can be seen as a global window, there exist a separate floating point window and a branch window. Although exact performance figures were not available back in 1991 (and in 1994, they are still not available), preliminary figures say that 1.4 to 3.0 instructions per cycle (ipc) are executed for the Spec suite. The designers claim that the average rate is more than two instructions per cycle. Another interesting conclusion reveals that the processor is almost always executing in speculation and always one conditional branch is unresolved. This conclusion would mean that speculative execution is essential for a performance of more than two ipc.

2.4 State recovery because of traps and speculative execution

A superscalar processor predicts the direction of branches and continues his work (fetching, or fetching and execution) after prediction. Although prediction schemes have a surprisingly good prediction rate (much more than 90% can easily be achieved and more than 97% (0 is possible, [Yeh92a], [Yeh93]) the prediction may not be correct. In such a case, several things have to be 'repaired' in order to produce correct result. As a consequence, some parts of the processor state have to be saved using some kind of save mechanism and, after detection of a false prediction, a restart mechanism must restart the right instruction within the right state. The save mechanism should be transparent; it should not harm normal

4 Although a patent has been applied for, it has not been granted. The resemblance with Sohi's register update unit was probably too much. Another interesting detail about the Metaflow architecture is that it has never come to a commerdal available Ie. According to a newsnet discussion of a Metaflow Inc. employee, this was due to a non-technical issue. In addition, he announced some 'neat refinements'. However, because of the many indirections in this design, a short clock cycle time may be very difficult to achieve [Lair92). This is more likely to be the case than a 'non-technical issue'. 2.4 State recovery because of traps and speculative execution 33 execution of the processor. The restart mechanisms should be as fast as possible.

A similar situation happens in case of a trap. If an instruction causes a trap (for instance a page fault or a divide by zero), some parts of the processor state have to be saved, the trap handler executes, and finally the original program restarts. But at which instruction? In case of super­ scalar out-of-order execution, several instructions may be executing. But which instruction caused the trap? And which instructions have com­ pleted yet? Because that is hard to determine, the instruction at the last, known, correct situation needs to be restarted.

It may be clear that both problems are comparable and could be solved by a similar method. Most traps are usually infrequent. A false prediction will be more common. Therefore, state recovery can't take too much time. Note that hardware complexity in the could be exchanged for hardware in the state recovery logic.

A lot of state recovery approaches have been reported in the literature. Although they were originally designed for scalar processors, much of this work appeared to be useful for superscalar processors. Some import­ ant designs will be covered below.

Checkpoint repair Checkpoint repair is based on backing up strategically chosen snapshots of the processor (figure 10). Once repairs are necessary, execution restarts from such a state, called a checkpoint. These states consist of the out-of­ order state (register file) of the processor. When instructions from before this backup copy complete, their results must be written to the state already saved. Because false predicted branches occur far more than exceptions, and because not every state can be saved, the state is saved at every branch. In case of an exception, execution has to be restarted from the branch to the offending instrucHon.

Instruction results

Register File (current)

Figure 10. Checkpoint repair 34 2. Superscalar concepts

History buffer The history buffer (figure 11) was originally meant for implementation of precise interrupts in a pipelined scalar processor with out-oF-order completion [Smith85]. In this organization, the register file contains the out-of-order state. The history buffer stores the register values of the in­ order state when they are superseded by the out-of-order state. Because these old values are saved, this is called a history buffer. The buffer is managed as a LIFO stack. Because restoring a state costs several cycles, this is a slow solution for a superscalar architecture.

instruction r&Suhs

in

~bms L- .....J din

source operands Figure 11. History buffer

Reorder buffer A reorder buffer [Smit85] uses approximately as much hardware as a history buffer but has a much better performance. The organization is the opposite of the history buffer: the register file contains the in-order state, the buffer contains the out-oE-order state (figure 12). It is a FIFO queue. A new instruction allocates an entry at the top and will be removed at the bottom. In between, results are written to the buffer. When a com­ pleted instruction arrives at the bottom, and this instruction is not specu­ lative any more and not responsible for a trap, its result is written to the register file. State recovery is fast because it consists of discarding the reorder buffer beyond the point of exception or a false predicted branch. This takes usually only one cycle. Main disadvantage is the associative lookup, prioritized on instruction order. This is necessary to deliver the source operands. Note that source operands are fetched from either the reorder buffer or the register file. A reorder buffer can be combined with the instruction window or can be a separate unit. Sohi's Register Update Unit and Popescu's Metaflow architecture are examples of a reorder buffer that are combined with the instruction window.

Reorder buffer with future file The associative lookup in the reorder buffer is avoided using a future file ([Smith85], figure 13). In the future file, the most recent instance of a register is kept. New operands are read from the future file, which prevent a wide associative search through the reorder buffer. In the figure, the operands are fetched from either the in-order register file or from the future file, using a select mechanism. Preventing the select mechanism at the cost of copy back logic will be a faster approach. In 2.4 State recovery because of traps and speculative execution 35

instruction results

register reorder file buller

source operands Figure 12. Reorder buffer that case, the future file contains the most recent instance of a register, which is either part of the in-order state or the out-of-order state. The copy back logic copies, in case of a trap, the whole contents of the regis­ ter file to the future file. The speculative register file of the next chapter is in fact a future file.

instruction results J,

~irl-order ! future register reorder file file buffer I J, J, select I to source operands Figure 13. Reorder buffer with a future file

Although it is hard to give a general conclusion about heavily implemen­ tation dependent things like store recovery logic, a recent paper shows that checkpointing is the fastest algorithm, a reorder buffer is 4 to 8 % slower, and a history buffer is 5 to 9 % slower [Butl93].

2.5 Theoretical performance of superscalar processors

2.5.1 Instruction level parallelism

Before trying to find out how much instructions an actual superscalar processor can execute each cycle, it is interesting to know what the theoretical instruction parallelisms is. Once the theoretical instruction parallelism is known, one could search for hardware while knowing when to stop searching. 36 2. Superscalar concepts

The theoretical instruction level parallelism is very high. Up to 188470 (0 instructions per cycle Ope) can be achieved by an oracle, a machine with unlimited resources (infinity functional units), no fetch limitations, perfect branch prediction, perfectly disambiguated memory, no fetch limitations and so on [Lam]. Perfectly disambiguated memory means that a pro­ cessor can determine if two memory addresses are identical before they have been computed, which is never possible in real general purpose processor. However, this number had been computed for a numeric intensive program from the SPEC benchmark (matrix300). The same oracle machine executed a collection of general purpose programs with 158 ipc. Note that Lam used a RISC instruction set. Although not of practical importance, these numbers show that there is enough instruction level parallelism. However, it is very unlikely that a real processor will ever be able to achieve this, because in order to reach this speed, it will have to execute and resolve multiple branches in one clock cycle.

Using more realistic models, a speedup of 4.1 compared to a single pipe­ lined RIse can be achieved by a superscalar processor with a window size of 32, perfect register renaming, branch prediction rate of 85% but no fetch limitations (machine 4 in [Smith89a]). With non-ideal fetch units that fetch up to 4 instruction each cycle, this reduces to a speedup of 2.3. As Mike Johnson is one of the authors of this paper, his book gives similar results [John91]. With the concepts in this book, 3.3 ipc is the upper limit for a machine with given latencies and realistic caches. To convert from speedup to ipc, he uses a comparable single pipelined processor that executes 0.61 ipc.

A lot has been written about theoretically and more practically achievable parallelisms. Unfortunately very little can be compared to each other because the context is always different. Instruction set, cache size, mispre­ diction rate, functional unit latencies, with or without superscalar pro­ cessor dedicated compiler; everything seems to be different all the time. Furthermore, hardly any information is available about operating system calls and multi-threading. Operating system calls can take up to 50% of the execution time but typically take only 10% of it [Ande91]. As far this Master's thesis could reveal, none of the papers about instruction level parallelism incorporate operating system calls. Some incorporate multi­ threading by simulating cache flushes. These aspects will probably have no large influence on instruction level parallelism or execution time, however, because it is not part of the modeling, it will decrease accuracy and it will make all measurements too optimistic. Performance measures have to be used with care.

Especially Johnson's book [John91] gives a good but pessimistic survey about many superscalar concepts, their profits and their costs. Uht gives a good summary of the existing literature and especially of unbalanced multi-path speculative execution [Uht93]. 2.5 Theoretical performance of superscalar processors 37

A few conclusions can be made, though, speaking from the perspective of the project description from section 1.2: • parallelism is bursty Although more thoroughly explored bySmith and Johnson ([Smith89a] and Uohn91, fig. 3-8]), Austin showed graphically how bursty parallel­ ism can be [Aus92]: High peaks, strange patterns, and hardly any program the same. The conclusion that can be drawn from this is that superscalar processors require a lot of hardware that operates only a small part of the time, but this is something which cannot be avoided. • register renaming is necessary Austin and Sohi report a remarkable speedup (more than 10 for non numeric program) for register renaming versus non register renaming using an idealistic machine [Aust92]. Smith and Johnson report a speedup of 4.1 for a realistic machine with register renaming, and a speedup of 2.5 for the same machine without renaming [Smith89a]. Although the large difference has something to do with the way a normal compiler assigns values to the registers (footnote 2 from [Smith89a]), one can conclude that register renaming is necessary to achieve the highest possible (but realistic) speedup. • memory renaming is probably not worthwhile Only very few general purpose programs benefit from memory renam­ ing or stack renaming [Aust92]. Before applying this, a lot of other bottlenecks should be removed first. • branch prediction and speculative execution is necessary As branches are the most troublesome bottleneck, a good way to deal with them is essential. Speculative execution will raise the instruction level parallelism to above 2 ipc levels [Lam92]. With a good branch prediction scheme, a realistic instruction fetch rate can be 4.2 ipc, using RISe [Yeh92b]. This is a pessimistic figure, because misprediction penalties are relatively high (6 cycles). • multi-path speculative execution is necessary Multi-path speculative execution (executing both paths of a branch simultaneously) will be essential to achieve 'massive instruction level parallelism' [Uht93]. Uht predicts a speedup of factors in the tens with reasonable hardware costs. From Lam can be estimated that the per­ formance must be between 6.8 (her 'SP' machine) and 13.27 ipc (her 'SP-CD' machine). It is probably more close to 6.8 than 13.27, because the SP-CO does more than multi-path speculative execution. However, these figures are for ideal RISC. Less theoretical figures can be obtained from [Pope91]. This Metaflow architecture incorporates full register renaming, a reorder buffer, and speculative execution across multiple unresolved branches (which is called multi-level speculative execution). It achieves approximately 3 ipc using a RISC instruction set. Therefore, multi-path speculative execution should at least perform better than 3 ipc. 38 2. Superscalar concepts

• window size luzs to be at least 64 entries Austin showed that with windows of about 100 entries, 7 ipc can be achieved on idealistic machines. • compiler optimizations have to be specific to superscalar processors A normal compiler optimization like for instance loop unrolling decreases program execution time on a processor [Lam92]. Note that this does not necessarily imply that the instruction level parallelism increases. A superscalar processor can execute more instructions per cycle on less efficient code, but this does not say anything about total execution time. Optimizations are much more complicated on a super­ scalar processor but may be more worthwhile, because every pipeline stall blocks several pipelines. Compiler optimizations are necessary to aid or reduce hardware Uohn91].

2.5.2 The real performance limit: basic block size and fetch limits

The distance between two branch instructions is called the basic block size. However, one should take care in which context this word is used. In general purpose code using a CI5C, the average basic block size is about 5 instructions. For a 80x86, 20% of all instructions are branch instructions, which means a basic block size of 5 [Henn90]. Using the DLX RI5C processor, the basic block size is 8 instructions, but a RISe instruction performs less every cycle than a CISC instruction, so both results are roughly comparable.

For a superscalar processor this means that every 5 till 8 cycles, an awkward branch has to be taken which will limit the performance. As register renaming will take care of most of the data dependencies, this small basic block size will be one the most challenging problems.

Table 1 gives more precise information about branches. If all branches are counted, the basic block is as displayed in the column with label 'all'. The average is 23.4 for numeric intensive programs, 7.72 for general purpose programs (which is close to 8 from [Henn90]), and 6.5 for C++ programs. However, if only conditional, not taken branches are taken into account, the basic block size is as displayed in the column with label 'true'. If branches are always predicted as taken, the true basic block size is a kind of corrected basic block size. This true basic block size is 13.3 for general purpose programs. So, with a good prediction scheme, a fast branch target buffer and enough hardware available, it must be possible to fetch more than 10 instructions per cycle (using RISC). In order to reach this, a single cycle jump (or zero cycle, depending on the point of view) is necessary. Yeh showed that the single cycle (or zero) jump is possible for jump instructions that are in the instruction cache by combining the cache and the branch predictor [Yeh93].

Other conclusions can be drawn from the table. First, indirect jumps are very rare, usually less than 1% of all branch instructions, but are common 2.5 Theoretical performance of superscalar processors 39

Table 1. Type and frequency of branches. Except basic block size, all columns are from [Caldg4j (Traces from a "RISCj.

InstnJctionsl Breaks CondItional branches Uncontional branches Basic block size Prooram (x 1e61 (x 1e61 90"10 I %Takenl %CBr %U I %Br I %CaJII %Ret all I true FP: Doduc 1150.00 98.09 175 48.68 81.31 0.01 4.97 6.86 6.86 11.72 29.62 fppp 4333.00 122.20 51 47.74 86.66 0.00 8.01 2.66 2.66 35.46 85.71 hydro2d 5683.00 357.04 74 73.34 95.84 0.00 1.38 1.39 1.39 15.92 22.65 mdIjsp2 3344.00 354.51 14 83.62 95.43 0.00 4.00 0.29 0.29 9.43 11.82 nasa7 6128.00 188.83 55 79.29 81.34 0.41 6.34 5.95 5.95 32.45 50.32 ora 6036.00 453.73 11 53.24 69.85 0.00 10.65 9.75 9.75 13.30 35.77 spice 16148.00 2029.35 38 71.63 91.56 0.16 3.73 2.28 2.28 7.96 12.13 su200r 4m.00 208.50 26 73.07 76.42 0.71 9.02 6.92 6.92 22.91 41.03 swm256 11037.00 182.44 3 98.42 99.63 0.07 0.15 0.08 0.08 60.50 61.70 1Dmca1V 900.00 30.21 5 99.28 99.86 0.02 0.05 0.03 0.03 29.79 30.05 waveS 3555.00 202.81 82 61.79 76.68 0.74 5.92 8.33 8.33 17.53 37.00 AVSfaQe 5735.55 384.34 49 71.83 86.78 0.19 4.93 4.05 4.05 "23.36'.'C' 37.98 Integer: alvinn 5240.00 476.25 2 97.n 98.30 0.02 0.40 0.64 0.64 11.00 11.45 a>mpress 92.63 12.88 12 68.25 88.51 0.00 7.5[;) 1.95 1.95 7.19 11.90 ear 17005.00 1377.85 6 90.13 61.37 0.05 3.71 17.42 17.46 12.34 22.31 eqrrtott 1810.00 208.88 14 90.30 93.47 1.70 1.90 0.70 224 8.67 1027 espresso 513.01 87.80 163 61.90 93.25 0.20 1.88 2.29 2.39 5.84 10.12 gee 143.74 22.96 1612 59.42 78.85 2.86 5.75 6.04 6.49 6.26 13.36 Ii 1355.06 239.42 52 47.30 63.94 2.24 7.74 12.92 13.13 5.66 18.71 sc 1450.13 303.55 94 66.88 85.96 0.98 2.62 5.18 526 4.78 8.31 AveraQe 3451.20 341.20 244 72.74 82.96 1.01 3.95 5.89 6.20 .•' 7:72 J3.30 f;++ dront 19.01 3.00 946 53.18 73.45 2.17 6.40 8.72 926 6.22 15.93 db+-+ 86.46 15.18 96 56.86 54.43 15.04 2.03 6.77 21.73 5.70 18.41 idl 21.14 4.15 38 46.70 50.00 12.31 7.55 9.07 21.07 5.10 21.84 text 147.83 14.76 256 57.46 75.87 2.80 10.06 5.49 5.78 10.01 22.97 loroff 41.52 6.69 372 54.17 66.12 4.80 7.80 8.77 12.51 6.21 17.34 Averaae 63.19 8.77 342 53.67 63.97 7.42 6.77 7.76 14.07 '··6.65 '\19M IUVerali avo :3792.36 291.71 175 68.35 80.75 1.97 4.99 5.44 6.85

90"/0 =Number of branches that contnbute to 90 % of all executed branches %Cbr =Percentage of Conditional Branches (of all branches) %IJ =Percentage of Indirect Jumps (of all branches) %8r =Percentage of Unconditional Branches (of all branches) True Basic Block Size =only laken, ooncfl1ional branches have been counted as branches

(7.4%) for C++ programs. Because it is hard - when not impossible - to predict indirect jumps, this may become a problem. However, 7.4 % of all branch instructions is only 0.88 % of all instructions, which is not too much.

2.6 Conclusion

Superscalar processors are one way among others to increase processor performance by using parallelisms. A superscalar processor executes multiple instructions concurrently. The instruction order is changed to achieve a better performance. This is called out-of-order execution. A buffer of instructions between the decode stage and the update stage of the pipeline, called an instruction window, is necessary to make the schedule of instructions. In a pipelined processor, there are two cases that stall the pipeline: a data dependency (an instruction uses the result of its predecessor) and a procedural dependency (branch or jump instruction). 40 2. SUperscalar concepts

A superscalar processor has yet another possible stall: the resource conflict.

Data dependencies are overcome by detecting them (scoreboarding) or by cancelling the dependencies caused by storage conflicts (register renam­ ing). Register renaming is an algorithm that creates a new instance of a register - so this instance needs a new name - whenever a write to that register occurs. This cancels storage conflicts and preserves true data dependencies, which makes it possible to change the instruction order and ensures correct program execution.

The effect of procedural dependencies can be reduced by predicting the direction of branches (an accuracy of more than 97% can be achieved), and by speculative fetching and executing instructions past branches. Of course, a branch target buffer can be used to reduce the penalty due to unconditional jumps, but most branch prediction schemes have an inte­ grated approach for both conditional as unconditional jumps. Because predictions can be incorrect, it must be possible to cancel the effects of speculative execution.

Exception recovery requires special attention. When instructions are executed out-of-order (that is, not in the original program order) and speculatively, extra hardware is necessary for a restart mechanism. Three methods exist: checkpoint repair, a history buffer and a reorder buffer (with or without future file). The reorder buffer integrates in a simple way with the speculative execution mechanism. In addition, it supports precise traps. The reorder buffer stores the out-of-order arriving results. When no traps occurred, results are copied to the register file in-order. So, the register file always contains a safe state of the processor.

Important aspects of the instruction window are local versus global window, contents of the window entries, order of instructions, and whether instruction are moved or not.

Instruction level parallelism gives estimates about the performance limits of superscalar processors. Although theoretical limits are absurd (up to 200 000 instructions per cycle), practical research shows that more than 3 instruction per cycle (using RISC) is feasible at the moment, with more possible in the future using multi-path speculative execution. Fetch limitations because of a small basic block size (group of instructions between two jumps) are a severe bottleneck. For this reason, there should be no inherent limit to the number of branch instructions that can be fetched and handled concurrently. 3. Description of US4A

3.1 Introduction

In this chapter, the architectural issues of US4A will be discussed. At first (section 3.2), the architectural starting points are explained. The current target parameter settings will given an idea of the value of some of the parameters of the scalable architecture. Furthermore, logic design rules, layout considerations, and memory considerations are given.

An overview of US~ is given in section 3.3. Instruction 'packages' containing opcode and operands are constantly moving through the processor. A tag, a unique label for every instruction in the processor, is used for identification of these floating 'packages'. Note that the tag is essential for the register renaming: in fact, it is the new name of the registers. Base of all is a buffer of these 'packages' called the instruction window. This is a FIFO buffer containing the 'packages'. For every instruction is stored: the opcode, source and destination operands, some dependency information, status information, and the tag.

Section 3.4 is a very detailed description of a non speculative version of US4A. A lot of examples have been used. In addition to the principles, several tricks will be given - often presented as an 'intermezzo' - that reduce the storage needed by the instruction window.

The fetch unit places new instructions into the instruction window, the dispatch unit sends ready instructions to the execution unit. After pro­ cessing, the results arrive at the instruction window. The register file is thereby bypassed. Executed instructions at the end of the buffer are removed when it is safe to do so. In addition, their results are copied to the in-order register file (this can be compared with a conventional register file). This is a register file containing a safe, in-order state of the pro­ cessor. This state can be used to restart the processor in case of a trap. The speculative register file, a kind of out-of-order register file, is used to implement the register renaming. This register file is used by the fetch 42 3. Description of U~A

unit to implement the register renaming and to place new instructions into the instruction window.

In section 3.5 state recovery is explained. State recovery can be necessary in case of traps (with the help of the in-order register file), and in case of speculative execution (with the help of a shadaw speculative register file). Also, hardware interrupts will be described. Then, the extension to a multi-path speculative version of US~ (section 3.6) is not so complex any more. Actually, it only consists of using a speculative register file for every path.

Finally, several possible future directions (section 3.7) and a conclusion (section 3.8) can be found.

3.2 Starting-points

3.2.1 Architectural starting-points

Recalling the project description, an advanced universal super scalable superscalar architecture should be made. US~ must fetch at least four instructions each cycle. It must be possible to execute on the average more than three instructions each cycle. Looking at the observations of the previous chapter, the following architectural issues are necessary: • out-of-order execution • global instruction window • register renaming • (multi-path) speculative execution • precise traps

For high performance, the following aspects - among others - must apply: • The turn around time, the time between the calculation of a result and the use of the result as a source operand, has to be as short as possible. • The average branch penalty must be low. This means that the 'damage' caused by incorrect predicted branches must be cancelled fast, preferably in one cycle and the branch prediction accuracy must be high.

3.2.2 Current target parameter settings

Although the architecture has to be scalable, a target configuration has been defined. This target configuration consists of • a fetch unit capable of fetching 4 instructions each cycle 3.2 Starting-points 43

• 2 load/store units • 4 ALUs • quadruple path speculative execution • window size of 64 entries

In case of a high degree of multi-path speculative execution, a higher fetch rate may be worthwhile. Especially in the case of multi-path specu­ lative execution, the memory bandwidth will probably be the bottleneck.

3.2.3 Logic design rules

In order to achieve not only the architectural aims but also make a realistic implementation as well, a few logic design rules have been used. An important thing to keep in mind is that this is all about finding algorithms that can be done in one clock cycle. So, algorithms must be simple and effective. The logic design rules are as follows: • The slowest part of the processor determines the minimum cycle time. As most ALU operations - especially the add - cannot be made faster any more, this will determine the cycle time of the scalar pro­ cessor. Therefore, The cycle time of all units has to be about the same (or smaller) than an ALU cycle time. Gate depth is the number of gates through which the data flows between two registers. The longest gate depth is called the critical path. Approximately 20 till 30 gates is normal for a 32 bits ALU - which usually consists of one pipeline stage - so this should be the maximum gate-depth for every pipeline stage. If all pipeline stages of a processor have about the same gate­ depth as the ALU, the superscalar processor will have the same dock frequency as its scalar counterparts. • Three-states that are connected to long busses are slow. Those three­ states can be counted as a gate-depth of 5 or 6 gates. Before switching three-states, prepare everything as much as possible. Calculate addresses, prepare data, and so on. Furthermore, the enable or disable signals can be seen as 'gate' inputs and should be available in time. • Data transport over busses takes a long time. The longer the busses, and the more connectors attached to them, the larger the capacitor, and therefore the slower. Global, chip-wide busses should therefore be avoided. When this is not possible, the long delay must be taken into account. Only little operations can be performed on data that has to be transported over long busses.

3.2.4 Layout considerations

Although layout cannot be influenced too much on this level, a few considerations can be given: 44 3. Description of USSA

• Units should be approximately square, as this will result in the shortest internal con'nection and an easier global layout design. • Because of the scalability, units should consist of easy repeatable subunits.

Of course, these are rules of thumb. They are process technology depend­ ent, but they are essential for designing complex digital systems such as microprocessors~

Speed is the main goal, power consumption and number of transistors are of secondary concern. A lot of silicon cannot be avoided with super­ scalar processors. Performance has its price. But, as, technology improves, and more than 10 millions of transistors will be available in the near future.

3.2.5 Memory considerations

Dual or even quadruple ported memory would be interesting for a degree four superscalar processor. This would make it possible to do four independent read or write operations concurrently (figure 14a). However, this will make memory slower and much more expensive, especially if it is used for main memory. Memory speeds increases in a much lower rate than processor speed. Main memory is already a large part of total system costs. Therefore, multi-ported memory is an unlikely alternative, even not in the future. Furthermore, multi-ported memory is not possible in case the superscalar processor must be compatible with scalar pro­ cessors. (a) multi-ported (b) interleaved memory memory oo 0000"':::::: 0001 0010 0011סס 0110 0111 0001 ~~mJ ~ 0100 :::::,:< 0101 • 0010 ~ ~ 1000 1001 ~ 0011 ~ 0100 ~ ~ 0101 ~ ~ 8 bank 0 bank 1 bank 2 bank 3 --...... ~ '-'"

complex readlwr~e logic

to fetch unit & load/store unit

to fetch unit & load/store unit Figure 14. MUlti-ported memory versus interleaved memory 3.3 Overview of U~A 45

Interleaved memory is more likely. Interleaved memory is divided into banks (see figure 14b). Different banks can be accessed simultaneously, but when data has been stored in the same bank, simultaneous access is impossible. If some of the lower address bitsS are used for determining which bank the address points to, succeeding instructions or data can be accessed concurrently. These banks can consists of normal, conventional RAM chips. Interleaving is probably a good compromise between per­ formance and costs.

Caches have similar trade-offs between interleaved and multi-ported RAM. A second level cache (off-chip cache) is probably interleaved; multi-ported is too expensive. A first level cache (on-chip cache) may be multi-ported although a banked version seems to be more realistic. When the first level cache is banked, it may use more banks or wider words than the second level cache. The design of the caches is critical, because memory load and store operations are slow and are probably a bottleneck.

Due to the different nature of instruction fetch and data load/stores (with superscalar processors, the differences are even larger than with scalar processors), and because of the high costs, a separate instruction and data cache - which use less silicon area - must exist in a superscalar.

3.3 Overview of US~

Based on the previous sections, the following will hold (see figure 15): • The following is necessary to implement register renaming: 1. The window has to contain the destination operands. 2. The windows must contain for every source operand a reference to the renamed register. Those references are a kind of labels called tags. • Window entries contain the source operands and the opcode, so dispatching a complete instruction with its data will be much easier and faster. • Instruction order in the window is the same as program order. This makes removing groups of succeeding instructions easier, as will be necessary for (multi-path) speculative execution. Furthermore, schedul­ ing will be easier. • Instructions are moved through the window. The position in the window indicates the age. The schedule logic can prioritize on posi­ tion, which makes scheduling faster. Priority encoding based on fixed

S It is unlikely that the most lower bits of an address are used for dctennination of the bank. The two lowest bits are usually used in 32 bits wide memory to dedde which byte out of four must be accessed. 46 3. Description of U~A

new instructions

...... ,7

globallnstr. window instr. + contains source • shelved instructions Aops. • contians source Ii to execution and destination operands II results from units 1\ execution units ~ implements 'i • register renaming • multi-level speculative execution • precise traps execution unit can be • resu~ forwardinQ ALU, FPU or LSU

results """ 7- to in-order register file Figure 15. Overview of the instruction window locations is faster, because no register values need to be read to know the priority. Priority encoding is fundamentally slow. So, this operation should be kept simple and fast to prevent it from becoming a bottleneck. • At the bottom, completed instructions that are not speculative any more and have generated no trap, can be removed. Results of so called retired instructions are placed in the so called in-order register file (IRF). This 'copy' of the in-order state of the processor is used to implement precise traps (see also section 3.4.4).

One can look upon the functionality of the window from the point of view of an instruction or from the window logic. First, consider instruc­ tions. To be more precise, the instruction window contains shelved 6 instructions. These are packages with the opcode, operands, tag , and some status information. The following steps are performed sequentially on such an instruction package (see figure 16) : 1. Fetch the instruction from memory 2. Check the dependencies and place the instruction into the window.

6 A tag is in fact a reference to an instruction, not to a register. This will be explained later in this section. 3.3 Overview of U~A ~7

I instruction & data memory I 1 r •.. •• -- _ --.--- - _.- _•••• _.- •• _."""., ~ -.•• _._ _•••• _.. -_-_ ... superscalar processor fetch and · decode unit · 2 · :t:: · c :J global .c results 0 ~ instruction - co window source 0. operands =c

3 4

execution units

Figure 16. Overview of the processor, showing the path of one instruction 3. When all operands are available, the dispatch unit sends the instruc­ tion with its source operands to the execution units. 4. After execution, the result(s) are written back to the instruction win­ dow. This will make other instructions ready to be dispatched. 5. The instruction arrives at the bottom. If it didn't generate a trap, the result can be written back to the register file. The instruction can then be removed. This process is called retirement of the instructions and is performed by a unit called the retire unit. Note that the instruction is constantly moved by the window. It flows gradually towards the bottom. Even when an instruction is executing, its window entry is moving. It is therefore impossible to refer to instruction by window position. A separate label called tag will be used to do this. This tag will be the unique identifier for every instruction. The tag of every instruction will have to be stored in the instruction window.

From the window logic's point of view, the instruction window has to perform several operations concurrently. These operations are: • Place new instructions into the window. 48 3. Description of Us'A

• Remove the oldest instructions from the window (retire). This must be done in-order and only when the instructions cannot generate a trap any more. • Write results to all desired locations. A result may be written to more than one location. Therefore, all parts of the window are constantly monitoring all so called result busses. This process is covered in more detail in section 3.4.2, page 51. • Move instructions to a next position. The complete window must be moved as much as old instructions have been retired. Parts of the window may be moved faster in order to fill up holes. This is only necessary in case of (multi-path) speculative execution. • Dispatch instructions to the execution units.

The drawback of moving instructions is a large amount of logic to shift all instructions. A non-moving implementation with pointers to address the head and tail of the window is much simpler, yet its schedule logic is more complex.

3.4 Description of a non speculative window version

Now an overview of US~ has been given, all details can be revealed in the following section.

3.4.1 Description of a window entry

The fields of a window entry that are necessary for a basic implementa­ tion without speculative execution, will be described below (see figure 17). After an explanation of all fields, an example will be given in the next section. .

source destination tall OJ*'llnd operand status

\...... _ ...... ----', ''--_y-._---'I for every for every source op. desl op. Figure 17. An entry from the instruction window

• tag Every fetched instruction has a unique tag. This tag is attached to the instruction until the instruction is removed from the window (by retiring or cancelling). 3.4 Description of a non speculative wIndow version 49

• source operands Every source operand has a tag field and a contents field. The tag specifies which other instruction from the window generates this source operand. When the source operand has been generated, the data is copied to the contents field. The locked bit denotes whether the contents are already available. In the following examples, a '}' equals true (locked). • destination operands The destination operand has a contents field, a register number (reg. no.), a locked bit, and an update bit. The results of an instruction are kept in the contents field, until it can be removed from the window (by retiring or cancelling). Note that more destination operands can exist. The update bit is used to indicate whether a destination register will be used by the instruction or not. Not all instructions may use all destination operands. If the update bit is true ('I'), the destination operand has to be updated so it contains an actual operand, otherwise (update bit is '0'), the destination operand is not in use. The retirement unit also makes use of this bif. • dispatched bit The dispatched bit indicates whether the instruction has already been sent to the execution units. This prevents the dispatch unit from executing instructions two times. Again, a 'I' means true (dispatched). • executed bit The executed bit is set after all result operands have returned to the instruction window. Only when this bit is set, an instruction can be retired. • class The class field contains the type of class of the instruction. This can be integer, floating point, memory load/store, branch. Of course, more types of execution units can exist. • opcode The opcode is saved on the window because this is necessary to instruct the execution units. For instance, it can instruct an ALU to add or to subtract. • status flags Status flags like for instance the carry bit or overflow bit have to be saved to the window because that is necessary for state recovery. This topic is covered in more detail in section 3.4.6. • pc The program counter is stored in the window in order to restart fetching after an incorrect predicted branch or a trap.

7 Note that for the examples in this chapter, the update bit is not used to keep the examples simple. 50 3. Description of US"A

• valid A valid bit indicates whether the wIndow entry contains an instruction or not. Because the window may not be full all the time, this bit is necessary to indicate whether an entry contains a valid instruction.

3.4.2 Example of a simplified window

In figure 18, a part of a filled instruction window can be seen. An instruc­ tion set with two source operands and one destination operand has been used as an example instruction set. Of course, other numbers of source or destination operands are possible. The operand field is symbolized by the instruction written at the left. 'An instruction with tag 6' is abbreviated as 'instruction 6', to make it more readable.

SCUm! source destination lnstl\lction tag operand 1 operand 2 operand status

J> ~.. J>~' ~ ~J> young '*'~ ~ ~~ ~ &-~r6J @'~ <9 ,0., <9 fl'

r4 := r3 .. 41 X 41 I QI r4 , X ~ ~ : @] X I ...,.,r3 := r2 .. 73 ~:J (r2] I @ITI 73 I~I X ~ old (r2] means current conlents of register 2 o=false, 1 "true Figure 18. An example of a simple window (i). Some details are omitted for clarity.

Instruction 6, has a left operand and a constant right operand (73). The constant right operand is directly placed into the source operand 2 contents field. The accompanying locked bit is set to 'a', indicating that this operand is already available. Its tag field is not used. Suppose the contents of r2 are already available, so the contents of r2 can also directly be placed in the contents field. Note that [r2] symbolize the contents of r2. It does not matter for the window where the contents came from. They are stored, that's all. Because instruction 6 updates register 3, this register number is written in the destination register number field. This will be used by the retirement unit. The dispatched bit and the executed bit are both 'a' to notify that the instruction has neither been dispatched nor executed yet. However, because all source operands are known, this instruction is ready to be executed. The dispatch unit can identify this by looking at the locked bits of the source operands. They are all zero, so this instruction can be dispatched. 3.4 Description of a non speculative window verslon 51

Instruction 7 is not ready for dispatching because it is waiting on the result of its predecessor. This true data dependency has been symbolized by the arrow. To indicate this in the window, tag 6 has been written into the tag field of source operand 1. Furthermore, the locked bit is '1', notifying that this operand is locked. This reference-by-tag principle is in fact the implementation of register renaming. Instead of referring to an register number, a tag to an instruction is used as a new register name. Whenever the result of instruction 6 will arrive, it will be copied to the destination operand of instruction 6 and the left source operand of instruction 7. This will make instruction 7 ready to be dispatched. More about this later in this section. As with the previous instruction, the right operands is a constant and can be placed directly into the window. Its locked bit is set to '0'.

Instruction 8 has only one destination operand. Its destination contents will be 24, and this result must be copied to register 1, after having retired the instruction. It seems to be possible to copy 24 directly to the destination contents field and to mark the instruction as executed. How­ ever, in order to set the flag register, it has to go through the ALU 8. Therefore, 24 is seen as an known (so unlocked) first source operand and the instruction is marked as neither dispatched nor executed (dispatched and executed bit zero).

Result bus monitoring When instruction 6 returns from one of the ALUs, tag 6 accompanied by its result (which can be, for instance, 89) has been written on one of the result busses. Some logic that is part of the window entries are constantly monitoring these busses. When they find their tag on one of the tag busses, they read the data (the value 89) from the accompanying bus. So, only those registers that are waiting on the result of instruction 6 load the data. Now, the situation is like displayed in figure 19. The grey boxes indicate changed values. The value 89 has been copied to both the destination operand of instruction 6 and the left source operand of instruction 7. The true data dependency has been resolved. Note that both source as destination operands monitor these result busses. This process can be compared to forwarding data in a pipelined processor. Here, it is called more specifically result forwarding.

8 This is a dear example where a superscalar microprocessor suffers it scalar instruction set. A special instruction set for supersmlar microprocessors could support instructions that leave the flag register in the same state. This will make it possible to copy the 'result' of instruction 8 directly to the window and prevent it from executing, thereby unnecessarily occupying resources and unlocking possible data dependencies with follOWing instructions. 52 3. Description of U~A

source source destination instruction- tag operand 1 operand 2 operand status

~ ~~J> >! ~0<::' ?I~$)' ~0~ young ,~ ,rS' ~ f2~ ~~~ '#'q. 33 MJef1 r3 X r3 :=-r5 ·33 [ill @IT] I I @lillTIQJ r9:= r9 + r10 ~ @IT] [r9) I ~ [r10J I QEJ X I ~ load ~ [rSJ ~ [rSJ I ~ X I QEJ X I ~ r1 := 24 OJ ~ 24 I ~ X ,~l X ,~ r4 :=,.,..,r3 + 41 Q] ~. ::::~:i'tl ~ 41 I ~l X I ~ ,.,..,r3:= r2 +73 Q] ~ [r2J I ~ 73 I Wi r3 I <~ Hid ~ old [r2J means current contents of register 2 0= false, 1 =true Figure 19. An example of a simple window (ii). Instruction 6 has just returned from the ALU. The grey boxes indicate changed values.

Register renaming The reference-by-tag principle is (part 00 the implementation of the register renaming. Instead of reading registers, instructions wait for results of other instructions. The references are unique tags. These tags refer to one of the instructions in the window. This tag is in fact the new name of the register, its source or destination operand contents field in the window is the extra storage required for the register with the new name. This makes it possible to execute instructions out-of-order and to remove storage conflicts.

The true data dependency between instruction 6 and 7 has been resolved. There is no other order in which those two instructions can be executed. Instruction 7 can only be executed if all operands are ready. All operands are ready when the result of instruction 6 is known, so when instruction 6 has been executed. Hence, this is the only possible order.

Storage conflicts are cancelled in the following way. Suppose for instance that for some reason, instruction 11 is earlier ready to be dispatched than instruction 6. Instruction 11 writes to r3 and instruction 7 reads and early version of r3, which is a storage conflict. Instruction 7 refers to a result with tag 6. However, instruction 11 has created its own version of r3 and refers to this version with tag 9. Instruction 11 renamed r3 to 'tag 9', so it can write to that r3 (with name 11) before instruction 7 reads r3 (with name 6). Both versions of r3 do not interfere with each other. Hence, instructions can be executed out-of-order, correct program execution is guaranteed because true data dependencies remain. Storage conflicts are removed in order to give a better instruction level parallelism, so a better performance potential. 3.4 Description of a non speculative window version 53

3.4.3 Intermezzo: Reducing storage In the Instruction window by sharing tag and register value storage

A window entry is very wide, approximately 114 to 204 bits (see section 4.8). A few simple techniques have been used to reduce the width. Tag and register contents can share their storage. Source and destination operands can do the same, as will be explained in section 3.4.5. Both reductions make use of the fact that not all storage is necessary at the same time. It is a kind of time sharing storage.

Observing both examples (figure 18 and 19), a source operand is either locked or unlocked. In case it is unlocked, the register contents are known and available in the contents field. The tag field is not necessary any more. In case of a locked variable, the contents are unknown and only the tag is important. The examples show this phenomenon: every time the locked bit is I, the tag field contains a value, every time the locked bit is 0, the contents field contains a value. What is more obvious than having the tag and register contents field share their storage?

Using this sharing concept, a contents field contains a tag if the locked bit is 1 (locked), and contains a register value if the locked bit is 0 (unlocked). A register is usually wider than a tag. Even in case of only 16 bit regis­ ters, a tag that supports 216 window entries seems enough. There is room for either the tag or the register value. So, not all bits are used in case a contents field contains a tag. Of course, the whole concept of this chapter remains the same.

3.4.4 Implementing data dependency checks and register renaming: the speculative register file

Knowing how the dependency information is stored in the window, the question remains how to calculate this. In the example, a just fetched instruction delivers zero, one or two source operands. For a source operand, three situations are possible.

1. the most recent instance of a source register is in the in-order register file. 2. the most recent instance of a source register is in the window, but the value has not been calculated yet. 3. the most recent instance of a source register is in the window, the result is known.

If situation 1 or 3 holds, the value of the source operand is known and must be copied to the source operand contents field of the window entry. If situation 2 holds, the tag of the instruction that produces the desired result must be placed in the source op tag field of the window entry. This is exactly what the speculative register file is going to do. 54 3. Description of US'A

Determining the situation will require a prioritized associative search through the instruction window. This is both slow qnd hardware inten­ sive, and has to be repeated for all operands each cycle. In the instruction set of the example, this means up to 8 operands each cycle. Furthermore, it will make multi-path speculative execution more difficult. Therefore, a future file (see [Smit85] and section 2.4) will be used to solve most of the problems (not all, because this will cost even more hardware). Speculative register file is a better name, because that has more to do with 'specula­ tive' than with 'future' and because it will be an important aid for (multi­ 9 path) speculative execution •

I instruction & data memory I

:- ...... _...... -...... _.-..-..... _...-...... superscalar processor fetch and decode unit I - speculative register k- - J, file ~

.~ global c: :J Instruction .c: results ,.-- .s IE-- window source C'lS 0- operands 1J) '5 in-o der , register - file

· execution · units · result and tag busses · ·...... --_ -_ _ -_ -_ _.- -_.. - - -_ _ _oo .. Figure 20. Overview of the processor with the speculative register file (SRF)

The speculative register file (SRF), is the speculative version of the in­ order register file (figure 20). The SRF contains the most recent instance of a register or a reference (tag) to the instruction that will calculate the most recent value. New instructions will use the SRF to fetch the ope­ rands or the tags. The in-order register file (IRF) will only be necessary

9As will be shown in chapter 3.6, in case of speculative execution, a shadow SRF will be used for saving the state of the registers when the processor starts executing speculatively. If the prediction appeared to be incorrect, the (results of the) incorrect executed instructions can be cancelled by clearing the speculative instructions in the window and by restoring the shadow SRF to the main SRF. This can be done in one cycle. In addition, this principle can be extended to multiple paths. 3.4 Description of a non speculative window versIon 55 for state recovery. It has turned from the one and only operand supplier into a part of the state recovery logic. Therefore, the SRF is a unit that is placed between the fetch unit and the instruction window. The dotted line between the IRF and the SRF symbolizes the busses necessary for state recovery (see section 3.5).

Because the SRF has to supply the contents of the source operand, results have to be written to the instruction window and the SRF. An SRF contains an entry for every register (figure 21). An entry contains a contents field and a locked bit. If the locked bit is one, the contents field contains a tag, referring to the instruction that will generate the most recent instance of the register. If the locked bit is zero, it contains the most recent register value.

The SRF uses result bus monitoring to load a result. It works the same as in the instruction window. Whenever a tag in the SRF matches a tag on the result bus, its result will be loaded. However, a tag in the SRF can be overwritten by a newer tag. A tag in the window remains the same. Therefore, a result of an instruction will always return to the window, but may not return to the SRF. (a) (b) (c) After fetching After fetching After fetching instruction 6 instruction 7 instruction 8

L contents L contents L contents r1 0 1994 r1 0 1994 r1 1 idS r2 0 16 r2 0 16 r2 0 16 r3 1 id6 r3 1 id6 r3 1 id6 r4 0 1994 r4 1 id7 r4 1 id7 r5 0 1994 r5 0 1994 r5 0 1994

fetched instr.: id6: r3:= r2 + 73 idl: r4:= r3 + 41 id8: r1 := 24

tag dest src 1 src2 tag dest sre1 src2 tag des' sre1 src2 Instr.Window: ~ ~ ~

Figure 21. Several successive states of the speculative register file (i)

An example is a better way to explain this. The previous example will be continued. To make it more readable, a tag is written with 'id' in front of it. The accompanying changes in the window are displayed below the SRF (figure 21). Note that the SRF is an aid to fill the instruction window. Therefore, both SRF entries and instruction window entries are displayed in the figure.

Suppose all registers contain some value, say 1994 (which happens to be the year in which this thesis has been written), except for r2 that contains 16. Instruction 6, (figure 21a), has two source operands. The first source operand (r2), is available in the SRF and can be copied directly to the instruction window. The second source operand is a constant and can also directly be copied to the window. Because this instruction writes to 56 3. Description of U~A

r3, its result becomes the newest instance of r3. Therefore, tag 6 is written to the contents field of r3 in the SRF. The locked bit is set to 1, indicating that the contents field contains a tag.

In figure 21b, instruction 7 has r3 as one of its source operands. From reading the r3 field in the SRF, it can be seen that tag 6 writes to the most recent version of r3. So, tag 6 is copied to the source operand contents field of the instruction window and the locked bit is set (however, for clarity, the locked bits of the operands in the instruction window are not displayed in the figure). The second source operand is a constant and will be copied directly to the instruction window.

Instruction 8 only has one constant result value (figure 21c). As explained in the previous section, this cannot be seen as a result operand. It has to go through the ALU in order to set the status flags. Only after that, the value '24' can update the rl-field of the SRF. Therefore, the constant value '24' is just a simple left operand. The rl field in the SRF gets tag 8. Its locked bit is set to '1', indicating that the contents fields contains a tag.

Fetching an instruction happens in-order, but dispatching instructions is done out-of-order and the result may return out-of-order. The next example will show that program execution will still be correct.

(a) (b) (3) After fetching After arrival of After arrival of instruction 11 result with id11 result with id6

L contents L contents L contents r1 0 24 r1 0 24 r1 0 24 r2 0 16 r2 0 16 r2 0 16 r3 1 idl1 r3 0 66 r3 0 66 r4 1 id7 r4 1 id7 r4 1 id7 r5 0 99 r5 0 99 r5 0 99

instr. or result: id11: r3:= r5· 33 id11 : result=66 id6: result=89 tag dest srct src2 tag dest src t src2 tag dest src' src2 instr.Window: lid11 Ir31 x 199133 , ~ ~ Figure 22. Several successive states of the speculative register file (ii)

Consider the previous example a few cycles later. Instruction 11 has been fetched (figure 22a). Because it writes to r3, the r3 field in the SRF has been replaced by tag 11.

Next, suppose after some time, instruction 11 returns it result ('66') before instruction 6 (figure 22b). This result is on one of the result busses, its tag is on the accompanying tag bus. Because the tag on the tag bus is the same as in the SRF (in the contents field of r3, figure 22a), the result is loaded into the SRF (figure 22b). 3.4 Description of a non speculative window version 57

Whenever a previous instance of r3, for instance, the result from instruc­ tion 6, has been calculated and appears on one of the result busses, this value will not be loaded into the SRF (figure 22c); tag 6 will be compared with all tags, but it matches no tag in the SRF. The result is ignored. In the window however, the result is written to the destination field of instruction 6 and the source operand field of instruction 7. As can be seen, the out-of-order returning of results is no problem. The SRF con­ tains (references to) the newest instance of a register, the window con­ tains (references to) the desired instance of a register. The SRF approach fully implements register renaming. It cancels the storage conflicts and it keeps the true data dependencies, thereby supporting correct out-of-order execution (see also the paragraph 'Register renaming' in the previous section).

Summarizing, the SRF, that is placed between the fetch unit and the instruction window (see figure 20 in the beginning of this section) per­ forms the data dependency checking by doing the next three things simultaneously: • new instructions receive a tag or value from the SRF for every source operand. • One of the SRF fields is updated with the tag of the new instruction. Of course, in case of an instruction set with more destination registers, more SRF fields have to be updated. • When a result returns, a SRF entry loads that result if the tag on the result bus matches a tag in the SRF.

The SRF approach is fast, because the dependency analysis is only a matter of copying the contents of the SRF to the instruction window. Updating the SRF with the new instructions is done by direct addressing, so it requires no slow associative searches. Updating the SRF with results from the result busses requires associative searches but can be done in a similar way as the instruction window itself, which means it cannot limit the speed of the processor. Furthermore, the SRF approach is easy expan­ dable in case of multi-path speculative execution. As will be shown in the next section, this can be achieved by introducing several SRFs. A disad­ vantage of the SRF is the amount of hardware. It contains information that can be derived from information stored in the instruction window. However, the SRF makes the instruction window simpler, because the fetch and decode unit does not require read access to the instruction window. Furthermore, by avoiding the associative searches through the window for dispatching instructions with its source operands, long windows of more than 128 entries will be possible, without affecting clock cycle time. 58 3. Description of US"A

3.4.5 Intermezzo; Redl,lcing storage in the instruction window by sharing source and destination operand storage

The SRF contains the most recent contents of a register, or a reference to the instruction that generates the most recent value. New instructions do not read the instruction window, they read the SRF instead. Therefore, storage for source and destination operands can be shared, using a same principle as with tag and contents value (see section 3.4.3). Storage for source operands is necessary before the instruction has been dispatched. Storage for destination operands is necessary after the instruction has been dispatched. After the instruction has been executed, the destination operands are still necessary for retirement. However, the source operands are not used any more because the speculative register file preserves a copy of the most recent value of every register.

In almost all instruction sets, there are more source operands than destination operands. Therefore, the locked bit and the tag/value fields of source and destination operands can be shared. However, a destination operand contains an extra field, the register number field. This field cannot be shared. Nevertheless, there is a large reduction in storage. Moreover, some comparators and are shared as well, although extra (yet simple) multiplexers have to be added to choose between source and destination contents.

3.4.6 Dealing with the flag register

So far, the flag register has not been touched. Flags like the zero, carry, overflow or underflow bit, as well as several processor mode bits, are usually gathered in a special register. This condition code register or flag register is used by many instructions. An execution unit will frequently read and write (part of the) flags. In order to allow out-of-order execu­ tion, the same renaming principle of the registers can be used for the flag register. Implementing the dependency checks can be done with the help of a SRF, as with the 'normal' register dependency checks. However, because of the special status of the flag register, additional actions are necessary and certain optimizations are possible. As with normal regis­ ters, a source flag register and destination flag register is necessary for every window entry (figure 23). 3.4 Description of a non speculative window version 59

source source destination destination tag operand flags operand flags status

~ ..1/ .Jt~/ -t/I'~/ ..1// tII~/

In figure 24, window contents from an instruction window that incorpor­ ates the flags are displayed. An add-with-carry is indicated as '+c'. To keep it simple, only the carry bit of all status flags is displayed.

Instruction 6, the instruction with tag 6, has been executed least recently. Both source operands are known and stored in the window. The instruc­ tion has been dispatched but has not been executed. Because it does not use the carry flag as input, the source flag register is unlocked (locked bit is 0). As the source flag register doesn't matter, the contents are indicated by an 'x'. The instruction has not been executed yet, so the destination operand is locked and the destination flags are locked.

source source source destination desl tag op1 op2 carry operand carry status new lag L conI L conI L conI L reg conI L conI 0 E das pc JMP_C456 ~ 1014561101 X 110\ id711xl X I X 1101 0 110101 01 221

r4:= r3 +c27 C1J 111 id41101 271111 id6lll1 r41 X 1111 X 110101 01 211

r1:=r2+c 33 L::§] 101 1611013311010 1111 r11 X 1111 X 111101 01201 old Figure 24. Example of an instruction window that supports the flag register.

Instruction 7, the next instruction, performs an add-with-carry. Because its predecessor is the only candidate that can set the carry, the source flag field contains the reference to this predecessor. When the results and flags of instruction 6 arrive on the result busses, the flags are copied to the source flags of instruction 7 and its locked bit can be changed from 1 into O. Because there is a data dependency with instruction 4, as indicated by the source operand 1 field, the instruction must still wait for instruc­ tion 4 to finish.

Instruction 8, the jump on carry instruction, has as source operand the displacement. Its source flags are waiting on the preceding instruction. Therefore, it contains tag 7. Its destination tag will not be set, but because of the exception recovery, the existing flags will be saved in the destina­ tion flag register. 60 3. Description of U~A

Extensions to the SRF The SRF must be extended in order to implement the register renaming and dependency checks of the flag register. This can be done in a similar way as a normal register. However, dividing the flag register into several parts (for instance integer, floating point, miscellaneous) will reduce the dependencies. A floating point instruction does not have to wait for an integer instruction. A locked bit should be available for every part. Furthermore, the same concept of sharing tag and value storage space is used as with the normal registers. logic flags inl flags fp flags misc. flags -#J> cJl~q, rS'~0~ -#J> rS'~0 o>~0$ [I; 10; 10j 10'------..1_ Figure 25. The divided flag register in the sri

Extension to the instruction window The same trick for dividing the flag register could be used for an instruc­ tion window entry. However, for certain instructions, the complete flag register is necessary. Instructions that save the flag register on the stack, will need to have the complete flag register in the instruction window. However, in order to make the make the window entry logic straightfor­ ward, it is better to divide the source flag register in the window too.

Every part of the destination flag register needs, just as the destination operand field, an update bit. This update bit indicates, whether the instruction modifies this part of flag register or not (see also section 3.4.1, under 'destination operands').

Intermezzo: sharing storage for source and destination flag register Besides dividing the flag register in the instruction window, tag storage and flag contents storage can be shared. This has already been used in the examples of this section. Furthermore, the source and destination flag register storage can be shared, just as the source and destination ope­ rands do (see section 3.4.5). Until dispatching, the flag register is a source flag register, after dispatching, it is a destination flag register.

Extensions to the result busses In case the instruction set contains only one destination register, the extension to the result busses to support the flags is dear: Every result bus gets an accompanying flag bus. This is as wide as one flag part, usually one fourth of the flag register: 8 bits. Because every execution unit has it own result bus, it also gets its own flag bus. A valid bit may be necessary to indicate whether the flag bus contains some valid data or 3.5 State recovery 61 not. Now, a result bus consists of a tag bus, a contents bus, and a flag bus.

In case the instruction set contains more destination registers, other approaches may be necessary. When an assembly-line principle (see section 3.7.1) can be applied, parts of the result can arrive on different times. Then, it is dependent on the instruction set how the result flag busses need to be organized. When an assembly-line principle cannot be applied, all results of an instruction arrive at the same time and the organization of the result busses can be the same as with one destination register. Then, a result bus consists of one tag bus, several contents busses, and one flag bus.

3.5 State recovery -.

3.5.1 State recovery due to traps

More importantly, state recovery for traps does not have'to be fast. First of all, traps do not occur very frequently in computer systems. They are used for dealing with special, erroneous situations such as page faults and overflows. In such cases either the performance is already low (page fault), or the fact that severe errors (overflow, divide by zero) exist, is more important than the performance.

Trap recovery has to be precise. In case of a page fault, pages are swapped between disk and memory as a kind of repair action. Then, the trapping instruction - and precisely that instruction - has to be executed again. Imprecise or relaxed trap handling will not guarantee correct execution and that is unacceptable. Furthermore, to be sure about a trtlP, it must be certain that every preceding instruction has been executed and has not caused a trap. This is because a preceding instruc­ tion could generate a trap and prevent the trap currently under investiga­ tion. Of course, these considerations only apply in case of out-of-order execution.

When the execution of an instruction causes a trap, the following actions should be done (figure 26): 1. The instruction window entry of the instruction must be marked as 'trapped'. 2. All preceding instructions have to be executed and retired. In case of speculative execution, it may happen that preceding branches turn out to be mispredicted. This could cause instructions - including the trapped instruction - to be cancelled. Therefore, state recovery cannot start until retirement of all preceding instructions. In addition, it may be sensible to continue fetching for the paths that did not caused the trap. 62 3. DescriptIon of US'A

3. When the trapped instruction arrives at the bottom of the instruction window, then no preceding instruction generated a trap nor was a mispredicted branch. At that time, a. all following instructions are marked invalid (removed), b. the in-order register file (IRF), containing the safe, in-order version of all registers and status flags, is copied to the speculative register file (SRF), c. the required trap handler must be fetched and executed.

3b. insert trap handler

Figure 26. State recovery actions in case an instruction causes a trap.

It may be possible to start (pre)fetching the trap handler, that is just another program itself, before every preceding instruction has been executed and retired. However, as the SRF is not up-to-date, the depend­ ency checks cannot be carried out so those instructions cannot be placed into the window.

3.5.2 State recovery due to specUlative execution

Speculative execution actually means speculative fetching and execution. State recovery for this must be fast. Branches are very frequent, say 20 % of all instructions for CISC, 10 - 15 % for (general purpose) RISC (see section 2.5.2). Only conditional branches contribute to speculative execu­ tion, but this is still 7.5 % of all instructions (general purpose RISC).

Suppose state recovery would take 5 cycles in a superscalar processor of degree 4, and it takes one cycle to execute an instruction. Every 13 (=1/7.5%) instructions there is a loss of 4 x 5 = 20 cycles, so 20 instruc­ tions! This would cause a severe performance reduction. State recovery should therefore be done as fast as possible, for example within one or two cycles.

The whole process of branch prediction and speculative execution is as follows (figure 27) : 3.5 State recovery 63

1. In case of a conditional branch, predict the direction. The prediction should be done as early as possible. 2 a. Fetch instructions in the predicted direction. Mark them as specu­ lative. b. Copy the contents of the main SRF tot the shadow SRF. 3. Decode these instructions and place them into the instruction window, marked as speculative. 4. Execute the instructions. 5 a. If the branch turns out to be incorrectly predicted, remove all speculative instructions newer than the branch itself. b. Restore the SRF state from before the predicted branch by copying the shadow SRF to the main SRF.

~IJ:2JlTIIDg 2b. backup 5b. restore 2a. continue fetching re~w J-:±-~.2:.'--::...J 5a I

Figure 27. State recovery actions in case of speculative execution. Note that removing instructions is the same as marking them because removing is only a matter of resetting the valid bit.

There are basically two approaches to implement the state recovery for speculative execution (although one of them has been explained above). First, after the misprediction, the processor could calculate the old specu­ lative register file (SRF) from the results available in the window and the register values from the in-order register file. Second, one could save a copy of the SRF when a branch is predicted and restore that after having detected that the prediction was wrong.

The first method is cheap, because no additional storage is necessary. Additional multiplexers and busses are necessary to access the instruction window, though. However, this approach is slow especially because the most recent instance of every register has to be searched and copied to the SRF. One can hardly imagine that searching for 32 values (32 registers is a common amount) can be done concurrently. So, this must be done­ at least in part - sequentially, which is unacceptably slow.

Making a backup of the SRF and restoring this after detecting of the misprediction can be done much faster. However, it duplicates the SRF. 3. Description of US"A

Yet, the extra busses and multiplexers of the first approach costs some amount of hardware too. In addition, the performance of the shadow SRF approach is much better. Copying the shadow SRF to the main SRF can be done in one cycle without stalling normal operation. Removing unwanted instructions can be done in one cycle too. The algorithm for this will be explained in the next chapter. This seems to be an acceptable yet expensive solution. However, as will be shown in the next section, multi-path speculative execution requires several backups of the SRF, so this approach fits better to the universal and scalable principles.

Note that copying the current SRF to the shadow SRF is done at the time a conditional branch is fetched. Using similar hardware as for the win­ dow entries makes it possible to copy the SRF to a shadow SRF and loading new results into both SRFs at the same clock cycle (see chapter 3). Although the word 'shadow' SRF has been used for clarity, this does not reflect the real situation. After all, suppose the main SRF appeared to be incorrect, and the 'shadow' SRF became the correct one, why not continue using the 'shadow' SRF (and 'converting' it to main SRF) instead of copying the shadow SRF back to the main SRF? Again, as will be seen in the next section, in case of multi-path speculative execution, it is not possible any more to use a 'shadow' SRF and a 'main' SRF as there are multiple speculative states. Therefore, copying from 'shadow' SRF back to 'main' SRF is not necessary. Instead, any SRF have a path of their own.

3.5.3 Dealing with interrupts

Note again that a hardware exception refers to an interrupt, a trap refers to an exception that has been caused by an instruction. Theoretically, it should be possible to fetch instructions from an interrupt service routine just as any other instructions. It should even be possible, to execute the instructions from the interrupt out-of-order with the instructions from the original program. However, this would cause a lot of problems that cannot be solved easily. For instance, what happens when an instruction from an interrupt causes a trap? Or what happens when instructions from an interrupt routine and 'normal' instructions are in the instruction window, and one of this 'normal' instructions causes a trap? Normally, the rest of the window would be flushed. When instructions from an interrupt routine are in the window, these would be flushed too. This would mean that the interrupt is gone. There is no way to reconstruct the interrupt, using the normal state recovery. So, special protection is necessary for interrupt routines. Introducing an extra window entry status bit, that indicates whether or not the instruction can be removed from the window would solve the problem.

However, there is another, more severe problem in case of speculative execution, and certainly in case of multi-level speculative execution (see the next section). When more paths are executed simultaneously, to which path should the interrupt be assigned? The only solution is, assign the interrupt to all paths. However, this will not make much sense. For 3.6 Multi-path speculative versIon of US-A 65

the target parameters settings (degree 4 superscalar processor, multi-level speculative execution of up to 4 paths), this would mean that every instruction of the interrupt service routine is fetched and executed four times. This does not makes sense any more.

The last problem that may be caused by this approach is that it takes too long before an interrupt routine takes effect. As interrupt routines are sometimes used to service some real-time needs, this can be an unaccept­ able situation. Suppose for instance the worst case situation that a win­ dow of 64 entries contains only the instructions with the longest latency, for instance memory write instructions. Suppose furthermore this write instruction takes 8 cycles on a 50 Mhz clock. Then, it takes 10 micro seconds before the interrupt can be served. This can be much too slow for certain applications.

To avoid these problems, it seems better to use the following approach: The in-order register file (ffiF) contains a non-speculative, safe state of the processor. Flush the window completely, copy the IRF to a SRF, mark all other SRFs as 'inactive', and start fetching the interrupt service routine. Now, only one, non speculative path is in use when the interrupt service routine starts. Of course, some already executed results are thrown away. However, it takes as much time as with a scalar processor before the interrupt takes effect, which can be essential for real-time needs. Further­ more, not much extra hardware is necessary for this approach.

3.6 Multi-path speculative version of US~

Introduction Instead of predicting the direction of a branch and executing one path, one could also execute both paths. The results of both paths must be stored in the instruction window and when the outcome of a branch arrives, the false results must be removed.

A simplified overview of dual path speculative execution is shown in figure 28. An if-then construction generates two paths (figure 28a). This is graphically shown by the flow chart (figure 28b). Because register 1 is set differently in different paths, this register is said to be control depend­ ent. In the same way, instruction 6 is called a control dependent instruc­ tion. Instructions of both paths are fetched and placed into the instruction window, marked as 'speculative' (figure 28c). Although the program paths come together at label 'L2', this fact cannot be used in the instruc­ tion window. One of the input operands of instruction 6 is dependent on the branch, so the outcome of that instruction is dependent on the branch too. Both paths (so both versions of instruction 6) may have been executed before the branch has been resolved. So, there is no other solution than placing instruction 6 two times into the window. 66 3. Description of ustA

(a) program (b) flowchart (c) instr. win(jow

.~e ~"'S ~(). /~~

1. r2 := r1- 7 r2:= r1-7 N 0 2. jmp zero L1 jmp zero L1 N 0 3. r1 := r1 + 1 r1 := r1 + 1 y 0 4. jmpL2 jmpL2 Y 0 5. U: r1 := r1 -1 r1 := r1 -1 y 1 6. L2: r4 := 2 x r1 r4:= 2 x r1 Y 1 r4:= 2 x r1 Y 0

Figure 28. Example of dual path speculative execution.

Not every instruction after a branch will be control dependent. Using the example, suppose instruction 6 was r4:= 2 It r3. Then, this result could be calculated independent of the branch outcome. It could even be executed before the branch. However, this is very difficult to detect by hardware. Data dependency checks in one clock cycle are already a difficult task, so data dependency checks in one clock cycle between different parts of code are almost impossible. However, software can be used to that. A compiler could move control independent instruction before the branch.

The just described process can be repeated when a second branch is fetched. However, one cannot continue this process unlimited, as the number of to be fetched instructions grows exponentially. However, in combination with a branch prediction scheme, it is possible to find the most likely path. Then, more likely paths can be evaluated more deeper than less likely paths.

Uht [Uht93] showed an algorithm that determines which basic block must be fetched, given the (predicted) probabilities of each path. Resources are always assigned to the most likely path. Given limited resources, it can be tO proved that this algorithm is optimal • Multiplying the probabilities of successive, unresolved branches gives predictions for all paths. When branches are resolved, predictions of the deeper branches change. This can cause the processor to change execution from one path to another. So, a kind of time sharing between multiple paths is created. Although it is unlikely that a real processor can implement this algorithm exactly as stated - multiplying is too difficult for evaluations that have to take place within a clock cycle - this theory gives a good foundation for a realistic approach.

lONote the remarkable resemblance with Huffman's coding theory, an important part of information theory. This theory also gives highest priority (i.e. the smallest word length) to the most probable situation. 3.6 Multi-path speculative version of U~A 67

A realistic branch approach may be the following. The fetch unit adds two bits to every branch instruction. The first bit indicates the predicted direction of the branch, which can be taken or not taken. The second bit indicates the certainty, which can be high (probable) or low (no predic­ tion was possible). Of course, a probable path gets highest priority. The other direction of that path gets low priority. When a branch is not the last instruction in the group of instructions that is fetched in the current cycle, and the branch is predicted taken, there are instructions in the group that are part of the unpredicted path. These instructions have already been fetched, so they can better be evaluated and placed into the instruction window than be discarded. This algorithm can easily be extended to more levels of priorities by using more certainty bits.

The direction and certainty bits can be derived from several sources. First, a compiler may be able to give hints in the instruction set. Second, a dynamic prediction scheme can be used by using history information. When a jump was taken the last eight times, it is likely that it will be taken the ninth time too. When this history information has been stored, good predictions can be made with results up to 97 % prediction accu­ racy (see section 2.4). Third, another approach could be to make predic­ tions based on the situation (Le. operands or sort of flag). A jump-on-zero is usual unlikely, a jump-on-Iess-than-zero is usual likely. Branch predic­ tion with certainty bits based on run-time information has never been investigated, as could be determined by literature research in the scope of this project. A drawback of the third scheme is that it cannot be imple­ mented in the decode stage, as with the second approach [Yeh93]. How­ ever, the third as well as the first approach can always be done, the second approach can only be done as the branch has been evaluated before. The best solution seems to use more prediction schemes, because they are complementary.

Implementing multi-path speculative execution As shown in the previous section, a backup of the SRF is necessary to implement speculative execution. This scheme can easily be extended to multi-level speculative execution by adding more SRFs (see figure 29). Whenever a branch is discovered, the 'main' srf is used for the predicted path, a 'shadow' SRF is used for the other path. Both paths will be fetched and executed. When the branch is resolved, there is certainty about the correct direction, and one path has to be cancelled. Cancelling a path consists of the following steps: • The instructions of the false path have to be removed from the instruc­ tion window. Removing means marking them as invalid. • The instructions from the correct path must be marked as 'not specu­ lative any more'. This is done by clearing the speculative bit. • The SRF of the incorrect path is simply discarded and becomes avail­ able for new paths. 68 3. Description of US"A

I instruction & data memory I ...•••.....•..•...••.....••••..•••...... ••.....•..•.•.....•...••.•..••.•.•.• superscalar processor fetch and decode unit speculative I .... .register file - J, I 02:- global ::J instruction state recovery .r::. control ~ ,- 0 ~ window source ~ 0- I operands o~ "0 .1-

'--- in-order I register file

execution units result and tag busses · ·...... Figure 29. Overview of the processor that supports multi-path speculative execution Although the tenns 'main' SRF and 'shadow' SRF have been used for clarity, this does not reflect the real situation. When several paths are being executed, which one is the 'main' SRF and which one is the 'shadow' SRF? Because this cannot be determined in the case of more than 2 paths, a different scheme is necessary. This is as follows. When­ l1 ever a branch is detected, the predicted path keeps the current SRF , the other path gets a copy of the current SRF. This copying process will happen transparently. The predicted path can continue updating the current SRF, whereas the current SRF is transferring its contents to a new, currently empty SRF. Whenever results arrive during this cycle, they can update both SRFs! This can be achieved by using the same hardware as for the instruction window (see the next chapter).

Removing incorrect predicted paths can be done in parallel with the other actions of the instruction window. Cancelling instructions from multiple (I) paths consists of resetting the valid bit, which can be done in one cycle, transparently will all other things the window has to do, as will be explained in the next chapter. The instruction window entries need some

11 However, whenever a branch instruction is not the last instruction in the group of instructions fetched in a cycle, and the branch is predicted taken, it is better to continue with the instructions after the branch (that are not the predicted instructions) using the current SRF, at least for the current cycle. Then, those instructions may become useful, otherwise those instructions have to be discarded. 3.6 MUlti-path speculative version of Us'A 69 additions to support this. A speculative bit and a unique path number need to be added. Note that when a branch has been resolved, the incorrect predicted instructions have to be marked as invalid, the correct instructions have to be marked as not speculative any more. The latter is done by clearing the speculative bit in the instruction window entry. This is necessary, because speculative instructions are not allowed to retire. Similar hardware can be used for both actions.

Whenever a second branch is detected with already one pending branch in the window, the similar approach can be used, when there is a free SRF available. This can be done for both paths (figure 30). Whether the program parts come together (figure 30a) or stay apart (figure 30b), the paths in the processor certainly stay apart (figure 30c). Note that every path gets its own SRF (indicated in the figure by letters), every sub path get its own path number (indicated by numbers). This will support removing sub paths and thereby using as little SRFs as possible. Suppose for instance that the main path in figure 30, indicated by the thickest line, appeared to be incorrect. More precisely, path 2 was the correct one. Now, not all instructions that used SRF A must be removed. After all, the instructions from sub path 0 were correct! So, all instructions from the paths I, 4 and 3 must be marked invalid. Furthermore, the instructions from path 2 must be marked as 'not speculative'. SRF A must be marked as inactive. (a) paths stay separate (b) paths reconnect (c) processor view

Notation : path no I SRF no Figure 30. Speculative execution of 4 paths

Assigning resources to each path can be done in the following way. Remember that memory is banked and that instructions are fetched in groups. Software or hardware mayor may not align instructions for better performance.

When an unconditional jump has been fetched, the jump may not be at the end of such a group. The instructions after the jump will never be executed so they can be discarded before they are placed into the instruc­ tion window. 70 3. Description of US"A

Similar, a conditional jump may be the last instruction of such a group. If so, the predicted path will get a new path number (which is actually a sub path number) and use the same SRF as currently in use. If not, the instructions after the jump will be placed into the window with a differ­ 12 ent path number. These instructions have to use the current SRF , because it is not possible to SRFs and update the dependency information in the SRF in the same cycle. Anyway, fetched instructions will not be discarded.

The predicted path should always get a higher priority. So, after the group of instructions which included the conditional branch have been placed into the instruction window, and the fetch unit can continue fetching into the right direction, a group from the main path should be fetched and placed into the window. After this cycle, the main path should again get attention. Because of the basic block size, however, it is likely that the second fetched group of the main path contains a branch instruction. During the time that this branch is being predicted, another path can be given attention. So, multiple paths are executed by means of time sharing.

The precise way of sharing time between multiple (more than two) paths, depends on (among others) the type of code, type of branch prediction (can be done after fetching of the branch instruction or during fetching [Yeh93]), prediction accuracy, and fetch capabilities. It is hard to give general guidelines. However, Uht's principles are a good start [Uht93], and simulations of the processor running its target programs will give more certainty. The certain/uncertain priority algorithm mentioned earlier in this section is a good candidate for this.

Some special registers: stack pointer and the program counter The program counter is necessary for state recovery. In case an instruc­ tion causes a trap, a trap handler has to be executed. After that, the trap causing instruction has to be restarted. Therefore, the program counter has to be stored on every window entry. Furthermore, jump instructions have to store the destination program counter, as this is necessary when jump instructions retire. Each SRF contains a program counter that is updated just as all other registers. Usually, the destination contents field can be used for storing the calculated program counter. In fact, this is a kind of result of the instruction. However, this will not be possible for every instruction set. For instance, if an instruction set supports an increment-and-Ioop instruction, a jump instruction also has an operand, so extra storage in the instruction window is necessary for the destination program counter.

Call and return instructions use the stack pointer. This can be seen as a normal register that is read or changed by all kinds of stack related

l~is situation does not occur in figure 30. 3.6 Multi-path speculative verslon of US"A 71 instructions, like for instance call or return. Therefore, a call and return instruction need to store in each window entry the source program counter (the fetch address of the instruction), destination program counter (address where the instruction jumps to), the source stack pointer, and the destination stack pointer. As usual, the source and destination stack pointer storage can be shared. The window already contains storage for the source program counter, the destination program counter can be stored in the destination operand field. However, an instruction set that uses operands in stack related instructions need additional storage for the stack pointer.

A register indirect jump can hardly be predicted. Furthermore, it is difficult to deal with them in superscalar processors. However, they 13 cannot be avoided in modem computer systems • The only way to deal with them is to stall fetching of all paths that depend on the register indirect jump instruction, and to wait for the outcome.

Final SRF The final SRF is displayed in figure 31. Because the program counter and the Stack Pointer are part of a speculative processor state, they are part of the SRF. Furthermore, an active bit is necessary to indicate whether this SRF is in use or free. A fetch priority is used to determine how often the fetch unit will use this SRF. The path number is necessary because it is copied to the instruction window for every new instruction.

L contents path no CJ to and from other SAFs and ~~ active r3 I ~ in-order reg.file " . . IYIN I } J ~ r~b I fetch-priority , (, , I' Program Counter I c=J to and from fetch unit Stack Pointer "II I I I logic flags , int. flags Speculative from fp flags Register result busses misc. flags ~ File

Figure 31. Overview of the final SRF

13 Especially C++ programs use register indirect jumps. However, this is only 7.4 % of all branches and 1% of all instructions (see table 1, page 38). 72 3. Description of USSA

3.7 Future directions

3.7.1 Problems and remaining design issues

Although the most important problems have been resolved, some parts still need attention.

Memory load and store instructions Load and store instructions require some special attention. Loads can be done out-of-order and even speculative, if not intervened with stores to the same address. In case the speculation appears to be incorrect, the values can easily be discarded. This is different in case of stores. When speculative stores had been executed, repair would not be so simple any more because stores could have changed every part of the memory. Furthermore, it consists of accessing the relatively slow memory; repair­ ing the processor state from out-of-order and speculative stores should be avoided.

Originally, the reorder buffer approach can be used for this: handle memory stores the same way as register updates. Only at retirement, when every predecessor has been executed and retired, the store should be executed. A load instruction may be executed out-of-order and specu­ lative, but only if there are no stores to that address.

This approach will work correctly, however, it will probably limit per­ formance. Methods exist to do loads and stores out-of-order and speculat­ ively ([Fran92bl and [Pick93]. Those methods consist usually of some­ thing like an in some way extended associative cache. This all needs to be worked out, especially in case of multi-path speculative execution.

Retirement unit Although an overview of the retirement units's function can be given, its precise function still need to be designed. Suppose two instructions are in the instruction window and are ready to be retired. They write to the same register. In order to retire both instructions in one cycle, the oldest result can be discarded. Precise algorithms and implementations for this must be designed.

Communication between the fetch unit and the SRF The implementation of the dependency checks has been explained. However, there are also dependencies between the new instructions fetched in the current clock cycle. A sequential algorithm is necessary to calculate these dependencies. This algorithm is relatively straightforward and can therefore be fast, however, hardware still needs to be designed. 3.7 Future directions 73

Floating point One or several floating point unit(s) can be implemented as yet another functional unit. These could exist on the same chip or on another chip. Implementing a superscalar processor on several chips is one way to deal with the large amount of silicon. Both the Metaflow architecture [Pope911 and IBM's Power2 architecture do this [StaI94]. IBM even uses a multi­ chip module. IBM's newer architecture, the PowerPC chip, has an on-ehip floating point module. For a general architecture, either way has to be worked out.

Branch prediction Although the general principles of branch prediction are clear, an exact scheme must be chosen and connected with the rest of the architecture. Although this is not part of the architecture, a superscalar processor will need some kind of branch prediction. The certainty approach (see section 3.6, page 67) with or without other branch prediction schemes seems a promising solution.

Keeping coherency between different types of caches in case of multi­ path speculative execution Instruction cache, data cache, Branch History Tables [Yeh94], speculative load/store buffers or Address Resolution Buffers [Fran92b]; there are several kind of caches in a superscalar processor. In case of speculative execution or multi-path speculative execution, it could become a problem to keep the coherency.

Assembly line principle An assembly line principle can be a possible extension to the architectural features already mentioned. Some complex instruction sets may require this. When an instruction requires multiple actions from multiple execu­ tion units, it is possible to have instructions several times scheduled and sent to the required execution units. This can be done by a straightfor­ ward extension of the window entries. This requires, among others, multiple dispatched and executed bits. The window can be seen as an assembly line where several kinds of execution units perform several kinds of actions on one instruction.

3.7.2 Further research Simulations Simulations are necessary to get information about several trade-offs. Simulations should consist of a program that simulate executing pro­ grams on the US~ architecture. By changing parameters, optimal con­ figurations can be determined. 74 3. DescrIptIon of U~A

Although simulations for a scalable architecture may not be so important as for a fixed architecture, it is still necessary for several reasons: • Determination of the exact algorithm for time-sharing of the multi-path speculative execution. • Determination of the exact shift algorithm. Is compressing the window (filling up holes) necessary, if so, how complicated should this be done? Do all window entries need to be able to receive new instruc­ tions? Is it sufficient to retire only the last 4 instructions? • Determination of the 'memory arbiter'. When should the load/store 14 unit get access to memory and when the fetch unit • Which unit has priority in which case? • Determination of all other parameters of the architecture: window size, number of SRFs, numbers and types of execution units, number of memory banks, sizes of all types of caches, and so on. Simulations are necessary to get the performance increase of every processor unit. This information, together with the hardware cost of every unit is essential for designing well balanced microprocessors.

Compiler development Compiler development is often part of modern computer architecture development. Superscalar processors require other compiler techniques than scalar processors for optimal performance. Although many scalar compiler techniques can be used, new techniques will be necessary. Because this is a completely different subject, it does not seem wise to have digital system designers develop hardware as well as compilers. However, compiler guidelines can be given.

When one can start from scratch: a superscalar dedicated instruction set Of course, instruction set compatible superscalar processors suffer some aspects of their scalar counterparts. It is interesting to look for an instruc­ tion set dedicated to superscalar microprocessors. For instance, the flag register causes a lot of artificial, unnecessary dependencies. Although the approach mentioned in this chapter reduces this problem, a dedicated instruction could do this better and more elegantly. These dependencies can be reduced by supporting multiple flag register parts (IBM's PowerPC uses this). Another approach is having the compiler or pro­ grammer choose whether flag registers are read and/or written. For instance, for a load-with-immediate instruction it will almost never be necessary to set the flag register. It will not be to difficult for a compiler to determine whether an instruction should read and/or write to the flag register. This can reduce a lot of dependencies.

Hit is probable better to give priority to the load/store unit because this is a more severe bottleneck than fetching instructions. 3.8 Conclusion 75

Similar aspects are subject to much research at the moment. The trade-off compiler complexity versus hardware complexity is one of the main questions.

3.7.3 Possible reductions

Whenever the performance potential offered by US~ is too high or too expensive, or whenever the technology does not permit the required number of transistors, the following reductions could be done, approxi­ mately in order of importance, thereby sacrificing execution speed: 1. Use fewer paths of multi-path speculative execution. 2. Do not use multi-path speculative execution at all. 3. Use a simpler shift construction in the instruction window. 4. Reduce the number of window entries in which a new instruction can start. 5. Reduce the number of window entries. 6. It may be a reduction to restart execution only after branches. Then, the flag registers and the program counter do not have to be stored in every window entry but on branch instructions - which have storage enough because they usually have less operands - only.

Of course, all types of caches could be made simpler and smaller also. Note that a too severe reduction is not advisable. For strong reductions, other approaches may become worthwhile. Because of the high perform­ ance goals, the scalability must be seen mainly as scalable to higher performance.

3.8 Conclusion

Starting points have been defined, which are necessary to achieve the performance goals. A few logic design rules and layout design rules have been given, as well as some memory considerations. Despite the scalabi­ lity, a current target parameter setting has been defined. Scalability without an increase of clock cycle time is the most important design issue, the amount of hardware is of secondary concern. Based on all this, a reorder buffer with future file principle has been chosen. The instruc­ tion window supports full register renaming, instruction dispatching, branch prediction, (multi-path) speculative execution, and precise traps. The instruction window is a FIFO buffer of instructions. It moves instruc­ tions and is capable of compressing. Instructions can enter the window at every position except for a few bottom positions; those are used for retiring executed instructions. When the instruction window is full, fetching must be stalled. 76 3. Description of US"A

The window contains multiple window entries. Every window entry contains an instruction. The opcode, source- and destination operand(s), dependency information of every operand, several status bits and a tag is stored in a window entry. A tag is a label that refers to the instruction. Every instruction in the window has a unique label. Register renaming has been implemented by changing register names into tags. The tag refers to the instruction that generates the result. Instructions load their source operands by constantly monitoring the result busses and loading data whenever the tag accompanying the result matches the tag in the operand field of the instruction window entry. When all source operands have been loaded, the instruction is ready to be dispatched to the execu­ tion units. Special registers like the program counter, stack pointer and the flag register use exactly the same renaming technique.

A speculative register file is necessary for a fast and scalable way of calculating the dependency information. The speculative register file contains the out-of-order state of the processor. An in-order register file, which can be compared to the normal register file of a scalar micropro­ cessor, contains a safe and in-order state of the processor. A result of an instruction is copied to the in-order register file whenever the instruction arrives at the bottom of the instruction window. In addition, this retiring of an instruction will only be allowed if that instruction has not generated a trap, when all preceding instructions have updated the in-order register file already, and when the instruction is not marked as 'speculative'. This in-order register file is used as a safe processor state to restart execution after a trap (software interrupt) has occurred. In case of a hardware interrupt, the complete window must be discarded.

Speculatively executed instructions are marked as 'speculative'. When they turn out to be correctly predicted, the speculative bit can be cleared. When they turn out to be incorrectly predicted, they can be cancelled by clearing the valid bit. Furthermore, a shadow speculative register file is necessary in case the prediction appeared to be incorrect. Several specu­ lative register files, each containing the state of a separate path, are necessary for implementing multi-path speculative execution. 4. Implementation of the hardware

4.1 Introduction

In this chapter, the most important parts of the hardware will be explained. First, the window is described, especially the process of moving the instructions (section 4.2), which is an important feature of US~. Result bus monitoring (section 4.3) is a hardware structure of the instruction window and the SRF. The SRF is briefly discussed in section 4.4. Section 4.5 is about flushing parts of the window. This section shows how multiple paths can be flushed within one cycle. The same technique can be used to reset the speculative bit, which is necessary to validate speculative paths that turned out to be correctly predicted. Then, the dispatch unit will be explained (section 4.6). A simple yet very effective and scalable mechanism has been found to find the most recent ready instruction. The turn around time, the time between dispatching an instruction and using its result is discussed in section 4.7. In order to know the amount of hardware necessary for US~, estimated hardware counts are given in section 4.8. This chapter ends with a conclusion.

4.2 Moving the instructions through the instruction window

Moving the instructions through the window is an important aspect of US4A. Figure 32 shows the parts of the window that are important for the shifting process. A window entry consists of the registers for storing operands, dependency information, and status information. Furthermore, there are some multiplexers and comparators. The large is used for shifting complete instructions, the small one symbolizes the result bus monitoring logic and consists of tag compare logic, several multiplexers, and logic necessary for the branch prediction and the 78 4. Implementation of the hardware

new instr. 1..4 I prevo instr. 1.;4- I

to dispatch unit I

\ .. ' .. all result busses

.',-....-::.:-:.:-.-:.,'. -:.:- :.'.:::' :>. :-:-:-".,:-,.' . .' '. _::.':1.:-:' ":_::-::--.-;" window entry I )))) I to dispatch unit

all result busses

:·:.:::: :, .",,", .,.- ,". ,.. :.. -.. :.--.' II '. . '."....,... .' . window entry 1 IIII I shift control to dispatch unit Figure 32. Overview of the shifting principle speculative execution. At the left are the new instruction busses. Result busses, that have not been drawn completely are at the right. Move control is a unit that controls the shift multiplexers.

A multiplexer has several candidates to shift into a window entry. The candidates are new instructions, one of the previous instructions, or the instruction that is already in this window entry. Early in the clock cycle, a new instruction has been shifted into the window entry. Late in the clock cycle, the results (together with their tags) arrive. Tags can be compared and when a match occurs, the result can be loaded directly into the window entry (see section 4.3). Note that tags of operands and tags of result busses are compared at the new position of an instruction. So, instructions can be moved while results arrive.

The retirement unit determines how many positions can be shifted. The number of instructions that are retired, that is, removed from the win­ dow, is the number of positions that can be shifted. For the target para­ meters settings of U54A, 4 instructions can fetched each cycle so it makes sense to retire and shift 4 instructions each cycle. However, multi-path speculative execution causes holes in the window because complete paths of instructions will be removed. Two approaches can be used to compress the window, so that holes are removed. It is important to remove holes, because this will make some useless window entries free. Note that it may not be necessary to remove all holes in one clock cycle. 4.2 Moving the Instructions through the Instruction window 79

The first approach is simple: Retire (when possible) more instruction than new instructions each cycle. For instance, when up to 4 instructions are fetched each cycle, retiring up to 6 instructions and shifting up to 6 instructions can reduce at least a part of all holes. This approach uses little hardware but offers less performance potential.

A second approach is more advanced but more expensive. The retirement unit can give a basic shift number. All instructions are moved at least that many positions. However, as locally holes are detected, the window from this holes to the top is moved faster. This cannot be done for every window entry. After all, it is a kind of sequential algorithm, which is not scalable. However, it can be done for groups of instructions. When a hole has been detected, all instructions higher than the hole need to shift some extra positions. When another hole is discovered on a higher position, all instructions higher than the second hole has to be shifted yet more positions.

Figure 33. Advanced shifting approach that is able to compress the instruction window

Because simple adding cannot be done in a fraction of the clock cycle, another approach must be used. This can be done as follows (figure 33). Divide the window into groups. Whenever a hole has been discovered in a group, it signals to all higher window entries that they need to shift some extra amount. To avoid a mathematical add, it seems better to use a bit pattern to indicate the shift amount, as indicated in the figure. This approach uses more hardware than the first approach, however, it is better able to compress the window. It has to be kept in mind that the determination of the shift pattern has to be ready early in the clock cycle. Therefore, other approaches may be interesting too.

Results loading must happen late in the clock cycle because the results arrive late. The long result busses drive a lot of inputs, so these busses and three-state buffers will be slow. Therefore, shifting of the instructions has to be early in the cycle, loading the results has to be late in the cycle. 80 4. Implementation of the hardware

4.3 Result bus monitoring

A more detailed view of the result bus monitoring process is given in figure 34. This figure gives the signal path of an instruction during one cycle. The instruction floats from an old window position to a new position. This new position may be the same as the old one. In the figure, the multiplexer just below the old window position is the shift multiplexer. Just one path through this multiplexer has been symbolized by a curved line. Next, the tag is compared to the tags from the tags of all tag busses. If there is a match, the accompanying result will be loaded via the multiplexer into the new window entry. instruction window result busses old window contents ]at tagloperand> ~J I

III _

new window ['r contents lit tag {operand Figure 34. Loading results into the window

Just one operand is shown in the figure. Of course, all other parameters of a window entry are moved in a similar way, but only registers that need to monitor the result bus need hardware like this. For instance the program counter can simply be copied to the new window entry.

4.4 Speculative register file (SRF)

The SRF implements the data dependency checks as well as the register renaming (see section 3.4.4). The SRF uses similar structures as the instruction window. The SRF performs result bus monitoring, but, as it contains only register values or tags and not complete instructions, it does not have to use shifting hardware. Therefore, the SRF is much less complex than the instruction window. Another difference with the instruction window is the connection with the fetch units. 4.5 Flushing paths 81

The fetch units deliver several new, decoded instructions every cycle. For every source operand, data (register value or a tag) must be read from the sRF. This requires three-states buffers. For every destination operand, a new value has to be written to the sRF, which requires write ports.

When the dependencies are taken care of using the sRF, results are arriving on the result busses. Result bus monitoring hardware loads required results into the sRF. However, some of those results may be operands of the instructions that are being evaluated. So, the depend­ encies must be evaluated early in the cycle, so that when the instruction is shifted into the instruction window, it is able to load the desired results. This can be done using the result bus monitoring hardware from the window.

But what if there are dependencies between the group of new instruc­ tions? Either instruction decoding can be stalled, which requires only simple hardware or the dependencies between those instructions must be taken care of at the cost of some special hardware. The second solution oblige to some kind of sequential handling of the new instructions. This may be too slow to be performed in one clock cycle. Note that most compilers try to use as little different registers as possible (see also 2.5.1 on page 37 under 'register renaming is necessary). However, in a super­ scalar processor with register renaming, this is not only worthless any more, it may even decrease performance or cost unnecessary hardware (depending on the chosen solution).

In order to copy the information stored in a sRF to other sRFs, a copy bus is necessary. Every sRF and the in-order register file must be able to write to the bus and read the data from the bus. So, in one cycle, the contents of one of the sRFs (or the in-order register file) can be copied to another sRF. One bus is enough, because the contents of one sRF can be copied to multiple other sRFs using one bus. Such a situation can ­ although it is unlikely - occur, when there are multiple branches in the group of instructions that are fetched this cycle.

4.5 Flushing paths

In order to cancel multiple speculative fetched and executed paths that appeared to be incorrect, a fast flush mechanism is necessary. A similar approach as used for the result bus monitoring hardware can be used for flushing hardware. Figure 35 has been organized in the same way as the figure of the result bus monitoring hardware in section 4.3. An instruc­ tion goes from its old position to its new position via the shift multiplexer. Before it is loaded into the new window entry, the path number is compared to the paths that should be flushed. Whenever a match occurs, the valid bit is set to '0' (invalid). A bit pattern as dis- 82 4. Implementation of the hardware

played in the figure has been used to make it possible to cancel several paths in one cycle.

instruction window path control bus

old window contents

(O=keep, 1=clear)

new window contents

Figure 35. Flushing instructions of all window entries or from certain paths.

4.6 Dispatch unit

Basic idea The dispatch unit has to select - within one clock cycle - the least recent instructions that are ready to go to an execution unit. 'Ready' means having all operands unlocked. For the target parameter settings, it is necessary to dispatch up to 4 ready instructions. These instructions have to be the least recent instructions (this scheduling algorithm will be discussed at the end of this section). Note that every class of instructions (for instance integer, floating point, load/store, branch) has its own dispatch unit and execution unit.

A priority encoder can be used to find the least recent instruction. A second priority encoder could scan all window entries except the least recent one to find the second least recent instruction. This can take place only after the first priority encoder has done its job. So to schedule several instructions, several priority encoders, that do their jobs sequen­ tially, are necessary. However, probably up to two priority encoders can do their jobs sequentially within one cycle. Because this is too little to be scalable, another approach has to be found. It is impossible to find the least recent ready instructions, so less optimal algorithms must be applied. 4.6 Dispatch unit 83

A solution can be the following: Suppose up to four ready instructions need to be found. Now, let priority encoder 1 scan window positions 0, 4, 8, ..., let priority encoder 2 scan window positions 1, 5, 9, ..., and so forth (see figure 36). A switch matrix is necessary to transfer the instruc­ tions to free execution units. This algorithm will not give the least recent­ ly ready instructions, however, it will come close to that. Furthermore, this algorithm is much more scalable than priority encoders that do their jobs sequentially.

Figure 36. Implementation of the priority encoding in the instruction dispatch unit.

Note that the figure does not depict the real situation. Complete window entries, that are more than 100 bits wide, wilInot be sent to the dispatch unit. It is more likely that the dispatch unit will signal the window which instructions have to be sent to which execution units. Sending instructions to the execution units consist of enabling three-state buffers.

Improvement of the basic idea An improvement can be achieved by applying the algorithm displayed in figure 37. It is an extension to the basic idea that searches not only for a best candidate but for a second best too. A main and a sub priority encoder are used in each of the four priority encoding units.

Priority encoder 1 has as inputs instructions 0, 4, 8, .... Its main priority encoder keeps its functions as in the basic idea: it searches for the least recently ready instruction. This is symbolized by the up arrow. The direction of arrow points to the search direction or lower priority.

The sub priority encoder searches for a kind of second best. It searches for the most recent candidate in the oldest half of the inputs. So, newer instructions get higher priority! This is displayed in the figure by a down arrow that symbolizes the search direction. The sub priority encoder usually does not find a second best. However, two reasons argue for this approach. First of all, the sub priority encoder does not have to wait for the main priority encoder. It can do its work concurrently. Second, the 84 4. Implementation of the hardware

chance that main and sub priority encoder find the same candidate is smaller than if they searched in the same direction.

The best candidate is given directly to the select unit. The second best­ when not the same as the best candidate - is given to the next select unit as a second candidate. When a priority encoding unit does not find a candidate at all, it may be able to use its predecessor's second best candidate. Using this approach, it is less likely that no candidates are found when they do exist in the instruction window.

Note that a switch matrix is still necessary. Suppose for instance that only the first ALU is free and only priority encoding unit 4 delivers a candi­ date. A switch matrix makes it possible to schedule the only candidate to the only free ALU.

Below will be explained that the performance of the presented sub­ optimal dispatch unit is good; it is only 5% or less from an optimal solution. However, simulations could reveal possible (yet small) enhance­ ments which could improve the presented dispatch unit without an increase of hardware. Those simulations should consists of executing large programs on a software model of the US~ architecture.

prior. ene. 1 0 4 ~---K----:;2'--n--'-d '--be-s"'--t--, 8 12 best

prior. ene.2 2nd best to free best execution units prior. ene.3 2nd best best

prior. ene.4 2nd best best

Figure 37. Improvement of the basic priority encoders in the dispatch unit.

A note on scheduling algorithms The scheduling or dispatching algorithm 'take the least recently fetched instruction' is not the best that can be used. For example, branch resolv­ ing instructions, or instructions that take a long time to executed and that make a lot of other instructions ready, should have a higher priority than other instructions. However, two reasons will prevent a processor from using an optimal schedule. First of all, latencies are not always known. Modern processors always use caches, but the dispatch unit can not know 4.7 Turn around time 85 whether data is in the cache or not. Second, finding an optimal schedule is an NP-complete problem [Ke1l75]. Even software that uses a lot of memory and has enough time to finish cannot do this for a complete program.

Butler and Patt [Butl92] showed that scheduling techniques does not affect performance a lot. Only approximately 5 % (but usually less) performance differences can be found for different techniques. Most techniques favoured in some way the least recent instructions. Only techniques that do the opposite have a significant lower performance. Therefore, the non-optimal dispatchingalgorithm explained in this section will be sufficient. However, simulations may be used to improve the algorithms presented in this section.

4.7 Turn around time

The turn around time is the time between dispatching an instruction and using its results (see 3.2.1). It is important for two reasons.

First, it is important because it gives a kind of worst case performance. When none of the superscalar concepts can be applied, each instruction will take at least the turn around time to execute. Usually, in such a situation, a superscalar processor is slower than a scalar processor. Suppose a situation with a data dependency between every succeeding instruction. Then it will cost the turn around time to execute every instruction. A lot of simulations of real machines executing real programs showed, that this is an unlikely situation (among many others: [Sohi90] and [Pope91]). Only for small windows (four or less entries using a reorder buffer [Sohi90D, superscalar concepts could cause a decrease in program execution times.

Second, the turn around time is important for indirect jump instructions. In case of an indirect jump, calculations have to be done first, before the processor can continue fetching i~tructions.

In US4A, the turn around time is normally three cycles: 1. In the first cycle, an instruction becomes ready because its latest source operand arrives and is loaded into the instruction window. All operands are marked unlocked. 86 4. Implementation of the hardware

2. The dispatch unit notices the ready instructions, performs the priority encoding. The selected instructions are transferred to the execution units. Operand and opcode are loaded into a register just in front of the execution unies. 3. The execution unit reads the register, performs the action specified by the opcode, and load the result into a register at the end of the execution unit. 4/1. (This is the same as the first cycle). The results are written on the result busses. Window entries that need the operands load the desired results and become ready to be executed.

The situation above only holds when an instruction stays one cycle in the ALU. If an instruction stays more cycles in an execution unit (or: 'if it has a longer execution time'), the turn around time will be greater. So, the turn around time is 2 + n, with n the number of cycles the instruction stays in the ALD.

At the cost of more control logic, an small enhancement is possible for move-immediate or move-register-to-register instructions. Note that move instructions still have to pass the ALU because status flags have to be set. However, setting the flags will not cost a whole cycle. So, the source operand (that is also the result), can be copied to the result bus in the cycle after the instruction arrived. Of course, in order to short circuit the ALU, control logic is necessary to prevent the move instruction from overwriting other results.

4.8 Hardware counts

It is very difficult to count hardware when the details of the architecture are not known. However, it is possible to make a reasonable estimate. Of course, this can only be used as an indication of the amount of hardware. A scalable model has been made to calculate most aspects of the hard­ ware. For more details, see appendix B.

Note that the amount of instructions fetched each cycle is called the degree of a superscalar processor. This can be seen as the most important parameter. Note furthermore that only the processor core has been evaluated, because US~ only supports the processor core.

l~egisters just in front of the execution unit and just after the execution unit are necessary because it will take too long to transport data over the slow result busses as well as to execute the instruction, within one clock cycle. As a consequence, this creates a pipclined execution unit (when at least, for multiple cycle instructions, additional registers are placed i11 the execution unit). 4.8 Hardware counts 87

Table 2. Hardware counts for three configurations (k=1000).

Simple RISC RISC VCISC min-SS VCISC

Instr. fetched 2 4 4 1

ALUs 2 4 4 1

LSUs 2 2 2 1

Window length 16 64 64 4

Source operands 2 2 3 3

Destinalion ops. 1 1 2 2

Registers 16 32 32 32

Register width 16 32 32 32

Address width 16 32 32 32

Spec.palhs 2 4 4 2 (. # SRFs)

Consequences (in b~s)

Memory cells 2k 15k 18k 3k

Mux inputs 20k 178k 265k 29k

Three-state buffers 5k 26k 32k 6k

Comparalors 400 x 5 2192 x 7 5024 x 7 376 x 3 number x width

Muniplexers 16 x 113 (5) 62x 65 (9) 64 x 204 (9) 4 x 199 (3) number x width 92 x 29 (4) 344 x 47 (6) 408 x 47 (12) 892 x 43 (4) (1 from ...) 44 x 16 (2) 152 x 32 (4) 152 x 32 (8) 76 x 32 (2)

Global busses 4 x 113 8 x 165 8 x204 2x 199 number x width 4 x 29 6 x 47 12 x 47 4 x 43 1 x 3 1x 7 1 x 7 1x 3 2x 6 2 x 14 2 x 14 2x 6 4 x 45 6 x 79 6 x 111 2 x 107

Four types of microprocessors have been calculated (table 2). First, a simple RISC architecture has been investigated (see the column labelled with simple RISC). This is a 16-bit superscalar RISC processor of degree 2 with a small register file. Second, a 32-bit RISC architecture of degree 4 has been evaluated. This processor has 32 registers of 32 bits. Third, a vcrsc (Very CISC) processor of degree 4 has been investigated. The instruction set of the vcrsc processor supports up to three source operands and two destination operands. This processor fits the current parameter settings as defined in section 3.2.2. Finally, a VCISC processor with minimal superscalar concepts (min-55 VCISC) has been evaluated. It consists of a small window of 4 entries, and it executes 2 paths. This is included in the table to give an indication of the cost of the advanced superscalar architecture of the current parameter settings compared to a simple superscalar architecture. 88 4. Implementation of the hardware

As can be concluded from the table, a large but not exceptional amount of hardware is necessary for US~. However, performance has its price and because process technology improves, more transistors will become available in the near future.

4.9 Conclusion

Shifting instructions in the instruction window early in the clock cycle and loading new results or values late in the cycle is one of the key structures of US~. Candidates to load in a window entry are new instructions, several preceding (newer) instructions (when available), and the instruction that is currently in this window entry. The basic concept consists of shifting the complete window the same amount each cycle. A more advanced shifting scheme can compress the window (removes instructions that turned out to be incorrectly predicted). This is necessary for multi-path speculative execution. A possible scheme could divide the instruction window into groups. Each group can shift an extra amount compared to the amount the complete window shifts.

At the end of the cycle, just before the instruction will be loaded into his new position, tags are compared and results are loaded when tags match. Furthermore, it is possible to remove an instruction.

The dispatch unit contains several priority encoders to pick - within one clock cycle - the oldest instructions that are ready to be executed. These priority encoders do not perform their jobs sequentially, but concurrently. This is possible because they look at different instructions. This scheme has a performance loss of less than 5% but prevents the dispatch unit from becoming the critical path.

The turn around time, the time between dispatching an instruction and using its results, is 2 + n cycles. Hereby, 'n' is the number of clock cycles that an instruction stays in the ALU.

Hardware counts show that US~ costs a lot of hardware but is feasible, certainly in the near future. 5. Conclusions and recommendations

5. 1 Conclusions

US4A, an Universal Super Scalable SuperScalar Architecture is feasible without significant increase in clock cycle time. However, it will cost a lot of hardware.

The architecture consists of out-of-order execution, register renaming, a global instruction window based on a reorder buffer principle with a future file and multi-path speculative execution. Precise traps and inter­ rupts are supported.

The instruction window contains source and destination operands, a part of the flag register, the program counter, and several status bits. Every instruction in the instruction window has its own label, called a tag. This tag is used to refer to an instruction. So, register renaming has been implemented and results coming from the execution units can be ident­ ified.

Advantages: • universal architecture. No specific instruction set has been assumed. Compatibility is even more important than performance, so it is an important advantage that US~ can be applied to many instruction sets. • scalability. US4A is highly scalable. The number of ALUs and LSUs, window size, number of paths, and so on. As soon as process technol­ ogy improves, higher performance can be achieved with only minor design changes. Of course, also when applications or instruction sets change, fast design changes are an advantage. • Clock cycle is comparable to scalar processors. 90 5. Conclusions and recommendations

• High performance possible. Sustained execution of more than three instructions each cycle '(using a RISC instruction set) is possible, and even higher performance can be achieved. • repeatable structures make a better layout possible. Instruction win­ dow entries are almost all the same, and even the speculative register file(s) have a lot in common with the instruction window entries.

Disadvantages: • processor core has to be very good. • universal architecture. This is both an advantage as an disadvantage. Because is has to be universal, not all kinds of optimizations are possible. This means that there is still work to do after changing the parameters. • amount of hardware necessary. There is a lot of hardware necessary. The current parameter settings are only for future hardware budgets. However, high performance superscalar processors of any design costs a lot of hardware. Performance has its price.

5.2 Recommendations

• Several aspects of the design must still be finished. The architectural part of the design but especially the logical part of the design still need attention. In order to get a layout, hand made layout will be necessary. Not much attention have been paid to layout design yet. • Simulations are necessary to verify certain design choices and to determine optimal parameter settings. •A simple version need to built. IDaSS, Interactive Design and Simula­ tion System, is a logic design tool developed at Eindhoven University of Technology, Digital Information Systems Group [Vers90). Especially the newer version, which supports scalable design, will be a useful tool to design a simple version of US~. It may reveal possible holes or mistakes in the design. After this, layout generation is possible. • Because nothing convinces more than working samples, a real pro­ cessor need to be built. Nice words and figures may be interesting, only real silicon will convince people. Furthermore, a real processor will be essential for verifying timing details. References

[Acos86] R.D. Acosta, J. Kjelstrup, and H.C. Tomg, An Instruction Issuing Approach to Enhancing Performance in Multiple Functional Unit Processors, IEEE Transactions on Computers, vol. C-35, no. 9, pp. 815-828, September 1986

[Ande91] Thomas E. Anderson, Henry M. Lei, Brain N. Bershad, et al, The Interaction of Architecture and Operating Systems Design, SIGPLAN Notices, vol. 26, no. 4, pp. 108-120, April 1991, also in: Proc. of 4th Ann. Conf. on Architectural Support for Programming Languages and Operating Systems, Santa Clara, CA, april 1991

[Aust92] Todd M. Austin and Gurindar S. Sohi, Dynamic Dependency Analysis of Ordinary Programs, Proceedings of the 19th Annual International Symposium on Computer Architecture, pp. 342-351, May 1992

[ButI92] Michael Butler and Yale Patt, An Investigation of the Performance of Various Dynamic Scheduling Techniques, SIGMICRO Newsletter, vol. 23, no. 1-2, pp. 1-9, December 1992, also in: MICRO 25, 25th Annual Symposium on

[ButI93] Michael Butler and Yale Patt, A Comparative Performance Evaluation of Various State Maintenance Mechanisms, Proceedings of the 26 Annual International Symposium on Microarchitechture (Micro-26), pp. 70-79, IEEE/ACM, December 1993, also in: SIGMICRO, vol. 24, december 1993

[Cald94] Brad Calder and Dirk Grunwald, Reducing Branch Costs via Branch Alignment, Preliminary version of a paper, 11 pages, 1994

[deVo93] H. J. de Vos, IDaSS implementation ofa 5uperscalar microprocessor, Graduation Report, Eindhoven University of Technology, Eindhoven, The Netherlands, No. EB 249, 1993 92 References

[Dwye92] Harry Dwyer and H.C. Tomg, An Qut-of-Order Superscalar Processor with Speculative Execution and Fast, Precise Interrupts, SIGMICRO Newsletter, vol. 23, no. 1-2, pp. 272-281, December 1992, also in: MICRO 25, 25th Annual Symposium on Microarchitecture

[Fran92b] Manoj Franklin and Gurindar S. Sohi, The Expandable Split Window Paradigm for Exploiting Fine-grain Parallelism, Proceedings of the 19th Annual Symposium on Computer Architecture, Gold Coast, Australia, May 19-21 1992, pp. 58-67, May 1992

[Henn90] John L. Hennessy and David A. Patterson, Computer architecture: a quantitative approach, Morgan Kaufmann Publishers, Inc., San Mateo, CA, USA, 1990, This book can also be found with Patterson as first author, as it is the authors' policy to use both names as the first name, in order to reflect their genuine cooperation

Uohn91] Mike Johnson, Superscalar microprocessor design, Prentice Hall series in innovative technology, Prentice Hall, Englewood Oiffs, NJ, USA, 1991

[Joup89b] Norman P. Jouppi, The Nonuniform Distribution of Instruction-Level and Machine Parallelism and Its Effect on Performance, IEEE Transactions on Computers, vol. 38, No. 12, pp. 1645-1658, December 1989

[Kell75] R. M. Keller, Look-Ahead Procesors, Computing Surveys, vol. 7, no. 4, pp. 177-195, december 1975

[Lair92] M. Laird, A comparision of three current superscalar designs, Computer Architecture News, vol. 20, no. 3, pp. 14-21, June 1992

[Lam92] M. S. Lam and R.P Wilson, Limits of Control Flow on Parallelism, Proceedings of the 19th Annual International Symposium on Computer Architecture, pp. 342-351, May 1992

[Ligh91] Bruce D. Lightner and Gene Hill, The Metaflow Lightning Chipset, COMCON Spring '91, Digest of Papers, pp. 13-18, March 1991

[Menal842] L.F. Menabrea, Sketch of the analytical engine invented by Charles Babbage, Bibliotheque Universelle de Gen~ve, Geneve, Switzerland, 1842 References 93

[Pick93] James K. Pickett and David G. Meyer, Enhanced Superscalar Hardware: The Schedule Table, Proceedings of Supercomputing '93, pp. 636-643, IEEE/ACM SIGARCH, November 1993

[Pope89] Valeri Popescu, Merle A. Schultz, Gary A. Gibson, John E. Spracklen, Bruce D. Lightner, Processor architecture with decoupled instruction fetching (patent application), European Application number: 90124075.4; European publication number: 0432 774 A2; US application number: 451403 (15.12.89, with priority) and 622893 (5.12.90, with priority); Japanese application number: 4270421., December 1989, Patent has not been rewarded

[Pope91l Val Popescu, Merle Schultz, John Spracklen, et aI., The Metaflow Architecture, IEEE Micro, vol. 11, no. 3, pp. 10-13,63-73, June 1991

[Ryan94] Bob Ryan, Ml challenges Pentium, International edition, pp. 83-87, January 1994

[Smit85] J.E. Smith and A.R. Pleszkun, Implementations of Precise Interrupts in Pipelined Processors, Proceedings of the 12th Annual International Symposium on Computer Architecture, pp. 36 - 44, June 1985, same title appeared in: IEEE Transactions on Computers, vol 3, no. 3, may 1990, pp. 349-359

[StaI93] William Stallings, Computer Organization and Architecture; Principles of structure and Function, Third Edition, MacMillan Publishing Company, New York, USA, 1993

[StaI94] Paul Statt, Power2 Takes the Lead - For Now, BYTE, International Edition, p. 77, January 1994

[Thor70] J.E. Thornton, Design of a Computer - The Control Data 6600, Scott, Foresman and Co., Glenview, IL, US, 1970

[Toma67] RM. Tomasulo,· An Efficient Algorithm for Exploiting Multiple Arithmetic Units, IBM Journal, vol. II, no. 1, pp. 25-33, January 1967

[Torn84J H.C. Torng, An Instruction Issuing Mechnism for Performance Enhancement, Cornell University, Ithaca, New York, School of Electrical Engineering Technical Report EE-eEG-84-1, 1984

[Uht93) Augustus K. Uht, Extraction of Massive Instruction Level ParaIIelism, Computer Architecture News, vol 21, no. 3, pp. 5-12, June 1993, In vol 21, no. I, only the pages 8-11 are printed. 94 References

[Vers90] A.C. Verschueren, IDaSS for ULSI, Interactive Design and Simulation System for Ultrg lArge Scale Integration, Manual for IDaSS, Eindhoven University of Technology, Digital Information Systems Group, Eindhoven, Netherlands, July 1990

[Yeh92a] Tse-Yu Yeh and YN. Patt, Alternative implementations of two-level adaptive branch prediction, Proceedings of the 19th Annual International Symposium on Computer Architecture, pp. 124-134, May 1992

[Yeh93] Tse-Yu Yeh and Yale N. Patt, Branch History Table Indexing to prevent Pipeline Bubbles in Wide-Issue Superscalar Procesors, Proceedings of the 26 AnnllalInternational Symposillm on Microarchitechture (Micro-26), pp. 164-175, IEEE/ACM, December 1993, Also in: SIGMICRO Vol. 24, december 1993 Index

assembly line principle 73 reorder buffer with future file. 3, 18, 34, branch 35, 40, 54, 75, 89 delay 20, 97, 98 execution delayed branches 20 multi-path speculative execution.. 3, 32, prediction .. 14, 17, 20, 28, 36, 37, 40, 42, 36, 37, 40, 43, 54, 57, 64, 67, 68, 62, 66, 67, 70, 73, 75, 77, 94, 100 72-76, 78, 88, 89 branch prediction multi-path speculative fetching and on run-time information 67 execution 14 branch target buffer 38, 40, 100 speculative execution .. 3, 18, 20, 27, 29, cache ... 11, 13, 18, 36, 38, 45, 72, 73, 85, 98 30, 32, 36, 37, 40, 42, 43, 45, 48, 54, 57, first level ...... 45 61~9, 72-76, 78, 88, 89, 92, 98, 103 off-chip ...... 45 speculative fetching and execution .. 14, on-chip 45 62 second level ...... •...... 45 fetching checkpoint multi-path speculative fetching and checkpoint repair 18, 33, 40, 99 execution 14 checkpointing ...... 35 speculative fetching and execution .. 14, clock 62 cycle .. 3, 12-14, 19, 21, 26, 30, 32, 36, 43, flag registers 74, 75 57, 64, 66, 72, 75, 78, 79, 81, 82, 86, 88, flags 49, 56, 58-60, 62, 86 89 floating point .. 9, 11, 13, 27, 32, 49, 60, 73, frequency 7, 9, 13, 43 82 condition code 23, 58 Flynn's classification ..•...... 10, 11 control dependent •...... 65, 66 forwarding 19-21,23,28,51 critical path 43, 88 future file 3, 18, 34, 35, 40, 54, 75, 89 data dependency. . .. 17, 19, 22, 24, 26, 39, history buffer 18, 34, 35, 40 51-53,57,59,66,80,85 IBM decoupled 360/91 27, 31 decoupled architecture 23 Power2 27, 73, 93 decoupled instruction fetching .. , 23, 93 PowerI>C 31, 73, 74 degree ...... 23, 43, 44, 62, 65, 86, 87 RS/6000 31, 102 delayed branches...... 20 ID 18, 30, 55, 109 dependency imprecise...... 61 data . 17,19,22. 24,26,39,51-53,57,59, in",rder . 21, 22, 34, 35, 40-42, 46, 47, 53, 54, 66,80,85 56, 62, 63, 65, 76, 81 procedural ..•••...•.•.... 17,20,39 in",rder register file . 34, 41, 42, 46, 53, 54, design rules 62, 63, 65, 76, 81 layout 75 instruction logic .•.•...... 41, 43, 75 level parallelism 11, 13, 18, 35-38, 40, 52, dispatch stack 25, 26, 31 93,98 exception recovery set ., 3, 7, 12-15, 27, 31, 36, 37, SO, 51, 54, checkpoint repair 18, 33, 40, 99 57, 60, 61, 67, 70, 71, 74, 87,89, 90, 97 history buffer •...... 18, 34, 35, 40 instruction window 3, 12, 17, 23, 25, 26, 28, reorder buffer ., 3, 18, 31, 34, 35, 34, 35, 29,34, 39-42, 46-SO, 53~5, 67-72, 37,40,72,75,85,89 75-77, 79-81, 84, 85, 88, 89, 90, 98, 105 96 Index

dispatch stack 25, 26, 31 result bus . 51, 55, 57, 60, 61, 77, 80, 81, 86, scoreboarding ," 17, 25, 31, 40 109 instructions monitoring 51, 55, 77, 80, 81 call 15, 70, 71 result forwarding 19-21,23,28,51 conditional jump 17,18,70 retire ...... 47, 69, 70, 72, 74, 78, 79 jump. 17, 18,20,38,39,59,67,69-71,85 retire unit 47 return ...... 55, 56, 70, 71 scoreboarding 17,25,31,40 instructions per clock cycle " 3 scoreboard 25,26 interleaved memory 44, 45 scoreboard bit ...... 25 interrupt 3, 15, 23, 27, 64, 65, 76, 102 SIMD , 10 definition 23 SISD 10,13 precise 28, 34, 92, 93 software pipelining 20 ipc " 32, 36-38 speculative label 3, 30, 38, 41, 47, 65, 76, 89 multi-path speculative execution .. 3, 32, layout design rules 75 36, 37, 40, 43, 54, 57, 64, 67, 68, logic design rules 41,43, 75 72-76, 78, 88, 89 memory multi-path speculative fetching and interleaved 44, 45 execution 14 multi-ported 44 speculative execution .. 3, 18, 20, 27, 29, Metaflow architecture 27, 32, 34, 37, 73, 92, 30, 32, 36, 37, 40, 42, 43, 45, 48, 54, 57, 93 61-69,72-76,78,88,89,92,98, 103 MIMD 10 speculative fetching and execution .. 14, MISD 10 62 multi-path ., 3, 14, 32, 36, 37, 40, 42, 43, 45, speculative register file . 3, 35, 41, 42, 53-56, 48, 54, 57, 64, 65, 67, 68, 72-76, 78, 88, 58,62,63,76,80,90,105 89 SRF .. 54-58,60,62-65, 67-72, 71, 80, 81, 109 multi-path speculative execution .. 3, 32, stack pointer , 70, 71, 76 36, 37, 40, 43, 54, 57, 64, 67, 68, superpipelined 12, 99 72-76, 78, 88, 89 superscalar multi-path speculative fetching and degree of an superscalar .. 23, 43, 44, 62, execution 14 65,86,87 multi-ported memory 44 tag .. , 3,30-32, 41, 46-61, 76, 77, BO, 81,89, out-of-order. 3, 14, 17, 18, 21-23, 25, 28,29, 109 31,33-35,39,40,41,42,52, 56-58, 61, trap .. 3, IS, 18, 23, 33-35, 41, 46, 47,49, 61, 64, 72, 76, 89, 92, 99, 102 62, 64, 70, 76 parallelism . 10-13, 18, 19, 28, 35-38, 40, 52, definition 23 92,93,97-103 tum around time , 42, 71, 85, 86, 88 instruction level parallelism . II, 13, 18, VLIW 11, 12, 102 35-38, 40, 52, 93, 98 Very Large Instruction Word 11 pipeline. 17-20, 19-24, 27, 28, 38, 39, 43, 94, [Ac0s86] 25, 31, 91 98 lAnde91] ....•...... 36,91 superpipelined 12,99 [Aust92] 37, 91 pipeline stalls 17, 19-21, 23, 24, 28 [Butl92] 85, 91 data dependency.. 17, 19, 22, 24, 26, 39, [Bun93] 35, 91 51-53,57,59,66,80,85 lCald941 39, 91 procedural dependency 17, 20, 39 [deVo93] 26, 91 resource conflict 23, 39 lDwye92] 26, 92 pipelining 9, 17-20 lFran92bl , 72, 73, 92 precise.. 3, 21, 28, 34, 38, 40, 42, 46, 61, 70, [Henn901 38, 92 72, 75, 89, 92, 93, 102, 103 [John911 ...... 20, 36, 38, 92 precise interrupts 28, 34, 92, 93 [Joup89b] " 12,92 procedural dependency 17,20,39 lKell751 11, 85, 92 register file [Lair92] 27, 32, 92 in-order. 34, 41, 42, 46, 53, 54, 62, 63, 65, lLam92] ..•...... 37, 38, 92 76,81 lLigh911 27, 32, 92 speculative .. 3, 35, 41, 42, 53-56, 58, 62, lMenal842] 9, 92 63, 76,80,90, 105 [Pick93] 72, 93 SRF ... 54-58, 60, 62-65, 67-72, 71, 80, 81, lPope891 •...... 27, 32, 93 109 lPope911 27,32, 37, 73, 85, 93 reorder buffer .. 3, 18, 31, 34, 35, 34, 35, 37, [Ryan941 .•.....•...... 27, 93 40, 72, 75, 85, 89 ISmit851 ...... 34, 54, 93 with future file . 3, 18, 34, 35, 40, 54, 75, [Sta193] ...... •...... 14, 93 89 ISta194] 27, 73, 93 Index 97

[Thor70] 25, 93 [Toma67] ...... 25, 27, 31, 93 (Torn84] 25, 93 [Uht93] 36, 37, 66, 70, 93 {Vers90J 26, 90, 94 [Yeh92aJ ..•...... 32, 94 {Yeh93] 32, 38, 67, 70, 94 Appendix A. Related publications

[Adam93] R.G. Adams, S.M. Gray, and G.B. Steven, HARP: A Statically Scheduled Multiple-Instruction-Issue Architecture and its Compiler, University of Hertfordshire, Hatfield, Herts, UK, Technical Report No. 163, September 1993, Also submitted to 2nd Euromicro Workshop on Parallel and Distributed Processing, Malaga, Spain, January 1994

{But191] M. Butler, Y. Patt, and T.Y. Yeh, Single Instruction Stream parallelism is Greater than two, IEEE 18th Annual International Symposium on Computer Architecture, pp. 276-286, ACM, New York, USA, 1991

[Chan90] P.P. Chang et aI, Code Optimization Techniques for Multiple-Instruction-Issue architectures, Centcr for Reliable and High-Performance Computing Rcport, University of ll1inois, 1990, in preparation (maybe ready today)

[Chan91] Pohua P. Chang et aI, IMPACT: An Architectural Framework for Multiple-Instruction- Issue Processors., Proceedings of the 18th AnmUll International Symposium On Computer Architecture, pp. 266-275, May 1991

[Chen92] J. Bradley Chen, Anita Borg, and Norman P. Jouppi, A Simulation Based Study of TLB Performance, Proceedings of Ore 19th Annual Symposium on Computer Architecture, Gold Coast, Australia, May 19-21 1992, pp. 114-123, May 1992

[Cmel9H Robert F. Cmelik, Shing I. Kong, David R. Ditzel et aI., An Analysis of MIPS and SPARC Instruction set utilization on the SPEC Benchmarks, Sigplan Notices, vol 26, no. 4, pp. 290-302, 1991

[Co1l93] R. Collins, Developing a simulator for the Hatfield superscalar processor, University of Hertfordshire, Hatfield, Herts, UK, Technical Report No. 172, December 1993

[Corp92] Henk Corporaal and Hans Mulder, MOVE: A framework for high-performance , Proceedings Supercomputing '91, pp. 692-701, 1991

[Dief92) K. Diefendorff and M. Allen, Organization of the Motorola 88110 Superscalar RISe Microprocessor, IEEE Micro, vol. 12, no. 2, pp. 35-91, April 1992

[Ditz87) D. Ditzel and H. Me Lellan, Branch Folding in the CRISP Microprocessor: Reducing Branch Delay to Zero, Proceedings of the 14th AnmUlI Symposium on Computer Architecture, pp. 2-9, 1987 100 Appendix A

[Dube94] K. Dubey Pradeep, George B. Adams III, and Michael J. flynn, Instruction Window Size Trade-Offs and Characterization of Program Parallelism, IEEE Transactions on Computers, vol 43, no 4, pp. 431-442, april 1994

[Emma87] P.G. Emma and E.s. Davidson, Characterization of Branch and Data Dependences in Programs for Evaluating Pipeline Performance, IEEE Computer, vol. 36, no. 7, pp. 859-879, July 1987

[ErtI94] M. Anton Ertl and Andreas Krall, Delayed Exceptions - Speculative Execution of Trapping Instructions, Preliminary version of a paper for CC'94 in Edinburhg, April 94, March 1994

[Fern92] Edit S.T. Fernandes and Fernando M.B. Barbosa, Effects of building blocks on the performance of super-scalar architecture, Proceedings of the 19th International Symposium on Computer Architecture, pp. 36-45, IEEE/ACM, May 1992

[Fran92a] M. Franklin and G.s. Sohi, A new paradigm for exploiting tine-grain parallelism, Proceedings of the Twenty-Fifth Hawaii International Conference on System Sciences, vol. 1, pp. 4-13, January 1992

[Fuku91] Akiza Fukuda, KRPP and DSNS superscalar architectures using task and instruction level parallelism, IEEE Micro, pp. 16-19 and 50-61, August 1991

[Furb87] S.B. Furber, VLSI RISC Architecture and Organisation, 1987

[Gonz91] Antonio M. Gonzalez and Jose M. Llaberia, Retlucing Branch Delay to Zero in Pipelined Processors, IEEE Transactions on Computers, vol. 42, no. 3, ...., March 1993

[Gros82] T.R. Gross, J.L. Hennessy, S.A. Przybylski, and C. Rowen, Measurement and Evaluation of the MIPS Architecture and Processor, ACM Transactions on Computer Systems, vol. 6, no. 3, pp. 229-257, August 1988

[Hana91] Hanawa et. a1., On-chip multiple superscalar processors with secondary cache memories, IEEE International Conference on Computer Design, pp. 128-131, 1991

[HilI93] Mark D. Hill et aI., Wisconsin Architectural Research Tool Set, Computer Architecture News, vol 21, no. 4, pp. 8-11, December 1993 Related publications lOl

[Huck89] Jerome c. Huck and Micheal J. Flynn, Analyzing computer architectures, IEEE Computer Society Press, Washington, 1989

[Hwan93] Kai Hwang, Advanced Computer Architecture: Parallelism, Scalability, Programmability, McGraw-Hill Series in Computer Science, McGraw-Hill, Inc., New York, USA, 1993

[Hwu861 W. Hwu and Y.N. Patt, HPSm, a High Performace Resbicted Data Flow Architecture Having Minimal Functionality, Proceedings of the 13th Annual Symposium on Computer Architecture, pp. 297-307, June 1986

[Hwu87) W. Hwu and Y.N. Patt, Checkpoint Repair for Out-of-Qrder Execution Machines, Proceedings of the 14th Annual Symposium on Computer Architecture, pp. 18-26, June 1987

[Inou93] Atsushi Inoue and Kenji Takeda, Performance Evaluation for Various Configuration of Superscalar Processors, Computer Architecture News, vol 21, no. 1, pp. 4-11, March 1993

[Joup89al N.P. Jouppi and D.W. Wall, Available Instruction-Level Parallelism for Superscalar and Superpipelined Machines, Proceedings of the Third International Conference on Architectural Support for Programming Languages and Operating Systems, pp. 272-282, ACM Press, New York, USA, 1989

[KaeI91) David R. Kaeli and Philip G. Emma, Branch History Table Prediction of Moving Target Branches Due to Subroutine Returns, Proceedings of the 19th Annual International Symposium on Computer Architecture, pp. 34-41, May 1992

[Kato92] Takaaki Kato, Toshihisa Ono, and Nader Bagherzadeh, Performance Analysis and Design Methodology for a Scalable Superscalar Architecture, SIGMICRO Newsletter, vol. 23, no. 1-2, pp. 246-255, December 1992, also in: MICRO 25, 25th Annual Symposium on Microarchitecture

[Kern93] Daniel R. Kerns and Susan J. Eggers, Balanced Scheduling: Instruction scheduling when memory latency is uncertain,

Proceedings of the ... Conference on programming language design and implementation I ACM Sigplan Notices, vol 28. no 6, pp. 278-289, June 1993

[Kuga911 Morihiro Kuga, Kazuaki Murakami, Shinji Tomita, DSNS (Dynamically-hazard-resolved, Statically-code-scheduled, Nonuniform Superscalar): Yet Another Superscalar Processor Architecture, Computer Architecture News, vol 19, no. 4, pp. 14-29, December 1991 102 Appendix A

fLee84J J.K.F Lee and A.J. Smith, Branch Prediction Strategies and Branch Target Buffer Design, IEEE Computer, vol. 17, no. 1, pp. 6-22, January 1984

[Logr79] L. Logrippo, Renamings, maximal parallelism, and space-time tradeoffs in program schemata,

Journal of the ACM, vol. 26, no. 4, pp. 819-..., 1979

fMatt9ll Wolfgang Matthes, How Many Operation Units are Adequate?, Computer Architecture News, vol 19, no. 4, pp. 94-108, December 1991

[McFa86J S. McFarling and J. Hennessy, Reducing the costs of branches, Proceedings of the 13th annual international symposium on computer architecture, pp. 396-403, June 1986

[Melv91] Steve Melvin and Yale Patt, Exploiting Fine-Grained Parallelism Through Combined Hardware and Software Techniques, Proceedings of the 18th Annual Symposium on Computer Architecture, pp. 287-296, ACM, New York, USA, May 1991

[Oyan90] Yen-Jen Oyang, Chun-Hung Wen, Yu-Fen Chen, and Shu-May Lin, The effect of employing advanced branching mechanisms in superscalar processors, Computer Architecture News, vol. 18, no. 4, pp. 35-51, December 1990

[Oyan91] Yen-len Oyang, Exploiting multi-way branching to superscalar processor performance, SIGPLAN Notices, vol. 26, no. 3, pp. 68-78, March 1991

[Oyan93] Yeh-]en Oyang, Microprocessing & Microprogramming, vol 36, no. 4, pp. 205-214, September 1993

[Patt85a] Y.N. Patt, W. Hwu, and M.e. Shebanow, HPS, A New Microarchitecture: Rationale and Introduction, Proceedings of the 18th Annual Workshop on Microprogramming, pp. 103-108, December 1985

fPatt85b] YN. Patt, S.W. Melvin, W. Hwu, and M.e. Shebanow, Critical Issues Regarding HPS, A High Performance Microarchitecture, Proceedings of the 18th Annual Workshop on Microprogramming, pp. 109-116, December 1985

[Patt86] Y. N. Patt, M.e. Shebanow, W. Hwu and S. W. Melvin, AC Compiler for HPS I, A Highly Parallel Execution Engine, Proceedings, 19th Hawaii International Conference on System Sciences, ..., January 1986 Related publications 103

[Pint93] Shlomit S. Pinter, Register Allocation with Instruction Scheduling: a New Approach, Proceedings of the ... Conference on programming language design and implementation I ACM Sigplan Notices, vol 28. no 6, pp. 248-257, June 1993

[Rise72] Edward M. Riseman and Caxton C. Foster, The Inhibition of Potential Parallelism by Condition Jumps, IEEE Transactions on Computers, vol. C-21, no 12, pp. 1405, 1411, December 1972

[Russ92] Gordon Russel and Paul Shaw, Shifting Register Windows, [IEEE] Micro, Vol 13, no. 4, ..., April 1992

[Smit89a] M.D. Smith, M. Johnson, and M.A. Horowitz, Limits on Multiple Instruction Issue, Proceedings of the Third International Conference on Architectural Support for Programming Languages and Operating Systems, pp.290-302, April 1989

[Smit89b] James E. Smith, Dynamic Instruction Scheduling and The Astronautics 2S·1, [IEEE} Computer [Magazine], vol. 22, no. 7, pp. 21-35, IEEE, July 1989

[Smit90] M.D. Smith, M.S. Lam, and M.A. Horowitz, Boosting beyond static scheduling in a 5uperscalar processor, Proceedings of the 17th Annual International Symposium on Computer Architecture, pp. 344-354, May 1990

[Sohi87] G.S. $ohi and S. Vajapeyam, Instruction Issue Logic for High-Performance Interruptable Pipelined Processors, Proceedings of the 14th Ann1/al Symposium on Computer Architecture, pp. 27-34, June 1987, This is a preliminary paper of [Sohi90]

[Sohi89] G.5. $ohi and S. Vajapeyam, Tradeoffs in Instruction Format Design for Horizontal Architectures, Proceedings of the Third International Conference on Architectural Support for Programming Languages Ilnd Operating Systems, April 1989

[Sohi90J G.S. Sohi, Instruction Issue Logic for High-Performance, Interruptible, Multiple Functional Unit, Pipelined Computers, IEEE Transactions on Computers, vol. 39, no. 3, pp. 349-359, March 1990

[Sohi91] Gurindar S. Sohi and Manoj Franklin, High-Bandwidth Data Memory Systems for Supersca1ar Processors, SIGPLAN Notices, vol. 26, no. 4, ..., Apri11991, also in: Proc. of 4th Ann. Conf. on Architectural Support for Programming Languages and Operating Systems, Santa Oara, CA, apri11991 104 Appendix A

IStep9lJ Chriss Stephens et aI, Instruction Level Profiling and Evaluation: of the IBM RS/6000, Proceedings of the 18th Annual International Symposium on Computer Architecture, pp. 180-189, ACM, New York, USA, May 1991

[Stev93b] Fleur L. Steven, Gordon B. Steven and Liang Wang, An Evaluation of the Architectural Features of the iHARP Processor, University of Hertfordshire, Hatfield, Herts, UK, Technical Report No. 170, December 1993

[Stev93b] F.L. Steven, R.C. Adams, C.B. Steven, L. Wang, and D.}. Whole, Addressing Mechanisms for VLIW and Superscalar Processors, Microprocessing and Microprogramming/Proceedings of the Euromicro '93 Conference, vol. 39, no. 2-5, December 1993

[Theo92) K.B. Theobald, C.R. Gao, and L.J. Hendren, On the limits of program parallelism and its smoothability, SIGMICRO Newsletter, vol. 23, no. 1-2, pp. 10-19, December 1992, also in: Proceedings of the 25th Annual International Symposium on Microarchitecture

[Tjad70] G.S. Tjaden and M.J. Flynn, Detection and parallel execution of independent instructions, IEEE Transaction on computers, vol. C-19, no. 10, pp. 889-895, October 1970

[Tjad72] G.S. Tjaden, Representation and Detection of Concurrency using Ordering Matrices, dissertation, The Johns Hopkins University, Baltimore, MD, USA, 1972, also in IEEE Transactions on Computers, vol 22, no 8, pp. 752-761

[Torn93] H.e. Torng and Martin Day, Interrupt Handling for Out-of-Order Execution Processors, IEEE Transactions on Computers, vol. 42, no. 1, pp. 122-127, January 1993

[Tran92) Thang Tran and Chuan-lin Wu, Limitation of Superscalar Microprocessor Performance, SIGMICRO Newsletter, vol. 23, no. 1-2, pp. 33-36, December 1992, also in: MICRO 25, 25th Annual Symposium on Microarchitecture

IUht92] Augustus K. Uht, Concurrency Extraction via Hardware Methods Executing the Static Instruction Stream, IEEE Transaction on Computers, vol. 41, no 7, pp. 820-841, July 1992

[UlIa93] Nasr Ullah and Matt Holle, The MC88110 Implementation of Precise Exceptions in a Superscalar Architecture, Computer Architecture News, vol 21, no. 1, pp. 15-25, March 1993 Related publications lOS

[Vaja9H Sriram Vajapeyam, Gurindar S. Sohi, and Wei-Chung Hsu, An Empirical Study of the CRAY Y-MP processor using the PERFECT Club Benchmarks, Proceedings ofthe 18th Annual International Symposium on Computer Architecture, pp. 170-179, ACM, New York, USA, May 1991

[Vassi92] Stamatis Vassiliadis, Bart Blaner, and Richard J. Eickemeyer, On the Attributes of the SCISM Organization, Computer Architecture News, vol 20, no. 4, pp. 44-53, December 1992

[Wall9H D.W. Wall, Limits of instruction-level parallelism, SIGPLAN Notices, vol. 26, no. 4, pp. 176-188, Apri11991, also in: Proceedings of the 4th International Conference on Architectural Support for Programming Languages and Operating Systems

[Wang92] Chia-Jiu Wang and Frank Emnett, Implementing precise interruptions in pipelined RISC processors, {IEEEI Micro, vol 13, no. 4, ..., August 1993

[Wang93] L. Wang, Instruction Scheduling for a Family of Multiple Instruction Issue Architectures, dissertation, University of Hertfordshire, Hatfield, Herstsfordshire, U.K., September 1993

IWarr90] H.S. Warren, Instruction scheduling for the IBM RISe system/6000 processor, IBM Journal of Resetlrch and Development, vol. 34, no. 1, ..., January 1990

[Weis84] S. Weiss, J.E. Smith, Instruction Issue Logic in Pipelined , IEEE Transaction on Computers, pp 1013-1022, 1984

[Wilk92] Kent D. Wilken and David W. Goodwin, Toward Zero-Cost Branches Using Instruction Registers, SIGMICRO Newsldter, vol. 23, no. 1-2, pp. 214-217, December 1992, also in: MICRO 25, 25th Annual Symposium on Microarchitecture

[Yeh92b] Tse-Yu Yeh, and Yale N. Patt, A Comprehensive Instruction Fetch Mechanism for a Processor Supporting Speculative Execution, SIGMICRO Newsldter, vol. 23, no. 1-2, pp. 129-139, December 1992, also in: MICRO 25, 25th Annual Symposium on Microarchitecture Appendix B. Hardware counts

The hardware counts are made using the spreadsheet program Quattro Pro (version 5.0 for DOS). It consist of three 'pages'16; the overall counts, the instruction window counts, and the speculative register file counts. Most of the relations have been implemented by using variable names. A list of these names has also been included in this appendix.

As has been explained in section 4.8, it is hard to count hardware when not all hardware has been designed yet. However, these counts wil give reasonable estimations.

16In Quatro Pro's tenninology, a '3-dimensi0I1al' spreadsheet has been used. However, one can beter look at this as three pages or layers (overall, instruction window, speculative register file). ~ 0 0>

Totals

Parameter Effect Totals

10511'. fell::hed Qpe) 4 ALUs 4 Result busses 12 Memotycens LSUs 2 Result bus width 47 17700 bits

WlrdaN enllies 64 TagWIC!th 7 VOports Tag 01190011 1 WlrdoN width 204 bits II1XJI 264504 bits 10511'. Oasses 4 Class 10 Width 2 bits Output 31864 bits Source operards 3 Rflad porf=mux Input; writ9 pott=3·stat9 Desl operands 2 Aegster width 32 CompBl'Btors width Number of regs 32 Reg. 10 Width 5 bits 5024 7 Address width 32 Multiplexers 1from .. Rag register WJd1h 32 Rag part size 8 64 204 9 F1agclasses 4 408 47 12 Path 10 Width 2 152 32 8 InSIr. retired Qpe) 4 Number of SRFs 4 Max. shift 4 Subpaths 7 Speculative paths 4 Global busses (along M? 8 204 12 47 1 7 2 14 6 111

Assum~ons For simplicty, every result bus contains Not an hard.var& can be counted 1. mOle soorce ops than dest. ops - result op bus (this may be probable, because not all hard.var& Is known. 2. tag widlh<" lIag part size -llag part bus but ....,11 not atNays be the case) -tag bus J: Instruction Window a.0 Totals (complete /W} ~ Perent'Y 0 ~ () Tag width 110 ports # width number width 0c ~ tag 7 bits Shift mux (i) 9 204 en Results (i) 12 47 Source OpenJnd From shift mux 0) 1 204 ready 1 reset valid·blt (i) 1 7 oontenls (IDlop) 32 (re)set spec.blt (I) 2 14 total 99 To shift mux (0) 1 204 To exec units (0) 1 111 Dest Opetand Total input 2639 Total Input 168896 bits ready 1 Total output 315 Total output 20160 bits upda18 1 extra Retire Unit rAv ports notIncluded contents (IDlop) 32 Irput port::mux Input; outputpoIt=three-stats reg. no. 5 total 12 Comparators Comparators sourceldest,flag 48 7 source &dest op. 3072 7 SourcelDest flags reset vaOd-bit 1 7 reset vaOd-bit 64 7 locked bits 4 (re)set spec.blt 1 7 (re)set spec.bit 64 7 Update 1 F1agwldth 32 Multiplexers 1from ... Multiplexers f from ... Shiftmux 204 9 Shiftmux 64 204 9 Status Resultmux 4 47 12 Resultmux 256 47 12 allready 1 (sou~ & dest opt1f8ndresult mUll tJr8 shared) dispatched 1 Globel width busses execuled 1 New lost. busses 4 204 speculalfw 1 Shift bus 4 204 not physJcaIIy one bus trapped 1 Result busses 12 47 class 2 reset vaOd-bit t 7 pa1h 2 (re)set spec.blt 2 14 deooded opoode 8 To exec. ooits 6 111 p.c. 32 total 49 Memory 13056 bits Total 204 bits Speculative Register Files Perentry Totals foral/ Future Files

width Registers VOporfS # width locked bit 1 bits Result side (I) 12 47 contenls 32 From IRF 0) 1 33 (IRF=ln·OI'd9r rBglSl9r file) Total 1056 Fetch side (deSl) I 1 32 Fetch side (source) 0 12 32 StacIe pointer To o1tler SRFs (0) 1 33 ReadWrite ports locket bit 1 Total Inputs 629 Total inputs 95608 bits c:ontenls 32 Total outputs n Total outputs 11704 bits program counter Comparators Comparato/S locked bit 1 Result tags comp. 12 7 Result tags comPo 1824 7 conlants 32 Not all logic betwrHln fstch unit and SRF has b8Bn counted yst

Rag pan. MUltiplexers 1 from .•• MultipleXers 1 from ... locked bit 1 I'9su1l wri19 mux 47 12 result write mux 152 47 12 oontenls 8 fetch side (dest) mux 32 8 fetch side (dest) mux 152 32 8 Total 36 From other SRFs 1 33 4 IRF to SRFCOfIfleCtion not countBd ht1rB Status Path no 2 Global width busses Active 1 see IW-entry

Total SRF bits 1161 SRFentries: 38 Incl. IPandSP 4644 bits For simplicity, normals ffIfJS, flag part rBgS. PC, SP and flOmIai rBg tJrfI COflsidBr9d fK1lIai ::r a Variable names ~ ADDRESS_WIDTH Address width of the memory interface ~ () ALU Number of ALUs o Number of exec unit classes CLASSES S.en CLASS_ID_WIDTH Number of bits to specify a exec.unit class DEST_OP Number of destination operands FETCH Number of Fetch Units (inst's fetched per cycle FLAG_CLASSES Number of classes the flag reg Is divided in FLAG_PART_SIZE Size of flag register parts FLAG_REG_WIDTH Width of complete flag register IW_E Number of IW entries IW_WIDTH Number of bits in a IW-entry LSU Number of Load/Store units OPCODE_WIDTH Opcode width (in bits) PATH_ID_WIDTH Number of bits necessary to identify a path REGS Number of registers REG_ID_WIDTH Number of bits to specify a register REG_WIDTH Register width RESULT_BUSSES Number of results buffers RES_BUS_WIDTH Result bus width (result+tag+flag part) RETIRED Instr's retired each cycle SHIFT Number of positions to shift an instr. each cycle SOURCE_OP Number of source operands SRF Number of SR Fs SRF_ENTRIES Number of entries in a SRF SUBPATH Number of paths between branches (subpaths) TAG_OVERKILL Extra bits in tag to overcome short of tags due to compressing the TAG_WIDTH Number of bits of a tag