CALIFORNIA STATE UNIVERSITY NORTHRIDGE

TOMASULO ARCHITECTURE BASED MIPS

A graduate project submitted in partial fulfilment for the requirement

Of the degree of Master of Science

In Electrical Engineering

By

Sameer S Pandit

May 2014

The graduate project of Sameer Pandit is approved by:

______

Dr. Ali Amini, Ph.. Date

______

Dr. Shahnam Mirzaei, Ph.D. Date

______

Dr. Ramin Roosta, Ph.D., Chair Date

California State University Northridge

ii

ACKNOWLEDGEMENT

I would like to express my gratitude towards all the members of the project committee and would like to thank them for their continuous support and mentoring me for every step that

I have taken towards the completion of this project. Dr. Roosta, for his guidance, Dr.

Shahnam for his ideas, and Dr. Ali Amini for his utmost support.

I would also like to thank my family and friends for their love, care and support through all the tough times during my graduation.

iii

Table of Contents

SIGNATURE PAGE ...... ii

ACKNOWLEDGEMENT ...... iii

LIST OF FIGURES ...... vii

ABSTRACT ...... x

Chapter 1: Introduction ...... 1

1.1 Introduction to RISC and CISC ...... 1

1.1.1 RISC versus CISC ...... 2

1.1.2 CISC Architecture ...... 3

1.1.3 RISC Architecture ...... 4

1.1.4 Performance Equation ...... 6

Chapter 2: Types of CPU ...... 7

2.1 Single-Cycle CPU Method ...... 7

2.2 Instruction Formats ...... 8

R-Type Instruction Format ...... 9

I-Type Instruction Format ...... 11

Branch Instruction ...... 12

J-Type Instruction Format ...... 12

2.3 Multi Cycle CPU ...... 13

iv

Chapter 3: Pipelining and MIPS ...... 16

3.1 Pipelining ...... 16

3.2 The Classic 5 Stage Pipelined Processor...... 17

3.3 Hazards ...... 18

3.4 Data Forwarding ...... 19

3.5 MIPS Implementation ...... 21

3.5.1 Instruction Fetch ...... 23

3.5.2 Instruction Decode ...... 24

3.5.3 Execute Stage ...... 25

3.5.4 Memory stage ...... 26

3.5.5 Write Back Stage ...... 27

3.5.6 ALU and Control to Complete MIPS ...... 27

Chapter 4: Dynamic Scheduling ...... 31

4.1 Types of scheduling ...... 31

4.2 Types of Data Hazards ...... 32

4.3 The Tomasulo Approach ...... 33

4.3.1 ...... 37

4.4 Working Example of Tomasulo ...... 39

Chapter 5: Simulation and Synthesis ...... 47

5.1 Simulation using ModelSim ...... 47

v

5.2 Simulation using VCS...... 55

5.3 Synthesis Results ...... 60

Chapter 6: Conclusion and Modifications ...... 63

6.1 Conclusion ...... 63

6.2 Modifications ...... 63

References ...... 65

Appendix A: Code Listing and Synthesis Scripts ...... 67

Appendix B: Simulation waveforms showing all Operations...... 72

vi

LIST OF FIGURES

Figure 1: Storage scheme of a generic ...... 2

Figure 2: Single-Cycle Method [5] ...... 7

Figure 3: R-Type Register Format [6] ...... 9

Figure 4: Data Path for R-Type format [7] ...... 10

Figure 5: I-Type Instruction Format [8]...... 11

Figure 6: I-Type Data Path [9] ...... 11

Figure 7: J-Type Instruction Format ...... 12

Figure 8: Data Path for Jump Instruction Format [9] ...... 13

Figure 9: Multi Cycle CPU Data Path [5] ...... 14

Figure 10: A Pipelined RISC Data Path ...... 18

Figure 11: Forwarding Path for the above example. [10] ...... 21

Figure 12: 5 Stage Pipelined MIPS [11] ...... 22

Figure 13: IF Data Path ...... 23

Figure 14: ID Data Path ...... 24

Figure 15: EX Data Path ...... 25

Figure 16: MEM Data Path ...... 26

Figure 17: MIPS after appending ALU Control [12]...... 28

Figure 18: Controller in MIPS ...... 29

Figure 19: 5 Stage Pipelined MIPS Processor [12] ...... 30

Figure 20: MIPS architecture with [12] ...... 36

Figure 21: Tomasulo Example - Clock Cycle 0 ...... 39

Figure 22: Tomasulo Example - Clock Cycle 1 ...... 40

vii

Figure 23: Tomasulo Example - Clock Cycle 2 ...... 41

Figure 24: Tomasulo Example - Clock Cycle 3 ...... 41

Figure 25: Tomasulo Example - Clock Cycle 4 ...... 42

Figure 26: Tomasulo Example - Clock Cycle 5 ...... 43

Figure 27: Tomasulo Example - Clock Cycle 6 ...... 43

Figure 28: Tomasulo Example - Clock Cycle 7 ...... 44

Figure 29: Tomasulo Example - Clock Cycle 56 ...... 45

Figure 30: Tomasulo Example - Clock Cycle 57 ...... 45

Figure 31: Top Level Design of Tomasulo ...... 47

Figure 32: Schematic for Tomasulo ...... 48

Figure 33: Simulation waveform for Tomasulo ...... 49

Figure 34: How my Design works ...... 51

Figure 35: Synthesis for Tomasulo Using ModelSim ...... 52

Figure 36: Schematic of Test.v used as top module ...... 56

Figure 37: Memory Unit Schematic ...... 57

Figure 38: Instruction Queue Schematic ...... 57

Figure 39: Simulation Waveform Using Synopsys (1) ...... 58

Figure 40: Simulation Waveform Using Synopsys (2) ...... 58

Figure 41: Simulation Waveform Using Synopsys (3) ...... 59

Figure 42: Simulation Waveform Using Synopsys (4) ...... 59

Figure 43: Top-Down Area Report ...... 60

Figure 44: Top-Down Timing Report ...... 61

Figure 45: Bottom-up Timing Report ...... 62

viii

Figure 46: Bottom-Up Area Report ...... 62

Figure 47: Instruction Queue ...... 72

Figure 48: Load 1 Operation - Instruction 1 ...... 72

Figure 49: Load 2 Operation - Instruction 2 ...... 73

Figure 50: Multiply 1 Operation - Instruction 3 ...... 73

Figure 51: Subtract Operation - Instruction 4 ...... 74

Figure 52: Multiply Operation - Instruction 5 ...... 74

Figure 53: Add Operation - Instruction 6 ...... 75

ix

ABSTRACT

Tomasulo Architecture based MIPS Processor

By

Sameer S Pandit

Master of Science in Electrical Engineering

The main goal of this project is to design a processor based on architecture of choice. I have chosen to create an out-of-order execution processor based on the Tomasulo architecture. The main feature of this architecture is its instruction level parallelism (ILP), as instructions are executed concurrently. ILP can be achieved using two methods: one, using hardware that dynamically detects parallelism and takes advantage of that and the other, using software which detects parallelism statically. Robert Marco Tomasulo was the one who implemented the Tomasulo architecture for IBM in the early 1960s.

The project focuses on the design and implementation of a Tomasulo based MIPS architecture processor. The principles of out-of-order execution, as well as the concept of dynamic scheduling and Tomasulo’s approach are explained. The various types of hazards like WAR (Write after Read), RAW (Read after Write) and WAW (Write after Write) that the Tomasulo’s approach helped to suppress or eliminate are implemented.

x

Chapter 1: Introduction

My graduate project mainly focuses on the design and implementation of Tomasulo architecture on a classical 5-stage MIPS processor. The design is implemented using the

Verilog hardware description language, and synthesis of the design is done using Synopsys

DC Compiler. The five main stages or parts of the MIPS processor are Instruction Fetch

(IF), Instruction Decode (ID), Execute (EX), Memory (MEM) and Write Back (WB). The main goal of my project would be to integrate the three stages of the Tomasulo Architecture in these five stages of the MIPS processor. The three main stages of the Tomasulo

Architecture are Issue (IS), Execute (EX), and Write Back (WB).

Microprocessor without Interlocked Pipeline Stages (MIPS) is a Reduced Instruction Set

Computer (RISC) Instruction Set Architecture (ISA). More details on MIPS and Tomasulo will be given in future chapters. For now, let us start with the basics of CPU.

1.1 Introduction to RISC and CISC

A Complex Instruction Set Computer (CISC) is a computer in which a single instruction can perform complicated tasks that can be done using multiple simple instructions [2].

RISC is a design strategy based on the idea of simplifying the instructions. RISC does not mean that the number of instructions were reduced or less compared to the number of instructions in CISC. It just means that the complex instruction in CISC is simplified in such a way that the same task is done by breaking down the instruction into sub- instructions. Hence, the number of lines of assembly code is comparatively more as to the

1

CISC assembly code, but the instructions are more user friendly and easy to understand.

[3]

1.1.1 RISC versus CISC

The simplest way to understand the difference(s) between the two architectures or design strategies is by comparing them with each other. Let us look at a simple example that will illustrate these differences. Consider the figure below [4].

Figure 1: Storage scheme of a generic Computer

The memory is parted into rows and columns. Rows numbered 1 to 6 and columns numbered 1 to 4. The carries out all the computations. However, the execution unit can only work on the data that has been transferred into any of the registers,

A to F. Assume that we desire the multiplication of 2 numbers, one stored in memory location R2:C3 and another stored in location R5:C2 – and then store the result in location

R2:C3 [4].

2

1.1.2 CISC Architecture

The main goal of CISC architecture is to finish the task in as low number of assembly code lines as possible. This can be done by building a processor hardware that understands and executes operations one after another. For this task, the CISC processor would generate one single instruction that we know as “MULT”. When this instruction is executed, it will load the values known as operands in two different registers, multiplies them using the execution unit, and then puts the result into the desired destination register. Hence, the whole task of multiplication two and storing the result into the memory location boils down to a single instruction:

MULT R2:C3, R5:C2

MULT here is what we call a “Complex Instruction”. It directly uses the computer’s memory and does not require the coder to call any load/store functions. CISC instructions resemble high level language commands. If “A” represents the value stored in the memory location R2:C3 and “B” represents the value stored in the location R5:C2, then the command is the same as the C statement “A=A*B”.

The main advantage of this system is that the compiler need not do much work at all to assembly from a high-level language statement. Due to the short length of the code, a negligible amount of memory is needed for the instructions. The focus is more on developing complex instructions which can be accessed directly from the hardware.

3

1.1.3 RISC Architecture

RISC architecture only uses simple instructions which are executed in 1 clock cycle.

Hence, the "MULT" instruction mentioned above is divided into three commands:

"LOAD" which transfers data from the memory location to a register, "PROD" which computes the product of two operands present in the registers, and "STORE" which transfers data from a register to the memory. If these series of steps are needed to be performed, then the coder would need to code the below lines of assembly:

LOAD A, R2:C3

LOAD B, R5:C2

PROD A, B

STORE R2:C3, A

This may look like a very inefficient way of performing an operation because there are more lines of code, more memory (RAM) required to store the assembly level instructions.

There is more pressure on the compiler as well as it needs to convert a high-level language statement to assembly.

However, RISC architecture has its own advantages. As each instruction needs one clock cycle to complete execution, the whole program takes approximately the same amount of time as the "MULT" command explained in multi-cycle. These RISC "reduced instructions" needless hardware space as compared to the complex instructions, leaving more space for general purpose registers. And also as all of the instructions complete in one clock cycle, pipelining is possible.

4

Considering the "LOAD" and "STORE" instructions as separate, it helps to reduce the amount of work a computer must perform. After a CISC-style "MULT" is executed, the processor automatically empties the registers. If one of the operands has to be used for another computation, the processor must re-retrieve the data from memory and load into a register. In RISC, the operand will stay in the register till a new value is put in its place.

Therefore, from the above example and from the understanding that we have obtained from this example, we can list the differences between CISC and RISC as:

CISC RISC

Emphasis on hardware On software

Multi-clock complex instructions Single-clock simple instructions

Memory to Memory Register to Register

LOAD/STORE incorporated in a single In two different instructions that act instruction. independently.

Small code size Bigger code size

Cycles per second is high It is low

Transistors are used for storing complex Amount of transistors used in memory instructions. registers is more.

Table 1: RISC and CISC

5

1.1.4 Performance Equation

The below equation is usually used when we have to express a computer's performance ability [4]:

CISC tries to minimize the number of instructions by sacrificing the number of . RISC does exactly the opposite. It reduces the cycles per instruction and increases the number of instructions per program [4].

6

Chapter 2: Types of CPU

Implementation of The (CPU) can be done in two ways:

 Single-Cycle CPU

 Multi-Cycle CPU

2.1 Single-Cycle CPU Method

A single cycle means that all the instructions take a single clock cycle to complete or take equal amount of time for their completion. But the main question to be asked is “How do we determine a clock cycle?” There might a million instructions in the CPU but the duration of a clock cycle will always be the time taken by the slowest instruction to complete. The “LOAD” instruction usually always tends to be the slowest as it takes time to access the memory location. All the other instructions might even finish before the clock tick. In such cases, the instruction would have no other choice but to wait for the next clock cycle for its execution.

Figure 2: Single-Cycle Method [5]

7

The above figure clearly shows that I-type branch instruction is being used. The read address of the instruction memory is generated from the program (PC). The output of the instruction memory is broken down into parts and a part of it is sent to register modules to find the data that is needed to access it and the other part is sent to the sign extend module that extends a 16-bit data to a 32-bit data. Output of the register modules is based on the output of instruction memory. This goes to the ALU, which takes care of all the arithmetic operation depending on what the Op-Code is. The ALU result sent to the

Data Memory, which takes care of writing the result back to the memory location specified.

The (PC) is incremented, which is decided by the comparison between the operands. If the operands are equal, then PC is incremented by four and the value offset*4 is added to it. If operands are not equal, then the offset value is scratched and just the PC is incremented by four.

2.2 Instruction Formats

There are mainly 4 formats of instructions that are used in a CPU:

 R-Type

 I-Type

 J-Type

 Jump

The instruction size used in my design is 32-bits wide. The instructions are selected in the instruction memory. The starting 6 bits are the op-code. And the rest of the bits determine

8

the address depending on what type of instruction they are. Let us look into each type of instructions individually.

R-Type Instruction Format

R-Type register format is meant for only register to register instructions. This divides the register into 6 fields as shown in Figure 3.

 The op-code field tells us what kind of operation is implied by the instruction.

 RS and RT are the 1st and 2nd source registers respectively used to store the

operands.

 Rd is the destination register in which the final result of an ALU operation or the

task assigned by the instruction is stored.

 Shamt implies to the number of bits that have to be shifted during any shift

operation.

 Funct implies to the variant of the operations in the op field that has to be selected.

Figure 3: R-Type Register Format [6]

The data path for an R-Type instruction is as shown below:

9

Chapter 3: Pipelining

Figure 4: Data Path for R-Type format [7]

In the R-Type instruction format, the data is read from two source registers, Rs and Rt respectively. The operation is performed in the ALU and the final result is stored in the destination register Rd. This instruction format is mainly used for instructions such as add, sub, and, or etc.

Example: add $Rd, $Rs, $Rt

10

I-Type Instruction Format

These instruction formats are mainly used for immediate type instructions. Mainly

“LOAD” and “STORE” instructions. Also used for branch instructions.

Figure 5: I-Type Instruction Format [8]

 The op-code field tells us what kind of operation is implied by the instruction.

 Rs and Rt are again as mentioned in the previous section.

 Address/Immediate field points us either towards a 16-bit offset, or an immediate

operand.

The figure below shows the data path for an I-Type instruction format.

Figure 6: I-Type Data Path [9] Here, we can see that the second source register Rt can be used both, as a source register and as a destination register depending on the instruction. The I [15:11] shown here can be

11

used if we want to use this instruction format as R-Type. The immediate value is sign extended accordingly and then sent the ALU for the desired action.

Example: addi $Rt, $Rs, 10 ; where 10 is the immediate value.

Branch Instruction

Both, the figure and the description of the branch instruction is as shown in section 2.1

J-Type Instruction Format

These instruction formats are mainly used for jump instructions. The instruction format is as shown below.

Figure 7: J-Type Instruction Format

The figure clearly shows that we have only two fields in this kind of instruction format. A six bit op-code that field that implies which operation is supposed to be performed, and a

26 bit address field that points to the address of a memory location where jump needs to happen.

12

Figure 8: Data Path for Jump Instruction Format [9]

From the figure, we see that the four bits, PC+4 [31-28] are added to the shift left by two values in the 26 bit instruction taken from the IM to make the final jump address, a 32 bit address.

Example: j dest

Where j means jump and dest refers to the target address to where the jump has to take place.

2.3 Multi Cycle CPU

In a multi cycle CPU, the instructions take more than a clock cycle to complete, thereby increasing its clocks per instruction (CPI). For example, a load instruction needs the highest amount of clock cycles to complete, whereas any other instruction can take less number of clock cycles to complete. Same as a single cycle CPU, a multi cycle CPU would also require a , however, the main difference between the two is that a multi cycle

CPU uses finite state machines (FSMs) for control signals, whereas single cycle CPUs use for the same.

13

Figure 9: Multi Cycle CPU Data Path [5]

From the above figure, we can see that the main difference that we first notice between multi cycle and single cycle is that in multi cycle, there are extra registers added. The other differences that can be pointed out are that there is only one ALU which is also used to increment the program counter. This is then followed by the ALU out register which is different from single cycle as it consists of an in between. The program counter is incremented in the 1st clock cycle after which ALU is used for other operations. The various stages in a multi cycle CPU are:

 Instruction Fetch: PC is incremented and the (IR) is loaded

with the instruction that the PC is pointing to.

 Instruction Decode: The values from register modules are put in A and B, and the

ALU out is loaded with target address.

14

 Execute: The operations in ALU take place if it is required to and the result is

loaded into the ALU out register. Once the operation is performed, the cycle

returns to the Instruction Fetch and a new instruction is loaded into the IR.

 Memory: Any instruction that needs to access a memory location enters this stage.

The three types of instructions that would require this would be load, store and

arithmetic. If it is load, then the value from a memory location would be loaded

into one of the registers. If it is store, the value from one of the registers would be

loaded into a memory location. If it is arithmetic, the value in ALU out would be

loaded into a memory location. In case of store and arithmetic, we return to the

Instruction Fetch stage.

 Write Back: Only load instructions access this stage. The value in one of the

memory location is loaded into one of the registers. After this is complete,

we again go back to the Instruction Fetch stage.

15

Chapter 3: Pipelining and MIPS

MIPS stands for “ without Interlocked Pipelined Stages” and it is a RISC based architecture. My design is based on implementing the Tomasulo architecture on the classic 5-stage MIPS architecture which is a 32 bit processor. Hence, all the bit sizes and widths etc. are taken into account with this as the reference [1].

3.1 Pipelining

Before we jump into our MIPS architecture, let us first understand the concept of pipelining. Pipelining refers to the concept of performing more than one task in a single data path. Let me give a simple example for this as per my understanding. Suppose there are 2 instructions that need completely different resources or operands to complete, but the second instruction is waiting for the first one to complete. So if one instruction is executing, instead of the second instruction waiting, it can start as they need completely different resources, giving an impression that they are executing in parallel with each other without disturbing one another.

If this needs to happen, every part of a process is broken down into various smaller parts or pipelined stages. At every clock cycle, the process is stored into the next pipelined stage which allows another process to begin in the same stage without interfering with the previous process. Therefore, all stages in that path can be put to use in parallel. This concept of pipelining tends to increase the throughput of the design.

The goal of designing a pipelined processor would be mainly to equalize the timing of each pipelined stage. This can be achieved in an ideal condition, however, to be practical, the

16

stages cannot be perfectly divided. On top of that, pipelining has some timing costs.

Therefore, the throughput would obviously increase, however, it would not be equal to the ideal calculation.

Let us be clear about one thing though. Pipelining technique improves the average execution time per instruction, not the execution time of individual instructions. The change in the time can be noticed by observing the number of clocks per instruction (CPI).

3.2 The Classic 5 Stage Pipelined Processor

As much advantageous the pipelining technique is, it is equally disadvantageous.

Pipelining does introduce new issues. However, all the issues are not permanent. Some of them can be worked upon. The main issue that we would have to think about is the usage of the same data path and resources for different operations on the same clock cycle. Giving an example for this, an ALU cannot calculate the EA and complete another operation in the same cycle. Preventing this would be of course the work of the designer. However luckily, because of the straightforwardness of RISC instructions, resource calculations become much easier. Some of the other issues that we have to address and their solutions are:

 The memory is divided into two parts. Instruction memory and Data memory. This

separation rids us of a hazard which would occur between the IF stage and Memory

access stage. Thus a pipelined processor should provide 5 times the bandwidth that

a processor without any pipelining would provide, both having the same clock

frequency.

17

 The is accessed twice, the first time in the decode stage when operands

are read and the second in the write back stage when the result is written back.

Hence, the register file is accessed twice and written to, in every clock cycle. This

issue can be addressed by using a master-slave flip-flop, writing during the rising

edge of the clock and reading during the negative edge.

 The PC has to be incremented and then stored in every clock cycle if we want a

new instruction every cycle. This should take place during the fetch cycle.

After taking everything into account, we can say that the increase in the amount of instructions. The throughput increases due to pipelining, however it cannot reduce the time needed for a single instruction to reduce. The figure below shows a simple model of a pipelined RISC data path.

Figure 10: A Pipelined RISC Data Path

3.3 Pipeline Hazards

In any pipelined processor, a hazard would be defined as any situation that would prevent the next instruction in the pipeline from executing. Such hazards cause longer execution time, and thus lower performance as compared to an ideal pipelined processor. There are mainly 3 kinds of hazards:

18

 Structural hazards: These kinds of hazards occur when more than one instruction

are trying to use the same hardware resources.

 Data hazards: These kinds of hazards occur when there is data dependency between

2 instructions.

 Control Hazards: Are a result of branch instructions and any other instructions that

are capable of changing the value of the program counter.

Sometimes, it is important to stall an instruction in a pipeline due to these hazards. When we stall an instruction, the instruction that is stalled is the one that is issued. After that, the stalled instruction and the instruction issued after that should be allowed to proceed or else we will never get rid of the hazard and the program will be stuck as a new instruction cannot be fetched.

3.4 Data Forwarding

Data forwarding is a hardware method to suppress the effect of data hazards. Let us discuss what data forwarding method is in detail:

The results obtained from the memory access stage and the execution stage are given to the ALU through a . Then we continuously compare the contents of the operand register of the execution stage with the source register of the memory access stage. If they both are equal, we get the operand from any of these stages instead of getting it from the

Instruction Decode stage. The detection of hazards and producing the correct control signals completely depends on the design of the forwarding unit.

19

Data forwarding is not just restricted to the input of ALU. It can be done to anywhere needed from anywhere it can be achieved. The results can be forwarded from the output of a pipelined register to any input of any other functional unit. The clock cycle distance can be up to a maximum of two clock cycles between them.

Let us look at a simple example for data forwarding. Consider the code below:

ADD $2, $3, $4

LW $5, 20($2)

SW $6, 30($2)

In the above example, the values of ALU has to be forwarded and DM output to the ALU and inputs of the memory to get rid of these hazards. When it comes to the 4th cycle, the value of register ‘1’ is calculated for the ADD instruction for target address calculation of load instruction so that we move it from the output of ALU to its input. In the next clock cycle, we need the value of the register that is not yet written in the register file for the target address calculation of store. So we forward it from the MEM stage to the ALU inputs and also we need the value of register ‘4’ from the memory to write it again in the memory in store instruction. Figure below depicts this [10].

20

Figure 11: Forwarding Path for the above example. [10]

The problem is that all data hazards cannot be taken care of with data forwarding technique.

3.5 MIPS Implementation

Any MIPS processor you take, it can be executed in a maximum of 5 clock cycles. The

MIPS processor on which I will be implementing the Tomasulo architecture is the classic

5 stage pipelined MIPS. Hence, it also has five stages as shown below:

21

Figure 12: 5 Stage Pipelined MIPS [11]

The figure above shows the 5 stages of MIPS:

 IF (Instruction Fetch)

 ID (Instruction Decode)

 EX (Execute)

 MEM (Memory)

 WB (Write Back)

We also see from the figure above that all of these 5 stages are separated by pipelined stages. IF/ID, ID/EX etc. are all what we call the pipelined stages. Now, let us discuss more on the 5 stages of our MIPS [11].

22

3.5.1 Instruction Fetch

The Program counter is used to get the instruction from the instruction memory (IM) and put into the Instruction Register (IR) at the positive edge of the next clock cycle. The program counter always points to the next instruction that it has to fetch. Which means it is pointing to the address in the IM where the instruction is stored. The PC is incremented each cycle so that it points to the next instruction to be fetched. The figure below shows the data path for IF stage.

Figure 13: IF Data Path

23

3.5.2 Instruction Decode

In the ID stage, the instruction sent from the IR is decoded. It means that the contents of

IR is analyzed in this stage to check for the correct op-code, function, address etc. Based on the instruction, it reads the operands required for the register file. The register module gives the value of the two registers A and B, which have to be sent to ALU during the

ID/EX pipeline. The figure below shows the data path for ID stage.

Figure 14: ID Data Path

24

3.5.3 Execute Stage

The instructions are completed by execution in this stage. It performs operations on the operands decoded by the ID in the ID/EX pipeline. The final result is sent to the ALU out register which is in the EX/MEM pipelined stage. The data path for the execute stage is as shown below:

Figure 15: EX Data Path

25

3.5.4 Memory stage

The memory stage mainly implies the access to the data memory in the design. It has a single read port and a write port which is 64 bits in depth. The data that is read would be stored in a 32 bit data out register of the memory. The figure below shows the data path of the MEM stage.

Figure 16: MEM Data Path

26

3.5.5 Write Back Stage

The Write Back just suggests that the final result is written back into the memory in this stage.

3.5.6 ALU and Control to Complete MIPS

There are more than just these 5 stages that make the MIPS what it is. There are two more modules which are equally important for the design. They are:

 ALU

 Control Unit

Let us look into these modules briefly.

3.5.6.1 ALU

ALU stands for Arithmetic and Logical Unit. As the name itself suggests, it takes care of all the arithmetic and logical operations for the processor, and this happens in the EX stage.

The operands are retrieved from the ALU Control module with the help of op-codes. The figure below shows the MIPS after the ALU control has been appended into its design.

27

Figure 17: MIPS after appending ALU Control [12]

3.5.6.2 Control Unit

Control is always the heart of the design. The most important part. It generates different control signals for various functional units and other modules of the design. It mainly generates control signals that are passed through pipelined registers.

28

Figure 18: Controller in MIPS

Now after adding this control module in the design, we can say that the design of MIPS is now complete. The figure below shows the design of the 5 stage pipelined MIPS after control and ALU, both are added.

29

Figure 19: 5 Stage Pipelined MIPS Processor [12]

30

Chapter 4: Dynamic Scheduling

The main concept of Dynamic Scheduling is out-of-order execution of instructions. Of course, in software or code, the instructions will be in serial order; however when it comes to the execution of the instructions, that will take place out of order. It need not be necessary that the 1st instruction be executed before the 2nd. This kind of non-sequential execution of instructions is also known as the Non-Von Neumann method.

Scheduling means re-arranging the order of instructions execution in order to maximize the throughput, and performance, of the design (system). To perform proper scheduling, there are mainly two things that a programmer should be aware of: One, knowledge of the architecture of the processor and two, knowledge of latencies and dependencies.

4.1 Types of scheduling

There are mainly two types of scheduling methods.

 Static Scheduling – Mainly happens through the software, the compiler. Static

scheduling mainly focuses on in-order instruction issue. Meaning that if the 1st

instruction is stalled, the latter cannot execute at all. Multiple instantiations of a

functional unit remain unused and this leads to inefficiency.

 Dynamic Scheduling – This happens with the help of hardware. Dynamic

scheduling focuses more on out of order execution and completion of instructions.

Therefore, even though any of the instruction is stalled, the latter instructions that

are not waiting for the same resources and do not have any dependency with this

instruction, can proceed to completion.

31

The main discussion we are having on scheduling is to take care of, or eliminate mainly data hazards. So let us discuss briefly on the different types of data hazards and how they work.

4.2 Types of Data Hazards

Data hazards can be categorized as shown below:

 RAW (Read After Write) – Consider 2 instructions A and B. Now, this hazard is

noticed when A tries to get information from a register before B writes to it. So an

incorrect value is given to A. This type of hazard is mainly due to operand or data

dependency. To make sure that A gets the right data from B, the 2 instructions

should be executed in order.

 WAW (Write After Write) – This is a hazard that would be noticed when A tries to

write data after B writes it first. Therefore, the value that B wrote would be present

in the destination register instead of the data that A wrote. The hazard mainly

happens due to destination register dependency.

 WAR (Write After Read) – This hazard would be noticeable when A tries to write

to a register before B tries to read from the same register. And B gets the wrong

value instead. The most common reason of these hazards would be the change in

the order of instruction execution.

Both Data and Structural hazards are taken care of during the ID stage. If there are no hazards noticed, or they can be taken care of with the help of forwarding, then the instruction gets issued to the next stage.

32

There are mainly 2 techniques used for dynamic scheduling.

 Score-boarding technique

 Tomasulo Algorithm

We will not be discussing on the Score-boarding technique. However, my main focus would be on Tomasulo Algorithm as this is what my whole project is based on. So let us discuss what the Tomasulo Algorithm is.

4.3 The Tomasulo Approach

The Tomasulo Algorithm is a hardware based algorithm developed in 1967 by Robert

Tomasulo for IBM. It allows sequential instructions that would normally be stalled due to certain dependencies to execute non-sequentially or out of order [13].

In the Tomasulo approach, the control and buffers are distributed with the functional units

(FUs). These functional unit buffers are known as reservation stations (RS). The registers in the instructions are replaced by values or pointers that point to the RSs. This is done by a technique known as register renaming. We will be looking into the concept of register renaming in a while. The main advantage of the Tomasulo’s algorithm is that it can perform optimizations that even the compiler cannot. The number of RS are far more than the number of registers, hence this helps in minimizing the hardware as well.

All the data which is transferred to the functional units (FUs) are transferred from these

RSs and not the registers. The data travels to all the functional units through something called the CDB (Common Data ). The CDB broadcasts the results to all the FUs and the correct FU then picks up the data from the CDB. Once the FU has completed its

33

operation, the result is again broadcasted back to the registers in the same way. This avoids the RAW hazards because the instructions are executed only when the operands are available.

The difference between a normal bus and CDB is that in a normal bus, you put the data on the bus and then specify the destination, and the data is then sent to that particular destination. However in CDB, the data is put on the bus, specified as to who is producing the data and this information is broadcasted to all the FUs and their corresponding RSs.

The destination notices this data on the CDB and says “It is mine and I am waiting for it” and then picks the data from the CDB. Only the destination that is waiting for the data picks it up from the CDB. The other destinations can ignore the broadcast.

Load and Store are treated as two completely different FUs as they both have their own

RSs. Each RS consists of 7 fields. Let us look into these.

Reservation Station Components:

 Op: Implies to which operation has to be performed in the functional unit (such as

add or subtract) by looking at the instruction .

 Vj, Vk – Values of the source operands after they have been resolved.

 Oj, Ok – Pending operands in terms of RS identity (Vx and Ox are mutually

exclusive.

 Busy – Gives the status of RS and its corresponding FU.

 A – This is a register only for memory calculations.

 Qi – A GRP target produced by the RS.

34

There are mainly 3 stages in the Tomasulo algorithm. Let us look at what those 3 stages are in detail.

 Issue Stage – Getting the instructions from the FP Op queue is the basic operation

performed in this stage. Let us look into what all happens in this stage.

o For FP Op: If the RS is free, issue the instruction and send the operation and

operands if they are in registers.

o For load/store: If the buffer is available, issue instruction.

o If RS or buffers are not free, then it leads to structural hazard. So the

instruction needs to be stalled and this is done in this stage.

o Register renaming takes place in this stage.

 Execution Stage (EX) – Perform operations on the operands and get the results. Let

us look into what happens in this stage.

o When an operand is ready, put it in an RS.

o If the operand is not ready, then continuously monitor the CDB for registers.

o When both the operands are available, execute the operation specified by

the op-code.

o Check for RAW hazards.

 Write back stage (WB) – Completes the execution process.

o Based on the availability of the result, broadcast the result on the CDB and

then to all the FUs, registers and RSs.

o Mark RS available.

The Tomasulo architecture used for our implementation is as shown below.

35

Figure 20: MIPS architecture with Tomasulo Algorithm [12]

The main difference between the Tomasulo and Score-boarding technique is that in

Tomasulo, the concept of register renaming is implemented. This takes care of one flaw in the score-boarding technique, which is instructions cannot be issued in score-boarding if any of the previous instruction is stalled due to some dependency. Whereas in Tomasulo architecture, the register renaming technique allows continuous issuing of instructions. For now, let us make a list of major differences between the Tomasulo and score boarding technique.

36

TOMASULO SCOREBOARD

Control and buffers are distributed with the Control and buffers are centralized. functional units and are known as

“Reservation Stations”.

Registers in the instruction are replaced by Scoreboard technique does not use register pointers to the using the renaming. register renaming concept.

Uses hardware renaming of registers to Does not use this method. avoid the WAR hazard.

CDB broadcasts the results to all the Scoreboard does not have a CDB. The functional units. results are sent to the specific registers.

The load and store units are treated as two All functional units are combined into one. different functional units.

Table 2: Tomasulo and Scoreboard technique

4.3.1 Register Renaming

Register renaming can be described as the process by which independencies between operands are dynamically identified by the hardware and are replaced or substituted with

‘dummy’ register identifiers. And these dummy registers are known as RSs. Let us look at an example for this.

37

Example:

 Let us consider the code shown below for our design:

F1  F3 / F5

F7  F1 + F9

Mem [1+R2]  F7

F9  F11 – F15

F7  F11 x F9

 We can observe in the above code that there is both, name dependency and data

dependency leading to WAR and WAW hazards.

 So how can we eliminate these hazards? We use the register renaming technique.

Suppose we introduce two ‘dummy registers’, say X and Y under the hardware

control. The code changes as shown below.

F1  F3 / F5

X  F1 + F9

Mem [1+R2]  X

Y  F11 – F15

F7  F11 x Y

38

 From the above code, we can see that both the WAR and WAW hazards have been

prevented as now in the new code, there are no name dependencies or data

dependencies.

4.4 Working Example of Tomasulo

To understand how the tomasulo approach works and how register renaming takes place, let us look at an example. Consider the code below:

LOAD F6  34+ R2 LOAD F2  45+ R3 MUL F0  F2 * F4 SUB F8  F2 – F3 DIV F10  F0 / F6 ADD F6  F8 + F2

Let us now see how this would work on every clock cycle.

Clock Cycle 0

Figure 21: Tomasulo Example - Clock Cycle 0

39

The above figure describes what happens at clock 0. Also, it shows the different parts of the algorithm that will help us understand this example better. All these parts are marked in red in the figure. At cycle 0, the instructions are ready to be issued.

Clock Cycle 1

Figure 22: Tomasulo Example - Clock Cycle 1 In clock cycle 1, we can see that the first load instruction has been issued and the load buffer associated with it gets the address 34+R2. The status changes to busy. The result will be produced by the FP register F6 and that is why in the FP registers, we see that F6 holds a pointer to the load buffer, Load 1.

40

Clock Cycle 2

Figure 23: Tomasulo Example - Clock Cycle 2 In Cycle 2 again, we see that the 2nd load instruction is issued and buffer or the reservation station associated with it gets the address 45+R3, the status changes to busy and the FP register F2 holds the pointer to the buffer Load 2 as F2 is the one that would be producing this result. Here Load 2 is used as Load 1 is already busy.

Clock Cycle 3

Figure 24: Tomasulo Example - Clock Cycle 3

41

In Cycle 3, the MULT instruction is issued out and that is assigned to the reservation station

Mult 1s shown in the figure. Since F0 is the FP register that would be producing this result, it holds a pointer to the Mult 1 reservation station. We see that one result, F4 is already available. It is not being produced by any instruction in the queue. That is the value has been written directly below Vk, whereas the operand has to be still produced by Load 2 and hence we have Load 2 as the pointer below Qj.

Clock Cycle 4

Figure 25: Tomasulo Example - Clock Cycle 4 In cycle 4, we see that Load 1 has completed and the final result has been stored in F6, and

Load 1 is not busy anymore. The next instruction SUBD has been issued and the reservation station Add 1 associated with it gets the instruction. Since one of the operands has been already produced by F6, the value of that is stored under Vj. The other operand is still not available from Load 2, and hence, the pointer to Load 2 can be seen under Qk. Since F8 is the FP register producing the result for this instruction, we can see the pointer to the reservation station Add 1 has been loaded into the F8 register.

42

Clock Cycle 5

Figure 26: Tomasulo Example - Clock Cycle 5 In Cycle 5, we see a lot of activity. The 2nd load instruction completes in this cycle and the

Load 2 reservation station is freed. And the result is now stored in the F2 register instead of a pointer. A new instruction DIVD is issued and the reservation station Mult 2 gets this instruction as Mult 1 is still busy. As F10 is the register that will be producing this result,

F10 gets a pointer to the reservation station Mult 2 instead of the value.

Clock Cycle 6

Figure 27: Tomasulo Example - Clock Cycle 6

43

From the above figure of cycle 6, we can see that none of the other instructions have completed yet. The next instruction ADD is issued in this cycle. It is assigned to the reservation station Add 2 as Add 1 is already busy. And F6 gets the pointer to this reservation station as that is the register which will produce this result.

Clock Cycle 7

Figure 28: Tomasulo Example - Clock Cycle 7 In Cycle 7, we see that all the instructions have already been issued by now. Since multiply and divide operations take longer time to complete, we see here that the SUB instruction completes before the MULT instruction. Since the result has not yet been written, the Add

1 reservation is still busy as it holds the instruction and also, F8 still holds the pointer to

Add 1.

Let us skip some clock cycles now that all the instructions are issued. We will now continue from cycle 56.

44

Clock Cycle 56

Figure 29: Tomasulo Example - Clock Cycle 56 In this cycle, we can see that almost all instructions have been completed and also written back, however the DIVD instruction has just completed and has not been written yet.

Hence, Mult 2 is still busy and the F10 register still holds the pointer to Mult 2 instead of the real value. All other registers now have their real value.

Clock Cycle 57

Figure 30: Tomasulo Example - Clock Cycle 57

45

This is the last cycle of the example. Here, we see that all the instructions have been issued, completed and written back which completes the 3 stages of the tomasulo algorithm. In this cycle, we see that all the FP registers now have the real values instead of holding the pointers to the reservation stations.

46

Chapter 5: Simulation and Synthesis

The simulation and synthesis for my design were done using two platforms. Simulation using the ModelSim and the synthesis using Synopsys VCS. I have designed my project using the hardware description language known as Verilog. The verification of the design was also done using Verilog.

5.1 Simulation using ModelSim

The Tomasulo architecture was implemented on a classic 5-stage pipelined MIPS. Let us be clear that the Tomasulo architecture was implemented on the ALU of the MIPS. Hence, the whole architecture of MIPS does not come into the picture for my design. The top level design of my project is as shown in the figure below.

Figure 31: Top Level Design of Tomasulo The top level design consists of all the other modules including the individual functional units and reservation station. The figure below shows the schematic of the complete design.

47

Figure 32: Schematic for Tomasulo

I wrote a test bench for my design using Verilog. The below code shows my test bench. module test; reg clk; reg reset;

wire [7:0] data_out;

tomasulo TOMASULO (clk,

reset,

data_out); always #5 clk = ~clk;

initial begin

clk = 0;

reset = 1;

#10 reset = 0;

48

#1000 $finish; end

The below figure shows the simulation waveform for my design.

Figure 33: Simulation waveform for Tomasulo

From the above figure and simulation results, we can notice the following:

 Issue Stage – The Instruction queue consists of FIFO list of all the instructions in a

program order. This is evident from a part of my code of instruction queue

(inst_queue.v) as shown below: inst_array [0] <= 32'b0001_0010_0000_0110_00100010_00000000; // LOAD F6 <-

MEM (R2+34); // 5

inst_array [1] <= 32'b0001_0011_0000_0010_00101101_00000000; //

LOAD F2 <- MEM(R3+45); // 9

49

inst_array [2] <= 32'b0101_0010_0100_0000_00000000_00000000; //

MUL F0 <- F2 * F4; // 36

inst_array [3] <= 32'b0100_0010_0011_1000_00000000_00000000; //

SUB F8 <- F2 - F3; // 6

inst_array [4] <= 32'b0101_0001_0110_1010_00000000_00000000; //

MUL F10 <- F1 * F6; // 5

inst_array [5] <= 32'b0011_1000_0010_0110_00000000_00000000; //

ADD F6 <- F8 + F2; // 15

inst_array [6] <= 32'd0;

inst_array [7] <= 32'd0;

If the reservation station at functional unit of the instruction is free, then the instruction is assigned to that reservation station. If it is not free, then the instruction is stalled and a structural hazard is detected. The data structure of the RS supports in the resolution of

WAR and WAW hazards as discussed.

 Execute Stage – The first thing that happens in this stage is that it checks for RAW

and WAW hazards. If the operands are not ready yet, i.e. if they are under

evaluation, then we wait for an update on the CDB. If the operands are ready, then

they are queued for execution. There is a possibility that multiple operands for

various instructions would be resolved and ready in the same cycle. The FP

(Floating Point) structure helps resolve this arbitrarily. The load and store have been

designed with a FIFO queue to avoid RAW and WAW hazards. However, note that

50

my design does not check for STORE instructions. I have not included two things

in my design which are STORE instructions and DIV instructions. The branch

instructions are completed before any other following instruction is executed.

 Write back stage – The reservation stations and GPR are updated as soon as

execution is completed.

Let me explain a part of the waveform to show how the tomasulo works.

Figure 34: How my Design works

The above waveform shows that the register renaming is taking place as desired. Also in the result where ADD, SUB and MUL operations take place, if we check the waveform, we see that the previous values are maintained until the results are updated by the common_data. And the same thing happens for other register too.

The figure below shows the synthesis report extracted from ModelSim for my design.

51

Figure 35: Synthesis for Tomasulo Using ModelSim

The below report is also extracted from ModelSim that shows the Power consumption and other properties for our design:

X-Power Report

Power summary: I (mA) P (mW)

------

Total estimated power consumption: 1021

---

Vccint 1.20V: 341 410

Vccaux 2.50V: 245 612

Vcco25 2.50V: 0 0

---

52

Clocks: 0 0

Inputs: 0 0

Logic: 0 0

Outputs:

Vcco25 0 0

Signals: 0 0

---

Quiescent Vccint 1.20V: 341 410

Quiescent Vccaux 2.50V: 245 612

Thermal summary:

------

Estimated junction temperature: 35C

Ambient temp: 25C

Case temp: 35C

Theta J-A: 10C/W

Decoupling Network Summary: Cap Range (uF) #

------

53

Capacitor Recommendations:

Total for Vccint: 140

470.0 - 1000.0: 2

4.70 - 10.00: 8

0.470 - 2.200: 16

0.0470 - 0.2200: 28

0.0100 - 0.0470: 44

0.0010 - 0.0047: 42

---

Total for Vccaux: 34

470.0 - 1000.0: 1

4.70 - 10.00: 2

0.470 - 2.200: 4

0.0470 - 0.2200: 6

0.0100 - 0.0470: 10

0.0010 - 0.0047: 11

---

54

Total for Vcco25: 17

470.0 - 1000.0: 1

4.70 - 10.00: 1

0.470 - 2.200: 2

0.0470 - 0.2200: 3

0.0100 - 0.0470: 5

0.0010 - 0.0047: 5

Analysis completed: Wed Apr 09 12:11:10 2014

5.2 Simulation using Synopsys VCS

55

The code was again verified with Synopsys VCS and the same result was obtained. The various figures below show the schematics and the output waveforms extracted from Synopsys.

Figure 36: Schematic of Test.v used as top module

The figure above shows the schematic of the test bench used as the top module. The inputs are nothing but clock and reset and the output is the final data, data_out.

And below, we have the schematic of our Memory unit.

56

Figure 37: Memory Unit Schematic

The below figure shows the schematic of the instruction queue in the design.

Figure 38: Instruction Queue Schematic

57

The figures below show the output simulation waveforms for our design.

Figure 39: Simulation Waveform Using Synopsys (1)

Figure 40: Simulation Waveform Using Synopsys (2)

58

Figure 41: Simulation Waveform Using Synopsys (3)

Figure 42: Simulation Waveform Using Synopsys (4)

Note that the Store is completely red in color as I have not included the ‘STORE’ instruction in my design although the store unit has been designed.

59

5.3 Synthesis Results

The synthesis was done using both, the top-down and the bottom-up approach. The design was synthesized using the 90nm library. In the top-down methodology used, I found that my design was functioning at the optimum level when it was run at 10 MHz. The figure below shows the area and slack for this.

The area and timing reports are as shown:

Figure 43: Top-Down Area Report

60

Figure 44: Top-Down Timing Report

For the bottom-up approach, I found that my design works the best at 14.81 MHz as shown by the area and time reports below.

61

Figure 45: Bottom-up Timing Report

Figure 46: Bottom-Up Area Report

62

Chapter 6: Conclusion and Modifications

6.1 Conclusion

As per the analysis of the design and the simulation waveform, we can say the design implemented is behaving as desired for all the instructions that have been included in the design and the project has been successfully completed. The main goal for this project was to gain knowledge in coding and verification. I can say that I have learnt using tools such as Synopsys VCS and also gained an understanding of how the synthesis of the design works, how every design reacts to changes in power and timing depending on the constraints given to the design. Speaking of constraints, the constraints for the design during its synthesis was set using TCL script, and I can say that I have gained at least the basic knowledge of coding in TCL script. The main goal for completing this project was achieved in the end.

6.2 Modifications

I had to cut down a few things in my design due to shortage of time and the complexity level of the functionality. My design does not consist of BRANCH instructions. Also my design does not implement DIV instructions due to the complexity level of this instruction.

The output of the design is completely shown based on the simulation waveforms that my design produces. I have not designed a simulator that shows the work of tomasulo algorithm. This as well can be considered as a future enhancement if possible. And I have not taken the design to the level of the GDS II stage to understand working on IC Compiler

63

that I had initially planned to do. So, these can be some of the enhancements that can be implemented in the future for my design.

64

References

1. http://en.wikipedia.org/wiki/MIPS_architecture Retrieved April 2014

2. http://en.wikipedia.org/wiki/Complex_instruction_set_computing Retrieved April

2014

3. http://en.wikipedia.org/wiki/Reduced_instruction_set_computing Retrieved April

2014

4. http://cs.stanford.edu/people/eroberts/courses/soco/projects/2000-01/risc/risccisc/

Retrieved April 2014

5. M.S.Schmalz(n.d.), Organization of Computer Systems,

http://www.cise.ufl.edu/~mssz/CompOrg/CDA-pipe.html Retrieved April 2014

6. http://stackoverflow.com/questions/6929440/what-does-func-means-in-r-format-

instruction-set Retrieved April 2014

7. http://www.cise.ufl.edu/~mssz/CompOrg/CDA-proc.html Retrieved April 2014

8. http://fourier.eng.hmc.edu/e85_old/lectures/instruction/node10.html Retrieved

April 2014

9. MIPS architecture, http://pages.cs.wisc.edu/~smoler/x86text/lect.notes/MIPS.html

Retrieved April 2014

10. http://scholarworks.csun.edu/handle/10211.2/3332 Retrieved April 2014

11. http://www.cs.nyu.edu/courses/fall09/V22.0436-001/lecture16.html Retrieved

April 2014

12. David A. Patterson, John L. Hennessy: Computer Organization and Design – The

Hardware/Software Interface. Fourth Edition (2006). Morgan Kaufmann

Publisher, Inc. Retrieved April 2014

65

13. http://en.wikipedia.org/wiki/Tomasulo_algorithm Retrieved April 2014

66

Appendix A: Code Listing and Synthesis Scripts

Code Listing

 test.v (The test bench for the top Module)

 tomasulo.v (The Top Module)

o inst_queue.v

o FP_reg.v

o Load_unit_1.v

o Load_unit_2.v

o Store_unit.v

o Memory_unit.v

o Adder_1.v

o Adder_2.v

o Adder_3.v

o Mul_1.v

o Mul_2.v

Synthesis Script for Top-Down

analyze –format verilog Mul_2.v > ./reports/analyze_Mul_2.rpt elaborate Mul_2 > ./reports/elaborate_Mul_2.rpt analyze –format verilog Mul_1.v > ./reports/analyze_Mul_1.rpt elaborate Mul_1 > ./reports/elaborate_Mul_1.rpt analyze –format verilog Adder_3.v > ./reports/analyze_Adder_3.rpt

67

elaborate Adder_3 > ./reports/elaborate_Adder_3.rpt analyze –format verilog Adder_2.v > ./reports/analyze_Adder_2.rpt elaborate Adder_2 > ./reports/elaborate_Adder_2.rpt analyze –format verilog Adder_1.v > ./reports/analyze_Adder_1.rpt elaborate Adder_1 > ./reports/elaborate_Adder_1.rpt analyze –format verilog Memory_unit.v > ./reports/analyze_Memory_unit.rpt elaborate Memory_unit > ./reports/elaborate_Memory_unit.rpt analyze –format verilog Store_unit.v > ./reports/analyze_Store_unit.rpt elaborate Store_unit > ./reports/elaborate_Store_unit.rpt analyze –format verilog Load_unit_2.v > ./reports/analyze_Load_unit_2.rpt elaborate Load_unit_2 > ./reports/elaborate_Load_unit_2.rpt analyze –format verilog Load_unit_1.v > ./reports/analyze_Load_unit_1.rpt elaborate Load_unit_1 > ./reports/elaborate_Load_unit_1.rpt analyze –format verilog FP_reg.v > ./reports/analyze_FP_reg.rpt elaborate FP_reg > ./reports/elaborate_FP_reg.rpt analyze –format verilog inst_queue.v > ./reports/analyze_inst_queue.rpt elaborate inst_queue > ./reports/elaborate_inst_queue.rpt analyze –format verilog tomasulo.v > ./reports/analyze_tomasulo.rpt elaborate tomasulo > ./reports/elaborate_tomasulo.rpt current_design tomasulo

create_clock clk –period 200 set_clock_uncertainty 1 [get_clocks clk] set_clock_latency 0.5 –rise [get_clocks clk] set_clock_latency 0.5 –fall [get_clocks clk]

68

set_input_delay –max 0.5 –clock clk [get_ports all_inputs] set_output_delay –max 0.5 –clock clk [get_ports all_outputs] remove_input_delay [get_ports clk] set_drive 0 clk set_drive 0 reset set_wire_load_mode enclosed set_wire_load_model –name “280000”

compile check_design > ./reports/check_design.rpt report_timing > ./reports/timing.rpt report_area > ./reports/area.rpt

Synthesis Script for Bottom-Up

set all_blocks {Mul_2 Mul_1 Adder_3 Adder_2 Adder_1 Memory_unit Store_unit Load_unit_2 Load_unit_1 FP_reg inst_queue} foreach block $all_blocks { set block_source “$block.v”

analyze –f verilog $block_source elaborate $block current_design check_design create_clock clk –period 100

69

#set_clock_uncertainty 1 [get_clocks clk]

#set_clock_latency 0.5 –rise [get_clocks clk]

#set_clock_latency 0.5 –fall [get_clocks clk]

#set_input_delay –max 0.5 –clock clk [get_ports all_inputs]

#set_output_delay –max 0.5 –clock clk [get_ports all_outputs]

#remove_input_delay [get_ports clk] set_drive 0 clk set_drive 0 reset set_wire_load_mode enclosed

#set_wire_load_model –name “280000” compile > Reports/$block.rpt check_design report_timing > Reports/Time_$block.rpt report_area > Reports/Area_$block.rpt write –f verilog –hierarchy –o Gates/Gate_$block.v set_dont_touch $block true current_design

}

remove_design –all

analyze –f verilog tomasulo.v elaborate tomasulo uniquify create_clock clk –period 135

70

set_drive 0 clk set_drive 0 reset set_wire_load_mode enclosed compile > Reports/Bottom_Tomasulo.rpt report_timing > Reports/Time_Tomasulo.rpt report_area > Reports/Area_Tomasulo.rpt write –f verilog –hierarchy –o Gates/Gate_Tomasulo.v write_sdf bottom_tomasulo.sdf

remove_design –all

71

Appendix B: Simulation waveforms showing all Operations

Figure 47: Instruction Queue

Figure 48: Load 1 Operation - Instruction 1

72

Figure 49: Load 2 Operation - Instruction 2

Figure 50: Multiply 1 Operation - Instruction 3

73

Figure 51: Subtract Operation - Instruction 4

Figure 52: Multiply Operation - Instruction 5

74

Figure 53: Add Operation - Instruction 6

75