<<

Rochester Institute of Technology RIT Scholar Works

Theses

1-1994

Design and implementation of an asynchronous version of the MIPS R3000

Kevin Johnson

Follow this and additional works at: https://scholarworks.rit.edu/theses

Recommended Citation Johnson, Kevin, "Design and implementation of an asynchronous version of the MIPS R3000 microprocessor" (1994). Thesis. Rochester Institute of Technology. Accessed from

This Thesis is brought to you for free and open access by RIT Scholar Works. It has been accepted for inclusion in Theses by an authorized administrator of RIT Scholar Works. For more information, please contact [email protected]. DESIGN AND IMPLEMENTATION OF AN ASYNCHRONOUS VERSION OF THE MIPS R3000 MICROPROCESSOR

by

Kevin Johnson

A Thesis Submitted In Partial Fulfillment of the Requirements for the Degree of MASTER OF SCIENCE In Computer Engineering

Approved by: Graduate Advisor - George A. Brown, Professor

Roy S. Czemikowski, Professor and Department Head

Tony H. Chang, Professor

Department of Computer Engineering College of Engineering Rochester Institute of Technology Rochester, New York

January, 1994 THESIS RELEASE PERMISSION FORM

ROCHESTER INSTITUTE OF TECHNOLOGY COLLEGE OF ENGINEERING

Title: Design and Implementation of an Asynchronous Version of the MIPS R3000 Microprocessor.

I, Kevin Johnson, hereby deny permission to the Wallace Memorial Library of RIT to reproduce my thesis in whole or in part.

Date: , II~-Ilf 'f ABSTRACT

The purpose of this thesis is to show some of the advantages for the asynchronous implementation of a major synchronous structure. Three people were involved with the design of an asynchronous microprocessor modeled after the MIPS R3000 microprocessor. This microprocessor implemented all of the MIPS reduced instruction set, while eliminating the need for synchronous clocking throughout the chip. Paul Fanelli modeled the asynchronous processor using VHDL (hardware description language). Kevin Johnson created circuit level designs and layouts of the and supporting hardware. Scott Siers created circuit level designs and layouts of the instruction fetch, write back, memory and instruction decode stages. Table Of Contents

Abstract ii Table of Contents iii List of Figures iv

List of Tables vi Glossary vii 1. Introduction 1 1.1 MIPSR3000 2 1.2 3 1.3 ALU stage 4 2. Registers 6 2.1 Data Dependencies 6 2.2 Register Design 13 3. Arithmetic Logic Unit 21 3.1 TTL Model 21 3.2 DataPath 28 3.3 Dual Purpose 28 3.4 Conditional Statements 29 3.5 Opcode Interfacing 32 4. Shifter Design 38

5 . Multiplier Divider in the Asynchronous Microprocessor 48 6. Bus Control 50 6.1 Inputs 50 6.2 Outputs 52 7. Exception Creation 57 8. Handshaking 61 8.1 Completion Signals 61 8.2 Handshaking Controller Circuit 67 9. Results 70 10. Conclusions 78 Bibliography 80

m List Of Figures

Figure 1.1 Pipeline Stages 3 Figure 1.2 Top Level Design of the ALU stage 5 Figure 2.1 Simple Method For Adding Three Numbers 6 Figure 2.2 Possible Paths for the Valid Bits 9 Figure 2.3 Program Showing Data Dependency 9 Figure 2.4 Petri Net for the Data Dependency Problem 12 Figure 2.5 Two Phase Clocking Scheme..... 13 Figure 2.6 Bus Utilization Scheme 14

Figure 2.7 Example of a Load Word Left Instruction 16

Figure 2.8 Logic Diagram of One Register Bit 18 Figure 2.9 Layout of a One-Bit Register 19

Figure 2.10 Logic Diagram of One 32-Bit Register 20

Figure 3.1 Schematic of the Modified ALU 22 Figure 3.2 Equations for Intermediate Carry Signals 24 Figure 3.3 Look Ahead Circuit 26

Figure 3 .4 Schematic of a 32-bit Adder Using 4-bit Look Ahead Carry Adders 27 Figure 3.5 Sign Extension Circuit 30 Figure 3.6 Condition Analyzer 31 Figure 3.7 Condition Bit Generator 34 Figure 3.8 Instruction Interface Circuit 35 Figure 3.9 Bit Coding of the R3000 Instruction Set 36 Figure 4.1 Arithmetic and Logical Shifts 38 Figure 4.2 Basic Shifting Cell 39 Figure 4.3 Simple Connection For Shifting Cells 40 Figure 4.4 Circuit Diagram of A 3-1 Multiplexor 42

Figure 4.5 Layout of a 3-1 Multiplexor 43

Figure 4.6 Layout of a 2-1 Multiplexor 44 Figure 4.6 Example of a Driving Circuit 46 Figure 4.7 Analog Simulation of the Driving Circuit 47 Figure 6.1 Input Bus Control Block 51 Figure 6.2 A-Bus Selector 53 Figure 6.3 B-Bus Selector 54 Figure 6.4 Bus Selector Decoder 55 Figure 6.5 Output Selector 56

Figure 7.1 Example of Overflow Conditions 58

Figure 7.2 Schematic of the Overflow Logic 59

Figure 7.3 Layout of the Overflow Logic 60

IV Figure 8.1 DCVSL NOR Gate with Completion Signal 62

Figure 8.2 Paths of DCVSL Outputs 63 Figure 8.3 Comparison of DCVSL and Complementary CMOS Logic 65 Figure 8.4 Layout Comparison of DCVSL versus Complementary CMOS 66 Figure 8.5 Handshaking Controller Circuit 68 Figure 9.1 Simulation of Unsigned Addition 73

Figure 9.2 Simulation of a Branch Not Equal to Zero Instruction 74 Figure 9.3 Simulation of a Signed Instruction Causing an Exception 75 Figure 9.4 Simulation of a Shift Instruction 76

Figure 9.5 Simulation of a Move From HI Instruction 77 List of Tables

Table 2.1 Incorrect Scheduling of Dependent Instructions 10 Table 9.1 Average Times of Various Operations 71

VI Glossary:

ALU Arithmetic Logic Unit

ALU Stage Part of the pipeline where all arithmetic operations takes

place.

Asynchronous Actions do not occur at specific time periods or when

another action is scheduled to take place.

CISC Complex instruction set computer.

CMOS Complementary metal-oxide-semiconductor

DCVSL Differential Cascade Voltage Switched Logic

FPU Floating point unit.

HCC Handshaking Controller Circuit.

HI 32-bit register used to hold the high 32 bits in a multiply operation or the quotient in a divide operation.

ID stage Instruction decode stage of the pipeline where the

instruction is decoded into signals.

IF stage Instruction fetch stage of the pipeline where the

instruction is obtained from memory.

LOW 32-bit register used to hold the low 32 bits in a multiply operation or the remainder in a divide operation.

LWL Load word left.

LWR Load word right.

MEM stage Memory stage of the pipeline where all memory operations are performed.

MIPS Company that makes currently now a division of Sun Microsystems.

Vll MIPS Millions of instructions per second.

NMOS N-type metal-oxide-semiconductor.

NOP Instruction that cuses nothing to happen.

PC .

PCL Signal name for the latched output of the program counter.

R3000 A version of the MIPS architecture used in a particular

32-bit microprocessor.

RISC Reduced instruction set computer.

RIT Rochester Institute of Technology

Synchronous Actions occur according to a deterministic schedule, e.g. Actions occur at specific phases of a clock pulse.

TTL Transistor Transistor Logic

VHDL VHSIC Hardware Description Language

VHSIC Very High Speed Integrated Circuit

VLSI Very Large Scale Integration

WB stage Write back stage of the pipeline where the data is written into the registers.

Vlll 1. Introduction

The purpose of this project is to show some of the advantages of the asynchronous design methodology. The design should be large enough to allow the increase in speed caused by the asynchronous operation to be greater than the decrease in speed caused by the extra circuitry. A thirty-two bit microprocessor was chosen for this project. The asynchronous design methodology eliminates the need for clocking circuits by adding handshaking circuitry.

Clock lines in a microprocessor need to be routed to every part of the chip. These lines cause a problem in the operation of the chip called clock skew. The clock lines, upon which the clock signal travels, spread out in many directions from the clock driver. This signal is supposed to change the state of many circuit elements throughout the chip simultaneously. Since these lines are not the same length, the clock pulses on one side of the chip may or may not be in phase with the pulses on the other side. In order to make sure that circuit elements on both sides of the chip have changed state using the same clock pulse, the period of the clock is extended. This makes sure that the clock pulse has had time to travel throughout the chip.

The clock lines and the gates that are driven by the clock create a large capacitive load. This capacitance must be overcome in order for the clock signal to change state in a reasonable amount of time. This causes the clock driver to be necessarily large. The capacitance in the line must be charged in order for the voltage on the line to come to a true logical-one value. The capacitance also must be discharged in order for the value to return to the logical-zero state. The time needed to charge and discharge the large capacitance limits the maximum possible clock frequency. If the clock frequency is faster than the charging and discharging times, the clock pulses

mid- will never reach their true values but will stay fluctuating around some level voltage.

Within a large clocked circuit, the clock speed must be slow enough to allow the slowest part of the circuit to be able to complete its task before the beginning of the next clock cycle. In large designs, this minimum time between clock pulses becomes a problem. For example, in a microprocessor this minimum time is the time it takes for the slowest instruction to execute.

This can be a long period of time because operations such as multiplication take longer than operations such as shifting. Even in simple designs, there can be large discrepancies in the time it takes for an instruction to execute. In a thirty-two bit ripple carry adder, for example, it takes two gate delays for two numbers to be added together, when there are no carrys within the circuit. If, however, a carry is created in every single stage, each addition stage must wait until the carry is obtained from the proceeding stage. This condition takes 32 times longer than the previous example.

In an asynchronous design, the clock is eliminated and circuitry is added to ensure that each stage will finish it's execution before the next stage begins.

The circuit will evaluate at the maximum speed possible for each stage.

1.1 MIPSR3000

The synchronous circuit to be modified in this project is the MIPS

R3000 microprocessor. It is a 32-bit microprocessor with an off chip floating point unit. It employs a five-stage pipeline structure with a reduced instruction set computer(RISC) architecture. Each instruction is 32-bits in length. There are only three different styles for the instruction words, immediate, register and jump.

The microprocessor was implemented in a similar to the MOSIS two micron process that was used in the RIT VLSI labs. The R3000 microprocessor is a relatively old microprocessor and there is a lot of information available. Many books have been written describing this architecture.

1.2 Pipeline

The Mips processor uses a five stage pipeline structure. An example of this pipeline is shown in figure 1.1. Each stage performs a different part of the execution of an instruction. The stages are called instruction fetch, instruction decode, arithmetic logic unit(ALU), memory, and write back. The instruction fetch stage gets the instruction from memory. The instruction decode stage decodes the instruction into useful parts. The arithmetic logic unit performs calculations. The memory stage performs fetches and writes to memory. The write back stage puts completed information into the register bank.

IF ID ALU MEM WB

instruction instruction arithmetic logi memory write back fetch decode unit

Fig 1.1 Pipeline Stages Because of these five stages, five instructions can be operated on at the same time. Different instructions will be located at different stages in the pipeline simultaneously. This allows the microprocessor to execute instructions faster once the pipeline is filled. The pipeline in the MIPS R3000 is clocked. The speed of the processor is the speed that the instructions are able to move from stage to stage. Because this speed is controlled by the clock, it must be able to accommodate the slowest stage in the pipeline. In the asynchronous design, the stages are controlled by signals that mark the beginning and end of the operation in each stage. This does not depend upon the worst case stage time and the stages are allowed to operate as fast as possible. This project is designed to prove that an asynchronous processor will operate faster than its synchronous counterpart.

1.3 ALU stage

This particular thesis only deals with the arithmetic logic unit stage of the total design. The thesis by Scott Siers deals with the other pipeline stages and the thesis by Paul Fanelli describes the VHDL modeling of the microprocessor. The ALU stage is split up into five major parts. These parts are the registers, the ALU, the shifter, the handshaking controls and the exception creation. A top level diagram of the ALU stage is presented in

are Figure 1 .2 to provide an overview of how these different parts combined. Bus Ctrl Block

(31:1)

i(3l :)) instruction indnKiignC3l:B) pcu \y el.in(3l:8> hmOl: intl(3li. -OB(31:B) irradiate KljuKiM in& -OpcLout(31:0) star^ Hert

"Construct ion_to_next_stage(31:0)

;,n Alu 32b it wioiio) Overflov Decode M(3I:1>

CBUI owrfli >except

int) (31:!) s Conpare Bra.Ctrl

qui ***"*"* Obranch

brl Shift Ctrl J Ltd! Shifter ii no 12 3 intl(3l!fl) to s

M (31:9)3 3 < 1 SLI

hLlow \Ja ^tput Sel Sox rcaaiie (31i01 outOI: put intlnitHm(!li!) ffi-f>0ut (31:0)

Figure 1.2 Top Level Design ofthe ALU Stage 2. Registers

2.1 Data Dependencies

Data dependencies occur when an instruction needs the output from a prior instruction that has not left the pipeline. For example, a case might occur where three numbers need to be added together. A simple method to accomplish this operation is shown in Figure 2.1.

AUUn Rt R2 R5 adds Rl to R2 and the results is in R5 Mm R? R? R5 aits R5 to R3 and the ttmlts is in R5

Figure 2. 1 Simple Methodfor Adding Three Numbers

This method adds the first two numbers together and adds their sum to the third. In the pipeline, however, these instructions will immediately follow each other. The answer will be created from the first instruction when the write back stage is reached. The second instruction needs the information from the first instruction in the ALU stage. The memory stage is located between the write back and ALU stages. A space or time delay is needed between these two instructions so that the first instruction will be in the write back stage when the second is in the ALU stage. Another instruction can be placed between the two add instructions to provide the space or time delay. This instruction would be in the memory stage when the first add instruction is in the write back stage. Any non-dependent instruction can be used for this space and, if one cannot be found from the program, a simple instruction that causes nothing to happen can be inserted. Most of the MIPS R3000 processor's data dependencies are taken care of in the compiler or at software level. This can be done because the processor works on a synchronous (clocked) pipeline. For example, a branch instruction causes a data dependency. A test is needed to determine if the branch will be taken. This test occurs in the ALU stage, while the outcome of the branch instruction affects the instruction fetch stage. The instruction decode stage falls between these two stages. In the R3000 processor, the branch instructions are immediately followed by an executable statement. This executable statement acts as the space between the test for the branch and the execution of the branch. This is similar to the way a space was needed in the addition example above. After this second instruction executes, the branch is then allowed to occur. The detection of these data dependencies can be done in software and the insertion of spacer instructions, to eliminate the dependency, can be done at the compiler level.

In an asynchronous processor, one simply cannot assume that letting one instruction pass though a stage will allow enough time for the data to leave the pipeline. An instruction may be in the process of moving into the write back stage when the dependent instruction begins to be executed in the ALU stage.

The write back stage has not yet put the correct data into the register before the

ALU stage operates on the register. This will cause the processor to act on erroneous data and therefore must be avoided.

One solution to this problem is to allow the data to be stored at intermediate points along the pipeline and allow the stored data to be retrievable for the dependent instruction. This would mean that every stage would remember the output from the instruction that had previously passed though it. This implementation would be useful if the dependent instructions immediately followed the instruction they were dependent upon. However, as was described earlier, the dependent instructions in the synchronous microprocessor are separated by a spacer instruction to insure proper timing.

Because one of the goals of this project is to be compatible with the MIPS architecture, this solution is now invalid. In order to allow dependent instructions to be spaced out, the intermediate storage devices now need to be capable of storing more than one value. This solution is problematic at best because increasing the number of storage spaces reduces the speed of the pipeline.

Another solution to this problem is to tag the data that is stored in the register bank. If an instruction is about to leave the instruction decode stage and pass into the ALU stage, the instruction can cause the value in the register, that the instruction uses as a target, to be invalidated. The next instruction, if it is dependent on that data, would then have to wait until the data in the register is validated. In order to implement this scheme, another bit is needed in the register to indicate whether that register has valid or invalid data. When the instruction passes into the ALU section, this register bit would be set to invalid and, when the instruction leaves the write back section, this bit would then be set back to valid. At this point it will be assumed that, when the processor is reset, the condition bits for all the registers will be reset to a valid state.

This scheme works for a clocked system but in an asynchronous system there are some hazards that need to be addressed. For instance, it is conceivably possible for two instructions to be setting and clearing the valid bit at the same time. This would cause an indeterminate state, or race condition, depending on which stage was faster. To eliminate this problem both the instruction decode stage and the write back stage get their own bit which they can toggle. The two bits can then be exclusive ORed together in order to produce a valid bit. These two bits follow the path show in the Figure 2.2. By assigning the states 1 1 and 00 to be valid, the race condition that may occur is eliminated.

This method appears to answer all the problems with data dependencies, however, one problem remains. Until this point, the investigation has only

Figure 2.2 Possible Pathsfor the Valid Bits

been concerned with the possible dependencies in the data part of the instruction. The target part is just used to designate which register will be invalidated. The flaw in this scheme can be shown in the simple program in

Figure 2.3.

- pl.ufN ill Addn R I R2 R3 ysil n .'.! M'l n in i3

pl.kVN 2 Addu R4 R5 R3 ,i,jtfe 1 4 %tt i5 md n in i3 i;: 3 lm --- add one to R3

Figure 2.3 Program Showing Data Dependency

The first two instructions are not data dependent but the third one is. It is interesting to note that instruction 1 is meaningless but still is a valid instruction for the program. An example of the execution of this program is in

Table 2.1. Pipeline Stages

time IF ID ALU MEM WB

1 3 2 1

2 3 2 1

3 3 2 1

4 3 2

5 3

6 3

TaWe 2.7 Incorrect Scheduling ofDependent Instructions

As is shown in the table the third instruction stalls until the valid bit has been set. The valid bit is set at time 4 allowing the third instruction to enter. This is incorrect because the first instruction set the valid bit. In fact, the third instruction was supposed to wait for the second instruction to finish.

The solution for this problem is to only allow the instruction to enter into the ALU stage if all data and target registers are valid. This will stall the second instruction until the first instruction leaves the pipeline. The validity of this method is verified in the petri net shown in the Figure 2.4. By using this method, no dependencies can occur. No set of instructions will cause the microprocessor to operate on erroneous data caused by a dependency. Some care should be taken when programming this type of microprocessor because, if a lot of dependencies occur in the program, the microprocessor efficiency

10 instructions will be very low. The program will still run but, since many of the will become stalled in the pipeline, the program will not run at maximum speed.

11 IF) Data Dependent Operations To The Same Register

Valid Register

Fig 2.4 Petri Net Validating Solution to the Data Dependency Problem

12 2.2 Register Design

The registers, where data is stored for eventual use, are an important

part of any microprocessor. Due to the asynchronous structure of this

processor, some special considerations need to be made, in order to eliminate

hazard conditions that may occur in loading and retrieving information from

the register bank. First, the synchronous method, as used in the R3000,

processor will be investigated. Then, the method that allows the registers to be

used in an asynchronous manner will be described.

One problem encountered when designing registers is preventing access

to the registers for simultaneous reading and writing. If this happens, a race

condition will be created making it unclear whether the information obtained is

an old number stored in the register or a new number being written to the

register. A second problem is caused by the fact that to do any calculation two

numbers are needed. Since registers generally write to a common data bus,

getting two sets of data can become difficult.

The R3000 processor solves both of these problems by using a two

((,1

<|)2

A B c D

Fig 2.5 Two Phase Clocking Scheme

phase, non-overlapping clock. An example of this clocking scheme is shown in

Figure 2.5. This allows the write back stage to do the write backs before the

13 instruction fetch stage grabs the data to give to the arithmetic logic unit stage.

The clock eliminates the hazard that occurs between loading and retrieving numbers and solves the first problem of preventing simultaneous reading and writing.

In order to solve the second problem of obtaining two sets of data, the

R3000 uses one 32-bit data bus and multiplexes the data from the registers to the correct lines in the arithmetic logic unit. Figure 2.5 shows that there are four distinct periods in the clocking scheme. One piece of data can be obtained during the period marked with a 'C and another piece obtained during the period marked with a 'D'. This eliminates any chance of having data collide on the bus.

In the asynchronous microprocessor, one does not have the luxury of knowing exactly when certain events are going to happen. There is no mechanism to stop the write back stage from writing at any time, potentially causing a hazard condition. In order to eliminate this problem, register accesses need to be allowed to happen in the register bank simultaneously.

This costs some space on the chip because, while the R3000 uses one bus and multiplexes its use between the various clock phases; in the asynchronous

write VVTlte y,r ^ back back / \U Asynchronous Processor Register Bank

,/32 ,/3

A \l/ \1/ ALU a ALUb ALU a and b

Fig 2.6 Bus Utilization Scheme

14 microprocessor, three different busses and corresponding logic are required.

Three different busses eliminate the hazard of contention on the bus but not the hazard caused by reading and writing simultaneously to the same register. This is similar to the data dependency problem described earlier in the data dependency section. The use of a register flag, to indicate whether a register is safe to read from or write to, solves the logical dependency problem that occurs in the pipeline. This method also protects the register's data coherency. The

registers' instruction fetch stage waits on the state before accessing the register.

This solves the problem of having both instruction fetch and write back stages accessing the registers at the same time.

In the asynchronous version of this processor, there are thirty-two general purpose registers. Also, there are two additional registers that are used by the multiplier/divider to hold sixty-four bit numbers. The multiplier/divider registers are accessed differently and are located in a different part of the chip.

These registers are only used by the multiplier/divider and by four instructions that move the output of the multiplier/divider to the general purpose registers.

Two other instructions also operate the register bank differently. These instructions are load word left (LWL) and load word right (LWR). One to four bytes of memory can be loaded into a register with these instructions. What makes these instructions different from the standard load instructions is that the bytes not indicated in the target register are not changed after the end of the operation. If a thirty-two bit word is needed and it does not fall on regular byte boundaries, a combination of these instructions can be used to get the number.

An example of this is shown in Figure 2.7. In this example the register is loaded with the three low order bytes from memory location 0 without affecting the high order byte. In order to implement these instructions, more logic must be added to the register bank. The write back stage sends four

15 memory

register x

before a b c d

LWL$x,l($0)

address 4 A s fi 7 after a 1 2 3 address 0 0 1 2 3

Fig 2.7 Example ofa Load Word Left Instruction

signals to the register bank before executing a load instruction. These four

signals select which byte within the register will recieve the new data and

which byte will be the same. Under normal operation these signals will always

be active so that the whole register is affected. For a load word left or a load

word right, these signals will indicate which byte in the register will be loaded.

This is performed by ANDing the load signal with the select signals that are

sent by the write back stage.

The general purpose registers are made out simple components designed to fit together to create the large register bank. Thirty-two of these components

are placed side to side to make a single thirty-two bit register. A logic diagram of one bit of a register and the layout of that logic have been included in

Figures 2.8 and 2.9. As can be seen by these figures, the registers are designed with maximum modularity. A thirty two bit register can be implemented by combining thirty-two of these one bit registers into four groups of eight bits.

Each eight-bit register has an AND gate to control the selection for LWL and

LWR. The four register banks are combined together and two toggle registers are added to create the valid signal described in the previous chapter. A logic

16 diagram of a thirty-two bit register is shown in Figure 2.10. The layout measures 2750 microns wide and 2300 microns high.

17 asE>

idty Oa

inO

-ob

bsO

Figure 2.8 Logic Diagram of One Register Bit

18 one bit register

bs-bar

as -gar

nz

Figure 2.9 Layout ofa One-Bit Register

19 32bit Register

O- as

bsO-

in<31:0)O-

" 8bit reo <,:8) reg_naskOO-

8bit reo "<7:6) reg.nasklC^

' 0

B 8b it reo <7:0, *- reg_nask20" k7ib:

8b it reo a:0J k30- reg_nas Nv >A <5&M\

VCC

alu-tO"

'CLK

K 5 -

reseto

loadO"

K 5-iPf-1 I

Figure 2.10 Logic Diagram of One 32-Bit Register

20 3. Arithmetic Logic Unit

The Arithmetic Logic Unit (ALU) is the brain of the microprocessor.

Within the ALU, all calculations are performed. It is a complex circuit that implements many different functions. In most microprocessors, the ALU is split into two parts, one part operates on floating point numbers and another part operates on integer numbers. Operating on floating point numbers is more complex due to the number's size and format. Floating point numbers are typically 64 bits long. In the and R3000 microprocessors, the floating point unit (FPU) is found on another chip () that is placed beside the microprocessor. Integer operations consist of performing combinations of adding, subtracting, multiplying, dividing and logic operations.

3.1 TTL Model

The asynchronous microprocessor's ALU could not be modeled exactly after the R3000 ALU because no information on the structure of the R3000's

ALU was available. A similar ALU was found in the Fairchild Advanced

Schottky TTL Data Book. This ALU performs all the required functions necessary for the asynchronous microprocessor, except A NOR B. The ALU was modified to perform this operation by adding XOR gates on the inputs and performing an AND operation. The XOR gates complement the inputs and, by

DeMorgan's Laws, the NOR operation is obtained.

The schematic of the modified ALU is shown in Figure 3.1. This ALU has been designed using a look ahead carry format. The look ahead carry

21 Cout

Figure 3.1 Schematic ofthe Modified ALU

22 implementation was chosen because it is faster than the standard ripple carry implementation.

In a ripple carry adder, the two inputs are added together with the carry from the preceding stage. This produces an output and a carry to the next operation. This carry then proceeds to ripple down the chain of bits to the end, where it becomes the carry out. The advantage of this design is the ease of implementation and the size of the circuit. The disadvantage is its slow speed.

The time needed to perform the ripple carry addition can be calculated by finding the amount of time it takes to create a carry signal and multiplying this time by the number of bits through which this signal needs to be carried. If, for example, it takes 3 nanoseconds to create a carry signal, then, for a 32-bit adder, it would take 96 nanoseconds, in the worst case, to add two numbers together.

The look ahead carry adder operates differently. Each pair of inputs are combined and a generate signal and a propagate signal are created. Each output can now be created by combining its own generate and propagate signals with the generate and propagate signals from the preceding pairs. The advantage of this method is that the entire structure of the adder has only three levels of gates, one level to create the generate and propagate signals and two levels to combine the signals together. For example, if the gate delay of one level were 3 nanoseconds, the time needed to add two numbers together would be 9 nanoseconds. This circuit is more than ten times faster than the ripple carry adder.

The equations for the carry in signals are shown in Figure 3.2. From these carry signals, derived from the generate and propagate signals, the output is created by adding the two inputs and the carry. Because the creation of the

23 " :- PUk=Afn+B(i> s- |;.: G

Ci=Gn+PftCh C2=Gi+PiGo+P}PoCo C3=n2+P2C. | +P2Pi G0+P2P | PoQ,

Figure 3.2 Equationsfor Intermediate Carry Signals

generate and propagate signals does not require signals other than the inputs, the outputs can all be created simultaneously.

In practice, a look ahead carry adder is normally not built for more than

eight-bit adders. As the size of the adder increases, more generate and propagate signals need to be combined to perform the addition. (Refer to the

previous figure.) The larger gates needed to handle the greater number of

inputs cause the adder to be less efficient. A compromise between speed and

gate size can be reached by creating another circuit to implement a look ahead

carry between blocks of smaller adders. This adds three more gate levels to the

over all circuit and slows the hypothetical adder to 18 nanoseconds. This is

still a great improvement over the ripple carry adder. An example of this extra

look ahead circuit is shown in Figure 3.3. It is similar to the look ahead carry

within the adder circuit but works with the generate and propagate signals

are used in the asynchronous created by the adder blocks. Smaller ALU blocks

microprocessor to perform their operations upon four bits at a time. An

example of how these smaller ALU blocks are connected can be found in

Figure 3.4.

To extend the design of the look ahead carry adder into an ALU,

perform logical operations on the inputs. This selection circuitry is added to

24 output. This selection allows the look ahead carry adder to create the proper circuitry adds another layer of gates to the overall circuit.

25 Figure 3.3 Look Ahead Circuit

26 Figure 3.4 Schematic ofa 32-bit Adder Using 4-bit Look Ahead Carry Adders

27 3.2 Data Path

"true" The data path in the asynchronous microprocessor is a thirty two bit data path. In microprocessors that use a clock, the sequence for loading data into the ALU stage is to load the first number on the first clock pulse and the second number on the second clock pulse. The advantage of this is that only one bus is required to send data signals from the registers to the ALU.

The disadvantage of using only one bus to move the numbers from the registers to the ALU is that it takes twice the optimum time. In the asynchronous microprocessor, there is no clock to use when scheduling bus utilization. Each number needs to have its own bus to travel from the registers to the ALU. This causes more lines to be used but improves the overall efficiency of the ALU's operation.

3.3 Dual Purpose of the ALU

The R3000 processor uses the ALU for more than just arithmetic and logic operations. It also uses it for base offset address calculations. When a register is used to hold a memory location and an offset is added to this register to point to a desired location in memory, it is called base offset addressing.

This is commonly used for table lookup and stack operations. The ALU stage occurs before the memory stage in the pipeline order. This allows the ALU to calculate addresses for the memory stage before the instruction is executed within the memory stage.

Because the offset is coded into the instruction word, a method is needed to allow the offset part of the instruction to be moved onto the data bus

28 between the ALU and the registers. This offset is always the lower sixteen bits of the instruction. Because offsets can be either positive or negative, this

32- number must be sign extended from sixteen bits to 32 bits to be used by the bit ALU. The circuit that performs this operation is shown in Figure 3.5. The circuit multiplexes between the data lines from the registers and the data in the instruction opcode.

3.4 Conditional Statements

The R3000 microprocessor's ALU also is used to compare two numbers to create conditions that are used to resolve branch on conditional statements and set on conditional statements. The conditions that are used in the R3000 microprocessor are equal, greater than or equal to zero, less than or equal to zero, greater than zero, less than zero, and not equal. These conditions are evaluated within the ALU and a condition bit is sent back to the instruction fetch stage. The branch destination is calculated in the instruction fetch stage while the ALU is determining whether the branch will be taken or not. The condition bit then indicates the location that contains the next instruction.

No information is available on exactly how these condition bits are produced in the R3000 microprocessor. The asynchronous processor performs this operation by adding the two ALU input busses together. The instruction fetch stage places the two pieces of data on the busses. If the operation deals with a comparison to zero, register zero is used for one piece of data. The zero register in the data bank is always hardwired to contain a zero value. The resulting answer is then analyzed by the circuit shown in Figure 3.6. Negative

29 o

T

C_ o

o CD

CD CO

CO

_D CO

CO

00

-t-

Figure 3.5 Sign Extension Circuit

30 <0

-t><0

-O>>0

Zero in(31:0)O equal Tesf "E^equal

-Oout(31:0)

Figure 3.6 Condition Analyzer

31 numbers are indicated by a one in the sign bit. Positive numbers are indicated by a zero in the sign bit and a one somewhere else within the number. Zero is indicated by all the numbers being zero.

The analysis is then fed into another block. This block converts the comparison outputs into a control line for the branch instructions. The instruction decode stage uses this control line to determine whether to take the branch or not. This block is shown in Figure 3.7. The bit code for branch instructions includes two bits in which a condition is encoded. These two bits are bit 26 and 27. These bits are then fed into a 4-1 multiplexor as select lines.

Another series of conditional statements are the BCOND statements.

BCOND stands for branch conditionally. These statements are special instructions unlike the other conditional branches. There are only two types of

BCOND statements. They are greater than or equal to zero and less than zero.

These statements are identified by bit 16 of the BCOND instruction.

The Figure 3.7 shows how the two types of instructions are resolved and muxed together to create the branch control line.

3.5 Opcode Interfacing

The disadvantage of not having any information on the R3000's ALU is that the control lines of the model ALU do not match the instruction encoding of the R3000. A circuit is required to decode the instruction before sending it to the asynchronous ALU. This circuit is shown in Figure 3.8.

The interfacing circuit was created by decoding the operations performed in the R3000 instruction set and matching them to the model ALU's instructions. Figure 3.9 shows the bit coding of the R3000 instruction set and

Figure 3.10 shows the model ALU's instruction set. The R3000 instruction set

32 uses two bit fields to indicate instructions. The highest five bits indicate the instruction. If these bits are all zero's, this indicates a special instruction and the low five bits are used instead. If it is not a special instruction, the instruction is either an immediate or a conditional instruction. All register instructions are special instructions.

33 inst (31:0)1

Obranch-control

Figure 3. 7 Condition Bit Generator

34 A Ar Ay yy> -oso

g^^nSf rOsi 4> Interface Logic D Aaama -OS2

-OS3

Sr P t>T instruct ion(31:0)O-

^Dr

A

A- SA DASS AA On

Figure 3.8 Instruction Interface Circuit

35 OPCODE bits 28..26

bits 0 1

31..29

0 SPECIAL BCOND J JAL BEQ BNE BLEZ BGTZ

ADDI ADDIU SLTI SLTIU ANDI ORI XORI LUI

COPO1 COP11 COP21 COP31 * * * *

* * * * * * * *

LB LH LWL LW LBU LHU LWR *

SB SH SWL SW * * SWR *

LWCO1 LWC11 LWC21 LWC31 * * * *

SWCO1 SWC11 SWC21 SWC31 * * * *

SPECIAL bits 2..0

bits 5..3 0 1

0 SLL * SRL SRA SLLV * SRLV SRAV

JR JALR * * SYSCALL BREAK * *

MFHI MTHI MFLO MTLO * * * *

MULT MULTU Drv divu * * * *

ADD ADDU SUB SUBU AND OR XOR NOR

* * SLT SLTU * * * *

* * * * * * * *

* * * * * * * *

BCOND bits 18..16

20..19 0

0 BLTZ BGEZ * * * * * *

* * * * * * * *

BLTZAL BGEZAL * * * * * *

* * * * * * * *

* Illegal instruction * Coprocessor instructions were not implemented

Figure 3.9 Bit Coding ofthe R3000 Instruction Set

36 INPUTS OUTPUTS

FUNCTION SO SI S2 Cn An Bn FO Fl F2 F3 G P

CLEAR 0 0 0 X X X 0 0 0 0 0 0 0 0 0 1 1 1 1 1 0 B MINUS A 1 0 0 0 0 1 0 1 1 1 0 0 0 1 0 0 0 0 0 1 1 0 1 1 1 1 1 1 1 0 1 0 0 0 0 0 0 1 0 1 0 1 1 1 1 1 0 0 1 1 0 1 0 0 0 1 1 1 1 1 0 0 0 0 1 0 0 0 0 1 1 1 1 1 0 A MINUS B 0 1 0 0 0 1 0 0 0 0 1 1 0 1 0 0 1 1 1 0 0 0 1 1 1 1 1 1 1 0 1 0 0 0 0 0 0 1 0 1 0 1 1 0 0 0 1 1 1 1 0 1 1 1 1 0 0 1 1 1 0 0 0 0 1 0 0 0 0 0 0 0 0 1 1 A PLUS B 1 1 0 0 0 1 1 1 1 1 1 0 0 1 0 1 1 1 1 1 0 0 1 1 0 1 1 1 0 0 1 0 0 1 0 0 0 1 1 1 0 1 0 0 0 0 1 0 1 1 0 0 0 0 0 1 0 1 1 1 1 1 1 1 0 0 X 0 0 0 0 0 0 1 1 AXORB 0 0 1 X 0 1 1 1 1 1 1 1 X 1 0 1 1 1 1 1 0 X 1 1 0 0 0 0 0 0 X 0 0 0 0 0 0 1 1 A ORB 1 0 1 X 0 1 1 1 1 1 1 1 X 1 0 1 1 1 1 1 1 X 1 1 1 1 1 1 1 0 X 0 0 0 0 0 0 0 0 AANDB 0 1 1 X 0 1 0 0 0 0 1 1 X 1 0 0 0 0 0 0 0 X 1 1 1 1 1 1 1 0 X 0 0 1 1 1 1 1 1 PRESET 1 1 1 X 0 1 1 1 1 1 1 1 X 1 0 1 1 1 1 1 1 X 1 1 1 1 1 1 1 0

1= High Voltage Level 0= Low Voltage Level

X = Don't Care

Figure 3.10 Model ALU's Instruction Set

37 4. Shifter Design

Shifters are important to any microprocessor because they prepare the data for use in other stages. A shifter used in a microprocessor is a steering circuit that moves the bits to the correct positions for the final output. There are four different types of shifting operations; arithmetic, logical, rotate, and roll.

The R3000 architecture requires only arithmetic and logical shifts.

These shifts operate similarly. They move the bits either left or right and, on the edges of the number, they either sign extend or input a 0. An arithmetic shift instruction shifts the bits either left or right. The first bit is both held and shifted at the same time. An example of this is shown in Figure 4.1. In a logical shift, a zero is always shifted into the number. By shifting left, a quick multiply by two can be done without having to run it through the slow multiplier.

0 -^L>^L>-^L>^L>^L>^n

Shift Right Logical

Shift Right Arithmetic

Fig. 4.1 Arithmetic and Logical Shifts

38 The R3000 architecture contains no arithmetic left shift. The arithmetic left shift is meaningless because it is identical in operation to a logical left shift. The R3000 architecture also supports dynamic shifting. A register can be designated in the instruction to indicate how far to shift. The five least significant bits of the register are used to control the shifting operation. The other bits are ignored. Rotate and roll shifts use the bits shifted off one edge of the number as the input to the other edge. They are not implemented in the

R3000 reduced instruction set.

The shifter for the asynchronous microprocessor can be implemented using a multiplexing network because there are no rotate or roll instructions in this processor. This implementation will not greatly increase the space required.

no shift

left input Py right input

\\s \i/ y Arith/logic \

3-1 MUX Shift/noshift >

output y

Fig 4.2 Basic Shifting Cell

The basic cell for one bit, in the asynchronous design, is shown in

Figure 4.2. This cell is a 3-1 multiplexor and has been created by modifying a

39 4-1 multiplexor, because only three inputs are required. A number of basic

cells are then connected to form the multiplexing network. A simple version of

this network is shown in Figure 4.3. The shifter can handle thirty-one bit

shifts. There are five bits in the instruction designating how many lines to

shift. Because there are five control lines for the shifter, there are five

shifting stages. Each stage shifts by a power of two. The first stage shifts by

2l 2, the second shifts by and so on. During the last stage the displacement is

sixteen bits. This does cause long lines to be formed, however, producing line

capacitance which slows the shifter.

Fig 4.3 Simple Connection For Shifting Cells

The multiplexors are created using transmission gates. The drivers,

which are needed in order to overcome the loss of driving capability caused by

the transmission gates, enable the addition of a moderate amount of line

capacitance. These drivers can easily handle the line capacitance caused by

the long routing channels needed in this design.

The advantage of this shifter design is that the basic thirty-two bit part

can be used in other places where a three to one decision needs to be made.

This situation occurs on the second bus coming out of the register bank. The

40 second bus handles the immediate instructions, register instructions, and base offset calculations.

The shifter measures 1800 microns wide by 900 microns high. The basic 3-1 mux cell circuit diagram is shown in Figure 4.4 and the layout of the circuit is shown in Figure 4.5. 160 of these cells were used to form the entire shifter. To select between logical and arithmetic shifts, five 2-1 multiplexors were used. These multiplexors were obtained by modifying the 3-1 multiplexor. The layout of this cell is shown in Figure 4.6.

41 LeftOi rights cy

1E>

2E>

Figure 4.4 Circuit Diagram ofa 3-1 Multiplexor

42 nux

Figure 4.5 Layout ofa 3-1 Multiplexor

43 A m

set e

seie

Figure 4.6 Layout ofa 2-1 Multiplexor

44 The worst case line capacitance was calculated between the fourth and final stages of the network. The area capacitance was calculated from the equation

* total capacitance = total area capacitance coefficient.

The capacitance coefficients differ depending on the process used and the layer in question. Using the two micron Mosis Scalable CMOS process, the capacitance coefficient for the metal2 layer is 0.03 femtofarads per square micron. For the metall layer it is 0.05 femtofarads per square micron. The worst case line capacitance was found to be 135 femtofarads. The driver circuit used to drive this load is shown in Figure 4.7. The analog trace from the Accusim simulation of the resulting circuit is shown in Figure 4.8. The driver has more than amply driven the load with a delay of only 0.5 nanoseconds.

45 vcc

TP TP l=2un w=8un I L=2un w=2^un ^ ^ n > out

\> f>

TN TN 136F l=2un w=^lun l=2un w=6un

Figure 4.6 Example ofa Driving Circuit

46 Trace 5. 00^ 4. 50_| 4. 00J 3. 50_| 3. 00_| 2. 50_| E2. 00_| 50J I. 00_| 0. 50_| 0. 00^_ 5. 00^ 4. 504 4. 004 3. 50_| >3. 004 "2. 504 = 2. 004 ^1. 504 I. 004 0. 504 0. 00_E_ 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 i 1 1 1 1 1 1 1 r

0.00e+00 5. 00e-09 I.OOe-OB l.50e-0B 2.00e-08 2.50e-08 TIME (s)

Figure 4.7 Analog Simulation ofthe Driving Circuit

47 5. Multiplier/Divider in the Asynchronous Microprocessor

The multiplier/divider is a special circuit in the ALU stage, where two

32-bit numbers are either multiplied or divided creating a 64-bit result.

Multiplications and divisions cannot be handled by the regular ALU because the resulting numbers are too large for the thirty-two bit data bus. The output of this separate circuit goes directly to two special registers named HI and

LOW. These two registers either store the 64-bit output from a multiplication or the 32-bit output from a division plus the 32-bit remainder.

Because all of these circuits used in the multiplier/divider are separate from the main ALU, it becomes possible to allow other instructions to execute while the multiplier/divider is operating. This is made possible by latching the inputs to the separate section. The method of having a valid bit for the registers and the fact that the multiplier/divider does not require any further pipeline stages, allow other instructions to be executed simultaneously. The latches on the inputs hold and separate them from the main bus while the multiplication or division is taking place. This allows other traffic on the data busses to be conducted without interfering with the multiplier/divider's data.

The method of designating the registers valid and invalid assures that the instructions on the bus will not interfere with the current operation of the multiplier/divider. All instructions that want to use these multiplier and divider circuits must wait until the valid bit for the HI and LOW registers to become set. If the instruction does not need these circuits, it can continue through the rest of the ALU and bypass the multiplier and divider circuits.

48 The multiplier/divider automatically places its answers into the special registers HI and LOW. The write back and memory stages in the pipeline are not needed for any operation that uses the multiplier/divider. This allows the multiply and divide instructions to be removed from the general flow of instructions down the pipeline. The removal of these instructions allows the multiplier/divider to calculate at its slower speed and, at the same time, allows the rest of the instructions to continue to be calculated.

Scott Siers designed the multiplier/divider and a more complete

investigation of this circuit can be found in his thesis.

49 6. Bus Control

One of the requirements for the ALU stage is to control the values on the various busses leading into and out from the stage. All of the inputs to the

ALU stage need to be latched in order for the different pipeline stages to be independent of each other. The output of the ALU stage can be either the incremented program counter (PC) or the data line leading from the arithmetic logic unit. The output can also come from the multiplier/divider when executing a move from low or move from high instruction.

6.1 Inputs

The inputs to the ALU stage are comprised of the A and B busses from the register bank, the instruction from the instruction decode stage, an immediate signal to indicate an immediate instruction, and the program counter value that will be incremented in the ALU stage to perform link operations.

The logic diagram of the Input Bus Controller is shown in Figure 6.1.

The instruction decode stage will select the correct registers to be connected to the A and B busses before the start signal comes to the ALU stage. This allows the ALU to latch the A and B bus's data immediately after the start pulse is sent from the instruction decode stage. The A-bus is used later for the variable shifts. When this operation is performed, a zero is sent to the arithmetic logic unit and the lower five bits of the A-bus are allowed to pass to the shifter. The other bus that is connected to the ALU is the PCL bus.

The PCL contains the program counter value which allows the ALU to increment the program counter. The incremented program counter can then be

50 32b i l latch Oil(tB) a(31:0>O- inC31 10)

-OBpass(31:0)

32b it lotch -Opcl.out<31:0) pcl_in<31:8>0- nOIIO)

"a A sel KMfllfl)

-OAout(31:0) 32b it lotch

b<31:0)O- n(3l!(H

3i 8 sel

-OBout<31:0) 32bit lotch instruction<3l!0>O- in(3IIO>

mtOliO

startO- Bus sel dec

-Oinst(31:0) innedieteO- JjatcF

Figure 6. 1 Input Bus Control Block

51 placed into the link register by the write back stage. Figure 6.2 shows the diagram of the A-bus selection logic. The B-bus is used for immediate instructions. During immediate instructions the lower sixteen bits of the instruction are sign extended to thirty-two bits and placed on the B-bus instead of the latched value allowing the B-bus to be used to pass information to the next stage. Therefore the ALU can do base offset calculations for store operations while the B-bus is free to pass the data to be stored in memory. The

'8' B selector also multiplexes onto the B-bus the number during link operations in order for the PC to be incremented by the correct number. Figure 6.3 shows the logic diagram of the B-bus controller. The J and the S line are created by a small decoder. This decoder is shown in Figure 6.4. It decodes both jump and link instructions and shift variable instructions.

6.2 Outputs

The outputs are controlled in a similar manner as the inputs. These outputs are multiplexed between the arithmetic logic unit, the Hi and Low registers, and the extra program counter incrementer. The control for this mux comes from the instruction and is decoded by a small decoder. The logic diagram of this is shown in Figure 6.5.

52 A Bus selector out(31:0)

A(31:0)O

J[>

so out( i) REG(i) OUT- ocK31:0)O

pci( i zero(31:Q) t zero( i )

for i := 0 to 31

Figure 6.2 A-Bus Selector

53 o 9

o

o CD

cd CO

A)

CO oo

Figure 6.3 B-Bus Selector

54 JAL y^ AA Bus Select ion y y Decoder

r> AA &i

inst(31:0>O- //////////A

Figure 6.4 Bus Selector Decoder

55 Output Select Box

instruct ion(31 :0)O- "sm\ E^L>

7

out(i) alu( alu(i) -O out (31:0) rcaCi) hi_low( Ii hiJow(i) j :

for i := 0 to 31

Figure 6.5 Output Selector

56 7. Exception Creation

Exceptions are typically indicators that signify something out of the has ordinary occurred. There are many types of exceptions. These exceptions can be grouped into two categories, unmaskable and maskable. Turning off the power would be an event that occurs out of the sequence of ordinary operation. The microprocessor, however, cannot do anything about the occurrence of a power failure. Power failure, therefore, is an unmaskable exception. An example of a maskable exception would be an interrupt from a input/output(I/0) device. The microprocessor can mask the exception from the

I/O device while it is performing a routine. This insures that the microprocessor will be free from interupts while it performs a criticall routine.

The masking can then be turned off, allowing interupts to again be serviced.

This is a common practice inside interupt service routines.

The ALU stage in the R3000 architecture generates only one single exception. This exception occurs if the addition or subtraction of two's complement numbers result in number which is greater than the largest number a 32-bit register can hold. This is called the overflow exception. When this exception occurs, it signals that the number created by the arithmetic logic unit is incorrect. The microprocessor cannot be allowed to continue with incorrect data so this error is fatal to the program. At this point, the microprocessor will stop execution and wait for new instructions. The overflow instruction is an unmaskable exception.

In the asynchronous ALU, the overflow exception is also created. This signal is simply created by exclusive ORing the sign bit and the carry bit of the answer. If two positive two's complement numbers are added together and an overflow occurs, the sign bit of the number will have data written into it

57 changing its value from a 0 to a 1. The carry bit however will not be affected.

This causes the two bits to be different and a 1 will occur on the output of the

XOR gate. If a large negative number has a large positive number subtracted from it, the 1 in the sign bit will change to a 0 but because subtraction of two's complement numbers is implemented by adding the two numbers together, the carry bit will change to a 1. This again will cause an overflow. An example of this operation performed on four bit numbers is shown in Figure 7.1.

Fig 7.1 Examples of Overflow Conditions

A schematic of the overflow creation logic is shown in Figure 7.2. The corresponding layout of this logic is shown in Figure 7.3. The overflow creation logic includes some gates to decode the conditions when an overflow occurs. An overflow can only occur during a signed addition or subtraction operation. The ALU performs the same operations for both signed and unsigned additions so during unsigned operations the overflow creation is turned off.

58 data<31:0)O "Ooverf low

inst<31:0)O|

t

Figure 7.2 Schematic ofthe Overflow Logic

59 data(31)

instruct ion

Figure 7.3 Layout ofthe Overflow Logic

60 8. Handshaking

In an asynchronous design, it is important to have a method to know when an evaluation has been completed. Each part of the design must tell the other parts when it completes its current work and is ready to go onto the next evaluation. In this microprocessor, each pipeline stage must tell the previous stage when it is ready to accept a new set of inputs and tell the following stage when it has completed evaluating the current set of inputs. Each part inside the stage must have completed its work in order for the stage to signal that it is ready to move on. Each part is required to generate a completion signal within its circuitry. The completion signal from one stage can tell the next stage to begin its evaluation. From these signals, the flow within the pipeline stages can be controlled.

8.1 Completion Signals

Two methods can be used to create a completion signal. The first is a dummy delay. A dummy delay is a signal created after a period of time long enough to exceed the worst case delay within the circuit. It has no relationship to the actual circuitry performing the function. This method of completion signal offers little increase in speed to the circuit because it is similar to the

will seen method used in a synchronous circuit. No speed up be to operations

of because the in a circuit that take a short period time corresponding dummy

of of the delay is set for the worst case delay any operations the circuit performs.

61 The other method to create a completion signal is to use differential cascade voltage switched logic (DCVSL). This method allows a completion signal to be generated as soon as the evaluation of the inputs have finished. An example of a DCVSL gate is shown in figure 8.1. The figure consists of a

NOR gate that creates a completion signal when its evaluation is completed.

There are two sides

YCC

start 4 ^ done

a 1 a L H b H

r

Vss

Figure 8. 1 DCVSL NOR Gate with Completion Signal

to a DCVSL gate. The first side evaluates the output and the second side evaluates the complement of the output. The start signal controls the evaluation of this gate. When the start signal is low, the P transistors on the

output and the complement to a one value. The top of the gate charge both the

N transistor on the bottom is turned off, in this phase and the charge is stored on the outputs. When the start signal is turned on, the P transistors are turned

62 off and the N transistor is turned on. The two Nmos trees are then allowed to evaluate. Because both trees are complementary, the one tree will evaluate to zero and the other will evaluate to a one. In this circuit, the tree evaluating to a zero will allow the stored charge to discharge through to Vss. The other tree will keep the stored charge and will evaluate as a one. The completion signal can now be created by comparing the outputs of the two trees. The two outputs are NANDed together creating a signal that will evaluate true only when the two outputs are different in the evaluation phase. The gate can have its outputs latched by another circuit before allowing the start signal low again. This causes one tree to charge back up to a one value sending the completion signal low in the process. The outputs of the circuit will follow the path shown in figure 8.2.

precharge

.. /\ (f6) (fi) evaluation

Figure 8.2 Paths ofDCVSL Outputs

DCVSL will create a reliable completion signal that will signal when evaluation is done. However, by the very nature of this logic, these types of gates are slower than the more standard complementary CMOS gates. The

governed the fall time in evaluation of the Nmos trees are by discharging the

same speed as stored charge. This is typically the complementary CMOS

create a signal. gates. The outputs are then compared to done The NAND gate

gate which creates this done signal incurs a delay, slowing the gate further.

Also, between evaluations, a time is need to precharge the gate.

63 This makes the DCVSL gates slower than other methods.

A simulation of the DCVSL NOR gate, shown in figure 8.1, versus the complementary CMOS NOR gate is shown in figure 8.3. The DCVSL outputs change with roughly the same speed as the complementary CMOS gates but the extra delay causes the done signal to come slightly later than the outputs. Also, the precharge time is not shown in this simulation.

In DCVSL logic, the inputs to the gates need to stay stable throughout the entire evaluation phase, because there is no way to charge one of the outputs back up from a zero. In various places, the outputs of the DCVSL gates need to be latched in order to allow the slower sections of the circuit time to evaluate and precharge. This adds another delay to the operation of DCVSL circuits.

A final drawback of DCVSL logic is that both the input and its complement are needed for each and every gate. This causes twice the amount of lines to be needed in the DCVSL implementation as compared to the complementary CMOS implementation. These lines become very cumbersome in a large design. For instance, if a 32 bit adder is to implemented using

DCVSL, 128 lines are need for the inputs. In a two micron technology, if these lines were to be routed using metal two, 1024 microns will be needed just to handle the 128 bit bus.

The overall size comparison between the DCVSL and the complementary CMOS gates is shown in figure 8.4. The two gates shown are both two input NOR gates. The DCVSL gate is much larger due to the added transistors and to the fact that it needs both the inputs and their complements.

For this layout, the complements of the inputs are assumed to be available. If the complements were not available, more circuitry would be needed to create

64 Illlllllllllllllppifpn ii|iiiiiiii|iiiiiiiii|iiii

^- L*2_,t*"^~ -^- inmi/iinin L tn w o in >+ in rn cvj c in m cu o ft) w a in-^hmco o (A) mm (A) v/ (A) dva ino isadqw) ino isaoq/ia 3N0Q 1SADQ/ (A) ino"soH3/

Fig. 8.3 Comparison ofDCVSL and Complementary CMOS Logic

65 cy

oo CD

c_>

Figure 8.4 Layout Comparison of DCVSL versus Complementary CMOS

66 the complement. The DCVSL gate uses eleven transistors while the complementary CMOS gate uses only four transistors to implement the NOR function. The DCVSL gate however is the only one that will produce a completion signal.

In the ALU stage of the asynchronous microprocessor, DCVSL is only used for the critical path. A reliable completion signal is needed for the stage and the DCVSL gates provide this signal without resulting to the use of a dummy delay. Using the DCVSL gates only in the critical path cuts down on the number of larger DCVSL gates needed and the number of extra lines that need to be routed to the gates. The critical path is the path that takes the longest to evaluate within the circuit. This path is then changed over to

DCVSL which slows it down further, ensuring that it will be the slowest evaluating path within the circuit. This gives the circuit a completion signal while also keeping the impact of the slower DCVSL gates to a minimum.

8.2 Handshaking Controller Circuit

Between the stages within the pipeline, a circuit is used to allow the different stages to start and stop themselves. This circuit is called the handshaking controller circuit (HCC). This circuit accepts the done signal from the first stage and then waits for the ready signal from the second. This second stage is ready to accept inputs after it has transferred its data to the third stage. The HCC will then send a start signal to the second stage and simultaneously send a ready signal to the stage previous to the first indicating that the first stage will accept new inputs. This circuit is very important to the control of the pipeline. Each stage in the pipeline has one of these circuits to control its start and completion signals.

67 The latches at the beginning of the ALU stage drive the entire stage.

The start signal from the HCC sends a load signal to these latches causing them

to latch in the new data. The start signal also causes the first DCVSL gate to

begin evaluation. This starts a chain reaction of DCVSL gates. The first gate's

completion signal causes the second gate to evaluate and so on throughout the chain. At the end of the chain, the completion signal from the last DCVSL gate causes the ALU stage's completion signal to trigger. The HCC uses this signal to indicate that the stage is done and that the data is ready for the following

stage. It also uses this time to force the start signal low causing the DCVSL gates to go into precharge phase. The action of the HCC's takes at least six nano-seconds because the HCC will wait for an acknowledge from the next

HCC. This acknowledge signal means that the following stage has latched in this stage's output. This allows time for the DCVSL gates to precharge.

A circuit diagram of a HCC circuit is shown in figure 8.5.

Rendy(n) Ain

Inil Roiit(n)

Aont Ok(n) QkOtl) U

Figure 8.5 Handshaking Controller Circuit

68 When the Ok(n-l) signal goes low, the HCC knows that the previous stage has data it wishes to send. When the current stage finishes, its Ready(n) signal will go low which pulls the Rout(n) signal low. This signals the following stage that there is data waiting. When the following stage is ready for the inputs, it will latch in the data and send an acknowledgment back via the Ain tine. This acknowledgment causes the Aout line to go high. The Aout line acknowledges the previous stage's data, simultaneously latches it in to this stage and sends the start signal to the current stage. This cycle then repeats itself. The Init line is used to force the HCC into the correct state during a reset. A more complete discussion of the HCC circuit can be found in the thesis written by Scott Siers.

69 9. Results

The results of the ALU stage of the asynchronous microprocessor were obtained from circuit simulation. The analog simulator (Accusim from Mentor

Graphics Corporation) was first used to obtain timing information about the various gates used within the circuits. Each gate was simulated using a five volt source. Various capacitive loads were used to simulate line capacitance within the circuit. An example of this type of simulation was shown in section

3. The driving circuit was simulated and its rise, fall, and delay times were found. This timing information was then returned to the digital simulator

(Quicksim II from Mentor Graphics Corporation) to simulate actual running conditions. It would be impractical to use the analog simulator for the circuit simulation of the whole stage because the simulation would be slow and take an extremely long time. The digital simulator simulates faster and incorporates the delay times that were added.

The output of the simulations can be classified into five categories. The five categories are arithmetic and logic, conditional, exceptions, shift and move from HI/LOW instructions. Due to the asynchronous nature of the logic there is no set time that these operations take. There is, however, an average time that can be used for most cases. Table 9.1 shows these average times. The times do not include the overhead added from the handshaking controller. This overhead is approximately 5 nanoseconds and is invariable for all instructions that are performed.

70 Operation Time in nanoseconds

Arithmetic/Logic 25

Conditional 18

Exceptions 15

Shift 10

Move from HI/LOW 7

Table 9. 1 Average Times of Various Operations

Example simulations of these operations are included in Figures 9.1 through 9.5. In all the examples the instruction field only includes the parts of the instruction needed for the ALU stage. The busses shown as the inputs to the stage are before the latches in the stage. This adds the time for the latching to the overall time of the stage. The start pulse is responsible for latching the numbers. Figure 9.1 shows unsigned addition of two numbers. Figure 9.2 shows a branch not equal to zero instruction. The exception is shown in Figure

9.3 and is an example of signed addition. A simple shift is shown in figure 9.5.

This shift shifts left logically zero places and is typically used as a non operation or NOP. The final figure, Figure 9.5, shows a move from HI instruction. During this instruction the Instruction decode stage will have

correct register onto the HI/LOW bus and all the ALU needs already moved the to do is move it onto its output bus.

The longest time taken within the ALU stage was an average of 25

nanoseconds. The fastest that the R3000 can be run at is 40 megahertz. This is

nanoseconds. shows that the a cycle time of 25 This ALU designed in this

equal the one used. stage was roughly to actually The overhead of the handshaking controller, however, puts the overall stage time to 30 nanoseconds

71 which is significantly slower than the R3000. Other instructions were found to simulate faster than the arithmetic/logic instructions. In this area there is some improvement in speed over the synchronous microprocessor even with the added overhead of the handshaking controller.

72 m

m

o cn

to OJ

OJ CNJ

CO tu . E r-

OJ X X X X X =5? X L X J

CO

CO J J

X X X X cn CVJ OJ X X X X

CD CD CD CD CD CD CD

cn co co co cn co co

Figure 9.1 Simulation of Unsigned Addition

73 co CVJ

UD OJ

OJ OJ

CO

CD'

CD

CO

CO

OJ X X X X X X X X

CD CD CD x X CD CD CD X x X CD CD CD X X X CD CD CD X X X O CD CD X x X CD CD CD X r x X CD X r X X CP =*: X X

cocncncocncnco-^cT-j

Figure 9.2 Simulation ofa Branch Not Equal to Zero Instruction

74 + + +

CD cn

UD CXJ

OJ OJ

CJD CD

OJ

CO

CO

+J

CD CD X CD CM X cp Cp =*: CD CD x CD CD X CD CD X CD CD X i^2 52

CD CD CD CD O CD CD

co cn cn rn cn cn cn

an Exception Figure 9.3 Simulation ofa Signed Instruction Causing

75 CO + H ^ 4 4 + + 4 4 + + cd oo

4 i i- + + + t- + +

CD 4 4 + 4 + in OJ

+ CD + 4 4 4 + + CD CD O CD CD ^t- 4 + 4 4 + 4 + oo CO

4 4 4 t- 4 4

OJ

h + 4 + + + + + cri

1- + h -I 4 4 + 4 4- + + +

CD + h + 4 4 + 4 4 + CO

4 H H 4 - 4 4 4 +

CO

4 H H + 4 + 4 4 + CO X X X X H 4 X + X X X ;>< CO 4 H 4 4 4 4 4 + 1 cr>

+ Hh H + 4 4 + f 4 + 4-

4 H + 4 + 4 4 1 cd

+ + 4 4 + 4 4 + t

oo 4 4 4 4 4 4 4 + 1 cn r-j (_ i_ X O cd o X X X X o CD X X X X =*: == 4 cp cp cp ^ X cd o CD X X X X i o CD CD X X X X CD CD O X X X X CD CD CD X X X CD 4 J H 2 52

cn cn cn cn cn cn cn

Figure 9.4 Simulation ofa Shift Instruction

76 CD

CO

CD

CO

r*- r

+ + <

CD

CM

CD

CD

cd e

ccii

CD ? J .. CO

CD

CD

OJ

cococococococo\

* a _o S o -_

Figure 9.5 Simulation ofa Move From HI Instruction

77 10. Conclusions

asynchronous This thesis has shown that there is merit to an microprocessor. In the VHDL simulation of this microprocessor performed by

Paul Fanelli, the asynchronous processor achieved an average of ten million instructions per second (MIPS). The actual speed of the asynchronous processor however is variable depending on which instructions are being performed. Also, the fact that memory was not fully implemented makes further comparison between the synchronous and asynchronous processors

speed meaningless.

Certain instructions in the R3000 microprocessor did provide a

challenge for implementation in an asynchronous method. For example an

entire circuit was added to implement the two BCOND instructions. These

instructions rely heavily on the multiphase clock to perform a variety of

operations during one clock cycle in the synchronous processor. In the future,

it would be prudent to design the architecture of the microprocessor to more

closely match the strengths and weaknesses of an asynchronous design.

The asynchronous methodology did yield a tremendous advantage. All

the components were modular in construction and use. Because each piece of

the design was allowed to operate at its highest speed, nothing in the design

needs to be compromised. An example of the power of this design can be

shown in the memory requirements for a processor such as this. Any speed memory can be used in this processor since the memory stage will wait until the acknowledge is received. Any module in this microprocessor can be replaced or redesigned at any time. As long as the signals match on the inputs and outputs of the piece, no problems with integration will be encountered.

78 Another problem with the asynchronous microprocessor is the size of

the design involved. The design is much larger than its synchronous counter

part. This will become less of a factor as feature sizes on the chip decrease.

The asynchronous design will always be larger, however, due to the added

logic for the asynchronous controls. Also, without the ability to easily add bus

multiplexing logic more signal lines are need to convey the same information.

A problem occurred in the parallel design of this thesis. The VHDL

modeling of this processor should have been done before any circuits or layout

was attempted. Many times through the course of the design, the VHDL

modeling caught problems that were not apparent to the individual designers.

Because the designs were done in parallel with the VHDL modeling, the designs were changed often and at times completely redone to solve the problems encountered.

79 Bibliography

Self-timed Asada, Kunihiro, Okura Kazama, and Kyoung-rok Cho, "Design of a Data Path for a Fully Asynchronous Microprocessor", International Workshop SASIMI'92, Kobe, Japan, 1992.

" Self-Timed Dewilde, Prof. Dr. Ir. P. M., Dr. Ir. R. Nouta, and Ir. M. Sim, A 91- Handshake-Scheme Applied to a Bit-Serial Multiplier", Technical Report 107, Delft University of Technology, June 1991.

Pipelined Ginosar, Ran, and Nick Mitchell, "On the Potential of Asynchronous Processors", News, Vol. 18, No. 4, December 1990.

Hayes, John P., Computer Architecture and Organization. McGraw-Hill, New York, 1988. liana, David, Ran Ginosar, Michael Yeoli, "An Efficient Implementation of Boolean Functions as Self-Timed Circuits", IEEE Transactions on Computers, Vol. 41, No. 1, January 1992.

Jacobs, Gordon M., and Robert Brodersen, "A Fully Asynchronous Digital Solid- Signal Processor Using Self-Timed Circuits", IEEE Journal of State Circuits, Vol. 25, No. 6, December 1990.

Kane, Gerry, MIPS RISC Architecture. Prentice-Hall, New Jersey, 1988.

Kane, Gerry, and Joe Heinrich, MIPS RISC Architecture. Prentice-Hall, New Jersey, 1992.

Mano, M. Morris, Computer Engineering Hardware Design. Prentice-Hall, New Jersey, 1988.

McAuley, Anthony J., "Four State Asynchronous Architectures", IEEE Transactions on Computers, Vol. 41, No. 2, February 1992.

McCluskey, Edward J., Logic Design Principles. Prentice-Hall, New Jersey, 1986.

Mukherjee, Amar, Introduction to nMOS and CMOS VLSI Systems Design. Prentice-Hall, New Jersey, 1986.

80 Nouta, R., M. Sim, and Gunawan, "Two Self-Timed Handshake Controllers for High Speed Applications", IEEE, 1992.

Siers, Scott E., "Design and Implementation of an Asynchronous Version of the MIPS R3000 Microprocessor", Rochester Institute of Technology, November 1993.

Seitz, Charles I., "Self-Timed VLSI Systems", CalTech Conference on VLSI, January 1979.

Tan, Y. K, and Y. C. Lim, "Self-Timed System Design Technique", Electronic Letters, Vol. 26, No. 5, March 1990.

81