<<

Abstract

HARCOURT EDWIN ALAN Formal Sp ecication of Instruction Set Pro cessors and the

Derivation of Instruction Schedulers Under the Direction of Jon Mauney and Thomas K

Miller I I I

We present two techniques for formally sp ecifying an instruction set pro cessor at the pro

grammers view the architecture view and the timing view From the timing sp ecication

we show how to derive an instruction scheduler for the pro cessor

One technique addresses architecture sp ecication that is the information required to

write correct programs At the architectural level we present a functional semantics that

captures the prop erty that instructions are functions from pro cessor state to pro cessor state

The second sp ecication technique addresses the programmers view of the timing of the

pro cessor that is the needed information required to write temp orally ecient programs

We present a technique for formally describing at a highlevel the timing prop erties of

pip elined sup erscalar pro cessors We illustrate the technique by sp ecifying and simulating

a hypothetical pro cessor that includes many features of commercial pro cessors including

delayed loads and branches interlocked oatingp oint instructions and multiple instruc

tion issue As our mathematical formalism we use SCCS a synchronous pro cess algebra

designed for sp ecifying timed concurrent systems Putting our sp ecication to use we

show how to construct an instructionscheduler from the sp ecication by deriving appro

priate parameters needed for instruction scheduling These parameters include instruction

latencies illegal instruction combinations resource constraints and instructions that may b e issued in parallel

The Formal Sp ecication of Instruction

Set Pro cessors and the

Derivation of Instruction Schedulers

by

Edwin Alan Harcourt

A dissertation submitted to the Graduate Faculty of

North Carolina State University

in partial fulllment of the

requirements for the Degree of

Do ctor of Philosophy

Computer Science

Raleigh

Approved by

Cochair of Advisory Committee Cochair of Advisory Committee

Biography

Edwin A Harcourt received a BS in Science from the State University of New

York at Plattsburgh in In he received an MS in Computer Engineering

and in a PhD in b oth from North Carolina State University ii

Contents

List of Figures viii

Introduction

The Architecture Level

Organization Detail vs Timing Detail

Static Instruction Scheduling

Outline of Dissertation

Mathematical Sp ecications and Related Research

Hardware Description Languages

Formal Sp ecication Languages

Sequential Formalisms

Formalisms of Concurrency and Time

Co de Generator Generation Languages

Resource Requirements

A Marion Extension

Other Pro cessor Features

Summary The Sp ecication Language Zo o

Functional ISA Sp ecication

Denotational Semantics

Semantic Algebras

The Semantics of ISAs iii

Notation

The Syntactic Domain

The Semantic Actions

The Valuation Function

Action Implementation

Co de Generation

Discussion

Instruction Timing

Architecture and Organization

Delayed Instructions

Multicycle Instructions

Resource Constraints

Multiple Instruction Issue

Motivation

Functions Dont Work

SCCS A Synchronous Calculus of Communicating Systems

A Small Example

SCCS Syntax

Connecting Pro cesses

An Algebra of Actions

Extensions to SCCS

Transition Graphs

The Op erational Semantics of SCCS

Examples

A ZeroDelay and a UnitDelay Wire

Logic Gates

FlipFlop iv

Sp ecifying a Pro cessor

Our example micropro cessor

Timing Constraints

The SCCS Sp ecication

Instruction Formats as Actions

Dening the Registers

Register Lo cking

Dening Memory

Instruction Pip eline

Instruction Issue

Arithmetic Instructions

Integer Load and Store Instructions

The Branch Instruction

Interlocked FloatingPoint Instructions

FloatingPoint Registers

The Fadd instruction

Structural Constraints

Mo deling Finite Resources

Multicycle Floatingp oint Instructions

A Normal Form

Register Lo cking

Resource Requirements

Undened Instruction Sequences

Summary

Multiple InstructionIssue Pro cessors

An Integer Float Sup erscalar

Instruction Issue

An Integer Integer Sup erscalar

Data Dep endencies or Data Hazards

Sp ecifying Data Hazards v

An Integer Integer Float Sup erscalar

Simulation

The Reactive View

Simulation

A Simple Example

Example An Illegal Instruction Sequence

Example A FloatingPoint Vector Sum

Instruction Scheduling

Instruction Scheduling

Constraints

Instruction Scheduling

Deriving Instruction Scheduling Parameters

Preliminaries

Derivation of Scheduling Parameters

The

Determining Illegal Instruction Sequences

A Mo dal Logic for SCCS

Using the logic in a pro cessor

Determining Possible MultipleIssue Instructions

Computing the Resource Usage Functions

Particulate Actions and Agent Sorts

Resources and Actions

Deriving the Resource Usage Functions

Summary

Conclusions and Future Research

Programmers Timing View

Instruction Scheduling

Future Research vi

Interrupts

Cache Mo del

Verication and Synthesis

Formal Metho ds

A Example circuits in the Concurrency Workbench

A Simple Logic Gates

A Not Gates

A Half

A Sp ecication

A Implementation

A Wires

A FullAdder

A Sp ecication

A Implementation

A FlipFlop

A Implementation

A Sp ecication

B Our example RISC in the Concurrency Workbench

B The CWB Listing vii

List of Figures

A VHDL description of an inverter

A fulladder sp ecied in SML

The classication of sp ecication languages

Syntactic domain that sp ecies PDP instruction formats

Syntax of semantic actions that sp ecify RTL

The Semantics of the PDP Addressing Mo des

Semantics of PDP Instructions

Semantics of PDP Instructions continued

Semantics of PDP Instructions continued

RTL Syntax tree corresp onding to the instruction Add R R

A two stage Add pip eline constructed from two Add agents

Syntax of SCCS expressions

Transition graph of agent dened in Equation

Op erational semantics of SCCS

Equational laws of SCCS

A nor gate

A ipop constructed from two norgates Figure

RISC instruction set

State transition graph of agent Reg in Equation

State transition graph of a oatingp oint register Freg

State transition graph of a Resource in Equation viii

Possible data dep endencies in instruction sequences

Possible threeissue instruction sequences

Derivation of program executing on the pro cessor

Derivation of illegal program executing

Program that calculates a vector sum

Transition graph of vector sum from Figure

Dep endency graph of vector sum program from Figure

Part of resource usage function for MIPSR FDIV instruction

Generic list scheduling algorithm

The structure of a typical compiler

Algorithm that derives instruction latencies

Algorithm to detect illegal instruction pairs

Algorithm to detect multiple instruction issue pairs

Algorithm to calculate resource usage functions ix

Chapter

Introduction

A sp ecication of a system is a precise description of the desired b ehavior of the system

which any implementation should provide Sp ecications have traditionally b een expressed

in natural language which unfortunately can b e vague and hence lead to imprecision and

ambiguity To remedy this problem many sp ecications are now b eing expressed formal ly

ie using some mathematical framework such as rstorder logic or algebraic techniques

Once a formal sp ecication has b een written several tasks can b e p erformed the sp ecica

tion may b e used to help derive an implementation construct a simulator for the system

or aid in writing do cumentation

This dissertation addresses the problem of sp ecifying an instruction set pro cessor in a

way that is useful for building An instruction set is what we normally

think of as a micropro cessor or CPU Sp ecically we address formally sp ecifying a pro cessor

at two levels the architecture level and the instruction timing level Also we show how a

timing sp ecication can b e used to derive an instruction scheduler for the pro cessor

Formal pro cessor sp ecications are valuable for a variety of reasons

Formal sp ecications aid in the design pro cess They require the user to thoughtfully

plan and design the system

Pro cessor verication requires a formal sp ecication in order to carry out pro ofs of

correctness

If the sp ecication language is executable and b oth SML and SCCS are then a

timing level simulator is automatically available

Highlevel synthesis whether compilers or hardware requires some form of a sp eci

cation

Formal sp ecications are used for precise do cumentation

Instruction set pro cessors are b est viewed hierarchically at various levels of abstraction

A common partitioning of the hierarchy is as follows

The instruction set architecture or just architecture level is a functional view that

represents the pro cessor as seen by the assembly language programmer or compiler

writer This view only includes information needed to write functionally correct

programs

The organization level includes the general structure of the pro cessor in terms of

functional units which include include integer and oating p oint pip elines branch

units caches buses internal latches etc

The logic level contains the low level implementation details of the functional units

When discussing or making an argument ab out a pro cessor one has to b e clear ab out

the level of abstraction at issue The user of a pro cessor is usually concerned with the

architectural level since the user must have this information to write correct programs

The Architecture Level

At its highest level an architecture is an abstract data type where memory and registers

ie the machine state constitute the data and the machine instructions are the op erations

dened on the data At this level instructions are essentially functions that map states to

states Usually a simple register transfer language or RTL notation suces to describ e

the computational eects of an instruction For example the instruction

Add R R R

i j k

may b e describ ed by the register transfer statement

Regi Regj Regk

Until recently there has b een no language available for sp ecifying pro cessors at the

architectural level Co ok remedied this by designing a language sp ecically for designing

instruction sets Since the architecture level provides a functional view where instructions

are functions from pro cessor state to pro cessor state it can b e formalized using a functional

language In Chapter we present a functional semantics of a pro cessor at the architecture

level using the language SML

Organization Detail vs Timing Detail

Besides writing correct programs a user would also like to write ecient programs Hence

the user needs more information than is contained in an architecture sp ecication For

example in some RISC architectures the following instruction sequence is not the most

ecient

Load R R R MemR

Add R R R R R R

Add R R R R

Instruction will usually cause an interlock b ecause needs to wait for R to b e

loaded from memory by which wastes cycles and causes the pip eline to stall On

some pro cessors the sequence is illegal which causes the value of R to b e undened during

instructions and However instructions and may b e switched without altering

the meaning of the program and this switch would reduce the number of stall cycles or

make the sequence valid

When one instruction can interact with another and alter the computational eect

in undened ways eg the load instruction ab ove or when we wish to capture timing

information so that we can write ecient programs then viewing instructions as functions

is no longer sucient We consider this p oint further in Chapter

In many mo dern instruction set pro cessors the temp oral and concurrent prop erties of

the instructions are visible to the user of the pro cessor Consequently such prop erties should

b e included in a b ehavioral pro cessor sp ecication We present a technique for formally de

scribing at a highlevel the timing prop erties of pip elined sup erscalar pro cessors

We illustrate the technique by sp ecifying and simulating a hypothetical pro cessor that

includes many features of commercial pro cessors including delayed loads and branches in

terlo cked oatingp oint instructions and multiple instruction issue sup erscalar As our

mathematical formalism we use SCCS Synchronous Calculus of Communication Systems

a synchronous pro cess algebra designed for sp ecifying timed concurrent systems

After showing how to sp ecify a pro cessor we parameterize the instruction scheduling prob

lem and demonstrate how to derive these scheduling parameters from the sp ecication

essentially yielding an instruction scheduler for the architecture

Static Instruction Scheduling

Once a pro cessor has b een dened at the instruction timing level there are several ways in

which the sp ecication can b e used A hardware designer could use the sp ecication as the

requirements specication that any implementation must meet

However a particular level of abstraction is b oth a specication of a lower level and an

implementation of a higher level For example the organization level can b e viewed as an

implementation of the architecture level and as a sp ecication for the logic level In a similar

vein our timing sp ecication could also b e viewed as an implementation of the pro cessor

for a higher level the compiler level It is in this way that we use our timing sp ecication

by extracting scheduling information from the timing sp ecication and putting it in a form

suitable for use with an o the shelf instruction scheduler

A subsequent goal of this research then is to utilize our sp ecication by generating

instruction scheduling information eg latencies resource requirements directly from the

sp ecication We develop algorithms that analyze SCCS descriptions and yield the instruc

tion scheduling information The algorithms dep end on the formal op erational semantics

of SCCS which is dened in terms of an abstract machine known as a labeled transition

system

Outline of Dissertation

The rest of this dissertation is organized as follows

There have b een a variety of metho ds used to sp ecify lowlevel digital hardware b oth

formal and informal Chapter reviews this related research and further motivates

our choice of using the synchronous pro cess algebra SCCS

Instructions at the highest level are mathematically state transformation functions

Chapter gives a functional semantics to an example architecture using the functional

language SML

In Chapter we motivate the need for expressing programmer level timing constraints

in pro cessor sp ecications

In Chapter we introduce SCCS b oth formally using structural op erational seman

tics and informally with several examples of using SCCS to describ e some simple

combinational and sequential circuits

Chapter formally sp ecies using SCCS the instructions and the their timing con

straints of a hypothetical yet realistic RISC style pro cessor

Chapter shows how we can use SCCS to sp ecify various Sup erscalar congurations

of our example architecture

The op erational semantics of SCCS maps SCCS programs to abstract machines Our

SCCS sp ecication is therefore executable in terms of the abstract machine it rep

resents In Chapter we discuss how the b ehavior of our architecture is simulated

Chapter gives a brief introduction to instruction scheduling and sets the stage for

deriving instruction scheduling information from the SCCS sp ecication

Chapter describ es how to derive instruction scheduling parameters including in

struction latencies resource constraints illegal instruction sequences eg delayed

loads and multiple instruction issue

Chapter concludes and discusses future research

For purp oses of introducing SCCS with some simple examples App endix A gives

the SCCS implementation using the Concurrency Workbench of some simple digital

circuits

App endix B gives an SCCS implementation of our example RISC architecture using

the Concurrency Workbench

Chapter

Mathematical Sp ecications and

Related Research

At rst researchers hop ed that there might exist some one unique ideal sp ecication

language But what could it b e Perhaps some kind of logic say full rst order

logic Or p erhaps Horn clause logic Or equational logic Or LCF What ab out

the lamb da calculus Or p erhaps we should take a state transition approach

But none of these seemed to work

Joseph Goguen One None a Hundred Thousand Specication Languages

There are many formalisms available and currently b eing applied for sp ecifying the in

tended b ehavior andor semantics of computer hardware and architectures In this chapter

we review hardware description languages eg Verilog and VHDL

formal sp ecication languages such as rstorder logic higherorder logic

temp oral logic equational algebra relational algebra

concurrency formalisms the Z sp ecication lan

guage type theory and the lambda calculus We

also review compiler co de generator generators which also provide sp ecication languages

for describing pro cessors

Hardware Description Languages

The term hardware description language or HDL is not a welldened term Tradition

ally it means any language used to sp ecify hardware But this denition is to o broad as

this would include imp erative programming languages such as Pascal and Ada and ab

stract formalisms such as Constructive Type Theory and Category Theory Usually

an HDL refers to languages such as VHDL and Verilog which are imp erative style program

ming languages extended with primitives and libraries that aid in describing hardware

There are a variety of HDLs such as Verilog VHDL ISPS and ELLA

All of them suer in that they lack a formal denition HDLs are extensions of imp erative

programming languages and their problems can b e traced to the problems of imp erative

languages such as aliasing and sideeects It has also b een suggested that HDLs b e

abandoned altogether and that imp erative programming languages b e used Using Ada as

an HDL has b een addressed in Also while HDLs are useful for simulating lowlevel

digital hardware they are unsuitable for use in highlevel pro cessor sp ecication as p ointed

out by Co ok

VHDL is one of the more p opular hardware description languages Figure shows

the VHDL co de that describ es the b ehavior of a simple inverter There is nothing to o

surprising ab out this description other than that it is a lot of co de for sp ecifying a simple

inverter A more serious problem however is that since VHDL is not formally dened

attempting to ascertain the b ehavior of even simple circuits let alone more complicated

circuits is dep endent up on individual implementations of VHDL For example the key

word transp ort in the inverter sp ecication sp ecies that the delay in the after clause

corresp onds to the propagation delay asso ciated with passing a value through a wire If

the keyword transp ort is omitted then delay corresp onds not to a propagation delay in a

wire but to the hold time that is the amount of time the signal must p ersist in order for

it to b e considered as a signal Other timing complications are introduced by the various

forms of the wait statement on pro cesses wait wait on wait for wait until wait

timeexpression

Another criticism frequently leveled at VHDL is that the timing primitives of VHDL are

lowlevel and overly concrete as the programmer is required to sp ecify timing constraints

in nanoseconds or femtoseconds one femtosecond is the smallest unit of time in VHDL If

one wishes to view time more abstractly in terms of say clo ck cycles then a clo ck pro cess

must b e constructed that mo dels a clo ck and every pro cess must then p erform their actions

based on a signal that this clo ck pro cess emits These same criticisms apply to Verilog the

other most widely used HDL as it is similar to VHDL

Dene some enumeration types

type Logic is

type Delayflag is ZerodelayConstdelay

The interface to the inverter

entity INV is

generic Timeflag Delayflag

p ort X in LogicY out Logic

end INV

Specify the behavior of the inverter

architecture Behavior of INV is

b egin

p pro cess

variable delay Time Ns

b egin

if Timeflag Constdelay then delay Ns

end if

if X then

Y transport after delay

else

Y transport after delay

end if

wait on X

end pro cess

end Behavior

Figure A VHDL description of an inverter

Formal Sp ecication Languages

A formal sp ecication language is one that is based on some rigorous mathematical frame

work The sp ecication then b ecomes amenable to analysis using the mathematical to ols

of the framework For example if we sp ecify a system using rstorder logic we may then

b e able to use resolution to p erform pro ofs ab out the system

Formal sp ecication languages fall into two broad categories functional or relational

and reactive A functional view of a system is one that maps an initial state to a nal state

Or in the relational view mapping an input to a set of outputs This view is appropriate

for programs that accept all of their inputs when the program b egins executing and yields its

output when the program terminates For example a compiler maps source programs

to target programs

Some systems have no nal output and termination is viewed as a catastrophe rather

than as a virtue The b ehavior of these reactive systems is dened in terms of the ongoing

interaction with their environment For example an op erating system should not terminate

and has no nal output but has an ongoing interaction with users to service their requests

Any concurrent system can b e viewed as reactive b ecause even if its overall role is functional

the system is comp osed of interacting comp onents It is in this way that we view a

micropro cessor reactively as a system of interacting comp onents instructions registers

and functional units

A recent overview of formal metho ds applied to the design and analysis of digital systems

can b e found in

Sequential Formalisms

We use the term sequential to mean a formalism that do es not include primitives to sp ecify

concurrency or time Sequential formalisms include functional languages based on the

calculus algebraic theories rstorder logic and higherorder logic

Functional Sp ecication

There has b een some research into sp ecifying hardware and architectures using functional

metho ds Paillet presents a functional semantics of a CISCstyle micropro cessor at

the instruction set level No attempt was made to sp ecify instruction timing prop erties or

instruction interaction rather the intent of the research was to use the sp ecication as a

basis to verify an implementation although this was not done

fun HalfAdderab a b a andalso b

fun FullAdderCin a b

let

val Sum Carry HalfAdderab

val Sum Carry HalfAdderCinSum

in

Sum Carry orelse Carry

end

Figure A fulladder sp ecied in SML

Charlton presents a technique of introducing a clo ck into a functional sp ecication

by using lazy lists to represent an innite stream of clo ck pulses The research was limited

to a consideration of how one could introduce a clo ck in a functional mo del nothing was

subsequently done with the sp ecication

Gordon represents nitestate systems using the calculus He then showed

how one could verify circuits in the calculus using xpoint induction As an example

Figure uses the functional language SML to sp ecify a fulladder from two halfadders

Of the various abstraction levels of a pro cessor the functional view is appropriate if

we wish to view it at its highest level the architecture In this view instructions are

functions from states to states However a function that represents an instruction sp ecies

nal values of an instruction and not how this value was computed nor how long it to ok to

compute This view makes it dicult to sp ecify timing or instruction interaction

HOL HigherOrder Logic

Probably the most widely used formalism for formally sp ecifying hardware is higherorder

logic in particular the logic embo died in the HOL system of Gordon See Gordon for

an introduction to HOL and Melham for using HOL in hardware verication Higher

Order Logic is an extension of rstorder logic in which variables can range over functions

and predicates HOL uses the lambda calculus to sp ecify functions and consequently

subsumes functional metho ds

HOL is very expressive and has b een used to describ e other formalisms such as a subset

of VHDL CCS the asynchronous version of SCCS the calculus higher

order CCS and Hoares CSP This suggests that HOL is a go o d metalanguage and is

useful for dening other formalisms within HOL However there is a sense in which HOL

is to o expressive according to Goguen

most logics are to o expressive We need more restricted logics in order to get

executable sp ecication languages that can b e used for rapid prototyping and even

ecient programming And we certainly want to avoid sp ecifying systems that are

unimplementable b ecause they involve uncomputable functions or relations

Higher order logic is undecidable That is it is undecidable whether a statement in

higher order logic is true Consequently the HOL system is not a theorem prover but a

pro of checker and provides a supp ort environment for carrying out pro ofs in HOL by

hand Moreover HOL can not directly b e used to simulate the system b eing sp ecied as

there is no notion of the op erational b ehavior of an HOL statement

As in functional metho ds neither time nor concurrency are primitives in the logic but are

enco ded However b ecause of HOLs expressiveness it is somewhat easier to represent b oth

time and concurrency than in a functional approach For example to represent concurrency

in HOL we do not need to talk ab out sets of nal values but simply dene two events

and to o ccur in parallel if and are b oth true at time t that is if t t holds

So logical conjunction is used to comp ose two elements in parallel

Typically time is represented using the natural numbers usually abbrevi

ated Consider two p orts in and out A p ort can b e represented as a function from

time to b o oleans

in out bool

We use HOLs logical op erations the standard to sp ecify constraints on

the p ort values For example the invariant that the value at p ort out at time t is the

same as the value on p ort in at time t is sp ecied as

t outt int

Equation represents a one bit unit delay element where the output at time t is the

same as the input at time t Equation also sp ecies that at time p ower up the

contents of the register is

def

Regin out bool t outt t j int

The notation cond true j false is a conditional expression

Equation connects two registers together creating a delay circuit that inputs a bit

at time t and outputs it at time t The types of the p orts in out are omitted for

readability but are functions from time to b o oleans as in Equation

def

TwoRegsin out inoutRegin Reg out

Equation shows how using conjunction two registers are instantiated in parallel Exis

tentially quantifying the variable makes an internal connection b etween the two registers

In Cohn describ es the Vip er micropro cessor using HOL The Vip er is a micro co ded

pro cessor that do es not include any instructionlevel parallel prop erties Also in Her

b ert discusses sp ecifying and verifying micro co ded micropro cessors using HOL in general

Again the techniques do not address instructionlevel parallelism

Formalisms of Concurrency and Time

There are well developed formalisms for explicitly dealing with reactive systems The most

common formalisms are pro cess algebras and mo dal and temp oral logics

Pro cess Calculi

There are a variety of formalisms for sp ecifying asynchronous andor synchronous con

current systems including Petri Nets CCS SCCS ACP CIR

CAL HOP Esterel the calculus and CSP

Pro cess algebras have b een used to give a semantics to a communications proto col

language LOTOS a parallel ob ject oriented language POOL a computer integrated

manufacturing system to lowlevel digital hardware and biological ecosystems

The various calculi can b e divided into asynchronous and synchronous languages The

synchronous languages can b e considered more expressive as it is typically p ossible to

embed an asynchronous calculus within a synchronous one For example the asynchronous

calculus CCS is denable within the synchronous calculus SCCS

In an asynchronous formalism parallelism is reduced to nondeterminism or interleav

ing For example let

A B mean pro cesses A and B execute in parallel

A B mean nondeterministically execute either A or B

aA mean execute the action a and then b ecome the pro cess A

In an asynchronous formalism if the pro cess A can do some action a and b ecome the pro cess

a

A we will write A A the pro cess B can do some action b and b ecome the pro cess

B then the pro cess A B is equivalent to the pro cess aA B bA B That is

the pro cess A B can execute A for a step or B for a step but not b oth at the same time

forgetting for the moment the p ossibility that A and B might wish to communicate

This is not the correct way to view clo cked hardware as a circuit do es not op erate by in

terleaving the execution of the sub comp onents of the circuit This problem of asynchronous

formalisms has b een p ointed out by Berry For example if we construct a halfadder

from an orgate and an xorgate the circuit do es not op erate by computing the result of

the orgate and then computing the result of the xorgate or viceversa but by really com

puting b oth the or and the xor at the same time Consequently we dismiss asynchronous

calculi ie CCS ACP LOTOS CSP Petri Nets as they can not directly represent the

global clo ck asso ciated with hardware They have however b een applied for representing

asynchronous circuits a current area of active research

Synchronous Pro cess Algebras

Several of the algebras cited ab ove SCCS Esterel CIRCAL HOP are designed for

sp ecifying synchronous systems in other words systems that have a global clo ck where

events o ccur based on the clo ck and pro cesses execute in lo ck step Given the ab ove example

the pro cess A B is equivalent to the pro cess a bA B where is some binary

op eration on actions This means that b oth A and B p erform their actions on the same

clo ck cycle Here is now b eing used as a synchronous parallel op erator rather than an

interleaving one

The synchronous languages are all related in that SCCS is the father of them all Our

choice of using SCCS over the other synchronous formalisms is largely pragmatic SCCS

is the most widely used and known

is the most mature and formally developed

has a large b o dy of research to draw up on for doing formal analysis

has to ols available for carrying out analysis

has a simple syntax

has a clear and concise op erational semantics

is comp ositional allowing larger systems to b e comp osed of smaller ones

No one has yet applied SCCS to the problem of hardware sp ecication

Mo dal and Temporal Logic

There are some logics that deal explicitly with time and concurrency Mo dal

logic extends classical prop ositional or rstorder logic by by including two op erators 2 and

3 that represent necessity and p ossibility Really the logic only need include one

of the op erators as one can always b e dened in terms of the other using negation In a

temp oral context these op erators are usually interpreted as henceforth and eventually

For example the temp oral logic of Pnueli has a temp oral op erator 2 that means

henceforth and an op erator that means Consider the example from the dis

cussion on HOL where we wished to represent the invariant that the value on a p ort in at

time t app eared at p ort out at time t This can b e represented in temp oral logic by

2out in

This says that from now on the next output is equal to the current input

It turns out that there is a strong connection b etween mo dal temp oral logics and

pro cess calculi In fact we will use a particular mo dal logic the calculus to check whether

our micropro cessor sp ecied in SCCS has certain prop erties This connection b etween

mo daltemp oral logic and pro cess algebra was rst established by Pnueli Briey the

relationship is this the op erational semantics of SCCS is dened in terms of a lab eled

transition system dened in Chapter In turn the lab eled transition system is used as

a mo del for formulae in the mo dal logic That is the abstract machine generated by

the op erational semantics of one language SCCS is used as a mo del of another the mo dal

calculus In Chapter we will discuss this in more detail

Compiler Co de Generator Generation Languages

The compiler research area of retargetable code generation oers another metho d of sp ecify

ing an architecture A retargetable co de generator uses a general co de generation strategy

often some type of pattern matching as in Twig parameterized by the prop erties of

the architecture which are encapsulated in a machine description However the term ma

chine description is misleading as the machine description do es not describ e a machine

as much as it describ es an intermediate language IL A b etter term would b e co de

generator description as one would exp ect that a machine description describ e an architec

tures instructions directly by mapping them to some other language But what a machine

description describ es is how the compilers intermediate language maps to the machines

instructions That is we would exp ect

Machine Description Instructions IL

but what we really have is

Machine Description IL Instructions

This problem has b een remedied by Giegerich who prop oses a way of inverting the

machine description Until recently these retargetable co de generation systems have dealt

only with instruction selection and not instruction scheduling That is they have not dealt

with instruction timing prop erties

There has b een some work by Bradlee and Pro ebsting and Fraser called PF

from here on in compiler co de generator description languages where instruction timing

prop erties are inserted into the machine description to try and address the appropriate

userlevel view of instruction timing Bradlee was the rst to prop ose extending machine

descriptions with instruction timing prop erties The sp ecication language Marion was

subsequently used to p erform retargetable instruction scheduling and as a basis for integrat

ing register allo cation with instruction scheduling The goal of PF was to generate from

a sp ecication of the resource requirements of a pro cessor a spaceecient state transition

graph that can b e used for instruction scheduling Consequently PFs machine descrip

tion only includes a metho d of sp ecifying resource requirements and is not as expressive

as Bradlees Since our primary interest is on sp ecication and b ecause Marion is more

expressive than PFs description language we only consider Bradlee in detail

Resource Requirements

From the time an instruction is fetched to the time it has completed typically takes sev

eral cycles On each cycle an instruction uses a subset of the pro cessors resources The

sp ecication languages of PF and Bradlee sp ecify these resource requirements directly For

example the MIPSR pro cessors oatingp oint unit has as its resource set an un

packer shifter adder rounder twostage multiplier divider and two exception resources

These are abbreviated by Bradlee as U S A R M M D E and E resp ectively

Using Marions syntax the resource constraints of the MIPS R pro cessors single

precision oatingp oint divide instruction divs is sp ecied as

U SA SR S D DA DR DA DR A R

which means that the divs instruction uses the

unpacker on the rst cycle

shifter and adder on the second cycle

shifter and rounder on the third cycle

shifter on the fourth cycle

divider only for thirteen consecutive cycle starting on cycle ve

and so on

There are two main criticisms of Marion The rst criticism is that it is not formal

as we only have intuitive notions of what the op erators and mean This

precludes Marion from b eing used for anything except for use with Bradlees own algorithms

That is in order to use Marion for other tasks which we exp ect from a sp ecication

ie verication simulation do cumentation the sp ecication language needs to b e given

a formal denition

The second criticism of Marion and this approach is that it is sometimes a to o low

level way of sp ecifying the timing constraints of an instruction For example the Marion

sp ecication of the resource requirements for the MIPSR pro cessor single precision

integer addition instruction add is given as

IF RD ALU MEM WB

which sp ecies that the instruction uses a dierent resource on each clo ck cycle cycles

through More sp ecically IF RD ALU MEM WB represent the instruction fetch register

read execute memory access and register write back stages of the integer pip eline resp ec

tively though this information can only b e inferred by the reader from already knowing

how instruction pip elines are usually structured

Yet there are no timing constraints on the MIPSs add instruction and in a program

the add instruction can b e used in any context This is imp ossible to ascertain from the

description b ecause we do not know precisely what the description means As it turns out

the reason that there are no constraints is that there are other resources forwarding

hardware which are left unsp ecied It is not p ossible to sp ecify these extra resources

as it do es not suce to just enumerate the forwarding hardware as resources in the sp ec

ication language One must sp ecify how the forwarding resources are connected to the

other resources and how they interact This capability is currently b eyond the sp ecication

languages of b oth PF and Bradlee

A Marion Extension

To remedy the situation where Marions resource vector do es not accurately reect the

timing prop erties of the instruction Bradlee asso ciates a delay value with the instruction

In his notation the add instruction is now sp ecied as

IF RD ALU MEM WB

Here the represents the latency of the instruction or how many cycles need to pass

b efore a subsequent instruction can use the result pro duced by the add Since the latency is

only one the very next instruction may use the result Now in this case the resource vector

do es not yield any useful information ab out the instructions timing constraint which is

contained in the extra parameter the delay value So sometimes sp ecifying resources

is unnecessary as the timing constraints of the instruction do not dep end on them However

if an instructions timing constraints do dep end on resource constraints then resources need

to b e included in the instruction sp ecication While omitting unproblematic resources is

not p ossible in Marion we will see that our sp ecication technique using SCCS will allow

us to sp ecify latency values with or without resource information dep ending on whether it

is needed or not

The situation is actually more complicated Bradlee also asso ciates with each instruc

tion a value that represents the number of delay slots for the instruction which for most

instructions is zero For example the statement

IF RD ALU

represents the timing constraints of a delayed branch instruction The tuple rep

resents the latency and the number of required delay slots resp ectively Furthermore a

negative value for the number of delay slots indicates that the instructions in the delay

slots are only executed if the branch is taken A p ositive value indicates that the instruc

tions are always executed

Other Pro cessor Features

Some pro cessors use priority schemes to resolve resource conicts For example the

uses gives priority of certain instruction over others if they b oth require the data

on the same cycle Expressing this is not p ossible in the framework of Bradlee and PF

but is in SCCS and pro cess algebras in general

Also sup erscalar pro cessors execute more than one instruction on each cycle Sp ecifying

this will b e p ossible in SCCS but is not within the frameworks of Bradlee and PF

Summary The Sp ecication Language Zo o

In this chapter we have reviewed two areas of pro cessor sp ecication each adopting a partic

ular viewp oint hardware description languages HDLs and compiler co de generator

machine descriptions We argued that the language we require should b e formal and b e

able to explicitly sp ecify the temp oral and concurrent prop erties of instructions We then

argued that no one has addressed formally sp ecifying instruction timing at the appropriate

user level

In the compiler co de generation view we indicated that while there were attempts

to sp ecify architectures at the correct users timing level the attempts were ad ho c and

informal and were only partially successful at hitting the correct level of abstraction With

these criticisms at hand we then justied our choice of using SCCS as our sp ecication

language

Figure shows a classication of the various languages and formalisms from the p oint

of pro cessor sp ecication as discussed in this chapter From our p ersp ective every language

is a pro cessor sp ecication language which is categorized as either a hardware description

language or a co de generator description language Marion and Twig VHDL Verilog and

ELLA are informal hardware description languages SML and Haskell are two particular

implementations of the calculus HOL is an implementation of higherorderlogic CTL

Computational tree logic a particular temp oral logic and the calculus are two particular

mo dal logics that are related to pro cess algebra in that they are used to describ e reactive

systems SCCS CIRCAL and Esterel are synchronous pro cess algebras while CCS

Petri Nets and CSP are asynchronous pro cess formalisms The tree diagram represents

the current situation and do es not mean to imply that there will never b e such a thing as

a formal co de generator description language Also there has b een some work at giving

VHDL a formal semantics currently only subsets of VHDL which could move it into Processor Specification

Hardware Code Generator Description Description Languages Languages

Twig Marion Formal Informal

VHDL Verilog ELLA

Functional Reactive

Lambda−Calculus Higher Modal Process Order Logic Logic Algebras SML Haskell CTL mu−calculus HOL

Asynchronous Synchronous

CCS Petri Nets CSP SCCS CIRCAL Esterel

Figure The classication of sp ecication languages

the realm of a formal hardware description language

Chapter

Functional ISA Sp ecication

In this chapter we present a metho dology for formally sp ecifying a denotational or func

tional semantics of an instruction set architecture More sp ecically since there are basic

architectural op erations common to all architectures eg reading and writing registers

and memory we create an abstract data type of primitive architectural op erations suit

able for representing a wide variety of architectures This functional view sp ecies nal

computations of instructions and not instruction timing prop erties

Denotational Semantics

A denotational semantics of a represents each syntactic construct of

the language in terms of mathematical ob jects eg sets functions relations We can view

an architecture as a lowlevel programming language but a highlevel view of a pro cessor

in which each instruction is a syntactic ob ject that represents an op eration on the state of

the pro cessor That is an instruction is a state transformation function

I State State

and State is the type of a function from lo cations register numbers or memory addresses

to values integers

State Locations Values

Locations RegNums Addresses

Values integer

If is the state function then x is the value stored at lo cation x A new state can b e

built from an old state with the state up date function x d which means the state

with x up dated to d Given this denition of state we can b egin to describ e the semantics

of individual instructions For example the MIPS Add instruction is dened by

I Add R R R R R R

i j k i j k

The notation I represents the valuation function The metabrackets sur

round the syntax of an instruction and separate the language b eing dened from the dening

language The state function is an argument to the valuation function and an up dated

valuation function is returned This coincides with our denition of instructions b eing

functions from state to state

The advantage of a denotational semantics is that it provides a facility to reason ab out

sp ecications and to mathematically prove prop erties ab out the sp ecication For example

consider the denition of an instruction that increments by one the value stored in the

register

I Inc R R R

i i i

Given the denitions of Add and Inc it is now an easy matter to prove that

Inc R Add R R

i i i

by checking that

I Inc R I Add R R

i i i

By substituting each side of the with its denition we see that

R R R R

i i i i

This prop erty of b eing able to substitute equals for equals is known as referential trans

parency and is a prop erty of functional sp ecications not shared by traditional imp erative

programming languages and some hardware sp ecication languages such as VHDL or Ver

ilog

Semantic Algebras

We make our denotational sp ecications more readable and easier to work with by creating

a library of architectural primitives This library is essentially an abstract data type and is

sometimes called a semantic algebra The op erations dened by the semantic algebra

are called semantic actions Rather than sp ecify the architecture directly as a denotational

semantics we rst translate it into semantic actions dened by the semantic algebra The

semantic algebra is a register transfer language or RTL which is then sp ecied denotation

ally

This sp ecication technique is unique in that

Sp ecifying the semantic actions o ccurs once When this is done many architectures

can b e sp ecied using the predened actions

The semantics of the actions need not b e sp ecied denotationally but could b e sp eci

ed using any formal semantic metho d eg op erational or algebraic semantics This

gives our semantic sp ecication an abstractness and mo dularity that is absent from

other semantic sp ecication metho ds This style of semantics is called a separated

semantics

The Semantics of ISAs

The denotational semantics of an ISA will comprise the three parts of a semantics a

syntactic domain a semantic domain and a valuation function

Syntactic domain The syntactic domain is the instruction formats of the archi

tecture b eing sp ecied

Semantic domain The semantic domain is a set of semantic actions In this

pap er we use actions that implement a register transfer language RTL

Valuation function A valuation function sp ecies how an item in the syntactic

domain maps to an item in the semantic domain In this case the valuation function

formally describ es an instruction in terms of semantic actions

To illustrate the pro cess we use the PDP as an example ISA

Notation

It is common to use a mo dern functional language to sp ecify a denotational semantics

and in our case we will use Standard ML While SML is not completely referentially trans

parent due to its imp erative features its semantics is formally sp ecied Also if we restrict

ourselves to the purely functional subset we keep referential transparency

Traditionally the syntactic domain is sp ecied with a contextfree grammar Since the

syntax of instructions is simple SMLs algebraic data type constructor datatype is used

to sp ecify the abstract syntax

The semantic domain is a set of semantic actions that represent basic architectural

op erations and as was mentioned b efore is RTL The semantic actions constitute a language

with a syntax and a semantics and to b e complete b oth must b e sp ecied

The valuation function in a traditional denotational description will b e describ ed us

ing SML functions Since our semantics is directly co ded into SML a simulator for the

architecture is immediately available

The Syntactic Domain

The syntactic domain consists of the ISAs instruction formats Figure shows the SML

denition of the syntax of the PDP instruction formats For example the PDP

instruction Add RR is represented by the SML construct

TwoOpADD RegDirect AutoInc

and the instruction Add R is represented by

TwoOpADD Indexed Immediate

Here TwoOp RegDirect AutoInc Indexed and Immediate are type constructors or

tags

The Semantic Actions

This section describ es the syntax of the language of semantic actions RTL We defer their

implementation until Section The choice of the semantic actions is imp ortant they

must b e capable of sp ecifying the lowlevel semantics of a machine instruction and also b e

suitable as an intermediate representation for the front end of a compiler Register transfer

language RTL satises b oth criteria Figure shows the SML representation of the

RTL syntax The RTL semantic actions constitute an abstract data type that separates

the RTL syntax from its semantics Now the syntax can b e used eectively for pattern

matching This keeps the lowlevel implementation details of the RTL op erators hidden

The RTL actions are of two kinds values and imperatives Value actions pro duce

values eg memory fetching register access addition etc and imp erative actions alter

the state eg assignment and statement sequencing A valid RTL program is a sequence

type RegNum int

type Offset int

type ImmediateValue int

Instruction formats

datatype Instruction

TwoOp of TwoOpInstr Operand Operand

TwoOp of TwoOpInstr int Operand

OneOp of OneOpInstr Operand

Branch of BranchInstr int

Instructions al lowed for each format

and TwoOpInstr ADD MOV MOVB SUB CMP CMPB

and TwoOpInstr XOR MUL DIV ASH ASHC

and OneOpInstr CLR CLRB INC DEC

and BranchInstr BR BNE BEQ BGT BGE BLT BLE

Operand formats and addressing modes

and Operand RegDirect of int

RegIndirect of int

AutoInc of int

AutoDec of int

Immediate of int

Indexed of int int

AutoIncIndirect of int

AutoDecIndirect of int

IndexedIndirect of int int

Figure Syntactic domain that sp ecies PDP instruction formats

datatype

Mode CC Address Int Int Int Int Float DoubleFloat

and

Value

Arithmetic operators

Mem of Mode Value

Reg of Mode int

Add of Mode Value Value

Sub of Mode Value Value

Mul of Mode Value Value

Div of Mode Value Value

Mod of Mode Value Value

Uminus of Mode Value

Integer of Mode int

Comparison operators

Compare of Mode Value Value

Eq of Mode Value Value

Neq of Mode Value Value

Gt of Mode Value Value

Gteq of Mode Value Value

Lt of Mode Value Value

Lteq of Mode Value Value

IfThenElse of Value Value Value

and condition codes

PC

N

V

Z

C

and

Imperative

Sequence of Imperative list

Parallel of Imperative list

Assign of Value Value

Call of Value

Nop

Return

Figure Syntax of semantic actions that sp ecify RTL

of imp eratives The Parallel op erator sp ecies that an instruction has multiple eects

on the state that o ccur simultaneously For example an instruction ADD Dest Source

p erforms the op eration Dest Dest Source and in parallel assigns condition co des based

on Dest Source

The Valuation Function

The valuation function maps syntactic ob jects instructions to semantic actions RTL

We sp ecify the valuation function in two parts addressing mo des and instructions The

valuation functions build prex operator terms or abstract syntax trees that represent the

eect of the op erations It is these terms that can either b e given a semantics as we will

do in the next section matched for co de selection or analyzed for co de optimizer

generation

In the following discussion for the sake of clarity and adhering to common notational

conventions we have strayed slightly from strict SML syntax

Addressing Mo des A valuation function for an addressing mo de returns a Value

gure that describ es how an op erand is accessed Some addressing mo des eg auto

incrementdecrement mo des cause side eects on the pro cessor state Consequently an

addressing mo de also returns an Imperative that sp ecies this sideeect Where no side

eects take place the imp erative Nop is returned The type for the addressing mo de valu

ation function Op is

Op Operand Mode ValueImperative

Where the type Operand gure represents an op erand including its addressing mo de

and Mode gure is the size of the data b eing accessed Function Op is higherorder Op

takes Operand and returns a function from Mode to a tuple

For example consider the autoincrement addressing mo de The valuation function Op

is

Op AutoIncRegNum Mode

let

val r RegAddr RegNum

in

MemMode r

AssignrAddAddrrIntIntSizeofMode

end

Op returns a tuple the rst item of which is how the op erand is accessed and the second item

is the sideeect of the autoincrement addressing mo de AutoIncRegNum is abstract

syntax and Reg Mem Assign Add and Integer are semantic actions The remainder of

the PDPs addressing mo des are similarly dened and are shown in gure

Instructions A valuation function for an instruction constructs an Imperative by

combining the actions of the op erands with the action that represents the semantics of the

instruction The function Instr has type

Instr Instruction Imperative

For example the PDP MOV instruction is sp ecied by the following valuation function

Instr

Instr TwoOpMOV DestMode SrcMode

let

val Dest S Op DestMode Int

and Src S Op SrcMode Int

in

Sequence

Parallel

AssignDestSrc

AssignZEqIntSrcIntegerInt

AssignNLtIntSrcIntegerInt

ParallelS S

end

Two subtrees S and S are constructed that represent sideeects from b oth op erand

accesses These subtrees are combined with the subtree that represents the eect or meaning

of the op eration

An instruction can alter the state in three ways

Explicitly

Through a side eect of the addressing mo de

By setting condition co des

The remainder of the PDP instructions are similarly dened and are shown in gures

and

fun Op RegDirectRegNum Mode Register Direct

RegMode RegNum Nop

Op RegIndirectRegNum Mode Register Indirect

MemMode RegAddress RegNum Nop

Op AutoIncRegNum Mode Autoincrement

let

val r RegAddress RegNum

in

MemMode r

Assignr AddAddress r IntegerSizeofMode

end

Op AutoDecRegNum Mode Autodecrement

let

val r RegAddress RegNum

in

MemMode r

Assignr SubAddress r IntegerSizeofMode

end

Op IndexedOffset RegNum Mode Indexed

MemMode AddAddress RegAddress RegNum

IntegerOffset Nop

Op ImmediateImmediateValue Mode Immediate

IntegerImmediateValue Nop

Op AutoIncIndirectRegNum Mode Autoincrement indirect

let

val r RegAddress RegNum

in

MemMode MemAddress r

Assignr AddAddress r IntegerSizeofAddress

end

Figure The Semantics of the PDP Addressing Mo des

Add instruction

fun Instr TwoOpADD DestMode SrcMode

let

val Dest S OpDestMode Int

and Src S OpSrcMode Int

in

let

val Result AddInt Dest Src

in

Sequence Parallel AssignDestResult

AssignZ EqInt Result Integer

AssignN LtInt Result Integer

ParallelS S

end

end

Compare instruction

Instr TwoOpCMP DestMode SrcMode

let

val Dest S OpDestMode Int

and Src S OpSrcMode Int

in

let

val Temp SubInt Dest Src

in

Sequence ParallelAssignZ EqInt Temp Integer

AssignN LtInt Temp Integer

ParallelS S

end

end

Move instruction

Instr TwoOpMOV DestMode SrcMode

let

val Dest S OpDestMode Int

and Src S OpSrcMode Int

in

Sequence ParallelAssignDestSrc

AssignZ EqInt Src Integer

AssignN LtInt Src Integer

ParallelS S

end

Figure Semantics of PDP Instructions

Branch on not equal to zero

Instr BranchBNE Offset

AssignPC IfThenElseEqCC Z Integer

AddAddress PC IntegerOffset PC

Divide Dest Dest Src Dest must be an even numbered register

The dividend is put in Dest and the remainder in Dest

Instr TwoOpDIV RegNum SrcMode

let

val Src S OpSrcMode Int

in

let

val Quotient DivInt RegInt RegNum Src

and Remainder ModInt RegInt RegNum Src

in

if evenRegNum then

Sequence Parallel AssignRegInt RegNum Quotient

AssignRegInt RegNum Remainder

AssignZ EqInt Quotient Integer

AssignN LtInt Quotient Integer S

else Nop

end

end

Multiply

Instr TwoOpMUL RegNum SrcMode

let

val Src S OpSrcMode Int

val Result MulInt RegInt RegNum Src

in

let

val LoBits AssignRegInt RegNum LoPartInt Result

and HiBits if evenRegNum then

AssignRegInt RegNum HiPartInt Result

else Nop

in

Sequence Parallel LoBits HiBits

AssignZ EqInt Result Integer

AssignN LtInt Result Integer S

end

end

Figure Semantics of PDP Instructions continued

Clear

Instr OneOpCLR SrcDestMode

let

val SrcDest S OpSrcDestMode Int

val Result AssignSrcDest Integer

in

ParallelResult S

end

Increment

Instr OneOpINC SrcDestMode

let

val SrcDest S OpSrcDestMode Int

val Result AssignSrcDest AddInt SrcDest Integer

in

ParallelResult S

end

Decrement

Instr OneOpDEC SrcDestMode

let

val SrcDest S OpSrcDestMode Int

val Result AssignSrcDest AddInt SrcDest Integer

in

ParallelResult S

end

Figure Semantics of PDP Instructions continued

Action Implementation

This section outlines a denotational semantics of the RTL semantic actions used in the

previous section The implementation of the semantic actions that describ e the RTL only

needs to b e done once the p oint b eing that once the RTL is implemented we can easily

describ e a variety of architectures

We do not have ro om here to present the semantics of the actions but we will briey

outline the type signatures of the actions and the kinds of values that they op erate on We

rst dene an abstract notion of state

The State Our state consists of memory byte addressable registers bit and a

status register for the condition co des C N V and Z that op erate on the data types given

by Mode

Mode Addr Int Int Int Int Float CC

State Memory Registers StatusReg

Each of Memory Registers and the StatusReg are stores Abstractly a store is a

function from lo cations addresses register numbers condition co des to data This suggests

the following denition for the three stores

Memory Address Int

Registers RegNum Int

StatusReg Flags Bit

Action Signatures The Mem action takes two parameters a mo de Mode that sp ecies

how many bytes to fetch and a value pro ducing action Value that yields an address Mem

itself is a value pro ducing action Its signature is given by the following equation

Mem Mode Value Value

An example of an imp erative action Imperative is Assign Assign takes two value

actions the rst of which must b e an Lvalue that will b e assigned the data the second

action pro duces the Rvalue The signature of Assign is

Assign Value Value Imperative

A value action is consequently a function from a state to a denotation An imp erative action

is a function from a state to a state

Imperative State State

Value State Denotation

Denotation CC of Bit Address of int

Int of int Int of int Int of int

To b e complete we must sp ecify the semantics of individual actions We only note that

there is an SML function for each of the actions in gure For example there is an SML

function Assign that has the following signature

Assign Value Value Imperative

Simulation SML is essentially an implementation of the callbyvalue typed

calculus The simulator for the architecture is the SML The simulation should

b e viewed as computational that is we mo del the computation of the individual instruc

tions as functions describ ed in the calculus

Co de Generation

A retargetable co de generator op erates by matching p ortions of the intermediate language

IL pro duced by the frontend to the language that describ es instructions in the machine

description This of course assumes that the IL generated by the frontend is the same as

the language that describ es instructions Fortunately the RTL presented in this chapter

suces for b oth Using RTL for an IL was rst prop osed by Davidson and is embo died

in the p opular retargetable C compiler GCC

Our sp ecication can b e used in a co de generator b ecause the RTL used to describ e

instructions can also b e used as an intermediate language For example the instruction Add

R R determines the RTL syntax tree linearized in gure Recall that Int

and Address are types that describ e the kind of value in the register or memory

Pattern matching co de generators work by matching p ortions of the intermediate language

generated by the frontend with patterns that represent instructions gure Twig

is one such system that could b e used here

Sequence

Parallel

AssignRegInt AddInt

RegInt

MemIntRegAddress

AssignZ EqInt MemInt AddInt

RegInt

MemIntRegAddress

AssignN LtInt MemInt AddInt

RegInt

MemIntRegAddress

ParallelNop

AssignRegAddress

AddAddress

RegAddress Integer

Figure RTL Syntax tree corresp onding to the instruction Add R R

Discussion

An instruction set architecture is in a sense a programming language and can b e treated

as such Using denotational semantics we have sp ecied formally the semantics of an

instruction set architecture which allows for the design of mo dular readable and usable

architectural sp ecications We have implemented the semantics using the functional lan

guage SML which makes our sp ecication executable automatically providing a simulator

for the instruction set architecture

Chapter

Instruction Timing

In this chapter we motivate our thesis that the timingprop erties of instructions that are

needed for program eciency to p erform instruction scheduling should b e included in a

highlevel pro cessor sp ecication We have refrained from using the the term architecture

as this level contains only information required to write correct programs

Architecture and Organization

The distinction b etween the levels of abstraction of a pro cessor are illdened and somewhat

articial Consider the architecture and organization levels it is often dicult keeping the

details of organization from creeping into the architecture As an extreme example the

i pro cessor makes no such separation and one can even argue that the organization

is the architecture Consequently in order to program the i prop erly it is necessary that

the user understand the structure of the pip elines and how the internal latches are used

Ideally an architecture sp ecication describ es nal values that the instructions compute

and the organization describ es how those values are computed That is the architecture is

an abstraction whereas the organization is an implementation To this end the architecture

level do es not contain timing information but the organization level do es However when

considering the eciency of a program the organization level do es contain information that

is of use for the programmer Yet the organization also contains a considerable amount of

other detail that is of no concern to the user For example a oatingp oint multiply may

have a latency of six cycles due to the structure of the pip eline However to use the

multiplication instruction eciently we need only b e concerned with the latency itself not

the cause There are many such examples of this complex instruction interaction which we

now describ e

Delayed Instructions

Some instructions have an architecturally dened delay The most common examples are

delayed loads and branches In a delayed instruction the eect of the instruction always

o ccurs n instructions cycles after the instruction b egins where n represents the number of

required delay slots For example some memory load instructions have an architecturally

dened delay to them it is not reasonable to exp ect a word from memory b e fetch in a single

cycle without slowing the pro cessor clo ck sp eed as in the following instruction sequence

where n

Load R R R MemR

Add R R R R R R

On some pro cessors the use of register R in is illegal and sends the pro cessor into

an undened state For delayed load instructions program correctness dep ends on the

programmer who must ensure that the instruction executing immediately after a delayed

load instruction do es not reference the register b eing loaded

Another situation arises with a delayed branch For some pro cessors in the following

instruction sequence

Locn

BZ R Locn

Add R R

the Add instruction after the branch BZ is always executed b efore a jump to Locn If

the branch is not taken a couple of dierent b ehaviors can b e dened One is to always

execute the Add instruction whether the branch succeeds or fails The other is to skip the

Add instruction when the branch fails and execute the instruction immediately b elow the

Add This is sometimes called nul lication of the delayslot Another constraint is that a

branch instruction may not b e immediately followed by another branch instruction

Multicycle Instructions

Some instructions take several cycles to execute In this situation an instruction may have

to wait until a previous instruction nishes This wait or datadependent delay o ccurs in

the following instruction sequence

Fmul FR FR FR

Fadd FR FR FR

where the Fadd instruction has to wait until the Fmul instruction nishes due to the data

dep endency on register FR The delay suered in this instruction pair is not the same

as that encountered in the delayed load instruction as the delay in the load instruction is

architecturally dened That is the delay is constant across implementations whereas

the delay in the ab ove instruction pair is dep endent up on a particular implementation of

the pro cessor

Resource Constraints

A pro cessor has a limited set of resources When one instruction needs a resource that is

currently b eing used by another instruction a structural hazard results and an interlock or

delay o ccurs For example if a pro cessor has only one oatingp oint adder then in the

following sequence

Fadd FR FR FR

Fadd FR FR FR

the second Fadd instruction will have to wait for the rst Fadd to nish using the adder

even though the two instructions are data indep endent

Multiple Instruction Issue

On some pro cessors when instructions have no data dep endencies nor resource conicts the

pro cessor will execute the instructions in parallel For example integer and oatingp oint

instructions typically use mutually exclusive p ortions of the hardware and the following

instructions

Fmul FR FR FR

Add R R R

can execute in parallel

Motivation

All of the timing constraints describ ed ab ove arise b ecause instructions overlap in execution

This overlap is called instructionlevel parallelism or ILP These constraints constitute the

programmers view of instruction timing The architecture level generally do es not contain

timing information but the organization level do es However the organization level also

contains many more details of implementation that are irrelevant to the user Consequently

we are interested in a new level of abstraction that describ es the programmers view of

instruction timing This new level abstracts away from implementation details which we

simply call the instruction timing view

The main motivation for describing a pro cessor at the timing level is that in general

we should not exp ect that the user of the pro cessor infer the existence of timing constraints

by examining the organization One of our goals then is to develop a mathematical mo del

of instruction timing that hides irrelevant details of implementation

Functions Dont Work

Given the ab ove discussion on the various ways instructions can interact the b ehavior of an

instruction can no longer b e sp ecied in isolation of other instructions Consequently the

typical technique of using registertransfer statements can no longer b e applied However

as we have mentioned on some pro cessors it is vital that the compiler know the timing con

straints and the concurrent prop erties of the pro cessor to use the pro cessor eciently and

in some cases correctly eg architecturally dened delayed loadsbranches and instruction

latencies

It is wellknown that functional metho ds do not extend well to include timing constraints

or concurrency In fact it is widely known in the programming language community that

functional metho ds and concurrency are at o dds as has b een p ointed out by Milner

Hoare and Pnueli This is b ecause a concurrent system do es not necessarily have

a functional b ehavior as concurrency can introduce nondeterminism into a system The

denition of b eing a function implies that an input has only one corresp onding output

Consider again the delayedload instruction of the MIPS R On the MIPS for

the following instruction sequence

Load R R R MemR

Add R R R R R R

the contents of register R is undened at instruction That is the value of R is non

deterministic The ab ove load instruction then is can not b e describ ed by a function as

R can receive one of two values

The correct value stored at the memory lo cation in R or

some undened value

Consequently since R is undened R will also b e undened p ossibly causing a rippleeect

of undened registers This implies that if the contents of any register b ecomes undened

then the entire pro cessor state must b e considered to b e undened

Even if a concurrent system do es have an overall functional b ehavior it may b e more

appropriate to mo del the system using a concurrencybased formal metho d rather than a

functional one This is b ecause the concurrent system is comp osed of interacting comp o

nents which can yield irregular yet deterministic b ehavior which can b e dicult to sp ecify

functionally For example if the load instruction ab ove did not include an architecturally

dened delay but was interlocked instead ie if the pro cessor stalled and waited for R to

b e loaded as is done on the newer MIPS R then we have to mo del the stall also But

mo deling the stall is cumbersome using functional metho ds as one would have to somehow

include a stallvalue in the output This is as we will see straightforward using a calculus of

concurrency So we will forego using a functional language in favor of a particular calculus

of concurrency SCCS

Chapter

SCCS A Synchronous Calculus of

Communicating Systems

SCCS or Synchronous Calculus of Communicating Systems is a mathematical the

ory of communicating systems in which we can represent systems by the terms or expressions

in a simple language given by the theory SCCS allows us to directly represent the temp oral

and concurrent prop erties of the system b eing sp ecied

A Small Example

In this section we introduce SCCS through a small example involving a pip eline Consider

a two stage pip eline where each stage adds one to its input such a pip eline is depicted in

gure

Each stage is mo deled using an agent pro cess in SCCS which may have one or more

in SS out

Add2

Figure A two stage Add pip eline constructed from two Add agents

communication p orts In SCCS each agent must p erform an action that is use one or more

of its p orts on each clo ck cycle or execute the id le action written as Communication

b etween two agents o ccurs when at time t one agent wants to use a p ort and the other

wants to use complementary p ort

One stage in our example pip eline is represented in SCCS by

def

Sx iny outx Sy

Equation sp ecies that on clo ck cycle t S is an agent with current output x and input

y and that at time t S b ecomes an agent with current output y In Equation

is the prex op erator The expression a P represents the pro cess that at time

t can do the action a and b ecome the pro cess P at time t

iny outx is a product of actions sp ecifying that the two particulate actions in and

out o ccur simultaneously This action can also b e thought of as reading y on p ort

in and sending x on p ort out

S is dened recursively allowing for the mo deling of nonterminating agents

S is parameterized by the arithmetic expression y An imp ortant characteristic

of SCCS is that the parameter of an output action may b e any expression using

whatever functions over values we need

The semantics of SCCS is given formally in

SCCS Syntax

Systems sp ecied by SCCS consist of two entities actions and agents or pro cesses A

pro cess is an SCCS expression Figure gives the syntax for SCCS expressions Actions

communicate values and can either b e positive eg in or negative eg out Positive

actions input values and negative actions output values Two actions and asso ciated

with two agents running in parallel are connected by the fact that they are complements of

the same name

Connecting Pro cesses

The combinator mo dels parallelism The agent A B represents agents A and B executing

in parallel If two agents joined by pro duct contain complementary action names then these

def

P x x x E Parameterized agent denition

n

Done Idle agent

Inactive agent

E F Parallel comp osition E and F execute in parallel

E F Choice of E or F

E F E or F with preference for E

a x a x a x E Synchronous action prex with values

n n

if b then E else E Conditional

P

E Summation over indexing set I

i

iI

Q

E Comp osition over indexing set I

i

iI

E L Action Restriction

E nnL Particle Restriction

E f Apply relab eling function f

Figure Syntax of SCCS expressions

agents are joined by what may b e thought of as wires at those p orts Hence these agents

may now communicate

Given Equation we can now construct a two stage add pip eline from two add

agents There is a problem though the agent S S do es not contain complementary

action names that is the pip eline stages are not connected yet the output of the rst S

stage must b e fed into the input of the second S stage To mo del this kind of connection

SCCS provides a mechanism for relabeling actions Relab eling out to in the rst S and

in to in the second o ccurrence of S provides the desired eect

def

Addx y S x S y fin outg

out in

In Equation

is a relab eling function that means change the p ort name out to changes in

to

S means apply relab eling function to agent S

S fin outg is the restriction combinator applied to agent S Restriction serves

the purp ose of internalizing p orts or hiding actions from the environment and

exp osing others Hence in and out are made known to the environment and is

internalized

Combining restriction and relab eling is a way of achieving scoping in SCCS

The net eect Equation is to construct a pip eline of two stages where each stage

adds to its input

In SCCS the agent A B represents a choice of p erforming agent A or agent B Which

choice is taken dep ends up on the actions available within the environment The agent

P

n

A A A is abbreviated to A

n i

i

An Algebra of Actions

Agents interact with their environment through p orts that are identied with lab els

Port and lab el are synonymous and every p ort and lab el is also an action but as we

will see the converse is not true Actions have the following prop erties

There is an innite set A of names and a set of conames A The set of all lab els is

L A A

There is a binary op erator used to form a new action that is a pro duct of actions

This is the pro duct of actions and L

is commutative and asso ciative and

Often when clarity allows a pro duct of actions is written using juxtap osition

eg rather than with

A pro duct of actions denotes simultaneous o ccurrence of

n

n

The idle action is a left and right identity on such that

The unary op erator is an inverse op eration

With this information at hand we conclude that there is an algebra of actions that form an

ab elian commutative group generated by A the set of particles

Act is an Ab elian Group

Therefore every Act can b e expressed as a unique pro duct of particulate actions

z

z

n

n

up to order

Extensions to SCCS

In this section we introduce two extensions to SCCS that will aid us in writing pro cessor

sp ecications

Frequently we wish to execute two agents A and B in parallel where B b egins executing

one clo ck cycle after A eg issuing instructions on consecutive cycles This can b e mo deled

by A B We dene the binary combinator Next to denote this agent

A Next B A B

Next is rightassociative That is

A Next B Next C A Next B Next C

Intuitively the agent A Next B Next C should b egin executing A at time t B at time t

and C at time t We can see that this is the case by expanding it using Equation

A Next B Next C A Next B Next C

A B Next C

A B C

A B C is identity

A B C by

Another useful op erator is the priority sum op erator Intuitively if in the agent

A B b oth A and B can execute then it is nondeterministic as to which is executed Often

we would like to prioritize so that if b oth A and B can execute then A is preferred Thus

A B denotes the priority sum of A and B where A has priority over B

Transition Graphs

The op erational semantics of SCCS is dened in terms of a labeled transition system

Denition A lab eled transition system LTS is a triple hP Act i where P is

a set of states Act is a set of actions and is the transition relation a subset of

P Act P

When p q P and Act and p q we write p q to mean that state p

can do an and evolve into q States p and q represent the state of the system at times

t and t resp ectively Figure gives the inference rules for SCCS that map an SCCS

term to a lab eled transition system where states corresp ond to SCCS expressions In the

transition p q q is called an derivative of p or just a derivative of p

A state t is reachable from state s if there exist states s s and actions

n n

such that

n

s t s s s

n

In this case we call a trace of s

n

The transition graph of an agent p is a graphical representation of the LTS induced by

p For example the SCCS agent dened in Equation has the transition graph shown in

gure

def

E a b c d e b b:0 0 a

E 0 c d d:0 + e: 0 e

0

Figure Transition graph of agent dened in Equation

The transition graph of an agent represents its dynamic nature as it represents an agent

executing through time The LTS itself is a structure that represents all p ossible traces

of the system For example the LTS in gure of the SCCS pro cess in Equation

identies the set fa b c d c eg as the p ossible traces of E of length two Analysis of

SCCS pro cesses eg simulation equivalence testing mo del checking is carried out on the

LTS rather than SCCS terms directly

The Op erational Semantics of SCCS

In general an op erational semantics of a language maps terms of the language to some

abstract machine A structural operational semantics developed by Plotkin uses syntax

directed rules to map terms of the language to a transition system or lab eled transition

system in our case The structural op erational semantics or SOS for SCCS is given

in gure

Each rule in gure is an inference rule structured on the syntax of an SCCS term

The following conveys the intuition b ehind the rules

a

Action In the pro cess a the rule Action says that a

Pro duct If P can do an action and Q can do an action then P Q can do an action

For example consider the pro cess a b b The rule Pro duct says that

a

since a and

Q Q

P P

Sum

Sum

P Q P

P Q Q

P P Action

P P Q Q

P P

Restriction A

Pro duct

P A P A

P Q P Q

P P

def

P P

Relab eling

Denition A P

f

A P

P f P f

Figure Op erational semantics of SCCS

b

since b b b then

ab

then a b b b

Sum The Sum rule says that if P can do an action and b ecome P then P Q can

do an and b ecome P The rule Sum handles the other case

Relab eling If P can do an then the pro cess P f can do an f

Restriction If P can do an then P A can do an provided A The predicate

A is called a side condition

Denition The denition rule allows us to bind a pro cess to an identier It says that

def

if P can do an then so can any pro cess identier A where A P

The inference rules dene all p ossible transitions for any SCCS term

SCCS is an algebra that satises many equational laws gure The equational laws

are based on the denition of bisimulation equivalence in SCCS which is dened in We

will not pursue this here

If P Q R P and a b Act then

P P

P P

P P P

P

a P b Q ab P Q

P Q R P Q P R

P Q Q P

P Q Q P

P Q R P Q R

P Q R P Q R

ab P c Q ab P ac Q

a P A if a A

a P A

if a A

P A B P A B

Figure Equational laws of SCCS

Examples

Here we present some examples using SCCS to sp ecify various simple circuits More ex

amples and their implementation in the Concurrency Workbench are given in App endix B

The purp ose of this section is to familiarize the reader with SCCS and how it is used Since

we will eventually b e sp ecifying the timing prop erties of a RISCstyle micropro cessor we

use digital hardware in our examples

A ZeroDelay and a UnitDelay Wire

The simplest circuit is a wire One end is the input and the other is the output which we

lab el with the actions in and out

In all of the examples variables range over b o oleans and Equation represents

a zerodelay wire At all times t the current input at in is equal to the output at out

def

Wire iniouti Wire

The pro cess Wire b elow represent a wire that has a unitdelay which means it has a

current state j At the time t the input i on in b ecomes at time t the output j on

out Compare the HOL equation for this same circuit Equation p ort

def

outj Wirei Wirej ini

We can also implement a wire that has a delay of two clo ck cycles

def

Wirei j ina outj Wirea i

We could have also sp ecied the twodelay wire by ho oking together two unitdelay

wires

def

WireImpli j Wirei alphaout Wirej alphain fin outg

One imp ortant asp ect of SCCS is that now we can prove that the agents sp ecied by Wire

and WireImpl are equivalent In fact the Concurrency Workbench can establish this for

us automatically

Logic Gates

A unitdelay inverter is sp ecied in Equation

def

out Not inout Not Not in

def

out Not inout Not Not in

Using a little notational freedom for sp ecifying the functionality of an agent we can abbre

viate Equation with Equation

def

Notj inioutj Not i

Connecting two inverters together we get Equation

def

TwoNotsi j Notialphaout Notj alphain fin outg

Connecting two inverters together cancel each other out except for the delay We should

exp ect that agent TwoNots b e equivalent to agent Wire and consequently WireImpl

Indeed this is the case

Other logic gates are similarly dened of course all we really need is a nand or a nor

and we can dene all of them in terms of this op eration and prove that they meet their

sp ecication

def

Andj inainb outj Anda b

def

outj Ora b Orj inainb

def

Xorj inainb outj Xora b

def

outj Nora nor b Norj inainb

These pro cesses are all unitdelay with j b eing the default initial state

FlipFlop

A ipop is easily constructed by connecting together two Nor gates gures and

Each Nor gate has an output that feeds back into the other Nor gate In order to make the

desired connections we have make two copies of a Nor gate and suitably relab el some p orts

Finally the restriction op erator internalizes the two feedback connections

def

FFm n

Normsin g inout Nornrin in dout fs r g dg in1 out

in2

Figure A nor gate

s g

r d

Figure A ipop constructed from two norgates Figure

Chapter

Sp ecifying a Pro cessor

In this chapter we present a sp ecication of the programmers timing view for hypothetical

RISCstyle pro cessor based on DLX and the MIPS from Patterson and Hennessy

Our example micropro cessor

In this RISC instructions memory word size registers and addresses are thirtytwo bits

Figure gives the instruction syntax and computational semantics of our pro cessor What

follows is an informal description of some of our RISC instructions that we subsequently

describ e with SCCS

Add R R R adds registers j and k and puts the result in register i The instruction

i j k

executing immediately after an Add may use register i

Load R R Const is a delayed load instruction Register i is b eing loaded from

i j

memory at the base address in register j with oset Const The instruction executing

immediately after Load cannot use register i

BZ R Locn is a delayed branch instruction that branches to Locn if register i is

i

zero The instruction immediately after the branch is always executed b efore the

branch is taken If the branch is not taken then instruction after the branch is not

executed Another BZ instruction may not app ear in the branch delay slot

Fadd FR FR FR is an interlocked oatingp oint add with a latency of six cycles

i j k

If another Fadd instruction tries to use the result b efore the current Fadd is nished

Instruction Syntax Computational Semantics

Add R R R R R R

i j k i j k

AddI R R Const R R Const

i j i j

Mov R R R R

i j i j

Movf FR FR FR FR

i j i j

MovFPtoI R FR R FR

i j i j

MovItoFP FR R FR R

i j i j

Nop No operation

Cmp R R R R R R

i j k i j k

CmpI R R Const R R Const

i j i j

Fadd FR FR FR FR FR FR

i j k i j k

Fmul FR FR FR FR FR FR

i j k i j k

Fdiv FR FR FR FR FR FR

i j k i j k

BZ R Locn if R then PC Locn

i i

Load R R Offset R MemR Offset

i j i j

Loadf FR R Offset FR MemR Offset

i j i j

LoadI R Const R Const

i i

Store R R Offset MemR Offset R

i j j i

Storef FR R Offset MemR Offset FR

i j j i

Figure RISC instruction set

then instruction execution stalls until the result is ready

The table in Figure gives the entire instruction set along with the RTL statement

describing the functional semantics of each instruction This functional semantics can b e

easily represented using the techniques from chapter

Timing Constraints

There are three types of constraints that alter the programmers view of the timing of a

pro cessor

delayed instructions The eect of an instruction can b e delayed eg as in our

RISCs load and branch instructions making certain instructions sequences illegal

multicycle instructions An instruction that takes more than one cycle to calculate

its result may cause the pro cessor to interlock when a subsequent instruction needs the

result b efore the previous instruction has nished These are known as data hazards

and will b e discussed in more detail later

limited resources Often two or more instructions may comp ete for the same resource

eg oatingp oint adder and one will have to wait This situation is known as a

structural hazard

We will discuss all three types of timing constraints as we encounter them

The SCCS Sp ecication

A pro cessor is a system in which registers and memory interact with one or more functional

units Equation represents such a system at the highest level in SCCS

def

Processor Instruction Unit Memory Registers I

Where I is the set of al l instructions

Equation says that the only visible actions are instructions That is the lab els on the

lab eled transition system generated by are entities like Add R R R BZ R Locn

etc

Instruction Formats as Actions

We will represent the syntax format of each instruction as an SCCS action The in

structions describ ed in Figure determine a set of actions one for each instruction For

example the instruction Add R R R is represented in SCCS by the action AddR R R

More generally the instruction Add R R R determines a set of actions I given by

i j j Add

Equation

def

I Add R R R AddR R R

Add i j j i j k

ijk

The set I can b e determined for each op co de op Opcodes yielding the set I of all

op

instructions Equation

def

I I

op

op Opcodes

For readability we continue to comma separate an instructions op erands always keep

ing in mind that an instruction is represented by an action We could also use the bit

representation of the instructions but this would only make the sp ecication less readable

Dening the Registers

Before we pro ceed in sp ecifying instructions and their interaction it is necessary to develop

an appropriate mo del of registers and memory In this section we develop an abstract

mo del of storage in which storage cells are mo deled as agents Equation denes one

cell Cel ly holding a value y such that an action putcx executed at time t stores x in

getcy retrieves the value stored Cel l which is available for use at time t The action

in Cel l and assigns this to y If no agent wants to interact with Cel l using putc or getc

actions then Cel l executes the idle action

def

getcy Cel ly putcx Cel lx C el l y Cel ly

This mo del of a storage cell is simple but inadequate b ecause Cel l can only p erform

one getc or putc at a time Consider the instruction Add R R R which accesses

R twice and also writes R On most pro cessors this instruction can eectively execute

in a single cycle b ecause registers are read and written in dierent pip eline stages and

there is appropriate forwarding hardware renaming registers etc But we do not need

to mo del pip eline stages and all of the other organization that go es with them What

we need to do is augment the agent Cel l so that it can handle parallel reads and writes

For example the action getcagetcb means read Cel l twice putting the result into a

and b The action getcaputcb means read and write Cel l in parallel The action

getcagetcbgetccputcd means read Cel l three times with the cells value placed in

a b and and c and also write d to Cel l Only one putc is allowed for each action

Equation denes an agent Reg that is a new version of Cel l that can accommo date

parallel reads and writes

X

def

j

getry Regy putrx Regx Regy

j

Notice that we can change the upp er b ound on the summation to allow an arbitrary number

of readers of the registers ie getrs

If we expand the summation in Equation we obtain Equation

def

Regy getry Regy

getry Regy

getry Regy

getry putrx Regx

getry putrx Regx

getry putrx Regx

Applying the equational law that states that for any particulate action a A a

and a a Equation reduces to Equation

def

Regy getry Regy

getry Regy

putrx Regx

getry putrx Regx

getry putrx Regx

Regy

In further sections we will not continue to expand summations like this it was done

here to indicate how summations are used and manipulated In fact the algebra allows

the use of innite sums that is sums over a countably innite indexing set which would

prohibit us from expanding them completely anyway

Register Lo cking

The actions getr and putr are atomic It may b e that a register is going to b e up dated

some time in the future eg delayed loads and any attempt to read or write the register

by another agent instruction should result in an error We will augment Equation by

allowing an agent to reserve a register for future writing using the action lockreg and then

at some p oint in the future by writing the register with putr and releasing it with the

action releasereg Equation mo dies Reg so that when an agent lo cks a register the

Reg where the only allowable action is putrxreleasereg register go es into a state Locked

All other combinations of getr and putr in the lo cked state lead to the inactive agent

This need to trap all of the other illegal action sequences complicates matters so we have

factored this out and put them in Equation Notice again that the upp erb ound of the

summation in equations and can b e arbitrarily increased to include any number of

readers

X

def

j

getry lockreg Locked Regy Regy Regy

j

def

Locked Regy

Accessy putrxreleasereg Regx Locked Regy Il legal

def

Il legal Accessy

X

j

getry getry putrx putrxreleasereg

j

Figure shows the state transition graph of Reg Equation which is the lab eled

transition system generated by the op erational semantics of SCCS The gure shows that

from the state Reg any number of getrs including zero and zero or one putrs will leave

us in the state Reg However any number of getrs and a lockreg action will put us in

Reg state Figure also shows that any action other than a putr releasereg the Locked

will put us in the deadlo cked state

Given the denition of one register a family of registers Reg Reg etc is now dened

by subscripting each of the actions by a register number For example the action putr x

i

represents writing x to register i Thirtytwo registers are constructed by

def

Registers Reg y Reg y

j getr lockreg

j getr putr , Reg Locked 1 j Reg getr ,1

putr releasereg

j getr , j getr putr , j getr putr releasereg

0

Figure State transition graph of agent Reg in Equation

which we abbreviate to

Y

def

Registers Reg y

i

i

we have not changed SCCS at all this is just a shorthand notation

Dening Memory

The denition of an agent Memory is exactly analogous to that of Registers except that

memory cells do not have lo cks asso ciated with them For brevity we omit the denition of

Memory and just note that the actions getm and putm read and write memory cell i

i i

def

MEM y putm x MEM x

i i

i

getm y MEM y

i i

getmy putmx MEMx

MEM y

i

Y

def

MEM y MEMORY

i

i

Instruction Pip eline

Instruction pip elines are usually describ ed in terms of their stages of execution For example

the agent IPL for instruction pip eline

def

IPL IF ID EX MEM WB

denes a vestage instruction pip eline where IF ID EX MEM and WB represent in

struction fetch deco de execute memory access and write back stages

This is a reasonable and obvious representation but since we are interested only in

external timing b ehavior it is overspecied We should resist attempting to sp ecify an

architectures timing b ehavior in terms of individual stages as this commits us to describ e

the detailed op eration of each individual stage Since our interest is simply timing b ehavior

a more abstract sp ecication will suce

We should also p oint out that it is very useful to b e able to sp ecify a pro cessor at this

lower organizational level as this would count as an implementation of the pro cessor In

fact one of SCCSs ma jor b enets is its ability to sp ecify systems at various levels and

compare and analyze them This robustness is one of the reasons we chose SCCS

Instruction Issue

Given our previous denitions of Registers and Memory and using a program counter PC

we now describ e an agent InstrPC Equation that sp ecies the b ehavior of our

pro cessors instructions InstrPC partitions instructions into two classes Branch and

Non Branch Non Branch instructions are further divided into three classes arithmetic

Alu load and store Load Store and oatingp oint Float

def

InstrPC Non BranchPC Next InstrPC

BranchPC

Stal lPC

def

Non BranchPC AluPC Load StorePC FloatPC

def

Stal lPC InstrPC

There are three p ossible alternatives of InstrPC

A nonbranch instruction may execute in which case the next instruction to execute

is at PC The rst line of Equation describ es this situation

A branch instruction may execute in which case the next instruction cannot b e deter

mined until it is known whether the branch will b e taken or not Hence the decision

on what instruction to execute next is deferred see Equation

If no instruction can execute then the pro cessor must stall Equation The

op erator section is used here b ecause the pro cessor should stall only when no

other alternative is available

Arithmetic Instructions

Like most pro cessors ours fetches instructions from memory using a program counter PC

The action

getm Add R R R

i j k

PC

represents fetching an Add instruction from memory Recall the slight abuse of notation

with the comma separated op erands

From a users view the instruction Add R R R appears to take one cycle to execute

i j k

In the following instruction sequence

Add R R R

Mov R R

the Add instruction executes at time t and the Mov executes at time t From a b ehav

ioral view there is no problem with writing R and reading R in consecutive instructions

Normally a user would expect this instruction sequence to b e legal and the user should not

need to understand the details of the instruction pip eline that might make the sequence

il legal nor the bypass hardware that makes the sequence b ehave as originally exp ected

The agent

def

AluPC getm Add R R R getr xgetr y putr x y Done

i j k j k i

PC

represents the execution of the Add instruction At time t source registers j and k are read

by the actions getr xgetr y and the result is written to destination register i by the

j k

putr x y action

i

In fact Equation describ es the same computation as the register transfer statement

Regi Regj Regk

except that the SCCS equation sp ecies that registers are accessed and the result is written

atomically ie executes in a single cycle The agent Done is the idle agent and represents

termination of the instruction agent The other arithmetic instructions are analogous

Integer Load and Store Instructions

The following instruction sequence

Load R R

Mov R R

is illegal in our pro cessor b ecause of the use of R immediately after the Load The Load

instruction accesses memory at time t and the result of the load is available at time t

The computational b ehavior of the load instruction is the following RTL statement

Load R R Offset Regi MemRegj Offset

i j

This is represented by

def

LoadPC

lockreg getm Load R R getr Bgetm V

i j

i PC j B

putr V releasereg Done

i i

Equation sp ecies that at time t three things happ en

The base register j is accessed and the base address is placed in the variable B by

action getr B

j

Memory is fetched with the value placed in the variable V with action getm V

B

where is the oset value

The destination register i is lo cked using the action lockreg

i

At time t two actions o ccur

putr V The value V is written to destination register i with the action

i

releasereg The destination register i is released with the action

i

In the event that an instruction attempts to read or write a lo cked register the pro cessor

reaches the deadlo cked state This is b ecause our pro cess that represents a register traps

any such op erations

The Store Instruction

The store op eration eectively takes one cycle to execute For example the following co de

segment is legal

Store R R

Load R R

The computational b ehavior of the store instruction is given by the following RTL statement

Store R R Offset MemRegi Offset Regj

i j

It is unlikely however that R has nished b eing written to memory The hardware

takes care of the diculties allowing the next instruction to access the memory lo cation

just written Equation represents the store instruction

def

StorePC

getm Store R R getr Bgetr V putm V Done

i j

PC j i B

The imp ortant characteristic of Equation is that it sp ecies that the Store instruction

eectively takes one cycle to execute Equation combines the two agents Load and

Store for Equation

def

StorePC LoadPC StorePC Load

The Branch Instruction

In the following instruction sequence

Locn

BZ R Locn

Add R R

the Add instruction after the branch is always executed b efore the jump to Locn If the

branch is not taken then the Add instruction is skipp ed and the instruction b elow Add is

executed A BZ instruction may not b e immediately followed by another BZ instruction

Equation sp ecies the b ehavior of the BZ instruction

def

BranchPC getm BZ R Locngetr V

i

PC i

if V then

BranchPC Next InstrLocn Non

getm BZ R Locn

i

PC

else

InstrPC

The BZ instruction has the eect that

at time t a BZ instruction is fetched and register R is accessed

i

at time t if the value of R is not zero then execution continues with the instruction

i

after the branch delay slot

at time t if the value of R is zero then a nonbranch instruction is executed in the

i

branch delay slot and execution continues with the instruction at Locn at time t

If another BZ instruction is in the delay slot then we reach the inactive agent which

represents an error state

Interlocked FloatingPoint Instructions

The oatingp oint add instruction Fadd takes six cycles to compute its result For instruc

tions that have a large latency it is generally unreasonable to exp ect the scheduler to nd

enough indep endent instructions to execute until the Fadd is complete As inserting Nop

instructions would signicantly increase co de size therefore oatingp oint instructions are

typically interlocked

FloatingPoint Registers

One metho d of keeping instructions ordered prop erly is to asso ciate a lo ck with each FP

register as we did in the case of the integer registers The dierence here is that accessing

a lo cked integer register is illegal while accessing a lo cked FPregister causes the pro cessor

to stall j getfr .lockfreg

Locked j Freg 1 getfr , Freg 1

putfr.releasefreg

Figure State transition graph of a oatingp oint register Freg

Our pro cessor has a separate set of thirty two oatingp oint registers that are de

ned similarly to the integer registers except that we add two new actions lockfreg

and releasefreg Actions putfr and getfr are the two actions that write and read a

oatingp oint register

X

def

j

Freg y getfr y lockfreg Locked Freg

i i

i i

j fg

X

j

getfr y Freg y

i

i

j fg

Freg y

i

def

Freg putfr xreleasefreg Freg x Locked

i i

i i

Locked Freg

i

Thirtytwo FPregisters are constructed analogously to the integer registers

Y

def

FP Registers Freg y

j

j

Figure shows the state transition graph for a oatingp oint register For the case of the

integer registers it was illegal to access a lo cked register and trying to do so was trapp ed by

directly putting the pro cessor in the deadlo cked state But for the oatingp oint registers

there is no deadlo cked state A pro cess instruction is still not able to read a lo cked

oatingp oint register but trying to do so causes the pro cessor to stall as the only p ossible

transition will b e the Stal l pro cess Equation

The Fadd instruction

Now that interlocked registers are dened we can dene the b ehavior of the oating p oint

add instruction The Fadd instruction must

access its source registers and

lo ck its destination register

compute the addition

write the result in the destination register

release the destination register

Equation sp ecies our pro cessors Fadd instruction

def

FloatPC getm Fadd FR FR FR lockfreg getfr xgetfr y

i j k i

PC j k

putfr x y releasefreg Done

i

i

n times

z

n

which is interpreted The abbreviation represents the ncycle delay

as ncycles of internal computation The pro cessor stalls when an instruction wishes to

access a lo cked FPregister This happ ens b ecause the instruction will not b e able to access

the FPregister and the only other option is to execute the agent Stal l Equation

Remember in the denition of InstrPC in Equation that our pro cessor continues

executing instructions after the Fadd instruction has started

Structural Constraints

Pro cessors often reuse functional units For example a oatingp oint unit may have only one

adder that is used by the addition multiplication and division instructions This failure

to fully replicate the resource for each instruction that needs it gives rise to structural hazards

which can alter the timing characteristics of the instructions that require the resource

Moreover an instruction may require a resource several times for various lengths of time

during its execution making the resource constraints complex to describ e

As an example the MIPSR oatingp oint unit has a oatingp oint adder divider

rounder and shifter it also has several other functional units in the FPU such as an

exception checker but we omit for the sake of simplicity The single precision divide

instruction FDIV requires the

oatingp oint adder and shifter on clo ck cycle

rounder and the shifter on cycle

shifter on cycle

divider on cycles through

divider and adder on cycles and

divider and rounder on cycles and

adder on cycle

rounder on cycle

Mo deling Finite Resources

We need a way to include resources in our mo del We could leave them out doing so

would still give us a go o d approximation of the timing b ehavior of our pro cessor that

is we could still mo del delayed load branches and latencies but if we wish to b e more

accurate then we should include resources Also instruction schedulers consider resource

constraints

To mo del limited resources we introduce a generic agent Resource that mo dels a re

source that instructions can acquire and release When an instruction acquires a resource

that is b eing used by another the pro cessor stalls Equation denes a generic agent

resource and released with the action Resource that can b e acquired with the action get

release resource This agent will b e replicated however many times is needed once for

each resource and suitably relab eled

def

resource Locked Resource Resource get

resource release resource Resource get

Resource

def

Resource release resource Resource Locked

Locked Resource 1 release_resource

1

Resource Locked Resource

get_resource

get_resource release_resource

Figure State transition graph of a Resource in Equation

Figure shows the transition graph for the agent Resource

Notice that the functionality of a resource is not b eing mo deled only an instructions

capability to use the resource exclusively Hence at this level mo deling a oatingp oint

adder is exactly like mo deling a oatingp oint multiplier Alternatively we could sp ecify

the pip elined oatingp oint unit in detail if we desired but this would mire us in irrelevant

organizational detail

In our pro cessor the oatingp oint unit has three resources that must b e shared an

adder multiplier and a divider Equation sp ecies this by replicating Resource three

times and relab eling its actions appropriately

def

FPU Resource Resource Resource

where get resource get multiplier

resource release multiplier release

get resource get adder

resource release adder release

get resource get divider

resource release divider release

We assume that an agent acquires a resource on the rst cycle that it needs it and releases

the resource on the last cycle that it needs it For example if an instruction needs the adder

for one cycle only then it will p erform the action pro duct get adderrelease adder If an

instruction needs the adder for two consecutive cycles then the instruction should sp ecify

this with the agent get adderrelease adder The instruction should not acquire and

release the adder on each cycle as in the following

get adder release adder get adder release adder

By requiring that resource requirements b e sp ecied this way we are imp osing a normal

form on the sp ecication However this normal form is reasonable and we justify it by

noting that

pro cessor manuals indicate resource requirements in this fashion for example see this

MIPS pro cessor manual

the instruction scheduling problem as presented in the literature denes resource

requirements in this manner

Multicycle Floatingp oint Instructions

Now that there are several oatingp oint instructions comp eting for shared resources the

Fadd instruction dened in Equation needs to b e altered We redene the agent

FloatPC to the following

def

FloatPC FADDPC FMULPC FDIVPC

The Fadd instruction requires the adder for two cycles after the op erands are accessed

def

FADD PC getm Fadd FR FR FR lockfreg getfr xgetfr y

i j k i

PC j k

get adder release adder

putfr x y releasefreg Done

i

i

The Fmul and Fdiv can now similarly dened The Fmul instruction requires the adder

for one cycle then the multiplier for two cycles then the adder again for one cycle

def

FMULPC getm Fmul FR FR FR lockfreg getfr xgetfr y

i j k i

PC j k

get adder release adder

get multiplier release multiplier

get adder release adder

putfr x y releasefreg Done

i

i

After its op erands are accessed the Fdiv instruction requires the adder for one cycle

the divider for eight cycles and then the adder again for two cycles

def

lockfreg getfr xgetfr y FDIVPC getm Fdiv FR FR FR

i i j k

j k PC

get adder release adder

get divider release divider

get adder release adder

putfr xy releasefreg Done

i

i

Essentially we are mo deling a resource as a binary semaphore This technique can

b e used to handle any kind of resource that needs to b e accessed exclusively eg regis

termemory p orts busses etc

A pro cessor may have more than one copy of a particular resource For example there

may b e two indep endent oatingp oint adders that can b e used by any of the oatingp oint

instructions In this case we duplicate the resource using the same lab el for each For

example two adders would b e sp ecied as

Resource Resource

resource get adder where get

and when an agent wishes to acquire one of them then there are three p ossibilities

Both adders are free and one of them is nondeterministically chosen

If one adder is b eing used and the other is free then the free adder is acquired

If b oth adders are busy then the instruction cannot continue and the pip eline stalls

as in the case for the interlocked oatingp oint registers

Nondeterminism is an imp ortant asp ect of sp ecication The p otential nondeterminism

introduced ab ove while not implementable at the hardware level helps keep a description

from b eing over specied For example if a pro cessor has two identical adders at the

sp ecication level we may not want to dictate which of the two adders the instruction uses

A Normal Form

This chapter has demonstrated a metho d of sp ecifying the timing prop erties of instructions

Our approach requires that instructions b e sp ecied in a certain manner so we present the

following normal form for the sp ecication As we will see the restrictions are reasonable

Register Lo cking

An instruction that lo cks a register must eventually unlo ck that same register That is

a pro cess that describ es an instruction that lo ckes a register i must execute the following

sequence

lockreg releasereg

i i

Resource Requirements

An instruction that needs a resource will eventually release that same resource That is a

pro cess that describ es an instruction that acquires a resource r must execute the following

sequence

release get

r r

Moreover if an instruction needs resource r for one cycle then it has the sequence

get release

r r

Undened Instruction Sequences

Two situations arose in our pro cessor denition where instruction sequences were undened

a branch instruction followed by another branch or an instruction referencing a register

that was currently b eing loaded by a load instruction In b oth of these situations we trapp ed

the oending sequence by having a transition to SCCSs undened state If an instruction

sequence i i is illegal then executing that sequence should cause a transition to That

n

is the following transition must o ccur

i i

n

The general way of sp ecifying an illegal sequence is to have the instructions communicate

through an intermediate pro cess This was the technique we used when sp ecifying illegal

Load Store combinations where the intermediate pro cess was a lo ckable register

Summary

In this chapter we have presented a technique for sp ecifying the programmers timing

view of RISCstyle architectures We handled architecturally dened delayed loads and

branches interlocked oatingp oint instructions and resource constraints The technique

for sp ecifying resources is very general and can b e used to sp ecify a variety of constraints

For example if we had a dualp orted register le and if for some reason two p orts were

not enough and could lead to a structural hazard we could mo del a p ort as a resource

As another example the Motorola has a structural hazard on a writeback data bus

in which it is p ossible that up to three instructions can try to use the writeback bus on

the same cycle as explained in The hazard is resolved by giving priority to integer

instructions over oatingp oint instructions In this situation we could mo del the writeback

bus as a resource and use our prioritysum op erator to correctly mo del how the hazard

is resolved

Chapter

Multiple InstructionIssue

Pro cessors

An Integer  Float Sup erscalar

This section describ es a sup erscalar multiple instructionissue version of our pro cessor

that can issue one oatingp oint and one integer instruction p er cycle If two instructions

can b e issued in parallel then we have either an integer instruction followed by a oating

p oint instruction or a oatingp oint instruction followed by an integer instruction This

situation is sp ecied by Equation

FloatPC AluPC

AluPC FloatPC

We can rewrite this sum as Equation

X

AluPC i FloatPC j

ij fg

Assuming an instruction is not b oth an integer and oatingp oint Equation represents

a folding of Equation

Equation is a sum of four terms

AluPC FloatPC

AluPC FloatPC

AluPC FloatPC

AluPC FloatPC

However when i j then AluPC i and FloatPC j refer to the same memory lo cation

and it is imp ossible for a memory lo cation to contain b oth an Alu instruction and a Float

instruction We can conclude then that

AluPC i FloatPC j where i j

and only two terms remain Equation We use the summation notation b ecause it

enables us to succinctly sp ecify nway instruction parallelism

Equation extends Equation to continue execution at PC

def

Do TwoPC

X

A

Next InstrPC AluPC i FloatPC j

ij fg

There are no data dep endencies to worry ab out b ecause each instruction accesses separate

register les That is b ecause pro cesses Alu and Float use disjoint register les and the

only way data dep endencies arise is when two or more instructions use the same register

then it follows that an integer instruction and a oatingp oint instruction can not have a

data dep endency b etween them

Instruction Issue

Our toplevel instruction issue equation Equation must now b e mo died to take this

new twoissue capability into account For reference we restate Instr Equation and

One rename it Do

def

Do Non OnePC BranchPC Next InstrPC

BranchPC

The pro cessor can execute two one or zero ie stall instructions p er cycle which we

capture by

def

TwoPC Do OnePC Stal lPC InstrPC Do

Notice here the use of the priority choice op erator section instead of whenever

Two it is also p ossible to do Do One and issuing two instructions it is p ossible to do Do

should take priority over issuing one when p ossible Similarly if we can not execute any

instruction then the pro cessor must stall

An Integer  Integer Sup erscalar

In this section we sp ecify a version of our pro cessor that can execute two integer ALU

instructions in parallel At rst glance it would seem that

AluPC AluPC

sp ecies the ability to execute two integer instructions in parallel However b ecause b oth

instructions use the same register le we now have the p ossibility of data hazards existing

b etween the two integer instructions Hence sometimes parallel execution is thwarted

Before we continue we need to introduce the various types of data hazards that can arise

Data Dep endencies or Data Hazards

The instructions in Figure represent all of the p ossible dep endencies that can exist

b etween any two instructions

The instructions in a can b e executed in parallel b ecause they are data indep endent

while those in b cannot b e executed in parallel b ecause of the readafterwrite or RAW

hazard on R In c parallel execution is p ossible if the pro cessor can do

to eliminate the writeafterread or WAR hazard However we can sp ecify that the

instructions must b e executed in parallel without having to sp ecify the renaming hardware

as we dont wish to overspecify The instructions in d present a rare but p ossible write

afterwrite or WAW hazard Here the nal value of R must b e R R Two solutions

are p ossible First since the hazard is rare execute the instructions sequentially Second

execute them in parallel and insure that R gets the result of the second instruction For

simplicity we will choose the rst option

Sp ecifying Data Hazards

Using restriction we can force Equation to apply only to legal integer instruction se

quence of length two If the rst integer instruction writes register i then the second integer

instruction cannot write or read register i For example given two integer instructions if the

rst instruction writes register then Equation represents the legal integer instruction

There is a confusion in terminology with regards to data hazards Computer engineers refer to them

as RAW WAR and WAW hazards while compiler writers refer to them as forward anti and output

dep endencies resp ectively Forward dep endencies are sometimes referred to as true dep endencies

Add R R R Add R R R

Add R R R Add R R R

a No dependencies b ReadAfterWrite hazard

Add R R R Add R R R

Add R R R Add R R R

c WriteAfterRead hazard d WriteAfterWrite hazard

Figure Possible data dep endencies in instruction sequences

sequences

AluPC nnA AluPC nnB

where A fputr getr getr g

B fputr putr getr getr g

Here the agent P nnS represents particle restriction on the agent P where S is a set of

particles that P may execute In Equation the restriction on the rst instruction

by the set A sp ecies that only register zero is a p ossible destination register while the

restriction on the second instruction by the set B sp ecies that register zero cannot b e a

source register nor a destination register

Summing over all p ossible destination registers of the rst instruction yields the desired

result

X

def

TwoPC AluPC nnA AluPC nnB Next PC Do

i i

i

where A fputr getr getr g

i

i

B fgetr getr putr putr g fputr getr g

i

i i

Equation represents all of the allowable integer instruction sequences of length two that

may execute in parallel

An Integer  Integer  Float Sup erscalar

In this section we describ e a version of our pro cessor that can execute three instructions in

parallel two of which can b e integer instructions and the other of which may b e a oating

p oint instruction Now that we have already sp ecied two dualissue versions sections

and the three issue version follows nicely

In this case however the easiest way to write the equation is as a sum of three terms

one for each p ossible p osition of the oatingp oint instruction ignoring data hazards for

the time b eing

AluPC AluPC FloatPC

AluPC FloatPC AluPC

FloatPC AluPC AluPC

Adding data hazard constraints to Equation as in Equation gives us Equa

tion

def

Do ThreePC

AluPC nnA AluPC nnB FloatPC

i i

31

B C

X

B C

Next InstrPC

B C

AluPC nnA FloatPC AluPC nnB

i i

A

i=0

FloatPC AluPC nnA AluPC nnB

i i

where A fputr getr getr g

i

i 0 31

B fgetr getr putr putr g fputr getr g

i

0 31 0 31 i i

The toplevel issue equation now allows executing three two one or zero instructions

each cycle

def

InstrPC Do ThreePC Do TwoPC Do OnePC Stal lPC

As an example the instruction sequence in Figure a can b e executed in parallel

according to Equation while the sequence in Figure b cannot b ecause of the hazard

The situation is much the same for sp ecifying fourissue veissue etc

Add R R R Add R R R

Fadd FR FR FR Add R R R

Add R R R Fadd FR FR FR

b RAW hazard a No dependencies

Figure Possible threeissue instruction sequences

Chapter

Simulation

In this section we show how our SCCS sp ecication of our example pro cessor is simulated

The simulation o ccurs within the framework of the Concurrency Workbench which

allows us to exp eriment with simulate and analyze SCCS sp ecications

The Reactive View

A reactive system is one that interacts with its environment This is opp osed to the func

tional view in which a system is viewed extensionally as a mapping of initial inputs to

nal outputs A common way to think of a reactive system is to view it as a black b ox

with buttons where the buttons represent the actions that the user or environment can

p erform

The reactive view of our SCCS pro cessor description is one where instructions represent

the buttons Initially at time t any button can b e pressed ie any instruction can b e

executed At time t some buttons can b e pressed and some cannot The instructions

that cant b e executed the button representing the instruction cant b e pressed must b e

waiting for something from the instruction that was initiated at time t

For example if the Fadd FR FR FR button is pressed at time t then at time t

any button that uses FR cannot b e pressed

t InstrPC Registers Memory FP Registers

Add R R R hhhgetr xgetr y putr x y iii

InstrPC Registers Memory FP Registers t

Mov R R hhhgetr xputr xiii

t InstrPC Registers Memory FP Registers

Figure Derivation of program executing on the pro cessor

Simulation

A simulation of our pro cessor sp ecication amounts to loading a program into memory

with putm actions and then running the agent that represents the pro cessor That is we

can observe the b ehavior of the program by calculating the transition graph of an agent

Recall from Section that the transition graph of an agent P consists of transitions

of the form A B

A represents the state of the system at time t

is the action p erformed transition

B is the new state at time t

Because our pro cessor represents such a large system it is not feasible to write down the

entire graph so we will abbreviate by only showing p ertinent states

In our transition graphs each no de is surrounded by a b ox and represents the current

state of the pro cessor at a particular moment in time Each edge is lab eled with the set of

actions that execute on that transition eg instructions getr putr lockreg etc For

readability individual particles are enclosed with eg getr xputr x Also

the particles that app ear b etween the hhh iii do not really app ear on the transition b ecause

each particle has synchronized with either a register or a resource We lab el the transitions

with these particles anyway just to show all of the internal details To simplify the graph

many actions and pro cesses have b een omitted At the no des unchanged register and

memory cells have b een replaced by an ellipsis

A Simple Example

Figure shows the transition graph of the following program

Add R R R

Mov R R

We assume that the program is loaded into memory with putm actions starting at PC

In the graph in Figure

Time t is the initial state of the pro cessor

Time t is the state of the pro cessor after executing the Add instruction

Time t is the nal state of the pro cessor after the Mov instruction is executed

Example An Illegal Instruction Sequence

Executing an illegal instruction sequence on our sp ecication should lead to the inactive

agent Figure traces the following illegal instruction sequence

Load R R

Mov R R

On the second transition the particle getr x causes the agent Locked Reg to change to

the agent see Equation

Example A FloatingPoint Vector Sum

For a more complete example we trace our pro cessors b ehavior on a program that

calculates the vector sum of a oatingp oint array Figure For this example we

t InstrPC Registers Memory FP Registers

Load R R hhhgetr Basegetm V lockreg iii

Base

t InstrPC putr V releasereg Done Locked Reg y

Mov R R hhhgetr xputr xputr V releasereg iii

t

Figure Derivation of illegal program executing

LoadI R

LoadI R

LoadI R Vec

MovItoFP FR R

Loop Loadf FR R

AddI R R

AddI R R

Fadd FR FR FR

CmpI R R

BZ R Loop

Nop

Figure Program that calculates a vector sum

t InstrPC

LoadI R hhhlockreg iii

t InstrPC putr releasereg Done Locked Reg y

LoadI R hhhlockreg putr releasereg iii

InstrPC putr releasereg Done Locked Reg t

LoadI R Vechhhlockreg putr releasereg iii

t InstrPC Locked Reg y putr Vecreleasereg Done

MovItoFP FR R hhhgetr xputfr xputr Vecreleasereg iii

t InstrPC

Loadf FR R hhhlockfreg getr B getm V iii

B

t InstrPC putfr V releasefreg Locked FP Reg

AddI R R hhhputfr V releasefreg getr xputr x iii

t AluPC FloatPC

AddI R R hhhgetr xputr x iii

Fadd FR FR FR hhhgefr x gefr y lockfreg iii

t InstrPC putfr releasefreg Done Locked FP Reg

Figure Transition graph of vector sum from Figure

use the sup erscalar version of our pro cessor dened in Section Note that there are

instructions in Figure which we have not sp ecied LoadI AddI and CmpI are load

add and compare instructions where the third op erand is an immediate constant LoadIs

timing prop erties are the same as Load and AddI and CmpI take one cycle to execute as

in Add MovItoFP moves data from an integer register to a FPregister and takes one cycle

to execute Loadf is a delayed oatingp oint load instruction

Figure shows a partial transition graph of the vector sum program up to the rst

execution of the Fadd instruction We can see from the transition graph that

at least one new instruction is issued every cycle That is on every transition there

is an action that represents an instruction

at time t t t and t we can observe the eect of the delayed load

instructions by observing a leftover agent that writes and releases the destination

register At these no des we can also see that the destination register of each load is

in its lo cked state Locked Reg

From time t and t we can observe that two instructions are executing in parallel

one integer and one oatingp oint

In conclusion since the op erational semantics of SCCS maps SCCS terms to abstract

machines ie lab eled transition systems we have a direct way of examining the b ehavior

of an SCCS sp ecication

Chapter

Instruction Scheduling

This chapter gives a brief overview and formalization of instruction scheduling As an

example we present an algorithm for the most common form of instruction scheduling List

Scheduling We need a formal denition of the instruction scheduling problem so that we

can deduce what information an instruction scheduler requires Knowing what to lo ok for

then leads us into extracting this information from our sp ecication

Instruction scheduling is primarily carried out on RISCSup erscalar style architectures

that have the following characteristics

LoadStore architecture Only two instructions load and store are allowed to access

memory All other instructions op erate on registers

Instructions are xed format That is every instruction is enco ded in the same number

of bits A typical RISC has bit instruction formats

A new instruction can p otentially b e issued every cycle

Instruction Scheduling

Given a sequence of instructions S instruction scheduling is the problem of reordering S

into S such that S has two prop erties

the semantics of the original program S is unaltered and

the time to execute S is minimal with resp ect to all p ermutations of S that preserve

the semantics of S

The instruction scheduling problem is NPcomplete in general

Before we pro ceed we need a denition

Denition The length of an instruction i is the minimum number of cycles needed to

execute i

In our case lengthi is the number of cycles needed to execute i in the absence of all

hazards Intuitively one way to compute lengthi is to execute i in isolation

Constraints

We introduce a machine mo del that has two types of scheduling constraintsprecedence

and resource

Precedence Constraints

A precedence constraint or data dep endency is a requirement that a particular instruction

i execute b efore another instruction j due to a data dep endency b etween i and j As

mentioned in Section data dep endencies are of three types true or forward anti

and output which corresp ond to RAW WAR and WAW hazards

The ob jects to b e scheduled are individual instructions of a program If I is the set of

instructions in a program then the precedence constraints induce a partial order Prec on

P such that Prec I I If there is a precedence constraint b etween two instructions i

and j such that i must execute b efore j then i j Prec

The partial order Prec is usually represented graphically by a directed acyclic graph

DAG G G is comp osed of a set of vertices V and a set of edges E G V E Each

vertex of the graph is an instruction and if i j Prec then there is a directed edge i j

from vertex i to vertex j As an example Figure shows the dep endency graph for the

lo op b o dy a basic blo ck of the vector sum program in Figure

If instruction i must execute b efore j we do not usually require that i run to completion

b efore j can b egin Hence we augment the DAG by lab eling each edge e with a minimal

latency delay de This latency represents the least amount of time in cycles that must

pass after i b egins executing b efore j can b egin

Denition The latency b etween two instruction i and j is the least number of cycles

that must pass b etween initiating i and then initiating j V1 V3 AddI R1, R1, #1 Load.f FR1, R2, #0

V4 V5 Fadd FR0, FR0, FR1 AddI R2, R2, #4

V2

CmpI R0, R1, #10

Figure Dep endency graph of vector sum program from Figure

Resource Constraints

An architecture consists of a multiset R of resources On each clo ck cycle that it is

executing each instruction uses a subset of resources from R intuitively the resources

b eing used by the instruction at that time If an instruction i executes for lengthi cycles

then the resource usage function for instruction i maps clo ck cycles to subsets of R

i

That is t Pow R st t where represents the set of natural numbers When

i

t lengthi then t For example represents the multiset of resources used

i Add

on clo ck cycle by the Add instruction

The scheduling constraint on resources then is that at any particular time t the re

sources needed by the instructions executing at time t do not exceed the available resources

As an example recall the description of the MIPSR oatingp oint unit from Sec

tion The resource set R is

S mult S unpacker exception g fdivider shifter adder rounder mult

and the partial resource usage function for the FDIV instruction is given in Figure

Instruction Scheduling Algorithms

We now dene what an instruction schedule is and formalize the instruction scheduling

problem

A multiset is a set that allows duplicate elements For example there might b e two or more oatingp oint

adders Multisets are also known as bags

fadder shifterg

FDIV

frounder shifterg

FDIV

fshifterg

FDIV

fdividerg

FDIV

Figure Part of resource usage function for MIPSR FDIV instruction

Denition An instruction schedule for a program DAG is an injective function that

assigns each no de v V an integer that represents its p osition in the list of instructions

V That is is a top ological ordering to G

For example consider the DAG from Figure One p ossible instruction schedule

would b e

V V V V V

which we could abbreviate as V V V V V Another p ossible schedule is V V V V V

Denition The instruction scheduling problem is to nd the shortest schedule such

that b oth precedence Equation and resource Equation constraints are met

e u v E v u de

t t v R

v

v V

Intuitively if Equation is true then distance b etween any two dep endent instructions

is greater than or equal to the required minimum latency If Equation is true then the

schedule has not overcommitted the available resources A schedule is not overcommitted

if at any time t the union of all the resource requirements of all the instructions still

executing at time t those that started at t t is a subset of R the available

resources of the machine

In Equation and refer to multiset union and subset

The ab ove characterization of instruction scheduling is the basis for all instruction

scheduling metho ds including list scheduling trace scheduling and software pip elining

At this p oint we have dened the instruction scheduling problem and we know what

information we need to do instruction scheduling To make the denition more concrete we

discuss the most common instruction scheduling algorithm List Scheduling

List Scheduling

To see what information is needed to parameterize instruction scheduling we now describ e

List Scheduling which we have taken from Lams thesis

The main idea of list scheduling is straightforward There is a list of ready to execute

instructions called the ready list which is initially the ro ot no des of V that is the no des

that have no incoming edges Instructions are selected from the ready list using some type

of priority scheme until the ready list is empty After each instruction is scheduled new

no des are added to the ready list Altering the priority heuristic pro duces many dierent

types of scheduling algorithms Figure gives the list scheduling algorithm

Schedule but only note that it is a We need not go into the details of function List

function of three things

the program DAG G

the table of latencies d used in Satisfy Precedence Constraints

the resource usage function used in Satisfy Resource Requirements

Scheduling algorithms work on a section of straightline co de called a basic blo ck Trace

scheduling and software pip elining are two techniques that schedule instructions beyond

basic blo cks Without going into the details we mention that b oth trace scheduling and

software pip elining use list scheduling as a basis

Figure depicts a canonical view of the phases of compilation and shows where this

research ts in highlighted in the dotted b ox In the gure each b oldfaced no de not

enclosed in a b ox is data for some p ortion of the compilation system In our case the

timing sp ecication is input to a program that generates parameters for a generic instruction

scheduler

function List ScheduleG as V E d

let

Ready ro ot no des of V and

Scheduled

in

while Ready do

v highest priority no de in Ready

Lb Satisfy Precedence Constraintsv Scheduled d

v Satisfy Resource Constraintsv Scheduled Lb

Scheduled Scheduled fv g

Ready Readyfv g fu j u Scheduled w u E w Scheduled g

end while

return

end let

Schedule end List

function Satisfy Precedence Constraintsv Scheduled d

returnmax u du v

uScheduled

Resource Constraintsv Scheduled Lb function Satisfy

for t Lb to do

S

i j u j R then if j lengthv

u v

uScheduled

returnt

Figure Generic list scheduling algorithm Source Program

Front End

Intermediate Representation

Machine Code Description Generator Timing Spec

Target Code Table Generator

Scheduler Tables

Target Code

Figure The structure of a typical compiler

Chapter

Deriving Instruction Scheduling

Parameters

This chapter describ es how instruction scheduling information is derived from a timing

sp ecication written in SCCS The general idea is that the parameters d and from the

previous chapter are derivable automatically from the sp ecication yielding an instruction

scheduler for the pro cessor

Preliminaries

Recall from chapter that the length of an instruction i is the minimum number of cycles

required to execute i Intuitively one way to compute lengthi is to execute i in isolation

In SCCS this amounts running the SCCS agent that describ es i and counting cycles Un

fortunately in general SCCS agents do not have to terminate For example our SCCS

pro cessor description may not terminate However termination is a reasonable assumption

to make ab out our SCCS agents that describ e individual machine instructions Moreover

so we can observe this termination each instruction will indicate its termination with a

sp ecial action done

We will require that each instruction of our SCCS description indicate its termination

using the action done To do this we redene the agent Done to the following

def

done Done

The action done is a termination signal no other agent should do a done and synchronize

with done

In addition we b orrow the notion of a wel lterminating agent from Milner

Denition An agent P is wel lterminating if

for every P P done

done

P then P Done if P

The rst part of the denition says the action done is reserved to indicate termination

and that no other agent can execute the action done and synchronize with done The

second part of the denition says that if any agent do es p erform a done it is the last

action it p erforms b efore changing to the idle agent The denition do es not require that

P terminate only that if it do es terminate then it indicates its termination with the action

done

Now we can precisely compute lengthi inductively on the structure of SCCS terms for

any agent that is wellterminating

length

length Done

done length P length P

length P Q maxlength P lengthQ

length P Q maxlength P lengthQ

length P S length P

length P f length P

def

length A P length P

For example

length abc d

Derivation of Scheduling Parameters

Given the denition of the instruction scheduling problem from chapter there are two

tasks

Given a program dep endence graph G V E lab el each edge e E with de the

minimal latency

For each instruction i construct the resource usage function

i

We rst present an algorithm for deriving the delay function d from our pro cessor

sp ecication Essentially the delay for an instruction pair i j is calculated by initiating i

and then observing how long until j can b egin Most architectures resolve WAR and WAW

hazards and schedulers only have to deal with RAW hazards We will b e more general and

calculate the delay for every instruction pair i j in the presence of a hazard h where

h fRAW WAR WAWg Now that we have a table of latencies we can correctly lab el a program

DAG

The Algorithm

The algorithm to calculate instruction latencies from the SCCS pro cessor description is

Latency Function works by executing an instruction i and given in Figure Construct

counting how long until a subsequent instruction j can b egin after i starts It do es this for

all p ossible pairs of op co des for each data hazard Recall that the instruction latency is the

amount of time b etween initiating i and initiating j denition Notice however that it

is still p ossible that after j b egins it may stall for some other reason which will most likely

b e a resource hazard

Equivalence With Resp ect to Hazards

In order to decrease the number of instruction pairs that we need to calculate latencies for

we do not generate all p ossible pairs of instructions but all p ossible pairs mo dulo the three

types of hazards That is we consider two instruction pairs to have the same latency if they

are the equivalent up to hazards For example we exp ect that the two instruction pairs

b elow have the same latency even though they use dierent integer registers

Add R R R Add R R R

Mul R R R Mul R R R

Both instruction pairs have a RAW hazard the pair on the left on R and the pair on the

right on R The assumption is that the register le has a reasonable amount of orthogonality

as we would exp ect that the time it takes to access register i is equal to the time it takes

to access register j If the architecture do es not p ossess this orthogonality then we could

insp ect all register pairs for all combinations of registers But the search space is reduced

considerably if we treat all registers equivalently

To see this if an architecture has ten op erand instructions and registers the number

of p ossible instruction pairs is Examining instruction pairs up

to hazards equivalence only requires that we check pairs

Denition Let I b e the set of instructions let function Op co dei I return the op co de

of instruction i and function Hazardi j return the hazard o ccurring b etween instruction i

followed by instruction j Two instruction pairs i i and j j are said to b e equivalent

up to hazard written

haz

i i j j

if

Op co de i Op co de j and Op co de i Op co de j

Hazardi i Hazardj j

haz

Theorem is an equivalence relation

haz

Pro of We need to show that is reexive symmetric and transitive

haz

reexive Clearly i i i i as the Op co dei Op co dei and Hazardi

Hazardi

haz haz haz

symmetric Show that if i i j j then j j i i If i i

j j then Op co dei Op co dej and Hazardi Hazardj which is all we

need to show

haz haz haz

transitive Show that if i i j j and j j k k then i i

k k By the antecedent of the if Op co dei Op co dej and Op co dej

Op co dek Then it follows that Op co dei Op co dek Also by the antecedent

of the if Hazardi Hazardj and Hazardj Hazardk Then it follows that

Hazardi Hazardk This is all we need to show

2

Algorithm Construct Latency Function

Latency Function requires as input the lab eled transition system Algorithm Construct

hP Act i see Section of the pro cessor sp ecication Consider an instruction

pair i j having hazard h b etween them Essentially we initiate instruction i and execute

j

Nop instructions until we reach a state such that the transition is p ossible

That is if we execute i at time t then we subsequently try to execute j at time t and

if we cannot then we try executing j at time t and so on Eventually j will b e able to

execute at some future time t n and we can conclude that the latency di h j is n

If we let mean zero or more consecutive o ccurrences of the action then we can

visualize the latency b etween two instructions i and j in terms of the transition system

rst cycle of j

rst cycle of i

z

z

j

i

z

latency

We now prove two theorems ab out algorithm Construct Latency Function For all of

the remaining theorems we assume that the pro cessor is sp ecied in the normal form laid

out at the end of chapter that when an instruction acquires a register or resource it

eventually releases the same register or resource

Latency Function terminates Lemma Algorithm Construct

Pro of Clearly the outer forlo op terminates as the set Op co des Hazards Op co des is

nite

Consider an arbitrary instruction pair i j The only way for the whilelo op to not

0

j

Since terminate is if instruction j is blo cked forever That is no exists for which

the only thing that can blo ck j is a lo cked register or resource and by our assumption

0

j

and the every registerresource must eventually b e unlo cked then eventually

whilelo op will terminate

2

In the worst case j can b egin after i terminates Since i has terminated i must

0

j

have unlo cked all of its registersresources and the transition is p ossible and the

whilelo op terminates In this case

di h j length i

However we would like to show a stronger result that di h j is the minimal latency

function Construct Latency FunctionhP Act i

let

Op co des fAdd Fadd BZ g and

Hazards fRAW WAR WAWg

in

for each i h j Op co des Hazards Op co des do

Construct an instruction pair i j st i uses op co de i

j uses op co de j and hazard h exists from i to j

0

i

let b e state st

delay

0

j

while there is no transition do

delay delay

let b e state st

next next

next

end while

di h j delay

end for

end let

returnd

Latency Function end Construct

Figure Algorithm that derives instruction latencies

Theorem Algorithm Construct Latency Function constructs d the minimal latency

function

Pro of We need to show that an arbitrary triple i h j Op co des Hazards Op co des

has a minimal latency di h j Let the instruction pair i j b e the pair constructed in

the rst statement of the forlo op Instruction i can start executing ie statement of

the forlo op is valid since no other instructions are currently executing that may blo ck i

There are two cases to consider instruction i is not blo cking j and instruction i is

blo cking j

case i is not blo cking j In this case the whilelo op is never entered as the only

transitions p ossible are

0

0

j

i

and di h j

case i is blo cking j We only need to show that j starts as early as p ossible That is

0

i

that the latency is not to o large If at time t then we try to execute j at time

0

j

then we issue the idle action and reach a state and try to execute j t If

0

j

again Eventually at time n b ecause by the assumption that i is in normal

form it will eventually release all of its registersresources that j dep ends on

The second case of the pro of essentially says that if i started execution at time t then

i will eventually unlo ck register r at some time t n by the assumption that instruction

i is in normal form The latency is then n If the current cycle of i is n ie time

0

j

t n then the variable delay n Now and we enter the b o dy of the while lo op and

delay is set to n

the algorithm issues the idle action ie the transition is taken But

now on this transition i unlo cks r b ecause this is cycle n and therefore in state

time t n r is unlo cked and i is no longer blo cking j

0

j

The algorithm tries to execute the lo op b o dy again but the transition is p ossible

and the lo op terminates with delay n

2

Notice that the latency b etween two instructions i and j that suer a data hazard

may b e equal to the length of i In the second case it is p ossible that i may lo ck a resource

causing j to stall after j has b egun executing Here j has already started but that is all

that is needed to calculate the latency since the latency represents the least amount of time

in cycles that must pass after i begins executing b efore j can begin The resource vectors

of the instructions will yield information allowing the scheduler to optimize for resource

conicts

Latency Function is O mn where m Theorem The of Construct

is length of the longest instruction and n is the number of op co des for the architecture

Pro of The forlo op in algorithm iterates n times which is O n where n jOp co des

Hazards Op co des j Since instructions terminate the whilelo op iterates at most m

max length i times which is O m Hence algorithm Construct Latency Function

iOp co des

is O mn

2

Determining Illegal Instruction Sequences

We need to b e able to determine which instruction sequences are illegal eg incorrectly

using the load or branch delay slots If an illegal instruction sequence is executed then our

simulated pro cessor deadlo cks recall Figure In our example micropro cessor referring

to a register b eing loaded while it is lo cked Equation or executing a branch instruction

in the branch delay slot Equation constitute illegal sequences We would exp ect these

to b e the only situations where illegal combinations of instructions arise but there could

b e others so we must check

A Mo dal Logic for SCCS

At the exp ense of some preliminary discussion we will make our algorithms more clear and

concise by employing a mo dal logic dened in terms of lab eled transition systems The

logic lets us describ e prop erties in terms of a logical formula we may then check whether

the transition system pro cess satises this logic formula We introduce the logic

informally through an example The following grammar generates the formulae of the logic

where K Act

A true j false j A A j A A j A j K A j hK iA j xA

The logic is essentially the prop ositional calculus b o olean algebra with two additional

mo dal op erators K and hK i K and hK i represent necessity and p ossibility from

classical mo dal logic usually denoted 2 and 3 The calculus also includes the op erator

xA which allows us to write recursive logic equations

If K is a set of actions then a pro cess P satises K A written P j K A i P can do

every action in K and subsequently satisfy A hK i is a dual of K and P j hK iA i after

some action in K P j A Every pro cess P has the prop erty true That is

P P P j true

and no agent has the prop erty false

P P P j false

As an example consider a pro cess V that represents a vending machine that can accept

one quarter or a half dollar coin and yields a small quarter or large half dollar candy

this example is b orrowed from The actions q and hd represent inserting a quarter

or a half dollar

def

V hd big collect V q little collect V

Here V j hqitrue that is V can p erform the action q V can accept a quarter A

candy cannot b e retrieved b efore a quarter is inserted That is V j biglittlefalse

Some more prop erties include

V j hdlittlefalse hbigitrue after a half dollar is inserted we cannot get a

little candy but we can get a big one

V j qhdqhdfalse after one coin is inserted no more coins can b e inserted

V j qhdbiglittlehcollectitrue after a coin is inserted and a candy is

returned we can collect it

We can express immediate deadlock in the logic with the formula

def

Deadlock false

Here is a wild card that represents the entire action set If P satises Deadlock then

P cant do any action The pro cess p ossesses no actions and represents the deadlo cked

agent consequently j Deadlock

We can also enco de the temp oral logic op erator eventually into the logic Here P j

eventually A when P eventually evolves to a pro cess P that satises A That is

P P P j A

P j eventually Deadlock if the pro cess P eventually deadlo cks as opp osed to immedi

ately deadlo cks

Using the logic in a pro cessor

We rst consider illegal instruction pairs Essentially we detect these by executing all

instruction pairs and examining which cause the pro cessor to deadlo ck If CPU is our

SCCS pro cess that represents our micropro cessor then

CPU j Load R RAdd RRReventually Deadlock

means that CPU deadlo cks when this particular Load instruction followed by the Add in

struction is executed Hence this instruction sequence is illegal Our intuition suggests that

for our example micropro cessor that we have b een studying any instruction pair where the

rst instruction is a Load and the second instruction uses the register b eing loaded is illegal

Another illegal instruction pair is a BZ instruction followed by a BZ instruction So we

exp ect that

CPU j BZ R LocnBZ R Locneventually Deadlock

i j

Algorithm Detect Il legal Instruction Pairs Figure examines illegal instruction pairs

When the algorithm detects an illegal pair i h j it records the error value in the table

entry for that pair That is di h j

It is p ossible to extend this metho d to all illegal instruction sequences of length n For

example we could check to see if for an instruction sequence i i i is legal with

n

CPU j i i eventually Deadlock

n

but this could lead to combinatorial problems Fortunately n is usually quite small

Enco ding the op erator eventually requires using the least xp oint op erator which represents a certain

def

solution to recursive equations written in the logic eg an equation such as Z hiZ Sp ecically

from we have

def

eventuallyB X B hitrue X

function Detect Illegal Instruction PairshP Act i d

let

Op co des fAdd Fadd BZ g and

Hazards fRAW WAR WAW NoHazardg

in

for each i h j Op co des Hazards Op co des do

Construct an instruction pair i j st i uses op co de i

j uses op co de j and hazard h exists from i to j

if j i j eventually Deadlock then

di h j

end for

end let

returnd

end Detect Illegal Instruction Pairs

Figure Algorithm to detect illegal instruction pairs

Theorem Algorithm Detect Il legal Instruction Pairs correctly identies illegal in

struction pairs

Pro of By the assumption that the sp ecication is in the normal form laid out at the end

of chapter and the correctness of the satisfaction relation j The correctness of j is

wellknown and is computable for nitestate systems

2

Il legal Instruction Pairs is the second state The only interesting part of algorithm Detect

ment of the forlo op checking the satisfaction relation j Here we have used the mo dal

logic to do all of the transition checking and rely on the correct implementation of the logic

Determining Possible MultipleIssue Instructions

We can also identify which instructions can b e issued in parallel using the mo dal logic An

instruction is an action in SCCS and two instructions i and j executing in parallel is

the pro duct of these actions i j So if i and j can execute in parallel then the transition

ij

is p ossible Figure shows this case

In terms of the logic then we exp ect that CPU j hi j itrue if i and j can execute in

parallel For example if our pro cessor can execute an integer instruction in parallel with a

oatingp oint instruction then

CPU j hFadd FR FR FR Mov RRitrue

In this situation we consider the latency b etween the two instructions to b e zero That is

di h j

Algorithm Detect Multiple Issue Pairs Figure only examines instruction pairs As

in the case for illegal instruction sequences we can extend this metho d to all instruction

sequences of length n but again this leads to combinatorial problems Fortunately n

is usually quite small and there is some evidence that there is a small lower b ound less

than on the amount of available instructionlevelparallelism For all current

sup erscalar architectures n is less than ve The IBM RS which has the largest

degree of parallelism is capable of issuing four instructionscycle

Theorem Algorithm Detect Multiple Issue Pairs correctly identies instruction pairs

that can b e issued simultaneously

Pro of By the correctness of j

2

function Detect Multiple Issue PairshP Act i d

let

Op co des fAdd Fadd BZ g and

Hazards fRAW WAR WAW NoHazardg

in

for each i h j Op co des Hazards Op co des do

Construct an instruction pair i j st i uses op co de i

j uses op co de j and hazard h exists from i to j

if j hi j itrue then

di h j

end for

end let

returnd

end Detect Multiple Issue Pairs

Figure Algorithm to detect multiple instruction issue pairs

As in the case for checking illegal instruction sequences the only interesting part of al

gorithm Detect Multiple Issue Pairs is the second statement of the forlo op checking the

satisfaction relation j Again we have used the mo dal logic to do all of the transition

checking and rely on the correct implementation of the logic

Computing the Resource Usage Functions

In this section we will compute the resource usage function for each instruction i of the

i

SCCS pro cessor sp ecication First some denitions

Particulate Actions and Agent Sorts

Recall that every action is uniquely expressible as a pro duct of p owers of particulate

n

n n

k

where each is either a or actions that is a We denote the particles

i

k

of an action as Part where Part f g We need to categorize the pro cess

k

by the actions that they may eventually execute This categorization is called a sort and is

like a type in a programming language

Denition The action sort of an agent P denoted SortP is the set of actions that P

executes st P executes action implies SortP

The converse of the implication do es not have to hold so SortP can b e to o big That

is if SortP S and S T then SortP T

Denition The particle sort of P denoted PartSortP is the set of particles of SortP

st PartSortP fPart j SortP g

The particle sort of an SCCS agent is dened inductively on the structure of SCCS

terms

PartSort

PartSortDone

PartSort P Part PartSortP

PartSortP Q PartSortP PartSortQ

PartSortP Q PartSortP PartSortQ

PartSortP S PartSortP PartS

PartSortP f Partf SortP

def

PartSortA P PartSortP

For example

def

SortA a Done b Done A fab g

and

PartSortA fa b g

Resources and Actions

The only reason we need to introduce sorts at all is so that we can determine what resources

if any an instruction requires If R is the set of resources of a pro cessor section let

PartR b e the set of particles used to acquire and release these resources Equation

PartR fget j i Rg fput j i Rg

i i

The set of particles that an instruction i uses to access all of its resources is dened by

PartSortP PartR where P is the SCCS agent that describ es i For example PartSortP

PartR where P is the divide instruction of Equation is the set

fget adder put adder get divider release dividerg

Deriving the Resource Usage Functions

Recall that the resource usage function for an instruction i precisely sp ecies what re

i

sources i uses on each cycle of its execution To derive the resource usage function for i we

simply execute i in isolation and observe what resources i acquires and releases and also

when those resources are acquired and released

An instruction is using a resource from the time it gets that resource using get to

the time it releases that resource using release We can visualize this in terms of the

transition system for the instruction

Done

z

using resource r

where get Part release Part

r

r

for each i P st Sorti PartR do

let and b e the transition relation and start state of i

InUse

for cycle to length i do

let b e state and action st

next next

next

if get Part and put Part then

r r

cycle InUse fr g

i

else if get Part then

r

cycle I nU se fr g

i

InUse InUse fr g

else if put Part then

r

cycle InUse

i

InUse InUse fr g

else

cycle InUse

i

end if

end for

end for

Figure Algorithm to calculate resource usage functions

Figure gives the algorithm for computing the resource usage functions

The algorithm works by examining for each instruction the transition graph of the

agent that describ es that instruction The algorithm scans the instruction recording the

current set of resources that are in b egin used with the variable InUse When a get is

r

encountered r is added to InUse When a put is encountered r is removed from InUse

r

The ifstatement in the innermost forlo op has four cases

If b oth a get and put are needed on the same cycle then the resource r is needed

r r

for only one cycle and the current resources b eing used are InUse fr g

A get with no put signies that the instruction b egins using r for more than one

r r

cycle so the current resources used are again InUse fr g InUse is up dated to include

r

A put signies that the instruction is nished using r after the current cycle The

r

resource r is removed from InUse

If no get or put is encountered on the current cycle then the set of resources currently

r r

b eing used is InUse

Theorem The algorithm in Figure correctly computes the resource usage func

tions of the architecture

Pro of Consider an arbitrary instruction i P in normal form a resource r R and the

transitions of i

n k

Done

n k

Consider an f g We need to show that r m if and only if

m n i

get Part or

m

r

put Part or

m

r

f g such that a m b and get Part put Part

a b n a b

r r

put f g

a b

r

The rst two are obvious from the rst three conditions of the ifstatement in the algo

a

rithm The third statement follows as the transition up dates InUse to include r

b

Consequently from the nal and r is not removed from InUse until the transition

else clause r m

i

2

Theorem The time complexity of the algorithm in Figure is O mn where m is

the number of instructions n is the length of the longest instruction

Pro of Clearly the outer forlo op executes at most m times the number of instructions

on the machine The inner forlo op executes at most n times where n is the length of the

longest instruction Hence the statement in the inner forlo op executes mn times which is

O mnr

The size of R jRj will typically b e small for example on the MIPS R it is and on

the Motorola it is

An Example

Consider the oatingp oint divide instruction describ ed in section which we rep eat

here

def

FDIVPC getm Fdiv FR FR FR lockfreg getfr xgetfr y

i j k i

PC j k

get adder release adder

get divider release divider

get adder release adder

putfr xy releasefreg Done

i

i

The algorithm computes the following resource usage function for the oatingp oint

divide instruction Fdiv describ ed earlier

fadderg if t

fdividerg if t

t

Fdiv

fadderg if t

otherwise

Summary

Using the lab eled transition systems generated by the op erational semantics of SCCS we

can have shown how to

nd instruction latencies

determine illegal instruction sequences

determine instructions that can b e issued in parallel

derive the resource usage functions for each instruction

Chapter

Conclusions and Future Research

This thesis has addressed the formal sp ecication of two programmer views of a pro cessor

The architecture level describ es the pro cessor so that the programmer can write correct

programs The timing level describ es the pro cessor so that the programmer can write

ecient programs

At the highest level of abstraction the architecture level a pro cessor can b e viewed as a

set of instructions where each instruction is represented as a function from pro cessor state

to pro cessor state One of the contributions of this thesis was to present a way in which

this functional view can b e represented cleanly using the functional language SML Other

functional languages would have suced such as Haskell or Miranda but we chose SML

as it has the most advanced data abstraction facilities of any functional language We

essentially created an abstract data type ADT of common architectural op erations and

implemented the instructions with this ADT

Programmers Timing View

Programmers often need more information than the architecture provides The architecture

is minimal in that it only includes information needed to write correct programs However

programmers also want to write ecient programs Hence we have the programmers

timing view of instructions Timing information includes instruction latencies resource

requirements and multiple instruction issue capabilities

It is natural to ask whether the functional technique used to sp ecify architecture could

b e extended and applied for the timing view It turns out that sp ecifying the temp oral and

parallel b ehavior of systems with functional metho ds is dicult Our criteria for choosing

another metho d were that the formalism b e able to sp ecify time and parallelism explicitly

and we chose the synchronous pro cess algebra SCCS

The second contribution of this thesis then is to sp ecify as abstractly as p ossible the

timing b ehavior of a typical RISC architecture We also showed how to sp ecify sup erscalar

versions of our RISC in which more than one instruction can b e issued simultaneously

Since SCCS has a formally dened op erational semantics our sp ecication is exe

cutable The abstract machine of the op erational semantics a lab eled transition system

a type of state machine sp ecically describ es the b ehavior of systems sp ecied in SCCS

We have implemented our sp ecication on the Concurrency Workbench a to ol for sp ecifying

and analyzing systems written in SCCS or CCS

Instruction Scheduling

Once a timing view of a pro cessor has b een sp ecied we showed what information is re

quired to do instruction scheduling essentially parameterizing the instruction scheduling

problem Using the the lab eled transition system of the sp ecication we then showed how

to derive the instruction scheduling parameters from the sp ecication consequently yielding

an instruction scheduler for the pro cessor

Future Research

The features that we have included in our example pro cessor were those that are related

to instruction scheduling It is natural to ask whether SCCS is capable of sp ecifying other

features such as interrupts and more sophisticated memory hierarchies such as instruction

and data caches

Interrupts

One asp ect of instruction pro cessors that we did not address are interrupts traps and ex

ceptions Any complete pro cessor sp ecication however should include them Fortunately

SCCS and pro cess algebras in general are capable of sp ecifying interrupts For example

Milner discusses an interrupt op erator where a pro cess F may interrupt a pro cess E

written E F Once the interrupt op erator is included it is also p ossible to sp ecify a Restart

op erator allowing pro cesses to b e restarted after they have b een interrupted The syn

chronous pro cess language Esterel also includes op erators useful for sp ecifying interrupts

such as its watchdog and trap statements For example the statement

do S watching I

rep eatedly executes the statement S until the signal I o ccurs Eventually we should

address sp ecifying pro cessor interrupts though this may not b e useful information for

compilers users of the pro cessor may need to know how interrupts are handled

Cache Mo del

At present the memory hierarchy we use is only a twolevel mo del of registers and main

memory Though it would complicate our sp ecication of memory we could add a cache

to our sp ecication in a transparent way without altering the sp ecication of individual

instructions since in a RISC mo del only two instructions access memory Load and Store

Compiling for ecient use of caches is a current area of active research Typically

program prolers generate information that describ es the program memory access patterns

and the compiler rearranges pro cedures to make ecient use of the instruction cache Com

pilers can also rearrange instructions that access data so that the data cache is used more

eciently

Once a cache mo del has b een included in the sp ecication it may b e p ossible to use this

sp ecication to help generate this optimization phase of the compiler

Other Classes of Pro cessors

In this dissertation we have concentrated on RISCstyle architectures Certainly it would

b e p ossible to sp ecify other classes of architecture such as CISCs or VLIWs Very Long

Instruction Word In a VLIW the architecture and the organization are inseparable

Sp ecifying a VLIW would essentially require us to sp ecify the individual resources and

the datapaths of the pro cessor Since it is up to the programmer to organize the instructions

so that resource uses do not conict it should in principle b e easier to sp ecify than a RISC

where we needed to provide interlocks A sp ecication of a VLIW would require altering

the algorithms to reect the volatile nature of the resources

Verication and Synthesis

One imp ortant prop erty of SCCS that we have not used is the ability to prove systems

equivalent In SCCS it is p ossible to sp ecify a highlevel system and to verify that an

implementation of the system meets its sp ecication To do this would require that we

sp ecify a pro cessor at a lower implementation level

Verication

For example the obvious choice is to sp ecify the organization of the pro cessor the pip eline

stages forwarding paths etc Doing this may b e tedious but not dicult The problem

is state explosion That is the number of states in any pro cessor is tremendous and since

the notion of equivalence in SCCS relates states in the implementation with states in the

sp ecication any complete verication would require that all of the states b e checked

There are a couple of techniques that we could use to handle the state explosion problem

The rst technique exploits the fact the parts of the pro cessor are regular that is they have

a rep etitive or inductive structure For example a bit ALU may b e constructed from

thirtytwo bit ALUs for example see Patterson Hennessy So a sp ecication of

a bit ALU is almost identical to the sp ecication of a bit ALU As Milner p oints

systems with inductive structure have pro ofs that are amenable to induction pro ofs

One of the principle reasons a pro cessor sp ecication has so many states is that the

number of states in a sp ecication increases exp onentially as the size of the memory elements

increases For example a bit register has dierent states and each bit increase

in wordsize doubles the number of p ossible states There has b een some research using

abstraction techniques for example see where an abstraction function is used to merge

states For example if we want to verify the control p ortion of a pro cessor then we can

ignore the values in registers and memory Using this abstraction all p ossible states of

a register are collapsed into one state In this case the abstraction function is congruence

mo dulo one which creates an equivalence class for each register on the pro cessor Now

each register R is the representative for all of the p ossible states of register R If a

i i

n

pro cessor has n bit registers the number of p ossible states is reduced from to just

n a dramatic decrease The ma jority of the states would come from the control p ortion

of the hardware known as the The control unit likely includes other state

registers which we could not abstract in the previous manner as they are integral parts of

the control unit

Synthesis

We should also note that it may b e p ossible to synthesize hardware from a sp ecication

since a lab eled transition system at least in the nite case is really just a state machine

and hardware synthesis is often p erformed from state machines In fact there are CAD

packages that will synthesize hardware from descriptions of state machines

Formal Metho ds

We should also stress that formal sp ecications are valuable in their own right and are a

starting p oint for a variety of applications One b enet is that they require the designer to

thoughtfully plan the design in a structured manner As Noam Chomsky p oints out in

a formalized theory may automatically provide solutions for many problems other

than those for which it was explicitly designed Obscure and intuitionb ound notions

can neither lead to absurd conclusions nor provide new and correct ones and hence

they fail to b e useful in two imp ortant resp ects

Starting from a mature and welldened formal approach made available a large amount

of mathematical machinery Predened are notions of system equivalence used for veri

cation lab eled transition system used in the semantics for simulation and just ab out

everything else mo daltemp oral logic sorts etc Often whenever we needed to do some

thing with our sp ecication existing results were there to help

App endix A

Example circuits in the

Concurrency Workbench

In this app endix we show the Workbench implementations some of the example circuits

from chapter

A few notes ab out the SCCS syntax and the CWBSCCS syntax The Concurrency

Workbench implements the basic SCCS calculus Consequently this means that there

P

is no generalized summation eg in the Workbench In order to keep the description

tangible I have given only a timing sp ecication and have excised all values register and

memory values

There is also some notational dierences

Action pro duct is sp ecied with instead of juxtap osition For example the agent

ab is ab

Action complement is done using a single quote character out instead of out

Parallel comp osition is for example AjB instead of A B

def

Pro cess denition A P is handled by the bi command which binds an identier

to a pro cess description

A Simple Logic Gates

Or gate is described by

Ork aibjoutkOri or j

bi Or aboutOr aboutOr

aboutOr aboutOr

bi Or aboutOr aboutOr

aboutOr aboutOr

bi Or Or Or

An exclusiveor gate is similar

bi Xor aboutXor aboutXor

aboutXor aboutXor

bi Xor aboutXor aboutXor

aboutXor aboutXor

bi Xor Xor Xor

And gate is described by

Andk aibjoutkNori and j

bi And aboutAnd aboutAnd

aboutAnd aboutAnd

bi And aboutAnd aboutAnd

aboutAnd aboutAnd

bi And And And

A Not Gates

A Not gate

Notj aioutjNotnot i

bi Not aoutNot aoutNot

bi Not aoutNot aoutNot

bi Not Not Not

bi NotNot Notalphaoutalphaout

Notalphaaalphaaaaoutout

A HalfAdder

A Sp ecication

Binary Halfadder specification is described by

HASpeccs

aibjcarrycsumsHASpeci and j i or j

NOTE There is no HASpec as it is an unreachable state

bi HASpec abcarrysumHASpec

abcarrysumHASpec

abcarrysumHASpec

abcarrysumHASpec

bi HASpec abcarrysumHASpec

abcarrysumHASpec

abcarrysumHASpec

abcarrysumHASpec

bi HASpec abcarrysumHASpec

abcarrysumHASpec

abcarrysumHASpec

abcarrysumHASpec

bi HASpec HASpec HASpec HASpec

A Implementation

Binary Halfadder implementation is implemented with an

ORgate properly connected to an ANDgate

HAImplcs Andcf Xorsgh

where f carryout g sumout h ininin

workbench says cannot have relabel of form ajk

but Milners paper says you can

Workbench implements Act Part where

Act Act would be nice

A big set of all possible actions

basi All abcarrysum abcarrysum

abcarrysum abcarrysum

abcarrysum abcarrysum

abcarrysum abcarrysum

abcarrysum abcarrysum

abcarrysum abcarrysum

abcarrysum abcarrysum

abcarrysum abcarrysum

bi HAImpl HAImpl HAImpl HAImplAll

bi HAImpl Andcarryout carryout

Xorsumout sumout

bi HAImpl Andcarryout carryout

Xorsumout sumout

bi HAImpl Andcarryout carryout

Xorsumout sumout

bi HAImpl Andcarryout carryout

Xorsumout sumoutAll

A Wires

Unit delay wire

Wirej inioutjWirei

Two delay wire Specification

Wireij inaoutjWireai

bi Wire inoutWire inoutWire

bi Wire inoutWire inoutWire

bi Wire Wire Wire

bi Wire inoutWire inoutWire

bi Wire inoutWire inoutWire

bi Wire inoutWire inoutWire

bi Wire inoutWire inoutWire

bi Wire Wire Wire Wire Wire

Implement a two unit delay wire with two pieces of

unit delay wire

WireImpl Wiref Wireginiouti

where f alphaiouti and g alphaiini

bi WireImpl

Wirealphaoutalphaout

Wirealphainalphainininoutout

Notice that Wire WireImpl NotNot

A FullAdder

A Sp ecication

Full Adder specification

FASpeccs aibjCinkcarrycsums

FASpecif ijk then else

if ijk or then else

bi FASpec FASpec FASpec FASpec FASpec

bi FASpec abcincarrysumFASpec

abcincarrysumFASpec

abcincarrysumFASpec

abcincarrysumFASpec

abcincarrysumFASpec

abcincarrysumFASpec

abcincarrysumFASpec

abcincarrysumFASpec

bi FASpec abcincarrysumFASpec

abcincarrysumFASpec

abcincarrysumFASpec

abcincarrysumFASpec

abcincarrysumFASpec

abcincarrysumFASpec

abcincarrysumFASpec

abcincarrysumFASpec

bi FASpec abcincarrysumFASpec

abcincarrysumFASpec

abcincarrysumFASpec

abcincarrysumFASpec

abcincarrysumFASpec

abcincarrysumFASpec

abcincarrysumFASpec

abcincarrysumFASpec

bi FASpec abcincarrysumFASpec

abcincarrysumFASpec

abcincarrysumFASpec

abcincarrysumFASpec

abcincarrysumFASpec

abcincarrysumFASpec

abcincarrysumFASpec

abcincarrysumFASpec

A Implementation

Full adder implementation

FAImplcs HAf HAg Orhabccarrysum

f cinb tempsuma carrycarry

g tempsumsum carrycarry

h carrya carryb carryout

bi HA HASpec

bpsi S a a b b carry carry sum sum cin cin

bi HA HAcinbcinbtempsumatempsuma

carrycarrycarrycarry

bi HA HAtempsumsumtempsumsumcarrycarry

carrycarry

bi Or Orcarryacarryacarrybcarryb

carryoutcarryout

bi FAImpl HA HA OrS

A FlipFlop

This example of implementing a flip flop with two nor

gates and is borrowed from

R Milner Calculi For Synchrony and Asynchrony

Theoretical Computer Science

A Implementation

Nor gate is described by

Nork aibioutkNori nor j

where in Sml

infix nor fun i nor j not i orelse j

bi Nor abgNor abgNor

abgNor abgNor

bi Nor abgNor abgNor

abgNor abgNor

Make a flip flop out of two properly connected

nor gates

FFmn Normf Norng

where f siai giaiouti where i in

g ribi bidiouti

bpsi S s s r r g g d d

bi FF Norsasagaggag

NorrbrbbdgbdgS

bi FF Norsasagaggag

NorrbrbbdgbdgS

bi FF Norsasagaggag

NorrbrbbdgbdgS

bi FF Norsasagaggag

NorrbrbbdgbdgS

bi FF FF FF FF FF

A Sp ecication

A flip flop specification

FFmn sirjgmdnFFi nor n m nor j

Which unfolds to the following

bi SpecFF srgdSpecFF srgdSpecFF

srgdSpecFF srgdSpecFF

bi SpecFF srgdSpecFF srgdSpecFF

srgdSpecFF srgdSpecFF

bi SpecFF srgdSpecFF srgdSpecFF

srgdSpecFF srgdSpecFF

bi SpecFF srgdSpecFF srgdSpecFF

srgdSpecFF srgdSpecFF

bi SpecFF SpecFF SpecFF SpecFF SpecFF

App endix B

Our example RISC in the

Concurrency Workbench

In this app endix we list how the SCCS sp ecication of our RISC is implemented using the

Concurrency Workbench

B The CWB Listing

Some auxillary definitions

bi DONE DONE

Definition of a memory cell There can be or

readers of the register There is no problem with having more

theyre just not defined Only one writer is allowed and it may

be simultaneous with a read

This is basically an unfolding of a summation

bi Reg putrReg

getrReg

getrReg

getrReg

getrReg

putrgetrReg

getrputrReg

getrputrReg

getrputrReg

Reg

bi Reg Reg lockregLockedReg

bi LockedReg IllegalAccess

putrreleaseregReg

LockedReg

Have to enumerate all of the possible illegal accesses

bi IllegalAccess getr

getr

getr

getr

getrputr

getrputr

getrputr

getrputr

putr

getrputrreleasereg

getrputrreleasereg

getrputrreleasereg

getrputrreleasereg

Do the same thing for another register Reg as in Reg

Should have used relabeling but when running the Workbench the output

is easier to look at if there is not alot of relabeling

bi Reg putrReg

getrReg

getrReg

getrReg

getrReg

getrputrReg

getrputrReg

getrputrReg

getrputrReg

Reg

bi Reg Reg lockregLockedReg

bi LockedReg IllegalAccess putrreleaseregReg LockedReg

bi IllegalAccess getr

getr

getr

getr

getrputr

getrputr

getrputr

getrputr

putr

getrputrreleasereg

getrputrreleasereg

getrputrreleasereg

getrputrreleasereg

Define memory cells Easier than registers because no interlocks

bi Mem putmMem

getmMem

getmMem

putmgetmMem

getmputmMem

Mem

bi Mem putmMem

getmMem

getmMem

putmgetmMem

getmputmMem

Mem

Defining two memory cells and registers each

bi Registers Reg Reg

bi Memory Mem Mem

Instruction definitions

There are possible add instructions and the nop instruction

Three operands with Reg or Reg possible for each operand

The syntax addxyz represent the add instruction with destination

register x and source registers y and z

bi ALU addgetrgetrputrDONE

addgetrgetrputrDONE

addgetrgetrputrDONE

addgetrgetrputrDONE

addgetrgetrputrDONE

addgetrgetrputrDONE

addgetrgetrputrDONE

addgetrgetrputrDONE

nopDONE

There are load instructions and store instructions

but to save states we are actually only going to enumerate the

ones where the base registers and the register being loaded or

stored are distinct That is you really dont usually want

to do a Load R R

bi LoadStore Load Store

insnijk where i dest reg j base reg k memory locn

bi Load loadgetrgetmlockregputrreleaseregDONE

loadgetrgetmlockregputrreleaseregDONE

loadgetrgetmlockregputrreleaseregDONE

loadgetrgetmlockregputrreleaseregDONE

bi Store storegetrgetrputmDONE

storegetrgetrputmDONE

storegetrgetrputmDONE

storegetrgetrputmDONE

The branch instruction has two outcomes succede or fail

Also if the branch succedes we cant have another branch

instruction

bi Branch Succede Fail

bi Succede bzgetrALU LoadStore IPL bz

bzgetrALU LoadStore IPL bz

bi Fail bzgetrIPL bzgetrIPL

Define a particle set of all possible instructions

bpsi Insns add add add add add add add add

load load load load

store store store store

bz nop

bpsi AluInsns add add add add add add add add nop

bpsi FPUInsns fadd fadd fadd fadd

fadd fadd fadd fadd

Definition of dual ALU instruction issue Need appropriate

particle sets See dissertation chapter on Superscalar

bpsi Putr putr getr getr add add add add add

add add add nop

bpsi NotGetr putr getr add add add add add

add add add nop

bpsi Putr putr getr getr add add add add add

add add add nop

bpsi NotGetr putr getr add add add add add add

add add nop

bi TwoAlus ALUPutr ALUNotGetr ALUPutr ALUNotGetr

Floatingpoint registers

bi Freg lockfregLockedFreg

getfrlockfregLockedFreg

getfrlockfregLockedFreg

getfrFreg

getfrFreg

Freg

bi LockedFreg putfrreleasefregFreg

LockedFreg

bi Freg lockfregLockedFreg

getfrlockfregLockedFreg

getfrlockfregLockedFreg

getfrFreg

getfrFreg

Freg

bi LockedFreg putfrreleasefregFreg

LockedFreg

bi FPRegisters Freg Freg

Floatingpoint unit resource declarations

There is an adder multiplier and a divider All of which are

accessed exclusively through semaphores

bi Resource getresourceLockedResource

getresourcereleaseresourceResource

Resource

bi LockedResource releaseresourceResource

LockedResource

bi FPU Resourcegetmultipliergetresource

releasemultiplierreleaseresource

Resourcegetaddergetresource

releaseadderreleaseresource

Resourcegetdividergetresource

releasedividerreleaseresource

Floatingpoint instructions Fdiv and Fmul not yet implemented

Yuck Enumerate all of the eight possible Fadd instrcutions

bi Float faddlockfreggetfrgetfr

getadderreleaseadder

putfrreleasefregDONE

faddlockfreggetfrgetfr

getadderreleaseadder

putfrreleasefregDONE

faddlockfreggetfrgetfr

getadderreleaseadder

putfrreleasefregDONE

faddlockfreggetfrgetfr

getadderreleaseadder

putfrreleasefregDONE

faddlockfreggetfrgetfr

getadderreleaseadder

putfrreleasefregDONE

faddlockfreggetfrgetfr

getadderreleaseadder

putfrreleasefregDONE

faddlockfreggetfrgetfr

getadderreleaseadder

putfrreleasefregDONE

faddlockfreggetfrgetfr

getadderreleaseadder

putfrreleasefregDONE

The toplevel standard normal form agent

bi Instr TwoAlus ALU LoadStore Float Instr Branch

bi CPU Registers FPRegisters FPU Memory InstrInsns

Bibliography

Alfred V Aho Mahadevan Ganapathi and Steven WK Tjiang Co de generation

using tree matching and dynamic programming ACM Transactions on Programming

Languages and Systems Octob er

Mitch Alsup Motorolas family architecture IEEE Micro pages June

James H Aylor Ronald Waxman and Charles Scarratt VHDLfeature description

and analysis IEEE Design and Test of pages April

JCM Baeten editor Applications of Process Algebra Cambridge University Press

Mario R Barbacci Instruction set pro cessor sp ecications ISPS The notation and

its applications IEEE Transactions on Computers pages January

Mario R Barbacci et al Ada as a hardware description language An initial rep ort

In Computer Hardware Description Languages and their Applications pages

David Bernstein Global instruction scheduling for sup erscalar machines In ACM SIG

PLAN Conference on Programming Language Design and Implementation pages

ACM June

Gerard Berry and Georges Gonthier The esterel synchronous programming lan

guage Design semantics and implementation The Science of Computer Program

ming

G Birtwistle and P A Subrahmanyam editors Current Trends in Hardware Veri

cation and Automated Theorem Proving SpringerVerlag

D Borrione editor Proceedings of the IFIP WG Working Conference on From

HDL Descriptions to Provably Correct Circuit Designs SpringerVerlag

Richard Boulton Andrew Gordon Mike Gordon John Harrison John Herb ert and

John Van Tassel Exp erience with embedding hardware description languages with

HOL In V Stavridou T F Melham and R T Boute editors Theorem Provers in

Circuit Design IFIP NorthHolland

Jonathan Bowen Formal sp ecication of the ProCossafemos instruction set Micro

processors and Microsystems December

Jonathen Bowen Formal sp ecication and do cumentation of micropro cessor instruc

tion sets and Microprogramming pages

David Bradlee Susan Eggers and Rob ert Henry Integrating register allo cation and

instruction scheduling for riscs In Fourth International Conference on Architectural

Support Programming Languages and Operating Systems pages ACM and

IEEE Computer So ciety April

David Bradlee Rob ert Henry and Susan Eggers The Marion system for retargetable

instruction scheduling In ACM SIGPLAN Conference on Programming Language

Design and Implementation pages ACM June

W Brauer W Reisig and G Rozenberg editors Petri Nets Applications and Rela

tionships to Other Models of Concurrency SpringerVerlag

E Brinksma Information Processing Systems Open Systems Interconnection LO

TOS A Formal Description Technique based upon the Temporal Ordering of Obser

vational Behavior Draft International Standard ISO

Bishop C Bro ck Warren A Hunt and William D Young Introduction to a formally

dened hardware description lang uage In Theorem Provers in Circuit Design pages

GM Brown Towards truly delayinsensitive circuit realisations of pro cess algebras In

G Jones and M Sheeran editors Designing Correct Circuits SpringerVerlag

Juanito Camilleri and Glynn Winskel CCS with priority choice In LICS IEEE

Symposium on Logic in Computer Science pages

C Charlton D Jackson and P Leng A functional mo del of clo cked microarchitec

tures In MICRO Proceedings of the nd Annual International Symposium on

Microarchitecture

ChiHung Chi and Hank Dietz Unied management of registers and cache using

liveness and cac he bypass In ACM Conference on Programming Language Design

and Implementation pages

Noam Chomsky Syntactic Structures Mouton

Edmund M Clarke Orna Grumberg and David E Long Mo del checking and ab

th

straction In POPL Proceedings of the annual symposium on principles of

programming languages

Rance Cleaveland Joachim Parrow and Bernhard Steen The Concurrency Work

b ench A semanticsbased to ol for the verication of concurrent systems ACM Trans

actions on Programming Languages and Systems January

Avra Cohn Correctness prop erties of the vip er blo ck mo del In G Birtwistle and P

A Subrahmanyam editors Current Trends in Hardware Verication and Automated

Theorem Proving SpringerVerlag

Todd Co ok Paul Franzon Ed Harcourt and Thomas Miller Behavioral mo deling of

pro cessors from instruction set sp ecications In Proceedings of the nd International

Verilog HDL Conference

Todd Co ok Ed Harcourt Thomas Miller and Paul Franzon LISAS A Language

for Instruction Set Architecture Sp ecication In First Annual Conference on Har

wareSoftware Codesign

Todd A Co ok Instruction Set Architecture Specication PhD thesis North Car

olina State University Raleigh NC Department of Electrical and Computer

Engineering

Todd A Co ok Paul D Franzon Ed A Harcourt and Thomas K Miller System

level sp ecication of instruction sets In ICCD Proceedings of the International

Conference on Computer Design

Todd A Co ok and Ed Harcourt A functional sp ecication language for instruction

set architectures In To appear in ICCL Proceedings of the International Conference

on Computer Languages

Jack W Davidson Co de selection through ob ject co de optimization ACM Transac

tions on Programming Languages and Systems Octob er

Jack W Davidson A retargetable instruction reorganizer In Proceedings of the SIG

PLAN Symposium on Compiler Construction pages

Bruce S Davie Formal Specication and Veriation in VLSI Design Edinburgh

University Press

John R Ellis Bul ldog A Compiler for VLIW Architectures The MIT Press Cam

bridge MA

Christopher Fraser and Alan Wendt Automatic generation of fast optimizing co de

generators In ACM SIGPLAN Conference on Programming Language Design and

Implementation

Christopher W Fraser A language for writing co de generators In ACM SIGPLAN

Conference on Programming Language Design and Implementation pages

M R Garey and D S Johnson Computers and Intractability A Guide to the Theory

of NPCompleteness W H Freeman and Company

Alfons Geser A sp ecication of the Intel micropro cessor A case study In

Algebraic Methods Theory Tools and Applications pages SpringerVerlag

Sumit Ghosh Using Ada as an HDL IEEE Design and Test of Computers pages

February

Rob ert Giegerich Co de selection by inversion of ordersorted derivors Theoretical

Computer Science pages

Joseph A Goguen One none a hundred thousand sp ecication languages Technical

Rep ort CSLI CSLICenter for the Study of Language and Information

Rob ert Goldblatt Logics of Time and Computation CSLI Center for the Study of

Language and Information

Ganesh Gopalakrishnan Sp ecication and verication of pip elined hardware in HOP

In Computer Hardware Description Lnaguages and their Applications pages

SpringerVerlag

MJC Gordon The denotational semantics of sequential machines Information Pro

cessing Letters pages

MJC Gordon Register transfer systems and their b ehavior In Computer Hardware

description Languages and Their Applications pages

MJC Gordon and TF Melham Introduction to HOL A theorem proving environ

ment for higher order logic Cambridge University Press

Keith Hanna Neil Daeche and Gareth Howells Implementation of the veritas design

logic In Theorem Provers in Circuit Design pages

Ed Harcourt Jon Mauney and Todd Co ok Sp ecication of instructionlevel paral

lelism In Proceedings of NAPAW the North American Process Algebra Workshop

Ed Harcourt Jon Mauney and Todd Co ok Formal sp ecication and simulation of

instructionlevel parallelism In Proceedings of the European Design Automation

Conference IEEE Computer So ciety Press

Ed Harcourt Jon Mauney and Todd Co ok Functional sp ecication and simulation

of instruction set architectures In Proceedings of the International Conference on

Simulation and Hardware Description Languages SCS Press

John MJ Herb ert Incremental design and formal verication of micro co ded micro

pro cessors In Theorem Provers in Circuit Design pages

Tony Hoare Communicating Sequential Processes Prentice Hall

Paul B Jackson Nuprl and its use in circuit design In V Stavridou T F Melham

and R T Boute editors Theorem Provers in Circuit Design IFIP NorthHolland

Mike Johnson Superscalar Design Prentice Hall

Gerry Kane and Jo e Heinrich MIPS RISC Architecture Prentice Hall

Monica Lam Software pip elining An eective scheduling technique for VLIW ma

chines In ACM SIGPLAN Conference on Programming Language Design and

Implementation pages ACM June

Monica Lam A Systolic Array Optimizing Compiler Kluwer Academic Press

Eugene Lawler Jan Lenstra Charles Martel Barbara Simons and Larry Sto ckmeyer

Pip eline scheduling A survey Technical Rep ort Computer science research rep ort

IBM Research Division

Peter Lee Realistic Compiler Generation MIT Press

M Leeser and G Brown editors Hardware Specication Verication and Synthesis

Mathematical Aspects SpringerVerlag

Roger Lipsett VHDLHardware Description and Design Kluwer Academic Press

Boston TK L

Neal Margulis i Microprocessor Architecture Intel Osb orne McGrawHill

Thomas Melham Higher Order Logic and Hardware Verication Cambridge University

Press

Jean P Mermet editor Fundamentals and Standards in Hardware Description Lan

guages volume of NATO ASI Series Kluwer Academic Press

George Milne CIRCAL and the representation of communication concurrency and

time ACM Transactions on Programming Languages and Systems April

Robin Milner Calculi for synchrony and asynchrony Journal of Theoretical Computer

Science

Robin Milner Communication and Concurrency Prentice Hall

Robin Milner Elements of interaction Communications of the ACM pages

January

Robin Milner Joachim Parrow and David Walker A calculus of mobile pro cesses i

and ii Information and Computation September

Faron Moller The Edinburgh Concurrency Workbench Version University of

Edinburgh

John D Morison ELLA a language for the design of digital systems In Jean P Mer

met editor Fundamentals and Standards in Hardware Description Languages volume

of NATO ASI Series pages Kluwer Academic Press

Thomas M uller Employing nite automata for resource scheduling In MICRO

Proceedings of the th Annual International Symposium on pages

Serafin Olcoz and Jose Manuel Colom The discrete event simulation semantics of

VHDL In Proceedings of the International Conference on Simulation and Hardware

Description Languages SCS Press

JeanLuc Paillet Functional semantics of micropro cessors at the machine instruction

level In Computer Hardware Description Languages and Their Applications pages

David A Patterson and John L Hennessy A Quantitative

Approach Morgan Kaufman San Mateo CA

David A Patterson and John L Hennessy Computer Organization and Design The

HardwareSoftware Interface Morgan Kaufman San Mateo CA

LC Paulson ML for the Working Programmer Cambridge University Press

Gordon D Plotkin A structural approach to op erational semantics Technical Rep ort

DAMI FN University of Aarhus Denmark

Amir Pnueli Linear time temp oral logic Research and Education in Concurrent

Systems In the REX SchoolWokshop on Linear Time Branching Time and Partial

Order in Logics and Mo dels for Concurrency No ordwijkerhout the Netherlands

Todd Pro ebsting and Chris Fraser Detecting pip eline structural hazards quickly In

st

POPL Proceedings of the annual symposium on principles of programming

languages

Todd A Pro ebsting and Charles N Fischer Lineartime optimal co de scheduling

for delayedload architectures In ACM SIGPLAN Conference on Programming

Language Design and Implementation pages ACM June

Rami R Razouk The use of Petri Nets for mo deling pip elined pro cessors In Proceed

ings of the th ACMIEEE Design Automation Conference

David A Schmidt Denotational Semantics A Methodology for Language Development

Allyn and Bacon

Mary Sheeran Categories for the working hardware designer In Hardware Specica

tion Verication and Synthesis Mathematical Aspects

Michael D Smith Mike Johnson and Mark A Horowitz Limits on instructionlevel

parallelism In Second International Conference on Architectural Support Programming

Languages and Operating Systems pages ACM and IEEE Computer So ciety

J M Spivey Understanding Z A Specication Language and Its Formal Semantics

Cambridge University Press Cambridge UK QA Z S

Mandayam Srivas and Mark Bickford Formal verication of a pip elined micropro ces

sor IEEE Software pages September

Richard Stallman Using and Porting GNU C Free Software Foundation

V Stavridou Formal Methods in Circuit Design Cambridge University Press

V Stavridou T F Melham and R T Boute editors Theorem Provers in Circuit

Design NorthHolland

Eliezer Sternheim Ra jvir Singh and Yatin Trivedi Digital Design with Verilog HDL

Automata Publishing Co Cup ertino CA

Colin Stirling An introduction to mo dal and temp oral logics for CCS In Lecture

Notes in Computer Science pages SpringerVerlag

Colin Stirling Mo dal and temp oral logics In Samson Abramsky Don Gabbay and

TSE Maibaum editors Handbook of Logic in Computer Science pages

Oxford Clarendon Press

Chris Tofts Describing so cial insect b ehavior using pro cess algebra Transactions of

the Society for Computer Simulation December

CJ Tseng et al A versatile nite state machine synthesizer In International Con

ference on Computer Aided Design pages

David W Wall Limits of instructionlevel parallelism In Fourth International Confer

ence on Architectural Support Programming Languages and Operating Systems pages

ACM and IEEE Computer So ciety April

Glynn Winskel The Formal Semantics of Programming Languages MIT Press