CMU-CS-85-180

A Data-Driven Multiprocessor for Switch-Level Simulation of VLSI Circuits

Edward Harrison Frank

November, 1985

DEPARTMENT of COMPUTER SCIENCE

Carneg=e-Mellon Un=vers,ty

CMU-CS-85-180

A Data-Driven Multiprocessor for Switch-Level Simulation of VLSI Circuits

Edward Harrison Frank

November, 1985

Carnegie-Mellon University Department of Computer Science Pittsburgh, PA 15213

Submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy in Computer Science at Carnegie-Mellon University.

Copyright © 1985 Edward H. Frank

This research was sponsored by the Defense Advanced Research Projects Agency (DOD), ARPA Order No. 3597, monitored by the Air Force Avionics Laboratory Under Contract F33615-81-K-1539, and by the Fannie and John Hertz Foundation.

The views and conclusions contained in this document are those of the author and should not be interpreted as representing the official policies, either ex- pressed or implied, of the Defense Advanced Research Projects Agency, the US Government, or the Hertz Foundation. Abstract

In this dissertation I describe the algorithms, architecture, and performance of a computer called the FAST-I--a special-purpose machine for switch-level simulation of VLS1 circuits. The FASr-I does not implement a previously exist- ing simulation algorithm. Rather its simulation algorithm and its architecture were developed together. The FAS_I-Iis data-driven, which means that the flow of data determines which instructions to execute next. Data-driven execution has several important attributes: it implements event-driven simulation in a natural way, and it makes parallelism easier to exploit.

Although the architecture described in this dissertation has yet to be imple- mented in hardware, it has itself been simulated using a 'software implementation' that allows performance to be measured in terms of read- modify-write memory cycles. The software-implemented FAsr-1 runs at speeds comparable to other software-implemented switch-level simulators. Thus it was possible to collect an extensive set of experimental performance results of the F_ST-1 simulating actual circuits, including some with over twenty thousand transistors. These measurements indicate that a hardware-implemented, uniprocessor F_'-I offers several orders of magnitude speedup over software- implemented simulators running on conventional computers built using similar technology.

Additional speedup over a uniprocessor can be obtained using a Fasr-I mul- tiprocessor, that is constructed using multiple FAST-1uniprocessors that are in- terconnected by one or more broadcast busses. In order for a FAsr-I mul- tiprocessor to exploit the parallelism available in simulations, the FAs'r-I representation of circuits must be carefully partitioned onto the processors. Al- though, even simple versions of the partitioning problem are NP-complete, I show that an additional order of magnitude speedup can be obtained by using a multiprocessor F, sr-1 and fast heuristic partitioning algorithms.

//i

Acknowledgements

The completion of this work owes much to many people. Throughout my years at CMU, my advisor, Bob Sproull, has provided guidance, fi%ndship, understanding, and good ideas as needed. My dear fi-iend and officemate, Carl Ebeling, has spent many hours listening to me instead of doing his own work. Both Dr. Bob's, and Carl's contributions to this research, and to my stay at CMU, are immeasurable. My thesis committee, Randal Bryant, A1 Davis, Marc Raibert. Alfred Spector, and Robert Sproull, provided the proper amount of help and criticism, at the proper times. I thank them all for care- fully reading this dissertation in finite time.

The CMU VLSI project, originally directed by Sproull, and now being run by HT Kung. provided the overall context in which this work was conducted. Many project members, in particular Allan Fisher, and Hank Walker, have offered many good ideas and been good listeners.

Several other people have given aid at important times: Rob Mathews of Silicon Solutions Corp. graciously allowed me to simulate the SSC Filter chip, and Dan Perkins, also of SSC, spent several hours working with me in order to get their test vectors to work with my simulator. Marco Annaratone provided the CMOS adder circuit, and Thomas Anantharaman provided the multiplier circuit. Many of the other circuits were given to me by Carl Ebeling. Ivan Sutherland provided many useful comments on an early draft of the thesis.

Though the CMU Computer Science Department has grown and changed since I first came here many years ago, it is still a wonderful place with great resources, both human and computational. As with most people who come from the West. 1 was pleasantly surprised by Pittsburgh, although it still needs a real place to ski and some real lakes. During my studies at CMU, I was sup- ported by a Fannie and John Hertz Foundation Fellowship, for which I am most grateful.

I am most indebted to my family who, throughout my life, have provided un- ending love and support. My wife, Sarah Ratchye, has endured the hard times, enjoyed the good times, and along with our daughter, Whitton Anne, has made this all worthwhile.

Table of Contents

Acknowledgements v 1 Introduction 1 1.1. Background and Motivation 1 1.1.1.Why Machines for VLSI Simulation? 1 1.1.2.Algorithms, Architecture, and Implementation 2 1.2. A Simple Simulation Algorithm and a Simple Simulation Machine 3 1.2.1.An Event-Driven Simulation Algorithm 3 1.2.2.The Fast-I Simulation Machine 5 1.3. The Organization of the Dissertation 8 1.4. The Contributions of this Research 10 1.5. A Final Note 11 II Related Work 13 II.1. Data-Driven Computers 13 II.2. Multiprocessors and Interconnection Networks 16 II.2.1. MIMD Machines 16 11.2.2.SIMD Machines 17 11.3.Simulation Algorithms 18 II.3.1. A Brief Survey of Digital Simulation Techniques 18 II.3.2. Switch-level Simulation Algorithms 19 II.4. Simulation Machines 20 II.4.1. Logic-Level Machines 20 11.4.2.Switch-level Machines 22 II.4.3. Circuit-Level Machines 23 11.5.Partitioning 23 II1 A Switch-Level Simulation Algorithm 25 III.1. Notation 26 111.2.A Switch-Level Model of MOS Circuits 26 III.2.1. Signals 27 III.2.2. Transistors 28 1II.2.3. Nodes 29 III.2.4. Strengths and Sizes 30 III.2.5. Actual Signal Models 30 1II.2.6. Modeling Threshold Drops 35 III.3. Determining the Steady State of a Network 37 III.3.1. An Incorrect Switch-Level Simulation Algorithm 38 I11.3.2.The Fast-I Switch-Level Simulation Algorithm 44 III.3.3. The Correctness and Complexity of the Simulation Algorithm 46 I11.3.4.Delay 50

vii viii A Data-Driven Multiprocessor for Switch-Level Simulation of VLSI Circuits

I11.3.5.Initialization 51 I11.3.6.Optimizations 52 I11.4.Compiling Circuits into Simulations 56 I11.5.Other Issues 58 111.5.1.Multi-level Simulation 58 !11.5.2.Fault Simulation 60 IV The Architecture of a Fast-1 Uniprocessor 61 IV.1. Uniprocessor Architecture 61 IV.I.1. Instruction Definition 62 IV.I.2. Instruction Execution 63 IV.1.3. Implementing Algorithm III-4 Using the Fast-l: A Summary 68 IV.1.4. Other Issues 68 IV.2. Implementation 70 IV.2.1. The Datapaths of a Fast-1 Processor 70 IV.2.2. Keeping Track of Executable Instructions 71 IV.2.3. Fixed-Width versus Variable-Width Instructions 73 IV.2.4. The Impact of Technology 78 IV.2.5. Reliability 81 IV.2.6. Other Issues 81 V Uniproeessor Experiments 83 V.1. The Circuits 84 V.2. The Software Implementation of the Fast-1 MOS Simulator 85 V.3. Static Measurements 87 V.3.1. Transistors, Nodes, and the Distribution of Instructions 88 V.3.2. Fan-In and Fan-Out 89 V.3.3. Sizes of Transistor Groups 91 V.3.4. Representing Bidirectional Transistors Using Two Unidirectional 99 Transistor Instructions V.3.5. Using Minimal Machines and Fan-in and Fan-out Trees 99 V.3.6. The Effect of Finding Unidirectional Transistors and Eliminating 100 One-input Nodes V.4. Dynamic Measurements 102 V.4.1. The Base Case 106 V.4.2. The Effect of Changing the Representation of Circuits 108 V.4.3. The Effect of Optimizations 113 V.4.4. Using a Queue versus a Stack for Keeping Track of Executable In- 117 structions V.4.5. Parallelism in the Fast-1 117 V.4.6. Execution Time Estimates for Other Simulation Machine Architec- 125 tures V.4.7. Some Other Thoughts on Parallelism 129 VI Algorithms for Multiprocessor Simulation 131 VI.1. Multiprocessor Implementation of Algorithm III-4 131 VI.1.1. Implementation 131 Vl.l.2. Correctness 132 VI.1.3. Performance Considerations 132 VI.2. Partitioning Algorithms 133 V1.2.1.The Complexity of Partitioning 134 VI.2.2. Practical Partitioning Algorithms 135 VII The Architecture of a Fast-1 Multiprocessor 141 "Fableof Contents ix

Vii.1. Approaches to Exploiting Parallelism 141 VII.2. A Multiprocessor Fast-1 144 VI1.2.1.Processor Architecture Assuming Static Instruction Assignment 145 V!1.2.2. Reorganizing Fan-out and Broadcasting 148 VII.2.3. Interconnect 150 VII.2.4. Multi-level Simulation 155 VIII Multiprocessor Experiments 157 VII 1.1.An Outline of the Experiments 157 VII 1.2.Speedup 158 VIII.3. Message Traffic 162 VIII.4. The Impact of Broadcasting 163 IX Conclusions 167 IX.1. Contributions 167 IX.2. Other Applications 168 IX.3. Future Work 169 1X.4. And Now a Word to Our Sponsor 170 References 171 A Circuit Descriptions 177 A.1. Adder 177 A.2. Cadd24 177 A.3. Ram 179 A.4. Stack 180 B Test Programs 183 B.1. Adder Test Program 183 B.2. Multiplier Test Program 184 B.3. RAM Test Program 185 B.4. Stack Test Program 187 C Sample Raw Data for Experiments 189 C.1. Sample Raw Static Data for Adder4 189 C.2. Sample Raw Dynamic Data for Adder4 190

List of Figures

I-1 A one-bit full adder represented using boolean logic gates. 4 I-2 As sample Fast-I instruction. 6 I-3 The data paths ofa Fast-I processor. 7 ll-1 The data-flow graph for the function ( -b -+ sqrt(b 2 - 4xaxc))/2xa 14 II-2 A typical Dennis-style data-flow computer. 15 1I-3 A YSE logic processor. 21 11-4 A block diagram of the Mossim Simulation Engine. 23 II!-1 A CMOS implementation of a one bit full-adder. 27 1II-2 An NMOS circuit represented as a switch-level network. 32 111-3 The range of values of an interval signal. 34 Ill-4 Examples of the least upper bound ftmction on interval signals. 35 III-5 A circuits illustrating the modeling of threshold drops. 37 III-6 The simulation of an NMOS circuit. 40 Ill-7 A case where Algorithm III-3 gives the wrong resulL 41 111-8 Another circuit where Algorithm Ill-3 works incorrectly. 42 Ill-9 An example illustrating the different static optimizations. 53 II1-10 A bridge circuit illustrating that transistors can be bidirectional even when there is 54 only one driver and one sink. IIl-ll A description of a one-bit adder ,astransistors and their interconnection. 57 IIl-12 A compiled version of the circuit in Figure III-11. 59 IV-1 A Fast-I program for the one-bit adder in Figure I-1. 64 IV-2 A trace of the execution Fast-1 program for the adder. 65 IV-3 The datapaths ofa Fast-I processor. 72 IV-4 An illustration of fan-out trees. 74 IV-5 A block diagram of a Fast-1 using a separate Destination memory. 75 IV-6 A block diagram of a f=ast-1 processor with separate memories for the Opcode 78 and ExecutionTags, the SourceOperands and Results, and the Destinations. V-1 A program for exhaustively testing an 8-bit multipler. 87 V-2 Node fan-in. 93 V-3 Cumulative node fan-in. 93 V-4 Instruction fan-in including both transistor and node instructions. 94 V-5 Cumulative instruction fan-in. 94 V-6 The fan-out of nodes. 95 V-7 The cumulative fan-out of nodes. 95 V-8 The fan-out of instructions. 96 V-9 The cumulative fan-out of instructions. 96 V-10 Percentage of transistor groups with n transistors. 97 V-11 Cumulative percentage of transistors versus transistor group size. 97

xi xii A Data-Driven Multiprocessor for Switch-Level Simulation of VLS1 Circuits

V-12 Percentage of transistor groups with n nodes. 98 V-13 Cumulative percentage of nodes versus transistor group size. 98 V-14 A portion of the parallelism profile for adder4. 120 V-15 The instructions executed during each parallel step.. 120 V-16 The average speedup per instruction. 122 V-17 Cumulative execution time for the RAM circuits. 123 V-18 Cumulative instruction frequency for the RAM circuits. 123 V-19 Cumulative execution time for the other circuits. 124 V-20 Cumulative instruction frequency for the other circuits. 124 VII-1 A pipelined implementation ofa Fast-I processor. 142 VII-2 A multiprocessor Fast-1 with independent memories and evaluation units. 144 VI [-3 A Fast-I processor with queues designed for use in a Fast-I multiprocessor. 146 VII-4 Using a Copy instruction to reduce the number of remote stores. 149 VII-5 Using maps tbr translating broadcasts. 151 VII-6 A Fast-I multiprocessor constructed using a broadcast bus. 153 Vll-7 One method for connecting the arbitration signals on a Fast-1 bus. 153 Vli-8 A more complex bus interconnect structure using two ports/processor. 155 VIlI-1 Speedup usingamultiprocessor. 161 List of Tables

I-1 The initial state of the simulation of the adder circuit in Figure I-1. 5 1-2 A trace of Algorithm I-1 simulating the full adder in Figure I-1. 5 V-1 The ratio of transistors to nodes 89 V-2 The distribution of transistor and node types. 90 V-3 The ratio of instructions to transistors and nodes in original circuits. 91 V-4 Average fan-in and fan-out of circuits. 92 V-5 The effect of representing bidirectional transistors using two unidirectional tran- 100 sistor instructions. V-6 The number of instructions needed to represent a circuit using instructions with 101 two SourceOperands and the ability to fan-out to two destinations. V-7 The effect of finding unidirectional transistors on fan-in, fan-out, and number of 102 instructions. V-8 Base case dynamic simulation data. 107 V-9 The fractional contributions to execution time. 107 V-10 The Ratio of transistor executions to node executions. 107 V-11 Seconds of time used by software and hardware implementations of the Fast-1. 109 V-12 Seconds of time used by MossimlI. 109 V-13 The effect of using two unidirectional transistor instructions to represent each 110 bidirectional transistor. V-14 The effect of using single-Result bidirectional transistor instructions. 111 V-15 The performance of a minimal Fast-l. 111 V-16 Average dynamic fan-out. 112 V-17 The effect of having all nodes be capacitive. 112 V-18 The effect of initializing instructions only as needed. 113 V-19 The effect of not finding unidirectional transistors other than those connected 114 directly to Vdd or Ground. V-20 The effect of using no dynamic optimizations. 115 V-21 The effect of using a minimal Fast-1 and no dynamic optimizations. 115 V-22 The effect of using only the node optimization. 116 V-23 The effect of using only the Node2 optimization. 116 V-24 The effect of using only the Store optimization. 117 V-25 The effect of using only the Transistor optimization. 117 V-26 The performance of a stack versus a queue for keeping track of executable in- 118 structions. V-27 Maximum available parallelism in Fast-1 simulations using the base configura- 121 tion. V-28 Maximum available parallelism in Fast-1 simulations using the minimal con- 121 figuration.

°.. Xlll xiv A Data-Driven Multiprocessor for Switch-Level Simukttion of VLSI Circuits

V-29 The time used by a YSE-style simulation machine. 127 V-30 The time used by a -style simulation machine. 128 Vii-1 The rotation of priorities in decentralized parallel arbitration. 154 VIII-1 Simulation time using random partitioning and conlention-free interconnect. 159 Viii-2 Simulation time using random partitioning and interconnect with contention. 159 VIII-3 Simulation time using fan-out partitioning and contention-free interconnect. 160 VII I-4 Simulation time using fan-out partitioning and interconnect with contention. 160 VIII-5 A comparison between random partitioning and fan-outpartitioning. 161 VII i-6 Message traffic using random partitioning. 162 VIII-7 Message traffic using fan-out partitioning. 162 V![1-8 Number of arbitration cycles pertbrmed when using random partitioning. 164 VII I-9 Number of arbitration cycles performed when using random partitioning. 164 VIII-10 Simulation time without broadcasting. 165 VIII-11 Message traffic without broadcasting. 165 List of Algorithms

I-1 An event-driven simulation algorithm. 4 I-2 The 'Fetch/Execute' cycle oft_he basic Fast-1 processor. 6 llI-1 The network functions for cross-product signals. 31 III-2 The network functions for interval signals. 36 III-3 An incorrectswitch-level simulation algorithm. 39 IlI-4 The Fast-I switch-level simulation algorithm. 45 IV-1 The fetch/execute cycle of the complete Fast-I processor. 66 VI-1 The DFS fan-out partitioning algorithm. 138 VI-2 A cost function for partitioning using simulated annealing. 139 VII-1 The fetch/execute cycle of a Fast-1 processor designed for use in a multiproces- 147 SOt. VII-2 The fetch/execute cycle of a multiprocessor Fast-1 with RemoteStore instruc- 148 tions.

XV

To my parents

I Introduction

Faster! If there is one word that describes the motivation for much of the work done in computer science and engineering, faster is it. Since computers were first designed and built, scientists and engineers have been trying to make them run faster in order to solve problems more rapidly. Even more important is that as computers become faster, we are able to use them to solve new problems that were previously, for all practical purposes, intractable.

In this dissertation I describe research in algorithm design and computer ar- chitecture that has resulted in a system for switch-level simulation that is several orders of magnitude faster than similar simulators running on general- purpose computers.

1.1. Background and Motivation For the reader unfamiliar with designing VLSI circuits, some motivation is in order. In particular, why is the research described in this thesis in the least bit interesting? In my mind, there are two perspectives from which to consider this research. The narrow perspective is that VLSI simulation is slow and, therefore, methods for speeding it up are of great practical interest. The broader perspective is that, in computer science, faster computing is achieved most effectively when algorithms, architecture and implementation are com- bined synergistically. In this light, the system discussed in this thesis is an example of a system developed with this synergistic effect in mind: thus it illus- trates what can be achieved when systems are developed in this way. I discuss both of these perspectives in greater detail in the following sections.

1.1.1. Why Machines for VLSI Simulation? A 'typical' VLSI chip, such as a microprocessor, may require tens to hundreds of thousands of transistors to implement its functions. Yet. even given ad- vanced computer-aided design tools, it is difficult and time-consuming to create a complex chip that works correctly. In this respect the VLSI designer is in much the same boat as the computer programmer. One technique the programmer uses to attack this problem is to simply compile a program, ex- ecute it using test data, fix the bugs, and try again; a process that these days can 2 A Data-Driven Mulliprocessorfor Switch-Level S#nulation of VLSI Circuits

be done reasonably interactivety and inexpensively. Similarly, a VLSI designer has a chip fabricated, exercises it with some test data, fixes the bugs and tries again. But here's the rub: fabrication takes several weeks, at best, and fixing the bugs can be incredibly painstaking. Moreover, fabrication is expensive and so far no one is talking about personal lab lines. So what's a VLSI designer to do? Simulate! While not as good as testing the real thing, simulation provides the designer with a method for building confidence that the actual implemen- tation of a circuit will work.

There are several levels of simulation available to the VLSI designer. At the most detailed level, there is circuit-level simulation in which voltages, currents, and transistors are accurately modeled, allowing the designer to examine the detailed electrical behavior of the circuit. At a less detailed level, there is func- tional simulation in which much of the analog electrical behavior is omitted and the circuit is modeled as a strictly digital system. The primary advantage of functional simulation over circuit simulation is that, by ignoring most of the electrical detail, the simulation runs orders of magnitude faster.

For VLSI systems a form of functional simulation called switch-level simulation has been found to be quite useful. In switch-level simulation, transistors are modeled as ideal bidirectional resistive switches and effects such as capacitive storage are also taken into account. The result is that the functional effects of most digital VLSI circuits can be modeled correctly. Nevertheless, switch-level simulation is still slow. Simulating the steps that a microprocessor chip uses to execute a single instruction can take minutes. Verifying that a microprocessor executes even a simple program correctly can take hours. Indeed, when using a general-purpose computer, such as a Vax 11/780, a simulated to real time ratio of 10,000,000:1 is not at all atypical. The most obvious way to decrease this ratio is to build faster machines designed specifically for simulation, and it is exactly this approach that is described in this dissertation.

1.1.2. Algorithms, Architecture, and Implementation There are many ways in which the performance of a computer system can be improved. Perhaps the simplest way is to use faster technology; for example, implementing a computer using ECL instead of TTL. It is certainly true that a traditional switch-level simulation program will run much faster on a Cray-1 than on a Vax--but not several orders of magnitude faster. The performance of the simulation system described in this dissertation results not from faster technology but rather from understanding and taking advantage of the inter- action of algorithms, architecture, and implementation technology.

Taking advantage of the synergistic effect algorithms, architecture, and im- plementation technology have on each other is becoming increasingly impor- tant in the development of new computer systems. For example, systolic al- gorithms [Kung, 1982; Kung and Leiserson, 1979] take advantage of the paral- lelism offered by Very Large Scale Integration (VLSI) technology. More im- Introduction 3

portxtntly, they utilize _)ine of tile pntctical aspects of the technology, such as the rekltivc inexpense of compt_t_ltion in comparison with communication. Similarly, Hcnnc_sy h_lsdescribed how it is no longer possible to design a high- performance general-purpose microprocessor without considering how it is to be used and how it is to be implemented--there is simply too much interaction between these facets of the design [Hennessy, 1985].

These interactions have led many researchers to conclude that, among other things, the simpler the architecture the faster the implementation will run. In the world of general-purpose computers this discussion has become a debate over reduced instruction set computers (RISC) versus complex instruction set computers (CISC)[Hennessy, 1985; Patterson and Sequin, 1982: Colwell, Hitchcock, and Jensen, 1983]. Similarly, in the context of gcnerztl system build- ing, Lampson has noted that any system function that adds 10% to the execu- tion time of other system functions had better compensate by reducing the overall execution time [Lampson, 1984]. Independent of one's particular biases in this debate, a general principle still seems to be true: It is easier to make simple things run fast. This is also known as the KISS principle: Keep It Simple, Stupid.

1.2. A Simple Simulation Algorithm and a Simple Simulation Machine A guiding principle behind the design of the system described in this disser- tation is that simulation is simple, and therefore any machine that implements a simulation algorithm should be simple as well. In the following two sections, 1 describe the simulation algorithm and the simulation machine that form the basis of the remainder of the work discussed in this dissertation.

With the exception of the chapters on performance, much of the rest of this dissertation might be viewed as just describing embellishments, albeit impor- tant ones, to the simulation algorithm and machine described below. The es- sential embellishment is adapting the simulation algorithm and the machine to the needs of switch-level simulation. While a full discussion of switch-level simulation is presented in Chapter III, when considering the algorithm presented below the reader may find it interesting to consider what happens if this algorithm is used to simulate a single-pole single-throw switch, or its analog in VLSI, an MOS transistor, in which current can flow in both direc- tions through the switch.

1.2.1. An Event-Driven Simulation Algorithm Though there are many ways of implementing a simulator, one of the most common and efficient is called event-driven simulation. The essential idea be- hind event-driven simulation is that it is necessary to simulate only those parts of a circuit in which there have been changes, or conversely, there is no point in computing new values for those parts of a circuit in which nothing has changed. For example, Figure I-1 shows the circuit for a one-bit full adder 4 A Data-Driven Multiprocessorfor Switch-Level Simulation of VLSI Circuits

using traditional boolean logic gates. Assuming that inputs X, Y, and Z are 0 then, in steady state, the outputs of the gates are _s summarized in Table I-1. If we now set input X to 1, we can ask the question: "What do we need to do in order to simulate the circuit?" One possibility is to simulate every gate in turn until no outputs change state. However, this is inefficient. A better way is to simulate a gate only when at least one of its inputs has been changed and then propagate the result to each input that is connected to the output of the gate we just simulated. This process is repeated until no more inputs have changed. Algorithm I-1 summarizes this procedure:

While there are any gates with changed inputs { Pick a gate with changed inputs. Compute the result of this gate given its inputs. If new result * previous result { Propagate new result to all inputs connected to the output of the gate. } } AlgorithmI-I:An event-drivensimulationalgorithm.

SIJM Carry

:,, ._ i

y z Figure 1-1: A one-bit full adder represented using boolean logic gates.

There are two important reasons for checking that the new and previous results are different: first, it reduces the number of gates needlessly simulated and second, in situations where there is feedback, this check helps eliminate infinite simulation loops. Continuing the example, Table I-2 presents a trace of Algo- rithm I-1 simulating the circuit in Figure 1-1after the X input has been set to 1. Introduction 5

"fable 1-1: The initial state ofthe simulation of the adder circuit in Figure I-1.

Gate Output I1 1 12 1 13 1 A1 0 A2 0 A3 0 A4 0 A5 0 A6 0 A7 0 O1 0 O2 0

Table I-2: A trace of Algorithm I-1 simulating the full adder in Figure I-1. Event Gates Affected X -> 1 I1, A1, A4, A6, A7 I1 -> 0 A2, A3 A1 -> 1 O1 A4 -> 0 A6 -> 0 A7 -> 0 A2 -> 0 A3 -> 0 Ol -> 1

Algorithm I-1 forms the basis for both the algorithm and the architecture research described in this thesis. Although simulating MOS circuits requires additional detail not present in Algorithm I-1, the basic flavor remains the same.

1.2.2. The Fast-1 Simulation Machine The FAsT_l 1 is a simulation machine whose overall architecture is that of a data-driven computer and is motivated by the observation that Algorithm I-1 resembles the general instruction execution strategy used in data-driven machines. While the relation between the Fast-1 simulation machine and other data-driven architectures is discussed in greater detail in Chapter II, the essential attribute of data-driven computation that is relevant in the present context is that instructions are executed in response to the flow of data and not as dictated by a program counter as in traditional von Neumann computers.

All instructions in the Fast-1 have the same basic format in which there are five different types of fields:

1This name is not an acronym, nor does it have anything to do with how well the machine works. 6 A Dam-Dtqven Mulliprocessorfor Switch-l, evel S#nulation of VLSI Circuits

• An Opco_ that indicates the function to be performed when the instruc- tion is executed. • One or more SOtJRc_OPv,RA_'I_Sthat contain the actual data that is used in evaluating the instruction. • A RI_ULrthat contains the value of the previous execution of the instruc- tion.

• One or more DI'SI'INATIONS that specify SOURcEOP_RANDSin other instruc- tion into which the new Rvsu_T, of this instruction is to be stored. • An ExF_ztyrlo_TAGthat specifies whether or not this instruction needs to be executed.

For example, a three-input AND gate can be represented using the FAst-1 in- struction shown in Figure I-2.

IAND Source_ Sourcez I Source3 EXeCTa,qResultlDest11Destl "'" Des_

Figure I-2: As sample Fast-1 instruction.

Corresponding to Algorithm 1-1 is Algorithm I-2, which describes the 'fetch/execute' cycle of a FAst-1 processor:

While there exists an instruction, I, with I.ExEcuT_oNT^G set { C le ar I. EXECLFIIONTAG Let R +-- I,OPCODE(I. SOURcEOPERAND1 ..... I .SOURCEOPERANDn) If R * I.RESULT then { I.REsuLT +-- R Foreach D E I.D_n_a_oNs { Store R into the SOURCEOPERAND specified by D Set the F-'_ECUTIONTAofG the instruction specified by D } } ) Algorithm I-2: The 'Fetch/Execute' cycle of the basic FAsz-1processor.

There are several important things to notice about the execution of instruc- tions. First an instruction becomes executable whenever a value is stored into any one of its SOURcEOPERANDS.Moreover, when the instruction is executed it is evaluated using whatever values are stored in its SOURcEOP_ANDS.ThUS, an in- struction can be executed using both new and old data. As discussed in Chapter II, this is a significant departure from the usual model of data-driven com- puters.

Another important observation is that instructions may be executed in any or- Introduction 7

der. This feature is extremely useful when considering how to implement both uniprocessor and multiprocessor FAsJ-I machines. Finally, note that at any given time more than one instruction may executable. We will return to this observation in a moment.

>

> OpcoUeEvaluation _ -I> d_ Fetch Unit _ ' --_ _ • _ Unit

Sources Resul%c _ Address

Ii, . , . .i. ° ._ • • • I-• • ?"

Temporary Reg.

Data

Instruction Memory Address <

Figure I-3: The data paths of a FAST-1processor.

Though discussed in detail in Chapter IV, let us consider briefly how a Fxsr-I processor might be implemented. As shown in Figure I-3 one can imagine implementing a FAsr-I processor in such a way that each word in instruction memory contains one instruction. To execute an instruction, the fetch unit ex- mnines the EXECbqqoYTACSof the instructions stored in memory, and picks an executable instruction to be latched into the instruction register. Though this operation may seem difficult to implement, in Chapters IIl and IV, I describe how the EXLCUnoyTAGmay actually be implemented as a linked list. The operation of picking the next instruction to execute then becomes a matter of simply accessing the instruction at the head of the list. Once the instruction is in the instruction register, it is evaluated, its Exr_tTrxoyTAais cleared, and its RESULTfield is updated. The modified instruction word is then written back into memory. Assuming that the new and old RESULTdiffer, each of the instruc- 8 A Data-Driven Multiprocessorfor Switch-Level Simukttion of VLS1 Circuits

tions referenced by a DI_'rlYAT1ONis read2, the appropriate SOURcEOPF, RAND and its Exv,curloYTAG are updated, and the instruction is written back into memory. Hence, the fundamental operation of the processor is a Read-Modify-Write (RMW) memory cycle. When I discuss the performance of various processor configurations, it is in terms &these RMW memory cycles.

Returning to the observation that multiple instructions may be executable at the same time, it seems that one way to improve perfonnance is to exploit this phenomenon and allow instructions to execute in parallel. While we might con- sider modifying the machine presented in Figure !-3 to allow for this, an alter- native architecture is to have multiple FAsJ--1processors that each contain some subset of the instructions. The processors would be interconnected so that a DF.STINATIONcan specify that a R_ul:r is to be written into a SOURC_':O_'_ANDof an instruction that is either in the same processor or a different processor. This style of multiprocessor is discussed in Chapter VII.

1.3. The Organization of the Dissertation The remainder of this dissertation essentially mirrors the structure of this intro- duction. There are some discussions about algorithms, some discussions about architecture and some discussions about performance. Rather than pile every- thing into three chapters. I have chosen to divide the work into two parts. The first part describes the uniprocessor simulation system. The second part describes the multiprocessor simulation system. Within each of these parts there are separate chapters on architecture, algorithms and experimental results. In addition, there is a separate chapter on related work that applies to both parts of the dissertation, and a concluding chapter with suggestions for further research. Thus:

• Algorithms. The switch-level simulation algorithm used in the FAST'I is described in Chapter III. The presentation of the algorithm is divided into two main parts: A discussion of a switch-level model of MOS circuits, and a discussion of an algorithm that uses the model to find the steady state response of the circuit. Though the algorithm was developed with the goal of eventually implementing it using a FAsT-Iprocessor in mind, this chapter can be read independently of the remainder of the thesis. In particular, the algorithm need not be implemented using hardware, but can be implemented quite effectively in software. Because the form it takes is that of an event-driven algorithm, it may be particularly useful when used in conjunction with a logic-level simulator. Besides a presentation of the switch-level simulation algorithm, Chapter

2At this level of descriptionit is not obviouswhy the instructionthat is the target of the destinationmust be read. However,as discussedin later ChapterIV, in practicalimplemen- tationsitisnecessaryto readthe instructionin orderto tellwhetheror notitis 'alreadyona listof instructionsthatneedto beevaluated. Introduction 9

ill contains a discussion of the relationship between the F,_s>l algorithm and other switch-level simulation algorithms, in particular Bryant's MOS- SIM I1algorithm. In Chapter i!1, I also consider the correctness and com- plexity of the simulation algorithm. As discussed in Chapter VI, the FAs>I simulation algorithm can be imple- mented, essentially unchanged, on a multiprocessor. However, because the basic algorithm imposes some constraints on the order in which groups of instructions may be executed, some processor synch ronization is required. Of greater concern is that there needs to be some way of par- titioning simulations to run on a multiprocessor simulation machine. To this end, Chapter Vl presents several different partitioning algorithms. • Architecture. In Chapter IV, I show how the FAst-1 architecture may be extended in a straightforward manner to incorporate the features needed to implement the switch-level simulation algorithm presented in Chapter II1. in Chapter IV 1 also discuss various implementation details and con- sider how various technologies affect the implementation. The architecture of a multiprocessor FAsT-l, built using many instances of the uniprocessor, is described in Chapter VII. I discuss the importance of both the logical and physical aspects of the processor interconnect, includ- ing such capabilities as broadcasting. ! also consider mechanisms for im- plementing the synchronization required by the simulation algorithm. • Experimental Results. The effectiveness of the simulation algorithm and the data-driven architecture is evaluated in two chapters. Chapter Vdiscusses the performance of a uniprocessor FASt-1 in terms of the num- ber of read-modify-write memory cycles required to simulate the circuits and shows how this performance is affected by changes in the architec- ture. A static analysis of circuit attributes, such as the fan-in and fan-out of nodes, is also presented. Finally this chapter presents an experimental analysis of the potential speedups that might be obtained by executing instructions in parallel. Besides providing a basis for comparing the FAst-1 architecture to other simulation machine architectures, this analysis sets the stage for the multiprocessor research described in the following chap- ters. In Chapter VIII, I present the results of using the multiprocessor architec- ture and the partitioning algorithms to simulate the same circuits simu- lated using the F_ST-1uniprocessor. Several different configurations are evaluated in this chapter, and these multiprocessor performance results are compared to the uniprocessor measurements. • Related Work. In order to provide a detailed comparison between my research and other work, much of the discussion of related work is spread throughout this dissertation. In addition, Chapter II offers a more or- ganized but less detailed discussion of related work and provides pointers to the relevant sections of the dissertation where the detailed discussion is presented. Among the topics discussed in Chapter II are simulation al- gorithms, simulation hardware, data-driven computation and multiproces- sor partitioning. • Conclusions and Future Work. Chapter IX contains concluding material. 10 A Data-Driven Multiprocessor for Switch- Level Simulation of VLS! Circuits

I describe how this research can be applied to tasks other than simulation. I also discuss those aspects of this research, such as partitioning, where there is a lot more work to be done and what approaches might be inves- tigated in the future.

1.4. The Contributions of this Research In pursuing the research described herein | was primarily concerned with answering to three questions:

• Is there an event-driven algorithm for switch-level simulation that can be implemented using the F,_s_l architecture presented in Section 1.2? • To what extent is switch-level simulation a domain in which parallelism can be exploited? • Are there efficient partitioning algorithms that allow this parallelism to be exploited using a muitiprocessor FAsT'l?

It is my belief that this dissertation does indeed answer these questions as well many others. In summary, I feel the major contributions &this research are:

• A new algorithm for switch-level simulation of VLSI. The algorithm I present is based on previous work by Bryant [Bryant, 1984], Terman [Terman, 1983]. and others, yet appears to be more suitable for im- plementation in hardware. Because the algorithm is fundamentally data- driven it can be implemented on a multiprocessor essentially unclaanged. Finally, and perhaps unintuitively, when implemented in software the al- gorithm performs as well as other software-implemented simulators. • A novel processor architecture that implements the simulation algorithm in a very straightforward manner. Assuming a 500ns RMW cycle time, experimental results indicate that a uniprocessor FAST'I simulation machine will run between a hundred and a thousand times faster than comparable simulators running on a Vax 11/780. Experimental results also indicate that the event-driven architecture used in the FAst-1 is over ten times faster than a uniprocessor architecture that is not event-driven. • An analysis of the static characteristics of switch-level circuits. The static analysis is based on over two dozen circuits ranging in size from seventy transistors to over forty thousand transistors. Among other things, the static analysis reveals that the ratio of transistors to nodes in circuits is approximately 2 to 1, and that both the average fan-in and the fan-out of nodes in circuits are less than 3. • An analysis of the dynamic characteristics of a simulation algorithm and of a simulation machine. These experiments are based on 13 circuits rang- ing in size from 70 to 20K transistors. The two largest chips simulated are currently being used in running systems. • An analysis of the potential for speeding up simulation by executing in- structions in parallel. These experiments indicate that, contrary to popular opinion, simulation does not provide unlimited opportunities for exploit- Introduction ! 1

ing parallelism, in the best case measured there is a potential for speeding up the simulation by a factor of 200, assuming that each instruction re- quires only one RMW cycle to exectlte. More realistic measurements in- dicate that the potential for speedup is actually closer to a factor of 90, assuming no communication contention, and Lipto one processor per in- struction. This experimental ,analysis also shows that the average percent- age of instructions that can be executed in parallel is a decreasing function of circuit size. In one circuit, which has 200 transistors, about 7% of the instructions can be executed in parallel, on average. In the largest circuit, which has over 20 thousand transistors, only about .7% of the instructions can be executed in parallel, on average. • A description of several algorithms for partitioning simulations onto a data-driven multiprocessor. Two of the algorithms, a random partitioning algorithm, and a topology-based partitioning algorithm, are analyzed in detail. In the best cases measured, the random partitioning algorithm ach- ieves speedups of up to a factor of 29, using 64 processors

1.5. A Final Note Lest the reader feel misled, let me state here and now that the experimental results reported in this dissertation are based on simulations of proposed hardware, rather than on a hardware prototype. However, as the reader will see these simulations were done at a level of detail such that a great deal of confidence can be placed in the results. It will also become apparent that the hardware I propose is not what might be called 'if only' hardware as in 'if only I could build a 100,000 processor system with a cycle time of 10 ns, then I could solve this problem really fast.'3 To those familiar with implementing hardware it will be clear that the architecture I propose can be implemented with a rela- tively modest amount of circuitry of either the off-the-shelf or custom variety. Though, as always, it would have been nice to have had a hardware implemen- tation of the machine, in reality many of the experiments reported herein would have been difficult to perform using a hardware implementation be- cause modifying and adding features to software is much easier than munging hardware. Nevertheless, I hope that in the not too distant future I'll be trading in this keyboard for a soldering iron.

3For a further explanationof this phenomenasee The VLSI Approachto Computational ComplexitybyProfessorJ. Finnegan[Finnegan,1981].

II Related Work

Somewhat arbitrarily, 1 have classified the work related to this research as being in one of five areas: data-driven computers, multiprocessors and proces- sor interconnection, simulation algorithms, simulation machines, and partition- ing algorithms. At times the relationship between my work and the work of others is quite technical. In these cases 1 mention the relationship briefly in this chapter and explore it in depth elsewhere in the dissertation.

II.1. Data-Driven Computers A FAsr-1 processor is a form of a data-driven, or data-flow, computer. The fundamental idea behind data-driven computation is that instructions are ex- ecuted in response to the flow of data. Whenever an instruction's firing rule is satisfied, the instruction is considered executable: a common firing rule is that an instruction is executable when it has new values for all of its operands. As will become clearer in a moment, it is possible that several instructions are fireable at the same time, and thus they can be executed concurrently. This is in contrast to the standard yon Neumann, or 'control-flow', computer in which instructions must be executed sequentially, as dictated by the program counter.

Programs for data-driven computers are usually thought of as a data-flow graphs, illustrated by the example in Figure II-1. Note the striking similarities to the logic circuit in Figure I-1. The inputs to this program are labeled a, b, and c. The quoted numbers, for example, 4', are constants. A node may fire whenever it has new values for all of its inputs. When a node does fire, it consumes its input values and produces an output value that is propagated to the inputs of other nodes. In the example shown, when constants are consumed they are immediately regenerated.

Corresponding to the data-flow graph, the conceptually simplest data-flow ar- chitecture would have a processor for each node in the graph. These processors would be interconnected by wires that correspond to each of the arcs. For obvious reasons implementing such an architecture is generally infeasible.

Some of the earliest work on data-flow machines was done by Dennis [Dennis, Leung and Misunas, 1979; Dennis, 1980]. A canonical Dennis-style data-flow

13 14 d Data-Driven Multiprocessor for Switch-Level Simulation of VLSI Circuits

\

Figure ll-l: The data-flow graph for the function (-b -+ sqrt(b 2 - 4xaxc))/2xa

architecture is illustrated in Figure II-2. In this machine, instructions are stored, together with their source operands and destination tags, in the instruction/operand memory. The arbitration network selects some subset of the executable instructions and routes them to the appropriate processing units. When an instruction finishes executing, the distribution network stores the result in the proper instructions in the memory.

In order to be useful for general-purpose computing, however, data-driven computers must facilitate the use of modem programing language concepts. In particular, the ability to process complex data structures and procedure calls efficiently has been the focus of much of the research on data-driven com- puters. Work by Kazar [Kazar, 1984], and Arvind [Arvind and Kathail, 1981] is particularly notable in this area. On the other hand, because I am interested in building a machine that is to be used only for simulation, it is possible to avoid solving these problems, the result being a processor that is more straight- forward to implement than a general-purpose data-driven computer. Indeed, not having to solve these problems is an important aim of my research.

A primary reason for much of the interest in data-driven computation is that exploiting parallelism appears to be easier when using data-driven computers than when using yon Neumann computers. This is, however, a subject of much debate. One of the basic arguments of the data-driven camp goes by the name 'yon Neumann bottleneck' [Backus, 1978]. It refers to the problems of the limited bandwidth between processor and memory in yon Neumann machines. Arvind and lannucci have refined this argument, saying that there are really two issues: not degrading perfomaance because of latency when accessing memory, and sharing data without constraining parallelism [Arvind and ]an- Related Work 15

Processing Units

_ ]nstruction/_ o Operand _ controli __]3 ® Memory - • ° z ! O c

Figure II-2: A typical Dennis-style data-flow computer.

nucci, 1983]. Their claim is that data-driven computers can deal effectively with these issues while yon Neumann computers cannot. On the other hand Gajski, et. aL, argue that while on the surface data-driven machines may seem ap- propriate for exploiting parallelism, this is often because many of the 'constant factors' have been glossed over[Gajski, Padua, Kuck and Kuhn, 1982]. Moreover, they suggest that in many data-driven machines the von Neumann bottleneck has simply been replaced by other bottlenecks that are just as limit- ing. Such claims notwithstanding, Kazar has shown experimentally, that a large number of data-driven processors connected by high-speed local area net- works can indeed exploit the available parallelism in data-flow graphs [Kazar, 1984]. Kazar's results are based on the evaluation of two data-flow programs, one an assembler and the other a program for matrix multiplication. Yet, matrix multiplication is a task handled quite effectively by a variety of yon Neumann multiprocessors, so it is clear that the debate is far from over.

A significant difference between a FAst-1 processor and most other kinds of data-driven computers is that FAST-1instructions have state. That is, input values are not consumed when an instruction is executed. Rather a FAst-1 in- struction becomes fireable whenever at least one of its inputs receives a new value. When a FAs_-Iinstruction fires, it may be evaluated using a combination of new and old input values. As can be seen from the example in Chapter I, this scheme is exactly what is needed for event-driven simulation. 16 A Data-Driven Mulliprocessor for Switch-Level Simulation of VLSI Circuits

The FAs_-I model of computation is not unlike that of object-oriented pro- gramming languages such as Smalltalk [Goldberg and Robson, 1983]. Each in- struction can be viewed as an object that is an instance of a class, or type that is defined tot"each different opcode. The state of the object is simply the current value of each of its operands. In Smalltalk, computation is perfomaed by send- ing messages between objects. Whenever an object 'receives' a message, a par- ticular method, or procedure, is invoked according to the type of the message and the type of the object. The result of executing a method is that other messages may be sent to other objects. In the FAsr-I messages are untyped and simply specify a new value for one of the operands of the destination instruc- tion. When the message is received the 'method' invoked first updates the appropriate operand and then computes one or more new results based on operands and the instruction's opcode.

For, a more in-depth discussion of many of the different kinds of data-driven systems that have been proposed see the survey articles by Davis and Dron- gowski [Davis and Drongowski, 1980], Treleavan, et. al. [Treleaven, Brownbridge, and Hopkins, 1982], Vegdhal[Vegdahl, 1985], and Dennis [Dennis, 1980].

11.2. Multiprocessors and Interconnection Networks Whether using a data-flow or control-flow computational model, improved performance can often be obtained by using multiple processors. Indeed, data- driven machines are almost exclusively discussed in the context of mul- tiprocessing, since their supposed virtue is their ability to exploit parallelism. Nonetheless, it is of course possible to build muttiprocessor control-flow com- puters as well. My discussion below follows along the well-known lines of Fiynn [Flynn, 1972]. The reader is referred to the paper by Davis, Denny, and Sutherland for a more thorough taxonomy of these kinds of machines [Davis, Denny and Sutherland, 1980].

II.2.1. MIMD Machines At one end of the multiprocessor spectrum are Multiple-Instruction Multiple- Data Stream (MIMD) machines. Most data-flow multiprocessors fit into this category, as do control-flow multiprocessors such as Cm* [Swan, 1978] and C.mmp [Wulf, 1981]. Of primary importance when discussing M1MD machines is how processors communicate. Physically, interprocessor com- munication can be viewed as either memory-based or message-based. In a memory-based system, such as C.mmp and, as generally viewed, Cm*, the processor interconnect gives each processor access to most or all of physical memory. In a message-based system, processors communicate by explicitly sending messages to each other. Some systems, of course, fall somewhere in between. Data-driven systems may be viewed as message-based; each message consists of data, and the address of a slot in an instruction into which the data is Related Work 17

to be written. Indeed, as discussed in Chapter VII, a FAs_I multiprocessor uses exactly this mechanism.

Depending on implementation details, the processor interconnect may be lo- cated in one of several places. One possibility is for it to be placed between the processors and the memory, as is done in C.mmp and many other shared- memory systems. Another possibility is for each processor to have its own memory, and for the interconnect to be located between processors. This ar- chitecture is used in many message-based systems, for example, Kazar's DS system.

The structure of the interconnect is, however, largely independent of where the interconnect is located. It would appear that the design of interconnect topol- ogy is an area where one is limited only by one's imagination. In loosely coupled systems where processors are relatively autonomous, or communicate relatively infrequently, computer networks such as the Ethernet [Metcalfe and Boggs, 1976; Tanenbaum, 1981]can be used for interprocessor communication. The work of several researchers, such as Spector[Spector, 1981], and Kazar [Kazar, 1984], indicates that local area networks can be used effectively, even in systems that are reasonably tightly coupled and that have high inter- processor communication rates. Nevertheless, in some tightly coupled systems even a minimal overhead for sending and receiving a packet over a local area network is simply too high, and message traffic is high enough such that con- tention is a significant problem. In these situations, some form of parallel com- munication network, such as a cross-bar, a sorting network, or an n-cube is needed. This is not to say that these interconnection networks may not have contention problems of their own. See the survey article by Feng, for a more • in-depth discussion of these latter kinds of interconnect [Feng, 1981]. Ander- son and Jensen present a somewhat broader discussion of the taxonomy of interconnection networks [Anderson and Jensen, 1975]. The interconnect used in my research looks very much like a local area network, but is more tightly controlled.

11.2.2. SIMD Machines At the other end of the multiprocessor spectrum are Single-Instruction Multiple-Data Stream (SIMD) machines in which every processor executes the same instruction but on different data. A classic example of this style of machine is Illiac-IV [Bouknight, et. al., 1972], in which 64 processors, each with its own memory, simultaneously execute the same instruction. Using bit-serial techniques, some very large SIMD machines have been builL For example, the Goodyear MPP has 16K processors [Batcher, 1980], while Hillis has proposed building a serial SIMD machine, called the Connection Machine, that has up to a million processors [Hillis, 1981]. Though many SIMD machines, for example, llliac-IV and MPP, have relatively restricted, nearest-neighbor, interprocessor communication, the Connection Machine allows arbitrary interprocessor com- munication. As a result, the Connection Machine is an interesting candidate 18 A Data-Driven Multiproceswr for Switch-Level Simulation of VLSI Circuits

for a simulation machine as discussed briefly by Tennan [Terman, 1983], and as discussed in Chapter V. Some systolic arrays [Kung, 1982; Kung and Leiser- son, 1979] are also SIMD machines. Unlike many SIMD computers, however, systolic arrays tatke advantage of the massive interprocessor bandwidth avail- able when using very regular arrays. Alas, the apparent irregularity of com- munication in simulation does not appear to map nicely onto such a regular array.

11.3. Simulation Algorithms In order to place the work described in this dissertation in context, it is impor- tant to understand two things. First, where does switch-level simulation fit in relation to other simulation techniques? Second, how do other switch-level al- gorithms relate to the one described in Chapter llI? The next two sections try to shed some light on these questions.

II.3.1. A Brief Survey of Digital Simulation Techniques Simulation is an indispensable tool for debugging integrated circuits. Depend- ing on what kind of debugging is being done, a designer will use a simulator that provides the appropriate level of detail and accuracy. Of course, if time and space allowed, a designer would probably use a simulator that provided the most accurate information possible. Unfortunately the more accurate a simulation, the slower it runs and the more memory it uses. The designer is forced to compromise between accuracy and thoroughness. The greater the ac- curacy desired, the fewer the number of test cases that can be processed.

Switch-level simulation uses ideal, voltage-controlled, resistive switches, and capacitive wires to simulate circuits, particularly those built using MOS 1. Volt- age and current are modeled abstractly using states, and strengths, which, in some simulators, are combined into a single entity called a signal Using this model, it is possible to simulate most of the digital properties of MOS circuits. For the designer debugging a chip this means that it is possible to ascertain

1MOS is a collectivename for the technologies currently being used to construct almost all high densityintegrated cfl-cuitssuch as microprocessorsand dynamic memories. There are two types of MOS circuitry in common use: NMOS and CMOS. In NMOS there are two types of transistors: depletion mode and n-channel enhancement mode. Depletion mode transistorsare alwaysturned on and are used primarily as load resistors. N-channel enhancement mode transistorsare three-terminal devicesused as switches. When 0 voltsare applied to the gate of an enhancement mode transistor, the switch is turned off. When a non-zero voltage, typically 5 volts, is applied to the gate of an enhancement mode transistor, the switch is turned on, providing a bidirectional current path between the source and drain. In CMOS there are also twotypes of transistors:n-channel enhancement mode, and p-channel enhancement mode. The n:channel enhancement mode transistor is the same as in NMOS. The p-channel transistor is identical to the n-channel transistor except that it is turned on when its gate is at 0 volts, and turned offwhen its gate is at 5 volts. The reader interestedin more background is referred to the introductory article on VLSI by Clark [Clark, 1980] or the textbook on VLSI by Mead and Conway [Meadand Conway,1980] Related Work 19

whether or not the chip works functionally, but only if the model accurately abstracts the electrical and physical properties of the technology used to imple- ment the chip.

Unlike switch-level simulation, the more detailed circuit-level simulation uses models of active components, such as transistors, and passive components, such as resistors and capacitors, to simulate the circuit. Transistors are described as instances of some formal model, and the results of a simulation are voltages and currents at circuit nodes as a function of time.

Logic-level simulation is similar to switch-level simulation in that it provides information about the functional aspects of a circuit. Whereas switch-level simulation does this by modeling a circuit at the transistor level, logic-level simulation tries to abstract even further by modeling the circuit as consisting of ideal logic gates or even more abstract objects, such as adders or microproces- sors. For many circuits, logic-level simulation suffices. Unfortunately, certain features of MOS circuits, for example bidirectional transistors, cannot be modeled easily using boolean logic gates. Switch-level simulation is more general than logic-level simulation in that switch-level simulation can model a greater variety of circuits.

A number of researchers have investigated multi-level simulation in which, in the same simulation, different parts of a circuit may be modeled at one of the different levels described above, or depending on circuit activity the same sec- tion is modeled differently at different times [Hill and vanCleemput, 1979; Agrawal, Bose, Kozak, and Nham, 1980]. Multi-level simulation has great potential in that it gives the designer greater freedom to manipulate accuracy versus time-space tradeoffs. However, mutli-level simulation is not without its own difficulties, such as controlling anomalies that may arise due to the inter- actions between the different models.

II.3.2. Switch-level Simulation Algorithms On the surface, most switch-level simulation algorithms appear very similar. While a detailed analysis of the relationships between the FAs_'-Ialgorithm and other switch-level algorithms must wait until Chapter III, some discussion can still take place here.

Though logic-level simulators had been around for a long time, as MOS VLSI became widely used researchers realized that logical level simulators simply could not model some of the important aspects of MOS circuits. The evolution of switch-level simulation reflects attempts to deal with this situation. Early on, there were attempts to modify standard logic-level simulators to handle some subset of switch-level simulation. In the late 1970's, several researchers, notably Bryant [Bryant, 1980], and Hayes [Hayes, 1982] attacked the problem head-on and developed switch-level models that form the basis for much of the switch-level simulation work being done today. 20 A Data-Driven Multiprocessor for Switch-Level Simulation of VLSI Circuits

In switch-level simulators, transistors, wires, voltage, and current are modeled abstractly and some of the differences among switch-level simulators are in the nature of these models. Using a switch-level model, most switch-level simula- tion algorithms work by perturbing the gates of transistors and then computing the new steady state of the network using some form of relaxation, or graph traversal, or both. Much of the complexity of a switch-level simulation algo- rithm is required to correctly model the resistivity and bidirectionality of tran- sistors. While the inner loop of a switch-level simulation algorithm is often similar to the inner loop of an event-driven logic-level simulator, additional mechanisms are needed to simulate switch-level circuits correctly. These mechanisms are explained in detail in Chapter IlI.

I1.4. Simulation Machines Because simulation is a time consuming task it is only natural that there has been a good deal of interest in building special-purpose machines for simula- tion. Like their software counterparts, these machines are generally designed either for logic-level, switch-level, or circuit level simulation, with logic-level machines being predominant.

11.4.1. Logic-Level Machines Perhaps the best known logic-level simulation machine is the IBM Yorktown Simulation Engine (YSE)[Denneau, 1982a]. In the YSE, each gate is represented by a separate instruction. As illustrated in Figure II-3, the most recent output of each instruction is stored in a separate memory, with each instruction being able to specify four locations in this memory as input operands, and one location as its output. However, unlike the FAs'r-1,the YSE is not event-driven. Rather, there is an instruction counter that cycles through all instructions, causing each instruction to be evaluated in turn. Thus, instruc- tions are evaluated whether they need to be or not. An advantage of this ap- proach is that it allows arbitrary fan-out to be handled in constant time. A disadvantage is that it means instructions are needlessly evaluated. A complete YSE consists of up to 256 simulation engines interconnected via a 256 by 256 by 3-bit cross-point switch. In unit-delay mode, each processor can contain up to 4K instructions, and can evaluate one instruction every 80ns.

Although the YSE was designed for logic-level simulation, researchers at IBM have explored using it for switch-level simulation [Barzilai et. aL, 1983]. Unfor- tunately, their algorithm requires a good deal of pre-analysis of the circuit, and even then does not deal well with arbitrary bidireetionality or with unknown or illegal states. In their defense, the designers of the algorithm are trying to use existing hardware for a task for which it wasn't designed. Nevertheless, it is possible to abstract away from the exact implementation of the YSE and com- pare the performance of a non-event driven architecture to an event-driven architecture when used for switch-level simulation. In Chapter V, I present Related Work 21

Result__ uncticnlFun_tiOnunit Instruction Memory Address

,$$, 8Kx128 lI i

Addresses <

Data ___ Memory 8Kx2 , < Fmulti 'port) <

(

( -l- I

Result _ Result I Source Processor Number (to switch (from switch) for result from bus Instruction Countel

Figure II-3: A YSE logic processor.

such an analysis, the results of which indicate that event-driven architectures seem much better.

The most comprehensive commercial simulation machine available is the Zycad Logic-Evaluator [Zycad, 1982]. Unfortunately, very little detailed infor- marion on the algorithms used in the machine has been published in the open literature. The Zycad machine is designed to sit on the bus of a host computer. Attached to this bus is a microprogrammed control processor that controls the machine's logic evaluation processors. In contrast to the YSE, the Zycad machine is event-driven. While designed primarily for logic-level simulation, it has some ability to model states and strengths, as well as bidirectional com- ponents. However, the machine lacks a comprehensive algorithm for simulat- ing circuits at the switch-level.

In terms of logic-level performance, each logic evaluation processor is capable of handling up to 1 million events per second. Unfortunately, as discussed in Chapter V, because switch-level simulation algorithms are more complicated than logic-level simulation algorithms, using events per second as a perfor- mance metric does not necessarily provide much useful information.

Several other logic-level machines have been proposed, some of which have 22 A Data-Driven Multiprocessor for Switch-Level Simulation of VLSI Circuits

been built. They include the Nippon Electric Co. (NEC) simulation machine [Koike, Ohmori, Kondo, and Sasaki, 1983], the Bell Labs logic simulation machine [Abramovici, Levendel, and Memon, 1983], and the Tegas Accelerator [Barto, 1980], as well as simulation accelerators such as those manufactured by Daisy and Valid for their engineering work stations. The NEC machine is interesting because its design allows it to handle not only simple logic gates but higher level q-TL-style devices as well. This is accomplished by having a dynanfically programmable logic array that allows these higher level devices to be evaluated in a single cycle. Like other logic-level machines, multiple proces- sors are used to get improved performance. While the YSE uses a cross-bar for interconnect, the NEC machine uses a multi-stage routing network.

An interesting aspect of the Daisy and Valid accelerators is that they have the capability of including actual chips as part of a simulation. This style of simulation is useful when building TTL and ECL systems using off-the-shelf components, and for testing custom chips. For example, if one is building a system using a commercial microprocessor, the microprocessor can be plugged into the accelerator and the overall system simulated without having to use a software simulation of the microprocessor.

A final note in regard to logic-level simulation machines is that some machines do a much better job modeling delay than others. In particular the Tegas Ac- celerator, the Zycad Logic Evaluator mad the Bell Labs Simulation Machine all pay careful attention to modeling delay accurately. The FAst-l, on the other hand, uses a simpler 'unit-delay' model.

11.4.2. Switch-level Machines Although some of the logic-level simulation machines can be used to a certain degree for switch-level simulation, it appears that none of them work as well as a good software-implemented switch-level simulator, such as MossimlI [Bryant, 1984]. Besides the work described in this dissertation there is at least one other effort, by Bill Dally of Caltech, to build a machine specifically for switch-level simulation [Dally, 1984]. Rather than design a new algorithm, Dally's Mossim Simulation Engine (MSE) implements Mossimll directly in hardware. A good way to view the MSE is that it is a special-purpose machine for tracing through data structures [Dally, 1985]. The F,_sr-1, on the other hand, is best viewed as a data-driven processor that has, in the case of switch-level simulation, two instructions: node and transistor. Nevertheless, because the simulation algorithm implemented by the FAsr-I is similar to the MossimlI algorithm, there are parts of the MSE and the Fast-1 that are quite similar.

A block diagram of the MSE is shown in Figure II-4. Each subnetwork proces- sor consists of three basic units: the scheduling unit, the network operation unit, and the network traversal unit The current implementation of the MSE stores the switch-level network in 40ns access-time memories and has a 200ns system clock. Preliminary measurements of the hardware indicate that it Related Work 23

should run approximately 100 times faster than Mossimll running on a Vax 11/780.

Scheduling Node Network Unit Operation Traversal Unit Unit

MSE Subnetwork Processor

Figure 11-4: A block diagram of the Mossim Simulation Engine.

An advantage of the approach Dally used in the MSE is that, by implementing the MossimlI algorithm, he is able to take advantage of what is already known about the correctness of the algorithm, as well as its performance. However, compared with the approach discussed in this dissertation, it appears that the MSE is more complex than the FAST-1and, assuming comparable implemen- tation technology, somewhat slower.

Finally, other researchers have proposed implementing existing simulation al- gorithms using small multiprocessors built using conventional microprocessors [Arnold, 1984; Thacker, 1984].

II.4.3. Circuit-Level Machines Though I know of no efforts to build special-purpose machines for circuit-level simulation, as with switch-level simulation, there are some researchers using multiprocessors to get improved performance. One example is the MSplice system of Deutsch and Newton [Deutsch and Newton, 1984], which is a mul- tiprocessor implementation of the Splice simulator [Newton, 1979]. Using a BBN Butterfly, a multiprocessor 68000 system, they have been able to obtain a seven-fold improvement in performance using a ten processor system.

11.5. Partitioning Various flavors of partitioning are found in a many different areas of computer science. Of particular interest to the research at hand are those that are ap- plicable to the problem of multiprocessor scheduling. As shown in Chapter VI, the partitioning problem in this dissertation is NP-complete; thus one can hope to find only sub-optimal algorithms for its solution.

In his DS system, Kazar uses a series of heuristics to statically and dynamically 24 A Dam-Driven Multiprocessor fi_r Switch- Level Simulation qf VLSI Circuits

allocate dataflow graphs to processors [Kazar, 1984]. Among the reasons why Kazar's algorithms work is that he is scheduling reasonably large-grained ob- jects: in particular he is not trying to schedule individual instructions. A more general solution of the multiprocessor scheduling problem is provided by Stone's max-flow algorithm [Stone, 1977]. Unforttmately the complexity of the algorithm appears to be O(n2mS), where n is the number oFprocessors and m is the number of nodes in the graph.

Another technique that has been used to find approximate solutions to NP- complete partitioning problems is a form of hill-climbing called simulated annealing [Kirpatrick, Gelatt and Vecchi, 1983]. As in hill climbing, in simu- lated annealing we are trying to optimize a cost function by searching. When minimizing the cost function, for example, a new configuration is accepted if it decreases the cost function. However, in simulated annealing, a configuration in which the cost function increases is accepted with a probability equal to eAC/kT, where AC is the amount by which the cost function is increased, k is a constant, and T is the current 'temperature'. Thus, small excursions in the wrong direction are accepted with high probability, while large excursions are accepted with small probability. The concept of temperature is that, at the beginning of the search, temperatures are large so that it is possible to jump out of large, but sub-optimal valleys. As the search progresses, T is lowered, so that gradually a global optimum is found. Simulated annealing is particularly suc- cessful in solving those problems where there exists a well-defined and well- behaved objective function. An example application is partitioning com- ponents among circuit boards, where the objective is to minimize the number of connections between the boards.

In a problem more akin to the task &partitioning simulations, Oflazer has used simulated annealing for partitioning production systems onto ten to twenty processors. His cost function tries to maximize the parallel evaluation of productions. Using simulated annealing, he is able to create partitions that are 10% to 20% better than those created using simpler, but faster techniques, such as round-robin partitioning [Oflazer, 1984]. In Chapter VII discuss how simu- lated annealing can be applied to the task of partitioning simulations, but un- fortunately, 1 was unable to find a cost function that provides both good results and that is computationally tractable. III A Switch-Level Simulation Algorithm

In this chapter I present a detailed description of the Fas_-I switch-level simulation algorithm. This presentation has two major parts: a description of a switch-level model for representing MOS circuits, and a relaxation algorithm for finding the steady state response of a circuit.

The FAs_-I switch-level model and simulation algorithm resemble in many ways the work of Bryant [Bryant, 1984], Hayes [Hayes, 1982], Terman [Yerman, 1983], and others. This is not surprising, since within the mass of papers published on switch-level simulation there are only a few fundamental ideas, many of them Bryant's. The primary difference between the FASt-1 model and algorithm, and previous work is in the details. But these details are important. In particular, because previous algolithms are intended for implementation on a general-purpose computer, the algorithms are usually expressed in a form that takes advantage of the features available in modern programming lan- guages, for example, dynamic memory allocation and complex data structures. Because the FAs'r-Ialgorithm is intended to be implemented using a processor that is very specialized and lacks many of the features of a general-purpose computer, the algorithm uses only those features that can be easily imple- mented using the general F,4sr-1 architecture.

After presenting the FAsr-I algorithm for switch-level simulation, I show that it is equivalent to Bryant's MossimII algorithm [Bryant, 1984], in that given a change in the inputs to a network, both compute the same steady state. I chose MossimII because it is well known, and because Bryant has developed a thorough theoretical characterization of it.

Both the model and the algorithm are described without considering whether they will be implemented using hardware or using software, thus either im- plementation medium can be used. After understanding the algorithm, you will undoubtably have some idea how this is done. Nevertheless, in Chapter IV I spend an entire section discussing how the FAst-1 architecture is used to implement the algorithm, while in this chapter I discuss how circuits are com- piled into FAst-1 programs.

25 26 A Data-Driven Multiprocessor for Switch-l,evel Simulation of VLSI Circuits

111.1. Notation All algorithms, herein, are described using a pseudo programming language style, in order to keep the algorithms concise. I at times use control structures such as "Until there are no more x's do." When reading the algorithms do not worry about how these control structures are implemented. In all instances 1 eventually discuss how it is done.

In describing data structures I use a tuple notation of the form (NAMI-:]N, ,_I_2..... NAMEn) to represent a record containing n different fields. Each of the names describes the purpose of the field. For example, (NoDECONNECT_ONS,NoDES_zz)is a record containing two fields. Names that are in bold type such as NOD_.CoN_Ec'rIONS, represent vectors of zero or more elements. Individual elements in a vector are referenced using subscripts_for example NO1)IZCONNI,:CTIONSi. Subcomponents of a tuple are referenced using a dot notation. For example, if

NoDECONNECTIONSis thie tuple (TRANSISTOR, TERMINAL), then the TERMINALfield is referenced as NOD_-Co._N_:CT_ONsz,T_RM_YAL.For brevity's sake, in situations where there is no possible ambiguity, I often leave out the prefix part of a name and simply write, for example, TF_alr_AL.

III.2. A Switch-Level Model of MOS Circuits Figure Ili-1 is a schematic for a CMOS implementation of the adder originally shown in Figure 1-1. Essentially an MOS consists of transistors connected by wires. Logically, we can consider all transistors that are connected to the same wire as meeting at a single point called a node. For our purposes, a MOS circuit can be viewed as an undirected graph whose vertices are the nodes and transistors of the circuit and whose arcs represent their interconnection. This graph is bipartite: nodes are always connected to transistors and transistors are always connected to nodes. While the degree of a node can be arbitrary, the degree of a transistor is always three.

Associated with each node is some amount of capacitance, which means that the node may be able to store charge. The state of the circuit can be deter- mined by examining the node voltages. Unlike the logic circuit of Figure I-1, in which each node is driven by only a single output, in MOS circuits, each node can have several sources of charge. In an abstract sense, a node must 'compute' its voltage based on its conductance to sources and sinks of charge.

The steady-state behavior of a circuit is calculated using a model of that be- havior. The switch-level model is designed to accurately described the behavior of most digital circuits while being computational efficient. In the FAs_I, tran- sistors and nodes are modeled using simple functions that operate on signals. A signal describes the conductance to sources and sinks of charge via some path through the switch-level graph. Transistors can alter this conductance. The function of a node is to combine two or more signals into a single signal that A Switch-Level Simulation Algorilhm 27

JJ

I-_ SLim

I

Figure Ill-l: A CMOS implementation of a one bit full-adder.

reflects the net effect of the individual signals. The following sections describe this model in detail.

II1.2.1. Signals One of the accuracy versus time tradeoffs that is made in switch-level simula- tion is to not calculate voltages and currents exactly. Rather. both are modeled abstractly using signals that have state and strength, corresponding roughly to voltage, and to the ability to source or sink charge. Whereas in a circuit simulator we calculate the voltage on a wire and the current flowing through a wire, in a switch-level simulator we calculate the state and strength of each node.

The FAsT"I model use three logic states, {0, 1, X}, where 0 represents a low voltage, 1 represents a high voltage, and X represents either an unknown volt- age or a voltage that is between 0 and 1. Given a signal, S, the function State(S) returns the signal's state. Finally, for certain representations of signals the logic states are partially ordered, as follow: 28 A Data-Driven Mulliprocessor for Switch-Level Simulation of VLS! Circuits

X / \ 1 0

That is X>I, X>0, and 0#1.

Conductance is modeled using a finite number of strengths. These strengths are a totally ordered set {s1, s2..... Sn}, where sI < s2 < ... < sn. The function Strength(S) returns a signal's maximum strength. The function Limit(S, si) returns a signal, R, such that State(R) equals State(S) and Strength(R) equals Min(si, Strength(S)). As we shall, the Limit function is used to determine what happens to a signal when it passes through a transistor that has finite conduc- tance.

In each of the representations of signals described in this dissertation, there are a finite number of different signals. Each set of signals is partially ordered. That is, in addition to the standard relations <, >, and = being defined, there is a fourth relation, #. lfS and R are signals, S#R is true if no ordering exists between S and R. In other words, S and R are incomparable. Whenever two signals are incomparable, the effect of combining them at a node is to produce a signal whose state is X.

Finally, there exists a signal, .L, called bottom, where by definition State(2-) = X, Strength(_l_) = sl, and for all signals S, S > _L. Intuitively, 2_represents the condition, sometimes called tristate, of a completely undriven wire. In contrast, for the power (Vdd) and Ground supplies of a circuit, Strength(Vclcl) = Strength(Ground) = sn, State(Vclcl)= 1, and State(Ground) = 0.

1II.2.2. Transistors Transistors are modeled as ideal resistive switches. Given a signal on the source node of a transistor, the signal that the transistor contributes to its drain node depends on the transistor's size as a function of the state of the node connected to the transistor's gate.

Each transistor is represented by the pair (TRANsCo_ECr_or_S,TRANSSIZF_S). TRANSCO_VNEC_nONSis the 3-tuple (GAZE,SOURCEDRAtN),, and specifies the nodes to which the transistor is connected. TRANsS_zESis the pair (S_zE0, S_ZE1),where StzE0 represents the conductivity of the transistor's channel, that is, the conduc- tivity of the path between a transistor's SOURCEnode and its DRA_Nnode, when State(GAzE) = 0, and S_ZE1 is the channel conductivity when State(GAzE) = 1. The transistor S_z_F_s,{t1, t2..... tn} are an ordered subset of the signal strengths, where t1 < t2 < ...< tn, with t1 = s1 = Strength(&), and tn < sn. That is, the weakest signal strength equals weakest transistor size, and the strongest signal strength is greater than than strongest transistor size.

Given a signal G, transistor S_zEs,tt, and tj, and a signal, S, the function Trans(G, te _ S) returns a signal thatdescribes how a transistor affects S, as a function of me transistor s sizes and the state of its gate. Given a transistor T, A Switch-l,evel Shnulation Algorithm 29

its effect on its DRAINnode is Trans(T.Gat'i_, T.Siztc0, T.Slzl.1, T.SouRclc)and its effect on its SouRct.node: is Trans(T.GAH_,"F.SIzI0, T.SIzI!l, T.Dr_my).

Another way to think about the Trans function is that it models a state- controlled 'strength limiter'. The strength of the output signal is the minimum of the strength determined by the state of the gate and the strength of the input signal. Though the exact definition of Trans depends on how signals are ac- tually modeled, there are a number of important invariant properties: it is

non-increasing in the strength of S• In other words Strength(Trans(G,' tl_ tj_ S)) < Strength(S): the state of its result is always comparable with the state of S, thus either State(Trans(G, ti, tj, S)) = State(S) or State(Tran. (G, t i, lj, S)) X; and finally, Trans(G, ii, ti, S) = Trans(G, ti, lj, Trans(G, ti, tj, S)),

111.2.3. Nodes The interconnection of transistors is modeled as occurring at a single point, called a node. In a circuit-level model, Kirchhoffs laws are used to determine the voltage of a node. In the switch-level model, the state of a node is deter- mined using the least upper bound (LUB) of the signals contributed to the node by each of the transistors connected to the node. The LUB of two signals, S and R is a signal T such that T )= S and T >= R, and such that there does not exists a signal U such that U < T and U = LU B(S,R). Note that the least upper bound function is commutative (LUB(S, R) = LUB(R, S)), associative (LUB(S, LUB(R, W)) = LUB(LUB(S, R))), and non-decreasing (LUB(S, R) > s ALUB(S,R)>_.n).

By using the least upper bound to determine the state of a node, we are saying that the state of a node is determined by its inputs of greatest strength. In the case where there are two or more signals that have the same strength but dif- ferent state, the resulting value of the node is a signal of the same strength but whose state is X. Notice that in using the least upper bound we only ap- proximate physical reality. For example in the physical world it is possible that the combined influence of two transistors of size ti that are connected in paral- lel is stronger than the influence of a single transistor of size ti+ 1" However, using least upper bound and the switch-level model, two weak transistors con- nected in parallel appear the same as a single weak transistor. A similar problem arises with transistors connected in series. This is a fundamental limitation of the switch-level model. In simulating most digital designs this limitation is not severe, though it does mean that circuits that appear to be correct when simulated, may not actually be electrically correct, and visa versa. Adding more sizes may solve the problem for some circuits. Nevertheless, even with an arbitrary number of transistor sizes a switch-level simulator does not become a circuit-level simulator. In any event, this problem can be solved in part by using a static checker that determines if any worst-case transistor ratios violate any electrical design rules [Baker and Terman, 1980].

Each node is represented by the pair (NoDECONN_CnONS,NoDESIzE). 30 A Data-Driven Multiprocessor for Switch-Level Simulation of VLSI Cimuits

NOI)I.:CONNECTIONSis a vector of pairs /TP, ANSISTOR, TI'RMINAI.) that describe the node's connections to transistors. NoDt_Stz_describes the capacitance of the node and is fi'om the set {cl..... Cn},where c1 < c2 < ... < cn. The set of node sizes is also a subset of the signal strengths, where c1 = t1 = s1, and cn < t2. in other words, the weakest node size equals Strength(L), and the greatest node size is less than all non-Z transistor sizes. The capacitance of a node is used to model the node's ability to store charge. The effect of this stored charge is calculated as Limit(S, NoDv.S_zl.:),where S is the previous steady-state value of the node. When calculating a node's state, this effect is included as though it came from a transistor.

So, if T1, T2..... T,v are the transistors whose sources, or drains, are connected to node N, then the signal value of N is the LUB(Trans(Value(T1.GAvr_), T1.SIz _, T1.SlzJ_ 1, Value(T1.DRM._I)),Trans(Value(T2.GA'r_-:),T2,SIzE 0, T2.SIzE 1, Value(T2.DRAIN))..... Trans(Value(Tn.GxrE), Tn.SIzF.0, Tn.SlzE1, Value(Tn.D_.aty)), Limit(S, NOoESIzE)).

111.2.4. Strengths and Sizes Experience has shown that relatively few strengths are needed to model the signals that occur in the majority of MOS circuits. For most circuits three node sizes and three transistor sizes suffice, though in some instances more sizes may be necessary to accurately model how the circuit works, even at the functional level. In the exarnples that follow I use these, representative, strengths: Z < WeakCap < StrongCap < Weak < Strong < External, which I often abbreviate as Z, WC, SC, W, S, E, respectively. Thus Strength(a_) = Z and Strength(Vdd) = Strength(Ground) = External. The node sizes are Z, WeakCap, StrongCap, and the transistors sizes are Z, Weak and Strong. A typical p-channel transistor has St_,ba = Strong and SizJ1s = Z, an n-channel transistor has S_zE0 = Z and S_zE1 = Strong, and a depletion mode transistor has, in the general case, S_ze_0 = Weak and S_ZE1 = Strong, although an ordinary pullup has SIZN = S_ZE1 = Weak.

1II.2.5. Actual Signal Models There are many possible representations of signals that obey the above defini- tions. In this section 1 describe two such signal representations and their net- work functions: cross-product signals, and interval signals. Cross-product sig- nals are interesting for two reasons. First, they are a simple representation of signals. They allows us to see how the network functions are defined, without getting lost in a lot of detail. Second, in the development of switch-level simulators, the cross-product representation of signals was one of the first models developed. As we shall see, the problem with the cross-product representation of signals is that it is too pessimistic_the X state is generated

lor, analogously, SouRo_ A Switch-Level Simulation Algorithm 31

when a 0 or a 1 is more appropriate. By retaining more information, interval signals solve many of the problems _Lssociatedwith cross-product signals. They model more closely the way circuits actually work. Moreover, interval signals provide a natural framework for representing additional circuit phenomena, such as threshold drops, as discussed at the end of this section. Because of all of these advantages, the experiments discussed in later chapters use interval signals.

I have chosen to discuss the representation of signals independently From the relaxation algorithm that uses them. The advantage of this approach is that it clearly shows that there is an accuracy versus space tradeoff that is independent of the simulation algorithm. I11.2.5.1. Cross-ProductSignals Perhaps the simplest representation of a signal that follows the above defini- tions is the pair (STRENC;TH,STATE), where STRENGT_is from the set of totally ordered strengths and STAT_is from the set of the partially ordered states. Using this representation, Ground is represented as

This representation is called a cross-product signal and many authors, for example, Hayes [Hayes, 1982] and Ullman [Ullman, 1983], have described switch-level simulators based on this model. Using the cross-product represen- tion, it is now possible to define the network functions precisely. To avoid complicating the function definitions, it is convenient to think of any signal whose SrR_NCmequals, s1as representing _L.

S and R are cross-product signals, s is a signal strength and t o and t 1 are transistor sizes.

State(S) := if S.STRENOTH _---S1 then X else S.ST^T_

Strength(S) := S.STRBNOTH

Limit(S, s) := (Min(s, S.STRENGTH), S.STAT_1

LUB(R, S) := If R.STRENGTH > S,STRENGTH then R e Is e if S. STRENGI}I> R. STRENGTH t he n S el s e R.Strength = S.Strength if State(R) = State(S) then

Trans(G, t O, t 1, S) := If State(G) = 0 then Limit(S, to) else if State(G) = I then Limit(S, tl)

elsese (Miif __nt{0S.S=TREtNG'FI],I thenMaxLimit(S,(t0,tl) ) ,t0_)) Algorithm Ili-1: The network functions for cross-product signals.

Figure 111-2shows an NMOS circuit with its nodes and transistors annotated 32 A Data-Driven Multiprocessor for Switch-Level Simulation of VLSI Circuits

Vdd

N5

SizeO=Weak[ T4

[ CapSiTe=7, NI -- NZ SO

T SO CapSize=WeakCap Na T__.__L SizeO:Z _SizeO=Z Sizel=Stron( ISizel=SZr°ng

EO GROUND

Figure III-2: An NMOS circuit represented as a switch-level network.

with their sizes and initial steady state signal values. The most interesting node is Nl and, to illustrate how cross-product signals are used, let us examine how its value is calculated. The contribution of transistor T 1 to node N1 is Trans(E0, Z, Strong, E0), which equals (X, Z) or, in other words, _L. Similarly, the contribution of transistor T2 is Trans(E1, Z, Strong, E0), which equals SO. The contribution of transistor 1-3 is Trans(E0, Weak, Weak, El), which equals W12. Finally, the contribution of transistor T4 is Trans(E1, Z, Strong, SO), which equals SO.So, node N1 has four signals contributing to its value: _L,SO, W1, SO.Its value is their least upper bound, which equals SO as shown in the figure. Though it may seem strange that transistor T4 contributes a value of SO to node N1, remember that transistors are bidirectional and since node N2 has a value of SO,it is reasonable that transistor T4 contributes an SO to node N 1. As discussed in Section 1II.3, the bidirectionality of transistors is the fun- damental source of complexity in switch-level simulation algorithms.

What happens if the gate 3 of transistor T4 becomes 0? In steady state, node N1 remains at SO, since transistor T2 is turned on. But node N 2 becomes WC0. This is because all of the transistors contributing to node N2 are turned off, and all inputs to the node are .L. Therefore, the node assumes the capacitive strength of its previous value, that is, Limit(S0, WeakCap).

A more interesting case is what happens to node N 1 if the gate of transistor T1 becomes X? In this case we have LUB(SX, SO,W1, SO),which equals SX. Un-

2Notethat thevalueofthegateinputforthiskindof pullupisirrelevant.

3HereandelsewherewhenI say'the gate',l technicallymeanthe 'stateof the node to whichthe gateisconnected'. A Switch-Level Simulation Algorithm 33

fortunately, this value is overly pessimistic since, in reality, as long as State(T2.Gxrv, ) = 1 and, therefore, T2 is turned on, node N1 will remain at S0--the state ofT 1 is irrelevant.

There are several ways we might choose to correct this problem. The first is to make certain that it really is a problem. My own experience confirms that of others in showing that this situation arises quite often in simulating actual cir- cuits and that simulators that are too pessimistic in this regrad are of limited usefulness.

Another possibility is to use a different model of the circuit-- for example, one that is less discrete. Terman's RSIM [Terman, 1983] simulator represents resis- tance and capacitance as continuous values and solves for node values using systems of linear equations. However, as Bryant notes, this method does not always work well, especially when X's are present in the circuit [Bryant, 1984].

Finally, we might try using a representation of signals that has more infor- mation. In particular, a value of SX lbr the output of transistor T1 indicates that, among other things, the output could be S1, which is not correct. In reality, the only possible outputs are SOand _1_.4 Some researchers have tried to extend cross-product signals by including additional states, such as Z0 and Z1 that represent the X state more accurately. However, these ad hoc extensions still have problems. A more reasonable alternative is to use a signal represen- tation that can represent, in a systematic way, the additional information re- quired. The interval signals discussed below are just such a representation. 111.2.5.2. IntervalSignals When the gate of a transistor is X it indicates that the transistor might either be turned on or turned off. The problem with the cross-product representation of signals is that it cannot represent this condition accurately. For example, when the gate of transistor T1 in Figure III-2 is X, its output is somewhere between _1_and SO.This output can be represented using an interval signal of the fol- lowing form [VALUE0, V^Lt_1] where VALUE0 and VALUE1 are of the form (Sa'RENOa'n,STA'rE),where the states are from the set {0. 1}. Thus, VALt:0Eand VALUE1 can take on values such as SO, W1, or Z0. Intervals of this form are essentially the same as those described by Terman [Tennan, 1983]and Flake [Flake, Moorby, and Musgrave, 1983]. More interesting is that the interval representation of signals is essentially equivalent to the 'up' and 'down' values that are used in MossimlI.

An interval signal corresponds exactly to an interval on the number line shown in Figure Ili-3, with VaLuZ0 being the leftmost endpoint and VALU1Ebeing the

4Of course this is not reallytrue either, in the sense that the resistance of an actual MOSFET is a function of its gate voltage. Nevertheless, it is true that with its source or drain connected to ground, the output voltage of the MOSFET is never going to approach Vdd. 34 A Data-Driven Multiprocessor for Switch-l, evel Simulation of VLSI Circuits

rightmost endpoint. As long as the interval does not cross or include 1, the state of the signal is either 0 or 1. Any interval that crosses or includes _Lhas a state of X. This corresponds to the intuitive notion that says that X is the state of a signal that can be either 0, 1, or X.

0 1 4( X _' I I I I I I External Strong Weak StrongCap WeakCap Z WeakCap StrongCap Weak S%rong External

Figure 1II-3: The range of values of an interval signal.

For example, the interval [(S, 0 ), (S, 0}] represents a signal that is identical to the cross-product signal SO.The interval, [(S, 0}, (S, 1)], or, in shorthand, [SO, $1], is identical to the cross-product signal SX. On the other hand, interval signals [SO, WC1] and [SO, Z1] are signals whose state is X and for which there are no equivalent representations using cross-product signals.

The network functions for interval signals aredefined below. While the func- tions for cross-product signals are fairly obvious, the functions for interval sig- nals are somewhat trickier. The examples in Figure III-4 illustrate how least upper bound works for interval signals. As for the Trans function, when the gate is either a 0 or a 1, the function is essentially the same as the cross-product Trans function, except that the Limit function now must limit both Value0 and Value 1. When the transistor's gate is X, however, we see that whereas for cross- product signals the output is always X, for interval signals, the output is an interval that may still have the same state as the input signal.

In the previous example using cross product signals, when the gate of transistor T1 became X, node N1 became SX. However, using interval signals, the output of transistor T1 is [SO, Z] and the output of transistor T2 is [SO, SO] and the value of node N1 is LUB([S0, Z], [SO, SO]), which equals [SO, SO], a signal whose state is 0 instead of X.

Using the same circuit it is possible to illustrate another situation where inter- val signals are necessary in order to avoid computing a result that is too pes- simistic. If node N_ is SOand the gate of transistor T4 goes from I to X, then, using cross-product signals, node N2 becomes SX. But, using interval signals, it becomes [SO, WC0], a signal whose state is 0 instead of X.

Even though interval signals appear to model more closely how MOS circuits operate, there are still circuits for which the predicted operation of a circuit will differ from the actual operation. A primary source of these differences is when the model does not take into account various aspects of circuit behavior. For example, I have already mentioned that the switch-level representation fails to A Switch-Level Simulation Algorithm 35

(c) ( [. _Weak .I_ ) < [,]

[, '] [])

[, ] [<_rongCap,]>,'I>]

[, ] _ _WeakCap. 1> .

Figure|I|-4:Examplesoftheleastupperbound functionon intervalsignals. (a)when bothsignalsareofthesame non-Xstate,(b)when the strengthof one signalisalwaysgreaterthantheother,and (c) when thesignalstrengthsoverlap.Ineachset,theleastupper boundisrepresentedby thebold,topline.

accurately model the effect that series and parallel connections have on resis- tance.

Another circuit phenomenon that is worthwhile considering is threshold drops. So far, transistors have been modeled as ideal resistive switches. Yet, in the physical world, transistors do not necessarily conduct signals of one state as well as the other. For example, when the gate of an n-channel enhancement mode transistor is at Vdd and its source is at Vdd, its drain is at Vdd - Vth, where Vth is the turn-on threshold of the transistor. It is possible to construct circuits in which there are so many threshold drops that the output of the cir- cuit is near 0 volts instead of Vdd. Once again, an electrical rules checker can generally verify that no such circuit configurations exist. This is particularly true in NMOS. But in CMOS, there are situations where a static analysis may detect a hazard, where dynamically none exists. In the next section, I discuss how the interval signals can be extended to model threshold drops.

111.2.6. Modeling Threshold Drops As mentioned above, threshold drops occur because transistors are not ideal linear devices. Modeling threshold drops is, however, not difficult. To start, we need to be able to model the state of a signal that is a threshold below 1 or a threshold above 0, which I represent as 1" and 0", respectively. This new set of five states has the following partial order: x / \ 0 1 I I 0* 1" 36 A Data-Driven Multiprocessor for Switch-Level Simulation of VLSI Circuits

S and R _tre interval signals, s is a signal strength, t o and t 1 are transistor sizes.

State(S) :=

I f S. Vmul: 0 •S'|AIE _ S. VALUE 1 • STATE OR S. VALUE 0 .SlllF, NGI'II _--- S 1 OR S. VALUE 1 .S'IRF, NGTft _ S,i then X e 1 s e S. VALU1_).S'tAT_ states of both halves are the same

Strength(S) := Max(S.VALu_0.STRENGTH, S.VAt,UE1.STRENGn_)

HalfLimit(Value, s) := (Min(VALuE.STRENGTH, S), VALuE.STATE)

Limit(S, s) := [HalfLimit(S.V^t.uFg,s), HalfLimit(S.V_LUEl,S)]

HalfLUB(R, S, i) :=

If (R.VALUEi.STRENG't_H > S.VALUEi.STRENGTH) OR (R.VALuEi.SrRENO'm = S.VALUE i .STREr_O'I'n AND R.VALuEi.STATE = i) then R .VALUEi el se S .VALUEi

LUB(R, S) := [HalfLUB(R,S,0),HalfLUB(R,S,I)]

Trans(G, t O , t I, S) := If State(G) = 0 then Limit(S, t ) else if State(G) = I then LimitIS, tl) else the gate is X if State(S) = 0 then [HalfLimit(S.VaI.uE0, Max(T0, TI)), HalfLimit(S.VALUE], Min(T0, TI))] else if State(S) = I then [HalfLimit(S.VALuE 0, Min(T0, TI)), HBIfLirnit(S.VALuE1, MaX(T0, T1))] else Limit(S, Max(t 0 , tl) ) Algorithm 1II-2: The network functions for interval signals.

In fact, it is possible for there to be multiple threshold drops. For simplicity's sake, I have chosen to discuss only a single threshold drop, with additional threshold drops being represented by the X state. However, the general scheme I describe can be easily extended to incorporate an arbitrary number of threshold drops.

The model of nodes remains the same, although the details of LUB do change. The model of transistors must be modified, however, to take into account what happens when either the gate or the input of a transistor has a state of 1" or 0". Because the resistance of a transistor depends on its gate voltage, as there are now four possible voltages, 0, 0", 1, and 1", four transistor sizes are needed: SrzE0, S_ZE0*,S_z_E1, and S_ZZl*.Furthermore, the situations under which a threshold drop occurs need to be described. Associated with each transistor tuple is a new element called DRopsthat consists of the pair of booleans (DROP0, DROP1).If DRop0 is true, it indicates that a threshold drop occurs when the state of the gate is either 0 or 0". If DROP1 is true, it indicates that a threshold drop occurs when the state of the gate is either 1 or 1". In general, when modeling n threshold drops, there need to be 2(n + 1)+ 1 states, 2(n + 1) S_z_s,and 2 DRops. A Switch-Level Simulation Algorithm 37

Using this model, suppose a transistor h_ts S_z_1 equal to Strong, DRoPsI equal to true, and I)Ro_,s0 equal to false. When the transistor's gate is 1 and its input is S1, its output is SI*. If',on the other hand, its input is S0, its output will still be SO. This example illustrates a more accurate model for an n-type enhance- ment mode transistor.

Unfortunately, it is not sufficient to modify just the model of a transistor, we must also change the representation of signals. A moment's reflection and a simple example demonstrates why this is necessary. Recall that our definition of intervals is based on the 0/1 number line. Now that there are two more states, we must use a plane to represent signals and therefore an interval signal must now have the form

[VAI.UE 0' VAI.UE0*, VALUEI*, VAr.UE1]

Figure 1II-5 illustrates why modeling threshold drops is useful. Assume that the gate of transistor T1 is 1". If threshold drops are not being modeled, the output of T1 is S0, as is value of node N 1. When thresholds are modeled, however, the output of T1 becomes W0 and node N 1 becomes WX, a much closer approximation to physical reality. Moreover, if the size of the pullup is changed to be VeryWeak, the value of node N1 instead becomes W0, thereby modeling a restoring NMOS inverter.

2IISizel=Weak SI I Sizel=VeryWeak T2 T2 =Weak t =VeryWeak

N1 N1 wx wo

T1 i* N2 _ T1 _* N2 _ I Size1=Strong I Size1=Str°ng

[ Sizel*=Weak _=Weak EO EO Figure III-5: A circuits illustrating the modeling of threshold drops.

111.3. Determining the Steady State of a Network Given the models and network functions for signals, transistors, and nodes, an algorithm is needed that uses the models to determine the steady state of a network. As discussed in Section III.2.3 finding the steady state of a network 38 A Data-Driven Multiprocessor for Switch-Level Simulation of VLSI Circuits

involves determining the value of each node. One way of _lving this problem is to perfoml a recursive tree walk starting at each node--a procedure Terman calls 'global' simulation. An alternative method is to use a rclaxation algorithm that uses only local information. Such an algorithm h_tsthe tlavor of Algorithm i-1, the basic event-driven simulation algorithm. Using this algorithm, a node's value is calculated from its set of input 5 signals: the contributions of transis- tors, external signals, and the node's capacitive value. Whenever one of a node's input signals changes, the node's value is recalculated and this new value is propagated to the transistors connected to the node. Similarly, transis- tors also have inputs. When one of these inputs changes, the transistor recom- putes the values of its contribution to its source and drain nodes and propagates these values. This process continues until the node values converge to a steady state.

In the following sections, 1 present such an algorithm for switch-level simula- tion. In order to help the reader understand the final algorithm, I first present an incorrect algorithm that is derived directly from Algorithm I-1 and show why it doesn't work. 1 then show how the algorithm can be modified to per- form switch-level simulation correctly. Following this, I discuss the complexity of the algorithm and show that it is equivalent to Bryant's Mossimll algorithm. The reader familiar with switch-level simulation may want to skip directly to Section III.3.2.

111.3.1. An Incorrect Switch-Level Simulation Algorithm In order to describe the algorithms, the representations of nodes and transistors need to be more concrete. Therefore, let each node be represented by the tuple (INvvTS,Otrrvv'rs, Obn'm=rVnLuE,NoI_ESIzE) where IN_,m'sis a vector of signal values that are the contributions from sources and drains of transistors and, if required, an external input value. Ou_b, rs is a vector of pairs (TRANSISTOR,TERMINAL) that specify the terminals, that is, the G,mss, SouRces,and DRAINS,of transistors connected to this node. OLn'etnVAL_ is the most recent value of the node and NoDES_zEis the capacitive size of the node. This representation is very similar to the one in Section 1II.2.3. The NoDESx_ field is the same as before, and the Otrrvtrrs vector is essentially iden- tical to the Co_sCnONS vector. In addition, there is now an IN,u'rs vector and a Om'VtrrVAT.UEfield. In some ways the INPu'rsvector is redundant, as we can use the pointers in the Oua_m's vector to access the .same information. However, there are advantages to this representation, as I show later in this chapter and in the chapter on the FasT_.lhardware.

As might be expected, the more concrete representation of transistors is similar to the representation presented in Section III.2.2. Each transistor is represented by the tuple

5Nottobe confusedwithBryant'sinputnodes,whichIcallexternalsignals. A Switch-Level Simulation Algorithm 39

(INPUTS, OUII'L"rs, OUTI'UTVAI_UI_:S, Slzl_)

INPUTSis the triple (GA'tI-:VAI,UI!,SOURcvVM.ut{, DRAINVAI,UI:.), where each of the elements is a signal value that is a copy of the most recent OUwLrrV,v_U_.:of the associated node. Ot"rPuts is the pair (Sou_ct_Ou_;,c'r,DRA_NOtrrPUT)where each of these elements is the pair (NoDF,INl)i.:x)that specifies the node being driven by this output and an index into that node's Ixvurs. Ot;rPLT'rVA1.UrSis the pair (SouI_cFOtrFPu'rVAI,L:F, DRAINOUTPUTVAI,UE) where each element is the most recent signal value contributed by this transistor to the nodes connected to its source and drain. Strumis a pair of sizes, (Slzv0,. SizX.:l),that represent the size of the transistor when its gate is 0 or 1, respectively.

Using these representations for nodes and transistors, and any of the signal models described above, Algorithm 1II-3 presents a naive attempt at adapting Algorithm I-1 to switch-level simulation.

While there is a node or transistor, NT, _ith a changed INPUT { If NT is a node { NewVal +-- LUB(NT.INPuTsI, NT.I_PuTS2 ..... NT.I_uTs n, L im it (N T.OUIIauTVALUE, N T. NoDESIzE)) If NewVal _ NT.OL'rPuTVALUE { NT.OuTPuTVALUE *-- NewVal For each 0 E NT.Ot-rPtnS O. TRANSISTOR.INPUTS0.TERMINAL +-- NewVal ) } e 1s e { NT is a transistor NewVal _--Trans(NT.GATEVALuE, NT.SIzE O, NT.SIzE 1, NT.SouRCEVALuE) If NewVal * NI.DRAINOUTPUTVALUE { NT. DRAINOu'IIaUTVALU4-E --NewVa l NT. DRAINOu'I?_T.NODE.I_PUTSNT"DP_a_OuTPUT.INDEX +-- NewVal } NewVal +-Trans(NT.GATEVALUE, NT.SxzE0, NT.SIzEI, NT.DP_JNV^LuE) If NewVal _ NT.SouRc_OuTPUTV_LUE { NT. SOURCEOuTPLrTVALUE +-- NewVal NT.SOURCI_Oun_r.NoDE.I_pLrrSNTSOURcEOurPUT. .INDEX_ NewVa l ) } } Algorithm III-3: An incorrect switch-level simulation algorithm.

Referring to the circuit in Figure II1-6a, we can see how Algorithm II1-3 works. The signals shown are the OUTPb_VALUESof the nodes and transistors and are the steady state values of the circuit. If node N 3 is now externally driven to 1, it has a changed input; thus, it is a candidate for evaluation. Using Algorithm II1-3, once node N3 is evaluated, its new OUTPUTVALUEwill be propagated to transistor T1. T1 will then be evaluated and it will store its output, S0, on nodes N1 and Ground. When evaluated, Ground's value remains at E0, while node N1 becomes SOand this value is propagated to T2, T3, and T4. These transistors are then evaluated and so on. The final steady state is shown in Figure llI-6b. 40 A Data-Driven Multiprocessor for Switch-Level Simulation of VLSI Circuits

Vdd

N5

T4 SizeO=7

Na , _ N2WI W1

Size0 N3S_l=Stron_ I _' 2Size0=Size1=Strong2N4 0

E0 GROUND

Vdd (a) 1 1 NS

I Sizet=Strong

1 T4sizeo=zNZSO so

Size0=2 I SizeO=Z Sizel=Stron I Sizel=Str°ng

E0 GROUND (b) Figure III-6: The simulation of an NMOS circuit. (a) Its initial steady state. (b) The steady reached using Algorithm III-3 after node N3 is driven to 1.

Although Algorithm III-3 appears to work, two simple examples illustrate its shortcomings. Given the steady state shown in Figure IIl-6b, consider what happens if node N3 is driven to 0. Transistor T1 turns off and its contribution to node N 1 becomes _L.Node N1 is then be evaluated using the input values ±, _k,W1, and SO,from transistors T1, T2, T3, and T4, respectively. The LUB of these values is SO. as shown in Figure III-7, whereas it should Wl. This unexpected result is due to the bidirectionality of transistor T4 and the discrete nature of the simulation algorithm. When node N 1 is evaluated T4 is still con- A Switch-Level Simulation Algorithm 41

tributing SO to node N1 and therefore node N1 does not change. Obviously, this is a fatal shortcoming.

7dd

rq si_eo=z 1,,I T S0 CapSize=WeakCap

s_zol-stro,91 Is_o;_tro,g

EO GROUND

FigureIll-?:A casewhereAlgorithmIII-3givesthewrong result.Starting fromthesteadystateinFigure111-6b,node N 3goesfromito0. Nodes N1 and N2 should become Wl, but incorrectly stay at SO.

Another instance where Algorithm III-3 does not work correctly is shown in Figures IlI-8a and III-8b. Given the steady state of Figure III-8a, suppose nodes N 2 and N4 are both driven to 1. lfthe order of evaluation is N2, N4, T2, N 1, T3..... then nodes N7 and N8 both have a correct value of W0. However, if the order of evaluation happens to be N2, N4, T 3, N7, T4, N8, T 4, T2..... then N7 and N8 will have an incorrect value of WX. There are several other manifestations of this problem. Suppose node N4 became X instead of 1. In this circumstance, depending on the order of evaluation, nodes N7 and N8 ei- ther become [W0, WC0] or [W0, WCI-I, with the former being the correct value.

To fully understand why Algorithm III-3 is incorrect, the following definition is useful. A transistor group is a connected component of the network graph formed when all Vdd,Ground and gate edges are deleted. In other words, given a transistor T, the transistors and nodes that are part of the same transistor group are those that can be reached by traversing the graph following only edges that connect to either the source or drain of a transistor, while not follow- ing any edge that connects to Vdd, Ground, or any edge between a node and the gate of a transistor.

Given this definition it can be seen that the first problem with Algorithm 1II-3 is that within a transistor group there is feedback caused by the bidirec- 42 A Data-Driven Muhiprocessor for Switch-Level Simulation of VLSI Circuits

1 N3 0 1

l N4 N9 T3 SizelSileO=Z=Weak T4 [ SizeO=_iz_l=St7 rong

Size1:Strong I W1 N5 _ WI WCO WCO N7 N8

SO SizelN6 S=Strot/eD:ngZl_I [ CapSize=WeakCap CapSize=WeakCap l N2 0 (a) 1 N3 1 1

SizeO-Z SizeO:Z l T3 TSize1:Weak T4TSize1=Strong

s_zel°st_o,_ [W0,_WC0 ]l wl wl [wo.wcl]l [w%wco]

SO N6 _ CapSi ze:WeakCap CapSize=WeakCap StzeO=ZI Sizel=Strong I -- N7 N8 [WO,WCl] [ N2 1 (b) Figure 1II-8: Another circuit where Algorithm II1-3 works incorrectly. (a) The initial steady state. (b) The steady state reached when N2 becomes 1 and N4 becomes 1. As shown the state ofN 7 is either 0 or X depending on the order in which nodes and transistors are evaluated.

tionality of transistors. The result is that the value of a node may remain unin- tentionally latched. Although the node can be driven more strongly, removing the original input drive has no effect.

An examination of other switch-level simulation algorithms, such as MossimlI, reveals an elegant solution to this problem. Assume that the network is in steady state. Some external agent perturbs the network by externally driving some set of the nodes. Holding the gates of transistors at their present values, we must then calculate the new steady state values for the nodes in each tran- A Switch-Level Simulation Algorithm 43

sistor group. But in order to do this correctly we need to first reset all of the INpuTs.This is accomplished by breaking the simulation into thrcc phases. During phase 1, when a node is evaluated it ignorcs its inputs and limits the strength of its OUTPuTVAIt;l".tO be no greater than the node's Noi_rSlz_.Thus the Otn'ptnVAl,ui:_reflects the node's ability to store charge. In phase 1, transistors are simulated as in Algorithm 11I-3.The result of phase 1 is that all nodes that can be reached via transistors that are not turned off are set to their capacitive value. Moreover, any node that is evaluated during phase 1 is marked for evaluation during phase 2, Phase 2 is essentially Algorithm 111-3,described above. The only difference is that new values for the gates of transistors are not propagated to the transistor gates. Finally, during phase 3, the gates of transis- tors are updated to their new values, if any, and any transistor whose gate has changed is evaluated. This results in a new set of nodes that need to be evaluated during phase 1. This process continues until the circuit reaches a steady state. As discussed below, it is possible to build circuits such as ring oscillators that never reach quiescence.

At first it may not seem obvious why it is necessary to hold the gates of transis- tors constant during phase 2. The reason for doing this is two-fold. First, it provides a reasonable way of determining when we need to perform phase 1. That is, if we allow the gates of transistors to change during phase 2, we then have to revert back to phase 1, since otherwise we can potentially run into the same problem we are trying to solve. The second reason is that it provides a delay model that. while not necessarily very accurate, still conveys some infor- mation. The delay model is easily stated: "Within a transistor group signals propagate with zero delay. Between transistor groups, signals propagate with unit delay." Unit-delay is an analogue to inverter delay.

Though the three phase algorithm handles the problems that arise from transis- tors being bidirectional, it does not cope with the second set of problems. In this regard the fundamental flaw with Algorithm !II-3 is that, depending on the order in which nodes and transistors are evaluated, paths from one node to all of the other nodes to which it is connected may or may not be evaluated in- dependently. But the paths are not, in fact, independent. That is, the value of a node cannot be calculated by finding the value for each path independently and then taking the least upper bound of all of these values. This is because LUB(Trans(G, ti, t/S), Trans(G, ti, tt R)) does not necessarily equal Trans(G, t i, t], LUB(R, S)). "This problem is solved in MossimlI by noting that for each node there is a 'blocking strength,' which is the minimum 6 strength of the final signal value, without any regard for its state. In MossimlI, before the actual node states are calculated, the blocking strength for each node is calculated such that during the equivalent of phase 2, above, a node is never set to a value

6For cross-productsignals,the maximumand minimumstrengthare the same. For interval signals, the maximumstrength is as defined above, while the minimum strength is Min(Value0.Strength,Valuel.Strength),ifValue0.State= Value1.Stateand Z,otherwise. 44 A Data-Driven Multiprocessor for Switch-Level Simulation of VLSI Circuits

whose strength is less than the blocking strength. In the example above, the blocking strength of node N1 is Strong and thus it should never have been set to a Wl.

Though one can imagine calculating the blocking strengths during a new phase between phases 1 and 2, Algorithm 111-4uses a different method. Notice that we need not actually calculate the blocking strength. Rather, we simply need to insure that a node is never set to a value whose minimum strength is less than the minimum strength of the node's final value. Fortunately this is easily ac- complished. Instead of allowing transistors to be evaluated in any order, we require that first values be propagated only through transistors whose min- imum strength as a function of the state of their gates equals tn. Once a steady state has been reached, we allow values to be propagated through transistors whose minimum strength > tn_1 and so on down to t1.

111.3.2. The Fast-I Switch-Level Simulation Algorithm The modifications described above can be summarized as follows. Each step is still the basic event-driven relaxation algorithm.

1. Calculate capacitive values of nodes 2. For t := tn down to tI do Propagate values only through transistors whose minimum dynamic strength is > t 3. Update transistor gates to their new values

Algorithm III-4 is an implementation of these steps. It requires that some ad- ditional information be kept in each node and transistor tuple. While previously a node or transistor was either executable or not, now it is possible that while not executable during the present phase, it is executable during some later phase. In order to implement the delay mechanism a transistor may be marked as executable both during phase 1 or 2 and during phase 3. Thus, each node and transistor tuple now contains a two element vector called PHASe, which is used to control when the node or transistor is evaluated and is set whenever an element of a node's or transistor's I_,VUTSis set. An element of the PHASEvector is either 'unset' or else equals the value of the phase in which the node or transistor should be evaluated. While various other means of im- plementing delay are discussed in Section III.3.4, the one presented in Algo- rithm III-4 works as follows. If at any time during phases 1 or 2, a new value is generated for the gate of a transistor, the transistor is marked as needing to be evaluated during phase 3 and the new value is stored in a new field in the transistor tuple called NEwGATEV_t.'E.When the transistor is evaluated during phase 3, it checks to see if the state of the NEwGA'rEVALUEand the state of GATEVALu_are different and if they are, NEwGATEVALtmis copied into GAvEVALUEand the transistor is evaluated. In effect, NewGavEVALUEis a buffer. In Algorithm 11I-4it appears that phase 3 comes first. This is an artifact of how the PHASEvector is used. The idea is that executing phase 3 of the previous delay period is actually what begins the next delay period. " A Switch-Level Simulation Algorithm 45

B d,--0 While there exists a node or transistor with P,^sEd _ UnSet { For PresentPhase = {3. 1. 2n. 2n_ 1 ..... 21} { For each node or transistor. NT. with PIL_SdE = PresentPhase { NT.PI_e d *-- Unset; If' NT is a node then ProcessNode(NT) else ProcessTransistor(NT) } } d *--- (d + 1) rood 2 }

ProcessNode(N} :- { { If' PresentPhase = 1 { NewVal ,- Ltmtt(N.OuTPUTVAL_S, N.NODESrZ_) N.PItA_d ,- 2n } else { }NewVa] _- LUB(N.I_UTSl. N.l_n_ 2..... N.I_nJTSn, N.OurPurV_cus) If'NewVal ,, N.Ou'n'uTVAuJ"{ N.Oun,u'rV,u.u,E,-..NewVal For each 0 E M.Ourrtrm { if O.TE_wn_. : "Gate" { 0 •Tr'ANs_'rot"l>nxsE( +1) m-u" ._ "-" 3; 0.TL_,SmTOi.Im'm'S.t_twGx_'Vt_/xUJg _ NewVal } else { if' 0 TP_asroR PHASE = Unset • . {O.TP,xmasroR.PHxsde *-- PresentPhase} O.TRxNs_sro_.INnn_O.TEx_j_j. _-- NewVal } } } } Algorithm111-4:The F_r.l switch-levelsimulationalgorithm.The algorithm is continuedon the next page.

The operation 'pick a node or transistor that needs to be evaluated' must be performed in the inner loop of Algorithm III-4. Clearly the efficiency of this operation affects the performance of the algorithm. While in the presentation above the PHASefield is used as a tag, a more time-efficient implementation is to use a pointer so that the node or transistor can be kept on some form of linked list. A careful examination of Algorithm Ill-4 reveals that during delay period dany evaluation set for period d+ 1 always occurs during phase 3 of that delay period. During the other phases of a delay period, a node or transistor is marked only for evaluation during a single phase of that delay period. Thus, the two element PHASEvector can replaced by two pointers. This topic is dis- cussedin greater detailin Chapter IV. 46 A Data-DrivenMultiprocessorfor Switch-Level Simulation of VLS1 Circuits

ProcessTransistor(T) := { i f Pr e s e nt P h a s e = 3 { update transistor gates to new value if State(N.G^TFNALUV,) = State(N.NEwGA'FtNALUE) then return new and old state are the same. so do nothing else { N. GATEVALuE : N. NEWGATEVALuE StorePhase ,-- 1 any stores we do are for phase l } } e 1s e during phase I or 2i all stores areforpresent phase {StorePhase *-- PresentPhase}

NewDV a l ,-- T r a ns ( T. GATEVALUE, T. SIZE0 , T. SIZE l , ]'. SOURCEVALUE) NewSVa 1*--T r a ns (T .GA'rEVALUE, T SIZE0 , T. SIZE l , T. DRAINVALUE)

Calculate minimum strength of this transistor and then see if this is the proper time to evaluate the transistor. If it isn't set the transistor's outputs to _L and mark the transistor as needing to be evaluated during some later phas_ I f St ate (I.GATEVALUE) = X {TransStrength *- Min(T.S_zE O, I.SIz_l)} else {TransStrength *- SIZEState(T.GATEVALuE) }

If ProperPhase[TransStrength] 'later than' PresentPhase AND (Strength(NewDVal) > TransStrength OR Strength(NewSVal) > TransStrength) { NewDVal *- NewSVal ,- _L T.PrIASEd _-- ProperPhase[TransStrength] }

If NewDVal * T.DRAmOuTPuTVALtm { T.DRAINOuTPuTVALUE *-- NewDVal T.DRAINOuIPuT,NODE.INPLrTPSDRAINOuTPUTIND_X *-- NewDVal If T .Ou'FPLrrs.DP.AINOUTPUT.IqODE.PHAsEd :2 Unset {T. OU'WU'I_.DRAINOuTPbq.NODE.PHAsEd *--Store Ph as e} }

If NewSVal _ T.SouRcEOuTPuTVALuE { T.SOURCEOLFI?uTVALUE+-- NewSVal T.SOUacEOuTPbT.NoDLINPLrrST.SOURCEOUTPUT.INDEX*-- NewVal If T.OuTPLFrs.SouRcEOLn_uT.NODE.PHAsEd = Unset {T. OUTPUrs.SouRCEOuTPuT.NODE.PBAsdE *-- S to reP h as e} } } ProperPhase is simply a map from transistor strengths to phases, ProperPhase[tn] = 2n ..... ProperPhase[tl] = 21 AlgorithmIli-4, continued:The Fast-1 switch-levelsimulation algorithm.

111.3.3. The CorrectnessandComplexityof the SimulationAlgorithm In this section I show that Algorithm III-4terminatesand that it computesthe same steady state as Bryant'sMossimlI algorithm. I also show that, within a unit-delay time step, the algorithm has a running time proportional to the number of transistorsin the circuiL A Switch-Level S#nulation Algorithm 47

111.3.3.1. Termination Because it is possible to build circuits, such as oscillators, that never reach a steady state, there will be situations where Algorithm III-4 never terminates. As a necessary condition for termination I show that each unit delay time-step terminates. Moreover, if, after phase 3 of a particular delay step, no transistor gates have changed state, the algorithm itself terminates. In order to show that a particular delay step terminates, I need to demonstrate that the inner loop of the algorithm terminates for each of the phases. An important thing to remem- ber in this proof is that signals are propagated only if they have changed.

During phase 1 any node can be evaluated at most once, since whenever a node is evaluated, it is marked for evaluation during phase 2; thus it cannot be re- marked for evaluation during phase 1. Since each transistor has at most two non-gate inputs, it can be evaluated at most two times corresponding to each of its two inputs having changed once. Therefore, phase I must terminate.

Phases 2n through 21 are essentially identical. Note that by definition of LUB and Algorithm III-4, each successive value of a node is always greater than or equal to its present value. Therefore, the total number of times the output of a node can change is bounded by the height, h, of the signal lattice. In the signal representations discussed above, h has an upper bound equal to the product of the number of states times the number of strengths. The inputs of the transis- tors connected to a node can change at most h times. Therefore, the outputs of these transistors can change at most h times. Hence, the maximum number of times a node can be evaluated during any phase 2i is flxh, where fn is the fan-in of the node 7. Using this same argument, we know that the number of times a transistor can be evaluated has an upper bound equal to 2h, as the transistor has a source input and drain input. Thus, phases 2n through 21 must terminate.

During phase 3 only transistors are evaluated and each transistor can be evaluated at most once. Moreover, during phase 3 no other nodes or transistors are marked for evaluation during phase 3. Therefore, phase 3 must terminate.

Since all phases terminate, a unit-delay step of Algorithm III-4 terminates. Moreover, if no have change state during phase 3, then no nodes are marked for evaluation during phase I and the algorithm itselfterrninates. [] III.3.3.2. Partial Correctness While it would be nice to define the correctness of Algorithm III-4 in terms of the actually steady state of the physical circuit, it should be clear that it is im- possible to do so. There are many circumstances, ranging from race conditions to detailed electrical behavior, where the switch-level model is simply not ac- curate enough to predict what the circuit actually does.

7Actually, Algorithm III-4 can be modified to so that during phases 2tz through 21, a transistor never outputs a value to a node, if the strength of the transistor's value is less than the strength of the node's OtrrPuTVALu_':.Given this modification, a node will be evaluated at most h times. 48 A Data-Driven Multipmcessor for Switch-Level Simulation of VLS1 Circuits

! propose instead to demonstrate that Algorithm 111-4,when used with interval signals, computes the same steady state response as Mossimll [Bry_lnt, 1984]. Mossimll is a reasonable candidate for comparison as it has been described in detail in the literature and is well-known within the simulation community. The discussion below is not intended to be a proof. Rather, 1 hope to provide the reader who is farniliar with Mossimll with a convincing informal demonstration that the two algorithms and models are functionally equivalent.

The basic unit-delay step in Mossimll is identical to that of Algorithm III-4. When finding the steady state response of a network, both algorithms keep the state of all transistor gates fixed. Only after a steady state has been reached in a unit-delay step are new gate values propagated to transistors. Given a set of transistors that have new gate values, both Mossimll and Algorithm Ili-4 perturb the nodes connected to the sources aJ_ddrains of the transistors, marking them as needing to be evaluated during the next phase of simulation. In MossimII this is called finding the vicinity of a node, and in Algorithm III-4, it is called phase 1.

In MossimII, finding the vicinity of a node is accomplished by making a traver- sal of a transistor group, adding to the set of nodes whose state needs to be recalculated those nodes reachable through transistors that are not turned off. This is exactly what is accomplished in phase 1 of Algorithm 1II-4. As capaci- tive values are propagated, exactly those nodes that are in the vicinity of a node are marked as needing to be evaluated during phase 2.

In the next step of MossimII, the blocking strength of each node is calculated. Following this are two identical steps, in which given a node's blocking strength, the node's up and down strengths are calculated. The blocking strength of a node is the minimum possible value for either the up or down strength. When solving for the up or down strength of a node, MossimlI never calculates a value for the up or down strength that is less than the blocking strength. Once the relaxation process has stabilized, the final up and down strengths are used to determine the new state of the node.

In Algorithm III-4, phases 2n through 21 are responsible for calculating the correct steady state value of all nodes in the vicinity of a perturbed node. Algo- rithm III-4 never explicitly calculates the blocking strength. Rather, the evalua- tion of transistors is ordered such that a node never receives a value that is less than its blocking strength unless the strength of the value is less than all non-Z transistors sizes, that is, less than t2.

It is easy to see that if the blocking strength of a node is t7'. where .tJ. is an element of the set of non-Z transistor sizes, then the node is never assigned a value whose strength is t/c,where 1 <_.k

there is no way a signal value with a strength of !/can ever be generated until phase2/ On the other hand, Algorithm !11-4doesallow a node to be assigneda value whose strength is lessthan t2 but greater than q8 even though this strength may not be the node's blocking strength. The first time this can happen is during phase1,when a node is set to its capacitive value. If the node's blocking strength is tZ,.thenthe node assumesits correct value during phase27,and this initial capac_twevalue has no effect. Alternatively, the blocking strength can be from the set of node sizes. Remembering that until phase 21 no value is propagatedacrossa transistorwhosegate is X, it should be clear that if a node's blocking strength is from the set of node sizes,then a transistor whosegate is 0 or 1 canhaveno effect on a signal that might affect the node. Therefore, in the caseat hand, all transistorsthat are turned on have, in effect, the samesize. Thus, the order in which a node is assignedvalues does not affect its final value.

Note that taken together the up and down strengths of a node correspond fairly closely to an inter_'al value, and that Mossimll and Algorithm II1-4 both per- form relaxation in essentially the same way: a new strength or signal value is calculated and if greater than the previous value, it is propagated through tran- sistors to other nodes. Therefore, by ordering the evaluation of transistors ac- cording to their minimum dynamic size, Algorithm III-4 using intervals cal- culates the same steady state as does MossimlI. III.3.3.3. Complexity The analysis of the time complexity of the algorithm follows directly from the proof of termination. Assume that a circuit has t transistors, n nodes, and that h is the height of the signal lattice. Note that n < 3t. During phase 1, each node can be evaluated at most once, and each transistor at most two times. Therefore, the running time of phase 1 is O(t + n), which is O(t). During phases 2n through 21,a transistor can be evaluated at most 2h times. The reason that the number of possible transistor evaluations is this high is that during phase 21, a node's value can change up to h times. The number of times a node can be evaluated equals fn×h. In a circuit with l transistors, the total fan-in, summed across all nodes, equals 2t. Hence, the cost of evaluating all of the nodes in a circuit during phases 2n through 21 is O(hl). In general, h is a small integer and independent of t. Thus, the total running of phases 2n through 21 is O(t). Finally, during phase 3, each transistor can be evaluated at most once and so the running time of this phase is O(t). Thus, the total running time of a unit-delay step is O(t). []

The space complexity is a function of the number of nodes and transistors, and their fan-in and fan-out. We already know that n _<3t. We also know that the

8Thatis,Strength(&)orZ 50 A Data-Driven Multiprocessor for Switch-Level Simulation of VI,SI Circuits

total fan-in, summed across all nodes, equals 2t. Similarly, the toud node fan- out equals 3t. The lhn-in of a transistor is 3, while the fan-out is 2.

One record is allocated for each node and each one transistor. Space for one signal value is allocated for each input to a node or transistor. One pointer is allocated for each output of a node or transistor. Thus the total space required for a given simulation is O(t). []

111.3.4. Delay Algorithm I!I-4 implements unit-delay by having a buffer on the gate input of a transistor, and it is this scheme that is used in performing the experiments reported in Chapter V. Nevertheless, there are other methods for implement- ing delay.

One possibility is not to mark transistors as needing evaluation during phase 3, but to mark instead any node whose output changes state during phases 2n through 21, as needing to be evaluated during phase 3. In this case a node stores a value onto the gate of a transistor only during phase 3. While im- plementing this scheme still requires a two element P.._sE vector, the NEwGA'Ir_VALt:E field of a transistor can be eliminated, together with the code that copies NEwGArEVALU_into G^rEVALUE.Moreover, independent of how many times the node is evaluated during a given delay period, the GxrEVALuE field of a transistor connected to the node is updated only once. As it turns out this latter advantage is mitigated to a large extent by the use of optimizations discussed in Section Ili.3.6.2. A disadvantage of this scheme is that it causes the node to be evaluated one extra time.

Another way of implementing delay is to interpose a 'delay-buffer' between a node and the transistor gates to which it is connected. The delay-buffer takes the place of the NFWGATEVALUEfield in a transistor tuple. A delay-buffer is similar to a node, except that delay buffers are executed only during phase 3. When executed, a delay-buffer copies its single input to its outputs. This scheme has the disadvantage that it requires an additional tuple for each node that connects to at least one gate. It has the advantage that not only is the NEwGAq'EVALUEfield eliminated, but also that only a single element Pt-IASEfield is needed.

In actual circuits, signal delay occurs because nodes have non-zero capacitance while transistors and wires have non-zero resistance, Since unit-delay is a crude approximation to true RC delay, it is reasonable to consider whether a more accurate delay model can be implemented without drastically changing Algorithm III-4. A better approximation is to allow different amounts of delay to different gates. One method of implementing such a multiple-delay simulator is to use multiple delay-buffers between a node and the transistor gates it drives. Notice that when delay-buffers are used in this way, one delay- buffer will be storing into another, marking the latter for evaluation during A Switch- Level Simulation Algorithm 51

phase 1. When a delay-buffer is evaluated during phase 1, it leaves its output alone and marks itself as needing to be evaluated during phase 3, thus causing one unit of delay. Another way to implement multiple-delay is to modify the transistor tuple to have a vector of NlzwGxri..VAkuis and additional P_IASt_fields.

An alternative implementation of multiple-delay is to use a separate event queue mechanism, a common technique in logic simulators. Instead of propagating new signal values directly to the node and transistor tuples, new values are stored in a priority queue that is ordered by delay. A simulation phase proceeds by removing an event from the queue, storing the value into a node or transistor tuple, evaluating the tuple, and adding new events to the queue, if necessary. There are many possible implementations of this type of event queue including sorted linked lists or vectors of linked lists where lists v(i) and v(i+/) contain events that are one time-unit apart. A further discus- sion of this topic is presented in Chapter IV.

Finally, one might ask if we need model delay at all. As discussed above, using unit or multiple-delay simplifies determining when to transition from one phase to the next. However, in circuits in which all transistors can be modeled as being unidirectional, it is possible to implement a zero-delay or pseudo-unit- delay. For some circuits, such as an arbiter built using cross-coupled NAND or NOR gates, zero-delay and pseudo unit-delay simulation are somewhat advan- tageous in that, as in real life, the circuit settles. With true unit-delay simula- tion these circuits oscillate forever. But, as Bryant points out, there are always circuits that settle in real life but cause a discrete-valued simulation to oscillate. In any case, an advantage of true unit-delay simulation is that partitioning a circuit across multiple processors does not change how critical races are simu- lated.

111.3.5. Initialization Corresponding to the power being turned off, the initial state of a simulation is that every node has a value of _L.The simulation is started by setting the Vdd node to El, the Ground node to E0, the proper PHASEfield in each tuple to 1 and then executing Algorithm II1-4. Since all other nodes have a state of X, the result of initialization is that most nodes remain at X. Initializing circuits in this way does not create a problem in circuits without feedback. Unfortunately, in digital circuits with feedback, this form of initialization may be too pessimistic. In their physical realization nodes usually settle to 0 or 1, so that the circuit can be designed to self-reset. In circuits with reset signals this problem may not be as severe. In extreme circumstances, it may be necessary to force some subset of the nodes to 0 or 1, before initialization and once initialization has finished to remove this external drive. However, in some circuits, for example, circuits with cross-coupled inverters, this may result in oscillatory behavior. In these situations it may be necessary to include in the circuit description transistors that can be used to force proper initialization. 52 A Data-Driven Multiprocessor for Switch-Level Simulation of VLSI Circuits

111.3.6. Optimizations As with most algorithms, there is always room for improvement and Algorithm III-4 is no exception. In this section, I present a number of ways to improve the performance of Algorithm 11I-4 without significantly changing how it works. How these optimizations affect the performance of actual simulations is dis- cussed in Chapter V. The optimizations may be roughly broken down into two classes: static and dynamic. Static optimizations are those that affect how cir- cuits are represented. Dynamic optimizations are those that affect the perfor- mance of determining a network's steady state. Clearly static optimizations may affect dynamic performance. III.3.6.1. Static Optimizations An interesting artifact of Algorithm III-4 is that transistors are treated as per- fectly symmetric devices. So, instead of representing each transistor with a single, bidirectional, transistor tuple, we can instead use two unidirectional transistor tuples. Though representing circuits using two unidirectional tran- sistor tuples may be inefficient in both space and time, it has the advantage of making the evaluation of transistors and nodes more nearly the same as they now both have only a single OU'_VALt:_. AS discussed in Chapter IV, this may in turn simplify the hardware implementation of the algorithm.

Of greater importance is that being able to use two unidirectional transistor tuples to represent a transistor means that we can take advantage of situations where one of the unidirectional transistor tuples is unnecessary. That is, in some circumstances, transistors can be modeled as propagating signals in only one direction. This corresponds to physical situations where current flows only in one direction through a transistor. Note that the direction of current flow and signal flow do not always correspond. Current flows from Vclcito Ground, while signals flow away from sources of strong signals and, thus, away from both Vdd and Ground. For example, each of the transistors in Figure II1-9 can be represented by a single unidirectional transistor tuple, with the direction of signal flow indicted by the arrows.

In practice, with the exception of special structures such as memories, almost all transistors are intended to carry current unidirectionally. Programs such as Jouppi's timing verifier, TV, can successfully identify almost all unidirectional transistors [Jouppi, 1983]. In determining the direction of transistors, Jouppi's program uses both syntactic and semantic features of a circuit. While using semantic assumptions is fine for timing verification, where we are willing to assume that a circuit functions correctly, for simulation, assuming a priori func- tional correctness of the circuit can be disastrous--for obvious reasons.

For simulation, an algorithm that determines flow direction should depend only on syntactic information. In order to discuss such an algorithm, a criterion is needed to decide if a transistor is unidirectional and if so, the direction in which signals flow through it. Define a driver node to be Vcld, Ground, any node whose NoDES_ZEis greater than Z, or any node explicitly marked as a A Switch-Level Simulation Algorithm 53

-J- j

One I +

Figure 111-9: An example illustrating the different static optimizations. The arrows indicate the direction of signal flow in the unidirectional transistors.

driver node. Define a sink node to be any node that has NoDESIZEgreater than Z or any node explicitly marked as a sink node. By definition, any node con- nected to the gate of a transistor has NODt_S_zEgreater than Z. A transistor is bidirectional iff there is a path, Sd, from its source to a driver and a path, D s, from its drain to a sink such that Sd and Ds are edge and vertex disjoint AND there is a path, Dd, from its drain to a driver and a path, Ss, from its source to a sink such that Dd and Ss are disjoint. A transistor is unidirectional from source to drain if Sd and Ds exist and are disjoint, and either Dd or Ss doesn't exist or, if they both exist, they are not disjoint. Similarly, a transistor is unidirectional from drain to source if Dd and Ss exist and are disjoint, and either Sd or Ds doesn't exist or, if they both exist, they are not disjoint. If a transistor is neither bidirectional, nor unidirectional in either direction, theoretically it can be removed from the circuit without effect. As illustrated by the bridge circuit in Figure III-10, the above definition does not require there to be two independ- ent drivers and two independent sinks.

As defined above, the problem of finding bidirectional transistors is an instance of a well-known graph theory problem called the two-path problem9: in an undirected graph with distinct nodes P, Q, R, and S, does there exist a path from P to Q and another path from R to S such that these two paths are vertex

9Withinthe CMUVLSIgroup,thisproblemisalsoknownas thefirst VLSI lunchproblemand wasoriginallyconjectured,by me, in mynaivete,to be NP-complete.MikeFosterfoundthe workby Shiloachthat showedit can be solvedin P-timeand came up with an interesting algorithmof hisown. Ed Clarkebecameinterestedin the problemand in turninterestedBud Mishrawho cameup with a better solutionas well as some new methodsthat may prove applicableto othergraphproblems.Thisis yetanotherexampleofoneof greatestaspectsof our department:the closecooperationbetweenpeople workingin supposedlydifferentareasof computerscience. 54 A Data-Driven Multiprocessor for Switch-Level Simulation of VLS1 Circuits

_J L_

Figure III-10: A bridge circuit illustrating that transistors can be bidirectional even when there is only one driver and one sink.

disjoint. An algorithm by Shiloach requires O(E2V) time to determine if these paths are disjoint, where E is the number of edges in the graph and V is the number of vertices [Shiloach, 1980]. Mishra has developed a new O(EV) algo- rithm to this solve this problem [Mishra, 1984]. Shiloach's algorithm is un- suitable for this problem because there are circuits with subgraphs that have thousands of edges and transistors. Mishra's algorithm, while developed too late for use in my research, is much more suitable though it may still be too slow when working on large graphs. A simpler and faster approach that works reasonably well is to use a linear time depth-first search to find drivers and sinks, and to label as bidirectional any transistor for which all four paths, Sd, Ds, Dd, and Ss exist, even if the paths are not disjoint. In effect this means that pass transistors that are connected in parallel are labeled as bidirectional even though they may actually be unidirectional.

Notice that transistors connected directly to Vdd or Ground can always be labeled as unidirectional. Therefore, the Vdd and Ground nodes can never change. Rather than have these nodes take up space and time, an INPUTof tran- sistor that is connected to Vdd or Ground is can be preset to El, or E0, respec- tively. Additional space savings can be obtained by having special transistor types called Pullup and Pulldown, which have only a single input, GAVE,and a single O_trrVALUE. Similarly, note that the output of some transistors can never change. Such transistors can be eliminated and their contribution to their output nodes precomputed and preset. For example, a depletion mode pullup that is not part of a superbuffer usually contributes a Wl to its output node. A Switch-Level Simulation Algorithm 55

Instead of using two unidirectional transistor tuples to represent a bidirectional transistor, it is possible to use a single tr,'msistor tuple that has a single Ot:reLnVAt_tu_.that is propagated to both its source and drain nodes. When this transistor is evaluated during phases 1 or 3, its output is always ,1_.During any other phase its result is Trans( .... LUB(DRA_NVALUE,SOURCF-VM.uI¢)). Since Trans does not distribute over LUB, the result cannot be computed as LUB(Trans( .... DRA1NVALuE),Trans( .... SOURCv:VAIuE)).A potential advantage of this implementation of transistors is that when implemented in hardware only one Ot:_a'trrVALuVis,required instead of two.

Finally, if a node has only one input and has a NODESIZEof Z then it can be eliminated and the output of the transistor that is the node's only input can be propagated directly to the node's outputs. To see that this optimization works, notice that for all signals, S, Limit(S, Z) = _1_and that LUB{S, .1_) = S. However, using this optimization requires a minor change to Algorithm II1-4. Note that, if the intervening node between two transistors is removed it is possible for one transistor to change the SOURCVALUEV, of another transistor during phase 3 of the algorithm. If this happens and the latter transistor's NEWGATF,VALUFequals its GxrEV_,LUEthen the transistor will not be evaluated. So, an additional check is required to insure that the transistor is not evaluated iff its NEwGArEVALu_equals its GATEVALUIANDS its output, when calculated as during phase 2, is the same as its previous output. II1.3.6.2. DynamicOptinfizations An examination of the Trans fimction reveals that it depends only on the state of the gate and not its strength. Hence, new values for the gates of transistors need to be propagated only when the state has changed. Recall that a transistor is evaluated whenever any of its inputs change. During phase 1, the capacitive value of a node is propagated to the gates of transistors even though this value has the same state as the node's previous Otm_t:rVgLuE. By using this optimiza- tion, this extra propagation is avoided. This may significantly improve perfor- mance if the mere act of storing a value is as costly as evaluating a transistor, which may indeed be true in a hardware implementation. I call this improve- ment the store optimization. Implementing this optimization actually simplifies the algorithm. When a node is storing into the gates of transistors, it simply checks that the states of the new and old values are different before preforming the store. In addition, if the static optimization that eliminates one-input nodes is not implemented then, in the evaluation of a transistor, it is no longer necessary to check to see if the new and present gate values are different.

If when setting the source or drain value of a transistor, T, T.S_zEstate(G^_)= Z then we need not set T.PnAsEd, although we must still store the value. If T.PHAsr:d is already set, the transistor will be evaluated anyway. But, if T.PHASEd is not set then it indicates that the transistor was already evaluated when its gate was in its current state and that because the transistor size in this state is Z, no changes to the source or drain can propagate through the transistor. In other 56 A Data-Driven Multiprocessor for Switch- l.evel Simulation of VLSI Circuits

words when a tnmsistor is turned completely off its outputs are always J_. I call this the transistor optimization.

Yhe last optimization I consider is called the node optimization. For each node, divide its inputs into two classes: unidirectional and bidirectional. Unidirec- tional inputs are the precomputed inputs or inputs from transistors connected directly to Vdd or Ground. Given that the inputs to a node have been classified as unidirectional or bidirectional, during phase 1 we can propagate the node value calculated using the node's unidirectional inputs rather than propagate just its the capacitive value. Moreover. if the node value calculated using both the unidirectional and bidirectional inputs equals the node value calculated using only the unidirectional inputs, then we need not mark the node as need- ing evaluation during phase 2.

The basis for this optimization is that the unintentional feedback of bidirec- tional transistors, which phase 1 eliminates, occurs only in sub-circuits with bidirectional transistors, and that unidirectional inputs can never be the source of this feedback. Thus, there is no need to ignore values from unidirectional inputs during phase 1. Note, however, that it is now possible that during phase 1, a node with unidirectional inputs may be assigned a value whose strength is greater than the node's capacitive strength, but less than the node's final strength. However, this doesn't cause any problems. Recall that the reason for ordering the evaluation of transistors is to prevent two successive values of the same strength, but opposite sign, from being propagated through a transistor such that the steady state value of some other node is wrongly computed to be X. This cannot happen using the node optimization, as the unidirectional in- puts to a node can change only during phase 3. As discussed in Section V.4.3, the node optimization is responsible for about a factor of two improvement in performance.

111.4. Compiling Circuits into Simulations Unless descriptions of real circuits can be compiled into the switch-level representation, a switch-level simulator is of little use. Figure III-11 shows the 'wire-list' of the one-bit adder of Figure III-1 represented as a list of transistors and their interconnections. The first letter is the transistor type, where n in- dicates an n-enhancement transistor, and p, a p-enhancement transistor. The next three fields are the names of the nodes that are connected to the transistor's gate, source, and drain. The remainder of each line has information such as the transistor's size.

Given this kind of description, compilation is straightforward, assuming that the circuit is designed using a reasonable style, such as the one advocated by Mead and Conway [Mead and Conway, 1980]. Nominally each transistor en- countered generates a transistor tuple, and each node encountered generates a node tuple, though static optimizations may eliminate unnecessary tuples of A Switch-Level Simulation Algorithm 57

n X xx 9nd 3.00 4.00 r 0 0 12.00 n Y yy gnd 3.0{) 4.00 r 0 0 12.00 n 7 7z 9nd 3.00 4,00 r 0 0 12.00 n yy 9n(l pl 3,00 4.00 r 0 0 12.00 n xx pl p2 3.00 4.00 r 0 13 12.00 n zz p2 Sum 3.{10 4.00 r 0 0 12.00 _= zz Sum p3 3.00 4.00 ¢ 0 0 12.00 n Y p3 p4 3.00 4.00 r 0 0 12.00 rk X p4 gnd 3.00 4.00 r 0 0 12.00 n X gnd p5 3.00 4.00 r 0 0 12.00 n yy p5 p6 3.00 4.00 r 0 0 J2.00 n Z p6 Sum 3.00 4.00 r 0 0 12.00 n Z Sum p7 3.00 4.00 r 0 0 12.00 n xx p7 p8 3,00 4.00 r 0 0 12.00 n Y pB gnd 3.00 4.00 r 0 0 12.00 n yy grid ml 3.00 4.00 r 0 0 12.00 n xx ml Carry 3.00 4.00 r 0 0 12.00 n yy 9nd rn2 3.00 4.00 r 0 0 12.00 n zz rn2 Carry 3.00 4.00 r 0 0 12,00 n zz gncl rn3 3.00 4.00 r 0 0 12.00 n xx m3 Carry 3.00 4.00 r 0 0 12.00 p X xx vdd 3.00 4.00 r 0 0 12.00 p Y yy vdd 3.00 4.00 r 0 0 12.00 p 7 zz vdd 3.00 4.00 r 0 0 12.00 p yy vdd pla 3.00 4.00 r 0 D 12.00 p xx pla p2a 3.00 4.00 r 0 0 12.00 p zz p2a Sum 3.00 4.00 r 0 0 12.00 p zz Sum p3a 3.00 4.00 r 0 0 12.00 p Y p3a p4a 3.00 4.00 r 0 0 12.00 p X p4a rod 3.00 4.00 r 0 0 12,00 p X vdd p5a 3.0D 4.00 r 0 0 12.00 p yy pSa p6a 3.00 4.00 r 0 0 12.00 p 2' p6a Sum 3.00 4.00 r 0 0 12.00 p Z Sum p7a 3.00 4.00 r 0 0 12.00 p xx p7a pBa 3.00 4.00 r 0 0 12.00 p Y p8a vdd 3.00 4.00 r 0 0 12.00 p yy vdO mla 3.00 4.00 r 0 0 12.00 p xx mla Carry 3.00 4.00 r 0 0 12.00 p zz vdd rn2a 3.00 4.00 r 0 0 12.00 13 yy m2a Carry 3.00 4.00 r 0 0 12.00 p zz vdd m3a 3.00 4.00 r 0 0 12.00 p xx m3a Carry 3.00 4.00 r 0 0 12.00 Figure I11-11: A description of a one-bit adder as transistors and their inter- connection.

either type. The major task is to determine the size of each node and each transistor. Though a detailed analysis of the circuit can be done in which tran- sistor ratios are carefully calculated, a simpler approach usually suffices:

• A depletion mode transistor whose source is connected to Vdd and whose gate and drain are connected together, or an n-enhancement mode tran- sistor whose gate and source are tied to Vdd, or a p-enhancement mode transistor whose gate is tied to Ground is assumed to be a weak pullup. An input of its drain node is set permanently to W1 and no transistor tuple is generated. • All other n-enhancement mode transistors are assumed to have S_zz0 = Z and S_zE1 = Strong. All other p-enhancement mode transistors have S_ze0 = Strong and SxzE1 = Z. • A depletion mode-transistor that connects two non-external input nodes, sometimes called a 'spoiled' transistor, has S_ZF4)= SEzz1 = Strong. Any remaining depletion mode transistors, such as those that are in super- buffers, are assigned S_zN = Weak and S_zz1 = Strong. • A node connected to the gate of a transistor has NoDES_z_= WeakCap. 58 A Data-Driven Multiprocessorfor Switch-Level Simulation of VLSI Circuits

The drain node of an n-enhancement transistor with only its source con- nected to Vdd is assumed to be prc-charged and has NoD_.Slzv.= StrongCap. In addition, nodes can be explicitly given Strong capacitance.

The remainder of the compilation task is for optimization:

• Unidirectional transistors are marked using the depth first search algo- rithm described above. Transistors connected directly to Vdd or Ground are always marked as unidirectional. • Vdd and Ground nodes are eliminated and the attached transistor inputs are set to E1 and E0. respectively. • Always 'on' transistors, such as spoiled transistors, are eliminated and the source and drain nodes are merged into a single node 1°. • Always 'off transistors, such as lightning arresters on input pads, are eliminated. • One-input nodes are eliminated, as discussed above.

Figure III-12 is a compiled version of Figure III-11 for a F,4sT-1simulator. The first number is the tuple's index and the next two numbers are identifiers used by other programs. The fourth field is the tuple's type. In order to make the code easier for people to read, n-type and p-type transistors are distinguished by name. The simulator's input reader has default values for transistor and node sizes. Values need to be specified only when they differ from the defaults. The next number is the total number of inputs. The following number is ig- nored for transistors, while for nodes, it is the number of unidirectional inputs. This number is one greater than expected because the compiler automatically adds an extra input to be used for externally driving the node. Next, there is a count of the number of outputs, followed by output triplets of the form: tuple index number, input index, delay. Finally, there is initialization and size infor- mation.

III.5. Other Issues Although Algorithm III-4 has been designed for switch-level simulation, it is easily extended to incorporate other kinds of digital simulation as well. Two of these, multi-level simulation, and fault-simulation, are discussed in this section.

Ill.5.1. Multi-level Simulation In Chapter I1 I mentioned that multi-level simulation is often useful because it can reduce the amount of computation required and therefore allow function- ally more complex circuits to be simulated. In order to use Algorithm III-4 for multi-level simulation, we need to be able to incorporate new functional units into the simulator.

lOAnappropriatelynastywarningmessageisgivenifthismergershortsVddandGround. A S w_°ttch-Level Simulation Algorithm 59

0 73 73 penhpx 3 0 2 55 5 0 42 2 0 l 72 72 perth 2 0 1 42 1 O s 1 el 2 70 70 penhpx 3 0 2 55 5 0 43 2 0 3 69 69 penh 2 0 1 43 ] 0 s 1 el ,1 67 67 penhpx 3 0 2 55 4 0 44 2 0 5 66 66 penh 2 0 1 44 ] 0 s I el 6 64 64 penh 2 0 ! 45 1 0 s I el 7 63 63 penhpx ;_ 0 2 45 2 0 46 2 0 8 61 61 penhpx 3 0 2 46 1 0 63 8 0 9 59 59 penhpx 3 0 2 63 7 0 47 2 0 10 58 58 penhpx 3 0 2 47 1 0 48 2 0 ]l 56 56 penh 2 0 l 48 t 0 s I el 12 54 54 penh 2 0 1 49 I 0 s 1 el 13 53 53 penhpx 3 0 2 49 2 0 50 2 0 14 51 5l penhp_ 3 0 2 50 I 0 63 6 0 15 49 49 penhpx 3 0 2 63 5 0 51 2 0 16 48 48 penhpx 3 0 2 51 1 0 52 2 0 17 46 46 perth 2 0 [ 52 1 0 s 1 el 18 44 44 perth 2 0 t 66 1 0 s 1 el 19 43 43 perth 2 0 1 68 1 0 s 1 el 20 42 42 penh 2 0 1 70 1 0 s I el 21 41 41 nenhpx 3 0 2 55 3 0 53 2 0 22 40 40 nenil 2 0 1 53 I 0 s 1 eO 23 38 38 nenhpx 3 0 2 55 2 0 54 2 0 24 37 37 nenh 2 0 1 54 I 0 s 1 eO 25 35 35 nenhpx 3 0 2 55 1 0 56 2 0 26 33 33 nenh 2 0 1 56 1 0 s 1 eO 27 31 31 nenh 2 0 1 57 I 0 s 1 eO 28 30 30 nenhpx 3 0 2 57 2 0 58 2 0 29 28 28 nenhpx 3 0 2 58 1 0 63 4 0 30 26 26 nenhpx 3 0 2 63 3 0 59 2 0 31 25 25 nenhpx 3 0 2 59 1 0 60 2 0 32 Z3 23 nenh 2 0 l 60 1 0 s 1 eO 33 21 21 nenh 2 0 1 61 1 0 s I eO 34 20 2.0 nenhpx 3 0 2 61 2 0 62 2 0 35 18 18 nenhpx 3 0 2 62 1 0 63 2 0 36 16 16 nenhpx 3 0 2 63 1 0 64 2 0 37 14 14 nenhpx 3 0 2 64 1 0 65 2 0 38 12 12 nenh 2 0 1 65 1 0 s 1 eO 39 10 10 nenh 2 0 1 66 2 0 s 1 eO 40 7 7 nenh 2 0 1 68 2 0 s I eO 41 4 4 nenh 2 0 I 70 2 0 s I eO 42 71 m3a node 3 2 t 0 I 0 43 68 m2a node 3 2 I 2 I 0 44 55 mla node 3 2 1 4 I 0 45 62 p8a node 3 2 1 7 2 0 46 60 pTa node 3 I 2 8 2 0 7 I 0 47 57 p6a node 3 I 2 10 2 0 9 I 0 48 55 p5a node 3 2 1 10 t 0 49 52 p4a node 3 2 I 13 2 0 50 50 p3a node 3 1 2 14 2 0 13 I 0 51 47 p2a node 3 I 2 16 2 0 15 I 0 52 45 pla node 3 2 I 16 I 0 53 39 m3 node 3 2 I 21 i 0 54 36 m2 node 3 2 I 23 I 0 55 34 Carry node 7 i 6 25 2 D 23 2 O 21 2 0 4 2 0 2 2 0 0 2 0 56 32 ml node 3 2 1 25 ] 0 57 29 p8 node 3 2 I 28 2 0 58 27 p7 node 3 I 2 29 2 D 28 I 0 59 24 p6 node 3 I 2 31 2 0 30 I 0 60 22 p5 node 3 2 I 31 I D 61 19 p4 node 3 2 1 34 2 0 62 17 p3 node 3 I 2 35 2 0 34 I 0 63 15 Sum node 9 I 8 36 2 0 35 1 0 30 2 0 29 I 0 15 2 0 14 I 0 9 2 0 8 I 0 64 13 p2 node 3 I 2 37 2 0 36 1 0 65 11 pl node 3 2 1 37 I 0 66 9 zz node 3 3 8 I 0 I 3 0 I 14 0 I !5 0 I 22 D I 23 0 I 35 0 1 36 0 1 o wc 67 8 Z node I I 6 8 0 1 9 0 1 18 D I 29 0 I 30 0 I 39 0 I o wc 68 6 yy node 3 3 8 Z 0 I 5 0 I I0 0 I 17 0 I ;'4 0 1 26 0 I 31 0 1 38 0 I o =c 69 5 Y node I 1 6 6 O I 13 0 I 19 0 I 27 0 I 34 0 i 40 0 I o wc 70 3 xx node 3 3 8 0 0 I 4 0 I 7 0 1 16 0 I 21 0 I 25 0 I 28 0 I 37 0 I owc 71 2 X node I I 6 11 0 1 12 0 1 20 0 I 32 0 1 33 0 I 41 0 I o wc

Figure 11|-12: A compiled version of the circuit in Figure III-11. 60 A Data-Driven Multiprocessor for Switch-Level Simulation of VLSI Circuits

An arbitrary functional unit is analogous to a transistor or node: it has a set of inputs, a set of outputs, and a function that describes how its outputs change in response to input changes. Some care must be taken to make sure that the function implemented conforms to the requirements of Algorithm 111-4. A bidirectional path may not have delay, and any input that has delay is, in effect, a 'transistor gate' and must, therefore, cause the outputs it controls to behave in the way transistor outputs do when the input is X. Moreover, if the functional unit has delayed inputs that independently control separate outputs, it may be necessary to be able to mark the unit for evaluation during more than one phase 2_

Inverters, or multi-input Nand or Nor gates are examples of functional units that are trivial to implement. A compiler such as the one described above can easily find instances where transistors and nodes can be replaced by such higher level equivalents. Terman discusses going even further by compiling circuits into their equivalent logic equations [Terman, 1983, chap. 8]. In the experiments reported in the next chapter I have avoided the temptation to use such optimizations as they are largely independent of the rest of the work in this thesis. Finally, as discussed in Chapter VII, multi-level simulation becomes particularly interesting in the context of a multiprocessor simulation machine, where different functional units can be implemented in separate processors.

111.5.2. Fault Simulation Even though the design and layout of a chip may be completely correct, once fabricated there is no guarantee that the chip will actually work. This is be- cause fabrication is such an intricate manufacturing process that defects in a, potentially large, fraction of the chips are almost unavoidable. If you're lucky, the chip doesn't work at all--you plug it in and see smoke. More likely, only a pal_ of the chip doesn't work, and what you want to do is run a relatively small set of tests to determine whether or not a chip is completely functional. In order to do this with any confidence, it is necessary to know the likelihood that a particular test sequence will expose a particular fault. One technique for doing this is called fault simulation. Essentially, various parts of a circuit are simulated as being faulty, a set of tests are run, and a check is made to see whether or not the fault is discovered.

The techniques described by Shuster and Bryant for fault simulation in MOS can be used directly [Bryant and Shuster, 1983]. Among the various kinds of faults they and others have described are nodes stuck at 0, 1 or X, transistors stuck at 0, 1 or X, transistors whose sizes are not what they should be, and nodes that are shorted. With the exception of shorted nodes, these faults can be modeled by having a set of fault vectors that are used to assign values to nodes at the beginning of each delay step. The shorting of nodes can be modeled by inserting extra transistors and then turning the gates of these tran- sistors on and off. IV The Architecture of a Fast-1 Uniprocessor

In Chapter III we saw that the basic simulation algorithm presented in Chapter I was not sophisticated enough to handle the complexities of switch-level simulation. Thus, it is reasonable to suspect that the basic simulation machine is likewise not sophisticated enough to implement Algorithm Ili-4. An ex- amination of Algorithm III-4 reveals that it consists of three nested loops: an innermost loop that corresponds almost exactly to the basic simulation algo- rithm and thus to the basic FAst-1 processor, and two outer loops that are used to sequence through groups of instructions, for which the basic FAst-1 proces- sor architecture provides no support. Hence, among the architectural changes I describe in this chapter are mechanisms for implementing the sequencing provided by the two outer loops &Algorithm Ili-4.

In the first part of this chapter, I describe the complete FAst-1 architecture used to implement Algorithm III-4. Because one of the primary goals of this research is to design a simulation algorithm that maps naturally onto an im- plementation in hardware, much of the architectural description may seem repetitive--simply restating parts of Algorithm II1-4 in different words. I hope this is true, as it indicates that I have achieved my goal.

The second part of the chapter focuses on how the architecture can be imple- mented and how technology affects various implementation decisions. As the FAST-1has yet to be implemented in hardware, the reader is justified in reading this part with a certain amount of skepticism. Nevertheless, I have tried to avoid waving my hands too quickly.

IV.1. Uniprocessor Architecture The F_sr-1 is a data-driven computer, in that it executes instructions in response to the flow of data and not as dictated by a program counter. It differs from other data-driven machines in two substantial ways: its instruc- tions have state and it provides a unique mechanism for enforcing a partial order on the execution of instructions.

One reason for much of the interest in data-driven computation is that it ap- parently offers the opportunity to exploit parallelism easily, in that at any given

61 62 A Data-Driven Multiprocessor for Switch-Level Simulation of VLSI Circuits

moment several instructions may be executable. However, in order to keep the t,',,_sz-Iprocessor simple, [ have explicitly avoided trying to exploit instruction- level parallelism within an individual processor. I have opted, instead, to ex- ploit parallelism by using multiple processors, as discussed in Chapter VII. If one ignores the data-driven nature of a FASZ_Iuniprocessor, the overall struc- ture more or less resembles that of a traditional stored-program computer: there is a main memory that is used to store both instructions and data, a fetch unit that controls reads and writes to memory, and an execution unit that is used to evaluate instructions.

IV.l.l. Instruction Definition A FAsr-I processor contains a vector of INSTRUCTIOinNwhichS each instruction is a 5-tuple of the form (OPCoDE, SOURCEOPERANDS, RI'kSULTS, D[kSTINATIONS, EXECUTIONTAGS) where

• OPCODEspecifies the operation that should be performed when the instruc- tion is executed. • SOwcEOvERANDSis a vector of one or more data values that are used in evaluating the instruction. The type of each SOURcEOPERANDis determined by the OeCODE.In general, SOURcEO_'ERAY_Sare written by other instruc- tions, and having one of its SOU_cEOPERANDSwritten causes an instruction to become executable. Whereas in other computers an instruction often contains operand specifiers that are the addresses of the actual data values, FAsr-1 instructions contain the actual data values themselves. • REStJLTSis also a vector of data values and is very similar to the SOURcEOPE_ANDSvector. When an instruction executes it can produce one or more R_uL3s that in addition to being propagated to other instructions, are stored back into the original instruction. Whenever an instruction is executed the REswrs from its previous execution are available to the evaluation unit. In this research, the primary use of the R_suL'rsvector is to be able to check whether the new R_=s_LrSof an instruction differ from its previous R_uLxs. • D_x_A_'_oNsis a vector of SOURcEOPERANDaddresses that specify where the new R_SULTSof an instruction should be stored. The binding of RESULa'S and DESa'_:_AX_ONSis static. That is, a particular REsu_:ris always stored using the same set of DEsx_YA'HONS.Each DESV_AT_OYis a four-tuple of the form (INsTINDEX,OPINDEX,TAdNDEX,TnOVALVE) where InsrIxDEXspecifies an instruction in the INS+RtJCT_ONSvector and OPL,,_Exspecifies a particular SOURcEOPERANDwithin that instruction. TAGINDEX,arid TAGVALUEdetermine which EXEC_nONTAC;is to be set, and what its value should be. • Ex[ctmoNT_s is a vector of tags that are used to indicate whether or not an instruction is executable. Each EXECUTIONTAGtakes on values from the set {nil, El, E2..... En}. Whereas it is convenient to think of an instruction The Architecture of a F,4sz-I Uniprocessor 63

as becoming executable whenever one of its SOUac_O_'_AN_Sis set, in reality, an instruction is executable if any of its Ex_a:u'r_oN'l'A(_shave a non-nil value. As we shall see below, it is also useful to be able to use an Exv.cunoNTA_as one of the operands in evaluating an instruction. Thus, an EX_CUTIONTAGis very much like a SOUac_Or'_;_ANDin that it is used in evaluating instructions and setting it causes the instruction to become ex- ecutable. In simple systems, instructions need only a single Ex_ctmONTACand only one non-nil tag value. By having multiple Ex_:Ct_noNTAGvalues and mul- tiple ExvcunJoNTAcsit is possible to create groups of executable instruc- tions such that all instructions in one group are executed before any in- structions in another group are executed.

The complete FAsv'l program for the adder circuit in Figure I-1 appears in Figure IV-1. Each line represents one instruction. The 'lnst Name' field iden- tifies which gate the instruction represents and is not part of the actual instruc- tion. Similarly, the 'lnst Index' field is the instruction's address and is not part of the instruction either. The values shown for each SOURcEOPE_ANDand RFSULT are the initial values corresponding to the state where the X, Y, and Z inputs are all 0. In this instance, there is only a single EXFC_ONTAGand it has only two values: nil and t. As shown in the figure, all EXF_CtmONTAGsare nil; thus initially, no instructions are executable. Each D_:s'r_NAT_ONis given as the triple n.n.t, which are the values for INs-rlYDI;X,OPINDEX.and TAGVALUE,respectively. Because there is only a single Ex_cuno._TAG,l have omitted the value for TAOtND_:X.Similarly, because the TAGVALuEfield is the same in all DI_TINATIONS, one can imagine eliminating it as well. Finally, a Copy instruction is used to take a single input and copy it to several SOURcEOv_aANDS.

IV.l.2. Instruction Execution Given the definition of an instruction, we can now examine in detail how in- structions are executed. Assume for the moment that a processor is quiescent; that is, the EXECbnlONTAGSin all instructions are nil. Some external agent, for example a host processor, might then store a value into some Sov_cEO_,_,v, thereby setting an Exac_rnoNTn_of the instruction. The F_sT'-I processor will then detect that this instruction is executable, fetch it, and store its RFSU_XiSnto the SOURc_O_'-I_nyDSspecified by its DF.ST_X_OnS. At the same time, an EXECUnONTnGin each of the target instructions is also set, thereby making these instructions executable. In the simplest case where each instruction has a single E_ctrr_oyTAo, and there is only one non-nil EXECUq'_ONTAOvalue, the process of executing instructions is exactly as presented in Algorithm 1-2 in Chapter I. If we examine innermost loop of Algorithm 111-4,we see it can be implemented by Algorithm I-2, as well. In this innermost loop there is a single test to see whether there are any transistors or node records that have their PHASEfield set to the current phase value.

Using Algorithm I-2, we can trace the execution of the program in Figure IV-1. 64 A Data-Driven Multiprocessor for Switch-Level Simulation of VLSI Circuits

Inst Inst Opcode Source Exec Result Destinations Name Index Operands Tag X 0 Copy 0 nil 0 3.0.t, 4.0.t, 9.0.t, 12.0.t, 13.0.t Y 1 Copy 0 nil 0 5.0.t, 6.0.t, 9.1.t, 11.0.t, 12.1.t Z 2 Copy 0 nil 0 7.0.t, 8.0.t, 9.2.t, 11.1.t, 13.1.t A1 3 And 0 1 1 nil 0 10.0.t 11 4 Invert 0 nil 1 5.1.t, 7.1.t A2 5 And 0 I I nil 0 10.1.t 12 6 Invert 0 nil 1 3.l.t, 7.2.t A3 7 And 0 1 1 nil 0 10.2.t 13 8 Invert 0 nil 1 3.2.t, 5.2.t A4 9 And 0 0 0 nil 0 10.3.t 01 10 Or 0 0 0 0 nil 0 A5 11 And 0 0 nil 0 14.0.t A6 12 And 0 0 nil 0 14.1.t A7 13 And 0 0 nil 0 14.2.t 02 14 Or 0 0 0 nil 0

Figure IV-l: A FAsr-1 program for the one-bit adder in Figure I-1. The values given for the SOURC_'zOe_:RaYDsand the REsum-s cor- respond to the condition when the X, Y, Z inputs are all 0.

Suppose we set the X input to 1. In the FasT-1program we accomplish this by setting SOURCEOPERAND 0.0 to 1, and the ExF_ctrno.,,'Taoto t. Instructions are now be executed as shown in Figure IV-2. The trace is shown in terms of reads and writes to instruction memory and, for reasons discussed later in this chap- ter, instructions are always read before they are written. The cause of a memory reference is indicated in the 'E/S' column. An 'E' indicates that the cycle is for executing an instruction, while an'S' indicates that the cycle is for storing a RESULT. IV.1.2.1. UsingMultipleExecutionTagValuesandMultipleExecutionTags Now that we understand how the simplified FAsT_Iworks, it is appropriate to consider a complete F,4sT-Iprocessor. Algorithm 1V-1summarizes the process of executing instructions in a FaST-I with multiple EXECUTIONTAGSand multiple EXECUTIONTAGvalues. In the following paragraphs, I explain the algorithm in detail.

Let us begin by examining the middle loop of Algorithm III-4. The intent of this loop is to sequence through sets of nodes and transistors that have been marked with the same phase value. In this way, within the innermost loop we evaluate all nodes and transistors marked with the same phase value, before proceeding to evaluate any nodes or transistors that are marked with a different phase value. The phase values can be implemented using EXECUTIONTaovalues. Sequencing through these values is implemented by the middle loop of Algo- rithm IV-1. Notice that once an Ex_tn'xONTAGis set to a non-nil value it can not be changed, as is required by Algorithm III-4. The Architecture of a l_st _l Uniprocessor 65

Inst Inst Opcode Source Exec Result Destinations E/S R/W Name Index Operands Tag X 0 Copy 1 t 0 3.0.t, 4.0.t, 9.0.t, E R 12.0.t, 13.0.t X 0 Copy 1 nil 1 3.0.t, 4.0.t, 9.0.t, E W 12.0.t, 13.0.t A1 3 And 0 1 1 nil 0 10.0.t S R A1 3 And 1 1 I t 0 10.0.t S W II 4 Invert 0 nil 1 5.1.t, 7.1.t S R I1 4 Invert 1 t 1 5.1.t, 7.1.t S W A4 9 And 0 0 0 nil 0 10.3.t S R A4 9 And 0 0 1 t 0 10.3.t S W A6 12 And 0 0 nil 0 14.1.t S R A6 12 And 1 0 t 0 14.1.t S W A7 13 And 0 0 nil 0 14.2.t S R A7 13 And 1 0 t 0 14.2.t S W A6 12 And 1 0 t 0 14.1.t E R A6 12 And 1 0 nil 0 14.1.t E W A4 9 And 0 0 I t 0 10.3.t E R A4 9 And 0 0 I nil 0 10.3.t E W A7 13 And I 0 t 0 14.2.t E R A7 13 And I 0 nil 0 14.2.t E W A1 3 And 1 1 I t 0 10.0.t E R A1 3 And 1 1 1 nil 1 10.0.t E W Ol 10 Or 0 0 0 0 nil 0 S R 01 10 Or 1 0 0 0 t 0 S W Ol I0 Or 1 0 0 0 t 0 E R Ol ID Or I 0 0 0 nil 1 E W II 4 Invert I t 1 5.1.t, 7.1.t E R II 4 Invert 1 nil 0 5.l.t, 7.l.t E W A2 5 And 0 0 0 nil 0 10.1 t S R A2 5 And 0 I 0 t 0 I0 I t S W A3 7 And 0 0 0 nil 0 I0 2 t S R A3 7 And 0 1 0 t 0 I0 2 t S W A2 5 And 0 1 0 t 0 I0 I t E R A2 5 And 0 1 0 nil 0 I0 1 t E W A3 7 And 0 I 0 t 0 I0 2 t E R A3 7 And 0 I 0 nil 0 10 2 t E W

FigureIV-2:A traceoftheexecutionFAsT-Iprogramfortheadder.The trace is shown in terms of reads and writes to memory. An 'E', in the column labeled 'E/S', indicates that the memory reference is for executing an instruction. An'S' indicates that the memory refer- ence is for storing a RESULT. The 'R/W' column indicates whether the instruction is being read or written. Note that reads are always followed by writes.

There are several other important observations about Algorithm III-4. First, when the evaluation of a transistor or node causes other nodes or transistors to be marked for evaluation, it is common for them to be marked for evaluation during the same phase as the original instruction. As shown in Algorithm IV-1 this requirement is implemented in the FAST'I by having a distinguished tag value, '*', that indicates that the ExnctmoNTAGin the target instruction is to be set to the current EXF£LrrIoNTAG value, E. 66 A Data-Driven Multiprocessor for Switch-Level Simulation of VLSI Circuits

m = the number of elements in Ex_:ctnloN'ra(;s k _--- 0 while there exists an instruction with a non-nil E×I<:UIIoNTAG { foreach tag value, E, in order from E0 to E n { While there is an instruction, I, with I.E×v_u'rlONTAOSk--E{ Fetch I Execute I I .ExECUrlONT^G k _-- nil For each RESUI.T, R { If the new value of R = the previous value of R { Store the new value of R back into I

For each DI_TINATION, D, of R { INST_t;CTIONSDIN IINDEx'SOuRCEOPERANDSD OPINDEX4"- R Let i _- (IS.'_AGiNDEX + k) mod m ' if INSTRUCrlONSD.].NSTINDFX"EXI_UTIONTAGi = nil { if D.TAGVAI.UE = '*' INSTRUCTIONSD.INST1NDEX"EXECUTIONTAGi *-- E else INSTRUCHONSD.INSTINDEX"EXECUTIONTAGi +-- D. TAGVALuE } } } ) } k _ (k+l) mod m } Algorithm IV-l: The fetch/execute cycle of the complete FAST'I processor.

Next, the exact nature of evaluation of a node or transistor depends on the phase during which it is evaluated. Moreover, when a node or transistor is evaluated, it may mark not only other transistors or nodes as needing evalua- tion, but it may also mark itself as needing to be evaluated during some future phase. Though not shown explicitly in Algorithm III-4, these two operations can be incorporated directly into a FAsr-I processor. For the first one, we simply provide the current value of the EXECL'TIONTAGas an operand to the evaluation unit. As for the second one, notice that when an instruction is ex- ecuted, its E×ECUrtoNTAGis cleared, and there is nothing that prevents it from being set to some new value. At first glance, this may seem to violate the spirit of data-driven computation. However, as I have already mentioned, we can think of the action of storing a RESULTback into the instruction just executed as being an implicitly-specified destination store. As a DESTINATIONalways specifies a TAGVALUE, there is no reason why the implicit store of a RESULT cannot specify a new value for the EXECUTIONTAGas well. In the default case, this new tag value is nil. In more complex circumstances, its value can be deter- mined dynamically during instruction execution. Hence, an EXECUTIONTAGis very much like a SOURCEOPERANDor a RESULTin that it can be used in evaluating the instruction, and it is either set explicitly when doing a destination store or set implicitly, after executing an instruction. Indeed, the only attribute that distinguishes an EXECL"nONTAGfrom a SOURCEOPERANDor a RESULTis that it 7'he Amhitecture of a FAs_-I Uniprocessor 67

alone is used by the fetch unit in determining which instruction to execute next.

Having presented the use of multiple E×V,CUT_ON'I'AC;values, I now need to dis- cuss how multiple EXI_Cbq'IoyTAGsare used. Returning to Algorithna 1II-4, let us examine the outermost loop. The purpose of this loop is to implement the delay mechanism of the simulation algorithm. A subtle attribute of the algo- rithm is that a transistor 1 may be marked as needing evaluation during both the current delay period and the next delay period. In effect, this allows a transistor to be part of two independent groups. One group is the set of transistors that are to be evaluated during a particular phase of the current delay period. The other group is those transistors that are to be evaluated at the beginning of the next delay period.

The mechanism used to implement this feature in the FAS'r-I processor is similar to the one used in Algorithm I11-4: an instruction in a FASt-1 processor has a vector of EX_:CUTIONTAGsand, as shown in Algorithm IV-l, there is an outermost loop that sequences through this vector. Note that when there are multiple EXECUTIoNTAGs,a DESTINATIONmust specify not only a value for the EXECUTIONTAGbut, also, which EXECtYrIONTAGis to be set.

Together, multiple EXt_bn'IONTAGsand multiple EXECL-rloNTaGvalues provide a mechanism for creating groups of instructions, such that all of the instructions in one group are executed before any instructions in the next group. The order in which groups are processed is strictly determined. Within a particular group instructions may be executed in any order. IV.1.2.2. A CloserLookat ExecutingInstructions There are a number of interesting ramifications of the above definition of a FAst-1 processor that need to be considered in order to understand fully how a FAst-1 processor works.

First, what happens when multiple instructions are marked as executable or in other words, when more than one instruction has EXECUTIONTAGSk equal to E? As allowed by Algorithm III-4, these instructions may be executed in random order, implementation considerations notwithstanding. The ability to execute instructions in any order is not simply a curiosity, it is fundamental to the operation of a multiprocessor F_sT-I.

A subtle feature of the FAst-1 is that it is possible for a SOURCEOPERANDto be overwritten. That is, a SOURcEOPERANDin an instruction may be written, and before the instruction has a chance to be evaluated using this new value, the SOURCEOPERANDmay be overwritten with a new value. This can happen even if a SOURCEOPERANDis specified in only a single DESTINATION, as true for all the work described in this dissertation. On the other hand this is not true of

lor a node,ifweimplementthe suggestioninSectionIII.3.4 68 A Data-Driven Multiprocessor for Switch-Level Simulation of VLSI Circuits

EX_.:CUr_oNTAcs.Once _m ExE,_tmoNT^c_is set to a non-nil value, it retains that value until the instruction is executed. Both of these aspects of the FAs'r-Iare in accordance with Algorithm 111-4.

IV.1.3. Implementing Algorithm!II-4 Using the Fast-l: A Summary Though it should be clear, by now, how Algorithm 111-4is implemented using the FAsl_I, an orderly summary is appropriate. The actual number of bits used to represent each field depends on a variety of factors, which are covered in depth in Section 1V.2.

Each node and transistor tuple is represented as a single F_sr-1 instruction. A bidirectional Transistor instruction has three SOt:_cEOP_ZRANI)SNEwGATEV^L, uE, DRAINVALuE,and SOURCEVALUE, and three RESULTS, GATEVALUE, SOURcEOLrYPUT, and DRA1NOI_rrPUTGATEVALUE. is considered a RESULTsince during phase 3, NEWGATEVAkUEis copied into it. During the remainder of the time GATEVALUE is used as a SOt:RcEOPERAND.SOU_COtv, r;PU'rand DRAINOt:rPtreachr have a single DESTINATtONthat is the address of a SOL'RcEOPL_RANDin a node instruction. Each transistor instruction also contains two read-only SOURcEOP_ANDS:S_ZE0 and Size 1.

Each Node instruction has a SOURcEOPEaANIvector_S that corresponds to its in- puts. A Node instruction has a single RESULTcorresponding to its OUTPUTVALUE. The DFATINATIONSOf this RESULTare exactly the OLnVLrrSof a node. A node's NODESIZE is handled in the same manner as a transistor's S_zEs.

The only thing left is implementation of phases and delay. Phases are imple- mented using multiple EXECUTIONTAGvalues, while unit-delay is implemented using multiple EXECtrrIoNTAcs.

IV.1.4. Other Issues As with most computer architectures, there are many additional features that are worthwhile to think about. In this section I present some of the features that I have considered and how they might be used. IV.l.4.1. Data Typesand Operations It should be clear that many of the operations found in general-purpose com- puters can also be implemented in a FAst-1. However, in order to simplify the implemention ofa FAsT-Iprocessor we might consider forcing all data types to occupy the same number of bits. This can be a serious limitation, in general, but not in simulation. One way around this limitation is to pass around references or pointers to objects that reside in a memory that is logically separate from instruction memory. Kazar's DS system uses semantics similar to these [Kazar, 1984]. The Architecture of a k)s_'-I Uniprocessor 69

IV.1.4.2. ComplexFiringRules In certain applications it may be appropriate to have more complex firing rules. For example, in simulating edge-triggered circuits it may be useful to be able to detect when a new value of an operand has arrived. This feature would make it possible to simulate a gate, such as a clocked register, only when there are new values for both the clock input and the data input.

Without modifying the architecture it is possible to achieve these semantics by having a shadow REsul:l' field with no associated DI_rL_ArlONSfor each SOURcEOPERAYDPart. of the execution of an instruction would include copying the current value of each SOUkcEOeEkANDinto its associated shadow RESULa" field. In this way it is always possible to compare a SOU_CEO_'ERA._Dwith its previous value. A more space efficient mechanism is to associate with each SOt:RcEOPERAYDa NEwVALUEbit, which is set whenever a new value is stored into the SOURcEOPFRAND.

lmproved execution efficiency can be obtained by associating one or two ad- ditional bits with each SOURcEOPERAND.A NEEDEDbit can be used to specify whether or not the NEWVALUEbit of this SOURcEOPERAND must be set in order for the instruction to be executed. With such a bit, the EXECUTIONTAGof an instruction would be set only when all SOURcEOPERAN1)Smarked as NEttED have their NEWVALUEbit set. A DON'TNEEDbit can be used to indicate that setting this SOURCEOPERANDshould not cause the Ex_tmoNT,_c to be set. Alternatively, we can have this bit in the DESTiNATiONfield of the instruction that stores into the SOURCEOPFhRAND.Depending on how they are used, these firing rule bits may interact with the EXF_:L73oNTACs,thUSfurther complicating matters.

The firing rule of typical data-flow machines, which specifies that all operands must have new values before the instruction can fire, can be implemented by turning on all of the NEEDEDbits. Constant operands would have the NEEDEDbit turned off. Algorithm lII-4, on the other hand, can be implemented without using any of the complex firing rule features. IV.1.4.3. EarlyExecutionof Instructions One observation about the way in which Algorithm IV-1 works is that it may access instructions more often than necessary. For instance, in Figure IV-2 the Or instruction is accessed twice even though it can been executed when it is first referenced. It seems that an instruction could be evaluated when it is

fetched for updating one of its SOURCEOPERANDS, although RESULTScannot be stored until later. One can imagine modifying the architecture so that an EXECLrnONTA_is set not to indicate that the instruction needs to be evaluated but rather to indicate that R_ULTSneed to be stored. Part of storing a REStrLT would then be to evaluate the target instruction and set its ExEcu'nONTAGif it needs to store a RESULT.The efficacy of this model depends on two factors. First, how often are instructions executed without storing new results and, second, what are the relative costs of executing an instruction versus simply fetching it? If evaluating an instruction is dependent either on the sequencing 70 A Data-Driven Multiprocessor for Switch-Level Simulation of VLSI Circuits

provided by EXI'CUTIONTAGSor on the wdue of the EXI';CUTIONTAGwhen the in- struction is executed, it may not be possible to evaluate the instruction early. In any event, even without completely changing how the FAst-1 executes instruc- tions it may still be useful to do some evaluation of instructions when storing into them. For example, in Section 111.3.6.2,1 discussed how this kind of op- timization can be used to improve the performance of the simulation algo- rithm. IV.1.4.4. QueueingResultsInsteadof Instructions A somewhat different method of implementing a machine that is driven by the need to store RFSUL'rSis to think of R_ULTS as messages that need to be delivered. The inner loop of the fetch/execute algorithm then becomes the process of getting a new message, fetching the instruction it references, and evaluating the instruction using its SOURc_OPv.RANDSand the data from the RESUL'rmessage. In many simulators, this mechanism is called the event queue, where a slot in the event queue corresponds roughly to an Ex__oNTgc.

If only a single undelivered RzsuLr message is allowed for each SOURcEOPzRANO and attempts to generate additional messages simply overwrite the data in the already existing message, then a machine using RFSUL_m"essages would func- tion identically to the FasT-1. If, however, multiple outstanding RJ_uLTmes- sages are permitted, then some computations can be made to run more ef- ficiently on a machine with an event queue. This is because the storage the event queue is implementing implicitly would have to be implemented ex- plicitly, using instructions, on a FAst-1. storage

A disadvantage of this approach is that the maximum number of outstanding messages needed is a function of the program being executed. Even if only one message per SOURcEOPERANDis allowed, the amount of storage needed to represent a program is approximately double that used in a regular FAst-1.

IV.2. Implementation The Fasz-1 architecture, as described so far, competes well with most paper designs: it seems okay, but can it really be built? In this section I hope to convince you that, assuming reasonable operations and data-types, the basic architectural features of a FAst-1 present no barriers to a straightforward im- plementation.

IV.2.1. The Datapaths of a Fast-1 Processor In Chapter I, I presented the basic structure of an implementation of a FaST-I processor. In this section, I reconsider this implementation and discuss how it may be altered either to simplify it or to improve its performance. Figure IV-3 shows a simplified schematic of the datapaths of a FAst-1. Instructions are stored in the instruction memory. Logically, a word of instruction memory must be wide enough to contain the widest instruction, so that each different The Architecture of a FAsr-1 Uniprocessor 71

instruction memory address specifies a different instruction. This is not very practical, and in Section IV.2.3 1 discuss other methods for representing in- structions. The current Exvct, :'i_oy'FAc;of each instruction, as indexed by k in Algorithm IV-l, is available to a fetch unit that fetches the selected instruction from memory. Once again, implementing the fetch unit in this fashion is im- practical and, in Section 1V.2.2, I discuss more reasonable methods. When an instruction is fetched from memory, it is stored into an instruction register. The OPCODE, current SOURcEOm_ANOandS, the current RtsuLrs are fed into an ex- ecution unit that computes a new set of R_,:suLTsvalues. Associated with each new Rvsut:r is a bit that indicates whether or not the new RFSU_,Tequals the current Rv:scLT.Once the instruction has been evaluated, the new RtsueTs are written back into the instruction, the EXECm'_oNTAGis set to nil, and the instruc- tion is written back into memory. Then, each RESULVthat differs from its pre- vious value is written into each SOURcEOP_ANDspecified by the REsucr'S DEST_NAa'_ONSand the appropriate EXFCLn'_ONTACSare set. Since the ExEctznoyT^G must be checked before it is set, the target instruction of a DESTINATIONmust be read before it is written. Thus, the basic cycle of a Fast-1 is a read-modify-write memory cycle_both for executing instructions and for storing RESULTSinto SOURCEOPERANDS.

IV.2.2. Keeping Track of Executable Instructions In the architecture section, I implied that each EXECtmoNTAGis implemented as a tag, or small integer, and that the fetch unit can examine all of the current EXECtmONTACSin parallel in order to determine whether there are any instruc- tions available for execution. However, as nice as this is conceptually, from an implementation standpoint it is a problem. If we build a FAsT-Iprocessor using off-the-shelf memory parts, then there is no easy way to examine all of the EXECUnoNTAcsin parallel. One solution to this problem is to have the fetch unit sequentially read instructions from memory, looking for an instruction that has the proper EXECtmONTA_set to the proper value. Unless most instruc- tions are executable, this solution wastes a lot of time. Another approach is to store the EXECtrrIONTAGSin individually settable registers that can be read in parallel by the fetch unit. Unfortunately, if there are hundreds of thousands of instructions, this solution is not very space efficient if these registers are imple- mented using off-the-shelf TI'L or ECL parts. We might decide to build a custom chip that implements this part of the fetch unit. Indeed, we might con- sider implementing the entire F_7".l processor on a single chip, so that the problem can be solved using a suitable associative memory, for example.

In any event, a parallel solution that doesn't scan through memory requires some form of priority encoder to select an executable instruction. This priority encoder can be serial, in the form of a carry chain, though this may be too slow. Another possibility is to use a tree structure, but this implementation may oc- cupy too much space. An alternative solution is to be able to access the 'next' executable instruction directly. This may, however, force us to execute instruc- tions in a particular order. Two orderings that immediately come to mind are 72 A Data-Driven Multiprocessor for Switch-Level Simulation of VLSI Circuits

) , . . Fetch OpcodeEvaluatiol <> =_& ' i =_ Unit

Sources Re§ul t., _ Address

Ill+. ° ,• • •

Temporary Reg. i Data

InstructionMemory Address ,_

Figure IV-3: The datapaths of a Fast-1 processor.

first in, first out (FIFO), and last in, first out (LIFO), as determined by when an instruction is first made executable.

Both orderings are easily implemented. Each Ex_trrloNTac in an instruction is replaced by a pointer capable of addressing an instruction. Within the fetch unit there are sets of 'head of list' (HOL) pointer registers, where in each set there is a HOL pointer for each different non-nil Ex_tmONTAO value. If FIFO order is used, then a 'tail of list' (TOL) pointer is needed as well. These lists are the equivalent of the event queue found in many simulators. However, rather than queueing data to be delivered, we are queueing instructions to be ex- ecuted.

For LIFO execution, when an instruction that is not already executable be- comes executable, the proper HOL pointer is stored into the instruction's EXECtmONTAo,and the instruction's address is copied into the HOL pointer. Similarly, when an instruction is executed, its ExzctmoNTAo, which points to the next executable instruction, is copied into the current HOL pointer. Once an instruction is on a particular list of executable instructions, it must not be removed until it is executed. Thus, associated with each Ex.zct_-noNTAoin an The Architecture of a F:_.sr-IUniprocessor 73

instruction must be a bit that indicates that this instruction is ah'eady on a list of executable instructions. Testing this bit requires that when a R_.st:_:ris stored into an instruction, the instruction be read in order to determine whether or not it is already on a list of executable instructions. Finally, detecting the end of the list can be accomplished using either an additional bit. or a special pointer value, such as zero.

Implementing FIFO ordering is somewhat more complicated than implement- ing LIFO ordering, since we must add not only the new instruction to the tail of the list but must also update the instruction that is currently at the tail to point to the new instruction. Whereas LIFO ordering requires only a single read-modify-write (RMW) cycle to store a R_ULT and mark an instruction as executable, FIFO ordering requires two RMW cycles: one to store tile R_zu_:r as in LIFO, and one to update the instruction currently at the tail. In a simple implementation these two memory cycles are sequential. In more complex im- plementations the two memory cycles can be performed concurrently, as dis- cussed below.

The choice of ordering does not affect the semantics of program execution, since any constrained ordering is certainly a subset of all possible random or- derings. However, the number of instructions executed may differ depending on file ordering used. This difference results from the rule that if an instruc- tion is marked as executable, storing into other SOURcEOPERANDSdoes not cause it to be executed more than once. It is possible that for one ordering, multiple SOURCEOPERANDSare set in an instruction before it is executed, while for the other ordering instructions are usually executed after every setting of a SOURCEOPERANDBut. experiments described in Section V.4.4 indicate that there is negligible difference between LIFO and FIFO ordering.

IV.2.3. Fixed-Width versusVariable-Width Instructions At first glance, a FasT_l instruction looks like a monster: It can have a variable number of SOURCEOv_ANDSa va, riable number of RESULTS,and a variable num- ber of DEST[YATXONS.About the only thing that isn't variable is the O_'CODE.One way of handling variable width instructions is to make memory bit or byte addressable and have instructions occupy as little or much memory as needed. This approach is used in the Intel IAPX-432 microprocessor [Lattin and Rat- ner, 1981] and in the DEC Vax architecture [DEC, 1979]. It has several unfor- tunate drawbacks. It is more complicated than having fixed-width instructions and fetching instructions will likely be slower. Addresses must be larger since they need to specify a particular bit in memory. The advantage of this scheme is that it makes implementing variable-sized data-types straightforward.

At the other extreme, we might force all instructions to have the same width. From an implementation viewpoint this has great appeal. A fixed-width in- struction greatly simplifies matters, because a word of memory can be as wide as an instruction and a single memory fetch retrieves the entire instruction in 74 A Data-Driven Multiprocessor for Switch-Level Simulation of VLSI Circuits

parallel, in order to do this we must determine the maximum number of SOt_Rc_O_'_RAN_Sthe ,maximum number of R_uL_s, and the maximum number of Dt.SI'INATIONSeach Ri_utytcan' have. This sounds pretty drastic. Yet, a deeper investigation indicates that it might not be as bad as it sounds. Except for very complicated instructions that maintain a lot of internal state, for example, an instruction that represents microprocessor, most instructions have only a few RESL'LTIfS.each RI_UHdepends only on the SOL'R('_:OvERANDSof the instruction, then it is always possible to implement a multi-RvsuLT instruction using mul- tiple single-Rr_uLT instructions. As a practical matter, recall that Algorithm III-4 requires at most three R_ULTSfor implementing a bidirectional transistor and that one of these RI:-SULTS,the GATI_VALUE,has no DESTINATIONS.

Nevertheless, there are circuits with fan-out and fan-in in the thousands. So, we still must deal with large numbers of Dt_TINAT_ONS,representing large fan- out, and large numbers of SOURcEOPERANDS,representing large fan-in. These two topics are discussed in the following two sections. IV.2.3.1. HandlingInstructionswithLargeFan-Out If an instruction can contain only a fixed number, m, of DESTINATIONS,then the easiest way to fan-out a RV_ULTto more than m destinations is to build a tree of Copy instructions that has one input at the top of the tree, and Fn/m] Copy instructions at the leaves, where n is the total fan-out required. An example of how this works is shown in Figures IV-4a and IV-4b. The first figure is the original instruction. The second figure is an equivalent program using a fan-out tree, assuming that each Rt_ULTcan have at most two DESTINATIONS.

Inst Inst 0pcode Source Exec Result Destinations Name Index 0perands Tag 0 Add -- - 10.0.t, 11.1.t, 13.2.r, 15.0.t,14.1.r (a)

O Add -- - 1.0.*, 2.0.* I Copy - - 3.0.', 13.2.r 2 Copy - - 15.0.t, 14.1.r 3 Copy - - 10.0.t, 11.1.t (b) FigureIV-4:An illustrationoffan-outtrees.(a)istheoriginalinstruction,(b) istheinstructionwitha fan-outtreeadded.A TAGVALUEof'*' indicates that the target instruction's ExFx2LrrlONTAGis set to the current EXECtyrIo_TAGvalue, as discussed in Section IV.1.2.1.

There are two major disadvantages of fanout trees: they occupy more space and they take extra time to execute, though the impact obviously depends on the structure of the original program. Fortunately, there is another way to ach- ieve unlimited fan-out. Whereas, in the block diagram in Figure IV-3 instruction memory is a single unit, by splitting this memory into two The Architecture of a I_Sts"J-IUniprocessor 75

pieces the fan-out problem can be solved without using significantly more memory nor additional time. The first piece of memory, as before, contains instructions, but with one modific_tion: thc D_.:Srt_xrIoNsvcctor is no longer stored with the instruction, but is instead stored in _m separate DV5"r_NA'r_ON memory. The instruction contains a pointer into this memory, where each block of DEsr_yAr_oysis sequentially allocated. Figure IV-5 is a block diagram of the F,_sT-I using a separate DESTINATIONmemory.

_>

> OpcodeEvaluation <-->--,;d Fetch

Unit : _ Unit _

. . . . . ° ° ° ° o . Reg. ' ' es Data

Temporary Reg. Destin- ation Memo ry Data -_ Address

Instruction Memory Atldress

Figure IV-5: A block diagram of a FAST'I using a separate DE.STINATION memory.

On the surface it may appear that in order to use a separate DESTINATION memory, two sequential memory cycles are required to store a RESULT,instead of one. This problem can be solved using some straightforward pipelining: While writing a R_ULT into instruction memory, read the next DESTINATION from DESTINATIONmemory. Assuming that memory read and write times are comparable, the time it takes to read a DESTINATIONcan be overlapped with the time it takes to write a RESULTinto the currently executing instruction or into the SOURCEOP_ANDof some other instruction.

Performance is not the only issue---extra space is required as well. Using a 76 A Data-Driven Multiprocessor for Switch-Level Simulation of VLSI Circuits

variable number of DI_rINA'rIONSrequires that there be some way of knowing how many DvxnxaT_oysthere are for each R_:suH-.One method is to associate with each R_tJH a count of its DFSTINATIONS.Another method is to have a LAsT bit associated with each DVSTJN^nON.While the latter method is probably easier to implement, determining which method uses less memory depends on an analysis of the expected static fan-out of FAst-1 programs, such as presented in Chapter V. If there are n words in DESnNA'r_ONmemory and the expected fanout is less than log2 n then having a LASTbit uses less memory, while if the expected fanout is greater than log2 n, a count uses less memory. Moreover, using counts has the advantage that for instructions with multiple R_ULTS,in which RVSULTSchange at different times, it is possible to address directly the D_S'nNA'r_ONSassociated with each changed Rvsu_.T;it is not necessary to scan through DESTINATIONmemory checking LASTbits.

Finally, there are two special cases that should be considered: the case where there are no DFSTINATJONSand the case where there is only one DFSTINA'HON.If we are using a count of the number of DESTiNATiONS,the zero case is handled automatically. If we are using a LASTbit, we can either have a NONEbit as- sociated with each RESUL:r,or a special pointer value, such ,aszero, to indicate that there are no DESTINATIONS.If the number of bits in a DESTINATIONiS com- parable to the number of bits in a pointer into DESTiNATiONSmemory, and a large number of RESULTShave fan-out of one, then it may make sense to allow the pointer to be replaced by a DVS'nNAT_ONtuple. If we are using a count, then a count of one can indicate that what is normally a pointer is, in fact, an actual Dr_STINA'nON.If we are using a LAs1bit, we need an OYLYON_bit associated with each RESULTto indicate this fact.

There are some disadvantages of having a separate DFST_NAT_ONmemory. Be- sides the extra space required, there is the increased implementation com- plexity of having two separate memories with pipelined access. Furthermore, whereas bit-addressable memory can be allocated as needed for different in- struction fields, if memory is physically partitioned into two pieces, we must decide early in the implementation how much memory to dedicate to each piece. Although we can have the DESTiNATiONpointer in an instruction point to D_T_N^T_ONSstored separately in instruction memory, doing so prevents us from overlapping the reading of DESTINATIONSand the writing of RESULTS. However, by clever interleaving and allocation of instructions and DF.STINAa'IONS it may possible to be writing into one part of instruction memory while reading from another part. IV.2.3.2. HandlingInstructionswith LargeFan-in It seems that we ought to be able to handle large fan-in in much the same way as large fan-out, using either fan-in trees or a separate SOURCEOPERAND memory. Unfortunately, life is not quite so simple. In order to use fan-in trees a function must be decomposable and associative. For example, a 3-input And gates can be decomposed into two 2-input And gates: And(a,b,c) equals And(a,And(b,c)). However, a 3-input Nand gate cannot be handled in such a The Architecture of a F._sr-I Uniprocessor 77

straightforward manner, as Nand(a,b,c) does not equal Nand(a, Nand(b,c)). While theoretically any function can be expressed using 2-input Nand gates, as a practical matter we may not always be able to limit fan-in in this way.

Another method of dealing with large fan-in is to separate the SOt_m_OvJ,:_ANDS from the instruction ,as was done for the DJ,='mN_T_ONSin the previous section. However, whereas it is possible to overlap the reading of a D_:s'r_NAV_ONwith the storing of a Rvsut:r, no such overlap is possible for the SOURC_,:OP_:RANIfnS. previously the execution unit could process all the operands of an instruction in parallel, it will now take multiple memory cycles to execute an instruction. On the other hand, it is quite likely that we cannot build ,anarbitrary fan-in execu- tion unit. Hence, as long as a cycle through the execution unit takes ap- proximately the samc amount of time as reading from memory, there is no reduction in performance. Since, in this latter case, we do not need to overlap memory accesses, we can choose to not place the SOURcEOPERANDSin a separate memory, but rather to place them in memory words following the instruction. Each word in instruction memory has room for a few SOt:RcEOPERANDS,and instructions that need additional SOURcEOPEaANDScan fetch additional words following the original instruction. This scheme is reasonably space efficient if unused fields in the extra SOURCEOP_ANDwords, such as the OPCODE,are small, or if we are able to overload these unused fields with additional SOURcEOPERANDS.

If SOURcrOPER^NDSreside at a memory location other than that of the Or'CODE part of the instruction, then there is one further complication: we now need to be able to address both the SOURcEOP-ERANDword and the OPCODEword of the instruction in order to store a RVSULTand set the EXECtn'_ONTAC.The solution to this problem is either to have each D_XnNA'r_ONhave both a pointer to the beginning of the instruction and a pointer to the SOURcEOp_^No,or to have each additional instruction word have a pointer or offset to the beginning of the instruction. In any event, we must perform two sequential memory references into instruction memory. We can perform these two memory references concurrently if we are willing to have the two pointers as part of each DF.STINAT1OandN have SOURCEOPERANmemoryD separate from the remainder of instruction memory. Since RESULTSare generally of the same data- type as SOURcEOvzr_ANDS,they can be stored in the SOURczOvzRAXDmemory as well. A block diagram of this configuration is shown in Figure IV-6. Note that once again we must determine the size of each memory at implementation time, rather than being able to adjust the size dynamically for each program. IV.2.3.3. Fixed-widthInstructionsforthe Switch-LevelSimulationMachine Fortunately, in implementing Algorithm III-4, it is possible to use fixed-width instructions, without resorting to separate DESTJN^TIOoNr SOURC_OPz_ND memory. Transistor instructions have both fixed fan-in and fixed fan-out. In a fixed-width implementation we use unidirectional transistor instructions, which have two SouRa_Or'Fa^r,vs,two Rmuu's, and one DFST_N^nONB. idirec- tional transistors are represented using two unidirectional transistor instruc- 78 A Data-Driven Multiprocessor for Switch-Level Simulalion of VLSI Circuits

_ " ' T

OpcodeEvaluation.Unit _ • <> ", __g FUnitetch

--

urc R s_l Address

. , * o . °

IInstructioni I Reg*_I "'" _,,I"'" Des_!pn_ Data

Temporary Reg. ? Destin-

"t ationMemory Data Data Address

Instruction SourceOperand Memo ry Memo ry Address

Address

Figure IV-6: A block diagram of a FAs_'I processor with separate memories for the OPCODEand EXECU'nONTAGS,the SOURCEOPERANDSand RI_LSULTS,and the DESTINATIONS.

tions. Node instructions have two SOURCEOPERANDS, one RESULT, and two DESTINATIONS.Nodes with fan-in greater than two are represented using a fan-in tree of Node instructions. This can be done because least upper bound (LUB) is associative and commutative. Arbitrary fan-out is handled using Copy in- structions, which can be implemented using Node instructions.

In Chapter V, 1 present statistics that compare a restricted, or minimal, machine with fan-in and fan-out of 2 to the arbitrary fan-in and fan-out machine in Figure IV-6. On average the restricted machine requires 2.7 times more instructions than the general machine, while the general machine runs only twice as fast.

IV.2.4. The Impact of Technology Now that I have presented most of the implementation issues, it is appropriate to consider, in greater detail, how technology affects the implementation. As originally conceived [Frank, 1982], a Fasv-1 processor was to be implemented as a single VLSI chip and multiple chips were to be used to create a larger The Architecture of a FAsr-l Uniprocessor 79

machine, ls such a chip feasible? This question is hard to answer without specifying the machine more concretely. For e×ample, assurne that the machine has a 2-bit Ol,cot)v., two 8-bit SOURcrO_'H_AYnS.tWO 8-bit R[;sL|.rs, two Ex_cb°r_o_TAos,and two D_r1NArr_OYSper instruction. The number of bits in a D_r_NAr_ONand in an Ex_zcbT"_ONTAGdepends on how many instructions we want to be able to reference. Since, at the present time, 256K by 1 memory chips are the densest available, and as we already have a minimum of about 40 bits in each instruction, this means that we can have at most four thousand instructions on chip. In order to leave room for the rest of the processor, let's ,assume that there are one thousand instructions. Therefore, each instruction address requires 10 bits, adding 40 bits to the instruction. Each instruction is, therefore, about 80 bits wide. Is this too big? Not at all.

Consider for a moment how commercial dynamic random access memories (DRAMs) are built. Because of pin limitations, addressing memory is done in two cycles. First a row address is provided (RAS) and then a column address (CAS), yielding a single bit for reading or writing. Yet, the internal organiza- tion of these memories is such that during the RAS cycle usually 128 or 256 bits are available in parallel. Hence, this two level organization results in a factor of 128 or 256 reduction in potential memory bandwidth. Some commer- cial memory parts provide a limited mechanism for utilizing the memory bandwidth available during RAS. For example, the TI 4161 [Texas Instru- ments, 1983] DRAM has a special shift register that can be loaded in parallel using the data available at RAS. The data can then be shifted off-chip, a bit at a time, at a rate faster than the normal memory access time. It is clear that in order to take advantage of the full bandwidth we need to be able to utilize a large number of bits in parallel. A wide large instruction word is, therefore, an ideal match to the way in which memories are designed. Though it is true that a wider memory is somewhat slower than a narrower memory, due to the longer word line, it is generally much faster to provide parallel access to n bits in one memory cycle than to use an m-bit wide memory and [n/m] memory cycles.

Is there enough room for the rest of the processor? This depends on the com- plexity of the control and the complexity of the execution unit. In terms of fetching instructions and storing RZSULTS,the control is straightforward. It can be implemented using a finite state machine with at most a few dozen states. Implementing an instruction such as least upper bound is also straightforward, as it is combinatorial. Using the 32-bit Berkeley RISC microprocessor [Patterson and Sequin, 1982] and the 8-bit CMU Programmable Systolic Chip (PSC) [Fisher and Kung, 1983] as guidelines, it appears that the kind of control and execution units needed in a FAs'r-Irequire significantly less space than the memory. So, a thousand instruction single-chip F_r-1 processor is feasible using current VLS1 technology,

If we want to run large programs, such as simulations of circuits with tens to 80 A Data-Driven Multiprocessor for Switch-Level Simulalion of VLSI Circuits

hundreds of thousands transistors, then a thousand instructions is not enough. There are two solutions to this problem. One of them is to build a multiproces- sor using hundreds of VLSI processors. This h&s the nice property that the machine can be expanded in fairly small increments and that as more proces- sors are added the performance of the overall machine may improve since more instructions can now be executed in parallel. On the other hand, this kind of multiprocessor requires partitioning the program onto the processors as well as some method of interprocessor communication. Because such a configura- tion is a central part of this research, I will postpone further discussion of it until the second half of this dissmVation, where I consider it in full detail.

A different approach to building a machine with more than a thousand instruc- tions is to construct it using off-the-shelf memory parts, such as 256K by 1 DRAMs. The instruction word now grows to about 110 bits since addresses now require 18 bits instead of 10 bits. The remainder of the processor can be implemented as custom VLSI chip.

In order to run at full speed the chip will require about 140 pins, in order to access all of the instruction bits directly. If we are willing to use higher speed memory and interleaving, then the pinout of the chip can be reduced. Nevertheless building a chip with 140 pins is certainly feasible using packaging technology such as pin grid arrays. Alternatively, we can implement the processor using semi-custom gate arrays or standard TFL parts such as programmable array logic (PALs) [Texas Instruments, 1984]. Preliminary design work I have done indicates that about 75 chips are needed in addition to the memory chips. Approximately 20 of these are buffers and multiplexors used in accessing memory, 15 are registers used to implement the instruction register, 10 are PALs and ROMs used to implement the ALU, 5 are registers and multiplexors to implement the top of stack pointers, and the remainder are ROMs, registers and buffers used to implement the finite state control and the host interface. Thus, using a total of about 200 standard chips one can imple- ment a large Fasr-I processor.

To improve the price/performance of the implemention it may be possible to use a cache. A cache does not improve ultimate performance since we can always build everything out of the memory parts used for the cache. For example, much of the performance of the Yorktown Simulation Engine [Denneau, 1982a] comes from the use of 35ns access-time memory. Moreover, for the cache to be effective there must be a good deal of locality, which is not necessarily true for simulation.

It might also be possible to build a virtual memory Fasrl with demand paging. As with cache, the effectiveness of this approach depends largely on locality of reference. In yon Neumann computers locality of instruction reference is en- hanced by the sequential execution of instructions. A similar effect can be ob- tained in a data-driven system by placing instructions that have fairly intimate data-dependencies into the same memory page. A similar kind of locality is The Architecture of a l_5',sr-IUniprocessor 81

needed to make a multiprocessor FAs_l work effectively, in that what are inter-page references in a virtual memory system become inter-processor references in a muitiprocessor. For similar reasons. Dally has suggested using virtual memory in the Mossim Simulation Engine as a method for increasing processor utilization [Dally, 1984].

IV.2.5. Reliability While a complete investigation of reliability is beyond the scope of this disser- tation, a few words are in order. One of the advantages of having a machine that is primarily memory is that well-known memory error detection and cor- rection techniques can be used. For example, having a parity bit for each in- struction is certainly worthwhile. Furthermore, the relatively wide instruction word required by a FAst'-1makes more sophisticated techniques, such as single- bit error correction, double-bit error detection (SECDED), very space efficient.

In terms of debugging and maintaining the machine, as with many microcode machines, the instruction register provides a wonderful place for incorporating a serial scan-in, scan-out (S1SO) datapath [Frank and Sproull, 1981]. Using the SISO datapath and a little bit of extra control, memory can be accessed and the execution unit exercised with relative ease.

IV.2.6. Other Issues In order for a computing system to be useful, it must be possible to compile and debug programs, monitor performance, and otherwise interact with it. Rather than incorporate all of the needed features into the FAst-1 processor itself, this is an occasion to stand on the shoulders of standard processor ar- chitectures. As with most special-purpose architectures, it makes a lot of sense to envision the machine connected to some form of host processor. I anticipate that programs such as compilers, user interfaces, and such would all be written using one's favorite traditional programming language. The primary impact this has on implementing a F_T-1 processor is that the proper hooks must be provided for the host processor to control the FAs_-I. One very important hook is the ability to download programs. FAST-1instruction memory must be acces- sible to the host processor. Whether it should appear as part of the host's ad- dress space or look like an I/O device to the host is not particularly important. But it does mean that the FAsT_l'smemory must be dual-ported. If download- ing does not happen too often, one mechanism for implementing host access to FAsT'-Imemory is to provide a serial path. This technique is a common one and memory chips such as TI 4161's, in which there is a separate serial access path to memory, provide an easy way of implementing it. If a SISO datapath is provided into the instruction register it too can be used for downloading programs.

The other necessary hooks are for control. The host processor must be able to start and stop the FAsv-1,interrogate registers such as the HOL pointer in the fetch unit, and know when the FAs'r-1is idle. Once again, the S1SO features 82 A Data-Driven Multiprocessor for Switch-Level Simulation of VLSI Circuits

incorporated for debugging and maintenance can be used for this purpose as well.

For debugging programs or as in this thesis, logic simulations, it is useful to trace instruction execution. The easiest way to provide this facility is to add an extra bit to each instruction. If the bit is set when an instruction is executed, the FAST'-Iprocessor halts and then interrupts the host. Other schemes such as an associative memory that contains the address of instructions being traced are also possible, For performance monitoring and for partitioning it would be nice to be able to keep track of how many times each instruction is executed. Though this can be done using the trace bit mentioned above, a better ap- proach, which has significantly less perfonnance impact, is to have an execu- tion count field as part of each instruction. Having 32 bits for this field is enough; fewer can be used by interrupting the processor whenever the count overflows. Clearly, such a counter can also be used in tracing instructions.

Finally, the FAsr-I must be able to send and receive RESULTStO and from the host. Depending on the application, the complexity of this interface may vary. In the case of simulation, the host will generally be sending data to the F,_sr-I only when the FAst-1 is quiescent. Here, a relatively straightforward interface should suffice. However, when the host is receiving results back from the FAST-I,we would like to have the FAst-1 continue execution while the host processes the last batch of RFSUL'rS.This is particularly true when using mul- tiple Ex,'_:cL_r_oNTAOs,SOthe host can retreive the RFsu_,vsfrom the end of the current cycle while the F,4ST-Iproceeds onto the next cycle. One way to imple- ment this is to allow D_T_NATIONSto reference a separate dual-ported interface memory. The host can address this memory in the normal fashion and FAst-1 references to this memory are created by prepending a host-defined offset to the memory address obtained from the D_T_NATtON.In this way, the host can implement multiple RESt:Lvbuffers in order to improve overall system perfor- mance. Note that if we are building a multiprocesso r FAs_-I, then much of the above interaction can be accommodated by having the host sit on the inter- processor bus, communicating as if it is just another FAsr-I processor. V Uniprocessor Experiments

Even though Algorithm III-4 appears to map nicely onto the FAst-1 architec- ture, there is no way of knowing how well the whole system works without running some experiments. Most papers written about simulators discuss per- formance by estimating how many events per second or evaluations per second the simulator executes when running on some particular machine. These kinds of figures do not really say much about how fast the simulator will simulate a particular circuit. It is unfortunate that, within the simulation community, there is no well-established set of benchmarks.

In this chapter, I discuss two sets of experimental data. The first set, is a static analysis of 27 MOS circuits. While, ultimately, it would be useful to be able to predict dynamic performance using static analysis, I have not attempted to do this. Rather, the static characteristics I describe simply show that there are some attributes of circuits, such as the ratio of transistors to nodes, that can be predicted with reasonable confidence. Moreover, while I make no claims as to whether these circuits are representative of circuits at large, the static analysis has given me some confidence that they indeed may be. The second set of measurements describe the dynamic behavior of a FAst-1 processor implemen- tation of Algorithm III-4. These measurements are the result of using the FAs_-I to simulate 13 of the 27 circuits that were statically analyzed. To provide the reader with some feel as to how well Algorithm Ill-4 works in comparison to other simulators, l have included measurements of the perfor- mance of MossimlI simulating the same circuits. Finally, using simulations and back-of-the-envelope calculations, I compare the FAst1 architecture to other sinmlation architectures, in particular the Yorktown Simulation Engine madthe Connection Machine.

Perhaps the most interesting results discussed in this chapter are those pertain- ing to the potential for speedup using parallel implementations of simulation algorithms. It is generally thought that the performance of logic simulators can be dramatically improved using parallel implementations, with the potential, perhaps, of thousand-fold speedups. The experiments presented in this chapter indicate that there is not nearly as much useful parallelism as has previously been thought, at least if the circuits measured are at all representative. In the best case measured, the potential practical speedup is limited to about a factor of 100,even if several thousand processors are used. 83 84 A Data-Driven Multiprocessor for Switch-Level Simulation of VLSI Circuits

The data presented in the following sections summarizes a vast quantity of raw data--samples of which are presented in Appendix C. Where appropriate, 1 have included the standard deviation and the 95% confidence intervals of the data. In these cases, I have also used the Kolmogorov-Smimov test [Allen, 1978], at a 5% level of significance, to determine whether or not the data fits a normal distribution. Note, however, that in some instances the distribution test is based on the average of averages, where the original data did not necessarily have a normal distribution. Hence, some caution must be used in interpreting the results.

The performance results described below are based not on a hardware im- plementation of a Fas>! simulator, but rather on a software implementation, which is described in Section V.2. One interesting aspect of the software im- plementation is that even though it is simulating a FAs'r-1processor simulating a circuit, its performance is competitive with other software-implemented switch-level simulators. I believe that the results of simulating the hardware are an accurate reflection of the hardware's actual performance.

V.1. The Circuits The circuits used for the static and dynamic analysis are designs from a variety of sources, largely within the ARPA university com- munity. As far as I know all of the chips are designed using the Mead and Conway design style [Mead and Conway, 1980]. Hence, it is reasonable to ex- pect that the chips exhibit a fair amount of regularity, though it is unclear if design style has any effect on the statistics presented.

The following is a list of the chips that are analyzed. The name before the dash is used as an identifier for the chip throughout the thesis. An asterisk after a name* indicates that the circuit is analyzed both statically and dynamically. Because static analysis requires no knowledge of how a circuit works, or even if it does work, more circuits could be analyzed statically than could be analyzed dynamically.

• riscb - The Berkeley RISC-I microprocessor [Patterson and Sequin, 1982]. • scheme- The MIT Scheme-2 microprocessor [Steele and Sussman, 1980]. • psc- The CMU PSC microprocessor [Fisher and Kung, 1983]. • ssc* - A Silicon Solutions Corp. FIR filter chip. • chess*- A move generator chip for chess [Ebeling and Palay, 1984]. • phase - An Ethernet phase decoder chip. • mac- The multiplier accumulator from the PSC. • mul8×8* - An 8x8 carry-save multiplier. • adder4*, adder8*, adder16*, adder32 - 4, 8, 16, and 32 bit ripple carry adders from the chess chip. Uniprocessor Experiments 85

• stk4x4*, stk8x8*, stk16x16*, stk32x32 - 4x4, 8x8, 16x16, and 32x32 stacks from the chess chip. • ram4x4*, ram8xS*, raml6xl6* - 4x4, 8x8, 16x16 six-transistor static RAMS. • ram64x64 - A self-timed 64x64 six-transistor static ram [Frank and Sproull, 1983]. • cadd24* - A 24-bit CMOS adder. • fir - A FIR filter chip. • fifo - A fifo chip. • life - A 3x3 Life chip. • alu - An ALU. • synth - A frequency synthesizer chip. • tilt80 - Yet another filter chip.

V.2. Tile Software Implementation of the Fast-I MOS Simulator Implementing the FAs?"I in software required writing a program that emulates an instance of the FAs_'-Idescribed in Section IV.1.3. In the context of this research, this task is made somewhat more difficult because of the need to be able to simulate multiple FAsT-I processors connected via multiple busses. A reasonable user interface is also required so that it is possible to drive the cir- cuits being simulated. Much of the code has nothing to do with simulation, per se, but rather is required for gathering statistics and allowing many different configurations to be modeled.

The language I chose for implementing the simulator is called 'Class C'1 [Stroustrup, 1980; Stroustrup, 1982], which consists of a preprocessor for the standard Unix C compiler and a run-time package. Class C provides two main features. First, it allows the programmer to define abstract data types in the form of Simula-like classes and, second, it provides light-weight tasks. Thus, simulating multiple processors can be accomplished without resorting to the much more expensive Unix process mechanism. The Unix mechanism has the added disadvantage that it forces inter-processor communication to be simu- lated using either pipes or messaged-based IPC.

A FAsT'I processor is defined as a derived class of the system-supplied base- class task. When an instance of a processor is allocated it, in turn, allocates a vector of instructions and then allocates one or more ports. A port is also a task, so that a processor and its ports can operate asynchronously. The main

1ClassC hasbeensupplantedbya languagecalledC+ +, alsowrittenbyStroustrup[Stroustrup, 1984].ThoughI haveneverusedC+ +, my experienceswith ClassC, and myreadingsof the C+ + usermanuals,leadmeto believethat C+ + isa veryusefulprogramminglanguage. 86 A Data-Driven Multiprocessor for Switch-Level Simulation of VLS1 Circuits

part of the processor task is a loop that corresponds to the inner loop of Algo- rithm IV-1. If no instructions are pending, the processor sleeps, waiting for something to do. Otherwise, an instruction is removed from the list of ex- ecutable instructions and an evaluation routine is invoked based on the instruction's OeCODE.

Instructions are also defined as a class. The function Execute evaluates an instruction and calls the routine SetSou rce to propagate results. The different signal models described in Chapter III are implemented as abstract data types called values. By simply relinking the simulator with a different implemen- tation of the values class, it is possible to experiment with different signal models.

Once an instruction has been evaluated and its RESULTShave been stored, Execute calls the function Delay. It is here that simulated real time is ac- counted for. The an_ount of time delayed equals l+Number of Destination Stores+Number of Extra SOURCEOPF_RAND fetchos

and corresponds to the number of read-modify-write memory cycles that need to be performed. For example, assuming a combined total of six SOURCEOPER^NDSand RrSULTSper instruction, a bidirectional transistor instruc- tion has three SOUI_cEOPERAYDS,(NEwGATEVALUE,SOURCEVALuE,and DRAIYVALuE),three RESULTS, (GATEVALUE, CURRENFFDRAINVALUE, and CURRENTSOURCEVALUE), and two DV.STINATIONS. If the instruction is executed and yields a DRMNVALuEor SOURCEVALuEthat differs from its previous value, then either two or three RMW cycles are required. If both RESULTSare unchanged, only the single cycle is required. Nodes, on the other hand, can have arbitrary fan-in, so that in some situations extra SOURcEOPF_aANDwords must be fetched and, as shown above, this is taken into accounL However, in a minimal im- plementation, where instructions have a fixed width, there are never any extra SOURCEOPERANDfetches.

The outer loops of the simulation algorithm, where PresentPhase and the delay index, d, are updated, are implemented as a main task that receives control any time all processors and ports are idle. Moreover, this task is responsible for processor synchronization, which occurs every time PresentPhase is updated. This updating and synchronization is modeled as costing one RMW cycle. The main task also provides a user interface via which nodes may be driven and sensed. In addition, it allows the user to invoke a user-written test program that has been linked into the simulator. A special class called signal is provided to simplify the setting and retrieving of node values. The signal class allows a test program to drive the simulation and verify the results. Furthermore, using the magic of macros, in some instances the same test program has been used to drive both the simulation of a chip and, using our tester, the actual chip, itself. By way of example, the program for exhaustively testing an 8-bit multipler is shown in Figure V-1. Finally, setting or getting a node value is modeled as costing a RMW cycle. Uniprocessor Experiments 87

#include "signals.h" #define width 8; mul() { int i ,j, count,mask ; signal a("a",width,O), b("b",width,O), p("p",width << 1,0); count = l<

V.3. Static Measurements The static analysis of circuits is interesting because it provides one method for finding circuit characteristics that are consistent across designs. Such charac- teristics can be used in a number of ways. For example, in building simulation hardware it may be necessary to have some idea of the expected value of some characteristic in order to determine the amount of memory needed. Another use is in predicting the average-case performance of various algorithms. For example, Algorithm II1-4 is, on average, faster when nodes have low fan-in and fan-out. Thus, one use of some of the static data presented is in trying to predict the running times of simulations.

For those readers uninterested in wading through the details, the following is a summary of the static analysis: • The ratio of the total number of transistors to the total number of nodes is approximately 2 to 1. • In NMOS, approximately 29% of the transistors are connected directly to Vdd, and approximately 37% are connected directly to Ground. Thus, in NMOS circuits, 66% of the transistors can be immediately identified as unidirectional. Using the simple depth-first search algorithm described in Chapter Ili, a total of 80% of the transistors are, on average, labeled as unidirectional. 88 A Data-Driven Multiprocessor for Switch- Level Simulation of VLSI Circuits

• The average fan-in of nodes is approximately 2.7, while tile average fan- out is approximately 2.9. • When each node or transistor is represented using a single instruction, the ratio of FAs_-I instructions to circuit transistors is approximately 1.3, when one-input nodes are not eliminated, and 1.1, when one-input nodes are eliminated. • When circuits are represented using node instructions with only two SOURcEOPERANDSand two D_rINA'NONS,and bidirectional transistors are implemented using two unidirectional transistor instructions, the ratio of instructions to circuit transistors is approximately 3.4. • The distribution of the number of nodes and transistors in a transistor group varies tremendously from circuit to circuit.

V.3.1. Transistors, Nodes, and the Distribution of Instructions Table V-1 summarizes transistor and node counts for each of the circuits analyzed. Among other things, this data suggests that when building an MOS simulation machine with a capacity for up to t transistors, there needs to be room for about t/2 nodes as well. Moreover, this data provides indirect evidence that the fan-in of nodes is relatively low since high fan-in circuits, such as RAM's, yield a significantly higher ratio, as shown in the table. These statistics also give us indirect evidence that the 'average' circuit construct is approximately an inverter; thus, complicated function block recognizers may not provide much more utility than simple recognizers.

Table V-2 shows the distribution of uses of transistors and nodes. This refine- ment of the data in Table V-1 indicates that, without any graph traversal at all, it is possible to identify about two-thirds of the transistors in a circuit as being unidirectional. Moreover, in NMOS, about 25% of the transistors are static pullups. We do not have to generate instructions for these pullups, but can instead simply preset a SOURcEOr'_ANDof the associated node. Note that, even though the remainder of the transistors are labeled as pass transistors, they may, in reality, be part of structures such as Nand gates.

While the data in Table V-1 seems to suggest that the average circuit construct is an inverter, the data in Table V-3 indicates that this is too crude an ap- proximation. Rather. it appears that the 'average' circuit construct is an inverter connected to the source or drain of a pass transistor, a Nand gate, or a multi- input Nor gate. Finally, roughly two out of every three nodes are connected to gates of transistors.

Different implementations of a FAst-1 processor require that memory be al- located in different ways. At minimum, it is useful to have a good estimate of how large a circuit can be simulated using a FAST'I capable of holding a given number of instructions. Because of various optimizations, this number is not simply the number of transistors plus the number of nodes. As shown in Table V-3, while circuits have twice as many transistors as nodes, FAST_Iprograms for Uniprocessor Experiments 89

Table V-I: The ratio of transistors to nodes

Circuit # "lransistors # Nodes Tt:msistors/Node adder4 78 46 1.696 adder8 154 90 1.711 adderl6 306 178 1.719 adder32 610 354 1.723 alu 3047 1501 2.030 cadd24 2122 1109 1.913 chess 15411 5762 2.675 fifo 8269 4539 1.822 l]ltSO 11987 7920 1.514 fir 14421 8909 1,619 life 1107 572 1.935 mac 4114 1998 2.059 mulSx8 1384 619 2.236 phase 1158 508 2.280 psc 25510 14008 1.821 ram4x4 156 74 2.108 ram8x8 504 204 2.471 ram16x16 1776 650 2.732 ram64x64 29138 10226 2.849 riscb 42084 18771 2.242 scheme 9475 2420 3.915 ssc 20233 12476 1.622 st.k4x4 160 86 1.860 stkgx8 540 282 1.915 stkl6xl6 2068 1058 1.955 stk32x32 8196 4146 1.977 synth 3659 2275 1.608

Mean 2.074 Variance 0.261 Std. Dev. 0.510 95% Conf. 0.193 Dist normal

these circuits have only 50% more trmlsistor instructions than node instruc- tions. On average, there are only 30% more instructions than transistors in the circuit. An important exception to this average is the CMOS adder circuit (cadd24), where there are no static putlups. For this circuit there are 50% more instructions, as the original transistors per node ratio leads us to expect. As discussed in Section V.3.6, further savings are possible by eliminating one- input nodes.

V.3.2. Fan-In and Fan-Out Fan-in and fan-out statistics are interesting because, like the number of transis- tors in a circuit, they potentially affect the amount of storage needed to represent a circuit. As shown in Table V-4, on average the fan-out of circuit nodes is 2.9, while on average the fan-out of FAST'I instructions is 2. Fast-1 instructions have lower average fan-out because transistor instructions have a fan-out of 1 or 2 and represent 60% of the instructions in a circuit. In terms of implementing a machine, this data indicates that if destination memory is separate from the remainder of instruction memory, then there should be room for about twice as many destinations as instructions. The last column of the table is the average number of SOURCEOPERANDwords used per instruction, as- suming that each SOURCEOPERANDword can contain a combined total of six SOURCEOPIhRANDS and Rt_ULTS.AS the data shows, this means that on average, instruction memory must be about 8%larger than indicated by the statistics in Table V-3. 90 ,4 Data-Driven Multiprocessor for Switch-Level Simulation of VLSI C,ircuits

Table V-2: The distribution of transistor and node typcs. Only transistors tied directly to Vdd or Ground are classified as pullups or pulldowns. Bogus transistors are those that can be eliminated without affecting the function of the circuit, Capacitive nodes are those that are connected to gates of transistors, or that are precharged by enhancement-mode transistors.

Circuit Pullups Super-buf Pulldowns Pass Bogus Capacitive PuIlups Transistors Trm_sistors Nodes adder4 0.321 0.000 0.321 0.359 0.000 0.630 adder8 0.318 0.000 0.318 0.364 0.000 0.633 adder16 0.317 0.000 0.317 0.366 0.000 0.635 adder32 0.316 0.000 0.316 0.367 0.000 0.636 alu 0.220 0.020 0.317 0.365 0.078 0.545 cadd24 0.000 0.453 0.323 0.224 0.000 0.537 chess 0.211 0.030 0.513 0.239 0.007 0.691 fifo 0.086 0.053 0.329 0.477 0.055 0.560 fillS0 0.185 0.011 0.373 0.428 0.003 0.612 fir 0.290 0.008 0.342 0.358 0.002 0.588 life 0.218 0.030 0.321 0.415 0.017 0.528 mac 0.175 0.198 0.299 0.328 0.000 0.855 mulSx8 0.267 0.023 0.557 0.145 0.009 0.674 phase 0.259 0.060 0.562 0.117 0.002 0.787 psc 0.084 0.051 0.365 0A74 0.026 0.595 ram4x4 0.346 0.000 0.372 0.282 0.000 0.730 ram8x8 0.335 0.000 0.367 0.298 0.000 0.814 raml6xl6 0.331 0.000 0.358 0.311 0.000 0.889 ram64x64 0.313 0.028 0.364 0.291 0.004 0.974 fiscb 0.211 0.084 0.380 0.324 0.001 0.806 scheme 0.157 0.018 0.664 0.156 0.004 0.712 ssc 0.269 0.006 0.357 0.366 0.002 0.486 stk4x4 0.262 0.025 0.338 0.375 0.000 0.581 stk8x8 0.256 0.007 0.278 0.459 0.000 0.518 sLkl6xl6 0.252 0.002 0.258 0.487 0.000 0.501 sLk32x32 0.251 0.000 0.253 0.496 0.000 0.498 synth 0.316 0.018 0.365 0.299 0.002 0.599

Mean 0.253 0.037 0.368 0.340 0.015 0.652 Varianoe 0.005 0.002 0.009 0.010 0.001 0.017 Std. Dev. 0.072 0.046 0.096 0.102 0.023 0.130 95% Conf. 0.028 0.021 0.036 0.039 0.012 0.049 Dist. normal normal norm al normal Samples 26 18 14

Because the time it takes to execute a FAst-1 instruction is proportional to both the instruction's fan-in and its fan-out, it is reasonable to conclude that if most nodes have high fan-in or fan-out, then the FAsr-I architecture may not be the right one to use. Fortunately, as the data shows, dais is not the case: both fan-in and fan-out are, on average, reasonably small. On the other hand, circuits that have many instructions with high fan-out may have a higher degree of paral- lelism, and may be better candidates for simulation on a multiprocessor simula- tion machine. Low average fan-out may be a mixed blessing.

While the above averages are instructive, it is also useful to consider their dis- tributions. Figure V-2 is a plot of node fan-in versus percentage of nodes. Figure V-3 shows the same data as a cumulative percentage of nodes. This data shows that for most circuits 90% of the nodes have fan-in of 5 or less, although the exact distribution of fan-in varies from circuit to circuit. As shown in Figures V-4 and V-5, if we examine instructions instead of nodes, we see that this variance decreases markedly, due to the impact of transistors. The bi- Uniprocessor Experiments 91

Table V-3: The ratio of instructions to transistors and nodes in original cir- cuits. Static depletion mode pullups have been eliminated.

Circuit Transislor rrans Insl Tran h3sl Node Node lnsi Iolal Total Ins t tnsmaclions Totallnst Transistors Instructions qotal Inst Insu'uctions Transistors adder4 53 0.535 0.679 46 0.465 99 1.269 adder8 105 0.538 0.682 90 0.462 195 1.266 adderl6 209 0.540 0.683 178 0.460 387 1,265 adder32 417 0.541 0.684 354 0.459 771 1.264 alu 2141 0.588 0.703 1501 0.412 3642 1.195 cadd24 2122 0.657 1.000 1109 0.343 3231 1.523 chess 12046 0.676 0.782 5762 0.324 17808 1.156 fifo 7103 0.610 0.859 4539 0.390 11642 1.408 tiltS0 9733 0.551 0.812 7920 0.449 17653 1.473 fir 10209 0.534 0.708 8909 0.466 19118 1.326 life 847 0,597 0.765 572 0403 1419 1.282 mac 3392 0.629 0.825 1998 0.371 5390 1.310 mulSx8 1003 0.618 0.725 619 0,382 1622 1,172 phase 856 0.628 0.739 508 0.372 1364 1.178 psc 22704 0.618 0.890 14008 0.382 36712 1.439 ram4x4 102 0.580 0.654 74 0420 176 1.128 ramSx8 335 0.622 0,665 204 0.378 539 1.069 ram16x16 1188 0.646 0.669 650 0354 1838 1.035 rarn64x64 19881 0.660 0.682 10226 0.340 30107 1.033 riscb 33144 0.638 0.788 18771 0.362 51915 1.234 scheme 7946 0.767 0.839 2420 0.233 10366 1.094 14754 0.542 0,729 12476 0A58 27230 1.346 st.k4x4 118 0.578 0.738 86 0.422 204 1.275 st.kSx8 402 0.588 0.744 282 0.412 684 1,267 stkl6xl6 1546 0,594 0.748 1058 0.406 2604 1.259 stk32x32 6138 0,597 0.749 4146 0.403 10284 1.255 synth 2496 0.523 0.682 2275 0.477 4771 1.304 Mean 0.600 0.749 0.400 1.253 Variance 0.003 0.007 0.003 0.015 Std. Dev, 0.055 0,081 0.055 0.124 95%Conf. 0.021 0,031 0.021 0.047 DiSL nonn',d normal normal normal

modal nature of the distribution is also due to the effect of transistors whose fan-in is either four, for unidirectional transistors, or six, for bidirectional tran- sistors.

As shown in Figures V-6 and V-7, fan-out has much less variance than does fan-in. Nevertheless, it is clear from these plots that some circuits have at least a few nodes with fan-out over a thousand. In many instances, as in the SSC chip, for example, the large fan-out nodes are the outputs of clock drivers. The distribution of the fan-out of instructions, shown in Figures V-8 and V-9, is skewed towards one, due to the impact of transistor instructions.

V.3.3. Sizes of Transistor Groups Another interesting static statistic is the size of transistor groups. This statistic is relevant because some simulation algorithms, such as Bryant's original Mos- sim algorithm 2 [Bryant, 1980] and Dumlugol's Local Relaxation Algorithm [Dumlugol, et. al., 1983], visit every node in a transistor group whenever any node in the group is been perturbed. If transistor groups are small, this traver- sal doesn't cost very much. If transistor groups are large, however, this traversal

2ButdefinitelynotMossimlI. 92 A Data-Driven Multiprocessor for Switch-Level Simulation of VLS I Circuits

Table V-4: Average fan-in and fan-out of circuits. Only transistors tied directly to Vdd and Ground are labeled as unidirectional. The instruction fan-in and SOL:_C_,:O_,H¢statistics,_Nt_ are based on a combined total of six SOU_¢cl_O_,H¢^N_Sand Rl_t:H's per instruction word, In addition, each node instruction includes an extra input that is used when driving the node from the host.

Circuil <---Nodes in Circuit---> <---All f-'_st-i Instructions---> Average Average Average Average Avg. SourceOp Fan-i n l_;an-out Fan-in Fan-ou t words/Inst. adder4 2.304 2.370 5.172 1.919 1.081 adder8 2.333 2.411 5.190 1.938 1.082 adder16 2.348 2.433 5.]99 1.948 1.083 adder32 2.356 2.444 5.204 1.953 1.083 alu 2.613 2.909 5.276 2.093 1.069 cadd24 2.342 2.770 4.755 1.755 1.036 chess 3.294 3.367 5.155 1.972 1.127 fifo 2.570 3.304 5.290 2.238 1.063 tilt80 2.157 2.525 5.101 1.975 1.059 fir 2.194 2.305 5.096 1.878 1.055 life 2.705 3.086 5.334 2.164 ].118 mac 2.733 3.047 5.143 2.009 1.095 mulSx8 2.540 2.267 4.834 1.607 1.030 phase 2.543 2.220 4.774 1.554 1.062 psc 2.635 3.347 5.282 2.225 1.074 ram4x4 2.703 2.568 5.216 1.909 1.045 ram8x8 3.206 3.113 5.391 2.078 1.074 ram16x16 3.582 3.526 5.514 2.194 1.061 ram64×64 3.672 3.603 5.471 2.166 1.051 ri_b 2.964 3.216 5.235 2.064 1.048 .scheme 4.510 4.503 5.104 1.960 1.118 ssc 2.212 2368 5.099 1.899 1.047 stk4x4 2.558 2.767 5.245 2.039 1.137 stk8x8 2.794 3.184 5.465 2.263 1.170 stk16x16 2.907 3367 5.549 2.349 1.286 stk32x32 2.957 3.441 5.579 2.379 1.193 synt.h 2.087 2.060 4.977 1.735 1.021 Mean 2.734 2.908 5.209 2.010 1.084 Variance 0.296 0.318 0.046 0.042 0.002 Std. Dev. 0.544 0.564 0.214 0.204 0.046 95% Cenf. 0.205 0.213 0.081 0.077 0.017 Dist. normal normal normal normal

can be very expensive. Figures V-10 and V-11 show the sizes of transistor groups in terms of the number of transistors in the group. Figures V-12 and V-13 show the sizes in terms of the number of nodes. Though from Figures V-10 and V-12 we see that most of the transistor groups are small, Figures V-11 and V-13 show that most of the transistors and nodes are in large tran- sistor groups. Assuming that all transistors are equally likely to change, then it is likely that, on average, algorithms that explicitly traverse transistor groups are going to have to traverse large groups. For example, in the RISC chip, 41% of the transistors are in transistor groups with 319 or more transistors, and 18% are in a single transistor group with 7639 transistors. Consequently, if a random transistor changes state, there is an 18% chance that a transistor group with 7639 transistors must be traversed and a 41% chance that a group with 319 transistors or more must be traversed!

Plots similar to these could be made for the number of FAsz'-Iinstructions in transistor groups and would have similar characteristics. However, these plots are not as interesting in the case of the FAST'.1because Algorithm III-4 never Uniprocessor Experiments 93

loo

"_ 9o

80 O. a alu ,. + cadd24 70 ' c chess + ' f fifo 60 ' ' * mac , p psc t,'. , i rn ram64x64 5o1- ,'_;:::: s r riscb | "" ;!':;_''i : , 1 scheme 4-.U..t .-.;''_,"'," ?,:",,¢'_,'S;,:'.':' s ssc r,... ];, ,.-:,,,_,,,b :: s stk32x32 30_ • , ,.,,,,., " .';' : 2;;,':a':,,' ,}i : '.T.',,,':::,c .,:'., ' ,,':,'.' ,e,' ' 20L..'_ , ', ,_ ,.,,, I,',!' ", nl ; '..'i,_:: ',. l", .... '¢',,- : lo_,_','."',., ,,:,",.Y" . tq .,* _ ',p',''._,',':*... 3

1.0 2.0 4"'.0 " "_..6 "'-i'_,d'"'5$YO- .....6_.0"" "¥2_8.0 25'(3.0 Fan In Node Fan In

Figure V-2: Node fan-in. The Y axis is the percentage of nodes.

,.-,_ "'".v"_":C-¢_'Y"_...... ':_':-_"" '"-

_. 11 80 a alu ; .,.., .'.,_, •:' .' ,::.7.': + cadd24 70 :, + ,;:,;:.,,: c chess ,...,_'_ .c: f fifo 60 ':sv_'.", , * mac /,_.,..! mpramp_cG4x64 50 ,'_i:_":'_ '-S" r"riscb

S SSC 40 'i:)i::ii::' S1 stk32x32scheme ;,, 30 ,?:

20 .,,_!'" • 4" I]1 _oi';".."" _.o 2:0 4:o 8:o _.o s,_.o _,;.o I_.o 2_'_.o Fan In Cumulative Node Fan In

Figure V-3: Cumulative node fan-in. The Y axis is the cumulative percentage of nodes. 94 A Data-Driven Multiprocessor for Switch-Level Simulation of VLSI Circuits

¢ too

9o cj 80 a alu 13. + cadd24 70 c chess f fifo 60 :]: * mac ,, p psc : _ ql rn ram64x64 :+" r riscb 50 i_i ._, "1 scheme 40 ,,.,_ ,_' , Ss stk32x32ssc

20 '", $',

10 "4" _." f*

1.0 2'.0 j" 4'.0 8.5 .... _'-6.'-_" "32.'_ 6',4"_.6 ]"2_.0 25"6.0 Fan In Instruction Fan In

Figure V-4: Instruction fan-in including both transistor and node instructions. lOO = 90 i': (b tJ

I__ 80 _ $ a alu

70 .')'; t.' c chess 60 1'; I !i *f macfifo ..1:.,';,_., p psc :_,',':'!: m ram64x64 50 :2','°_: r riscb :_:',' _' I scheme

40 :;_' ,' S stka2x32

30 ,,,.:,:. _ ,' 2o s

1.0 2.0 4.0 8.0 16.0 32.0 64.0 128.0 256.0 Fan In Cumulative Instruction Fan In

Figure ¥-5: Cumulative instruction fan-in. Uniprocessor Experiments 95

1.0 4.0 16.0 64.0 256.0 1024.0 4096.0 Fan Out Cumulative Node Fan Out

Figure V-7: The cumulative fan-out of nodes. 96 A Data-Driven Multiprocessor for Switch-Level Simulation of VLS1 Circuits

100

c 90 a) o 80 Q. a alu + cadd24 70 c chess f fifo mac 60_, : $ "" _ m ram64x64 50 r" riscb ] scheme

4 " • S stk32x32 4,,,,

20, 1', _. , -.=

10 . ,:, :I_:%%.__._=._..__ "...... 1.0 4".6 .... 16'.0 .... 6"_:'0 ....._'2E6.0" 1_4.(_ 4096.0 Fan Out Instruction Fan Out

Figure V-8: The fan-out of instructions.

c 90

¢j_ 80 " /_;" Q. ,, _:.', a alu '__'" + cadd24 70 ._.,. C chess ,,._/: f fifo "'=' * mac 60 .,';.'," t., p psc ;'.: m ram64x64 50 r riscb ] scheme s ssc ' S stk32x32

3O

10

1.0 4.0 16.0 64.0 256.0 1024.0 4096.0 Fan Out Cumulative Instruction Fan Out

Figure V-9: The cumulative fan-out of instructions. Uniprocessor Experiments 97

lOO t_ = 90 (J L,. 80 I_. :: a alu :: + cadd24 70 : ', c chess : ; f fifo 60 , . * mac p psc i: / rn ram64x64 50 _., r riscb ..._ , I scheme 40 "" s ssc ,',,,.._ , .- ..... - S stk32x32 30 ' ?,: _ : .... : ._,-_:,: 20 ',';.,: 'v . • f

1.0 -4.6 .... i"_.O '" [_4.b- -256.6 _ _b_.d 4'096.0 Transistors in Group Transistor Groups

Figure V-IO: Percentage of transistor groups with n transistors.

a 100 + _ ..... * _ t'.._ • ...p c...r ...*.... --" t..... • .-- . "'" o, _' _:s--.- _s8._,]--_--s'-'::!,": _.pg-,.'...... 'P. ",:"

80 .',,, , . / .* a alu : ._' i ../L..._• "...... ii + cadd24

'_"" :'c'.....-¢ ":':" ""-"" :]" ---i-_" f fifo 60 ,'l ....:,,:, .'" "'."" ' ', ,: *pmacc.o= ,:' a"; ..:. p psc

':'.": F.,.:"':_ '; c- c-G- cc".".." .''"• .$ mr ram64x64riscb :.::_ ] scheme •;: ...'" .. . ..p ,-" , .-" S SSC oOO o .(_,F..° . ....°...... o -° o'"

1.0 4.0 16.0 64.0 256.0 1024.0 4096.0 Transistors in Group Transistor Groups: Cumulative Transistors

Figure V-11: Cumulative percentage of transistors versus transistor group size. 98 A Data-Driven Multiprocessor for Switch- Level Simulation of VLSI Circuits

100 C_ 90

o_ 80 s 0.. a alu @ _: + cadd24 70 " C chess : _ f fifo 60 :_, * mac : _; p psc ';"i m ram64x64 50 ;,!, ,,_ r riscb ::,, i, I scheme

40 -f .S S stk32x32

ii/l S SSC 23o,°1! if!,,...... 10_ "m"''__'',..-',"$,', :".,.....: *' m _,_,:_,,'I't u"',£_,.,'J_:n.-.,_},-,,¢_;:.:':,¢,_'. ¢._ ._-.:r

1.0 4:0- _" 1'_.0-_4.--6 TM _2_-d'.O i0_4.0 4"096.6163'84.0 Nodes in Group Transistor Groups: Nodes

Figure V-12: Percentage of transistor groups with n nodes.

100 + .... * $ m _sl c .r ,,.....=. *..- #:..:-,_ ...P ,' .. -" "E 90 + _ s6"_'6" "

_ . °.-C 80 • s ' 1/ ':': p'''r.. a alu C_, , .* 1'" : '_' !''" '" + cadd24

70 , ' "" _... _,)::i_' c chess ._ .s-s ' •a.... a "" ", "" _ *f fifomac

60 i:' er._." . ."[. -_v :;"" i m ram64x64 50 :.' ,_' : g ._ 1f'1 /"._; : r riscb /,,..:_ _._...l _1-"" .."::"" i 1 scheme 40 ,, a." ,, .. ,, , s ssc i'F" i (_.' ... _.., s _t,_2x_2 30 ,:'...,r.e,"# ' r_f" _"" )"'_ :: 20 ;i.._.__"._ :..'"

to %.'.'."_:._-*/ , , ,- ,.....,'_-'_":¢"'"_ %,, • ;:" _' ¢." --" _ __ " _,e.o• 'e .o i i i t ,ogei | .o Nodes in Group Transistor Groups: Cumulative Nodes

Figure V-13: Cumulative percentage of nodes versus transistor group size. Uniprocessor Experiments 99

explicitly visits all of the nodes or transistors in a transistor group when a tran- sistor in the group has changed state. Thus, the amount of work required to find the steady state of a network varies more dynamically. In the worst casc, where all of the transistors in the group have a state of 1 or X, it would be necessary to visit every node in the transistor group.

V.3.4. Representing Bidirectional Transistors Using Two Unidirectional Transistor Instructions One way to simplify the implementation of a FAST'I processor is to represent bidirectional transistors using two unidirectional transistor instructions. The cost of this simplification is an increase in the number of instructions needed to represent the circuit as well as an increase in the fan-out of those nodes that connect to gates of bidirectional transistors. The static effect of representing circuits in this way is shown in "Fable V-5. Note that, in this data, static deple- tion mode pullups have been eliminated. Furthernaore, other transistors con- nected directly to Vdd or Ground never need to be implemented using two unidirectional transistor instructions and therefore do not affect these statistics.

Whereas transistor instructions previously accounted for about 60% of all in- structions, they now account for almost 70% of instructions and there are now about 25% more instructions than before. As expected average, fan-in has dropped since there are no longer transistor instructions with six SOUX_cEO_'ERAYDfields. Fnough not shown in the table, the average node fan- out has gone up, ,as expected. However, the average instruction fan-out has gone down, for two reasons. First. all transistor instructions now have fan-out of one. while previously some had fan-out of two. Second, the ratio of node instructions to total instructions has gone down; thus high fan-out nodes have a decreased effect on average instruction fan-out.

V.3.5. Using Minimal Machines and Fan-in and Fan-out Trees Further implementation simplifications are possible. Besides using two unidirectional transistor instructions to represent a bidirectional transistor, in- structions can be implemented with limited fan-in and fan-out, and trees of instructions can be used to provide the necessary fan-in and fan-out. An in- struction requires, at minimum, two SOURcEOPERANDS3 and the ability to fan- out to two destinations. Table V-6 shows the effect on the number of instruc- tions.

It can be seen immediately that whereas, previously, 1.253 instructions were used per transistor, on average, now 3.387 instructions are required. Programs certainly grow in size. Since the number of transistor instructions remains con-

3A unidirectional transistor instruction has Sou_cEOwP_Nl_s, NEWGATEVALUE,and SOURCEVALtm, while a node instruction needs, at minimum, a two-element SOURcwOPERANDvector in order to build binary fan-in trees. I00 ,4 Data-Driven Multiprocessor for Switch-Level Simulation of VLS I Circuits

Table V-5: qhe effect of representing bidirectional transistors using two unidirectional transistor instructions. The last two columns are the ratios of the data from Table V-3 to the data in the first two columns in this table. Note that in both tables, static depletion mode pullups have been eliminated. Other transistors connected directly to Vdd or Ground. are always represented using a single unidirectional transistor instruction.

Circuil <---2 unidircvlional Transistors---> Original/unidirectional l"rans Ins1 InslJaiclions Average Average Trans Inst lnslructions Instructions 'Iransislors Fan-in Fan-out lnsla-uclions Tr_msistors adder4 0.638 1,628 4.472 1.717 0.839 0,779 adder8 0.641 1.630 4.478 1.729 0,839 0.777 adder16 0.643 1,631 4.481 1.735 0.840 0.776 adder32 0.644 1.631 4.482 1.739 0.840 0.775 ",du 0.684 1.561 4.509 1.837 0.860 0.766 cadd24 0.701 1.746 4.402 1.658 0.937 0.872 chess 0.732 1.394 4.615 1.806 0.923 0.829 fifo 0.709 1.885 4.457 1.924 0.860 0.747 tilt80 0.652 1.901 4.402 1.755 0.845 0.775 fir 0.633 1.684 4.438 1.691 0£44 0.787 life 0.695 1.696 4.519 1.880 0.859 0.756 mac 0.703 1.638 4.514 1.807 0.895 0.800 mulSx8 0.660 1.316 4.523 1.540 0.936 0.891 phase 0.661 1.295 4.523 1.504 0.950 0.910 psc 0.713 1.913 4.469 1.921 0.867 0.752 ram4x4 0.664 1.410 4.573 1.727 0.873 0.800 ram8x8 0.704 1.367 4.653 1.843 0.884 0.782 ram16x16 0.728 1.346 4.702 1.918 0.887 0.769 ram64x64 0.735 1.324 4.708 1.910 0,898 0.780 riscb 0.714 1.557 4.563 1.843 0.894 0.793 scheme 0.796 1.250 4.717 1.841 0.964 0.875 ssc 0.640 1.711 4.437 1.707 0.847 0.787 stk4x4 0.674 1.650 4.508 1.803 0.858 0.773 stk8x8 0.697 1.726 4.543 1.927 0.844 0.734 stkl6xl6 0.707 1.747 4.559 1.972 0.840 0.721 stk32x32 0.711 1.751 4.566 1.989 0.840 0.717 synth 0.612 1.603 4.421 1.598 0.855 0.813 Mean 0.685 1.592 4.527 1.790 0.875 0390 Variance 0.002 0.037 0.008 0.016 0.001 0.002 Std. Dev. 0.041 0.192 0.089 0.126 0.038 0.048 95% Conf. 0.016 0.072 0.034 0.047 0.014 0.018 Disk normal normal normal normal normal

stant, this increase is due solely to the node instructions used to build fan-in and fan-out trees. These new node instructions account for more than 50% of the instructions used to represent the circuit on a minimal machine.

V.3.6. The Effect of FindingUnidirectional Transistors and Eliminating One-input Nodes In the data presented so far, only when a transistor is connected directly to Vdd or Ground, is it represented using a single unidirectional transistor instruction. Using the algorithm described in Chapter III, it is possible to find other transis- tors that can be implemented using a single unidirectional transistor instruc- tion. The advantage of doing this is that it can reduce the fan-in and fan-out of node instructions as well as reduce the number of instruction executions. In addition, by finding unidirectional transistors, it is possible to find 'one-input' nodes that can be eliminated.

The number of transistors that can be marked as unidirectional depends on Uniprocessor Experiments lO1

"FableV-6: The number of instructions needed to represent a circuit using instructions with two SOURC_O_:._AXDSand the ability to fan-out to two destinations. The first column is the fraction of node instruc- tions devoted to implementing the fan-in trees. The second column is the fi'action of instructions devoted to implementing fan-out trees. The third column is the ratio of node instructions, including those used in fan-in and fan-out trees, to nodes in the circuit. The forth column is the new ratio of instructions to tran- sistors. The fifth column is the ratio of this data from 'fable V-3 to the data in this table.

Original/Minimal Circuit Fan-in Instructions Fm3-oul Insm_ctionsNode Instruct.ions lnsU'uctions Instructions Instructions Instructions Nodes Transistors Transistors adder4 0.261 0.258 3.978 3.385 0.375 adder8 0.261 0.260 4.033 3.403 0.372 adderl6 0.261 0.261 4.062 3.412 0.371 adder32 0.262 0.261 4.076 3.416 0.370 alu 0.245 0.292 4.673 3.370 0.355 cadd24 0.223 0.247 3.967 3.297 0.462 chess 0.284 0.270 5.624 3.123 0.370 fifo 0.210 0.352 5.398 4.300 0.327 filtSO 0.223 0.272 3.822 3.765 0.391 fir 0.242 0.233 3.460 3.203 0.414 life 0.246 0.289 4.773 3.646 0.352 mac 0.254 0.271 4.729 3.449 0.380 mulSx8 0.281 0.191 3.628 2.492 0.470 phase 0.288 0.180 3.598 2.435 0.484 p_ 0.215 0.341 5.362 4.308 0.334 ram4x4 0.292 0.229 4.230 2.942 0383 rezn8x8 0.303 0.260 5.363 3.133 0.341 raml6xl6 0.307 0.280 6.232 3.261 0.317 ram64x64 0.301 0.284 6.313 3.189 0.324 riscb 0.261 0.288 5.245 3.451 0.358 scheme 0.308 0.291 8.311 3.117 0.351 ssc 0.228 0.258 3.630 3.333 0.404 stk4x4 0.253 0.264 4.279 3.412 0.374 stk8x8 0.248 0.298 4.979 3.804 0.333 stk16x16 0.248 0.308 5.282 3.937 0.320 stk32x32 0.249 0.311 5.404 3.978 0.315 synth 0.238 0.209 3.078 2.895 0.450

Mean 0.259 0.269 4.723 3.387 0.374 Variance 0.001 0.002 1.252 0.199 0.002 Std. Dev. 0.028 0.039 1.119 0A46 0.047 95%Conf. 0.011 0.015 0.422 0.168 0.018 Dist. normal normal normal normal normal

how hard one tries. The depth-first search algorithm described in Section III.3.6.1 does not try very hard. Even without solving the two-path problem, better heuristics than the ones used here can do an even better job of finding unidirectional transistors [Jouppi, 1983]. Nevertheless, it is still interesting to see how well a simpler algorithm works. For example, the depth-first search algorithm is able to label as unidirectional all of the transistors in mul8x8.

The performance of this algorithm is summarized in Table V-7. As shown in the first column, about 44% of the transistors that were previously classified as bidirectional transistors are now classified as unidirectional. As shown in the second column, when these newly labeled unidirectional transistors are com- bined with transistors connected directly to Vdd and Ground, on average, a total of almost 80% of the transistors can be labeled as unidirectional. Finally, i02 A Data-Driven Multiprocessor for Switch-Level Simulation of VLSI Circuits

as shown in the last colurnn, after removing one-input nodes, 6% fewer instruc- tions are required than in the case shown in Table V-3.

"fable V-7: The effect of finding unidirectional transistors oll fan-in, fan-out, and number of instructions. 'Onidir Pass/Pass Trim" is the frac- tion of transistors not connected directly to Vdd or Ground that are labeled as unidirectional using the depth-first search algo- rithm. 'All Unidir/Transistors' is the ratio of all unidirectional transistors to transistors in the original circuit. 'Orig/Unidir' is the ratio of the number of instructions/transistor from Table V-3 to the number of instructions/transistor in this table, where one- input nodes are eliminated. Note that bidirectional transistors are represented using a single transistor instruction.

Orig/Unidir Circuit Unidir Pa_s All Unidir 1-In Nodes Average Average Instructions Instructions Pass Tran Transistors Total Nodes Fan-in Fan-out Transistors 'l'ransistors adder4 0.429 0.796 0.261 4.920 1.770 1.1t5 1.138 adder8 0.429 0.792 0.267 4.936 1.789 1.110 1.141 adderl6 0.429 0.791 0.270 4.944 1.799 1.108 1.142 adder32 0.429 0.791 0.271 4.948 1.804 1.107 1.142 alu 0.145 0.610 0.043 5.164 2.022 1.174 1.018 cadd24 0.851 0.966 0.364 4.434 1.577 1.332 1.143 chess 0.176 0.796 0.053 5.065 1.915 1.136 1.018 fifo 0.948 0.921 0.389 4.384 1.701 1.194 1.179 tilt80 0.898 0.954 0.204 4.349 1.498 i.338 1.101 fir 0.732 0.902 0.042 4.513 1.492 1.300 1.020 life 0.046 0.588 0.016 5.298 2.142 1.274 1,006 mac 0.185 0.733 0.09t 5.039 1.948 1.266 1.035 mul8x8 1.000 0.992 0.246 4.512 1.397 1.062 1.104 phase 0.993 0.998 0.004 4.478 1.357 1.176 1.002 psc 0.775 0.867 0.339 4,593 1.820 1.253 1.148 ram4x4 0.273 0.795 0.162 5.085 1.829 1.051 1.073 ram8x8 0.147 0.746 0.108 5.323 2.039 1.026 1.042 raml6xl6 0.072 0.712 0.062 5.481 2.176 1.012 1.023 ram64x64 0.032 0.714 0.020 5.454 2.156 1.026 1.007 riscb 0.233 0.750 0.015 5.057 1.946 1.227 1.006 scheme 0.282 0.883 0.085 5.004 1.898 1.072 1.021 ssc 0.894 0.959 0.054 4.379 1.423 1.313 1.025 stk4x4 0.267 0.725 0.000 5.010 1.882 1.275 1.000 stkSx8 0.129 0.600 0.000 5.325 2.170 1.267 1.000 stk16x16 0.063 0.543 0.0130 5.475 2.300 1.259 1.000 stk32x32 0.031 0.520 0.000 5.542 2.354 1.255 1.000 synth 0.912 0.972 0.019 4.352 1.319 1.292 1.009

Mean 0.437 0.793 0.147 1.186 4.928 1.834 1.057 Variance 0.122 0.020 0.016 0.011 0.155 0.085 0.004 Std. Dev. 0.350 0.141 0.127 0.104 0.393 0.291 0.061 95% Conf. 0.132 0.053 0.052 0.039 0.148 0.110 0.023 Dist. normal normal normal normal normal Samples 23

V.4. Dynamic Measurements Static analysis is interesting because it examines circuit characteristics inde- pendent of functionality and, for the most part, independent of a particular simulation algorithm. One particularly useful characteristic is the expected size of a FAST-1program relative to the number of transistors in a circuit. While some of the static statistics, such as fan-out, may give us a rough idea about how long it will take to simulate a particular circuit, it is clear that the func- tional properties of a circuit have a much greater effect on simulation time. Uniprocessor Experiments 103

In this section I present the results of simulating several circuits using the FASt-I simulator. My purpose in doing this is two-fold. Primarily, it affords the opportunity to see how changes in the architecture and algorithm affect perfor- mance. Unfortunately, too much computer time was required to perform the experiments required to analyze all of the possible combinations. Therefore I conducted two sets of experiments: ones in which all features known to im- prove performance are used together, and ones in which only a single feature is used, in order to measure the isolated effect of each parameter.

The other purpose of this data is to provide other researchers with detailed pertbrmance data for compalison with other simulators, implemented either in hardware or software. Where possible, I do this comparison myself. However, the lack of other published data makes this difficult.

The data presented in this section requires circuits to be simulated with real test data. It is impossible to simulate all of the circuits analyzed in the previous section, both because I do not know how some of them work, and because many of them are not fully functional chips. The circuits analyzed in this sec- tion are all fully functional. The test programs that stimulate the circuits verify that the circuits and the simulation yields the correct results. Thus, the perfor- mance measurements are accurate in the sense that they are not based on bugs in either the circuit or the simulator. 4

Listed below are the 13 circuits that are analyzed. The numbers in parenthesis are the number of random test cycles executed as explained for each circuit. For all experiments, enough test cycles are executed to insure that at least one million instructions and two million read-modify-write memory cycles are ex- ecuted by the FAsT'i. For the move generator chip and the SSC filter chip, however, performing reasonable simulations requires significantly longer test sequences than for the other circuits. The circuit descriptions and test programs for many of the chips can be found in Appendices A and B, respectively.

• NMOS Adder: 4 bits (6144), 8 bits (3072), 16 bits (1536). The circuit is a ripple-carry adder using a 3-transistor XOR circuit. The test program generated random values for the A, B, and Carry-in signals, the number of times indicated. • NMOS Stack: 4x4 (256), 8x8 (32), 16x16 (4). The circuit is a true stack of the width and depth shown. The inner loop of the test program pushes depth number of random values onto the stack and then pops them off. The inner loop is repeated the number of times indicated. • NMOS Static RAM: 4x4 (4096), 8x8 (2048), 16x16 (1024). The circuit is a

4Ofcourse,the wholepurposeof a simulatoris to help debugchips,so its performance,when simulatingbuggychips,iscertainlyof interest.For example,forthosechipsthat are clocked,it mightbe interestingto comparethe simulationtime requiredwhenthe clocksworkcorrectly versusthe timerequiredwhenthe clocksarestuckat somevalue,suchas X.1have,however, avoidedthe temptationto conductthisexperiment. 104 A Data-Driven Multiprocessor for Switch-Level Simulation of VLSI Circuits

six-transistor static RAM of the width and depth shown. The test program first writes a 0 into all locations. In the inner loop, random data is written into random locations and read to verify that the write worked. The inner loop is executed the number of times indicated above. Following this the entire memory is read and the contents of each location are verified. • CMOS Adder: 24 bits (512). A 24 bit CMOS adder that does not use a CMOS XOR gate. The test program is essentially the same as for the NMOS adder except there is no carry-in signal. 512 random adds are performed. • NMOS Multiplier: 8x8 (512). The test program is essentially the same as for the CMOS adder. 512 random multiplies are performed. • Move Generator Chip for Chess. The chip generates legal-moves for chess [Ebeling and Palay, 1984]. 'l-he test program was written by Carl Ebeling and tests a fair portion of the chip. The same test program is used to test the actual chip, except that, in the simulations, internal nodes are written directly instead of loading data via shifting as is done when ac- tually using the chip. • SSC Filter Chip. The chip is a digital filter chip that is a commercial product of Silicon Solutions Corp. The test program applies the first 500 elements of a 5000 element test vector used by SSC both in their own simulations and in testing the actual chip. Independent of the experiments presented herein, I verified that the FAs_I'-Icorrectly simulates the chip using all 5000 elements of the test vector.

The experiments performed can be grouped into the six categories listed below.

• Base case. The measurements in this section provide a metric against which the other configurations can be compared. A FAS_I with the fol- lowing features is used:

o A separate SOURcEOPv__aNDmemory, with six SOURcEOvzRANDSper word. An instruction can have arbitrary fan-in. o A separate D_STINAT_ONmemory, An instruction can have arbitrary fan-out. o The static unidirectional transistor labeling, and one-input node elimination optimizations are implemented. o The dynamic node, store, and transistor optimizations are imple- mented.

• Representation. The measurements in this section show the effect on per- fonnance of different representations of a circuit, for example, the effect of using fixed fan-in and fan-out instructions. • Optimizations. The measurements in this section show the effect on per- formance of disabling the static and dynamic optimizations. • Control. The measurements in this section explore the effect of different methods for keeping track of executable instructions. Unipmcessor Experiments 105

• Parallelism. By using a FIFO to keep track of executable instructions it is possible to determine sets of instructions that can be executed in parallel. Using this technique, the experiments in this section are able to provide upper bounds on parallelism in simulation. • Architecture. Using the same technique for finding sets of instructions that can be executed in parallel, it is possible to predict the performance of simulators that are not event-driven. The experiments in this section predict the performance of Algorithm 111-4when implemented on two different architectures that are not event-driven.

The following points summarize the results of the experimentsl The details are explained in the remainder of this chapter. Appendix C.2 contains a sample of the raw data from which these summaries are formed.

• Fetching and executing instructions accounts for about 38% of the RMW cycles of a simulation while storing results accounts for 50% of the RMW cycles. The remaining 12% of the cycles are used for extra SOURcEOPERAND fetches and for advancing to the next phase of the simulation algorithm. • A minimal machine in which node instructions are limited to only two SOURCEOPERANDS and two D_'_NAT_ONS,thereby necessitating the use of fan-in and fan-out trees, runs about half as fast as the base case machine, in which each instruction can have an arbitrar7 number of SOUi_,cEOPERANDS, and DESTINATIONS, which are stored in separate memories. • Using a stack instead of a queue for keeping track of executable instruc- tions yields marginally better performance on average. • Optimizations are important. The effect of not finding unidirectional transistors, other than pullups and pulldowns, is to increase the run time by about 27% over the simulations where unidirectional transistors are found. When no dynamic optimizations are used, the simulator runs about half as fast as in the base case where dynamic optimizations are used. When only the node optimization is used, simulations run about 9% slower than in the base case. The transistor optimization provides only a 4% improvement over using no dynamic optimizations, while the store optimization provides an 8% improvement over using no dynamic op- timizations. • The amount of parallelism in circuits is highly variable. An optimistic upper bound on potential parallelism is the ratio between instruction ex- ecutions and parallel simulation steps. In the best case, the SSC filter chip, this ratio is 192 to 1. Using a more reasonable metric, which measures the parallelism available in a fixed fan-in/fan-out representation of the cir- cuit, the ratio of parallel simulation time to uniprocessor simulation time is 85 to 1 for this same chip. The average ratios for all of the chips are 49 to 1 and 29 to 1, respectively. • On uniprocessors, event-driven simulation is about a factor of 10 faster than non-event driven simulation, assuming that Algorithm III-4 forms the basis of both implementations. • A multiprocessor in which instructions are executed bit-serially runs 30% I06 A Data-Driven Multiprocessor for Switch-Level Simulation of VLSI Circuits

slower than a uniprocessor FAs_'-I,on average, and 2.5 times faster, in the best case measured. In the bit-selial muhiprocessor measured, each in- struction is assumed to reside on its own processor, and all instructions are executed in parallel. Each instruction execution cycle is modeled as re- quiring 92 clock cycles, which equals the sum of tile number of bits in an instruction, the number of bits in a R_¢su_Tand. the number of bits in a DI:.STJNA'HOResultsN. are sent bit-serially between processors. The inter- connect is assumed to be contention-free: thus, an actual implementation of a bit-serial multiprocessor would likely have worse performance.

V.4.1. The Base Case 1he standard against which all other experiments are measured is presented in this section. The FAS'_I implementation simulated assumes that there are separate memories for SOURcEOPERANDS, D_'rINATIONS, and the remainder of the instructions. Each SOURcEOPERANDword can contain up to a combined total of six SOURcEOPERANDS and R_tJHs. 5 Bidirectional transistors are represented using a single instruction. A stack is used for keeping track of instructions and all static and dynamic optinfizations are used.

Table V-8 summarizes the performance of the FAs7_Isimulating the circuits described above. As mentioned earlier, simulation performance is measured in terms of the number of RMW cycles used. There are five contributions to this number:

• Each instruction execution requires 1 RMW cycle. • Each fetch of up to six additional SOURCEOPERANDSrequires 1 RMW cycle. • Each store ofa R_uLa"requires 1 RMW cycle. • Each update of PresentPhase is modeled as costing 1 RMW cycle. Using three transistor strengths, Strong, Weak, and Z, this means that PresentPhase must be updated five times per unit-delay cycle. However, in all of the simulations, instructions were never executed during phase 2Weab so only four updates per unit-delay cycle were required. • The reading or writing of a node by a test program using Get or Set is modeled as costing 1 RMW cycle.

A more reasonable way of viewing the numbers in Table V-8 is in terms of fractional contributions to the total execution time, as shown in Table V-9. From this data we see that on average 38% of the RMW cycles are used for instruction fetches and 50% for storing RESULTS.Extra source fetches account for only 8%of the execution time. Hence, it does not seem reasonable to build a machine with more than six SOURCEOPERANDSper instruction. Finally, in small circuits, the cost of updating PresentPhase and setting and getting node values is more significant than in large circuits, though in all instances this updating cost has a minor effect on execution time.

5ThereasonforhavingsixSOURCEOPERANDSper wordis that a bidirectionaltransistorinstruction requiresfourSOURC_aA_OSandtwoR_ULTS. Uniprocessor Experiments 107

"FableV-8: Base case dynamic simulation data.

Circuit Instn_ctions RMW = lnslrtictimls + l)e_tination +Soulvc +PrcsentPh:_e +Sign_ in ci_uit Cycles Executed Stores l_tches Updates Sct/Ge_ adder4 87 3473206 1312857 1677337 198468 198528 86016 adder8 171 3929078 1504682 1964268 238484 141772 79872 adder16 339 4352975 1682119 2222790 273110 98156 76800 cadd24 2827 2747671 1159382 1408831 113022 29524 36912 ch_,s 17502 13562413 5449366 6539055 1502197 33484 38311 mulSx8 1470 3263933 1537897 1567639 92777 49236 16384 ram4x4 164 3534312 1210397 1612598 274629 362884 73804 ram8x8 517 4334034 1218854 2252208 610271 195128 57573 rmnl6xl6 1798 5810364 1241060 3519528 9{)2538 99392 47846 ssc 26560 21438255 10503470 10741282 73022 64972 55509 slk4x4 204 2658449 1058359 1214033 220145 141328 24584 stk8x8 684 2738428 1082471 1312888 297729 36112 9228 stkl6xl6 2604 2799596 1097577 1356092 332931 9136 3860

"FableV-9: The fractional contributions to execution time.

Circuit Instruclion Destination Source PresentPhase Signal Execution Stores Fetches Updates Set/Gets adder4 0.378 0.483 0.057 0.057 0.025 adder8 0.383 0.500 0.061 0.036 0.020 adderl6 0.386 0.511 0.063 0.023 0.018 cadd24 0.422 0.513 0.041 0.011 0.013 chess 0.402 0.482 0.111 0.002 0.003 mul8x8 0.471 0.480 0.028 0.015 0.005 ram4x4 0.342 0.456 0.078 0.103 0.021 ramSx8 0.281 0.520 0.141 0.045 0.013 ram16x16 0.214 0.606 0.155 0.017 0.008 ssc 0.490 0.501 0.003 0.003 0.003 stk4x4 0.398 0.457 0.083 0.053 0.009 stk8x8 0.395 0.479 0.109 0.013 0.003 stkl6xl6 0.392 0.484 0.119 0.003 0.001

Mean 0.381 0.498 0.081 0.029 0.011 Variance 0.005 0.001 0.002 0.001 0.000 Std. De_. 0.072 0.038 0.045 0.029 0.008 95% Conf. 0.039 0.021 0.024 0.016 0.004 DisL nonnat normal normal normal normal

Table V-10: The Ratio of transistor executions to node executions.

Circuit Transistors Nodes Transisto_ Executed Executed Executed Nodes Executed adder4 679156 633701 1.072 adder8 787905 716777 1.099 adderl6 887941 794178 1.118 cadd24 745155 414227 1.799 chess 2773444 2675922 1.036 mulSx8 943918 593979 1.589 ram4x4 542274 668123 0.812 ramgx8 560553 658301 0.852 raml6xl6 583915 657145 0.889 ssc 5288969 5214501 1.014 stk4x4 478533 579826 0.825 stkSx8 482211 600260 0.803 st.kl6xl6 489657 607920 0.805 (w/o cadd24 an d mul8xS) Mean 1.055 0.939 Variance 0.096 0.017 Std. Dev. 0.310 0.129 95% Conf. 0.169 0.076 DisL normal normal

As shown in Table V-10, even though 60% of the static instructions are tran- sistor instructions, transistor and node instructions are executed equally often. 108 A Data-Driven Multiprocessor for Switch-Level Simu&tion of VLSI Circuits

This is to be expected, since, nominally, when a transistor instruction is ex- ecuted, it stores a result which, in turn, causes a node instruction to be ex- ecuted. Moreover, the store and transistor optimizations tend to decrease the number of transistor executions after which no R,._su_.t_are stored.

Table V-11 shows the CPU time used by the FAsr-i simulator running on a Vax 11/780, compared to the expected real time required to simulate the circuits using a FAsJ_Iprocessor that h_ a 500ns RMW cycle time. A F._s'r-Iwith this cycle time can be built using off-the-shelf 256K dynamic RAMs. Similarly, Table V-12 compares the CPU time used by Mossimll [Bryant, 1984] running on a Vax 11/780 to the real time used by a FAsr-l. Two sets of figures are presented for Mossimll. For the first set MossimII's 'redundant' optimization is used, while for second set the redundant optimization is not used. According to Bryant, the redundant optimization allows MossimI1 to avoid unnecessarily perturbing nodes, if such perturbations simply result in the same steady state being calculated. Implementing the optimization requires updating several data structures in MossimIl, and while easy to do in software, is difficult to do in hardware. Hence, it is an interesting instance of being able to take advantage of the flexibility of general-purpose computers [Bryant, 1985].

In thinking about the figures in these tables, remember that it is implicitly assumed that whatever is driving the hardware implementation of the FAST'I is able to keep up and that the total cycles in the 'PresentPhase' and 'Set/Get' columns in the above tables reasonably models the cost of interacting with a host. Also remember that the software implementation of the FAsr-1 is not the optimal way to implement Algorithm IlI-4 in software, as the software im- plementation is designed primarily for ease of experimentation, and not for high-performance. Finally, the simulation times for MossimlI have been cor- rected to remove the time required to read and parse the test vector. It turns out that this is a significant fraction of the total CPU time used by MossimII. The numbers presented in the table include only the time used to simulate the circuit and not the time for doing I/O or parsing.

V.4.2. The Effect of Changing the Representation of Circuits There are several different ways of representing transistors: a single bidirec- tional transistor instruction with two outputs, two unidirectional transistor in- structions, or a single instruction with one output. Furthermore, a Fasz,.1 processor can be implemented such that instructions have either arbitrary or fixed fan-in and fan-out. The following sections explore how these different implementations affect performance. V.4.2.1. The Effectof DifferentRepresentationsof Transistors The data in Table V-7 indicates that only about 20% of the transistors are bidirectional when unidirectional transistors are found. If each bidirectional transistor is represented using two unidirectional transistor instructions instead of a single bidirectional transistor instruction, about 12% more instructions are required to represent a circuit. As shown in Table V-13, using two unidirec- Uniprocessor Experiments 109

Table V-I I: Seconds of time used by software and hardware implenlentations of the F,4s;_l.The software implementation is running on a Vax 11/780, while the hardware implementation assumes a 500ns read-modify-write memory cycle time.

Circuit F_t-1 Sol/ware Fast-1 ltardwarc Fast-I software Va×11/780 500ns RMW time Fast-1 hardware adder4 1016.17 1.737 585.014 adder8 1109.15 1.965 564.453 adderl6 1212.83 2.176 557.367 cadd24 791.55 1.374 576.092 chess M74.10 6.781 5] 2.329 mulSx8 1000.03 1.632 612.763 ram4x4 1115.63 1.767 631.370 ramSx8 1121.28 2.167 517.434 ram16x16 1290.95 2.905 444.389 ssc 6001.65 10.719 559.908 slk4x4 779.12 1.329 586.245 stkSx8 752.60 1.369 549.744 stkl6xl6 745.63 1.400 532.593

Mean 556.131 Variance 2299.153 Std. Dev. 47.949 95% Conf. 26.066 Dist. normal

Table V-12: Seconds of time used by MossimII. Mossimll was run on a Vax 11/780. Its times do not include the time to read and parse the test vectors. In the left set of figures Mossim's redundant op- timization is turned on, and in the right set of figures the redun- dant optimization is turned off. Because of the complexity of the test program for the chess chip, there are no measurements of using MossimlI to simulate this chip.

Circuit Mossimll MossimlI Mossimll MossimlI Vax 111780 Fast-1 hardware Vax 11/780 Fast-1 hardware Redundant on Redundant off adder4 1020.3 587.392 1101.9 634.370 adder8 1159.2 589.924 1243.0 632.570 adderl6 1362_5 626.149 1417.1 651.241 cadd24 %8.0 704.512 %7.9 704.440 che_s mul8x8 868.5 532.169 1044.6 640.074 ram4x4 710.6 402.151 1335.1 755.574 ramSx8 703.0 324.412 1355.8 625.658 raml6xl6 672.8 231.601 1432.1 492.978 ssc 120L1 112.053 8582.8 800.709 stk4x4 359.8 270.730 906.2 681.866 slkSx8 331.1 241.855 897.9 655.880 stk16x16 386.7 276.214 868.5 620.357

Mean 408.3 658.0 Variance 36885.5 5807.7 Std. Dev. 192.1 76.2 95% Conf. 108.7 43.1 Dist. normal normal Samples 12 12

tional transistor instructions to represent a single bidirectional transistor causes only a 6% decrease in performance. Note that the data in this table assumes that there are still six SOURcEOPI'2RAN1)Sin each SOURCEOPERAN1)word. One is tempted, however, to decrease this to four since this is the number required by a unidirectional transistor instruction. 1!0 A Data-Driven Multiprocessorfor Switch-Level Simulation of VLS! Circuits

Table V-13: The effect of using two unidirectional transistor instructions to represent each bidirectional transistor. All instructions still have six SOL,RotOH_RA_:_Sper instruction.

Circuit <----2 Unidircvtional transistors----> <----Base/2 Unidirection:d transislors----> RMW Insm_ctions Destination Source RMW Instructions Destination Source Cycles Execulcd Stores Fetches Cycles I_ecuted Stores Fetches adder4 3804189 1468703 1852474 198468 0.913 0.894 0.905 1.000 adder8 4321275 1686929 2174220 238482 0.909 0.892 0.903 1.000 adderl6 4799748 1888297 2463391 273104 0.907 0.891 0.902 1.000 cadd24 2776487 1172737 1424286 113028 0.990 0.989 0.989 1.000 che_s 14727764 6133999 7377245 1138665 0.921 0.888 0.886 1.319 mulSx8 3263933 1537897 1567639 92777 1.000 1.000 1.000 1.000 ram4x4 3796744 1341645 1743814 274597 0.93I 0.902 0.925 1.000 ram8x8 4597266 1350662 2383888 610015 0.943 0.902 0.945 1.000 raml6xl6 6076476 1375140 3653096 901002 0.956 0.902 0.963 1.002 ssc 21482392 10529080 10758093 73174 0.998 0.998 0.998 0.998 stk4x4 2841243 1169505 13.6093 189233 0.936 0.905 0.922 1.163 stk8x8 2955884 1217849 1438648 254047 0.926 0.889 0.913 1.172 stk16x16 3047724 1251941 1495996 286791 0.919 0.877 0.906 1.161

Mean 0.942 0.918 0.935 1.063 Varianoe 0.001 0.002 0.002 0.011 Std. Dev. 0.034 0.045 0.040 0,105 95% Conf. 0.018 0.025 0.022 0.057 DisL normal normal

Alternatively, while still using a single instruction to represent a bidirectional transistor, it is possible for this instruction to have a single RVSULTwith two DESVL','AT1ONSrather than two R_ULTS,each with one D_T_NAT_ON.The advan- tage of the single RFSULTversion is that it simplifies the implementation of the hardware. As shown in Table V-14, these two implementations of a bidirec- tional transistor perform equally well. Although the data here is again based on six SOURCEOPERANDS or RESULTSper SOURcEOPERAND word, when using a one RESULTbidirectional transistor instruction, five SOURCEOPERANDSper word can be used instead. V.4.2.L Usinga MinimalFast-1with Fixed Fan-inand Fan-out Even simpler FAsr-I implementations are possible if each instruction has fixed fan-in and fixed fan-out. In such a minimal configuration, each node instruc- tion has two SOURCEOPERANDSand tWO DESTXNATIONSand ,each bidirectional tran- sistor is represented by two unidirectional transistor instructions. This con- figuration allows the use of binary fan-in and fan-out trees. The data presented in Table V-15 shows the impact of using this version of a FAST-1.Note that all other static and dynamic optimizations are still being used. The performance of an even more minimal F/sT-1, in which no dynamic optimizations are used, is discussed in Section V.4.3.

The essential result from this data is that while a minimal FAsv-1requires 2.7 times more instructions to represent a circuit than in the base case, a minimal machine runs only about 2.1 times more slowly. Another interesting com- parison between the base case and the minimal machine is how dynamic fan- out is affected. As shown in Table V-16 the average dynamic fan-out of the base case is only 1.2 times that of the minimal case, indicating that large fan-out instructions do not appear to dominate the performance of the machine. It Uniprocessor Experiments I I i

Table V-14: The effect of using single-R_ut:r bidirectional transistor instruc- tions. All instructions still have six SOURcO,v 'HCAN,_Sper instruc- tion.

Circuit <----Single RI,kqUI,T----) <---- Ba.sc/Sitlgle Rt,SUI,T---) RMW lnslructions 1)estinalion Source RMW instructions 1)_stination Source Cycles ['xtculed Stores Fctches Cycles Executed Stores Fetches adder4 3481817 1387455 1610390 199428 0.998 0.946 1.042 0.995 addm8 3857722 1559074 1848035 228%9 1.018 0.965 1.063 1.042 adderl6 4226469 1725650 2069344 256519 1.030 0.975 1.074 1.065 cadd24 2659227 1114908 1360528 117355 1.033 1.040 1.036 0.963 chess 15683860 6424387 7686595 1501083 0.865 0.848 0.851 1.001 mul8x8 3263933 1537897 1567639 92777 1.000 1.000 1.000 1,000 rzan4x4 3658645 1314730 1660148 247079 0.966 0.92l 0.971 1.112 ram8x8 4428453 1322197 2297180 556375 0.979 0.922 0.980 1.097 raml6xl6 5871588 1345684 3557500 821166 0.990 0.922 0.989 1.099 ss¢ 21473480 10520573 10758105 74321 0.998 0.998 0.998 0.983 stk4x4 2564369 1045029 1178443 174985 1.037 1.013 1.030 1.258 stk8x8 2624550 1069937 1275498 233775 1.043 1.012 1.029 1.274 stkl6xl6 2736505 1109556 1344438 269515 1.023 0.989 t.009 1.235

Mean 0.998 0.965 1.006 1.086 Variance 0.002 0.003 0.003 0.012 Std. Dev. 0.047 0.052 0.056 0.108 95% Conf. 0.025 0.028 0.030 0.059 Dist. normal normal normal normal

Table V-15: The performance of a minimal Fasz-l. Each node instruction has two SOt;kcI_O_'ERANDSand two D_TINATIONS.Each bidirectional transistor is represented by two unidirectional transistor instruc- tions.

Circuit <----Minimal----> <----Base/Minimal----> RMW Instructions Destination Source P,_MW Instructions Destination Source Cycles Executed Stores Fetches Cycles Executed Stores Fetches adder4 7350415 3347437 3718434 0 0.473 0.392 0.451 adder8 8300133 3803648 4274841 0 0.473 0.396 0.459 adderl6 9166095 4218755 4772384 0 0.475 0.399 0.466 cadd24 4551823 2087119 2398264 0 0.604 0.555 0.587 chess 31944832 14735330 17082609 0 0.425 0.370 0.383 mul8x8 4981497 2449651 2466222 0 0.655 0.628 0.636 ram4x4 6921834 2942314 3542836 0 0.511 0.411 0.455 ram8x8 10689089 4438261 5998131 0 0.405 0.2"/5 0.375 ram16x16 16764859 6575771 10041854 0 0.347 0.189 0.350 ssc 52252746 23542216 28552629 0 0.410 0.446 0.376 st.k4x4 4989460 2284756 2538792 0 0.533 0.463 0.478 stk8x8 5574383 2574180 2954863 0 0.491 0.421 0.444 stk16x16 5886068 2722767 3150305 0 0.476 0.403 0.430

Mean 0.483 0.411 0.453 Variance 0.007 0.012 0.007 Std. Dev. 0.082 0.109 0.082 95% Conf. 0.045 0.059 0.045 Dist. normal normal, normal

seems that dynamically, just as statically, large fan-out nodes are not a big con- cern, ¥.4.2.3. Other Effects In the base case described above, the only nodes given non-Z capacitance are those that are connected to gates of transistors or those that appear to be precharged. Alternatively, all nodes can be given a non-Z capacitive strength and are, therefore, able to store some amount of charge. The former approach is useful in verifying that only the gates of transistors, and nodes known to be 112 A Data-Driven Multiprocessor for Switch-Level Simulation of VLSI Circuits

'Fable V-16: Average dynamic fan-out.

Circuit B:u_'Case Minimal Machine Fml-Out Fm_-OuI adder4 1.28 1.13 adder8 1.31 1.13 adderl6 1.32 1.12 cadd24 1.22 1.14 chess 1.20 1.17 mul8x8 1.02 0.99 ram4x4 1.33 1.13 ram8x8 1.85 1.22 ram16x16 2.84 1.30 ssc 1.02 1.30 stk4x4 1.15 1.07 stk8x8 1.21 1.06 s_l 6x16 1.24 1.05

Mean 1.384 1.139 Variance 0.233 0.008 Std. Dev. 0.482 0.092 95% Conf 0.262 0.050 Dist. normal

precharged, are used as storage nodes. The latter approach is useful for verify- ing that there are no charge-sharing problems. In any event, an interesting question to consider is whether one approach has better performance than the other. As shown in ]'able V-17, there is either no effect, or else the base case perfom_s slightly better. There does not appear to be any correlation between the percentage of nodes originally given non-Z capacitance and the perfor- mance when all nodes are given non-Z capacitance. Note that in this experi- ment unidirectional transistors are used. It would be interesting to perform the same experiment using only bidirectional transistors. One would expect the results to be similar.

Table V-17: The effect of having all nodes be capacitive. In the base case, only nodes connected directly to the gates of transistors, or those deemed to be precharged have non-Z capacitance. In the 'All nodes capacitive' case, all nodes have at least weak capacitance.

Circuit <----All nodes capacitive----) (----Base/All nodes capacitive----> RMW Instruclions Destination Source RMW Instructions Destination Source Cycles Executed Stores Fetches Cycles Executed Stores Fetches adder4 3473206 1312857 1677337 198468 1.000 1.000 1.000 1.000 adder8 3929078 1504682 1964268 238484 1.000 1.0130 1.000 1.000 adderl6 4352975 1682119 2222790 273110 1.000 1.000 1.000 1.000 cadd24 2807366 1184526 1437902 118502 0.979 0.979 0.980 0.954 chess 13616619 5476037 6563514 1503825 0.996 0.995 0.996 0.999 mul8x8 3279391 1545189 1575805 92777 0.995 0.995 0.995 1.000 ram4x4 3534312 1210397 1612598 274629 1.000 1.000 1.000 1.000 ramSx8 4334034 1218854 2252208 610271 1.000 1.000 1.000 1.000 ram16x16 5810364 1241060 3519528 902538 1.000 1.000 1.000 1.000 ssc 21453039 10508330 10747927 76301 0.999 1.000 0.999 0.957 st,k4x4 2658449 1058359 1214033 220145 1.000 1.000 1.000 1.000 st.kSx8 2738428 1082471 1312888 297729 1.000 1.000 1.000 1.000 st,k16x16 2799596 1097577 1356092 332931 1.000 1.000 1.000 1.000

Mean 0.998 0.998 0.998 0.993 Variance 0.000 0.000 0.000 0.000 Stcl. Dev. 0.006 0.006 0.006 0.017 95% Conf. 0.003 0.003 0.003 0.009 Dist. Uniprocessor Experiments 113

There are also two slightly different ways of initializing circuits. When a circuit is loaded into the F._s_'-l,the standard initialization procedure is to mark every instruction as executable. It is also possible to mark as executable only those instructions whose outputs will actually change when they are executed. At first glance it seems that the latter method should perform better but, as shown in Table V-18. either it doesn't matter or else initially marking all instructions as executable performs a little better. The reason for this is that when all in- structions are marked as executable it is more likely that an instruction is al- ready on the stack of executable instructions and that storing into it does not cause an additional instruction execution, On the other hand it may be that when using a FIFO queue for keeping track of executable instructions, in- itializing instructions only as needed will work better. Nevertheless, the bot- tom line appears to be that when averaged over a reasonably long simulation run, it does not matter.

Table Y-18: The effect of initializing instructions only as needed.

Circuit <----lnit as needed----> <----Base/Init as needed----> RMW Instructions Destination Source RMW Instructions D_slination Source Cycles Executed Stores Fetches Cycles Executed Stores Fetches adder4 3473866 1313149 1677644 198525 1.000 1.000 1.000 1.000 adder8 3929828 1505054 1964567 238559 1.000 1.000 1.000 1.000 adder]6 4352913 ]682292 2222497 273164 1.000 1.000 1.000 1.000 cadd24 2748539 1159353 1409678 113068 1.000 1.000 0.999 1.000 chess 13615216 5470365 6559597 1513459 0.996 0.996 0.997 0.993 rnulSx8 3.60762 1539154 1568100 92880 0.999 0.999 1.000 0.999 ram4x4 3534484 ]210473 1612662 274661 1.000 1.000 L000 1.000 ramSx8 4334895 ]219203 2252464 610527 1.000 1.000 1.000 1.000 ram]6xl6 5814398 ]242534 3520552 904074 0.999 0.999 1.000 0.998 ssc 21508346 10542112 10770624 75129 0.997 0.996 0.997 0.972 stk4x4 2659303 ]058727 1214425 220235 1.000 1.000 1.000 1.000 stkSx8 2742196 ]08.1051 1314586 298215 0.999 0.999 0.999 0.998 stkl6xl6 2815702 1104261 1363272 335169 0.994 0.994 0.995 0.993

Mean 0.999 0.999 0.999 0.996 Variance 0.000 0.000 0.000 0.000 Std. Dev. 0.002 0.002 0.002 0.008 95% Conf. 0.001 0.001 0.001 0.004 Dist.

V.4.3. The Effect of Optimizations The base case data presented above used both static and dynamic optimiza- tions. In this section, I consider the importance of these optimizations. Note that the optimizations are not completely independent, so the sum of the ef- fects of each optimization used alone is greater than their combined effect.

When compiling circuits we have the choice of using unidirectional transistor instructions for only those transistors connected directly to Vdd or Ground, or else trying to find additional transistors that are unidirectional as well. This is simply a tradeoff between compile time and simulation time. As shown in Table V-19, the effect of not finding additional unidirectional transistors is an average increase in execution time of almost 30%. This is somewhat surprising, as the data from Table V-7 indicates that finding unidirectional transistors, above and beyond pullups and pulldowns, increases the number of unidirec- I I4 A Data-Driven Multiprocessorfor Switch-l,evel Simulation of VLSI Circuits

tional transistors from 66% to 79%. The slatic statistic is misleading, however, in that most pullups do not generate instructions: thus, the actual increase in unidirectional transistor instructions is much larger. Furthermore, when ad- ditional unidirectional transistors are tbund, it is possible to eliminate almost 15% of the nodes, as they have only one input.

Table V-19: The effect of not finding unidirectional transistors other than those connected directly to Vdd or Ground. Circuit (----Not f;ound----> <----Ba,_e/Nol Found----> RMW Instrt_ctions l),.x,;lination Source RMW Instructions Destination Source Cycl_ Executed Stores Fetches Cycles Executed Stores Fetches adder4 4422889 1704117 2181714 252514 0.785 0.770 0.769 0.786 adder8 50426.59 1966259 2553833 300923 0.779 0.765 0.769 0.793 adderl6 5615580 2207066 2890386 343172 0.775 0.762 0.769 0.796 cadd24 4093726 1776237 2110560 140493 0.671 0.653 0.668 0.804 chess 14749655 5964035 7187602 1525603 0.920 0.914 0.910 0.985 mul8x8 4981316 2315309 2507610 92777 0.655 0.664 0.625 1.000 ram4x4 4550725 1646084 2153850 314095 0.777 0.735 0.749 0.874 r_n8x8 5273926 1606980 2725702 688535 0.822 0.758 0.826 0.886 raml6xl6 6675476 1588309 3918264 1021657 0.870 0.781 0.898 0.883 ssc 55466086 25864069 27458123 2022045 0.387 0.406 0.391 0.036 stk4x4 3112243 1206667 1423885 315779 0.854 0.877 0.853 0.697 stk8x8 2966009 1156605 1418110 345954 0.923 0.936 0.926 0.861 stk16x16 2917201 1135917 1411168 357120 0.960 0.966 0.961 0.932

Mean 0.783 0.768 0.778 0.795 Variance 0.022 0.021 0.023 0.059 Std. Dev. 0.150 0.146 0.153 0.243 95% Conf. 0.081 0.080 0.083 0.132 Dist. normal normal normal

Three dynamic optimizations are analyzed here: the node optimization, the store optimization and the transistor optimization. As shown in Table V-20, the effect of using no dynamic optimizations is to reduce performance by 50%. The data in Table V-21 demonstrates that this same reduction in performance oc- curs when using a minimal FAST-1without the benefit of dynamic optimiza- tions. In other words, the effect of using optimizations appears to be inde- pendent of whether we are using the base-case or the minimal Fas_l. This can be seen by multiplying the mean RMW cycle ratio (0.483) from Table V-15, by the mean RMW cycle ratio (0.495) from Table V-20. The product (0.239) is remarkably close to the mean RMW cycle ratio (0.265) in Table V-21.

The most important dynamic optimization is the node optimization in which whenever possible during phase 1 of Algorithm IlI-4, a node's final value is propagated instead of its capacitive value. As indicated by the data in Table V-22, this optimization alone is responsible for about 80% of the impact of all dynamic optimizations.

An alternative to using the node optimization is to realize that during phase 1 there is no need to fetch any of the SOURCEOPERANDSof a node instruction. Rather, only the node's PRESENTVALUEneeds to be fetched. Thus during phase 1 it is possible to eliminate any extra SOURCEOPERANDfetches for node instruc- tions. Because extra SOURcEOPERANDfetches account for only 8% of the overall execution time, it is unlikely that this alternative optimization, called the Node2 Uniprocessor Experiments 115

Table V-20: The effect of using no dynamic optimizations. In the base case all three dynamic optimizations are used.

Circuit ,;----No l)ynamic Optimizations----> <----Base/No l)ymmqic Opdmizations----> RMW lnsu'uctions Dc_stination Source RMW Instructions De,stination Source Cyclc_ [_ecuted Stores Fetches Cycles Executed Stores Fetches adder4 7544586 2886203 404't626 324209 0.460 0.455 0.414 0.612 adder8 8474705 3279351 4593548 380158 0.464 0.459 0.428 0.627 adderl6 9338728 3640475 5093525 429768 0.466 0.462 0.436 0.635 cadd24 4297102 1564836 2455083 210743 0.639 0.741 0.574 0.536 che,,_s 23670213 9019777 125.16719 2031922 0.573 0.604 0.521 0,739 mulSx8 4792472 2073600 2498628 154616 0.681 0.742 0.627 0.600 ram4x4 8808900 M56678 4510258 405284 0.401 0.350 0.358 0.678 ramSx8 13374191 5003839 7180023 937636 0.324 0.244 0.314 0.651 ram16x16 22243915 8161032 12556189 1379464 0.261 0.152 0.280 0.654 ssc 40615101 11_6328009 23963995 202616 0.528 0.643 0.448 0.360 stk4x4 4756444 11940123 2351329 299076 0.559 0.546 0.516 0.736 stkSx8 5031112 2117437 2476695 391636 0.544 0.511 0.530 0.760 stk16a16 5190575 2195954 2542241 439380 0.539 0.500 0.533 0.758

Mean 0.495 0A93 0.460 0.642 Variance 0.014 0.030 0.010 0.012 Sld. Dev. 0.118 0.174 0.102 0.108 95% Conf. 0.064 0.095 0.056 0.059 Dist. normal normal normal normal

Table V-21: The effect of using a minimal FAsT'-Iand no dynamic optimiza- tions.

Circuit <----Minimal (no opt.)----> <----Base/Minimal (no opt.)----> RMW Instructions Destination Source RMW Instructions Destination Source Cycles Executed Stores Fetches Cycles Executed Stores Fetches . adder4 13421476 6175445 6955339 0 0.259 0.213 0.241 adder8 14980750 6941814 7811828 0 0.262 0.217 0.251 adderl6 16423885 7647839 8596966 0 0.265 0.220 0.259 cadd24 7422404 3435454 3920510 0 0.370 0.337 0.359 che_ 52905530 24352448 28426021 0 0.256 0.224 0.230 mulSx8 8208126 4085658 4056840 0 0.398 0.376 0.386 ram4x4 15990355 7279419 8208676 0 0.221 0.166 0.196 ramSx8 29533725 13.60604 16082568 0 0.147 0,093 0.140 ram16x16 49877395 21575514 28138079 0 0.116 0.058 0.125 ssc 74303503 32190875 41954575 0 0.289 0.326 0.256 stk4x4 8834584 4196570 4472098 0 0.301 0.252 0.271 stk8x8 %89542 4692826 4951372 0 0.283 0.231 0.265 stk16x16 10175950 4954481 5208469 0 0.275 0.222 0.260

Mean 0.265 0.226 0.249 Variance 0.006 0.008 0.005 Std. Dev. 0.076 0.089 0.072 95% Conf. 0.041 0.048 0.039 DisL normal normal normal

optimization, will have much effect. The data in Table V-23 supports this claim.

The store optimization avoids storing into the gate of a transistor if the state of the new value being stored is the same as the state of the previously stored value. This optimization should have more impact when the node optimization is not being used as the total number of stores is greater. As shown in Table V-24 using only the store optimization causes the simulation to run about 1.7 times slower than the base case rather than twice as slow when no optimiza- tions are used.

The last dynamic optimization is the transistor optimization in which, when 116 A Data-Driven Multiprocessor for Switch-Level Simulation of VLSI Circuits

Table V-22: The effect of using only the node optimization.

Circuit <----Node oplJmizafion----> <----B_Lse/Node optimization----> RMW lnstruclJons 1)cstination Source RMW Instn*ctions 1)eslination Source Cycles Executed Slores Fetches Cycles Executed Stores Felches addel4 3791279 1630930 1677337 198468 0.916 0.805 1,000 1,000 adder8 4321885 1897489 1964268 238484 0.909 0.793 1,000 1,000 adderl6 -1809520 2138664 2222790 273110 0.905 0.787 1.000 1.000 cadd24 2829169 1238776 1410935 113022 0.971 0.936 0.999 1.000 chess 16514574 6544379 8396198 1502202 0.821 0.833 0.779 1.000 mulSx8 3359050 1616382 1584271 92777 0.972 0.951 0.990 1.000 ram4x4 4020698 1696783 1612598 274629 0.879 0.713 1.000 1.0f)(3 ram8×8 5413068 2297888 2252208 610271 0.801 0.530 1.000 1.000 raml6xl6 8096200 3526896 3519528 902538 0.718 0.352 1.000 1.000 s_ 28802942 10700752 17908687 73022 0.744 0,982 0.600 1.000 st.k4x4 3214481 1331744 1496680 220145 0.827 0.795 0.811 1.000 sLk8x8 3369519 1428911 1597539 297729 0.813 0.758 0.822 1.000 stk16x16 3455593 1469839 1639827 332931 0.810 0.747 0.827 1.000

Mean 0.853 0.768 0.910 1.000 Variance 0.006 0.029 0.017 0 Std. Dev. 0.080 0.170 0.129 0 95% Conf. 0.043 0.093 0.070 Dist. normal normal

Table ¥-23: The effect of using only the Node2 optimization.

Circuit <----Node2 Optimization----> <----Base/Node2 Optimization----> RMW Instructions Destination Source RMW Instructions Destination Source Cycles Executed Stores Fetches Cycles Executed Stores Fetches adder4 7417558 2886203 4049626 197181 0A68 0.455 0.414 1.007 adder8 8325615 3279351 4593548 231068 0A72 0.459 0.428 1.032 adderl6 9170087 3640475 5093525 261127 0.475 0.462 0.436 1.046 cadd24 4200143 1564836 2455083 113784 0.654 0.741 0.574 0.993 chess 23146456 9019777 12546719 1508165 0.586 0.604 0,521 0.996 mul8x8 4730512 2073600 2498628 92656 0.690 0.742 0.627 1.001 ram4x4 8619414 3456678 4510258 215798 0.410 0.350 0.358 1.273 ramSx8 12934128 5003839 7180023 497573 0.335 0.244 0.314 1.226 ram16x16 21600236 8161032 12556189 735785 0.269 0.152 0.280 1.227 ssc 40572324 16328009 23963995 159839 0.528 0,643 0.448 0.457 st.k4x4 4656853 1940123 2351329 199485 0.571 0.546 0.516 1.104 stk8x8 4909251 2117437 2476695 269775 0.558 0.511 0.530 1.104 stk16x16 5056094 2195954 2542241 304899 0.554 0.500 0.533 1.092

Mean .505 0.493 0.460 1.043 Variance 0.014 0.030 0.010 0.040 Std. Dev. 0.119 0.174 0.102 0.200 95% Conf. 0.065 0.095 0.056 0.109 Dist. normal normal normal

storing into the SOURCEVALuEor DRAINVALuEfield of a transistor instruction, the instruction is not marked as executable if the transistor is turned off. However, if the instruction is already queued for evaluation, it will be executed. The simulation algorithm could be designed to turn off the EXECLn'IoNTAG, but in hardware this would require dequeueing the instruction, which takes more time than simply executing the instruction.

As shown in Table V-25, in isolation the transistor optimization is only about half as important as the store optimization. Yet, as explained above when the node optimization is used as well, it is likely that the transistor optimization and the store optimization have about equal impact Uniprocessor Experiments I 17

Table V-24: The effect of using only the Store optimization.

Circuit <----SIore Optimiz:_fion----> <----I_a.,c/Slore Optimization----> RMW ]nsmlctions Destination Source RMW Inslructions l)c,_linalion Source Cycles l'_xc_.'utcd Stores Fetches Cyclc,_ E×ccutcd Slows Fetthes adder4 6106923 2886203 2911963 324209 0.542 0`455 0.576 0,612 adder8 7213020 3279351 3331863 380158 0,545 0.459 0.590 0.627 addcrl6 7959088 3640-175 3713885 429768 0,547 0A62 0.599 0.635 cadd24 3333436 1564836 1491417 210743 0.824 0,741 0.945 0.536 chess 19419611 9019791 8296117 2031908 0.698 0,604 0,788 0.739 mulSx8 3917229 2073600 1623385 154616 0.833 0,742 0.966 0.600 ram4x4 7550658 3456678 3252016 405284 0.468 0.350 0.496 0.678 ram8x8 11043120 5003839 4848952 937636 0.392 0.244 0.464 0.651 raml6xl6 17740968 8161032 8053242 1379464 0.328 0.152 0.437 0.654 ssc 27898287 16328009 11247181 202616 0.768 0.643 0,955 0.360 stk4x4 4087301 1940123 1682186 299076 0,650 0.546 0.722 0.736 sLk8x8 4455345 2117437 1900928 391636 0.615 0.511 0.691 0.760 stk16x16 4632672 2195954 1984338 439380 0.604 0,500 0.683 0.758

Mean 0.601 0.493 0.686 0.642 Variance 0.024 0,030 0.034 0.012 Std. Dev. 0.155 0.174 0.184 0.108 95% Conf. 0.084 0.095 0.100 0.059 Dist. normal normal normal normal

Table V-25: The effect &using only the Transistor optimization.

Circuit <----Transistor Optimization----> <----Base/Transistor Optimization----> RMW Instructions Destination Source RMW Instructions Destination Source Cycles Executed Stores Fetches Cyc]_ Excculed Stores Fetches adder4 6913293 2254910 4049626 324209 0.502 0.582 0.414 0,612 adder8 7731791 2536437 4593548 380158 0.508 0.593 0,428 0,627 adderl6 8499038 2800785 5093525 429768 0.512 0.601 0.436 0.635 cadd24 4210941 1478675 2455083 210743 0.653 0.784 0.574 0.536 chess 21321031 6670595 12546719 2031922 0.636 0.817 0.521 0.739 mulSx8 4689683 1970811 2498628 154616 0.696 0.780 0.627 0.600 ram4x4 7403470 2051248 4510258 405284 0,477 0,590 0.358 0.678 ram8x8 10428164 2057812 7180023 937636 0,416 0.592 0.314 0.651 rarn16x16 16161935 2079052 12556189 1379464 0.360 0.597 0.280 0.654 ssc 40158458 15871366 23963995 202616 0.534 0.662 0.448 0.360 st.k4x4 4203226 1386905 2351329 299076 0.632 0.763 0.516 0.736 stkSx8 4322566 1408891 2476695 391636 0.634 0.768 0.530 0.760 stkl6xl6 4421719 1427098 2542241 439380 0.633 0.769 0.533 0.758

Mean 0.553 0.684 0.460 0.642 Variance 0.010 0.009 0.010 0.012 Std. Dev. 0.102 0.095 0.102 0.108 95% Conf, 0.055 0,052 0,056 0.059 Dist. normal normal normal

V.4.4. Using a Queue versus a Stack for Keeping Track of Executable Instructions Of the various methods for keeping track of executable instructions, the two most convenient methods are a stack (LIFO) and a queue (FIFO). From the measurements in Table V-26, it can be seen that the perfommnce of a FAs_I is not affected by the particular method used. Thus, the ease of implemention can be the sole factor in deciding which method to use. Though not shown in the table, essentially identical results occur with a minimal FAsv-1.

V.4.5. Parallelism in the Fast-1 As a prelude to the remainder of this dissertation, it is appropriate to consider whether or not simulations have parallelism that can be exploited in order to make them run faster. In the context of the FAsT-l, one view of parallelism is 118 A Data-Driven Multiprocessor for Switch-Level Simulation of VLSI Circuits

Table V-26: The performance of a stack versus a queue for keeping track of executable instructions. In the base case, a stack is used. All other aspects of the simulations are identical.

Circuit (----Queue---- > <----Base (stack)/Queue----> KMW Insm._ctioas D_tination Source RMW Instructions D_tination Source Cycl_ E;_ecuted Stores |:etches Cycles Executed Stores Fetches adder4 3471782 1315281 1679215 192738 1.000 0.998 0.999 1.030 adder8 3922183 1505112 1965921 229502 1.002 1.000 0,999 1.039 adder16 4340039 1680328 2223589 261162 1.003 1.001 1.000 1.046 cadd24 2836117 1197788 1432844 139045 0.969 0.968 0.983 0.813 ch_s 13537285 545.3659 6584268 1427563 1.002 0.999 0.993 1.052 mul8x8 3246629 1528294 1562962 89745 1.005 1.006 1.003 1.034 ram4×4 3538348 1213272 1612286 276110 0.999 0.998 1.000 0.995 ram8×8 4339799 1222213 2251026 613867 0.999 0.997 1.001 0.994 raml6xl6 5811346 ]242839 3514960 906317 1.000 0.999 1.001 0.996 ssc 21536451 10560603 10775969 79394 0.995 0.995 0.997 0.920 stk4x4 2695175 1080697 1214085 234477 0.986 0.979 1.000 0.939 stkSx8 2785184 1107833 1313134 318873 0.983 0.977 1.000 0.934 stk16x16 2853479 1125348 1357292 357839 0.981 0.975 0.999 0.930

Mean 0.994 0.992 0.998 0.979 Variance 0.000 0.000 0.000 0.005 Std. Dev. 0.011 0.012 0.005 0.069 95%Conf. 0.006 0.007 0.003 0.037 DisL normal

to determine whether or not there are instructions that can be executed in parallel. This is done by finding sets of instructions that can be executed in a single parallel simulation step. Having found the total number of parallel simulation steps required, it is also possible to estimate the performance of switch-level simulators that are not event-driven. As discussed in the following section, these estimates are particularly interesting in the context of simulation machine architectures that are not event-driven, such as the Yorktown Simula- tion Engine, or the Connection Machine.

In order to determine sets of instructions that can be executed in parallel, FIFO ordering is used to keep track of executable instructions. At the beginning of each phase, the last instruction in the queue is tagged and all of the instructions between the head of the queue and this last instruction are assumed to be ex- ecutable in parallel. After the tagged instruction is executed, the instruction that is now last in the queue is tagged. In this way, another set of instructions that can be executed in parallel is defined. This process continues until there are no more instructions in the queue, at which point the next phase of the simulation begins.

Each set of instructions constitutes a parallel simulation step. The ratio of the number of instructions executed in the base case to the number of parallel steps it takes to execute the same simulation is an upper bound on parallelism. Of course, this bound is not a precise counting because using instruction execu- tions instead of read-modify-write cycles ignores the effect of fan-in and fan- out. Another method for determining the potential speedup from executing instructions in parallel is to keep track of the maximum time used to execute any instruction during each simulation step. The sum of the maximum times yields the parallel simulation time. The ratio of the parallel time to the base case Uniprocessor Experiments 1I9

time is a somewhat more realistic estimate of the speed-up possible but in some sense it is both too optimistic and too pessimistic. It is too optimistic in that it ignores the impact of inter-processor communication. It is too pessimistic in that it assumes that all instructions in step n must always finish before any instruction in step n + 1 can begin. In reality, this restriction is required only at the end of a simulation phase. Furthermore, the parallel simulation time does not take into account more parallel methods for handling large fan-in and fan- out. One method for getting a feel for the effect of large fan-in and fan-out on parallelism is to perform the same measurements using a minimal FAsr-I. In a minimal machine each instruction has a fixed fan-in of two and a fixed fan-out of two and thus executing an instruction requires at most 3 read-modify-write cycles. The result of using a minimal FAsT-I is that the number of parallel simulation steps increases, while the total parallel simulation time decreases.

Figures V-14 and V-15 illustrate graphically the parallel execution steps. The X axis in both plots represents successive parallel simulation steps. The positive Y axis in the first plot is the number of instructions that can executed in parallel and the negative Y axis is time it takes to execute the longest instruction during that step. In the second plot, the Y axis is the instruction address. Hence, this plot shows which instructions are actually executed during the step. There are two things to notice about these plots. First, there is high variance in the num- ber of instructions that are executable during each step. Second, the average number of instructions that are actually executed during a step is a small frac- tion of the number of instructions used to represent the circuit.

Table V-27 is a summary of the parallelism experiments for the base case con- figuration. As expected, circuits that have the most instructions and that have nodes with high fan-out have the highest ratio of instruction executions to parallel simulation steps. But, when looking at the ratio of serial simulation time to parallel simulation time, the difference in the ratios is much smaller. This is because the time it takes to execute each parallel simulation step is now the time required to execute the slowest instruction in that step, which is, more or less, the instruction with the highest fan-out.

The data in Table V-28 is similar to the data in Table V-27. However, in order to get better estimates of the available parallelism, the parallelism measures are obtained using a minimal FasT-l. The advantage of this approach is that the fan-in and fan-out trees allow fan-in and fan-out to be performed in parallel. In the base case data, in Table V-27, fan-in and fan-out are done sequentially. As a result, the minimal parallelism data has more parallel time steps, but each step takes at most 3 RMW cycles time. Thus, as shown in Table V-28, the ratio of uniprocessor simulation time to parallel simulation is more nearly the same as the ratio of uniprocessor instruction executions to parallel simulation steps than in Table V-27.

Regardless of which set of ratios is examined, it is clear that while there is some potential for parallelism it is not as great as might be hoped. Figure 120 A Data-Driven Multiprocessor for Switch-Level Simulation of VLSI Circuits

60_

50_

40_

30_

Figure ¥-14: A portion of the parallelism profile for adder4. The X axis is parallel steps. The positive Y axis is the number of instructions executed in parallel. The negative Y axis is longest instruction execution time during that step.

90_ 8o_:'":::..::...... -.."'" ::"• ..-".'".... i!"" :=.-::.=.:":.... .": "" ..i: -..":. -, -;.. %o ...... ;- "..." •. :-.°. ...•.% ... .::. --:_-:. :.-'. :_.. "" :.. ." ." •'':o ° -'.;; ;• ; ...... "° I.• 70_ -" "...... """ : ":'" :""" ':""':" ":'" .... • . "..":_".... • • ,. •...... : :...-.- . ... • :.....

60_ ....::•:..'--. -. • "_ "-].... ""'-:. i-'.'.'_:"..".." . -..: -: -..... :'.-:'..-'::': .. •--:...-:.'-."::'i ...... " -i:;;:-i ,,.., -" " • " -": -'-:'-:-".:: :"-" "" "7-'' -:'Y.':"."'-" "- -":':"'. "'" 50_ i " : : .... ;.' : : " :-:... • ".. •...... :: .: •" .. ":-'": "_" " "i. "_" ":i"""::-" : "" :: 40_. ::. " ;-_.. " i i'_ _:_. : ; •o... ° _o • • .° •... • 30_. "::'.. ---:.. =-".:':-" ":."i- "!" " """-'- i:":" • • "':...... "[ "".... . i".... i. ...:..? :'. ..: 20_'""":- "L'.. .. --. .":.. ": ... :.: -_:. .."". .:_.-'. "....: ...: • •... : _.-. . :it":..":-'-":.: :. " "_.._.'. if:" . .... :. :. :". " 10_ • -. ::: . ":'. ": -.. -::. "-_':.r.. :-:.: • • -'--_." ...... :" :" r t" • • 0 f i i i l I i " ""-_--'--:_ 0 50 100 150 200 250 300 350 400 450 500

Figure ¥-15: The instructions executed during each parallel step. The X axis is parallel steps, as above, and the Y axis is the instruction ad- dress. Uniprocessor Experiments 12I

Table V-27: Maximum available parallelism in F,_sr-I simulations using the base configuration. 'Uniprocessor Time/Parallel Time' is the ratio of the time used by a uniprocessor FAs_q to the parallel simulation time. 'Instruction Executions/Parallel Steps' is the ratio of instructions executed on a uniprocessor to the number of parallel simulation steps. This figure provides an optimistic upper bound on parallelism.

Circuit Parallel Parallel Uniprocessor Time Inslruction Exocutions Time Steps Parallel Time Parallel Steps adder4 103%01 320478 3.341 4.097 adder8 907648 266600 4.329 5.644 adder16 729857 209861 5.964 8.015 cadd24 1%369 60210 13.992 19.256 chess 472739 41939 28.689 129.936 mulSx8 193890 57992 16.834 26.519 rarn4x4 951808 248998 3.713 4.861 ramSx8 789695 136723 5A88 8.915 raml6xl6 657835 70505 8.833 17.602 ssc 4743010 54670 4.520 192.125 slk4x4 419397 103275 6.339 10.248 stkSx8 244995 25617 11.177 42.256 stk16x16 205080 6325 13.651 173.530

Mean 9.759 49.462 Variance 51.946 4632.861 Sial. Dev. 7.207 68.065 95% Conf. 3.918 37.001 DisL normal

Table V-28: Maximum available parallelism in FAST-1simulations using the minimal configuration. 'Uniprocessor Time/Parallel Time' is the ratio of the time used by the base uniprocessor FAst-1 tO the parallel simulation time for a minimal FAs_q. 'Instruction Executions/Parallel Steps' is the ratio of instructions executed on the base uniprocessor to the number of parallel simulation steps for a minimal FAsr-I. This figure provides a more realistic bound on potential parallelism.

Minimal Minimal Base/Minimal Base/Minimal Circuit Parallel Parallel UniProc_-'sor Time Instruction Executions Tune Steps Parallel Time Parallel Steps adder4 1351130 577723 2.571 2.272 adder8 1167079 489237 3.367 3.076 adderl6 927720 384129 4.692 4.379 cadd24 189109 80434 14.530 14.414 chess 169689 67630 79.925 80.576 mulSx8 192413 75103 16.963 20.477 ram4x4 77_306 331084 4.539 3.656 ramSx8 494766 210019 8.760 5.804 raml6xl6 273318 113511 21.259 10.933 ssc 234799 92828 91.305 113.150 stk4x4 427396 178251 6.220 5.937 st.kSx8 113718 47004 24.081 23.029 stkl6xl6 28507 11854 97.185 92.591

Mean 28.877 29.253 Variance 1253.000 1511.388 Std. Dev. 35.398 38.877 95% Conf. 19.242 21.135 Dist. 122 A Data-Driven Multiprocessor for Switch-Level Simulation of VLS1 Circuits

V-16 illustrates this point graphically. The X axis in this plot is the number of instructions in the circuit. The Y axis is the speedup divided by the number of instructions in the circuit. The resulting plot is the 'speedup per instruction.' For some circuits, such as the RAMs and the stacks, the speedup per instruc- tion grows as the circuit grows. For the remainder of the circuits, the speedup per instruction decreases rather rapidly with increasing circuit size. This plot suggests that even as circuits grow very large, the available parallelism is still relatively limited. Nevertheless, many more measurements are required before any firm conclusions may be drawn.

.09

• ..... Q Base Instruction Executions/Minimal Configuration Parallel Steps .08 o -- --o Base Time/Minimal Configuration Parallel Time _ Base Instruction Executions/Base Configuration Parallel Steps

.07 <_---- 0 Base Execution Time/Base Configuration Parallel Time t_ .06

.o4 _

.03

.0052 _l

[] o O0 , , , i , , 64 128 256 512 1024 2048 4096 8;92 16384 32768 Number of Instructions Speedup per Instruction

Figure V-16: The average speedup per instruction.

Another way to understand why the available parallelism is limited is to ex- amine the distribution of instruction executions and simulation time over in- structions. Figures V-17 and V-19 show the cumulative percentage of execution time against the cumulative percentage of instructions sorted by execution time. Figures V-18 and V-20 show the cumulative percentage of instruction executions against the cumulative percentage of instructions sorted by execu- tion frequency for some of the circuits. From the data in these plots, it is clear that, in general, relatively few instructions account for most of the execution time and most of the instruction executions.

Finally, the data in Tables V-27 and V-28 reveal a rather interesting phenomenon. Note that for the RAM circuits quadrupling the number of bits Uniprocessor Experiments 123

o lo 20 30 40 50 no 70 _o _o ,bo Cumulative Percentage of Instructions Instruction Execution Time

Figure V-17: Cumulative execution time for the RAM circuits.

9 90 -"- -" .

lOO80 / / ___.'> ,, / /,y7 =° 70 ," ..,..,<.J'J"

u 60 I ///._..; "_= , ,..,'/,,:.:.;, _.. 50 .... ram16xle "0¢).. 1.I/-Z/_..// / ram16x16m -- • /*L]/ ...... ram4x4m 30 ram8x8m ;l S-_ ram8x8-- 2°tl!/ _ lO

, , , , , , , t_ 0 10 20 30 40 50 60 70 80 901 Cumulative Percentage of Instructions Instruction Execution Frequency

Figure¥-18: Cumulativeinstructionfrequencyfor the RAM circuits. i 124 A Data-Driven Multiprocessorfor Switch-Level Simulation of VLSI Circuits

120

..'_.. _.,'_ .-_=_..<:" 8o //. " /.;;....<.- 8, ¢_", ._ "..""

I, ,,',,/.., ,.,_.:.'.." -..- final - ; ; ," - -- -- finalm

4o/l./fl' ./I,//;¢'. -...... --- _dd_adderm - 2Ol..7;, .... .o,.._

0 10 20 30 40 50 60 70 80 90 100 Cumulative Percentage of Instructions Instruction Execution Time

Figure ¥-19: Cumulative execution time for the other circuits.

120 .o :1 c) 100 . . . .._ ._2..._.:t ._.._.:.c .__--,_=._ -__ .,.U. _;.-_'_...... ¢..-- ._ .. ._9 " •-_- __¢_ 80 /7%1 ._:._,''_ ,..','...'1 ,__/,_._" = 6o .,.. /,,:._, ®° _:,."_"-_ _/,",',.,,. .... ,,,ut " It_//z - - mulm = 40 ,_..'Z ---- final //f' ...... finalm ,, II "_ I E o lIo _o _o ,;o _o _o io aI o 9I o ioo! Cumulative Percentage of Instructions Instruction Execution Frequency

Figure V-ZO: Cumulative instruction frequency for the other circuits.

II Uniprocessor Experiments 125

only doubles the parallelism, while for the stacks quadrupling the number of bits quadruples the parallelism. The likely reason for this is that, in a RAM, we access only one word at a time, and thus only increases in the width of the RAM significantly increase available parallelism. For the implementation of a stack used here, however, nominally all bits are changed when we push or pop the stack, and so the available parallelism increases with increases in either the width or the depth of the stack.

The relatively low parallelism in the RAM is particularly interesting because it indicates that when simulating a microprocessor in which a large fraction of the circuit is memory, there may not be much parallelism to exploit. On the other hand, both the chess chip and the SSC filter chip are known to have a good deal of internal parallelism, and this appears to be reflected in the data in these tables.

¥.4.6. Execution Time Estimates for Other Simulation Machine Architectures The parallel step analysis described above is useful for another purpose. If Algorithm III-4 is implemented on a machine that is not event-driven then, during each step, potentially every instruction needs to be executed. Hence, if we know the number of steps it takes to perform a simulation, we can estinaate the time it would take a non-event-driven processor to perform the same simulation. Two such processors are the Yorktown Simulation Engine (YSE) [Denneau, 1982a], and the Connection Machine (CM)[Hillis, 1981]. Rather than concern ourselves with the exact implementation details of these processors, it is more appropriate to consider instead the relative performance of each machine's general architecture in comparison with the FAsT'-I. In all cases, it is assumed that these machines implement a version of Algorithm III-4. It is conceivable, however, that some other switch-level simulation algo- rithm would be more appropriate for a processor that is not event-driven. V.4.6.1. Execution Time Estimates for YSE-Style Processors In the YSE, instructions are executed sequentially, and a YSE simulation step can be thought of as one pass through instruction memory. Given an im- plementation of Algorithm III-4 in which one-input nodes are not eliminated and in which instructions are evaluated in FIFO order, it should be clear that during any single F_sT,.1 parallel simulation step, either transistors are evaluated or nodes are evaluated, but not both. Furthermore, assuming that instructions in a YSE-style machine are allocated sequentially, with transistor instructions followed by node instructions, phases 3 and 1 of Algorithm III-4 can be performed together in a single pass through YSE instruction memory. 6 A reasonable estimate of the number of steps used by a YSE-style machine is the number of unit-delay periods plus half of the total number of

6This is true only if the node optimization is not used. Measurements I have made indicate that the benefit of performing phase 3 and phase 1 in a single YSE pass far exceeds the benefit of the node optimization. 126 A Data-Driven Multiprocessor for Switch-Level Simulation of VLSI Circuits

parallel simulation steps performed during phases 2n through 21. However, unless a special mechanism is implemented to keep track of whether or not any instructions need to be executed during a particular phase of Algorithm lII-4, a YSE-style machine would have to make at least one pass through its instruc- tions for each phase, even if no instructions need to be executed. The data presented in the table below assumes that such a mechanism exists and is there- fore biased in favor of a YSE-style machine.

Next, we need to estimate how long it takes to execute a single YSE-style in- struction. Assuming some pipelining and that the operand memory is multi- ported, each instruction can be executed in a single RMW cycle. The architec- ture handles arbitrary fan-out implicitly, and thus fan-out does not affect ex- ecution time nor is there a need for fan-out trees. For node instructions, however, we either need fan-in trees or some mechanism for fetching ad- ditional source-words. Because of pipelining considerations, it is probably more reasonable to use fan-in trees. Even though using fan-in trees means that nodes can now output to other nodes, this does not present a problem, since instructions in a YSE-style machine are executed sequentially. The execution of the node instructions that are part of a fan-in tree can be ordered so that using fan-in trees does not increase the number of simulation steps. Other implementation considerations make it difficult to simultaneously write two different results into the operand memory. Consequently, each bidirectional transistor is represented using two unidirectional transistor instructions.

The total time it takes to perform a simulation on a YSE-style machine is the total number of YSE steps times the number of instructions used to represent the circuit. Given this basis, we can compare the performance of the FAst-1 architecture to that of a YSE-style architecture. This comparison is presented in Table V-29. The second column is the number of instructions used to represent the circuit for the YSE-style machine. The third column is simulation steps used by a YSE-style machine, and the fourth column is the total simula- tion time. The ratio of the FASV-I uniprocessor simulation time to the YSE- style machine simulation time is shown in the fifth column and indicates that, when using Algorithm Ill-4, an event-driven processor such as the FAst-1 is about 14 times faster than a non-event-driven processor such as the YSE. Moreover, for the chess chip and the SSC filter chip, the FAst-1 is about 50 times faster than a YSE-style machine. V.4.6.2. Execution Time Estimates for Connection Machine-Style Processors The basic Connection Machine (CM) architecture is that of a SIMD computer, in which each processor is a very simple serial machine, and can contain the information for either one node instruction or one transistor instruction. Any CM processor can send a message to any other processor; thus implementing Algorithm III-4 is straightforward. As with a YSE-style machine, simulating circuits on a CM-style machines involves steps in which the simulation of tran- sistors is followed by the simulation of nodes. Whereas in a YSE-style machine, instructions are executed sequentially, in a CM-style machine instruc- Uniprocessor Experiments 127

Table V-29: The time used by a YSE-style simulation machine.

Circuit Instructions tbr YSE-styte Steps YSF-Stylc Time FAST-I Time YSE-Style Machine = Sleps * # of Inst YSE-Style Time adder4 172 146074 25124728 0.138 adder8 340 118432 40266880 0.098 adderl6 676 89814 60714264 0.072 cadd24 4457 25742 114732094 0.024 chess 33487 22910 767187170 0.018 mul8x8 2391 35991 86054481 0.038 ram4x4 330 174432 57562560 0.061 ramSx8 11_4 92977 104506148 0.041 raml6xl6 4090 47478 194185020 0.030 ssc 36670 30003 1100210010 0.019 stk4x4 370 66698 24678260 0.108 stkSx8 1378 17107 23573446 0.116 stk16x16 5506 4340 23896040 0.117

Mean 0.068 Variance 0.002 Sld. Dev. 0.043 95% Conf. 0.023 Dist. normal

tions are executed in parallel. Therefore, a CM-style machine is able to first execute all transistor instructions in parallel, and then execute all node instruc- tions in parallel. So, in a CM-style machine, all phase 3 transistor evaluations may be performed in parallel, and all phase 1 node evaluations may be per- formed in parallel, again assuming that the node optimizations is not used 7. During phases 2n through 21, each FAs'r-I parallel simulation step can be ex- ecuted in parallel by the Connection Machine.

A reasonable estimate of the number of steps used by a CM-style machine is twice the number of unit-delay steps plus the number of FAst-1 parallel simula- tion steps performed during phases 2n through 21. There is one further con- sideration. In order to build a machine with one instruction per processor, each processor must be kept very simple, and therefore arbitrary fan-in and fan-out must be handled using fan-in and fan-out trees. Hence, the number of parallel steps used by a minimal FAST'I is a better estimate of the number of steps used by a CM-style machine.

We also need to estimate how long it takes to evaluate a Connection Machine instruction. Because the Connection Machine is serial, the number of clock cycles it takes to evaluate an instruction is approximately equal to the number of bits that must be processed. A minimal FAST'I instruction has two SOURCEOPERANDS, a RESULT, and some constant information, such as transistor sizes. Each of these fields is about 8 bits long, for a total of 32 bits. Furthermore there are two DESTINATIONS,each of which must be capable of addressing a processor and a memory location within the processor. Assuming that there are 64K processors, a processor address requires 16 bits. The address within

7Forthe ConnectionMachine,measurementsindicatethat the choiceof perfomaingall phase1 evaluationsin parallelversususingthe node optimizationis about a toss-up. In the figures presentedthe nodeoptimizationisnot used. 128 A Data-Driven Multiprocessor for Switch-Level Simulation of VLSI Circuits

the processor can probably be kept to about 6 bits. Thus, Dt'_XIYAri'lOYSaccount for about another 44 bits. Finally, the eight bit R_._ut:rmust be written back into the current instruction, _md, on average, propagated to one other instruc- tion, requiring a total of another 16 clock cycles. So, approximately 92 clock cycles are required to process each FAs_I instruction, ignoring communication contention. The effect of communication contention is to increase the average number of clock cycles per instruction. By ignoring the impact of contention, we should have a reasonable upper bound on the performance of a CM-style machine.

Table V-30 summarizes the performance of a CM-style machine implemen- tation of Algorithm III-4. Because instructions are executed in parallel, the time required for a CM-style machine to simulate a circuit is just the number of steps times the number of clock cycles per instruction. The times listed in the table are based on 92 clock cycles per FASr-I instruction. On average, it seems that the Connection Machine is about 30% slower than a FAs'r-l. On the other hand, a CM-style machine is able to take advantage of parallelism; thus, it simulates the chess chip and the SSC filter chip several times fikster than a uniprocessor FAsr-I. Of course, the simulations of these two chips require 45945 processors and 45932 processors, respectively. Moreover, these simula- tions are the ones that are most likely to be affected by communication conten- tion. While, in this age of VLSI, one should not immediately scoff at using large numbers of processors, it seems clear to me that a uniprocessor F_T-1 is much easier to build than a 45,000 processor CM-style machine.

Table V-30: The time used by a Connection Machine-style simulation machine. The number of CM-style processors equals the number of instructions needed to represent the circuit. The execution of a Fast-1 instruction is assumed to require 92 Connection Machine clock cycles, as explained in the text.

Circuit C_t-style CM-style Steps CM-Style Time FAST-I Tim e Machine _rs = Steps * 92 Cm-Style Time adder4 220 569298 52375416 0.066 adder8 436 476542 43841864 0.090 adderl6 868 369402 33984984 0.128 cadd24 5514 76988 7082896 0.388 chess 45945 63803 5869876 2.311 mul8x8 2752 93537 8605404 0.379 ram4x4 411 477452 43925584 0.080 ram8x8 1506 286561 26363612 0.164 ram16x16 5639 155474 14303608 0.406 ssc 45932 97866 9003672 2.381 slk4x4 486 179667 16529364 0.161 stk8x8 19M 46169 4247548 0.645 stkl6xl6 4381 11711 1077412 2.598

Mean 0.754 Variance 0.0945 Std. Dev. 0.972 95% Conf. 0.028 DisL Uniprocessor Experiments 129

V.4.7. Some Other Thoughts on Parallelism While pondering the data presented above, there are several things to keep in mind. Foremost is that the data is experimental. While the results are perhaps indicative of circuits in general, many more circuits need to be simulated be- fore drawing any firm conclusions. It is also important to understand that these parallelism experiments measure several factors simultaneously. For example, while the experiments clearly reflect the parallelism inherent in the circuits, they also reflect the extent to which Algorithm 111-4limits this parallelism by serializing phases. It is certainly possible that a different simulation algorithm would do a better job of exploiting the inherent parallelism. Similarly, it is also possible that other simulation algorithms are better suited to YSE-style or CM- style machines. Indeed, if I am correct that there is an advantage to designing algorithms and architectures together then it is not surprising that the FAsv-1 algorithm works best using the FAs'r-1architecture. Future work will hopefully provide more definite answers to these questions.

VI Algorithms for Multiprocessor Simulation

There are many ways to improve the performance of any computing system. In the first half of this dissertation, I showed how the combination of algorithms and architecture results in a simulation system that can run between two and three orders of magnitude faster than similar simulators running on a Vax 11/780. If performance is our primary concern we can increase performance by another order of magnitude by using a higher-performance technology than the one used to implement a Vax 11/780.

We can also try to improve performance by exploiting the parallelism available in simulation. In this chapter, I examine the algorithmic issues involved in designing a multiprocessor F._T-1. There are two aspects to this discussion. First, how is Algorithm III-4 implemented on a multiprocessor FAs'r-1 and second, how are circuits partitioned onto the multiprocessor in order to take advantage of the available parallelism?

The discussion in this chapter presupposes that a multiprocessor Fast-1 is designed using multiple FAsT-I processors that can send messages to one another. In Chapter VI1, I discuss the architectural issues involved in designing such a machine as well as other alternatives for improving performance.

VI.1. Multiprocessor Implementation of Algorithm III-4 In developing a multiprocessor version of Algorithm III-4 there are two major issues that need to be considered: how is the algorithm implemented using a multiprocessor and does this new implementation compute the same results as the uniprocessor algorithm? Besides having a multiprocessor algorithm that works, we also want one that is fast. While partitioning is the most important factor in determining performance, non-partitioning aspects of multiprocessing also affect performance.

VI.I.1. Implementation One of the strengths of Algorithm III-4 is that, within a particular phase, tran- sistors and nodes may be evaluated in any order. Moreover, as long the most recent RESULrof evaluating a transistor or node instruction reflects the most

131 132 A Data-Driven Multiprocessor for Switch-Level Simulation of VLSI Circuits

recent values of the instructions's SOURCEOPI_ANDS, the inner loop of Algorithm 11I-4can be modified to perform all pending evaluations in parallel. Another virtue of Algorithm 11I-4 is that there need be no global data. This makes a multiprocessor implementation of the algorithm substantially simpler.

Given these observations, it seems that the simplest way to create a mul- tiprocessor implementation of Algorithm III-4 is to use multiple Fasr-I proces- sors, where each processor implements a copy of Algorithm IlI-4. Transistor and node instructions are then partitioned among the processors. A processor need not be able to read data from another processor, but must only be able to write a Re,SUETinto a node or transistor instruction stored in another processor. The only other requirement is that the processors stay synchronized on a phase by phase basis as, in general, all processors must work on the same phase of Algorithm 11I-4at the same time.

Vl.I.2. Correctness It follows directly from the discussions on correctness and termination in Chap- ter Ill that Algorithm IlI-4 continues to work correctly when implemented on a multiprocessor as presented above. Recall that the order of evaluation of nodes and transistors, within a particular phase, does not affect the ultimate steady-state node values computed. Therefore, the order of evaluation in a multiprocessor implementation can not affect the ultimate steady-state node value. This is true, even though there is a subtle difference between the order of evaluation in a uniprocessor and in a multiprocessor. In a multiprocessor, two or more instructions may execute and store results concurrently, while in a uniprocessor, executing instructions and storing results is strictly sequential. However, because of the definition of the LUB and Trans functions, this change in evaluation order does not alter the steady-state of a circuit. Moreover, if a circuit exhibits oscillatory behavior when simulated on a uniprocessor, it will exhibit the same oscillatory behavior when simulated on a multiprocessor.

Another interesting feature of Algorithm III-4 is that all nodes and transistors are handled identically; no special restrictions, such as having to have all of the transistors in a single transistor group on the same processor, need to be en- forced. Consequently, from a correctness standpoint, there are no restrictions on how circuits are partitioned. Indeed, even if a partitioning algorithm does not partition a circuit as intended, the simulation will still be correct.

VI.1.3. Performance Considerations Parallel activity in a multiprocessor FAST--1arises from one instruction propagating results to two or more instructions. Thus, achieving the ultimate speedups for a particular simulation requires, in part, walking the fine line be- tween having enough communication between processors so that there is paral- lel activity and not having so much communication that the interconnect be- comes a bottleneck. It is the role of a partitioning algorithm to walk this line. Algorithms for Multiprocessor,Simulation 133

Nevertheless, any changes to the simulation algorithm that reduces unneces- sary communication obviously helps.

Whenever a value is propagated and later replaced by a value of greater strength, there is, in some sense, redundant communication. It would be nice to propagate a wdue between two processors only if the value is the 'final' value. Of course, propagating a value not only transfers data but also causes computation to be performed. Therefore, in general, there is no easy way to tell that the 'final' value has been reached. When propagating a value to the gate of a transistor, however, it is possible to avoid redundant communication. To a large extent the store optimization already achieves this by storing a new gate value only when the state of the value has changed. It might be better to use one of the other delay techniques described in Chapter III. For example, if delay-buffer instructions are used, then gate values are propagated only once per unit-delay step. The same is true when unit-delay is implemented by scheduling nodes, instead of transistors, for phase 3.

Synchronization is another aspect of the multiprocessor implemention of Algo- rithm lII-4 that may limit performance. As currently designed, the algorithm requires that processors synchronize at the beginning of each phase of Algo- rithm III-4. But, if a transistor group is never split between two processors, then processors need to be synchronized only once per unit-delay step. To see that this works, notice that the only communication between transistor groups occurs, in effect, during phase 3. Unfortunately, there are two drawbacks to this idea. First, it is not clear that partitioning using transistor groups is the best way to exploit parallelism. Second, as shown by static analysis, some transistor groups can be very large. Nevertheless, even if some transistor groups must be split across two or more processors, only those processors that have instructions in the same transistor group must synchronize at phase boundaries.

VI.2. Partitioning Algorithms The problem in partitioning simulations is to judiciously use communication so that there is as much parallel activity as possible. If communication is free and an unlimited number of processors are available, partitioning is easy--simply put one instruction on each processor. In real life, partitioning is difficult be- cause communication is not free and the number of processors available is bounded.

As I show below, even ignoring communication, given a bounded number of processors, the partitioning problem is NP-complete. Moreover, there is an important practical consideration to keep in mind when discussing partitioning algorithms. Given that partitioning takes place on a general-purpose processor while simulation takes place on a very high-speed special-purpose processor, we want to make sure that the following inequality holds: Timepartitioning + TimeMultiprocxssorsimulation<

Obviously, if one is planning to run simulations that take weeks or months, spending a day partitioning in order to make the simulation run two or three times faster may well be worth it. On the other hand, it has been observed that for the Yorktown Simulation Engine minutes or hours of CPU time are often spent partitioning simulations that run for only a few seconds on the YSE [Denneau, 1982b]. During debugging, such short runs are the norm.

It seems that for circuits with hundreds of thousands of transistors, partitioning algorithms with running times worse than, say, O(nlogn) are likely to be un- usable. Moreover, as 1 have already discussed, short of simulating a circuit with a particular set of test vectors, it is very difficult to detennine the actual running time of a particular simulation. Hence, it is very difficult for any prac- tical partitioning algorithm to know exactly how well it is doing. The al- gorithms discussed in this section are heuristic. 111Chapter VIII, I show how well some of these heuristics work.

VI.2.1. The Complexity of Partitioning In this section, 1show that solving even a simplified version of the partitioning problem 1 is NP-complete 2. Clearly, there is no hope of creating an optimal partitioning of a circuit without knowing which instructions are executed when, in response to an applied stimulus. Given a uniprocessor execution instruction trace, it should be possible to schedule instructions onto a multiprocessor such that minimum running time is achieved. If all Fas_l instructions had the same running time this would be easy but, of course, they don't.

A formal statement of the partitioning problem is: Given a FAsr-1 instruction trace does there exist an assignment of the instructions to m processors such that the running time will be less than D time units? Note that as stated, this does not find the optimal assignment. However, given that the uniprocessor running time is an upper bound for the execution time, an algorithm for solving the partitioning problem, together with binary search, can be used to find an optimal partitioning.

It is easy to see that this problem can be solved in NP time. One simply guesses the correct assignment and verifies that it is correct by actually running it. The verification step of running the simulation can be performed in time propor- tional to the number of nodes and transistors.

1Thereader shouldnot confusethis partitioningproblemwith the well-knownNP-complete problemcalledPartition. 2Roughly,aproblemis NP-completeifit canbe solvedin polynomialtimeby anon-deterministic TuringMachineand thereexistsa deterministicpolynomialtime algorithmthat transformsan instanceof a problemalreadyknownto be NP-completeinto an instanceof the newproblem [GareyandJohnson,1979:Aho,Hopcroft,andUllman,1974]. Algorithms for Multiprocessor Simulation 135

To show that the problem is NP-complete, it suffices to show that some other problem that is already known to be NP-complete can be reduced to an in- stance of the partitioning problem. The most similar N P-complete problem is the multiprocessor scheduling problem (MPS)[Garey and Johnson, 1979, p. 238]: Given a set, T, of tasks, m processors, length l(t) for each task t _ T and a deadline D, is there an m-processor schedule, which specifies the starting time for each task, that meets the deadline D? An instance of MPS can be transformed into an instance of the partitioning problem by mapping each task, t, into a FAs-r-I instruction whose execution time is l(t). Note that whereas in the partitioning problem an instruction may be executed several different times, in the transformed MPS problem each in- struction will be executed only once. A solution to the partitioning problem gives us an assignment of instructions to processors. To turn this into a solu- tion to the original MPS problem, a list of starting times is needed. Given a partitioning assignment, we simply run the simulation and note the time when each instruction is executed. This becomes the multiprocessor schedule. []

VI.2.2. Practical Partitioning Algorithms In developing partitioning algorithms, I have restricted myself to those with running times no greater than O(nlogn), where n is the number of transistors in the circuit. This restriction is based on the assumption that the whole purpose of using a multiprocessor is not just to reduce the time it takes to simulate, but rather to reduce the total time to compile, partition, and simulate.

The set of partitioning algorithms described in the following sections is far from exhaustive. Rather, these algorithms represent ideas I have investigated to a greater or lesser extent. The performance of two of the algorithms, random partitioning and fan-out partitioning, is discussed in depth in Chapter Villi. VI.2.2.1. RandomPartitioning While undoubtably not the best way to solve the partitioning problem, ran- domly assigning instructions to processors is not as bad an idea as it sounds. At worst, it provides a mechanism for determining a reasonable lower bound for performance, against which the performance of other partitioning algorithms may be compared. At best, randomly partitioning instructions is extremely fast and is useful, even if it allows a multiprocessor to execute a simulation only two or three times faster than a uniprocessor. When partitioning large circuits with reasonably high potential parallelism onto a small number of processors, it is likely that random partitioning will do as well as anything else. This should be true as there will be more than enough activity to keep all processors busy. As long as interprocessor communication queues are long enough, processors are not likely to block waiting for something to do.

Closely akin to random partitioning is round-robin partitioning in which tran- sistor and node instructions are sequentially allocated to processors. Round- 136 A Data-Driven Multiprocessor for Switch-Level Simulation of VLSI Circuits

robin partitioning works on the assumption that transistors that are near each other in the input file may be related and are, therefore, likely to be active at the same time. Depending on how the input file is generated this may or may not be true. Also, it isjust _ reasonable to assume that transistors that are near each other have some data dependency between them and therefore are good candidates for placing on the same processor. These thoughts lead to consider- ing methods that try to take advantage of circuit structure. VI.2.2.2. Functional Partitioning It is well known that VLSI circuits often have a significant amount of structure. It is natural to wonder whether this structure can be used in partitioning. For example, given a bit-sliced design it might be reasonable to place each bit-slice on a separate processor. The intuition is that in many circuits functional blocks are designed to operate in parallel. Obviously, in order for partitioning using functional structure to be feasible, the partitioning program must know the structure. Although a function block recognizer can find small blocks to group together on a single processor, in general it seems that the partitioning program must be told by the user which transistors constitute a functional block.

While this approach has a lot of intuitive appeal, it has two disadvantages. First, it is not clear that functional boundaries are strongly correlated with parallel activity. At the level of individual transistors and nodes it is much harder to tell what is happening with respect to parallelism. In other words, parallelism that is present at the macro or functional level does not necessarily map directly into parallelism at the micro or transistor level. The other dis- advantage is that forcing the user to specify possibly hundreds of partitions may be quite burdensome and, depending on the design system used, it may not be particularly feasible. VI.2.2.3. Transistor GroupPartitioning I have already mentioned transistor group partitioning as a method of reducing the amount of processor synchronization required. Another reason to use tran- sistor groups as a basis for partitioning is that they represent subcircuits in which there are considerable data dependencies. Transistor groups can be found using a linear time depth first search algorithm. In some circuits tran- sistor groups can be quite large, so that it may be necessary to split a transistor group among several processors.

Intuitively, partitioning using transistor groups seems like a good idea since within each phase of Algorithm III-4, activity is localized within transistor groups. Dally has suggests using similar partitioning methods in multiproces- sor implementions of the Mossim Simulation Engine [Dally, 1984]. However, unless all transistor groups have approximately equal dynamic activity, it seems that partitioning using transistor groups may leave many processors idle. Algorithms for Multiprocessor Simulation 137

VI.2.2.4. Fan-Out Partitioning While partitioning using transistor groups tries to keep related instructions on the same processor, it does nothing to insure that instructions that can be ex- ecuted in parallel are on separate processors. Since fan-out is the primary source of parallelism, it seems reasonable to put onto separate processors all instructions that are a target of the same instruction. This doesn't want to be taken too far, as clearly the targets of instructions that have fan-out of only one or two are not great sources of parallelism. Since changes in high fan-out nodes, such as clocks and busses, seem to be responsible for much of the paral- lel activity, Algorithm VI-1 first sorts nodes in the graph in decreasing order of fan-out. Transistor instructions that are the target of the same node are then placed onto separate processors, starting with the transistor instructions as- sociated with the highest fan-out node. Some high fan-out nodes, such as reset signals, are largely inactive and these may want to be treated specially.

Earlier arguments suggest that, after distributing the target instructions of the highest fan-out nodes, it is reasonable to consider placing on the same proces- sor those instructions that are in the immediate vicinity of these target instruc- tions. The vicinity is defined to be those transistor and node instructions reach- able within a given distance of the original transistor instruction. As shown in the algorithm, only instructions within the same transistor group are con- sidered; thus "Gate' edges are not traversed, though there may be situations where this restriction should not be applied. The running time of this par- titioning algorithm is dominated by the sort, which takes O(nlogn) time. Note that fan-out partitioning has the added advantage that it provides opportunities to exploit broadcasting, as discussed in Chapter VII. VI.2.2.5. ExecutionTime Partitioning While fan-out partitioning tries to maximize parallelism it does so rather blindly in that it has no way of knowing whether or not a high fan-out node is actually responsible for a significant fraction of the execution time. If execu- tion traces are available, it may be possible to do a better job of partitioning by placing on separate processors any two instructions that are frequently ex- ecutable in parallel. Alas, the difficulty in implementing such an algorithm is that vast quantities of trace data are required. In order to reduce the volume of information that needs to be gathered, we might only keep track of the total execution time for each instruction, and not the pairwise interactions. Then, in a fashion similar to the fan-out partitioning algorithm, we can sort node in- structions by execution frequency and place the most frequently executed in- structions and their targets on separate processors. Furthermore, we might as- sign weights to execution time and fan-out, and attempt to use both as a basis for partitioning. VI.2.2.6. SimulatedAnnealing From the previous sections, it seems that what is needed is a partitioning tech- nique that can effectively utilize a number of different factors, such as fan-out and execution time. One such method is the hill-climbing technique known as 138 A Data-Driven Multiprocessorfor Switch-l,evel Simulation of VLSI Circuits

Let Nodes[l..n] be the list of nodes sorted by fan-out, where Nodes[l] is the highest fan-out node. CurProcessor +- 0 For i ,-- I to n { if Nodes[i].PRoc_,SsoR = undefined { Nodes[i].PRoc_LssoR _- CurProcessor CurProcessor *- (CurProcessor + I) mod numProcessors foreach 0 E Nodes[i].OuJ_uTs { NewProcessor *-- Rand() mod numProcessors Ass ignT oP roc (INsrmYCTIONS0 .INSTINDEX'NewProcessor, Depth ) ) } )

AssignloProc(I, ProcNum) := { if I.PRocEssoR = unassigned { I.PRoc_SOR *-- ProcNum if (Depth > 0) { foreach 0 E I.OuTPUTS A 0.TERMINAL _ "Gate" { Ass ig nToProc ([NSTRUCTIONSo.INSTINDEX'ProcNum, Depth - 1 ) } ) } } AlgorithVI-I:m The DFS fan-outpa1_itioningalgorithm.

simulated annealing [Kirpatrick, Gelatt and VecchL 1983]. Simulated anneal- ing is like traditional hill-climbing, in that it attempts to optimize an objective function but differs, in that it is occasionally willing to accept a solution that is worse than a previous solution in order to avoid getting stuck with a solution that is only locally optimal.

In order for simulated annealing to be effective, the objective function must accurately reflect the goal trying to be achieved. For partitioning simulations the obvious objective function to use is actual simulation time. Thus, after creating a new partition of the circuit, we would run the simulation and measure the real time used. The drawbacks of this approach should be obvious. Alternatively, we might try to devise a cost function that can be evaluated in constant or near-constant time and that also reasonably reflects the relative simulation times of different partitions.

Algorithm VI-2 is a cost function that attempts to capitalize on the various ideas discussed above. The first summation is over all processors and says that there is a penalty whenever the number of instructions on a processor (Totallnstp), or the total number of instructions executed by a processor (TotalExecuteSp), or the total number of stores done by a processor (TotalStoresp), deviates from the mean. The second summation is over instruc- tions. It tries to capture the notion that the remote and local stores have dif- ferent costs and that, within these two groups of stores, it may be appropriate to assign different costs to stores that are within a transistor group versus stores that are outside of a transistor group. The constants control the relative impor- Algorithms for Multiprocessor Simulation 139

tance of each parameter. The data for the various parameters is gathered from earlier simulation runs of the same circuit.

Cost( ) :: { (klnst(TotallnStD - Meanlnst) 2 + p_EProcessors kExecut^(TotalEx_cutes _ _ MeanExecutes) 2 + kstores_TotalStoresp -PMeanStores)2)

+

_ NumStores d x K(i,d) iElnstructions dEi. DES'nNmIONS

K(i, d) :: { i f i. PROCVSSoRID = INSTRUCTJONSd,INSTINDEXPROC[_SORID" t hen i f i .TRANSISTORGROuP: INSTRUCTIONSd .INsT1NDFx'TRANSISTORGROUP klGStore else kNonTGStore else i f i .TRANSISTORGROuP: INSTRUCnONSd.INSTINDEx'TRANSISTORGRouP kRemoteTGStore else kRemoteNonTGStore) Algorithm Yl-2: A cost function for partitioning using simulated annealing.

Experiments using simulated annealing and this cost function reveal several things. First, over an hour of CPU time is often required to find an optimal solution to the cost function, even though circuits with only a few hundred transistors are being partitioned. Moreover, the partitions found often perform worse than those found using simpler methods. Clearly, a better cost function is needed. However, as simulated annealing using other cost functions would likely use approximately the same amount of CPU time, I have not pursued this method any further.

VII The Architecture of a Fast-1 Multiprocessor

In this chapter, 1 describe the architecture of a multiprocessor simulation machine designed using multiple FAs'r-I processors. In choosing the FAs_I as the basic building block of a multiprocessor I have taken a leap of faith that there is not some other approach that works much better. In part, my willing- ness to take this leap is based on how well a uniprocessor FAs_I works, par- Ocularly in comparison to other architectures, such as those examined in Chap- ter V. I may be mistaken in assuming that an architecture that performs well as a uniprocessor will continue to perform well as multiprocessor. It may be that an architecture that is 10 times slower as a uniprocessor would more than over- come this disadvantage when built as a multiprocessor. Nevertheless, a mul- tiprocessor FAs_'I seems like a reasonable architecture to investigate.

I have already shown that simulations have instruction-level parallelism that can be exploited. There are several other types of parallelism that we might try to utilize as well. In the next section, I explore some of the alternative ap- proaches that can be used to improve the performance of a F_sr-1. Following this, 1 discuss an actual FAst-1 multiprocessor architecture. The performance of this architecture, together with the partitioning algorithms described in Chapter VI, is presented in Chapter VIII.

VII.1. Approaches to Exploiting Parallelism One obvious approach to exploiting parallelism is to execute many instances of a problem in parallel. In simulation this may be a very feasible approach if there are many sets of independent test vectors that need to be processed. In this circumstance we can use as many FasT-1 uniprocessors as there are test vectors, with each Fast-1 processor running completely independent of the others. Of course if each test vector is quite long then it may still be necessary to make each individual FAst-1processor run faster.

In designing a processor, there are several well-known methods for exploiting parallelism, ranging from instruction pipelining to multiple function units. In instruction pipelining, the execution of an instruction is divided into n stages, or phases. As one instruction progresses from phase i to phase i+ 1, the follow- ing instruction can begin phase i. Thus, at any given time, up to n instructions

141 142 A Data-Driven Multiprocessor for Switch-Level Simulation of VLSI Circuits

can be in some phase of execution. The amount of parallelism achievable is limited only by the number of phases into which instruction execution can be split, and on how full the pipeline can be kept. Unfortunately. unless instruc- tions are fairly complex, it is usually hard to build instruction pipelines with more than a few stages. In addition, data dependencies in the instructions being executed may make it difficult to keep the pipeline full.

If the number of instructions that can be executed in parallel exceeds our ability to pipeline a single function unit, then multiple function units can be used to achieve additional parallelism. In circumstances where there are several different kinds of operations, creating a single pipelined function unit may be difficult. In these cases, multiple function units provide a reasonable alter- native. This is particularly true when the different operations have very dif- ferent execution times.

Taking advantage of instruction pipelining in the FAst-1 requires dividing in- struction execution into several phases. Figure VII-1 illustrates one such pipeline in which there are separate stages for instruction fetching, evaluation, and destination storing. Moreover, we have already seen that destination stor- ing can be split into a two stage pipeline: a stage for fetching addresses, fol- lowed by a stage for storing the result. For some instructions the evaluation unit can be pipelined by using a tree of function units. For example, consider the node instruction needed to implement Algorithm III-4. If we have, say, seven two-input least upper bound units, then we can build a three level pipeline for computing the LUB of eight signal values. To improve perfor- mance even further, we might consider replicating various pieces of the pipeline, an obvious candidate being the evaluation unit. This works only if the other stages in the pipeline can continue to supply and absorb data from the replicated stages. In particular, with the exception of the evaluation stage, all stages require access to memory.

Memory [ Memory i I

Fetch

i _ SourceOperand _ Evaluation _ Result Store 1

Figure VII-l: A pipelined implementation ofa FAs_-Iprocessor.

It is clear that in order for a F_r-1 processor to take advantage of pipelining The Architecture. of a F_ls_'.lMultiprocessor 143

and multiple function units, there must be sufficient memory bandwidth. Using a cache is one way to increase memory bandwidth, but only in those situations where there is locality of reference. Moreover, it may still be neces- sary to improve the bandwidth between the cache and the processor. Another common technique for improving memory bandwidth is interleaving, in which memory is divided into n separate sections, where the data stored in location i of section j, is referenced by the address i×n+j. When the data in memory location k is read or written the data stored in addresses k+l through k+ n-1 are read or written, as well. The consequence of interleaving is that n memory requests to sequential locations take only slightly longer than a single request. In a FAsv-1, interleaving might be used to provide rapid access to the SOURcEOvE_NDSof instructions with high fan-in.

In those instances where sequential references are not the norm, memory per- formance can be improved by multi-porting, or partitioning. In multi-ported memories, simultaneous access is provided to the same set of addresses 1, while in partitioned memories, simultaneous access is provided to different sets of addresses. The advantage of multi-porting is that multiple requests to the same set of addresses can be processed simultaneously, while in partitioned memory, these requests must be processed sequentially. Unfortunately, multi-ported memories are expensive to implement. The advantage of partitioned memories is that they are cheaper to implement and, as long as memory references are evenly distributed and access times remain about the same, they perform about as well as multi-ported memories. In addition, using partitioned memory may be more advantageous at the systems level in that it allows memory to be physi- cally distributed.

Partitioned memories can be used in the pipelined structure described above, by providing each stage with its own memory. Nevertheless, in order to store the results of an instruction, the destination storing stage may have to reference the same memory as the instruction fetch unit. Here, a multi-ported memory can be used so that a read and a write can be performed at the same time.

Another way to take advantage of partitioned memory is illustrated in Figure VII-2. In this configuration, there are multiple memory units that are con- nected to multiple function units via a processor-memory switch, such as a cross-bar. A memory that contains an instruction that needs to be evaluated simply finds an available function unit and sends the instruction to it. A func- tion unit with data to store sends the data to the proper memory unit. If there are enough datapaths from each function unit, then multiple stores to different memories can be performed in parallel.

The structure in Figure VII-2 is most effective in situations where either the evaluation units are slow relative to memory or where the evaluation unit is

1Ahigherlevelauthoritymayhaveto dealwithsimultaneouswritestothe samelocation. 144 A Data-Driven Multiprocessor for Switch-Level Simulation of VLSI Circuits

MEMORY

MEMORY Switch

MEMORY]

--I MEMORY.1 Figure Vll-2: A multiprocessor Fasr-I with independent memories and evaluation units.

complicated and replicating it, in its entirety, is costly. If replicating the evalua- tion unit is not a problem and if evaluation is a relatively small part of instruc- tion execution, then separating the instruction unit from its memory does not make much sense. In particular, note that even with multiple evaluation units, if all executable instructions are in one memory, the bandwidth to that memory is still the bottleneck, as it is in a uniprocessor.

Another alternative is to build a multiprocessor using multiple Fast-1 uniprocessors that are somehow interconnected. Such an arrangement is ex- actly the one I have decided to use. It is discussed, in detail, in the next section.

VII.2. A Multiprocessor Fast-1 Though all of the above alternatives might provide reasonable approaches to improved performance, a thorough investigation of all of them is beyond the scope of a single dissertation. In my own mind, the most straightforward and obvious approach to building higher performance Fast-1 system is to inter- connect multiple FAST-1uniprocessors to provide the necessary communica- tion. The Architecture of a kL_sr-IMuttiprocessor 145

Even without considering how processors are physically interconnected, there are a number of immediate ramifications of building a multiprocessor out of uniprocessors. First, the only instructions a processor can execute are those that reside in its own memory. Without the aid of some external agent, or substan- tial modification to the processors, the assignment of instructions to processors must be essentially static. The inability to perform fine grain dynamic load balancing may potentially limit the ultimate perfornmnce of the system, though there is no reason to assume, a priori, that a good algorithm for dynamic load balancing exists.2 On the other hand there are many advantages of static assign- ment. For example, instruction addressing is simplified since instructions al- ways reside in the same processor at the same address. Interprocessor com- munication is simplified, since only 'data' mad not 'instructions' need to be sent between processors.

VII.2.1. Processor Architecture Assuming Static Instruction Assignment Given a static assignment of instructions to processors, the logical structure of interprocessor communication follows naturally. Each processor is given a unique ID, and the D_'rINATJONaddress within an instruction is modified to be the tuple (PRocESSoRID,INSTINDEXOPINDEX,, TAGINDEX,TAGVALUE) If the PROCESSORIDin a DESTINATIONis the same as the processor on which the instruction is being executed then a local store is performed. Otherwise, a remote store must be performed in order to store the RESULt"in some other processor's memory. Note that a processor needs only to be able to store into another processor's memory. There is never any need to be able to read data from another processor. There is also no need to wait for the store to finish before continuing with instruction execution. This suggests the processor struc- ture shown in Figure VII-3.

When a processor does a remote store it puts a message containing the DESTINATIONand the associated data into its output queue. If its output queue is full, then the processor must wait for room, though mechanisms to avoiding waiting are discussed below. Similarly, each processor has an input queue into which remote store requests from other processors are placed. A store request is not removed from an output queue unless there is room for it in the input queue into which it is to be placed. These queues allow processors to run relatively independently of one another, at least as long as the queues are not full. Moreover, using an input queue may simplify implementing a processor in that remote stores into a processor's memory can be easily synchronized so as to not occur during the middle of instruction evaluation.

Algorithm VII-1 is a revised FAsr-I fetch/execute cycle that allows for remote stores. It is basically the same as for the uniprocessor, except the inner loop

2Indeed,thereisgoodreasontobelievethatanysuchalgorithmwouldbe NP-complete. 146 A Data-Driven Multiproce_sor for Switch-Level Simulation of VLS1 Circuits

InstructionMemory Address

_ata I

Processor interconnect , ,, FigureVII-3:k F_sr-!processorwithqueuesdesignedforuse ina F_r-! multiprocessor.

now includes a check to see if there are any incoming messages. Processing incoming messages first helps to keep the queues from becoming full. In order to avoid deadlock, it is critical that while a processor is waiting to put a request into its output queue, it continue to process requests that are in its input queue.

Algorithm VII-1 also includes the basic synchronization required for a mul- tiprocessor implementation of Algoridam III-4. The EVERYBODYDOYEflag is as- sumed to be set and cleared by an agent that is capable of monitoring processor and interconnect activity. When it detects that there are no more instructions to execute in any processor, and that there are no outstanding messages, it sets E'mRYBODYDO_,allowing each processor to exit the innermost loop of Algo- rithm VII-1. Depending on exactly how processors are implemented, it may also be necessary to make certain that all processors have exited the inner loop before allowing any processor to begin the next phase of the algorithm. As shown in Algorithm VII-l, this can be enforced by making a processor walt for EVERYBODYDO_ to be cleared.

In practice, a processor may spend a significant amount of time waiting for The Architecture of a FAst-1 Mulliprocessor 147

re=the number of elements in EXECUTtoNT^Gs k*-- 0 While there exists an instruction with a non-nil EXFCUTIosT^G { Foreach tag value, E, in order from E0 to En { While not EverybodyDone { if there is a message, M, in the input queue {Dequeue N and store its data into specified SOURcEOP_xAND} else if there is an instruction, I, with I.LXECUTIoNTAGSk=E { Fetch I Execute I I .Exvx:UT_oNTAok *--- nil For each RFSULT, R { If the new value of R = the previous value of R { Store the new value of R back into I For each DI:.SI'INAI'ION, D, of R { if (R.PRocl-:SSORID == ThisProcessor) {Store result locally} e 1 s e { remote store While output queue is full { While there is a message, M, in the input queue { Remove M from the queue and store the data in the SOURCEOPERAND specified } } } Put the request into the output queue } } } } } While EverybodyDone {do nothing} } k *- (k+1) mod m } AlgorithmVII-l:The fetch/executcyce leofa F_r-!processordesignedfor useinamulfiprocessor.

room initsoutputqueue.AlgorithmVII-Icanbe modifiedtoavoidthisO.ne methodistohavea special'RemoteStore'instructionthatistheonlytypeof instrdction that has remote DESTINATIONS, that is, DESTINATIONS where the PROCF_SSORIDfield differs from ID of the processor in which the DESTINATION specifier resides. A RemoteStore instruction is executed iff there is room in the output queue. In all other respects a RemoteStore is identical to a Copy in- struction. Other instructions would not write directly into the output queue. Rather, for any DESTiNATiONthat would have been remote, there will be a cor- responding RemoteStore instruction whose DESTiNATiONis the remote Dz,_rINAT_ON.Hence, at the cost of an additional instruction execution, a proces- sor will always be able to execute non-RemoteStore instructions, even if its output queue is full. Algorithm VII-2 is the multiprocessor F,,,sr-1 Fetch/Execute cycle, modified to include the RemoteStore feature. 148 A Data-Driven Multiprocessor for Switch-Level Simulation of VLSI Circuits

re=the number of elements in Ex_x:unoNTAGs k,- 0 While there exists an instruction with a non-ni/ EX[_UnONTAO { Foreach tag value, E, in order from E0 to En { While not EverybodyDone { if there is a message, M, in the input queue {Dequeue M and store its data into specified SOURCEOPERAND} else if there is an instruction I, with I.OPCODE=RemoteStore AND I.Ex_cuIIoNTAOsk=E AND the output queue is not full { Fetch I I .F--xEcu'rloNTAo k _ nil Put the request into the output queue } else if there is an instruction, I, with I.OPcooE_RemoteStore AND I. EXBCUTxoNTAGSk = E Fetch I Execute I I .ExECU_oNTAG k _- nil For each RESULT, R { If the new value of R _ the previous value of R { Store the new value of R back into I For each DESTINATION,D, of R { {Store result locally} } } } ) While EverybodyDone {do nothing} } k +- (k+l) mod m } AlgorithmVII-2:l_nefetch/executecycleof a multiprocessorFAST'/ with RemoteStoreinstructions.

Vli.2.2. Reorganizing Fan-out and Broadcasting If an instruction has high fan-out, then storing its RESULTmay be the most time-consuming part of its execution. When using a multiprocessor, the impact of high fan-out can be reduced by not writing a particular value to a given remote processor more than once. This can be accomplished using Copy in- structions. For example, Figure VII-4a shows an instruction that fans out to six SOURCgDPERANDS,tWOof which are on the same processor as the original instruc- tion, one of which is on processor 2 and three of which are on processor 3. Therefore, four remote stores are required when this instruction needs to store a RESULT.Figure VII-4b shows a potentially better way of doing this operation. Instruction 0 on processor 1 now has four DESTINATIONS.The three DEST_m_ONS that originally pointed to processor 3 have been replaced by a single DEST_NAT_OYwhose target is a new Copy instruction on processor 3. Those three DFST_NATtOarNSe now local DESTiNATiOofNSthe Copy instruction.

Besides reducing the number of remote stores required, reorganizing fan-out The Architecture of a FAsv-IMultiprocessor 149

Processor 1 0 Foo 1.1.0.t, 1.2.0.t, 2.0.O.t, 3.0.O.t, 3.1.0.t, 3.2.0.t 1 Bar 2 Bar

Processor 2 0 Bar

Processor 3 0 Bar 1 Bar 2 Bar (a)

Processo r 1 0 Foo 1.1.0.t, 1.2.0.t, 2.0.O.t, 3.3.0.* 1 Bar 2 Bar

Processor 2 0 Bar

Processor 3 0 Bar 1 Bar 2 Bar 3 Copy 3.0.O.t, 3.1.D.t, 3.2.0.t (b) Figure ¥11-4: Using a Copy instruction to reduce the number of remote stores. (a) The original program with four remote stores. (b) the modified program with only two remote stores.

has the added benefit that it also increases potential parallelism since the fanning-out of values can now proceed in parallel. However, this costs an ad- ditional instruction execution. Reorganizing fan-out is not guaranteed to im- prove performance.

Depending on the capabilities of the interconnect, fan-out performance can be further improved using broadcasting. Rather than sending a particular R.EStrLX to each remote processor individually, it is preferable to broadcast the REStrLX to all processors using a single remote store. In order to do this, the processor interconnect must be able to handle the situation where one or more of the destination processors does not have room to receive a broadcast, even though other processors may have room. Next, there is the problem of addressing. The processor that is broadcasting needs to be able to distinguish a normal DESTINATIONfrom a broadcast DESTINATION.A processor receiving a broadcast needs a mechanism to determine whether or not it is an intended recipient of the broadcast. If it is an intended recipient, the processor must determine into which SOt_cnOPERANDof which instruction the data is to be written. 150 A Data-Driven Multiprocessor for Switch-Level Simulation of VLSI Circuits

One way to deal with the whole matter of addressing is to add one bit to each Dt_.S'r_N^'HONthat says whether or not it is a broadcast, or alternatively, a special PROC_:SSORIDcan be used to indicate this fact. as on the 3MB Ethernet [Metcalfe and Boggs, 1976], for example. The remainder of a DI.:STINArlOisNunchanged. Whenever a broadcast is performed, all processors receive the message and place it into their input queue, Since all processors receive the same DESTINATIONaddress they will all write the REsuL'rinto the same SOC'Rc_OeERAND of the same instruction, in their own memory. By convention this instruction is a Copy instruction. The D_.:S'rINAT1ONSof this Copy instruction are the original DEs'rlNA'rlONS.If the Copy instruction has no DESHXXHONSand is, in effect, a NoOp, then this processor is not logically intended to be a recipient of the broadcast.

Although this scheme is simple, it does have several drawbacks. First, each processor must have as many 'broadcast receiving' Copy instructions as there are different instructions that have broadcast D_V_NA'rIONS.Second, even if there is only a single instruction that is the target of a store on a particular processor, an intermediate Copy instruction may be necessary; in some cir- cumstances it may be possible to eliminate these intermediaries. Finally, extra memory activity may occur if most of the Copy instructions have no DEST_Nm'_ONS.Although unnecessa_ _instruction evaluation can be avoided by having a special Noop instruction that is never queued tbr execution, the in- struction must still be accessed when performing the remote store. Potentially even more troublesome is that whenever a broadcast occurs all processors must have room in their input queue to receive the message, instead of just those processors that actually need to receive it.

Notice that the Copy instruction described above is simply performing address translation. This function can be easily handled using a translation map, as shown in Figure VII-5. The index into this map is a BROADCASTTAG, which can be stored in the INsrb,v_ field of a broadcast DESTINATION.The output of the map is a regular DEST_NA'nON,indicating where the RESULTshould actually be written in this processor, as well as a VAHDbit that indicates whether or not the write should actually be performed. If the VAUDbit is set, then the translated DESTINATIONand the R_uL:r are written into the input queue as before. Using a translation map solves most of the problems mentioned above. However, if most the of entries in the map are invalid, then using an associative memory to implement the map may be desirable. Either way, not only is the implemen- tation of a processor now more complicated, but the time it takes to do a remote store may now have to include the time it takes to look up an entry in the map.

VII.2.3. Interconnect Having avoided the subject so far in this chapter, it is now time to consider the issue of interconnect. In Chapter II, I discussed some of the issues related to choice of interconnect. Although one can make theoretical arguments about The Architecture of a FAsr-I Multiprocessor 151

• Fetch OpcodeEvaluationUnit _ ' _g_ Unit

IressDest.

' 7

Instruction Reg . / _/_ Destinations _ /Ik ,_

Temporary Reg. Dest Data Dest Data I

D!ta InQ OutQ

InstructionMemory Address

Dest Data MAP

Address Tag Processor interconnecZ

Figure VII-5: Using maps for translating broadcasts.

the performance of a particular interconnect, unless there is a direct mapping of a problem onto processors connected in a palticular way, for example, the FFT onto processors connected via a perfect shuffle, it is usually not clear how well a particular structure works without running some experiments. The in- terconnect structure 1 describe below fits into this latter category, as in the course of this research I was unable to find an 'obviously right' interconnect structure for use in a simulation machine. Ultimately, it would be useful to investigate the performance of several different interconnect structures for use in simulation machines. In order to get a feel for how other types of inter- connect might perform, the experimental results reported in Chapter VIlI include not only the performance of the scheme described below, but also the hypothetical situation where there is never contention for the interconnect. VII.2.3.1. Issues In the context of this research the two most crucial aspects of an interconnect structure are its bandwidth and its latency. Bandwidth is important for obvious reasons. Latency is important because processors may be mutually data- dependent. Even though a particular interconnect structure may be able to pass 152 A Data-Driven Multiprocessor for Switch-Level Simulation of VLSI Circuits

thousands of messages in the time it takes to execute a single instruction, this bandwidth may be useless if latency is more than a few times the instruction execution time. This same phenomenon is often noticed in computer networks that carry terminal traffic. The critical performance factor is not the network's bandwidth, but rather the latency in echoing a character after it is typed.

Another important issue is arbitration, because it affects bandwidth, latency, and the cost of implementation. At one end of the spectrum there is arbitration by collision detection, for example, as used in CSMA-CD networks such as the Ethernet [Metcalfe and Boggs, 1976]. If contention for the network is high then both latency and bandwidth are affected. Conversely, if there is no contention, then arbitration does not consume any bandwidth, and adds no latency to sending a message. At the other end of the spectrum is point-to-point inter- connect in which arbitrating for the interconnect is unnecessary. Assuming that there is eventually a single consumer of data within each processor, then inter- nal to a processor some form of arbitration is still necessary. However, this form of arbitration should be easier and faster than arbitrating for a common interconnect, as it is much more localized. Somewhere in the middle of the spectrum, there is interconnect in which arbitration does not require bandwidth but does add latency.

It goes without saying that the cost and practicality of implementation are im- portant. The interconnect must provide the ability for every processor to com- municate with every other processor, lfthere are more than a few bits in each message, then the difference between parallel and serial communication is im- portant. Assuming for the moment that we can implement n serial connections for the same cost as one n-bit parallel connection, then the cost of the raw bandwidth is the same. If there is no contention, the parallel implementation has lower latency. If there is significant contention, the serial implementation may have better performance, depending on how arbitration is done. VII.2.3.2. An InterconnectStructure UsingBroadcast Busses Because broadcasting seems to be an important capability, busses seem to be an obvious choice for interconnect. In the simplest implementation, each proces- sor is connected to a single bus as shown in Figure VII-6.

The bus is assumed to be wide enough to pass an entire message in a single cycle. Before a message can be sent, bus ownership must be obtained by ar- bitration. Arbitration is assumed to take one bus cycle as well, but can be over- lapped with message sending. Therefore, if a processor has n messages to send, and no other processors are contending for the bus, n+ 1 bus cycles are re- quired. Arbitration is based on a rotating priority, such that, in general, the highest priority participant during the current bus cycle becomes the lowest priority participant during the next bus cycle. If there are only a few processors on a bus, arbitration can be accomplished using some form of daisy chaining. If there are a fairly large number of participants, a faster mechanism, such as a centralized arbiter, is needed. Decentralized arbitration is also possible. If The Architecture of a FAs_'-IMultiprocessor 153

Figure VII-6: A FAs'r-Imuitiprocessor constructed using a broadcast bus.

there are 2m processors, than such scheme is to use m bidirectional, wired-or, signals for each processor, connected as shown in Figure VII-7. If a processor wants the bus it drives some subset of its priority wires low and senses the remainder as indicated by its priority number. The processor that doesn't sense any of its input wires being driven low gets the bus for the next cycle.

Control wires

I llli" 'I fillII i,II lilii i I I ] i 11

Figure VII-7: One method for connecting the arbitration signals on a Fast-1 bus.

In this scheme rotating the priority is slightly more difficult than just in- crementing a processor's current priority modulo the number of processors. Table VII-1 shows how the priorities of eight processors rotate. Each processor counts using a Gray code based on its processor number.

As discussed above, a message can be sent only if the intended recipients have room in their input queues. Implementing this acknowledge can be easily ac- complished using a wired-or acknowledge signal. A bus participant drives the acknowledge wire low if it does not have room for a message. Every message is immediately acknowledged. If it isn't sent successfully, it is resent during a future cycle. Every bus participant senses the acknowledge wire. If any recipient negatively acknowledges a message, all recipients discard the message, whether or not they had room for it. 154 A Data-Driven Multiprocessor for Switch-Level Simulation of VLSI Circuits

Table VII-l: The rotation of priorities in decentralized parallel arbitration. Processor 0 1 2 3 4 5 6 7 Cycle Priority 0 0 1 2 3 4 5 6 7 1 1 0 3 2 5 4 7 6 2 3 2 1 0 7 6 5 4 3 2 3 0 1 6 7 4 5 4 6 7 4 5 2 3 0 1 5 7 6 5 4 3 2 1 0 6 5 4 7 6 1 0 2 3 7 4 5 6 7 0 1 3 2

If the bandwidth that can be provided using a single bus is insufficient, mul- tiple busses can be used. In this situation each processor has multiple ports that connects it to some, or all, of the busses. Depending on the complexity of the implementation, deciding which port to use can be done statically or dynamically. In the experiments reported in Chapter VIII, I have assumed that this decision is made statically, by having a PORTfield in each DF,STINATION.Fur- thermore, it makes sense to have a separate output queue for each port and, if RemoteStore instructions are being used, to generalize Algorithm VII-2 to dis- tinguish among multiple output queues.

In a simple multiple-bus configuration each processor is connected to each bus. In more complex configurations, such as shown in Figure VII-8, processors are connected to only a subset of the busses. Using the same number of bus inter- faces, these configurations provide greater aggregate bandwidth at the cost of increased average minimum latency. Furthermore, some form of message for- warding is required.

Message forwarding can be accomplished in several ways. The simplest is to have a Copy or RemoteStore instruction that receives a message from one port and copies it, having translated the address, to the output queue associated with some other port. Just as a translation map is used to reduce the overhead for broadcasts, a similar map can be used for message forwarding. Because instruc- tions are statically assigned to processors, the contents of these maps can remain fixed. If the interconnect structure is constrained such that the PROCESSORIDin a destination can be used to determine the port over which a message should be forwarded, then a translation map is not needed. N-cubes and trees are two interconnect structures that exhibit this property.

If arbitrary forwarding is allowed, care must be exercised to avoid deadlock. Using Copy or RemoteStore instructions to implement forwarding prevents deadlock, since once a message is received it is always possible to remove it The Architecture of a i(,fsr-I Muttiprocessor 155

I,l I,l L,I1,ll Figure Vll-8: A more complex bus interconnect structure using two ports/processor.

from the queue. On the other hand, if messages can be received from one port and copied directly into the output queue of another port, deadlock can occur if the queues are full. Deadlock avoidance techniques developed for general store-and-forward networks are applicable [Tanenbaum, 1981; Merlin and Schweitzer, 1980]. Starvation is not a problem in simulation, as there are only a finite number of messages that can be generated during a given phase.

VII.L4. Multi-level Simulation A multiprocessor FAst-1 provides an ideal way to implement a multi-level simulator. For example, suppose we are simulating a microprocessor that needs large amounts of main memory. This memory can be implemented at the tran- sistor level, but this is inefficient. A better way to accomplish the same thing is to implement a special-purpose processor that has a large block of memory and a data-driven interface. This processor implements a special instruction called 'Memory.' The processor's architecture is identical to that of a normal FAST-1 156 A Data-Driven Mulliprocessor for Switch-Level Simulation of VLS! Circuits

processor, with a block of memory attached to the evaluation unit. Effectively, this memory is the evaluation unit. The Memory opcode has SOt_C'I'OPI':RANI)S for each bit of the address, each bit of write data, read/write, and, if needed, bits such as RAS and CAS. Data read from the memory is sent to the destina- tions specified in the particular instance of the Memory instruction. If, in the system being simulated, there are several different places where access to large mnounts of memory is required, each of these places has its own Memory in- struction to which to send its request, and from which to receive results. Ex- ecuting a Memory instruction is straightforward, as all of the information needed to perform the memory operation is contained in the instruction's SOURCEOPERANDS.

Other high-level functions can be implemented in a similar fashion. Moreover, it makes a tremendous amount of sense to interface to the bus a general- purpose processor that is executing a software implementation of the F_r-1 simulator. In this way, a user can write software to implement any OPCODE desired. In particular, there should be such an interface to the host computer. VIII Multiprocessor Experiments

Using the software implemention of the F_T-1 simulator, it is possible to evaluate the multiprocessor partitioning algorithms and the multiprocessor ar- chitectures described previously. By experimenting with several different ar- chitectural structures, we can determine how well the partitioning algorithms work, in and of themselves. However, unless we have at our disposal an op- timal partitioning algorithm, the converse is not true. That is, suppose that for all partitioning algorithms measured, architecture configuration A always has better performance than configuration B. Though this might provide evidence that configuration A is better than configuration B, this cannot be concluded for certain. For reasons explained in Chapter V1, using optimal partitioning algorithms is infeasible, and so, of necessity, the experiments discussed below answer only some of the questions we might have.

In the following sections, I describe the experiments I performed. Relative to the potential speedups discussed earlier, The results presented in this chapter are quite encouraging. For example, when simulating the chess chip using 64 processors, broadcasting, contention-free interconnect, and random partition- ing, a 28-fold speedup over a uniprocessor is obtained. When interconnect contention is modeled, a 13-fold speedup over a uniprocessor is obtained using 64 processors, 3 parallel broadcast busses, and fan-out partitioning.

VillA. An Outline of the Experiments The bulk of the experiments described in this chapter are based on two of the partitioning algorithms and two basic interconnect configurations. The two al- gorithms used are random partitioning and fan-out partitioning. Random par- titioning is used because it partitions circuits very quickly and because it provides a lower-bound against which other partitioning algorithms can be compared. The reason for using fan-out partitioning is three-fold. First, it is an algorithm designed to take advantage of broadcasting. Second, it is one of a class of algorithms that try, to some extent, to take advantage of static locality in a circuit. Finally, in preliminary measurements I conducted, fan-out par- titioning seemed to perform better than any of the other algorithms I ex- amined. This is not to say that there aren't better partitioning algorithms with the same running time, but rather that I haven't found one.

157 158 A Data-Driven Multiprocessor fi_r Switch-Level Simulation of VLSI Circuits

The difference between the two basic interconnect configurations is that in one of them the busses are ,assumed to have infinite bandwidth, while in the other, a bus is assumed to be able to send a message and arbitrate in the same amount of time a processor takes to do a RMW cycle. In other words, in the first con- figuration there is never any contention [br the interconnect, while in the second, there is contention. However, in the contention-free configuration there is still a one cycle of latency between the time when a message is sent, and when it arrived at its destination. Finally, broadcasting is used in both inter- connect configurations, except as noted below.

Both interconnect configurations are measured using 4, 16, and 64 processors. In the contention-free case the number of busses interconnecting the proces- sors is irrelevant. For the other experiments, 2 busses are used with the 4 processor configuration, while 3 busses are used with both the 16 and the 64 processor configurations.

All of circuits simulated in Chapter V are simulated using the algorithms and architectural configurations described above. In addition, in order to under- stand the impact of broadcasting, I ran a limited set of experiments using the chess chip and the SSC filter chip in which broadcasting is not used.

VIII.2. Speedup Of greatest interest, of course, is how much faster simulations run when using a multi-processor than when using a uniprocessor. While the potential speedup data gives us some idea of what to expect, it is over optimistic in that it assumes optimal partitioning, and no communication overhead. Moreover, it assumes that enough processors are available to execute all executable instructions in parallel. For example, measurements of the chess chip indicate that, ignoring initialization, at one point 3822 instructions are executable in parallel; thus 3822 processors are needed to execute these instructions in parallel. Besides being hard to simulate, building a 3822-processor FAs_-I seems somewhat un- realistic. I simulated configurations with up to 64 processors, as I feel that such configurations are feasible. The parallelism profiles also suggest that using 64 processors is reasonable, in that processor utilization will be high.

Tables VIII-]., VIII-2, VIII-3, and VIII-4 present the essential speedup results from the multiprocessor experiments. The uniprocessor measurements used for comparison are those from base configuration presented in Chapter V. Except for the bus interfaces, each processor in the multiprocessor is identical to the base-case uniprocessor. Time is measured in units of RMW cycles. Figure VIII-].presents some of this datagraphically.

Several things are immediately noticeable from this data. First, when there is no contention for the interconnect, random partitioning works quite well, in fact, even somewhat better than fan-out partitioning. Table VIII-5 highlights Multiprocessor Experiments 159

Table Viii-l" Simulation time using random partitioning and contention-free interconnect.

Circuit 4 proce_rs 16 pr_zes_ors 64 proct_so_ Uniprocessor [hli0roces,sor Uniproct_sor 4 proce_ors 16 proce_ors 64 proc_sors adder4 1842966 1356869 1211676 1.820 2.560 2.866 adder8 1909785 1211675 1048667 2.057 3.243 3.747 adderl6 1940890 1067093 855835 2.243 4.079 5,086 cadd24 1038628 482043 331%0 2.645 5.700 8.277 chess 4060912 1265825 482682 3.340 10.714 28.098 mu18×8 ]177897 505593 323921 2.771 6.456 10.076 ram4x4 1893347 1468814 1295719 1.867 2,406 2.728 ram8x8 1908806 1046430 815982 2.271 4.142 5.31] ram16x16 2037284 737985 514636 2.852 7,873 11.290 ssc 6195241 1904547 795407 3.460 11256 26.953 stk4x4 1137667 685%5 577644 2.337 8.026 14209 sl.kSx8 898761 331235 187100 3.047 8.267 14.636 st_kl6xl6 812356 242852 89711 3.446 11.528 31.207

Mean 2.627 6.635 12.177 Variance 0.336 10.643 101.842 Std. Dev. 0.580 3.262 10.092 95% Conf. 0.315 1.773 5.486 Dist. normal norm'a/ normal

Table VIII-2: Simulation time using random partitioning and interconnect with contention.

Circuit 4 processors 16 processors 64 processors Uniproce_ssor Uniprocessor Unipr0cess0r 4 processors 16 processors 64 proce_ors adder4 2019242 1899962 1728563 1.720 1.828 2.009 adder8 2029276 1699789 1557238 1.936 2.312 2.523 adderl6 2036607 1534663 1362425 2.137 2.836 3.195 cadd24 1078554 783706 654989 2.548 3.506 4.195 chess 4130621 2274253 1599063 3.283 5.963 8.481 mulSx8 1211743 845367 699730 2.694 3.861 4.665 ram4x4 2090266 1923000 1757998 1.691 1.838 2.010 ram8x8 2023271 1385671 1175984 2.142 3.128 3.685 raml6xl6 2094173 1005343 791826 2.775 5.779 7.338 ssc 6233181 3806819 2590742 3.439 5.632 8.275 stk4x4 123%20 1075180 918869 2.145 2.473 2.893 stk8x8 928641 605886 475272 2.949 4.520 5362 stkl 6x 16 837889 492301 351729 3.341 5.687 7.960

Mean 2.523 3.797 4.845 Variance 0.373 2.435 5.962 Std. Dev. 0.611 1.460 2.442 95% Conf. 0.332 0.848 1.327 Dist. normal normal normal

this observation. When 64 processors are used, random partitioning provides, on average, 18% greater speedup than fan-out partitioning. Thus, when com- munication bandwidth is unlimited evenly distributing activity is important, even though there may be higher latency than when communication is local.

On the other hand, as the data on the left hand side of Table VIII-5 shows, when communication bandwidth is limited, such that only a few messages may be sent per processor cycle, fan-out partitioning works significantly better than random partitioning. Indeed, when simulating the chess chip with 64 proces- sors, the use of fan-out partitioning results in 50% better performance than when random partitioning is used. As I mentioned in Chapter VI, a good partitioning algorithm must walk the fine line between allowing too little and too much communication. Finally, it is interesting that for raml6xl6, random 160 A Data-Driven Multiprocessor for Switch-Level Simulation of VLS I Circuits

Table VIii-3: Sinmlation time using fan-out partitioning and contention-flee interconnect.

Circuit 4 processors 16 prtx_essors 64 proc_sso_ Uniprocessor Uniproccssor Uniprocexsor 4 proc_s_rs 16 processors 64 processors adder4 1927249 1479473 1421989 1.802 2.348 2.442 adder8 1919952 1391754 1287462 2,046 2.823 3.052 adderl6 1830272 1200492 1048849 2.378 3.626 4.150 cadd24 1082025 535633 380282 2.539 5.130 7.225 chess 4090032 1400023 561363 3,316 9.687 24.160 mul8x8 1162656 518770 325829 2.807 6.292 10.017 ram4x4 2090763 1684475 1269054 1.690 2.098 2.785 ram8x8 1828304 1254317 851905 2.371 3.455 5.087 raml6xl6 2267179 1101270 793180 2.563 5.276 7.325 ssc 6306554 2151129 930089 3.399 9.966 23.050 stk4x4 1194602 799473 086407 2.225 3.325 3.873 stkSx8 955068 415300 251609 2.867 6.594 10.884 stkl6xl6 852417 273974 120121 3.284 10.218 23.306

Mean 2.676 5.970 11.078 Variance 0.274 8.332 69.866 Std. Dev. 0.524 2.886 8.359 95% Conf. 0.310 1.706 4.940 Dist. normal normal normal

Table VIII-4: Simulation time using fan-out partitioning and interconnect with contention.

Circuit 4 processors 16 processors 64 processors Uniprocessor Uniprocessor Uniprocessor 4 processors 16 processors 64 processors adder4 2033722 1635876 1579741 1.708 2.123 2.199 adder8 2003287 1541517 1449732 1.961 2.549 2.710 adder16 1891290 1334125 119%81 2.302 3.263 3.628 cadd24 1095292 596258 479651 2.509 4.608 5.728 chess 4114644 1579877 1050702 3.296 8.584 12.908 mul8x8 1190955 754903 635575 2.741 4.324 5.135 ram4x4 2201833 1828748 1441797 1.605 1.933 2.451 ramSx8 1873717 1353946 964976 2.313 3.201 4.491 raml6xl6 2284566 1168605 879582 2.543 4.972 6.606 ssc 6336798 2669469 1836770 3.383 8.031 11.672 stk4x4 1271215 910182 819559 2.091 2.921 3.244 stk8x8 %9857 452441 345343 2.824 6.053 7.930 stkl6xl6 856843 306881 219726 3.267 9.123 12.741

Mean 2.625 5.183 6.958 Variance 0.304 6.037 14.825 Std. Dev. 0.551 - 2A57 3.850 95% Conf. 0.326 1A52 2.275 Disk normal normal normal

partitioning works better than fan-out partitioning, even when interconnect contention is modeled. The structure of a ram is such that the transistors con- nected to each bit line of the ram form a relatively large, fairly shallow, tran- sistor group. When fan-out partitioning is used. each of these groups is al- located to a single processor, thereby limiting the parallelism. It seems that some form of min-cut algorithm that splits the group in half by cutting the bit-line, would improve the situation. Nevertheless, random partitioning does fairly well without any help at all. Multiprocessor Experiments 161

o. 70. "0

_60. _ linear _ Chess - Contention-Free []..... -_ Chess - Random Partitioning 50. []- _ Chess-Fan-outPartitioning SSC- Contention-Free O-.... _ SSC- Random Partitioning 40. O - --O SSC - Fan-out Partitioning

20.

10. ======- -

I I I I I I I 0 10 20 30 4 0 50 60 70 Processors Speedup

Figure VIII-l: Speedup using a multiprocessor. Random partitioning is used for the 'contention-free' plots.

Table VIII-5: A comparison between random partitioning and fan-out par- titioning. Each ratio is the simulation time when random par- titioning is used divided by the simulation time when fan-out partitioning is used. WithoutContention WithContention Circuit 4 processors 16processors 64processors 4 processors 16processors 64proce_o_ adder4 0.956 0.917 0.852 0.993 1.161 1.094 adder8 1.060 0.889 0.816 1.013 1.103 1.074 adderl6 1.060 0.889 0.816 1.077 1.150 1.136 cadd24 0.960 0.900 0.873 0.985 1.314 1.366 chess 0.993 0.904 0.860 1.004 1.440 1.522 mulSx8 1.013 0.975 0.994 1.017 1.120 1.101 ram4x4 0.906 0.872 1.021 0.949 1.052 1.219 ramSx8 1.044 0.834 0.958 1.080 1.023 1.219 raml6xl6 0.899 0.670 0.649 0.917 0.860 0.900 ssc 0.982 0.885 0.855 0.984 1.426 1.410 st.k4x4 0.952 0.723 0.841 0.975 1.181 1.121 sN8x8 0.941 0.798 0.744 0.958 1.339 1.376 stk16x16 0.953 0.886 0.747 0.978 1.604 1.601 Mean 0.973 0.849 0.851 0.993 1.128 1.127 Variance 0.003 0.008 0.013 0.002 0.047 0.042 Std. Dev. 0.052 0.088 0.112 0.050 0.218 0.206 95%Conf. 0.031 0.052 0.066 0.030 0.129 0.122 Dist. normal normal normal normal normal normal 162 A Data-Driven Multiprocessor for Switch-Level Simulation of VI,SI Circuits

VIII.3. Message Traffic The data from the previous section makes it clear that communication bandwidth is critically important to performance. The data in this section ex- plores this issue from the standpoint of message traffic.

Tables Vlll-6 and VIII-7 show the number of messages sent while simulating circuits partitioned by the two different algorithms. It is clear that using fan- out partitioning significantly reduces message traffic. On average, 50% fewer messages are sent when fan-out partitioning is used than when random par- titioning is used.

Table VIII-6: Message traffic using random partitioning. Interconnect with contention is used. 'Messages per Destination Store' is the frac- tion of destination stores that generate messages.

Messages Messages per Destination Store Circuit 4 processors 16 processors 64 processors 4 processors 16 processors 64 processors adder4 707753 1036506 1079888 0A30 0.837 0.950 adder8 914812 1175068 1254421 0.482 0.797 0.952 adderl6 1093598 1327055 1393296 0.505 0.805 0.912 cadd24 780224 950170 983759 0.555 0.807 0.932 chess 3123027 3717866 3855561 0.463 0.649 0.839 mul8x8 889016 1052089 1097432 0.578 0.834 0.952 ram4x4 699017 826292 854601 0.460 0.735 0.887 ramSx8 728460 832871 864730 0.305 0.536 0.825 ram16x16 712036 850467 883672 0.194 0.673 0.902 ssc 4913488 5966166 6181239 0.431 0.527 0.549 stk4x4 675880 804853 832346 0.555 0.816 0.949 stkSx8 705850 823583 851256 0.553 0.736 0.873 stk16x16 681226 831340 861165 0.546 0.753 0.800

Mean 0.466 0.731 0.871 Variance 0.012 0.011 0.012 Std. Dev. 0.111 0.106 0.110 95% Conf. 0.060 0.058 0.060 Dist. normal normal normal

Table ¥III-7: Message traffic using fan-out partitioning. Interconnect with contention is used. 'Messages per Destination Store' is the frac- tion of destination stores that generate messages.

Messages M_ges per D_tination Store Circuit 4 processors 16 processors 64 processors 4 processors 16 processors 64 processors adder4 560998 585386 591915 0.341 0.380 0.388 adder8 659794 682817 686659 0.345 0.379 0.387 adder16 748779 771560 776444 0.346 0.379 0.386 cadd24 413980 472753 487074 0.283 0.362 0.397 chess 1953466 2215333 2284684 0.282 0.336 0.382 mul8x8 745176 895736 972797 0.489 0.707 0.846 ram4x4 235198 267217 281091 0.143 0.174 0.210 ramSx8 214701 249526 259892 0.097 0.120 0.825 raml6xl6 135722 214690 247724 0.039 0.064 0.076 ssc 2876328 3250736 3558545 0.252 0.286 0.314 stk4x4 410163 487214 511731 0.325 0.399 0.429 st.kSx8 416816 476369 479362 0.300 0.349 0.364 stk16x16 402496 448658 461829 0.281 0.321 0.332

Mean 0.258 0.318 0.415 Variance 0.016 0.029 0.053 Std. Dev. 0.125 0.170 0.231 95%Conf. 0.074 0.101 0.136 Dist. normal normal normal Multiprocessor Experiments 163

However, in and of itself, message traffic is not bad. Rather, what is really of interest is the latency incurred in sending a message. As the busses get busy, the number of times a processor must arbitrate belbre getting the bus increases, thereby increasing the latency between the request to send a message and its actually being sent. Tables Vlll-8 and Vlll-9 show the total number of arbitra- tion cycles and the average number of arbitration cycles per message per- formed in simulating the circuits. Here we see that the average latency of send- ing a message is indeed less when using fan-out partitioning than when using random partitioning. Note that it is not a factor of two less, as is the total number of messages sent. It appears, therefore, that small changes in latency can significantly affect performance. Overall, however, it seems that high average latency is tolerable when there is significant parallelism, for reasons analogous to those found in pipelined systems. That is, as long as data depen- dencies do not cause most processors to become idle, then having long queues of messages to send does not affect performance except at the beginning and end of a phase of the simulation algorithm. So it is reasonable to expect simula- tions that have phases with long execution times still perform well even though there is high bus latency.

1 conclude from this data that a bus structure such as the one shown in Figure Vll-8 in Chapter VII, in which the interconnect forms a two-dimensional grid and processors connect to two busses may be a good one. Except for having to implement message forwarding, the cost of a processor remains about the same or may even be less, as only two ports are needed, rather than the three ports used here. Hence, for approximately the same cost, bus bandwidth is vastly increased, resulting in decreased average latency. Partitioning algorithms might be made to take advantage of the hierarchy imposed by the interconnect. For example, using run-time statistics, a partitioning algorithm would place instructions that communicate frequently onto processors that are directly con- nected.

VillA. The Impact of Broadcasting In earlier chapters, I conjectured that the ability to broadcast a message is par- ticularly useful in exploiting parallelism. In this section, I explore whether or not this conjecture is true. Because broadcasting is most important when simulating circuits with high fan-out nodes, I did not run these experiments on all of the circuits. Rather, 1 used only the chess chip and the SSC filter chip in making these measurements. 1 As discussed in Chapter VII, there are two alter- natives when broadcasting is not used. The first is to have each instruction have its normal list of destinations. The second is for each instruction to have one DESTiNATIOnfor each processor to which it must send a message. Within each

1The truth of the matter is that running the experiments on the complete set of circuits would have used several weeks of CPU time, and I thought I'd spare my friends the grief. 164 A Data-Driven Multiprocessor for Switch-Level Simulation of VI.SI Circuits

Table VIIi-8: Number of arbitration cycles performed when using random partitioning. 'Arbitration Cycles per Message' is the average number of arbitration cycles required to send a message. Recall that arbitration is overlapped with message sending, so the average latency equals the average number of arbitration cycles plus one.

Arbitration Cycles Arbitration Cycles per Message Circuit 4 processors 16 processors 64 processors 4 processors 16 pr(×'essors 64 pr(_:essors adder4 852664 2666924 2735180 1.177 2.573 2.533 adder8 1096845 3750614 4278870 1.199 3.192 3.411 adderl6 1396692 5334571 6755250 1.277 4.020 4.848 cadd24 1048427 5541771 10628287 1.344 5.832 10.804 chess 3886072 26526330 102160374 1.244 7.135 26.497 mul8x8 1203560 5848214 8461821 1.354 5.559 7.711 ram4x4 985974 1642768 1746520 1.411 1.988 2.044 ramSx8 1130419 2566805 2924301 1.552 3.082 3.382 raml6xl6 1086350 4246179 5213031 1.526 4.993 5.899 ssc 6857836 45294555 183231347 1.396 7.592 29.643 stk4x4 924680 3223212 3016053 1.368 4.005 3.624 stkSx8 916923 5434270 10561157 1.299 6.598 12.407 stk16x16 873956 6083559 21935442 1.283 7.318 25.472

Mean 1.341 4.914 10.637 Variance 0.013 3.664 99.506 Std. Dev. 0.113 1.914 9.975 95% Conf. 0.061 1.041 5.423 Dist normal normal normal

Table VIII-9: Number of arbitration cycles performed when using random partitioning. 'Arbitration Cycles per Message' is the average number of arbitration cycles required to send a message. Recall that arbitration is overlapped with message sending, so the average latency equals the average number of arbitration cycles plus one.

Arbitration Cycles Arbitration Cycles per Me_age Circuit 4 processors 16 processors 64 processors 4 processors 16 processors 64 processors adder4 649728 1087872 1078429 1.158 1.858 1.822 adder8 811047 1552454 1618257 1.229 2.274 2.357 adderl6 924708 2203331 2429956 1.235 2.856 3.130 cadd24 453488 2084127 3800095 1.095 4A08 7.802 chess 2344688 14004314 55941869 1.200 6.322 24.486 mulSx8 945663 4603036 6794742 1.269 5.139 6.985 ram4x4 304297 442230 440995 1.294 1.655 1.569 ram8x8 311824 597493 624811 1A52 2.395 2.404 ram16x16 155018 748651 1074667 1.142 3.487 4.338 ssc 3094939 23369160 97217514 1.076 7.189 27.319 stk4x4 537899 1156297 1371179 1.311 2.373 2.679 stkSx8 489512 2342332 3784305 1.174 4.917 7.894 stkl6xl6 473259 2953514 9151962 1.176 6.583 19.817

Mean 1.220 4.302 9.857 Variance 0.012 3.583 88.593 Std. Dev. 0.108 1.893 9.412 95% Conf. 0.064 1.119 5.562 Dist. normal normal normal

processor that receives one of these messages, there is a Copy instruction that fans out the RESULTto proper instructions. The intended virtue of the second alternative is to reduce message traffic on the bus. Table VlIl-10 and Table VIII-ll present the results. Clearly, using copy instructions is a good idea. Nevertheless, broadcasting still provides a substantial benefit. Multiprocessor Experiments I65

Table VIII-10: Simulation time without broadcasting. All measurements were done using 64 processors and intcrconnect with contention. The ratios are of the time used to perform the simulation.

Without Copy Instructions R_dom Partitioning Fan-out Partitioning Circuit Time With Broadcasting Time Wilh Broadcasting W/O Broadcasting W/O F_roadcasting chess 2563139 0.624 1617159 0.650 ssc 7004382 0.370 6409338 0.287

With Copy Instructions

Random Partitioning Fan-out Poalitioning Circuit Time With Broadcasting Time With Broadcssfin 8 W/O Broadcasting W/O l_,roadcasfing chess 2383078 0.671 1435650 0.731 ssc 2676233 0.968 2031774 0.904

Table VIII-II" Message traffic without broadcasting. All measurements were done using 64 processors and interconnect without contention. The ratios are of the messages used in performing the simula- tion.

Without Copy Instructions Random Partitioning Fan-out Partitioning Circuit Messages With Broadcasting Messages With Broadcasting W/O Broadcasting W/O Broadcasting chess 5519121 0.699 3762893 0.607 ssc 10612184 0.582 7898948 0.450

With Copy Instructions Random Partitioning Fan-out Partitioning Circuit Messages With Broadcasting Messages With Broadcasting W/O Broadcasting W/O Broadcasting chess 6203796 0.621 3369219 0.678 ssc 6168801 1.002 3472214 1.025

IX Conclusions

The primary goal of this research was to use a very simple data-driven architec- ture as the basis for a switch-level simulation machine that is orders of mag- nitude faster than similar simulators implemented on general-purpose com- puter architectures. In this dissertation I described a switch-level simulation algorithm and an architecture for implementing the algorithm that met these goals. As with other algorithms and architectures, many optimizations are pos- sible, and a variety of such optimizations were discussed. Finally, I described the results of an extensive set of experiments that were used to ascertain that the desired performance had been achieved.

The algorithm and architecture were devised jointly; they are two perspectives on a single design. Thus one way to think of the FAsv-I is as a direct execution machine for programs that happen to be circuits. Nevertheless it is important to understand that both the FAsr-I simulation algorithm and the FAST-1ar- chitecture are useful in and of themselves. As I have shown, it is reasonable to run the simulation algorithm on a general-purpose computer. Similarly, it is reasonable to use the general FAs'r-1 architecture in implementing a special- purpose machine for executing other switch-level simulations algorithms, for example, MossimlI. The FAs_I architecture is useful other tasks as well, some of which are described later in this chapter.

IX.1. Contributions While we may not yet understand enough about data-driven computation to successfully build general-purpose data-driven computers, this dissertation has demonstrated that data-driven computers can be very effective for some ap- plications. Moreover, the claim that data-driven machines allow parallelism to be easily exploited is indeed true--at least for switch-level simulation.

This research has presented a number of interesting results on exploiting paral- lelism in switch-level simulation. For the circuits measured, I found that only a few percent of the transistors or nodes in a circuit can be usefully evaluated in parallel. Nevertheless, I demonstrated that it is possible to take advantage of much of this parallelism using a simple multiprocessor architecture, together with simple and fast partitioning algorithms. That is to say, what parallelism

167 168 A Data-Driven Multiprocessor for Switch-Level Simulation of VLSI Circuits

exists can be reasonably exploited, but unfortunately there is not much paral- lelism to exploit. Of course since these results are experimental in nature, we must be careful not to extrapolate too far.

In the context of switch-level simulation algorithms, 1 have demonstrated that there exists a good switch-level simulation algorithm whose performance when implemented in software is competitive with that of other switch-level simula- tion algorithms. Of greater importance, is that the algorithm has an efficient hardware implementation that can be used to achieve up to tbur or five orders of magnitude better performance than a software implementation of a similar simulator.

Finally, I think one of the most useful contributions of this dissertation is to provide other researchers with thorough experimental data that can be used in the evaluation of other simulation systems.

IX.2. Other Applications There are many other problems that can be solved using event-driven relaxa- tion. One such problem is maze routing. In the typical maze routing problem pairs of points in a two-dimensional grid need to be connected with the con- straint that connections do not cross. There may also be obstructions that must be avoided. A F_sv-1 processor for solving this problem would have an in- struction for each point in the grid. Each instruction would have four SOURCEOPERANDS and a single R_uLr that has four DESTINATIONScorresponding to the point's neighbors in the grid. Two non-nil EXECbq'IONTAGvalues are needed, such that when a REsucr is stored in an instruction, the instruction's EXECt_q'IoNTAGis set to the value that is not equal to the current tag value. As the grid is loaded into memory, all SOURcEO,ERANDSand RZSULTSare set to 0. When an instruction executes it sets its RzsuL'r tO the index of any one of its SOORcEOPERANDSthat is non-zero. To route two points, we simply mark the origin point as executable, and make the destination point a special instruction that causes the processor to halt when it is executed. Once the processor has halted, we trace back from the destination point using the REsut:r values. A multi-processor implementation of the algorithm follows directly, though, as with simulation, partitioning is important to performance.

Another useful and important application of the FAsv-1 is for implementing Conway's game of Life that was described many years ago in Sdentific American [Gardner, 1970]. For this "application," each instruction has eight one-bit SOURCEOPERANDS corresponding to its eight neighbors, and a single RESULTwith eight DESTINATIONS.A two phase algorithm is used. During the first phase, any instruction that is marked for evaluation determines its new state using its SOURCEOPERANDS.If this new value is different from its previous value, it saves the new value in its R_suLvfield and marks itself for evaluation during the second phase. During the second phase, any instruction that is Conclusions 169

evaluated ignores its SOURCFOPI';RANDS,, and simply propagates its Rvsut:r to its destinations. By using two phases, the SOURct_OPrkANt_Sare not allowed to change during phase 1, as required by the rules of Life. This is analogous to implementing unit-delay in Algorithm 11I-4.

While in both applications the F,_sl_l may be effective from a computational point of view, there is the drawback that it may not be very space efficient. This is because DVSrr_NA'nONSallow arbitrary addressing. Since addressing in these instances is very regular, we might consider removing the addresses from the instructions, and, instead, calculate them on-the-fly, based on the current instruction's address.

IX.3. Future Work Besides the construction _of a hardware implementation of the machine, an ob- vious candidate for further work is partitioning. One of the limiting factors in my own investigation was the vast amount of CPU time required to simulate a multiprocessor F._sr-1. It is my feeling that further work in this area requires, at minimum, a general-purpose multiprocessor, with each processor simulating a FASr-1 processor, or preferably, an actual Fnsr-1 multiprocessor. Among other things, an actual FAST-Imuitiprocessor would provide a very good objec- tive function for partitioning using simulated annealing. One can imagine par- titioning a circuit onto the real machine, simulating it for a while using a small set of test vectors, moving some subset of the instructions, and then running the simulation again in order to determine if things had improved. The efficacy of this method depends on being able to find a reasonably small set of test vectors that provides meaningful enough information for optimizing the par- titioning.

In terms of implementation, it seems that a FAST_Imultiprocessor is an inter- esting candidate for wafer-scale integration for a number of reasons. First, because so much of a FAst-1 processor is memory, and as redundancy tech- niques for memory are well understood, it should be possible to get better yields than with a processor that contains a high proportion of non-memory circuitry. At the multiprocessor level, we can imagine connecting processors using a two-dimensional grid of busses. The time to drive these busses is not unreasonable, as it is on the same order of magnitude as driving traces on a PC board. The individual bus wires can be spaced reasonably far apart to eliminate the likelihood of shorts. If one bus wire is broken, the rest can be severed using a laser, thus creating two separate busses. The two-dimensional grid helps in- sure that there is still some path between any two processors. If a processor is non-functional in a non-interfering way we can simply not use it. Otherwise, we can sever it from its busses using a laser. Unlike many wafer-scale applica- tions where the elimination of a processor causes great difficulties, the architec- ture of the FAsr-I allows us to use as many or as few processors as we have available. Of course, if too many processors or busses are non-functional we 170 A Data-Driven Multipmcessor for Switch-Level Simulation of VLS I Circuits

may end up with very slow interprocessor communication, which may result in a relatively worthless multiprocessor.

In the context of simulation, it is worthwhile to consider how Algorithm II1-4 might be modified to better model delay, both within and between tran- sistor groups, though the latter should be much easier to do than the former. Although I have already described, in general terms, the design of a multi-level FasT-I simulator, the actual details need to be investigated. Moreover, the same kinds of experiments that 1 have described in this dissertation need to be con- ducted using the multi-level simulator in order to understand the performance tradeoffs involved.

IX.4. And Now a Word to Our Sponsor As we all know when problems are difficult, there is nothing like some real data to give one a better idea of what is going on. As I mentioned above, one of the things limiting my research was access to fast computing cycles. CMU has great resources, yet I can think of several dissertations in our department that would have been easier to do, taken less time. and produced 'better' results if more computing power had been available. In my opinion, if DARPA and the US Government really want to advance the state of the art of computing, which, I suppose, the Strategic Computing Initiative indicates that they do, they would take the money required to build one B-1 bomber and use it to buy a Lisp Machine, Sun Workstation or a Micro-Vax II for each and every re- searcher in the major CS departments in the country. Moreover, any depart- ment that could use a machine like a Cray-1 should be given one of those as well. Creating knowledge is productive, creating weaponry is not. References

Abr;unovici, M., Levendel, Y. H., and Memon, P. R., "A Logic Simulation Machine,"lEEE Trans. on Computer Aided Design, CAD-2, 2 (April 1983), 82-94.

Agrawal, V. D., Bose, A. K, Kozak, P., Nham, H. N., and Pacas-Skewes, E., "A Mixed Mode Simulator,"dacl7, June 1980.

Aho, A. V., Hopcroft, J. E., and Ullman, J. D., The Design and Analysis of Computer Algorithms, Addison Wesley, Reading, MA, 1974.

Allen, A. O., Probability, Statistics and Queuing Theory with Computer Science Applications., Academic Press, New York, 1978.

Anderson, G. A., and Jensen, E. D., "Computer Interconnection Structures: Taxonomy, Characteristics, and Examples,"Computing Surveys, 7, 4 (December 1975), 197-213.

Arnold, J., "Parallel Simulation Of Digital LSI Circuits,"Master Degree Thesis Proposal, MIT, March 1984.

Arvind, Iannucci, R. A., "A Critique of Multiprocessing von Neumann Style,"Proc. lOth Annual Intl. Conf. on Computer Architecture, ACM, June 1983, pp. 426-436, reprinted in : Design and Applications, K. Hwang, Editor. 1EEE Computer Society Press, 1984.

Arvind, and Kathail, V., "A Multiple Processor Data-flow Machine that Sup- ports Generalized Procedures,"Proc. of the 8th Annual Symposium on Com- puter Architecture, ACM, New York, May 1981.

Backus. J., "Can Programming be Liberated from the yon Neumann Style? A Functional Style and Its Algebra of Programs," Comm. ACM, 21, 9 (September 1978), 613-641.

Baker, C. M., and Terman, C., "Tools for Verifying Integrated Circuit Designs," Lambda (now VLSI Design), 1, 3 (Fourth Quarter 1980), 22-31.

Barto, R. L., "Architecture for a Hardware Simulator,"Proc Conf. on Circuits and Computers, IEEE, 1980, pp. 891-893.

171 !72 A Data-Driven Multiprocessorfor Switch-Level Simulation of VLSI Circuits

Barzilai, Z., Huisman, L., Silberman, G., Tang, D., and Woo, L., "Simulating Pass Transistor Circuits Using Logic Simulation Machines," Proc. 20th Design Automation Conf, IEEE, June 1983, pp. 157-163.

Batcher, K. E., "Design of a Massively Parallel Processor,"lEEE Trans. on Computers (September 1980), 836-840, Reprinted in SuperComputers." Design and Applications K. Hwang, editor. IEEE Computer Society Press, 1984.

Bouknight, W. J., Denenberg, S. A., Mclntyre, D. E., Randall, J. M., Sameh, A. H., and Slotnick, D. L., "The llliac 1V System."Proc. IEEE (April 1972), 369-388, reprinted in Computer Structures." Principles and Examples, Siewiorek, D. P., Bell. C. G., and Newell, A., McGraw-Hill, NY, ch. 20, pp. 306-316, 1982.

Bryant, R., "An Algorithm for MOS Simulation,"Lambda (now VLSI Design), 1, 3 (Fourth Quarter 1980), 46-54.

Bryant, R., "A Switch-Level Model and Simulator for MOS Digital Systems," Trans. on Computers, C-33, 2 (February 1984), 160-177.

Bryant, R., Personal Communication, February, 1985.

Bryant, R., and Shuster, M., "Fault Simulation," VLSI Design (October 1983), 24-30.

Clark, W. A., "From Electron Mobility to Logical Structure: A View of In- tegrated Circuits," Computing Surveys, 12, 3 (September 1980), 325-356.

Colwell, R., Hitchcock, C., and Jensen, E. D., "A Perspective on the Processor Complexity Controversy,"Proc. IEEE Intl. Conf. on Computer Design, IEEE, 1983.

Dally, W. J., "A MOSSIM Simulation Machine: Architecture and Design,"Tech. Rep.5123:TR:84, Caltech, April 1984.

Dally, W. J., "A Special-Purpose Machine for Switch-level Simulation", Talk at DARPA VLS1 Contractors Meeting, Salt Lake City, Utah., March, 1985.

Davis, A. L., and Drongowski, P. J., "Dataflow Computers: A Tutorial and Survey,"UUCS80-109, Univ. of Utah, Dept. of Computer Science, July 1980.

Davis, A. L., Denny, W. M., and Sutherland, I. E., '% Characterization of Parallel Systems,"Tech. Rep.UUCS-80-108, University of Utah, August 1980.

DEC, Vax 11/780 Architecture Manual, Digital Equipment Corp., Manard, MA, 1979.

Denneau, M., "The Yorktown Simulation Engine,"Proc. of the 19th Design Automation Conf., IEEE, June 1982.

Denneau, M., Lecture at CMU, Dec., 1982. References I73

Dennis, J. B., "Data Flow Supercomputers,"Computer(November 1980), 48-56, reprinted in Supercomputers: Design and Applications, K. Hwang, Editor. IEEE Computer Society Press. 1984.

Dennis, J. B., Leung, C. K. C., and Misunas, D. P.. "A Highly Parallel Proces- sor Using a Data-flow Machine Language,"Tech. Rep.CSG 134-1, MIT, Lab. for Computer Science, June 1979.

Deutsch, J. T., and Newton, A. R., "MSPLICE: A Multiprocessor-Based Cir- cuit Simulator,"Proc. 1984 Intl. Conf on Parallel Processing, IEEE, August 1984, pp. 207-214.

Dumlugoi, D., DeMan, H. J., Stevens, P., and Schrooten, G. G., "Local Relaxation Algorithms for Event Driven Simulation including Assignable Delay Modeling,"lEEE Trans on CAD, CAD-2, 3 (July 1983), 193-201.

Ebeling, C. and Palay, A., "The Design and Implementation of a VLSI Chess Move Generator,'Proc. l lth Annual Intl. Symp. on Computer Architecture, ACM, June 1984, pp. 74-81.

Feng, T., "A Survey of Interconnection Networks," Computer, 14, 12 (December 1981), 12-27, reprinted in Interconneetion Networks for parallel and distributed processing, C. Wu and T. Feng, Editors. IEEE Computer Society Press, 1984.

Finnegan, J., "The VLSI Approach to Computational Complexity,"CMU Conf. on VLS1 Systems and Computations, Oct 1981, pp. 124-125, Computer Science Press, Rockville, MD.

Fisher, A. L., and Kung, H. T., "The Architecture of the PSC: A Programm- able Systolic Chip,"Proe. lOth Annual Int. Symp. Computer Architecture, ACM, June 1983, pp. 48-53.

Flake, P., Moorby, P., and Musgrave, G., "An Algebra for Logic Strength Manipulation,"Proc. 20th Design Automation Conf., IEEE, June 1983, pp. 615-618.

Flynn, M. J., "Some Computer Organizations and Their Effectiveness,"IEEE Trans. on Computers, C-21, 9 (September 1972), 948-960.

Frank, E. H., "The Fast-l: A Data-driven Multiprocessor for Logic Simulation,'Thesis Proposal, Carnegie-Mellon University, Computer Science Department, October 1982.

Frank, E. H., and Sproull, R. F., "Testing and Debugging Custom Integrated Circuits," Computing Surveys, 13, 4 (December 1981).

Frank, E. H., and Sproull, R. F., "A Self-timed Static Ram,"Proc. 3rd Caltech Conf. on VLSL Caltech, Pasadena, CA., March 1983, pp. 275-286. 174 A Data-Driven Multiprocessor for Switch-Level Simulation of VLS I Circuits

Gajski, D. D., Padua, D. A., Kuck, D. J., and Kunh, R. H., "A Second Opinion on Data Flow Machines and Languages,"Computer, 15, 2 (February 1982), 58-70.

Gardner, M., "Mathematical Games," Scientific American, 223, 4 (Oct 1970), 120-123.

Garey, M. R., and Johnson, D. S., Computers and Intractability: A Guide to the Theory of NP-Completeness, Freeman, New York, 1979.

Goldberg, A., and Robson, D., Smalltalk-80: The Language and its Implementation, Addison-Wesley, Reading, Mass, 1983.

Hayes, J. P., "A Unified Switching Theory with Applications to VLSI Design,'Proc. IEEE, 70, 10 (October 1982), 1140-1151.

Hennessy, J., "VLSI Processor Architecture,"lEEE Trans. on Computers, C-34, 1 (January 1985).

Hill, D., and vanCleemput, W. M., "SABLE: A Tool for Generating Structural Multi-Level Simulafions.,"Proc. 16th Design Automation Conf., IEEE, New York, June 1979, pp. 272-279.

Hillis, W. D., "The Connection Machine: Computer Architecure for the New Wave,"AI Lab646, MIT, September 1981.

Jouppi, N., "TV: an nMOS Timing Analyzer," Third Caltech Conf. on VLSL Computer Science Press, 1EEE, Rockville, MD, March 1983, pp. 71-85. Kazar, M. L., The Automatic Distribution of Programs and Data in a Network Environment, PhD dissertation, Carnegie-Mellon University, Department of Computer Science, September 1984.

Kirkpatrick, S., Gelatt, Jr., C. D., and Vecchi, M. P., "Optimization by Simu- lated Annealing,'Science (May 1983), 671-680.

Koike, N., Ohmori, K., Kondo, H., and Sasaki, T., "A High Speed Logic Simulation Machine,"Proc. 20th Design Automation Conference, IEEE, June 1983, pp. 446-451.

Kung, H. T., "Why Systolic Architectures?," Computer, 15, 1 (January 1982), 37-46.

Kung, H. T. and Leiserson, C. E., "Systolic Arrays (for VLSI",)Sparse Matrix Proceedings 1978, Society for Industrial and Applied Mathematics, 1979, pp. 256-282, A slightly different version appears in [Mead and Conway, 1980, Sec- tion 8.3].

Lampson, B., "Hints for Computer System Design,"IEEE Software, 1, 1 (January 1984), 11-28. References 175

Lattin, W., and Ratner, l., "Ada Determines Architecture of 32-bit Miroprocessor,"Electronics (Feb. 24 1981).

Mead, C.. and Conway. L., Introduction to VLSI Systems, Addison-Wesley, Reading. MA, 1980.

Merlin, P. M. and Schweitzer, P. J., "Deadlock Avoidence,"lEEE Trans. on Communication. COM-28 (March 1980), 345-354.

Metcalfe, R. M., and Boggs, D. R., "Ethemet: Distributed Packet Switching for Local Computer Networks," Comm. of the ACM, 19, 7 (July 1976).

Mishra, B., "An Efficient Algorithm to Find All 'Bidirectional' Edges in an Undirected Graph," Proc. Conf. on the Foundations of Computers Science, Oct. 1984.

Newton, A. R., "Techniques for the Simulation of Large Scale Integrated Circuits,"IEEE Transactions on Circuits and Systems, CAS-26, 9 (September 1979), 741-749.

Oflazer, K., "Partitioning In Parallel Processing of Production Systems,"Proc. 1984 Intl. Conf. on Parallel Processing, IEEE, August 1984, pp. 92-100.

Patterson, D., and Sequin, C., "A VLSI RISC,"Computer, 15, 9 (Sept. 1982), 8-22.

Shiloach, "A Polynomial-Time Solution to the Undirected Two Paths Problem,'J. ACM, 27, 3 (July 1980).

Spector, A. Z., Multiprocessing Architectures for Local Computer Networks, PhD dissertation, Stanford University, 1981, Available as Stanford Report STAN-CS-81-874.

Steele, G. L. Jr., and Sussman, G. J., "Design of a LISP-based microprocessor," Comm. ACM, 23, 11 (November 1980), 628-644.

Stone, H. S., "Multiprocessor Scheduling with the Aid of Network Flow Algorthms,"IEEE Trans. on Software Engineering (January 1977), 85-94, Reprinted in Supercomputers." Design and Applications, K. Hwang, editor. IEEE Computer Society Press, 1984.

Stroustrup, B., "Classes: An Abstract Data Type Facility for the C Language,"Tech. Rep.CS-84, Bell Laboratories, Murray Hill, NJ, April 1980.

Stroustrup, B., "A Set of C Classes for Co-routine Style Programming,"Tech. Rep.CS-90, Bell Laboratories, Murray Hill, NJ, July 1982.

Stroustrup, B., "The C + + Programming Language Reference Manual,"Tech. Rep.CSTR-108, AT&T Bell Laboratories, Murray Hill, NJ, November 1984. 176 A Data-Driven Multiprocessor for Switch-Level Simulation of VLSI Circuits

Swan, R. J., The Switching Structure and Addressing Architecture of an Exten- sible Multiprocessor: Cm*, PhD dissertation, Carnegie-Mellon University, Computer Science Department, August 1978.

Tanenbaum, A. S., Computer Networks, Prentice-Hall, Englewood Cliffs, NJ, 1981.

Terman, C., Simulation Tools for Digital LSI Design, PhD dissertation, MIT, September 1983.

Texas Instruments, "TI 4161 Dynamic RAM", Data Sheet, 1983.

Texas Instruments, PAL Data Book, TI, 1984.

Thacker, C., Personal Communication, August, 1984.

Treleaven, P., Brownbfidge, D. R., and Hopkins, R. P., "Data-Driven and Demand-Driven Computer Architecture,"Computer Surveys, 14, 1 (March 1982), 93-143.

Ullman, J. D., Computational Aspects of VLSI, Addison Wessley, San Fran- cisco, 1983.

Vegdahl, S. R., "A Survey of Proposed Architectures for the Execution of Functional Languages,"IEEE Trans. on Computers, C-34, 1 (January 1985). Wulf, W. A., Levin, R.. and Harbison, S. P., Hydra/C.MMP: An Experimental Computer System, McGraw Hill, New York, Advanced Computer Science Series, 1981.

Zycad, Inc., Zycad LE-IO01 and LE-IO02 Logic Evaluator product description, Preliminary ed., Minneapolis, Minn., 1982. A Circuit Descriptions

The following are the adder, ram, and stack circuits, described using Terrnan's N ET language.

A.1. Adder

********************************************************************* The sum bit of a Full Adder

(macro SumBit (a b c sum) ;a b c and sum are parameters (local presum) :a convenient focalname (pullup presum) ;pullup is a depletion mode load (etrans a presum b) ;order of parameters is gate, source, drain (e_rans b presum a) ; etrans is enhancement mode (pullup sum) (etrans presum sum c) (etrans c sum presum) )

(macro CarryBit (a b cin cout) (local precarry) (and-or-invert precarry ;precarry = NOT ((a AND b) OR (a AND cin) ; OR (b AND cin) (a b ) (a cin) (b cin)) (invert cout precarry)) ;invert precarry to make cout

(macro adder (a b carry sum n) (local ain bin) (repeat i 0 n ;generate an n-blt adder (invert ain.i a.i) (invert bin.i b.i) (SumBit ain.i bin.i carry.i sum.i) (CarryBit ain.i bin.i carry.i carry.(+ i I) )))

********************************************************************* ; Test a 4-bit adder *********************************************************************

(node a b carry sum carryin ain bin) ;global names (invert carry.O carryin) (adder a b carry sum 4)

A.2. Cadd24

; NET file for 24 bit Brent-Kung type adder ; half adder (just sum part) made from nor gates

(macro halfadder (a b s) (local abar bbar tl t2) (cinvert abar a) (cinvert bbar b) (cnor tl a b)

177 178 A Data-Driven Multiprocessor for Switch-Level Simulation of VLSI Circuits

(cnor tZ abar bbar) (cnor s tl t2) )

; full adder (just sum part) made from hand gates

(macro fulladder (a b c s) local abar bbar cbar tl t2 t3 t4) cinvert abar a) cinvert bbar b) cinvert cbar c) cnand tl a bbar cbar) cnand t2 abar b cbar) cnand t3 abar bbar c) cnand t4 a b c) cnand s tl t2 t3 t4) )

; propagate-generate cell from adder inputs

(macro propgen (a b g p) (local abar bbar) (cinvert abar a) (cinvert bbar b) (cnor g abar bbar) (halfadder a b p) )

: propagate-generate cell for carry chain

(macro carry (gl g2 pl p2 gout pout) (local glbar poutbar tl) (cnand poutbar pl p2) (cinvert pout poutbar) (cinvert glbar gl) (cnand tl pl g2) (cnand gout glbar tl)

declare global nodes (a and b are inputs, g and p are for carry chain and s is the output sum)

(node a b g p s)

; generate a 24 bit Brent-Kung adder

; first generate propagate and generate for each bit

(repeat i 1 24 (propgen a.i b.i g.l.i p.l.i) )

; now build carry tree

(repeat i I 12 (setq j (* 2 I)) (carry g.1.j g.1.(-- j I) p.1.j p.l.(-- j I) g.Z.j p.2._) ) (repeat i I 6 (setq j (" 4 i)) (carry g,2.j g.Z.(- j 2) p.2.j p.2.(-- j 2) g.3.j p.3.j)

(repeat i I 3 (setq j (" B i)) (carry g.3.j g.3.(- j 4) p.3.j p.3.(-- j 4) g.4.j p.4._) )

(carry g.4.16 g.4.B p,4.16 p,4.8 g.5.16 p.5.16) (carry g.4.24 g.5.16 p.4,16 p.5.16 g.5.24 p,5.24)

(carry g.3.20 g.5,16 p.3.20 p.5.16 g.4.20 p.4.20) (carry g.3,12 g.4.8 p,3.12 p.4.8 g.4.12 p.4.12) (carry g.2.22 g.4.20 p.2.22 p.4.20 g.3.22 p.3.22) (carry g.2 18 g.5.16 p.2,1B p.5.]6 g.3.18 p.3.18) (carry g.2 14 g.4,12 p.2.14 p.4.12 g.3.14 p.3.14) (carry g.2 10 g.4.8 p.2.10 p.4.8 g.3.10 p.3.10) (carry g.2 6 g,3.4 p.2.6 p.3.4 g.3.6 p.3.6) (carry g.1 23 g.3.22 p.I.23 p.3.22 g,2.23 p.2.23) (carry g.1 21 g.4.20 p.I.21 p.4.ZO g.2.21 p.2.21) Circuit Descriptions Appendix A 179

(carry 9.1.19 g.3.18 p.l.lg p.3.18 g.2.19 p.2.1g) (carry 9.1.17 g,5.16 p.1.17 p.5.16 g.2.17 p.2.17) (carry g.1.15 g.3.14 p.l.15 p.3.14 g.2.15 p.2.15) (carry g.1.13 9.4.12 p,l,13 p.4.12 g,2.]3 p.2.13) carry g,1.11 g.3.10 p,I,11 p.3.10 g.2.11 p.2.11) carry g.1.9 g.4.8 p.l.g p.4.8 g.2.g p,2.g) carry g.1.7 g.3.6 p,1.7 p.3.6 9.2.7 p,2,7) carry g,1.5 g.3.4 p.1.5 p.3.4 g.2.5 p.2.5) carry g.l.3 9.2.2 p,l,3 p.2.2 g.2.3 p.2.3)

connect extra nodes

connect g.].1 g,5.1) connect g.2.2 9.5.2) connect g.3.4 g.5.4) connect g.3,6 g.5.6) connect g.4.8 g.5.8) connect g.3.10 g,5.10) connect g,4.12 g,5.12) connect g.3.14 g.5.14) (connect 9.3.18 g.5.18) (connect g.4.20 g.5.20) (connect g.3.22 9.5.22) (repeat i 2 12 (setq j (- (* 2 i) I)) (connect g.2.J g.5.j) )

; now build sums usin 9 the generated carries (from 9.5)

(halfadder a.1 b.l s.1) (repeat i 2 24 (fulladder a.i b.i g.5.(- i t) s.i) )

A.3. Ram

********************************************************************* A decoder cell (by Carl Ebeling) width : Width of pulldowns size : The number of address bits. address : The address to be decoded. output : The decoder output. addr : 2-rail address lines

The pullup is not provided and must externally connected. This allows decode pulldowns to be cascaded.

macro decode (width size address output addrH addrL) (do ((i 0 (+ i I)) (temp address (I temp 2))) ((>= i size) (cond ((> Zemp O) (printf "Decode: address (%S) too large for size (%S)\n" address size)))) (cond (('= (% temp 2) O) (pulldown output (addrH.i width 2))) (t (pulldown output (addrL,i width Z))))) )

(macro ramcell (bH bL an) (local bHint bLint) (invert bHint bLint) (invert bLint bHint) (trans en bH bHint) (trans en bt bLint)) (macro ramarray (writeH bitin bitout addrH addrenableH width depth addrsize enable) (local addrt addrLx addrHx bitL bitx bltLx) ;allocate address arivers (repeat i 0 (- addrsize I) (nand addrHx.i addrenableH addrH.i) (invert addrL.i addrH.i) (nand addrtx.i addrenableH addrL.i)) allocate bit line drivers (repeat i 0 (- width I) 180 A Data-Driven Multiprocessor for Switch-Level Simulation of VLSI Circuits

(connect bitout.i bitx.i) (nand bitx_i writeH bitl,i) (invert bitl.i bitin.i) (nand bitLx.i writeH bitin.i)) (repeat i 0 (-- depth I) (pullup enable.i) (decode 4 addrsize i enable.i addrHx addrlx) (repeat j 0 (-- width I) (ramcell bitx,j bitix.j enable.i)))) (node write bitin b_tout addr addrenable enable) (ramarray write bitin bitout addr addrenable 32 32 5 enable)

A.4. Stack

********************************************************************* ; This file contains the definitions for a stack and the control signals : for a stack *********************************************************************

********************************************************************* : The stackCell is one element of a stack. It matches the stack layout• ; Just connect top to bottom to build stack

(macro stackCell (top bottom holdl hold2 SHR SHL) {local Ttop) : Top of cell 'inside' pass transistor (connect top top.H) (connect bottom bottom. H) (trans SHR top. H Ttop.H) (trans SHL top.L Ttop.L) (invert (Ttop. L 2 17) Ttop.H) (invert (bottom. H 2 17) bottom. L) (trans holdl bottom. H Ttop.H) (trans hold2 Ttop. L bottom. L))

********************************************************************* ; Cell at top of stack *********************************************************************

(macro stackTop (topin topout bottom holdl hold2 load) local Ttop) ; Top of cell 'inside' pass transistor connect topin topin. H) connect topout Ttop.H) connect bottom bottom,H) trans load topin. H Ttop.H) connect topin. L Ttop. L) invert (Ttop. L 2 17) Ttop.H) invert (bottom. B 2 17) bottom. L) trans holdl bottom. H Ttop.H) trans hold2 Ttop. L bottom. L))

; Stack Buffer buffers the control signals to the stack

(macro stackCntlBufferCell (out in) (local PD) (invert (PD 2 g) in) (dtrans in out vdd 2 2) (pulldown out (PD B 2)))

(macro stackCntIBuffer {Bholdl Bhold2 BSHR BSHL holdl hold2 SHR SHL) ; (stackCntIBufferCell Bholdl holdl) ; (stackCntlBufferCell Bhold2 hold2) (stackCntlBufferCell BSHR SHR) (stackCntIBufferCell BSHL SHL) (connect BholOl holdl) (connect Bhold2 hold2) (connect BSHR SHR) (connect BSHL SHL) )

• stack builds a stack width bits wide and depth bits deep. ; depth must be even for this to match control signal buffers. *********************************************************************

{macro stack (width depth topin topout holdl hold2 SHR SHL) Circuit Descriptions Appendix A 181

(local bottom Tholdl lhold2 TSHR TSHI) (repeat i 0 (- (/ depth 2) I) (stackCntlBuffer Thold].i Thold2.i TSHR.i TSHI .i holdl hold2 SHR SHI)) (repeat j 0 (- width I) (stacklop topin.j topout.j bottom. O.j Thold1.0 Thold2.0 TSHR.O)) (repeat i I (- depth I) (repeat j 0 (- width I) (stackCell bottom.(- i l).j bottom, i.j Tholdl.(/ i 2) Thold2.(/ i 2) TSHR.(/ i 2) TSHI.(/ i 2))) ))

; stackCntICell generates the control signals for a stack: two copies must be ; used, one to produce shl and holdl, the other to produce shr and hold2. ; pH, p[ : pop/push H/L

(macro stackCntlCell (phil phi2 pH pL hold shift) (local holdPU holdPD shiftPU shiftPD) (nor (holdPU 2 4) (pH 4 2) (phi2.L 4 2) (shift 4 2)) (nor (shiftPU 2 4) (pL 4 2) (phil.L 4 2) (hold 4 2)) (invert (holdPD 2 2) (holdPU 8 2)) (d_rans holdPU vdd hold 4 2) (pulldown hold (holdPD 16 2)) (invert (shiftPD 2 2) (shiftPU B 2)) (dtrans shiftPU vdd shift 4 2) (pulldown shift {shiftPD 16 2)))

; Build the chess stack including the stack control signals

(macro chessStack (width depth topin topout push pop phil phi2) (local holdl hold2 SHR SHL) (connect pop pop.H) (connect push push.H) (invert pop.t pop.H) (invert push.L push. H) (stackCntICell phil phi2 push.H pop.L holdl SHL) (stackCntICell phil phi2 pop.H push.L hold2 SHR) (stack width depth topin topout holdl hold2 SHR SHL) )

; Test the stack

(setq L -I H -Z) (node input topin topout push pop phil phi2 holdl hold2 shr shl)

(chessStack 16 16 topin topout push pop phil phi2)

B Test Programs

The following are the Class C source files for most of the test programs used. Some of the test programs contain code for manual, exhaustive, and random testing of the circuits, though in experiments reported in Chapters V and VIII, only random testing wits used for the associated circuits.

B.1. Adder Test Program

#include "signals.h" /'header file provides interface to simulator'/ int AdderTest(); int (*UserFunction)() = AdderTest; /* Provides hook to user interface'/ int AdderTest() { static int i.j.k,ax=O.bx=O,cx=O, result.simResult; static int count,mask,xwidth = O; int width= (xwidth == O ?getint("How wide? ",1,31,4) : xwidth);

/'the signal declaration is used to create an object usable in a class-c program that references the right nodes. The format is: internalname("nodename", size. start). If size is O, then signal is a scalar. If size is non-zero, then signal is a vector of the form "nodename".starz, "nodename'.(+ start I) ..... "nodename".(- (+ start size) I) For example for signal sum we would have sum. O, sum.l ... sum.(- width I) ./ signal carryin("carryin",O,O), a("a",width,O),b("b",width,O),sum("sum",width,O), carryout("carry",1,width);

xwidth = width; count = ] << width: mask = count--l; switch(getchr("What? ","mer",'m')) ;find out what we should do'/ {/'manual, all, random'/ case O: {/'manual test'/ for(;;) { ax = getint("a? ",-1,mask,ax); /*get a value for user*/ if (ax == --I) break; bx = getint("b? ",O,mask,bx); cx = getint("c? ".O,l,cx); /* internalname. Set(value) drives a signal, and thus one or more nodes to the state given by the proper bit of the value. Once the nodes have been set, Execute is called to request execution of the simulation. The parameter to execute is the maximum number of unit-delay steps to be simulate, Generally a big number is used. and the Simulation quiesces well within that amoung of time. If it doesn't Execute will print a message */ a,Set(~ax);b. Set(~bx);carryin. Set(~cx);Execute(lO0); printf("%d + %d + %d = %d\n",ax,bx,cx,result = ax+bx+cx);

/'internalname. GetD() is used to get the boolean values of the nodes associated with a signal. If the state of a node is X, GetD will return a O, and GetU will returns a I

183 184 A Data-Driven Multiprocessor for Switch-Level Simulation of VLSI Cimuits

*/

if (result != (simResult = sum. GetD() l (carryout.GetD()< O;i----) { ax = {rand() >7 2) & mask; bx = {rand() >> 2)& mask; cx = {rand() >> 4) & l; a. Set(-ax);b.Set(~bx);carryin. Set(~cx);Execute(100); result = ax+bx+cx; if (result != (simResult =sum. GetD() i (carryout.GetD()<

B.2. Multiplier Test Program

#include "signals.h" int mul(); int (*UserFunction)() = mul;

int mul() { static int width = 8; static int i,j,count,mesk; signal a("a",width,O), b("b",width,O), p("p",width << l,O);

count - 1<

{ for(:;) { i = getint("a",-].mask,i): if (i _= -I) break: a. Set(i); b. Set(j _ getint("b",0,mask,j)): Execute(1000); printf("%d * %d = Zd\n",i,j,i*j); if (i'j != p. GetD() IIp. GetU() != 0) {printf("but.%d != %d, Zd\n",i*j,p,GetD(),p. GetU());} ) break; ) case I: {/'exhaustive'/ for (i=0;i < count;i++) ( printf(".");fflush(stdout); a. Set(i); for (j=O ;j _ count;j++) { b. Set(j); Execute(lO0); if (i*j t= p. GetD.() II p. GetU() I= O) { printf("_d " %d = %dkn",i,j,i*j); printf("but,%d != %d. %dkn",i*j,p. GetD(),p.GetU()); ) } } break; } case 2: {/*random test*/ int k; for(k = getint("how many",0,1000000,1);k > 0;k---) { if (k%64 == 1) {fprintf(outfile,".");fflush(outfile);}

a. Set(i : ((rand()>>2) & mask)); b. Set(j : ((rand()>>2) & mask)): Execute(lO00); if (i*j != p.GetD() IIp. GetU() I= 0) { printf("Xd * %d = %dkn",i,j,i*j); printf("but,%d != %d, %d\n",i*j,p.GetD(),p.GetU()); } ) break; ) } return(0); )

B.3. RAM Test Program

#include "signals.h" int ram(); int (*UserFunction)() : ram; #define writememory(address,data) \ ae. Set(0);\ Execute(1000);\ addr.Set(address);\ Execute(lO00); /* let the address settle*/\ ae. Set(t);k Execute(lO00);\ write. Set(1): /* turn on the write bit*/\ bitin. Set(data);\ Execute(1000) #define readmemory(address)\ (ae.Set(0),write.Set(0),Execute(1000),addr.Set(address),\ Execute(1000),ae.Set(1),Execute(1OO0),bitout.GetD())

int ram() 186 A Data-Driven Multiprocessor f_r Switch-Level Simulation of VLSI Circuits

{ static int xwid_h=O.xaddrsize=Q.i,depth,depthMask,widthMask,a_O.d=O,*table: value /ZZ7: int width = ((xwidth =: O)?getint{"width ?",l.31,8):xwidth), addrsize = ((xaddrsize =: O)?getint('address size? ",l,31,8):xaddrsize): signal bitin("bitin".width,O).hitout("bitout",width. O). addr('addr',addrsize,O), ae("addrenable",O,O),write("write",O,O): xwidth = width; xaddrsize : addrsize: depth : 1<

/'initialization - cause all of the memory cells to flip one way*/

swiZch (getchr("What? ","mer",'m')) { case O: {/*manual poking'/ for(;;) { a = getint("addr? ".-],depthMask,a); if (a == -I) break; if (getbool("Write". TRUE)) {writememory(a,d= getint("bitins? ".O,widthMask,d));} printf("%d: %d = %d,%d\n",a,d,readmemory(a),bitout.GetU()); } break; } case I: {/*exhaustive. (read and write every location with every value)'/ for' (i=O;i <= widthMask;i++) { for (a=O;a < depth;a+*) { writememory(a_(d = ((a+i) & wldthMask))); if (readmemory(a) != d II bitout.GetU() i: O) { printf("bad read after write %d: %d = %d %d\n", a,d,bitout.GetD(),bitout. GetU()): } } for (a=O:a < depth;a++) { if (readmemory(a) != (d = ((a+i) & widthMask)) II bitout. GetU() I= O) { printf("bad verify %d: %d = Xd %d\n". a,d,bitout. GetD(),bitout. GetU()); ) } } break; } case 2: {/'random poking'/ /*initialize the world to zeros*/ for (a=O; a < depth;a++) {table[a] = O; writememory(a,O);} for (i = getint("how many? ",0,999999,I00);i > O;i---) {/'do a bunch of random writes'/ a = (rand() >> 2) & depthMask; d = (rand() >> 2)&widthMask; writememory(a,d):table[a]=d; if (readmemory(a) i= d ]J bitout. GetU() I = O) { printf("bad read after write %d: %d = %d %d\n", a,d,bitout. GetD(),bitout.GetU()); ) } for (i-O;i < depth;i++) { if (readmemory(i))= table[i] II bitout. GetU() I- O) { printf("bad verify %d: %d : %d %d\n", i.table[i],bitout.GetD(),bitout.GetU()); } } break; Test Programs Appendix B 187

} }

delete table; return(O); }

B.4. Stack Test Program

#include "signals.h" int StackTest(); int (*UserFunction)() = StackTest; #define clockit() phi].Set(O);phi2.Set(1);Fxecute(]O00);\ phi1.Set(]);phi2.Set(O);Execute(1000) #define pushit(aValue) topin. Set(aValue); push.Set(1); clockit(); push.Set(O) #define popit() pop.Set(1); clockit(); pop.Set(0)

int StackTest() {/'stack test'/ static int xwidth=O,xdepth=0,widthmask,widthcount,i.j; int width = ((xwidth == 0)?getint("width ?",1,3],8):xwidtb), depth = ((xdepth == 0)?getint("depth? ",1,g9ggggg,8):xdepth); signal push("push".O.O), pop("pop",O,0), topin("topin",width. O),phi1("phil.-]",O,0),phi2("phi2.-1",O,O), topout("topout",width,O); int "table = new int[depth];

xwidth = width; xdepth = depth; widthcount = ]<

push.Set(0); pop.Set(O); phil.Set(1); phi2.Set(]); topin. Set(O); Execute(t000);

switch (getchr("What? ","mer",'m')) {/'process manual, exhaustive or random command'/ case 0: {/'manual testing'/ for(:;) {/'process manual commands'/ switch (getchr("Push? ","wrq",'w')) { case O: {/*do a push'/ pushit(getint("Value ",O,widthmask,O)); break; } case I: {/'pop'/ printf("TOS: %d , %d\n",topout. GetD(),topout.GetU()); popit(): break; } case 2: delete table; return(O); } ) break; } case 1: {/'exhaustive'/ int aValue; for ( i= 0;i < widthcount;i++) {/*try all aValues'/ printf("."):fflush(stdout); for (j=O;j < depth; j++) {/'push all locations'/ pushit(aValue = (i+j)&widthmask); if (topout.GetD() I= aValue ]I topout. GetU() I= 0) {printf("Bad push: _d I= %d, _d\n",aValue,topout. GetD().topout. GetU());} ) for (j=depth-l;j >= O;j---) {/'pop locations*/ aValue = {i+_) & widthmask; 188 A Data-Driven Multiprocessor for Switch-Level Simulation of VLSI Circuits

if (aValue I= topout.GetD() II topout.GetU() != 0) { printf("Bad Verify: %d != %d, %d\n",aValue,topout. GetD(), topout.GetU()); ) popit(): ] ) break; ] case 2: {/'random test'/ int k; for(k = getint("how many".0,1000O00,1):k > 0;k--) {/'do this many random push/pop cycles*/ /" printf(",");fflush(stdout);'/ for (j:0;j < depth: j++) {/'push all locations'/ pushit(table[j] = rand() & widthmask); if (topout. GetD() t= table[j] ]] topout. GetU() != 0) {printf("Bad push: %d I= %d, %d\n",table[j],topout.GetD(),topout. GetU());} ) for (j=depth--l;j >= 0;j--) {/'pop locations'/ if (table[j] != Zopout,GetD() ]I topout. GetU() I- 0) {printf("Bad Verify: %d != %d, %d\n r, table[j],topout. GetD().topout. GetU());} popit(); } } break; ) ) delete table; return(0); ) Sample Raw Data for Experiments

In the two sections below are samples of the raw data 1 collected. In the first section is data from the static analysis of the four-bit adder circuiL In the second section is the raw data from one of the simulation runs.

Sample Raw Static Data for Adder4 adder4 aOder4P -n 4 -b 2 -p d2 -s -B formatted Thu Mar 20 21:52:27 1985 Total input devices: 78, nets 46, ratio 1.008 What # % Inst X Catg % Input ]nst 97 1.244 (Devices) Devs 53 0.546 0.679 Pu 0 0.000 0.000 0.000 Pd 25 0.258 0.472 0.321 lWay 12 0.124 0.226 0.154 PX 16 0.185 0.302 0.205 #pu 25 0.258 0.4?2 0.321 #Devs 0 0.000 0.000 0.000 Nets 44 0.454 0.957 (Nets) w/Cap 29 0.299 0.659 0.630 Reg 15 0.155 0.341 0.326 Fin 0 0.000 0.000 0.000 FOur 0 0.000 0.000 0.000 BCast 20 0.206 0.455 0.435 #11n 12 0.124 0.273 0.261 Input file fanin/fanout Fan In _ Out % 0 9 0.196 I 0.022 1 22 0.674 29 0.652 2 1 0.696 13 0.935 3 16 1.043 8 1.109 4 8 1.217 5 1.217 Total nodes: 46 FanIn: I04, Avg: 2.261, Var 2.533, SD 1.592, 95% 0.460, 99% 0,604 FanOut: 99, Avg: 2.152, Var 1.472. SD 1.213, 95% 0.351, 99% 0.461 Fastl file fanin/fanout Fan In % Out % O 0 0.000 1 0.010 1 0 0.000 54 0.567 2 0 O.OOO 29 0.866 3 9 0.093 8 0.948 4 47 0.577 5 1.000 5 1 0.588 0 1.000 6 32 0.918 0 1.DO0 7 8 1.000 0 1.000 Total ]nstructlons: 97 Fanln: 468. Avg: 4.825, Var 1.500, SD 1.225, 95% 0.244, 99% 0,320 FanOut: 156, Avg: 1.608, Var 0,741, SD 0.861, 95% 0.171, 99_ 0.225 InstWords: 138, Avg: 1.423, Var 0.173, SD 0.416, 95% 0.069, 99% 0.091

Processor Summary # Total Devs Nets total 0 14 5 9 0.144 1 21 13 8 0.216 2 34 20 14 0.351 3 28 15 13 0.289

189

190 A Data-Driven Multiprocessor for Switch-Level Simulation of VLSI Circuits

Destination Summary: 135 internal, 29 external. 164 total

Transistor Groups: Input file # tg:node tg nodes tg:tran tg trans 0 0 0.000 0.O_O 9 0.500 0.000 1 I0 0.556 0.217 0 0.500 0,000 2 0 0.556 0.217 1 0.556 0.026 4 4 0.778 0.565 0 0.556 0.026 7 2 0.889 0.870 4 0.778 0.385 8 2 1.000 1.217 0 0.778 0.385 12 0 1.000 1.217 4 1.000 1.000 Total Transistor Groups: IB Nodes: 46, Avg: 2.556, Var ?.725, SD 2.779, 95% 1,284, ggz 1.688 Total Transistor Groups: 18 Trans: 78, Avg: 4.333, Var 25.765. SD 5.076, 95% 2.345. 99% 3.082

Transistor Groups: Fast-] Instructions # groups _ grp #Inst % inst total _ inst I 9 0.500 g 0.093 9 0,0g3 2 ] 0.056 2 0.021 11 0.113 7 4 0.222 28 0.289 39 0.402 14 2 0.11] 28 0.289 67 0.691 15 2 0.111 30 0.309 97 1.000 Total Transistor Groups: 18 Inst: 97, Avg: 5.389, Var 31.075, SD 5.575, 95% 2.575, 99% 3.385

C.2. Sample Raw Dynamic Data for Adder4 fl> M 0 Memory Delay set to 0 f1> U How wide? (I to 31) [4] 4 What? (mer) [m] r How many? (0 to 999999) [100] 6144 f1> P adder4 adder4P -n 4 -b 2 -p d2 -s -B formatted Thu Mar 28 22:08:04 1985 97 instructions, 4 processors, on Fri Mar 29 00:46:26 1985 Mem delay I. 6 src/inst, 2 dest/inst, stack fetch, 4 phases unit delay, two-output bi-PX, Init: all, 2 time slots 2 busses. 2 ports/processor, max queue 100 Bus delay I, I cycles/trans, arbitration on, lambda 0.00 Time: 2033722 (calculated 3510819), fastl = 49633 Opt: trans = 361794, net = 344014, store = 0, set = 37308. ]nstructions Executed: _386752+0 = 1386752, Pending: 0+0 = 0 Messages 560998, Arbitrates 649728, Retries 0, Max O 3, Destination Stores 1643016, Extra source fetches 196503 Nets executed 714436, Transistors executed 672316 Num Sources examined 6271296. Signal Set/Gets 86016 4.52 avg dynamic fan-in. 1.18 avg dynamic fan-out. 0.73 Storing ]nst/Internal. 1.63 Stores/Storing Inst. 2?.94 fl inst/fl sim time. 1.47 f3 real time/f1 inst. 2?22.73 seconds user time. 244.05 seconds system time. 467.43 fl instructions/CPU sec. 338.00 pages max process size. 1.00 page faults. f1> adios 2723.2u 244.3s 1:26:27 57% 41+125k 20+4io 10pf+Ow