DEPARTMENT of COMPUTER SCIENCE Carneg=E-Mellon Un
Total Page:16
File Type:pdf, Size:1020Kb
CMU-CS-85-180 A Data-Driven Multiprocessor for Switch-Level Simulation of VLSI Circuits Edward Harrison Frank November, 1985 DEPARTMENT of COMPUTER SCIENCE Carneg=e-Mellon Un=vers,ty CMU-CS-85-180 A Data-Driven Multiprocessor for Switch-Level Simulation of VLSI Circuits Edward Harrison Frank November, 1985 Carnegie-Mellon University Department of Computer Science Pittsburgh, PA 15213 Submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy in Computer Science at Carnegie-Mellon University. Copyright © 1985 Edward H. Frank This research was sponsored by the Defense Advanced Research Projects Agency (DOD), ARPA Order No. 3597, monitored by the Air Force Avionics Laboratory Under Contract F33615-81-K-1539, and by the Fannie and John Hertz Foundation. The views and conclusions contained in this document are those of the author and should not be interpreted as representing the official policies, either ex- pressed or implied, of the Defense Advanced Research Projects Agency, the US Government, or the Hertz Foundation. Abstract In this dissertation I describe the algorithms, architecture, and performance of a computer called the FAST-I--a special-purpose machine for switch-level simulation of VLS1 circuits. The FASr-I does not implement a previously exist- ing simulation algorithm. Rather its simulation algorithm and its architecture were developed together. The FAS_I-Iis data-driven, which means that the flow of data determines which instructions to execute next. Data-driven execution has several important attributes: it implements event-driven simulation in a natural way, and it makes parallelism easier to exploit. Although the architecture described in this dissertation has yet to be imple- mented in hardware, it has itself been simulated using a 'software implementation' that allows performance to be measured in terms of read- modify-write memory cycles. The software-implemented FAsr-1 runs at speeds comparable to other software-implemented switch-level simulators. Thus it was possible to collect an extensive set of experimental performance results of the F_ST-1 simulating actual circuits, including some with over twenty thousand transistors. These measurements indicate that a hardware-implemented, uniprocessor F_'-I offers several orders of magnitude speedup over software- implemented simulators running on conventional computers built using similar technology. Additional speedup over a uniprocessor can be obtained using a Fasr-I mul- tiprocessor, that is constructed using multiple FAST-1uniprocessors that are in- terconnected by one or more broadcast busses. In order for a FAsr-I mul- tiprocessor to exploit the parallelism available in simulations, the FAs'r-I representation of circuits must be carefully partitioned onto the processors. Al- though, even simple versions of the partitioning problem are NP-complete, I show that an additional order of magnitude speedup can be obtained by using a multiprocessor F, sr-1 and fast heuristic partitioning algorithms. //i Acknowledgements The completion of this work owes much to many people. Throughout my years at CMU, my advisor, Bob Sproull, has provided guidance, fi%ndship, understanding, and good ideas as needed. My dear fi-iend and officemate, Carl Ebeling, has spent many hours listening to me instead of doing his own work. Both Dr. Bob's, and Carl's contributions to this research, and to my stay at CMU, are immeasurable. My thesis committee, Randal Bryant, A1 Davis, Marc Raibert. Alfred Spector, and Robert Sproull, provided the proper amount of help and criticism, at the proper times. I thank them all for care- fully reading this dissertation in finite time. The CMU VLSI project, originally directed by Sproull, and now being run by HT Kung. provided the overall context in which this work was conducted. Many project members, in particular Allan Fisher, and Hank Walker, have offered many good ideas and been good listeners. Several other people have given aid at important times: Rob Mathews of Silicon Solutions Corp. graciously allowed me to simulate the SSC Filter chip, and Dan Perkins, also of SSC, spent several hours working with me in order to get their test vectors to work with my simulator. Marco Annaratone provided the CMOS adder circuit, and Thomas Anantharaman provided the multiplier circuit. Many of the other circuits were given to me by Carl Ebeling. Ivan Sutherland provided many useful comments on an early draft of the thesis. Though the CMU Computer Science Department has grown and changed since I first came here many years ago, it is still a wonderful place with great resources, both human and computational. As with most people who come from the West. 1 was pleasantly surprised by Pittsburgh, although it still needs a real place to ski and some real lakes. During my studies at CMU, I was sup- ported by a Fannie and John Hertz Foundation Fellowship, for which I am most grateful. I am most indebted to my family who, throughout my life, have provided un- ending love and support. My wife, Sarah Ratchye, has endured the hard times, enjoyed the good times, and along with our daughter, Whitton Anne, has made this all worthwhile. Table of Contents Acknowledgements v 1 Introduction 1 1.1. Background and Motivation 1 1.1.1.Why Machines for VLSI Simulation? 1 1.1.2.Algorithms, Architecture, and Implementation 2 1.2. A Simple Simulation Algorithm and a Simple Simulation Machine 3 1.2.1.An Event-Driven Simulation Algorithm 3 1.2.2.The Fast-I Simulation Machine 5 1.3. The Organization of the Dissertation 8 1.4. The Contributions of this Research 10 1.5. A Final Note 11 II Related Work 13 II.1. Data-Driven Computers 13 II.2. Multiprocessors and Interconnection Networks 16 II.2.1. MIMD Machines 16 11.2.2.SIMD Machines 17 11.3.Simulation Algorithms 18 II.3.1. A Brief Survey of Digital Simulation Techniques 18 II.3.2. Switch-level Simulation Algorithms 19 II.4. Simulation Machines 20 II.4.1. Logic-Level Machines 20 11.4.2.Switch-level Machines 22 II.4.3. Circuit-Level Machines 23 11.5.Partitioning 23 II1 A Switch-Level Simulation Algorithm 25 III.1. Notation 26 111.2.A Switch-Level Model of MOS Circuits 26 III.2.1. Signals 27 III.2.2. Transistors 28 1II.2.3. Nodes 29 III.2.4. Strengths and Sizes 30 III.2.5. Actual Signal Models 30 1II.2.6. Modeling Threshold Drops 35 III.3. Determining the Steady State of a Network 37 III.3.1. An Incorrect Switch-Level Simulation Algorithm 38 I11.3.2.The Fast-I Switch-Level Simulation Algorithm 44 III.3.3. The Correctness and Complexity of the Simulation Algorithm 46 I11.3.4.Delay 50 vii viii A Data-Driven Multiprocessor for Switch-Level Simulation of VLSI Circuits I11.3.5.Initialization 51 I11.3.6.Optimizations 52 I11.4.Compiling Circuits into Simulations 56 I11.5.Other Issues 58 111.5.1.Multi-level Simulation 58 !11.5.2.Fault Simulation 60 IV The Architecture of a Fast-1 Uniprocessor 61 IV.1. Uniprocessor Architecture 61 IV.I.1. Instruction Definition 62 IV.I.2. Instruction Execution 63 IV.1.3. Implementing Algorithm III-4 Using the Fast-l: A Summary 68 IV.1.4. Other Issues 68 IV.2. Implementation 70 IV.2.1. The Datapaths of a Fast-1 Processor 70 IV.2.2. Keeping Track of Executable Instructions 71 IV.2.3. Fixed-Width versus Variable-Width Instructions 73 IV.2.4. The Impact of Technology 78 IV.2.5. Reliability 81 IV.2.6. Other Issues 81 V Uniproeessor Experiments 83 V.1. The Circuits 84 V.2. The Software Implementation of the Fast-1 MOS Simulator 85 V.3. Static Measurements 87 V.3.1. Transistors, Nodes, and the Distribution of Instructions 88 V.3.2. Fan-In and Fan-Out 89 V.3.3. Sizes of Transistor Groups 91 V.3.4. Representing Bidirectional Transistors Using Two Unidirectional 99 Transistor Instructions V.3.5. Using Minimal Machines and Fan-in and Fan-out Trees 99 V.3.6. The Effect of Finding Unidirectional Transistors and Eliminating 100 One-input Nodes V.4. Dynamic Measurements 102 V.4.1. The Base Case 106 V.4.2. The Effect of Changing the Representation of Circuits 108 V.4.3. The Effect of Optimizations 113 V.4.4. Using a Queue versus a Stack for Keeping Track of Executable In- 117 structions V.4.5. Parallelism in the Fast-1 117 V.4.6. Execution Time Estimates for Other Simulation Machine Architec- 125 tures V.4.7. Some Other Thoughts on Parallelism 129 VI Algorithms for Multiprocessor Simulation 131 VI.1. Multiprocessor Implementation of Algorithm III-4 131 VI.1.1. Implementation 131 Vl.l.2. Correctness 132 VI.1.3. Performance Considerations 132 VI.2. Partitioning Algorithms 133 V1.2.1.The Complexity of Partitioning 134 VI.2.2. Practical Partitioning Algorithms 135 VII The Architecture of a Fast-1 Multiprocessor 141 "Fableof Contents ix Vii.1. Approaches to Exploiting Parallelism 141 VII.2. A Multiprocessor Fast-1 144 VI1.2.1.Processor Architecture Assuming Static Instruction Assignment 145 V!1.2.2. Reorganizing Fan-out and Broadcasting 148 VII.2.3. Interconnect 150 VII.2.4. Multi-level Simulation 155 VIII Multiprocessor Experiments 157 VII 1.1.An Outline of the Experiments 157 VII 1.2.Speedup 158 VIII.3. Message Traffic 162 VIII.4. The Impact of Broadcasting 163 IX Conclusions 167 IX.1. Contributions 167 IX.2. Other Applications 168 IX.3. Future Work 169 1X.4. And Now a Word to Our Sponsor 170 References 171 A Circuit Descriptions 177 A.1. Adder 177 A.2.