Reconfigurable Architectures for General-Purpose Computing
Total Page:16
File Type:pdf, Size:1020Kb
MASSACHUSETTS INSTITUTE OF TECHNOLOGY ARTIFICIAL INTELLIGENCE LABORATORY A.I. Technical Report No. 1586 October, 1996 Recon®gurable Architectures for General-Purpose Computing Andre DeHon [email protected] Abstract: General-purpose computing devices allow us to (1) customize computation after fabrication and (2) conserve area by reusing expensive active circuitry for different functions in time. We de®ne RP-space, a restricted domain of the general-purpose architectural space focussed on recon®gurable computing architectures. Two dominant features differentiate recon®gurable from special-purpose architectures and account for most of the area overhead associated with RP devices: (1) instructions which tell the device how to behave, and (2) ¯exible interconnect which supports task dependent data¯ow between operations. We can characterize RP-space by the allocation and structure of these resources and compare the ef®ciencies of architectural points across broad application characteristics. Conventional FPGAs fall at one extreme end of this space and their ef®ciency ranges over two orders of magnitude across the space of application characteristics. Understanding RP-space and its consequences allows us to pick the best architecture for a task and to search for more robust design points in the space. Our DPGA, a ®ne-grained computing device which adds small, on-chip instruction memories to FPGAs is one such design point. For typical logic applications and ®nite-state machines, a DPGA can implement tasks in one-third the area of a traditional FPGA. TSFPGA, a variant of the DPGA which focuses on heavily time-switched interconnect, achieves circuit densities close to the DPGA, while reducing typical physical mapping times from hours to seconds. Rigid, fabrication-time organization of instruction resources signi®cantly narrows the range of ef®ciency for conventional architectures. To avoid this performance brittleness, we developed MATRIX, the ®rst architecture to defer the binding of instruction resources until run-time, allowing the application to organize resources according to its needs. Our focus MATRIX design point is based on an array of 8-bit ALU and register-®le building blocks interconnected via a byte-wide network. With today's silicon, a single chip MATRIX array can deliver over 10 Gop/s (8-bit ops). On sample image processing tasks, we show that MATRIX yields 10-20 the computational density of conventional processors. Understanding the cost structure of RP-space helps us identify these intermediate architectural points and may provide useful insight more broadly in guiding our continual search for robust and ef®cient general-purpose computing structures. Acknowledgements: This report describes research done at the Arti®cial Intelligence Laboratory of the Mas- sachusetts Institute of Technology. This research is supported by the Advanced Research Projects Agency of the Department of Defense under Rome Labs contract number F30602-94-C-0252. Acknowledgments This effort grew out the intellectual backdrop of the Transit and Abacus projects. Years prototyping Transit machines with Tom Simon and his specialization philosophy set the stage for my initial interest in FPGAs for computing. The initial ideas for the DPGA grew out of dialogs with Mike Bolotski in which we tried to reconcile Abacus, his SIMD architecture which he described as ªa bunch of one-bit processors,º with FPGAs, which looked to me like ªa bunch of one-bit processors.º Tom Knight has been my research advisor since I was a junior. He has always encouraged me to focus on the big idea and has been supportive as I explored sometimes radical points of view. He gave me plenty of freedom to do the right thing, and hopefully, I have lived up to the con®dence and trust implied by that autonomy. The efforts of Jeremy Brown, Derrick Chen, Ian Eslick, Ethan Mirsky, and Edward Tau during and after 6.371 made the DPGA prototype possible. Ian's perseverance to ®nalize the layout and veri®cation was particular responsible for the completion of that effort. Ed and Ian both helped see the DPGA prototype through its ®nal postmortem. TSFPGA and MATRIX were both possible only because of Derrick Chen and Ethan Mirsky, the Master of Engineering students who respectively took ownership of the microarchitecture and VLSI portions of those designs. We were largely able to complement each other's efforts in our attempts to understand and develop these architectures. Discussion with Rich Lethin, Russ Tessier, and Jonathan Babb at MIT were useful in focusing in on the key issues which needed addressing. Regular interaction with the emerging recon®gurable computing community was valuable for encouragement and for identifying key problems and issues. Notably, discussions with Brad Taylor, Mike Butts, Brad Hutchings, Bill Magione-Smith, John Villasenor, Phil Kuekes, Steve Trimberger, Mike Smith, and Carl Ebling have been helpful in identifying the questions which need answers and cleaning up ideas for presentation. Thomas McDermott provided valuable feedback on the early chapters of this work. The availability of high-quality, experimental CAD tools in source form from universities made the experimental mapping work done here feasible. University of Toronto's Chortle provided a clean basis for several early experiments in DPGA synthesis. UC Berkeley's SIS was used for standard, technology independent circuit mapping. UC Berkeley's mustang was the workhorse behind multicontext FSM mapping. This research was supported by the Advanced Research Projects Agency of the Department of Defense under Rome Labs contract number F30602-94-C-0252. i Contents I Introduction and Background 1 1 Overview and Synopsis 2 1.1 Evolution of General-Purpose Computing with VLSI Technology 2 1.2 This Thesis 3 1.3 Recon®gurable Device Characteristics 5 1.4 Con®gurable, Programmable, and Fixed-Function Devices 5 1.5 Key Relations 7 1.6 New General-Purpose Architectures 8 1.7 Prognosis for the Future 11 2 Basics and Terminology 12 2.1 General-Purpose Computing 12 2.2 General-Purpose Computing Issues 13 2.2.1 Interconnect 13 2.2.2 Instructions 13 2.3 Programmables and Con®gurables 13 2.4 FPGA Introduction 15 2.5 Regular and Irregular Computing Tasks 17 2.6 Metrics: Density, Diversity, and Capacity 17 2.6.1 Functional Density 18 2.6.2 Functional Diversity ± Instruction Density 20 2.6.3 Data Density 20 3 Recon®gurable Computing Background 22 3.1 Successes of Recon®gurable Computing 22 3.1.1 Programmable Active Memories 22 3.1.2 Splash 22 3.1.3 PRISM 23 3.1.4 Logic Emulation 23 3.2 Lineage 23 3.3 Technological Enablers 24 ii II Empirical Review 26 4 Empirical Review of General Purpose Computing Architectures in the Age of MOS VLSI 27 4.1 Processors 27 4.2 VLIW Processors 33 4.3 Digital Signal Processors (DSPs) 34 4.4 Memories 35 4.5 Field-Programmable Gate Arrays (FPGAs) 41 4.6 Vector and SIMD Processors 44 4.7 Multimedia Processors 47 4.8 Multiple Context FPGAs 48 4.9 MIMD Processors 49 4.10 Recon®gurable ALUs 50 4.11 Summary 51 5 Case Study: Multiply 54 5.1 Custom Multipliers 54 5.2 Semicustom Multipliers 54 5.3 General-Purpose Multiply Implementations 56 5.4 Hardwired Functional Units in ªGeneral-Purpose Devicesº 57 5.5 Multiplication Granularity 58 5.6 Specialized Multiplication 58 5.7 Summary 59 6 High Diversity on Recon®gurables 60 III Structure and Composition of Recon®gurable Computing Devices 62 7 Interconnect 63 7.1 Dominant Area and Delay 63 7.1.1 Fixed Area 63 7.1.2 Interconnect and Con®guration Area 64 7.1.3 Delay 64 7.2 Problems with ªSimpleº Networks 66 7.2.1 Crossbars 66 7.2.2 Multistage Networks 67 7.2.3 Mesh Interconnect 67 7.3 Issues in Recon®gurable Network Design 68 7.4 Conventional Interconnect 69 7.5 Switch Requirements for FPGAs with 100-1000 LUTs 71 7.6 Channel and Wire Growth 72 7.6.1 Rent's Rule Based Hierarchical Interconnect Model 72 iii 7.6.2 Wire Growth in Rent Hierarchy Model 74 7.6.3 Switch Growth in Rent Hierarchy Model 75 7.7 Network Utilization Ef®ciency 79 7.8 Interconnect Description 89 7.8.1 Weak Upper Bound 89 7.8.2 Structure Based-Estimates 90 7.8.3 Signi®cance and Impact 92 7.8.4 Instruction Growth versus Interconnect Growth 94 7.9 Effects of Interconnect Granularity 97 7.9.1 Wiring 97 7.9.2 Switches 98 7.10 Summary 99 8 Instructions 100 8.1 General Case Example 100 8.2 Bits per Instruction 102 8.3 Compressing Instruction Stream Requirements 103 8.3.1 Wide Word Architectures 103 8.3.2 Broadcast Single Instruction to Multiple Compute Units 103 8.3.3 Locally Con®gure Instruction 103 8.3.4 Broadcast Instruction Identi®er, Lookup in Local Store 104 8.3.5 Encode Length by Likelihood 105 8.3.6 Mode Bits for Early Bound information 105 8.3.7 Themes 106 8.4 Compressibility 107 8.5 Control Streams 108 8.6 Instruction Stream Taxonomy 109 8.7 Summary 110 9 RP-space Area Model 111 9.1 Model and Assumptions 111 9.2 Peak Performance Density 114 9.3 Granularity 117 9.4 Contexts 121 9.5 Composition 124 9.6 Summary 128 IV New Architectures 129 10 Dynamically Programmable Gate Arrays 130 10.1 DPGA Introduction 132 10.2 Related Architectures 137 10.3 Realm of Application 138 iv 10.3.1 Limited Throughput Requirements 138 10.3.2 Latency Limited Designs 140 10.3.3 Temporally Varying or Data Dependent Functional Requirements 141 10.3.4 Multicontext versus Monolithic and Partial Recon®guration 141 10.4 A Prototype DPGA 145 10.4.1 Architecture 145 10.4.2 Implementation 150 10.4.3 Component Operation 157 10.4.4 Prototype Context Area Model 158 10.4.5 Prototype Conclusions 158 10.5 Circuit Evaluation 160 10.5.1