A Globally Asynchronous Locally Synchronous Configurable Array Architecture for Algorithm Embeddings

A Globally Asynchronous Locally Synchronous Configurable Array Architecture for Algorithm Embeddings A thesis submitted for the degree of Doctor of Philosophy Department of Computer Science University of Edinburgh August, 1996 A thesis for the degree of Doctor of Philosophy Department of Computer Science University of Edinburgh I declare that this thesis has been composed by myself and that the work described within this thesis is entirely my own except where clearly indicated otherwise in the text. Bo Gao. Abstract Advanced VLSI/ULSI technologies have made it possible to realise parallelism and pipelimng processing principles at affordable cost. One of the consequences is that more and more algorithms are now directly implemented in hardware. The configurable hardware algorithm approach has the potential to combine the performance of hardware algorithms and the flexibility of software algorithms at the user level. On the other hand, system timing design problems become one of the determin- ing factors on design complexity, correct system function and high performance. This timing problem plays an even more important role in configurable systems. There are two typical system timing control design approaches, the synchronous timing design and the asynchronous timing design. This thesis investigates and demonstrates the idea and feasibility of applying asynchronous timing control at the system level and synchronous timing control to system composition modules, namely a Globally Asynchronous Locally Synchronous (GALS) design approach, for very large scale configurable hardware algorithms. A systematic approach has been adopted in this thesis to develop a configurable GALS array architecture. With the analysis of general algorithmic properties, a novel multiple threads computation model consisting of an architecture with a pool of programmable hardware operators having configurable interconnections and a GALS system timing control structure is first established. The multiple threads computation model bridges algorithms and the architecture for efficient algorithm embeddings. The GALS timing control makes this threads model practical. A novel and fast event-driven GALS data transfer interface is developed upon which a bit-serial configurable GALS array system for algorithm embeddings is designed. Some good average performance results are obtained with a polynomial evaluation algorithm embedded as a frame buffer. The work on the GALS system timing design principle can be easily extended to the design of general GALS systems. Acknowledgements The work described in this thesis was inspired by Dr Thomas Kean's work on cellular Configurable Array Logic (CAL). Many useful suggestions were also obtained from direct discussions with Dr Thomas Kean. Dr David Rees has very patiently supervised me and given me a lot of encouragement all the way through my research work. He also carefully proof read the draft of this thesis and has made many useful comments. Special thanks also go to Professor David Kinni- ment from University of Newcastle upon Tyne who has helped me to clarify the synchronisation issue discussed in this thesis. I would like to thank the department of Computer Science of Edinburgh Univer- sity where an excellent research environment and computing facilities are provided, and from where I have obtained substantial knowledge on computer architectures and skills on programming. I would also like to acknowledge the support of Sino-British Friendship Schol- arship Scheme which made my research in Britain a reality. Table of Contents Introduction 1 1.1 Computing Systems ..........................2 1.2 Algorithms ...............................2 1.2.1 Software Solutions .......................3 1.2.2 Hardware Solutions ......................3 1.2.3 Parallelism and Pipeining ...................6 1.3 Regular and Modular Architectures ..................8 1.3.1 Granularity of Array Element .................9 1.3.2 Array Conflgurabi]ity .....................9 1.3.3 Array System Timing and Control ..............10 1.4 Overview of the Thesis .........................12 Massively Parallel Computing Systems 14 2.1 Cellular Logic Image Processor ....................15 2.2 Distributed Array Processor .......................17 2.3 Massively Parallel Processor ......................19 2.4 Connection Machine ..........................22 1 Table of Contents U 2.5 Adaptive Array Processor .......................24 2.6 A Data-Driven VLSI Array ......................26 2.7 Reconfigurable Arithmetic Processor .................28 2.8 Reconfigurable Parallel Array Processor ...............29 2.9 Field Programmable Gate Arrays ...................31 2.10 Cellular Array Logic ..........................32 2.11 Comparisons and Remarks .......................34 2.12 Impacts on Configurable Hardware Algorithms ...........40 2.12.1 Circuit Switching vs. Packet Switching ............40 2.12.2 PE local memory ........................41 2.12.3 PE Degree ...........................42 2.12.4 PE Functionaiity ........................43 2.12.5 System Timing Control Strategies ..............46 2.13 Summary ................................47 3. Algorithmically Configurable Architectures 48 3.1 Towards Algorithmically Structured Systems ............48 3.2 Hardware Algorithms ..........................51 3.3 Computation Architectures ......................53 3.3.1 Dimensionaiity and Connectivity ...............53 3.3.2 Configuration Methods ....................55 3.4 Computation models for Hardware Algorithms ...........57 3.4.1 Combinational Hardware Algorithms .............58 Table of Contents UI 3.4.2 Systolic Algorithms ......................58 3.4.3 Computational Wavefronts ..................60 3.4.4 Non Control-Driven Computations ..............62 3.4.5 Multiple Threads Computations ................64 3.5 Timing Control Structures .......................70 3.5.1 Clocks and Clock Skews ....................71 3.5.2 Computing without Clocks ..................73 3.5.3 Separately Timed Communications and Computations . 74 3.5.4 Communicating Synchronous Logic Modules .........78 3.6 Algorithm Embeddings .........................80 3.7 Summary ................................81 4. A Configurable GALS Array 82 4.1 Basic Architecture Constraints ....................82 4.1.1 Architecture Regularity ....................83 4.1.2 Architecture Scalability ....................84 4.1.3 Communication Overheads ..................84 4.2 System Level Physical Topology ....................85 4.2.1 Interstitial of a Switch Lattice and PE Array ........86 4.2.2 Linearisation .......................... 87 4.2.3 Overlapped Communications and Computations ...... 88 4.2.4 Aggregated Switched Communication Network ....... 89 4.2.5 A Pseudo Nearest neighbour Configurable Array ...... 91 Table of Contents lv 4.3 The GALS Scheme in PNCA .....................95 4.3.1 Synchronous Regions in PNCA ................95 4.3.2 Communicating Synchronous PH,s .............96 4.3.3 A Configurable GALS Array .................99 4.4 RC and PH, ..............................101 4.4.1 DFG Computation Properties .................101 4.4.2 The Routing Cell ........................104 4.4.3 The Programmable H, ....................107 4.5 PH,, Local Memory ..........................109 4.6 Summary ................................111 5. An Implementation of a GALSA 112 5.1 Design Tools and Implementation Technology ............112 5.2 The Configuration Technique .....................114 5.3 Asynchronous Data Transfer Interface ................ 117 5.3.1 Hand-Shaking Cycle ...................... 117 5.3.2 Data Status Signal ....................... 119 5.3.3 Event-Driven Hand-shaking .................. 124 5.3.4 An Event-Driven Register Transfer Interface ......... 126 5.4 The Implementation of a PH, .................... 129 5.4.1 The Clock Management Unit ................. 130 5.4.2 An Event-Driven General GALS Logic Module. .......136 5.4.3 The PH and I/O Selector .................138 Table of Contents V 5.4.4 The Execution Code Register .................141 5.4.5 Multiplexers ..........................145 5.5 The Routing Network .........................146 5.5.1 Switches .............................146 5.5.2 The Configuration Control Memory .............148 5.5.3 The Routing Cell ........................149 5.5.4 Routing Channel Buffers .................... 150 5.6 A GALSA System ........................... 150 5.6.1 The Pre-loading Circuits .................... 151 5.6.2 GALS Array I/O Interface .................. 153 5.7 Testability ................................153 5.8 Summary ................................155 6. Example Algorithms and Simulation Results 156 6.1 Typical Timing Characteristics ....................156 6.1.1 Simulation and Measurement Conditions ...........157 6.1.2 The tn-state register and the GALS DTI ..........157 6.1.3 The Transmission Gate Adder and Multiplexers .......159 6.1.4 The Routing Cell and Channel Buffer .. 160 6.1.5 Array Element Test ......................160 6.1.6 Configuration Test .......................161 6.2 A 4 x 4 Multiplier in a GALSA ....................163 6.2.1 Integer Multiplication .....................163 Table of Contents vi 6.2.2 Embedding the 4 x 4 Array Multiplier into a GALSA . . 164 6.3 A Seven Segment Display Decoder . . 167 6.4 Evaluation of Polynomial Expressions ................ 170 6.4.1 Display of Pixels for Different Objects ............ 170 6.4.2 Polynomials in Single Variable ................ 171 6.4.3 Polynomials

A Globally Asynchronous Locally Synchronous Configurable Array Architecture for Algorithm Embeddings

System Trends and Their Impact on Future Microprocessor Design

An Overview of the Blue Gene/L System Software Organization

Cellular Wave Computers and CNN Technology – a Soc Architecture with Xk Processors and Sensor Arrays*

Performance Modelling and Optimization of Memory Access on Cellular Computer Architecture Cyclops64

Focal-Plane Analog VLSI Cellular Implementation of the Boundary Contour System

Software-Defined Hyper-Cellular Architecture for Green and Elastic

An Overview on Cyclops-64 Architecture - a Status Report on the Programming Model and Software Infrastructure

Virtualized Baseband Units Consolidation in Advanced Lte Networks Using Mobility- and Power-Aware Algorithms

Simulating Linux Clusters on Linux Clusters

Evaluating Cyclops64

Efficient Synchronization for a Large-Scale Multi-Core Chip Architecture

Toward a Software Infrastructure for the Cyclops-64 Cellular Architecture