An Integrated Multiprocessor for Matrix Algorithms
Total Page:16
File Type:pdf, Size:1020Kb
\\'\ \, An Integrated Multiprocessor for Matrix Algorithms Warren Marwood B.Sc., B.E. A thesis submitted to the Department of Electrical and Electronic Engineering, the University of Adelaide, to meet the requirements for award of the degree of Doctor of Philosophy. June L994 Awond lqql " 4 Contents Abstract vt Statement of Originality . vl11 Acknowledgements . 1X Publications x List of Figures xlv List of Tables . xxl Chapter 1: The Evolution of Computers 1 1.1 Scalar Architectures. 1 1.2 Vector Architectures l) 1.3 Parallel Architectures 6 1.3.1 Ring Architectures. 8 1.3.2 The Two-dimensional Mesh I 1.3.3 The Three-dimensional Mesh 10 1.3.4 The Hypercube 11 1.4 Massively Parallel Computers: a Summary 72 1.5 The MATRISC Processor 74 1.5.1 Architecture. 74 1.5. 2 Performance/Algorithms 15 1.6 Summary .. 15 Chapter 2: A Review of Systolic Processors .. 77 2.1 The Inner-product-step Processor 19 .to 2.2 A Systolic Cell for Polynomial and DFT Evaluation Z.) 2.3 The Engagement Processor.... 25 2.4 Algorithms 26 2.4.1 Convolution 27 2.4.2 Finite Impulse Response (FIR) Filters 27 2.4.3 Discrete Fourier Transform .. 28 2.5 Heterogeneous Systolic Arrays 29 2.5.7 LU Decomposition 29 2.5.2 Solution of Triangular Linear Systems 30 oô 2.5.3 A Systolic Array for Adaptive Beamforming. ¿a 2.6 Systolic Arrays for Data Formatting. 34 2.7 Bif Level Systolic Arrays óÐ 2.7.l Beamforming at the Bit-level 37 2.7.2 Oplimisation for Binary Quantised Data . 40 2.8 Configurable Sysiolic Architectures 47 2.8.1 The Configurable Highly Parallel (CHiP) Computer 42 2.8.2 The Programmable Systolic Chip (PSC) 43 2.9 Mapping Algorithms to Architectures 44 2.9.1 Software. 44 2.10 Systolic Processors/Coprocessors . 46 2.11 Summary 49 Chapter 3: Systolic Ring Processors ... 52 3.1 A Systolic Multipiier 52 3.1.1 A Multiplier Model 54 3.7.2 Systolic Ring Multiplier 59 ^ 3.2 Parallel Digit Multiplication. 61 3.2.1 Digit-Serial Integer Multiplication. 62 3.2.2 Digif-Serial Floating Point Multiplication . 66 3.3 Coalescence of Systolic Ring Multipliers 68 3.4 Ring Accumulator 69 ^ 3.4.1 Floating Point Addition/Accumulation. 69 3.4.2 A Simplified Algorithm 77 3.5 Coalescence of Ring Accumulators. 77 3.6 A Systolic Ring Floating Point Multiplier/Accumulator.. 78 3.6.1 Coalescence of Ring Multiply/Accumulators.. 82 3.7 Discussion... 82 Chapter 4: The MATRISC Processor.. 84 4.1 A Lattice Model 85 4.1.1 Implications for Matrix Processors 90 4.7.2 Analysis of a Rectangular Systolic Array Processor. 91 4.1.3 Optimisation of a Square Systolic Array 93 4.1.4 Bandwidth Considerations 101 4.2 Generalised Matrix Address Generator 103 ^ 4.2.7 The Address Generator Architecture . 104 11 4.2.2 An Implementation 108 4.2.3 Area and Time Considerations . 109 4.2.4 Matrix Addressing Examples. 109 4.2.5 Address Generation in Higher Dimensions. 772 4.3 The MATRISC ArchitectuÍe ... 714 4.4 Discussion 118 Chapter 5: MATRISC Matrix Algorithms .. 720 5.1 Mairix Primitives.... 72r 5.1.1 Mairix Multiplication 727 5.1.2 Element-wise Operations.. ... ..... 722 5.1.3 Matrix Transposition/Permutation 722 5.2 Matrix Algorithms. 722 5.2.1 FIR Filtering, Convolution and Correlation 722 5.2.2 QR Factorisation .. 727 5.2.3 The Discrete Fourier Transform 729 5.3 A MATRISC Performance Model and its Validation 130 5.3.1 Matrix Multiplication 131 5.3.2 The FIR Algorithm. 135 5.3.3 A BLAS Matrix Primitive 737 6.4 MATRISC Processor and its Performance. 138 ^5.4.1 Matrix Multiplication 139 5.4.2 FIR Filters, Convolution and Correlation r42 5.4.3 The BLAS SGEMM0 subroutine 743 5.4.4 The QR Factorisation 744 5.5 A Constant Bandwidth Array 744 Chapter 6: The Fourier and Hartley TYansforms r47 6.1 The Discrete Fourier Transform 747 6.2 Linear Mappings Between One and Two Dimensions.. 749 6.2.1 Relatively Prime 150 6.2.2 Cornmon Factor 150 6.3 Two-dimensional Mappings for a One-dimensional DFT 150 6.3.1 The Prime Factor Case . 151 6.3.2 The Common Factor Case. 158 6.4 p-diÍrensional Mappings and the One-dimensional DFT 160 6.4. 1 Particular Implementations 762 6.4.2 L Recursive Implementation 764 6.5 Performance 165 lll 6.5. 1 Addiiion/Subtraction 166 6.5.2 Hadamard or Schur (Elementwise) Multiplication 767 6.5.3 Matrix Multiplication 168 6.5.4 The Prime Factor Algoriihm 170 6.5.5 The Common-Factor Algorithm 773 6.6 The Discrete Hartley Transform. 777 6.7 Two-dimensional Mappings for a One-dimensional DHT 178 6.7.1 The Prime Factor Case . 779 6.7.2 The Common Factor Case . 181 6.8 Performance 183 6.8.1 The Prime-Factor Algorithm. 183 6.8.2 The Common-Factor Aigorithm 183 6.8.3 Comparison 184 6.9 A Multi-dimensional Transform Example 784 6.9.1 The Multi-dimensional Prime Factor cos Transform 184 6.10 Summary 792 Chapter 7: Implementation Studies in Si and GaAs 193 7.1 Silicon: The SCalable Array Processor (SCAP) 193 7.1.1 System Architecture 794 7.1.2 Hardware 794 7.1.3 The Data Formatter Chip 195 7.7.4 The Processing Element Chip . 199 7.1.5 Software. 203 7.2 Gallium Arsenide ... 206 7.2.1 Gallium Arsenide Technology . 208 7.2.2 Detailed Circuit Design and Simulation . 215 7.2.3 Test Equipment and Procedures . .. 225 7.3 Interconnection Technology ... 228 7.3.1 lVlulti-chip Modules 228 7.3.2 Silicon Hybrids 229 Chapter 8: Surnmar¡ Future Tlends and Conclusion 237 8.1 Summary 237 8.1.1 Implementation. 232 8.1.2 Software... 233 8.2 Current and Future Trends 233 8.2.1 The Matrix Product 234 IV 8.3 MATRISC Multiprocessors-Teraflops Machines. 235 8.4 Conclusion 236 Bibliography ... 237 v Abstract Current trends in supercomputer architectures indicate that massive parallelism will provide the means by which computers will achieve the fastest possible performance for arbitrary problems. For the particular case of signal processing, a study of the computationally intensive algorithms revealed their dependence on matrix opera- tions in general, and the O("") matrix product in particular. This thesis proposes a computer architecture for these algorithms. The architectural philosophy is to add hardware support for a limited set of matrix operations to the conventional Re- duced or Complex Instruction Set Computer (RISC or CISC) architectures which are commonly used at multiprocessor nodes. The support for the matrix data type is provided with systolic array techniques. This philosophy parallels and extends that of the RISC computer architecture proposed in the 1980's, and as a consequence the new processor architecture proposed in this thesis is referred to as a MATrix Reduced Instruction Set Computer (MATRISC). The architecture is shown to offer outstanding performance for the class of probiems which are expressible in terms of matrix algebra. The concepts of massive parallelism are applicable to arrays of \4ATRISC processors, each of which appears as a conventional machine in terms of both hardware and software. Tasks are partitioned into sub-tasks expressible in ma- trix form. This form embeds, or hides, a high level of parallelism within the matrix operators. The work in this thesis is devoted to the architecture, implementation and performance of a MATRISC processing node. Specific advantages of the MATRISC architecture include: 1. the provision of orders of magnitude improvement in the peak computational performance of a multiprocessor processing node; 2. a simple object-oriented coding paradigm which follows traditional problem formulation processes and conventional von Neumann coding techniques; 3. a design method which controls complexity and allows the use of arbitrary numbers of processing elements in the implementation of MATRISC processors. V1 The restricted number of efficiently implemented matrix primitives provided in the MATRISC processor can be used to implement all of the higher order matrix opera- tors found in matrix algebra. As in ihe RISC philosophy, resources are concentrated on the primary set of operations or instructions which incur the highest penalties in execution. Also reflected from the RISC development is the integration of the software and hardware to produce a complete system. The novelty in the thesis is found in three major areas, and in the whole to which these areas contribute: 1. the design of low complexity and low transistor count arithmetic units which perform floating point computations with variable precision and dynamic range, and which are designed to optimise the execution time of matrix operations; 2. lhe design of an address generator and systolic interface which extends the domain of application for the systolic; computation engine 3. the integration of the hardware into conventional software compilers. Simulation results for the MATRISC processor are provided which give performance estimates for systems which can be implemented in current technologies. These tech- nologies include both Gallium Arsenide and Silicon. In addition, a description of a concept demonstrator is provided which has been implemented in a 1.2 micron CMOS process. This concept demonstrator has been implemented in a SUN SPARCsta- tion 1. The simulator code is verified by comparing simulation predictions with measured performance of the concept demonstrator. The simulator parameters are then modified to describe a typical system which is being implemented in current technologies. These simulation results show that nodal processing rates which ex- ceed 5 gigaflops are achievable at single processor nodes which use current memory technologies in non-interleaved structures.