\\'\ \,

An Integrated Multiprocessor for Matrix Algorithms

Warren Marwood

B.Sc., B.E.

A thesis submitted to the Department of Electrical and Electronic Engineering, the University of Adelaide, to meet the requirements for award of the degree of Doctor of Philosophy.

June L994

Awond lqql " 4 Contents

Abstract vt

Statement of Originality . vl11

Acknowledgements . . . . 1X Publications x List of Figures xlv

List of Tables . . xxl

Chapter 1: The Evolution of 1

1.1 Scalar Architectures. . . . . 1 1.2 Vector Architectures l) 1.3 Parallel Architectures 6

1.3.1 Ring Architectures. . . 8 1.3.2 The Two-dimensional Mesh I 1.3.3 The Three-dimensional Mesh 10 1.3.4 The Hypercube 11 1.4 Computers: a Summary 72 1.5 The MATRISC 74

1.5.1 Architecture. . . . . 74 1.5. 2 Performance/Algorithms 15 1.6 Summary .. 15

Chapter 2: A Review of Systolic Processors .. . 77 2.1 The Inner-product-step Processor 19 .to 2.2 A Systolic for Polynomial and DFT Evaluation Z.) 2.3 The Engagement Processor.... 25 2.4 Algorithms 26 2.4.1 Convolution 27 2.4.2 Finite Impulse Response (FIR) Filters 27 2.4.3 Discrete Fourier Transform .. 28 2.5 Heterogeneous Systolic Arrays 29 2.5.7 LU Decomposition 29 2.5.2 Solution of Triangular Linear Systems 30 oô 2.5.3 A for Adaptive Beamforming. . . . . ¿a

2.6 Systolic Arrays for Data Formatting. . . . 34

2.7 Bif Level Systolic Arrays óÐ 2.7.l Beamforming at the Bit-level 37

2.7.2 Oplimisation for Binary Quantised Data . 40 2.8 Configurable Sysiolic Architectures 47 2.8.1 The Configurable Highly Parallel (CHiP) 42 2.8.2 The Programmable Systolic Chip (PSC) 43 2.9 Mapping Algorithms to Architectures 44

2.9.1 Software. . . 44

2.10 Systolic Processors/ . . . 46 2.11 Summary 49

Chapter 3: Systolic Ring Processors ... 52 3.1 A Systolic Multipiier 52 3.1.1 A Multiplier Model 54 3.7.2 Systolic Ring Multiplier 59 ^ 3.2 Parallel Digit Multiplication. 61

3.2.1 Digit-Serial Integer Multiplication. . 62

3.2.2 Digif-Serial Floating Point Multiplication . 66 3.3 Coalescence of Systolic Ring Multipliers 68 3.4 Ring Accumulator 69 ^ 3.4.1 Floating Point Addition/Accumulation. . 69 3.4.2 A Simplified Algorithm 77

3.5 Coalescence of Ring Accumulators. . 77

3.6 A Systolic Ring Floating Point Multiplier/Accumulator.. . 78

3.6.1 Coalescence of Ring Multiply/Accumulators...... 82

3.7 Discussion... . 82 Chapter 4: The MATRISC Processor.. 84 4.1 A Lattice Model 85 4.1.1 Implications for Matrix Processors 90 4.7.2 Analysis of a Rectangular Systolic Array Processor. 91 4.1.3 Optimisation of a Square Systolic Array 93 4.1.4 Bandwidth Considerations 101 4.2 Generalised Matrix Address Generator 103 ^ 4.2.7 The Address Generator Architecture . . . . 104

11 4.2.2 An Implementation 108

4.2.3 Area and Time Considerations . 109

4.2.4 Matrix Addressing Examples. . . 109 4.2.5 Address Generation in Higher Dimensions. 772 4.3 The MATRISC ArchitectuÍe ... 714 4.4 Discussion 118 Chapter 5: MATRISC Matrix Algorithms .. 720 5.1 Mairix Primitives.... 72r 5.1.1 Mairix Multiplication 727 5.1.2 Element-wise Operations...... 722 5.1.3 Matrix Transposition/Permutation 722 5.2 Matrix Algorithms. 722 5.2.1 FIR Filtering, Convolution and Correlation 722 5.2.2 QR Factorisation .. 727 5.2.3 The Discrete Fourier Transform 729 5.3 A MATRISC Performance Model and its Validation 130 5.3.1 Matrix Multiplication 131 5.3.2 The FIR Algorithm. 135 5.3.3 A BLAS Matrix Primitive 737

6.4 MATRISC Processor and its Performance. . . . 138 ^5.4.1 Matrix Multiplication 139 5.4.2 FIR Filters, Convolution and Correlation r42 5.4.3 The BLAS SGEMM0 subroutine 743 5.4.4 The QR Factorisation 744 5.5 A Constant Bandwidth Array 744 Chapter 6: The Fourier and Hartley TYansforms r47 6.1 The Discrete Fourier Transform 747

6.2 Linear Mappings Between One and Two Dimensions.. . . 749 6.2.1 Relatively Prime 150 6.2.2 Cornmon Factor 150 6.3 Two-dimensional Mappings for a One-dimensional DFT 150

6.3.1 The Prime Factor Case . 151 6.3.2 The Common Factor Case. 158 6.4 p-diÍrensional Mappings and the One-dimensional DFT 160

6.4. 1 Particular Implementations 762 6.4.2 L Recursive Implementation 764 6.5 Performance 165

lll 6.5. 1 Addiiion/Subtraction 166

6.5.2 Hadamard or Schur (Elementwise) Multiplication 767 6.5.3 Matrix Multiplication 168 6.5.4 The Prime Factor Algoriihm 170 6.5.5 The Common-Factor Algorithm 773

6.6 The Discrete Hartley Transform. . . . 777 6.7 Two-dimensional Mappings for a One-dimensional DHT 178

6.7.1 The Prime Factor Case . 779

6.7.2 The Common Factor Case . 181 6.8 Performance 183

6.8.1 The Prime-Factor Algorithm. . 183 6.8.2 The Common-Factor Aigorithm 183 6.8.3 Comparison 184 6.9 A Multi-dimensional Transform Example 784 6.9.1 The Multi-dimensional Prime Factor cos Transform 184 6.10 Summary 792

Chapter 7: Implementation Studies in Si and GaAs 193 7.1 Silicon: The SCalable Array Processor (SCAP) 193 7.1.1 System Architecture 794 7.1.2 Hardware 794 7.1.3 The Data Formatter Chip 195

7.7.4 The Processing Element Chip . 199

7.1.5 Software. . . 203

7.2 Gallium Arsenide ... . 206

7.2.1 Gallium Arsenide Technology . 208

7.2.2 Detailed Circuit Design and Simulation . . 215

7.2.3 Test Equipment and Procedures ...... 225

7.3 Interconnection Technology ... . 228 7.3.1 lVlulti-chip Modules 228 7.3.2 Silicon Hybrids 229

Chapter 8: Surnmar¡ Future Tlends and Conclusion 237 8.1 Summary 237

8.1.1 Implementation. . . . . 232 8.1.2 Software... 233 8.2 Current and Future Trends 233 8.2.1 The Matrix Product 234

IV 8.3 MATRISC Multiprocessors-Teraflops Machines. . 235 8.4 Conclusion 236 Bibliography ... 237

v Abstract

Current trends in architectures indicate that massive parallelism will provide the means by which computers will achieve the fastest possible performance for arbitrary problems. For the particular case of signal processing, a study of the computationally intensive algorithms revealed their dependence on matrix opera- tions in general, and the O("") matrix product in particular. This thesis proposes a for these algorithms. The architectural philosophy is to add hardware support for a limited set of matrix operations to the conventional Re- duced or Complex Instruction Set Computer (RISC or CISC) architectures which are commonly used at multiprocessor nodes. The support for the matrix data type is provided with systolic array techniques. This philosophy parallels and extends that of the RISC computer architecture proposed in the 1980's, and as a consequence the new processor architecture proposed in this thesis is referred to as a MATrix Reduced Instruction Set Computer (MATRISC). The architecture is shown to offer outstanding performance for the class of probiems which are expressible in terms of matrix algebra. The concepts of massive parallelism are applicable to arrays of \4ATRISC processors, each of which appears as a conventional machine in terms of both hardware and software. Tasks are partitioned into sub-tasks expressible in ma- trix form. This form embeds, or hides, a high level of parallelism within the matrix operators. The work in this thesis is devoted to the architecture, implementation and performance of a MATRISC processing node.

Specific advantages of the MATRISC architecture include:

1. the provision of orders of magnitude improvement in the peak computational performance of a multiprocessor processing node;

2. a simple object-oriented coding paradigm which follows traditional problem formulation processes and conventional von Neumann coding techniques;

3. a design method which controls complexity and allows the use of arbitrary numbers of processing elements in the implementation of MATRISC processors.

V1 The restricted number of efficiently implemented matrix primitives provided in the MATRISC processor can be used to implement all of the higher order matrix opera- tors found in matrix algebra. As in ihe RISC philosophy, resources are concentrated on the primary set of operations or instructions which incur the highest penalties in execution. Also reflected from the RISC development is the integration of the software and hardware to produce a complete system.

The novelty in the thesis is found in three major areas, and in the whole to which these areas contribute:

1. the design of low complexity and low arithmetic units which perform floating point computations with variable precision and dynamic range, and which are designed to optimise the execution time of matrix operations;

2. lhe design of an address generator and systolic interface which extends the domain of application for the systolic; computation engine

3. the integration of the hardware into conventional software compilers.

Simulation results for the MATRISC processor are provided which give performance estimates for systems which can be implemented in current technologies. These tech- nologies include both Gallium Arsenide and Silicon. In addition, a description of a concept demonstrator is provided which has been implemented in a 1.2 micron CMOS . This concept demonstrator has been implemented in a SUN SPARCsta- tion 1. The simulator code is verified by comparing simulation predictions with measured performance of the concept demonstrator. The simulator parameters are then modified to describe a typical system which is being implemented in current technologies. These simulation results show that nodal processing rates which ex- ceed 5 gigaflops are achievable at single processor nodes which use current memory technologies in non-interleaved structures.

It is concluded that the extremely high performance of IVIATRISC processors makes possible the construction of parallel computers with processing capabilities in excess of one teraflops.

vll Statement of Originality

I hereby declare that this thesis contains no material which has been accepted for the award of any other degree or diploma in any University and that, to the best of my knowledge and belief, this thesis contains no material previously published or written by another person, except where due reference is made in the text of this thesis. I also consent to this thesis being made available for photocopying and loan.

Warren Marwood 30 June, 1994

vlll Acknowledgements

The author is indebted to a number of people without whom this thesis would not have been possible. In particular I wish to thank Mr. A.P. Clarke for his encour- agement, support and contributions and my supervisors Dr. K. Eshraghia,n and Dr. C.C. Lim for their advice, enthusiasm and encouragement.

The phitosophies and concepts which are presented in this thesis are the subject of two implementation studies, one carried out by the Defence Science and Technology Organisation (DSTO), and the other by the The Centre for GaAs VLSI Technology at the University of Adelaide. The silicon studies carried out at the DSTO have been done with Mr. A.P. Clarke and have culminated in the development of the SCalable Array Processor (SCAP). The design of both the SCAP CMOS chips and board sub- systems were the responsibility of Dr. R.J. Ciarke and Mr. LA. Curtis, of RADLogic Pty. Ltd. The contract to construct the SCAP processors'was managed by the Acqui- sition and Logistics Organisation in the Department of Defence. The development at the University of Adelaide of GaAs transistor models, circuits, chip design and testing which is reported in this thesis was carried out by Mr A. Beaumont-Smith with some assistance from the author. Mr. T. Shaw has contributed to the study of a number of memory architectures for systolic processors, and the performance of multiprocessors constructed from high performance systolic MATRISC nodes.

The support of the Australian Research Council, the Sir Ross and Sir Keith Smith Foundation and The Centre for Gallium Arsenide VLSI Technology is gratefully acknowledged.

Finally I wish to thank my wife Janne, and children Tania, Anita and Damien without whose forbearance the task would not have been possible.

1X Publications

[1] Marwood, W., "Multiprocessor Architecture-Theory and Application", Draft Master of Engineering Thesis, University of Adelaide, February 1988.

[2] Marwood, W. and Clarke, 4.P., "Overhead Penalties in Dynamically Recon- figurable Arrays of Processing Elements", Journal of Electrical and Electronics Engineering, Australia, June 1987.

[3] Marwood, W. and Clarke, 4.P., "A Systolic Filter Implemented with Two Asyn- chronous Multiprocessors", Journal of Microcomputer Applications, February, 1989.

[4] Marwood, W. and Clarke, 4.P., "A Matrix Product Machine and the Fourier Transform", IEE Proceedings, Vol. 137, Pt. G, No' 4, PP. 295-301, August 1990.

[5] Marwood, W., "A Generalised Systolic Ring Serial Floating Point Multiplier" Electronics Letters Vol. 26 No. 11, pp. 753-754,24 May 1990.

[6] Reinhold, O, and Marwood, W., "silicon Hybrids-A Technique for'Zero-Defect' Wafer-Scale Processors", Journal of Eiectrical and Electronics Engineering, Aus- tralia, IEAust. & IREE Aust., Vol. 11, No. 3, September 1991.

[7] Clarke, A.P. and lVlarwood, W., "A Generalised Prime Factor Expression for the Fourier Transform of a One-dimensional Data Sequence", submitted to Signai Processing.

[8] Clarke, A.P. and Marwood, W., "On Signal Processor System Design", I.E.Aust. National Conference Publication No. 8715, April 1987.

x [9] Marwood, W. and Clarke, 4.P., "On Computing Fourier Transforms Using a Matrix Product Machine", Conference Proceedings, Microelectronics '88, Syd- ney University, May 16-18 1988.

[10] Marwood, \M. and Clarke, 4.P., "A Co-Processor with Supercomputer Capa- bilities for Personal Computers", ICCD '88, IEEE International Conference on Computer Design: VLSI in Computers & Processors, pp. 468-477 Rye Town Hilton, Rye Brook, New York October 3-5, 1988'

[11] Marwood, w., steed, M.D. and woon, I.M., "A CMOS Implementation of a Generic Time-Domain Beamformer", Proceedings of Australian Symposium on Signal Processing and Applications, ASSPA 89, Adelaide, Australia, pp. 178- I82,77-79 April 1989.

[12] Marwood, W. and Clarke, 4.P., "The Implementation and Application of a SCAM-A Systolic Confi.gurable Array Module", I.E.Aust. Conference, Micro- electronics '89, Brisbane, Australia, 1989.

[13] Reinhold, O. and Marwood, W., "silicon Hybrids-A Technique for'Zero-Defecí' Wafer-Scale Processors", I.E.Aust. Conference, Microelectronics '89, Brisbane, Australia 1989.

[14] Marwood, W. and Lim, C.C., "A GaAs Systolic Processor for Implementing a Kalman Filter", I.E.Aust. Conference, Microelectronics '90, IREE and I.E.Aust' Conference, Microelectronics'90, Australia 1990.

[15] Clarke, A.P. and Marwood, W., "Implementation of FIR Filters on a Systolic Array", ISSPA 90 The Second International Symposium on Signal Processing and its Applications" pp. 111-114, Gold Coast, Australia, August 27-371990'

[16] Beaumont-smith,'{., Marwood, w., Lim, c.c. and Eshraghian, K., "ultra High Speed Gallium Arsenide Systems: Design Methodology, CAD Tools and Archi- tectures", I.E.Aust Microelectronics Conference, 1991.

[17] Marwood, \M. and Beaumont-Smith, 4., "The Architecture and Optimisation of Systolic Ring Processors", IEEE Region 10 Conference, Tencon 92 Melbourne, Australia, pp. 735-739, 11-13 November 7992.

[1S] Clarke, R.J., curtis, I.4., Clarke, A.P. and Marwood, w., "A Floating Point Rectangular Systolic Array on a Chip",IEEE Region 10 Conference, Tencon92 pp. 1008-7072, 11-13 November 7992'

XI [19] Marwood, W. and Beaumont-Smith, 4., "The Implementation of a Generalised Systolic Serial Floating Point Multiplier", APCCAS'92, IEEE, IREE and IEAust Asia-Pacific Conference on Circuits and Systems, Sydney, NSW, Australia, pp. 513-518, 8-11 December 7992.

[20] Clarke, R.J., Curtis, I.4., Marwood, W. and Clarke, 4.P., "A Floating Point Matrix Arithmetic Processor: An Implementation of the SCAP Concept", APC- CAS'92, IEEE, IREE and IEAust Asia-Pacific Conference on Circuits and Sys- tems, Sydney, NSW, Australia, pp. 519-524,8-77 December 7992.

[21] Marwood, \ /. and Clarke, 4.P., "A Prototype Scalable Array Processor (SCAP)", ICSPAT '93, The International Conference on Signal Processing Applications and Technology, Santa Clara, California, USA, pp. 1619-1628' September 28 - October 1 1993. l22l Clarke, A.P. and Marwood, W., "A Generalised Cosine Transform", PCAT-93 and Transputers, Proceedings of the 6th Australian Trans- puter and Occam User Group Conference, Brisbane, Australia, 3-4 November 1993 pp. 150-157, IOS Press,1994'

[23] Marwood, \M. and Clarke, 4.P., "Measurement of the Performance of a Matrix (SCAP) in a SUN SPARCstation 1", Proceedings, 11th Australian Microelectronics Conference, pp. 153-158 Marriott Surfers Paradise Resort,

Queensland, Australia, October 5-8 1993.

[24] Marwood, W., Shaw, T., Liebelt, M. and Eshraghian, K., "A Data Controller for a Systolic Outer Product Engine", Proceedings, 11th Australian Microeiec- tronics Conference, pp. 227-226 Marriott Surfers Paradise Resort, Queensland, Australia, October 5-8 1993.

[25] Beaumont-Smith, 4., Marwood, W., Eshraghian, K. and Lim, C.C., "The Gal- lium Arsenide Implementation of a Floating Point Processing Element", Pro- ceedings, 11th Australian Microelectronics Conference, pp. 255-260 Marriott Surfers Paradise Resort, Queensland, Australia, October 5-8 1993.

[26] Beaumont-Smith, A'., Marwood, W., Eshraghian, K' and Lim, C.C., "Ã 2'5 GFLOPS Gallium Arsenide Floating Point Systolic Array Matrix Processor", Internal Report, Dept. Electrical & Electronic Engineering, University of Ade- laide, 1993.

xll Patents Pending

[27] Marwood, W., "A Number Theory Mapping Generator for Addressing Matrix Structures", June 1990.

[28] Curtis, I.4., Clarke, R.J., Clarke, A.P. and Marwood, W., "Data Formatter" November 7992.

[29] Clarke, R.J., Clarke, A.P. and Marwood, \M., "Systolic Dimensionless Array" November 1992.

xlll List of Figures

1.1 The architecture of Babbage's Analytical Engi,ne. 2

1.2 A multiprocessor architecture. 7

1.3 A ring architecture. 8

7.4 A two-dimensional mesh of processing elements. I

1.5 The interconnection topology for the AMT DAP processor. I

1.6 The MasPar Xnet interconnection topology. 10

7.7 A three-dimensional mesh of processing elements. 11

1.8 A 4-dimensional hypercube. 11

1.9 A fai tree with processing elements at the leaf nodes. 72

2.7 The inner-product-step processor 19

2.2 The linear systolic array. 20

2.3 Data movement for the matrix-vector product on the linear array' 27

2.4 The hexagonally-connected systolic array' 22

2.5 Band matrix multiplication on the hexagonally-connected affay. 22

2.6 Computation of Fourier coefficients with a linear systolic array. 23

2.7 The inner product accumulate systolic cell. 25

2.8 A 3 x 3 engagement processor with two input matrices A and B. 26

2.9 The hex-connected array for LU decomposition. There are two pro- 29 cessor types in this array.

2.70 The linear heterogeneous systolic array for triangular systems. 31

2.77 The systolic array cells used for the extraction of least squares resid- 32 uals. oo 2.72 The triangular systolic array organisation for the extraction of least r-)e) squares residuals.

XlV 2.73 The triangular systolic array and linear array for QR system solving. 33

2.74 The systolic cell for format conversion. 34

2.15 A systolic array of format conversion cells. 35

2.16 The systolic cell for bit-level correlation. 36

2.77 A systolic array for bit-level correlation. 36

2.78 Evaluation of the elements of the main diagonal of a matrix product. 40

2.79 The CHiP configurable array as a mesh and a binary tree. Processors 42 are indicated by squares and by circles. The root node of the binary tree is the node containing R.

2.20 A number of systolic arrays which can be implemented with arrays 43 of Programmable Systolic Chips.

2.27 The Warp machine. 45

3.1 Machine I/O and operand format. 54

3.2 The systolic floating point multiplier cell. Ðo

3.3 Operand movement through a four cell linear array of recurrence Ð( cells.

3.4 Operand movement through a four cell linear array of recurrence 58 cells using the modified recurrences for Y.

3.5 The improved systolic floating point multiplier cell. 59

3.6 A systolic ring multiplier with three possible operand formats. 60

3.7 A pipelined four-bit digit multiplier. oc

3.8 A pipelined four-bit digit multiplier optimised for both area and 66 critical path.

3.9 A digit-serial muitiplier cell. 0l

3.10 The coalescence of two adjacent systolic ring multipliers. The I/O 68 of the second ceases to be used when the coalescence has occurred.

3.11 A systolic de-normalisation cell. 75

3.72 A systolic ring accumulator. 76

3.13 A double length ring accumulator formed by the coalescence of two 78 adjacent rings.

3.14 The systolic ring multipiy/accumulate processor. 79

3.15 The systolic multiply/accumulate cell. 80

XV 3.16 The logical function of the systolic cell during multiplication. 80

3.77 The logical function of the systolic cell during de-normalisation. 81

4.7 A schematic diagram of an arbitrary lattice algorithm. 86

4.2 A conformal two-dimensional matrix product. 88

4.3 Evaluation of the area-time product. 97 aAp.T. 4.4 Evaluation of the partial derivative --Ark 98

4.5 Comparison of the continuous model with a model constrained to 99 integral cell numbers and delays f.or m: 32 and e : !6.

4.6 A comparison of model results for m: 64 and e :76. 100

4.7 A comparison of model results for m : !28 and e : 16. 100

4.8 The bandwidth constraint for an active area of 106 gates. 101

4.9 The bandwidth constraint for an active area of 107 gates. 702

4.70 The bandwidth constraint for an active area of 108 gates. 702

4.77 Schematic of a matrix address generator. 105

4.72 Schematic of a modified RISC/CISC processor. 714

4.13 Schematic of a modified Harvard architecture matrix RISC/CISC 115 processor. The matrix capabilities are added in the form of a co- processor.

4.74 Schematic of an integrated matrix RISC/CISC processor. 116

4.75 A schematic of the entry of operand wavefronts to a processor array. 777

5.1 I\,{ATRISC code to implement FIR filters. 126

5.2 A schematic representation of the functional components of the soft- 130 ware simulator.

EÐ tJ..) Simulated performance for lhe AB product with the A matrix in row 732 major order.

5.4 Simulated performance for the AB product with the A matrix in 133 transposed row major order.

5.5 Simulated performance for lhe AB product with the A matrix in 133 column major order.

ô.f) Simulated performance for the AB product with the A matrix in 734 transposed column major order.

xvl Ð.( Simulated and experimental performance for lhe AB product. The 135 A matrix is in row major order and the B matrix is circulant. b.ð Simulated and experimental performance for an FIR filter on an 136 order ¡/ MATRISC processor.

5.9 A comparison of the performance of the SCAP processor and the 138 MATRISC model of SCAP when implementing the SGEMM0 sub- routine. 5.10 The performance of a 40 x 40 MATRISC processor when executing 740 matrix multiplication. A write-through is used.

5.11 The performance of a 40 x 40 MATRISC processor when executing 740 matrix multiplication. A write-back cache is used.

5.72 The performance of a 40 x 40 MATRISC processor when executing 747 matrix multiplication with a circulant B matrix. A write-through cache is used. 5.13 Simulated performance for FIR filters implemented on an order 40 742 MATRISC processor.

5.74 The performance of the MATRISC processor when executing the 743 SGEMM0 subroutine. b. Lb The block QR factorisation algorithm implemented on the MA- 744 TRISC architecture. 5.16 A constant bandwidth systolic array for the MATRISC architecture. 745

5.77 The QR algorithm implemented on a constant bandwidth MATRISC 145 architecture. 6.1 A surface plot representing the execution times for a constant band- 770 width implementation of a two-dimensionai factorisation of the com- plex Fourier transform algorithm.

6.2 point plot representing the execution times for a constant band- 777 ^2D width implementation of a two-dimensional factorisation of the com- plex Fourier transform algorithm.

6.3 A, 2D surface plot representing the execution times for a variable 777 bandwidth implementation of a two-dimensional factorisation of the complex Fourier transform algorithm.

6.4 A point plot representing the execution times for a variable band- 772 width implementation of a two-dimensional factorisation of the com- plex Fourier transform algorithm.

xvll b.ô A comparison of the performance of a constant bandwidth 40 x 40 772 array, a variable bandwidth 40 x 40 array and a conventional complex FFT algorithm.

6.6 The performance benefit of two- and three-dimensional factorisation 773 algorithms when compared with ihe FFT algorithm. o.t The performance benefit of three- and four-dimensionai factorisation 773 algorithms when compared with the FFT algorithm.

6.8 A comparison between the execution times of common factor and 774 prime factor transforms of varying lengths and the conventional FFT algorithm.

6.9 A comparison between the execution times of the Cooley-Tukey FFT 776 and a two-dimensional factorisation of the Fourier transform imple- mented on a complex MATRISC processor.

6.10 A comparison between the execution times of the Cooley-Tukey FFT 776 and three- and four-dimensional factorisations of the Fourier trans- form implemented on a complex MATRISC processor.

6.11 A comparison between the execution times of a prime-factor and 784 common-factor implementations of the Hartley transform. The real FFT execution time is included as a reference. 6.72 A comparison between the execution times of a real Cooley-Tukey 186 FFT and two- and three-dimensional factorisations of the cas trans- form implemented on a real MATRISC processor.

6.13 A comparison between the execution times of a real Cooley-Tukey 186 FFT and three- and four-dimensional factorisations of the cas trans- form implemented on a real MATRISC processor.

6.74 A comparison between the execution times of the two-dimensional 189 cøs transform, and the derived real Fourier and Hartley transforms implemented on a real MATRISC processor. The real Cooley-Tukey FFT time is included for comparison.

6.15 A comparison between the execution times of the three-dimensional 190 c¿s transform, and the derived Hartley transform implemented on a real MATRISC processor. The real Cooley-Tukey FFT time is included for comparison.

7.7 A schematic diagram of the data formatter. 195

7.2 A micrograph of a data formatter chip. 196

XVIII -ol.J A schematic diagram of a processing element. 199

7.4 A micrograph of a processing element chip. 200 l.Ð Measured execution speed (Mflops) for a SCAP enhanced SUN 204 SPARCStation 1 as a function of matrix order for real matrices.

7.6 A schematic of the GaAs systolic ring processing element, and the 207 associated data format.

7.7 DCFL Inverter ("), Z input NOR gate (b) and equivalent circuit (c). 210

7.8 Propagation delay of a DCFL Inverter as a function of fan-out (ca- 217 pacitive load).

7.9 Average Noise Margin of a 3 input DCFL NOR gate as a function 2r2

of. W".

7.10 SDCFL Inverter (a), SDCFL Inverter with extra supply (b) and 273 equivalent circuit (c).

7.17 OR-AND-INVERT (OAI) logic structure. 273

7.72 SBFL Inverter. 274

7.73 Schematic of a six nor-gate data flip-flop with clear. 216

7.74 Ring notation of a GaAs daia flip-flop with clear or preset. 276

7.75 Layout of a GaAs data flip-flop using ring notation. 276

7.76 SPICE simulation of a GaAs data flip-flop. 277

7.77 The clock generator architecture. 278

7.18 A SPICE simulation of the clock generator output. 218

7.79 A schematic diagram of the full . 220

7.20 A schematic diagram of the full adder sum generation. 220

7.27 A schematic diagram of the digit-serial multiplier. 227

7.22 SPICE simulation of the criticai path through the digit-serial multi- 222 plier.

7.23 A schematic of the systolic ring layout. 223

7.24 A floor plan of the GaAs systolic ring processing element chip. 223

7.25 A micrograph of the GaAs systolic processing element. 224

7.26 Clock input (channel 1) and output (channel 2) waveforms. 226 t.zl Variation of clock frequency with power supply voltage for a 7-gate 227 ring.

xlx 7.28 Variation of clock frequency with power supply voltage for a 13-gate 227 ring.

7.29 DAS output from a functional test of the first systolic cell. 228

7.30 A photograph of a 400 Mflops multi-chip module (actual size). 229

8.1 Performance of the SCAP 7.2 ¡nn system for matrix products. 234

8.2 Performance of a MATRISC 0.510.2 pm system for matrix products. 235

xx List of Tables

1.1 Characteristics of vector in the 1980's [Dong87]. 5

7.2 Characteristics of vector supercomputers in the 1990's [Kaha92]. 6

1.3 Flynn's taxonomy of computer architectures. 6

7.4 Characteristics of current massively parallel computers. 13

4.7 Area and time metrics for characteristic circuits in a particular E-D 94 GaAs process.

4.2 Values for A,1, A2 and g which implement three different mappings 707 from a one-dimensional to a two-dimensional space.

4.3 State table for the address generator PLA. 108

7.7 GaAs logic family characteristics. 272

7.2 Clock control signals. 279

t.õ Systolic cell instruction codes. 222

7.4 Table of simulated and observed clock frequencies for chip ft2. 226

xx1 Chapter 1

The Evolution of Computers

1.1 Scalar Architectures

The design and construction of the first adding machine is attributed to the French mathematician, Blaise Pascal, in 7642 [Lund90]. A general purpose caiculating ma- chine was subsequently designed by Leibniz in 1671 which was capable of addition, subtraction, multiplication and division. The German engineer Johann Müller in- vented a universal calculating engine in June 7782, and by June 20, !784, its con- struction r,¡/as completed [Lundg0, pp. 65-66]. Müiler developed his ideas for auto- matic calculations, and proposed in the same year that a machine could be built "... which would si,multaneously print i,n printers' i,nlc on paper any arbi,trary arithmeti,cal progression in natural nurnbers. . . ". The idea was formally presented with a descrip- tion of his universal calculating machine in a book published in 1786. According to Lundgren, it is clear that Johann Müller had invented the difference engine by ihe year 1786. Charles Babbage produced his first difference engine in 7822, and from 1823 to 1833 worked on the production of a difference engine for the automatic com- putation and printing of tables. Funding for this work of "t77,000 was provided by the British government. Whether Babbage was av/are of any of the details of Müller's work when he devised his difference engine is open to some conjecture [Lund90]. The difference engine v/as never completed, but after its failure Babbage conceived the idea of a far more powerful machine. Known as the Analytical Engine, it consisted of an arithmetic unit, a and a printing unit. It was programmable with cards similar to those used in the Jacquard weaving looms. This machine was de- signed to perform data dependent operations, and could print its results. Lundgren states that recent work [Brom82] confirms the functionality of the design, and its

1 design principles make it the predecessor of the current computer [Wilkes75] and IMetro80].

The first successful construction of a difference engine capable of printing the results of its calculations rvas carried out by Georg and Edvard Scheutz of Sweden. Inspired by the work of Babbage, the machine \¡/as completed in 1843. It was capable of third order difference computations. A second machine was completed in 1853 which was capabie of fourth order computations, and in 1859 a third machine with fifth order differences was finished within three weeks of its scheduled completion date [Lund90].

Control program Data memory

Data Item

CPU Data Instructions address

Data

Figure l.Iz The archi,tecture of Babbøge's Analyti,cal Engine

The architecture of Babbage's Analyt'ical Engine is shown in figure 1.1 . This machine was intended to operate on scalar quantities, and constitutes the first programmable scalar computer. One of the prime tasks for this computer was to be the automatic computation of logarithmic tables for navigation. This was a computationally in- tensive task which in the 19úä century \¡¡as prone to error. Much of the support for Babbage's work was forthcoming because errors in logarithmic tables were likely to cause navigational errors, and subsequent shipping losses. In the one and a half centuries since Babbage used the most advanced machining technologies to build his mechanical computer, a series of successively more specialised technologies have been developed for the implementation of faster computers. As in 1834, the support for the development of both technologies and architectures is still found where a cost benefi.t is to be obtained from a faster and more reliable ans'vl/er to a computationally intensive problem.

Babbage's architecture has been enhanced, expanded, modified and in some cases even discarded (e.g. neural networks). However, his architecture can still be used as a basis for the description of the scalar computing nodes found in most current

2 processors. The twelve decades of performance improvement which distinguish the projected computer architectures of today from the Analytical Engine have been found in the symbiotic relationships which exist between the three areas of. fuchnology, archit ecture and aI g o rithm sp ecifi cati, orz (software).

One of the earliest examples of an on-line, real-time data processing and computing system is an automatic totalisator installed at the Ellersie racetrack in Auckland, New Zealand [Sw87]. Developed by George Julius in Australia, the machine performed all the necessary computations to maintain a real-time tote, providing a display of the current odds and printing tickets recording details of each bet. The machine was implemented with mechanical technology.

The electro-mechanical Harvard Mark I computer or IBM Automatic Sequence Con- trolled Calculator was constructed and assembled by IBM, and was dedicated at Harvard on 7 August 7944. Constructed to the specification of Howard Aiken, it was the first large scale fully automatic general purpose digital computer to be built

[WilkesSa], [Co90]. The Electronic Numerical Integrator and Computer (ENIAC), containing over 18000 vacuum tubes and occupying a room approximately 40 feet by 20 feel, required 150 KWatts of power and was functional in 7945. Wilkes defines this machine as the world's first electronic computer. He states that its designers, Presper Eckart and John Mauchly, had advanced a long way towards the design of a far more efficient machine when John von Neumann was given permission to visit the ENIAC. Von Neumann subsequently wrote a report on the design of the Electronic Discrete Variable Automatic Computer (EDVAC) which contained the principles on which modern computers have been based: the stored program concept with the same store for numbers and instructions, the serial execution of instructions and the use of binary switching circuits for both computation and control. The Harvard Mark I computer was designed with separate stores for control and data as was Babbage's Analytical Engine. This architecture is used today for computers such as signal pro- cessors which are designed to maximise data throughput. General purpose computers evolved during the 1950's and 1960's using the von Neumann stored-program model which unifi.ed the program and data memories. The performance of a computer im- plemented with this architecture is limited by the rate at which instructions and data can be read and written to the memory. This limitation is known as the uon Neu- Tlann bottleneck. As the Harvard architecture has two memories, it has the potential to execute at twice the rate of a machine constructed with only one, provided the program needs approximately equivalent bandwidths for both code and data. This

3 observation led to the use of. a modified Harvard architecture in which a common in- struction/data memory is used with local cache memories for both instructions and data.

M.V. lVilkes, in the early 1950's proposed the use of m'icrocod,e to ease the pro- gramming problem for arbitrary machines. A microcoded machine is designed in an hierarchical manner. High level instructions from the main memory are read by the instruction fetch unit of the (CPU). They are then executed in a number of micro-cycles to implement the desired function. A microcoded ma- chine is programmed at the instruction level by ihe designer, and at an arbitrary meta-level by the compiler writer or assembler programmer. One of the goals of the approach is to allow different underlying hardware to implement a common set of instructions which are well suited to the implementation of high level languages. The most sophisticated example of a computer designed with this philosophy is the Digital Equipment Corporation VAX 771780 computer released in the late 1970's.

Beginning in the late 1970's, and continuing to the present, research has been con- ducted into the design of Reduced Instruction Set Computer (RISC) architectures.

Examples are [Kate85] and [PaS5]. The work was motivated by analysis of compiler- generated computer programs which demonstrated that there was high utilisation of only a small subset of the available Complex Instruction Set Computer (CISC) instructions. This result led to the design of optimised computer architectures which implemented only the subset of instructions containing the most frequently used in- structions together with those instructions which were required for efficient high level program execution. In general, RISC processors carry out complex operations not with specialised instructions, but with sequences of simple instructions. The need to control the complexity of the VLSI design task provided additionaL incentive for this approach as a RISC chip design is simpler than a CISC design. This design simpliflcation resulted in both faster design times and improved yields. As on-chip bandwidths are always greater than off-chip bandwidths, the potential for computa- tion is maximised by constraining the processor architecture to a single chip. The RISC philosophy meant that the area of ihe chip dedicated to control fell from over 50% io about 10% [Kate85]. This area was then available for the primary function of computation. For the RISC philosophy to succeed, it was found necessary to ex- tend the capability of high level language compilers, as more optimisation had to be performed during the compilation phase of the program.

4 As the level of integration has risen with shrinking geometries in VLSI processes, RISC processors have been incorporating more capabilities. An example is the in- clusion of floating point execution units which can operate in parallel with the con- ventional integer unit. Further enhancements have led to the capability to execute multiple instructions in each clock cycle. This requires both multiple execution units and extended data paths. These processors are known as superscalar.

1.2 Vector Architectures

Machine Cycle Time Processors Peak VIflops ("') Mflops per Proc 10Ð Amdahl 500 l.Ð 1 133 L t).-, CRAY_1 12.5 1 160 160 CRAY X-MP-1 9.5 1 210 270 rBM 3090lVF-200 18.5 2 276 108 Amdahl 1100 7.5 1 267 267 NEC SX-18 7 1 325 325 CDC CYBER 205 20 1 400 400 CRAY X-MP-2 9.5 2 420 270 rBM 3oe0/vF-400 18.5 4 432 108 Amdahl 1200 7.5 1 533 533 NEC SX-1 7 1 650 650 CRAY X-MP-4 9.5 4 840 2t0 Hitachi S-810/20 74 1 840 840 NEC SX-2 6 1 1300 1300 CRAY 2 4.r 4 2000 500

Table L.Lz Characteristics of uector superconxputers in th,e 1980's [Dong97].

To enhance the speed of the digital computer, particularly for signal processing, a range of techniques have been explored both in the laboratory and by innovative computer companies. The most successful of these techniques prior to 1989, when a massively parallel computer exceeded the computational capability of the vector supercomputers, has been the combination of vector processing with limited paral- lelism. Vector techniques were preferred for two reasons. The first was the high cost of arithmetic units: it was more cost effective to construct a few fast arithmetic units than to produce many units. The second reason was the difficulty of programming massively paral1el processors. The Cray range of supercomputers has until recently epitomised vector processing. The Cray 1, introduced in 1976, was the first Cray supercomputer.

b In [Dong87] Dongarra discusses the performance of a number of computers. Table 1.1 lists the cycle time and peak computation speed of those which could exceed 100 Mflops (Million Floating point Operations Per Second).

Machine Cycle Time Processors Peak ("') Mflops HITAC 5_3800 2 4 32000 NEC SX-3R 4 4 25600 oô Fujitsu VP2000 .),.á 2 10000 CRAY Cg() 4.2 16 16000 rBM ES/eooo 2670

Table L.2z Characteristi,cs of uector supercorlputers in the 1990's IKahag2]

The performance of current vector supercomputers is given in table 1.2

1.3 Parallel Architectures

Parallel processing has always promised major performance benefi.ts for computer users when compared with the capabiiities of both scalar and vector processors. However, despite its promise the parallel processors which have been built have never been able to successfully compete with conventional processors implemented in newer technologies. This has been due to the relative complexity of both the hardware and software of the parallel machines. As a result, scalar and vector processors have dominated the computing industry as they have provided more cost effective solutions to the need for increased computation rates. For parallelism to succeed in surpassing the capabilities of the scalar and vector machines it has been necessary to define a parallel architecture which: 1. utilises new technology as readily as scalar machines; 2. does not require major software redevelopment; 3. provides benefits when compared with a scalar implementation.

Single Instruction Multiple Instruction Single Data SISD (von Neumann) MISD Multiple Data srMD (DAP) M]MD (iPSC/2)

Table L.3z Flynn's taronomy of computer arcltitectures.

6 In the following sections, reference is made to a number of computer architectures us- ing the notation and classification of Flynn [Flyn72]. Flynn's taxonomy of computer architectures categorised computers into four classes. The classes are determined by the 'single' or 'muitiple' nature of both the instruction and data streams.

Massively parallel computer architectures are all characterised by many parallel com- puting nodes, and most use MIMD architectures. Instead of the four to sixteen processors which have been common in past vector and scalar architectures, future massively paraliel processors (MPP) will have hundreds to thousands of processors.

Massively parallel architectures are not new. The concept of parallel processing was first explored seriously by the architects of the ILLIAC IV at the University of Illinois in the late 1960's and early 1970's [BaBr68], [Ku68]. Although designed for 256 processors in four arrays of. 64, the implementation was limited to 64. The construction of ihe ILLIAC IV was technologically demanding, yet its performance lvas soon outstripped by conventional architectures. It was a SIMD machine, in contrast to the MIMD trend of today. MPP computers became serious competitors for the supercomputer manufacturers when the Connection Machine exceeded the performance of traditional supercomputers in 1989.

ooo

Memory Interconnection network o Processor ooa

Figure I.2z A shared n'ùeTrùorA rnultiprocessor arcl¿itecture

Commercial acceptance of the viability of MPP architectures is demonstrated by Japan's first massively parallel computer which was announced by Fujitsu. Known as the Vector Parallel Processor (VPP) 500, it can be confi.gured with between seven and

222 processing nodes each with a peak processing capacity of 1,6 gigaflops lZorp93l, a figure regarded by Zorpette as "barely belieuable". The theoretical peak performance of this machine is 355 gigaflops. The projected performance of some of these mas-

7 sively parallel computers approaches what has been described as the Holy Gra'il of. computing: one teraflops, or 1012 floating point operations per second.

The shared-memory model shown in figure 1.2 provides an evolutionary path for both users and software developers to follow, as it is the same model used by mainframe and vector supercomputer manufacturers. Zorpette lZorpg2] lists one massiveiy parallel computer on the market which implements this model, the Kendall Square Research KSR1, and claims that Cray Research, Convex Computer and Tera Computer are among those following the lead. This model has not been used in the past. To utilise parallel processing machines efficiently it has been necessary for the programmer to map the problem onto the machine architecture in a way which suited the intercon- nection strategy of the machine. If this were not done, the execution efficiency of the code was in general very poor. Architectures which have seen some commercial success are presented below.

1.3.1 Ring Architectures

Figure l.3z A ring architecture

Ring architectures with processors and communication paths arranged as shown in figure 1.3 are applicable to a range of problems. The KSR1, a MIMD processor marketed by Kendall Square Research Inc., is constructed as a hierarchy of rings lZorp92l. As stated above, this machine is the first to support a shared memory model for ease of programming. Up to 1088 distributed nodes, each with 32 Mbytes of memory, are arranged to provide a single large . One megabyte (Mbyte) of memory is 106 bytes.

8 1.3.2 The Two-dimensional Mesh

Figure L.4: A two-di,rnensi,onal mesh of process'ing elements

Figure 1.4 illustrates the two-dimensional mesh structure which has been used in a wide variety of computers including the ILIAC IV.

Figure L.5z The i,nterconnection topology for the AMT DAP processor

Commercial examples of this topology have been produced by both Active Memory Technology (AMT) and MasPar. AMT, a spin-off company from ICL, produced a SIMD processor known as the Distributed Array Processor (DAP). A number of implementations have been provided, all based upon a mesh array of simple processing elements (PEs). Each PE has a local memory, and in later versions a co-processor \Mas added to enhance the floating point capability of the machine. The DAP is hosted by a conventional general purpose processor such as the VAX or a SUN . The connection between the host workstation and the DAP array is handled by a

9 Motorola . Figure 1.5 provides a schematic of the PE interconnection scheme. Applications of the DAP are discussed in [Li88].

Figure L.6z The MasPar Xnet i,nterconnection topology.

The MasPar MP-1 processor is hosted by a DEC VAX workstation. The architecture of the processor is similar to the DAP in that the simple 4-bit processors, which are grouped into local 4 x 4 clusters, have local memory, and are arranged in a rectangular, two-dimensional lattice. Communications between processing elements (PEs) is achieved with an X-net communications mesh which is oriented at 45o to the PE lattice. An additional global router is provided which allows any processor to communicate with any other processor. The structure of the PE interconnection is shown in figure 1.6. The X-net allows communications from any PE to any of the adjacent eight PEs. On the affay boundary the connection network is directed back to the opposite side of the array in the form of a torus.

1.3.3 The Three-dimensional Mesh

The following figure illustrates the three-dimensional mesh structure which is being used in Cray's massively parallel processor T3D. The processing elements in ihe 3D array will be connected in a torus topology to minimise the distances and hence communication delays between the processors [Zorp93].

10 Figure L.7z A tl¿ree-dimens'ional mesh of processing elements

L.3.4 The Hypercube

Figure 1.8: .4 /¡-d,imensional h,ypercube.

Figure 1.8 illustrates a 4-dimensional hypercube. The hypercube connection topology has been used by many researchers. Proposed by Seitz lSe85], the topology has been used in the Connection Machine. Advantages of hypercubes are that the number of processors gro\Ms exponentially with the number of connections per node, and efficient algorithms are known for routing messages between arbitrary processors in the network.

11 Figure L.9z A fat tree with processing elements at the leaf noil,es.

Based on the needs of the Artificial Intelligence and symbolic processing researchers, Thinking Machines Corporation (TMC) was formed in the early 1980's. The machine which was subsequently produced was the Connection Machine (CM-1). It is a SIMD processor with between 4096 and 65536 simple processors, each with local memory. Every 32 processor group optionally has access to a fast floating point accelerator. As with the AMT and the MasPar machines, the SIMD array is hosted by a SUN or DEC workstation. For each group of sixteen processors a router node is provided which is connected to other routing nodes with a hypercube topology. In the early TMC processors a two-dimensional nearest neighbour connection was provided between PEs as well as a hypercube-based connection technique. In the CM-2 processor, the two-dimensional network has been replaced by an n-dimensional network with

7 1 n < 32 [Trew90]. The CM-5 model connects processors with a fat tueein which the processors lie on the leaf nodes of a tree and the communication paths form the branches of the tree. The closer the links are to the processors, the lower is their total bandwidth requirement. The node processors consist of a maximum of four SUN SPARC processors with a peak computation rate of 128 Mflops lZorp92l. Graphically ihis is represented as a tree in which the branches are thickest at the root as shown in figure 1.9.

1.4 Massively Parallel Computers: a Summary

In 1989 Thinking Machines Corporation won ihe IEEE Gordon Bell Award. This award is an annual prize given to machines judged to have made significant advances in scientific and engineering computing. The award was given in both categories of raw performance and best price/performance. The application was a seismic mod- elling program executing on a 65536 processor Connection Machine model CM-2.

72 The performance achieved was a sustained 5.6 Gflops where one Gflops is 10e float- ing point operations per second.

Machine Architecture Node Processor Peak per Node Mflops Intel XP/S MIMD Intel i860 50 Kendall Square Research KSR1 MIMD 4 Custom RISC 40 MasPar MP-1 SIMD Custom 1.2 per chip Meiko Computing Surface MIMD is6o/SPARC 40 (i860) nCube 25 MIMD Custom 2.4 Parsytec GC MIMD Transputer T9000 25 Thinking Machines CM-5 MIMD SPARC 728 (4 vector units) Wavetracer SIMD Custom bit-serial

Table L.4z Cltaracteristics of current massiuely parallel computers

The success of an MPP in both categories indicated that the future lay in the further development of both multiprocessor architectures and parallel processing software. Some current developments in ihe freld of massively parallel processors are provided in table 1.4. The predominance of the MIMD architecture is obvious. As the Intel i860, the SUN SPARC and the Transputer T9000 can all be considered RISC processors, it is appararent that RISC architectures are preferred for the processing nodes.

Alliant Computer Systems, until they halted operations, marketed a multiprocessor computer system based on fast scalar and vector processors. The modes of opera- tion of their Fortran compiler are representative of the way in which programs are currently scheduled on massively parallel processors. Code is executed using the following mechanisms:

1. scalar;

2. vector;

3. concurrent scalar in which the loops execute in parallel;

4. concurrent vector in which the loops are broken into vectorised chunks;

5. concurrent outer, vector inner in which the innermost loop is executed in vector mode and the outer levels are used to schedule parallel operations.

It was necessary for the compiler to analyse the source code at all levels. Vector operations are constructed from sets of scalar operations, and scheduled accordingly.

1D l- q) 1-.5 The MATRISC Processor

As can be seen from the preceding sections, there has been a historical restriction of computer architectures to those which support scalar and vector operations. This is no longer a restriction imposed by technological limitations, as is evidenced by the more recent array processors. However, the current trend is to massively parallel processors based on RISC nodes, with some support for vector processing. A specific example of such an architecture is the Hybrid Array Ring Processor (HARP) of

Dowling el ø/. [DoFu93] which is based on a host processor, shared memory and a bi-directional systolic ring of RISC processors.

1.5.1 Architecture

In contrast to the current trend, this thesis proposes a simple and elegant two- dimensional systolic hardware extension to conventional processors which provides major performance benefits for matrir operations. Ii is proposed that the arith- metic units of the systolic array shouid themselves be novel systolic ring architectures which implement a variable precision and dynamic range floating point inner-product- accumulate algorithm. These systolic rings are to be used in an optimised processor array dedicated to implementing a restricted set of matrix primitives, with the highest performance operator being the matrix product. In addition to the construction of the processor array, a novel data formatting/addressing algorithm and implementa- tion has been proposed which substantially extends the domain of application of the systolic processing engine. The resulting processor, referred to as a MATrix Reduced Instruction Set Computer (MATRISC) due to a number of similarities to the RISC computer philosophies, has been simulated and a range of algorithms implemented on the simulator. It is shown that a variation of the conventional two-dimensional array can be constructed which utilises a constant memory bandwidth. This array is read- ily constructed using the variable speed characteristics of the systolic ring processors, and is shown to improve the performance of MATRISC processors when executing algorithms with matrix dimensions smaller than the dimensions of the hardware.

74 1.6.2 Performance/Algorithms

The problem of mapping algorithms to architecture are minimised by restricting the architecture to those operations which lie within the formal mathematical framework of matrices. In contrast to conventional parallel processor architectures, the pro- grammer does not have to match a particular problem to the architecture. Rather, he must find a matrix algorithm to describe the problem. By restricting the algorithm definition to that of matrix algebra, the architecture of the machine is transparent to the programmer. Some of the unique attributes of the address generation hardware extend the area of application by supporting mappings from one to many dimensions.

The performance of a single MATRISC node is studied and examples of the im- plementation of a range of algorithms are given. The range includes efficient im- plementations of most of the systolic algorithms reviewed. Particular examples are finite impulse response (FIR) filtering, convolution, correlation and QR factorisation. Linear transform examples which have been given include the Fourier, Hartley and Cas ftansforms. It is shown that the MATRISC processor is faster than the conven- tional Cooley-Tukey algorithm for the fast Fourier transform (FFT). The discrete Fourier transform (DFT) has been rewritten in an optimum form for execution on the MATRISC processor, and it is shown that to preserve the over the con- ventional FFT, higher dimensional factorisations are required as the transform length lncreases

1-.6 Summary

Following a review of the evolution of computers, the absence of directly supported matrix processors has been identified. A systolic matrix architecture is proposed as the next evolutionary architectural step for computers which are used for matrix based algorithms. The proposed acronym for this processor is MATRISC, represent- ing both the RISC/CISC machine into which matrix capabilities have been integrated, and the philosophies from which the architecture is derived. Results presented in later chapters show that the performance enhancements which are possibie in single MA- TRISC nodes are substantial. These enhancements make it evident that the use of MATRISC nodes in MPP architectures will provide major beneflts in computing

15 power for those processors. The MATRISC concept is demonstrated to be feasible in current technology.

The following chapter contains a review of a number of systolic processor develop- ments which have been used as a basis for the development of the rnatrix component of the MATRISC processor.

16 Chapter 2

A Review of Systolic Processors

'A systolic system is a network of processors wltich rhythmically compute and pass data through the system. Physiologi,sts use the word, "systole" to refer to the rhyth- rnically recurrent contract'ion of the h,eart and arteries wh,ich pulses blood, through the body. In a systolic computi,ng system, tlte function of a proc%sor is analogous to that of the heart. Eaery processor regularly puïnps d,ata in and out, each time performing sorne short computation, so that a, regular flnw of data i,s kept up in the networlc. Many bas'ic matrir computations can be pipelined elegantly and, efficiently on systolic networlcs hauing an arrûA structure. As an erample, lteragonally connected, processors can optimally perform matrix multiplication. Surpri,singly, a simi,Iar systolic o,rrúU can cornpute the LU rJecomposi,tion of a matrir. These systolic arrûys enjoy simple and, regular con'Lnùunication path,s, and almost all processors useil in the networks are identical. As a result, special purpose ltardware deuices based on systolic arraus can be built inerpensiuely using the VLSI tech'nology.' -H.T. Kung and Charles E. Leiserson, 1978

Systolic processors were first proposed in 1978 by H.T. Kung and Charles E. Leiserson in a paper presented to the Symposium on Sparse Matri,n Computati,ons and Thei,r

Appli,cations lKtLeTS] which also appears as a slightly different version in [MeCo80]. In this paper they considered the organisation of large arrays of simple processing elements, and their application to algorithms. Systoli,c arrays are stated by Kung in

[KuS0] to have the following properties: 1. The array can be implemented with only a few types of. simple cells;' 2. T};.e data and control flow of the array is simple and regula,r, so that cells can be connected by a network with local and regular interconnections: long distance or irregular communication is not needed; 3. The array uses extensive pipeli,ni,ng and . Typically, several data streams move at constant velocities over fixed paths in the network, and interact where they meet. In this fashion, a large proportion of the processors in the

77 arîay are kept active, so that the array can sustain a high rate of computation flow; 4. The array makes multiple use of each input data item. As a result high compu- tation throughput can be achieved without requiring high bandwidih between the array and memory.

In the original papers on systolic arrays [KuLe78], [MeCo8O], [KuS2], [KuAr82], [Ku83], [Ku80] and [WhSp81] [KuLe85] Kung and others present the hardware and interconnection schemes necessary to construct a number of systolic processor ar- rays. The arrays are designed to efficiently implement matrix-vector multiplication, matrix-matrix multiplication, LU decomposition and polynomial evaluation. A num- ber of algorithms which are based on these operations were also presented. Much of the work is based on processing elements which implement simple recurrences in- volving multiplication and addition of operands, although systolic concepts are also used in more complex elementary cells. As an example, Ahmed et ul. l[hDe82] pro- posed cells which implement the CORDIC (Coordinate rotation digital computer) operations of Volder [Vo59]. In [Jo93] it is shown how novel one-dimensional and two-dimensional systolic processing architectures, comprising up to l/ CORDIC bii- serial processing elements (PEs), can be used to carry out hardware-efficient parallel implementations of the l/-point discrete Fourier transform (DFT). The arithmetic units do not use a floating point number representation, and as a consequence are extremely area efficient.

Systolic processor development has been evolutionary with time in the sense that the simplicity of the first architectures has been replaced with architectures of growing complexity at both the systolic cell ievel and the algorithmic level. Even the first step from the inner-product-step processor of Kung to the inner-product-accumulate processor of Whitehotrce et al. [WhSp81] shows an increase in cell complexity, pri- marily in the area of control, as the cell must be initialised prior to computation, and the result extracted after the compietion of the computation. Other areas which exhibit an evolutionary increase in complexity are the interconnection techniques which were developed for algorithmic flexibility, and the considerable development of techniques to map algorithms into systolic forms which are implementable on systolic architectures.

As an example, Tasic et al. lTaGug2] describe a three step procedure, per iteration, to implement the Preconditioned Conjugate Gradient (PCG) Method for solving

18 systems of linear equations with symmetric positive definite matrices. Regular data flow between the three different stages in the algorithm is obtained using broadcast elimination, synchronization between stages, introduction of new variables, pipelining the variables and other re-indexing techniques.

The following sections summarise the development of systolic processors for numerical applications.

2.1 The Inner-product-step Processor

A C A B C C

B B B A A C

Figure 2.lz The i,nner-product-step processor

The fundamental component upon which the work of Kung and Leiserson rvr/as based was the inner-product-step processor. For simplicity of presentation they assumed a synchronous in which data moved synchronously from pro- cessor to processor although they also pointed out that a data-flow model could be used in which processors computed results when all input operands were avail- able. The operation of the inner-product-step processor is defined by the expression c1-c*axb.

Each processor is implemented with three registers R,s, Rs and -R6r. Each register has two connections to allow for both input and output. In addition to the registers, processors also contain the necessary logic to perform the arithmetic defined by the recurrence relation presented above. Two geometries presented for the inner product step processor are shown in figure 2.1.

19 Figure 2.22 The linear systolic ürrûy

Three interconnection topologies rtvere proposed: the linear array, the mesh-connected array and the hexagonally-connected array. Inputs and outputs from the array occur through boundary elements. The linear array is shown in figure 2.2.

In each time unit, or clock period, every processor in the array moves data present at the A, B and C inputs into the associated registers R,s, Rn and ,Rc. It then computes a value fot Rs defined by ihe expression Rc * Rc I R¡ x Rs. J?c is made available at the output line labelled C, and the values in the registers.Ba and Rs are made available at the output lines labelled A and B. In the synchronous model all outputs are latched and clocking is carried out so that data is correctly transferred between processors.

Matrix-vector multiplication was cited as an example of an algorithm which was well matched to the linear array. Following Kung's example, a matrix A: (o¡¡) is multi- plied by a vectot t : (*r,...,*n)T.The product, represented by U : (At,...,Un)T, is computed by the application of the recurrences

(1) a -0

( k+ 1 vi a:r) + aikirk (2.1)

(/c+1) A¿ vx

att e72 000 0 0 ix 1 Ut a2t azz azs00 0 0 T I Az üs't asz ¿gs 0,2+ 0 0 0 T Us (2.2) 0 a+z ü4s aE+ a+S 0 0 :x 4 U+

000000 anrn rn Un

20 as4 0 a+g 0 0 ass 0 a+z azs 0 agz 0 0 ezz 0 ast ül,z 0 úzt 0 0 att 0 0

ï2 0 0

0 0 Az

Figure 2.32 Data rnouement for the matrir-uector product on tlte linear a,rray

For the case where A is an n x n band matrix with band width w : p+ q -1, where p is the band width above the diagonal and g is the band width below the diagonal as shown in equation (2.2), the implementation of the product on the linear array is shown in figure 2.3.

T}re y¿, initially zero, move to the left while lhe r¿ are moving to the right and the úij a;re moving down. All data streams move in synchronism. When the y¿ exit from the leftmost element of the a ray, they have accumulated their correct values. The interstitial gaps between matrix elements are zeto) and each processor operates only half of the time. Adjacent processors are altermately idle, and Kung notes that a coalescence of adjacent processors would make it possible to use half as many processors to perform the same function. If the bandwidth of the matrix is u.', then the number of time units after which product components are shifted out from the array is tt', and all components have been output after 2n *'u time units, compared to the O(wn) units required on a serial machine. For large values of n the efficiency of the cell, and hence of the array, is approximately 50%. In this context, efficiency is taken as the ratio of clock periods in which the cell is active to the total number of clock periods to complete the computation.

27 Figure 2.42 Th,e h,eragonally-connected systolic ûrray

C

-uí, .ó;

a&

arl

þ" w c'j

4ft %z cÀ ct¡-

c42 c!3 ca

Figure 2.52 Band rnatrin multiplication on the heragonally-connected array

Matrix-matrix multiplication was implemented by Kung on the hexagonally-connect- ed mesh array of. fr,gtre 2.4. If A and B are n x n band matrices of bandwidth u.'r

22 and ut2 respectively, the product matrix C can be computed wiih an array of ra¡n2 hex-connected processors.

Data movement for this algorithm is shown in figure 2.5. The three matrix data structures move synchronously through the array, and the recurrences which are applied to the data are as defined for the linear array in equation (2.2). To ensure the interaction of the correct matrix elements, two interstitial zeroes are interposed between matrix elements. This extension of the spacing between elements to three is compared to the spacing of two for the matrix-vector algorithm on the linear array. The number of recurrences which can be applied during the computation of any element c¡¡ is determined by the number of cells which the result passes through before leaving the array, i.e. the dimension of the array.

The computation time for matrix-matrix multiplication is given as 3n lmin(w1,w2) time units by Kung. For large values of n, the second term is insignificant. In these cases each cell is performing useful computations during one in three ciock cycles, giving an efficiency of 33Y0.

Kung and Leiserson suggested that the optimal size of the network for a particular problem depends not only on the problem, but also in the memory bandwidth to the host computer. To achieve high performance, they state, ". .. it i,s d,esi,rable to haae as TnûnA processors as possible in the network, prouided, they can all be kept busy d,oing us eful cornçtutat'ions. "

2.2 A Systolic Cell for Polynomial and DFT Evaluation

xi" Xoot

Yi" Yoot

\-z 4-¡ Y.

âo-t

Figure 2.62 Computation of Fourier coefficients with a linear systolic array.

Kung subsequently proposed an alternative systolic cell for the evaluation of polyno- mials via Horner's rule [KuS0]. The ceil is defined by the recurrences

23 frout ? lin Yo*+Y¿nr¿n*a

and the structure of the array is shown in figure 2.6.

After pre-loading the coefficients {a¿ : i 0 ,n - 2j, the array computes Fourier coefficients

n-I ai:Da¡uii (2.3) j:0 where ø is an nth root of unity. Rewriting the summation for n :5 gives

ai:a+uiL +orrit +orri'+atulit +ao (2.4)

Applying Horner's rule

ao : (((aa.1 * a3).1 -f a2).7 + o1).1 + ¿0 (2.5)

At : (((aa.u I as).a + az).a + a,r).u + ao (2.6)

az : (((a+.u2 + or).r' + az).a2 + ar).az + ao (2.7)

at : (((aa.u3 + or).r' + or).r" + a"r).u3 + a0 (2.8)

a+ : (((aa.e + or).rn + az).ua + a,r).a4 + a0 (2.e)

Inspection of these equations shows the relationship of the algorithm to the archi- tecture. The innermost term is evaluated in the leftmost processor, passed to the next which evaluates the second level of nesting, until the last cell which computes the final multiplication and addition of øs . dn-r is a constant input to the left cel1, together with the sequence {w' : i : 0,..., n - 7}. The Fourier coeffi.cients {y;: i - 0,. .. )T1.- 1} are output from the rightmost processor.

24 2.3 The Engagement Processor

In 1981 Whitehouse and Speiser [WhSpSl] proposed a different systolic processor to that discussed by Kung and Leiserson. Referred to as the engagement processor, it differs from the inner-product-step processor in that it is constructed from a mesh connected array of inner-product-accumulate cells. Operands move through the array in the same \¡/ay as for the inner-product-step processor, but with no interstitial zeroes. Results do not move through the engagement processor, but are accumulated in each systolic cell. The C registers of each cell are set to zero at the beginning of the computation. At the completion of the computation, the accumulated results are extracted from the array.

B C

A

B C

Figure 2.72 The inner prod,uct accumulate systolic cell.

The inner-product-accumulate cell is shown in figure 2.7. It can be constructed from the inner-product-step processor of figure 2.7 by recirculating the C operand. The operation of the cell is described by the recurrences

: o (2.10) "Í;, lk) lk-l) ";;':";; +a¿nbnj k 1 ¡\r (2.11) where the desired result ir "[f). This cell computes the inner product defined by

¡'r cij :Da¿t bt j (2.12) Ic:l

25 brs btz bzs órr bzz óse bzt bsz bn

AtI dtZ AIB ---+

421 422 {123 -)

o,31 432 443 --+

Figure 2.82 A 3 x 3 eng(rgetnent processor with two i,nput rnatrices A and B

An example of an engagement processor is shown in figure 2.8. It consists of a two-dimensional mesh-connected array of inner-product-accumulate cells.

A skew of one clock period, or arithmetic cycle time, is introduced between succes- sive wavefronts which enter the engagement processor. Matrix multiplication and accumulation is done simply by initialising the contents of the array to a required initial matrix. The accumulation of the final term of the matrix product occurs at time ?: (¡r/- 1)+(¡/ - 1) +¡/:3-l/-2 clocks. Unlike Kung's arrays of inner- product-step processors, the number of recurrences which can be applied during the formation of a result matrix C is independent of the dimension of the array. Any conformal matrix product of dimension less than or equal to the array size can be computed in one pass. Conformal products of dimension greater than the array size are computed as a set of partitions.

As the operands have no interstitial zeroes, cells are rlever idle once they begin processing. For large products where end effects can be ignored, both the cells and the arrays compute with efficiencies approaching 100%.

2.4 Llgorithms

Kung and Leiserson present the hardware and interconnection schemes necessary to construct a number of systolic processor arrays. The arrays are designed to efficiently implement matrix-vector multiplication, matrix-matrix multiplication, LU decompo- sition and polynomial evaluation. A number of algorithms which are based on these operations were also presented, and are given below.

26 2.4.I Convolution

Consider two sequences r(n) and gr(n) where at least r(n) is finite. The sequence z(n) defined by the expression

N-1 z(n):Ð"@)a@-*) (2.13) Itt:o corresponds to the linear convolution of r(n) and y(n). Rewriting (2.13) as a matrix- vector expression gives

lxg

r1 ïg 0 Uo -tt I2 ix1 ïg At Z1 Az Z2 Uz lLN-I íxN-2 r¡'r-3 A+ Z4 (2.14) 0 íDN-l ixN-2 As -5 Ua 0 0 frN-r Az Ue ,:

0 (tN-t (tN-2 rN-1

X is a Toeplitz matrix of nearly identical form to the A matrix in the matrix-vector equation (2.2) and hence linear convolutions are readily implemented with a linear systolic array of inner-product-step processors.

2.4.2 Finite Impulse Response (FIR) Filters

A p-tap finite impulse response (FIR) filter is implemented by the expression

p-l y(n):\nç^¡*("-*) (2.15) m:0

27 Ignoring end effects, this can be written as the middle partition of the matrix-vector product above. If the weights are represented by the sequence {hs,ht,. . . ,hp-r } and the input data is represented by the sequence r(n), the filter output sequence y(n) is given by the expression

hp-t h2 h1 ho hp-t h2 h1 hs hp-t h2 h1 hs (2.16) hp-t h2 h1 hs

0

This is identical in form to the convolution, and is efficiently implemented with Kung's linear array.

2.4.3 Discrete Fourier TYansform

The discrete Fourier transform is defined by the expression

N-1 x(k): r@)wff (2.17) n:ot

- N-l r(n): x@)wñk" (2.18) # fk:0 where and both r(n) and X(fr) are periodic sequences. W¡¡ - "-iQtlN), Equations (2.17) and (2.1S) are best considered as matrix-vector products

1 111 Ig

1 au2u3 r 1 1 w2 u4 u6 r 2 (2.1e) 1 u3 u6 u9 3

Again the DFT expressed in this form is a candidate for efficient implementation on the linear array.

28 2.5 Heterogeneous Systolic Arrays

The construction of arrays of processing elements which are not homogeneous makes it possible to implement a range of different algorithms. A number of heterogeneous arrays and their applications are given in the following sections.

2.5.L LU Decomposition L U

-1 0 0 0 0 0

4t

4t \z

i\t A4t

r+t ílsz aM

ã+z 4t u

4z a$ u Lrs

Figure 2.92 Theher-connected, array for LU decornposition. Tltere ûre tuo processor types in tltis array.

The factorisation of a matrix A into lower and upper matrices L and [/ such that A: LU is called the LU decomposition of the matrix. This decomposition makes it

29 easy either to invert A or to solve the linear system Ar : ó. The operation is defined by the recurrences

(1 aij ) aij

(k+ 1 aì.¡ t¿t(-ux¡) "Í? +

0 k

T ,¿tc: 1 2 k (2.20) (k) aì* u*t-l k ukj k>j {2r,',rr i

Kung et al. showed that the recurrences of the inner-product-step processor \Mere sufficient to implement this algorithm with minor modiflcations to the hex-connected processor array. The assumption is made that the LU decomposition of the matrix A can be performed by Gaussian elimination without pivoting.

The top processor in the hex-connected array is changed to one which computes the reciprocal of its input and passes the result southwest and the input northwards. The top left boundary processors are rotated 120' clockwise and the top right boundary processors are rotated 120' anti-clockwise. Figure 2.9 shows the resulting array structure and the data movement to implement the LU decomposition.

As in the matrix multiplication algorithm, each processor operates on every third clock cycle. Kung quotes the performance of this array for an n x n band matrix A with bandwidth u : p + q - 7. An array with no more than pq hex-connected processors can compute the LU decomposition of .4. in 3n I min(p, q) time steps.

2.5.2 Solution of Tbiangular Linear Systems

The linear system Ar : ó can be solved by first performing the LU decomposition of the coefficient matrix A in which A : LU, and then solving the two triangular systems LA: å and Ur: y. As upper triangular systems can be rewritten in terms of a lower triangular system, Kung proposed a systolic processor for the solution of lower triangular systems.

30 If. A : (ø¿¡) is a nonsingular n x n band lower triangular matrix and the vector b: (bt,...,bn)T, the computation of r: ("r,...,*n)T such that Ar: ó can be done by applying the recurrences

aÍ') : o

(¡r+1) (fr) yi a I a¿*r*

r¿: (b¿ - yÍn))1",0 (2.21)

The full matrix expression for this problem is given by

att 0 00 00 0 r1 b1 azt azz 00 00 0 ï2 b2 úgt asz øgg 0 00 0 ï3 å3 (2.22) 0 ü+z &+s &++ 00 0 ï4 ba

0000 óó ûnrn frn b,^

àes

àss læ àsz att à+z à¿t %z a att %t àtt

Yr Yz x|- xl -Y¡

¿

Figure 2.IOz Tl¿e linear lt,eterogeneous systolic ûrrúA for triangular systems. and the linear systolic heterogeneous processor which implements the required recur- rences is shown in figure 2.10.

31 2.5.3 A Systolic Array for Adaptive Beamforming

X-(k+1) Y"(k+l) C-,(k+1) Cn(k+1) C-(k+1)

1) Snft+1) 1)

1) 1)

Figure 2.tLz The systolic array cells useil for the ertro,ction of least squares resi,duals

Ward et aI. in [WaRo84] describe a technique for adaptive antenna beamforming using a least squares technique implemented efficiently on a heterogeneous triangular systolic array. The array is constructed from two systolic cell types, one of which is used on the diagonal of the array, and the other in the body of the array. The recurrences which describe the diagonal cell are

v(k+t): {v'(k) -t læ¡,(k + r)l'}'t' v(k) Cout(k + 1) : v(k + 1)

r¿"(k + 7) Sout(k + 1) : v(k + t) Yout(k+ 1) : Yi"(k + t)C""t(k + t) (2.23)

and the recurrences which describe the other cell are

Rr(k + 7) : c¿"(k + 1)Ãr(k) -t s¿"(k + 1)vi"(fr + 1) Y,ut(k + 1) : c¿.(k ir)Y¡"(k + 1) - si"(k + 1)r3r(k) C,,t(k +1) : C¿"(h¡1)

Sout(k + 1): S;,(k + 1) (2.24)

Figure 2.11 shows these two cells with their inputs and outputs.

32 Yz xra Yr xts xr¿ xn Xlc &r \n

^l xrr

O - Delay

^l

Figure 2.122 The triangular systol'ic arruy orgl,nisl,tion for the ertraction of least squares resid,uals.

Yz

xl Y¡

xB rr¿ xa xt¡ xzt xn O - Delay xrr

WtrW2r\ry3tW4

Figure 2.L32 The triangular systol'ic array and li,near ûrray for QR systern solaing

oo Jq) Figure 2.12 shows the organisation of the cells into a triangular array which computes recursively an upper triangular matrix r?,, which is initialised to zero at the beginning of the calculation and which is updated every clock cycle. Cells on the right hand side of the array contain the evoiving vector óll) which is also initially zero and is updated each clock cycle.

Each row of the array implements a Givens rotation on rorvr/s of the input matrix to carry out a QR factorisation. A simplified triangular array which does not output the residuals was proposed by Kung and Gentleman to perform QR factorisation.

The solution of the resulting triangular system is achieved by driving the output of the triangular array into the iinear array for back-substitution as shown in figure 2.73.

A more recent example of a systolic array for the orthogonal solution of systems of linear equations is given in [GoSc91]. Processor arrays of one and two dimensions are constructed from cells capable of implementing Givens rotations. Due to the choice of algorithm, only multiplications and additions are required in the systolic arrays, with a single division being required at the end of the computation which can be performed by a post-processor, or the host machine.

2.6 Systolic Arrays for Data Formatting

cl:cl

if c then ) begin v

r'i:a;a t' X x end else begin c c fr' :: r; al a; end v

Figure 2.L42 The systoli,c cell for format conuersion.

The arrays proposed by Kung and others were efficient at implementing particular algorithms. However the data input/output formats are specifi.c to each algorithm, and it may not be possible to interface one systolic processor to another without

34 reformatting the operands. This problem has been studied by a number of researchers including O'Leary [0187] and Petkov [Pe88]. Petkov used a cell whose I/O and logical operation are deflned in figure 2.14.

A systolic array of these cells is shown by Petkov to perform row-to-diagonal and diagonal-to-ro.ñ/ input /output format conversion for matrices.

4u %t 4t

îtz a atz

îts a \t

------> %s %z %, 00001

ars a %t 0 0010

âr¡ \, ãtt 0 0 100

Figure 2.152 A systolic array of format conuersion cells

Figure 2.15 is a systolic array for row-to-diagonal format conversion. Input data at time I is shown output at time t : 3n - 2 in the figure.

2.7 Eit Level Systolic Arrays

Most of the systoiic array research which has been conducted has been devoted to word-level systolic systems in which individual processors are implemented on singie chips. In contrast to this approach, several workers have applied the concepts at the bit level for signal processing applications. Examples include the work on bit- level architectures of McCanny et aL [McMc\M82] and [McMcWo84], [McMcKu90], Hoekstra [HoS5], Urquart eú ø/. [UrWo84], the GEC studies [McMc82] and the work of the author [MaSt89]. One of the reasons for interest in bit-level systolic arrays is the input/output (I/O) constraints of conventional VLSI technology.

An example of a bit-level architecture is given in [McMc82] in which a systolic corre- lator is presented which implements the correlation c(k) between a reference sequence r(i) and a data sequence d(i) where

35 ¡¡-1 c(k):\,ç;¡ag'+r¡ (2.25) i:0

The dimension of the correlation array in the vertical direction is set by the required output word length which must be sufficient to accommodate the binary representa- tion of the maximum output data value. In the horizontal direction the number of columns corresponds to the correlator length, or number of reference coefficients.

,\ ci" Yout + Y¡n Ø C¿n Ø (X¿".A¿.) x_ Xoot C out * Yn.C¿n *Y..(X¡^.A¿.) + C¿n(X,,n'A¿,-) Aort + Ain Y*, Yi" Xout + Xin Aoot Coot

Figure 2.162 The systolic cell for bi,t-leael correlution.

c2 0 cl b2 0 co 0 bl 0 a, 0 bo Reference 0 a1 0 input 0 0 ao 0 0 0 LSB -'------> xor 0 xo,0 xoo <- YoooYo,oyo, rt) I I I f x2 0 x1 0 xo 0 o yto o yt, o yt, x'" o x'ro x'oo o ------> 0 0 y'o o y'ro y', MSB Input data N-stages Results -> Figure 2.L72 A systolic array for bit-leuel correlation.

This use of systolic techniques to implement the desired operation offers a highly efficient solution to the problem. A general purpose microprocessor would be unable

36 to meet the processing requirements. A feature of the solution is that data transfer to and from the array is minimised as data is input to the array once, and the result is read once, but each data sample is used many times during the computation.

2.7.L Beamforming at the Bit-level

In the bit-level beamformer designed and implemented by the author [MaSt89], the conventional beamforming equation is formulated in terms of a set of binary weighted matrix traces. A low complexity VLSI chip is then described which implements this algorithm with a linear systolic array of boolean multiply/accumulate elements. Performance estimates are given for a range of typical formats. The remainder of this section summarises this work.

In conventional beamforming the observation of some propagating, coherent wave in the ambient noise background is enhanced by summing time-delayed and weighted sensor data. The weights are designed to achieve some specified side-lobe level from a particular sensor array.

The output in a given direction d is given by

k : w bo(t) \ "(O)r,(t - (2.26) n-l ".(0)) where k is the number of sensors in an arbitrary array,

""(t) is the output from sensor n, r"@) is a time delay to be applied to r"(t) to steer in the direction d, and

u.(0) is a weight factor.

In many practical systems ,"(0) is quantised. This in general induces an error in the estimate be(t).

This beamforming task has been represented in matrix form in [MaCla88]. To high- light the ease of mapping from the algorithmic domain to silicon, the development is reproduced below.

o t-7 JI The application of the weighting coefficient and time delay to a set of sensor samples as defined in equation (2.26) is written as a scalar product i-(t).í.(0) where i"(0) is a vector that selects an element from i"(t) and multiplies it by w"(0). Then

k

b t : tr (t). R(0)} (2.27) e(t) -- \ "çt¡.r"Q) {x n:l

where X(t) is a data matrix at time ú, each row of X(t) is a set of samples from a particular sensor and rB(d) is a matrix whose i¿h column contains a single non-zero element tr¿.

This representation is more general than equation (2.26), as it describes an interpo- lation beamforming algorithm. The inner products implement the convolution of a set of samples with an arbitrary set of filter coefficients í"(0). These coefficients can be adjusted so that interpolation between the discrete samples is achieved. However, time-delay-and-sum beamforming is simply defined by equation (2.27) when each column of .R consists of at most one non-zero element.

Each óB(t) is the trace of the matrix product. The trace is computed by evaluating the main diagonal elements of the product and accumulating the results. For k independent beams it is necessary to carry out this computation k times, once for each rB(d¿), 1 ( i < k. Decomposition of the arithmetic operations present in the algorithm allows the replacement of word-level multiplications with bit-level Boolean operations. Implementation of the bit-level operators is simply done with custom VLSI, and a chip has been designed and fabricated with this approach. The chip required about 13000 transistors laid out in less than 16 mrn2 of silicon area in the ORBIT 2 micron CMOS process.

For a system in which the weighting coefficients 0 1 w; 11 are constrained to p bits, each coefficient can be expressed as

p ui(O): t w¡¡(0)2-i (2.28) j:t

Substituting this expression in equation (2.26) gives

38 kp be(t) : t t u;,,¡(0)r,(t - r.(0))2-i (2.2e) n:7 j:l

where un,¡ is the jth bit of the nth weight factor for the direction d

Rearranging the summation order gives

l¡k bo(t) : t t wn,¡(0)rn(t - r,@))z-i (2.30) j:l n:7

The inner summation is identical in form to equation (2.26) with the wn,¡(O) now constrained to the set {0, 1}. Following the formulation of (2.27), equation (2.30) can be rewritten as

plc p b,(t) :tt i.Q).F,,¡(o)z-i : ttr{x(t)R¡(o)}z-t (2.s1) j:l n:l j:l where R¡(0) is a matrix of binary elements whose ith column contains at most a single non-zero bit w¿,¡(0). Each column of R¡(0) extracts and applies a partial weighting of 0 or 1 to one sample from the corresponding row of X(t).

Further decomposition of the beamforming equation can be performed. The math- ematical formulation has been omitted, but the extension is simple. Each sensor sample is expressed in the form of equation (2.28). Substitution in equation (2.31) gives a double sum of the traces of binary matrix products. The penalty for this further decomposition is that for p-bit weighting coeffi.cients and q-bit data samples, an array of p x q trace evaluation circuits is required.

Matrix products are readily computed using systolic techniques such as those de- scribed by [WeEs85], [WhSp81] and [MeCoSO]. However, for trace computation, ma- trix products have redundant off-diagonal elements. A simple structure for trace computation in which inner products are formed in parallel is a linear array of multiply-accumulate elements.

Operation of each of the cells in the array is defined by the recurrence relation

39 cli:ri-rr¿-tla¿-t (2.32)

where r) r aîd a a;te coefficient, data and accumulator values respectively, at sample timesi-7andi

lulul I J'Tn Itg ItZ '.ttt --+ <- x:11 itZl (t gt Inl a,cc

Mull fr2n :XZS ïZZ IZI -+ (- (ttZ I22 rSZ Ïn2 a,cc

Mull rkn itl$ íIIr2 rkt --+ <- ïtk rzk IJk rnk acc

Main diagonal output

Figure 2.18: Eaaluøtion of the elements of the rnr;in d,iagonal of a matrir pro¿luct.

2.7.2 Optimisation for Binary Quantised Data

Consider the product of two digits r and y in an arbitrary base B. The range of each digit is 01r,A < p - 1. Hence

01ry<(p-7)'

For binary data, þ :2, and

01ry17

A fully decomposed implementation of a beamformer processes only binary data. The operation of the systolic multiply/accumulate elements is defined by equation

40 (2.28) with the multiplication operation replaced by a Boolean AND. For a time- delay-and-sum beamformer, the addiiion operator can be replaced with a Boolean OR operator, as the result of the multiply/accumulate operation is constrained to the set {0,1}. Equation (2.32) then becomes the Boolean expression:

CIi:Ti-t.ri-I+úi-I (2.33)

Cells which implement this Boolean recurrence can consist of as few as 42 transistors in a CMOS technology. This transistor count includes the output register and control gates.

On completion of inner product formation, the results are loaded into the output register of each cell. During the formation of the next inner product, these results propagate to the boundary element via the shift register formed by a concatenation of each cell's output register.

A further optimisation is possible in the trace accumulator for time-delay-and-sum beamformers. Since the elements of the main diagonal are constrained to the set {0,1}, the accumulator can be implemented with an incrementer circuit instead of a more general full adder circuit. Log2(k) incrementer cells are required where k is the number of elements in the main diagonal. For a beamformer which has 25 multiply/accumulate elements, 5 incrementer cells are required for the accumulator.

2.8 Configurable Systolic Architectures

A natural response to the realisation that no single systolic array cell and intercon- nection structure could directly implement the large range of systolic algorithms led to a number of projects which sought to produce a generic processor based on systolic principles [SnSl], [Snyd81], [Sn82], [FiKu83], [Wica85] and [AnArS7]. Two of these projects are discussed in the remainder of this section.

47 2.8.L The Conffgurable Highly Parallel (CHiP) Computer

o o o o o oooooooo o o o o o o o oo o o olo o o o ooo o o o o o o o o o o o o o o o o o o o o o ooooooo o

Figure 2.192 The CH|P confi,gurable array as a mesh and, a binary tree. Processors are i,ndi,cated by squl,res and switches by circles. The root nod,e of the binary tree 'is the nod,e containing R.

The Configurable Highly Parallel (CHiP) Computer [Sn81], [Sn82] is a family of archi- tectures which are constructed from a collection of homogeneous , a switch lattice and a controller. The computer differs from the systolic arrays of Kung and others in that it has programmable interconnections. These interconnections are achieved by connecting the PEs not to each other, but to the switch elements. Each switch element in the lattice contains local memory which can store several configu- ration settings. A connection is established as a static connection rather than as a dynamic path for data packets. Figure 2.19 shows the structure of a processor/switch lattice for a mesh and a binary tree.

The nature of the CHiP computer allows the processor array to be configured into algorithmically specialised regions which are interconnected. As an example, the solution of a linear system can be performed by setting the switches in one region of the array to form a hex-connected array for LU decomposition and then connect the output of this region to another region which has been configured into a linear array for back-substitution as proposed by Kung and shown in figure 2.13 [Snyd81], ISn82].

42 2.8.2 The Programmable Systolic Chip (PSC)

The Programmable Systolic Chip (PSC) [FiKu83] was designed to be a high per- formance special-purpose single chip microprocessor which was to be integrated into processor arrays consisting of tens or hundreds of PSC's and which implemented a broad variety of systolic arrays. The PSC was expected to be at least an order of mag- nitude more efficient at the implementation of systolic algorithms than conventional microprocessors.

Figure 2.2O: A number of systolic ûrrays whicl¿ can be i,mplemented, with arrays of Programmable Systolic Chips.

Figure 2.20 shows a PSC with the available interconnections on its boundary both as a single cell and configured into several possible systolic array processors.

The PSC v/as conceived as a single chip S-bit processor capable of supporting both 9-bit arithmetic and multiple precision computation. Multiplication within a single machine cycle was defined to be necessary to provide good performance in signal processing and coding applications. High-bandwidth inier-chip and on-chip commu- nication paths were required to support the continuous flow of data between cells,

43 and also to provide for the transmission of pipelined systolic control signals necessary for runtime control of algorithms.

The architectural features of the PSC which are related to its operation as a cell of a systolic array are summarised below: 1. 6 eight-bit data ports, three dedicated to input and three to output; 2. 6 one-bit control ports, three dedicated to input and three to output; 3. an 8-bit ALU with support for multiple precision and modulo 257 arithmetic; 4. a parallel8-bit multiplier/accumulator (MAC) with 16-bit accumulation, based on the TRW MAC chips.

Details of the chip implementation, based on the OM2 chip [MeCoSO], are as follows [8a81]: 7. 64 word write-able , each word of 60 bits; 2. a 64 word by 9-bit ; 3. three 9-bit on-chip buses; 4. a stack based micro-sequencer.

The chip used a two phase clocking strategy.

2.9 Mapping Algorithms to Architectures

Following the presentation of systolic systems such as the PSC and CHiP, the prob- lem of mapping algorithms to working systolic a rays was addressed by a number of workers including [Mo83], [MoFo86], [KuTs89] and [Ts90]. In [MoS3] Moldovan considers the problem of mapping cyclic loop algorithms into special-purpose VLSI arrays using mathematical transformations of index sets and data dependence vec- tors. In [MoFo86] the problem of mapping problems of size greater than or less than the availabie hardware'was studied.

2.9.L Software

One of the most complete systolic hardware and software examples is the \Marp processor and its associated parallelising compilers of which examples are Tseng [Ts90] and the code generation technique of [Le90]. The Warp processor is a linear

44 systolic a;rray of processing cells as shown in the following figure. It is representative of a 'standard' approach to the implementation of a systolic processor.

Host

Interface

X X X X Cell Cell Cell Cell 0 1 n-2 n-1 Y Y Y Y

Figure 2.2t2 Th,e Warp mach,i,ne.

Each node consists of 1. one 5 Mflops floating point multiplier; 2. one 5 Mflops floating point adder; 3. one integer ALU; 4. a data memory of 32K words (128K bytes); 5. 8K instruction word memory (272 bifs wide) and sequencer; 6. X and Y input and output data paths, each with a bandwidth of 20 Mbytes/s, for systolic interconnections between cells.

Peak node performance is 10 Mflops, and for a 10 cell array, the peak performance is 100 Mflops. W1 is the cell assembly language, each statement of which consists of one line of . Each \M1 statement controls every cell component with a dedicated instruction field. Each statement can initiate two floating point operations, two integer operations, perform memory and register operations and communicate via each of its ports as well as performing next instruction address generation. Inter- nal pipelining delays of up to seven clocks mean that the programming of the cell is non-trivial. Tseng [Ts90] produced a high level language compiler which generated code for the Warp computer from source code based on the simple sequential control constructs IF-THEN-ELSE, DO-loop and WHILE-loop, and uses data objects con- sisting of scalars and arrays. This language was referred to as the W2 language. The

45 compiler uses data dependence analysis and loop scheduling techniques to generate code. The user must program each Warp cell individually and explicitly manage inter-cell communication with communication primitives SEND and RECV.

An additional programming language known as AL (Array Language) was produced by Tseng [Ts90] which hides the array architecture from the user. Programs are written as if for a sequential computer. The AL compiler generates W2 programs and manages the inter-cell communication. An additional data class is available in AL: the distributed array, elements of which may be either scalars or arrays.

Tseng contrasts the AL ianguage with languages written for the Intel hypercube processor iPSC/2 and the IBM RP3 and BBN Butter y in two areas:

1. the use of data relations in AL to manage distributed data objects. The user must specify the distribution functions in the other languages;

2. the generation by Al of code for fine-grain systolic communications; the other compilers support the hypercube architecture with a more general set of com- munication primitives. i'Warp [BoCo88] is a system architecture for high speed signal, image and scientific computing which resulted from a collaborative effort between Carnegie-Mellon Uni- versity and Intel Corporation. The iWarp system'v¡¡as an attempt to extend the Warp concepts in a commercial environment. The building block for i'Warp systems was a custom VLSI processor consisting of 600,000 transistors and providing 20 Mflops of computational po\Mer with a 320 Mbytes/s throughput. Arrays were constructed from cells containing processors and up to 64 Mbytes of local memory. Processor interconnection topologies implementable with i'Warp include l-dimensional arrays, rings, two-dimensional arrays and tori. System sizes were expected to range from a several processors to thousands of processors. Programming languages for i'Warp included AL and Apply, an image processing language, as well as an optimising com- piler written by M.S. Lam for the \Marp processor.

2.tO Systolic Processors/ Coprocessors

Much of the work which has been carried out in the field of systolic processing has been with integer arithmetic units. An example is the two-dimensional array of Nash ef ø/. [NaPr87]. In this case, as with others, the design of the processing elements has

46 been influenced by specific algorithms. Exceptions which have addressed the need for floating point representations include the SAXPY Matrix-1 [FoSc87], which was to be the first commercial general purpose systolic computer, the \Marp computer [AnArST] and the systolic coprocessor of Steenis et al. [StTrSS]. The coprocessor of Steenis et al. was designed to provide high speed processing of high order matrix problems with a two-dimensional array of serial inner-product floating point processors. The floating point representation chosen provided 16 bits for the mantissa and 8 bits for the exponent which is not compatible with the IEEE 754 froating point standard. Details of the arithmetic units from which this processor tvr¡as constructed are provided in [StTr88], but there is no discussion of the system level architecture which would determine its usefulness as a coprocessor in a real system.

Other examples of implementations which use floating point arithmetic usually con- sist of a rays of single or multiple chips per processing node. An example of a paral- lel vector architecture is the GaAs of Misko et ø1. [MiRa88] which uses eight McDonnell Douglas gallium arsenide 32-bif RISC microprocessors, each with four floating point coprocessors. This processor is capable of performing 7024- point complex FFT's in 7L2 ¡;s, and consumes 618 watts. In some cases researchers have investigated the use of linear arrays for matrix-oriented problems for which two-dimensional arrays have been designed [ScTh89]. The paper by Franceschetti et aL [FrMa91] describes a parallel architecture especially designed for a synthetic- aperture-radar (SAR) processing algorithm based on an appropriate two-dimensional fast Fourier transform (FFT). The algorithm is briefly summarized, and the one- dimensional FFT case is generalised to the double FFT. The computer architec- ture consists of a toroidal net with transputers on each node. [HwCh91] report the simuiated performance of the Orthogonal MultiProcessor (OMP). The OMP was evaluated in SPMD (Single-Program and Multiple-Data-streams) mode, which de- mands inter-processor synchronization at the subprogram level, rather than at the instruction lockstep level as in an SIMD machine. The simulated OMP consists of 16 Intel i860 processors and 256 memory modules interconnected by two-dimensional spanning buses. The simulated benchmarks include unfolded matrix multiplication, two-dimensional FFT, orthogonal sorting, and parallel pattern clustering. These sim- ulation experiments resulted in a speedup factor between 10 and 15 when compared with a system using a single i860 processor. This is a representative approach to systolic computation using general purpose RISC processors with appropriate inter- connections.

47 An example of a systolic array dedicated to a specific algorithm is [BrMegl] in which a two-dimensional systolic array for the column-by-column QD algorithm is presented. The work by Gregorel,ti et al. [GrRe92] is targetted at limited context application areas such as CAD tools for VLSI design, real time pattern recognition and move- ment detection, the mapping of programmable architectures and the emulation of cellular neural networks. The generic architecture involves the incorporation of a two-dimensional array of processing elements into a general purpose computing envi- ronment. Graphical and logical operators are implemented rather than floating point arithmetic in identical 1-bii processing elements each of which process a single image pixel. The first prototype has a 16 x 16 array of processing elements constructed from 4 x 4 arrays of 7.5¡t"m CMOS chips. The PEs are physically grouped into four columns per chip where the eiements of each column share a common data used either for memory operations or for broadcasting instruction codes to all PEs.

A floating point arithmetic processor was developed for a systolic architecture for

LU decomposition [JaLa89]. The arithmetic of this new cell is based on second-order polynomial interpolation.

In [ShGo89] a parallel computer architecture targeted at signal pattern analysis ap- plications is described which the authors claim is scalable to configurations capable of teraflops (1012 floating point operations per second). The architecture has a low interconnection overhead, making it well suited to miniaturisation using advanced packaging. These techniques would be necessary as a teraflops machine constructed from the proposed processing elements would require the integration of 40,000 PEs. Preliminary design and thermal tests project a computing density of 300 Gflops per cubic foot. The architecture is reconfigurable as a tree machine, one or more rings, or a set of linear systolic arrays.

Smith eú ø/. [SmSo93] present a systolic processor cell implemented with a controller and the AMD Arn29325 32-bit floating-point chip. Together these chips form a two-chip cell designed for one- or two-dimensional systolic arrays which can be used to implement a wide variety of signal processing applications. The controller chip controls the 4m29325, and the data communications to other cells in the array. Architectural features include two interchangeable data memories, an input port which can be used as either a local or global port, and a 32-bit instruction word that allows concurrent use of all cell resources. The authors claim the approach is similar to a number of other approaches such as Kung's Programmable Systolic Chip (PSC),

48 but wiih novel attributes which enhance system performance.

The report by Nash et al. [NaNu84] had as its objective the investigation of the feasibility of building a floating point processor (24-bit mantissa and 8-bit exponent) on a single ship based on the Hughes Research Laboratories (HRL) present 28-bit fixed point chip (Multiplication Oriented Processor or MOP chip).

The Systolic Linear Algebra Parallel Processor (SLAPP) described by Drake et al. in [DrLu87] was designed to implement a core set of matrix operations which included matrix multiplication, QR decomposition, singular value decomposition (SVD) and generalised SVD. As with other projects, the architecture was tailored to the specific algorithms, and it was not optimised and efficiently integrated into the host processor.

The above survey concentrates on systolic processor techniques which are relevant to scientific, or numerical computing. A more general survey of systolic array projects which have been initiated during the last decade can be found in a paper by Johnson et al. lJoHuShg3]. This paper provides a systolic processor taxonomy, and a discus- sion of the implementation and architectural issues related to all classes of systolic processors, including those which are programmable and reconfigurable. Extending Flynn's classifications of SIMD and MIMD to include Very-Few-Instruction stream, Multiple-Data stream (VFIMD) processors, the authors give examples of processors which fall into each category. They also observe the the trend towards large sys- tolic array processors which require complex support. They point out, however, that a number of simplified approaches have been shown to be effective for particular applications. Examples include the cellular processor architecture for image process- ing and VLSI design which operates as a coprocessor in a workstation environment

[GrRe92], and the Splash programmable linear logic array which also operates as a coprocessor to a workstation and is constructed from fie1d programmabie gate arrays

(FPGA's) [GoHo91].

2.LL Summary

This chapter has summarised the development of the systolic concept initially pre- sented by Kung and Leiserson. In the fifteen years since the coining of the term systolic, the literature on systolic architectures, implementations, algorithms and compilers has been extensive. It is not possible in a short review to provide details of such extensive activity. \Mhat has been presented are the principles of systolic

49 architectures and the directions in which researchers have progressed in the field of systolic computation for numerical applications.

The architectural philosophies reported in the remaining chapters of this thesis differ from all work known to the author both in the way in which it is proposed that systolic techniques should be integrated into general purpose computing nodes and in the special purpose architectures developed for the implementation. By applying ihe RISC computer philosophy, effort has been concentrated on the efficient imple- mentation of a reduced set of matrix primitives. While there are some examples of this approach, such as the work of Steenis el ¿/. in [StTr88], the construction of a powerful matrix capability supplies only a part of the solution for a general purpose computing environment. The need for generality in the addressing of matrix data structures is indicated by the work on data formatting for systolic arrays discussed earlier in this chapter. This problem has been addressed in this thesis with a novel address generator which extends the range of algorithms which can be executed on the systolic hardware. The integration of the combined address-generation hardware and the systolic matrix computation unit has been devised in such a way that the software interface for an MATRISC processor, or co-processor, reflects the hardware directly. This provides substantial benefits in the use of the processor. It is not neces- sary to write ner,¡/ compilers to gain the full benefits from the MATRISC philosophy. It is a simple matter, for example, to define a matrix class in C++ which provides extreme ease of use for the applications programmer. It is unnecessary to provide the extensive software support required by developments such as the Warp processor or SLAPP system.

To summarise, the novelty in this thesis is found in three major areas, and in the whole to which these areas contribute: the use of systolic techniques in a general purpose computing node.

The areas are: 1. the design of the arithmetic units which implement floating point computations with variable precision and dynamic range, and which are designed to optimise the execution time of matrix operations; 2. the design of an address generator and systolic interface which provides an extended domain of application for the systolic computation engine; 3. the integration of the hardware into conventional software compilers.

A number of the algorithms quoted in this chapter as examples of algorithms which

50 require different forms of systolic implementation are rewritten directly for the MA- TRISC processor. It is shown that the engagement processor of \Mhitehorse et, al. can be applied to a broad range of problems when used in conjunction with appro- priate address generation techniques. Example algorithms include convolution, FIR filtering, linear transforms and QR factorisation. Performance estimates are provided for particular hardware configurations.

51 Chapter 3

Systolic Ring Processors

Serial pipelined multipliers have been used for signal processing applications for more than twenty years. Examples of work on fixed-point multipliers in this field include a paper by [JaKaMc68], and a paper by Lyon [Ly76]. Other works include [BaMo75], and [PePa79].

The concepts used by ihe above workers for fixed point serial pipelined multipliers were extended and applied to floating point number representations by Marwood [Ma8 ] in which a linear systolic array floating point multiplier for operands with two's complement exponents and mantissae is described. The multiplier required a total of n identical cells to perform multiplication of floating point numbers with n-bit mantissae, and imposed no limit on the number of exponent bits.

The results of the current study extend the concepts of [MaS ] to systolic rings for both multiplication and accumulation of variable format floating point number representations.

3.1 A Systolic Multiplier

A systolic ring of cells, each of which represents the functional equivalent of a set of recurrence relations, operates on ordered number pairs of sign-magnitude mantissae and two's complement exponents. The cells are physically arranged and connected in a systolic ring which multiplies variable precision operands. Alternativeiy, the cells are arranged and connected in a linear array to perform the floating point multiplication of operands with unlimited exponent length.

52 The object of this section is to provide a generic architectural basis for the use of a recurrence cell which, when used in systolic arrays, can implement a new and improved serial pipelined floating point multiplier.

The architecture of the multiplier is shown:

1. to reduce the complexity of the earlier linear systolic array floating point mul- tipliers of [Ma84] by:

(a) using a cell with a single set of recurrences to process mantissae encoded in a sign-magnitude format;

(b) using a ne\M circuit with fewer storage elements to implement both man- tissa re-ordering and storage, and the synchronous propagation of expo- nents;

(c) re-constituting all input operands at the output, in parallel with the prod- uct;

2. lo provide a multiplier capable of both variable precision and variable dynamic range;

3. to have low gate complexity, consisting only of an I/O , a simple finite state machine controller, and at least one computational cell. The cell is constructed from the operand registers, a state storage register, a control signal storage register, a multiplier and an adder;

4. to show how to design a generic multiplier for a broad range of performance specifications by varying both the number of computational cells and or the number of delay cells in the systolic ring. Varying the number base of the digits provides a further means for controlling the execution time of the multiplier.

The architectural description of a systolic digit-serial floating point multiplier follows. The logical operation of the multiplier is modeled as a state machine constructed from a linear systolic array of identical cells. Each cell is described mathematically as a set of recurrence relations. The results of the analysis are presented as a lemma and show that the cellular array implements the multiplication of numbers in both fixed and floating point formats. The analysis also demonstrates that the cell can be con- nected in a systolic ring structure to achieve variable precision multiplication. When implemented as a linear array, the floating point multiplier can process operands

53 with unlimited exponent length. When implemented as a systolic ring, the multiplier restricts the operand length to the dimension of the ring.

A fi.oating point number -t' is composed of two parts, a fractional mantissa F* an.d an integral exponent .t-", and can be represented as the 2-tuple

Ií' : {F", Fr.} (3.1)

The real number representation .r? of this floating point representation is

R: Fr...bF' (3.2)

where ó is the base of both and f'" 1 F^ -F- ""d | 17.

In its simplest form the multiplier consists of a systolic ring arranged to accept as input three sequences of data digits, two of which represent mantissae and exponents of respective X and Y operands in floating point format, and a third which represents a partial product input P which is initialiy zero. The output from the ring is an identically formatted result digit sequence P which is the floating point product of theXandYoperands.

3.1.1 A Multiplier Model

mode mode Machine M {Y} {0} (m systolic {P} {x} cells) {x}

mantissa exponent mantissa exponent

m ts e m tse k

mode

Figure 3.lz Mo,chine I/O and operand, format.

54 Referring to figure 3.1 let {X} and {)'} be two sequences of digits entered in parallel into a machine M and let {P} be a sequence of digits output from the machine. The sequences are constructed from k digit 2-tuples. Each 2-tuple represents a discrete floating point number and consists of an ordered exponent and mantissa number pair. Each number is entered least significant digit first. The frrst e digits in a 2- tuple represent the exponent, and the remaining (k - e) digits represent the mantissa. Figure 3.1 is a schematic representation of the machine input and output. A mode signal is used to differentiate between exponent and mantissa digits.

Let the machine M be constructed from rn identical cells. Further, let the state of the machine at time n be {^9o(n, XorXtrXrrYo,Yr,Yz,P) t p:1,...,rn} where the states X¿, Y and P represent storage nodes for digits. The behaviour of lhe pth systolic cell in the machine is defined by the following recurrence relations:

Xs(p,n): xr(p - 7,n) (3.3) Xt(p,r): Xo(p,," - 7) (3.4) x2(p,,n): Xt(p,n - 7) (3.5) Yo(p,n): Y2(p - 7,n) (3.6) Yr(p,r): Ys(p,n - 7) i,k -f2p 1n 1ik t 2p I e I1 (3.7) Y1(p,n - 7) ilc + ei2p+71n< (i + 7)k +2p (3.8) Y2(p,n): Y1(p,n - 7) ik-f2p + 1 < nlikI2p-f e+I (3.9) Ys(p,n - I) ik + e i2p + r 1n< (i + r)k +2p + t (3.10) P(p,n) : Xt(p,,n) -tYr(p,r) ik I2p 1n 1 ik l2p -l e (3.11) P(p-1,n-1)-f

X{p,,n)Yy(p,n) ik -t 2p * e ( n < (i 1- 7)k ¡ 2,p (3.12) where i is a non-negative integer.

In the following, a two-dimensional mapping n: ki + j + 1 is used to express the nth digit of the linear input and output sequences in terms of the ith 2-tupIe. Using this mapping, the one-dimensional sequences {X}, {Y} and {P} can al1 be written in the form {tr.'(i, j):Vi ) 0:0 < j <,k} where the element w(i,j) is the 7¿h digit of the ith 2-tuple.

The following lemma has been derived:

55 Lemma: For the machine inputs {X} and {Y} defined above, the state P(p,n) of the pth stage of the machine M at time n is given by the following:

Vi > 0 : 0 < j ( e where j : n-2p

P(p,"):r(i,i)+aU,i) (3.13)

Vi > 0 : eip- 1 < j < k- 1 where j :r-p-7

p-l P(p,r): Ð *(i.,j - Ðv\,s * e) (3.14) s:0

Vi > 0 : 0 ( r ( p where r : ï'¿ - p p-r-r P(p,n) : Ð -I - *r * e) (3.15) s:0 "(i,k ")a\,s

Proof: By Induction.

mode mode DQ DQ

% 2:l Y 2:1 Mu 0 Mu

P, 2l Pr Mu o €'E P. (d 2l Mu f¡¡

x" x, x"

r-bit data storage register

DQ l-bit contol storage register

Figure 3.22 The systolic fl,oating point multiplier cell

The lemma can be interpreted as follows

56 f . in the interval defined by equation (3.13) the exponent elements of the input 2-tuples are added independently in every cell. Only the final cell contributes to the output digit sequence;

2. In the interval defined by equation (3.14), the digits output from the pth cell (p < *) are the low order digits of the product of the i'¿ input mantissae. The expression is not defined for p - rr¿ as the low order digits do not reach the last cell of the machine and do not constitute any part of the output digit sequence;

3. In the interval defined by equation (3.15) the digits output from the last cell of the machine (when p: rn) are the most significant digits of the product of the irä input mantissae.

A complete machine M rntst include carry states in the machine definition. Expand- ing the defi.nition to include these states gives {So(n,Xo,Xr,Xz,Yo,Yt,Y2,P,C) : p:7r...,rnj. These carries are included in the computation as a carry-save opera- tion.

Time Cell 0 Cell 1 Cell 2 Cell 3 n Ao Ut Uz Ao Ut Uz 1 ê¡ Uo Ut Uz 2 €1 €g Uo hUz o a) €2 eI €g 4 €3 €2 €1 êg

5 Tf?,9 e3 €2 ê1 €g 6 Tn1 Tfl,g €3 €2 ê1 €g

I TfI2 TTl,g rnl êg €2 €1 €g

8 Tlt3 TTt,g TfL2 TNI €3 e2 €y êg I €g Tfl,g Tn3 Tnl TfI2 €3 ê2 êy €g 10 €1 €g €g rTIl TfZ3 Tf1,2 €g €2 €1 11 €2 €1 €g Tn1 €g Tfl2 TfI3 €g €2 72 eg €2 e1 €g €g TfI2 €g TfI3 €3 10 Ia) Tf1,g €3 €2 €1 €g TTI2 €g Tf1,3 €¡ 74 TfIl TfIg €3 €2 ey eg êg rf73 €g

15 Tf1,2 TTl,g TfIl €3 €2 €1 Ag Tlt,3 eg 16 Tfl,3 Tfl,g TfI2 Tft,1 €g €2 €1 e0 eg 77 €¡ Tn'g TfI3 Tk1 TfT2 €3 €2 €1 €g

Figure 3.32 Operand rnouen'Lent, through a four cell linea,r array of recurrence cells.

Figure 3.2 shows a ceIl which implements the recurrence relations used to define the machine M. A linear array of these cells implements addition and multiplication of ordered 2-tup1es. The maximum result mantissa de-normalisation which can occur

Ðt with normalised operands is one digit. By placing a sign digii in the exponent field, with guard digits as needed, sign-magnitude representation for the mantissa can be used. A desirable feature of a linear array is the ability to multiply operands with arbitrary length exponents.

Figure 3.3 shows the movement of an eight digit operand through a four cell ar- ray. The exponent digiis are labelled e0,...,e3 and the mantissa digits are labelled Tn6,...,Ttù3. It is clear from the diagram that the exponent digits propagate through the array with simpie delays. In contrast, the mantissa digits only propagate within the array and do not reach the output boundary. Each m¿, however, is stored within cell i for the period that the mantissa is passing through the cell. This allows a partial product to be formed within each cell and accumulated with the partial prod- uct transmitted from the prior cell, so implementing the conventional muliiplication algorithm.

The same multiplication algorithm can be shown to be implemented after replacing equations (3.9) and (3.10) with the modified recurrences: Yr(p,n):Y1(p,n-7) ikl2plnlikl2pleIl (3.16) :Yo(p," - 7) ik I e l-2p +7 1n< (i + 7)k +2p (3.17)

Time Cell 0 Cell 1 Cell2 Cell 3 n Ao Ut Uz Uo At Uz 1 €g Ao At Uz 2 €1 €g Uo UT Uz 3 €2 €y €6 4 €3 €2 et €g

5 TTLg e3 e2 €1 e 0 6 TfIy Tno €3 €2 e 7eO

( Tf|,2 TfLg Tft,1 €3 e 2 e 1 €g

8 Tfl3 Tf1,g T11,2 TfLy €3 €2 €1 €g

9 €g Tf1,ç¡ Tft,g TTt,1 Tfl,2 €3 €2 €1 €g

10 €1 €g Tfl,g TfLl Tf1,3 Tf1,2 €3 €2 €1

11 €2 €y €g TfIl Tf1,g Tf1,2 TTt,3 €3 €2

72 €3 €2 €y €g Tf1,1 Tft,2 TfIg TfI3 €3

13 TfIg €3 e2 €1 €g TfI2 Tfl,1 TfI3 TTLg t4 Tfl,1 Tf"ì,9 €g €2 €1 €g TfI2 Tft,3 TfIl 15 TTt,2 Tft,g Tf-\7 €3 €2 €1 €g Tf1,3 TfI2 16 Tû'3 Tfl,g Tfl2 TfT1 €3 €2 €1 €g TfI3 77 €ç¡ Tng Tf|,3 Tû'y Tf1,2 ê3 €2 €1 €g

Figure 3.42 Operand mouement th,rouglt a four cell linear arrúy of recurrence cells using the modifi,ed, recurrences for Y.

58 However, the behaviour of the cells offers an advantage which is made clear in figure 3.4: the digit sequence from the )z output port of the last cell in the machine is iden- tical to the digit sequence input to the Y input of the frrst cell of the machine. Thus the input operands leave the cell unchanged and in parallel with the floating point product. Mantissa digii rn¿ is still stored in cell i to allow the multiplication algorithm to be implemented identically, and as before the exponent digits pass through delay stages only. The restoration of the Y digits is possible because successive cells rotate the mantissa sequence by one digit, so that at the final stage when all recurrences have been applied the sequence has been fully rotated, and is identical to the input to the first stage. This has significance to the testing and verification of a multiplier implementation.

mode mode DQ DQ

% Y2 2:l 2:l Mu Mu R Pr zil Mu 6) rtË P, ftl 2:l Mu tI{

x" xr x,

r-bit data storage register

DQ l-bit contol storage regisúer

Figure 3.62 The improued systolic fl,oating point multipli'er cell

Figure 3.5 presents the modified cell which preserves the input operands. No addi- tional circuitry is required when compared with the cell of figure 3'2'

3.t.2 A Systolic Ring Multiplier

The recurrence relations which define a systolic multiplier show that X,, Y, P and

59 mode digits are transferred unchanged between cells. As a consequence, additional delays can be introduced between cells without affecting the operation of the recur- rences implemented within the cells. In addition, only a finite number of input digits (k) contribute to each output. It is possible to replace the linear array of rn cells with a systolic ring of zz" systolic cells and lc - 2n" delay cells, where in general n. 1 mf 2. Where there are delay cells, they may be lumped or distributed. The k digits of the operands are input into the ring and rn recurrences are applied by circulating the operands times before outputting the result. The next computation W] is fully pipelined. New operands are entered into the ring as the results from the previous computation are being output. Input and output from the ring and the number of re-circulations of operands within the ring is performed by a multiplexer under the control of a finite state machine.

Confrol

A Systolicmultiply \-/ cell

Delay cell

e e e e e e e e e e e e e I c

e e e

e t

Legend: m mantissa digit e exponent digit s sign digit L load digit g guard digit Figure 3.62 A systolic ring multiplier wi,th, three possible operand, forrnats

60 Non-integral re-circulations of the operands are possible. However, additional in- put/output would be required at various points around the ring to sup- port this possibility.

Figure 3.6 is a schematic representation of both the operand and control format for a ring systolic array. Three possible data formats are presented in this figure where it is assumed Lhat 2 digits are assigned to control functions, such as data-flow instructions embedded within the operand, and guard digits between fields. Assuming that the sign digit is part of the mantissa, in the case where a single computational cell is augmented with k - 2 delay cells, the multiplier can process floating point operands whose specifications range from k - 2 mantissa digits and a single exponent digit (i. e. a fixed point multiplier) to k - 3 exponent digits and two mantissa digits (i. e. a fixed point adder). This provides a wide range of possible dynamic range and precision options in a single hardware implementation for length k operands.

3.2 Parallel Digit Multiplication

In [Ma90] and [Ma91] a digit-seriai floating point systolic multiplier is described. The multiplier is constructed from a ring of simple cells, each of which consists of a number of delay cells, multiplexers, an r x r bit parallel multiplier and a2r-bit adder. The speed of this multiplier is limited by ihe maximum speed of the multiplication and addition elements. Previous work on integer digit-serial processing techniques can be found in [HaCo90], [CoHa92] and [Pa89]. Much of this work is concerned with the partitioning of a parallel operation into a sequence of smaller-radix digit-serial operations. The following sections present a multiplier based on [Ma90] and [Ma91] but which is of lower compÌexity, and which executes at almost twice the speed.

The following section presents the algorithm for the multiplication of an N digit number by an M digit number, where the digit base B is 2' , rewritten in terms of the binary representation of the digits. The result is used to show that for digit-serial multiplication the full r x r bii multiplication of digit pairs shown in [Ma90] and figures 3.2 and 3.5 is not required. In particular it shows that the digits which contribute to the output sequence are formed from the accumulation of partial multiplications whose critical paths are approximately half that of an r x r bit parallel multiplier. A multiply-accumulate structure is described which computes the desired output term in each clock cycle with pipelined partial multiplications, accumulation and addition.

61 The reformulation of the multiplication algorithm in terms of a pipelined digit mul- tiplication leads to an elegant systolic multipiier cell. The use of the cell in a systolic ring digit-serial multiplier has a number of advantages over the original proposal. These advantages include the following:

1. for mulii-bit digits the multiplier executes nearly twice as quickly;

2. tlne adder of [Ma90] is incorporated into the new pipelined parallel digit multiplier, with a consequent reduction in the number of elements in the systolic cell.

3.2.L Digit-Serial Integer Multiplication

Consider X, an M-digif number {*o,rt¡...¡tM-r } represented in base B as

M 1 X D *nltn (3.18) i:0

Similarly Y, an,n/-digit number in base B is represented as

N-1 y:\angn (3.1e) i:0

The product of. X and Y is given by

M-| N-7 xY: f D riajpi+j (3.20) i:0 j:0

Let each digit r¿ and y¡ of X and Y have an r-bit binary representation given by

r-l ri:Ðrik2k (3.21) k:0 and r-L yj: aj 2 (3.22) tl:0

62 Substituting equations (3.21) aú (3.22) into equation (3.20) gives

M-l N-lr-L r-L Xy: Ð Ittr¿tajûk+tþi+i (3.23) i:0 .i:0 È:0 l:0

Re-writing the innermost summation gives

M-l N-l r-7 r-I-k r-L XY: ttt Ð ra¡y¡¡2k+I + P D, *oraj,2r*'-' þi+¡ @.24) i-0 j-o k-0 l:0 I:r-lc

r-7- k Let A¿jn: rikA jt2 k+ (3.25) tl:0 r-l and B¿jt: I r¿tujt2k-tl-r. (3.26) I:r-le

Then equation (3.24) can be re-'vvritten as

M-l N-I r-I xY : t t ÐlAn¡rpi+i + B,¡*þn*i*') (3.27) i-o j-o k-o

The A¿¡¡, are associated with partial digit sums of weight þi+i ur'd lhe B¿¡¡ are associated with partial sums of weight p;+i+t.

Expanding the summation over j gives

M -l r-l xY : B¿ot A¿rnþi+t B¿rt t t lo,orlt' + þi+I + + þi+'l i:0 lc:0

An r0'*' + B¿zt þi+t + .. .A¿o-ÐnþN-t + nnl*-r>*þ*f

M-L r-l N-2 A¿ot 0i + t (An(¡*t)r + B¿¡n) 0i+j+t * B¿(w-ÐnþN (3.28) tti:0 k:0 j:l

Expanding the A¿¡¡ and B¿¡¡ terms in this equation gives

63 M_I r-l r-l-le XY: t t t r¿¡Ys¡2k+I Pi I z: 0 k:0 l:0 N-2 r-l r-l-k r-l r-l j+r t D t r¿tca¡+Ðt2t+¿+I t n¿¡y¡¡2k+I 0i+ + j:0 lc:O l:0 k:0 I:r-k r-I r-I t t ïika(N-L)I2k-tt-r BN (3.2e) k:0 l:r-k

Expanding the two terms in the inner brackets for r : 4 gives the following

rtys2al r¿¡y¡¡2k+I (3.30) t D - x2y325 I rzyz2al lc:O I:4-k rtAs26t rsyz2sl rsyt2al and

roYs23 +roaz22 +roat2l +roao2o + k rt Yz23 +rtat22 +rtyo2r + t D rihY(i+r)t2ktt : (3.31) ,k:0 l:0 rzat23 +rzao22 + rsYo23

where the terms in equation (3.30) are the high-order components of the digit product

@¿a¡+t) and the terms in equation (3.31) are the low-order components of the product @¿A¡). The i, j and i + 1 subscripts have been omitted in these equations for clarity of presentation.

This re-formulation of the XY product shows that partial products of weight p;+i+t are formed by summing the low-order digit output from the digit product (*oyf¡*r¡) with the high-order digit output from the digit product ("¿A¡). To form digits of weight pt'+i+t a structure is required which in each time period can compute and accumulate the two different partial products from adjacent time periods, and then accumulate the result with partial products computed in other cells. The function required is Z: XY +V +W (3.32)

64 where X,, Y, l/ and W are r-bit operands and where the term XY implies the pipelined computation and accumulation of the high and low order digits as discussed.

To implement this function a pipelined parallel multiplier is used. Pipelining of the high-order output of this multiplier with one level of registers delays the high-order digit by one clock cycle. This delayed digit is then fed back to the I/ input during the next computation to allow its accumulation with the next low-order digit, and so forms the desired term in equation (3.29).

Y3/ y2 Yt/ yo/

N O a> ÈN a> ÀO

x0

x1

x2

x3

f3

A Cin x AND Gate Æull Adder v

Cout Sum a l-bitregister delay

Figure 3.72 A pi,peli,neil four-bit d,i,git multiplier.

The structure of a multiplier which implements this operation is shown in figure 3.7

65 The need for the single level of registers to properly sequence the digit-wise addition within the multiplier allows the minimisation of the critical path. In fact it is possible to almost halve the number of delays present in the critical path with an appropriate placement of these registers. This has the immediate consequence that the optimised multiplier can function at nearly double the clock speed of a conventional parallel multiplier array.

¡åa ;Ër å;e å E

AB Cin

o Data Latch Full Adder

Critical Path = 5 Full Adders + I AND GatÊ Cout Sum Figure 3.8: ,4 pipelined four-bit digit multiplier optimised, for both area a,nd criti,cal path.

The optimised multiplier is shown in figure 3.8. Ii contains 2n - 7 registers compared with 3n - 1 registers for the direct implementation shown in figure 3.7.

3.2.2 Digit-serial Floating Point Multiplication

The r x r bit parallel integer multiplier and2r bit adder of the floating point multiplier of figures 3.2 and 3.5 discussed in [Ma90] and [Ma91] can be replaced by the single digit-serial multiplier of the previous section which implements equation (3.32).

66 During the mantissa multiplication the algorithm implemented is

PPout: (XY + Y) + PP¿n (3.33) where

I/ is the high-order digit generated by the pipelined multiplier, and

PP¿n is the partial product input from the previous cell.

mode DQ DQ

Y, 2;L 2:l 0 Mux Mux

PP F Modc = 1 F=X.Y+PP+V

Mode = 0 F=X+Y V

X

r-bit dat¿ storage register

DQ l.-bit conüol storage register

Figure 3.92 A di,git-seri,al multiplier cell

During the exponent addition mode, the function implemented is:

PPout: (X x 1 + V) +Y :X+y (3.34)

The value of I/ is zero in this part of the computation as there is no high-order output from the product X x !.

67 3.3 Coalescence of Systolic Ring Multipliers

Systolic ring arithmetic units provide new possibilities for systolic array processors. As an example, consider an array of processors, designed to process single precision operands. If each processing element in the array is implemented as a systolic ring it is possible to coalesce adjacent rings into half the number of larger rings. These larger rings can process double-length operands with the same number of circulations as the single ring, as the ratio of mantissa digits to systolic cells remains a constant. For larger order systolic arrays the ability for cells to coalesce makes possible the construction of variable dimension arrays which can be matched to both the problem size and the number representation.

Master Slave Contol Control

Input Ouþut (InpuÐ (Ouþut)

Systolic multiply Systolic multiply \-/ cell cell ^

Delay cell Delay cell

Figure 3.10: The coalescence of two adjacent systolic ring multipliers. The I/O of th,e second ceases to be used, when tlt'e coalescence h,as occurred.

Figure 3.10 shows an implementation in which two systolic ring multipliers can be coalesced into a single multiplier. A slave control line from the controller of the first ring is used to operate an additional multiplexer which provides the connection between the two rings.

68 3.4 A Ring Accumulator

The accumulator described in this section is constructed as a heterogeneous ring array structure containing a main logic or arithmetic block and input/output multiplexer, a k-stage delay block and a systolic de-normalisation array.

The number of systolic de-normalisation cells in the ring can range from a minimum of one to a maximum of mf 2 where rn is the number of digits in the mantissa of the number representation. The number of recurrence cells determines the number of times the operands must circulate in the ring for complete de-normalisation, and so determines the speed of the accumulator.

The architecture offers the following advantages

1. a broad range of performance specifications can be achieved by varying the number of de-normalisation cells and delay cells in the systolic ring. Varying the number base of the digits in the floating point format also provides a means for controlling the execution time of the accumulator;

2. an adder/accumulator can be constructed which is capable of both variable precision and variable dynamic range;

3. the complexity of the the problem of constructing variable precision floating point adders and accumulators is reduced by using simple replicated cell struc- tures to perform de-normalisation.

Novel circuitry is used to implement in a serial pipelined fashion both the increment- ing of an exponent difference and the conditional de-normalisation of an associated mantissa in the systolic de-normalisation array.

3.4.I Floating Point Addition/Accumulation

The fixed point accumulation of a frxed point number Z with a fixed point accumu- lator value at time n of An is carried out by the simple operation of integer addition

L.e

An+t:An+Z (3.35)

69 The floating point accumulation of a floating point number {2",2,.} with a floating point accumulator value at time n of {A.,n, Arn,n} to form the new accumulator value {A",n+tt Am,n¡t } at time n l7 is performed by the following algorithm:

Z",n} A'*,n+t : A*,n.b^ir"{o,A",n- I Z *,n.b*ir,{o,z",n-A",n} (3.36)

Á ., Á, 6-(+[ogo¡a;,.*,¡1; (3.37) ^m,n]-l - ^m,n!7' A'",n+r : mar{Ae,n, Z",n} (3.38)

A",n+r : A'",n+r+ 1 + [1ogrl,4'1,,*, l] (3.3e) Aov p : sigr'(M ax:-eïp - A.,n+r) (3.40) AuF: sign(A",rr1v - Mi'n-erP) (3.41) where

Mar-erp is the maximum exponent value in the particular format,

Min-erp is the minimum exponent value,

Aovp is a flag which indicates whether the exponent is greater than Mar-erp

A¡¡p is a flag which indicates whether the exponent is less than Min-erp

ó is the number base of the floating point representation,

[.] represents the integer part of,, and

sign(.) is the sign of the operation.

This algorithm can be considered representative of the way in which addition or accumulation is performed in the IEEE 754 floating point standard, although no attention has been given to the implementation of the rounding modes. [Siev81], [Cody81], [Hough81] and [Coon81] provide a discussion of this standard, its behaviour and its applications.

Some discussion of the algorithm is warranted. Equation (3.36) represents the shift- ing of the operand mantissa which has the smallest exponent by a number of digit places equal to the difference in the exponents, followed by the summation of the shifted operands. This temporary result A'*,n+r is conditionally left or right shifted according to its value. The amount of shifting required is given by equation (3.37) and is known as post-normalisation. Three possibilities exist for this operation:

70 1. the result is smaller than the lower bound of the defined range for the frac- tional part of the representation. In this case the operation has caused a loss of precision known as ca,tastrophi,c cancellati,on. Leading zeroes which have been introduced into the result are removed by lefi shifting the mantissa and appropriately adjusting the exponent;

2. mantissa overflow in which the result is larger than the upper bound of the deflned range of the mantissa. In this case the result is righi shifted one place to restore it to the defined ranÉ{e, and the exponent is decremented;

3. the result falls within the defined range of the representation, in which case no shifts are required.

Equation (3.3S) defines the exponent of the temporary result A'.,n+r. This exponent value is modified by any shifts which are performed upon the mantissa to preserve the real value of the 2-tuple. Additive corrections to this exponent value are defined by equation (3.39). Equations (3.40) and (3.a1) set flags which indicate whether the result has exceeded the floating point representation at either end of its dynamic range.

Existing techniques for the design and construction of floating point adders and accumulators are broadly categorised as parallel or serial. The parallel architectures are intended for low latency designs. An example is the work of Owen [Ow87]. For system architectures in which longer latencies can be tolerated, serial architectures are used to advantage and an example is Chau et al. lChKaKuST].

3.4.2 A Sirnpliffed Algorithm

Let A and Z be numbers whose floating point representations ,4-* and Z* introduce errors e4 and ez such tlnal

A*:Alet (3.42)

Z*:Zlez (3.43)

The sum of these numbers is

77 A* + Z* : A+ Z I et * ez I e"urn (3.44) where e"u* is the error introduced by the machine precision, or register length, during the process of floating point addition.

The maximum relative error .E when forming this sum occurs when they have opposite sign. This worst-case relative error is approximated by

(3.45)

The significance of equation (3.a5) to the floating point addition algorithm of equa- tions (3.36) to (3.41) is that the error is formed in equation (3.36). The post- normalisation process of equaiion (3.37) introduces no error term as it is simply a shift operation in which no significant digits are lost from the register. As a con- sequence the operation may be omitted without altering the error behaviour of the accumulation process.

Using this result the algorithm described by equations (3.36) to (3.a1) can be sim- plified by replacing equation (3.37) by

lart,rnll _ ãm,n!I'Á, 6_lle,*,.+lf (3.46) ^ - and equation (3.39) by

A",n+r : A|,+t + llA'rr,r+rlf (3.47)

These equations omit the post-normalisation operation which would have been per- formed when there is catastrophic canceliation.

A further simplification is possible for some implementations in which the exponent operations are performed in registers whose lengths are substantially longer than the exponents. In these cases there is no need for overflow and underflow flags.

Removing the requirement for post-normalisations after each operation does not com- pletely remove the need for post-normalisation if the final result is to conform to floating point standards. Conformance is necessary for co-processor applications.

72 However the operation needs to be done only once, at the end of the completed accumulation, rather than after every operation, so allowing the design of simpler processing elements which operate more quickly.

Rewriting equations (3.36) to (3.41), but with the simplifi.cation of (3.a6) and (3.47), in terms of the exponent differences between the contents of the accumulator and the current operand gives

A'.,n+r : mar{Ae,n, Z",n} (3.48)

D.,n : min{A.,n,, Z",n} - A|,n+I (3.4e)

A'n ,n: Arn,n A A",n (3.50) for ernll - Z'rr,n: Z^,n

A'rn,n: Z*,n for A'",r+r * A",n (3.51) Z'rnrn: A^,n

Zt*,n+t: Zt^,n.bD",^ lD"l < - - 0 otherwise (3.52) A|n,+r: A'*,n * Ztn,n+t (3.53)

A.,nrr : A|*+r + llA'r,,r+rlf (3.54)

A*,n+r : A'*,n+r.6-lll'*,'+'l) (3.55) where m is the number of digits in the mantissa of the floating point representation.

To implement the de-normalisation of equation (3.36), an array of at least one systolic cell is required in which the transfer of data between cells is described by the following recurrences

Mo(P):Mz(P-r) (3.56) Co(P):Cz(P-7) (3.57)

zo(p):zz(P-7) (3.5s) Ao(p):Az(p-7) (3.5e) and the internal recurrences in each cell are

F7Drù M2(n): M1(n - 7) (3.60) M1(n): Ms(n - 7) (3.61) Cr(") : C{n - 7) (3.62) Ct(") : Co(n - I) (3.63) Z'(") : Zo(n - 7) (3.64) Zn(") : Zs(n - 7) Mt(n - 1) : o : Z+(n - t) Mt(n - 1) : 1 (3.65) Ar("): fu(n - 7) (3.66) Ar("): Ao(n - 7) (3.67) Z"(") : Ct(n- 1) + Zo(n - 7) + Cy(n - 1) for Ms(n - 7).Mt(n - t).Za(n - 1) : 1 : ct(n i- 1) + zt(n - 1) + Cy(n - 1) : Ct(n - 1) + Zo(, - 2) + Cy(n - t) otherwise (3.6s)

C is defined to contain the value 1 in the digit position corresponding to the least signifrcant exponent digit, and is zero elsewhere. An examination of the recurrences (3.65) shows that the sign of the exponent is stored ir Za for the duration of the mantissa. This value is used to control via recurrence (3.68) whether the mantissa output 22 is delayed either one or two stages when lhe mode values Ms and M1 are high. This effects a conditional one-digit de-normalisation of the Z rnantissa field relative to the ,4. mantissa when the exponent difference D" is negative. The presence of a 1 in the C digit sequence is used to increment the exponent difference according to the recurrence (3.68).

Each cell when connected in a linear or ring structure can implement the one-digit de-normalisation and sign-extension required for floating point addition using ones- complement or two's complement mantissae, and the de-normalisation without sign extension for sign-magnitude mantissae. An rn-bit mantissa requires the application of. m recurrences for full de-normalisation. These recurrences may be applied either by connecting rn-cells in a linear array, or by connecting at least one cell in a systolic ring structure with sufficient delay cells to contain the operand, and circulating the operands until the mantissae are aligned as indicated by a non-negative exponent difference or until rn recurrences have been applied.

74 Mo M M2 DQ DQ

z3 0 2:l z4 Mux DQ

1 co c

cy

z^ zr l-'. c) zo rc) 0f- € 2:l cü Mux tli ,\ A A,

r-bit data storage register

DQ l-bit control storage register

Figure 3.11: A systolic de-normali,sation cell.

Figure 3.11 is a schematic diagram of a cell which implements these recurrences

The implementation of the shift unit is a novel application of the characteristics of systolic circuits. Brackenbury et al. [BrWe9O] discuss the design of circular shifting circuits using two or three levels of logic. The shift circuit presented here operates in a serial/paraliel mode as a function of the number base chosen.

lÐ Conüol

Input Ouþut

------>

UO and arithmetic unit

,,^. Systolic denormalisation t

Delay cell

e e e:e:e :e le re re re e e e e e e E

m e,e te te :e I ûÌ Ít

m :m :m:m rm

Figure 3.L22 A systolic ring accumulator

Figure 3.12 is a schematic of a floating point systolic ring accumulator which imple- ments the algorithm discussed above. It is not a homogeneous ring, as non systolic arithmetic operations are performed in the I/O and arithmetic unit at the top of the ring. De-normalisation occurs in the systolic de-normalisation cell, of which only one is shown in the schematic. The delay cells which complete the ring provide storage for the full length of the data operands. The input to the accumulator is presented sequentially with a series of floating point representations of real numb erc Z consist- ing of triplets having the form {2t,2.,2,.}. An initialisation flag fr,eld Z¡ is part of the descrip tor. Z is a digit sequen ce representing the exponent of the real number Z " and Zrn is a digit sequence representing the mantissa of the real number Z. A mode signal entered in parallel with the triplet identifies which of the fields Z l, Z. and Z^ are being input at any one time. In this impiementation an additional constant digit

/o sequence C is also entered in parallel with the triplet and is used to increment the exponent difference D" of. equation (3.49).

The systolic ring consists of four circular registers;

!. a Z register containing fields representing the exponent difference D", equal to the difference between the accumulator exponent A. and the input operand exponent 2", and lhe Z mantissa valte Z^1

2. an,4" register partitioned into fields containing the exponent and mantissa of the A operand. This register contains the accumulated result;

3. a mode register which contains lhe mode signal; and

4. a C register which contains a constant value which is circulated around the ring.

Prior to entry into the ring, the current operand is denoted by the triplet {z¡,z",zrn} in which the sign of the operand z" is entered in the flag freld Z ¡. Sign bits are stored in single-bit registers {A",2"}. The operand registers are of sufficient length to contain the exponent, mantissa, sign bits and flags. The contents are differentiated by the presence of a mode bit which is associated with the operands. Zero flags are required by the accumulator, and are stored in the register set {zr¡, Z"l, Ar¡}. A ioad instruction is entered with the current operand and both initiates a ne'vl/ accumulation and causes the accumulator to output its current value.

The output provides a digit sequence ,4 which is the floating point representation of the accumulation of the real numb ers Z together with a parallel mode output signal which identifles the elements of the triplets {Af ,A.,A,"}.

Operands are entered into the accumulator least significant digit or least significant bit (LSB) first. A state machine decodes the different fields.

3.5 Coalescence of Ring Accumulators

As with the systolic ring multiplier architecture, a characteristic of the ring accu- mulator is the joining or coalescence of two adjacent accumulators to form a single accumulator capable of accumulating operands of double length. In the two systolic rings which have coalesced, the multipiexer for the second ring is controlled by the controller of the first ring.

77 Master Slave Control Control

Input Ouþut (InpuÐ (Ouþut)

ao VO and a¡ithmetic arifhmetic unit unit

Sysúolic Sysûolic ^, denormalisation cell U

Delay cell Delay cell

Figure 3.13: A double lengtlt ring accumulator formed by tlte coalescence of two adjacent rings.

Figure 3.13 is a schematic representation of two rings which can be coalesced. The arithmetic unit of the second ring is slaved to the first ring, and simply circulates the operands. The double ring can process double-length operands with the same number of circulations as the single ring, as the ratio of mantissa digits to systolic cells remains a constant. For larger order systolic arrays the ability for cells to coalesce makes possible the construction of variable dimension arrays which can be matched to both the problem size and the number representation.

3.6 A Systolic Ring Floating Point Multiplier/Accumulator

In previous sections topologically equivalent architectures for both ring multipliers and ring accumulators have been defined which operate on variable precision and variable dynamic range digit-serial operands.

In [MaBe92] and in this section, details are given of a new digit-seriai systolic ring floating point processor for the formation of inner products. This new systolic ring which implements the combined function of multiplication and accumulation appears at the block schematic level to be identical to that used for the multiplier. The difference is a small increase in complexity in both the systolic cells and the single logic element. Interconnections are made only to nearest neighbours, as is characteristic of systolic architectures.

78 Conüol

Input Ouþut

VO and -> a¡ithmetic unit

Systolic multiply/accumulate cell

Delay cell

e e e e e;e e :e I e e s

e e e e s II}Il1In

imrm

Figure 3.L42 The systoli,c ring rnultiply/accumulate processor.

A block schematic of this new digit-serial floating-point systolic ring processor is shown in fi.gure 3.14. The data format required by the ring processor is also shown in this fi.gure. As with the multiplier, the ring consists of three elements. The first is an I/O logic element in which I/O and logical operations are performed. The second is a systolic cell which may or may not be replicated around the ring, and which implements two distinct recurrence relations upon operands which are circulating in the ring. The selection of the appropriate recurrences to be applied at a given time is determined by a microcode field which circulates with the operands. The third element of the ring is a delay ceil. The number of delay cells in a ring is chosen so that the length of the ring is equal to the length of the operands'

79 Instr DQ DQ

Instruction Decode

Y 2:l 2:l 2l 0 Mux Mr¡x Mux

PP

X.YI+PP{C

x

Acc

r-bit dat¿ storage register

DQ 4-bit control storage register

Figure 3.15: Tlte systolic multiply/accumulate cell.

Figure 3.15 shows a schematic representation of the systolic cell

Y 2:l Mux

PP

X.Y,+PP{€

X

Figure 3.16: The logical function of th,e systolic cell during multiplication

80 Sign of exponent difference Y 2:l Mux 0

Mantissa shift

Y Y+1

Exponent increment

Figure 3.17: Tlte logi,cal function of the systolic cell duri,ng iJe-normalisation.

Figure 3.16 shows the logical function of the systolic cell when it is performing a mul- tiplication operation on the mantissa of two floating point operands. The structure of the cell is simpler than that proposed in [Ma90] and [Ma91] due to a number of differences, one of which is the use of a the digit-serial multiplier which embeds the required addition operation in the function

PPIII:rnUn*PPnIcn (3.6e) where pp, r, y and c are each r-bit digits.

Figure 3.17 shows the logical function of the systolic cell when it is performing a de-normalisation operation for the floating point accumulation function. Two oper- ations are required in this mode, one to increment an exponent difference, and the other to de-normalise the mantissa. These operations carried out on the exponent and mantissa fields of the appropriate operand. De-normalisation is a natural op- eration to perform on a systolic processor, as it simply requires the movement of one operand relative to another. In this case the relative movement is obtained by logically selecting a cell data path of either one or two delays for the desired operand. The other operand always passes through two delays, and so there is a relative shift- ing of the operands. The cell circuitry stops the shifting operation if the exponent difference becomes non-negative.

As in the ring architectures described earlier and in [Ma90] and [Ma91], each cell

81 applies a recurrence relation during either multiplication or accumulation to the mantissae of the input operands. For an rn digit mantissa, it is necessary to apply rn recurrences to compute the product.

For an operand format of k digits, m of which represent the mantissa, and the remaining lc - rn represent exponent, sign, instruction and guard digits, a systolic ring can be constructed from n" cells and k -2n" state registers, where n. 1mf2. The state register cells may be lumped or distributed. The k digits of the operands are input into the ring and the m recurrences are applied by circulating the operands l^1".1 times for the multiplication, and l*1".f f 1 times for accumulation where [.'] represents the smailest integer greater than or equal to the argument. The next computation is fully pipelined. New operands are entered into the ring as the results of the previous computation are being output.

As with the earlier ring multipliers and accumulators, the mode input shown in figure 3.14 differentiates between the fields of each operand. As an input to the finite state machines and instruction decoders within the ring, it allows the processing of operands with different formats. In the illustrated format one digit is used for instruction and sign information and one is used as a guard digit. The limiting case for a processor is a single computational cell with k - 2 delay cells.

3.6.1 Coalescence of Ring Multiply/Accumulators

As with systolic ring multipliers and accumulators, systolic ring multiply/accumulate processors can also be coalesced. At the schematic level, the diagram is identical to that of a multiplier or accumulator with the systolic cell replaced by a cell capable of both functions, and a control unit which carries out both operations.

3.7 Discussion

A novel concept for the design of systolic ring floating point processing elements has been presented. The implementation of floating point multiplication algorithms in a linear systolic array [Ma8 ] has been extended in this chapter to the de-normalisation of operands, and hence the accumulation of floating point numbers. In addition, the linear array of [Ma8 ] has been replaced with systolic ring architectures for both

82 multiplication, accumulation and combined multiplication and accumulation. Digit- serial techniques have been used to optimise the arithmetic units within the cells, by integrating the multiplication and addition operations of earlier implementations, as well as optimising the critical path.

The processor architectures are characterised by simplicity and regularity, and offer a range of serial/parallel implementation possibilities. They have short critical paths, and a high degree of pipelining. As they are constructed from parallel sets of sequen- tial registers, multiple scan paths are available for functional testing. Complete scan paths can be readily implemented.

The processor family is well suited to optimisation to meet the needs of specific tasks due to the large range of execution speeds and processing element areas which can be achieved. Optimisation of an integrated multiply/accumulate processor is carried out in the next chapter for the task of matrix multiplication'

83 Chapter 4

The MATRISC Processor

The current trends in parallel processor development, documented in chapter 1, show that machine architects are designing massively parallel processor arrays from the fastest available scalar processing chips. This is a consequence of the desire to obtain maximum benefit from high-volume state-of-the-art RISC processors. In some cases, such as the Thinking Machines CM-5 processor, the SUN SPARC node processors are augmented by vector units lZorp92l. Higher level matrix operations are traditionally carried out by software. As detailed earlier it is possible to implement matrix oper- ations very efficiently with systolic arrays. Current VLSI technoiogy now makes it cost effective to use these systolic techniques to provide hardware support for matrix operations

Use of hardware supported matrix operations provides an architecture which:

1. provides a well defined mathematical framework for problem definition and expression;

2. allows serial code to be written which implements a required scalar, vector or matrix algorithm;

3. allows an accepted machine model, the RISC computer, to describe the parallel architecture;

4. utilises current technology to obtain performance improvements which can ex- ceed the capability of scalar architectures by orders of magnitude.

Simply stated, it is now possible to construct a reduced instruction set processor which directly supports matrix operands. The elements of the matrices can be either integer or floating point. The definition of the resultant processor is obtained from RISC processor architectures by the addition of the matrix data type.

84 Matrix algebra provides the mathematical framework both for problem definition and also for expression in a serial form suitable for writing computer code. The use of the systolic array allows far higher computation rates to be achieved with a given memory bandwidth than is possible with scalar or super-scalar processors. The problem of writing code for a parallel processor does not exist for a MATRISC processor, as the parallelism is embedded within the matrix operators.

In this chapter a iattice model of a systolic architecture is used to provide a basis for the specification of the systolic elements in a matrix processor. Also presented is a formal specifi.cation for a systolic ring multiplier and accumuiator, together with a novel matrix address generator to support matrix operands in a RISC computer environment.

4.L A Lattice Model

A lattice model for cellular algorithms v/as proposed by Katona [Kato88]. This model is briefly reviewed and applied to an analysis of the I/O bandwidth of a two-dimensional systolic array which computes matrix products. The results of the analysis show that the system computation rate of such an array is:

1. proportional to the square of the I/O bandwidth of the array;

2. inversely proportional to Lhe speed of the systolic processing elements.

This result is significant as:

1. it provides a mathematical basis for the specification of the systolic elements in computer systems;

2. it allows optimal decisions to be made regarding serial us. parallel implementa- tions.

The application of the result in real implementations can lead to major increases in the computation rates achievable with existing bus architectures.

Cellular algorithms are based upon the interaction of moving data streams with the computational elements of a processor array. Both the processor array and the data streams can be represented in terms of n-dimensional lattices where a lattice is a net of points. Katona provides both informal and formal descriptions of lattice systems and lattice algorithms in his paper. An informal description which follows Katona's is as follows:

85 1. Consider K n-dirnensional lattices Lt,. . . , Lt l

2. initialise some lattice points with data;

3. assign a translation vector u¿ to the lattices {L¿: i - 1,. ..,k};

4. translate each -L¿ with u¿ in each time step;

5. transform data with a local transformation function if they meet at a point of the n-dimensional space;

6. results are present in certain lattice points after some time interval.

Y

X

Z

Figure 4.Iz A schemat'ic diagram of an arbitrary lattice algorithm.

Each lattice Z¿ is translated with a velocity vector u¿ in every discrete time step r. Parameters are assigned to distinct lattice points and are carried by the lattice. When non-empty lattice points coincide, a local transformation is executed upon

86 them. Although Katona considers the velocity vectors as constants, it is necessary for u¡ to be a function of time to model particular aigorithms and architectures. An example of relevance is the matrix product executed on a systolic array of multiply- accumulate elements. The array forms the product in-place before moving the result to memory.

Figure 4.1 is a schematic representation of an arbitrary lattice algorithm,

.L1 is a lattice which is of particular interest. It consists of a set of points which remain stationary until lattices L2 and -t3 have passed through it with velocities u2 and us, and it then moves at velocity ur. All local transformations take place within the points defi.ned by Lt, and so .L1 defines the physical boundary of the hardware required to implement the algorithm.

This models the function of a higher dimensional engagement processor, and is useful for deriving bandwidth requirements. Consider that the parameters which are present in the lattice points are held in a memory subsystem, and are fetched and mapped to the appropriate lattice positions as required. The number of memory reads is determined by the rate at which the lattices .L2 and -t3 move inwards through the boundary of. L1. This boundary is defined by planes which pass through the origin and are orthogonal to the velocity vectors of the moving lattices. Movement of the parameters through -t1 is performed by the hardware. The parameters are discarded as they leave the domain of L1.

When L2 arrd .Ls have completed their interaction within .t1, the velocity vector of L1 becomes non-zero. Result parameters are written back to memory as the -L1 lattice passes through the plane through the origin which is orthogonal to its velocity vector.

The bandwidth requirements of the algorithm are derived using the dimensions of the lattices, and their velocity vectors.

As a particular example consider three two-dimensional rectangular lattices A, B and C propagating in a Cartesian coordinate system where

1. the dimensions of the lattices are (A,, Ao), (B*,, Br), and (C",Cr);

2. the initial positions are respectively (-A",0), (0, -Br) and (0,0); 3. their velocity vectors are defined to be Lro, irt and -þu"U (, - Tr) respectively,

where ?r : n'¿an(o'-l'","ortt'\ is the time for lattices A and B to UUa'ub/

87 complete their interaction with C, I and j arc the unit vectors along the r and y axes respectively andU(r -Tt) is the unit step.

Non-zero memory bandwidths B f.or each lattice are defined by the following:

A, þ¿': Aruof r 0(z( (4.1) ua

l3a:B,u6fr 0(r aBo (4.2) Db

: Cou.f r T11r

?r is the time that the C lattice remains stationary before moving and requiring memory bandwidth.

Y

K+N-1 M

C N N A va* +vc X

L2 L1

K+M-1

L3 M

Figure 4.22 A conformal two-di,mensional matrir product.

These results can be applied to the evaluation of a matrix product using a two- dimensional array of systolic processing elements. In this case the local trans- formation at the coinciding lattice points is an inner product evaluation i.e. ci:a,i-:_xb¿-tlc¿_t.

88 To compute the conformal product CN*u - A¡,txNBxxut with an N x M systolic array, the lattice dimensions for,4., B and C are (I{ +N - 1, N), (M,I{ + M - 7) and (M,-l[) respectively, and uo : u¡. The additional ¡/ - 1 columns in A and M - 7 rows in B allow appropriate skewing of the matrices within the lattices to ensure a correct interaction. 'Empty points' or zeroes fili ihe additional lattice points, but are counted as parameters for the purpose of computing memory bandwidths. Figurc 4.2 illustrates this example.

The bandwidth requirements of each lattice are:

o(r

o(r

t<] Utfi ¡V-=-L where 7, - is the time for lattice interactions to be completed. ( Let uo --'ùb :1 and N : M. The interval 0 r 1 K + ¡/ - 1 defines the period when both A arrd B lattices demand memory bandwidth at a total rate given by

(4.7) þ,4a:0.a.* {)n:2NT

The bandwidth requirements of C in the subsequent time period defined in equation (4.6) can be made equal to this by setting u" :2 i.e.

þc:^2N (4.8) -T

The maximum computation rate -R of the array is given in this interval by the number of active processors (,n/2 ) multiplied by the number of operations per second which are carried out in each processor. As each processor implements both a multiplication and an addition in each time step, the number of operations is two and the computation rate is

R :2Nz (4.e) T

89 Rewriting equation (4.7),

(4.10)

and substituting in equation (4.8) gives

R': þz¿'sr 2

: l3¿.nN (4.11)

This relation states that the computational power of a square two-dimensional systolic processor is proportional to both the square of the memory bus bandwidth, and the computation time of the systolic processing elements (SPEs) from which the array is constructed. Equivalently, it is equal to the product of the array order and the system bandwidth.

The order of the array which provides this computation rate is given by equation (4.8).

For realistic systems in which the order of the problems to be solved typically exceed the achievable order of the hardware, the system architect has two degrees of freedom in designing a processor to meet a specified computation rate. They are

1. the technologically constrained bus bandwidth, and

2. the technologically unconstrained upper execution time of the SPEs'

The upper bound to the computation rate is determined by the problem size. As the probiem size increases, so does the possibie order of the array, and with it the theoretically achievable power of the processor.

4.L.L Implications for Matrix Processors

The -intuitive nature of the execution time parameter r in equation (4.11) is that performance increases as the element speed decreases. The result is readily explained in terms of an example. Consider a system which is constructed from a lrl x lrl two-dimensional systolic array. Suppose the array achieves a computation rate

90 -R at some maximum clock rale r when the loading of the 2.1/ boundary elements demands the complete system memory bandwidth. A halving of the systolic would allow the same memory bandwidth to be used to load the boundary elements of. a 2N x 2N systolic array. Although the systolic elements execute at half the speed of the original array elements, there are four times as many of them, and the net system computation rate is doubled, as described by equation (4.11). Alternatively, a doubling of the bus bandwidth with the original ciock rate maintained leads to a quadrupling of computation rate in a 2N x 2N array as four times as many elements are active.

For a computer system with a systolic processor subsystem, these results lead to a unique possibility: modular expansion of the computational power of the processor by simply increasing the number of systolic processor chips in the system. This expansion is accompanied by a slowing of the system clock which is particularly easy to implement if the elements are self-timed.

The following sections present novel systolic ring computational elements which are designed for low transistor counts. They are capable of processing digit-serial floating point operands at specified computation rates.

4.L.2 Analysis of a Rectangular Systolic Array Processor

Implementation of a rectangular systolic array of elementary inner-product-accumul- ate processors imposes some physical constraints upon the system architect. It is assumed that the following two constraints are realistic:

1. the design is limited by some maximum area of active circuitry, determined by either physical limitations such as thermal dissipation, or cost;

2. llne design is limiied by some maximum memory bandwidth.

It is further assumed that the area constraint is A, i.e. processor active area is less than or equal lo A, and the maximum bandwidth constraint is B.

Each processing element is constructed as a systolic ring characterised by the two parameters of execution time and area, both of which are functions of

r: the number of bits per digit in the number representation used, and

91 n": the number of recurrence cells present in each ring.

Let the area of the processing element Ap" be represented by the function a(r,,n"), and the execution time of the processor Tp. be represented by the function t(r,,n.).

Under the above assumptions, the number of processing elements p2 in the system is expressed as a function of the total active area as

p2 : Ala(r,n.) (4.72) where p is the order of a square systolic array, and

p Af a(r,n.) (4.13)

The number of operands required to drive the array inputs for each wavefront entered into the array is 2p. The time to fetch these operands from memory is, from equation (4.10), given by T-f :2plB. This is an upper bound for the execution time of each processing element. Thus the bandwidth constraint provides an execution time constraint for the processing elements of the form

t(r,n.) < 2plB (4.14)

Hence, B < zplt(r,n.) and substituting equation (4.13) gives

B (4.15)

Consider the execution of a product of square matrices of order .ly' on a systolic array of order p, where l/ ) p.

The number of partitions to be computed in the product is [,n//p]2, where frl rep- resents the least integral value greater than or equal to r. The pipelined time to compute each partition is given by the time for the array to process all the tü/ave- fronts of any given partition. If the processing time for one wavefront is f(r, n"), then the time to compute one partition is -l/ ú(r, n"). h is convenient to assume that all

92 partitions are the same size, in which case the time required to compute the product in an area limited array is

2 ¡\r T¡ob: Nt(r,n") (4.16) p

: (4.17) *rrr,n")a(r,n") assuming that all partitions are the same size, i.e. .lú modp : [.

Under these assumptions the job execution time is minimised by minimising the area-time product of the processing elements.

The following section assumes that a square systolic processor is constructed from the systolic ring floating point multiply-accumulate processing element presented in the previous chapter. A study of the area-time metric of the processor is carried out for a particular GaAs technology and the results are used to optimise performance. It is noted that a multiprocessor constructed from MATRISC nodes could be made to satisfy Savage's bound AT2 : f¿("n) [SaS1] for the multiplication of matrices. This observation was made by Preparata and Vuillemin [PrVu80] when they pointed out that systolic arrays of the form of Kung and Leiserson could be connected in a manner which met Savage's bound.

4.L.3 Optimisation of a Square Systolic Array

A number of GaAs technologies have been studied. They have been characterised in terms of the properties of a limited set of fundamental circuits, typically a , a 2:1 multiplexer, a full adder and a data latch. Typical values for propagation delay and area of a GaAs NOR gate, fabricated in the process of interest, are 150ps and 900pm2 respectively.

Consider an r-bit per digii processing element (PE) for a floating point format con- sisting of rn mantissa digits, e exponent digits and g guard and control digits. Let the PE be constructed from r¿c computation cells, n¿ delay cells and a ring control element.

The number of delay cells can be expressed as

93 nd:rnlr+elr*g-2n" (4.18) and the area Ap" of the processing element is given by

Ap":n"A"tnaAalA.on (4.1e)

where A" is the area of a systolic computation cell A¿ is the area of a delay cell and A"on is the area devoted to PE control.

Circuit element Area (mm2) Gate Delay (¡;s) NOR gate as :0'9 x 10-3 ts :0.!5 x 10-B 2:1 mux (L'rn :3Ag trr:2ts Full adder ao: lTas to:2ts Register ür :7ag t': 5ts Multiplier arnut : r'(oo i oò I (2r - 7)o, | 2ran tmur:(3r*3)tn

Table 4.Lz Area and time metrics for ch,aracteristic ci,rcui,ts in a parti,cular E-D GaAs process.

The systolic cell area A""¡¡ can be approximated by the area of its constituents: seven r-bit registers, two 4-bit registers, three r-bit multiplexers, two r-bit control gates and an r x r pipelined digit-serial multiplier. An additional ten gates have been allowed for instruction decoding and control. No allowance is made either for interconnections or power and clock distribution. It is assumed that modelling the active circuitry provides a realistic guide to a physical implementation. For the particular E-D GaAs process under study, the area and time characteristics of the circuit elements used are given in table 4.1, and result in the following expression:

Ac"rr : 7ra, I 8o, + 3ra,n | 70an + 12(oo I aò -f (2r - 7)o,

: 12 (ao * oò I r(9a, I 3a* + 2aò I 7a, * 70an :Ar2 lBr*C (4.20)

where A: ao I an B:9arI3a**2as

94 and C -7a,lt\an

The area of a delay cell is

Aa : (4r -14)a, (4.21)

Thus the area of a PE can be written as

Ap":ftcAceilln¿Aa*A"on rn*e : n"(Ar2 i Br + C) + +g_2n c (4r*4)a, lA.on T

: TLcAr2 ¡ (n.(B - 8o,) | a,g)r i n"(C - 8o")+ 4a,(m + e) * 4go, + A"on l4(m -f e)a,r-r (4.22) where A"on is the area devoted to the control of the operation of the ring

The speed of a systolic cell 7..¡¡ is assumed to be determined by the setup and hold time of the registers plus the critical path delay of the digit-serial multiplier discussed earlier. The muitiplier critical path contains r + 1 full adders and one AND gate. Letting the gate delay bets, the total multiplier delay is (3r + 3)úe and the register delay is 5ú.n giving a total cell delay of

Tcetr:(3rf8)ún (4.23)

The time T. to circulate an operand through the ring is given by

T":(mfr+elr*9)7""u

:(mfr+elr+sx3r+8)re (4.24)

The number of circulations required for a multiplication is ffi; and for an accumu- lation ffi; + 1. The execution iime 7i for a multiply/accumulate operation is then given by the expression

2rn T.: (_+1 T" (4.2ó) TTt c

95 The PE execution time is given by multiplying the cell execution time by the number of clocks to complete the required multiply/accumulate operation

2rn (mlr -f elr + 9)7..¡ -+1TTlc ) 2rno * e) 1 :(" -"76m(rn + T ( rn+e+ " nc T7' c )).

B(m-r ù *T* 8g + ,,n)r, (4.26)

The area-time product of the PE is can be represented by the polynomial expression

A,p.T" - (oor' I atr I az I as lr)(tsr-t + trr-' -l tz -f tsr) (4.27)

where øs : ftcA

go,) A1 : n"(B - l4a,g

A2 : n"(C - 3ø,") | 4a,(m + e) + 4ga, I A"on

4,3 :4(m I e)a,

ts : 76m(m I e) ln"

t1 :6m(m -f e)ln. * 8(rn + e) * 76mg f n"

t2 :3(m+e) +8st6msfn.

ts -3s

The partial derivative of the PE area with respect to the number of cells n" is

W: Ar2 +(B - 8a,)r tC - 8a, (4.28) and the partial derivative of the PE execution time with respect to n" is

: -r (6m(m+ e) r6ms)r-r -t r6ms) t' (4.2s) # - þ ttu*(* ")r-' r +

96 Substituting these equations into the expression for the partial derivative of the area- time product with respect to n" gives

aAp"T" )Ap.^ , À ôT" ¿e I JL'D? a 0n" i)n.^ - '- dn" : (Ar2 + (B - Ba,)r + C - 8a,)(tsr-2 I ttr-L t tz lúer)- n"2 (asr2 i atr t az I ø3r-1) x

(to*(* + ")r-' -f (6rn(m + e) * 76ms)r-r -l76ms)t, (4.30)

Area-time product for a PE

mm^2-microssc

80 70 60 50 40 30 20 10 0

1:32 cells 1:32 bits per digit

Figure 4.32 Eaaluation of tlt,e area-time prod,uct.

Expanding Ap"T" and partialiy differentiating with respect to r gives

4# : - Laotor-a - 2(a2to -l øsúr)"-t - (or¿o + azh I ast2)r-2 |

aoh I aúz i azts l2(ast2 | afi3)r I Sastsrz (4.31)

97 Partial derivative of AT with respect to number of cells

mm^2-microsec

0 -10 -20 -30 -40 -50 -60 -70 -80

1:32 cells 1:32 bits per digit

U*:'' Figure 4.42 Eualuation of the part'ial d,eriuatiae OTL.

The area-time product over all r and aII n" has been evaluated for a number represen- tation with 64 mantissa bits, 16 exponent bits and two guard or control digiis. The results of this evaluation are shown in figure 4.3 for the domain of interest for both ihe numberof bits per digit 1 ( r ( 32 andthenumberof systolic cells 71n.< mlQr).

Equation (4.30), the partial derivative of the area-time product with respect to n", has been evaluated over the same domain of interest 7 1r ( 32 and 7 1n. < mlQr). Figure 4.4 shows the result of the evaluation for a number representation with 64 mantissa bits, 16 exponent bits and two guard or control digits. As this partial derivative is always less than zero, the area-time product decreases with increasing n" and is minimised by using the maximum allowable number of systolic cells. For the remainder of the analysis it is assumed lhal n. is maximised by letting n": # which requires two circulations of the operands for a multiplication and three for a de-normalisation operation.

Maximising n" allows the area-time product to be expressed as a function of r

AT: (r(mAl2l4a,g) -l0.5rn(B - $a,") ! 4a,(rn + e) + 4go, + A"o,-1l

0.5r-r m(C - 8o,) * 4(m I e)a, ) X

98 (+o(m + ")r-t * 15(rn + e) * 40g + rSsr) t, (4.32)

In the above analysis it has been assumed that all functions are continuous and dif- ferentiable. These assumptions neglect physical restrictions such as the requirement for the number of delay cells to be integrai. These requirements are included in the model by ihe following operations:

elr +- lelrl mf r *- l*lr1 nd <- l(*1, + elr + g -2".)l (4.33)

Evaluation of the resulting expressions for mantissa lengths of 32, 64 and 128 bits, and 16 exponent bits gives results which show that the optimal number of bits per digit is four.

AT product for PE's with the maximum number of systolic cells 2

18

o q) 16 È

ôl l4

12 Jo õo Ê F<

0.8

0.6 2 4 6 810 12 t4 16 1: 16 bits per digit

Figure 4.62 Comparison of the continuous moil,el witl¿ a model constra'ined, to integral cell numbers and d,elays for m:32 anil e:76.

99 AT product for PE's with the maximum number of systolic cells 9

8

7 o Ê 6 oì

5

(.) a 4 È Fr 3

2

5 10 15 20 25 30 l:32 bits per digit

Figure 4.62 A cornparison of model results for m: 64 and e : 76

AT product for PE's with the maximum number of systolic cells 22

20

18

0) !r 16

cl l4

o 12 = úo À 10 F

8

6

4 5 l0 15 20 25 30 1:32 bits per digit

Figure 4.72 A con'Lpa,r'ison of mod,el resulfu for m : 128 o,nd e: 16.

Local minima in the A? product are associated primarily with the condition mlr - l*lrl. For mantissa lengths which differ from the sample set {32, 64,728}, the optimal number of bits per digit may differ from four. In the case of a 50 bii mantissa and a 15 bit exponent, the minimum rv¡ras found to be 5, and for a 54 bit mantissa and 15 bit exponent the minimum was found to be 3. These results are to be expected from the comparatively broad minimum present in the continuous model, but it is noted that the penalty for processing imperfectly matched operands

100 is usually not severe. The continuous model represents a lower bound for the AT performance metric of the processor.

4.L.4 Bandwidth Considerations

In this section the bandwidth constraint of equation (4.14) is considered, with a number of graphical examples for illustration. The constraint is B ( 2plt(r,n") which can be rewritten as

2plB -t(r,r") t 0

While 2plB -t(r,n") is greater than zero, the constraint is met and the processor bandwidth is fully utilised. When the expression equals zero, the processor and mem- ory subsystem requirements are matched, and when the expression becomes negative, the bandwidth is no longer fully uiilised. It is apparent that in the latter case an increase in the order of the array p carr return the bandwidth to full utilisation. To illustrate this behaviour, three cases are presented, each having an available memory bandwidth B of 108 operands/s. The first is a processor with an area of 106 active gates.

Bandwidth constraint 2plB - t(r,n) for 10^6 gates

mlcrosecs

0

-5

-10

-15

-20

l:32 cells 1:32 bits per digit

Figure 4.82 The bandwidth constraint for an acti,ue area of 106 gates

101 It can be seen from the above figure tlnaf 2plB - t(r,n") is negative for all r and n. Hence the available bandwidth is under-utilised for all PE architectures.

Bandwidth constraint 2plB - t(r,n) for 10^7 gates

mi&osecs

0

-5

-10

-15

-20

1:32 cells 1:32 bits per digit

Figure 4.92 The band,width constraint for an uctiue area of 707 gates.

Bandwidth constraint 2plB - t(r,n) for 10^8 gates

mlcrosecs

0

-5

-10

-15

1:32 cells 1:32 bits per digit

Figure 4.10: The bandw'idth constrai,nt for an act'iue area of 708 gates.

In the case where the area is increased to 107 gates, a significant set of possibilities exist where 2plB - t(r,r") t O. These possibilities are indicated in the figure by

102 those points on the surface which lie above the zero plane. PEs with r and n in this set meet the bandwidth constraint. The set includes the optimal solution of a processor with a minimum area-time metric.

In the final case shown in which the area is further increased to 108 gates, nearly all PE architectures meet the bandwidth constraint, except for those constructed from a few cells of one or two bits per digit.

4.2 A Generalised Matrix Address Generator

This section discusses a novel matrix operand addressing technique for a MATRISC processor. To allow processors to approach their peak performance, particularly in processors designed for signal processing applications, address generators have been designed which automatically generate operand addresses for certain algorithms. A common example is the bit-reversed addressing for the FFT.

Common DSP addressing patterns include:

1. Sequential; 2. Inverted;

3. Reflected; 4. Bit-reversed;

5. Perfect shuffied (Interleaved);

6. Multiple shuffied;

7. Parallel shuffied.

These patterns are in common usage as a result of the vector nature of many computer architectures. Zobel [2o88] provides a description of hardware which implements these addressing patterns. Another paper which describes a versatile hardware ad- dress indexing unit is [Nw85]. A further paper which in part describes a generai purpose address generation technique is [HaRo89].

In all of these papers the address generation techniques are designed to optimise vector-based algorithms. Matrices and matrix algorithms are supported as operations upon sets of vectors. These approaches have limitations when matrix algorithms are implemented which can not readily be expressed in terms of sets of vectors. An example of such an algorithm is the one-dimensional Fourier transform implemented with the prime factor algorithm.

103 A processor designed to operate upon matrix operands differs substantially in iis address generation from conventional RISC processors. This is particularly true when multi-dimensional matrix operands are being processed, and it is of even greater importance if mapped operands are required. To maintain processing effi.ciency, any mappings which need to be applied to operands should be applied 'on-the-fly' to minimise the number of times that data must be read from and written to memory. A simple example of a mapping is the generation of the matrix product R: AAr, where Ar is the transpose of ,4. The transposition operation can be regarded as a 1:1 mapping, and an appropriate address generator will fetch both operands directly.

In contrast to other approaches such as that of Petkov [PeSS] discussed earlier, no systolic transposition circuitry is required. In addition, if ihe design of the address generator is done correctly, additional mappings can be superimposed on simple mappings such as transposition. In this case, it is possible to specify matrix operands which are, for example, prime factor mapped as well as transposed. Benefits from these attributes are that transforms can be implemented with as few accesses to the data vectors as possible, with an associated execution time speedup.

Examples of work which has been carried out for matrix address generators which operate on parallel or interieaved memories are [Ko83] and [LaVo77]. A simple'corner turning'generator is described in [C284] and a more general solution is given in [Ta89].

In [DeSi89] and [DeSi9O] a four-dimensional generator is described and in [ToAr94] an .l/-dimensional generator is presented. None of these generators have the capability to implement the number theory mappings which are possible with the generator described in this section. The paper by Wong et al. [WoCh9O] provides both software and hardware techniques to perform prime factor mappings, but the evaluation of addresses is carried out with the aid of ROM lookup tables in place of the modulo operation. A paper by Ltn et al. [LuSi91] describes the technique of performing an in-order, in-place DFT using ihe complementary Chinese Remainder Theorem and Alternate Integer representation maps as stated by [ElOr79]. The technique for address generation is via permutation of row addresses, rather than the explicit generation of the addresses with 'modulo' addressing.

4.2.L The Address Generator Architecture

A conventional approach to the problem of addressing matrices stored in a linear

104 memory space is to consider addressing the elements of the matrix in terms oL stri,des. The strides specify the linear distance between successive elements of rows or columns respectively. The problem with this approach is that it is not possible to both fetch matrix operands and simultaneously apply general number theory mappings. The mappings must be applied to the matrices as separate operations. These operations must be done in software in a conventional machine and incur significant time penal- ties.

@ ttr@ FEÐ r N1 N2 RIdI 1',s Muxl MuxO @ EI EI

Cin Adder Dec Dec 'l' fl zfl Adder

Signbit Mux2 run 4fI fl, s0 sI s,l s0 '0' rtdl Adder PLA rld2 Cin

Address output

Figure 4.Llz Schemati,c of a matrir address generator

As conventional matrix storage schemes can be considered as simple mappings be- tween one-dimensional and multi-dimensional subspaces, address generation for ma- trices can be performed elegantly by constructing a general number theory mapping generator. The generator must provide a general capability to support mappings between one-dimensional and many-dimensional subspaces. The approach taken in this study is to add to the conventional address generation logic found in RISC or CISC processors additional logic which implements general number theory mappings. This not only performs the addressing of the conventional matrix operands, but also supports linear transform algorithms.

105 The remaining sections describe the architecture of an appropriate address genera- tor for l/-dimensional matrices. An example implementation is given for the two- dimensional case. The generator has the capability to store at least five descriptors which represent the matrix structure. At least two descriptors are required to define two of the dimensions of an n-dimensional matrix and at least two descriptors are used to store the values of two of the strides of the matrix structure. A sequential n-dimensional finite difference engine is defined which takes as its input the ma- trix descriptors and generates from them sequences of addresses which are used to fetch elements of the required matrix structure in an order determined by the para- metric description. The underlying n-dimensional matrix description is reflected in the software description of the data structure which makes it a simple task for the applications programmer to operate on the data structure.

Figure 4.11 depicts a schematic representation of a number theory mapping address generator. Mappings which have been considered include the Chinese Remainder Theorem mapping and the Alternate Integer Representation [E1Or79], [8185] and others.

From one dimension n to two dimensions (nr, .n2), the mappings are

The Chinese Remainder Theorem

T¿ nzNt(¡rit ) r, (4.34) (rrNr(¡r;t )r. f ) ¡¡ where l/r and N2 are mutually prime, ¡Ë¡/, : I[ and (ø)r,'(ø-1)¡¡ : 1.

The notation (ø)l,r means that ¿ is evaluated modulo N

The Alternate Integer Representation

y1 : (n1N2 + n2¡l1)N (4.35) where l/1 and N2 are mutualiy prime and lftJt/r: ¡¿

The simple mapping used conventionally to store matrices in linear memory

n: ntNz I nz (4.36)

Examination of equations (4.34) to (a.36) shows that each can be implemented with a second order difference engine implemented with a modulo arithmetic capability,

106 provided that the constants are chosen appropriately. Consider the following expres- slon

n : base-address I (nrAl + n2L2 (4.37) q

This maps an element n of. an arbitrary matrix [A] stored in a linear address space starting at base-a,ild,ress onto the (n1, n2) element of a two-dimensional address space. The modulo operation denoted bV (.)o normally requires a division. However, by per- forming a conditional subtraction of g during each calculation the modulo arithmetic can be performed without the complexity of multiplication or division.

Mapping A1 L2 q Chinese Remainder Theorem l/z (lÇ1),.r, r\I, (tüit)¡,', ¡r1¡ú' Alternate Index Representation N2 ¡r1 ¡L/ú, Simple(1) Nz ¡L mar-int Simple(2) N2 1 mar-'int Simple(3) 1 ¡L m,o,r-int Simple(4) 1 I{ mar-int

wltere maxjnt is tlte marimum integer of the number repre- sentati,on used. This remoues the use of rnodulo arithrnetic.

Table 4.22 Values for L,7, L2 ønd q which implement three different mappings from a one-dimensiona,l to a two-dimensional space.

To address sequentially all elements of the matrix [A] in some order determined by the constants 41, A2 and Q,,n1 andTL2 aÍe indexed through their respective ranges (the dimension of the matrix). Some of the choices for these constants are given in table 4.2, and the associated mapping is identified. It must be noted that the mappings are not restricted to one-to-one for some parameters.

Some matrix types for which this addressing technique provides access include:

1. Dense matrices; 2. Diagonai matrices; 3. Circulant matrices (".g.the identity matrix);

4. Constant matrices; 5. Number Theory mapped matrices.

707 4.2.2 An Implementation

The address generator assumes a two-phase implementation using a PLA to generate the necessary control signals.

s1 s0 run 217 zf2 rldl ùd2 cln S1 SO 0 0 0 X X 0 0 0 0 0 0 0 1 x x 0 0 1 0 1 0 1 x x x 0 0 0 1 0 1 0 X 0 0 0 0 0 1 0 1 0 x 1 0 1 0 0 1 0 1 0 x 1 1 1 1 0 0 0

Table 4.32 State table for tl¿e add,ress generator PLA.

The state table for the address generator is shown in table 4.3. Variables are: s0, s1: state address inputs

,S0,51: state address outputs run: start flag zfTrzf2: zero flags from decrementers rld7,rld2: reload signals for the decrementers cin: carry input to first adder

Equations which describe the outputs are:

,S0 : cin: s}.sl.run (4.38)

51 : so.l +Ã.s7.71f .Zzî +l0.sl.zrf .V2f (4.3e)

rld! :10.s7.2If .z2f -f s().s1.21 f .t2f (4.40) rld2 : s\.sl.zlf .z2f (4.41)

108 4.2.3 Area and Time Consideratrons

It is assumed that in the schematic of fi.gure 4.11, all computations are done in a two-phase clocked system, with the address calculation being performed with com- binatorial logic. Further, it is assumed that the circuits use the simplest possible ripple-carry addition circuits. Using these assumptions there are estimated to be in excess of 5000 transistors in the multiplexer/inversion/addition data-path for a 32-bit implementation, and less than 10000 transistors in the complete generator.

To achieve maximum performance the address generator can be pipelined, and it is estimated that with existing GaAs processes it is reasonable to expect addresses to be generated at a rate that exceeds 500 million per second.

The use of number theory mapping hardware implemented as a difference engine provides access to both normal and transposed matrix structures, including constant matrices stored as a single scalar. Circulant matrices are generated from a single row. The generality of the approach makes possible the use of prime factor mappings of dense matrices without time penalty. The prime factor mappings are used to optimise performance when executing algorithms such as convolutions, correlations and a large number of linear transforms. Examples of these transforms include the Fourier transform, the Chirp-z transform and the Hartley transform.

The technique is elegant as the number theory mappings which were originally addi- tional operations are implemented without time penalty. In addition, as the conven- tional storage schemes for matrices appear as a subset of the mapping capability of the generator, the need for conventional address generation hardware is removed.

4.2.4 Matrix Addressing Examples

Arguments which are required by the generator are the set {base, deltal, delta2, n1, n2, qj. The following examples consider 3 x 5, 5 x 3 and 5 x 5 matrices, and show the address sequences which are used for normal, transposed, prime factor mapped and circulant matrices. The examples shown are obtained by executing a functional simulator.

109 Norrnal Form

A 3 x 5 matrix stored in row order requires a simple linear sequence

Enter base,delta1,delta2,n1,n2,q 0115315 01234 56789 1011L27374

A non-zero base address simply offsets the address sequence, e.g'

Enter base,deltal, de1ta2,n1,n2, q 100115315

100 101 102 103 104 105 106 107 108 109 110 111 tL2 7L3 tt4

TÞansposed Form

Transposition of the above matrix requires the following arguments

Enter base,delta1,delta2,n1,n2,q 05-93515 0510 1611 2772 3813 49t4

This is a multiple-shuffie of the sequential addresses

Prime factor mappings

A prime factor mapping of the matrix is given by the following

Enter base, de1tal, de1ta2,n1,n2,q 0385315 036912 581t142 1013147

Tlansposed prime factor mappings

Transposition of the above mapped matrix is obtained by:

Enter base, deltal,de1ta2,n1,n2,q 0583515 0510 3813 6117 9744 1227

Circulant rnatrices

For matrices which are circulant, major savings in both storage and generation time are possible by computing only the first row, and generating the required matrix from

110 this one row. As an example the generator is used to generate a 3 x 5 matrix from a single 5-element a;rray.

Enter base, delta1,de1ta2,n1,n2,q 010535 07234 40123 340L2

This is the technique used to generate the identity matrix .I from a single row with a one in the first element position followed by l/ - 1 zeroes (for an order I/ matrix).

A skew-circulant matrix is generated similarly:

Enter base, deltal, de1ta2,n1,n2,q 072535 01234 72340 2340L

Constant Matrices

Where an algorithm calls for the multiplication of a matrix by a scalar constant, e.g. C : aA, it is readily implemented by a Hadamard multiplication of the matrix by an identicaily dimensioned matrix whose elements are the desired constant. This is readily achieved by generating a single scalar and then choosing parameters for the mapping hardware which construct the constant matrix from the one scalar, 'i.e. for a scalar at address 0:

Enter base, delta1,de1ta2,n1,n2, q 000531 00000 00000 00000

Sub-matrix generation

Sub-matrices are extracted from an arbitrary matrix with appropriate offsets, e.g. the 2 x 2 sub-matrix of the matrix in the first example, starting at elemertt a2,2.

Enter base, deltal, delta2,n1,n2, q 6742215 67 t7t2

Skewed sub-matrices are extracted similarly.

111 4.2.5 Address Generation in Higher Dimensions

An arbitrary, conformal matrix product C : AB, of dimension P x Q bV Q x-rB is represented by the following:

C 11 ct2 Ct R A 1 C2l c22 C2 R A2 x [Bt 3z B^l (4.42) : CPI CP2 CPR AP

The matrix A has been partitioned i"t" [#l horizontal partitions where u., is the width of the processor array, and the matrix B has been partitioned i"t. [#l vertical partitions. There is no loss of generality in assuming that the processor array is square.

The computation of each partition C'r of. the resuit matrix C is implemented with the following algorithm:

Cii : I (4.43) r "l(¡r")t where ai is the rrh column f the row matrix At and bf is the rth row of the column matrix .B,.

In the general case, the matrix dimensions are not integral multiples of the processor size, and the matrix C car' be considered to consist of four regions shown below in terms of subscripted partitions.

co" cn, g!@-t) clo^ c?' C?, gl@-t) c?no

g(P-t)r gf -r)z ce-r)(R-r) c!!-'¡n cT,| cï; Clln-t¡ I cf"

The two dimensional difference engine presented in [Marw90], [MaClClCug2] and [MaLi90] implements the addressing for all of the partitions, taken one at a time.

772 However, the cost of handling each partition separately can be significant if the accesses to set up the generator are not implemented efficientiy. A better solution is to use a higher dimensional difference engine. Considering the first region which consists of a two dimensional array of matrix products, a four-dimensional engine is capable of correctly sequencing through the required addresses. In the case of regions two and three, a three-dimensional engine suffi.ces, and for the single fourth partition, it is only necessary to use a two-dimensional engine.

A C++ matrix class which has been defined for use with the matrix address generator is given in the following listing. typedef float precision; Zero} enum DataType {Real, Imaginary, Complex,, 1 typedef float precision; class MATRIX { int SubMatrix; DataType typ"; ll P.ieal or complex matrix int init; I I inírial offset int n1; I I "".of rows int n2; I I "".of cols int d1; I I row element spacing int d2; I I lasf row element to first next-row element int modulo; // modulo int negate; int conjugate; * precision re; I I rcal data structure * precision im; f f irnaginary data structure public: MATRIX(int, int, DataType);

MATRIX(int, DataType) ; MATRIX(int, int); MATRIX(MATRIX&, int, int, int, int); MATRIX(int, int, precision*); MATRIX(int, int, precision *, precision *); MATRIXQ; virtual -MATRIX0; virtual void operator!0;

friend void muI(MATRIX&,MATRIX&,MATRIX&) ;

friend voi d emmul3 ( MATRIX&,MATRIX&, M ATRIX& ) ;

friend void emmul(MATRIX&,MATRIX&,MATRIX& ) ;

friend void nadd(MATRIX&,MATRIX&,MATRIX&) ;

113 friend void maddS ( MATRIX&, MATRIX&, MATRIX& ) ;

friend void msub(MATRIX&,MATRIX&,MATRIX&) ; DataType& Type0 { return typ";} void map(int,int,int); void unmap0; int& submatrix0 { return SubMatrix; } int& offset0 { return init; } int& rows0 { return n1; } int& colsQ { return n2; } int& deltal0 { return d1; } int& delta20 { return d2; } int& modq0 { return modulo; } int& neg0 { return negate; } ini& conj0 { return conjugate; } precision& velr(int j) { return t"b]; } precision& veli(int j) { return i-b]; } precision& elr(int, int); precision& eli(int, int); * precision vectr0 { return re; } * precision vecti0 { return i-; } * void setvectr(precision r) { re: r; } * void setvecti(precision i) { im : i; } void desc0; void Print0; void VPrint0;

MATRIX& operatorf (MATRIX&) ;

MATRIX& operator- (MATRIX&) ;

MATRIX& operator* (MATRIX&) ; )

4.3 The MATRISC Architecture

Addr

Main Instruction \ RISC/ Data memory cache Instr clsc Data cache CPU

System Bus

Figure 4.L22 Schematic of a mod,ifi,ed Haruard, archi,tecture RISC/CßC processor

TT4 Addr -.- Main Instruction RISC/ Data memory cache InsE clsc Data cache CPU \-

System Bus

XData Y Data cache cache o u oÉ O Data Data / Addr Addr

CÉ CË Matrix Matrix ¡ó Þ Address Address E! gen & IÆ gen & IÆ

o E oÉ U

ystolic MaEix Array Address gen & IÆ

Figure 4.132 Schematic of a modi,f,ed Haruard arclti,tecture matrir RISC/USC pro- cessor. Th,e matri,r capabi,Ii,ties are ad,d,ed i,n the form of a co-processor.

A conventional RISC or CISC processor which uses a modified Harvard architecture is shown in figure 4.12. It uses separate caches for instructions and data. This is the architecture which has been chosen for matrix enhancement to produce a MA- TRISC model for simulation and implementation studies. The addition of hardware support for matrix primitives to the architecture of figure 4.12 produces an enhanced architecture shown in figure 4.13.

Each element of the processor array is constructed from a systolic ring floating point arithmetic unit described earlier. Simulation code which has been used to predict the performance of this processor assumes either a single cache for both X and Y matrix operands, or a partitioned cache. The cache is assumed to be direct-mapped. It is also possible to simulate a system without a cache. Both write-through and write-back caching policies have been simulated. Where performance predictions are

115 presented, the caching policy which has been assumed is stated

Mainmemory

System Bus

XData YData Instruction cache cache cache flEv Maftix Matrix Matrix RISC/ Address Address Address CISC gen &W gen &IÆ gen & I/F CPU

stolic Array

Figure 4.142 Schematic of an integrated matrir LISC/USC processor

If the matrix co-processor architecture of figure 4.13 is integrated into the scalar or as in figure 4.74, cache coherency problems can be resolved.

The serial nature of the systolic ring arithmetic units, whether at the bit or digit level, leads naturally to serial data paths in the systolic processor array which are the width of the digits. This allows successive processing elements to be digit-skewed rather than word-skewed as is common in the literature. In this form the array approaches a broadcast array where all elements start processing simultaneously. As a consequence, the I/O architecture of each processor element consists of two orthogonal data transmission paths for the X and Y operands, each consisting of a single one-digit delay/storage cell, and a full length output register. Data is input

116 to the array as a sequence of 2-tuples {instruction,d,ata}. At the completion of a computation, an instruction is entered which callses the unloading of the results into the output registers.

b',s b'+ b",¡ bn1. bn2

J

AOn ... AOZ AO2 ø01 ø00 aln ... An AtZ üÍ AtO --+ d2n . . . a2S CL22 AZI (IZO AJn ... ASS ASZ øgt ûgO

Figure 4.L52 A sclt,ematic of tlt,e entry of operand, wauefronts to a processor arrüy.

The digit-serial data is digit-skewed on entry to adjacent processing elements on the array boundary. This skew is preserved between adjacent elements within the array by passing the data through the single-digit delay stage in each processing element before re-transmitting it to the next PE. The use of serial data both minimises the I/O pin count at the array boundary and allows adjacent processing elements to both commence and conclude their computations with a time differential of only one digit period. Advantages of the skewing approach over a broadcast architecture include the minimisation of job time, the removal of the need to drive long buses with large buffers, and as a consequence, the capability for arbitrary expansion. The computation time is minimised for both a single job and a job stream. For most applications it is reasonable to consider the operations in each processing element to be occurring in parallel.

Figure 4.15 provides a schematic representation of the entry of digit-skewed wave- fronts to the alrra.y. Digit-skewing is indicated by the small offset between adjacent rows of ,4. and columns of B. Each X data operand consists of an instruction followed by a single floating point number in the format discussed previously. The instruction is loaded and decoded, followed by the serial loading of the operand. The internal

t77 format has both extended precision and extended dynamic range when compared with the IEEE standard.

Array I/O is performed with double buffered serial/parallel shift registers. The ad- dress generators ioad one set of registers from the memory subsystem while the other set is either driving the data serially into the array, or reading data serially from the array. To miminise chip area, the registers are clock-enabled sequentially, avoiding the need for physical delays to achieve digit-skewing.

4.4 Discussiori

GaAs Technology and PE Architecture Synergy

Specific characteristics of GaAs circuits which are of interest when designing pro- cessing elements are the very fast switching speed and comparatively poor drive capability. To take advantage of the high speed logic attributes, a novel systolic ring architecture has been developed which can implement the floating point arithmetic required in each element of a systolic processor. The systolic ring has been shown to implement both digit-serial multiplication and accumulation. As the processing element is constructed as a ring, communication is always to the nearest neighbour, so minimising the need for long communication paths. The length of paths internal to each cell of the ring are short due to the digit-serial nature of the arithmetic. The ring processor has a register set which is re-organised from one bit per digit to r bits per digit. This re-organisation incurs no area penalty, as the same amount of register storage is required. The benefits of the rearrangement are a reduced cell control overhead. As there are fewer cells to control, a higher computation rate is achieved with increasing r.

Although not studied in the current work, there is substantial scope for improvements in cell performance. As the number of bits per digit increases, faster algorithms can be used in the digit-serial multiply/accumuiate cell, and cell execution time wili decrease. Higher levels of pipelining are also possible to improve the time performance of the ring. In addition, the recurrences applied within the cells can be rewritten to allow pairs of cells to coalesce. These architectural options are to be explored in detail in a future research programme.

118 Optimisation

An analysis of the matrix multiplication problem for fixed array sizes under area and bandwidth constraints led to the result that processor speed was optimised by minimising the area-time product of the arithmetic units. The architecture of the PEs allows this optimisation to be carried out with ease. The PEs also have unique attributes for processing variable format numbers, as well as allowing adjacent PEs to coalesce when longer operands are to be processed.

In addition to their functional utility the processing element architecture, which is based upon serial register chains, is inherently well suited to scan path testing.

The MATRISC Processor

It has been proposed that the attributes of systolic processor arrays can, when in- tegrated into fast scalar RISC machines, offer numerical performance which is un- surpassed by any alternative architecture. To support the capabilities of the systolic architecture, a new matrix address generator has been developed which extends the domain of application of the systolic processor beyond that which is achievable with conventional processor implementations. To demonstrate the attributes of the soft- ware and hardware, the following chapter presents a number of algorithms which can be implemented on an MATRISC processor. The performance of the processor is determined both by simulation and experimentally.

119 Chapter 5

MATRISC Matrix Algorithms

This chapter summarises the matrix operations which are possible with the MATRISC architecture of the previous chapters. The matrix primitives are ele- ments of the set {multiplication, addition, subtraction, Schur or Hadama,rd, mul- tipli,cati,on, transposition/permutationj. The multiplication primitive is the O("t) operation which provides major performance gains for the N4ATRISC concept. The prime justification for including the remaining element-wise operations of the set is that they are fully supported by the matrix addressing capabilities of the address generators. As an example, if it were necessary to perform matrix addition with the RISC or CISC host, substantial addressing overheads would be incurred by the host if the matrices were mapped. As it is a simple matter to support the element-wise operations directly by the matrix hardware, they are included to improve execution efficiency.

The following sections discuss the direct implementation of the multiplication prim- itive in a machine with a simple . It has not been possible in the scope of this work to provide fuily optimised models or implementations. As a result, simple models are used, and performance predictions should in general be conserva- tive. Optimised matrix product algorithms, such as Strassen's [GoVa89], have not been studied during this research programme.

120 5.1- Matrix Primitives

5.1.1 Matrix Multiplication

As discussed in the previous chapter, the conformal matrix product C : AB, of. dimension P x Q Iw Q x -R is represented by the foliowing:

Ctt Cn Ctn A1 Czt Czz Czn A2 x (Bt B2 na) (5.1) : Cpt Cpz Cpn Ap

The matrix ,4 has been partitioned into ffi] horizontal partitions where t¿r is the width of the processor array, and the matrix B has been partitioned into ffl vertical partitions. There is no loss of generality in assuming that the processor array is square.

The computation of each partition C¡¡ of. the result matrix C is implemented with the following algorithm which accumulates outer products:

C¿j : I (5.2) r ""n(Uf), where a,,¿ is lhe rth column of the row matrix A¿ and (UI)¡ is the rth row of the column matrix B¡.

In the general case, the matrix dimensions are not integral multiples of the processor size, and the matrix C car- be considered to consist of four major regions shown below in terms of superscripted partitions.

Cit ci, ci Cå'^

C z C 't, ci'r-rro clr-r¡, ( P -r\2 lP-1)l.R-1) I rl z't't 't,u c'Éi c'Ë:, v p(n-t) CPR

This partitioning is implemented by the MATRISC system software

727 5.L.2 Element-wise Operations

Matrix addition, subtraction and element-wise multiplication, referred to as Schur or Hadamard multiplication, is performed by the MATRISC processor by partitioning the problem in a similar way to matrix multiplication.

5.1.3 Matrix TYansposition/Permutation

Transposition, permutation and mappings are implemented on-the-fly by the address generation hardware of the MATRISC processor.

5.2 Matrix Algorithms

Chapter 2 presented a survey of various algorithms which have been implemented on a variety of systolic arrays. Summarising, the matrix product was implemented on Kung's linear and hexagonal arrays and the mesh-connected array of Whitehouse and Speiser. The linear array was used for convolution, correlation and FIR filtering. The Fourier transform was implemented with a linear array of modified systolic elements which evaluated polynomials. Matrix transposition was implemented with a systolic array of switch elements. QR factorisation was implemented with a triangular array with two different types of systolic element. The remainder of this section, and the next chapter, present the implementation of these algorithms on the MATRISC processor. As these algorithms are implemented with matrix algebra, the MATRISC processor is not limited to this range.

5.2.L FIR Filtering, Convolution and Correlation

As discussed in chapter 2, lhe finite impulse response filter can be expressed as a Toeplitz matrix product. If an FIR filter of length p has weights represented by the sequence {ho,,hr,. . . ,hp-t} and the input data is represented by the sequence z(n), the filter output sequence g(n) is given by the expression

722 hp-t hz hr hs 1) 0 Uo

hp-t h2 h1 ho 0 r 1 At hp-t h2 hy hs r 2 Uz (5.3) 0 hp-t h2 hr hs T Us

This is identical in form to the convolution and is the form which is implemented with 50% efficiency on Kung's linear a ray.

Following the presentation in [ClMa9O], the mapping

n:n2p+u 0 ( n1 < /ú+ p-7 andn2:0,7,2, (5.4)

where -ly' is some constant, redundantly maps the infinite input sequence r(n) to a matrix X(rt,,n2)oforder(¡/+p)x-. If theinfiniteToeplitzmatrixff isreplaced by ihe top left partition H1 of. dimension p x (¡/ * p), the product

Y:HtX (ô.Ð,)

is an .ly' x oo two-dimensionai mapping of the desired linear convolution. Expanding the matrix notation in terms of the elements,

hp-t hz hy hs hp-t hz h1 hs 0 X 0 hp-t h2 ht hs hp-t h2 h1 hs

íxo ítN (x2N Ao UN UzN ï1 rN+1 rzN+r Ut YN+1 UzN+t ï2 :TN+2 r2N+2 Uz YN+2 az¡v+z (5.6) ixg i¿¡/+s :L2N +3 As YN+g Y2N+J

13N+p-1 r N +p-L .xzN+p-t U N -t AzN -t 9s¡'r-r

This infinite product can be written in a block matrix form

H (xo X1 Xz Xs ..):(yo I Yt Yz Ys ( 5.7)

723 where the X¿ are of order (¡rI +e) x .l/ and the Y are of order l/ x I/. .|y' is chosen to be equal to the order of the systolic array. As an example for a 3 x 3 systolic array and an order 7 filter, the first two matrix partitions X6 and X1 would contain the following vector elements:

ïg 13 fr6 ïg ït2 rts ï1 14 fr7 rto rts Trc 12 15 Ig rtt ltt+ ittz :X3 16 r9 :xtz ltts itta 14 ï7 ïto rtg rt6 ïtg X0 X1 15 rg :xtt ïr4 rt7 :xzo fr6 19 (ttz rts ïLg rzt :x,7 íItO ([tg rta ixtg rzz ïg :XtI ixt+ ittT {tzo íxzg Ig ittZ rrc ixts rzJ. rz+ and the first two output matrix partitions wouid contain the elements

Us Eo\ Us An Uts Yt Ato Aß Urc 'r: îr) Utt Au Un

The output is not in row major order. The Y¿ sub-matrices can be transposed during writing to generate the correct sequencef.or Y, but a better solution is to rewrite the expression in a form which naturally supports data storage in row major order:

XT HT :YT (5.8)

In partitioned form,

XT Yd

T X1 Y{

XI HT (5.e) "{ XT Yr'^

where

724 !r'1 12 13 ï4 I5 r6 I7 ïg I9 xl r4 rg 16 î7 ïg rg {tto rtt rtz (5.10) ít7 :IB 19 rto rtt rtz rts rt4 rts

r9 :xto rlt (ttz rts ixt+ rts ïta r1,7 :Lt8 xl ixtz ixts ït+ frts :xLa ixtt rß rrg rzo rzt (5.11) :xts rts ittT ïrg rtg rzo ixzt rzz (tzs îz+

hs00 hrhs 0 h2 h1 ho hs hz h1 ha h3 h2 h5 h+ hs HT: (5.12) h6 h5 h+ hz ha h5 h6 hz h6 hs hs ht 0hsh¿ oohg

Yd :Qi r:i) (5.1 3)

Us Uto Utt YÏ :( Utz Uts Au (5.14) Uts Arc Un which are in row major order as required.

The following code implements this FIR filtering algorithm on the MATRISC simu- lator. Omitted from the presented code is the definition of the data matrix X which contains the data to be filtered, and the Y matrix, which is the matrix into which the results are to be written. The sub-matrices X¿ and Y; in the previous discussion are represented in the code by X S and Y S. The increasing indices i are implemented by increasing the offset with which ihe sub-matrices are positioned in the respective data vectors associated with the X and Y parent matrices.

725 f f Generate a row matrix to contain the circulant matrix data

MATRIX H( 1,N*n- 1,Real) ; // Assign filter weights to H from a weight vector h[i] for (indx:0;indx

I I Set the number of rows to the systolic anay size H.rows0: N;

I I Set the differences and moduius to make H circulant H.delta10 : 1; H.delta20 : 0; H.modq0 : N*p-l; f f Transpose it !H; f f Generate the X submatrix and map it

MATRIX XS(X,0,0,N,N+p- 1 ) ; XS.deltal0 : 1; XS.delta20:1-P; f f Generute the Y submatrix MATRIX YS(Y,0,0,N,N); YS.deltal0 : YS.delta20 : t;

I I Otrsef the result by the delay of the filter YS.offset0 : Pl2;

I I Set the number of output points per matrix multiply int points : N*N; I I D" the filtering for (int index:0 ;index < (dat alen - p -points) ;indexa :points) { mul(XS,H,YS); XS.offset0 +: points; YS.offset0 +: points; Ì

Ì Figure 6.Lz MATRISC code to implement FIR filters

The ease with which the matrix data structure is generated and manipulated is well illustrated in this example, as is the close relationship between the application software and the hardware.

726 6.2.2 QR Factorisation

The least squares solution of over-determined systems of equations, i.e. the minimi- sation of llAn -Ullz where A e ft*x" with rn ) n and ó € ft-, is reliably done

[GoVaSg] using the reduction of A lo a canonical form via orthogonal transforma- tions. In this section, Householder reflections are used in the computation of the factorisatiot A: Q-R where Q is orthogonal and -R is upper triangular. A Vector Algorithm

A Householder reflection, or Householder matrix, is defined by an n x n matrix P of the form

P:I-2uuTfuTu where the vector u is called a Householder vector. Householder matrices are both symmetric and orthogonal. Given any vector 0 * " € W', and requiring Pr to be a multiple of e¡, the first column vector of In,,

2uuT 2uT n f.L-D^- (' uTu ,uLl)

It can be shown [GoVa89] that u is simply determined from r by setting

,u: t + ll"ll ê1 2 which gives Pr:(r-r#),

: +ll"ll €1 2

When applying a Householder reflection P - I - 2uur f uTu to a matrix A,

/ uu?\ PA: a: A+uwr l, -r^)

727 where ry: BAru and B: -2luru

To minimise the error in the computation of. B, u is defined to be

,u:tf sign(r1)llrll 2 "t

Using the notation of [GoVa89], a single Householder update of a matrix involves the computation of

1. the Euclidean norm of the column to be reflected, p:ll*ll : t/*r, ; 2

2. lhe scaling factor þ: r(7) -l sign(r(1))p;

3. u(2: n): r(2: n)lB, and u(1) : 1'

4. the scaled matrix vector product s : BAru;

5. the outer product matrix update A: A I ,wT

Steps 4 and 5 involve a matrix vector product followed by a matrix accumulation of an outer product.

A Block Algorithm

Golub and Van Loan, section 5.2.2 [GoVa89] discuss the implementation of a block algorithm which has similar error propagation characteristics to the algorithm of the previous section. It is outlined here.

Suppose Q:Qt...Q, is aproductof.n xn Householdermatrices. It canbe shown that

Q: I +wYr where W arrd.Y arenxr matrices. If. P: I -2uurfu"o with u € ffi" and z: 2Quf uTu, andW,Y € ft'x' then

Q+:QP:I-W+YÏ where W+ : lW zl ar'd Y+ : [Yu] are each n x (i + 1).

Consider the block matrix

728 An An Atp Azt Azz Azp A-

Apt Apz Aoo where each partilion A¿¡ € ft""", r : nlP.

The block algorithm consists of the following steps, for j : 7,. .. ,Pi

1. apply the algorithm of the previous section to each block column A,¡, and at each step in the factorisation of A¡, iteratively compuleW and Y, using the resuits of this section

RtJ Rt,z Rt,i Rr,i+t Rr,o 0 Rz 2 Rz,i Rr,i+t Rr,o

: A- 0 0 Rj,j Ri'i+t Ri,i+, Rj,o 0 0 0 A¡¡t,i¡t Aj+t,i+, Aj+t,, 0 0 0 A¡¡z,i¡t A¡¡2,¡¡z Aj+",,

00 0 Ao,j*t Ap,j+z Ao,

2. update the remainder of the matrix with the following operation:

A¡¡t,j+r A¡¡t,j¡z Aj+r,, A.¡¡z,i¡t A¡¡z,j¡z Aj+r,o : (r+ wyr) x

Ao,¡+t Ap,j+z Aoo

Aj+t,i+, A¡¡t,¡¡z A jl'l ,p A¡¡2,¡¡t A¡¡2,¡¡z Aj12,p (5.15)

Ap,i+t Ap,i+z Aoo

5.2.3 The Discrete Fourier TYansform

A detailed discussion of the Fourier, Hartley and cas-cas transforms is provided in the next chapter. These algorithms are implemented particularly efficiently on the MATRISC architecture.

729 5.3 A MATRISC Performance Model and its Validation

Main memory Synchronous DRAM

I

É cache Y cache d X f< ôo È (n oÉ (€ () X Y Result È (t) address address address È O gen gen gen

a X ¡i d Systolic Å processor NXN

Figure 5.22 A sch,ematic representation of tlte functi,ono,l components of the software simulator.

To predict the performance of the MATRISC processor, a matrix class has been defined in C**, and software models of the matrix components of the MATRISC processor have been written which implement the matrix primitives. Algorithms have been coded with this software, and performance estimates obtained from it. Figure 5.2 is a schematic representation of the software simulator. A two level memory hierarchy has been used. The single level direct-mapped cache can be defined as a single cache with two banks, two independent caches with serial access to the main memory or two independent caches with parallel access to the main memory. The last case requires an increased main memory bandwidth. Both write-through and write- back caching policies are simulated. In the write-through model a cache entry is invalidated when results are written to that memory address whereas in a write-back cache, the cache entry is updated. The cache replacement algorithms are intentionally very simple so that the performance predictions produced by the simulations are as conservative as possible.

130 Veriflcation of the performance prediction is done by comparing the predicted per- formance of a MATRISC processor with the measured performance of a concept demonstrator which has been produced by RADLogic Pty. Ltd. for the Australian Defence Science and Technology Organisation (DSTO). The processor is referred to here and elsewhere as SCAP, the SCalable Array Processor'

5.3.1 Matrix Multiplication

Matrices are usually stored in row major order or column major order, i.e. suc- cessive row elements are stored in contiguous memory locations, or successive col- umn elements are stored in contiguous memory locations respectively. In gen- era1, the behaviour of a dense matrix multiplication operation will vary depend- ing upon the storage method for each operand. For the product of two dense matrices A and B, there are sixteen distinct possibilities, indicated in the follow- ing set, where A, indicates row major, and A" indicates column major storage:

* * A, * B", A, + B!, A"* Br, A"+ B!, A" * B", A"* B!,, Al B, Al + {A, Br, A, BT, " BT,,AT * B",AI * BT,A! x B,,A! * BI,A! * 8",,A! * a!]

It must be noted that while there is no algorithmic difference between the order of fetching between A, anð. the transpose of the column major version A!,the use of biock fetches in a cached system means that the memory performance will differ for these two cases. The cache replacement algorithm will aiso affect performance in each of these cases.

Matrix products for each of the sixteen elements of the set described above have been executed on the MATRISC simulation code written for performance estima- tion and algorithm development. Performance parameters used for these simulations were chosen to approximate the SUN SPARCstation 1 using the SUN SBus.

The SBus specification [Su90] shows that the memory subsystem transfers one word every two clock cycles or 100ns, and page 31 of the specification gives an example of a 16 byte Direct Virtual Memory Access (DVMA) transfer in 13 clock cycles, or 650ns. Experience with the 20 MHz SPARCstation 1 implementation of the SBus indicated that this figure is optimistic. The maximum sustained transfer rate pos- sible with four byte, or single word transfers, was about 6 Mbytes/s while running the operating system ) or one transfer in 650ns. For comparison purposes, the exper-

131 imental values have been used and it has therefore been assumed that the total bus latency is 550ns, with successive transfers requiring 100ns per word.

AIso assumed is a direct mapped cache with a block size of four words, with two partitions each of 4096 words,,'i.e. a728 Kbyte cache. Each partition is dedicated to either the A or B matrix operand. The size of the systolic array which has been used is 20 x 20. These choices have been made to match the SCAP hardware as closely as possible so that the model may be verified against a physical implementation.

The partitioning algorithm is implemented automatically by the MATRISC proces- sor, the simulation software and the SCAP system [MaCI93]. The following shows the results of executing matrix products on the simulator for varying storage and access patterns.

90 'ArBr' 80 'ArBrT' 'ArBc' 70

60

.t)È 50 (li FÀ 40

30

20

10

0 0 50 100 150 200 250 Matrix order

Figure 6.32 Simulated perforrlance for the AB prod,uct witl¿ th,e A rnatrir in row major oriler.

r32 90 'ArTBr' 80 'ArTBrT' 'ArTBc' 10

60

U)È 50 tro 40

30

20

10

0 0 50 100 150 200 250 Matrix order

Figure 5.42 Simulateil perforn'Lance for th,e AB prod,uct with the A matririn trans- posed, row major order.

90 'AcBr' 80 'AcBc' 'AcBc' 70

60

(t) È 50 t+i 40

30

20

10

0 0 50 100 150 200 250 Matrix order

Figure 6.52 Si,mulated, perforn'ì,ance for th,e AB product wi,th the A matrir in column rnajor ord,er.

133 90 'AoTBr' 80 'AcTBc' 'AcTBc' 70

60

(t) * 50 Ë Å¡= 40

30

20

10

0 0 50 100 150 200 250 Matrix order

Figure 5.62 Simulated perfornLûnce for the AB product wi,th, th,e A matrir in trans- posed, column major oriler.

Figures 5.3,5.4,5.5 and 5.6 show identical performances for operand matrices whose dimensions fall within the cache. For matrices which exceed cache capacities, the different storage techniques cause performance differences.

If ihe B matrix is circulant, it will not cause performance degradation until the ma- trix order exceeds the cache capacity of 16384. For a dense B matrix, performance will begin to fall once the matrix exceeds the cache size. In the case of the cache chosen, this will be for matrices whose order exceeds JTffi84, or 128. The perfor- mance degradation for matrices which exceed this order is apparent from the graphs. Assuming the C¿¡ output blocks are computed in row major order on a 20 x 20 pro- cessor array, the A matrix partition will achieve hit rates which approach 700% while Lhe partiti,on. is fully contained in the cache. This will be true for matrix orders of up to 800. For orders larger than this, the row partition of the A matrix cannot be held in the cache, and the cache misses which occur during the computation of each C¿,¡ wiII cause a performance loss. For such large order matrices, a block algorithm would be required to maintain performance'

Model Validation

A comparison of simulated and experimental results is given for the circulant matrix

734 problem in the following figure.

r20

.circ' 100

80 (h* É 60 ¿

40

20

0 0 50 100 150 200 250 300 350 400 450 500 Matrix order

Figure 6.72 Simulated, and erperimental performance for the AB product. The A rnatrir i,s in row major order and' tl¿e B mo,trir is circulant'

The results shown in figure 5.7 demonstrate that for matrix multiplication, the mem- ory model used is sufficiently accurate to be used for system performance predictions. Some differences are both expected and observed due to the different cache replace- ment algorithms. This is believed to be the reason for the differences in the predicted and measured performances for matrix orders between 80 and 160. Matrix multipli- cation performance is dominated by the cache access time, which is known to be 150ns for the SCAP system. The differences between predicted and measured per- formance for low order matrix products is due to the omission of system overheads from the software model during task startup and completion.

5.3.2 The FIR Algorithm

MATRISC code which implements FIR filters is given in figure 5.1. The equivalent code which implements the filtering loop on the SCAP processor is given in [MaC193]

AS

135 for (i:0; i*(H-n2- 1)*H- >n2*H- )n1 - 1 < Data-Length; i+:H->n2*H->n2) { V->r: v*i; V->kr : kv f i; mmult3(H,V,R); R->r +: (H->n2*H->n2); R- >kr +: (H-> n2*H->n2); Ì Ì

The input data matrix is I/ and the filtered output matrix is -R. Due to issues particular to the SCAP implementation, two members of the SCAP matrix data structures V and .R require updating, rather than the single member of the MATRISC processor.

140 'MATRISC.fir'

120 -

100

(n È t+i 80 a

60

40

20 0 200 400 600 800 1000 1200 FIR Filter length

Figure 6.& Simulated and" erperimental performance for an FIR fi,Iter on' o,n order N MATLISC processor.

Figure 5.8 shows the experimentally measured performance of the SCAP processor when performing FIR filtering. Also shown are two MATRISC performance predic- tions from the simulation software. The memory model alone is the solid line labelled MATRISC.fi,r, and a model which incorporates 400¡;s of system overhead per matrix multiply is shown in MATRISC-sys.fir. Both MATRISC predictions exceed the mea- sured SCAP performance.

136 The reasons for these differences are twofold, and related to implementation issues. Due to the way in which SCAP interacts with the UNIX operating system, the SCAP processor incurs several types of timing overhead which would not be present in a fully integrated MATRISC processor. The interrupt processing which occurs in SCAP at the completion of a set of matrix operations has been found to exceed 100¡;s. Additional time overheads are due to the writing of the start addresses of the data controller programs into each data controller. These writes are done with system calls, each of which require in excess of 100¡;s. As there are three data controllers, the total system overhead per matrix muliiply is a minimum of 400¡;s. No time has been allowed in these estimates of overheads for loop control and the updating of the offset in the input and output data vectors. As a consequence, the IVIATRISC simulation predictions are optimistic when compared to SCAP.

Although it is possible to quantify the timing overheads in SCAP and make the simuiation accurately model the SCAP implementation, it is of greater interest to use the model to show what SCAP is capable of achieving. For an efficient MATRISC implementation which does not have the unnecessary system overheads of SCAP, the most accurate performance prediction is the memory model alone. This is the solid- line graph in flgure 5.8 labelled MATRISC.fI. What this makes clear is that the SCAP system is capable of substantially better performance for short filter lengths than its measured performance indicates.

5.3.3 A BLAS Matrix Primitive

In Sheikh eú ø/. [ShPh92] a comparison is provided of the performance of some level-2 and level-3 Basic Linear Algebra Subroutines (BLAS) for both the CRAY Y-MP and the CRAY-2. The particular subroutine of interest is SGEMM0 which performs the matrix operation C +- aAB + PC, where A, B andC are matrices , and a and B are scalars.

In this section, experimental results from the SCAP processor presented in [MaCl93] for the SGEMM0 matrix primitive are used to further validate the MATRISC model, and then the model is used to predict the performance of the MATRISC processor under development at the University of Adelaide. The vector version is not considered as the matrix attributes of the MATRISC processor provide no advantage.

737 60 'MRISC

50

40

U)È d 30 àle{

20

10

0 0 50 100 150 200 250 300 350 400 450 500 550 Matrix order

Figure 6.92 A comparison of the performance of th,e SCAP processor and tlte MA-

TRISC model of SCAP when implementing úå,e SGEMMO subroutine.

Figure 5.9 shows the difference between the SCAP processor and the MATRISC model. The different cache replacement algorithms in the two systems will contribute to the lower performance predicted by the simulator than was achieved in practice. However, the shapes of the two curves are nearly identical, indicating good overall agreement. The 'jagged' nature of the simulated curve is due to a more detailed simulation. If the peak figures are taken as representative, there is typically less than

75To rcIative error between simulated and measured performau.ce.

5.4 MATRISC Processor and its Performance ^

A MATRISC processor is currently under development. The architecture which has been studied has a 40 x 40 systolic array and a memory system constructed from syn- chronous dynamic RAM chips. Synchronous DRAM technology offers main memory transfer rates of one word every 10ns with current technology, with a 30ns setup time. These times are consistent with the times quoted in [8u93]. The cache access time is 10ns and the block size is eight. The cache size has been assumed to be 1 Mbyte.

138 5.4.1 Matrix Multiplication

Six memory models have been used for performance assessment. There are three read cache types used for the X and Y operands. The first is the same as IMas used for the SCAP simulations: a partitioned cache with serial access for both X and Y address generators. The second model is equivalent to the modified Harvard style architecture in which X and Y address generators can independently access their respective caches, but the caches have serial access to main memory. The ihird model allows the caches independent access to main memory. For algorithms with high cache hit rates, this model is likely to be the most accurate. The model is invalid for low hit rates as it would require an increase in main memory bandwidth. This could be achieved with an hierarchical cache, but this has not been investigated.

Two memory write models are used. The first assumes single word transfers to main memory, with cache invalidation. This models a write-through cache, and incurs the full main memory setup and write time per memory access. The second models a write-back cache in which data is written back to the caches from the result address generator if there is a write hit on at ieast one of the caches. The time penalty in this case is the access time of the cache. If neither cache contains the data, the time penalty for the write is the same as for the write-through case.

In the following diagrams, a two-digit labelling convention has been adopted to indi- cate the cache structure used in the simulation. The first digit from the set {0, 1, 2} defines the read cache type as: 0: serial access for the X and Y address generators to a partitioned cache; 1: parallel access for the X and Y address generators to a partitioned cache which has serial access to main memory; 2: parallel access for the X and Y address generators to independent caches each of which has parallel access to the main memory.

The second digit from the set {0, 1} defines the write cache type as: 0: write-through; 1: write-back.

Thus a graph referenced as 'circOO' uses a partitioned read cache with serial access for the address generators and has a write-through policy. A graph referenced as 'circl1'

139 uses a write-back cache policy with parallel address generator access to independent partitions, each of which has serial access to main memory.

5000 'c0w0' 4500 ,c

4000

3500

(h 3000 oa l+i 2500 ¿ 2000

1500

1000

500

0 0 50 100 150 200 250 300 350 400 450 500 Matrix order

Figure 5.10: Tlte perforrnünce of a 40 x 40 MATRISC processor when erecuting matrir multiplication. A write-tlt'rough cache is used-

5000 'cOw1' 4500 ,c

4000 w

3500

3000 ÊØ o (tr 2500 Å 2000

1500

1000

500

0 0 50 100 150 200 250 300 350 400 450 500 Matrix order

Figure 5.11: The perforn'Lance of a 40 x 40 MATRISC processor wlt'en erecut'ing matrir multipli,cation. A write-baclc cache is used.

740 5500 5000 circ00 4500 4000 3500

oÈ 3000 É 2500 2000 1500 1000 500 0 0 50 100 150 200 250 300 350 400 450 500 Matrix order

Figure 5.L22 The perfornLance of a 40 x 40 MATRISC processor when erecuti,ng rnatri,r multiplicati,on with a c'irculant B matrir. A write-through cache is used.

It can be seen from fi.gures 5.10 and 5.11 that the performance for matrix multipli- cation is independent of the type of write cache used. This is to be expected as the matrix multiply does not access data which it has written.

The performance for circulant B matrices is shown in figure 5.72 f.or a partitioned cache, and two independent caches. It is clear from these performance graphs that the prime requirement for matrix multiplication is independent, parallel access of the address generators to their respective caches. Parallel access to main memory by the caches contributes little to performance.

741 5.4.2 FIR Filters, Convolution and Correlation

7000 'fir00' 6000 'fir1 'firO1 5000 'fir11 'firzl Èrt) 4000 o Ë À= 3000

2000

1000

0 0 500 1000 1500 2000 2500 3000 3500 4000 Matrix order

Figure 5.13: Sim,ulated performance for FIR f,lters implemented, on an oriler 40 MATRISC processor.

Figure 5.13 shows the predicted performance of the MATRISC processor when per- forming FIR filtering. The three configurations of caches have been simulated, with both write-through and write-back characteristics. As is to be expected from the matrix muitiplication studies, the prime requirement is for the use of independent caches for the X and Y operands. The main memory bandwidth is not significant, as the cache hit rate approaches 100%. As output operands are not re-used, the use of a write-back strategy is of no benefi.t. The two low-performance curves are those of systems which use serial cache access.

742 5.4.3 The BLAS SGEMM0 subroutine

3000 'sgemm0O' 'sgemml 2500 t' IT' 2000 sgemm2l' ÈU) o 1500 à

1000

500

0 0 50 100 150 200 250 300 350 400 450 500 Matrix order

Figure ú.L42 The perforn'tünce of th,e MATRISC processor when erecuting tlt'e SGEMM O subroutine.

Figure 5.14 shows the performance of the MATRISC processor for the BLAS SGEMMO subroutine. The operation is C <- aAB i BC, whete A, B and C are matrices , and a and B are scalars.

The operations carried out are

T:AB T:aT C:pC C:TlC (5.16)

The element-wise operations in this routine significantly degrade the performance of the system. This is to be expected as the high performance operation is the matrix product. However, the benefit of the write-back cache is apparent due to the re-use of the ? and C matrices.

143 6.4.4 The QR Factorisation

400 'qrb00' 350 'qrb1

300 I qrb2I 250 a(t) (*o 200 À 150

100

50

0 50 100 150 200 250 300 350 400 450 Matrix order

Figure 5.15: The block QR factorisation algorithm 'implemented, on the MATRISC arcl¿itecture.

This algorithm benefrts from the write-back cache. It has a performance which .p- proaches 70% of the peak speed of the matrix product. This is not unreasonable for a general purpose algorithm.

5.5 A Constant Bandwidth Array

The analysis carried out in the previous chapters has been for problems whose di- mension I/ is much greater than the order of the systolic array P, i.e. ¡/ > p. For problems where N 1 p,, the performance of the optimised array is less than optimal. In the examples quoted above, the cycle time of the processing elements, and hence the array, has been assumed to be 400ns. This restricts the potential MATRISC per- formance for the computation of a small order matrices. A soiution to this problem is the use of variable speed processing elements. The ring architecture presented in the iast chapter is readily adapted to meet specified speed requirements.

744 Relative speed I u2 ll4 U8 U16 of processing elements: I

U2

U4

u8

Ut6

Figure 5.16: A constant bandwid,th systolic ûrrûU for tlte MATEISC architecture

800 11 700 11 , 1 600

500 rt)È o tr-l 400

300

200

100

0 0 100 200 300 400 500 600 700 800 Matrix order

Figure 5.172 The QR ølgorithm irnplemented, on a constant bandwid,th MATBISC arch,itecture.

There is minimal increase in data-path complexity between systolic elements if the

745 speed of the eiements are binary weighted as in figure 5.16. In this case, the fixed memory bandwidth is used optimally as the computation time for a product is de- termined by the order of the product, rather than the size of the array.

The use of a constant bandwidth array has a significant effect on those algorithms which have many matrix computations which are smaller in order than the array size. The performance of the array for large order outer products, and hence matrix prod- ucts, will remain virtually unchanged. To demonstrate the effect on the block QR algorithm, figure 5.17 shows the predicted performance of the constant bandwidth array. These graphs indicate a performance improvement factor of approximately two, i. e. the 200 Mflops performance obtained from the MATRISC processor when factorising an order 200 matrix improves to about 400 Mflops when a constant band- widih array is used. Also shown in the figure is the performance for the same problem with a cache size of 2 Mbytes. The larger cache provides some benefrts for the larger order matrices.

746 Chapter 6

The Fourier and Hartley Transforms

6.1- The Discrete Fourier TYansform

This chapter describes the implementation on a MATRISC processor of both prime and common factor discrete Fourier transform algorithms using multi-dimensional mapping techniques.

The Fourier integrals provide one means of transforming from the time-domain to the frequency-domain, or uice-uersa. These integrals are

The Fourier Transform

x(u): t:: r(t)e-i't dt (6.1)

The Inverse Fourier Transform

r ¡*oo r(t)::\/ 2rJ--I x(a)ei'Ld,u (6.2) where r(f) is a time-domain signal and X(u.') is the frequency-domain transform of

"(t).

As many signal processing operations such as beamforming, filtering and correlation can be expressed in terms of the Fourier transform and its inverse, efficient algorithms have been developed for its evaluation.

747 The continuous-time Fourier transform, given by equation (6.1) has a discrete rep- resentation which is appropriate for sampled data systems. The Discrete Fourier Transform pair is given by

N-1 x(k): r@)wfi (6 3) n:ot

. N-l r(n):' t x&)wñk" (6.4) k:0

where Wx : and both r(n) and X(k) are periodic sequences. "-iQrlN),

Equations (6.3) and (6.4) are best considered as matrix-vector products.

This chapter discusses the application of number theory mappings to these matrix- vector products. These mappings convert the one-dimensional matrix-vector product, or convolution, to a multi-dimensional convolution which is impiemented as a series of matrix-matrix multiplications.

This re-formulation of the problem yields two major benefits

1. the order of the problem is reduced from O(l/) to O(JÑ) with a two- dimensional mapping, and order W fot p-dimensional mappings;

2. the arithmetic operations are notw matrix-matrix operations rather than matrix- vector operations. These are the operations which are most efficiently imple- mented on the MATRISC processor.

The following sections contain an outline of

1. the mathematical development of both prime-factor and common-factor algo- rithms expressed as matrix products;

2. example code which implements both of these algorithms on the MATRISC machine.

148 6.2 Linear Mappings Between One and Two Dimensions

The derivation of prime factor mappings for linear transforms follows the work of

Good, [Go5S], [Go71], Burrus [Bu77], [BuEs81], Elliott [EIOr79] and Yassaie [YaS6]. Most of the literature e.g. lEIOrTg], references the use of the complementary Chinese Remainder Theorem and Alternate Integer Representation (AIR) maps for forward and reverse mappings, or ai,ce aersû. These complementary maps preserve the DFT coefficient matrices. Burrus [8u77] had commented on the use of ihe AIR map for for- ward and reverse mappings with modified coefficient matrices, as had Yassaie [Ya86]. Yassaie \/as concerned with the implementation of signal processing algorithms on linear arrays and Temperton [Te88] applied prime factor algorithms to the implemen- tation of Fourier transforms on vector supercomputers such as the CRAY-1. Lun eú a/. [LuSi90] proposed the use of the AIR map for both forward and reverse mappings for the special case of DFTs whose lengths are equai to the product of the squares of primes. The two-dimensional formulation of the forward and reverse AIR map- ping for a matrix product machine for the general case of Fourier transforms whose lengths are the product of co-primes t¡/as presented by the author and A.P. Clarke in

[\{arC188], [MaClSS] and [MaCl9O]. This work is extended here for the case of multi- dimensional mappings. Earlier work on the matrix implementation of the DFT on two-dimensional systolic processors was the paper by Yeh and Yeh [YeYe87]. In this paper the prime factor algorithm is not used and the architecture is more complex than that of the matrix formulation of the prime factor algorithm of [MaCi9O].

Consider the problem of mapping a one-dimensional sequence, or array of length l{ : ly'r 7 Nz, into a two-dimensional array of dimension ¡f1 bV ¡/r.

A general mapping which can be used is

y¡ : (M1nt i Mznzl u (6.5) where (.)no indicates that the expression is to be evaluated modulo ly'. Evaluation modulo Iy' makes the map cyclic in n. In order for the map to be unique and one- to-one, the mapping constanls Mt and M2 must be chosen as a function of .ly'1 and Nz.

Two cases are possible:

749 6.2.L Relatively Prime

If ¡L and l/2 are relatively prime then their greatest common divisor is given by (r[r, ¡rr) : r.

For this case the conditions on M1 and lVI2 which make the mapping unique are given by

l(Mt : *Nz) andf or (M, : B¡fr)] and(M1, l/t ) : (Mr, Nr¡ : 1 (6.6) where a and B arc integers

6.2.2 Common Factor

For the case where lr/r and -lr/z have a common factor r, i'e' (l/t,l/r) : r, the conditions on M1 and M2 which make the mapping unique are

(Mr: tNz)and(M2l þNt)and(t,l/t): (Mr,Nr) : 1 (6.7) or (M, I oNz) and(M2 : þNt) and(p, Nr) : (Mt, Nr) : 1 (6.8) where o and B are integers.

6.3 Two'dimensional Mappings for a One-dimensional DFT

The Discrete Fourier Transform of a real or complex vector r is deflned by

¡\r-1 ,_ N-1 x(k): t r(n)exp(- i¡"tr¡: t r(n)Wfrk lc:0,1,...,¡/-1 (6.9) n:O n:0

Consider the mapping of the linear input and output vectors r(n) and X(k) into two-dimensional forms A(nt,n2) and Y(kt,k2) of dimension ¡fr by -l/2, using the mapplngs

150 y¿: (M1nt I IVIznz) x (6.10)

k:(Lth+L2k2)N (6.11) where n1 and k1 a;re indexed from 0 to lúr - 1 and n2 arrdlc2 are indexed from 0 to Nz-7.

Substituting equations (6.10) and (6.11) into (6.9) gives

Nr-1 Nz-1 X(Lrlq¡Lrlcr): t t *(Mtnt I Mznz) ."p(--r'+, (6.12) fl'1:O n2:Q

?,. e

Nr-1 Nz-1 Y(kr,ttr): Ð t a(nt,n2)Wirk (6.13) n1:O n2:Q where

z L z n z te z M r L z n 1 lc z yV M r L r n 1 k 1 M z L 1 n 2 le 1 W i,* : W ff 14¡ 1¡¡ (6.14)

Defining

M2 : BNl (l,nd L1 :1N2 (6. i 5) where B and'lt are integers, makes the last term in equation (6.14) unity

6.3.1 The Prirne Factor Case

For this case .ly'1 and ¡/2 are relatively prime, and it is possible to achieve a unique mapping with

Mt: crNz and Lz:6Nt (6.16)

This makes the second term in (6.ia) also equal unity. Hence equation (6.13) becomes

ly'r -1 Nz -1 Y(t r,kù: Ð Ð v@r,nr)Wf,:*'n'r' WffrNznrkt (6.17) r"t:0 TLZ:0

151 Complementary rnaps for both n and Ic.

Good [Go5S], [Go71] suggested the use of the following constants:

a:þ:7 r : (túi')¡r, (6.1s) o: (r/i')r,

where (p)r (p-t), : 1.

Subsiituting equation (6.13) into (6.17) gives

Nr -1 Nz -1 Y(kr,kù: Ð D, a@t,nr)wi,Xr' wi,:r' (6.1e) ÍLt:0 T¿Z:0

This is a two-dimensional DFT with the two different mappings for n and k given by

y7: (N2n1+ l/rnz)¡,, (6.20)

k : (lit;')no,] n,r, + l(¡/i')r,] n' r,) * (6.21)

This expression can be simplified by writing it in terms of matrix operations. In terms of complex conformal matrices Y , Wt, X and Wz of. orders l/r x l/2, lúr x l/r,

Iúr x l/z and l/z x .1y'2 respectively,

Y : WtXWz (6.22) where the (p, q)rä elements of each matrix are

(Y)o,o : Y(P,q) (6.23) (x)o,o: x(P,q) (6.24)

(Wr)o,o exp -*,-i2tr pq (6.25) - Jvj

(Wz)r,o exp -i2tr (6.26) - -#-pqJ\2

752 The coefficient matrices Wt and W2 are conventional Fourier coefficient matrices.

As an example, consider the Fourier transform of a discrete sequence , , {r(n) . 0, . . . ,20Ì. The length I/ of the sequence is composite with two co-prime factors ¡r1 :7 and Nz:3 where (¡r;t)r, :5 and (¡rit)r, :1. The mappingfor r¿ and lc a,re:.

n - (3ry l7n2)r, (6.27) k:(5xïlq+1x7nz)zt (6.28)

: (1bk1 +7k2)2r (6.2e)

The input sequence {r} is expressed as a matrix of length 21 which is mapped to a matrix of order 3 x 7 by ihe mapping for n, i.e.

rg fr1 fr2 r3 I4 íxs r6 i[7 ixg

ïg lo,o frO,7 lo,2 lo,3 ro,4 f OrS fio,6 frto H tl,o ll,l rl,2 17,3 rL14 ü1,5 ll,6 rtt fr2,0 12,! lcc 12,g 12,4 r2,5 12,6 i[tz íttg íxt+ rt5 frta rt7 Ítg ïtg r2o

rg Ig fø r9 rtz í[15 ixts :( ír7 ïto atts rta ixtg r1 ít4 (6.30) ixt+ rtl frzo r2 I5 rg rtt

The Fourier transform is implemented by executing the following matrix products, where Wtís of dimension 3 x 3,W, is of dimension 7 x 7 and X,of. dimension 3x7, is mapped with the mapping shown above in equation (6.27):

153 T:WtX Y:TWz (6.3i)

The result matrix Y is of dimension 3 x 7. To obtain an in-order result, the map given above in equation (6.29) is used for mappiLB U(kt,,k2) to A* as shown:

Uo,o Ao, 1 Ao,2 Ao,4 Uo,5 Uo,6 Ao,7 As Us Uß An Ue U\,0 U7, 1 Ur,2 Ar,g ur,4 Al,5 Ar,6 Urc Uto U+ Uts Uts A2,o U2, 1 U2,2 U2,B U2,4 U2,5 Az,6 Az Utz At Us Uzo

Ao Ut Az Us A+ As

Uø Ut Ue Us Ð Uto (6.32) Utt Utz An Ut+ Uts Urc Utz Aß Uts Uzo

As a simple example of the operation of the transform, the linearly increasing se- quence V : {(0, j0),,(Li7),(2, i2), . .. , (19, i79),(20, i20)} was inverse Fourier trans- formed to give X : F-rV, where F-L represents the inverse Fourier transform. The result of this inverse transform is given in the following equation. It can be seen from the fi.rst term of the result matrix X that no scaling has been performed. To recover the original sequence, it is simply necessary to perform the forward transform on X, i,.e. V : FX where .F represents the Fourier transform.

154 (210,, 27oj) (0, 0j) (5e.162e, -80.163j) (1, 1j) (23.5402, -44.5402j) (2, 2j) (11.3035, -32.3035j ) (3, 3j ) (4.90067, -25.e007 j) (4, 4j ) (0.816312, -27.8763j) (Ð, 5j ) (-2.72654,, -18.8735j) (6, 6j ) (-4.43782,, -76.5622j) (7, 7j ) (-6.37906, -74.627j) (8, 8j) (-8.10345, (e, ej) -r2.8966j) r-l X- (-9.71313, - 11.286ej ) (10 10j) (6.33) (- 11.2869, -9.77374j) (11 11j) (- 12.8966, -8.10345j) (72 12j) (-74.620e, -6.37e05j) (13 13j) (-76.5622, -4.43782j) (74 14j) (-18.8735, -2.72653j) (15 15j) (-21.8163, 0.816306j) (16 16j) (-25.9007, 4.e0067 j) (77 17 j) (-32.3035, 11.3035j) (18 18j) (-44.5402, 23.5402i) (1e 1ej) (-80.162e, 5e.762ej) (20 20j)

Carrying out this transform with the mapping of equation (6.27) and the matrix multiplications of (6.31), the following matrix results after scaling:

-5.5e-07, -3.5e-06j 75,L\j 9,9j 3,,3j 18, 187 72,72j 6,6j 7,7j t,lj 76,76j 10, 10j 4,4j 79,79j 13, 13j 74,74i 8,8j ,,; 77,77 j 17,71j 5,5j 20,20.i

As V was chosen to be a linearly increasing sequence with initial value zero, the value of each U;¡ is the address to which it must be mapped. Inspection of equation (6.29) and the associated map shows that it is the expected mapping.

Identical maps for both n a.rld lc.

Using the constants suggested by Burrus [8u77] and subsequently Yassaie [Ya86],

a:p:.y:á:1 (6.34) instead of those suggested by Good, equation (6.9) becomes

155 Nr -1 1Vz-1 Y(kr,kù: D Ð Y@t,n2)w N2N'n'k" Wff'N'l.' (6.35) ?/-t:0 flZ:0

This is also a two-dimensional DFT with identical mappings for n and k given by

y¿ - (N2n1+ l/rnz)ru (6.36)

k : (Nzlq + Nrk2) N (6.37)

As before, this expression can be simplified by writing it in terms of matrix operations. In terms of complex conformal matrices Y ,Wl, X and Wl of orders I/r x l[2, l/r x l/r, I/r x .nl2 and Nz x Nz respectively,

Y : WiXWI (6.38) where the (p, g)rä elements of each matrix are

(Y)o,o:Y(P,q) (6.3e)

(x)o,o: x(P,q) (6.40) .---,\ -i2rNz (Wi)o,o - exp -- *.on (6.41) ;2trN1 (Wl)r,o - exp ffry (6.42)

The coefficient matrices W! and Wl are the Fourier coefficient matrices Wt and Wz rotated bV ¡/z and I/1 respectively.

As before, the Fourier transform is implemented by executing the following matrix products, where W! is of dimension 3 x 3, Wl is of dimension 7 x 7 and X, of dimension 3 x 7, is mapped with the same mapping as before and shown above in equation (6.27):

T:WlX Y:TWl (6.43)

156 The result matrix Y is of dimension 3 x 7. To obtain an in-order result, the map given above in equation (6.27) is used for mappin| U(kt,k2) to An as shown:

Ao,o Uo,l Ao,2 Uo,4 Uo,5 Ao,e Uo,7 Ao Us Aø Us Utz Arc A':.a Ul,o Ul,L Ur,2 47,3 47,4 Ul,5 47,6 Az Uto Aß Urc Urs Ut A+ U2,o 92,1 U2,2 42,3 42,4 42,5 42,6 ):( Ut+ Utz Uzo Az As Aa Utt

9o At Uz As U+ Us Ao Az Ae Us H Arc (6.44) At An Uts Au Uts Ata Utz Uß Uts 9zo

Using the same example as for the fi.rst mapping, the Fourier transform of a 21 point sequence is performed with the matrix products of equations (6.43). The resuit matrix is

(-0, -0j) (3,3j) (6,6j) (9,9j ) (72,t2j) (15,15j) (18,18j) (7 ,7 j) (10,10j) (13,13i) (16,16j) (19,19j) (1,1j) (4,4j) (14,74i) (77 ,17 j) (20,20j) (2,2j ) (5,5j) (s,8j) (11,11j)

As can be seen from this example, the correct mapping to recover the in-order se- quence is that of equation (6.44), as required'

157 6.3.2 The Common Factor Case

For this case -ly'1 and I/2 are no ionger relatively prime, but have a common factor r Equation (6.13) becomes:

Nr -1 Nz -1 Y(kr,tr): D Ð a@',rr)w|,:"'r" WffrLznrt"zyyrM;'tNznrkr (6.4b) nt:0 ÍùZ:0

The second term WffrLzutc2 cannot be made unity as in the prime-factor case with- out violating the requirement for a unique mapping. This term therefore remains as a "twiddle" factor. Fung et al. [FuNa91] trse a common factor algorithm to map one-dimensional FFTs into a two-dimensional form which are implemented on a multiprocessor network.

From equations (6.7) and (6.3) the following parameters can be set:

Mt:Lz:0:'Y:7 (6.46)

Hence equation (6.45) becomes

Nr -1 Nz -1 Y(kr,kù: Ð D, a@t,'r)wi,',r' Wç'r'"Wfr:r' (6.47) TIt:O nz:0 with the mapping being given by

n: n! I Ntnz (6.48)

lc:Nzlqllez (6.4e)

As in the prime factor case this expression can be simplified by writing it in terms of matrix operations. In terms of complex conformal matrices YrWt, XrWz arrdWs of orders lfr x l/2, ly'r X ¡f1, ¡tr2 x ly'r' N2 x N2 and l/1 x -1y'2 respectively,

Y:Wtllxrwrl*Ws) (6.50)

158 where S represents an element-wise, or Hadamard matrix product, and the (prq)'o elements of each matrix are

(Y)o,o : Y(p,q) (6.51)

(x)r,o : X(P,q) (6.52) ;r* -.1 L^ (Wt)o,o - exp U, nø (6.53) -j2n (wr)r,o - exP w, nø (6.54) j2n (W")o,o exp - (6.55) - Nrtrnø

Code which executes forward (cf -df{Ð and inverse (cf -idft( )) two dimensional com- mon factor discrete Fourier transforms using the classes defined in the next section follows. The input data is in row order in the matrix X, and the output data is returned in row order in the output matrix ¡8. The matrix ? contains the "twiddle" elements defined in equation (6.55) to be applied after the frrst product. A is the f our-coef0 matrix data structure used in the prime factor transforms which contains the two Fourier matrices W1 and W2, as well as a number of arrays which contain the size of the elements of the factorisation, and working matrix storage. class four-coef { int P; int *n; int *N; MATRIX *W; MATRIX *T; public: four-coef0; four-coef(int, int*); int & ass-p0 { return p; }; ini & assl(int i) return { "[i]; ] int & ass-N(int i) { return N[i]; ] MATRIX & w(int i) { return W[i]; ] MATRIX & t(int i) { return f [i]; ] Ì; void cft(four-coef& A, MATRIX& T' MATRIX & X' MATRIX & R) {

159 int length : A.ass-n(0)*A.ass-N(O); A.t(o) : x; R.colsQ : A.t(1).colsQ : A.t(Q).rowtQ : A.ass-n(1); R.rows0 : A.t(1).rows0 : A.t(g).cols0 : A.ass-n(g); !A.t(0); f f Transpose input matrix MATRIX T1 : A.t(0)*A.w(1); emmui(T1,T,A.t(1)); I I Apply elementwise 'twiddle' matrix

mul(A.w(0),4.t(1),R) ; Ì void cf-idft(four-coef& A, MATRIX& T'MATRIX & X, MATRIX & R) { for(int i:O;i

6.4 p-dirnensional Mappings and the One-dimensional DFT

In [ClMa] the two-dimensional prime factor algorithm of the previous sections are generalised to p dimensions. The derivation in [ClMa] is reproduced here.

The one-dimensional discrete Fourier transform {y(tr)} of the real or complex data sequence {r(")i as defined by (6.9) can be written in the form

*(nt,rL2, ,o) a(kt,kz,. . . ,ko): î- "', fI "*o (-r,, ,W) Þ_:b-"u:0 n2:0 flp:O T: (6.56)

where ¡r : ¡ú¿ gcd(N*,¡ú,,): I Vm,n mtn i:tfl

160 The index maps are

":('å;)^, (6.57)

and *:('åe), (6.58)

The input and output data sequences are

r(n1 ,n2, 'åtr)^,) (6.5e)

and y(kr,kr,...,kr):, ((tå fr)r) respectively (6.60) where (.)r denotes the residue modulo l[

Using the maps for the indices n and k, the product nk is

(o , P p Ttrk nlc: N2 J \-'"t' r \-\- " \L t\T2r 'L,L ¡r,.¡t/" I r:r " r:l s#r )),

p n.krN2 :( (6.61) t N7 r:t N

Substituting the mappings of n, lc and nlc into equation (6.9) gives Nr-1tt Nz-1 No-lt '(('å-t N n1:Q n2:Q np:Q p nrkrN2 exp t N? r:1 N N1tt -1 N2-1 No-1t Í1,1:0 ¡¿2:Q np:0

161 ""' (-'# $;W.'') ) ror some Í :ï_lî_ î-"(('å;),) rl,1=0 n2:Q flp= "

ú*o (-r",W) (662)

so deriving equation (6.56)

6.4.L Particular Implementations

Equation (6.56) provides a generalised expression for a p-dimensional mapping of a one-dimensional transform. The expression can be interpreted as p sets of matrix products. The way in which these matrices are defined in the linear vector space and the definition of the coefficient matrices are discussed below. To clarify this discussion, the application of a MATRISC processor to the transformation of a linear data vector is done best by example.

As an illustration, the following example transforms a linear vector into the Fourier domain with a MATRISC processor using a four-dimensional mapping, i.e. p: [.

The four dimensional maps where I/: l/rl/z¡l3^I4 and (l/r,¡/2,¡tr3,.ly'a):1 a,t"

?? : (¡/2¡l3 N+nt + ¡f1¡/s^I4nz I NtNzN+ns + NrNzNsn¿)¡/ (6.63)

¿ : (I/zl/s N+lq I NtNsN¿lcz + ¡û¡/r¡/4ks * ¡Ë¡/2Nsk¿)N (6.64) and the product is given by

ntc : (Nillwln+lc+ i NlNlwlneks * wIw!wlnztcz t Nlwlwlntkt)iv (6.65)

Using equation (6.56), the expression for the p-dimensional DFT factorisation is

762 Nr -1 Nz-L Ns-1 N¿ -1 Y(kr,lrz,hs,kn): Ð t t L, A@t¡z72,¡Tts,n+)x 1¡.t:0 TùZ:0 fùg:0 Íù+:0

wffY^r'^ WffY"n" WffY'X' wffY'r" (6.66)

The coefficient matrices have elements defined by

t2trNPq (W¡,tr)p,q: (6.67) ""P ff

Let the mapping constants M¿ associated with the n¿ in equations (6.63) and (6.64) be defined by

M¡: flt, : NIN¿ (6.68) i+i

The Fourier transform is implemented by performing the following sets of operations:

(1) x'¿,i:wN"x¿,i o

(2) Xi,i :WNnXl,i o

(3) x'¡,.¡ :W¡'trx¡,i o

*fr"* the elements of X¿,j(p,q) are defi'ned by t(i,i,p,q):'i*Msl i * M+*p* MtlS*Mz.

(4) xi,i:wx"x'¿,i o

163 6.4.2 A Recursive Implementation

An example is given for a p-dimensional impiementation. The following code imple- ments arbitrary order prime factor forward @f -df 4Ð and inverse (pf -idf t( )) Fourier transforms. void pft(four-coef& A' MATRIX & X' MATRIX & R) { int length : A.ass-n(0)*A.ass-N(O); A.t(2) : X; A.t(3) : R; A.i(2).modq0 : A.t(3).modq0 : length; for (int p:0; p

The input data is passed in as the data vector associated with the X matrix, and the in-order transform is returned as the data vector associated with ihe ¡B matrix. For a p-dimensional transform the A data structure contains p Fourier coefficient matrices, together with integer arrays which contain the dimensions of the factorisation. The computation is carried out in the recursive Tensor-Multiply0 function, and the prime factor maps used during the computation are implemented with a map0 function. The class definitions and functions which are used by the above code are given below. void MATRIX::map(int offset, int N1, int N2) { init : offset; d1 : N1; d2: N2 - (n2 - 1)*N1; if (modulo)

764 while (d2<0) d2 ¡: modulo; Ì void Tensor-Multiply(int level, int p, int init, four-coef& A) { int index,PrP-7rP-2,*tr,*N, map1, map2, p-1modp, p2modp, src, dest; p : A.ass_pO; P-1 : P-1; P2: P-2; if (level < P-2) { index : (level+p)%p; for (int i:0; i0) init a: A.assJ'[(index);

Tensor Jvluttiply(levell 1,p,init,A) ; Ì Ì else { f f Evatuating TensorJvlultiply0 init : init % (A.ass-N(O)*A.ass-n(O)); if (init::O) { p-lmodp: (P-1+u)%P; p-2modp : (P-2+v)%P; mapl : A.assJt[(p-1modp);

rnap2 : A.assl\(p-2modp) ; if (p::Q) src:2; else src : (P&1); if (p

( ( 1 . offset ( : A. t ( 2) . offset ( : 4' I 3 offset ( :i¡1¡1 A. t ( 0 ). otrset ) : A. i ) ) ) 1 ). ) ;

mul(A.w(p2modp),4.t(src),A..t(dest)) ; Ì )

6.5 Performance

To obtain algorithmic performance estimates, the execution time of each element of the set of matrix operations { addi,tion, subtraction, multiplication, Hadamard rnultiplicationj is characterised in terms of the memory access time, the dimensions of the matrix operands, and the order p of a square systolic array'

165 6.5.1 Addition/Subtraction

The set { addi,tion, subtraction, } have an identical time performance which is esti- mated from the number of fetches to read the operands, the memory architecture, and the number of memory writes to store the result.

The operand matrices are partitioned so that the partitions have at least one dimen- sion equal to or less than the order of the array.

N2 N2 N2

p Aoo p Boo p Coo p An p Bn p Cn + (6.6e)

(¡/t )o A(*lùo (r/' ), Bçv1fi0 (¡rr )o cg ¡¡o

Real matrices

The time to complete a matrix addition or subtraction using a single memory sub- system is

l/r Treal-ad,it:2 -f/2 max(pz * 2,n/2 max(( Nt) + N1N2r (6.70) p ,7") or,T")

where Tre,1-aitit is the total addition time, ?} is the array cycle time, r is the memory cycle time and [.] is the largest integer less than or equal to the argument.

The first term of the equation gives the time to compute the partitions with row dimension p, the second term gives the time to compute the partition with row dimension (¡úr)o,and the third term is the time to write the result back to memory.

For a constant bandwidth array) the array cycle time is equal to the time to fetch the operand vectors, and hence the time for addition is the same as T*.*, the time spent reading and writing the operands:

166 rl"Lî-;oo: r,n.,n: r" (lil pNzt (¡r,), ¡r,) (6.71)

For a memory subsystem which is dual-ported, the operand fetches occur in parailel, and the general expression becomes:

Treal-ailil: lú2 max(pr,T") + ¡¿, max((I/rlrr,T"¡ * l/rl/zr (6.72) L+l

Complex matrices

The time to complete a complex matrix addition or subtraction using a single memory subsystem is

Tcrnpl : 2Treal (6.73) x -a itil. -aitil

6.5.2 Hadamard or Schur (Elementwise) Multiplication

For element-wise multiplication, The matrices are partitioned identically to that shown in equation (6.69) for addition and subtraction.

Real Matrices

For reai matrices, the equations deflning the execution time of the operation are identical to those for addition and subtraction.

Complex Matrices

In the complex case, the complex result to be computed is given by the following:

(A + ¿B)s (x + iY) : (AX - BY) + i(AY + BX) (6.74)

The general equation for the execution time of this operation can be written in terms

of. Tr"o¿, the read/execute time of the operation, and Trarite, the write time of the operation.

l{r Tr.od, :2 N2max(pr,To)12.n{2max((N1)rr,To) (6.75) p

767 where 4 is the array cycle time and z is the memory cycle time

Turite: NtNzr (6.76)

Ts x : 2(Tr.o¿ t T-rit") (6.77) "hu, -"rnpl -mul

For the case of a dual-ported memory,

T S : Trea il I 2T-rit" (6.78) "lru, -"rrplx -mul

6.5.3 Matrix Multiplication

There are many possible implementations of the matrix product algorithm. However, the direct implementation, which may be considered as an accumulation of outer products, offers numerical advantages when accumulating into extended precision registers [GoVa89].

The simplest expression for the execution time of a matrix multiplication operation for conformal matrices of dimension l/r x l/2 and Nz x Ns to give a result matrix

given time to compute result partitions. of dimension ¡û x l/g is by the t+llpllpl t+l This time is readiiy described in terms of the number of memory accesses to compute the product and write the result.

The number of fetches required to compute the result on a p x p array is

¡ú3 ¡r,¡rg p l',*. lil

The number of memory writes to store the result is l/rl{s. The total time to execute the operation is

tII;ï': (lil ¡r1¡r,. l+l ¡/r¡/3 + ¡fl¡fg T (6.7e) where r is the memory cycle time. In the performance graphs which foliow, r is assumed to be 10ns.

168 This expression is inadequate to describe the behaviour of the MATRISC processor as it is necessary to take into account the respective dimensions of the two matrices being multiplied. The computation of each outer product cannot commence until both vectors of the particular outer product have been loaded.

For a matrix product whose result dimensions do not exceed the size of the systolic array, the computation time is determined by the time to load each pair of vectors to be multiplied in the array, the number of vector pairs or wavefronts, and the number of results to be written to memory. The systolic cycle time T"o is given by:

(nI -f n3)r for a single-ported memory T.a: (6.80) rnax(nl,nï)r for a dual-ported memory

The time to compute the product is then

NzT"a * lútffsr for real matrices ¿NrNzNgLtnrnul (6.s1) - 4N2T"y f 2NrNsr for complex matrices

Conformal matrices of dimension -ly'1 x -l/2 and ¡lz x I/a whose Nr x ¡fs result exceeds the size of the systolic array are computed by the MATRISC processor as a number of output partitions, where the time to compute each partition is given AV tL:,I,iI'' These partitions can be grouped into four regions as shown below:

Cåo Lrol/-1 r,2 cT?*-'- C xz'1, C xa w4a-¡/1 21) C:AB: 10 11 (6.82)

/1x't z /1 21t ?,4 u1r-r¡r w ç'-t¡z C(P-1)(R-1)

The time to compute this partitioned product is given by the expression : rIHi*" f;I','," - 1Ïl )'fi,rr*

(6.83)

(¡'l')o>o v/nere---L^-^ _ _l(¡'l,)o. ro:li otheiwise

(¡r')o > o anct^-r ", _ I (¡'lr)"r ro:tO otheíwise.

169 6.6.4 The Prime Factor Algorithm

From equation (6.62), the prime factor algorithm requires the evaluation of a set of matrix products. The time for these products is given by equation (6.83). The fol- lowing expression gives the execution time for a two-dimensional prime-factor Fourier transform. The matrices are complex.

+NrNz aN1 N1 N2 1,¡N1 N2N2 (6.84) 'pI-dft-ummul- | "mrnul

Plots of the execution times of a two-dimensional factorisation implemented on a constant bandwidth array with dual-ported main memory is shown in the following figure.

Complex Fourier Transform (constant bandwidth)

Transform time (us) 'cft_cbw' 1500 1000 500 0

N2[1:40] N1[1:40]

Figure 6.Lz A surface plot representing the erecution times for a constant bandwid'th, 'implementation of a two-d,imensional factorisation of the compler Fourier transform algorith,m.

170 Complex Fourier Transform (constant bandwidth) 1400 'cft cbwzd' t200 CN ..'..j o 1000 800 600 -9(t

cü 400 F 200

0 0 1000 2000 3000 4000 5000 6000 Transform length

Figure 6.22 A 2D point plot representing the erecution times for a constant band,- width implementation of a two-dimensional factorisation of the compler Fourier transform algorith,m.

The times for a variable bandwidth array are shown beiow

Complex Fourier Transform (variable bandwidth)

Transform time (us) 'cft vbw' 1 500 1000 500 0

N2[1:40] Nl[1:40]

Figure 6.32 A 2D surfa,ce plot representing tlte erecut'ion times for a aariable band- width i,mplementati,on of a two-d,imensional factori,sation of th,e compler Fourier transform algorithm.

777 Complex Fourier Transform (variable bandwidth) 1400 t200 ct) o 1000 800 É o 600 (t d 400 Fr 200 0 0 1000 2000 3000 4000 5000 6000 Transform length

Figure 6.42 A point plot representi,ng the erecution ti,mes for a aari,able ho,n,duiiltlt' irnplementation of a two-d,imensional factorisation of the compler Fourier transform algorithm.

Complex Fourier Transform (40x40 array) 1400 'cft vbw2d' o 1200 'cft cbw2d' + Ø) 1000 o 800

! 600 Ø HcË ti 400

200

0 0 1000 2000 3000 4000 5000 6000 Transform length

Figure 6.6¿ A comparison of tlt,e performance of a constantband'widtlt' 40x40 ürray) a uariable bandwidth 40 x 40 array and, a conuentional compler FFT algorithm.

772 Complex Fourier Transform (40x40 array) 30000 'cft cbw2d' o 'cft cbw3d' 25000 G

Ø ? 20000 o

15000 ôtr Ø + (É + t< 10000 +** F + + +t{ . ++* r** + + ' 5000 ++++ + 1+* +++

0 0 10000 20000 30000 40000 50000 60000 Transform length

Figure 6.62 The performance benefit of two- and three-dimensional factorisation ølgorithms when compared, witlt' the FFT algoritlt'rn.

Complex Fourier Transform (40x40 array) 1.4e+06 'sim3dft ' " l.2e+06 'sim4dft' 'fÎti .*'' )U) 1e+06 C) 800000

Lr (+ro 600000 rt) É + lr ++ F 400000 + ++ ** ** ++* 200000 e +# l+

0 0 500000 1e+06 1.5e+06 2e+06 Transform length

Figure 6.72 The performance benef,t of three- and four-dimensiona,I factorisation algori,thms when cornpared with th'e FFT algoritltm'

6.5.5 The Common-Factor Algorithm

From equation (6.50), the common-factor algorithm requires the evaluation of two complex matrix products and a complex Hadamard matrix product. The time for

173 this is derived directly from equations (6.78) and (6.77)

N1 N2 N2N2 +NrNzLcf aN1umn¿ul 1,¡N1urnrnul r +NrNz (6.85) -dft - I | "Scl¿ur-cntplz-rnul

Complex Fourier Transform (40x40 array) 3000 'sim2dcf cft' o 'sim2dpf-cft' 2500 E

(t) 2000 o ! H o É 1500 ¡r o + çr(t) 1000 F*<

500

0 0 1000 2000 3000 4000 5000 6000 7000 8000 Transform length

Figure 6.8: ,4 cornparison between the erecuti,on times of common factor and, prime factor transforrns of uarying lengths and, the conuentíonal FFT algoritltm.

Figure 6.8 shows that the penalty of the complex Hadamard product is approximately

30% al, the longest transform lengths. Despite this penalty, the common factor algo- rithm is still faster than the conventional Cooley-Tukey FFT except for transforms of length 2048.

In the above comparisons it has been assumed that the Cooley-Tukey algorithm is implemented wilh nlogrn memory read/write operations. Real Cooley-Tukey transforms require the time of a complex transform of half the length. During the application of each set of 'butterfly' operations,2n real accesses are required to read the n point complex vector, and2n real accesses are also required to write the complex result vector back to memory.

The comparisons have been biased towards the Cooley-Tukey implementation as the butterflies have been assumed to be impiemented in a scalar arithmetic unit with complex arithmetic capability. If the same assumption is made for the MATRISC processor, each systolic cell would implement complex arithmetic, and the complex matrices would need to be fetched only once.

774 Using the expressions for the execution time of matrix products given in equations (6.80) and (6.81), the cycle time for a complex array would be

2(nI -f nï)r for a single-ported memory mt (6.86) cmplr-ca - 2 x max(n!,n3)r for a dual-ported memory

The time to compute the complex product is then

t!;X{{" : NzT"^plx-cv l2NvNsr (6.s7)

As before, the time to compute a partitioned matrix product is identical to equation (6.39), with the time for real matrix products replaced by the time for complex matrix products i. e.

¡ft 1 mNr.ly'zNs +roNzyo-. I - I +roNrp t - am,m,ul - L"rn*ul -L O )Lcmrnulr fÇl tÏ11,i,i,,. l4/] le/)"ilr', (6.88)

(¡/r)o (rúr), > o (¡tIr)o (¡rr)o > o where rs : and grs p otherwise - p otherwise

The expressions for the execution time of the two-dimensional Fourier transform on a complex MATRISC processor then becomes:

l¡NrNr N2 rVN1N2N2 +NrNzupf _y| - (6.8e) -dft - - crnntul crnntul

175 Fourier Transform: (40x40 complex array) 14000 'cmplx2dft' 12000 'cfft-'

Ø 10000

C) Ê 8000

F 6000 €v) (!Ê li F 4000

2000

0 0 5000 10000 15000 20000 25000 30000 Transform length

Figure 6.92 A comparison between the erecution times of the Cooley-Tulcey FFT and a two-d,imensional factorisation of the Fourier transform implemented on u compler MATRISC procusor.

Fourier Transform (40x40 complex array) l.4e+06 'cmplx3dft' l.2e+06 'cmplx4dft' O+ + 'fft_' ... Ø) 1e+06 o 800000 É ¡rÈ 600000 çr(t) lrñ F 400000

ôo o 200000 o oo ¿o oÔ .{ ooo óoo ¡oÔ 0 0 500000 1e+06 1.5e+06 2e+06 Transform length

Figure 6.10: A comparison between tlte erecution t'imes of the Cooley-Tukey FFT and, three- and, four-dimens'ional factorisations of tlte Fourier transforrn implemented' on ü cornpler MATRISC processor'

For the p-dimensional transform of length l/ : lll:, l/¿ the general expression is

776 +N rNiN;N1;ar¡o Lpr_üt:-T ^/ tcmmut (6.e0) k ¡vrJ(+rt

A 265650 point complex transform requires 33.4ms to execute using a four- dimensional factorisation with a complex array using the factors (27,22,23,25).

6.6 The Discrete Hartley Transform

Hartley, in 7942lHaa2l, defined an alternative real kernel for the Fourier integrals:

HU): * l* *þ).us(2trft)dt (6.e 1)

x(t): L H(f)cas(2r f t)df (6.s2) where cas(2nft):cos(2trfú)f sin(2zrft),X(t) isarealfunctionof time,andff(/) is a real function of frequency.

These integrals are the continuous Hartley transforms. For sampled data systems the discrete Hartley transform (DHT) pair can be written as

N-1 fr(k) : r(n)cas(2trnk lN) (6.e3) # n:0t

N-1 n(n) : I ng'¡.as(2trnklN) (6.e4) &:0

A fast algorithm for the Hartley transform was presented by Braceweli in [Br84] and [8186]. Villasenor ef ø/. [ViBrSg] presented the use of commmon factor maps for the implementation of a fast algorithm, and Pei et al. [PeJa89] pointed out the use of the complementary Chinese Remainder Theorem and Alternate Integer representation maps to reduce the complexity of the transform.

777 As with the Fourier transform, this section discusses the application of these number theory mappings to the matrix-vector products which define the Hartley transform. These mappings convert the one-dimensional matrix-vector product, or convolution, to a multi-dimensional convolution which is implemented as a series of real matrix- matrix multiplications. As for the Fourier transform, the Alternate Integer represen- tation is used for both the forward and reverse mappings, in contrast to the use in the literature of the complementary maps.

6.7 Two-dimensional Mappings for a One-dimensional DHT

The discrete Hartley transform of a real vector r is deflned by

¡'r-1 H(k): r@)cas(fink) lc:0r!r...r¡f-1 (6.e5) tn:0

Using the mappings derived for the Fourier transform,

y¡: (M1nt I Mznz)modN (6.e6)

k : (Lth t Lzkz) mod N (6.e7)

where n1 and k7 ate indexed from 0 to l/r - 1 and n2 arrdk2 ate indexed from 0 to Nz - 7,,

the linear input and output vectors z(n) and X(k) are mapped into two-dimensional forms y(nt,n2) andY(kt,,k2) of dimension ¡/r bV ¡úz'

Substituting equations and into (6.95) gives

H(Lrkt-f Lzkz): Ï' T .r*,nt * Mznz)' "(T#) (6.e8) 7l.t:O n2=0

?,. e

-òy'1 -1 1y'2 -1 ( H(t r,lcz) : a(nt,nr¡.^"(\ff¡ (6.ee) n'1:Ot' n2:Qi'

178 where

/2rnk\ 2r (M2L2n2lcz I MtLznnJcz I M1L1n1h I MzLtnzkt) ""t\. t ):"u"* (6.100)

Defining

M2: BN1 and L1 :1N2 (6.101) where B and I aîe integers, makes the last term in equation (6.100) an integral multiple of.2tr, which can be ignored.

6.7.1 The Prime Factor Case

For this case .ly'1 and l/z are relatively prime, and it is possible to achieve a unique mapping with

M't : 6rNz and Lz:6Nt. (6.102)

This makes the second term in also an integral multiple of 2r. Hence equation (6.99) becomes

'î' H(kr,,tcz) :b' a(nt,n2)cas (ff;liu*rnzkz t (6.103) ,,t:o 7r:o \ ft*r*rrrrr)

As for the Fourier transform the following constants are used: a:0:r

1: (N;r)modN1 (6.104)

á : (I/it ) mod N2

where l(p) rnod¡ll[(p-t) mod Nl : t.

Substituting equation (6.104) into (6.103) gives

779 H(kr,tcz) : Ï t' a(nt,n2)cas (#;,,r, * (6.105) n1:Q n2:Q \ ft,rr,)

Applying the trigonometric identity

cas(a + ó) : cos(ó)cas(a) + sin(A)cas(-ø) (6.106) yieids

Hk:ï_t-r(Hirrrrr.,,i":ra(nt,nr).u"(ftnzlcz)

Nr Nz -1 -1 . 2¡r + t sin(ftnú) t a(nt,n2)cas(-ú"tfr) (6.107) 1,?"t:0 nz:0

The mappings for r¿ and lc are given by

v¿- (N2n1l Nfiz)modN (6.108)

k : { [(¡r;t )rnod,¡úr] ¡r, kr + l(,vit)mod¡lr] ¡/r k2]mod N (6.109)

A further rearrangement expresses (6.107) as the sum of two conformal matrix prod- ucts of orders l/r x ¡û, ¡f1 x lf2 and N2 x N2 respectively and the result matrix is of order N1 x N2. i.e.

H:CXT"OTSXT_"O (6.110)

where the (p, g)Úä elements of each matrix are

(H)o,o : H(P,q) (6.111)

(X)o,o : X(P,q) (6.112) 2r (C)o,o : cos (6.113) NrPq . 2tr (S)o,o : sln (6.ii4) Nrpq 2r (T.o)o,n : cas (6.1 15) *PQ

: -2tr (6.116) (T-.o)o,o cas *P8

180 6.7.2 The Common Factor Case

For this case .ly'1 and I/2 are no longer relatively prime, but have a common factor r Equation (6.99) becomes:

Nr-l Nz-l .. 2#mrLzntkzt H (kr, kz) : t. t u(nt,n2)cas(ff;Otr"rk, + U:O n2=0 #*"*'n'n') (6.1i7)

'{mtf'rrykk2 The second term cannot be made an integral multipie of.2¡r as in the prime-factor case without violating the requirement for a unique mapping. This term therefore remains.

From equations (6.101) and (6.102) the following parameters can be set:

Mt:Lz:13:^'l :7 (6.118)

Hence equation (6.117) becomes

H(kr,t z) : Ï t' a(nt,n2)cas (Hrr*, +2fir't , + (6.i1e) 'lt4:Q n2:Q \ Hørrrkr) with the mapping being given by

n:nt*Ntnz (6.120)

le:Nzlq*Kz (6.121)

Using the identity

cas(ø + ó + c) :cas(¿)cos(ó) cos(c) * cas(-a) sin(b) cos(c)* cas(-c) cos(ó) sin(c) - cas(-ø) sin(ó) sin(c) (6.122)

181 equation (6.119) can be written as a sum of four conformal matrix products

H(kt,kz):T.o [[X"Cr,'r] S Cw] 1-T-"o [fxr0ryt] I Sry] I T-"o i[x"Srvr] s Cru] - T"o l[x"srir] s .9ru] (6.123) where I is a Hadamard, or element-wise matrix product, and the matrices in each of the four conformal products are of order N2 X N2,Iúz x I/r, lrlr x Ift and lúz x lrir.

Rewriting,

H (kr,kz) :T"o lllxr C Nrl ø cru] - [fx"sryt] ø sru]] tT-"o [[[x"crur] ø sry] + [[x"^9ru'] ø cru]] (6.724)

The (p, g)úä elements of each matrix are

(H)o,o: H(P,q) (6.125) (X)r,o: X(P,q) (6.126) 2r (Sùp,o: sin I pq (6.727) 2tr (Ct)o,o: cos I pq (6.128) 2r (Tro)o* cas pg (6.12e) - Jv1n,

(T-"o)o,o: cas-2n pÇ (6.130) lv1n¡

Code which implements both the prime factor and common factor two-dimensional Hartley transform is shown below.

void cfdht(hart-coef& A' MATRIX & X' MATRIX & R) { int length : A.ass-n(0)*A.ass-N(0);

782 R.rows0 : A.t(O).cols0 : A.t(1).cols0 : X.to*t0 : A.ass-n(g); R.colsQ : A.t(O).rows0 : A.t(1).rowsQ : X.cols0 : A.ass-n(1); !X; // Transpose X A.i(3) : X*4.c0; A.t(2) : X*4.s0;

emmul(4. t (3),A'. cn( ),4. t (0) ) ;

emmul(A.t(2),4. sn0,A.t( 1 )) ; R : A.h0*(A.r(o) - A.t(l)); emmul(A.t (3 ),A'. sn( ),4. t (0) ) ;

emmul(A.t(2),A.cn0,A.t (1 )) ; R : R + (A.h-0*(A.t(o) + A.t(1)));

6.8 Performance

Using the results of section 5, the execution times for prime factor and common factor Hartley transforms can be simply written in terms of matrix products, element-wise products and additions.

6.8.1 The Prime-Factor Algorithm

From equation (6.107), the prime factor algorithm requires the evaluation and sum- mation of two triple matrix products. The time for these operations is derived directly from equations (6.83) and (6.71):

,i;Il :, (ril#!'' + r##i*') + rl"Lf_ioo (6.131)

6.8.2 The Comrnon-Factor Algorithm

From equation (6.124),, the common-factor algorithm requires the evaluation and summation of four matrix products, four Hadamard matrix products and three matrix additions. The time for this computation is derived directly from equations (6.83) and (6.71):

183 zr t' 4T sr t! ![: : zr #",1,',N' + [g? + LZf', n, r rn u r + [1f-'" o o ; " -n¿ :, (r##l'' + rill,i*') + rcfl;f_1oo (6.132)

6.8.3 Comparison

Hartley Transform (40x40 real array) 4500 'cbw ht2d¡ o 4000 'cbw cf ht2d' . 'rfft ' + 3500 (t) 3000 C) E 2500 F l+io 2000 Ø Ê (Ú lr 1500 -v t-{ 1000

500

0 0 2000 4000 6000 8000 10000 12000 14000 16000 Transform length

Figure 6.LLz A comparison between tlt,e erecution times of a prime-factor and comrnon-factor implementalions of tlte Hartley transform. The real FFT etecution time is includ"ed as a reference.

6.9 A Multi-dimensional Transform Example

A transform related to the Hartley transform which is defrned expressly in terms of the cas kernel is discussed in this section. It can be used for the computation of both the Hartley transform and the real Fourier transform.

6.9.1 The Multi-dimensional Prime Factor cas Ttansforrn

Consider the mapping of the linear input and output vectors r(n) and X(k) into p-dimensional forms using the alternate integer representation maps of (6.59), (6.60)

184 and (6.61) where the length l/ of the vectors is factorable into p co-prime factors ¡f1 ...ÄIo,and p ¡i : fl ¡rn. z:l

Substitution of this product into the DHT equation (6.95) converts the one- dimensional Hartley transform into a multi-dimensional transform. In contrast to the Hartley transform, this cøs multi-dimensional transform is written directly in terms of the kernel, and for the p-dimensional case is written as

Nr-1 Nz-l No-l ., : .D"@r,,n2¡...,np) T(kr,lcz,.. kr) I¡r t t n1=O n2:Q tup:o

(6.133)

The inverse transform is simply

Nr-l Nz-l No-l r(nt,rù2¡...,rò : D t . Ð (¿r,kz,...,ko) î¿L:O n2:Q np:o "",(r#t) "..ew) ".,ew) (6.134)

This transform is of interest for the following reasons

f . it is readily computed with a recursive procedure

2. multi-dimensional Hartley transforms can be readily derived from it;

3. multi-dimensional real Fourier transforms are derivable directly from the cas transform.

For the p-dimensional transform of length ¡/ : fll:t .n/, the general expression is

¡./ (6.135) tpf-cas - å td* 'N;NtNç+'l¡o

185 Cas Transform (40x40 real array) 3000 'cbw cas2d' o 'cbw cas3d' 2500 +

U) 2 2000 o ts

H 1500 ! (* (t) É (! 1000 Fti

500 aßo + * 0 0 2000 4000 6000 8000 10000 12000 14000 16000 Transform length

Figure 6.L22 A compari,son between the erecut'ion ti,mes of a real Cooley-Tulcey FFT and, two- and" three-d,imensional fa,ctorisations of the cas transform implemented, on a real MATRISC processor.

Cas Transform (40x40 real array) 600000

500000

(t) ) 400000 a E Ê 300000 f< (+ro (h (€ 200000 Flr

100000

0 0 500000 1e+06 1.5e+06 2e+06 Transform length

Figure 6.13: A comparison between tl¿e erecution times of a real Cooley-Tukey FFT and, three- and, four-ilimensional factorisati,ons of tlte cas transform implemented on a real MATRISC processor.

Although real Fourier and Hartley transforms of dimension higher than three can be computed with the cøs transform, they are likely to be less computationally efficient

186 than the direct computation.

The following presents the relationship between l,he cas-cas and Hartley transforms of two and three dimensions. Chakrabarti and JáJá in [ChJá90] discuss the sys- tolic implementation of two-dimensional discrete Hartley and Cosine transforms, and Bracewell et al. in [BrBu86] discuss a technique for the computation of the two- dimensional Hartley transform H (h,k2).

In [BrBu86], T(kr,kr) : T(u,u) is the transform obtained after taking the DHT of the rows (columns) of a two-dimensional data a;rray,, foliowed by ihe DHT of the columns (rows) of the result.

Bracewell used the ideniity

2cas(a * ó) :cas(a)cas(ó) -l cas(a)cas(-ó)+ cas(-ø)cat(ó) - cas(-ø)cat(-ó) (6.136)

with the two-dimensional cús-cas transform T(kr,,k2) to compute the Hartley trans- form

2í(kr,kz) :T(kr,kr) + - h,kz) + T(kr,Nz - kz)- "(¡ú1 -lq,Nz-kz) (6.137) "(¡/t

The computation of H(k1, k2) was proposed to be done by computing lhe diagonal ercess E -- 1l2l(A+ Q - (B + C)] followed by the in-place replacemenl A <- A- E, B <- B + E, C <- C l-Ð and D <- D-8. Chakrabarti et al. in[ChJág0] propose a different method for the computation of f/ using a new temporary outcome Z defrned by

1 Z(kr,k2) lT(kt,kt) -f T(kt,, Nz kr)l (6.138) 2 -

1 kz) T(kr, Nz kz)l (6.13e) z(kr, N2 - kz) -- , lT(kr, - -

is then deflnedby for 1( kr < l/r - l and 71lcz f |-¡/, 121 - 1. fl

H(kt,kz): Z(kt,,kr) + z(Nt - kt,Nz - tcz) (6.140)

187 H(kt,,N2 - kz): Z(kr,kz) - Z(N, - lq,Nz - kz) (6.141)

Special purpose hardware is presented which implements the Hartley and Cosine transforms.

The MATRISC processor implements this transform directly. The mappings which describe the four components of the Hartley transform in (6.137) are implemented naturally by the address generator. The first mapping is the prime factor maP, and the remainder are simple modifications which are naturaily represented in the software:

T(kr,k2): k: (N2lq + lürkz)¡¡

T(kt,,N2 - k2) : k: (¡/2kr + ¡úr(¡/z - kr))¡¡ : (N2lc1 - Nzkz) u

?(l/r - lq,kz) : k : (Iúz(Iú1 - kr) + Núz) w

: (-N2lq l- Nzkz) ¡,t

- kt,N2 - k2): k: (.n[z(lrlr - kt) + ¡/r(¡'l, - kr))¡¡ "(lúr : (-N2lq - Nzkz) N

The following code implements this transform.

void Hart2d(four-coef & A, int dim, int * sz, MATRIX & a, MATRIX & R) { int vlen : A.assl(O)*A.assl[(0); DataType ResultType : R.TYP"0; R.Type0 - Real; pft(A,a,R); I I Perform the 2-dimensional cas transform R.map (AIR, 0,4. ass J\ (0),4. ass -N( 1 ) ) ; R.modq0 - vlen; R.rows0 : A.ass-N(¡); R.cols0 : A.ass1q(l); MATRIX T10 : R;

188 Tl0.submatrix() : 1; MATRIX TO1 : R; T0l.submatrix0 : 1; MATRIX T11 : R; Tll.submatrix0 : 1;

MATRIX TC(R.rows0,R. cols0,Reai) ;

MATRIX TS(R.rows0,R. cols0,Real) ;

TC.map(AIR,0,A.ass J\(0),A.ass-N( 1 )) ;

TS.map(AIR,0,A. asslt(0),4. ass-N( 1 )) ; TC.modq0 : R.modq0; TS.modq0 : R.modq0;

ass-N( 1 T1 0.map(AIR,0, -4. ass-|I(0),4. )) ; T0 1.map(AIR,0,A. ass-N(0), -A.ass-N( 1 ) ) ; ass-N( 1 T1 1.map(AIR,0, -4. ass-|t(0), -4. )) ; * l* Generate th,e sine a,nd, cosine matrices from th'e T matrir f madd(T10,T01,TC); msub(R,T11,TS); R.unmap0; TC.unmap0; TS.unmap0; madd(TC,TS,R); ì- J

2D cas, Fourier & Hartley Transforms (40x40 real anay) 3000 'cbw ht2d' cbw 2500 'cbw + tc E Ø 2000 o tr + 1500 ts

Ø (d 1000 FH

500 @

0 0 2000 4000 6000 8000 10000 12000 14000 16000 Transform length

Figure 6.L42 A comparison between the erecut'ion t'imes of the two-d,imensional cas transform, and, tlte deri,ued, real Fourier a,nd, Hartley transforms i,mplernented, on a real MATRISC processor. Th,e real Cooley-Tulceg FFT time is incluiled for comparison.

189 It is simple to show that the sine terms and the cosine terms of the real Fourier transform are computed fi.rst in matrices TC and TS, and then added to produce the Hartley transform. The real Fourier transform can be returned in less time than the Hartley transform in this case, as the final addition is unnecessary.

A comparison of the performance of the MATRISC processor when implementing the two-dimensional Hartley and real Fourier transforms is provided in the following figure. The time for the real Cooley-Tukey FFT, and the time to compute the two- dimensional cas transform are also plotted for comparison.

For the three-dimensional case, the following identity is used with the same approach as for the two-dimensional case:

2cas(a+ b + c) :cas(-ø)cas(b)cas(c) * cas(o)cas(-ó)cas(c)* cas(o)cas(b)cas(-c) - cas(-ø)cas(-ó)cat(-") (6.742)

Let T(kr,lrr,kr) be the three-dimensional cas-cüs transform. Then the three- dimensional Hartley transform is given by

2H (kr, kr, ks) :?(l/r - lq,kz,kt) + T(kt, Nz - kz,kr)l T(kr,lrz, Ns - kt) - - lrr, Nz - lcz,¡/s - kt) (6'143) "(lút

3D Cas and derived Hartley Transforms (40x40 real array) 600000 'cbw cashart3d' o 'sim cas3d' 500000 +

(Ò 400000 o Þ 300000 tr (* (t)

CÚ 200000 F¡<

100000

0 0 500000 1e+06 1.5e+06 2e+06 Transform length

Figure 6.15: A comparison between tl¿e erecution tirnes of tlte tl¿ree-d,irnensional cas transform, and, the deriueiJ, Hartley transform irnplemented, on a rea,l MATR'ISC processor. The real Cooley-Tulcey FFT time i,s i,ncludeil for comparison.

190 Code for the computation of the p-dimensional cøs-transform is implemented with the code presented for the p-dimensional Fourier transform. The difference in im- plementation is that the Fourier coefficient matrices are replaced by cos coefficient matrices, and the data is real, and not complex.

For three and higher dimensional factorisations, a tensor-ad,d, fiinction is used to implement the sums as required. This function is presented below with the code for the three-dimensional Hartley transform. void Tensor-Add(int level,int ainit,int xinit,int yinitt nd-MATRIX& A,nd-MATRIX& X,nd-MATRIX& Y) { int i,1,index,P,P -1,P 2,p1,p2,*n,*N,length,AN1,XN1,YN 1,AN2,XN2,YN2; int AN, An-1; P : A.ass_pO; P-1 : P-1; P2:P-2; if (level < P-2) { index - level; for (i:0; i

TensorAdd(levela l,ainit,xinit,yinit,A,X,Y) ; Ì Ì else { length : A.ass-N(O)*A.ass-n(O); p1 :P-7;p2:P-2; A.a0.cols0 : X."0.cols0 : Y.a0.cols0 : A.ass-¡1(p1); A.a0.rows0 : X."0.rows0 : Y.aQ.rowsQ : A.ass-n(p2); if (A.ass-R(pt)) AN1 : -A.assl\(p2); else AN1 : A.ass-N(p2); if (X.ass-R(p1)) XNt : -X.assl\(p2); eise XN1 : X'ass-N(p2); if (Y.ass-R(pt)) YN1 : -Y.assl\(p2); else YN1 : Y.assl{(p2); if (A.ass-R(p2)) AN2 : -A.assJ\(p1); else AN2 : A.ass-|t(p1); if (X.ass-R(pZ)) XNZ : -X.assl\(p1); else XN2 : X.assJ'[(p1); if (Y.ass-R(p2)) YN2 : -Y.assl\(p1); else YN2 : Y.ass-N(p1);

A.a0.map(AIR,ainit%length,AN2, AN1 ) ; X.a0.map(AlR,xinit%1ength,XN2, XN1);

191 Y.a0.map(AlR,yinit%length,YN2, YN1);

madds(4.a0,X.a0,Y. a0) ; Ì i void Hart3d(four-coef & A, int dim, int * sz' MATRIX & a, MATRIX & R) {, int vlen : A.ass-n(0)*A.assJ\(0); DataType ResultType : R.Type0; R.Type0 : Real; pft(A,a,R);

MATRIX t 1(1,vlen,Real); MATRIX t2( l,vlen,Real);

nd-MATRIX X(dim,sz,R) ;

nd-MATRIX Y(dim,sz,R) ; nd-MATRIX T1 (dim,sz,tI); nd-MATRIX T2(dim,sz,t2); X.ass-R(0) : Y.assÌ.(1) : True;

Tensor-Add(0,0,0,0,X,Y,T 1 ) ; X.ass-R(O) : False; X.ass-R(2) : Y.assl.(Q) : Y'ass-R(2) : True; Y.a0.neg0 : 1; X.a0.unmap0; Y.a0.unmap0;

Tensor-Add(0,0,0,0,X,Y,T2) ; X.ass-R(0) : X.assl.(l) : X.ass-R(2) : False; X.a0.unmap0;

Tensor-Add(0,0,0,0,T1,T2,X) ; ¡ : X.aO; R.unmap0; Ì

6.L0 Summary

This section provides a brief development of the application of multi-dimensional number theory mappings to the implementation of the one-dimensional discrete Hart- ley transform on ihe MATRISC. Both prime-factor and common-factor mappings have been used.

792 Chapter 7

Implementation Studies in Si and GaAs

The philosophies and concepts which have been presented in the preceding chap- ters are the subject of continuing studies in implementation. At the University of Adelaide, these studies have been devoted primarily to the development and use of gallium arsenide. As a consequence there has been considerable empha,sis on the design and construction of low transistor count arithmetic units in effi.cient enhance- ment/depletion (E/D) processes. A more directed implementation programme was undertaken in the Australian Department of Defence which let an industry contract to implement a systolic co-processor in a commercial silicon CMOS process. This contract has achieved two working concept demonstrators, both referred to by the generic tiile SCalable Array Processor (SCAP). This chapter presents an overview of both the CMOS SCAP impiementations and the current state of the galiium arsenide processor development at the University of Adelaide.

7.1 Silicon: The SCalable Array Processor (SCAP)

SCAP is the frrst systolic processor subsystem known to the author which im- plements the set of matrix operations {multiplication, ad,diti,on, element-wise (or Hadamard,) rnultiplication, transposition, permutation,). These matrix operators, to- gether with appropriate addressing techniques and the host processor, a,re used to efficiently realise all of the operations of linear algebra. SCAP is not designed as an o,ccelerator, but as a co-processor which provides hardware support for the matrix data type. The consequence of supporting matrix types is a performance gain for the host processor for a large class of algorithms'

193 Generic SCAP features include:

1. IEEE single precision floating point data format;

2. Scalable floating point matrix multiplication performance up to several Gflops;

3. a library of matrix operations currently written in C.

In a loosely coupled network of SCAP enhanced , computation rates of Gflops are readily achievable with existing compilers.

7.L.1 System Architecture

A prototype SCAP system has been implemented in a SUN SPARCstation 1 as a concept demonstrator. Matrix computations are performed in a two-dimensional array constructed from four hundred 1.0 Mflops floating point processors. A number of limitations are present in this system which prevent full utiiisation of the systolic array. However, a performance improvement factor of about 150 has been demon- strated for the prototype system when executing matrix products [MaCla93]. Data is held in the SPARCstationts virtual address space and is accessed via the SUN SBus. The function of the matrix unit is analogous to the function of the SPARC floating point unit, and can be transparent to the programmer. A 128 Kbyte matrix cache is an integrai part of the co-processor subsystem.

7.t.2 Ilardware

Two custom VLSI CN{OS chips have been developed to implement the SCAP con- cept. One chip handles address generation for matrix operations. The other is a processing element chip consisting of an array of 5x4 single precisionfloating point processors. It contains 265,000 transistors. Both chips are implemented in 1.2 mi- cron technology. The processing element chip has been used in discrete packages in the construction of the fi.rst prototype SCAP system and it has also been used to construct a 400 Mflops multi-chip module for use in current SCAP systems.

794 7.L.3 The Data Formatter ChiP

Anay Control and Dûtâ

Address Generator Unit Shift Register Unit

Anev control I btre a Anav control 2 ln¡t sregu dt sresl d2 nl n1, q mode sreg sres I 9

Intemal Bus

Instruction Fetch Unit Bus Control Unit

PU a¡lrir nrfnrntinn Snanc rnstr data

Address/Data Bus Interface and cont¡ol

Figure 7.Lz A scl¿ematic d'iagram of tlr'e d'ata f ormo,tter

The data formatter operates as part of a software/hardware system which can process operands or data structures which can be broadly categorised as scalars, vectors, matrices and tensors. All elements of the data structures are in IEEE 754 floating point format. Details of the design and function of the formatter chip are contained

in [CtClCu92] and [MaClCl92]. The following is a summary of the contents of these documents.

195 Figure 7.22 A micrograph' of a d,ata formatter ch,ip

The data formatter performs two primary functions. These are: 1. the word-sequential extraction of data structures from a memory space in an ordered manner, appending instructions to these data structures, and writing in a bit-serial format a parallei set of {instruction, d,ata} 2-tuples to an interface designed to accept parallel data structures; 2. the input of a parallel data structure whose elements are in bit-serial form from an interface, and the output of the data structure in word sequential form to a memory space in an ordered manner.

The formatter is constructed from four functionally distinct units identified as the Address Generator Unit (AGU), the Shift Register Unit (SRU), the Instruction Fetch Unii (IFU) and the Bus Control Unit (BCU).

196 A schematic diagram of these units and their interconnection is shown in figure 7.1 and a micrograph of the fabricated chip is shown in figure 7.2.

Input and output of the parallel data structure is bit-skewed between adjacent ele- ments of the structure. The parallel data structure can be considered as 'wavefronts' which are either entered into the parallel interface or read from the interface.

Bus Control Unit

The bus control unit provides the control for the internal bus by which functional units communicate between themselves or the external world. Requests for bus access are ordered in priority and serviced by the bus control unit interface. External communications are also controiled by the bus controller.

The external address and data bus and the associated protocols are interfaced to the internal bus in the bus control unit. External bus request and bus grant are part of the interface, as is the muitiplexing of address and data. The internal registers within the various units are made available to the external bus by the control unit so that they may be addressed as memory mapped registers. A number of memory spaces are supported by the bus control unit. This allows the use of partitioned memory to enhance system speed. An example is a partitioned cache in which different matrix operands are stored in different partitions to improve the efficiency of the cache.

Address Generation Unit

The (AGU) consists of. a 32 bit , a microprogram ROM (Read Only Memory) and a microprogrammed sequencer. The addresses for either source or destination data structures are computed by the AGU and passed to the bus control unit to be used in data structure reads and writes. A number of microprograms are held in the microprogram ROM which enabie the AGU to perform a range of different addressing modes.

A general approach to matrix addressing is to use a second order difference engine, impiemented with a modulo arithmetic capabiiity. The following expression is imple- mented in the AGU:

-f nzdz) (1) n : base-address (init * udt I c

where 1 .)c represents the evaluation of the expression rnoilulo q

797 This maps an element of an arbitrary matrix [X], stored at address n in a linear address space starting at base-address, onto the (nrrrr) element of a two-dimensional address space.

To address sequentialty all elements of the matrix, r?1 and rL2 ã,te indexed through their respective ranges (the dimension of the matrix). This is carried out using difference engine principles. For the first row of the matrix, addresses are formed by nt - ! accumulations of the first difference value d1, where each operation is carried out modulo q. The address of the first element of the second row is computed by accumulating the second difference d2 modulo q, and the remaining addresses of the matrix elements are computed by repeating this procedure. Prime factor mappings are implemented directly with this technique'

To enable the addressing of non-rectangular data structures, the dimensions {n;} are variable. By linearly decreasing one of the two dimensions in a matrix it is possible to generate addresses for a triangular region of the matrix.

Instruction Fetch Unit

The operation of the data formatter is determined by an instruction sequence stored in memory. Instructions are fetched by ihe Instruction Fetch Unit (IFU). Execution is started by the host writing a start address into the program start address register, and continues until a HALT instruction is executed.

The ability to write into the internal registers as memory mapped registers allows two modes of operation. The first is as described above where a program in mem- ory determines the function of the formatter, and the second is when the address generator registers are loaded explicitly by a host before the formatter is started.

Shift Register Unit

The Shift Register Unit (SRU) contains 20 serial-to-paraliei/parallel-to-serial shift registers. These shift registers constitute the local storage for structured data which is input either from the sequential memory accesses of the address generator unit when reading structured operands from memory, or from the parallel bit-serial inputs prior to writing an operarl.d back to memory.

198 7.L.4 The Processing Element ChiP

Details of this chip are contained in [CtCuCl92] and [MaClCl92]. The following is a summary of the content of these documents'

U) (A IÈ ú X bus (operands)

Xin Xout

XWSin lnput Register Output XRSin registers le registers XRSout

UIP

Y bus (operands)

R bus (results)

Clkin

Resetinl Control and sequencing

a o o o U) úU)

Figure 7.32 A schemati,c diagram of a processing element.

The primary design goal for the chip is the implementation of a set of primitive floating point matrix operations for conformable matrices of arbitrary orcler on an array of fixed size. A VLSI chip implementation of this design is discussed which contains a rectangular array of 5 x 4 giobally clocked multiply/accumulate floating point elements.

A feature of the chip is its ability, under the control of an instruction, to: 1. compute the product of two matrices; 2. compute the element-wise (Hadamard) product of two matrices; 3. compute the sum of two matrices; 4. permute the rows and columns of a matrix;

199 5. transpose a matrix

Chip Architecture:

Figure 7.42 A micrograph' of a processing element clt'ip.

The systolic chip is composed of a 5 x 4 rectangular array of single precision floating point processing elements (PEs) each of which accept serial ilata-fl,ow operands, and which perform a limited set of operations. Each operand consists of a 5-bit instruc- tion followed by an IEEE standard single precision number. Each processing element is a microcoded ALU with a 32-bit parailel data-path with some dedicated hardware support for floating point multiplication and addition algorithms. The organisation of the chip is shown in the schematic diagram of figure 7.3. Chips can be cascaded arbitrarily in both X and Y directions. The array of processing elements is clocked synchronously. Three bit-serial links provide communication between processing ele- ments. One link is provided for each of the two input operands and one for the output, or result operand. X input data is transferred from left to right across the array and Y input d.ata is transferred from top to bottom. Output results are transmitted from

200 right to left, and are read from the left boundary in bit-serial form

Clocking of the systolic chips is performed by a single phase 50% duty cycle clock from which all internal timing signals are generated. The maximum frequency of this clock is 20MHz. The clock is buffered on entry to the chip and is distributed to each processing element. It is re-buffered within the PE where it is used as a locally synchronous clock. In addition, each PE generates a second, synchronous clock of the same frequency with a duty cycie determined by the precharge requirements of the internal data buses. A micrograph of a chip is shown in figure 7.4.

PE Architecture:

The internal operations within the processing element are executed with a simple instruction set operating on 32-bit data samples. The choice of a simple microcoded ALU with serial floating point algorithms minimises the silicon area for integer mul- tiplication, d.enormalisation and renormalisation operations. These algorithm com- ponents are the most area inefficient in fast parallel floating point processors. In addition, such array processors are almost invariably bandwidth limited, and there is no advantage in implementing the arithmetic operations any faster than operands can be fetched from memorY.

The I/O architecture of each processor element consists of two orthogonal rlata trans- mission paths for X and Y operands, each consisting of a single one-bit delay/storage cell, and a32-l:rit output register. Data is input to the array as a sequence of 2-tuples {instruction,ilata}. At the completion of a computation, an instruction is entered which causes the unloading of the results into the output registers.

Bit-serial data is bit-skewed on entry to adjacent processing elements on the array boundary. This skew is preserved between adjacent elements within the array by passing the data through the single-bit delay stage in each processing element before re-transmitting it to the next PE. The use of serial data both minimises the I/O bandwidth both at the array boundary and between PEs within a chip as well as allowing adjacent processing elements to both commence and conclude their com- putations with a time differential of only one bit period. As each PE only requires a single cell for data transfer, the data distribution paths are both simple and reli- able. The major advantage the skewing approach has over a broadcast architecture is the capability for arbitrary expansion, as no long buses need to be driven, and the associated higher clock rates which are attainable.

207 Each X data operand consists of a 5-bit instruction followed by a single 32-bit IEEE 754 standard floating point number. A variable length gap of several clock periods may be present between operands for I/O synchronisation. The serial instruction is loaded and decoded, foilowed by the serial loading and conversion to parallel format of the IEEE data. The internal format has both extended precision and extended dynamic range when compared with the IEEE standard.

PE Instruction Set:

The instruction set is summarised in the following table

Inst Bit Function ADD 4 Add LDR 3 Load O/P register HAD 2 Enable Load if. d,iagonal fl'ag set SDE 1 Conditionally set diagonal fl'ag CLR 0 Clear accumulator

Matrix Multiplication OPeration

The default operation performed by the PE is an inner-product operation imple- mented as a floating point multiply-accumulate, the input X and Y operands being multiplied. and accumulated with the contents of the accumulator. Multiplication is faciiitated by the inclusion of a modified Booth encoder for the Y operand. The denormalisation and normalisation operations required by the floating point accumu- lation or addition algorithms are facilitated by the repeated application of the shifter circuit which can shift up to 15 bits in a single cycle.

Element-wise Op erations

Element-wise operations are performed by first setting the set-diagonal-element (SDE) flag in a desired processing element. The flag is set by the SDE instruc- tion, which sets the flag in any processing eiement containing a non-zero value in the accumulator. When this flag is set, and a HAD or ADD instruction is received, the result of the computation is unloaded after each data sample. In the case of a HAD instruction, in an array where the diagonal element flags have been set, this results in the element-wise multiplication of the two input matrices. In the case of an ADD instruction executed in an array where the diagonal element flags have been set, the result is the element-wise addition of the two input matrices.

202 Exception Processing

IEEE encoded data is processed according to IEEE standards with the exception of some rounding modes, and the handling of NaN (not a number) and infinity. On completion of the computation, numbers are output in IEEE format with the following exception: internal de-normalised numbers are converted on output t'o zero.

Matrix Permutation

If the fl,ags in the anti-diagonal processing elements have been set by a prior SDE instruction, and an element-wise multiplication of a matrix A with the identity matrix is executed, the result of the operation is the transpose of the matrix A. If an arbitrary orthogonal set of elements have their flags set, a permutation of the input matrix will be performed by this element-wise product'

Data Output

When an unload (tDR) instruction is received, the accumulator contents are con- verted from the internal format to an IEEE standard form, setting the exception conditions where necessary (NaN, and infinity). De-normaiised numbers are trun- cated Lo zero. The IEEE representation of the result is loaded into a separate output register which is concatenated with other output registers in adjacent processing ele- ments to form an output register chain. The result is output in a serial form through this register chain.

7.1.5 Software

Matrix algorithms which are elements of the set of primitive operat ors {multipli,cati,on, aild,ition, element-wise (or Had,amard,) multiplicati,on, transposi,tion, perrnutationj ate performed directly by the processor. For the permutation primitive operator the operand dimensions are bounded by the dimensions of the array. For the element- wise operators the operand and result matrices must be of the same dimensions, with at least one dimension being bounded by the array size. In the case of the matrix multiplication operator, the operands must be conformai, and have a product whose dimension is bounded by the array size. The implementation of matrix operations on operands whose dimensions exceed the size of the array is possible by mathematically partitioning the operation into a set of operations, each element of which can be computed independently on the available array size.

203 Matrix multiplication

r20

(t) 100 oÊi ¡ g< 80 à o o) 60 v)È 40 Ë o) C) X 20 rÐ

0 0 50 100 150 200 300 350 400 450 500 Order of matrix product

Figure 7.52 Measured, erecution speed (Mfl,ops) for a scAP enhanced suN sPARCStation L as a function of matrir ord,er for real matrices.

Measurement of the performance of the prototype is shown in figure 7.5 for the optimal case of circulant matrices where the cache size does not limit performance'

Matrix Multiplication and Accumulation

If a matrix multiplication is commenced with an instruction which does not clear the accumulator, the result of the multiplication will be summed with the prior result. This gives the matrix multiplication/accumulation capability.

Software Examples

The primitive matrix operators of the PEs together with the addressing techniques of the data formatters and the scalar capabilities of the host processor efficientiy implement all of the operations of linear algebra. As the SCAP operators can be implemented for arbitrary matrix dimensions on a SCAP of fixed size, all of the algorithms of linear algebra are independent of the physical dimensions of the SCAP array.

Example 1:

The Fourier transform can be written compactly as a matrix expression if number theory techniques are used. The prime factor algorithm is a well known example. The following C code fragment implements the Fourier transform on the SCAP.

204 X = map(X, O, n[0], n[1]); Y = nap(Y, o, n[o], n[1]); mmult3(F1, X, f); mrnult3(T, F2, Y);

where X is the complex input data, f.1 and F2 arcrotated complex Fourier coefficient matrices, Y is the transform and n[0] and n[1] are the data matrix dimensions. mmutt\0 is a complex matrix multiplication operation, and m'ap0 causes the SCAP address generation hardware to implement on-the.fly mappings during the matrix operations.

The code presented above is executed interpretively by the SCAP subsystem. As Unix is not a real-time operating system, the interpretive code which uses system calls are inefficient and causes significant performance degradation. To avoid this degradation, it is possible to pre-compile data controller instruction sequences which are executed. when required. An example of such pre-compilation is given below for the same Fourier transform.

/x A pre-compiled two-dimensional DFT */ x = maP(x, o, n[o], n[1]);

generate-dc-prog (MATOP-CMMULT, F1 , X, T,0) ; y = map(y, o, n[O] , n[1] );

generate-dc-prog (MATOP-CMMULT, T, F2 , Y,0) ; /* Store progran for subsequent execution */ dft-oP = comPile-oP(2, X, V); /* Execute pre-compiled prograrn */ run-op(dft-op, 2, &X, &Y);

The generate-d,c-progp function generates instruction streams for the data format- ters. These instruction streams are stored for subsequent execution by the function compi,le-op0, arrd are run by the function run-op0.

Example 2:

Finite Impulse Response (FIR) filtering is a computationally intensive task.

The following C code fragment implements FIR filtering on an arbitrarily long data vector as a series of small matrix products. The code presented is interpretive.

205 for (i=0 ;

i+ (H->n2- 1 ) *H->n2+H-)n1 - 1 n2) { V-)r = v+i; V-)kr = lçv+i; rnmult3(H,V,R);

R-)r += (lt->n2xH->n2) ;

R-)kr += (H->n2*H->n2) t )

where H is a weight matrix, V is a matrix associated with the input data and R is a matrix associated with the output data. R-)r is a pointer to the reai data vector in user space, and R-)kr is a pointer to the real data vector in kernel space. Each matrix multiplication implemented with the mmultS0 call computes 400 output points.

Experiments conducted on a SPARCstation 1 enhanced with a SCAP co-processor show that a 32 tap FIR filter is implemented at a rate which approaches 400 kilo- samples per second.

7.2 Gallium Arsenide

The implementation studies described in this section been performed at the Uni- versity of Adelaide during the development of a GaAs MATRISC processor. This processor is intended to achieve computation rates of the order of Gflops. A GaAs processing element has just been fabricated and is in the process of being tested. It is expected that a combined GaAs/Si processor will provide the most algorithmic flexibility in a constant bandwidth array where the corner elements are much faster than the others.

As part of the development, studies are underway to define an extended address generator which:

1. supports directly the multi-dimensional mappings necessary for fast transforms;

2. allows the processing of sparse matrices which are common both in circuit simulation programs such as SPICE and frnite-element analysis programs.

206 Controller Ouþut: x,Y,PP,INSTR[]

Flag generator

VO Mux

Systolic cell

Delay cell

fnmmfnfn e e ie

Figure 7.6-. A sch,ernatic of the GaAs systolic ring processing element, a,nd the as- sociated d,ata format.

The architecture and application of the processing elements have been discussed in The [Ma90], lMalig0], [MaBe92], [MaBeSm92] and earlier chapters of this thesis. processors accept as input two digit-serial numbers and an associated instruction and perform fl.oating point multiplication and/or addition on the input operands. In a processing element optimised in an area-time sense, each digit contains 4 bits. For the number representation chosen the processor is constructed from 4 systolic cells and 5 delay cells in a ring with a fast state machine to control both the ring function and the IIO of data. Modelling studies with the SPICE circuit simulation program indicate that it will operate with a ciock rate of about 300 MHz and per- form approximately 10 million floating point operations per second. A schematic diagram of the systolic ring GaAs floating point processing element chip is shown

207 in figure 7.6. The chip size, including I/O and test structures, is 5.8mm by 3.1nt'm and contains 16,000 devices giving an overall density of 900 devices/rnrn2. Vfithout I/O and test structures, the processing element occupies 7.7mm by 4.8mm, and is implemented in 12000 transistors, giving a device density of 1500 transistors l**'. The remainder of this section provides a discussion of the technology deveioped for the processor, the design methodoiogy, the prototype processing element architecture and implementation, clocking strategy, testing and performance.

7.2.L Gallium Arsenide Technology

The GaAs processing element chip was designed and fabricated in the Vitesse 0.8p'rn E/D GaAs process (H-GAAS II) supplied by Thomson-CSF, France, on their first fabrication run using this process.

As the full design manual was not available during the design cycle, logic circuits were d.eveloped using a technology file and models supplied by MOSIS. The tech- nology employs enhancement and depletion mode Metal Semiconductor Field Effect Transistor (MESFETs) as well as Schottky barrier diodes'

Test structures were also implemented on ihe chip to characterise the process and validate the models used.

GaAs MESFET Logic Classes

Complementary Gallium Arsenide logic suffers in performance from a poor P-type transistor due to its low mobility and so high performance logic has been restricted to the normally-off and normally-on classes of logic.

Normally-on logic classes include: Buffered Fet Logic (BFL) Capacitively Coupled Domino Logic (CCDL) Capacitor Coupled FET Logic (CCFL) Capacitor Diode FET Logic (CDFL) Feed-Forward Static Logic (FFSL) Inverted Common Drain Logic (ICDL)

208 Schottky Diode FET Logic (SDFL) Source Coupled FET Logic (SCFL) Two-Phase dynamic FET Logic (TDFL) Unbuffered FET Logic (UFL)

Normally-offlogic uses enhancement and depletion type MESFETs with the enhance- ment MESFET used as a switch and the depletion MESFET or a resistor as a load. Direct Coupled FET Logic (DCFL) Feedback FET Logic (FBFL) FET FET Logic (FFL) Junction FET Logic (JFL) Pseudo Current Mode Logic (PCML)

Quasi FET Losic (QFL) Super Buffer FET Logic (SBFL) Source Follower Direct Coupled FET Logic (SDCFL) Source Follower FET Logic (SFFL)

Normally-on logic uses only depletion mode MESFETs and typically requires some voltage level shifting of the gate output to be compatible with the next stage. Larger supply voltages are needed for normally-on than for normally-off logic classes with a consequent higher porver dissipation. Gate complexity is also generally higher but the speed may be greater than normaily-off logic. Level shifting may be done by using Schottky diodes.

The H-GAAS II process has been tuned for using Normally-off DCFL derived logic families. A buffered logic family, Source follower Direct Coupled FET Logic (SDCFL) was mixed with an unbuffered logic family, DCFL to optimise the speed and layout density on most of the chip. Super buffered DCFL (SBDCFL) was also used where high drive capability is required such as clock lines. Studies of this mired Iogic approach have shown that it achieves good VLSI density, noise immunity and speed

[BeMa91].

The limits of operation of the logic classes are Power Supply : 1.2 - 2.5 volts Temperature : 0 - 725 degO 0.5ø fast-fast and 0.5ø slow-slow models

Models were available for the following transistor types and sizes:

209 EFET L :7.2pm,7.5p,m DFET L : 7.2p,m,2.4¡lm,,3.2p'm

Gate lengths are specified "as drawn" and are shrunk by 0.4¡tm when processed. Ail device modelling was done using HSPICE with models supplied by MOSIS for the ttedgaas" process.

Direct Coupled FET Logic

vdd vdd vdd

DFET (Wd,Ld) DFET (\Yd,Ld) t Id

output output output --->Ii inPut input EFET (V/e,Le) inPutl input2 Ie EFET (We,Le) GND GND GND

(a) (b) (c)

Figure T.7z DCFL Inuerter (o), 2 input NOR gate (b) ønd' equi,ualent circuit (c)

Direct Coupled FET Logic (DCFL) is the simplest logic class for digitai GaAs design and has the smallest power-delay product of the current GaAs normally-off logic classes. It is comparable to nMOS in Silicon VLSI design.

An enhancement MESFET (EFET) operates as a voltage controlled resistor which pulls the output down as a function of the applied gate voltage while a depletion mode MESFET (DFET) operating in the saturation region provides the active pull up as shown in figure 7.7. A resistor may replace the depletion mode MESFET in some cases.

When a DCFL output drives a DCFL input, the output high is clamped to about 0.7V by the Schottky diode at the input of the next gate. This limits both the voltage swing of the gate and the noise margin.

By varying the pull up and pull down MESFET widths, the gate can be optimised for the following set of characteristics {Speed, Noise Margin, Power, Drive}. The pull d.own to pull up MESFET ratio determines the noise margin, propagation delay and transition times.

210 As there is only a small percentage change in supply current between logic states DCFL produces a quiet power bus.

To achieve low power gates the current in the DFET has been minimised by using gate lengths of 3.2p"m or 2¡1,m. with L¿ : 3.2p,m, a simple DCFL inverter driving another inverter can only drive one fan-out, for larger fan-outs or for driving longer wires, L¿ :2p,m must be used.

450 dcfl: oad.dat' -a--

400

350 a 3 300 d o d É 250 .do ! rd bl rú 200 oP¡ l] 150

L00

50 1 2 3 Fan-out (number of similar gates)

Figure 7.Bz Propagation delay of a DCFL Inuerter as a function of fan-out (capac- itiue load).

Figure 7.8 shows the delay of a DCFL inverter as a function of load. The delay is approximately linear with increasing load (fan-out), with fan-outs above three or four producing long gate delays. As a consequence, a buffered logic class is used for high fan-out circuit requirements.

277 180 'dcflnois e . dat' -+-

160

t40 ; ùr k d L20 E 0 0 o 100 0 bl d H 0 80 å

40 6 10 t2 !4 th 18 EFET Width (um)

Figure 7.92 Aaert,ge Noise Margin of u 3 input DCFL NOB gúte as a functi,on of W.

NM NM Figure 7.9 shows the average noise margin ( ) as a function of EFET width (W") for a three input NOR gate with a fan-out of three. Only one input signal is driven and the other two are tied to ground. This represents the worst case input configuration. The high noise margin NMn approaches zero at around W": t}¡trn whereas the low noise margin is consistently above 150mV. From these results, W" :\pm was chosen for the EFET width'

Gate Deiay(ps) Noise Voltage Power (-W) fan-in fan-out Margin (-V) swing (-V) DCFL Inverter 70 70 550 0.17 3 3 SDFL NOR 135 270 700 0.55 3 3

Table 7.L; GaAs loqic fami,ly ch'aracteristics

Table 7.1 shows the characteristics of a DCFL inverter and an SDCFL NOR gate

272 Source Follower Direct Coupled FET Logic

vdd vdd

Ìi¡d,Ld

Wsd,Lsd We,Le Wsd,tad input

GND GND Vss

(a) (b)

,l

input

GND

DCFL stage Buffer Stage Load

(c)

Figure 7.1O: SDCFL Inuerter (a), SDCFL Inaerter with ertra supply (b) and' equia- alent circui't (c).

vdd

wd,kl Wse,Lse rùVse,Lse v/d,Ld

input3 rJr'e,Le Vy'sd,Lsd input2

rùr'e,Le GND

output

Figure 7.Ll: OR-AND-INVERT (OAI) logi,c structure

Source Follower Direct Coupled FET Logic (SDCFL) is a buffered version of DCFL which is used to improve the load drive capability, voltage swing and noise margin of DCFL. The buffer is a source follower using an EFET as a pull up and a DFET

273 as a pull down ioad as shown in figure 7.10.

As the output of the DCFL stage is clamped by two diode drops, one across the EFET in the source follower and one across the input diode of the DCFL load, the voltage swing is improved over DCFL. Since there is a Vs" voltage drop across the EFET in the source follower, the logic low level is improved. A negative suppiy for the source follower may be used to further improve the voltage swing as shown in frgure 7.10.

As the amount of current through the source follower depends on its logic state, switching transients are induced on the po'vl/er rails which can lead to ground bounce and. noise injection into other circuits. This is due to the DFET coming in and out of saturation when the circuit switches'

SDCFL has a higher noise margin than DCFL and can have a fan-in of up to five.

The OR-AND-INVERT (OAI) structures which can be made using SDCFL are shown in figure 7.11. They are both compact and fast, implementing the following logic function (with a fan-in of three per gate):

z:(A+B+C)+@+E+ F')

: (Ã.8.e) + @.8.F) (A+B+c).(D+E+F) (7.1) where A, B, C, D,,Ð and F are inputs arrd Z is the output

Super Buffer FET Logic

vdd

v/d,Ld Wse,Lse

output

Wsel,Lsel input

GND

Figure 7.t22 SBFL Inuerter.

274 Super Buffer FET Logic (SBFL) improves the capacitive-load drive capability of DCFL and SDCFL utilising a push-pull super buffer as shown in figure 7.12. This gate injects noise into the power rails due to a conduction path from Vdd to ground when the gate changes state. However with adequate design it is well suited to the d.riving of high fan-out loads such as clock lines and buses. It has a higher noise margin than DCFL and SDCFL but a higher poÌtver dissipation which limits its use. The use of ihis gate has been restricted to an inverter driver since the implementation of logic is boih more complex and consumes more area than either DCFL or SDCFL.

The design guidelines which have been used for SBFL transistors with fan-outs of up to seven are as follows:

Wa :2p*, La :2.4U,m

W" : 8p*, L. : 7'2U'm W"" - 72p,mrL".:7.2p'm

Wsel :72¡-tm, Lset : l'2P'm

7.2.2 Detailed Circuit Design and Simulation

This section deals with the design of the components of the chip that form the building biocks for the processing element.

Data Flip-flop Design

The data latches v¡ere required to be area-time efficient, to operate from DC to 1 GHz, and to have a preset or clear facility.

275 Figure 7.L32 schemo,tic of a si,r nor-gate data fl,ip-fl,oqt witlt clear

The classic six gate data flip-flop of frgure 7.13 which has been used successfully by foundries such as Vitesse and Triquint was chosen for this design'

Figure 7.].42 Ring noto,tion of a GaAs d,ata fl,ip-fl,op with clear or preset.

Figure 7.L52 Layout of a GaAs data fl,i,p-flop using ring notation.

276 700 0M DFF.TR() CLK 650 0M - 600 0M = 550 0M = 500-0f4 - ¡150_0H =

q00 O|4

35 0 OM

300 OM

25 0 0t4

E DFF.f RO OBAR 700 0M l b- --/'- \ : OBAR I 600 0M = - CLR .'\ :o+--.- 0t1 _ 500 I 400 0M - .\ I 300 0M l' ,\

200 OM l_

- 10 0 - 0M =- I rr r9 2 271{M 2 5 f)N 8-50N L - 5 0N

Figure 7.L62 SPICE simulation of a GaAs d'ata fip-fl'op'

Ring notation of a iatch with clear or preset is shown in fi.gure 7.74 and figure 7.15 shows the resulting layout. Figure 7.16 is a SPICE simulation of the flip-flop'

Clock Generation

A two speed single-phase clock is generated on-chip using a DCFL ring oscillator which is designed to run at either 7 GHz or 600 }l/.Hz. To provide a range of possible clock speeds for both low speed functional testing, high speed performance verification and process characterisation, the output from the two speed oscillator is divided by two additional modulo 4 counters. External signals are used to multiplex between the direct and derived clocks. In addition, an external clock source can be used to drive the chip. The internaliy generated clock frequencies are 37.5, 62.5, 150, 250, 600 and 1000 MHz. There is no constraint on the external clock frequency which can therefore be used for DC testing.

277 RST

13/7 stage DCFL Ring Oscillator râte

oscout

T TF'X' o 2:L s1 lttx T TFF a CL_output 2tl mx T TT'F o 2zl lllx T TFF a

CLKpad

Figure 7.L72 The cloclc generator architecture.

750 OM 700 0M - l: 650 0l : lr' liì 600 0M - li 1', 550 oil l ti' lil l: 500 0ltl - |: ii l- 0 lil L 450 011 l' l: T ft ri L li, I {00 0ì4 - |: N ll ti: l: 350 rl: 300 0ü l li li 'j 250 OH : li tì ¿00 OH : t50 0H l l- 100 014 :- t: - , l= 50 0H- ,\ ON {,,,. 8.0N

Figure 7.18: A SPICE simulation of the cloclc generator output.

278 CL-output s1 s2 rate (MHz)

1000 0 0 1 250 0 1 1 clkpad 1 0 1 62.5 1 1 1 600 0 0 0 150 0 1 0 clkpad 1 0 0 37.5 1 1 0

Table 7.22 Clock control si,gnals.

Figure 7.17 shows the clock architecture, and table 7.2 shows the clock control signals. The rst signal is an active high oscillator reset signal and the rate signal determines the path through the ring, rate:O,length : 11 DCFL inverters + 1 DCFL NOR + 1 SDCFL NOR rate:7, iength : 5 DCFL inverters + 1 DCFL NOR + 1 SDCFL NOR

clkpad, is the external clock input. A Spice simulation of the osciilator is shown in Figure 7.18.

Clock Distribution

In a process with rise times of 200ps, and gate delays which can be less than 100ps, the problem of clock distribution is signifrcant. An H-tree approach was used to distribute the synchronously to all 500 Data flip-. The distribution tree was carefully balanced and modeled using SPICE simulations trying various arrangements of buffers. The resulting tree had a maximum of 50ps skew over all the leaves in the distribution tree, which is approximately equal to one gate delay.

Full Adder Design

Full and half ad.ders with equal sum and caïïy times were required for the digit-seriai multiplier accumulator The equations for the sum and carry terms generated from the a,b and c inputs were:

H*: o.Øb S: Hn@c C:a.blHn.c (7.2)

279 Available input signals a,re ctr) abar,,brbbar, c and cbar and outputs S, Sbar, C and Cbar are produced.

X A

7 Car-r-g

Cin

Sun

Figure 7.192 A schemati,c iliagram of th'e full adder

l^I z ô C n

¿ X z

Figure 7.2O2 A schematic diagram of tlte fuII ad,d,er sum generation.

Systolic Cell Design

The systolic cell implements digit-seriai arithmetic on the three operands X, Y and PP where X, Y and. PP are single precision floating point numbers which have mantissa, exponent, flag and guard fi,elds. Each digit is a four-bit quantity. There are three mod.es of operation, multiply, add and de-normalisation which are determined by two bits of the instruction digit.

220 Multiplication mode

Figure 7.21; A schemati,c di,agram of th,e d,igit-seri,al multipli,er.

In multiplication mod.e, the Y operand is the multiplicand and the X operand is the multiplier. To form partiai product terms, the fi.rst digit of the Y mantissa is stored and multiplied with each digit of the X operand and the resuit is accumulated with the input partial product and output to PPor¡. The Y mantissa is rotated by one digit as it passes through each systoiic celi. To complete a mantissa multiplication, the mantissa must pass through the same number of systolic cells as there are digits in the mantissa

In a processing element with four systoiic cells in the ring, two recirculations of the operands are required to complete a multiplication for a 32-bit mantissa.

Digit-Serial MultiPlier

A SPICE simulation of the critical path through the digit-serial muitiplier was used to predict circuit speed.

227 DIGI'1UL,TRO CLK 700 OM t-

600 0t4

500 0tt

¡100 0t4

300 OM

200 OM

100 OM -;, i ' i I .'': DIGMUL TRO -- 445 700-0M I - -::- e- -'._-\:-::-- 'I 579 600_0Ma' t, 566 ì 500.0HÉ ¡: +------l: li \ c0 ¡+00 OM v------I, CP_END t,. ti I Ã---_ 300 OM I t7 t: I 0------t'' 207 200 OH lt l- I -::--'{::- 100 OM ,l 0 - 0N 3 ON ( ¡tl 0N 33 - 0N T I ME L I N )

Figure 7.222 SPICE simulation of the critical path through the digit-serial multiplier

Figure 7.22 shows the results of the simulation. It shows a delay of. 3.2ns which indicates the maximum clock speed at which the processor \4/iil function correctly is greater than 300 MHz. The floating point performance which can be expected from the device is approximately 10 Mflops. Use of this processing element in a 16x16 systolic array processor will provide a peak computation rate in excess of 2'5 Gflops.

These results are compared. with the model results used to optimise the design of the processing element. The model assumed a gate delay of 150ps, which incorporates at IITo overhead for wiring deiays, and predicted a clock rate of 3.0ns for the circuit, and a multiplication/accumulation time of 195ns, or 10 Mflops.

Systolic Ring Architecture

rNSrR[o] rNSrR[1] Cell Operation

0 0 floating point addition 0 1 floating point multiplication 1 0 denormalisation

Table 7.32 Systolic cell instruction cod,es

222 Instructions which circulate with the operands and control the operation of the ceils are encoded according to the speciflcation given in table 7.3.

(3:0)

'"TlEåiiî[lifl

XouL (3: You[ ( 3 dcbe ou[ (3

Figure 7.232 A schematic of the systolic ring layout.

OÆMux-Buffers

cell,2

Clock

Delay cell4 cell 3 (4) FET Delay test Clock generation DFF test test

Figure 7.242 A fl,oor plan of the GaAs systoli,c ring processing elenr'ent ch'ip

Power design

The clock distribution and the power rails for the logic circuits were separated to avoid. the introduction of noise. The design took into account current density, self

223 inductance and maximised the supply to ground capacitance. The total power dissi- pation is 2.2 Watts, which is readily dissipated with conventional packaging.

I/O and Packaging

The ECL compatible chip has ESD protected input pads with a bandwidth of 700 MHz. The output pads have a bandwidth of approximately 600 MHz.

Figure 7.262 A micrograph of the GaAs systolic processi,ng element

A micrograph of the fabricated chip is shown in figure 7'25

Testability

The ring architecture consists of a number of parallel register chains. In the exponent mode, the input operands move through the registers unchanged and in synchronism, which provides natural scan paths through the processor'

To provide a mechanism to carry out performance testing for both cell and ring, the output pins are multiplexed between the systolic ring output and the output of the first systolic ceil. This allows the independent testing of both a single cell and the complete array. The cell can be tested both functionally and parametrically.

224 7.2.3 Test Equipment and Procedures

Low speed. functionai and high speed testing was carried out using a Tektronix Digital Anaiysis System (DAS) 9200 with a 92516 pattern generation card and a 92496 data acquisition card. The 92496 has a,24 channel high speed acquisition mode with a timing resolution of.2.5ns. The software is confrgured to run via a host system using an X-windows interface.

The pattern generation modules are programmed with test vectors which are sent to the chip under test. The resulting chip outputs are read by the data acquisition probes and can be post-processed by the host computer'

Analogue waveforms rr¡/ere measured with a LeCroy digital oscilloscope with a 600MHz bandwidth and a sampling rate of 5 gigasamples per second'

Test Procedures

Testing of the following structures has been done both functionally and at speed to determine critical paths: 1. Clock generation circuit; 2. Systolic Cell; 3. Systolic Ring.

The input and output of data to and from the ring is controlled by the I/O state machine which is clocked synchronously with the systolic cells. For testability pur- poses, two outputs are provided. from the chip, one from the 16 bit output of the fi.rst systolic cell and the other from the 16 bit output of the ring. This allows inde- pendent functional testing of both a systolic cell and the systolic ring. Both can be done under DC conditions with external clock control. High speed testing is done by loading the chip at low speed and by recirculating data internally at high speed and monitoring the outPut.

225 Clock Generation Circuit

Signal CKout CKout CKout CKsl CKs2 CKrate Nominal 0.5 o slow Simulated Simulated Measured Chip f 2 Vddc 2.0v 2.0v 1.7V Temp, lÐ tÐ (MHz) (MHz) (MHz) 1000 763 747 0 0 1 250 191 187 0 1 1 62.ó 47.7 46.7 1 1 1 600 461 447 0 0 0 150 115 110 0 1 0 37.5 28.9 27.3 1 1 0

Table 7.42 Table of simulated and obseraed' cloclc frequencies for chip rt2

I 1-Jan-94 17:30:56 l 5ns k-a- 1.00 \J t l \ \ t 5 n5 \ I \ \ 1. 00\/ $

\ \ J t J \

4s ùJe e Ps avenage IsLJ h igh signa pltpK{l) 3.û5 lJ ?.91 3.r2 0. 07 nean(l) 66. I nV 49. B 87. 5 18.0 sdev(l) 1 2i48 V 1.2ø13 1. 2309 0.0109 nns(l) 1,.21,37 V I . 2054 1.2310 0.0118 5ns anpl (l) 2.6? \J 2.65 2.72 0. 03

Figure 7.262 Cloclc input (channel 1) and, output (channel 2) waueforms'

Figure 7.26 shows the clock output waveform with a 67 Í-l terminating resistor mea- sured using the LeCroy oscilloscope. The delay of the clock input to output is ap-

226 proximately 2-2.5ns and the rise and fali time of the output clock was measured at lrzs. This measurement is limited by the bandwidth of the cRo.

The clock frequency rü/as measured using a frequency meter. In most cases the fre- quency was stable to f 10KHz.

49 'oscresl.dat'-a- 'oscsiml.daÈ' -+-'

48

4'l 1 É

;o 46 o É, 0 lr 45

44

43 L.2 L.3 7.4 1.6 r.'7 1.8 vdd

Figure 7.272 Vari,ation of clock frequency witlt power supply uoltøge for a 7-gate rxng

30 'osqres2,dat'+- 'oscsim2.dat'-È-'

29

N f 2A

U o tfa 0 klr

26

1 1 L.2 1.3 L,4 1.5 1,6 r.7 1.8 vdd

Figure 7.282 Variation of cloclc frequency with power supply aoltage for a l?-gate rxng.

Figures 7.27 and.7.28 show the variation of oscillator frequency of chip ft2 with porvl/er

227 supply voltage for a ring lengths of seven and thirteen gates respectively. Rise and fall times of between 1.Ons and 1.5ns lvere measured'

Functional Testing: The Systolic Cell

pe-L1ouE Timing 28 Jan 1994 10:50 Refnem ( stsrip 1, Page 1 simÞfe Tests of PE in mulEiplicaEìon mode Mag: 200 clock lons)

'132 sesence: 669

X-ou E-03 X-outs-02 X-ouE 01 X-out-00 P-ou t - 03 P-outs-02 P-out-01 P-ou E-00 Y-outs-03 Y-ouL 02 Y-our-01 Y-ouE-00 Y-out ÆCD-ouE ÆCD-ouE ÀBCD-outs_0 ÆCD-outs-0 ABCD-ouE X-out P-ou t clockout 1 66 732

Figure 7.2g2 DAS output from a functional test of the f'rst systolic ceII.

Simulations of the systolic cell architecture and generation of the exhaustive test vec- tors were done using C, Systolic Cell simulations were done using VHDL, the layout \4/as generated using MAGIC, and extracted using custom software tools. SPICE simulations r¡/ere done using HSPICE and functional simuiations using IRSIM. The functionality of the systolic cell was tested using exhaustive test data. Figure 7.29 shows the DAS 9200 output from a functional test of the first systolic ceil in the ring.

7.3 Interconnection TechnologY

7 .3.L Multi-chip Modules

The prototype SCAP systems for which results have been quoted have been con- structed. using conventional discrete chip packaging. This packaging technique does

228 not provide good packing densities for regular arrays. A development programme has been undertaken in the DSTO and industry to allow the integration of 400 floating point processors into a single package. This has been accomplished by using multi- chip module technology to package 20 chips, each of which contained 20 processing elements. The mod.ule is 75mm square and is cascadable. The module has approxi- mately 5 million active devices, it dissipates a maximum of 10 watts and provides a peak computation rate of 400 Mflops. AIi current SCAP systems are being designed using multi-chip module packaging. 400 Mfl.ops modules have been produced with high yields which can be improved by the replacement of faulty chips on the modules. The modules are d.esigned to be placed into a one- or two-dimensional array which will provide peak computation rates which are integral multiples of 400 Mflops'

Figure 7.30: A photograph or a /¡00 Mfl,ops multi-chip module (actuøl size)'

A photograph of a module is shown in figure 7.30. Each module requires approxi- mately 1300 wire bonds.

7.3.2 Silicon Hybrids

Further technological research has been carried out, and initial results were reported is aimed it Microelectronics '89 [ReMa89] and subsequently in [ReMa91]. This work at the construction of hybrid technology wafers. Silicon wafers are preferentially etched to provide mounting apertures for either silicon or gallium arsenide chips which are bonded into the wafer with an epoxy. Additional holes are etched to

229 provide interconnects through the wafer to a mechanically rigid substrate- Intercon- nect between chips is carried out with a multi-level thin-film metallisation, which provides low-parasitic interconnection between adjacent devices and avoids the prob- lems of high bond-wire parasitic capacity and inductance. These parasitics become significant at the gigabit rates achievable with GaAs. Thin-film interconnects are achievable because of the pianarity of the hybrid wafer surface.

Currently, only test structures have been constructed with this technique' However, a contract has been let to develop the capability to construct a functional system using this technologY.

The technoiogy has the capability of allowing the construction of densely packaged high-speed mixed technology systems, including silicon, GaAs, bipolar digital and analogue integrated. circuits and devices. It is hoped that the creation of an integrated two-dimensional hybrid system will make possible the future construction of a three dimensional processor. It is anticipated. that optical waveguides can be integrated onto the planar hybrid surface to implement high-bandwidth buses both for local data transfer as well as for I/O from the wafer'

Other areas of research relevant to this technology relate to the construction of con- trolled impedance transmission lines on the wafer by appropriate use of dielectric films.

230 Chapter 8

Summâry: Future Tlends and Conclusion

8.1 Summary

The evolution of computers has been summarised. Computers have been discussed which range from the frrst mechanical difference engines, through the stored program thermionic valve machines, to the still current vector processors which achieve their speed through the extensive use of vector operations. Replacing these latter vector processors in recent years have been massively parallel machines which rely on large numbers of processors and their interconnection to process data at rates approaching one Teraflops. Interconnection topologies include rings, two and three dimensional meshes, hypercubes and trees. It is characteristic of this latter approach to use off- the-shelf microprocessors, aithough there are some manufacturers who use custom processors.

With few exceptions the development of computer architecture has been achieved by optimising the available technologies for scalar and vector data structures. Higher di- mensional structures such as matrices have been supported in general at the software Ievel. \Mhere matrices have been specifically supported, they are supported as one member of a range of possible structures, and the machines have not been optimised for this structure. The MASPAR computers as an example have a general purpose architecture which allows a range of aigorithms to be executed, matrix operators being only a small subset.

A review of systolic processors revealed that despite being a rich source of research topics for over a d.ecade, each systolic architecture is limited by being specific to a particular problem, and is not readily extended to other problem areas. Examples

237 have been quoted which range from Kung's initial inner-product-step processor to Snyder's Configurable Highty Parallel Computer (CHiP), and Kung's Programmable Systolic Chip. Algorithmic examples which \¡/ere presented included polynomial eval- uation, convolution, FIR filtering and DFT evaluation with linear a rays. Systolic cell types vary for these algorithms. Different interconnection techniques' and the use of multiple cell types, were shown to allow LU decomposition on a hexagonally connected array. Triangular linear systems were shown to be soluble on linear arrays with two cell types. Further sophistication of the cells enabled Givens rotations to be used. for the solution of systems, QR factorisations and beamforming. Data format- ting was shown to be possible using systolic techniques and bit-level architectures were also exploited.

Attempts to unify the different aigorithms and architectures, such as with Kung's Programmable Systolic Chip and Snyder's CHiP computer, have not met with notable success. The later systolic machines have been typified by the use of commercial microprocessors. Large computing arrays have been constructed using one or more conventional microprocessors at each node'

The work in this thesis has accepted the restriction of systolic architectures to a lim- ited set of algorithms. However, the usual limitations imposed by such a iimited set have been circumvenied by the implementation of. matri,r primitives which have gen- eral application. The usual I/O restrictions of systolic processors have been avoided by the use of a novel data formatting/address-generation techniques on which two patents are pending. The architecture of the address generator makes transposition and on-the-fly mappings possible. These characteristics extend ihe domain of appli- cation of the systolic processor substantiaily, both for linear algebra and for linear transforms.

8.1.1 Implementation

A study of novel systolic ,ing prå""ssors showed that floating point algorithms were read.ily implemented in ring structures, and furthermore that they had the desirable attributes of variable speed, precision and dynamic range. The use of processors with varying speed capabilities is ideal for a constant bandwidth array where a few fast processors are in the top corner) and successively slower processors are used further into the array. Digit-serial systolic ring processors are characterised by short critical

232 paths and a high degree of pipelining, attributes well matched to boih GaAs and Si technologies. The architecture is well suited to a mixed technology approach in which high-speed gallium arsenide processors are used where the fastest processing speed is required and slower, more dense, silicon processors are used where large numbers of processors are required to execute at slower speeds'

8.1.2 Software

The writing of software for the MATRISC architecture follows the von Neumann coding paradigm. The code is sequential. However, each matrix operation such as C : A* B, where all operands are conformal matrices, is itself a parallel operation, and so the inherent parallelism of the MATRISC concept is hidden from the appli- cations programmer. Software is written not to match a problem to a particular architecture, but simply to express the problem in terms of linear algebra. Not only is the expression of the problem in this form appropriate, but the sequential formula- tion of mathematical aigorithms is the natural coding technique used for MATRISC software. The use of conventional compilers to program MATRISC processors has been demonstrated by the SCAP development.

8.2 Current and Future Tlends

To illustrate the capability of the MATRISC technology, the measured performance

of. a I.2¡t1CMOS technology (SCAP) has been quoted, together with the simulated performance of a MATRISC processor implemented in a current 0.5¡;m technology. To provide estimates for future MATRISC architectures, the performance comparison is extend.ed below to include three technologies 7.2p,rn Si,0.5pm Si and 0.2¡;m Si. These line widths have been chosen as their ratios are approximately constant. As stated. above, ilne 7.2p,rn architecture has been realised in the form of the SCAP processor, and this is used as the benchmark for the other technologies.

233 8.2.1 The Matrix Product

Measured and predicted performance for a20x20 artay 400 I 350 '2Ox2O-SUN-SPARCl'

300

250 U)È o (+¡ 200 À13 150

100

50

0 0 50 100 150 200 250 300 350 400 450 500 Matrix product order

Figure B.Lz PerfornLance of th,e SCAP 1.2 ¡-tm system for matri,r proilucts'

In fi.gure 8.1 the bottom curve shows the measured performance of the SCAP en- hanced. SPARCstation 1. It is limited by both memory bandwidth and clock speed. The SCAP array is implemented with discrete packages on a conventional printed circuit board. The middle curve shows the predicted performance for a new board which is being constructed for the Department of Defence using multi-chip module technology to encapsulate the 400 processors onto a 75mm square substrate and which has been discussed. in the previous chapter. The speed limitation is due to the available memory bandwidih, and the speed of the data formatters. The top curve represents the computational power of the multi-chip module in a system which is not memory bound.

Simulated results for processors implemented in 0.5¡;m Si technoiogy are provided in the following figure. These simulations correspond to much of the work which has been done earlier in this thesis. Finaliy, a simulation is carried out for an 80 x 80

234 systolic processor. This processor can either be constructed from 0.5¡-lm technology, in which case it will occupy approximately four times the area of the 40 x 40 processor' or in the future it can be constructed in an 0.2¡L"m technology, in which case it could be constructed using a multi-chip module of approximately the same area as the current 400 Mflops module.

Predicted performance for 40x40 and 80x80 anays 30000 '40x40 10ns'

25000

20000

(t) È o lff 15000 Å

10000

5000

0 0 50 100 150 200 250 300 350 400 450 500 Matrix product order

Figure B.2z Perforn-ùance of a MATRISC 0.5/0.2 p* system for matri,r proilucts'

Figure 8.2 gives the matrix product performance graphs for the 40 x 40 and 80 x 80 processors. The 40 x 40 processor requires 10ns memories whereas the 80 x 80 pïocessor requires 5ns memories. The 5ns speed is achievabie speeds with current memory technoiogies, using interleaving techniques'

g.3 MATRISC Multiprocessors-Teraflops Machines

The author's initial interest in MATRISC processors',¡/as a consequence of studying the problem of constructing multiprocessors for signal processing. It was concluded that a technique which could increase the computing power at each multiprocessor node would simplify all aspects of the construction of a multiprocessor. As a con- sequence, this stud.y has been constrained to the matrix enhancement of a scalar or

235 vector processor to provide a high performance MATRISC computing node. Initiai studies and simulations have been done primarily by Mr. T' Shaw using arrays of MATRISC processors in hypercube topologies. Tentative results show performance figures which exceed 1012 floating point operations per second, or one Teraflops. This is a reasonable result given the 30 Gflops capability of the 80 x 80 processor.

8.4 Conclusion

As feature sizes reduce and more transistors are integrated onto single chips, it will become cost effective to construct matrix engines as part of the CPU instead of provid.ing iarger caches. It is the author's view that the MATRISC concept will then become a standard approach to the design and construction of high performance processing units. Until that time, this thesis has documented the techniques which enable the construction of conceptualiy simpie processor architectures which can provide computation rates approaching 10 Gflops in hand held modules implemented in 0.5¡rm technologY.

The architecture is inherently weil suited to block matrix algorithms, which are an active research area in their own right. The combination of new algorithms with the power of the MATRISC architecture will offer unparalleled price/performance with any foreseeable technology. In contrast to the restricted application areas of past systolic architectures, the integration of efficient matrix primitives and matrix ad- dressing techniques into conventional compilers substantially extends the application areas of the technologY.

236 Bibliography

and Morf, Martin, IAhDe82] Ahmed, Hassan M., Delosme, Jean-Marc ,,Highly concurrent computing structures for matrix arithmetic and signal processing", IEEE Computer pp. 65-80, January 7982. H'T', Lam, M', [AnAr87] Annaraton€, M., Arnould, E., Gross, T., Kung, Menzilcioglu, O. and Webb, J'4., "The Warp Computer: Archi- tecture, Implementation and Performance", IEEE Transactions on Computers, C-36(12) pp' 1523-1538, December 1987' (ISPS): The Nota- [Ba81] Barbacci, M.R., "Instruction Set Processors tion and its Application", IEEE Transactions on computers, c- 30(1), pp.24-40, JanuarY 1981. Kuck, D'J" Slotnick, D'L' [BaB168] Barnes, G.H., Brown, R.M., Maso, K', and Stokes, R.4., "The ILLIAC IV Computer", IEEE Transac- tions on Computers C-17 pp' 746-757, 1968. D.B. and Tretola, A'R', "A [BaMo75] Baldwin, G.L., Morris, B.L.' Fraser, modular high-speed serial pipeline multiplier for digital signal processing", IEEE J. Soiid-State Circuits, SC-10, pp' 307-13' October 7975. [BeMael] Beaumont-Smith,4., Marwood, W., Lim, C.C. and Eshraghian, K., "Ultra High Speed Gallium Arsenide Systems: Design Methodology, CAD Tools and Architecture", I.R.E'E' & I'E'Aust Microelectronics conference, Melbourne, Australia, National Conference Publication No. 9715, pp' 85-90 June 1991' /., GaAs Systolic El- [BeMaea] Beaumont-Smith, A', and Marwood, \ "A ement Test Structure", Internal Report, university of Adelaide, 1994. Algorithms for Digital signal Process- [8185] Blahut, Richard E., "Fast ing", Addison-\Mesley Publishing Company, 1985'

237 [BoCo88] Shekhar Borkar, Robert Cohn, George Cox, Sha Gleason, Thomas Gross, H.T. Kung, Monica Lam, Brian Moore, Craig Petersen, John Pieper, Linda Rankin, P.S. Tseng, Jim Sutton, John Urbanski and Jon Webb, "iWarp: an Integrated Soiution to High-Speed Parallel Computing", Supercomputing'88, Kissim- mee, Florida, November 14-18, 1988' of IBr8a] Bracewell, R.N., "The Fast Hartley Transform", Proceedings the IEEE, YoI. 72 No. 8, pp. 1010-1018, August 1984. [8r86] Tl¿e Hartley Transform. Bracewell, R.N., Oxford University Press, 1986. J', IBrBu86] Bracewell, R.N., Buneman, O., Hao, H. and Villasenor, "Fast two-dimensional Hartley transform", Proc IEEE, YoI. 74 No'9, pp. 1282-1283, September 1986. t'Two-dimensional IBrMeel] Brud.aru, O., Megson, G' M', systolic array for the column-by-column QD algorithm", IEE Proceedings, Part E: Computers and Digitat Techniques Vol. 138 No. 6, PP' 407-406, November 1991. J.B., [BrWeeo] Brackenbury, L.E.M., Wells, P.J. and Gosling, "Universal Shift Matrix", IEE Proceedings, Vol 137, Pt. E, No' 1, January 1990. University IBrac86] Braceweli, R.N., "The Hartley Transform", Oxford Press,1986. Analytical Engine, IBrom82] Bromley, Allan G., "Charles Babbage's 1838", Annals of the History of Computing, Vol 4 No' 3, PP' 196-217,, July 1982. ,,Index Formula- [8u77] Burrus, c.s., Mappings for Muliidimensional tion of the DFT and convolution", IEEE Transactions on ASSP, Vol. 25, pp. 239-242, June 7977. 100 MHz", Electron- [Bue3] Bursky, D., "synchronous DRAMS clock at ics Design, pp. 45-49, 18 February 1993. in-place in-order [BuEs8L] Burrus, C.S. and Eschenbacher, P.W., "At prime factor FFT algorithm", IEEE Trans. Acoustics Speech and Signal Processing 29' 1981.

238 lchJáeol Chakrabarti, C. and JáJá, J., "systolic Architectures for the Computation of the Discrete Hartley and the Discrete Co- sine Transforms Based on Prime Factor Decomposition", IEEE Transactions on Computers, Vol' 39 No' 11, PP. 1359-1368' November 1990. [ChKaKu87] Chau, P.M., Kay, C.C. and Ku, W.H', "A bit-serial floating- point complex multiplier-accumulator for fault-tolerant digital

signal processing arrays", CASSP'87, CH2396-0 187 10000-0483 pp. 483-486,1987. IClClCue2] A.P. Clarke, R.J. Clarke, I.A. Curtis and W. Marwood, "A Float- ing Point Matrix Arithmetic Processor: An Implementation of the SCAP Concept", APCCAS'92, IEEE, IREE and IEAust Asia-Pacific conference on circuits and systems, 8-11 December 7992. Float- [ClCuCle2] R.J. Clarke, LA. Curtis, A.P. Clarke and W. Marwood, "A ing Point Rectangular Systolic Array on a Chip", IEEE Region l-0 Conference, Tencon '92, pp. 1008-1012, 11-13 November 7992. [clMa] Clarke, A.P. and Marwood, \ /., "A Generalised Fourier Trans- form Algorithm", Submitted lo Signal Processing' and the Computer", in ICoeo] Cohen I. Bernard, "Howard H' Aiken A Hi,story of Scientif,c Computing E,d. Stephen G. Nash, ACM Press, New York, Addison-Wesley Publishing Company 1990. Arrays Using ICoHa92 Corbett, P. and Hartley, R., "Designing Systolic Digit-Serial Arithmetic", IEEE Trans. on Circuits and Systems, Vot 39 No.1, pp. 62-65, JanuarY 7992' point stan- [Cody8L] Cody, W.J., "Analysis of proposals for the floating dard", COMPUTER pp. 63-69, March 1981. [Coon81] Coonen, J.T., "Underflow and the denormalised numbers", COMPUTER pp. 75-86, March 1981. lczsal Czekalski, Martin W., "Corner Turn memory Address Genera- tor", US Patent Number 4484265, November 20,7984. for [DeSise] Deerfield, Aian J. and Siu, Sun-Chi, "Method and Apparatus Addressing a Memory by Array Transformations", US Patent Number 4819752, April 4, 1989'

239 for [DeSieo] Deerfi.eld, Alan J. and Siu, Sun-Chi, "Method and Apparatus Ad.dressing a Memory by Array Transformations", US Patent Number 4959776, September 25, 1990. An Open Archi- [DoFues] Dowling, E.M., Fu,Z. and Drafz, R.S., "HARP: tecture for Parallel Matrix and Signai Processing", IEEE Trans- actions on Parallel and Distributed Systems, Vol. 4 No. 10, PP' 1081-1091, October 1993. Explana- [Dong87] Dongarra, J.J., "The LINPACK Benchmark: An tion", 1"¿ International Conference on Supercomputing, Athens, Greece, June 8-12, 1987; Lecture Notes in computer science 297, pp. 456_474, Eds. G. Goos and J. Hartmanis, Springer-Verlag 1988. and Symanski, J'J', [DrLu87] Drake, B.L., Luk, F.T', Speiser, J'M. "SLAPP: A systolic linear algebra parallel processor", Computer, pp. 45-49, July 1987. DFT process- [ElorTe] Elliott, D.F. and orton, D.4., "Multidimensional ing in subspaces whose Dimensions are Relatively Prime", Pro- ceedings, IEEE Conference ICASSP '79, PP. 522-525,7979' P.P. and Nunez', A', [EsSael] Eshraghian, K.,Sarmiento, R., Carbailo, ,,speed-Area-Power optimization for DCFL and SDCFL Class of Logic using Ring Notation", Microprocessing and Micropro- gramming 32 pp. 75-82,,1991. Systems, to [Eshr] Eshraghiar, K., Fund'amentals of Ultra High' Speed' be published. M., \Malker, Hank [FiKu83] Fisher, Allan L., Kung, H.T., Monier, Louis and Dohi, Yasunori, "Design of the PSC: A Programmable Sys- totic chip", Third caltech conference on very Large scale Inte- gration, Pasadena, CA, USA, pp. 287-302,21-23 March 1983' ,,Some and their effective- [Flyn72] Flynn, M.J., computer organisations ness", IEEE Trans. Comput., C-zI, pp. 948-960 7972' Robert, Saxpy Matrix- [FoSc87] Foulser, David E., and Schreiber, "The 1: A General Purpose Systolic Computer",IEEE COMPUTER' pp. 35-43, July 1987. Arrays-From concept [Fo\Ma87] Fortes, J.A.B. and Wah, 8.W., "Systolic to Impiementation", IEEE Computer, Vol. 20 No. 7',pp 72-77, July 1987.

240 lr}87l Friedlander, Benjamin, "Block Processing on a Programmable Systolic Array", IEEE CH2409-I 187/0000/0184, 1987' IFYMael] Franceschetti, G.; }/.azzeo, A';Mazzocca, N.; Pascazio, V.; Schir- inzi, G.r "An efficient SAR paraiiel processor", IEEE Transac- tions on Aerospace and Electronic Systems Vol: 27 Iss: 2 pp. 343-53 March 1991. [FuNael] Fung, M.K. and Nandi, 4.K., "Parallel FFT Algorithms for MIMD Computers", Electronics Letters Vol. 27 No. 3, pp. 286- 288, 31 January 1991.

IGae0] "Parallel Algorithms for Matrix Computations", K.A. Gallivan, Michael T. Heath, Esmond Ng, James M. Ortega, Barry W. Pey- ton, R.J. Plemmons, Charles H. Romine, A'H. Sameh, Robert G' Voigt, ISBN 0-89871-260-2, SIAM 1990. ,,parallel lcalle0l Algoriihms for Dense Linear Algebra computations", K.A. Gallivan, R.J. Plemmons and A.H. Sameh, SIAM Review' March 1990. ,,The and Practical Fourier se- IGo58] Good, I.J., Interaction Algorithm ries," J. Roy. Statist. Soc. 8,20 pp. 361-370, 1958. Addendum 7960 22 pp. 372-375 ,,The Ico71] Good, I.J., Relationship Between Two fast Fourier trans- forms", IEEE Trans. Computers Vol. C-20 No. 3, PP. 310-317' March 7977. [GoHoel] Gokhale, M., Holmes W., Kopser, S, Lucas, R.M. and Sweely, D., "Building and Using a Highly Parallel Programmable Logic Array", COMPUTER, pp' 81-89, January 1991. IGoScel] Gotze, J., Schwiegelshohn, U., "Vlsl-suited orthogonal solution of systems of linear equations", Journal of Parallel and Dis- tributed Computing Vol. 11 No. 4, PP. 276-283, April 1991. and [GoVase] Matrir Computations, Second edition, Gene H. Golub charles F. van Loan,The Johns Hopkins university Press, 1989. IGrRee2] Gregoretti, F., Reyneri, F., Sansoe, C. and Rigazio, L', "A chip set implementation of a parallel cellular architecture", Micropro- cessing and Microprogramming Vol. 35, pp. 477-425,7992' Applied [Haa2] Hartley, R.V.L., "A More Symmetrical Fourier Analysis to Transmission Problems", Proceedings of the IRE, Vol 30, 7942'

247 [HaCoeo] Hartley, R. and Corbett, P., "Digit-Serial Processing Tech- niques" , IEEE Trans. on Circuits and Systems, Vol. 37, pp' 707-779, June 1990. array [HaRose] Hall, F.E. and Rocco Jr., 4.G., "A compact programmable processor", The Lincoln Laboratory Journal, Vol. 2, Number 1, 1989. Michael T' [Heatheo] "Parailel Algorithms for Sparse Linear Systems", Heath, Esmond Ng and Barry W. Peyton, 1990. lHillssl Hillis, W. Daniel, "The Connection Machine", ACM Distin- guished Dissertation Series, MIT Press Cambridge Massachusetts 1985. Letters Vol' 20, [Ho85] Hoekstra, J., "systolic multiplier", Electronics pp. 905-996,1985. proposed IEEE 754 standard [Hough81] Hough, D., "Applications of the for floating-point arithmetic", COMPUTER pp. 70-74, March 1981. of [HwChel] Hwang, Kai; Cheng, Chien-Ming, "simulated performance a RISC-based multipfocessor using orthogonal-access memory", Journal of Parallel and Vol. 13 No' 1, pp. 43-57, September 1991' to [JaKaMc68] Jackson, L.8., Kaiser, J.F. and McDonald, H.S., "An approach the implementation of digital filters", published in IEEE Trans. on Audio and Electroacoust., Vol. AU-16, PP. 413-2I, Septem- ber 1968. L-U decom- [JaLase] Jain, V. K., Landis, D. L., AIvatez, C'E., "Systolic position array with a new reciprocal cell", Proceedings - IEEE International conference on computer Design: vLSI in com- puters and Processors. Publ by IEEE, IEEE Service center, piscataway, NJ, USA. Available from IEEE Service cent (cat n

89CH2794-6), Piscataway' NJ, USA. pp' 460-465' 1989' ,,Parallel on bit-serial systolic [Joe3] Jones, K. J., DFT computation processor arrays", IEE Proceedings, Part E: Computers and Dig- ital Techniques Vol. 140 No. 1, pp. 10-18, Jan 1993' Behrooz, [JoHuShes] Johnson, Kurtis T., Hurston, A'R' and Shirazi, "Generai-Purpose Systolic Arrays", COMPUTER, pp. 20-37, November 1993.

242 [Kahae2] Kahaner, D.K., "Japan: a competitive assessment",IEEE Spec- trum pp. 42-47, September 1992. [Kate85] Katevenis, M.G.H., "Reduced Instruction Set Computer Archi- tectures for VLSI", MIT Press, Cambridge, MA, 1985. [Kato86] Katona, E., "A lattice model for cellular (systolic) algorithms", Parallel Computing 3 pp. 251-258,1986' IKo83] Kogge, Peter M., "skewed Matrix Address Generator", US Patent Number 4370732, January 25, 1983' [Ku68] Kuck, David J. "ILLIAC IV Software and Application Program- ming",IEEE Transactions on Computers, Vol. C-17 No. 8, pp' 758-770, August 1968. pro- [Ku80] Kung, H.T., "special-purpose devices for signal and image cessing: an opportunity in very large scale integration (VLSI)", 24th SPIE Ann. Tech. Sy-p., San Diego, August 1980. [Ku82] Kung, H.T., "!Vhy Systolic Architectures", COMPUTER, Vol 15 No. 1, January 1982. Array", [Ku83] Kung, S.Y., "From Transversal Filter to VLSI Wavefront vLsI83, Proceedings of the IFIP TC 10/WG 10.5, International conference on very Large scale Integration, Trondheim, Norway, 16-19 August 1983. D'V', [KuAr82] Kung, S.Y., Arun, K.S., Gal-Ezer, R'J. and Bhaskar, "Wavefront Array Processor: Language, Architecture and Ap- plications", IEEE transactions on computers, special issue on Parallel and Distributed Computers, Vol. C-31 No. 11 pp' 1054- 1066, Nov. 1982. (for VLSI)", [KuLe78] Kung, H.T. and. Leiserson, C'E., "systolic Arrays Proceedings, Symposium on sparse matrix computations and Their Applications, Eds. Duff and Stewart, 1978. Ar- [KuLe85] Kung, Hsiang-Tsung, and Leiserson, Charles 8., "Systolic ray Apparatuses for Matrix Computations", US Patent Number 4493048, JanuarY 8, 1985. Linear [KuScsa] Kuekes, P.J. and Schlansker, M.S., "A One-third Gigaflop Algebra Processor", SPIE Vol. 495 Real Time signal Processing VII pp. 137-139, 1984.

L'fñ^D .) [KuTsse] Kumar, Prasanna V.K. and Tsai, Yu-Chen, "On mapping algo- rithms to linear and fault tolerant arrays", IEEE Transactions on Computers, Vol. 38 No. 3, PP. 470-478,, March 1989. [LaBa88] Lackey, Raymond J., Baurle, Herbert F. and Barile, John, "Application-specific Super Computer", SPIE Vol. 977 Real Time Signal Processing XI, PP. 187-195 1988. lLaYoTTl Lawrie, Duncan H. and Chandrakant Ratilal Voral, "Multi- d.imensional Parallel Access Computer Memory System", US Patent Number 4051551, September 27, 7977. computer", [Leeo] Lengauer, Christian, "Code generation for a systolic Software - Practice and Experience v 20 n 3 Mar 1990 pp. 261- 282 7990. [ri88] Liddell, Heather M., "Applications of highly parallei processors", parallel computing Methods, Algorithms and Applications, pro- ceedings of the International Meeting on Parallel computing, Verona, Italy 28-30 September 1988, Eds. D.J. Evans and C Sutti, Adam Hilger. [LoBueo] Gallium Arsenid'e Digital Integrated Circui,t Design, Chapter 7, Long, S.I. and Butner, S.E., McGraw-Hill, New York, 1990' for trans- [LuSieo] Lun, D.P.-K. and Siu, W.-C., "Novel mapping scheme forms with lengths equal to products of prime squares", Elec- tronics Letters, Vot. 26 No. 1, pp. 27-23, 4 January 1990. [LuSiel] Lun, D.P.-K. and Siu, W.-C., "Extended diagonal structure for address generation of the prime factor algorithm", IEE Proceed- ings Pt. G, Vol. 138 No. 4, PP. 449-456, August 1991. [Lundeo] "Glory and failure: the difference engines of Johann Müller, Charles Babbage and Georg and Edvard Scheutz", by Michael Lundgren; translated by Craig G. McKay, MIT Press, 1990. IEEE ILy76] Lyon, R.F., "Two's complement pipeline multipliers", Trans. on Comm. CO\[,-24, pp' 418-425,1976. Mul- IMa8a] Marwood., \^/., "A Cellular Floating-Point Serial-Pipelined tiplier", Australian Patent No. PG7661, 16 October 1984. Floating IMaeo] Marwood, W., "A Generalised Systolic Ring Serial Point Multiplier", Electronics Letters Vol. 26 No. 11, pp. 753- 754 24 May 1990.

244 [Mae1] Marwood, W., "A Generalised Systolic Ring Serial Floating Point Multiplier", PCT Patent App' No. PCT/4U9710027,, July 1991. [MaBee2] W. Marwood and A. Beaumont-Smith, "The Architecture and Optimisation of Systolic Ring Processors", IEEE Region 10 Con- ference, Tencon 92, PP. 735-739,11-13 November 7992' of IMaBeSm92] W. Marwood and A. Beaumont-Smith, "The Implementation a Generalised Systolic Serial Floating Point Multiplier", APC- CAS'92, IEEE, IREE and IEAust Asia-Pacific Conference orr Circuits and Systems, 8-11 December 1992. with Super- IMaClss] Marwood, W. and Ciarke, A'P., "A Co-processor computer Capabilities for Personal Computers", Proceedings of the ICCD - IEEE International conference on computer Design: VLSI Computers and Processors, Ry" Brook, NY, PP' 468-477, October 3-5, 1988. of the Perfor- IMaCles] Marwood, W. and Clarke, 4.P., "Measurement mance of a Matrix coprocessor (scAP) in a SUN SPARCstation 1,,, proceedings of the 11rh Australian Microelectronics Confer- ence, october 5 to 8, Marriott surfers Paradise Resort, Queens- land, Australia, pp. 153-158, 1993. Dimen- IMaClCle2] W. Marwood, A.P. Ciarke and R.J' Clarke, "Scalable sionless Array", Australian Patent Application PL5697, 1992' For- [MaClClCue2] \M. Marwood, A.P. Clarke, R'J. Clarke, I'A' Curtis, "Data matter", Australian Patent Application PL5696, 7992. Beam- IMaCla88] Marwood, W. and Clarke, 4.P., "A Generic Time-Domain former Architecture", The Australian computer Journal, vol. 20 No. 3, August 1988. Array IMaClaes] Marwood, W. and Clarke, A.P', "A Prototype Scalable processor (scAP),,, ICSPAT '93, The International conference on Signal Processing Applications and Technoiogy, Santa Clara, California, USA, pp. 1619-1628, September 28 - October 1 1993' Processor for IMaLie0] \M. Marwood and C'C. Lim, "A GaAs Systolic Implementing a Kalman Filter", I.E.Aust. Conference, Micro- electronics '90, IREE and I.E'Aust. Conference, 1990.

245 imple- IMaStse] Marwood, W., Steed, M.D. and Woon, I.M., "A CMOS mentation of a generic time-domain beamformer", ASSPA89, IEEE Conference on Signal Processing, !7-19 April, Adelaide, Australia pp. 178-182' 1989. IMarCl88] Marwood, W. and Clarke, 4.P., "On Computing Fourier Trans- forms Using a Matrix Product Machine", Conference Proceed- ings, Microelectronics '88, Sydney University, M.y 16-18 1988' for Ad- [Marw90] Marwood, \ /. "A Number Theory Mapping Generator dressing Matrix Operands", PCT Patent Application, June 1990' 8., Robinson, [McMc82] McCabe, M.M., McCabe, A.P'H', Arambepola, I.N. and Corry, 4.G., "New algorithms and architectures for VLSI", GEC Journal of Science and Technology, Vo1' 48 No' 2,7982. [McMcKugo] McCanny, John V., McWhirter, John V' and Kung, Sun-Yuan, .,The use of data dependence graphs in the design of bit-level systolic arrays", IEEE Transactions on Acoustics, Speech and Signal Processing, Vol' 38 No. 5, PP. 787-793, May 1990' of signal [McMc\M82] McCanny, J.V. and McWhirter, J'G', "Implementation processing functions using one-bit systolic arrays", Electronics Letters, Vol. 18 No. 6, pp. 247-243,7982. bit lMcMcWos2l McCanny, J.V., McWhirter, J.G' and Wood, K., "Optimised level systolic array for convolution", IEE Proceedings, Vol 131, Pt F, No. 6, pp. 632-637, October 1984. Ad- [MeCoso] Mead, C. and Conway, L., "Introduction to VLSI Systems", dison Wesley, 1980. iinear algebra on [MeSa87] Meier, ulrike and Sameh, Ahmed, "Numerical the CEDAR multiprocessor", SPIE Vol. 286, Advanced Algo- rithms and Architectures for Signal Processing II, 1987' in the His- IMetroso] N. Metropolis and J. \Morlton, "A Trilogy on Errors tory of Computing", Annals of the History of Computing, Vol' 2 No. 1, pp. 49-59, JanuarY 1980. for systolic ar- IMo83] Moldovan, Dan L, "On the design of algorithms rays", Proceedings of the IEEE, Vol. 71 No. 1, January 1983' Jose 4.B., and map- [MoFo86] Moldovan, Dan I. and Fortes, "Partitioning ping algorithms into frxed size systolic arrays", IEEE Transac- tions on Computers, Vo1. C-35 No. 1, January 1986.

246 [NaNusa] Nash, J. G., Nudd, G. R., "Design Study of Floating Point Sys- toiic VLSI (Very Large Scaie Integration) Chip", Final rept. Per- forming Organization: Hughes Research Labs., Malibu, CA. Re- port No: NOSC-CR-232 Sponsoring Organization: Naval Ocean Systems Center, San Diego, CA. Contract No: N66001-82-M- 4720; XR02102; XR0210201, March 1984. [NaPr87] Nash, Greg J., Przytula, K. Wojtek and Hanen, S', "The sys- tolic/cellular system for signal processing",in Systolic arrays: a suraey of seuen projects by José A.B' Fortes and Benjamin W. Wah, Computer, pp. 96-97 JulY 1987. [Nw85] Nwachukwu, E.O., "Address generation in an array processor", IÐEE Trans. on Computers, Vol. C-34 No. 2, PP' 770-773', February 1985. lot87l O'Leary, D.P., "systolic arrays for matrix transpose and other reorderings", IEEE Trans' C-36, pp' 117-122,L987' IopSc75] Oppenheim, A, and Schafer, R., "Digital Signal Processing", Prentice-Hall, Inc., Engiewood Cliffs, NJ, 1975' [or85] Orbach, Zvi, "Enhancing microcomputer for high-speed array processing" ,74'h Convention of Electrical and Electronics Engi- neers in Israel, Part 1.2.3, PP. 1-4, 1985. IOrtega90] "A Bibliography on Parallel and Vector numerical Algorithms", James M. Ortega, Robert G. Voigt and Charles H. Romine, 1990' Iow87] Owen, R.E., "A 15 nanosecond complex muitiplier-accumulator for FFTs", CASSP'87, CH2396-018710000-0527 pp. 527-530, 1987. Computers", Com- [Pa85] Patterson, D.4., "Reduced Instruction Set munications of the ACM, Vol. 28 No. 1, PP. 8-27, January 1985. Designs via [Pa8e] Parhi, K.K., "Nibble-Seriai Arithmetic Processor Unfolding", Proc. Int. Sy-p. on Circuits and Systems, pp' 635-640,1989. Format Conversion", [Pe88] Petkov, N., "systolic Arrays for Matrix I/O Electronics Letters, YoI. 24 No. 13, 23 June 1988. jaw, transform", [PeJaeo] Pei, S.-C. and S.-8., "Comment: Vector Hartley Electronics Letters VoI. 26 No. 2, p. 94,18 January 1990'

247 [PeJu88] Peng, Shie-Tung and Jun, Moon S., "A new VLSI2-D systolic ar- ray for matrix multiplication", Proceedings of the International Conference on Parallel Processing, Penn State University, Uni- versity park USA pp. 169-172,75-79 August 1988.

[PePaTe] Pekmestzi, K.Z.and Papadopoulis, G.D., "Cellular two's com- plement serial-pipeline multipliers", The Radio and Electronics Engineer, Vol. 49 No. 11, pp. 575-580, November 1979' [PrVuso] Preparata, Franco P. and Vuillemin, Jean E., "Area-time optimal VLSI networks for multiplying matrices", Information Processing Letters Vol. 11 No. 2, 20 October 1980. [ReMase] Reinhold, O. and Marwood, W., "Silicon Hybrids - A Technique for' Zero-Defect' Wafer-Scale Processors", I'E.Aust. Conference' Microelectronics'89, Brisbane, Australia 1989. [ReMael] Reinhold, O, and Marwood, W', "Silicon Hybrids - A Technique for 'Zero-Defect' \Mafer-Scale Processors", Journai of Electrical and Eiectronics Engineering, Australia, IEAust. & IREE Aust., Vol. 11 No. 3, September 1991. multiplication ISa81] Savage, John E., "Area-time tradeoffs for matrix and related problems in VLSI models", Journal of Computer and System Sciences 22, PP. 230-242, 1981. for IScThse] Schwiegelshohn, Uwe; Thiele' Lothar, "Linear systolic arrays matrix computatiorì.s", Journal of Parallel and Distributed Com- puting Vol. 7 No. 1, pp. 28-39, August 1989. of the ISe85] Seitz, Charles L.,"The Cosmic Cube", Communications ACM, Vol 28 No. 1 pp 22-33,, January 1985. fault- IShGose] Shively, Richard R.; Gorin, Allen L., "Reconfigurable tolerant systolic signal processor", Proceedings - ICASSP' IEEE Internationai conference on Acoustics, speech and signal Pro- cessing Vol. IV (of a). Publ by IEEE, IEEE Service Center, Piscataway, NJ, USA. Available from IEEE Service Cent (cat n 89CH2673-2), Piscataway, NJ, USA. pp. 2397-2400,1989. [shPhe2] Sheikh, Q., Phuong, V., Yang, C. and Merchant, M., "Imple- mentation of the level 2 and 3 BLAS on the CRAY Y-MP and the CRAY -2" , The Journal of Supercomputing, 5, PP. 29t-305, 7992.

248 ISmSoes] Smith, R., Sobelman, G., Luk, G', Suda, K. and Bracken, J., "Programmable floating-point cell for systolic signal processing", Journal of VLSI Signal Processing Vol' 5 No. 1, PP. 75-83, January 1993.

[Sn81] Snyder, Lawrence, "Overview of the CHiP Computer", Technical Report CDS-TR-377, Purdue University, August 19, 1981. Highly ISn82] Snyder, Lawrence, "Introduction to the Confi-gurable, Paraliel Computer", IEEE Computer,pp. 47-56, January 7982.

[Snyd81] Snyder, Lawrence, "Programming Processor Interconnection Structures", Technical Report CDS-TR-381, Purdue University, October, 1981. IstTþ88] Steenis, B. and Trullemans, C., "A Systolic Coprocessor for Ma- trix Computation", Parallel Processing, Proceedings of ihe IFIP WG 10.3 Working Conference pp. 22!-234,25-27 April 1988' IsrTÞ88] Steenis, Bernard and Trullemans' Charles, "A systolic coproces- sor for matrix computation", Parallel Processing. Proceedings of the IFIP WG i0.3 Working Conference Pisa, Italy pp. 221-234 25-27 April 1988, Editors: Cosnard, M.; Barton, M.H.; Van- neschi, M. Pub-lisher: North-Holland, Amsterdam, Netherlands 1988. [Stev81] "A proposed standard for binary floating point arithmetic", COMPUTER pp. 57-62, March 1981.

ISue0] SBus Specification A./, Edward H' Frank, , 1990. ttA New Isw87] Swade, Doron, sure bet for understanding computers", Scientist, 29 October 1987' Patent Num- ITa8e] Taber, John 8., "Address Generator Circuit", US ber 4809156, February 28, 1989. of [TaGue2] Tasic, J., Gusev, M., Evans, D.J', "Systolic implementation preconditioned conjugate gradient method in adaptive transver- sal frlters", Parallel Computing Vol' 18 No. 9, PP. 1053-1065' September 7992. [TaRe87] Taylor, David M. and Retter, Rafi, "A novel VLSI digital sig- nal processor architecture for high-speed vector and transform operations", IEEE CH2396-0 137/0000-0531 pp. 531-534, 1987.

249 [ToAoea] Toyokura, Masaki and Araki, Toshiyuki, "Multidimensional Ad- dress Generator and a System for Controlling the Generator", US Patent Number 5293596, March 8, 1994. [Tïewe1] "Past, Present, Parallel: A Survey of Available Parallel Com- puter Systems", Arthur Trew and Greg Wilson (Eds') Springer- Verlag 1991. [Tseo] Tseng, Ping-Sheng, "A systolic array parallelizing compiler", Kluwer Academic Publishers 1990. [UrWosa] Urqhart, R.B. and Wood, D., "systolic matrix and vector meth- ods for signal processing", Proc Inst. Elec. Eng., Pt. F, Vol' 131, pp. 632-637,1984. [ViBrse] Villasenor, J.D. and Braceweii, R.N', "Vector Hartley trans- form", Electronics Letters Vol 25, No. 17, pp. 1110-1111, 1989'

[Vo5e] Volder, J.E., "The CORDIC trigonometric Computing Tech- nique", IRE Trans. Electronic Computers, Vol EC-8, No. 3, pp. 330-334, Sept. 1959. [WaRosa] Ward, C.R., Robson, 4.J., Hargrave, P.J. and McWhirter, J.G., "Application of a Systolic Array to Adaptive Beamforming", IEE Proceedings, Vol. 131, Pt. F, No. 6, October 1984.

[\MeEs85] Weste, N.H.E. and Eshraghian, K., Principles of CMOS VLSI Design, a Systems Perspectiue, Addison-Wesley, Reading, Mass., 1985 [whsp81] Whitehouse, H.J. and Speiser, J.M', "SONAR Applications of Systolic Array Technology", Conference Record, IEEE EAS- CON, Washington, D.C., November 77-!9,1981. [\Mi87] Wiley, Paul, "A parallel architecture comes of age at last", IEEE Spectrum, pp. 46-50, June 1987. [\MiCa85] Williams, Michael K. and Carlson, David 4., "An algorithmically flexible systolic array", IEEE CH2223-6185/0000/0627 ',pp- 627- 631,1985. [Wilkes75] \Milkes, M.V., "How Babbage's dream came true", Nature, Vol. 257, pp. 54!-544, October 16, 1975' [Wilkessa] \Milkes, M.V., "Memoirs of a Computer Pioneer", MIT Press, Cambridge Massachusetts, London Engiand, 1984.

250 [\MoCheo] Wong, Kar-Lik, Chan, R., Pak-Kong Lun, D. and Siu, Wan- Chi, "Efficient Address generation for Prime Factor Algorithms", IEEE Transactions on Acoustics, Speech and Signal Processing, Vol. 38 No. 9, pp. 1518-1528, September 1990. 4100", [Ya86] Yassaie, H., "Discrete Fourier Transform with the IMS IMS 4100 Application Note 2, INMOS Corporation, 1986. Discrete [YeYe87] Yeh, H.G. and Yeh, H.Y., "Implementation of the Fourier transform on 2-dimensional systolic processors", IEE Proceedings, Vol. 134, Pt G, No. 4, PP. 181-185 August 1987' [YuAlsa] Yung, H.C. and Allen, C.R., "A programmable VLSI array processor system for IIR/FIR digital filtering", Digital Signal Processing-.84,, V. Cappellini and A.G. Constantinides (eds.), Elsevier Science Publishers B.V' (North-Holland), 1984.

[2o88] Zobel, R.N., "Some alternative techniques for hardware address generators for digital signal processors", ISCAS'88, CH2458-8/88 pp.69-72, 1988. lZorps2l Zorpette, G., "The Power of Parallelism", IEEE Spectrum pp. 28-33, September 1992. lZorpssl Zorpette, G., "Large Computers", IEEE Spectrum pp. 34-37, January 1993.

251