General-Purpose Systolic Arrays

General-Purpose Systolic Arrays Kurtis T. Johnson and A.R. Hurson, Pennsylvania State University Behrooz Shirazi, University of Texas, Arlington hen Sun Microsystems introduced its first workstation, the company could not have imagined how quickly workstations would revolution- ize computing. The idea of a community of engineers, scientists, or researchers time-sharing on a single mainframe computer could hardly have become ancient any more quickly. The almost instant wide acceptance of workstations and desktop computers indicates that [hey were quickly recognized as giving the best and most flexible performance for the dollar. Desktop computers proliferated for three reasons. First, very large scale integration (VLSI) and wafer scale integration (WSI), despite some problems, increased the gate density of chips while dramatically lowering their production cost.' Moreover. increased gate density permits a more complicated processor, which in turn promotes parallelism. Second, desktop computers distribute processing power to the user in an easily customized open architecture. Rcal-time applications that require intensive 110 Systolic arrays and computation need not consume all the resources of a supercomputer. Also, effectively exploit desktop computers support high-definition screens with color and motion far exceeding those available with any multiple-user. shared-resource mainframe. massive parallelism in Third, economical. high-bandwidth networks allow desktop computers to share data. thus retaining the most appealing aspect of centralized computing, resource computationally sharing. Moreover, networks allow the computers to share data with dissimilar intensive applications. computing machines. That is perhaps the most important reason for the acceptance of desktop computers. since all the performance in the world is worth little With advances in VLSI, if the machine is isolated. WSI, and FPGA Today's workstations have redefined the way the computing community distrib- utes processing resources, and tomorrow's machines will continue this trend with technologies, they have higher bandwidth networks and higher computational performance. One way to obtain higher computational performance is to use special parallel coprocessors to progressed from fixed- perform functions such as motion and color support of high-definition screens. function to general- Future computationally intensive applications suited for desktop computing machines include real-time text, speech, and image processing. These applications purpose architectures. require massive parallelism.' I Many computational tasks are by their than processor arrays that execute sys- very nature sequential: for other tasks I I I tolic algorithms. Systolic arrays consist the degree of parallelism varies. There- of elements that take one of the follow- fore. a massively parallel computation- ing forms: al architecture must maintain sufficient application flexibility and computational a special-purpose cell with hardwired efficiency. It must be’ functions, a vector-computerlike cell with an reconfigurable to exploit applica- instruction decoding unit and a pro- tion-dependent parallelisms, cessing unit, or high-level-language programmable a processor complete with a control for task control and flexibility, I I I unit and a processing unit. scalable for easy extension to many applications. and Figure 1. General systolic organization. In all cases, the systolic elements or capable of supportingsingle-instruc- cells are customized for intensive local tion stream, multiple-data stream communications and decentralized par- (SIMD) organizations for vector trols blood flow to the cells since it is the allelism. Because an array consists of operations and multiple-instruction source and destination for all blood.J cells of only one or, at most, a few kinds, stream, multiple-data stream Although there is no widely accepted it has regular and simple characteris- (MIMD) organizations to exploit standard definition of systolic arrays tics. The array usually is extensible with nonhomogeneous parallelism re- and systolic cells. the following descrip- minimal difficulty. quirements. tion serves as a working definition in Three factors have contributed to the this article (also see “Systolic array systolic array’s evolution into a leading Systolic arrays are ideally qualified summary” below). Systolic arrays have approach for handling computationally for computationally intensive applica- balanced. uniform, gridlike architectures intensive applications: technology ad- tions. Whether functioningas a dedicat- (Figure I) in which each line indicates a vances, concurrency processing, and ed fixed-function graphics processor or communication path and each intersec- demanding scientific applicationsP a more complicated and flexible copro- tion represents a cell or a systolic ele- cessor shared across a network, asystol- ment. However, systolic arrays are more Technology advances. Advances in ic array effectively exploits massive parallelism. Falling into an area between vector computers and massively parallel computers. systolic arrays typically Systolic array summary combine intensive local communication and computation with decentralized Systolic array: A gridlike structure of special processing elements that parallelism in a compact package. They processes data much like an n-dimensional pipeline. Unlike a pipeline, how- capitalize on regular, modular, rhyth- ever, the input data as well as partial results flow through the array. In addi- mic, synchronous. concurrent process- tion, data can flow in a systolic organization at multiple speeds in multiple di- es that require intensive, repetitive com- rections. Systolic arrays usually have a very high rate of I/O and are well putation. While systolic arrays originally suited for intensive parallel operations. were used for fixed or special-purpose architectures, the systolic concept has Applications: Matrix arithmetic, signal processing, image processing, lan- be en extende d to genera1 -purpose guage recognition, relational database operations, data structure manipula- SIMD and MIMD architectures. tion, and character string manipulation. Special-purpose systolic array: An array of hardwired systolic process- Why systolic arrays? ing elements tailored for a specific application. Typically, many tens or hun- dreds of cells fit on a single chip. Ever since Kung proposed the systol- General-purpose systolic array: An array of systolic processing ele- ic model..‘ its elegant solutions to de- ments that can be adapted to a variety of applications via programming or manding problems and its potential per- reconfiguration. formance have attracted great attention. In physiology, the term sysfolicdescribes Programmable systolic array: An array of programmable systolic ele- the contraction (systole) of the heart, ments that operates either in SIMD or MIMD fashion. Either the arrays inter- which regularly sends blood to all cells connect or each processing unit is programmable and a program controls of the body through the arteries, veins, dataflow through the elements. and capillaries. Analogously. systolic computer processes perform operations Reconfigurable systolic array: An array of systolic elements that can be in a rhythmic. incremental, cellular. and programmed at the lowest level. FPGA (field-programmablegate array) tech- repetitive manner. The systolic compu- nology allows the array to emulate hardwired systolic elements at a very low tational rate is restricted by the array’s level for each unique application. U0 operations. much as the heart con- November lYY3 21 VLSI/WSI technology complement the Demanding scientific applications. Cellgranularity.The level of cell gran- systolic array’s qualifications in the fol- The technology growth of the last three ularity directly affects the array’s lowing ways: decades has produced computing envi- throughput and flexibility and deter- ronments that make it feasible to attack mines the set of algorithms that it can Smaller and faster gatesallow a high- demanding scientific applications on a efficiently execute. Each cell’s basic er rate of on-chip communication be- larger scale. Large-matrix multiplica- operation can range from a logical or cause data has a shorter distance to tion, feature extraction, cluster analy- bitwise operation. to a word-level mul- travel. sis, and radar signal processing are only tiplication or addition. to a complete Higher gate densities permit more afew examples.’As recent history shows, program. Granularity is subject to tech- complicated cells with higher individu- when many computer users work on a nology capabilities and limitations as al and group performance. Granularity wide variety of applications, they devel- well as design goals. For example. inte- increases as word length increases, and op new applications requiringincreased gration-substrate families have differ- concurrency increases with more com- computational performance. Examples ent performance and density character- plicated cells. of these innovative applications include istics. Packaging also introduces U0 pin Economical design and fabrication interactive language recognition, rela- restrictions. processes produce less expensive sys- tional database operations, text recog- tolic chips, even in small quantities. nition, and virtual reality.’ These appli- Extensibility. Because systolic arrays Better design tools allow arrays to be cations require massive repetitive and are built of cellular

General-Purpose Systolic Arrays

System Trends and Their Impact on Future Microprocessor Design

An Overview of the Blue Gene/L System Software Organization

Cellular Wave Computers and CNN Technology – a Soc Architecture with Xk Processors and Sensor Arrays*

Performance Modelling and Optimization of Memory Access on Cellular Computer Architecture Cyclops64

ECE 590: Digital Systems Design Using Hardware Description Language (VHDL) Systolic Implementation of Faddeev's Algorithm in V

On the Efficiency of Register File Versus Broadcast Interconnect For

Focal-Plane Analog VLSI Cellular Implementation of the Boundary Contour System

Computer Architectures

Systolic Computing on Gpus for Productive Performance

Software-Defined Hyper-Cellular Architecture for Green and Elastic

An Overview on Cyclops-64 Architecture - a Status Report on the Programming Model and Software Infrastructure

Virtualized Baseband Units Consolidation in Advanced Lte Networks Using Mobility- and Power-Aware Algorithms