<<

General-Purpose Systolic Arrays

Kurtis T. Johnson and A.R. Hurson, Pennsylvania State University Behrooz Shirazi, University of Texas, Arlington

hen Sun Microsystems introduced its first workstation, the company could not have imagined how quickly workstations would revolution- ize computing. The idea of a community of engineers, scientists, or researchers time-sharing on a single mainframe could hardly have become ancient any more quickly. The almost instant wide acceptance of worksta- tions and desktop indicates that [hey were quickly recognized as giving the best and most flexible performance for the dollar. Desktop computers proliferated for three reasons. First, very large scale integra- tion (VLSI) and wafer scale integration (WSI), despite some problems, increased the gate density of chips while dramatically lowering their production cost.' Moreover. increased gate density permits a more complicated , which in turn promotes parallelism. Second, desktop computers distribute processing power to the user in an easily customized open architecture. Rcal-time applications that require intensive 110 Systolic arrays and computation need not consume all the resources of a . Also, effectively exploit desktop computers support high-definition screens with color and motion far exceeding those available with any multiple-user. shared-resource mainframe. massive parallelism in Third, economical. high-bandwidth networks allow desktop computers to share data. thus retaining the most appealing aspect of centralized computing, resource computationally sharing. Moreover, networks allow the computers to share data with dissimilar intensive applications. computing machines. That is perhaps the most important reason for the accep- tance of desktop computers. since all the performance in the world is worth little With advances in VLSI, if the machine is isolated. WSI, and FPGA Today's workstations have redefined the way the computing community distrib- utes processing resources, and tomorrow's machines will continue this trend with technologies, they have higher bandwidth networks and higher computational performance. One way to obtain higher computational performance is to use special parallel to progressed from fixed- perform functions such as motion and color support of high-definition screens. function to general- Future computationally intensive applications suited for desktop computing ma- chines include real-time text, speech, and image processing. These applications purpose architectures. require massive parallelism.'

I Many computational tasks are by their than processor arrays that execute sys- very nature sequential: for other tasks I I I tolic . Systolic arrays consist the varies. There- of elements that take one of the follow- fore. a computation- ing forms: al architecture must maintain sufficient application flexibility and computational a special-purpose with hardwired efficiency. It must be’ functions, a vector-computerlike cell with an reconfigurable to exploit applica- instruction decoding unit and a pro- tion-dependent parallelisms, cessing unit, or high-level-language programmable a processor complete with a control for task control and flexibility, I I I unit and a processing unit. scalable for easy extension to many applications. and Figure 1. General systolic organization. In all cases, the systolic elements or capable of supportingsingle-instruc- cells are customized for intensive local tion stream, multiple-data stream communications and decentralized par- (SIMD) organizations for vector trols blood flow to the cells since it is the allelism. Because an array consists of operations and multiple-instruction source and destination for all blood.J cells of only one or, at most, a few kinds, stream, multiple-data stream Although there is no widely accepted it has regular and simple characteris- (MIMD) organizations to exploit standard definition of systolic arrays tics. The array usually is extensible with nonhomogeneous parallelism re- and systolic cells. the following descrip- minimal difficulty. quirements. tion serves as a working definition in Three factors have contributed to the this article (also see “Systolic array systolic array’s evolution into a leading Systolic arrays are ideally qualified summary” below). Systolic arrays have approach for handling computationally for computationally intensive applica- balanced. uniform, gridlike architectures intensive applications: technology ad- tions. Whether functioningas a dedicat- (Figure I) in which each line indicates a vances, concurrency processing, and ed fixed-function graphics processor or communication path and each intersec- demanding scientific applicationsP a more complicated and flexible copro- tion represents a cell or a systolic ele- cessor shared across a network, asystol- ment. However, systolic arrays are more Technology advances. Advances in ic array effectively exploits massive par- allelism. Falling into an area between vector computers and massively paral- lel computers. systolic arrays typically Systolic array summary combine intensive local communication and computation with decentralized Systolic array: A gridlike structure of special processing elements that parallelism in a compact package. They processes data much like an n-dimensional pipeline. Unlike a pipeline, how- capitalize on regular, modular, rhyth- ever, the input data as well as partial results flow through the array. In addi- mic, synchronous. concurrent - tion, data can flow in a systolic organization at multiple speeds in multiple di- es that require intensive, repetitive com- rections. Systolic arrays usually have a very high rate of I/O and are well putation. While systolic arrays originally suited for intensive parallel operations. were used for fixed or special-purpose architectures, the systolic concept has Applications: Matrix arithmetic, signal processing, image processing, lan- be en extende d to genera1 -purpose guage recognition, relational database operations, data structure manipula- SIMD and MIMD architectures. tion, and character string manipulation.

Special-purpose systolic array: An array of hardwired systolic process- Why systolic arrays? ing elements tailored for a specific application. Typically, many tens or hun- dreds of cells fit on a single chip.

Ever since Kung proposed the systol- General-purpose systolic array: An array of systolic processing ele- ic model..‘ its elegant solutions to de- ments that can be adapted to a variety of applications via programming or manding problems and its potential per- reconfiguration. formance have attracted great attention. In physiology, the term sysfolicdescribes Programmable systolic array: An array of programmable systolic ele- the contraction (systole) of the heart, ments that operates either in SIMD or MIMD fashion. Either the arrays inter- which regularly sends blood to all cells connect or each processing unit is programmable and a program controls of the body through the arteries, veins, dataflow through the elements. and capillaries. Analogously. systolic computer processes perform operations Reconfigurable systolic array: An array of systolic elements that can be in a rhythmic. incremental, cellular. and programmed at the lowest level. FPGA (field-programmablegate array) tech- repetitive manner. The systolic compu- nology allows the array to emulate hardwired systolic elements at a very low tational rate is restricted by the array’s level for each unique application. U0 operations. much as the heart con-

November lYY3 21 VLSI/WSI technology complement the Demanding scientific applications. Cellgranularity.The level of cell gran- systolic array’s qualifications in the fol- The technology growth of the last three ularity directly affects the array’s lowing ways: decades has produced computing envi- throughput and flexibility and deter- ronments that make it feasible to attack mines the set of algorithms that it can Smaller and faster gatesallow a high- demanding scientific applications on a efficiently execute. Each cell’s basic er rate of on-chip communication be- larger scale. Large-matrix multiplica- operation can range from a logical or cause data has a shorter distance to tion, feature extraction, cluster analy- bitwise operation. to a word-level mul- travel. sis, and radar signal processing are only tiplication or addition. to a complete Higher gate densities permit more afew examples.’As recent history shows, program. Granularity is subject to tech- complicated cells with higher individu- when many computer users work on a nology capabilities and limitations as al and group performance. Granularity wide variety of applications, they devel- well as design goals. For example. inte- increases as word length increases, and op new applications requiringincreased gration-substrate families have differ- concurrency increases with more com- computational performance. Examples ent performance and density character- plicated cells. of these innovative applications include istics. Packaging also introduces U0 pin Economical design and fabrication interactive language recognition, rela- restrictions. processes produce less expensive sys- tional database operations, text recog- tolic chips, even in small quantities. nition, and virtual reality.’ These appli- Extensibility. Because systolic arrays Better design tools allow arrays to be cations require massive repetitive and are built of cellular building blocks, the designed more efficiently. A systoliccell rhythmic parallel processing, as well as cell design should be sufficiently flexi- can be fully simulated before fabrica- intensive U0 operation. Hence, systolic ble for use in a wide variety of topolo- tion, reducing the chances that it will computing. gies implemented in a wide variety of fail to work as designed. With advances substrate technologies. in simulation techniques, fully tested, unique cells can now be quickly copied Implementation issues Clock synchronization. Clock lines of and arranged in regular, modular ar- different lengths within integrated chips, rays. As VLSIiWSI designs become more A number of implementation issues as well as external to the chips, can complicated. “systolicizing” them pro- determine a systolic array‘sperformance introduce skews. The risk of clock skew vides an efficient way to ensure fault efficiency. Designers should understand is greater when dataflow in the systolic tolerance: any fault tolerance precau- the following performance trade-offs at array is bidirectional. Wavefront arrays‘ tions built into one cell are extensible to the design stage. reduce the clock skew problem by in- all cells. troducing more complicated, asynchro- Relatively new field-programmable Algorithms and mapping. Designers nous, intercellular communications. (FPGA) technology permits must be intimately familiar with the a reconfigurable architecture, as op- algorithms they are implementing on Reliability. Asintegratedcircuits grow posed to a reprogrammable architec- systolic arrays. Designing a systolic ar- larger. designers must build in greater ture. ray heuristically from an is fault tolerance to maintain reliability, slow and error-prone,requiring simula- and diagnostics to verify proper opera- Concurrency processing. Past efforts tion for verification and often produc- tion. to add concurrency to the conventional, ing aless-than-optimum algorithm. Thus, von Neumann automatic array synthesis is an impor- have yielded coprocessors, multiple pro- tant research area6At present. howev- Systolic array cessing units. data pipelining. and mul- er, most array designs are based on heu- taxonomy tiple homogeneous processors. Systolic ristics. arrays combine features from all of these architectures in a massively parallel ar- Integration into existing systems. The term systolic array originally re- chitecture that can be integrated into Generally, a systolic array is integrated ferred to special-purpose or fixed-func- existing platforms without a complete into an existing host as a back-end pro- tion architectures designed as hardware redesign. A systolic array can act as a cessor. The array’s high U0 require- implementations of a given algorithm. , can contain multiple pro- ments often make system integration a In mass quantities, the production of cessing units andlor processors, and can significant problem. Because the exist- these arrays was manageable and eco- act as an n-dimensional pipeline. Al- ing IiO channel rarely satisfies the ar- nomical, and, thus, they were well suit- though data pipelining reduces IiO re- ray’s bandwidth requirement, a memo- ed for common applications. But these quirements by allowing adjacent cells ry subsystem often must be added designs were bound to the specific ap- to reuse the input data, the systolic ar- between the host and the systolic array plication at hand and were not flexible ray’s real novelty is its incremental in- to support data access and data multi- or versatile. Every time a systolic array struction processing or computational plexing and demultiplexing. The mem- was to be used on a new application, the pipelining.’ Each cell computes an in- ory subsystem can range from the com- manufacturer had to undertake the long, cremental result. and the computer de- plicated support and cluster processors costly, and potentially risky process of rives the complete result by interpret- in the Warp array to the simpler staging designing. testing, and fabricating an ing the incremental results from the memory in the Splash array. (These application-specific integrated chip. entire array in a prespecified algorith- systems will be discussed in greater Although the cost and risks of develop- mic format. detail later.) ing ASlCs have decreased in recent

22 COMPUTER Table 1. Systolic array taxonomy. Class General-purpose ‘ Special-purpose Type Programmable Reconfigurable Hybrid Hardwired

Organization SIMD or MIMD VFIMD VFIMD

Topology Programmable Fixed

Interconnections Static Dynamic Fixed Static Dynamic Fixed

Dimensions I n-dimensional (n > 2 is rare due to complexity) n-dimensional

SIMD: single-instruction stream, multiple-data stream: MIMD: multiple-instruction stream, multiple-data stream; VFTMD: very-few- instruction stream, multiple-data stream years, budget constraints have motivat- An obvious problem with this ap- icarrays. Proper algorithm development ed a trend away from unique hardware proach is that matrix products involving compensates for the problem, but per- development. Consequently. general- a matrix larger than the systolic array formance decreases nevertheless. purpose systolic architectures have be- must be divided into a set of smaller The matrix product example also dem- come a logical alternative. In addition matrix products. This resource and im- onstrates another problem with special- to serving in a wide variety of applica- plementation problem affects all systol- purpose systolic arrays and hardware in tions. they also provide test beds for developing, verifying, and debugging new systolic algorithms. Table 1 shows a taxonomy of general-purpose and spe- cial-purpose systolic arrays.

Special-purpose architectures. Spe- cial-purpose systolic architectures are custom designed for each application. Few problems resist attack from systol- ic arrays, but some problems may re- quire elegant algorithms. Generally speaking, the systolic design requires a performance algorithm that can be effi- ciently implemented with today’s VLSI ib-type data exits technology. One area that easily utilizes systolic ~ ~ ~ algorithms is matrix operations. Figure Figure 2. A systolic processing element that computes the sum of a scalar 2 illustrates the algorithm for the sum of product. a scalar product, computed in a single systolic element. After the cell is initial- ized. the a’s and the b’s are synchro- nously shifted through the processing element. The accumulator stores the sum of the a,b products. All the a and b data synchronously exits the processing element unmodified to be available for the next element. At the end ofprocess- a13 a12 all ing, the sum of the products is shifted out of the accumulator. This principle easily extends to a matrix product. as shown in Figure 3. The only difference a23 ‘22 * between single-element processing and array processing is that the latter delays each additional column and row by one a33 a32 a31 * . cycle so that the columns and rows line up for a matrix multiply. The product Figure 3. The sys- matrix is shifted out after completion of b-type data exits array tolic product of processing. two 3 x 3 matrices.

November I993 23 applications, and dynamic topologies can be altered within an application. Host Static programmable topologies can be implemented with much less complexi- ty than dynamic programmable topolo- gies. There has been little research in dynamic programmable topologies be- cause a highly complex interconnection network could undermine the regular and simple principles of systolic archi- tectures. Reconfigurable systolic architectures capitalize on FPGA technology, which allows the user to configure a low-level Data in Data out logic circuit for each cell. Reconfigu- rable arrays also have either fixed or ---__ reconfigurable cell interconnections. The user reconfigures an array’s topol- ogy by means of a lattice. Any general-purpose array that is not con- Array ventionally programmable is usually I I considered reconfigurable. All FPGA reconfiguring is static due to technolo- FigureI 4. General organization of SIMD programmable linear systolic: arrays. gy limitations. Array dimensions. We can further clas- general. The more specialized the hard- and FPGA technology. They usually sify general-purpose and special-pur- ware, the higher the performance: but consist of VLSI circuits embedded in an pose systolic architectures by their ar- cost per application also rises and flex- FPGA-reconfigurable interconnection ray dimensions. The two most common ibility decreases. Therein lies the at- network. structures are the linear array and the tractiveness of general-purpose systolic two-dimensional array. Linear systolic architectures. Systolic topologies. Array topologies arrays are by default statically reconfig- can be either programmable or re- urable in one-dimensional space. Two- General-purpose architectures. The configurable. Likewise, array cells are dimensional arrays allow more efficient two basic types of general-purpose sys- either programmable or reconfigur- execution of complicated algorithms. tolic arrays are the programmable mod- able. Due to I/O limitations. general-purpose el and the reconfigurable model. Re- A programmable systolic architecture systolic arrays of dimensions greater cently. hybrid models have also been is a collection of interconnected, gener- than two are not common. proposed. al-purpose systolic cells, each of which In the programmable model, cell ar- is either programmable or reconfigu- Programmable array organization. chitectures and array architectures re- rable. Programmable systolic cells are Programmable systolic arrays are pro- main the same from application to ap- flexible processing elements specially grammable either at a high level or a plication. However, a program controls designed to meet the computational and low level. At either level, programma- data operations in the cells and data U0 requirements of systolic arrays. Pro- ble arrays can be categorized as either routing through the array. All commu- grammable systolic architectures can be SIMD or MIMD machines. They are nication paths and functional units are classified according to their cell inter- typically back-end processors with an fixed, and the program determines when connection topologies: fixed or program- additional buffer memory to handle the and which paths are utilized. One of the mable. high systolic U0 rates. High-level pro- first programmable architectures was Fixed cell interconnections limit a grammable arrays usually are pro- Carnegie Mellon’s programmable sys- given topology to some subset of all grammed in high-level languages and tolic chip. possible algorithms. That topology can are word oriented. Low-level arrays are In the reconfigurable model, cell ar- emulate other topologies by means of programmed in assembly language and chitectures as well as array architec- the proper mapping transformation, but are bit oriented. tures change from one application to reduced performance is often a conse- another. The architecture for each ap- quence. SZMD. SIMD systolic machines (Fig- plication appears as a special-purpose Programmable cell interconnection ure 4) operate similarly to a vector pro- array. The primary means of implement- topologies typically consist of program- cessor. The host workstation preloads a ing the reconfigurable model is FPGA mable cells embedded in a switch lattice controller and a memory, which are ex- technology. Splash was one of the early that allows the array to assume many ternal to the array, with the instructions FPGA-based reconfigurable systolic different topologies. Programmable to- and data for the application. The systol- arrays. pologies are either static or dynamic. ic cells store no programs or instruc- Hybridmodelsmake use of both VLSI Static topologies can be altered between tions. Each cell’s instruction-process-

24 COMPUTER ing functional unit serves as a large de- coder. Once the workstation enables execution, the controller sequences through the external memory, thus de- livering instructions and data to the sys- tolic array. Within the array, instruc- tions are broadcast and all cells perform the same operation on different data. Adjacent cells may share memory, but 1 generally no memory is shared by the I entire array. After exiting the array, data is collected in the external buffer memory. SIMD-programmable systolic cells take less space on the VLSI wafer than their MIMD counterparts because I Data in , Data out they are simple instruction-processing -i elements requiring no program memo- _____ ry. From tens to hundreds of such cells fit on one integrated chip.'

MIMD. MIMD systolic machines Array (Figure 5) operate similarly to homoge- neous von Neumann multiprocessor machines. The workstation downloads Figure 5. General organization of MIMD programmable linear systolic arrays. a program to each MIMD systolic cell. Each cell may be loaded with a different program, or all the cells in the array may The designer logically draws cell archi- from the workstation. Since special-pur- be loaded with the same program. Each tectures on the workstation with a sche- pose systolic architectures consist of a cell's architecture is somewhat similar matic editor (such as Mentor), converts very few unique cells repeated through- to the conventional von Neumann ar- the design to FPGA code with another out the array, the entire array also tends chitecture: It contains a , an utility, and then downloads the code in to be VFIMD. Purely reconfigurable ALU, and local memory. MIMD systol- a few seconds to configure the FPGA architectures are fine-grain, low-level ic cells have more local memory than architecture.x devices best suited for logical or bit their SIMD counterparts to support Reconfigurable systolic arrays do not manipulations. They typically lack the the von Neumann-style organization. fall into the SIMD or MIMD categories gate density to support high-level func- Some may have a small amount of glo- for the same reason that special-pur- tions such as multiplication. bal memory, but generally no memory pose systolic arrays do not. Reconfigu- Tables 2, 3, and 4 list most of the is shared by all the cells. Whenever data rable arrays, like special-purpose ar- recent programmable and reconfigu- is to be shared by processors. it must be rays, are generally limited to VFIMD rable, general-purpose systolic arrays passed to the next cell. Thus, data avail- (very-few-instruction streams. multiple- reported in the literature. When infor- ability becomes a very important issue. data streams) organization due to FPGA mation is unclear in the literature, the High-level MIMD systolic cells are very gate density. Instructions are implicit in corresponding space in the table is left complicated. and usually only one fits the configuration of each cell; there- blank. The tables indicate three stages on a single integrated chip. For exam- fore, there is no need to download them of product development: ple, the iWarp is a Warp cell without memory on a single chip. Local memory for each iWarp cell must be supplied by additional chips.

Reconfigurable array organization. Recent gate-density advances in FPGA technology have produced a low-level, reconfigurable systolic array architec- ture that bridges the gap between spe- cial-purpose arrays and the more versa- tile, programmable. general-purpose arrays. The FPGA architecture is un- usual because a single hardware plat- form can be logically reconfigured as an 1 Figure 6. General exact duplicate of a special-purpose sys- organization of tolic array. Figure 6 shows the general reconfigurable organization of a reconfigurable array. arrays.

November 1993 25 Table 2. SIMD-organized programmable systolic architectures.

System name, Development developer stage Topology Key features

Brown Systolic Array’ Prototype Linear Very small VLSI footprint; 100s of cells Brown University 470 cells per chip; ISA, SSR architectures; 8-bit ALU only; 157 MOPS Micmacs* IRISA, Campus de Beadier. Linear 16-bit fixed-point math; broadcast data; France Prototype 18 cells 90 MOPS

Geometric Arithmetic Parallel Processor” Commercial 48 x 48 array Bit-slice cellular architecture; global data; NCR 2,304 cells 900 MOPS

Saxpy-1M4 Commercial Linear 32-bit floating-point capability; broadcast Computer Corp. 32 cells and Saxpy global data with block processing 1,000 MFLOPS

Systolic/Cellular Architecture5 Prototype 16 x 16 array 32-bit fixed-function units Hughes Research Laboratories 256 cells

Cylindrical Banyan Multicomputerh Research Packet-switched programmable topology University of Texas at Austin with programmable cells

Programmable Systolic Device’ Research Instruction decoding occurs once per chip; Australian National University each chip has many cells

1. R. Hughey and D. Lopresti. “Architecture of a Programmable Systolic Array,” Proc. Inr’l Con,f.Sysfolic Arruys. IEEE CS Press. LOS Alamitos, Calif., Order No. 860, 1988, pp. 41-49. 2. P. Frison et al., “Micmacs: A VLSI Programmable Systolic Architecture.” Systolic Array Processors, Prentice-Hall. Englewood Cliffs, N.J.: 1989, pp. 145-155. 3. P. Greussay, “Programmation des Mega-Processeurs: du GAPP a la Connection Machinc Course Notes DEA ,” Paris Univ.. VIII, Vincennes, 1985. 4. D. Foulser and R. Scheiber, “The Saxpy Matrix-1: A General-Purpose Systolic Computer.” Computer. Vol. 20. No. 7, July 1987. pp. 35-43. 5. K.W. Przytula and J.B. Nash, “lmplementation of Synthetic Aperture Radar Algorithms on a SystoliciCellular Architecture,” Proc. In[’/ Conf Systolic Arrays, IEEE CS Press, Los Alamitos, Calif., Order No. 860, 1988, pp. 21-27. 6. M. Malek and E. Opper. “The Cylindrical Banyan Multicomputer: A Reconfigurable Systolic Architecture.“ , Vol. 10, No. 3. May 1989. pp. 319-326. 7. P. Lenders and H. Schroder, “A ProRrammable Systolic Device for Image Processing Based on Morphology.” Purullel Conzputing. Vol. 13, No. 3, Mar. 1990. pp. 337-344

Research: A functional system that the introduction of aprogrammable sys- rency are possible in a programmable has not yet been implemented. tolic cell with significant local memory systolic architecture: concurrency across *Prototype: At least a partial system has resulted in a new mode of opera- cells and concurrency within cells, both has been implemented. tions: block processing. Block process- implicit to hardwired designs. To make Commercial: The system has been ing combines periods of intensive sys- concurrency feasible in programmable implemented. and an integrated chip tolic IiO and periods of sequential von systolic architectures, the designer must with multiple cells, a complete sys- Neumann-style processing to form what add special mechanisms that facilitate tem, or both are available for pur- is known as a pseudosystolic model. processing control. chase. Replicating and interconnecting a basic programmable systolic cell to form Intercell concurrency. Coordination an array carries out the regularity and of concurrency control has always been Architectural issues of simplicity principles, provided an ap- inherent in a hardwired systolic design programmable systolic propriate algorithm and program are because it has been addressed at the utilized. However. programmable- algorithm development stage. The ad- arrays cell-based architectures must intro- vent of the programmable systolic array duce special features to address the sys- required that innovative features be in- To be effective. a programmable sys- tolic properties of concurrency, rhyth- corporated into designs to satisfy a pro- tolic architecture must adhere to the mic communications, and block pro- grammable coordination requirement. general principles of systolic arrays: reg- cessing. Methods of program loading. memory ularity, simplicity, concurrency, and initialization, program switching. and rhythmic communications. In addition. Concurrency. Two levels of concur- local memory access at the cell level had

26 COMPUTER

I Table 3. MIMD-organized programmable systolic architectures.

System name, Development developer stage Topology Key features

Architecture‘ Prot

Warp’ Commercial Linear 32-bit floating-point multiplication: Carnegie Mellon University 10 cells block processing; IiO queuing. 100 MFLOPS

iWarp3 Carnegie Mellon University

Computer for Experimental Prototype Four 8 x 16 Bit-serial cellular IiO; 32-bit floating- Synthetic Aperture Radar4 arrays point multipliers in each cell; Norwegian Defense Res. Estab. 512 cells 320 MFLOPS

Cellular Array Processor5 Commer Fujitsu Laboratories, Japan

Configurable Highly Parallel Computerh Research Programmable cells embedded in switch Purdue University lattice for programmable topology

Associative String Processor’ Research at hi Brunel University, UK

PICAP3’ Prototype 8 x 8 array 16-bit word length; image oriented University of Paris 64 cells

PSDP9 Research 4-bit word length; wafer-scale design Univ. of South Florida, Honeywell

1. A.L. Fisher. K. Sarocky. and H.T. Kung. “Experience with the CMU Programmable Systolic Chip,” Proc. Soc. of Phoro-Oprical 1n.strunzenfufion Engineers, Red-Time Signal Processing VII,Spie. San Diego. Calif., 1984. p. 495. 2. M. Annaratone et al.. “Architecture of Warp,” Proc. Compcon, IEEE CS Press, Los Alamitos, Calif.. Order No. 764, Feb. 1987, pp. 264-267. 3. S. Borkar et al.. “iWarp: An Integrated Solution to High-speed Parallel Computing.” Proc. Superconzpuring. IEEE CS Press. Los Alamitos, Calif., Order No. 882, 1988, pp. 330.339. 4. M. Toverud and V. Anderson. “CESAR: A Programmable High-Performance Systolic Array Proccssor,” Proc. In/’[ Con$ Conipufer Design, IEEE CS Press, Los Alamitos. Calif., Order No. 872 (microfiche only), 1988, pp. 414-417. 5. M. Ishii et al., “Cellular Array Proccssor CAP and Application,” Proc. Int‘l Conf. Systolic Arrays, IEEE CS Press. Los Alamitos. Calif., Order No. 860. 1988. pp. 535-544. 6. L. Snyder. “Introduction to the Configurable Highly Parallel Computer.” Conzputer. Vol. 15, No. 1. Jan. 1982. pp. 47-56. 7. R.M. Lea, “The ASP. a Fault-Tolerant VLSI/ULSI/WSI Associative String Processor for Cost-Effective Processing,” Proc. Inr’l Conf: Sysrolic Arrays. pp. 515.524. 8. B. Lindscog and P.E. Danielsson. “PICAP3: A Parallel Processor Tuned for 3D Image Operations.“ Proc. Eighth Int’l Conf: Patrern Recognilion, IEEE CS Press, Order No. 742 (microfiche only), 1986, pp. 1248-1250. 9. D. Landis et al.. “A Wafer-Scale Programmable Systolic Data Processor,” Proc. Ninth Biennial Univ./Gov’f/lnd.Microrlec/ronics Synp., IEEE, Piscataway, N.I., 1991, pp. 252-256.

to be incorporated into a systolic envi- efficiency it will gain from a broad- Broadcast instruction addresses al- ronment. The following mechanisms cast data capability. low fast and efficient switching facilitate programmable concurrency Broadcast instructions permit oper- among programs in an MIMD cell’s throughout the array: ations similar to those of a vector program memory. When all MIMD computer by making each systolic cells have programs stored in the Broadcast data permits all cells of cell analogous to the vector com- same places in their program an array to be reset simultaneously puter’s processing unit. However, memories, a jump to the broadcast with initial conditions and constants unlike most vector computers, sys- instruction address carries out si- at the start of a processing block tolic cells can support a high-band- multaneous program switching during nonsystolic modes of opera- width communication channel with throughout the array. If the pro- tion. The larger the array, the more adjacent cells. grams in all the MIMD cells happen

November 1993 27 Table 4. VFIMD-organized programmable systolic architectures.

System name, Development developer stage Topology Key features

~~~ ~ ~~~~~~~ Splash Systolic Engine' Super Computing Linear based on Research Center Prototype 32 cells Cellular Array Logic' 16 x 16 array FPGA-reconfigurable cell architecture based on Edinburgh University, UK Prototype 256 cells custom chip

Hybrid Architecture3 Reconfigurable cell architecture integrating FPGA University of Illinois Research capability and 32-bit floating-point multiplication Programmable Adaptive Functional units embedded in each cell: Computing Engine.' reconfigurable connections in cell as well as University of Wales, Bangor, UK Prototype Programmable programmable topology Configurable Functional Array5 Tsinghua University, Beijing Research Configurable array topology and cell architecture

1. M. Gokhale et al.. "Splash: A Reconfigurable Linear Logic Array," Pmc. Ini'l Conf: Parrillel Processing, IEEE CS Press. Los Alamitos, Calif., Order No. 2101. 1990. pp. 1526-1531. 2. T. Kcan and J. Gray, "Configurable Hardware: Two Case Studies of Micrograin Computation." in Sysrolic Array Processorh, Prentice- Hall, Englcwood Cliffs, N.J., 1989. pp. 310-319. 3. R. Smith and G. Sobelman, "Simulation-Based Design of Programmablc Systolic Arrays." Coniprrier-Aided Design. Vol. 2.3. No. IO. Dec. 1901, pp. 669-675. 4. S. Jones. A. Spray, and A. Ling. "A Flexiblc Building Block for the Construction of Processor Arrays." in Sys/o/ic Arrnv Procrssors, Prcntice-Hall, Englewood Cliffs. N.J., 19S9, pp. 459-466. 5. C. Wenyang, L. Yanda. and J. Yuc, "Systolic Realization for 2D Convolution Using Conrigurable Functional Method in VLSI Parallel Array Dcsigns." Proc. IEEE, Compurrrs und Digirtrl Te~hnology~Vol. 138, No. S. Sept. 1991. pp. 361-370.

to be identical. the array is perform- rect local memory access or global able when it is required. high-speed arith- ing SIMD operations. memory also facilitates initialization metic capabilities, and the ability to Instruction systolic urriiys" provide within a cell. The Saxpy-1M is one transfer or store the processed data. a precise low-level method of coor- of the first systolic arrays with glo- Thus, each cell must have sufficient dinating processing from cell to cell. bal memory. memory for data storage as well as pro- In an ISA. instructions travel through gram storage to facilitate intercell com- the array with the data. The pro- Itifrace11concurrency. Programmable munications. cessed data passes with the original cells require the ability to perform mul- The following architectural features instruction from each cell to the next. tiple similar or dissimilar operations si- support rhythmic communications: An appropriately wide ISA instruc- multaneously. Mechanisms incorporat- tion also has built-in microcodable ed into programmable systolic cells to QueiLed //O streamlines cellular parallelism. ISA algorithms can be support concurrency are all proven tech- communica t i on s by a I lowing the lengthy and more difficult to devel- niques that originated in conventional source cell to send data to the desti- op than algorithms for other SIMD von Neumann processors: microcodable nation cell when the data is avail- and MIMD arrays. But they provide processing elements. duplication offunc- able. not when the destination cell is a way to uniquely specify concur- tional units, multiple data paths, suffi- ready to accept it. Data exits the rency across cells without an overly cientlywide instruction words. and pipe- queue on a FIFO (first in. first out) complicated circuit. lining of functional units. basis that allows program computa- Direct local memory access or glo- tion to proceed irrespective of com- bal rneniory. Status flags can be Rhythmic communications. The prin- munications status. Queued IiO stored either in each cell's local ciple of rhythmiccommunicationsclearly helps considerablywhen all cells are memory or in a global memory. In separates systolic arrays from other ar- computing a single program that is the case of local memory, the con- chitectures. Programmability creates a skewed in time across the cells. A troller must be able to directly ac- more flexible systolic architecture. but disadvantage is that this mechanism cess each cell's local memory to de- the penalties are complexity and possi- uses memory space that is expen- termine flag status; otherwise, the ble slowing of operations. When systol- sive on VLSI wafers. controller must issue a request and ic cells are programmable, the issue of Serial I/O reduces the bandwidth wait for it to propagate through the data availability arises. To minimize the requirement on the systolic array's array. When global memory is avail- performance degradation of program- workstation machine, at the same able. the controller obtains status mable systolic designs, each cell's pro- time promoting rhythmic cellular information much more easily. Di- cessing element must have data avail- communication. The resulting pro-

28 COMPUTER grammable systolic array has bit- I I serial cellular communications and word-wide operations in each cell. Each cell stores partial words until it receives the full word and then performs functional operations. Using more cells can partially offset the disadvantage of reduced throughput. I I Figure 7. An SIMD systolic shared-register architecture. Multiple datu paths provide the cell with fast, parallel internal and ex- ternal communications.One or more data paths are dedicated to compu- Bidirectional und wrapuroutzd data- Important work has been done on tational needs. another to cellular ,flow must be explicitly specified in characterizing block processing in a gen- input operations, and yet another to the program, whereas in hardwired eral-purpose systolic array.'" Block pro- cellular output operations. designs dataflow is implicit. Bi- cessing can be performed on either Systolicshared registers provide sev- directional and wraparound data- SIMD or MIMD programmable systol- eral benefits specifically for pro- flow can reduce the I/O bandwidth ic machines. Requirements for efficient grammable systolic architectures. requirements of the external buffer block processing include an appropri- SSR architectures (Figure 7) pro- memory by recirculating data with- ate algorithm and significant local mem- vide a program-implicit means of in the array. More flexible dataflow ory in each cell. Other types of array streamlined dataflow. Each two ad- also allows arrays to execute a wide communication such as broadcast data jacent processing elements share a variety of algorithms more effi- and global data are also desirable. small register memory. Data flows ciently. concurrently with computation as each processing element receives Block processing, If each program- Architectural issues of new data from the register memory mable systolic cell has significant local reconfigurable systolic upstream, operates on it, and di- memory. block processing is possible in rects it to the next register memory the systolic environment. During block arrays downstream. This architecture elim- processing, very little systolic I/O oc- inates the requirement that data curs, and each cell executes a series of The primary architectural issues in movement be explicitly specified, instructions on local data. Once the cells designing a reconfigurable systolic and dataflow through the array be- generate an intermediate result, pro- array are the hardware platform of comes a result of processing. The cessing pauses while systolic IiO takes FPGAs and the class of algorithms to be input register and the output regis- place, and new data is written to each targeted to that platform. The number ter are specified by the instruction cell's local memory. of FPGAs, their topology, and their gate word. Therefore. bidirectional data- Block processing on programmable density will determine the set of systolic flow is a natural result. An SSR systolic architectures can result in a more architectures that can be synthesized architecture is advantageous over efficient. pseudosystolic operation in and the set of systolic algorithms that queued 110 because communications some applications for two reasons. First, can be implemented on that hardware do not occur on a FIFO basis. A the merger of von Neumann program- platform. Reconfigurable systolic archi- disadvantage is that the register bank mable cell architectures into a systolic tectures are very interesting because must be kept small to minimize ac- array environment causes many of the the cell's architecture is actually pro- cess time and thus maximize systol- same problems found in homogeneous grammed. Consequently, the architec- ic bandwidth. The small register multiprocessor systems. The serial na- ture programmer has complete control makes programming difficult for ture of \'on Neumann machines inter- over what architectural features are in- some of the more complicated algo- feres with the rhythmic systolic I/O of corporated in each FPGA. Determin- rithms. the array, causing an unacceptable ing the cell architecture can be an iter- Broudcast &ita eases systolic IiO amount of time wasted in waiting for ative process that continuously refines requirements in certain instances by data to become available. Block pro- the architecture until it satisfies appli- allowing all functional units and cessing minimizes this wasted time in cation requirements. local memory of a cell to be initial- applications that can be divided into FPGA architecture programming dif- ized or set to a common variable at equal segments. Applications are divid- fers from conventional programming in once. ed into parallel tasks that utilize local that one programs the circuit's logical *Global datu eases systolic IiO re- data. function, instead of programming a quirements and the cell's storage Second, for any systolic array, the model of operations in a high-level lan- requirements by keeping only one bandwidth and size of the external mem- guage. Reconfiguring an FPGA is the copy of the same data and allowing ory are always the limiting factors on same as logically drawing a new circuit. all cells to share it. With proper throughput and performance. Block Therefore, an FPGA platform can as- global memory coherency. any mod- processing reduces the array's systolic sume any architecture that FPGA gate ifications to global data are instant- 110 requirements by reducing the density and package pinout permit. ly available to other cells. amount of cellular IiO. Communication and concurrency

Novcmber 1993 29 host support. Warp. the Computer for Broadcast instructions Experimental Synthetic Aperture Ra- dar, the Cellular Array Processor, and Saxpy-1M all require expensive and complicated I/O support - for the in- tensive instruction I/O as well as the data 1/0. These machines are high- performance (200 MFLOPS, 320 MFLOPS, and 1,000 MFLOPS. respectively) with central- ized concurrency, constructed to fill an existing performance gap. The introduction of workstations into the workplace has changed the way a

C I I I significant portion of the computing community views computers. Users de- Figure 8. A hybrid SlMD programmable systolic architecture. mand improved desktop computing performance as applications continue to increase in size and complexity. must be evaluated on a case-by-case architectures in the near future. Conse- Workstation architecture can always be basis at the time of configuration. Cur- quently, a natural step is to merge the improved. especially for emerging ap- rent gate-density technology does not two approaches, keeping their desir- plications such as text and speech pro- make MIMD arrays feasible due to cell able aspects and discarding their unde- cessing and gene matching. The more memory and control requirements, but sirable aspects. recent general-purpose systolic array FPGA platforms are well suited for log- The hybrid SlMD architecture" (Fig- projects (Brown, Micmacs, Splash) ical or bit-level applications. Splash, for ure 8) is best utilized for intensive float- show that back-end systolic processors example, is a linear reconfigurable ar- ing-point computational applications but are effective in boosting a workstation's ray appropriate for bit-level applica- does not degrade in performance as performance for these applications. tions. much as high-level programmable ar- These arrays are small enough that the An FPGA chip can be configured for rays when significant logical or control host's open architecture with limited one medium-size systolic element or for operations are included. The hybrid U0 bandwidth does not severely im- several simple systolic cells. In either design combines a commercial floating- pact the array's performance for low- case, current technology limits the cir- point multiplier chip and an FPGA con- to moderate-level granularity. Small cuit size to an order of magnitude ap- troller to form a systolic cell. The com- back-end systolic processors are also proaching 25,000 gates. Packaging and mercial multiplier is used for its economically sound. For example. 110 pins also influence the design. Cur- economy. speed, and package density: Splash consists of a two-board add-on rent technology limits I/O to approxi- the FPGA closely binds the cell to a set for a Sun workstation. One board mately 200 pins. specific application. Another project is supports the linear array. and the oth- Reconfigurable architectures are also attempting to integrate a hybrid gener- er supports a buffer memory. The set highly qualified for fault-tolerant com- al-purpose systolic array cell on a single costs from $13,000 to $35,000.' puting. Reconfiguration can simply elim- chip.': Almost without exception, current inate a bad cell from the array. Cell A recently developed simulator al- research emphasizes reconfigurable architectures can be configured around lows simulation and testing of the cell cells, reconfigurable arrays, and hybrids VLSI defects. and the array design for a hybrid archi- of functional units embedded in recon- tecture." Unlike most commercially figurable FPGA arrays. Reconfigurable available computer-aided design utili- designs have proven to be unmatched Architectural issues of ties, which verify designs at the chip for low- to moderate-granularity require- hybrid arrays level, it verifies the performance of al- ments but are not yet mature enough gorithms and architectures through sim- for high-granularity applications. In ulation of the complete array. The sim- addition, as we have said. reconfigu- High-level programmable arrays re- ulator maintains the iterative nature of rable topologies and cells are highly quire extensive efforts tomap algorithms purely reconfigurable array design by fault-tolerant. Their fault tolerance is a to high-level languages after algorithm facilitating tailoring of hybrid designs configuration issue. not a design and development. When mapping is com- for specific applications. fabrication issue. pleted, the resulting system often is not Until FPGA chip density progresses as efficient as a system implemented in to the point where a very large FPGA a fixed-function ASIC. On the other Existing architectures can achieve high-level granularity, hy- hand. low FPGA gate density makes it brid architectures present perhaps the unlikely that large-grain tasks will ever A review of projects initiated during most practical means to a reconfigu- be completely implemented in FPGAs the last decade shows that the trend was rable high-level systolic array. Hybrids or that reconfigurable arrays will re- to develop large systolic array proces- make use of the most attractive features place conventional SlMD and MIMD sors that require elaborate. customized of programmable and reconfigurable

30 COMPUTER

I methods while adding greater flexibili- 5. J.A.B. Fortes and B.W. Wah, “Systolic ty than either method. Arrays: From Concept to Implementa- tion.” Computer (special issue on sys- No discussion of general-purpose tolic arrays). Vol. 20, No. 7, July 1987, pp. systolic arrays is complete without ad- 12-17. dressing the issues of programming and configuration. High-level-languagepro- 6. C.K. KO and 0. Wing. “Mapping Strate- gramming is desirable for promoting gy for Automated Design of Systolic widespread use of programmable sys- Arrays.” Proc. Inr’l Conf. Systolic Ar- rays. IEEE CS Press, Los Alamitos. Ca- tolic back-end processors. Of the projects lif.. Order No. 860, 1988. pp. 285-294. we surveyed. only the larger systems A.R. Hurson is an associate professor of computer engineering at Pennsylvania State had mature programming environments. 7. R. Hughey and D. Loprcsti. “B-SYS: A University. His research is directed toward Currently, the smaller programmable 470-Processor Programmable Systolic the design and analysis of general-purpose arrays have implemented only assem- Array.” Proc Int’l Conf. Parallel Pro- and special-purpose computer architectures. bly programming. As mentioned ear- cersing. IEEE CS Press. Los Alamitos, He has published more than 110 papers on topics including computer architecture, par- lier, FPGA-configurable arrays are Calif., Order No 2355-22,1991 .pp. 1580- 1583. allel processing. database systems and ma- configured with the use of a schematic chines, dataflow architectures. and VLSI al- editor. One drawback of this approach gorithms. He coauthored the IEEE tutorial 8. M. Gokhale et al., “Building and Using a is that it typically takes more effort than hooks, Parallel Architectures for Database Highly Parallel Programmable Logic Systenis and Multidatabase Systems: An Ad- programming a programmable cell.’ Computer. Vol. 24. Array.” Jan. 1991. pp. vanced Solutionfor Globallnfornzatioti Shar- 81 -89. ing Process. He cofounded the IEEE Sym- posium on Parallel and Distributed 9. H. Schroder and P. Strardins. “Program Processing. Hurson is a member of the IEEE he systolic array is a formidable Compression on the ISA,” Parallel Coni- Computer Society Press Editorial Board and a member of the IEEE Distinguished Visi- approach to exploiting concur- piiting, Vol. 17, No. 2-3, June 1991, pp. 207-2 15. tors Program. T rencies in a computationally rhythmic and intensive environment. 10 B. Friedlander. “Block Processing on a General-purpose systolic arrays provide Programmable Systolic Array,” Proc. Int’l an economical way to enhance compu- Conf Parallel Processing. IEEE CS Press. tational performance by emphasizing Los Alamitos, Calif.. Order No. 889.1987. concurrency and parallelism. Arrays pp. 184-187. ranging from low-level to high-level granularity have been applied to prob- 11 R. Smith and G. Sobelman, “Simulation- Based Design of Programmable Systolic lems from bit-oriented pixel mapping Arrays.” Computer-Aided Design. Vol. to 32-bit floating-point-based scientific 2-3. No. 10, Dec. 7991. pp. 669-675. computing. Systolic arrays hold great promise to be a pervasive form of con- 12 S. Jones, A. Spray, and A. Ling, “A Flex- currency processing. As a solution to ible Building Block for the Construction Behrooz Shirazi is an associate professor of the intensive computational perfor- of Processor Arrays,” in Systolic Array computer science engineering at the Univer- mance requirements of tomorrow’s ap- ProcrsJors, Prentice-Hall. Englewood sity of Texas at Arlington. Previously, he was Cliffs, N.J., 1989. pp. 459-466. an assistant professor at Southern Methodist plications, general-purpose systolic ar- University. His research interests include rays cannot be overlooked. m task scheduling, , dataflow computation, and parallel and dis- tributed processing. He has published wide- ly on these topics. He is a cofounder of the IEEE Symposium on Parallel and Distribut- References ed Processing. Hc has served as chair of the Computer Society section in the Dallas chap- 1. W.K. Fuchs and E.E. Swartzlander Jr., ter oC the IEEE and as chair of the IEEE “ Wafer-Scale Integration: Architectures Region V Area Activities Board. Shirazi re- and Algorithms.” Computer. Vol. 25. No. ceived his MS and PhD degrees in computer 4, Apr. 1092, pp. 6-8. science from the University of Oklahoma. in 1980 and 1Y85. respectively. 2. P. Quinton and Y. Robert, Sjstolic Algo- rifhni.c d Architectures, Prentice-Hall. Englewood Cliffs, N.J., 1991. Send correspondence about this article to A.R. Hurson. Dept. of Computer Science 3. A. Krikelis and R.M. Lea. “Architectur- and Engineering. Pennsylvania State Uni- al Constructs for Cost-Effective Parallel versity, University Park, PA 16802: or [email protected]. Computers.” in Systolic Array Proces- Kurtis T. Johnson is an advanccd engineer at sors. Prentice-Hall, Englewood Cliffs. HRB Systems, an E-Systems subsidiary. His N.J.. 1989, pp. 287-300. intcrests include parallel computer architec- tures, parallel algorithms, and digital signal Laxmi Bhuyan, Conzputer’s system archi- 4. H.T. Kung. “Why Systolic Architec- processing. He received his BS in 1986 and tecture area editor, coordinated and recom- tures?” Cotnpurer, Vol. 15. No. 1. Jan. his MS in 1992, both in electrical engineer- mended this article for publication. His c- 1982, pp. 37-46. ing, from Pennsylvania State University. mail address is [email protected].

November 1993 31