Fpgas for Custom Computing Machines, Napa Progress Function As the General Purpose Systems

Comparing the Performance of FPGA-Based Custom Computers with General-purpose Computers for DSP Applications Neil W. Bergmann and J. Craig Mudge CSIRO/Flinders Joint Research Centre in Information Technology Flinders University, GPO Box 2100, Adelaide, SA 5001, AUSTRALIA Tel: 61-8-201-3 109; Fax: 61-8-201-3507 E-mail: [email protected], [email protected] Ab st r ac t conventional desktop workstations, and often signifi- When FPGA logic circuits are incorporated within a cantly greater than the best reported results using stored-program computer, the result is a machine where "conventional" super-computers, especially for those the programmer can design both the sofhvare and the algorithms which can be decomposed into many, simple, hardware that will execute that software. This paper first parallel processing tasks. Many Digital Signal describes some of the mre important custom computers, Processing (DSP) algorithms can be decomposed into and their potential weakness as DSP implementation parallel tasks, but each task often involves relatively platforms. It then describes a new custom computing complex operations such as a multiply-accumulate. DSP architecture which is specifically designed for efficient algorithms are therefore less clearly suitable candidates implementation of DSP algorithms. Finally, it presents a for efficient implementation on a custom computer. simple performance comparison of a number of DSP This paper surveys some of the most important custom implementation alternatives, and concludes that (i) the computers, presents the authors' work on a new custom new custom computing architecture is worthy of further computing architecture specifically designed to support investigation, and (ii) that custom computers based only DSP applications, and analyses the performance of on FPGA execution units show little performance various implementation alternatives for DSP algorithms. improvement over state-of-the art workstations. 2. Previous methods of customising 1. Custom computing computers for DSP Field Programmable Gate Arrays are now a popular There is a constant tension in computer design implementation style for digital logic systems and sub- between being general purpose, i.e., doing a wide range systems [l]. Where the programming configuration is of computational tasks moderately well, and being held in static RAM, the logic function implemented by application specific, i.e., doing a smaller range of those FPGAs can be dynamically reconfigured, in computational tasks much better, usually at the cost of fractions of a second, by rewriting the contents of the either increased system resources or of poorer "general SRAM configuration memory. When such PGA logic purpose" performance. circuits are incorporated within a stored-program There have been many different approaches computer, the result is a machine where the programmer investigated over the years for improving the can design both the software and the hardware that will performance of a general purpose computer for the execute that software. Such a machine, where the implementation of DSP algorithms. Custom computers hardware can be reconfigured and customised on a represent the latest technology which shows some program-by-program basis, is called a custom computer promise for this task. PI. A common approach to improving the performance of Several researchers report algorithm speed-up rates of a general purpose computer for specific applications is hundreds or thousands of times compared to the addition of application specific hardware, such as 164 0-8186-5490-2/94$03.00 0 1994 IEEE graphics accelerators for computer displays, or image and boards, with each board containing 16 Xilinx 4010 video compression chips for multi-media workstations. FPGAs [6], connected as a linear array, plus an extra These can provide excellent speedup for that specific Xilinx 4010 for control, with all of the FFGAs having application, but provide no performance improvement some additional interconnections via a central crossbar when other applications are run. This approach switch. The SPLASH boards can communicate with a represents one extreme of the generality/cost spectrum. SUN Sparcstation host via input and output FIFOs, Another common approach, at the opposite end of the which are on an additional interface board connecting spectrum, is the addition of a general-purpose parallel the SUN and SPLASH array. processing sub-system to a host processor. Such add-on Once configured, data is streamed through the parallel processing boards have seemed an obvious and SPLASH processor using the systolic array programming attractive enhancement to desktop computers for some model, with results streaming back to the host. The time, but they have achieved only limited success in the SPLASH processor is most suited to simple streaming marketplace. It is our conjecture that this is because of operations, and has shown significant speedups over difficulty of programming, and high cost resulting from conventional supercomputers for tasks such as text low sales volumes. Additionally, such parallel searching and genetic database searching. processing systems have commonly been inefficient at SPLASH is usually programmed by specifying the implementing the fine-grain, communications-intensive function of the FPGAs using VHDL, which is then parallel algorithms associated with DSP applications. automatically translated into an FPGA configuration file. Recently, general-purpose, Digital Signal Processor chips Current research [5] is examining the use of other such as the TMS32OC40 [3] have become available. programming languages, such as data parallel C. These combine high speed, floating-point arithmetic performance with high speed interprocessor 3.2 Programmable Active Memory communication channels incorporating individual DMA controllers. Arrays of such chips provide a formidable The PAM (Programmable Active Memory) has been challenge for other DSP implementation altematives. under development at DECs Paris Research Labs for The idea of a computer which can be customised, several years [7], [8]. The latest version, PeRLe-1, under programmer control, on an application-by- consists of a 5x5 array of Xilinx 3090 FTGAs, connected application basis, is not new. A writable control store to local 32-bit wide RAM banks, and also, via a lOOMB/s within a micro-programmed computer allows the TURBOchannel interface, to a DEC desktop workstation. programmer to design application-specific machine The workstation writes data to the RAM banks, which is instructions, which can make better use of the existing processed by the Xilinx array and returned to the RAM functional units within the computer, and hence improve banks, from where the host then retrieves the results. the performance of specific applications, including those Programming the PAM consists of designing software within DSP [4]. Such performance improvement has components for the host, and hardware components for been limited because of the interpretative nature of the PAM array. The latter can be done by writing a microprogramming. program in a conventional programming language (Lisp, A custom computer goes one step further than a C++, and Esterel are used) using a specialised library. writable control store by allowing design of new The program describes logic modules by their bit-level functional units, rather than simply making better use of logic equations, or by using standard library modules existing functional units. such as adders, and registers. Ten applications are described in [8], including long 3. Some Existing Custom Computers multiplication, RSA cryptography, data compression, string matching, heat and Laplace equations, a In this section, we examine three of the best known Boltzmann machine neural network, 3-D graphics custom computers as examples of the current state of the acceleration, and the discrete cosine transform. Results art in this area. are very encouraging; e.g., the PAM implementation of 512-bit RSA cryptography was faster than any other 3.1 SPLASH reported implementation in any technology as of February 1990, and 10 times faster than the next best SPLASH and SPLASH I1 [5] are custom computers reported implementation on a custom VLSI circuit. which have been developed at the Supercomputer Research Corporation, Maryland. The SPLASH I1 processor consists of an extendable number of processor 165 3.3 The Virtual Computer cessing element. For this reason, we are now exploring a new custom computing architecture explicitly targetted at The Virtual Computer [9], from the Virtual Computer the efficient implementation of digital signal and image Corporation, provides approximately 500,000 processing applications. programmable logic gates, using an array of 52 Xilinx 4010 FPGAs and 24 ICUBE Field Programmable 4.2 Our New Architecture Interconnect Devices, 8 megabytes of SRAM, and 16k x 16-bit 25ns dual-port RAMs. The system also has 3 x Our decision to concentrate our efforts on a custom 64-bit 1/0 ports - one for hardware configuration / read- computing architecture which is specifically designed for back, and two for general purpose 1/0 such as connection the support of DSP algorithms leads us naturally to the to a host workstation. The central processing array of 40 provision of specific hardware support for the arithmetic Xilinx FPGAs, called the Virtual Array, consists of four operations which dominate many DSP algorithms, but Virtual Pipelines, each with 10 FPGAs connected to which are costly

Load more