A Digital Signal Processor and Programming System for Parallel Signal Processing
Total Page:16
File Type:pdf, Size:1020Kb
A Digital Signal Processor and Programming System for Parallel Signal Processing David E. Van den Bout Center for Communications and Signal Processing Department of Electrical Computer Engineering North Carolina State University June 1987 ccsr-TR-87/6 Table of Contents Chapter 1. Introduction .................................................................................... 1 Chapter 2. Background 8 Chapter 3. My Thesis 46 Chapter 4. Algorithm Analysis 51 Chapter 5. Interprocessor Communications 69 Chapter 6. The DSP Architecture........................................................................ 90 Chapter 7. The PaLS Automated Programming System 127 Chapter 8. Conclusions and Areas for Further Work.............................................. 220 References 238 Appendix A. The Axe Hardware Simulation Language........................................... 248 Appendix B. Prediction of Simulated Annealing Curves for Graph Partitioning 278 Appendix C. The Approach to Thermal Equilibrium 299 VAN DEN BOUT, DAVID E. A Digital Signal Processor and Programming System for Paral- lei Signal Processing (Under the direction of T. K. Miller III.) This thesis describes an integrated assault upon the problem of designing high throughput, low cost digital signal processing systems. The dual prongs of this assault con- sist of: • the design of a digital signal processor (DSP) which efficiently executes signal processing algorithms in either a uniprocessor or multiprocessor configuration, • the PaLS programming system which accepts an arbitrary algorithm, partitions it across a group of DSPs, synthesizes an optimal communication link topology for the DSPs, and schedules the partitioned algorithm upon the DSPs. The results of applying a new quasi-dynamic analysis technique to a set of high-level sig- nal processing algorithms were used to determine the uniprocessor features of the DSP design. For multiprocessing applications, the DSP contains an inierprocessor communications port (IPe) which supports simple, flexible, dataflow communications while allowing the total communication bandwidth to be incrementally allocated to achieve the best link utilization. The net result is a DSP with a simple architecture that is easy to program for both uniprocessor and multiprocessor modes of operation (as was verified using the newly developed hardware simulation language, Axe). The PaLS programming system simplifies the task of parallelizing an algorithm for exe- cution upon a multiprocessor built with the DSP. PaLS differs from many other automatic programming systems in that it can map algorithms onto a group of processors while sintul- taneously deriving the interconnection topology for the processors. This is possible because PaLS incorporates a model of computation which accounts for the costs of partitioning, communica- tion link assignment, and scheduling. PaLS uses the method of simulated annealing coupled with the new technique of annealing curve prediction to adaptively control the optimization of the algorithm mapping. CHAPTER 1 Introduction The uses of signal processing are ubiquitous within our society, from the synthetic voice which reminds us to buckle our seatbelt to the enhanced pictures beamed to Earth from the Voyager spacecraft circling Jupiter. Before World War II, all signal processing was done in the analog domain (i.e. continuous-time, continuous-signal). With the advent of integrated circuits in the 1960's, many applications moved from the analog to the digital domain (i.e. discrete-time, quantized-signal). Digital techniques permit greater precision, repeatability, signal-to-noise ratio, and flexibility than their analog counterparts [AlleSS]. Programmable digital signal processors (DSPs) became widely available in the 1970's as large scale integra- tion (LSI) was introduced. Even greater flexibility is available to the user of these devices since their operation can be tailored merely by changing the signal processing program. As usually happens, such benefits are offset by disadvantages. In the case of digital sig- nal processing, the bandwidth of the signals which can be processed is much smaller than for analog techniques. Programmable DSPs can usually be applied to voice-frequency signals (~104 Hz) [Kawa80], while special-purpose digital systems can handle higher frequencies (::::106 Hz) [Lind85]. Analog systems have much higher throughput, and can process signals of ultra-high frequency (~1012 Hz). The operating physics of analog devices makes them faster for many common signal processing tasks. For example, determining the response y (t) of a linear system to a given input x (t) is expressed mathematically as: y (t) = f h (t - a )x (a)d 0: (1.1) -ex; 2 where h(t) expresses the impulse response of the given analog system. This convolution operation is a natural response of many types of analog systems. The digital form of Eqn. (1.1) is: N 1 Yk = L h, -~ I X, (1.2) 1-0 where Yk' hk , and Xk are the discrete, digital counterparts of the previously described analog signals. As can be seen, determining Yk for some value of k requires N multiplications and N -1 additions. This digital form of convolution is not a natural operation of digital systems, but must be explicitly provided for with hardware for performing multiplication and addition. Special-purpose digital systems can use many adders and multipliers in parallel to compute Eqn. (1.2), but to even approach the speed of analog convolution requires a large amount of hardware. Programmable DSPs, which typically have only one multiplier, are much slower than such hardware intensive systems. The flexibility and programmability of digital systems also exacerbates their throughput problem. New algorithms are being proposed which could not readily be performed with analog components but are easily programmed on digital hardware. These algorithms (such as linear system solution and eigenvalue decomposition) require even greater computational resources. In order to expand the range of applicability of programmable DSPs, the speed at which they execute signal processing algorithms must be increased. An increase in DSP speed can be achieved by: • increasing the speed of the circuit technology, • creating signal processing algorithms with reduced computational requirements, • enhancing the DSP architecture, • using multiple DSPs to perform computations in parallel. 3 The speed and density of silicon technology has steadily increased over time, and new technologies such as gallium-arsenide are becoming commercially feasible [Milu86]. How ever, quantum limits will eventually be reached beyond which further improvements are impossible [Mead80]. In any event, improvements in material technology can usually be applied independently of the other speed-up methods mentioned. Therefore, the effects of improved technology will not be considered here. Algorithms with reduced computational requirements (such as the Fast Fourier Transform (FFf) and Winograd FFf [Wino78]), can make digital signal processing practical for certain applications. However, these procedures are not easily discovered, nor do they pos sess the usual simplicity of the majority of general-purpose signal processing algorithms. Therefore, this speed-up technique will not be expanded upon further. Enhancing the architecture of a general-purpose computer in order to increase its perfor mance for specific algorithms is a common technique. The execution speed of an algorithm can be increased as much as 10000/0 [Abd-74], although gains of 10%-50% are probably more common. In the same manner, the architecture of a programmable DSP can be altered to effi ciently support common digital signal processing operations. The architectural enhancements will usually consume some additional circuit area and power, and since these two quantities are not in unlimited supply, a careful costlbenefit analysis is needed. Still, this is a likely area in which to improve the performance of DSPs. Executing signal processing software with multiple processors offers the best hope for massive performance increases. The throughput of the total system can be increased merely by adding more processing elements while avoiding the fundamental physical limits encoun tered when trying to improve the circuit technology. The need to develop algorithms with reduced computational needs is replaced with the (slightly easier) job of creating algorithms consisting of many concurrent tasks. Ideally, these changes make it possible to increase the 4 total system speed indefinitely, rather than just the 10%-500/0 performance improvement achieved from enhancing the architecture of a uniprocessor. The appraisal of parallel processing given above is overly optimistic from a practical point of view. Typical parallel processing systems have saturated their performance at 2-3 times the speed of an individual processing component [Fish83]. Such systems are usually victims of communications bottlenecks. In effect, a significant portion of the system processing time is spent passing data between the individual computing elements. In the worst case, the time spent servicing this communication overhead cannot be used to perform useful computa tions, thereby reducing the throughput of the system. Therefore, designers have spent con siderable effort in developing parallel systems which minimize this overhead. Probably the simplest multiprocessor network consists of a set of processing elements connected via a single global bus (Figure 1.1). The bus is used to transfer