Variable Precision Arithmetic Circuits for FPGA-Based Multimedia
Total Page:16
File Type:pdf, Size:1020Kb
IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 12, NO. 9, SEPTEMBER 2004 995 Variable Precision Arithmetic Circuits for FPGA-Based Moreover, several embedded processors can be efficiently synthe- Multimedia Processors sized on FPGAs running up to 150 MHz and custom instructions can be added to these microprocessors in order to match algorithms re- Stefania Perri, Pasquale Corsonello, Maria Antonia Iachino, quirements. As an example, in [8] custom logic has been added to the Marco Lanuzza, and Giuseppe Cocorullo ALU of an ALTERA Nios to realize an MP3 decoder. ARCInternational, TENSILICA and XILINX embedded processors can exploit similar tech- niques using dedicated or fast-link busses. Abstract—This brief describes new efficient variable precision arith- The performance of multimedia processors is usually improved with metic circuits for field programmable gate array(FPGA)-based processors. The proposed circuits can adapt themselves to different data word lengths, the single instruction multiple data (SIMD) processing [9]–[13]. SIMD avoiding time and power consuming reconfiguration. This is made possible circuits should have the ability to dynamically vary their precision. thanks to the introduction of on purpose designed auxiliarylogic, which However, often they are realized by trivially replicating computational enables the new circuits to operate in single instruction multiple data processing elements with fixed functionality. This approach leads to a (SIMD) fashion and allows high parallelism levels to be guaranteed when operations on lower precisions are executed. The new SIMD structures consistent waste of resources and power. have been designed to optimallyexploit the resources of a widelyused In this brief, efficient SIMD architectures for FPGA-based multi- familyof SRAM-based FPGAs, but their architectures can be easily media processors are presented. The new circuits here proposed for re- adapted to anyeither SRAM-based or antifuse-based FPGA chips. alizing variable precision arithmetic basic blocks run-time adapt their Index Terms—Multimedia processors, multipliers, single instruction structures to different precisions at instruction level avoiding the need multiple data (SIMD). for FPGA reconfiguration. Obtained results demonstrate that this prop- erty not only saves consistent time and power due to reconfiguration, but also guarantees very high computational capabilities and an easy I. INTRODUCTION integration into field programmable systems-on-chip (FP-SOCs). Today multimedia applications are one of the major drivers for electronics products. Most multimedia tasks have high computa- II. MOTIVATION AND RELATED WORKS tional requirements, real-time throughput constraints and complex data-flows which vary significantly versus the specific application. Several works exist in literature in which field programmable gate array (FPGA) devices either constitute or participate to multimedia Nowadays, systems-on-chip (SOCs) realized using the ASIC approach processing units [3]–[5], [14]. Very recently, in [15] a first attempt to represent the performance optimized solution. Usually, one (or more) evaluate the impact of the introduction of SIMD instructions on the embedded processor (e.g., IP Core) is used inside these SOCs to ALTERA Nios embedded processor has been carried out. There, it is produce a flexible computing platform able to support a large class of clearly stated that to produce an effective advantage, such extensions existing and future multimedia algorithms. However, the sequential have to be strictly additive and implemented in such a way that the instruction decoding and execution, and the fixed control architecture extended processor has the same Instruction Set as the original one. often limit the performance of such embedded processors [1]. Thus, to Furthermore, in this way, original compiler and development tools can accomplish speed performance requirements the embedded processor be used for the extended architecture. has to be endowed with special units purposely designed to accelerate For these reasons, variable precision modules that adapt their a certain class of applications [1], [2]. Just as an example, this is the structures to different precisions avoiding the need for bit-stream case of the ARCtangent family of IP Core processors for which dig- downloading are very attractive. In fact, the alternative approach ital-signal processing (DSP) extensions (i.e., with SIMD instructions) that exploits chip reconfiguration in order to change the processor are available for the design of multimedia SOCs. The ARCtangent-A4 functionality is very hard. As outlined in [14], it is well known reaches the maximum frequency of 200 MHz when synthesized using that for initiating the reconfiguration, the need for it has to be HXIV m libraries, whereas, the more recent and extensible ARCt- firstly identified. Thus, to trigger reconfiguration either multimedia angent-A5 runs up to 170 MHz using the same technology. MIPS, instructions have to be hardware decoded or special reconfiguration ARM, and OPENRISC soft-processors show similar characteristics instructions inserted into the code have to be software processed. In and comparable speed performances. the first case, purpose-designed complex instruction decoders that Looking at the ASIC space in purely economic terms, the costs for can deteriorate the overall performance are needed1. On the other a typical mask set for a modern technology run into the $ 500–700 K hand, the software approach requires a purposely designed compiler range. Thus, the ASIC design approach seems limited in viability to that detects the need of reconfigurations introducing into the code high-volume applications and big industries. reconfiguration instructions. They must be well scheduled to ensure Modern field-programmable gate array (FPGA) devices are fast and that the appropriate configuration bit-stream is downloaded onto the offer a level of integration comparable to current application specific in- FPGA device well before the specialized multimedia instructions tegrated circuits (ASICs). FPGAs are today largely used in multimedia are executed. In any case, the above approaches imply a consistent applications [2]–[7], where a good tradeoff between costs, flexibility, performance penalty which further increases if the reconfiguration performance and time-to-market plays a crucial role. time (e.g., in the order of milliseconds, for practical cases) cannot be hidden [14]. The circuits described in the following do not require any action Manuscript received January 4, 2003; revised June 2, 2003; February 16, either on the processor decoder or on the compiler. The user can easily 2004. configure, send data to and read data from the coprocessing unit by S. Perri, M. Lanuzza, and G. Cocorullo are with the Dipartimento de Elet- tronica, Informatica E Sistemistica (D.E.I.S.), University of Calabria, Rende means of the “move-to” and “move-from” instructions available in any 87036, Italy (e-mail: [email protected]). instruction set. P. Corsonello and M. A. Iachino are with the Department of Computer Sci- ence, Mathematics, Electronics and Transportation, University of Reggio Cal- 1Note that for most embedded processors, the instruction decoder cannot be abria, Reggio Calabria 89060, Italy (e-mail: [email protected]). accessed by the user. Thus, reconfiguration instructions cannot be processed Digital Object Identifier 10.1109/TVLSI.2004.833400 in hardware. 1063-8210/04$20.00 © 2004 IEEE 996 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 12, NO. 9, SEPTEMBER 2004 Fig. 1. Architecture of the new variable precision multiplier using four 32 9 multipliers. III. NEW VARIABLE PRECISION SIMD precision multipliers with x aQPis based on the following consider- SIGNED/UNSIGNED MULTIPLIERS ations. eQI X H and fQI X H being two 32-bit operands, the following identities are easily verified: e f arITv e f In media signal processing (i.e., filtering, discrete cosine transforms, QIXH QIXH QIXH QIXIT fast Fourier transforms, etc.), multiplication and multiply-accumula- C TRi eQIXH fISXH (1) tion are the most frequently required arithmetic operations. Moreover, eQIXH fQIXH a eQIXIT fQIXIT vsxu eISXH fISXH multiplication often has the largest impact on the instruction cycle time C rITv TRi e f of a whole system [1]. However, no efficient architectures for realizing QIXIT ISXH variable precision SIMD multipliers have been yet proposed and opti- C TRi eISXH fQIXIT (2) mized for FPGA-based processors. It is the goal of this Section. eQIXH fQIXH a rITv rVv eQIXH fQIXPR Typically, a x x variable precision SIMD multiplier is able to C RVi eQIXH fPQXIT execute either one x x multiplication or two @xaPA @xaPA or C TRi rVv e f four @xaRA @xaRA multiplications and the obtained results are usu- QIXH ISXV ally stored into a 2N-bit output register. The design of the new variable C RVi eQIXH fUXH (3) IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 12, NO. 9, SEPTEMBER 2004 997 TABLE I COMMANDS AND CONTROL SIGNALS FOR THE VARIABLE PRECISION ARCHITECTURE WITH FOUR 32 9MULTIPLIERS where LINK, SHyL, and SzE indicate a concatenation action, a left In Fig. 1, the block diagram of the new variable precision SIMD shift by y bits, and an extension to z bits, respectively. multiplier is depicted. It can be seen that its main building modules It is useful to recall that an n-bit number can be easily extended are: a control unit (CU) needed to properly set the computational