EURASIP Journal on Applied Signal Processing

Design Methods for DSP Systems

Guest Editors: Bernhard Wess, Shuvra S. Bhattacharyya, and Markus Rupp

EURASIP Journal on Applied Signal Processing Design Methods for DSP Systems

EURASIP Journal on Applied Signal Processing Design Methods for DSP Systems

Guest Editors: Markus Rupp, Bernhard Wess, and Shuvra S. Bhattacharyya,

Copyright © 2006 Hindawi Publishing Corporation. All rights reserved.

This is a special issue published in volume 2006 of “EURASIP Journal on Applied Signal Processing.” All articles are open access articles distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Editor-in-Chief Ali H. Sayed, University of California, USA

Associate Editors Kenneth Barner, USA Søren Holdt Jensen, Denmark Vitor H. Nascimento, Brazil Mauro Barni, Italy Mark Kahrs, USA Sven Nordholm , Australia Richard Barton, USA Thomas Kaiser, Germany Douglas O’Shaughnessy, Canada Ati Baskurt, France Moon Gi Kang, South Korea Montse Pardas, Spain Kostas Berberidis, Greece Matti Karjalainen, Finland Wilfried Philips, Belgium Jose C. Bermudez, Brazil Walter Kellermann, Germany Vincent Poor, USA Enis Cetin, Turkey Joerg Kliewer, USA Ioannis Psaromiligkos, Canada Jonathon Chambers, UK Lisimachos P. Kondi, USA Phillip Regalia, France Benoit Champagne, Canada Alex Kot, Singapore Markus Rupp, Austria Joe Chen, USA Vikram Krishnamurthy, Canada Bill Sandham, UK Liang-Gee Chen, Taiwan Tan Lee, Hong Kong Bulent Sankur, Turkey Huaiyu Dai, USA Geert Leus, The Netherlands Erchin Serpedin, USA Satya Dharanipragada, USA Bernard C. Levy, USA Dirk Slock, France Frank Ehlers, Italy Ta-Hsin Li, USA Yap-Peng Tan, Singapore Sharon Gannot, Israel Mark Liao, Taiwan Dimitrios Tzovaras, Greece Fulvio Gini, Italy Yuan-Pei Lin, Taiwan Hugo Van hamme, Belgium Irene Gu, Sweden Shoji Makino, Japan Bernhard Wess, Austria Peter Handel, Sweden Stephen Marshall, UK Douglas Williams, USA R. Heusdens, The Netherlands C. Mecklenbräuker, Austria Roger Woods, UK Ulrich Heute, Germany Gloria Menegaz, Italy Jar-Ferr Yang, Taiwan Arden Huang, USA Ricardo Merched, Brazil Abdelhak M. Zoubir, Germany Jiri Jan, Czech Republic Rafael Molina, Spain Sudharman K. Jayaweera, USA Marc Moonen, Belgium

Contents

Design Methods for DSP Systems, Markus Rupp, Bernhard Wess, and Shuvra S. Bhattacharyya Volume 2006 (2006), Article ID 47817, 3 pages

Macrocell Builder: IP-Block-Based Design Environment for High-Throughput VLSI Dedicated Digital Signal Processing Systems, Nacer-Eddine Zergainoh, Ludovic Tambour, Pascal Urard, and Ahmed Amine Jerraya Volume 2006 (2006), Article ID 28636, 11 pages

Multiple-Clock-Cycle Architecture for the VLSI Design of a System for Time-Frequency Analysis, Veselin N. Ivanović, Radovan Stojanović, and LJubivša Stanković Volume 2006 (2006), Article ID 60613, 18 pages

3D-SoftChip: A Novel Architecture for Next-Generation Adaptive Computing Systems, Chul Kim, Alex Rassau, Stefan Lachowicz, Mike Myung-Ok Lee, and Kamran Eshraghian Volume 2006 (2006), Article ID 75032, 13 pages

Highly Flexible Multimode Digital Signal Processing Systems Using Adaptable Components and Controllers, Vinu Vijay Kumar and John Lach Volume 2006 (2006), Article ID 79595, 9 pages

Rapid VLIW Processor Customization for Signal Processing Applications Using Combinational Hardware Functions, Raymond R. Hoare, Alex K. Jones, Dara Kusic, Joshua Fazekas, John Foster, Shenchih Tung, and Michael McCloud Volume 2006 (2006), Article ID 46472, 23 pages

Rapid Prototyping for Heterogeneous Multicomponent Systems: An MPEG-4 Stream over a UMTS Communication Link, M. Raulet, F. Urban, J. F. Nezan, C. Moy, O. Deforges, and Y. Sorel Volume 2006 (2006), Article ID 64369, 13 pages

A Fully Automated Environment for Verification of Virtual Prototypes, P. Belanović, B. Knerr, M. Holzer, and M. Rupp Volume 2006 (2006), Article ID 32408, 12 pages

FPGA-Based Reconfigurable Measurement Instruments with Functionality Defined by User, Guo-Ruey Tsai and Min-Chuan Lin Volume 2006 (2006), Article ID 84340, 14 pages

FPGA Implementation of an MUD Based on Cascade Filters for a WCDMA System, Quoc-Thai Ho, Daniel Massicotte, and Adel-Omar Dahmane Volume 2006 (2006), Article ID 52919, 12 pages

A New Pipelined Systolic Array-Based Architecture for Matrix Inversion in FPGAs with Kalman Filter Case Study, Abbas Bigdeli, Morteza Biglari-Abhari, Zoran Salcic, and Yat Tin Lai Volume 2006 (2006), Article ID 89186, 12 pages

Floating-to-Fixed-Point Conversion for Digital Signal Processors, Daniel Menard, Daniel Chillet, and Olivier Sentieys Volume 2006 (2006), Article ID 96421, 19 pages Optimum Wordlength Search Using Sensitivity Information, Kyungtae Han and Brian L. Evans Volume 2006 (2006), Article ID 92849, 14 pages Hindawi Publishing Corporation EURASIP Journal on Applied Signal Processing Volume 2006, Article ID 47817, Pages 1–3 DOI 10.1155/ASP/2006/47817

Editorial Design Methods for DSP Systems

Markus Rupp,1 Bernhard Wess,1 and Shuvra S. Bhattacharyya2

1 Institute of Communications and Radio Frequency Engineering, Vienna University of Technology, Gusshausstrasse 25/389, 1040 Vienna, Austria 2 Department of Electrical & Computer Engineering, University of Maryland, College Park, MD 20742, USA

Received 8 August 2005; Accepted 8 August 2005 Copyright © 2006 Markus Rupp et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Industrial implementations of DSP systems today require ter transfer level architectures for VLSI signal processing sys- extreme complexity. Examples are wireless systems satisfy- tems from high-level representations as interconnections of ing standards like WLAN or 3GPP, video components, or intellectual property (IP) blocks. The development empha- multimedia players. At the same time, often harsh con- sizes extensive parameterization and component reuse to im- straints like low-power requirements burden the designer prove productivity and flexibility. Careful generation of con- even more. Conventional methods for ASIC design are not trol structures is also performed to manage delays and coor- sufficient any more to guarantee a fast conversion from ini- dinate parallel execution. Effectiveness of the tool is demon- tial concept to final product. In industry, the problem has strated on a number of high-throughput signal processing been addressed by the wording design crisis or design gap. applications. While this design gap exists in a complexity gap, that is, a In “Multiple-clock cycle architecture for the VLSI design of difference between existing, available, and demanded com- a system for time-frequency analysis,” Veselin N. Ivanovicet´ plexity, there is also a productivity gap, that is, the dif- al. present a streamlined architecture for time-frequency sig- ference between available complexity and how much we nal analysis. The architecture enables real-time analysis of a are able to efficiently convert into gate-level representa- number of important time-frequency distributions. By pro- tions. This special issue intends to present recent solutions viding for multiple-clock-cycle operation and resource shar- to such gaps addressing algorithmic design methods, al- ing across the design in an efficient manner, the architecture gorithms for floating-to-fixed-point conversion, automatic achieves these features with relatively low hardware complex- DSP coding strategies, architectural exploration methods, ity. Results are given based on implementation of the archi- hardware/software partitioning, as well as virtual and rapid tecture on field-programmable gate arrays, and a thorough prototyping. comparison is given against a single-cycle implementation We received 20 submissions from different fields and ar- architecture. eas of expertise from which finally only 12 were accepted for In “3D-SoftChip: a novel architecture for next-generation publication. These 12 papers can be categorised into four adaptive computing systems,” C. Kim et al. present an archi- groups: pure VLSI design methods, prototyping methods, tecture for real-time communication and signal processing experimental reports on FPGAs, and floating-to-fixed-point through vertical integration of a configurable array processor conversions. subsystem and a switch subsystem. The proposed integration Most activities in design methods are related to the final is achieved by means of an indium bump interconnection ar- product. VLSI design methods intend to deal with high com- ray to provide high interconnection bandwidth at relatively plexity in a rather short time. In this special issue, we present low levels of power dissipation. The paper motivates and de- five contributions allowing to design complex VLSI designs velops the design of the proposed system architecture, along in substantially lower time periods. with its 2D subsystems and hierarchical interconnection net- In “Macrocell builder: IP-block-based design environment work. Details on hardware/software codesign aspects of the for high-throughput VLSI dedicated digital signal process- proposed system are also discussed. ing systems”, N.-E. Zergainoh et al. present a design tool, In “Highly flexible multimode digital signal processing called DSP macrocell builder, that generates SystemC regis- systems using adaptable components and controllers”,V.V. 2 EURASIP Journal on Applied Signal Processing

Kumar and J. Lach present a design methodology for sig- through the COSSAP commercial design system for signal nal processing systems. The targeted class of applications in- processing, and also in the context of target platforms based volves those that can be decomposed naturally into multi- on the StarCore DSP. Retargetability to other algorithm de- ple application modes, where the different modes operate velopment environments and target platforms is promising during nonoverlapping time intervals. The approach devel- due to the general principles and modular architecture of the oped in the paper emphasizes supporting flexible applica- developed approach. tion of reconfigurability in multimode signal processing ar- Many clever ideas to build prototypes based on FPGA chitectures, including reconfigurability in datapath compo- were submitted. The three most interesting ones will be pre- nents, controllers, and interconnect, as well as both intra- sented in this special issue. In “FPGA-based reconfigurable and inter-mode reconfigurability. The approach is demon- measurement instruments with functionality defined by user,” strated through synthesis of multimode applications that are G.-R. Tsai and M.-C. Lin develop an approach using FPGAs composed of various DSP subsystems. to provide a framework for configurable measurement in- In “Rapid VLIW processor customization for signal pro- struments, where the features and functionality of the in- cessing applications using combinational hardware functions,” struments can be customized flexibly by the user. A hardware R. R. Hoare et al. present a VLIW processor with multiple kernel for the configurable instrument approach is presented application-specific hardware functions for computationally along with associated implementation considerations. Sev- intensive signal processing applications. The hardware func- eral examples are developed based on the proposed frame- tions share the register file with the processor to eliminate work to illustrate the utility of the approach. overhead by data movement. A design methodology includ- In “FPGA implementation of a MUD based on cascade fil- ing profiling, compiler transformations for combinational ters for a WCDMA system”, Q.-T. Ho et al. present an FPGA- logic synthesis, and code restructuring is proposed to map based implementation of a multiuser detector for WCDMA algorithms written in C onto this architecture. Application transmission systems. They exploit a serial interference struc- speedups are reported for several signal processing bench- ture in form of a cascade filter. Their design methodol- marks from the MediaBench suite. ogy strives for support of maximum number of users while A large amount of activities can currently be found in reflecting limited FPGA resources and timing constraints. rapid prototyping where it is important to find feasible solu- Elaborate resource utilisation studies for VIRTEX II and tions to a challenging system design in rather short time. A VIRTEX II Pro FPGAs from XILINX validate their results. final product may look different than the prototype but the In “A new pipelined systolic array-based architecture for prototype is intended to deliver a first hands-on experience matrix inversion in FPGAs with Kalman filter case study,” A. of whether a proposal architectural solution is feasible at all. Bigdeli et al. propose an optimized systolic array-based ma- The prototype thus provides the designers with decisions for trix inversion for implementation in FPGAs. The main ad- a final product while still giving them a chance to further ex- vantage of their structure is the small logic resource con- plore parts of the design. sumption compared to other systolic arrays in the literature. In “Rapid prototyping for heterogeneous multicomponent The hardware complexity is reduced from O(n2)toO(n)for systems: an MPEG-4 stream over a UMTS communication inverting an nxn matrix. The new pipelined systolic array is link,” M. Raulet et al. present a rapid prototyping method used for rapid prototyping of a Kalman filter and compared using the SynDEx CAD tool, a half-automated method, to with other implementations. map algorithms that are typically specified in C onto var- Floating-to-fixed-point conversion is an ongoing topic in ious real-time platforms. Supported platforms are by Sun- system design. Although many concepts have been proposed dance and Pentek using a multitude of conventional DSPs over the years, there is hardly any tool support in commercial and FPGAs. In order to support various platforms, means to EDA products. In “Floating-to-fixed-point conversion for dig- describe hardware and software components as well as their ital signal processors,” D. Menard et al. follow a different path communications links are provided in terms of SynDEx ker- than researchers have done before. Rather than minimizing nels. The communication kernel, for example, supports com- signal-to-quantization noise energy, they minimize code ex- munication between the various functional units via shared ecution time on a DSP for a given accuracy constraint. This RAMs. The efficiency of the proposed method is shown by a method includes taking into account the DSP architectural rather challenging example: an MPEG-4 stream is provided structure. To evaluate the fixed-point accuracy, an analytical over a UMTS link. approach is used to reduce the optimisation time compared A second contribution in this field entitled “A fully au- to existing methods. tomated environment for verification of virtual prototypes”,P. In “Optimum wordlength search using sensitivity infor- Belanovic et al. present a computer-aided design tool for au- mation,” K. Han and B. L. Evans propose a fast algorithm tomated derivation and verification support of virtual proto- for searching for an optimum wordlength by trading off types. The targeted virtual prototypes include definitions of hardware complexity for arithmetic precision at the system the hardware/software interfaces in the given system, which outputs. The optimization is based on the complexity-and- enables parallel development and improved validation sup- distortion measure that combines hardware complexity in- port across hardware and software. The developed tool op- formation with propagated quantized precision loss. Two erates in the context of algorithmic specifications developed case studies demonstrate that the proposed method can find Markus Rupp et al. 3 optimum wordlengths in less time compared to local search strategies.

Markus Rupp Bernhard Wess Shuvra S. Bhattacharyya

Markus Rupp received his Dipl.-Ing. de- gree from 1988 at the University of Saar- bruecken, Germany, and his Dr.-Ing. degree in 1993 from the Technische Universitaet¨ Darmstadt, Germany. He is presently a Full Professor of digital signal processing in mo- bile communications at the Technical Uni- versity of Vienna. He is an Associate Editor of IEEE Transactions on Signal Processing, of JASP EURASIP Journal of Applied Signal Processing, and of JES EURASIP Journal on Embedded Systems, and is elected AdCom Member of EURASIP. He authored and co- authored more than 180 papers and patents on adaptive filtering, wireless communications, and rapid prototyping.

Bernhard Wess received the Dipl. degree and the Ph.D. degree in electrical engineer- ing from the University of Technology, Vi- enna in 1985 and 1993, respectively. He is currently the Head of the Electronic De- partment at the Vienna Institute of Tech- nology and a lecturer at the University of Technology, Vienna. His current research interests are in the areas of code generation and optimization for digital signal proces- sors and rapid prototyping for digital signal processing systems.

Shuvra S. Bhattacharyya is an Associate Professor in the Depart- ment of Electrical and Computer Engineering and the Institute for Advanced Computer Studies (UMIACS) at the University of Mary- land, College Park. He is also an Affiliate Associate Professor in the Department of . He is coauthor or coeditor of three books and the author or coauthor of more than 90 ref- ereed technical articles. His research interests include VLSI signal processing, embedded software, and hardware/software codesign. He received the B.S. degree from the University of Wisconsin at Madison, and the Ph.D. degree from the University of California at Berkeley. He has held industrial positions as a researcher at the Hitachi America Semiconductor Research Laboratory (San Jose, Calif), and as a Compiler Developer at Kuck & Associates (Cham- paign, Ill). Hindawi Publishing Corporation EURASIP Journal on Applied Signal Processing Volume 2006, Article ID 28636, Pages 1–11 DOI 10.1155/ASP/2006/28636

Macrocell Builder: IP-Block-Based Design Environment for High-Throughput VLSI Dedicated Digital Signal Processing Systems

Nacer-Eddine Zergainoh,1 Ludovic Tambour,1, 2, 3 Pascal Urard,2 and Ahmed Amine Jerraya1

1 TIMA Laboratory, National Polytechnique Institute of Grenoble, 46 Avenue F´elix Viallet, 38031 Grenoble Cedex 1, France 2 ST Microelectronics, 850 Rue Jean Monnet, 38926 Crolles Cedex, France 3 CIRAD, TA 40/01, avenue Agropolis Lavalette, 34398 Montpellier Cedex 5, France

Received 3 October 2004; Revised 14 April 2005; Accepted 25 May 2005 We propose an efficient IP-block-based design environment for high-throughput VLSI systems. The flow generates SystemC register-transfer-level (RTL) architecture, starting from a Matlab functional model described as a netlist of functional IP. The refinement model inserts automatically control structures to manage delays induced by the use of RTL IPs. It also inserts a control structure to coordinate the execution of parallel clocked IP. The delays may be managed by registers or by counters included in the control structure. The flow has been used successfully in three real-world DSP systems. The experimentations show that the approach can produce efficient RTL architecture and allows to save huge amount of time.

Copyright © 2006 Hindawi Publishing Corporation. All rights reserved.

1. INTRODUCTION functions with nonstandard algorithms [5]. This is because VLSI DSP system cannot be parameterized for global per- As the complexity of the high-throughput dedicated digi- formance and functions; for example, necessary processing tal signal processing (DSP) systems under hardware design cycles cannot be adjusted for IPs blocks. The second prob- increases, development efforts increase dramatically. At the lem comes from interfacing of IPs blocks between themselves. same time, the market dynamics for electronic systems push Designers have to design IPs blocks that can communicate for shorter and shorter development times [1]. In order to according to the blocks’ interface specification. When they meet the design time requirements, a design methodology connect two different IP blocks, they have to insert an extra for VLSI dedicated DSP system that favors reuse and early interface circuitry in order to synchronize them. Area and de- error detection is essential. One idea, largely widespread and lay overhead for circuitry cannot be neglected in some cases. applied to design DSP systems, is to adopt a modular ap- Our goal is to find some appropriate design tactics to avoid proach based on divide-and-conquer strategy (recursive). these problems. The global complexity of the system should be divided into In this paper, we propose an efficient IP-block-based de- subsystems (i.e., elementary signal processing functions), sign environment for high-throughput VLSI dedicated dig- well known and of easily accessible complexity such as fil- ital signal processing (DSP) systems called DSP macrocells ter (FIR, IIR), fast Fourier transform (FFT), Viterbi decoder, builder tool. The flow generates SystemC register-transfer- and so forth. The system can be obtained by the hierarchi- level (RTL) architecture, starting from a Matlab functional cal assembly of these common functions of signal processing model described as a netlist of functional IP. To provide IPs (also known as IP blocks). The intellectual-property- (IP)- with more reusability and flexibility, we use parameterized based design is obviously an important issue for improving reusable DSP components at functional and RT levels. Thus, not only design productivity, but also design from the higher- by setting the appropriate parameters, unnecessary functions level abstraction [2, 3]. and redundant interfaces are eliminated in our IP-based de- However, designers encounter two major problems with sign approach. The refinement process inserts automatically the IP-block-based design approach [2–4]. The first problem control structures to treat delays induced by the use of RTL is the difficulty in using IPs blocks for high-throughput DSP IPs. It also inserts a control structure to coordinate the exe- systems that require various performances (throughput) or cution of parallel clocked IP. The delays may be managed by 2 EURASIP Journal on Applied Signal Processing registers or by counters included in the control structure. The encompass the complete problem. Some attempt to close the main contribution of this paper is a prototype implementa- gap between algorithm and hardware design by basing syn- tion and experimentation of the approach. The rest of the thesis tools on C/C++ description [6–11]. However, these paper is organized as follows. After investigating related work solutions require a style of code that is very similar to RTL in Section 2, we introduce our methodology and discuss its code and it is unattractive to algorithm designers. Commer- merits, the important issues, and how this approach handled cial tools from design automation companies offer RTL code the IP-based design problems. Section 4 details the IP-block- generation solutions from block diagrams [12]. However, based design environment for high-throughput VLSI DSP these tools are targeted mostly for hardware designers and systems. Section 5 describes several experiments to analyze obscure the information about the algorithm and architec- the efficiency of the proposed design flow and Section 6 con- ture through the code generation process. cludes the paper. Some have proposed using high-level system design flows,suchasPtolemy[13], and POLIS [14]. These flows em- 2. RELATED WORK phasize overall system cosimulation and cosynthesis for het- erogeneous systems rather than the details required in creat- 2.1. Standard design flow for ASIC ing and integrating DSP-ASIC into an existing system. There are also some works on system-level design flows targeted A standard design flow for hardware implementation of al- for DSP hardware systems [15–19]. Grape-II [15], Cham- gorithms has four phases which are typically handled by four pion [16], Logic foundry [17], and MATCH [18] follow this different designers. Algorithm designers conceive the chip scheme. In Grape-II, the target architecture consists of com- and deliver a specification to system designers, often in the mercial DSP processors, bond-out versions of core proces- form of a floating-point simulation. The system or archi- sors, and FPGAs linked to form a powerful heterogeneous tecture designers begin to add structure to this simulation, multiprocessor. The Logic foundry is system-level design partitioning the design into functional units. They must also flow for the rapid creation and integration of FPGA-based convert the data types from floating to fixed-point and ver- DSP by using predictable, preverified IP blocks that have ify that finite word-length effects and pipeline depth do not standardized interfaces. The problem of this approach is that compromise the algorithm. The hardware designers map the the area and delay overhead for standard interface circuitry simulation RTL code and verify that the code matches the cannot be neglected in some cases. Champion is IP-block- specified functionality and pipeline depth. Physical designers based design approach for data path of DSP-ASIC. The de- take standard-cell netlists synthesized from the RTL code and sign automation of data path is performed using two libraries generate layout mask patterns. This flow requires three trans- of predesigned basic blocks (functional and cells libraries). lations of the design, expressing the functionality as grad- Unfortunately, the lack of flexibility of libraries (no param- ually less sequential and more structural with requirements eterized blocks) limits the reuse of the IP blocks especially for reverification at each stage. Opportunities for algorithmic for high-throughput DSP systems which require various per- modifications, to reduce power and area, are often lost due formances (throughput) or functions with nonstandard al- to the separation of engineering decisions. Performance bot- gorithms. This work was also limited to data paths without tlenecks discovered during the physical design phase are un- runtime control considerations. MATCH has attempted to known to the algorithm designer. Aggressive system require- compile high-level languages, such as Matlab, directly into ments may require new and unusual architectures, which can hardware implementations (including code for DSP, embed- stall the flow, leading to uncontrolled looping back to earlier ded processors, and FPGA). stages of the design process and extending the design time in- definitely. The main problem with this flow is that it attempts However, we believe that in all above works, design to avoid feedback information to algorithm designers. methodologies tackle some issues of DSP design but they yet have to encompass the entire problem. In fact, most of the The flow we need would allow algorithm designers to ex- ff plore the design space as thoroughly as possible by creating above-mentioned approaches cannot satisfy a tradeo be- RTL model and obtaining performance estimates. This ex- tween architecture quality, rapid algorithm/architecture ex- ploration should allow refinement of fixed-point types to be ploration, and fast modeling and validation. constrained libraries of efficient hardware blocks, and to be carried out by automated design flow. This encourages feed- 2.3. IP-based design issues back of RTL design issues to algorithm designers by allowing them to maintain ownership of the design data at all times. It A lot of research has been carried out on the IP-based also would encourage interaction with algorithm and hard- design [2–4, 20–27]. Most of the research deals with IP- ware designers by reducing the design process to a single based SoC [2, 20–23]. Problems on SoC synthesis are ad- phase. dressed in [23], where it is assumed that an external refer- ence clock is supplied and the asynchronous communication 2.2. Current methods and flows for DSP is used. However, most of on-chip buses for SoC use the syn- algorithm implementation chronous communication. IP blocks are also exploited in the application-specific instruction-set processor (ASIP) synthe- Recent efforts have identified the gaps between algorithm, sis for the embedded DSP software [26]. To accelerate the system, hardware, and physical design, but yet have to execution of the software, they select an optimal set of IPs Nacer-Eddine Zergainoh et al. 3

Functional model Generic In Out System-level simulation functional F-IP1 F-IP2 F-IP3 DSP-IP library (parameters exploration) F-IP4

Refinement (parameters extraction, FSM Automatic delay integration, automatic assembly) correction method

RTL architecture FSM

Generic In Out RTL RTL-IP1 RTL-IP2 RTL-IP3 Behavior difference? DSL-IP library RTL-IP4

Figure 1: IP-based design methodology for VLSI dedicated DSP.

and interface types for each IP. However, the interface types RTL one, according to a set of parameters given by the for IPs are restricted to coprocessor integration style. Inter- designer. These are present as attributes in the functional rupt/trap or shared-mapped I/O memories are often used. model. The choice of IP parameter values (i.e., architec- The software called handshaking offers flexible communica- tural parameter values such as bit width) is made by the tion between hardware and software, but it is too slow. Some system designer in order to satisfy a tradeoff between sig- researchers are trying to develop general communication in- nal quality and implementation constraints. To generate the terfaces in hardware. In the area of the application-specific architecture, IP parameter values are firstly extracted from integrated circuit (ASIC) design, communication between a validated functional architecture model and then used to IPs is often conducted by shared registers or shared memo- instantiate the predesigned RTL IP written in synthesizable ries. The typical interface configuration contains multiplexes hardware language (i.e., VHDL, SystemC). The architecture with enable signals or address decoders. The concept of a is then built by automatic assembly of predesigned RTL IPs generic virtual interface has been attracting a lot of attention (with the same assembly topology as the functional model). as a way to increase the design reuse. General virtual inter- The connection between the RTL IPs is made by name. The faces lead to designers believing that any IP could communi- design flow includes a unified verification platform used to cate with any other IP [27]. Some practical approaches are re- verify both RTL and functional models. ported such as the automatic matching/generation/deletion The platform exploits directly the high-level environ- of interface pins [3, 4, 24, 25]. General virtual interfaces are ment used for functional validation. The results of the kinds of wrapper IPs, so they would have the area and delay methodology are a safety functional and RTL models of the overhead. whole DSP application. The functional model can be used None of the above works solve the two above-mentioned as an executable reference for the next generation of design. problems for the high-throughput ASIC DSP systems. The Overall, the final architecture takes implicit advantage of the main contribution in this paper is to provide some appropri- hardware designer expertise. The RTL model is suitable for ate design tactics to avoid these problems. logical synthesis.

3. OVERVIEW OF DESIGN METHODOLOGY 3.1. Generic DSP-IPs blocks

The IP-based design methodology is based on designer’s To provide IPs with more reusability and flexibility (prob- practice [28, 29]. The methodology, described in Figure 1, lem 1), we are developing parameterized reusable DSP com- generates register-transfer-level (RTL) architecture starting ponents at functional and RTL levels called “generic F-IP” from a functional model, given in Matlab. The functional and“genericRTLIP.”WedefinethegenericF-IPastem- modeling and RTL architecture generation are performed us- plate described in Matlab hybrid representation; many de- ing two libraries of predesigned DSP basic blocks (functional tails are left open, only some signals which are relevant and RTL libraries). for the quantification are implemented in quantified inte- In our IP-block-based design approach, the functional ger. The generic F-IP blocks are stored in library. We define model is created by assembly of existing functional IP written the generic RTL IP as a synthesizable RTL model of a basic in Matlab [30]. The refinement process keeps the same archi- DSP block. Each F-IP is mapped on one or more RTL IPs. tecture and replaces each functional IP by a corresponding A typical RTL IP is shown in Figure 2 (where the generic 4 EURASIP Journal on Applied Signal Processing

N = number of coefficients nbit in − − ··· − In Z 1 Z 1 Z 1

coeff 1 coeff 2 ··· coeff N

+

nbit sum

Out Round Saturation

Figure 2: Generic RTL-IP description of FIR filter. parameters are in italic). The external interface concepts technique is the area and delay overhead for each IP. How- (e.g., external ports-structure, functional, and timing details, ever, the problem occurs only in the three cases above. The generic parameters, etc.) of IPs provide how the IP block ex- second technique involves the insertion of registers between changes information with its environment. The F-IP inter- RTL IPs in order to compensate the additive delays. This face defines the component name, I/O data stream names, technique has the advantage of being nonintrusive. However, and generic parameters names. The external interface in- performing the corrections manually (i.e., locating the places formation of RTL-IP block is described by the component where the problems are, determining exactly how many reg- definition (including component name, generic parameters isters need to be inserted, and where to insert them) is a very names, ports names, ports directions, and ports data types). difficult task, increasing the number of IPs. The third tech- The ports can be data, clock, reset, control, and test ports. nique involves the modification of the initial finite-state ma- Figure 3 illustrates the analogy between F-IP and RTL-IP in- chine (FSM) by generating additional signals to control the terfaces. Therefore, just by setting appropriate parameters IPs. These signals are time shifted of the initial signals of the (problem 2), unnecessary functions and redundant interfaces global FSM. Therefore, they are able to put back or to put for- are eliminated in the IP-based assembly approach (no need ward the activation of the IPs. This technique adopts various to insert an extra interface circuitry). Furthermore, the de- stages of the second technique (i.e., locating the places where signer does not have to pay attention to the communication the problems are, determining exactly how many signals are of interface protocols. needed) and requires the FSM to be modified. Modification costs more because of the complexity of the FSM (multiplies 3.2. Overview of automatic delay the number of control signals and increases the number of correction method (ADCM) states). In our IP-block-based design, we have implemented Although each functional IP and its equivalent RTL pro- a systematic approach called “automatic delay correction duce individually the same digital values, in some cases, method (ADCM) to solve the problem without inserting an the register-transfer model obtained by automatic assembly extra interface circuitry. The ADCM implements efficiently from the functional model can be wrong [31]. This can oc- the last two techniques (register-insertion-based and FSM cur due to delays induced by implementation constraints modification-based). We have developed two algorithms (pipeline registers, output buffers, etc.). This behavioral fault (algorithm-1 and algorithm-2) to perform ADCM [31]. The is caused by the existence of delays in the RTL model, which first algorithm (algorithm-1), similar to the Bellman labeling cannot be found in the functional model. These delays occur algorithm [32], determines an optimal solution in latency; when the DSP application contains parallel branches of IPs whereas the second algorithm (algorithm-2), similar to the converging towards another IP, feedback loops of IPs, and/or simplex algorithm [32], determines an optimal solution in time-depending IP. This problem is generally known as re- the number of inserted registers (i.e., optimizing area). timing issue. There are three main techniques able to correct the differ- 4. DSP MACROCELL BUILDER: IP-BLOCK-BASED ent behavior between the two models. The first technique in- DESIGN ENVIRONMENT FOR volves the insertion of synchronization protocol (e.g., hand- VLSI DEDICATED DSP shake protocol) for each IP component, which indicates when the input and output data are valid. The advantage Our IP-block-based design environment called “DSP macro- of this technique is that the delay problems are solved be- cell builder” shown in Figure 4 consists of system-level vali- fore the assembly of RTL IPs. The main drawback of this dation flow, hardware design flow (including data path and Nacer-Eddine Zergainoh et al. 5

Functional  Function [out1, out2] Outputs = GF IP (  In1, In1 Generic Out1 Inputs F-IP block In2,  para1, Generic para2) parameters In2 Out2 para1, para2

Component GRT IP generic (  RTL para1: positive; Generic parameters In1 para2: positive); Generic RTL-IP block Out1 port (  In2 ··· in1: in std ; Inputs in2: in std ···; En#1  En#2 out1: out std ···; Outputs Clock para1, para2 out2: out std ···; Reset control: in std ···;  clock: in std ···; Other ports reset: IN std ···) Systematic analogy end component;

Figure 3: F-IP interface versus RTL-IP interface.

Matlab functional

Mat2Colif

CosimX SystemC functional Colif functional architecture ColifRefiner Prepare-1 ColifLatencer  Delay correction • Registers insertion-based (RegInsert) • FSM modification-based (ModFSM) method

Colif RTL

GenFSM Prepare-2 FSM (SystemC RTL code)

CosimX SystemC RTL netlist

Figure 4: DSP macrocell builder.

FSM), and delay correction flow for high-throughput VLSI generic IPs libraries. Initially, a designer uses the generic F-IP dedicated DSP systems. The main feature of our configura- library to describe his functional model in Matlab [30]. The tion is that the tool flow is based on a unified design model next step is to explore a pure algorithm for DSP system using for simulation and synthesis of system-on-chip (SoC) ar- Matlab environment. Then, the Mat2Colif tool transforms chitectures, called “Colif ”[33]. Other tools take advantage the Matlab description into Colif description. IP parame- of information from the Colif and the characteristics of the ter values are extracted from a validated functional model 6 EURASIP Journal on Applied Signal Processing

M1 M2.1

M1.1

M3 M1.2 Black box

Wrapper

Module Virtual component

Task Virtual port

Configuration parameters Virtual channel

Figure 5: Colif representation. and then used by Prepare1 and CosimX tools to generate the protocol. Each Colif object has a list of local parameters, for functional architecture in SystemC [11]. The delay correction example the kind of protocol used in a virtual channel and flow (including ColifRefiner, ColifLatencer, and ADCM), as addresses of ports. will be explained later, transforms the Colif functional into Colif is used as intermediate language for describing the a corrected Colif RTL. Architectural parameters are used to design model through different phases of the DSP macrocell instantiate the predesigned RTL IP written in synthesizable builder. hardware language (i.e., VHDL, SystemC). The DSP macro- cell builder includes the automatic generation of RTL Sys- 4.2. Mat2Colif temC of the final architecture (including data path and FSM). After cycle-level simulation, the generated architecture can The Mat2Colif is developed to transform the functional Mat- be passed to a logic synthesis, automatic placement, and lab model into a functional description in Colif language. It routing tools, in order to achieve a good performance circuit. consists of a lexical and syntactical analyzer applied upon the The following subsections detail the several automatic Matlab description, the functions treating the different in- phases of the macrocell builder. put parameters of the tool, and the functions necessary for producing the correct output file. The tool needs an inter- 4.1. Colif (Codesign language-independent format) mediate variable for integrating the inputs and the outputs. This means that it is not possible to use directly the labels of Colif is a unified abstract model for high-level system design the inputs and the outputs for calling these functions. After and refinement methodology [33]. Colif represents a system the intermediate form is explored, the tool imports the Colif as a hierarchical network of virtual components using three objects corresponding to functional IPs. After all the objects basic concepts: module, port, and net. Virtual components are correctly imported into the Colif tree structure located in use wrappers to separate the interface of the internal compo- memory, the tool instantiates this structure in order to obtain nent from the interface of external nets (see Figure 5). The a suitable file for visualization. This file describes the func- wrapper is the set of virtual ports of a virtual component. tional description in Colif language. Virtual ports contain internal and external ports that can be different in terms of communication protocol and abstrac- 4.3. ColifRefiner tion level. Colif uses a uniform syntax to represent systems that are described at multiple abstraction levels. A virtual The ColifRefiner tool transforms the Colif functional archi- port can contain multiple levels of hierarchy to represent an tecture model into Colif RTL model. First, the Colif F-IPs “N : M”(N and M are natural numbers) correspondence be- are substituted by their corresponding Colif RTL IPs using tween internal and external ports. The internal ports are used the IPs libraries. Then, the module of global FSM is added to connect the internal behavior of the module to the vir- and the ports-nets connections are performed. The connec- tual port. The external ports are used to connect the external tions between IPs are made by name, meaning that ports with communication channel to the virtual port. A virtual chan- same role have same name in both functional and RTL mod- nel groups nets that are parts of the same communication els. The output result of ColifRefiner is Colif RTL structure Nacer-Eddine Zergainoh et al. 7

Module Colif: F-IP1 Module Colif: F-IP2 Module Colif: F-IP3 s Out1 Data out1 s1 Data out 3 Data in1 Data in Data in In1 Data out Data out2 Data in2

s2

ColifRefiner

Module Colif: RTL-IP2 Module Colif: RTL-IP1 Module Colif: RTL-IP3 s1 s3 Out1 In1 Data out1 Data in Data out Data in1 Data in Data out Data out2 clk Data in2 clk nrst clk nrst Enable Enable nrst Enable s2

Module Colif: FSM clk clk Enable nrst nrst

Figure 6: Input and output of ColifRefiner tool.

description of the system (including data path and FSM delays. The ADCM uses two algorithms in performing the structure). Figure 6 illustrates an example of the input and corrections needed by the differential graph of evolution output of ColifRefiner tool. in order to obtain a balanced graph. The first algorithm (algorithm-1) determines an optimal solution in latency, 4.4. Delay correction flow while the second algorithm (algorithm-2) gives an optimal solution in number of inserted registers, optimizing area. Fi- Figure 7 shows the flow of the delay correction method [31]. nally, the step of code generation produces a corrected RTL The inputs of this flow are Colif functional and Colif RTL description of the system, inserting the right number of de- descriptions of the entire system. The output of this flow is lays into the right place. The ADCM implements efficiently a corrected RTL-level description producing the same digital two alternatives to correct the RTL description of the system: displays as the functional description. The localization and one based on registers insertion while the second is based on the calculation of the number of delays to be inserted require FSM modification. According to the implementation con- the use of an intermediate form called differential graph of straints and the target application, the designer can choose evolution, highlighting the delays present in the RTL model the suitable techniques to be used. In practice, the ColifLa- and absent from the functional one. For that, the functional tencer tool inserts automatically the latency values into the model (resp., the RTL model) is represented by a graph called Colif files (Colif RTL and functional), while the ADCM per- functional graph of evolution (resp., RTL graph of evolution) forms the correction. describing its own delays. The differential graph of evolution is created by performing one by one the difference between 4.5. Synthetic example the weight of edges of the functional graph of evolution and those of the RTL graph of evolution. This difference makes In order to highlight the problem of behavior difference and possible to see only the additive delays due to the constraints its solution in a real case of IP-block-based design, we have of implementation, by removing all the delays related to the willfully selected a synthetic example composed of two par- functionality. allel branches of block IPs tending towards the same IP (see Starting from the differential graph, the ADCM deter- Figure 8). One contains three FIR filters IPs and the others mines the corrections necessary to compensate the additive contain only one FIR filter. A behavior difference between 8 EURASIP Journal on Applied Signal Processing

Functional description RTL description path (Figure 10(a)). The other method involves modifying the initial FSM (Figure 10(b)). The initial FSM generates a control signal at each fourth clock cycle. In the case of the second correction method, the FSM has to produce a supple- Compute the algorithmic Compute the delay between mentary control signal, but delayed initially by two impulses delay between I/O of the IP I/O of the IP of the first signal (8 clock cycles). Then, the filter in the sec- ond branch starts its computation after the same period as Functional graph of RTL graph of the filters of the first path. Both of the two techniques were evolution evolution applied upon this example. Independently of the used cor- rection method, the RTL models produce exactly the same digital values as the functional one (see Figure 11). Delay difference 4.6. Discussion Differential graph of evolution We assumed the systems are mono-rates, do not include time-varying IPs, and can be built by acyclic assembly of ADCM IPs. We assumed the model is a static data flow graph (SDF graph), that is, latency and data throughput of IPs are con- stant, and this model is not a limitation of our methodol- Corrected differential evolution graph ogy. In practice, when designing data-dependency IPs, the FIFOs with parameterized sizes are placed at the outputs of IPs; it boils down to SDF graph case. In the case of cyclic Figure 7: Delay correction flow. graph, heuristic algorithms that build acyclic graphs from cyclic ones need to be considered, which are outside the scope of this paper. The most high-throughput DSP systems can be FIR1 FIR2 FIR3 supported by our methodology. + FIR5 5. EXPERIMENTAL RESULT FIR4 ×105 Output We applied our ADCM and associated IP-based design flow 2 to synthetic example (previously presented) and three high- 1.5 1 throughput dedicated DSP systems: digital modulation chain 0.5 circuit extracted from a real design of TV digital transmis- 0 sion satellite application, decoder based on the soft-output −0.5 Viterbi algorithm (SOVA), MP3 (MPEG-1 audio layer-3) au- −1 dio compression standard. Functional and RTL models of . −1 5 these three applications were built by assembling the various −2 0 102030405060708090100 predesigned and prevalidated IPs. The behaviors were sub- jected to ADCM; we used two alternatives (registers insertion and FSM modification) of implementing delay correction on Matlab these circuits. The logic synthesis was performed using Syn- SystemC RTL opsys Design Compiler [34] and the resultant circuits were mapped to AMS’s 0.35μ cell-based array library. The resul- Figure 8: Synthetic example and problem of behavior problem be- tant gate-level circuits were compared with respect to the tween functional and RTL models. following metrics: area and performance. The area and clock period are obtained after performing synthesis and technol- ogy mapping. The performance, that is, execution time is the the RTL and functional model has been detected by our product of clock period and number of clock cycles (RTL ADCM as well as during both functional and cycle accurate simulation). simulations (both digital data curves in Figure 6 are differ- Table 1 presents the number of registers inserted af- ent). The problem is due to an output register present in each ter the behaviors which were performed by ADCM (i.e., RTL FIR filter and absent in functional filters. This register algorithm-1 and algorithm-2). Save for the synthetic exam- induces an additional delay in the RTL model. The differen- ple, algorithm-2 improves significantly the solutions found tial graph of evolution is shown in Figure 9. The first path by algorithm-1. The average registers improvement is 50% has three additional delays, whereas the second path has one (the averages were calculated based on comparing the sum of additional delay. It was necessary to add two delays in the sec- the values in algorithm-1 and algorithm-2 columns). ond path in order to balance the differential graph. This de- In Tables 2 and 3, we present, respectively, execution time lay correction was translated in two ways on the RTL model. and area results. The results are obtained according to the fol- The first one consists of inserting two registers in the second lowing three cases: designs without ADCM (second column), Nacer-Eddine Zergainoh et al. 9

011 IN1 FIR1 FIR2 FIR3 1

Correction 0 0 + FIR5 Out 1+2 0 IN2 FIR4

Figure 9: Balanced differential graph of evolution.

C1 C1 C1 Table 1: Registers insertion performed by ADCM algorithms. C1 FIR1 FIR2 FIR3 FSM C1 C1 C1+ FIR5 C1 Registers inserted no. − − System FIR4 Z 1 Z 1 Clk Algorithm-1 Algorithm-2 C1 Correction Synthetic example 22 (a) Modulation chain 15 7 SOVA 95 MP3 decoder 12 5 C1 C1 C1 FSM FIR1 FIR2 FIR3 C1 C1 C2 C2 + FIR5 Clk FIR4 C1 C2 for circuitry cannot be neglected in the first case (second Correction column). Regarding both alternatives of ADCM (registers (b) insertion versus FSM modification), the two corrections are equivalent in terms of performances (third and fourth Figure 10:Twoways tocorrectthebehaviorinRTLmodel:(a)delay columns in Table 2). With regard to area (third and fourth correction by registers insertion and (b) delay correction by modi- columns in Table 3), the two corrections give the slightly dif- fying FSM. ferent results. We note that the difference is the area no more than 4% for the cases studied. This difference is due to the N ×105 Output way in which the delays correction are distributed on the 8 balanced differential graph. The N delays correction on dif- 6 ferential graph correspond to M (1 ≤ M ≤ N, M depends 4 on the distribution of delays correction on the differential 2 graph) supplementary control signals in ADCM by FSM 0 modification, whereas they always correspond to N registers −2 in ADCM by registers insertion. The choice of a method is closely related to the application and must be done after ap- −4 plying the two methods, and analyzing the area results. −6 −8 0 102030405060708090100 6. SUMMARY AND CONCLUSIONS In this paper, we proposed an efficient IP-block-based design Matlab environment for high-throughput VLSI Systems. The flow RTL generates SystemC RTL architecture, starting from Matlab- based functional model of digital system. To provide IPs with Figure 11: Output signals of functional and corrected RTL models. more reusability and flexibility, we are developing parame- terized reusable DSP components at functional and register- transfer level called “generic F-IP” and “generic RTL-IP.” designs with ADCM by registers insertion (third column), Thus, by setting the appropriate parameters, unnecessary and ADCM by FSM modification (fourth column). In the functions and redundant interfaces are eliminated in the IP- first case (second column), the interfacing of IPs blocks based design approach. Although each functional IP and its was performed by inserting an extra interface circuitry (i.e., equivalent RTL produce the same digital displays, in some handshake protocol), in order to synchronize them. cases, the register-transfer model obtained by automatic as- Tables 2 and 3 indicate that our ADCM results in sig- sembly from the functional model can be wrong. We have nificant improvements of performance and area, the aver- also proposed an approach called automatic delay correction age performance improvement is 15.67%, whereas the av- method to solve this problem without the insertion of an ex- erage area improvement is 10.7%. Area and delay overhead tra interface circuitry. The approach corrects the behavior of 10 EURASIP Journal on Applied Signal Processing

Table 2: Performance results: registers versus FSM modification.

Execution time (ns) System Without ADCM ADCM by registers insertion ADCM by FSM modification Modulator 4567870 3859850 3859854 MP3 decoder 2297500 1929902 1929907 SOVA∗ 68266 58026 58027

∗Time required to decode a single 1024-bit block of information using 4-stage iterative decoding.

Table 3: Area results: registers insertion versus FSM modification.

Area (# kgates) System Without ADCM ADCM by registers insertion ADCM by FSM modification Modulator 541.7 482.5 483.2 MP3 decoder 149.4 134.1 134.9 SOVA 52.5 47.5 47.1 the RTL model in a judicious way that includes locating the [10] D. C. Ku and G. De Micheli, “HardwareC: A language for hard- places where the problems occur, determining how many de- ware design,” Tech. Rep. CSTL-TR-90-419, Computer Systems lays are needed, and implementing the correction. We have Laboratory, Stanford University, Stanford, Calif, USA, August described two alternatives (registers insertion and FSM mod- 1990. ification) of implementing delay correction methods and we [11] SystemC Community, http://www.systemc.org. have presented a realistic example where the delay correction [12] Xilinx System Generator v2.1 for Simulink Reference Guide, ffi Xilinx, 2000. method has been e ciently applied. Experimental results in [13] J. Buck, S. Ha, E. A. Lee, and D. G. Messerschmitt, “Ptolemy: real cases, also, demonstrate significant improvements in the a framework for simulating and prototyping heterogeneous quality of the synthesized implementations. systems,” International Journal of Computer Simulation, vol. 4, no. 2, pp. 155–182, 1994. REFERENCES [14] F. Balarin, M. Chiodo, P. Di Giusto, et al., Hardware-Software Co-Design of Embedded Systems: The POLIS Approach,Kluwer [1] International Technology Roadmap for Semiconductors, 2003 Academic, Boston, Mass, USA, 1997. Edition Report, http://public.itrs.net. [15] R. Lauwereins, M. Engels, M. Ade, and J. A. Peperstraete, [2] A. Sangiovanni-Vincentelli, L. Carloni, F. De Bernardinis, and “Grape-II: a system-level prototyping environment for DSP M. Sgroi, “Benefits and challenges for platform-based design,” applications,” IEEE Computer, vol. 28, no. 2, pp. 35–43, 1995. in Proceedings of 41st IEEE Design Automation Conference [16] S. Natarajan, B. Levine, C. Tan, D. Newport, and D. Bouldin, (DAC ’04), pp. 409–414, San Diego, Claif, USA, June 2004. “Automatic mapping of khoros-based applications to adaptive [3] G. Martin, “Design methodologies for system level IP,” in computing systems,” in Proceedings of Military and Aerospace Proceedings of IEEE Design, Automation and Test in Europe Applications of Programmable Devices and Technologies Inter- (DATE ’98), pp. 286–289, Paris, France, February 1998. national Conference (MAPLD ’99), pp. 101–107, Laurel, Md, [4] D. D. Gajski, A. C.-H. Wu, V. Chaiyakul, S. Mori, T. Nukiyama, USA, Septemper 1999. and P. Bricaud, “Essential issues for IP reuse,” in Proceedings [17] G. Spivey, S. S. Bhattacharyya, and K. Nakajima, “Logic of IEEE Asia and South Pacific Design Automation Conference foundry: rapid prototyping for FPGA-based DSP systems,” (ASP-DAC ’00), pp. 37–42, Yokohama, Japan, January 2000. EURASIP Journal on Applied Signal Processing, vol. 2003, no. 6, [5]K.K.Parhi,VLSI Digital Signal Processing Systems: Design pp. 565–579, 2003. and Implementation, John Wiley & Sons, New York, NY, USA, [18] P. Banerjee, N. Shenoy, A. Choudhary, et al., “MATCH: A 1998. MATLAB Compiler for Configurable Computing Systems,” [6] Celoxica, Handel-C Language Reference Manual, 2003. RM- Tech. Rep. CPDCTR-9908-013, Center for Parallel and Dis- 1003-4.0, http://www.celoxica.com. tributed Computing, Northwestern University, Evanston, Ill, [7] G. De Micheli, “Hardware synthesis from C/C++ models,” in USA, August 1999. Proceedings of IEEE Design, Automation and Test in Europe [19] W. R. Davis, N. Zhang, K. Camera, et al., “A design environ- Conference and Exhibition (DATE ’99), pp. 382–383, Munich, ment for high-throughput low-power dedicated signal pro- Germany, March 1999. cessing systems,” IEEE Journal of Solid-State Circuits, vol. 37, [8] S. A. Edwards, “The challenges of hardware synthesis from C- no. 3, pp. 420–431, 2002. like languages,” in Proceedings of IEEE Design, Automation and [20] R. K. Gupta and Y. Zorian, “Introducing core-based system Test in Europe (DATE ’05), vol. 1, pp. 66–67, Munich, Ger- design,” IEEE Design and Test of Computers, vol. 14, no. 4, pp. many, March 2005. 15–25, 1997. [9] D. D. Gajski, J. Zhu, R. Domer,¨ A. Gerstlauser, and S. Zhoa, [21] L. Lavagno, S. Dey, and R. Gupta, “Specification, modeling Spec C: Specification Language and Methodology,KluwerAca- and design tools for system-on-chip,” in Proceedings of 7th demic, Boston, Mass, USA, 2000. IEEE Asia and South Pacific Design Automation Conference and Nacer-Eddine Zergainoh et al. 11

15th International Conference on VLSI Design (ASP-DAC ’02), codesign, high-level synthesis and CAD issues for real-time digi- pp. 21–23, Bangalore, India, January 2002. tal signal processing, design and exploration of application-specific [22] W. Cescirio, A. Baghdadi, L. Gauthier, et al., “Component- multiprocessor SoC (including design and analysis of on-chip com- based design approach for multicore SoCs,” in Proceedings of munication architectures, network on-chip issues). He also main- 39th IEEE Design Automation Conference (DAC ’02), pp. 789– tains an active interest in parallel processing, multiprocessor archi- 794, New Orleans, La, USA, June 2002. tectures, and real-time operating systems. Professor Zergainoh has [23] B.-W. Kim and C.-M. Kyung, “Exploiting intellectual proper- served on the technical program committees for several interna- ties with imprecise design costs for system-on-chip synthesis,” tional conferences and workshops. IEEE Transactions on Very Large Scale Integration (VLSI) Sys- Ludovic Tambour received the Engineer tems, vol. 10, no. 3, pp. 240–252, 2002. degree in computer science from the Ecole [24] M. Vachharajani, N. Vachharajani, S. Malik, and D. I. August, Polytechnique de Grenoble in 2000 and the “Facilitating reuse in hardware models with enhanced type in- M.S.andPh.D.degreesincomputerscience ference,” in Proceedings of IEEE/ACM/IFIP International Con- from the Institut National Polytechnique ference on Hardware/Software Codesign and System Synthesis de Grenoble (INPG), Grenoble, France, in (CODES+ISSS ’04), pp. 86–91, Stockholm, Sweden, Septem- 2000 and 2003, respectively. In 2000, he ber 2004. joined the R&D SHIVA Group at ST Micro- [25] R. Passerone, J. A. Rowson, and A. Sangiovanni-Vincentelli, electronics and SLS Group at TIMA Lab- “Automatic synthesis of interfaces between incompatible pro- oratory where he worked on methodology tocols,” in Proceedings of 35th IEEE Design Automation Confer- and flow for design and validation of digital signal processing ASIC ence (DAC ’98), pp. 8–13, San Francico, Calif, USA, June 1998. macrocells. In 2004, Dr. Tambour moved to hold an Engineer po- [26]H.Choi,J.H.Yi,J.-Y.Lee,I.-C.Park,andC.-M.Kyung,“Ex- sition at CIRAD (International Cooperating Center in Research for ploiting intellectual properties in ASIP designs for embedded Agronomic Developing), Montpellier, France. His research inter- DSP software,” in Proceedings of 36th IEEE Design Automation ests include software tools for modeling, simulation and data anal- Conference (DAC ’99), pp. 939–944, New Orleans, La, USA, ysis in a large field of activities including microelectronic, signal June 1999. processing, agronomy, and so forth. [27] VSI Alliance, http://www.vsi.org. ffi [28] L. Tambour, “E cient methodology for design and vali- Pascal Urard joined ST Microelectronics in dation of complex DSP system-on-chip,” Ph.D. thesis, In- 1992 where he has worked successively in stitut National Polytechnique de Grenoble (INPG), Greno- test, engineering, ASIC design, and archi- ble, France, December 2003, http://tima.imag.fr/publications/ tecture of mixed signal processing ASICs. files/th/mfs 196.pdf. In 2000, he joined ST R&D to work on [29] N. E. Zergainoh, K. Popovici, A. A. Jerraya, and P. Urard, ESLD flows. He initiated a Matlab-2-RTL “Matlab based environment for designing DSP systems using flow that is now used internally in ST. In IP blocks,” in Proceedings of 12th Workshop on Synthesis and 2001, he initiated cooperations with HLS System Integration of Mixed Information Technologies (SASIMI tools companies. He is now the Manager of ’04), pp. 296–302, Kanazawa, Japan, October 2004. High-Level Synthesis Group within ST Mi- [30] The MathWorks Incorporation, http://www.mathworks.com. croelectronics Central—CAD. [31] N. Zergainoh, L. Tambour, H. Michel, and A. A. Jerraya, “Methode´ de correction automatique de retard dans les Ahmed Amine Jerraya received the Engi- modeles` RTL des systemes` monopuces DSP obtenus par as- neer degree from the University of Tunis semblage de composants IP,” Techniques et Sciences Informa- in 1980 and the DEA, “Docteur Ingenieur,”´ tiques, vol. 24, no. 10, pp. 1227–1257, 2005. and the “Docteur d’Etat” degrees from the [32] A. Gibbons, Algorithmic Graph Theory, Cambridge University University of Grenoble in 1981, 1983, and Press, Cambridge, UK, 1985. 1989, respectively, all in computer sciences. [33] W. O. Cesario, G. Nicolescu, L. Gauthier, D. Lyonnard, and In 1986, he held a Full Research posi- A. A. Jerraya, “Colif: A design representation for application- tion with the CNRS (Centre National de la specific multiprocessor SOCs,” IEEE Design and Test of Com- Recherche Scientifique). From April 1990 to puters, vol. 18, no. 5, pp. 8–20, 2001. March 1991, he was a member of the scien- [34] Synopsys Incorporation, http://www.synopsys.com. tific staff at Nortel in Canada, working on linking system design tools and hardware design environments. He is the General Chair of HLDVT ’02 and Coprogram Chair of CASES ’02. He served as Nacer-Eddine Zergainoh received the State the General Chair for DATE 2001, ISSS ’96, and General Cochair Engineering degree in electrical engineering for CODES ’99. He also served as Program Chair for ISSS ’95, from National Telecommunication School RSP ’96, and Coprogram Chair of CODES ’97. He published more and the M.S. and Ph.D. degrees in computer than 100 papers in international conferences and journals. He re- engineering from University of Paris XI, ceived the Best Paper Award at the 1994 ED&TC for his work on in 1992 and 1996, respectively. Currently, hardware/software cosimulation. Dr. Jerraya is currently managing he is an Associate Professor at Ecole Poly- the System-Level Synthesis Group of TIMA Laboratory and has the technique of University of Joseph Fourier, grade of Research Director within the CNRS. Grenoble, and member of the research staff of the Techniques of Informatics and Mi- croelectronics for Computer Architecture Laboratory, Grenoble. Prior to that, he was an R&D Engineer at ILEX-Computer Systems, Paris, France. His current research interests are hardware/software Hindawi Publishing Corporation EURASIP Journal on Applied Signal Processing Volume 2006, Article ID 60613, Pages 1–18 DOI 10.1155/ASP/2006/60613

Multiple-Clock-Cycle Architecture for the VLSI Design of a System for Time-Frequency Analysis

Veselin N. Ivanovic,´ Radovan Stojanovic,´ and LJubisaˇ Stankovic´

Department of Electrical Engineering, University of Montenegro, 81000 Podgorica, Montenegro, Yugoslavia

Received 29 September 2004; Revised 17 March 2005; Accepted 25 May 2005 Multiple-clock-cycle implementation (MCI) of a flexible system for time-frequency (TF) signal analysis is presented. Some very important and frequently used time-frequency distributions (TFDs) can be realized by using the proposed architecture: (i) the spectrogram (SPEC) and the pseudo-Wigner distribution (WD), as the oldest and the most important tools used in TF signal analysis; (ii) the S-method (SM) with various convolution window widths, as intensively used reduced interference TFD. This architecture is based on the short-time Fourier transformation (STFT) realization in the first clock cycle. It allows the mentioned TFDs to take different numbers of clock cycles and to share functional units within their execution. These abilities represent the major advantages of multicycle design and they help reduce both hardware complexity and cost. The designed hardware is suitable for a wide range of applications, because it allows sharing in simultaneous realizations of the higher-order TFDs. Also, it can be accommodated for the implementation of the SM with signal-dependent convolution window width. In order to verify the results on real devices, proposed architecture has been implemented with a field programmable gate array (FPGA) chips. Also, at the implementation (silicon) level, it has been compared with the single-cycle implementation (SCI) architecture.

Copyright © 2006 Hindawi Publishing Corporation. All rights reserved.

1. INTRODUCTION AND PROBLEM signal with nonoverlapping components, by an appropriate FORMULATION convolution window width selection, the SM can produce a sum of the WDs of individual signal components, avoiding The most important and commonly used methods in TF sig- cross-terms [4, 10, 11]: P(n,k)(i)shouldbewideenoughto nal analysis, the SPEC and the WD, show serious drawbacks: enable complete integration over the auto-terms, but nar- low concentration in the TF plane and generation of cross- rower than the distance between two auto-terms. In addi- terms in the case of multicomponent signal analysis, respec- tion, the SM produces better results than the SPEC and the tively, [1–3]. In order to alleviate (or in some cases com- WD, regarding calculation complexity [4] and noise influ- pletely solve) the above problems, the SM for TF analysis is ence [9]. Note that the essential SM properties are: the high proposed in [4]. Recently, the SM has been intesively used, auto-terms concentration, the cross-terms reduction and the [5–8]. Its definition is [4, 9, 10] noise influence suppression. Two possibilities for the SM (1) implementation are SM(n, k) (1) with a signal-independent (constant) ( , ), ( , Ld(n,k) Ld n k Ld n ∗ ) = = const, [4, 10], when, in order to get the WD = P(n,k)(i)STFT(n, k + i)STFT (n, k − i), k Ld i=−Ld(n,k) for each component, the convolution window width (1) should be such that 2Ld + 1 is equal to the width of the  widest auto-term. For the entire TF plane, except at the = N/2 − j(2π/N)ik central points of the widest component, this window where STFT(n, k) i=−N/2+1 f (n + i)w(i)e repre- sents the STFT of the analyzed signal f (n), 2Ld(n, k)+1 would be too long. This fact might have negative ef- is the width of a finite frequency domain (convolution) fects regarding cross-terms reduction, [4, 10] and the rectangular window P(n,k)(i)(P(n,k)(i) = 0, for |i| >Ld(n, k)), noise influence suppression, [9]. On the other hand, and the signal’s duration is N = 2m. The SM produces, as the shorter window would result in lower concentra- its marginal cases, the WD and the SPEC with maximal tion; (Ld(n, k) = N/2), and minimal (Ld(n, k) = 0) convolution (2) with a signal-dependent Ld(n, k) (the so-called sig- window width, respectively. In the case of a multicomponent nal-dependent SM) [11], which may alleviate the 2 EURASIP Journal on Applied Signal Processing

disadvantages of the signal-independent form in the signal-independent and signal-dependent forms) are de- analysis of multicomponent signals having different signed, the corresponding controls are defined, and the widths of the auto-terms. In addition, it may fur- trade-offs and comparisons with the SCI are given. In ther significantly improve the essential SM properties, Section 3, the designed MCI system is used for the real- [9, 11]. time realization of the higher-order TFDs. The proposed ap- proaches are verified in Section 4 by designing the FPGA In order to improve concentration of highly nonstation- chips. Also, the obtained implementation results at silicon ary signals, higher-order TFDs can be used [5, 12]. One of level are compared with SCI architectures. them, which can be presented in a two-dimensional TF plane and defined in the same manner as the SM, is the L-Wigner distribution (LWD) [12]: 2. MULTICYCLE HARDWARE IMPLEMENTATION OF THE S-METHOD Ld LWD L(n, k) = LWD L/2(n, k + i)LWDL/2(n, k − i), 2.1. Signal-independent S-method i=−Ld (2) In this section, an MCI system for SM (1) realization, assum- ing fixed convolution window width (Ld(n, k) = Ld), is pre- where LWDL(n, k) is the LWD of the Lth order, and sented. Since the STFT is a complex transformation, (1) in- ≡ LWD 1(n, k) SM(n, k). Note that the LWD is implicitly volves complex multiplications. In order to involve only real defined based on the SM and the STFT, so it can be imple- multiplications in (1), we modify it by using STFT(n, k) = mented in a similar way as the SM. STFTRe(n, k)+j STFTIm(n, k)(STFTRe(n, k)andSTFTIm(n, Definition (1), based on STFT, makes the SM very at- k) are the real and imaginary parts of STFT(n, k), resp.), as tractive for implementation. However, all TFDs, beyond the STFT, are numerically quite complex and require significant = 2 SMR(n, k) STFTRe(n, k) calculation time. This fact makes them unsuitable for real- time analysis, and severely restricts their application. Hard- Ld − ware implementations, when they are possible, can overcome +2 STFTRe(n, k + i)STFTRe(n, k i), i=1 this problem and enable application of these methods in nu- (3) merous additional problems in practice. Some simple imple- = 2 mentations of the architectures for TF analysis are presented SMI (n, k) STFTIm(n, k) in [10, 13–19]. An architecture for VLSI design of systems Ld for TF analysis and time-varying filtering based on the SM +2 STFTIm(n, k + i)STFTIm(n, k − i), is presented in [16, 17]. However, all these architectures give i=1 the desired TFD in one clock cycle. It means that no archi- (4) tecture resource can be used more than once, and that any element needed more than once must be duplicated. Con- where SM(n, k) = SMR(n, k)+SMI (n, k). The kth channel, sequently, practical realization of these architectures requires one of the N channels (obtained for k = 0, 1, ..., N − 1), is large chips. Besides, just a single TFD—SM with exactly de- described by (3)-(4). Note that it will consist of two iden- fined convolution window width—can be realized this way. tical sub-channels used for processing of STFTRe(n, k)and In this paper, we develop an MCI of a special purpose STFTIm(n, k), respectively. hardware for TF analysis based on the SM, suitable for the The hardware necessary for one channel MCI of the VLSI design. In the proposed implementation, each step in signal-independent SM is presented in Figure 1. It is designed the TFDs execution will take one clock cycle. In the first step, based on a two-block structure. The first block is used for the proposed architecture realizes the STFT, as a key interme- STFT implementation, whereas the second block is used to diate step in realization of the implemented TFDs. In each modify the outputs of the STFT block, in order to obtain the higher-order clock cycle, different TFD is realized: in the sec- improved TFD concentration based on the SM. The STFT ond one—the SPEC, in the third one—the SM with unitary block can be implemented by using the available FFT chips convolution window width, and so on. The WD is realized [20, 21] or by using approaches based on the recursive algo- in the clock cycle when the maximal convolution window rithm [10, 13, 17, 19, 22–24]. Note that, due to the reduced width is reached. Note that proposed architecture can real- hardware complexity, the recursive algorithm is more suit- ize almost all commonly used TFDs. The MCI design allows able for a VLSI implementation, [13]. The second block is a functional unit to be used more than once per TFDs execu- designed so that it realizes each summation term from (3)- tion, as long as it is used on different clock cycles. This sig- (4) in the corresponding step of the method implementation. nificantly reduces the amount of the required hardware. The We break the SM execution into several steps, each taking ability to allow TFDs to take different number of clock cycles one clock cycle. Our goal in breaking the SM execution into and the ability to share functional units within the execution clock cycles is to balance the amount of work done in each of a single TFD are the major advantages of the proposed de- cycle, so that we minimize the clock cycle time. In the first sign. step, the STFT will be executed, in the second step, the SPEC The paper is organized as follows. After the intro- will be executed based on the first step execution, in the third duction, MCI architectures for the SM realization (in its step, the SM with the unitary convolution window width will Veselin N. Ivanovicetal.´ 3

SM block Sel STFT STFT(n, k)TFD(n, k)

0 STFTRe(n, k +1) 1 STFT (n, k +2) 2 Re M SHLorNo Add SelB . u . . x N − STFTRe(n, k + 2 1) N − 1 0 0 2 D M MULT m u STFTRe(n, k) u x + 0 x 1 SHL1 1 STFTRe(n, k − 1) 1 STFT (n, k − 2) 2 Real Re M 0 0 CLK . M . u 16 . u ( ) x f t STFT block N A/D STFTRe(n, k − +1) x SMStore Signal f (n) STFT(n, k) 2 N − 1 2 1

TFD(n, k) STFTIm(n, k) + 0

STFTIm(n, k +1) 1 OutREG STFT ( +2) 2 Im n, k M SignLoad . u . x N − Clock STFTIm(n, k + 2 1) N − 0 0 2 1 D M MULT m u u x + 0 x 1 1 MSB MSB STFTIm(n, k − 1) 1 SHL1 − STFTIm(n, k 2) 2 Imag SHl1 M 0 0 . M CLK . u . x u − N STFTIm(n, k 2 +1) x N − 1 2 1

Figure 1: MCI architecture for the signal-independent S-method realization.

be executed based on the execution in the first two steps, and and a demultiplexor, see Table 1. Note that control signals so on. With each further step, one realizes the SM with the SHLorNo and AddSelB assume unity values in each step of the incremented width of convolution window, based on the pre- TFD implementation, except in the second step (SPEC com- ceding steps. This improves the TFD concentration, aiming pletion step), when they assume zero values. Consequently, at achieving the one obtained by the WD. these signals can be replaced by one control signal SPECorSM Proposed hardware has been designed for a 16-bit fixed- that enables the SPEC execution (with its zero value), or ex- point arithmetic. Each subchannel of the second block con- ecution of the TFDs with the nonzero convolution window tains exactly one adder, one multiplier, and one shift left reg- widths. Note that the multiplication operation results in a ister for implementation of (3)-(4). These functional units two sign-bit and, assuming Q15 format (15 fractional bit), must be shared for different inputs in different steps by the product must be shifted left by one bit to obtain correct adding multiplexors and/or a demultiplexor at their inputs. results. This shifter is included in the multiplier. Real and imaginary parts of the SM value, computed in each The longest path in the second block is one that con- execution step and based on (3)-(4), are saved into the Real nects the inputs STFTRe(n, k)(orSTFTIm(n, k)), through one and Imag temporary registers, respectively. In the first step, multiplier, one shift left register, and 2 adders, with the out- only the STFT block of the proposed two-block architec- put of the second block. If the STFT is realized based on ture is used, whereas in the remaining steps only the second a recursive algorithm, than it has the same longest path, block is used. This will be regulated by the set of control [10, 17]. This path determines the clock cycle time and then signals introduced on temporary registers, and multiplexors the fastest sampling rate. This design can be implemented as 4 EURASIP Journal on Applied Signal Processing

Table 1: Function of each of the control signal generated by the control logic.

Control signal Effect

SelSTFT (m − 1)-bit signal which controls N/2-input multiplexors (two of them per subchannel are intro- duced to select between the STFT values from different channels) SHLorNo 1-bit signal which enables use of the shift-left register in the corresponding steps (when we need to implement multiplication by 2), or disables this (in the second step) AddSelB 1-bit signal which enables use of only one adder per subchannel for implementing sums in (3)-(4) by controlling its second input, which can be either the constant 0 (in the second step) or a register Real (or Imag) value (in each further step) SignLoad 1-bit signal which enables sampling of the analyzed analog signal f (t), but only after execution of the desired TFD of the analyzed signal samples from the preceding time instant SMStore 1-bit write control signal of the OutREG temporary register. It should be asserted during the step in which the SM with corresponding convolution window width is computed an application specified integral circuits (ASIC) chip to meet than once per TFD execution and that any element needed the speed and performance demands of very fast real-time more than once must be duplicated. Then, we can easily con- applications, see Section 4. clude that in the case of the considered SM block (3)-(4) implementation we have to use (2Ld +1)adders,2(Ld +1) Defining the control multipliers, and 2Ld shift left registers, if we prefer an SCI approach. This can be tested by studying the SCI architec- From the defined multistep sequence of the multicycle TFDs tures represented in [16, 17], as well as real-time SCI of the execution, we can determine what control logic must do at SM with Ld =3giveninSection 4.2. each clock cycle. It can set all control signals, based solely on Comparison of the architectures’ resources used in the the distribution code (TFDcode). This code determines TFD SCI and MCI designs, as well as comparison of their clock which will be implemented by using the proposed architec- cycletimesaregiveninTable 2. The following advantages of ture. Taking N = 64, the TFDcode can be a 6-bit field which the MCI design, compared with the SCI ones, can be noted: determines the convolution window width. An architecture (i) required reduction of the amount of hardware, with the control logic and the control signals are shown in achieved by introducing the temporary registers and Figure 2. several multiplexors at the inputs of the functional Control for the MCI architecture must specify both the units. The achieved hardware reduction is significant, signals to be set in any step and the next step in the sequence. and it increases as the convolution window width in- Here we use finite-state Moore machine to specify the multi- creases; cycle control, Figure 3. Finite-state control essentially corre- (ii) since temporary registers and the introduced multi- sponds to the steps of desired TFD execution; each state in plexors are fairly small, this could yield a substantial the finite-state machine will take one clock cycle. This ma- reduction in the hardware cost, as well as in the used chine consists of a set of states and directions on how to chip dimensions; change states. Each state specifies a set of outputs that are (iii) the clock cycle time in the MCI design is much shorter. asserted or deasserted when the machine is in that particular state. The labels on the arc are conditions that are tested to Finally, the ability to realize almost all commonly used determine which state is the next one. When the next state TFDs by the same hardware represents a major advantage of is unconditional, no label is given. Note that implementation the proposed MCI design. of a finite-state machine usually assumes that all outputs that On the other hand, the fastest sampling rate in the MCI are not explicitly asserted are deasserted, and the correct op- design of the SM with arbitrary Ld is (Ld +2)×(Tm +2Ta +Ts), eration of the architecture often depends on the fact that see Table 2, while it is equal to the clock cycle time in the cor- a signal is deasserted. Multiplexors and demultiplexor con- responding SCI design (2Tm +(Ld +3)Ta + Ts,seeTable 2). trols are slightly different, since they select one of the inputs, Then, the SCI approach improves execution time. However, whether they be 0 or 1. Thus, in the finite-state machine we this disadvantage of the MCI approach is significantly allevi- 1 always specify the settings of all (de)multiplexor controls that ated by the fact that the SM with small Ld is usually used, we care about. when the execution times in these two cases (the SCI and the MCI approaches) do not differ significantly. 2.2. Trade-offs and comparisons of the proposed design with the SCI ones 1 High TFD concentration (almost as high as in the WD case) is achieved even with small Ld [4, 9], whereas the interference effects [10]andthe SCI architecture executes desired TFD in one clock cycle. noise influence [9] are more reduced with decreasing of the convolution This means that no architecture resource can be used more window width. Veselin N. Ivanovicetal.´ 5

TFD code SMStore SignLoad Control SPECorSM logic SelSTFT

STFTRe(n, k) 0 STFTRe(n, k +1) 1 STFTRe(n, k +2) 2 M . . u . N − x STFTRe(n, k + 2 1) N − 1 D0 0 2 M m MULT u u STFTRe(n, k) + 0 x x 1 SHL1 1 STFTRe(n, k − 1) 1 STFTRe(n, k − 2) 2 Real M 0 0 CLK . M . u f (t) 16 STFT block . u A/D − N x x Signal f (n) STFT(n, k) STFTRe(n, k 2 +1) N − 1 2 1 TFD(n, k) STFTIm(n, k) + 0

STFTIm(n, k +1) 1 OutREG STFTIm(n, k +2) 2 M . . u . x Clock N − STFTIm(n, k + 2 1) N − 1 0 0 2 D M MULT m u u STFTIm(n, k) x + 0 x 1 SHL1 1 STFTIm(n, k − 1) 1 STFTIm(n, k − 2) 2 Imag M 0 0 CLK . M . u . u x − N x STFTIm(n, k 2 +1) N − 2 1 1

Figure 2: MCI architecture for the signal-independent S-method realization together with the necessary control lines. Thick solid line highlights the control line as opposed to a line that carries data.

More technical details about practical implementation of step it could realize the SM with the incremental convolution the MCI and the SCI architectures can be found in Section 4. window width. Then, total number of clock cycles would not be greater than the one from the MCI design. In particular, Hybrid implementation both implementation approaches, hybrid and MCI, use the same number (two) of clock cycles for the SPEC implemen- In order to achieve a balance between minimal chip dimen- tation only. In the case of the SM with nonzero convolution sions, hardware consumption and cost from the MCI ap- window width implementation, total number of clock cycles proach and minimal execution time from the SCI approach, would be smaller by using hybrid implementation design. the hybrid implementation approach may be considered. The For the SM block implementation one would use (2Ld + SM block of this implementation would be based on the SCI 1) adders, 2(Ld + 1) multipliers, and 2Ld shift left registers, design of the SM with exactly defined convolution window and the corresponding clock cycle time would be Tm +(Ld + width Ld (Ld ≥ 1). As in the MCI design case, hybrid imple- 1)Ta + Ts. Note that the hybrid implementation (even the mentation would give the desired TFD in a few clock cycles: one based on the SM with Ld = 1) increases hardware com- in the second one this architecture could implement the SMs plexity, chip dimensions, and cost, as well as the clock cy- with convolution window widths up to the Ld (up to the SM cle time from the MCI design. Then, the SM with Ld = 1 that is a base for the SM block realization) and in each further cannot be so useful as a base for the SM block of hybrid 6 EURASIP Journal on Applied Signal Processing

0

Start SignLoad = 1 SMStore = 0

1 SignLoad = 0 (TFD code = ‘SPEC’) SelSTFT = 010 SPECorSM = 0 (SMStore = 1)

2 SignLoad = 0 (TFD code = ‘SM with L = 1’) SelSTFT = 110 d SPECorSM = 1 (SMStore = 1)

3 SignLoad = 0 (TFD code = ‘SM with L = 2’) SelSTFT = 210 d SPECorSM = 1 (SMStore = 1)

. . .

N 2 +2 SignLoad = 0 = N − (TFD code =‘WD’) SelSTFT ( 2 1)10 SPECorSM = 1 (SMStore = 1)

Figure 3: The finite-state machine control for the architecture shown in Figure 2. Output (SMStore = 1) means that the SMStore control signal is asserted during only the final step of the corresponding TFD execution. implementation, since it would only slightly improve the ex- Note that the overall performance of the hybrid imple- ecution time from MCI architecture (it requires only one mentation is not likely to be very high, since all the steps (ex- step—SPEC completion—less than the MCI approach). The cept, in some cases, the second one) could fit in a shorter SM with Ld = 2 would be a reasonable choice for this pur- clock cycle. The second step is an exception when the SM pose. However, the hybrid approach would not use the whole with convolution window width of at least Ld is imple- SM block in each step. For example, part of the SM block mented by using hybrid design, where Ld is the convolu- for SPEC implementation (see Figure 12 from Section 4.2) tion window width of the SM that is a base for this par- would be used in the second step only. Note that the clock cy- ticular implementation. This fact leads to the dispersion of cle time is determined by the longest possible path in the SM the hardware resources as well as needed time in almost block, which does not have to be used in any step here. Con- all steps used in TFD execution. Also, control logic of the sequently, hybrid architecture could not succeed to balance hybrid implementation would be similar but, at the same the amount of work done in each clock cycle, so that we could time, more complicated, as compared to the MCI approach not minimize the clock cycle time. case. Veselin N. Ivanovicetal.´ 7

Table 2: Total number of functional units per channel in an SM block and the clock cycle time in the cases of (a) single-cycle implementation (SCI) and (b) the multicycle implementation (MCI). Tm is the multiplication time of a two-input 16-bit multiplier, Ta is the addition time of a two-input 16-bit adder, whereas Ts is the time for 1-bit shift. The recursive form of the STFT block implementation is assumed when the clock cycle time in the SCI case is represented.

Implementation Adders Multipliers Shift left registers Clock cycle time

SCI 2Ld +1 2(Ld +1) 2Ld 2Tm +(Ld +3)Ta + Ts

MCI 3 2 2 Tm +2Ta + Ts

2.3. Signal-dependent S-method Multistep sequence of the signal-dependent SM is the same as in the signal-independent case. Two first steps have Disadvantages of the signal-independent convolution win- to be executed, since SPEC value should be forwarded to the dow in the analysis of multicomponent signals, having dif- output anyway. Namely, even if | STFT(n, k)|2 ≤ R2,forall ferent widths of the auto-terms, motivates the introduction n k, that is x0 = 0, (practically, these are points (n, k)withno of a signal-dependent convolution window width. It follows, signal) the convolution window width takes zero value, and for each point of TF plane, the widths of the auto-terms then the SM takes its marginal form—SPEC [4, 9]. Execu- excluding the summation in (1) where one or both of the tion of the second step is provided by setting the unit value components STFT( , + )andSTFT(, − )areequal n k i n k i instead of x to the first respective inputs of the N/2-input to zero. In addition, it should stop the summation outside 0 multiplexors, so SignDep ≡ 1 in the second—SPEC comple- a component. Practically, it means that when the absolute tion step. square value of STFT(n, k + i) or STFT(n, k − i)issmaller than an assumed reference level Rn, the summation in (1) should be stopped. In practice, reference value is selected Defining the control based on a simple analysis of the analyzed signal and the Control logic for the MCI realization of the signal-dependent implementation system [10, 17]. It is defined as a few percent SM can set all but one of the control signals, based solely of the SPEC’s maximal value at a considered time-instant n, 2 = { } 2 on the SM enable code (SM en). Write control signal of the Rn maxk SPEC(n, k) /Q , where SPEC(n, k) is the SPEC ≤ ∞ OutREG temporary register is the exception. To generate it, of analyzed signal and 1 Q< . In the sequel, the signals we will need to AND together an SMStoreCond signal from that determine nonzero values of STFT(n, k ± i)(i = 0, 1, ..., 2 the control unit, with SignDep control signal. The finite-state Ld(n, k)) will be denoted by x±i: x±i = 1if| STFT(n, k ± i)| > 2 = Moore machine that specifies the multicycle control is pre- Rn,andx±i 0 otherwise. sented in Figure 5. Sampling rate of the analyzed analog signal f (t) depends on the clock cycle time Tc and on the number of the exe- cuted steps. Consequently, the same number of steps in dif- 3. MULTICYCLE HARDWARE IMPLEMENTATION ferent time instants must be executed. In that sense, we have OF THE HIGHER-ORDER TFDS to assume maximal possible convolution window width as Since the LWD is defined in the same manner as the SM (see 2Ld max + 1 (variable convolution window width approach the LWD definition (2) and the SM definition (1)), it may be with the predefined maximal window width), and to define realized by using the same hardware presented in Figures 1 sampling rate by (2Ld max +1)Tc. Since the SM(n, k)valueis ≤ and 2. For that purpose, the SM block of the proposed ar- calculated in the Lth step, where L Ld max +1,itmustbe chitecture and the second input of the output adder in the saved up to the (Ld max + 1)th step into the OutREG tempo- SM block must be shared (by introducing two-input mul- rary register. tiplexors) for realization of the LWD with L = 2, Figure 6. In order to accommodate hardware from Figures 1 and This must be done since only one subchannel of the SM 2 for signal-dependent window width, we add two N/2- block is used when the SM block realizes the LWD, [25]. input multiplexors to generate SignDep(endent) control sig- Namely, in that case the SM block always processes the real nal, which determines whether or not the th term enters i function SM(n, k). The function of the proposed hardware is the summation in (3)-(4). With the zero value of the Sign- determined by the SMorLWD control signal: the SM imple- Dep control signal, adding the new term to the calculated mentation and the LWD implementation are determined by SM value is disabled, since the additional improvement of the SMorLWD zero and unit value, respectively, see Figure 7. the TFD concentration is impossible. It takes different values ff Note that the OutREG temporary register is used for saving in di erent steps defined as the computed SM value when we need to use the SM block for the LWD implementation. Then, the control logic defined in Section 2 must be ex- SignDep = xi · x−i, i = 0, 1, 2, ..., Ld max. (5) panded with the SMorLWD control signal. In the first Ld +2 clock cycles, system realizes SM(n, k). The calculated SM Signals xi are set in the first step after the STFT calculation. value, saved in the OutREG register, will be used in the = The circuit needed to generate signal xi is separated within next Ld + 1 clock cycles, when the LWD with L 2will the dashed box and presented in Figure 4. be realized. It is done by asserting the SMorLWD control 8 EURASIP Journal on Applied Signal Processing

SM en SMStoreCond SignDep SignLoad Control SPECorSM logic SelSTFT

0 1 0 1 x1 STFTRe(n, k +1) 1 STFT (n, k +2) 2 2 M x2 Re M . u . . . u x . N − x x N − 0 1 STFTRe(n, k + 1) 2 1 2 N 1 x−1 N − 0 − 1 2 1 0 SignDep 2 2 M x−2 D M . MULT m u . u . u x + x 0 x x− N − 1 SHL1 1 2 +1 STFTRe(n, k 1) 1 N Real − 1 STFTRe(n, k − 2) 2 2 M 0 0 CLK . M . u . u STFTRe(n, k) − N x STFTRe(n, k 2 +1) x N − 1 2 1 TFD(n, k) + ( ) 16 0 f t STFT block STFT ( +1) A/D Im n, k 1 OutREG ( ) STFT( ) STFT ( +2) 2 Signal f n n, k Im n, k M . u . x STFT (n, k+ N −1) STFTIm(n, k) Im 2 N −1 0 0 2 D M m MULT u u + 0 x x STFT ( − 1) 1 SHL1 1 Clock Im n, k 1 STFT (n, k − 2) 2 Imag Im M 0 0 CLK STFTRe(n, k + i) . M . u MULT . u − N x STFTIm(n, k 2 +1) x N −1 1 + Comp xi 2 STFTIm(n, k + i) MULT R2

Figure 4: MCI architecture for the signal-dependent S-method realization.

01 2 3 SignLoad= 0 SignLoad = 0 SignLoad = 0 Start SignLoad = 1 SelSTFT = 010 SelSTFT = 110 SelSTFT = 210 SMStoreCond = 0 SPECorSM = 0 SPECorSM = 1 SPECorSM = 1 SMStoreCond = 1 SMStoreCond= 1 SMStoreCond= 1

Ldmax +1 SignLoad = 0 SelSTFT = (Ldmax)10 SPECorSM = 1 SMStoreCond = 1

Figure 5: The finite-state machine control for the MCI design of the signal-dependent S-method from Figure 4. Veselin N. Ivanovicetal.´ 9

TFD code SMStore Add SelB SignLoad Control SHLorNo logic SelSTFT SMorLWD

STFTRe(n, k) 0 STFTRe(n, k +1) 1 STFT (n, k +2) 2 Re M . u . N − x STFTRe(n, k + 2 1) N − 1 0 0 2 D M 1 M MULT m u u x + u 0 x 1 SHL1 1 x STFTRe(n, k − 1) 1 0 STFTRe(n, k − 2) 2 Real M 0 0 CLK . M . u . u − N x STFTRe(n, k 2 +1) x N − 1 1 STFTRe(n, k) 2 + TFD(n, k) SMorLWD STFTIm(n, k) OutREG f (t) 16 STFT block 0 A/D STFTIm(n, k +1) 1 Signal f (n) STFT(n, k) STFTIm(n, k +2) 2 M . 0 1 . u M . N − x u STFTIm(n, k) STFTIm(n, k + 2 1) N − 1 0 0 x 2 D M 0 m MULT u u + 0 x x − 1 SHL1 1 CLK STFTIm(n, k 1) 1 STFT (n, k − 2) 2 Imag Im 0 0 CLK . M M . . u u STFT ( − N +1) Im n, k 2 x x N − 1 2 1

Figure 6: A complete hardware for one channel simultaneous realization of the S-method/L-Wigner distribution. signal. The finite-state machine control for this system is such as PC- or DSP-based solutions, running special soft- shown in Figure 7. If we repeat the last Ld + 1 steps from ware, or applying specified chips in forms of ASICs or pro- Figure 7 (i.e., steps Ld +2to2Ld + 2), together with assert- grammable devices (PDs). The first way is not so useful for ing of the SMStore control signal in the (2Ld + 2)th step, the real-time processing, since it is mostly based on the Von Neu- LWD with L = 4 is implemented by using the proposed ar- mann architecture that significantly reduces the speed per- chitecture. formances. Otherwise, a great degree of parallelism at high Here we do not analyze the finite register length influence speed, as well as low power consumption, can be achieved on the accuracy of the results obtained by the proposed archi- with the chip-based solutions. Using the FPGA chips in- tecture. Its rigorous treatment may be found in [26]. Also, for stead of classical ASICs has numerous advantages, especially the numerical illustration we refer the readers to the papers in prototype development. Some of them are: (i) reasonable where the theoretical approach for the methods used in this cost for small number of pieces, (ii) in system programming paper is given, [4, 9, 10, 12, 16]. (ISP) possibilities, (iii) availability of software design support provided by different development systems for Windows- 4. PRACTICAL IMPLEMENTATION APPROACH based PCs and workstations, and (iv) the developed FPGA’s cores and schematics entries can be directly translated to The architectures for the SM calculation from the STFT sam- the ASIC’s code. In contrast to first families, present FPGAs ples can be practically realized by using different technologies offer not only a lot of logic cells, but also a huge register 10 EURASIP Journal on Applied Signal Processing

Start

0

SignLoad = 1 SMStore = 0

2Ld +2 1 SMorLWD = 1 SMorLWD = 0 SignLoad = 0 SignLoad = 0 = (TFD code=‘SPEC’) SelSTFT = 010 SelSTFT 010 SHLorNo = 0 SHLorNo = 0 Add SelB = 1 Add SelB = 0 SMStore = 1 (SMStore = 1)

2Ld +1 2 SMorLWD = 1 SMorLWD = 0 = SignLoad = 0 SignLoad 0 (TFD code=‘LWD with L = 2andLd = 1’) (TFD code=‘SM with Ld = 1’) = SelSTFT = 110 SelSTFT 110 SHLorNo = 1 SHLorNo = 1 Add SelB = 1 Add SelB = 1 SMStore = 0 (SMStore = 1)

2L 3 d SMorLWD = 1 SMorLWD = 0 SignLoad = 0 SignLoad = 0 (TFD code=‘LWD with L = 2andLd = 2’) (TFD code=‘SM with Ld = 2’) SelSTFT = 210 SelSTFT = 210 SHLorNo = 1 SHLorNo = 1 Add SelB = 1 Add SelB = 1 SMStore = 0 (SMStore = 1)

. . . . .

Ld +2 Ld +1 SMorLWD = 1 SMorLWD = 0 SignLoad = 0 SignLoad = 0 (TFD code=‘LWD with L = 2andLd’) (TFD code=‘SM with Ld’) SelSTFT = (Ld)10 SelSTFT = (Ld)10 SHLorNo = 1 SHLorNo = 1 Add SelB = 0 Add SelB = 1 SMStore = 0 SMStore = 1

Figure 7: The finite-state machine control for the multicycle hardware implementation from Figure 6.

blocks and memory areas. These can be used to built power- was implemented following the approach proposed here, ful specialized parallel processing units such as adders, mul- whereas the SCI one was implemented following the ap- tipliers, shifters, and so forth in form of schematic entry or proachgivenin[17]. The design was carried out in Altera the VHDL code. The internal memory blocks (RAMs, ROMs Max +plus II software. For hardware realization the Al- and FIFOs, etc.) are usable for fast interconnection between tera’s FLEX 10 K chips family has been chosen. This fam- parallel structures, as well as to generate the control signals ily is fabricated in CMOS SRAM technology, running up and to configure the system. to 100 MHz and consuming less than 0.5 mA on 5 V. It In this section, both MCI and SCI architectures are has a high density of 10,000 to 250,000 typical gates, up implemented in the FPGA chips. The MCI architecture to 40,960 RAM bits, 2,048 bits per embedded array block Veselin N. Ivanovicetal.´ 11

From STFT module

0 STFT(n, k + Ld)

STFT(n, k + Ld − 1) MUX1 . . . ShLEFT SelSTFT 1 CumADD × OutREG Ld STFT(n, k) + SM(k) . . . MULT SHLorNo − STFT(n, k Ld +1) MUX2 CLK1 2Ld STFT(n, k − Ld) . RESET (ADD clear) . SelSTFT 2 . SMStore/STFTLoad

1-bits SMStore/STFTLoad Control logic ff Shift memory bu er RESET Bin counter SelSTFT 1 (ShMemBuff) SelSTFT 2 LUT Add LUT (RAM CLK TFD code or RESET ROM) SHLorNo SMStore/STFTLoad Configuration signals System (from PC or MC) clock

Figure 8: Block diagram of FPGA implementation of the MCI approach.

(EAB), and so on. The computation units are realized by partial product term according to (3). This term is either using standard digital components in form of schematics shifted left or not, depending on the signal SHLorNo. This entries or by Altera hardware design language (AHDL)- shift is performed by shifter ShLEFT, the output of which based mega-functions (library of parametrized modules is connected to the first input of the cumulative pipelined (LPM)). adder CumADD.TheCumADD has been designed to replace The proposed MCI and SCI architectures, implemented an adder and a multiplexor (addressed by the AddSelB con- in FPGA technology, will be shortly described and com- trol signal) from Figures 1 and 2. The time diagram of calcu- pared against usual criteria such as chip capacity, computa- lation process is presented in Figure 9. As shown, the multi- tion speed, power consumption, and cost. plying and shifting operations are parallel, while the adding has a latency of one clock. After Ld + 1 clocks, the output 4.1. Implementation of the MCI architecture of the CumADD will contain the sum SM(n, k) that repre- sents the final value of the SM. The next two cycles are used The FPGA-based implementation of the MCI architecture for the signals SMStore/STFTLoad and RESET that will store follows the design logic given in Figure 8. Since the real and the sum SM(n, k) in the output register and reset CumADD imaginary computation lines are identical, the interpreta- to zero, respectively. Use of the RESET signal will increase tion will be done through real ones. As seen, it consists of the calculation time for one clock. It means that the calcula- several functional blocks (units). The STFT sample is im- tion process takes Ld + 3 cycles, one more than is elaborated ported from the STFT module to the Shift Memory Buffer in Figure 3. Note that the RESET signal can be generated by (ShMemBuff) that is implemented as an array of parallel- the signal SMStore/STFTLoad, using a short delay, that will in-parallel-out registers. Their outputs represent the STFT reduce the calculation process to Ld +2cycles.Inorderto samples in time order STFT(n, k + Ld), STFT(n, k + Ld − clarify the principle of calculation and simulation (the pro- 1), ...,STFT(n, k), ...,STFT(n, k − Ld +1),STFT(n, k − Ld) cess of cumulative sums cumSM represented in Figure 11), and due to each SMStore/STFTLoad cycle, they have been we have used the first variant of RESET generation, with shifted for one position. These are also fed to the inputs of Ld +3clocks. multiplexors MUX1 and MUX2 and, two-by-two, regarding Look-up-table (LUT), realized in the form of ROM or on multiplexor’s addresses SelSTFT 1 and SelSTFT 2,for- RAM memory, manages the computation process. As illus- warded to the parallel multiplier MULT in order to produce trated in Table 3, its memory location consists of the control 12 EURASIP Journal on Applied Signal Processing

12 Ld +1

System clock CLK

SMStore/STFTLoad

RESET

SHLorNo

StoreSM(n, k − 1)/LoadSTFT(n, k + Ld) SelSTFT 1(n, k)/SelSTFT 2(n, k)

0+STFT(n, k)∗STFT(n, k) = Sum(0)

SelSTFT 1(n, k + 1)/SelSTFT 2(n, k − 1)

Sum(0) + 2∗(STFT(n, k +1)∗STFT(n, k − 1) = Sum(1)

SelSTFT 1(n, k + Ld)/SelSTFT 2(n, k − Ld)

∗ ∗ Sum(Ld − 1) + 2 (STFT(n, k + Ld) STFT(n, k − Ld)) = SM(n, k)

StoreSM(n, k)/Load STFT(n, k + Ld +1)

Figure 9: The calculation-timing diagram for block diagram from Figure 8.

Table 3: LUT’s values for given Ld. The ADD(STFT(n, k)) means the address location of the STFT(n, k) sample inside ShMemBuff, whereas = =  m CEIL(log2 N) Length(SelSTFT 1). Symbol “ ” denotes logical shift left operation. Note that signals SHLorNo, RESET and SM- Store/STFTLoad make control signals area.

LUT’s memory location SHLorNo RESET SMStore/STFTLoad SelSTFT 1 bits SelSTFT 2 bits 0 0 0 0 ADD(STFT(n, k))  m ADD(STFT(n, k)) 1 1 0 0 ADD(STFT(n, k +1)) m ADD(STFT(n, k − 1)) —100 — —

Ld 1 0 0 ADD(STFT(n, k + Ld))  m ADD(STFT(n, k − Ld))

Ld +1 0 0 1 0 0

Ld +2 0 1 0 0 0

signals area (which consists of signals SHLorNo, RESET, and Figure 10 shows a schematic diagram for SM calculation SMStore/STFTLoad, resp.) and MUXs’ addresses. The binary from the STFT samples (STFT to SM gateway) using MCI counter (see Figure 8) generates the low LUT’s addresses, approach. The control logic is realized by using ROM. The while TFDcode register sets the high ones. It means that maximal register widths for each unit determine the capacity starting address of the running memory block is assigned of the assigned chip. The critical point is the width of the to the corresponding value Ld stored in TFDcode register. CumADD. It is a function of both STFT data length and At the end of the sequence, the binary counter is cleared the maximal possible convolution window width Ld max that by the signal RESET. During system initialization, the mem- can be implemented by using proposed architecture. Table 4 ory contents and value of TFDcode register are automati- shows the relations between minimum widths of units and cally loaded from outside by using PC or general-purpose parameters l (data length) and Ld max. In order to verify the microcontroller. Of course, these parameters can be perma- chip operation before its programming, the compilation and nently stored using ROMs, EEPROMs, and FLASHs instead simulation have been performed by using the various test of RAMs. vectors. An example of simulation is shown in Figure 11. Veselin N. Ivanovicetal.´ 13 0] .. SM[19 Output 0] .. q[] OutREG CumSM[19 (OUTPUT REGISTER) 8. DFF = N Data[] LPM 3and Output ≤ d L Cout SelSTFT[6] Result[] SUB ADD 0] .. CumADD LPM Cin Dataa[] Clock Datab[] Aclr 0] (Cumulative adder) .. 18] .. ShLorNo[0] CLK1 SelSTFT[8 Reset StoreSM/Load STFT a[19 GND 0] a[19 .. ShLorNo[0] ShLorNo[1] ShLorNo[2] ShLorNo[3] ShLorNo[4] a[17 Output Output Output Output Result[] Overflow Underflow Soft Soft 0] GND GND GND GND .. Soft SelSTFT[8] CLSHIFT ShLEFT LPM Data[] Distance[] Direction (Shift register) SelSTFT[7] SelSTFT[6] SelSTFT[8 SelSTFT[7] 0] GND .. q[] 0] ROM .. 0] c[17 ROM .. LPM Address[] 16] .. ShLorNo[4 c[17 Add[3..0] Result[] c[15 GND MULT MULT LPM (Multiplier) Add[0] Add[1] Add[2] Add[3] CLK1 Dataa[] Datab[] 0] 3] .. .. Control Logic QA QB QC QD 7493 NOT Counter RO1 RO2 CLKA CLKB SelSTFT[2 SelSTFT[5 Result[] Result[] sel[] MUX sel[] MUX Mux1 LPM MUX2 LPM INPUT (Multiplexers) data[][] data[][] 0] 0] .. .. SelSTFT[7] er) 0] 0] 0] 0] 0] 0] 0] ...... CLK ff 0][7 0][7 ) .. .. d ff L 0] .. + STFT[7 n, k STFT[7 STFT[1][7 STFT[2][7 STFT[3][7 STFT[4][7 STFT[5][7 STFT[6][7 STFT[7][7 ShMemBu 0] 0] 0] 0] 0] 0] 0] STFT[0][7 ...... STFT( (Shift memory bu reg reg reg reg reg reg reg 0] Q[7 0] Q[7 0] Q[7 0] Q[7 0] Q[7 0] Q[7 0] Q[7 ...... 8bit 8bit 8bit 8bit 8bit 8bit 8bit D[7 D[7 D[7 D[7 D[7 D[7 CLK CLK CLK CLK CLK CLK CLK D[7 Input 0] 10: The schemastic diagram of the 8-bit STFT to SM gateway implemented in FPGA using MCI approach. It is implemented for .. SelSTFT[6] STFT[0][7 Figure 14 EURASIP Journal on Applied Signal Processing

Table 4: Output register lengths for used digital units depending on the parameters l, Ld max.

Length of MUX1, MUX2 MULT ShLEFT CumADD and OutREG · · 2l+1 − · Parameters l, Ld max l 2 l 2 l +1 CEIL(log2((2 1) (Ld max + 1)))

Ref: 0ns Time: 2.32 us Interval: 2.32 us

Name: Value: 5us 10us 15us 20us CLK 0 SM/Load STFT 0 RESET 0 SelSTFT[8..0] D18 18 267 260 64 18 267 260 64 18 267 260 64 ShLorNo[0] D0 011001100110 STFT0 [7..0] D5 567 cumSM[19..0] D0 00025 SM[19..0] D0 0

(a)

Ref: 0ns Time: 26.36 us Interval: 26.36 us

Name: Value: 25us 30us 35us 40us 45us CLK 0 SM/Load STFT 0 RESET 0 SelSTFT[8..0] D18 64 18 267 260 64 18 267 260 64 18 267 260 ShLorNo[0] D0 001100110011 STFT0 [7..0] D5 78 9 0 cumSM[19..0] D0 25 0 36 106 0 49 145 235 0 64 190 SM[19..0] D0 25 106 235

(b)

Figure 11: Simulation illustration for test vector V ={5, 6, 7, 8, 9, 0, 0, ...} and Ld =3.

4.2. Implementation of the SCI architecture giving the final sum SM[19 ···0]. The register widths are the same as in the case of MCI. It should be emphasized As opposite to the MCI architecture, the SCI has no latency that the number of multipliers, shift register, and adders [17]. The arithmetic units are realized by using combina- drastically increases with the order of Ld.Forexample,for tional logic, meaning that all calculation operations are per- Ld = 3 we need 4 multipliers (MULT1 ···4), 3 shift registers formed in parallel. The schematic diagram of its FPGA im- (ShLEFT1 ···3), and 3 adders (ParADD1 ···3), Figure 12. plementation is given in Figure 12. As seen, there is no need for input multiplexors and control signals such as SMStore/ 4.3. Comparison of MCI and SCI architectures STFTLoad, SelSTFT 1, SelSTFT 2, RESET and SHLorNo. Thus, the ROM based generator is needless. At the rising During the test phase we have implemented 8-bit and 16-bit edge of the system clock CLK, the STFT samples are shifted, computation configurations for both architectures MCI and and due to falling edge, the final result is stored in output SCI. The different Lds have been considered. Having in mind register OutREG, as shown in the simulation diagram given the design symmetry, both real and imaginary parts have in Figure 13. One parallel multiplier and one shift register been developed separately or together. Some implementation are used for each of product terms from (3), expect for the details for Ld = 3, N = 8, and selected real devices from SPEC term that has no shift register. These terms are added 10 K and 20 K families are summarized in Table 5.Inorderto by using cascade network of two-inputs parallel adders, generate visual conclusions, the dependence of used logical Veselin N. Ivanovicetal.´ 15 0] .. Cout SM[19 Cout Result[] SUB Result[] SUB Output ADD ADD ParADD1 q[] ParADD1 LPM Cin LPM Dataa[] Datab[] Cin Dataa[] Datab[] OutREG DFF Cout Result[] Data[] LPM SUB ADD ParADD1 0] LPM .. (Parallel adders) Cin Dataa[] Datab[] a1[19 0] 0] 18] 18] .. Not ...... 18] .. a1[19 a2[19 a0[19 GND GND GND 0] 0] a2[19 0] a0[19 ...... 3. a1[17 a2[17 a0[17 = d L Result[] Result[] Result[] Overflow Overflow Underflow Underflow CLKIN Overflow Underflow 0] CLSHIFT CLSHIFT .. CLSHIFT ShLEFT1 ShLEFT2 LPM LPM ShLEFT3 LPM (Shift registers) Data[] Distance[] Direction Data[] Distance[] Direction Data[] Distance[] Direction GND GND 16] .. 0] 0] GND .. .. c3[19 0] 0] 0] 0] GND ...... 16] .. 16] 16] .. .. c2[17 ShLorNo[4 ShLorNo[4 c0[17 c1[17 0] c0[17 0] c1[17 0] c2[17 0] c3[19 GND ...... ShLorNo[4 GND GND Result[] c0[15 Result[] c1[15 Result[] c2[15 Result[] c3[15 MULT MULT MULT MULT MULT2 MULT3 MULT4 MULT1 LPM LPM LPM LPM (Multipliers) Dataa[] Datab[] Dataa[] Datab[] Dataa[] Datab[] Dataa[] Datab[] 0] 0] 0] 0] 0] ...... 0] .. 0] 0] .. 12: FPGA schematic diagram of the 8-bit SCI architecture for .. .. STFT[2][7 STFT[6][7 STFT[3][7 STFT[5][7 Figure STFT[4][7 SHLorNo[4 STFT[1][7 STFT[7][7 0] 0] 0] 0] 0] 0] 0] ...... ) d ff L 0] .. + n, k STFT[1][7 CLKIN STFT[2][7 STFT[3][7 STFT[4][7 STFT[5][7 STFT[6][7 STFT[7][7 ShMemBu STFT[0][7 STFT( 0] .. 0] 0] 0] 0] 0] 0] er) ...... Input ff ff reg reg reg reg reg reg reg 8bit 8bit 8bit 8bit 8bit 8bit 8bit 0] Q[7 SHLorNo[1] SHLorNo[2] SHLorNo[3] SHLorNo[4] .. 0] Q[7 0] Q[7 0] Q[7 0] Q[7 0] Q[7 0] Q[7 Input ...... GND GND GND GND D[7 CLK D[7 D[7 D[7 D[7 D[7 D[7 CLK CLK CLK CLK CLK CLK ShMemBu 0] .. CLKIN SHLorNo[0] (Shift memory bu STFT[0][7 VCC 16 EURASIP Journal on Applied Signal Processing

Ref: 9 ns Time: 0 us Interval: −9us 9us Name: Value: 2us 4us 6us 8us 10us 12us 14us 16us CLK 1 STFT0 [7..0] D9 5 6789 0 SM[19..0] D25 0 25 106 235 190

Figure 13: Simulation diagrams for SCI architecture. The overall computation process is performed in one clock cycle.

Table 5: Summarized implementation utilization for real devices and Ld =3andN =8 and data lengths l =8andl =16.

Computation architecture Total logic Total flip- Memory Total I/O Utilized Recommended device cells (LCs) flops used bits used pins used LCs for used recom- mended device Real 8-bits MCI 641 101 144 41 55% EPF10K20TC144-3 Real 8-bits SCI 1728 75 0 29 100% EPF10K30RC208-3 Real 16-bits MCI 1772 197 144 69 76% EPF10K40RC208-3 Real 16-bits SCI 5498 147 0 57 No fit Not fit in the largest of 10 K EPF10K100GC503- 3DX4992 66% EP20K200 Real + Imag 8-bits MCI 1281 198 144 69 74% EPF10K30RC208-3 Real + Imag 8-bits SCI 3532 150 0 57 94% EPF10K70RC240-2 Real + Imag 16-bits MCI 3543 397 144 125 94% EPF10K70RC248-3 Real + Imag 16-bits SCI 11237 294 0 113 No fit Not fit in the largest of 10 K EPF10K100GC503- 3DX4992 67% EP20K400

devices (total logic cells (LCs)) as a function of Ld,forcon- After the simulation, the real FLEX 10 K devices are stant N =16, and data length l =8isillustratedinFigure 14. configured at system power-up using Atlera’s UP2 develop- As seen, the main advantages of MCI architecture are as ment board with data from ByteBlasterMV. Microcontroller follows: emulated the STFT front end, while the calculated SM was collected and verified by a PC. Because reconfiguration re- (i) for the same Ld, the MCI architecture needs signifi- quires less than 320 ms (in case of using external configura- cantly less LCs for its implementation. It is known that tion EEPROM), real-time changes can be made during sys- the capacity of chip, that is, the silicon area, is directly tem operation. proportional to the number of allowed LCs. Since the MCI architecture is structurally identical for different 5. CONCLUSION Lds, the number of LCs could only slightly increase with the increase of N. That is caused by the input span and address lengths of multiplexors (MUX1 and Flexible system for TF signal analysis is proposed. Its MCI MUX2 from Figure 10); design is presented. Proposed architecture can be used for (ii) the reduced power consumption, which is strongly real-time implementation of some commonly used quadratic proportional to the chip capacity; and and higher-order TFDs. It allows a functional unit to be used (iii) less implementation cost (about 2-3 times). more than once per TFDs execution, as long as it is used on different clock cycles, and, consequently, enables a signif- An advantage of the SCI architecture is the processing icant reduction of hardware complexity and cost. The ma- speed that is of importance for time-critical applications. The jor advantages of the proposed design are the ability to al- number of LCs significantly varies by Ld (about 400–500 LCs low implemented TFDs to take different numbers of clock per Ld) that complicates the design and increases the imple- cycles and to share functional units within a TFDs execu- mentation cost and power consumption. tion. Finally, proposed architecture is practically verified by Veselin N. Ivanovicetal.´ 17

Total LCs used distribution,” Annales des Telecommunications, vol. 51, no. 11- 2500 12, pp. 585–594, 1996. [10] S. Stankovic´ and LJ. Stankovic,´ “An architecture for the real- ization of a system for time-frequency signal analysis,” IEEE 2000 Transactions on Circuits And Systems—Part II: Analog and Dig- ital Signal Processing, vol. 44, no. 7, pp. 600–604, 1997. 1500 [11] LJ. Stankovic´ and J. F. Bohme,¨ “Time-frequency analysis of multiple resonances in combustion engine signals,” Signal Pro- cessing, vol. 79, no. 1, pp. 15–28, 1999. 1000 [12] LJ. Stankovic,´ “A method for improved distribution concen- tration in the time-frequency analysis of multicomponent sig- 500 nals using the L-Wigner distribution,” IEEE Signal Processing Magazine, vol. 43, no. 5, pp. 1262–1268, 1995. 0 [13] K. J. R. Liu, “Novel parallel architectures for short-time 234Ld 5 Fourier transform,” IEEE Transactions on Circuits And Systems—Part II: Analog and Digital Signal Processing, vol. 40, no. 12, pp. 786–790, 1993. MCI [14] M. G. Amin and K. D. Feng, “Short-time Fourier transforms using cascade filter structures,” IEEE Transactions on Circuits SCI And Systems—Part II: Analog and Digital Signal Processing, vol. 42, no. 10, pp. 631–641, 1995. Figure 14: The dependance of the LCs used assuming N = 16, and [15] B. Boashash and P. Black, “An efficient real-time implemen- data length l =8. tation of the Wigner-Ville distribution,” IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. 35, no. 11, pp. 1611–1618, 1987. [16] D. Petranovic,´ S. Stankovic,´ and LJ. Stankovic,´ “Special pur- pose hardware for time-frequency analysis,” Electronics Letters, its implementations in FPGA devices and compared with the vol. 33, no. 6, pp. 464–466, 1997. SCI architecture against usual criteria such as chip capacity, [17] S. Stankovic,´ LJ. Stankovic,´ V. N. Ivanovic,´ and R. Stojanovic,´ computation speed, power consumption, and cost. “An architecture for the VLSI design of systems for time- frequency analysis and time-varying filtering,” Annales des REFERENCES Telecommunications, vol. 57, no. 9-10, pp. 974–995, 2002. [18] K. Maharatna, A. S. Dhar, and S. Banerjee, “A VLSI array ar- [1] L. Cohen, “Time-frequency distributions—a review,” Proceed- chitecture for realization of DFT, DHT, DCT and DST,” Signal ings of the IEEE, vol. 77, no. 7, pp. 941–981, 1989. Processing, vol. 81, no. 9, pp. 1813–1822, 2001. [2] F. Hlawatsch and G. F. Boudreaux-Bartels, “Linear and [19] K. J. R. Liu and C.-T. Chiu, “Unified parallel lattice structures quadratic time-frequency signal representations,” IEEE Signal for time-recursive discrete cosine/sine/Hartley transforms,” Processing Magazine, vol. 9, no. 2, pp. 21–67, 1992. IEEE Transactions on Signal Processing, vol. 41, no. 3, pp. 1357– [3] L. Cohen, “Preface to the special issue on time-frequency anal- 1377, 1993. ysis,” Proceedings of the IEEE, vol. 84, no. 9, pp. 1197–1197, [20] A. Papoulis, Signal Analysis, McGraw-Hill, New York, NY, 1996. USA, 1977. [4] LJ. Stankovic,´ “A method for time-frequency analysis,” IEEE [21] A. V. Oppenheim and R. W. Schafer, DigitalSignalProcessing, Transactions on Signal Processing, vol. 42, no. 1, pp. 225–229, ff 1994. Prentice-Hall, Englewood Cli s, NJ, USA, 1975. [5] B. Boashash and B. Ristic, “Polynomial time-frequency distri- [22] M. G. Amin, “A new approach to recursive Fourier transform,” butions and time-varying higher order spectra: application to Proceedings of the IEEE, vol. 75, no. 11, pp. 1537–1538, 1987. the analysis of multicomponent FM signals and to the treat- [23] M. Unser, “Recursion in short-time signal analysis,” Signal ment of multiplicative noise,” Signal Processing, vol. 67, no. 1, Processing, vol. 5, no. 3, pp. 229–240, 1983. pp. 1–23, 1998. [24] M. G. Amin, “Spectral smoothing and recursion based on the [6] P.Goncalves and R. G. Baraniuk, “Pseudo affine Wigner distri- nonstationarity of the autocorrelation function,” IEEE Trans- butions: definition and kernel formulation,” IEEE Transactions actions on Signal Processing, vol. 39, no. 1, pp. 183–185, 1991. on Signal Processing, vol. 46, no. 6, pp. 1505–1516, 1998. [25] V. N. Ivanovic´ and LJ. Stankovic,´ “Multiple clock cycle real- [7] C. Richard, “Time-frequency-based detection using discrete- time implementation of a system for time-frequency analysis,” time discrete-frequency Wigner distributions,” IEEE Transac- in Proceedings of 12th European Signal Processing Conference tions on Signal Processing, vol. 50, no. 9, pp. 2170–2176, 2002. (EUSIPCO ’04), pp. 1633–1636, Vienna, Austria, September [8] L. L. Scharf and B. Friedlander, “Toeplitz and Hankel ker- 2004. nels for estimating time-varying spectra of discrete-time ran- [26] V. N. Ivanovic,´ LJ. Stankovic,´ and D. Petranovic,´ “Finite word- dom processes,” IEEE Transactions on Signal Processing, vol. 49, length effects in implementation of distributions for time- no. 1, pp. 179–189, 2001. frequency signal analysis,” IEEE Transactions on Signal Process- [9] LJ. Stankovic,´ V. N. Ivanovic,´ and Z. Petrovic,´ “Unified ap- ing, vol. 46, no. 7, pp. 2035–2040, 1998. proach to the noise analysis in the spectrogram and Wigner 18 EURASIP Journal on Applied Signal Processing

Veselin N. Ivanov ic´ was born in Cetinje, For his scientific achievements, he was awarded the Highest State Montenegro, April 10, 1970. He received Award of the Republic of Montenegro in 1997. Professor Stankovic´ the B.S. degree in electrical engineering is a Member of the IEEE Signal Processing Society’s Technical Com- (1993) and the M.S. degree in electrical mittee on Theory and Methods. He is an Associate Editor of the engineering from the University of Mon- IEEE Transactions on Image Processing. He is a Member of the tenegro (1996). He received the Ph.D. de- Yugoslav Engineering Academy, and a Member of the National gree in electrical engineering from the same Academy of Science and Art of Montenegro (CANU). Professor University (2001) in time-frequency signal Stankovic´ is the Rector of the University of Montenegro since 2003. analysis and architecture design for imple- mentation of time-frequency methods and time-varying filtering. In 2001, he received the Siemens Award for scientific achievements in his Ph.D. research. Dr. Ivanovicisan´ Assistant Professor (Docent) at the Electrical Engineering Depart- ment, University of Montenegro. He is also Vice-Dean at the elec- trical engineering Department, University of Montenegro. His re- search interests are in the areas of time-frequency signal analysis, hardware/software codesign, computer organization and design, and design with microcontrollers.

Radovan Stojanovic´ wasborninBerane, Montenegro, Yugoslavia, November 18, 1965. He received the B.S.E.E. and M.S.E.E. degrees from the University of Montenegro, and the Ph.D. degree from the University of Patras, Greece, in 1991, 1994, and 2001, re- spectively. From 1990 to 1998, he was at the Electrical Engineering Department, Univer- sity of Montenegro. From 1998 to 2001, he was a Research Associate at the Department of Electrical Engineering and Computer Technology, University of Patras, Greece. After that, he spent two years as a Senior Researcher in the Industrial System Institute (ISI), Patras, Greece. Currently, he is an Assistant Professor at the University of Montenegro guid- ing the group of applied electronics. His fields of interest are hard- ware/software codesign, applied signal and image processing, and industrial and medical electronics. LJubisaˇ Stankovic´ was born in Montene- gro, June 1, 1960. He received the B.S. de- gree in electrical engineering from the Uni- versity of Montenegro, in 1982, with the honor “the best student at the University,” the M.S. degree in electrical engineering, in 1984, from the University of Belgrade, and the Ph.D. degree in electrical engineering in 1988 from the University of Montene- gro. As a Fulbright grantee, he spent the 1984/1985 academic year at the Worcester Polytechnic Institute, Massachusetts. Since 1982, he has been on the faculty at the Uni- versity of Montenegro, where he now holds position of a Full Pro- fessor. Stankovic´ was also active in politics, as a Vice-President of the Republic of Montenegro (1989–1991), and then the leader of democratic (anti-war) opposition in Montenegro (1991–1993). During 1997/1998 and 1999, he was on leave at the Ruhr University Bochum, Germany, with Signal Theory Group, supported by the Alexander von Humboldt foundation. At the beginning of 2001, he spent a period of time at the Technische Universiteit Eindhoven, the Netherlands, as a Visiting Professor. During the priod of 2001– 2002 he was the President of the Governing Board of the Mon- tenegrin mobile phone company “MONET.” His current interests are in signal processing and electromagnetic field theory. He pub- lished about 270 technical papers, more than 80 of them in lead- ing international journals, mainly the IEEE editions. He has pub- lished several textbooks about signal processing (in Serbo-Croat) and the monograph Time-Frequency Signal Analysis (in English). Hindawi Publishing Corporation EURASIP Journal on Applied Signal Processing Volume 2006, Article ID 75032, Pages 1–13 DOI 10.1155/ASP/2006/75032

3D-SoftChip: A Novel Architecture for Next-Generation Adaptive Computing Systems

Chul Kim,1 Alex Rassau,1 Stefan Lachowicz,1 Mike Myung-Ok Lee,2 and Kamran Eshraghian3

1 Centre for Very High Speed Microelectronic Systems, Edith Cowan University, Joondalup, WA 6027, Australia 2 School of Information and Communication Engineering, Dongshin University, Naju, Chonnam 520714, South Korea 3 Eshraghian Laboratories Pty Ltd, Technology Park, Bentley, WA 6102, Australia

Received 1 October 2004; Revised 15 March 2005; Accepted 25 May 2005 This paper introduces a novel architecture for next-generation adaptive computing systems, which we term 3D-SoftChip. The 3D-SoftChip is a 3-dimensional (3D) vertically integrated adaptive computing system combining state-of-the-art processing and 3D interconnection technology. It comprises the vertical integration of two chips (a configurable array processor and an intelli- gent configurable switch) through an indium bump interconnection array (IBIA). The configurable array processor (CAP) is an array of heterogeneous processing elements (PEs), while the intelligent configurable switch (ICS) comprises a switch block, 32- bit dedicated RISC processor for control, on-chip program/data memory, data frame buffer, along with a direct memory access (DMA) controller. This paper introduces the novel 3D-SoftChip architecture for real-time communication and multimedia signal processing as a next-generation computing system. The paper further describes the advanced HW/SW codesign and verification methodology, including high-level system modeling of the 3D-SoftChip using SystemC, being used to determine the optimum hardware specification in the early design stage.

Copyright © 2006 Hindawi Publishing Corporation. All rights reserved.

1. INTRODUCTION receiver, an MP3 player, an e-book reader, a digital camera, a portable television, a satellite radio, a handheld gaming System design is becoming increasingly challenging as the platform, and so forth. This concept also becomes increas- complexity of integrated circuits and the time-to-market ingly important as there is a growing need for a single prod- pressures relentlessly increase. Adaptive computing is a crit- uct to support multiple (and evolving) standards without re- ical technology to develop for future computing systems in engineering work. order to resolve most of the problems that system design- Another growing problem in advanced computation sys- ers are now faced with due in no small part to its potential tems, particularly for real-time communication or video pro- for wide applicability. Up until now, however, this concept cessing applications, is the data bandwidth necessary to sat- has not been fully realized because of many technology con- isfy the processing requirements. The interconnection wire straints such as chip real-estate limitations and the software requirements in standard planar technology are increasing complexity. With the coupled advancement of semiconduc- almost exponentially as feature sizes continue to shrink. tor processing technology and software technology, however, A novel 3D integration system such as 3D system-on-chip adaptive computing is now facing a turning point. For in- (SoC) [6], 3D-SoftChip [7, 8] which is able to satisfy the stance, the reconfigurable computing concept has more re- severe demand of more computation throughput by effec- cently started to receive considerable research attention [1–3] tively manipulating the functionality of hardware primi- and this concept is now starting to move and expand into the tives through vertical integration of two 2D chips is another realm of adaptive computing. Software defined virtual hard- concept proposed for next-generation computing systems. ware [4] and “do-it-all” devices [5] are good examples that This paper proposes the novel 3D-SoftChip architecture as demonstrate this development direction for computing sys- a forthcoming giga-scaled integrated circuit computing sys- tems. The major forthcoming impact from the deployment tem and shows an implemented example of a single PE using of adaptive computing is do-it-all devices. For example, a SystemC. small handheld PDA size device could assume the function- Figure 1 illustrates the physical architecture of the 3D- ality of about 10 standard devices simply depending on the SoftChip comprising the vertical integration of two 2D chips. context programs included such as a cellular phone, a GPS The upper chip is the intelligent configurable switch (ICS). 2 EURASIP Journal on Applied Signal Processing

Intelligent configuration switch (ICS)

Indium bump interconnects Configurable array processor (CAP)

Figure 1: 3D-SoftChip physical architecture.

PE PE PE PE PE PE PE PE Switch Switch Switch Switch block block block block PE PE

ICS ICS

PE PE

Switch Switch Program memory Switch Switch block block block block PE Data frame PE buffer Memory Memory DMA PE controller PE Switch Switch Switch Switch block block block block PE PE

ICS ICS

PE PE

Switch Switch Program memoryProgram memory Switch Program memory Switch block block block block PE PE PE PE PE PE PE PE

Figure 2: 3D-SoftChip: a novel 3D vertically integrated adaptive computing system-on-chip.

The lower chip is the configurable array switch (CAP). Inter- 2. 3D ADAPTIVE COMPUTING SYSTEM connection between the two 2D chips is achieved via an array of indium bump interconnections. A 2D planar architecture 2.1. 3D vertically integrated systems overview of the 3D-SoftChip can be seen in Figure 2. The rest of the paper is organized as follows. Section 2 in- During the past few years, there has been significant re- troduces an overview of the 3D adaptive computing system. search into 3D vertically integrated systems. This is due to the Section 3 provides overall explanations of the proposed 3D- ever increasing wiring requirements, which are fast becoming SoftChip architecture and its distinctive features. Sections 4 the major bottleneck for future gigascale integrated systems and 5 introduce the detailed architecture of the CAP and [6, 9]. In very deep submicron silicon geometry the standard ICS chips, respectively. The interconnection network struc- planar technology has many drawbacks in regards to per- ture is described in Section 6. Section 7 describes a suggested formance, reliability, and so forth, caused entirely by limita- HW/SW codesign and verification of the 3D-SoftChip and tions in the wiring. Moreover, the data bandwidth require- shows an implemented example of a single PE using Sys- ments for the next-generation computing systems are be- temC. Finally, conclusions are provided in Section 8. coming ever larger. To overcome these problems, the concept Chul Kim et al. 3 of 3D-SoC, 3D-SoftChip has been developed, which exploits 3. 3D-SOFTCHIP ARCHITECTURE the vertical integration of 2D planar chips to effectively ma- nipulate computation throughput. Previous work has shown 3.1. Overall architecture of 3D-SoftChip that the 3D integration of systems has a number of benefits Figure 3 shows the overall architecture of the 3D-SoftChip. [10]. As described by Joyner et al. [10], 3D system integration As can be seen, it is comprised of 4 unit chips. By including offers a 3.9 times increase in wire-limited clock frequency, four separate unit chips in the architecture, sufficient flexi- an 84% decrease in wire-limited area, or a 25% decrease in bility is provided to allow multiple optimized task threads to the number of metal levels required per stratum. There are be processed simultaneously. Given the primary target appli- three feasible 3D integration methods; a stacking of pack- cations of multimedia processing and communications four ages, a stacking of ICs, and a vertical system integration as unit chips should be sufficientforallsuchrequirements. was introduced by IMEC [9]. In this research, however, the Each unit chip has a PE array, a dedicated control proces- focus is on the use of indium bump interconnection technol- sor, and a high-bandwidth data interface unit. According to ogy as indium has good adhesion, a low contact resistance, a given application program, the PE array processes large and can be readily utilized to achieve an interconnect array amounts of data in parallel while the ICS controls the overall with a pitch as low as 10 µm. The development of 3D inte- system and directs the PE array execution, data, and address grated systems will allow improvements in packaging costs, transfers within the system. performance, reliability, and a reduction in the size of the chips. 3.2. Features of 3D-SoftChip

2.2. Adaptive computing system The 3D-SoftChip has 4 distinctive features: various compu- tation models, adaptive word-length configuration computa- A reconfigurable system is one that has reconfigurable hard- tion [7], optimized system architecture for communication, ware resources that can be adapted to the application cur- and multimedia signal processing and dynamic reconfigura- rently under execution, thus providing the possibility to bility for adaptive computing. customize across multiple standards and/or applications. In most of the previous research in this area the concepts of re- 3.2.1. Computation algorithm: various configurable and adaptive computing have been described computation models interchangeably. In this paper, however, these two concepts will be more specifically described and differentiated. Adap- As described before, one 32-bit RISC controller can supply tive computing will be treated as a more extended and ad- control, data, and instruction addresses to 16 sets of PEs vanced concept of reconfigurable computing. Adaptive com- through the completely freely controllable switch block so puting will include more advanced software technology to various computation models can be achieved such asSISD, effectively manipulate more advanced reconfigurable hard- SIMD, MISD, and MIMD as required. Enough flexibility is ware resources in order to support fast and seamless exe- thus achieved for an adaptive computing system. Especially, cution across many applications. Table 1 shows the differ- in the single instruction multiple data (SIMD) computation ences between reconfigurable computing and adaptive com- model, 3 types of different SIMD computational models can puting. be realized, massively parallel, multithreaded, and pipelined [19]. In the massively parallel SIMD computation model, each unit chip operates with the same global program mem- 2.3. Previous work ory. Every computation is processed in parallel, maximiz- ing computational throughput. In the multithreaded SIMD Adaptive computing systems are mainly classified in terms computation model, the executed program instructions in of granularity, programmability, reconfigurability, computa- each unit chip can be different from the others so multi- tional methods, and target applications. The nature of recent threaded programs can be executed. The final one is the par- research work in this area according to these classifications, is allel SIMD computation model. In this case each unit chip shown in Table 2. This table shows that the early research and executes a different pipelined stage. Because of these SIMD development was into single linear array-type reconfigurable computation characteristics, the 3D-SoftChip can adaptively systems with single and static configuration but also shows maximize it’s computational throughput according to var- that this has evolved towards large adaptive SoCs with het- ious application requirements. These three computational erogeneous types of reconfigurable hardware resources and models are illustrated in Figure 4. with multiple and dynamic configurability. As illustrated in Table 2, the 3D-SoftChip architecture 3.2.2. Word-length configuration has several superiorities when compared with conventional reconfigurable/adaptive computing systems resulting from This is a key characteristic in order to classify the 3D- the 3D vertical interconnections and the use of state-of-the- SoftChip as an adaptive computing system. Each PE’s basic art adaptive computing technology (as will be described in processing word-length is 4 bits. This can, however, be con- the following sections). This makes it highly suitable for the figured up to 32 bits according to the application in the pro- next generation of adaptive computing systems. gram memory. Figure 5 illustrates the proposed word-length 4 EURASIP Journal on Applied Signal Processing

Table 1: Reconfigurable computing versus adaptive computing.

Reconfigurable computing Adaptive computing Linear array of homogeneous elements (logic Heterogeneous algorithmic elements (complete Hardware resources gates, lookup tables) function units such as ALU, multiplier) Static, dynamic configuration, slow Configuration Dynamic, partial runtime reconfiguration reconfiguration time Manual routing, conventional ASIC design tools Mapping methods High-level language (SystemC,C) (HDL) Large silicon area, low speed (high Smaller silicon size, high speed, high Characteristics capacitance), high-power consumption, performance, low-power consumption, high cost low cost

Table 2: Reconfigurable computing and adaptive computing systems.

Granularity/ Computation System Programmability Reconfigurability Target application PE-type method PADDI [11] Coarse (16 bits) Multiple Static VLIW, SIMD DSP application MATRIX [12] Coarse (8 bits) Multiple Dynamic MIMD General purpose RaPiD [13] Coarse (16 bits) Single Mostly static Linear array Systolic arrays Remarc [3] Coarse (16 bits) Multiple Static SIMD Data-parallel RAW [14] Mixed Single Static MIMD General purpose Data-parallel, PipeRench [1] Mixed (128 bits) Multiple Dynamic Pipelined DSP MorphoSys [2] Coarse (16 bits) Multiple Dynamic SIMD Data-parallel Triscend A7 [15] Mixed Multiple Dynamic N/A General purpose Motorola Computation MRC6011 Coarse (16 bits) Multiple Dynamic SIMD intensive [16] application QuickSilver Coarse Heterogeneous Comm., Adapt2400 Multiple Dynamic (8, 16, 24, 32 bits) nodes array multimedia DSP [17] Linear D-fabric Multimedia Elixent DFA100 [4] Coarse (4 bits) Multiple Dynamic array applications PicoChip PC102 Wireless Coarse (16 bits) Multiple Dynamic 3way-LIW [18] communications Various types of Comm., multimedia 3D-SoftChip Coarse (4 bits) Multiple Dynamic computation signal processing models configuration algorithm. When 2 PEs configure together, an and potentially higher bit-level computation. To fulfill these 8-bit word-length system is created. If 4 PEs configure to- signal processing demands, each unit chip contains two types gether this extends to 16 bits. And finally when 8 PEs con- of PE. One is a standard PE for generic ALU functions, which figure together a full 32-bit word length is achieved. This is optimized for bit-level computation. The other is a pro- flexibility is possible due to the configurable nature of the cessing accelerator PE for DSP. In addition, special address- arithmetic primitives in the PEs [7, 20] and the completely ing modes to leverage the localized memory along with 16 freely controllable switch block architecture in the ICS chip. sets of loop buffers in the ICS add to the specialized charac- teristics for optimized communication and multimedia sig- 3.2.3. Optimized system architecture for communication nal processing. and multimedia signal processing 3.2.4. Dynamic Reconfigurability for There are many similarities between communications and Adaptive Computing multimedia signal processing, such as data parallelism, low- precision data, and high-computation rates. The different Every PE contains a small quantity of local embedded SRAM characteristics of communication signal processing are basi- memory and additionally the ICS chip has an abundant cally more data reorganization, such as matrix transposition memory capacity directly addressable from the PEs via the Chul Kim et al. 5

Unit chip 1 Unit chip 2

DMA DMA controller controller

Program Data Program Data memory ICS memory memory ICS memory Data Data frame frame buffer buffer

ICS chip ICS chip IBI IBI IBI IBI IBI IBI CAP chip CAP chip

P P P P P P P P E E PE array E E E E PE array E E

Unit chip 4 Unit chip 3

DMA DMA controller controller

Program ICS Data Program ICS Data memory Data memory memory Data memory frame frame buffer buffer

ICS chip ICS chip IBI IBI IBI IBI IBI IBI CAP chip CAP chip

P P P P P P P P E E PE array E E E E PE array E E

Figure 3: Overall architecture for 3D-SoftChip.

indium bump interconnect array. Multiple sets of program 4.1.1. Standard PE memory, the abundant memory capacity, and the very high- bandwidth data interface unit makes it possible to switch The S-PE is for standard ALU functions and is also optimized programs easily and seamlessly, even at runtime. for bit-level operation for communication signal processing. It comprises 4 sets of 19-bit registers for S-PE instruction decoding, two multiplexers to select input operands from 4. ARCHITECTURE OF CAP CHIP the data bus, adjacent PEs, or internal registers; a standard ALU with a bit-serial multiplier, adder, subtracter, and com- The basic architecture of CAP chip is a linear array of het- parator, an embedded local SRAM and 4 sets of registers. erogeneous PEs. Figure 6 shows three possible architecture The arithmetic primitives are scalable so as to make it pos- choices for the PEs. The architecture in Figure 6(b) is sug- sible to reconfigure the word-length for specific tasks. The gested as the most feasible architecture for the PE in the ff scalable arithmetic primitive’s architecture is presented in 3D-SoftChip because it has the optimum tradeo between [7, 20]. Moreover it can execute single-clock-cycle absolute application-specific performance and flexibility. Examples of value computation and comparison. Table 3 shows the func- type A can be seen in [1, 2, 12, 14], type B in [17], and type tions of S-PE. It is suitable for bit-wise manipulation and Cin[18]. The CAP chip has the basic role of the processing generic ALU functions. engine for the 3D-SoftChip. It manipulates large amounts of data at a high-computational rate using any of the three dif- ferent SIMD computation models previously described. 4.1.2. Processing accelerator PE The PA-PE is dedicated specifically for digital signal pro- 4.1. Two types of PEs cessing DSP operations. It consists of 4 sets of 19-bit regis- ters for PA-PE instruction decoding, two multiplexers to se- Figure 7 illustrates the two types of PE architecture chosen to lect input operands from the data bus, adjacent PEs or in- optimize multimedia signal processing and communication ternal registers, a signed 4-bit scalable parallel/parallel mul- type applications. tiplier, an accumulator/subtracter modified to enable MAC 6 EURASIP Journal on Applied Signal Processing

Unit chip 1: program 1 execution Unit chip 2: program 1 execution Unit chip 1: program 1 execution Unit chip 2: program 2 execution

DMA DMA DMA DMA controller controller controller controller

Program Data Program Data ICS Data ICS Data ICS Data memory ICS Data memory Program Program 1 1 memory memory frame frame 1 RISC Data 2 RISC Data buffer buffer frame frame buffer buffer ICS chip ICS chip IBI IBI IBI IBI IBI IBI ICS chip ICS chip CAP chip CAP chip IBI IBI IBI IBI IBI IBI CAP chip CAP chip

P P P P P P P P E E E E E E E E PE array PE array P P P P P P P P E E PE array E E E E PE array E E

Unit chip 4: program 1 execution Unit chip 3: program 1 execution

DMA DMA Unit chip 4: program 4 execution Unit chip 3: program 3 execution controller controller DMA DMA Program Data Program Data controller controller Data Data 1 ICS memory 1 ICS memory frame frame Program ICS Data Program ICS Data buffer buffer memory memory 4 RISC Data 3 RISC Data frame frame ICS chip ICS chip IBI IBI IBI IBI IBI IBI buffer buffer CAP chip CAP chip ICS chip IBI IBI IBIICS chip IBI IBI IBI P P P P P P P P CAP chip CAP chip E E PE array E E E E PE array E E

P P P P P P P P E E PE array E E E E PE array E E

(a) (b)

Unit chip 1: pipeline stage 1 Unit chip 2: pipeline stage 2

DMA DMA controller controller

Program ICS Data Program ICS Data memory memory 1 RISC Data 2 RISC Data frame frame buffer buffer

ICS chip ICS chip IBI IBI IBI IBI IBI IBI CAP chip CAP chip

P P P P P P P P E E PE array E E E E PE array E E

Unit chip 4: pipeline stage 4 Unit chip 3: pipeline stage 3

DMA DMA controller controller

Program ICS Data Program ICS Data memory memory 4 RISC Data 3 RISC Data frame frame buffer buffer

ICS chip ICS chip CAP chipIBI IBI IBI CAP chip IBI IBI IBI

P P P P P P P P E E PE array E E E E PE array E E

(c)

Figure 4: Computation algorithm: 3 types of SIMD computation models. (a) Massively parallel SIMD computation model, (b) multi- threaded SIMD computation model, and (c) pipelined SIMD computation model. and MAS operations within one clock cycle, an 8-bit con- configured to produce a 16-bit barrel shifter. Its distinctive figurable barrel shifter, an embedded local SRAM, and 4 features are the single-clock-cycle MAC, MAS operations sets of registers. Two shifters in the quad-PE can also be and parallel-parallel multiplier to accelerate DSP operations. Chul Kim et al. 7

Data bus ICS Data bus Adjacent PEs PE PE PE PE PE PE PE PE Adjacent PEs

PE PE PE PE PE PE PE PE MUX A MUX B

(a)

PE PE PE PE PE PE PE PE ALU Instruction decoder (mul, add, sub, comp)

PE PE PE PE PE PE PE PE Register Data bus (b) Data

Embedded SRAM PE PE PE PE PE PE PE PE

PE PE PE PE PE PE PE PE

(a) (c)

Figure 5: Word-length configuration algorithm. (a) 8-bit configu- ICS Data bus Data bus Adjacent PEs Adjacent PEs ration, (b) 16-bit configuration, and (c) 32-bit configuration.

MUX A MUX B

PE PE PE 1 PE 2 Multiplier Switch block Switch block (MAC, MAS) (ICS) (ICS) Instruction decoder

PE PE PE 3 PE 4 Accumulator/subtractor Register (a) (b) 8-bit parallel Adjacent PEs shifter Dedicated Data bus PE function PE 1 Embedded SRAM Switch block Data (ICS)

Dedicated function PE (b) PE 2

(c) Figure 7: Two types of PE: (a) standard PE and (b) processing ac- celerator PE. Figure 6: Types of PEs: (a) homogeneous-type, (b) heterogeneous- type, and (c) heterogeneous-type with dedicated functions for spe- cial purpose. used for the read/write enable bit of the embedded SRAM and registers. Bits 16 to 10 are used for SRAM and register Table 4 describes the PA-PE functions, it is specialized for selection (addressing). Bit 9 is used for data output register DSP operations such as MAC, MAS, logical shift, arithmetic enable signal and bits 8 to 6 are used to specify the PE oper- shift, rotate, and absolute value computation. ation. Finally, bits 5 to 0 are used to control the input multi- plexers for input operand selection. This format is illustrated 4.1.3. PE instruction format and operation modes in Figure 8 below. Figure 9 illustrates 3 types of PE operation modes that The PE instruction format consists of a 19-bit instruc- can be realized on the PE array, horizontal mode, verti- tion word. The MSB 2 bits (WS en/RS en,WR en/RR en) are cal mode, and circular mode, these allow for even greater 8 EURASIP Journal on Applied Signal Processing

Table 3: Standard PE functions. 5.1. Switch block Function Mnemonics The switch block provides data from/to each PE and also AandB AND provides instruction data to each PE. Three types of switch AorB OR blocks, 6-sided, 7-sided, and 8-sided, provide optimized in- AxorB XOR terconnections within the ICS chip. A pass-transistor design A+B ADD is used to optimize performance and minimize area allowing A−BSUBa completely free configuration for each PE. A × BSPMUL AcompB COMP 5.2. ICS RISC |A| (Absolute value) ABS The ICS RISC is a 32-bit dedicated RISC control processor. Table 4: Processing accelerator PE functions. The ICS RISC controls the execution of the PE array and provides control and address signals to program/data mem- Function Mnemonics ory, the data frame buffers, and the DMA controller. It has a A × BPAMUL3-stage pipelinedarchitecture that includes instruction fetch A × B+out(t) MAC (F), decode (D), and execute (E). To cope with the iterative A × B − out (t) MAS nature of DSP arithmetic, it also has 16 sets of loop buffers so Logical shift left LSL as to provide direct instruction to instruction decoding in- Logical shift right LSR stead of fetching from program memory in each case. This significantly reduces bus utilization allowing for improved Arithmetic shift right ASR performance and lower-power dissipation. Moreover 32 gen- Rotate ROR eral purpose registers and specialized addressing modes are | | A (Absolute value) ABS provided for optimized communication and multimedia sig- nal processing. flexibility and help to maximize computational throughput according to the target application. 5.3. High-bandwidth data interface unit

4.2. Embedded local SRAM The high-bandwidth data interface unit allows the efficient ff transfer of data within the 3D-SoftChip. Two sets of data Each PE has a local embedded SRAM. The e ective memory frame buffers and the DMA controller make it easy to trans- bandwidth is, therefore, increased dramatically by as much fer large amounts of data. Multiple sets of program memory as the total number of PEs, which will result in an increase support runtime program switching and, because of this dy- in effective processing speed in many applications and allows ffi namic reconfiguration feature, adaptive computing is possi- for rapid dynamic context switching. Bus tra c can also be ble. The data memory has a variable word width so it can eas- reduced because many data transmission operations can be ily be combined to build wider/deeper memories and thus in- contained within a PE. Consequently, power dissipation will crease flexibility for different application programs andmul- also be minimized. tiple word-length computations. 4.3. Quad-PE 6. INTERCONNECTION NETWORK As previously described one quad-PE consists of two pairs of PEs (two S-PE and two PA-PE). The quad-PE is controlled The interconnection network of the 3D-SoftChip can be bro- and configured by the switch block according to the control ken down into three hierarchical levels. The Inter-PE bus be- and address data from the ICS transmitted through the IBIA. tween PEs in the CAP chip is the first level. This local inter- Figure 10 shows the architecture of a single quad-PE. connection network has a 2D-mesh architecture providing nearest-neighbour interconnects between the PEs. The sec- 5. ARCHITECTURE OF ICS CHIP ond level of the interconnection network is the switch block The ICS chip comprises the switch blocks, ICS RISC, pro- array interconnection. This supports longer interconnections gram memory, data memory, data frame buffers, and DMA on the ICS chip but also has a basic 2D-mesh architec- controller as illustrated in Figure 11. The ICS chip is a control ture. The last hierarchical level of interconnection is the in- processor which controls the CAP chip via the IBIA as well as dium bump interconnect array (IBIA). With the progression the overall system. The ICS RISC provides control and ad- of technology to ever decreasing semiconductor geometry dress signals and data to the system as a whole. The switch scales, the prediction of interconnection delay as well as its blocks configure each PE based on the current program in- impact on total system delay are crucial factors, introducing a struction. The high-bandwidth data interface unit enables ef- major limiting factor in overall system performance. To over- ficient transmission of data and instructions within the sys- come these problems, 3D interconnection technology using tem. an array of indium bumps becomes very attractive because Chul Kim et al. 9

18 17 16 15 12 11 10 9 8 6 5 3 2 0 Register SPE O WS en/ WR en/ SRAM SRAM Dout MUX BMUXA RS en RR en en selection selection RCtl P 18 17 16 15 12 11 10 9 8 6 5 3 2 0

WS en/ WR en/ SRAM SRAM Register Dout PAPE MUX BMUXA RS en RR en en selection selection RCtl OP

Figure 8: PE instruction formats.

PE 0 PE 1 PE 2 PE 3 PE 0 PE 1 PE 2 PE 3 PE 0 PE 1 PE 2 PE 3

PE 4 PE 5 PE 6 PE 7 PE 4 PE 5 PE 6 PE 7 PE 4 PE 5 PE 6 PE 7

PE 8 PE 9 PE 10 PE 11 PE 8 PE 9 PE 10 PE 11 PE 8 PE 9 PE 10 PE 11

PE 12 PE 13 PE 14 PE 15 PE 12 PE 13 PE 14 PE 15 PE 12 PE 13 PE 14 PE 15

(a) (b) (c)

Figure 9: PE operation modes: (a) horizontal mode, (b) vertical mode, and (c) circular mode.

it supports a very high bandwidth coupled with a very low determine which functions should be implemented in hard- inductance/capacitance (and thus low-power dissipation) ware and which in software. The HW is currently being mod- [8]. However, any other equivalent 3D-interconnection tech- eled at a system level using SystemC [21, 22]toverifyfunc- nology could also be applied to realize this interconnection tionality of the operation and to explore various architec- level within the 3D-SoftChip architecture. ture configurations while concurrently modeling the soft- ware in C. After that, a cosimulation and verification process 6.1. Indium bump interconnection will be implemented to verify the operation and performance of the 3D-SoftChip architecture and to decide on an opti- Indium is an excellent material to use as an interconnect ma- mal HW/SW architecture. More specifically, the SW will be terial due to its excellent adhesion to most metals, includ- a modified GNU C Compiler and Assembler. After the com- ing aluminum, which is the metallization for the pads used piler and assembler for ICS RISC has been finalized, a pro- in most VLSI technologies. Indium has a low melting point, gram for the implementation of the MPEG4 motion estima- ffi which implies a low work-hardening coe cient, allowing for tion algorithm will be developed and compiled using it. Af- direct bonding on processed VLSI wafers. Additionally, it ter that, object code can be produced, which can be directly provides excellent mechanical as well as electrical connectiv- used as the input stimulus for an instruction set simulator < Ω ity (contact resistance 1m per bump). Reflow techniques and system level simulation. The HW/SW verification pro- can be used for flexibility and to increase the bump height cess can be achieved through the comparison between the re- to width ratio as needed. Such techniques can also be used sults from instruction-level simulation and system-level sim- to incorporate self-alignment features to the bonding pro- ulation. From this point on, the rest of the procedure can be cess. Figure 12(a) illustrates a cut-away view of the flip-chip processed using any conventional HW design methodology, indium bump interconnection, a micrograph of a single in- such as full and semicustom design. dium bump after reflow can be seen in Figure 12(b).

7. HW/SW CODESIGN AND VERIFICATION 7.1. System level modeling of single PE METHODOLOGY Figure 14 shows the single Standard PE block diagram, file Figure 13 shows the HW/SW codesign methodology for structure of SystemC modeling and the output waveform of the 3D-SoftChip. HW/SW partitioning is being executed to system-level modeled Standard PE. 10 EURASIP Journal on Applied Signal Processing

ICS Data bus Data bus Adjacent PEs ICS Data bus Data bus Adjacent PEs Adjacent PEs Adjacent PEs

MUX A MUX B MUX A MUX B

Inter PE bus Multiplier ction decoder (MAC, MAS) ALU (mul, add, sub, comp) Instruction decoder IBI IBI Accumulator/subtractor Register Data bus Metalization pad 8-bit parallel shifter

Embedded SRAM IBI IBI Address Embedded SRAM Switch block Inter PE bus Inter PE bus

ICS Data bus Adjacent PEs Data bus IBI IBI Adjacent PEs Data bus Adjacent PEs

MUX A M UX A MUX B

IBI IBI Multiplier (MAC, MAS) ALU (mul, add, sub, comp)

Instruction decoder Accumulator/subtractor Instru Register Inter PE bus Register Data bus 8-bit parallel Adjacent PEs shifter Data bus Embedded SRAM Address Data Embedded SRAM Address

Figure 10: Quad PE.

Instruction Instruction data < 31 : 0 > address Loop buffer 31 : 0 (16 × 32 bit)

Instruction Instruction register address Program counter 31 : 0

Register file (32 × 32 bit) ALU & control unit

I/O unit ··· Control signals ICS RISC

Figure 11: Architecture of ICS RISC. Chul Kim et al. 11

Bonding pad ICS chip Indium bumps

Subtrate/CAP chip

(a) (b)

Figure 12: (a) 3D flip-chip indium bump interconnection and (b) indium bump interconnection: single indium bump after reflow.

3D-SoftChip system specification

HW/SW partitioning

S/W H/W System-level modeling & S/W specification architecture exploration of 3D-SoftChip using SystemC ICS compiler Possible assembler design design specifications (modify GUN C compiler/ assembler) HW/SW *SystemC modeling codesign Program coding, system-level modeling for assembling function/instruction verif. (ICS assembler) & arch. exploration Object codes

Instruction-level System-level simulation simulation (ICS instruction set HW/SW simulator) coverification SLS results Design ILS results verification result checking

Optimum H/W specifications

Circuit optimization

Layout/ circuit-level simulation Circuit optimization layout editor, spice

Design verification DRC, LVS Go to foundry Chip test

Figure 13: Suggested HW/SW codesign verification methodology. 12 EURASIP Journal on Applied Signal Processing

SystemC.h

Instruction Reset Testbench.cpp decoder.cpp ALU.cpp SRAM.cpp Instruction Din Dout Testbench.h decoder.h ALU.h SRAM.h

Instructions Stimulus Standard PE

Clock Main.cpp

Clock

(a) (b)

(c)

Figure 14: System level modeling of single PE: (a) standard PE block diagram, (b) file structure of standard PE, and (c) the output waveform of system-level modeled standard PE.

8. CONCLUSIONS REFERENCES

A novel 3D vertically integrated adaptive computing system [1] S. C. Goldstein, H. Schmit, M. Budiu, S. Cadambi, M. Moe, architecture for communication and multimedia signal pro- and R. R. Taylor, “PipeRench: a reconfigurable architecture cessing has been presented along with system-level model- and compiler,” IEEE Computer, vol. 33, no. 4, pp. 70–77, 2000. ing example of a single PE. The described system leverages [2] H. Singh, M.-H. Lee, G. Lu, F. J. Kurdahi, N. Bagherzadeh, and the very high-bandwidth connection between two chips, re- E. M. Chaves Filho, “MorphoSys: an integrated reconfigurable alizable through the indium bump interconnect array, to system for data-parallel and computation-intensive applica- tions,” IEEE Transactions on Computers, vol. 49, no. 5, pp. 465– combine high-level ICS and low-level CAP processing en- 481, 2000. gines to create a next-generation adaptive computing sys- [3] T. Miyamori and K. Olukotun, “REMARC: reconfigurable tem. The described system architecture of the 3D-SoftChip multimedia array coprocessor,” in Proceedings of ACM/SIGDA is currently being fully modeled in SystemC in order to de- 6th International Symposium on Field Programmable Gate Ar- termine the optimal hardware architecture. The SW design rays (FPGA ’98), pp. 261–261, Monterey, Calif, USA, February is being concurrently finalized so that the novel concept of 1998. an adaptive system-on-chip computing system can be real- [4] Elixent Limited, “The Reconfigurable Algorithm Processor,” ized. http://www.elixent.com/products/white papers.htm. Chul Kim et al. 13

[5] N. Tredennick and B. Shimamoto, “Special Report: do-it-all Chul Kim received the B.S. degree in elec- devices,” IEEE Spectrum, pp. 37–40, December 2003. tric engineering from Sunchon National University, Korea, in 2003. He is currently [6] J. W. Joyner, P. Zarkesh-Ha, and J. D. Meindl, “Global in- pursuing his Masters degree at the Center terconnect design in a three-dimensional system-on-a-chip,” for Very High Speed Microelectronic Sys- IEEE Transactions on Very Large Scale Integration (VLSI) Sys- tems, Edith Cowan University, Perth, Aus- tems, vol. 12, no. 4, pp. 367–372, 2004. tralia. His research interests include 3D [7] S. Eshraghian, S. Lachowicz, and K. Eshraghian, “3-D verti- adaptive computing systems and platform- cally integrated configurable soft-chip with terabit computa- based SoC design for communication and tional bandwidth for image and data processing,” in Proceed- multimedia signal processing. ings of 10th International Conference on Mixed Design of Inte- grated Circuits and Systems (MIXDES ’03), pp. 143–148, Lodz, Alex Rassau received a Ph.D. degree in mi- Poland, June 2003. croelectronics from the University of Read- [8] A. Rassau, G. Alagoda, A. Ehrhardt, S. Lachowicz, and K. ing, Reading, England in 2000. He joined Eshraghian, “Design methodology for a 3D softchip video theCentreforVeryHighSpeedMicroelec- processing architecture,” in Proceedings of 6th World Multicon- tronic Systems at Edith Cowan University in ference on Systemics, Cybernetics and Informatics (SCI ’02),pp. 2000 and his current research interests in- 324–329, Orlando, Fla, USA, July 2002. clude new adaptive computing architectures [9] IZM, “3D System Integration,” http://www.pb.izm.fhg.de/ and microphotonic systems. izm/015 Programms/010 R/. [10]J.W.Joyner,R.Venkatesan,P.Zarkesh-Ha,J.A.Davis,and Stefan Lachowicz received M.Eng.Sc. and J. D. Meindl, “Impact of three-dimensional architectures on Ph.D. degrees from the Technical Univer- interconnects in gigascale integration,” IEEE Transactions on sity of Lodz, Poland in 1982 and 1987, re- Very Large Scale Integration (VLSI) Systems,vol.9,no.6,pp. spectively. In 1993 he joined Edith Cowan 922–928, 2001. University as a Senior Lecturer in engineer- [11] D. Chen and J. Rabaey, “PADDI: programmable arithmetic ing at the School of Engineering and Math- devices for digital signal processing,” in Proceedings of IEEE ematics and the Deputy Director of The Workshop on VLSI Signal Processing, pp. 240–249, IEEE Press, National Networked Teletesting Facility for San Diego, Calif, USA, November 1990. Integrated Systems (NNTTF). His research [12] E. Mirsky and A. DeHon, “MATRIX: a reconfigurable com- interests include CMOS imagers, reconfig- puting architecture with configurable instruction distribution urable architectures, and design for test. and deployable resources,” in Proceedings of IEEE Symposium Mike Myung-Ok Lee received B.S., MNS, and Ph.D. degrees from on FPGAs for Custom Computing Machines, pp. 157–166, Napa the Arizona State University, Tempe, U.S.A. in 1983, 1987, and Valley, Calif, USA, April 1996. 1988, respectively. He is a Professor in the School of Informa- [13] D. C. Cronquist, C. Fisher, M. Figueroa, P. Franklin, and C. tion and Communication Engineering, Dongshin University, Ko- Ebeling, “Architecture design of reconfigurable pipelined dat- rea, and his current research interests include high-speed intelligent apaths,” in Proceedings of 20th Anniversary Conference on Ad- network design, multimedia Optic-VLSI/ULSI design, telecommu- vanced Research in VLSI (ARVLSI ’99), pp. 23–40, Atlanta, Ga, nication engineering, and nanobio-medical engineering. USA, March 1999. [14] E. Waingold, M. Taylor, D. Srikrishna, et al., “Baring it all to Kamran Eshraghian received B.Tech., software: raw machines,” IEEE Computer,vol.30,no.9,pp. M.Eng.Sc., and Ph.D. degrees from the Uni- 86–93, 1997. versity of Adelaide, South Australia. In [15] Triscend Corporation, “Triscend A7S Configurable System- 1979 he joined the Department of Elec- on-Chip Platforms,” http://www.triscend.com. trical & Electronic Engineering at the [16] Motorola Incorporation, “MRC6100: Reconfigurable Com- University of Adelaide after spending 10 pute Fabric (RCF) device,” http://www.motorola.com/semi- years with Philips Research both in Europe conductors/. and Australia. He has held a number of [17] QuickSilver Technology Incorporation, “Adapt2400 ACM Ar- visiting academic posts including Professor chitecture Overview”. of Computer Science at Duke University, N.C., USA, Visiting Professor of Microelectronics and Computer [18] picoChip Designs Limited, “PC102 Product Brief,” http:// Systems at EPFL, Lausanne, Switzerland, visiting Professor of www.picochip.com. Computer Technology at the University of Las Palmas and at the [19] L. Guangming, Modeling, implementation and scalability of the University of Ulm in Germany. In 1987 he founded the Centre for morphoSys dynamically reconfigurable computing architecture, Gallium Arsenide VLSI Technology at the University of Adelaide Ph.D. thesis, Electrical and Computer Engineering Depart- and was appointed as its Director. In July 1994 he was invited ment, University of California, Irvine, Calif, USA, 2000. to take up the Foundation Chair of Computer, Electronics, and [20] S. Eshraghian, “Implementation of arithmetic primitives us- Communication Engineering at Edith Cowan University to lead ing truly deep submicron technology (TDST),” Ms thesis, the newly establish Department of Engineering. He has coau- Edith Cowan University, Perth, Australia, 2004. thored 5 textbooks and served as the Editor of the Silicon Systems [21] Open SystemC Initiative, “The Functional Specification for Engineering series published by Prentice Hall. In 2004, he founded SystemC 2.0,” http://www.systemc.org/. Eshraghian Laboratories as part of his vision for the horizontal [22] Open SystemC Initiative, “SystemC 2.0.1 Language Reference integration of nanochemistry and nanoelectronics with those of Manual Rev 1.0,” http://www.systemc.org/. bio- and photon-based technologies, thus creating a new platform for future research and development. Hindawi Publishing Corporation EURASIP Journal on Applied Signal Processing Volume 2006, Article ID 79595, Pages 1–9 DOI 10.1155/ASP/2006/79595

Highly Flexible Multimode Digital Signal Processing Systems Using Adaptable Components and Controllers

Vinu Vijay Kumar and John Lach

Charles L. Brown Department of Electrical and Computer Engineering, University of Virginia, Charlottesville, VA 22904, USA

Received 1 October 2004; Revised 21 March 2005; Accepted 25 May 2005 Multimode systems have emerged as an area- and power-efficient platform for implementing multiple timewise mutually ex- clusive digital signal processing (DSP) applications in a single hardware space. This paper presents a design methodology for integrating flexible components and controllers into primarily fixed logic multimode DSP systems, thereby increasing their overall efficiency and implementation capabilities. The components are built using a technique called small-scale reconfigurability (SSR) that provides the necessary flexibility for both intermode and intramode reconfigurabilities, without the penalties associated with general-purpose reconfigurable logic. Using this methodology, area and power consumption are reduced beyond what is pro- vided by current multimode systems, without sacrificing performance. The results show an average of 7% reduction in datapath component area, 26% reduction in register area, 36% reduction in interconnect MUX cost, and 68% reduction in the number of controller signals, with an average 38% increase in component utilization for a set of benchmark 32-bit DSP applications.

Copyright © 2006 Hindawi Publishing Corporation. All rights reserved.

1. INTRODUCTION a system jointly implementing two different standards (e.g., CDMA/GSM formats in a cell phone, different region DVD The burgeoning demand for high performance DSP sys- formats in a Universal DVD player, etc.), where only one tems has spurred widespread research on efficient platforms mode needs to be active at any given time, can benefit from a for implementing arithmetic intensive applications charac- multimode implementation. teristic of such systems. Based on these applications’ high However, current multimode systems are severely con- throughput requirements, fixed logic application-specific in- strained in their capabilities and efficiency due to their lim- tegrated circuits (ASICs) are normally the platform of choice. ited reconfigurability. Reconfiguration between modes is ac- However, their lack of flexibility is disadvantageous in today’s complished by changing only the dataflow between compo- world of disparate and rapidly evolving standards and appli- nents; the datapath components themselves do not change. cations, which require the execution of a variety of DSP tasks. The individual controllers for each mode are composed to- In the absence of flexibility, direct hardware implementation gether into a single controller that also does not change be- of all of the tasks is the only option and can be prohibitively tween modes. Hence, such a system is inefficient and not very expensive—even in this “transistors for free” era. This has led powerful, with only the interconnections changing between to the search for new methods for adding flexibility to other- modes. wise fixed logic DSP circuits, without having to pay the large In this paper, we present a multimode DSP system design performance, area, and power penalties associated with field and synthesis technique that provides greater implementa- programmable gate arrays (FPGAs), DSP processors, or even tion flexibility by enabling reconfigurability in controllers application-specific instruction processors (ASIPs). and datapath components, as well as the interconnections. An emerging platform that has been proposed to address In addition, reconfiguration may be performed not only be- the flexibility issue in ASICs for DSP is “multimode” systems tween modes but also within a single mode. This technique [1, 2]. Tasks that are timewise mutually exclusive are syn- provides improved results in terms of resource requirements thesized to the same hardware area, allowing the tasks to be and/or task latency and power consumption (via increased separated temporally rather than spatially. When a particu- component utilization), when compared to existing multi- lar task must be executed, the system switches to the appro- mode synthesis techniques, enabling more powerful DSP ap- priate hardware configuration “mode.” Such a design plat- plications. While these improvements would likely be lost if form can prove useful for many DSP systems. For example, general-purpose reconfigurable devices (e.g., FPGAs) were 2 EURASIP Journal on Applied Signal Processing

v v v v Using FACs and intramode reconfiguration, DFG 1a can 1 ∗ 2 ∗ 4 ∗ 5 + be implemented with one MULT and one FAC, with the FAC switching between a MULT (to execute v2 in c-step 1) v3 + v6 + and an ADD (to execute v3 in c-step 2). If DFG 1b were (a) (b) synthesized independently, the technique would allocate one MULT and one ADD. But given that the two DFGs are to be implemented in the same physical multimode space, they Figure 1: Sample DFGs. can be synthesized together, resulting in one MULT and one FAC. Given the assumption that one FAC is smaller than the combined area of an adder and a multiplier, area sav- used to provide the hardware flexibility, the technique pre- ings are achieved over existing limited reconfigurable mul- sented here uses small-scale reconfigurability (SSR) [3, 4]. timode synthesis techniques. Component utilization is also SSR provides hardware flexibility without the area, delay, increased, reducing wasted power consumption. Note that power, and reconfiguration time penalties associated with if only intermode reconfiguration is utilized (with the FAC general-purpose reconfigurable fabric. functionality being fixed within a mode through all c-steps), the component allocation would be an ADD, a MULT, and a 1.1. Illustrative example FAC, which still provides an improvement over the nonmul- timode implementation. Consider the two dataflow graphs (DFGs) in Figure 1,each ff This technique also inherently leverages any inter- and representing a di erent mode of a DSP system. The datap- intra-DFG isomorphism. Using efficient binding of flexible ath and controller designs must be capable of implementing components to nodes, the need for MUXes in the intercon- either mode. nection network is minimized without the need for any iso- morphic subgraph identification and matching on the nodes 1.1.1. Datapath design in the DFG. Consider that nodes v1 and v4 in Figure 1 use the MULT component and nodes v2, v3, v5,andv6 use the FAC, Assuming the system is not pipelined and specifications dic- so the component interconnects will not change from one tate that modes have minimum latency, DFG 1a (Figure 1(a)) mode to the other. Without the FAC, v2 and v5 would be ex- would require two multipliers (MULTs) and one adder ecuted on a MULT and ADD, respectively. Both operations’ (ADD), and DFG 1b (Figure 1(b)) would require one MULT results would be input to the ADD, requiring a 2 : 1 MUX and one ADD. If both tasks were implemented spatially sep- for the ADD and MULT to write to the same result register arately, the total number of arithmetic components would be or for the ADD to read from different source registers for the three MULTs and two ADDs. different modes. For large system bit widths, MUXes become However, if the tasks are timewise mutually exclusive, very expensive. The technique presented here minimizes the they may be implemented using existing multimode tech- need for such MUXes (and control signals for the MUXes, niques in which components may be shared by both tasks, which must be generated individually for each mode), while but must remain fixed. Such an implementation would re- avoiding computationally intensive subgraph isomorphism quire only two MULTs and one ADD, and the proper mode identification algorithms [6]. would be invoked based on the task to be implemented. While the tasks share datapath components, they require sep- arate controllers. In addition, interconnect complexity is also 1.1.2. Controller design high, as component interconnections have to support the dataflow in both product instances. For example, the adder Conventional multimode or domain-specific customization component producing the overall output (mapped to nodes approaches, where fixed datapath components are shared be- v3 or v6) gets its inputs either from the two multipliers (for tween tasks, require separate controllers, with the individual DFG 1a) or from a multiplier and itself (for DFG 1b), requir- controllers typically MUXed at their outputs to form a com- ing a MUX to be added at the inputs that was unnecessary in posite controller. However, the composite controller in such the separate single-mode implementations. designsoftenbecomescomplexenoughtorequireapro- Now consider a flexible arithmetic component (FAC) ca- grammable microcode-based implementation, which is less pable of performing both addition and multiplication that is area- and power-efficient and has lower performance than as fast as a multiplier but is smaller than the combined areas hardwired controllers. of an adder and multiplier (although larger than each indi- SSR-based adaptable controllers can reduce these ineffi- vidually). For example, the “morphable multiplier” uses the ciencies. Consider now a composite controller for the multi- adder chains within the multiplier to perform addition with mode system shown in Figure 1. Assume that the individual minimal area and no delay penalties [5]. The flexibility of a instance controllers have only minor differences; say, for ex- FAC is such that not only can it be a multiplier in one mode ample, one of their output functions is represented by f1 in and an adder in another (intermode reconfiguration), but it one and f2 in the other. Assume f1 and f2 are defined as fol- could also be an adder and a multiplier in different control lows: f1 = a + b + ce + de; f2 = ac + ad + bc + bd + e.Fora steps (c-steps) within the same mode (intramode reconfigu- composite controller, both functions could be separately im- ration). plemented on the same device, and selection between them V. V. Kumar and J. Lach 3 for the specific mode could be performed with an output with various configurations available to recover from com- MUX. Using only 2-input fixed logic gates from a standard ponents failures [8]. Most recently, a “spatially chained trans- cell library such as the Lsi10k library, the total circuit cost for formation” was introduced to enable dataflow graphs of dif- such an implementation is 19 inverter-equivalent gates. ferent applications to be chained together for joint compo- The SSR-designed flexible controller provides the same nent allocation and binding [1]. The essential element in all flexibility more efficiently. f1 and f2 are jointly synthesized of these efforts is that timewise mutually exclusive applica- as a multi-output function, thus maximizing logic sharing tions can reside in different configurations in the same phys- between the functions and automatically obtaining the min- ical area of a “multimode” system. imum implementation difference between them. The shared Domain-specific customization is a related approach for logic (in this case, the common subexpressions X = a+b and application-specific flexibility in reconfigurable systems [6, Y = c+d) is then implemented in fixed logic, while the differ- 9]. This approach involves creating a custom reconfigurable ences are implemented in configurable logic. In our example, architecture to specifically implement a set of circuits from ∗ the functions f1 and f2 are rewritten as f1 = X+e Y and f2 = a given domain and be completely flexible within that do- X∗Y+e, and jointly implemented using programmable inter- main. This is, in a sense, a mirror image of our approach. connect for a total circuit cost of 12 inverter-equivalent gates. While ours is aimed at inserting small amounts of reconfig- An algorithmic framework for integrating flexible datap- urability into primarily fixed logic circuits, domain-specific ath and control components into multimode systems is pre- customization inserts fixed logic into circuits with primarily sented in this paper. Conventional high-level synthesis tools reconfigurable fabric. The synthesis techniques developed for do not take advantage of the range of flexibility provided such systems, therefore, address a different set of issues (e.g., by these components. For example, module allocation tech- template generation, isomorphic subgraph identification and niques for multifunction ALUs (e.g., using operation clus- matching, etc.) that is relevant to domain-specific customiza- tering, etc.) are not optimized for application-specific, lim- tion. It is difficult to adapt these techniques to address is- ited flexibility addition. Other approaches for incorporating sues specific to our problem, such as runtime reconfigurabil- ALUs, in which allocation precedes synthesis, are ineffective ity within isomorphic subgraphs. The technique presented in for multimode systems since inter-DFG dependencies are not this paper addresses such issues. effectively extracted for optimum allocation. In the following While hybrid FPGAs and reconfigurable cores provide sections, new algorithms and extensions to conventional al- hardware flexibility, their coarse integration of fixed logic gorithms are proposed for datapath synthesis, allocation, and and reconfigurable fabric result in significant area, perfor- binding, and for automatic control path synthesis for multi- mance, and power penalties. Techniques have therefore been mode DSP systems with flexible components and controllers. explored to add flexibility to individual hardware compo- nents without the penalties associated with general-purpose reconfigurable arrays. By reusing adder chains within a mul- 2. BACKGROUND AND RELATED WORK tiplier, an area-efficient “morphable” multifunction compo- nent capable of both addition and multiplication was de- In order to leverage the increasing relative performance, area, scribedin[5]. Such a unit is useful in DSP systems domi- and power benefits of hardware versus software, there is a nated by multiply-accumulate (MAC) chains. Another flexi- trend towards implementing in hardware many algorithms ble component capable of both single-precision and double- and applications that had previously been done primarily precision floating point multiplication was described in [10]. in software. This is particularly evident with the advent of Synthesis techniques that integrate flexible components into -on-a-chip (SOC) designs, in which em- primarily fixed logic systems are detailed in [3]. That work bedded processors and application-specific circuitry share augmented traditional force-directed list scheduling (FDLS) the processing load based on a designer-defined partition. [11] for component scheduling and allocation using a hybrid For applications requiring several disparate performance- library of fixed and reconfigurable arithmetic components, sensitive tasks, this often results in low resource utilization, providing significant area savings for single-mode systems. a metric for hardware efficiency. The work presented here provides hybrid library synthesis One approach that has been suggested to address the low techniques for multimode systems, yielding even greater sav- resource utilization issue, as well as to enable more power- ings. ful hardware implementations of DSP systems, is to jointly synthesize different applications to build a unified datapath. By using separate controllers, the datapath may be animated 3. EFFICIENT FLEXIBLE HARDWARE to implement the various applications. This method was first explored in [2] as “multifunctional processing units.” Hardware flexibility is traditionally achieved with large-scale, The work provided heuristic local search algorithms for the general-purpose reconfigurable arrays such as FPGAs, which joint allocation of components so as to minimize intercon- are significantly less efficient (in terms of area, delay, and nect. Designing application-specific programmable proces- power) than fixed logic. The logic is not as dense, delays sors (ASPPs) by bundling similar applications and jointly are larger through SRAM-based lookup table (LUT) logic synthesizing them has also been investigated [7]. Flexible and programmable interconnects, and power consumption datapaths have been proposed for fault tolerance purposes, is greater due to the increased node capacitance. 4 EURASIP Journal on Applied Signal Processing

The SSR design technique minimizes these penalties by subtraction, multiplication, or comparison [3]. The areas of inserting into a primarily fixed logic design only the flexibil- the various components normalized in terms of the com- ity that is required for a specific application. Reconfigurable parator (which is the smallest of the fixed components at logic and interconnect (e.g., SRAM-based LUTs, MUXes, 52 inverter-equivalent gates) were obtained as Comparator: SRAM-gated pass transistors, etc.) are finely integrated with X, Adder/Subtractor: 1.44X, Multiplier: 4.5X, Limited Flexi- fixed logic at a gate-level granularity. While some recent ble Unit (LFU; Adder/Subtractor and Multiplier): 4.81X, and technologies contain both fixed and reconfigurable logic on Full Flexible Unit (FFU; Comparator, Adder/Subtractor, and the same chip, they are coarsely integrated. For example, Multiplier): 5.31X. The flexible units were therefore of very some hybrid FPGAs contain a fixed logic core surrounded by reasonable size compared to the fixed logic units. To com- general-purpose reconfigurable fabric. Domain-specific cus- pare these inverter-equivalent gate counts to an FPGA im- tomization [6, 9], discussed in Section 2, provides another plementation, the largest fixed component (i.e., the multi- example. SSR allows for finer integration and application- plier) was considered. The 4-bit multiplier implemented on specific implementation, providing the necessary flexibility an FPGA with 4-input LUTs required 82 LUTs (using the Xil- with ASIC-like efficiency. In addition, the reconfiguration inx ISE software package). At an approximate area cost of time is significantly shorter, as there is less to reconfigure. 80 inverter-equivalent gates for each LUT, this is equivalent The tradeoff is the design effort and fabrication costs as- to 6560 inverters or 126.15X. This does not include the in- sociated with all ASICs, but high-volume applications off- terconnect network, which consumes the majority of recon- set these costs, and many applications require ASIC perfor- figurable fabric area. Reconfiguring the flexible components mance. is also efficient since it just involves changing the configu- The SSR design methodology can be applied to a range ration bits for a single 4-input look-up table and select sig- of designs and applications. The remainder of this section nals for internal MUXes as opposed to reconfiguring large ar- focuses on the use of SSR for designing FACs and adaptable eas of an FPGA. Small-scale reconfigurability clearly provides finite state machines (FSMs), which enable efficient datapath hardware flexibility with significantly less area than general- and control flexibility. purpose FPGAs. Since FPGAs are, in general, more efficient than DSP processors when customized for a particular appli- 3.1. Flexible arithmetic components cation, SSR provides greater benefits overall. While SSR provides area savings, delay penalties must When designing a FAC, the similarities between the desired also be considered, as the length of a c-step may increase operations can be implemented in fixed logic, and reconfig- with flexible components. For the 4-bit components in [3], urable logic and interconnect must be used to implement the the relative path delays of the components were Compara- differences. Therefore, the first step in designing a FAC with tor: Y, Adder/Subtractor: 2Y, Multiplier: 3.67Y, LFU: 4.67Y, SSR is to determine the minimum distance between the oper- and FFU: 5Y, with the approximation that all combinational ations to be implemented, thereby minimizing the need for gates have equal delays. Therefore, assuming the length of the reconfigurable hardware (and its associated penalties). Cer- c-step was defined by the multiplier delay, the use of an LFU tain operations have inherently greater similarities than oth- or FFU will increase the c-step length by 27% and 36%, re- ers, making them more conducive to SSR implementation. spectively. These increases must be traded off with the area For example, adders and multipliers have similar substruc- savings provided by hybrid library scheduling and allocation. tures, resulting in an especially efficient flexible implementa- In addition, these delays are significantly less than those of an tion. FPGA, revealing the delay benefit of SSR. Other DSP operation combinations may also be consid- It must be noted that the specific multiplier chosen for ered for FAC implementation: a wide bit width operation comparison is that with the shortest critical path. For other could be integrated with multiple operations of narrower multiplier structures, the percentage delay increase would be width; several low-precision operations could be embedded smaller. For example, the results in this paper use an aug- within a high-precision operation [10];ararelyusedopera- mentation of the “morphable multiplier” [5], which is ca- tion could be integrated within a high use operation (increas- pable of implementing both multiplication and addition (in ing hardware usage); and so on. In addition, reconfigurable fact, it can perform two data-independent additions in paral- rounding modes may be added so that the binary point of the lel) with the same delay as a fixed logic multiplier. Given the output can be moved based on the inputs and desired round- regularity with which MAC operations occur in DSP algo- ing. This would address the rounding inaccuracy and scaling rithms, this FAC provides significant benefits for multimode problems that plague conventional fixed-point components DSP systems. This component and the augmentations per- used in hardware signal processing. As stated in Section 1.1, formed are discussed in Section 5. for a FAC to provide area savings, its area must be smaller than the combined area of all of the operations implemented 3.2. Flexible controllers individually. FACs built using SSR avoid the large performance, area, Controllers in application-specific multimode systems are and power penalties associated with FPGAs and DSP pro- typically designed as a composition of individual controllers, cessors. For example, we have presented a flexible com- one for each mode. These controllers can be microcoded or ponent capable of executing a 4-bit fixed-point addition, hardwired FSMs. Hardwired FSMs are the more common V. V. Kumar and J. Lach 5 choice given their smaller area and higher performance. (1) input DFGs and system and DFG-specific constraints; However, when one considers designing a single flexible con- (2) identify the set of potential arithmetic components; troller capable of being adapted for each mode, microcode (3) traverse DFGs for critical paths; has been the traditional option, as different microinstruc- (4) for each DFG (in order of increasing slack) HFDLS (DFG); tionscanbeloadedforeachmode.ButaswithFPGAs, the area, performance, and power costs of microcoded con- HAL (DFG); trollers make them significantly less efficient that hardwired (5) c-step matching (scheduled DFGs); FSMs. (6) for each DFG (in order of decreasing resource usage) SSR can again be used to find the minimum distance bind (DFG); between the various mode controllers to implement an ef- (7) design controller. ficient flexible hardwired FSM. Consider that an FSM can be defined as a six-tuple S, I, O, S0, δ, λ,whereS is the set of Figure 2: Multimode system synthesis methodology. states, I is the input set, O is the output set, S0 is the ini- tial state, δ is the state transition function set, and λ is the output function set. For an FSM to be made flexible, the δ and λ functions can be implemented using SSR such that ++∗ ∗ + they can be configured to implement the various mode con- trollers (including the control signals setting the FAC func- ∗ + + ∗ tionalities), and the cardinality of S is defined by the maxi- mum S cardinality of the individual mode controllers. Such an implementation is likely to be significantly more efficient ∗ + + + than either separate hardware controllers or programmable microcode. ∗ + ∗ The adaptable FSM is optimally implemented as follows. δ The set of state transition functions defined by is fed to + + a logic synthesis tool and jointly synthesized as a multiple- output function circuit with logic sharing between the func- (a) (b) tions. The shared logic in the final synthesized circuit is the logic common to all of the functions and is implemented Figure 3: Sample DFGs. in fixed logic. The logic specific to each function is imple- mented in reconfigurable logic (using LUTs and/or MUXes) and changes between contexts. The process is repeated for the λ set. In step 1, the inputs to the system are the individual Thisprocessisefficientaslongastheintermodefunc- DFGs, representing the signal processing modes and an op- tions differ only slightly from each other. The multimode tional overall delay and/or resource constraint on the whole system synthesis techniques detailed in the following section system. Parameters associated with each DFG, such as the help to minimize these differences, ensuring an efficient im- precision, data width, and maximum tolerable latency (or plementation. It should be noted that the number of config- optionally, the resource constraint), are also input to the sys- uration control bits is limited due to the small scale nature tem. (The results in Section 5 are for deriving the minimum of SSR. Therefore, the extra logic needed for generating these resource allocation under latency requirements. Future work bits in the flexible FSMs is small compared to the savings ob- will address other scenarios.) tained in simplifying the steering logic for the interconnect Consider the two DFGs in Figure 3. Assume that the MUXes. latency constraints on DFG 3a (Figure 3(a)) and DFG 3b (Figure 3(b)) are eight c-steps and seven c-steps, respectively. 4. MULTIMODE DSP SYSTEM SYNTHESIS At this time, a preprocessing step, step 2, is performed that identifies the set of arithmetic components that could pos- While the SSR design methodology helps to enable efficient sibly execute each operation, including satisfying operations flexible hardware, a multimode DSP system’s efficiency is ul- of various bit widths, such as implementing a 32-bit addi- timately driven by high-level synthesis, which ensures the tion with two 16-bit adders. For this example, the total re- flexibility is used efficiently. General multimode synthesis source set includes ADDs, MULTs, and the augmented mor- techniques are emerging, but this section presents the first phable multipliers. Assume that all operations have the same multimode DSP system synthesis technique that incorpo- bit width and that the c-step latencies of an ADD and MULT ratesFACsandadaptableFSMs.Infact,itisnotedbythe are one and two, respectively, in both the fixed logic com- designers of the morphable multiplier that no synthesis tech- ponents and the FAC. As stated in Section 3.1, the FAC can niques exist that make use of such components [5]. also do two data-independent additions in a single c-step. Given that FAC reconfiguration time is minimal due to the The steps in Figure 2 detail the synthesis methodology SSR implementation, FACs can change functionality within and are explained via the subsequent example. the same mode (i.e., intramode reconfiguration). 6 EURASIP Journal on Applied Signal Processing

The DFGs are then traversed in step 3 to find the critical ++ paths based on data dependencies and the operation latency assumptions. Counting additions as one c-step and multi- plies as two, the longest paths on DFG 3a and DFG 3b are ∗ eight and six steps, respectively. Therefore, to meet the la- tency constraints of each DFG, there is no c-step slack for DFG 3a and one c-step for DFG 3b. This determines the DFG ∗ scheduling order, with the DFG with the least slack scheduled first.

Therefore, DFG 3a is scheduled first in step 4 using hy- ∗ brid force-directed list scheduling (HFDLS) and hybrid allo- cation (HAL) [3]. Unlike the conventional FDLS algorithm that lowers the concurrency of same-type operations per c- + step, the HFDLS algorithm uses a modified force calculation such that an overall balance in the number of operations per c-step is achieved. The HFDLS algorithm has the same worst Figure 4: Scheduled DFG for DFG 3a. case computational complexity as FDLS: O(n2), where n is the number of nodes to be scheduled in each DFG. After scheduling, general multifunction ALU allocation algorithms can be used by considering the flexible component as a kind ++ ∗ of limited ALU. However, the bulk of the work in creating an optimum schedule requiring the minimum number of flex- ∗ ible components is done by HFDLS, and a simple allocation algorithm, such as HAL, is sufficient to allocate components ∗ on these scheduled graphs. HAL uses principles from set the- + + ory to produce an exact minimum module allocation set. The + + ∗ algorithm has a computational complexity of O(m), where m is the number of adder and multiplier nodes to be allocated. + Further details on both HFDLS and HAL can be found in [3]. + Given that DFG 3a has no slack, it must be scheduled in the fewest number of c-steps. The resulting scheduled DFG is shown in Figure 4. From this schedule, it is clear that two Figure 5: Scheduled DFG for DFG 3b. ADDs and one MULT would be necessary if only fixed com- ponents were available. However, this DFG can be imple- mented with a single FAC. Given the relative size of the var- DFG is shown in Figure 5. (Note that the schedule would be ious components, this represents a large reduction in area, different if only fixed components were available, as HFDLS even within a single mode. (This echoes the results in [3], often produces different schedules than traditional FDLS.) which focus on FACs in single-mode systems.) Even though we increased the component allocation, we do Once the first DFG is scheduled, the other DFGs are each not reschedule DFG 3a. (Note that in this case the schedule scheduled (in the order of increasing slack) so that they meet would not change anyway.) their individual latency requirements. For each DFG, an at- If the DFGs have different numbers of c-steps, step 5 tempt is made to meet the required latency without more re- matches c-steps across the DFGs. The goal is to maximize sources than are currently allocated. When the resource allo- same-control-step component usage, thus minimizing the cation must be increased to meet a latency requirement, the functional differences between the various mode controllers global resource allocation is updated. While an increase in for an efficient adaptable FSM design. This matching is done the allocation set may enable already scheduled DFGs to re- using maximal weighted matching, which can be solved in duce latency by rescheduling, the algorithm does not do so, polynomial time [11]. The weight assigned to each control- as the DFG latency requirements have already been met. In- step-connecting edge in the matching graph is the number of stead, this resource slack is exploited during binding, as dis- resources common to the connected c-steps. cussed below. When all of the DFGs are scheduled, the fi- In the example here, there are only 2 DFGs, making it a nal resource allocation set, including both fixed and flexible case of bipartite matching. Given that DFGs 3a and 3b have components, is known. eight and seven c-steps, respectively, there is only a slack of WhenDFG3bisscheduledusingthisapproach,itisob- one c-step for the matching process. So c-step 1 in DFG 3b vious that it cannot be scheduled with only one FAC. There- can match with either c-step 1 or 2 in DFG 3a, c-step 2 can fore, the number of resources must be increased. The mini- match with either c-step 2 or 3, and so forth. Given the bal- mum resource set to schedule DFG 3b, while meeting its la- anced resource needs of each c-step, the edges all have equal tency requirement, is two FACs, as opposed to 2 ADDs and weight. Therefore, by convention, the same number c-steps 2 MULTs if FACs were not available. The resulting scheduled are matched across DFGs (1 to 1, 2 to 2, etc.). V. V. Kumar and J. Lach 7

Step 6 binds operations to components, starting with ++ ++ the DFG that last set the resource allocation, as it is typi- ∗ cally the mode with the highest component utilization. Max- ∗ ∗ imal weighted matching is used to minimize interconnect, MUXes, and registers [13]. An important change to tradi- ∗ + + tional binding is that the graph is constructed with compat- ∗ ibility edges drawn from operation nodes not only to com- + + ∗ ponents of that type but also to FACs, albeit with a smaller + “component match” weight factor. ∗ In binding the other DFGs, an effort is made to minimize + interconnect, MUX, and register overhead above that set by + the base DFG as well as to simplify the subsequent controller design. The key benefit provided by FACs is that subgraphs within a DFG with different operations are actually isomor- Figure 6: Bounded DFGs. phic in both nodes and edges if the disparate operations can be bound to the same FAC. Therefore, the binding algorithm is likely to find a larger number of subgraphs and individual nodes that share inputs/outputs. These benefits are further fixed logic adders, multipliers, and FACs capable of both ad- enhanced by the resource slack in the DFGs that were not dition and multiplication (including two data-independent rescheduled after additional resources were allocated. ADDs in one c-step). Other FACs (including those not The binding result for DFGs 3a and 3b are shown in specific to DSP applications) will be considered as part of Figure 6.Thedifferent block shadings represent the two future work. As discussed in Section 3.1, the FAC is based FACs. Note that the darkly shaded component always out- on the morphable multiplier [5]. The component areas in terms of the number of constituent transistors are as follows: puts to itself, except for c-step 1 in DFG 3b, and the lightly = = = shaded component always feeds the same input port of the ADD 1306, MULT 6150, FAC 6860. The multi- other component. This matching helps minimize the inter- plier and multiply-configured FAC have latencies of two c- connect, MUXes, and registers. steps. Pipelined versions of these components can be built The final step is controller design. As discussed in Section at a cost of approximately 1000 additional transistors each, 3.2, SSR and multiple output logic synthesis enable the con- which would be necessary if the required maximum latency trollers for all of the modes to be implemented in the same was less than what non-pipelined components allowed. Re- physical space, with their similarities implemented in fixed configuring this component simply involves sending the ap- logic and interconnect and their differences in reconfigurable propriate select signal to internal MUXes and is virtually in- logic. The outputs of the FSM include the control signals for stantaneous. the MUXes, register enables, and FAC settings. The component allocations and resource utilizations for None of these steps in this process are more computa- various DFG combinations are shown in Table 1. The DFGs tionally complex than what is currently done for multimode were synthesized for minimum area under the imposed la- system synthesis, but as the results show in the following sec- tency constraint. The first three columns (SMFixed: single- tion, the area and power savings can be significant. mode fixed) show the area for fully implementing each DFG individually with fixed components. The next three columns 5. RESULTS (MMFixed: multimode fixed) show the synthesis results for a fixed logic multimode DSP system. Both SMFixed and MM- The methodology presented in this paper has been evalu- Fixed were obtained using conventional FDLS and allocation ated by synthesizing multimode DSP systems with runtime [1]. Finally, the synthesis results for a flexible DSP system reconfiguration. The base DFGs used are well-known DSP with intramode reconfiguration using the method presented instances from the high-level synthesis literature. The param- here are shown (MMFlex: multimode flexible), including the eters of these DFGs, in terms of number of nodes, minimum datapath area savings and resource utilization increase over latency, and so forth, span a wide range and are representa- the fixed logic multimode DSP system. We have shown in tive of the kind of DFGs that occur in multimode systems. [3] that performance gains in flexible single-mode systems ELLIP is a fifth-order elliptic digital filter with 33 operations are attributable to both the use of FACs and the synthesis and minimum latency of 13 c-steps [12], EDGE is an edge algorithm. The modified synthesis and allocation procedures detector with 241 operations and minimum latency of 121 presented here are therefore necessary to fully utilize the ben- c-steps [13], ARFILT is an autoregressive filter with 28 op- efits provided by FACs and flexible controllers for multi- erations and minimum latency of 8 c-steps [14], FDCT is a mode systems as well. It is important to note that domain- fast discrete Fourier transform instance with 42 operations specific synthesis and conventional multifunction ALU allo- and minimum latency of 6 c-steps [15], and FIRFILT is a 16- cation techniques could possibly be adapted to produce sim- point FIR filter with 23 operations and minimum latency of ilar, but intermode reconfiguration only, systems. However, 9 c-steps [16]. the intramode reconfiguration enabled by the technique pre- The datapath is assumed to be 32-bit wide in all of the sented here contributes a substantial portion of the benefit example systems. The component library consists of 32-bit over conventional fixed multimode DSP systems. 8 EURASIP Journal on Applied Signal Processing

Table 1: Component allocations and utilizations. SMFixed MMFixed MMFlex Improvement (%) MM system + * Area Util. (%) + * Area Util. (% ) + * FAC Area Util. (%) Area Util. (%) AR, FDCT, FIR 7 14 95 242 17.9% 4 8 54 424 37.5% 1 6 2 51 926 44.7% 4.59% 19.2% ELLIP, FIR 5 4 31 130 29.1% 3 2 16 218 52.3% 1 0 2 15 026 73.2% 7.35% 40.0% ARFILT, FDCT 6 12 81 636 29.7% 4 8 54 424 44.6% 0 6 2 50 620 56.2% 6.99% 26.0% FIR, EDGE 4 4 29 824 36.1% 3 2 16 218 57.8% 1 0 2 15 026 90.1% 7.35% 55.9% ELLIP, EDGE 4 4 29 824 33.6% 3 2 16 218 53.7% 1 0 2 15 026 81.5% 7.35% 51.8%

Table 2: Registers, MUXes, and control signals.

MMFixed MMFlex Improvement (%) MM system Reg 2 : 1 MUX Ctrl Reg 2 : 1 MUX Ctrl Reg 2 : 1 MUX Ctrl AR, FDCT, FIR 18 83 249 14 51 53 22.22% 38.55% 78.71% ELLIP, FIR 11 42 84 8232527.27% 45.24% 70.24% ARFILT, FDCT 14 69 138 12 45 47 14.29% 34.78% 65.94% FIR, EDGE 11 29 58 8222427.27% 24.14% 58.62% ELLIP, EDGE 727544161842.86% 40.74% 66.67%

controller. The flip-flops and other memory elements in both Table 3: Controller logic areas. the fixed controller and the adaptable controller are the same MMFixed MMFlex Improvement and are hence not included in the area results. MM system Controller area Controller area (%) AR, FDCT, FIR 1094 709 35.19% 6. CONCLUSIONS ELLIP, FIR 416 263 36.78% This paper presented an approach for synthesizing multi- AR, FDCT 506 354 30.04% mode DSP systems with a hybrid library of fixed and flex- FIR, EDGE 294 133 54.76% ible arithmetic components and adaptable controllers. The ELLIP, EDGE 266 202 24.06% implementation capabilities and efficiency of the multimode system platform is greatly increased by the extra hardware flexibility provided by small-scale reconfigurability, without the large area, performance, and power penalties associated The increased resource utilization also results in less with general-purpose reconfigurable fabric. The intramode wasted power consumption. For all three implementations, reconfiguration and the scheduling, allocating, and binding turning off components that are completely unused in a flexibility provided by the FACs result in significant datapath mode will help reduce power, but the overhead of turn- and control area savings and wasted power consumption re- ing components on/off prevents intramode component shut duction over existing multimode DSP system synthesis tech- down. niques. While these datapath area savings are significant, Table 2 shows that even larger savings are provided in terms of regis- 7. ACKNOWLEDGMENTS ters, MUXes, and control signals. The register and 2 : 1 MUX reductions, due primarily to the binding process, are espe- This work is supported in part by the National Science cially valuable, as the 32-bit wide datapath makes these com- ∼ Foundation under Grant numbers CCR-0105626 and EHS- ponents large. The 70% reduction in the number of control 0410526 and by the Woodrow W. Everett, Jr. SCEEE Develop- lines (ctrl) to the datapath from the controller, which among ment Fund in cooperation with the Southeastern Association other things helps to simplify placement and routing, is ob- of Electrical Engineering Department Heads. tained as a result of both the binding process as well as the use of adaptable controllers. The controller logic area is also reduced, as shown in REFERENCES Table 3. Since separate controllers need not be built, the logic [1] L.-Y. Chiou, S. Bhunia, and K. Roy, “Synthesis of application- is greatly simplified. The area numbers shown are in terms specific highly-efficient multi-mode systems for low-power of inverter-equivalent gates and are only for the combina- applications,” in Proceedings of Design, Automation and Test tional portion of the controller that implements the output in Europe Conference and Exhibition (DATE ’03), pp. 96–101, functions, including any LUTs in the case of the adaptable Munich, Germany, March 2003. V. V. Kumar and J. Lach 9

[2]A.vanderWerf,M.J.H.Peek,E.H.L.Aarts,J.L.Van [16] R. Karri and A. Orailoglu, “High-level synthesis of fault-secure Meerbergen, P. E. R. Lippens, and W. F. J. Verhaegh, “Area microarchitectures,” in Proceedings of 30th ACM/IEEE Interna- optimization of multi-functional processing units,” in Pro- tional Conference on Design Automation (DAC ’93), pp. 429– ceedings of IEEE/ACM International Conference on Computer- 433, Dallas, Tex, USA, June 1993. Aided Design (ICCAD ’92), pp. 292–299, Santa Clara, Calif, USA, November 1992. [3] V. Vijay Kumar and J. Lach, “Designing, scheduling, and allo- Vinu Vijay Kumar is a Design Engineer in cating flexible arithmetic components,” in Proceedings of 13th the DSP Design Division of Texas Instru- International Conference on Field Programmable Logic and Ap- ments. His research interests include recon- plications (FPL ’03), pp. 1166–1169, Lisbon, Portugal, Septem- figurable systems, design methodologies for ber 2003. DSP system synthesis, and physical design [4] V. Vijay Kumar and J. Lach, “Heterogeneous redundancy for of SoC systems. He received the B.E. degree fault and defect tolerance with complexity independent area from PSG College of Technology, India, in overhead,” in Proceedings of 18th IEEE International Sympo- 2000, and the M.S. degree in electrical en- sium on Defect and Fault-Tolerance in VLSI Systems (DFT ’03), gineering and the Ph.D. degree in computer pp. 571–578, Boston, Mass, USA, November 2003. engineering from the University of Virginia, [5] S. M. S. A. Chiricescu, M. A. Schuette, R. Glinton, and H. Charlottesville, Va, USA, in 2002 and 2005, respectively. He is a Schmit, “Morphable multipliers,” in Proceedings of 12th Inter- Member of the IEEE and Eta Kappa Nu. national Conference on Field Programmable Logic and Applica- John Lach received the B.S. degree from tions (FLP ’02), pp. 647–656, Montpellier, France, September Stanford University, USA, in 1996, and the 2002. M.S. and Ph.D. degrees in electrical engi- [6] K. Compton and S. Hauck, “Flexibility measurement of neering from the University of California, domain-specific reconfigurable hardware,” in Proceedings of Los Angeles, USA, in 1998 and 2000, re- ACM/SIGDA 12th International Symposium on Field Pro- spectively. Since 2000, he has been an Assis- grammable Gate Arrays (FPGA ’04), pp. 155–161, Monterey, tant Professor in the Charles L. Brown De- Calif, USA, February 2004. partment of Electrical and Computer Engi- [7] K. Kim, R. Karri, and M. Potkonjak, “Synthesis of appli- neering at the University of Virginia, Char- cation specific programmable processors,” in Proceedings of lottesville, Va, USA. His primary research ACM/IEEE 34th Design Automation Conference (DAC ’97),pp. interests include dynamically adaptable and real-time embedded 353–358, Anaheim, Calif, USA, June 1997. systems, computer-aided design techniques for very-large-scale in- [8] L. M. Guerra, M. Potkonjak, and J. M. Rabaey, “Behavioral- tegration, general-purpose and application-specific processor de- level synthesis of heterogeneous BISR reconfigurable ASIC’s,” signs, and wearable technologies for aged independence. He is a Se- IEEE Transactions on Very Large Scale Integration (VLSI) Sys- nior Member of the IEEE, and a Member of the ACM, IEEE Com- tems, vol. 6, no. 1, pp. 158–167, 1998. puter Society, IEEE Circuits and Systems Society, ACM SIGDA, and [9]E.Bozorgzadeh,S.O.Memik,R.Kastner,andM.Sarrafzadeh, Eta Kappa Nu. “Pattern selection: customized block allocation for domain- specific programmable systems,” in Proceedings of Interna- tional Conference on Engineering of Reconfigurable Systems and Algorithms (ERSA ’02), Las Vegas, Nev, USA, June 2002. [10] G. Even, S. M. Mueller, and P.-M. Seidel, “A dual mode IEEE multiplier,” in Proceedings of 2nd Annual IEEE International Conference on Innovative Systems in Silicon (ISIS ’97), pp. 282– 289, Austin, Tex, USA, October 1997. [11] P. G. Paulin and J. P. Knight, “Force-directed scheduling for the behavioral synthesis of ASIC’s,” IEEE Transactions on Computer-Aided design of Integrated Circuits and Systems, vol. 8, no. 6, pp. 661–679, 1989. [12] S. Park and K. Choi, “Performance-driven high-level synthesis with bit-level chaining and clock selection,” IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, vol. 20, no. 2, pp. 199–212, 2001. [13] R. M. Haralick and L. G. Shapiro, Computer and Robot Vision, Addison-Wesley, Reading, Mass, USA, 1992. [14] K. Hogstedt¨ and A. Orailoglu, “Integrating binding con- straints in the synthesis of area-efficient self-recovering mi- croarchitectures,” in Proceedings of IEEE International Confer- ence on Computer Design: VLSI in Computers and Processors (ICCD ’94), pp. 331–334, Cambridge, Mass, USA, October 1994. [15] D. J. Mallon and P. B. Denyer, “A new approach to pipeline optimisation,” in Proceedings of European Design Automation Conference (EDAC ’90), pp. 83–88, Glasgow, Scotland, UK, March 1990. Hindawi Publishing Corporation EURASIP Journal on Applied Signal Processing Volume 2006, Article ID 46472, Pages 1–23 DOI 10.1155/ASP/2006/46472

Rapid VLIW Processor Customization for Signal Processing Applications Using Combinational Hardware Functions

Raymond R. Hoare, Alex K. Jones, Dara Kusic, Joshua Fazekas, John Foster, Shenchih Tung, and Michael McCloud

Department of Electrical and Computer Engineering, University of Pittsburgh, Pittsburgh, PA 15261, USA

Received 12 October 2004; Revised 30 June 2005; Accepted 12 July 2005 This paper presents an architecture that combines VLIW (very long instruction word) processing with the capability to introduce application-specific customized instructions and highly parallel combinational hardware functions for the acceleration of signal processing applications. To support this architecture, a compilation and design automation flow is described for algorithms written in C. The key contributions of this paper are as follows: (1) a 4-way VLIW processor implemented in an FPGA, (2) large speedups through hardware functions, (3) a hardware/software interface with zero overhead, (4) a design methodology for implementing signal processing applications on this architecture, (5) tractable design automation techniques for extracting and synthesizing hardware functions. Several design tradeoffs for the architecture were examined including the number of VLIW functional units and register file size. The architecture was implemented on an Altera Stratix II FPGA. The Stratix II device was selected because it offers a large number of high-speed DSP (digital signal processing) blocks that execute multiply-accumulate operations. Using the MediaBench benchmark suite, we tested our methodology and architecture to accelerate software. Our combined VLIW processor with hardware functions was compared to that of software executing on a RISC processor, specifically the soft core embedded NIOS II processor. For software kernels converted into hardware functions, we show a hardware performance multiplier of up to 230 times that of software with an average 63 times faster. For the entire application in which only a portion of the software is converted to hardware, the performance improvement is as much as 30X times faster than the nonaccelerated application, with a 12X improvement on average.

Copyright © 2006 Hindawi Publishing Corporation. All rights reserved.

1. INTRODUCTION functionality is then rendered into a piece of hardware, ei- ther by a direct VLSI implementation, typically on either an In this paper, we present an architecture and design method- FPGA platform or an ASIC, or by porting the system code to ology that allows the rapid creation of application-specific a microprocessor or (DSP). In fact, it hardware accelerated processors for computationally inten- is very common to perform a mixture of such implementa- sive signal processing and communication codes. The tar- tions for a realistically complicated system, with some func- get technology is suitable for field programmable gate arrays tionality residing in a processor and some in an ASIC. It (FPGAs) with embedded multipliers and for structured or is often difficult to determine in advance how this separa- standard cell application-specific integrated circuits (ASICs). tion should be performed and the process is often wrought The objective of this work is to increase the performance of with errors, causing expensive extensions to the design cy- the design and to increase the productivity of the designer, cle. thereby enabling faster prototyping and time-to-market so- The computational resources of the current generation lutions with superior performance. of FPGAs and of ASICs exceed that of DSP processors. DSP The design process in a signal processing or communica- processorsareabletoexecuteuptoeightoperationsper tions product typically involves a top-down design approach cycle while FPGAs contain tens to hundreds of multiply- with successively lower level implementations of a set of op- accumulate DSP blocks implemented in ASIC cells that have erations. At the most abstract level, the systems engineer de- configurable width and can execute sophisticated multiply- signs the algorithms and control logic to be implemented in a accumulate functions. For example, one DSP block can high level programming language such as Matlab or C. This execute A ∗ B ± C ∗ D + E ∗ F ± G ∗ H in two cycles on 2 EURASIP Journal on Applied Signal Processing

9-bitdataoritcanexecuteA ∗ B + C on 36-bit data in two functions is determined by the synthesis tool using static tim- cycles. An Altera Stratix II contains 72 such blocks as well ing analysis. as numerous logic cells [1]. Xilinx has released preliminary In order to demonstrate the utility of our proposed de- information on their largest Virtex 4 that will contain 512 sign methodology, we consider several representative prob- multiply-accumulate ASIC cells, with 18x18-bit multiply and lems that arise in the design of signal processing systems in a 42-bit accumulate, and operate at a peak speed of 500 MHz detail. Representative problems are chosen in the areas of (1) [2]. Lattice Semiconductor has introduced a low-cost FPGA voice compression with the G.721, GSM 06.10, and the pro- that contains 40 DSP blocks [3]. From our experiments, a posed CCIIT ADPCM standards; (2) image coding through floating point multiplier/adder unit can be created using 4 to the inverse discrete cosine transform (IDCT) that arise in 8 DSP blocks, depending on the FPGA. MPEG video compression; and (3) multiple-input multiple- Additionally, ASICs can contain more computational output (MIMO) communication systems through the sphere powerthananFPGAbutconsumemuchlesspower.In decoder [4] employing the Fincke-Pohst algorithm [5]. fact, there are many companies, including the FPGA vendors The key contributions of this work are as follows. themselves, that will convert an FPGA design into an equiv- (i) A complete 32-bit 4-way VLIW soft core processor in an alent ASIC and thereby reduce the unit cost and power con- FPGA. Our pNIOS II processor has been tested on a sumption. Stratix II FPGA device and runs at 166 MHz. In spite of these attractive capabilities of FPGA architec- (ii) Speedups over conventional approaches through hard- tures, it is often intractable to implement an entire applica- ware kernel extraction and custom implementation in tion in hardware. Computationally complex portions of the the same FPGA device. applications, or computational kernels, with generally high (iii) A hardware/software interface requiring zero cycle over- available parallelism are often mapped to these devices while head. By allowing our hardware functions direct access the remaining portion of the code is executed with a sequen- to the entire register file, the hardware function can tial processor. operate without the overhead of a bus or other bot- This paper introduces an architecture and a design tlenecks. We show that the additional hardware cost to methodology that combines the computational power of achieve this is minimal. application-specific hardware with the programmability of a (iv) A design methodology that allows standard applications software processor. written in C to map to our processor using a VLIW The architecture utilizes a tightly coupled general- compiler that automatically extracts available paral- purpose 4-way very long instruction world (VLIW) proces- lelism. sor with multiple application-specific hardware functions. (v) Tractable design automation techniques for mapping The hardware functions can obtain a performance speedup computational kernels into efficient custom combina- of 10x to over 100x, while the VLIW can achieve a 1x to 4x tional hardware functions. speedup, depending on the available instruction level paral- lelism (ILP). To demonstrate the validity of our solution, a The remainder of the paper is organized as follows: we 4-way VLIW processor (pNIOS II) was created based on the provide some motivation for our approach and its need in instruction set of the Altera NIOS II processor. A high-end signal processing in Section 2.InSection 3, we describe the 90 nm FPGA, an Altera Stratix II, was selected as the target related work to our architecture and design flow. Our archi- technology for our experiments. tecture is described in detail in Section 4. Section 5 describes For the design methodology, we assume that the design our design methodology including our method for extract- has been implemented in strongly typed software language, ing and synthesizing hardware functions. Our signal process- such as C, or utilizes a mechanism that statically indicate the ing applications are presented in Section 6 including an in data structure sizes, like vectorized Matlab. The software is depth discussion of our design automation techniques us- first profiled to determine the critical loops within the pro- ing these applications as examples. We present performance gram that typically consume 90% of the execution time. The results of our architecture and tool flow in Section 7.Fi- control portion of each loop remains in software for execu- nally, Section 8 describes our conclusions with planned fu- tion on the 4-way VLIW processor. Some control flow from ture work. loop structures is removed by loop unrolling. By using pred- ication and function inlining, the entire loop body is con- 2. MOTIVATION verted into a single data flow graph (DFG) and synthesized into an entirely combinational hardware function. If the loop The use of FPGA and ASIC devices is a popular method does not yield a sufficiently large DFG, the loop is considered for speeding up time critical signal processing applications. for unrolling to increase the size of the DFG. The hardware FPGA/ASIC technologies have seen several key advance- functions are tightly integrated into the software processor ments that have led to greater opportunity for mapping through a shared register file so that, unlike a bus, there is no these applications to FPGA devices. ASIC cells such as DSP hardware/software interface overhead. The hardware func- blocks and block RAMs within FPGAs provide an efficient tions are mapped into the processor’s instruction stream as method to supplement increasing amounts of programmable if they are regular instructions except that they require mul- logic within the device. This trend continues to increase the tiple cycles to compute. The exact timing of the hardware complexity of applications that may be implemented and Raymond R. Hoare et al. 3 the achievable performance of the hardware implementa- 100% tion. 80% However, signal processing scientists work with software systems to implement and test their algorithms. In general, 60% these applications are written in C and more commonly in Matlab. Thus, to supplement the rich amount of hardware 40% logic in FPGAs, vendors such as Xilinx and Altera have re- 20% leased both FPGAs containing ASIC processor cores such as Execution time of loops the PowerPC enabled Virtex II Pro and the ARM-enabled 0% Excalibur, respectively. Additionally, Xilinx and Altera also Speclnt MediaBench Security NetBench produce soft core processors Microblaze and NIOS, each of Average for benchmark suite which can be synthesized on their respective FPGAs. Loop 1 Loop 5 Unfortunately, these architectures have several deficien- Loop 4 Loop 3 cies that make them insufficient alone. Hardware logic is Loop 2 Loops 6-10 difficult to program and requires hardware engineers who understand the RTL synthesis tools, their flow, and how to Figure 1: Execution time contained within the top 10 loops in design algorithms using cumbersome hardware description the code averaged across the SpecInt, MediaBench, and NetBench languages (HDLs). Soft core processors have the advantage suites, as well as selected security applications [5]. of being customizable making it easy to integrate software and hardware solutions in the same device. However, these processors are also at the mercy of the synthesis tools and of- ten cannot achieve necessary speeds to execute the software software-level operation can take tens of instructions more ffi portions of the applications e ciently. ASIC core processors than the alternative of a single, hardware-level operation provide much higher clock speeds; however, these processors that propagates the results from one functional unit to the are not customizable and generally only provide bus-based next without the need for write-back, fetch, or performance- interfaces to the remaining FPGA device creating a large data affecting data forwarding. transfer bottleneck. Our technique for extracting computational kernels in Figure 1 displays application profiling results for the the form of loops from the original code for no overhead SpecInt, MediaBench, and NetBench suites, with a group of implementation in combinational hardware functions allows selected security applications [5]. The 90/10 rule tells us that the opportunity for large speedups over traditional or VLIW on average, 90% of the execution time for an application is processing alone. We have mapped a course-grain compu- contained within about 10% of the overall application code. tational structure on top of the fine-grain FPGA fabric for These numbers are an average of individual application pro- implementation of hardware functions. In particular, this files to illustrate the overall tendency of the behavior of each hardware fabric is coarse-grained and takes advantage of ex- suite of benchmarks. As seen in Figure 1, it is clear that the tremely low-latency DSP (multiply-accumulate) blocks im- 10% of code referred to in the 90/10 rule refers to loop struc- plemented directly in silicon. Because the fabric is combi- tures in the benchmarks. It is also apparent that multime- national, no overhead from nonuniform or slow datapath dia, networking, and security applications, this includes sev- stages is introduced. eral signal processing benchmark applications, exhibit even For implementation, we selected an Altera Stratix II higher propensity for looping structures to make a large im- EP2S180F1508C4 in part for its high density of sophisticated pact on the total execution time of the application. DSP multiply-accumulate blocks and the FPGA’s rapidly ma- Architectures that take advantage of parallel computation turing tool flow that eventually permits fine grain control techniqueshavebeenexploredasameanstosupportcompu- over routing layouts of the critical paths. The FPGA is useful tational density for the complex operations required by dig- beyond prototyping, capably supporting deployment with a ital processing of signals and multimedia data. For example, maximum internal clock speed of 420 MHz dependent on many processors contain SIMD (single instruction multiple the interconnect of the design and on-chip resource utiliza- data) functional units for vector operations often found in tion. For purposes of comparing performance, we compare DSP and multimedia codes. our FPGA implementation against our implementation of VLIW processing improves upon the SIMD technique the Altera NIOS II soft core processor. by allowing each processing element parallelism to execute its instructions. VLIW processing alone is still insufficient 3. RELATED WORK to achieve significant performance improvements over se- quential embedded processing. When one considers a tradi- Manual hardware acceleration has been applied to count- tional processing model that requires a cycle for operand- less algorithms and is beyond enumeration here. These fetch, execute, and writeback, there is significant overhead systems generally achieve significant speedups over their that occupies what could otherwise be computation time. software counterparts. Behavioral and high-level synthesis While pipelining typically hides much of this latency, mis- techniques attempt to leverage hardware performance from prediction of branching reduces the processor ILP. A typical different levels of behavioral algorithmic descriptions. These 4 EURASIP Journal on Applied Signal Processing different representations can be from hardware description appropriate output for storage into internal storage elements. languages (HDLs) or software languages such as C, C++, ADDs are most commonly used for automated generation of , and Matlab. test patterns for circuit verification [23, 24]. Our technique The HardwareC language is a C-like HDL used by the is not limited to decisions saved to internal storage, which Olympus synthesis system at Stanford [6]. This system uses imply sequential circuits. Rather, our technique applies hard- high-level synthesis to translate algorithms written in Hard- ware predication at several levels within a combinational (i.e., wareC into standard cell ASIC netlists. Esterel-C is a system- DFG) representation. level synthesis language that combines C with the Esterel lan- The support of custom instructions for interface with co- guage for specifying concurrency, waiting, and pre-emption processor arrays and CPU has developed into a developed at Cadence Berkeley Laboratories [7]. The SPARK standard feature of soft-core processors and those which are synthesis engine from the UC Irvine translates algorithms designed for DSP and multimedia applications. Coprocessor written in C into hardware descriptions emphasizing extrac- arrays have been studied for their impact on speech coders tion of parallelism in the synthesis flow [8, 9]. The PACT be- [25, 26], video encoders [27, 28], and general vector-based havioral synthesis tool from Northwestern University trans- signal processing [29–31]. lates algorithms written in C into synthesizable hardware de- These coprocessor systems often assume the presence and scriptions that are optimized for low-power as well as perfor- interface to a general-purpose processor such as a bus. Ad- mance [10, 11]. ditionally, processors that support custom instructions for In industry, several tools exist which are based on be- interface to coprocessor arrays are often soft-core and run havioral synthesis. The Behavioral Compiler from Synop- a significantly slower clock rates than hard-core processors. sys translates applications written in SystemC into netlists Our processor is fully deployed on an FPGA system with targeting standard cell ASIC implementations [12, 13]. Sys- detailed post place-and-route performance characterization. temC is a set of libraries designed to provide HDL-like func- Our processor does not have the performance bottleneck as- tionality within the C++ language for system level synthe- sociated with a bus interconnect but directly connects the sis [14]. Synopsys cancelled its Behavioral Compiler because hardware unit to the register file. There is no additional over- customers were unwilling to accept reduced quality of re- head associated with calling a hardware function. sults compared to traditional RTL synthesis [15]. Forte De- Severalprojectshaveexperimentedwithreconfigurable sign Systems has developed the Cynthesizer behavioral syn- functional units for hardware acceleration. PipeRench [32– thesis tool that translates hardware independent algorithm 36] and more recently HASTE [37] have explored imple- descriptions in C and C++ into synthesizable hardware de- menting computational kernels on coarse-grained reconfig- scriptions [16]. Handel-C is a C-like design language from urable fabrics for hardware acceleration. PipeRench utilizes a Celoxica for system level synthesis and hardware software pipeline of subword ALUs that are combined to form 32-bit co-design [17]. Accelchip provides the AccelFPGA product, operations. The limitation of this approach is the require- which translates Matlab programs into synthesizable VHDL ment of pipelining as more complex operations require mul- for synthesis on FPGAs [18]. This technology is based on tiple stages and, thus, incur latency. In contrast, we are us- the MATCH project at Northwestern [19]. Catapult C from ing non-clocked hardware functions that represent numer- Mentor Graphics Corporation translates a subset of untimed ous 32-bit operations. RaPid [38–42] is a coarse-grain re- C++ directly into hardware [20]. configurable datapath for hardware acceleration. RaPid is a The difference between these projects and our technique datapath-based approach and also requires pipelining. Ma- is that they try to solve the entire behavioral synthesis prob- trix [43] is a coarse-grained architecture with an FPGA like lem. Our approach utilizes a 4-wide VLIW processor to ex- interconnect. Most FPGAs offer this coarse-grain support ecute nonkernel portions of the code (10% of the execution with embedded multipliers/adders. Our approach, in con- time) and utilizes tightly coupled hardware acceleration us- trast, reduces the execution latency and, thus, increases the ing behavioral synthesis of kernel portions of the code (90% throughput of computational kernels. of the execution time). We match the available hardware re- Several projects have attempted to combine a reconfig- sources to the impact on the application performance so that urable functional unit with a processor. The Imagine pro- our processor core utilizes 10% or less of the hardware re- cessor [44–46] combines a very wide SIMD/VLIW processor sources leaving 90% or more to improve the performance of engine with a host processor. Unfortunately, it is difficult to the kernels. achieve efficient parallelism through high ILP due to many Our synthesis flow utilizes a DFG representation that in- types of dependencies. Our processor architecture differs as cludes hardware predication: a technique to convert control it uses a flexible combinational hardware flow for kernel ac- flow based on conditionals into multiplexer units that select celeration. from two inputs from this conditional. This technique is sim- The Garp processor [47–49] combines a custom recon- ilar to assignment decision diagram (ADD) representation figurable hardware block with a MIPS processor. In Garp, [21, 22], a technique to represent functional register transfer the hardware unit has a special purpose connection to the level (RTL) circuits as an alternative to control and data flow processor and direct access to the memory. The Chimaera graphs (CDFGs). ADDs read from a set of primary inputs processor [50, 51] combines a reconfigurable functional unit (generally registers) and compute a set of logic functions. with a register file with a limited number of read and write A conditional called an assignment decision then selects an ports. Our system differsasweuseaVLIWprocessorinstead Raymond R. Hoare et al. 5

of a single processor and our hardware unit connects directly Instr. RAM Register file to all registers in the register file for both reading and writ- ing allowing hardware execution with no overhead. These Instruction ··· decoder Cust. Cust. Cust. projects also assume that the hardware resource must be re- ALU instr. ALU instr. ALU instr. configured to execute a hardware-accelerated kernel, which Controller MUX MUX MUX may require significant overhead. In contrast, our system configures the hardware blocks prior to runtime and uses Figure 2: Very long instruction word architecture. multiplexers to select between them at runtime. Addition- ally, our system is physically implemented in a single FPGA device, while it appears that Garp and Chimaera were studied in simulation only. through a memory addressing scheme, we integrated the In previous work, we created a 64-way and an 88-way execution of the hardware blocks as if it was a custom in- SIMD architecture and interconnected the processing ele- struction. However, we have termed the hardware blocks as ments (i.e., the ALUs) using a hypercube network [52]. This hardware functions because they perform the work of tens to architecture was shown to have a modest degradation in per- hundreds of assembly instructions. To eliminate data move- formance as the number of processors scaled from 2 to 88. ment, our hardware functions share the register file with the The instruction broadcasting and the communication rout- processor and, thus, the overhead involved in calling a hard- ing delay were the only components that degraded the scala- ware function is exactly that of an inlined software functions. bility of the architecture. The ALUs were built using embed- These hardware functions can be multiple cycles and ded ASIC multiply-add circuits and were extended to include are scheduled as if it were just another software instruc- user-definable instructions that were implemented in FPGA tion. The hardware functions are purely combinational (i.e., gates. However, one limitation of a SIMD architecture is the not internally registered) and receive their data inputs from requirement for regular instructions that can be executed in the register file and return computed data to the regis- parallel, which is not the case for many signal processing ap- ter file. They contain predication operations and are the plications. Additionally, explicit communications operations hardware equivalent of tens to hundreds of assembly in- are necessary. structions. These features enable large speedup with zero- Work by industry researchers [53] shows that coupling overhead hardware/software switching. The following three a VLIW with a reconfigurable resource offers the robustness subsections describe each of the architectural components in of a parallel, general-purpose processor with the accelerat- detail. ing power and flexibility of a reprogrammable systolic grid. From Amdahl’s Law of speedup, we know that even if we For purposes of extrapolation, the cited research assumes the infinitely speedup 90% of the execution time, we will have a reconfiguration penalty of the grid to be zero and that de- maximum of 10X speedup if we ignore the remaining 10% sign automation tools tackle the problem of reconfiguration. of the time. Thus, we have taken a VLIW architecture as the Our system differs because the FPGA resource can be pro- baseline processor and sought to increase its width as much grammed prior to execution, giving us a more realistic recon- as possible within an FPGA. An in-depth analysis and perfor- figuration penalty of zero. We also provide a compiler and mance results show the limited scalability of a VLIW proces- automation flow to map kernels onto the reconfigurable de- sor within an FPGA. vice. 4.1. VLIW processor

4. ARCHITECTURE To ensure that we are able to compile any C software codes, we implemented a sequential processor based on the NIOS The architecture we are introducing is motivated by four fac- II instruction set. Thus, our processor, pNIOS II, is binary- tors: (1) the need to accelerate applications within a single code-compatible to the Altera NIOS II soft-core processor. chip, (2) the need to handle real applications consisting of The branch prediction unit and the register windowing of thousands of lines of C source code, (3) the need to achieve the Altera NIOS II have not been implemented at the time of speedup when parallelism does not appear to be available, this publication. and (4) the size of FPGA resources continues to grow as does In order to expand the problem domains that can be im- the complexity of fully utilizing these resources. proved by parallel processing within a chip, we examined the Given these needs, we have created a VLIW processor scalability of a VLIW architecture for FPGAs. As shown in from the ground-up and optimized its implementation to Figure 2, the key differences between VLIWs and SIMDs or utilize the DSP Blocks within an FPGA. A RISC instruction MIMDs are the wider instruction stream and the shared reg- set from a commercial processor was selected to validate the ister file, respectively. The ALUs (also called PEs) can be iden- completeness of our design and to provide a method of de- tical to that of their SIMD counterpart. Rather than having termining the efficiency of our implementation. a single instruction executed each clock cycle, a VLIW can In order to achieve custom hardware speeds, we enable execute P operations for a P processor VLIW. the integration of hardware and software within the same We designed and implemented a 32-bit, 6-stage pipelined processor architecture. Rather than adding a customized co- soft-core processor that supports the full NIOS II instruction processor to the processor’s I/O bus that must be addressed set including custom instructions. The single processor was 6 EURASIP Journal on Applied Signal Processing

Hardware function Hardware function Hardware function

FU FU FU FU FU FU FU FU FU FU FU FU FU FU FU FU

FU FU FU FU FU FU FU FU FU FU FU FU FU FU FU FU

FU FU FU FU FU FU FU FU FU FU FU FU FU FU FU FU

FU FU FU FU FU FU FU FU FU FU FU FU FU FU FU FU

FU FU FU FU FU FU FU FU FU FU FU FU FU FU FU FU

FU FU FU FU FU F FU FU FU FU F FU FU FU FU FU

FU FU FU FU FU F FU FU FU FU U FU FU F FU FU

FU FU FU FU Instr. RAM Shared register file FU FU FU FU Instruction Cust. Cust. Cust. Cust. FU FU decoder ALU instr. ALU instr. ALU instr. ALU instr. FU FU MUX MUX MUX MUX FU FU Controller FU FU

FU FU FU FU FU FU FU FU FU FU FU U FU FU FU FU

FU FU FU FU FU FU FU FU FU FU FU FU FU FU FU FU

FU FU FU FU FU FU FU FU FU FU FU FU FU FU FU FU

FU FU FU FU FU FU FU FU FU FU FU FU FU FU FU FU

FU FU FU FU FU FU FU FU FU FU FU FU FU FU FU FU

FU FU FU FU FU FU FU FU FU FU FU FU FU FU FU FU

Hardware Hardware Hardware function Hardware function Hardware function function function

Figure 3: The VLIW processor architecture with application-specific hardware functions. then configured in a 4-wide VLIW processor using a shared software instructions, there is no need to provide an addi- register file. The shared 32-element register file has 8 read tional hardware interface. The register file acts as the data ports and 4 write ports. buffer as it normally does for software instructions. Thus, There is also a 16 KB dual-ported memory accessible to when hardware function needs to be called, its parameters 2 processing elements (PEs) in the VLIW, and a single 128- are stored in the register file for use by the hardware func- bit wide instruction ROM. An interface controller arbitrates tion. Likewise, the return value of the hardware function is between software and hardware functions as directed by the placed back into the register file. custom instructions. The gains offered by a robust VLIW supporting a large We targeted our design to the Altera Stratix II EP2- instruction set come at a price to the performance and area S180F1508C4 FPGA with a maximum internal clock rate of the design. The number of ports to the shared register file of 420 MHz. The EP2S180F has 768 9-bit embedded DSP and instruction decode logic have shown in our tests to be multiply-adders and 1.2 MB of available memory. The single the greatest limitations to VLIW scalability. A variable-sized processor was iteratively optimized to the device based on register file is shown in. modifications to the critical path. The clock rate sustained In Figure 4, P processing elements interface to N regis- increases to its present 4-wide VLIW rate of 166 MHz. ters. Multiplexing breadth and width pose the greatest hin- drances to clock speed in a VLIW architecture. We tested the 4.2. Zero-cycle overhead hardware/software interface effect of multiplexers by charting performance impact by in- creasing the number of ports on a shared register file, an ex- In addition to interconnecting the VLIW processors, the reg- pression of increasing VLIW width. ister file is also available to the hardware functions, as shown In Figure 5, the number of 32-bit registers is fixed to by an overview of the processor architecture in Figure 3 and 32 and the number of processors is scaled. For each pro- through a register file schematic in Figure 4. By enabling the cessor, two operands need to be read and one written per compiler to schedule the hardware functions as if they were cycle. Thus, for P processors there are 2P read ports and Raymond R. Hoare et al. 7

O(···P − 1) O(··· P − 1) O(··· P − 1)

WrMUX0 WrMUX1 WrMUX(N − 1) Wr sel0 Wr sel1 Wr sel(N − 1)

···

Reg0 Reg1 Reg(N − 1) Wr En0 Wr En1 Wr En(N − 1)

··· N − O(··· N − 1) O(··· N − 1) O(1)

RdMUX0 RdMUX1 RdMUX(P − 1) Rd Sel0 Rd Sel1 Rd Sel(P − 1) ···

Scalable register file PE0 PE1 PE(P − 1) ···

Figure 4: N-element register file supporting P-wide VLIW with P read ports and P write ports.

32-element register file performance and area 257 MHz . 1 24 228 ALUT (16%) 23.088 ALUT 0.9 226 MHz 249 MHz (16%) 0.8 0.7 . 197 MHz 12840 ALUT 0 6 (7%) 0.5 11.149 ALUT (7%) 0.4 5187 ALUT 111 MHz 0.3 2622 ALUT (3%) 91 MHz 90 MHz 69 MHz

Normalized unit . 0 2 (1%) 4662 ALUT 0.1 2593 ALUT (3%) 0 24816(1%) Number of processors two read ports, one write port per processor

Area∗ VLIW∗ Performance VLIW∗∗ Area VLIW + super CISC∗ Performance VLIW + super CISC∗∗

Figure 5: Scalability of a 32-element register file for P processors having 2P read and P write ports. Solid lines are for just a VLIW while dashed lines include access for SuperCISC hardware functions. (∗Area normalized as percentage of area of 16 processor register file; ∗∗performance normalized as percentage of performance of 2 processor register file.)

P write ports. As shown, the performance steadily drops the register file. Scaling to 8 or 16-way VLIW would decrease and the number of processors is increased. Additionally, the the clock rate of the design, as shown in Figure 5. routing resources and logic resources required also increa- The multiplexer is the design unit that contributes most ses. to performance degradation of the register file as the VLIW From an analysis of the benchmarks we examined, we scales. We measured the impact of a single 32-bit P-to-1 mul- foundanaverageILPbetween1and2andconcludedthat tiplexer on the Stratix II EP2S180. As the width P doubled, a 4-way VLIW was more than sufficient for the 90% of the the area increased by a factor of 1.4x times the width. The code that requires 10% of the time. We also determined that performance took the greatest hit of all our scaling tests, los- critical path within the ALU was limited to 166 MHz as seen ing an average of 44 MHz per doubling, as shown in Figure 6. in Table 1. The performance is limited by the ALU and not The performance degrades because the number of P-to-1 8 EURASIP Journal on Applied Signal Processing

Table 1: Performance of instructions (Altera Stratix II FPGA EP2S180F1508C4).

Post-place and route results for ALU modules on EP2S180F1508C4 ALUTs % Area Clock Latency Adder/subtractor/comparator 96 < 1 241 MHz 4 ns 32-bit integer multiplier (1 cycle) 0 + 8 DSP units < 1 322 MHz 3 ns Logical unit (AND/OR/XOR) 96 < 1 422 MHz 2 ns Variable left/right shifter 135 < 1 288 MHz 4 ns Top ALU (4 modules above) 416+ DSP units < 1 166 MHz 6 ns

P-to-1 multiplexer (32 bits) performance and area VLIW compiler was too low to warrant more than a 4-way 422 MHz 1 VLIW, (2) the nonpartitioned register file approach was not . 406 MHz 1326 ALUT 0 9 (< 1%) 0.8 340 MHz the limiting factor for performance in our 4-way VLIW im- 0.7 279 MHz plementation, and (3) our VLIW compiler does not support . 0 6 211 MHz 708 ALUT partitioned register files. 0.5 (< 1%) . 361 ALUT 0 4 187 ALUT 256 ALUT < 193 . ( 1%) 578 ALUT 156 4.3. Achieving speedup through hardware functions 0 3 171 ALUT(1%) (< 1%) < MHz Normalized unit ( 1%) MHz 0.2 (< 1%) 0.1 By using multicycle hardware functions, we are able to place 0 hundreds of machine instructions into a single hardware 4 8 16 32 64 128 256 function. This hardware function is then converted into logic P P Number ( ) of 32-bit inputs for a single -to-1multiplexer and synthesized into hardware. The architecture interfaces Area∗ an arbitrary number of hardware functions to the register ∗∗ Performance file while the compiler schedules the hardware functions as if they were software. Figure 6: Scalability of a 32-bit P-to-1 multiplexer on an Altera Synchronous design is by definition inefficient. The en- ∗ Stratix II (EP2S180F1508C4). ( Area normalized as percentage of tire circuit must execute at the rate of the slowest component. ∗∗ 256-to-1 multiplexer area; performance normalized as percent- For a processor, this means that a simple left-shift requires as age of 4-to-1 multiplexer performance.) much time as a multiply. For kernel codes, this effect is mag- nified. As a point of reference, we have synthesized various arith- metic operations for a Stratix II FPGA. The objective is not multiplexers increases to implement the read and write ports the absolute speed of the operations but the relative speed. within the register file. Note that a logic operation can execute 5x faster than the N For an -wide VLIW, the limiting factor will be the reg- entire ALU. Thus, by moving data flow graphs directly into NR ister file which in turn requires 2 : 1 multiplexer as each hardware, the critical path from input to output is going to R processor reads two registers from a register file with reg- achieve large speedup. The critical path through a circuit is R isters. For the write ports, each of the registers requires unlikely to contain only multipliers and is expected to be a N an a : 1 multiplexer. However, as shown in Figure 5, the variety of operations and, thus, will have a smaller delay than logic required for a 4-wide VLIW with 32 shared registers of if they were executed on a sequential processor. 32-bits each, only achieved 226 MHz while the 32 : 1 multi- This methodology requires a moderate sized data flow di- plexer achieved 279 MHz. What is not shown is the routing. agram. There are numerous methods for achieving this and These performance numbers should be taken as minimums will be discussed again in the following section. One method and maximums for the performance of the register file. We that requires hardware support is the predication operation. were able to scale our VLIW with 32 shared registers up to This operation is a conditional assignment of one register to 166 MHz 4-way. another based on whether the contents of a third register is a One technique for increasing the performance of shared “1.” This simple operation enables the removal of jumps for register files for VLIW machines is partitioned register files if-then-else statements. In compiler terms, predication en- [54]. This technique partitions the original register file into ables the creation of large data flow diagrams that exceed the banks of limited connectivity register files that are accessi- size of basic blocks. ble by a subset of the VLIW processing elements. Busses are used to interconnect these partitions. For a register to be ac- 5. COMPILATION FOR THE VLIW PROCESSOR cessed by a processing element outside of the local partition, WITH HARDWARE FUNCTIONS the data must be moved over a bus using an explicit move instruction. While we considered this technique, we did not Our VLIW processor with hardware functions is designed to employ register file partitioning in our processing scheme for assist in creating a tractable synthesis tool flow which is out- several reasons: (1) the amount of ILP available from our lined in Figure 7. First, the algorithm is profiled using the Raymond R. Hoare et al. 9

Behavioral RTL Loops HDL/DFG synthesis synthesis Bitstream

Noise II Noise II Cprogram Profiling Cprogram Trimaran IR VLIW Assembly VLIW Machine code backend assembler

Figure 7: Tool flow for the VLIW processor with hardware functions.

Shark profiling tool from Apple Computer [4] that can pro- file programs compiled with the gcc compiler. Shark is de- predictor zero() signed to identify the computationally intensive loops. 0.80% for (i = 1; i<6; i++) /∗ ACCUM ∗/ The computational kernels discovered by Shark are prop- 34.60 sezi += fmult (state ptr−>b[i] >> 2, >dq i agated to a synthesis flow that consists of two basic stages. state ptr− [ ]); First, a set of well-understood compiler transformations in- -- 35.40% cluding function inlining, loop unrolling, and code motion are used to attempt to segregate the loop control and mem- quan() ory accesses from the computation portion of the kernel 14.20% for (i = 0; ib[i] >> 2, and memory access portions of the computational kernels, is state ptr−>dq[i]); passed through the Trimaran VLIW Compiler [55]forexe- 5. #pragma HW END cution on the VLIW processor core. Trimaran was extended to generate assembly for a VLIW version of the NIOS II in- 6. quan() 7. #pragma HW START struction set architecture. This code is assembled by our own 8. for (i = 0; i

5.1. Performance code profiling Algorithm 2: Code excerpt from Algorithm 1 after insertion of The Shark profiling tool is designed to discover the loops that directives to outline computational kernels that are candidates for contribute the most to the total program execution time. The custom hardware implementation. tool returns results such as those seen in Algorithm 2. These are the top two loops from the G.721 MediaBench bench- mark that total nearly 70% of the total program execution 5.2. Compiler transformations for synthesis time. After profiling, the C program is modified to include di- Synthesis from behavioral descriptions is an active area of rectives within the code to signal which portions of the code study with many projects that generate hardware descrip- had been detected to be computational kernels during the tions from a variety of high-level languages and other behav- profiling. As seen in Algorithm 1, the computational kernel ioral descriptions, see Section 3. However, synthesis of com- portions are enclosed with the #pragma HW START and binational logic from properly formed behavioral descrip- #pragma HW END directives to denote the beginning and tions is significantly more mature than the general case and ending of the kernel, respectively. The compiler uses these can produce efficient implementations. Combinational logic, directives to identify the segments of code to implement in by definition, does not contain any timing or storage con- custom hardware. straints but defines the output as purely a function of the 10 EURASIP Journal on Applied Signal Processing

1. fmult(int an, int srn) { Kernel (AST) CDFG 2. short anmag, anexp, anmant; (hardware) 3. short wanexp, wanmag, wanmant; 4. short retval; DU 5. anmag = (an > 0) ? an: ((−an) & 0x1FFF); analysis 6. anexp = quan(anmag, power2, 15) −6; 7. anmant = (anmag == 0) ? 32: (anexp >= 0) ? anmag >> anexp: AST + Hardware << − data flow predication anmag anexp; 8. wanexp = anexp + ((srn >> 6) & 0xF) −13; = ∗ >> Inlining 9. wanmant (anmant (srn & 077)+0x30) 4; = >= unrolling CDFG 10. retval (wanexp 0) ? generation ((wanmant << wanexp) & 0x7FFF): (wanmant >> −wanexp); AST + DF DFG with HW 11. return (((anˆsrn) < 0) ? −retval: 32 loads predication 16 stores retval); 12. } Code motion Algorithm 3: Fmult function from G.721 benchmark.

AST, DF, Generate 32/16 L/S HDL DFG window computational code alone and uses hardware predication to convert this into a single DFG for combinational hardware synthesis. HW/SW partitioning 5.2.1. Compiler transformations to restructure code Outer loop shell Combinational includes L/S hardware The kernel portion of the code is first compiled using the (software) description SUIF (Stanford University Intermediate Format) Compiler. This infrastructure provides an AST representation of the code and facilities for writing compiler transformations to Figure 8: Description of the compilation and synthesis flow for operate on the AST. The code is then converted to SUIF2, portions of the code selected for custom hardware acceleration. which provides routines for definition-use analysis. Items on the left side are part of phase 1, which uses standard com- Definition-use (DU) analysis, shown as the first oper- piler transformations to prepare the code for synthesis. Items on the ation in Figure 8, annotates the SUIF2 AST with informa- right side manipulate the code further using hardware predication tion about how the symbol (e.g., a variable from the original to create a DFG for hardware implementation. code) is used. Specifically, a definition refers to a symbol that is assigned a new value (i.e., a variable on the left-hand side of an assignment) and a use refers to an instance in which inputs. Sequential logic, on the other hand, requires knowl- that symbol is used in an instruction (e.g., in an expression edge of timing and prior inputs to determine the output val- or on the right-hand side of an assignment). The lifetime of ues. a symbol consists the time from the definition until the final Our synthesis technique only relies on combinational use in the code. logic synthesis and creates a tractable synthesis flow. The The subsequent compiler pass, as shown in Figure 8, in- compiler generates data flow graphs (DFGs) that correspond lines functions within the kernel code segment to eliminate to the computational kernel and, by directly translating these artificial basic block boundaries and unrolls loops to increase DFGs into a hardware description language like VHDL, the amount of computation for implementation in hard- these DFGs can be synthesized into entirely combinational ware. The first function from Algorithm 2, predictor zero(), logic for custom hardware execution using standard synthe- calls the fmult() function shown in Algorithm 3. The fmult() sis tools. function calls the quan() function which was also one of Figure 8 expands the behavioral synthesis block from our top loops from Shark. Even though quan() is called (in- Figure 7 to describe in more detail the compilation and directly) by predictor zero(), Shark provides execution for synthesis techniques employed by our design flow to gen- each loop independently. Thus, by inlining quan(), the sub- erate the hardware functions. The synthesis flow is com- sequent code segment includes nearly 70% of the program’s prised of two phases. Phase 1 utilizes standard compiler tech- execution time. The computational kernel after function niques operating on an abstract syntax tree (AST) to decou- inlining is shown in Algorithm 4. Note that the local symbols ple loop control and memory accesses from the computa- from the inlined functions have been renamed by prepend- tion required by the kernel, which is shown on the left side ing the function name to avoid conflicting with local symbols of Figure 8. Phase 2 generates a CDFG representation of the in the caller function. Raymond R. Hoare et al. 11

1. for (i = 0; i<6; i++) { if (fmult anmag < ∗quan table) 2. // begin fmult quan i = 0; 3. fmult an = state ptr−>b[i] >> 2; else if (fmult anmag < ∗(quan table + 1)) 4. fmult srn = state ptr−>dq[i]; quan i = 1; 5. fmult anmag = (fmult an > 0) ? fmult an: else if (fmult anmag < ∗(quan table + 2)) ((−fmult an) & 0x1FFF); quan i = 2; 6. // begin quan ... 7. quan table = power2; else if (fmult anmag < ∗(quan table + 14) 8. for (quan i = 0; quan i<15; quan i++) quan i = 14; 9. if (fmult anmag < ∗quan table++) 10. break; 11. fmult anexp = quan i; Algorithm 5: Unrolled inner loop of inlined G.721 hardware ker- 12. // end quan nel. 13. fmult anmant = (fmult anmag == 0) ? 32: (fmult anexp >= 0) ? fmult anmag >> fmult anexp: fmult anmag << −fmult anexp; for (i = 0; i<6; i++){ 14. fmult wanexp = fmult anexp + quan table array 0 = ∗quan table; >> − ((fmult srn 6) & 0xF) 13; quan table array 1 = ∗(quan table + 1); 15. fmult wanmant = (fmult anmant∗ ... >> (srn & 077)+0x30) 4; quan table array 14 = ∗(quan table + 14); 16. fmult retval = (fmult wanexp>= 0) ? b i = −>b i ((fmult wanmant<> −fmult wanexp); state pointer dq array i = state ptr−>dq[i]; 17. sezi += (((fmult anˆfmult srn)< 0) ? // Begin Hardware Function −fmult retval : fmult retval); fmult an = state pointer b array i>>2; 18. // end fmult fmult srn = state pointer dq array i; 19. } if (fmult anmag < quan table array 0) quan i = 0; else if (fmult anmag < quan table array 1) Algorithm 4: G.721 code after function inlining. quan i = 1; else if (fmult anmag < quan table array 2) quan i = 2; ... Once function inlining is completed, the inner loop is ex- else if (fmult anmag < quan table array 14) amined for implementation in hardware. By unrolling this quan i = 14; loop, it is possible to increase the amount of code that can ... be executed in a single iteration of the hardware function. // End Hardware Function The number of loop iterations that can be unrolled is lim- } ited by the number of values that must be passed into the hardware function through the register file. In the example from Algorithm 4, each loop iteration requires a value loaded Algorithm 6: G.721 benchmark after inlining, unrolling, and code motion compiler transformations. (Hardware functionality is in from memory, ∗quan table, and a comparison with the sym- plain text with VLIW software highlighted with gray background.) bol fmult anmag. Because there are 15 iterations, complete unrolling results in a total of 16 reads from the register file. The resulting unrolled loop is shown in Algorithm 5.Once the inner loop is completely unrolled, the outer loop may be unrolled code in Algorithm 5 are from the array quan table considered for unrolling. In the example, several values such that is defined prior to the hardware kernel code. Thus, load- as the array reads must be passed through the register file be- ing the first 15 elements of quan table array can be moved yond the 16 required by the inner loop, preventing the outer to the beginning of the hardware function code and stored loop from being unrolled. However, by considering a larger in static symbols mapped to registers which the loops in the register file or special registers dedicated to hardware func- unrolled inner loop code. This is possible for all array ac- tions, this loop could be unrolled as well. cesses within the hardware kernel code for G.721. The hard- After unrolling and inlining is completed, there is a max- ware kernel code after code motion is shown in Algorithm 6. imum of 32 values that can be read from the register file and As shown in Algorithm 6, the resulting code after DU 16 values that can be written to the register file. The next analysis, function inlining, loop unrolling, and code motion phase of the compilation flow uses code motion to move all is partitioned between hardware and software implementa- memory loads to the beginning of the hardware function and tion. The partitioning decision is made statically such that move all memory stores to the end of the hardware function. all code required to maintain the loop (e.g., loop induction This is done so as not to violate any data dependencies dis- variable calculation, bounds checking and branching) and covered during definition-use analysis. The loads from the code required to do memory loads and stores is executed in 12 EURASIP Journal on Applied Signal Processing software while the remaining code is implemented in hard- If (bufferstep){ ware. This distinction is shown in Algorithm 6, where soft- delta = inputbuffer & 0xf; ware code is highlighted with a gray background. } else { inputbuffer =∗inp++ = ff >> 5.2.2. Synthesis of core computational code delta (inputbu er 4) & 0xf; }

Once hardware and software partitioning decisions are made (a) as described in Section 5.2.1, the portion of the code for im- plementation in hardware is converted into a CDFG rep- resentation. This representation contains a series of basic Bufferstep blocks interconnected by control flow edges. Thus, each basic = block boundary represents a conditional branch operation ! 0x0 within the original code. Creation of a CDFG representation If basic block from a high level language is a well studied technique beyond the scope of this paper. However, details on creation of these ∗Inp Inp graphs can be found in [6]. Input buffer In order to implement the computation contained within Input buffer 0xF >> 0x4 the computational kernel, the control portions of the CDFG 0xF & ++ must be converted into data flow dependencies. This al- & Delta Inp lows basic blocks, which were previously separated by con- Delta trol flow dependency edges to be merged into larger basic Then basic block Else basic block blocks which larger DFGs. If all the control flow dependen- cies can be successfully converted into data flow dependen- cies, the entire computational portion of the kernel can be (b) represented as a single DFG. As a result, the DFG can be triv- ially transformed into a combinational hardware implemen- Inp ∗Inp tation, in our case using VHDL, and can be synthesized and Input buffer mapped efficiently into logic within the target FPGA using Buffer step Input buffer >> existing synthesis tools. 0xF 0xF 0x4 Our technique for converting these control flow depen- ++ ! = 0x0 && dencies into data flow dependencies is called hardware pred- T/1 F/0 ication. This technique is similar to ADDs developed as an T/1 F/0 2:1MUX 2:1MUX alternate behavioral representation for synthesis flows, see Inp Delta Section 3. Consider a traditional if-then-else conditional con- struct written in C code. In hardware, an if-then-else con- (c) ditional statement can be implemented using a multiplexer acting as a binary switch to predicated output datapaths. To execute the same code in software, an if-then-else statement Figure 9: Software code, CDFG, and DFG with predicated hard- is implemented as a stream of six instructions composed ware example for control flow in ADPCM encoder. of comparisons and branch statements. Figure 9 shows sev- eral different representations of a segment of the kernel code from the ADPCM encoder benchmark. Figure 9(a) lists the assembly code, Figure 9(b) shows the corresponding CDFG multiplexer. For example, in Figure 9, the symbol delta is de- representation of the code segment, and Figure 9(c) presents fined in both blocks and these definitions become inputs to a data flow diagram for a 2 : 1 hardware predication (e.g., a rightmost selection multiplexer in Figure 9(c). The symbol multiplexer) equivalent of the CDFG from Figure 9(b). inp is updated in the else basic block only in Figure 9(b). This In the example from Figure 9, the then part of the code requires the leftmost multiplexer in Figure 9(c), where the from Figure 9(a) is converted into the then basic block original value from prior to the condition and the updated Figure 9(b). Likewise the statements from the else portion value from the else block become inputs. All of the multi- in Figure 9(a) are converted in into the else basic block in plexers instantiated due to the conversion of these control Figure 9(b). The CDFG in Figure 9(b) shows that the control flow edges into data flow edges are based on the conditional flow from the if-then-else construction creates basic block operation from the if basic block in Figure 9(b). boundaries with control flow edges. The hardware predica- By implementing the logic in this manner, the six clock tion technique converts these control flow dependencies into cycles required for execution in the processor can be reduced data flow dependencies allowing the CDFG in Figure 9(b) to two levels of combinational logic in hardware. Consider- to be transformed into the DFG in Figure 9(c).Eachsym- ing the example of Figure 9, the assembly code requires as bol with a definition in either or both of the basic blocks many as nine (9) cycles if the else path is selected, but the following the conditional statement (i.e., the then and else hardware version can be implemented as two levels of com- blocks from Figure 9(b)) must be predicated by inserting a binational logic (constant shifts are implemented as wires). Raymond R. Hoare et al. 13

In many cases, this type of hardware predication works Hardware interface latency versus real speedup in the general case and creates efficient combinational logic 250 for moderate performance results. However, in some spe- 200 cial cases, control flow can be further optimized for combi- )x national hardware implementation. In C, switch statements, N 150 sometimes called multiway branches, can be handled spe- cially. While this construct can be interpreted in sequence 100 to execute the C code, directly realizing this construct with multiplexing hardware containing as many inputs as cases Real speedup ( 50 in the original code allows entirely combinational, parallel execution. A second special case exists for the G.721 exam- 0 ple described in Section 5.2.1. Consider the unrolled inner- 0 2 4 8 16 32 64 128 most loop shown in Algorithm 6. This code follows the con- Hardware overhead latency in cycles ... struction if (cond), else if (cond2), , else if (condN). This g721, 273x ADPCM encode, 18x is similar to the behavior of a priority encoder in combina- IDCT column, 76x ADPCM decode, 17x tional hardware where each condition has a priority, such as IDCT row, 44x high bit significance overriding lower bit significance. For ex- ample, in a one-hot priority encoder, if the most significant Figure 10: Real speedup of hardware benchmark functions com- bit (MSB) is “1”, then all other bits are ignored and treated pared to software execution given varying interface latencies. as zeros. If the MSB is “0” and the next MSB is “1”, then all other bits are assumed “0.” This continues down into the least significant bit. When this type of conditional is written in a Table 2: Execution profile of benchmarks. similar style in synthesizable HDL, synthesis tools will im- plement a priority encoder, just like a case statement in HDL Execution time of implements a multiplexer. Thus, for the cases where this type Benchmark Kernel 1 Kernel 2 Total of code is present for either the multiplexer or the priority ADPCM decode 99.9% N/A 99.9% encoder, this structure is retained. ADPCM encode 99.9% N/A 99.9% G.721 decode 70.5% N/A 70.5% 5.3. Interfacing hardware and software GSM decode 71.0% N/A 71.0% A hardware function can be called with no additional over- MPEG 2 decode 21.5% 21.4% 42.9% head requirements versus that of executing the code directly in software. The impact of even a small hardware/software overhead can dramatically reduce the speedup that the kernel overhead, perhaps caused by a single I/O write and one I/O ff achieves. In essence, some of the speed benefit gained from read, causes the e ective kernel speedup to be cut down by acceleration is lost due to the interface overhead. a half. For a bus-based system, tens of processor cycles of la- Consider (1), β is the hardware speedup defined as the tency dramatically diminish the benefit of hardware acceler- ratio of software to hardware execution time. This equation ation. Thus, by enabling direct data sharing through the reg- only considers hardware acceleration and does not equate di- ister file, our architecture does not incur any penalty. rectly to kernel speedup.In(2), α is the actual kernel speedup as this considers the portion of the kernel that cannot be 6. BENCHMARKS translated to hardware. This is labeled as overhead (OH). This definition is actually a misnomer as it implies that there To evaluate the effectiveness of our approach for signal pro- is an added overhead for running our kernel hardware. In cessing applications, we selected a set of core signal process- fact, this “overhead” is actually the same loads and stores that ing benchmarks. We examine algorithms of interest in sig- would be run in the software only solution. No additional nal processing from three categories: voice compression, im- computation is added, age and video coding, and wireless communication. The fol- t lowing sections describe selected benchmarks in these do- β sw = ,(1)mains and specifically examine benchmark codes selected thw t β from each domain. Except for the so-called sphere decoder, α = sw = . (2) the software codes examined in the following sections were t + t t /t +1 OH hw OH hw taken from the MediaBench benchmark suite [57]. Figure 10 shows the effect of adding 0 to 128 cycles of Table 2 shows the execution time contribution of the hardware/software overhead on a set of hardware accelerated computational kernels from signal processing oriented kernels. We explain how these speedups are achieved later on benchmarks from MediaBench. For example, ADPCM en- in this paper and focus here on the impact of data movement code and decode kernels contribute nearly the entirety of the overhead. A zero overhead is the pure speedup of the hard- application execution time. Both the G.721 and GSM bench- ware versus the software. Note that even 2 software cycles of marks top kernel requires over 70% of the execution time 14 EURASIP Journal on Applied Signal Processing

Table 3: Instruction level parallelism (ILP) extracted using the Tri- G.721 standard, which employs adaptive differential pulse maran compiler. code modulation (ADPCM) to compress toll quality audio signals down to 32 kpbs [57]. The G.721 audio compression Instruction level parallelism standard is employed in most European cordless telephones. Benchmark Kernel 1 Kernel 2 Nonkernel Avg We next consider CCITT-ADPCM, a different ADPCM ADPCM decode implementation that is recommended by the IMA Digital 4-way VLIW 1.13 N/A 1.23 1.18 Audio Technical Working Group. The algorithm takes 16 bit PCM samples and compresses them to four bit ADPCM sam- . . . Unlimited VLIW 1 13 N/A 1 23 1 18 ples, generating a compression ratio of 4 : 1. ADPCM encode The last speech compression algorithm we consider is the 4-way VLIW 1.28 N/A 1.38 1.33 GSM 06.10 standard specified for use with the global system Unlimited VLIW 1.28 N/A 1.38 1.33 for mobile telecommunication (GSM) wireless standard. In the GSM 06.10 standard, residual pulse excitation/long term G.721 decode prediction (RPELTP) is used to encode the speech signal at 4-way VLIW 1.25 N/A 1.32 1.28 a compression ratio of 8 : 1. The linear prediction engine Unlimited VLIW 1.41 N/A 1.33 1.37 runs Schur recursions, which was argued by the package de- GSM decode signer to yield some performance advantages over the usual Levinson-Durbin algorithm when parallelized [58]. . . . 4-way VLIW 1 39 N/A 1 25 1 32 One of the significant bottlenecks of increasing algorith- Unlimited VLIW 1.39 N/A 1.25 1.32 mic execution is control flow requirements (e.g., determin- MPEG 2 decode ing the next operation to execute based on the result of pre- 4-way VLIW 1.68 1.40 1.41 1.54 vious operations). Algorithms high in control flow map very well to sequential processors as these processors are highly . . . . Unlimited VLIW 1 84 1 50 1 46 1 67 optimized to execute these sequential codes by achieving high throughputs and clock speeds through techniques like pipelined execution. and MPEG 2 decoder requires two separate loop kernels to When implementing heavily control-oriented codes in achieve between less than 50% of the execution times. hardware, sequential structures such as finite state machines The ILP of the benchmarks is shown in Table 3. The ILP (FSMs) are often used for this purpose. Unfortunately, these numbers are broken into three groups, the first being the ILP FSMs do not allow significantly more parallelism than run- for the computational kernel of highest complexity (kernel ning this code in a processor. To achieve a speedup using a 1), the second for the next highest kernel (kernel 2), which is VLIW processor it is necessary to attempt to remove the con- only necessary for the MPEG 2 benchmark, the ILP of the trol flow dependencies to allow parallel execution. In sequen- nonkernel software code, and finally, a nonweighted aver- tial processors, predication is used to convert many types of age ILP for the entire application. All numbers were reported control flow to data flow dependencies. for both a standard 4-way VLIW processor as implemented Consider the ADPCM encoder shown in Algorithm 7. in our system and compared with numbers for a theoretical The for loop in the example consumes nearly 100% of the unlimited-way VLIW processor. execution time (see Table 2). Excluding the control flow as- This limited ILP shows that VLIW processing alone can sociated with the for loop, this code segment contains nine only provide a nominal performance improvement. The (9) conditional executions. These statements are candidates range of speedups possible will be of 20–60% overall, which for predication. is far below our target for these applications. To discover how To allow predicated execution in a processor, one or more speedups can be achieved through hardware functions in our predication registers are used. Conditional branch instruc- system, we begin by examining our algorithms, specifically tions are traditionally used to execute if statements. To use the computational kernel codes below. predication, these branch instructions are replaced by con- ditional operations followed by predicated instructions. For 6.1. Voice compression algorithms example, in Algorithm 7, line 7, the subtraction operation is only executed if diff >= step. Thus, the conditional is cal- We chose three representative voice compression algorithms culated and the result is stored in a predication register. The as benchmarks. These were drawn from various applica- subtraction instruction can be issued and the result will only tion areas in voice compression and reflect quite different be saved if the conditional is true. The same predication reg- coding algorithms. In each case, the C-language implemen- ister can also be used for the addition operation in line 8. This tation benchmark came from the MediaBench suite. We type of predication allows increased ILP and reduces stalls in have purposefully chosen well-established implementations pipelined execution. to demonstrate the practical performance gains immediately One of the restrictions we place on our hardware func- available to the system designer through our approach. tions is that they consist entirely of combinational logic (e.g., The first system we examined was the International Con- they do not require sequential execution). As a result, we use sultative Committee on Telegraphy and Telephony (CCITT) a technique related to predication called parallel execution. Raymond R. Hoare et al. 15

converted into lookup-tables (LUTs) for implementation in 1. for ( ; len > 0; len–) { ROM structures. 2. val = ∗inp++; The computational kernel source code for the G.721 3. delta = 0; benchmark was used in the prior section to describe the var- 4. vpdiff = (step >> 3); ious design automation phases. The resulting DFG for the 5. if (diff >= step) { G.721 based on the transformations is displayed in Figure 12. 6. delta = 4; It should be noted that the completely unrolled loop has been 7. diff −=step; transformed into a priority encoder in the hardware imple- ff = 8. vpdi + step; mentation. It can also be seen that aside from the encoder, } 9. there is only a moderate amount of computations. 10. step >>= 1; 11. if (diff >= step) { 12. delta |=2; 6.2. The discrete cosine transform 13. diff −=step; 14. vpdiff += step; We next consider a hardware implementation of the inverse 15. } discrete-time cosine transform (IDCT). The IDCT arises in 16. step >>= 1; several signal processing applications, most notably in im- 17. if (diff >= step) { age/video coding (see, e.g., the MPEG standard) and in more 18. delta |=1; general time-frequency analysis. The IDCT is chosen because 19. vpdiff += step; there has been a large amount of work on efficient algorithm 20. } design for such transforms. Our results argue that further 21. if (sign) valpred −=vpdiff; gains are possible with relatively little additional design over- 22. else valpred += vpdiff; > head by employing a mixed architecture. 23. if (valpred 32767) TheIDCTcodefromtheMediaBenchsuitewasex- 24. valpred = 32767; 25. else if (valpred < −32768) tracted from the MPEG 2 benchmark and specifically from 26. valpred =−32768; the MPEG 2 Decoder used in a variety of applications, most 27. delta |=sign; notably for DVD movies. The IDCT actually appears in the 28. index += indexTable[delta]; JPEG benchmark; however, the implementation from MPEG 29. if (index < 0) index = 0; 2 was selected because it has the longer runtime. The imple- 30. if (index > 88) index = 88; mentation in MPEG 2 decomposes the IDCT into a row-wise 31. step = stepsizeTable[index]; and columnwise IDCT. 32. if (bufferstep) { The IDCT algorithm does a two-dimesional IDCT ff = << 33. outputbu er (delta 4) & 0xf0; through decomposition. It first executes a one-dimensional } { 34. else DCT in one dimension (row) followed a one-dimensional 35. ∗outp++ = (delta & 0x0f) | outputbuffer; DCT in the other (column). The IDCT columnwise decom- } 36. position kernel with the software portions highlighted is 37. bufferstep = !bufferstep; shown in Algorithm 8. 38. } Like the IDCT row-wise decomposition, the DFG in Figure 13 again contains a significant number of arithmetic Algorithm 7: ADPCM encoder kernel C code. (Hardware func- functional units and shows a significant amount of paral- tionality is in plain text with VLIW software highlighted with gray.) lelism.

6.3. The sphere decoder For an if statement, both the then and else parts of the state- The last application example we consider is the so-called ment are executed and propagated down the DFG based on sphere decoder, which arises in MIMO communication the result of the conditional. For example, the ADPCM en- systems. The basic MIMO problem is the following: an M- coder from Algorithm 7 was translated into the DFG shown tuple of information symbols s, drawn from the integer lat- in Figure 11. The blocks labelled MUX implement the com- tice ZM,istransmittedfromM transmit antennas to N re- binational parallel execution. The conditional operation is ceive antennas. We assume that the channel is “flat,” meaning used as the selector and the two inputs contain the result of that the fading parameter connecting transmit antenna m to the “predicated” operation as well as the nonmodified result. receive antenna n, hm,n, can be modeled as a scalar, constant Two other standard automation techniques were used over the transmission. The model at the output of a bank of to convert the code segment into the DFG. First, the load receiver matched filters (one per receive antenna) is simply from memory ∗inp from line 2 and the predicated store y = Hs + n,(3) ∗outp from line 34 of Algorithm 7 are moved to the begin- ning and end of the DFG, respectively, using code motion. where n modeled as zero mean additive white Gaussian This allows the loads and stores to be executed in software. noise (AWGN) arising from receiver electronics noise and All code executed in software is highlighted in Algorithm 7. possibly from channel interference. We assume that the re- Secondly the static arrays indexTable and stepsizeTable are ceiver can track the channel coefficients. Note that the use 16 EURASIP Journal on Applied Signal Processing

0x0 Val Valpred Step 0x3 0x1

0x8 − >> >>

− 2:1MUX > + >>

0x4 ! = 2:1MUX

0x2 2 : 1 MUX >=−>= 2:1MUX

|| 2:1MUX >= +

0x1 2 : 1 MUX >=−2:1MUX >=

|| 2:1MUX >= +

>= 2:1MUX 2:1MUX >= Valpred

Index 0xF0 0x4 Outpbuffer 0x0 || 0xF + −

LUT << Bufstep & 2 : 1 MUX 0x7FFF

0x58 0x0 + & ! = || ∗Outp ! > 2:1MUX 0xffff8000

< 2:1MUX 2:1MUX 2:1MUX 2:1MUX <

> 2:1MUX

LUT

Out0 Out1 Out2 Out3 Out4

Figure 11: Data flow graph for the ADPCM encoder.

of a real-valued channel model loses no generality since any vector multiplication, a vector summation, and finally a dis- complex model employing quadrature-amplitude modula- tance calculation. The DFG for this function is displayed in tion (QAM) at the transmitter can be reduced to a real model Figure 14 that includes two square functions and a square- of twice the dimensions (see, e.g., [59]). root. At the receiver we seek to find the input b which mini- mizes the detection norm 7. RESULTS  =  − 2 s arg mins Hs y ,(4) To implement our architecture including the hardware func- which is generally an exponentially hard problem (i.e., we tions, several industry computer-aided design tools were must consider QM possible signals when an alphabet of size Q used to accomplish the functional testing and gate-level syn- is employed). The sphere decoder employs the Fincke-Pohst thesis tasks of the design flow. We used Mentor Graphics’ tree-based search algorithm over ZM to reduce this to a de- FPGA Advantage 6.1 to assist in the generation of VHDL tection rule which is roughly cubic in M for sufficiently large code used to describe the core processor architecture. The signal-to-noise ratios [60, 61]. It is expected to form the core processor was built up to support all of the operations de- of practical receivers for future MIMO systems. scribed in the NIOS II instruction set. Unlike the previous algorithms, the spherical decoder We used Synplicity’s Synplify Pro 7.6.1 as the RTL synthe- was written in vectorized Matlab rather than C. Matlab is sis tool to generate the gate-level netlist targeted to the Altera a popular language for signal processing algorithm develop- Stratix II ES2S180F1508C4 from our VHDL description. The ment because of its native matrix treatment and its built-in netlist was then passed to Altera’s Quartus II 4.1 for device- mathematical functions. Vectorized Matlab has another nice specific, placement and routing, and bitstream generation. At feature; it is easy to use for extracting parallelism, a main this level, post placement and routing results were extracted feature for achieving performance improvements by direct for additional manual optimization, timing accurate simula- hardware implementation. tion, and verification. It is at this point that the timing infor- ThecodefromAlgorithm 9 represents the computa- mation about the hardware functions can be inserted into the tional core of the spherical decoder, by executing a pointwise software. Both Altera’s Quartus and Synplicity’s Amplify for Raymond R. Hoare et al. 17

0 an 1 0x1fff > −

& 2:1MUX Priority encoder

== 01 0x30 0x4d srn 6 0xf 0xd 4 − >> & >> >=− + &

<< − ∗ >= 2:1MUX + 0 2:1MUX an srn >> −

0 ∧ 0x7fff << >> < & 2:1MUX

− 2:1MUX

sezi

Figure 12: Resulting DFG from transformations described in G.721 example.

b x2 x3 0x0454 0x0a74 x0 x1 0x0 19 0x0235 x4 x5 0x0649 0x0968 x6 x7 0000 0000 0000 0000 0000 0000

++−−++− + − ++

4 ∗∗ ∗ ∗∗ ∗4 ∗∗ ∗4

+ + +

− 3+ − 3+ − 3 −

>> >> >> >> >> >>

− ++− −− ++

+0xe +0xe+0xb5 − +0xe− 0xe

>> >> ∗ 0x80 ∗ >> >>

blk[7] blk[0] + 8 + blk[3] blk[4]

>> >>

−−++ 0xe 0xe 0xe >> >> >> >>

blk[6] blk[5] blk[1] blk[2]

Figure 13: IDCT column-wise decomposition data flow graph. 18 EURASIP Journal on Applied Signal Processing

k = k − 1. if (!((x1 = (blk[8 ∗ 4] << 8)) 1. 1; 2. ym(k, k +1)= y(k) − sum(R(k, k +1:M) ∗ s(k +1:M)); 2. | (x2 = blk[8 ∗ 6]) | (x3 = blk[8 ∗ 2]) 3. rp(k) = sqrt(rp(k +1)ˆ2− (ym(k +1,k +2)− x = blk ∗ | x = blk ∗ 3. | ( 4 [8 1]) ( 5 [8 7]) R(k +1,k +1)∗ s(k + 1))ˆ2); 4. | (x6 = blk[8 ∗ 5]) | (x7 = blk[8 ∗ 3]))){ 5. blk[8 ∗ 0] = blk[8 ∗ 1] = blk[8 ∗ 2] = blk[8 ∗ 3] = 6. blk[8 ∗ 4] = blk[8 ∗ 5] = blk[8 ∗ 6] = blk[8 ∗ 7] = Algorithm 9: Matlab kernel code for the spherical decoder. 7. iclp[(blk[8 ∗ 0] + 32) >> 6]; 8. return; 9. } 10. x0 = (blk[8 ∗ 0] << 8) + 8192; Through a series of optimizations to the critical path, we 11. /∗ first stage ∗/ were able to achieve a maximum clock speed of 166 MHz for 12. x8 = W7 ∗ (x4+x5) + 4; our VLIW and clock frequencies ranging from 22 to 71 MHz 13. x4 = (x8+(W1 − W7) ∗ x4) >> 3; for our hardware functions equating to combinational de- 14. x5 = (x8 − (W1+W7) ∗ x5) >> 3; lays from 14 to 45 ns. We then compared benchmark execu- 15. x8 = W3 ∗ (x6+x7) + 4; tion times of our VLIW both with and without hardware ac- 16. x6 = (x8 − (W3 − W5) ∗ x6) >> 3; celeration against the pNIOS II embedded soft-core proces- 17. x7 = (x8 − (W3+W5) ∗ x7) >> 3; sor. ∗ ∗ 18. / second stage / To exercise our processor, we selected core signal process- 19. x8 = x0+x1; ing benchmarks from the MediaBench suite that include AD- 20. x0 −=x1; 21. x1 = W6 ∗ (x3+x2) + 4; PCM encode and decode, GSM decode, G.721 decode, and 22. x2 = (x1 − (W2+W6) ∗ x2) >> 3; MPEG 2 decode. As described in Section 6, from each of the 23. x3 = (x1+(W2 − W6) ∗ x3) >> 3; benchmarks a single hardware kernel was extracted with the 24. x1 = x4+x6; exception of MPEG 2 decode for which two kernels were ex- 25. x4−=x6; tracted. In the case of GSM and G.721, the hardware kernel 26. x6 = x5+x7; was shared by the encoder and decoder portions of the algo- 27. x5−=x7; rithm. ∗ ∗ 28. / third stage / The performance improvement of implementing the x = x x 29. 7 8+ 3; computational portions of the benchmark on a 4-way VLIW, 30. x8−=x3; 31. x3 = x0+x2; an unlimited-way VLIW, and directly in the hardware com- 32. x0−=x2; pared to a software implementation running on pNIOS 33. x2 = (181 ∗ (x4+x5) + 128) >> 8; II is displayed in Figure 15. The VLIW performance im- 34. x4 = (181 ∗ (x4 − x5) + 128) >> 8; provements were fairly nominal ranging from 2% to 48% 35. /∗ fourth stage ∗/ improvement over pNIOS II, a single ALU RISC processor. 36. blk[8 ∗ 0] = iclp[(x7+x1) >> 14 ]; The performance improvement of the entire kernel execu- ff 37. blk[8 ∗ 1] = iclp[(x3+x2) >> 14 ]; tion is compared for a variety of di erent architectures in ff 38. blk[8 ∗ 2] = iclp[(x0+x4) >> 14 ]; Figure 16.Thedi erence between Figures 15 and 16 is that the loads and stores required to maintain the data in the reg- 39. blk[8 ∗ 3] = iclp[(x8+x6) >> 14 ]; ister file are not considered in the former and run in software blk ∗ = iclp x − x >> 40. [8 4] [( 8 6) 14 ]; in the latter. When the software-based loads and stores are blk ∗ = iclp x − x >> 41. [8 5] [( 0 4) 14 ]; considered, the VLIW capability of our processor has a more 42. blk[8 ∗ 6] = iclp[(x3 − x2) >> 14 ]; significant impact. Overall kernel speedups range from about 43. blk[8 ∗ 7] = iclp[(x7 − x1) >> 14 ]; 5X to over 40X. The width of the available VLIW has a significant im- pact on the overall performance. In general, the 4-way Algorithm 8: IDCT column-wise decomposition kernel. (Hard- VLIW is adequate, although particularly when considering ware functionality is in plain text with VLIW software highlighted the IDCT-based benchmark, the unlimited VLIW shows that with gray.) not all of the ILP available is being exploited by the 4-way VLIW. The performance of the entire benchmark is considered FPGA allow manual routing modifications for optimization in Figure 17. We compare execution times for both hardware of design density and critical paths. and VLIW acceleration and compare them to the pNIOS For functional simulation and testing of the design, we II processor execution when the overheads associated with passed the machine code output from the compiler design hardware function calls are included. While VLIW process- flow into the instruction ROM used in modeling our design. ing alone again provides nominal speedup (less than 2X), by ModelSim SE 5.7 was used to generate the waveforms to con- including hardware acceleration, these speedups range from firm the functional correctness of our VLIW processor with about 3X to over 12X and can reach nearly 14X when com- hardware function acceleration. bined with a 4-way VLIW processor. Raymond R. Hoare et al. 19

R[k +1][k +1] s[k +1] R[k][k +1] R[k][k +3] s[k +3] y[k] R[k][k +2] s[k +2] R[k][k +4] s[k +4] ym[k +1][k +2] ∗∗ ∗∗∗

rp[k +1] − + +

∗∗ +

− √ −

ym[k][k +2] r[k +1]

Figure 14: Data flow graph for the spherical decoder from Matlab.

Performance speedup of the computational kernel Performance speedup over software equivalent on various processors computational kernel+load/store setup 350 over the single processor pNIOS II 300 127 127 127 45 250 40 200 35 150 30 100 25 50 Speedup over pNIOS II 20 0 pNIOS II VLIW 4 VLIW Unl 15 . ADPCM decoder 18 33 18 18 10 ADPCM encoder 18.25 16 16 Speedup over pNIOS II GSM decoder 9.33 7 7 5 G721 decoder . 230 184 161 67 0 HW+ HW+ HW+ pNIOS II VLIW 4 VLIW Unl IDCT row 37.40 25.20 25.20 pNIOS II VLIW 4 VLIW Unl IDCT col 64.80 51.20 50 ADPCM decoder 1 1.13 1.13 2.93 4.16 4.16 Spherical decoder 332.50 124.13 123.25 ADPCM encoder 1 1.28 1.28 4 7.67 7.67 GSM decoder 1 1.39 1.39 5.41 7.67 7.67 G721 decoder 1 1.25 1.41 37.16 44.13 44.13 Figure 15: Performance improvement from hardware acceleration IDCT row 1 1.68 1.84 10.96 17.53 26.30 of computational portion of the hardware kernel. IDCT col 1 1.40 1.50 18.95 19.86 24.53 Spherical decoder 1 2.66 2.68 127.29 127.29 127.29

Figure 16: Kernel speeds up several architectures when considering 8. CONCLUSIONS AND FUTURE WORK required loads and stores to maintain the register file.

In this paper, we describe a VLIW processor with the capabil- ity of implementing computational kernels in custom com- binational hardware functions with no additional overhead. We provide a design methodology to map algorithms written improvement in ILP and, thus, provides relatively small per- in C onto this processor and utilize profiling and tractable formance improvement. However, when coupled with hard- behavioral synthesis to achieve application speedups. ware functions, the VLIW has a significant impact providing, We tested our processor with a set of signal processing in some cases, up to an additional 3.6X. It provided an aver- benchmarks from the MediaBench suite and achieved a hard- age of 2.3X over a single processor and hardware alone. This ware acceleration of 9.3 to 332x with an average of 63X bet- range falls within the additional predicted 2X to 5X process- ter than a single processor, depending on the kernel. For the ing capability provided by the 4-way VLIW processor. entire application, the speedup reached nearly 30X and was The reason for this improvement is due to several fac- on average 12X better than a single processor implementa- tors. The first is a simplified problem. Because the software tion. code has been made more regular (just loading and stor- VLIW processing alone was shown to achieve very poor ing non-data-dependent values), the available ILP for VLIW speedups, reaching less than 2X maximum improvement processing is potentially much higher than in typical code. even for an unlimited VLIW. This is due to a relatively small Secondly, we see a new “90/10” rule. The hardware execution 20 EURASIP Journal on Applied Signal Processing

Benchmark speedup [2] Xilinx Incorporated, “Virtex-4 Product Backgrounder,” avail- kernal + nonkernel including function call overhead able on-line: http://www.xilinx.com. over the single processor pNIOS II [3] Lattice Semiconductor Corporation, “LatticeECP and EC 18 Familiy Data Sheet,” available on-line: http://www.latticesemi 16 .com. 14 [4] Apple Computer Inc., “Optimizing with SHARK, Big Payoff, 12 Small Effort”. 10 [5] D. C. Suresh, W. A. Najjar, F. Vahid, J. R. Villarreal, and G. Stitt, 8 “Profiling tools for hardware/software partitioning of embed- 6 ded applications,” in Proceedings of ACM SiGPLAN Confer- 4 ence on Languages, Compilers, and Tools for Embedded Systems Speedup over pNIOS II (LCTES ’03), pp. 189–198, San Diego, Calif, USA, June 2003. 2 [6] G. De Micheli, D. Ku, F. Mailhot, and T. Truong, “The Olym- 0 HW+ HW+ HW+ pus synthesis system,” IEEE Design and Test of Computers, pNIOS II VLIW 4 VLIW Unl pNIOS II VLIW 4 VLIW Unl vol. 7, no. 5, pp. 37–53, 1990. ADPCM decoder 1 1.13 1.13 2.92 4.15 4.15 [7] L. Lavagno and E. Sentovich, “ECL: a specification environ- ADPCM encoder 1 1.28 1.28 4 7.66 7.66 GSM decoder 1 1.35 1.36 4.02 5.71 5.77 ment for system-level design,” in Proceedings of 36th Design G721 decoder 1 1.27 1.39 12.01 13.95 16.01 Automation Conference (DAC ’99), pp. 511–516, New Orleans, MPEG2 decode 1 1.39 1.48 3.97 3.42 3.95 La, USA, June 1999. [8] S. Gupta, N. Dutt, R. Gupta, and A. Nicolau, “SPARK: a high- Figure 17: Overall performance speedup of the entire application level synthesis framework for applying parallelizing compiler for several architectures including overheads associated with func- transformations,” in Proceedings of 16th IEEE International tion calls. Conference on VLSI Design (VLSI Design ’03), pp. 461–466, New Delhi, India, January 2003. [9] S. Gupta, N. Savoiu, N. Dutt, R. Gupta, and A. Nicolau, “Using global code motions to improve the quality of results for high- level synthesis,” IEEE Transactions On Computer-Aided Design accelerates a high percentage of the kernel code by 9X or Of Integrated Circuits and Systems, vol. 23, no. 2, pp. 302–312, more leaving the remaining software portion of the code 2004. to dominate the execution time. Improving the remaining [10] A. K. Jones, D. Bagchi, S. Pal, P. Banerjee, and A. Choud- software code execution time through VLIW processing im- hary, “Pact HDL: compiler targeting ASIC’s and FPGA’s with pacts this remaining (and now dominant) execution time, power and performance optimizations,” in Power Aware Com- thus providing magnified improvements for relatively low puting, R. Graybill and R. Melhem, Eds., chapter 9, pp. 169– ILP (such as the predicted 2–5X). 190, Kluwer Academic, Boston, Mass, USA, 2002. While the initial results for the VLIW processor with [11] X. Tang, T. Jiang, A. K. Jones, and P.Banerjee, “Behavioral syn- hardware functions are encouraging, there are several oppor- thesis of data-dominated circuits for minimal energy imple- tunities for improvement. A significant limiting factor of the mentation,” in Proceedings of 18th IEEE International Confer- hardware acceleration is the loads and stores that are cur- ence on VLSI Design (VLSI Design ’05), pp. 267–273, Kolkata, India, January 2005. rently executed in software. While these operations would [12] E. Jung, “Behavioral synthesis using systemC compiler,” in need to be done in software for a single processor execution Proceedings of 13th Annual Synopsys Users Group Meeting and it is possible through transformations that these oper- (SNUG ’03), San Jose, Calif, USA, March 2003. ations have become more regular and, thus, exhibit higher [13] D. Black and S. Smith, “Pushing the limites with behavioral than normal ILP, they still are a substantial bottleneck. compiler,” in Proceedings of 9th Annual Synopsys Users Group Toimprove the performance of these operations it should Meeting (SNUG ’99), San Jose, Calif, USA, March 1999. be possible to overlap load and store operations with the ex- [14] K. Bartleson, “A New Standard for System-Level Design,” Syn- ecution of the hardware block. One way to do this is to create opsys White Paper, 1999. “mirror” register files. While the hardware function executes [15] R. Goering, “Behavioral Synthesis Crossroads,” EE Times Ar- on one group of registers, the VLIW prepares the mirror reg- ticle, 2004. ister file for the next iteration. Another possibility to allow [16] D. J. Pursley and B. L. Cline, “A practical approach to hard- the hardware direct access to the memories as well as the reg- ware and software SoC tradeoffs using high-level synthesis for ister file. architectural exploration,” in Proceedings of of the GSPx Con- ference, Dallas, Tex, USA, March–April 2003. [17] S. Chappell and C. Sullivan, “Handel-C for Co-Processing and ACKNOWLEDGMENT Co-Design of Field Programmable System on Chip,” Celoxica White Paper, 2002. This work was supported in part by the Pittsburgh Digital [18] P. Banerjee, M. Haldar, A. Nayak, et al., “Overview of a com- Greenhouse. piler for synthesizing MATLAB programs onto FPGAs,” IEEE Transactions on Very Large Scale Integration (VLSI) Systems, REFERENCES vol. 12, no. 3, pp. 312–324, 2004. [19] P.Banerjee, N. Shenoy, A. Choudhary, et al., “A MATLAB com- [1] Altera Corporation, “Stratix II Device Handbook, Volume 1,” piler for distributed, heterogeneous, reconfigurable comput- available on-line: http://www.altera.com. ing systems,” in Proceedings of 8th Annual IEEE International Raymond R. Hoare et al. 21

Symposium on FPGAs for Custom Computing Machines (FCCM Programmable Gate Arrays (FPGA ’98), pp. 55–64, Monterey, ’00), pp. 39–48, Napa Valley, Calif, USA, April 2000. Calif, USA, February 1998. [20] S. McCloud, “Catapult C Synthesis-Based Design Flow: [36] H. Schmit, “Incremental reconfiguration for pipelined appli- Speeding Implementation and Increasing Flexibility,” Mentor cations,” in Proceedings of 5th Annual IEEE Symposium on FP- Graphics White Paper, 2004. GAs for Custom Computing Machines (FCCM ’97), pp. 47–55, [21] V. Chaiyakul and D. D. Gajski, “Assignment decision diagram Napa Valley, Calif, USA, April 1997. for high-level synthesis,” Tech. Rep. #92-103, University of [37] B. A. Levine and H. Schmit, “Efficient application represen- California, Irvine, Calif, USA, December 1992. tation for HASTE: hybrid architectures with a single, trans- [22] V. Chaiyakul, D. D. Gajski, and L. Ramachandran, “High-level formable executable,” in Proceedings of 11th Annual IEEE Sym- transformations for minimizing syntactic variances,” in Pro- posium on FPGAs for Custom Computing Machines (FCCM ceedings of 30th Design Automation Conference (DAC ’93),pp. ’03), pp. 101–110, Napa Valley, Calif, USA, April 2003. 413–418, Dallas, Tex, USA, June 1993. [38] C. Ebeling, D. C. Cronquist, and P. Franklin, “RaPiD - re- [23] I. Ghosh and M. Fujita, “Automatic test pattern generation for configurable pipelined datapath,” in Proceedings of 6th Inter- functional RTL circuits using assignment decision diagrams,” national Workshop on Field-Programmable Logic and Applica- in Proceedings of 37th Design Automation Conference (DAC tions (FPL ’96), pp. 126–135, Darmstadt, Germany, September ’00), pp. 43–48, Los Angeles, Calif, USA, June 2000. 1996. [24] L. Zhang, I. Ghosh, and M. Hsiao, “Efficient sequential ATPG [39] C. Ebeling, D. C. Cronquist, P. Franklin, and C. Fisher, “RaPiD for functional RTL circuits,” in Proceedings of IEEE Interna- - a configurable computing architecture for compute-intensive tional Test Conference (ITC ’03), vol. 1, pp. 290–298, Charlotte, applications,” Tech. Rep. TR-96-11-03, University of Washing- NC, USA, September–October 2003. ton, Department of Computer Science & Engineering, Seattle, [25] V. A. Chouliaras and J. Nunez, “Scalar coprocessors for accel- Wash, USA, 1996. erating the G723.1 and G729A speech coders,” IEEE Transac- [40] C. Ebeling, D. C. Cronquist, P. Franklin, J. Secosky, and S. G. tions on Consumer Electronics, vol. 49, no. 3, pp. 703–710, 2003. Berg, “Mapping applications to the RaPiD configurable archi- [26]E.Atzori,S.M.Carta,andL.Raffo, “44.6% processing cy- tecture,” in Proceedings of 5th Annual IEEE Symposium on FP- cles reduction in GSM voice coding by low-power recon- GAs for Custom Computing Machines (FCCM ’97), pp. 106– figurable co-processor architecture,” IEE Electronics Letters, 115, Napa Valley, Calif, USA, April 1997. vol. 38, no. 24, pp. 1524–1526, 2002. [41] D. C. Cronquist, P. Franklin, S. G. Berg, and C. Ebeling, “Spec- [27] J. Hilgenstock, K. Herrmann, J. Otterstedt, D. Niggemeyer, ifying and compiling applications for RaPiD,” in Proceedings and P. Pirsch, “A video signal processor for MIMD multipro- of 6th IEEE Symposium on FPGAs for Custom Computing Ma- cessing,” in Proceedings of 35th Design Automation Conference chines (FCCM ’98), pp. 116–125, Napa Valley, Calif, USA, April (DAC ’98), pp. 50–55, San Francisco, Calif, USA, June 1998. 1998. [28] R. Garg, C. Y. Chung, D. Kim, and Y. Kim, “Boundary mac- [42] D. C. Cronquist, C. Fisher, M. Figueroa, P. Franklin, and C. roblock padding in MPEG-4 video decoding using a graph- Ebeling, “Architecture design of reconfigurable pipelined dat- ics coprocessor,” IEEE Transactions on Circuits and Systems for apaths,” in Proceedings of 20th Anniversary Conference on Ad- Video Technology, vol. 12, no. 8, pp. 719–723, 2002. vanced Research in VLSI, pp. 23–40, Atlanta, Ga, USA, March [29] C. N. Hinds, “An enhanced floating point coprocessor for em- 1999. bedded signal processing and graphics applications,” in Pro- [43] E. Mirsky and A. DeHon, “MATRIX: a reconfigurable com- ceedings of Conference Record 33rd Asilomar Conference on puting architecture with configurable instruction distribution Signals, Systems, and Computers, vol. 1, pp. 147–151, Pacific and deployable resources,” in Proceedings of 4th IEEE Sympo- Grove, Calif, USA, October 1999. sium on FPGAs for Custom Computing Machines (FCCM ’96), [30] J. C. Alves and J. S. Matos, “RVC-a reconfigurable coprocessor pp. 157–166, Napa Valley, Calif, USA, April 1996. for vector processing applications,” in Proceedings of 6th IEEE [44] U. J. Kapasi, W. J. Dally, S. Rixner, J. D. Owens, and B. Khailany, Symposium on FPGAs for Custom Computing Machines (FCCM “The imagine stream processor,” in Proceedings of IEEE Inter- ’98), pp. 258–259, Napa Valley, Calif, USA, April 1998. national Conference on Computer Design: VLSI in Computers [31] T. Bridges, S. W. Kitchel, and R. M. Wehrmeister, “A CPU uti- and Processors, pp. 282–288, Freiberg, Germany, September lization limit for massively parallel MIMD computers,” in Pro- 2002. ceedings of 4th Symposium on the Frontiers of Massively Parallel [45] B. Khailany, W. J. Dally, U. J. Kapasi, et al., “Imagine: media Computation, pp. 83–92, McLean, Va, USA, October 1992. processing with streams,” IEEE Micro, vol. 21, no. 2, pp. 35– [32] H. Schmit, D. Whelihan, A. Tsai, M. Moe, B. Levine, and R. R. 46, 2001. Taylor, “PipeRench: A virtualized programmable datapath in [46] J. D. Owens, S. Rixner, U. J. Kapasi, et al., “Media processing 0.18 micron technology,” in Proceedings of IEEE Custom Inte- applications on the Imagine stream processor,” in Proceedings grated Circuits Conference (CICC ’02), pp. 63–66, Orlando, Fla, of IEEE International Conference on Computer Design: VLSI USA, May 2002. in Computers and Processors, pp. 295–302, Freiberg, Germany, [33] S. C. Goldstein, H. Schmit, M. Budiu, S. Cadambi, M. Moe, September 2002. and R. R. Taylor, “PipeRench: a reconfigurable architecture [47] J. R. Hauser and J. Wawrzynek, “Garp: a MIPS processor with a and compiler,” Computer, vol. 33, no. 4, pp. 70–77, 2000. reconfigurable coprocessor,” in Proceedings of 5th Annual IEEE [34] S. C. Goldstein, H. Schmit, M. Moe, et al., “PipeRench: a co- Symposium on FPGAs for Custom Computing Machines (FCCM processor for streaming multimedia acceleration,” in Proceed- ’97), pp. 12–21, Napa Valley, Calif, USA, April 1997. ings of 26th IEEE International Symposium on Computer Archi- [48] T. J. Callahan, J. R. Hauser, and J. Wawrzynek, “The Garp ar- tecture (ISCA ’99), pp. 28–39, Atlanta, Ga, USA, May 1999. chitecture and C compiler,” Computer, vol. 33, no. 4, pp. 62– [35] S. Cadambi, J. Weener, S. C. Goldstein, H. Schmit, and D. E. 69, 2000. Thomas, “Managing pipeline-reconfigurable FPGAs,” in Pro- [49] T. Callahan, “Kernel formation in Garpcc,” in Proceedings of ceedings of 6th ACM/SIGDA International Symposium on Field 11th Annual IEEE Symposium on FPGAs for Custom Computing 22 EURASIP Journal on Applied Signal Processing

Machines (FCCM ’03), pp. 308–309, Napa Valley, Calif, USA, Raymond R. Hoare is an Assistant Profes- April 2003. sor of Electrical Engineering at the Univer- sity of Pittsburgh. He received his Bachelor [50] S. Hauck, T. W. Fry, M. M. Hosler, and J. P. Kao, “The Chi- of Engineer degree from Steven’s Institute of maera reconfigurable functional unit,” in Proceedings of 5th Technology in 1991. He obtained the Mas- Annual IEEE Symposium on FPGAs for Custom Computing Ma- ter’s degree from the University of Mary- chines (FCCM ’97), pp. 87–96, Napa Valley, Calif, USA, April land and his Ph.D. from Purdue University 1997. in 1994 and 1999, respectively. Dr. Hoare [51] S. Hauck, M. M. Hosler, and T. W. Fry, “High-performance teaches hardware design methodologies at carry chains for FPGAs,” in Proceedings of ACM/SIGDA In- the graduate level, computer organization, ternational Symposium on Field Programmable Gate Arrays and software engineering. His research focus is on high perfor- (FPGA ’98), pp. 223–233, Monterey, Calif, USA, February mance parallel architectures. For large parallel systems, his focus 1998. is on communication and coordination networks. For systems on a [52] R. Hoare, S. Tung, and K. Werger, “A 64-way SIMD processing chip, he is focused on parallel processing architectures and design architecture on an FPGA,” in Proceedings of 15th IASTED In- automation for application specific computing. Dr. Hoare is one of ternational Conference on Parallel and Distributed Computing the founders, and is the General Chair for the IEEE Workshop on and Systems (PDCS ’03), vol. 1, pp. 345–350, Marina del Rey, Massively Parallel Processing. Calif, USA, November 2003. Alex K. Jones received his B.S. in 1998 [53] S. Dutta, A. Wolfe, W. Wolf, and K. J. O’Connor, “Design is- in physics from the College of William sues for very-long-instruction-word VLSI video signal proces- and Mary in Williamsburg, Virginia. He sors,” in Proceedings of IEEE Workshop on VLSI Signal Pro- received his M.S. and Ph.D. degrees in cessing, IX, pp. 95–104, San Francisco, Calif, USA, October– 2000 and 2002, respectively, in electrical November 1996. and computer engineering at Northwest- [54] A. Capitanio, N. Dutt, and A. Nicolau, “Partitioned register ern University. He is currently an Assistant ff files For VLIWs: a preliminary analysis of tradeo s,” in Pro- Professor at the University of Pittsburgh in ceedings of 25th Annual International Symposium on Microar- Pittsburgh, Pennsylvania. He was formerly chitecture (MICRO ’92), pp. 292–300, Portland, Ore, USA, De- a Research Associate in the Center for Par- cember 1992. allel and Distributed Computing and Instructor of electrical and [55] “Trimaran, An Infrastructure for Research in Instruction- computer engineering at Northwestern University. He is a Walter Level Parallelism,” 1998, http://www.trimaran.org. P. Murphy Fellow of Northwestern University, a distinction he was [56] A. K. Jones, R. Hoare, I. S. Kourtev, et al., “A 64-way awarded with twice. Dr. Jones’ research interests include compila- VLIW/SIMD FPGA architecture and design flow,” in Proceed- tion techniques for behavioral synthesis, low-power synthesis, em- ings of 11th IEEE International Conference on Electronics, Cir- bedded systems, and high-performance computing. He is the au- cuits and Systems (ICECS ’04), pp. 499–502, Tel Aviv, Israel, thor of over 30 publications related to high-performance comput- December 2004. ing and power-aware design automation including a book chap- [57] C. Lee, M. Potkonjak, and W. H. Mangione-Smith, “Me- ter in Power Aware Computing (Kluwer, Boston, Mass, 2002). He is diaBench: a tool for evaluating and synthesizing multime- currently an Associate Editor of the International Journal of Com- dia and communications systems,” in Proceedings of 30th An- puters and Applications. He is also on the Program Committee of nual IEEE/ACM International Symposium on Microarchitecture the Parallel and Distributed Computing and Systems Conference (MICRO ’97), pp. 330–335, Research Triangle Park, NC, USA, and the Microelectronic System Engineering Conference. December 1997. Dara Kusic is a Masters student at the Uni- [58] J. Degener and C. Bormann, “GSM 06.10 lossy speech com- versity of Pittsburgh. Her research interests pression library,” available on-line: http://kbs.cs.tu-berlin.de/ include parallel processor design, hybrid ar- ∼jutta/toast.html. chitectures and computational accelerators. [59] G. Golub and C. F. V. Loan, Matrix Computational, Johns She is a Student Member of the IEEE and Hopkins University Press, Baltimore, Md, USA, 1991. the IEEE Computer Society. [60] B. Hassibi and H. Vikalo, “On sphere decoding algorithm. I. Expected complexity,” submitted to IEEE Transactions on Sig- nal Processing, 2003. [61] B. Hassibi and H. Vikalo, “On sphere decoding algorithm. II. Examples,” submitted to IEEE Transactions on Signal Process- Joshua Fazekas is an M.S.E.E. student at the ing, 2003. University of Pittsburgh. His research in- [62] Y. Chobe, B. Narahari, R. Simha, and W. F. Wong, “Tritanium: terests include hardware/software codesign, augmenting the trimaran compiler infrastructure to support compiler design, and low-power hardware IA64 code generation,” in Proceedings of 1st Annual Workshop design. Fazekas received a B.S. in computer on Explicitly Parallel Instruction Computing Architectures and engineering from the University of Pitts- Compiler Techniques (EPIC ’01), pp. 76–79, Austin, Tex, USA, burgh. December 2001. Raymond R. Hoare et al. 23

John Foster is a Masters student in the De- partment of Electrical and Computer En- gineering, University of Pittsburgh. He re- ceived his B.S. degree in computer en- gineering from University of Maryland, Baltimore County. His research interests include parallel processing compilers and hardware/software codesign.

Shenchih Tung is a Ph.D. candidate in the Department of Electrical and Computer Engineering, University of Pittsburgh. He received his B.S. degree in electrical engi- neering from National Taiwan Ocean Uni- versity, Taiwan, in June 1997. He received his M.S. degree in telecommunication from Department of Information Science at the University of Pittsburgh in August 2000. His research interests include parallel com- puting architecture, MPSoCs, network-on-chip, parallel and dis- tributed computer simulation, and FPGA design. He is a Member of the IEEE and IEEE Computer Society.

Michael McCloud received the B.S. degree in electrical engineering from George Ma- son University, Fairfax, VA, in 1995, and the M.S. and Ph.D. degrees in electrical engi- neering from the University of Colorado, Boulder, in 1998 and 2000, respectively. He spent the 2000–2001 academic year as a Vis- iting Researcher and Lecturer at the Univer- sity of Colorado. From 2001 to 2003 he was aStaff Engineer at Magis Network, Inc., San Diego, California, where he worked on the physical layer design of indoor wireless modems. He spent the 2004 and 2005 academic years as an Assistant Professor at the University of Pittsburgh. He is currently with TensorComm, Inc., Denver, Colorado, where he works on interference mitigation technologies for wireless commu- nications. Hindawi Publishing Corporation EURASIP Journal on Applied Signal Processing Volume 2006, Article ID 64369, Pages 1–13 DOI 10.1155/ASP/2006/64369

Rapid Prototyping for Heterogeneous Multicomponent Systems: An MPEG-4 Stream over a UMTS Communication Link

M. Raulet,1, 2 F. Urban, 1 J.-F. Nezan,1 C. Moy,3 O. Deforges,1 and Y. Sorel4

1 IETR/Image Group Lab, UMR CNRS 6164/INSA, 20, Avenue des Buttes de Co¨esmes, 35043 Rennes, France 2 Mitsubishi Electric ITE, Telecommunication Lab, 1 All´ee de Beaulieu, 35 000 Rennes, France 3 IETR/Automatic & Communication Lab, UMR CNRS 6164/Supelec-SCEE Team, Avenue de la Boulaie, BP 81127, 35511 Cesson-S´evign´e, France 4 INRIA Rocquencourt, AOSTE, BP 105, 78153 Le Chesnay, France

Received 15 October 2004; Revised 24 May 2005; Accepted 21 June 2005 Future generations of mobile phones, including advanced video and digital communication layers, represent a great challenge in terms of real-time embedded systems. Programmable multicomponent architectures can provide suitable target solutions combin- ing flexibility and computation power. The aim of our work is to develop a fast and automatic prototyping methodology dedicated to signal processing application implementation on parallel heterogeneous architectures, two major features required by future systems. This paper presents the whole methodology based on the SynDEx CAD tool that directly generates a distributed imple- mentation onto various platforms from a high-level application description, taking real-time aspects into account. It illustrates the methodology in the context of real-time distributed executives for multilayer applications based on an MPEG-4 video codec and a UMTS telecommunication link.

Copyright © 2006 M. Raulet et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

1. INTRODUCTION Real-time executives developed for single-processor applica- tions can hardly take advantage of multicomponent architec- New embedded multimedia systems, such as mobile phones, tures: handmade data transfers and synchronizations quickly require more and more computation power. They are in- become very complex and result in lost time and potential creasingly complex in design and have a shorter time to deadlocks. A suitable design-process solution consists of us- market. Computation limits of critical parts of the system ing a rapid prototyping methodology. The ultimate objective (i.e., video processing, telecommunication physical layer) are is then to go from a high-level description of the application often overcome thanks to specific circuits [1]. Neverthe- to its real-time implementation on a target architecture [2] less, this solution is not compatible with short time designs as automatically as possible. The aim is to avoid disruptions or the system’s growing need for reprogramming and fu- in the design process from a validated system at simulation ture capacity improvements. An alternative can be provided level (monoprocessor) to its implementation on a heteroge- by programmable software (DSP: digital signal processor, neous multicomponent target. Performances of the process RISC: reduced instruction set computer, CISC: complex in- can be evaluated by different aspects as follows: struction set computer) or programmable hardware (FPGA: (i) maximal independence with regards to the architec- field programmable gate arrays) components since they are ture, more flexible. Efficiency loss can be counterbalanced by us- (ii) possibility of handling heterogeneous multicompo- ing multicomponent architectures to satisfy hard real-time nent architectures, constraints. The parallel aspect of multicomponent architec- (iii) maximal automation during the process (distribution/ tures (programmable software and/or programmable hard- scheduling, code generation, including data transfers ware components interconnected by communication media) and synchronizations), and possibly its heterogeneity (different component types) (iv) efficiency of the implementation both in terms of exe- raise new problems in terms of application distribution. cution time and resource requirements, 2 EURASIP Journal on Applied Signal Processing

(v) reduced design time, internet to video streaming, and also include high-speed pic- (vi) enhanced quality and robustness of the final executive. ture exchanging and of course voice. MPEG-4 is the latest multimedia compression stan- The methodologies generally rely on a description model, dard to be adopted by the moving picture experts group which must match the application behavior. These applica- (MPEG) [6]. The prototyping of MPEG-4 video codecs over tions are a mixture of transformation and reactive operators multicomponent platforms and their optimizations are stud- [3]. A transformation operator is based on the data-driven ied in the IETR Image Group Laboratory. A part of the process: input data is transformed into output data. A reac- project has already been presented in [7]. We will therefore tive operator is one, which is event-driven and has to con- focus on the coupling between the UMTS and MPEG-4 sub- tinually react to stimuli. In practice, systems are a combina- systems rather than describe the video codec in detail. tion of both. Nevertheless, an important distinction can be The paper is organized as follows: Section 2 introduces made between systems with deterministic scheduling whose the SynDEx tool and the AAA methodology. Our contribu- operators are mainly transformation-oriented, and systems tion in terms of prototyping platforms and executive ker- with highly dynamic behavior whose operators are mostly nels is described in Section 3. The UMTS description accord- reactive-oriented. For the first class of system (including sig- ing to the AAA methodology and its implementations are nal, image, and communication applications), DFG (data explained in Section 4. The methodology is illustrated and flow graphs) have proven to be an efficient representation validated by the application (MPEG-4 + UMTS) described model. They enable automatic rapid prototyping and lead to in Section 5 what allows to reach a new stage in the rele- optimized scheduling [4]. vance of the method. Finally, conclusions and open issues This paper deals with a rapid prototyping method- encountered during the application development are given ology based on the SynDEx tool, which is suitable for in Section 6. transformation-oriented systems and heterogeneous multi- component architectures. Major contributions concern two points as follows: 2. SYNDEX OVERVIEW

(i) method and tool, more specifically about automatic SynDEx1 is a free academic system-level CAD (computer- distributed code generation from SynDEx, aided design) tool developed in INRIA Rocquencourt, (ii) a complex multilayer application including video and France. It supports the AAA methodology (adequation algo- digital communication layers, going from its high-level rithm architecture [8, 9]) for distributed real-time process- description to its distributed and real-time implemen- ing. tations on heterogeneous platforms.

SynDEx automatically generates synchronized distributed 2.1. Adequation algorithm architecture executives from both application and target architecture de- scription models. These executives specify the inner compo- A SynDEx application (Figure 1) comprises an algorithm nent scheduling and global application scheduling, and are graph (operations that the application has to execute), which expressed in an intermediate generic language. These execu- specifies the potential parallelism, and an architecture graph tiveshavetobetransformedtobecompliantwiththetype (multicomponent [10] target, i.e., a set of interconnected of component and communication media so that they au- processors and specific integrated circuits), which specifies tomatically become compilable codes. In this article, we will the available parallelism. “Adequation” means efficient map- focus on this mechanism based on the concept of SynDEx ping, and consists of manually or automatically exploring the kernels, and detail new developed kernels enabling automatic implementation solutions with optimization heuristics [9]. code generation on various multicomponent platforms. These heuristics aim to minimize the total execution time The design and the distributed implementation of a mul- of the algorithm running on the multicomponent architec- tilayer application composed of a video (MPEG-4) and a ture, taking the execution time of operations and of data digital communication layer (UMTS) illustrate the method- transfers between operations into account. These execution ology. An MPEG-4 coding application provides the UMTS times are determined during the characterization process, transceiver with a video coded bitstream, whereas the as- which associates a list of characteristics, such as execution sociated MPEG-4 decoder is connected to the UMTS re- times, necessary memory, and so forth, with each (operation, ceiver in order to display the video. The result is a com- processor)/(data transfer, communication medium) pair, re- plete demonstration application with automatic code gener- spectively. ation over several kinds of processors and communication An implementation consists of both performing a distri- media. bution (allocating parts of the algorithm on components) The digital communication layer under investigation is a and scheduling (giving a total order for the operations dis- UMTS FDD (frequency-division duplex) uplink transceiver tributed onto a component) the algorithm on the architec- [5]. UMTS is the European and Japanese selected standard ture. Formal verifications during the adequation avoid dead- for 3G. It has already spread to many areas of the world, locks in the communication scheme thanks to semaphores but is not yet predominant. 3G should enable us to benefit from new wireless services requiring quite a high data rate up to 2 Mbps. Typical targeted applications go from wireless 1 www.syndex.org M. Raulet et al. 3

User

Architecture graph Constraints Algorithm graph

Adequation SynDEx distribution/scheduling heuristic Generic synchronized distributed executives Timing graph (predictions)

Target 1 Target N · Comm M kernel kernel kernel M4

Dedicated executives for specific targets (specific compilers/loaders)

Figure 1: SynDEx utilization global view.

inserted automatically during the real-time code generation. SynDEx kernels are available for several processors, such Moreover, since the Synchronized Distributed EXecutives as the TI2 TMS320C6x (C62x, C64x) and the Virtex FPGA (SynDEx) are automatically generated and safe, part of the families, and for several communication media such as links tests and low-level hand-coding are eliminated, decreasing SDBs (Sundance digital buses-Sundance high-speed FIFOs), the development lifecycle. CPs (comports-Sundance FIFOs), BIFOs (BI-FIFOs-Pentek SynDEx provides a timing graph, which includes simu- FIFOs), PCI bus, and TCP bus presented in the following sec- lation results of the distributed application and thus enables tion. SynDEx to be used as a virtual prototyping tool. In the AAA methodology, an algorithm is specified as an 2.3. Design process infinitely repeated DFG. Each edge represents a data depen- dence relation between vertices, which are operations; opera- Our previous prototyping process integrated AVS3 (ad- tion stands for a sequence of instructions, which starts when vanced visual systems) as a front-end [14] for functional all its input data is available and which produces output data checking. AVS is a software designed for DFG description at the end of the sequence. In SynDEx, there is an additional and simulation. The application was constructed by inserting notion of reference. Each reference corresponds to the defini- existing modules or user modules into the AVS workspace, tion of an algorithm. The same definition may correspond to and by linking their inputs and outputs. The validated DFG several references to this definition. An algorithm definition was next converted into a new DFG by a translator to be com- is a repeated DFG similar to those in AAA, except that ver- pliant with SynDEx algorithm input. The main advantage tices are references or ports so that hierarchical definitions of was the automatic visualization of intermediate and resulting an algorithm are possible. images at the input and output of each module. This charac- teristic enables the image processing designer to check and 2.2. Automatic executive generation validate the functionality of the application with AVS before the step of the implementation. The aim of SynDEx is to directly achieve an optimized im- Although SynDEx is basically a CAD tool for distribu- plementation from a description of an algorithm and an tion/scheduling and code generation, here we demonstrate architecture. SynDEx automatically generates a generic ex- that SynDEx can also be directly used as the front-end of ecutive, which is independent of the processor target, into the process for functional checking (as it is possibly done several source files (Figure 1), one for each processor [11]. with AVS). This is made possible thanks to our kernels pre- These generic executives are static and are composed of a list sented in Section 3. The design process is now based on a sin- of macrocalls. The M4 macroprocessor transforms this list gle tool and is therefore simpler and more efficient. SynDEx of macrocalls into compilable code for a specific processor therefore enables full rapid prototyping from the application target. It replaces macrocalls by their definition given in the description (DFG) to final multiprocessor implementation corresponding executive kernel, which is dependent on a pro- (Figure 2) in three steps as the following: cessor target and/or a communication medium. In this way, SynDExcanbeseenasanoff-line static operating system that is suitable for setting data-driven scheduling, such as signal 2 Texas instrument. processing applications [12, 13]. 3 www.avs.com 4 EURASIP Journal on Applied Signal Processing

Step 1 3. SYNDEX EXECUTIVE KERNELS Sequential executive (PC) target Functional As described above, the SynDEx generic executive will be visual C++ application checking translated into a compilable language. The translation of Step 2 SynDEx macros into the target language is contained in li- Sequential executive (PC) brary files (also called kernels). The final executive for a SynDEX with chorno. primitives processor is static and composed of one computation se- visual C++ application Nodes quence and one communication sequence for each medium timing connected to this processor. Multicomponent platform man- estimation Sequential executive (DSP) ufacturers must insert additional digital resources between with chorno. primitives processors to make communication possible. Thus, SynDEx code composer application kernels depend on specific platforms.

Step 3 3.1. Development platforms Distributed executive Parallel (PC + DSPs) application Different hardware providers (Sundance, Pentek) were cho- sen to validate automatic executive generation. Many com- Figure 2: SynDEx utilization global view. ponent and intercomponent communication links are used in their platforms, ensuring accuracy and the generic aspect of the approach. The use of several hardware architectures guarantees generic kernel developments. Step 1. The user creates the application DFG using SynDEx. Sundance4 platform: A typical Sundance device is made AutomaticcodegenerationprovidesastandardCcodefora up of a host PC with one or more motherboards, each sup- single host computer (PC) implementation (SynDEx PC ker- portingoneormoreTIMs(Texasinstrumentmodule).A nel). In this way, the user can design and check each C func- TIM is a basic building block from which you build your sys- tion associated with each vertex of its DFG, and can check the tem. It contains one processing element, which is not nec- functionalities of the complete application with any standard essarily a DSP, but an Input/Output device, or an FPGA. A compilation tools. With automatic code generation, visual- TIM also provides mechanisms to transfer data from mod- ization primitives or binary error rate computation can be ule to module. These mechanisms, such as SDBs (200 MB/s), used for easy functional checking of algorithms. The user can CPs (20 MB/s), or a global bus (to access a PCI bus up to easily check his or her own DFG on a cluster of PCs intercon- 40 MB/s), are implemented on the TIMs using FPGAs. nected by TCP buses. With this cluster, the user can emulate The SMT320 motherboard (Figure 3)ismodular,flex- his or her embedded platform thanks to SynDEx distributed ible, and scalable. Up to four different modules can be scheduling. plugged into the SMT320 and connected using CP or SDB cables. The SMT361 TIM with a TMS320C6416 (400 Mhz) Step 2. The developed DFG is then used for automatic proto- is very suitable for imaging processing solutions as the typing on monoprocessor targets so that to chronometric re- TMS320C64xx has special functions for handling graph- ports are automatically inserted by the SynDEx code genera- ics. The SMT319 TIM is a framegrabber, which includes a tor. Each duration associated with each function (i.e., vertex) TMS320C6414 and two nonprogrammable devices: a BT829 executed on each processor of the architecture graph is auto- PAL to YUV encoder, and a BT864 YUV to PAL decoder. matically estimated using dedicated temporal primitives. These two devices are connected to the TMS320C6414 DSP thanks to two FIFOs, which are equivalent to SDBs with the Step 3. The user can easily use these durations to character- same data rate. An SMT358 is composed of a programmable ize the algorithm graph by entering these values in SynDEx. Virtex FPGA (XCV600) which integrates specific communi- Then, SynDEx tool executes an adequation (optimized distri- cation links and specific IP blocks (computation). bution/scheduling) and generates a real-time distributed and Pentek5 platform. The Pentek p4292 platform (Figure 4) optimized executive according to the target platform. Several is made up of four TMS320C6203 DSPs. Each DSP has three platform configurations can be simulated (processor type, communication links: two bidirectional (300 Mhz) inter- ff their number, and also di erent media connections). DSP links and one for the Input/Output interface. The four DSPs are already connected to each other in a ring struc- The main advantage of this prototyping process is its ture. Some daughterboards may be added to the p4292 simplicity because most of the tasks performed by the user thanks to the VIM (velocity interface mezzanine) bus, such as concern the description of an application and a compiling analog-to-digital converters (ADC p6216), digital-to-analog environment. Only a limited knowledge of SynDEx and com- converters (DAC p6229), or FPGAs (XC2V3000, XC2Vx Vir- pilers is required. All complex tasks (adequation, synchro- tex2 family). nization, data transfers, and chronometric reports) are exe- cuted automatically, thus reducing the “time to market.” The user can rapidly explore several design alternatives by modi- 4 http://sundance.com/ fying the architecture graph or adding constraints. 5 http://www.pentek.com/ M. Raulet et al. 5

PC (pentium) Personal computer PCI

Embedded motherboard: SMT320

DSP2 (TMS320C6416) PCI (Bus)PCI SDBa SDBb CP0 CP1 PCI SMT361 Bus 6 (CP) FPGA1 (Virtex) SDBa SDBb CP0 CP1 CP2 CP3 Bus 3 (SDB) SMT358 Bus 1 (SDB) In (VID in) DSPC3 (TMS320C6414) PAL to YUV (BT829) SDBa VID in SDBb VID in Out (VID out) YUV to PAL (BT864a) VID out VID out SMT319

Figure 3: Example of Sundance architecture topology.

This stand-alone Pentek platform is connected to an Eth- therefore bus conflicts only appear when CPU and DMA ernet network. This allows TCP/IP (1.5 MB/s) communica- access an external memory. As all data buffers are in inter- tions between DSPs and any computer in the network in nal memory, memory bus conflicts are null between CPU order to check a binary error rate, or to visualize a decoded and DMA accesses. Communication overhead is only due to image. However, this Bus’s throughputs will not authorize DMA setup which is negligible to take transfers into account the transfer of uncompressed data. (a few assembly instructions) [16]. The development of an application on TI processors 3.2. Software component kernels can be hand-coded with TI RTOS (real-time operating sys- tem) called DSP/BIOS [17]. DSP/BIOS is well-suited for Most of the kernels are developed in C language so that they can multithread monoprocessor applications. Several processors be reused for any C software programmable device. These must be connected to improve computational performances kernels are similar for the host computer (PC) and the em- and reach real-time performances. In this case, the multi- bedded processors (DSPs). The generated executive is com- thread multiprocessor 3L diamond6 RTOS is more appro- posed of a sequential list of function calls (one for each priate for this situation than DSP/BIOS. Applications are DFG operation). This kind of executive and the fact that the built as a collection of intercommunicating tasks. The map- adapted C compiler for DSPs has really improved in terms of ping and scheduling of each task are chosen manually. Then resource use mean that the gap between an executive writ- data transfers and synchronizations are implemented by the teninCandanexecutivewritteninanassemblylanguage RTOS using precompiled libraries. 3L enables multiproces- is narrow. The user can design each function associated with sor application development easier, faster, and suited to dy- each vertex of its DFG in C or assembly language for better namic communications between tasks. Data transfers are re- results [15]. alized using DMA, but without any computation parallelism SynDEx creates a macrocode made of several interleaved which is nearly equivalent to polling technique. schedulers: one for computation and the others for commu- Data transfers in a signal processing application are gen- nications allowing parallelism of those actions. We have cho- erally statically defined both in terms of data type and sen to use multichannel enhanced DMA (direct memory ac- number so that their description with a DFG is suitable. cess) transfers, thus maximizing parallelism and timing per- The execution of DFG operations is also well defined so that formance. Data transfers are executed in parallel with com- putation minimizing communication duration. DMA and CPU have their own bus to access the internal memory, 6 http://www.31.com 6 EURASIP Journal on Applied Signal Processing

PC (Pentium) ADC TCP daughterboard Personal computer ADC 1 (DAC) VIM TCP (TCP) VIM 1 (BIFO)

DSP-B (TMS320C6203) Bus 1 (BIFO) DSP-A (TMS320C6203) XX XX YY YY IO IO TCP TCP Bus 4 (BIFO) Bus 2 (BIFO)

DSP-D (TMS320C6203) Bus 3 (BIFO) DSP-C (TMS320C6203) XX XX YY YY IO IO TCP TCP P4292 motherboard Embedded boards

Figure 4: Pentek 4292 motherboard and its daughterboard. data transfers can be implemented with static processes. As developed with the SAM model. First, the host and DSP must static processes are faster than dynamic ones, SynDEx ker- be synchronized. Each data transfer therefore encloses two nels are developed without any RTOS. That is to say that the synchronizations because the PCI bus does not have hard- SynDEx generic executive is not transformed into dynamic ware signals like a usual FIFO (full or empty flag). The re- RTOS functions, but into specific static optimized functions. ceiver must first wait for the sender to write new data in the PCI memory. Then, the receiver can read data from the PCI 3.3. Communication media kernels memory and send an acknowledgement back to the sender. This “rendez-vous as soon as possible mechanism” induces With AAA methodology, two different models are possi- idle or wait states, but is mandatory to ensure the medium ble for communication media between processors: the SAM is ready for the next transfer and to guarantee transfer order. (single access memory) and RAM (random access memory, PCI communications using the SAM model reach a maxi- shared memory) models. mum transfer rate of 16 MB/s. This mechanism drastically slows down PCI transfers. In addition, a shared buffer is actu- The SAM model corresponds to FIFOs in which data are ally allocated to the PC’s RAM by the PCI bus driver. There- pushed by the producer if it is not full, and then pulled by fore, a new PCI kernel implementing the RAM model has the receiver if the FIFO is not empty. Synchronizations be- been developed, and the transfer rate has been improved (up tween the two processors are hardware signals (empty and to 40 MB/s). Each item of data that has to be transferred full flags) and are not handled by SynDEx semaphores. The has its own address allocation in the PCI memory and cor- data must be received in the same order as it is sent. Most of responding semaphores, which allows several buffers to be our kernels are designed according to this model. SDBs, CPs, written before the first one is read. This results in less wait and BIFO DMAs enable parallelism between calculation and states and more time for computation. The PCI scheduler communications, whereas TCP and BIFO do not enable it is controlled by interrupt when using this model. Conse- (data polling mechanism). quently, communications and computations can be concur- The RAM model corresponds to an indexed shared mem- rent on the DSP, thus reducing overall execution time. ory. A memory space is allocated, and an interprocessor syn- chronization semaphore is created for each item of data that 3.4. Hardware component kernel has to be transferred. This mechanism allows the destination processor to read data in a different order to which it has been Moreover, an FPGA kernel for programmable hardware written by the source processor. Interprocessor synchroniza- components has been developed in HDL (hardware descrip- tions are handled by SynDEx. The first implementation of tion langage) and could be considered as a coprocessor in the RAM model, through the PCI bus, is described in the fol- order to speed up a specific function of the algorithm. This lowing section. kernel handles automatic integration of intercomponent A PCI transfer kernel, for communications between a communication syntheses and instantiates a specific IP (in- DSP on Sundance platforms and the host computer, is first tellectual properties) block. M. Raulet et al. 7

Code generation

Architecture- Application-dependent Generic dependent SynDEX.m4x Application name.m4x

Processor type- Media-type-dependent dependent SDB.m4x (C62x, C64x, FPGA) C62x.m4x CP.m4x (C62x, C64x, FPGA) C64x.m4x Bus-PCI-SAM.m4x (C62x, C64x, Pentium) Pentium.m4x Bus-PCI-RAM.m4x (C62x, C64x, Pentium) FPGA.m4x TCP.m4x (Pentium, C62x) BIFO.m4x (C62x, C64x, FPGA) BIFO-DMA.m4x (C62x, C64x, FPGA)

Figure 5: SynDEx kernel organization.

Programming of a communication link depends on its Table 1: Legenda of UMTS FDD transmitter. type, but also on the processor. Previous works have al- ready validated these libraries [18], however, they need to SRC Source (pseudorandom generator) evolve with processors or communication links (depending CRC Add of cyclic redundancy check bits on provider’s additional logic). SEG Segmentation COD Channel coding 3.5. Kernel organization EQU Equalization INT1 First interleaving The libraries are classified to make developments easier and INT2 Second interleaving to limit modifications when necessary. As shown in Figure 5, SPRdata Spreading of information bits these files are organized in a hierarchical way. An application- SPRctrl Spreading of control bits dependent library contains macros for the application, such SUM Creation of a complex signal as the calls of the algorithm’s different functions. A generic li- CST-SCR-code Generation of the scrambling code brary contains macros used regardless of the architecture tar- SCR Scrambling get (basic macros). The others are architecture-dependent: processor or communication type-dependent. Processor- DPCCH Generation of control bits dependent libraries contain macros related to the real-time PSH Pulse shaping kernel, such as memory allocations, interrupt handling, or the calculation sequence. Communication type-dependent libraries contain macros related to communications: send, application and signal processing layers are very demand- receive, and synchronization macros, communication se- ing. This partially explains the delay in the effective arrival quences. As different processor types (with different pro- of UMTS on the market. It presents a very interesting case gramming of the link) can be connected by the same com- study for high efficiency multiprocessing heterogeneous im- munication type, one part per processor type can be found plementations. This becomes even more relevant in a soft- in one library. The right part of the file is used during the ware radio [19] context, which aims to implement as much macroprocessing. radio processing as possible in the digital domain, and es- Kernels have been developed for every component of the pecially onto processors and reconfigurable hardware. The platforms described in Section 3.1. When SynDEx is used for advantages firstly consist of easing the system design, while a new application, only the application-dependent library privileging fast software instead of heavy low-level hardware needs to be modified by the user. Architecture-dependent development. Secondly, the system supports new services libraries are added or modified when a new architecture is and features thanks to software adaptation capability during used (a processor or a medium that does not have its kernel). system operation [20].

4. UMTS APPLICATION 4.1. General description

UMTS is much more challenging than previous 2G sys- UMTS FDD physical layer algorithms explained in [5]are tems, such as GSM. In particular, UMTS signals have a implemented for baseband from cyclic redundancy check 3.84 MHz bandwidth compared with 270 kHz for GSM. Both (CRC) to pulse shaping (PSH) (Table 1) for the transmitter 8 EURASIP Journal on Applied Signal Processing

Frame/frame CST SCR code Transport block

SRC CRC SEG COD EQU INT1 INT2 SPR data SCR PSH SUM

DPCCH SPR ctrl Slot/slot

Figure 6: UMTS FDD transmitter (Tx).

Frame/frame CST SCR code

Transport block Slot/slot

BER DCRC DSEG DCOD DEQU DINT1 DINT2 DSPR data DSCR RAKE MFL

Figure 7: UMTS FDD receiver (Rx). as shown in the DFG in Figure 6.Thisdoesnotrepresent storing the filter coefficients and the number of multiplica- a total real UMTS since synchronization is artificial and no tion operations. In order to obtain a convenient rejection of propagation channel is used (the link is completely digital). contiguous bands, the filter impulse response is spread over Data may be generated by an arbitrary source (SRC Figure 6: 16 chips and consequently has 33 taps with an oversampling not in the standard) for bit-error-rate verifications or ex- of 2. The same coefficients are used for Tx and Rx. tracted from a real application, such as a video stream, to Equation (1) gives us the representation of a FIR filter make demonstrations. with an odd number of coefficient, where h is the real coef- Link characteristics in the measured version are as fol- ficient vector of the filter impulse response (filter taps), K is lows: number of coefficients (or taps), x[n]andy[n], the nth input (i) 1 transport channel, and output complex data samples, respectively. (ii) 1 physical channel,     (iii) no channel coding, K − 1 K − 1 y[n] = h · x n − (iv) spreading factor of 4, 2 2 (v) data rate of 950 kbps. (K−1)/2−1   The receiver [5] extracts the information necessary for + h[k] · x[n − k]+x[n − K +1+k] . the application using the scheme represented in Figure 7 k=0 (Table 2). (1) The number of operations effectivelyinuseismuch greater than the figures shown, as most of them are dupli- ffi cated several times. The generation of a 10 ms frame (com- A real filter (i.e., filter whose coe cients are real) applied posed of 15 slots) requires the instantiation of approximately to complex data is very frequent in baseband (BB) process- 140 operations for Tx and 240 for Rx in this version, which is ing and consists of applying the same filter independently a minimum. The granularity of the operations has the same to the real and imaginary parts of the data samples. In our level of complexity as a FFT, FIR, or a memory reorganiza- case we are interested in fixed point implementations, so care tion. must be taken to avoid overflow while preserving signal qual- ity (in terms of SNR). The filter at Tx is called pulse shap- ing (PSH), and at Rx matched filtering (MFL). At Tx PSH 4.2. FIR implementation and oversample (which consists of inserting zero between bi- The filter operation is of particular interest because its im- nary digits), operation can be combined in order to mini- n plementation complexity makes it very resource consum- mize computation. In this case we obtain the following: if ing. This is a FIR (finite impulse response) with a raised- is even, root cosine impulse response specified by the UMTS stan- dard at both transmitter baseband output and receiver base- K− /    ( 1) 4 n − (K − 1) band input. Here, the impulse response is symmetric around y[n] = h[2k] · x[n − k]+x + k , its center; this characteristic can be exploited to minimize k=0 2 the number of memory accesses, the required memory for (2) M. Raulet et al. 9

Table 2: Legenda of UMTS FDD receiver. h(0,...,K) MFL Matched filter FIR y(0,...,N − 1) x(−K +1,...,−1, 0,...,N − 1) K Simplified RAKE (one perfectly ( taps) RAKE synchronized finger) CST-SCR-code Generation of the scrambling code State New data DSCR Descrambling DSPRdata Despreading of information bits State update DINT2 Deinterleaving 2 DINT1 Deinterleaving 1 Figure 8: Data management for DSP implementation of an FIR. DEQU Equalization inverse operation DCOD Channel decoding Table 3: Timing of PSH (input: 2560 samples). DSEG Transport block extraction C62x C64x XC2Vx DCRC Analysis of cyclic redundancy check bits Target BER Bit error rate 300 Mhz 400 Mhz 100 Mhz Time/slot (microseconds) 576 320 338

n if is odd, Table 4: Timing of MFL (input: 5120 samples).     K − K − 1 1 C62x C64x XC2Vx y[n] = h · x n − Target 2 2 300 Mhz 400 Mhz 100 Mhz K− / −    ( 1) 4 1 (K −1) Time/slot (microseconds) 1130 640 338 + h[2k] · x[n − k]+x n− +k . k=1 2 (3) Table 5: Tx timings and PSH ratio. The nature of a FIR operation is particularly suited to Target Sundance Pentek FPGA implementations, but can also be implemented on 1∗XC2Vx Configuration 1∗C64x 1∗C62x 2∗C62x DSP processors. A specific characteristic of the DSP is that 1∗C62x it has a MAC (multiply accumulate) or a VLIW struc- Time/frame 9.5ms 11.8ms 8.5ms 9.6ms ture to support filtering computing in one clock cycle. The PSH ratio 50% 73% 53% 52% TMS320C6x family, based on VLIW architecture, has six adders and two multipliers, which operate in parallel and complete execution in one clock cycle. A fixed point multiply indices of the input data buffer point to the state buffer data. accumulate takes two instructions: multiply on one cycle and In Tables 3 and 4, the differences in timing between C62x accumulate on the next. Thanks to pipelining, it is possible to and C64x (without taking clock rates into account) are due effectively compute two multiply accumulates per cycle. to the fact that compilers are not the same for each pro- The performance then directly depends on filter length cessor, and that those DSPs have different internal architec- and processor clock frequency as each tap is processed se- tures. In an FPGA (XC2Vx), this FIR operation could be quentially. In an FPGA, it is possible to parallelize part or more parallelized giving better acceleration to the detriment all of these operations, depending on the available gate sur- of the gate surface. However, these time values are sufficient face. FIR implemented in the FPGA is a distributed arith- to get a Tx or Rx real-time application, that is why we use metic (DA) filter [21]. Features of this FIR are not multipli- the same FIR implementation for PSH and MFL. An ele- ers, but only read only memory (ROM) and accumulators. mentary oversampling function just has to be added before The complexity of this filter only depends on the number of PSH. On the contrary to FPGAs, we take advantage of the bits per sample, not on the number of taps. FIR features (cf. Section 4.2) on DSPs to optimize and di- In the particular case of C6x, it is possible to use a data vide by 2 the computation complexity of PSH at Tx, so that buffer organization of the FIR as shown in Figure 8.FIRis 576 microseconds versus 1130 microseconds are obtained on a typical case where functional units in the microprocessor C62x, and 320 microseconds versus 640 microseconds on datapath can speed up processing. Data is processed in C64x. blocks. The interface consists of an input data buffer, the co- efficient buffer, and an output data buffer. 4.3. Tx and Rx implementations The algorithm for each input sample performs the func- tion of y[n] in a for-loop. At the end of each block process- Four different implementations (Table 5) of a UMTS trans- ing operation, the filter state is updated by copying the last K mitter have been automatically tested using SynDEx: three input data into a state buffer (Figure 8). For the sake of pro- are implemented on Pentek platform and one on Sundance cessing efficiency, it is assumed that the input data buffer is platform. A transmitter application must last under 10 ms to stored in a memory after the state data buffer so that negative be real time. 10 EURASIP Journal on Applied Signal Processing

Table 6: Rx timings and MFL ratio. processing nature. A conventional methodology would re- quire different environments, which is a cause of bugs and in- Target Sundance Pentek ∗ compatibility at the integration step. This causes delays in the ∗ ∗ 1 XC2Vx Configuration 1 C64x 1 C62x ∗ best case, and could even completely question the design in 1 C62x ff Time/frame 15.9ms 20.2ms 9.9ms the worst case. Our approach permits to gather the di erent MFL ratio 60% 84% 32% parts of the design very early in the design flow and anticipate integration issues. Nevertheless, MPEG-4 over UMTS arises anewdifficulty: the complete application is a multilayer sys- tem (two layers MPEG-4 and UMTS) with different data pe- Principally, due to PSH (Table 5, timing PSH ratio com- riodicities between layers. A consequence is that the whole pared to a Tx implementation), the first transmitter imple- application cannot be represented by a single DFG. The so- mentation onto the Pentek platform did not reach real time lution consists of breaking up the UMTS physical layer and with one C62x DSP, however, it is possible to parallelize PSH the video codec layer into four algorithm subgraphs. Then in order to process half of the samples on two processors. Be- these subgraphs (coder, decoder, modulation, and demodu- ff fore filtering, two bu ers of 1296 samples (as described in lation) have been implemented onto several processors con- Figure 8) must be created. Each block processing operation nected each other with media (FIFO) following the topology overlaps 16 transient samples. The length of this PSH is re- of Figure 9. . duced by 1 5 when transfers are taken into account. The MPEG-4 codec is not embedded here: firstly, TCP Furthermore, code generation and kernels can be used throughputs on the Pentek platform do not enable uncoded to quickly shift to another platform. UMTS prototyping on or uncompressed data to be transferred, and secondly, too the Sundance platform required indeed few hours to reach to few Sundance TIMs are available in our laboratory to em- a real-time transmitter application, thanks to our previous bed a complete application with UMTS+MPEG-4. Our real- works (UMTS algorithm description, SynDEx code genera- time MPEG-4 codec provides the maximum data rate sup- tion and kernels) on Pentek platform. This is a tremendous ported by our UMTS transceiver (950 kbps). An MPEG-4 ff proof of the portability capabilities o ered by the methodol- bitstream, coded on a PC, is sent via a UMTS telecommuni- ogy. cation link to another PC to be decoded. Once the commu- UMTS Rx has been implemented according to three dif- nication transceiver has been implemented on a platform, it ferent configurations (Table 6). A real-time application has can be viewed as a communication medium equivalent to a been achieved on the Pentek platform with one DSP and FIFO. one FPGA. MFL parallelization is also possible on several So the platform integrating the MPEG-4 codec could be DSPs on Pentek platform, however, more than two DSPs are described as two PCs interconnected by a UMTS commu- added compared with one FPGA in the previous configura- nication medium. A FIFO is used to connect asynchronous tion. A configuration with 4 DSPs requires many transfers in applications (codec to UMTS communication link). Asyn- the Pentek ring structure, thus not reducing MFL computa- chronous means different periodicities and different data ex- tion length by too much. change formats. A codec cycle corresponds to one image pro- cessing operation producing a variable compressed bitstream 5. MPEG-4 OVER UMTS: A MULTILAYER SYSTEM in a variable time (about 40 ms). A UMTS cycle executes one fixed size frame in 10 ms. FIFO material signals (empty MPEG-4 is the latest compression standard. An MPEG-4 and full flags) ensure the self-regulation of the global system codec can be divided into ten main parts (e.g., system, vi- (UMTS + MPEG-4). Two implementations of this global sys- ff sual, and audio) with di erent timing requirements and exe- tem have been rapidly done onto two platforms thanks to cution behaviors. Each part is divided into profiles and levels developed kernels as described in Figure 9. The global sys- for the use of the tools defined in the standard. Each profile tem runs in real time on Pentek platform and is not far from (at a given level) constitutes a subset of the standard so that real time on Sundance platform (Rx is in 16 ms and must MPEG-4 can be seen as a toolbox where system manufactur- be 10 ms). The first implementation of the global application ersandcontentcreatorshavetoselectoneormoreprofiles on Pentek platform takes quite a long time (two months) to and levels for a given application. The application handled find and solve the multilayer issue, but this implementation here is an MPEG-4 part 2 codec developed in our laboratory, is instantaneously transposed on Sundance platform, which 7 whichisbasedontheXvid codec. This MPEG-4 codec has exactly illustrates the efficiency and the pertinence of the ap- also been tested on several distributed platform configura- proach. tions [7] (multi-DSP implementation). Here, our aim is to interface UMTS with MPEG-4 to provide a bitstream to the 6. CONCLUSIONS AND OPEN ISSUES UMTS application. The methodology permits to merge the design of very The design process proposed in this paper covers every step, different (heterogeneous) parts of the system in terms of from simulation to integration in digital signal application hardware processing support (PC, DSP, FPGA) as well as development. Compared with a manual approach, the use of our fast prototyping process ensures easy reuse, reduced time to market, design security, flexibility, virtual prototyping, ef- 7 www.xvid.org ficiency, and portability. M. Raulet et al. 11

MPEG-4 UMTS UMTS MPEG-4 coder modulation demodulation decoder

FIFO FPGAs FIFO FPGAs FIFO PC + + PC DSPs DSPs TCP TCP Sundance SDB Sundance PCI bus BIFO PCI bus Pentek Pentek

Figure 9: MPEG-4 over UMTS.

On the one hand, we have shown how SynDEx is capable terms of data flow. In the future, this case study may be cap- of manually or automatically exploring several implementa- italized on creating in SynDEx new hierarchical models of tion solutions using optimization heuristics, and on the other architecture graphs in such a way the physical layer (telecom- hand, how it automatically generates dedicated distributed munication link) may appear as a particular medium. An- real-time executives from kernels dependent on the proces- other issue is the memory allocation in SynDEx. At each out- sors and the media. These executives are dedicated to the ap- put of each vertex, SynDEx creates an allocation. At this time, plication because they do not use any support of RTOS, and memory allocations are reordered and reused manually to are generated from the results of the adequation taking oper- give an optimal solution. Current works deal with an auto- ation and data transfer distribution and scheduling into ac- matic solution, based on graph coloring techniques and life count while providing synchronizations between operations memory allocation. and data transfers and between consecutive repetitions of the DFG. The kernels enable recent multiprocessor platforms to REFERENCES be used and also enable the process to be extended to het- [1]A.M.Eltawil,E.Grayver,H.Zou,J.F.Frigon,G.Poberezh- erogeneous platforms. It was tested on several different ar- skiy, and B. Daneshrad, “Dual antenna UMTS mobile sta- chitectures composed of TI TMS320C6201, TMS320C6203, tion transceiver ASIC for 2 Mb/s data rate,” in Proceedings of TMS320C6416 DSPs, Xilinx Virtex-E and Virtex-II FPGAs, IEEE International Solid-State Circuits Conference (ISSCC ’03), and PCs. vol. 1, pp. 146–484, San Francisco, Calif, USA, February 2003. The calculations and data transfers are executed in paral- [2] K. Keutzer, S. Malik, A. R. Newton, J. M. Rabaey, and A. lel. RAM and SAM communication models have been tested Sangiovanni-Vincentelli, “System-level design: orthogonaliza- for PCI transfers. Higher transfer rates are reached using the tion of concerns and platform-based design,” IEEE Transac- RAM model enabling real-time video transfers between a PC tions on Computer-Aided Design of Integrated Circuits and Sys- and a DSP platform. tems, vol. 19, no. 12, pp. 1523–1543, 2000. [3] T. A. Henzinger, C. M. Kirsch, M. A. A. Sanvido, and W. Pree, Several complex tasks are performed automatically, such “From control models to real-time code using Giotto,” IEEE as distribution/scheduling, code generation of data transfers Control Systems Magazine, vol. 23, no. 1, pp. 50–64, 2003. and synchronizations. So the development of a new applica- [4] S. S. Bhattacharyya, P. K. Murthy, and E. A. Lee, Software Syn- tion is limited to the algorithm description and to the adap- thesis from Dataflow Graphs, Kluwer Academic, Norwell, Mass, tation of kernels for platforms or components. Furthermore, USA, 1996. as the C language is used and there is a large number of tested [5] 3GPP TS 25.213 v3.3.0: Spreading and Modulation FDD,re- topologies, developed DSP kernels can easily be adapted to lease 1999. any other DSP and communication media. [6] F. Pereira and T. Ebrahimi, The MPEG-4 Book, Prentice-Hall The current MPEG-4 + UMTS application is still in PTR, Upper Saddle River, NJ, USA, 2002. progress to achieve wireless communication. A new version [7] N. Ventroux, J. F. Nezan, M. Raulet, and O. Deforges,´ “Rapid prototyping for an optimized MPEG-4 decoder implemen- already integrates the channel coding (Turbo code) steps, tation over a parallel heterogenous architecture,” in Proceed- which only slightly increases the overall complexity. ings of 28th IEEE International Conference on Acoustics, Speech, This approach ensures fast prototyping of digital sig- and Signal Processing (ICASSP ’03), vol. 2, pp. 433–436, Hong nal applications over heterogeneous parallel architectures in Kong, China, April 2003, Conference cancelled - Invited paper, many technological fields. Other applications have already ICME 2003. taken advantage of it. A SynDEx description of a MC-CDMA [8] Y. Sorel, “Massively parallel computing systems with real (probably planned as 4G) application has been developed time constraints: the “Algorithm Architecture Adequation” by the IETR SPR laboratory [22]. LAR codec is a video methodology,” in Proceedings of 1st IEEE International Confer- codec studied in the IETR Image Group Laboratory. A sim- ence on Massively Parallel Computing Systems (MPCS ’94),pp. ilar scheme (Figure 9) has already been tested on different 44–53, Ischia, Italy, May 1994. [9] T. Grandpierre, C. Lavarenne, and Y. Sorel, “Optimized rapid configurations: LAR over MC-CDMA, MPEG-4 over MC- prototyping for real-time embedded heterogeneous multipro- CDMA, and LAR over UMTS. cessors,” in Proceedings of 7th International Workshop on Hard- The complex MPEG-4 + UMTS application stresses that ware/Software Codesign (CODES ’99), pp. 74–78, Rome, Italy, a multilayer system presents some specific characteristics in May 1999. 12 EURASIP Journal on Applied Signal Processing

[10] Y. Sorel, “Real-time embedded image processing applications M. Raulet received his postgraduate certifi- using the A3 methodology,” in Proceedings of IEEE Interna- cate in signal, telecommunications, images, tional Conference on Image Processing (ICIP ’96), vol. 2, pp. and radar sciences from Rennes University 145–148, Lausanne, Switzerland, September 1996. in 2002, and his Engineering degree in elec- [11] T. Grandpierre and Y. Sorel, “From algorithm and architec- tronic and computer engineering from Na- ture specifications to automatic generation of distributed real- tional Institute of Applied Sciences (INSA), time executives: a seamless flow of graphs transformations,” in Rennes Scientific and Technical University Proceedings of 1st ACM and IEEE International Conference on in 2002, where he is currently working as Formal Methods and Models for Co-Design (MEMOCODE ’03), a Ph.D. student in collaboration with Mit- pp. 123–132, Mont Saint-Michel, France, June 2003. subishi Electric. His interests include image [12] F. Balarin, L. Lavagno, P. Murthy, and A. Sangiovanni- compression and telecommunication algorithms and rapid proto- Vincentelli, “Scheduling for embedded real-time systems,” typing. IEEE Design and Test of Computers, vol. 15, no. 1, pp. 71–82, 1998. F. Urban received his postgraduate certifi- [13] L. A. Hall, D. B. Shmoys, and J. Wein, “Scheduling to minimize ff cate in signal, telecommunications, images, average completion time: o -line and on-line algorithms,” in and radar sciences from Rennes Univer- Proceedings of 7th Annual ACM-SIAM Symposium on Discrete sity in 2004, and his Engineering degree in Algorithms (SODA ’96), pp. 142–151, Atlanta, Ga, USA, Jan- electronic and computer engineering from uary 1996. INSA, Rennes Scientific and Technical Uni- [14]V.Fresse,O.Deforges,´ and J. F. Nezan, “AVSynDEx: a rapid versity in 2004. He is currently working as a prototyping process dedicated to the implementation of dig- Ph.D. student in collaboration with Thom- ital image processing applications on Multi-DSP and FPGA son R&D, France. His interests include im- architectures,” EURASIP Journal on Applied Signal Processing, age compression, rapid prototyping, DSP vol. 2002, no. 9, pp. 990–1002, 2002, Special Issue on imple- optimization, and codesign. mentation of DSP and communication systems. [15] Texas Instruments, “TMS320C6000 Optimizing Compiler User’s Guide,” reference spru187l, March 2004. J.-F. Nezan is an Assistant Professor at Na- [16] Y. Le Mener,´ M. Raulet, J. F. Nezan, A. Kountouris, and C. tional Institute of Applied Sciences (INSA) Moy, “SynDEx executive kernel development for DSP TI C6x of Rennes and a Member of the IETR Lab- applied to real-time and embedded multiprocessors architec- oratory in Rennes. He received his post- tures,” in Proceedings of 11th European Signal Processing Con- graduate certificate in signal, telecommu- ference (EUSIPCO ’02), Toulouse, France, September 2002. nications, images, and radar sciences from [17] Texas Instruments, “TMS320 DSP/BIOS User’s Guide,” refer- Rennes University in 1999, and his Engi- ence spru423b, September 2002. neering degree in electronic and computer [18] F. Nouvel, S. Le Nours, and I. Herman, “AAA methodology engineering from INSA, Rennes Scientific and SynDEx tool capabilities for designing on heterogeneous and Technical University in 1999. He re- architecture,” in Proceedings of 18th Conference on Design of ceived his Ph.D. degree in electronics in 2002 from the INSA. His Circuits and Integrated Systems (DCIS ’03), Ciudad Real, Spain, main research interests include image compression algorithms and November 2003. multi-DSP rapid prototyping. [19] A. Kountouris, C. Moy, and L. Rambaud, “Reconfigurability: a key property in software radio systems,” in Proceedings of 1st C. Moy has been an Engineer at the Na- Karlshruhe Workshop on Software Radios, Karlsruhe, Germany, tional Institute of Applied Sciences (INSA), March 2000. Rennes Scientific and Technical University, [20] C. Moy, A. Kountouris, and A. Bisiaux, “HW and SW archi- France, 1995. He received his M.S. and tectures for over-the-air dynamic reconfiguration by software Ph.D. degrees in electronics in 1995 and download,” in Proceedings of Software Defined Radio workshop 1999 from the INSA. He worked from 1995 of IEEE Radio and Wireless Conference (RAWCON ’03),Boston, to 1999 on spread-spectrum and RAKE re- Mass, USA, August 2003. ceivers for the Institute on Electronics and [21] S. A. White, “Applications of distributed arithmetic to digi- Telecommunications of Rennes (IETR). He tal signal processing: a tutorial review,” IEEE ASSP Magazine, then worked 6 years at Mitsubishi Electric vol. 6, no. 3, pp. 4–19, 1989. ITE-TCL Research Laboratory where he was focusing on software [22] S. Le Nours, F. Nouvel, and J. F. Helard, “Example of a Co- radio systems and concepts. He is now an Assistant Professor at Design approach for a MC-CDMA transmission system im- Supelec.´ His research is done in the SCEE Laboratory of the UMR plementation,” in Journ´ees Francophones sur l’Ad´equation Al- CNRS 6164 IETR, which focuses on cognitive radio. He addresses gorithme Architecture (JFAAA ’02), Monastir, Tunisie, Decem- heterogeneous design techniques for SDR as well as cross-layer op- ber 2002. timization topics. M. Raulet et al. 13

O. Deforges is a Professor at National Institute of Applied Sciences of Rennes (INSA). He graduated in electronic engi- neering from the Ecole´ Polytechnique, Uni- versity of Nantes, France, in 1992, where he also received a Ph.D. degree in image pro- cessing in 1995. In 1996, he joined the De- partment of Electronic Engineering at the INSA, Rennes Scientifc and Technical Uni- versity. He is a Member of the UMR CNRS 6164 IETR Laboratory in Rennes. His principal research interests are image and video lossy and lossless compressions, image under- standing, fast prototyping, and parallel architectures.

Y. Sorel is a Research Director at INRIA (National Institute for Reseach in Com- puter Science and Control) and a Scien- tific Leader of the Rocquencourt’s Team AOSTE (Analysis and Optimization for Sys- tems with Real-Time and Embedding Con- straints). His main research topics are mod- eling of distributed real-time systems with graphs and partial order, uniprocessor and multiprocesseur real-time scheduling opti- mizations of systems with multiple constraints, and automatic code generation for hardware/software codesign. He is also the Founder of SynDEx, a system-level CAD software distributed free of charge at www.syndex.org. Hindawi Publishing Corporation EURASIP Journal on Applied Signal Processing Volume 2006, Article ID 32408, Pages 1–12 DOI 10.1155/ASP/2006/32408

A Fully Automated Environment for Verification of Virtual Prototypes

P.Belanovic,´ B. Knerr, M. Holzer, and M. Rupp

Institute of Communications and Radio Frequency Engineering, Vienna University of Technology, 1040 Vienna, Austria

Received 15 October 2004; Revised 29 March 2005; Accepted 25 May 2005 The extremely dynamic and competitive nature of the wireless communication systems market demands ever shorter times to market for new products. Virtual prototyping has emerged as one of the most promising techniques to offer the required time savings and resulting increases in design efficiency. A fully automated environment for development of virtual prototypes is pre- sented here, offering maximal efficiency gains, and supporting both design and verification flows, from the algorithmic model to the virtual prototype. The environment employs automated verification pattern refinement to achieve increased reuse in the design process, as well as increased quality by reducing human coding errors.

Copyright © 2006 P. Belanovic´ et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

1. INTRODUCTION the presented environment concerned with automated veri- fication pattern refinement for VPs is presented in Section 3, Complexity of modern embedded systems, particularly in together with an example design. Finally, conclusions are the wireless communications domain, grows at an astound- drawn in Section 4. ing rate. This rate is so high that the algorithmic complex- ity now significantly outpaces the growth in complexity of 1.1. Virtual prototype concept underlying silicon implementations, which proceeds accord- System descriptions at algorithmic level contain no specific ing to the famous Moore’s Law [1]. Furthermore, algorithmic implementation details. Hence, before implementation of complexity even more rapidly outpaces design productivity, the system can begin, the algorithmic description is parti- expressed as the average number of transistors designed per tioned, that is, each component in the description is assigned staff/month [2, 3]. In other words, current approaches to em- to software or hardware implementation. bedded system design are proving inadequate in the struggle Traditionally, implementation of hardware components to keep up with system complexity. proceeds from this point. Development of software modules, Hence, a number of new system design techniques with however, can begin only once all required hardware design is potential to speed up design productivity are intensively re- complete. This is due to the fact that the design of software searched [4, 5]. One of these techniques known as virtual components must take into consideration the behaviour of prototyping [6–8] speeds up the design process by enabling the underlying hardware. Hence, a significant penalty is in- development of hardware and software components of the curred in the length of the design process (see Figure 1,top embedded system in parallel. chart). Development of a comprehensive design environment Virtual prototyping [9] is a technique which can elimi- for automatic generation and verification of virtual proto- nate most of this penalty and thus dramatically shorten the types (VPs) from an algorithmic-level description of the sys- development cycle. A VP is a software model of the complete tem is presented here. Section 1.1 describes the concept of system, fully representing its functionality, without any im- a VP in closer detail and Section 1.2 explains the model of plementation details. To achieve the mentioned system de- the hardware platform used in this work. A survey of re- velopment speedup, we consider VPs which additionally in- lated work, including a comparison of the presented environ- clude full definitions of hardware/software interfaces found ment with the most advanced current approaches, is given in in the system, including the required architectural informa- Section 1.3. The design environment for automatic genera- tion, but still no details of the actual implementation of any tion of VPs is described in detail in Section 2.Thepartof component. 2 EURASIP Journal on Applied Signal Processing

Traditional embedded system development

Algorithmic modeling Hardware development Software development

Embedded system development with VP (manually generated)

Algorithmic modeling Time savings VP development (manual) Hardware development Software development

Figure 1: Shortening of the design cycle by the VP technique.

The speedup in the system development cycle by employ- behaviour only at component boundaries, allowing all other ing virtual prototyping is achieved as depicted in Figure 2. implementation details to be overlooked. As seen in Figure 1 Firstly, the algorithmic model is partitioned into components (bottom chart), this allows the time savings which make VP to be implemented in hardware and those to be implemented a desirable design technique. in software. This defines the hardware-software interfaces in the system. In Figure 2, blocks B, C, and E have been assigned 1.2. Model of hardware platform to implementation in hardware and blocks A, D, and F in software. The algorithmic description is then remodeled to a The structure of the hardware platform assumed in this form where these interfaces are clearly defined. Thus, the VP work is a generic multiprocessor system-on-chip (SoC) ar- of the system is created. chitecture. At least one processor core, such as the StarCore From this point, hardware and software development DSP, for example, is present in the architecture, as shown in proceed in parallel. It is important to note that all blocks as- Figure 3. All the system components assigned to software im- signed to hardware implementation are grouped into a num- plementation will be targeted to one of these processor cores. ber of VP components, each of which will later be realised as Also present in the system are a number of hardware accel- a separate hardware accelerator in the system architecture. In erator (HA) blocks. These contain custom silicon designs to Figure 2, blocks B and C form the VP component 1, whereas provide accelerated processing for time-critical system func- block E alone forms the VP component 2. tions. All the system components assigned to hardware im- Development of the hardware implementation of VP plementation will be realised as these HA blocks. The system component 1 is done against the hardware-software inter- also contains one or more banks of system memory and a face defined in the VP. Similarly, the software implementa- dedicated direct memory access (DMA) controller, serving tion of VP component 2 relies on the existence of the same the processor cores as well as the HA blocks. hardware-software interface. At the same time, the develop- Communications on this hardware platform are facili- ment of the software implementation of VP component 3 tatedbyatleastonesystembus,suchasanAMBAbusforex- makes use of the same interface. Such use of the VP en- ample, connecting all system components. Additionally, HA sures co-operability of the three implementations, allowing blocks may be provided with dedicated direct I/O ports, for for their parallel development and the resulting time savings. off-chip communications. Virtual prototyping offers numerous improvements to the design process. First and foremost, it allows parallel de- velopment of all components in the system, resolving all in- 1.3. Related work terface dependencies. Furthermore, it allows verification of software components which interface with hardware against Extension of the virtual prototyping environment into the the known hardware-software interface. Finally, a VP allows verification flow requires automated verification pattern re- verification of the hardware implementation itself, making finement, as explained in Section 3. Several previous research sure the hardware indeed provides correct interface to ex- efforts in this area exist. Varma and Bhatia [10]presentan ternal components as it was designed for at the algorithmic approach to reusing preexisting verification programs for level. virtual components. This approach includes a fully auto- Very importantly, creation of a VP for a system com- mated reuse methodology, which relies on a formal descrip- ponent requires a relatively small design effort, compared tion of architectural constraints and produces system-level to that of a full hardware or software implementation. This verification vectors. However, this approach is applicable is due to the relaxed requirement of the VP to recreate only to hardware virtual components. P. B e la nov icetal.´ 3

Hardware development Comp1 Comp2

Virtual prototype HW HW Final Comp2 implementation Algorithmic model Comp1 (HW) (HW)

BCE HA1 HA2 AB C ADF Comp3 DEF AD F Comp1 Comp2 DSP

Comp3 (SW) BCE

Software development SW Comp3

Figure 2: System development using a VP.

automated verification pattern refinement, but addresses this

DMA DSP

¡ ¡ ¡ issue in a more general manner than previously published work. Hence, it is applicable to both software and hardware components, and indeed to verification pattern refinement System RAM between any two abstraction levels, though the particular bus instance of this framework presented here is specific to the transition from algorithmic to virtual prototype abstraction

HA1 HA2 levels.

¡ ¡ ¡

2. AUTOMATED VIRTUAL PROTOTYPE Direct I/O GENERATION

Figure 3: Target hardware platform. As described earlier, design of an embedded system proceeds from the algorithmic-level description towards the system’s final implementation firstly through a partitioning process, followed by the creation of a VP and finally hardware or soft- ware implementation of each individual component. On the other hand, Stohr¨ et al. [11] present FlexBench, The process of VP generation is typically performed a fully automated methodology for reuse of component- manually, through rewriting of the VP from the algorithmic- level stimuli in system verification. While this environment level description. However, when the VP design environment presents a novel structure which supports verification pat- is integrated into a unified design methodology, it is possi- tern reuse at various abstraction levels without the need for ble to make VP generation a fully automated process. This reformatting of the verification patterns themselves, this in helps eliminate human errors and drastically decrease the turn creates the need for new “driver” and “monitor” blocks time needed to create a VP, in turn deriving maximum possi- in the environment for every new component being verified. ble efficiency gain promised by virtual prototyping [13, 14]. Also, this environment has only been applied to hardware This is illustrated in Figure 4. components. The automatic VP generation environment presented An automated testing framework offered by Odin Tech- here is depicted in Figure 5. The process of automatically nologies called Axe [12] also offersautomatedreuseofver- generating a VP component from that component’s algorith- ification patterns during system integration. However, this mic description consists of two parts. First, the algorithmic environment requires manual rewriting of test cases in Mi- description of the entire system (encompassing all its compo- crosoft Excel and relies on the use of a third-party test au- nents) is read into the single system description (SSD). This tomation tool on the back end. Also, the Axe framework has also includes partitioning of the system by labelling each sys- only been applied to development of software systems. tem component for implementation in hardware or software. The verification extension of the virtual prototyping en- The second step in the process is the generation of all parts vironment presented here is also designed to provide fully of the VP component from the SSD. 4 EURASIP Journal on Applied Signal Processing

Embedded system development with VP (manually generated)

Algorithmic modeling

VP development (manual)

Hardware development

Software development

Embedded system development with VP (automatically generated) Time savings Algorithmic modeling

VP development (automatic)

Hardware development

Software development

Figure 4: Shortening of the design cycle by automating VP generation.

HW/SW partitioning information table VP components for W, X, Y, Z COSSAP project Bus interface WXZ SDI CA Y for SSD VPG Fileset COSSAP B

£ .gc v arc v ent Scheduler

COSSAP guidelines

Figure 5: Design environment for automatic generation of VPs.

2.1. Processing the algorithmic description COSSAP environment) and has to be formatted in accor- dance with specific guidelines. These guidelines ensure com- The environment for automatic generation of VPs presented patibility of the GenericC code with tools in the second phase here is based on processing algorithmic descriptions created of the automatic VP generation. Suitably formatted func- in the COSSAP environment. Nevertheless, the VP environ- tional component descriptions are placed directly into the ment is in principle independent of languages and tools used SSD. for algorithmic modelling and can, due to its modular struc- After the complete algorithmic system description is ture, easily be adapted to any language or tool. processed into the SSD, it is necessary to perform hard- COSSAP descriptions contain separate structural/inter- ware/software partitioning before VP components for all connection and functional information. The structural and hardware components can be generated. Manually created interconnection information in the COSSAP description is hardware/software partitioning decisions, stored in textual VHDL-compliant and is read into the SSD by the system form, are integrated directly into the SSD. Also, possibilities description interface (SDI). The SDI comprises a VHDL- for automated hardware/software partitioning exist and have compliant parser module as well as a scanner module which been successfully applied to the presented environment [15], manages the database structure within the SSD. yielding the same quality of results as manual system parti- The functional information in COSSAP descriptions is tioning. Once system partitioning has been performed, the written in GenericC (extension to ANSI C proprietary to the first phase of the VP generation process is complete. P. B e la nov icetal.´ 5

Initial concept . . . Verification Model Verification patterns Level n (level n) (level n)

Model Refinement Verification pattern refinement information refinement

Verification Level n +1 Model n Verification patterns (level +1) n . (level +1) . .

Final product

Figure 6: Conceptual view of parallel refinement of the model and the associated verification patterns.

2.2. Virtual prototype generation refinement levels of a particular design flow may include the algorithmic level, architectural level, register transfer level A VP component is composed of several parts, as shown in (RTL), and others. Figure 5. The core of the VP component is the recreated in- As the model of the design progresses from one refine- terconnected block structure, as found in the algorithmic- ment level to another, it needs to be verified for correct func- level model—blocks A, B, and C in Figure 5. Additionally, tionality at each level. Hence, the model of the system at each the VP component contains a scheduler which controls the refinement level has associated with it a set of verification execution of each block, according to the current input and patterns, designed to verify correct functionality of the cor- output sample rates of each block and the availability of data responding model. to be processed. Finally, the VP component contains a bus The verification patterns at each new level in the design interface, responsible for communications between the VP flow are traditionally created from the verification patterns at component and the processor core(s) in the system over the the previous refinement level. This is shown in Figure 6.We bus. This block is shown in gray in Figure 5, because it needs refer to this process henceforth as verification pattern refine- to be created manually, depending on the bus type, commu- ment. nications protocol, and processor core(s) used in the system. Whereas a great multitude of EDA tools and reseach work The second phase of automatic VP generation is per- exists for automating refinement of system models between formed by the virtual prototype generator (VPG) tool. This all the various refinement levels, there is a distinct lack of tool extracts all necessary structural information for the par- such support for verification pattern refinement. This causes ticular component from the SSD and creates the intercon- both significantly prolonged verification cycles as well as nected block structure accordingly. Relevant functional in- lower design quality, due to the introduction of manual cod- formation in the SSD is code-styled to be compliant with the ing errors. Hence, significant reduction of the time to mar- VSIA standard [16] and the C++ language and is then in- ketaswellasimprovementinqualitycanbeachievedbyau- tegrated into the VP component. Following these steps, the tomating verification pattern refinement. automatically created VP component can be manually cus- The manual process of verification pattern refinement, tomised to a particular system bus, processor core(s), and as it is customary in modern engineering practice, involves communications protocols, before being used. rewriting of the verification patterns from the earlier refine- ment level, applying the refinement information which re- 3. AUTOMATED VERIFICATION PATTERN sulted from model refinement, to produce the new verifica- REFINEMENT tion patterns (see Figure 6). Hence, two distinct tasks can be recognised in the process of verification pattern refinement. As stated previously, design flows for embedded systems tra- ditionally start from initial concepts of system functionality, (i) Reformatting of verification pattern data, to fit the new progressing through a number of refinement steps, eventu- format required at the next refinement level. ally resulting in the final product, containing all the software (ii) Enrichment of the same data, with the refinement in- and hardware components that make up the system. These formation (see Figure 6), which does not appear in the 6 EURASIP Journal on Applied Signal Processing

Algorithmic level verification patterns

Data-in Data-out Parameter-in Parameter-out streams streams stream stream

Test Interface generator script specification

Direct I/O Memory Verification data image program

Virtual prototype level verification patterns

Figure 7: Structure of the environment for automatic generation of verification patterns.

original verification patterns, but is a necessary com- one stream exists for each of the data channels going into ponent in the newly created verification patterns. the model and one for each data channel going out of the model. A pair of dedicated parameter streams, exactly one Although the reformatting task can be, and frequently is, for all parameters going into the model, and exactly one for fully automated, current approaches to verification pattern those going out of the model, also exist. The complete set of refinement require manual effort from the designer in order streams is shown as algorithmic-level verification patterns in to complete the enrichment task, for which traditionally no Figure 7. formal framework exists. Since no architectural or implementation information is The environment for automated generation of vir- yet known at the algorithmic level, the simulation of the tual prototypes from algorithmic-level models presented in model (and hence its verification) at this level is purely un- Section 2 demonstrated automated model refinement be- timed functional. In other words, the simulation is driven tween these two refinement levels. This section presents an solely by the availability of input parameters and data, and environment for automating the corresponding verification their processing by the system modules. pattern refinement, from the algorithmic level to the virtual prototype level, performing both reformatting and enrich- ment of the verification patterns automatically. 3.2. Verification at virtual prototype level

Use of a virtual prototype implies a highly heterogeneous 3.1. Verification at algorithmic level system. Initially, all of the components in the system have a general, purely algorithmic description. During parallel soft- At the algorithmic level, the model of the system contains no ware and hardware development of the various system com- architectural information and the partitioning of the system ponents (see Figure 2), some of the initial component de- is done on a purely functional basis. Hence, the model of the scriptions may be replaced by implementation specific de- system typically assumes the form of a process network, with scriptions. For hardware components these may be VHDL all functional blocks that make up the system executing con- or Verilog descriptions, while for software components these currently and communicating through FIFO channels. Pop- may be written in Java or C++, for example. Hence, as the ular commercially available environments for development development of the system progresses, the VP becomes in- and simulation of such models are Matlab/Simulink, COS- creasingly heterogeneous. SAP, and SPW, among others. The work described here con- In this work, we focus on verification of system compo- centrates on algorithmic models developed in the COSSAP nents assigned to hardware implementation, since they will environment, though with no substantial changes, it is appli- be implemented as part of an HA block (see Figure 3). Ver- cable to other algorithmic-level models as well. ification of software components is entirely analogous, but The presence of two types of information flowing has reduced complexity, because no HA blocks are involved through the FIFO communications channels of the model is (a more homogeneous problem). assumed. The first type of information consists of parame- Hence, verification at the virtual prototype level requires ters, responsible for controlling the modes of operation of the following: each process. The second type of information is data, the ac- tual values which are processed in the system and have no (i) device under verification (DUV), influence on the mode of operation of any process. (ii) verification patterns, Therefore, verification patterns at the algorithmic level (iii) verification program (runs on the DSP, applies the ver- consist of a set of sequences of values, or streams.Exactly ification patterns to the DUV). P. B e la nov icetal.´ 7

Header Header Header follows an untimed functional paradigm, that is, is driven Block 1 purely by the availability of input parameters and data, no Input Sequence 1 further timing information needs to be contained in the Values memory Block 2 Sequence 2 image streams. Block 3 Sequence 3 . output . . 3.3.2. Verification program . . memory . image Masks The verification program runs on the processor core and i Sequence j Block communicates with the DUV over the system bus. Its func- tion is to supply the appropriate verification patterns from the memory image to the DUV, as well as to verify the pro- Figure 8: Structure of the memory image. cessing results of the DUV against the expected results, also stored in the memory image. The cycle of writing to/reading from the DUV is repeated for the complete set of verifica- It is important to note that the structure of the hardware tion patterns, on the basis of one input block and one output platform (see Figure 3) enforces the separation of verification block being processed per cycle (see Section 3.3.3 for more patterns into two types, according to how they are commu- details). nicated to the DUV. Hence, there exist verification patterns Functionality of the verification program is hence not de- communicated to the VP through the system bus (stored in a pendent on the particular VP being verified. Thus, the veri- structured memory image) and those communicated to the fication program is generic in nature, and can be reused for VP through its direct I/O interfaces (supplied directly to the verification of any VP component. However, a separate ver- VP during functional simulation). Both of these types of ver- ification program must of course be written for every new ification patterns are shown as virtual prototype-level verifi- processor core used in the system and being employed to run cation patterns in Figure 7, together with the necessary veri- the verification of any DUV. fication program. Since verification at the virtual prototype level relies heavily on transactions over the system bus, it is imple- 3.3.3. Memory image mented in a bus-cycle true manner. The bus interface of the DUV, as well as the rest of the simulation environment, in- The memory image is a structured representation of the ver- cluding the VSIA-compliant models of the DSP and the sys- ification patterns for the virtual prototype level. It includes tem bus, are also accurate to this time resolution within the only those verification patterns which are to be supplied to functional simulation of the complete system. or read from the DUV over the system bus. As already mentioned, since the verification program is generic and applicable to the verification of any VP com- 3.3. Environment for automatic generation of ponent, all verification pattern values, their sequence, and verification patterns the appropriate interface information must be contained in the memory image. This in turn dictates the structure of the The environment for automated verification pattern refine- memory image: it contains all the above information, while ment presented here generates virtual prototype-level verifi- both making it efficiently accessible in a generic manner by cation patterns from algorithmic-level verification patterns, the verification program, as well as minimizing the memory as shown in Figure 7. size overhead required to establish this structure. As a consequence, the memory image is organised as 3.3.1. COSSAP verification patterns shown in Figure 8. It is primarily divided into the input mem- ory image and the output memory image. The former contains The environment for algorithmic-level modelling consid- all verification patterns (both parameter and data) which are ered in this work is COSSAP from Synopsys. Hence, the written to the DUV. The latter contains those verification algorithmic-level verification patterns used also come from patterns which are used to check the validity of the outputs the COSSAP environment. As seen in Figure 7, they are di- of the DUV. vided into four sets of streams parameter in and out, and data Further, each of the two primary parts of the memory im- in and out streams. age contains a header, followed by several blocks. The header Exactly one stream exists for all parameters supplied to contains the number of blocks in the particular image, fol- the DUV during functional verification, as well as exactly one lowed by a pointer to the beginning of each block, as well stream for all parameters read from the DUV. Exactly one as a pointer to the end address of the last block. The latter stream exists for each data input port of the DUV and exactly pointer is effectively the pointer to the end of the particular one for each of its output ports. image and is used in assessing the total size of the memory The structure of each stream is a sequence of values to image by the verification program. be supplied to the inputs or expected at the outputs of the Each block is a set of verification patterns which are con- DUV. Remembering that verification at the algorithmic level sumed (for input image) or produced (for the output image) 8 EURASIP Journal on Applied Signal Processing

vc1 System bus di1 pi1 p1 di2 di3 b1 b2 do1 d1 in di3 po1 out di1 di2 di3 do1 di4 p po2 vc1 p Direct I/O

Figure 9: Model refinement of the virtual component vc1, from algorithmic level (left) to virtual prototype level (right).

by the DUV in a single functional invocation. Similar to the 3.3.5. Interface specification structure of the memory image itself, each block contains a header, followed by a number of sequences. The header con- The interface specification (see Figure 7) contains all the tains the number of sequences in the particular block, fol- structural information which is present, and naturally re- lowed by a pointer to the beginning of each sequence. quired during verification, at the VP level, but did not ex- A sequence is a set of verification pattern values to be ist at the algorithmic level. Indeed, this interface information written to or read from a contiguous section of the DUV’s comes as a result of the refinement process, going from the register space. It is composed of a header, a set of values, algorithmic model to the VP. and a set of masks. The header contains only the start ad- In other words, the interface specification is the refine- dress within the DUV’s register space where the write or read ment information (as depicted in Figure 6) between the algo- operation is to take place. rithmic level and the VP level. Hence, the interface informa- In the case of the input memory image, the values in a tion is needed in order to perform verification pattern refine- sequence are to be written to the DUV, while the masks de- ment between these two levels. termine which bits of each value are to be written to the The interface specification can contain interface informa- DUV (overwriting the current content) and which bits are tion for several VP components. Each part dedicated to a par- to be kept at their current state. Hence, the required oper- ticular VP component is composed of exactly one parameter ation for writing the verification patterns from the mem- and one data section. The parameter section contains inter- ory image to the DUV is given (on the bit level) as n = face information for all the parameters of the VP component (m¯ · c)+(m · v), or a simple 1-bit multiplex operation, where in question. Correspondingly, the data section contains in- v is the value in the verification pattern, m is the mask, c is terface specifications for each data channel (input as well as the current value in the DUV register space, and n is the new output) of the VP component in question. value. The parameter interface information includes names of In the case of the output memory image, the values in a all parameters in the model, together with their bit-exact ad- sequence are to be compared to those returned by the DUV, dresses in the register space of the DUV. Unlike parameters, to verify its functionality. The mask values are used to indi- data is packaged for communication over the system bus and cate which of the bits are to be verified and which bits can writing into the register space of the DUV. That is to say, be regarded as “do not care.” Hence, the required operation several data values may be packaged into one register of the ffi while verifying the functionality of the DUV is given (on the DUV. If the latter is 32 bits wide, it is e cient to package bit level) as t = m · (c ⊕ v), where v is the expected value, m four 8-bit data values into a single register. Hence, the data is the mask, c is the current value in the DUV register space, section of the interface specification contains in addition to and t is the test output. A failed test is indicated with the log- the name of the data input or output, also its packaging fac- ical state “1” of the variable t. tor (being four in the example above) and its starting address in the register space of the DUV. 3.3.4. Direct I/O data 3.3.6. Test generator script As already mentioned in Section 3.2, during the verification process, some verification patterns are supplied to the DUV The test generator script (TGS) lies at the core of the auto- directly through the I/O interfaces of the HA (see Figure 3) mated environment for verification pattern refinement pre- and not through the system bus. Hence, during the verifica- sented here, as shown in Figure 7. Its main function is to cre- tion process these values are not handled by the processor ate the VP level verification patterns, that is, perform both core and are thus not part of the memory image. steps in the verification pattern refinement process automat- The direct I/O data is therefore handled separately during ically (see Section 3). the simulation process. A dedicated module in the simulation In order to achieve this, the TGS creates the structure environment has been created to serve the sole purpose of of the memory image as described in Section 3.3.3. The re- making the direct I/O data available to the DUV through its formatting step of the verification pattern refinement pro- direct I/O ports. cess is achieved by interleaving the block-based structure of P. B e la nov icetal.´ 9

di1 di2 di3 do1 para in para out New block New block New block New block New block New block CD9A 1B 000B0855 22 pi1A po1 0 di1 di1 01A0 501C 89 002C4002 01 New block po2 04 01A1 di1 di1 E0D5 60 New block 84 New block New block 01A2 di2 di2 di2 di2 4F05 A1 00F4128E 74 pi1 3 po2 03 New block New block 00C11032 New block New block New block 01A3 di3 1AC1 7B ··· 01 ··· New block 01A4 di3 7000 70 76 po2 1A ··· ··· ··· ··· 01A5 pi1 01A6 do1 do1 do1 do1 01A7 po2 po1 Figure 12: The COSSAP verification patterns for each port of vc1.

The so-prepared memory image is written by the TGS in binary file format, ready to be loaded directly into system Figure 10: Register mapping of each data and parameter port of memory, either within the VP simulation environment or (in vc1. the implementation stage of the design process) on the hard- ware platform itself.

Interface specification ··· 3.4. Example design Component vc1 An example design, showing the automated refinement of Parameter verification patterns for a virtual component vc1, from the Pi101A530 algorithmic level to the virtual prototype level, is given in Po101A700 this section. Initially, this component undergoes refinement Po201A781 Data of the model itself, as shown in Figure 9. Here the model of di1 bus 01A0 2 vc1 in the algorithmic modelling environment, such as COS- di2 bus 01A2 4 SAP, is shown on the left. The component is made up of two di3 bus 01A3 1 subblocks, b1 and b2, connected by various data channels di4 IO (represented by full lines, such as d1) and parameter chan- do1 bus 01A6 4 nels (represented by broken lines, such as p1). Component vc2 On the right in Figure 9, the virtual prototype model of vc1 is shown. This model contains the same interconnected ··· structure as that in the algorithmic model, but additionally it contains architectural information. This additional archi- tectural information is hence introduced into the model as a Figure 11: Interface specification for the virtual component vc1. result of the refinement process, as shown in Figure 6 as “Re- finement Information.” This architectural information in- cludes the architectural location of data ports, such as the the algorithmic verification patterns, followed by the analysis assignment of input port di1 to the system bus interface and of the resulting single stream of patterns. As a result of this input port di4 to the direct I/O interface. analysis, the structure of the memory image, with associated Moreover, this refinement information includes the reg- block, sequence, and pointer structures can be created. ister mapping of all data and parameter channels which have Thesecondstepintheverificationpatternrefinement been assigned to the system bus interface, as described earlier processistheenrichmentoftheverificationpatternswithre- in this section. The register mapping for the virtual compo- finement information, that is, architectural details. This task nent vc1 is shown in Figure 10. Hence, the bus interface be- achieves the filling out of the empty memory image struc- tween the component vc1 and the processor core on which ture with the actual verification pattern values, with correct the software components are running occupies the section of bus interface formats, including appropriate register map- the register space between addresses 01A0 and 01A7 (inclu- ping. Hence, in order to complete this task, the TGS con- sive). Data corresponding to the input data port di1 occu- structs each sequence of each block, both in the input and the pies registers 01A0 and 01A1, with a packaging factor two (as output memory image, by bitwise combination of the algo- described earlier). Similarly, the output parameters po1 and rithmic verification patterns, according to the register map- po2 occupy nonoverlapping (but bordering) sections of the ping found in the interface specification. Also, the TGS cre- register 01A7. ates the appropriate bitwise masks found in each sequence, All parts of this refinement information are formally de- again from the information found in the interface specifica- scribed in the interface specification for the component vc1, tion. as shown in Figure 11. Here, it is specified that the input 10 EURASIP Journal on Applied Signal Processing

00 00 00 00 05 Number of blocks in the input image 01 00 00 00 07 Pointer to input block 1

Input image

¡ ¡ header ¡ 05 00 00 00 43 Pointer to input block 5 06 00 00 00 52 Pointer to end of input block 5 07 00 00 00 01 Number of sequences in input block 1 08 00 00 00 09 Pointer to sequence 1 09 00 00 01 A0 Pointer to the starting address in the register space 0A 50 1C CD 9A 0B 4F 05 E0 D5 A1 60 89 15 0C Values 0D 00 0B 08 55 Input block 1 0E 00 2C 40 02 0F 00 00 00 0A Input image 10 FF FF FF FF 11 FF FF FF FF 12 FF FF FF FF Masks 13 FF FF FF FF 14 FF FF FF FF Start of input 15 00 00 00 0F

block 2 16

¡ ¡ End of input ¡ block 5 52 53 00 00 00 05 Number of blocks in the output image

Output image 54 00 00 00 5A Pointer to output block 1

¡ ¡ header ¡ 58 00 00 00 70 Pointer to output block 5 59 00 00 00 75 Pointer to end of output block 5 5A 00 00 00 01 Number of sequences in output block1 5B 00 00 00 5C Pointer to sequence 1 5C 00 00 00 A6 Pointer to the starting address in the register space Output block 1 5D 74 84 01 22 Values Output image 5E 00 00 00 08 5F FF FF FF FF Masks 60 00 00 01 FF

Start of output 61

¡ ¡ block 2 ¡ End of output 75 block 5

Figure 13: The structure and content of the memory image for the virtual component vc1.

parameter pi1 will be read by vc1 from the address 01A5, Values in each stream are devided into blocks, for synchro- occupying a total of four bits, between bits 0 and 3 inclu- nization across streams. sive. Similar specifications are given for the other parameters. As already explained, the idea of automated verification The interface of each data channel is similarly described. For pattern refinement revolves around the enrichment of the example, data associated with the output data channel do1 is algorithmic-level patterns with the refinement information to be written by the component vc1 to the system bus inter- that results from the model refinement, to create virtual pro- face, at address 01A6, packaging four data values into each totype patterns automatically. The result is a memory image, 32-bit register. containing the original algorithmic patterns, which are not After the refinement information has been formally spec- only reformatted to fit the VP simulation environment (as ified, in the form of the interface specification, it is possi- well as the final hardware platform), but also appropriately ble to automatically generate virtual prototype verification enriched with the necessary architectural information, which patterns from algorithmic-level verification patterns. These is not present in the original verification patterns. The struc- algorithmic-level patterns are shown in Figure 12.Asde- ture and content of the memory image for the example vir- scribed earlier, each data input and data output port in the al- tual component vc1 is shown in Figure 13. gorithmic model has associated with it a stream of values, in It can be noted that, as explained earlier, the memory im- addition to the two dedicated parameter streams, para in and age is composed of two parts: the input and the output im- para out, for the input and output parameters, respectively. age. Each image is then further broken down into a header, P. B e la nov icetal.´ 11 followed by a number of blocks. In this case, both images these widths are parameters in the TGS, no further modifica- contain five blocks. Each block is composed of a header, fol- tion to this script itself is required in order to adopt it to any lowed by a number of sequences. In this example, both the set of processor cores. first blocks of the input and the output image are shown fully, and both of them contain one sequence each. ACKNOWLEDGMENTS Each sequence starts with a pointer to the starting address in the register space, where the reading (in the case of the The authors would like to acknowledge the ongoing co- input image) or writing (output image) is to start. Following operation with Infineon Technologies and in particular this pointer, the rest of the sequence is made up of actual thank Guillaume Sauzon, Thomas Herndl, Ahmad Sarashgi, values and the corresponding masks, as described earlier. In Wolfgang Haas, and Johann Glaser for their collabora- this example, as can be seen in Figure 13, the first sequence of tion. This work has been funded by the Christian Doppler the first block of the input image is six values long, whereas Laboratory for Design Methodology of Signal Processing Al- the same in the output image is two values long. gorithms.

REFERENCES 4. CONCLUSIONS [1] G. E. Moore, “Cramming more components onto integrated In the rapidly changing and highly competitive field of wire- circuits,” Electronics Magazine, vol. 38, no. 8, pp. 114–117, less communication systems, minimizing time to market is a 1965. key requirement for any commercially viable product devel- [2] R. Subramanian, “Shannon vs Moore: driving the evolution of opment. While virtual prototyping has proved to be one of signal processing platforms in wireless communications,” in the most effective techniques for achieving the required time Proc. IEEE Workshop on Signal Processing Systems (SIPS ’02), savings, it is only with full automation that the maximal gains pp. 2–2, San Diego, Calif, USA, October 2002. can be achieved. [3] International SEMATECH, The International Technology Roadmap for Semiconductors, Austin, Tex, USA, 1999. The presented environment for automated development ff [4] G. Karsai, J. Sztipanovits, A. Ledeczi, and T. Bapty, “Model- of virtual prototypes not only o ers these maximal time integrated development of embedded software,” Proc. IEEE, gains, but also supports the virtual prototyping process com- vol. 91, no. 1, pp. 145–164, 2003. prehensively, in both the design and verification flows. In [5] P. Belanovic,´ M. Holzer, D. Micuˇ sˇ´ık, and M. Rupp, other words, the transition from the algorithmic-level to the “Design methodology of signal processing algorithms in corresponding virtual prototype is covered seamlessly by the wireless systems,” in Proc. International Conference on Com- presented environment, for both the model itself, as well as puter, Communication and Control Technologies (CCCT ’03), for the associated verification patterns. pp. 288–291, Orlando, Fla, USA, July–August 2003. The application of the presented environment is limited [6] A. Hemani, A. K. Deb, J. Oberg, A. Postula, D. Lindqvist, and B. Fjellborg, “System level virtual prototyping of DSP SOCs in its general applicability in several aspects. Firstly, the al- using grammar based approach,” Design Automation for Em- gorithmic descriptions considered in this work come from bedded Systems, vol. 5, no. 3-4, pp. 295–311, 2000. the COSSAP environment. While system descriptions origi- [7] C. A. Valderrama, A. Changuel, and A. A. Jerraya, “Virtual nating in any of the numerous other environments for algo- prototyping for modular and flexible hardware-software sys- rithmic modelling have not yet been considered, the modular tems,” Design Automation for Embedded Systems, vol. 2, no. 3- nature of the presented environment offers the possibility to 4, pp. 267–282, 1997. process these other types of descriptions as well with minimal [8]N.S.Voros,L.Sanchez,´ A. Alonso, A. N. Birbas, M. Bir- modifications and/or extensions. In particular, processing al- bas, and A. Jerraya, “Hardware-software co-design of complex gorithmic descriptions in SystemC is being considered as a embedded systems: an approach using efficient process mod- future extension to the presented environment, due to the els, multiple formalism specification and validation via co- strong presence of SystemC in the EDA market [17–19]. This simulation,” Design Automation for Embedded Systems, vol. 8, no. 1, pp. 5–49, 2003. will require only minimal extension to the presented envi- [9] R. Ernst, “Codesign of embedded systems: status and trends,” ronment, due to the already present ability of the underlying IEEE Des. Test. Comput., vol. 15, no. 2, pp. 45–54, 1998. framework to process algorithmic descriptions in SystemC. [10] P. Varma and S. Bhatia, “A structured test re-use methodol- Furthermore, the verification strategy presented here has ogy for core-based system chips,” in Proc. IEEE International been implemented only for systems built around the Star- Test Conference (ITC ’98), pp. 294–302, Washington, DC, USA, Core DSP [20]. However, the modular nature of the verifi- October 1998. cation environment ensures the applicability of the environ- [11] B. Stohr,¨ M. Simmons, and J. Geishauser, “FlexBench: reuse of ment to systems build both around other processor cores as verification IP to increase productivity,” in Proc. Design, Au- well as multiprocessor systems, with only minimal modifi- tomation and Test in Europe Conference and Exposition (DATE cations and/or extensions. One of the directions of future ’02), pp. 1131–1131, Paris, France, March 2002. [12] Odin Technology, Axe Automated Testing Framework, 2004, work being considered includes extending the environment www.odin.co.uk/downloads/AxeFlyer.pdf. to systems using other processor cores, by creating verifica- [13] P. Belanovic,M.Holzer,B.Knerr,M.Rupp,andG.Sauzon,´ tion programs for a set of supported cores. This may also re- “Automatic generation of virtual prototypes,” in Proc. 15th In- quire reformatting the associated memory images, to accom- ternational Workshop on Rapid System Prototyping (RSP ’04), modate varying register and memory widths. However, since pp. 114–118, Geneva, Switzerland, June 2004. 12 EURASIP Journal on Applied Signal Processing

[14] P. Belanovic,´ B. Knerr, M. Holzer, G. Sauzon, and M. Rupp, 2002 he has a research position at the Christian Doppler Labora- “A consistent design methodology for wireless embedded sys- tory for Design Methodology of Signal Processing Algorithms at tems,” EURASIP Journal on Applied Signal Processing,Special the Technical University of Vienna. issue on DSP enabled radio, 2005 [15] B. Knerr, M. Holzer, and M. Rupp, “HW/SW partitioning us- M. Rupp received his Dipl. Ing. degree in ing high level metrics,” in Proc. International Conference on 1988 at the University of Saarbrucken,¨ Ger- Computer, Communication and Control Technologies (CCCT many and his Dr. Ing. degree in 1993 at ’04), Austin, Tex, USA, August 2004. the Technische Universitat¨ Darmstadt, Ger- [16] U. Bortfeld and C. Mielenz, “White paper C++ System Simu- many, where he worked with Eberhardt lation Interfaces,” Infineon, Munich, Germany, July 2000. Hansler¨ on designing new algorithms for [17] The Open SystemC Initiative (OSCI), San Jose, Calif, USA, acoustical and electrical echo compensa- www.systemc.org. tion. From November 1993 until July 1995 [18] CoWare Incorporation, “SoC Platform-Based Design Using he had a postdoctoral position at the Uni- ConvergenSC/SystemC,” July 2002, www.coware.com. versity of Santa Barbara, California with [19] T. Grotker,¨ S. Liao, G. Martin, and S. Swan, System Design with Sanjit Mitra where he worked with Ali H. Sayed on a robustness SystemC, Kluwer Academic, Boston, Mass, USA, 2002. description of adaptive filters with impacts on neural networks and [20] StarCore DSP, www.starcore-dsp.com. active noise control. From October 1995 until August 2001 he was a member of the Technical Staff in the Wireless Technology Research Department of Bell-Labs where he was working on various top- P. B e l a n ov i c´ received his Dr. tech. degree in ics related to adaptive equalization and rapid implementation for 2006 from the Vienna University of Tech- IS-136, 802.11, and UMTS. He is presently a Full Professor for Dig- nology, Austria, where his research focused ital Signal Processing in Mobile Communications at the Technical on design methodologies for embedded sys- University of Vienna. He is an Associate Editor of the IEEE Transac- tems in wireless communications, virtual tions on Signal Processing and of the EURASIP Journal on Applied prototyping, and automated floating-point Signal Processing, and EURASIP Journal on Embedded Systems, to fixed-point conversion. He received his and is elected as an AdCom Member of EURASIP. He authored and M.S. and B.E. degrees from Northeastern coauthored more than 180 papers and patents on adaptive filter- University, Boston, and the University of ing, wireless communications, and rapid prototyping, including 12 Auckland, New Zealand, in 2002 and 2000, patents. respectively. His research focused on the acceleration of image processing algorithms with reconfigurable platforms, both in re- mote sensing and biomedical domains, as well as custom-format floating-point arithmetic. Currently he is a Ph.D. candidate at the Vienna University of Technology, Austria, focusing on the design methodologies for embedded systems in wireless communications, virtual prototyping, and automated floating-point to fixed-point conversion. B. Knerr studied communications engi- neering at the University of Saarland and the Technical University of Hamburg, Har- burg, respectively. He finished the diploma thesis about OFDM communications sys- tems and graduated with honours in 2002. He worked for one year as a Software Engi- neer for the UZR GmbH & Co KG, Ham- burg, on image processing and 3D com- puter vision. In June 2003 he joined the Christian Doppler Laboratory for Design Methodology of Signal Processing Algorithms at the Vienna Technical University as a Ph.D. candidate. His research interests are hw/sw partitioning, multicore task scheduling, static code analysis, and platform-based design.

M. Holzer received his Dipl. Ing. degree in electrical engineering from the Vienna Uni- versity of Technology, Austria in 1999. Dur- ing his diploma studies he worked on the hardware implementation of the LonTalk protocol for Motorola. From 1999 to 2001 he worked at Frequentis in the area of auto- mated testing of TETRA systems and after- wards until 2002 at Infineon Technologies on ASIC design for UMTS mobiles. Since Hindawi Publishing Corporation EURASIP Journal on Applied Signal Processing Volume 2006, Article ID 84340, Pages 1–14 DOI 10.1155/ASP/2006/84340

FPGA-Based Reconfigurable Measurement Instruments with Functionality Defined by User

Guo-Ruey Tsai and Min-Chuan Lin

Department of Electronics Engineering, Kun-Shan University of Technology, Taiwan

Received 2 October 2004; Revised 5 March 2005; Accepted 25 May 2005 Using the field-programmable gate array (FPGA) with embedded software-core processor and/or digital signal processor cores, we are able to construct a hardware kernel for measurement instruments, which can fit common electronic measurement and test requirements. We call this approach the software-defined instrumentation (SDI). By properly configuring, we have used the hardware kernel to implement an n-channel arbitrary waveform generator with various add-on functions, a wideband and pre- cise network analyzer, a high-speed signal digitizer, and a real-time sweep spectrum analyzer. With adaptively reconfiguring the hardware kernel, SDI concept can easily respond to the rapidly changing user-application-specified needs in measurement and test markets.

Copyright © 2006 Hindawi Publishing Corporation. All rights reserved.

1. INTRODUCTION implementation. Second, it can provide automatic adjust- ment or compensation for circuit component variations As the power of FPGA increases [1, 2], we find ourselves due to temperature dependence, aging, manufacturing tol- with the ability to design, simulate, analyze, and even emu- erances, and so forth. late the more complex devices with application-specified em- Current high-performance FPGA is richly equipped with bedded processor and/or digital signal processor cores. From built-in on-chip SRAM, which includes block RAM and dis- the viewpoint of SDI concept [3], the process of measure- tributed RAM. Therefore, either logic circuits using table- ment has been reduced only to signal excitation, captures, lookup algorithm or embedded processor in system-on- conditioning, processing, and output display as illustrated in chip application could utilize the on-chip SRAM to improve Figure 1 [4]. Figure 2 illustrates that the traditional instru- the speed degrading due to external chips’ interconnection mentation technique depends on digital signal processor, mi- and then enhance the entire system performance. Under a croprocessor unit, virtual instruments, application-specified single-hardware-core architecture, all the implemented in- integrated circuit (ASIC), or FPGA, which are in charge of struments are meant only to adjust the instrumental func- the responsibility of signal conditioning and signal process- tions in software way and apply them to their specified ap- ing. plication fields. In Section 2, we illustrate the system archi- The instrument market is fragmented because instru- tecture of the proposed hardware kernel first. In Section 3, ments are specialized in hardware to serve thousands of we introduce five kinds of possible instrument design al- slightly divergent test applications. In fact, the traditional gorithms by SDI philosophy: multichannel arbitrary func- classification of measurement instruments (such as volt- tion generator, DC transfer curve tracer, transient response meter, frequency counter, function generator, oscilloscope, analyzer, steady-state network analyzer, and real-time spec- signal analyzer, etc.) has become blurred, and to some ex- trum analyzer. In Sections 4, 5, 6,and7,wewilldemon- tent can be replaced with a single set of reconfigurable hard- strate the practical implementation of four signal process- ware, called hardware kernel. The hardware kernel can be re- ing devices: an n-channel arbitrary waveform generator with configured by software to implement a specified measure- various add-on functions [5], a wideband and precise phase ment instrument. With such a software-defined architec- detector [6], high-speed signal sampler by multiple-path al- ture concept applied to the circuit level, we have two ad- gorithm [7], and all-digital real-time spectrum analyzer [8]. vantages. First, it can dramatically reduce the number of In Section 8, a flexible re-configuration methodology of this hardware components in all mixed-signal designs. This then SDI system is presented. Finally, we have come into a conclu- possibly means a much smaller chip size for system-on-chip sion. 2 EURASIP Journal on Applied Signal Processing

Excitation Physical Sensor Signal All human interface devices (HID) for manipulation and quantities transducer conditioning test data presentation are basic interfaces for each instru- Signal ment. The proposed hardware kernel includes the follow- Transmission Signal ing HIDs: push wheel switch, led, text LCD, graphic display or display processing Output STN/LCD or color TFT/LCD, keyboard, touch panel, even oscilloscope signal driver. Figure 1: Measurements technique by SDI. The flash RAM can be used to store sine, Log, or other mathematic function lookup tables for exciting signal gener- ation, and ease fast data operations. Excitation This system can be operated in on-line mode and the per- Physical Sensor quantities transducer ASIC sonal computer (PC) can control and communicate with it. Signal MPU Without the PC, this system is also a stand-alone device off- DSP line operated by panel components and displayed to liquid Transmission FPGA VIs or display crystal display (LCD). We have designed the panel controller Output and LCD controller using the embedded processor. For on-line operations, we can design a PC development Figure 2: Instrumentation technique. platform with powerful graphic unit interface (GUI) and mathematic functions package, which can be supported from Matlab or LabVIEW. On the other hand, the hardware kernel 2. HARDWARE KERNEL FOR THE RECONFIGURABLE should have some on-line operation interfaces, such as USB, INSTRUMENTS SPI, or UART to communicate with the PC.

Figure 3 illustrates the proposed hardware kernel architec- 3. RECONFIGURABLE INSTRUMENTS ture. Besides FPGA, we need other ASIC chips to process ana- log signals. In order to measure time, frequency, and phase With the complete hardware kernel architecture, we can con- responses of the device under test (DUT), we need the fol- figure FPGA to match the necessary function specifications lowing function modules: digital-to-analog converter, wave- for various measurement environments and requirements. form amplifier, analog-to-digital converter, waveform sharp- Here we introduce five types of SDI design algorithms for ener,phasedetector,hardwarepeak/troughdetector,andhu- specified applications. man input devices (HIDs). The original stimulus signal generated by FPGA is in dig- 3.1. Arbitrary waveform generator ital form. For analog exciting signal requirement, it must be converted by digital-to-analog converter, filtered and shaped Utilizing direct digital synthesizer (DDS) [2, 9–11]algo- by low-pass-filter, and amplified or attenuated by amplifier rithm, we can generate any periodic function with arbi- ff or DC o set. The amplitude of the exciting signal can be trary frequency, amplitude, and waveform. As illustrated in adjusted through automatic gain control which is achieved Figure 5, the function waveforms are preloaded into flash by FPGA-generating programmable gain-adjustment (PGA) RAM, and will be directly loaded into the built-in RAM in signal. FPGA when powered up. The waveform frequency can be set We also need signal capture and digitization modules. up to half the system clock. Using 32-bit phase accumulator The output signals from DUT can be digital or analog. The can achieve a frequency resolution of 0.02 Hz. The embed- latter needs to be captured and digitized by analog-to-digital ded 8 − bit processor in FPGA is in charge of the control converter. To meet the input signal limitations to analog-to- of HIDs and setting calculation of frequency and amplitude. digital converter, the signal gain of output analog signals still Arranged as Figure 6, we reorganize the DDS data processing needs to be controlled by the PGA signal which is generated path and generate two channels of FM, PM, FSK, or PSK sig- by FPGA. nals. Section 4 will describe an n-channel arbitrary waveform We need to detect the inevitable phase drift between in- generator with various add-on functions in detail. put and output signals from DUT. By waveform sharpening circuit, we can transform the periodical analog signal into square wave. The phase difference can be drawn out from the 3.2. DC transfer function analyzer duty cycle of the square wave. The duty cycle is calculated by the FPGA or ASIC chip. The process of phase detection is We use a transfer function analyzer to analyze the trans- shown as Figure 4. fer function between input and output signals, and we can To calculate sine wave excitation and response amplifi- observe the linearity characteristics of the measured sen- cation factor, we need peak extraction circuit to detect two sor/transducer. The input/output transfer curve measure- peak-to-peak values and get the quotient between them. ment is essential for characterizing DUTs with electronic From the data array, the embedded processor in FPGA can circuits. Arranging the FPGA software design flow as in take out the peak values (maximum or minimum) for fur- Figure 7, we can configure the proposed hardware kernel into ther processing. a DC transfer function analyzer. G.-R. Tsai and M.-C. Lin 3

Signal out Flash Stimulus signal Waveform RAM process(D/A) (Amp/Atten)

PGA

Phase Device under USB detector test UART pulse (PD) (DUT) SPI PGA PC FPGA

Peak Waveform detector process A/D (Amp/Atten) Signal in

Oscilloscope Text LCD & Graphic LCD & I/F (D/A) push wheel SW touch screen

Figure 3: Hardware kernel for the reconfigurable instruments.

I1 3.5. Real-time spectrum analyzer S1 P0 Zero crossing Phase detector (comparator) (RS discriminator) Figure 9 illustrates a real-time sweep spectrum analyzer us- S2 I2 ing a fixed IF filter and a sweeping local oscillator (LO). The mixer output contains the input signal, the LO signal, Figure 4: Phase detection the sum and difference between these two signals, and var- ious other frequency components. If we know the LO fre- quency exactly, then by sending these frequency components 3.3. Steady-state network analyzer through a narrow IF filter, we can identify both the amplitude For the hardware kernel as in Figure 3, when using DDS tech- and the frequency of the unknown input signal. Whenever nique to generate sweep sine wave, and retrieving the phase any of these components falls within the IF filter bandwidth, and peak response of the DUT, we can collect the tabular data an AC voltage, which is related to the input signal’s ampli- of the system frequency response. The frequency response tude, is produced. This AC voltage is converted to a DC volt- age by an envelope detector, and the result is displayed on the spectrum can be constructed by calculating the tabular data y in logarithm by the embedded processor and put into dis- -axis of the screen. play on the color graphic LCD. We also can upload the data By HDL coding or schematic entry, we can implement array to a PC for further processing. Section 5 will describe the mixer, narrow IF filter, envelope detector, voltage-control a wideband and precise network analyzer based on FPGA in oscillator (VCO), and other processing algorithms into the detail. same FPGA chip. Section 7 will describe an FPGA-based de- sign of real-time sweep spectrum analyzer in detail. The next section describes the developed n-channel arbitrary wave- 3.4. Transient-state analyzer form generator with various add-on functions.

When the proposed arbitrary function generator generates 4. n-CHANNEL ARBITRARY WAVEFORM GENERATOR synchronized periodic signal as the required exciting sig- WITH VARIOUS ADD-ON FUNCTIONS [5] nal which is fed into the DUT, we will get the time re- sponse output which needs further transient analysis, as 4.1. DDS waveform generator shown in Figure 8. To overcome the lower sampling rate of the analog-to-digital converter, a multipass algorithm is pro- DDS is the most popular technique to synthesize AC in- posed. Section 6 will describe a multipass algorithm digitizer centive signals for instrumentation, measurement, and dig- based on FPGA in detail. ital communications. Generating synthesized waveforms by We can have n-time the effective sampling rate. When DDS technique has the following benefits: high frequency the exciting signal is designed and presented as a periodically resolution, precise frequency control, and low complexity. variable duty-cycle square wave, a step response analyzer is Figure 10 shows the simplified DDS block diagram [9]. built up. If an FFT algorithm is built into the FPGA, the hard- Utilizing conventional table-lookup algorithm for DDS, ware kernel will be configured as a software-based spectrum we need not generate both sine and cosine functions and can analyzer. realize desired functions with smaller memory table size. We 4 EURASIP Journal on Applied Signal Processing

Panel LCD Flash

Panel LCD Flash SRAM controller controller controller controller

Internal bus

I/F controller Embedded DDS SRAM (USB/SPI) CPU (build-in)

FPGA

PC DAC

Figure 5: System architecture for arbitrary n-channel function generator.

Frequency modulation signal Phase modulation signal can implement all the necessary digital logic circuits and the (FM) (PM) lookup memory in the same FPGA chip so that a better per- Modulatution sensitivity formance can be achieved by avoiding interchip connections [3, 10]. To generate the required analog signals, commercial (kf) Frequency Phase register register digital-to-analog converters (DAC) can be adopted. Lowpass filters (LPF) are required to filter out high frequency noise. Phase accumulator Frequency The entire instrument system includes a PC platform, control USB controller, FPGA waveform synthesizer, and DAC/LPF D/A word Output Waveform output buffer, as shown in Figure 11. The PC is the sys- + table LPF tem development platform, responsible for arbitrary func- tion waveform editing, previewing, encoding, lookup data Figure 6: On-chip FM/PM modulation. downloading, and the coding and decoding of USB com- mands. The multiple operation windows and GUI applica- tion programs are coded by Visual Basic language. The USB controller will deal with the messages inter- change between PC platform and FPGA chip. The Cypress Vi <= initial D/A output EZ-USB controller is utilized to communicate the PC with the FPGA. It provides some DLL files, which can be called and linked by Visual Basic, Visual C languages, and/or Lab- No A/D VIEW programs in the PC platform, and simplifies the de- digitization sign for both messages interchange and transmission control of GUI windows. Besides the parallel interface, we can also use the SPI or I2C techniques to communicate between USB > <= Vi final Vi Vi+Vstep controller and FPGA to save the pins resource of the FPGA. By the plug-n-play property of USB interface, a PC can be used to develop many AWG instruments simultaneously. Yes Tablet graphic display 4.2. FPGA realization

The FPGA is used to synthesize the specified function wave- Figure 7: DC transfer function measurement. form. We adopt Xilinx Spartan II XC2S200PQ208 which G.-R. Tsai and M.-C. Lin 5

Arbitrary Signal out stimulus signal Waveform process (D/A) (Amp/Atten)

PGA

Multipass timing control Device under USB test UART (DUT) SPI PC FPGA PGA Waveform process Bitstream (RC, compatator) Signal in

Oscilloscope Graphic LCD & I/F (D/A) touch screen

Figure 8: Transient state analyzer.

instrument itself. Together with n DACs, we can reconfigure the SRAM capacity into n parts for n-channel analog signal outputs. In this case, we have up to 56 channel outputs us- Envelope detector ing 1 kbit per channel. On the PC platform, you can down- load each channel waveform one by one. Adopting 50 MHz IF clock frequency and 32-bit phase accumulator word length, we have a 0.01164 Hz frequency resolution. LO CRT 4.3. Function performance

After integrating all the interfacing software, firmware, and hardware, the instrument can afford typical fundamental waveforms output, such as sinusoidal, square, and triangle Sweep generator functions. We can also edit any mathematical equations in the edit window and output their waveforms. Typical modu- Figure 9: Real-time sweep spectrum analyzer. lated waveforms, such as AM, FM, and others, can be edited and stored into the waveform banks in advance. You can choose anyone you like from the waveform banks and freely Clock output to any desired channel. The instrument also provides piecewise linear function output with multiple data points Waveform Digital-to- Frequency Phase map in analog Output periodically. Choosing FPGA with larger on-chip SRAM ca- control accumulator SRAM converter pacity for lookup table usage, we can flexibly expand the out- put channel numbers or improve the waveform resolution. Figure 10: Simplified block diagram of the direct digital synthe- With the programmability in the PC development platform, sizer. we can output the waveforms to individual channels inde- pendently, or generate a mixed waveform, which is a linear combination among several other channels. affords on-chip true single-port blocked synchronous RAM. Furthermore, we can produce a series of n-channel wave- The total available memory size is 56 kbit. The FPGA chip forms, which show some group-related functions for special is configured into four main parts: hand-shaking controller application purposes. Figure 12 shows the typical output re- between USB and FPGA, SRAM (block RAM), SRAM con- sults of the instrument. Figures 13(a) and 13(b) show two troller, and remaining control logic. The SRAM is utilized clear frequency spectrums for 1 kHz sinusoidal waveforms for both built-in and downloadable lookup tables. The built- produced by this instrument and Agilent 33120A signal gen- in lookup table can also be reserved for specified wave- erator individually. We have nearly the same waveform qual- forms output directly or built-in self-test purpose for the ity. 6 EURASIP Journal on Applied Signal Processing

Ch#1 D/A(LP) SPI/I2C Ch#2 FPGA USB (ASIC) USB . . controller .

Parallel n D/A(LP) Ch# ROM SRAM

Built-in Downloadable lookup table lookup table

Figure 11: The system architecture of the n-channel arbitrary waveform generator.

4.4. Add-on function frequency. Gathering the signal frequency by control circuit in FPGA and calculating data by the embedded processor in The instrument also provides an algorithm for more FPGA, we can adaptively adjust the sampling clock, which is advanced custom-made waveforms generation. Taking a used to measure the pulse period. This phase detector auto- custom-made FSK signal generator as the example, you can matically detects and adjusts the sampling clock without any input or edit a modulation bitstream on the PC platform, circuits’ modification. and download it to the SRAM in FPGA chip. Processed by a Figure 15 demonstrates the proposed all-digital adaptive preset FSK control code, we can combine both the bitstream algorithm for phase detection, including incoming signal’s channel and sine wave channel to generate the desired FSK frequency recovery circuit, sampling clock generator, and all- signal. Figure 14 shows the custom-made signal generation digital phase detectors. The period of sampling clock, Ts,is flow. the function of both income signal frequency ( fs)andphase The algorithm receives the data from both sine wave and resolution (Δp). And the phase value, Pd, is the function of piecewise linear generators, and respectively decides the fre- both sampling period (Ts) and the pulse duration of phase quency for “1” and “0” which can control the accumulator difference (ΔPt): phase increment generator and construct the designed FSK   signal. AM-ASK control code can be coded to generate the Ts = f fs, Δp ,(1)   designed AM-ASK signal by the same approach. With the re- Pd = f Ts, ΔPt . (2) served input port, the external FSK control code IP can di- rectly be fed into and generate the designed FSK signal wave- From (1), we must get the incoming signal frequency form by the stand-alone instrument. and desired phase resolution at first, and then put them into the programmable fractional-N frequency synthesizer. 5. WIDEBAND AND PRECISE PHASE DETECTOR Finally, we have the sampling clock required for phase detec- BASED ON FPGA WITH EMBEDDED tion. Combining the sampling clock and a specially designed PROCESSOR [6] counter algorithm, we can get a phase difference value with 8 to 12 bits from (2). We can use the multiplier phase detector [4] or logarithmic amplifiers to implement a phase detector. But these two ap- 5.2. FPGA realization proaches are both mixed-mode type and unsuitable for sys- tem on-chip (SoC) design which is in all-digital type. There We design the whole system with our proposed phase mea- exist several digital design methodologies for phase detec- surement method for specified network analyzer applica- tor, such as EXOR phase detector, JK flip-flop phase detector, tions and implement it by FPGA, as illustrated in Figure 5. phase-frequency detector, Nyquist-rate phase detector, zero- From the operation point of view, we have USB interface for crossing phase detector, and Hilbert-transform phase detec- PC on-line operation, and embedded processor for stand- tor [12, 13]. But they are all only suitable for some specified alone off-line control. The DDS-powered function genera- narrowband frequency range. To detect phase in another fre- tor is used for sweep stimulus signal generator and sampling quency range, we must modify phase detection and calcula- clock generator. The all-digital phase lock loop (PLL) with tion circuits to meet the necessary requirements, and then be programmable divide-by-N module is used for the phase able to get the precise value of phase difference. detection of uncontrollable random input signals. And an LCD display controller is also included for stand-alone use 5.1. All-digital phase detector display. Utilizing DDS algorithm, we can generate any periodic The proposed phase detector is an all-digital approach to function waveform with arbitrary frequency, amplitude, and measure the phase difference of two signals with the same waveform. The embedded 8-bit processor in FPGA is in G.-R. Tsai and M.-C. Lin 7

Figure 12: Typical output waveforms displayed by Agilent 54622D oscilloscope.

loop and fractional-N frequency synthesizer [14], as illus- trated in Figure 16, to recover input signal’s frequency and generate required sampling clock. To meet the requirement of precise resolution, the programmable scale-factor-N di- vider must generate the counting clock required for calculat- ing the duration of phase difference. Constituted by counters, the all-digital phase detector can output 8- to 12-bit digital signals for phase calculation according to the requirement of phase resolution and the successive processing circuits.

5.3. Performance analysis (a) Table 1 lists the comparison of measured phase differences of RC circuit from 10 to 10 kHz measured by Agilent 54621A oscilloscope and our proposed method, respectively. From the results, it is apparent that our proposed method is precise enough, and the measurement frequency range is relatively wide without specified parameter adjustment and further special logic circuits. When we adopt automatic sweep stimulus signal gen- erator, which is controlled by a PC, to generate the input signal, and collect and analyze the data by LabVIEW pro- gram, we have the amplitude and phase response as shown in (b) Figure 17. It is evident that the measurement is rather precise within a wide frequency range. It supports that our proposed Figure 13: Displayed 1 kHz sinusoidal spectrum comparison be- method is suitable for all-digital SoC system realization. tween (a) the reconfigurable instrument and (b) Agilent 33120 A signal generator. The spectrums are displayed by Agilent 54622D. 6. HIGH-SPEED SIGNAL SAMPLER BY MULTIPLE-PATH ALGORITHM [7] charge of the control of USB communication and setting cal- culation of frequency, amplitude, and sampling clock. To implement an A/D converter into a pure digital chip, a When the input signal of DUT is not generated by the delta-sigma D/A algorithm must be built in to work together system function generator, but comes from other uncontrol- with SAR, flash, or other types of A/D algorithms. The slow lable signal source, we use both the all-digital phase locked conversion rate of delta-sigma D/A algorithm is the main 8 EURASIP Journal on Applied Signal Processing

FPGA

Channel 0 bitstream SG

PC Channel 1 platform sine wave SG FSK O/P

External FSK control code I/P

Figure 14: The customer-made FSK signal generation flow.

U1 input Sampling Table 1: Comparison of the measured phase differences of RC cir- Frequency calculation signal clock (Ts) cuit. and sampling clock generator Phase measured (degree) Frequency (Hz) Agilent 54621A Proposed method . All-digital 10 0 00 0 U2 input PD [8 ···0] phase 50 0.00 0 signal detector 100 −4.32 −4 500 −16.20 −16 1K −31.68 −32 Figure 15: All-digital adaptive algorithm for phase detection. 2.5K −52.20 −51 5K −64.80 −64 U1 input 10 K −72.00 −71 signal Sampling ∗ clock ( fs =N fU ) All digital PLL 1

Frequency response Network analyzer 0 0 −2 −10

Divider-N −4 −20 −6 −30

dB − − Programmable 8 40 degree scalar factor N −10 −50

−12 −60 Figure 16: All-digital PLL fractional-N frequency synthesizer. −14 −70 0 1 2 3 4 5 6 7 10 50 100 500 1 2.5 5 10 Hz Hz Hz Hz KHz KHz KHz KHz limitation to the whole A/D converter for capturing high- bandwidth periodic signals. Figure 17: Automatic sweep RC-circuit frequency response by our system. 6.1. Multipass method t According to the sampling theory, we need a sampling rate time shift shift, calculated as follows: two times larger than the greatest signal frequency so that − Δt = sampling time the original signal can be reconstructed after it was sampled. shift n ,(3) The higher the sampling rate is, the more complete the sig- nal reconstruction is. For a periodic signal, if we periodically where n is an integer. We can repeatedly and synchronously sample the signal whose frequency can be larger or even less get n sets of sampled signal values, and store them into the than the signal frequency with the available sampling device, embedded memory in FPGA. Figure 18 demonstrates the we can get a set of sampled signal values. Then, we take a fix proposed multipath sampling algorithm. G.-R. Tsai and M.-C. Lin 9

By estimation, the minimum converting time of the SAR ADC is about 160 μs. In other words, the real-time maximum signal capture sampling speed is about 780 Hz for 8 quan- tum bits. According to the calculation formula of the real- time sampling rate of the delta-sigma D/A converter, shown as follows, we have the real-time sampling rate 760 Hz for Signal (A/D) ADC bit = 8andF = 15:

50 MHz SR = . (4) (ADC bit +1) × F × Δtshift Sample 2 ( +1) ADC bit

Figure 21 shows the comparison of original input 100 kHz sine wave and the captured signal by the pro- Pass 1 sample Pass 2 sample posed signal capturer. The captured signal has 256 sampled points per period. We can find that the signal reconstruction is rather satisfactory. We have demonstrated an ultrahigh- Figure 18: Multipass under-sampling algorithm. speed signal capturer based on a single FPGA chip. The multipath algorithm has enhanced the sampling rate from a 760 Hz real-time sampling rate up to a 25.6 MHz apparent By this way, the apparent sampling rate is n times higher sampling rate. For further applications, we can use this de- than the real-time sampling rate. If n is large enough, we can sign to measure the transient response of the device under overcome the problem of the low real-time sampling rate and test. get more satisfactory signal reconstruction. 7. ALL-DIGITAL REAL-TIME SPECTRUM ANALYZER [8] 6.2. FPGA realization A spectrum analyzer is used to analyze the frequency com- ponents in signals under test. By mathematical calculation, We use an FPGA chip, an RC circuit, an LF398 sample/hold the traditional fast Fourier transform (FFT) can transform chip, and an LM319 comparator chip to implement the mul- the time-domain waveform to frequency-domain waveform tipass method, as shown in Figure 19. The built-in DDS arbi- of the signal and become the key technique of the spectrum trary waveform generator is used to generate the desired pe- analyzer. The sweeping frequency technique combined with riodical signal waveform whose data stream is stored in the digital technique is also used to implement spectrum analysis waveform data table. The external R-2R circuit is used to con- [15]. In general, the analysis accuracy of FFT technique is ba- vert the digital data streams into continuous analog signal. To sically worse than that of the sweeping frequency technique verify the captured signal quality, we directly bypass the out- equipped with digital intermediate-frequency filter. From the put signal into the sample/hold chip so that we can compare viewpoint of dynamic range decaying effect, the FFT tech- the original and the captured signals. Through the compara- nique is also worse. For smaller frequency range, the pro- tor chip, the compared result between the output signal from cessing speed of FFT technique is better, but worse for wider sample/hold chip and the RC circuit output signal is fed into frequency range. In the form of digital intelligent property the SAR A/D converter. The delta-sigma D/A converter out- (IP), which can be mapped to a single FPGA chip, we de- puts the bitstreams to the RC circuit, and then the charged sign a real-time sweep spectrum analyzer. The system func- output signal is fed into the comparator chip. The internal tion block diagram of real-time sweep spectrum analyzer is DDS module provides the necessary synchronous signal for shown in Figure 22. repeated sampling. If the tested signal comes from other sig- nal source, we need the all-digital phase lock loop (ADPLL) 7.1. Theory of operation module (Figure 16) for signal synchronization and timing control. The memory controller is in charge of the storage The mixer is used to combine the signal under test and the and transfer of the digital data streams. The TDC signal ex- sweeping signals generated by the local oscillator (LO). The tractor is used for signal reconstruction. The interface circuit mixer is a multiplier circuit, so the output is an amplitude- can be used for PC communication and data transfer con- modulated signal. According to the formula of triangular al- troller. gebra, we have the signals with sum frequency and difference frequency, as shown in the following expression: 6.3. System performance     πf t × πf In this demonstration, we use 50 MHz of system clock rate. sin 2 0 sin 2 in 1   1   (5) We use the DDS module to generate a sine wave with fre- = cos 2π f − f t − cos 2π f + f t, quency of 100 kHz as the stimulus signal. From the step re- 2 0 in 2 0 in sponse of RC circuit, as shown in Figure 20,wehavethe 151 μs steady state rise time and 136 μsfalltime. where, f0 is the frequency of LO, fin is the input signal fre- 10 EURASIP Journal on Applied Signal Processing

FPGA Analog periodic Waveform signal data table DDS Digital R-2R AWG Periodic signal circuit DLL Device Sync. under test USB ADPLL Memory UART controller S/H SPI TDC SAR A/D Analog signal PC IF

Delta-sigma DAC

Signal Bitstream Multipass reconstruction algorithm

Figure 19: FPGA-based all digital signal capturer.

Figure 20: The step response of delta-sigma D/A converter. Figure 21: The comparison of original and captured 100 kHz sine wave. quency. The central frequency of the finite-impulse-response f (FIR) bandpass filter ( IF)is Bandpass Signal filter Peak fIF = f0 + fin. (6) Mixer detector input Y-axis When the AM signal is passed to the FIR filter, the signal with frequency band meeting the pass band of the filter can pass through. The following peak detector can get the passed sig- nal’s amplitude, which will be shown in the X-Y mode of os- cilloscope. X-axis If we fix the LO frequency, then we must adjust the cen- tral frequency of FIR filter adaptively to properly detect all Local Sweep Scope the frequency components of the signal under test. The adap- oscillator generator tive central frequency must satisfy (6). It seems that the IF filter with adaptive central frequency Figure 22: The function blocks diagram of real-time sweep spec- is not practical from the viewpoints of cost, accuracy, and trum analyzer. speed. In contrast, keeping the central frequency of filter fixed, and linearly (or logarithmically) sweeping the LO fre- quency, we have the following relation: where the changing frequency fin is the possible frequency f0 = fIF − fin,(7)component of the detected signal. G.-R. Tsai and M.-C. Lin 11

a (ab) Frequency in Filter out Signal in Signal out Sine wave ADC1 b z−3 1000 × (2 × 3.14) DAC1 Mult FIR filter Envelope detect

Oscilloscope 100 XY display Z−1 DAC2

Sweep 1

Sine Acc width Acc 1k v4 60M Index Sweep 2 parameter Sweep generator DDS sine wave oscillator Frequency parameter

Figure 23: The signal processing flow of real-time sweep spectrum analyzer constructed by both Simulink and system generator.

Table 2: Analysis performance for the filters with different central frequencies. Central frequency 1 MHz 100 kHz 1 kHz Frequency resolution 10 kHz 1 kHz 10 Hz 10 kHz ∼ 1kHz∼ 10 Hz ∼ Scanning range 2 MHz 200 kHz 2 kHz System clock rate 60 MHz 60 MHz 60 MHz

7.2. FPGA realization clearly shown. The relative amplitudes ratios are the same as the expected by FFT calculation. We use software-hardware codesign and cosimulation ap- To verify the analysis resolution in the highest detectable proach to design a real-time sweep spectrum analyzer core, frequency of 2 kHz for the filter with central frequency of which can be built in a real-time electrical harmonic analyzer. 1 kHz, we input an AM signal with carrier frequency of 2 kHz First, we use the Simulink (with Matlab) function blocks and modulation frequency of 20 Hz. Figure 25 shows the de- tobuildupasweepspectrumanalyzersystem,asshownin tected spectrum. Figure 23. In Figure 25, the central frequency is 2000 Hz, the left fre- In the same time, we integrate the Xilinx System Gener- quency is difference frequency of 1980 Hz, and the right fre- ator DSP Block Library [16] into the Simulink. With these quency is sum frequency of 2020 Hz. The X-spacing is 20 Hz. library modules, we can cosimulate the total system and We can easily differentiate between different frequency com- generate configuring bit for the corresponding FPGA chip ponents with resolution less than 20 Hz. so that we can perform the hardware verification. Software The proposed digital IP can analyze the frequency span simulation can proceed by Simulink, or by ModelSim RTL- ranges from 10 Hz to 2 MHz. It can be flexibly utilized for simulation. The Xilinx ChipScope is used to execute the FPGA-based real-time digital signal processing applications, hardware simulation and debugging. such as visualized signal analysis, noise level monitoring, test Here, besides the ADC and DAC devices, we have de- and measurement of music studio, voice noise process, voice signed an all-digital spectrum analyzer, where we use digital instruction interpreter, voice reading, and hearing aid design. multiplier as the mixer, equiripple technique to implement FIR band-pass filter. The equiripple technique can make the 8. FLEXIBLY RECONFIGURABLE SDI SYSTEM DESIGN attenuated band of frequency response of the filter equally smooth and optimizes the filter design. The configuration of FPGA can be performed by either an on-line or an off-line process [17–19]. The dynamically re- 7.3. System performance configurable SDI system is shown in the Figure 1. For the on- line process, we need a PC (personal computer) to connect We can implement multiple FIR filters in the same FPGA with the instument for data exchange and on-line reconfig- chip. Here, we demonstrate three different filters with central uring. The prestored configuration bitstream file for speci- frequencies of 1 kHz, 100 kHz, and 1 MHz, respectively. fied function can be used to reconfigure the FPGA via the Table 2 shows their analysis performance. controlling CPLD (complex programmable logic device). Af- The X-spacing is 20 Hz and the Y-axis represents the ter powering up, the system will behave as a new instrument. component amplitude in Figure 24. We notice that the odd The PC platform is able to perform the advanced analysis harmonic peaks like 10 Hz, 30 Hz, 50 Hz, and 70 Hz are and 3D display of the data outputed from the instrument. 12 EURASIP Journal on Applied Signal Processing

Tek run Triggered

XY display

Off (YT)

Triggered XY

Ch1 (X) versus Ch2 (Y) Ref1 (X) versus Off (Y)

Backlight XY Display Color Waveform intensity Graticule palette display high grid Ch1 vs Ch2 normal

Figure 24: Analyzed spectrum of input square signal with period of 10 Hz.

Tek run Triggered

XY display

Off (YT)

Triggered XY

Ch1 (X) versus Ch2 (Y) Ref1 (X) versus Off (Y)

Backlight XY Display Color Waveform intensity Graticule palette display high grid Ch1 vs Ch2 normal

Figure 25: Detected spectrum of AM signal with carrier frequency of 2 kHz and modulation frequency of 20 Hz.

In another exchange way, the PC can also send more so- Increasing or replacing different configuring EPROM (flash), phisticated excitation signals to the instrument for more we can add new functions to the proposed SDI system with- specific applications. We can use LabVIEW programs or C out changing or redesigning the system hardware. codes to achieve these. For the off-line process, the function- predefined configuring bitstream files are stored in different 9. CONCLUSIONS EPROMs (flash). We can use the function selection switch to order CPLD to send different CE (chip enable) signals to the The development trend of measurement and test technology selected configuring-EPROM (flash). When powered up, the is transferred from functionality-defined-by-manufacturer instrument will be re-configured to work with new function. into functionality-defined-by-user. The ability of being G.-R. Tsai and M.-C. Lin 13

Data SDI exchange Off-line reconfiguring FPGA PC platform EPROM EPROM EPROM (flash) (flash) (flash) F#1 F#2 F#n

n On-line CE1 CE2 CE CPLD reconfiguring

Function selection switch

Figure 26: Flexibly reconfigurable SDI system.

reconfigurable, reusable, flexible, and rapidly prototyped [7] G.-R. Tsai and M.-C. Lin, “High speed signal sampler by will be the key to success in measurement and test market. multiple-path algorithm,” in Proceedings of IEEE Region 10 This proposed FPGA-based reconfigurable instrument really Conference (TENCON ’04), vol. 1, pp. 29–31, Chiang Mai, meets the evolution trend. Once you have derived a mea- Thailand, November 2004. surement algorithm, you can easily build up a specialized [8] G.-R. Tsai and M.-C. Lin, “Implementation of a real-time har- instrument by SDI approach, such as an LCR (inductance- monic analyzer core by a single FPGA chip,” in Proceedings of capacitance-resistance) meter, or biomedical monitor, and so 25th Symposium on Electrical Power Engineering, Tainan, Tai- forth. wan, November 2004. [9] J. Tierney, C. Rader, and B. Gold, “A digital frequency synthe- sizer,” IEEE Transactions on Audio and Electroacoustics, vol. 19, ACKNOWLEDGMENT no. 1, pp. 48–57, 1971. [10] C. Dick and F. J. Harris, “Configurable logic for digital com- This work was supported by National Science Council (NSC- munications: some signal processing perspectives,” IEEE Com- 93-2215-E-168–004), Taiwan munications Magazine, vol. 37, no. 8, pp. 107–111, 1999. [11]J.Vankka,M.Waltari,M.Kosunen,andK.A.I.Halonen, REFERENCES “A direct digital synthesizer with an on-chip D/A-converter,” IEEE Journal of Solid-State Circuits, vol. 33, no. 2, pp. 218–237, [1] Xilinx Incorporation, “The programmable logic data book,” 1998. 2002. [12] W. C. Lindsey and C. M. Chie , “A survey of digital phase- [2] M. Cummings and S. Haruyama, “FPGA in the software ra- locked loops,” Proceedings of the IEEE, vol. 69, no. 4, pp. 410– dio,” IEEE Communications Magazine, vol. 37, no. 2, pp. 108– 431, 1981. 112, 1999. [13] R. E. Best, Phase-Locked Loops: Design, Simulation, and Ap- [3] G.-R. Tsai, M.-C. Lin, G.-S. Sun, and Y.-S. Lin, “Single chip plications, McGraw-Hill, New York, NY, USA, 5th edition, FPGA-based reconfigurable instruments,” in Proceedings of In- 2003. ternational Conference on Reconfigurable Computing and FP- [14] T. Watanabe and S. Yamauchi, “An all-digital PLL for fre- GAs (ReConFig ’04), Colima, Mexico, September 2004. quency multiplication by 4 to 1022 with seven-cycle lock [4] S. A. Dyer, Survey of Instrumentation and Measurement,John time,” IEEE Journal of Solid-State Circuits,vol.38,no.2,pp. Wiley & Sons, New York, NY, USA, 2001. 198–204, 2003. [5] J.-W. Hsieh, G.-R. Tsai, and M.-C. Lin, “Using FPGA to imple- [15] N. Kularatna, Modern Electronic Test and Measuring Instru- ment a n-channel arbitrary wave form generator with various ments, IEE, London, UK, 1996. add-on functions,” in Proceedings of 2nd IEEE International [16] Xilinx System Generator v6.2 User Guide, 2004. Conference on Field-Programmable Technology (FPT ’03),pp. [17] Xilinx, “The Low-Cost, Efficient Serial Configuration of Spar- 296–298, University of Tokyo, Tokyo, Japan, December 2003. tan FPGAs,” XAPP098, 1998. [6] G.-R. Tsai, M.-C. Lin, W.-Z. Tung, K.-C. Chuang, and S.- Y. Chan, “Wide-band and precisely measurement method of [18] Xilinx, “Configuring Spartan-II FPGAs from Parallel EPR- phase detector based on FPGA with embedded processor,” in OMs,” XAPP178, 1999. Proceedings of International Conference on Informatics, Cyber- [19] Xilinx, “Data Generation and Configuration for Spartan Series netics and Systems (ICICS ’03), I-SHOU University, Kaohsi- FPGAs,” XAPP126, 2003. ung, Taiwan, December 2003. 14 EURASIP Journal on Applied Signal Processing

Guo-Ruey Tsai was born in Tainan, Taiwan, in 1953. He received the B.S. and M.S. de- grees in electronics engineering from the National Chiao Tung University, Taiwan, in 1975 and 1977, respectively. Since 1993, he hasbeenaFacultyMemberatKunShan University, Taiwan, where he is currently an Associate Professor in the Department of Electronics Engineering. His research inter- ests are in embedded processor architecture, parallel processing algorithms, and FPGA-based instrumentation and measurement system design.

Min-Chuan Lin was born in Chang-Hua, Taiwan, in 1956. He received the B.S. de- gree from the National Taiwan University, Taiwan, in 1979, and the M.S. and Ph.D. degrees in electro-optical engineering from the National Chiao Tung University, Tai- wan, in 1987 and 1992, respectively. Since 1993, he has been a Faculty Member at Kun Shan University, Taiwan, where he is cur- rently an Associate Professor in the Depart- ment of Electronics Engineering. His research interests include FPGA-based reconfigurable system design and optical fiber trans- mission. Hindawi Publishing Corporation EURASIP Journal on Applied Signal Processing Volume 2006, Article ID 52919, Pages 1–12 DOI 10.1155/ASP/2006/52919

FPGA Implementation of an MUD Based on Cascade Filters for a WCDMA System

Quoc-Thai Ho, Daniel Massicotte, and Adel-Omar Dahmane

Laboratory of Signal and System Integration (LSSI), Department of Electrical and Computer Engineering, Universit´eduQu´ebec a` Trois-Rivi`eres, 3351 Boulevard des Forges, C.P. 500, Trois-Rivi`eres, QC, Canada G9A 5H7 Received 2 October 2004; Revised 30 June 2005; Accepted 12 July 2005 The VLSI architecture targeted on FPGAs of a multiuser detector based on a cascade of adaptive filters for asynchronous WCDMA systems is presented. The algorithm is briefly described. This paper focuses mainly on real-time implementation. Also, it fo- cuses on a design methodology exploiting the modern technology of programmable logic and overcoming the limitations of commercial tools. The dedicated architecture based on a regular structure of processors and a special structure of memory ex- ploiting FPGA architecture maximizes the processing rate. The proposed architecture was validated using synthesized data in UMTS communication scenarios. The performance goal is to maximize the number of users of different WCDMA data traffics. This dedicated architecture can be used as an intellectual property (IP) core processing an MUD function in the system-on- programmable-chip (SOPC) of UMTS systems. The targeted FPGA components are Virtex-II and Virtex-II Pro families of Xil- inx.

Copyright © 2006 Hindawi Publishing Corporation. All rights reserved.

1. INTRODUCTION of users on a chip (or a device in case of FPGAs). Maximizing the number of users makes it possible to increase the capacity The third generation (3G) of mobile wireless communica- of a cell and multiantenna processing. tion is adopted for high-throughput services and the effect- ive utilization of spectral resources. This work focuses on Because minimum-mean-square-error (MMSE)-based Universal Mobile Universal Telecommunications systems receivers allow for a significant gain in performance, the (UMTS). In UMTS Systems, the wideband code-division adaptive two-stage linear cascade filter MUD (CF-MUD) ff multiple-access (WCDMA) scheme is adopted. The desired based on MMSE receivers proposed in [13]o ers a good ff data throughputs for 3G UMTS systems are 144 kbps for tradeo between performance and complexity. This algo- vehicular, 384 kbps for pedestrian, and 2 Mbps for indoor rithm presents a low-complexity and suitable regularity as- environments [1, 2]. The receivers in 3G systems must take pects for FPGA implementation. The CF-MUD is based on into account not only intersymbol interferences (ISI), but two blocks, signature and detection, which will be briefly de- also more importantly multiple-access interferences (MAIs) scribed in Section 2. Each block acts as a filter in order to can- which increase radically in the number of users and data cel the ISI and MAI. In previous works [14, 15], FPGA im- rates. Multiuser detectors (MUDs) are applied to eliminate plementations of the signature block were presented. Based the MAI and become essential for an efficient3Gwireless on the CF-MUD algorithm, this paper describes a com- network systems deployment [3]. The algorithmic aspect of plete design architecture targeted on the recent FPGA com- MUD has become an important research issue over the last ponents—the Virtex-II and Virtex-II Pro of Xilinx includ- decade (e.g., [3–6]). Moreover, the real-time implementa- ing signature and detection blocks. tion aspect of MUDs is also well documented (e.g., [6–9]). The rest of the paper is organized as follows. Section 2 The rapid prototyping targeted on field-programmable gate presents a brief description of the system model and the arrays (FPGAs) is also proposed [10–12]. These works dem- adaptive MUD algorithm considered in this paper. Section 3 onstrate several limitations in practical systems in terms of introduces the VLSI architecture of the present MUD tar- timing and algorithm and hardware constraints (e.g., arith- geted on the Virtex-II and Virtex-II Pro components. Section metic complexity, memory access requirements, data flow) 4 describes the implementation methodology and Section 5 [5–7]. Moreover, no work was done to maximize the number presents the results. Section 6 presents a few conclusions. 2 EURASIP Journal on Applied Signal Processing

ffi rtrain btrain exception of pilot bits—in order to adjust the filter coe - Channel baseband model cients. It is important to note that to assure the convergence, both block filters need more than the pilot bits available in fast-fading context. Preknown data training sequences rtrain  y1 are internally generated based on channels parameters (am- ˜r Signature 1  b1 plitudes and delays) obtained from the channel-estimation . . . . . Detection . technique. . . . The principle of CF-MUD is briefly described in Figure 2. Signature K  bK The switch models the training phase and detection phase. yK The first block of the CF-MUD, the signature block, adapts Signature block Detection block the signatures of the users without prior knowledge of their PN codes. In the first step, we synchronized the received sig- Figure 1: Principle of cascade filter MUD (CF-MUD). nal r(n) based on the estimated propagation delays for each user. In the training phase, we used the following set of equa- = 2. BACKGROUND tions for user k (k 1, 2, ..., K):    =  Htrain  = 2.1. DS-CDMA baseband model yk(n) wk(n) r (n) , wk(0) 0, (2) train αk(n) = b (n) − yk(n), (3) In a direct-sequence CDMA (DS-CDMA) baseband system k  =  train ∗ model, we consider K mobile users transmitting symbols wk(n +1) wk(n)+μ r (n)αk(n) ,(4) Ξ ={− } from the alphabet 1, 1 .Eachuser’ssymbolisspread  =    T with wk(n) [wk,0(n), wk,1(n), ..., wk,Nc−1(n)] ,and by a pseudonoise (PN) sequence of length Nc called the spe-    cific signature code. T denotes the symbol period and Tc de- r(n) = r(nT), r nT − Tc , =       notes the chip period, where Nc T/Tc is an integer. User k’s T (5) (n) r nT − 2Tc , ..., r nT − Nc − 1 Tc , nth transmitted symbol is bk . The base transceiver station (BTS) received signal in train where dim(w k) = dim(r ) = Nc × 1, (•) defines the real baseband can then be written as follows: part of complex value, (•)H defines the Hermitian operation − ∗ Nb 1 K Lk   and the conjugate.  = (n) (n) (n) − − The following notations are used: x is the estimated value r(t) Akbk hk,l sk t nT τk,l + η(t), n=0 k=1 l=1 of x; yk(n) is the adaptation output of user k; w k(n) is the ffi train (1) vector of filter coe cients of user k; bk (n) is the synthetic transmitted training data sequence; rtrain(n) is the synthetic where denotes the time; is the number of propagation train t Lk received training data vector generated from the bk (n) (n) paths; hk,l and τk,l are, respectively, the complex gain and the transmitted through estimated channel parameters; αk(n)is propagation delay of the path l for user k; Nb represents the the adaptation error of the signature; and μ is the adaptation number of the transmitted symbols, Ak is the transmitted step of adaptive filters in the signature block. (n) The detection block aims to suppress the residual MAI amplitude of user k; sk is the specific signature of user k; and η(t) is the additive white Gaussian noise (AWGN) with and ISI based on the data of all users estimated using the out- variance σ2. put signal of the signature block. From all users, we formed a η  Toincrease the performance and capacity of communica- vector yT(n) at the output of the signature block as follows: tion systems, the ISI and MAI must be minimized. It is there-      yT(n) = y1 n − 1 , ..., yK n − 1 , fore essential to design MUD processing able to cancel these  T (6) interferences. The following gives a brief description of the y1(n), ..., yK (n), y1(n +1),..., yK (n +1) . CF-MUD [13]. In the training phase, we used the following set of equa- = 2.2. Cascade filter multiuser detector tions for user k (for k 1, 2, ..., K): o (n) = v (n)H y (n), v (0) = 0, The block diagram of the multiuser detector CF-MUD to k Tk T Tk = train −  be implemented on an FPGA is shown in Figure 1 [13].We βk(n) bk (n) ok(n), (7) ∗ can distinguish two blocks: signature and detection. Each vTk(n +1)= vTk(n)+νyT(n)βk(n) , block acts as an adaptive filter for canceling the ISI and MAI. T The proposed linear adaptive MUD is based on the least- where vTk(n)=[v1,k(n), v2,k(n), ..., v3K,k(n)] , dim(vTk(n)) = mean-square (LMS) adaptation method. This filter, however, dim(yT(n)) = 3K × 1, ok(n) is the adaptation output of user needs data training sequences to adapt the filter coefficients. k corresponding to the output of the respective adaptive fil- Compared to time-division multiple-access (TDMA) used in ter, vTk(n) is the filter coefficient vector of user k, βk(n)is Global Systems for Mobile communications (GSM) systems, the adaptation error of detection, and ν is adaptation step of UMTS systems do not give access to preknown data with the adaptive filters in the detection block. Quoc-Thai Ho et al. 3

˜rtrain(n) btrain(n) Channel baseband model

train train b1 (n) b1 (n) + +  o (n) Signature − y1(n) Detection − 1   w1 v1 . . train train . bk (n) bk (n)

Concatenation  + + ok(n) Signature − yk(n) yT(n) Detection −   ˜r(n) wk vk . . train train . bK (n) bK (n)

+ + oK (n) Signature − yK (n) Detection − w K vK

(a) (b) Figure 2: Principle of (a) signature block and (b) detection block for the kth user.

In the detection phase, the transmitted data of mobile us- 100 ers are estimated by the signature block from following equa- tion:   y (n) = w (n)Hr(n) ,fork = 1, 2, ..., K. (8) k k 10−1 Regarding the detection block, the transmitted data of

users are estimated by the following equation: BLER

H − ok(n) = vTk(n) yT(n), for k = 1, 2, ..., K. (9) 10 2

 Finally, the estimated bits bk(n) are found by simply tak- ing the sign function of ok(n),   −3  10 bk(n) = sign ok(n) . (10) 024681012 SNR When the adaptation process was completed, we applied  Rake (8), (9), and (10) to propagate the signal r(n) through the Soft-MPIC CF-MUD. CF-MUD

2.3. Performance evaluation of CF-MUD Figure 3: A performance evaluation of MUD methods in the WCDMA conditions with vehicular A channel at mobile speed Figure 3 depicts algorithmic performance in terms of the 3 km/h, data rate 64 kbps (OVSF = 16), and 15 users in terms of block error rate (BLER) of CF-MUD algorithm compared BLER. with the RAKE receiver and soft multistage parallel interfer- ence canceler (Soft-MPIC) in a WCDMA platform [3]. Sim- ulation results were done for one antenna, in perfect chan- 3. VLSI ARCHITECTURE TARGETED ON FPGA nel estimation, Vehicular A channel defined by International Telecommunication Union (ITU) [16] 3 km/h mobile speed, The developed architecture should be reconfigurable to sev- 64 kbps data rate, and 15 users. We observed a gain of 1.9dB eral baseband processing UMTS systems characterized by the to target a BLER of 10% for CF-MUD compared with Soft- number of users K and different communication scenarios MPIC and the RAKE receiver cannot reach the BLER of 10%. in different mobile speeds. Thus, it can be reconfigured by No decision feedback has been considered for CF-MUD and respecting WCDMA, hardware, and algorithmic constraints. Soft-MPIC. Although MUD with decision feedback is con- The main WCDMA constraints [2] are data rates, that is, sidered superior than without the decision feedback creates orthogonal variable spreading factor (OVSF) of 64, 16, 8, a serious data dependency to parallelize the implementation or 4 corresponding, respectively, to 12.2kbps (voice rate), on many devices. 64 kbps, 144 kbps, and 384 kbps data rates; a time frame of Based on CF-MUD equations (2)–(10), the proposed 38400 chips in 10 milliseconds; and a mobile speed of 3 km/h FPGA-targeted architecture can be described as in Section 3. to 100 km/h. 4 EURASIP Journal on Applied Signal Processing

CF-MUD mapping Figure 5

Address generator er ff er er ff ff OutputBu InterBu Array of PE Array of PE InputBu (detection stage) (signature stage) Parallel2Serial FIFO Serial2Parallel FIFO External memory (SDRAM) External memory (SDRAM)

Global control

External memory control

Figure 4: Simplified HW architecture of CF-MUD.

The main algorithmic constraints, with respect to MUD allow independent access to input/output, and thus to maxi- performance, consist of the number of adaptation iterations mize the multiple path access to external input/output. These in the signature filter and detection filter, adaptation steps μ memory buffers are implemented by the LUT (lookup-table) and ν, quantification scales to respect the arithmetic preci- -based distributed memory of FPGAs. The memory buffers sion in fixed point. InputBuffer and OutputBuffer are multiport. The buffer In- The main hardware constraints take into account the lim- ternalBuffer is used to memorize intermediate results from itations of targeted FPGAs in term of number of dedicated the signature filter and input to the detection filter. It is im- multipliers, number of block RAMs (BRAMs), and memory plemented by LUT-based distributed memories. The first- size of each BRAM [17]. in first-out (FIFO) buffers Serial2Parallel and Parallel2Serial These constraints were also used in our method of re- are used to minimize the utilization of input-output (IO) source estimation before synthesis. The architecture must be pins of FPGA and also to minimize the number of exter- able to respect real-time constraints bounded by time frame nal memories. These buffers are implemented by LUT-based to detect all data frames, and by adaptation time to adapt all distributed memory of FPGAs as well. The PE of the ar- coefficients (w and v) depending on the mobile speed. chitecture uses the semiglobal internal BRAM-based mem- The block diagram of the pipelined architecture is based ories, that is, a certain number of PEs have access to the on two stages of the modular array structure of processing same memory. This number is defined by the possible time elements (PEs) shown in Figure 4. Figure 5 illustrates the multiplexing determined from the architectural specification mapping of CF-MUD algorithm on array of PEs and internal step. memories (inside the FPGA). These PEs consist of optimized We used an advanced scheduling based on time multi- cores performing adaptive filtering defined by (2)–(4)which plexing by modifying the conventional methods, that is, As we called PELMS including straightforward filtering defined Soon As Possible (ASAP) and As Late As Possible (ALAP). by (2)whichwecalledPEFIR. The regularity of the CF-MUD This advanced scheme relies on the fact that ASAP gives low makes it possible to time multiplex a number of users, that latency while ALAP gives high latency but uses less hardware is, we used only one PE to process a number of users by time resources [18]. Modifying jointly these two methods permits multiplexing selection. The time multiplexing, that is, num- to balance the latency while exploiting the particular fea- ber of users per PE, in the signature and detection blocks is tures of targeted FPGAs. The constraints of this scheduling defined by TMUX 1 and TMUX 2, respectively. Thus, the num- involve using only two real dedicated multipliers and min- ber of PELMS and PEFIR inside each block is the same, and imum number of multiplexers and other arithmetic opera- is represented by NMUX 1 and NMUX 2 for the signature and tors (adders). This method exploits the symmetric structure detection blocks, respectively. All PEs consider normalized- of these FPGA components, especially the shared connec- fixed complex-value signals and use the same time multi- tion between BRAMs and the dedicated multipliers. Using plexing. two real multipliers to implement complex multiplication in- The data and address paths are independent to permit cluding four real multiplications permits to use this shared maximum simultaneous direct access to data and address. connection between dedicated multiplier and BRAM. Min- Two different external memories SDRAM and two different imizing the number of multiplexers leads to a reduction in memory buffers (InputBuffer and OutputBuffer)areusedto the critical path of circuit. Quoc-Thai Ho et al. 5

InputBuffer rtrain

PELMS PELMS PELMS PELMS #1 ···#TMUX 1 ···#(N1 − TMUX 1 +1) ··· #NMUX 1 signature stage signature stage signature stage signature stage Local control Local control Local control Local control Local control

Semiglobal memory (w) ··· Semiglobal memory (w) #1 #NMUX 1

OutputBuffer

PEFIR PEFIR PEFIR PEFIR ··· ··· ··· #1 #TMUX 2 #(N2 − TMUX 2 +1) #NMUX 2 detection stage detection stage detection stage detection stage Local control Local control Local control Local control Local control

InterBuffer

InterBuffer

PELMS PELMS PELMS PELMS ··· ··· ··· #1 #TMUX 2 #(N2 − TMUX 2 +1) #NMUX 2 detection satge detection stage detection stage detection stage Local control Local control Local control Local control Local control

Semiglobal memory (v) ··· Semiglobal memory (v) #1 #NMEMV

PE PEFIR PEFIR PEFIR FIR ··· ··· ··· #1 #TMUX 1 #(N1 − TMUX 1 +1) #NMUX 1 signature stage signature stage signature stage signature stage Local control Local control Local control Local control Local control

InterBuffer ˜r

Figure 5: Mapping the CF-MUD on processing elements and internal memories.

The fine-grain pipeline of PEs, shown in Figure 6(a), uses And to update the coefficients of (3)in(4), dedicated 2-level pipelined multipliers available on the sili-         train w k,i(n +1) = w k,i(n) + μ b (n) − Rre rk,i(n) , con die of Xilinx FPGA devices. To understand the PE func-      k,i    tionality, consider the complex-number multiplication de-   =  train −   wk,i(n +1) wk,i(n) + μ bk,i (n) Rim rk,i(n) , scribed by (2) as follows. The summation is up to NT ,which (12) is NC for signature filters and 3K for detection filters: where (x)and(x) define the real and imaginary parts of − x,andR and R represent the accumulation registers for NT 1          re im = train  − train   real and imaginary parts. Rre rk,i wk,i rk,i wk,i , i=0 Figure 6(b) illustrates the scheduling and register-trans- − (11) NT 1          fer logic (RTL) mapping of PELMS, including PEFIR,toim- = train    train  Rim rk,i wk,i + rk,i wk,i . plement the complex-number filter using two real-number i=0 multipliers, where Ax and Mx (x = 1, 2, 3) are, respectively, 6 EURASIP Journal on Applied Signal Processing

train b rr wr wi ri

M1 M2

Tclk R0 R1 M1 M2

A1 A2 R0 R1 R  0 R1 A1 A2 PEFIR

Rre μe Rim  A3 e

M1 M2

R0 R1

A1 A2

  R0 R1

Register (a)

train b Reg. Mem rr Mem wr Mem wi ri Mem

10 10

0 1

M1 M2

R0 R1 10 01

10 01 10 01

A1 A2   R0 R0  o oR1 R Rre im

A3 Reg. Reg. Reg. Reg. yr Wr yi Wi 

R2

(b)

Figure 6: Detailed description of a PE: (a) scheduling and (b) mapping of 2-level pipelined complex taps adaptive FIR-LMS filters. Quoc-Thai Ho et al. 7 the adder and the multiplier units. Unit A1 is an adder-sub- to estimate the necessary HW resources and timing perfor- tracter that is used for addition or subtraction in the real part mance. of (2). Unit A3 is subtracter operation that is used to calcu- For the specific developed architecture of the CF-MUD late the error adaptation in (3). Saturation is used at the out- algorithm targeted on these FPGA devices (Virtex-II Pro and put of these operational units to maintain the length of the Virtex-II), the objective functions are to maximize the num- data bus. In this figure, the subscripts “r”and“i”represent ber of users K MAX described by the nonlinear inequalities as the real and the imaginary parts of the variables, respectively. follows:   Registers Rre (Rim)andR0 (R1 ) correspond to (wk,i(n))   ≤ ((wk,i(n))) and (wk,i(n +1))((wk,i(n + 1))), respectively. K f t, NMEM, TMUX 1, TMUX 2,OVSF,Nchip, Nm, NA2, Ncycle . Registers R0 (R1) are used as pipelined registers allowing for (13) two concurrent additions in multiplier-accumulator (MAC) and complex multiplications in (2), (4). Two registers are Respecting the following constraints, added before inputs of adders Ax to pipeline without haz-   ≤ ard. The IO of PE can be registered or not. The fact that IO TMUX 1 g t, NMEM,OVSF,Nchip, NA1, Ncycle (14) can be registered or not helps the processor to interface with other components of the system. The shift-to-right opera- and TMUX 2 is an integer satisfying the pipeline strategy of the tion is represented by . This shift operation allows to im- HW architecture. plement the hardware-free multiplication by adaptation step Where NMEM is the number of data by BRAM, Nchip is μ and ν whose value are of 2−n. the number of chip, Nm is the maximum number of dedi- cated multipliers available on silicon die of these FGPA com- The execution time of an adder is one clock cycle (Tclk) and that of a multiplier is 2 cycles. Regarding N complex taps ponents [17], Ncycle is the number of cycle (throughput) to filters, the throughput in terms of clock cycles of adaptation solve the CF-MUD on FPGA (Section 3), and NA1 and NA2 process is (2N +5) and of detection process is (2N +4).Thus, are the number of adaptation iterations in the signature and detection block, respectively. We consider that the variables the throughput for the PELMS (including adaptation process , ,OVSF,and are constraints. These above inequal- and detection process) and PEFIR (including detection pro- NA1 NA2 t • • cess only) of are, respectively, (3N +9)and(2N +5).Asare- ities defined by straightforward functions f ( )andg( ), sult, the throughputs of signature block and detection block from (13)and(14), are built by taking constraints stated on Section 3 and the dedicated FPGA architecture. are, respectively, (3NC + 9), (2NC +5)and(9K + 9), (6K +5). The coarse-grain pipeline data-flow strategy in the sys- Since verification is critical in the design flow, dynamic tem level of the architecture is detailed in Figures 7 and 8 verification by simulations is used throughout. The results for the adaptation and detection processes, respectively. The of fixed-point simulations high-level language (Matlab) pro- strategy depends on the processing time between signature vide a static functional reference for the HW verification of block, detection block, and the adaptation and detection pro- the architecture. The synthesized data are used for the verifi- cesses. cation in Matlab as well as in FPGA devices implementation.

5. RESULTS 4. IMPLEMENTATION METHODOLOGY HW architecture is targeted on the Virtex-II and Virtex-II This paper focuses on the hardware (HW) design flow of the Pro components of Xilinx to satisfy different algorithmic and MUD based on a library of the hard optimized IP cores; for WCDMA specifications in real time. example, complex-taps FIR filters used as PE for the adap- Tables 1 and 2 summarize the maximum number of si- tive MUD. It is necessary to estimate the timing performance multaneous users (K MAX) that can be processed in monorate and HW resources required by architectures from the archi- on different devices of the Virtex-II and Virtex-II Pro families tectural specifications satisfying these constraints. To reach in different data based on the UMTS 3G standard. The data the maximum number of users (K) for two family devices of throughputs are fixed by the OVSF parameter such as 64, 16, Xilinx, a program based on nonlinear integer-programming 8, and 4 corresponding, respectively, to 12.2kbps(voicerate), model was developed. This nonlinear integer-programming 64 kbps, 144 kbps, and 384 kbps (the last three throughputs is resolved by the branch-and-bound method [19]. The non- are for data) [2]. We assumed three mobile speeds: slow fad- linear integer-programming model makes it possible to es- ing (TA = 40 milliseconds), medium fading (TA = 10 mil- timate the performance requirements and the limitations liseconds), and fast fading (TA = 2 milliseconds), where TA of FPGA HW resources. This tool is used to maximize the represents the allowed adaptation time of CF-MUD coeffi- time multiplexing (number of users in one PE) and timing cients (w and v)[20]. Considering the short code of 256 performance (number of clock cycles) of the system, while chips, the number of adaptation iterations is 100(256/OVSF) respecting algorithmic constraints and HW resource limita- for each user k of the signature and detection block. We used tions (number of multipliers and RAM block). It is also nec- the same number of adaptation iterations for hardware esti- essary to minimize the clock rate for power consumption. mation. The program is helpful for choosing a type of suitable ar- While the allowed adaptation time constraint varies with chitecture in terms of pipeline strategy for the algorithmic the mobile speed, the allowed detection time is always lim- specification of MUD. This tool can also be conversely used ited by 10 milliseconds, which is the timing length of a frame 8 EURASIP Journal on Applied Signal Processing

n − 1 n n +1 PELMS PELMS PELMS Filtering Adaptation Filtering Adaptation Filtering Adaptation Block ··· signature

Block ··· detection Idle Filtering Adapt. Idle Filtering Adapt. Idle Filtering Adapt. Idle

tA (a)

n − 1 n n +1 PELMS PELMS PELMS Filtering Adapt. Idle Filtering Adapt. Idle Filtering Adapt. Idle Block ··· signature

Block ··· detection Idle Filtering Adaptation Filtering Adaptation Filtering Adaptation

tA t (b)

Figure 7: Pipeline strategy of adaptation process in case that the processing time of signature block is (a) superior and (b) inferior to the processing time of detection block.

n − 1 n n +1

PEFIR PEFIR PEFIR Block ··· signature

Block ··· detection Idle PEFIR Idle PEFIR Idle PEFIR Idle

tD (a)

n − 1 n n +1

Block PEFIR Idle PEFIR Idle PEFIR Idle ··· signature

Block ··· detection Idle PEFIR PEFIR PEFIR

tD t (b)

Figure 8: Pipeline strategy of detection process in case that the processing time of signature block is (a) superior and (b) inferior to the processing time of detection block. Quoc-Thai Ho et al. 9

Table 1: Maximum number of simultaneous users (KMAX) detected and which can be integrated on different devices of Virtex-II Pro family. OVSF Slow fading Medium fading Fast fading Device 64 16 8 4 64 16 8 4 64 16 8 4 XC2VP2 10 10 8 6 10 6 6 4 4 2 2 1 XC2VP4 22 20 16 14 20 14 10 6 10 4 2 2 XC2VP7 30 28 24 18 28 18 14 8 12 6 4 2 XC2VP20 52 48 36 28 48 28 16 16 22 12 4 2 XC2VP30 68 68 44 32 68 32 26 16 26 12 4 2 XC2VP40 84 82 64 38 82 38 32 16 32 12 4 2 XC2VP50 98 90 68 46 90 46 32 16 38 12 4 2 XC2VP70 112 108 68 64 108 64 32 16 54 12 4 2 XC2VP100 148 136 88 68 136 68 32 16 54 12 4 2 XC2VP125 170 136 110 68 136 68 32 16 54 12 4 2

Table 2: Maximum number of simultaneous users (KMAX) detected and which can be integrated on different devices of Virtex-II family. OVSF Slow fading Medium fading Fast fading Device 64 16 8 4 64 16 8 4 64 16 8 4 XCV40 2 2 2 2 2 2 2 2 1 1 0 0 XCV80 6 6 6 4 6 4 4 2 2 2 1 1 XCV250 18 18 16 12 18 12 10 6 8 4 2 2 XCV500 24 22 18 16 23 16 10 6 10 4 4 2 XCV1000 28 26 22 16 25 16 12 8 12 6 4 2 XCV1500 34 32 26 20 32 20 16 8 14 8 4 2 XCV2000 36 34 28 22 34 22 16 10 16 9 4 2 XCV3000 56 52 40 32 52 32 19 16 24 12 4 2 XCV4000 66 60 44 32 60 32 24 16 26 12 4 2 XCV6000 72 68 48 32 68 32 28 16 26 12 4 2 XCV8000 84 72 56 32 72 32 32 16 28 12 4 2 of 38400 chips in UTMS systems. To estimate the maximum detection blocks. In the signature block, the performance in MAX number of users K , we assumed a 100 MHz clock fre- terms of adaptation time (tA1) and detection time (tD1)is, quency for all devices. respectively, defined by

Tables 3 and 4 summarize the utilization ratio of resour-   256 t = 3N +9 N T T , ces on targeted devices corresponding to the estimated maxi- A1 C A1 OVSF MUX 1 clk mum number of users given in Tables 1 and 2,respectively. (15)   38400 We observed that the utilization ratio of resources in case tD1 = 2NC +5 TMUX 1Tclk. of fast-fading scenario is low (indicated in gray zones). This OVSF is because the adaptation time decreases an impose to fix In the detection block, we have TMUX 1 and TMUX 2 to equal 1. Thus, we are limited by few tA2 = (9K +9)NA2TMUX 2Tclk, resources. But we can easily increase the number of users by 38400 (16) only duplicating the same architecture on the device. Hence, tD2 = (6K +5) TMUX 2Tclk. we can easily increase K MAX in fast-moving conditions. OVSF Note that in these results, the users transmit simultane- With the pipeline strategy of architecture, the time process- ously in the same sector. Normally, we should consider the ing in each cascade filter is, respectively, max(tA1, tD1)and number of user lower than the value of the OVSF. Thus, the max(tA2, tD2), and it needs to be inferior to TA for adaptation number of user higher than the value of the OVSF should depending on slow-, medium-, and fast-fading communica- be distributed on the other sectors of the BTS. Under these tion situations. conditions, the number of users by BTS (3 sectors) should be Table 5 summarizes the results of an experiment system higher than the data indicated in Tables 1 and 2. for 16 users after routing and placing by the Xilinx physical According to the pipeline strategy of developed architec- tool (the ISE foundation) on the Virtex-II Pro component tures, the total time needed to process a data frame is re- XC2VP30. The results for the data rate in fast-fading condi- stricted by the maximum execution time in the signature and tions are excluded for the system of 16 users because of the 10 EURASIP Journal on Applied Signal Processing

Table 3: Utilization ratio of hardware (%) for KMAX of Table 1 on different devices of Virtex-II Pro family. OVSF Slow fading Medium fading Fast fading Device 64 16 8 4 64 16 8 4 64 16 8 4 XC2VP2 93 97 98 88 97 88 100 89 79 83 83 39 XC2VP4 100 100 95 100 100 100 95 100 100 71 57 36 XC2VP7 96 95 98 95 95 95 95 97 98 95 68 23 XC2VP20 98 98 98 99 98 99 97 97 100 82 34 11 XC2VP30 90 100 88 97 100 97 99 94 76 70 22 7 XC2VP40 89 100 100 99 100 99 92 67 100 50 16 5.2 XC2VP50 100 92 98 99 92 99 85 55 98 41 13 4.3 XC2VP70 92 100 83 99 100 99 80 39 99 29 9.1 3.0 XC2VP100 100 92 99 92 92 92 59 29 97 22 6.7 2.2 XC2VP125 92 98 100 98 98 98 47 23 78 17 5.3 1.7

Table 4: Utilization ratio of hardware (%) for KMAX of Table 2 on different devices of Virtex-II family. OVSF Slow fading Medium fading Fast fading Device 64 16 8 4 64 16 8 4 64 16 8 4 XCV40 78 80 84 93 79 93 85 90 54 67 0 0 XCV80 95 98 91 85 98 85 100 88 86 75 58 58 XCV250 98 96 100 90 96 90 97 97 89 83 67 42 XCV500 96 98 99 100 98 100 83 88 87 94 100 31 XCV1000 99 98 92 99 98 99 98 100 90 90 75 25 XCV1500 99 100 98 97 100 97 100 100 97 100 62 21 XCV2000 95 98 100 92 98 92 95 98 95 96 54 18 XCV3000 97 99 100 100 99 100 99 100 100 100 31 10 XCV4000 99 100 100 92 100 92 100 80 87 80 25 8.3 XCV6000 100 94 100 92 94 92 97 89 72 7 21 6.9 XCV8000 100 100 100 79 100 78 98 76 100 57 18 6.0

Table 5: Postplacing and routing results using Xilinx physical tools (ISE Foundation) targed on Xilinx Virtex-II Pro XC2VP30 device for a system of K = 16 users for slow- and medium-fading conditions.

T Clock rate Clock skew t t OVSF MUX Slices BRAM Multipliers A D (MHz) (ns) (ms) (ms) TMUX 1 TMUX 2

64 4 4 6149/13696 36/136 32/136 (23%) 71 0.273 4.53 4.50 (44%) (32%)

16 4 4 4508/13696 36/136 32/136 (23%) 72 0.271 8.49 13.45 (32%) (32%)

8 3 2 6168/13696 56/136 52/136 (38%) 74 0.28 4.28 13.10 Slow fading (45%) (41%)

4 2 2 7474/13696 68/136 64/136 (47%) 73 0.281 4.192 26.56 (54%) (50%)

64 4 4 6155/13696 36/136 32/136 (23%) 75 0.279 4.34 4.31 (44%) (32%)

16 2 2 8466/13696 68/136 64/136 (47%) 83 0.281 3.68 5.82 (61%) (30%)

8 4 1 8493/13696 84 80 (58%) 49 0.708 8.62 9.89 (61%) (61%) Medium fading 4 1 1 11940/13696 132/136 128 (94%) 46 1.181 3.33 20.00 (87%) (97%) Quoc-Thai Ho et al. 11 limitation of the present architecture in terms of maximum Conference on Wireless and Optical Communications (WOC numbers. Again, we can find a slight difference in terms of ’02),Banff, Alberta, Canada, July 2002. hardware resources (number of slices) between the results [6] S. Moshavi, “Multi-user detection for DS-CDMA communi- after synthesis in Table 5 and the results before synthesis by cations,” IEEE Communications Magazine, vol. 34, no. 10, pp. our resource-estimator tool in Table 1. This was explained in 124–136, 1996. Section 4 by the absence of database for FPGA components. [7] S. Rajagopal, S. Bhashyam, J. R. Cavallaro, and B. Aazhang, “Real-time algorithms and architectures for multiuser channel We consider only the number of multipliers and BRAMs in estimation and detection in wireless base-station receivers,” our integer nonlinear programming model. Moreover, even IEEE Transaction on Wireless Communications,vol.1,no.3, with knowledge of the database, the resource estimation be- pp. 468–479, 2002. fore synthesis is still difficult [21]. Nevertheless, for the main [8] O. Leung, C.-Y. Tsui, and R. S. Cheng, “VLSI implementation resources, the number of multipliers and BRAMs are exactly of rake receiver for IS-95 CDMA testbed using FPGA,” in Pro- the same as in Table 1. ceedings of IEEE Asia and South Pacific on Design Automation Conference (ASP-DAC ’00), pp. 3–4, Yokohama, Japan, January 6. CONCLUSIONS 2000. [9] G. Xu, S. Rajagopal, J. R. Cavallaro, and B. Aazhang, “VLSI The HW architectures of a multiuser detector based on a implementation of the multistage detector for next generation wideband CDMA receivers,” The Journal of VLSI Signal Pro- cascade of adaptive filters (CF-MUD) for WCDMA systems cessing, vol. 30, no. 1-3, pp. 21–33, 2002. were developed. The CF-MUD based on FIR using an LMS [10] Y. Guo, G. Xu, D. McCain, and J. R. Cavallaro, “Rapid schedul- adaptation process presented a good choice for targeting ing of efficient VLSI architectures for next-generation HSDPA FPGA devices. We have exploited the implementation advan- wireless system using Precision C synthesizer,” in Proceedings tages of the algorithm and the particular features of Xilinx of 14th IEEE International Workshop on Rapid Systems Proto- devices. The regularity and recursiveness of the CF-MUD al- typing (RSP ’03), pp. 179–185, San Diego, Calif, USA, June gorithm offer the opportunity to maximize the utilization ra- 2003. tio in the resource of the FPGA device. Using real-time im- [11] W. Schlecker, A. Engelhart, W. G. Teich, and H.-J. Pfleiderer, plementation and taking into account all UMTS constraints, “FPGA hardware implementation of an iterative multiuser we demonstrated a utilization ratio in the resource near to detection scheme,” in Proceedings of 10th Aachen Symposium 100% to maximize the parallelism of the CF-MUD algo- on Signal Theory (ASST ’01), pp. 293–298, Aachen,Germany, rithm. These dedicated architectures can be used later as op- September 2001. timized IP cores performing MUD functions. The current [12] B. A. Jones and J. R. Cavallaro, “A rapid prototyping en- vironment for wireless communication embedded systems,” HW architectures are purely glue logic. Future work will con- EURASIP Journal on Applied Signal Processing, vol. 2003, no. 6, sist of exploiting software processing in the multirate CF- pp. 603–614, 2003, Special issue on rapid prototyping of DSP MUD as a whole respecting the constraint specifications of systems. the 3G wireless communications. [13] D. Massicotte and A. O. Dahmane, “Cascade filter receiver for DS-CDMA communication systems,” International Applica- ACKNOWLEDGMENTS tion Published Under the Patent Cooperation Treaty (PCT), May 2004, WO2004/040789. The authors are grateful for the financial support of the Nat- [14] Q.-T. Ho and D. Massicotte, “FPGA implementation of adap- ural Sciences and Engineering Research Council of Canada tivemultiuserdetectorforDS-CDMAsystems,”inProceed- (NSERC). We also wish to thank Axiocom Inc. for its techni- ings of 14th International Conference on Field Programmable cal and financial assistance. Logic and Application (FPL ’04), pp. 959–964, Leuven, Bel- gium, August–September 2004. [15] Q.-T. Ho and D. Massicotte, “A low complexity adaptive mul- REFERENCES tiuser detector and FPGA implementation for wireless DS- WCDMA communication systems,” in Proceedings of Global [1] P. Chaudhury, W. Mohr, and S. Onoe, “The 3GPP proposal for Signal Processing Expo and Conference (GSPx ’04), Santa Clara, IMT-2000,” IEEE Communications Magazine, vol. 37, no. 12, Calif, USA, September 2004. pp. 72–81, 1999. [2] 3rd Generation Partnership Project (3GPP), “Spreading and [16] The International Telecommunication Union (ITU), Geneva, modulation (FDD),” Tech. Rep. TS 25.213 v4.1.0 (2001-06), Switzerland, available at: http://www.itu.org. 3GPP, Valbonne, France, 2001. [17] Xilinx, San Jose, Calif, USA, available at: http://www.xilinx. [3] S. Verdu,´ Multiuser Detection, Cambridge University Press, com. New York, NY, USA, 1998. [18] G. De Micheli, Synthesis and Optimization of Digital Circuits, [4] A. O. Dahmane and D. Massicotte, “DS-CDMA receivers in McGraw-Hill, New York, NY, USA, 1994. Rayleigh fading multipath channels: direct vs. indirect meth- [19] S. G. Nash and A. Sofer, Linear and Nonlinear Programming, ods,” in Proceedings of IASTED International Conference on McGraw-Hill, New York, NY, USA, 1996. Communications, Internet and Information Technology (CIIT [20] S. Rajagopal, S. Rixner, and J. R. Cavallaro, “A programmable ’02), St. Thomas, Virgin Islands, USA, November 2002. baseband processor design for software defined radios,” in Pro- [5] A. O. Dahmane and D. Massicotte, “Wideband CDMA receiv- ceedings of 45th IEEE Midwest Symposium on Circuits and Sys- ers for 3G wireless communications: algorithm and imple- tems (MWSCAS ’02), vol. 3, pp. 413–416, Tulsa, Okla, USA, mentation study,” in Proceedings of IASTED International August 2002. 12 EURASIP Journal on Applied Signal Processing

[21] C. Shi, J. Hwang, S. McMillan, A. Root, and V.Singh, “A system Canada (NSERC). From 2002 to 2004, he worked for Axiocom Inc. level resource estimation tool for FPGAs,” in Proceedings of as a Director of research and development. In 2004, he joined the 14th International Conference on Field Programmable Logic and UniversiteduQu´ ebec´ a` Trois-Rivieres` as Professor in electrical and Application (FPL ’04), pp. 424–433, Leuven, Belgium, August– computer engineering. His current research interests include wire- September 2004. less communications, spread-spectrum systems, multiuser detec- tion, MIMO, and VLSI implementation issues. He is a Member of the Research Group in Industrial Electronics at the UQTR. Quoc-Thai Ho received a B.S. degree in electrical and electronics engineering from the Ho Chi Minh City University of Tech- nology, an M.S. degree in design of digi- tal and analog integrated systems from the Institut National Polytechnique de Greno- ble, and an M.S. degree in microelectron- ics from the EcoleDoctoraledeGrenoble´ in September 2000, October 2001, and June 2002, respectively. He is currently pursuing his Ph.D. in electrical engineering at the UniversiteduQu´ ebec´ a` Trois-Rivieres` where he joined the Laboratory of Signal and Sys- tem Integration. His Ph.D. work consists of VLSI architectures of multiuser detectors for DS-WCDMA wireless communication sys- tems of third generation. His actual research interests include VLSI implementation, design methodologies, FPGA-based rapid proto- typing with applications to CDMA communication systems.

Daniel Massicotte received the B.S.A. and M.S.A. degrees in electrical engineering and industrial electronics in 1987 and 1990, re- spectively, from the UniversiteduQu´ ebec´ a` Trois-Rivieres` (UQTR), QC, Canada. He obtained the Ph.D. degree in electrical en- gineering in 1995 at the Ecole´ Polytech- nique de Montreal,´ QC, Canada. In 1994, he joined the Department of Electrical and Computer Engineering, Universitedu´ Quebec´ a` Trois-Rivieres,` where he is currently a Professor. He is currently the Head of the Laboratory of Signal and Systems Inte- gration and Chief Technology Officer of Axiocom Inc. He received the Douglas R. Colton Medal for Research Excellence awarded by the Canadian Microelectronics Corporation, the PMC-Sierra High Speed Networking and Communication Award, and the Second place at the Year 2000 Complex Multimedia/Telecom IP Design Contest from Europractice in 1997, 1999, and 2000, respectively. His research interests include VLSI implementation and digital sig- nal processing for the communications and measurement prob- lems such as nonlinear equalization, multiuser detection, channel estimation, and signal reconstruction. He is the author and the coauthor of more than 60 technical papers. He is also Member of the Ordre des Ingenieurs´ du Quebec,´ Groupe de Recherche en Electronique´ Industrielle (GREI), and Microsystems Strategic Al- liance of Quebec´ (ReSMiQ).

Adel-Omar Dahmane received the B.S. de- gree in electrical engineering from the Uni- versite´ des Sciences et de la Technologie Houary Boumedienne´ (USTHB), Algiers, Algeria, in 1997, the M.S. and Ph.D. de- grees with honours in electrical engineering from UniversiteduQu´ ebec´ a` Trois-Rivieres,` Trois-Rivieres` (UQTR), QC, Canada, in 2000 and 2004, respectively. He was two times the Laureate of the Governor General of Canada’s Academic Medal (gold medal—graduate level) and a Fellow of the Natural Sciences and Engineering Research Council of Hindawi Publishing Corporation EURASIP Journal on Applied Signal Processing Volume 2006, Article ID 89186, Pages 1–12 DOI 10.1155/ASP/2006/89186

A New Pipelined Systolic Array-Based Architecture for Matrix Inversion in FPGAs with Kalman Filter Case Study

Abbas Bigdeli, Morteza Biglari-Abhari, Zoran Salcic, and Yat Tin Lai

Department of Electrical and Computer Engineering, the University of Auckland, Private Bag 92019, Auckland, New Zealand Received 11 November 2004; Revised 20 June 2005; Accepted 12 July 2005 A new pipelined systolic array-based (PSA) architecture for matrix inversion is proposed. The pipelined systolic array (PSA) archi- tecture is suitable for FPGA implementations as it efficiently uses available resources of an FPGA. It is scalable for different matrix size and as such allows employing parameterisation that makes it suitable for customisation for application-specific needs. This new architecture has an advantage of O(n) processing element complexity, compared to the O(n2) in other systolic array struc- tures, where the size of the input matrix is given by n × n. The use of the PSA architecture for Kalman filter as an implementation example, which requires different structures for different number of states, is illustrated. The resulting precision error is analysed and shown to be negligible.

Copyright © 2006 Hindawi Publishing Corporation. All rights reserved.

1. INTRODUCTION which is scalable and parameterisable so that it can be easily used for new applications Many DSP algorithms, such as Kalman filter, involve several ffi iterative matrix operations, the most complicated being ma- (2) A new e cient approach for hardware-implemented trix inversion, which requires O(n3) computations (n is the division in FPGA, which is required in matrix inver- matrix size). This becomes the critical bottleneck of the pro- sion. cessing time in such algorithms. (3) A Kalman filter implementation, which demonstrates With the properties of inherent parallelism and pipelin- the advantages of the PSA. ing, systolic arrays have been used for implementation of re- The paper is organised as follows. In Section 2, the Schur current algorithms, such as matrix inversion. The lattice ar- complement for the matrix inversion operation is described rangement of the basic processing unit in the systolic array is and a generic systolic array structure for its implementation suitable for executing regular matrix-type computation. His- is shown. Then a new design of a modified array structure, torically, systolic arrays have been widely used in VLSI im- called PSA, is proposed. In Section 3, the performance of plementations when inherent parallelism exists in the algo- two approaches for scalar division calculation, a direct di- rithm [1]. vision by divider and an approximated division by lookup In recent years, FPGAs have been improved considerably table (LUT) and multiplier, are compared. An efficient LUT- in speed, density, and functionality, which makes them ideal based scheme with minimum round-off error and resource for system-on-a-programmable-chip (SOPC) designs for a consumption is proposed. In Section 4, the PSA implemen- wide range of applications [2]. In this paper we demonstrate tation is described. In Section 5, the system performance and how FPGAs can be used efficiently to implement systolic ar- results verification are presented in detail. Benchmark com- rays, as an underlying architecture for matrix inversion and parison and the design limitations are discussed to show the implementation of Kalman filter. advantages as well as the limitations of the proposed de- The main contributions of this paper are the following. sign. In Section 6, Kalman filter implementation using the (1) A new pipelined systolic array (PSA) architecture suit- proposed PSA structure is presented. Section 7 presents con- able for matrix inversion and FPGA implementation, cluding remarks. 2 EURASIP Journal on Applied Signal Processing

2. MATRIX INVERSION Boundary cell Internal cell Hardware implementation of matrix inversion has been dis- XX cussed in many papers [3]. In this section, a systolic-array- Input S based inversion is introduced to target more efficient imple- P C P mentation in FPGAs.

2.1. Schur complement in the Faddeev algorithm If |X| > |P|: If S = 1:

S For a compound matrix M in the Faddeev algorithm [4], 1 X X −P/X C   AB M = ,(1) P + C∗X −CD Else: Else: where A, B, C,andD arematriceswithsizeof(n×n), (n×l),

m × n m × l Output S ( ), and ( ), respectively, the Schur complement, P 0 P D + CA−1B, can be calculated provided that matrix A is non- −X/P C singular [4]. X + C∗P First, a row operation is performed to multiply the top Always: Always: row by another matrix W and then to add the result to the S bottom row: P C   Mode 2P 0 Mode 1 AB −X/P M = . (2) X + C∗P −C + WA D + WB

When the lower left-hand quadrant of matrix M is nulli- Figure 1: Operations of boundary cell and internal cell. fied, the Schur complement appears in the lower right-hand quadrant. Therefore, W behaves as a decomposition operator and should be equal to an array structure one must take into account the design effi- − W = CA 1 (3) ciency, structure regularity, modularity, and communication topology [9]. such that The array structure presented in [6] is taken as the start- ing point for our approach. It consists of only two types of −1 D + WB = D + CA B. (4) cells, the boundary and internal cells. The structure in [3] needs three types of cells. The cell arrangement in the chosen By properly substituting matrices A, B, C,andD, the matrix structure is two-dimensional while the cells in [7]arecon- operation or a combination of operations can be executed via nected in three-dimensional space with much higher com- the Schur complement, for example, as follows. plexity. (i) Multiply and add: The other consideration when choosing the target struc- ture was the type of operations in the cells. In the preferred D + CA−1B = D + CB (5) structure [6], all the computations executed in cells are lin- ear, while [8]wouldrequireoperationssuchassquareand if A = I; square root calculations. (ii) Matrix inversion: A cell is a basic processing unit that accepts the input data and computes the outputs according to the specified control D + CA−1B = A−1 (6) signal. Both the boundary and internal cells have two differ- ent operating modes that determine the computation algo- if B = C = I and D = 0. rithms employed inside the cells. Mode 1 executes matrix tri- angulation and mode 2 performs annulment. The operating 2.2. Systolic array for Schur complement mode of the cell depends on the comparison result between implementation the input data and the register content in the cell. The cell operations are described in Figure 1. Schur complement is a process of matrix triangulation and To create a systolic array for Schur complement evalua- annulment [5]. Systolic arrays, because of their regular lat- tion, E = D + CA−1B, cells are placed in a pattern of an in- tice structure and the parallelism, are a good platform for the verse trapezium shown in Figure 2. The systolic array size is implementation of the Schur complement. Different systolic controlled by the size of output matrix E,whichisasquare array structures, which compute the Schur complement, are matrix in case of matrix inversion. The number of cells in the presented in the literature [3, 6–8]. However, when choosing top row is twice the size of E and the number of internal cells Abbas Bigdeli et al. 3

2 × 2 3 × 3 4 × 4

5 × 56× 6

Boundary cell Internal cell

Figure 2: Cells layout in systolic array for different output matrix sizes. in the bottom row is the same as the size of E.Thenumberof d boundary cells and layers is equal to the size of matrix E. 22 Inputs are packed in a skewed sequence entering the top d21 d12 of the systolic array. Outputs are produced from the bottom −c d b row. Data and control signals are transferred inside the array 22 11 22 Data in −c −c b b structure from left to right and top to bottom in each layer 21 12 21 12 −c a b through the interconnections. Dataflow is synchronous to a Mode 2 11 22 11 global clock and data can only be transferred to a cell in a . . a a . . × 21 12 . . fixed clock period. For example, to invert a 2 2matrixwith Mode 1 Schur complement, let E be a11 ··· ··· ···

E = D + CA−1B,          e e d d c c a a −1 b b 11 12 = 11 12 + 11 12 11 12 11 12 . e21 e22 d21 d22 c21 c22 a21 a22 b21 b22 (7)

e22

Then the matrix is fed into the systolic array in columns. A e12 e21 . and B require mode 1 cell operation, while C and D are com- e . 11 . Data out puted in mode 2. The result can be obtained from the bottom . . ··· row in skewed form that corresponds to the input sequence. . Figure 3 gives an illustration. ···

2.3. Modifying systolic array structure Figure 3: Dataflow in systolic array of 2 × 2matrixsize. A new systolic array can be constituted from other array structures to achieve certain specifications with the follow- ing four techniques [6]. (i) Off-the-peg maps the algorithm onto an existing sys- tolic array directly. Data is preprocessed but the array design array structures are preserved, with dataflow transferring be- is preserved. However, data may be manipulated to ensure tween arrays. that the algorithm works correctly under array structure. (iv) Layer is similar to the ensemble technique. Several (ii) Cut-to-fit is to customise an existing systolic array to existing systolic arrays are joined to from a new array, which adjust for special data structures or to achieve specific system switches its operation modes depending on the data. Only performance. In this case, data is preserved but array struc- part of the new array will be utilised at one time. ture is modified. In order to overcome the problem of the growth of the (iii) Ensemble merges several existing systolic arrays into basic systolic array presented in Section 2.2 with the size of a new structure to execute one algorithm only. Both data and input matrices, a modified PSA is proposed in this section. 4 EURASIP Journal on Applied Signal Processing

A2n+1 ···Ak B2n+1 ···Bk (2n − 2)2n+1 ···(2n − 2)k

A ···A n B ···B n C ···C n 0 2 0 2 0 2 (2n − 1)0 ···(2n − 1)k

···

Boundary cell Forward path

Internal cell Feedback path Pipleline Data sequence registers

Figure 4: PSA dataflow in 3D visualization form.

1st recursion 2nd recursion

3rd recursion Xin Xin

Xout Xout

Boundary cell Internal cell Register bank

Figure 5: Demonstration of feedback dataflow.

When comparing two consecutive layers in the basic ar- row and the outputs are passed to the registers in the same ray from Figure 2, it can be noted that the cell arrangement is column. These registers, which store the temporary results, identical except the lower layer has one less internal cell than are connected in series and also provide feedback paths. The its immediate upper layer. This leads to the conclusion that end of the register column connects to the input ports of the topmost layer is the only one that has the processing capa- the cell in the adjacent column and the feedback data be- bilities of all other layers and could be reused to do the func- comes the input data of the adjacent cell. The corresponding tion of any other layer given the appropriate input data into dataflow paths in two different array structures are shown each cell. In other words, the topmost layer processing ele- in Figure 5, highlighted in bold arrows. The data originally ments can be reused (shared) to implement functionality of passing through the basic systolic array re-enters the same any layer (logical layer) at different times. Obviously, for this single processing layer four times during three recursions. to be possible, the intermediate results of calculation from In order to implement the PSA structure for an n × n logical layers have to be stored in temporary memories and matrix, the required number of elements is made available for the subsequent calculation. The sharing (i) the number of boundary cells C = 1, of the processing elements of the topmost layer is achieved bc (ii) the number of internal cells C = 2n − 1, by transmitting the output data to the same layer through ic (iii) the number of layers in a column of register bank RL = feedback paths and pipeline registers. The dataflow graph of 2(n − 1), the PSA is shown in Figure 4. (iv) the total number of registers R = 2(n − 1)(2n − 1). In the PSA, the regular lattice structure of basic systolic tot array is simplified to only include the first (topmost/physical) The exact structure of the PSA for the example from Figure 5 layer. Referring to Figure 4, data first enters in the single cell is presented in Figure 6. As can be seen when the input Abbas Bigdeli et al. 5

Data in ×104 4.5 4 3.5 3 Data out 2.5 2 1.5 Data in

Resource (logic element) 1 0.5 0 234567 Size of input matrix (n × n)

Basic PSA

Figure 7: Logic resource usage comparison between the PSA and Data out basic systolic array.

Boundary cell Internal cell Register hyperbolic curve, while b tends to one, and so the value dif- Figure 6: Modifying systolic array of PSA structure. ference between two consecutive numbers of 1/b decreases dramatically. To reduce the size of the LUT, the inverse value curve can be segmented into several sections with different mapping ratios. This can be achieved by storing one inverse matrix size increases, the number of cells required to build value, the median of the group, in the LUT to represent the /b b the PSA increases by O(n), which is much smaller than O(n2) results of 1 for a group of consecutive values of . This pro- as it is the case in other systolic array structures. The price cess is illustrated in Figure 8. The larger the mapping ratio, paid is the number of additional registers used for storage the smaller amount of memory needed for the LUT. Obvi- of intermediate results. However, as the complexity of regis- ously, such segmentation induces precision error. The way to ters is much lower than that of systolic array cells, substan- segment the inverse curve is important because it directly af- tial savings in the implementation of the functionality can fects the result accuracy. Further reduction in the memory be achieved as it is illustrated in Figure 7 for different sizes size is achieved by storing only positive values in the LUT. of matrices. Resource utilisation is expressed in a number of The sign of the division result can be evaluated by an XOR logic elements of an FPGA device used for implementation. gate. On an Altera APEX device, when combining the LUT and multiplier into a single division module, a 16 bit by 26 bit 3. DIVISION IN HARDWARE multiplier consumes 838 logic elements (LEs), operating at 3.1. Division with multiplication 25 MHz clock frequency and total memory consumption of 53 248 memory bits for the specific target FPGA device. The Scalar division represents the most critical arithmetic oper- overall speed improvement achieved through using the DLM ation within a processing element in terms of both resource method is 3.5 times when compared to using a traditional utilisation and propagation delay. This is particularly typical divider. Because of the extra hardware required for efficiently for FPGAs, where a large number of logic elements are typi- addressing the LUT, the improvement in terms of LEs is cally used to implement division. For the efficient implemen- rather modest. The hardware-based divider supplied by Al- tation of division, which still satisfies accuracy requirements, tera, configured as 16 bit by 26 bit, consumes 1 123 LEs when an approach with the use of LUT and an additional multi- it is synthesised for the same APEX device. plier has been proposed and implemented. Noting that numerical result of “a divided by b” is the 3.2. Optimum segmentation scheme same as “a multiplied by 1/b,” the FPGA built-in multiplier can be used to calculate the division if an LUT of all possible Since b is a 16-bit number (used in 1.15 format), there are values of 1/b was available in advance. (215 − 1) = 32 767 different values of 1/b. The performance FPGA devices provide a limited amount of memory, of various linear and nonlinear segmentation approaches are which can be used for LUTs. Due to the fact that 1 and b can evaluated in the priority of precision error and resource con- be considered integers, the value of 1/b falls into a decreasing sumption. 6 EURASIP Journal on Applied Signal Processing

Mapping ratios 4. PIPELINED SYSTOLIC ARRAY IMPLEMENTATION

Small Moderate Large The implementation block diagram of the PSA structure is shown in Figure 9. Datapath Architecture is illustrated in Figure 10. The interfacing of the control unit and the other internal and external cells are shown in Figure 11. 1/b 4.1. Control unit

The control unit is a timing module responsible for gener- ating the control signals at specific time instances. It is syn- chronous to the system clock. Counters are the main com- ponents in the control unit. The I/O data of control unit are listed below.

b Inputs (i) 1-bit system clock: clk for synchronisation and the ba- Segment 1 Segment 2 Segment 3 sic unit in timing circuitry. (ii) 1-bit reset signal: reset to reset the control unit oper- Figure 8: A simple demonstration of segments in different mapping ation. Counters will be reset to the initial values and ratios. restart the counting sequences.

Outputs

Table 1: The optimum segmentation scheme. (i) 1-bit cell operation signal mode to decide the cell op- eration mode: “1” for mode 1 and “0” for mode 2. Segmentation Mapping ratio (ii) 1-bit register clear signal: clear to activate the content- 1–511 1 : 1 clear function in cell internal registers: “1” for enable 512–1 023 1 : 2 and “0” for disable. 1 024–2 047 1 : 4 (iii) 1-bit multiplexer select signal: sel for controlling the input data sources selection in data path multiplexers: 2 048–4 095 1 : 8 “1” for input from matrix and “0” for input from the 4 096–8 191 1 : 16 feedback path. 8 192–16 383 1 : 32 16 384–32 767 1 : 64 Since the modules in the PSA are arranged in systolic structure and connected synchronously, generation of the control signals required to operate these modules should be also in regular timing patterns. Figure 12 demonstrates the required control signals for operating the PSA in different Absolute error is calculated by subtracting the true value sizes. of the inverse 1/b from the LUT output. Average error is the mean of the absolute error among the 32 767 data. Since the 5. DESIGN PERFORMANCE AND RESULTS value of 1/b retrieved from the LUT is later multiplied by a in order to generate the division result, any precision er- 5.1. Resource consumption and timing restrictions ror in LUT will be eventually magnified by the multiplier. Therefore, the worst-case error is more critical than the av- Compared to other systolic arrays in the literature, the small erage precision error. The worst-case error can be calcu- logic resource consumption is the main advantage of the pro- n×n lated as follows: worst-case error of 1/bk = absolute error of posed PSA structure. For example, for inverting an ma- n (1/bk) × bk−1. trix, the PSA requires to instantiate 2 cells while the systolic n2 2n−1 k The error analysis was performed to investigate both the array in Figure 2 requires ( + k=1 ) cells. absolute error in average and the worst-case. As a result of Because of feedback paths in the design and single cell this analysis an optimum segmentation scheme, tabulated in layer structure in the PSA, the number of processing ele- Table 1, was determined. It provides the minimum precision ments required for implementation has been reduced and required of a typical hardware-implemented matrix inver- therefore the hardware complexity changed from O(n2)to sion operation. This was verified by means of simulation us- O(n). ing Matlab-DSP blockset for a number of applications. The AgenericPSAhasacustomisablesizeandconfigurable resulting LUT holds 4 096 inverse values with a 26-bit word structure. The final size of the PSA can be estimated by length in 16.10 data format. adding the resource consumption of each building block or Abbas Bigdeli et al. 7

Control unit

x3 x2 x1 Inputs x0

Boundary Internal Internal Internal cell cell cell cell

Control signal Data path Register y Multiplexer 1 y0 Outputs

Figure 9: The PSA structure block diagram.

New data from input Feedback matrix data from pipeline structure

Input select Control 10 signal from Sel control unit

Output data signal from Input data internal cell signal going into cell

Cell Cell

Reg

Reg Pipeline structure

Reg Feedback path

Reg

Figure 10: Data-path architecture. 8 EURASIP Journal on Applied Signal Processing

Control unit One clock One clock ResetReset Mode delay delay Clear D-FFs D-FFs System Sel clock Datapath Datapath Mux Mux

Data Mode Data Mode Reg Reg Boundary cell Internal cell

Figure 11: Control unit interfacing with other modules in PSA.

Clk n = 2 Mode

Clear Sel

n = 3 Mode Clear Sel

n = 4 Mode

Clear Sel

Figure 12: Timing diagram of control signals for different PSA sizes. module as shown below for example: synthesised, the system clock period decreases as the critical  path extends. PSA size = size (boundary cell + internal cell + data path + control unit) 5.2. Comparisons with other implementations = (976) + (495I) +(16R +16M)       (8) The PSA performance has been compared with some other BoundryCell InternalCell DataPath matrix inversion structures based on systolic arrays in terms D of number of processing elements (or cells), number of +(131+3  ) [LEs], cell types, logic element consumption, maximum clock fre- ControlUnit quency, and design flexibility. where I, R, M,andD represent the number of internal cells, For an n × n matrix inversion, the PSA requires 2n cells 16-bit pipelining registers, 16-bit input select multiplexers, while [n(3n +1)/2] cells are used in the systolic array based and 3-bit signal delay D-FFs, respectively. It should be noted on the Gauss-Jordan elimination algorithm [10]. In the PSA, that the actual size of the synthesised PSA on FPGA device cells are classified as either boundary or internal cells, while will be affected by the architecture and routing resources of the processing elements in the matrix inversion array struc- the FPGA. ture in [5] are divided into three different functional groups. The processing time for the n × n matrix inversion in When working with a 4 × 4 matrix, it takes 4 784 LEs PSA is 2(n2 − 1) clock cycles at a maximum clock frequency to implement the PSA on an Altera APEX device, while running at 16.5 MHz for n<10 in our implementation 8 610 LEs are used to implement the same in a matrix-based (Altera APEX EP20K200EFC484-2). When a larger PSA is systolic algorithm engineering (MBSAE) Kalman filter [11]. Abbas Bigdeli et al. 9

Matrix from Skewed from

d22 d21 d12 c d b c21 c22 d21 d22 22 11 22 c c d d c21 c12 b21 b12 11 12 11 12 Data c11 a22 b11 a a b b packing . . 21 22 21 22 a a . . 21 12 . . a a a ··· ··· ··· 11 12 b11 b12 11

Schur complemnt Generic PSA on FPGA E = D + CA−1B

e22 e21 e22 Data e21 e12 unpacking e ··· e11 e12 11

Figure 13: Procedures for input data packing and output data unpacking.

When synthesised on an Altera APEX device (EP20K- which arise due to finite-precision quantisation. As described 200EFC484-2), PSA allows a maximum throughput of earlier in the paper, all the multiplication operations are per- 16 MHz, compared to only 2 MHz in the design presented in formed using 26-bit long data. Computation results, as well the systolic array based design reported in [11]and10MHz as the data in the LUT, are of 26-bit long. To a large extent, in geometric arithmetic parallel processor (GAPP) in [12]. this eliminates the possibility of overflow occurring with ma- The PSA is designed to be customisable and parameterisable, trices of small size regardless of the actual data values. Simu- but other systolic arrays in the literature were all fixed-size lation shows that the inverse of a matrix of size up to 10 × 10, structures. and data represented with 26 bits, which is sufficient for most practical applications, can be computed with minimal error. 5.3. Limitations Obviously, as the size of the matrix increases, the error also increases. However, as the proposed design is fully param- In our design several built-in modules from the vendor li- eterised, the word length used in the computation can be brary were used for basic dataflow control and arithmetic accordingly increased, but it will result in higher FPGA re- calculations. Therefore, the results reported in this paper are source usage. valid only for specific FPGA devices. However, as libraries provided by other FPGA vendors have equivalent functional- 6. KALMAN FILTER IMPLEMENTED USING PSA ities readily available, the proposed design can be easily mod- ified and ported to other FPGA device families. 6.1. Kalman filter One disadvantage of the PSA design is that input data has to be in skewed form before entering the array. When Since its introduction in the early 60s [14], Kalman filter has the PSA interfaces with other processors, a data wrapping been used in a wide range of applications and as such it falls preprocessing stage may be required to pack the data in the in the category of recursive least square (RLS) filters. As a specific skewed form shown in Figure 13.Outputdatafrom powerful linear estimator for dynamic systems, Kalman fil- the PSA are unpacked to rearrange the results back to regular ter invokes the concept of state space [15]. The main feature matrix form. of the state-space concept allows Kalman filters to compute a new state estimate from the previous state estimate and new input data [16]. Kalman filter algorithms consist of six equa- 5.4. Effects of the finite word length tions in a recursive loop. This means that results are con- ThefinitewordlengthperformanceofthePSAstructurewas tinuously calculated step by step. To derive the Kalman filter analysed. All quantities in the structure are represented using equations, a mathematical model is built to describe the dy- fixed-point numbers. It should be noted that only multipli- namics and the measurement system in form of linear equa- cation and division, which itself is computed by multiplica- tions (9)and(10). tion, will introduce round-off error [13]. Addition and sub- (i) Process equation: traction do not produce any round-off noise. The approach used here was to follow the arithmetic operations in the dif- x n = x n w n . ferent variables update equations and keep track of the errors ( +1) A ( )+ ( ) (9) 10 EURASIP Journal on Applied Signal Processing

(ii) Measurement equation: Table 2: Matrix substitutions for Kalman filter algorithms.

s(n) = B x(n)+v(n), (10) Schur complement Result A I where x(n) is the state at time instance n, s(n) is the measure- B x (n − 1 | n − 1) Step 1 x −(n | n − 1) ment at time instance n, A is the processing matrix, B is the C A w n measurement matrix, ( ) is the system processing noise, D 0 v n and finally ( ) is the measurement noise. In (9), A describes A I the plant and the changes of state vector x(n) over time, while B P(n − 1 | n − 1) w(n) is a plant disturbance vector of a zero-mean Gaussian Step 2 AP(n−1 | n−1) C A white noise. In (10), B linearly relates the system states to the measurements, where v(n) is a measurement noise vector of D 0 a zero-mean Gaussian white noise. A I B AT TheKalmanfilterequationscanbegroupedinto Step 3 P−(n | n − 1) two basic operations: prediction and filtering. Prediction, C AP(n − 1 | n − 1) sometimes referred to as time update, estimates the new state D Q(n − 1) and the uncertainty. An estimated state vector is denoted as A I x n x n T ( ). When an estimate of ( ) is computed before the cur- B B s n Step 4 P−(n | n − 1)BT rent measurement data ( ) become available, such estimate C P−(n | n − 1) x n isclassifiedasanaprioriestimateanddenotedas ( ). When D 0 the estimate is made after the measurement s(n) arrives, it is A I called a posteriori estimate [16]. On the other hand, filter- B P−(n | n − 1)BT ing, usually referred to as measurement update, is to correct Step 5 BP(n | n−1)BT +R(n) the previous estimation with the arrival of new measurement C B data. The prediction error can be computed from the dif- D R(n) ference between the value of actual measurements and the A BP(n | n − 1)BT + R(n) estimated value. It is used to refine the parameters in a pre- B I Step 6 K(n) diction algorithm immediately in order to generate a more C P−(n | n − 1)BT accurate estimate in the future. The full set of Kalman filter D 0 equations can be found in [17]. A I It is evident from the Kalman filter equations that its B [P−(n | n − 1)BT ]T algorithm comprises a set of matrix operations, including Step 7 P(n | n) C −K(n) matrix addition, matrix subtraction, matrix multiplication, D P−(n | n − 1) and matrix inversion. Among these matrix operations, ma- trix inversion is the most computationally expensive and A I B x −(n | n − 1) thus being the bottleneck in the processing time of the al- Step 8 s(n) − Bx −(n | n − 1) gorithm such that the overall system processing time mainly C −B depends on matrix inversion speed [10]. In Section 2,anew D s(n) implementation of matrix inversion, which is in fact the A I “heart” of Kalman filter, was presented. Hardware imple- B s(n) − Bx −(n | n − 1) Step 9 x (n | n) mentation of another critical operation, division, was pre- C K(n) sented in Section 3. D x −(n | n − 1)

6.2. Kalman filter in PSA-based structure

As a case study to verify the performance of the proposed is created. In the existing PSA structure, data in A and C PSA, a Kalman-filter-based echo cancellation application was are aligned in the same column entering to the cells in left- implemented. By appropriate substitutions of matrices A, B, half group, while B and D are in another column toward the C,andD (Table 2), matrix-form Kalman filter equations can right-half cells group. Along the feedback paths, the result, be computed by the PSA in 9 steps. A complete execution of E = D + CA−1B, is connected to the same columns of A and the 9 steps produces state estimates in the next time instance C as shown in Figure 14. In this case, the intermediate result and constitutes one recursion in the Kalman filter algorithm. cannot be used as the input data for B and D. Therefore, a The components of the four input matrices are queued new data path with an input multiplexer is added to allow E in a skewed package entering the PSA cells row by row. It can passing to cells in right-half group. A control unit is required be noted from Table 2 that some Schur complement results to switch the multiplexer input sources between intermediate will be used as input data in later steps. Thus, extra regis- result E and new data from B and D. The modified design is ters are required to store the intermediate results. To ensure presented with thick lines in Figure 15. that the intermediate results are reloaded to specific cells at The results obtained from the echo cancellation appli- the correct time instances, a new data path and control unit cation using the PSA-based Kalman filter closely match the Abbas Bigdeli et al. 11

Left-half group Right-half group Clock speed

The advantages and the conditions of using LUT with mul- D tiplier to perform scalar division has been discussed in C Section 3.2. This approach enables PSA to have a system B . A clock frequency 3 5 times faster than using scalar dividers only.

Resource usage

E In the MDM method, 32 operations of addition/subtraction, 22 multiplications, and 4 divisions are involved in scalar op- Figure 14: The original data paths of PSA. erations. The overall logic element usage of the PSA is 40% lower than an equivalent MDM-based design for a 4-state Kalman filter implementation. Left-half group Right-half group 7. CONCLUSIONS D C In this paper, an optimised systolic-array-based matrix in- B A Control version for implementation in FPGA was proposed and used Mux unit for rapid prototyping of a Kalman filter. Matrix inversion is the computational bottleneck and the most complex oper- ation in Kalman filtering. The PSA matrix inversion results E in a simple, yet fast, implementation of the operation. It is scalable to matrices of various sizes and is implemented as Figure 15: The new data paths of a PSA-based Kalman filter. a parameterised design. This allows its direct customisation and instantiation for application-specific problems. Resource utilisation is low and linearly depends on the matrix size. Modified from the Schur complement systolic array, the theoretical values. The small residual error observed in the PSA simplifies recursive matrix-form equations in Kalman resulting data, is contributed to the finite word length effect filters to scalar operations and inherits the design advantages typical of fixed-point structure of the proposed design. of parallelism and pipelining. In the proposed PSA design, a new approach for implementation of scalar division has 6.3. Comparison with other implementations also been proposed, which speeds up the division operation 3.5 times over traditional dividers and yet uses less logic ele- There are several hardware implementations for Kalman fil- ments and resources to implement. ter in the literature. For a 4-state Kalman filter, all the Kalman filter equations can be expressed as 30 scalar equations. Sim- REFERENCES ilar to the PSA, direct operation of matrix inversion is also avoided in the matrix decomposition method (MDM) and [1] G. W. Irwin, “Parallel algorithms for control,” Control Engi- the Kalman gain calculation turns into a set of 4 scalar equa- neering Practice, vol. 1, no. 4, pp. 635–643, 1993. tions with scalar division and addition. With the high pro- [2] M. Ceschia, M. Bellato, A. Paccagnella, and A. Kaminski, “Ion cessing speed of 169.4 nanoseconds reported in [18], MDM beam testing of ALTERA APEX FPGAs,” in Proceedings of IEEE seems to have a better speed over the PSA (280 nanoseconds) Radiation Effects Data Workshop, pp. 45–50, Phoenix, Ariz, for the same target APEX device. However, the PSA structure USA, July 2002. still enjoys the following advantages. [3] A. El-Amawy, “A systolic architecture for fast dense matrix in- version,” IEEE Transactions on Computers,vol.38,no.3,pp. Flexibility 449–455, 1989. [4] A. K. Ghosh and P. Paparao, “Performance of modified Fad- When the number of states in a Kalman filter changes, all the deev algorithm on optical processors,” IEE Proceedings. J: Op- toelectronics, vol. 139, no. 5, pp. 325–330, 1992. scalar equations in MDM become invalid as matrix dimen- ffi sions in the algorithm depend on the size of the state vec- [5] M. Zajc, R. Sernec, and J. Tasic, “An e cient linear algebra SoC design: implementation considerations,” in Proceedings tor. Considerable design time is required to decompose the of 11th Mediterranean Electrotechnical Conference (MELECON matrix-form equations again. However, in the PSA, a Kalman ff ’02), pp. 322–326, Cairo, Egypt, May 2002. filter with di erent number of states can be generated by [6] F. M. F. Gaston and G. W. Irwin, “Systolic Kalman filtering: an modifying one parameter (number of states, i.e., the matrix overview,” IEE Proceedings. D: Control Theory & Applications, size) in the heading of the VHDL code. The PSA serves as an vol. 137, no. 4, pp. 235–244, 1990. IP block for a generic Kalman filter in VHDL, while MDM is [7] F. M. F. Gaston, D. W. Brown, and J. Kadlec, “A parallel a hard-wired implementation for a fixed Kalman filter. predictive controller,” in Proceedings of UKACC International 12 EURASIP Journal on Applied Signal Processing

Conference on Control, vol. 2, pp. 1070–1075, Exeter, UK, These include the Journal of Microprocessors and Microsystems, September 1996. Australian Journal of Research and Practice in Information Tech- [8] A. El-Amawy and K. R. Dharmarajan, “Parallel VLSI algo- nology, as well as FPL, IEEE VLSI, and EUSIPCO conferences. rithm for stable inversion of dense matrices,” IEE Proceedings. E: Computers and Digital Techniques, vol. 136, no. 6, pp. 575– Morteza Biglari-Abhari received the B.S. 580, 1989. degree from Iran University of Science and [9] N. Faroughi and M. A. Shanblatt, “An improved systematic Technology, M.S. degree from Sharif Uni- method for constructing systolic arrays from algorithms,” in versity of Technology in Tehran, and Ph.D. Proceedings of 24th ACM/IEEE Design Automation Conference degree from The University of Adelaide in (DAC ’87), pp. 26–34, Miami Beach, Fla, USA, June–July 1987. Australia. Currently, he is Senior Lecturer in [10] S.-G. Chen, J.-C. Lee, and C.-C. Li, “Systolic implementation the Department of Electrical and Computer of Kalman filter,” in Proceedings of IEEE Asia-Pacific Conference Engineering at The University of Auckland on Circuits and Systems (APCCAS ’94), pp. 97–102, Taipei, Tai- in New Zealand. His main research interests wan, December 1994. are computer architecture, multiprocessor system-on-chips, compiler optimisations, and hardware/software [11] Z. Salcic and C.-R. Lee, “Scalar-based direct algorithm map- codesign for low-power embedded systems. He is also a Member ping FPLD implementation of a Kalman filter,” IEEE Transac- of the Steering Committee of Polymer Electronic Research Centre tions on Aerospace and Electronic Systems,vol.36,no.3,part1, (PERC) at The University of Auckland. He has been Chair of the pp. 879–888, 2000. IEEE Computer Chapter (New Zealand North Section) since 2004 [12] D. Lawrie and P. Fleming, “Fine-grain parallel processing im- and Reviewer of some technical journals and conferences such as plementations of Kalman filter algorithms,” in Proceedings of Journal of Microprocessors and MicroSystems, EURASIP Journal International Conference on Control, vol. 2, pp. 867–870, Edin- on Applied Signal Processing, and FPL, VLSI, ISSPA, and EUSIPCO burgh, Scotland, UK, March 1991. conferences. [13] S. K. Mitra, Digital Signal Processing: A Computer-Based Ap- proach, McGraw-Hill/Irwin, Boston, Mass, USA, 2nd edition, Zoran Salcic is a Professor of computer sys- 2001. tems engineering at The University of Auck- [14] R. E. Kalman, “A new approach to linear filtering and predic- land, New Zealand. He holds the B.E., M.E. tion problems,” Transaction of the ASME, Series D, Journal of and Ph.D. degrees in electrical and com- Basic Engineering, vol. 82, pp. 35–45, March 1960. puter engineering from the University of [15] S. V. Vaseghi, Advanced Digital Signal Processing and Noise Re- Sarajevo received in 1972, 1974, and 1976, duction, John Wiley & Sons, New York, NY, USA, 2nd edition, respectively. He did most of the Ph.D. re- 2000. search at the City University of New York [16] E. W. Kamen and J. K. Su, Introduction to Optimal Estimation, in 1974 and 1975. He has been with the Springer, London, UK, 1999. academia since 1972, with the exception of [17] D. C. Swanson, Signal Processing for Intelligent Sensor Systems, years 1985–1990 when he took the posts in the industrial establish- Marcel Dekker, New York, NY, USA, 2000. ment, leading a major industrial enterprise institute in the area of computer engineering. His expertise spans the whole range of dis- [18] C.-R. Lee, FPLD implementation and customisation in multiple ciplines within computer systems engineering: complex digital sys- target tracking applications, Engineering Ph.D. thesis, the Uni- tems design, custom computing machines, reconfigurable systems, versity of Auckland, Auckland, New Zealand, 1998. field programmable gate arrays, processor and computer systems architecture, embedded systems and their implementation, design Abbas Bigdeli wasborninAhvaz,Iranin automation tools for embedded systems, hardware/software code- 1973. He received a Bachelor in electron- sign, new computing architectures, and models of computation for ics engineering in 1995 from the Depart- heterogeneous embedded systems and related areas. He has pub- ment of Electrical Engineering, Amir Kabir lished more than 170 refereed journal and conference papers and University of Technology, Tehran, Iran. He numerous technical reports. He has supervised six Ph.Ds and more started his postgraduate studies at James than 40 M.E. thesis completions and took part in numerous Ph.D. Cook University, Australia, in 1996. He and M.E. examinations. He is the Founding Editor-in-Chief of the concluded his Ph.D. research in 2000 and new EURASIP Journal on Embedded Systems. moved to Auckland, New Zealand, to join the Faculty of Engineering at The Univer- Yat Tin Lai received his B.E. and M.E. degrees in computer sity of Auckland. His current research interests are in the area of systems engineering from The University of Auckland in reconfigurable embedded and network processors, security solu- 2002 and 2004, respectively. tions for wireless networks, hardware/software implementation of image and video processing and design, and fabrication of intel- ligent implantable medical devices. He has published over 35 sci- entific and technical papers in international journals and confer- ences. He has recently patented an invention on securing legacy 802.11 wireless LAN systems. He is the Program Leader for elec- tronics projects at Polymer Electronic Research Centre at The Uni- versity of Auckland. He has been on the Executive Committee of IEEE New Zealand North Section since 2001. He has been act- ing as a Technical Reviewer for several journals and conferences. Hindawi Publishing Corporation EURASIP Journal on Applied Signal Processing Volume 2006, Article ID 96421, Pages 1–19 DOI 10.1155/ASP/2006/96421

Floating-to-Fixed-Point Conversion for Digital Signal Processors

Daniel Menard, Daniel Chillet, and Olivier Sentieys

R2D2 Team (IRISA), ENSSAT, University of Rennes I, 6 rue de Kerampont, 22300 Lannion, France

Received 1 October 2004; Revised 7 July 2005; Accepted 12 July 2005 Digital signal processing applications are specified with floating-point data types but they are usually implemented in embedded systems with fixed-point arithmetic to minimise cost and power consumption. Thus, methodologies which establish automati- cally the fixed-point specification are required to reduce the application time-to-market. In this paper, a new methodology for the floating-to-fixed point conversion is proposed for software implementations. The aim of our approach is to determine the fixed-point specification which minimises the code execution time for a given accuracy constraint. Compared to previous method- ologies, our approach takes into account the DSP architecture to optimise the fixed-point formats and the floating-to-fixed-point conversion process is coupled with the code generation process. The fixed-point data types and the position of the scaling opera- tions are optimised to reduce the code execution time. To evaluate the fixed-point computation accuracy, an analytical approach is used to reduce the optimisation time compared to the existing methods based on simulation. The methodology stages are de- scribed and several experiment results are presented to underline the efficiency of this approach.

Copyright © 2006 Daniel Menard et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

1. INTRODUCTION floating-to-fixed-point conversion are required to accelerate the development. Most embedded systems integrate digital signal processing For digital signal processors (DSPs), the methodology applications. These applications are usually designed with aim is to define the optimised fixed-point specification which high-level description tools like CoCentric (Synopsys), Mat- minimises the code execution time and leads to a suffi- lab/Simulink (Mathworks), or SPW (CoWare) to evaluate cient accuracy. For this accuracy, the desired application per- the application performances with floating-point simula- formances must be reached. Existing methodologies [2, 3] tions. Nevertheless, if digital signal processing algorithms are achieve a floating-to-fixed-point transformation leading to specified and designed with floating-point data types, they an ANSI-C code with integer data types. Nevertheless, the are finally implemented into fixed-point architectures to sat- data types supported by the DSP and the processor scaling isfy the cost and power consumption constraints associated capabilities are not taken into account to determine the fixed- with embedded systems. In fixed-point architectures, mem- point specification. The analysis of the architecture influence ory and bus widths are smaller, leading to a definitively lower on the computation accuracy underlines the necessity to take cost and power consumption. Moreover, floating-point op- the DSP architecture into account to optimise the fixed-point erators are more complex to process the exponent and the specification [4]. Furthermore, the code generation and the mantissa. Thus, floating-point operator area and latency are conversionprocessmustbecoupled. greater compared to fixed-point operators. In this paper, a new methodology to implement floating- In this context, the application specification must be con- point algorithms in fixed-point processors under accuracy verted into fixed-point. The manual conversion process is constraint is presented. Compared to the existing methods, a time-consuming and an error-prone task which increases the processor architecture is taken into account and the the development time. Some experiments [1] have shown floating-to-fixed-point conversion process is coupled with that this manual conversion can represent up to 30% of the code generation process. The fixed-point specification is the global implementation time. To reduce the application optimised to reduce the code execution time as long as the time-to-market, high-level development and code genera- application performances are reached. These optimisations tion tools are needed. Thus, methodologies for automatic are achieved through the location of the scaling operations 2 EURASIP Journal on Applied Signal Processing and the selection of the data word-length according to the done to minimise the number of scaling operations. Firstly, different data types supported by recent DSPs. The scal- the floating-point data types are replaced by fixed-point data ing operations are moved to reduce the code execution types and the scaling operations are included in the code. time. This paper is organised as follows. The previous works The scaling operations and the fixed-point data formats are for the floating-to-fixed-point conversion are presented in determined from the dynamic range information obtained Section 2. Our methodology is detailed in Section 3. For the with a statistical method [9]. The reduction of the scaling different methodology stages, our approach is justified and operations number is based on the assignation of a common the technique used to solve the problem is described. Finally, format to several relevant data to minimise the scaling oper- in Section 4,different experiments are presented underlining ations cost function. This cost function takes account of the the efficiency of our approach. number of each scaling operation occurrences and depends on the processor scaling capabilities. For a processor with a 2. RELATED WORKS barrel shifter, the cost of a scaling operation is set to one cy- cle; otherwise the number of cycles required for a shift of n 2.1. Floating-to-fixed-point conversion methodologies bits is equal to n cycles. This methodology achieves the floating-to-fixed-point In this section the different available methodologies for the conversion with the minimisation of the scaling operations automatic implementation of floating-point algorithms into cost. But, the code execution time is not optimised under a fixed-point architectures are presented. global accuracy constraint. The accuracy constraint is only In [5], a methodology which implements floating- specified through the definition of a maximal acceptable ac- point algorithms into the TMS320C25/50 fixed-point DSP curacy degradation allowed for each data. The data types (Texas Instruments) is proposed. The floating-to-fixed-point supported by the architecture are not taken into account to conversion is achieved after the code generation process. optimise the fixed-point data formats. Moreover, the archi- This methodology is specialised for this particular archi- tecture model used to minimise the scaling operations num- tecture and cannot be transposed to other architecture ber is not realistic. Indeed, for conventional DSPs including classes. a barrel shifter and based on a MAC (multiply-accumulate) The two methodologies presented below achieve the structure, the scaling operation execution time depends on floating-to-fixed-point transformation at the source code the data location in the data path and is not always equal to level. The FRIDGE [6] methodology, developed at the one cycle. Furthermore, for processors with instruction-level Aachen University, transforms the floating-point C source parallelism (ILP) capabilities, the overhead due to scaling op- code into a C code with fixed-point data types. In the first erations depends on the scheduling step and cannot be easily step, called annotations, the user defines the fixed-point for- evaluated before the code generation process. mat of some variables which are critical in the system or for Compared to these methods, our approach optimises the which the fixed-point specification is already known. More- data word-length to benefit from the different data types sup- over, global annotations can be defined to specify some rules ported by recent DSPs. Moreover, the scaling operation loca- for the entire system (maximal data word-length, casting tion is optimised with a realistic model to evaluate the scaling rules). The second step, called interpolation [6, 7], deter- operation execution time. The goal of these two optimisa- mines the application fixed-point specification. The fixed- tions is to minimise the code execution time as long as the point data formats are obtained from a set of propagation accuracy constraint is fulfilled. In our methodology, the pro- rules and the analysis of the program control flow. This de- cessor architecture is taken into account and the floating-to- scription is simulated to verify if the accuracy constrains are fixed-point conversion process is coupled with the code gen- fulfilled. The commercial tool CoCentric Fixed-point Designer eration process. proposed by Synopsys is based on this approach. In [3] a method called embedded approach is proposed 2.2. Fixed-point accuracy evaluation to generate an ANSI-C code for a DSP compiler from the fixed-point specification. The data (source data), for which Despite fixed-point computation, the application quality cri- the fixed-point formats have been obtained with the tech- teria must be verified. Thus, the computation accuracy due nique presented previously, are specified with the available to fixed-point arithmetic is evaluated. Most of the avail- data types (target data) supported by the target processor. able methodologies are based on a bit-true simulation of the The degrees of freedom due to the source data position in fixed-point application [10–12]. Nevertheless, this technique the target data are used to minimise the scaling operations. suffers from a major drawback which is the time required for This methodology produces a bit-true implementation into the simulations [11]. The fixed-point mechanism emulation a DSP of a fixed-point specification. But accuracy and execu- on a floating-point workstation increases the simulation time tion time are not optimised through the fixed-point format compared to a classical floating-point simulation. Moreover, modification of some relevant variables. a great number of samples is required to verify if the applica- The aim of the tool presented in [2, 8]istotransform tion quality criteria are respected. This drawback becomes a a floating-point C source code into an ANSI-C code with severe limitation when these methods are used in the process integer data types. This code is independent of the targeted of fixed-point optimisation where multiple simulations are architecture. Moreover, a fixed-point format optimisation is needed to explore the design space [10]. For each evaluation Daniel Menard et al. 3 of the fixed-point specification accuracy, a new simulation is The first stage of the fixed-point conversion process cor- required. responds to the data dynamic range evaluation. These re- An alternative to the simulation-based method is the ana- sults are used to determine the data binary-point position lytical approach. The verification that the fixed-point imple- which avoids overflows. Then, the data word-lengths are de- mentation respects the application quality criteria is achieved termined to obtain a complete fixed-point specification. The in two steps with the help of a single metric. The most com- data types which minimise the code execution time and re- monly used metric to evaluate the computation accuracy is spect the accuracy constraint are selected. Finally, the scaling the signal-to-quantisation-noise ratio (SQNR) [10, 13, 14]. operation locations are optimised to minimise the code exe- This metric defines the ratio between the desired signal cution time as long as the accuracy constraint is fulfilled. This power and the quantisation noise power. Thus, first of all, conversion process is achieved under an accuracy constraint the minimal value of the computation accuracy (SQNRmin) to obtain a fixed-point specification which satisfies the appli- is determined and then, the fixed-point specification is op- cation performances. Thus, the computation accuracy must timised under this accuracy constraint. The accuracy con- be evaluated and the accuracy constraint must be determined straint (SQNRmin) is determined according to the application from application performances. performance constraints. The main advantage of the analyt- ical approach is the execution time reduction of the fixed- 3.1.1. Compilation infrastructure point optimisation process. Indeed, the SQNR expression de- termination is done only once, then, the fixed-point system The floating-point C source algorithm is transformed into accuracy is evaluated through the computation of a mathe- an intermediate representation with the compiler front-end. matical expression. This intermediate representation (IR) specifies the applica- In our methodology, an analytical approach is used to tion with a control and data flow graph (CDFG). The tool evaluate the computation accuracy. This approach [14]re- uses the SUIF compiler front-end [15], and the CDFG is duces significantly the execution time of the fixed-point generated from SUIF’s internal-representation abstract trees. optimisation process, compared to the simulation-based This CDFG is made up of different control flow graphs methods. This method is described with further details in (CFGs) and data flow graphs (DFGs). Each CFG represents Section 3.1.3. one of the application control structures. These structures correspond to basic blocks, conditional and repetitive struc- 3. FLOATING-TO-FIXED-POINT CONVERSION tures. The core of conditional and repetitive structures is METHODOLOGY specified with a CFG. Each control structure block contains The aim of the methodology presented in this paper is to a specification of its input and output data. The basic block implement automatically a floating-point application into a represents a set of sequential computations without control fixed-point DSP. Despite the computation error due to the structure. The different computations of a basic block core fixed-point arithmetic, the different quality criteria (perfor- which correspond to the signal processing part are repre- mances) associated with the application must be respected. sented with a data flow graph (DFG). The DFG includes the For embedded systems, the cost and the power consumption delay operations. To illustrate this intermediate representa- must be minimised. Thus, the optimised fixed-point specifi- tion, an FIR (finite impulse response) filter example is under cation which minimises the code execution time and fulfils a consideration. The floating-point C source code is given in given computation accuracy constraint must be determined. Algorithm 1 and the corresponding intermediate representa- To optimise the implementation, the targeted architecture tion is presented in Figure 2. must be taken into account during the fixed-point conver- The code generation is achieved with the flexible code sion process. generation tool CALIFE presented in [16] and the processor is described with the ARMOR language [17]. 3.1. Methodology flow 3.1.2. Fixed-point format The methodology flow has been defined from the analysis of the architecture influence on the computation accuracy A fixed-point data is made up of an integer part and a frac- and from the study of the interaction between the fixed- tional part as presented in Figure 3. The fixed-point format point conversion process and the code generation process. of a data is specified as (b, m, n), where b is the data word- The global methodology flow is presented in Figure 1.The length. The terms m and n are the binary-point positions tool is made up of two main blocks corresponding to the referenced, respectively, from the most significant bit (MSB) compilation infrastructure and to the floating-to-fixed-point and the least significant bit (LSB). In fixed-point arithmetic, conversion. m and n are fixed and lead to an implicit scale factor which The compilation infrastructure front-end generates an stays constant during the processing. intermediate representation from the floating-point C source A binary-point position is assigned to each oi operation code. The floating-to-fixed-point conversion process is ap- inputs and output (mx , my , mz ) as presented in Figure 4. plied on this intermediate representation. The assembly code In the same way, a word-length (bx , by , bz ) is assigned is generated with the compilation infrastructure back-end to each oi operation operand. Let bi = (bx , by , bz )and from this fixed-point intermediate representation. mi = (mx , my , mz ) be, respectively, the word-lengths 4 EURASIP Journal on Applied Signal Processing

Application performances Csourcecode

GApp

Dynamic Compiler front-end Accuracy range determination evaluation GDR −→ −→ ( b m) IR fSQNR , Binary-point position CDFG C generation determination Accuracy constraint GBP Fixed-point determination Ccode Data type or SystemC selection Compilation infrastructure Compiler GWL back-end Scaling operation SQNRmin optimisation Code generation GSO Processor model Floating-to-fixed-point conversion

DSP Processor model Assembly code description generation

Figure 1: Methodology flow for the floating-to-fixed-point conversion. The tool is made up of two main blocks corresponding to the compilation infrastructure and to the floating-to-fixed-point conversion.

Input float h[32] ={−0.0297, ..., = 0.897, 0.98, 0.897, ..., −0.0297}; float x[32]; x[0] h[0] float y,acc; float fir (float input) × B.B. 1 { DFG 1 int i; CFG for Acc − x[0] = input; x[i 1] Acc CFG FIR For B.B. 2 acc = x[0] ∗ h[0] ; z−1 = −− for (i 31; i>0; i ) x[i] h[i] { B.B. 3 acc = acc + x[i] ∗ h[i]; × Acc x[i] = x[i − 1]; } = u

= B.B.: basic block DFG 3 + y acc; y DFG 2 return y; Acc }

Figure 2: The control and data flow graph equivalent to Algorithm 1: Specification of the 32-tap FIR filter with the Algorithm 1 (the node z−1 corresponds to a delay operation). floating-point C source code.

3.1.3. Computation accuracy management and the binary-point positions associated with the oper- −→ ation oi. For a CDFG made up of No operations, b = In our methodology, the SQNR metric is used to ensure that −→ = [b1, b2, ..., bi, ..., bNo ]andm [m1, m2, ..., mi, ..., mNo ] the fixed-point implementation verifies the application qual- are the vectors specifying, respectively, the word-length ity criteria. Thus, the accuracy constraint and the SQNR ex- and the binary-point position of all CDFG operation pression can be obtained from the application as explained operands. in the following sections. Daniel Menard et al. 5

Sign − bit 2m 1 Integer part 20 2−1 Fractional part 2−n Application Output quantisation Quality S bm−1bm−2 b1 b0 b−1 b−2 b−n+2 b−n+1 b−n noise model criteria MSB LSB mn Floating-point y y Quality criteria + b system verification

q SQNR Figure 3: Fixed-point data specification: b, m,andn represent, re- y SQNRmin min spectively, the data word-length, the binary-point position refer- determination max(Pq ) enced from the MSB (integer part), and the binary-point position Fixed-point system model y referenced from the LSB (fractional part). Figure 5: Technique to determine the accuracy constraint. The global error due to the fixed-point conversion is modelled by a noise Operation y source (qy). my my Ops. z q0 mz mz x mx α0

mx Data

Figure 4: Binary-point position model for an operation. The αi binary-point position for the operation inputs and output are spec- qi + qy ified by mx , my ,andmz .

αN Accuracy constraint determination qN

The accuracy constraint corresponding to the minimal value Figure 6: Output quantisation noise model in a fixed-point system. ff (SQNR ) of the SQNR is determined according to the The system output noise qy is a weighted sum of the di erent noise min sources q . application performance constraints. This SQNR minimal gk value is obtained with a floating-point simulation of the ap- plication as presented in Figure 5. The error due to the fixed- point conversion is modelled by a noise source (qy)located Thus, the DFG is transformed into several directed acyclic at the system output. The power of this noise source is in- graphs (DAGs) when cycles are present like in the case of re- creased as long as the application performance constraints cursive1 structures. 2 are respected. The SQNR constraint is determined from the In nonrecursive and nonlinear systems, each αk term is maximal value of the noise source power which ensures that obtained from the signals associated with each operation in- the application performances are still reached. volved in the qgk noise source propagation towards the out- put [19]. The αk term expressions are built by traversing the Computation accuracy evaluation acyclic graph from the inputs to the output. The statistical parameters of αk are determined with a single floating-point To determine the SQNR expression, the main challenge cor- simulation. responds to the computation of the system output quantisa- The qgk noise source statistical parameters are deter- mined from the models presented in [20]. The statistical pa- tion power. In fixed-point system, a quantisation noise qgk is generated when some bits are eliminated during a cast opera- rameters depend on the number of bits eliminated and the tion. Each quantisation noise source q is propagated inside data format after the cast operation. As described in (1), gk −→ −→ the system and contributes to the output quantisation noise the SQNR is a function of the vector b and m specified qy through the gain αk as presented in Figure 6.Thegoalof in Section 3.1.2. This function is determined automatically the analytical approach is to define the power expression of from the data flow graph representing the application with the output noise qy according to the qgk noise source statisti- the technique summarised in the previous paragraph and de- cal parameters and the gains αk between the output and the tailed in [14, 19]: different noise sources. For linear time-invariant systems, each α term is ob- k −→ tained from the transfer function between the system output −→ SQNR = fSQNR( b , m). (1) and the qgk noise source. The transfer functions are deter- mined from the data flow graph [18] representing the appli- cation [14]. They are obtained from the Z transform of the 1 In a recursive structure, the system output depends on the input samples recurrent equations representing the system. The recurrent and the previous output samples. equations are built by traversing the graph from the inputs to 2 In a nonrecursive structure, the system output depends only on the input the output. This technique requires that the graph be acyclic. samples. 6 EURASIP Journal on Applied Signal Processing

3.1.4. Floating-to-fixed-point conversion GApp CDFG

For the floating-to-fixed-point conversion process, the data DFG generation dynamic range is first evaluated. The results are used to de- termine the binary-point position of each data. Then, the Input dynamic Data dynamic range Dynamic range data word-length is selected according to the data types sup- range determination update ported by the targeted DSP. Finally, the fixed-point speci- fication is optimised by moving the scaling operations to GDR CDFG reduce the code execution time. The data word-length and the scaling operation location are optimised under accuracy constraint. These different transformations in the conversion Figure 7: Methodology flow for the data dynamic range determi- process lead to the CDFG G , G , G ,andG and are nation. The dynamic range is computed on the DFG representing DR BP WL SO the application and then the global CDFG is annotated with the dy- detailed in the following sections. The optimised fixed-point namic range information. specification obtained after the conversion process can be transformed into a fixed-point C code or a SystemC code. This code can be used to simulate the fixed-point specifica- tion and to verify that the application quality criteria are re- spected. nonrecursive systems and in recursive linear time-invariant systems. The structure of this module is presented in 3.2. Data dynamic range determination Figure 7. The module input is the intermediate representa- tion corresponding to the application CDFG GApp. The first The first stage of the methodology corresponds to the data step eliminates the control structures of the CDFG to ob- dynamic range evaluation. This stage only depends on the tain a data flow graph (DFG). For repetitive structures, the application and the input signals. To evaluate an applica- loops are unrolled, and for conditional structures, the branch tion data dynamic range, two approaches based on statisti- which leads to the worst case is retained. cal or analytical methods can be used. The dynamic range The second step corresponds to the dynamic range com- can be computed from the data statistical parameters which putation for each data of the application DFG. For nonrecur- are obtained with a floating-point simulation. The estima- sive structures the dynamic range information are obtained tion results depend on the data used for the simulation. This by traversing the graph from the sources to the sinks. For approach produces an accurate estimation of the dynamic each operation, a propagation rule is applied as defined in range from signal characteristics. It guarantees a low over- [21]. For recursive linear time-invariant structures, the trans- flow probability for signals with the same characteristics. fer functions between the critical data and the inputs are de- Nevertheless, overflows can occur for signals with different termined with the technique presented in [14]. These critical statistical properties. data correspond to the output of the addition or subtraction The second class of methods corresponds to the ana- operations. Then, the dynamic range is computed from the lytical approaches which are based on the computation of input dynamic range with the L1 or Chebyshev norm. For all the data dynamic range expressions from the input dynamic other data, the dynamic range is obtained with the propaga- range. These methods guarantee that no overflow will occur tion rule technique. but lead to a more conservative estimation. Indeed, the dy- The last step annotates the CDFG GApp data with the dy- namic range expression is computed in the worst case. The namic range and leads to the CDFG GDR.Foradatawithonly data dynamic range can be obtained with the interval arith- one instantiation in the CDFG, its dynamic range is equal metic theory [21]. The operation’s output dynamic range is to the dynamic range of the equivalent data in the applica- determined from its input dynamic. A worst-case dynamic tion DFG. For data defined as vector (i.e., array) and used in range propagation rule is defined for each type of opera- loop, the vector dynamic range in the CDFG corresponds to tion. Each data dynamic range is obtained with the help of the greatest value of the different vector elements used in the the propagation rules during the application graph traver- DFG. The dynamic range determination is more complex in sal. Thus, this technique cannot be used in the case of cyclic the case of data with multiple instantiations like in the FFT graphs like in recursive structures. (fast Fourier transform) butterfly where the butterfly inputs For linear time-invariant systems, the data dynamic and outputs are stored in the same variables. The output vec- range can be computed from the L1 or Chebyshev norm [22] tor dynamic range is multiplied by a factor of two at each according to the input signal frequency characteristics. These FFT stage. Thus, the fixed-point format of the output vec- norms compute the data dynamic range in the case of lin- tor must evolve at each stage. The first and the final values of ear time-invariant systems based on a nonrecursive or re- the vector X dynamic range are specified through the input cursive structure. To evaluate the dynamic range of a data and the output loop structure and the evolution of the vec- di from the system input x, the transfer function of the sub- tor X dynamic range is specified through the input and the system with the di output and the x input has to be deter- output CFG block which represents the ith FFT stage. This mined. is illustrated in Figure 8. Consequently, the expression of the In our methodology, these two analytical approaches dynamic range evolution for a multiple instantiation data is have been combined to determine the data dynamic range in determined from the different dynamic range values. Daniel Menard et al. 7

FFT input dynamic range

DX = [−1; 1] X For block D = [−d; d]

)] X N (

2 X

X FFT radix 2 X log Stage i input and ... FFT stage i output dynamic range

N stages [1 ∈ − 1 X [ 1; 1] = i = X Step Index DX = [−2d;2d]

X DX = [−N; N]

FFT output dynamic range

Figure 8: Specification of the data dynamic range for the FFT algorithm. The vector X dynamic range is specified for the FOR block input and output and for the FFT stage input and output.

3.3. Binary-point position determination obtain the binary-point position. This technique can be ap- plied only on directed acyclic graph (DAG). Thus, the graph The second stage of the methodology corresponds to the representing a DFG is firstly dismantled into a DAG if it con- determination of the data binary-point position. The dy- tains cycles. namic range results are used to determine, for each data, For a data x, the binary-point position mx is obtained the binary-point position which minimises the integer part from the dynamic range with the following relation: word-length and avoids overflows. The architecture must be taken into account to determine the binary-point position.     Indeed, many DSPs offer accumulator guard bits to manage   mx = log max x(n) . (2) the supplementary bits due to accumulations. Most of the 2 n DSPs achieve a MAC (multiply-accumulate) operation with- out loss of information. The adder and the multiplier out- A binary-point position is assigned to each operation in- put word-length is equal to the sum of the multiplier input put and output (m , m , m ) as presented in Figure 4.A word-lengths. Nevertheless, the dynamic range increase, due x y z propagation rule has been defined for each type of operation. to successive accumulations, can lead to an overflow. Thus, These rules determine the value of m , m , m according to many DSPs [23, 24] extend the accumulator word-length by x y z the binary-point position of the operation input and output providing guard bits. These supplementary bits ensure the data (m , m , m ). storage of additional bits generated during successive accu- x y z In the case of the multiplication, the binary-point posi- mulations. To avoid the introduction of costly scaling oper- tions of the inputs (m , m ) correspond to those of the op- ations, these guard bits must be taken into account to deter- x y eration input data (m , m ). The binary-point position of the mine the binary-point position. x y multiplier output is directly obtained from the binary-point The aim of this methodology stage is to obtain a cor- position of the operation inputs. Thus, the multiplier propa- rect fixed-point specification which guarantees no overflow. gation rules are given by the following expressions: Moreover, this transformation must respect the different fixed-point arithmetic rules. Thus, scaling operations are in- cluded in the application to adapt the fixed-point format of mx = mx, a data to its dynamic range or to align the binary-point of my = my, (3) the addition inputs. The input of this transformation is the mz = mx + my +1. CDFG GDR where all the data are annotated with their dy- namic range. The output is the CDFG GBP where all the data are annotated with their binary-point position. A hierarchi- For addition and subtraction operations, a binary point cal approach is used to determine the data binary-point po- position which is common to the operation inputs has to be sition. First, all the application DFGs are independently pro- defined to align the operation input binary point. This com- cessed and then a global processing is applied to the CDFG mon position must guarantee no overflow. The lack of ac- to obtain a coherent fixed-point specification. cumulator guard bits to store the supplementary bits due to To determine the binary-point position (m) of each data, overflow must be taken into account to determine the com- the different DFGs are traversed from the sources towards mon binary-point position. Thus, to avoid overflow the com- the sinks. For each data and operation, a rule is applied to mon binary-point position mc must be valid for the output 8 EURASIP Journal on Applied Signal Processing

Ng Sx bmx b0 b−1 b−2 bnx + mx

SB SB SB bmy b1 b0 b−1 b−2 bny

my gy my

SB SB bmz b1 b0 b−1 b−2 bnz

mz gz mz

Figure 9: Binary-point position for an addition with Ng guard bits. The parameter g defines the number of guard bits used by the data. data z andisdefinedasfollows: is different from the binary-point position of the operation   input mx (or my ). For the operation output, a scaling oper- mc = max mx, my, mz , ation is introduced if the binary-point positions mz and mz mx = mc, ff (4) are di erent. my = mc, The results obtained for the FIR filter example presented in Figure 2 are given in Figure 10. The DFG associated with mz = mc. the second basic block (B.B. 2) of the FIR filter is pre- If there are accumulator guard bits, the input and out- sented. A processor with an accumulator without guard bit put word-lengths are different. Then, a common reference is considered. The data are annotated by their dynamic range has to be defined to compare the binary-point positions. New and their binary-point position. For the operation, the out- binary-point positions (mx , my , mz ) referenced from the put binary-point position is determined. A scaling operation most significant bit of the data with the minimum word- must be introduced between the multiplication and the ad- length are computed for the inputs and the output as illus- dition to align the binary-point position before the addition. trated in Figure 9. A new parameter g corresponding to the number of guard bits used by the data is introduced as fol- 3.4. Data type selection lows: In the floating-to-fixed-point conversion process, each data mx = mx − gx, type (word-length) is determined to obtain a complete fixed- = − , my my gy (5) point format for each CDFG data. This process must explore = − mz mz gz. the diversity of the data types available in recent DSPs. Dif- ferent elements of the data-path influence the computation Considering that the parameter g is unknown to deter- z accuracy as described in [4]. The most important element is mine m ,itisfixedtoN , which is the number of guard bits c g the data word-length. Each processor is defined by its native available for the accumulator:   data word-length which is the word-length of the data that mc = max mx − gx, my − gy, mz − Ng . (6) the processor buses and data-path can manipulate in a sin- gle instruction cycle [25]. For most of the fixed-point DSPs, The real number of guard bits used by the adder output the native data word-length is equal to 16 bits. For ASIP is equal to (application-specific instruction-set processor) or some DSP cores like the CEVA-X and the CEVA-Palm [26], this native g = m − m if m >m, z z c z c data word-length is customisable to adapt the architecture to = ≤ (7) gz 0ifmz mc; the targeted applications. The computation accuracy is di- and, the binary-point positions of the adder inputs and out- rectly linked to the word-length of the data which are ma- put are equal to nipulated by the operations and depends on the type of in- structions which are used to implement the operation. Many DSPs support extended-precision arithmetic to = mx mc + gx, increase the computation accuracy. In this case, the data my = mc + gy, (8) are stored in memory with a greater precision. The data word-length is a multiple of the natural data word-lengths. mz = mc + gz. Considering that extended-precision operations manipulate The scaling operations required to obtain a correct fixed- greater data word-lengths, an extended-precision operation point specification are inserted in the CDFG. For each op- is achieved with several single-precision operations. Conse- eration, as represented in Figure 4, a scaling operation is in- quently, this operation execution time is greater than the one troduced if the binary-point position of the data mx (or my) of a single-precision operation. Daniel Menard et al. 9

− [ 0.99; 0.99] − x[i − 1] x[i 1] mx = 0

−1 z−1 z

[−0.19; 0.98] [−0.99; 0.99] x[i] x[i] = h[i] mx = 0 h[i] mh 0

× = = × mz,mult mh + mx +1 1 [−6.26; 6.26]

Acc mAcc = 3 = [−0.97; 0.97] mu1 1 u1 Acc u mu = 0  + mc = max(mu,mAcc,mAcc) = mu = 3 u2 mc 3 [−6.26; 6.26] 2 = Acc mAcc 3 + Scaling operation (right shift of 2 bits)

Acc

(a) (b)

Figure 10: DFG representing the second basic block (B.B. 2) of the FIR filter specified in Figure 2. (a) The data dynamic range and the binary- point position for the DFG2 are specified. (b) DFG2 after the insertion of the scaling operation is shown. u, u1,andu2 are intermediate variables.

Table 1: Word-length of the data which can be manipulated by dif- GBP CDFG Processor ferent DSPs offering SWP capabilities for arithmetic operations. model Processor Data types (bits) −→ Instruction Execution time TMS320C64x (T.I.) [29] 8, 16, 32, 40, 64 m selection estimation TigerSHARC (A.D.) [28] 8, 16, 32, 64 B −→ −→ −→ −→ T( b ) SP5, UniPhy (3DSP) [30] 8, 16, 24, 32, 48 fSQNR( b , m) fSQNR( b ) Optimisation CEVA-X1620 (CEVA) [31] 8, 16, 32, 40 −→ ZSP500 (LSI Logic) [32] 16, 32, 40, 64 SQNRmin CDFG b OneDSP (Siroyan) 8, 16, 32, 44, 88 GWL

To reduce the code execution time, some recent DSPs Figure 11: Flow of the data type selection process. This optimisa- tion process uses the SQNR expression f to evaluate the com- can exploit the data-level parallelism by providing SWP SQNR putation accuracy. It requires selecting the instructions for each op- (subword parallelism) capabilities. An operator (multiplier, eration and to evaluate the code execution time T. The data of the adder, shifter) of word-length is split to execute op- N k output CDFG GWL are annotated with their optimised word-length erations in parallel on subwords of word-length N/k. This specified through the vector b. technique can accelerate the code execution time up to a factor k. Thus, these processors can manipulate a wide di- versity of data types as shown in Table 1 for several recent G where all the data are annotated with their binary-point DSPs. In [27], this technique has been used to implement a BP −→ position specified through the vector m. The output is the CDMA (code-division multiple access) synchronisation loop CDFG G where all the data are annotated with their op- into the TigerSharc DSP [28]. The SWP capabilities offer the WL −→ opportunity to achieve an average 6.6 MAC per cycle with timised word-length specified through the vector b . This two MAC units. transformation leads to a complete fixed-point specifica- The main goal of the code generation process is to tion. This optimisation process use the SQNR expression −→ −→ minimise the code execution time under a given accuracy fSQNR( b , m) to evaluate the computation accuracy. Before constraint. Thus, our methodology selects the instructions starting the optimisation process, for each operation, the dif- which respect the global accuracy constraint and minimise ferent instructions which can be used are selected. During the code execution time. The methodology flow is presented the optimisation process, the application execution time is in Figure 11. The input of this transformation is the CDFG estimated. 10 EURASIP Journal on Applied Signal Processing

3.4.1. Code execution time estimation This approach for the code execution time estimation can be improved with more accurate techniques such as those The processor is modelled by a data flow instruction set. presented in [33, 34]. On the other hand, the optimisation These instructions implement arithmetic operations. The in- time will be increased. structions are obtained from one or several instructions of the processor instruction set. Each data flow instruction j k 3.4.2. Data type selection is characterised by its function γk, its operand word-length bk, and its execution time tk. This execution time is obtained In this section, the data type selection process is described. from the processor model. For SWP instructions, the execu- For each CDFG operation oi, the different instructions, tion time is set to the processor instruction execution time achieving oi, are selected. Let Ii be the set specifying the divided by the number of operations executed in parallel. For instructions selected for the operation oi.LetBi be the set the extended-precision instructions, the execution time is the specifying all the possible word-lengths for the oi operation sum of the execution time of the processor instructions used operands. Thus, for each operation oi, the optimised word- to implement this operation. A processor model example is length b (b ∈ B ), that is, which minimises the global ex- presented in Figure 12(a). i i −→i The global application execution time is estimated from ecution time T( b ) and respects the minimal precision con- straint, must be selected. Consequently, the application exe- the instructions selected for the No operations of the CDFG. −→ Nevertheless, the goal is not to obtain an exact execution time cution time T( b ) is minimised as long as the accuracy con- estimation but to compare two instruction lists and to se- straint (SQNRmin) is fulfilled as described with the following lect the one that leads to the minimal execution time. Thus, equation: a simple estimation model is used to evaluate the execution −→  −→  −→ ≥ time T( b ) of the CDFG. This time depends on the type of in- min−→ T( b ) subject to fSQNR( b ) SQNRmin . (10) struction used to execute the CDFG operations and thus is b ∈B −→ T a function of the vector b which specifies the word-length of −→ Considering that the number of values for each variable the CDFG operation operands. The time T( b )isestimated bi is limited, the optimisation problem can be modelled with from the execution time ti and the number of executions ni a tree. This optimisation process is illustrated with an FIR fil- of each oi operation as follows: ter example in Figure 12. To obtain the optimal solution, the tree must be explored exhaustively. This technique leads to an ffi −→ No exponential optimisation time. To explore e ciently this tree T( b ) = ti · ni. (9) a branch-and-bound algorithm is used with four techniques i=1 to limit the search space. These techniques are presented in the next section. This estimation method is based on the sum of the instruction execution times and leads to accurate results 3.4.3. Search space limitation for DSPs without instruction parallelism. For DSPs with instruction-level parallelism (ILP), this method does not take The tree modelling of this optimisation problem offers the account of the instructions executed in parallel. Neverthe- capability to exhaustively enumerate solutions. Nevertheless, less, this estimation can be used to compare adequately two all the instruction combinations are not valid. Let us con- instruction lists in the case of a processor with ILP. sider two operations ol and ok where the ol operation in- For single-precision and SWP instructions, the gains due put is the ok operation result. In this case, the number of in to the transformation (code parallelisation) of the vertical bits nl for the ol input fractional part cannot be strictly out code into a horizontal one are similar. Indeed, the two in- greater than the number of bits nk for the ok output frac- struction lists use the same functional units at the same clock tional part. Thus, the instruction tested for the operation ol ff out ≥ in cycles. The di erence lies in the functionality of the proces- is valid if nk nl . If this condition is not respected, the sor unit. For SWP instructions, the functional units manipu- exploration of the subtree is stopped and a new instruction late fractions of a word instead of the entire word. Thus, the is tested for the operation ol. This technique reduces signifi- gains due to the code parallelisation are identical with SWP cantly the search space. and single-precision instructions. In the branch-and-bound algorithm, the partial solutions An extended-precision instruction is achieved with sev- are evaluated to stop the tree exploration if they cannot lead eral single-precision instructions. Thus, in the best case and to the best solution. At the tree level l, the exploration of the after the scheduling stage, the extended-precision instruc- subtree induced by the node representing bl can be stopped tion execution time can be equal to the execution time of if the minimal execution time which can be obtained during the single-precision instructions. In this case, the single- the exploration of this subtree is greater than the minimal precision instructions must be favoured if the precision con- execution time which has already been obtained. Consider- straint is fulfilled to reduce the data memory size. Therefore, ing that only the word-lengths b0 to bl are already defined, the extended-precision instruction execution time is set to the minimal execution time is determined by selecting for the maximal value to select them only if the single-precision the operation oj (j ∈ [l +1,No]) the instruction with the instructions cannot fulfil the precision constraint. minimal execution time tj . Daniel Menard et al. 11

Instruction Function Execution time I/O operand word-length jk yk tk bin 1 bin 2 bout

j1 MULT 0.25 8 8 16

j2 MULT 0.5 16 16 32

j3 MULT 1 32 32 64

j4 ADD 0.25 16 16 16

j5 ADD 0.5 32 32 32

j6 ADD 1 64 64 64 (a)

x[i] h[i] Selected instruction Operations oi Operand data j1 j2 j3 word-length bi × o0 o0 I0 ={j1,j2,j3} 8 × 8 → 16 16 × 16 → 32 32 × 32 → 64

j4 j5 j6 j4 j5 j6 j4 j5 j6 o1 o1 16 32 64 16 32 64 16 32 64 + I1 ={j4,j5,j6}

Acc

(b) (c)

Figure 12: Data word-length optimisation process for an FIR filter. (a) Model of the processor data flow instruction set. (b) FIR filter data flow graph. (c) Model with a tree of the different solutions for the optimisation.

At the tree level l, the exploration of the subtree induced pervariable.Foreachvariablebi, only the values which are by the node representing bl can be stopped if the maximal members of Bi and immediately higher and lower than bi are SQNR which can be obtained during the exploration of this retained. Thus only two values are tested for each variable subtree is lower than the precision constraint (SQNRmin). and the search space is dramatically reduced. The SQNR maximal value is obtained by fixing the word- An optimisation time less than 200 seconds has been ob- ∈ lengths bj (j [l +1,No]) to their maximal value. Indeed, tained for the branch-and-bound algorithm with 35 variables considering that the SQNR is a monotonic and nondecreas- and four alternatives per variable. In this case, only the two ing function, the SQNR maximal value is obtained for the first techniques corresponding to the instruction combina- maximal operand word-length. tion restriction and the partial solution evaluation were used. This optimisation technique based on a tree traversal is For the same application, this optimisation time is dramati- sensitive to the node evaluation order. To find quickly a good cally reduced when two alternatives are tested for each vari- solution to reduce the search space, the variables with the able like for the last search space reduction technique which most influence on the optimisation process must be evalu- achieves the optimisation in two steps. ated first. The variables are sorted by their influence on the global execution time. The influence of the operation oi on 3.5. Scaling operation optimisation the execution time is obtained from the number of times (ni) that this oi operation is executed. The previous methodology stages, that correspond to the de- For applications with a great number of variables, the termination of the data word-length and the binary-point optimisation time can become important. To obtain reason- position, lead to an optimised fixed-point specification in able optimisation time, the optimisation is achieved in two terms of accuracy. Indeed, scaling operations have been in- steps. Firstly, the variables corresponding to the data word- serted to maintain a sufficient computation accuracy. These length are considered as positive real numbers and a con- scaling operations are used to adapt the fixed-point format to strained nonlinear optimisation technique is used to min- the data dynamic range or to insert additional bits in the inte- imise the code execution time under accuracy constraint. The ger part to avoid overflows. Nevertheless, these scaling oper- optimisation technique is based on the sequential quadratic ations increase the code execution time. The aim of this part programming (SQP) [35]. Let b be the optimised solution is to optimise the fixed-point data formats to minimise the i −→ obtained with this technique for the variable bi. Secondly, code execution time T(m) as long as the accuracy constraint the technique based on the branch-and-bound algorithm pre- is fulfilled. The execution time is reduced by moving the scal- sented previously is applied with a reduced number of values ing operations. These scaling operation transfers modify the 12 EURASIP Journal on Applied Signal Processing

−→ data binary-point position specified through the vector m. scaling operation are due to the transfer between the regis- Thus, this optimisation problem can be expressed as follows: ters. The evaluation of the scaling operation execution time requires the knowledge of the data location before and after   the shift instruction. Thus, the instruction list used to imple- −→ −→ ≥ ment the scaling operation has to be determined. This list is min−→ T(m) subject to fSQNR(m) SQNRmin . (11) m obtained with the code selection stage. In homogeneous architectures a register file is connected 3.5.1. Scaling operation transfers to a set of operators working in parallel like in VLIW (very long instruction word) DSPs [28, 29]. For these architectures, Scaling operations based on a left-shift adapt the fixed-point the barrel shifter can scale the input or the output of any format to the data dynamic range. The number of bits m used operation in one cycle. For processors with instruction-level for the integer part is reduced, because this one is too high parallelism, the scaling operation cost depends on the op- compared to the data dynamic range. This bit number re- portunity to execute this operation in parallel with the other duction for the integer part can be delayed. Thus, this scaling instructions. To illustrate and quantify this concept, the ex- operation achieved with a left shift can be moved towards the tra cost due to a scaling operation has been measured on the application graph sinks. DSPStone benchmark implemented into the TMS320C64x Scaling operations based on a right shift realise the in- VLIW DSP [4]. For these applications based on a MAC op- sertion of supplementary bits for the integer part to support eration, the application execution times have been measured the data dynamic range increase. This supplementary bits in- with and without a scaling operation executed after the mul- sertion can be brought forward. Thus, this scaling operation tiply operation. Let T and T be the code execution times, achieved with a right shift can be moved towards the applica- ri ri respectively, with and without a scaling operation r . The ex- tion graph sources. Nevertheless, left-shift operations are in- i tra cost Cr defined in (12) corresponds to the ratio between serted after a set of accumulations which use guard bits. This the additional execution time due to the scaling operation operation ensures the guard bit recovering before spilling the (T − T ) and the application execution time without this data in memory. In this case, the binary-point position is not ri ri scaling operation (Tri ). This extra cost depends on the av- changed. Consequently, this operation must not be moved, erage IPC (instructions per cycle) obtained for the applica- otherwise the guard bits would be lost. tion without a scaling operation. When the IPC is closed to To move the scaling operations, a propagation rule is de- its maximal value, the extra cost can be relatively important fined for each class of operations. When a right shift is moved (47%). Indeed, most of the functional units are used and towards a multiplication operation, one of the inputs must be supplementary cycles are required to execute the scaling op- selected to receive the scaling operation. In the case of linear erations. When the IPC decreases, the extra cost diminishes systems, two alternatives are available to move a right shift. and can climb down to 0%. Thus, these results underline that These scaling operations can be moved towards the system ffi the scaling operation execution time can be evaluated only inputs or towards the coe cients. For this last case the degra- during the scheduling stage: dation of the SQNR is less important. But in the case of linear filters, the degradation of the frequency response due to the − C = Tri Tri coefficient quantisation is more significant. r . (12) Tri To optimise the scaling operation location, two ap- 3.5.2. Architecture influence on the scaling proaches have been defined according to the DSP architec- operation cost ture and more particularly the DSP instruction-level paral- Different classes of shift registers are available in DSPs to scale lelism (ILP). the data. In some processors [24, 36], a specialised shift regis- ter is located at the output or at the input of an operator and 3.5.3. DSPs without instruction-level parallelism several specific shifts can be achieved. Thus, the operator in- put or output can be scaled without supplementary cycle. In this part, the approach proposed for processors without For more flexibility, most of the recent DSPs offer a bar- instruction-level parallelism is explained. For this class of rel shifter which is able to perform any shift operation in DSPs, only one instruction is executed per cycle and the par- one cycle. In traditional DSPs [23, 24, 36]basedonaMAC allelism is specified through complex instructions. The flow (multiply-accumulate) structure, the registers are dedicated of the optimisation of the scaling operation location is pre- to a specific operator. The barrel shifter is connected to the sented in Figure 13. The input of this transformation is the accumulation register and can only scale efficiently the out- CDFG GWL where all the data are annotated with their opti- put of an addition. To analyse the additional cost due to the mised fixed-point specification. The output is the CDFG GSO scaling operation, several experiments have been conducted where the location of the scaling operations has been opti- on the DSPStone benchmark [37]. Different locations of a mised. This optimisation process uses the SQNR expression −→ −→ scaling operation in the applications have been tested. This ( fSQNR( b , m)) to evaluate the computation accuracy. The scaling operation requires between one and five cycles for the technique used to estimate the extra execution time due to TMS320C54x [23] and between one and four cycles for the scaling operations and the algorithm proposed to minimise OakDSPCore [38]. These additional cycles required for the this execution time are explained. Daniel Menard et al. 13

GWL Instruction Execution time Ari −→ CDFG selection estimation Tri b Expression tree − extraction t −→ −→ −→ Instruction Execution time Tri ri Ari fSQNR( b , m) fSQNR(m) tr selection estimation Optimisation i

RSBQmin Execution time Processor CDFG estimation model

GSO

Figure 13: Flow to optimise the scaling operation location for DSP without instruction-level parallelism. The execution time tri of the scaling operation ri is estimated.

For a scaling operation ri,lettri be its execution time and For the FIR filter example presented in Figure 2, the scal- nri the number of times that ri is executed. The scaling op- ing operations have been optimised for the TMS320C50 ar- eration cost is defined as the product of ri and nri . For this chitecture model. The scaling operations are moved towards class of DSP architectures, the global execution time of the the system input. The fixed-point C code generated before NSO scaling operations located in the application CDFG is and after the optimisation process are presented in Algo- determined with the following expression: rithms 2 and 3, respectively. This optimisation process de- creases the scaling operation execution time TSO from 120 N SO cycles to 0. Thus, the global code execution time is reduced = · TSO nri tri . (13) by 36%. On the other hand, the output SQNR is reduced by i=1 4.5dB. ff The execution time tri is equal to the di erence between 3.5.4. DSPs with instruction-level parallelism Tri and Tri . The times Tri and Tri correspond to the code ex- ecution times, respectively, with and without the scaling op- For processors with instruction-level parallelism, the estima- eration ri. The technique used to evaluate the times Tri and T isrepresentedintherightpartofFigure 13. First of all, tion of the execution time must be coupled with the schedul- ri ing stage to take account of the partial instructions which are the expression tree which includes the scaling operation ri is extracted. Then a code selection is applied on this expres- executed in parallel. Indeed, the scaling operation cost de- pends on the opportunity to execute this operation in parallel sion tree with (Ari ) and without (Ari ) the scaling operation. The execution time is directly computed from the instruc- with the other instructions. Thus, the goal of our approach is tion list selected for the expression tree. It corresponds to the to find the scaling operation location which enables the ex- sum of the different instruction execution times and it leads ecution of the shift operation in parallel with other instruc- to a sufficient accurate estimation of the code execution time tions. The aim is to find the scheduling which minimises the for this class of DSP architectures. Indeed, the parallelism is increase of time compared to the scheduling obtained with- specified through complex instructions and can be detected out the scaling operations. during the code selection stage. Nevertheless, this technique For a scaling operation rk located between the operations can be improved by taking account of the pipeline hazards oi and oj , the scaling operation cost ck,ij is defined with the with the technique proposed in [39]. The adjacent instruc- expression (14). The term ηij defines the maximal number of tions can be analysed to determine if a pipeline hazard can scaling operations which can be inserted between the oper- occur. ations oi and oj without increasing the execution time com- The scaling operation optimisation problem is solved pared to a solution without scaling operation. This term de- with an iterative algorithm. For each iteration, a scaling op- pends on the operations oi and oj mobility and the proces- eration is moved and this transfer is validated if the accuracy sor resource usage rate. This term is computed from the op- constraint is respected. The scaling operations are processed eration execution date obtained with a list scheduling algo- by cost-decreasing order to consider costly operations first. rithms in a direct and forward sense. For this, the operation After each transfer, the application accuracy is evaluated. If oi is executed as soon as possible and the operation oj is ex- the accuracy constraint is no longer respected, the scaling op- ecuted as late as possible. When no scaling operation can be eration is replaced in the location which leads to the minimal inserted, the term ηij is null and the cost is equal to its maxi- execution time and this operation will not be moved after. If mal value: the accuracy constraint is still fulfilled, the scaling operation 1 ck,ij = . (14) transfer is validated. Then, the scaling operation costs are 1+ηij computed. In the next iteration, the scaling operation with the maximal cost is processed. The algorithm finishes when The scaling operation optimisation is achieved with an no scaling operation can be moved. iterative process made up of three steps corresponding to 14 EURASIP Journal on Applied Signal Processing

short h[32] ={−973, ..., 29418, 32112, short h[32] ={−973, ..., 29418, 32112, 29418, ..., −973}; 29418, ..., −973}; short x[32]; short x[32]; short y; short y; int acc; int acc; short fir (short input) short fir (short input) { { int i; int i; ∗x = input; ∗x = input  2; acc = ∗x ∗ ∗h  2; acc = ∗x ∗ ∗h; for (i = 31; i>0; i −−) for (i = 31; i>0; i −−) { { acc = acc +x[i] ∗ h[i]  2; acc = acc +x[i] ∗ h[i]; x[i] = x[i − 1]; x[i] = x[i − 1]; } } y = (short) (acc); y = (short) (acc); return y; return y; } }

Algorithm 2: Fixed-point C code for the FIR filter before the scal- Algorithm 3: Fixed-point C code for the FIR filter after the scaling ing operation optimisation. operation optimisation.

the scaling operation cost computation, the transfer of some correlates the received signal by a spreading code. The RAKE ff scaling operations, and the scheduling. The scaling opera- receiver and the di erent finger structures are detailed in tions are processed by cost-decreasing order. They are moved Figure 14. The signal y(k) corresponds to the combination ff ff as long as their cost is equal to one and the accuracy con- of the di erent finger outputs yl(k). To combine the di erent straint is fulfilled. finger results, the complex amplitude αl of the lth path must be estimated and removed for each multipath. The symbols are decoded by multiplying the received signal with a syn- 4. EXPERIMENTS AND RESULTS chronised version of the code generated in the receiver. The 4.1. Floating-to-fixed-point conversion for synchronisation between the code and the received signal is a WCDMA receiver realised by a delay-locked loop (DLL). For each finger, the symbols (DPDCH/DPCCH) are es- The aim of this part is to show the interest of our approach timated with the symbol decoder structure presented in to obtain an optimised fixed-point specification in the case of Figure 15. Thanks to the complex multiplication (CM 1) of a real-life application corresponding to a WCDMA receiver. the received signal by the conjugate of the Kasami code ∗ Especially, this experiment underlines the benefits provided cK (n) the unscrambling operation is performed. Then, the by the data type selection stage to reduce the code execution phase distortion resulting from the transmission channel is time. removed with the complex multiplication (CM 2) with the ∗ conjugate of the complex amplitude estimation (αl ). At last, 4.1.1. WCDMA receiver description the despreading operation with OVSF code (cOVSFI (n)and

cOVSFQ (n)) transforms the wideband received signal into a The considered application corresponds to a receiver used in narrowband signal. This operation decodes the transmitted the base station for the third-generation telecommunication symbols yl(k). systems. UMTS (Universal Mobile Telecommunications Sys- tem) is based on the wideband code-division multiple-access 4.1.2. Data type selection (WCDMA) norm [40]. The information data (DPDCH) and the control data (DPCCH) are spread with an orthogonal Recent DSPs like the TMS320C64x from Texas Instruments variable-spreading-factor code (OVSF), and then scrambled provide a wide diversity of data types with the SWP capabili- by a specific spreading sequence (Kasami codes). ties. The data type selection is a tradeoff between the compu- In the receiver part, the complex received signal is made tation accuracy and the code execution time. To illustrate the up of different delayed copies of the transmitted signal due to different opportunities offered by this class of architectures, the multipaths inside the radio channel. The RAKE-receiver the complex correlator used in the RAKE receiver has been concept is based on the combination of the different multi- implemented with different data types. For each solution Si, path components to improve the quality of the decision on the execution time and the signal-to-quantification-noise ra- symbols. Each multipath signal is processed by a finger which tio (SQNR) metric are evaluated. The results are presented Daniel Menard et al. 15

Finger 0 y0(k)

( ) Finger 1 y1 k A2  Se(n) y2(k) y(k) yb(k) Finger 2 +

xl(n) Finger z−4 4DLLearly Symbol decoding DPDCH yl Un- Phase DPDCH/DPCCH scramble removing decoding DPCCH

α∗ Channel l estimation

z−2 4 DLL on time DLL

z0 4 DLL late

Figure 14: Schematic of the RAKE receiver and the finger for a base station. The RAKE receiver achieves the combination of the different finger results.

cOVSFI (n)

SF 1 × SF CM1 CM2 SF yl(k) Se(n) ××C3 A1 256 × 256 1 256

∗ ∗ cK (n) αi cOVSFQ (n)

Figure 15: Symbol decoding subsystem for a base-station receiver. in Table 2 and the word-lengths of the operation operands accuracy constraint determination process leads to a mini- arereported.Thesedifferent results have been obtained by mal SQNR equal to 12.5dB. using our methodology with different accuracy constraints The WCDMA receiver fixed-point specification has been −→ obtained with our methodology. The input data (receiving (SQNRmin). The execution time (Tnorm( b )) is normalised in relation to the execution time of a classical implementa- Nyquist filter output) word-length was fixed to 8 bits. The tion based on single-precision instructions (multiplication: word-lengths of the main data for the symbol decoding sub- 16 × 16 ⇒ 32 bits; addition: 32 + 32 ⇒ 32 bits). system of the RAKE receiver are summarised in Table 3.For Before determining the RAKE-receiver fixed-point spec- this experiment, the Texas Instruments code generation tool ification, the accuracy constraint must be defined. This min- is used to benefit from the high performance of the C com- piler and more particularly the software pipelining tech- imal value of the SQNR (SQNRmin) is defined according to the system performance constraints. In the case of the nique. Thus, the C source code is modified to include the dif- WCDMA receiver, the performances are specified through ferent data types from the fixed-point specification. Intrinsic the maximal value of the bit error rate (BER). The accu- functions are used to express the data parallelism. The data racy constraint has been defined so that the system out- parallelisation must be achieved by the user to exploit the put BER is slightly modified after the fixed-point conver- processor SWP capabilities. sion process. Compared to the floating-point implementa- To analyse the improvement due to the data type selec- tion, the maximal BER degradation due to fixed-point com- tion stage, the execution times of the code obtained with a classical implementation based on single-precision instruc- putation is fixed to 5%. The SQNR minimal value is ob- −→ −→ tained with a floating-point simulation with the technique tions (Tunopt( b )) and with our methodology (Topt( b )) have explained in Section 3.1.3. For the WCDMA receiver, this been measured. In the classical approach, the data types are 16 EURASIP Journal on Applied Signal Processing

Table 2: Results of the complex correlator implementation for dif- Table 4: SWP improvement factor F. This factor F corresponds to ferent data types. Tnorm is the execution time normalised in relation the acceleration factor due to the data type selection. to the classical implementation (impl. 2) execution time. Code execution time Operand word-length (bits) SQNR Number improvement factor F S T i norm (dB) Multiplication Addition of fingers Symbol decoding Synchronisation bits × bits ⇒ bits bits + bits ⇒ bits subsystem subsystem 1 0.6518× 8 ⇒ 16 16 + 16 ⇒ 16 1 2.83 1.91 2 18916× 16 ⇒ 32 32 + 32 ⇒ 32 2 2.79 2.79 3 1.55 151 32 × 16 ⇒ 32 32 + 32 ⇒ 32 4 3.51 3.18 4 2.1 170 32 × 16 ⇒ 64 64 + 64 ⇒ 64

Table 3: Data word-length for the symbol decoding subsystem of operation location for DSP based on conventional architec- the RAKE receiver. The data and the operations are presented in ture. These experiments are achieved with the C50 and the Figure 15. C54x DSPs from Texas Instruments. These two processors are based on a classical MAC structure. The C54x DSP is Operations Data Data type (bits) made up of an accumulator with eight guard bits and a barrel xl 8 shifter connected to the accumulator register. The C50 offers MULT 8 × 8 → 16 CM 1 no guard bits and the scaling capabilities based on specialised ADD 16 + 16 → 16 shift registers are limited. A prescaler register is available to MULT 16 × 16 → 32 CM 2 shift the data which are loaded from memory and a postscaler ADD 16 + 16 → 16 register provides the capability to shift the data when they are M3MULT 16× 16 → 32 stored in memory. ff A1ADD 16+16→ 16 The di erent experiment results are given in Table 5.The A2ADD 16+16→ 16 scaling operation execution time TSO is given before and after y 16 the optimisation of the scaling operation location to analyse l the improvement due to the optimisation process. The exe- cution time TSO (number of cycles) corresponds to the appli- not optimised and thus only the single-precision instruc- cation execution time difference with and without the scaling tions are used (multiplication: 16 × 16 ⇒ 32 bits; addition: operations. The accuracy degradation ΔSQNR (dB) due to the 32+32 ⇒ 32 bits). Given that the two floating-to-fixed-point scaling operation transfers is measured. conversion methods presented in Section 2.1 do not optimise The two first applications correspond to a finite impulse the data type, the results obtained with these approaches cor- response and an infinite impulse response filters. The com- respond to the classical implementation. In our approach, plex correlator achieves the correlation between a complex the code is obtained from the fixed-point specification deter- signal and a complex bipolar code made up of N points. The mined with our floating-to-fixed point conversion method- four last applications are used in the WCDMA receiver for ology. The accuracy constraint and the DSP architecture of- third-generation telecommunication systems. These applica- fer the opportunity to use the SWP instructions. To com- tions are described in the previous section. The receivers for pare these two approaches, the ratio between the two ex- the mobile terminal (MT) and for the base station (BS) are −→ −→ F similar except for the location of the phase removing process- ecution times Tunopt( b )andTopt( b ) is computed. This im- provement factor F,definedin(15), corresponds to the ac- ing. In the base station the phase removing is achieved during celeration factor due to the data type selection: the symbol decoding and in the mobile station the phase re- moving is achieved after the symbol decoding and before the −→ output finger combination. Tunopt( b ) For the C54x, the guard bits ensure a fixed-point speci- F = −→ . (15) T ( b ) fication with a limited number of scaling operations. Except opt for the IIR filter, these scaling operations correspond to left Different experiments have been achieved on the symbol shifts required to align the guard bits before storing in mem- decoding and the synchronisation subsystems for several val- ory the data which was in the accumulator register. Thus, ues of the fingers number. The results, presented in Table 4, these scaling operations cannot be moved and the scaling op- underline the benefit of the SWP instructions. Our approach eration optimisation does not reduce the scaling operation reduces the code execution time by a factor between 1.91 and cost. In the IIR filter, the guard bits are not sufficient to limit 3.51. the number of scaling operations. A scaling operation is re- quired to adapt the format of the recursive and the nonre- 4.2. Optimisation of the scaling operation location cursive part outputs. This scaling operation can be moved to reduce the scaling operation cost. In this section, some experiments have been conducted For the C50, the lack of guard bits leads to a fixed- to show our approach’s interest to optimise the scaling point specification with a high execution time for the scaling Daniel Menard et al. 17

Table 5: Optimization of the scaling operation location for different applications implemented in the C54x and the C50 DSPs. The scaling operation execution time TSO (number of cycles) and the SQNR degradation ΔSQNR (dB) are evaluated. TMS320C54x TMS320C50 Applications Initial After optimisation Initial After optimisation

TSO ΔSQNR TSO TSO ΔSQNR TSO FIR 32-tap filter 101128 −4.50 Second-order IIR filter 3 −8.62 7 −8.60 Complex correlator (N = 32) 101160 −18.26 0 Complex correlator (N = 128) 101896 −29.90 MT symbol decoding (SF = 32) 101128 −12.40 BS symbol decoding (SF = 32) 101128 −9.50 MT RAKE receiver (SF = 4) 90980 −12.60 BS RAKE receiver (SF = 8) 50550 −2.50 operations. Indeed, these scaling operations are inserted to the recent DSPs. Moreover, the ability of our technique to re- code the fixed-point data with the maximal accuracy. The duce significantly the execution time with SWP instructions limited scaling capabilities do not provide the opportunity to compared to a classical implementation has been demon- scale efficiently the data between two arithmetic operations. strated through the WCDMA receiver example. Indeed, this In this case, the scaling operation execution time depends on technique reduces the code execution time by a factor be- the number of bits to shift. The optimisation process reduces tween 1.9and3.5. The experiments on scaling operations dramatically the scaling operation execution time by mov- show that their execution time can become important. The ing these operations towards the application inputs. When use of guard bits or the optimisation of the scaling operation the scaling operations are located at the application inputs, location reduces significantly the code execution time. the execution time of the scaling operations TSO is null. In- deed, the prescaler register can scale the inputs when they are REFERENCES loaded from memory, with no supplementary cycle. Never- theless, these scaling operation transfers degrade the compu- [1] T. Grotker,¨ E. Multhaup, and O. Mauss, “Evaluation of tation accuracy. HW/SW tradeoffs using behavioral synthesis,” in Proceeding These different results show the benefits provided by of 7th International Conference on Signal Processing Applica- the optimisation of the scaling operation location and by tions and Technology (ICSPAT ’96), pp. 781–785, Boston, Mass, the guard bits. They underline the necessity to take account USA, October 1996. of the DSP architecture to obtain an optimised fixed-point [2] K.-I. Kum, J. Kang, and W. Sung, “AUTOSCALER for C: an optimizing floating-point to integer C program converter for specification. fixed-point digital signal processors,” IEEE Transactions on Circuits and Syst—Part II, vol. 47, no. 9, pp. 840–848, 2000. 5. CONCLUSIONS [3]M.Willems,V.Bursgens,andH.Meyr,“FRIDGE:floating- point programming of fixed-point digital signal processors,” Efficient application implementation in embedded systems in Proceeding of 8th International Conference on Signal Pro- requires using fixed-point arithmetic. The reduction of the cessing Applications and Technology (ICSPAT ’97),SanDiego, application time-to-market needs to develop high-level tools Calif, USA, September 1997. which automate the floating-to-fixed-point conversion. In [4] D. Menard, P. Quemerais, and O. Sentieys, “Influence of fixed- this paper, a new methodology for the floating-to-fixed- point DSP architecture on computation accuracy,” in Proceed- point conversion has been proposed. This approach min- ing of 11th European Signal Processing Conference (EUSIPCO imises the code execution time under an accuracy constraint. ’02), vol. 1, pp. 587–590, Toulouse, France, September 2002. Compared to the previous methodologies, the DSP architec- [5] S. Kim and W. Sung, “A floating-point to fixed-point assembly ture is taken into account to optimise the fixed-point specifi- program translator for the TMS 320C25,” IEEE Transactions cation. The fixed-point data types and the scaling operation on Circuits and Systems—Part II, vol. 41, no. 11, pp. 730–739, 1994. location are optimised to reduce the code execution time. These two optimisation processes are achieved efficiently [6] H. Keding, M. Willems, M. Coors, and H. Meyr, “FRIDGE: a fixed-point design and simulation environment,” in Proceed- thanks to the use of an analytical technique to evaluate the ing of IEEE Design, Automation and Test in Europe Conference computation accuracy. Indeed, this technique reduces dra- and Exhibition (DATE ’98), pp. 429–435, Paris, France, Febru- matically the optimisation time compared to a simulation- ary 1998. based approach. [7] M. Willems, V. Bursgens, H. Keding, T. Grotker,¨ and H. Meyr, Different experiments have been conducted to analyse “System level fixed-point design based on an interpolative ap- the efficiency of our approach. The results obtained for the proach,” in Proceeding of 34th ACM/IEEE Design Automation data type selection underline the tradeoff between the accu- Conference (DAC ’97), pp. 293–298, Anaheim, Calif, USA, June racy and the code execution time which can be obtained with 1997. 18 EURASIP Journal on Applied Signal Processing

[8] K.-I. Kum, J. Kang, and W. Sung, “A floating-point to integer [24] Lucent Technologies, DSP16xx, Lucent Technologies, Murray C converter with shift reduction for fixed-point digital signal Hill, NJ, USA. processors,” in Proceeding of IEEE International Conference on [25] P. Lapsley, J. Bier, A. Shoham, and E. A. Lee, DSP Proces- Acoustics, Speech, and Signal Processing (ICASSP ’99), vol. 4, sor Fundamentals: Architectures and Features, Berkeley Design pp. 2163–2166, Phoenix, Ariz, USA, March 1999. Technology, Fremont, Calif, USA, 1996. [9] S. Kim and W. Sung, “Fixed-point error analysis and word [26] B. Ovadia and G. Wertheizer, “PalmDSPCore—Dual MAC length optimization of 8 × 8 IDCT architectures,” IEEE Trans- and parallel modular architecture,” in Proceeding of 10th In- actions on Circuits and Systems for Video Technology, vol. 8, ternational Conference on Signal Processing Applications and no. 8, pp. 935–940, 1998. Technology (ICSPAT ’99), Miller Freeman, Orlando, Fla, USA, [10] S. Kim, K.-I. Kum, and W. Sung, “Fixed-point optimization November 1999. utility for C and C++ based digital signal processing pro- [27] D. Efstathiou, L. Fridman, and Z. Zvonar, “Recent develop- grams,” IEEE Transactions on Circuits and Systems—Part II, ments in enabling technologies for software defined radio,” vol. 45, no. 11, pp. 1455–1464, 1998. IEEE Communications Magazine, vol. 37, no. 8, pp. 112–117, [11] L. De Coster, M. Ade, R. Lauwereins, and J. Peperstraete, 1999. “Code generation for compiled bit-true simulation of DSP ap- [28] Analog Device Incorporation, TigerSHARC Hardware Specifi- plications,” in Proceeding of 11th IEEE International Sympo- cation, Analog Device, December 1999. sium on System Synthesis (ISSS ’98), pp. 9–14, Hsinchu, Tai- [29] Texas Instruments Incorporated, TMS320C64x Technical wan, December 1998. Overview, Texas Instruments, Dallas, Tex, USA, February [12] H. Keding, M. Coors, O. Luthje,¨ and H. Meyr, “Fast bit-true 2000. simulation,” in Proceeding of 38th ACM/IEEE Design Automa- [30] 3DSP, SP-5 Fixed-point Signal Processor Core,3DSPCorpora- tion Conference (DAC ’01), pp. 708–713, Las Vegas, Nev, USA, tion, Irvine, Calif, USA, July 1999. June 2001. [31] CEVAIncorporation, CEVA-X1620 Datasheet,CEVA,SanJose, [13] H. Keding, F. Hurtgen, M. Willems, and M. Coors, “Transfor- Calif, USA, 2004. mation of floating-point into fixed-point algorithms by inter- [32] S. Wichman and N. Goel, The Second Generation ZSP DSP,LSI polation applying a statistical approach,” in Proceeding of 9th Logic Corporation, Milpitas, Calif, USA, 2002. International Conference on Signal Processing Applications and [33] N. Ghazal, R. Newton, and J. Rabaey, “Predicting performance Technology (ICSPAT ’98), Toronto, Ontario, Canada, Septem- potential of modern DSPs,” in Proceeding of 37th ACM/IEEE ber 1998. Design Automation Conference (DAC ’00), pp. 332–335, Los [14] D. Menard and O. Sentieys, “Automatic evaluation of the ac- Angeles, Calif, USA, June 2000. curacy of fixed-point algorithms,” in Proceeding of IEEE De- [34] A. Pegatoquet, E. Gresset, M. Auguin, and L. Bianco, “Rapid sign, Automation and Test in Europe Conference and Exhibition development of optimized DSP code from a high level de- (DATE ’02), pp. 529–535, Paris, France, March 2002. scription through software estimations,” in Proceeding of 36th [15] R. Wilson, “SUIF: an infrastructure for research on paralleliz- ACM/IEEE Design Automation Conference (DAC ’99), pp. 823– ing and optimizing compilers,” Tech. Rep. CA 94305-4055, 826, New Orleans, La, USA, June 1999. Computer Systems Laboratory, Stanford University, Stanford, [35] R. Fletcher, Practical Methods of Optimization, John Wiley & Calif, USA, May 1994. Sons, New York, NY, USA, 2nd edition, 1987. [16] F. Charot, F. Djieya, and C. Wagner, “Retargetable compila- [36] Texas Instruments Incorporated, TMS320C5x User’s Guide, tion in the service of interactive ASIP design,” Tech. Rep. 1173, Texas Instruments, Dallas, Tex, USA, June 1998. Institut de Recherche en Informatique et Systemes` Aleatoires´ [37]V.Zivojnovic,J.M.Velarde,C.Schlager,¨ and H. Meyr, “DSP- (IRISA), Rennes, France, November 2000. Stone: A DSP-oriented benchmarking methodology,” in Pro- [17] F. Charot and V. Messe, “A flexible code generation frame- ceeding of 5th International Conference on Signal Processing Ap- work for the design of application specific programmable pro- plications and Technology (ICSPAT ’94), pp. 715–720, Miller cessors,” in Proceeding of 7th IEEE International Workshop on Freeman, Dallas, Tex, USA, October 1994. Hardware/Software Codesign (CODES ’99), pp. 27–31, Rome, [38] VLSI Technology, VVF 3500 DSP Core Rev. 1.2, VLSI Technol- Italy, May 1999. ogy, June 1998. [18] V. Madisetti, VLSIDigitalSignalProcessors:AnIntro- [39] Y.-T. S. Li and S. Malik, “Performance analysis of embedded duction to Rapid Prototyping and Design Synthesis, IEEE software using implicit path enumeration,” in Proceeding of Press/Butterworth-Heinemann, Boston, Mass, USA, 1995. 32nd ACM/IEEE Design Automation Conference (DAC ’95),pp. [19] D. Menard, R. Rocher, P. Scalart, and O. Sentieys, “SQNR de- 456–461, San Francisco, Calif, USA, June 1995. termination in non-linear and non-recursive fixed-point sys- [40] T. Ojanpera¨ and R. Prasad, Eds., WCDMA: Towards IP Mobil- tems,” in Proceeding of 12th European Signal Processing Confer- ity and Mobile Internet, Universal Personal Communications ence (EUSIPCO ’04), pp. 1349–1352, Vienna, Austria, Septem- Series, Artech House, Norwood, Mass, USA, 2002. ber 2004. [20] G. A. Constantinides, P. Y. K. Cheung, and W. Luk, “Trunca- Daniel Menard received the Engineering tion noise in fixed-point SFGs,” IEE Electronics Letters, vol. 35, degree and the M.S. degree in electronics no. 23, pp. 2012–2014, 1999. and signal processing engineering from the [21] R. Kearfott, “Interval computations: introduction, uses, and University of Nantes Polytechnic School in resources,” Euromath Bulletin, vol. 2, no. 1, pp. 95–112, 1996. 1996, and the Ph.D. degree in signal pro- [22] T. W. Parks and C. S. Burrus, Digital Filter Design,JohnWiley cessing and telecommunications from the & Sons, New York, NY, USA, 1987. University of Rennes, in 2002. From 1996 [23] Texas Instruments Incorporated, TMS320C54x DSP Reference to 2000, he was a research engineer at the Set, Volume 1: CPU And Peripherals, Texas Instruments, Dallas, University of Rennes. He is currently an As- Tex, USA, January 1999. sociate Professor of electrical engineering Daniel Menard et al. 19 at the University of Rennes (ENSSAT) and a member of the R2D2 (Reconfigurable Retargetable Digital Devices) Research Team at the IRISA Laboratory. His research interests include implementation of signal processing and mobile communication applications in em- bedded systems and floating-to-fixed-point conversion.

Daniel Chillet received the Engineering de- gree and the M.S. degree in electronics and signal processing engineering from EN- SSAT, University of Rennes, respectively, in 1992 and in 1994, and the Ph.D. degree in signal processing and telecommunications from the University of Rennes, in 1997. He is currently an Associate Professor of electri- cal engineering at the University of Rennes (ENSSAT) and a member of the R2D2 (Re- configurable Retargetable Digital Devices) Research Team at the IRISA Laboratory. His research interests include memory hierar- chy, reconfigurable resources, real-time systems, and middleware. All these topics are studied in the context of SoC design for embed- ded systems.

Olivier Sentieys received the Engineering degree and the M.S. degree in electron- ics and signal processing engineering from ENSSAT, University of Rennes, in 1990, the Ph.D. degree in signal processing and telecommunications from the University of Rennes, in 1993, and the “Habilitation Diriger des Recherches” degree in 1999. He is currently a Professor of Electrical Engi- neering at the University of Rennes (EN- SSAT). He is the Cohead of the R2D2 (Reconfigurable Retar- getable Digital Devices) Research Team at the IRISA Laboratory and is a cofounder of Aphycare Technologies, a company develop- ing smart sensors for biomedical applications. His research inter- ests include VLSI integrated systems for mobile communications, finite arithmetic effects, low-power and reconfigurable architec- tures, and multiple-valued logic circuits. He is the author or coau- thor of over 70 journal publications or published conference papers and holds 4 patents. Hindawi Publishing Corporation EURASIP Journal on Applied Signal Processing Volume 2006, Article ID 92849, Pages 1–14 DOI 10.1155/ASP/2006/92849

Optimum Wordlength Search Using Sensitivity Information

Kyungtae Han and Brian L. Evans

Embedded Signal Processing Laboratory, Wireless Networking and Communications Group, The University of Texas at Austin, Austin, TX 78712, USA Received 2 October 2004; Revised 4 July 2005; Accepted 12 July 2005 Many digital signal processing algorithms are first developed in floating point and later converted into fixed point for digital hardware implementation. During this conversion, more than 50% of the design time may be spent for complex designs, and optimum wordlengths are searched by trading off hardware complexity for arithmetic precision at system outputs. We propose a fast algorithm for searching for an optimum wordlength. This algorithm uses sensitivity information of hardware complexity and system output error with respect to the signal wordlengths, while other approaches use only one of the two sensitivities. This paper presents various optimization methods, and compares sensitivity search methods. Wordlength design case studies for a wireless demodulator show that the proposed method can find an optimum solution in one fourth of the time that the local search method takes. In addition, the optimum wordlength searched by the proposed method yields 30% lower hardware implementation costs than the sequential search method in wireless demodulators. Case studies demonstrate the proposed method is robust for searching for the optimum wordlength in a nonconvex space.

Copyright © 2006 K. Han and B. L. Evans. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

1. INTRODUCTION wordlength optimization for fixed-point digital signal pro- cessing systems. These search algorithms try to find the Digital signal processing algorithms often rely on long cost-optimal solution by using either “exhaustive” search or wordlengths for high precision, whereas digital hardware im- heuristics. plementations of these algorithms need short wordlengths Han et al. [5] proposed search algorithms that can find to reduce total hardware costs. Determining the opti- the performance-optimal solution by using “sequential” or mum wordlength can be time-consuming if assignments of “preplanned” search. Those algorithms utilize the distortion wordlengths are performed by trial and error. In a complex sensitivity information with respect to the signal wordlengths system, 50% of the design time may be spent on wordlength at the system output such as propagated quantized error. determination [1]. Those algorithms assume that the hardware cost in each Optimum wordlength choices can be made by solving wordlength is the same. However, complicated digital sys- equations when propagated quantized errors [2]areex- tems such as a digital transceiver possess different cost or pressed in an analytical form. However, an analytical form complexity in digital blocks. is difficult to obtain in complicated systems. Searching the A new algorithm that considers different hardware costs entire space by simulation guarantees to find optimum is proposed in [7]. The new algorithm utilizes the measure wordlength. Computation time, however, increases expo- of the distortion sensitivity as well as complexity sensitivity. nentially as the number of wordlength variables increases. The new algorithm speeds up the search time to find an op- For these reasons, many simulation-based wordlength opti- timum wordlength by considering performance and cost as mization methods have explored a subset of the entire space the objective function and the update direction. [3–7]. This paper is organized as follows. In Section 2, related Choi and Burleson [3] showed how a general search- work for floating-point to fixed-point conversion is pre- based wordlength optimization can produce optimal or sented. Section 3 gives the background for wordlength op- near-optimal solutions for different objective-constraint for- timization. Various search methods to find optimum word- mulations. Sung and Kum [4] proposed simulation-based length are reviewed in Section 4. New sensitivity measures to 2 EURASIP Journal on Applied Signal Processing

Table 1: Fixed-point conversion approaches for integer wordlength (IWL) and for fractional wordlength (FWL) determination. Analytical approach Statistical approach Range model for IWL Error model for FWL Range statistic for IWL Error statistic for FWL Wadekar [8] Constantinides [9]Cmar[10]Cmar[10] Stephenson [11] Shi [12] Kim [13] Kum [14] Nayak [15] — — Shi [12]

Table 2: Optimum wordlength search methods. Cost sensitivity Error sensitivity Nonsensitivity Local search [3] Sequential search [5] Exhaustive search [4] Evolutive search [6] Max-1 search [16] Branch and bound [3] — Preplanned search [5] Complexity-and-distortion measure search—proposed update search directions are generalized in Section 5.Case Some use sensitivity information to search an optimum studies of the optimum wordlength design are presented wordlength. “Local” search [3]and“evolutive”searchin[16] in Section 6.InSection 7, simulation results are discussed. use cost sensitivity information. The advantage of cost sensi- Section 8 concludes the paper. tivity methods is that they can find an optimum wordlength in terms of cost. “Sequential” search and “preplanned” search 2. RELATED WORK in [5]and“Max-1”searchin[16] use error sensitivity infor- mation. The advantage of employing error sensitivity is that During the floating-point to fixed-point conversion process, they find the optimum wordlength in terms of error faster fixed-point wordlengths composed of the integer wordlength than the cost sensitivity method. However, both sensitivity (IWL) part and the fractional wordlength (FWL) part are de- methods do not always reach global optimum wordlength. termined by different approaches as shown in Table 1.Some Cantin et al. provide a useful survey of search algorithms published approaches for floating-point to fixed-point con- for wordlength determination. In this work, search algo- version use an analytic approach for range and error estima- rithms are compared, and the “preplanned search” shows the tion [8, 9, 11, 12, 15], and others use a statistical approach smallest number of iterations to find a solution. However, [10, 12–14]. An analytic approach has a range and error the heuristic procedures do not necessarily capture the op- model for integer wordlength and fractional wordlength de- timum solution to the wordlength determination problem, sign. Some use a worst-case error model for range estimation due to nonconvexity in the constraint space [9]. Thus, the [8, 15], and some use forward and backward propagation for distance between a global optimum wordlength and a local IWL design [11]. Still, others use an error model for FWL optimum wordlength searched by algorithms is considered. [9, 12]. The advantages of analytic techniques are that they The proposed method is robust to search a near optimum do not require simulation stimulus and can be faster. How- wordlength. This paper discusses the distance and robustness ever, they tend to produce more conservative wordlength re- of the proposed algorithm in Section 7. sults. A statistical approach has been used for IWL and FWL determination. Some use range monitoring for IWL esti- 3. BACKGROUND mation [10, 13], and some use error monitoring for FWL [10, 12, 14]. The work in [12] also uses an error model that 3.1. Fixed-point data format has coefficients obtained through simulation. The advantage of statistical techniques is that they do not require a range or When designers model at a high-level, floating-point num- error model. However, they often need long simulation time bers are useful to model arithmetic operations. Floating- and tend to be less accurate in determining wordlengths. point numbers can handle a very large range of values and are After obtaining models or statistics of range and error easily scaled. In hardware, floating-point data types are typi- by analytic or statistical approaches, respectively, search al- cally converted or built as fixed-point data types to reduce the gorithms can find an optimum wordlength. Some published amount of hardware needed to implement the functionality. methods search optimum wordlength without sensitivity in- To model the behavior of fixed-point arithmetic hardware, formation [3, 4], and with sensitivity information [3, 5, 16] designers need bit-accurate fixed-point data types. as shown in Table 2.“Exhaustive”search[4] and “branch- Fixed-point data consists of an integer part and a frac- and-bound” procedure [3] can find an optimum wordlength tional part. The number of bits assigned to the integer without any sensitivity information. However, nonsensitiv- representation is called the integer wordlength (IWL), and ity methods have an unrealistic search space as the number the number of bits assigned to the fraction is the factional of wordlengths increases. wordlength (FWL) [17]. Fixed-point wordlength (WL) K. Han and B. L. Evans 3 corresponds to the following equation: 3.3. Finding the optimum wordlength

WL = IWL + FWL . (1) One of the algorithms for searching the “optimum” wordlength starts with an initial feasible solution w(0) and The wordlength must be greater than 0. Given IWL and FWL, performs an update via fixed-point data represents a value in the range R, with the (h+1) (h) sξ(h). quantization step Δ as w = w + (8) h s ξ −2IWL ≤ R<2IWL, for signed, Here, is an iteration index, is the integer step size, and is an integer update direction. A sound initial guess, a well- ≤ R< IWL 0 2 , for unsigned , (2) chosen step size, and a well-chosen update direction can re- Δ = 2− FWL. duce the number of iterations to find optimum wordlengths. Optimum wordlengths can be found by solving equa- IWL and FWL are determined to prevent unwanted over- tions when the performance function p is expressed in an flow and underflow. IWL can be determined by the following analytical form. If there is no analytical form to express the relation: performance, then simulation-based search methods can be used to search for optimum wordlengths by measuring the ≥ R . IWL log2 (3) performance function. Typical approaches involve assigning wordlength vector w(0) to a lower bound, an upper bound, or Here, x is the smallest integer that is greater than or equal a vector between the lower and upper bound. Step size can be to x.Therange,R, can be estimated by monitoring the max- fixed or adapted. The update direction is adapted according imum and minimum value or mean and standard derivation to the search algorithms in Section 4. of a signal [13, 18].FWLcanbedeterminedbywordlength During iteration, the stopping criteria are dependent on optimization or tradeoffs in the design parameters during the search algorithm. The algorithm that starts from the fixed-point conversion. lower bound stops when the performance P reaches the re- quired performance Preq. The algorithm that starts from the P P 3.2. Formulation of the optimum wordlength upper bound stops when falls below req. Other algorithms stop when the performance P or cost c converges within a The wordlength is an integer value, and a set of n wordlengths neighborhood. in a system is defined to be a wordlength vector, that is, ∈ In {w w ... w } w such as 1, 2, , n . We assume that the objec- 4. REVIEW OF SIMULATION-BASED tive function f is defined by the sum of every wordlength SEARCH METHODS implementation cost function c as Optimum wordlengths can be found by solving equations n when the performance function P isexpressedinananalyti- f = c w (w) k k ,(4)cal form. If there is no analytical form to express the perfor- k= 1 mance, then simulation-based search methods can be used to search for optimum wordlengths by measuring the per- where ck has a real value so that ck : I → R. The quan- formance at the system output. tized performance function p indicates propagated precision or quantized error and is constrained as follows: 4.1. Complete search p(w) ≥ P ,(5) req Complete search (CS) tests every possible combination of wordlengths between the lower bound and upper bound and where p has a real value so that p : I → R,andP is a req measures the performance of each combination by simula- constant for a required performance. We also consider the tion. Then optimum wordlengths can be selected from the lower bound wordlength w andupperboundwordlengthw simulations results. as constraints for each wordlength variable: For example, assuming that the number of indepen- dent variables to find optimum wordlength is two, and the wk ≤ wk ≤ wk, ∀k = 1, ..., n. (6) lower bound and upper bound are {2, 2} and {8, 7},respec- tively, the possible wordlength combinations are shown in The complete wordlength optimization problem can then be Figure 1. The number of trial tests or trials is 42. The opti- stated as mum wordlength can be selected from the given simulation results after simulation completes. min f (w)subjecttop(w) ≥ Preq, w ≤ w ≤ w. (7) w∈In The total number of tests in N wordlength variables is

The goal of the wordlength optimization is hence to search N ∗ EN = w − w . for the optimizer w that minimizes the objective function CS k k +1 (9) f (w)in(7). k=1 4 EURASIP Journal on Applied Signal Processing

each of wordlength cost is similar, the search path is shown in Figure 2. An optimized point {5, 5} is given for a comparison 7 between search methods. The minimum number of trials is 6 24. We have generalized the total number of experiments of N 5 the exhaustive search in dimensions with the sum of the distance. The sum of the distance, d,isdefinedas 2

w 4 d = dw1 + dw2 + ···+ dwN , (10) 3 where dwi is the distance between the minimum wordlength 2 and the optimum wordlength in ith dimension. The expected number of experiments of the exhaustive search is calculated 1 by using the summation of combination-with-replacement in [19]as 12345678 d−1 w N R 1 EES(d) = C (N, r) r=0 Figure 1: The possible wordlength combinations searching the en- ={ } ={ } = N + d − 1 tire space in complete search (w 2, 2 ; w 8, 7 ;trials 42). = CR(N +1,d − 1) = d − 1 N d − = ( + 1)! (N + d − 1) − (d − 1) !(d − 1)! 7 d N − ··· d d d = ( + 1) ( +2)( +1) . N 6 ! (11) wopt 5 24 The trials may be bounded as

2 23

w 4 dw N N,d N 2 EES(d) ≤ EES

12345678 4.3. Sequential search w 1 The basic notion of sequential search is that each trial elimi- ={ } nates a portion of the region being searched [5]. This proce- Figure 2: The direction of exhaustive search (w 2, 2 ; optimum dure is also called a “Min+b search” in [16]. The sequential point ={5, 5}; distance d in (10)is6;trials= 24). search method decides where the most promising areas are located, and continues in the most favorable region after each set of experiments [20]. The sequential search algorithm can Complete search is guaranteed to find a global optimum be summarized by the following four steps. point, but computational time and the number of tests (1) Select a set of values for the independent variables, increase exponentially as the number of wordlength variables which satisfy the desired system performance during the one- increases. variable simulations. (2) Evaluate the system performance. 4.2. Exhaustive search (3) Choose feasible locations at which system perfor- mance is evaluated. Sung and Kum [4] search for the first feasible solution. They (4) If the system performance of one point is better than search for a wordlength with the minimum wordlength as others, then move to the better point, and repeat the search, the initial guess and increment the wordlength by one until until the point has been located within the desired accuracy. the propagated error meets the minimum error. For example, The base point is the minimum wordlength as an initial assuming that we are trying to find the optimum wordlength wordlength w(0) in (8). In step (3), the direction of search, ξ for two variables, the minimum wordlengths are {2, 2},and as in (8) is chosen in accordance with maximum derivative K. Han and B. L. Evans 5

7 7

6 6 wopt wopt 5 5 2 2 w w 4 4 dw2 dw2 3 3

wb wb 2 2

dw1 dw1 1 1

12345678 12345678

w1 w1

Figure 3: The direction of sequential search (w ={2, 2}; optimum Figure 4: The direction of preplanned search (w ={2, 2}; optimum point ={5, 5}; distance d in (10)is6;trials= 12). point ={5, 5}; distance d in (10)is6;trials= 6). of their performance are obtained from the sensitivity of performance of an inde- ⎧ pendent variable. The optimum point is found by employing ⎪ p ⎪{ ... } m =∇ the steepest descent among local neighbor points. ⎪ 1, 0, 0, ,0 ,if j w , ⎪ 1 The preplanned search algorithm in N dimensions is ⎪ ⎨⎪ p {0, 1, 0, ...,0},ifmj =∇ , summarized by the following steps. ξ = w j ⎪ 2 (1) Select a set of values for the independent vari- ⎪ ⎪··· ables, which satisfy the desired performance during the one- ⎪ (13) ⎪ p ⎩⎪{ ... } m =∇ variable simulations. 0, 0, 1, ,0 ,if j w , N (2) Make a performance sensitivity list from the one- p p p variable simulations. mj = max ∇ , ∇ , ..., ∇ , w1 w2 wN (3) Make a test schedule with the sensitivity list to follow the higher sensitivity points from base point. where ∇ is the gradient operator. (4) Evaluate the performance at those points. In Figure 3, starting from wordlength base point {2, 2}, (5) Move to the points, until the point has been located we measure performance of {2, 3} and {3, 2} from the within the desired accuracy. direction of sequential search in step (3). If the performance In step (3), the direction of preplanned search is chosen of {3, 2} is better than that of {2, 3}, then a new wordlength in accordance with maximum derivative of an independent vector moves into {3, 2}. Simulations are repeated until sat- performance isfying the desired performance. ⎧ p ⎪ 1 We have generalized the trials of the sequential search in ⎪{1, 0, 0, ...,0},ifmj =∇ , ⎪ w N dimensions as ⎪ 1 ⎪ p ⎨{ ... } m =∇ 2 N 0, 1, 0, ,0 ,ifj , E = N · dw dw ··· dwN . ξj = w2 (15) SS 1 + 2 + + (14) ⎪ ⎪ ⎪··· ⎪ p In this example, the numbers of trials are 12 from (14) ⎪ N ⎩{0, 0, 1, ...,0},ifmj =∇ , and also 12 from Figure 3. The number of trials is re- wN duced by using sensitivity information. However, an opti- mum wordlength can be a local optima. where Local search [3] uses sensitivity information with the p1 p2 pN mj = max ∇ , ∇ , ..., ∇ . (16) above procedure, but it uses cost sensitivity instead of per- w w wN formance sensitivity. 1 2 In Figure 4, starting from the base point {2, 2}, the pre- 4.4. Preplanned search planned search makes a list of the directions of the steepest ascent by comparing the gradients of the independent perfor- A preplanned search [5] is one in which all the experi- mances in one dimension from the one-variable simulations. ments are completely scheduled in advance. The directions If the gradient, which is calculated from the one-variable 6 EURASIP Journal on Applied Signal Processing

RF Rake Data LPF ADC Decoder demodulator receiver output

Carrier

(a) Analog demodulator.

Output FER SNR

Data RF ADC LPF Rake Decoder demodulator receiver output

Carrier

(b) Digital demodulator.

Figure 5: Analog and digital demodulators in CDMA receiver and performance measurement position.

B fc The demodulator converts modulated signals into base-

Bi Bm B f band signals. In the digital demodulator block of Figure 6, RF ADC LPF Rake receiver the sampled data values output by the ADC are multiplied by a carrier signal to shift the spectrum down to the base- Bc Carrier band. The out-of-band signal is removed by the lowpass fil- ter (LPF). The variables in the digital demodulator are given below [22, 23]: Figure 6: A digital demodulator block.

(i) (Bi): input wordlength; (ii) (Bc): carrier wordlength; w w simulations at 1 of 2 bits, is larger than that at 2 for2bits, (iii) (Bm): multiplier output wordlength; { } then next feasible location is 3, 2 . Then, if the gradient at (iv) (B f ): filter output wordlength; w of 3 is smaller than that at w of 2, the next feasible lo- 1 2 (v) (B fc): filter coefficient wordlength. cation is {3, 3}. The simulation path would be {2, 2}, {3, 2}, { } 3, 3 , and so forth. After scheduling the feasible points, the The output SNR is used for performance measurement performance of these points is evaluated until the value of instead of frame error rate (FER), which is a general mea- the performance meets the desired accuracy. surement to evaluate CDMA systems because direct mea- N We generalized the trials of the preplanned search in surement of FER requires at least 105 frames during the sim- dimensions as ulation [24].TherequiredoutputSNRinthissystemisover 0.8dB,orFERisunder0.03 [23]. EN = dw dw ··· dw . PS 1 + 2 + + N (17) For the initial point, minimum wordlength is selected by the independent one-variable simulations in which one In this example, the trials are 6 from (17)andfromFigure 4. variable changes while other variables keep high preci- The number of trials is the least among the search meth- sion. Satisfying the output SNR of 0.8 dB, the minimum ods reported so far. However, finding the global optimum wordlength of {Bi, Bc, Bm, B f , B fc} is {4, 3, 4, 5, 7},which wordlength is not guaranteed. is acquired from the one-variable simulations shown as Figure 7. For a simplified example, we assume that the 4.5. Search example in CDMA demodulator cost-per-bit is one. In the exhaustive search, the next wordlength design points are searched: {5, 3, 4, 5, 7}, {4, 4, 4, 5, 7}, {4, 3, 5, 5, 7}, {4, 3, 4, 6, 7}, {4, 3, 4, 5, 8}, {5, 4, 4, 5, 7}, and so forth. The Typical demodulators are implemented with an analog block search is continued until the communications performance in front of an analog-to-digital converter (ADC) block as meets the specific desired requirement. In the sequential shown in Figure 5(a). As the speed of the ADC increases, search, the next point is one of the following: {5, 3, 4, 5, 7}, analog parts are replaced with digital parts in communica- {4, 4, 4, 5, 7}, {4, 3, 5, 5, 7}, {4, 3, 4, 6, 7},and{4, 3, 4, 5, 8}. tion systems [21]. We replaced the analog demodulator with The next point would have the largest communication per- the digital demodulator as shown in Figure 5(b). formance among them. From Table 3, {4, 3, 4, 6, 7} is the next K. Han and B. L. Evans 7

Table 3: Sequence of the sequential search for CDMA demodulator (traffic channel rate set 1 in additive white Gaussian noise, input SNR =−17.3dB,Eb/Nt = 3.8, rate = 9600 bps, and desired performance: output SNR > 0.8dB,FER< 0.03).

Step {Bi, Bc, Bm, B f , B fc} Output SNR FER Result 1, 2 {4, 3, 4, 5, 7} 0.711 0.038 Fail 3 {5, 3, 4, 5, 7} 0.735 — — 3 {4, 4, 4, 5, 7} 0.694 — — 3 {4, 3, 5, 5, 7} 0.712 — — 3 {4, 3, 4, 6, 7} 0.759 — Max 3 {4, 3, 4, 5, 8} 0.704 — — 4 {4, 3, 4, 6, 7} 0.759 0.035 Fail 3 {5, 3, 4, 6, 7} 0.763 — — 3 {4, 4, 4, 6, 7} 0.722 — — 3 {4, 3, 5, 6, 7} 0.773 — Max 3 {4, 3, 4, 7, 7} 0.751 — — 3 {4, 3, 4, 6, 8} 0.749 — — 4 {4, 3, 5, 6, 7} 0.773 0.034 Fail ...... 3 {6, 3, 5, 6, 7} 0.798 — — 3 {5, 4, 5, 6, 7} 0.802 — — 3 {5, 3, 6, 6, 7} 0.805 — Max 3 {5, 3, 5, 7, 7} 0.803 — — 3 {5, 3, 5, 6, 8} 0.798 — — 4 {5, 3, 6, 6, 7} 0.805 0.029 Pass

1 point is 4 by using (10). The number of trials for the sequen- 0.95 tial search to find an optimum wordlength is 20 by using (14). 0.9 In the preplanned search, the search path is esti- 0.85 mated from the sensitivity of each one-variable simulation 0.8 shown in Figure 7. Starting from the minimum wordlength { } 0.75 or base point, 4, 3, 4, 5, 7 , the first expected point is {4, 3, 4, 6, 7} because B f has the greatest derivative among 0.7 Output SNR each wordlength at the base point from Figure 7. The se- 0.65 quence of the preplanned search points is {4, 3, 4, 5, 7}, { } { } { } { } 0.6 4, 3, 4, 6, 7 , 4, 3, 4, 6, 8 , 4, 3, 5, 6, 8 , 4, 4, 5, 6, 8 ,andso forth. Simulations move the current point to the next point 0.55 until the performance exceeds the specific desired require- 0.5 ment. The optimum point is {5, 4, 5, 6, 8} and distance is 5 2345678910 by using (10). The number of trials of the preplanned search Wordlength to find an optimum wordlength is 5 by using (17).

Bi B f B c B fc 4.6. Comparison Bm The four search methods are compared with the trials from Figure 7: Result of the independent one-variable simulations on a (9), (11), (14), and (17), as shown in Table 4. The numbers CDMA demodulator. of trials are calculated besides the one-variable simulations which all of the search methods use. The complete search needs 283920 trials to find optimum wordlength from (9) location because it has the largest communication perfor- with wk ={16, 16, 16, 16, 16} and wk ={4, 3, 4, 5, 7} assum- mance. The simulation moves the current point to the new ing that the maximum wordlength is 16 bits. If the computer point and continues to search until the performance exceeds simulation to calculate frame error rate per trial in CDMA the specific desired requirement, which is an output SNR of system takes about 10 minutes, the complete search to find 0.8 dB in this case. The final point is {5, 3, 6, 6, 7}, as shown an optimum wordlength would require 5 years, which is un- in Table 3. The distance between the base and the optimum realistic design time. 8 EURASIP Journal on Applied Signal Processing

Table 4: Comparison of complete, exhaustive, sequential, and preplanned search (N = 5, wk ={16, 16, 16, 16, 16}, wk ={4, 3, 4, 5, 7},and the term d is defined in (10)). Search Distance (d) Equation for number of experiments from (9), (11), (14), (17)Trials N w − w Complete — k=1( k k + 1) 283920 Exhaustive 4 (d +4)(d +3)···(d)/5! 56 Sequential 4 5 · d 20 Preplanned 5 d 5

The exhaustive search needs 56 trials by using (11), which Complexity measure method updates wordlengths from is less than the complete search. The exhaustive search is, the direction of the lowest sensitive complexity until a sys- however, inefficient to find the optimum wordlength when tem meets a required performance such as Preq in (5). The the wordlength variables for optimization are numerous and complexity measure method searches the wordlengths that the distance between base and optimum point is longer. minimize hardware complexity. However, it demands a large The sequential search and preplanned search requires 20 number of iterations since it does not use any distortion sen- and 5 trials, respectively, which are less than the other search sitivity information that can speed up to find the optimum methods. The preplanned search has the lowest number of wordlengths. For example, in a system composed of adders experiments among search methods, but its distance using and multipliers, the complexity sensitivity of a multiplier is (10) is larger than that for sequential search. It implies that larger than that of an adder. The complexity measure method the wordlength of the sequential search method is closer to a increases the wordlength in the adder with the right of pri- global optimum with respect to hardware cost. ority during an increase procedure even if the wordlength in The sequential search and preplanned search have a loss the multiplier affects the propagated quantized performance of direction problem encountered by techniques based on the more. It would waste computer simulation time if the com- gradient projection method. This problem can be solved by plexity sensitivity of an adder is much smaller than that of a adapting the step size. multiplier. The sequential search and the preplanned search reduce the trials by rates of 64% and 91%, respectively, when com- 5.2. Distortion measure pared to the exhaustive search for wordlength optimization in the CDMA demodulator design. However, preplanned The distortion measure method considers distortion func- search seldom converges to the same optimum point, and the tion as the objective function in (4) and uses the sensitiv- distance is longer than that of the other search methods. ity information of the distortion for the direction to search for the optimum wordlengths. Sequential search uses distor- tion measure. This method assumes that every cost or com- 5. SENSITIVITY MEASUREMENTS plexity function would be the same or equal to 1, and selects wordlengths with the update direction according to the dis- The sensitivity information used for update directions in (8) tortion sensitivity information. can help reduce the search space dramatically. The sensi- The complexity objective function is replaced with the tivity information can be obtained by measuring hardware distortion objective function d(w)as complexity and distortion or propagated quantized precision loss. Complexity measure is used for hardware cost function fd(w) = d(w), (19) in [3]. Distortion measure in [5] utilizes the sensitivity infor- mation of a propagated quantization error. Complexity-and- and the complexity minimization problem is changed into a distortion measure in [7] combines two measures to update distortion minimizing problem by the search direction. fd d ≤ D minn (w), subject to (w) req, w∈I (20) c ≤ C w ≤ ≤ w 5.1. Complexity measure (w) req, w ,

where Dreq is required distortion, and Creq is a complexity The complexity measure method considers hardware com- constant. plexity function as the cost function in (4) and uses the The sensitivity information is also calculated by gradient sensitivity information of the complexity as the direction to of the distortion function. For the steepest descent direction, search for the optimum wordlengths. The local search in [3] the update direction is uses complexity measure. The sensitivity information is cal- culated by gradient of the complexity function. For steepest ξDM =−∇fd(w). (21) descent direction, the update direction is For the distortion, Fiore and Lee [25] computed an error ξCM =−∇fc(w), (18) variance, and Han et al. [5]measuredoutputSNR. The distortion measure method reduces the number where ∇ is gradient of function. of iterations for searching the optimum wordlengths, since K. Han and B. L. Evans 9

OFDM Data Encoder source modulator

Channel Wireless estimator channel w model w3 2 w w Channel 1 OFDM 0 BER Decoder tester equalizer demodulator

Figure 8: Wordlength model for a fixed broadband wireless access demodulator. the search direction depends on the distortion by chang- Setting the complexity and distortion weighting factor, ing the wordlengths. This method rapidly finds the opti- αc or αd from 0 to 1, the complexity-and-distortion method mum wordlength satisfying the required performance by a searches for an optimum wordlength with tradeoffsbe- fewer number of iterations compared to complexity measure tween complexity measure method and distortion measure method. However, the wordlengths do not guarantee the op- method. The complexity-and-distortion measure becomes timum wordlengths in terms of the complexity. the complexity measure or the distortion measure when αd = 0orαc = 0, respectively. The complexity-and-distortion measure method can re- 5.3. Complexity-and-distortion measure duce the number of iterations for searching the optimum wordlengths, because the distortion sensitivity information The complexity-and-distortion measure combines the com- is utilized. This method can more rapidly find the optimum plexity measure with the distortion measure by a weighting wordlength that satisfies the required performance by using factor. In the objective function, both complexity and distor- less iteration compared to the complexity measure method. tion are simultaneously considered. We normalize the com- However, the wordlengths are not guaranteed to be optimal plexity and the distortion function and multiply them with in terms of the complexity. complexity and distortion weighting factors, αc and αd,re- spectively. The new objective function is 6. CASE STUDY f = α · c α · d cd(w) c n(w)+ d n(w), (22) 6.1. OFDM demodulator design where cn(w)anddn(w) are normalized complexity function Digital communication systems have digital blocks such as a and distortion function, respectively. The relation between demodulator that needs wordlength optimizations. Search- the weighting factors is ing algorithms in Section 4 were applied to the wordlength optimization of CDMA demodulator design in Section 4.5. αc + αd = 1, (23) From the CDMA case study, the sequential search is one where of the promising methods to find an optimum wordlength. In this section, complexity measure, distortion measure, 0 ≤ αc ≤ 1, 0 ≤ αd ≤ 1. (24) and complexity-and-distortion measure in Section 5 are applied in the sequential search framework to determine Using (22), the objective function gives a new optimiza- wordlengths for a fixed broadband wireless demodulator. tion problem Fixed broadband wireless access technology is intended fcd d ≤ D for high-speed voice, video, and data services, which is minn (w), subject to (w) req, w∈I (25) presently dominated by cable and digital subscriber line c(w) ≤ Creq, w ≤ w ≤ w, technologies [26]. One of the designs for orthogonal fre- D C quency division multiplexing (OFDM) demodulators for where req and req are the required distortion and a com- fixed broadband wireless access is shown in Figure 8. For the plexity constant, respectively. This optimization problem is wireless channel, we used Stanford University Interim mod- to find wordlengths that minimize complexity and distortion els [27, 28]. simultaneously according to the weighting factors. The main blocks in the demodulator for finite word- The update direction for the steepest decent direction to length determination are the fast fourier transform (FFT), find the optimum wordlength w is equalizer, and estimator. For wordlength variables, we choose the wordlengths that have the most significant effect on com- ξCDM =−∇fcd(w). (26) plexity and distortion in the system. For the OFDM demod- From (22)and(26), the update direction is ulator, we select wordlength variables w0, w1, w2,andw3 for the FFT input, equalizer right input, channel estimator input, ξ =− α ·∇c α ·∇d . CDM c n(w)+ d n(w) (27) and equalizer upper input, respectively, as shown in Figure 8. 10 EURASIP Journal on Applied Signal Processing

without channel decoder. For the optimum wordlength, we follow the hybrid procedure [16] that combines a wordlength increase followed by a wordlength decrease. Simulation re- sults are presented in Section 7.

10−2 6.2. IIR filter case BER OFDM demodulation case requires a large number of long simulations. This becomes especially time-consuming when each simulation takes hours in ensemble average of BER es- timation. For more general case, infinite impulse response (IIR) filter that has 7 wordlengths is simulated. There are 10−3 various methods for getting error function and cost func- tion as described in related work section. For simplifying the 46810121416 simulation, mean square error (MSE) is measured for the Wordlength error function, and a linear cost function of wordlength is

w0 w assumed. Required performance of the IIR filter is assumed 2 . w1 w3 MSE of 0 1. In the IIR filter case study, the wordlength vector has 7 elements, and the hardware complexity of the arith- ff Figure 9: Wordlength effect for the demodulator in Figure 8, with metic block has less di erence when compared to the OFDM Stanford University Interim wireless channel model number 3, SNR case study. Results are presented in Section 7. of 20 dB, FFT length of 256, and least-squares comb-type channel estimator without error-control coding. 7. RESULTS

The wordlength optimization problem is a discrete optimiza- tion problem with a nonconvex constraint space [30]. This We assume that the internal wordlengths of the given nonconvexity makes it harder to search for a global optimum blocks have already been decided. In simulation, only the solution [31]. Tables 5 and 6 show that there are several lo- inputs to each block are constrained to be in fixed-point cal optimum wordlengths that satisfy error specification and type, whereas the blocks themselves are simulated in floating- minimize hardware complexity in case studies. In this sec- point type. tion, wordlength optimization methods used in case studies For the hardware complexity, the number of multipli- are compared in terms of number of iteration and hardware cations is measured assuming that processing units are not complexity, and future work is discussed. reused. The number of multiplications in a K-point FFT block is 7.1. Number of iterations K Cost = log K, (28) The number of iteration to search an optimum wordlength in FFT 2 2 OFDM demodulator design is shown in Figure 10. The ini- tial wordlength does not satisfy the desired performance. Af- where K is the number of taps. The cost of the 256-point ter a number of trials by updating wordlength as in (8), the FFT in the fixed broadband wireless access is estimated to error at the system output decreases. The sequential search be 1024. Approximately, the simplified complexity vector c and the CDM search reach the feasible area after 15 trials. of the wordlength per bit is assumed to be {1024, 1, 128, 2} However, the local search takes 38 trials. After arriving at the from [4, 29]. feasible area, an optimum wordlength is searched again. In We also assume the complexity increases linearly as this case, the wordlengths, which are searched by the sequen- wordlength increases to simplify demonstration. For the dis- tial search or the CDM search, already arrive at an optimum tortion measurement, bit error rate (BER) is measured. The wordlength. However, the local search needs more iterations minimum wordlength searched by changing one wordlength to find an optimum wordlength. The total number of trials variable, while other variables have high precision (i.e., 16 to find an optimum wordlength in each method for OFDM bits), is used for the initial wordlength [4, 5]. The simulation case is shown in Table 5. The sequential search and the CDM for the minimum wordlength is shown in Figure 9. method can find an optimum solution in one-fourth of the Assuming the minimum performance of BER is 5×10−3, time that the local search method takes. the minimum wordlength is {5, 4, 4, 4} from Figure 9. Start- In IIR filter design, the number of iterations to search ing from the minimum wordlength, wordlengths are in- an optimum wordlength is shown in Figure 11. This figure creased according to the sensitivity information of differ- demonstrates the number of trials in an infeasible area and ent measures in Section 5. We measure the number of iter- a feasible area. After the search methods reach a feasible re- ations until they find their own optimum wordlength sat- gion, where MSE of IIR filter is under 0.1, the search methods isfying the required performance such as BER ≤ 2 × 10−3 continue searching an optimum wordlength. The sequential K. Han and B. L. Evans 11

Table 5: Simulation results of several search methods starting from the minimum wordlength for the demodulator arcs in Figure 8. N = 4, wk ={5, 4, 4, 4},andwk ={16, 16, 16, 16}. CDM is the complexity-and-distortion measure. αc is a weighting factor.

Search method αc Number of trials Wordlengths for variables Complexity estimate Sequential search [5]0 16 {10, 9, 4, 10} 10781 CDM 0.5 15 {7, 10, 4, 6} 7702 Local search [3]1 69 {7, 7, 4, 6} 7699

Table 6: Simulation results in IIR filter of several search methods. N = 7, wk ={1, 1, 1, 1, 1, 1, 1},andwk ={16, 16, 16, 16, 16, 16, 16}.CDM is the complexity-and-distortion measure. αc is a weighting factor. (Max-1 search starts from wk. Sequential search starts from wk).

Search method αc Number of trials Wordlengths for variables Complexity estimate Max-1 search [16]0 94 {4, 5, 4, 5, 2, 2, 4} 378 Sequential search [5]0 56 {4, 5, 4, 5, 2, 2, 4} 378 CDM 0.25 44 {4, 5, 4, 4, 2, 2, 5} 366 CDM 0.5 33 {6, 5, 5, 4, 1, 2, 4} 363 CDM 0.75 71 {6, 4, 4, 4, 2, 16, 13} 561 Local search [3] 1 126 {9, 5, 16, 4, 1, 16, 16} 723

101

10−2

BER 100 MSE

10−1

10−3 5 101520253035 20 40 60 80 100 120 Number of trials Number of trials Sequential search αc = 0.75 Sequential search αc = 0.25 Local search CDM search (αc = 0.5) αc = 0.5 Local search

Figure 10: Number of iterations for optimum wordlength with var- Figure 11: Number of iterations for optimum wordlength in IIR ious search algorithms in OFDM demodulator wordlength design. filter with various search algorithms.

search and the local search need a total of 56 and 126 itera- In general, if error sensitivity information for searching tions, respectively, including iterations in feasible and infea- an optimum wordlength is used, the number of iterations sible area as shown in Table 6. The “Max-1” search starting can be reduced. The sequential search and the CDM method from the feasible area needs 96 iterations. The CDM meth- with less than αc of 1 use the error sensitivity information. ods with weighting factor of 0.25, 0.5, and 0.75 are used Thus, they converge quickly into an optimum wordlength for comparison. When αc is less than 0.5, the CDM meth- that satisfies the required error performance. ods have the property of the sequential search. When αc is greater than 0.5, the CDM methods search as the local search 7.2. Hardware complexity does. In Figure 11, the CDM methods with weighting factor of 0.25 and 0.75 show similar shape as the sequential search Tables 5 and 6 show the hardware complexity according to and the local search, respectively. In the IIR filter case, the the searched optimum wordlengths in various methods. The CDM method with αc of 0.5 can find an optimum solution results show that the sequential search method, which only in one-fourth of the time that the local search method takes. uses error sensitivity information for the update direction, 12 EURASIP Journal on Applied Signal Processing

Table 7: Simulation results in noise cancellation with Wiener filter [32] of several search methods. N = 5, wk ={1, 1, 1, 1, 1},andwk = {16, 16, 16, 16, 16}. CDM is the complexity-and-distortion measure. αc is a weighting factor.

Search method αc Number of trials Wordlengths for variables Complexity estimate Sequential search [5]0 21 {4, 5, 5, 3, 2} 1331 CDM 0.25 23 {4, 4, 5, 4, 2} 1200 CDM 0.5 24 {5, 4, 4, 5, 4} 1074 CDM 0.75 167 {4, 4, 4, 5, 4} 1073 Local search [3] 1 170 {5, 4, 4, 15, 3} 1082

finds an optimum wordlength that has higher complexity 7.4. Future work than the CDM method and the local search in the OFDM For an extension of work, various methods can be combined demodulation case study. However, an optimum wordlength for wordlength optimization. Wordlength grouping [4]can searched by the local search method, which uses hardware be used to reduce a wordlength vector. Error model or error complexity information, has higher complexity in the IIR monitoring instead of error measuring can be used to reduce filter case study. If the design space is convex and has only the simulation time. Actual cost model [12]canbeusedto one optimum solution, then various search methods find the get accurate result. For the searching method, different search optimum solution. However, wordlength optimization prob- methods such as binary search can be combined. Preplanned lem has many local optimum solutions because of noncon- search, which is the fastest error sensitivity search method as vex space. As the number of wordlength variables increases compared in [16], can employ CDM methods to reach more and as the system becomes more complicated, the probabil- quickly a near-optimum wordlength. ity in being stuck in a local optimum solution increases. In the IIR filter case with 7 elements in wordlength vector, the 8. CONCLUSION wordlengths searched by the local search method are far from globally optimal. This paper generalized wordlength optimization methods The CDM search with the weighting factor of 0.5 finds an that use sensitivity measures. The proposed complexity-and- optimum wordlength that has the lowest hardware complex- distortion measure equation can express the local search ity in this IIR case study. The CDM search with the weighting or sequential search by changing the weighting factor. The factor of 0.75 tends to be the local search. The hardware com- weighting factor can reduce the number of iterations and plexity from the CDM method of 0.75 is between the CDM the hardware complexity compared to the local search and of 0.5 and the local search. Similarly, the complexity from the sequential search, respectively. In our case studies, the CDM method of 0.25 is between the sequential search and complexity-and-distortion method is simulated and com- theCDMof0.5. pared. The proposed method can find the optimum solu- For more examples, additional optimum wordlength tion in one-fourth of the time that the local search takes. search results in a noise cancellation with Wiener filter [32] In addition, the optimum wordlength searched by the pro- are shown in Table 7. posed method has 30% lower hardware implementation costs than sequential search in wireless demodulators. Case studies demonstrate that the proposed method is robust for 7.3. Discussion searching optimum wordlength in a nonconvex space. Fu- ture extensions of this work include combination with ana- The CDM method, which uses error and complexity sensitiv- lytic wordlength optimization and preplanned search. ity for optimum wordlength search, takes advantages from the sequential search and the local search. This method re- REFERENCES duces the number of iterations because of the error sensi- tivity that helps to fast reach feasible boundary. At the same [1] H. Keding, M. Willems, M. Coors, and H. Meyr, “FRIDGE: A time, this method finds a near-optimum wordlength that fixed-point design and simulation environment,” in Proceed- has lower hardware complexity because of the sensitivity of ings of IEEE Design, Automation and Test in Europe (DATE ’98), hardware complexity. The proposed method is robust for pp. 429–435, Paris, France, February 1998. search optimum wordlength in a nonconvex space because [2] A. V. Oppenheim, R. W. Schafer, and J. R. Buck, Discrete-Time this method is not easily captured by local optimum solu- Signal Processing, Prentice-Hall, Upper Saddle River, NJ, USA, tions. 1998. [3] H. Choi and W. P. Burleson, “Search-based wordlength op- The complexity-and-distortion measure method has timization for VLSI/DSP synthesis,” in Proceedings of IEEE flexibility to search for an optimum wordlength by setting Workshop on VLSI Signal Processing, VII, pp. 198–207, La Jolla, weighting factor. Designer can select the weighting factor, Calif, USA, October 1994. αc,asin(23). The αc of 0.5 means that the CDM method [4] W. Sung and K.-I. Kum, “Simulation-based word-length opti- equally uses the sensitivity information of the error and the mization method for fixed-point digital signal processing sys- complexity. The αc of 0.5 in CDM is reasonable for optimum tems,” IEEE Transactions on Signal Processing, vol. 43, no. 12, wordlength search algorithms. pp. 3087–3090, 1995. K. Han and B. L. Evans 13

[5]K.Han,I.Eo,K.Kim,andH.Cho,“Numericalword-length [20] G. S. G. Beveridge and R. S. Schechter, Optimization: Theory optimization for CDMA demodulator,” in Proceedings of IEEE and Practice, McGraw-Hill, New York, NY, USA, 1970. International Symposium on Circuits and Systems (ISCAS ’01), [21] J. A. Wepman, “Analog-to-digital converters and their appli- vol. 4, pp. 290–293, Sydney, NSW, Australia, May 2001. cations in radio receivers,” IEEE Communications Magazine, [6] M.-A. Cantin, Y. Savaria, D. Prodanos, and P. Lavoie, “An au- vol. 33, no. 5, pp. 39–45, 1995. tomatic word length determination method,” in Proceedings of [22] S. Nahm, K. Han, and W. Sung, “A CORDIC-based digital IEEE International Symposium on Circuits and Systems (ISCAS quadrature mixer: comparison with a ROM-based architec- ’01), vol. 5, pp. 53–56, Sydney, NSW, Australia, May 2001. ture,” in Proceedings of IEEE International Symposium on Cir- [7] K. Han and B. L. Evans, “Wordlength optimization with cuits and Systems (ISCAS ’98), vol. 4, pp. 385–388, Monterey, complexity-and-distortion measure and its application to Calif, USA, May–June 1998. broadband wireless demodulator design,” in Proceedings of [23] K. Han, I. Eo, K. Kim, and H. Cho, “Bit constraint parame- IEEE International Conference on Acoustics, Speech, and Signal ter decision method for CDMA digital demodulator,” in Pro- Processing (ICASSP ’04), vol. 5, pp. 37–40, Montreal, Quebec, ceedings of 5th CDMA International Conference and Exhibition, Canada, May 2004. vol. 2, pp. 583–586, Seoul, Korea, November 2000. [8] S. A. Wadekar and A. C. Parker, “Accuracy sensitive word- length selection for algorithm optimization,” in Proceedings [24] J.-S. Wu, M.-L. Liou, H.-P. Ma, and T.-D. Chiueh, “A 2.6- of International Conference on Computer Design: VLSI in V, 44-MHz all-digital QPSK direct-sequence spread-spectrum Computers and Processors (ICCD ’98), pp. 54–61, Austin, Tex, transceiver IC [wireless LANs],” IEEE Journal of Solid-State USA, October 1998. Circuits, vol. 32, no. 10, pp. 1499–1510, 1997. [9] G. A. Constantinides, P. Y. K. Cheung, and W. Luk, [25] P. D. Fiore and L. Lee, “Closed-form and real-time wordlength “Wordlength optimization for linear digital signal processing,” adaptation,” in Proceedings of IEEE International Conference on IEEE Transactions on Computer-Aided Design of Integrated Cir- Acoustics, Speech, and Signal Processing (ICASSP ’99), vol. 4, cuits and Systems, vol. 22, no. 10, pp. 1432–1442, 2003. pp. 1897–1900, Phoenix, Ariz, USA, March 1999. [10] R. Cmar, L. Rijnders, P. Schaumont, S. Vernalde, and I. [26] H. Bolcskei, A. J. Paulraj, K. V. S. Hari, R. U. Nabar, and W. Bolsens, “A methodology and design environment for DSP W. Lu, “Fixed broadband wireless access: state of the art, chal- ASIC fixed point refinement,” in Proceedings of Design, Au- lenges, and future directions,” IEEE Communications Maga- tomation and Test in Europe Conference and Exhibition,pp. zine, vol. 39, no. 1, pp. 100–108, 2001. 271–276, Munich, Germany, March 1999. [27]V.Erceg,K.V.S.Hari,M.S.Smith,etal.,“Channelmodelsfor [11] M. Stephenson, J. Babb, and S. Amarasinghe, “Bidwidth anal- fixed wireless applications,” in IEEE 802.16. proposal 802.16.3c- ysis with application to silicon compilation,” in Proceedings of 01/29, 2001. ACM SIGPLAN Conference on Programming Language Design [28] D. S. Baum, “Simulating the SUI channel models,” Tech. Rep., and Implementation, pp. 108–120, Vancouver, BC, Canada, Information Systems Laboratory, Stanford University, Stan- June 2000. ford, Calif, USA, 2001. [12] C. Shi and R. W. Brodersen, “Automated fixed-point data-type optimization tool for signal processing and communication [29] B. Shim and N. Shanbhag, “Complexity analysis of multicar- systems,” in Proceedings of 41st Design Automation Conference, rier and single-carrier systems for very high-speed digital sub- pp. 478–483, San Diego, Calif, USA, June 2004. scriber line,” IEEE Transactions on Signal Processing, vol. 51, [13] S. Kim, K.-I. Kum, and W. Sung, “Fixed-point optimization no. 1, pp. 282–292, 2003. utility for C and C++ based digital signal processing pro- [30] G. A. Constantinides, “High level synthesis and word length grams,” IEEE Transactions on Circuits and SystemsPart II: Ana- optimization of digital signal processing systems,” Ph.D. dis- log and Digital Signal Processing, vol. 45, no. 11, pp. 1455–1464, sertation, Department of Electronic & Electrical Engineering, 1998. University College London, London, UK, 2001. [14] K.-I. Kum and W. Sung, “Combined word-length optimiza- [31] R. Fletcher, Practical Methods of Optimization, Vol. 2: Con- tion and high-level synthesis of digital signal processing sys- strained Optimization, John Wiley & Sons, New York, NY, tems,” IEEE Transactions on Computer-Aided Design of Inte- USA, 1981. grated Circuits and Systems, vol. 20, no. 8, pp. 921–930, 2001. [32] M. H. Hayes, Statistical Digital Signal Processing and Modeling, [15] A. Nayak, M. Haldar, A. Choudhary, and P. Banerjee, “Preci- John Wiley & Sons, New York, NY, USA, 1996. sion and error analysis of MATLAB applications during au- tomated hardware synthesis for FPGAs,” in Proceedings of De- sign, Automation and Test in Europe (DATE ’01), pp. 722–728, Munich, Germany, March 2001. Kyungtae Han received the B.S. degree in [16] M.-A. Cantin, Y. Savaria, and P. Lavoie, “A comparison of au- information engineering from Korea Uni- tomatic word length optimization procedures,” in Proceedings versity in 1996, and the M.S. degree in elec- of IEEE International Symposium on Circuits and Systems (IS- trical engineering from Seoul National Uni- CAS ’02), vol. 2, pp. 612–615, Phoenix-Scottsdale, Ariz, USA, versity in 1998. Since August 2002, he has May 2002. been pursuing his Ph.D. degree in electri- [17] SystemC 2.0 User’s Guide, 2002, [online], available: www cal engineering in The University of Texas .systemc.org. at Austin. In 2000, he joined Electron- [18] S. Kim and W. Sung, “A floating-point to fixed-point assembly ics and Telecommunications Research Insti- program translator for the TMS 320C25,” IEEE Transactions tute, Korea, as a Research Engineer. His re- on Circuits and SystemsPart II: Analog and Digital Signal Pro- search experiences include modem designs for CDMA and OFDM cessing, vol. 41, no. 11, pp. 730–739, 1994. systems. His current research activities are on tradeoffs in sig- [19] K. H. Rosen, Handbook of Discrete and Combinatorial Mathe- nal quality versus implementation complexity to develop near- matics, CRC Press, Boca Raton, Fla, USA, 2000. optimal, low-complexity, and low-power algorithms. 14 EURASIP Journal on Applied Signal Processing

Brian L. Evans is Professor of electrical and computer engineering at The University of Texas at Austin, Austin, Tex, USA. His B.S. E.E. C.S. (1987) degree is from the Rose- Hulman Institute of Technology, and his M.S. E.E. (1988) and Ph.D. E.E. (1993) de- grees are from the Georgia Institute of Tech- nology. From 1993 to 1996, he was a Post- doctoral Researcher at UC Berkeley. His re- search efforts are in embedded real-time signal and image processing systems. In signal processing, his re- search group focuses on the design and real-time software imple- mentation of high-speed ADSL transceivers and multiuser OFDM systems. In image processing, his group focuses on the design and real-time software implementation of high-quality halftoning for desktop printers and smart image acquisition for digital-still cam- eras. Dr. Evans has published over 140 refereed conference and journal papers. Dr. Evans is the Primary Architect of the Signals and Systems Pack for Mathematica. He was a key contributor to UC Berkeley’s Ptolemy classic electronic design automation envi- ronment for embedded systems, which has been successfully com- mercialized by Agilent and Cadence. He is an Associate Editor for the IEEE Transactions on Signal Processing. He is the recipient of a 1997 US National Science Foundation Career Award.