A.1 CORDIC Algorithm
Research on Hardware Intellectual Property
Cores Based on Look-Up Table Architecture for
DSP Applications
HOANG VAN PHUC
Doctoral Program in Electronic Engineering Graduate School of Electro-Communications The University of Electro-Communications
A thesis submitted for the degree of DOCTOR OF ENGINEERING
September 2012 I would like to dedicate this dissertation to my parents, my wife and my daughter. Research on Hardware Intellectual Property
Cores Based on Look-Up Table Architecture for
DSP Applications
APPROVED
Asso. Prof. Cong-Kha PHAM, Chairman
Prof. Kazushi NAKANO
Prof. Yoshinao MIZUGAKI
Prof. Koichiro ISHIBASHI
Asso. Prof. Takayuki NAGAI
Date Approved by Chairman c Copyright 2012 by HOANG Van Phuc All Rights Reserved. 和文要旨
DSPアプリケーションのためのルックアップテーブルアーキテ
クチャに基づくハードウェアIPコアに関する研究
ホアン ヴァンフック
電気通信学研究科電子工学専攻博士後期課程 電気通信大学大学院
本論文は,ルックアップテーブル(LUT)アーキテクチャに基づく省面積 かつ高性能なハードウェア IP コアを提案し,DSP アプリケーションに適用 することを目的としている.LUT ベースの IP コアおよびこれらに基づいた 新しい計算システムを提案した.ここで,提案する計算システムには,従来 の演算と提案する IP コアを含む.提案する IP コアには,乗算や二乗計算等 の基本演算と,正弦関数や対数・真数計算等の初等関数が含まれる. まず初めに,高効率な LUT 乗算器および二乗回路の設計を目的として, 2つの方法を提案した.1つは,全幅結果が不要な場合に DSP に応用可能 な打切り定数乗算器である.このアーキテクチャには,LUT ベースの計算と DSP 用の打切り定数乗算器という2つのアプローチを組み合わせた.最適な パラメータと LUT の内容を探索するために,LUT 最適化アルゴリズムの検 討を行った.さらに,固定幅二乗回路向けに LUT ハイブリッドアーキテク チャを改良した.この技術は,二乗回路における性能,誤計算率,複雑さの 間の妥協点を見いだすために,LUT 論理回路と従来の論理回路の両方を採用 したものである. 初等関数の計算については,LUT ベースの計算と線形差分法を組み合わせ て,2つのアーキテクチャを提案した.1つは,ディジタル周波数合成器, 適応信号処理技術および正弦関数生成器に利用可能な正弦関数計算のために, 線形差分法を改良したものである.他の方法と比較して誤計算率が変わらな い一方で,LUT の規模と複雑さを抑える最適パラメータを探索するために数 値解析と最適化を行った.その他に,対数計算および真数計算向けに,疑似 対称線形アプローチを提案した.そのアーキテクチャの最適パラメータを探 索するために,二段階最適化アルゴリズムも開発した.これらの最適化アル ゴリズムと LUT 規模圧縮手法を利用することで,提案する対数計算および 真数計算モジュールは,他の方法と比較して同程度の正確さを維持しつつ, ハードウェアの複雑化を回避することができる.本論文で提案する手法は, DSP システムの対数・真数計算等関数生成器と混合数システムプロセッサに おける変換器の両方に適用可能である. 最後に,提案する IP コアによる特定用途向け DSP アプリケーションと計 算システムを開発した.これらの方法は,将来的には,特定用途向け DSP アプリケーションや,ディジタル信号プロセッサの演算ユニットとして利用 可能である. Abstract
Research on Hardware Intellectual Property Cores Based on Look-Up Table Architecture for DSP Applications
HOANG VAN PHUC
Doctoral Program in Electronic Engineering The University of Electro-Communications
Recently, high performance hardware intellectual property (IP) cores for arithmetic operations are highly required to meet the increasing demand of digital signal processing (DSP) applications in multime- dia, wireless communications, mobile and handheld devices. Tradi- tionally, the high performance arithmetic function generators can be implemented by logic circuits with some advanced techniques such as the parallel architecture. However, these techniques have to trade- off between the computation performance and system complexity. In other words, the high performance computation circuits may occupy much hardware area and lead to high power consumption. As a re- sult, some alternative approaches should be proposed for modern and future DSP applications. On the other hand, the look-up table (LUT)- based computation approach is promisingly suitable to be an alterna- tive approach. LUT-based computation circuits provide the output by only accessing the pre-stored tables other than actual computations in real time. This LUT access-only operation results in the high speed and the low switching rate that leads to the low power consumption. Moreover, according to the report of the International Technol- ogy Roadmap for Semiconductors (ITRS), embedded memories are becoming faster, having higher density, lower dynamic power con- sumption and will dominate the content of the future System-on- Chip DSP applications. Also, advanced LUT technologies lead to the high improvements in LUT performance and density. Memory- based computation and LUT-based computation are also well suited for many DSP algorithms which require squaring, constant multipli- cations and elementary function computations. However, in the sim- ple direct LUT-based computation, the LUT size grows exponentially when the operand width increases. Therefore, many researches focus on reducing the LUT size. Although there have been some existing methods for LUT-based computation components and systems, the new and more improved methods are desired because the future DSP applications require more efficient computation circuits. The research presented in this dissertation focuses on the efficient hardware IP cores based on the LUT architectures for DSP appli- cations. Four methods for efficient specific hardware IP cores are proposed. Moreover, some detail design examples and a computation system based on these IP cores are developed. The proposed IP cores include both basic arithmetic operations (such as multiplication and squaring) and elementary functions (such as sine and logarithm/anti- logarithm function computations). Two methods are proposed for the design of efficient LUT-based multiplier and squarer circuits. Firstly, a novel and efficient LUT- based truncated multiplier is presented for specific DSP applications in which the full width results are not required. This method combines two approaches of LUT-based computation and truncated multipliers for DSP applications. An LUT optimization algorithm is also devel- oped to find the optimal parameters and LUT content. Secondly, an improved hybrid LUT-based architecture is employed for the fixed- width squarer circuits. This hybrid technique takes the advantages of both LUT-based and conventional logic circuits to achieve the good trade-off of the squarer performance, error and complexity. For the computation of elementary functions using LUT-based architecture, two novel architectures which combine LUT-based com- putation and linear difference approach are proposed. The first one is the improved linear difference method for the sine function com- putation which can be used in digital frequency synthesizer, adaptive signal processing and sine function generator. The numerical analysis and optimization are employed to find the optimal parameters which minimize the LUT size and hardware complexity while remain the same error performance compared with other methods. The second one is a novel quasi-symmetrical approach for the logarithm and anti- logarithm computation. An optimization algorithm is also developed to find the optimal parameters for the hardware architecture. Thanks to the optimization algorithm and the LUT size reduction method, the hardware complexity and computation delay of the logarithm and anti-logarithm computation modules can be reduced significantly with the same accuracy compared with other methods. This method can be applied for both logarithm/anti-logarithm function generators in DSP systems and the domain converters in hybrid number system processors. Finally, some specific applications and a prototype computation system based on these IP cores are developed to clarify the improve- ments of proposed methods. With the improvements achieved, the proposed methods can be considered as potential candidates for mod- ern and future DSP systems. List of Abbreviations
ADC Analog to Digital Converter ADP Area-Delay Product ALOGC Anti-logarithmic Converter ALU Arithmetic Logic Unit AU Arithmetic Unit ASIC Application Specific Integrated Circuit CMOS Complementary Metal Oxide Semiconductor CORDIC Coordinate Rotation Digital Computer CSC Clolor Space Conversion CSD Canonical Signed Digit DA Distributed Arithmetic DAC Digital to Analog Converter DSP Digital Signal Processing FET Field Effect Transistor FPGA Field Programmable Gate Array GPP General Purpose Processor HIL Hardware In the Loop HNS Hybrid Number System IP Intellectual Property ITRS International Technology Roadmap for Semiconductors KCM Constant Coefficient Multiplier LE Logic Element LMS Least Mean Square LNS Logarithmic Number System LOGC Logarithm Converter LSB Least Significant Bit LUT Look-up Table MCU Microcontroller Unit MRAM Magnetic Random Access Memory MSB Maximum Significant Bit MTM Multipartite Table Method MFCC Mel Frequency Cepstral Coefficients
vii PAC Phase to Amplitude Converter PPM Partial Product Matrix RAM Random Access Memory ROM Read-Only Memory SoC System-on-chip SR Speech Recognition TTA Transport Triggered Architecture VHDL Very High Speed Integrated Circuits Hardware Description Language VLSI Very Large Scale Integration
viii Contents
1 Introduction 1 1.1 Context of Research and Motivation ...... 1 1.1.1 Trend of DSP Market ...... 1 1.1.2 Trend of DSP Features ...... 1 1.1.3 Motivation ...... 2 1.2 Objective and Scope of the Dissertation ...... 5 1.3 Original Contributions ...... 7 1.4 Dissertation Overview ...... 8
2 Background and General Approach 13 2.1 Introduction to DSP Systems ...... 13 2.2 Hardware Platforms and DSP Implementation Methods ...... 14 2.2.1 General Purpose Processor (GPP)-Based Implementation .14 2.2.2 DSP Processor-Based Implementation ...... 15 2.2.3 ASIC-Based DSP Implementation ...... 15 2.2.4 FPGA-Based DSP Implementation ...... 16 2.2.5 Implementation Methods for DSP Applications ...... 17 2.3 Concept of LUT-Based Computation and Design Issues ...... 17 2.4 Literature Review of LUT-Based Computation Methods ..... 20 2.5 General Approach for Efficient LUT-Based IP Cores ...... 23
3 LUT-Based Truncated Multiplier 27 3.1 Introduction to LUT-Based Truncated Multiplier ...... 29 3.1.1 Truncated Multiplier ...... 29 3.1.2 LUT-Based Truncated Multiplier ...... 29
ix CONTENTS
3.2 Proposed LUT-Based Truncated Multiplier ...... 32 3.2.1 Proposed Multiplier Architecture ...... 32 3.2.2 LUT Optimization ...... 33 3.2.3 Synthesis and Implementation Results in FPGA Hardware 36 3.3 General Multipartite LUT Split Technique ...... 37 3.4 Chapter Conclusions ...... 39
4 Hybrid LUT-Based Architecture for Fixed-Width Squarer 41 4.1 Fixed-Width Squarer ...... 42 4.2 LUT-Based Fixed-Width Squarer ...... 44 4.3 Proposed Hybrid LUT-Based Fixed-Width Squarer ...... 45 4.4 Error Analysis of Proposed Fixed-Width Squarer ...... 48 4.5 Implementation and Measurement Results ...... 49 4.6 Chapter Conclusions ...... 50
5 LUT-Based Sine Function Computation 59 5.1 Sine Function Computation ...... 59 5.2 Direct Digital Frequency Synthesizer (DDFS) ...... 60 5.3 Phase to Amplitude Converter in DDFS ...... 60 5.4 LUT Compression Methods for Phase to Amplitude Converter .. 61 5.4.1 Sine Wave Symmetry Exploitation ...... 62 5.4.2 Coarse/Fine LUT Splitting Method ...... 62 5.4.3 Sunderland and Nicholas Architectures ...... 63 5.4.4 Difference Method ...... 64 5.4.5 Multipartite Table Method ...... 65 5.5 Improved Linear Difference Method ...... 65 5.6 Implementation Results ...... 67 5.7 Chapter Conclusions ...... 70
6 LUT-Based Logarithm and Anti-logarithm Computation 73 6.1 Introduction ...... 73 6.2 Logarithm and Anti-logarithm Approximation Methods ...... 74 6.2.1 Mitchell Approximation ...... 75 6.2.2 Shift-Add Piece-Wise Linear Approximation ...... 76
x CONTENTS
6.2.3 Table-Based Approximation Methods ...... 77 6.2.4 Combined LUT-Based/Difference Method ...... 77 6.3 Proposed Quasi-Symmetrical Approach ...... 78 6.4 Implementation of the Fundamental Functions ...... 82 6.5 Applications in Logarithm Generator and HNS Processors .... 84 6.5.1 Logarithm Function Generator for DSP Applications ... 84 6.5.2 Hybrid Arithmetic Unit For HNS Processors ...... 86 6.6 Chapter Conclusions ...... 87
7 Prototype and Applications of Proposed IP Cores 91 7.1 Chapter Objectives and Targeted Applications ...... 91 7.1.1 Chapter Objectives ...... 91 7.1.2 Targeted Applications of Proposed IP Cores ...... 92 7.2 Design Flow for Applications using Proposed IP Cores ...... 93 7.2.1 Design Flow ...... 93 7.2.2 Advantages of Using Proposed IP Cores for DSP Applications 93 7.3 A Prototype Computation System for the Proposed IP Cores ... 94 7.3.1 Prototype Computation System Architecture ...... 95 7.3.2 System Implementation Results ...... 98 7.3.3 Verification and Measurement Method ...... 99 7.4 Design Examples Using Proposed IP Cores ...... 100 7.4.1 RGB to YCbCr Color Space Conversion Using Proposed Multiplier IP Core ...... 100 7.4.1.1 Color Space Conversion and Its Hardware Archi- tecture ...... 100 7.4.1.2 FPGA Implementation and Verification Results . 103 7.4.2 2-D Convolver for Image Processing ...... 105 7.4.2.1 Image Convolver ...... 105 7.4.2.2 Convolver Hardware Architecture and Implemen- tation Results ...... 106 7.4.3 Speech Feature Extraction Using Proposed Logarithm Gen- erator ...... 107 7.4.3.1 Speech Recognition ...... 107
xi CONTENTS
7.4.3.2 Speech Feature Extraction ...... 108 7.4.3.3 Hardware Architecture and Implementation Results108 7.5 Chapter Conclusions ...... 109
8 Conclusions and Future Work 117 8.1 Conclusions ...... 117 8.2 Future Work ...... 118
A Sine/Cosine Computation Using Pipelined CORDIC 121 A.1 CORDIC Algorithm ...... 121 A.2 Pipelined CORDIC Architecture ...... 123 A.3 Implementation Results ...... 124
B Operation List of the Conventional ALU Component 127
C List of Publications 129 C.1 Journal Papers ...... 129 C.2 Book Chapter ...... 130 C.3 International Conference Presentations ...... 130 C.4 Technical Report ...... 131
xii List of Figures
1.1 Operation percentage for OpenGL TnL running on a conventional DSP processor ([3])...... 3 1.2 Trend in memory domination (%) in SoC content ([4])...... 6 1.3 Product function size trends (Source: ORTC 2011 ITRS Korea Winter Public Conference - Executive summary [4])...... 10 1.4 Summary of the motivation and the dissertation overview ..... 11
2.1 General block diagram of a DSP system...... 14 2.2 Hardware platforms for implementation of DSP applications. ... 16 2.3 General LUT-based computation system...... 18 2.4 Combined APC-OMS architecture for LUT-based multiplier [19]. .22 2.5 General approach for efficient LUT-based IP cores...... 25
3.1 Introduction to LUT-Based Truncated Multiplier...... 28 3.2 General LUT-based multiplier...... 31 3.3 Proposed W ×M-bit LUT-based truncated multiplier architecture (a) and its partial product matrix with W = M =8andδ = 2 (b). 34 3.4 Proposed sub-optimal LUT content optimization algorithm. ... 36 3.5 Error analysis results for the proposed 8-bit multiplier with differ- ent numbers of guard bits (solid line), compared with conventional truncated multiplier (dashed line)...... 37
4.1 Partial product matrix of the 8-bit fixed-width squarer...... 43 4.2 General LUT-based full-width squarer...... 45 4.3 Proposed hybrid LUT-based architecture for fixed-width squarer. 46 4.4 Coarse/fine LUT splitting method to further reduce the LUT size. 48
xiii LIST OF FIGURES
4.5 ADP results for different values of W ...... 50 4.6 ADP reduction results for different values of W (compared with Garofalo et al. method in [47] )...... 51 4.7 Chip microphotograph of the proposed 8-bit fixed-width squarer. .53 4.8 Test and measurement flow for the proposed fixed-width squarer. 55 4.9 Simulation result in Modelsim software (left) and measurement result in logic analyzer (right)...... 56 4.10 Critical path analysis result in Synopsys Design Compiler. .... 57 4.11 Longest path delay measurement result...... 57
5.1 General block diagram and signal flow of the DDFS...... 61 5.2 Operation principle of DDFS with the digital phase wheel. .... 62 5.3 General DDFS with sine wave symmetry exploitation...... 63 5.4 Coarse/Fine LUT splitting...... 63 5.5 Sunderland method...... 65 5.6 General difference method...... 66 5.7 Proposed function (J = 4) vs Hsu difference function (J =3). .. 68 5.8 Architecture of phase to amplitude converter using the improved linear difference method...... 69 5.9 Digital sine wave output of the proposed DDFS in Modelsim sim- ulation...... 70
6.1 Area breakdown of a 3-D graphics chip in [3]...... 75 6.2 Logarithmic and anti-logarithmic Mitchell error functions. .... 76 6.3 General difference method for logarithm approximation...... 78 6.4 The 4-segment linear method with small error LUT proposed in [78]. 79 6.5 The idea for the proposed quasi-symmetrical approach...... 80 6.6 Proposed 2-step parameter optimization algorithm for the hard-
ware approximation of log2(1 + x)...... 83 6.7 Proposed quasi-symmetrical linear approximation method (after the parameter optimization)...... 84
6.8 Proposed architecture for the approximation of log2(1 + x) with W =13...... 85 6.9 A 16-bit logarithm function generator using the proposed method. 88
xiv LIST OF FIGURES
6.10 An arithmetic unit for HNS processors using proposed converters. 89 6.11 Layout of the proposed 16-bit logarithm function generator. ... 89
7.1 Chapter objectives...... 92 7.2 Design flow for DSP applications using proposed IP cores. .... 94 7.3 Merged DSP core/MCU/coprocessor architecture as a trend for future DSP system architecture...... 96 7.4 Hardware architecture of the prototype computation system. ... 97 7.5 Layout of the prototype computation system...... 99 7.6 Area breakdown of the prototype computation system...... 100 7.7 Verification and measurement flow for the prototype computation system...... 102 7.8 Proposed RGB to YCbCr color space conversion using LUT-based constant multiplier...... 104 7.9 Verification flow with DSP builder using Matlab simulation and HIL.110 7.10 Operation of a 2-D convolver for image processing...... 112 7.11 Hardware architecture of a 2-D convolver for image processing. . . 112 7.12 Block diagram of speech recognition system...... 114 7.13 Feature extraction hardware architecture for the speech recognition system...... 114 7.14 Hardware architecture of the logarithm computation block for MFCC featureextractioninthespeech recognition application ...... 115 7.15 Verification results of the logarithm computation block for MFCC featureextractioninthespeech recognition application ...... 116
8.1 TTA processor architecture...... 119
A.1 Vector rotation in CORDIC algorithm...... 123 A.2 A stage in the pipelined CORDIC architecture...... 125 A.3 Block diagram of the CORDIC-based DDFS...... 125
xv List of Tables
1.1 Class of application most suitable for each emerging research tech- nology entry evaluated (Source: ITRS [4])...... 5
2.1 Comparison of different hardware platforms for DSP implementation. 17 2.2 The true table of an LUT-based multiplier using APC method with two operands X and A...... 23 2.3 The true table of an LUT-based multiplier using OMS method with two operands X and A...... 24
3.1 Error analysis results for the proposed 8-bit multiplier with differ- ent numbers of guard bits (δ), compared with conventional trun- cated multiplier...... 35 3.2 Implementation results of different KCM methods in Altera Cy- clone II FPGA...... 38
4.1 Four-folding method for hybrid LUT-based fixed-width squarer. .47 4.2 Error analysis for different fixed-width squarer architectures. ... 52 4.3 LUT parameters and compression ratio of the proposed fixed-width squarer designs with LUT splitting technique...... 53 4.4 Implementation results of different fixed-width squarer architec- tures using 0.18-μm CMOS technology...... 54 4.5 ADP reduction with different values of W (compared with the method in [47] )...... 55
5.1 Computation results of word size reduction for different numbers of segments...... 69
xvi LIST OF TABLES
5.2 Comparison with the previous methods...... 70
6.1 Proposed 2-step optimization algorithm...... 82 6.2 Optimal parameter results using the 2-step optimization algorithm. 82 6.3 Summary of optimal parameters and error analysis results for the proposed method compared with the method presented in [78] .. 85 6.4 FPGA Implementation results of different methods for approxima- x tion of log2(1 + x)and(2 − 1) with W = 13...... 86 6.5 Implementation results in 0.18-μm CMOS technology of different x methods for approximation of log2(1 + x)and(2 − 1) with W = 13. 87 6.6 FPGA Implementation results of different methods for 16-bit log- arithm generator...... 87 6.7 Implementation results in 0.18-μm CMOS technology of different methods for the 16-bit logarithm generator...... 88
7.1 List of operations supported by the prototype computation system. 98 7.2 Main parameters for the ASIC implementation of the prototype computation system...... 101 7.3 Area consumption by components in the prototype computation system (× 103 μm2)...... 101 7.4 Comparison of FPGA implementation results of the color space converter...... 105 7.5 Verification results using Altera DSP flow by hardware co-simulation with Cyclone II FPGA in Matlab-Simulink environment...... 111 7.6 ASIC implementation results of the proposed RGB to YCbCr color space converter using a standard cell library with 0.18-μmCMOS technology...... 112 7.7 FPGA implementation results of 2-D convolver with the input/output resolution of 8 bit...... 113 7.8 Verification result of 2-D convolver...... 113 7.9 ASIC implementation results of 2-D convolver with 0.18-μmCMOS technology using a standard cell library...... 114
xvii LIST OF TABLES
7.10 FPGA implementation results of logarithm computation block in MFCC feature extraction with the input/output resolution of 16 bit ...... 115 7.11 ASIC implementation results of logarithm computation block in MFCC feature extraction with 0.18-μm CMOS technology using a standard cell library...... 115
A.1 Implementation results for the pipelined CORDIC-based DDFS. . 125
B.1 List of operations supported by the conventional ALU component. 128
xviii Chapter 1
Introduction
1.1 Context of Research and Motivation
1.1.1 Trend of DSP Market
The digital signal processing (DSP) global market is growing strongly due to the fast increasing demand of DSP applications. According to a report in [1], the global DSP market by revenue (in Intellectual Property (IP), Design Architec- ture and Applications) is estimated to grow from $6.20 billion in 2011 to $9.58 billion in 2016 with the Compound Annual Growth Rate (CAGR) of 9.09%. The percentage share of DSP industry in the global semiconductor revenue was ap- proximately between 1% and 3% over the years, and it currently stands at 1.97% in 2011. The rising demand of DSPs in the wireless infrastructure sector is one of the most prominent reasons for its growth. Rising trend of several advanced performance technologies such as DSP-based System-on-chip (SoC) is another reason accountable for significant potential growth in the DSP market.
1.1.2 Trend of DSP Features
With the increasing demand of high performance, low area and low power DSP applications and the continuous development of integrated circuit technology, the following trends of DSP features can be predicted.
1 1.1 Context of Research and Motivation
• Modern DSP core is becoming to support higher clock frequency. For ex- ample, Texas Instruments released the C6000 series DSP processors, which have clock speeds of 1.2 GHz. However, the clock frequency will be limited and other approaches should be considered.
• Multicore/multiprocessor is employed more and more in modern DSP pro- cessor cores to increasing the performance of the DSP system and solve the problem of clock frequency limitation. Moreover, DSP processor architec- ture is becoming many-core approach to achieve higher performance.
• The hybrid (or merged) DSP core/Microcontroller Unit (MCU)/coprocessor architecture is also a trend for modern and future DSP systems due to the high convergence of multiple application type in DSP and multimedia systems.
• More dedicated components are predicted to present in DSP systems to accelerate the system and provide higher performance adaptive signal pro- cessing.
• Mixed and hybrid number system support is a promising approach to achieve the good trade-off of the computation performance and power consumption.
1.1.3 Motivation
With above trends of DSP market and architecture, the modern and future DSP systems are expected to employ more and more multiplication, squaring and other complex operations and more dedicated computation modules to meet the requirements of more complicated algorithms and applications, especially for the 3-D graphics and high-speed multimedia processing systems. Employing many multiplier circuits, especially the parallel architectures, can lead to high system complexity and high power consumption. As pointed out in [2], a 3-D graphics rendering application requires about 90% of division, multiplication and square root operations. According to the research by B.G.Nam et al. [3], a typical 3-D graphics processor employs a lot of complex operations such as multiplication
2 1.1 Context of Research and Motivation
5% (DIV) 24% (SQ)
30% (MUL) 16% (POW)
1% (Others)
24% (ADD)
Figure 1.1: Operation percentage for OpenGL TnL running on a conventional DSP processor ([3]).
(MUL), division (DIV), squaring (SQ) and powering (POW) as presented in Fig- ure 1.1, together with the basic operations such addition (ADD) and subtraction. Hence, low complexity arithmetic components are highly required for future DSP applications. Moreover, the continuous development of integration technology allows the design and implementation of sophisticated processors for complex algorithms with the high computation speed. Also, the higher integration density makes the DSP circuits smaller. However, with the fast increasing demand of wireless, mo- bile and portable devices such as smart phones, tablets, wireless-based intelligent electronics devices, the requirement of low area and low power DSP and multi- media systems are becoming more and more emerging. With these applications, not only high speed but also low area and low power consumption circuits are highly required (especially for battery-based devices). Meanwhile, the parallel multipliers and above mentioned complex operations often results in high system complexity and high power consumption. Therefore, an alternative approach should be considered to derive the efficient
3 1.1 Context of Research and Motivation computation systems without complicated multipliers while achieve high compu- tation performance. Using memory-based computation, or LUT-based compu- tation for more general, is an alternative approach which has been presented in literature. Meanwhile, according to the 2010-year report of the International Technology Roadmap for Semiconductors (ITRS) [4], embedded memories are going beyond the 16-nm technology generation and becoming faster, having higher density, lower dynamic power consumption and will dominate the content of System-on- Chip (SoC) applications. Figure 1.2 presents the trend of memory domination in SoC content. Table 1.1 shows 8 candidate memory technologies which have been evaluated in the ITRS report. In this report, the Emerging Research De- vices (ERD) and Emerging Research Materials (ERM) working groups from ITRS identified Spin Transfer Torque MRAM and Redox RRAM as emerging memory technologies recommended for accelerated research and development leading to scaling and commercialization of Non-volatile RAM to and beyond the 16-nm gen- eration. The magnetic memory/logic is predicted to dominate future electronic systems ([5]and[6]). Also, the development of the advanced 3-D integration technology will further promote this process [7]. Figure 1.3 depicts the trend of product function size reported by ORTC 2011 ITRS Korea Winter Public Con- ference. It can be seen that the function size is becoming smaller, corresponding to higher chip density. On the other hand, the density of memory circuit is higher than that of logic circuit. Moreover, P. Meinerzhagen et al. [9]presentsanap- proach of using the standard cell library to implement specific memories in 65-nm technology. With the availability of different methods for the efficient memory and LUT implementation, the LUT-based architectures and methods presented in this dissertation will be useful for DSP designers when choosing a design method and architecture for the arithmetic functions and modules in DSP applications. Especially, when advanced memory technologies mentioned previously become popular, these architectures and methods can be employed more efficiently. Therefore, using the LUT-based computation combined with some simple logic circuit is a promising solution for the alternative approach. LUT-based computation refers to a type of digital circuits/systems which can provide the
4 1.2 Objective and Scope of the Dissertation
Table 1.1: Class of application most suitable for each emerging research technol- ogy entry evaluated (Source: ITRS [4]). Emerging Research Memory Technology Entry Standalone Embedded √ Ferroelectric-gate FET √ Nanoelectromechanical RAM √ Spin Transfer Torque MRAM √ √ Nanoionic or Redox Memory √ √ Nanowire Phase Change Memory (PCM) √ Electronic Effects Memory √ √ Macromolecular memory √ √ Molecular memory
computation results by accessing pre-stored tables other than actual computa- tions. The LUT-based circuit can lead to the high speed computation because it mainly involves the time for pre-stored tables access only. It also can reduce the dynamic power consumption because if the minimum switching rate. The LUT can be implemented by either application-specific embedded memory circuits or optimized logic circuits. Some examples of application-specific memories are the low power memories for mobile devices and electronics consumer products, the high-speed memories for multimedia applications, the wide temperature memo- ries for automotive, the high reliability memories for biomedical instruments and the radiation hardened memory for space applications.
1.2 Objective and Scope of the Dissertation
The objective of the research presented in this dissertation is to find the efficient architectures and methods for arithmetic components based on LUT architec- ture for specific DSP applications. Since DSP applications have some specific characteristics which are different with general computation applications, the ar- chitectures and implementation methods are required to adapt appropriately. Moreover, some important characteristics of DSP applications should be also considered as follows:
5 1.2 Objective and Scope of the Dissertation
100
90
80
70
60
50 %
40
30
20
10
0 2005 2008 2014
Figure 1.2: Trend in memory domination (%) in SoC content ([4]).
• Error-tolerance: Error is acceptable in many DSP applications such as dig- ital filtering, video and image processing. Therefore, the hardware com- plexity can be reduced significantly by employing some techniques such as truncated multipliers and squarers, LUT size reduction and so on.
• Low complexity DSP applications are highly desired because of the high de- mand for portable, hand-held and wearable electronic devices. Electronic devices are becoming smaller, lighter and thinner. Even that the integra- tion technology is also going to deep sub-micron technologies, it is highly required to develop more efficient methods in architecture level and system
6 1.3 Original Contributions
level for the lower hardware complexity DSP applications.
• Real-time operation. Since DSP applications are often real-time, high-speed computation circuits are desired and the speed (or delay) of the compu- tation module has to be taken into account for design methodology and optimization algorithms.
• Low power DSP applications are becoming more and more popular due to the increasing demand of battery-based devices. Hence, the power con- sumption should be reduced in future DSP systems.
Therefore, the research work presented in this dissertation is to find the ef- ficient methods which are suitable for DSP applications considering the above characteristics. For example, due to the feature of error-tolerance of DSP sys- tems, the methods for low error truncated multipliers and squarers using LUT- based architectures are proposed to reduce the hardware area significantly while achieve the acceptable computation accuracy. Also, to meet the requirement of low complexity DSP applications, the research on finding an efficient method for low area logarithmic and anti-logarithmic converters is carried out for application in DSP systems and in hybrid number system processors.
1.3 Original Contributions
Several contributions on the architectures and design methodologies for low area, high performance LUT-based IP core and system have been made in this work. Parts of these contributions have been published or submitted for publication. The following list summarizes the main contributions within the scope of this work.
1. Two efficient architectures for LUT-based truncated multiplier and fixed- width squarer are proposed. Since multipliers and squarers are highly em- ployed in many DSP applications, the proposed methods can open up some new methods and suggestions for designs and system developers for digital
7 1.4 Dissertation Overview
hardware design related to DSP systems. Two papers based on these meth- ods were accepted to be published in IEICE Transactions of Fundamentals in June and July, 2012, respectively.
2. A new, efficient and more general linear difference method combined with LUT-based error correction is proposed for the design of phase to amplitude converter in Direct Digital Frequency Synthesizer (DDFS). The paper based on this method was published in IEICE Transactions of Fundamentals in March 2011.
3. The novel quasi-symmetrical approach for low area, efficient logarithmic and anti-logarithmic converters is proposed. Since logarithmic and anti- logarithmic converters are essential components in hybrid number system processors and many DSP applications, the proposed approach can lead to a great impact on improving the system performance in those applications. The paper based on this method was submitted for future publishing in IEICE Transactions of Fundamentals.
1.4 Dissertation Overview
The dissertation is divided into 8 chapters. Figure 1.4 presents the relationship between separate chapters of this dissertation. After this introduction chap- ter, Chapter 2 provides the background knowledge and design issues about DSP applications and LUT-based computation systems. The existing methods and architectures for LUT-based computation are also reviewed and discussed in this chapter. Chapter 3 presents the efficient LUT-based truncated multiplier for the con- stant multiplication which is highly employed in many DSP applications. By using the proposed parameter and LUT content optimization, the significant im- provements in both circuit area and delay can be achieved. Squaring is also a fundamental operation in many DSP applications. There- fore, an improved hybrid LUT-based architecture for the design of low error,
8 1.4 Dissertation Overview efficient fixed-width squarer is proposed in Chapter 4. The mathematical identi- ties of squaring operation are exploited in a new way so that a low error, efficient architecture for fixed-width squarer can be achieved. The implementation and chip measurement results in 0.18-μm CMOS technology are presented and dis- cussed. In Chapter 5, an improved linear difference approximation combined with LUT-based architecture is proposed for sine function computation targeted for phase to sine converter in Direct Digital Frequency Synthesizer (DDFS). By em- ploying parameter optimization in Matlab software, an optimal architecture for linear difference method is derived. Chapter 6 presents a novel approach in logarithmic and anti-logarithmic con- verters which are essential components in the hybrid number system processors and many DSP applications. The novel quasi-symmetrical approach and the proposed parameter optimization algorithm to achieve higher area-speed efficient converters are also presented. Moreover, the applications of the proposed con- verters for logarithm/exponent function generators and the arithmetic unit for hybrid number system processors are proposed in this chapter. The implemen- tation results in both FPGA and 0.18-μm CMOS technology platforms are also presented. Chapter 7 presents the applications of the proposed IP cores and a prototype computation system based on these cores. Moreover, the design flow and detail designs examples are also provided in this chapter. Finally, Chapter 8 summarizes the main results of this work, concludes this dissertation and suggests a list of open topics for the future research.
9 1.4 Dissertation Overview
Figure 1.3: Product function size trends (Source: ORTC 2011 ITRS Korea Winter Public Conference - Executive summary [4]).
10 1.4 Dissertation Overview
DSP Applications
High Speed Requirement Low Power Requirement
LUT-based Computation
Basic Arithmetic Operations Elementary Functions
Multiplier Squarer Sine Generator Logarithm Generator
Chapter 3 Chapter 4 Chapter 5 Chapter 6
Prototype and Applications Chapter 7
Figure 1.4: Summary of the motivation and the dissertation overview
11 1.4 Dissertation Overview
12 Chapter 2
Background and General Approach
2.1 Introduction to DSP Systems
The signals in real world are analog by nature. However, digital computers and many electronic devices operate on data represented in binary format that are composed of finite number of bits. In a DSP system, original analog signals are converted to digital ones which are represented by sequences of finite-precision numbers, typically the binary numbers in which only two logic values of ‘0’ and ‘1’ are used. The signal processing tasks are performed by digital computation components [10]. Figure 2.1 presents the general block diagram of a typical DSP system. As shown in this figure, a DSP system receives the input signal, processes it in digital domain and generates the outputs according to the given algorithms. The analog and digital parts in this system interact by using the analog to digital converters (ADC) and the digital to analog converters (DAC). The main component in this system is the DSP module which performs the signal processing algorithms in digital domain. The input and output filters are also required to guarantee the suitable frequency band for the DSP system.
13 2.2 Hardware Platforms and DSP Implementation Methods
Analog Input DSP Output Analog ADC DAC Input Filter Module Filter Output
0101...101 1110...001 1101...011 0010...101 0100...101 1001...001
Figure 2.1: General block diagram of a DSP system.
2.2 Hardware Platforms and DSP Implementa- tion Methods
2.2.1 General Purpose Processor (GPP)-Based Implemen- tation
The general purpose processors (GPP) have been employed popularly in many computing systems over some of past decades. Although DSP applications require specific operations and are computation-intensive, the general purpose processors can be used to implement these applications. The Intel MMX processor is a typical GPP which support DSP and multimedia applications [11]. PowerPC 604e processor is also can be used to implement DSP applications [12]. The high applicability of this type of processors has led to its wide utilization. Due to their merit of high flexibility, GPP can perform a wide range of applications. The flexibility inherent in GPP was a key component of the computer revolution. To date, processors have been the driving engine behind general-purpose computing. Originally, due to the limitation of active chip area, processors focus on the heavy reuse of a single or small number of functional units. With the continuous development of the Very Large Scale Integration (VLSI) technology, we can now integrate complete and powerful processors and other computation modules onto a single integrated circuit. Therefore, some alternatives for GPP-based DSP applications will be introduced in the following sections.
14 2.2 Hardware Platforms and DSP Implementation Methods
2.2.2 DSP Processor-Based Implementation
Digital Signal Processors are often abbreviated as DSP (the same with DSP for Digital Signal Processing). Therefore, in this dissertation, the term ‘DSP pro- cessor’ is used to prevent the confusion between two concepts. In this kind of DSP implementation, the specific DSP application is developed as a software with a specific programming language and then is executed by a DSP proces- sor. Morepver, a DSP processor is a dedicated microprocessor with the following characteristics to support DSP applications:
• Real-time digital signal processing capabilities. DSP processors typically have to process data in real time, i.e., the correctness of the operation depends heavily on the time when the data processing is completed.
• High throughput. DSP processors can sustain processing of high-speed streaming data, such as audio and multimedia data processing.
• Deterministic operation. The execution time of DSP programs can be fore- seen accurately, thus guaranteeing a repeatable, desired performance.
• Re-programmability by software. Different system behaviors might be ob- tained by re-coding the algorithm executed by the DSP processor instead of hardware modifications. DSP processors appeared on the market in the early 1980s. Over the last three decades, they have been the key enabling technology for many electronics products in the fields such as communica- tion systems, multimedia, automotive, instrumentation and military appli- cations.
2.2.3 ASIC-Based DSP Implementation
Application Specific Integrated Circuits (ASIC)-based implementation can be em- ployed for DSP applications to design high performance DSP systems. With the hardware optimized for a specific application, the ASIC-based DSP implemen- tation method can lead to the high performance, area efficient, low power DSP designs. However, it requires the high design effort and time. Therefore, this approach is well-suited for designing high amount of DSP products.
15 2.2 Hardware Platforms and DSP Implementation Methods
Programmability
GPP
DSP Processor
FPGA
ASIC
Specialization
Figure 2.2: Hardware platforms for implementation of DSP applications.
2.2.4 FPGA-Based DSP Implementation
FPGA, a typical reconfigurable hardware platform, can provide the rapid pro- totype and implementation for DSP applications [13]. It can combine the ad- vantages of both dedicated ASIC-based and software-based DSP implementation approaches. With the development of reconfigurable hardware technologies, this approach is very promising to provide the high performance, rapid implementa- tion of DSP applications. Figure 2.2 presents the design spectrum of DSP applications with above hard- ware platforms and degrees of programmability and specialization. Table 2.1 sum- marizes the comparison in several factors of using different hardware platforms to implement DSP applications. Depending on the specific features and require- ments for each DSP application, the designer can choose a suitable hardware platform for the implementation.
16 2.3 Concept of LUT-Based Computation and Design Issues
Table 2.1: Comparison of different hardware platforms for DSP implementation. Platform Performance Cost Power Flexibility Design effort ASIC high high low low high DSP processor medium medium medium medium medium GPP low low medium high low FPGA-based medium medium high high medium
2.2.5 Implementation Methods for DSP Applications
The following methods can be used to implement a DSP application.
• Dedicated hardware design method. In this type of implementation, a DSP function/application is performed by a specific and dedicated hard- ware module. Digital filters, encoders, decoders are typical examples for this approach. The Fast Fourier Transform (FFT), Finite Impulse Filter (FIR) are typical examples of this approach in DSP implementation. The hardware platforms for this method can be the FPGA or ASIC hardware.
• Software-based design using a general purpose processor (GPP) or a DSP processor. The software to implement the DSP algorithm is developed based on an architecture of a specific GPP or DSP processor.
• The combination of two above approaches can be employed for a compli- cated DSP application, called hardware/software co-design method ([14] and [15]).
The proposed hardware IP cores in this research can be used with all above implementation methods for DSP applications.
2.3 Concept of LUT-Based Computation and Design Issues
The concept of memory-based computation, firstly as ROM-based computation, was employed in IBM 1620 computer announced by IBM in 1959 [16]. In this
17 2.3 Concept of LUT-Based Computation and Design Issues
W W+M X LUT Y (2W words)
Figure 2.3: General LUT-based computation system. computer, addition, subtraction and multiplication are accomplished by the auto- matic table look-up in the core storage. However, the division is accomplished by available subroutine or by optional automatic divide feature. This concept was also presented in some textbooks such as [17]and[18]. The basic idea behind this concept is that the memory array often has higher density than the logic circuit [18]. It promisingly results in high speed computation because only time for accessing pre-stored memory arrays is required, without actual computation circuits. It also can provide some other advantages as discussed later. However, since the required memory size grows exponentially when the operand bit-width increases, the cost of memory devices were high and embedded memories were not popular at that time, this approach was not popular in the past. Currently, as the result of the continuous development in integrated circuit and memory technologies as mentioned in Section 1.1, the approach of memory-based computation, and the more general concept of the LUT-based computation, can be realized efficiently [5]. Bipul C. Paul and his colleagues at Toshiba Corporation have proposed an efficient implementation of 16-bit multiplier using ROM-based method with the full custom design flow [8]. The following paragraphs are to provide the basic concept of LUT-based com- putation and its background knowledge in more detail. Firstly, LUT-based com- putation refers to a type of digital circuits/systems which can provide the com- putation results by accessing pre-stored tables other than actual computation. Figure 2.3 shows the block diagram of a general LUT-based computation circuit with the L-bit input X which is used as the address for accessing an LUT of 2L words to provide the output Y with the width of (W + M)-bit.
18 2.3 Concept of LUT-Based Computation and Design Issues
Using the LUT-based computation approach has several advantages [19]. The following part summaries some main advantages of the LUT-based computation approach.
• The LUT-based approach can result in high speed computation and has the potential for high-throughput and reduced latency implementation because it mainly relies on the time for accessing the pre-stored tables.
• The LUT-based computation systems are promisingly to consume less dy- namic power consumption due to the minimization of switching activities.
• The LUT-based architectures make it easier to design, more regular com- pared with conventional logic circuit such as the multiply-addition method.
• The LUT-based computation approach is close to human-like computing. It is also suitable for implementation of the active functions in artificial neural networks [20].
With these advantages, LUT-based computation can be employed for a wide range of applications such as constant multiplication, squaring, digital filtering, orthogonal transformation for DSP applications and computing the elementary functions (trigonometric functions, logarithm function) and active functions in artificial neural networks. However, as mentioned previously, LUT-based computation approach has the disadvantage that the LUT size grows exponentially when the operand length in- creases. It may lead to the high hardware complexity for high resolution designs. Therefore, there have been many researches focusing on methods to reduce the LUT size with the cost of some simple additional logic circuits. In this disserta- tion, some novel and improved methods are proposed for the efficient LUT-based computation components and systems. Before providing the detail of these meth- ods in next chapters, the following sections are planned to review the existing methods for LUT-based computation and its design issues. An LUT is featured by its size, word length and content. The LUT size can be presented in number of words or number of bits. The LUT size as shown in Figure 2.3 is 2W words or 2W ×(W +M) bits because the word length in this case
19 2.4 Literature Review of LUT-Based Computation Methods is (W + M) bits. The word length is defined by the number of data bit stored in each LUT word. Moreover, the following design issues needed to be considered when imple- menting LUT-based computation components for DSP applications.
• As mentioned previously, in a general LUT-based computation system, the LUT size grow exponentially when the input width (or address width of the LUT) increases. Therefore, the relationship between the LUT size depends and the input width should be considered in the LUT-based computation and many researchers have tried to reduce the growth speed of LUT size by some proposed techniques.
• Since DSP applications can tolerate the error, the LUT size can be reduced significantly by discarding some least significant bits in the results. How- ever, the trade-off between the accuracy and hardware complexity should be taken into account. Hence, it requires an efficient LUT optimization method (for both LUT parameters and content) to achieve the good trade-off.
• LUT-based computation can be combined with other methods to derive some hybrid LUT-based/logic architectures. This approach also leads to the requirement of considering the trade-off between the LUT-based com- putation part and the conventional logic part.
2.4 Literature Review of LUT-Based Computa- tion Methods
In IBM 1620 computer computer [16], the simple direct memory-based method was used to implement the arithmetic functions. However, due to this simple method and low memory density at that time, this principle was now used pop- ularly in commercial computers. H. Ling in [21] and B. Vinnakota in [22]have proposed some methods of using LUT to implement multiplier and squarer cir- cuits. B. Parhami et al. also presented some improved methods for LUT-based computationin[23]and[24].
20 2.4 Literature Review of LUT-Based Computation Methods
Memory-based and LUT-based methods are quite popular in the digital fil- ter design as presented in [25], [26]. Also, LUT-based computation is applied for computing trigonometric functions such as arctangent function [27] and sine function which will be presented in detail in Chapter 5. G. Inoue [28] announced an invention of using semiconductor memory combined with some an adder and small control logic to implement an 8-bit multiplier. LUT-based method is also employed popularly to implement some transformations in digital signal process- ing such as DFT and DCT [29]. Moreover, Bipul C. Paul et al. [8] present another approach by using full custom specific memory arrays to implement arithmetic functions. Due to ad- vantages of the full custom flow and the advanced integrated circuit technology, this method can improve the speed and area efficiency of the circuits significantly. However, this method leads to the disadvantages of lack of flexibility and long time for design process. P. K. Meher ([19]and[30]) has proposed some methods for the LUT-based computation including some optimizing methods for LUT-based computation for constant multipliers such as odd multiple storage (OMS) scheme, anti-symmetric product coding (APC) scheme, input coding scheme and their combined tech- niques. The main idea of these methods is to reduce LUT size by employing mathematical identities. It is reported that these methods outperform other methods in literation and these methods can be promising candidates for future DSP and wireless communication systems. The true tables for APC and OMS methods are presented in Tables 2.2 and 2.3, respectively. Figure 2.4 is the com- bined APC-OMS architecture for the LUT-based multiplier with the operand X and a constant A. In APC method, by exploiting the mathematical identities of anti-symmetric product coding, the input X is recoded to the address X (or in binary form x3x2x1x0 for the case of 4-bit address) for the LUT so that the LUT size is reduced by half. In APC-OMS method, by employing odd multiple stor- age, the LUT size can be further reduced by half compared with APC method. The addresses for the modified LUT in these architectures are generated by an address generator and control circuit and an address decoder as shown in Figure 2.4. The simple shifting operations in Table 2.3 are performed by a barrel shifter.
21 2.4 Literature Review of LUT-Based Computation Methods
d x0 0 To sign- decision Address d 4-to-9 x1 1 circuit generator line 9 × (W+4) Barrel x and control d2 address LUT shifter 2 circuit decoder d x3 3
Figure 2.4: Combined APC-OMS architecture for LUT-based multiplier [19].
Also, LUT-based computation can be applied efficiently for the active func- tions in artificial neural networks [20]. Te-Jen Changa et al. [31]proposedan efficient LUT-based method for the squarer circuit. This method is targeted only for the full-width squarer with the application in cryptography. However, the above methods only focus on architectures for the full-length computation and only synthesis results are presented. Therefore, more researches should be done to derive new and more improved methods, especially for trun- cated operations in which the full length results are not compulsory due to the feature of error tolerance in DSP applications. Moreover, the hardware complex- ity can be reduced significantly when the truncation can be applied. Also, the real chip implementations should be carried out to clarify the improvements of proposed methods. To sum up, there has not any method presented in literature for the truncated/ fixed-width computation using LUT-based method. This leave the room for more researches in this field. Also, the existing methods should be improved for the new algorithms and applications. Furthermore, some technology-independent architectures and methods need be considered for more general methods that can be applied in multiple hardware platforms. Therefore, the research presented in this dissertation is targeted to find some novel and improved methods for LUT- based computation components and systems which can be employed in modern and future DSP applications.
22 2.5 General Approach for Efficient LUT-Based IP Cores
Table 2.2: The true table of an LUT-based multiplier using APC method with two operands X and A. X Product X Product x3x2x1x0 APC word 00001 A 11111 31A 1111 15A 00010 2A 11110 30A 1110 14A 00011 3A 11101 29A 1101 13A 00100 4A 11100 28A 1100 12A 00101 5A 11011 27A 1011 11A 00110 6A 11010 26A 1010 10A 00111 7A 11001 25A 1001 9A 01000 8A 11000 24A 1000 8A 01001 9A 10111 23A 0111 7A 01010 10A 10110 22A 0110 6A 01011 11A 10101 21A 0101 5A 01100 12A 10100 20A 0100 4A 01101 13A 10011 19A 0011 3A 01110 14A 10010 18A 0010 2A 01111 15A 10001 17A 0001 A 10000 16A 10000 16A 0000 0
2.5 General Approach for Efficient LUT-Based IP Cores
Figure 2.5 presents the general approach for IP cores proposed in this dissertation. To obtain the good trade-off between the hardware complexity and computation performance, the original direct LUT is decomposed into a smaller LUT (the size reduced LUT) and a simple logic circuit. A combinational circuit (Comb.) is also required to provide the final computation result. The existing and proposed methods to reduce the LUT size will be presented in the next chapters. The LUT in this research is implemented in both FPGA and ASIC hardware platforms. For the purpose of the first evaluation for proposed methods, the LUT
23 2.5 General Approach for Efficient LUT-Based IP Cores
Table 2.3: The true table of an LUT-based multiplier using OMS method with two operands X and A. X Product Number Shifted input Stored Address x3x2x1x0 value of shifts X APC word d3d2d1d0 0001 A 0 0010 2 × A 1 0001 P 0=A 0000 0100 4 × A 2 1000 8 × A 3 0011 3A 0 0110 2 × 3A 1 0011 P 1=3A 0001 1100 4 × 3A 2 0101 5A 0 0101 P 2=5A 0010 1010 2 × 5A 1 0111 7A 0 0111 P 3=7A 0011 1110 2 × 7A 1 1001 9A 0 1001 P 4=9A 0100 1011 11A 0 1011 P 5=11A 0101 1101 13A 0 1101 P 7=13A 0110 1111 15A 0 1111 P 7=15A 0111
in FPGA devices and standard cells are used to realize the proposed LUT-based architectures. However, in future research, the advanced LUT hardware will be considered for the implementation to improve the performance, power and area efficiency. The details of implementations issues for the proposed IP cores will be presented in next chapters.
24 2.5 General Approach for Efficient LUT-Based IP Cores
Simple logic . X . Y XYComb. . . . .
Size reduced LUT Direct LUT
Figure 2.5: General approach for efficient LUT-based IP cores.
25 2.5 General Approach for Efficient LUT-Based IP Cores
26 Chapter 3
LUT-Based Truncated Multiplier
High performance multipliers are the essential components in most of DSP and communication systems. In these systems, parallel multipliers are often employed to carry out high speed computations and transformations. However, the general parallel multipliers often result in high hardware complexity and contribute large power consumption to the overall system. On the other hand, the high increase of mobile and portable devices leads to the emerging requirement of low power consumption and low complexity digital circuits [32]. Therefore, recently, there have been many researches focusing on architecture for low power, high speed multipliers. Truncated multiplication is an efficient method to significantly reduce area and power consumption of the multipliers for DSP applications in which the full length multiplication result is not required [33]. Moreover, in many DSP applica- tions, when one of the multiplier inputs is constant, the hardware complexity can be reduced further by using a constant coefficient multiplier (KCM). For example, in digital filters, the coefficient set is fixed with a given transform function. In this case, KCM is more suitable than other approaches like the distributed arith- metic (DA)-based method [34]. The disadvantage of KCM caused by its fixed hardware configuration for each constant coefficient set can be solved by using reconfigurable hardware such as FPGA, resulting in reconfigurable KCM. Moreover, LUT-based computation is promisingly suitable for the implemen- tation of KCM in DSP systems because the address width is reduced significantly when only one input of the multiplier is used for the LUT address. Some recent
27 Figure 3.1: Introduction to LUT-Based Truncated Multiplier. researches have pointed out that the LUT-based computation is a highly poten- tial candidate for future DSP systems due to its excellent speed merit because it mainly relies on LUT access only operations [34]. However, the direct LUT-based implementation of high resolution KCM leads to high hardware complexity because the LUT size grows exponentially as the input length increases. Some methods have been proposed to reduce the LUT size in LUT-based full length multiplier ([34]-[37]and[19]), but more improvements are desired. Moreover, there has been not any research in literature mentioning the combination of LUT-based computation and truncated multiplier method. Therefore, in this chapter, we propose an efficient LUT-based architecture for truncated multipliers which can be applied for modern and future DSP systems. Figure 3.1 presents the introduction and motivation for the research on truncated LUT-based KCM.
28 3.1 Introduction to LUT-Based Truncated Multiplier
3.1 Introduction to LUT-Based Truncated Mul- tiplier
3.1.1 Truncated Multiplier
Truncated multiplication can be employed to significantly reduce the hardware area and power consumption of multipliers in DSP applications where the full length multiplication results are not necessary. The truncating operation can be implemented by rounding the final result or simply discarding some least significant bits (LSB) of the full bit-width product to get the desired bit-width output. However, the main drawback of these methods is the waste of hardware resource because of the unnecessary LSB part generated before truncation. Hence, some methods have been proposed to approximate the discarded least significant part in partial product matrix of the multiplier so that the hardware complexity is reduced significantly compared with standard multipliers. James E. Stine et al. [33] presented some popular methods for truncated mul- tiplication including Constant Correction Truncated (CCT), Variable Correction Truncated (VCT) and Hybrid Correction Truncated (HCT) and their implemen- tation in FPGA hardware. The reported results show that hardware and power consumption can be remarkably reduced when employing truncated multiplica- tion. Recently, many researches focus on improved VCT method with optimal compensation technique to minimize the overall error. Nicola Petra et al. [40] proposed a sub-optimal linear compensation method and its implementation to provide good trade-off between hardware complexity and multiplication accuracy. However, more improvements in multiplier area and performance are desired to meet the requirements of future DSP applications. Therefore, in this chapter, we propose a novel and efficient architecure for the LUT-based truncated multiplier that combines two approaches of truncated multiplier and LUT-based computa- tion.
3.1.2 LUT-Based Truncated Multiplier
As presented in Chapter 2, the LUT-based computation circuit provides the out- puts by accessing the pre-stored LUT other than actual computations. This
29 3.1 Introduction to LUT-Based Truncated Multiplier look-up only operation leads to the high speed computation because it mainly requires only the time for accessing the pre-stored tables. Also, the LUT-based circuits consume less dynamic power because of less bit-switching rate [19]. A typical LUT-based computation architecture called LUT-based multiplier is shown in Figure 3.2 in which an LUT-based W × M (W -bit by M-bit) multiplier requires the LUT size of 2W +M (W + M)-bit to store 2W +M words with the word length of (W + M)-bit. As a result, the LUT size increases exponentially as the input lengths W and M increase. In KCM, however, if the second multiplier input (with the length of M) is constant, the LUT size is reduced to 2W (W + M)-bit. In this chapter, we consider the constant multiplication of an W -bit binary number X with an M-bit binary constant coefficient A to produce the product Y . In full-width multiplication, the product Y has the length of (W +M)-bit. The mathematic presentations of the input A and Y are: M −1 i A = ai2 (3.1) i=0
M+ W −1 i Y = aiX2 (3.2) i=0 where ai denotes the binary digit of constant A. According to Equation (3.2), the simplest method called straightforward shift-add KCM is performed by adding shifted partial products X2i which are computed by left-shifting the input X by i bits. It can be observed that the number of non-zero bits defines the number of adders required in the hardware implementation. To reduce the number of addition operations required in straightforward shift- add KCM, some recoding methods such as canonical signed digit (CSD) and Booth recoding have been proposed in which constant coefficients are presented in signed digit form so that the number of non-zero bits is decreased. The com- parison between two above recoding schemes in [39] shows that CSD recoding results in better performance than Booth method for the KCM implementation in most cases since CSD scheme always produces minimal number of non-zero digits in recoded representation. Tom Kean et al. [36] presented the technique of using split LUT structure to implement a constant coefficient multiplier, resulting in reduction of both
30 3.1 Introduction to LUT-Based Truncated Multiplier
W Look-Up W+M X Y=XA Table
Figure 3.2: General LUT-based multiplier. number of words and word length. In this structure, the input X is divided into two fractions XH and XL which are formed by W /2 high bits and W /2 low bits of X, respectively. Therefore, the operand X and the product Y can be written as:
k X = XH 2 + XL (3.3)
k Y = XH A2 + XLA (3.4) where k = W /2. Based on Equation (3.4), the LUT with the word length of (W + M)-bit can be divided into two LUTs with the length of (W /2+M)-bit which store XH A and XLA values, with the supposition that W is an even number. However, the reduction of LUT size has to trade-off with the extra adder tree to get the result Y from split LUT outputs. An application of the LUT-based multiplier called constant coefficient convo- lution was reported in [37]. In [19], P. K. Meher also proposed a technique called combined OMS-APC to optimize the LUT content so that the number of words is reduced but the word length is not. In this work, to reduce more hardware area of the multiplier, we propose a novel method which employs both the LUT-based and truncated multiplication approaches to design the low area and high speed constant multipliers for specific DSP applications.
31 3.2 Proposed LUT-Based Truncated Multiplier
3.2 Proposed LUT-Based Truncated Multiplier
3.2.1 Proposed Multiplier Architecture
In this section, we present the proposed architecture which combines an improved bipartite LUT split technique and the truncated multiplication. The bipartite split technique means that the input operand X is divided into two fractions as shown in (3.3). A more general method, called multipartite LUT split technique for LUT-based truncated multiplier, will be discussed in Section 3.5. Figure 3.3 presents the hardware architecture of a proposed W × M-bit LUT-based truncated multiplier and its partial product matrix for the case of W = M =8. The proposed multiplier includes two optimized LUTs and an adder block. The
LUT-1 with W1-bit length and the LUT-2 with W2-bit length are used to store k truncated values of XLA and XH A2 as described in (3.4), respectively. These truncated values are calculated as either floor function or ceiling function of the full width product depending on the LUT content optimization. The LUT parameters and content are optimized to minimize the average rel- ative error of the final result while achieve the good trade-off with the hardware complexity. For each LUT parameters set and each constant coefficient value, the LUT content is optimized to minimize the average error of the proposed multi- plier. The output Ytrunc of the proposed multiplier is computed by adding LUT-1 and LUT-2 outputs as:
Ytrunc = Y1 + Y2 (3.5) where Y1 and Y2 denote the outputs of LUT-1 and LUT-2, respectively and they can be expressed as:
k Y1 = TruncW1 (XH A2 ) (3.6)
Y2 = TruncW2 (XLA) (3.7) where TruncJ (x) denotes the truncation operation of x to get J-bit result. To compensate the accumulation of error possibly caused by adding two truncated values from two LUTs, some guard bits are used in two LUT outputs as depicted in Figure 3.3. Moreover, to produce regular partial product matrix and remain
32 3.2 Proposed LUT-Based Truncated Multiplier equal effect of each fractions on the multiplier error performance, it is assumed that:
W2 = W1 + k (3.8) and
W1 = k + δ (3.9) where δ is the number of guard bits. The absolute error is calculated as the difference between the truncated output of the proposed multiplier and the full length product as follows:
E = |(Ytrunc − Y )| (3.10) where E is absolute error, Ytrunc is the multiplier output and Y is the exact product value. Then, replacing Y in (3.10) by (3.4) leads to the following equation for the absolute error E:
E = |E1 + E2| (3.11) where two error components E1 and E2 are defined as:
k E1 = Y1 − XH A2 (3.12)
E2 = Y2 − XLA (3.13)
Another useful factor which is used in this chapter for error analysis is the relative error Erel defined by the ratio between the absolute error and the exact product:
Erel = E/Y (3.14)
3.2.2 LUT Optimization
The proposed LUT optimization method includes choosing design parameters (parameter optimization) and calculating the LUT content (LUT content opti- mization). To choose the suitable design parameters, the error analysis is per- formed for each value of δ (or each set of W1 and W2 values defined by (3.8) and
33 3.2 Proposed LUT-Based Truncated Multiplier
W k W2 Optimized X LUT-2 W Adder Y k W1 Optimized LUT-1
(a)
Constant (A)
LUT input (X)
LUT-1
LUT-2
Y Guard part (b)
Figure 3.3: Proposed W × M-bit LUT-based truncated multiplier architecture (a) and its partial product matrix with W = M =8andδ = 2 (b).
(3.9)). A Matlab program is developed to find the average relative error results of the proposed multiplier with different design parameter values for all cases of coefficient A and then choose the optimal parameter values. The error analysis results of proposed multiplier are also compared with that of the conventional truncated multiplier in which the truncated product is generated from full width product by discarding some least significant bits. After choosing the design parameters (δ), to further reduce the average error of the proposed multiplier, the LUT content optimization is performed by searching all possible cases of stored values for LUT-1 and LUT-2. It is optimal if a full search of all possible cases for the LUT content can be performed after choosing
34 3.2 Proposed LUT-Based Truncated Multiplier
Table 3.1: Error analysis results for the proposed 8-bit multiplier with different numbers of guard bits (δ), compared with conventional truncated multiplier.
Design Average Erel (%) Conventional truncation 1.74
Proposed method with δ=0 (W1=4 and W2=8) 3.28
Proposed method with δ=1 (W1=5 and W2=9) 1.78
Proposed method with δ=2 (W1=6 and W2=10) 0.95
Proposed method with δ=3 (W1=7 and W2=11) 0.48
Proposed method with δ=4 (W1=8 and W2=12) 0.22
the design parameter. However, it involves extensive simulation process and results in a very long simulation time, so that a sub-optimal method is proposed for LUT content optimization that reduces the optimization time significantly by reducing the search cases as shown in Figure 3.4. For each constant coefficient value, firstly, LUT-1 is filled with rounded values of XLA and a search with all cases of LUT-2 is performed (as the floor or ceiling function of the full width k product). Then, the LUT-2 is filled with rounded values of XH A2 and a similar search with all cases of LUT-1 is also performed. Finally, the error results from two above searching processes are compared to get the optimized content of the two LUTs. Table 3.1 and Figure 3.5 present the error analysis results of optimized de- sign with different values of δ and a full set of the constant coefficient A.Itis shown that the average relative error decreases when the number of guard bits is increased. In the case of the proposed method with δ =2(W1 =6andW2 = 10), the average relative error (0.95%) is much lower than the conventional truncation method (1.74%). Therefore, this value of δ is chosen for our design in this work. For applications that require higher accuracy (or lower error target), higher val- ues of δ can be used. Then, the entries of each optimized LUT are pre-calculated based on chosen parameters with above algorithm. The adder module as shown in Figure 3.3 (a) is also optimized independently to achieve highest multiplier performance.
35 3.2 Proposed LUT-Based Truncated Multiplier
k +) LUT-1 <= RoundW1(XLA) +) LUT-2 <= RoundW2(XHA2 ) +) Search for all cases of LUT-2 +) Search for all cases of LUT-1
Error Comparing
Optimized LUT content
Figure 3.4: Proposed sub-optimal LUT content optimization algorithm.
3.2.3 Synthesis and Implementation Results in FPGA Hard- ware
We have implemented the 8-bit truncated multiplier using the proposed LUT- based method and comparable reference architectures for all cases of the constant coefficient with the same design parameters and constraints in Altera Cyclone II FPGA using Quartus-II 9.1 design tool and VHDL coding. The comparison in the average values of area, delay and area-delay product (ADP) is presented in Table 3.2. ADP is a popular factor to compare the general performance of different designs. For the FPGA implementation, ADP is calculated as the product of number of logic elements (LE) and the delay (in nanosecond). The most efficient method means that it results in the lowest value of ADP. Straightforward shift-add KCM and CSD-based KCM, as described in Sec- tion 3.2, are specific hardware architectures of conventional truncation technique chosen for the performance comparison. Altera Megacore-based KCM which is the parameterized constant multiplier core provided by Altera and the conven- tional split LUT-based KCM (without LUT optimization) are also chosen for the comparison. From results presented in Table 3.2, it is clear that the pro- posed multiplier results in lower ADP than other constant coefficient multipli-
36 3.3 General Multipartite LUT Split Technique
3.5 Proposed method Conventional truncation method 3
2.5
2
1.5
1 Average relative error (%)
0.5
0 0 1 2 3 4 Number of guard bits
Figure 3.5: Error analysis results for the proposed 8-bit multiplier with differ- ent numbers of guard bits (solid line), compared with conventional truncated multiplier (dashed line). ers. The proposed KCM can reduce ADP value of around 28% when comparing with CSD-based KCM, Altera Megacore-based KCM and conventional split LUT- based KCM, and 52% when compared with the straightforward shift-add KCM.
3.3 General Multipartite LUT Split Technique
This section presents a more general method for LUT-based truncated multiplier called the multipartite LUT split technique in which the input operand is divided into more than two fractions. Then, each fraction is used as the address for an optimized LUT. Consider the multiplication of the W -bit binary number X with a constant A and suppose that W = l × k (if not, some leading zero bits can be added to X so that this condition is satisfied), then X can be decomposed into l
37 3.3 General Multipartite LUT Split Technique
Table 3.2: Implementation results of different KCM methods in Altera Cyclone II FPGA.
Method Area (number of LEs) Delay (ns) ADP Straightforward shift-add KCM 33 17.5 578 CSD-based KCM 25 15.3 383 Altera Megacore-based KCM 28 13.3 372 Conventional split LUT-based KCM 28 13.8 386 Proposed LUT-based KCM 23 12.0 276
fractions (with width of k-bit) and Equation (3.3) can be extended as: