Quick viewing(Text Mode)

A.1 CORDIC Algorithm

A.1 CORDIC Algorithm

Research on Hardware Intellectual Property

Cores Based on Look-Up Table Architecture for

DSP Applications

HOANG VAN PHUC

Doctoral Program in Electronic Engineering Graduate School of Electro-Communications The University of Electro-Communications

A thesis submitted for the degree of DOCTOR OF ENGINEERING

September 2012 I would like to dedicate this dissertation to my parents, my wife and my daughter. Research on Hardware Intellectual Property

Cores Based on Look-Up Table Architecture for

DSP Applications

APPROVED

Asso. Prof. Cong-Kha PHAM, Chairman

Prof. Kazushi NAKANO

Prof. Yoshinao MIZUGAKI

Prof. Koichiro ISHIBASHI

Asso. Prof. Takayuki NAGAI

Date Approved by Chairman c Copyright 2012 by HOANG Van Phuc All Rights Reserved. 和文要旨

DSPアプリケーションのためのルックアップテーブルアーキテ

クチャに基づくハードウェアIPコアに関する研究

ホアン ヴァンフック

電気通信学研究科電子工学専攻博士後期課程 電気通信大学大学院

本論文は,ルックアップテーブル(LUT)アーキテクチャに基づく省面積 かつ高性能なハードウェア IP コアを提案し,DSP アプリケーションに適用 することを目的としている.LUT ベースの IP コアおよびこれらに基づいた 新しい計算システムを提案した.ここで,提案する計算システムには,従来 の演算と提案する IP コアを含む.提案する IP コアには,乗算や二乗計算等 の基本演算と,正弦関数や対数・真数計算等の初等関数が含まれる. まず初めに,高効率な LUT 乗算器および二乗回路の設計を目的として, 2つの方法を提案した.1つは,全幅結果が不要な場合に DSP に応用可能 な打切り定数乗算器である.このアーキテクチャには,LUT ベースの計算と DSP 用の打切り定数乗算器という2つのアプローチを組み合わせた.最適な パラメータと LUT の内容を探索するために,LUT 最適化アルゴリズムの検 討を行った.さらに,固定幅二乗回路向けに LUT ハイブリッドアーキテク チャを改良した.この技術は,二乗回路における性能,誤計算率,複雑さの 間の妥協点を見いだすために,LUT 論理回路と従来の論理回路の両方を採用 したものである. 初等関数の計算については,LUT ベースの計算と線形差分法を組み合わせ て,2つのアーキテクチャを提案した.1つは,ディジタル周波数合成器, 適応信号処理技術および正弦関数生成器に利用可能な正弦関数計算のために, 線形差分法を改良したものである.他の方法と比較して誤計算率が変わらな い一方で,LUT の規模と複雑さを抑える最適パラメータを探索するために数 値解析と最適化を行った.その他に,対数計算および真数計算向けに,疑似 対称線形アプローチを提案した.そのアーキテクチャの最適パラメータを探 索するために,二段階最適化アルゴリズムも開発した.これらの最適化アル ゴリズムと LUT 規模圧縮手法を利用することで,提案する対数計算および 真数計算モジュールは,他の方法と比較して同程度の正確さを維持しつつ, ハードウェアの複雑化を回避することができる.本論文で提案する手法は, DSP システムの対数・真数計算等関数生成器と混合数システムプロセッサに おける変換器の両方に適用可能である. 最後に,提案する IP コアによる特定用途向け DSP アプリケーションと計 算システムを開発した.これらの方法は,将来的には,特定用途向け DSP アプリケーションや,ディジタル信号プロセッサの演算ユニットとして利用 可能である. Abstract

Research on Hardware Intellectual Property Cores Based on Look-Up Table Architecture for DSP Applications

HOANG VAN PHUC

Doctoral Program in Electronic Engineering The University of Electro-Communications

Recently, high performance hardware intellectual property (IP) cores for arithmetic operations are highly required to meet the increasing demand of digital (DSP) applications in multime- dia, wireless communications, mobile and handheld devices. Tradi- tionally, the high performance arithmetic function generators can be implemented by logic circuits with some advanced techniques such as the parallel architecture. However, these techniques have to trade- off between the computation performance and system complexity. In other words, the high performance computation circuits may occupy much hardware area and lead to high power consumption. As a re- sult, some alternative approaches should be proposed for modern and future DSP applications. On the other hand, the look-up table (LUT)- based computation approach is promisingly suitable to be an alterna- tive approach. LUT-based computation circuits provide the output by only accessing the pre-stored tables other than actual computations in real time. This LUT access-only operation results in the high speed and the low switching rate that leads to the low power consumption. Moreover, according to the report of the International Technol- ogy Roadmap for Semiconductors (ITRS), embedded memories are becoming faster, having higher density, lower dynamic power con- sumption and will dominate the content of the future System-on- Chip DSP applications. Also, advanced LUT technologies lead to the high improvements in LUT performance and density. Memory- based computation and LUT-based computation are also well suited for many DSP which require squaring, constant multipli- cations and elementary function computations. However, in the sim- ple direct LUT-based computation, the LUT size grows exponentially when the operand width increases. Therefore, many researches focus on reducing the LUT size. Although there have been some existing methods for LUT-based computation components and systems, the new and more improved methods are desired because the future DSP applications require more efficient computation circuits. The research presented in this dissertation focuses on the efficient hardware IP cores based on the LUT architectures for DSP appli- cations. Four methods for efficient specific hardware IP cores are proposed. Moreover, some detail design examples and a computation system based on these IP cores are developed. The proposed IP cores include both basic arithmetic operations (such as and squaring) and elementary functions (such as and /anti- logarithm function computations). Two methods are proposed for the design of efficient LUT-based multiplier and squarer circuits. Firstly, a novel and efficient LUT- based truncated multiplier is presented for specific DSP applications in which the full width results are not required. This method combines two approaches of LUT-based computation and truncated multipliers for DSP applications. An LUT optimization is also devel- oped to find the optimal parameters and LUT content. Secondly, an improved hybrid LUT-based architecture is employed for the fixed- width squarer circuits. This hybrid technique takes the advantages of both LUT-based and conventional logic circuits to achieve the good trade-off of the squarer performance, error and complexity. For the computation of elementary functions using LUT-based architecture, two novel architectures which combine LUT-based com- putation and linear difference approach are proposed. The first one is the improved linear difference method for the sine function com- putation which can be used in digital frequency synthesizer, adaptive signal processing and sine function generator. The numerical analysis and optimization are employed to find the optimal parameters which minimize the LUT size and hardware complexity while remain the same error performance compared with other methods. The second one is a novel quasi-symmetrical approach for the logarithm and anti- logarithm computation. An optimization algorithm is also developed to find the optimal parameters for the hardware architecture. Thanks to the optimization algorithm and the LUT size reduction method, the hardware complexity and computation delay of the logarithm and anti-logarithm computation modules can be reduced significantly with the same accuracy compared with other methods. This method can be applied for both logarithm/anti-logarithm function generators in DSP systems and the domain converters in hybrid number system processors. Finally, some specific applications and a prototype computation system based on these IP cores are developed to clarify the improve- ments of proposed methods. With the improvements achieved, the proposed methods can be considered as potential candidates for mod- ern and future DSP systems. List of Abbreviations

ADC Analog to Digital Converter ADP Area-Delay Product ALOGC Anti-logarithmic Converter ALU AU Arithmetic Unit ASIC Application Specific CMOS Complementary Metal Oxide Semiconductor CORDIC Coordinate Digital Computer CSC Clolor Space Conversion CSD Canonical Signed Digit DA Distributed Arithmetic DAC Digital to Analog Converter DSP Digital Signal Processing FET Field Effect Transistor FPGA Field Programmable GPP General Purpose HIL Hardware In the Loop HNS Hybrid Number System IP Intellectual Property ITRS International Technology Roadmap for Semiconductors KCM Constant Coefficient Multiplier LE Logic Element LMS Least Mean Square LNS Logarithmic Number System LOGC Logarithm Converter LSB Least Significant Bit LUT Look-up Table MCU Unit MRAM Magnetic Random Access Memory MSB Maximum Significant Bit MTM Multipartite Table Method MFCC Mel Frequency Cepstral Coefficients

vii PAC Phase to Amplitude Converter PPM Partial Product Matrix RAM Random Access Memory ROM Read-Only Memory SoC System-on-chip SR Speech Recognition TTA Transport Triggered Architecture VHDL Very High Speed Integrated Circuits Hardware Description Language VLSI Very Large Scale Integration

viii Contents

1 Introduction 1 1.1 Context of Research and Motivation ...... 1 1.1.1 Trend of DSP Market ...... 1 1.1.2 Trend of DSP Features ...... 1 1.1.3 Motivation ...... 2 1.2 Objective and Scope of the Dissertation ...... 5 1.3 Original Contributions ...... 7 1.4 Dissertation Overview ...... 8

2 Background and General Approach 13 2.1 Introduction to DSP Systems ...... 13 2.2 Hardware Platforms and DSP Implementation Methods ...... 14 2.2.1 General Purpose Processor (GPP)-Based Implementation .14 2.2.2 DSP Processor-Based Implementation ...... 15 2.2.3 ASIC-Based DSP Implementation ...... 15 2.2.4 FPGA-Based DSP Implementation ...... 16 2.2.5 Implementation Methods for DSP Applications ...... 17 2.3 Concept of LUT-Based Computation and Design Issues ...... 17 2.4 Literature Review of LUT-Based Computation Methods ..... 20 2.5 General Approach for Efficient LUT-Based IP Cores ...... 23

3 LUT-Based Truncated Multiplier 27 3.1 Introduction to LUT-Based Truncated Multiplier ...... 29 3.1.1 Truncated Multiplier ...... 29 3.1.2 LUT-Based Truncated Multiplier ...... 29

ix CONTENTS

3.2 Proposed LUT-Based Truncated Multiplier ...... 32 3.2.1 Proposed Multiplier Architecture ...... 32 3.2.2 LUT Optimization ...... 33 3.2.3 Synthesis and Implementation Results in FPGA Hardware 36 3.3 General Multipartite LUT Split Technique ...... 37 3.4 Chapter Conclusions ...... 39

4 Hybrid LUT-Based Architecture for Fixed-Width Squarer 41 4.1 Fixed-Width Squarer ...... 42 4.2 LUT-Based Fixed-Width Squarer ...... 44 4.3 Proposed Hybrid LUT-Based Fixed-Width Squarer ...... 45 4.4 Error Analysis of Proposed Fixed-Width Squarer ...... 48 4.5 Implementation and Measurement Results ...... 49 4.6 Chapter Conclusions ...... 50

5 LUT-Based Sine Function Computation 59 5.1 Sine Function Computation ...... 59 5.2 Direct Digital Frequency Synthesizer (DDFS) ...... 60 5.3 Phase to Amplitude Converter in DDFS ...... 60 5.4 LUT Compression Methods for Phase to Amplitude Converter .. 61 5.4.1 Sine Wave Symmetry Exploitation ...... 62 5.4.2 Coarse/Fine LUT Splitting Method ...... 62 5.4.3 Sunderland and Nicholas Architectures ...... 63 5.4.4 Difference Method ...... 64 5.4.5 Multipartite Table Method ...... 65 5.5 Improved Linear Difference Method ...... 65 5.6 Implementation Results ...... 67 5.7 Chapter Conclusions ...... 70

6 LUT-Based Logarithm and Anti-logarithm Computation 73 6.1 Introduction ...... 73 6.2 Logarithm and Anti-logarithm Approximation Methods ...... 74 6.2.1 Mitchell Approximation ...... 75 6.2.2 Shift-Add Piece-Wise Linear Approximation ...... 76

x CONTENTS

6.2.3 Table-Based Approximation Methods ...... 77 6.2.4 Combined LUT-Based/Difference Method ...... 77 6.3 Proposed Quasi-Symmetrical Approach ...... 78 6.4 Implementation of the Fundamental Functions ...... 82 6.5 Applications in Logarithm Generator and HNS Processors .... 84 6.5.1 Logarithm Function Generator for DSP Applications ... 84 6.5.2 Hybrid Arithmetic Unit For HNS Processors ...... 86 6.6 Chapter Conclusions ...... 87

7 Prototype and Applications of Proposed IP Cores 91 7.1 Chapter Objectives and Targeted Applications ...... 91 7.1.1 Chapter Objectives ...... 91 7.1.2 Targeted Applications of Proposed IP Cores ...... 92 7.2 Design Flow for Applications using Proposed IP Cores ...... 93 7.2.1 Design Flow ...... 93 7.2.2 Advantages of Using Proposed IP Cores for DSP Applications 93 7.3 A Prototype Computation System for the Proposed IP Cores ... 94 7.3.1 Prototype Computation System Architecture ...... 95 7.3.2 System Implementation Results ...... 98 7.3.3 Verification and Measurement Method ...... 99 7.4 Design Examples Using Proposed IP Cores ...... 100 7.4.1 RGB to YCbCr Color Space Conversion Using Proposed Multiplier IP Core ...... 100 7.4.1.1 Color Space Conversion and Its Hardware Archi- tecture ...... 100 7.4.1.2 FPGA Implementation and Verification Results . 103 7.4.2 2-D Convolver for Image Processing ...... 105 7.4.2.1 Image Convolver ...... 105 7.4.2.2 Convolver Hardware Architecture and Implemen- tation Results ...... 106 7.4.3 Speech Feature Extraction Using Proposed Logarithm Gen- erator ...... 107 7.4.3.1 Speech Recognition ...... 107

xi CONTENTS

7.4.3.2 Speech Feature Extraction ...... 108 7.4.3.3 Hardware Architecture and Implementation Results108 7.5 Chapter Conclusions ...... 109

8 Conclusions and Future Work 117 8.1 Conclusions ...... 117 8.2 Future Work ...... 118

A Sine/Cosine Computation Using Pipelined CORDIC 121 A.1 CORDIC Algorithm ...... 121 A.2 Pipelined CORDIC Architecture ...... 123 A.3 Implementation Results ...... 124

B Operation List of the Conventional ALU Component 127

C List of Publications 129 C.1 Journal Papers ...... 129 C.2 Book Chapter ...... 130 C.3 International Conference Presentations ...... 130 C.4 Technical Report ...... 131

xii List of Figures

1.1 Operation percentage for OpenGL TnL running on a conventional DSP processor ([3])...... 3 1.2 Trend in memory domination (%) in SoC content ([4])...... 6 1.3 Product function size trends (Source: ORTC 2011 ITRS Korea Winter Public Conference - Executive summary [4])...... 10 1.4 Summary of the motivation and the dissertation overview ..... 11

2.1 General block diagram of a DSP system...... 14 2.2 Hardware platforms for implementation of DSP applications. ... 16 2.3 General LUT-based computation system...... 18 2.4 Combined APC-OMS architecture for LUT-based multiplier [19]. .22 2.5 General approach for efficient LUT-based IP cores...... 25

3.1 Introduction to LUT-Based Truncated Multiplier...... 28 3.2 General LUT-based multiplier...... 31 3.3 Proposed W ×M-bit LUT-based truncated multiplier architecture (a) and its partial product matrix with W = M =8andδ = 2 (b). 34 3.4 Proposed sub-optimal LUT content optimization algorithm. ... 36 3.5 Error analysis results for the proposed 8-bit multiplier with differ- ent numbers of guard bits (solid line), compared with conventional truncated multiplier (dashed line)...... 37

4.1 Partial product matrix of the 8-bit fixed-width squarer...... 43 4.2 General LUT-based full-width squarer...... 45 4.3 Proposed hybrid LUT-based architecture for fixed-width squarer. 46 4.4 Coarse/fine LUT splitting method to further reduce the LUT size. 48

xiii LIST OF FIGURES

4.5 ADP results for different values of W ...... 50 4.6 ADP reduction results for different values of W (compared with Garofalo et al. method in [47] )...... 51 4.7 Chip microphotograph of the proposed 8-bit fixed-width squarer. .53 4.8 Test and measurement flow for the proposed fixed-width squarer. 55 4.9 Simulation result in Modelsim software (left) and measurement result in logic analyzer (right)...... 56 4.10 Critical path analysis result in Synopsys Design Compiler. .... 57 4.11 Longest path delay measurement result...... 57

5.1 General block diagram and signal flow of the DDFS...... 61 5.2 Operation principle of DDFS with the digital phase wheel. .... 62 5.3 General DDFS with sine wave symmetry exploitation...... 63 5.4 Coarse/Fine LUT splitting...... 63 5.5 Sunderland method...... 65 5.6 General difference method...... 66 5.7 Proposed function (J = 4) vs Hsu difference function (J =3). .. 68 5.8 Architecture of phase to amplitude converter using the improved linear difference method...... 69 5.9 Digital sine wave output of the proposed DDFS in Modelsim sim- ulation...... 70

6.1 Area breakdown of a 3-D graphics chip in [3]...... 75 6.2 Logarithmic and anti-logarithmic Mitchell error functions. .... 76 6.3 General difference method for logarithm approximation...... 78 6.4 The 4-segment linear method with small error LUT proposed in [78]. 79 6.5 The idea for the proposed quasi-symmetrical approach...... 80 6.6 Proposed 2-step parameter optimization algorithm for the hard-

ware approximation of log2(1 + x)...... 83 6.7 Proposed quasi-symmetrical linear approximation method (after the parameter optimization)...... 84

6.8 Proposed architecture for the approximation of log2(1 + x) with W =13...... 85 6.9 A 16-bit logarithm function generator using the proposed method. 88

xiv LIST OF FIGURES

6.10 An arithmetic unit for HNS processors using proposed converters. 89 6.11 Layout of the proposed 16-bit logarithm function generator. ... 89

7.1 Chapter objectives...... 92 7.2 Design flow for DSP applications using proposed IP cores. .... 94 7.3 Merged DSP core/MCU/ architecture as a trend for future DSP system architecture...... 96 7.4 Hardware architecture of the prototype computation system. ... 97 7.5 Layout of the prototype computation system...... 99 7.6 Area breakdown of the prototype computation system...... 100 7.7 Verification and measurement flow for the prototype computation system...... 102 7.8 Proposed RGB to YCbCr color space conversion using LUT-based constant multiplier...... 104 7.9 Verification flow with DSP builder using Matlab simulation and HIL.110 7.10 Operation of a 2-D convolver for image processing...... 112 7.11 Hardware architecture of a 2-D convolver for image processing. . . 112 7.12 Block diagram of speech recognition system...... 114 7.13 Feature extraction hardware architecture for the speech recognition system...... 114 7.14 Hardware architecture of the logarithm computation block for MFCC featureextractioninthespeech recognition application ...... 115 7.15 Verification results of the logarithm computation block for MFCC featureextractioninthespeech recognition application ...... 116

8.1 TTA processor architecture...... 119

A.1 Vector rotation in CORDIC algorithm...... 123 A.2 A stage in the pipelined CORDIC architecture...... 125 A.3 Block diagram of the CORDIC-based DDFS...... 125

xv List of Tables

1.1 Class of application most suitable for each emerging research tech- nology entry evaluated (Source: ITRS [4])...... 5

2.1 Comparison of different hardware platforms for DSP implementation. 17 2.2 The true table of an LUT-based multiplier using APC method with two operands X and A...... 23 2.3 The true table of an LUT-based multiplier using OMS method with two operands X and A...... 24

3.1 Error analysis results for the proposed 8-bit multiplier with differ- ent numbers of guard bits (δ), compared with conventional trun- cated multiplier...... 35 3.2 Implementation results of different KCM methods in Altera Cy- clone II FPGA...... 38

4.1 Four-folding method for hybrid LUT-based fixed-width squarer. .47 4.2 Error analysis for different fixed-width squarer architectures. ... 52 4.3 LUT parameters and compression ratio of the proposed fixed-width squarer designs with LUT splitting technique...... 53 4.4 Implementation results of different fixed-width squarer architec- tures using 0.18-μm CMOS technology...... 54 4.5 ADP reduction with different values of W (compared with the method in [47] )...... 55

5.1 Computation results of word size reduction for different numbers of segments...... 69

xvi LIST OF TABLES

5.2 Comparison with the previous methods...... 70

6.1 Proposed 2-step optimization algorithm...... 82 6.2 Optimal parameter results using the 2-step optimization algorithm. 82 6.3 Summary of optimal parameters and error analysis results for the proposed method compared with the method presented in [78] .. 85 6.4 FPGA Implementation results of different methods for approxima- x tion of log2(1 + x)and(2 − 1) with W = 13...... 86 6.5 Implementation results in 0.18-μm CMOS technology of different x methods for approximation of log2(1 + x)and(2 − 1) with W = 13. 87 6.6 FPGA Implementation results of different methods for 16-bit log- arithm generator...... 87 6.7 Implementation results in 0.18-μm CMOS technology of different methods for the 16-bit logarithm generator...... 88

7.1 List of operations supported by the prototype computation system. 98 7.2 Main parameters for the ASIC implementation of the prototype computation system...... 101 7.3 Area consumption by components in the prototype computation system (× 103 μm2)...... 101 7.4 Comparison of FPGA implementation results of the color space converter...... 105 7.5 Verification results using Altera DSP flow by hardware co-simulation with Cyclone II FPGA in Matlab-Simulink environment...... 111 7.6 ASIC implementation results of the proposed RGB to YCbCr color space converter using a standard cell library with 0.18-μmCMOS technology...... 112 7.7 FPGA implementation results of 2-D convolver with the input/output resolution of 8 bit...... 113 7.8 Verification result of 2-D convolver...... 113 7.9 ASIC implementation results of 2-D convolver with 0.18-μmCMOS technology using a standard cell library...... 114

xvii LIST OF TABLES

7.10 FPGA implementation results of logarithm computation block in MFCC feature extraction with the input/output resolution of 16 bit ...... 115 7.11 ASIC implementation results of logarithm computation block in MFCC feature extraction with 0.18-μm CMOS technology using a standard cell library...... 115

A.1 Implementation results for the pipelined CORDIC-based DDFS. . 125

B.1 List of operations supported by the conventional ALU component. 128

xviii Chapter 1

Introduction

1.1 Context of Research and Motivation

1.1.1 Trend of DSP Market

The digital signal processing (DSP) global market is growing strongly due to the fast increasing demand of DSP applications. According to a report in [1], the global DSP market by revenue (in Intellectual Property (IP), Design Architec- ture and Applications) is estimated to grow from $6.20 billion in 2011 to $9.58 billion in 2016 with the Compound Annual Growth Rate (CAGR) of 9.09%. The percentage share of DSP industry in the global semiconductor revenue was ap- proximately between 1% and 3% over the years, and it currently stands at 1.97% in 2011. The rising demand of DSPs in the wireless infrastructure sector is one of the most prominent reasons for its growth. Rising trend of several advanced performance technologies such as DSP-based System-on-chip (SoC) is another reason accountable for significant potential growth in the DSP market.

1.1.2 Trend of DSP Features

With the increasing demand of high performance, low area and low power DSP applications and the continuous development of integrated circuit technology, the following trends of DSP features can be predicted.

1 1.1 Context of Research and Motivation

• Modern DSP core is becoming to support higher clock frequency. For ex- ample, released the C6000 series DSP processors, which have clock speeds of 1.2 GHz. However, the clock frequency will be limited and other approaches should be considered.

• Multicore/multiprocessor is employed more and more in modern DSP pro- cessor cores to increasing the performance of the DSP system and solve the problem of clock frequency limitation. Moreover, DSP processor architec- ture is becoming many-core approach to achieve higher performance.

• The hybrid (or merged) DSP core/Microcontroller Unit (MCU)/coprocessor architecture is also a trend for modern and future DSP systems due to the high convergence of multiple application type in DSP and multimedia systems.

• More dedicated components are predicted to present in DSP systems to accelerate the system and provide higher performance adaptive signal pro- cessing.

• Mixed and hybrid number system support is a promising approach to achieve the good trade-off of the computation performance and power consumption.

1.1.3 Motivation

With above trends of DSP market and architecture, the modern and future DSP systems are expected to employ more and more multiplication, squaring and other complex operations and more dedicated computation modules to meet the requirements of more complicated algorithms and applications, especially for the 3-D graphics and high-speed multimedia processing systems. Employing many multiplier circuits, especially the parallel architectures, can lead to high system complexity and high power consumption. As pointed out in [2], a 3-D graphics rendering application requires about 90% of , multiplication and operations. According to the research by B.G.Nam et al. [3], a typical 3-D graphics processor employs a lot of complex operations such as multiplication

2 1.1 Context of Research and Motivation

5% (DIV) 24% (SQ)

30% (MUL) 16% (POW)

1% (Others)

24% (ADD)

Figure 1.1: Operation percentage for OpenGL TnL running on a conventional DSP processor ([3]).

(MUL), division (DIV), squaring (SQ) and powering (POW) as presented in Fig- ure 1.1, together with the basic operations such (ADD) and . Hence, low complexity arithmetic components are highly required for future DSP applications. Moreover, the continuous development of integration technology allows the design and implementation of sophisticated processors for complex algorithms with the high computation speed. Also, the higher integration density makes the DSP circuits smaller. However, with the fast increasing demand of wireless, mo- bile and portable devices such as smart phones, tablets, wireless-based intelligent electronics devices, the requirement of low area and low power DSP and multi- media systems are becoming more and more emerging. With these applications, not only high speed but also low area and low power consumption circuits are highly required (especially for battery-based devices). Meanwhile, the parallel multipliers and above mentioned complex operations often results in high system complexity and high power consumption. Therefore, an alternative approach should be considered to derive the efficient

3 1.1 Context of Research and Motivation computation systems without complicated multipliers while achieve high compu- tation performance. Using memory-based computation, or LUT-based compu- tation for more general, is an alternative approach which has been presented in literature. Meanwhile, according to the 2010-year report of the International Technology Roadmap for Semiconductors (ITRS) [4], embedded memories are going beyond the 16-nm technology generation and becoming faster, having higher density, lower dynamic power consumption and will dominate the content of System-on- Chip (SoC) applications. Figure 1.2 presents the trend of memory domination in SoC content. Table 1.1 shows 8 candidate memory technologies which have been evaluated in the ITRS report. In this report, the Emerging Research De- vices (ERD) and Emerging Research Materials (ERM) working groups from ITRS identified Spin Transfer Torque MRAM and Redox RRAM as emerging memory technologies recommended for accelerated research and development leading to scaling and commercialization of Non-volatile RAM to and beyond the 16-nm gen- eration. The magnetic memory/logic is predicted to dominate future electronic systems ([5]and[6]). Also, the development of the advanced 3-D integration technology will further promote this [7]. Figure 1.3 depicts the trend of product function size reported by ORTC 2011 ITRS Korea Winter Public Con- ference. It can be seen that the function size is becoming smaller, corresponding to higher chip density. On the other hand, the density of memory circuit is higher than that of logic circuit. Moreover, P. Meinerzhagen et al. [9]presentsanap- proach of using the standard cell library to implement specific memories in 65-nm technology. With the availability of different methods for the efficient memory and LUT implementation, the LUT-based architectures and methods presented in this dissertation will be useful for DSP designers when choosing a design method and architecture for the arithmetic functions and modules in DSP applications. Especially, when advanced memory technologies mentioned previously become popular, these architectures and methods can be employed more efficiently. Therefore, using the LUT-based computation combined with some simple logic circuit is a promising solution for the alternative approach. LUT-based computation refers to a type of digital circuits/systems which can provide the

4 1.2 Objective and Scope of the Dissertation

Table 1.1: Class of application most suitable for each emerging research technol- ogy entry evaluated (Source: ITRS [4]). Emerging Research Memory Technology Entry Standalone Embedded √ Ferroelectric-gate FET √ Nanoelectromechanical RAM √ Spin Transfer Torque MRAM √ √ Nanoionic or Redox Memory √ √ Nanowire Phase Change Memory (PCM) √ Electronic Effects Memory √ √ Macromolecular memory √ √ Molecular memory

computation results by accessing pre-stored tables other than actual computa- tions. The LUT-based circuit can lead to the high speed computation because it mainly involves the time for pre-stored tables access only. It also can reduce the dynamic power consumption because if the minimum switching rate. The LUT can be implemented by either application-specific embedded memory circuits or optimized logic circuits. Some examples of application-specific memories are the low power memories for mobile devices and electronics consumer products, the high-speed memories for multimedia applications, the wide temperature memo- ries for automotive, the high reliability memories for biomedical instruments and the radiation hardened memory for space applications.

1.2 Objective and Scope of the Dissertation

The objective of the research presented in this dissertation is to find the efficient architectures and methods for arithmetic components based on LUT architec- ture for specific DSP applications. Since DSP applications have some specific characteristics which are different with general computation applications, the ar- chitectures and implementation methods are required to adapt appropriately. Moreover, some important characteristics of DSP applications should be also considered as follows:

5 1.2 Objective and Scope of the Dissertation

100

90

80

70

60

50 %

40

30

20

10

0 2005 2008 2014

Figure 1.2: Trend in memory domination (%) in SoC content ([4]).

• Error-tolerance: Error is acceptable in many DSP applications such as dig- ital filtering, video and image processing. Therefore, the hardware com- plexity can be reduced significantly by employing some techniques such as truncated multipliers and squarers, LUT size reduction and so on.

• Low complexity DSP applications are highly desired because of the high de- mand for portable, hand-held and wearable electronic devices. Electronic devices are becoming smaller, lighter and thinner. Even that the integra- tion technology is also going to deep sub-micron technologies, it is highly required to develop more efficient methods in architecture level and system

6 1.3 Original Contributions

level for the lower hardware complexity DSP applications.

• Real-time operation. Since DSP applications are often real-time, high-speed computation circuits are desired and the speed (or delay) of the compu- tation module has to be taken into account for design methodology and optimization algorithms.

• Low power DSP applications are becoming more and more popular due to the increasing demand of battery-based devices. Hence, the power con- sumption should be reduced in future DSP systems.

Therefore, the research work presented in this dissertation is to find the ef- ficient methods which are suitable for DSP applications considering the above characteristics. For example, due to the feature of error-tolerance of DSP sys- tems, the methods for low error truncated multipliers and squarers using LUT- based architectures are proposed to reduce the hardware area significantly while achieve the acceptable computation accuracy. Also, to meet the requirement of low complexity DSP applications, the research on finding an efficient method for low area logarithmic and anti-logarithmic converters is carried out for application in DSP systems and in hybrid number system processors.

1.3 Original Contributions

Several contributions on the architectures and design methodologies for low area, high performance LUT-based IP core and system have been made in this work. Parts of these contributions have been published or submitted for publication. The following list summarizes the main contributions within the scope of this work.

1. Two efficient architectures for LUT-based truncated multiplier and fixed- width squarer are proposed. Since multipliers and squarers are highly em- ployed in many DSP applications, the proposed methods can open up some new methods and suggestions for designs and system developers for digital

7 1.4 Dissertation Overview

hardware design related to DSP systems. Two papers based on these meth- ods were accepted to be published in IEICE Transactions of Fundamentals in June and July, 2012, respectively.

2. A new, efficient and more general linear difference method combined with LUT-based error correction is proposed for the design of phase to amplitude converter in Direct Digital Frequency Synthesizer (DDFS). The paper based on this method was published in IEICE Transactions of Fundamentals in March 2011.

3. The novel quasi-symmetrical approach for low area, efficient logarithmic and anti-logarithmic converters is proposed. Since logarithmic and anti- logarithmic converters are essential components in hybrid number system processors and many DSP applications, the proposed approach can lead to a great impact on improving the system performance in those applications. The paper based on this method was submitted for future publishing in IEICE Transactions of Fundamentals.

1.4 Dissertation Overview

The dissertation is divided into 8 chapters. Figure 1.4 presents the relationship between separate chapters of this dissertation. After this introduction chap- ter, Chapter 2 provides the background knowledge and design issues about DSP applications and LUT-based computation systems. The existing methods and architectures for LUT-based computation are also reviewed and discussed in this chapter. Chapter 3 presents the efficient LUT-based truncated multiplier for the con- stant multiplication which is highly employed in many DSP applications. By using the proposed parameter and LUT content optimization, the significant im- provements in both circuit area and delay can be achieved. Squaring is also a fundamental operation in many DSP applications. There- fore, an improved hybrid LUT-based architecture for the design of low error,

8 1.4 Dissertation Overview efficient fixed-width squarer is proposed in Chapter 4. The mathematical identi- ties of squaring operation are exploited in a new way so that a low error, efficient architecture for fixed-width squarer can be achieved. The implementation and chip measurement results in 0.18-μm CMOS technology are presented and dis- cussed. In Chapter 5, an improved linear difference approximation combined with LUT-based architecture is proposed for sine function computation targeted for phase to sine converter in Direct Digital Frequency Synthesizer (DDFS). By em- ploying parameter optimization in Matlab software, an optimal architecture for linear difference method is derived. Chapter 6 presents a novel approach in logarithmic and anti-logarithmic con- verters which are essential components in the hybrid number system processors and many DSP applications. The novel quasi-symmetrical approach and the proposed parameter optimization algorithm to achieve higher area-speed efficient converters are also presented. Moreover, the applications of the proposed con- verters for logarithm/exponent function generators and the arithmetic unit for hybrid number system processors are proposed in this chapter. The implemen- tation results in both FPGA and 0.18-μm CMOS technology platforms are also presented. Chapter 7 presents the applications of the proposed IP cores and a prototype computation system based on these cores. Moreover, the design flow and detail designs examples are also provided in this chapter. Finally, Chapter 8 summarizes the main results of this work, concludes this dissertation and suggests a list of open topics for the future research.

9 1.4 Dissertation Overview

Figure 1.3: Product function size trends (Source: ORTC 2011 ITRS Korea Winter Public Conference - Executive summary [4]).

10 1.4 Dissertation Overview

DSP Applications

High Speed Requirement Low Power Requirement

LUT-based Computation

Basic Arithmetic Operations Elementary Functions

Multiplier Squarer Sine Generator Logarithm Generator

Chapter 3 Chapter 4 Chapter 5 Chapter 6

Prototype and Applications Chapter 7

Figure 1.4: Summary of the motivation and the dissertation overview

11 1.4 Dissertation Overview

12 Chapter 2

Background and General Approach

2.1 Introduction to DSP Systems

The signals in real world are analog by nature. However, digital computers and many electronic devices operate on data represented in binary format that are composed of finite number of bits. In a DSP system, original analog signals are converted to digital ones which are represented by sequences of finite-precision numbers, typically the binary numbers in which only two logic values of ‘0’ and ‘1’ are used. The signal processing tasks are performed by digital computation components [10]. Figure 2.1 presents the general block diagram of a typical DSP system. As shown in this figure, a DSP system receives the input signal, processes it in digital domain and generates the outputs according to the given algorithms. The analog and digital parts in this system interact by using the analog to digital converters (ADC) and the digital to analog converters (DAC). The main component in this system is the DSP module which performs the signal processing algorithms in digital domain. The input and output filters are also required to guarantee the suitable frequency band for the DSP system.

13 2.2 Hardware Platforms and DSP Implementation Methods

Analog Input DSP Output Analog ADC DAC Input Filter Module Filter Output

0101...101 1110...001 1101...011 0010...101 0100...101 1001...001

Figure 2.1: General block diagram of a DSP system.

2.2 Hardware Platforms and DSP Implementa- tion Methods

2.2.1 General Purpose Processor (GPP)-Based Implemen- tation

The general purpose processors (GPP) have been employed popularly in many computing systems over some of past decades. Although DSP applications require specific operations and are computation-intensive, the general purpose processors can be used to implement these applications. The MMX processor is a typical GPP which support DSP and multimedia applications [11]. PowerPC 604e processor is also can be used to implement DSP applications [12]. The high applicability of this type of processors has led to its wide utilization. Due to their merit of high flexibility, GPP can perform a wide range of applications. The flexibility inherent in GPP was a key component of the computer revolution. To date, processors have been the driving engine behind general-purpose computing. Originally, due to the limitation of active chip area, processors focus on the heavy reuse of a single or small number of functional units. With the continuous development of the Very Large Scale Integration (VLSI) technology, we can now integrate complete and powerful processors and other computation modules onto a single integrated circuit. Therefore, some alternatives for GPP-based DSP applications will be introduced in the following sections.

14 2.2 Hardware Platforms and DSP Implementation Methods

2.2.2 DSP Processor-Based Implementation

Digital Signal Processors are often abbreviated as DSP (the same with DSP for Digital Signal Processing). Therefore, in this dissertation, the term ‘DSP pro- cessor’ is used to prevent the confusion between two concepts. In this kind of DSP implementation, the specific DSP application is developed as a software with a specific programming language and then is executed by a DSP proces- sor. Morepver, a DSP processor is a dedicated with the following characteristics to support DSP applications:

• Real-time digital signal processing capabilities. DSP processors typically have to process data in real time, i.e., the correctness of the operation depends heavily on the time when the data processing is completed.

• High throughput. DSP processors can sustain processing of high-speed streaming data, such as audio and multimedia data processing.

• Deterministic operation. The execution time of DSP programs can be fore- seen accurately, thus guaranteeing a repeatable, desired performance.

• Re-programmability by software. Different system behaviors might be ob- tained by re-coding the algorithm executed by the DSP processor instead of hardware modifications. DSP processors appeared on the market in the early 1980s. Over the last three decades, they have been the key enabling technology for many electronics products in the fields such as communica- tion systems, multimedia, automotive, instrumentation and military appli- cations.

2.2.3 ASIC-Based DSP Implementation

Application Specific Integrated Circuits (ASIC)-based implementation can be em- ployed for DSP applications to design high performance DSP systems. With the hardware optimized for a specific application, the ASIC-based DSP implemen- tation method can lead to the high performance, area efficient, low power DSP designs. However, it requires the high design effort and time. Therefore, this approach is well-suited for designing high amount of DSP products.

15 2.2 Hardware Platforms and DSP Implementation Methods

Programmability

GPP

DSP Processor

FPGA

ASIC

Specialization

Figure 2.2: Hardware platforms for implementation of DSP applications.

2.2.4 FPGA-Based DSP Implementation

FPGA, a typical reconfigurable hardware platform, can provide the rapid pro- totype and implementation for DSP applications [13]. It can combine the ad- vantages of both dedicated ASIC-based and software-based DSP implementation approaches. With the development of reconfigurable hardware technologies, this approach is very promising to provide the high performance, rapid implementa- tion of DSP applications. Figure 2.2 presents the design spectrum of DSP applications with above hard- ware platforms and degrees of programmability and specialization. Table 2.1 sum- marizes the comparison in several factors of using different hardware platforms to implement DSP applications. Depending on the specific features and require- ments for each DSP application, the designer can choose a suitable hardware platform for the implementation.

16 2.3 Concept of LUT-Based Computation and Design Issues

Table 2.1: Comparison of different hardware platforms for DSP implementation. Platform Performance Cost Power Flexibility Design effort ASIC high high low low high DSP processor medium medium medium medium medium GPP low low medium high low FPGA-based medium medium high high medium

2.2.5 Implementation Methods for DSP Applications

The following methods can be used to implement a DSP application.

• Dedicated hardware design method. In this type of implementation, a DSP function/application is performed by a specific and dedicated hard- ware module. Digital filters, encoders, decoders are typical examples for this approach. The Fast Fourier Transform (FFT), Finite Impulse Filter (FIR) are typical examples of this approach in DSP implementation. The hardware platforms for this method can be the FPGA or ASIC hardware.

• Software-based design using a general purpose processor (GPP) or a DSP processor. The software to implement the DSP algorithm is developed based on an architecture of a specific GPP or DSP processor.

• The combination of two above approaches can be employed for a compli- cated DSP application, called hardware/software co-design method ([14] and [15]).

The proposed hardware IP cores in this research can be used with all above implementation methods for DSP applications.

2.3 Concept of LUT-Based Computation and Design Issues

The concept of memory-based computation, firstly as ROM-based computation, was employed in IBM 1620 computer announced by IBM in 1959 [16]. In this

17 2.3 Concept of LUT-Based Computation and Design Issues

W W+M X LUT Y (2W words)

Figure 2.3: General LUT-based computation system. computer, addition, subtraction and multiplication are accomplished by the auto- matic table look-up in the core storage. However, the division is accomplished by available or by optional automatic divide feature. This concept was also presented in some textbooks such as [17]and[18]. The basic idea behind this concept is that the memory array often has higher density than the logic circuit [18]. It promisingly results in high speed computation because only time for accessing pre-stored memory arrays is required, without actual computation circuits. It also can provide some other advantages as discussed later. However, since the required memory size grows exponentially when the operand bit-width increases, the cost of memory devices were high and embedded memories were not popular at that time, this approach was not popular in the past. Currently, as the result of the continuous development in integrated circuit and memory technologies as mentioned in Section 1.1, the approach of memory-based computation, and the more general concept of the LUT-based computation, can be realized efficiently [5]. Bipul C. Paul and his colleagues at Toshiba Corporation have proposed an efficient implementation of 16-bit multiplier using ROM-based method with the full custom design flow [8]. The following paragraphs are to provide the basic concept of LUT-based com- putation and its background knowledge in more detail. Firstly, LUT-based com- putation refers to a type of digital circuits/systems which can provide the com- putation results by accessing pre-stored tables other than actual computation. Figure 2.3 shows the block diagram of a general LUT-based computation circuit with the L-bit input X which is used as the address for accessing an LUT of 2L words to provide the output Y with the width of (W + M)-bit.

18 2.3 Concept of LUT-Based Computation and Design Issues

Using the LUT-based computation approach has several advantages [19]. The following part summaries some main advantages of the LUT-based computation approach.

• The LUT-based approach can result in high speed computation and has the potential for high-throughput and reduced latency implementation because it mainly relies on the time for accessing the pre-stored tables.

• The LUT-based computation systems are promisingly to consume less dy- namic power consumption due to the minimization of switching activities.

• The LUT-based architectures make it easier to design, more regular com- pared with conventional logic circuit such as the multiply-addition method.

• The LUT-based computation approach is close to human-like computing. It is also suitable for implementation of the active functions in artificial neural networks [20].

With these advantages, LUT-based computation can be employed for a wide range of applications such as constant multiplication, squaring, digital filtering, orthogonal transformation for DSP applications and computing the elementary functions (, logarithm function) and active functions in artificial neural networks. However, as mentioned previously, LUT-based computation approach has the disadvantage that the LUT size grows exponentially when the operand length in- creases. It may lead to the high hardware complexity for high resolution designs. Therefore, there have been many researches focusing on methods to reduce the LUT size with the cost of some simple additional logic circuits. In this disserta- tion, some novel and improved methods are proposed for the efficient LUT-based computation components and systems. Before providing the detail of these meth- ods in next chapters, the following sections are planned to review the existing methods for LUT-based computation and its design issues. An LUT is featured by its size, word length and content. The LUT size can be presented in number of words or number of bits. The LUT size as shown in Figure 2.3 is 2W words or 2W ×(W +M) bits because the word length in this case

19 2.4 Literature Review of LUT-Based Computation Methods is (W + M) bits. The word length is defined by the number of data bit stored in each LUT word. Moreover, the following design issues needed to be considered when imple- menting LUT-based computation components for DSP applications.

• As mentioned previously, in a general LUT-based computation system, the LUT size grow exponentially when the input width (or address width of the LUT) increases. Therefore, the relationship between the LUT size depends and the input width should be considered in the LUT-based computation and many researchers have tried to reduce the growth speed of LUT size by some proposed techniques.

• Since DSP applications can tolerate the error, the LUT size can be reduced significantly by discarding some least significant bits in the results. How- ever, the trade-off between the accuracy and hardware complexity should be taken into account. Hence, it requires an efficient LUT optimization method (for both LUT parameters and content) to achieve the good trade-off.

• LUT-based computation can be combined with other methods to derive some hybrid LUT-based/logic architectures. This approach also leads to the requirement of considering the trade-off between the LUT-based com- putation part and the conventional logic part.

2.4 Literature Review of LUT-Based Computa- tion Methods

In IBM 1620 computer computer [16], the simple direct memory-based method was used to implement the arithmetic functions. However, due to this simple method and low memory density at that time, this principle was now used pop- ularly in commercial computers. H. Ling in [21] and B. Vinnakota in [22]have proposed some methods of using LUT to implement multiplier and squarer cir- cuits. B. Parhami et al. also presented some improved methods for LUT-based computationin[23]and[24].

20 2.4 Literature Review of LUT-Based Computation Methods

Memory-based and LUT-based methods are quite popular in the digital fil- ter design as presented in [25], [26]. Also, LUT-based computation is applied for computing trigonometric functions such as arctangent function [27] and sine function which will be presented in detail in Chapter 5. G. Inoue [28] announced an invention of using semiconductor memory combined with some an and small control logic to implement an 8-bit multiplier. LUT-based method is also employed popularly to implement some transformations in digital signal process- ing such as DFT and DCT [29]. Moreover, Bipul C. Paul et al. [8] present another approach by using full custom specific memory arrays to implement arithmetic functions. Due to ad- vantages of the full custom flow and the advanced integrated circuit technology, this method can improve the speed and area efficiency of the circuits significantly. However, this method leads to the disadvantages of lack of flexibility and long time for design process. P. K. Meher ([19]and[30]) has proposed some methods for the LUT-based computation including some optimizing methods for LUT-based computation for constant multipliers such as odd multiple storage (OMS) scheme, anti-symmetric product coding (APC) scheme, input coding scheme and their combined tech- niques. The main idea of these methods is to reduce LUT size by employing mathematical identities. It is reported that these methods outperform other methods in literation and these methods can be promising candidates for future DSP and wireless communication systems. The true tables for APC and OMS methods are presented in Tables 2.2 and 2.3, respectively. Figure 2.4 is the com- bined APC-OMS architecture for the LUT-based multiplier with the operand X and a constant A. In APC method, by exploiting the mathematical identities of anti-symmetric product coding, the input X is recoded to the address X (or in binary form x3x2x1x0 for the case of 4-bit address) for the LUT so that the LUT size is reduced by half. In APC-OMS method, by employing odd multiple stor- age, the LUT size can be further reduced by half compared with APC method. The addresses for the modified LUT in these architectures are generated by an address generator and control circuit and an as shown in Figure 2.4. The simple shifting operations in Table 2.3 are performed by a .

21 2.4 Literature Review of LUT-Based Computation Methods

d x0 0 To sign- decision Address d 4-to-9 x1 1 circuit generator line 9 × (W+4) Barrel x and control d2 address LUT shifter 2 circuit decoder d x3 3

Figure 2.4: Combined APC-OMS architecture for LUT-based multiplier [19].

Also, LUT-based computation can be applied efficiently for the active func- tions in artificial neural networks [20]. Te-Jen Changa et al. [31]proposedan efficient LUT-based method for the squarer circuit. This method is targeted only for the full-width squarer with the application in cryptography. However, the above methods only focus on architectures for the full-length computation and only synthesis results are presented. Therefore, more researches should be done to derive new and more improved methods, especially for trun- cated operations in which the full length results are not compulsory due to the feature of error tolerance in DSP applications. Moreover, the hardware complex- ity can be reduced significantly when the truncation can be applied. Also, the real chip implementations should be carried out to clarify the improvements of proposed methods. To sum up, there has not any method presented in literature for the truncated/ fixed-width computation using LUT-based method. This leave the room for more researches in this field. Also, the existing methods should be improved for the new algorithms and applications. Furthermore, some technology-independent architectures and methods need be considered for more general methods that can be applied in multiple hardware platforms. Therefore, the research presented in this dissertation is targeted to find some novel and improved methods for LUT- based computation components and systems which can be employed in modern and future DSP applications.

22 2.5 General Approach for Efficient LUT-Based IP Cores

Table 2.2: The true table of an LUT-based multiplier using APC method with two operands X and A. X Product X Product x3x2x1x0 APC word 00001 A 11111 31A 1111 15A 00010 2A 11110 30A 1110 14A 00011 3A 11101 29A 1101 13A 00100 4A 11100 28A 1100 12A 00101 5A 11011 27A 1011 11A 00110 6A 11010 26A 1010 10A 00111 7A 11001 25A 1001 9A 01000 8A 11000 24A 1000 8A 01001 9A 10111 23A 0111 7A 01010 10A 10110 22A 0110 6A 01011 11A 10101 21A 0101 5A 01100 12A 10100 20A 0100 4A 01101 13A 10011 19A 0011 3A 01110 14A 10010 18A 0010 2A 01111 15A 10001 17A 0001 A 10000 16A 10000 16A 0000 0

2.5 General Approach for Efficient LUT-Based IP Cores

Figure 2.5 presents the general approach for IP cores proposed in this dissertation. To obtain the good trade-off between the hardware complexity and computation performance, the original direct LUT is decomposed into a smaller LUT (the size reduced LUT) and a simple logic circuit. A combinational circuit (Comb.) is also required to provide the final computation result. The existing and proposed methods to reduce the LUT size will be presented in the next chapters. The LUT in this research is implemented in both FPGA and ASIC hardware platforms. For the purpose of the first evaluation for proposed methods, the LUT

23 2.5 General Approach for Efficient LUT-Based IP Cores

Table 2.3: The true table of an LUT-based multiplier using OMS method with two operands X and A. X Product Number Shifted input Stored Address x3x2x1x0 value of shifts X APC word d3d2d1d0 0001 A 0 0010 2 × A 1 0001 P 0=A 0000 0100 4 × A 2 1000 8 × A 3 0011 3A 0 0110 2 × 3A 1 0011 P 1=3A 0001 1100 4 × 3A 2 0101 5A 0 0101 P 2=5A 0010 1010 2 × 5A 1 0111 7A 0 0111 P 3=7A 0011 1110 2 × 7A 1 1001 9A 0 1001 P 4=9A 0100 1011 11A 0 1011 P 5=11A 0101 1101 13A 0 1101 P 7=13A 0110 1111 15A 0 1111 P 7=15A 0111

in FPGA devices and standard cells are used to realize the proposed LUT-based architectures. However, in future research, the advanced LUT hardware will be considered for the implementation to improve the performance, power and area efficiency. The details of implementations issues for the proposed IP cores will be presented in next chapters.

24 2.5 General Approach for Efficient LUT-Based IP Cores

Simple logic . X . Y XYComb. . . . .

Size reduced LUT Direct LUT

Figure 2.5: General approach for efficient LUT-based IP cores.

25 2.5 General Approach for Efficient LUT-Based IP Cores

26 Chapter 3

LUT-Based Truncated Multiplier

High performance multipliers are the essential components in most of DSP and communication systems. In these systems, parallel multipliers are often employed to carry out high speed computations and transformations. However, the general parallel multipliers often result in high hardware complexity and contribute large power consumption to the overall system. On the other hand, the high increase of mobile and portable devices leads to the emerging requirement of low power consumption and low complexity digital circuits [32]. Therefore, recently, there have been many researches focusing on architecture for low power, high speed multipliers. Truncated multiplication is an efficient method to significantly reduce area and power consumption of the multipliers for DSP applications in which the full length multiplication result is not required [33]. Moreover, in many DSP applica- tions, when one of the multiplier inputs is constant, the hardware complexity can be reduced further by using a constant coefficient multiplier (KCM). For example, in digital filters, the coefficient set is fixed with a given transform function. In this case, KCM is more suitable than other approaches like the distributed arith- metic (DA)-based method [34]. The disadvantage of KCM caused by its fixed hardware configuration for each constant coefficient set can be solved by using reconfigurable hardware such as FPGA, resulting in reconfigurable KCM. Moreover, LUT-based computation is promisingly suitable for the implemen- tation of KCM in DSP systems because the address width is reduced significantly when only one input of the multiplier is used for the LUT address. Some recent

27 Figure 3.1: Introduction to LUT-Based Truncated Multiplier. researches have pointed out that the LUT-based computation is a highly poten- tial candidate for future DSP systems due to its excellent speed merit because it mainly relies on LUT access only operations [34]. However, the direct LUT-based implementation of high resolution KCM leads to high hardware complexity because the LUT size grows exponentially as the input length increases. Some methods have been proposed to reduce the LUT size in LUT-based full length multiplier ([34]-[37]and[19]), but more improvements are desired. Moreover, there has been not any research in literature mentioning the combination of LUT-based computation and truncated multiplier method. Therefore, in this chapter, we propose an efficient LUT-based architecture for truncated multipliers which can be applied for modern and future DSP systems. Figure 3.1 presents the introduction and motivation for the research on truncated LUT-based KCM.

28 3.1 Introduction to LUT-Based Truncated Multiplier

3.1 Introduction to LUT-Based Truncated Mul- tiplier

3.1.1 Truncated Multiplier

Truncated multiplication can be employed to significantly reduce the hardware area and power consumption of multipliers in DSP applications where the full length multiplication results are not necessary. The truncating operation can be implemented by rounding the final result or simply discarding some least significant bits (LSB) of the full bit-width product to get the desired bit-width output. However, the main drawback of these methods is the waste of hardware resource because of the unnecessary LSB part generated before truncation. Hence, some methods have been proposed to approximate the discarded least significant part in partial product matrix of the multiplier so that the hardware complexity is reduced significantly compared with standard multipliers. James E. Stine et al. [33] presented some popular methods for truncated mul- tiplication including Constant Correction Truncated (CCT), Variable Correction Truncated (VCT) and Hybrid Correction Truncated (HCT) and their implemen- tation in FPGA hardware. The reported results show that hardware and power consumption can be remarkably reduced when employing truncated multiplica- tion. Recently, many researches focus on improved VCT method with optimal compensation technique to minimize the overall error. Nicola Petra et al. [40] proposed a sub-optimal linear compensation method and its implementation to provide good trade-off between hardware complexity and multiplication accuracy. However, more improvements in multiplier area and performance are desired to meet the requirements of future DSP applications. Therefore, in this chapter, we propose a novel and efficient architecure for the LUT-based truncated multiplier that combines two approaches of truncated multiplier and LUT-based computa- tion.

3.1.2 LUT-Based Truncated Multiplier

As presented in Chapter 2, the LUT-based computation circuit provides the out- puts by accessing the pre-stored LUT other than actual computations. This

29 3.1 Introduction to LUT-Based Truncated Multiplier look-up only operation leads to the high speed computation because it mainly requires only the time for accessing the pre-stored tables. Also, the LUT-based circuits consume less dynamic power because of less bit-switching rate [19]. A typical LUT-based computation architecture called LUT-based multiplier is shown in Figure 3.2 in which an LUT-based W × M (W -bit by M-bit) multiplier requires the LUT size of 2W +M (W + M)-bit to store 2W +M words with the word length of (W + M)-bit. As a result, the LUT size increases exponentially as the input lengths W and M increase. In KCM, however, if the second multiplier input (with the length of M) is constant, the LUT size is reduced to 2W (W + M)-bit. In this chapter, we consider the constant multiplication of an W -bit X with an M-bit binary constant coefficient A to produce the product Y . In full-width multiplication, the product Y has the length of (W +M)-bit. The mathematic presentations of the input A and Y are: M−1 i A = ai2 (3.1) i=0

M+W −1 i Y = aiX2 (3.2) i=0 where ai denotes the binary digit of constant A. According to Equation (3.2), the simplest method called straightforward shift-add KCM is performed by adding shifted partial products X2i which are computed by left-shifting the input X by i bits. It can be observed that the number of non-zero bits defines the number of adders required in the hardware implementation. To reduce the number of addition operations required in straightforward shift- add KCM, some recoding methods such as canonical signed digit (CSD) and Booth recoding have been proposed in which constant coefficients are presented in signed digit form so that the number of non-zero bits is decreased. The com- parison between two above recoding schemes in [39] shows that CSD recoding results in better performance than Booth method for the KCM implementation in most cases since CSD scheme always produces minimal number of non-zero digits in recoded representation. Tom Kean et al. [36] presented the technique of using split LUT structure to implement a constant coefficient multiplier, resulting in reduction of both

30 3.1 Introduction to LUT-Based Truncated Multiplier

W Look-Up W+M X Y=XA Table

Figure 3.2: General LUT-based multiplier. number of words and word length. In this structure, the input X is divided into two fractions XH and XL which are formed by W /2 high bits and W /2 low bits of X, respectively. Therefore, the operand X and the product Y can be written as:

k X = XH 2 + XL (3.3)

k Y = XH A2 + XLA (3.4) where k = W /2. Based on Equation (3.4), the LUT with the word length of (W + M)-bit can be divided into two LUTs with the length of (W /2+M)-bit which store XH A and XLA values, with the supposition that W is an even number. However, the reduction of LUT size has to trade-off with the extra adder tree to get the result Y from split LUT outputs. An application of the LUT-based multiplier called constant coefficient convo- lution was reported in [37]. In [19], P. K. Meher also proposed a technique called combined OMS-APC to optimize the LUT content so that the number of words is reduced but the word length is not. In this work, to reduce more hardware area of the multiplier, we propose a novel method which employs both the LUT-based and truncated multiplication approaches to design the low area and high speed constant multipliers for specific DSP applications.

31 3.2 Proposed LUT-Based Truncated Multiplier

3.2 Proposed LUT-Based Truncated Multiplier

3.2.1 Proposed Multiplier Architecture

In this section, we present the proposed architecture which combines an improved bipartite LUT split technique and the truncated multiplication. The bipartite split technique means that the input operand X is divided into two fractions as shown in (3.3). A more general method, called multipartite LUT split technique for LUT-based truncated multiplier, will be discussed in Section 3.5. Figure 3.3 presents the hardware architecture of a proposed W × M-bit LUT-based truncated multiplier and its partial product matrix for the case of W = M =8. The proposed multiplier includes two optimized LUTs and an adder block. The

LUT-1 with W1-bit length and the LUT-2 with W2-bit length are used to store k truncated values of XLA and XH A2 as described in (3.4), respectively. These truncated values are calculated as either floor function or ceiling function of the full width product depending on the LUT content optimization. The LUT parameters and content are optimized to minimize the average rel- ative error of the final result while achieve the good trade-off with the hardware complexity. For each LUT parameters set and each constant coefficient value, the LUT content is optimized to minimize the average error of the proposed multi- plier. The output Ytrunc of the proposed multiplier is computed by adding LUT-1 and LUT-2 outputs as:

Ytrunc = Y1 + Y2 (3.5) where Y1 and Y2 denote the outputs of LUT-1 and LUT-2, respectively and they can be expressed as:

k Y1 = TruncW1 (XH A2 ) (3.6)

Y2 = TruncW2 (XLA) (3.7) where TruncJ (x) denotes the truncation operation of x to get J-bit result. To compensate the accumulation of error possibly caused by adding two truncated values from two LUTs, some guard bits are used in two LUT outputs as depicted in Figure 3.3. Moreover, to produce regular partial product matrix and remain

32 3.2 Proposed LUT-Based Truncated Multiplier equal effect of each fractions on the multiplier error performance, it is assumed that:

W2 = W1 + k (3.8) and

W1 = k + δ (3.9) where δ is the number of guard bits. The absolute error is calculated as the difference between the truncated output of the proposed multiplier and the full length product as follows:

E = |(Ytrunc − Y )| (3.10) where E is absolute error, Ytrunc is the multiplier output and Y is the exact product value. Then, replacing Y in (3.10) by (3.4) leads to the following equation for the absolute error E:

E = |E1 + E2| (3.11) where two error components E1 and E2 are defined as:

k E1 = Y1 − XH A2 (3.12)

E2 = Y2 − XLA (3.13)

Another useful factor which is used in this chapter for error analysis is the relative error Erel defined by the ratio between the absolute error and the exact product:

Erel = E/Y (3.14)

3.2.2 LUT Optimization

The proposed LUT optimization method includes choosing design parameters (parameter optimization) and calculating the LUT content (LUT content opti- mization). To choose the suitable design parameters, the error analysis is per- formed for each value of δ (or each set of W1 and W2 values defined by (3.8) and

33 3.2 Proposed LUT-Based Truncated Multiplier

W k W2 Optimized X LUT-2 W Adder Y k W1 Optimized LUT-1

(a)

Constant (A)

LUT input (X)

LUT-1

LUT-2

Y Guard part (b)

Figure 3.3: Proposed W × M-bit LUT-based truncated multiplier architecture (a) and its partial product matrix with W = M =8andδ = 2 (b).

(3.9)). A Matlab program is developed to find the average relative error results of the proposed multiplier with different design parameter values for all cases of coefficient A and then choose the optimal parameter values. The error analysis results of proposed multiplier are also compared with that of the conventional truncated multiplier in which the truncated product is generated from full width product by discarding some least significant bits. After choosing the design parameters (δ), to further reduce the average error of the proposed multiplier, the LUT content optimization is performed by searching all possible cases of stored values for LUT-1 and LUT-2. It is optimal if a full search of all possible cases for the LUT content can be performed after choosing

34 3.2 Proposed LUT-Based Truncated Multiplier

Table 3.1: Error analysis results for the proposed 8-bit multiplier with different numbers of guard bits (δ), compared with conventional truncated multiplier.

Design Average Erel (%) Conventional truncation 1.74

Proposed method with δ=0 (W1=4 and W2=8) 3.28

Proposed method with δ=1 (W1=5 and W2=9) 1.78

Proposed method with δ=2 (W1=6 and W2=10) 0.95

Proposed method with δ=3 (W1=7 and W2=11) 0.48

Proposed method with δ=4 (W1=8 and W2=12) 0.22

the design parameter. However, it involves extensive simulation process and results in a very long simulation time, so that a sub-optimal method is proposed for LUT content optimization that reduces the optimization time significantly by reducing the search cases as shown in Figure 3.4. For each constant coefficient value, firstly, LUT-1 is filled with rounded values of XLA and a search with all cases of LUT-2 is performed (as the floor or ceiling function of the full width k product). Then, the LUT-2 is filled with rounded values of XH A2 and a similar search with all cases of LUT-1 is also performed. Finally, the error results from two above searching processes are compared to get the optimized content of the two LUTs. Table 3.1 and Figure 3.5 present the error analysis results of optimized de- sign with different values of δ and a full set of the constant coefficient A.Itis shown that the average relative error decreases when the number of guard bits is increased. In the case of the proposed method with δ =2(W1 =6andW2 = 10), the average relative error (0.95%) is much lower than the conventional truncation method (1.74%). Therefore, this value of δ is chosen for our design in this work. For applications that require higher accuracy (or lower error target), higher val- ues of δ can be used. Then, the entries of each optimized LUT are pre-calculated based on chosen parameters with above algorithm. The adder module as shown in Figure 3.3 (a) is also optimized independently to achieve highest multiplier performance.

35 3.2 Proposed LUT-Based Truncated Multiplier

k +) LUT-1 <= RoundW1(XLA) +) LUT-2 <= RoundW2(XHA2 ) +) Search for all cases of LUT-2 +) Search for all cases of LUT-1

Error Comparing

Optimized LUT content

Figure 3.4: Proposed sub-optimal LUT content optimization algorithm.

3.2.3 Synthesis and Implementation Results in FPGA Hard- ware

We have implemented the 8-bit truncated multiplier using the proposed LUT- based method and comparable reference architectures for all cases of the constant coefficient with the same design parameters and constraints in Altera Cyclone II FPGA using Quartus-II 9.1 design tool and VHDL coding. The comparison in the average values of area, delay and area-delay product (ADP) is presented in Table 3.2. ADP is a popular factor to compare the general performance of different designs. For the FPGA implementation, ADP is calculated as the product of number of logic elements (LE) and the delay (in nanosecond). The most efficient method means that it results in the lowest value of ADP. Straightforward shift-add KCM and CSD-based KCM, as described in Sec- tion 3.2, are specific hardware architectures of conventional truncation technique chosen for the performance comparison. Altera Megacore-based KCM which is the parameterized constant multiplier core provided by Altera and the conven- tional split LUT-based KCM (without LUT optimization) are also chosen for the comparison. From results presented in Table 3.2, it is clear that the pro- posed multiplier results in lower ADP than other constant coefficient multipli-

36 3.3 General Multipartite LUT Split Technique

3.5 Proposed method Conventional truncation method 3

2.5

2

1.5

1 Average relative error (%)

0.5

0 0 1 2 3 4 Number of guard bits

Figure 3.5: Error analysis results for the proposed 8-bit multiplier with differ- ent numbers of guard bits (solid line), compared with conventional truncated multiplier (dashed line). ers. The proposed KCM can reduce ADP value of around 28% when comparing with CSD-based KCM, Altera Megacore-based KCM and conventional split LUT- based KCM, and 52% when compared with the straightforward shift-add KCM.

3.3 General Multipartite LUT Split Technique

This section presents a more general method for LUT-based truncated multiplier called the multipartite LUT split technique in which the input operand is divided into more than two fractions. Then, each fraction is used as the address for an optimized LUT. Consider the multiplication of the W -bit binary number X with a constant A and suppose that W = l × k (if not, some leading zero bits can be added to X so that this condition is satisfied), then X can be decomposed into l

37 3.3 General Multipartite LUT Split Technique

Table 3.2: Implementation results of different KCM methods in Altera Cyclone II FPGA.

Method Area (number of LEs) Delay (ns) ADP Straightforward shift-add KCM 33 17.5 578 CSD-based KCM 25 15.3 383 Altera Megacore-based KCM 28 13.3 372 Conventional split LUT-based KCM 28 13.8 386 Proposed LUT-based KCM 23 12.0 276

fractions (with width of k-bit) and Equation (3.3) can be extended as:

l−1 ik X = Xi2 (3.15) i=0 where Xi denotes a k-bit fraction of X. The product Y of the multiplication of X and a constant coefficient A can be written as: l−1 ik Y = AXi2 (3.16) i=0

With this decomposition, each component AXi in Equation (3.16) can be stored in an LUT with the LUT size of 2k words. The word length of each LUT depends on the number of guard bits chosen like the bipartite method presented in Section 3.3. Then an adder tree and a rounding circuit can be used to get the final truncated product. A general optimization algorithm based on the proposed method in Section 3.2 can be employed for this general architecture. However, with more LUTs and the larger adder tree, the optimization algorithm need to be changed greatly and more parameters should be taken into account in the optimization process. Therefore, it is left as an open topic for the further research. Moreover, a more general architecture in which the input operand X is divided into some fractions with different bit-widths will be considered.

38 3.4 Chapter Conclusions

3.4 Chapter Conclusions

LUT-based computation and truncated multiplier are two efficient approaches for high speed and low power DSP applications. In this chapter, an efficient architecture for the LUT-based truncated multiplier that combines two above approaches for modern and future DSP systems has been proposed. The hardware implementation results of the proposed multiplier show that compared with other methods, the proposed architecture can significantly reduce the value of area- delay product. Hence, it has great potential to be selected as an architecture of the high speed, low power multiplier for future DSP applications. However, the proposed LUT-based KCM method has a limitation of scalability because for very high bit-width designs, the optimization has to be changed as discussed in Section 3.3. Therefore, in the future research, the general multipartite LUT split technique and its optimization algorithm for the LUT-based truncated multiplier will be considered.

39 3.4 Chapter Conclusions

40 Chapter 4

Hybrid LUT-Based Architecture for Fixed-Width Squarer

Squaring is a fundamental arithmetic operation in many applications such as Viterbi decoding, adaptive filtering, vector quantization, pattern recognition and image compression. However, the direct implementation of the general parallel squarer circuits often leads to high hardware complexity and high power con- sumption. Moreover, the high increase of mobile and portable devices leads to an emerging requirement of low power consumption and low complexity digital circuits [41]. Therefore, recently, there have been many researches focusing on the architecture for low area, low power and high performance squarer circuits. Some proposed methods such as folded squarer [42], merged technique [43]and K.-J. Cho et al. method [44] employ some mathematical identities to reduce the hardware complexity and improve the performance of the full-width squarer cir- cuit. A. G. M. Strollo et al. [45] proposed an improved method that combines Booth recoding with folded technique and some specific sub-circuits. On the other hand, in many DSP applications, when the square result is not required to be full-width but equal bit-width as the operand, the architecture for the high performance, low complexity fixed-width squarer becomes an emerging topic with many issues needed to be considered such as error compensation, hard- ware efficiency and squarer performance. Authors in [46]and[47]haveproposed some methods to improve the accuracy and performance of truncated and fixed- width squarer circuits, but more improvements are desired, especially for future

41 4.1 Fixed-Width Squarer

DSP systems which require high performance, low power consumption circuit to meet the high demand of portability and mobility. Moreover, the LUT-based computation has great potential to be employed in future DSP and communication systems, because it mainly relies on memory access operations resulting in high speed computation and low power consumption [19]. The disadvantage that limits the popularity of the LUT-based architecture in current systems is its exponential growth of LUT size as the operand width increases. Therefore, it is reasonable to find a proper hybrid architecture that takes the advantage of both LUT-based and conventional logic circuits so that the good trade-off of the squarer performance and hardware complexity can be achieved. In this chapter, we present a new approach for designing the fixed- width squarer circuit for specific DSP applications by employing an improved hybrid LUT-based architecture to reduce the error, delay and area of squarer circuit. The main contribution of this research is that an efficient architecture for the fixed-width squarer is proposed by employing a known specific mathematical identity of squaring operation in a new way with some improvements for the low error and high area-delay efficient squarer circuits.

4.1 Fixed-Width Squarer

In this paper, we consider the square of W -bit binary number X and the full- width (2W -bit) square X2 can be written as:

W−1 2 W−1 W−1 2 i i+j S = X = xi2 = xixj2 (4.1) i=0 i=0 j=0 where xi and xj denote the binary bits of X. Figure 4.1 shows the partial prod- uct matrix (PPM) of an 8-bit fixed-width squarer generated from the full-width squarer in which each partial product bit pij is computed by an AND logic oper- ation as: pij = xixj. In a full-width squarer, all partial product bits in PPM are summed by using compression tree and adder [42]. Some methods proposed in [42]-[44] exploit the mathematical identities of the squaring operation to reduce the complexity of PPM summation.

42 4.1 Fixed-Width Squarer

p70 p60 p50 p40 p30 p20 p10 p00

p71 p61 p51 p41 p31 p21 p11 p01 MSP p72 p62 p52 p42 p32 p22 p12 p02

p73 p63 p53 p43 p33 p23 p13 p03

p74 p64 p54 p44 p34 p24 p14 p04 LSP

p75 p65 p55 p45 p35 p25 p15 p05

p76 p66 p56 p46 p36 p26 p16 p06

p77 p67 p57 p47 p37 p27 p17 p07 IC

(15)(14) (13) (12) (11) (10) (9)(8) (7) (6) (5) (4) (3) (2) (1) (0)

Figure 4.1: Partial product matrix of the 8-bit fixed-width squarer.

However, in a fixed-width squarer, the square result is required to have the same bit-width as the operand. For example, in the 8-bit fixed-width squarer as shown in Figure 4.1, the 8-bit square result corresponds to 8 maximum significant bits of the full-width square result. The most accurate fixed-width squaring method is the ideal rounding in which all partial products in PPM are summed and the full-width (2W -bit) result is rounded to provide the fixed-width (W -bit) result. For more mathematical convenience, as shown in Figure 4.1, the PPM is divided into three parts: MSP (Maximum Significant Part), IC (Input Correction) and LSP (Least Significant Part). The ideal rounding operation can be improved by adding ‘1’ to the IC column as the correction constant before summing and then simple dropping some least significant bits of the full-width result to get the fixed-width square result. The two above methods lead to high complexity and long computation delay because all partial products of IC and LSP are generated and accumulated. Therefore, to reduce the hardware complexity and delay, in more improved methods of fixed-width squarer, LSP is discarded and IC is used to compensate the error caused by discarding the LSP. E. G. Walters III et al. [48] proposed an architecture for the truncated squarer with variable correction scheme. Kyung-Ju Cho et al. [46] presented the adap- tive error compensation method for fixed-width squarer with more mathematical analysis and modifications of Booth folding encoding method. Furthermore, V.

43 4.2 LUT-Based Fixed-Width Squarer

Garofalo et al. [47] proposed an improved method applying the analytical tech- nique presented in [49]and[50] to find the sub-optimal linear compensation func- tion so that the mean error reduction of from 5% to 20% and the performance improvement can be also achieved.

4.2 LUT-Based Fixed-Width Squarer

As mentioned in Chapter 2, the LUT-based circuit provides the outputs by ac- cessing the pre-stored LUT other than actual computations, resulting in high speed computation and less dynamic power because of less bit-switching [35]. As shown in Figure 4.2, an LUT-based full-width squarer requires an LUT of 2W × 2W bit to store 2W words with the word length of 2W -bit. For the case of fixed-width squarer using this direct LUT-based architecture, an LUT of 2W ×W bit is required to store ideally rounded values of squaring computation. The maximum error in this ideally rounded direct LUT-based squarer is restricted by only 0.5 LSB (Least Significant Bit) of the squaring result. However, this direct LUT-based implementation leads to exponentially growth of the LUT size when the operand width increases. Therefore, some methods were proposed to reduce the LUT size for LUT-based computation design. Pramod Kumar Meher [19] has proposed some methods for the LUT optimiza- tion that can reduce the LUT size significantly for LUT-based constant multiplier, but these methods cannot be applied directly for the fixed-width squarer because of the different PPMs and truncation operation required in fixed-width squaring computation. Besides, some improved methods for the fixed-width and truncated multipliers cannot be employed directly for design of the fixed-width squarer [50]. Te-Jen Changa et al. [31] proposed an efficient LUT-based method for the squarer circuit. However, it is only applicable for the full-width squarer in the application of cryptography. As a result, a more accurate and efficient architecture for the fixed-width squarer is highly desired. It is also reasonable to achieve the good trade-off of the LUT-size and squarer performance by employing a hybrid LUT- based structure for the squarer by using both LUT and conventional logic circuits properly.

44 4.3 Proposed Hybrid LUT-Based Fixed-Width Squarer

W Look-Up 2W X X2 Table

Figure 4.2: General LUT-based full-width squarer. 4.3 Proposed Hybrid LUT-Based Fixed-Width Squarer

Again, consider the square of the W -bit binary number as presented in (4.1). In this section, we apply the mathematical characteristic presented in [51]and [52] for the case of four-folding with some improvements as described below. If

X =(xW −1xW −2 ...x1x0)andY =(yW −1yW −2 ...y1y0)aretwoW -bit binary numbers satisfying that yi = xi,0≤ i ≤ (W − 1), the difference D between squares of two numbers can be expressed as:

2 2 W W D = |X| −|Y | = −yn−12 (2 +1)+

|xW −2xW −3 ...x1x00yW −2yW −3 ...y1y0| (4.2) where |X| denotes the absolute value of the binary number X. Therefore, when yW −1 = 0, i.e. xW −1 = 1, it is derived that:

D = |xW −2xW −3 ...x1x00yW −2yW −3 ...y1y0| (4.3)

The four-folding method means that two maximum significant bits of operand are used as selecting bits to produce the square result. Applying some mathemati- cal properties, we can derive the relationship between the square of the number X and its binary bits as shown in Table 4.1 (in which MSB12 denotes two maximum significant bits of X) together with the following equation:

2 2 |X| = |Y | + D (4.4)

The novelty of the proposed approach is that a specific mathematical identity of the squaring operation presented in [51]and[52]isemployedinanewwayto

45 4.3 Proposed Hybrid LUT-Based Fixed-Width Squarer

Y W-3 W-2 LUT

W 2 W DH0 Adder X AMU W X DH1 MUX4 DH2 DH3 2 MSB12

Figure 4.3: Proposed hybrid LUT-based architecture for fixed-width squarer. derive an efficient architecture for the fixed-width squarer circuit with lower error and higher area-delay efficiency. By exploiting (4.2), to get the fixed-width result, the rounding operation of square can be approximated by summing rounding results of its components as follows:

2 2 roundW {|X| }roundW {|Y | } + roundW {D} (4.5) in which roundW {x} denotes the rounding operation of x to get the W -bit result. This approximation is optimal if one of the two rounding operations in right part of (4.5) is error-free. In this section, we present a sub-optimal approximation method using the hybrid LUT architecture to reduce the error as much as possible. In Table 4.1, the component D is also divided into two parts: DH (high part) and DL (low part), each part has the same bit length of W . Therefore, it is obvious that:

D = |DH| 2W + |DL| (4.6)

As one can see in the fourth column of Table 4.1, the maximum significant bit of DL is always zero. Hence, the operation can be approximated by discard- ing the DL part and the result of this rounding operation can be approximated by DH part. The maximum error can be easily estimated as 1 LSB of result 2 compared with ideal rounded result. Assume that the operation roundW (|Y | )

46 4.3 Proposed Hybrid LUT-Based Fixed-Width Squarer

Table 4.1: Four-folding method for hybrid LUT-based fixed-width squarer. MSB12 Y DH DL

00 xW −3 ...x1x0 000...... 00(DH 0) 00...... 00

01 xW −3 ...x1x0 00xW −3 ...x1x0 (DH 1) 0xW −3 ...x01

10 xW −3 ...x1x0 01xW −3 ...x1x0 (DH 2) 00...... 00

11 xW −3 ...x1x0 1xW −3 ...x1x00(DH 3) 0xW −3 ...x01

can be performed by the ideal rounding to W -bit, the general maximum error of the approximation in (4.5) will be 1 LSB compared with ideal rounded result. Therefore, it is promising that this method can result in low error squarer circuit. The results of error analysis will be shown in the next section. The block diagram of the proposed fixed-width squarer using the improved hybrid LUT-based architecture is shown in Figure 4.3. The AMU (Address Map- ping Unit) has the function of inverting (one’s complement) the input operand and generating the (W -2)-bit address for the LUT. The LUT block is used to provide the square values of its address input. The 4-input (MUX4) is used to generate the DH part as shown in Table 4.1 by using two maximum significant bits (MSB12) as the selecting bits. Then an adder provides the final fixed-width (W -bit) square result. With the proposed architecture, the LUT size for storing Y 2 is reduced to 2W −2 × (W − 3) bit because it stores the 2(W − 2)-bit rounding result of Y 2. However, for the designs with W ≥ 10, the coarse/fine LUT splitting method [53] is also employed to further reduce the LUT size as shown in Figure 4.4. The input Y with the bit-width of (A + B) is decomposed into two components: one with lower bit-width (A-bit) for addressing the coarse LUT and another with the same bit-width of (A + B)-bit for addressing the fine LUT. The coarse LUT has the same width but fewer storage entries than the traditional one, while the fine LUT has smaller width and the same storage entries as the traditional one to store the difference between the rounded Y 2 and coarse LUT values. An additional adder is needed to provide the final square result Y 2 of the overall LUT input.

47 4.4 Error Analysis of Proposed Fixed-Width Squarer

A Coarse LUT W-3 Adder Y2 Y A+B Fine LUT

Figure 4.4: Coarse/fine LUT splitting method to further reduce the LUT size.

4.4 Error Analysis of Proposed Fixed-Width Squarer

In this work, the mean error (ME), maximum error (MaxE) and mean square error (MSE) are used as error criteria for the error analysis. The mean error and mean square error are related to the power of the error and hence play a very important role in DSP applications. However, the maximum error becomes the most important concern in safety critical applications [54]. The error caused by the approximation in (4.5) together with ME and MSE can be expressed as:

2 2 Er = X − (roundW {|Y | } + roundW {D}) (4.7)

ME = E{Er} (4.8)

MSE = E{Er2} (4.9) where E{x} denotes the averaging operation to get the mean value of the variable x. Table 4.2 summaries the error analysis results and the comparison between proposed squarer and other architectures. It is shown that the proposed fixed- width squarer results in the lower error compared with others. The mean error and mean square error of proposed squarer can be reduced by up to 40% compared with these values of V. Garofalo et al. method presented in [47] whereas the maximum error can be reduced by up to 22%. The low error merit makes the

48 4.5 Implementation and Measurement Results proposed method highly applicable for future DSP applications which require not only high speed and low power but also low error squarer circuits.

4.5 Implementation and Measurement Results

The proposed fixed-width squarer and other architectures have been implemented in 0.18-μm CMOS technology using the standard cell library with the same design parameters and constrains for different values of W , using the Synopsys Design Compiler and Synopsys IC Compiler tools. For the designs with W ≥ 10, the coarse/fine LUT splitting method is also employed to further reduce the LUT size as shown in Figure 4.4. Table 4.3 shows the LUT parameter, size and compression ratio of different fixed-width squarer designs using the proposed hybrid LUT- based architecture and coarse/fine LUT splitting method in which A and B denote the input widths of two split LUTs as depicted in Figure 4.4. The compression ratio is calculated as the ratio between the direct LUT size and the proposed LUT size. The implementation results in 0.18-μm CMOS technology are presented in Table 4.4 and Figure 4.5. It is shown that the proposed fixed-width squarer can reduce the area, delay and average power consumption significantly when compared with other methods. To compare the overall squarer performance with different methods, the area-delay product (ADP) is used as a factor of merit for comparison. The bold numbers in parentheses of the fifth column in Table 4.4 present the normalized ADP results of different fixed-width squarer designs with each value of W . All power consumption estimations in this work are performed with the operating clock frequency of 50 MHz. Compared with the fixed-width squarer method presented in [47], the proposed hybrid LUT-based architecture leads to the ADP reduction of up to 42% and power consumption reduction of up to 20%. To show the trade-off between ADP reduction and operand bit-width, Table 4.5 and Figure 4.6 summary the ADP reduction results for different values of bit-width W . It can be seen that the ADP reduction (in %) tends to decrease with higher W values. This is one limitation of the proposed architecture which should be improved in the next research work.

49 4.6 Chapter Conclusions

4 x 10 12 Folded method Garofalo et al. [47] 10 Proposed

8

6 ADP

4

2

0 8 10 12 14 16 W

Figure 4.5: ADP results for different values of W .

Figure 4.7 is the chip microphotograph of the proposed 8-bit fixed-width squarer fabricated with 0.18-μm CMOS technology and Figure 4.8 shows the test and measurement flow to clarify the improvement of the proposed squarer method. The test patterns are generated by FPGA hardware. Then the oscil- loscope and and logic analyzer are used to measurement the output waveforms and logic values, respectively. The Matlab software is also used to compare the practical results obtained by the chip with the expected values for fixed-width squaring and get the error analysis results. Figure 4.9 presents the chip functional measurement result by using logic analyzer (right side) which is the same as the simulation result in Modelsim software (left side). Figure 4.10 and Figure 4.11 show the longest (critical) path analysis result in Synopsys Design Compiler and its chip measurement result (the path between the input x6 and the output s7).

4.6 Chapter Conclusions

In this chapter, an efficient hybrid LUT-based architecture for the fixed-width squarer circuits has been presented. The proposed hybrid architecture takes the

50 4.6 Chapter Conclusions

45

40

35

30

25 % 20

15

10

5

0 8 10 12 14 16 W

Figure 4.6: ADP reduction results for different values of W (compared with Garofalo et al. method in [47]).

advantages of both LUT-based and conventional logic circuits to achieve the good trade-off between the circuit performance and complexity. The implementation and chip measurement results in 0.18-μm CMOS technology have been shown with some discussions. Compared with some other methods presented in liter- ature for the design of fixed-width and truncated squarer circuits, the proposed method not only reduces the error significantly, but also improves the squarer performance and area efficiency. Therefore, it has great potential to be applied in the modern and future DSP and multimedia systems. However, for the case of very high bit-width designs, more efficient methods should be considered in the future research.

51 4.6 Chapter Conclusions

Table 4.2: Error analysis for different fixed-width squarer architectures. ME MaxE MSE W Method (LSB) (LSB) (LSB2) Walter et al. [48] 0.166 1.215 0.207 Garofaloet al. [50] 0.334 - 0.441 8 Garofalo et al. [47] 0.166 1.260 0.163 Proposed 0.100 0.106 0.105 Walter et al. [48] 0.166 - 0.226 Garofalo et al. [50] 0.333 - 0.529 10 Garofalo et al. [47] 0.166 1.263 0.181 Proposed 0.119 1.062 0.120 Walter et al. [48] 0.167 - 0.246 Garofalo et al. [50] 0.333 - 0.607 12 Garofalo et al. [47] 0.167 1.308 0.200 Proposed 0.118 1.103 0.119 Walter et al. [48] 0.167 - 0.266 Garofaloet al. [50] 0.333 - 0.693 14 Garofalo et al. [47] 0.167 1.365 0.220 Proposed 0.120 1.117 0.124 Walter et al. [48] 0.167 1.862 0.287 Garofaloet al. [50] 0.333 - 0.776 16 Garofalo et al. [47] 0.167 1.432 0.240 Proposed 0.122 1.120 0.125

52 4.6 Chapter Conclusions

Table 4.3: LUT parameters and compression ratio of the proposed fixed-width squarer designs with LUT splitting technique. W 8 10 12 14 16 A - 5 7 8 9 B - 3 3 4 5 Proposed LUT (bits) 320 704 3072 10752 38912 Direct LUT (bits) 2048 10240 49152 229376 1048576 Compression ratio 6.4:1 14.5:1 16:1 21.3:1 26.9:1

40µ m

40µ m

Figure 4.7: Chip microphotograph of the proposed 8-bit fixed-width squarer.

53 4.6 Chapter Conclusions

Table 4.4: Implementation results of different fixed-width squarer architectures using 0.18-μm CMOS technology. Area Delay Power W Method ADP (×103) (×103 μm2) (ns) (mW) Folded 2.3 6.6 15.3 (1.34) 0.586 Direct LUT-based 4.0 4.8 19.2 (1.68) 0.621 8 Garofalo et al. [47] 1.9 5.9 11.4 (1.00) 0.425 Proposed 1.5 4.4 6.6 (0.58) 0.354 Folded 3.8 7.4 28.3 (1.33) 0.792 Direct LUT-based 19.6 6.2 121.7 (5.73) 1.202 10 Garofalo et al. [47] 3.3 6.5 21.2 (1.00) 0.643 Proposed 2.7 5.6 15.1 (0.71) 0.512 Folded 6.1 8.6 52.9 (1.20) 1.371 Direct LUT-based 48.8 6.9 336.7 (7.66) 3.264 12 Garofalo et al. [47] 5.4 8.2 43.9 (1.00) 1.125 Proposed 5.0 6.7 33.5 (0.76) 0.934 Folded 8.7 9.9 86.6 (1.23) 1.624 Direct LUT-based 98.9 7.9 762.4 (11.1) 5.421 14 Garofalo et al. [47] 7.7 9.1 70.5 (1.00) 1.482 Proposed 7.4 7.6 56.2 (0.80) 1.326 Folded 10.5 11.0 115.7 (1.21) 2.271 Direct LUT-based 208.7 9.4 2024.4 (21.2) 9.560 16 Garofalo et al. [47] 9.2 10.4 95.4 (1.00) 1.784 Proposed 8.9 9.0 80.3 (0.84) 1.623

54 4.6 Chapter Conclusions

Table 4.5: ADP reduction with different values of W (compared with the method in [47]). W ADP reduction (%) 8 42 10 29 12 24 14 20 16 16

Oscilloscope Timing results

Chip Logic Analyzer Board

.csv file

Pattern Generator Matlab (FPGA) (PC)

Test results

Figure 4.8: Test and measurement flow for the proposed fixed-width squarer.

55 4.6 Chapter Conclusions

X 7E 7F 80 81 82 x7

x6

x5

x4

x3 x2 x1 x0 S 3E 3F 40 41 42

s7 s6

s5 s4

s3 s2 s1 s0

Figure 4.9: Simulation result in Modelsim software (left) and measurement result in logic analyzer (right).

56 4.6 Chapter Conclusions

Critical path

Figure 4.10: Critical path analysis result in Synopsys Design Compiler.

x6 s7

4.7 ns

Figure 4.11: Longest path delay measurement result.

57 4.6 Chapter Conclusions

58 Chapter 5

LUT-Based Sine Function Computation

5.1 Sine Function Computation

Like other trigonometrical functions, sine function generator is highly required for real scientific computation systems and DSP applications. The software im- plementation of this function was well investigated in some textbooks and papers. However, due to the high demand of high speed computation and real time DSP, its hardware implementation needs to be enhanced as well. There are some meth- ods for sine function approximation such as CORDIC-based architecture [55]. However, this method involves iterative computation so that it is not suitable for high speed and real time DSP applications. Moreover, some other sine approxi- mation methods like high order approximation, Taylor approximation often result in high hardware complexity. On the other hand, the LUT-based method is more attractive for DSP applications because of its merit of high speed computation. Therefore, this chapter focuses on the LUT-based method and its combination with linear approximation for sine computation with the main targeted applica- tion in digital frequency synthesizer for DSP systems.

59 5.2 Direct Digital Frequency Synthesizer (DDFS)

5.2 Direct Digital Frequency Synthesizer (DDFS)

Model digital communication systems require high performance frequency syn- thesizers to meet the increasing demand of high speed communication services. With many significant advantages over the traditional analog approach [56], direct digital frequency synthesizers (DDFS) are fast becoming an alternative to analog frequency synthesizers. DDFS can also be employed in digital modulation, digital mixer and many other applications in DSP and digital communication systems. In a DDFS, the frequency-tunable and/or phase-tunable output signal is gen- erated from a high precision reference clock frequency by using some data process- ing blocks. A general DDFS includes a phase accumulator, a phase-to-amplitude converter (PAC) and a Digital to Analog Converter (DAC). Figure 5.1 shows the simplified block diagram of a DDFS and its signal flow. Figure 5.2 presents the operation principle of a DDFS using the digital phase wheel representation [57]. By changing the value of tuning word M, the output frequency can be changed accordingly. Moreover, by using an additional adder after phase accumulator, the phase can be controlled as well. In other words, the reference clock frequency is divided by the scaling the programmable binary tuning word. Therefore, another advantage of a DDFS is that its output frequency, phase and amplitude can be precisely and rapidly controlled under digital processors. The basic tuning equation for a DDFS is:

M × FClk FOut = (5.1) 2N where FOut is the output frequency of the DDFS, M is the binary frequency tuning word which is also the jump size in digital phase wheel presented in Figure 5.2,

FClk is the internal reference clock frequency (system clock) and N is the length in bits of the phase accumulator.

5.3 Phase to Amplitude Converter in DDFS

A typical solution for a DDFS and its phase-to-amplitude converter, first intro- duced by Tinery et al. [58], is shown in Figure 5.3. The converter generates sine values from an LUT. The address of this LUT is provided by a phase accumulator

60 5.4 LUT Compression Methods for Phase to Amplitude Converter

Tuning Analog Output Word Phase Phase-Amplitude D/A Accummulator Converter Converter

Clock

Figure 5.1: General block diagram and signal flow of the DDFS. which is a binary feedback whose value is added accumulatively by the frequency tuning word each clock cycle. The main disadvantage of this solution is that the LUT size is approximately doubled to get 1-bit finer resolution, re- sulting in high hardware complexity, especially when high frequency resolution is required. Therefore, many LUT compression methods for DDFS have been proposed. Among these methods, linear difference approach is promising of low hardware complexity when using shift-add logic for linear interpolation. In this chapter, an improved linear difference method is proposed to achieve very high LUT compression ratio. The analysis and comparison of our synthesis and im- plementation results with the previous work are also included.

5.4 LUT Compression Methods for Phase to Am- plitude Converter

This section is to summarize the existing LUT compression methods for sine computation for phase to sine converter in DDFS. Phase truncation is a simple method to significantly reduce the LUT size by avoiding some least significant bits (LSB) of phase accumulator output. However, this truncation also results in more and larger spurs in output spectrum. Therefore, phase truncation should be used together with an LUT compression method to reduce the LUT size and obtain the reasonable accuracy.

61 5.4 LUT Compression Methods for Phase to Amplitude Converter

Figure 5.2: Operation principle of DDFS with the digital phase wheel.

5.4.1 Sine Wave Symmetry Exploitation

This technique exploits the quarter-wave symmetry of sine function. The LUT stores sine values for only π/2 (one quadrant) of the sine wave and the LUT size is reduced to one-fourth. The cost of LUT size reduction is the ad- ditional logic to generate the full sine wave. Two max significant bits of phase accumulator output are used as control bits for this control purpose. Therefore, two complementors are used as shown in Figure 5.3. The first complementor uses the second max significant bit (MSB2) of the phase accumulator output to invert the phase and the second one uses the first max significant bit (MSB1) to invert the output amplitudes of the PAC.

5.4.2 Coarse/Fine LUT Splitting Method

As shown in Figure 5.4, this technique bases on the idea of hierarchical LUT structure. The phase value P is divided into 2 parts, one for address of coarse LUT and the other for address of the fine LUT. The coarse LUT has the same

62 5.4 LUT Compression Methods for Phase to Amplitude Converter

NN-2 Phase 1st Phase-Amplitude 2nd M Accummulator Converter Complement Complement To DAC

MSB2 MSB1

Figure 5.3: General DDFS with sine wave symmetry exploitation.

Coarse LUT sin(P) P + Fine LUT

Figure 5.4: Coarse/Fine LUT splitting.

word size but fewer storage entries than traditional method, while the fine LUT has smaller word size and the same storage entries as traditional method to store the difference between the values of sine function and coarse LUT respectively. An extra adder is needed to provide the sine outputs from these LUTs.

5.4.3 Sunderland and Nicholas Architectures

The original Sunderland architecture and its modified version that employs split- ting technique are presented in [59]and[60], respectively. In Sunderland archi- tecture, the LUT size is reduced by applying trigonometric identities with the following approximation.

63 5.4 LUT Compression Methods for Phase to Amplitude Converter

π π π sin (a + b + c) =sin (a + b) cos (c) + 2 2 2 π π cos (a + b) sin (c) (5.2) 2 2 As an example, the 12-bit phase value can be decomposed into three 4-bit fractions as P = a + b + c so that a<1, b<2−4, c<2−8. With this assumption, the approximation can be rewritten as: π π sin (a + b + c)  sin (a + b) + 2 2 π π cos (a + b) sin (c) (5.3) 2 2 in which the average value of b component, b, is added to the component a in the second term to improve the accuracy of approximation. The first term is stored in the coarse LUT with low resolution and the second term is stored in the fine LUT that gives the additional resolution by approximating the error between each coarse value and the equivalent sine value. Figure 5.5 presents the block diagram of a phase to sine converter using Sunderland method. Nicholas architecture applies the same coarse/fine LUT structure as Sunder- land method. However, the stored samples are calculated by numerical optimiza- tion to minimize the error or the mean-square error [61].

5.4.4 Difference Method

In difference method, as shown in Figure 5.6, the LUT stores the error between the sine function and a function D(P ) which could be implemented with some basic arithmetic operations. An adder is also needed to provide the sine values. The maximum value of stored entries is less than that of original sine function, therefore, the number of bits needed for the LUT words is reduced. The simplest form of difference algorithm is the sine-phase difference method in which the LUT stores the error between sine function and the simple linear function D(P )=P . This method results in reducing two bits of LUT word size for storing sine values. To improve the compression ratio, many researches have tried to find the better difference functions. Among previous works on the

64 5.5 Improved Linear Difference Method

A Coarse B ROM sin(P) P A + Fine C ROM

Figure 5.5: Sunderland method. difference method, the paper of Hsu et al. [62] shows the highest compression ratio by employing optimized the three-segment linear difference method.

5.4.5 Multipartite Table Method

Recently, Davide De Caro et al. [63] has investigated and implemented the LUT compression technique for DDFS based on multipartite table method (MTM). This is a new LUT compression technique in which sine LUT is divided into A+1 tables, a table of initial values and A tables of offsets. The outputs of these tables are added together to provide the sine values of equivalent phases. The reported implementation results of MTM show that this technique is very promising of high compression ratio and hardware efficiency.

5.5 Improved Linear Difference Method

For mathematical convenience, the range of phase value is scaled to [0, 1] instead of [0, π/2] for first quadrant of sine wave with the supposition that the only values of first quadrant are stored in LUT. The basic idea of difference algorithm is reducing dynamic range of stored values in order to reduce the LUT size. With the simple sine-phase difference

65 5.5 Improved Linear Difference Method

D(P) sin(P) P + sin(P) -D(P) LUT

Figure 5.6: General difference method. algorithm, the maximum difference value is 0.21 (1/4.75). Therefore, the number of bits needed to store this difference value (LUT word size) is decreased by 2. To make the LUT word size smaller, some other difference functions have been proposed. Also, the compression ratio must be considered in the trade-off with the complexity of targeted hardware. The function D(P ) can be implemented by a multiple-segment piece-wise linear, parabolic or higher order functions. Nor- mally, with the same number of segments, the parabolic, especially high order parabolic functions can provide the better approximation to sine function than linear counterparts. However, the implementation of parabolic or higher order functions requires much more hardware resource because of multiplying opera- tions needed for such non-linear functions. On the other hand, modern electronic systems require high power efficiency and low complexity. Therefore, the piece- wise linear functions show to be a good choice to be implemented with the low complexity shift/add logic. In the piece-wise linear difference method, the difference function D(P )con- sists of J linear segments described as: ⎧ ⎪ ⎪ a0P + b0 (0

66 5.6 Implementation Results

in which the slope values (a0, a1,..., aJ−1) and offset values (b0, b1,..., bJ−1)of all segments are calculated in advance based on the break-point phase values P1,

P2,..., PJ−1. To get the highest compression ratio with each given J value, the numerical optimization is performed to find the optimal phase break-point values Pi for each segment based on minimizing the maximum value of the difference function sin(P × π/2)-D(P ) because the maximum value defines number of bits needed for each LUT entry. This parameter optimization was done by Matlab software. Table 5.1 lists the optimization results for each J value and the word size reduction in LUT, respectively. In this table, MaxDiff denotes the maximum value of the diference function. The case of J = 1 corresponds to the sine-phase difference method. It is shown that with J ≤ 4, one more bit reduction in LUT word size can be achieved when J is increased by one. Also, the optimization results for the case of J = 3 are the same with the research of Hsu et al. [62]. This means that design in [62] is a special case of this work. However, with J>4, the word size reduction for storing LUT entries does not increase with the same rule. Hence, in order to achieve good trade-off between compression ratio and hardware complexity, the value J = 4 is chosen for our design and the word size in LUT can be reduced by 6 bits. In this case, the optimal values for phase break points are: P1 = 0.363, P2 = 0.598 and P3 = 0.804. In hardware implementation, the computation in each segment including a constant multiplication (with slope value) and an addition (with offset value) is performed by shift-add logic. Figure 5.7 presents the proposed difference function compared with the one in [62]. It can be seen that the proposed function lead to lower maximum difference value. Therefore, the LUT word length can be reduced.

5.6 Implementation Results

Figure 5.8 shows the structure of phase to amplitude converter block using im- proved linear difference method. In this design, the compare and select module has the role of comparing the current phase value with P1, P2, P3 values and selecting the slope and offset values for each segment. The shift-add logic module performs the constant and to compute the optimized

67 5.6 Implementation Results

Figure 5.7: Proposed function (J = 4) vs Hsu difference function (J =3).

4-segment linear function D(P ). Canonical Signed Digit (CSD) technique is ap- plied for the shift-add logic module to reduce the complexity. The split LUT stores the difference values sin (P ) − D(P ). An extra adder is also needed to provide the final sine output. To get highest compression ratio as possible, the improved difference method is used together with the exploitation of sine wave symmetry and the coarse/fine splitting method. Therefore, only the first quadrant values of sine function are stored and the sine LUT is split into coarse and fine LUTs. The implementation targets the Xilinx Spartan-3E FPGA device with the design and synthesis tool Xilinx ISE 10.1 suite. The higher level simulation was done with Xilinx System Generator tool in MATLAB-Simulink environment. The DDFS in this work was designed with the phase accumulator length (tuning word length) of N = 32, 12-bit phase word and 11-bit sine output (the same output resolution with the design in [62]). Table 5.2 compares the results of

68 5.6 Implementation Results

Table 5.1: Computation results of word size reduction for different numbers of segments. Number of Word size reduction MaxDiff segments in LUT (bits) 1 0.21 (1/4.75) 2 2 0.0490 (1/20) 4 3 0.0220 (1/45) 5 4 0.0120 (1/83) 6 5 0.0080 (1/125) 6 6 0.0058 (1/172) 7

Shift-Add Compare Logic and Select P + sin(P)

Split LUT

Figure 5.8: Architecture of phase to amplitude converter using the improved linear difference method. proposed DDFS design with previous work and Figure 5.9 demonstrates the sine wave output in Modelsim simulation. To clarify the improvement of our method, the comparison also includes the DDFS design with conventional 4-segment linear difference method in which two max significant bits of phase word are used to select the segments. The designs for all cases target the spurious loss value of 75 dBc. The spurious loss (in dBc) is calculated as the relative ratio (in logarithm scale) between the amplitudes of the highest spur component and the primary frequency component. The area of FPGA implementation is estimated by the number of slices used. Slice is The primitive component in Xilinx Spartan-3E

69 5.7 Chapter Conclusions

Figure 5.9: Digital sine wave output of the proposed DDFS in Modelsim simula- tion.

Table 5.2: Comparison with the previous methods.

Method Sunderland Sine-phase Conv. lin. In [62] Proposed LUT size (bits) 4096 3854 2688 2560 1536 Compression ratio 44:1 50:1 67:1 70:1 117.3:1 Number of slices 144 146 133 176 136 Spurious loss - - 75 dBc 75 dBc 75 dBc

FPGA It is also shown in Table 5.2 that our design achieves the higher compression ratio compared with other linear difference methods while remains quite low hardware complexity presented in number of slices used in FPGA device. In this table, Conv. lin. denotes the conventional 4-segment linear difference method.

5.7 Chapter Conclusions

The improved piece-wise linear difference algorithm presented in this chapter shows a method to obtain the good trade-off of LUT compression ratio and hard- ware complexity among linear difference methods for LUT-based DDFS design.

70 5.7 Chapter Conclusions

In general, the LUT block is considered as the most power consuming part in conventional LUT-based DDFS. The limitation of the proposed method is that it mainly focuses on word length reduction and reducing number of word is based on known methods. However, with the high compression ratio of 117.3:1 and low hardware complexity, the proposed DDFS is potential to meet the requirement of high hardware efficiency and low power consumption in the advanced digital VLSI system.

71 5.7 Chapter Conclusions

72 Chapter 6

LUT-Based Logarithm and Anti-logarithm Computation

This chapter presents an efficient approach for the logarithm and anti-logarithm hardware computation which can be used for the arithmetic unit in hybrid num- ber system processors and logarithm/exponent function generator in DSP ap- plications. By employing the novel quasi-symmetrical difference method with only the simple shift-add logic and the look-up table, the proposed approach can reduce the hardware area and improve the computation speed significantly while achieve the similar accuracy compared with the best previous method. The implementation results in both FPGA and 0.18-μm CMOS technology are also presented and discussed.

6.1 Introduction

In many modern DSP applications, e.g. 3-D graphics [64], a lot of complex oper- ations are required such as multiplication, division, square root, power and so on. By using logarithmic scaled domain, these operations can be simplified. For ex- ample, the multiplication can be performed by addition while square root can be replaced by the shifting in logarithmic domain. The logarithmic number system (LNS) which is based on logarithmic scaled presentation and arithmetic has been shown to be an alternative for floating-point for precisions up to 32-bit because of its merit in above complex operations [65]. Also, LNS is promisingly suitable

73 6.2 Logarithm and Anti-logarithm Approximation Methods for the low power applications [66]. However, the addition and subtraction op- erations in the LNS are more complicated [64], resulting in its low popularity in current systems. Therefore, it is reasonable to combine LNS and the conven- tional binary number system to derive the hybrid number system (HNS) which takes the advantages of both above number systems [64]. In HNS, the linear binary to logarithm converters (LOGC) and logarithm to linear binary convert- ers (anti-logarithmic converters or ALOGC) are essential components. In the implementation of 3-D graphic processors using HNS in [3], the area of LOGC and ALOGC consumes 64% of total chip area as shown in Figure 6.1. Hence, reducing the hardware complexity of LOGC and ALOGC components can lead to a great impact on overall HNS-DSP systems. Moreover, in many real-time DSP applications such as the wireless communication system, high performance, low area and low power logarithm and anti-logarithm function generators are also required. Therefore, the objective of this research is to find a novel and efficient architecture for the logarithm and anti-logarithm hardware computation that can be applied for LNS/HNS systems and general DSP applications. The main con- tribution of this work is that a new approach called the quasi-linear difference method is proposed to provide an efficient architecture and method for the low complexityandhighspeedLOGCandALOGC.

6.2 Logarithm and Anti-logarithm Approxima- tion Methods

Without lost of generality, consider the binary (base-2) logarithm of an unsigned number N for LOGC in which N can be decomposed as [67]:

N =2n(1 + x) (6.1) where n, called the characteristics of N, corresponds to the position of the most significant ‘1’ bit of N in its binary representation and x is the fraction part with: 0 ≤ x<1. As a result, the binary logarithm of N can be expressed as:

log2 N = n +log2(1 + x) (6.2)

74 6.2 Logarithm and Anti-logarithm Approximation Methods

24% (LNS)

17% (ALOGC) 47% (LOGC)

11% (FXP)

1% (Test Vector)

Figure 6.1: Area breakdown of a 3-D graphics chip in [3].

Therefore, log2 N can be computed by detecting the most significant ‘1’ bit of N in its binary representation and approximating log2(1 + x) called the fundamental function of LOGC. Hence, many researchers have been focused on finding the efficient methods to compute log2(1+x) since it is the essential step in the LOGC. On the other hand, it is also required to convert a value from logarithm to linear scale by an ALOGC which requires the efficient computation of its fundamental function of (2x − 1) with the same range of x.

6.2.1 Mitchell Approximation

J. N. Mitchell proposed a method of using the simple linear approximation method [67] as follows:

log2(1 + x) ≈ x (6.3)

2x ≈ x + 1 (6.4)

75 6.2 Logarithm and Anti-logarithm Approximation Methods

0.09 E A 0.08 E L 0.07

0.06

0.05

0.04

Mitchell error 0.03

0.02

0.01

0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 x

Figure 6.2: Logarithmic and anti-logarithmic Mitchell error functions. where 0 ≤ x<1. Figure 6.2 depicts two error functions due to this approximation method as:

EL =log2(1 + x) − x (6.5)

x EA = x − (2 − 1) (6.6) where EL and EA, called Mitchell errors, denote the error functions for the ap- proximations in (6.3) and (6.4), respectively. The maximum value of the both error functions is 0.08639, and the accuracy is only 3.5 bits which is too low for many DSP applications. Therefore, there have been many researches focusing on the more accurate approximation methods.

6.2.2 Shift-Add Piece-Wise Linear Approximation

In the shift-add piece-wise linear approximation method, the range [0, 1) for the fraction part x in logarithm and anti-logarithm computation is divided into number of regions. With each region, EL or EA is approximated by a linear

76 6.2 Logarithm and Anti-logarithm Approximation Methods function called a segment. A linear function can be expressed as:

y = Slope ∗ x + Offset (6.7)

However, a linear segment can be defined by two of the following parameters: slope, offset and one point (with x and y coordinates). Also, it can be defined by two points. The complicated multiplication with Slope in (6.7) can be avoided by using the shift-add logic. Some shift-add piece-wise linear approximation methods with different numbers of segments and parameter values of the linear function are presented in [64], [68]-[73]. In [68]-[72], some methods were proposed for the number of segments of 2, 4 and 6 in which the parameters of each linear segment are selected by the “trial and error” method. B.-G. Nam et al. [64]presented a method of dividing the input range into 24 regions for the logarithmic and 16 regions for the anti-logarithmic approximation. In [73], the authors proposed an optimizing method for the integer binary number to logarithmic conversion. However, these methods should be improved to be employed for the high accuracy applications. Increasing the number of segments [64], [73] or using higher order approximation [74] can improve the accuracy, but also leads to higher hardware complexity.

6.2.3 Table-Based Approximation Methods

The multipartite table method (MTM) to approximate the elementary functions including logarithm and anti-logarithm was presented in [75] in which only ta- bles and adders are used. This approach can reduce the table size considerably compared with the direct LUT-based method in which only an LUT is used to provide the conversion results. S. Paul et al. [76] presented a table-based method for LOGC and ALOGC by combining an LUT and a multiplier-less linear interpo- lation so that the table size can be reduced compared with some other table-based methods for the same accuracy.

6.2.4 Combined LUT-Based/Difference Method

The difference method can be employed for the high accuracy conversions. The basic idea is that an LUT is used to store the difference function value which

77 6.3 Proposed Quasi-Symmetrical Approach

D(x) EL(x) x +

EL (x) - D(x) (LUT)

Figure 6.3: General difference method for logarithm approximation. is defined by the difference between the original function and the approximated one as depicted in Figure 6.3. In LOGC, the difference function is defined by the difference between EL and the approximated function. In a simple method pro- posed in [77], EL is stored in an LUT and the LUT output is added with Mitchell approximation function to get the final result. R. Gutierrez et al. [78]proposed an improved method by using the 4-segment linear approximation together with a small error LUT. Figure 6.4 presents the 4-segment linear approximation and its error function as proposed in [78]. It is reported in [78] that this method out- performs previous methods for LOGC in both area efficiency and performance. However, more improvements are desired and the selection of the linear function parameters and coefficients which is performed by the “trial and error” method may lead to a non-optimal architecture. Therefore, in this research work, an optimization algorithm is developed to find the optimal parameters for the ap- proximations.

6.3 Proposed Quasi-Symmetrical Approach

Since EL(x)andEA(x) have similar curves and the same maximum value as men- tioned previously, a unified approach for both LOGC and ALOGC is developed. This section mainly describes the method for LOGC, but it can be applied for ALOGC as well. Moreover, a new parameter optimization algorithm combined with the LUT-based correction is presented so that the proposed method is much

78 6.3 Proposed Quasi-Symmetrical Approach

0.09 E L 0.08 Approximate Curve Difference 0.07

0.06

0.05

0.04

0.03

0.02

0.01

0

−0.01 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 x

Figure 6.4: The 4-segment linear method with small error LUT proposed in [78]. better than the “trial and error” method [78] which is often used in many previous works.

Firstly, as depicted in Figure 6.5, consider the curves of the original EL(x), its mirror function EL(1−x) and their mean function EM (x) with the operand range of 0 ≤ x<1. From the definitions of EM (x)andEM (1 − x), their relationship can be derived as follows:

EM (x)=0.5 × [(EL(x)+EL(1 − x)] = EM (1 − x) (6.8)

It can be seen that EM (x) is symmetrical via the center line x = 0.5. Also, the maximum value of the difference function (EM − EL)(x) is quite small (about

0.0075) compared with maximum value of EL (0.08639). As a result, it will be promisingly efficient if we can approximate this mean function because it requires only the approximation for a half range of x and the other half can be interpolated easily by a simple complement circuit which uses the maximum significant bit (MSB) as the control bit. Moreover, the LUT size can be further reduced by half if we can approximate the mean curve accurately because the difference function is also symmetrical.

79 6.3 Proposed Quasi-Symmetrical Approach

0.09

0.08

0.07

0.06 E (x) 0.05 L E (1−x) 0.04 L E (x) M 0.03 (E −E )(x) M L 0.02

0.01

0

−0.01 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 x

Figure 6.5: The idea for the proposed quasi-symmetrical approach.

Since the difference method involves an approximated function and an error correction LUT, the optimization algorithm has to take into account both the complexity of approximated function and the LUT size. To find the architecture optimized for the smallest LUT size with the linear piece-wise linear approxi- mation, we can perform an exhausting search to find the slope and offset values which minimize the maximum value of the difference function. However, the optimal slope value of this method may lead to high complexity of the approxi- mate function because of the multiplying operation. A possible solution is using the shift-add multiplier, but it may result in the low computation speed. On the other hand, using a simple slope value (for example, a power-by-two value) can reduce the complexity significantly because it can be performed by only one shift operation, but it also leads to the large LUT size. Therefore, we propose the 2-step optimization algorithm to get the good trade-off between hardware complexity and computation speed. The objective of the proposed optimization

80 6.3 Proposed Quasi-Symmetrical Approach algorithm is to find the optimal parameters for two segments that can reduce the LUT size as much as possible while achieve the low hardware complexity. Therefore, the optimization algorithm is developed to minimize the maximum value of the difference function because the LUT size depends on this value. The quasi-symmetrical approach means that EL is actually not symmetrical but only nearly symmetrical. However, it can be treated like a symmetrical function by some modifications. Firstly, the entire range of x (0 ≤ x<1) is divided into two halves so that the quasi-symmetrical method can be applied. Then, the left half range of the operand x (0 ≤ x ≤ 0.5) is further divided into two equal intervals [0; 0.25) and [0.25; 0.5] to make the selecting circuit simple. Table 6.1 presents the optimization algorithm to find the optimal parameters of the linear segments in which Peak point denotes the value of approximate func- tion at the point x = 0.5 and the difference function is defined by the difference between EL and the linear function. Figure 6.6 depicts this optimization algo- rithm in which two linear segments are chosen independently. In step 1, a full search in the restricted ranges of Peak point and Offset1 is performed to find the optimal values of Slope1andSlope2 that minimize maximum value (MaxDiff )of the difference function which is stored in the LUT. Then, in step 2, Slope1and Slope2 are reassigned to the adjacent power-by-2 values and another search is per- formed to find the optimal offset values which minimize MaxDiff.Thepower-by-2 slope values are used to avoid the multiplications and use the one-shift operation. The ranges of Peak point and Offset1 are chosen to guarantee the acceptable accuracy of approximated results. The optimization results in each step of the proposed algorithm for both x log2(1 + x)and(2 − 1) functions are summarized in Table 6.2. It can be seen that after step 2, MaxDiff increases slightly but the LUT size remains the same as the result of step 1. The error analysis results are presented in Table 6.3 for thecaseofW =13inwhichW denotes the input and output bit-width. Figure 6.7 depicts the proposed 2-segment quasi-symmetrical method for LOGC. It can be seen that the proposed method can achieve the similar accuracy and LUT size compared with the method in [78]. However, the proposed method can reduce the hardware complexity significantly because EL is approximated by four linear segments, but only two segments are required to be computed actually. Figure

81 6.4 Implementation of the Fundamental Functions

Table 6.1: Proposed 2-step optimization algorithm.

Step 1: For {Offset1L ≤ Offset1 ≤ Offset1H and

Peak pointL ≤ Peak point ≤ Peak pointH } : Find the optimal values of Slope1 and Slope2. Step 2: Re-assign the optimal Slope1 and Slope2 values in step 1 to the adjacent power-by-2 values and find the optimal offset values.

Table 6.2: Optimal parameter results using the 2-step optimization algorithm. Function Step Slope1 Offset1 Slope2 Offset2 MaxDiff Step 1 0.2332 0.008 0.728 0.0341 0.0089 (1/112) log2(1 + x) Step 2 0.25 0.004 0.0625 0.0518 0.0101 (1/99) Step 1 0.2617 0.006 0.891 0.0495 0.0072 (1/139) (2x − 1) Step 2 0.25 0.003 0.125 0.0305 0.0075 (1/133)

6.8 shows the proposed architecture to compute log2(1+x) function for LOGC. A similar architecture is also employed to approximate (2x−1) function for ALOGC.

6.4 Implementation of the Fundamental Func- tions

x The proposed and reference architectures for log2(1 + x)and(2 − 1) functions are implemented in both Altera Cyclone II FPGA (using Quartus II design tool) and 0.18-μm CMOS technology (using the Synopsys Design Compiler and Syn- opsys IC Compiler tools) with the standard cell library with the same design parameters and constraints for the case of W = 13. The area in the FPGA im- plementation is calculated by the number of logic elements (LE) used. To compare the overall converter performance, the results of area-delay product (ADP) are also presented. In [78], a comparison for log2(1 + x) computation shows that the method proposed in [78] can reduce the hardware complexity by about 43% compared with the method presented in [76] while achieve the similar conversion

82 6.4 Implementation of the Fundamental Functions

Figure 6.6: Proposed 2-step parameter optimization algorithm for the hardware approximation of log2(1 + x). speed and accuracy. Therefore, the implementation results of our method are compared with the method in [78] with the same hardware platforms to clarify the improvements of the proposed approach. It is shown in Table 6.4 that in FPGA implementation, the proposed method can reduce the ADP value of 39% compared with the method in [78] for log2(1+x) function and 57% compared with MTM-based method for (2x − 1) function. Moreover, in ASIC implementations with 0.18-μm CMOS technology, as presented in Table 6.5, the proposed method can reduce the ADP value of 44% compared with the method in [78] for the case of log2(1 + x) function and 62% compared with the MTM-based method [75]for (2x − 1) function. The results of average power consumption are also presented in Table 6.5 in which the proposed method can lead to the significant reduction of power consumption, compared with other methods.

83 6.5 Applications in Logarithm Generator and HNS Processors

0.09 E L 0.08 Proposed Difference 0.07

0.06

0.05

0.04

0.03

0.02

0.01

0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 x

Figure 6.7: Proposed quasi-symmetrical linear approximation method (after the parameter optimization).

6.5 Applications in Logarithm Generator and HNS Processors

In this section, the proposed architectures for the approximation of log2(1 + x) and (2x − 1) functions are applied for logarithm/exponent function generators in DSP systems and arithmetic unit (AU) in HNS processors.

6.5.1 Logarithm Function Generator for DSP Applica- tions

Logarithm and exponent function generators are highly required in many DSP applications such as digital communications systems, artificial neural networks and 3-D graphics. Firstly, consider the binary logarithm generator for a 16-bit operand and more general architectures for both logarithm and exponent can be built similarly. According to Equations (6.1) and (6.2), the proposed architecture for a 16-bit logarithm function generator consists of four main components as

84 6.5 Applications in Logarithm Generator and HNS Processors

Table 6.3: Summary of optimal parameters and error analysis results for the proposed method compared with the method presented in [78] In [78] Proposed Proposed Method x (for log2(1 + x)) (for log2(1 + x)) (for (2 − 1)) No. of segments 4 2(4) 2(4) MaxDiff 0.0080 (1/125) 0.0101 (1/99) 0.0075 (1/133) LUT size (bit) 640 640 512 Mean error 2.3 × 10−4 2.3 × 10−4 2.4 × 10−4 Maximum error 8.0 × 10−4 8.0 × 10−4 8.2 × 10−4

MSB 13 >> 2 13 13 log2(1+x) 13 13 MUX + x >> 4

7MSB 5 13 LUT Sign Ext.

Figure 6.8: Proposed architecture for the approximation of log2(1 + x) with W =13. presented in Figure 6.9. The leading one detector and encoder (LODE) has the role of detecting the most significant ‘1’ bit and generating the binary coded result of the characteristics n corresponding to the operand N. The inverter block (INV) and the barrel shifter are used to generate the fraction part (x)of the operand. Finally, the logarithm approximation block provides the fraction part of the logarithm result by approximating log2(1 + x) function. The zero flag z is also generated by LODE circuit to indicate the exception of zero input. Tables 6.6 and 6.7 present the implementation results for the logarithm gen- erator using some different methods in Altera Cyclone II FPGA and 0.18-μm CMOS technology platforms, respectively. It is shown that the proposed method outperforms other methods in speed, area efficiency and ADP as well. It can be

85 6.5 Applications in Logarithm Generator and HNS Processors

Table 6.4: FPGA Implementation results of different methods for approximation x of log2(1 + x)and(2 − 1) with W = 13. Area Delay ADP Function Method (LE) (ns) (×103) Direct LUT-based 980 19.2 18.82 MTM-based 216 20.0 7.49 log2(1 + x) Method in [78] 126 22.7 2.86 Proposed 100 17.3 1.73 Direct LUT-based 948 19.0 18.01 MTM-based 185 19.8 3.66 2x − 1 Proposed 93 16.8 1.56

seen in Table 6.7 that the proposed method can lead to the ADP reduction of 37% and the power consumption reduction of 26%, compared with the method in [78].

6.5.2 Hybrid Arithmetic Unit For HNS Processors

In computation intensive DSP applications such as 3-D graphics and video/image processing, a lot of complex operations (multiplication, division, square, etc.) are required. Using the logarithmic number system (LNS) for the arithmetic unit (AU) can reduce the hardware complexity and improve the computation speed for these complex operations because they are replaced by simple operations such as addition, subtraction and shifting in logarithmic scale. However, the addition and subtraction operations in LNS are more complicated than in conventional number system. Therefore, the good trade-off can be achieved by employing the hybrid number system (HNS) which combines LNS and the conventional number system. Figure 6.10 presents the block diagram of an HNS AU in which X and Y denote two operands and S is the computation result. The input OP is used to select the desired operation by controlling the (MUXes).

86 6.6 Chapter Conclusions

Table 6.5: Implementation results in 0.18-μm CMOS technology of different meth- x ods for approximation of log2(1 + x)and(2 − 1) with W = 13. Area Delay ADP Power Function Method (×103 μm2) (ns) (×103) (mW) Direct LUT-based 37.2 6.3 234.4 6.238 MTM-based 7.5 9.6 72.0 1.526 log2(1 + x) Method in [78] 5.2 8.7 45.2 1.304 Proposed 4.2 6.0 25.2 0.931 Direct LUT-based 34.7 6.0 208.2 6.184 2x − 1 MTM-based 7.9 8.0 63.8 1.795 Proposed 4.1 5.9 24.2 0.926

Table 6.6: FPGA Implementation results of different methods for 16-bit logarithm generator. Area Delay ADP Method (LE) (ns) (×103) Direct LUT-based 1064 26 27.7 MTM-based 481 27.9 13.4 Method in [78] 201 22.8 4.58 Proposed 175 16.8 2.94

6.6 Chapter Conclusions

Logarithmic and anti-logarithmic converters are essential components in HNS processors and many DSP applications. The proposed quasi-symmetrical ap- proach can lead to high improvements of these converters in both hardware com- plexity and conversion speed. The implementation results in both FPGA and 0.18-μm CMOS technology show that the proposed approach is efficient and can be a promising candidate for the DSP applications that employ HNS arithmetic unit or logarithm/exponent computation modules. However, there is still a lim- itation of the proposed method is that it faces difficulty to approximate exactly

87 6.6 Chapter Conclusions

z

16 4 4 N LODE n

INV

4 13 12 Barrel Shifter log2(1+x) fr x

Figure 6.9: A 16-bit logarithm function generator using the proposed method.

Table 6.7: Implementation results in 0.18-μm CMOS technology of different meth- ods for the 16-bit logarithm generator. Area Delay ADP Power Method (×103 μm2) (ns) (×103) (mW) Direct LUT-based 32.6 12.2 397.7 6.983 MTM-based 23.4 13.0 304.2 5.064 Method in [78] 9.4 10.3 96.8 2.269 Proposed 7.6 8.0 60.8 1.680

the EM function to further reduce the LUT size by half, as presented previously. Therefore, in the future research, the more efficient methods will be considered.

88 6.6 Chapter Conclusions

MUX X LOGC S MUX Conventional AU ALOGC Y LOGC

MUX

OP

Figure 6.10: An arithmetic unit for HNS processors using proposed converters.

116 µm

116µ m

Figure 6.11: Layout of the proposed 16-bit logarithm function generator.

89 6.6 Chapter Conclusions

90 Chapter 7

Prototype and Applications of Proposed IP Cores

7.1 Chapter Objectives and Targeted Applica- tions

7.1.1 Chapter Objectives

With the increasing demand for new DSP algorithms and applications, especially for mobile, hand-held and wearable devices, efficient arithmetic components and modules are highly required. The conventional parallel architectures for arith- metic function often lead to the high hardware complexity and power consump- tion. Therefore, some alternatives need to be proposed for future DSP appli- cations. In previous chapters, several efficient methods have been proposed to derive efficient hardware IP cores based on LUT architectures for specific arith- metic operations in DSP applications. The implementation results have been clarified the advantages of the proposed approach of combining the LUT-based and simple conventional logic circuits. This chapter is to provide a general application view for the hardware IP cores presented in this dissertation. Moreover, the design flow and detail examples are presented to show the method of implementing the proposed IP cores for specific DSP applications. Besides, a full prototype system using these cores is developed

91 7.1 Chapter Objectives and Targeted Applications

Sine LOG & KCM Squarer Computation ALOG How to use? How to combine? DSP Applications

Chapter 7: Prototype and Applications

Figure 7.1: Chapter objectives. for the test purpose.

7.1.2 Targeted Applications of Proposed IP Cores

The proposed IP cores can be used for both function specific DSP applications and as special functional units in the of modern DSP processors. For the case of function specific DSP applications, the targeted applications of the proposed IP cores can be considered as follows:

• KCM core can be used for applications which employ constant multiplica- tions such as FIR filter, image convolver, color space converter, etc.

• Squarer core is suitable for real time DSP applications which require large amount of squaring operations like Euclidean distance computation for Viterbi decoding, similarity evaluation for the speech and image recognition, etc.

• Sine function computation core can be employed for frequency synthesiz- ers, digital mixers, modulator/demodulator and many other applications in digital communication systems.

92 7.2 Design Flow for Applications using Proposed IP Cores

• Logarithm and anti-logarithm function computation cores are applicable for many speech and graphics applications. The logarithmic and anti- logarithmic converters are also very important components in the DSP pro- cessors which employ the logarithmic number system or the hybrid number system, as presented in Chapter 6.

Moreover, these IP cores can be used as special functional unit in the datapath of the DSP processor for some specific application classes which require the high numbers of the corresponding functions.

7.2 Design Flow for Applications using Proposed IP Cores

7.2.1 Design Flow

Figure 7.2 presents the general design flow for DSP applications using the IP cores proposed in this dissertation. These cores are provided as components in a library. Based on the detail application specifications, the hardware descrip- tion language (HDL) code for the arithmetic components based on these IP cores can be generated automatically by an automation software tool (as a Matlab program). After that, the HDL description for the whole application is devel- oped. Then, the resulting HDL code is synthesized and implemented in FPGA or ASIC hardware platforms. Finally, the verification is performed to confirm the functionality and performance of the application.

7.2.2 Advantages of Using Proposed IP Cores for DSP Applications

The proposed IP cores are optimized for for high area-delay efficiency as presented in previous chapters. Therefore, using these cores for applications can lead to following advantages.

• Low area and high speed, due to the use of efficiently optimized IP cores.

93 7.3 A Prototype Computation System for the Proposed IP Cores

Application IP Cores Specifications (As a Library)

Design Automation Tool

HDL Description

FPGA ASIC configuration Implementation

Figure 7.2: Design flow for DSP applications using proposed IP cores.

• High flexibility, because these IP cores can be parameterized for the specific applications.

• Reduced design time, since the VHDL code of each IP core can be generated automatically by Matlab program.

7.3 A Prototype Computation System for the Proposed IP Cores

Traditionally, a is a dedicated processor targeted for dig- ital signal processing with a specific hardware architecture to support real time and high speed applications. It is different from the general purpose processors and which are more flexible and targeted for a wide range of applications. Currently, a trend of the DSP processor architecture is the merged DSP

94 7.3 A Prototype Computation System for the Proposed IP Cores core/Micro-controller Unit (MCU) due to the increasing demand of new appli- cations. The Blackfin processor family released by Analog Devices, Inc. is one typical processor family which employs this merged architecture [82]. A merged DSP processor which combines a 32-bit RISC MCU core and a 16-bit DSP core is presented in [83]. Tay-Jyi Lin et al. [84] presented a unified processor architec- ture for the RISC-VLIW DSP processor. Moreover, Texas Instrument Inc. has announced its first VLIW family called C6x [85]. With the high increasing demand of high performance DSP applications and continuous development of integrated circuit technology, it can be predicted that one of major trends in future DSP system architecture is the higher degree of merging DSP core, MCU and coprocessor. Figure 7.3 presents the trend of merg- ing DSP core with MCU and coprocessor. Coprocessor is a type of dedicated processor used to support the operations of the primary processor. Some ex- ample of the operations supported by the coprocessor are floating point arith- metic, signal processing, string processing, graphics or encryption. By taking the computation-intensive tasks from the main processor, can help to accelerate the system performance. H. Parandeh-Afshar et al. [86]presented a parallel merged multiplieraccumulator coprocessor optimized for digital filters. Therefore, it is also required to develope new architectures of arithmetic unit for the merged DSP processor in future DSP applications. The computation system presented in this section is developed for two pur- poses. Firstly, it is the full prototype in which the separate components ( the proposed IP cores presented in previous chapters) are implemented as a whole. Secondly, it can be considered as a new idea of the arithmetic unit which includes both conventional and specific dedicated arithmetic components for the merged DSP core/MCU/coprocessor architecture.

7.3.1 Prototype Computation System Architecture

Future DSP applications require not only the basic computation components such as adders, multipliers and MAC (multiply-accumulator), but also some dedicated components and specific complex function generators such as squaring, sine func- tion and logarithm function generators for digital communication applications.

95 7.3 A Prototype Computation System for the Proposed IP Cores

DSP Core MCU

Coprocessor

Figure 7.3: Merged DSP core/MCU/coprocessor architecture as a trend for future DSP system architecture.

The system performance can be improved significantly if this frequently used op- erations are available as dedicated components. For example, the computation of logarithm function is reduced to only one clock cycle by using this dedicated function module, instead of using many multiplication, addition and powering operations. Certainly, this approach has to trade-off with the increase of the overall system complexity. However, with the improved methods for above com- ponents presented in previous chapters, the system complexity can be reduced to an acceptable value. Figure 7.4 is the general architecture of the prototype computation system which includes two inputs (X and Y ), the operation selecting signal (OP), the output (R) and eight building blocks (two blocks are control and output selection units and 6 other blocks are computing components). For the purpose of the first prototype of the proposed computation system, the input bit-width is chosen to be 16-bit and the result bit-width is 32-bit. The next section is to describe more detail about each component in this computation system. Tohether with the control and output selection units, six functional computa- tion blocks are included in the computation system. The following part explains about these blocks in detail.

96 7.3 A Prototype Computation System for the Proposed IP Cores

Conventional components

Conventional ALU

MUL

SQ X Output R Selection Sine Y

LOG

ALOG

OP C Proposed components

Figure 7.4: Hardware architecture of the prototype computation system.

• Control unit and output selection unit: The control unit generates the con- trol signals for other components and synchronize the operation of the whole computation system. The output selection unit has the role of selecting the appropriate output signal that corresponds to the desired operation. The operation selecting signal has 4-bit width so that 16 operation types can be carried out. Table 7.1 lists all operations supported by the prototype computation system in which the conventional ALU operations indicate the basic operations performed by the conventional ALU module as presented in Appendix B.

• Squaring, sine, logarithm and anti-logarithm function generators: The LUT- based squarer circuit, sine function generator and logarithm function gen-

97 7.3 A Prototype Computation System for the Proposed IP Cores

Table 7.1: List of operations supported by the prototype computation system. OP Operation 0000 Logarithm 0001 Sine 0010 Squaring 0011 Multiply 0100 Anti-logarithm 0101 - 0111 Reserved 1000 - 1111 Conventional ALU operations

erator as presented in previous chapters are also included in the proposed computation system.

• Conventional ALU and multiplier units: The conventional arithmetic and logic unit (ALU) is used to compute the basic logic operation such as AND, OR, XOR, NOT and arithmetic operations such as addition and subtrac- tions. The detail list of operations supported by this ALU unit is presented in Appendix B.

7.3.2 System Implementation Results

The computation system is developed with the design flow as depicted in Figure 7.2. It can be seen that the proposed approach for the dedicated computation sys- tem can lead to the higher speed computation with some small extra area (area overhead) for dedicated components. The computation of complex operations (such as sine, logarithm and anti-logarithm) can be performed in a single cycle which is extremely better than conventional approach of using Taylor approxima- tion to compute the logarithm function as suggested by Texas Instruments (TI) in [87]. For example, the implementation in TI TMS320C67X processor need 134 clock cycles to complete a logarithm computation [88]. Moreover, it is shown in Figure 7.6 that the area for the proposed components is quite small compared

98 7.3 A Prototype Computation System for the Proposed IP Cores

650 µm

650µ m

Figure 7.5: Layout of the prototype computation system. with conventional multiplier block (MUL) and ALU. Therefore, the additional area for these dedicated computation components is not a critical issue. The whole computation system is implemented in 0.18-μm CMOS technology using a standard cell library. Figure 7.5 is the layout of the this computation system which is planned to be fabricated with 0.18-μm CMOS technology. Table 7.2 presents the main parameters for the ASIC implementation of the prototype system and Table 7.3 lists the area consumed by each component as also presented in Figure 7.6.

7.3.3 Verification and Measurement Method

Figure 7.7 presents the verification and measurement flow for the prototype com- putation system. The pattern generator using FPGA hardware generates the test vectors as the input for the circuit under test (the computation system) in the chip board. The logic analyzer is used to record the corresponding outputs.

99 7.4 Design Examples Using Proposed IP Cores

OUT CON 4% 2% 9% ALOGC

MUL LOG 10% 44%

13% SINE

9% 10% SQ ALU

Figure 7.6: Area breakdown of the prototype computation system.

The output results are saved in the csv files which can be used as the input data for Matlab software to verify the computation results by comparing with the pre-calculated expected results.

7.4 Design Examples Using Proposed IP Cores

This section is to provide some detail design examples in DSP applications using proposed IP cores. It can help designers and other researchers understand the design methodology clearly.

7.4.1 RGB to YCbCr Color Space Conversion Using Pro- posed Multiplier IP Core

7.4.1.1 Color Space Conversion and Its Hardware Architecture

A color space (or color model) is a mathematic model describing the way colors can be represented, created and specified. Currently, many color spaces are used

100 7.4 Design Examples Using Proposed IP Cores

Table 7.2: Main parameters for the ASIC implementation of the prototype com- putation system. Technology 0.18-μm CMOS technology Supply voltage (VDD) 1.8 V Layout dimension 0.61 x 0.61 mm Power consumption (mW) 2.10 Throughput 1 cycle/operation Maximum latency 2 cycles

Table 7.3: Area consumption by components in the prototype computation sys- tem (× 103 μm2). Multiplier (MUL) 21.70 ALU 4.76 Squarer (SQ) 4.23 Sine generator (SINE) 6.35 Logarithm generator (LOG) 4.76 Anti-logarithm (ALOGC) 4.68 Control Unit (CTRL) 1.06 Output Select Unit (OUT) 2.12

for a wide range of image, video and multimedia applications [89]. Among these color spaces, RGB is used widely in computer graphics in which each color is presented as combination of three primary color components (Red, Green and Blue). However, processing an image in RGB presentation is not the most efficient method in the real time image/video applications because of the high correlation between the color components [90]. For example, the intensity of all three color components has to be adjusted when we want to change the intensity of a pixel. If each pixel is presented in the form of intensity and color separately, the processing speed can be improved because the color components are uncorrelated. Therefore,

101 7.4 Design Examples Using Proposed IP Cores

Chip Logic Analyzer Board

.csv file

Pattern Generator Matlab (FPGA) (PC)

Test result

Figure 7.7: Verification and measurement flow for the prototype computation system. many video, digital TV and imaging standards use independent luminance and color difference signals. In ITU-R digital TV standards such as BT.601 and BT.709, the color space YCbCr is used in which each pixel is presented by a luminance component (Y ) and two chrominance components (Cb and Cr). Due to the mathematic properties of YCbCr color space presentation, the component Y has a range of 16 to 235, while Cb and Cr components are limited in the range of 16 to 240. Moreover, many image processing algorithms require the conversion from RGB color space to YCbCr (or YUV color space in which YCbCr is a scaled and offset version of YUV color space) to speed up some essential processing steps, such as JPEG compression [91] and digital image watermarking [92]. For above mentioned reasons, it is required to design an efficient RGB to YCrCb color space converter (CSC) for image and video applications. In this section, we will present an improved conversion design which employs the proposed LUT-based KCM.

102 7.4 Design Examples Using Proposed IP Cores

According to the ITU-R BT.601 digital TV standard, we have the basic equa- tions of RGB to YCbCr conversion described as: ⎧ ⎨ Y =16+0.257R +0.504G +0.098B − − ⎩ Cb = 128 0.148R 0.291G +0.439B (7.1) Cr = 128 + 0.439R − 0.368G − 0.071B

To be implemented in digital hardware using 8-bit binary representation for inputs and coefficients, above equations can be approximated by follows: ⎧ ⎨ Y =16+(66R + 129G +25B)/256 − − ⎩ Cb = 128 + (112B 38R 74G)/256 (7.2) Cr = 128 + (112R − 94G − 18B)/256

Exploiting this mathematic expression, the RGB to YCbCr color space con- version can be implemented by the constant multipliers and adders/subtractors. In [93], the authors presented a method to implement this conversion by us- ing conventional constant multipliers and adders/subtractors which leads to high hardware complexity and computation delay. Moreover, the output color compo- nents are not required to have full bit width but often have the same bit width as input color components. Therefore, in this work, to reduce the hardware area and circuit delay, a CSC is developed by using the LUT-based KCM core presented in Chapter 3 and some multi-input adders, as depicted in Figure 7.8. The area and delay improvement has the trade-off of some amount of computation error due to the use of truncated multipliers. The LUT-based multipliers are optimized with the previously mentioned algorithm so that the conversion error is minimized.

7.4.1.2 FPGA Implementation and Verification Results

The proposed CSC and comparable reference architectures are implemented by Altera Cyclone II FPGA with the same design parameters and constraints. Table 7.4 presents the comparison of proposed design with others in implementation results and the average relative error. Again, ADP is used as the factor of merit for comparing different methods. It is shown in this table that our proposed CSC leads to significant reduction of both area and delay, resulting in the high reduction of ADP. Compared with results in [93], the proposed CSC can reduce the ADP of 48% due to the use of the proposed efficient LUT-based truncated

103 7.4 Design Examples Using Proposed IP Cores

R G B 16

LUT-based LUT-based LUT-based Multiplier Multiplier Multiplier + Y

128

LUT-based LUT-based LUT-based Multiplier Multiplier Multiplier + Cb

128

LUT-based LUT-based LUT-based Multiplier Multiplier Multiplier + Cr

Figure 7.8: Proposed RGB to YCbCr color space conversion using LUT-based constant multiplier. multiplier. Moreover, the average relative error of proposed design is also reduced significantly compared with conventional methods. To clarify more the improvement of the proposed method and perform more er- ror analysis of overall CSC design, the FPGA implementation of proposed conver- sion architecture was also verified by Altera DSP design flow using DSP Builder tool together with Matlab-Simulink environment as shown in Figure 7.9 in which the hardware co-simulation is carried out by Haware In the Loop (HIL) block of Altera DSP Builder Blockset Library. With each sample image, both conven- tional simulation with Matlab software only and hardware/software co-simulation with FPGA board are performed. The converted images of two methods are then compared to get the error analysis results. Table 7.5 presents the verification results by using Matlab simulation and Al- tera hardware/software co-simulation together with the results of maximum and average relative error for each color component. In this table, the relative error (Erel) for each color component in YCbCr space is calculated as the difference

104 7.4 Design Examples Using Proposed IP Cores

Table 7.4: Comparison of FPGA implementation results of the color space converter.

Area Delay Average Erel Method ADP (no. of LEs) (ns) (%) Using shift-add KCM 226 20.6 4656 1.62 Using Altera Megacore-based KCM 224 17.6 3942 1.45 Proposed in [93] 370 17.5 6475 1.32 Using proposed LUT-based KCM 206 16.4 3378 1.26

between its values generated by FPGA hardware and by Matlab software, respec- tively. It can be seen that the proposed CSC results in only small amount of error which is caused by truncation operations in the proposed mutipliers and differ- ent data representation methods used in Matlab software and FPGA hardware implementation. Moreover, the proposed CSC is synthesized and implemented with 0.18-μm CMOS technology using a standard cell library as shown in Table 7.6.

7.4.2 2-D Convolver for Image Processing

7.4.2.1 Image Convolver

Mathematically, convolution is a operation on two functions h and g to produce another function that can be considered as a modified version of one of the original functions. In other words, convolution is a way of multiplying together two arrays of numbers of different sizes to produce a third array of numbers. Convolution operation is used widely in many image and signal processing applications, proba- bility, statistics, computer vision and electrical engineering. In image processing, convolutionisusedtoimplementtheoperators whose output pixel values are simple linear combination of certain input pixels of the input image. Convolution belongs to a class of algorithms called spatial filters. Spatial filters use a wide variety of masks (kernels), to calculate different results, depending on the desired function. 2-D convolution is most important one in modern image processing

105 7.4 Design Examples Using Proposed IP Cores and is used popularly in image processing for many applications such as image filtering, edge detection and image recognition [94]. The basic idea is to scan a window of some finite size over an image. The output pixel value is the weighted sum of the input pixels within the window where the weights are the values of the filter assigned to every pixel of the window. The window with its weights is called the convolution mask (or kernel matrix) and the hardware block for con- volution computation is called the convolver. Figure 7.10 presents the operation of a 2-D convolver with a given kernel matrix K. To compute the convolution of each pixel of the input image, the window around this pixel is multiplied with the kernel matrix by a per-elementary multiplication (or ). Then, the accumulative addition is performed to provide the output pixels. Moreover, the 2-D convolution can be expressed as a linear combination or sum of products of the mask coefficients with the input function. Particularly, the mathematical operation of a 2-D convolver can be expressed as:

M−1 N−1 PO(x, y)= PI (x + i, y + j) × K(i, j) (7.3) i=0 j=0 where PI (x, y)andPO(x, y) denote the input and output pixels, while K(i, j) is the coefficient value of kernel matrix with the size of M × N. In this design example, the popular value of M = N = 3 is chosen for the kernel matrix size.

7.4.2.2 Convolver Hardware Architecture and Implementation Re- sults

Figure 7.11 shows the hardware architecture for a typical image convolver in which the kernel matrix multiply module is the most important part. Firstly, the input image is taken by a window with the same size with the kernel matrix. Then, this window for each input pixel is multiplied with the kernel matrix and an adder tree is used to provide the output pixel value via an output buffer. For the purpose of a design example, the kernel matrix weights are chosen to perform the high pass filter as follows: ⎡ ⎤ −1 −2 −1 ⎢ ⎥ K = ⎣−213−2⎦ −1 −2 −1

106 7.4 Design Examples Using Proposed IP Cores

By using the design flow in Figure 7.2, the implementations in both FPGA and ASIC platforms were completed. With the standard 8-bit pixel format, the VHDL code for constant multipliers based on the proposed KCM core are gen- erated automatically by a Matlab program. After that, other components in the image convolver are developed. The resulting VHDL code of the entire con- volver is then synthesized and implemented in both FPGA and ASIC hardware platforms. Table 7.7 presents the FPGA implementation results of proposed con- volver compared with the design in [95] for the same case of 3×3 convolver. It can be seen that the proposed convolver achieves higher performance and are efficiency due to the use of the optimized KCM core. The verification method for the FPGA implementation is similar to the case of color space converter as shown in Figure 7.9 and these results are presented in Table 7.8 in which the input sample images are in the grayscale format. Moreover, Table 7.9 summaries the ASIC implementation results with 0.18-μm CMOS technology using a standard cell library.

7.4.3 Speech Feature Extraction Using Proposed Loga- rithm Generator

7.4.3.1 Speech Recognition

In computer science, speech recognition (SR) is the translation of spoken words into text or other data format for other applications [96]. Speech recognition is employed in many application fields such as in health care systems, fighter aircrafts, helicopters, air traffic controllers, etc. Algorithms and software imple- mentations of SR were deeply explored and proposed in many research papers and presented in many real systems. In speech recognition, and speech processing in general, feature extraction is one of essential computation steps. This section is to present the implementation of the logarithm computation block for the feature extraction part in a typical SR system. As presented in Figure 7.12, a general speech recognition system includes three main components. The sampled input speech signal is fed to the feature extraction block to extract the feature infor- mation of the input speech. Then, this feature information is compared with the stored reference data by using a similarity measurement circuit. Based on the

107 7.4 Design Examples Using Proposed IP Cores results of this measurement, the decision logic circuit will generate the recognized results (eg. text data).

7.4.3.2 Speech Feature Extraction

Mel frequency cepstral coefficients (MFCC) is an efficient method for speech fea- ture extraction for both speech and speaker recognition applications [97]. Figure 7.13 presents the hardware architecture for an MFCC feature extraction mod- ule. Firstly, the speech sample is windowed (for example, by the Hamming win- dows). Then, the Fast Fourier Transformation (FFT) is performed to converted the speech signal from time domain to frequency domain. After the Mel-scaled filter banks, the logarithm computation is carried out as presented in Equation 7.4. Finally, the Discrete Cosine Transformation (DCT) is performed to provide the Mel cepstral coefficients results.

FMel = 781 × log2(1 + fnorm) (7.4) where FMel indicates the MFCC feature result and fnorm is the normalized fre- quency value obtained by FFT module. The normalized frequency can be com- puted from the the frequency output f by FFT computation as:

fnorm = f/700 (7.5)

This frequency normalization can be performed by a constant division (by 700) which is integrated with the Mel-scale filter banks.

7.4.3.3 Hardware Architecture and Implementation Results

Figure 7.14 is the proposed hardware architecture for the logarithm function computation block in the MFCC speech feature extraction. Assume that the speech sample has the bit-width of 16-bit with 12-bit fraction (in fixed-point presentation). The Zero Magnitude Detection (ZMD) circuit is used to detect the case of zero magnitude input (fnorm < 1) to directly forward fnorm to log2(1 + x) block. The scaler circuit has the role of detecting the most significant one bit and provide the fraction value for log2(1 + x) block for the other case of fnorm ≥ 1.

108 7.5 Chapter Conclusions

Finally, the KCM circuit performs the constant multiplication with 781 to get the MFCC feature result. Tables 7.10 and 7.11 summary the implementation of the logarithm computa- tion block for MFCC feature extraction in FPGA and ASIC hardware platforms. The comparison with the method of using Altera Mega-Core for this block is also presented in Table 7.10. It can be seen that by using the proposed logarithm com- putation IP core, the performance and area efficiency of this computation block can be improved. Moreover, the verification result of this FPGA implementation is depicted in Figure 7.15 which draws the computation results obtained by the hardware implementation. The results of the mean and maximum errors are the same as the results in Table 6.3 of Chapter 6.

7.5 Chapter Conclusions

In this chapter, the applications and design method for DSP applications have been presented together with some detail design examples. Moreover, a compu- tation system based on the proposed LUT-based IP cores is developed for the prototype purpose. By showing the design flow and detail design examples, this chapter can help other researchers and designers understand the method and advantages of using the proposed IP cores.

109 7.5 Chapter Conclusions

Sample Image

Input ... Matlab Sofware Hardware Simulation Verification (FPGA) ... Output Hardware in the loop (HIL)

Output Image 1 Output Image 2

Compare

Results

Figure 7.9: Verification flow with DSP builder using Matlab simulation and HIL.

110 7.5 Chapter Conclusions

Table 7.5: Verification results using Altera DSP flow by hardware co-simulation with Cyclone II FPGA in Matlab-Simulink environment. Sample Image Lena Pepper

RGB input image

YCbCr output image using Matlab function

YCbCr output image using proposed method

Y: 0.85 Y: 0.95 Maximum Erel (%) Cb: 1.01 Cb: 1.09 Cr: 0.77 Cr: 0.97 Y: 0.13 Y: 0.15 Average Erel (%) Cb: 0.52 Cb: 0.51 Cr: 0.45 Cr: 0.48

111 7.5 Chapter Conclusions

Table 7.6: ASIC implementation results of the proposed RGB to YCbCr color space converter using a standard cell library with 0.18-μm CMOS technology. Area (×103 μm2) 7.42 Delay (ns) 6.6 Power (mW) 1.74

Figure 7.10: Operation of a 2-D convolver for image processing.

Input Kernel matrix Adder Output Windowing Buffer image multiply tree image

Figure 7.11: Hardware architecture of a 2-D convolver for image processing.

112 7.5 Chapter Conclusions

Table 7.7: FPGA implementation results of 2-D convolver with the input/output resolution of 8 bit. Method Area (number of LEs) Delay (ns) Method in [95] 322 32.4 Proposed method 265 18.9

Table 7.8: Verification result of 2-D convolver.

Original Image Convolved Image

original image Convolution Operation Output

50 50

100 100

150 150

200 200

250 250

300 300

350 350

400 400

450 450

500 500 50 100 150 200 250 300 350 400 450 500 50 100 150 200 250 300 350 400 450 500

original image Convolution Operation Output

50 50

100 100

150 150

200 200

250 250

300 300

350 350

400 400

450 450

500 500 50 100 150 200 250 300 350 400 450 500 50 100 150 200 250 300 350 400 450 500

113 7.5 Chapter Conclusions

Table 7.9: ASIC implementation results of 2-D convolver with 0.18-μmCMOS technology using a standard cell library. Input/output resolution 8bit Area (×103 μm2) 9.35 Delay (ns) 8.6 Power (mW) 1.84

Recognition Input Feature Similarity Decision Results Speech Extraction Measurement Logic

Reference Data

Figure 7.12: Block diagram of speech recognition system.

Speech Mel-scale Windowing FFT signal Filter Banks

Cepstral DCT LOG coefficient

Figure 7.13: Feature extraction hardware architecture for the speech recognition system.

114 7.5 Chapter Conclusions

ZMD

MUX log2(1+x) KCM FMel fnorm Scaler

Figure 7.14: Hardware architecture of the logarithm computation block for MFCC feature extraction in the speech recognition application

Table 7.10: FPGA implementation results of logarithm computation block in MFCC feature extraction with the input/output resolution of 16 bit . Method Area (number of LEs) Delay (ns) Using Altera Mega-Core 443 28.4 Proposed method 340 22.9

Table 7.11: ASIC implementation results of logarithm computation block in MFCC feature extraction with 0.18-μm CMOS technology using a standard cell library. Input/output resolution 16 bit Area (×103 μm2) 15.47 Delay (ns) 11.3 Power (mW) 2.86

115 7.5 Chapter Conclusions

4000

3500

3000

2500 Mel f 2000

1500

1000

500

0 0 2000 4000 6000 8000 10000 12000 14000 16000 f (Hz)

Figure 7.15: Verification results of the logarithm computation block for MFCC feature extraction in the speech recognition application

116 Chapter 8

Conclusions and Future Work

This chapter summarizes the main results of this work and describes some open topics for the future research. Moreover, conclusion of the work is included.

8.1 Conclusions

In this dissertation, the fundamentals of LUT-based computation and DSP sys- tems were reviewed and discussed in Chapter 2. After that, four efficient IP cores based on the LUT architecture were proposed. Chapter 3 presented an efficient LUT-based truncated multiplier for the constant multiplication which is required in many DSP applications. An improved hybrid LUT-based architecture for the design of the low error, efficient fixed-width squarer was described in Chapter 4. In Chapter 5, an improved linear difference approximation combined with LUT-based architecture was proposed for the sine function computation in Direct Digital Frequency Synthesizer. Chapter 6 presented a novel quasi-symmetrical approach for logarithm and anti-logarithm function computation which can be used as the domain converters in the hybrid number system processors and as the function generators in many DSP applications. Then, in Chapter 7, the ap- plications and design method for DSP applications using proposed IP cores were presented together with some detail design examples. Moreover, a computation system based on the proposed LUT-based IP cores was developed for the proto- type purpose.

117 8.2 Future Work

According to the architecture proposals, numerical analyses, implementation and measurement results presented in this dissertation, the following concluding remarks are obtained:

• The approach of using combined LUT-based and simple conventional logic circuits for arithmetic components and unit in DSP applications is very promising for future DSP applications which require not only high perfor- mance but also low area and low power consumption computation modules.

• Moreover, by using the proposed parameter and LUT content optimization algorithms, the computation performance and area efficiency can be further improved.

• The proposed methods presented in this work will be useful for system designers and developers to choose a suitable method for each specific DSP application.

8.2 Future Work

Referring to the above conclusion remarks, there are several areas of this work which could be extended for future research as follows:

• The modern and future DSP systems require many specific high perfor- mance and low area computation components for specific applications. There- fore, the efficient architectures for other arithmetic operations such as square root, division, sigmoid functions will be considered.

• More efficient architectures for the LUT-based computation components and system presented in this dissertation as well as their applications in both DSP and general computation systems should be developed, especially for the LUT-based computation components with high bit-width operands.

• The new LUT and memory technologies can be applied together with the architectural and parameter optimization methods so that the higher per- formance and area efficiency and be achieved.

118 8.2 Future Work

Data LSU Memory SFU Instruction Instruction Instruction Interconnection MUL Memory Fetch Decode Network ALU

Register File

Figure 8.1: TTA processor architecture.

• The proposed LUT based architectures should be combined with some de- sign techniques for low power DSP applications. Particularly, the future research plan also targets an ultra low power unified arithmetic unit for the future DSP processors, due to the increasing demand for the low power electronics devices and multimedia applications.

• Improved speech recognition application using the proposed logarithm gen- erator and an efficient processing method as mentioned in Section 7.4.3 will be considered.

• The future research will also target the advanced Transport Triggered Ar- chitecture (TTA) processor for speech processing as shown in Figure 8.1 in which the special functional unit (SFU) can be implemented by one of the proposed IP cores for the specific application [98] and LSU indicates the load/store unit. The philosophy behind this architecture is that the instructions define the transport between functional units. Moreover, it can be customized for a given application by extending the instruction set with the user-defined application-specific operations that are implemented in hardware as SFUs.

119 8.2 Future Work

120 Appendix A

Sine/Cosine Computation Using Pipelined CORDIC

This appendix presents the pipelined CORDIC-based sine/cosine computation targeted for Direct Digital Frequency Synthesizer (DDFS).

A.1 CORDIC Algorithm

Coordinate Rotation Digital Computer (CORDIC), first described in 1959 by Jack E. Volder, is a simple and efficient algorithm to calculate hyperbolic and trigonometric functions, multiplications, division, data type conversion and hy- perbolic functions [99]. One of its applications is to compute sine and cosine value in DDFS, called CORDIC-based DDFS [100]. Two basic CORDIC modes have been developed for applications in compu- tation of different functions, the rotation mode and vectoring mode. For the purpose of sine/cosine computation, this paper only deals with rotation mode of CORDIC algorithm. However, in both modes, the CORDIC algorithm performs a planar rotation. Graphically, planar rotation means transforming a vector (X0,

Y0) into a new vector (Xi, Yi), as depicted in Figure A.1. This transform can be

121 A.1 CORDIC Algorithm expressed by following equations:

Xi = X0 cos θ − Y0 sin θ (A.1)

Yi = Y0 cos θ + X0 sin θ (A.2) Then, these equations could be rewritten as:

Xi =cosθ(X0 − Y0 tan θ)(A.3)

Yi =cosθ(Y0 + X0 tan θ)(A.4) The CORDIC algorithm includes a series of rotations with starting vector

(X0, Y0). After n iterations, the vector (Xn, Yn) is obtained the same as (A.3) and (A.4) but the index changes to n. In each iteration, the vector is rotated by a micro-angle θi, called a micro-rotation. And the new vector is computed by the equations similar to (A.3) and (A.4) with di = 1 to show the direction of −i rotation step and the value tan θi is restricted to 2 to convert a multiplication to an arithmetic right shift:

−i Xi+1 =cosθi(Xi − diYi2 )(A.5)

−i Yi+1 =cosθi(Yi + diXi2 )(A.6)

Or in other forms with Ki =cosθi = cos(arctan θi):

−i Xi+1 = Ki(Xi − diYi2 )(A.7)

−i Yi+1 = Ki(Yi + diXi2 )(A.8)

However, multiplication by Ki is avoided by considering it as a gain factor for all rotation steps. After n iterations, K is obtained as multiplication of every Ki. n K = ,andwhenn −→ ∞ , K = 0.0607252935... i=0 Therefore, the multiplication by the constant K is performed only once at the end of all iterations. In application of computing sine-cosine functions for

122 A.2 Pipelined CORDIC Architecture

Figure A.1: Vector rotation in CORDIC algorithm.

DDFS, to avoid this multiplication when still remain sufficient accuracy of sine- cosine computations, the starting vector (X0, Y0) is initialized as (|K|,0). Each iteration perform the computation in (A.5) and (A.6). In order to decide whether di =1ordi = -1, a new variable is defined as:

zi+1 = zi − di arctan θi (A.9)

The value of z0 is initialized with the angle for which sine and cosine functions to be computed. Then, di =1whenz>0, else -1. In summary, with above rotation steps, if starting vector (X0, Y0) (the input of CORDIC module) is initialized as (|K|,0), after n iterations, the outputs of this module are given as follows with n-bit precision: Xn =cosθ; Yn =sinθ.

A.2 Pipelined CORDIC Architecture

As mentioned previously, CORDIC algorithm involves n iterations to produce n-bit precision results. Therefore, the typical architecture of CORDIC imple- mentation is iterative architecture which has only one stage to perform all micro- rotations. In iterative architecture, an extended logic control module and reg-

123 A.3 Implementation Results isters are also required to control one-stage CORDIC module at each iteration step. This architecture is hardware efficient but has the disadvantage of high computation delay because it requires more than a clock cycle for each output value. To improve the speed of CORDIC computation and get the trade-off be- tween performance and hardware complexity, the pipelined CORDIC architecture has been proposed by dividing the CORDIC computation module into number of stages, each stage perform a corresponding iteration. Figure A.2 shows the structure of the ith stage in this architecture. X0, Y0 and Z0 are initialized as values mentioned in the previous section. In a CORDIC-based DDFS, the CORDIC module is used as the phase-to- amplitude converter as depicted in Figure A.1. Since the CORDIC algorithm has the restricted convergence region of [0, π/2], two max significant bits (MSB1 and MSB2) of the phase accumulator output are used as control bits to generate full sine wave. Therefore, a CORDIC-based DDFS have two complementors as shown in Figure A.3 in which the CORDIC module is pipelined as described in Section A.1.

A.3 Implementation Results

The DDFS design was done with Xilinx ISE 10.1 suite software. All modules were written in the VHDL code and simulated as components in Modelsim XE III 6.3.c software. The simulation results show that this DDFS can produce the sine and cosine waveforms with the high frequency accuracy and fast settling time. The full design was also targeted to Xilinx Spartan 3E xc3s500e-5fg320 FPGA device. Table A.1 shows the synthesis and implementation results. Owing to the pipelining technique, the speed of DDFS can be improved with the maximum clock frequency of 183 MHz in a Xilinx xc3s500e FPGA. In CORDIC-based systems, the high accuracy waves could be produced with- out large-size sine LUTs. The pipelining technique is employed for both phase accumulator and a CORDIC module to improve the speed of DDFS by generating sample output in each clock cycle.

124 A.3 Implementation Results

Figure A.2: A stage in the pipelined CORDIC architecture.

NN-2

2nd To DAC M Phase 1st Accummulator Complement CORDIC Complement

MSB2 MSB1

Figure A.3: Block diagram of the CORDIC-based DDFS.

Table A.1: Implementation results for the pipelined CORDIC-based DDFS. Number of slices 305

Number of slices Flip Flops 513

Number of 4-input LUTs 556

Maximum clock frequency 183 MHz

125 A.3 Implementation Results

126 Appendix B

Operation List of the Conventional ALU Component

In this appendix, the list of all operations supported by the conventional ALU component in the prototype computation system (Chapter 7) is presented in Table B.1.

127 Table B.1: List of operations supported by the conventional ALU component. OP Operation

000 AND

001 OR 010 XOR

011 NOT

100 Add

101 Subtract

110 Left shift

111 Right shift

128 Appendix C

List of Publications

C.1 Journal Papers

[1] Van-Phuc Hoang and Cong-Kha Pham, “An Improved Linear Difference Method with High ROM Compression Ratio in Direct Digital Frequency Synthesizer,” IEICE Trans. Fundamentals of Electronics, Communications and Computer Sciences, vol. E94.A, no. 3, pp. 995-998, Mar. 2011.

[2] Van-Phuc Hoang and Cong-Kha Pham, “Efficient LUT-Based Truncated Multiplier and Its Application in RGB to YCbCr Color Space Conversion,” IEICE Trans. Fundamentals of Electronics, Communications and Com- puter Sciences, vol. E95.A, no. 6, pp. 999-1006, Jun. 2012.

[3] Van-Phuc Hoang and Cong-Kha Pham, “An Improved Hybrid LUT-Based Architecture for Low-Error and Efficient Fixed-Width Squarer,” IEICE Trans. Fundamentals of Electronics, Communications and Computer Sci- ences, vol. E95.A, no. 7, pp.1180-1184, Jul. 2012.

129 C.2 Book Chapter

C.2 Book Chapter

[1] Van-Phuc Hoang and Cong-Kha Pham, “Low Error, Efficient Fixed Width Squarer Using Hybrid LUT-based Architecture” in Advances in Electrical Engineering and Electrical Machines, edited by Dehuai Zeng, Lecture Notes in Electrical Engineering (LNEE), vol. 134, pp. 223-230, Springer, ISSN 1876-1100, 2011.

C.3 International Conference Presentations

[1] Van-Phuc Hoang, Thi-Tam Hoang and Cong-Kha Pham, “FPGA Implemen- tation of a Direct Digital Synthesizer Using Pipelined CORDIC-Based Ap- proach,” Proc. Triangle Symposium on Advanced ICT 2009 (TriSAI2009), pp. 105-108, Oct. 2009.

[2] Van-Phuc Hoang and Cong-Kha Pham, “Improved Linear Difference Method for Sine ROM Compression in Direct Digital Frequency Synthesizer,” Proc. The First Solid-State Systems Symposium-VLSI & Related Technologies (4S-2010), pp. 192-195, Jun. 2010.

[3] Van-Phuc Hoang and Cong-Kha Pham, “Efficient LUT-Based Multiplier and Squarer for DSP Applications,” Proc. IEICE international conference Integrated Circuits and Devices in Vietnam (ICDV2011), pp. 148-153, Aug. 2011.

[4] Van-Phuc Hoang and Cong-Kha Pham, “Low Error, Efficient Fixed Width Squarer Using Hybrid LUT-based Architecture” Proc. Springer 2011 Inter- national Conference of Electrical and Electronics Engineering (ICEEE2011), pp. 223-230, Dec. 2011.

[5] Van-Phuc Hoang and Cong-Kha Pham, “Low-Area, High-Speed Logarith- mic and Anti-logarithmic Converters for Digital Signal Processors Based on Hybrid Number System,” IEEE Symposium on Low-Power and High-Speed Chips (COOL Chips XV), Poster no. 8, Apr. 2012.

130 C.4 Technical Report

[6] Van-Phuc Hoang and Cong-Kha Pham, “Novel Quasi-Symmetrical Ap- proach for Efficient Logarithmic and Anti-logarithmic Converters,” Proc. IEEE/VDE 8th Conference on Ph.D. Research in Microelectronics & Elec- tronics (PRIME 2012), pp. 111-114, Jun. 2012.

[7] Van-Phuc Hoang and Cong-Kha Pham, “Low-Error and Efficient Fixed- Width Squarer for Digital Signal Processing Applications,” Proc. IEEE 4th International Conference on Communications and Electronics (ICCE2012), pp. 477-482, Aug. 2012.

C.4 Technical Report

[1] Van-Phuc Hoang and Cong-Kha Pham, “Design of a low error LUT-based truncated multiplier” IEICE Tech. Rep., vol. 110, no. 344 (ICD2010-126), pp. 159-162, Dec. 2010.

131 Bibliography

[1] Global Digital Signal Processors (DSP) Market by Intellectual Property (IP), Design Architecture & Applications (2011 - 2016). [Online]. Available: http://www.marketsandmarkets.com.

[2] K. Yosida, T. Sakamoto, and T. Hase, “A 3D graphics library for 32-bit for embedded systems,” IEEE Transactions on Consumer Electronics, vol. 44, no. 4, pp. 1107-1114, Aug. 1998.

[3] Byeong-Gyu Nam, Hyejung Kim, and Hoi-Jun Yoo, “Power and Area- Efficient Unified Computation of Vector and Elementary Functions for Hand- held 3D Graphics Systems,” IEEE Transaction on Computers, vol. 57, no. 4, pp. 490-504, Apr. 2008.

[4] International Technology Roadmap for Semiconductors. [Online]. Available: http://public.itrs.net/

[5] Kiyoo Itoh, “Embedded Memories: Progress and a Look into the Future,” IEEE Design and Test of Computers, vol. 28, no. 1, pp. 10-13, Jan.-Feb. 2011.

[6] S. Matsunaga, J. Hayakawa, S. Ikeda, K. Miura, T. Endoh, H. Ohno, T. Hanyu, “MTJ-based nonvolatile logic-in-memory circuit, future prospects and issues,” Proc. Design, Automation & Test in Europe Conference & Ex- hibition (DATE ’09), pp. 433-435, Apr. 2009.

[7] Hojung Kim, Sanghun Jeon, Myoung-Jae Lee, Jaechul Park, Sangbeom Kang, Hyun-Sik Choi, Churoo Park, Hong-Sun Hwang, Changjung Kim, Jaikwang Shin, U-In Chung, “Three-Dimensional Integration Approach to

132 BIBLIOGRAPHY

High-Density Memory Devices,” IEEE Transactions on Electron Devices, vol. 58, no. 11, pp. 3820-3828, Nov. 2011.

[8] Bipul C. Paul, Shinobu Fujita, and Masaki Okajima, “ROM-Based Logic (RBL) Design: A Low-Power 16 Bit Multiplier,” IEEE Journal of Solid- State Circuits, vol. 44, no. 11, pp. 2935-2942, Nov. 2009.

[9] Pascal Meinerzhagen, S. M. Yasser Sherazi, Andreas Burg, and Joachim Neves Rodrigues, “Benchmarking of Standard-Cell Based Memories in the

Sub-VT Domain in 65-nm CMOS Technology,” IEEE Transactions on emerg- ing and selected topics in circuits and systems, vol. 1, no. 2, pp. 173-182, Jun. 2011.

[10] A. V. Oppenheim and R. W. Schafer, Discrete-Time Signal Processing, Prentice-Hall, Englewood Cliffs, NJ, USA, 1989.

[11] Intel Corp., “MMX technology architecture overview, Intel Technology Jour- nal, Sep. 1997.

[12] M. Denman, P. Anderson, M. Snyder, “Design of the PowerPC 604e micro- processor,” Proc. 41st IEEE International Computer Conference (COMP- CON ’96), pp. 126-131, Feb. 1996.

[13] Linda Kaouane, Mohamed Akil, Thierry Grandpierre and Yves Sorel, “A Methodology to Implement Real-Time Applications onto Reconfigurable Cir- cuits,” The Journal of Supercomputing, Kluwer Academic Publishers, vol. 30, no. 3, pp. 283-301, 2004.

[14] A. Kalavade, E.A. Lee, “A hardware-software codesign methodology for DSP applications,” IEEE Design & Test of Computers, vol. 10, no. 3, pp. 16-28, Sept. 1993.

[15] M. Brogioli, P. Radosavljevic, J.R. Cavallaro, “A General Hard- ware/Software Co-design Methodology for Embedded Signal Processing and Multimedia Workloads,” Proc. Fortieth Asilomar Conference on Signals, Systems and Computers (ACSSC ’06), pp.1486-1490, Nov. 2006.

133 BIBLIOGRAPHY

[16] Daniel N. Leeson and Donald L. Dimitry, Basic Programming Concepts and the IBM 1620 Computer, Holt, Rinehart and Winston, 1962.

[17] M. Rafiqzzaman and R. Chandra, Modern , West Pub- lishing Co., St. Paul, MN, 1988.

[18] B. Parhami, Computer-Arithmetic-Algorithms-and-Hardware-Designs, chap- ter 24, Oxford University Press, 2000.

[19] Pramod Kumar Meher, “LUT Optimization for Memory-Based Computa- tion,” IEEE Transactions on Circuits and Systems-II: Express Brief,vol. 57, no. 4, pp. 285-289, Apr. 2010.

[20] P. K. Meher, “An optimized lookup-table for the evaluation of sigmoid func- tion for artificial neural networks,” 2010 18th IEEE/IFIP VLSI System on Chip Conference (VLSI-SoC), pp. 91-95, Sep. 2010.

[21] H. Ling, “An approach to implementing multiplication with small tables,” IEEE Transactions on Computers, vol.39, no.5, pp.717-718, May 1990.

[22] B. Vinnakota, “Implementing multiplication with split read-only memory,” IEEE Transactions on Computers, vol. 44, no. 11, pp. 1352-1356, Nov. 1995.

[23] Behrooz Parhami and Hsun-Feng Lai, “Alternate memory compression scheme for modular multiplication,” IEEE Transactions on Signal Process- ing, vol. 41, no. 3, pp. 1378-1385, Mar. 1993.

[24] Behrooz Parhami, “Modular reduction by multi-level table lookup,” Proceed- ings of the 40th Midwest Symposium on Circuits and Systems, pp. 381-384, Aug. 1997.

[25] Hwan-Rei Lee, Chein-Wei Jen, and Chi-Min Liu, “On the design automation of the memory-based VLSI architectures for FIR filters,” IEEE Transactions on Consumer Electronics, vol. 39, no. 3, Aug. 1993.

[26] P.K. Meher, “New Approach to Look-Up-Table Design and Memory-Based Realization of FIR Digital Filter,” IEEE Transactions on Circuits and Systems-I: Regular papers, vol. 57, no. 3, pp. 592- 603, Mar. 2010.

134 BIBLIOGRAPHY

[27] R. Gutierrez and J. Valls, “Low-Power FPGA-Implementation of atan(Y/X) Using Look-Up Table Methods for Communication Applications,” Journal of Signal Processing Systems, Springer, vol. 56, no. 1, pp. 25-33, Jul. 2009.

[28] G. Inoue, “Multiplication device using semiconductor memory,” US Patent no. 5617346, Apr. 1997.

[29] Jiun-In Guo, Chi-Min Liu, and Chein-Wei Jen, “The efficient memory-based VLSI array design for DFT and DCT,” IEEE Transactions on Circuits and Systems-II: Analog and Digital Signal Processing, vol. 39, no. 10, pp. 723-733, Oct. 1992.

[30] P.K. Meher, “Memory-based hardware for resource-constraint digital sig- nal processing systems,” Proc. 6th International Conference on Information, Communications & Signal Processing, pp.1-4, Dec. 2007.

[31] Te-Jen Changa, Chia-Long Wub, Der-Chyuan Loua, Ching-Yin Chen, “A low-complexity LUT-based squaring algorithm,” Journal of Computers & Mathematics with Applications, vol. 57, Issue 9, pp. 1494-1501, May 2009.

[32] Anantha P. Chandrakasan, Samuel Sheng, and Robert W. Brodersen, “Low- Power CMOS Digital Design,” IEEE Journal of Solid-State Circuits, vol. 27, no. 4, pp. 473-484, Apr. 1992.

[33] James E. Stine and Oliver M. Duverne, “Variations on Truncated Multipli- cation,” Proceedings of the Euromicro Symposium on Digital System Design (DSD’03), pp. 112-119, Sep. 2003.

[34] Pramod Kumar Meher, “New Approach to Look-Up-Table Design and Memory-Based Realization of FIR Digital Filter,” IEEE Transactions on Circuits and Systems-I: Regular Papers, vol. 57, no. 3, pp. 592-603, Mar. 2010.

[35] Pramod Kumar Meher, “LUT-Based Circuits for Future Wireless Systems,” Proceedings of IEEE International Midwest Symposium on Circuits and Sys- tems (MWSCAS), pp. 696-699, Aug. 2010.

135 BIBLIOGRAPHY

[36] Tom Kean, Bernie New, Robert Slous, “A Fast Constant Coefficient Mul- tiplier for the XC6200,” Proceedings of the 6th International Workshop on Field-Programmable Logic, Smart Applications, New Paradigms and Com- pilers, pp. 230-236, Sep. 1996.

[37] Ernest Jamro, Kazimierz Wiatr, “Constant Coefficient Convolution Imple- mented in FPGAs,” Proceedings of the Euromicro Symposium on Digital System Design (DSD’02), pp. 291-198, Sep. 2002.

[38] Ian Kuon, Russell Tessier and Jonathan Rose, “FPGA Architecture: Survey and Challenges,” Foundations and Trends in Electronic Design Automation, vol. 2, no. 2, pp. 135-253, Feb. 2008.

[39] Aiping Hu, A.J.Al-Khalili, “Comparison of Constant Coefficient Multipliers for CSD and Booth Recoding,” Proceedings of the 14th International Con- ference on Microelectronics, pp. 66-69, Dec. 2002.

[40] Nicola Petra, Davide De Caro, Valeria Garofalo, Ettore Napoli, and Antonio G. M. Strollo, “Truncated Binary Multipliers With Variable Correction and Minimum Mean Square Error,” IEEE Transactions on Circuits and Systems- I: Regular Papers, vol. 57, no. 6, pp. 1312-1325, Jun. 2010.

[41] Anantha P. Chandrakasan, Samuel Sheng, and Robert W. Brodersen, “Low- power CMOS digital design,” IEEE J. Solid-State Circuits, vol. 27, no. 4, pp. 473-484, Apr. 1992.

[42] J. Pihl and E. Aas, “A multiplier and squarer generator for high performance DSP applications,” Proc. 39th Midwest IEEE Symposium on Circuit and Systems, pp. 109-112, Aug. 1996.

[43] R. K. Kolagotla, N.R. Griesbach and H.R. Srinivas, “VLSI implementation of 350 MHz 0.35 μm 8 bit merged squarer,” Electronic Letters, vol. 34, no. 1, pp. 47-48, Jan. 1998.

[44] K.-J. Cho and J.-G. Chung, “Parallel squarer design using pre-calculated sums of partial products,” Electronic Letters, vol. 43, no. 25, pp. 1414-1416, Dec. 2007.

136 BIBLIOGRAPHY

[45] A. G. M. Strollo and D. De Caro, “Booth folding encoding for high perfor- mance squarer circuits,” IEEE Trans. Circuits and Systems II: Analog and DigitalSignalProcessing, vol. 50, no. 5, pp. 250-254, May 2003.

[46] Kyung-Ju Cho and Jin-Gyun Chung, “Adaptive error compensation for low error fixed-width squarers,” IEICE Trans. Information and Systems,vol. E90-D, no. 3, pp. 621-626, Mar. 2007.

[47] Valeria Garofalo, Marino Coppola, Davide De Caro, Ettore Napoli, Nicola Petra, Antonio G. M. Strollo, “A novel truncated squarer with linear com- pensation function,” Proc. 2010 IEEE International Symposium on Circuits and Systems (ISCAS), pp. 4157-4160, Jun. 2010.

[48] E. G. Walters III and M. J. Schulte, “Efficient function approximation using truncated multipliers and squarers,” Proc. 17th IEEE Symposium on Com- puter Arithmetic, pp. 232-239, Jun. 2005.

[49] N. Petra, D. De Caro, V. Garofalo, E. Napoli, and A. G. M. Strollo, “Trun- cated binary multipliers with variable correction and minimum mean square error,” IEEE Trans. Circuits and Systems-I: Regular Papers, vol. 57, no. 6, pp. 1312-1325, Jun. 2010.

[50] V. Garofalo, N. Petra, D. De Caro, A. G. M. Strollo and E.Napoli, “Low error truncated multipliers for DSP applications,” Proc. 15th IEEE International Conference on Electronics, Circuits and Systems (ICECS 2008), pp. 29-32, Sep. 2008.

[51] Chin-Long Wey and Ming-Der Shieh, “Design of a high speed square gener- ator,” IEEE Trans. Computers, vol. 47, no. 9, pp. 1021-1026, Sep. 1998.

[52] Wei-Chang Tsai, Ming-Der Shieh, Wen-Chin Lin, and Chin-Long Wey, “De- sign of square generator with small look-up table,” Proc. IEEE Asia Pacific Conference on circuits and systems (APCCAS 2008), pp. 172-175, Dec. 2008.

[53] Jouko Vankka, Digital synthesizers and transmitters for software radio, chap- ter 9, Springer, Jul. 2005.

137 BIBLIOGRAPHY

[54] V. Garofalo, N. Petra and E.Napoli, “Analytical calculation of the maximum error for a family of truncated multipliers providing minimum mean square error,” IEEE Trans. Computers, vol. 60, no. 9, pp. 1366-1371, Sep. 2011.

[55] P.K. Meher, J. Valls, Tso-Bing Juang; K. Sridharan, and K. Maharatna, “50 Years of CORDIC: Algorithms, Architectures, and Applications,” IEEE Transactions on Circuits and Systems I: Regular Papers, vol. 56, no. 9, pp. 1893-1907, Sep. 2009

[56] J. Vankka, M. Waltari, M. Kosunen, and K. A.I. Halonen, “A Direct Digital Synthesizer with an on-chip D/A-converter,” IEEE Journal of Solid-State Circuits, vol. 33, no. 2, pp. 218-227, Feb. 1998.

[57] A Technical Tutorial on Digital Signal Synthesis, Analog Devices, Inc, 1999.

[58] J. Tierney, C.M. Rader, and B. Gold, “A Digital Frequency Synthesizer” IEEE Trans. Audio Electroacoust., vol. AU-19, no. 1, pp. 48-57, Mar. 1971.

[59] D. A. Sunderland, R. A. Strauch, S.S. Wharfield, H. T. Peterson, and C. R. Cole, “CMOS/SOS Frequency Synthesizer LSI Circuit for Spread Spectrum Communications,” IEEE J. of Solid State Circuits, vol.SC-19, no. 4, pp. 497-505, Aug. 1984.

[60] H. T. Nicholas, H. Samueli, and B. Kim, “The Optimization of Direct Dig- ital Frequency Synthesizer in the Presence of Finite Word Length Effects Performance,” Proceedings of 42nd Annual Frequency Control Symposium, pp. 357-363, Jun. 1988.

[61] H. T. Nicholas, III and H. Samueli, “A 150-MHz Direct Digital Frequency Synthesizer in 1.25-micron CMOS with -90 dBc Spurious Performance,” IEEE Journal of Solid-State Circuits, vol. 26, no. 12, pp. 1959-1969, Dec. 1991.

[62] Li-Wen Hsu and Dah-Chung Chang, “Design of Direct Digital Frequency Synthesizer with high ROM Compression Ratio,” Proc. 12th IEEE Interna- tional Conference on Electronics, Circuits and Systems (ICECS 2005), pp. 1-4, Dec. 2005.

138 BIBLIOGRAPHY

[63] D. De Caro, N. Petra, and A. G. M. Strollo, “Reducing Lookup-Table Size in Direct Digital Frequency Synthesizers Using Optimized Multipartite Table Method,” IEEE Transactions on Circuits and Systems I: Regular Papers, vol. 55, no. 7, pp. 2116-2127, Aug. 2008.

[64] B.-G. Nam, H. Kim and H.-J. Yoo, “A low-power unified arithmetic unit for programmable handheld 3-D graphics systems,” IEEE J. Solid-State Cir- cuits, vol. 42, no. 8, pp.1767-1778, Aug. 2007.

[65] Jeremie Detrey and Florent de Dinechin, “A VHDL library of LNS opera- tors,” Proc. 37th Asilomar Conference on Signals, Systems & Computers, vol. 2, pp. 2227-2231, Nov. 2003.

[66] V. Paliouras and T. Stouraitis, “Low-power properties of the logarithmic number system,” Proc. 15th IEEE Symposium on Computer Arithmetic, pp. 229-236, Jun. 2001.

[67] J. N. Mitchell, “Computer multiplication and division using binary loga- rithms,” IEEE Trans. Electron. Comput., vol. 11, no.11, pp. 512-517, Aug. 1962.

[68] M. Combet, H. Van Zonneveld, and L. Verbeek, “Computation of the base two logarithm of binary numbers,” IEEE Trans. Electron Comput.,vol.EC- 14, no. 6, pp. 863-867, Dec. 1965.

[69] E. L. Hall, D. D. Lynch, and S. J. Dwyer, “Generation of products and quo- tients using approximate binary for digital filtering applications,” IRE (now IEEE) Trans. Comput., vol. 19, pp. 97-105, Feb. 1970.

[70] S. Sangregory, C. Brothers, D. Gallagher, and R. Siferd, “A fast, low-power logarithm approximation with CMOS VLSI implementation,” Proc. 42nd Midwest Symp. Circuits Syst., pp. 388-391, Aug. 1999.

[71] K. H. Aded and R. E. Siferd, “CMOS VLSI implementation of low power logarithmic converter,” IEEE Trans. Comput., vol. 52, no. 9, pp. 1221-1228, Nov. 2003.

139 BIBLIOGRAPHY

[72] Tso-Bing Juang, Sheng-Hung Chen and Huang-Jia Cheng, “A lower error and ROM-free logarithmic converter for digital signal processing applica- tions,” IEEE Trans. Circuits Syst. II, Exp. Briefs, vol. 56, no. 12, pp. 931- 935, Dec. 2009.

[73] D. De Caro, N. Petra and A. G. M. Strollo, “Efficient logarithmic converters for digital signal processing applications,” IEEE Trans. Circuits Syst. II, Exp. Briefs, vol. 58, no. 10, pp. 667-671, Oct. 2011.

[74] S. Nasayama, T. Sasao, and J. T. Butler, “Programmable numerical function generators based on quadratic approximation: architecture and synthesis method,” Proc. Asia and South Pacific Conference on Design Automation, pp. 378-383, Jan. 2006.

[75] Florent de Dinechin and Arnaud Tisserand, “Multipartite table methods,” IEEE Trans. Comput., vol. 54, no. 3, pp. 319-330, Mar. 2005.

[76] S. Paul, N. Jayakumar and S.P. Khatri, “A fast hardware approach for approximate, efficient logarithm and antilogarithm computations,” IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 17, no. 2, pp. 269-277, Feb. 2009.

[77] G. L. Kmetz, “Floating point/logarithmic conversion systems,” U.S. patent 4583180, Apr. 1986.

[78] R. Gutierrez and J. Valls, “Low cost hardware implementation of logarithm approximation,” IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 19, no. 12, pp. 2326-2330, Dec. 2011.

[79] Paulo S. R. Diniz, Adaptive Filtering Algorithms and Practical Implementa- tion, Third Edition, Kluwer Academic Publishers, 2008.

[80] S. S. Mahant-Shetti, S. Hosur, A. Gatherer, “The log-log LMS algorithm,” Proc. 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP-97), vol. 3, pp. 2357-2360, Apr. 1997.

140 BIBLIOGRAPHY

[81] Mansour A. Aldajani, “Logarithmic quantization in the least mean squares algorithm,” Digital Signal Processing, Elsevier, Issue 18, pp. 321-333, May 2008.

[82] Website of Analog Devices, Inc. [Online]. Available: http: //www.analog.com

[83] Wookyeong Jeong, Sangjun An, Moongyung Kim, Sangkyong Heo, Youngjun Kim, Sangook Moon, Yongsurk Lee, “Design of a combined processor con- taining a 32-bit RISC microprocessor and a 16-bit fixed-point DSP on a chip”, Proc. 6th International Conference on VLSI and CAD (ICVC ’99), pp. 305-308, 1991.

[84] Tay-Jyi Lin, Shin-Kai Chen, Yu-Ting Kuo, Chih-Wei Liu, and Pi-Chen Hsiao, “Design and Implementation of a High-Performance and Complexity- Effective VLIW DSP for Multimedia Applications,” Journal of Signal Pro- cessing Systems, Kluwer Academic Publishers, vol. 51, Issue 3, pp.209-223, June. 2008.

[85] R. Simar, R. Tatge, “How TI adopted VLIW in digital signal processors,” IEEE Solid-State Circuits Magazine, vol. 1, Issue 3, pp. 10-14, Summer 2009.

[86] H. Parandeh-Afshar, S.M. Fakhraie, O. Fatemi, “Parallel merged multiplier- accumulator coprocessor optimized for digital filters,” Computers and Elec- trical Engineering, Elsevier, vol. 36, Issue 5, pp. 864-873, Sep. 2010.

[87] Jason Jiang, “General Guide to Implement Logarithmic and Exponential Operations on a Fixed-Point DSP,” Application Report (SPRA619),Texas Instruments Inc., Dec. 1999.

[88] MeiYang, Yuke Wang, Jinchu Wang and S.Q. Zheng, “Optimized scheduling and mapping of logarithm and arctangent functions on TI TMS320C67X processor,” Proc. 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), vol. 3, pp. III-3156-III-3159, May 2002.

141 BIBLIOGRAPHY

[89] Marko Tkalcic, Jurij F. Tasic, “Color Spaces: Perceptual, Historical and Applicational Background,” Proceedings of IEEE International Conference on Computer as a Tool (IEEE EUROCON 2003), vol.1, pp. 304-308, Sep. 2003.

[90] F. Bensaali, A. Amira, “Design and Implementation of Efficient Architec- tures for Color Space Conversion,” ICGST International Journal on Graph- ics, Vision and Image Processing, vol. 5, no. 1, pp. 37-47, Dec. 2004.

[91] Hideki Noda, Nobuteru Takao, Michiharu Niimi, “Colorization in YCbCr Space and its Application to Improve Quality of JPEG Color Images,” Pro- ceedings of IEEE International Conference Image Processing (ICIP 2007), pp. IV-385-IV-388, Oct. 2007.

[92] S. Mabtoul, E.Hassan, I. Elhaj, and D. Aboutajdine, “Robust Color Im- age Watermarking Based on Singular Value Decomposition and Dual Tree, Complex Wavelet Transform,” Proceedings of 14th IEEE International Con- ference on Electronics, Circuits and Systems (ICECS 2007), pp. 534-537, Dec. 2007.

[93] A. M. Sapkal, Mousami Munot, Dr. M. A. Joshi, “R’G’B’ to Y’CbCr color space conversion Using FPGA,” Proceedings of IET International Conference on Wireless, Mobile and Multimedia Networks, pp. 255-258, Mar. 2008.

[94] Kazimierz Wiatr, Ernest Jamro, “Implementation Image Data Convolutions Operations in FPGA Reconfigurable Structures for Real-Time Vision Sys- tems,” Proc. International Conference on Information Technology: Coding and Computing, pp. 152-157, 2000.

[95] M.A. Vega-Rodriguez, J.M. Sanchez-Perez and J.A. Gomez-Pulido, “An op- timized architecture for implementing image convolution with reconfigurable hardware,” Proc. Automation Congress 2004. vol. 16, pp. 131-136, 2004.

[96] R.D. Peacocke and D.H. Graf, “An introduction to speech and speaker recog- nition,” Computer, vol. 23, Issue 8, pp. 26-33, Aug. 1990.

142 BIBLIOGRAPHY

[97] J.-C. Wang, J.-F. Wang, and Y.-S. Weng, “Chip design of MFCC extraction for speech recognition, Integration, the VLSI Journal, vol. 32, no. 1-3, pp. 111131, Nov. 2002.

[98] H. Corporaal, Microprocessor Architectures: From VLIW to TTA,JohnWi- ley & Sons, Chichester, UK, 1997.

[99] J. E. Volder, “The CORDIC trigonometric computing technique,” IRE (now IEEE) Transactions on Electronic Computers, pp. 330-334, Sep. 1959.

[100] D. De Caro, N. Petra and A. G. M. Strollo, “A 380 MHz direct digital synthesizer/mixer with hybrid CORDIC architecture in 0.25 μmCMOS,” IEEE Journal of Solid-State Circuits, vol. 42, no. 1, Jan. 2007.

143 Acknowledgements

First of all, I would like to to express my deepest gratitude to my principal advisor Asso. Prof. Cong-Kha PHAM for his infinite encouragement, guidance and support throughout my doctor course. While experiencing the research and preparing the dissertation, I received many invaluable guidance and meaningful discussion which led to the successful accomplishment of my dissertation writing. It has been a pleasure working under his supervision and I would like to maintain this good relationship with him. Also, I would like to express my sincere appreciation to other committee members including Prof. Kazushi NAKANO, Prof. Yoshinao MIZUGAKI, Prof. Koichiro ISHIBASHI and Asso. Prof. Takayuki NAGAI for their interesting questions and valuable comments on this dissertation. Moreover, I would like to thank the Information and Communication Tech- nology (ICT) International Program, The University of Electro-Communications (UEC) and the Ministry of Education, Culture, Sports, Science and Technology (MEXT), Japan for providing me an excellent opportunity to study in Japan and for the MEXT scholarship that I have received during the time of this research. Thanks also go to members of PHAM Lab for their kindness and supports during my doctor course. Furthermore, I would like to send my deep gratitude to my parents for their encourage and providing me best educational opportunities. I am thankful to my wife Pham Thi Kim Anh for being together with me, giving me love and encouragement during the most important time of my doctor course. Especially, many thanks are to my lovely daughter, Hoang Phuong Thao, for her love and fun everyday after my school time.

144 BIBLIOGRAPHY

Last but not least, I would like to acknowledge VLSI Design and Education Center (VDEC), the University of Tokyo and ROHM Ltd. for their technical supports.

145 Author Biography

HOANG Van Phuc was born in Hung Yen, Vietnam on June 15th 1982. He received the B.E. (Bachelor of Engineering) degree in Electronics & Telecommu- nications and the M.E. (Master of Engineering) degree in Electronic Engineer- ing both from Le Quy Don Technical University, Hanoi, Vietnam, in 2006 and 2008, respectively. He is currently working toward the Ph.D. degree in Elec- tronic Engineering at The University of Electro-Communications, Tokyo, Japan. His research interests include digital circuits and systems, reconfigurable hard- ware systems and VLSI architecture for digital signal processing. Mr. Hoang Van Phuc is a student member of the Institute of Electronics, Information and Communication Engineers (IEICE) of Japan.

146