University of Wollongong Research Online

University of Wollongong Thesis Collection University of Wollongong Thesis Collections

1990 The implementation of multidimensional discrete transforms for digital Hong Ren Wu University of Wollongong

Recommended Citation Wu, Hong Ren, The implementation of multidimensional discrete transforms for digital signal processing, Doctor of Philosophy thesis, Department of Electrical and Computer Engineering, University of Wollongong, 1990. http://ro.uow.edu.au/theses/1353

Research Online is the open access institutional repository for the University of Wollongong. For further information contact the UOW Library: [email protected]

THE IMPLEMENTATION OF MULTIDIMENSIONAL DISCRETE TRANSFORMS FOR DIGITAL SIGNAL PROCESSING

A thesis submitted in fulfilment of the requirements for the award of the degree

DOCTOR OF PHILOSOPHY from

THE UNIVERSITY OF WOLLONGONG by

WU, HONG REN, B.E., M.E.

THE DEPARTMENT OF ELECTRICAL AND COMPUTER ENGINEERING. FEBRUARY 1990. "Entertaining someone with fish, uou could only serve him once,', but if you teach him the, art of fishing, it vt>iCC serve fvim for a Cije time,."

—cAn ancient Chinese wise man and, philosopher. 1

CONTENTS

ACKNOWLEDGEMENTS vi ABSTRACT viii LIST OF ACRONYMS AND SYMBOLS x

CHAPTER ONE: INTRODUCTION 1

1 -1 Introduction to Multidimensional Digital Signal Processing 1 1-2 Applications 2

1-3 History and New Achievements in Fast Signal Processing Algorithms 3 1-4 Objectives 6

1-5 Thesis Review and Contributions 8

1-6 Publications, Submitted Papers and Internal Technical Reports 11

PART I. MULTIDIMENSIONAL DISCRETE FOURIER TRANSFORMS 14

CHAPTER TWO: 1-D DISCRETE AND FAST FOURIER TRANSFORM ALGORITHMS 15 2-1 Definitions 15

2-2 Matrix Representations for 1-D Cooley-Tukey FFT

Algorithms 16

2-3 Computational Considerations 23

2-4 Summary 31 ii

CHAPTER THREE: 2-D DFT AND 2-D FFT ALGORITHMS 32

3-1 Introduction to 2-D Discrete Fourier Transforms 32

3-2 Definitions 37

3-3 Row-Column FFT Algorithms 38

3-4 Vector Radix FFT Algorithms 39

3-5 Matrix Representations for 2-D Vector Radix FFT

Algorithms 43

3-6 Structure Theorems 46

3-7 Structural Approach via Logic Diagrams 55

3-8 2-D Vector Split-Radix FFT Algorithms 63

3-9 Comparisons of Various 2-D Vector Radix FFT

Algorithms 67

3-10 Vector Radix FFT Using FDP™ A41102 69

3-11 Summary 72

CHAPTER FOUR: A PERSPECTIVE ON VECTOR RADLX FFT ALGORITHMS OF HIGHER DIMENSIONS 72

4-1 Definitions 74

4-2 Matrix Representations and Structure Theorems 75

4-3 Diagrammatical Presentations 78

4-4 Computing Power Limitations 84 Ill

PART II. MULTIDIMENSIONAL DISCRETE COSINE TRANSFORMS 87

CHAPTER FIVE: INTRODUCTION TO MULTI­ DIMENSIONAL DISCRETE COSINE TRANSFORMS 88

5-1 Definitions of 1 -D DCT and Its Inverse DCT 91

5-2 Definitions of 2-D DCT and Its Inverse DCT 93

5-3 Applications of 2-D DCTs in Image Compression 95

5-4 2-D Indirect Fast DCT Algorithms 100

CHAPTER SIX: 2-D DIRECT FAST DCT ALGORITHMS 103

6-1 2-D Direct Fast DCT Algorithm Based on Lee's Method 103

6-1-1 1-D Lee's algorithm in matrix form 103

6-1-2 Derivation of 2-D fast DCT algorithm from

Lee's algorithm 108

6-2 2-D Direct Fast DCT Algorithm Based on Hou's Method 114

6-2-1 1-D Hou's algorithm in matrix form 114

6-2-2 Derivation of 2-D fast DCT algorithm from

Hou's algorithm 118

6-3 Comparison of Arithmetic Complexity of Various DCT

Algorithms 124

6-4 Comparison of Computation Structures of 2-D Direct VR

DCTs and VR FFTs 125

6-5 Summary 126 IV

CHAPTER SEVEN: HARDWARE IMPLEMENTATION OF 2-D DCTS FOR REAL-TIME IMAGE CODING SYSTEMS 128

7_ l Description of Hardware Implementation of Modified 2-D

Makhoul DCT Algorithm Using FDP™ A41102 129

7-2 Discussion of 2-D DCT Image Coding Systems Using

VLSI Digital Signal Processors 132

CHAPTER EIGHT: THE EFFECTS OF FINITE-WORD- LENGTH COMPUTATION FOR FAST DCT ALGORITHMS 136

8-1 Introduction 136

8-2 Simulation Design 138

8-2-1 Structure of the simulation program 138

8-2-2 Error model for the basic computation structure 140

8-2-3 DCT in infinite-word-length 141

8-2-4 Data collection 141

8-3 Simulation Results 143

8-3-1 Floating-point computation of 1-D DCTs 143

8-3-2 Floating-point computation of 2-D DCTs 147

8-4 Summary 151

CHAPTER NINE: CONCLUSIONS 152

9-1 Conclusions 152

9-2 Suggestions for Future Research 154

BIBLIOGRAPHY 156 APPENDIX A: PRELIMINARY BACKGROUND ON THE TENSOR (KRONECKER) PRODUCT AND THE LOGIC DIAGRAM 172 APPENDIX B PROOF OF STRUCTURE THEOREMS 175 APPENDIX C THE COMBINED FACTOR METHOD 176 APPENDIX D DERIVATION OF VECTOR RADIX 2-D FAST DCT BASED ON LEE'S ALGORITHM 181 APPENDIX E: ARITHMETIC COMPLEXITY OF THE VECTOR SPLIT-RADIX DIF FFT ALGORITHM 187 vi

ACKNOWLEDGEMENTS

The author wishes to express his deepest appreciation to his Supervisor, Dr. FJ.

Paoloni, Associate Professor of the Department of Electrical and Computer Engineering,

The University of Wollongong, for his guidance, support and encouragement and also for his understanding and confidence in the author throughout this research. His professional and optimistic attitude towards the research have made this research challenging, interesting, productive and enjoyable.

The author wishes to thank Professor Huang, Ruji, of the Department of Industrial Automation, University of Science and Technology, Beijing (formerly Beijing University of Iron and Steel Technology), who, as his Masters' Supervisor, had a great influence on shaping the author's research skills and abilities as an independent as well as cooperative researcher.

Sincere thanks are also extended to Professor B.H. Smith who introduced the author to this Institution and made this study possible in the first place.

The author wishes to thank the following people for their generous help, patience and useful discussions at various stages of this program: Dr. G.W. Trott and Dr. T.S.

Ng, Department of Electrical and Computer Engineering; Mr. I.C. Piper, Computer

Services; Mr. J.K. Giblin, formerly with Computer Services and now with Network

Technical Services, B.H.P. Steel International Group; Mr. G. Andersson, Computer

Services; Dr. N. Smyth and Dr, K.G. Russell, Department of Mathematics, The

University of Wollongong; Professor J.H. McClellan, School of Electrical Engineering,

Georgia Institute of Technology, formerly with Schlumberger Well Services; Dr. J.D. O'Sullivan, Dr. DJ. McLean, Dr. C.E. Jacka and Mr. K.T. Hwa, Division of Radio

Physics, CSIRO in Epping, New South Wales; Dr. MJ. Biggar and Dr. W.B.S. Tan,

Telecom Research Laboratories (Australia); Professor K.R. Rao, Department of Electrical

Engineering, The University of Texas at Arlington; Professor M. Vetterli, Department of

Electrical Engineering, Columbia University; Mr. P. Single, Austek Microsystems

(Australia); Dr. M.A. Magdy, Mr. J.F. Chicharo and Mrs. C. Quinn, Department of vii

Electrical and Computer Engineering, The University of Wollongong; Mr. P.J. Costigan and all technical staff in the Department.

The author is deeply grateful to his friend and English teacher Mrs. B. S. Perry for her generous help and professional assistance in the author's understanding of English and Australian culture, and her and her husband's, Mr. E.J.W. Perry, understanding, friendship and encouragement which made the author's stay in Wollongong worthwhile, most pleasant and enjoyable.

The assistance from Miss. M.J. Fryer, of the Department of Electrical and

Computer Engineering, The University of Wollongong, throughout this research and particularly in reading the final manuscript of this thesis is warmly appreciated.

Financial support received from the Department of Electrical and Computer

Engineering and the Committee of Post-Graduate Study, The University of Wollongong, by means of the Departmental Teaching Fellowship and the Post-Graduate Research

Scholarship respectively, which made this research possible, is sincerely acknowledged.

Financial supports from Australian Telecommunication and Electronics Research Board and from Telecom Research Laboratories (Australia) through R&D contract No.7066 are also acknowledged.

Finally, the author wishes to express his deepest gratitude to Mei Mei, his best friend, colleague and wife, without whose patience, understanding, appreciation and continuous support, encouragement and inspiration, this work would not have been accomplished. The continuous support and understanding from his parents, from whom he has been separated for the cause is also greatly appreciated. vm

ABSTRACT

A structural approach to the construction of multidimensional vector radix fast

Discrete Fourier Transforms (DFTs) and fast direct Discrete Cosine Transforms (DCTs) is presented in this thesis. The approach features the use of matrix representation of one- dimensional (1-D) and two-dimensional (2-D) FFT and fast DCT algorithms along with the tensor product, and the use of logic diagram and rules for modifications.

In the first pan of the thesis, the structural approach is applied to construct 2-D

Decimation-In-Time (DIT), Decimation-In-Frequency (DIF) and mixed (DIT & DIF) vector radix FFT algorithms from corresponding 1-D FFT algorithms by the Cooley-

Tukey method. The results are summarized in theorems as well as examples using logic diagrams. It has been shown that the logic diagram (or signal flow graph) as well as being a form of representation and interpretation of fast algorithm equations, is a stand­ alone engineering tool for the construction of fast algorithms. The concept of "vector signal processing" is adapted into the logic diagram representation which reveals the structural features of multidimensional vector radix FFTs and explains the relationships and differences between the row-column FFT, the vector radix FFT reported previously and the approach presented in this thesis. The introduction of the structural approach makes the formulation of a multidimensional vector radix FFT algorithm of high radix and dimension easy to evaluate and implement by both software and hardware.

The hardware implementation of 2-D DFTs is discussed in the light of vector radix

FFTs using the Frequency Domain Processor (FDP™) A41102, which has shown improvement in reducing the system complexity over the traditional row-column method.

With the help of the structural approach, the vector split-radix DIF FFT algorithm, mixed (DIT & DIF) vector radix FFT and Combined Factor (CF) vector radix FFT algorithms are presented whereby a comparison study is made in terms of arithmetic complexity. The approach is then generalized to vector radix FFTs of higher dimensions.

Two vector radix DCT algorithms are presented in the second part of the thesis.

Although the one based on Lee's approach was reported by Haque using a direct matrix IX

derivation method, it is derived independently by the author using the structural approach.

The other vector radix DCT algorithm is based on Hou's method. The arithmetic complexities of these two algorithms are considered as well as various other known row- column DCT algorithms. The computation structures of 2-D vector radix direct fast DCT algorithms are discussed in comparison with those of 2-D vector radix FFT algorithms.

A correction to the system description of Hou's DIT fast DCT algorithm is presented as an analysis result of the algorithm's computation structure.

The system design of the 2-D modified Makhoul algorithm using the FDP A41102 provides yet another solution to the real-time 2-D DCT image coding problem. The effects of finite-word-length computation of DCT using various direct fast algorithms are studied by computer simulation for the purpose of transform coding of images. The results are also presented in the thesis. X

LIST OF ACRONYMS AND SYMBOLS

ASSP: Acoustics, Speech, and Signal Processing

AUSTEK: Austek Microsystems Proprietary. Inc. and Austek Microsystems

Proprietary Ltd.

BF: ButterFly computational structure of fast transform algorithms

CCll'i: International Telegraph and Telephone Consultative Committee

CF: Combined Factor method

CSIRO: the Commonwealth Scientific and Industrial Research Organization

DCT: Discrete Cosine Transform

DCTd: the DCT output sequence with the double-precision (64-bit floating-point)

DCTf: the DCT output sequence with the finite-word-length, (32-bit floating­ point or fixed-point)

DFT: Discrete Fourier Transform

DIF: Decimation-In-Frequency

DIT: Decimation-In-Time

DSP: Digital Signal Processor, or Digital Signal Processing

FDP: Frequency Domain Processor

FFT: Fast discrete Fourier Transform algorithm(s)

FIR: Finite-extent Impulse Response

HDTV: High Definition Television

HR: Infinite-extent Impulse Response

ISDN: Integrated Services Digital Networks inmos: a part of SGS THOMSON Microelectronics Group m-D: multi-Dimensional

M/A: Multiplier/Accumulator, or Multiply/Accumulate

MIT: Massachusetts Institute of Technology xi

Ms/s: Milhon samples per second

NMR: Nuclear Magnetic Resonance

RMFFT: Reduced Multiplications Fast discrete Fourier Transform algorithm(s)

SGS THOMSON: SGS THOMSON Microelectronics Group

SNR: Signal to Noise Ratio

TM Twiddling Multiplications of fast transform algorithms

TRW: TRW LSI Products Inc.

VLSI: Very Large Scale Integrated circuits

VR: Vector Radix

VSP: Vector Signal Processor

VSR: Vector Split-Radix

WFTA: Winograd Fourier Transform Algorithm(s)

Zoran: Zoran Corporation

oc,...,?: small Greek letters are used for transform coefficients throughout the

thesis

matrix of butterfly computation structure outiining the Cooley-Tukey FFT

B: butterfly matrix of Lee's fast DCT algorithm

B: butterfly matrix of vector radix fast DCT algorithm based on Lee's

method p(2n+l)k *~2N cos[Z*|p]

C(ki,k2): 2-D DCT sequence in 2-D indirect fast DCT algorithm C: 1-D DCT matrix

Ci': inverse 1-D DCT matrix

C: denormalized 1-D DCT matrix

Q: transpose of the denormalized 1-D DCT matrix en.eceo: roundoff errors Ftt- matrix for 1-D radix-r twiddling multiplication of length N DIT FFT xu matrix for 2-D vector radix-ri*r2 twiddling multiplications structure of length Ni*N2 VR DIT FFT matrix for 1-D radix-r twiddling multiplication of length N DIF FFT matrix for 2-D vector radix-ri*r2 twiddling multiplications structure of length Ni*N2 VR DIF FFT matrix for 1-D radix-r butterfly structure of length N DIT FFT matrix for 2-D vector radix-ri*r2 butterfly structure of length Ni*NT2 VR

DIT FFT matrix for 1-D radix-r butterfly structure of length N DIF FFT matrix for 2-D vector radix-r\*i2 butterfly structure of length Ni*NT2 VR

DIF FFT imaginary unity length of the transform diag.[-,-,...,-] multiplication matrix of Lee's fast DCT algorithm multiplication matrix of vector radix fast DCT algorithm based on Lee's method number of addition operations required by the transform • number of multiplication operations required by the transform • pre- or pro-calculation matrix of Lee's fast DCT algorithm pre- or pro-calculation matrix of vector radix fast DCT algorithm based on Lee's method diag.[-^l,...,l] sample variance recursive denormalized DCT matrix used in Hou's fast DCT algorithm diag.[|,|...|] T: matrix of twiddle multiplications outlining the Cooley-Tukey FFT

X_ : vector of multi-dimensional transform sequence x : vector of multi-dimensional data sequence

X(k): transfonn sequence x(m): data sequence

X : vector of 1-D transform sequence, [X(0),X(1),...,X(N-1)] x : vector of 1-D data sequence, [x(0),x(l),...,x(N-l)]

X(k) and X(k): denormalized 1-D DCT sequence ~ A X (k) and X (k): vector of the denormalized 1-D DCT sequence X_ (k) and X (k): vector of the denormalized 2-D DCT sequence

X i: sample mean of a random variable v(n i ,n2): rearranged 2-D sequence for indirect DCT

V(ki,k2): 2-D discrete Fourier transform of v(ni,n2) W^m: exp(-27ijkm/N)

WN: 1 -D discrete Fourier transform matrix

Y/^ : 1-D inverse discrete Fourier transform matrix

®: tensor (or Kronecker) product

*: multiply

+: add 1

CHAPTER ONE: INTRODUCTION

1-1 Introduction to Multidimensional Digital Signal Processing

Only after the advent of the modem electronic computer has multidimensional (m-D) signal processing become a reality. It has attracted more and more research interest as integrated circuits have become faster, cheaper and more compact [1]. It covers a large research area including image processing, computer-aided tomography, image compression and image coding, multidimensional Finite-extent Impulse Response (FIR) filtering, multidimensional Infinite-extent Impulse Response (HR) filtering, beamforming, multidimensional spectrum analysis and estimation, radar detection, seismic signal processing, biomedical signal processing, etc. While multidimensional signal processing, as defined by its name, is processing on all signals where dimensionality is equal to or greater than two; at present, two dimensional and three dimensional problems are of practical concern [2].

Although multidimensional signal processing is an extension of one dimensional signal processing, it does have different problems associated with the huge amount of data involved, which makes implementation a difficult issue. More complicated mathematics is required, which could be more arduous to comprehend. It also allows a greater degree of freedom because it provides versatile solutions to a single problem.

These difficulties make multidimensional signal processing a very complicated task, and also motivate research in mathematics, algorithms and implementation. Practical solutions to these problems are based on the development of modern technology (particularly computer technology) and raise the future requirements on the technology front.

On the whole, as in one dimensional signal processing, there are two basic approaches to multidimensional signal processing problems. One is the spatial (or original) domain approach, and the other is the frequency (or transform) domain approach

[1, 3-8]. They are two forms of mathematical representation within the natural world.

Although they are equally powerful, one can be more appealing than the other in certain applications. This thesis will be focusing on the transform approach, the implementation 2

of multidimensional Discrete Fourier Transforms (DFTs) and Discrete Cosine Transforms

(DCTs) in particular.

1-2 Applications

The Fourier transform theory has played an important role in multidimensional signal processing [1, 7, 8] and will continue to be a topic of interest in theoretical, as well as applied, work in this field [9-12]. Mathematical fundamentals of multidimensional

Fourier transforms have been thoroughly examined, and many fast algorithms have been proposed. The introduction of vector processors [61], VLSI vector signal processors

[13], VLSI FFT processors [14, 15, 142], systolic array processors [140, 141, 145,

147], Single Instruction Multiple Data (SIMD) [130] and Very Long Instruction Word

(VLIW) supercomputers [129], makes the implementation of multidimensional Fourier transforms more real in practical situations than ever before.

The multidimensional Fourier transform finds its application in the 2-D context, such as image enhancement (smoothing, edge detecting), image restoration, image compression and encoding, image description [4, 8], radar detection [134,137], 2-D FIR filter implementation and design [1] and invariant object recognition [10]. The 3-D

Fourier transform is required in nuclear magnetic resonance imaging algorithms [9], 3-D tomo-synthesis [146] and in construction of 3-D microscopic-scale objects to remove out- of-focus noise [16]. Multidimensional Fourier transforms used for simultaneous time spatial or spatial frequency representation in computer vision and pattern analysis provide better tools for pattern analysis and a better understanding of dynamic patterns in the visual system [2, 10, 11, 12].

Almost a decade after the introduction of the Cooley-Tukey Fast Fourier Transform algorithm (FFT), the Discrete Cosine Transform was first introduced into digital signal processing for the purposes of pattern recognition and Wiener filtering in 1974 [17] .

But it soon led to a vast range of engineering applications. In the multidimensional context, the two dimensional (2-D) DCT is used for image compression and transform coding of images [3, 4] in telecommunications such as video-conferencing, video 3

telephony, video image compression for High Definition Television (HDTV), block structure/distortion in image coding, activity classification in transform coding, surface texture analysis, tomographic classification, photovideotex, pattern recognition, progressive image transmission, printed image coding and applications in fast packet switching networks [18, 19, 157]. The DCTs can be implemented by fast algorithms with either software or hardware and render almost optimal performance that is virtually indistinguishable from the Karhunen-Loeve Transform [3, 17], in terms of energy packing ability and decorrelation efficiency. Various VLSI DCT processors have also been reported and demonstrated recently, for video coding applications [74, 75, 87-89,

111, 143].

1-3 History and New Achievements of Fast Signal Processing

Algorithms

Reviewing the history of a research and study area provides a perspective which generally benefits future research and study. A review of the study of FFT algorithms in digital signal processing has particular significance.

Great engineering power has its deep roots in mathematics. Applications of research achievements rely on the development of relevant technology. The development of technology motivates further research in conveying more mathematical wonders into application. The gap has to be bridged by a proper approach and a form of representation which are attractive to the engineering society.

The history of FFT algorithms did not begin at Good, or Thomas, or Danielson, or

Lanczos, or even Runge [20]. It can be traced back to the great German mathematician

Carl Friedrich Gauss (1777-1855) [21]. But it has only become an important engineering concern, since the advent of the modern electronic digital computer, through the fundamental work laid by Cooley and Tukey [22] and those who have helped to give this mathematical curiosity an engineering interpretation and eventually to convert it to an engineering power [5, 23,24]. It has been said that the rediscovery of the FFT algorithm was one of the saviours of the predecessor of the IEEE Acoustics, Speech, and Signal 4

Processing Society [23] and marked the beginning of modern digital signal processing [6,

31, 135]. The Cooley-Tukey FFT algorithm, in addition to being widely used because it came first, owes much to its simple structure; a structure which is appealing to the engineering society. The representation of FFT by the so-called butterfly signal flow graph [24, 25]fits nicel y into the newly released VLSI FFT processor—the Frequency

Domain Processor (FDP™) A41102 [14, 15, 26-29]. Some problems can be timeless and solutions to them can be discovered, and rediscovered again and again.

Representation also is of vital importance for each step of the conversion from research achievement to engineering application. One of the tasks required of scientific researchers is to demystify and clarify, not mystify.

FFT algorithms also have their roots in Abelian Semi-Simple Algebras by which the mathematical structure of the FFT is revealed. These Abelian Semi-Simple Algebras provide explanation as to how various FFT algorithms are devised. Many attempts have been made to convey this mathematical result to the engineering society [29-34]. When the mathematical structure of a process is well understood, many fast algorithms for it can be constructed systematically. Taking their 1-D counterparts respectively, the tensor (or

Kronecker) product has been used successfully to generate multidimensional Winograd

Fourier transform algorithms [35] and prime factor FFT algorithms [32]. In [31], a matrix form is introduced to represent the vector radix-2*2 and -4*4

Decimation-In-Time (DIT) FFT algorithms. However, the tensor product is used as a form of representation for the 2-D VR-4*4 FFT algorithm rather than as a tool for the construction of the VR FFT algorithm from its 1-D counterpart [31]. Many multidimensional fast algorithms are constructed using a direct derivation method [36-38].

It is worth noting that in most of the published literature, multidimensional algorithms are still described in a 1-D diagrammatical representation form by the traditional butterfly signal flow graph which can be over-complicated in the multidimensional case. When problems become extended, new representation forms have to be found to disclose the myths behind mathematical structures, which are sometimes quite complicated (or abstract). 5

In the history of fast signal processing algorithms, the basic issues, which are associated with evaluating the effectiveness of an algorithm from the outset, have been:

(1) reduction of arithmetic complexity;

(2) reduction of round-off errors and errors due to the quantization of the coefficients;

(3) in-place computation; and,

(4) possession of a regular computation structure.

Three of the above four points (point 2 excluded) are associated with the processing speed which is a major engineering concern. An algorithm which does not possess in- place computation or regular computational structure will require more bookkeeping and indexing operations, and will affect the processing speed.

In the early years, multiplications were more time-consuming than additions and other types of operations (data transfer, for instance) on general purpose computers.

Reducing the number of multiplications became the centre point of the evaluation of fast algorithms. As a result, a group of FFT algorithms, called the reduced multiplications

FFT (RMFFT) algorithms were introduced [39] including the prime factor algorithm,

Winograd Fourier Transform Algorithm (WFTA) and polynomial transforms. Many of these were obtained at the expense of more additions and loss of regular computation structure. However, the introduction of Digital Signal Processors (DSPs) and the development of VLSI technology, Application Specific Integrated Circuits (ASIC) technology in particular, have changed this tradition dramatically, and now an addition (or even loading of data) takes about the same time to complete as a multiplication on some of processors [40]. The issue is not just reduction of the number of multiplications but the total number of operations. Fast algorithms which do not posses in-place computation or do not have a regular structure, will be in a disadvantageous position as they have to pay a severe cost in loading, storing, copying data and other indexing tasks [39, 41]. In systolic array implementations of DFT and FFTs, emphasis has been on modularity, pipelining and parallelism, and simple, regular and local communication 6

structures [140, 141, 145, 147] (apart from area*time2 criteria commonly used for

VLSI designs).

An algorithm is only fast when the hardware can take advantage of it [31]. A theoretically fast algorithm may be even less effective than a "slow" algorithm on certain processors. Some features of many fast algorithms, such as parallelism and pipelining structure, still remain to be fully exploited [134, 137]. These algorithms will be many- times faster than they are now, only when computer technology resolves the problems which are associated with them. For example, VLSI implementation-of FFT algorithms is not limited to radix-2 or radix-4 butterflies. Full length (up to 256 complex point)

CMOS and HMOS FFT processors (DFPs) have been reported, demonstrated [14, 15] and now are commercially available, as mentioned previously. An FFT processor that computes a 4096-point complex DFT in 102^ with 22-bit floating-point arithmetic was also reported in [142]. When the computation structure of the Cooley-Tukey algorithm described by butterflies is made into a VLSI pipelining architecture, a 256 complex point

DFT can be achieved in about 102.4^s (200^ for HMOS Chip) on FDPs. Another feature of FDP A41102 is that 8*8- or 16*16-point 2-D DFT can be accomplished in one pass although the row-column approach is used. This means a reduction in the time for sweeping data. This places many digital signal processing applications using FFT into the real-time or pseudo-real-time processing category.

1-4 Objectives

As explained in the previous section, because of the nature of multidimensional

Digital Signal Processing (DSP), there are various multidimensional DSP algorithms from which to choose and the structures of these algorithms appear to be more complex than those of their 1-D counterparts. Without an appropriate method, the construction, the evaluation of the performance and the implementation of m-D algorithms would be a very difficult task indeed. This thesis attempts to seek a structural approach to m-D fast DSP algorithms to make the task simpler. Instead of deriving m-D fast algorithms from a defined m-D DSP problem directly, evaluating and implementing algorithms according to 7

equations so derived, the approach suggests that the construction, evaluation and implementation of m-D fast algorithms be based on our knowledge of, and experience with, the corresponding 1-D algorithms, if possible. For example, 1-D FFT algorithms based on the Cooley-Tukey method are extensively studied and well documented.

Computer programs of 1-D FFTs can be found in the published literature and in computer software mathematics libraries. Many DSP manufacturers provide their version of FFT programs. The VLSI integration of 1-D FFT algorithms has also broadened our knowledge. All the above knowledge can be made useful for the development of m-D

FFT algorithms. The simplest case would be the row-column approach. The row- column m-D FFT algorithm, for instance, is obtained by repeatedly using the 1-D FFT algorithm on each dimension so that it can be said that all the knowledge and experience, including the programs and hardwares, of 1-D FFTs, are directly made use of in this m-D method and the method is constructed and built on 1-D FFT by deriving the relation between m-D DFT and the DFT on each of its dimensions. In this case, it happens that it is not necessary to worry about the structure of 1-D FFTs, nor how they are constructed.

It is simply to make use of what is available at the 1-D level. However, when the number of dimensions of DFTs increases, the computational saving of m-D fast FFTs, of which the vector radix FFT is one, over the row-column FFT will become substantial in terms of the number of multiplications or the total number of numerical operations, as will be shown in Chapter Four of this thesis [44,45]. When the m-D vector radix FFT is to be constructed, the structure of 1-D FFT algorithms and the structural relationship between the m-D FFT and 1-D FFTs have to be studied and understood in order to generate systematically the required m-D FFT from the knowledge (algorithm, software and hardware) possessed of corresponding 1-D FFTs. The above described approach is hereby called a structural approach. This kind of approach will not only help the construction, software and hardware implementation of m-D algorithms, but also assist the study of VLSI integration of m-D algorithms, which possess a greater degree of complexity than the 1-D case. Its function as a tool for software development of m-D discrete transform algorithms has been successfully demonstrated during this research. 8

This thesis is mainly concerned with the implementation of the multidimensional vector radix FFT algorithms [42-44], based on the Cooley-Tukey method [22], and that of 2-D direct vector radix fast DCT algorithms. The mathematical structures of these algorithms are to be examined and a graphical representation (logic diagram) is to be introduced to accommodate the concept of vector signal processing in graphical form.

Various issues associated with the software and hardware implementations of m-D DFTs and DCTs are also to be investigated using some state-of-the-art digital signal processors.

It will be shown that the algorithms under study are highly structured and have a close link to their 1-D counterparts. They provide more efficient process in terms of the computational complexity and will be fast if the parallel and pipeline structure can be fully exploited.

1-5 Thesis Review and Contributions

A structural approach to the construction of multidimensional vector radix fast

Discrete Fourier Transforms (DFTs) and fast direct Discrete Cosine Transforms (DCTs) is presented in this thesis. Rigorous mathematical derivation is presented by representing the 1-D and 2-D FFT and fast DCT algorithms in matrix form together with tensor product However, the same algorithm is also derived more simply by examination of the structure of logic diagrams with given rules for modification. The structural approach is applied to construct 2-D Decimation-In-Time (DIT), Decimation-In-Frequency (DIF) and mixed (DIT & DIF) vector radix FFT algorithms from corresponding 1-D FFT algorithms using the Cooley-Tukey approach. The whole procedure is summarized in theorems.

The results are then generalized to vector radix FFTs of higher dimensions and vector radix DCT algorithms. It has been shown that the logic diagram (or signal flow graph) is, in addition to being a form of representation and interpretation of fast algorithm equations, a stand-alone engineering tool for the construction of fast algorithms. The concept of

"vector processing" is adapted into the logic diagram representation . This reveals the structural features of multidimensional vector radix FFTs and explains the relationships and differences between the row-column FFT, the vector radix FFT in [43, 44] and the 9

approach presented in this thesis. Introduction of the structural approach makes the multidimensional vector radix FFT algorithms of high radix and high dimension easy to evaluate and implement by both software and hardware.

The hardware implementation of 2-D DFT is discussed in the light of vector radix

FFTs using the Frequency Domain Processor (FDP™) A41102, which has shown improvement in reducing the system complexity over the traditional row-column method.

With the help of the structural approach, the vector split-radix DIF FFT algorithm, mixed (DIT & DIF) vector radix FFT and Combined Factor (CF) vector radix FFT algorithms are presented whereby a comparison study is made in terms of arithmetic complexity.

Two vector radix DCT algorithms are presented in the second part of the thesis.

Although the one based on Lee's approach was reported by Haque using a direct matrix derivation method, it is here derived independently, using the structural approach. The other vector radix DCT algorithm is based on Hou's method.

The system design of the 2-D modified Makhoul algorithm using the FDP A41102 provides yet another solution to the real-time 2-D image coding problem. The effects of finite-word-length computation of DCT using various direct fast algorithms are studied by computer simulation and results are presented.

Chapter One presents an introduction to the multidimensional digital signal processing, with emphasis given to the transform method, multidimensional discrete

Fourier transforms and discrete cosine transforms in particular. The development and new achievements of fast digital signal processing algorithms are reviewed, producing insight to the research area.

Two basic representations, namely the matrix form and the logic diagram, for 1-D

DFT and FFT algorithms are presented in Chapter Two, which lays the foundation for the presentation of the structural approach to the construction of multidimensional vector radix FFT algorithms. It has also been shown that the logic diagram is a form of representation for FFT algorithms and a tool to derive or construct FFT algorithms as well. 10

Chapter Three forms one of the major chapters of the thesis. After the introduction of general matrix representations for the first stage 2-D DIT, DIF and mixed decimation vector radix algorithms, structure theorems are presented along with diagrammatical representation, which bear the essential message for the structural approach towards the construction of various vector radix FFT algorithms. The applications of theorems and the logic diagram are demonstrated by various examples, including the 2-D vector split- radix DIF FFT algorithm. As well, comparative studies of vector radix FFTs and hardware implementation of vector radix FFTs using the FDP A41102 are presented in this chapter.

The structural approach is extended to multidimensional vector radix FFT algorithms of higher dimension in Chapter Four. A recursive symbol system, which makes the derivation of multidimensional vector radix FFT from 1-D FFTs a systematic, straight-forward and error free procedure, is presented for the logic diagram representation of vector radix FFTs.

The second part of this thesis consists of study results on the fast computation of 2-

D discrete cosine transforms, its application to the transform coding of real-time images and error analysis of various direct fast DCT algorithms for image coding purposes using the floating-point computation.

A brief introduction to multidimensional DCTs is presented in Chapter Five. Two vector radix direct fast DCT algorithms are constructed using the structural approach and presented in Chapter Six. The arithmetic complexity of various direct fast DCT algorithms is also discussed in this chapter. In Chapter Seven, hardware implementations of 2-D DCTs for real-time image coding are discussed using dedicated VLSI DCT processors, digital signal processors, fast multiplier/accumulators and the newly released

FDP A41102. The effects of finite-word-length computation for fast DCT algorithms are studied using the floating-point arithmetic in comparison with the direct matrix multiplication method and simulation results presented in Chapter Eight. In conclusion,

Chapter Nine summarizes the main approach taken, the contribution made by this thesis and future aspects of research. 11

Preliminary material on the tensor (or Kronecker) product and logic diagrams are presented in appendices. A short proof of the structure theorems, vector radix direct fast

DCT algorithm based on Lee's method and derivations of various combined vector radix

FFT algorithms also are presented in appendices.

1-6 Publications, Submitted Papers and Internal Technical

Reports

[1-6.1] H.R. Wu and FJ. Paoloni, "The Structure of Vector Radix Multidimensional

Fast Fourier Transforms", ISSPA 87. Signal Processing. Theories-

Implementations and Applications, pp.89-92, August 1987.

[1-6.2] H.R. Wu and F.J. Paoloni, "The Structure of Vector Radix Fast Fourier

Transforms", IEEE Transactions on Acoustics. Speech, and Signal

Processing. Vol.37, pp.1415-1424, September 1989.

[1-6.3] H.R. Wu and F.J. Paoloni, "On the Two Dimensional Vector Split-Radix

FFT Algorithm", IEEE Transactions on Acoustics. Speech, and Signal

Processing. Vol.37, pp.1302-1304, August 1989.

[1-6.4] H.R. Wu and F.J. Paoloni, "Structured Vector Radix FFT Algorithms and

Hardware Implementation", Journal of Electrical and Electro

-nics Engineering, Australia, September 1990.

[1-6.5] H.R. Wu and F.J. Paoloni, "A Two Dimensional Fast Cosine Transform

Algorithm—A Structural Approach", Proceedings of IEEE International

Conference on Image Processing, pp.50-54, Singapore, September 1989. 12

[1-6.6] H.R. Wu and F.J. Paoloni, "A 2-D Fast Cosine Transform Algorithm Based

on Hou's Approach", IEEE Trans, on Acoust., Speech, Sign­

al Processing, to appear in June 1991.

[1-6.7] H.R. Wu, F.J. Paoloni and W. Tan, "Implementation of 2-D DCT for Image

Coding Using FDP™ A41102", Proceedings of the Conference on Image

Processing and the Impact of New Technologies, pp.35-38, Canberra,

December 1989.

[1-6.8] H.R. Wu and F.J. Paoloni, "A Structural Approach to Two Dimensional

Direct Fast Discrete Cosine Transform Algorithms", Proceedings of

International Symposium on Computer Architecture & Digital Signal

Processing, pp.358-362, Hong Kong, October 1989.

[1 -6.9] H.R. Wu and F.J. Paoloni, "The Impact of the VLSI Technology on the Fast

Computation of Discrete Cosine Transforms for Image Coding", to be

submitted.

[1-6.10] H.R. Wu and F.J. Paoloni, "A Perspective on Vector Radix FFT Algorithms

of Higher Dimensions", Proc. of the IASTED Int. Symp. on Sig

-nal Processing & Digital Filtering, June 1990.

[1-6.11] H.R. Wu and F.J. Paoloni, "Implementation of 2-D Vector Radix FFT

Algorithms Using the Frequency Domain Processor A41102", Proc.

of the IASTED Int. Symp. on Signal Processing & Digit­

al Filtering, June 1990. 13

(Internal Technical Report)

[1-6.12] H.R. Wu and F.J. Paoloni, "Comparison Study on Software and Hardware

Implementation of Various Fast Discrete Cosine Transform Algorithms",

Technical Report-1, the University of Wollongong-Telecom Research

Laboratories (Australia) R&D Contract for the Study of Fast Implementations of Discrete Cosine Transform Coding Systems, under No.7066, June 1989.

[1-6.13] H.R. Wu and F.J. Paoloni, "Simulation Study on the Effects of Finite-Word-

Length Calculations for Fast DCT Algorithms", Technical Report-2, the

University of Wollongong-Telecom Research Laboratories (Australia) R&D

Contract for the Study of Fast Implementations of Discrete Cosine Transform

Coding Systems, under No.7066, October 1989.

[1-6.14] H.R. Wu and F.J. Paoloni, "Comparison Study on Software and Hardware

Implementation of Various Fast Discrete Cosine Transform Algorithms",

Addendum of Technical Report-1, the University of Wollongong-Telecom

Research Laboratories (Australia) R&D Contract for the Study of Fast

Implementations of Discrete Cosine Transform Coding Systems, under

No.7066, November 1989. 14

PART I.

MULTIDIMENSIONAL DISCRETE FOURIER TRANSFORMS 15

CHAPTER TWO: 1-D DISCRETE FOURIER TRANSFORM AND FAST FOURIER TRANSFORM ALGORITHMS

In this thesis, multidimensional vector radix FFT algorithms [42-44] based on the

Cooley-Tukey method [22] are considered in detail. Although the 1-D Cooley-Tukey

FFT algorithm and many others have been well studied and understood, they have been included here for the purpose of understanding the structure of m-D VR FFT, the evolution of VR FFTs from 1-D FFT, and even the 1-D FFT itself. The more that is understood about 1-D FFT, the more easily the knowledge of m-D VR FFT algorithms can be expanded. The matrix and the logic diagram representation of the 1-D FFT algorithm form a foundation which provides the basis for the m-D VR FFTs.

After defining the 1-D discrete Fourier transform, the matrix forms for the 1-D FFT are introduced. An examination as to why FFTs can achieve better computational efficiency in different ways is then presented.

2-1 Definitions

The Discrete Fourier Transform (DFT), X(k), of a vector x(m) of length N is defined [22, 47] as follows:

X(k)= I x(m)W*m (2-1-1) m=0

7 where WN = exp(-j27t/N), j = V ! and k = 0,1,... ,N-1.

The inverse DFT (IDFT) is given by:

km x(m)=lzX(k)WN (2-1-2) 1X1 k=0 1N where m = 0,1,...,N-1.

The derivation or development of the DFT from its corresponding continuous

Fourier Transform (FT) can be found in [39,47].

In their matrix forms, the DFT and IDFT are defined by the following equations: 16

X = W* x (2-1-3) and:

! x = N WN A' (2-1-4)

T T where X = [X(0), X(1),...,X(N-1)] , x = [x(0), x(l),...,x(N-l)] , W^ is an N*N matrix and for 0 < k,m < N-l, W* (k,m) = W^m; N = diag.[~ ...,£]; for 0 < k,m <

N"l. Wj^ (m,k) = WjJm. All matrices are of size N*N. W^ is often called the 1-D

DFT matrix.

2-2 Matrix Representations for 1-D Cooley-Tukey FFT

Algorithms

If the direct matrix operation is used, the computation of a 1-D DFT as defined by

Equation (2-1-3) needs N2 complex multiplications and N2-N complex additions, provided the input sequence is complex. Thus, fast algorithms need to be introduced to reduce the arithmetic complexity and the time for the computation.

In this section, the traditional 1-D Cooley-Tukey FFT algorithm is presented in a traditional manner, which is followed by traditional graphical and matricial representations of the algorithm and its computation structure. The matrix and diagrammatical forms, which are used throughout the thesis, for both DIT and DIF FFT algorithms are then presented with differences and relationships explained.

Using the Cooley-Tukey method or the Decimation-In-Time FFT algorithm [22,24,

47]: k = ki*N' + kn; m = mi*r + mo; (2-2-1) is set; where N' = N/r, ki, mo = 0,l,...,r-l and ko, mi = 0,1,...,N'-1.

Then Equation (2-2-2) is derived from Equation (2-1-1).

1 X(k!,k0)= Z X^mi.moJW^'W^wf'" " (2-2-2) mo=0 mj=0

The four step algorithm to calculate Equation (2-2-2) are: 17

Step 1: The second Butterfly (BF2) or the shorter length DFTs:

N'-] x'l(ko,mo)= X x(mi.mo) W|2mi (2-2-3a) m]=0 N'

Step 2: The Twiddling Multiplications (TM):

omo x1(ko,mo)=x'](ko,mo)W^T (2-2-3b) N Step 3: The first Butterfly (BF1) of radix-r:

r-l kimo x2(ko,ki)= I xi(k0,m0)W (2-2-3c) mo=0 r

Step 4: The unscrambling:

X(ki,ko) = X2(ko,ki) (2-2-3d)

The algorithm shows that the original 1-D DFT of length N can be calculated by r length N' (shorter than length N) DFTs and computation saving thereafter can be achieved.

There have been many attempts to interpret the FFT by finding both mathematical and graphical expressions for the algorithm [24, 25, 31, 39, 47]. The Mason flow graph

[6] is one such attempt and its introduction has been of great help in interpreting and presenting the FFT. Figure-1 shows a Mason signal flow graph representing 8-point DIT

Cooley-Tukey FFT algorithm, where N = 8, the input is in natural numerical order and the output in bit-reversed order. By bit-reversed order, it is meant that if (n2ninn)b is the binary representation of the natural decimal index (n)d, its decimal index (nr)d in bit- reversed order will be (nonin2)b- For example, if (n)d = (4)d = (100)b, (nr)d = (001)b =

(l)d- Bit-reversed order of a sequence: ^ON ^ 1 01 10 2 2 10 is 01 1 {?)d W 18

+J CTcJ •H 1-1 0 0 M 4-1 < r— H CaO fc s-i En u >, 0) 0s ^ tH O i—i (ii H i rH >~1, s-ai C3 D C rcHu tn to •t-i •H 0o t. LO u r* AJ 0 en •cH « o 2 a < oo

X X X X X X X X 19

It is assumed that the decimal number is used for indexing unless it is indicated

explicitly otherwise.

In Figure-1, the signal flow graph shows vividly the construction of the algorithm,

including the butterfly (BF), twiddles (TM) and the unscrambling, and the number of operations saved noting that W^ = 1 and Wg = -j. The relationships between the long length DFT and shorter length DFTs when the algorithm is applied can also be described

[25].

An alternative representation uses matrices to give the mathematical interpretation

and explanation of the algorithm. The FFT algorithm can be constructed by matrix

decomposition (or factorization) [39, 47], and its roots lie in algebra [30, 33]. In [30],

the matrix decomposition which underlines the Cooley-Tukey algorithm [30] is shown to

be:

W* =31T1B2T2...Bm.1Tm.15m (2-2-4) assuming the length of the DFT N = 2m. 3[ (i=l,...,m) represents the butterfly stage which evolves from 2-point DFTs [30]. It is shown that each 3j only needs N complex additions to calculate. 7j (i=l,...,m-l) is a diagonal matrix (representing the twiddling multiplications) with half of the elements being 1 (or trivial), i.e., only N/2 complex multiplications needed to work out each of the Tj matrix. This matrix form describes the

Cooley-Tukey algorithm both precisely and concisely. As it has been shown, the computational efficiency is also very easy to evaluate.

The matrix form adapted in this thesis is virtually the same as that by Blahut [31], and its indexing scheme follows traditional Cooley-Tukey's presentation [22, 47]. The matrix equations for the 1-D radix-2 DIT FFT algorithm on the 8-point 1-D DFT (i.e., r =

2, N = 8 and N' =4,) is presented as follows: 20

(BF2:)

ko mj -x'i(0,mo) 1111- x(0,mo)- x'i(2,m0) 11-1-1 x(2,mo) (2-2-5a) x'i(l,mo) 1 -1 -j j x(l,mo) _x'i(3,mo). 1 -1 j -j - Lx(3,mo).

(TM:)

mo mo 1 0 "xi(k0,0)- _ xi'(k0,0)- k0 (2-2-5b) _xi(ko,i). " 0 W _xi'(k ,l). N 0

(BF1:)

ki mo -x2(k0,0)- _ -11- -xi(k0,0)- (2-2-5c) _X2(ko,l). " _1 -1_ _xi(ko,l)_

(Unscrambling:)

•X(0,k0)- "x2(ko,0)- = X(l,k0)_ _X2(ko,l)_

In Equation (2-2-5a), there are two (when mo = 0,1, respectively) length four DFTs which can be further decimated using this recursive algorithm.

The algorithm is described by a logic diagram in Figure-2 which is similar to the

Mason signal flow graph in Figure-1. There is a correspondence between the matrix form of the butterfly, twiddles and its signal flow graph. As shown in Figure-2, the logic diagram consists of one stage of radix-4 DIT FFT Butterfly (BF), a stage of radix-2 BF and a Twiddling Multiplication (TM) stage between the radix-4 and radix-2 BF stages represented by DIT TM-4*2 where a = exp(-J7t/4). Other symbols are defined in

Appendix A. It can be seen that there are two radix-4 butterflies in the radix-4 BF stage and each radix-4 butterfly can be implemented with no multiplication and only eight additions. The difference between Equation (2-2-4) and Equation (2-2-5) is that the former, which is a form that underlines the Cooley-Tukey algorithm [30], is a top-down overall matrix decomposition method; and the latter is a representation of the algorithm 21

c JS *-) «.—< s_i o <

•H— ?.p Ol •— — U- CM tp t- 1 CJ 15 o VJ ^ oc fcC •rH fc« t- Q *_> c o &< CO

o ^

itself. From the matrix form point of view, it is a bottom-up approach. Equation (2-2-4)

can be seen as a mathematical representation of the Cooley-Tukey algorithm. It explains

"what" and "why". But when it comes to "how", i.e., how to derive an algorithm, the

"decimation" method has been predominantly preferred [5, 6, 22, 24, 25, 36, 37, 43,

68]. Equation (2-2-5) is a concise representation of the Cooley-Tukey algorithm. Funher decimation can proceed on Equation (2-2-5a) as m0 takes each fixed value the equation itself is a half-length DFT. It is easy to see that the matrix form used in this thesis is similar to that used by Blahut but with reordered data sequence and different indexing scheme.

Similarly, a matrix form of equations can be introduced for the Decimation-In-

Frequency (DIF) Cooley-Tukey FFT algorithm. Assuming N = r*N', set

k = ki*r + ko; m = mi*N' + mo; where kn.mi = 0,1 r-1 and ki,mn = 0,1,...,N'-1.

Given Ni = N2 = N = 8, N = 2 * N', N' = 4, matrix equations for the 1-D radix-2

DIF FFT algorithm on the 8-point DFT are presented as follows:

(BF1:)

k0 mj xi(0,mo)l 1 1 x(0,mo) (2-2-6a) Lxi(l,mo)J L 1 "3 x(l,m0).

(TM:)

k0 ko 1 0 xi'(0,mo)' xi(0,mo) 0 (2-2-6b) xi'(l,mo). OWJJ _xi(l,mo).

(BF2:)

ki mo -11 1 l-i , X2(k0,0)- rxi (k0,0) X2(ko,2) 11-1-1 xi'(k ,2) 0 (2-2-6c) X2(kn,l) 1 -1 -j j xiYko,!) X2(ko,3)- - 1 -1 j -j JLxi'(ko,3 ) 23

(Unscrambling:)

rX(0,k0)- -x2(k0,0)-

X(2,k0) X2(ko,2)

X(l,k0) X2(ko,l) Lx(3,k0)_ -x2(k0)3)_

A logic diagram used to perform this 8-point DFT is shown in Figure-3.

Because of the binary organization of digital computers, algorithms for N (not a

power of 2) have received less attention, although the Cooley-Tukey algorithm can be

applied to DFT of length N which can be any composite number [21, 24].

Thereafter, in this thesis, most of the discussion will be on DFTs with length being a

power of 2.

2-3 Computational Considerations

The computational consideration is a sophisticated problem. The majority of the

work undertaken in search of fast algorithms for DFTs has gone into reducing the

computational complexity [34, 39, 48, 49]. On most general purpose computers,

multiplications are more expensive than additions. As a result, especially in the early

years of research on FFT algorithms, much work was carried out in reducing the number

of multiplications, some of which are accomplished at the expense of additions, in-place

computation, and most of all, regular computing structure. A second consideration was

reducing roundoff errors [25] and research in this direction has been very much alive and

up to date [50, 51]. In-Place computation has also drawn a lot of attention from researchers [39, 52]. Regular structure is yet another factor which is important to both software and hardware implementation of FFT algorithms [14, 34, 41, 53, 54]. A fast algorithm may lose its initial momentum due to its lack of in-place computation or regular computing structure which dramatically increases the bookkeeping task, and it may be placed in a disadvantageous position after all [39, 41, 154]. All the above should be considered and balanced to devise or choose an FFT algorithm for a specific application.

Another point to be considered is that the advantage of an improved algorithm can be 24

+ +

I

c < u. .

fe wB:

o I oc

O X X ^xf x 1< 25

wasted if the computer hardware (or the digital signal processor) cannot take advantage of it [31]. This aspect shall be discussed in the second part of the thesis where the hardware implementation of 2-D DCTs is considered.

The computational complexity in terms of operations is usually evaluated in two

ways, i.e., by examining the mathematical equations or by looking at the logic diagrams

(the signal flow graphs). There are many interpretations as to why the FFT algorithm can

be fast. Mathematically speaking, as the 1-D DFT is a summation of weighted inputs

with the weight being a periodic function, it can be evaluated by a clever insertion of

parentheses to reduce the number of additions and multiplications, thus becoming an

algebraic exercise. The number of multiplications can be further reduced by locating the

trivial multiplications such as ±1 and ±j in regular positions [31, 45]. This can also be

explained using the logic diagram which gives an engineering interpretation. Examine a

4-point DFT, for example. The algorithm represented by Figure-4-(a) is equivalent to the

direct matrix operation. As a result, 12 additions and 16 multiplications are needed to

complete the transform. By close examination, it is found that use can be made of the

periodicity of the weighting function as shown in Figure-4-(b) to group those inputs with

the same weights to reduce the number of additions to 8 in Figure-4-(c). It can be noted

thatW^ = 1, which does not require multiplication, and W4 can be moved to the right of the summation symbol, which further reduces the number of multiplications by 1. Using

conventional presentation, the algorithm represented by the logic diagram is given by Figure-4-(d). Similarly, w] = -j, which can be calculated without multiplication. The final fast algorithm for 4-point DFT needs only 8 additions as shown in Figure-4-(e).

Logic diagrams can also be used to explain why one algorithm can be faster than the

other, in terms of the number of multiplications, when different radices are used. Take a

16-point 1-D DFT, for example. In Figure-5, a mixed radix-2 and radix-8 FFT algorithm

is used so that 10 multiplications are required to complete the transform. If the twiddles

between the radix-2 and radix-8 butterflies are moved to theright an d are combined with

the twiddles inside the radix-8 butterfly, a new group of twiddles is formed as shown in 26

ft u 1) w4 x(0) 1

w? x(l) 1 1 1 X(0) + \ w? x(2) T

^ w? x(3) 1 1

\ u x(0) w4 1

x(l) [ wj w4 + X(l) + )\ w? x(2) -1

\ w? x(3) —, -wj

\) U x(0) W4 1

\ z x(l) ) w4 -, + X(2) + )\ w? x(2) 1

^ w* x(3) -1

y\ u x(0) , w4 1 wi x(l) . -wi + • X(3) + \ w? x(2) , -1

w? x(3) wj

Figure-4-(a) Figure-4-(b) 27

x(0) + + X(O) x(2) x^O) + + X(l) x(2) x(l) X(2^ x(3) -i x(l) wi + + X(3) x(3) Wj -i

Figure-4-(c)

x(O) + + X(O) x(2)

+ + X(l)

x(l) + + X(2) x(3)

+ + X(3) W

Figure-4-(d) 28

x(0) + X(0)

xuj + + X(2) -1

+ + X(1) x(2) -1

+ + X(3) x(3) -J

Figure-4-(e) 29

^ 01 ^ c rr c* X X X X X X +J L±J L+J L±J L+J L+J + 1 _±J + + + + +ir+1+1f+

sSiSSSSS GuC+] r+i r+n r+i m r+n r+i rr

fa. o H fa. fa.

^—s *~^ /-^ ^^ ^^ *-*. to' ^•* n o t-~ «-^ 5 K X X K X X X ^-X ^ X K X x >< X 30

^ ^ W ,-v ,-v ^ C**i 2 ^ ••» — r~. =c ^ — Q, " O —> C "J". — XXX X X X X X X X I XXX X 1 1 1I I 11I 1I I 1 1 + •f + + + -f ]+ + + + 1 + 1 + + + + E n ni j - i l £ e II" 1 10 I 1 ! H—'

i ,, 1 ? i + •f + + + -h + + + + [+ g^j r1! r1! i .J. i iHi srh ^ fa. fa. «

b. : -P..,« ;••-- S =i i,-E " M O c x X — IIo I ..... , , . ,. J-l 3 + -f + + + -H + + + -f + !+' + + -f + •H 1 1 .J •h r ! r ! JL _L J . .J, fa. N/ fa.c C X o c II a

- . - a • a H T> '7-" fa.

i i—-i T +] "h + + + H-;+++ - (- + + + 1 + -1-1+ ttl

• • *—« •'. •••:-..:• • i *T "i* :. • " ' ": V V T JU

1 - .'•:

. 6 g 6 S £ S § "•3- K X CJ, o * X X x x XXX X XXX 31

Figure-6. As a result, only 8 multiplications are needed to perform the same 16-point

DFT. As a matter of fact, Figure-6 represents the radix-4 FFT algorithm. According to

Richard [55], the best choice of algorithm depends strongly on whether the execution speed is dominated by: (1) multiply time; (2) equally by multiply and addition time; or (3) butterfly time. From the logic diagram it can be seen that, when multiply time dominates, to optimize FFT algorithms is to find the best locations for those twiddle factors so that the number of multiplications is minimal.

2-4 Summary

In this chapter, both 1-D Decimation-In-Time and Decimation-In-Frequency FFT algorithms have been examined using two forms, namely, the matrix representation and the logic diagram. These two forms will be used throughout this thesis as a basis for the derivation of multidimensional vector radix FFT algorithms. The logic diagram is equivalent to the traditional signal flow graph, but it can be generalized into a multidimensional form without any complication and misperception. It is also demonstrated that alternative algorithms can be derived using logic diagrams alone. In other words, the logic diagram is not only a form of representation of algorithms but can also be used to derive new algorithms. It can be used as a stand-alone engineering tool. 32

CHAPTER THREE: 2-D DFT AND 2-D FFT ALGORITHMS

3-1 Introduction to 2-D Discrete Fourier Transforms

It is well known that when the sample length of the convolution for image filtering or the correlation for template matching is long, the FFT approach is faster than the original domain approach [8]. In most multidimensional applications, this is usually the case. Although there are many approximation methods in the original domain which perform multidimensional processing in real-time and have achieved reasonably good results, it is believed that multidimensional FFT will still have a place in the field, in theoretical analysis as well as in practical applications.

A 2-D DFT problem can arise from 2-D signal processing applications or can result from mathematical manipulations [1, 57]. The study of fast algorithms for 2-D DFTs is one of the practical concerns in multidimensional.(m-D) digital signal processing based on contemporary technology—especially computer technology, and also the first step to studying multidimensional fast Fourier transform algorithms. Many properties of and applications for 2-D Fourier transforms can be extended directly to computing multidimensional Fourier transforms [31],

At present, however, it must be admitted that the requirements of most real-time multidimensional DFT applications cannot be satisfied using ordinary methods [1, 8, 31,

39, 47, 63] with even the most advanced digital signal processors [13, 26, 40] or supercomputers [129, 130], due to the large amount of data needed to be processed. To improve the situation, good multidimensional FFT algorithms and greater computing power provided by the development of application specific VLSI technology in the computer industry are needed. This has motivated research in both theory and technology. On the theoretical front, apart from the traditional row-column approach, there are reports of many multidimensional FFT methods: the vector radix FFT algorithms [31, 42-45], multidimensional DFTs by polynomial transforms [58], multidimensional prime factor FFT algorithms [32], multidimensional Winograd Fourier transforms [35], multidimensional Number Theoretic Transform algorithm [59], vector 33

split-radix FFT algorithms [36, 37, 60] and, recently, vector FFT algorithms developed for vector computers [61] and supercomputers [129, 130]. A good mathematical explanation of different multidimensional FFT algorithm can be found in [1, 30, 31, 33].

On the technology front, apart from different Digital Signal Processors (DSP), especially the Zoran Vector Signal Processor (VSP) [13], which can perform FFT, the successful fabrication of radix-2 and radix-4 butterflies has been reported [56] as well as full length

FFT processors [142]. And recently, a full length FFT processor which does up to 256- point complex DFT in 102.4^s has been fabricated and demonstrated [14, 15], and now is commercially available [26, 27]. Many proposals have been made to implement DFTs and FFTs using systolic array processors [140, 141, 145, 147] and VLrW [129] or

SLMD [130] supercomputers. The neural net implementation of FFTs is still in its early stage [138]. A VLSI architecture has been proposed and designed using GE 3-u.m CMOS technology and the vector radix-2*2 FFT algorithm for rasterizing the 2-D DFT with size N*N at video speed [134].

From the beginning, research on fast computation of DFTs has followed criteria to evaluate the effectiveness of an FFT algorithm, i.e., an effective FFT algorithm should be computationally efficient in terms of the number of operations, it should reduce roundoff errors, it should possess an in-place computation, and a regular structure. Ignorance of the last two points wiD cause an increase in bookkeeping burden and is responsible for the disadvantages which the bookkeeping task may cause. According to the above criteria, although algorithms based on the Cooley-Tukey method usually need more multiplications than many reduced multiplications FFT algorithms [39,41, 58], they have obvious advantages over the rest by the last three criteria. This is also true of the multidimensional vector radix FFT algorithms [50, 51].

In this part of the thesis, 2-D Vector Radix (VR) FFT algorithms, which are multidimensional extensions of the 1-D Cooley-Tukey algorithms, are considered. The

Cooley-Tukey FFT algorithm is of historical importance in modern digital signal processing. It is still a most widely used algorithm, both in software and hardware, including VLSI implementation of DFTs, because of its regular structure and many other 34

computational advantages. The fact that an algorithm has a regular structure is very

crucial in VLSI implementation. The CSIRO designed AUSTEK FDP™ A41101 and

A41102 FFT processors exploit the good structure provided by the Cooley-Tukey

algorithm to form a pipeline architecture and to achieve one of the fastest FFT processing

speeds on record [14, 15, 142].

The vector radix FFT algorithm was first conceived by Rivard [42], further

developed by Harris, McClellan, Chan and Schuessler [43], Arambepola [44], and

unified by Mersereau and Speake [62]. The vector radix FFT algorithm is a straight

forward extension of the 1-D Cooley-Tukey algorithm, and it is more efficient in terms of

the number of multiplications than its row-column 1-D counterpart A VLSI architecture, using the vector radix-2*2 FFT algorithm, has been proposed and patented [134, 137] to

show many of its advantages over the traditional row-column implementations.

Nevertheless, it is less well known to electrical and computer engineers, and more often

than not, misunderstood and treated as a mathematically complicated and involved

process. Many think that it is not worth the effort using VR FFT in real applications.

This is only natural when all the struggle and strife of two decades ago in understanding

and interpreting the Cooley-Tukey FFT algorithm [20, 23, 25, 47, 63, 64] is recalled.

On the other hand, there are new reports lately on the vector split-radix (VSR) FFT

algorithms by Pei and Wu [36], and Mou and Duhamel [37]. The derivation of these

"strangely" split vector radix algorithms is rather complicated and final results are difficult

to appreciate due to the direct derivation approach used.

In order to extend different 1-D FFT algorithms to higher dimensions whilst

avoiding confusions and tedious derivation, the structural features of the

multidimensional FFT algorithms have to be examined. The structural features of the

multidimensional Winograd Fourier transform algorithms have been studied extensively

[30, 35], as well as those of the prime factor algorithm [32]. Efficient as they are, the

Winograd Fourier transform algorithm demands the length of the DFT to be a prime, and the prime factor fast Fourier transform algorithm requires the length of the short DFTs to be mutually prime [45, 154]. In essence, they tend to reduce the number of 35

multiplications at the expense of the number of additions, and especially the regular computation structure. Furthermore, these structural features are described exclusively by matrix decomposition (or factorization) which is somehow less attractive to the engineers than the graphical representation such as signal flow graph—"butterflies". On the other hand, the structural features of the vector radix FFT algorithms have not been well examined, understood, or exploited.

In the history of the FFT algorithms, matrix decomposition has served as a method of interpretation of fast algorithms [47] and an alternative representation of the algorithms

[39]. Algorithms are usually derived by decimation of the indices of the transform function and finally described by signal flow graphs on which computer programs or

VLSI architecture designs are based. Matrix representation was used as a tool for the construction of some FFT algorithms. It has been found that the construction of fast algorithms and algebra, of which matrix is one, are deeply related, although they are not the same subject [30].

When the dimensions increase, problems become more complex and are often difficult to comprehend as explained in previous chapter. Solutions become more flexible and there are many alternative approaches. When a new algorithm is derived, it is not always certain if the formulas are error free and correcting them is not easy. More often than not, the effort spent in verifying a new algorithm is equal to the time taken to derive another new algorithm. This is why the study of a systematic and structural approach to the m-D FFT algorithms is justified.

Whenever a mathematical result is used for engineering applications, it results in a new algorithm. It may be further converted into software or hardware implementations, in which case the representation becomes important, so much so they make whatever they represent either a technological and industrial wave or a sinking leaf in the sea of research papers. There is no need to stress the significant role played by a group of researchers at

MIT in providing electrical engineers with an engineering interpretation of the Cooley-

Tukey FFT algorithm [23], as the so-called "butterfly" signal flow graph is a major consideration in all of the publications on FFT algorithms which are based on the Cooley- 36

Tukey method. It has been used to represent many other fast transform algorithms as well [39]. Unfortunately, in most publications on m-D FFT algorithms, the graphical presentation is still in a 1-D form which can, more often than not, be over complicated.

Another thing which is missing in the 1-D graphical presentation of m-D algorithms is the connection and relation between the m-D algorithm and its corresponding row-column 1-

D algorithm. The graphic presentation which is going to be introduced in this chapter will adapt the vector signal process concept. It will become clear that this form is a better form and it can be used to explain the relationship between different m-D algorithms as well.

It is the purpose of this part of the thesis to establish the general matrix representation forms for vector radix FFT algorithms. These general forms can provide a structural approach to the construction of various VR FFT algorithms whilst still allowing appreciation for the simple and regular structure such as those possessed by their

1-D counterparts, in addition to the computational improvement. The form that has been chosen is a combination of matrix representation and logic diagram which is a multidimensional extension of the signal flow graph ("butterflies") for 1-D FFT algorithms. Matrices are used as a concise representation of algorithms and any particular structure of an algorithm as well as a tool to derive multidimensional VR FFT from their 1-D corresponding algorithms. Logic diagrams independently provide another approach towards the derivation of m-D VR FFT algorithms and are used for computational considerations, software programming and hardware implementation. The relationships between the row-column FFT and various VR FFT can also be vividly interpreted or explained by these forms. The use of this structural approach makes the derivation and implementation of various VR FFT algorithms a straight-forward and structured one. Otherwise it could be a tedious, untidy and potentially erroneous procedure. With a 1-D FFT algorithm, based on Cooley-Tukey's concept, it is possible to systematically achieve its multidimensional VR FFT. The properties, such as in-place calculation, symmetry, data order and structures like butterfly, twiddling multiplications and unscrambling, are all preserved in a multidimensional context. This approach can be 37

also extended to some other fast transform algorithms and has been done so to the 2-D fast cosine transform, as explored in the second part of this thesis.

Another feature of multidimensional VR FFT algorithms which is worth mentioning is that VR FFTs have better fixed-point and floating-point error characteristics than both the row-column FFTs and polynomial transform FFTs [50, 51]. Naturally, when the number of operations is reduced, the number of error sources will also be reduced.

3-2 Definitions The general Ni*N2-point 2-D DFT and its inverse are defined as [1]:

X(k,/ ) = NX Ix(m,n) W™* W^ (3-2-la) m=0 n=0 1 M and, k x(m,n) - ^t Tx(k,/ ) W-» WN"2' (3-2-1 b)

where k,m = 0,l,...,Ni-l; n, / = 0,1,...,N2-1. In the following discussion it is

assumed that Ni and N2 are power of 2 to simplify the presentation.

The matrix forms of the 2-D DFT and its inverse definitions are given as follows. X = W2 x (3-2-2a)

and, 2 x = ^_ W" X (3-2-2b) N1N2

where X is an Ni*N2 column vector formed by stacking transposed row vectors of the 2-

D output array, x is also an Ni*N2 column vector formed by stacking transposed row vectors of the 2-D input array, W2 = W^ ® W^ , W"2 = Wj^ ® Wj^ , and stands for the tensor (or Kronecker) product. Another matrix form for the definition of

the 2-D DFT for general periodically sampled signals is given by Mersereau and Speake

[62].

X(k)= I x. (n) expt-jk^TtN-^n] keJn (3-2-3a) and, 38

£ (n) = ^ I X (k) exp[jkT(27tN-l)n] ne IN (3-2-3b) keJN where N is a periodicity matrix, IN and JN are regions in which x (n) and X (k) are supported respectively [1, 62]. A special case of these two regions is the rectangular, which is commonly used.

Yet another definition can be given to the 2-D DFT in a form of the matrix row and column operations [3]. In this form the input and its DFT sequences are both in a 2-D matrix form. Although the row and column matrix operations are most familiar, this form seems difficult to extend to anything beyond the 2-D DFT. When it comes to the derivation of 2-D FFT algorithms, such as vector radix FFT algorithms, this form is not as convenient to use as others.

The definition given by Equation (3-2-3) is a very concise mathematical representation of the 2-D (m-D) DFT. Based on this definition, FFT algorithms for rectangularly or hexagonally sampled signals or signals which are sampled on an arbitrary periodic grid in either the spatial or Fourier domain are devised. The relationships between the existing m-D FFT algorithms based on the Cooley-Tukey scheme are also well explained in this form. However, this form helps little to show how to derive vector radix FFT algorithms given that the corresponding 1-D FFTs are known.

Definitions given by Equations (3-2-1) and Equations (3-2-2) are by far the most commonly used presentations for m-D DFTs [2, 9, 30, 31, 35, 37, 65, 66]; Equation (3-

2-2) being a direct matrix representation of Equation (3-2-1).

3-3 Row-Column FFT Algorithms

The relationship between the two multidimensional DFT and the 1-D DFTs can be expressed by the Kronecker product in a matrix form [30, 35], i.e., the multidimensional

DFT matrix W2 is presented by

2 W =W^1 <8> W^2 (3-3-D where wi, i=l,2, represents the Ni-point 1-D DFT matrix. 39

The first implication from Equation (3-3-1) in the 2-D DFT problem is the row- column approach which is well known. If Ni = N2 = 16, the row-column radix-4 FFT can be used to calculate the 2-D DFT as is shown in Figure-7. In Figure-7, the vectors xO to xl5 consist of the row elements of the 2-D input array and XO to X15 represent rows of the output array with elements in bit-reversed order. Each heavy line represents sixteen datum lines each of which carries an element from xi. The block inscribed by R-

16 FFT represents the 16-point DFT using the radix-4 FFT as given in Figure-6 and performs the row FFT in the diagram. The addition block stands for vector addition and operates on the elements from the same column of the two input vectors. Likewise, the -

1 block and the -j block performs corresponding operations on every element of the input vector. The part of Figure-7 to theright o f the R-16 FFT blocks forms a radix-4 FFT

structure. However this structure operates on columns of the input array only, i.e., it performs the column FFT. Using the logic diagram shown in Figure-7, the computation

structure of the 2-D DFT is exceedingly clear.

3-4 Vector Radix FFT Algorithms

Instead of proceeding with decimation operations on each dimension separately (one

after another) as the row-column method does, the vector radix FFT algorithm suggests that decimation be performed on all indices (or dimensions) simultaneously [42-44].

In the case where decimation-in-time is used on both indices of the 2-D DFT, assuming that N\ =T\* N\ and N2 = r2 * N2', set:

k = ki*Nf +kn; m = mi*n+mo; / = /i*N2 +/n; n = ni*r2+no; where ki , mo = 0,l,...,ri-l; kn , mi = 0,l,...,Ni'-l and l\ , n0 = 0,l,...,r2-l; /0 , ni

= 0,1,...,N2'-1. From Equation (3-2-1), Equation (3-4-1) is derived: 40

o oe _ r- ^ x x x x x x x x x x x x x X x dbcfai^igL^ a

CO c a £ u

•c >< i II c C3 >-i 3 60 H •H b. fi- a X o II a c c 41

X(ki,ko;/i,/o)= S I' Y Ni"x(mi,mo;ni,no)W™1.kow"1!0 mo=0 no=0 mi=0 m=0 Nl N2 wmokowno/owm°klwno/l N l N2 rj r2 - V V Wm°klWno/l WmOkO\X7nO/0Nv1 Nv_1 / ^rmiko,,Jii/o

(3-4-1) When decimation-in-frequency is applied along both indices, set:

k = k1*r1+k0; m = m1*N1'+mo; / = /1*r2 + /0; n = n1*N2' + n0.

where N1=r1*N1', N2=r2*N2\ kb m0 =0,1,..., N^-l; ICQ, mi =0,1,..., rrl, and lh n0

=0,1,..., N2'-l; /0, m =0,1,..., r2-l.

Then from Equation (3-2-1), Equation (3-4-2) is derived:

1 kl X(k1,k0;/1,/0)= £ i" £ l\(mllm0;n1,no)Wj W"°V ! 2 m0=0n0=0 1^=0 n^O wmokown°/owmikowni/o

Nl N2 ri r2 Ni'-l NV-1 , , , , n-1 r->-l - Y V Wm°klWn0/l Wm0k°Wn0/0 V V v/n, «, • r, r, UUmlkOwnl'0 W W W ~ ^n ^n Ni' NV Ni No ^ X x(mlsm0; nl5n0)W W A>1 i>2 i>( n r2 m0=0n0=0 "1 2 m1=oni=0 (3-4-2)

Since more than one dimension can be decimated, different decimation schemes can be applied to different dimensions which leads to a mixed decimation vector radix FFT

(mixed VR FFT for short [44, 67]). For instance, the DIF is used on the row index and

the DIT on the column index by setting:

k = k1*r1+k0; m = m1*N1' + m0; / = h * N2' + /n; n = ni*r2+no; where N1=r1*N1', N2=r2*N2', klf TDQ =0,1,..., Nj'-l; k0, mj =0,1,..., rrl, and /i , n0

= 0,l,...,r2-l; /0,m = 0,1,...,N2'-1. 42

1 0 1 X(k1,k0;/1,/0) = \* 'j X ^"xCmLino ; ni,n0) W^w" ' r m0=Ono=0 mi=0ni=0 ^1 2 wm°k° wn°/owm ] k°wn 11°

Nj N2 ri N2' N/-1 r2-l 1 , , , n-1 NV-1 1 U U - 2v 2. WN, W WN "WN £ E x(mi,mo; ni,n0) W WN^, i>:l r 1N 1N r N m0=0n0=0 2 1 2 mi=0ni=0 l 2

(3-4-3)

From Equation (3-4-1), in each stage of the FFT operation, row twiddles both 1 0 inside Wr° and outside wJJP the butterfly structure can be combined with column twiddles (W^0 \ w£J° °, respectively). Intuitively, this explains why vector radix FFT algorithms have fewer number of multiplications than their row-column counterparts.

The point is that although this original VR FFT presentation is mathematically and computationally simple and clear, it helps little in eliminating the complicated and tedious procedure for the derivation of various VR FFT algorithms required by specific applications. When the mixed radix FFT method [54] or the split-radix method [68] is invoked for each dimension to obtain the VR FFT algorithms, further complications in the derivation procedure would be expected. It has not been seen in any literature that there are simple solutions to the problem. The computational complexity can be calculated on 2 3 the wrong basis as well. For instance, WN counts one complex multiplication just as WN does. Another point that has to be made here is that the mixed vector radix FFT algorithm has more variety than the 1-D mixed radix FFT algorithm [47, 54], which has not been addressed properly in the published literature [1,31, 42-44, 62] if it was addressed at all.

This will be discussed further through examples. 43

3-5 Matrix Representations for 2-D Vector Radix FFT

Algorithms

In order to present the structural approach, a matrix form is introduced for 2-D VR

FFTs. Its indexing scheme follows the traditional Cooley-Tukey presentation, which has

been widely used in the literature and adopted in both software and hardware implementations, otherwise it is a generalized form of that presented in [31].

A matrix form for DIT VR FFTs given by Equation (3-4-1) can be written as the

following three steps:

1 (BF:) [X(*i , k0 ; 2i , IQ)] = I [xi'(ko, m0 ; /o, n0)] (3-5-la)

(TM:) [xi'(ko, mo; IQ, no)] = EHxiCkn, mo; /n, no)] (3-5-lb)

(Remaining Short Length 2-D DFTs:)

Nj'-l N2'-l v , 1 0 [x1(k0,m0;/o,«o)]= I I W™ r ^,°[x(mifm0;ni,n0)] (3-5-lc) mi=0 ni=0 1 z

where [X(k] , ko ; £j , IQ)], [xi'(kn, mo ; IQ, no )], [xi(ko, mo ; /o, «o )],and [x(mi,mo ;

l n\,no )] are rir2 column vectors, with kj ,£i and mo ,no varying in bit reversed order, E

is the twiddle factor matrix which is an nr2*rir2 diagonal matrix with the element value

l ok /o 1 F (i,i) (i = 1,2,.. .,rir2) equal to W^ °W{^ accordingly, and I is the matrix for the 2-D

vector radix-n*r2 BF structure which is also an nr2*rir2 matrix with the element value

7 Hid) (ij = l,2,...,rir2) equal to W^W^ correspondingly. Equation (3-5-lc)

contains n*r2 Ni'*N2'-point 2-D DFTs which can be further decimated.

Example-1: Given an Ni*N2-point 2-D DFT where Ni = N2 = N = 16, the VR-4*4

DfT FFT algorithm in matrix form can be presented as follows:

(BF:) xo- rll 1 1 -. rll 1 1 " rxi'O X2 11-1-1 11-1-1 xi'2 ® (3-5-2a) XI = 1 -1 -j j 1 -1 -j j xi'l X3_ L i -i j -j J - 1 "1 j -j J -xi'3 44

(TM:) 1 0 0 0 -! 1 0 0 0 -xi'O- 2k rxiO- 0 W 0 0 0 0 w2/0 0 0 xi'2 N N xi2 r^Ou ® (3-5-2b) xi'l 0 0 W* 0 0 0 WJJ 0 xil Lxi'3J 0 0 0 w3k 0 0 0 0 w.3/ 0 Lxi3_ N . N -I

(Remaining Short Length 2-D DFTs:)

-xiO- rxO- xi2 Nj'-l N2'-l m v „ , x2 k (3-5-2c) xil x x w™; ° *$> xl _xi3_ mi=0n!=0 X>1 ^2 x3.

where i = 0,1,...,3;

T Xi = [X(i,k0 ; 0,/0),X(i,ko ; 2,/0),X(i,k0 ; l,/0),X(i,k0 ; 3,/0)] ;

T xi'i = [xi'(k0,i; /o,0),xi'(ko,i; /0,2),xi'(k0,i; /0,l),xi'(k0,i; /o,3)] ;

xii = [xi(kn,i; /o,0),xi(ko,i; /o,2),xi(ko,i; /o,l),xi(ko,i; /o,3)]T;

xi = [x(mi,i; ni,0),x(mi,i; ni,2),x(mi,i; ni,l),x(mi,i; ni,3)]T;

It4 Jt4 Jt4 Jt4 1111-1 It4 Jt4 _Jt4 _Jt4 11-1-1 l I = It4 = 114 .J14 .jjt4 jjt4 1 -1 -j j _ J14 _Jt4 jjt4 _jjt4 _ 1 -1 j -j -

u 0 0 N r- 1 0 0 0 0 W2koF4 4 0 0 0 N owj' 0 0 l I7t4 _ E = 0 0 Wk°Ft4 0 o o wj>y0 o 3/ 0 0 0 w3k°Ft4 ooo w ° u ffN N

The tensor product in Equation (3-5-2) now is just used as a form of concise presentation. However, it does indicate an important fact which will be discussed in the next section. 45

A matrix form can be also written for the first stage of DIF VR FFTs presented by Equation (3-4-2) as the following three steps:

n f (BF:) [xjCfy , mo; £0 . 0)] = I [x(m7 , m^ *; , n0)] (3-5-3a)

f n (TM:) [x{(k0, mo; £0, n0)] = E Ui(k0, n^; £0. 0)] (3-5-3b) (Remaining Short Length 2-D DFTs:)

[XCk,,^ ;llt£0 )]= X XW™f ^Wx^ , m0; £0 , n0)]- (3-5-3c) n\Q=0 nQ=0 1 ^

where [X(kltk0 ; /lf^0)], [x^ , TDQ; £0 , n0)], [xi'(k0 , VTIQ; £0 . n0)] and [x(m7, mo; ";> no)] are rlr2 column vectors with kn , -£# and mj , nj varying in bit reversed order, Ef is the twiddle factor matrix which is an r^Tj^ diagonal matrix with the element value Ff(i,i) G=l,2,..., r^) equal to WJ^WJ^0 correspondingly, and If is the matrix

x for the 2-D vector radix-ri*r2 BF structure which is also an r\t2* \?2 matrix with the f 7 f f element value I (i,j) (i,j=l,2,..., rir2 ) equal to W^'V ^°. The product of E , I and rl T2

[x(m7, mo; nj, no)] is a column vector again so that further decimation can proceed on

Equation (3-5-3c).

The matrix form for the 2-D mixed vector radix FFT algorithm given by Equation

(3-4-3) is as follows: (BF1:) [x\(kn ,mo ; £o .no)] = IJBF Wm^ ^0 '-n l >n0)J (3-5-4a)

(TM:) 3 5 4b [xi'(^ ,mo ; IQ ,n0 )] = E?TM [x\(ko ,mo ; Iq M )] ( " - )

(BF2:)

[X(kj ,ko ; £j ,/0 )] = IJBF [xi'(ko /no ; h ,no )] 0-5-4c) 46

where [x(mj ,mn ; nj ,no)] and [x\(kn ,mn ; #o ,no)] in Equation (3-5-4a) are r}N2' column vectors with kg , #o ani^ ml > "7 varying in bit reversed order; [XI(£Q ,mo ; IQ ,no

w m )] and [x\\ko ,mo ; /o > 0 )] Equation (3-5-4b) are rjr2 column vectors with kg and I\Q varying in bit reversed order; [xi'(ko /no ; /o ,«o )] and [X(£/ ,ko ; £] ,IQ )] in (3-5-4c) are Ni'r2 column vectors with kj , £j and mo, no varying in bit reversed order. E^TM is

the twiddle factor matrix which is an rir2*rjr2 diagonal matrix with the element value

—ITM ^'^ (i=l»2,..., rjr2) equal to WN °WN ° correspondingly, I™BF is the matrix for

the 2-D vector radix-ri*N2' BF structure which is an riN2'*riN2 matrix with the

; ; element value lfBF(i,j) (i,j=l,2,..., r^r2 ) equal to W™ °W" , °, and I^F is the matrix

, for the 2-D vector radix-Nj'*r2 BF structure which is an Ni r2*N1'r2 matrix with the

; 7 element value lJBF(i,j) (i,j=l,2,..., rxr2 ) equal to wJJ£* W^ . The superscript m stands for the mixed decimation vector radix FFT. Further decimation can proceed on both Equation (3-5-4a) and Equation (3-5-4c).

3-6 Structure Theorems

By using the structural features of the multidimensional vector radix DIT FFT stated in the following theorem, the straight-forward but often tedious derivation can be

bypassed.

[Structure Theorem 1:—Decimation-In-Time FFT]

If a 2-D DFT is defined by Equation (3-2-la), Ni = n * N\ and N2 = r2 * N2', the vector radix-ri*r2 decimation-in-time FFT is used, and the matrix representations of corresponding 1-D FFT equations are given as follows:

(3 6 la) [X(kj , ko)] = ijjj [xi'Cko, m0)] - '

(3 6 lb) [xiXko, m0)] = F^ [xi(ko, m0)] - "

1 0 3 6 1 [xi(k0, m0)] = Y W™ * [x(mi,m0 )] C " ' ^ ra!=0 "J 47

and,

2 [X(*i , /0)] = I* [XI'(/O, no)] (3.6.2a)

[xi'(/o. "0)] = F^ [Xl(/0, «o)] (3-6-2b)

N2'-l , 1 0 [xi(/o,^)]= X W" ' ^,^)] (3_6.2c) ni=0 z

where *7 , m0 = 0,l,...,n-l; ko , mi = 0,l,...,Ni'-l; £ltn0 = 0,l,...,r2-l; /0 , m = 2 0,1,... ,N2'-1; FN^ (FN respectively) is the twiddle factor matrix of 1-D radix-ri (radix-

2 r2 respectively) DIT FFT and ij^ (I* respectively) is the BF structure matrix of 1-D

radix-ri (radix-r2 respectively) DIT FFT, then the matrix equation for the 2-D vector l 1 2 radix-ri*r2 DIT FFT algorithm is presented by Equation (3-5-1), where E = F* F* ,

1 and I = INi ® IN2 with symbol ® standing for the tensor (or the Kronecker) product

[30, 31, 69]. In other words, El can be obtained by replacing the element F*1 (i,i) of

matrix FjJJ with F^(i,i)*F^ and p by replacing I^(i,j) of ijj with ^(ij)*^.

The structure theorem can be readily proved using matrix theory once all equations

have been expressed in the above matrix form (see Appendix B). It can be verified that

the result is correct by referring to Equation (3-4-1). The complete equations for a

specific DIT VR FFT can be obtained by applying the theorem on the remaining short length 2-D DFTs repeatedly. The application of the structure theorem will be demonstrated in the examples at the end of this subsection.

The relationship between 1-D radix-2^1 FFTs and corresponding vector radix FFTs is clearly explained by the structure theorem and thus the derivation of the higher order vector radix FFT algorithms becomes simpler. Since FFT based on [22] and [43] is the issue, not surprisingly, the statements cover the processing stages of both BF and TM.

The unscrambling stage of a complete vector radix FFT equation is also governed by this rule, i.e., once the unscrambling matrix for corresponding 1-D FFTs are known, that of 48

the 2-D vector radix FFT algorithm will be the result of the tensor product of the two [153]. Similarly, the following theorems for the DIF VR FFTs and the mixed VR FFTs are also true.

[Structure Theorem 2:—Decimation-In-Frequency FFT]

Suppose that the Nj*N2 2-D DFT is defined by Equation (3-2-la), where

Ni=ri*Ni', N2=r2*N2', decimation-in-frequency is used and the matrix representations of the corresponding 1-D FFT equations are given as follows:

[ xT( k0 , mo)] = 1^ [ x( mj, mo)] (3-6-3a)

[ Xl'( k0 , mo)] = F** [ Xl( ko , mo )] (3-6-3b)

Nl'_1 mold 1 3 3 [ X( klf k0) ] =mo= I0 W^fN* [Xl'( ko, mo)] < -6- c) and,

[ Xl( £0 , n0)] = 1^ [ x( nj , n0)] (3-6-4a)

2 [ Xl'( £0 , n0 )] = F* [ Xl( £0 , n0 )] (3-6-4b)

N2'-l _ i 3 [ X( /,, £0) ] = I W^7 [ Xl'( £0 , n0)] ( '6-4c)

n where k0,m2 = 0,1,..., rrl; kj, m0 =0,1,..., Nj'-l; £o> i = 0,1,..., r2-l; /j, 1 2 n0 =0,1,..., N2'-l, F^ (F^ respectively) is the twiddle factor matrix of 1-D radix-r!

(radix-r2 respectively) decimation-in-frequency FFT and ijjj (Irrespectively) is the BF

structure matrix of 1-D radix-rx (radix-r2 respectively) decimation-in-frequency FFT. The matrix equation for the 2-D vector radix-rj*r2 DIF FFT algorithm is given by Equation (3-5-3), where Ef= F%®¥% and 1^1^®^, wkh the Symbo1 ® Standing f°r the tensor (or the Kronecker) product [30, 31, 69]. 49

[Structure Theorem 3:—Mixed VR FFT]

For a given Ni*N2 2-D DFT as shown in Equation (3-2-la), if Ni = ri*Ni\ N2 = r2*N2', the matrix representation of 1-D DIF FFT and that of 1-D DrT FFT algorithm are presented as follows:

[ x2( k0 , m0)] = 1^ [ x( w; , mo)] (3-6-5a)

[ Xi'( ICQ , mo)] = F^ [ Xi( k0 , HIQ )] (3-6-5b)

Nl'"1 mnki 1 [X(khk0)] = I W™° [xi(k0,m0)] (3-6-5c) mo=0 1 and, 2 3 [X(^7 , IQ)] = I* [xi'(/0, no )] ( "6-6a)

2 [xi'(/0, "0 )] = F^ [xi(/0, n0 )] (3-6-6b) [xi(/o, no)] = I W"^,0 [x(ni,;io )] (3-6-6c)

v/hzrt ko/nj = 0,1,..., rrl; kls mo =0,1,..., Nj'-l; £] , no = 0,l,...,r2-l; IQ, n\- 1 1 0,1,...,N2'-1, F^ is the twiddle factor matrix of the 1-D radix-r j DIF FFT, ijj is the

BF structure matrix of the 1-D radix-r! DIF FFT, FJJ? is the twiddle factor matrix of the

2 1-D radix-r2 DIT FFT and I^ is the BF structure matrix of the 1-D radix-r2 DrT FFT.

The matrix equation for the 2-D mixed vector radix-r}*r2 FFT algorithm is given by f Equation (3-5-4), where &„ = 1^ ®

I^2, with the superscript m for the mixed vector radix FFT.

The application of the above theorems can be shown by the following examples.

ExampIe-2: Deriving the 1-D radix-8 FFT algorithm used to be a significant task [64]. However, comparing it with generating the 2-D vector radix-8*8 FFT, it is relatively simple. For many, writing up the corresponding 1-D algorithm (even deriving it from scratch) or drawing its logic diagram is a good starting point for generating required 2-D 50

VR FFT and it is simple enough. By applying the structure theorem, the vector radix

FFT formula will then be achieved with little extra effort.

A Consider a 2-D DFT defined as Equation (3-2-la) where N1=N2=N=8( , \i is a

positive integer so that the VR-8*8 DIF FFT can be applied.

Begin by writing the butterfly structure and twiddling multiplications of the 1-D

radix-8 DIF FFT algorithm in matrix form presented by Equation (3-6-3) where:

r=8, N'=N/8, kj, mo =0,1,...,N'-1, and

[X(khk0)] = [X(k1,0),X(k1,4),X(k1,2),X(k1,6), .

X( kl5 1 ), X( kl5 5 ), X( kls 3 ), X( klf 7 )]T;

[ xj'C k0 , m0)] =[ x{( 0, mo ), x{( 4, TUQ ), x{( 2, IHQ ), x{( 6, niQ ),

T X!'( 1, mo ), x{( 5, mo ), x{( 3, HIQ ), x{( 7, m0 )] ;

[ xx( k0, mo)] = [ xj( 0, mo), xi( 4, mo), xj( 2, mQ), x;( 6, IXIQ ),

T x2( 1, mo), X!( 5, mo), x2( 3, mo), x}( 7, mo)] ;

[ x( m; , mo)] = [ x( 0, m0 ), x( 4, n^), x( 2, mo ), x( 6, iriQ ),

T x( 1, m0), x( 5, mo ), x( 3, niQ ), x( 7, m0 )] ;

1 1 1 1 1 1 1 1 1 1 1 1 -1 -1 -1 -1 1 1 -1 -1 -j -j j j 1 1 -1 -1 j j -j -j (3-6-7)

1 -1 -j j a -a -ja ja 1 -1 -j j -a a ja -ja 1 -1 j -j -ja ja a -a 1 -1 j -j ja -ja -a a

a =W8=exp(-J7t/4);

f8 4 10 2mo 6 0 m 5m m F =diaef 1 W " W W ™ W ° W °, W?T °, wj"°]. rN -Qiag.L i, wN , wN , \>N , wN , v»N , "N , N 51

The logic diagram shown in Figure-3 performs the R-8 DIF FFT BF where there are only two complex multiplications (caused by a) and the TM stage, where there are seven non-trivial complex multiplications because of the twiddles, can be added to the BF

[45].

According to Structure Theorem 2, the first stage of the 2-D VR-8*8 DIF FFT matrix representation will be given by Equation (3-5-3), where: r!=r2=8, k!,m0, /1? n0 =0,1,...,N'-1, and N=N/8.

[x(mj, niQ ; nj, T\Q )]=[X0, X4, x2, x6,xl, x5, x3,x7] T; xi = [ x(i, mo;0, n0 ), x (i, mQ;4, n0 ), x (i, mQ;2, n0 ), x (i, mo;6, n0 ),

x(i, mo;l, n0 ), x (i, mo;5, n0 ), x (i, mo;3, n0 ), x (i, mQ;7, n0 ) ], and i

=0,1,...,7.

T [x^kQ, m0 ; £0, n0 )]=[xx0, xx4, xx2, Xl6, xjl, xx5, xt3, xx7 ] ; xji = [x!(i, mo;0, n0), xj(i, m0;4, n0), x^i, mo;2, n0), xT(i, m0;6, n0),

xj(i, m0;l, n0), x^i, m0;5, n0), Xj(i, m0;3, n0), x^i, m0;7, n0)], and i

=0,1,...,7.

, [x{(k0, m0 ; £0, n0 )] = [x1 0, x{4, x{2, x{6, xx'l, Xl'5, Xj'3, xj'7 ] T; xj'i = [ \{(i, mo;0, n0 ), x{(i, m0;4, n0 ), x{(i, m0;2, n0 ), xj'Ci, m0;6, n0 ),

xj'Ci, m0;l, n0), xi'(i, mQ:5, n0), Xj'(i, mo;3, n0), x{(i, m0;7, n0)],

and i =0,1,...,7.

T [X (kl7k0 ; h,£0 )]=[ XO, X4, X2, X6, XI, X5, X3, X7 ] ;

Xi= [ X( kl5i; lh0), X ( kj,i; l\A), X ( kl5i; lh2 ), X ( kj,i; /i,6),

X( klti; /j,l ), X ( ki,i; /j,5 ), X ( k2,i; /lf3 ), X ( k,,i; /j,7 ) ], and i

=0,1,...,7. 52

f E = diag.[ F« , w£x» p« , wjfo F« _ w^o pf8

w7F«,W^F«,W^FffljW7moFf8 j

FN = diaS-[ 1, w£°, W^°, w£«, Wj* W*"*, w^c- w7n0]

From Equation (3-6-7) we have:

rf8 Tf8 Tf8 f8 I I [f8 f8 If 8 f8 N N N N N N IN f8 f8 f8 f8 f8 I I -I f8 [f8 f8 N N N N N N -IN f8 f8 f8 f8 -I f8 .jf8 {Tf8 f8 N N N -IN -il il

{f& f8 f8 jTf8 f8 f8 l -I -I f8 N N N il -il -il V = f8 f8 f8 f8 -I 5T JT ~Tf8 -aIf 8 -jal^8 jaIf 8 •N N J N N •J N ih aIN f8 f8 ;Tf8 ;rf8 ~rf8 f 8 8 f 8 N -IN aI N jalf -jaI N

f8 f8 :Tf8 .Tf8 . Tf8 f 8 8 N -IN J jaI al{J -al^ J N -^N "jaIN N

f8 f8 • Tf8 .Tf8 • Tf8 . Tf8 f8 N -IN aljj al •jaIN N

f8 f8 - I (x) T (3-6-8) - 1N 09 1N

The complete equations of the VR-8*8 FFT for a specific 2-D DFT application can be obtained by applying the structure theorem recursively. Another point to be made is that Equation (3-6-8) is the matrix presentation of VR-8*8 FFT butterfly structure which is equivalent to an 8*8-point DFT and itself can also be calculated by further invoking the vector radix approach. In mathematical terms, this implies further application of the properties of the tensor product to Equation (3-6-8). Computing VR-8*8 FFT BF as it is, would commonly mean invoking the row-column method. As a result, this VR-8*8 FFT would be inferior to the VR-4*4 FFT in terms of the arithmetic complexity. However, if the vector radix approach is used to perform this VR-8*8 BF by one of the following: the method [43] indicated; the Combined Factor (CF) method in [45, Appendix C]; or the 53

mixed VR method, as will be shown by the following example, the performance of the VR-8*8 FFT would be better than that of VR-4*4 FFTs [44, 45, 67].

ExampIe-3:

Given Ni = N2 = N = 8, N = 2 * N, N = 4, matrix equations for the 1-D radix-2

DIF FFT algorithm on the 8-point DFT are presented as follows:

(BF1:)

ko mi "xi(0,mo)" "i r "x(0,mo)~ (3-6-9a) _xi(l,mo). l -i_ _x(l5mo)_

(TM:)

ko ko 1 0 xi'(0,mo)' xi(0,mo)' (3-6-9b) .xi'(l,mo). ow£° xi(l,mo).

(BF2:)

kj mo rX(0,k0)- -11 1 1-j rxi'(k0,0)-i X(2,k ) 11-1-1 xi'(k ,2) 0 0 (3-6-9c) X(l,k0) 1 -1 -j j xi'(k0,l) LX(3,k0)J - 1 -1 j -j J Lxi'(k0,3)J

The matrix equations for the 1-D radix-2 DrT FFT algorithm on the 8-point DFT are presented as follows:

(BF1:)

*l no -11- •x(0,/0r •xi'(/0,0)- (3-6-10a) 1 -1_ X(1,/0)J .xi'(/0,D. (TM:)

no no 1 o •xi'(/o,oy xi(/o,0)" (3-6-10b) to Lxi(/0,1). xi'(/0.D. o wN 54

(BF2:)

*0 ni -xi(0,no)~ -11 1 l-i -x(0,no) xi(2,no) 11-1-1 x(2,n ) 0 (3-6-10c) xi(l,no) 1 -1 -j j x(l,no)

_xi(3,n0)_ L i -i j -j J -x(3,n0)

Using Structure Theorem 3, from Equations (3-6-9) and (3-6-10) the matrix form for the mixed DIF & DIT vector radix FFT algorithm is derived.

(BF1:)

ko *0 mj nj •xi(0,mo;0,/ rx(0,mo;0,no)- xi(0,mo;2,/ x(0,mo;2,no) xi(0,mo;l,/ x(0,mo;l,no) xi(0,mo;3,/ x(0,mo;3,no) (3-6-11 a) xi(l,mo;0,/ x(l,mo;0,no) xi(l,mo;2,/ M x(l,mo;2,no) xi(l,mo;l,/ x(l,mo;l,no) .xi(l,mo;3,/ Lx(l,mo;3,no)-

(TM:)

ko no ko no -xi'(0,mo;/o,0)- r-xi(0,mo;/o,0)- xi'(0,mo;/o,l) if 3FJ* xi(0,mo;/o,l) (3-6-lib) xi'(l,mo;/o,0) xi(l,mo;/o,0) .xi'(l,mo;/o,l). _xi(l,mo;/o,l)-

(BF2:)

ki £] mo no -xi'(k ,0;/ ,0)- •X(0,ko;0,/0)-i 0 0 X(0,k0;l,/0) xi'(ko,0;/0,l) X(2,ko;0,/0) xi'(ko,2;/o,0) xi'(ko,2;/o,l) X(2,k0;l,/o) (3-6-1lc) xi'(k ,l;/0,0) X(l,k0;0,/o) M 0 X(l,k0;l,/0) xi'(ko,i;/o,l) X(3,ko;0,/0) xi'(k0,3;/o,0) •X(3,k0;Uo) -xi'(ko3;/o»i)- 55

where: r 1 0 0 O-i OW^J 0 0 l 2 F^®F g = 0 0 WjJ° 0 0 0 0 W™0+'°_

The above theorems provide very simple construction tools for various vector radix

FFT algorithms. Having a knowledge of different 1-D FFT algorithms, 2-D VR FFT for a required application can be readily achieved. Since the theorems state clearly what the

2-D BF or TM stage should look like, checking a new variation of VR FFTs becomes a simple and straight-forward procedure. Once complete equations of 1-D FFT algorithms are made available, it is a matter of inter-weaving corresponding BF and TM structures of

1-D algorithms to form 2-D (m-D) BF and TM structures. Although not discussed in the theorems, the output sequences of 2-D VR FFTs also obey the properties of the tensor product in respect of 1-D FFT output sequences [153].

3-7 Structural Approach via Logic Diagrams

The diagrammatical interpretation of structure theorems can be expressed both at stage-by-stage level [45] and as a complete form for a specific application [67]. Obtaining the logic diagram of a 2-D FFT from those of 1-D FFTs requires the following procedure: drawing 1-D FFT logic diagram(s); generating the logic diagram using the row-column

FFT; and finally, modifying the logic diagram using the row-column FFT into various 2-

D vector radix FFTs. Modification of the logic diagram follows the simple rules as shown by the following equations.

In Figure-8(a), AxO ± Axl = A(x0 ± xl).

In Figure-8(b), ocAx = A (ax). where x, xO and xl are column vectors; xO and xl are of the same dimension; A is an operator and a is a scalar.

For long length DFTs and using high radices [45, 70], the logic diagram of FFT at the stage-by-stage level would be useful because final drawing would be difficult to o X m m CD

o

$

© X X CD m m

© i-i X X 57

accommodate on one sheet of paper, nor is it necessary, although it is not unachievable.

For small size DFTs, deriving a complete logic diagram is always preferable.

Example-4:

In this first example, the VR-4*4 FFT algorithm on a 16* 16-point DFT will be derived using the logic diagram. As most 1-D FFT algorithms are well documented, it is always simple to start by drawing a 1-D logic diagram. In this case, the logic diagram of a 16-point DFT using radix-4 FFT algorithm is presented in Figure-6. Even if there were no 1-D logic diagram available, drawing a 1-D diagram from equations is much simpler than doing so from equations for a 2-D vector radix FFT algorithm. For this reason, it is preferable that fast algorithms be presented in logic diagrams (or equally flow graphs)

whenever it is feasible. From personal experience, more often than not, one can judge if

it is worthwhile for a 1-D fast transform algorithm to be generalized to its

multidimensional counterpart and if the saving could be made in terms of computational

complexity just by looking at the structure of the logic diagram of the algorithm.

After the logic diagram is drawn for the 1-D radix-4 FFT as shown in Figure-6, the

figure is partitioned into three parts according to the stages of the FFT procedure as

included in Figure-6. The logic diagram of the 2-D 16* 16-point DFT then is presented

using the row-column radix-4 FFT as it is given in Figure-7. Replace all blocks inscribed

by R-16 FFT in Figure-7 by Figure-6 to obtain Figure-9. Then Figure-9 can be modified to achieve Figure-10 which is the logic diagram for the vector radix-4*4 DIT FFT

algorithm on a 16* 16-point DFT. The twiddle factors of the row FFT are combined with

those of the column FFT to reduce the number of multiplications and this is the reason

why the vector radix approach is less expensive in terms of the computational operations

than the row-column approach. In Figure-10, the VR-4*4 FFT BF is actually implemented by the row-column approach. Another alternative is to apply the VR-2*2

FFT to the VR-4*4 FFT BF but there is no further saving in the number of non-trivial multiplications. X XXX XXXxXxxXXx* o x T IZ «— «„ ^ t/* — c> x x^x xxx^xx^xx^x B afttf 60

The matrix form for the vector radix DIT FFT algorithm and the structure theorem can be readily extended to higher dimensions and so can the logic diagrams. Figure-10 can be used as the VR-16*16 FFT BF in the vector radix-16* 16 FFT algorithm and so forth

[70]. This example not only shows the evolution of the VR-4*4 FFT algorithm from its

1-D counterpart, but also indicates that to perform 16* 16-point 2-D DFT in a pipelined computation, only one complex multiplier is required for a VLSI design [14, 15, 134,

137].

As this technique does not impose any requirement on the radices nor any knowledge of how the decimation (DIT or DIF) procedure is undertaken to obtain 1-D

FFTs, not surprisingly, the mixed radix FFT algorithms can be derived by this approach as well.

Example-5:

In this example, the mixed DIF and DIT vector radix FFT algorithm is derived to compute an 8*8-point DFT which is equivalent to that presented in Equation (3-6-11).

The 8*8-point 2-D DFT is first calculated using the row-column FFT algorithm and different decimation techniques have been used, as shown in Figure-11 where 32 nontrivial multiplications are involved. In Figure-11, row transforms are performed using 1-D DIT FFTs as shown in Figure-2, and column transforms are computed by DIF

FFTs as shown in Figure-3. When the mixed vector radix FFT algorithm is applied to the same problem, the logic diagram of which is shown in Figure-12, the number of nontrivial multiplications are reduced to 24 after combining the row and the column twiddles. This example once again demonstrates that different multidimensional vector radix FFT algorithms can be developed systematically by using the structure theorem.

If the complete Equation (3-6-11) for the mixed vector radix FFT still looks somewhat complicated, its diagrammatical presentation is extremely clear and straight­ forward. 61

u 11 o —. M H

t-^ X •«* ft **! X £ "1 © X I H ft X o ? 5 X

o IT} m X X X X X X X X

a: 5 xa u *S c

+ + + + + + + +

u. *

>

+ + + + + + + 4-

* t * H T > s C 2 •D H

u. C u S Ti- -Tf-' EC eiEfr *Efr «Eh eiEfr ;c£ teofr

+ + + + + + + + 01515 a o fH

Before considering the 2-D vector split-radix FFT algorithm and the comparative study of various vector radix FFT algorithms, two points have to be made. One is that this structural approach, both in its matrix form and diagrammatical form, can be extended to multidimensional cases with little difficulty. The other is that the 2-D direct vector radix DCT algorithms were devised by examination of logic diagrams of the corresponding 1-D algorithms and were later verified by mathematical analysis. The discussion on the combined factor vector radix-8*8 and vector radix-16* 16

FFT algorithms are included in Appendix C.

3-8 2-D Vector Split-Radix FFT Algorithms

Another successful application of the structural approach is to generate complete equations for the DIF vector split-radix FFT.[60]. The idea behind the split-radix approach is quite simple. In one dimensional Discrete Fourier Transform computation, the 1-D split-radix approach [68] divides a length N DFT into two DFTs of length N/2 when a radix-2 FFT is applied at the first stage. One of the resulting N/2 DFTs, which involves odd terms, is further decimated using radix-2. Thus the original DFT is implemented by an N/2 DFT together with two N/4 DFTs and an algorithm can be devised to reduce the number of operations required to complete the transform. The trouble is that when this very approach is applied to 2-D DFTs using the traditional mathematical representations [36, 37], the final equation for the algorithm contains so many terms that without an understanding of its structural features, derivation and verification of the algorithm and the implementation of the algorithm would be difficult indeed [37, 71]. This is not to mention its generalization to even higher dimension applications. Recently, the split-radix FFT algorithm has been extended to two dimensions using Decimation-In-Frequency (DIF) [36] and Decimation-In-Time [37]. In this section, the complete equations for the first stage of the vector split-radix DIF FFT algorithm are derived using a structural approach [45, 60]. To derive the complete equation for the vector split-radix DIF FFT equations, the structure theorem is used 64

initially to obtain VR-2*2 and VR-4*4 DIF FFT equations. The split-radix idea is then applied to compute the outputs when both indices are even in a vector radix-2*2 step and the rest in a vector radix-4*4 step. The algorithm is the two dimensional counterpart of the 1-D split-radix DIF FFT algorithm [68], and differs from the split vector radix 2-D

FFT [36] in the way in which the vector radices are divided. Using the structural approach [45] the vector radix-2*2 and the vector radix-4*4

DIF FFT algorithms can be derived easily from corresponding 1-D algorithms. The matrix form of the 2-D vector radix-2*2 DIF FFT is given by the following equations, assuming Ni = NT2 = N.

ko *0 mj nj -xi(0,mo;0,nn)- -11 1 l-i -x(0,mo;0,no)- xi(0,mn;l,no) 1-11-1 x(0,mo;l,no) (3-8-la) xi(l,mo;0,no) 1 1-1-1 x(l,mo;0,no) _xi(l,mo;l,no)_ -1-1-1 1 - _x(l,mo;l,no)-.

pi 0 0 0 -j -xi'(0,mo;0,no) ow!? o o -xi(0,mo;0,no)- xi'(0,mo;l,no) N xi(0,mo*,l,no) (3-8-lb) xi'(l,mo;0,no) 0 0 WJJ° 0 xi(l,mo;0,no) _xi'(l,mo;l,no) _o o o w;j0+no_ _xi(l,mo;l,no)_

ko £o ko #o rX(ki,0;/i,0)-| -xi,(0,mo;0,no)~ X(k1,0;/1,l) xi'(0,mo;l,no) (3-8-lc) X(ki,l;/i,0) mo=0 no=0 w " xi'(l,mo;0,no) -xi no; 1,no) J LX(k1,l;/1,l)J \U

where ki, l\ =0,1,...,N'-1, and N' = N/2. The vector radix-4*4 DIF FFT is described by the following equations:

f [xi(*0,mo; £0V0)] = I [x(w;,mo;«7,no)] (3-8-2a)

f 3 8 2b [xi'(*0,mo; V»0)] =E [xi(^,m0;^,no)] ( ' - ) kl 1 3 8 2c [X(kuk0 ;hJo)] = T Y W^,? W^ [xi'(*0,mo;^.no)] ( " - ) mo=0 no=0 ™ iy where ki, l\ =0,1,...,N"-1, and N" = N/4; 65

[X(ki,*0 ;lh£0)] = [X(k1,0;/1,0),X(k1,0;/1,2),X(k1,0;/1,l),X(k1,0;/1,3),

X(k1,2;/1,0),X(ki,2;/1,2),X(k1,2;/1,l),X(ki,2;/1,3),

X(ki,l;/1,0),X(k1,l;/1,2),X(k1,l;/1,l),X(k1,l;/1,3),

X(ki,3;/1,0),X(k1,3;/1,2),X(k1,3;/1,l),X(k1,3;/i,3)]T;

[x(m7,mo;«7,n0)] = [x(0,mo;0,no),x(0,mo;2,no),x(0,mo;l,n0),x(0,mo;3,no),

x(2,mo;0,no),x(2,m0;2,no),x(2,m0; 1 ,n0),x(2,mo;3,n0),

x(l,mo;0,no),x(l,mo;2,n0),x(l,mo;l,n0),x(l,mo;3,no),

T x(3,mo;0,no),x(3,mo;2,n0),x(3,mo; 1 ,no),x(3,m0;3,n0)] ; Ef=diag.[l,W^nO,wJO,W^nO,

2m 2m0+2n 2m +n 2m +3n w 05w ° w ° ° w ° ° wm°,wmo+2no wm°+n° wmo+3n° w3m0,w3m0+2n° w3mo+n° W3mo+3n°l-

Tf4 Jf4 jf4 jf4 1111-1 If4 jf4 _jf4 _jf4 11-1-1 If= I» = If4 _lf4 _jlf4 j!f4 1 -1 -j j Llf4.!f4 jjf4 _j!f4 J - 1 -1 j -j

The basic approach of the vector split-radix algorithm is to compute the outputs when both indices are even in a vector radix-2*2 step and the rest in a vector radix-4*4

step. Both indices are even in the first line in each of Equations (3-8-la) to (3-8-lc). Thus, in the 4*4 process, twelve equations are required out of the sixteen in Equation (3-

8-2) since X(ki,0;/i,0), X(ki,0;/i,2), X(ki,2;/i,0) and X(ki,2;/i,2) have already been solved by the vector radix-2*2 step. This is the first stage of the vector split-radix DIF

FFT decomposition as is shown below:

1 1 okl 0/l X(ki,0;/,,0)= I Z W"), W" x1(0^0;0,no) (3-8-3a) mo=0 no=0 N ™ r-x(0,mo;0,no)"l x(0,mo;l,no) xi(0,mo;0,n ) = [ 1 111] (3-8-3b) 0 x(l,mo;0,no) Lx(l,mo;l,no)J where N* = N/2, ki,/i = 0,1 N*-l; and, 66

[xi(fy),mo;%no)]m =Im [x(m;,mo;n;,no)] (3-8-3c)

n n [xi'(ko,mo;% 0)]m = Fm [xi(^,mo;^ o)]m (3-8-3d)

1 kl 0 1 [X(ki,k 0 ;hJ0 )]m = I 2 W™? W" / [xi(k0,mQ;£0,nQ)]m (3-8-3e) mo=0 no=0 a> 1N where ki, h =0,1,...,N"-1, and N" = N/4;

[X(ki,^;/l,^)]m=[X(ki,0;/i,l),X(ki,0;/i,3),

X(ki,2;/1,l),X(ki,2;/1,3),

X(ki,l;/i,0),X(ki,l;/i,2),X(ki,l;/i,l),X(ki,l;/i,3),

X(ki,3;/i,0),X(ki,3;/i,2),X(ki,3;/i,l),X(k1,3;/i,3)]T;

[x i (kn,mo; ^o,no)]m=[x l (O.mn; 1 ,no),xi (0,mo; 3,no), xi(2,mo;l,no),xi(2,mo;3,no), xi(l,mo;0,no),xi(l,mo;2,no),xi(l,mo;l,no),xi(l,mo;3,no), xi(3,mo;0,no),xi(3,mo;2,no),xi(3,mo;l,no),xi(3,mo;3,no)]T; 3 E4 = diag.[ W^W ^, W2jnO+nOtW2mo+3n0> wmo wmo+2nn wmo+no wmo+3no WN ' N ' N ' N w3mo w3mo+2no w3mo+no w3mo+3no-.. WN ' N ' N ' N J'

r f4 f4 Tf4 f4 H m m m m -11 1 Ii jf4 jf4 jf4 ,f4 11-1-1 1 -1 -j j 1 If4 = if4 = 1-1 j -jj 4 = 1 -1 -j j m I» -If4 -jlf4 jlf4 L i -i j -j -. L I» -I» jl» -jlf4j

and [xQnj,mQ;ni,nQ)] is defined as in Equation (3-8-2). The first, second, fifth and sixth

f f rows of [X(khk0 ;liJ0)]. [xi(Jto,m0;4>,no)], E and I have been omitted to obtain f [X(khk0 ;h,£0 )]m , [xi(Ao.mo;^,no)]m. fm and I m. All indices are the same as those

in Equation (3-8-2), but a long and tedious direct derivation has been avoided. The logic

diagram of the vector split-radix DIF FFT can be achieved by modifying the

corresponding logic diagrams of VR-2*2 and VR-4*4 DIF FFT algorithms, which is a 67

simple procedure. Complete equations for the first stage 2-D vector split-radix DIT FFT algorithm can also be constructed by this simple approach [37].

3-9 Comparisons of Various 2-D Vector Radix FFT Algorithms

The comparison of vector radix FFT algorithms in this section mainly follows the traditional judgement, i.e., arithmetic complexity, error analysis, in-place computation and regularity of the computation structure, as mentioned in the previous chapter. Since the analysis of arithmetic complexity in the early work of vector radix FFTs [42, 43], there have been many other reports on the issue associated with different vector radix algorithms [1, 36, 37, 44, 45, 60, 62]. The arithmetic complexity, in terms of multiplications, of various vector radix FFTs is listed in Table-1 in comparison with row- column FFTs, assuming that inputs are complex. N = 4096 is chosen because all the vector radix FFT algorithms considered can be applied. It is worth noting that although the split radix method requires less multiplications than the rest of the Cooley-Tukey based FFTs in 1-D DFT computations [68], its applications in the 2-D case [36, 37, 60] are less effective than the Combined Factor (CF) VR-16*16 FFT in terms of multiply operations [45, 70]. Besides, since vector radix FFTs preserve a regular computation structure inherited from 1-D Cooley-Tukey algorithms, they are bound to have advantages in the software and hardware DFT implementations [154]. They carry out an in-place computation and their numerical features are also superior to the row-column method. Vector radix FFT algorithms consist of VR-2*2 BFs, regular twiddling multiplication stages and regular indexing formation. These features, along with the pipelined and parallel structure inherited from their 1-D counterparts, would facilitate both software and hardware implementation of fast computation of 2-D DFTs as well [134,137]. To give a brief idea of the reduction of complex multiplications, for a 4096*4096 2-

D DFT problem, the number of complex multiplications required by the vector split-radix

DIF FFT [Appendix E] is only about 37% of that required by the radix-2 row-column

FFT algorithm [45]; about 65% of that required by the 1-D split-radix FFT row-column

FFT [68]; 49% of that needed by the vector radix-2*2 FFT [43]; 66% of that required by Table-1 Arithmetic complexity of FFT algorithms for 4096 * 4096 2-D DFTs in terms of multiplications 2-D FFT Number of BF Number of TM Total number of Percentage (total Algorithms multiplications multiplications multiplications multiplications.) RC R-2 0 2* 92,274,688 184,549,376 100.00%

RC R-4 0 2* 62,914,560 125,829,120 68.18%

RC R-8 2*16,777,216 2* 44,040,192 121,634,816 65.91%

RC R-16 2*25,165,824 2*31,457.280 113,246.208 61.36%

RC N/A N/A 104,398,848 56.57% SR FFT [68] VR-2 * 2 0 138,412,032 138,412,032 75.00%

VR-4 * 4 0 78,643,200 78,643,200 42.61%

VR-8*8 33,554,432 49,545,216 83,099,648 45.03%

VR-16 * 16 50,331,648 33,423,360 83,755,008 45.38% CF 25,165,824 49,545,216 74,711,040 40.48% VR-8*8 CF 33,030,144 33,423,360 60,453,504 36.01% VR-16 -16 VSR-1 [36] N/A N/A 102,676,560 55.64%

VSR-2 [60] N/A N/A 67,746,504 36.71%

NOTE: RC R-i: The row-column 1-D radix-i FFT algorithm; VR: The vector radix 2-D FFT algorithm; CF: Combined Factor method applied; N/A: Not Applicable; BF: ButterFly computation structure; TM: Twiddling Multiplications; SR: Split-Radix FFT algorithm; VSR: Vector Split-Radix 2-D FFT algorithm. 69

a different vector split-radix DIF FFT approach [36]; and it is slightly (2%) inferior to the combined factor vector radix-16* 16 FFT algorithm [45, 70]. This algorithm needs slightly more complex additions than the vector radix-2*2 FFT algorithm. Further discussion on the issue can be found in [37].

3-10 Vector Radix FFT Using FDP™ A41102

The Australian CSIRO designed AUSTEK Frequency Domain Processor (FDP™)

A41102 is a high performance CMOS VLSI device providing a complete hardware solution for implementing FFTs [14,15]. Its main features include performing up to 256 complex point DFTs within 102.4M-S and 2-D 8*8-point DFTs or 16* 16-point DFTs using a single processor configuration with a throughput of 2.5 Ms/s [28]. In [28], 2-D

512*512-point and 1024*1024-point DFTs are implemented using FDPs by the row- column method. Although there are many publications in which multidimensional vector radix FFT algorithms are shown by simulation to have computational advantages over the row-column method, there have been very few reports on hardware implementation [134,

137]. In this section, it shall be demonstrated that when the vector radix method is used, fewer FDPs are required to obtain the same 2-D FFT processing throughput. The vector radix-8*8 FFT algorithm can be used to calculate 512*512-point DFTs.

The complete operation is divided into three vector radix-8*8 Butterfly (BF) and two

Twiddling Multiplication (TM) stages [70]. Since the VR-8*8 butterfly computation structure is a 2-D 8*8-point DFT on its own, it does not greatly matter if it is implemented by the row-column or the vector radix approach so long as the most efficient computation can be achieved. In fact, the 2-D 8*8-point DFT is calculated by the row-column FFT on the FDP A41102. Using the VR-8*8 FFT algorithm to perform 512*512-point DFTs, a multi-FDP system design is described in Figure-13 which consists of three FDPs with auxiliary discrete circuits rendering a processing rate of 2.5 Ms/s compared with four FDPs required by a row-column procedure [28]. In this configuration, the VR-8*8 BFs are calculated by the 2-D 8*8-point FFT function provided on FDP A41102s and two TM stages are performed using two available uncommitted complex multipliers. Using a 70

3 o o Q CM*-"1

Q i

W5

o D. i *

C I o QJ s Q u Es, 3 c 60 ex •H E c •-* H oc * 4 k oc •o a

V R v R-.« A* 1s t RO M ? Twiddl e > i

ts oo o oo G 8* * <

* B FF T > s 2- D 5 see i i u.

ra T3 1 T

Q 71

O 3 C c I * ti. 4 O ^— Q r-J

(S * >% oc c: oo c * B 9 t < ~i- £="" • 5 o 4 L u. H fa Q c; — -O 2 o O < uX Oi i3— c CQ J k I © * o >^S (/* liftL NhK

r-i oc O c * >, oc a H at & < I > = s£ E 0) "5 cc c-i Q 3 c Q Q.

u > *=5 ft<4 > 3s •a I< f* iI5 (>5y — Ho; u X

CJ —• o *\o >-. *" f- ^_ • i- < os S M*[9^"L > 3 c- C U. 5 i i Q

C3 C3 •o s 3 2 ca

Q 72

mixed VR-16* 16 and VR-8*8 FFT algorithm, 1024*1024-point DFTs can also be

implemented using three FDPs, rendering a throughput of 2.5 Ms/s with alternative

configurations as shown in Figure-14. In real-time image processing, a multi-processor

system has to be used and reduction in the number of processors will mean a decrease in

the system complexity. These are but a few examples to show what can be done using

the vector radix approach.

3-11 Summary

In this chapter, the structural approach to the construction of 2-D VR FFT

algorithms has been presented by both mathematical equations and diagrammatical forms.

The use of this method helps to understand structures of various VR FFTs, and, most

importantly, also eases the burden of the implementation task for electrical engineers.

Using the diagrammatical representation, the modification of a VR FFT algorithm to fit

special design requirements becomes a simple task. The comparison study on various VR

FFTs summarizes the arithmetic complexities of VR FFTs and also their merits in the

context of error analysis, in-place computation and regularity of the computation

structure.

The introduction of the FDP A41102 demonstrates a complete VLSI hardware

solution to the DFT computation. It has been shown that if the vector radix method were used, the number of complex multipliers on the processor to perform either 2-D 8*8- or

16* 16-point DFT could be reduced to one. Even if die FDP is used in its current status, incorporating the vector radix method in the application of 2-D DFTs for real-time image processing will reduce the number of processors required to achieve the same performance when the row-column FFT is used. This would mean a reduction in the system complexity. 73

CHAPTER FOUR: A PERSPECTIVE ON VECTOR RADIX FFT ALGORITHMS OF HIGHER DIMENSIONS

As discussed in the introduction, multidimensional (m-D) Discrete Fourier

Transforms (DFTs) with dimensions equal to or greater than three have been used in

construction of 3-D microscopic-scale objects to remove out-of-focus noise [16], Nuclear

Magnetic Resonance (NMR) imaging algorithms [9], computer vision and pattern analysis

to provide a better understanding of dynamics in the visual system [10-12]. When the

dimensions of the problem increase, the computation burden becomes heavy. Thus the

saving in computation time using efficient fast algorithms will be of even more

significance [45]. Because of the complexity of the problem involved, a systematic

approach is required for the comprehension, derivation, construction and effective

implementation of multidimensional fast algorithms. It seems that the structural approach

introduced in Chapter Three is, at least, one technique capable of being developed to

higher dimensions as this method has been successfully demonstrated in the construction

of 2-D vector radix FFT algorithms [45, 60] and 2-D direct vector radix fast Discrete

Cosine Transform (DCT) algorithms which will be discussed shortly after this chapter

[80]. The approach can also assist in developing computer programs using

multidimensional vector radix algorithms, especially when the computer programs for the

corresponding 1-D fast algorithms are available.

In this chapter, the structural approach for the construction of m-D (m > 3) fast

vector radix FFT algorithms is closely examined using both matrix and diagrammatical

forms. From definitions of the multidimensional DFT and its inverse, equations which

represent multidimensional vector radix Decimation-In-Time and Decimation-In-

Frequency FFTs are derived. A structural approach based on the matrix representation is

described which is used to construct multidimensional vector radix FFTs. A recursive logic diagram symbol system is then presented to show how an m-D (m > 3) vector radix

FFT algorithm can be derived and represented in a graphical form. An example is also given to demonstrate the simple procedure required to construct a vector radix-4*4*4 FFT 74

algorithm on a 16* 16* 16-point 3-D DFT problem using the symbol system. Since the approach using diagrammatical representations does not impose any restrictions on how the decimations (DIT or DIF) should be applied to each dimension, various vector radix

FFT algorithms can be constructed by this method. Although not discussed in this thesis, the material presented in this chapter can be extended to m-D (m > 3) fast vector radix

DCT algorithms as well.

4-1 Definitions As mentioned in the previous chapter, the multidimensional DFT of dimension m is defined as:

X(kl,k2,...,km) = I I ... I x(nl,n2,...,nm) WN WN —WN nl=0 n2=0 nm=0 * l m

(4-1-la) and its inverse is defined as: Ni-i N2-i Nm-i lkl x(nl,n2,...,nm)= * I I ... I X(kl,k2,...,km)WN 1 NlN2...Nm kl=0k2=0 km=0 n2k2 nmkm WK ...WN (4-1-lb) where ki, ni = 0,1,... ,Nj-1; i = 1,2,... ,m. In their matrix forms, X = Wm x (4-1-2a) and m x= 1 W- X (4-l-2b) W * NiN2...Nm ~

where W°» = WN* ® W* ® ...® W^ , W^ , i=l,...,m, which represents the Nj- point 1-D DFT matrix; W-*= W^ ® W^ ® ...® W^ , W^. , i=l,-,m, which

represents the Ni-point 1-D inverse DFT matrix; X and x are NiN2. • .Nm column vectors of output and input sequences respectively (also see Example-6). 75

If DIT is used on all indices of the m-D DFT assuming Nj = rj * Nj', i = l,...,m, set: ki =kii*Ni'+kio; ni = nii*rj + nio; where kii, nio = 0,l,...,rj-l; kio, nil = 0,l,...,Nj'-l.

X(kliJclo;k2i,k2o;...;km1,km0)= X X ... X nl0=0n20=0 nmo=0

N£l N£l N^-l nlikl0.„n2ik2o .„nmikm0 X X ••• X x(nli,nlo;n2i,n20;...;nmi,nmo) WN! WN ! ...WN , Wl m INm nl1=0n21=0 nm^O wnloklo wn2ok2o wnmokmo ^nlokli wn2ok2i wnmokmi Ni N2 "" Nm ri r2 rm

(4-1-3)

Accordingly, the m-D DIF VR FFT and mixed VR FFT equations can be derived. If DIF is used on all indices of the m-D DFT assuming Ni = rj * Nj\ i = 1,.. .,m, set: ki = kii*ri + kio; ni = nii*Ni'+ nio; where kii, nio = 0,l,...,Nj'-l; kio, nil = 0,l,...,ri-l.

Nj'-l N2'-l Nm'-1 X(kl],klo;k2iJc2o;...;kmi,kmo)= X X ... X nlo=On2o=0 nmo=0 rrl r2-l rm-l x „,nl()kli „Ji2ok2i ...nmokm] W X X ... X x(nli,nl0;n2i,n2o;...;nmi,nmo)WNV 'C - Nm' nloklo wn2ok2o wnmokm0 wnlikl0 wn2ik20 wnmikm0 W W W N! N2 " Nm n r2 ••• rm

(4-1-4)

Since there is more than one dimension, different decimations can be applied to different dimensional indices. Therefore there are more variations of the vector radix FFT algorithm. A unified form for mixed VR FFT algorithms somehow is difficult to present.

4-2 Matrix Representations and Structure Theorems

A matrix form for DIT vector radix FFT algorithms presented by Equation (4-1-3) can be given as follows: 76

[X(Wj^lo;^i,k2o;...;^mjJcmo)]=It[xi,(klo,n/o;k2o,n2o;...;kmo,nmo)] (4-2-la)

l [xi'(klo,n-?0 ;k2o.«20 ;...;kmo,«mo)] = F [xi(kl0,nio ;k2o.«20 ;...;kmo,«mo)] (4-2-lb) [xi(klo,nio 'XlQ>,n2o ;...;kmo,nmo)] Nj'-l NV-1 N'-l , , , 0 IO ll kl n2 k2 - v v v \i^ l 0«/ l 0 ™,nmikmo r , . 7 0 0 ,, - X X ••• X WN . WN . ...WN , [x(nli,«7o;n2i,/z2o;...;nmi,/imo)] nli=0n2!=0 nm^O l l m (4-2-lc) where [x(nli,nl o ;n2\,n2o ;...;nm\,nmo)], [xi(klQ,nl o ;k2Q,n2o ;. • .;kmQ,nmo )], [xi'(klo,"70; k2o,/i2o;...;kmo,/tmo)] and [X(£7;,klo;&2;,k2o;...;/:m;,kmo)] are ri*r2*.. .*rm column vectors with ki] and nio (1 ^ i ^ m) varying in bit reversed order; F} is the twiddle factor matrix which is an rir2...rm*rir2...rm diagonal matrix with the kl k2 okm element value F<(i,i) (1 < i < nr2...rm) equal to w"^ ° W^ ^ ..W™ °

1 accordingly, and I is the matrix for the m-D vector radix-ri*r2*...*rm butterfly

l structure which is also an r]T2.. .rm*nr2.. .rm matrix with the element value I (i j) (1 ^ ij 7

contains ri*r2*...*rm Ni'*N2'*...*Nm'-point m-D DFTs which can be further decimated. The generalization of the structure theorem for the m-D DIT case will be stated as follows: If Ni = n * Ni' in an m-D DFT defined by Equation (4-1-1) and the 1-D DIT FFT algorithms are given by:

[X(kii,kiQ)] = l£i [xi'(kio,m0 )] (4-2-2a)

[xi'Qd0tm0 )] = F^[xi(kio,m0 )] (4-2-2b) N-'-l lklo 1 0 [xi(ki0,m0)]=[xiGdn.nin)]= X W?! [x(nii,mo)] Y. W?! ,^ [x(nii,«w )1 (4-2-2c) nii=0 JNl where 1 < i < m; 0 < kii, nio < ri-1; 0 ^ kio, nil < Nj'-l; then, the DIT m-D vector radix-ri*r2*...*rm FFT algorithm will be given by Equation (4-2-1), where: 77

2 E* = F*\ ® F£ 2 ® ...® Fjj™ ; (4-2-3a)

It = I^®I^®...®I^. (4-2-3b)

Similarly, a matrix form for DIF vector radix FFT algorithms given by Equation (4-1-4) can be presented as follows:

[x\(klo ,nlo; k2o ,n2o;...; kmo ,nmo)] = If [x(nlj ,nlo; n2j ,n2n;...; /zm; ,nmo)]

(4-2-4a) [xi'(/:7o,nlo;/:2o,n2o;...;^/7/o,nmo)] =Ef [xi(£/o ,nlo; £2#,n2o;...; £mo ,nmo)]

(4-2-4b) [X(kl\,klo ;k.2i,k2o;...;kmi,kmo)] = k2 m krai Y T ..."I"' <«"» W^ '...WJ, ? [x, Wo,nio; *2„,n2o;...; *m«,nmo)] i l J>2 nlo=On20=0 nmo=0 ^

(4-2-4C) where [x(nlj ,nlo; AZ2; ,n2n;...; «m; ,nmo)], [xi(£/o,nlo; £2^ ,n2o;...; fono.nmn)], [xi'(*7o ,nlo; A:2o,n2o;...; kmo .nmo)] and [X(kli,fc70 ;k2i,£2n ;...;kmi,fcm0 )] are

f ri*r2*.. .*rm column vectors with &# and ni] (1 < i < m) varying in bit reversed order; F is the twiddle factor matrix which is an rir2...rm*rir2...rm diagonal matrix with the n kl 0 01 0 element value £f(i,i) (1 < i < nr2...rm) equal to W ^ ° V^f ...*™ ™ accordingly; and If is the matrix for the m-D vector radix-n*r2*.. .*rm butterfly structure

f which is also an nr2...rm*rir2...rm matrix with the element value I (i,j) (1 ^ i,j ^ 7;U n2; 2 7 m rir2.-.rm) equal to W" ° W * °...W™ * ° correspondingly. Equation (4-2-4c) rl r2 rm contains ri*r2*...*rm Ni'*N2'*...*Nm'-point m-D DFTs which can be further decimated.

The generalization of the structure theorem for the m-D DIF case will be stated as follows: If Nj = n * Nj" in an m-D DFT defined by Equation (4-1-1) and the 1-D DIF FFT algorithms are given by: 4 2 5a [xi(ki0 ,nio)] = 1%. [x(nii ,ni0)] ( " " )

4 2 5b [xi'Wo ,ni0)] = F^lxiikio ,ni0)] ( " - ) 78

kil [X(kii,fc0)]= X W^ [xi'CJkio ,ni0)] (4-2-5C) ni0=0 » where 1 < i < m; 0 < kii, nio < Ni'-l; 0 < kio, nil < ri-1; then, the DIF m-D vector radix-ri*r2*.. .*rm FFT algorithm will be given by Equation (4-2-4), where: Ef = F^ ® F*2 ® ...® F^; (4-2-6a)

2 f If = I^®4 2®...®I ^. (4-2-6b)

To obtain the complete equations for an m-D DIT or DIF vector radix FFT, simply apply the theorem to the remaining short length m-D DFTs repeatedly and this makes the derivation simpler and programming easier especially when corresponding 1-D algorithm(s) or program(s) are available.

4-3 Diagrammatical Presentations The logic diagram for an m-D vector radix FFT is much simpler than its matrix representations. Because the representation for m-D where m > 3 will be the same as that for 2-D except for the definitions of each symbol, a recursive symbol system can be developed. Consider a procedure for developing a vector radix-ri*r2*.. .*rj (1 < i < m, where m is the dimension of the DFT) butterfly structure along with the twiddle multiplication stage. In Figure-15-(a), xi (0 < i < rj-1) is a vector with dimension equal to ri*r2*...*ri-i, the elements of which are in natural order. The symbol defined by VR-n*r2*.. .*ri-i BF structure is the (i-l)-dimensional vector radix-ri*r2*...*ri-i butterfly computation structure and that by VR-ri*r2*...*ri-i TM is the (i-l)-dimensional vector radix- ri *r2*...*ri.i twiddling multiplication, x'i (0 < i < rj-l) is a vector of n*r2*...*ri-i dimensions with elements in bit reversed order. The symbol defined by R-rj BF on

Dimension-i is a 1-dimensional radix-ri butterfly computation structure which works on dimension-i. xii (0 < i < ri-1) is a vector of n*r2*... *ri-i dimensions. The elements of xii are in bit reversed order, as are output vectors xii (0 < i ^ rj-1) themselves. 79

l_ S « E~ X

-fl r4 X fl M s- fa X CQ nj H 1 fl TTI c rt O > CO n U c fl rH , V 0 • C rW fl O fl • Tl l-H s E . orH u 1 fl • H w »— c fl u u 0 -t-fl= « fl rH '•T3 > 1 -n i—i S *x 'So H Fi c o Td .in •X s o fa -4-= rQ w fl .."So rt * r> — kO S v i_ u <-> hi u fa rH • fa W cj fl u fl (S C2 t- u bO o •F-* OC • f—1 7 ^ En (- ;- -r» ei >

© 1— ©1 <~< X X X X 80

X X X X s- X f~ ._ 1 i- • > . i—i • nj fl • fas co 1—1 .fl2 i_ 4)

ck • gi—t >

• 'i ure . ra m .2 fl 13 £ bO fl r—oi ' o-"' -J-= u r^ "3 • o u if . l_ o fl • r" - CJ uM >C C — l_ hp 5 • • flr-H U u OC 3^ > SS EH

.. nj i"^ 1 U fa I *T° « i S x 1 ^A - ^ © _ bO , fa. p—( k• 81

Using the basic modification rules of logic diagrams, the logic diagram shown in Figure-15-(b) is derived. Combining ri*r2*...*ri_i-dimensional VR-n*r2*...*ri-i BF with 1-dimensional radix-ri BF results in an ri*r2*...*ri_i*ri-dimensional VR-ri*r2*...* ri-l*ri BF computational structure. When VR-ri*r2*.. .*ri-i TM is combined with radix-rj

TM, an ri*r2*...*ri.i*ri-dimensional twiddling multiplication stage is formed using this symbol system as shown in Figure-15-(c). It is possible to build an m-dimensional vector radix FFT algorithm by drawing 1-D FFT algorithm(s) and to finally achieve the

algorithm required.

Example-6: The procedure to derive a vector radix-4*4*4 FFT algorithm using the logic

diagram to compute a 16* 16* 16-point 3-D DFT, given a 1-D radix-4 FFT algorithm for a

16-point 1-D DFT, is as follows:

(a) draw the logic diagram using radix-4 FFT for the 1-D 16-point DFT as shown

in Figure-6;

(b) determine the logic diagram using the "row-column" radix-4 FFT method on

the 16* 16* 16-point 3-D DFT (not shown);

(c) use the vector radix-4*4 FFT to replace the 2-D row-column FFT algorithm

on 2-D data vectors as shown in Figure-16; blocks inscribed by VR-4*4 BFa,

VR-4*4 TM and VR-4*4 BFb are defined in Figure-10;

(d) modify Figure-16 to Figure-17 and combine twiddle factors to obtain the vector radix-4*4*4 FFT algorithm. The major difference between Figure-10

and Figure-17 is that all the symbols of Figure-10 represent and operate on vectors of size 16 whilst those of Figure-17, of size 256 (or 16*16) where xi

= [x(i,0,0), x(i,0,l), .... x(i,0,15), x(i,l,0), .... x(i,l,15), ..., x(i,15,0), ...,

x(i,15,15)], i = 0 15. 82

— «-> X x DEQHPSW JO JO .JO

c E "c o St 8£ c

X vO o rH I' c I a) rJ 3 a •+ to •H

c

84

4-4 Computing Power Limitations

When the dimensions of DFTs increase, the number of operations required for the computation of DFTs increases dramatically. At the current stage of VLSI technology, only m-D (m > 3) DFTs with relatively small size can be processed at real-time speed [13,

28, 129, 130]. The difference between computation times of an addition and a multiplication is reduced, even to none [40], so that the total number of numerical operations becomes the key issue [129]. The time used for data transfers also becomes significant so that in-place computation and a regular computing structure will certainly be crucial in m-D DFT calculations [129,130]. So far the implementation of m-D DFT by and large uses the "row-column" method whether it uses VLSI [2, 13-15], or Very Long Instruction Word (VLIW) architecture supercomputers [129], or distributed memory multiprocessor supercomputers [130]. Amongst a few reports that use m-D fast vector radix FFT algorithms is the one by Liu and Hughes [134]. Although only discussed in [134] is the implementation of vector radix-2* 2 FFT, many of its advantages over the row-column method have already been shown. The portions of the saving using vector radix FFT algorithms over that of the row-column FFT become substantial when the dimension of DFTs increases and/or higher radices are used as indicated by Table-2 [1,

43-45]. Three very active areas associated with the hardware implementation of DFTs are

ASICs [14-15,134,137,142], systolic array designs [135, 136, 140, 141, 145, 147] and neural networks [138]. Still, even the latest successful implementations can only cater for

1-D DFTs or very small size 2-D or 3-D DFTs at real-time speed with the neural networks approach in its early stage. Complete hardware solutions to m-D (m > 3) DFT problems are dependant upon future development of VLSI technology, understanding different m-D algorithms and possessing the abilities to construct them in a systematical way. It has been shown by many [134, 141,147] that fast algorithms chosen for VLSI implementation should possess a regular computation structure more than anything else, apart from the arithmetic complexity and maximal use of pipelined and parallelism of the algorithms, to enable 85

TabIe-2 Arithmetic Complexity of FFT algorithms for 64 * 64 * 64 3-D DFTs

3-D FFT Number of BF Number of TM Total number of Percentage (total Algorithms multiplications multiplications multiplications multiplications.) R-2 0 3*655,360 1,966,080 100.00%

R-4 0 3*393,216 1,179,648 60.00% ~1 R-8 3*131,072 3*229,376 1,081,344 55.00%

VR-2*2*2 0 1,146,880 1,146,880 58.33%

VR-4*4*4 0 516,096 516,096 26.25%

VR-8*8*8 393,216 261,632 654,848 33.31% CF 229,376 261,632 491,008 24.97% VR-8*8*8

NOTE: R-i: The "row-column" 1-D radix-i FFT algorithm: VR: The vector radix 2-D FFT algorithm; CF: Combined Factor method applied; BF: ButterFly computation structure; TM: Twiddling Multiplications; 86

systematical VLSI integration. The understanding of such algorithms and their implementation would be greatly assisted by an understanding of their computational structures. 87

PART II.

MULTIDIMENSIONAL DISCRETE COSINE TRANSFORMS 88

CHAPTER FIVE: INTRODUCTION TO MULTIDIMENSIONAL DISCRETE COSINE TRANSFORMS

The Discrete Cosine Transform (DCT) was first introduced into digital signal processing for the purposes of pattern recognition and Wiener filtering [17]. The two dimensional (2-D) DCT is used for transform coding of images in telecommunication such as video-conferencing, video telephony, video image compression for HDTV and applications in fast packet switching networks [3, 18, 19, 157]. Its performance is virtually indistinguishable from the optimal Karhunen-Loeve transform [3, 17] in terms of energy packing ability, decorrelation efficiency and the least mean square error. Many fast DCT algorithms require only real number operations and possess fairly regular computational structures similar to those of FFTs and vector radix FFTs, which substantially facilitate software and hardware implementations. It has, by now, become the standard decorrelation transform for compression of 1-D and 2-D signals [72, 73].

To implement the 2-D DCT, there are many fast algorithms available, and these algorithms are basically divided into two groups:

Direct fast algorithms, which are based on matrix factorization of the DCT

matrix or computation of a long length DCT by shorter length DCTs;

Indirect fast algorithms, which compute the DCT through an FFT of the

same size [34, 57, 82] or other fast algorithms [79, 113,124, 156].

In each group there are two approaches—the row-column approach, where the 2-D DCT is generated by repeated application of a 1-D DCT, and the 2-D fast algorithm approach.

In each group there are many fast algorithms, as is shown in Figure-18 which is by no means exclusive. In 2-D image transform coding, the original image is usually divided into 8*8 or

16*16 blocks, and these blocks are cosine transformed. For real-time video coding, it is assumed that the transmitting rate is 30 frames/second with frame size of 288*352 pixels, and a processing rate of about 3.04 Ms/s is required. This means that an 8*8-point DCT 89 t u_ c E 2; "c r- lw< c (A it re c c C re at Ctr- H o < 're t— E c — c •E5 c £ c >•* o U-. c c •5 _ < c to re < o 3 fc H. C C^ I- fc =5 "re re re u. Q E i5 X :/: Q o S CN c c •5 >N -5 —— c _ re O 3 " at H c c C e o < o i— — >o f— r_ U - ^ d c Q -> c < U- c E o lr < o re o treo t/5 o C C E 'u E lr E re < >< ca re c 1 C a o V. < OS o E re Z _3 .E WD Q U d "re "5. o C5 O z*tA Q o re < 1 CO o i c o > E- u c — o < 4_ 5D 2 .^H re c c w X r- c c - s 3 o s "u C5 C2 E E _ E < re c d to U c Ci to LU CC :< 1 < — IT, •ST E U- —' c •-z X d t—re 'u ^ to — O OS < to c E < c jo i U 3 CO c > < re Q Q C CT 3 1 re O X o

d u CO < o c c ex to < < re 6 3 d CO CO o o "3 < < X £ U "3 2 o oi o o at C f O £ U o «5 90

has to be carried out in about 21Ms or a 16* 16-point DCT has to be calculated within about

84^s.

For the last couple of years, VLSI fabrications of DCT processors which provide

real-time image coding throughput have been reported. Currently, these DCT processors

can only perform fixed-point calculation. The length of the input data format varies from

9- to 16-bits, as does that of the output of DCT processors. Usually, they are adequate

for the coding applications. Examples of DCT processors are IMS A121 DCT processor

(inmos) [88], STV3200 DCT processor (SGS Thomson) [87], and TMC2311 (TRW) [89]. Because the transform size used in image coding is relatively small, being 8*8 or

16*16, the performance of the above DCT processors is very close in terms of the

processing speed. Their processing rate varies from 13 MHz to 27 MHz which caters for

real-time image coding. For the same reason, various row-column algorithms, including

the direct matrix multiplication method, have been used in VLSI DCT processors without

showing a great deal of difference in speed performance for image coding. New

development in this area can be found in [158] and [159]. A comparative study of the

error performance of these DCT processors remains to be undertaken.

Of the two classes of 1-D fast DCT algorithms, the indirect approach shows little

advantage over the direct approach in terms of the arithmetic complexity, and it usually

does not have a regular computation structure and involves an excessive number of

additions [72]. These would be manifested in their 2-D extensions, although 2-D indirect

methods have been reported to have a lesser number of multiplications [72]. Amongst the

direct algorithms, the Lee algorithm is by far the most efficient in terms of the number of

multiplications (or the total number of numerical operations) [76,78] and it has a regular,

systematic and simple computation structure. However, the algorithm requires inversion

or division of the cosine coefficients which has been claimed to cause numerical

instabilities because of roundoff errors in finite length registers [38, 72, 76, 77]. This

problem will be examined in the next chapter in comparison with other methods. Hou

introduced a new fast DCT algorithm [77] which uses bit-shifting and data shuffling for

better numerical performance. Hou's algorithm is as efficient as that of Lee's in terms of 91

the number of multiplications and additions and also has a simple regular structure. When these direct 1-D algorithms are extended into 2-D applications, their structural features will be preserved as indicated previously.

In the context of fast computation of 2-D DCTs, there are several reports on 2-D indirect algorithms [57, 72, 79], whilst the direct method up to now is dominated by row- column 1-D algorithms. The 2-D direct fast DCT algorithm [38], though more efficient, remains less well known [65, 72, 77]. Besides, not all 1-D direct fast DCT algorithms can be expanded to 2-D fast algorithms effectively. The only one which has been reported is the 2-D fast algorithm by Haque based on Lee's method [38]. In [38], the direct matrix decomposition method is used to expand Lee's algorithm, and the improvement of the new algorithm over several other known algorithms is demonstrated in terms of the number of operations. It has also been shown that the roundoff errors in Lee's algorithm do not cause serious problems for small size such as 8*8- and 16* 16-point 2-D data block

DCTs, which are commonly used in image coding applications.

We use a structured approach on Lee's algorithm directiy to generate 2-D fast DCT algorithm and reproduced the Haque algorithm. A 2-D logic diagram is also used to represent the algorithm so that 8*8- and 16* 16-point 2-D DCTs using the new 2-D fast

DCT algorithm are readily devised from the 1-D Lee algorithm and easily implemented

[46, 80, 81]. To avoid the roundoff errors that Lee's algorithm may cause, a new two dimensional fast DCT algorithm has been devised based on Hou's algorithm [80, 81,

108] using the same technique. Both algorithms are equally efficient

5-1 Definitions of 1-D DCT and Its Inverse DCT

The definition of the N-point 1-D discrete cosine transform and its inverse are given by the following equations [3, 4, 17, 38, 76, 77, 82].

X(k)4e(k)^x(n)C^+1)k (5-1-la) and, 92

x(n) = I e(k) X(k) d*!+1)k (5-1-lb) k=0 where n,k, = 0,1,...,N-1; cgj+1)k = cos[*(2^1)lc]; and ' TV if k=°; e(k)=^ V2 ^ 1 otherwise. In its matrix form, Equation (5-1-1) can be written as: X = C° x (5-l-2a) and, x = €i X (5-l-2b)

where C° = R IT C ; Ctf = Ca £ ; E = diag[-^,l,...,l]; IT = diag[~ ...,|];

X is the DCT vector; x the data vector; C(i,j) = C(^+1)i; Cj = (C)T and the superscript T stands for the transpose operation of a matrix.

In order to derive fast algorithms, define

X(k)=^|yX(k) (5-l-3a) and, X(k) = e(k) X(k) (5-1-3b)

resulting in the "denormalized" DCT and IDCT as shown in Equation (5-1-4). X(k)= NZx(n)C^J+1)k (5-l-4a) n=0 and, N-i N A n+1)k -2N x(n) = i: X(k)C?N (5-Mb) k=0 where n,k = 0,1,...,N-1.

In matrix form: f=Cx (5-l-5a) and, 93

x = Cn X = ( C )T X (5-l-5b) where T stands for transpose operation.

It is from Equation (5-1-4) or Equation (5-1-5) that fast algorithms are derived. Operations involving IE and T can be applied either before or after the denormalized

DCT or 1DCT is performed.

Example-7: For a 4-point DCT,

r ro ro ro ro • ^8 *~8 *~8 u8

^"8 ^8 ~^8 ~^8 CB =diag.h=-, 1,1,1] diag.[j,j,^-,j] (5-1-6) LV2 T1 -P1 -C1 C1 ^4 W W ^4 y->3 s~\\ e~\\ J~\J _ ^8 "^8 u8 "^8 , and the 4-point IDCT matrix is given,

(~>V s^l (~yl s~yJ *~8 ^8 *~4 ^8

/r>0 /-3 «~il y^l ^"8 ^8 "W ~^8 C°3 = diag.[-7=-, 1,1,1] (5-1-7) y->0 J"->3 ^r-,1 X^i c8 "^ "*~4 ^8

/~i0 .~>1 .—>1 y-i3 _ ^8 "^ W ~^8 _

5-2 Definitions of 2-D DCT and Its Inverse DCT The N*N-point 2-D DCT and its inverse (IDCT) are given by Equations (5-2-la) and (5-2-lb) [4, 57, 72, 80].

4 1 v/, ^ ^ /i^ ^ 1 ^ o(2n+l)k r(2m+l)/ (5-2-1 a) e k e 1 X(k,/) =N MNM ( ) ^)n= I0 m=I0 x(n,m) C ^ G^ and, (5-2-lb) x(n,m) = Z I e(k) e(/) X(k,/) C^ C2N k=o / =0

In its matrix form, Equation (5-2-1) can be written as: X_ = C9 £ (5-2-2a) and, 94

x = Cr[02L (5-2-2b) where x and X are formed by stacking transposed row vectors of the input and output 2-D arrays respectively; C9 = (IE ® E ) (IT IT ) ( C ® C ); Cn° = ( Ca ® Cj +1)i )(E® E); E = diag[-^,1,...,1]; T = diag[^,|,...,^]; C(i,j) = cgj ; and C3 =

(C)T. The symbol ® stands for the tensor product. Defining NN X^ = 4e(k)e(/)X^ (5-2"3a) and, X(k,/) = e(k) e(/) X(k,/) (5-2-3b) results in the denormalized 2-D DCT and IDCT as shown in Equation (5-2-4) [46, 80]. X(k,/) = I NZ x(n,m) C(^+1)k C^m+1)/ (5-2-4a) n=0 m=0 and, x(n,m)=NS S X(k,/)Cgj+1)kcg,m+1)/ (5-2-4b) k=0 / =0 ^ ^

In matrix form: £=(€ ® C ) x_ (5-2-5a) and, T x. =( Cn ® Ca ) £= ( C C ) X_ (5-2-5b) Fast algorithms are usually derived from either Equation (5-2-4) or their matrix form as presented by Equation (5-2-5). Although the definitions for the 2-D DCT and its inverse are slightly different from those given in [72], those of denormalized forms are the same. These are the basis for the derivation of various fast algorithms. Generally, the mathematical derivation of an algorithm is quite involved and it is difficult to see the computational structure. For this reason, a logic diagram is used to present the computational structure of each algorithm. 95

5-3 Applications of 2-D DCTs in Image Compression

Image coding (compression) is a typical application of the 2-D DCT. It has been made an international standard by CCITT for video coding applications [72, 73].

Various 2-D DCT algorithms have been developed into computer programs for purposes of the simulation study of video coding [81, 96]. In the following example, image compression will be shown using the row-column Lee's fast DCT algorithm on a

256*256-pixel frame of image.

ExampIe-8:

In this example, the Series 151 Image Processor by Image Technology has been used to acquire and store images. It is hosted by a PC-AT which performs the 2-D DCT calculation. A 512*512-pixel image is snapped and stored in the frame grabber of Series

151 Image Processor. The frame is divided into four quadrants, each of which consists of 256*256-pixels. The upper-right quadrant is used to display the original image, the upper-left quadrant the scaled DCT coefficients, the bottom-left quadrant the reconstructed image after applying differentfiltering o n the DCT coefficients and the bottom-right quadrant is the difference image between the original and reconstructed images. The 2-D DCT is applied on 8*8-pixel blocks. DCT coefficients are scaled using a block size

64*64-pixels. The difference image can also be scaled so that the error signal can be seen. Signed 9-bit integer is used for DCT coefficients. The system setting is shown in

Figure-19.

Two types of filter masks are used in this example—the 2-D ideal low-pass filter and zigzagfilter a s shown in Figure-20 (a) and (b), where n is the length of the filter.

Thefilter mas k is used to eliminate selected DCT coefficients. In Figure-21, an ideal low- r 8 bits 64 pixels Q c< . 1S pass filter is used with n = 4 so that a compression ratio OIQ ^-ts * Toplxels ~ obtained, using signed 9-bit integer for DCT coefficients.

The difference image is magnified twenty times. The effects, shown by stripes, are caused by two dimensional noise introduced in the imaging system. This has been detected by analyzing 2-D Fourier spectrum of the image [Appendix C]. A zigzagfilter of 8 bits n = 5 is used in Figure-22 to achieve a compression ratio of 3.79:1, that is, ^-jrrrj * 96

64 pixe s AitnoUgh a higher compression ratio is used than that of Figure-21, the improvement in the reconstructed image quality is obvious especially in the areas of

English characters of the poster on the background, face and shoulder. When different bit allocating or adaptive schemes are applied, higher compression ratios can be obtained [3,

105].

System setting for image compression experiment.

Figure-19 97

l-g^ n= — I -.-.—- •"%y0' X X X X X X X X X X X X X X X X

The two dimensional rectangular filter of size n.

Figure-20 (a)

l^wJ**-*** ll^~" *~J M *•"*• *TSTV^ X X X X X X X X X X X X X X X

The two dimensional zigzagfilter o f size n.

Figure-20 (b) 98

An example of DCT compression of a 256*256-pixel image, an ideal low-pass filter used with n=4 with signed 9-bit for DCT coefficients.

Figure-21 99

An example of DCT compression of a 256*256-pixel image a zigzag filter used with n=5 with signed 9-bit for DCT coefficients.

Figure-22 100

5-4 2-D Indirect Fast DCT Algorithms

In the two categories of 2-D fast DCT algorithms, the indirect approach obtains a 2-

D DCT from a 2-D DFT of the same size [57]. One can use row-column FFT algorithms,

WFTA, etc., to calculate real valued DFTs as discussed previously. The arithmetic complexity is fairly low [72]. The computational kernel of this method is of simple structure, hence it is easy to be VLSI implemented. If 2-D FFT algorithms are invoked to calculate 2-D DFT, the arithmetic complexity can be further reduced particularly if the vector radix FFT algorithms are used [37, 43-45]. The structure of the algorithm is kept fairly simple and roundoff errors are also reduced compared with the row-column approach [50, 51]. A polynomial transform for 2-D DFT computation has lower computational requirements but has a complex computation structure [58]. Whether it is justified for fast computation of 2-D DCT remains to be seen [72]. The same can be said for 2-D indirect fast DCT methods using other reduced multiplication fast Fourier transform algorithms [32, 35]. There are several reports on 2-D indirect fast DCTs which map DCTs into DFTs [57, 72, 79, 84] and in [57] complete formulas for both forward and inverse DCT transforms are given. These are used in this thesis. According to Makhoul [57], a 2-D DCT can be converted to and computed by a 2-D

DFT following the steps given below. Step 1: 2-D N*N-point data rearrangement x(2ni,2n2); if 0 < ni <[^-]; 0 < n2 < ft^k

v(ni,n2) =< x(2N-2n!-l,2n2) if [*^r3 ^ n2 < N-1;.0 < n2 < [ -j-] XT. 1 N+l 1 x(2ni,2N-2n2-l); if 0 < ni <[-y-]; [—] < n2 ^ N' v(ni,n2)=< x(2N-2ni-l,2N-2n2-l) if [^-jk < ni < N-T,.[-y-] < n2 < N-l

(5-4-1) 101

Step 2: 2-D N*N-point DFT on v(nbn2)

1 lkl 2 V(kbk2) = ^ Z v(ni,n2) W" Wjf (5-4-2) ni=0 ni=0 N N

27r/N where ki = 0,1,...,N-1, k2 = 0,1,...,N-1 and WN = e 'J .

Step 3: Obtain 2-D DCT from the output of 2-D FFT by two methods k k k 2 C(khk2) = 2Re{ W ^[W 2 V(ki,k2) + W4 N V(kl5N-k2)]} (5-4-3a) or: k k C(ki,k2) = 2Re{W 2 [W ^V(k!,k2) + W^V(N-k!,k2)]] (5-4-3b) that is, a 2-D DCT is defined as:

C(k,,k2) - 4 S X v(n,.ii2) cos-^^cos™^ (5.4.3c) ni=0 n2=0 m m

Different 2-D fast discrete Fourier transform algorithms may be used (e.g., the vector radix FFT [37, 43, 45, 60], the 2-D WFTA [35], the 2-D polynomial transform [58], etc., [32, 65]) to calculate 2-D DFT in Step 2 for 8*8- or 16*16-point FFT. The inverse DCT can be done by the following steps: Step 1': Generate the 2-D DFT from the 2-D DCT 2 V(khk2) = jW^W^ {[C(ki,k2) - C(N-ki, N-k2)] - j[C(N-k!,k2) - C(kj,

N-k2)]} (5-4-4) Step V: 2-D IDFT

lkl 2 v(ni,n2) = r^r I I V(k!,k2) WN" W^ (5-4-5) w JN ^^ ki=0 k2=0 where ni,n2 = 0,1,...,N-1.

Step 3': Recover the sequence x(ni,n2) from Step 1 for the forward DCT. In [72], [79] and [84], there are no formulas given. Instead, the 2-D inverse DCT are generated on the flow graph using the transposition theorem of orthogonal transforms.

A 2-D indirect DCT method using convolution algorithm has been mentioned in

[72] which claims a dramatic reduction of multiplications. A detailed study yet remains to be seen. 102

The arithmetic complexity of indirect DCT algorithms depends on the FFT algorithm used. Since FFT algorithms are well documented, the arithmetic complexity of indirect DCT algorithms can be readily obtained. 103

CHAPTER SIX: 2-D DIRECT FAST DCT ALGORITHMS

It is known that 2-D fast transform algorithms are often more efficient than the row- column 1-D algorithm in terms of computational operations, i.e., they need less multiplications and additions than the row-column method to compute the same transform.

Many of 2-D algorithms also possess in-place computation, regular structure and small

roundoff errors [37, 45, 50, 51], which all provide advantages.

2-D direct fast DCT algorithms, which will be discussed in the following sections,

are generated from 1-D Lee's and Hou's algorithms respectively. They require fewer

number of multiplications than the row-column method and provide a systematic

computation structure featured by 2-D BFs and TM stages as well. The in-place

computation possessed by these algorithms is obvious. The computational complexity of

various DCT algorithms is considered and the computation structure of 2-D direct DCT

algorithms is analyzed in comparison with that of 2-D vector radix FFTs.

In the next two sections, the matrix and diagrammatical representations of 1-D

direct DCT algorithms are discussed as the bases for the derivation of 2-D direct vector

radix DCT algorithms. The 2-D DCT algorithms are then introduced using the structural

approach in a similar manner in which VR FFTs are constructed.

6-1 2-D Direct Fast DCT Algorithm Based on Lee's Method

6-1-1 1-D Lee's algorithm in matrix form

Lee's algorithm is a direct fast DCT algorithm. For an N-point forward DCT,

Equation (5-1-4a) can be decomposed into two N/2-point DCTs by the following steps in

a matrix form [46, 76, 80].

-g'^n)- 1 1 x(n) (6-1-la) _g'2(n)_ 1 -U x(N-l-n)J 1 0 S\(n)- "gl(n)l _ 1 (6-1-lb) 0 _g2(n)J " or,(2n+l) Lg2(n)_ ZL-2N 104

•Gi(k)- gl(n)' (6-1-lc) G2(k). n=0 W .g2(n).

' Gi(k) • X(2k) 1 00 G2(k) (6-1-ld) L 0 1 1. X(2k+1). G2(k+1).

where k,n = 0,1,...N/2-1, and G2(k+1) lk=N/2-l = 0. Define the pro- or pre-calculation matrix P, butterfly matrix B, and the multiplication matrix M as follows:

r l o 1 00 1 1 P = B = M = o l 0 1 1 1 -1 or(2n+l) ZL-2N

The N-point IDCT in Equation (5-1-4b) can also be decomposed into two N/2- point EDCTs by the following steps in the matrix form [46, 80].

" X(2k) ' Hi(k) = P X(2k+1) (6-l-2a) H2(k)J LX(2k-l)J

•hi(ny U (6-1-2b) h2(n). - kt0 2(N/2) |_H2(k).

•h'jdi) hi(n)" = M (6-1-2c) h'2(n) h2(n)J x(n) fh'jCn)- = B (6-1-2d) x(N-l-n) h'2(n)

where X(2k-1) =0ifk = 0. The above one dimensional fast DCT algorithm is described in Lee's paper [76] except for the matrix representations [46]. The matrix representations used here are very useful when a new 2-D fast DCT algorithm is devised [45, 46, 60,70,

80, 83]. 105

Notice that: x(n) ' 1 ' x(n) Lx(N-l-n). sN-l-2n (6-1-3)

~ X(2k)

X(2k+1) X(2k). (6-1-4) LX(2k-l)-J

and, Gl(k) 1 0 Gl(k)' G2(k) 0 1 LG2(k). (6-1-5) .G2(k+1) 0s

where the delay operator defined as x(n+l) = s x(n), x(n-l) = s_1 x(n).

The logic diagrams for the 1-D 8-point and 16-point denormalized IDCT are shown in Figure-23 and Figure-24 respectively. Note that in the above figures the input sequence is in bit reversed order whilst the output sequence is generated by starting with the set (0,1), forming a new set by adding the prefix "0" to each element, and then obtaining the rest of the elements by complementing the existing ones. Therefore the sets corresponding to 2-, 4-, 8- and 16-point output sequences will be: (0,1), (00,01,11,10),

(000, 001, 011, 010, 111, 110, 100, 101) and (0000, 0001, 0011, 0010, 0111, 0110,

0100, 0101, 1111, 1110, 1100, 1101, 1000, 1001, 1011, 1010). The corresponding

1-D fast DCT algorithms can be obtained easily by interchanging the input and output, reversing the direction of data flow and changing addition blocks to branches and branches to addition blocks of the IDCT logic diagrams [76, 84].

From the logic diagram, Lee's algorithm achieves a good performance in terms of the number of multiplications and additions and also has a regular structure.

For an N-point DCT, N = 2m, the number of multiplications and the number of additions, which are required for the calculation, are shown as follows:

0M[DCT(2m)] = m * 2^-1; m m-1 • _ : 1 m m 1 m 0A[DCT(2 )] = m * 2 + £ 2 (2 -1), for m>l. i=0 106

U CM

U C CM tc en 75 ^ 1/5 ~v — U CM

CO .E CM *3i I u

I— •* u CM II 107

1/1 CM u I CM

CO CM *r> •*: *N — ro *o v tn « c I u CM

M Ir- n — r-i IS IS 16 IB TQ IS is o CM 1 II VO ?*- I CM •po.\ en uCM r-C Mc i I 9- :;•;• m % •>- f- % s II u CM

SIS 5 BIS CM U 1 CM I 13- M nf • : ttj- '." _ «^> 9- CM 3 CM | : + vJ rJ u , ^ 3 5 OD i 5 o 9- •H c : cr- <=. d. CM — C*i i, . i u CM

r- — i § U CM

u CM CI

ro —

I u CM CM

« *^ IN **. © r^ *» *~ .^ .-. I-* r-. — r.-- •6, .» — IN — (©— r-^IT,^. OC-wC- u CM iU*

ci oo U CM CM ca -H 00 u CM CO.

CM 108

Since the publication of Lee's paper, the algorithm has been criticized because of roundoff errors produced by the required division of the cosine coefficients in the matrix M.

Haque [38], however, has shown that roundoff errors are not serious for small size DCTs although there has not been recognition of this fact [65,72,77].

6-1-2 Derivation of 2-D fast DCT algorithm from Lee's algorithm

Although this method wasfirst introduce d by Haque [38] in 1985 immediately after the publication of Lee's algorithm, it was derived independently by Wu and Paoloni [46] using a structural approach. The latter technique makes the 2-D fast algorithm a simple and systematic extension of Lee's algorithm and will be presented here. The N*N-point 2-D DCT and its inverse (IDCT) after denormalization are given by

Equations (5-2-4a) and (5-2-4b). Equation (5-2-4a) can be decomposed into

(N/2)*(N/2)-point 2-D DCTs in the same way as was done to the 1-D DCT in the last

sub-section. Matrix forms of the four step algorithm are shown by Equations (6-1-2-la)

to (6-1-2-ld) [46].

Sl=Bx_ (6-1-2-la)

8 =K£ (6-1-2-lb)

N/2-1 N/2-1 (2n+i)k (2m+l)/ „ (6-1-2-lc) "• " \ \^ 2(N/2) W(N/2) *

Z. = LG1 (6-1-2-ld) where 1 • ) x(n,m), £=( LsN-l-2nj ® zN-l-2m

r-g'jCn.m) g1(n,m>

g'2(n,m) g2(n,m) g' = g = g'3(n,m) g3(n,m)

_g'4(n,m)_ _g4(n,m)_ 109

rGi(k,/) "1 0" "1 0" G2(k,/) £.=( 0 1 ® 0 1 a = G3(k,/) ) a -0s_ .0z. LG4(k,/)J "1- "1- <8> £=( .s _ . z_) X(2k,2/) , £=P0 P,

M = M M', £ = B ® B,

where k,/ ,n,m = 0,1,...N/2-1, and Gi(k+1,*) lk=N/2-l = Gi(*,/ +1) 1/ =N/2-l = 0, for i =

2,3,4; matrices P, B and M are defined as those in the last sub-section and M' is

derived by substituting m for n in M; the symbol ® stands for the tensor (Kronecker )

product; and delay operators operate on different indices.

Equation (6-1-2-lc) represents four (N/2)*(N/2)-point 2-D DCTs which can be

further decomposed into even shorter length 2-D DCTs and so forth. In essence, the

relationship between 2-D direct DCT algorithms and their 1-D counterparts is governed

by the properties of the tensor product. Equation (6-1-2-1) can be proved using direct

derivation (Appendix D).

The 2-D fast IDCT algorithm derived from Equation (6-1-2) in matrix form can be given in a similar way [46, 80]. According to the structured approach, 2-D fast IDCT algorithm based on Lee's method will be presented by Equation (6-1-2-2).

H = P X (6-l-2-2a)

N _1 N 1 5 r(2n+l)k ^" r(2m+1)/ „ k = C2 C (6-l-2-2b) nto (N/2) mt0 2(N/2) O-

(6-l-2-2c)

x. = Bhl (6-l-2-2d) 110

where " 1 " " 1 " s ® z -s-l_ _z"l_

hl(k,/)" H2(k,/) h2(k,/) H = h = H3(k,/) h3(U) ' H4(k,/)J h4(k,/)_

_ " 1 - ® 1 - sN-l-2n zN-l-2m

p = p® p,

M = M M',

£ =B B,

where k, / , m, n = 0,1,...,N/2-1, s and z are two delay operators which operate on

different dimensions, X(2k-1,*) lk=o = X(*,2/ -1) 1/ =0 = 0.

The mathematical structure becomes even simpler when a logic diagram is used [45,

46, 80]. For instance, the logic diagram of an 8-point 1-D IDCT is given in Figure-23 [46]. The row-column IDCT is applied on an 8*8-point 2-D IDCT in Figure-25, where

the row transforms are implemented by the ID IDCT blocks at the input and the column

transforms by the remainder. The number of multiplications required for an N*N-point 2-

2 D IDCT is N log2N. Figure-26 shows the evolution of the row-column approach to the

two dimensional algorithm. The 1-D IDCT row operations of Figure-25 are now distributed throughout the logic diagram and thus the number of multiplications remains, as yet, unchanged. However, it is possible to combine adjacent factors to reduce further the number of multiplications. For example, in the block 2D-M3 of Figure-26, the factor oc is moved into the block M3, which has the effect of changing the internal multiplier values of M3 [45, 46]. The procedure is repeated for blocks 2D-M2 and 2D-M1. The total multiplications required to perform an N*N-point 2-D IDCT is now (3/4)N2log2N whilst the number of additions remains unchanged. The logic diagram for a 16* 16-point Ill

© CM IT) X x X X X X X

© T

r- — U CM I

~ I MS •o H U o I CM 5 en

u fcC CM

•5 re - r- O U CM CM | (J ^_ i-H 3 60 c*> oc SJO u c II CM £ CM H= CO. — OC Q u CM CM ca

r- ntf u CM * oc

^* •*< «X "X «X -x

u") CM r- rr,

IIu CM

f-1 CM r- CO U I CM

—i CM I" co I CM £ CM O^ en U CM ^

CM U fl CM It- E CM U I CM

9. r>*

0) U rJ CM 3 9. 60

— en U CM s ?-

r~ r- c _ u CM

C3E3E3 XO r- U CM CO

ro — U CM CM nj1 mHtpmmmHE U CM

^^ ^*. ^^ *^< ^^ •X

— 00 u CM ca

I CM II 8 114

IDCT using the above 2-D fast algorithm can also be constructed in the same manner.

Start by drawing a 16-point 1-D IDCT diagram using Lee's algorithm then the 2-D diagram can be constructed immediately using the simple rules of the logic diagram [81] as shown in Figure-27. The forward DCT can be obtained by reversing the direction of the arrows in the logic diagram of the IDCT, since the DCT is an orthogonal transform

[76, 84].

Using the above logic diagrams, both software and hardware implementation of 2-D

DCTs can proceed.

6-2 2-D Direct Fast DCT Algorithm Based on Hou's Method

6-2-1 1-D Hou's algorithm in matrix form

Hou introduced a recursive fast DCT algorithm in 1987, which achieves computational efficiency equal to that of Lee's algorithm and provides a better numerical performance. However, as a tradeoff, shifting and multiplexing operations are required in this method. Although Hou classifies his algorithm differently, it still is a direct fast DCT algorithm and it is recursive because the higher order DCT matrices are generated directly from the lower order DCT matrices [77].

The 1-D N-point DCT definition used to generate the algorithm is derived from

Equation (5-1-4a) [82] and is given by: N-l X(k)=I x(n) cos(9k + 27tkn/N) (6-2-1-1) n=0 where 8k = 7tk/(2N) and for n = 0,l,...,(N/2)-l, x(n) = x(2n) 1 (6-2-1-2) x(N-l-n) = x(2n+l)J

The Decimation-In-Frequency recursive Hou's algorithm for 1-D DCT generated from Equation (6-2-1-1) is :

rA - Ze = T(N) V A LZo- Xr_ 115

where r A xj A A •11- f (N) = T(l ) = 1,T(2) = , a = cos£ KTf)Q-KT(^)Q a -a 4

Zg = R^g, £<> is a vector consisting of even terms of X(k) in natural order,

z0 = R^0, £0 is a vector consisting of odd terms of X(k) in natural order,

R is the permutation matrix for performing bit reversal. For example, r 1 00(H 1 0 00 10 R2 = I2 = L0 U , R4 = 0 100 etc.; -000 1-1 r i o o o ... on -1 2 o o ... o K = RLR, L = 1 -2 2 0 ... 0 -12-2 2 ... 0 , Q = diag[cosOm], m=0,l,2,...,(N/2)-l, * * * * * L-l 2-22 ... 2J 1W2TL and Om = (m + ^X^p). Note that K2 = L2, and K is the result of bit reversal of the row and column indices of L.

Equation (6-2-1-3) can be used recursively to form the complete formula. The logic diagrams for 8-point and 16-point DCTs using the DIF Hou algorithm are presented in

Figures 28 and 29. The input sequence is such as x(0), x(2), ..., x(N-2), x(N-l), x(N-

3), ..., x(l), where N is the length of the DCT. The output sequence is in bit-reversed order.

In [77], another algorithm called the Decimation-In-Time algorithm for DCT is devised as a dual method to the DIF algorithm. It is obtained simply by switching the A indices between the input and the output and taking the transpose of DCT matrix T(N).

For the inverse transform, substitute Equation (6-2-1-2) into Equation (5-1-4b), to give

x(n) = NI X(k)(^r1)k (6-2-1-4) k=0 ^ 116 o **" ^ ;£. — *o <*i r- Ix Xx $< ix IX IX $< p<

r^ l + i

i 4 1

1 fc-.hr CC 0 + 0 II 3 1 -a ctci ^^^y

..••••- ^ • •- •• :.-•• :.-••••.. :•:

rJ £ ^ 1 = rs *=n«1/5 : H O 1 V *3 i O O »»• CI I r-« CO. cn CJD #•. rn r y""N "« vC GJ j -5 r—« rH ' " rn H 00 CJ O •H O O t- [ CIOI 3i ti -7' to + i + + +1 + + -r + ""* ^^ S c Kio 01**s— H H c i/s 1 0 l- TI oc U II o — o — »- CM d. ea. ci tn. ri to fc/Kc + + + + lOlr- + + + j + (/; JL J_ 1 J . O "7 T C O to

K|r^ : . . , . ^ v: — *N c-> «r O . s .0 fcC fcO *t o II + + + + I + + + + c 1 " i.; 1 r-U to

— ts

• :

• 117

ml—. 00 o :_P o !3J to _3 *12 M 00 o J o C3 to 0 ON en CM to IS O o II O r- ag B B 9 B 0 t-0 3 t=.ht IT) 00 CM Q to o O o o o II vo C to K 3 C «-• CO oo •rr CM >o 3 oo 00 O H O O U o II ON II CO. | Q 3 0) Q Kloo 5 CM oo CO cn O o 00 o c o II o 1 o VC II CO­ 3 en vC r~ CM c en oo en O & 00 O 5 O O II cn "•5 II to o cn '3D 3 fc/Kb _o ON]— fc/|

The IDCT matrix will be exactly the transpose of the DCT matrix defined by Equation (6-

2-1-1). Following Hou's indexing scheme, the fast IDCT algorithm is given below. A "| P T = T (N) Ze (6-2-1-5) IX r] A Zo. where rT^ ) QTT£)KT T ; v^* \ - T\N) = 2 2 KT = RLTR. TT(T). -Q^C. . . , T-)K*K-T T

This means that the DIT fast DCT algorithm given by Hou is equivalent to the inverse fast algorithm. The IDCT matrix can be factorized into the form shown by Equation (6-2-1-6).

Arr. I I- - I 0" frfy 0 I 0 T\N) = T (6-2-1-6) .1 -L - 0 Q„ 0 TT&) 0 K J 2'J where all matrices I, KT , Q and TT( j-) are of dimension (N/2)*(N/2). According to

Equation (6-2-1-6), Figure-3 in [77] should be the one shown in Figure-30. The number of multiplications and additions, which are required to perform an N-

m-l • m.;.i point DCT, is equal to that of Lee's algorithm in addition to X 2 (2 "" -1), for m > 1, i=0 shift operations where N = 2m. In software programming, the multiplexing can be hidden so that there is no extra operational cost. In other words, there is no extra operation due this multiplexing compared with the program using Lee's algorithm [81,96].

6-2-2 Derivation of 2-D fast DCT algorithm from Hou's algorithm Hou's algorithm can be extended into a 2-D fast algorithm in very much the same way as Lee's algorithm. In order to derive a new two dimensional recursive fast DCT algorithm based on Hou's approach, the DCT matrix in Equation (6-2-1-3) is rewritten as follows: 119

*N| I Nil

o cp

on

O

o m I a; 3 60 c •H o

u

o

I 0 T$ T ?) T(N) = (v2' 0 KTf)Q I •I I 0 A XT Tf) 0 I I 0 KT#)Q 0 I LI -I A I 0 0 I 0 N Tf) 0 11 L 0 K 0 Q 0 Tf) 0 I II I o - Tf) 0 I 0 I I (6-2-2-1) 0 K 0 tf)_,0 Q LI-I where all matrices I, K , Q and T$) are of dimension (N/2)*(N/2)

Set following equations [57]: x(n,m) = x(2n,2m) x(n,N-l-m) = x(2n,2m+l) N N 0

(6-2-2-2)

Substituting Equation (6-2-2-2) into Equation (5-2-4a), a modified version of the 2-

D DCT is derived.

N-l N-l ,._.,„+1 ,„k_ ,„ X(k,/)=I I x(n,m)C^ > C^^ (6-2-2-3) n=0 m=0

The matrix form for the denormalized 2-D DCT defined by Equation (6-2-2-3) after reordering the input and output sequence will be:

r- A -, Zee xpp A X pr (6-2-2-4) Zeo = ( T(N) ® T(N) ) A Xrp Zoe XrrJ A where LZ oo-* A • X ee £e = (R ® R) Xe , £e = Z ee X e, — .X eo. A Z eo. X oe rA -| £0=(R®R)X0 ,£0 = Zoe x0 = .X oo. A Z oo. X = (P®P)x , Pis the permutation matrix which results in Equation (6-2-1-2), 121

Xe and X0 are in natural order.

Substituting Equation (6-2-2-1) into (6-2-2-4), a new 2-D fast DCT algorithm is derived. ,N I 0 11 ( T(N) ® T(N)) = { TQ o I 0 0 K N, ) 0 TCrT) 0 QJ I -I r A .N- I 0 T(f-) 0 I 0 11 ®{ 0 K 0 Tf) L o QJ Li -I

A XT I 0 I 0 Tf) 0 Tf) 0 = { ® }{ ® 0 K J 0 K 0 T© A N 0 Tf) I 0 I 0 I I n i" d { ® H <8> } (6-2-2-5) 0 Q L 0 QJ II II. So an N*N-point 2-D DCT is decomposed into four shorter length DCTs at the cost of an increased number of multiplications represented by the term which contains factor Q. After combining the coefficients , the new algorithm uses 25% less number of multiplications than the row-column Hou algorithm. Although its mathematical derivation is quite involved, the logic diagrams for 8*8- and 16* 16-point 2-D DCT using the new fast algorithm are quite simple [81] and are shown in Figures 31 and 32. They are derived from Figures 27 and 28 respectively and symbols defined also in the Figures accordingly. Again, in the 2-D algorithm additional shift operations are traded for better numerical performance. The elements of input vectors are in such order as xi = [x(i,0), x(i,2), ..., x(i,N-2), x(i,N-l), x(i,N-3), ..., x(i,l)] and the elements of output vectors are in bit-reversed order. The 2-D IDCT fast algorithm can be derived from Equation (6-2-1-6) and the logic diagram can be obtained using the same method. mmmmci mrnr^mrj^iTTTr^iTTi-

t^I^l^l^l^l^l^J JK

fci Kc oo O o II to SO

00 O O II aaLtiaEDBaa s o toC M c ON CM cn oo O o II 00 3 O •a te | O II a I 8 c I 1 »o oo CM oo CM en o 1/3 o cn o II I o CQ. <1) II U oo 3 c 3 o CO r- CM H en o U t/5 II O o o o cnco s_o 1 3 o o _- D. cn CM o en II 00 cn O t=t:o N o O Osl—i II oo en O fc^lcM O CNJcn II CM 00 O O II CM 3 fc/jcM «n|en oo O o II f= cn > » t/3 Iof 124

6-3 Comparison of Arithmetic Complexity of Various DCT Algorithms [72, 156] Listed below are the arithmetic complexities of direct fast algorithms for a 1-D DCT of length N = 2m, including the number of real multiplications OM[DCT(2m)] and the m number of real additions 0A[DCT(2 )].

Chen [78]— m 0M[DCT(2 )] = N * log2N - 4r + 4, N > 4;

0A[DCT(2m)] = 2i(log2 N _ 1} + 2;

Lee [76]— 0M[DCT(2m)]= | *log2N,

0A[DCT(2m)] = M * log2N - N + 1;

Hou [77]-

0M[DCT(2™)] = £ * log2N,

m 0A[DCT(2 )] = ~- * log2N - N + 1;

Ma-Yin [65]— m m 1 m 0M[DCT(2 )]=m*2 - , N = 2 ;

m 1 1 0A[DCT(2 )] = (3 * m - 2) * 2 "- + 1;

Vetterli, et ai [34]—

• 0M[DCr(2">)] = £ * log2N

m 3 lo N 0A[DCT(2 )] = Y * ( * S2 - 2) + L 125

The general formula given can be derived either from decomposition equations or from logic diagrams provided in the thesis using an induction method.

The number of multiplications or additions for 2-D row-column DCT methods is obtained by multiplying the number used in 1-D fast DCT algorithms of the same size by 2*N. Then the arithmetic complexity of the 2-D vector radix DCT algorithms is easily

obtained by noting that the number of multiplications is reduced to three quarters of that

used by the row-column fast DCT algorithm and the number of additions remains

unchanged. Further discussions on the arithmetic complexity of DCTs can be found in

[72] and [156].

6-4 Comparison of Computation Structures of 2-D Direct VR

DCTs and VR FFTs

So far, independent VR FFT and VR DCT algorithms have been presented which

show some similarities in their computation structures. Further comparison will reveal

those basic computation structures common to both VR FFTs and VR DCTs and major

differences as well. The reason why the vector radix approach can be applied to FFTs

based on the Cooley-Tukey method and direct fast DCTs by Lee and Hou will soon

become clear. This exercise will certainly be beneficial to the software and hardware

implementation, including VLSI implementation, of vector radix fast algorithms.

Apart from the DFT being a complex valued transform and the DCT a real valued

one, there are some obvious differences in the computation structures of 1-D Cooley-

Tukey FFT and 1-D direct fast DCTs. Take a 1-D 8-point DIF FFT as shown in Figure-3

and a 1-D 8-point direct fast DCT by Hou as shown in Figure-28, for example. It can be

seen that the input sequence of the FFT is in natural order and the output in bit-reversed

order (or vice versa) whilst the input of the Hou's DCT is in a different shuffled order

(refer to Section 6-2-1) and the output in bit-reversed order. As a result, the DCT

algorithm requires that both input and output sequences be re-ordered whilst the FFT only

rearranges one of them. While the DCT algorithm needs a post-calculation stage, the FFT

does not. The FFT algorithms often have trivial twiddling multiplication stages, such as 126

the one inside the Radix-4 DIF FFT BF of Figure-3, and the Hou's fast DCT does not have trivial twiddling multiplication stage.

On the other hand, there are many important features which are common to Figure-3 and Figure-28. The basic computation structures of both algorithms are 2-point butterflies and separable twiddling multiplication stages. They both perform in-place computations at every stage, which is quite different from the WFTA [31] or the Chen's fast DCT algorithm [78]. The post-calculation of Hou's algorithm has also distinct stages. The fact that Cooley-Tukey FFTs and fast DCTs by Lee and Hou have in-place computation and separable twiddling multiplication stages makes them feasible to be extended to multidimensional fast algorithms. It makes the modification rules of the logic diagram applicable and the combination of twiddle factors of different dimensions possible. Not surprisingly, taking Figures 10 and 32 for example, 2-D VR FFTs and 2-D VR DCT algorithms derived from Lee's and Hou's methods have the vector radix-2*2 butterfly and combined twiddle factors' stage as their commonly used computation structures. Since VR FFTs based on the Cooley-Tukey method have trivial twiddling stages, 2-D butterflies with higher vector radices are allowed in 2-D VR FFT algorithms.

6-5 Summary

In this chapter, two vector radix fast discrete cosine transform algorithms are introduced using both the matrix representation and the logic diagram. These two algorithms show the arithmetic advantages over the row-column method because the number of multiplications is reduced by one quarter. The computational structure of 2-D vector radix algorithms is regular and featured by 2-D butterfly, twiddling multiplications and post- or pre-calculation structures, and can be systematically generated from their 1-D corresponding algorithms. Computer programs using these algorithms have been developed and the use of the structural approach has assisted in the program development procedure [96]. The arithmetic complexity of various DCT algorithms is considered. A comparative study on the computation structures of vector radix FFTs and vector radix 127

direct DCT algorithms is carried out and the correct system configuration for the 1-D DIT Hou's algorithm has also been presented in this chapter. 128

CHAPTER SEVEN: HARDWARE IMPLEMENTATION OF 2-D DCTS FOR REAL-TIME IMAGE CODING SYSTEMS

Research on fast digital signal processing algorithms has followed the development of computer technology, especially VLSI technology, since the foundational work laid by

Cooley and Tukey in 1965 [22]. In their well known paper they reduced the computation burden of computing a length N Discrete Fourier Transforms (DFTs) from the original

2 order of N to the order of Nlog2N and the same reduction can be realized for DCTs.

Since then, many fast algorithms have been published and the evaluation of various fast algorithms has been based on the following theoretical judgements, namely:

(a) the number of numerical operations (multiplications/additions);

(b) round-off errors;

(c) in-place computation; and

(d) the computation structure.

Of the above criteria, the computational complexity in terms of the number of multiplications and additions has been the focal point in the development of fast algorithms. In early years particularly, research concentrated on reduced multiplications algorithms [39] as the time spent on a multiplication was far more than that for an addition on general purpose computers. This has also been true of DCT computations until recently [72]. In implementations of 2-D DCTs for real-time image coding, like any other real-time applications, special hardware, instead of general purpose computers have to be employed. This special hardware often depends on the leading edge of the VLSI technology. As a result, the development of VLSI technology led to a re-consideration of those criteria on which new algorithms were devised and to a re-assessment of the effectiveness of various fast algorithms. In other words, the device and evaluation of

(new) fast algorithms have to be made relevant to VLSI technology or, simply, the specific hardware installment 129

In this chapter, a single processor system to implement the modified Makhoul's 2-D indirect algorithm is firstly described using the newly released CMOS VLSI FFT processor—A41102. Various VLSI DCT processors for the 2-D image coding have then been reviewed. Different algorithms for the 2-D DCT computation are re-assessed in the light of the hardware implementation using different Digital Signal Processors (DSPs), compared with the direct row-column matrix multiplication algorithm using

Multiplier/Accumulator processors [81].

7-1 Description of Hardware Implementation of Modified 2-D

Makhoul DCT Algorithm Using FDP™ A41102

The 2-D indirect fast DCT algorithms may not be as efficient as the 2-D direct methods in terms of arithmetic complexity or they may not have regular computation structure or in-place computation, but if the VLSI FFT processor is used, this method would show its advantages in processing speed and overall system simplicity [74, 86].

As mentioned previously, the Austek A41102 Frequency Domain Processor (FDP) is an FFT chip which provides a continuous sampling rate of up to 2.5 Ms/s and it has selectable 16-, 20-, or 24-bit word length [26-28, 85]. More importantly, 8*8- and

16* 16-point DFTs can be performed by a single pass within 25.6^s and 102.4^s respectively. When a modified 2-D indirect DCT algorithm is used [57, 86, 126], the configuration using the FDP will give a fairly large DCT processing throughput. For convenience, Equation (5-4-3a) is repeated as follows:

2 C(kj,k2) = 2Re{ W^[W*2 V(ki,k2) + W* V(ki,N-k2)]} (7-1-1)

From Equation (7-1-1), the 2-D DCT can be obtained by adding two terms from the

DFT:

2 W^W^V0q,k2) andW^W^ V(ki,N-k2).

Define k2' = N - k2 in the second term, which results in: 130

N W^W^V(kl5N-k2) = W^W^- V(k!,k2')

= iW4NW4NV(kl'k2,)- (74-2)

Equation (7-1-2) states a very important fact, namely, that all the elements represented by the second term in Equation (7-1-1) can be obtained by j multiplying the corresponding elements of the first term, which is nothing but interchanging the real and imaginary part of the element. There is an uncommitted complex multiplier on the FDP

A41102 which can be used either before or after the FFT operation is completed. This uncommitted complex multiplier can be employed in conjunction with a ROM to generate all the elements in Equation (7-1-1). Therefore, the 2-D DCT can be calculated by a system described diagrammatically in Figure-33. Since the use of the uncommitted complex multiplier will not slow down the process, a processing rate of 2.5 Ms/s can be obtained by this single FDP system. This system configuration provides a comparatively simple hardware solution over that of the row-column method [86]. The processing speed can be improved by introducing a multi-processor configuration [86]. The above modified Makhoul algorithm has been used to calculate 2-D DCTs using polynomial transforms [126]. 131

CM

u

1 Adder-1 2 Adder-1 r ~i fciD v.

c t/3 3 T5 © vj s. H ~ U £

CM CM •*J y. 2 - S— © .5: ^_. — Q, -O > > <4-ai vo .S > > rH -^z * Q £•5 PQ V© i 5: £ cd r-i 01 CM CM Q rH rM __ oc c 3 O * o CO OS C CC •H OS T3 fcn , 00 CM; 5 .©. <~m © > t/5

tM c c > 132

7-2 Discussion of 2-D DCT Image Coding Systems Using VLSI

Digital Signal Processors

For the fast computation of 2-D DCTs in real-time image coding, the fastest and simplest system configuration would be using dedicated VLSI DCT processors [74, 75,

87, 89]. For example, in [111], a hardware architecture is reported using the row-column fast DCT algorithm by Chen and et al [78] on 8*8-point blocks. The processor accepts 8- bit video input digital signals, uses 12-bit internal precision and provides 12-bit DCT output. A 16* 16-point DCT VLSI processor is demonstrated in [139] using a direct matrix multiplication method and a concurrent architecture [75, 143]. The processor accepts 9-bit 2's complement data, maintains 12-bit precision after column DCTs and produces 14-bit DCT coefficients at 14.3 MHz sample rate. In a recent report [74], a 27 MHz DCT Chip which performs 8*8-point DCTs has been demonstrated using Duhamel-

H'Mida fast cyclic convolution algorithm [113]. SGS-Thomson Microelectronics Group has been marketing its VLSI Discrete Cosine Transformer STV3200 [87]. The STV3200

DCT processor accepts 9-bit 2's complement input data, uses a 16-bit internal precision and produces 12-bit 2's complement DCT coefficients. It can perform 4*4- up to 16*16- point DCTs at a rate expected to be 13.5 MHz. The IMS A121 of inmos, which is now part of SGS-Thomson Microelectronics Group, is yet another VLSI DCT processor [88]. The IMS A121 can perform an 8*8-point DCT in 3.2M* (20 MHz pixel rate) using the direct matrix multiplication method. It accepts 9-bit signed input, uses 14-bit signed integer for the cosine function Look Up Table (LUT) and a 16-bit precision after the first matrix multiplication, and renders 12-bit output for DCT coefficients. TRW's TMC2311 [89] is another fast DCT processor which calculates 8*8-point DCT in 4.48^ (14.3 MHz pixel rate). The TMC2311 accepts 12- or 14-bit input data and produces optional 12-, 14- or 16-bit output. The row-column method has been used in all the above DCT processors, and so has the fixed-point computation. Since the length of DCTs under consideration is comparatively small, various algorithms have been used in VLSI integration of DCT processors without showing a great deal of difference in speed 133

performance for image coding. A comparative study of the error performance of these

DCT processors remains to be undertaken.

Where DCT processors are not available, the use of various DSPs, the FDP and

Multiplier/Accumulators (M/A) provides many options.

For the fixed-point DCT computation, a single M/A processor IDT7210 with

multiply/accumulate cycle 25ns [90] would render a throughput of about 2.5 Ms/s for an 8*8-point DCT, or 1.23 Ms/s for a 16* 16-point DCT using the row-column matrix

multiplication method [81]. A single processor system using AT&Ts WE DSP16A [91], which is an M/A based DSP, would give a processing rate of 1.02 Ms/s for an 8*8-point

DCT [92] and about 0.95 Ms/s for a 16* 16-point DCT [81]. Taking advantages of very

fast M/A processors, the direct matrix multiplication method out-performs many fast

algorithms using other digital signal processors in terms of the speed and system

simplicity.

Using the Austek FDP A41102 discussed previously would also provide a fairly

large throughput and simple system solution.

Using TMS320C30 [93], DCTs in floating-point can be calculated, which, as shall

be shown in the next chapter, has a much higher signal to noise ratio than the integer

computation.

The TMS320C30, which is a floating-point digital signal processor, is the third

generation device in the TMS320 family. Multiplication, memory access operation,

addition, shift or all other ALU operations can be executed within one clock cycle (60™).

Algorithms can be further optimized using the parallel commands that the TMS320C30

provides. The speed at which a particular algorithm can be implemented depends upon

how compatible it is with the hardware.

From previous studies, if the efficiency of an algorithm is judged by the number of

additions, Chen's algorithm is the best. Lee's algorithm is better than Chen's if the

number of multiplications, or even the total number of numerical operations (including

additions and multiplications), is used as the criteria. But if the TMS320C30 is used to implement an 8-point 1-D DCT, the total number of clock cycles which are used to 134

complete the process will be the main issue. Since there are only a limited number of registers, not every one of which can be used in the parallel processing instructions, on

TMS320C30, algorithms which do not have regular structure or in-place computation tend to introduce more data handling operations resulting in a relatively slow implementation although they may have the same arithmetic complexity as others [81]. In one occasion, the implementation of a 2-D 8*8-point DCT using Chen's algorithm on the

TMS320C30 requires about 60 cycles more than that using Lee's or Hou's algorithm under similar programming conditions. Although the indirect DCT algorithm using the

WFTA has the same arithmetic complexity as those of Lee's and Hou's algorithms, it also requires about 60 cycles more than the two when it comes to the implementation on the TMS320C30. The difference between algorithms in terms of the exact number of cycles may vary because of programer's experience, but the fact remains the same. This problem becomes worse as the length of the DCT increases. Another observation is that although the vector radix DCT algorithms have relatively low arithmetic complexity as well as in-place and regular computation structure, they are out-performed by the row- column DCTs due to the current arrangement of DSPs' architecture and the limited number of registers provided [81]. In other words, pipelined and parallel structure of vector radix DCTs cannot be fully employed by current DSPs.

Because the DCT processing speed, using floating-point computation, is considerably slower at present than the fixed-point computation, more TMS320C30s are required to provide real-time image coding speed, which means an increase in the system complexity.

Unless VLSI DCT processors are used, a multi-processor system is required to render a real-time image coding speed for an image of 288*352 pixels, or equally a video signal rate 3.04128 Ms/s. Since the DCT process, together with the quantization, decides the overall performance of an image coding system [128], using floating-point computation for DCTs also remains to be justified.

From the above discussion, it is concluded that: 135

(1) two fast algorithms, which have equal computational complexity, may not have the same efficiency in the hardware implementation as the limited resources on DSPs often impose different restrictions on them;

(2) the direct matrix multiplication method using very high speed multiplier/accumulators may out-perform many fast algorithms in certain applications and provides a simple system solution;

(3) a fast algorithm which possesses a regular computation structure will not only facilitate future VLSI implementation but also provide better performance using available DSPs than those which do not; and

(4) the pipeline and parallel computation structure of many multidimensional fast algorithms has yet to be fully exploited in VLSI system design [81]. 136

CHAPTER EIGHT: THE EFFECTS OF FINITE-WORD-LENGTH COMPUTATION FOR FAST DCT ALGORITHMS

8-1 Introduction In [81], various fast Discrete Cosine Transform (DCT) algorithms have been examined and compared in terms of computational efficiency (or arithmetic complexity) from both the software and hardware implementation point of view. In this chapter, further comparison of fast DCT algorithms will be conducted in order to analyze the effects of finite-word-length computations on the DCT process.

Generally speaking, the imposition of finite-word-length computation produces overflow and roundoff errors [5, 6, 24]. Overflow occurs when the magnitude of an operation exceeds the value that the finite-word-length register can represent. Roundoff is required when a b-bit datum sample is multiplied by a b-bit coefficient resulting in a product that is of 2b-bits long. To maintain a certain word-length in a computation procedure, truncation or rounding has to be applied which causes errors usually referred to as roundoff noise or roundoff errors. The use of quantized coefficients will also introduce errors in the finite-word-length calculation [5, 6, 24]. So far, there has been little report on the issue comparing different fast DCT algorithms [132].

The main concern of this chapter is to investigate roundoff errors produced in various direct fast DCT algorithms when finite-word-length arithmetic is used and when cosine multiplicands are quantized. Results are generated by computer simulation [94,

95]. The infinite precision calculation of a DCT is implemented by using double- precision floating-point arithmetic, which is considered to be the benchmark. The roundoff error performance is measured by the Signal to Noise

Ratio (SNR) which is defined as follows: IDCT^ SNR = 10«.og,0(£(DCTd,DCrf)2) (8->-D where DCTd is the DCT output with the double-precision, DCTf is the DCT output with finite-word-length which could be 32-bit floating-point, or integer with finite length. 137

For the investigation of roundoff errors caused by using 32-bit floating-point operation, Chen's [78], Lee's [76] and Hou's [77] algorithms will be considered for 1-D

DCT implementations, as well as the direct matrix multiplication method [81, 96]. In the

2-D DCT simulation with 32-bit floating-point, the row-column method using direct matrix multiplication, Chen's, Lee's and Hou's algorithms and 2-D vector radix direct

DCT algorithms [46, 80] will be studied.

For the analysis of roundoff errors caused by using the integer calculation, some simulation results have been reported in [81, 96].

According to simulation theory [94], the sample mean Xi of a random variable can 2 be estimate'X d by the sample mean xj of I observed values, and the variance Sx of I 2 independent samples can be approximated by the sample variance s£ of observed values. 2 The formulas for calculating xi and sx are given by the following equations: I I Xj i=o (8-1-2) XI=-T-

I 2 r-2 I X . - I Xj 2 i=l (8-1-3) S*xv ~= i - l

where xj is the observed value of the sample sequence. To improve the reliability of the

simulation output, the replications method [94] has been used in this study. The formulas 2 fox r the sample mean Xi and variance S are presented by the following equations I IXi Xl = ^r- (8-1-4)

< -^ i (X, - XI)2 = -Iji X] - ^(i X^)2 (8-1-5) where I is the number of runs and Xi is a sample mean on run i.

The input data to the DCT is produced by a random number generator with

Gaussian distribution. The Gaussian input data yj is obtained from a uniform distribution 138

sequence x\ on the interval (0,1), which is provided in the running time library, using the

Central Limit Theorem: n I Xi - J yi = — =-. (8-1-6)

When n = 12, the equation becomes :

yi= £ xi-6. (8-1-7) i=l Equation (8-1-7) has been used in the simulation to generate Gaussian random data as the input to DCTs. The test of the Gaussian input generating program on one million samples has shown a satisfactory result.

The simulation programs have been written in C language and compiled using PC-

AT and run on PC computers.

8-2 Simulation Design

In this section, the structure of the simulation program, error models, benchmarks for the DCT computations and data collection are described.

8-2-1 Structure of the simulation program

The simulation program consists of five parts:

simulation requirement input;

initialization;

generation of the input for DCTs;

computation of the DCTs; and

simulation data collection. Simulation Input: a. The Length of DCTs, n; b. The Number of Block samples in Each Simulation Run, bl; c. The Word-Length for LUT, nb1; (Optional for Integer) d. The Word-Length for Roundoff nb2; (Optional for Integer)

Initialization: a. Generation of LUT; b. Clear All Data Collection Variables. I

The Random Numbe? r Generator is seeded.

BL = 1

Input Data Generation

Computation of DCTs: a. Double-Precision DCT; Increment BL by 1, b. Finite-Word-Length DCT. I Increment I by 1.

Computation of SNR: a. SNR of the Current Block; b. Sum of SNRs for Run I.

No

Calculation for Run I: a. Mean SNR of Run I; b. Sum of Mean SNRs for Each Run c. Sum of Sqared Mean SNRs,

Data Collection: a. Sample Mean of SNR; b. Sample Variance of SNR; c. Confidence Interval.

~ * ( ENDJ Structure of simulation program for error analysis

Figure-34 140

This can be described by a flowchart as shown in Figure-34. The details in some of the blocks may vary from one fast DCT algorithm to another according to simulation requirements.

The input data is integer with specified word-length selectable as to whether it is signed 8-bit, or unsigned 8-bit or signed 9-bit. The data is processed in blocks of size

4*4, 8*8, 16*16 or 32*32 points and the length of the DCT equals the block size. The number of blocks is the number of two dimensional DCTs in each simulation run. For the row-column implementation, the number of the 1-D DCT required is double the block size.

The initialization is used to set up Look Up Tables (LUT) where the cosine or sine multiplicands are pre-calculated and stored, for each DCT program.

Note that the input and output of the DCT process are referred to as "data" and

"coefficients" whilst the values of the cosine functions in Look Up Table are referred to as

"multiplicands".

After each DCT block calculation, the signal to noise ratio is calculated and accumulated to find the sample mean. When the number of blocks is reached, the mean value of the signal to noise ratio on the current run is computed and this sample mean is again accumulated as well as its mean square. This process is repeated eighttimes befor e thefinal sample mean and the sample variance of the signal to noise ratio are calculated according to Equations (8-1-4) and (8-1-5). The confidence interval has also been used to render a provisional guide for the simulation.

8-2-2 Error model for the basic computation structure

The basic computational structure of fast DCT algorithms is the butterfly as shown in Figure-3 and consideration needs to be given to the roundoff errors produced in this stage. It is known that the error model of the floating-point calculation is different from that of the integer operation because both floating-point multiplications and additions will introduce roundoff errors whilst only multiplications using integer calculation will cause 141

roundoff errors. Although the simulation method is used in this study instead of the theoretical approach where the error model is a necessity, understanding of the model assists in the simulation design, especially for the integer calculation.

There are two additions and one multiplication in a butterfly structure and roundoff errors will usually be introduced at three locations, represented by en, ef2 and eo as shown in Figure-35. In essence, the accumulated roundoff errors depend heavily on the total number of multiplications and additions required by each algorithm for the DCT.

Since a' is a finite-word-length expression of the exact multiplicand a, it also introduces computation noise. The error information computed from the simulation will include all the above effects.

8-2-3 DCT in infinite-word-length

It is assumed that the DCT outputs calculated in infinite-word-length are independent of the individual algorithm used and that 64-bit double-precision is considered to be "infinite" compared with 32-bit floating-point or 16-bit integer data format, etc.. In the simulation, the roundoff noise is calculated by subtracting the DCT coefficients in finite-word-length obtained by an algorithm from those of the same DCT algorithm using double-precision. The signal to noise ratio is evaluated using Equation

(8-1-1)

8-2-4 Data collection

The Gaussian random data is mapped into signed 8-bit, or unsigned 8-bit or signed

9-bit integers as the input to the DCT process. The amount of input data is about the same as that contained in a frame of image 288*352 pixels in size. For each run, a new seed is chosen for the random number generator to make sure that different simulation runs are independent from each other. Double-precision is used throughout the calculation of the sample mean and variance to keep the error caused by the data collection at a minimum level. The replications method is used to reduce the sample variance [94,96]. 142

+ r— c x 3

5 X:

o

I a c n a. 3 i M ex •H Q)

C u

u:

E E x x 143

8-3 Simulation Results

The fast DCT algorithms under evaluation include those by Chen [78], Lee [76] and

Hou [77] in comparison with the direct matrix multiplication method. These one dimensional algorithms are used to implement the 2-D DCT using a row-column operation. As well, fast two dimensional algorithms based on Lee's and Hou's approach have been developed and evaluated [46, 80]. All the simulation results are plotted to present a meaningful comparison of the above-mentioned algorithms in terms of the error performance using the finite-word-length calculation.

That the infinite-word-length computation is independent of the DCT algorithm has been demonstrated by comparing the difference between Lee's and Hou's algorithms using 64-bit double-precision arithmetic. The signal to difference ratio is in excess of 250 dB, independent of block size. Thus, as expected, the output is essentially independent of algorithms when the precision is essentially infinite.

In the floating-point calculation of DCT, the 32-bit floating-point arithmetic is used throughout the DCT computation and multiplicands in LUTs are also presented in the 32- bit floating-point format.

8-3-1 Floating-point computation of 1-D DCTs

Figures 36, 37 and 38, show the signal to noise ratios of Chen's, Lee's and Hou's algorithms, in comparison with those of the Direct Matrix Multiplication (DMM) method, when the length of the 1-D DCT is 4, 8, 16 and 32 respectively. The form of the 1-D input data varies between signed 8-bit, unsigned 8-bit and signed 9-bit integers. It can be seen that when the length of the DCT increases, the signal to noise ratio decreases for all algorithms but that of Chen's algorithm shows the least degradation. All the signal to noise ratios are greater than 134 dB for all the fast algorithms under all the input data conditions using the floating point calculation and at least 10 dB better than that of the direct matrix multiplication method. The difference between the best and the worst SNR for the same DCT length is less than 9 dB for fast algorithms. The error performances of 144

-C Q> o 2 O _J X Q O O O O Q Q Q Q * + * •

3 Q. C m i I- CO U T3 D O C O O) m .c I

•gp 'oney QSION O; |euBis 145

O QJ 3 $ -C m O .2 O —i X Q H1 H- H- h- O O O O Q< MQ Q« Md

o

n

CM m QL

CO CD. CM CO T3 r- G) •^- QO C CM O CT 1 '55 T0 ) r-» •C m O i c CM MM 01 VJ D o 3 x: CO r-~ CT •H O ID C fe Q o c x: o 0. CT C CO o

•QP 'oiieu 9S10N 0} |BU6IS 146

CD CO 3 ^ x: S o S O -J X Q H h- H- K o o o o Q a O Q MM o

to

CM CO 3 a CO c CM m i O) r- •a "J- Oa a CM c 1- 00 0) m CT JZ O w CM »^ GJ o U H-"* x: -CJ> : CT •H o C CD 0) .^ Q _l c 0) x: o a i CT C ro o LL Q

"8P 'oiiea ©SION oj IBUBJS 147

Lee's and Hou's algorithms are very close. The interesting fact is that the error performance is dependent on the form of the input data. For example, for an 8- or 16- point DCT, Chen's algorithm provides better signal to noise ratio than both Lee's and

Hou's when the input data is a signed 8- or 9-bit integer, whilst the reverse is true when

the input data is an unsigned 8-bit integer. The degrading rate of SNRs of all the

algorithms with the unsigned 8-bit integer input is much less than that with a signed 8- or

9-bit integer input.

8-3-2 Floating-point computation of 2-D DCTs

The signal to noise ratios of the row-column Chen's, Lee's and Hou's algorithms

are plotted in Figures 39, 40 and 41 along with those of 2-D Vector Radix (VR) DCT

algorithms based on Lee's and Hou's approaches in comparison with that of the row-

column direct matrix multiplication method. It is interesting to note that when the input

data is an 8- or 9-bit integer the row-column Chen's algorithm gives the best error

performance whilst the difference between SNRs of all the algorithms of length 4 is

marginal for all the fast algorithms. However, when the input data is an unsigned 8-bit

integer the vector radix DCT algorithms provide a better performance for the DCT lengths

(8 and 16) used in practical image coding. Again, it can be seen that when the length of

the DCT is increasing the signal to noise ratio is decreasing with the degrading rate of

Chen's algorithm being the least and those of the vector radix algorithms being the most

(about 23 dB drop from length 4 to length 32 when the unsigned 8-bit integers are used as

input).

Since in the floating-point computation of DCTs the signal to noise ratio of each fast

algorithm considered is greater than 121 dB, the differences between fast algorithms is relatively marginal. It is obvious that the performance of fast algorithms is supenor to that of the direct matrix multiplication method. 148

CD 3 £ CD _l X Q O b b X" oX Xo oX XQ QX 9 '9 9 9 > > M MCM CM *CM CM*

3 a c co CQ CM • CO T3 o CO •c CM a c a CT CM a W O x: o> CM o 1 H x: a r-i O CT — C cr ID o •H Qo -- _l t. 0. a si CT c ro o u. Q • CM

T — r—j ' 1 I I 1 ~1— 1 1-1 ' l I 1 I I r*. r-~ in co CT r^ u• ) •«s- CO CO CO CO CM CM CO CM CM CM

8P '0!;eu esjofg oj |BU6|S 149

c CD 3 CD 3 2 0) CD O SI _J o a O o oX X X o X > > D aX QX Q a oX CM CM CM CM CM CM M M + *

3 a

DcO I CO H TJ U Q a Q c CM O U

•gp 'oijea SSJON 0} ieu6|s 150

CO 0) 3 CO 3 2 0) O CO O -J o a O o X_l X o rr a aX Q a> Q> aX CM CM CM CM CM CM HMI*

3 a c eo Dp CM d) \- T3 ••J- o CO CM C Q CT Q ai

•gp 'oijea esiON o; |BU6.IS 151

8-4 Summary

The errors caused by the use of the finite-word-length (32-bit floating-point) computation in the process of discrete cosine transforms for coding purposes has been studied.

In the floating-point computation, the signal to noise ratios of all the fast algorithms are fairly close and above 120 dB for both 1-D and 2-D DCT computations using fast algorithms. They are also superior to that of the direct matrix multiplication method. For one dimensional 4- to 32-point DCTs, Chen's algorithm shows better performance than both Lee's and Hou's if the input data is a signed 8- or 9-bit integer whilst the reverse is true when the input data is an unsigned 8-bit input. For two dimensional 4*4- to 32*32- point DCTs, the row-column Chen's algorithm is still superior if the input data is a signed

8- or 9-bit integer whilst vector radix DCT algorithms have a higher number of errors.

However, if the input data is an unsigned 8-bit integer the vector radix algorithms perform better than others for 4*4- to 16* 16-point DCTs. They were only inferior to row-column methods when the length of the 2-D DCT was 32.

It has also been found that for both floating-point and integer computations

[96], the performance of fast DCT algorithms in terms of the signal to noise ratio is dependent on the form of the input data. The input data is Gaussian noise mapped into signed 8- or 9-bit integers or unsigned 8-bit integers. A similar study is being undertaken using the fixed-point arithmetic (or integer computation). The results will be reported elsewhere. 152

CHAPTER NINE: CONCLUSIONS 9-1 Conclusions

In an attempt to ease the burden caused by the construction and implementation of multidimensional fast transform algorithms, a structural approach is introduced which is described by two representations—the matrix representation with the tensor product and logic diagrams with a set of modification rules. Using this structural approach, various vector radix FFT algorithms, including the vector split-radix FFT and mixed vector radix

FFT algorithms, and vector radix direct fast DCT algorithms are derived and implemented systematically from their 1-D counterparts. The relationship between vector radix algorithms and corresponding 1-D fast algorithms is clearly explained, particularly by diagrammatical representation. The derivation of vector radix algorithms becomes much simpler using the logic diagrams and implementation by both software and hardware can be based on pre-knowledge of the corresponding 1-D algorithms. The structural approach is described by theorems and a recursive diagrammatical symbol system which are successively applied to both multidimensional vector radix FFT and vector radix fast

DCT algorithms. The development of computer programs using vector radix fast algorithms, including combined factor vector radix-8*8 FFT and vector radix DCT based on Lee's and Hou's methods, has demonstrated the effectiveness of this approach, especially when the program using the 1-D algorithm is available. Further discussion on the hardware implementation of vector radix FFTs has shown that in a pipelined VLSI design to compute 16* 16-point DFT, only one complex multiplier is needed, whilst the traditional row-column method requires two. Further, with the implementation of 2-D, say 512*512-point DFTs, using the FDP A41102, the number of FDPs can be reduced if the vector radix method is applied, thereby reducing the system complexity.

Consequently, the structural approach has been extended to vector radix FFTs of higher dimensions. Although not discussed in the thesis, the approach can be applied to m-D (m > 3) vector radix direct DCT algorithms as well.

It has been demonstrated that the logic diagram is a useful and very effective presentation form in expanding knowledge of multidimensional transform algorithms. 153

The fact that 2-D vector radix fast DCT algorithms were derived firstly by using logic diagrams, then sorting out their matrix presentations in a general case is a good example.

The computation structure of 2-D vector radix DCT algorithms are discussed in comparison with that of 2-D vector radix FFT algorithms to show the basic computation structures common to vector radix algorithms and major differences.

In analyzing the structure of Hou's DIT fast DCT algorithm, the correct system description is presented together with the 2-D vector radix DCT algorithm.

A single processor 2-D DCT coding system using FDP A41102 is presented rendering a processing rate of 2.5 Ms/s. Where the VLSI DCT processors are not available, it provides an option to the hardware implementation for the transform coding problem.

Different aspects of hardware implementation of DCTs for image coding applications are also discussed using various VLSI DCT processors, DSPs,

Multiplier/Accumulators and FDP. It is pointed out that design of fast algorithms and designed of VLSI processors are closely related, with computation structure being a very important issue.

Error performance of various fast DCT algorithms is evaluated using computer simulation for the floating-point calculation. When random numbers with Gaussian distribution are used as input, it has been found that the performance of algorithms depends on the form of the input as to whether it is signed 8- or 9-bit or unsigned 8-bit integer. The length of DCTs considered is chosen for the image coding so that it varies from 4 to 32. The performance of fast DCT algorithms is also compared with that of the direct matrix multiplication method, in both 1-D and 2-D cases, to show the former is better than the latter when floating-point calculation is used. The performance of fast algorithms using integer computation is still under evaluation.

In conclusion, it is appropriate to point out some remaining problems and suggestions are made for further research. 154

9-2 Suggestions for Future Research

So far, an extensive study has been made on the theoretical aspects of vector radix algorithms for both DFTs and DCTs and computer programs have been developed to show the validity of the approach proposed. System configurations have been described using vector radix FFT algorithms and the FDP A41102. Listed following are various project for future work. (a) Hardware implementation of pipelined vector radix FFT for 512*512-point DFT

computation using the FDP A41102s as described in Chapter Three. Since the FDP

uses a fixed-point or a block floating-point arithmetic, the error analysis of this

system can be conducted based on information presented in [26-28, 85], or by

simulation using the software provided by Austek [114]. Different aspects of this

multi-processor system can be evaluated and compared with other system

configurations.

(b) Feasibility, performance, advantages and disadvantages of VLSI integration of

vector radix FFT algorithms can be closely examined to enhance published results

[2, 134, 137]. Many advantages of using the vector radix FFT in VLSI

implementation of 2-D DFTs have been shown in [134] and [137] compared with

the row-column FFT algorithms. However, the performance of VR FFTs in terms of area*time2 [2] has yet to be evaluated. Since only the vector radix-2*2 FFT is

considered in [134] and [137], the number of multiplier stages is shown to be

T log2N - 1, where N is the length of the 2-D Ni*N 2-point DFT assuming Ni = N2 =

N. It has been demonstrated in this thesis that when higher radices are used, the number of multiplier stages can be reduced. Thus, it is expected that the area*time2

performance of VLSI implementation will be improved using vector radix FFT

algorithms. (c) Extension of the structural approach to other multidimensional fast digital signal

processing algorithms should also be studied. 155

(d) Hardware implementation of 2-D DCT for image coding can be carried out using

FDP A41102 as described in Chapter Seven in the application of video-telephony or video-conferencing.

(e) An interactive study of DCT computation and quantization can be carried out so that evaluation of the overall DCT coding system can be reached [128] before a DCT codec is implemented for telecommunication purposes.

(f) Feasibility of VLSI integration of vector radix fast DCT algorithms for coding systems can also be explored. 156

BIBLIOGRAPHY

[1] D.E. Dudgeon and R.M. Mersereau, Multidimensional Digital Signal Processing,

Prentice-Hall Inc., Englewood Cliffs, N.J., 1984.

[2] I. Gertner and M. Shamash, "VLSI Architectures for Multidimensional Fourier

Transform Processing", IEEE Transactions on Computers, Vol.C-36, pp. 1265-1274,

November 1987.

[3] R.J. Clarke, Transform Coding of Images, Academic Press, 1985.

[4] W.K. Pratt, , John Wiley & Sons, Inc., 1978.

[5] L.R. Rabiner and B. Gold, Theory and Application of Digital Signal Processing,

Prentice-Hall, 1975.

[6] A.V. Oppenheim and R.W. Schafer, Digital Signal Processing, Prentice-Hall

International Inc., 1975.

[7] K.R. Castleman, Digital Image Processing, Prentice-Hall Inc., Englewood Cliffs,

New Jersey, 1979.

[8] R. C. Gonzales and P. Wintz, Digital Image Processing, Addison-Wesley Publishing

Company Inc., 1977.

[9] W.S. Hinshaw and A.H. Lent, "An Introduction to NMR Imaging: From the Block

Equation to the Imaging Equation", Proceedings IEEE, Vol.71, No.3, March 1983.

[10] L. Jacobson and H. Wechsler, "A Theory for Invariant Object Recognition in the

Frontoparallel Plane", IEEE Trans. Pattern Anal. Machine IntelL, Vol.PAMI-6,

pp.325-331,May 1984.

[11] H. Gafni and Y.Y. Zeevi, "A Model for Separation of Spatial and Temporal

Information in the Visual System", Biol. Cybern., Vol.28, pp.73-82, 1977.

[12] H. Gafni and Y.Y. Zeevi, "A Model for Processing of Movement in the Visual

System", Biol. Cybern., Vol.32, pp.165-173, 1979. 157

[13] The Last Word in DSP. Zoran, Digital Signal Processors Data Book, ZORAN

Corporation, 1987.

[14] J.D. O'Sullivan, D.R. Brown, K.T. Hua and C.E. Jacka, "A VLSI Chip for Fast

Fourier Transforms", Digest of Papers, IREECON'87, p. 142, 1987.

[15] D.R. Brown, K.T. Hua , J.D. O'Sullivan, C.E. Jacka and P.E. Single, "A VLSI

Chip for Fast Fourier Transforms", ASSPA 89, Signal Processing, Theories,

Implementations and Applications, pp. 164-168, April 1989.

[16] Fernando Macias-Garza, A.C. Bovik, K.R. Diller, S.J. Aggarwal and J.K.

Aggarwal, "Digital Reconstruction of Three-Dimensional Serially Sectioned Optical

Images", IEEE Trans, on Acoustics, Speech, and Signal Processing, Vol.ASSP-36,

pp.1067-1075, July 1988.

[17] N. Ahmed, T. Natarajan and K.R. Rao, "Discrete Cosine Transform", IEEE

Transactions on Computers, Vol.C-23, pp.90-93, January 1974.

[18] K.R. Rao and P. Yip, Discrete Cosine Transform, Academic Press, Orlando, Fl..

1990.

[19] A. Uzum, A.W. Seeto, D. Rosenfeld, D. Skellen and A. Maheswaran, "Video

Coding: A Survey", Workshop on Telecommunication Services Based on Video and

Images, Sydney, September 1988.

[20] J.W. Cooley, P.A.W. Lewis and P.D. Welch, "Historical Notes on the Fast Fourier

Transform", Proceedings of the IEEE, Vol.55, pp. 1675-1677, October 1967.

[21] M.T. Heideman, D.H. Johnson and GS. Burrus, "Gauss and the History of the Fast

Fourier Transform", IEEE ASSP Magazine, pp. 14-21, 1984.

[22] J.W. Cooley and J.W. Tukey, "An Algorithm for the Machine Calculation of

Complex ", Math. Comput., Vol.19, No.90, pp.297-301, 1965.

[23] L.R. Rabiner, "The Acoustics, Speech, and Signal Processing Society—A Historical

Perspective", IEEE ASSP Magazine, pp.4-10, January 1984. 158

[24] B. Gold and CM. Rader, Digital Processing of Signals, McGraw-Hill, Book Co.,

1969.

[25] W.T. Cochran, J.W. Cooley, D.L. Favin, H.D. Helms, R.A. Kaenel, W.W. Lang,

G.C. Maling, JR., D.E. Nelson, CM. Rader and P.D. Welch, "What is the Fast

Fourier Transform?", Proceedings of the IEEE, Vol.55, pp.1664-1674, October

1967.

[26] A41102 Frequency Domain Processor, Austek Microsystems Proprietary, Inc. and

Austek Microsystems Pty. Ltd., 1988.

[27] Frequency Domain Processor (FDP™), Austek Microsystems Proprietary, Inc. and

Austek Microsystems Pty. Ltd., 1988.

[28] A41I02 Frequency Domain Processor, Austek Microsystems Proprietary, Inc. and

Austek Microsystems Pty. Ltd., 1988.

[29] M. Bellanger, Digital Processing of Signals—Theory and Practice, John Wiley &

Sons Ltd., 1985.

[30] L. Auslander, E. Feig and S. Winograd, "Abelian Semi-Simple Algebras and

Algorithms for the Discrete Fourier Transform", Advances in Applied Mathematics,

No.5, pp.31-55, 1984.

[31] R.E. Blahut, Fast Algorithms for Digital Signal Processing, Addison-Wesley

Publishing, Inc., 1985.

[32] A. Guessoum and R.M. Mersereau, "Fast Algorithms for the Multidimensional

Discrete Fourier Transform", IEEE Transactions on Acoustics, Speech, and Signal

Processing, Vol.ASSP-34, No.4, pp.937-943, August 1986.

[33] L. Auslander, and R. Tolimieri, "Ring Structure and the Fourier Transform", The

Mathematical Intelligence, Vol.7, No.3, pp.49-52, p.54, 1985.

[34] M. Vetterli and H.J. Nussbaumer, "Simple FFT and DCT Algorithms with Reduced

Number of Operations", Signal Processing, August 1984. 159

[35] L. Auslander, E. Feig and S. Winograd, "New Algorithms for the Multidimensional

Discrete Fourier Transform", IEEE Transactions on Acoustics, Speech, and Signal

Processing, Vol.ASSP-31, No.2, pp.388-403, April 1983.

[36] Soo-Chang Pei and Ja-Lin Wu, "Split Vector Radix 2-D Fast Fourier Transform",

IEEE Transactions on Circuits and Systems , Vol.CAS-34, pp.978-980, August

1987.

[37] Zhi-Jian Mou and P. Duhamel, "In-Place Butterfly-Style FFT of 2-D Real

Sequences", IEEE Transactions on Acoustics, Speech, and Signal Processing,

Vol.ASSP-36, pp.1642-1650, October 1988.

[38] M.A. Haque, "A Two-Dimensional Fast Cosine Transform", IEEE Transactions on

Acoustics, Speech, and Signal Processing, Vol.ASSP-33, pp.1532-1539, 1985.

[39] D.F.Elliott and K.R. Rao, Fast Transforms: Algorithms, Analyses, Applications,

Academic Press, 1982.

[40] Third-Generation TMS320 User's Guide, SPRU031, Texas Instruments

Incorporated, 1988.

[41] L.R. Morris, "Comparative Study of Time Efficient FFT and WFTA Programs for

General Purpose Computers", IEEE Trans, on Acoustics, Speech, and Signal

Processing, Vol.ASSP-26, pp.141-150, April 1978.

[42] G.E.Rivard, "Direct Fast Fourier Transform of Bivariate Functions", IEEE

Transactions on Acoustics, Speech, and Signal Processing, Vol.ASSP-25, pp.250-

252, June 1977.

[43] D.B. Harris, J.H. McClellan, D.S.K. Chan, and H.W. Schuessler, "Vector Radix

Fast Fourier Transform", 1977 IEEE Int. Conf. Acoust., Speech, Signal Processing

Rec, pp.548-551,May 1977.

[44] B. Arambepola, "Fast Computation of Multidimensional Discrete Fourier

Transforms", IEE Proceedings, Vol.127, Pt.F, No.l, February, 1980. 160

[45] H.R. Wu and F.J. Paoloni, "The Structure of Vector Radix Fast Fourier

Transforms", IEEE Transactions on Acoustics, Speech, and Signal Processing,

Vol.37, pp.1415-1424, September 1989.

[46] H.R. Wu and F.J. Paoloni, "A Two Dimensional Fast Cosine Transform

Algorithm—A Structural Approach", Proceedings of IEEE International Conference

on Image Processing, Singapore, pp.50-54, September 1989.

[47] E.O. Brigham, The Fast Fourier Transform, Prentice-Hall Inc., Englewood Cliffs,

N.J., 1974.

[48] S. Winograd, "On Computing the Discrete Fourier Transform", Mathematics of

Computation, Vol.32, No.141, pp.175-199, January 1978.

[49] D.W. Tufts and G. Sadasiv, "The Arithmetic Fourier Transform", IEEE ASSP

Magazine, pp. 13-17, January 1988.

[50] S. Prakash and V.V. Rao, "Vector Radix FFT Error Analysis", IEEE Transactions on

Acoustics, Speech, and Signal Processing, Vol.ASSP-30, pp.808-811, October

1982.

[51] I. Pitas and M.G. Strintzis, "Floating Point Error Analysis of Two-Dimensional Fast

Fourier Transform Algorithms", IEEE Trans, on Circuits and Systems, Vol.35,

pp. 112-115, January 1988.

[52] QS. Burrus and P.W. Eschenbacher, "An In-Place, In-Order Prime Factor FFT

Algorithm", IEEE Transactions on Acoustics, Speech, and Signal Processing,

Vol.ASSP-29, August 1981.

[53] H.W. Johnson and C.S. Burrus, "On the Structure of Efficient DFT Algorithms",

IEEE Transactions on Acoustics, Speech, and Signal Processing, Vol.ASSP-33,

pp.248-254, February 1985.

[54] Kenji Nakayama, "An Improved Fast Fourier Transform Algorithm Using Mixed

Frequency and Time Decimations", IEEE Transactions on Acoustics, Speech, and

Signal Processing, Vol.ASSP-36, pp.290-292, February 1988. 161

[55] M.A. Richard, "On the Efficient Implementation of the Split-Radix FFT",

Proceedings ofICASSP-86, pp. 1801-1804, 1986.

[56] R.W. Linderman et al, "CUSP: A 2-urn CMOS Digital Signal Processor", IEEE

Journal of Solid-State Circuits, Vol.SC-20, pp. 761-769, June 1985.

[57] J. Makhoul, "A Fast Cosine Transform in One and Two Dimensions", IEEE

Transactions on Acoustics, Speech, and Signal Processing, Vol.ASSP-28, pp.27-34.

1980.

[58] H.J. Nussbaumer and P. Quandalle, "Fast Computation of Discrete Fourier

Transforms Using Polynomial Transforms", IEEE Transactions on Acoustics,

Speech, and Signal Processing, Vol.ASSP-27, pp. 169-181, April 1979.

[59] O.R. Hinton and R.A. Salch, "Two-Dimensional Discrete Fourier Transform with

Small Multiplicative Complexity Using Number Theoretic Transforms", IEE

Proceedings, Vol.131, Pt.G, No.6, December 1984.

[60] H.R. Wu and F.J. Paoloni, "On the Two Dimensional Vector Split-Radix FFT

Algorithm", IEEE Transactions on Acoustics, Speech, and Signal Processing,

August 1989.

[61] R.C Agarwal and J.W. Cooley, "An Efficient Vector Implementation of the FFT

Algorithm on IBM 3090VF", ICASSP, pp.249-252, 1986.

[62] R.M. Mersereau and T.C. Speake, "A Unified Treatment of Cooley-Tukey

Algorithms for the Evaluation of the Multidimensional DFT", IEEE Transactions on

Acoustics, Speech, and Signal Processing, Vol.ASSP-29, pp.1011-1018, October

1981. [63] CS. Burrus and T.W. Parks, Discrete Fourier Transform/Fast Fourier Transform

and Convolution Algorithms, A Wiley-Interscience Publication, John Wiley & Sons,

1985. [64] G.D. Bergland, "A Fast Fourier Transform Algorithm Using Base Eight Iterations",

Math Computation, Vol.22, pp.275-279, April 1968. 162

[65] Weizhen Ma and Ruixiang Yin, "New Recursive Factorization Algorithms to

Compute DFTQm) and DCT(2™)", IEEE Asian Electronics Conference, Hong Kong, 1987.

[66] Weizhen Ma and Dekun Yang, "New Fast Algorithm for Two-Dimensional Discrete

Fourier Transform DFT(2n,2)", Electronics Letters, Vol.25, No.l, pp.21-22,

January 1989.

[67] H.R. Wu and F.J. Paoloni, "Structured Vector Radix FFT Algorithms and Hardware

Implementation", submitted to Journal of Electrical and Electronics Engineering,

Australia, for publication, 1989.

[68] P. Duhamel, "Implementation of 'Split-Radix' FFT Algorithms for Complex, Real,

and Real-Symmetric Data", IEEE Transactions on Acoustics, Speech, and Signal

Processing, Vol.ASSP-34, pp.285-295, April 1986.

[69] P.R. Halmos, Finite-Dimensional Vector Spaces, D.Van Nostrand Company, Inc.,

1958.

[70] H.R. Wu, and F.J. Paoloni, "The Structure of Vector Radix Multidimensional Fast

Fourier Transforms", Technical Report No. I, Department of Electrical and Computer

Engineering, The University of Wollongong, 1986.

[71] Zhi-Jian Mou and P. Duhamel, "Corrections to 'In-Place Butterfly-Style FFT of 2-D

Real Sequences'", IEEE Transactions on Acoustics, Speech, and Signal Processing,

Vol.ASSP-37, September 1989.

[72] M. Vetterli, P. Duhamel and C Guillemot, "Trade-Offs in the Computation of

Mono- and Multi-Dimensional DCTs", Proceedings of IEEE International

Conference on A coustics, Speech, and Signal Processing, pp.999-1002, 1989.

[73] S. Okubo, R. Nicol, B. Haskell and S. Sabri, "Progress of CCJTT Standardization

on n*384 kbit/s Video Codec", IEEE Globecom'87, pp.36-39, 1987.

[74] J.C Carlach, P.Penard and J.L. Sicre, "TCAD: A 27 MHZ 8*8 Discrete Cosine

Transform Chip", Proc.,ICASSP'89, 1989. 163

[75] M. T. Sun, T.C. Chen, A. Gottlieb, L. Wu and M.L. Liou, "A 16*16 Discrete

Cosine Transform Chip", Proc. of SPIE'87 Symp. Visual Commun. Image Proc,

Vol.845, pp. 13-18, Oct. 1987.

[76] B.G. Lee, "A New Algorithm to Compute the Discrete Cosine

Transform", IEEE Trans, on Acoust., Speech, Signal Proce­

ssing, Vol.ASSP-32,pp.1243-1245, December 1984.

[77] H.S. Hou, "A Fast Recursive Algorithm For Computing the Discrete Cosine

Transform", IEEE Trans, on Acoustics, Speech, and Signal Processing, Vol.ASSP-

35, pp.1455-1461, 1987.

[78] Wen-Hsiung Chen, C Harrison Smith, and S. C Fralick, "A Fast Computational

Algorithm for The Discrete Cosine. Transform", IEEE Transactions on

Communications, Vol. COM-25, No.9, pp. 1004-1009, September 1977.

[79] M. Vetterli, "Fast 2-d Discrete Cosine Transform", IEEE ASSP Conf, pp. 1538-

1541, 1985.

[80] H.R. Wu and F.J. Paoloni, "A Structural Approach to Two Dimensional Direct Fast

Discrete Cosine Transform Algorithms", Proceedings of International Symposium on

Computer Architecture & Digital Signal Processing, Hong Kong, October 1989.

[81] H.R. Wu and F.J. Paoloni, "Comparison Study on Software and Hardware

Implementation of Various Fast Discrete Cosine Transform Algorithms", Technical

Report-1, The University of Wollongong-Telecom Research Laboratories (Australia)

R&D Contract for the Study of Fast Implementations of Discrete Cosine Transform

Coding Systems, under No.7066, June 1989.

[82] M.J. Narasimha and A.M. Peterson, "On the Computation of the Discrete Cosine

Transform", IEEE Transactions on Communications, Vol.COM-26, pp.934-936,

1978. 164

[83] H.R. Wu and F.J. Paoloni, "The Structure of Vector Radix Multidimensional Fast

Fourier Transforms", 1SSPA 87, Signal Processing, Theories, Implementations and

Applications, pp.89-92, August 1987.

[84] M. Vetterli, "Trade-Off s in the Computation of Mono- and Multi-dimensional

DCTs", Technical Report: CUICTRITR-090-88-18, Center for Telecommunications

Research, 1988.

[85] Austek Microsystems Proprietary, Inc. and Austek Microsystems Pty. Ltd., A User

Guide for the A41102, 1988.

[86] H.R. Wu, FJ. Paoloni and W. Tan, "Implementation of 2-D DCT for Image Coding

Using FDP™ A41102", Proceedings of the Conference on Image Processing and the

Impact of New Technologies, Canberra, December 1989.

[87] Real Time Discrete Cosine Transformer—Advanced Specifications #2, SGS

Thomson Microelectronics, March 1987.

[88] IMS A121 2-D Discrete Cosine Transform Processor—Advance Information, inmos,

April 1989.

[89] TMC2311-CMOS Fast Cosine Transform Processor—Advance Information, TRW

LSI Products Inc., 1989.

[90] High Performance CMOS—Data Book, Integrated Device Technology, 1988.

[91] WE® DSP16A Digital Signal Processor—Advance Data Sheet, AT&T 1988.

[92] D.M. Blaker, "Using the DSP16/DSP16A for Image Compression", DSP Review,

AT&T, Vol.2, Issue 1, pp.4-5, 1989.

[93] TEXAS INSTRUMENTS, Third-Generation TMS320 User's Guide, 1988.

[94] A. Alan B. Pritsker, Introduction to Simulation and SLAM II, 3rd ed. Systems

Publishing Corporation, Halsted Press, 1986.

[95] Byron J.T. Morgan, Elements of Simulation, Chapman and Hall Ltd, 1984.

[96] H.R. Wu and F.J. Paoloni, "Simulation Study on the Effects of Finite-Word-Length

Calculations for Fast DCT Algorithms", Technical Report-2, The University of 165

Wollongong-Telecom Research Laboratories (Australia) R&D Contract for the Study

of Fast Implementations of Discrete Cosine Transform Coding Systems, under

No.7066, October 1989.

[97] R. Yavne, "An Economical Method for Calculating the Discrete Fourier Transform",

National Computer Conference and Exposition Proceedings, Vol.33, pp.115-125,

1968.

[98] P. Duhamel, B. Piron and J.M. Etcheto, "On Computing the Inverse DFT", IEEE

Trans, on Acoustics, Speech, and Signal Processing, Vol.ASSP-36, pp.285-286,

February 1988.

[99] M.T. Heideman and CS. Burrus, "On the Number of Multiplications Necessary to

Compute a Length-2n DFT", IEEE Trans, on Acoustics, Speech, and Signal

Processing, Vol.ASSP-34, pp.91-95, February 1986.

[100] T.S. Huang, "How the Fast Fourier Transform Got Its Name", Computer, Vol.4,

No.3, p. 15, May-June 1971.

[101] Yoiti Suzuki, Toshio Sone and Ken'iti Kido, "A New FFT Algorithm of Radix 3, 6,

and 12", IEEE Trans, on Acoustics, Speech, and Signal Processing, Vol.ASSP-34,

pp.380-383, April 1986.

[102] W.A. Perera and P.J.W. Rayner, "Optimal Design of Multiplierless DFTs and

FFTs", ICASSP'86, pp.245-248, 1986.

[103] W.M. Gentleman, "Fast Fourier Transforms—For Fun and Profit", Proceedings-

Fall Joint Computer Conference, pp.333-578, 1966.

[104] M.R. Schroeder, "The Unreasonable Effectiveness of Number Theory in Science and

Communication (1987 Rayleigh Lecture)", IEEE ASSP Magazine, pp.5-12, January

1988.

[105] K.N. Ngan, K.S. Leong and H. Singh, "Adaptive Cosine Transform Coding of

Images in Perceptual Domain", IEEE Transactions on Acoustics, Speech, and Signal

Processing, Vol.ASSP-37, pp. 1743-1750, November 1989 166

[106] H. Kitajima, "A Symmetric Cosine Transform", IEEE Trans, on Computers, Vol.C-

29, pp.317-323, 1980.

[107] Byeong Gi Lee, "FCT - A Fast Cosine Transform", IEEE ASSP Conf, 28A.3.1-

28A.3.4, 1984.

[108] H.R. Wu and F.J. Paoloni, "A 2-D Fast Cosine Transform Algorithm Based on

Hou's Approach", submitted to IEEE Transactions on Acoustics, Speech, and Signal

Processing , for publication, 1989.

[109] H.R. Wu and F.J. Paoloni, "The Impact of the VLSI Technology on the Fast

Computation of Discrete Cosine Transforms for Image Coding", to be submitted.

[110] H.J. Nussbaumer, Fast Fourier Transform and Convolution Algorithms, Springer-

Verlag, Berlin Heidelberg, 1982.

[Ill] E. Arnould, and JP. Dugre, "Real Time Discrete Cosine Transform - An Original

Architecture", IEEE ASSP Conf, 48.6.1-48.6.4,1984.

[112] Naoki Suehiro and Mitsutoshi Hatori, "Fast Algorithms for the DFT and Other

Sinusoidal Transforms", IEEE Transactions on Acoustics, Speech, and Signal

Processing, Vol.ASSP-34, No. 3, pp. 642-644, June 1986.

[113] Pierre Duhamel and Hedi H'Mida, "New 2n DCT Algorithms Suitable for VLSI

Implementation", IEEE ASSP Conf, pp. 1805-1808, 1987.

[114] Austek Microsystems Proprietary, Inc. and Austek Microsystems Pty. Ltd., FDPSIM

USERS GUIDE, 198S.

[115] WE® DSP32C Digital Signal Processor—Advance Information Data Sheet, AT&T.

[116] WE® DSP32C Digital Signal Processor—Information Manual, AT&T, December

1988.

[117] CS. Burrus, "Bit Reverse Unscrambling for A Radix-2M FFT", Proc. ICASSP,

ppl809-1810, 1987. 167

[118] H.Nawab and J.H. McClellan, "Bounds on the Minimum Number of Data Transfers

in WFTA and FFT Programs", IEEE Transactions on Acoustics, Speech, and Signal

Processing, Vol.ASSP-27, No.4, pp.394-398, August 1979.

[119] Z. Wang, "On Computing the Discrete Fourier and Cosine Transforms", IEEE

Transactions on Acoustics, Speech, and Signal Processing, Vol.ASSP-33, No.4, pp.1341-1344, October 1985.

[120] Z. Wang and B.R. Hunt, "Comparative Performance of Two Different Versions of

the Discrete Cosine Transform", IEEE Transactions on Acoustics, Speech, and Signal

Processing, Vol.ASSP-32, No.2, pp.450-453, April 1984.

[121] P. Yip and K.R. Rao, "On the Shift Property of DCTs and DSTs", IEEE

Transactions on Acoustics, Speech, and Signal Processing, Vol.ASSP-35, No.3,

pp.404-406, March 1987.

[122] K.N. Ngan, "Image Display Techniques Using the Cosine Transform", IEEE

Transactions on Acoustics, Speech, and Signal Processing, Vol.ASSP-32, No.l,

pp. 173-177, February 1984.

[123] O. Ersoy, "On Relating Discrete Fourier, Sine, and Symmetric Cosine Transforms",

IEEE Transactions on Acoustics, Speech, and Signal Processing, Vol.ASSP-33,

No.l, pp.219-222, February 1985.

[124] H.S. Malvar, "Fast Computation of the Discrete Cosine Transform and the Discrete

Hartley Transform", IEEE Transactions on Acoustics, Speech, and Signal Processing,

Vol.ASSP-35, No.10, pp.1484-1485, October 1987.

[125] V. Nagesha, "Comments on 'Fast Computation of the Discrete Cosine Transform and

the Discrete Hartley Transform'", IEEE Transactions on Acoustics, Speech, and

Signal Processing, Vol.ASSP-37, No.3, pp.439-440, March 1989.

[126] N. Nasrabadi and R. King, "Computationally Efficient Discrete Cosine Transform

Algorithm", Electronics Letters, Vol.19, January 1983. 168

[127] H.R. Wu and F.J. Paoloni, "Comparison Study on Software and Hardware

Implementation of Various Fast Discrete Cosine Transform Algorithms", Addendum

of Technical Report-1, The University of Wollongong-Telecom Research

Laboratories (Australia) R&D Contract for the Study of Fast Implementations of

Discrete Cosine Transform Coding Systems, under No.7066, November 1989.

[128] D.J. Bailey and N. Birch, "Image Compression Using a Discrete Cosine Transform

Image Processor", Electronic Engineering, July 1989.

[129] P.K. Rodman, "High Performance FFTs for a VLJW Architecture", Proceedings of

International Symposium on Computer Architecture & Digital Signal Processing,

Hong Kong, October 1989.

[130] R.K. Asbury, "2D and 3D FFTs on the Intel IPSC/2—A Distributed Memory, Multi-

Processor Supercomputer", Proceedings of International Symposium on Computer

Arclutecture & Digital Signal Processing, Hong Kong, October 1989.

[131] S.Y. Kung, "From VLSI Arrays to Neural Networks", Proceedings of International

Symposium on Computer Architecture & Digital Signal Processing, Hong Kong,

October 1989.

[132] Y. He and Z. Wang, "Fixed-Point Error Analysis for the Fast Cosine Transform",

Proceedings of International Symposium on Computer Architecture & Digital Signal

Processing, Hong Kong, October 1989.

[133] W. Ma and D. Yang, "On Computing 2-D DFT", Proceedings of International

Symposium on Computer Architecture &. Digital Signal Processing, Hong Kong,

October 1989.

[134] W. Liu, T. Hughes and W.T. Krakow, "A Rasterization of Two-Dimensional Fast

Fourier Transform", in VLSI Signal Processing, II, ed. by S.Y. Kung, R.E. Owen

and J.G. Nash, pp. 281-292, IEEE Press, 1986.

[135] S.Y. Kung, H.J. Whitehouse and T. Kailath, ed., VLSI and Modern Signal

Processing, Prentice-Hall, Inc., 1985. 169

[136] S.Y. Kung, VLSI Array Processors, Prentice-Hall, Inc., 1988.

[137] W. Liu and D.E. Atkins, "VLSI Pipelined Architectures for Two Dimensional Fast

Fourier Transform with Raster-Scan Input Device", International Conference on

Computer Design: VLSI in Computer, pp.370-375, 1984.

[138] A.D. Culhane, M.C Peckerar and C.R.K. Marrian, "A Neural Net Approach to

Discrete Hartley and Fourier Transforms", IEEE Transactions on Circuits and

Systems, Vol.CAS-36, pp.695-703, 1989.

[139] M.-T. Sun, T.-C Chen and A.M. Gottlieb, "VLSI Implementation of a 16*16

Discrete Cosine Transform", IEEE Transactions on Circuits and Systems, Vol.CAS-

36, pp.610-617, 1989.

[140] J.A. Beraldin, T. Aboulnasr and W. Steenaart, "Efficient One-Dimensional Systolic

Array Realization of the Discrete Fourier Transform", IEEE Transactions on Circuits

and Systems, Vol.CAS-36, pp.95-100, 1989.

[141] T. Willey, R. Chapman, H. Yoho, TS. Durrani and D. Preis, "Systolic

Implementations for Deconvolution, DFT and FFT", IEE Proceedings, Vol.132,

Pt.F, 1985.

[142] E.E. Swartzlander, Jr. and G. Hallnor, "Fast Transform Processor Implementation",

Proceedings of 1CASSP 84, pp.25A.5.1-25A.5.4, 1984.

[143] M.T. Sun, L. Wu and M.L. Liou, "A Concurrent Architecture for VLSI

Implementation of Discrete Cosine Transform", IEEE Transactions on Circuits and

Systems, Vol.CAS-34, pp.992-994, 1987.

[144] CD. Thompson, "Fourier Transforms in VLSI", IEEE Transactions on Computers,

Vol.C-32, pp.1047-1057, 1983.

[145] H. Mori, H. Ouchi and S. Mori, "A WSI Oriented Two Dimensional Systolic Array

for FFT", Proceedings of ICASSP 86, pp.2155-2158, 1986. 170

[146] A. Iwata, I. Horiba, N. Suzumura and N. Takagi, "3-Dimensional Reconstructing

Algorithm for Digital Tomo-Synthesis", Proceedings of ICASSP 86, pp. 1741-1744,

1986.

[147] K.J. Jones, "2D Systolic Solution to Discrete Fourier Transform", IEE Proceedings,

Vol.136, Pt.F, pp.211-216, 1989.

[148] H. Schid, Decimal Computation, John Wiley & Sons, Inc. 1974.

[149] IEEE Micro, (Sepcial Issue on Digital Signal Processors,) Vol.6, No.6, December

1986.

[150] IEEE Micro, (Sepcial Issue on Digital Signal Processors,) Vol.8, No.6, December

1988.

[151] K. Hwang, Computer Arithmetic, John Wiley & Sons, Inc. 1979.

[152] E.E. Swartzlander, Jr., VLSI Signal Processing Systems, Kluwer Academic

Publishers, 1986. [153] H.R. Wu, and F.J. Paoloni, "The Structure of Vector Radix Multidimensional Fast

Fourier Transforms—Part II", Technical Report No. 2, Department of Electrical and

Computer Engineering, The University of Wollongong, 1986.

[ 154] M. Vulis, "The Weighted Redundancy Transform", IEEE Transactions on Acoustics,

Speech, and Signal Processing, Vol.ASSP-37, pp.1687-1692, November 1989.

[155] J.H. McClellan and CM. Rader, Number Theory in Digital Signal Processing,

Prentice-Hall Inc., Englewood Cliffs, N.J., 1979.

[156] P. Duhamel and C Guillemot, "Polynomial Transform Computation of the 2-D

DCT", to be presented at ICASSP-90, April 1990.

[157] J. Suzuki, M. Nomura and S. Ono, "Comparative Study of Transform Coding for

Super High Definition Images", to be presented at ICASSP-90, April 1990.

[158] U. Totzek, F. Matthiesen, S. Wohlleben and T.G. Noll, "CMOS VLSI

Implementation of the 2D-DCT with Linear Processor Arrays", to be presented at

ICASSP-90, April 1990. 171

[159] M. Yan, J.V. McCanny and Y. Hu, "VLSI Architectures for Digital Image Coding". to be presented at ICASSP-90, April 1990. 172

APPENDIX A: PRELIMINARY BACKGROUND ON THE TENSOR (KRONECKER) PRODUCT AND THE LOGIC DIAGRAM

In this section, a brief introduction to the two basic tools, which are frequently used throughout the thesis, will be presented, namely, the tensor product with its properties and the logic diagram. The definition and the properties of the tensor product are included to be self contained. The purpose of introducing the logic diagram instead of the conventional signal flowgraph (the Mason flowgraph) will soon become clear.

Definition—the Tensor (or Kronecker) Product: [31]

Let A = [amk] be an M by K matrix, and B = [bm/] be an N by L matrix. The tensor product of A and B, denoted by A®B, is a matrix

with MN rows and KL columns whose entry in row (m-l)N + n and

a column (k-l)L + 1 is given by cmn>k/ = mkt>m/-

The tensor product, A®B, is an M by K array of N by L blocks, with the (m,k)th

such block being amkB. It is apparent from the definition that the tensor product is not

commutative but is associative, i.e., A®B*B®A (A"1) (A®B)®C = A®(B®C) (A-2)

The following equalities can be also proven to be true [31][29].

(A®B)(C®D) = (AC)®(BD) (A-3)

r P^(Ir®A)M N = (A®Ir) (A-4)

r where A, B, C, and D are N*N matrices; Ir is of r*r dimension; P N is a permutation

matrix which is defined by, when r = 2:

X x x x x x A P^ 0' l> 2,--->XN-l) = ( 0, N/2, l> XN/2.],..-,XN-l)> < -5)

and M^j is the inverse matrix of P^ which is denoted by MN=[PNJ" •

If we define a time delay operator s which gives: 173

x(n+l) = s*x(n) or x(n-l) = s-^xfa), (A-6) we shall have the following properties of the tensor product: xi(n) "i '1 0" 'xi(n) _0si_ (A-7) _x2(n+l)j _X2(n) - xi(m,n) - -xi(m,n)~

X2(m,n+1) "1 0" -1 0" x2(m,n) = ® (A-8) X3(m+l,n) .0S2. .0 si_ X3(m,n)

_X4(m+l ,n+1) J _x4(m,n)_

The logic diagram which we shall introduce consists of basic elements such as line, heavy line (or vector line), addition operation block, vector addition operation block, scalar product block, and vector scalar product. The definitions are shown in Figure-A-1.

The logic diagram of one dimensional algorithms is equivalent to the Mason signal flowgraph and the logic diagram of multidimensional algorithms is a direct extension of that in the one dimensional, which introduces the vector operation concept into the graphical form. The logic diagram has its own rules which can be readily used to derive equivalent logic diagrams, which makes the modification of algorithms and the derivation of multidimensional fast algorithms a comparatively simple procedure. The inverse of an orthorgonal transform is equal to its transpose. The transpose operation on a logic diagram is to change the addition block to a branch node, and a branch node to a addition block as well as to change the direction of the input and output data flow.

In the thesis, all the multidimensional fast algorithms can be derived by either using the properties of the tensor product—a mathematical approach, or using the logic diagrams—an engineering approach. 174

U ll ^ .— u s—" C ,—. c« + ^5 •—- "—' ^ cz r: * II w '~~ TT '— cv O CJ ^ ***• ,_J X •• 4. - o CJ ~ *^ r- - II + 5 C5 CJ CJ * ^" O CJ c o&J > II ^. 'dj CJ n > & X T3 CJ O »^ >_: £3 _«J CJ * o E ? .r- CJ a > X > > o o CJ o II _^ *-> ou 1 E C CJ C c CJ < X c o _CJ X > 1 o _CJ r- u to r* c CJ a .—i I. t— CJ Cv O c r- r- u o C I— 5 r- u to t C o C c3 '.J i— CJ 00 c COJ CJ CJ u~ u i— 5 Z C- y. z O 1— C CJ C CJ u u CJ o O CJ c CJ CJ CJ o < a c/: > "5 > > <

CJ

+

rt Xi a — 175

APPENDIX B: PROOF OF STRUCTURE THEOREMS

In order to prove Structure Theorem 1 of Chapter Three, we assume that IT = l^ ®

2 l l IN . To prove that l = IT, it is only needed to show that I (&/ ,£] ;mo ,no) = IT(*; ,#1

;mo ,no) as the size of both matrices is rir2*rir2- According to definitions of 1-D and 2-

D butterfly matrices, the following equations are obvious:

Since 2 IT(*/ ^1 Wo ,no ) = IJ5 (kj , mo) * IN 2 (£] , no ) = w*im° * w£jno n r2.

V(k] ,£] \mo ,no ) = IT(^7 ,#1 \rno ,no )• that is:

F = IT- The second part of the theorem can be proved using the same approach as well as other structure theorems. 176

APPENDIX C: THE COMBINED FACTOR METHOD

To obtain the combined factor vector radix-8*8 DIF FFT using Equation (3-6-8), it

2 simply means to combine a with matrix IN and using the fact that a = -j.

Diagrammatically, the method is attempting to reduce the number of multiplications that the vector radix-8*8 butterfly structure contains by combining the row twiddles with the

column twiddles. For some algorithms, there may be more than one way of conducting

this task, but minimum multiplications and combined factors (or twiddles) at regular

places are often preferred. The example in [45] is but one. The combination of the row

twiddles with the column's needs more explanation. If the row twiddle is W. T and the

N column twiddle is W^ , combining W^ *W^ results in W^+P i . when N] = N2 =

N. w^^1 = Wa+P = W* where y = a+B. If the Look Up Table (LUT) is used in the N1N2 N N

program as is the case in this study, the combined factors are pre-calculated and stored in

the LUT which are called when needed. This practice increases the DFT processing speed considerably. Figure-C-1 shows a 2-D 64*64-point DFT calculated using the CF

VR-8*8 FFT [45] and a 2-D input data as shown in Figure-C-2. Figure-C-3 is another 2-

D DFT generated by the same program using Figure-C-4 as the input. The vector radix-16* 16 FFT algorithm can be constructed in the same manner. The

vector radix-16* 16 butterfly computational structure can be calculated according to

Figure-9 and the vector radix-16* 16 twiddle factors can be generated using the structure

theorem from the corresponding twiddles of the radix-16 FFT algorithm. 177

Figure-C-1: A 2-D 64 x 64-point DFT calculated using the CF VR-8 x 8 FFT algorithm. 178

Figure-C-2: The 2-D input data used for Figure-C-1. 179

Figure-C-3: A 2-D 64 x 64-point DFT calculated using the CF VR 8 x 8 FFT algorithm. 180

Figure-C-4: The 2-D input data used for Figure-C-3. 181

APPENDIX D: DERIVATION OF VECTOR RADIX 2-D FAST DCT BASED ON LEE'S ALGORITHM

In Equation (5-2-4a), set k = 2*k' + k" and 1 = 2*1 + /"; k',/' = 0,1,...,N/2-1 and k",/" = 0,1. Then the following four equations will be obtained:

X(2k',2T) = j£ j^n.m ) Cgjg^cg^1^ ' (D-l)

x(n m} c c +1)(2/ +1) (D 2 x(2k«,2/•+!) = £o mX0 ' wm' ™ ' " )

X(2k'+1,2/') = I1 ^ x(n,m ) C^1^1^*1*7' (D-3) n=0 m=0 ^ AIN/Z;

1 X(2k'+1,2/ '+1) = S I x(n,m ) rf^X^cgj" + D<2/ '+i) (r>4) n=0 m =0

From Equation (D-l), N/2-1 N/2-1 X(2k',2/')= I I [x(n,m) + x(N-l-n,m) n=0 m =0 1 + x(n,N-l-m) + x(N-l-n,N-l-m)] C^f'c^t2(N/2) ^2(N/2) ^ '

(D-5) 2 1 11 )k Note that cl ^" ^ *' = C^} '. Using the same method:

N/2-1 N/2-1 X(2k',2/'+1) = I I [x(n,m) + x(N-l-n,m) n=0 m =0 , , / XT 1 ^ /XT i XT i s-ir.(2n+l)k p(2m+l)(2/ + l) :(n,N-l-m) - x(N-l-n,N-l-m)] C2(N/22rN/2) ) C2N (D-6) N/2-1 N/2-1 X(2k'+l,2/') = £ X [x(n,m) - x(N-l-n,m) n=0 m =0 , , XT , x /XT , XT 1 M ^(2n+l)(2k +l)r(2m+l)/ C + x(n,N-l-m) -x(N-l-n,N-l-m)]C2N 2(N/2) (D-7) and, N/2-1 N/2-1 X(2k'+l,2/,+l)= £ I [x(n,m) - x(N-l-n,m) n=0 m =0 ~T , XT , M ^(2n+l)(2k'+l)r(2m+l)(2/ +1) V - x(n,N-1 -m) + x(N-1 -n,N-1 -m)] C 2N C2N (D-8) 182

m)+1V + 2k +1 Note that (f^ ' = cgjjW', (W-i-nHU^D = .Cgj ^ ' )

2 1 1 +1 2/ +1 and d ^' "" > ^ ' ) = A2m+\)(2l •+!)

Define: gl(n,m) = [x(n,m) + x(N-l-n,m) + x(n,N-l-m) + x(N-l-n,N-l-m)] 1 g2(n,m) = — m +1)[x(n,m) + x(N-l-n.m) - x(n,N-l-m) - x(N-l-n.N-l-m)] 2C2N g3(n,m)=—^^{x(n,m) - x(N-l-n,m) + x(n,N-l-m) - x(N-l-n,N-l-m)] 2(~2N g4(n,m)=—^13—(2^TT^x^n'm^ - x(N-l-n,m) - x(n,N-l-m) + x(N-l-n,N-l-m)] 2C2N 2C2N

Then:

^ ^e^"'"1' ^^N ^2(N/2) ^2N n=0 rn=0 ^ ' 1 WMN^(2n+l)k' r(2m+l, . r,(2n+l)k>(2m+l)/' e = X X g2(n,m) CV2(N/2^ ) 2(N/2C2rN/2) / n=0 m=0

N/2-(2n+l)k'1 W2-1 r(2m+l' (2n+ 1)k. (2m+ l)(/'+l) v ; J + Xn X g2(n,m) C2(N/29/Nm) W(N/2C)mm) n=0 m=0

= G2(k',/') + G2(k',/*+l) (D-9)

N/2-1 N/2-1 On+llk' C2m + n/' . • where G2(k',/') = X I g2(n,m) C^^C^^' , noticing that n=0 m=0 0~<2m+lW2n+l)k'~(2m+l)(2f '+1) _ r(2n+l)k'r(2m+l)'' r(2n+1 )k'c(2m+1)(/'+1) 2C2N L2(N/2) C2N ~ C2(N/2) U2(N/2) + ^2(N/2) ^2(N/2)

1 1 , «^,, , ^ W*- ^ , , -P(2n+lW2n+l)(2k +l)r(2m+l)r X(2k'+l,2/')= X X g3(n.ni) 2C^ C^ C2(N/2) n=0 m=0 1 1 Ng" ^" , , r(2n+l)k'r(2m+l)/' = X X g3(n,m) C2(N/2) C2(N/2) n=0 m=0 N 1N ] g- g- , , r(2n+l)(k'+l)r(2m+l)r + X 1 g3(n,m) C2(N/2) C2(N/2) n=0 m=0 D 10) = G3(k',/') + G3(k'+U') ( - where O3CW') = T TOOM*) C^cg-1*', notog to 11=0 m=0

2r(2n+l)r(2n+l)(2k'+l)c(2m+l)/' _ C^^cS^'' + C^f '^cE^' '• 2C2N L2N C2(N/2) ~ ^2(N/2) ^2(N/2) 2(N/2) 4W4

Accordingly, 183

N/2-1 N/2-1 , +1) m+1) 2 n+1)(2k +1 (2m+1)(2/ +1) X(2k + l,2/'+l)= X X g4(n,m) 2C2^ 2^ C2 N ' )c '

= G4(k',/') + G4(k',/ '+1) + G4(k'+1,/') + G4(k'+1,/ '+1) (D-ll)

(2n+1)k (2m+1)r where G4(k',/') = X I^nm) c 'r n=0 m=0 2(N^) ^W) '

After defining that Gi(k',/') = T Tgl^m) C^fcg^1*'', the matrix form of the forward algorithm can be obtained.

•g'jCn.m)- x(n,m) g'2(n,m) x(n,N-l-m) = (B®B) (D-l2a) g'3(n,m) x(N-l-n,m) Lx(N-l-n,N-l-m). _g'4(n,m)_ "gjCn.m) g'jCn.m)'

g2(n,m) g'2(n,m) = (M®M') (D-l 2b) g3(n,m) g'3(n,m)

.g4(n,m) _g'4(n,m)_

•gjCn.m)- •Gi(k,/)n G (k,/) N/2-1 N/2-1 g2(n,m) 2 ,(2n+l)k (2m+l)/ (D-l 2c) X X C ^n^n\ ^2(N/2) g (n,m) G3(k,/) m=0 2(N/2) 3 n=0 .G4(k,/ )J _g4(n,m)_

r Gi(k,/) •

G2(k,/)

X(2k,2/ ) G2(k,/+1) G3(k,/) X(2k,2/+1) = (P®P) G4(k,/) (D-12d)

X(2k+1,2/) G4(k,/+1) •X(2k+1,2/+1)-I G3(k+1,/) G4(k+1,/)

l"G3(k+l,/+l)- 184

where k,/ ,n,m = 0,1,...N/2-1, and Gi(k+1,*) \k=N/2-i = Gi(*,/ +1) 1/ =m-i = 0, for i =

2,3,4.

For the inverse 2-D DCT defined by Equation (5-2-4b), the same decimation can be applied to achieve the following equations.

N/2-1 N/2-1 A /0 ... , „ ,.,, x(n,m)= ^ ^^k'ai^C^C^1

, + X(2V 01 '+n r(2n+l)k'r(2m+l)(2/ + l) tAUK^ +i> L2(N/2) ^2N

( +1 2k +1 1 / + X(2k'+l,2/0C ^ >( ' )c^) ) '

+ X(2k'+1,2/ "+1) C^+1)(*'+1)cg^+1)C" ,+1)] (D-13)

N/2-1 N/2-1 A «n,n,, /omJ.n/' x(n,N-l-m) = ^ ^IX^^C^C^'

( -X(2k\2/Vl)C 2^

( +1)(2k +1) 2 1)/ + X(2k'+l,2/')C ^ ' C2 ^) '

2 +1 2k +1 2 m+1)(2/ +1) - X(2k'+1,2/ '+!) C( " X ' H N ' J (D-14)2N ^2N

N/2-1 N/2-1 /\ n nvi /•Omj.n/' 3n+l)k'p(2n ( )k 1)/ x(N-l-n,m)= I Z IX(2k\2/0C2(N/2^) cgJ^^2(N/2)

, ,- V70W 9/ '+n r(2n+l)k'r(2m+l)(2/ + l) + X(2k,2/ +l)C2(N/2) C2N , YOVu.1 O/'-t r(2n+l)(2k + l)r(2m+l)/' •A(2k+1,2/ j U2N <~2(N/2)

- 4(2k'fl,2/ VI) c^2k'+])C^+mi'+l)) (D-15)

N 2 N 1 ^T , M i A v v Anv 0/ -^ r(2n+l)k'r(2m+l)/' x(N-l-n,N-l-m)= X X [X(2k ,2/ ) C2(N/2) <~2(N/2) k'=0 / '=0 w ; v , \T?f 9/ '+n <-(2n+l)k c.(2m+l)(2/ '+1)

2 1)<2k +,, +,)(2r+,) + &(2k'+],2/'+l)C<2 - ' cS" ] (D-16) where n,m = 0,1,...,N/2-1. Define X(*,2/ -1)1/'=o = 0, X(2k'-l,*)l|c'=0 = 0 and. 185

N/2-1 N/2-] A hi(n,m)= X X X(2k',2/') C(2"^)k'c(2m+1)r k'=0 /'=0 ^2(N/2) W(N/2) +1) )k (2m+1 2/ +1 h2(n,m) = 2Cgf f"* T * X(2k',2l'+!) C?"?l 'c X ' ) k=0 / '=0 2(N/2) ^2N

N/2-1 N/2-1 A = X X X(2k'2/'+n r(2n+1)k'r(2m+1)/' kio r=o K ' +1)C2(N/2) C2(N/2)

N/2-1 N/2-1 A S?o,i„X(2k'-2r+I)«k'C^),)(''+,) N/2-1 N/2-1 = I IH?(k'/')C(2n+1)k'r(2m+1)r kio /4o 2 ;^2(N/2) C2(N/2) n+1) +1 ,+1 h3(n,m) = 2c£ T "l X(2k'+1,2/') C^} ^ )cgjJ^' k*=0 /~0 ' 2N ^2(N/2) N/2-1 N/2-1 1 Z I H3(k',/') c^f'd*?! k'=0 / '=0 AN/2) 2(N/2) n +1 2 m+1) 1 2 +1 2k +1 m+1 2/ +1 h4(n,m) = 2C2C< N T *£ X(2k'+lf2/'+!) c( " K ' )c?M X ' ) k'=0 / =0 2N 2N N/2-1 N/2-1

where n,m = 0,1,...,N/2-1;

H2(k',/ ') = X(2k',2/ '+1) + X(2k',2/ '-1);

H3(k',/ ') = X(2k'+1,2/') + X(2k'-1,2/ ');

H4(k',/ ') = X(2k'+1,2/ *+l) + X(2k'+1,2/ '-1) + X(2k'-1,2/ '+1) + X(2k'-1,2/ '-!). Therefore, x(n,m) = hi(n,m) + —r—rh2(n,m) + —\—rh3(n,m) + —-—]—=—Hi4(n,m) Z~2N 2U2N zC2N 2U2N

(D-17) x(n,N-l-m) = h,(n,m) - -JL^2(„,m)+ -lfc3(n,m) - j h^m) ZL-2N 2U2N 2L2N 2L1N

(D-l 8) x(N-l-n,m) = hi(n,m) + -JLjhtfn.m) - -J-^fo.m) - 2 * 2m+1h4(n,m) ZU2N 2(~2N /U2N ZU2N

(D-l 9) x(N-l-n,N-l-m)= hi(n,m) + —J-^2(n,m) - —5^3(n,m) - +\ 2m+1h4(n,m) ZC^ 2C2N ^^ Zl^N

(D-20) For k,/,n,m = 0,1,...N/2-1, and X(2k-1,*) lk=0 = X(*,2/-1) l/=0 = 0, the matrix form for 2-D IDCT algorithm is presented as follows:

X(2k,2/) X(2k,2/+1) X(2k,2/ -1) •Hi(k,/)-| X(2k+1,2/) H2(k,/) = (P®P) X(2k+1,2/+1) (D-21a) H3(k,/ )

-H4(k,/ )J X(2k+1,2/-1) X(2k-1,2/) X(2k-1,2/+1) L-X(2k-1,2/-1)-J

•hi(n,m)-] Hi(k,/)n h2(n,m) J H (k,/) ^ ^;' (2n+l)k r(2m+(2m+l)l / 2 (D-21b) h3(n,m) H3(k,/) .h4(n,m)_ LH4(k,/ )J

-h'^n.m)- rh](n,m) h'2(n,m) h2(n,m) = (MOM1) (D-21c) h'3(n,m) h3(n,m) _h4(n,m)_ _h'4(n,m) r-h'jtn.m)- x(n,m) h' (n,m) x(n,N-l-m) 2 = (B®B) (D-21d) x(N-l-n,m) h'3(n,m)

.x(N-l-n,N-l-m)J _h'4(n,m)_ 187

APPENDIX E: ARITHMETIC COMPLEXITY OF THE VECTOR SPLIT-RADIX DIF FFT ALGORITHM

According to Equation (3-8-3), the multiplications in the 2-D vector split-radix f DIF FFT are all listed in the twiddle factor matrix F . The N*N DFT can be calculated —m using one (N/2)*(N/2) DFT and twelve (N/4)*(N/4) DFTs. The number of extra f multiplications required at each stage using this approach is caused by Em. The total n number of complex multiplications Mn needed for a 2-D N*N DFT, where N = 2 , is given by

1 Mn = Mn_i + 12*Mn.2 + Mextra (E- )

2 with M2=M]=0, and it can be shown that Mextra = 12*((N/4) - N/4) for n>3.

The total number of complex additions An is

An = An-i + 12*An-2 + Aextra (E-2)

2 2 where Aextra = 3*(N/2) + 48*(N/4) , A0 = 0, and Ai = 8.

To prove that Mextra = 12*((N/4)2 . N/4) for n>3, it is observed that amongst F^, there are three group of factors:

(1) WNn, WN3n, Wjf, WN3m;

m+n 3m+3n (2) WN , WN ; and

2m+n +2n m+ n m+2n m+n (3) WN , WN2m+3n, WNm , WN 3 , WN3 , Wrf ; within which it is needed to determine the number of trivial multiplications. The term

"trivial multiplication" means the value of the twiddle factor being ±1 and ±j. There are no multiplications needed for 2*2 and 4*4 point DFTs, and thus, MQ = M: =0.

In the first group, when n (m resp.) = 0 there are N/4 trivial multiplications as m (n resp.) varies from 0 to N/4-1.

In the second group, WN™ is considered first. According to the properties of the transformation b which finds the residue of a modulo b [155], for each m=l,2,...,N/4-l, there exists n such that m+n=N/4 or N/4 = 0, in addition to m=0 and n=0. There are N/4 trivial multiplications for the factor WN™+". Since

<3(m+n)>N/4=«3>N/4N/4>N/4, <3(m+n)>N/4=0 as long as N/4 = 0 then it 188

is true that amongst (N/4)2 multiplications, N/4 are trivial for each of twiddle factors in this group.

2 3m+ 2 In the last group, Wxr ™, WN n md wN n+3m need on]y be considered as the rest can be proved accordingly.

2m+n Considering the factor WN , we have

<2m+n>N/4 = «2m>N/4+ N/4>N/4

= «2m>N/4+ n>N/4

= N/4 (E-3) where m'= <2m>N/4. For m=0,l,..,N/4-l, therefore for every m'e A, there exists an n such that ^m+n^^ =0. Therefore, at those points multiplications become trivial. It can be shown that the same is true for W^>m+n as well.

Wxr2""1"3111 also contains N/4 trivial multiplications. However, this is not so simple to prove. To start, it is required to prove that <3m>Ny4 is a function or one-to-one and onto mapping domain of which is A={0,l,...,N/4-l ], i.e., for every m e A, <3m>Ny4 e

A, if m^m^ both mj and m2 e A, then <3mj>N/4 ^ <3m2>N/4. This can be achieved by invoking the theorem[8] which states that: for n=0,l,2,...,M-l, N takes on all the

N possible residues if (a,M)=l.

In the problem considered here, a=3, M=N/4=2n/4, n>3, and N/4 is power of 2 as well, a and N/4 are mutually prime. According to the above theorem, <3m>N/4 is a function. For every m e A, there exists an m'=<3m>N/4 e A, while m^rn^ rn'i^m^.

Since <2n+3m>N/4= «2n >N/4 + <3m>N/4 >N/4

= «2n >N/4 +m' >N/4 (E-4)

Then in Equation (E-4), m' e A, and can be any value of the element in A in accordance with m. So, for each n e A, there exists an m, thereafter an m', such that Equation (E-4) will be zero.

2n+3m Therefore WN contains N/4 trivial multiplications, which completes the derivation.

From the above discussion, it is concluded that: 189

Mextra=12*((N/4)2-N/4) (E-5) for n>3 .

The number of complex multiplications needed for the 2-D vector split-radix FFT to perform N*N complex DFTs is listed in Table-E-1. 190

Table-E-l: The number of Complex multiplications required for the 2-D vector split-radix FFT to perform N x N- point complex DFTs.

N Mn.2 M„.i Mextra Mn 8 0 0 24 24 16 0 24 144 168 32 24 168 672 1128 64 168 1128 2880 6024 128 1128 6024 11904 31464 256 6024 31464 48384 152136 512 31464 152136 195072 724776 1024 152136 724776 783360 3333768 2048 724776 3333768 3139584 15170664 4096 3333768 15170664 12570624 67746504