Low Power and Area Efficient DCT Architecture for Low Bit Rate Communication

Muhammad FAISAL SIDDIQUI1 , Raja ALI RIAZ1, 2, Syed SAUD NAQVI1 Department of Electrical Engineering, COMSATS Institute of Information Technology, Islamabad (1), School of ECS, University of Southampton (2) Low Power and Area Efficient DCT Architecture for Low Bit Rate Communication Abstract. In this paper a low power and area efficient DCT (Discrete Cosine Transform) pipelined architecture using multiplier-less method is presented for low bit rate communications such as videoconferencing in mobile devices. The multiplier-less multiplication is implemented by minimum number of additions, subtractions and shifts using CSD (Canonic Signed Digit) representation for fixed point DCT coefficient. Power reduction is achieved by minimizing both the number of arithmetic operations and data-path width. The proposed DCT architecture was implemented on a XILINX FPGA (Field-Programmable Gate Array). The results from power estimation show that our design is capable of reducing the power dissipation 5.5 times compared to the other DCT architectures for video streaming/video conferencing in portable devices. Streszczenie. W artykule zaprezentowano system komunikacji DCT )(Discrete Cosine Transform) o małej przepływności bitów charakteryzujący się małą mocą, dobrym pokryciem obszaru. System ma strukturę potokową. Ograniczenie mocy osiągnięto przez zmniejszenie operacji matematycznych oraz szerokości ścieżki danych. System zaimplementowano z wykorzystaniem elementów FPGA. (Architektura komunikacji DCP o małej mocy do małej przepływności danych) Keywords: Discrete Cosine Transform (DCT), Canonical Signed Digit (CSD), Multiplier-less Słowa kluczowe: komunikacja DCT, przepływność danych Introduction communications such as H.261, H.263, H.263+, H.264 and Many of today multimedia applications such as video- many more [1, 2]. Normally for internet video streaming (@ conferencing, internet video streaming and video-over 20-200 kbps) uses H.263, for video conferencing (@ 20-320 wireless are most bandwidth consuming modes of kbps) uses H.261 or H.263, and for video over 3G wireless communication. Efficient hardware is needed for these (@ 20-100kbps) uses H.263 video standards. types of applications. The key features of hardware design Video is composed of frames of pictures, so the still are to consume very low power and low area to implement image coding has to process one more dimension for video in portable small devices such as mobiles phones or signals compression. To de-correlate the blocks of original cameras. pixels, DCT has been widely used in many multimedia Several international standards of image and video compression international standards. Some low bit rate coders use compression techniques based on DCT which international video coding standards are given in Table 1. transform it in frequency domain. DCT algorithm has excessive numbers of multiplication and addition Table 1. Comparison of Resources Usage operations. In hardware implementation multiplication is Coding Schemes Bit Rate Applications costly and consumes large amount of power. To reduce the (kbps) multiplication operation some mathematical manipulation H.261 / H.263 20-320 Video Telephony/Video Conferencing techniques are used such as row-column efficient matrix H.263 / H.264 20-100 Video over 3G Wireless form of DCT coefficient which reduces the complexity of the H.264 / H.264 20-200 Internet Video Streaming algorithm. In hardware fixed point constant multiplication is efficiently implemented by CSD. Furthermore, to decrease Discrete Cosine Transform (DCT) the arithmetic operations common sub-expression The two dimensional DCT is broadly used in still image elimination technique is also used. compression techniques. It is widely recognized that The proposed architecture is designed in such a way performing DCT over 8 x 8 blocks represents a good trade- that the architecture consumes minimum power to compute off between transform complexity and spatial correlation. the desired results with efficiently using the hardware The 8 x 8 2-D DCT transforms an 8 x 8 block sample from resources. Pipelining is used to increase the frequency and spatial domain f(x,y) into frequency domain F(k,l). DCT is throughput of the design. Hardware optimization is used to defined mathematically as in Eq (1): make resources more suitable and consumes less power with less complexity. The proposed design is implemented (1) on XILINX FPGA. The comparison results show that 77 proposed architecture proves to be efficient compared to 1 kx ly other published results. Fkl(,) CkCl () () f (, xy )cos cos This paper is organized as follows. First section gives an 41616xy00 overview of the video coding standards. Second section kxk(2 1) introduces the background of the DCT algorithm with x efficient matrix approach. Third and Fourth sections briefly lyl(2 1) describe the CSD multiplication and common sub- y expression elimination respectively. Fifth section presents 1 the design of the proposed architecture and comparison if k,0 l results. Finally, conclusions are presented in last section. Ck() Cl () 2 Overview on Video Coding Standards 1 otherwise In emerging multimedia applications such as image and and kl, varies from 0 to 7 video compression, require the use of coders. There are many different video coding standards for low bit rate 216 PRZEGLĄD ELEKTROTECHNICZNY (Electrical Review), ISSN 0033-2097, R. 88 NR 8/2012 For direct computation of 2-D DCT according to the [7]. CSD form notation is: definition, it needs 4096 multiplications and 4032 additions. N 1 i This computational complexity can be reduced by using the (4) Sa i 2 row-column approach [3]. In this row-column technique, 2-D i0 DCT can be implemented by two successive 1-D DCT. First 1-D DCT is performed in row-wise and the second 1-D DCT Here ai is in the set {-1, 0, 1} for each i . is performed in column wise on the output of the first 1-D The DCT coefficients are implemented in fixed point DCT. Eq (2) yields the mathematical expression of the 1-D CSD representation. According to IEEE 1180-1990, 12-bit DCT. precision is used in order to confirm the accuracy specifications [6]. CSD expression of the cosine coefficients 7 are shown in Table 2. 1 kx Fk() Ck () f ()cos x 416x0 Table 2. 12-bit Precision DCT Coefficient in CSD Form Coefficient Decimal Value CSD Representation (2) kxkx (2 1) c1 0.4904 0.ı00000ī0ī000 -c1 -0.4904 0.ī00000ı0ı000 1 c2 0.4619 0.ı000ī0ī00ı00 if k 0 -c2 -0.4619 0.ī000ı0ı00ī00 Ck() 2 c3 0.4157 0.ı0ī0ı0ı0ı0ī0 -c3 -0.4157 0.ī0ı0ī0ī0ī0ı0 1 otherwise c4 0.3536 0.ı0ī0ī0ı0ı000 -c4 -0.3536 0.ī0ı0ı0ī0ī000 and k varies from 0 to 7 c5 0.2778 0.0ı00ı00ī000ı -c5 -0.2778 0.0ī00ī00ı000ī In this paper, the efficient DCT coefficient matrix form is c6 0.1913 0.0ı0ī000ı000ī used for computational and architectural optimizations [4]. -c6 -0.1913 0.0ī0ı000ī000ı The 8 x 8 DCT coefficient matrix form can be represented c7 0.0975 0.00ı0ī00ı000ī as: -c7 -0.0975 0.00ī0ı00ī000ı cccccccc In this manner maximum addition latency is three for 44 4 4 4 4 4 4 computing any multiple of fixed point DCT coefficient. cc13 c 5 c 7 c 7 c 5 c 3 c 1 Common Sub-Expression Elimination cc26 c 6 c 2 c 2 cc 66 c 2 The fixed coefficients of a DCT in CSD format have some common sub-expressions. Common sub-expression ccccccc c 3715517 3 means that some of the bit patterns occur more than once cccccccc in any expression. Close observation on the fixed 44444 444 coefficients extracts some common sub-expressions which can be easily eliminated. For example, a bit pattern of cccc5173 cccc 3715 (ı0ī )CSD in coefficient c3 comes twice so it is not needed to cccccccc 6 22 6 62 26 compute this two times. It is achieved just by computing it ccccccccone time and using its shifted result. This characteristic 75311357 makes it possible to reduce the number of arithmetic where operations and the hardware complexity. c3 is computed as: 1 i (3) ci() cos, i 1,2, ,6,7 tx 13 x 216 1 tx2 57 x Since the coefficient matrix having constant values so minimum number of arithmetic operations will be possible cttt3121 8 on using CSD representation. The given example shows that the common sub- Multiplier-Less Implementation of Multiplication using expression technique eliminates one arithmetic operation. Canonic Signed Digit The first operation result is re-used with some shift in Multiplier-less method is widely used for VLSI (Very computing the final result. It provides the maximum addition Large scale Integration) realization because of improvement latency of 3. in speed, area overhead and power consumption [5-6]. In normal multipliers-less multiplication of a constant number Proposed DCT Architecture is implemented just by shift and add operations. In this The proposed DCT architecture is inherently low power method the number of adders are directly proportional to and low area because of minimum required arithmetic the number of ones ‘1’ in a constant. The proposed scheme operations, therefore making it highly suitable to compact especially incorporates CSD method for more efficient mobile applications e.g. video streaming, video hardware usage and reduces the hardware complexity conferencing and online video-gaming. The 5-stage significantly in multiplier-less implementation of DCT. In pipelined structure is pretty simple, composed of 4 barrel CSD form the numbers of ones are lesser than and equal to shifters, 5 adders/subtractors and 1 multiplexer, shows in the normal binary representation of a number. Therefore Fig. 1. CSD method decreases the number of add/sub operations PRZEGLĄD ELEKTROTECHNICZNY (Electrical Review), ISSN 0033-2097, R. 88 NR 8/2012 217 Fig. 1. Proposed DCT Architecture First three stages are performing the multiplication enables an encoding rate of up to 68 PAL (Phase operation. The 4th stage adder is adding the eight Alternating line) frames (720 x 576 pixels) per second and multiplication values which make one resultant value of the 280 CIF (Common Intermediate Format) frames (352 x 288 1D-DCT.

Load more