A Low-Power and High-Quality Implementation of the Discrete Cosine Transformation

Adv. Radio Sci., 5, 305–311, 2007 Advances in www.adv-radio-sci.net/5/305/2007/ © Author(s) 2007. This work is licensed Radio Science under a Creative Commons License.

A low-power and high-quality implementation of the discrete cosine transformation

B. Heyne and J. Gotze¨ University of Dortmund, Information Processing Lab, Otto-Hahn-Str. 4, 44227 Dortmund, Germany

Abstract. In this paper a computationally efficient and high- most 45% of the total area (Sung et al., 2005). In this regard, quality preserving DCT architecture is presented. It is ob- Tran (2000) proposed the binDCT which approximates mul- tained by optimizing the Loeffler DCT based on the Cordic tiplications with add and shift operations. It only consumes algorithm. The computational complexity is reduced from 11 about 38% of the power of the Loeffler DCT. However, it also multiply and 29 add operations (Loeffler DCT) to 38 add and looses about 3 dB in PSNR compared to the Loeffler DCT 16 shift operations (which is similar to the complexity of the (Sung et al., 2005). binDCT). The experimental results show that the proposed Jeong et al. (2004) has proposed a Cordic based imple- DCT algorithm not only reduces the computational com- mentation of the DCT. COordinate Rotation DIgital Com- plexity significantly, but also retains the good transformation puter (Cordic) is an algorithm that can be used for the evalu- quality of the Loeffler DCT. Therefore, the proposed Cordic ation of various functions in signal processing (Volder, 1959; based Loeffler DCT is especially suited for low-power and Walther, 1971). In addition, the Cordic algorithm is highly high-quality CODECs in battery-based systems. suited for VLSI-implementation. In this paper we propose a computationally efficient and high-quality Cordic based Loeffler DCT architecture, which is optimized by taking advantage of certain properties of 1 Introduction the Cordic algorithm and its implementations (Goetze and Hekstra, 1995). It only requires 38 add and 16 shift opera- Recently, many kinds of digital image processing and video tions. The resulting DCT algorithm not only reduces com- compression techniques have been proposed in the literature, putational complexity significantly, but also retains the good such as JPEG, Digital Watermark, MPEG and H.263 (Con- transformation quality of the Loeffler DCT. Therefore, the zalez and Woods, 2001; Richardson, 2002). All the above presented Cordic based Loeffler DCT implementation is es- standards require the Discrete Cosine Transform (DCT) Con- pecially suited for low-power and high-quality CODECs. zalez and Woods (2001) to aid image/video compression. This paper is organized as follows. Section 2 briefly intro- Therefore, the DCT has become more and more important duces the algorithms of the DCT, Loeffler DCT and Cordic in today’s image/video processing designs. based DCT. In Sect. 3, we will present the proposed Cordic In the past few years, much research has been done on low based Loeffer DCT algorithm. The experimental results are power DCT designs (Li and Lu, 1996; Hsiao et al., 2005; shown in Sect. 4, while Sect. 5 concludes this paper. August and Ha, 2004; Jeong et al., 2004; Shams et al., 2002; Fanucci and Saponara, 2002). One of the most popular ways to realize the fast DCT (FDCT) is to use the Flow-Graph Al- gorithm (FGA) for VLSI-implementation (Chen et al., 1977; Wang, 1984). Loeffler et al. (1989) has proposed a low- complexity FDCT/IDCT algorithm based on FGA that requires only 11 multiply and 29 add operations. However, the multiplications consume about 40% of the power and al-

Correspondence to: B. Heyne ([email protected])

Published by Copernicus Publications on behalf of the URSI Landesausschuss in der Bundesrepublik Deutschland e.V. 306 B. Heyne and J. Gotze:¨ Low-Power DCT 2 B. Heyne and J. G¨otze: Low-Power DCT

2.2 MAC based Loeffler DCT 0 0 7 7 1 F (k,l)= 4 C(k)C(l) f(x, y) Many 1-D flow graph algorithms have been reported in the x=0 y=0 1 4 literature (Chen et al., 1977; Wang, 1984). The Loeffer 1- P P (2x+1)kπ (2y+1)lπ 2C 3π /8 D 8-point DCT algorithm (Loeffler et al., 1989) requires 11 (1) 2 cos[ 16 ] cos[ 16 ] 2 · S3π /8 multiplications and 29 additions as shown in Table 3. The −S 3π /8 flow graph of the Loeffler DCT is illustrated in Fig. 1, with 1 if m =0 3 6 √2 2C C(m) = 3π /8 C = cos(x) and S = sin(x). One of its variations is adopted 1 otherwise. x x C3π /16 4 1 by the Independent JPEG Group (JPE, 1998) for their im- Since computing the above 2-D DCT by using matrix plementation of the popular JPEG image coding standard. S3π /16 4 Cπ /16 multiplication requires 8 multiplications, a commonly used 5 2 5 Note that this factorization requires a uniform scaling fac- S approach in hardware designs to reduce the computational π /16 √1 − tor of at the end of the flow graph to obtain the original Sπ /16 2 2 complexity is row-column decomposition. The decompo- 6 2 3 DCT coefficients. In the 2-D transform this scaling factor Cπ /16 sition performs row-wise one-dimensional (1-D) transform −S π 1 3 /16 becomes which can be easily implemented by a shift oper- followed by column-wise 1-D transform with intermediate 8 7 7 ation. Although the Loeffler DCT requires multipliers, which transposition. An 8-point 1-D DCT can be expressed as fol- C π 3 /16 will result in larger power dissipation and area, it offers bet- lows: ter rate distortion than the other approaches. Therefore it is 7 especially useful for high-quality CODECs. 1 (2x+1)kπ Fig.Fig. 1. 1.FlowFlow graph graph of of an an 8-point Loeffler DCT. DCT. F (k)= 2 C(k) f(x)cos[ 16 ] x=0 2.3 Cordic based DCT P (2) 1 ifk=0 2(2005)]. DCT algorithmsThe Cordic has a very regular structure suitable for C(k)= √2 The Loeffler DCT achieves good quality transformation re- 1 otherwise. VLSI design. Figure 2 shows the flow graph of an 8-point 2.1Cordic The based DCT DCT background using six Cordics [Jeong et al. (2004)], sults, but on the other hand it needs multiplications which This decomposition approach has two advantages. Firstly requiring 104 additions and 84 shift operations as shown in are computationally intensive in both software and hardware the number of operations is significantly reduced. Secondly, TheTable two 3. dimensional DCT in Eq. (1) transforms an 8 × 8 implementation. In this regard, one of the popular ways to with regard to the implementation, the original 1-D DCT can blockIn sample order to from realize spatial a vector domain rotationf (x, fory) into the Cordicfrequency algo- do- implement a fast multiplierless approximation of the DCT is be replaced easier by more efficient DCT algorithms. mainrithm,F that(k, l) is. rotating a vector (x, y) by an angle θ, the circu- using the Cordic algorithm (Jeong et al., 2004; Hsiao et al., lar rotation angle is described7 7 as 2005). The Cordic has a very regular structure suitable for 2.2 MAC based Loeffler DCT F (k, l) = 1 C(k)C(l) P P f (x, y) VLSI design. Figure 2 shows the flow graph of an 8-point 4 1 i θ =x=0σyi=0tan− (2− ) Cordic based DCT using six Cordics (Jeong et al., 2004), Many 1-D flow graph algorithms have been reported in the i · (3) requiring 104 additions and 84 shift operations as shown in literature [Chen et al. (1977); Wang (1984)]. The Loeffer 1- (2x+1)kπP (2y+1)lπ · cos[ ] cos[ ] (1) Table 3. D 8-point DCT algorithm [Loeffler et al. (1989)] requires 11 16 with σi =116, 1. − multiplications and 29 additions as shown in Table 3. The In order to realize a vector rotation for the Cordic algo- Then, the vector( 1 rotation can be performed using the iterative flow graph of the Loeffler DCT is illustrated in Figure 1, √ if m = 0 rithm, that is rotating a vector (x, y) by an angle θ, the circu- C(m)equation= given2 in [Mariatos et al. (1994);Jeong et al. (2004)]: lar rotation angle is described as with Cx = cos(x) and Sx = sin(x). One of its variations 1 otherwise. is adopted by the Independent JPEG Group [JPE (1998)] for i P −1 −i xi+1 = xi σi yi 2− θ = σi · tan (2 ) their implementationof the popular JPEG image coding stan- Since computing the above 2-D− DCT· · byi using matrix multi-(4) yi4+1 = yi + σi xi 2− . i dard. Note that this factorization requires a uniform scaling plication requires 8 multiplications,· · a commonly used ap- (3) factor of 1 at the end of the flow graph to obtain the origi- proachIn equation in hardware 4, only designsshift and toadd reduce operations the are computational required in 2√2 with σi = 1, −1. nal DCT coefficients. In the 2-D transform this scaling factor complexitydigital hardware. is row-column Next, the decomposition.results of the rotation The iterati decompo-ons 1 sitionneed to performs be compensated row-wise (scaled) one-dimensional by a compensation (1-D) transform factor s. Then, the vector rotation can be performed using the iterative becomes 8 which can be easily implemented by a shift operation. Although the Loeffler DCT requires multipliers, which followedThis is also by done column-wise using an iterative 1-D transform approach: with intermediate equation given in (Mariatos et al., 1994; Jeong et al., 2004): will result in larger power dissipation and area, it offers bet- transposition. An 8-point 1-D DCT can be expressed as fol- = − · · −i lows: xi+1 = xi(1 + γi Fi) xi+1 xi σi yi 2 ter rate distortion than the other approaches. Therefore it is · −i (4) yi = yi(1 + γi Fi) yi+1 = yi + σi · xi · 2 . especially useful for high-quality CODECs. +1 · 7 (5) (2x+1)kπ In Eq. (4), only shift and add operations are required in digi- 2.3 Cordic based DCT F (k) = 1 C(k) P f (x)withcos[(1 + γi F]i) ∼= s 2 i 16 · tal hardware. Next, the results of the rotation iterations need x=0 i and γi =Q (0, 1, 1), Fi =2− . The Loeffler DCT achieves good quality transformation re- − (2) to be compensated (scaled) by a compensation factor s. This ( sults, but on the other hand it needs multiplications which When using√1 if the k =Cordic 0 to replace the multiplications of the is also done using an iterative approach: C(k) = 2 are computationally intensive in both software and hardware 8-point DCT the angles θx are fixed. Therefore, we can skip 1 otherwise. x + = x (1 + γ · F ) implementation. In this regard, one of the popular ways to some unnecessary iterations without losing accuracy. Table i 1 i i i y = y (1 + γ · F ) implement a fast multiplierless approximation of the DCT is This1 shows decomposition the detailed approach number ofhas iterations two advantages. and compensa- Firstly i+1 i i i using the Cordic algorithm [Jeong et al. (2004); Hsiao et al. thetions number for the of Cordic operations based is algorithm significantly [Jeong reduced. et al. Secondly, (2004)]. Q ∼ (5) with (1 + γi · Fi) = s with regard to the implementation, the original 1-D DCT can i −i be replaced easier by more efficient DCT algorithms. and γi = (0, 1, −1), Fi = 2 .

Adv. Radio Sci., 5, 305–311, 2007 www.adv-radio-sci.net/5/305/2007/ B. Heyne and J. G¨otze: Low-Power DCT 3

a Table 1. Parameters for the DCT based on [Jeong et al. (2004)]. 1 a2

π 3π 7π 3π b b Angle 4 8 16 16 1 2

Rotation iterations [σi, i] according to Eq. (4) a b 2 1 2 1 −1 0 −1 2 +1 0 −1 1 π / 4 , , , , b a 2 1 2 2 - −1, 3 +1, 1 −1, 3

3 - −1, 6 +1, 3 −1, 10 Fig. 3. Cordic rotation of the π/4 angle. 4 - −1, 7 +1, 10 −1, 14 3 Cordic based Loefﬂer DCT Compensation iterations [1+ γi · Fi] according to Eq. (5)

1 1 1 1 1 Based on our previous work about Cordic based FFTs 1 1 − 1+ + 1 − 4 32 2 8 8 [Heyne and Goetze (2004); Heyne et al. (2004)], we now − 1 1 1 1 propose an optimized Cordic based Loeffler DCT in this pa- 2 1 16 1+ 128 1+ 256 1+ 64 per. This implementation requires only 38 add and 16 shift 1 1 1 1 operations. We have taken the original Loeffler DCT as the 3 1+ 256 1+ 1024 1+ 4096 1+ 1024 starting point for our optimization, because the theoretical 1 1 1 4 1+ 512 1+ 4096 - 1+ 4096 lower bound of the number of multiplications required for the 1-D 8-point DCT had been proven to be 11 [Duhamel 1 5 1+ 4096 - - - and H’Mida (1987)]. In order to derive the proposed algorithm, we first con- sider the butterfly at the beginning of Loeffler’s flow graph B. Heyne and J. Gotze:¨ Low-Power DCT 307 as shown in Figure 1. In this case the butterfly can be expressed as: Table 1. Parameters for the DCT based on Jeong et al. (2004). 0 0 a 1 1 b π 2 1 2 4 = (6) 4 b2 11 · a1 − Angle π 3π 7π 3π 4 2 4 8 16 16 3π The first matrix can then be decomposed to 6 8 6 1 1 Rotation iterations [σi, i] according to Eq. (4) 7 1 1 1 √2 √2 7π = √2 1 1 (7) 16 11 · − "− √2 √2 # 1 −1, 0 −1, 2 +1, 0 −1, 1 5 3π 16 7 which equals a Cordic rotating the input values by π/4, fol- 2 – −1, 3 +1, 1 −1, 3 lowed by a scaling of √2 as shown in Figure 3. 5 3π The scaled butterflies with scaling factors 3π/8, 1π/16 3 – −1, 6 +1, 3 −1, 10 3 16 and 3π/16 can also be replaced by Cordics using θ =3π/8, 1π/16 and 3π/16 respectively. Hence, we can replace all − + − 7π 4 – 1, 7 1, 10 1, 14 16 butterflies in the Loeffler DCT to derive the pure Cordic 1 3 based Loeffler DCT as shown in Figure 4. Compensation iterations [1 + γi · Fi] according to Eq. (5) The most commonly used DCT-based CODECs for signal Fig. 2. Flow graph of an 8-point Cordic based DCT [Jeong et al. processing are usually followed by a quantizer. In this re- Fig. 2. Flow graph of an 8-point Cordic based DCT (Jeong et al., gard we can skip some Cordic iterations without losing visual 1 1− 1 1+ 1 1 + 1 1− 1 2004(2004)].). B. Heyne4 and J. Götze:32 Low-Power2 8 DCT 8 3 quality, and shift the compensation steps to the quantization − 1 + 1 + 1 + 1 table without using additional hardware. 2 1 16 1 128 1 256 1 64 a Next, we will start to optimize each rotation angle and re- Table 1. Parameters for the DCT based on [Jeong et al. (2004)]. 1 a2 + 1 + 1 + 1 + 1 duce the computational complexity. 3 1 256 1 1024 1 4096 1 1024 π 3π 7π 3π Although the Cordicb based DCT can reduceb the number of At first,due to the special structure of the Loeffler DCT, Angle 1 2 + 1 4+ 1 8 16 + 1 16 computations in image/video compression, it still needs more the scaling of √2 and the five needed compensation steps as 4 1 512 1 4096 – 1 4096 operations than the binDCT does [Tran (2000)]. shown in Table 1 can be performed at the end of the flow 1Rotation iterations [σi, i] according to Eq. (4) + a2 5 1 4096 ––– b 1 2 1 −1 0 −1 2 +1 0 −1 1 π / 4 , , , , b a 2 1 2 2 - −1, 3 +1, 1 −1, 3

3 - −1, 6 +1, 3 −1, 10 When using the Cordic to replace the multiplications of the 4 Fig.Fig. 3. Cordic 3. Cordic rotation rotation of of the theπ/π/4 angle. 8-point DCT the angles θx are fixed. Therefore, we can skip −1 7 +1 10 −1 14 some unnecessary4 iterations- without, losing accuracy., Table, 1 shows the detailed number of iterations and compensations 3 Cordic based Loeffler DCT Compensation iterations [1+ γi · Fi] according to Eq. (5) as: for the Cordic based algorithm (Jeong et al., 2004). Although 1 1 1 1 1 Based on our previous work about Cordic based FFTs the Cordic based1 DCT1 − can reduce1+ the number+ of computa-1 − a2 1 1 b1 4 32 2 8 8 [Heyne= and Goetze· (2004); Heyne et al. (2004)], we now(6) tions in image/video compression, it still needs more opera- b2 −1 1 a1 − 1 1 1 1 propose an optimized Cordic based Loeffler DCT in this pa- tions than the2 binDCT1 does16 (Tran1+, 2000128 ).1+ 256 1+ 64 Theper. first This matrix implementation can then be requires decomposed only 38 to add and 16 shift 1 1 1 1 operations. We have taken the original Loeffler DCT as the 3 1+ 256 1+ 1024 1+ 4096 1+ 1024 starting point√ for" √ our1 optimization,√1 # because the theoretical 4 1+ 1 1+ 1 - 1+ 1 1lower 1 bound of the2 number2 of multiplications required for 3 Cordic based Loeffler512 DCT 4096 4096 = 2 · 1 1 (7) −the1 1 1-D 8-point− DCT√ had√ been proven to be 11 [Duhamel 1 2 2 5 1+ 4096 - - - and H’Mida (1987)]. Based on our previous work about Cordic based FFTs whichIn equals order to a Cordic derive√ the rotating proposed the inputalgorithm, values we by firstπ/ con-4, fol- (Heyne and Goetze, 2004; Heyne et al., 2004), we now pro- lowedsider by the a scalingbutterfly of at the2 asbeginning shown inof Fig.Loeffler’s3. flow graph pose an optimized Cordic based Loeffler DCT in this paper. Theas shown scaled in butterfliesFigure 1. In with this scaling case the factors butterfly 3π/ can8, be 1 ex-π/16 This implementation requires only 38 add and 16 shift op- andpressed 3π/16 as: can also be replaced by Cordics using θ=3π/8, erations. We0 have taken the original Loeffler DCT as the 1π/16 and 3π/16 respectively. Hence, we can replace all 0 a2 1 1 b1 starting point for our optimization, because theπ theoretical 2 4 butterflies in the Loeffler= DCT to derive the pure Cordic(6) 4 b2 11 · a1 lower bound of the number of multiplications required for based Loeffler DCT as shown − in Fig. 4. 4 2 the 1-D 8-point DCT had been proven to be3 11π (Duhamel TheThe most first commonly matrix can thenused be DCT-based decomposed CODECs to for signal 8 and H’Mida6 , 1987). 6 processing are usually followed by a quantizer. In this re- 1 1 In order7 to derive the proposed algorithm, we first con-1 gard we can skip some1 1 Cordic iterations√2 √ without2 losing visual 7π = √2 1 1 (7) 16 11 · sider the butterfly at the beginning of Loeffler’s flow graph quality, and shift the− compensation "− √ steps2 √2 to# the quantization as shown in5 Fig. 1. In this case the butterfly can be expressed table without using additional hardware. 3π 16 7 which equals a Cordic rotating the input values by π/4, followed by a scaling of √2 as shown in Figure 3. 5 www.adv-radio-sci.net/5/305/2007/3π The scaled butterflies Adv. with scalingRadio Sci., factors 5, 305–3π/8311, 1π/, 200716 3 16 and 3π/16 can also be replaced by Cordics using θ =3π/8, 1π/16 and 3π/16 respectively. Hence, we can replace all 7π 16 butterflies in the Loeffler DCT to derive the pure Cordic 1 3 based Loeffler DCT as shown in Figure 4. The most commonly used DCT-based CODECs for signal Fig. 2. Flow graph of an 8-point Cordic based DCT [Jeong et al. processing are usually followed by a quantizer. In this re- (2004)]. gard we can skip some Cordic iterations without losing visual quality, and shift the compensation steps to the quantization table without using additional hardware. Next, we will start to optimize each rotation angle and reduce the computational complexity. Although the Cordic based DCT can reduce the number of At first,due to the special structure of the Loeffler DCT, computations in image/video compression, it still needs more the scaling of √2 and the five needed compensation steps as operations than the binDCT does [Tran (2000)]. shown in Table 1 can be performed at the end of the flow 4 B. Heyne and J. Götze: Low-Power DCT

0 0 Table 2. Cordic based Loefﬂer DCT - Cordic parameters. π / 4 π / 4 π / 4 7 4 3084 B. B. Heyne Heyne and and J. J.G¨otze: Gotze:¨ Low-Power Low-Power DCT DCT

1 2 π 3π π 3π π / 4 π / 4 3π /8 Angle 4 8 16 16 0 0 TableTable 2. 2.CordicCordic based based Loeffler Loeffler DCT DCT - Cordic – Cordic parameters. parameters. 6 6 π / 4 π / 4 π / 4 7 Rotation iterations [σi, i] according to Eq. (4) 4 2 5 π / 4 π /16 π / 4 5 1 1 −1, 0 −1, 0 −1, 3 −1, 1 2 π 3π π 3π 1 π / 4 π / 4 3π /8 AngleAngle 4π 83π 16π 163π π / 4 6 6 4 8 16 16 3 7 2 - −1, 1 −1, 4 −1, 3 π / 4 3π /16 π / 4 Rotation iterations [σi, i] according to Eq. (4) 2 5 4 3 π π π Rotation iterations [σi, i] according to Eq. (4) 3 / 4 - +1/16, 4 - / 4 - 5 1 1 −1, 0 −1, 0 −1, 3 −1, 1 π / 4 1 −1, 0 −1, 0 −1, 3 −1, 1 Fig. 4. Pure Cordic based Loeffler DCT architecture. 3 Compensation iterations [1+ γi · Fi] according to Eq. (5) 7 2 - −1, 1 −1, 4 −1, 3 π / 4 3π /16 π / 4 4 3 − − − − 1 2 – 1, 1 1, 4 1, 3 1 - - - 1 8 3 - +1, 4 - -

1 3 – +1, 4 - - 2 - - - 1+ graph for the angle θ = π/4. In other words, the π/4 rota- Fig. 4. Pure Cordic based Loefﬂer DCT architecture. 64 Compensation iterations [1+ γi · Fi] according to Eq. (5) tion only needs two add operations to carry out the Cordic Fig. 4. Pure Cordic based Loefﬂer DCT architecture. Compensation iterations [1 + γ · F ] according− 1 to Eq. (5) rotation. 1 - - - i i 1 8

Secondly, for the angle θ = 3π/8 we reduce the number 0 2 0 1+ 1 graph for the angle . In other words, the 4 rota- 2 - - - −64 1 of iterations to three and also shift all compensation steps θ = π/4 π/4 1 – – – 1 8 to the quantizer. Although the optimized 3π/8 rotation will tion1 only needs two add operations to carry out the2 Cordic4 rotation. 4 2 – – – 1+ 1 decrease the quality of the results, the influences are not no- 1 64 2 2 ticeable in video sequence streams or image compression. 3.1694 2 Secondly, for the angle θ = 3π/π8 we reduce the number 0 0 3 /8 4 of3 iterations to three and also shift all compensation1 steps6 Thirdly, when we take a close look at the angle θ = π/16, 3.1694 1 2 4 it can be easily observed that the needed compensation of to the quantizer. Although the optimized 3π/8 rotation will 4 4 2 1 π/ π/ the π/16 rotation is very close to one. Thus, we can ignore decrease the quality of the results, the influences are4 not no- between the following stages of the 16 and 3 1 16 rota- π /16 2 2 ticeable in video sequence streams or image compression.1 tions. However, we still can ignore some unnoticeable3.1694 iter- the compensation steps of the π/16 rotation. Therefore, it 5 5 π 2 3 /8 only needs two iterations in the Cordic calculation. Unfortu- ations3 and compensation steps to reduce the computational1 6 Thirdly, when we take a close look at the angle θ =1 π/16, 3.1694 6 3 complexity of the angle θ=3π/16. nately we can not shift any compensation steps of the 3π/16 it can be easily observed that the needed compensation2 of 3π /16 4 Table 2 shows the summary of Cordic iterations2 and1 com- rotation to the end of the graph, due to the data correlation the π/16 rotation is very close to one. Thus, we can2 ignore 4 7 7 π /16 between the following stages of the and rota- 4 pensation steps for the proposed Cordic based Loeffler1 DCT. π/16 3π/16 the compensation steps of the π/16 rotation. Therefore, it 5 5 2 tions. However, we still can ignore some unnoticeable iter- only needs two iterations in the Cordic calculation. Unfortu- In Fig. 5 the optimized flow graph is shown, including the 1 ations and compensation steps to reduce the computational nately we can not shift any compensation steps of the scaling6 factors incorporated into the quantization table.3 Fig. 5. Flow graph of an 8-point Cordic based Loeffler DCT3 archi-π/16 2 complexity of the angle θ =3π/16. Fig.rotation 5. Flow to graphthe end of ofan 8-pointthe graph, Cordic due based to the Loeffler data correlation DCT archi- It only requires3π /16 38 add and 16 shift operations to realize tecture. 7 2 7 Table 2 shows the summary of Cordic iterations and com- tecture.between the following stages of the π/16 and 3π/16 rota- the DCT transformation. In short, we try to ignore4 some un- pensation steps for the proposed Cordic based Loeffler DCT. tions. However, we still can ignore some unnoticeable iter- noticeable iterations and shift the compensation steps of each In Fig. 5 the optimized flow graph is shown, including the ations4 Experimental and compensation Results steps to reduce the computational angle to the quantizer to derive the optimized Cordic based Next, we will start to optimize each rotation angle and re- Fig. 5. Flow graph of an 8-point Cordic based Loeffler DCT archi- scaling factors incorporated into the quantization table. complexity of the angle θ =3π/16. Loeffler DCT. Moreover, the proposed DCT algorithm not duce the computational complexity. tecture. In our experiments we have used different criteria to eval- only reduces the computational complexity significantly, but It only requires 38 add and 16 shift operations to realize Table 2 shows the summary of Cordic iterations and com- uateAt first,due four architectures:√ to the special Loeffler structure DCT, of Cordic the Loeffler based DCT, DCT, also keeps the high transformation quality as well as the orig- the DCT transformation. In short, we try to ignore some un- pensation steps for the proposed Cordic based Loeffler DCT. thebinDCT-C5 scaling of and2 Cordic and the based five needed Loeffler compensation DCT. Table 3steps sum- as inal Loeffler DCT does. noticeable iterations and shift the compensation steps of each In Fig. 5 the optimized flow graph is shown, including the 4 Experimental Results shownmarizes in the Table number1 can of be operations performed of at each the end DCT of architec- the flow Therefore, the proposed DCT algorithm has the same com- angle to the quantizer to derive the optimized Cordic based graphscaling for factors the angle incorporatedθ=π/4. into In the other quantization words, the table.π/4 rota- ture. It can be easily observed from Table 3 that the pro- Inputational our experiments complexity we have as the used binDCT, different but criteria as shown to eval- in the Loeffler DCT. Moreover, the proposed DCT algorithm not tionIt only only needs requires two 38 add add operations and 16 shift tooperations carry out the to realize Cordic posed DCT reduces the computational complexity signifi- uatenext four Section architectures: it can perform Loeffler as wellDCT, as Cordic the Loeffler based DCT, DCT in only reduces the computational complexity significantly, but rotation.the DCT transformation. In short, we try to ignore some un- cantly compared to the original Loeffler DCT and Cordic binDCT-C5quality. and Cordic based Loeffler DCT. Table 3 sum- also keeps the high transformationquality as well as the orig- noticeableSecondly, iterations for the andangle shiftθ= the3π/ compensation8 we reduce steps the ofnumber each based DCT. marizes the number of operations of each DCT architec- inal Loeffler DCT does. ofangle iterations to the toquantizer three and to derive also shift the optimized all compensation Cordic based steps In order to demonstrate the quality features of the pro- ture. It can be easily observed from Table 3 that the pro- toLoeffler the quantizer. DCT. Moreover, Although the the proposed optimized DCT 3π/8 algorithm rotation notwill Therefore, the proposedDCT algorithm has the same com- posed DCT algorithm, we have also applied it to the video posed4 Experimental DCT reduces results the computational complexity signifi- only reduces the computational complexity significantly, but putational complexity as the binDCT, but as shown in the decreasecoding standardthe quality MPEG of the 4, results, by using the influences a publicly are available not no- cantly compared to the original Loeffler DCT and Cordic also keeps the high transformationquality as well as the orig- next Section it can perform as well as the Loeffler DCT in ticeableXVID inCODEC video sequence software [XIVstreams (2005)]. or image The compression. DCT in the basedIn our DCT. experiments we have used different criteria to eval- quality. inalCODEC Loeffler of the DCT selected does. XVID implementation is based= on Thirdly, when we take a close look at the angle θ π/16, uateIn order fourarchitectures: to demonstrate Loeffler the quality DCT, features Cordic of based the pro- DCT, it canTherefore, be easily the observed proposedDCT that the algorithm needed has compensation the same com- of posedbinDCT-C5 DCT algorithm, and Cordic we based have also Loeffler applied DCT. it to Table the video3 sum- theputationalπ/16 rotation complexity is very as close the binDCT, to one. butThus, as we shown can in ignore the codingmarizes standard the number MPEG of 4, operations by using of a publiclyeach DCT available architec- thenext compensation Section it can steps perform of the asπ/ well16 as rotation. the Loeffler Therefore, DCT in it XVIDture.CODEC It can be software easily observed [XIV (2005)]. from Table The3 DCTthat in the the pro- onlyquality. needs two iterations in the Cordic calculation. Unfortu- CODECposed DCT of the reduces selected the XVID computational implementation complexity is based signifi- on nately we can not shift any compensation steps of the 3π/16 cantly compared to the original Loeffler DCT and Cordic rotation to the end of the graph, due to the data correlation based DCT.

Adv. Radio Sci., 5, 305–311, 2007 www.adv-radio-sci.net/5/305/2007/ B. Heyne and J. G¨otze: Low-Power DCT 5

Table 3. Complexity of different DCT architectures. 4 60 3.5 50 3 Power 40 ``` 2.5 PDP ` Operation 2 30 ``` PDP `` Mult Add Shift 1.5 DCT type `` [mw] Power 20 1 0.5 10 Loefﬂer 11 29 0 0 0 -C5 Loeffler Cordic DCT binDCT Cordic [Jeong et al. (2004)] 0 104 82 Cordic Loeffler

Cordic Loefﬂer 0 38 16 Fig. 7. Experimental results of the Power and PDP. binDCT-C5 [Tran (1999)] 0 36 17 B. Heyne and J. Gotze:¨ Low-Power DCT 309 700 10000 9000 600 8000 500 7000 EDP Table 3. Complexity of different DCT architectures. 50 400 6000 EDDP 5000 EDP Loeffler DCT 300 4000 EDDP 200 3000 Cordic DCT 2000 ` 100 `` 1000 `` Operation 45 Cordic Loeffler ``` Mult Add Shift 0 0 DCT type `` binDCT-C5 C5 ` Loeffler c DCT - Cordi Cordic Loeffler binDCT ] Loefﬂer 11 29 0 40 PSNR [dB Cordic [Jeong et al. (2004)] 0 104 82 Fig. 8. Experimental results of the EDP and EDDP.

Cordic Loefﬂer 0 38 16 35 Synopsys PrimePower to estimate the power consumption at binDCT-C5 [Tran (1999)] 0 36 17 gate-level. The simulation results are shown in Table 4. Some im-

1 2 3 4 5 6 7 8 9 10 portant points can be observed easily. Firstly, the proposed Quantization Level DCT architecture only consumed 19% of the area and about 16% of the power of the original Loeffler DCT. Secondly, the In order to demonstrate the quality features of the pro- proposed DCT architecture occupied the same area as the posed DCT algorithm, we have also applied it to the video Fig. 6. The average PSNR of the “Foreman” sequence from low to binDCT-C5. However, it has only about 59% of the power B. Heyne and J. Götze: Low-Power DCT Fig. 6. The average PSNR of the “Foreman” sequence from low to5 coding standard MPEG 4, by using a publicly available highhigh compression compression ratio ratio (XVID (XVID CODEC). dissipation and half the delay time of the binDCT-C5. XVID CODEC software (XIV, 2005). The DCT in the To further display the low-power features of the proposed algorithm, we have analyzed the Power, Power-Delay Prod- TableCODEC 3. Complexity of the of selected different XVID DCT architectures.implementation is based on Loeffler’s4 factorization with floating-point multiplicati60 ons. uct (PDP), Energy-Delay Product (EDP) and Energy-Delay- Loeffler’s factorization with floating-point multiplications. 3.5 In this part we have applied each DCT algorithmto50 the XVID Delay Product (EDDP). The PDP is the average energy con- In this part we have applied each DCT algorithm to the XVID 3 Power software,2.5 and simulated with some well-known40 video se- ``` PDP sumed per DCT transformation. A lower PDP means that the software,` and simulatedOperation with some well-known video se- 2 30 ``` quences to show the ability of the proposed approach.PDP Figure power consumption is better translated into speed of each op- `` Mult Add Shift 1.5 quencesDCT type to show the ability`` of the proposed approach. Fig- [mw] Power 20 6 shows1 the average PSNR results of four DCT algorithms eration and the EDP represents that one can trade increased ure 6 shows the average PSNR results of four DCT algo- 10 from low0.5 to high compression ratio (i.e. quantization steps delay for lower energy of each operation. Finally, the EDDP rithmsLoeffler from low to high compression11 ratio29 (i.e. quantization0 from 1 to0 10) with the “Foreman” video sequence.0 represents whether or not it results in a voltage-invariant ef- C5 steps from 1 to 10) with the “Foreman” video sequence. It can beLoeffler seen from the PSNR simulation- results, that the Cordic DCTCordic Loeffler binDCT ficiency metric. CordicIt can be[Jeong seen et from al. (2004)] the PSNR simulation0 104 results,82 that the proposed algorithm performs as well as the Loeffler DCT Figure 7 illustrates the experimental results of the power proposed algorithm performs as well as the Loeffler DCT does. Also, the average PSNR is about 2 dB higher than that consumption and the PDP. As illustrated for the PDP the pro- does.Cordic Also, Loeffler the average PSNR is about0 238 dB higher16 than that Fig.of 7.theExperimental binDCT-C5. results In summary, of the Power the overall and PDP. performance of posed DCT algorithm only consumes about 10% of the PDP of the binDCT-C5. In summary, the overall performance of Fig.the 7. CordicExperimental based Loeffler results of DCT the Poweris very and similar PDP. to the Loeffler of the Loeffler DCT and 16% of the PDP of the Cordic based thebinDCT-C5 Cordic based [Tran Loeffler (1999)] DCT is very0 similar36 to17 the Loeffler DCT in terms of video quality. However, it only requires the DCT respectively. It also reduces the PDP by about 59% DCT in terms of video quality. However, it only requires the erationsame computational and the EDP complexityrepresents that as the one binDCT-C5 can trade does. increased compared to the binDCT-C5. Figure 8 shows the EDP and 700 10000 delayTo for analyze lower the energy performance of each operation. and the energy Finally,9000 consumption the EDDP the EDDP. As shown, the performance of the proposed DCT same computational complexity as the binDCT-C5 does. 600 of the proposed Cordic based Loeffler DCT,8000 we have mod- algorithm is far superior to the other DCT algorithms, espe- represents500 whether or not it results in a voltage-invariant7000 EDP ef- 50 To analyze the performance and the energy consumption ficiencyeled the400 metric. four different DCT architectures as6000 RTL. AfterEDDP syn- cially in the EDDP. of the proposed Cordic based Loeffler DCT, we have mod- 5000 EDP eled the four different DCT architecturesLoeffler as RTL. DCT After syn- thesizingFigure3007 withillustrates Synopsys the experimental Design Compiler, results4000 weEDDP of have the power used Hence, the proposed DCT not only reduces the compu- 200 3000 Cordic DCT consumption and the PDP. As illustrated for2000 the PDP the pro- thesizing with Synopsys Design Compiler, we have used 100 Cordic Loeffler posed DCT algorithm only consumes about1000 10% of the PDP Synopsys45 PrimePower to estimate the power consumption at 0 0 binDCT-C5 of the Loeffler DCT and 16% of theC PDP5 of the Cordic based gate-level. Loeffler c DCT - Cordi Cordic Loeffler binDCT ] The simulation results are shown in Table 4. Some im- DCT respectively. It also reduces the PDP by about 59% portant points can be observed easily. Firstly, the proposed compared to the binDCT-C5. Figure 8 shows the EDP and 40 DCT architecture only consumed 19% of the area and about the EDDP. As shown, the performance of the proposed DCT PSNR [dB 16% of the power of the original Loeffler DCT. Secondly, the Fig.algorithm 8. Experimental is far superior results toof the EDPother and DCT EDDP. algorithms, espe- proposed DCT architecture occupied the same area as the cially in the EDDP. Hence, the proposed DCT not only reduces the compu- binDCT-C5.35 However, it has only about 59% of the power dissipation and half the delay time of the binDCT-C5. Synopsystational complexity PrimePower significantly, to estimate but the also power achieves consumption the best at performance in all criteria. To further display the low-power features of the proposed gate-level. algorithm, we have analyzed the Power, Power-Delay Prod- The simulation results are shown in Table 4. Some important points can be observed easily. Firstly, the proposed uct1 (PDP),2 Energy-Delay3 4 Product5 6 (EDP)7 and8 Energy-Delay-9 10 5 Conclusions Delay Product (EDDP).Quantiza Thetio PDPn Level is the average energy con- DCT architecture only consumed 19% of the area and about sumed per DCT transformation. A lower PDP means that the 16%In this of the paper power a low of complexity the original and Loeffler high qualityDCT. Secondly, DCT trans- the power consumption is better translated into speed of each op- proposedformation DCT based architecture on the Cordic occupied algorithm the is same presented. area as The the Fig. 6. The average PSNR of the “Foreman” sequence from low to binDCT-C5. However, it has only about 59% of the power high compression ratio (XVID CODEC). dissipation and half the delay time of the binDCT-C5. www.adv-radio-sci.net/5/305/2007/ Adv. Radio Sci., 5, 305–311, 2007 To further display the low-power features of the proposed algorithm, we have analyzed the Power, Power-Delay Prod- Loeffler’s factorization with floating-point multiplications. uct (PDP), Energy-Delay Product (EDP) and Energy-Delay- In this part we have applied each DCT algorithmto the XVID Delay Product (EDDP). The PDP is the average energy con- software, and simulated with some well-known video se- sumed per DCT transformation. A lower PDP means that the quences to show the ability of the proposed approach. Figure power consumption is better translated into speed of each op- 6 shows the average PSNR results of four DCT algorithms eration and the EDP represents that one can trade increased from low to high compression ratio (i.e. quantization steps delay for lower energy of each operation. Finally, the EDDP from 1 to 10) with the “Foreman” video sequence. represents whether or not it results in a voltage-invariant ef- It can be seen from the PSNR simulation results, that the ficiency metric. proposed algorithm performs as well as the Loeffler DCT Figure 7 illustrates the experimental results of the power does. Also, the average PSNR is about 2 dB higher than that consumption and the PDP. As illustrated for the PDP the pro- of the binDCT-C5. In summary, the overall performance of posed DCT algorithm only consumes about 10% of the PDP the Cordic based Loeffler DCT is very similar to the Loeffler of the Loeffler DCT and 16% of the PDP of the Cordic based DCT in terms of video quality. However, it only requires the DCT respectively. It also reduces the PDP by about 59% same computational complexity as the binDCT-C5 does. compared to the binDCT-C5. Figure 8 shows the EDP and To analyze the performance and the energy consumption the EDDP. As shown, the performance of the proposed DCT of the proposed Cordic based Loeffler DCT, we have mod- algorithm is far superior to the other DCT algorithms, espe- eled the four different DCT architectures as RTL. After syn- cially in the EDDP. thesizing with Synopsys Design Compiler, we have used Hence, the proposed DCT not only reduces the compu- 310 B. Heyne and J. Gotze:¨ Low-Power DCT

B. Heyne and J. G¨otze: Low-Power DCT Table 4. Power, Area and Time delay simulation results at gate-level.5

Table 3. Complexity of different DCT architectures. ``` 4 ``` DCT Arch. 60 3.5 `` Loeffler Cordic based (Jeong et al., 2004) Cordic based Loeffler binDCT-C5 (Tran, 1999) Measures ``` 50 3 Power 40 ``` 2.5 PDP ` Operation 2 30 ``` Power (mW) 3.557PDP 1.954 0.5616 0.9604 `` Mult Add Shift 1.5 DCT type `` [mw] Power 20 1 Area0.5 (GateCount) 15.06 K10 6.66 K 2.81 K 2.83 K Loeffler 11 29 0 0 0 Delay (ns) 13.49-C5 15.08 8.37 12.17 Loeffler Cordic DCT binDCT Cordic [Jeong et al. (2004)] 0 104 82 Cordic Loeffler

Cordic Loefﬂer 0 38 16 TSMC 0.13-µm at 1.2 V without pipelining. Fig. 7. Experimental results of the Power and PDP. binDCT-C5 [Tran (1999)] 0 36 17

700 10000 on Electronics, Circuits and Systems, 541–544, 2002. 9000 600 8000 Goetze, J. and Hekstra, G.: An Algorithm and Architecture Based 500 7000 EDP 50 on Orthonormal Micro-Rotations for Computing the Symmetric 400 6000 EDDP 5000 EVD, in: Integration – The VLSI Journal, 20, 21–39, 1995. EDP Loeffler DCT 300 4000 EDDP Heyne, B. and Goetze, J.: A Pure Cordic Based FFT for Reconfig- 200 3000 Cordic DCT 2000 urable Digital Signal Processing, in: 12th European Signal Pro- 100 Cordic Loeffler 1000 45 0 0 cessing Conference, 2004. binDCT-C5 C5 Heyne, B., Buecker, M., and Goetze, J.: Implementation of a Cordic Loeffler c DCT - Cordi Cordic Loeffler binDCT Based FFT on a Reconfigurable Hardware Accelerator, in: 3rd ] Karlsruhe Workshop on Software Radios, 2004. 40 Hsiao, S., Hu, Y., Juang, T., and Lee, C.: Efficient VLSI Imple-

PSNR [dB Fig. 8. Experimental results of the EDP and EDDP. Fig. 8. Experimental results of the EDP and EDDP. mentations of Fast Multiplierless Approximated DCT Using Pa- rameterized Hardware Modules for Silicon Intellectual Property Design, 52, 1568–1579, 2005. 35 proposed Cordic based Loeffler DCT architecture only re- Jeong, H., Kim, J., and Cho, W. K.: Low-Power Multiplierless DCT Synopsys PrimePower to estimate the power consumption at quires 38 add and 16 shift operations to carry out the DCT Architecture Using Image Correlation, 50, 262–267, 2004. gate-level. transformation, which is about the same complexity as the Li, J. and Lu, S. L.: Low Power Design of Two-Dimensional DCT, The simulation results are shown in Table 4. Some im- binDCT-C5’s. The proposed algorithm not only reduces the in: IEEE Conf. on ASIC and Exhibit, 309–312, 1996. portant points can be observed easily. Firstly, the proposed Loeffler, C., Lightenberg, A., and Moschytz, G. S.: Practical Fast 1- 1 2 3 4 5 6 7 8 9 10 computational complexity significantly compared to the orig- Quantization Level DCT architecture only consumed 19% of the area and about D DCT Algorithms With 11-Multiplications, in: Proc. ICASSP, inal Loeffler DCT, it also keeps the good quality transforma- 2, 988–991, Glasgow,UK, 1989. 16%tion of result. the power In this of regard,the original the proposed Loeffler DCT. DCT Secondly, algorithm the is proposed DCT architecture occupied the same area as the Mariatos, E., Metafas, D., Hallas, J., and Goutis, C.: A Fast very suitable for low-power and high quality CODECs. DCT Processor, Based on Special Purpose CORDIC Rotators, Fig. 6. The average PSNR of the “Foreman” sequence from low to binDCT-C5. However, it has only about 59% of the power high compression ratio (XVID CODEC). in: IEEE International Symposium on Circuits and Systems, 4, dissipation and half the delay time of the binDCT-C5. 271–274, 1994. ReferencesTo further display the low-power features of the proposed Richardson, I. E. G.: Video Codec Design, John Wiley & Sons Ltd, algorithm, we have analyzed the Power, Power-Delay Prod- Atrium, England, 2002. Loeffler’s factorization with floating-point multiplications. uctThe (PDP), JPEG-6b Energy-Delay Website, http://www.ijg.org/, Product (EDP) 1998. and Energy-Delay- Shams, A., Pan, W., Chidanandan, A., and Bayoumi, M. A.: A In this part we have applied each DCT algorithmto the XVID DelayThe XVID Product Website, (EDDP). http://www.xvid.org/, The PDP is the 2005. average energy con- Low-Power High Performance Distributed DCT Architecture, in: software, and simulated with some well-known video se- sumedAugust, per N. DCTJ. and transformation. Ha, D. S.: Low Power A lower Design PDP of DCT means and that IDCT the IEEE Computer Society Annual Symposium on VLSI, 21–27, quences to show the ability of the proposed approach. Figure powerfor Low consumption Bit Rate Video is better Codecs, translated IEEE Transactions into speed on of Multime- each op- 2002. dia, 6, 414–422, 2004. Sung, C. C., Ruan, S. J., Lin, B. Y., and Shie, M. C.: Quality and 6 shows the average PSNR results of four DCT algorithms eration and the EDP represents that one can trade increased from low to high compression ratio (i.e. quantization steps Chen, W. H., Smith, C., and Fralick, S.: A Fast Computational Power Effcient Architecture for the Discrete Cosine Transform, delayAlgorithm for lower for energy the Discrete of each Cosine operation. Transform, Finally, 25, 1004–1009, the EDDP IEICE Transactions on Fundamentals Special Section on VLSI from 1 to 10) with the “Foreman” video sequence. represents1977. whether or not it results in a voltage-invariant ef- Design and CAD Algorithms, 2005. It can be seen from the PSNR simulation results, that the ficiencyConzalez, metric. R. C. and Woods, R. E.: Digial Image Processing, Tran, T. D.: A Fast Multiplierless Block Transform for Image and proposed algorithm performs as well as the Loeffler DCT FigurePrentice-Hall, 7 illustrates Inc., New the Jersey, experimental 2001. results of the power Video Compression, in: International Conf. on Image Process- does. Also, the average PSNR is about 2 dB higher than that consumptionDuhamel, P. and and H’Mida, the PDP. H.: As New illustrated 2n DCT for Algorithms the PDP Suitable the pro- ing, 822–826, 1999. of the binDCT-C5. In summary, the overall performance of posedfor VLSIDCT Implementation, algorithm only in: consumes IEEE International about 10% Conference of the PDP on Tran, T. D.: The BinDCT: Fast Multiplierless Approximation of the the Cordic based Loeffler DCT is very similar to the Loeffler of theICASSP, Loeffler 12, DCT1805–1808, and 16% 1987. of the PDP of the Cordic based DCT, 7, 141–144, 2000. Fanucci, L. and Saponara, S.: Data Driven VLSI Computation For Volder, J.: The CORDIC Trigonometric Computing Technique, IRE DCT in terms of video quality. However, it only requires the DCT respectively. It also reduces the PDP by about 59% Low-Power DCT-Based Video Coding, in: International Conf. Trans. Electron. Comput., EC-8, 330–334, 1959. same computational complexity as the binDCT-C5 does. compared to the binDCT-C5. Figure 8 shows the EDP and To analyze the performance and the energy consumption the EDDP. As shown, the performance of the proposed DCT of the proposed Cordic based Loeffler DCT, we have mod- algorithmAdv. Radio is Sci.,far superior 5, 305–311 to the, 2007 other DCT algorithms, espe- www.adv-radio-sci.net/5/305/2007/ eled the four different DCT architectures as RTL. After syn- cially in the EDDP. thesizing with Synopsys Design Compiler, we have used Hence, the proposed DCT not only reduces the compu- B. Heyne and J. Gotze:¨ Low-Power DCT 311

Walther, J.: A Uniﬁed Algorithm for Elementary Functions, in: Wang, Z.: Fast Algorithms for the Discrete W Transform and for Proc. Spring Joint Comput. Conf., 38, 379–385, 1971. the Discrete Fourier Transform, 32, 803–816, 1984.

www.adv-radio-sci.net/5/305/2007/ Adv. Radio Sci., 5, 305–311, 2007