LOW-COMPLEXITY ADAPTIVE BLOCK-SIZE TRANSFORM BASED ON EXTENDED TRANSFORMS

Honggang Qi1,Wen Gao 1, Siwei Ma2, Debin Zhao 1 and Xiangyang Ji1

1Institute of Computing Technology, Chinese Academy of Sciences E-mail: {hgqi, wgao, dbzhao, xyji}@jdl.ac.cn 2Department of Electrical Engineering and Integrated Medias Center University of Southern California E-mail: [email protected]

ABSTRACT small transform prevents more ringing artifacts in coded pictures [2]. To exert the virtues of different transform In this paper, a low-complexity 8x8/4x4 Adaptive Block- sizes greatly, Adaptive Block-size Transform (ABT) size Transform (ABT) scheme is tentatively proposed for scheme is employed in image and coding standards. the Chinese Audio and Video coding Standard (AVS)1. The scheme can improve the coding performance In the proposed ABT scheme, an integer 8x8 transform is significantly. For example, the average luma PSNR gain derived from the integer 4x4 transform used in AVS in high-definition sequences is up to 0.5dB in H.264/AVC. according to a transform extension principle. The 8x8 In ABT scheme, the transforms in different sizes are transform not only has high energy compacted property adaptively chosen by encoder. Usually, RD decision is but also can be merely implemented within several employed in encoder for choosing the better transform [2]. additions and shifts, and all intermediate results are With RD decision, the optimal coding performance can be limited with 16-bit. The 8x8 transform and 4x4 transform achieved in RD sense. Since two transform sizes are can be merged together and share the same scale matrix so employed in ABT scheme, two zig-zag scans are needed that the hardware units and storage resources are for corresponding transforms. Moreover, the RD decision efficiently saved for both encoder and decoder. The requires that the current block is coded twice with the two experimental results on numerous sequences show that the transforms. It is clear that the ABT scheme introduces proposed ABT scheme can achieves significant high complexity to encoder. However, in view of its high performance improvement for AVS. coding performance, the increased complexity is tolerable in some applications. Thus, ABT scheme is adopted into 1. INTRODUCTION Fidelity Range Extensions of H.264/AVC by Joint Video Team (JVT). The Discrete Cosine Transform (DCT) is an effective In this paper, the ABT scheme with extended tool for image and video coding. In transforms is proposed for Chinese Audio and Video hybrid video coding, it is used to convert the correlated coding Standard (AVS). In this scheme, the extended 8x8 prediction residuals to seldom correlated frequency transform is derived from the original 4x4 transform in domain coefficients for more efficient . AVS. The extended 8x8 transform not only improves the Different coding standards employ different transforms. coding performance of ABT but also reduces the memory MPGE-2 adopts an 8x8 floating-point DCT, which is a requirements and hardware units for both encoder and real DCT with fine compression performance but high decoder. Since the computation and storage complexities complexity. H.264/AVC adopts a 4x4 integer DCT [1], of ABT mainly focus on the encoder and only some which has similar properties to real 4x4 DCT but relative storage complexities are increased in the decoder, the low complexity. Since the inverse DCT in the encoder and proposed method can reduce the complexities of decoder the decoder are defined by exact integer operations, the efficiently. Thus, the method is meaningful to some problem of mismatches is avoided. applications where the decoders are paid more attention The different transform sizes possess different than encoder. coding performances. Larger transform has a better The rest of this paper is organized as follows: In the energy compaction than smaller transform. However, the section 2, the concept of extended transforms is introduced, and an 8x8 transform which is an extension of 1 4x4 transform of AVS is designed. In section 3, the The Chinese audio and video coding standard, an emerging standard targeting at higher coding efficiency and lower complexity. implementation of ABT in AVS is briefly specified. In

1­4244­0367­7/06/$20.00 ©2006 IEEE 129 ICME 2006 section 4, the experimental results on test sequences with A low-complexity integer 4x4 transform is employed in different coding characteristics are shown. Finally, the AVS-M (a version of AVS orient to mobile applications), conclusions are made by us. The transform is expressed as follows:

2. THE EXTENDED TRANSFORMS OF ABT ª 2 2 2 2º « 3 1 1  3» Extended transforms means that the NxN transform as a C4 « » (3) part involves in the 2Nx2N transform [4]. Taking 4x4 and « 2  2  2 2» « » 8x8 transform as an example, if the 4x4 transform is ¬ 1  3 3 1¼ expressed with (1), its extended 8x8 transform should be expressed as (2). The transform has a good energy compaction and low complexity. It can be implemented with merely several ª c0 c0 c0 c0 º addition and shift operations. In practice, the C4 is just a « c c  c  c » core matrix of DCT. The norms of its some rows are D4 « 2 6 6 2 » (1) « c4  c4  c4 c4 » unequal. A scale matrix is needed for normalizing the core « » matrix. As the output of the transform and scale of the ¬ c6  c2 c2  c6 ¼ input X, Y is expressed as

ª c c c c c c c c º 0 0 0 0 0 0 0 0 Y (C4u X uC4c) S4 (4) « c c c c c c c c » « 1 3 5 7  7  5  3  1» « c2 c6  c6  c2  c2  c6 c6 c2 » « » where the notation denotes element-by-element c  c  c  c c c c  c (2) D8 « 3 7 1 5 5 1 7 3 » multiplication, and S4 means the 4x4 scale matrix. The « c  c  c c c  c  c c » « 4 4 4 4 4 4 4 4 » scale matrix S4 can be calculated by « c5  c1 c7 c3  c3  c7 c1  c5 » « c c c c c c c c » 6  2 2  6  6 2  2 6 3 3 « » 2 2 ¬« c7  c5 c3  c1 c1  c3 c5  c7 ¼» S4x,y 1 ( ¦ C4x,j )u( ¦C4ic,y ) (5) j 0 i 0 It can be found from (1) and (2) that the even basis functions of 8x8 transform D8 inherit all basis functions 2.2. 8x8 transform design of 4x4 transform D4. i.e. D82i,j= D4i,j, (0İi, j<4). The relation can be more clearly expressed from their butterfly According to the requirement of performance, the integer flow graph in Fig.1, where the two 4x4 transforms D4II 8x8 transform should preserve the real DCT compaction and D4IV mean the II-type and IV-type DCT respectively property. Its basis functions of transform matrix should [3]. If D4II is identical to D4, it is said that D8 is extended approximate the real transform’s. Meanwhile, these basis from D4. functions should not be so large that the transform can be implemented within narrow bit range. Moreover, the integer 8x8 transform should be an extension of the integer 4x4 transform C4 so that the 8x8 transform can be shared by the 4x4/8x8 transforms. Thus, the integer 8x8 II D4x4 transform should meet the following three design rules:

1) The basis functions of each row of the designed 8x8 transform should correlate closely with corresponding floating-point basis functions of real 8x8 transform where the correlation is measured by the cosine of the included angle between them; D IV 4x4 2) The basis functions of the designed 8x8 transform matrix should be represented by several simple integers; 3) The designed 8x8 transform must be the extension of Fig.1. Fast DCT butterfly signal flow graph. 4x4 transform of AVS i.e. meeting the relation:

C82x,y=C4x,y, 0İx, y<4. 2.1. 4x4 transform

130 According to the above rules, an integer 8x8 DCT is prediction mode. Moreover, numorous experimental expressed as follows: results show that ABT applying to chroma signal does not bring the significant coding gains. Thus, from the view ª 4 4 4 4 4 4 4 4º point of complexityˈ ABT scheme is not used in chroma « 6 6 3 2 2 3 6 6» signals. Unlike the conventional schemes where all 8x8, «     » « 6 2  2  6  6  2 2 6» 8x4, 4x8 and 4x4 transforms are used, no more transforms « » are used in proposed scheme except for 8x8 and 4x4 6  2  6  3 3 6 2  6 C8 « » / 2 (6) transforms. Because the 8x4 and 4x8 transforms just « 4  4  4 4 4  4  4 4» « » obtain small PSNR gains while introducing large coding « 3  6 2 6  6  2 6  3» loads to the . « 2  6 6  2  2 6  6 2» RD optimizer is used to choose the transform size « » with the better coding performance for current coded « 2  3 6  6 6  6 3  2» ¬ ¼ block. Since the RD optimizer just works in the encoder, a one-bit flag is added into bitstream to signal the decoder where C8 is an integer orthogonal matrix. Its even basis which transform is used to coding the current block. For functions and those of integer 4x4 transform matrix C4 further reducing the complexity, the 4x4 and 8x8 zig-zag meet the relation, C82x,y=C4x,y. Thus, C8 is an extended scans, entropy codings and deblocking tools are also transform of C4. The odd basis functions of C4 consists of merged with some optimization algorithms. For example, integers (6, 6, 3, 2), which are so simple that dynamic the 8x8 quantized transform coefficients are split into four range is only increased by 5.1 bits (log234). The cosine of 4x4 coefficients such that the original 4x4 zig-zag scan the included angle between the integer array (6, 6, 3, 2) tool can be applied to the four 4x4 coefficients and the and the float array of real DCT (0.4904, 0.4157, 0.2778, 4x4 entropy coding tool is still workable in the four 4x4 0.0975) is 0.9834 very near to the upper limit 1. Thus, it coefficients. has the similar compacting property to the real DCT. Similarly to the normalization of C4, a scale matrix S8 for 4. EXPERIMENTAL RESULTS normalization of C8 is needed. The scale matrix S8 is expressed as In the proposed ABT scheme, only the 8x8 forward/inverse transform units are implemented in codec. 7 7 2 2 The 4x4 forward/inverse transform operations can be S8x,y 1/ ( ¦ C8x,j )u( ¦C8ic,y ) (7) j 0 i 0 done by the 4x4 forward/inverse transform units involved in the 8x8 forward/inverse transform units. Thus, the 4x4 forward/inverse transform units of both encoder and since C8 =C4 , 0 x, y<4, thus 2x,y x,y İ decoder are saved. Like H.264, the scales matrix of core transform matrix is usually merged into the quantization 3 3 2 2 [5] in AVS for reducing the computational complexity S4x,y 1/ ( ¦ C4x,j )u( ¦C4ic,y ) j 0 i 0 and memory cost. The merging process is derived from: (8) 7 7 2 2 c 1/ ( ¦ C82x,j /2)u( ¦C8ic,2 y /2) S82x,2 y / 2 Y (C u X u C ) E // Q(qp) j 0 i 0 (9) (C u X u Cc) (E // Q(qp)) (C u X u Cc) Qˆ (qp) The division by 2 in above formula can is implemented where the notation // denotes element-by-element division with just one-bit right-shift operation in hardware. The and Q means the quantization step, which is a vector relation is useful for us to merge the 4x4 scale matrix into function of the quantization parameter qp. In formula (9), the 8x8 scale matrix. The normalizations of 4x4 and 8x8 the calculation of scale matrix and quantization vector transforms can be done with only an 8x8 scale matrix. E//Q(qp) is merged into a scale and quantization (or The above analysis on forward transforms can be also ˆ analogically done on inverse transforms. inverse quantization) matrix Q(qp) . Since the scale matrixes of the 4x4/8x8 extended transforms have such 3. THE IMPLEMENTATION OF ABT SCHEME relation as (8), only the Qˆ (qp) of 8x8 transform is stored and that of 4x4 transform is saved. Since there are no 4x8 The ABT scheme is implemented into AVS-M reference ˆ software. Since there is only intra 4x4 prediction mode in and 8x4 transforms, their Q(qp) are be saved too. As a AVS-M and the 4x4 transform usually has better result, the memory of (4x4+8x4+4x8)x8x16x2 = 20480 performance than 8x8 transform in this mode, the bits are saved (the quantization step period is 8 in AVS proposed scheme is just applied to the luma inter-

131 and the data type of scale and quantization matrixes is 16- [2] Wien, M., ”Variable block-size transforms for H.264/AVC, bit). “ Special Issue on the H.264/AVC video coding standard,” July, The ABT scheme is implemented into reference 2003. software AVS-M R0. The comparisons are conducted [3] W. Chen, C. H. Smith and S. C. Fralick, “A Fast between codec with and without ABT. Numerous Computational Algorithm for the Discrete Cosine Transform”, sequences in QCIF and CIF formats are tested. All IEEE Transactions on Communications, vol. com-25, No. 9, pp. sequences are coded with a fixed Quantization Parameter 1004-1009, September 1977. (QP) in 20, 26, 32 and 38. The RD optimization and [4] Jie Dong, Jian Lou, etc, “A new approach to compatible loopfilter are enabled, two frames are referenced and 2D- adaptive block-size transforms” VCIP2005, Beijing, Jan, 2005. VLC is used in entropy coding. There is no B-picture in [5] Lousi Kerofsky, Shawmin Lei, “Reduced bit-depth AVS-M for the consideration of complexity. Thus, the quantization”, ITU-T SG16 DOC. VCEG-N20, Aug, 2001. ABT scheme is applied to the P-pictures only. For fully [6] G. Bjontegaard, “Calculation of average PSNR differences demonstrating ABT’s performances, only the first picture between RD-Curves”, Doc. VCEG-M33, Mar. 2001. structure is set to I-picture and others are P-pictures. 200 frames of each sequence are coded. The results of coastguard(cif 30Hz) comparison are listed in Table 1. The average luma PSNR 43 gains and average bitrate savings of proposed ABT in 42 41 various sequences are calculated using the method 40 proposed in [6]. The average luma PSNR gains are from 39 0.19 to 0.69dB and the average bitrate savings are from 38 37 3.59% to 12.09%. The results imply that with the increase 36

of picture sizes, the more PSNR gains are achieved. The PSNRY(dB) 35 RD curves of sequence coastguard in CIF and QCIF 34 33 no ABT formats are also shown in Fig. 2. It is observed that the 32 significant performance improvements, especially at high 31 ABT bitrates, are achieved in sequence coastguard using the 30 300 1300 2300 3300 4300 proposed ABT scheme. Rate(Kbits/s)

coastguard (qcif 30Hz) 43 4. CONCLUSIONS 42 41 40 In this paper, an ABT scheme based on 8x8/4x4 extended 39 transforms for AVS is proposed. The 8x8 transform in the 38 37 proposed scheme is extended from the 4x4 transform of 36 AVS-M, which has similar decorrelation ability with the 35 PSNRY(dB) 34 real DCT while low complexity. The advantages of 33 extended transforms used in ABT are that not only some 32 no ABT 31 ABT transform units are saved in hardware but also smaller 30 memory is required. Compared with general ABT 29 0 200 400 600 800 1000 schemes, the proposed ABT scheme based on extended Rate(Kbits/s) transforms has the lower hardware complexities and maintain the high performance. Fig.2. The RD curves for typical test sequence coastguard in QCIF and CIF formats. 5. ACKNOWLEDGMENTS Table 1. Comparisons of sequences coded with and This work is partially supported by the key project of without ABT scheme. National Natural Science Foundation of China under grant Size CIF(30Hz) QCIF(30Hz) QCIF(15Hz) No. 60333020 and Natural Science Foundation of Beijing PSNRY Bitrate PSNRY Bitrate PSNRY Bitrate Sequance gains savings gains savings gains savings under grant No. 4041003. tempete 0.28dB 5.15% 0.36dB 5.87% 0.35dB 5.98% mobile 0.40dB 6.48% 0.32dB 5.14% 0.27dB 4.30% 6. REFERENCES foreman 0.23dB 5.11% 0.22dB 4.59% 0.20dB 4.36% [1] Wiegand, T.; Sullivan, G.J.; Bjntegaard, G.; Luthra, A. football 0.41dB 6.47% 0.27dB 3.98% 0.22dB 3.16% “Overview of the H.264/AVC video coding standard,” IEEE coastguard 0.69dB 12.09% 0.44dB 8.45% 0.44dB 8.50% Trans. on CSVT. Special Issue on the H.264/AVC video coding bus 0.49dB 8.13% 0.33dB 5.03% 0.27dB 4.32% standard, July, 2003. silent 0.19dB 3.77% 0.22dB 3.59% 0.20dB 3.33%

132