Hardware Implementation of HEVC Inverse Transform in 45Nm CMOS

IEEE Latin American Symposium on Circuits and Systems (LASCAS) 2020; San José, Costa Rica; February 2020 Hardware Implementation of HEVC Inverse Transform in 45nm CMOS Richard Calusdian and Aaron Stillmaker Electrical and Computer Engineering Department, California State University, Fresno, CA, USA [email protected], [email protected] Abstract—The High Efficiency Video Coding (HEVC) stan- quantized according to a particular mode to eliminate high dard relies on the use of the inverse discrete cosine transform frequency data. The resulting data is then entropy encoded and (IDCT) to perform video decompression. HEVC has increased then transmitted for storage or to a decoder which performs the complexity of the decoder and the inverse transform lends itself well to hardware acceleration due to repeated addition and inverse operations to reconstruct the video. The HEVC decoder multiplication on unit blocks. A hardware implementation of uses the same as the encoder but in inverse to undue the the inverse quantization and inverse transform, compliant to the encoding. The decoder undoes quantization using the inverse HEVC standard, is presented. The design targets the 4 4 inverse ⇥ quantization and undoes the forward transformation using the quantization and transform performing a synthesis and place & inverse discrete cosine transformation (IDCT). route flow using the Nangate FreePDK45 Open Cell Library. The operational frequency of presented design supports 4K video at A. Transform up to 30 frames/ sec. The core area of the presented design 2 HEVC defines a 2D transform for sizes 4 4, 8 8, 16 16, takes up 14 664 µm and can operate at max. frequency of ⇥ ⇥ ⇥ 367 MHz. and 32 32. The transform is used to change the representa- ⇥ Index Terms—HEVC, Inverse Discrete Cosine Transform, tion of the residual signal from the spatial domain into the IDCT frequency domain. In HEVC this is accomplished by using a finite approximation to the DCT. HEVC explicitly defines the I. INTRODUCTION matrix values of the DCT as integer values to make the math The same group that released the H.264 video compression more amenable to digital systems and to produce consistent standard [1], has also developed a follow-up, the High Ef- results. Also, the integer values avoid encoder/decoder mis- ficiency Video Coding (HEVC) standard, also known as the matches due to differing precision of representations of the H.265/MPEG-HEVC standard or simply H.265 [2]. This latest DCT matrix values. The equations for the HEVC DCT and standard builds on the framework of H.264 and expands and IDCT for an input residual block U and transform matrix D extends the tools and features of the standard. The standard are shown in (1) and (2). improved upon the existing coding efficiency while specifically addressing the needs of the proliferation of High Definition DCT(U)=D U DT = Y (1) (HD) video and Ultra High Definition (UHD) while also ⇥ ⇥ IDCT(Y )=DT Y D = U (2) adding support for parallel architectures [3]. ⇥ ⇥ In order to process a video sequence, each frame of the In actual implementation the inverse transform is computed sequence is partitioned into non-overlapping square blocks of by separating the equation in (2) into two one-dimensional pixels. These square blocks can then be sub-divided further transformations in succession. Once the DCT is performed, down into smaller coding blocks for eventual spatial or tem- the coefficients of the resulting matrix represent increasing poral coding. The encoder algorithm will encode large blocks frequency components of the image, starting with a DC where possible to reduce the bit rate and leave the small blocks coefficient in the upper-left corner and increasing as we move only where needed to retain detail. The H.264 standard calls down and to either side, with the highest frequency component its primary block a macroblock and is of size 16 16 pixels, ⇥ in the lower-right corner. Consider the case of a flat residual whereas HEVC uses the term coding tree unit (CTU), which block, such that the residual block has all entries of the same can be as small as 16 16 and as large as 64 64. An advantage ⇥ ⇥ value. After transformation, such a block will yield a set of of the larger block size of 32 32 or 64 64, is that for large ⇥ ⇥ coefficients with only one non-zero element.The interpretation flat areas, the HEVC standard can offer better efficiency as of this follows from the fact that the coefficients arranged with compared to the smaller 16 16 macroblock of H.264 [3]. ⇥ the lowest frequency term positioned in the upper-left corner The next step is to subtract the predicted block from the and increasing frequencies are placed below and to the right current block producing a residual, or difference, signal. It as we move away from this term. The highest frequency is is this residual signal which is then transformed using an thus positioned at the lower-right corner. integer approximation to the discrete cosine transform (DCT) In the resulting transformed block, the upper-left corner into the frequency domain. The transform coefficients are then term is called the DC coefficient and represents the average 978-1-7281-3427-7/20/$31.00 ©2020 IEEE value of the residual block of pixels. It tends to be the case that video has most of its energy in the lower frequency of transform size. This is accomplished by having a flexible components, and interestingly, the human eye is not sensitive architecture that shares data in parallel across column and row to high frequency contrasts. Both of these facts, allow video processing units. These processing units are equally used for to be compressed by discarding the higher frequency content both the smallest and largest HEVC supported block sizes. with little loss of detail as perceived by the human eye [4]. The previous standard, H.264, exclusively relies on the III. DESIGN DCT, but in the case of HEVC, the use of the DCT is The design of the 4 4 inverse quantization and inverse ⇥ augmented by use of the discrete sine transform (DST) for transform consists of three unique design entries. The first 4 4 luma intra-prediction blocks [2]. It was found that using is the RTL design done in Verilog and then verified with ⇥ the DST on 4 4 luma intra-prediction blocks improved the ModelSim and MATLAB. The second phase consists of using ⇥ bitstream compression by about 1% [5]. Synopsys Design Compiler to then synthesize the design and produce files for use in the last phase. Finally, the placement B. Quantization and routing uses files from the synthesis step to produce Once transformed, the resulting coefficients are quantized a placed and routed design. Cadence Innovus was used for by dividing them by an integer and rounding down. The divisor placement and routing. is called the quantization step and is derived from the encoded A. IDCT parameter, quantization parameter, QP. As a result of the division and rounding, some coefficients will be rounded down The IDCT module takes inverse quantization data, performs to zero and discarded. For larger values of the quantization the 2D transform as two separate 1D transforms and outputs step, more zeroes will be produced in the resuting data. residual data. Typically, for natural scenes, the higher frequency components Consider the DCT transform matrix shown in (3) and are smaller in magnitude and after division by QP, these values defined by the HEVC standard. This matrix and its inverse are used in the calculation of the 4 4 DCT and 4 4 IDCT. will round to zero. The larger the value of QP, the more ⇥ ⇥ coefficients will result in zero values, discarded, and thus more For notation purposes we will denote this matrix D. compression will be achieved at the cost of picture detail in the 64 64 64 64 reconstructed picture. This process of quantization is a lossy 83 36 36 83 step since data is being thrown away. Provided only the higher (3) 264 64 −64− 64 3 frequency components are removed, the resulting picture will − − 636 83 83 367 appear to be as detailed as the original, to the observer. 6 − − 7 The four point4 1D inverse transform of5 this matrix with an II. RELATED WORK input vector x is given by DT x. ⇥ Ma et al. [6] exploit the (anti-)symmetry property of the Performing a direct transformation, this will require 16 DCT matrix and continues the decomposition into multiple multiplications and 12 additions. If, instead, we decompose factors using sparse matrices to reduce the number of multi- the 1D transform of D and x using the even-odd algorithm plications and additions when compared to the direct method. as proposed by Budagavi et al. [11] then some savings can The design was done using combinational circuits only, thus be realized. The even-odd decomposition will decompose a no clock frequency is provided. The design also appears to use matrix into an even and odd part and the output y which two IDCT blocks which will impact circuit area and power. represents the 1D inverse transform result. Conceicao et al. [7], proposes a fast 4-point IDCT that uses The HEVC standard also provides scaling of the output data statistical information of the transformed residual data where after each 1D inverse transform stage. Scaling is also done dur- quality is maintained per the PSNR measurement along with a ing the forward transform, performed as part of the encoding small but desirable decrease in bit rate. Their approach reduced process. The purpose of this scaling is to both maintain the circuit complexity and high pixel throughput.

Hardware Implementation of HEVC Inverse Transform in 45Nm CMOS

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support