IEEE Latin American Symposium on Circuits and Systems (LASCAS) 2020; San José, Costa Rica; February 2020 Hardware Implementation of HEVC Inverse Transform in 45nm CMOS

Richard Calusdian and Aaron Stillmaker Electrical and Computer Engineering Department, California State University, Fresno, CA, USA [email protected], [email protected]

Abstract—The High Efficiency Coding (HEVC) stan- quantized according to a particular mode to eliminate high dard relies on the use of the inverse discrete cosine transform frequency data. The resulting data is then entropy encoded and (IDCT) to perform video decompression. HEVC has increased then transmitted for storage or to a decoder which performs the complexity of the decoder and the inverse transform lends itself well to hardware acceleration due to repeated addition and inverse operations to reconstruct the video. The HEVC decoder multiplication on unit blocks. A hardware implementation of uses the same as the encoder but in inverse to undue the the inverse quantization and inverse transform, compliant to the encoding. The decoder undoes quantization using the inverse HEVC standard, is presented. The design targets the 4 4 inverse ⇥ quantization and undoes the forward transformation using the quantization and transform performing a synthesis and place & inverse discrete cosine transformation (IDCT). route flow using the Nangate FreePDK45 Open Cell Library. The operational frequency of presented design supports 4K video at A. Transform up to 30 frames/ sec. The core area of the presented design 2 HEVC defines a 2D transform for sizes 4 4, 8 8, 16 16, takes up 14 664 µm and can operate at max. frequency of ⇥ ⇥ ⇥ 367 MHz. and 32 32. The transform is used to change the representa- ⇥ Index Terms—HEVC, Inverse Discrete Cosine Transform, tion of the residual signal from the spatial domain into the IDCT frequency domain. In HEVC this is accomplished by using a finite approximation to the DCT. HEVC explicitly defines the I. INTRODUCTION matrix values of the DCT as integer values to make the math The same group that released the H.264 video compression more amenable to digital systems and to produce consistent standard [1], has also developed a follow-up, the High Ef- results. Also, the integer values avoid encoder/decoder mis- ficiency Video Coding (HEVC) standard, also known as the matches due to differing precision of representations of the H.265/MPEG-HEVC standard or simply H.265 [2]. This latest DCT matrix values. The equations for the HEVC DCT and standard builds on the framework of H.264 and expands and IDCT for an input residual block U and transform matrix D extends the tools and features of the standard. The standard are shown in (1) and (2). improved upon the existing coding efficiency while specifically addressing the needs of the proliferation of High Definition DCT(U)=D U DT = Y (1) (HD) video and Ultra High Definition (UHD) while also ⇥ ⇥ IDCT(Y )=DT Y D = U (2) adding support for parallel architectures [3]. ⇥ ⇥ In order to process a video sequence, each frame of the In actual implementation the inverse transform is computed sequence is partitioned into non-overlapping square blocks of by separating the equation in (2) into two one-dimensional pixels. These square blocks can then be sub-divided further transformations in succession. Once the DCT is performed, down into smaller coding blocks for eventual spatial or tem- the coefficients of the resulting matrix represent increasing poral coding. The encoder algorithm will encode large blocks frequency components of the image, starting with a DC where possible to reduce the and leave the small blocks coefficient in the upper-left corner and increasing as we move only where needed to retain detail. The H.264 standard calls down and to either side, with the highest frequency component its primary block a and is of size 16 16 pixels, ⇥ in the lower-right corner. Consider the case of a flat residual whereas HEVC uses the term coding tree unit (CTU), which block, such that the residual block has all entries of the same can be as small as 16 16 and as large as 64 64. An advantage ⇥ ⇥ value. After transformation, such a block will yield a set of of the larger block size of 32 32 or 64 64, is that for large ⇥ ⇥ coefficients with only one non-zero element.The interpretation flat areas, the HEVC standard can offer better efficiency as of this follows from the fact that the coefficients arranged with compared to the smaller 16 16 macroblock of H.264 [3]. ⇥ the lowest frequency term positioned in the upper-left corner The next step is to subtract the predicted block from the and increasing frequencies are placed below and to the right current block producing a residual, or difference, signal. It as we move away from this term. The highest frequency is is this residual signal which is then transformed using an thus positioned at the lower-right corner. integer approximation to the discrete cosine transform (DCT) In the resulting transformed block, the upper-left corner into the frequency domain. The transform coefficients are then term is called the DC coefficient and represents the average 978-1-7281-3427-7/20/$31.00 ©2020 IEEE value of the residual block of pixels. It tends to be the case that video has most of its energy in the lower frequency of transform size. This is accomplished by having a flexible components, and interestingly, the human eye is not sensitive architecture that shares data in parallel across column and row to high frequency contrasts. Both of these facts, allow video processing units. These processing units are equally used for to be compressed by discarding the higher frequency content both the smallest and largest HEVC supported block sizes. with little loss of detail as perceived by the human eye [4]. The previous standard, H.264, exclusively relies on the III. DESIGN DCT, but in the case of HEVC, the use of the DCT is The design of the 4 4 inverse quantization and inverse ⇥ augmented by use of the discrete sine transform (DST) for transform consists of three unique design entries. The first 4 4 luma intra-prediction blocks [2]. It was found that using is the RTL design done in Verilog and then verified with ⇥ the DST on 4 4 luma intra-prediction blocks improved the ModelSim and MATLAB. The second phase consists of using ⇥ bitstream compression by about 1% [5]. Synopsys Design Compiler to then synthesize the design and produce files for use in the last phase. Finally, the placement B. Quantization and routing uses files from the synthesis step to produce Once transformed, the resulting coefficients are quantized a placed and routed design. Cadence Innovus was used for by dividing them by an integer and rounding down. The divisor placement and routing. is called the quantization step and is derived from the encoded A. IDCT parameter, quantization parameter, QP. As a result of the division and rounding, some coefficients will be rounded down The IDCT module takes inverse quantization data, performs to zero and discarded. For larger values of the quantization the 2D transform as two separate 1D transforms and outputs step, more zeroes will be produced in the resuting data. residual data. Typically, for natural scenes, the higher frequency components Consider the DCT transform matrix shown in (3) and are smaller in magnitude and after division by QP, these values defined by the HEVC standard. This matrix and its inverse are used in the calculation of the 4 4 DCT and 4 4 IDCT. will round to zero. The larger the value of QP, the more ⇥ ⇥ coefficients will result in zero values, discarded, and thus more For notation purposes we will denote this matrix D. compression will be achieved at the cost of picture detail in the 64 64 64 64 reconstructed picture. This process of quantization is a lossy 83 36 36 83 step since data is being thrown away. Provided only the higher (3) 264 64 64 64 3 frequency components are removed, the resulting picture will 636 83 83 367 appear to be as detailed as the original, to the observer. 6 7 The four point4 1D inverse transform of5 this matrix with an II. RELATED WORK input vector x is given by DT x. ⇥ Ma et al. [6] exploit the (anti-)symmetry property of the Performing a direct transformation, this will require 16 DCT matrix and continues the decomposition into multiple multiplications and 12 additions. If, instead, we decompose factors using sparse matrices to reduce the number of multi- the 1D transform of D and x using the even-odd algorithm plications and additions when compared to the direct method. as proposed by Budagavi et al. [11] then some savings can The design was done using combinational circuits only, thus be realized. The even-odd decomposition will decompose a no clock frequency is provided. The design also appears to use matrix into an even and odd part and the output y which two IDCT blocks which will impact circuit area and power. represents the 1D inverse transform result. Conceicao et al. [7], proposes a fast 4-point IDCT that uses The HEVC standard also provides scaling of the output data statistical information of the transformed residual data where after each 1D inverse transform stage. Scaling is also done dur- quality is maintained per the PSNR measurement along with a ing the forward transform, performed as part of the encoding small but desirable decrease in bit rate. Their approach reduced process. The purpose of this scaling is to both maintain the circuit complexity and high pixel throughput. The PSNR norm of the residual data through the encoding/decoding steps parameter used to assess quality is one accepted measurement and also to scale the data to a width that will keep hardware but research is still on-going looking for objective metrics to requirements to a reasonable amount. The HEVC standard better reflect human perception of video quality [8]. specifies these scaling factors such that the output results from The work by Ziyou et al. [9] exploits the (anti-)symmetry 1D (inverse-)transforms will be limited to a width of 16-bits, exhibited by the HEVC transform matrix to minimize the sign inclusive. physical quantity of multipliers. Their implementations re- The HEVC standard specifies scaling such that memory quires the bit count for the transpose memory to be 2 what and operands may be sized as 16-bit wide. This, however, ⇥ a typical shared IDCT design uses, but the gate count for should not be confused with size requirements of internal their design is smaller than most. The design does also have a variables used to hold calculated values. For example, when significant requirement that the clock frequency must operate multiplying two binary numbers, the resulting product may at the pixel rate of the video input. be up to twice as wide as the operands. An analysis done In the work by Hong et al. [10] an architecture is proposed by the JCT-VC on the range requirements for HEVC inverse that seeks to maintain high hardware utilization regardless transform operations [12], was used as the guide for setting bit widths for variables used in the RTL design of the inverse amounts. In typical use, the higher frequency components will transform engine. Specifically, for this work, a width of 24 be divided by larger numbers. These lists can be either a bits is used for variables, prior to any scaling operations. default list defined by the standard or custom lists which must After required scaling, as specified by the HEVC standard, be sent to the encoder during the quantization operation and the results may be stored in 16-bit registers and memory. transmitted in the bitstream for decoder use. The proposed design supports all three list modes, specifically no list, default, B. IDST and custom. The list mode is determined by writing to the The development of odd-even decomposed matrices are control register in the control module. The HEVC standard dependent on the matrix exhibiting (anti-)symmetry properties. defines the de-quantizer operation as shown in (4). The DST matrix does not have these properties but some savings in multiplications and additions can still be achieved. By computing the equations for the direct method, it may be coeffQ[x][y]= observed that certain patterns emerge in the resulting sums. QP (level[x][y] w[x][y] (g ) These patterns can be used to form factors that are re-used ⇥ ⇥ QP %6 ⌧ 6 for multiple outputs. The fast algorithm used in this work was + offset ) shift1 (4) IQ taken from the Kvazaar HEVC encoder project [13]. The level[x][y] corresponds to transformed and quantized C. Transpose Memory input data and w[x][y] is the scaling list, either custom or default. The value of g is used to map the QP value to one of Many systems need memory as a means to store intermittent a set of six values and a corresponding shift. The offset is given values but in applications that use the DCT, DWT, and for a specific bit depth and transform size. The value of shift1 encryption techniques, the use of transpose memory is required is equal to (M 5+B) where M is the log (transformsize), to manipulate the data [14]. The purpose of the transpose 2 and B is video bit depth. The use of shift1 is driven by the need memory is to take data that is written in a column fashion to maintain the norm for the residual block as it undergoes and then output the data for the second stage of the transform inverse quantization and inverse transformation. in a row fashion. Transpose memory proposed by El-Hadedy et al. [14] was E. Simulation selected as a model for this work. This memory consists The design was functionally verified with various modes of of an array of flip-flops and an input and output mux. The operation, including custom frequency scaling lists, QP values, input mux allow the flip-flops to input data from one of two and other supported modes. Results were cross-checked with sources while the output mux gives the flexibility to use the MATLAB script that called various functions to compute and transpose memory as conventional memory, i.e., without the compare both intermediate and final output values with Verilog transposition function. The registers of the transpose memory simulations, checking both inverse transformation and inverse are 16-bits wide and that the output mux is not needed and quantization results as well as end-to-end. The MATLAB therefor not included. script and functions were implemented using integer math so The transpose memory consists of a 4 4 array of 16-bit ⇥ as to produce results which could be directly compared to the registers which inputs column data from the IDCT module HDL results. and outputs row data. The first four clocks shift column data into memory and then on the next four clocks, row data F. Synthesis and Place & Route is shifted out to a 2-to-1 mux. The mux is used to select This work targeted the CMOS 45 nm NANGATE either data from the inverse quantization module or data from FreePDK45 Library. The Nangate 45nm Open Cell Library the transpose memory. Memory using register based designs was developed to provide a library for research, testing, and are more efficient for smaller transform sizes since a register exploring EDA flows. The plot in Fig. 1 shows the placed and based design does not require the overhead circuitry of SRAM routed design. blocks, e.g., row and column decoding, sense amplifier. IV. RESULTS D. Inverse Quantization In Table I, the designs are compared to provide an overview The Inverse Quantization module undoes the quantization of both performance and architecture. These metrics are meant function performed in the encoder. Quantization consists of to compare the supported functions and also each designs dividing the transformed coefficients by a quantization step performance. All designs implement the IDCT and in the case size, and inverse quantization is done by multiplying by this of Ziyou et al. [9], the IDST. In this work, both the IDST same step size. The step size is actually determined by the QP and the inverse quantization functions are implemented. value. This value can range from 0-51. Every increase of one The 4K speed metric shows the clock frequency that each in QP corresponds to an increase of 12% in the step size. In design would operate at in order to process 4K video. The addition to providing a means to undo a division operation, gate count is based on the area size of a NAND2X1 of each the de-quantizer also supports frequency scaling lists. These corresponding technology. Note that for Ma et al. [6] the gate lists allow for frequency components to be divided by different count is based on a NAND2 area size of 5.0922 µm2. TABLE I ARCHITECTURE AND PERFORMANCE COMPARISON BETWEEN DISCUSSED WORKS AND THIS WORK

Design Tech. IDCT IDST Inverse Quant. Gate count 4K Speed (MHz) Ma [6] 130 nm 8 8 no no 8.2k na. Conceicao [7] FPGA 4⇥4 no no na 23.3 Ziyou [9] 65 nm all⇥ yes no 145.4k 412 Hong [10] 65 nm all no no 112.5k 162.3 this work 45 nm 4 4 yes yes 17.5k 200 ⇥

However, this utilization means that circuits may not be turned off when performing smaller transform sizes as is possible in architectures that scale utilization with transform size. V. C ONCLUSION In this work an HEVC compliant inverse quantization and inverse transform has been presented capable of processing 4K video at 30 frames/ sec. The design was entered in Verilog HDL, synthesized, and then placed and routed in 45 nm CMOS technology using industry standard tools. The trans- form and quantization components were discussed to give the reader a brief introduction to these tools and how they achieve video compression. The proposed design was then discussed and compared to some related works. REFERENCES [1] ”ITU-T Rec. H.264 and ISO/IEC 14496-10 : for Generic Audio-Visual Services,” ITU-T and ISO/IEC, 2003. Fig. 1. A plot of the physical design of the HEVC compliant 4x4 inverse [2] ”ITU-T Rec. H.265 and ISO/IEC 23008-2: High efficiency video cod- quantization and inverse transform module showing sectioned areas of the ing,” ITU-T and ISO/IEC, 2013. major functional blocks [3] G. J. Sullivan, J.-R. Ohm, W.-J. Han, and T. Wiegand, ”Overview of the High Efficiency Video Coding (HEVC) standard,” IEEE Trans. Circuits Syst. Video Technol., vol. 22, no. 12, pp. 1649-1668, Dec. 2012. [4] R. Bahirat and A. Kolhe, ”Video compression using H.264 AVC stan- The design of Ma et al. [6] is purely combinational logic and dard,” Intl Journal Emerg Rsrch Mngmnt and Tech, vol. 3, no. 2, pp. thus has no clock. The gate count of Ma only includes one 1D 31-37, 2014. transform and does not include the transpose memory. If we [5] V. Sze, M. Budagavi and G. J. Sullivan, High Efficiency Video Coding (HEVC): Algorithms and Architectures, London: Springer, 2014. estimate expected memory size using the design of this work, [6] T. Ma, C. Liu, Y. Fan and X. Zeng, ”A fast 8x8 IDCT algorithm a register based 8 8 memory would require approximately ⇥ for HEVC,” in 2013 IEEE 10th International Conference on ASIC, 9.5k gates and would bump the total gate count to roughly 25k Shenzhen, 2013. [7] R. Conceicao, A. Araujo, M. Porto, B. Zatt, and L. Agostini, ”Hardware gates for a serially processed design, i.e. the transpose memory design of fast HEVC 2-D IDCT targeting real-time UHD 4K applica- in between two IDCT modules. In this work, the design re- tions,” in 2015 IEEE 6th Latin American Symposium on Circuits & uses the 1D transform to perform the 2D transform and thus Systems (LASCAS), Montevideo, 2015. [8] H.R. Sheikh and A.C. Bovik, ”Image information and visual quality,” achieves some area savings. IEEE Transactions on Image Processing, vol. 15, no. 2, pp. 430-444, The design of Conceicao et al. [7] operates at a very high Feb. 2006. throughput due to the discarding of columns of data and [9] Y. Ziyou, H. Weifeng, H. Liang, H. Guanghui, and M. Zhigang, ”Area and throughput efficient IDCT/IDST architecture for HEVC standard,” only performing one 1D inverse transform. It is unclear how in 2014 IEEE International Symposium on Circuits and Systems, Mel- effective this technique would be when paired with lower bourne, 2014. QP values as the proposed design only discusses high QP [10] L. Hong, W. He, H. Zhu, and Z. Mao, ”A cost effective 2-D adaptive block size IDCT architecture for HEVC standard,” in 2013 IEEE 56th values. As mentioned previously, it is difficult to assess human International Midwest Symposium on Circuits & Systems (MWSCAS), perceived quality from objective metrics. It is also unclear how Columbus, 2013. effective this technique would be with larger transform sizes. [11] M. Budagavi, A. Fuldseth, G. Bjntegaard, V. Sze and M. Sadafale, ”Core transform design in the High Efficiency Video Coding (HEVC) The proposal of Ziyou et al. [9] does implement the full standard,” IEEE Journal of Selected Topics in Signal Processing, vol. 7, range of transform sizes but does so at a hefty gate count. no. 6, pp. 1029-1041, Dec. 2013. Also, the design processes one coefficient at a time and so [12] L. Kerofsky and S. Riabtsev, ”Dynamic range analysis of HEVC/H.265 inverse transform operations,” JCT-VC, Geneva, 2013. relies on a higher clock speed than the other designs. This [13] Kvazaar HEVC Encoder. (2017), U. V. Group. Accessed: Mar. 2, 2019. high clock speed may possibly be accompanied with higher [Online]. Available: ultravideo.cs.tut.fi./#encoder power dissipation and increased emissions. [14] M. El-Hadedy, S. Purohit, M. Margala and S. Knapskog, ”Performance and area efficient transpose memory arch. for high throughput adaptive The proposal of Hong et al. [10] implements the full range signal proc. systs.,” in 2010 NASA/ESA Conf. on Adaptive Hardware of transform sizes and achieves high hardware utilization. and Systems, Anaheim, 2010.