A Very High Throughput Deblocking Filter for H.264/AVC
Total Page:16
File Type:pdf, Size:1020Kb
J Sign Process Syst DOI 10.1007/s11265-013-0744-4 A Very High Throughput Deblocking Filter for H.264/AVC M. Kthiri & B. Le Gal & P. Kadionik & A. Ben Atitallah Received: 28 October 2011 /Revised: 18 December 2012 /Accepted: 25 March 2013 # Springer Science+Business Media New York 2013 Abstract This paper presents a novel hardware architecture Keywords Deblocking filter . Filtering order . ASIC . for the real-time high-throughput implementation of the H.264/AVC video coding adaptive deblocking filtering process specified by the H.264/AVC video coding standard. A parallel filtering order of six units is proposed according to the H.264/AVC stan- 1 Introduction dard. With a parallel filtering order (fully compliant with H.264/AVC) and a dedicated data arrangement in local In the beginning of 2002, the H.264/AVC algorithm was memory banks, the proposed architecture can process filter- presented as a promising solution for the multimedia market ing operations for one macroblock with less filtering cycles due to its higher compression efficiency compared to other than previously proposed approaches. Whereas, filtering video encoding algorithms such as MPEG-2, H.263 and efficiency is improved due to a novel computation schedul- MPEG-4 [1]. Comparative studies reveal that, while ing and a dedicated architecture composed of six filtering maintaining the same video quality, the stream generated cores. It can be used either into the decoder or the encoder as by the H.264/AVC algorithm occupies approximately half of a hardware accelerator for the processor or can be embedded the bandwidth required by the MPEG-2 algorithm [2]. In into a full-hardware codec. This developed Intellectual order to increase global video encoding efficiency, the Property block-based on the proposed architecture supports H.264/AVC standard improves some traditional MPEG in- multiple and high definition processing flows in real time. ternal modules, for example DCT (using a 4×4 integer While working at clock frequency of 150 MHz, synthesized version) and inter-frame motion estimation (supporting under 65 nm low power and low voltage CMOS standard quarter pixel resolution, multi-frame and variable block cell technology, it easily meets the throughput requirements size). Moreover, several additional features have been in- for 4 k video at 30 fps of all the levels in H.264/AVC video corporated in the H.264/AVC standard, which include intra- coding standard and consumes 25.08 Kgates. frame prediction, CABAC and a deblocking filter [3]. An important H.264/AVC advantage is the inclusion of an anti- : : blocking filter also named deblocking filter. This filter, M. Kthiri (*) B. Le Gal P. Kadionik applied to the final images, improves video quality by IMS laboratory - ENSEIRB-MATMECA, University Bordeaux 1, attenuating blocking artifact effects, which are normally CNRS UMR 5218, 351, Cours de la Libération, 33 405 Talence Cedex, France found in decoded images. As a result, the final subjective e-mail: [email protected] quality is significantly improved, allowing the maintenance B. Le Gal of the video quality while reducing the bitrate. The draw- e-mail: [email protected] back of the deblocking filter comes from its high computa- P. Kadionik tional complexity. e-mail: [email protected] In fact, one of the most important pieces of information in the complexity analysis of a system is the distribution of A. B. Atitallah time complexity amongst its major subsystem. In [4], the High Institute of Electronics and Communication, University of Sfax, 3018 Sfax, Tunisia authors have generated results that have been averaged over e-mail: [email protected] all sequences in the test set. As a result, loop filtering (33 %) J Sign Process Syst and interpolation (25 %) are the largest components, each macroblock, the vertical edges are first filtered right- followed by bitstream parsing and entropy decoding wards and then the horizontal edges downwards. (13 %), and inverse transforms and reconstruction (13 %). AsshowninFig.2, the luma macroblock is first The deblocking filter is the most complex functional processed vertically, i.e. from g to j; and then horizontally block of the decoder. It consumes approximately more than from k to n. The chroma components follow the same rule. one-third of the computational complexity of the Each 8 pixels on a straight line of two adjacent 4×4 blocks, H.264/AVC decoder (Fig. 1). Thus, fast computation of such as (p3,p2,p1,p0) and (q0,ql,q2,q3) in Fig. 2(a) are sent the deblocking filter is necessary for high-definition video to the filter at the same time. The H.264/AVC deblocking processing. filter is highly adaptive. There are several conditions that Due to its high complexity, wide research has been car- determine: ried out regarding the implementation of the H.264/AVC 1. Whether a 4×4 block edge will be filtered or not deblocking filter. The main source of its complexity can be 2. The strength of the filtering for the block edges that will attributed to the fact that each pixel must be read a number be filtered. of times in different directions to filter a complete macroblock. To deal with this problem, several processing The Boundary Strength (BS) parameter, α and β thresh- orders were proposed in previous works, all of them aiming olds, and the values of the pixels in the edge determine the to decrease computation time and amount of memory used outcomes of these conditions. The BS parameter varies in the filtering process. adaptively according to the quantization step-size used In this paper, we propose a new filtering order for the when the block was coded, on the coding mode of neigh- deblocking filter and we propose a new architectural design boring blocks and the gradient of the values of the pixels for this filtering order. The architecture was described in computed across the edge being filtered [1]. Five strength VHDL language and was validated first in simulation and levels exist (BS=[0, 4]). BS equals to 0 means “no filtering” then with a FPGA device (using a co-design based ap- and BS=4 indicates maximum smoothing. proach). Finally it was implemented targeting a 65 nm low Figure 3 illustrates the principle of the deblocking filter power and low voltage ASIC technology. using a one-dimensional visualization of a 4×4 block edge. This paper is structured as follows: Section 2 outlines the In Fig. 3,{q0,ql,q2,q3} represent the pixels from the algorithm of the deblocking filter. Section 3 is devoted to the current 4×4 block, whereas {p0,p1,p2,p3} represent the presentation of the filter ordering solutions published in the 4×4 adjacent block, as detailed in Fig. 2. Whether the pixels literature. Proposed filtering order solution as well as its p0 and q0, as well as p1 and q1 are filtered is determined by hardware architecture is presented. Section 4 reports the the Quantization Parameter (QP) and the threshold variables results and compares them to the other related works. Section 5 α and β that are used to prevent true edges from being concludes. filtered. The values of α and β depend on QP. The filtering strength for an edge is determined by comparing pixel gradients with α and β threshold values for that edge. Thus, 2 Deblocking Filter filtering of p0 and q0 only takes place if the following content activity check operations are satisfied (1): In the H.264/AVC, the deblocking filter is applied to all four edges of each 4×4 block in one diagram. In Fig. 2, BS≠0 andjj p −q < α andjj p −p < β andjj q −q < β macroblocks are processed following raster scan order. For 0 0 1 0 1 0 ð1Þ Correspondingly, filtering of p1 or q1 occurs if (2)is satisfied: jj− < β jj− < β ð Þ p2 p0 and q2 q0 2 The dependency of α and β on the QP links the strength of filtering to the general quality of the reconstructed picture prior to filtering. The basic idea is that if a relatively large absolute difference between samples near a block edge is measured, it is quite likely to be a blocking artifact and should therefore be reduced. However, if the magnitude of Figure 1 Profiling of H264/AVC decoder [4]. that difference is so large that it can no longer be explained J Sign Process Syst Figure 2 Vertical and horizontal edges in one ab macroblock. p p3 p2 p1 p0 q3 q2 q1 q0 3 p 2 p 1 p 0 k q 0 q l 1 q 2 q 3 m r n s g h i j p q Luma components chroma components ¼ ð ; − ; ð þ ðÞðÞþ þ ≫ −ðÞ≪ ≫ ð Þ by the coarseness of the QP used in the encoding, the edge is Dif 2 Clip c0 c0 q2 p0 q0 1 1 q1 1 1 7 more likely to reflect the actual behavior of the source picture and should not be smoothed over. 0 ¼ þ ð Þ The next paragraphs present the two variations of the p1 p1 Dif 1 8 deblocking algorithm according to the BS value. 0 q1 ¼ q þ Dif ð9Þ 2.1 Algorithm for 0<BS<4 1 2 To calculate the new values of p0 and q0, the parameter Dif0 2.2 Algorithm for BS=4 is computed: The following expressions are used to compute the new Dif 0 ¼ ClipðÞc ; −c ; ðÞðÞðÞþðÞq −p ≪2 ðÞþp −q 4 ≫3 ð3Þ 1 1 0 0 1 1 values of the filtered pixel sequences, initially considering The parameter c1 used by the Clip function is defined by the current block (Q) and previous block (P), we compute the H.264/AVC standard (clip table) as shown in Table 1 [1].