J Sign Process Syst DOI 10.1007/s11265-013-0744-4

A Very High Throughput Deblocking Filter for H.264/AVC

M. Kthiri & B. Le Gal & P. Kadionik & A. Ben Atitallah

Received: 28 October 2011 /Revised: 18 December 2012 /Accepted: 25 March 2013 # Springer Science+Business Media New York 2013

Abstract This paper presents a novel hardware architecture Keywords Deblocking filter . Filtering order . ASIC . for the real-time high-throughput implementation of the H.264/AVC coding adaptive deblocking filtering process specified by the H.264/AVC video coding standard. A parallel filtering order of six units is proposed according to the H.264/AVC stan- 1 Introduction dard. With a parallel filtering order (fully compliant with H.264/AVC) and a dedicated data arrangement in local In the beginning of 2002, the H.264/AVC algorithm was memory banks, the proposed architecture can process filter- presented as a promising solution for the multimedia market ing operations for one with less filtering cycles due to its higher compression efficiency compared to other than previously proposed approaches. Whereas, filtering video encoding algorithms such as MPEG-2, H.263 and efficiency is improved due to a novel computation schedul- MPEG-4 [1]. Comparative studies reveal that, while ing and a dedicated architecture composed of six filtering maintaining the same video quality, the stream generated cores. It can be used either into the decoder or the encoder as by the H.264/AVC algorithm occupies approximately half of a hardware accelerator for the processor or can be embedded the bandwidth required by the MPEG-2 algorithm [2]. In into a full-hardware . This developed Intellectual order to increase global video encoding efficiency, the Property block-based on the proposed architecture supports H.264/AVC standard improves some traditional MPEG in- multiple and high definition processing flows in real time. ternal modules, for example DCT (using a 4×4 integer While working at clock frequency of 150 MHz, synthesized version) and inter-frame (supporting under 65 nm low power and low voltage CMOS standard quarter pixel resolution, multi-frame and variable block cell technology, it easily meets the throughput requirements size). Moreover, several additional features have been in- for 4 k video at 30 fps of all the levels in H.264/AVC video corporated in the H.264/AVC standard, which include intra- coding standard and consumes 25.08 Kgates. frame prediction, CABAC and a deblocking filter [3]. An important H.264/AVC advantage is the inclusion of an anti- : : blocking filter also named deblocking filter. This filter, M. Kthiri (*) B. Le Gal P. Kadionik applied to the final images, improves video quality by IMS laboratory - ENSEIRB-MATMECA, University Bordeaux 1, attenuating blocking artifact effects, which are normally CNRS UMR 5218, 351, Cours de la Libération, 33 405 Talence Cedex, France found in decoded images. As a result, the final subjective e-mail: [email protected] quality is significantly improved, allowing the maintenance B. Le Gal of the video quality while reducing the bitrate. The draw- e-mail: [email protected] back of the deblocking filter comes from its high computa- P. Kadionik tional complexity. e-mail: [email protected] In fact, one of the most important pieces of information in the complexity analysis of a system is the distribution of A. B. Atitallah time complexity amongst its major subsystem. In [4], the High Institute of Electronics and Communication, University of Sfax, 3018 Sfax, Tunisia authors have generated results that have been averaged over e-mail: [email protected] all sequences in the test set. As a result, loop filtering (33 %) J Sign Process Syst and interpolation (25 %) are the largest components, each macroblock, the vertical edges are first filtered right- followed by bitstream parsing and entropy decoding wards and then the horizontal edges downwards. (13 %), and inverse transforms and reconstruction (13 %). AsshowninFig.2, the luma macroblock is first The deblocking filter is the most complex functional processed vertically, i.e. from g to j; and then horizontally block of the decoder. It consumes approximately more than from k to n. The chroma components follow the same rule. one-third of the computational complexity of the Each 8 pixels on a straight line of two adjacent 4×4 blocks,

H.264/AVC decoder (Fig. 1). Thus, fast computation of such as (p3,p2,p1,p0) and (q0,ql,q2,q3) in Fig. 2(a) are sent the deblocking filter is necessary for high-definition video to the filter at the same time. The H.264/AVC deblocking processing. filter is highly adaptive. There are several conditions that Due to its high complexity, wide research has been car- determine: ried out regarding the implementation of the H.264/AVC 1. Whether a 4×4 block edge will be filtered or not deblocking filter. The main source of its complexity can be 2. The strength of the filtering for the block edges that will attributed to the fact that each pixel must be read a number be filtered. of times in different directions to filter a complete macroblock. To deal with this problem, several processing The Boundary Strength (BS) parameter, α and β thresh- orders were proposed in previous works, all of them aiming olds, and the values of the pixels in the edge determine the to decrease computation time and amount of memory used outcomes of these conditions. The BS parameter varies in the filtering process. adaptively according to the quantization step-size used In this paper, we propose a new filtering order for the when the block was coded, on the coding mode of neigh- deblocking filter and we propose a new architectural design boring blocks and the gradient of the values of the pixels for this filtering order. The architecture was described in computed across the edge being filtered [1]. Five strength VHDL language and was validated first in simulation and levels exist (BS=[0, 4]). BS equals to 0 means “no filtering” then with a FPGA device (using a co-design based ap- and BS=4 indicates maximum smoothing. proach). Finally it was implemented targeting a 65 nm low Figure 3 illustrates the principle of the deblocking filter power and low voltage ASIC technology. using a one-dimensional visualization of a 4×4 block edge. This paper is structured as follows: Section 2 outlines the In Fig. 3,{q0,ql,q2,q3} represent the pixels from the algorithm of the deblocking filter. Section 3 is devoted to the current 4×4 block, whereas {p0,p1,p2,p3} represent the presentation of the filter ordering solutions published in the 4×4 adjacent block, as detailed in Fig. 2. Whether the pixels literature. Proposed filtering order solution as well as its p0 and q0, as well as p1 and q1 are filtered is determined by hardware architecture is presented. Section 4 reports the the Quantization Parameter (QP) and the threshold variables results and compares them to the other related works. Section 5 α and β that are used to prevent true edges from being concludes. filtered. The values of α and β depend on QP. The filtering strength for an edge is determined by comparing pixel gradients with α and β threshold values for that edge. Thus, 2 Deblocking Filter filtering of p0 and q0 only takes place if the following content activity check operations are satisfied (1): In the H.264/AVC, the deblocking filter is applied to all four edges of each 4×4 block in one diagram. In Fig. 2, BS≠0 andjj p −q < α andjj p −p < β andjj q −q < β macroblocks are processed following raster scan order. For 0 0 1 0 1 0 ð1Þ

Correspondingly, filtering of p1 or q1 occurs if (2)is satisfied:

jj− < β jj− < β ð Þ p2 p0 and q2 q0 2

The dependency of α and β on the QP links the strength of filtering to the general quality of the reconstructed picture prior to filtering. The basic idea is that if a relatively large absolute difference between samples near a block edge is measured, it is quite likely to be a blocking artifact and should therefore be reduced. However, if the magnitude of Figure 1 Profiling of H264/AVC decoder [4]. that difference is so large that it can no longer be explained J Sign Process Syst

Figure 2 Vertical and horizontal edges in one ab macroblock. p

p3 p2 p1 p0 q3 q2 q1 q0 3 p 2 p 1 p 0

k q 0 q l 1 q 2 q 3 m r

n s

g h i j p q Luma components chroma components

¼ ð ; − ; ð þ ðÞðÞþ þ ≫ −ðÞ≪ ≫ ð Þ by the coarseness of the QP used in the encoding, the edge is Dif 2 Clip c0 c0 q2 p0 q0 1 1 q1 1 1 7 more likely to reflect the actual behavior of the source picture and should not be smoothed over. 0 ¼ þ ð Þ The next paragraphs present the two variations of the p1 p1 Dif 1 8 deblocking algorithm according to the BS value. 0 q1 ¼ q þ Dif ð9Þ 2.1 Algorithm for 0

To calculate the new values of p0 and q0, the parameter Dif0 2.2 Algorithm for BS=4 is computed: The following expressions are used to compute the new Dif 0 ¼ ClipðÞc ; −c ; ðÞðÞðÞþðÞq −p ≪2 ðÞþp −q 4 ≫3 ð3Þ 1 1 0 0 1 1 values of the filtered pixel sequences, initially considering

The parameter c1 used by the Clip function is defined by the current block (Q) and previous block (P), we compute the H.264/AVC standard (clip table) as shown in Table 1 [1]. the filtered pixels with the following equations:

As a result, the updated values of p0 and q0 (named p0′ and 0 ¼ ðÞþ þ þ þ þ ≫ ð Þ q0′) are computed using Eqs. 4 and 5: q0 p1 2 p0 2 q0 2 q1 q2 4 3 10 0 ¼ ðÞþ ðÞ p0 Clip p0 Dif 0 4 0 ¼ ðÞþ þ þ þ ≫ ð Þ q1 p0 q0 q1 q2 2 2 11 0 ¼ ðÞ− ðÞ q0 Clip q0 Dif 0 5 0 ¼ ðÞ þ þ þ þ þ ≫ ð Þ q2 2 q3 3 q2 q1 q0 p0 4 3 12

The computation of p1′ and q1′ occurs in the same man- 0 ¼ ðÞþ þ þ þ þ ≫ ð Þ ner. First, the values of Dif1 and Dif2 are determined. After p0 q1 2 q0 2 p0 2 p1 p2 4 3 13 that, p1′ and q1′ are respectively given by:   0 ¼ ; − ; þ ðÞðÞþ þ ≫ −ðÞ≪ ≫ ð Þ p1 ¼ ðÞq þ p þ p þ p þ 2 ≫2 ð14Þ Dif 1 Clip c0 c0 p2 p0 q0 1 1 p1 1 1 6 0 0 1 2

0 ¼ ðÞ þ þ þ þ þ ≫ ð Þ Block P Block Q p2 2 p3 3 p2 p1 p0 q0 4 3 15 q |q1-p0| q 3 For chrominance blocks, the following equations must be q 2 0 q 1 adopted: |q0-p0| 0 q0 ¼ ðÞ2 q þ q þ p þ 2 ≫2 ð16Þ |p1-p0| 1 0 1 p 2 p 0 p 3 p 1 0 ¼ ðÞ þ þ þ ≫ ð Þ Figure 3 Principle of a 4×4 block edge deblocking filtering. p0 2 p1 p0 q1 2 2 17 J Sign Process Syst

2

1 3

3 4 2

1 2 1 4

4 3

Luma components chroma components

Figure 4 Rules of the edge filtering order.

3 Related Works

In order to filter a macroblock, the value of a pixel must be read multiple times and the intermediate results of the fil- tering are stored into a local memory. This is because the following computation steps utilize them. In order to im- prove the use of the local memory and the filtering perfor- mances, it is necessary to reorder the filtering operations in such a manner that the intermediate results are used sooner. The only restriction imposed by the standard in relation 0000000000011 0000000001111 0000011111111 to the processing order is that the entire horizontal filtering which uses a determined sample must occur before the vertical filtering which adopts this sample. An illustration of the computation order imposed by the standard is provid-

]. ed by Fig. 4. 1 The processing order proposed by the H.264/AVCstandard [1] is presented in Fig. 5. As evident, the vertical borders of the luminance and chrominance blocks are all filtered before the horizontal borders. Since the results of the vertical filtering are employed in the horizontal filtering, the overall intermediate

T5 T6

T0 T1 T2 T3 41 42 L5 17 19

as a function of index A and BS [ 25 26 27 28

1 43 44 13 L0 1 5 9 L6 18 20 29 30 31 32

L1 2 6 10 14 Chrominance U

33 34 35 36 T7 T8 L2 3 7 11 15 45 46 123456789101112131415161718192021222324 37 38 39 40 L7 21 23 47 48 L3 12 16 4 8 L8 22 24 Value of filter clipping variable c Luminance Chrominance V Table 1 Index ABS 0 123 0000000000 0 0 0000000000 0 0 0000000000 0Index A 0 BS 251 262 27 111111112222333444 566789101113 3 28 111111222233344556 788101112131517 29 112222332444566789101113141618202325 30 31 32 33Figure 34 35 5 36Original 37 38 H.264/AVC 39 40 41 filtering 42 43order 44 [1]. 45 46 47 48 49 50 51 J Sign Process Syst

T5 T6 T5 T6 T0 T1 T2 T3 35 36 T1 T2 T3 2 3 5 6 7 8 L5 33 34 T0 39 40 2 4 6 7 L5 0 18 L0 1 2 3 4 L6 37 38 6 7 13 14 15 16 L0 0 1 3 5 L6 4 5 L1 9 10 11 12 Chrominance U 10 12 14 15 21 22 23 24 T7 T8 L1 8 9 11 13 Chrominance U L2 17 18 19 20 43 44

29 30 31 32 L7 41 42 18 20 22 23 T8 47 48 T7 L3 27 28 25 26 L8 45 46 L2 16 17 19 21 2 3

26 28 30 31 L7 0 1 Luminance Chrominance V 6 7 L3 24 25 27 29 L8 Figure 6 Filtering order proposed in [5]. 4 5

Luminance Chrominance V results must be stored. Consequently, this processing order is expensive in terms of memory usage and execution time. Figure 8 Filtering order proposed in [7]. Indeed, it requires the storage of 384 bytes (16 luminance blocks and 4 blocks for each chrominance) until the horizontal Besides, the processing order proposed in [7] and shown in filtering occurs. Fig. 8, significantly reduces the number of clock cycles re- The filtering order proposed by G. Khurana [5], quired to process a macroblock. This solution is based on the presented in Fig. 6, is based on an alternation between parallel execution of horizontal and vertical filtering compu- horizontal and vertical filtering of the blocks. This solution tations. Using the proposed computation schedule, up to three provides a local memory size decrease, as just one line of filtering cores can be used to speed up the data processing. 4×4 blocks must to be stored in order to be used by the next The number of concurrent filterings is limited due to data filtering steps. When the pixels are completely filtered (i.e. dependencies between . Therewith, based on in both directions), they can be written back to the main the filtering schedule proposed in [8], up to four edges filters memory in order to be shown or to be used as a reference in are possible. The order of the edge filtering process is the future. provided in Fig. 9. The proposal of He Jing [6], presented in Fig. 7, is based In fact, the vertical edges of the first sub-block-row of a on both data reuse and concurrent processing (using multi- MB, that is, edges numbered as 0–3 in Fig. 9 are processed ple filtering cores) to increase the design throughput. This successively to reuse the content data as efficiently as pos- architecture exploits a parallel filtering order using two edge sible. After the left and right vertical edges of a sub-block filters to process simultaneously the vertical and the hori- are successfully filtered, the sub-block data are transposed zontal edges. Repeated numbers in Fig. 7 correspond to the and then transferred to the second stage of the pair, that is, edge filterings that are executed in parallel during the same the vertical filtering process, which performs deblocking clock cycle on the two distinct filtering cores. filtering on horizontal edges.

T5 T6 T5 T6 T0 T1 T2 T3 10 11 T0 T1 T2 T3 20 21 L5 8 9 4 5 6 7 L5 17 18 2 3 4 5 22 12 13 23 0 1 2 3 L0 1 2 3 4 L0 L6 L6 19 20 10 11 8 9 10 11 6 7 8 15 Chrominance U L1 5 6 7 8 Chrominance U L1 4 5 6 7

12 13 14 15 7 8 9 10 T7 T8 T7 T8 L2 9 10 11 12 24 25 L2 5 6 7 8 15 16 L7 13 14 16 17 18 19 L7 21 22 11 12 13 14 26 27 17 18 L3 16 L3 9 10 11 12 13 14 15 L8 23 24 L8 15 16

Luminance Chrominance V Luminance Chrominance V

Figure 7 Filtering order proposed in [6]. Figure 9 Filtering order proposed in [8]. J Sign Process Syst

T5 T6 (Line of Pixels) of the previous (left) block (with a certain parallelism of computation, while respecting the constraints T0 T1 T2 T3 3 4 3 5 6 7 L5 1 162 17 imposed by the standard H.264/AVC). 8 9 The architecture proposed in this paper is based on a new 5 L0 1 0122 4 3 L6 6 18 7 19 processing order and a dedicated local memory organiza- 4 6 8 9 tion. Moreover, since the deblocking filter for chrominance Chrominance U L1 2 45673 5 6 pixels is almost identical to the one for luminance pixels, the 5 7 9 10 data path can be shared with the effect of minimizing idle T7 T8 L2 3 89104 6 7 11 4 5 cycles of the edge filter. 20 21 Our sample oriented processing order allows a more 6 8 10 11 L7 1 3 10 11 effective use of the architecture parallelism without signifi- L3 4 12 5 13 7 14 8 15 L8 8 229 23 cantly increasing local memory size. Figure 10 demonstrates the proposed filtering order. This processing order produces Luminance Chrominance V the same functional results as the order specified in the H.264/AVC standard [1]. Figure 10 Proposed filtering order. Considering this processing order, up to six filterings 4 Solution Based on 6 Edge-Filter Units may occur in parallel resulting in throughput increases when the architectural design is composed of six filter cores as 4.1 A Filtering Order for Up to 6 Parallel Computations detailed in Section 4.

According to the restriction imposed by the H.264/AVC 4.2 Hardware Architecture Based on 6 Edge Filter Units standard, it would be possible to perform three or more concurrent filterings in the same macroblock without a Based on the proposed edge filter scheduling, we have significant increase of the local memory size. designed a dedicated architecture composed of six filter units. All the processing orders presented before are performed The architecture is shown in Fig. 11. The hardware architecture at the block level, i.e. the filtering of a 4×4 block edge is exploits six identical filter units to enhance the processing performed serially by the same filter and the border of a throughput. Three edge filter units are dedicated to the hori- block can be filtered only after the filtering of the 4 LOPs zontal edges and the three others are dedicated to vertical ones.

input bus 128-bit

6 RAM 16*32 bits (Luminance) Start Memory 4RAM8*32bits 4 RAM 8*32 bits (Chrominance Cr) control (Chrominance Cb)

Start filter to the appropriate to the appropriate filter filter 32-bit 32-bit

Filters Control 4*32-bit FIFO 4*32-bit temporal Mux memories buffer yes

if th eblo im be applied T1 T2 T3 T4 T5 T6 mediately to ck will th no efilter QPp QPq FV FV FV FH OffsetA 1 2 3 1 FH2 FH3 OffsetB BS Generator Coding informtion

If the blocks are no totally filtered yes

Tinv

128-bit

output bus

Figure 11 Proposed deblocking filter architecture. J Sign Process Syst

Block Table 2 Input and out- cycle Data input Data output 01 2 3 4 5 6 7 8 9 10 put from the transpose module. a10,a01,a02,a03 a00,a10,a20,a30 (P) L0 0 L2 1 2 6 10 14 22 a20,a11,a12,a13 a01,a11,a21,a31 HF1 (Q) 0 1 8 2 3 7 11 15 23 a20,a21,a22,a23 a02,a12,a22,a32

a30,a31,a32,a33 a03,a13,a23,a33

(P) L5 L1 4 8 5 9 13 L8

HF2 This filter organization authorizes parallel computations of the (Q) 16 4 5 9 6 10 14 22 horizontal filtering of vertical edges and the vertical filtering of horizontal edges. A BS computation module, one threshold

calculator module, one c1 calculator module, 12 transpose (P) L7 16 20 L3 12 L6 18 modules, six 4×32 bit FIFO memories and thirteen 4×32 bit HF3 temporal buffers compose the rest of the architecture. (Q) 20 17 21 12 13 18 19 Edge filter computations were scheduled and bind on the filtering units. Scheduling and binding were realized specif- ically taking into account two main constraints: simplifying T0 0 T1 T2 T3 2 3 7 11 (P) the local memory access providing the best usage rate of VF1 (Q) 0 4 1 2 3 6 7 11 15 filtering units. In proposed scheduling, the architecture can start the execution of this scheduling when all the pixel data and the information required BS for computations have been (P) T5 T6 4 1 5 9 6 10 21 received. This choice was performed to simplify the syn-

VF3 chronization of the I/O and computation tasks that have a (Q) 16 17 8 5 9 13 10 14 23 pipelined execution. Figure 12 summarizes the filtering process within the proposed architecture in terms of ‘block cycles’. Each block (P) T7 T8 8 16 17 20 cycles requires 4 clock cycles (this corresponds to the exe- VF3 cution time of each 4×4 block). The processing starts with (Q) T7 21 12 18 19 22 the horizontal filtering. As a matter of fact, on the first block cycle, the input pixels to be filtered [p0,p3]and[q0,q3]are Figure 12 Edge computations scheduling units. fetched to the appropriate V-edge filters (HF1,HF2,HF3)from

Figure 13 Local memory organization. T 0 T 1 T 2 T 3

L0 0 1 2 3

L1 4 5 6 7 T 5 T 6

L2 8 9 10 11 L5 16 17

L3 12 13 14 15 L6 18 19

Luminance Chrominance

Left _luma _mem Left _chroma _mem Top _chroma _mem Top _luma _mem line 1_luma _mem line 1_chroma _mem line 2_luma _mem line 2_chroma _mem line 3_luma _mem line 4_luma _mem J Sign Process Syst

Figure 14 I/O and filtering execution sequences using 128 bit I/O interfaces.

the left_luma_mem and the line1_luma_mem, the scheduling without memory access conflict. In order to guar- left_chromaU_mem and the line1_chromaU_mem and the antee that all transfers could be performed in one clock cycle left_chromaV_mem and the line1_chromaV_mem respectively. without access conflict, this architecture is composed of 14

In addition, the vertical edge (L0~block0), the vertical edge local memory banks (each line of 4×4 block for luminance or (L5~block16) and the vertical edge (L7~block20) are simulta- chrominance pixels in independent 16×32 bit and 8×32 bit neously filtered. Then the blocks L0’,L5’,L7’ are transferred memories respectively). into the write stage and written into filtered memories. The loop-filter architecture is linked to the rest of the

The partially filtered block0’ and block16’ are then system through two buses: one dedicated to input data and forwarded directly to V-edge filter appropriate again (through another one to output data. The bus widths are 128 bits. Data aFIFOmemory).Block20’ is transferred to the appropriate 4× provided by the system are stored in the local memory banks 32 bit temporal buffer in order to be used in the filtering of the according to memory binding presented in Fig. 13. edge between block20 and block21. The blocks 1, L1,4,17are T and Tinv units are required to transpose a 4×4 block of loaded simultaneously on the next 4 clock cycles and the edges pixels from rows to columns and from columns to rows block0’~block1, blockL1~block4, block16’~block17 are fil- respectively. Because the proposed architecture is designed tered. The block0”, the block16” (vertically filtered), the block to perform both horizontal and vertical filtering of block T1 and the block T5 are sent to the transpose register for edges using the same filter, pixels in each block must be transposing in order to be used in the vertical filtering of the transposed before and after the deblocking filter. The edges blockT1~block0 and T5~block16 (with the suitable filters implemented T unit completes the transpose operation of a VF1,VF2,VF3), the block1’ and block4’ are forwarded to the 4×4 block in 4 clock cycles (in each clock cycle we receive suitable V-edge filters for edge L2~block8 and block4’~block5 one LOP). Table 2 presents the operations made by this filtering. This process repeats until all edges are filtered using module for one 4×4 block. The input of this block is {ai0, either 4×32 bit FIFO memories or 4×32 bit temporal buffers. ai1,ai2,ai3}withi∈ {0, 1, 2, 3} and the transposed output is To authorize such edge filter scheduling, a dedicated {a0i,a1i,a2i,a3i with i ∈ {0, 1, 2, 3}. memory binding of data has been developed. Figure 13 The “control filter” module is a finite state machine, respon- shows the memory organization that authorizes the computation sible for the synchronization of all data transfers (memory

Table 3 Comparison with other designs.

[10][8][12][11][14][13][9] Proposed architecture

Technology (μm) 0.13 0.13 0.18 0.18 0.13 0.18 0.18 65(nm) Application Target 1920×1088 2560×1920 1920×1088 1920×1088 3840×2160 1920×1088 6000×4000 6000×4000 @60 fps @30 fps @30 fps @60 fps @30 fps @60 fps @30 fps @30 fps Working Frequencya 62 MHz 225 MHz 100 MHz 200 MHz 98 MHz 60 MHz 135 MHz 123.75 MHz Gates count (KGates) 19.8 36.9 12.6 21.49 22.5 58.64 41.6 25.08 Number of filter 23222846 cores Memory (byte) 208 672 192 768 416 0 640 848 Processing time 128 260 110 204 100 56 48 44 (cycles/MB) Maximum Throughput 484 850 910 980 980 1080 2812 2812 (KMB/s) b a Correspond to frequency required to process the appropriate application target b Throughput (KMB/s)=((1/Fmax)×processing time)−1 J Sign Process Syst read/write and input/output interfacing) in order to ensure the 4000 filter module constantly processes new values. The filtering cores perform the filtering operations using samples and values of BS,thresholds(α and β)andc1 value, which were previously 3000 computed. The BS calculator computes the filtering strength and the threshold calculator defines the values of α and β based on the quantization parameters of the two blocks that are being 2000 filtered. The c1 calculator is a module that is based on the filtering strength and on the thresholds values generates a clipping value that is adopted in the filtering process. 1000 Propose architecture authorizes a full pipeline of the I/O task with the computation one like in [8]: once the data from the shared memory banks are consumed once by the com- putation units, the design can start the next filtering data 0 [10] [8] [12] [14] [11] [13] [9] Proposed loading. Indeed, the resulting data generated by the filter cores is stored in 4×32 bit local memories (temporal buffer) Figure 16 Throughput comparison (KMB/s). or 4×32 bit “first in first out” FIFO to store intermediate data which will be employed in the subsequent computation data to the deblocking filter architecture and to store compu- while data is read from another one. In the same way, once tation results. Input data was extracted from real video stream the computations have completed 4×4 block filtering, the using the JM decoder tool [9]. Results generated by the computation results are immediately send to the system. architecture compared to the JM decoded ones. In a second Figure 14 provides an overview of the design behavior. time, we have implemented the architecture in an Virtex-5 Time required to fill the input memory banks depends on the FPGA from Xilinx (ML507 board). The JVM decoder was input bus width. Indeed, depending on bus width, the number executed on the PowerPC core in the FPGA and the decoding of clock cycles required receiving the 640 pixel data from the filter was implemented as an accelerator. The communication system changes. To enable full speed processing, 128 bit data was realized using a PLB bus. The JVM tool was hacked to interfaces are required. Indeed, using such width reduces the execute (1) the loop filter computations (2) to send/receive the data loading stage to 640 data/16 data per cycle=40 cycles. data to/from the coprocessor (3) to check the bit equivalence Time required for data loading is lower that the execution one. of software and hardware results. used in this experi- mentation were stored on a compact flash device. As previously explained, the proposed architecture in this 5 Implementation and Performance Results paper considered a new filter ordering and its consequent algorithm. Thus, an analysis considering the number of We have designed the hardware architecture with VHDL lan- cycles required to filter a complete Macroblock in each guage at the RTL level and synthesized it by using the Design filtering order has been established. As evident in Table 3 Compiler tool from Synopsys. However, in order to silicon proof the proposed filtering order performs the whole filtering of a the correct behavior of the architecture a co-design based imple- Macroblock in 44 clock cycles (the filtering of each edge mentation of the architecture was realized on an FPGA target. during a step takes 4 clock cycles). Architecture validation was first realized in simulation Figure 15 shows the area profiling of the proposed work using Modelsim. A VHDL testbench was used to send pixel when targeting at an operating frequency of 150 MHz. In fact, the proposed design improves performances of other works. Area consumption is still low compared to other high-performances architectures. However, proposed solu- tion required more memory bytes.

Table 4 Required frequency for video standards.

Application target Frequency

1920×1088@60 fps 21.54 MHz 2560×1920@30 fps 25.34 MHz 3840×2160@30 fps 42.76 MHz 6000×4000@30 fps 123.75 MHz Figure 15 Hardware complexity profiling. J Sign Process Syst

The proposed architecture is faster than the other ones in 6 Conclusion the literature. It allows to achieve higher throughput at identical clock frequency or to require lower frequency This paper presents a new hardware approach for implementing when targeting identical throughput. The deblocking filter the H.264/AVC deblocking filter. The presented architecture is is a system bottleneck in terms of processing cycles. Based based on a new processing order with a new memory organi- on the proposed architecture, we can greatly reduce the zation. The related solution provides an efficient filtering order processing cycles (takes only 44 clock cycles) and improve with the respective algorithm, achieving the best results for the system throughput, which reduces the number of clock throughput that other works. This hardware implementation is cycles per macroblock by 4,5 % ~ 490 %. The synthesis designed to be used as a part of a H.264/AVC video decoder or result shows that the proposed design takes 25.08 kGates, encoder. It benefits several components executed in a parallel relatively lower than others previous approaches [7, 8, 10]. mode. It solves the problem of real-time constraints and enables However, we need to put into perspective this area results a better efficiency in video coding or decoding (the H.264/AVC because we consume more memories bytes. deblocking filter can be used either in the decoder or in the The designs presented in Table 3 including several filter encoder). cores executing in parallel way. In fact, when looking into the proposed work in [11], we can find that the total cost of this Acknowledgments This present study was carried out for the RTEL4I design and ours are comparative although a different memory project and funded by the French SYSTEM@TIC ICT cluster [15]. organization is employed in our architecture. Thus, with our proposed design, we can consume a reasonable area costs and we can accelerate the computation time. Since the proposed References architecture owns six edge filters, a significant issue on de- signing the controller is to almost fully exploit these filters. 1. ISO/IEC 14 496 ISO/IEC MPEG and ITU-T (2003). AVC Draft ITU- The design in [12] reduces gate count substantially because T “ISO/IEC 14 496–10 Recommendation and final draft international it performs the filtered MB with pipeline computation. How- standard of joint video specification”. ISO/IEC and ITU-T. ever, this smaller buffer requires more frequent access of 2. Richardson, I.E. (August 2003). H.264 and MPEG-4 video com- external memory that leads to larger power consumption. It pression (320 pages). England edition. Wiley & Sons. 3. Wiegand, T., Sullivan, G. J., Bjontegaard, G., & Luthra, A. is noted that the proposed design contains 14 local memory (2003). Overview of the H.264/AVC video coding standard. modules in order to take advantage of parallel computing. IEEE Transactions on Circuits and Systems for Video Technology, Figure 16 compares the throughput performance achieve by 13(7), 560–576. this work and some previous works. Indeed, we can see that 4. Horowitz, M., Joch, A., Kossentini, F., Hallapuro, A. (July 2003). H.264/AVC baseline profile decoder complexity analysis. IEEE the proposed design achieves four times of the real-time Transactions on Circuits and Systems for Video Technology, 13(7). performance requirement of the recent design [7]. Similarly, 5. Khurana, G., Kassim, T., Chua, T., & Mi, M. (2006). A pipelined when comparing with [13, 14], the throughput performance of hardware implementation of in-loop deblocking filter in H.264/AVC. the proposed design reaches even as high as three times. In IEEE Transactions on Consumer Electronics, 52(2), 536–540. – 6. Jing, H., Yan, H., Xinyu, X. (September 2009). An efficient archi- conclusion, Compared on [7, 8, 10 14] our design achieves tecture for deblocking filter in H.264/AVC. In the Proceedings of the the highest throughput due to lowest processing cycles and Fifth International Conference on Intelligent Information Hiding and relative high working frequency, as well as a slightly increase Multimedia Signal Processing (IIH-MSP’09) (pp. 848–851). in the final area of the architecture with the use of six filtering 7. Chien, C.A., Chang, H.C., Gue, J.I. (November 30 - December 3 2008). A high throughput in-loop de-blocking filter supporting cores when registers are used in place of the memory blocks. H.264/AVC BP/MP/HP video coding. In Proceedings of the IEEE Thus, Fig. 16 shows that we can process the same throughput Asia Pasific Conference on Circuits and Systems (APCCAS’08) that [8] with a lower frequency. (pp. 312–315). In addition, the proposed work provides an effective 8. Chen, K. H. (2010). 48 cycles-per-macro block deblocking filter accelerator for high-resolution H.264/AVC decoding. IET Circuits, trade-off between hardware complexity and processing ca- Devices & Systems, 4(3), 196–206. pability. Our deblocking filter is able to perform real time 9. ITU (2008). H.264/AVC reference software decoder (v10.2). video applications of 6000×4000 at 30 fps with low fre- http://iphome.hhi.de/suehring/tml/doc/ldec/html. quency requirements. This is due to the number of the clock 10. Chen, C. M., & Chen, C. H. (2008). Configurable VLSI architec- ture for deblocking filter in H.264/AVC. IEEE Transactions on cycle required to generate the filtered macroblock. Very Large Scale Integration (VLSI) Systems, 16(8), 1072–1082. In Table 4 we present the required frequency to process 11. Wei, H., Tao, L. I. N., & Zheng-hui, L. I. N. (2009). Parallel several applications targets based on the provided results of processing architecture of H.264 adaptive deblocking filters. Jour- the proposed design. In this manner, the proposed deblocking nal of Zhejiang University, 10(8), 1160–1168. 12. Tobajas, F., CalIicό,G.M.,Perez,P.A.,deArmas,V.,& filter architecture (producing lower dynamic power consump- Sarmiento, R. (2008). An efficient double-filter hardware architec- tion) can be employed as an IP core either in a dedicated or ture for H.264/AVC deblocking filtering. IEEE Transactions on platform-based H.264/AVC codec system. Consumer Electronics, 54(1), 131–139. J Sign Process Syst

13. Xu, K., & Choy, C. S. (2008). Five-stage pipeline, 204 cycles/MB, Higher Institute of Electronic and Communication of Sfax (Tunisia). He single-port SRAM-based deblocking filter for H.264/AVC. IEEE is teaching Embedded System conception and System on Chip. His main Transactions on Circuits and Systems for Video Technology, 18(3), research activities are focused on image and video signal processing, 363–374. hardware implementation, embedded systems. 14. Lin, Y. C., & Lin, Y. L. (2009). A two-result-per-cycle deblocking filter architecture for QFHD H.264/AVC decoder. IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 17(6), 838–843. 15. SYSTEM@TIC ICT cluster. http://www.systematic-paris-region.org/.

Patrice Kadionik received his ENSEIRB engineer diploma in 1989 and the Ph.D. Degree in Instrumentation and Measurement from the University of Bordeaux, France, in 1992. After having worked during 3 years for the France Telecom group, he has joined the IXL Labora- Moez Kthiri was born in Béja, Tunisia, in 1982. He received his tory of Microelectronics. He is currently associate Professor at the degree in Instrumentation and communication, from the faculty of ENSEIRB School of Electrical Engineering. He is teaching Embedded science at Sfax, his master in Electronic Engineering from the Sfax System conception, Networks and System on Chip. His main research National Engineering School (ENIS), Tunisia, in 2008 and Ph.D. activities include System on Chip for video compression and for degree in electronics from IMS laboratory, University of Bordeaux1 Sensor Networks and FPGA testing. in 2012. He is currently an assistant professor at Higher Institute of Applied Sciences and Technologies of Mateur (Tunisia). His research interests include digital signal processing, image and video coding with emphasis on H264/AVC standards and Co-design implementation.

Bertrand Le Gal was born in 1979, in Lorient France. He received his Ph.D degree in information and engineering sciences and technologies from the Université de Bretagne Sud, Lorient, France, in 2005 and the DEA (MS Degree) in Electronics in 2002. He is currently an Associate Ahmed Ben Atitallah received his Dipl.-Ing and MS degree in elec- Professor in the IMS Laboratory, ENSEIRB Engineering School, tronics from the National Engineering School of Sfax (ENIS) in 2002 and Talence, France. His research focuses on system design, high-level 2003, respectively and Ph.D. degree in electronics from IMS laboratory, synthesis, SoCs design methodologies and security issues in embedded University of Bordeaux1 in 2007. He is currently an assistant professor at devices such as Virtual Component Protection (IPP).