High-Speed FPGA Architecture for CABAC Decoding Acceleration in H.264/AVC Standard

J Sign Process Syst manuscript No. (will be inserted by the editor) High-speed FPGA Architecture for CABAC Decoding Acceleration in H.264/AVC Standard Roberto R. Osorio · Javier D. Bruguera Abstract Video encoding and decoding are comput- Keywords H.264, CABAC, FPGA, arithmetic codes, ing intensive applications that require high performance video coding processors or dedicated hardware. Video decoding offers a high parallel processing potential that may be exploited. However, a particular task challenges paral- lelization: entropy decoding. In H.264 and SVC video 1 Introduction standards, this task is mainly carried out using arithmetic decoding, an strictly sequential algorithm that The new H.264/AVC [15] video coding standard repre- achieves results close to the entropy limit. By accel- sents a significant improvement in quality and compres- erating arithmetic decoding, the bottleneck is removed sion ratio over previous standards such as MPEG-2 and and parallel decoding is enabled. Many works have been MPEG-4 Part 2. This advantage comes with a price as published on accelerating pure binary encoding and de- H.264 is characterized by high computational complex- coding. However, little research has been done into how ity. to integrate binary decoding with context managing In H.264, the Context-based Adaptive Binary Arith- and control without losing performance. In this work metic Coding (CABAC) algorithm offers a significant we propose a FPGA-based architecture that achieves compression gain with respect to the baseline method real time decoding for high-definition video by sustain- (&10%). However, this algorithm cannot be parallelized ing a 1 bin per cycle throughput. This is accomplished to a significant extent. by implementing fast bin decoding; a novel and area Sequential tasks pose a threat on performance as efficient context-managing mechanism; and optimized they limit the available parallelism due to Amdahl's control scheduling. law. This problem is addressed in [3] where some solu- tions are proposed, such as high clock frequency pro- Work supported in part by Ministry of Science and Innova- cessors that are allocated to sequential tasks, or recon- tion of Spain, co-funded by the FEDER funds of the European figurable accelerators. Union, under contract TIN2010-17541, and by the Xunta de The need for hardware acceleration in arithmetic de- Galicia, Program for Consolidation of Competitive Research Groups ref. 2010/6 and 2010/28. coding (AD) has been addressed in [1] and [13] among others. Subsequently, the remaining decoding tasks are Roberto R. Osorio carried out in parallel using either reconfigurable hard- Dept. of Electronic and Systems, University of A Coru~na,Spain ware, or a number of programmable processors in a E-mail: [email protected] manycore system. Javier D. Bruguera FPGA is a mature technology that allows accelerat- Centro de Investigacion en Tecnologias de la Informacion ing computing tasks without coming into the high non- (CITIUS), recurring costs and lack of flexibility of ASICs. Recon- University of Santiago de Compostela, Spain He is a member of the European Network of Excellence on figurability is a key factor, as the combination of pro- High Performance and Embedded Architecture and Compi- grammable and reconfigurable parallel processing pro- lation (HiPEAC). vides the adaptability required by modern multi-media E-mail: [email protected] platforms that must implement multiple standards. 2 Roberto R. Osorio, Javier D. Bruguera Several architectures have been proposed for AD bin 1 1 1 1 1 1 1 1 1 0 0 0 1 1 in recent years that apply different strategies. In [8, meaning >0 >1 >2 >3 >4 >5 >6 >7 >8 <17 +04 +02 +1 neg 12, 14, 19] the goal is to decode one bit of informa- context 0,1, 3 4 5 6 6 6 6 6 exp-golombcodes offset or2 nocontextused tion (bin) per cycle. More recent papers acquiesce in higher complexity in order to decode several bins each Fig. 1 Decoding of value -10 as a motion vector cycle [2,6,9,10,18,20,21]. Actual throughput, however, varies widely. In this way, only [21] actually decodes more than 2 bins and only [2, 9, 10] sustain more than are converted to a sequence of bins before compression 1 bin. In many works, throughputs in the 0.25 - 0.75 in a process called binarization. range are obtained. In general, just a few of the works found in the liter- Hence, decoding consist of both recovering bins from ature deal with the complexity of implementing context the bitstream, and also reconstructing the original syntax- modeling and inverse binarization for several bins in the elements and take further decoding decisions. In this same cycle. In [2, 8, 18, 19, 21] this is addressed in more section, the encoding procedures will be explained first, or less detail, but most of the best performing architec- and decoding details will be provided therafter. tures [9,10] do not even consider this subject. In many Syntax-elements are binarized by applying some ba- works, complexity is reduced by decoding multiple bins sic rules as described in [11]. As an example, Fig. 1 only under certain conditions. This leads to low and shows the breakdown of value -10 as a motion vector input-dependent throughput, as mentioned in the pre- coordinate, consisting of unary code of 9 bits, followed vious paragraph. Also, the convenience of multiple bin by an Exp-Golomb code of 4 bits, and the sign. decoding is difficult to assess as none of those papers Each bin is compressed using arithmetic coding, ex- details how much the cycle length increases compared ploiting the statistics of the source, that is, the propor- to a single bin architecture. tion between the number of 0's and 1's. CABAC deals In this work, a new FPGA-based architecture is pro- with the statistical properties of the bins using 2 vari- posed for single bin arithmetic decoding. The strengths ables: mps (most probable symbol) which signals with of this architecture are: a sustained throughput of 1 a single bit whether 0 or 1 has the highest probability bin per cycle is achieved, independently of the video se- at a given time; and state, which is a 6-bit measure of quence, by eliminating stalls; context modeling is fully the skewness of the mps probability. Both values are implemented; inverse binarization is performed for all updated using transition tables every time a bin is en- the syntax-elements; the scheduling is carried out con- coded. The combination of mps and state is called a sidering all the macroblock types; a novel scheme based context. on simple transpositions is used to evaluate all data de- pendencies; and the characteristics of FPGAs are exploited in order to achieve an efficient implementation. In Section 2, the arithmetic coding and decoding al- gorithms are presented. Section 3 explains the decoding challenges at macro-block level. Section 4 describes the 2.1 Contexts architecture, focusing on the high level control. Section 5 describes syntax-elements and bin decoding. Section In CABAC, more than 400 contexts are used to reflect 6 presents the results and compares with other archi- the different statistics of the different types of bins. In tectures. the last row in Fig. 1 we see which context is used for each bin in the example (actually, an offset with respect to a base context is shown). Some bins share the same 2 Arithmetic decoding in H.264 context as they have similar statistics, and some others do not have a context assigned. The latter ones are the The Context-based Adaptive Binary Arithmetic Cod- bypass bins, for which 0 and 1 are equally probable. ing (CABAC) [11] is the preferred entropy coding al- Another kind of bins with a fixed probability, ”final”, gorithm in H.264 Main and High profiles. CABAC en- also exist. codes the different pieces of information produced by As it can be seen in Fig. 1, the first bin is encoded us- the H.264 encoder, such as motion vectors, transform ing 1 out of 3 possible contexts. This is done in CABAC coefficients and flags, referred to as syntax-elements. because the statistics are dependent on the value of The binary arithmetic coder in CABAC deals only with neighboring syntax-elements. This is further addressed bit-size data, called bins. Therefore, all syntax-elements in Section 3. High-speed FPGA Architecture for CABAC Decoding Acceleration in H.264/AVC Standard 3 2.2 Encoding Table 1 Characteristics of MB types I MB P MB B MB The encoding process itself uses 2 variables: low and On slices... I, P, B P, B B range. For every bin that is encoded, they are updated Skip MB none skip skip & as follows: direct Partitions 4x4, PCM, 16x16, 8x16, 16x16, 8x16, 16x16 16x8, 8x8 16x8, 8x8 MPS lownew = low Sub- none 8x4, 4x8, 8x4, 4x8, partitions 4x4 4x4 − rangenew = range rLP S (1) Intra-mode 4x4 & none none prediction 16x16 LPS lownew = low + range − rLP S 8x8 only if 4x4 yes yes rangenew = rLP S (2) transform prediction Motion none only forward forward & where LPS stands for least probable symbol and rLPS vectors backward is the probability of LPS scaled to the current value Reference none one list two lists of range. The rLPS is read from a look-up-table using frames range and state. The value of range is a 9-bit number that must be in [256, 511]. If it drops below 256, it is left-shifted (normalized) as many bits as needed to put ber of syntax-elements: motion vectors, transform coef- it back in the right interval.

Load more