Efficient Video Decoding on Gpus by Point Based Rendering

Graphics Hardware (2006) M. Olano, P. Slusallek (Editors) Efficient Video Decoding on GPUs by Point Based Rendering Bo Han and Bingfeng Zhou National Engineering Research Center for New Technologies in Electronic Publishing Institute of Computer Science and Technology Peking University, Beijing, P.R.China Abstract To accelerate computation intensive video decoding tasks, we present a novel framework to offload most decoding operations to current GPUs. Our method is based on rendering graphics points and suitable for block-based video standards. By representing video blocks as graphics points, we achieve great flexibility and high paral- lelism to utilize the GPU’s pipelined stream processing architecture. The computational resources within texture units and blending units are also exploited to facilitate computations. We propose a high performance implementation of IDCT on GPUs, which efficiently excludes most zero-value coefficients to save the bandwidth and the computations. Compared with the existing quad-based representation, our point based implementation of MC greatly reduces data transfer and redundancy. We have demonstrated the efficiency of our proposed framework by a MPEG-2 decoder. Our results indicate a significant improvement over prior CPU and GPU solutions. Categories and Subject Descriptors (according to ACM CCS): I.3.3 [Computer Graphics]: Graphics hardware 1. Introduction designed specifically for video algorithms. However, they cost additional transistors and are independent from the tra- Digital video applications have become an essential compo- ditional 3D graphics hardware. Until now these VPs still lack nent of our daily lives, ranging form high-definition televi- a high level programming interface to release the computa- sions to multimedia mobile devices. For most users, efficient tional power. video decoding is their most concern. But advanced video compression techniques adopted in current video standards Driven by interactive 3D graphics applications, commod- make the decoding process one of the most computational ity graphics hardware has evolved into a powerful and flex- demanding tasks. The popular high definition (HD) videos ible graphics processor , known as GPUs. Their perfor- and the new video standards propose challenges to current mance and functionalities have made GPUs attractive as co- CPUs for both intensive computations and huge volumes of processors for various general-purpose computation prob- data. Therefore, a means should be found to offload some ∗ lems [OLG 05]. Architecturally, GPUs are highly parallel decoding tasks from CPUs to other sub-systems. streaming processors optimized for vector operations, which ∗ Graphics cards have been used to assist video decoding is similar to some media processors [ORK 02][JL05]. Due over the last decade. First, video overlay was introduced to to this facts, it should be possible to map the video coding handle expensive color space conversion. Then, dedicated algorithms on GPUs, and we expect such a solution can have decoding hardware emerged on PCs and has been widely several advantages. First, the well established graphics APIs adopted on today’s commodity graphics chips due to the well and high level shader languages can provide a convenient defined DirectX Video Acceleration(DXVA) specification. programming interface. Therefore, the implementation can Unfortunately, most of them are built on hard-wired circuits be independent of underlying hardware and platforms. Sec- only for certain specific video standards (most for MPEG-2 ondly, the higher performance growth rate over Moore’s law playback). Recently, graphics vendors have began to inte- and increasingly enhanced functionalities make GPUs more grate programmable video-processing engines in their prod- promising. Thirdly, a unified multimedia processing subsys- ucts, such as nVidia’s PureVideo and ATI’s Avivo. The en- tem, capable of handling different types multimedia con- gines are built on underlying SIMD vector processors(VPs) tents, could be implemented on GPUs, which can achieve c The Eurographics Association 2006. B.Han & B.Zhou / Efficient Video Decoding on GPUs by Point Based Rendering ∗ higher utilization of hardware resources and is specially at- rithms, Shen et al. [SGL 05] first put forward to use generic tractive for mobile devices. GPUs to accelerate video decoding. By using small quads to represent video blocks, they move the whole motion com- Since GPUs are specially designed for graphics opera- pensation feedback loop and the CSC to the GPU, thus tions, it is hard to directly map the complex and branchy leading to a considerable performance speedup. Kelly and video decoding algorithms onto the graphics pipeline. To- Kokaram [KK04] proposed to use texture bilinear filtering day’s video standards share a similar block/macroblock units on the GPU for fast image interpolation and combined structure. Each block has its own parameters and character- it with motion estimation. Then, two DCT/IDCT methods istics, which makes the computation not efficient in a sin- on GPUs [FSLC05][Nvi05] were proposed; they are based gle quad-texture used by traditional GPGPU applications. on direct matrix multiplication and the JPEG ANN fast al- Recently some researchers have began to exploit GPUs to ∗ gorithms respectively. The performance of the former one is implement partial video/image decoding process [SGL 05] higher due to its regular memory accesses. Both two meth- [FSLC05][HL05]. They adopt the texture-based GPGPU ods can achieve comparable performance to the optimized model and demonstrate the feasibilities, but the insignificant CPU implementation. However, in practice the expensive performance speedup makes them not attractive enough in a floating point texture updating and unpacking can greatly practical real-time decoder. hamper the overall performance, which has been addressed The main contributions of this paper are concluded as fol- in the recent study of [HL05]. In that paper, Hirvonen and lows. First, we propose a novel GPU-accelerated video de- Leppänen implemented a GPU-based H.263 decoder based coding framework based on rendering graphics points. By on similar techniques with the previous works. They demon- mapping video blocks to points, we transform decoding op- strated that current GPUs are capable of handling all the pro- erations into highly parallel graphics tasks. Our point-based cessing stages of H.263 video decoder except serial variable block representation greatly reduces the data transfer and re- length decoding. However, their work mainly focuses on the dundancy. Secondly, we present a high performance IDCT feasibility and concerns a little about the performance. implementation of the straightforward basis-image combi- In terms of graphics points, due to the simplicity and su- nation approach on GPUs, which excludes most zero co- perior flexibility, point sets have became increasingly attrac- efficients and fully utilizes the graphics pipeline for higher tive as an alternative surface representation as well as for efficiency. It is specially suitable for decoding highly com- processing of complex 3D models [KB04]. Our work in this pressed videos and images. Thirdly, our study demonstrate paper is mainly inspired by its advantages shown in large the GPU’s great potential capabilities to perform video cod- particle systems [KKKW05][KSW04]. In addition, Krüger ing algorithms. Some discussions about modifications on et al. [KW03] also introduce a special case of using points GPUs are given and expected to improve the performance to perform linear algebra operations on sparse random ma- for generic media processing. trices. They use sets of vertices to render the matrix values at The rest of the paper is organized as follows. Section the correct position, which exploits the sparseness and saves 2 gives a brief survey on related works. Section 3 high- a significant amount of GPU memory and operations. lights some characteristics of a typical block-based video decoder to facilitate understanding of our implementation. We present details of our point-based decoding methods in sec- 3. Characteristics of Decoding Modules tion 4 and evaluate our methods on aspects of performance In this section we give an overview of a typical block-based and visual quality in section 5. Section 6 and 7 gives the dis- video decoder and analyze its characteristics to facilitate un- cussion and concludes the paper. derstanding of our implementation on the GPU. Today’s major video standards incorporate block-based 2. Related Works transform coding and motion compensation techniques to In this section we give a brief overview of previous works exploit both the spatial and the temporal redundancy of a on using GPUs for video coding/processing and highlight video sequence. The macroblock, corresponding to a 16 × several related point based applications. 16-pixel region of a frame, is typically the basic unit for motion compensation. It is usually organized with four 8 × 8 In the area of video processing, GPUs have been widely luminance sample blocks and two chrominance blocks used used to render or post process video pictures such as fil- for transform coding which is typically the discrete cosine tering, composition, visual effects, and color space con- ∗ transform (DCT). A block diagram of a standard video de- version(CSC) [App05][DNVRH 05] . Moreover, Ben et coder is shown in Figure 1. al. [CCLW05] gave the comparison and analysis of FPGAs and GPUs in their use for video processing, which indi- Each decoding module has its own characteristics. In gen- cates GPUs can offer a viable solution but fall down on ap- eral, variable length decoding (VLD) is a sequential bit-wise plications with high memory accesses, such as 2D convo- process and is the only functional module where GPUs can lution in a big size. For the low level video coding algo- not offer any advantage. The following inverse zigzag scan c The Eurographics Association 2006. B.Han & B.Zhou / Efficient Video Decoding on GPUs by Point Based Rendering bit stream and can directly omit zero values to utilize the sparseness CPU GPU property of DCT matrix X, which is quite suitable for the VLD IZS IQ IDCT CSC Display GPU’s stream processing architecture.

Load more