Multiview Video Coding Extension of the H.264/AVC Standard
Total Page:16
File Type:pdf, Size:1020Kb
Multiview Video Coding Extension of the H.264/AVC Standard Ognjen Nemþiü 1, Snježana Rimac-Drlje 2, Mario Vranješ 2 1 Supra Net Projekt d.o.o., Zagreb, Croatia, [email protected] 2 University of Osijek, Faculty of Electrical Engineering, Osijek, Croatia, [email protected] Abstract - An overview of the new Multiview Video Coding approved an amendment of the ITU-T Rec. H.264 & ISO/IEC (MVC) extension of the H.264/AVC standard is given in this 14996-10 Advanced Video Coding (AVC) standard on paper. The benefits of multiview video coding are examined by Multiview Video Coding. encoding two data sets captured with multiple cameras. These sequences were encoded using MVC reference software and the results are compared with references obtained by encoding II. MULTIVIEW VIDEO CODING multiple views independently with H.264/AVC reference MVC was designed as an amendment of the H.264/AVC software. Experimental quality comparison of coding efficiency is and uses improved coding tools which have made H.264/AVC given. The peak signal-to-noise ration (PSNR) and Structural so superior over older standards. It is based on conventional Similarity (SSIM) Index are used as objective quality metrics. block-based motion-compensated video coding, with several new features which significantly improve its rate-distortion Keywords - H.264/AVC, MVC, inter-view prediction, temporal performance. The main new features of H.264/AVC, also used prediction, video quality by MVC, are the following: • I. INTRODUCTION variable block-size motion-compensated prediction with the block size down to 4x4 pixels; The H.264/AVC is the state-of-the-art video coding standard, [1], developed jointly by the ITU-T Video Coding • quarter-pixel motion vector accuracy; Experts Group (VCEG) and the ISO/IEC Moving Pictures • Experts Group (MPEG). H.264/AVC is developed to provide multiple reference picture for motion compensation; efficient compression and high reliability in video • bi-directional predicted picture as a reference for transmission. The benefits of high compression efficiency and motion prediction; error robustness of H.264/AVC standard have already become widely accepted and used for video content storage, streaming • intra-picture prediction in the spatial domain; and transmission of real-time multimedia. Most recent • adaptive deblocking filter within the motion- development of VCEG and MPEG focuses on the multiview compensated prediction loop; video. Multiview video represents a three-dimensional (3-D) scene captured by two or more cameras at a slightly different • small block-size transformation (4x4 block transform); angle. Typical services using this type of scene capturing and • representation include entertainment (3-D cinemas and PC enhanced entropy coding methods: Context-Adaptive gaming), education, surveillance, immersive telepresence and Variable-Length Coding (CAVLC) and Context- videoconferencing, three-dimensional (3DTV) and free Adaptive Binary Arithmetic Coding (CABAC). viewpoint television (FTV). These 3-D services require new Special attention is given to the improvement of robustness types of media that expand the user experience beyond what is to data errors or losses during transmission over different offered by traditional media. 3DTV, also referred to as stereo networks. The standard defines a network abstraction layer TV, offers a 3-D depth impression of the observed scene, while (NAL), which maps H.264/AVC video coding layer (VCL) FTV allows for an interactive selection of viewpoint and data to different transport layers. The VCL creates a coded direction within a certain operating range. All of the mentioned representation of the video content, while the NAL organizes services share similar components of the processing chain these data into units, adapted for specific application and consisting of multiview video capture, 3-D scene network. MVC essentially performs block-based predictive representation, coding, transmission, rendering and display. In coding across the cameras in addition to predictive coding comparison with conventional 2-D scene, 3-D scene along the time axis of each camera, hence achieving high representation usually requires a significantly higher amount of compression efficiency. data, hence efficient compression for data storage or transmission with less degradation and delay over limited Fig. 1 shows the overall MVC system structure. In addition bandwidth represent challenging tasks. Therefore, multiview to temporal redundancies between adjacent frames of video coding (hereinafter referred to as MVC) has gained individual camera view, multiple camera signals also contain a significant attention recently. In July 2008, MPEG officially large amount of statistical dependencies. This research is supported by Supra Net Projekt d.o.o., Zagreb, Croatia 52nd International Symposium ELMAR-2010, 15-17 September 2010, Zadar, Croatia 73 These subsets are referred to as profiles. For multiview coding, Multiview High profile is defined. Multiview High profile supports two or more views using both inter-picture (temporal) and MVC inter-view prediction, but does not support field pictures and macroblock-adaptive frame-field coding. The levels in MVC impose constraints on the processing power and the memory size in order to limit maximum coded frame size, frame rate and the number of reference pictures stored for prediction purposes. In comparison with H.264/AVC, encoding multiview video sequences with large amount of frames Figure 1. Overall MVC system architecture requires a bigger Decoded Picture Buffer (DPB) size to store all reference pictures necessary for prediction. Therefore DBP The encoder receives N temporally synchronized video streams size also needs to be considered when implementing multiview and generates one output bitstream. The decoder receives the prediction structure. bitstream, decodes and outputs N video signals. MVC views are identified by arbitrary view ID numbers which do not imply III. INTER-VIEW PREDICTION STRUCTURE any order or a specific dependency between different views. Several research groups addressed MVC and developed Since MVC shares some design principles with Scalable dedicated inter-view/temporal prediction structures to Video Coding (SVC) amendment of H.264/AVC, many efficiently exploit all statistical dependencies within the features available in SVC have been reused in MVC. Similar to multiview video data set. The structure developed by SVC, the MVC amendment defines new NAL Unit Header Fraunhofer HHI for the case of 1D camera arrangement (linear Extension. While SVC allows identification of multiple layers or arc) is depicted in Fig. 2 [2]. in SVC NAL Unit Header Extension, the MVC NAL Unit Header Extension contains a View Order Index. View Order Index is the parameter that represents the relation between different views. By carefully defining the dependencies between different camera views, efficient coding efficiency can be achieved. To meet the requirements of applications envisioned for MVC (Sect. I.), scalability and adaptability, as well as random access need to be supported. For example, in 3DTV scenario advanced displays capable of displaying multiple views would be decoding more views than stereoscopic displays that display only two views. The MVC standard defines efficient ways that enable easy extraction of any subset of the views from the entire bitstream. Since applications based on two dimensional displays are still widely used, backwards compatibility has also Figure 2. Multiview video coding structure combining inter-view and been an important target for MVC. The MVC standard temporal prediction based on H.264/AVC hierarchical B pictures achieves backwards compatibility by defining the bitstream from which compliant H.264/AVC decoder can decode a single This scheme first uses inter-view prediction to provide P 2D view and discard the rest of the data, whereas a compliant pictures for even views (Camera 2, 4 and 6). The rest of the MVC decoder can decode all the views and generate the 3D pictures in even camera views are predicted with no further video. Temporal random access is provided by inserting I inter-view dependencies, i.e. by using hierarchical B pictures in picture typically every 0.5 – 1 seconds. Backward temporal direction. Hierarchical B pictures provide compatibility is also supported by the related communication significantly improved compression performance when the protocols for transport over the MPEG-2 Transport Stream and quantization parameters for the various pictures are assigned the Internet Protocol (IP), meaning that a device capable of appropriately. Odd camera views (Camera 1, 3 and 5) are receiving an H.264/AVC stream over the MPEG-2 Transport obtained by combining inter-view prediction from 2 adjacent Stream or the Real-Time Transport Protocol over IP is also even views and hierarchical B coding structure in temporal capable of receiving an MVC stream over these protocols. direction. For an even number of views, the last view Some requirements of video coding standard are often represents a specific case for prediction. Camera 7 is coded as contradictory to one other, such as providing high compression shown, starting with an inter-view predicted P picture, efficiency, but with low delay. Usually these scenarios cover followed by hierarchical B pictures, which are also inter-view real-time conversational applications, such as predicted from the previous view. Thus, the coding scheme can videoconferencing, telepresence and mobile