Multiview Coding Extension of the H.264/AVC Standard

Ognjen Nemþiü 1, Snježana Rimac-Drlje 2, Mario Vranješ 2 1 Supra Net Projekt d.o.o., Zagreb, Croatia, [email protected] 2 University of Osijek, Faculty of Electrical Engineering, Osijek, Croatia, [email protected]

Abstract - An overview of the new Multiview Video Coding approved an amendment of the ITU-T Rec. H.264 & ISO/IEC (MVC) extension of the H.264/AVC standard is given in this 14996-10 (AVC) standard on paper. The benefits of multiview video coding are examined by Multiview Video Coding. encoding two data sets captured with multiple cameras. These sequences were encoded using MVC reference software and the results are compared with references obtained by encoding II. MULTIVIEW VIDEO CODING multiple views independently with H.264/AVC reference MVC was designed as an amendment of the H.264/AVC software. Experimental quality comparison of coding efficiency is and uses improved coding tools which have made H.264/AVC given. The peak signal-to-noise ration (PSNR) and Structural so superior over older standards. It is based on conventional Similarity (SSIM) Index are used as objective quality metrics. block-based motion-compensated video coding, with several new features which significantly improve its rate-distortion Keywords - H.264/AVC, MVC, inter-view prediction, temporal performance. The main new features of H.264/AVC, also used prediction, video quality by MVC, are the following: • I. INTRODUCTION variable block-size motion-compensated prediction with the block size down to 4x4 pixels; The H.264/AVC is the state-of-the-art video coding standard, [1], developed jointly by the ITU-T Video Coding • quarter-pixel motion vector accuracy; Experts Group (VCEG) and the ISO/IEC Moving Pictures • Experts Group (MPEG). H.264/AVC is developed to provide multiple reference picture for ; efficient compression and high reliability in video • bi-directional predicted picture as a reference for transmission. The benefits of high compression efficiency and motion prediction; error robustness of H.264/AVC standard have already become widely accepted and used for video content storage, streaming • intra-picture prediction in the spatial domain; and transmission of real-time multimedia. Most recent • adaptive deblocking filter within the motion- development of VCEG and MPEG focuses on the multiview compensated prediction loop; video. Multiview video represents a three-dimensional (3-D) scene captured by two or more cameras at a slightly different • small block-size transformation (4x4 block transform); angle. Typical services using this type of scene capturing and • representation include entertainment (3-D cinemas and PC enhanced entropy coding methods: Context-Adaptive gaming), education, surveillance, immersive telepresence and Variable-Length Coding (CAVLC) and Context- videoconferencing, three-dimensional (3DTV) and free Adaptive Binary (CABAC). viewpoint television (FTV). These 3-D services require new Special attention is given to the improvement of robustness types of media that expand the user experience beyond what is to data errors or losses during transmission over different offered by traditional media. 3DTV, also referred to as stereo networks. The standard defines a network abstraction layer TV, offers a 3-D depth impression of the observed scene, while (NAL), which maps H.264/AVC video coding layer (VCL) FTV allows for an interactive selection of viewpoint and data to different transport layers. The VCL creates a coded direction within a certain operating range. All of the mentioned representation of the video content, while the NAL organizes services share similar components of the processing chain these data into units, adapted for specific application and consisting of multiview video capture, 3-D scene network. MVC essentially performs block-based predictive representation, coding, transmission, rendering and display. In coding across the cameras in addition to predictive coding comparison with conventional 2-D scene, 3-D scene along the time axis of each camera, hence achieving high representation usually requires a significantly higher amount of compression efficiency. data, hence efficient compression for data storage or transmission with less degradation and delay over limited Fig. 1 shows the overall MVC system structure. In addition bandwidth represent challenging tasks. Therefore, multiview to temporal redundancies between adjacent frames of video coding (hereinafter referred to as MVC) has gained individual camera view, multiple camera signals also contain a significant attention recently. In July 2008, MPEG officially large amount of statistical dependencies.

This research is supported by Supra Net Projekt d.o.o., Zagreb, Croatia

52nd International Symposium ELMAR-2010, 15-17 September 2010, Zadar, Croatia 73 These subsets are referred to as profiles. For multiview coding, Multiview High profile is defined. Multiview High profile supports two or more views using both inter-picture (temporal) and MVC inter-view prediction, but does not support field pictures and macroblock-adaptive frame-field coding. The levels in MVC impose constraints on the processing power and the memory size in order to limit maximum coded frame size, frame rate and the number of reference pictures stored for prediction purposes. In comparison with H.264/AVC, encoding multiview video sequences with large amount of frames Figure 1. Overall MVC system architecture requires a bigger Decoded Picture Buffer (DPB) size to store all reference pictures necessary for prediction. Therefore DBP The encoder receives N temporally synchronized video streams size also needs to be considered when implementing multiview and generates one output bitstream. The decoder receives the prediction structure. bitstream, decodes and outputs N video signals. MVC views are identified by arbitrary view ID numbers which do not imply III. INTER-VIEW PREDICTION STRUCTURE any order or a specific dependency between different views. Several research groups addressed MVC and developed Since MVC shares some design principles with Scalable dedicated inter-view/temporal prediction structures to Video Coding (SVC) amendment of H.264/AVC, many efficiently exploit all statistical dependencies within the features available in SVC have been reused in MVC. Similar to multiview video data set. The structure developed by SVC, the MVC amendment defines new NAL Unit Header Fraunhofer HHI for the case of 1D camera arrangement (linear Extension. While SVC allows identification of multiple layers or arc) is depicted in Fig. 2 [2]. in SVC NAL Unit Header Extension, the MVC NAL Unit Header Extension contains a View Order Index. View Order Index is the parameter that represents the relation between different views. By carefully defining the dependencies between different camera views, efficient coding efficiency can be achieved. To meet the requirements of applications envisioned for MVC (Sect. I.), scalability and adaptability, as well as random access need to be supported. For example, in 3DTV scenario advanced displays capable of displaying multiple views would be decoding more views than stereoscopic displays that display only two views. The MVC standard defines efficient ways that enable easy extraction of any subset of the views from the entire bitstream. Since applications based on two dimensional displays are still widely used, backwards compatibility has also Figure 2. Multiview video coding structure combining inter-view and been an important target for MVC. The MVC standard temporal prediction based on H.264/AVC hierarchical B pictures achieves backwards compatibility by defining the bitstream from which compliant H.264/AVC decoder can decode a single This scheme first uses inter-view prediction to provide P 2D view and discard the rest of the data, whereas a compliant pictures for even views (Camera 2, 4 and 6). The rest of the MVC decoder can decode all the views and generate the 3D pictures in even camera views are predicted with no further video. Temporal random access is provided by inserting I inter-view dependencies, i.e. by using hierarchical B pictures in picture typically every 0.5 – 1 seconds. Backward temporal direction. Hierarchical B pictures provide compatibility is also supported by the related communication significantly improved compression performance when the protocols for transport over the MPEG-2 Transport Stream and quantization parameters for the various pictures are assigned the Internet Protocol (IP), meaning that a device capable of appropriately. Odd camera views (Camera 1, 3 and 5) are receiving an H.264/AVC stream over the MPEG-2 Transport obtained by combining inter-view prediction from 2 adjacent Stream or the Real-Time Transport Protocol over IP is also even views and hierarchical B coding structure in temporal capable of receiving an MVC stream over these protocols. direction. For an even number of views, the last view Some requirements of video coding standard are often represents a specific case for prediction. Camera 7 is coded as contradictory to one other, such as providing high compression shown, starting with an inter-view predicted P picture, efficiency, but with low delay. Usually these scenarios cover followed by hierarchical B pictures, which are also inter-view real-time conversational applications, such as predicted from the previous view. Thus, the coding scheme can videoconferencing, telepresence and mobile video applications. be applied to any multiview setting with more than 2 views. To In these cases a good trade-off needs to be found. allow random access, I pictures are inserted (Camera 0/T0, Camera 0/T8, etc., as shown in Fig. 2). The example above is Like H.264/AVC, the MVC standard also defines subsets for a Group of Pictures (GOP) length of 8, meaning that every of coding tools available for encoding process identified to 8th picture of the base Camera 0 view is an I picture. The meet a certain set of specifications of intended applications. syntax of the hierarchical B pictures implemented in

52nd International Symposium ELMAR-2010, 15-17 September 2010, Zadar, Croatia 74 H.264/AVC is very flexible and allows multiview GOPs of any Tool 2.6 available from [6]. Table I. summarizes the obtained length to be used. For encoding the sequences and obtaining coding results. experimental results provided in this paper, GOP length of 12 pictures is used. TABLE I. MVC AND H.264/AVC COMPARISSON RESULTS Average Average XPERIMENTAL RESULTS IV. E PSNR Y, dB SSIM Y AVC avg. MVC The main goal of MVC is to provide gain in rate-distortion Sequence AVC MVC AVC MVC performance when compared to encoding all of the views bitrate, kbps QP independently. Therefore, first all views were independently 128 40 27,45 29,98 0,761 0,827 encoded using the JVT JM 9.5 H.264/AVC software [3]. The 192 37 29,41 31,75 0,814 0,867 resulting decoded video signals served as reference for Ballroom 256 34 30,82 33,33 0,847 0,896 evaluating rate-distortion performance of JVT JMVC 7.1 codec 384 31 32,75 35,08 0,884 0,92 using inter-view/temporal dependencies among the adjacent 512 29 34,06 36,08 0,904 0,932 camera views during the encoding process. The test data set Average coding gain 2,35 dB 0,047 consisted of two sequences, ballroom and exit [4]. Used 96 37 32,76 34,40 0,870 0,894 sequences represent multiview data video captured using 8 128 34 34,13 35,83 0,889 0,911 cameras in 1D linear arrangement with 20 cm width [5]. Fig. 3 represents all available views in one time instance. The left Exit 292 31 35,83 37,12 0.91 0,924 most frame represents the N-th frame of the view 0, while the 256 29 36,82 37,85 0,92 0,930 right most one represents the N-th frame of the view 7). 384 26 38,03 38,83 0,931 0,938 Average coding gain 1,29 dB 0,015

Fig. 4 and Fig. 5 represent PSNR Y and SSIM Y results of encoding ballroom sequence, respectively. The solid line on both graphs represent average of PSNR Y and average of SSIM Y values across all camera views obtained by MVC for a set of targeted average bitrates. The dashed line represents average Figure 3. Example of multiview video test data sequences ballroom (set of PSNR Y and SSIM Y values obtained using H.264/AVC at camera views in upper row) and exit (set camera views in lower row) roughly the same average bitrate as MVC.

Both sequences are provided in YUV 4:2:0 planar format with 640 x 480 pixel resolution, frame rate of 25 fps and 250 frames. The ballroom sequence (represented with set of frames in upper row in Fig. 3) represents a dynamic scene containing fast movement of the dancers within the scene and more object overlapping than present in the second sequence exit. The exit sequence (a set of camera views in lower row in Fig. 3) represents mostly static scene with few persons slowly moving from the right part of the scene towards the door in the middle of the scene. During the first encoding stage, for independent encoding using H.264/AVC High profile at level 4 using typical settings and parameters (CABAC, variable block size, rate control). In Figure 4. Ballroom sequence coding results (PSNR Y) the second stage, for MVC, the Multiview High profile has been used with GOP size of 12 for each view. Since H.264/AVC codec enables rate control, the ballroom sequence has been coded using 128, 192, 256, 384 and 512 kbps, while the exit sequence required lower bitrates due to the less motion present in the scene, namely 96, 128, 192, 256 and 384 kbps. The bitrate of the MVC stream is controlled by varying the quantization parameter (QP). In general, the higher the QP, the higher the compression ratio, but lower quality of the coded video is provided. For comparing MVC encoded material with references obtained using the H.264/AVC, QPs needed to be carefully chosen in order to roughly match the bitrate provided by H.264/AVC rate control enabled encoder. For quality comparison PSNR and SSIM objective measures for luma component were obtained by using MSU Video Measurement Figure 5. Ballroom sequence coding results (SSIM Y)

52nd International Symposium ELMAR-2010, 15-17 September 2010, Zadar, Croatia 75 Fig. 6 and Fig. 7 depict the coding efficiency comparison Fig. 9 shows the benefit of exploiting the inter- obtained for exit sequence. view/temporal prediction structure described in Sect. III. Used view order (reflecting scheme in which views are coded) is 0 2 1 4 3 6 5 7. The results presented on the graph show that even views require more bitrate than odd ones. This conclusion is expected, because inter-view prediction is used only to provide P pictures at the beginning of each GOP within individual even views. It can also be seen that odd views 1, 3 and 5 are using less bitrate because all the pictures contained within these views are obtained by combining inter-view prediction from 2 adjacent even views and hierarchical B coding structure in temporal direction. Additionally, it is noticeable that view 7 consumes more bitrate than views 1, 3 and 5 due to the fact that it’s predicted exploiting inter-view dependency from only one adjacent view (view 6). Figure 6. Exit sequence coding results (PSNR Y)

Figure 7. Exit sequence coding results (SSIM Y)

For each rate point a gain is computed as the difference Figure 9. Ballroom sequence coded using inter-view/temporal prediction of between PSNR Y obtained by MVC and AVC. An average MVC (QP=29, average bitrate=515,5 kbps, average PSNR Y=36,08 dB) gain is computed as mean of the gains of 5 targeted bitrates. Fig. 8 depicts these average gains. V. CONCLUSION In this paper the basic concepts of multiview video coding are presented and inter-view prediction structure is examined. The obtained experimental results indicate that MVC provides up to a 2,53 dB higher gain for the same bitrate in comparison with H.264/AVC which doesn’t exploit inter-view dependencies. Our work also shows that MVC provides the same quality of the coded signal using around 30% (sequence exit) and 40% (ballroom) less bitrate in average. The highest bitrate savings are emphasized in the lower bitrate values. Figure 8. Average PSNR Y and SSIM Y gains for both test data sets REFERENCES The MVC performs almost 2 times better for ballroom [1] ISO/IEC 14496-10 and ITU-T Rec. H.264: “Advanced Video Coding”, (2,35 dB) than for exit sequence (1,29 dB). The results also 2003. show that the gain for ballroom sequence at lowest bitrate (128 [2] P.Merkle, K.Müller, A.Smolic and T.Wiegand, “ Efficient compression kbps) equals 2,53 dB and dropping as the bitrate increases, of multi-view video exploiting inter-view dependencies based on reaching the value of 2,02 dB at the highest bitrate (512 kbps). H.264/MPEG4-AVC”, ICME 2006, IEEE International Conferenceon The gain for exit sequence shows the same trend. The Multimedia and Exposition, Toronto, Ontario, Canada, Jul. 2006. conclusion is that MVC achieves higher compression [3] JVT JM version 9.5 H.264/AVC Reference Software, available online at http://iphome.hhi.de/suehring/tml/ efficiency at lower bitrates and should also be taken into [4] ftp://ftp.merl.com/pub/avetro/mvc-testseq account for services and applications using limited capacity [5] A. Kubota, A. Smolic., M. Magnor, M. Tanimoto, T. Chen, C. Zhang: communication channels like video for mobile services. The “Multiview Imaging and 3DTV”, IEEE Signal Processing Magazine, difference between average gain for ballroom and exit Vol. 24, No. 6, pp. 10-21., Nov. 2007. sequences is most likely related to the more dynamic nature of [6] MSU Graphics&Media Lab, www.compression.ru/video/ the content presented in the ballroom sequence.

52nd International Symposium ELMAR-2010, 15-17 September 2010, Zadar, Croatia 76