OVERVIEW OF THE MULTIVIEW HIGH EFFICIENCY VIDEO CODING (MV-HEVC) STANDARD

Miska M. Hannuksela1, Ye Yan2, Xuehui Huang2, and Houqiang Li2

1Nokia Technologies, 2University of Science and Technology of China

ABSTRACT damental principle of MVC+D and MV-HEVC is to re-use the coding tools of the underlying 2D coding, i.e. AVC and This paper reviews the multiview extension (MV-HEVC) of HEVC, respectively, so that implementations can be rea- the High Efficiency Video Coding (HEVC) standard. MV- lized by software changes to high-level syntax in the slice HEVC is capable of multiview video coding with or without header level and above. On the contrary, 3D-AVC and 3D- accompanying depth views. The key design concepts and HEVC introduce new low-level coding tools and aim at im- design elements of MV-HEVC are described in the paper. proved compression efficiency compared to the MVC+D Furthermore, the features and characteristics of MV-HEVC and MV-HEVC, respectively. compared to other standardized video codec extensions for Section 2 of this paper reviews the design of MV- three-dimensional (3D) video coding are reviewed. HEVC. Compared to earlier reviews, such as [9], the paper

describes the design of the approved MV-HEVC standard. Index Terms— HEVC, MV-HEVC, 3D video coding Section 3 of the paper presents a comprehensive feature

comparison between 3D video coding standards. Finally, 1. INTRODUCTION conclusions are provided in Section 4.

While stereoscopic content is most commonly used in to- 2. MULTI-LAYER HEVC EXTENSIONS day's 3D video content and services, an increasing amount of attention has been paid to depth-enhanced and multiview The development of the scalable extensions (SHVC) 3D video. Depth or disparity information can be used in [4][9][10] of HEVC started when the MV-HEVC project depth-image-based rendering (DIBR) [1] to synthesize was already ongoing. In an early phase of the SHVC devel- views as decoder-side post-processing. When applied with opment, it was decided to use an approach requiring only stereoscopic displays, view synthesis can be used for adjust- high-level syntax changes as well as inter-layer processing. ing the disparity between the displayed views according to Furthermore, since SHVC and MV-HEVC both shared the viewers' preferences and view conditions, such as viewing fundamental principle of using only the HEVC coding tools distance and display size, in order to reach as comfortable for slice data, their design was unified [11][12][13]. This 3D experience as possible. Depth-enhanced multiview video high-level syntax principle is achieved be enabling the in- together with DIBR can also be used with multiview auto- clusion of pictures originating from direct reference layers stereoscopic displays to generate a required number of in the reference picture list(s) used for decoding pictures of views for displaying from the received views. The multi- predicted layers, while otherwise these inter-layer reference view-video-plus-depth (MVD) format refers to a data format pictures are treated identically to any other reference pic- with more than one texture view, each paired with the depth tures. Eventually, both SHVC and MV-HEVC were released view of the same viewpoint [2]. as parts of HEVC version 2 [4] and share the same specifi- The two most recent international video coding stan- cations of multi-layer extensions. As one consequence of dards, namely the (AVC) standard this unified approach, views are treated as scalable layers (also known as H.264) [3] and the High Efficiency Video like any other types of scalability. Coding (HEVC) standard (also known as H.265) [4], were An elementary unit in the HEVC bitstreams is called a initially intended for two-dimensional (2D) video. Multi- network abstraction layer (NAL) unit, consisting of a header view extensions for both AVC and HEVC have been devel- and a payload. The NAL unit header contains a 5-bit NAL oped, referred to as Multiview Video Coding (MVC) [3][5] unit type, 6-bit layer identifier called nuh_layer_id, and a 3- and Multiview HEVC (MV-HEVC) [4], respectively. For bit temporal sub-layer identifier. A video parameter set depth-enhanced multiview video coding one can use AVC (VPS) NAL unit specifies a mapping of the layer identifiers extensions referred to as the multiview video and depth cod- to values of scalability dimensions, including a dependency ing (MVC+D) [3][6] and the multiview and depth video identifier for SHVC, a view order index for MV-HEVC and with enhanced non-base view coding (3D-AVC) [3], and the 3D-HEVC, a depth flag for 3D-HEVC, and an auxiliary HEVC extensions MV-HEVC and 3D-HEVC [7][8]. A fun-

978-1-4799-8339-1/15/$31.00 ©2015 IEEE 2154 ICIP 2015 layer type identifier, referred to as AuxId, which may be hancement bitstream are separate, and it is up to the systems used in any multi-layer bitstreams. A selectable number of layer to handle synchronization between the bitstreams as scalability dimensions can be used in a multi-layer bit- well as the connection between the base and enhancement stream. The VPS also indicates a mapping from each view layer decoding processes. The hybrid codec scalability fea- order index to a view identifier value, where each unique ture enables 3D services that are compatible with 2D AVC view identifier value represents a distinct viewpoint. Tem- decoding. For example, broadcast services could provide poral sub-layers are used for temporal scalability, i.e. they conventional 2D service for AVC-capable devices and build provide the capability of extracting sub-bitstreams of differ- 3D capability using either MV-HEVC or 3D-HEVC on top ent picture rates. For more information on the high-level of the AVC service. syntax and temporal scalability of HEVC, the reader is re- While earlier video coding standards specified profile- ferred to [14]. level conformance points applying to a bitstream, multi- Auxiliary picture layers, originally proposed in [15], layer HEVC extensions specify layer-wise conformance enable multiplexing of supplemental coded video into a bit- points. To be more exact, a profile-tier-level (PTL) combi- stream that also contains the primary coded video. It was nation is indicated for each necessary layer of each OLS, decided to enable depth views with the auxiliary picture while even finer-grain temporal-sub-layer-based PTL sig- layer mechanism in MV-HEVC. AuxId value equal to 2 naling is allowed. Consequently, decoder capabilities can be associated, in VPS, with a layer indicates that the layer indicated as a list of PTL values, where the number of list represents a sequence of depth pictures. More detailed prop- elements indicates the number of layers supported by the erties on the depth auxiliary layers, such as the depth or dis- decoder. Non-base layers that are not inter-layer predicted parity range represented by the sample values, can be pro- can be indicated to conform to a single-layer profile, such as vided with the depth representation information supplemen- the Main profile, while they also require so-called indepen- tal enhancement information (SEI) message [4]. Depth aux- dent non-base layer decoding (INBLD) capability to deal iliary layers are also associated with a view order index and correctly with layer-wise decoding. The Multiview Main view identifier; hence, multiview depth coding is supported profile was specified for MV-HEVC and suits both non-base in MV-HEVC. It is noteworthy that 3D-HEVC uses the texture views and non-base depth views. depth flag scalability dimension to indicate depth views. Table I shows an example of a bitstream containing This is because 3D-HEVC enables the use of depth informa- three texture views and three depth views, both coded with tion for non-base texture views. Auxiliary picture layers, on the so-called IBP inter-view prediction order, where left- the other hand, are not allowed to affect the decoding of the side view (I view) is coded independently of other views, primary video. As a consequence, depth views of MV- the right-side view (P view) may utilize inter-view predic- HEVC and 3D-HEVC are not compatible. tion from the I view, and center view (B view) may be pre- As scalable multi-layer bitstreams enable decoding of dicted from both the I and P views. As can be seen the view more than one combinations of layers and temporal sub- order index values of the respective views (left, center, layers, the multi-layer HEVC decoding process is given as right) of texture and depth are identical. The VPS is used to input a target output operation point, specifying the output specify the mapping view order index and AuxId values to layer set (OLS) and the highest temporal sub-layer to be nuh_layer_id values, on which Table I shows one example decoded. An OLS represents a set of layers, which can ei- where depth views follow in decoding order the texture ther be necessary or unnecessary layers. A necessary layer is views, while other mappings and decoding orders would either an output layer, meaning that the pictures of the layer also be possible. The reference layers are indicated in the are output by the decoding process, or a reference layer, VPS according to the IBP inter-view prediction pattern. In meaning that its pictures may be directly or indirectly used this example, four OLSs are specified in the VPS. The 0th as a reference for prediction of pictures of any output layer. OLS contains only the texture base view conforming to the The VPS includes a specification of OLSs, and can also Main profile. The first and second OLSs represent a stereos- specify buffering requirements and parameters for OLSs. copic video with wide (side views) and narrow (adjacent Unnecessary layers are not required to be decoded for re- views) baseline, respectively. As can be seen from Table I, constructing the output layers but can be included in OLSs the P view is not an output layer but is a necessary layer in for indicating buffering requirements for such sets of layers the narrow baseline OLS. Each necessary non-base view in which some layers are coded with potential future exten- conforms to the Multiview Main profile. In the fourth OLS, sions. all texture and depth views are output. The independently coded depth view is indicated to conform to the Main profile Multi-layer HEVC extensions support hybrid codec with the INBLD capability, while all predicted views comp- scalability, in which the base layer is coded as a separate ly with the Multiview Main profile. non-HEVC bitstream. The multi-layer HEVC decoding process inputs decoded base layer pictures and certain prop- erties for them. The base layer bitstream and HEVC en-

2155 Table I. Example bitstream with three texture views and three depth views with IBP inter-view prediction

0th OLS, base texture 1st OLS, wide stereo 2nd OLS, narrow 3rd OLS, 3-view view texture stereo texture MVD AuxId Profile of of Profile of Profile of Profile of Profile Output layer Output layer Output layer Output layer Output nuh_layer_id ViewOrderIdx necessary layer necessary layer necessary layer necessary layer necessary reference layers nuh_layer_id of of nuh_layer_id Left 000 - yes Main yes Main yes Main yes Main Center202 0, 1 yesMV MainyesMV Main video Color Color Right 1 0 1 0 yes MV Main MV Main yes MV Main Left 023 - yes Main, INBLD Center225 3, 4 yes MV Main video Depth Right 1 2 4 3 yes MV Main

3. COMPARISON OF MVD CODING STANDARDS mentioned in [16] that only one depth view is coded for MV-HEVC. Encoder-side optimizations and verification of Table II presents a comparison of the features of the MVD the RD performance difference between MV-HEVC and coding scenarios overviewed in Section 1. The following 3D-HEVC are therefore subjects of further studies. sub-sections provide further details on the information pro- vided in Table II with a specific focus on MV-HEVC. The 3.2. Data format goal of this section is to assist in selecting a 3D coding for- mat that suits the properties and purposes of the used video Uncompressed depth information can be obtained for cam- acquisition equipment, display, and application. era-captured content generally in two ways, either through depth estimation or by using depth sensors. In a depth esti- 3.1. Rate-distortion (RD) performance mation process, the respective pixels in adjacent views are searched, resulting into a disparity picture or, equivalently, The RD performance between 3D-HEVC and MV-HEVC to a depth picture. Many depth sensors produce pictures was analyzed in [16] with 3-view MVD coding using view having a significantly lower spatial resolution than the reso- synthesis optimization (VSO) [8] in the encoding. The JCT- lution produced by typical color image sensors. Depth sen- 3V common test conditions [17] were used, according to sors and color image sensors are typically located adjacent- which the RD performance of the texture views as well as ly, i.e. the viewpoint of the depth pictures differs from the three synthesized views between each pair of coded views is viewpoint(s) of the color image sensors. The data format compared. It was found that 3D-HEVC reduced the bitrate called unpaired MVD, introduced in [17], comprises texture by about 14% and 20% for coded texture views and synthe- and depth views that need not represent the same view- sized views, respectively, compared to MV-HEVC in terms points. The unpaired MVD format provides flexibility in of the Bjontegaard delta bitrate (dBR) [18]. No results for a depth acquisition and reduced complexity in encoder-side 2-view scenario were provided in [16], but it can be as- pre-processing. sumed that the bitrate reduction of 3D-HEVC compared to All depth-enhanced coding formats support the MVD MV-HEVC would be roughly halved. Furthermore, it is and the unpaired MVD data formats. When using the un- Table II. Comparison of MVD coding extensions paired MVD format texture coding tools using depth views (yes* = supported with constraints explained in the text) have to be selectively turned off in 3D-AVC and 3D-HEVC. As there no dependencies from depth views to texture views AVC extensions HEVC extensions or vice versa in MVC+D and MV-HEVC, the selection of MV- MVC+D 3D-AVC 3D-HEVC the transmitted views e.g. for bitrate adaptation can be per- HEVC formed flexibly e.g. resulting into bitstreams with an un- Data format equal number of texture and depth views. 3D-AVC and 3D- Unpaired MVD yes yes* yes yes* HEVC require more careful view extraction to ensure that Mixed resolution cross-component prediction dependencies are obeyed in the Texture w.r.t. depth yes yes yes yes* transmitted bitstream. Between texture views no no yes* yes* Compatibility When only a limited amount of receiver-side disparity HEVC HEVC adjustment is required, unpaired MVD can be used to im- Base texture view AVC AVC or AVC or AVC prove RD performance as studied in [20] and [21] for MVC or MV- MV- or MVC+D and 3D-AVC, respectively. As no similar study Second texture view MVC 3D-AVC HEVC 3D-HEVC was made for MV-HEVC, an experiment was performed for

2156 this paper. JCT-3V common test conditions (CTC) [17] was not used in these simulations, they are indicative that a were used for coding MV-HEVC content with two texture depth resolution lower than texture resolution can provide a views and one depth view. The baseline was adjusted by good trade-off between RD performance and complexity. 10% in the decoding end through the view synthesis ar- Mixed-resolution stereoscopic video has been broadly rangement described in [20]. The results are reported in Ta- studied e.g. in [26][27][28][29], as it could potentially ble IIIa, indicating that with the exception of one sequence, achieve a comparable quality with symmetric-resolution the transmission of one depth view provides improved com- stereoscopic video but reduce computational complexity pression efficiency. The unpaired MVD format can there- thanks to a smaller number of samples processed. The com- fore be justified to be used with MV-HEVC from the RD pared MVD coding extensions use pictures of other views performance point of view, when the disparity adjustment without resampling as inter-view reference pictures, and needs are limited. hence none of them supports mixed-resolution coding when

inter-view prediction is applied. However, both MV-HEVC 3.3. Mixed spatial resolution and 3D-HEVC support views having different resolutions,

when no prediction takes place between the views. Mixed resolution between texture and depth views may be desirable to achieve reduced computational complexity in 3.4. Compatibility cases where the acquired depth resolution is inherently smaller than the texture resolution, e.g. when a depth sensor The MVD coding extensions provide seamless compatibility has been used to acquire a depth map sequence. As can be with 2D video decoding. The HEVC extensions have the observed from Table II, the codecs are similar when it additional benefit that they can use a base view of any cod- comes to supporting a mixed resolution between texture and ing standard, such as AVC. This feature enables i) an effi- depth views. 3D-HEVC includes such depth-based texture cient use of hardware designs including both AVC and coding tools that require the same spatial resolution being HEVC decoders for stereoscopic services as well as ii) es- applied both in texture and depth pictures. Hence, in order to tablishment of 3D video services providing compatibility support unequal spatial resolutions between texture and with AVC-capable 2D clients while using HEVC-based depth views, 3D-HEVC encoders need to turn off depth- coding technology for non-base views and depth views to based texture coding tools. keep the additional bitrate as low as possible. Similarly, it A proper selection of depth resolution can provide may be desired to support clients with stereoscopic decoding compression benefits when considering both coded and syn- capability and other clients with multiview decoding capa- thesized views as studied in [22] and [23] in the context of bility with the same content or service. It can be seen from MVC+D and 3D-AVC, respectively. Early results for MV- Table II that 3D-AVC and 3D-HEVC leave it up to the en- HEVC were provided in [24], while for this paper we per- coder to determine whether the second view conforms to formed a new set of simulations with a recent version of the MVC and MV-HEVC, respectively, or requires dedicated reference software, HTM 12.1 [19], with depth pictures hav- coding tools of 3D-AVC and 3D-HEVC, respectively. ing half of the resolution vertically and horizontally com- pared to the texture pictures. The resampling was performed 4. CONCLUSIONS as explained in [25], and the JCT-3V CTC [17] was used. As can be seen from Table IIIb, the bitrate is reduced by The paper reviewed the design of the MV-HEVC standard, more than 5% on average, while the RD performance of the which introduces only high-level changes over the HEVC synthesized views remain unchanged. Even though VSO standard, facilitating implementations with software updates to existing HEVC codecs. Compared to the 3D-HEVC ap- Table III. MV-HEVC performance in a) 2V+1D un- proach of introducing dedicated 3D coding tools, MV- paired MVD and b) 3-view MVD with 1/4-resolution HEVC is inferior in rate-distortion performance but is easier depth relative to conventional MV-HEVC (dBR %) to implement and provides more flexibility in terms of the data format and scalability, as explained in the paper. a) Unpaired MVD b) Depth resolution Coded 10% Coded Synthesized views adjustment views views S01 -5.0 -2.4 S01 -7.8 0.4 S02 -2.4 -2.0 S02 -9.2 -4.4 S03 -1.1 -0.7 S03 -6.8 6.8 S04 -1.5 -1.3 S04 -2.2 -0.3 S05 -7.9 -4.7 S05 -5.4 -0.1 S06 -6.0 -1.1 S06 -3.4 -0.6 S08 -6.7 6.8 S08 -1.5 1.3 S10 -4.2 -4.5 S10 -8.0 -6.0 Avg -4.4 -1.2 Avg -5.6 -0.4

2157 REFERENCES [20] P. Aflaki, D. Rusanovskyy, M. M. Hannuksela, and M. Gab- bouj, “Unpaired multiview video plus depth compression,” Proc. of [1] L. McMillan Jr., “An image-based approach to three- International Conference on Digital Signal Processing, July 2013. dimensional computer graphics,” PhD thesis, Department of Com- [21] L. Chen, D. Rusanovskyy, P. Aflaki, and M. M. Hannuksela, puter Science, University of North Carolina, 1997. “3D-AVC: Coding of unpaired MVD data,” JCT-3V document [2] A. Smolic et al., “Multi-view video plus depth (MVD) for- JCT3V-D0161, Apr. 2013. mat for advanced 3D video systems,” Joint Video Team (JVT) [22] P. Aflaki, M. M. Hannuksela, and M. Gabbouj, “Flexible document JVT-W100, Apr. 2007. depth map spatial resolution in depth-enhanced multiview video [3] ITU-T Recommendation H.264, “Advanced video coding for coding,” Proc. of IEEE International Conference on Acoustics, generic audiovisual services,” Feb. 2014. Speech, and Signal Processing (ICASSP), May 2014. [4] ITU-T Recommendation H.265, “High efficiency video cod- [23] P. Aflaki, M. M. Hannuksela, and X. Huang, “CE7: Removal ing,” Oct. 2014. of texture-to-depth resolution ratio restrictions,” JCT-3V document [5] Y. Chen, Y.-K. Wang, K. Ugur, M. M. Hannuksela, JCT3V-E0035, July 2013. J. Lainema, and M. Gabbouj, “The emerging MVC standard for 3D [24] S. Shimizu and S. Sugimoto, “AHG13: Results with quarter video services,” EURASIP Journal on Advances in Signal resolution depth map coding,” JCT-3V document JCT3V-G0151, Processing, vol. 2009, Article ID 786015, 13 pages, 2009. Jan. 2014. doi:10.1155/2009/786015. [25] P. Aflaki, M. M. Hannuksela, D. Rusanovskyy, and M. Gab- [6] Y. Chen, M. M. Hannuksela, T. Suzuki, and S. Hattori, bouj, “Nonlinear depth map resampling for depth-enhanced 3-D “Overview of the MVC+D 3D video coding standard,” Journal of video coding,” IEEE Signal Processing Letters, vol. 20, no. 1, Visual Communication and Image Representation, vol. 25, no. 4, pp. 87-90, Jan. 2013. pp. 679-688, May 2014. [26] M. G. Perkins, “ of stereopairs,” IEEE [7] G. Tech, K. Wegner, Y. Chen, and S. Yea (editors), “3D- Transactions on Communications, vol. 40, no. 4, pp. 684-696, HEVC draft text 7,” JCT-3V document JCT3V-K1001, Apr. 2015. Apr. 1992. [8] Y. Chen, G. Tech, K. Wegner, and S. Yea (editors), “Test [27] H. Brust, A. Smolic, K. Müller, G. Tech, and T. Wiegand, model 10 of 3D-HEVC and MV-HEVC,” JCT-3V document “Mixed resolution coding of stereoscopic video for mobile devic- JCT3V-J1003, Nov. 2014. es,” Proc. of 3DTV Conference, May 2009. [9] G. J. Sullivan, J. M. Boyce, Y. Chen, J.-R. Ohm, [28] G. Saygili, C. G. Gurler, and A. M. Tekalp, “Quality assess- C. A. Segall, and A. Vetro, “Standardized extensions of High Effi- ment of asymmetric stereo video coding,” Proc. of IEEE Interna- ciency Video Coding (HEVC) ,” IEEE Journal of Selected Topics tional Conference on Image Processing, pp. 4009-4012, in Signal Processing, vol. 7, no. 6, pp. 1001-1016, Dec. 2013. Sept. 2010. [10] Y. Ye and P. Andrivon, “The scalable extensions of HEVC [29] P. Aflaki, M. M. Hannuksela, and M. Gabbouj, “Subjective for ultra-high-definition video delivery,” IEEE MultiMedia, quality assessment of asymmetric stereoscopic 3D video,” Sprin- vol. 21, no. 3, pp. 58-64, July-Sept. 2014. ger Journal of Signal, Image and Video Processing, Mar. 2013. [11] K. Ugur, M. M. Hannuksela, J. Lainema, and Online http://doi.org/10.1007/s11760-013-0439-0 D. Rusanovskyy, “Unification of scalable and multi-view exten- sions with HLS only changes,” JCT-VC document JCTVC-L0188, Jan. 2013. [12] J. Chen, J. Boyce, Y. Ye, and M. M. Hannuksela, “SHVC Working Draft 1,” JCT-VC document JCTVC-L1008, Mar. 2013. [13] G. Tech, K. Wegner, Y. Chen, M. M. Hannuksela, and J. Boyce, “MV-HEVC draft text 3,” JCT-3V document JCT3V- C1004, Mar. 2013. [14] R. Sjöberg, Y. Chen, A. Fujibayashi, M. M. Hannuksela, J. Samuelsson, T. K. Tan, Y.-K. Wang, and S. Wenger, “Overview of HEVC high-level syntax and reference picture management,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 22, no. 12, pp. 1858-1870, Dec. 2012. [15] M. M. Hannuksela, “REXT/MV-HEVC/SHVC HLS: aux- iliary picture layers,” JCT-VC document JCTVC-O0041 and JCT- 3V document JCT3V-F0031, Oct. 2013. [16] Y. Chen, G. Tech, K. Müller, A. Vetro, L. Zhang, and S. Shimizu, “Comparative results for 3D-HEVC and MV-HEVC with depth coding,” JCT-3V document JCT3V-G0109, Jan. 2014. [17] K. Müller and A. Vetro, “Common test conditions of 3DV core experiments,” JCT-3V document JCT3V-G1100, Jan. 2014. [18] G. Bjøntegaard, “Calculation of average PSNR differences between RD-Curves,” ITU-T SG16 Q.6 (VCEG) document VCEG-M33, Apr. 2001. [19] HTM reference software, online: https://hevc.hhi.fraunhofer.de/svn/svn_3DVCSoftware/tags/

2158