Video Coding and Transmission Standards for 3D Television – a Survey
Total Page:16
File Type:pdf, Size:1020Kb
OPTO−ELECTRONICS REVIEW 21(1), 39–51 DOI: 10.2478/s11772−013−0072−z Video coding and transmission standards for 3D television – a survey A. BUCHOWICZ* Institute of Radioelectronics, Warsaw University of Technology, 15/19 Nowowiejska Str., 00−665 Warsaw, Poland The emerging 3D television systems require effective techniques for transmission and storage of data representing a 3−D scene. The 3−D scene representations based on multiple video sequences or multiple views plus depth maps are especially im− portant since they can be processed with existing video technologies. The review of the video coding and transmission tech− niques is presented in this paper. Keywords: television, video coding, video broadcasting, video streaming. 1. Introduction The goal of the 3D television (3DTV) system is to create a 3−D viewing experience to the viewers. The sophisticated information processing techniques are currently being developed to achieve this goal. The main areas of research are: l 3−D scene capturing to create a 3−D scene representation suitable for further processing [1]; l coding and transmission of a 3−D scene representation; l display technologies to present a 3−D scene reconstruc− ted from its representation. The simplest 3−D scene representation can be obtained with 2−video camera in a stereoscopic setup (Fig. 1). Only very limited 3−D viewing experience can be achieved with such representation: a scene can be observed from the only 1−fixed position. This 3−D mode is widely supported by cur− Fig. 2. Multiview capturing system (C – video camera). rently available displays, however, wearing special 3−D glasses is required to perceive depth of the scene. [2] can be used to view selection. Widely discussed in the literature free viewpoint television (FTV) systems [3,4] are based on this concept. Number of views in the multiview 3−D scene representa− tion is fixed and limited by the throughput of the transmis− sion channel. The 3−D viewing experience is thus limited to the very narrow angular range. An additional scene geome− try information, e.g., depth maps, can be added to overcome those limitations. The 3−D scene representation based on multiple views with added depth maps (multiview plus Fig. 1. Stereo view capturing module (C – video camera). depth, MVD) [5] allows for synthesis of additional views, thus reducing the number of views to be transmitted. The Multiview scene representation obtained with a multi depth maps can be estimated by the analysis of the neigh− camera capturing system (Fig. 2) offers higher flexibility – bouring views (Fig. 3) [6]. viewers can switch between views and select their view Depth maps can be also generated with the use of the position in the scene. Stereoscopic displays with an eye/ distance measuring devices (Fig. 4), e.g., time−of−flight ca− head tracking system or multiview autosteroscopic displays meras or pattern projection techniques [1]. An interesting solution based on the infrared light fringe patterns and fast *e−mail: [email protected] video cameras is presented in Ref. 7. Opto−Electron. Rev., 21, no. 1, 2013 A. Buchowicz 39 Video coding and transmission standards for 3D television – a survey 2. Video coding The video sequence is the representation of a natural scene sampled temporally and spatially. In most cases it requires a huge amount of data to reproduce the scene with a good quality. The goal of a video coding is to find the compact representation of the video sequence while preserving its quality. The video compression is achieved by the exploita− tion of a temporal and spatial redundancy in the video sequence [9]. 2.1. MPEG-4 AVC/H.264 The MPEG−4 AVC/H.264 standard is currently the most Fig. 3. Multiview plus depth capturing system (C – video camera, important one in the area of video coding [10]. It is based on DE – depth estimation unit). the hybrid motion compensation and transforms coding algorithm like many of its predecessors. Significant enhan− cements of the classic algorithm have been implemented in this standard to improve its coding efficiency [11–14]. The H.264/AVC encoder is divided into two layers: the video coding layer (VCL) and the network abstraction layer (NAL). The VCL processes video data: each frame of the in− put sequence is partitioned into a set of macroblocks, each macroblock is temporally or spatially predicted and its pre− diction error is then transform coded. The VCL generates a stream of encoded macroblocks organized into slices. The slice covers a part (or entire) frame and can be parsed inde− pendently from other slices. The NAL formats the output stream of the encoder as a series of packets called NAL units. The set of consecutive NAL units decodable into a single frame is called an access unit (AU). Each NAL unit is com− posed of one−byte header followed by its payload (Fig. 5). The header contains three fields describing the payload: l F – error flag (1 bit), NAL units with this field set to 1 should not be processed; l NRI – NAL unit priority (2 bits), this field should be set to 0 in all NAL units not used as a reference by other NAL units. The higher value of this field the more im− portant the NAL unit is for the video sequence recon− struction; l TYPE – NAL unit type, values 1 ÷ 23 are restricted to be Fig. 4. Multiview plus depth capturing system (C – video camera, DM – distance measuring device). used only within the H.264/AVC standard. Values 0 and 24 ÷ 31 may be used for other purposes, e.g., in trans− The other 3−D scene representations, such as surface− mission. −based or volumetric [8], are used in practice, as well. How− The length of a NAL unit is not encoded in its header. ever, 3−D scenes representations based on multiple views Therefore, NAL units must be prefixed with start codes eventually enhanced by the depth information have an im− (e.g., defined in the Annex B of the H.264/AVC standard) portant advantage – they can be encoded and transmitted or encapsulated in additional data structures (e.g., transmis− with the use of the existing or enhanced techniques used for sion protocol packets) to allow their separation. the standard video. The survey of these techniques will be presented in this paper. This paper is organized as follows. The review of the video coding standards is given in Sect. 2. Issues related to media streams multiplexing and storage are discussed in Sect. 3. Techniques used for the video transmission are pre− sented in Sect. 4. Conclusions and future research directions are discussed in Sect. 5. Fig. 5. NAL units in the H.264/AVC bitstream. 40 Opto−Electron. Rev., 21, no. 1, 2013 © 2013 SEP, Warsaw There are two main NAL unit categories: l VCL containing encoded video data, the following types have been defined: – coded slice of instantaneous data refresh (IDR) pic− tures (TYPE = 5). IDR picture allows to start the decoding process. The first frame in the video se− quence is always encoded in an IDR mode. IDR pic− tures may be repeated in the video sequence to allow stream switching or recovery from transmission errors; – coded slice of non−IDR picture (TYPE = 1); – coded slice data partition (TYPE = 2, 3, 4). Data partitioning is an error resilience tool available in the H.264/AVC standard. More detailed description can be found in Ref. 10; l non−VCL carrying associated additional information, the most important types are: – sequence parameter set (SPS, TYPE = 7), contains infrequently changing information relating to the entire video sequence; – picture parameter set (PPS, TYPE = 8), contains infrequently changing information relating to one or more pictures in a video sequence; – supplemental enhancement information (SEI, TYPE = 6), provides information supporting the decoding process or displaying the reconstructed video sequence; – access unit delimiter (TYPE = 9); – end of sequence (TYPE = 10); Fig. 6. Access unit representation in the H.264/AVC bitstream. – end of stream (TYPE = 11). The syntax of the H.264/AVC bitstream is more flexible with respect to the previous video coding standards. The 2.1.1. Scalable extension only required syntax element in every access unit is the In many applications it is desirable to deliver the same video VCL NAL unit containing at least one slice of the primary content in multiple versions differing in spatial resolution, coded picture (Fig. 6). In certain profiles of the H.264/AVC frame rate or image quality. This goal can be easily standard primary coded data may be followed by VCL NAL achieved by simulcast transmission of all required versions units with a redundant representation of this picture. of the video content, but it is a highly inefficient solution. Each slice header directly refers to the PPS and indi− Scalable coding is another possibility. Scalable video rectly to the SPS, both parameters’ sets must be known to bitstream is composed of two or more hierarchical layers the decoder to allow slice processing. Usually, non−VCL (Fig. 7). NAL units with SPS and PPS are transmitted before any VCL NAL unit in the same channel (”in−band”). However, SPS/PPS NAL units may be transmitted in an additional, more reliable channels as well (”out−of−band”). SEI mes− sages may be very useful in the VCL NAL unit processing, but they are not necessary to decode the access unit. Simi− larly access units delimiters are not required to detect the beginning of a new frame in the encoded video sequence. The frame boundaries can be derived from slice headers in VCL NAL units, however it is a resource consuming pro− cess.