SCHIERL LAYOUT 10/7/09 4:08 PM Page 64

SEAMLESS CONTENT DELIVERY IN THE FUTURE MOBILE INTERNET

SCALABLE VIDEO CODING OVER RTP AND MPEG-2 TRANSPORT STREAM IN BROADCAST AND IPTV CHANNELS

THOMAS SCHIERL, KARSTEN GRÜNEBERG, AND THOMAS WIEGAND, FRAUNHOFER HHI

DRB: dependency repr ABSTRACT domain, or any combination of those, allowing bit rate reduction while still maintaining reason-

DRBn+m The ITU-T and ISO/IEC standard for scal- able video quality. In order to make use of SVC able video coding was recently finalized. SVC in media delivery scenarios, standards specifying DRn+1(jn+ ... allows for scalability of the video bitstream in the transport of SVC over IP/Real-Time Trans- with trefn OR tdn+1(j the temporal, spatial, or fidelity domain, or any port Protocol (RTP) [2] and MPEG-2 transport DRBn+1 combination of those. Video scalability may be stream [3] were recently finalized. Both standards DRn(jn) used for different purposes, such as saving band- are important for delivery of SVC over digital with td (j DRB n width when the same media content is required broadcast, wireless multicast, and IPTV channels. n to be sent simultaneously on a broadcast medi- Today’s digital broadcast channels as speci- Elementary um at different resolutions to support heteroge- fied by the Digital Video Broadcasting Project stream neous devices, when unequal error protection (DVB) or the Advanced Television Systems buffer shall be used for coverage extension in wireless Committee (ATSC) rely on MPEG-2 systems [4] broadcasting, as well as for rate shaping in IPTV for encapsulation and signaling for media deliv- environments. Furthermore, it may also be use- ery. This is a remainder from the early MPEG-2 The authors give a ful in layered multicast transmission over the success in digital video broadcasting. Over the detailed overview of Internet or peer-to-peer networks, or in any years new standards such as H.263, MPEG-4 transmission scenario where prioritized transmis- visual Part 2, and H.264/MPEG-4 Advanced the recently finished sion for network flows is meaningful. In order to Video Coding (AVC) have been developed. make usage of SVC in the aforementioned use Even with the new video coding standards, there SVC standards on cases, standards for defining the transport for- was no requirement to update the TV delivery mat and procedure are required. Therefore, we chain based on MPEG-2 Transport Stream (TS). transport over give a detailed overview of the recently finished Therefore, today’s DVB IPTV services are most- SVC standards on transport over IP/RTP and ly based on MPEG-2 TS delivered over IP, using IP/RTP and MPEG-2 the MPEG-2 transport stream. Both standards RTP over the User Datagram Protocol (UDP) Transport stream. are important for IPTV and video on demand, or UDP directly. This is the reason to keep both where the first is important for SVC transport digital broadcast as well as IPTV compliant. Both standards are over mobile broadcast/multicast channels, and Looking at emerging wireless and mobile the latter is also important for SVC transport broadcast/multicast channels such as the DVB important for over traditional digital broadcast channels. standard for satellite-to-handheld (SH), the IPTV and video DVB standard for next-generation handheld INTRODUCTION (NH), or 3GPP’s multimedia broadcast multicast services (MBMS), the situation is different. on demand. The most recent International Telecommunica- These service types target mobile receiver tion Union — Telecommunication Standardiza- devices, which typically have native IP interfaces tion Sector (ITU-T) and International Standards and therefore preferably support native media Organization/International Electrotechnical transport over IP; that is, those standards rely on Commission (ISO/IEC) standards for scalable RTP and the Session Description Protocol video coding (SVC) [1] specify a video bitstream (SDP) for packetization and signaling. that is scalable in the temporal, spatial, or fidelity Recently, second tier standardization commit- tees have already adopted the SVC transport standards as the DVB standards for media deliv- The work of the authors is supported by the European ery over IP, and MPEG-2 TS and the ATSC Commission under the contract number FP7-ICT- standard for mobile/handheld are close to the 214063, project SEA. final stage of adoption.

64 1536-1284/09/$25.00 © 2009 IEEE IEEE Wireless Communications • October 2009 SCHIERL LAYOUT 10/7/09 4:08 PM Page 65

Using any of the transport formats includes signaling as well as encapsulation and transport NAL unit header mechanisms. Since MPEG-2 TS and RTP com- prise different signaling and encapsulation tech- NAL unit payload niques, we highlight both transport methods and also mention the possible carriage of MPEG-2 TS over IP. Ref NAL unit The remainder of the article is structured as 0 IDC type follows. The next section gives a summary of the SVC standard and highlights important parts of Type 14 or 1 Priority ID D Q T DF OF 1 1 IDR

type 20: nILP (3 bits) (4 bits) (3 bits) the high level syntax of SVC important for uRBP encapsulation operations for SVC bitstreams. NAL unit header SVC extension The following two sections give a detailed overview of transporting SVC over RTP and Figure 1. SVC NAL Unit (top), AVC compatible header section (left, darker over MPEG-2 TS, respectively. blue), and SVC header extension (right, lighter blue). SVC OVERVIEW extensions of AVC (NAL unit type = 20), three CODING additional header bytes are used, as shown at the bottom of Fig. 1. Video scalability, also known as layered video This NAL unit header SVC extension contains, coding, has always been a desirable feature of a among other information, three fields that specify media bitstream. In contrast to earlier approach- the level with regard to the three scalability dimen- es included in video coding standards like sions, here labeled D for dependency ID (spatial H.262/MPEG-2 Video, H.263, and MPEG-4 or coarse-grain quality scalability [CGS] level), Q Visual (Part 2), SVC [1] has achieved reasonable for quality ID (medium-grain quality scalability coding efficiency, which makes it applicable for a [MGS] level), and T for temporal level ID (tem- wide range of services. For a more detailed poral scalability level). D, Q, and T identify for description of the SVC codec see [5], and for each SVC VCL NAL unit the specific level of SVC use cases see [6]. scalability where D and Q are zero. For the AVC The SVC design, which is an extension of the base layer, a special prefix NAL unit (NAL unit H.264/AVC video coding standard adding new type = 14) carrying the 3-byte SVC extension profiles, can be classified as a layered video header is inserted before each VCL NAL unit of codec. In SVC the hybrid video coding approach the AVC base layer in order to indicate the tem- of motion-compensated transform coding is poral level T to which it belongs. For more details extended in such a way that a wide range of spa- on SVC high-level syntax, we refer to [13]. tial, temporal, and fidelity scalability is achieved. An SVC bitstream consists of a base layer and SVC OVER RTP one or several enhancement layers. The removal of enhancement layers still leads to reasonable This section gives an overview of the transport quality of the decoded video at reduced temporal mechanisms for SVC over IP. Therefore, we dis- or signal-to-noise ratio (SNR) fidelity, and/or spa- cuss RTP and then go into detail about the RTP tial resolution. The base layer is a bitstream con- payload format for SVC. We then discuss the forming to existing H.264/AVC profiles, ensuring out-of-band signaling for SVC using SDP. backward compatibility with existing receivers. MEDIA OVER IP: HIGH-LEVEL SYNTAX In this section we highlight the signaling provid- REAL-TIME CONTROL PROTOCOL ed in the SVC bitstream to allow high-level RTP is an application layer protocol that pro- readability in general. Therefore, SVC as well as vides the means to transport real-time media AVC comprise a so-called network abstraction data over IP using a variety of transport layer layer (NAL) which acts as a packet interface to protocols such as UDP, Transmission Control the system and transport layers. Protocol (TCP), and Datagram Congestion Con- SVC and AVC bitstreams consist of a trol Protocol (DCCP). RTP provides encapsula- sequence of NAL unit packets that can be identi- tion for real-time media, which requires fied by the NAL unit header, as shown in Fig. 1. immediate transport as well as immediate con- A NAL unit may carry different types of pay- sumption of the data at the receiver. This makes loads: RTP suitable for different scenarios such as live • Video coding lLayer (VCL) information (e.g., broadcasting and media streaming on demand, entropy coded intra, residual, or motion data) as well as conversational services. • Non-VCL information such as: Pure RTP provides media encapsulation, –Information about decoding process settings media synchronization, and quality of service sig- as parameter sets naling and basic coordination for multi-party –Additional supplemental enhancement infor- communication. Today’s RTP has been funda- mation (SEI) messages not required by the mentally extended, providing error protection by decoding process forward error correction and retransmission, The payload content of a NAL unit is identi- advanced signaling about the received media fied by the NAL unit type field in the 1-byte quality to control the media codec via codec AVC NAL unit header section. For NAL units control messages, as well as security. In the fol- that contain SVC VCL data in the scalable lowing we highlight the RTP details important

IEEE Wireless Communications • October 2009 65 SCHIERL LAYOUT 10/7/09 4:08 PM Page 66

The AVC RTP payload format [7] supports SVC server encapsulating a single NAL unit, more than one Multicast sessions joined Layer 3 Client 1 NAL unit, or a fragment of a NAL unit into one RTP packet. A single NAL unit as specified in Layer 2 H.264/AVC can be included in the RTP packet as Joined only lower layers Client 2 is, and the NAL unit header serves as the payload Layer 1 header. Four types of aggregation NAL units are Network specified using values in the NAL unit type space Layer 0 that are undefined in SVC [1]. The two single- n RTP MANE Client 3 flows time aggregation packet types, STAP-A and 1 RTP STAP-B, allow encapsulating multiple NAL units flow that belong to the same picture (identified by Client 4 identical RTP timestamps) into one RTP packet. The two multiple-time aggregation packet types, MTAP16 and MTAP24, respectively, can be used to aggregate NAL units from different pictures Figure 2. SVC layered multicast + media-aware network elements (MANEs) into a single RTP packet. The AVC RTP payload format [7] also supports two types of fragmenta- tion units, FU-A and FU-B, which enable frag- for TV-like delivery of real-time media over mentation of one NAL unit into multiple RTP broadcast/multicast as well as IPTV channels. packets. In Fig. 3 the use of the single NAL unit Besides media transport, the RTP protocol packet and the FU as well as STAP are shown. suite provides a second packet type, Real-Time An SVC bitstream with its access units (coded Control Protocol (RTCP) packets. For IPTV, media frames) in decoding order is shown in the RTCP has two important functionalities: the pro- upper part. Each square of a different color rep- visioning of inter-RTP flow media synchronization resents a NAL unit of a particular layer, where and receiver feedback on connection statistics. white squares identify the base layer NAL units. RTP defines different topologies. Besides For details of the packet description see [2, 8]. typical point-to-point streaming scenarios, these The AVC RTP payload format [7] further topologies also include point-to-multipoint as defines three different packetization modes well as multipoint-to-multipoint. which differ in the supported packet types as Due to the focus of this article, we chose the well as the method of packet transmission. The two examples shown in Fig. 2, where the upper part single NAL unit mode only supports the use of shows a layered multicast scenario (single source single NAL unit packets, the non-interleaved multicast), that is, the use of IP multicast by a single packetization mode additionally allows the pres- server where clients join sessions of a layered codec ence of FU-A and STAP-A packets, and the depending on their downlink capacity. Layered interleaved packetization mode only allows the multicast as introduced by Steven McCanne can use of STAP-B and MTAP packets, and frag- also be applied on transmission over a wireless mentation units starting with an FU-B packet. broadcast channel, where different channel protec- These three packetization modes are also sup- tion may be applied in different RTP sessions (con- ported by the SVC payload format. In the SVC taining exactly one RTP flow) through layers. payload format it is further allowed to separate Assume a further extension of the layered data of an SVC bitstream onto different RTP multicast scenario, as shown in the bottom part of flows (RTP streams) in order to allow differenti- Fig. 2, where layered multicast is only applied in a ation in transport for partial content encryption dedicated network (e.g., the IPTV provider’s net- or unequal treatment by a packet loss recovery work). At the last hop, which may be a digital mechanism such as retransmission or FEC. We subscriber line (DSL) access multiplexer, a media- highlight the approach of SVC multisession aware network element (MANE) is present which transmission (MST) later on in more detail. combines layer of different RTP sessions (RTP The SVC payload format provides three new flows) in order to build a single session containing packet types: the layers matching the link capacity constraints Payload content specific information as well as the device capability. (PACSI): A table of content of the NAL units In RTP terminology, a MANE could act in contained in the payload. This optional packet two different modes, first as an RTP mixer, also carries error resilience information, and it which terminates incoming RTP sessions and can additionally contain a number indicating the establishes new RTP sessions to the client termi- decoding order of the NAL units. nal while modifying the RTP payload, or second Non-interleaved multi-time aggregation pack- as an RTP translator, which keeps the end-to- et (NI-MTAP): This packet allows for grouping end connection and signaling intact while modi- of non-interleaved, consecutive NAL units from fying RTP packets as well as RTP payload different frames. This could reduce the packeti- content on the fly. The MANE shown in Fig. 2 zation overhead, especially in hierarchical group acts as an RTP mixer. of packets (GOP) coding structures for low-bit- rate streaming, where the NAL unit size for pic- MEDIA PACKETIZATION RTP (PAYLOAD) tures of the highest temporal level typically is Design of the RTP payload format for SVC significantly below the network minimum trans- video [2] is inherently based on the RTP payload mission unit (MTU) size for Ethernet. format for AVC video [7]. All packet types, Empty packet, which is used to produce dummy packetization modes, and signaling parameters data of the same timestamp in order to align dif- of [7] are inherited. ferent RTP flows based on timing information.

66 IEEE Wireless Communications • October 2009 SCHIERL LAYOUT 10/7/09 4:08 PM Page 67

Access unit (sample) 1 Access unit 2 Access unit 3

SVC bitstream 0 1 2 3 3 0 0 1 ...

... RTP flow F 0 0 0 1 RTP RTP STAP Single NAL unit packet

RTP flow F 1 1 2 1 ... FU 2 RTP RTP RTP RTP Fragmentation unit (FU)

... RTP flow F3 3 3 RTP RTP RTP STAP Empty packet Single time aggregation packet (STAP)

Legend: RTP Payload i NAL unit

RTP header header of layer i

Figure 3. NI-T multisession transmission of four SVC layers in three RTP flows (example using temporal scalability).

An important functionality for the transport decoding order if SVC layers are received from differ- of SVC is the differentiated transport of NAL ent RTP flows. unit packets. Therefore, the system layer needs NI-TC mode offers both means of decoding to find out about the importance of packets. order recover (i.e., timestamp-based [NI-T] and Two different approaches can be used here: number-based [NI-C]). Single-session transmission (SST): All NAL NI-T mode basically relies on media time- units of all layers are transported within one stamps to recover the decoding order if NAL RTP flow; thus, for their differentiation a net- units of an SVC stream are received from, say, work element such as a MANE and the receiver different transport addresses. One problem may need to rely on payload header parsing to indi- arise because the RTP timestamps represent the cate the importance of the NAL units. presentation time and do not necessarily appear Multisession transmission (MST): NAL units in increasing order within an RTP flow. There- belonging to one or more layers (identified by a fore, an algorithm is specified in [2, 9] to recover combination of D, Q, and T in the SVC NAL unit the decoding order starting in the highest RTP header) are transported in a separate RTP flow, flow (i.e., the flow carrying the least important which is sent on a separate transport address, as layer in the layered coding hierarchy) to find out discussed earlier, as session multiplexing. the order of RTP timestamp appearance in that MST is supported in potentially four different RTP flow. The algorithm then proceeds to lower modes: RTP sessions carrying data of the same SVC bit- Non-interleaved timestamp-based (NI-T) stream. Given the RTP timestamp order from mode allows for non-interleaved transmission the highest RTP flow, it places the NAL units within each session (a single NAL unit as well as from lowest to highest RTP flow into the access non-interleaved packetization modes can be unit following the RTP timestamp order in the used) and relies on media timestamps to recover highest RTP flow. This algorithm relies on the the decoding order if SVC layers are received precondition that data of a particular timestamp from different RTP flows. present in one RTP flow is also contained in all Non-interleaved cross session decoding order higher RTP flows. Therefore, if temporal scala- (NI-C) mode allows for non-interleaved trans- bility is used, it may be required to satisfy this mission within each session (asingle NAL unit as rule at the transport layer; in such cases an well as non-interleaved packetization modes can empty NAL unit is included in higher layers. In be used) and relies on a number provided in the [14] a detailed comparison of using NI-T and PACSI packet to recover decoding order if SVC NI-C mode in error-prone environments is given. layers are received from different RTP flows. In the example shown in Fig. 3, an empty Interleaved cross session decoding order (I-C) packet is included in the higher RTP flow (e.g., mode allows for interleaved transmission within each RTP flow F3) whenever an access unit exists in a session (interleaved packetization mode is used) and lower layer but not in the higher layer. Figure 3 relies on a number in the payload header to recover shows the relation of NAL units and access unit

IEEE Wireless Communications • October 2009 67 SCHIERL LAYOUT 10/7/09 4:08 PM Page 68

SDP is required for present in an SVC bitstream and their mapping highest layer of the SVC bitstream indicated in to an NI-T multisession transmission. In order to the media description starting at line 9, it is RTP transport of SVC find out dependencies of RTP flows (i.e., the required to also receive either payload type 97 order of RTP flows in terms of lower and higher or 98 of the media description starting at line 3 as well as other RTP flows), a special signaling mechanism is (mid = 1). defined as discussed in the next section. Since In order to support SSRC multiplexing (a purposes, to identify the NI-T mode depends on timestamps, the RTP multiplexing technique based on the RTP head- timestamps have to be translated to media pre- er’s SSRC field instead of using the network the used sentation times; that is, they have to be mapped address or port as for RTP session multiplexing) packetization modes, to wallclock (provided in RTCP) timestamps, in with SVC, there is still a need for a dependency order to allow the aforementioned algorithm the group specification as discussed above to be the used MST finding of matching timestamps in the different used with SSRC grouping as defined in [11]. RTP flows. Currently, there is an optional modes, and mechanism under specification in [9] to rapidly SVC OVER MPEG-2 TRANSPORT STREAM provide timestamp matching between RTP flows dependencies in RTP header extensions. In this section we discuss the transport mecha- nism for SVC based on MPEG-2 systems [3]. between RTP SIGNALING — SDP We give a brief introduction to media transport sessions. SDP is required for RTP transport of SVC as over the transport stream defined by MPEG-2 well as other purposes, to identify the used pack- systems [4]. Furthermore, we describe the details etization modes (e.g., non-interleaved mode), the of using SVC in MPEG-2 transport streams. We used MST modes (e.g., NI-T mode), and depen- then discuss the possibility of using MPEG-2 sys- dencies between RTP sessions. For the latter, a tem streams over IP. new signaling framework based on SDP grouping has been defined, where each session is identified MPEG-2 TRANSPORT STREAM by a value called mid. The SDP dependency To encapsulate media elementary stream data grouping [10] in general indicates the potential (bitstream data) into MPEG-2 TS, MPEG-2 sys- relation of RTP sessions and additionally pro- tems defined the packetized elementary stream vides information with respect to the dependent (PES), which provides packetization of coded RTP sessions in the a=depend parameters. media frames and access units. Therefore, an Such RTP sessions are typically set up based access unit is encapsulated using PES packets. on a media description starting with m= in an The PES packet header provides framing, size SDP message. For each media a single RTP ses- information, stream type identification, as well as sion following a particular RTP payload type can media frame related information such as decod- be set up by the client. This is typically negotiated ing and presentation timestamps. The timing using the Real Time Streaming Protocol (RTSP) information carried in the PES is relative to a for point-to-point streaming, or indicated using program clock reference (PCR) which is carried Session Announcement Protocol (SAP) for multi- in the TS for each program. An access unit may cast or using other means (e.g., an electronic be transported in one or more PES packets, program guide). RTP payload types are detailed where for SVC one packet per access unit is used. in SDP using a combination of a line starting with The PES packets are framed into TS packets, a=rtpmap and a line starting with a=fmtp, which provide the level for multiplexing other where rtpmap indicates the payload format to be transport streams containing data of other media used (e.g., H264 for AVC or H264-SVC for SVC). elementary streams of the same program as well as The fmtp indicates further details about the con- media data of other programs. Besides PES data, tained media such as the used profile and level, program-specific information (PSI) tables that map the parameter sets and transport information, or the program information to a packet identifier the mode used for decoding order recovery in (PID) in the TS packet headers are provided. The multisession transmission (mst-mode) and pack- TS provides a fixed framing size of typically a 184- etization (packetization-mode). SDP signal- byte payload and a 4-byte header per TS packet, ing for SVC also allows for indication of operation which carries, for example, the PID field. points (i.e., the layers) contained in the layers of As already mentioned, the transport stream an RTP session using the parameter sprop- provides the PES data of different elementary operation-point-info in the fmtp. streams of different programs, each identified by Table 1 shows an SDP example where an its own PID as well as table of contents informa- SVC bitstream is offered by a server for multi- tion. For the latter case, a program association session transmission with up to two potential table (PAT) is used, which is by default associat- RTP sessions. The attribute specified in line 2 ed to PID = 0 and identifies the different pro- declares an SDP group having decoding depen- grams (TV channels). Each program has its own dencies DDP, which contains the media sessions program map table (PMT) in a separate PID, indicated by mid: L1 and L2 assigned in lines 8 which is indicated in the PAT. In the PMT and 14, respectively, to the media description detailed information about each program (i.e., blocks (shaded with different colors). Each the PIDs of the different PESs) is indicated. The media description is also associated with one or PSI tables such as PMT and PAT are typically more payload type (PT) numbers in lines 3 and provided at fixed intervals in the TS multiplex in 9, respectively. The a=depend attribute (line order to allow stream access and demultiplexing. 15) indicates dependencies per payload type that Figure 4 shows the relation of the different exist for all layers except the base layer. In Table tables. In the example we assume, in the media 1 payload types are marked red, while the media data section of the illustrated program, the pres- session identifiers are marked blue. For the ence of more than one PID carrying different

68 IEEE Wireless Communications • October 2009 SCHIERL LAYOUT 10/7/09 4:08 PM Page 69

Line # SDP message

1 … 2 a=group:DDP L1 L2

3 m=video 20000 RTP/AVP 97 98 4 a=rtpmap:97 H264/90000 5 a=fmtp:97 profile-level-id=4d400a ; packetization-mode=1; mst-mode=NI-T; sprop-parameter-sets=…; 6 a=rtpmap:98 H264/90000 7 a=fmtp:98 profile-level-id=4d400a ; packetization-mode=2; mst-mode=I-C; sprop-parameter-sets=…; 8 a=mid:L1

9 m=video 20002 RTP/AVP 99 100 10 a=rtpmap:99 H264-SVC/90000 11 a=fmtp:99 profile-level-id=53000c; packetization-mode=1; mst-mode=NI-T; sprop-parameter-sets=…; 12 a=rtpmap:100 H264-SVC/90000 13 a=fmtp:100 profile-level-id=53000c; packetization-mode=2; mst-mode=I-C; sprop-parameter-sets=…; 14 a=mid:L2 15 a=depend:99 lay L1:97; 100 lay L1:98

Table 1. SDP messages describing a simple SVC multi session transmission with two RTP flows.

PESs of different media (e.g., different layers of frame rates in the different video sub-bitstreams; an SVC bitstream). More details about the pack- there are different numbers of DRs present in the etization of SVC into MPEG-2 PES and TS access units depending on the position in the bit- packets are given in the next section. stream (access units AUn and AUn+3 in Fig. 5). In order to differentiate between different SVC IN MPEG-2 TRANSPORT STREAM values of Q in the SVC NAL unit header exten- The key to transporting SVC over MPEG-2 TS sion, an application has to parse into the PES is the distribution of SVC layers to different stream and identify such NAL units, where a PESs, which are indicated by different PIDs in quality enhancement to the AVC base layer is the TS. Different from the RTP payload format contained in the same video sub-bitstream. for SVC video, the Amendment for the trans- For detailed information about the SVC port of SVC over MPEG-2 system streams [3] video sub-bitstreams, three different descriptors defines the transport of SVC data in a more are available and are contained in the PMT of restricted way than the RTP payload format. the program element containing the SVC bit- The AVC base layer is by definition trans- stream: ported in its own PID and has the stream type • Hierarchy descriptor: Indicates the dependen- assignment of 0x1B, which identifies in the PES cies between PES containing video sub-bit- header AVC bitstreams conforming to existing streams (similar to the dependency grouping AVC profiles as defined in Annex A of [1]. SVC discussed earlier) and the type of scalability enhancements are transported in a separate PES present in a video sub-bitstream as temporal, with a separate PID assigned. Such SVC PESs spatial, and quality scalability. are indicated by the stream type 0x1F, which • AVC video descriptor: Details about the bit indicates AVC bitstreams conforming to SVC stream such as the used profile and level. profiles defined in Annex G of [1]. An SVC PES • SVC extension descriptor: Details about the carries a video sub-bitstream [3], which is identi- contained operation point in terms of bit-rate, fied by a common value of D for SVC NAL unit resolution, and ranges of D, Q, and T present header extensions. This rule ensure that data of in the PES. a particular access unit that is contained in a At the receiver, the DRs contained in the differ- lower video sub-bitstream (e.g., the AVC base ent PESs need to be reassembled to access unit layer) is also contained in all higher SVC video decoding order. Therefore, [3] defines an access unit sub-bitstreams. This is important for the access and bitstream reassembling process based on the unit reassembling process, which is similar to the decoding timestamps (DTSs) provided in the PES one used in the RTP payload format for SVC headers. A similar process as previously presented video in NI-T mode discussed earlier. for the NI-T mode is used, except that MPEG-2 Figure 5 illustrates the splitting of an SVC bit- TSs can rely on the presence of DTSs in the TS, stream into different elementary streams, each where RTP has to rely on presentation timestamps carried in its own PES. A dependency representa- (similar to the PTS). Reference [3] mandates the tion (DR) represents the part of an SVC access presence of DTSs and PTSs in all PES packet head- unit associated to a particular D-value. The exam- ers for video sub-bitstreams of an SVC bit stream. ple SVC bitstream includes four different D-val- When temporal scalability is used, DTS values ues (i.e., an access unit may contain up to DRs). of DRs do not necessarily match between layers The example stream additionally shows different with different temporal resolutions. In the example

IEEE Wireless Communications • October 2009 69 SCHIERL LAYOUT 10/7/09 4:08 PM Page 70

Legend: PID = 0 PAT TS

HDR HDR: Header PAT: Program association table PES: Packetized ES PID: Packet ID (in TS HDR) PID PMT PMT: Program map table m TS HDR TS: Transport stream PSI/SI tables ...

PES header may be split

PID PES (Base layer) (Base layer) j TS TS TS ...

HDR HDR HDR HDR

......

PID PES PES k TS TS ... Media data

HDR HDR HDR HDR

PID PES PES

l TS TS ...

HDR HDR HDR HDR

Figure 4. Meta data structures indicating the location of elementary streams in MPEG-2 TS.

the PES header, a timestamp reference TREF, GOP which indicates (e.g., for AU(n+3) in PES 3) the DTS″ of the related DR in PES 2. AUn AU(n+1) AU(n+2) AU(n+3) The timing and buffering model of MPEG-2 systems [4], the system target decoder (STD), SVC defines buffer sizes, data transfer between bitstream ...... buffers, as well as the used bit rates for such operations at the receiver. Figure 6 shows the Split buffer model for SVC. On the left of the figure, the incoming TS is demultiplexed based on the PIDs. According to the PMT of the received pro- PES 4 ...... gram containing the SVC bitstream, the TS pack- DTS DTS DTS DTS ets are forwarded to transport stream buffer TB. Then the data is transferred to the multiplexing PES 3 ...... buffer (MB), while removing the TS packet head- DTS DTS DTS DTS er. In the MB the PES packet header data is PES 2 ...... removed, and the resulting data is byte-wise DTS DTS DTS’ DTS” transferred into the DR buffer (DRB) using a leaky bucket method. Once the DR is completed Adjusted PES 1 ...... in each DRB, the data with matching DTS (td) DTS DTS respectively matching TREF (tref) is removed Legend: from the DRBs, reassembled in the order of ES (as indicated in the hierarchy descriptor), and GOP: Group of Pictures DTS: Decoding time stamp AU: Access unit DTS’: DTS of original AU transferred to the SVC decoder at time tdn+m of PES: Packetized DTS”: Adjusted DTS the highest received ES with PIDn+m, (Fig. 6). elementary stream MPEG-2 TRANSPORT STREAM OVER IP Since MPEG-2 transport stream provides a com- Figure 5. Splitting an SVC bitstream into elementary streams of MPEG-2 TS. pletely self-contained transport stream description, it seems obvious that such a stream is also suitable to be contained in IP streams. Therefore, [12] pro- shown in Fig. 5 PES 1, carrying the base layer, has vides a specification to transport an MPEG-2 the lowest temporal resolution; PES 2, carrying the transport stream over RTP as well as over UDP. first SVC enhancement, has twice the temporal Currently there are no means defined to allow resolution of PES 1; and PES 3 again doubles the multisession transmission of different video sub- temporal resolution of PES 2. In order to allow bitstreams over different transport addresses. continuous decoding, the DTSs have equidistant intervals, such that the time for decoding an access SUMMARY unit for a lower resolution is larger than that for a higher temporal resolution. In order to allow In this article we give a compact overview of the access unit reassembly in the temporal scalability SVC transport standards for digital TV and case, [3] defines, besides the DTS value carried in IPTV channels. We highlight the feature of mul-

70 IEEE Wireless Communications • October 2009 SCHIERL LAYOUT 10/7/09 4:08 PM Page 71

Since MPEG-2 Media streams associated by dependency attribute

TB: transport buffer Transport Stream DRB: dependency representation buffer MB: multiplexer buffer provides a Transport packets TBn+m MBn+m DRBn+m completely of PID n+m DRn+1(jn+1) ... with trefn+1(jn+1)==tdn(jn) self-contained OR tdn+1(jn+1)==tdn(jn) TBn+1 MBn+1 DRBn+1 transport stream DRn(jn) Access unit SVC with tdn(jn) re-assembling decoder description, it seems TS TBn MBn DRBn demux Elementary obvious that such a stream buffer stream is also Other PIDs suitable to be

Figure 6. Decoder buffer allocation and reassembly of access units for SVC layers. contained in IP streams. er he is responsible for HHI’s multimedia networking activi- tisession transmission in RTP and transmission ties, and leads various scientific and industry research pro- over multiple MPEG-2 transport streams. jects. He has conducted research work on reliable real-time transmission of H.264/MPEG-4 AVC and SVC in mobile point-to-point, point-to-multipoint, and broadcast environ- REFERENCES ments used by 3GPP PSS/MBMS and DVB-H/SH. He has sub- [1] ITU-T Rec. H.264, “ for Generic mitted different inputs on real-time streaming to Audiovisual Services,” Nov. 2007; ISO/IEC 14496- standardization organizations including 3GPP, JVT/MPEG, 10:2008, “Information Technology — Coding of Audio- ISMA, and IETF. Besides the IETF RTP payload format for Visual Objects (MPEG-4) — Part 10: Advanced Video SVC and the IETF generic SDP signaling for layered codecs, Coding,” 2008. he is co-authoring various IETF inputs. Furthermore, he is [2] S. Wenger et al., Eds., “RTP Payload Format for SVC one of the editors of the Amendments for SVC and MVC Video,” IETF AVT, Sept. 2009; http://tools.ietf.org/ over MPEG-2 TS. In 2007 he visited the group of Prof. html/draft-ietf-avt-rtp-svc Bernd Girod at Stanford University for research activities on [3] T. Schierl, B. Berthelot, and T. Wiegand, Eds., “ISO/IEC application layer multicast using SVC over wired and wire- 13818-1:2007/AMßD3 — Transport of SVC Video over less channels. ITU-T Rec. H.222.0,” ISO/IEC 13818-1, to be published. [4] ITU-T Rec. H.222.0, “Information Technology — Generic KARSTEN GRÜNEBERG ([email protected]) Coding of Moving Pictures and Associated Audio Infor- received his Diplom-Ingenieur degree in electrical engineer- mation: Systems,” May 2006; ISO/IEC 13818-1:2007, ing from the University of Hannover in 1987. He joined the “Information Technology — Generic Coding of Moving research group on HDTV in the Image Processing and Ter- Pictures and Associated Audio Information (MPEG-2) — minal Devices department of Heinrich-Hertz-Institute, also Part 1: Systems,” 2007. in 1987. He conducted different research activities on sig- [5] H. Schwarz, D. Marpe, and T. Wiegand, “Overview of nal processing for HDTV transmission and storage devices. the Scalable Video Coding Extension of the H.264/AVC Further work concerned the implementation of a real-time Standard,” IEEE Trans. Circuits Sys. for Video Tech., vol. system for stereoscopic videoconferencing with view inter- 17, no. 9, Sept. 2007, pp. 1103–20. polation. Recently, he worked on architectures and imple- [6] T. Schierl, T. Stockhammer, and T. Wiegand, “Mobile mentations for networked media distribution and storage Video Transmission using ,” IEEE Trans. Circuits Sys. for in the context of several European research projects Video Tech., vol. 17, no. 9, Sept. 2007, pp. 1204–17. (MediaNet, ASTRALS, SEA). He is also actively participating [7] S. Wenger et al., Eds., “RTP Payload Format for H.264 in and contributing to standardization committees such as Video,” IETF RFC 3984, Feb. 2005. MPEG and DVB. [8] S. Wenger, Y.-K. Wang, and T. Schierl, “Transport and Sig- naling of SVC in IP Networks,” IEEE Trans. Circuits Sys. for Video Tech., vol. 17, no. 9, Sept. 2007, pp. 1164–73. THOMAS WIEGAND ([email protected]) is a [9] C. Perkins and T. Schierl, Eds., “Rapid Synchronisation professor of electrical engineering at the Technical Universi- of RTP Flows,” IETF AVT, July 2009; http://tools.ietf. ty of Berlin chairing the Image Communication group, and org/html/draft-ietf-avt-rapid-rtp-sync-05 jointly heads the Image Processing department of the [10] T. Schierl and S. Wenger, Eds., “Signaling Media Fraunhofer Institute for Telecommunications — Heinrich Decoding Dependency in Session Description Protocol Hertz Institute, Berlin, Germany. Since 1995 he has been (SDP),” IETF MMUSIC RFC 5583, July 2009. an active participant in standardization for multimedia [11] J. Lennox, J. Ott, and T. Schierl, Eds., “Source-Specific with successful submissions to ITU-T VCEG, ISO/IEC MPEG, Media Attributes in the Session Description Protocol,” 3GPP, DVB, and IETF. In October 2000 he was appointed IETF MMUSIC RFC 5576, June 2009. Associate Rapporteur of ITU-T VCEG. In December 2001 he [12] D. Hoffman et al. (Eds.), “RTP Payload Format for was appointed Associate Rapporteur/Co-Chair of the JVT. In MPEG1/MPEG2 Video,” IETF RFC 2250, Jan. 1998. February 2002 he was appointed Editor of the H.264/AVC [13] Y.-K. Wang et al., “System and Transport Interface of video coding standard and its extensions (FRExt and SVC). SVC,” IEEE Trans. Circuits Sys. for Video Tech., vol. 17, In January 2005 he was appointed Co-Chair of MPEG no. 9, Sept. 2007, pp. 1149–63. Video. In 1998 he received the SPIE VCIP Best Student [14] T. Schierl et al., “Decoding Order Recovery for Multi Paper Award. In 2004 he received the Fraunhofer Award Flow Transmission of Scalable Video Coding (SVC) over for outstanding scientific achievements in solving applica- Mobile IP Channels,” Proc. ICST Mobimedia 2009, Lon- tion related problems and the ITG Award of the German don, U.K., Sept. 2009. Society for Information Technology. Since January 2006 he has been an Associate Editor of IEEE Transactions on Cir- cuits and Systems for Video Technology. In 2008 he BIOGRAPHIES received the Primetime Emmy Engineering Award that was THOMAS SCHIERL ([email protected]) awarded to the JVT standards committee by the Academy received his Dipl.-Ing. degree in computer engineering of Television Arts & Sciences for development of the High from Berlin University of Technology, Germany, in Decem- Profile of H.264/MPEG-4 AVC, for which he served as Asso- ber 2003. He has been with Fraunhofer Institute for ciate Rapporteur/Co-Chair, Editor, and technical contribu- Telecommunications — HHI since 2004. As project manag- tor.

IEEE Wireless Communications • October 2009 71