Quality-aware 3D Video Delivery
by Ahmed Hamza
M.Sc., Mansoura University, Egypt, 2008 B.Sc., Mansoura University, Egypt, 2003
Thesis Submitted in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy
in the School of Computing Science Faculty of Applied Science
c Ahmed Hamza 2017 SIMON FRASER UNIVERSITY Spring 2017
All rights reserved. However, in accordance with the Copyright Act of Canada, this work may be reproduced without authorization under the conditions for “Fair Dealing.” Therefore, limited reproduction of this work for the purposes of private study, research, criticism, review and news reporting is likely to be in accordance with the law, particularly if cited appropriately. Approval
Name: Ahmed Hamza Degree: Doctor of Philosophy Title: Quality-aware 3D Video Delivery Examining Committee: Chair: Arrvindh Shriraman Associate Professor
Mohamed Hefeeda Senior Supervisor Professor
Joseph Peters Supervisor Professor
Jiangchuan Liu Internal Examiner Professor School of Computing Science
Abdulmotaleb El Saddik External Examiner Professor School of Electrical Engineering and Computer Science University of Ottawa
Date Defended: April 18, 2017
ii Abstract
Three dimensional (3D) videos are the next natural step in the evolution of digital media technologies. In order to provide viewers with depth perception and immersive experience, 3D video streams contain one or more views and additional information describing the scene’s geometry. This greatly increases the bandwidth requirements for 3D video transport. In this thesis, we address the challenges associated with delivering high quality 3D video content to heterogeneous devices over both wired and wireless networks. We focus on three problems: energy-efficient multicast of 3D videos over 4/5G networks, quality-aware HTTP adaptive streaming of free-viewpoint videos, and achieving quality-of-experience (QoE) fair- ness in free-viewpoint video streaming in mobile networks. In the first problem, multiple 3D videos represented in the two-view-plus-depth format and scalably coded into several substreams are multicast over a broadband wireless network. We show that optimally select- ing the substreams to transmit for the multicast sessions is an NP-complete problem and present a polynomial time approximation algorithm to solve it. To maximize the power sav- ings of mobile receivers, we extend the algorithm to efficiently schedule the transmission of the chosen substreams from each video. In the second problem, we present a free-viewpoint video streaming architecture based on state-of-the-art HTTP adaptive streaming protocols. We propose a rate adaptation method for streaming clients based on virtual view qual- ity models, which relate the quality of synthesized views to the qualities of the reference views, to optimize the user’s quality-of-experience. We implement the proposed adaptation method in a streaming client and assess its performance. Finally, in the third problem, we propose an efficient radio resource allocation algorithm in mobile wireless networks where multiple free-viewpoint video streaming clients compete for the limited resources. The re- sulting allocation achieves QoE fairness across the streaming sessions and it reduces quality fluctuations. Keywords: Free-viewpoint video; adaptive video streaming; rate adaptation; DASH; 3D video; energy efficiency; mobile multimedia; multi-view video; wireless networks
iii To mom, dad, and Amgad.
iv “Read!” — First verse of The Noble Qur’an (96:1)
“And of knowledge, you (mankind) have been given only a little.” — The Noble Qur’an (17:85)
“Details matter, it’s worth waiting to get it right.” — Steve Jobs
“There is no greatness where there is not simplicity, goodness, and truth.” — Leo Tolstoy, War and Peace
“He who can have patience can have what he will.” — Benjamin Franklin
v Acknowledgements
First and foremost, I owe my deepest gratitude to Dr. Mohamed Hefeeda. It has been a great honour to work with Dr. Hefeeda and have him as my senior supervisor. I like to thank him for his endless encouragement, patience, support, and guidance throughout this journey. His critical reviews and intellectual input has enabled me to develop a deeper understanding of the exciting fields of multimedia networking and distributed systems. I would like to extend my sincerest gratitude to Dr. Joseph Peters, my supervisor, for his valuable advice and comments during my graduate studies. I am heartily thankful to him for sharing his time and his insights whenever I needed them and for the valuable brainstorming and discussion sessions from which I have learned a lot. I would also like to express my gratitude to Dr. Jiangchuan Liu, my thesis examiner, and Dr. Abdulmotaleb El Saddik, my external thesis examiner, for being on my committee and reviewing this thesis. Many thanks to Dr. Arrvindh Shriraman for taking the time to chair my thesis defence. I want to thank all my colleagues at the Network Systems Lab throughout the years of my graduate career. I am especially grateful to Cheng-Hsin Hsu whom I greatly respect and greatly learned from. Thank you for all the valuable advice and support. Your dedication and hard work has been an inspiration for me to keep on going. Special thanks also goes to Cong Ly, Shabnam Mirshokraie, Somsubhra Sharangi, Saleh Almowuena, Ahmed Abdel- sadek, Kiana Calagari, Tarek El-Ganainy, and Khaled Diab. I also want to thank Hamed Ahmadi for all his help and for a great collaboration. I am really fortunate to have worked with such talented and amazing people and I cannot imagine this journey without them. More importantly, this thesis would not have been possible without the endless love and support of my parents and my brother, Amgad. Words cannot express my eternal gratitude to my parents who have made great sacrifices so that I can pursue my dreams. Thank you for your constant encouragement in all my pursuits and for always pushing me to succeed and to be a better person. Amgad, thank you for being an amazing brother. And thank you for all your support and care, and for cheering me up whenever I felt down. I am truly blessed to have you.
vi Table of Contents
Approval ii
Abstract iii
Dedication iv
Quotations v
Acknowledgements vi
Table of Contents vii
List of Tables x
List of Figures xi
List of Acronyms xiv
1 Introduction 1 1.1 Introduction...... 1 1.2 ThesisContributions ...... 5 1.2.1 Energy-efficient Multicast of 3D Videos over Wireless Networks . . . 5 1.2.2 Quality-aware HAS-based Free-viewpoint Video Streaming...... 7 1.2.3 QoE-fair Adaptive Streaming of FVV over LTE Networks ...... 8 1.3 ThesisOrganization ...... 9
2 Background 10 2.1 Introduction...... 10 2.2 3D Content Capturing and Post-processing ...... 11 2.2.1 Camera Parameters and Geometric Calibration ...... 13 2.2.2 ImageRectification...... 15 2.2.3 ColorCorrection ...... 15 2.3 HumanVisualSystem ...... 16 2.4 3DDisplayTechnologies ...... 17
vii 2.5 3DVideoRepresentationandCoding...... 20 2.5.1 3DVideoRepresentations ...... 20 2.5.2 3DVideoCoding...... 25 2.6 HTTPAdaptiveStreaming ...... 32 2.7 WirelessCellularNetworks ...... 35 2.7.1 IEEE802.16WiMAXNetworks...... 35 2.7.2 LongTermEvolutionNetworks ...... 35 2.7.3 MultimediaMulticastServices ...... 38
3 Energy-Efficient Multicasting of Multiview 3D Videos over Wireless Networks 40 3.1 Introduction...... 40 3.2 RelatedWork...... 42 3.2.1 3D Video Transmission Over Wireless Networks ...... 42 3.2.2 ModelingSynthesizedViewsQuality ...... 43 3.2.3 Optimal Texture-Depth Bit Allocation ...... 43 3.3 SystemOverview...... 44 3.4 Problem Statement and Formulation ...... 45 3.5 ProposedSolution ...... 49 3.5.1 Analysis...... 53 3.6 EnergyEfficientRadioFrameScheduling ...... 54 3.6.1 ProposedAllocationAlgorithm ...... 54 3.7 ValidationofVirtualViewQualityModel ...... 58 3.8 PerformanceEvaluation ...... 59 3.8.1 Setup ...... 59 3.8.2 SimulationResults ...... 63 3.9 Summary ...... 66
4 Virtual View Quality-aware Rate Adaptation for HTTP-based Free- viewpoint Video Streaming 70 4.1 Introduction...... 70 4.2 RelatedWork...... 72 4.2.1 Server-basedApproaches...... 72 4.2.2 Client-basedApproaches...... 73 4.3 ProblemDefinition ...... 73 4.4 ReferenceViewScheduling...... 74 4.5 Virtual View Quality-aware Rate Adaptation ...... 77 4.5.1 Rate Adaptation Based on Empirical Virtual View Quality Measure- ments ...... 77 4.5.2 Rate Adaptation Based on Analytical Virtual View QualityModels. 78
viii 4.6 System Architecture and Client Implementation ...... 82 4.6.1 ContentServer ...... 82 4.6.2 FVVDASHClient ...... 86 4.7 Evaluation...... 90 4.7.1 ContentPreparation ...... 91 4.7.2 ExperimentalSetup ...... 91 4.7.3 EmpiricalQualityModelsResults ...... 93 4.7.4 AnalyticalQualityModelsResults ...... 94 4.7.5 SubjectiveEvaluation ...... 97 4.8 Summary ...... 98
5 QoE-fair HTTP Adaptive Streaming of Free-viewpoint Videos in LTE Networks 105 5.1 Introduction...... 105 5.2 RelatedWork...... 107 5.2.1 FairnessinWiredNetworks ...... 107 5.2.2 FairnessinWirelessNetworks ...... 108 5.3 SystemModelandOperation ...... 109 5.3.1 WirelessNetworkModel...... 109 5.3.2 FVVContentModel ...... 109 5.3.3 SystemOperation ...... 111 5.4 ProblemStatement...... 114 5.5 ProposedSolution ...... 116 5.5.1 Rate-UtilityModelsforFVV ...... 116 5.5.2 Quality-fairFVVRateAllocation...... 118 5.6 Evaluation...... 122 5.6.1 Setup ...... 122 5.6.2 PerformanceMetrics ...... 123 5.6.3 Results ...... 124 5.7 Summary ...... 127
6 Conclusions and Future Work 129 6.1 Conclusions ...... 129 6.2 FutureWork ...... 131
Bibliography 133
ix List of Tables
Table3.1 ListofsymbolsusedinChapter3...... 46 Table 3.2 3D video sequences used in 3D distortion model validation experiments. 58 Table 3.3 Data Rates (Kbps) and Y-PSNR Values (dB) Representing Each Layer of the Scalable Encodings of the Texture and Depth Streams...... 62
Table 4.1 Coefficient of determination and average absolute fitting error for vir- tual view quality models generated from 100 operating points at view position 2 of the Kendo and Balloons sequences (encoded using VBR). 92 Table 4.2 Kendo sequence representation bitrates (bps)...... 92 Table 4.3 Café sequence representation bitrates (bps)...... 92 Table4.4 Bandwidthchangepatterns...... 95
Table5.1 ListofsymbolsusedinChapter5...... 115 Table5.2 MobileNetworkConfiguration...... 123
x List of Figures
Figure 1.1 Problem 1 - Energy-efficient Multicasting of 3D Videos over Wireless Networks...... 4 Figure 1.2 Problem 2 - Quality-aware Free-viewpoint Adaptive Video Streaming. 5 Figure 1.3 Problem 3 - QoE-fair Radio Resource Scheduling for Free-viewpoint VideoStreamingoverMobileNetworks...... 6
Figure 2.1 End-to-end 3D video communication chain...... 11 Figure 2.2 Multi-view camera arrangements...... 12 Figure2.3 Pinholecameramodelgeometry...... 13 Figure 2.4 Working principles of auto-stereoscopic displays...... 19 Figure 2.5 Two-view head-tracked display...... 21 Figure 2.6 Different view packing arrangements for CSV ...... 21 Figure 2.7 Video plus depth representation of 3D video...... 23 Figure 2.8 Synthesizing three intermediate views using two reference views and associateddepthmaps...... 25 Figure 2.9 A sample frame from a layered depth video (LDV)...... 26 Figure 2.10 Hierarchical (dyadic) prediction structure...... 27 Figure2.11 Scalablevideocoding...... 28 Figure 2.12 Typical MVC hierarchical prediction structure...... 29 Figure 2.13 Video stream adaptation in HTTP adaptive streaming...... 33 Figure 2.14 Structure of an MPD file in MPEG-DASH...... 34 Figure2.15 WiMAXFrame...... 36 Figure 2.16 LTE network system architecture...... 37 Figure2.17 DownlinkframeinLTE...... 37
Figure 3.1 Calculating profits and costs for texture component substreams of thereferenceviews...... 48 Figure 3.2 Transmission intervals and decision points for two streams in a schedul- ingwindowof20TDDframes...... 57 Figure 3.3 Average PSNR quality of 3 synthesized views from decoded sub- streams with respect to views synthesized from uncompressed ref- erences...... 60
xi Figure 3.4 Average SSIM quality of 3 synthesized views from decoded substreams with respect to views synthesized from uncompressed references. . . . 61 Figure 3.5 Average quality of solutions obtained using proposal (taken over all video sequences) for: (a) variable number of video streams; (b) dif- ferentMBSareasizes...... 63 Figure 3.6 Average running times for: (a) variable number of video streams; (b) differentMBSareasizes...... 64 Figure 3.7 Average running times for different values of the approximation pa- rameter...... 65 Figure 3.8 Allocation algorithm performance in terms of receiver buffer occu- pancy levels of selected substreams using a 4 second scheduling win- dow: (a) receiving buffer; (b) consumption buffer; (c) overall buffer level...... 68 Figure3.9 Averageenergysaving...... 69
Figure 4.1 Free-viewpoint video streaming systems where view synthesis is per- formed at: (a) server and (b) client...... 72 Figure 4.2 Segment scheduling window. Deciding on left (L) reference view, right (R) reference view, and pre-fetched (P) view...... 76 Figure 4.3 Kendo sequence R-D surface for virtual view at camera 2 position using reference cameras 1 and 3 and equal depth bit rate of 1 Kbps. 78 Figure 4.4 Architecture of our free-viewpoint video streaming system...... 82 Figure 4.5 The components of our DASH client prototype...... 83 Figure 4.6 The user interface of our DASH client prototype...... 85 Figure 4.7 Frame buffers for the texture components of the reference streams. Dummy frames are inserted into the pre-fetch stream’s buffer to syn- chronize the buffers when no pre-fetch segments are needed. . . . . 87 Figure 4.8 OpenGL-based view synthesis pipeline...... 89 Figure4.9 Evaluationtestbed...... 93 Figure 4.10 Client response using different rate allocation strategies...... 94 Figure 4.11 Average quality for the Balloons video sequence with CBR encoding andfixednetworkbandwidth...... 96 Figure 4.12 Average quality for the Café video sequence with CBR encoding and fixednetworkbandwidth...... 100 Figure 4.13 Average quality for the Balloons video sequence with VBR encoding andfixednetworkbandwidth...... 101 Figure 4.14 Results for the Kendo video sequence with variable network bandwidth.102 Figure 4.15 Results for the Café video sequence with variable network bandwidth.103
xii Figure 4.16 Difference mean opinion score (DMOS) between proposed virtual view quality-aware rate allocation and [113] for VBR and CBR en- coded MVD videos at different available network bandwidth values. 104
Figure 5.1 System model for a HAS-based FVV streaming system...... 109 Figure 5.2 Free-viewpoint video using multi-view-plus-depth content represen- tation...... 110 Figure 5.3 Sequence diagram using HTTP or HTTPS with CDN and mobile networkcollaboration...... 112 Figure 5.4 Sequence diagram using HTTPS with no CDN and mobile network collaboration...... 113 Figure 5.5 Operating points for a virtual view where two reference views and their associated depth maps are used for view synthesis and each component has 6 CBR-coded representations...... 117 Figure 5.6 Average video quality over time (20 users)...... 125 Figure 5.7 Average rate of downward video quality switches...... 126 Figure 5.8 Percentage of saved resource blocks...... 127 Figure 5.9 Fairness in terms of average Jain’s Index across users...... 128 Figure 5.10 Cumulative distribution function of average running time per schedul- ingwindow(40users)...... 128
xiii List of Acronyms
3GPP Third Generation Partnership Project.
AVC Advanced Video Coding.
CBR Constant Bit Rate.
CQI Channel Quality Information.
CSV Conventional Stereo Video.
DASH Dynamic Adaptive Streaming over HTTP.
DIBR Depth-Image-Based Rendering.
FDD Frequency Division Duplex.
FVV Free-Viewpoint Video.
GOP Group of Pictures.
HAS HTTP Adaptive Streaming.
HEVC High Efficiency Video Coding.
HTTP Hyper-Text Transfer Protocol.
HVS Human Visual System.
IEEE Institute for Electrical and Electronics Engineers.
IP Internet Protocol.
LDV Layered Depth Video.
LTE Long Term Evolution.
xiv MB Macroblock.
MBS Multicast and Broadcast Service.
MCS Modulation and Coding Scheme.
MPD Media Presentation Descriptor.
MPEG Moving Picture Experts Group.
MV Motion Vector.
MVD Multi-View plus Depth.
NAL Network Abstraction Layer.
OFDM Orthogonal Frequency-Division Multiplexing.
OFDMA Orthogonal Frequency-Division Multiple Access.
QoE Quality-of-Experience.
RB Resource Block.
RTCP Real-time Transport Control Protocol.
RTP Real-time Transmission Protocol.
RTSP Real-time Streaming Protocol.
SEI Supplementary Enhancement Information.
SFN Single Frequency Network.
SNR Signal-to-Noise Ratio.
TCP Transmission Control Protocol.
TDD Time Division Duplex.
UDP User Datagram Protocol.
UE User Equipment.
UMTS Universal Mobile Telecommunications Service.
VBR Variable Bit Rate.
WiMAX Worldwide Interoperability for Microwave Access.
xv Chapter 1
Introduction
1.1 Introduction
Three-dimensional (3D) video has been gaining popularity over the past few years. The global 3D display market is expected to grow at a cumulative annual growth rate of 28.38 % between 2016 and 2020 [96]. Advances in 3D video acquisition and display technologies have paved the way for many emerging 3D applications, such as 3D TV, free-viewpoint video, and immersive teleconferencing. Such applications provide a realistic, visually ap- pealing viewing experience by allowing the viewers to perceive depth and view the captured scene from different vantage points. 3D TV extends the traditional 2D TV experience with the ability to perceive depth using displays that are able to decode and display more than one view simultaneously. Rendering multiple views at different spatial locations provides motion parallax, one of the main cues in the human visual system for perceiving depth. These displays can be two-view (stereoscopic) display with associated glasses or more ad- vanced auto-stereoscopic displays that support a large number of views and do not require specialized glasses. In contrast to 3D TV, the free-viewpoint video (FVV) application focuses on free nav- igation. Viewers can interactively choose their viewpoint to observe the captured scene from a preferred perspective. Switching between the different views can be achieved using a simple remote control, head tracking device, or head-mounted display. In case the desired viewpoint is not available, i.e., was not captured using a camera, an interpolated virtual view from other available views is rendered. An efficient technique for synthesizing realistic virtual views utilizes depth information associated with the captured views and is known as depth-image-based rendering (DIBR) [37]. The third application of 3D videos is immersive teleconferencing which creates 3D photo- realistic, immersive, and interactive collaboration environment among geographically dis- persed users. Each site involved in this collaborative environment has an array of cameras and 3D displays. The cameras capture the participants in the local scene from various
1 angles and generates a continuous 3D video stream that is rendered on the 3D displays at receiving sites. Immersiveness can be achieved either in a free-viewpoint video or 3D TV manner. Due to the large number of views, as well as the additional geometry information such as depth maps, involved in the representation of 3D videos, storing and transporting these videos is quite challenging for both wireless an wired networks. With the increasing popu- larity of 3D content, current and future networks will be required to allocate a huge amount of bandwidth to 3D video streaming services. Given the limited amount of available re- sources in a given network, sharing these resources among multiple 3D video streams while providing end users with the best possible quality-of-experience (QoE) is a challenging task. This is especially the case for wireless networks, such as cellular 4/5G broadband networks, where the operator bandwidth is often limited with respect to users demands and the users are watching the content on battery powered mobile devices. Adding to the complexity of the problem is the time-varying nature of the wireless channel conditions. An important problem in wireless networks is therefore how to dynamically adapt the utilization of radio resources to achieve the best perceived quality. One promising technology for transmitting 3D video content over wireless networks is utilizing multicast/broadcast services. These services enable the delivery of multimedia content to large-scale user communities in a cost-efficient manner. By serving mobile termi- nals interested in the same video using a single multicast session, mobile network operators can efficiently utilize network resources and significantly reduce the load. A base station scheduler is responsible for the important task of determining how to allocate video data to the multicast/broadcast data area in each frame such that the real-time nature of the video stream is maintained and the perceived quality is maximized over all sessions. An efficient method for video delivery that has recently been widely adopted is HTTP adaptive streaming (HAS) [109]. HAS is a client-driven streaming approach that is able to dynamically adjust the rate of the video stream by monitoring network conditions. Videos are encoded at a number of bit rates (versions) and each version is split into small equal duration chunks. A HAS client periodically measures the end-to-end throughput between itself and the server and selects the appropriate chunk version accordingly. Therefore, HAS is a promising technique for providing high QoE in 3D video streaming systems. However, unlike 2D videos, 3D videos often have multiple components, e.g., multiple views and/or texture and depth streams, and involve a view interpolation (synthesis) method. The quality of the output of the view interpolation method is directly affected by the qualities of the components used as an input to the process. Therefore, traditional HAS rate adaptation methods are unable to efficiently handle 3D video streams where the relationship between the stream bit rate and the perceived quality is more complex. In this thesis, we study multiple challenges associated with delivering 3D video content using wireless multicast services and over-the-top adaptive streaming services. We focus
2 on optimizing the quality-of-experience for users in terms of perceived visual quality and, in the case of mobile users, viewing time. We develop algorithms that enable 3D video delivery systems to adapt to the dynamics of the network conditions and provide viewers with the best possible video quality. Given that a large portion of mobile traffic nowadays is video streaming data, a secondary objective in some of our proposed algorithms is to save as much battery power as possible for mobile receivers. Not only does this improve the users’ experience, but it also means reducing the number of recharging cycles for their devices, which has a positive impact on the environment by reducing the load on electrical power grids. We focus on three main problems. The first problem is how to efficiently multicast 3D videos over mobile wireless networks. This problem is illustrated in Figure 1.1. Here, mobile users are interested in watching 3D videos which are transmitted over a number of multicast channels. The users’ mobile devices are capable of rendering the videos using either stereoscopic displays (more common) or the more advanced auto-stereoscopic displays which can render a large number of views to provide the viewer with a more immersive experience within a certain viewing angle. This application is referred to as 3D TV since it does not involve interactive view switching. The 3D video transmitted by each multicast session is captured using two color cameras and two depth cameras/sensors and the streams are encoded using a scalable video coder into a number of quality layers. Using a process known as depth-image-based rendering [37], a receiving client is able to render a number of (virtual) views if the device is equipped with an auto-stereoscopic display or adjust the displacement between the stereoscopic image pair based on the display size in the case of stereoscopic displays. Given the limited capacity of the wireless channel, our goal is to select the best set of substreams for the components of each of the multicast videos such that the network capacity is not exceeded and the quality of rendered views is maximized. Moreover, it is important to efficiently schedule the video data of the multicast sessions within the radio frames in order to minimize energy consumption at the receiver side. This scheduling process should also ensure that receiving buffers are not drained completely at any given point in time during playback to avoid interruptions. The second problem we address in this thesis is quality-aware free-viewpoint adaptive video streaming, shown in Figure 1.2. Free-viewpoint videos are captured using a set of cameras recording the scene from different angles. Unlike the first problem, where user interactivity is not practical in the case of multicast in order to deliver the videos to a large number of users, here we consider a single video streaming session where the user is able to navigate between a number of views. Each captured view has an associated depth map stream to enable streaming clients to render non-captured views. The video and depth map streams are available in a number of representations, where each representation is encoded with a certain bitrate and/or quality, and are stored as a set of fixed duration chunks on a content server along with a manifest file. The streaming client uses an HTTP adaptive
3 Texture Scalable Video Streams Users with Stereoscopic Depth and Auto-stereoscopic Displays Layer-3
Video-1 Layer-2 Texture Layer-1 Encoding Server T TD D Depth
Texture Scalable Video Streams Multicast Group 1 Depth Scheduler Layer-3
Video-2 Layer-2 Texture Layer-1 Encoding Server T TD D Depth
Base Station Multicast Group 2
Texture Scalable Video Streams Depth Layer-3
Video-3 Layer-2 Texture Layer-1 Encoding Server T TD D Depth Multicast Group 3
Figure 1.1: Problem 1 - Energy-efficient Multicasting of 3D Videos over Wireless Networks. streaming standard, such as MPEG-DASH [56], to request video chunks from the server. Since the user can request any view in the region between the first and last captured views, including non-captured views, the client needs to dynamically adapt to both the target view as well as the network condition when requesting video chunks. In this context, we implement a free-viewpoint video streaming client based on HTTP adaptive streaming and propose a rate adaptation method that maximizes the quality of rendered views. The proposed rate adaptation method determines which set of captured views (and depth maps) should be requested from the content server and which representation is chosen for each chunk. In the third problem, we consider a set of users streaming free-viewpoint videos within a single cell in a mobile wireless network. This problem is shown in Figure 1.3 and is an extension to the previous adaptive streaming problem. Here the wireless channel is a bottleneck and users are competing for the radio resources. In wireless networks, the base station scheduler is responsible for allocating radio resource blocks and assigning a guaran- teed bitrate for each user. Given that the complexity of the video content being streamed varies from one video to another, the challenge is how to allocate these resources in a way that maximizes the perceived quality while maintaining fairness across users. Although this problem is also applicable to 2D video streams, the complexity of the rate-utility relation- ship for synthesized views in the case of free-viewpoint videos makes it more challenging.
4 T
D
T
View-5 View-4
T
D
T T
D
D View-3 View-4
View-3.5 View-3 (virtual) Content Server
View-1 View-2
D
T
D
Figure 1.2: Problem 2 - Quality-aware Free-viewpoint Adaptive Video Streaming.
We propose an efficient heuristic algorithm that solves this problem using virtual view rate- utility models. In addition to the previously mentioned objectives, our proposed algorithm also minimizes the variability in video quality at the receivers. Although we address the problem in the context of wireless networks, it should be noted that the proposed solution is also applicable to any network where a centralized component is responsible for managing network resources, e.g., software-defined networks.
1.2 Thesis Contributions
We consider 3D video delivery using two network settings: (i) multicast/broadcast transmis- sion over wireless networks such as LTE/LTE-Advanced and WiMAX, where popular 3D videos are sent through multicast/broadcast channels to reduce the load on the core network of the operator and efficiently manage the spectrum; and (ii) streaming of multi-view-plus- depth 3D videos using HTTP adaptive streaming to enable free-viewpoint navigation at the receiver side. We identify multiple optimization problems and, for each problem, we design efficient algorithms to optimize the quality-of-experience observed by users.
1.2.1 Energy-efficient Multicast of 3D Videos over Wireless Networks
We consider the problem of multicasting 3D videos over 4/5G broadband access networks, such as Long Term Evolution (LTE) and WiMAX, to mobile devices with auto-stereoscopic
5 T
D Video-1
User-1 Video-1, View 1.25 Channel: Bad
T
D User-2 Video-2 Media-Aware Video-2, View 2.5 Content Server Network Element Base Station Channel: Very Good (MANE)
View-1 View-2 View-2 View-3 View-2 View-3
T User-1User-2 User-3 User-3 D Video-3, View 2.75
Video-3 Scheduling Window Channel: Good
Figure 1.3: Problem 3 - QoE-fair Radio Resource Scheduling for Free-viewpoint Video Streaming over Mobile Networks. displays. For such displays, 3D scenes need to be efficiently represented using a small amount of data that can be used to generate arbitrary views not captured during the acquisition process. The multi-view-plus-depth (MVD) representation has proven to be both efficient and flexible in providing good quality synthesized views. However, the quality of synthesized views is affected by the compression of texture videos and depth maps. Given the limitations on the wireless channel capacity, it is important to efficiently utilize the channel bandwidth such that the quality of all rendered views at the receiver side is maximized. Therefore, an efficient multicast solution should minimize the power consumption of the receivers to provide a longer viewing time experience. We address two main challenges: (i) maximizing the video quality of rendered views on auto-stereoscopic displays [32][122] of mobile receivers such as smartphones and tablets; and (ii) minimizing the energy consumption of the mobile receivers during multicast sessions. Our contributions in this topic can be summarized as follows [44]:
6 • We study the problem of optimal substream selection for multicasting scalably-coded MVD 3D videos over wireless networks. We mathematically formulate the problem and prove its NP-harness. We then propose an approximation algorithm for solving the problem in real-time.
• We propose an energy-efficient radio frame scheduling algorithm that utilizes our substream selection algorithm and reduces the power-consumption of receiving clients while maximizing perceived quality. Instead of continuously sending the streams at the encoding bit rate, our energy saving algorithm transmits the video streams in bursts at much higher bit rates to reduce transmission time. After receiving a burst of data, mobile subscribers can switch off their RF circuits until the start of the next burst. Our radio frame scheduling algorithm generates a burst schedule that maximizes the average system-wide energy saving over all multicast streams and prevents buffer overflow or underflow instances from occurring at the receivers.
• We evaluate the performance of our algorithms using simulation-based experiments. Our results show that the proposed algorithms provide solutions which are within 0.3 dB of the optimal solutions while satisfying real-time requirements of multicast systems, and they result in an average power consumption reduction of 86 %.
1.2.2 Quality-aware HAS-based Free-viewpoint Video Streaming
We study the problem of optimizing interactive free viewpoint video streaming to heteroge- neous clients using HAS. We analyze the relationship between the quality of the reference views’ components and that of synthesized virtual views and derive a simple model that cap- tures this relationship. We formulate a virtual view quality optimization problem to find the optimal set of reference representations to request and present a virtual-view-quality-aware rate adaptation algorithm to solve this problem. Our quality-aware adaptation algorithm is based on virtual view quality models that enable the streaming client to estimate the average quality of virtual views over the duration of each video chunk. We implement the proposed algorithm in a real HAS-based free viewpoint video streaming testbed and con- duct experiments for performance evaluations. Our contributions in can be summarized as follows [45]:
• We present a two-step rate adaptation method for free-viewpoint videos. In the first step, the streaming client performs view pre-fetching based on historical viewpoint po- sitions to reduce view switching latency and reduce quality degradation when switch- ing views. In the second step, the client utilizes virtual view quality models described in the manifest file of the video to determine the best set of segments for the next request such that the average quality of synthesized virtual views are maximized.
7 • We describe an end-to-end system architecture for free-viewpoint video streaming based on HAS and the multi-view-plus-depth 3D video representation. We develop a complete streaming client that implements the proposed rate adaptation method using empirical and analytical virtual view quality models.
• We rigorously evaluate the performance of the proposed virtual-view quality based rate adaptation algorithm using multiple video sequences. We evaluate the perceived quality using objective quality metrics, and we conduct a subjective quality assess- ment study. Our results indicate that the proposed virtual view quality-aware rate adaptation method results in significant quality gains (up to 4 dB for constant bit rate streams and up to 2.26 dB for variable bit rate streams), especially at low bandwidth conditions.
1.2.3 QoE-fair Adaptive Streaming of FVV over LTE Networks
We study the problem of QoE-fair radio resource allocation for adaptive FVV streaming to heterogeneous clients in mobile networks. Here, a number of mobile terminals within a cellular network are using our HAS-based client to stream FVV content. The wireless channel conditions vary from one user to another due to the channel fading effect as well as how far the user is from the base station. Most base station schedulers allocate resources by employing variations of the proportional fair [66] scheduling policy which is inefficient when users follow different utilities. We propose a radio resource allocation algorithm that finds the optimized QoE-fair allocation for the streaming clients which maximizes their perceived quality and reduces quality variation. Our contributions in this topic can be summarized as follows:
• We study the rate-utility relationship for synthesized virtual views in free-viewpoint videos represented using multiple views plus depth. We propose content-dependent parametric models that describes this relationship. These models enable resource allocation algorithms to estimate the perceived quality of synthesized virtual views given an allocated bandwidth.
• We formulate the problem of QoE-fair resource allocation for HAS-based video stream- ing in LTE networks as a multi-objective optimization problem that is known to be NP- hard. In addition to achieving QoE-fairness, a resource allocation algorithm should also maximize the average video quality and minimize quality variations. We propose a heuristic algorithm which utilizes rate-utility models for synthesized virtual views and attempts to achieve a balance between the three objectives and can run in real- time. The proposed algorithm runs on MANEs within the mobile network provider network to support the base station scheduler.
8 • We evaluate the proposed radio resource allocation algorithm using OPNET and Mat- lab. The proposed algorithm is compared to state-of-the-art approaches. Our results show that our algorithm is able to achieve a high level of fairness while reducing the rate of quality switches by up to 32 % compared to other algorithms. Our algorithm also saves up to 18 % of the radio resource blocks while achieving comparable average quality.
1.3 Thesis Organization
The chapters of this thesis are organized as follows. We first present some background on the end-to-end 3D video communications chain and wireless cellular networks in Chapter 2. In Chapter 3, we propose energy-efficient techniques for multicasting 3D videos over mobile broadband access networks. We address the problem of quality-aware adaptive streaming of free-viewpoint videos in Chapter 4, where we formulate the problem and derive virtual-view quality models to support the rate adaptation process. In Chapter 5, we present a heuristic algorithm to perform QoE-fair radio resource allocation for free-viewpoint video streaming in cellular networks. We conclude the thesis and discuss potential future research directions in Chapter 6.
9 Chapter 2
Background
2.1 Introduction
An end-to-end 3D video communication chain consists of several stages as depicted in Fig- ure 2.1. These include: content capturing, 3D representation, data compression, transmis- sion, decompression, post-processing, and 3D rendering and display. In this chapter, we introduce some background as well as relevant work for each of these stages. First, we describe how 3D video content is captured using a set of cameras and the challenges as- sociated with this process in Section 2.2. At the end of the chain, 3D displays utilize the characteristics of the human visual system to perceive depth in order to provide an immer- sive 3D experience. In particular, our visual system relies on several cues that provide it with information about how far objects are and their relative positions. The priorities of these depth cues and the exact integration method used by the human visual system to fuse these information together to visualize the world’s 3D structure are still not fully known. Current flat screen 3D displays take advantage of two such cues (binocular stereopsis and motion parallax) to provide the illusion of depth. We describe how the human visual system perceives depth in order to visualize objects and their relative positions in the 3D space in Section 2.3 and discuss the main depth cues. In Section 2.4, we present the different 3D display technologies and how they exploit the human visual system principles to provide the human eyes with the necessary information that enables them to perceive a video sequence in 3D. The 3D scene information can be represented in different ways depending on the applica- tion and the type of display technology used to render the content. Examples include image- based representations, depth-based representations, 3D mesh models, and point clouds. In this thesis, we mainly focus on image- and depth-based representations as they are the main representation formats currently being investigated for 3D TV and FVV applications. The amount of information required to capture the aspects of a 3D scene is a multiple of that required by 2D videos. Therefore, transmitting this information over existing bandwidth
10 Capture Post-process Representation Encoding
Camera Calibration Stereo (space, time) Multi-camera Rig Multi-view Color Correction Multi-view H.264/AVC ToF Depth Cameras V+D Image Rectification HEVC 3D Scanners MVD Sub-sampling LDV
Display Post-process
Stereoscopic Error Concealment Auto-stereoscopic Rendering Hologram Integral Imaging Decoding
Delivery
Figure 2.1: End-to-end 3D video communication chain. limited networks requires efficient compression algorithms that exploit inherent redundan- cies and significantly reduce the amount of data. Section 2.5 covers 3D content generation in terms of representation and coding formats as well as various coding approaches. The delivery of 3D video services poses more challenges than conventional 2D video services due to the large amount of data involved, diverse network characteristics and user terminal requirements, as well as the user’s context (e.g., preferences, location) [10]. HTTP adaptive streaming (HAS) has recently been adopted as the universal client-driven streaming solution for video distribution over the Internet. HAS is designed to cope with the highly dynamic nature of communication channels and is a promising delivery method for 3D videos. A brief introduction to the principles of HAS and how HAS content is generated is presented in Section 2.6. Recent studies have also shown that mobile video traffic is dominating the mobile communication landscape with more and more users preferring to watch their favourite content on-the-go [25]. We discuss the main concepts related to mobile broadband access networks in Section 2.7.
2.2 3D Content Capturing and Post-processing
Most 3D and free-viewpoint video systems use multiple cameras to capture real world scenery. These cameras are sometimes combined with depth sensors in order to capture scene geometry. The density (i.e., number of cameras) and arrangement of the cameras impose practical limitations on the view navigation range and the quality of the rendered views at a certain virtual view position [108]. Three typical multi-camera arrangements are shown in Figure 2.2. Individual cameras composing the multi-camera system have unique internal character- istics. Even when two cameras from a certain manufacturer are used to capture images of
11 (a) (b)
(c)
Figure 2.2: Multi-view camera arrangements: (a) divergent; (b) convergent; and (c) parallel. the same object from the exact same location and direction, the resulting images will not be identical. Moreover, if the location of each camera and the orientation of their respective op- tical axes cannot be determined precisely, virtual views cannot be interpolated accurately. Therefore multi-camera capturing systems impose additional requirements which are not present in traditional 2D video capturing systems in order to correct for these factors. These requirements include [80, Chapter-2]:
1. Accurate 3D positions and viewing directions of all cameras should be known (to integrate captured multi-view video data geometrically).
2. The cameras should be accurately synchronized (to integrate captured multi-view video data temporally).
3. Brightness and chromatic characteristics of the cameras should be accurately known (to integrate captured multi-view video data chromatically).
4. All object surface areas should be observed by at least two cameras to reconstruct their 3D shapes by stereo-based methods.
12 Figure 2.3: Pinhole camera model geometry.
2.2.1 Camera Parameters and Geometric Calibration
Geometric camera calibration is the process of estimating parameters of the geometric transformation conducted by a camera, which projects a 3D point onto the 2D image plane of the camera. These parameters include the internal geometric and optical characteristics and/or the 3-D position and orientation of the camera frame relative to a certain world coordinate system. Most geometric calibration methods, such as the popular method proposed by Tsai [119], are based on a pinhole camera geometry model in which three types of coordinate systems are defined: the world coordinate system, the camera coordinate system, and the image coordinate system. Each camera (view) has its own camera coordinate system and its own image coordinate system. As shown in Figure-2.3, the pinhole camera model is described by an optical centre (camera projection centre), C, and an image plane. The distance of image plane from C is called focal length, f. The line from camera centre C perpendicular to image plane is called principal axis (optical axis) of camera. The plane parallel to the image plane and containing C is called the principal plane (focal plane). To describe the relationship among the coordinate systems, two sets of camera param- eters are defined: extrinsic parameters and intrinsic parameters. Extrinsic parameters de- scribe the transformation from world coordinates to camera coordinates. This is represented by a translation vector t3×1 and a rotation matrix R3×3. Intrinsic parameters describe the characteristics of the camera that influence the transformation from a camera coordinate to its image coordinate. These characteristics are represented by the camera calibration matrix K, which contains information about the focal length f, image centre coordinates
(ox,oy), and pixel size in millimetres (sx,sy) along the axes of the camera photo-sensor. A 3D point is projected onto the image plane with the line containing the point and the optical centre. The relationship between 3D coordinates of a scene point and coordinates of its projection onto the image plane is described by the central or perspective projection
13 [47, Chapter-9]. If the world (scene) and image points are represented by homogeneous vectors, perspective projection of a point M =(X,Y,Z, 1)T in the 3D space of the scene to a pixel m =(u/z, v/z, 1)T in the image plane is defined by Eq. (2.1), where P is the camera projection matrix which describes the linear mapping and z corresponds to the depth value of the view described by P.
zm = PM (2.1)
X u Y v = P (2.2) Z z 1 In general, P is a 3 × 4 full-rank matrix. And since it is a homogeneous matrix, it has 11-degrees of freedom. The camera projection matrix encodes both the extrinsic and intrinsic parameters of the camera. Using QR factorization, we can decompose the 3 × 4 full-rank matrix P into the matrices and vectors representing those parameters. Thus, P can be factorized as given in Eq. (2.3), where t is the translation vector, R is the rotation matrix, and K is the upper triangular (non-singular) camera calibration matrix.
P = K[R|t] (2.3)
The intrinsic matrix K represents the transformation from a camera coordinate to its image coordinate. It is defined as given in Eq. (2.4), where αx and αy are the focal length in x-axis and y-axis, respectively, and (ox,oy) is the principal point offset. The reason that the focal length in the two axial directions is different is that in CCD cameras there is a possibility of having non-square pixels. Since image coordinates are measured in pixels, this introduces unequal scale factors in each direction and the image coordinates become non-Euclidean. Thus, αx = fnx and αy = fny, where nx and ny are the number of pixels per unit distance in the x-direction and y-direction, respectively.
αx 0 ox
K = 0 αy oy (2.4) 0 0 1 The extrinsic matrix E =[R|t] is defined for the transformation from world coordinates to camera coordinates, which is composed of a rotation matrix R3×3 and a translation vector t3×1. Using homogeneous coordinates, the transformation is given in Eq. (2.5). This can also be simplified and represented as given in Eq. (2.6). It should be noted that t = −RC˜,
14 where C˜ represents the coordinates of the camera centre in the world coordinate frame (represented in non-homogeneous coordinates).
Xcam X
Ycam R3×3 t3×1 Y = (2.5) Zcam 0 1 Z 1 1 Xcam X
Ycam = R3×3 · Y + t3×1 (2.6) Zcam Z Thus, the perspective projection of a point M = (X,Y,Z, 1)T in the 3D space of the scene to a pixel m =(u/z, v/z, 1)T in the image plane can be re-written as given in Eq. (2.7).
z · u X
z · v K3×3 0 R3×3 t3×1 Y = (2.7) z 0 1 0 1 Z 1 1 2.2.2 Image Rectification
In some settings (e.g., if depth is estimated from the disparities between the captured views), it is also desirable to perform multi-view image rectification as a post-processing step af- ter capturing the multi-view content using calibrated cameras. Due to differences in the positioning of the cameras, the captured multi-view images usually have both horizontal and vertical disparities between neighboring views. If the vertical disparities are removed completely, the search range for depth estimation algorithms is reduced to a single (horizon- tal) dimension only and the stereo matching process becomes greatly simplified and much faster [62]. Multi-view image rectification aligns the epipolar lines of each camera view and removes vertical disparities. In addition, it also compensates for the slight focal length differences existing between the cameras.
2.2.3 Color Correction
Another common problem in multi-camera systems is that different camera sensors acquire different color response to an image object [106]. Physical factors during the imaging process introduce a variation that differs from one camera to another. Moreover, even if all cameras are properly calibrated, it is practically impossible to capture an object under perfectly constant lighting conditions at different spatial positions. Color inconsistency
15 across cameras negatively affects the 3D multi-view video processing chain since it reduces the correlation between captured views. This leads to a significant decrease in multi-view coding efficiency since the inter-view prediction scheme produces higher energy residuals in cases of luminance difference between the reference view and the predicted view. Color inconsistency also reduces the quality of disparity estimation as a result of potentially wrong matches in stereo correspondence algorithms. as well as the quality of rendered virtual views. Moreover, the view synthesis process in free-viewpoint video systems is negatively affected if there are color differences between the two reference cameras (which are warped to the image coordinates of the target viewpoint) that need blending. Therefore, color calibration and/or color correction methods are necessary to enhance the performance of 3D TV and FTV systems [129].
2.3 Human Visual System
In the human visual system (HVS), the eyes are horizontally separated by a distance of approximately 6.3 cm. Thus, looking at a 3D scene, each eye sees a unique image from a slightly different angle. This is known as binocular stereopsis. The resulting difference in image location of an object seen by the left and right eyes is referred to as binocular disparity (also known as binocular parallax). This difference provides cues about the relative distance of objects (i.e., the perceived distance between objects) and their depth structure, as well as absolute depths (perceived distance from observer to objects). This is possible because the brain fuses the two images to enable depth perception, a process known as binocular fusion. The single mental image of the scene that results from binocular fusion is known as the cyclopean image. As the distance between the viewer and the object increases, the differences between the two images decreases and, consequently, the ability to identify differences in depth diminishes. The two most important depth cues are binocular stereopsis and motion (movement) parallax [92], [90]. Estimating depth through binocular depth cues has two parts: vergence and stereopsis. Vergence is the process of positioning the eyes so that the difference between the information projected in both retinae is minimized. The angle between the eyes is used as a depth cue. After this convergence step, the stereopsis process uses the residual disparity of surrounding area to estimate the depth relative to the point of convergence. Binocular depth cues have been discussed in length in the literature. They are the main cues utilized in 3D displays. For example, lenticular lens are placed on top of a multi-view display to direct different views to each eye. Displays that require wearing glasses of any kind are also using such property of the brain. Images for the left eye are separated from ones for the right eye and the technology relies on the brain to reconstruct the images and provide 3D perception.
16 However, even with one eye closed, one can still perceive depth through a number of depth cues known as monocular depth cues [95]. This is possible through some external factors that tell the user where things are with respect to each other. Monocular depth cues include: relative size, motion parallax, occlusion (interposition), light and shade, texture gradient, haze, and perspective. Motion parallax cues are a result of eye or head movement. The relative movement between the viewing eye and the scene is a cue to depth perception. An object that moves faster means that it is located much closer to the viewer than an object that moves slower. Free-view TV seems to be a great approach to leveraging the head motion parallax factor of 3D perception, which is very important. However, there are different kinds of motion parallax depth cues. In addition to head movements, eyeball movements also facilitate depth estimation. Other cues that play an important role in the perception of depth include: accommodation and pictorial cues. Accommodation is the ability of the eye to change the optical power of its lens in order to focus on objects at various distances. As the distance between the eyes and the object becomes very large, binocular depth cues become less important and the HVS relies on pictorial cues for depth assessment. These are monocular cues such as shadows, perspective lines, and texture scaling. In the real world, all the aforementioned depth cues are fused together in an adaptive way depending on the viewing conditions and the viewing space. However, in artificial scenarios, such as when watching 3D content on a 3D TV, it can happen that two or more depth cues are in conflict. And although in many such cases the HVS can still correctly interpret the 3D scene, this comes at higher cognitive load and after some time the viewer may sense visual discomfort, e.g., eye strain, headache, or nausea. An example of such a conflict is the so called accommodation-vergence rivalry where the eyes converge to focus on an object but the lens accommodation stays on the screen where the image is sharpest [99].
2.4 3D Display Technologies
3D displays are imaging devices that create 3D perception as their output utilizing different characteristics of the human visual system. Such displays can be categorized, based on the technique used to direct the left- and right-eye views to the appropriate eye, into: aided-view displays, and free-view displays [99].
Aided-View Displays
Aided-view displays rely on special user-worn devices, such as stereo-glasses or head-mounted miniature displays, to optically channel the left- and right-eye views to the corresponding eye. Various multiplexing methods have been proposed to carry the optical signal to the appropriate eye, including: color multiplexing, polarization multiplexing, and time multi- plexing.
17 • Color Multiplexed Displays. Color multiplexing is used in anaglyph displays where the images for the left and right eyes are combined using a complimentary color coding technique. The most common anaglyph method uses the red channel for the left eye and the cyan channel for the right eye. The viewer wears a pair of colored (anaglyph) glasses so that the left and right eyes receive the corresponding images only. Each lens permits the wavelength of the correct image to reach the eye while blocking other wavelengths for that eye. Different color coding techniques are possible in anaglyph displays, including: red/cyan, yellow/blue, and green/magenta. The main drawbacks of this type of displays are the loss of color information and the increased degree of cross-talk [122]. Moreover, the anaglyph viewing filters sometimes cause chromatic adaptation problems for the viewer and it has been widely reported that prolonged use of this technology causes headaches. To reduce crosstalk and increase the quality of the images, methods such as image alignment, color component blurring, and depth map adjustment have been shown to significantly improve image quality [50], [51].
• Polarization Multiplexed Displays. In polarization multiplexing, the state of po- larization of light corresponding to each image in the stereo pair is made mutually orthogonal. The two views are superimposed on the screen and the viewer needs to wear polarized glasses to separate them. Two types of polarized glasses are possible: linearly polarized (one lens horizontally polarized and the other vertically polarized) and circularly polarized (one lens polarized clockwise and the other counter-clockwise). Circular polarization allows more head tilt before cross-talk becomes noticeable. Al- though polarizing filters can cause chromatic aberration, these types of 3D displays offer high resolution and the color quality issues are generally negligible, unlike the anaglyph-based displays. Polarized displays are most common in movie theatres nowa- days.
• Time Multiplexed Displays. Time-multiplexed displays exploit the persistence of vision of the human visual system to give 3D perception. The left and right eye images are displayed on the screen in an alternating fashion at high frame rates, usually 120 Hz. The viewer is required to wear battery powered active shutter glasses, which are synchronized to the content being displayed [122]. The lenses of these glasses are actually small LCD screens. Applying a voltage to a lens causes the shutter to close, preventing the image from passing through to the eye. By synchronizing this behavior with the screen displaying the 3D content, normally using an infrared transmitter, each eye sees a separate view.
Auto-stereoscopic Displays
Auto-stereoscopic displays relief the viewer from the discomfort of wearing specialized glasses by dividing the viewing space into a finite number of viewing slots where only one image
18 Right Eye Left Eye Right Eye Left Eye
Lenticular Lens
Parallax Barrier
LCD LCD RRRRRRLLLLLL RRRRRRLLLLLL (a) Parallax barrier (b) Lenticular lens
Figure 2.4: Working principles of auto-stereoscopic displays.
(view) of the scene is visible. Thus, each of the viewer’s eyes sees a different image and, depending on the type of display, those images may change as the viewer moves or changes his head position. This is achieved by applying optical principles such as diffraction, refrac- tion, reflection, and occlusion to direct the light from a certain view to the appropriate eye. Various types of auto-stereoscopic displays use different techniques to control light paths. The two most well known auto-stereoscopic techniques are parallax barriers and lenticular arrays.
• Parallax Barrier Displays. Parallax barriers utilize occlusion to hide part of the image from one eye while maintaining it visible to the other eye. At the right distance and angle, each eye will only be able to see the corresponding view, as shown in Figure 2.4a. These displays can be switched to a 2D display mode for backward compatibility by removing the optical function of the parallax elements [130]. This can be achieved using polarization-based electronic switching systems. The two main problems associated with parallax barrier systems are the loss of brightness, caused by the barriers themselves, and the loss of spatial resolution, caused by using only half of the pixels for each viewing zone. The optimum viewing distance in these systems is proportional to the distance between the display and the parallax barrier and inversely proportional to the display pixel size. As the display resolution gets higher, the optimum viewing distance of the system gets larger [122].
• Lenticular Array Displays. Unlike parallax barrier displays, lenticular displays are based on the refraction principle. An array of vertically oriented cylindrical lenses are placed in front of columns of pixels, as shown in Figure 2.4b. The display creates repeating viewing zones for the left and right eyes (shown with green and red colors). A lenticular display can also be switched between 2D and 3D viewing modes using lenticular lenses filled with a special material that can switch between two refracting states. The alignment of the lenticular array on the display panel is critical in lenticular systems. This alignment gets more difficult as the display resolution increases and any misalignment can cause distortions in the displayed images [122]. The problem with
19 lenticular systems is the reduction in resolution with the increase in the number of views. In vertically aligned lenticular arrays, the resolution decreases in the horizontal direction only. However, if the lenses are slanted the resolution loss is distributed into two axes [29].
Regardless of the technique used to direct the light, auto-stereoscopic displays can either be two-view displays, where only a single stereo pair is displayed, or multi-view displays, where multiple stereo pairs are produced to provide 3-D images to multiple users. Two- view auto-stereoscopic displays divide the horizontal resolution of the display into two sets. Every second column of pixels constitutes one image of the left-right image pair, while the other image consists of the rest of the columns. The two displayed images are visible in multiple zones in space. However, the viewer will perceive a correct stereoscopic image only if standing at the ideal distance and in the correct position. Moving much forward or backward from the ideal distance greatly reduces the chance of seeing a correct image. If the two-view stereoscopic display is equipped with a head tracking device, it can prevent incorrect pseudoscopic viewing by displaying the right and left images in the ap- propriate zones. One disadvantage of head-tracking stereoscopic displays is that they only support a single-viewer. Moreover, they need to be designed to have minimal lag so that the user does not notice the head tracking. Multi-view auto-stereoscopic displays overcome the limitations of two-view and head-tracking stereoscopic displays by increasing the number of displayed views. Thus, they have the advantage of allowing viewers to perceive a 3D image when the eyes are anywhere within the viewing zone. This enables multiple viewers to see the 3D objects from their own point of view, which makes these displays more suitable for applications such as computer games, home entertainment, and advertising.
2.5 3D Video Representation and Coding
2.5.1 3D Video Representations
Conventional Stereo Video
A stereo video signal captured by two input cameras is the most simple 3D video data representation. This 3D format is called conventional stereo video (CSV). By presenting each of the captured views to one of the eyes, the viewer is provided with a 3D impression of the captured scene. Standardized solutions for CSV have already found their way to the market: 3-D cinema, Blu-Ray Disc, and broadcast. A common way to represent and transmit the two views is to multiplex them either temporally or spatially [125]. In temporal multiplexing, the left and right views are interleaved as alternating frames. Temporal multiplexing has the advantage of maintaining the full resolution of each view. However, this comes at the expense of doubling the raw data rate of conventional single view.
20 (a)
(b)
Figure 2.5: Two-view head-tracked display; (a) swapping zones over as viewer moves his head; (b) producing only two views and controlling where the views are directed in space [32].
L L R R
(a) (b) (c) (d)
Figure 2.6: Different view packing arrangements for left (L) and right (R) views in conven- tional stereo video: (a) side-by-side; (b) above-below; (c) line-by-line; and (d) checkboard.
21 With spatial multiplexing, the left and right views are sub-sampled either horizontally or vertically and interleaved (packed) within a single frame. Possible packing arrangements include: side-by-side, above-below, line-by-line, or checkboard arrangements. Figure 2.6 illustrates the different packing arrangements. The resulting format is known as a frame- compatible format which essentially tunnels the stereo video through existing hardware and delivery channels. Thus, the main advantage for spatial multiplexing is that it allows broad- casters to use the same bandwidth as regular monoscopic video content and transmission can be achieved in the same way. The obvious drawback is, however, the loss of spatial resolution, which may impact the quality of 3D perception. Subjective experiments have shown that, up to a certain limit, if the quality of one of the two views in a stereo pair is reduced by low filtering, the overall perceived quality tends to be dominated by the higher quality view. This is known as the binocular suppression theory [110]. Another possible representation of stereo video exploits this theory to reduce the overall bit rate of the stereo video. This representation is known as mixed resolution stereo. Among the important issues that have been studied is whether the bit rate of the auxiliary view in a mixed resolution stereo video should be reduced by downscaling (and then rescaling at the receiver) or quality reduction (by increasing the quantization parameter). Temporal scaling, i.e., reducing the frame rate of the auxiliary view, is also possible but has been shown to give unacceptable results in terms of perceived 3D quality, especially for high motion content. The main limitation of the stereo representation in general is its hardware representation dependency. The acquisition process is tailored to a specific type of stereoscopic displays (e.g., size, display type, number of views, etc.). Moreover, the baseline distance between the two cameras is fixed. This hinders the flexibility of modifying the 3D impression at the receiver side, and prevents supporting head motion parallax, occlusion, and disocclusion when the viewer changes the viewpoint. Additional information about the captured scene, such as geometry information, needs to be provided in order to support such features.
Video Plus Depth
The video-plus-depth (V+D) format provides a regular 2D video signal along with geometry information of the scene in the form of an associated depth map video, as shown in Figure 2.7. The V+D format enables the generation of virtual views, within a certain range around the captured view, using a view synthesis technique such as depth image-based rendering (DIBR) [64], [37]. In this format, the 2D video represents texture information and color intensity, while the depth video provides a Z-value for each pixel representing the distance between the optical center of the camera and a 3D point in the captured scene. The depth range is restricted to a range between two extremes Znear and Zfar indicating the minimum and maximum possible distances of a 3D point, respectively. Information about this range needs to be transmitted along with the video bitstream. The depth range is usually quantized
22 0
255 Texture Depth
Figure 2.7: Video plus depth representation of 3D video.
using 8 bits into 256 quantization intervals, where Znear is associated with value 255 and
Zfar associated with value 0. Thus, the depth video carries a monochromatic video signal. The depth data is usually stored as inverted real-world depth according to
1 1 1 1 d = round 255 · − / − , (2.8) " z Zfar ! Znear Zfar !# where z is the real-world depth and d is the corresponding value in the depth map [88]. This method of depth storage has the following advantage. Since depth values are inverted, a high depth resolution of nearby objects is achieved, while farther objects only receive coarse depth resolution. This also aligns with the human perception of stereopsis, where a depth impression is derived from the shift between left- and right-eye views [88]. Thus, the stored depth values are quantized similarly to these shift or disparity values. The process of capturing the depth map is in itself error-prone. One technique for capturing depth information is using triangulation. A laser stripe is scanned across the scene, which is captured by a camera positioned at a distance from the laser pointer. And the range of the scene is then determined by the focal length of the camera, the distance between the camera and the laser pointer, and the observed stripe position in the captured image [68]. This technique works well for static scenes, but not for dynamic ones. It tends to change the color and texture of the scene, which means it introduces artifacts even before the encoding process. Another depth map capturing method utilizes the time-of-flight principle [46]. For example, laser beams (often in the infrared spectrum) are emitted to the scene, and the reflections are collected by the device to measure the time of flight [68]. Pulsed-wave sensors, such as the ZCam Depth Camera from 3DV Systems [43] (later acquired by Microsoft), can measure the time of delay directly. This method works well for dynamic scenes, but it is yet to be determined how it will perform in some specific environments, such as when there are many mirrors, smooth/rough surfaces, very fast motion, or heat/cold. Depth information may also be estimated from a stereo pair by solving for stereo correspondences [98].
23 The video-plus-depth format has several advantages. Encoding of depth map presents a small overhead (about 10-20%) on the video bit rate, which makes it attractive when attempting to minimize bandwidth. Moreover, the inclusion of depth enables a display- independent solution that provides adaptivity at the user side for different kinds of displays. For example, it allows the adjustment of depth perception in stereo displays based on the viewing characteristics. Finally, the format is backward compatible with legacy devices. However, the flexibility provided by the video-plus-depth format comes at the cost of in- creased complexity at both the sender and the receiver sides. Depth estimation algorithms, for example, are highly complex, time consuming, and error prone. At the receiver side, it is required to generate a second view to drive a stereoscopic display. Moreover, video- plus-depth is only capable of rendering a limited range of views and is prone to errors at disoccluded points.
Multi-view Plus Depth
One key issue with video-plus-depth is that it enables synthesizing only a limited continuum of views around the original view. This is due to the disocclusion (or exposure) problem, where some regions in the virtual view have no mapping because they were invisible in the original reference view. These regions are known as holes and require applying a filling algorithm that interpolates the value of the unmapped pixels from surrounding areas. This disocclusion effect increases as the angular distance between the reference view and the virtual view increases. On the other hand, advanced 3D video applications that enable the user to change the viewing point, such as wide range multi-view auto-stereoscopic displays and free-viewpoint video, require a very large number of output views. To overcome the limitations of the V+D format, the Moving Picture Experts Group (MPEG) developed a new 3D video standard based on a multi-view video-plus-depth (MVD) representation format, where multiple 2D views are combined with their associated depth map signals. Virtual views may be synthesized more correctly if two or more reference views, from both sides of the virtual view, are used [42]. This is possible because areas which are occluded in one of the reference views may not be occluded in the other one. Thus, the MVD format enables generating many high quality views from a limited amount of input data. Moreover, the format enables flexible adjustment of the video signal to various display types and sizes, and different viewing preferences. When only two reference views and their depth maps are available for a certain video, the representation format is referred to as MVD2. Similarly, MVD4 is the representation format where four reference views and four associated depth maps are available. Figure 2.8 demonstrates the results of the view synthesis process using an MVD2 video. This representation format will be used in subsequent chapters of this thesis.
24 Left Reference View Right Reference View (Texture + Depth) (Texture + Depth)
Synthesized View 1 Synthesized View 2 Synthesized View 3
Figure 2.8: Synthesizing three intermediate views using two reference views and associated depth maps.
Layered Depth Video
Layered depth video (LDV) [89] is an extension of the MVD representation and is based on the concept of layered depth images [105]. A layered depth video has two layers: a texture sequence and an associated depth map sequence comprising the base layer, and an enhancement layer (also known as the occlusion layer) composed of a residual texture sequence and a residual depth map. Figure 2.9 illustrates the components of a single frame in LDV representation. The occlusion layer in the LDV representation is used to fill any holes during the view synthesis process. Generating the occlusion information, however, requires more processing during content generation, or pre-processing before compression and transmission. Since there are a lot of redundant data in non-occluded regions, this can be exploited to efficiently code the occlusion layer. Occlusion data can be obtained from side views or from previous or next frames. Although LDV has a more compact representation than MVD, the generation of LDV is based on view warping and, consequently, on the error prone depth data. Moreover, this basic approach is oblivious to reflections, shadows, and other factors which result in the same content appearing differently in different views [7].
2.5.2 3D Video Coding
Before going over the different video coding standards for 3D and multi-view videos, we first provide a brief background on video coding concepts in 2D videos. A 2D video sequence is a stream of individual frames (pictures) which are presented with constant or variable time intervals between them. In addition to spatial redundancies which are normally present in 2D images, video sequences also contain temporal redundancies because successive frames within the video often have small differences, with the exception of scene changes. The
25 Main Layer Enhancement Layer
Texture Depth Residual Texture Residual Depth
Figure 2.9: A sample frame from a layered depth video (LDV). frames are grouped into coding structures known as group of pictures (GOP), where each GOP contains the same number of frames. State-of-the-art 2D video encoders divide each frame into a set of non-overlapping small blocks, each known as a macroblock (MB). Within a GOP, the encoder reduces the temporal redundancy by attempting to predict each MB within a frame from MBs in previous (and possibly future) frames and obtain a residual frame containing prediction errors between the original and predicted blocks. Residual frames have smaller energy and therefore can be coded more efficiently using fewer bits. In order to construct the predicted frame, the encoder performs a search within a certain region in the reference frame(s) to find the closest matching block and a motion vector (MV) representing the displacement between the that block and the block being coded is calculated, a process known as motion estimation. The difference between the two blocks is then transmitted in the encoded bitstream along with the corresponding motion vector. This process is known as inter-frame prediction (or inter-prediction for short). For MBs where inter-prediction cannot be exploited, intra-prediction is used to eliminate spatial redundancies. Intra-prediction attempts to predict a block by extrapolating the neighboring pixels from adjacent blocks in a defined set of different directions. Frames within a GOP can therefore be classified based on the type of prediction used for the MBs within the frame into: I-frames (only intra-predicted MBs), P-frames (either intra-predicted MBs and/or predicted MBs from previous frames), and B-frames (intra-predicted MBs and/or predicted MBs from previous and future frames). The prediction relationship between the different frames within a GOP can therefore be represented using a dependency structure similar to the one shown in Figure 2.10, where the black frames are I-frames and the number below each frame represents its decoding order. A more detailed explanation of prediction coding as well as other 2D video coding concepts used by recent video coding standards, the reader is referred to [97, Chapter-3]. In state-of-the-art video coding standards, such as H.264/AVC and high efficiency video coding (HEVC) [114], the encoded video bitstream consists of data units called network abstraction layer (NAL) units, each of which is effectively a packet that contains an integer number of bytes. Some NAL units contain parameter sets that carry high-level information
26 0 4 3 5 2 7 6 8 1
Figure 2.10: Hierarchical (dyadic) prediction structure. regarding the entire coded video sequence or a subset of the pictures within it. Other NAL units carry coded samples in the form of slices that belong to one of the various picture types defined by the video coding standard. In addition, some NAL units contain optional supplementary enhancement information (SEI) that supports the decoding process or may assist in other ways, such as providing hints about how best to display the video. A set of NAL units whose decoding results in one decoded picture is referred to as an access unit (AU). One main challenge with video compression applications is delivering multiple versions of a video at different operating points, i.e., different qualities, spatial resolutions, and frame rates. The straightforward way to achieve this using conventional video coders is to encode each version of the video sequence independently. However, this approach results in sig- nificant overhead since the generated versions contain many redundancies. Scalable video coders, such as the scalable video coding (SVC) extension of H.264/AVC [102] and scalable high efficiency video coding (SHVC) [17], exploit the correlation between different versions of the same video sequence to provide smaller storage requirements and transmission band- width. In addition to the spatial and temporal motion-compensated predictions that are available in a single-layer coder, scalable video coders utilize inter-layer prediction, where the reconstructed video signal from a reference layer is used to predict an enhancement layer. A single scalable encoder produces multiple coded bitstreams referred to as layers, Figure 2.11. The lowest (base) layer is a stream decodeable using a standard single-layer decoder to generate a version of the video sequence at the lowest available quality/resolu- tion operating point. One or more enhancement layers are coded as scalable bitstreams. To decode a sequence at a higher quality or resolution, a scalable video decoder decodes the base layer and one or more enhancement layers.
27 Low Bandwidth Base Layer Scalable/Conventional Video Decoder
Enhancement Medium Bandwidth Scalable Video Layer 1 Scalable Encoder Video Decoder
High Bandwidth Scalable Enhancement Video Decoder Layer 2
Figure 2.11: Scalable video coding.
Conventional Stereo Video Coding
The direct method of coding a multi-view video is simulcast, where each view is individually coded using a conventional 2D video encoder such as H.264/AVC or HEVC. This method does not take advantage of the correlation between neighboring views. In the case of stereo video, the fidelity range extensions (FRExt) [78] of the H.264/AVC standard defines a special stereo video SEI message that enables the decoder to identify the multiplexing of the stereo video and extract the two views. The two views are interlaced line-by-line into one sequence, where the top field contains the left view and the bottom field contains the right view. The interlaced sequence is then encoded using the field coding mode of H.264/AVC. At the receiver side, the interlaced bitstream is decoded and the output sequence is de-interlaced to obtain the two individual views. The main drawback of this approach is that it does not support backward compatibility. Traditional 2D devices incapable of demultiplexing the two views will not be able to decode and display a 2D version of the content.
Multi-view Video Coding
In early 2010, the Joint Video Team (JVT), which is a collaboration between the Video Coding Experts Group (VCEG) of the ITU-T and MPEG of the ISO/IEC, standardized multi-view video coding (MVC) [126] as an extension to the H.264/AVC video coding stan- dard. The multi-view extension provided inter-view prediction to improve compression efficiency and support for traditional temporal and spatial prediction schemes. The MVC standard introduced two profiles, which indicate a subset of coding tools that must be supported by conforming decoders: the Multi-view High Profile (supports multiple views with no interlace coding tools), and the Stereo High Profile (supports only two views with interlace coding tools). MVC has been selected as the standard of 3D video distribution by the Blu-ray Disc Association (BDA). In addition to the general video coding requirements, such as those implemented in H.264/AVC, some specific requirements for MVC include: view switching random access, view scalability, and backward compatibility. View switching random access provides the
28 T0 T1 T2 T3 T4 T5 T6 T7 T8
V0 I0 B3 B2 B3 B1 B3 B2 B3 I0
V1 B1 B4 B3 B4 B2 B4 B3 B4 B1
V2 P0 B3 B2 B3 B1 B3 B2 B3 P0
V3 B1 B4 B3 B4 B2 B4 B3 B4 B1
V4 P0 B3 B2 B3 B1 B3 B2 B3 P0
V5 B1 B4 B3 B4 B2 B4 B3 B4 B1 P V6 P0 B3 B2 B3 B1 B3 B2 B3 0 P V7 P0 B3 B2 B3 B1 B3 B2 B3 0
Figure 2.12: Typical MVC hierarchical prediction structure. ability to access, decode and display a specified view at a random access point with a small amount of data required to decode the image. View scalability is the ability to access a subset of the bitstream to decode a subset of encoded views. For backward compatibility, a subset of the encoded multi-view video bitstream should be decodable by the H.264/AVC decoder. A typical MVC prediction structure uses a combination of hierarchical temporal predic- tion and inter-view prediction, as shown in Figure 2.12. In such a structure, some views are dependent on other views to be accessed and decoded. To access view2, for example, both view 1 and view 3 need to be decoded first. Moreover, view 3 will only be available after decoding view 1 because it depends on it. The sequence parameter set of an H.264/AVC bitstream was extended to include high level syntax that signals the view identification, view dependency, and indicators of resource requirements. To support backward compat- ibility, an MVC bitstream is structured to include a base view which could be decoded independently.
Video Plus Depth Coding
In the video-plus-depth representation of 3D videos, the depth map can be considered as a monochromatic grayscale image sequence. Therefore, it is possible to use existing video codecs, such as H.264/AVC, to compress the depth map stream. However, these codecs are optimized for encoding texture information which are viewed by the user. This is in contrast to depth maps which represent geometry information that is never directly presented to the viewer and only aid the view rendering process. Thus, compression artifacts have a more severe effect when existing codecs are used for compressing depth maps and cause distortions
29 in rendered synthesized views. There are two possible approaches to solve this problem in the literature. The first approach is to take the special characteristics of depth maps into consideration and develop novel compression techniques specifically suitable for them, e.g., [83] and [49]. The second approach is modifying and optimizing existing video codecs for depth map encoding and removing undesirable compression artifacts at the decoder side using post-processing image denoising techniques. Examples of the second approach include [91], and [74]. The ISO/IEC 23002-3 MPEG-C Part 3 standard specifies a representation format for depth maps which allows encoding them as conventional 2D sequences and additional param- eters for interpreting the decoded depth values at the receiver side [54]. The specification is based on the encoding of a 3D content inside a conventional MPEG-2 transport stream, which includes the texture video, the depth video and some auxiliary data. It specifies high-level syntax that allows a decoder to interpret two incoming video streams correctly as texture and depth data. The standard does not introduce any specific coding algorithms and supports different coding formats like MPEG-2 and H.264/AVC. Transport is defined in a separate MPEG Systems specification ISO/IEC 13818-1:2003 Carriage of Auxiliary Data [53]. The two (texture and depth) bit-streams are interleaved frame-by-frame, resulting in one transport-stream that may contain additional depth map parameters as auxiliary information. Another option for encoding video-plus-depth sequences is using the multiple auxiliary components (MAC) of MPEG-4 version 2. An auxiliary component is a greyscale shape that is used to describe transparency of a video object. However, an auxiliary component can be defined in a more general way in order to describe shape, depth shape, or other secondary texture. It is also sometimes known as an alpha channel. Thus, the depth video can be used as one of the auxiliary components in MPEG-4 version 2 [63]. Moreover, the same compression techniques used for the texture component can also be used for auxiliary components and motion vectors used for motion compensation are identical.
High Efficiency Video Coding of 3D Videos
The ISO/IEC MPEG and ITU-T Video Coding Experts Group (VCEG) standardization bodies established the Joint Collaborative Team on 3D Video Coding Extension Develop- ment (JCT-3V) in July 2012. The aim of this group is to develop the next-generation 3D video coding standards which provide more advanced compression capabilities and fa- cilitates synthesis of additional non-captured views to support emerging auto-stereoscopic displays. JCT-3V has developed two extensions for HEVC [114], namely, Multiview HEVC (MV-HEVC) [117], which was integrated in the second edition of the HEVC standard [57], and 3D-HEVC [118], which was completed in February 2015 and integrated in the latest edition of the standard. In MV- and 3D-HEVC, a layer can represent texture, depth, or other auxiliary information of a scene related to a particular camera perspective. All layers
30 belonging to the same camera perspective are denoted as a view; whereas layers carrying the same type of information (e.g., texture or depth) are usually called components in the scope of 3D videos [116]. MV-HEVC enables efficient coding of multiple camera views and associated auxiliary pic- tures and follows the same design principles of multi-view video coding (see Section 2.5.2).For example, MV-HEVC allows prediction from pictures in the same AU and the same compo- nent but in different views.To enable this, decoded pictures from other views are inserted into one or both of the reference picture lists of the current picture being decoded. There- fore, motion vectors may be temporal MVs, when related to temporal reference pictures of the same view, or disparity MVs, when related to inter-view reference pictures. MV-HEVC comprises only high-level syntax (HLS) additions. Therefore, it can be implemented using existing single-layer decoding cores without changing the block-level processing modules. Compared to HEVC simulcast, MV-HEVC provides higher compression gains by exploiting the redundancy between different camera views of the same scene. Support for depth maps is enabled through auxiliary picture high-level syntax. The auxiliary picture decoding pro- cess would be the same as that for video or multi-view video, and the required decoding capabilities can be specified as part of the bitstream. The second, more advanced, extension is 3D-HEVC. This extension targets a coded representation consisting of multiple views and associated depth maps, as required for gen- erating additional intermediate views in advanced 3D displays. 3D-HEVC aims to compress the video-plus-depth format more efficiently by introducing new compression tools that: 1) explicitly address the unique characteristics of depth maps; and 2) exploit dependencies between multiple views as well as between video texture and depth. 3D-HEVC extends MV-HEVC by allowing new types of inter-layer prediction to enable more efficient compres- sion. The new prediction types include:
• combined temporal and inter-view prediction (A+V), where a reference picture is in the same component but in a different AU and a different view,
• inter-component prediction (C), where reference pictures are in the same AU and view but in a different component.
• combined inter-component and inter-view prediction (C+V), where reference pictures are in the same AU but in a different view and component.
Additional bit rate reductions compared to MV-HEVC are achieved by specifying new block- level video coding tools, which explicitly exploit statistical dependencies between video texture and depth and specifically adapt to the properties of depth maps. A further design change compared with MV-HEVC is that in addition to sample and motion information, residual, and disparity and partitioning information can also be predicted or inferred.
31 2.6 HTTP Adaptive Streaming
Until recently, the transmission control protocol (TCP) was considered unsuitable for video applications, due to the latency and overhead of its sliding window and retransmissions. For a long time, the main video delivery protocl was the real-time transmission protocol (RTP) [100], which uses the unreliable user datagram protocol (UDP) for transmission. RTP is a push-based protocol where video frames are transmitted in a paced manner from the server to the receiving client. Using UDP as an underlying transport protocol eliminates the overhead and latency of retransmissions encountered in TCP, since UDP does not imple- ment error and flow control techniques. This made RTP attractive for real-time interactive applications and multicast. However, although using UDP for media delivery has its advantages, it became clear over the past few years that this has many shortcomings. For example, due to the unreliable nature of the protocol, an unavoidable consequence of network interruption is that the receiver would have to ignore lost and late frames. This causes artifacts and distortions in the rendered video and decreases the user’s quality-of-experience. Moreover, RTP-based media delivery is quite complex and is not scalable. While RTP itself is responsible for the transmission of the actual media data, it relies on additional protocols for establishing and controlling media sessions between end points and providing feedback information. For example, in addition to establishing a separate UDP connection for each media component, e.g., audio and video components, RTP requires additional UDP connections for real-time transport control protocol (RTCP) [100] channels, one for each RTP connection. To facilitate real-time control of media playback from the server, a separate connection is dedicated to a control plane protocol known as the real-time streaming protocol (RTSP) [101] which enables the clients to issue VCR-like commands to the server. Moreover, UDP is a non-responsive protocol, i.e., it does not reduce its data rate when there is a congestion. Given the rapid increase in video streaming network flows, this may lead to congestion collapse where little or no useful communication is happening due to congestion. The frame-based delivery in RTP requires streaming servers to parse video files in order to extract the frames, which adds additional overhead and impacts scalability. Recently, multimedia delivery services have adopted a pull-based approach for delivering media content over the Internet using the widely popular Hyper-Text Transfer Protocol (HTTP). HTTP adaptive streaming (HAS) aims to overcome all the issues of RTP streaming and is motivated by the stateless nature of the HTTP protocol, which makes it a more scalable solution, and the fact that, unlike UDP, almost all firewalls and network address translators (NATs) are configured to allow HTTP traffic. HAS optimizes and adapts the video configurations over time in order to deliver the best possible quality to the user at any given time. This allows for enhanced quality-of-experience enabled by intelligent adaptation to different network path conditions, device capabilities, and content characteristics.
32