Quality-aware 3D Video Delivery

by Ahmed Hamza

M.Sc., Mansoura University, Egypt, 2008 B.Sc., Mansoura University, Egypt, 2003

Thesis Submitted in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy

in the School of Computing Science Faculty of Applied Science

c Ahmed Hamza 2017 SIMON FRASER UNIVERSITY Spring 2017

All rights reserved. However, in accordance with the Copyright Act of Canada, this work may be reproduced without authorization under the conditions for “Fair Dealing.” Therefore, limited reproduction of this work for the purposes of private study, research, criticism, review and news reporting is likely to be in accordance with the law, particularly if cited appropriately. Approval

Name: Ahmed Hamza Degree: Doctor of Philosophy Title: Quality-aware 3D Video Delivery Examining Committee: Chair: Arrvindh Shriraman Associate Professor

Mohamed Hefeeda Senior Supervisor Professor

Joseph Peters Supervisor Professor

Jiangchuan Liu Internal Examiner Professor School of Computing Science

Abdulmotaleb El Saddik External Examiner Professor School of Electrical Engineering and Computer Science University of Ottawa

Date Defended: April 18, 2017

ii Abstract

Three dimensional (3D) videos are the next natural step in the evolution of digital media technologies. In order to provide viewers with and immersive experience, 3D video streams contain one or more views and additional information describing the scene’s geometry. This greatly increases the bandwidth requirements for 3D video transport. In this thesis, we address the challenges associated with delivering high quality 3D video content to heterogeneous devices over both wired and wireless networks. We focus on three problems: energy-efficient multicast of 3D videos over 4/5G networks, quality-aware HTTP adaptive streaming of free-viewpoint videos, and achieving quality-of-experience (QoE) fair- ness in free-viewpoint video streaming in mobile networks. In the first problem, multiple 3D videos represented in the two-view-plus-depth format and scalably coded into several substreams are multicast over a broadband wireless network. We show that optimally select- ing the substreams to transmit for the multicast sessions is an NP-complete problem and present a polynomial time approximation algorithm to solve it. To maximize the power sav- ings of mobile receivers, we extend the algorithm to efficiently schedule the transmission of the chosen substreams from each video. In the second problem, we present a free-viewpoint video streaming architecture based on state-of-the-art HTTP adaptive streaming protocols. We propose a rate adaptation method for streaming clients based on virtual view qual- ity models, which relate the quality of synthesized views to the qualities of the reference views, to optimize the user’s quality-of-experience. We implement the proposed adaptation method in a streaming client and assess its performance. Finally, in the third problem, we propose an efficient radio resource allocation algorithm in mobile wireless networks where multiple free-viewpoint video streaming clients compete for the limited resources. The re- sulting allocation achieves QoE fairness across the streaming sessions and it reduces quality fluctuations. Keywords: Free-viewpoint video; adaptive video streaming; rate adaptation; DASH; 3D video; energy efficiency; mobile multimedia; multi-view video; wireless networks

iii To mom, dad, and Amgad.

iv “Read!” — First verse of The Noble Qur’an (96:1)

“And of knowledge, you (mankind) have been given only a little.” — The Noble Qur’an (17:85)

“Details matter, it’s worth waiting to get it right.” — Steve Jobs

“There is no greatness where there is not simplicity, goodness, and truth.” — Leo Tolstoy, War and Peace

“He who can have patience can have what he will.” — Benjamin Franklin

v Acknowledgements

First and foremost, I owe my deepest gratitude to Dr. Mohamed Hefeeda. It has been a great honour to work with Dr. Hefeeda and have him as my senior supervisor. I like to thank him for his endless encouragement, patience, support, and guidance throughout this journey. His critical reviews and intellectual input has enabled me to develop a deeper understanding of the exciting fields of multimedia networking and distributed systems. I would like to extend my sincerest gratitude to Dr. Joseph Peters, my supervisor, for his valuable advice and comments during my graduate studies. I am heartily thankful to him for sharing his time and his insights whenever I needed them and for the valuable brainstorming and discussion sessions from which I have learned a lot. I would also like to express my gratitude to Dr. Jiangchuan Liu, my thesis examiner, and Dr. Abdulmotaleb El Saddik, my external thesis examiner, for being on my committee and reviewing this thesis. Many thanks to Dr. Arrvindh Shriraman for taking the time to chair my thesis defence. I want to thank all my colleagues at the Network Systems Lab throughout the years of my graduate career. I am especially grateful to Cheng-Hsin Hsu whom I greatly respect and greatly learned from. Thank you for all the valuable advice and support. Your dedication and hard work has been an inspiration for me to keep on going. Special thanks also goes to Cong Ly, Shabnam Mirshokraie, Somsubhra Sharangi, Saleh Almowuena, Ahmed Abdel- sadek, Kiana Calagari, Tarek El-Ganainy, and Khaled Diab. I also want to thank Hamed Ahmadi for all his help and for a great collaboration. I am really fortunate to have worked with such talented and amazing people and I cannot imagine this journey without them. More importantly, this thesis would not have been possible without the endless love and support of my parents and my brother, Amgad. Words cannot express my eternal gratitude to my parents who have made great sacrifices so that I can pursue my dreams. Thank you for your constant encouragement in all my pursuits and for always pushing me to succeed and to be a better person. Amgad, thank you for being an amazing brother. And thank you for all your support and care, and for cheering me up whenever I felt down. I am truly blessed to have you.

vi Table of Contents

Approval ii

Abstract iii

Dedication iv

Quotations v

Acknowledgements vi

Table of Contents vii

List of Tables x

List of Figures xi

List of Acronyms xiv

1 Introduction 1 1.1 Introduction...... 1 1.2 ThesisContributions ...... 5 1.2.1 Energy-efficient Multicast of 3D Videos over Wireless Networks . . . 5 1.2.2 Quality-aware HAS-based Free-viewpoint Video Streaming...... 7 1.2.3 QoE-fair Adaptive Streaming of FVV over LTE Networks ...... 8 1.3 ThesisOrganization ...... 9

2 Background 10 2.1 Introduction...... 10 2.2 3D Content Capturing and Post-processing ...... 11 2.2.1 Camera Parameters and Geometric Calibration ...... 13 2.2.2 ImageRectification...... 15 2.2.3 ColorCorrection ...... 15 2.3 HumanVisualSystem ...... 16 2.4 3DDisplayTechnologies ...... 17

vii 2.5 3DVideoRepresentationandCoding...... 20 2.5.1 3DVideoRepresentations ...... 20 2.5.2 3DVideoCoding...... 25 2.6 HTTPAdaptiveStreaming ...... 32 2.7 WirelessCellularNetworks ...... 35 2.7.1 IEEE802.16WiMAXNetworks...... 35 2.7.2 LongTermEvolutionNetworks ...... 35 2.7.3 MultimediaMulticastServices ...... 38

3 Energy-Efficient Multicasting of Multiview 3D Videos over Wireless Networks 40 3.1 Introduction...... 40 3.2 RelatedWork...... 42 3.2.1 3D Video Transmission Over Wireless Networks ...... 42 3.2.2 ModelingSynthesizedViewsQuality ...... 43 3.2.3 Optimal Texture-Depth Bit Allocation ...... 43 3.3 SystemOverview...... 44 3.4 Problem Statement and Formulation ...... 45 3.5 ProposedSolution ...... 49 3.5.1 Analysis...... 53 3.6 EnergyEfficientRadioFrameScheduling ...... 54 3.6.1 ProposedAllocationAlgorithm ...... 54 3.7 ValidationofVirtualViewQualityModel ...... 58 3.8 PerformanceEvaluation ...... 59 3.8.1 Setup ...... 59 3.8.2 SimulationResults ...... 63 3.9 Summary ...... 66

4 Virtual View Quality-aware Rate Adaptation for HTTP-based Free- viewpoint Video Streaming 70 4.1 Introduction...... 70 4.2 RelatedWork...... 72 4.2.1 Server-basedApproaches...... 72 4.2.2 Client-basedApproaches...... 73 4.3 ProblemDefinition ...... 73 4.4 ReferenceViewScheduling...... 74 4.5 Virtual View Quality-aware Rate Adaptation ...... 77 4.5.1 Rate Adaptation Based on Empirical Virtual View Quality Measure- ments ...... 77 4.5.2 Rate Adaptation Based on Analytical Virtual View QualityModels. 78

viii 4.6 System Architecture and Client Implementation ...... 82 4.6.1 ContentServer ...... 82 4.6.2 FVVDASHClient ...... 86 4.7 Evaluation...... 90 4.7.1 ContentPreparation ...... 91 4.7.2 ExperimentalSetup ...... 91 4.7.3 EmpiricalQualityModelsResults ...... 93 4.7.4 AnalyticalQualityModelsResults ...... 94 4.7.5 SubjectiveEvaluation ...... 97 4.8 Summary ...... 98

5 QoE-fair HTTP Adaptive Streaming of Free-viewpoint Videos in LTE Networks 105 5.1 Introduction...... 105 5.2 RelatedWork...... 107 5.2.1 FairnessinWiredNetworks ...... 107 5.2.2 FairnessinWirelessNetworks ...... 108 5.3 SystemModelandOperation ...... 109 5.3.1 WirelessNetworkModel...... 109 5.3.2 FVVContentModel ...... 109 5.3.3 SystemOperation ...... 111 5.4 ProblemStatement...... 114 5.5 ProposedSolution ...... 116 5.5.1 Rate-UtilityModelsforFVV ...... 116 5.5.2 Quality-fairFVVRateAllocation...... 118 5.6 Evaluation...... 122 5.6.1 Setup ...... 122 5.6.2 PerformanceMetrics ...... 123 5.6.3 Results ...... 124 5.7 Summary ...... 127

6 Conclusions and Future Work 129 6.1 Conclusions ...... 129 6.2 FutureWork ...... 131

Bibliography 133

ix List of Tables

Table3.1 ListofsymbolsusedinChapter3...... 46 Table 3.2 3D video sequences used in 3D distortion model validation experiments. 58 Table 3.3 Data Rates (Kbps) and Y-PSNR Values (dB) Representing Each Layer of the Scalable Encodings of the Texture and Depth Streams...... 62

Table 4.1 Coefficient of determination and average absolute fitting error for vir- tual view quality models generated from 100 operating points at view position 2 of the Kendo and Balloons sequences (encoded using VBR). 92 Table 4.2 Kendo sequence representation bitrates (bps)...... 92 Table 4.3 Café sequence representation bitrates (bps)...... 92 Table4.4 Bandwidthchangepatterns...... 95

Table5.1 ListofsymbolsusedinChapter5...... 115 Table5.2 MobileNetworkConfiguration...... 123

x List of Figures

Figure 1.1 Problem 1 - Energy-efficient Multicasting of 3D Videos over Wireless Networks...... 4 Figure 1.2 Problem 2 - Quality-aware Free-viewpoint Adaptive Video Streaming. 5 Figure 1.3 Problem 3 - QoE-fair Radio Resource Scheduling for Free-viewpoint VideoStreamingoverMobileNetworks...... 6

Figure 2.1 End-to-end 3D video communication chain...... 11 Figure 2.2 Multi-view camera arrangements...... 12 Figure2.3 Pinholecameramodelgeometry...... 13 Figure 2.4 Working principles of auto-stereoscopic displays...... 19 Figure 2.5 Two-view head-tracked display...... 21 Figure 2.6 Different view packing arrangements for CSV ...... 21 Figure 2.7 Video plus depth representation of 3D video...... 23 Figure 2.8 Synthesizing three intermediate views using two reference views and associateddepthmaps...... 25 Figure 2.9 A sample frame from a layered depth video (LDV)...... 26 Figure 2.10 Hierarchical (dyadic) prediction structure...... 27 Figure2.11 Scalablevideocoding...... 28 Figure 2.12 Typical MVC hierarchical prediction structure...... 29 Figure 2.13 Video stream adaptation in HTTP adaptive streaming...... 33 Figure 2.14 Structure of an MPD file in MPEG-DASH...... 34 Figure2.15 WiMAXFrame...... 36 Figure 2.16 LTE network system architecture...... 37 Figure2.17 DownlinkframeinLTE...... 37

Figure 3.1 Calculating profits and costs for texture component substreams of thereferenceviews...... 48 Figure 3.2 Transmission intervals and decision points for two streams in a schedul- ingwindowof20TDDframes...... 57 Figure 3.3 Average PSNR quality of 3 synthesized views from decoded sub- streams with respect to views synthesized from uncompressed ref- erences...... 60

xi Figure 3.4 Average SSIM quality of 3 synthesized views from decoded substreams with respect to views synthesized from uncompressed references. . . . 61 Figure 3.5 Average quality of solutions obtained using proposal (taken over all video sequences) for: (a) variable number of video streams; (b) dif- ferentMBSareasizes...... 63 Figure 3.6 Average running times for: (a) variable number of video streams; (b) differentMBSareasizes...... 64 Figure 3.7 Average running times for different values of the approximation pa- rameter...... 65 Figure 3.8 Allocation algorithm performance in terms of receiver buffer occu- pancy levels of selected substreams using a 4 second scheduling win- dow: (a) receiving buffer; (b) consumption buffer; (c) overall buffer level...... 68 Figure3.9 Averageenergysaving...... 69

Figure 4.1 Free-viewpoint video streaming systems where view synthesis is per- formed at: (a) server and (b) client...... 72 Figure 4.2 Segment scheduling window. Deciding on left (L) reference view, right (R) reference view, and pre-fetched (P) view...... 76 Figure 4.3 Kendo sequence R-D surface for virtual view at camera 2 position using reference cameras 1 and 3 and equal depth bit rate of 1 Kbps. 78 Figure 4.4 Architecture of our free-viewpoint video streaming system...... 82 Figure 4.5 The components of our DASH client prototype...... 83 Figure 4.6 The user interface of our DASH client prototype...... 85 Figure 4.7 Frame buffers for the texture components of the reference streams. Dummy frames are inserted into the pre-fetch stream’s buffer to syn- chronize the buffers when no pre-fetch segments are needed. . . . . 87 Figure 4.8 OpenGL-based view synthesis pipeline...... 89 Figure4.9 Evaluationtestbed...... 93 Figure 4.10 Client response using different rate allocation strategies...... 94 Figure 4.11 Average quality for the Balloons video sequence with CBR encoding andfixednetworkbandwidth...... 96 Figure 4.12 Average quality for the Café video sequence with CBR encoding and fixednetworkbandwidth...... 100 Figure 4.13 Average quality for the Balloons video sequence with VBR encoding andfixednetworkbandwidth...... 101 Figure 4.14 Results for the Kendo video sequence with variable network bandwidth.102 Figure 4.15 Results for the Café video sequence with variable network bandwidth.103

xii Figure 4.16 Difference mean opinion score (DMOS) between proposed virtual view quality-aware rate allocation and [113] for VBR and CBR en- coded MVD videos at different available network bandwidth values. 104

Figure 5.1 System model for a HAS-based FVV streaming system...... 109 Figure 5.2 Free-viewpoint video using multi-view-plus-depth content represen- tation...... 110 Figure 5.3 Sequence diagram using HTTP or HTTPS with CDN and mobile networkcollaboration...... 112 Figure 5.4 Sequence diagram using HTTPS with no CDN and mobile network collaboration...... 113 Figure 5.5 Operating points for a virtual view where two reference views and their associated depth maps are used for view synthesis and each component has 6 CBR-coded representations...... 117 Figure 5.6 Average video quality over time (20 users)...... 125 Figure 5.7 Average rate of downward video quality switches...... 126 Figure 5.8 Percentage of saved resource blocks...... 127 Figure 5.9 Fairness in terms of average Jain’s Index across users...... 128 Figure 5.10 Cumulative distribution function of average running time per schedul- ingwindow(40users)...... 128

xiii List of Acronyms

3GPP Third Generation Partnership Project.

AVC Advanced Video Coding.

CBR Constant Bit Rate.

CQI Channel Quality Information.

CSV Conventional Stereo Video.

DASH Dynamic Adaptive Streaming over HTTP.

DIBR Depth-Image-Based Rendering.

FDD Frequency Division Duplex.

FVV Free-Viewpoint Video.

GOP Group of Pictures.

HAS HTTP Adaptive Streaming.

HEVC High Efficiency Video Coding.

HTTP Hyper-Text Transfer Protocol.

HVS Human Visual System.

IEEE Institute for Electrical and Electronics Engineers.

IP Internet Protocol.

LDV Layered Depth Video.

LTE Long Term Evolution.

xiv MB Macroblock.

MBS Multicast and Broadcast Service.

MCS Modulation and Coding Scheme.

MPD Media Presentation Descriptor.

MPEG Moving Picture Experts Group.

MV Motion Vector.

MVD Multi-View plus Depth.

NAL Network Abstraction Layer.

OFDM Orthogonal Frequency-Division Multiplexing.

OFDMA Orthogonal Frequency-Division Multiple Access.

QoE Quality-of-Experience.

RB Resource Block.

RTCP Real-time Transport Control Protocol.

RTP Real-time Transmission Protocol.

RTSP Real-time Streaming Protocol.

SEI Supplementary Enhancement Information.

SFN Single Frequency Network.

SNR Signal-to-Noise Ratio.

TCP Transmission Control Protocol.

TDD Time Division Duplex.

UDP User Datagram Protocol.

UE User Equipment.

UMTS Universal Mobile Telecommunications Service.

VBR Variable Bit Rate.

WiMAX Worldwide Interoperability for Microwave Access.

xv Chapter 1

Introduction

1.1 Introduction

Three-dimensional (3D) video has been gaining popularity over the past few years. The global 3D display market is expected to grow at a cumulative annual growth rate of 28.38 % between 2016 and 2020 [96]. Advances in 3D video acquisition and display technologies have paved the way for many emerging 3D applications, such as 3D TV, free-viewpoint video, and immersive teleconferencing. Such applications provide a realistic, visually ap- pealing viewing experience by allowing the viewers to perceive depth and view the captured scene from different vantage points. 3D TV extends the traditional 2D TV experience with the ability to perceive depth using displays that are able to decode and display more than one view simultaneously. Rendering multiple views at different spatial locations provides motion parallax, one of the main cues in the human visual system for perceiving depth. These displays can be two-view (stereoscopic) display with associated glasses or more ad- vanced auto-stereoscopic displays that support a large number of views and do not require specialized glasses. In contrast to 3D TV, the free-viewpoint video (FVV) application focuses on free nav- igation. Viewers can interactively choose their viewpoint to observe the captured scene from a preferred perspective. Switching between the different views can be achieved using a simple remote control, head tracking device, or head-mounted display. In case the desired viewpoint is not available, i.e., was not captured using a camera, an interpolated virtual view from other available views is rendered. An efficient technique for synthesizing realistic virtual views utilizes depth information associated with the captured views and is known as depth-image-based rendering (DIBR) [37]. The third application of 3D videos is immersive teleconferencing which creates 3D photo- realistic, immersive, and interactive collaboration environment among geographically dis- persed users. Each site involved in this collaborative environment has an array of cameras and 3D displays. The cameras capture the participants in the local scene from various

1 angles and generates a continuous 3D video stream that is rendered on the 3D displays at receiving sites. Immersiveness can be achieved either in a free-viewpoint video or 3D TV manner. Due to the large number of views, as well as the additional geometry information such as depth maps, involved in the representation of 3D videos, storing and transporting these videos is quite challenging for both wireless an wired networks. With the increasing popu- larity of 3D content, current and future networks will be required to allocate a huge amount of bandwidth to 3D video streaming services. Given the limited amount of available re- sources in a given network, sharing these resources among multiple 3D video streams while providing end users with the best possible quality-of-experience (QoE) is a challenging task. This is especially the case for wireless networks, such as cellular 4/5G broadband networks, where the operator bandwidth is often limited with respect to users demands and the users are watching the content on battery powered mobile devices. Adding to the complexity of the problem is the time-varying nature of the wireless channel conditions. An important problem in wireless networks is therefore how to dynamically adapt the utilization of radio resources to achieve the best perceived quality. One promising technology for transmitting 3D video content over wireless networks is utilizing multicast/broadcast services. These services enable the delivery of multimedia content to large-scale user communities in a cost-efficient manner. By serving mobile termi- nals interested in the same video using a single multicast session, mobile network operators can efficiently utilize network resources and significantly reduce the load. A base station scheduler is responsible for the important task of determining how to allocate video data to the multicast/broadcast data area in each frame such that the real-time nature of the video stream is maintained and the perceived quality is maximized over all sessions. An efficient method for video delivery that has recently been widely adopted is HTTP adaptive streaming (HAS) [109]. HAS is a client-driven streaming approach that is able to dynamically adjust the rate of the video stream by monitoring network conditions. Videos are encoded at a number of bit rates (versions) and each version is split into small equal duration chunks. A HAS client periodically measures the end-to-end throughput between itself and the server and selects the appropriate chunk version accordingly. Therefore, HAS is a promising technique for providing high QoE in 3D video streaming systems. However, unlike 2D videos, 3D videos often have multiple components, e.g., multiple views and/or texture and depth streams, and involve a view interpolation (synthesis) method. The quality of the output of the view interpolation method is directly affected by the qualities of the components used as an input to the process. Therefore, traditional HAS rate adaptation methods are unable to efficiently handle 3D video streams where the relationship between the stream bit rate and the perceived quality is more complex. In this thesis, we study multiple challenges associated with delivering 3D video content using wireless multicast services and over-the-top adaptive streaming services. We focus

2 on optimizing the quality-of-experience for users in terms of perceived visual quality and, in the case of mobile users, viewing time. We develop algorithms that enable 3D video delivery systems to adapt to the dynamics of the network conditions and provide viewers with the best possible video quality. Given that a large portion of mobile traffic nowadays is video streaming data, a secondary objective in some of our proposed algorithms is to save as much battery power as possible for mobile receivers. Not only does this improve the users’ experience, but it also means reducing the number of recharging cycles for their devices, which has a positive impact on the environment by reducing the load on electrical power grids. We focus on three main problems. The first problem is how to efficiently multicast 3D videos over mobile wireless networks. This problem is illustrated in Figure 1.1. Here, mobile users are interested in watching 3D videos which are transmitted over a number of multicast channels. The users’ mobile devices are capable of rendering the videos using either stereoscopic displays (more common) or the more advanced auto-stereoscopic displays which can render a large number of views to provide the viewer with a more immersive experience within a certain viewing angle. This application is referred to as 3D TV since it does not involve interactive view switching. The 3D video transmitted by each multicast session is captured using two color cameras and two depth cameras/sensors and the streams are encoded using a scalable video coder into a number of quality layers. Using a process known as depth-image-based rendering [37], a receiving client is able to render a number of (virtual) views if the device is equipped with an auto-stereoscopic display or adjust the displacement between the stereoscopic image pair based on the display size in the case of stereoscopic displays. Given the limited capacity of the wireless channel, our goal is to select the best set of substreams for the components of each of the multicast videos such that the network capacity is not exceeded and the quality of rendered views is maximized. Moreover, it is important to efficiently schedule the video data of the multicast sessions within the radio frames in order to minimize energy consumption at the receiver side. This scheduling process should also ensure that receiving buffers are not drained completely at any given point in time during playback to avoid interruptions. The second problem we address in this thesis is quality-aware free-viewpoint adaptive video streaming, shown in Figure 1.2. Free-viewpoint videos are captured using a set of cameras recording the scene from different angles. Unlike the first problem, where user interactivity is not practical in the case of multicast in order to deliver the videos to a large number of users, here we consider a single video streaming session where the user is able to navigate between a number of views. Each captured view has an associated depth map stream to enable streaming clients to render non-captured views. The video and depth map streams are available in a number of representations, where each representation is encoded with a certain bitrate and/or quality, and are stored as a set of fixed duration chunks on a content server along with a manifest file. The streaming client uses an HTTP adaptive

3 Texture Scalable Video Streams Users with Stereoscopic Depth and Auto-stereoscopic Displays Layer-3

Video-1 Layer-2 Texture Layer-1 Encoding Server T TD D Depth

Texture Scalable Video Streams Multicast Group 1 Depth Scheduler Layer-3

Video-2 Layer-2 Texture Layer-1 Encoding Server T TD D Depth

Base Station Multicast Group 2

Texture Scalable Video Streams Depth Layer-3

Video-3 Layer-2 Texture Layer-1 Encoding Server T TD D Depth Multicast Group 3

Figure 1.1: Problem 1 - Energy-efficient Multicasting of 3D Videos over Wireless Networks. streaming standard, such as MPEG-DASH [56], to request video chunks from the server. Since the user can request any view in the region between the first and last captured views, including non-captured views, the client needs to dynamically adapt to both the target view as well as the network condition when requesting video chunks. In this context, we implement a free-viewpoint video streaming client based on HTTP adaptive streaming and propose a rate adaptation method that maximizes the quality of rendered views. The proposed rate adaptation method determines which set of captured views (and depth maps) should be requested from the content server and which representation is chosen for each chunk. In the third problem, we consider a set of users streaming free-viewpoint videos within a single cell in a mobile wireless network. This problem is shown in Figure 1.3 and is an extension to the previous adaptive streaming problem. Here the wireless channel is a bottleneck and users are competing for the radio resources. In wireless networks, the base station scheduler is responsible for allocating radio resource blocks and assigning a guaran- teed bitrate for each user. Given that the complexity of the video content being streamed varies from one video to another, the challenge is how to allocate these resources in a way that maximizes the perceived quality while maintaining fairness across users. Although this problem is also applicable to 2D video streams, the complexity of the rate-utility relation- ship for synthesized views in the case of free-viewpoint videos makes it more challenging.

4 T

D

T

View-5 View-4

T

D

T T

D

D View-3 View-4

View-3.5 View-3 (virtual) Content Server

View-1 View-2

D

T

D

Figure 1.2: Problem 2 - Quality-aware Free-viewpoint Adaptive Video Streaming.

We propose an efficient heuristic algorithm that solves this problem using virtual view rate- utility models. In addition to the previously mentioned objectives, our proposed algorithm also minimizes the variability in video quality at the receivers. Although we address the problem in the context of wireless networks, it should be noted that the proposed solution is also applicable to any network where a centralized component is responsible for managing network resources, e.g., software-defined networks.

1.2 Thesis Contributions

We consider 3D video delivery using two network settings: (i) multicast/broadcast transmis- sion over wireless networks such as LTE/LTE-Advanced and WiMAX, where popular 3D videos are sent through multicast/broadcast channels to reduce the load on the core network of the operator and efficiently manage the spectrum; and (ii) streaming of multi-view-plus- depth 3D videos using HTTP adaptive streaming to enable free-viewpoint navigation at the receiver side. We identify multiple optimization problems and, for each problem, we design efficient algorithms to optimize the quality-of-experience observed by users.

1.2.1 Energy-efficient Multicast of 3D Videos over Wireless Networks

We consider the problem of multicasting 3D videos over 4/5G broadband access networks, such as Long Term Evolution (LTE) and WiMAX, to mobile devices with auto-stereoscopic

5 T

D Video-1

User-1 Video-1, View 1.25 Channel: Bad

T

D User-2 Video-2 Media-Aware Video-2, View 2.5 Content Server Network Element Base Station Channel: Very Good (MANE)

View-1 View-2 View-2 View-3 View-2 View-3

T User-1User-2 User-3 User-3 D Video-3, View 2.75

Video-3 Scheduling Window Channel: Good

Figure 1.3: Problem 3 - QoE-fair Radio Resource Scheduling for Free-viewpoint Video Streaming over Mobile Networks. displays. For such displays, 3D scenes need to be efficiently represented using a small amount of data that can be used to generate arbitrary views not captured during the acquisition process. The multi-view-plus-depth (MVD) representation has proven to be both efficient and flexible in providing good quality synthesized views. However, the quality of synthesized views is affected by the compression of texture videos and depth maps. Given the limitations on the wireless channel capacity, it is important to efficiently utilize the channel bandwidth such that the quality of all rendered views at the receiver side is maximized. Therefore, an efficient multicast solution should minimize the power consumption of the receivers to provide a longer viewing time experience. We address two main challenges: (i) maximizing the video quality of rendered views on auto-stereoscopic displays [32][122] of mobile receivers such as smartphones and tablets; and (ii) minimizing the energy consumption of the mobile receivers during multicast sessions. Our contributions in this topic can be summarized as follows [44]:

6 • We study the problem of optimal substream selection for multicasting scalably-coded MVD 3D videos over wireless networks. We mathematically formulate the problem and prove its NP-harness. We then propose an approximation algorithm for solving the problem in real-time.

• We propose an energy-efficient radio frame scheduling algorithm that utilizes our substream selection algorithm and reduces the power-consumption of receiving clients while maximizing perceived quality. Instead of continuously sending the streams at the encoding bit rate, our energy saving algorithm transmits the video streams in bursts at much higher bit rates to reduce transmission time. After receiving a burst of data, mobile subscribers can switch off their RF circuits until the start of the next burst. Our radio frame scheduling algorithm generates a burst schedule that maximizes the average system-wide energy saving over all multicast streams and prevents buffer overflow or underflow instances from occurring at the receivers.

• We evaluate the performance of our algorithms using simulation-based experiments. Our results show that the proposed algorithms provide solutions which are within 0.3 dB of the optimal solutions while satisfying real-time requirements of multicast systems, and they result in an average power consumption reduction of 86 %.

1.2.2 Quality-aware HAS-based Free-viewpoint Video Streaming

We study the problem of optimizing interactive free viewpoint video streaming to heteroge- neous clients using HAS. We analyze the relationship between the quality of the reference views’ components and that of synthesized virtual views and derive a simple model that cap- tures this relationship. We formulate a virtual view quality optimization problem to find the optimal set of reference representations to request and present a virtual-view-quality-aware rate adaptation algorithm to solve this problem. Our quality-aware adaptation algorithm is based on virtual view quality models that enable the streaming client to estimate the average quality of virtual views over the duration of each video chunk. We implement the proposed algorithm in a real HAS-based free viewpoint video streaming testbed and con- duct experiments for performance evaluations. Our contributions in can be summarized as follows [45]:

• We present a two-step rate adaptation method for free-viewpoint videos. In the first step, the streaming client performs view pre-fetching based on historical viewpoint po- sitions to reduce view switching latency and reduce quality degradation when switch- ing views. In the second step, the client utilizes virtual view quality models described in the manifest file of the video to determine the best set of segments for the next request such that the average quality of synthesized virtual views are maximized.

7 • We describe an end-to-end system architecture for free-viewpoint video streaming based on HAS and the multi-view-plus-depth 3D video representation. We develop a complete streaming client that implements the proposed rate adaptation method using empirical and analytical virtual view quality models.

• We rigorously evaluate the performance of the proposed virtual-view quality based rate adaptation algorithm using multiple video sequences. We evaluate the perceived quality using objective quality metrics, and we conduct a subjective quality assess- ment study. Our results indicate that the proposed virtual view quality-aware rate adaptation method results in significant quality gains (up to 4 dB for constant bit rate streams and up to 2.26 dB for variable bit rate streams), especially at low bandwidth conditions.

1.2.3 QoE-fair Adaptive Streaming of FVV over LTE Networks

We study the problem of QoE-fair radio resource allocation for adaptive FVV streaming to heterogeneous clients in mobile networks. Here, a number of mobile terminals within a cellular network are using our HAS-based client to stream FVV content. The wireless channel conditions vary from one user to another due to the channel fading effect as well as how far the user is from the base station. Most base station schedulers allocate resources by employing variations of the proportional fair [66] scheduling policy which is inefficient when users follow different utilities. We propose a radio resource allocation algorithm that finds the optimized QoE-fair allocation for the streaming clients which maximizes their perceived quality and reduces quality variation. Our contributions in this topic can be summarized as follows:

• We study the rate-utility relationship for synthesized virtual views in free-viewpoint videos represented using multiple views plus depth. We propose content-dependent parametric models that describes this relationship. These models enable resource allocation algorithms to estimate the perceived quality of synthesized virtual views given an allocated bandwidth.

• We formulate the problem of QoE-fair resource allocation for HAS-based video stream- ing in LTE networks as a multi-objective optimization problem that is known to be NP- hard. In addition to achieving QoE-fairness, a resource allocation algorithm should also maximize the average video quality and minimize quality variations. We propose a heuristic algorithm which utilizes rate-utility models for synthesized virtual views and attempts to achieve a balance between the three objectives and can run in real- time. The proposed algorithm runs on MANEs within the mobile network provider network to support the base station scheduler.

8 • We evaluate the proposed radio resource allocation algorithm using OPNET and Mat- lab. The proposed algorithm is compared to state-of-the-art approaches. Our results show that our algorithm is able to achieve a high level of fairness while reducing the rate of quality switches by up to 32 % compared to other algorithms. Our algorithm also saves up to 18 % of the radio resource blocks while achieving comparable average quality.

1.3 Thesis Organization

The chapters of this thesis are organized as follows. We first present some background on the end-to-end 3D video communications chain and wireless cellular networks in Chapter 2. In Chapter 3, we propose energy-efficient techniques for multicasting 3D videos over mobile broadband access networks. We address the problem of quality-aware adaptive streaming of free-viewpoint videos in Chapter 4, where we formulate the problem and derive virtual-view quality models to support the rate adaptation process. In Chapter 5, we present a heuristic algorithm to perform QoE-fair radio resource allocation for free-viewpoint video streaming in cellular networks. We conclude the thesis and discuss potential future research directions in Chapter 6.

9 Chapter 2

Background

2.1 Introduction

An end-to-end 3D video communication chain consists of several stages as depicted in Fig- ure 2.1. These include: content capturing, 3D representation, data compression, transmis- sion, decompression, post-processing, and 3D rendering and display. In this chapter, we introduce some background as well as relevant work for each of these stages. First, we describe how 3D video content is captured using a set of cameras and the challenges as- sociated with this process in Section 2.2. At the end of the chain, 3D displays utilize the characteristics of the human visual system to perceive depth in order to provide an immer- sive 3D experience. In particular, our visual system relies on several cues that provide it with information about how far objects are and their relative positions. The priorities of these depth cues and the exact integration method used by the human visual system to fuse these information together to visualize the world’s 3D structure are still not fully known. Current flat screen 3D displays take advantage of two such cues (binocular and motion parallax) to provide the illusion of depth. We describe how the human visual system perceives depth in order to visualize objects and their relative positions in the 3D space in Section 2.3 and discuss the main depth cues. In Section 2.4, we present the different 3D display technologies and how they exploit the human visual system principles to provide the human eyes with the necessary information that enables them to perceive a video sequence in 3D. The 3D scene information can be represented in different ways depending on the applica- tion and the type of display technology used to render the content. Examples include image- based representations, depth-based representations, 3D mesh models, and point clouds. In this thesis, we mainly focus on image- and depth-based representations as they are the main representation formats currently being investigated for 3D TV and FVV applications. The amount of information required to capture the aspects of a 3D scene is a multiple of that required by 2D videos. Therefore, transmitting this information over existing bandwidth

10 Capture Post-process Representation Encoding

Camera Calibration Stereo (space, time) Multi-camera Rig Multi-view Color Correction Multi-view H.264/AVC ToF Depth Cameras V+D Image Rectification HEVC 3D Scanners MVD Sub-sampling LDV

Display Post-process

Stereoscopic Error Concealment Auto-stereoscopic Rendering Hologram Decoding

Delivery

Figure 2.1: End-to-end 3D video communication chain. limited networks requires efficient compression algorithms that exploit inherent redundan- cies and significantly reduce the amount of data. Section 2.5 covers 3D content generation in terms of representation and coding formats as well as various coding approaches. The delivery of 3D video services poses more challenges than conventional 2D video services due to the large amount of data involved, diverse network characteristics and user terminal requirements, as well as the user’s context (e.g., preferences, location) [10]. HTTP adaptive streaming (HAS) has recently been adopted as the universal client-driven streaming solution for video distribution over the Internet. HAS is designed to cope with the highly dynamic nature of communication channels and is a promising delivery method for 3D videos. A brief introduction to the principles of HAS and how HAS content is generated is presented in Section 2.6. Recent studies have also shown that mobile video traffic is dominating the mobile communication landscape with more and more users preferring to watch their favourite content on-the-go [25]. We discuss the main concepts related to mobile broadband access networks in Section 2.7.

2.2 3D Content Capturing and Post-processing

Most 3D and free-viewpoint video systems use multiple cameras to capture real world scenery. These cameras are sometimes combined with depth sensors in order to capture scene geometry. The density (i.e., number of cameras) and arrangement of the cameras impose practical limitations on the view navigation range and the quality of the rendered views at a certain virtual view position [108]. Three typical multi-camera arrangements are shown in Figure 2.2. Individual cameras composing the multi-camera system have unique internal character- istics. Even when two cameras from a certain manufacturer are used to capture images of

11 (a) (b)

(c)

Figure 2.2: Multi-view camera arrangements: (a) divergent; (b) convergent; and (c) parallel. the same object from the exact same location and direction, the resulting images will not be identical. Moreover, if the location of each camera and the orientation of their respective op- tical axes cannot be determined precisely, virtual views cannot be interpolated accurately. Therefore multi-camera capturing systems impose additional requirements which are not present in traditional 2D video capturing systems in order to correct for these factors. These requirements include [80, Chapter-2]:

1. Accurate 3D positions and viewing directions of all cameras should be known (to integrate captured multi-view video data geometrically).

2. The cameras should be accurately synchronized (to integrate captured multi-view video data temporally).

3. Brightness and chromatic characteristics of the cameras should be accurately known (to integrate captured multi-view video data chromatically).

4. All object surface areas should be observed by at least two cameras to reconstruct their 3D shapes by stereo-based methods.

12 Figure 2.3: Pinhole camera model geometry.

2.2.1 Camera Parameters and Geometric Calibration

Geometric camera calibration is the process of estimating parameters of the geometric transformation conducted by a camera, which projects a 3D point onto the 2D image plane of the camera. These parameters include the internal geometric and optical characteristics and/or the 3-D position and orientation of the camera frame relative to a certain world coordinate system. Most geometric calibration methods, such as the popular method proposed by Tsai [119], are based on a pinhole camera geometry model in which three types of coordinate systems are defined: the world coordinate system, the camera coordinate system, and the image coordinate system. Each camera (view) has its own camera coordinate system and its own image coordinate system. As shown in Figure-2.3, the pinhole camera model is described by an optical centre (camera projection centre), C, and an image plane. The distance of image plane from C is called focal length, f. The line from camera centre C perpendicular to image plane is called principal axis (optical axis) of camera. The plane parallel to the image plane and containing C is called the principal plane (focal plane). To describe the relationship among the coordinate systems, two sets of camera param- eters are defined: extrinsic parameters and intrinsic parameters. Extrinsic parameters de- scribe the transformation from world coordinates to camera coordinates. This is represented by a translation vector t3×1 and a rotation matrix R3×3. Intrinsic parameters describe the characteristics of the camera that influence the transformation from a camera coordinate to its image coordinate. These characteristics are represented by the camera calibration matrix K, which contains information about the focal length f, image centre coordinates

(ox,oy), and pixel size in millimetres (sx,sy) along the axes of the camera photo-sensor. A 3D point is projected onto the image plane with the line containing the point and the optical centre. The relationship between 3D coordinates of a scene point and coordinates of its projection onto the image plane is described by the central or perspective projection

13 [47, Chapter-9]. If the world (scene) and image points are represented by homogeneous vectors, perspective projection of a point M =(X,Y,Z, 1)T in the 3D space of the scene to a pixel m =(u/z, v/z, 1)T in the image plane is defined by Eq. (2.1), where P is the camera projection matrix which describes the linear mapping and z corresponds to the depth value of the view described by P.

zm = PM (2.1)

X u  Y   v  = P (2.2)    Z       z       1        In general, P is a 3 × 4 full-rank matrix. And since it is a homogeneous matrix, it has 11-degrees of freedom. The camera projection matrix encodes both the extrinsic and intrinsic parameters of the camera. Using QR factorization, we can decompose the 3 × 4 full-rank matrix P into the matrices and vectors representing those parameters. Thus, P can be factorized as given in Eq. (2.3), where t is the translation vector, R is the rotation matrix, and K is the upper triangular (non-singular) camera calibration matrix.

P = K[R|t] (2.3)

The intrinsic matrix K represents the transformation from a camera coordinate to its image coordinate. It is defined as given in Eq. (2.4), where αx and αy are the focal length in x-axis and y-axis, respectively, and (ox,oy) is the principal point offset. The reason that the focal length in the two axial directions is different is that in CCD cameras there is a possibility of having non-square pixels. Since image coordinates are measured in pixels, this introduces unequal scale factors in each direction and the image coordinates become non-Euclidean. Thus, αx = fnx and αy = fny, where nx and ny are the number of pixels per unit distance in the x-direction and y-direction, respectively.

αx 0 ox

K =  0 αy oy  (2.4)    0 0 1      The extrinsic matrix E =[R|t] is defined for the transformation from world coordinates to camera coordinates, which is composed of a rotation matrix R3×3 and a translation vector t3×1. Using homogeneous coordinates, the transformation is given in Eq. (2.5). This can also be simplified and represented as given in Eq. (2.6). It should be noted that t = −RC˜,

14 where C˜ represents the coordinates of the camera centre in the world coordinate frame (represented in non-homogeneous coordinates).

Xcam X

 Ycam  R3×3 t3×1  Y  = (2.5)        Zcam  0 1  Z             1   1          Xcam X

 Ycam  = R3×3 ·  Y  + t3×1 (2.6)      Zcam   Z          Thus, the perspective projection of a point M = (X,Y,Z, 1)T in the 3D space of the scene to a pixel m =(u/z, v/z, 1)T in the image plane can be re-written as given in Eq. (2.7).

z · u X

 z · v  K3×3 0 R3×3 t3×1  Y  = (2.7)          z  0 1 0 1  Z               1   1          2.2.2 Image Rectification

In some settings (e.g., if depth is estimated from the disparities between the captured views), it is also desirable to perform multi-view image rectification as a post-processing step af- ter capturing the multi-view content using calibrated cameras. Due to differences in the positioning of the cameras, the captured multi-view images usually have both horizontal and vertical disparities between neighboring views. If the vertical disparities are removed completely, the search range for depth estimation algorithms is reduced to a single (horizon- tal) dimension only and the stereo matching process becomes greatly simplified and much faster [62]. Multi-view image rectification aligns the epipolar lines of each camera view and removes vertical disparities. In addition, it also compensates for the slight focal length differences existing between the cameras.

2.2.3 Color Correction

Another common problem in multi-camera systems is that different camera sensors acquire different color response to an image object [106]. Physical factors during the imaging process introduce a variation that differs from one camera to another. Moreover, even if all cameras are properly calibrated, it is practically impossible to capture an object under perfectly constant lighting conditions at different spatial positions. Color inconsistency

15 across cameras negatively affects the 3D multi-view video processing chain since it reduces the correlation between captured views. This leads to a significant decrease in multi-view coding efficiency since the inter-view prediction scheme produces higher energy residuals in cases of luminance difference between the reference view and the predicted view. Color inconsistency also reduces the quality of disparity estimation as a result of potentially wrong matches in stereo correspondence algorithms. as well as the quality of rendered virtual views. Moreover, the view synthesis process in free-viewpoint video systems is negatively affected if there are color differences between the two reference cameras (which are warped to the image coordinates of the target viewpoint) that need blending. Therefore, color calibration and/or color correction methods are necessary to enhance the performance of 3D TV and FTV systems [129].

2.3 Human Visual System

In the human visual system (HVS), the eyes are horizontally separated by a distance of approximately 6.3 cm. Thus, looking at a 3D scene, each eye sees a unique image from a slightly different angle. This is known as binocular stereopsis. The resulting difference in image location of an object seen by the left and right eyes is referred to as binocular disparity (also known as binocular parallax). This difference provides cues about the relative distance of objects (i.e., the perceived distance between objects) and their depth structure, as well as absolute depths (perceived distance from observer to objects). This is possible because the brain fuses the two images to enable depth perception, a process known as binocular fusion. The single mental image of the scene that results from binocular fusion is known as the cyclopean image. As the distance between the viewer and the object increases, the differences between the two images decreases and, consequently, the ability to identify differences in depth diminishes. The two most important depth cues are binocular stereopsis and motion (movement) parallax [92], [90]. Estimating depth through binocular depth cues has two parts: vergence and stereopsis. Vergence is the process of positioning the eyes so that the difference between the information projected in both retinae is minimized. The angle between the eyes is used as a depth cue. After this convergence step, the stereopsis process uses the residual disparity of surrounding area to estimate the depth relative to the point of convergence. Binocular depth cues have been discussed in length in the literature. They are the main cues utilized in 3D displays. For example, are placed on top of a multi-view display to direct different views to each eye. Displays that require wearing glasses of any kind are also using such property of the brain. Images for the left eye are separated from ones for the right eye and the technology relies on the brain to reconstruct the images and provide 3D perception.

16 However, even with one eye closed, one can still perceive depth through a number of depth cues known as monocular depth cues [95]. This is possible through some external factors that tell the user where things are with respect to each other. Monocular depth cues include: relative size, motion parallax, occlusion (interposition), light and shade, texture gradient, haze, and perspective. Motion parallax cues are a result of eye or head movement. The relative movement between the viewing eye and the scene is a cue to depth perception. An object that moves faster means that it is located much closer to the viewer than an object that moves slower. Free-view TV seems to be a great approach to leveraging the head motion parallax factor of 3D perception, which is very important. However, there are different kinds of motion parallax depth cues. In addition to head movements, eyeball movements also facilitate depth estimation. Other cues that play an important role in the perception of depth include: accommodation and pictorial cues. Accommodation is the ability of the eye to change the optical power of its lens in order to focus on objects at various distances. As the distance between the eyes and the object becomes very large, binocular depth cues become less important and the HVS relies on pictorial cues for depth assessment. These are monocular cues such as shadows, perspective lines, and texture scaling. In the real world, all the aforementioned depth cues are fused together in an adaptive way depending on the viewing conditions and the viewing space. However, in artificial scenarios, such as when watching 3D content on a 3D TV, it can happen that two or more depth cues are in conflict. And although in many such cases the HVS can still correctly interpret the 3D scene, this comes at higher cognitive load and after some time the viewer may sense visual discomfort, e.g., eye strain, headache, or nausea. An example of such a conflict is the so called accommodation-vergence rivalry where the eyes converge to focus on an object but the lens accommodation stays on the screen where the image is sharpest [99].

2.4 3D Display Technologies

3D displays are imaging devices that create 3D perception as their output utilizing different characteristics of the human visual system. Such displays can be categorized, based on the technique used to direct the left- and right-eye views to the appropriate eye, into: aided-view displays, and free-view displays [99].

Aided-View Displays

Aided-view displays rely on special user-worn devices, such as stereo-glasses or head-mounted miniature displays, to optically channel the left- and right-eye views to the corresponding eye. Various multiplexing methods have been proposed to carry the optical signal to the appropriate eye, including: color multiplexing, polarization multiplexing, and time multi- plexing.

17 • Color Multiplexed Displays. Color multiplexing is used in anaglyph displays where the images for the left and right eyes are combined using a complimentary color coding technique. The most common anaglyph method uses the red channel for the left eye and the cyan channel for the right eye. The viewer wears a pair of colored (anaglyph) glasses so that the left and right eyes receive the corresponding images only. Each lens permits the wavelength of the correct image to reach the eye while blocking other wavelengths for that eye. Different color coding techniques are possible in anaglyph displays, including: red/cyan, yellow/blue, and green/magenta. The main drawbacks of this type of displays are the loss of color information and the increased degree of cross-talk [122]. Moreover, the anaglyph viewing filters sometimes cause chromatic adaptation problems for the viewer and it has been widely reported that prolonged use of this technology causes headaches. To reduce crosstalk and increase the quality of the images, methods such as image alignment, color component blurring, and depth map adjustment have been shown to significantly improve image quality [50], [51].

• Polarization Multiplexed Displays. In polarization multiplexing, the state of po- larization of light corresponding to each image in the stereo pair is made mutually orthogonal. The two views are superimposed on the screen and the viewer needs to wear polarized glasses to separate them. Two types of polarized glasses are possible: linearly polarized (one lens horizontally polarized and the other vertically polarized) and circularly polarized (one lens polarized clockwise and the other counter-clockwise). Circular polarization allows more head tilt before cross-talk becomes noticeable. Al- though polarizing filters can cause chromatic aberration, these types of 3D displays offer high resolution and the color quality issues are generally negligible, unlike the anaglyph-based displays. Polarized displays are most common in movie theatres nowa- days.

• Time Multiplexed Displays. Time-multiplexed displays exploit the persistence of vision of the human visual system to give 3D perception. The left and right eye images are displayed on the screen in an alternating fashion at high frame rates, usually 120 Hz. The viewer is required to wear battery powered active shutter glasses, which are synchronized to the content being displayed [122]. The lenses of these glasses are actually small LCD screens. Applying a voltage to a lens causes the shutter to close, preventing the image from passing through to the eye. By synchronizing this behavior with the screen displaying the 3D content, normally using an infrared transmitter, each eye sees a separate view.

Auto-stereoscopic Displays

Auto-stereoscopic displays relief the viewer from the discomfort of wearing specialized glasses by dividing the viewing space into a finite number of viewing slots where only one image

18 Right Eye Left Eye Right Eye Left Eye

Lenticular Lens

Parallax Barrier

LCD LCD RRRRRRLLLLLL RRRRRRLLLLLL (a) (b) Lenticular lens

Figure 2.4: Working principles of auto-stereoscopic displays.

(view) of the scene is visible. Thus, each of the viewer’s eyes sees a different image and, depending on the type of display, those images may change as the viewer moves or changes his head position. This is achieved by applying optical principles such as diffraction, refrac- tion, reflection, and occlusion to direct the light from a certain view to the appropriate eye. Various types of auto-stereoscopic displays use different techniques to control light paths. The two most well known auto-stereoscopic techniques are parallax barriers and lenticular arrays.

• Parallax Barrier Displays. Parallax barriers utilize occlusion to hide part of the image from one eye while maintaining it visible to the other eye. At the right distance and angle, each eye will only be able to see the corresponding view, as shown in Figure 2.4a. These displays can be switched to a 2D display mode for backward compatibility by removing the optical function of the parallax elements [130]. This can be achieved using polarization-based electronic switching systems. The two main problems associated with parallax barrier systems are the loss of brightness, caused by the barriers themselves, and the loss of spatial resolution, caused by using only half of the pixels for each viewing zone. The optimum viewing distance in these systems is proportional to the distance between the display and the parallax barrier and inversely proportional to the display pixel size. As the display resolution gets higher, the optimum viewing distance of the system gets larger [122].

• Lenticular Array Displays. Unlike parallax barrier displays, lenticular displays are based on the refraction principle. An array of vertically oriented cylindrical lenses are placed in front of columns of pixels, as shown in Figure 2.4b. The display creates repeating viewing zones for the left and right eyes (shown with green and red colors). A lenticular display can also be switched between 2D and 3D viewing modes using lenticular lenses filled with a special material that can switch between two refracting states. The alignment of the lenticular array on the display panel is critical in lenticular systems. This alignment gets more difficult as the display resolution increases and any misalignment can cause distortions in the displayed images [122]. The problem with

19 lenticular systems is the reduction in resolution with the increase in the number of views. In vertically aligned lenticular arrays, the resolution decreases in the horizontal direction only. However, if the lenses are slanted the resolution loss is distributed into two axes [29].

Regardless of the technique used to direct the light, auto-stereoscopic displays can either be two-view displays, where only a single stereo pair is displayed, or multi-view displays, where multiple stereo pairs are produced to provide 3-D images to multiple users. Two- view auto-stereoscopic displays divide the horizontal resolution of the display into two sets. Every second column of pixels constitutes one image of the left-right image pair, while the other image consists of the rest of the columns. The two displayed images are visible in multiple zones in space. However, the viewer will perceive a correct stereoscopic image only if standing at the ideal distance and in the correct position. Moving much forward or backward from the ideal distance greatly reduces the chance of seeing a correct image. If the two-view stereoscopic display is equipped with a head tracking device, it can prevent incorrect pseudoscopic viewing by displaying the right and left images in the ap- propriate zones. One disadvantage of head-tracking stereoscopic displays is that they only support a single-viewer. Moreover, they need to be designed to have minimal lag so that the user does not notice the head tracking. Multi-view auto-stereoscopic displays overcome the limitations of two-view and head-tracking stereoscopic displays by increasing the number of displayed views. Thus, they have the advantage of allowing viewers to perceive a 3D image when the eyes are anywhere within the viewing zone. This enables multiple viewers to see the 3D objects from their own point of view, which makes these displays more suitable for applications such as computer games, home entertainment, and advertising.

2.5 3D Video Representation and Coding

2.5.1 3D Video Representations

Conventional Stereo Video

A stereo video signal captured by two input cameras is the most simple 3D video data representation. This 3D format is called conventional stereo video (CSV). By presenting each of the captured views to one of the eyes, the viewer is provided with a 3D impression of the captured scene. Standardized solutions for CSV have already found their way to the market: 3-D cinema, Blu-Ray Disc, and broadcast. A common way to represent and transmit the two views is to multiplex them either temporally or spatially [125]. In temporal multiplexing, the left and right views are interleaved as alternating frames. Temporal multiplexing has the advantage of maintaining the full resolution of each view. However, this comes at the expense of doubling the raw data rate of conventional single view.

20 (a)

(b)

Figure 2.5: Two-view head-tracked display; (a) swapping zones over as viewer moves his head; (b) producing only two views and controlling where the views are directed in space [32].

L L R R

(a) (b) (c) (d)

Figure 2.6: Different view packing arrangements for left (L) and right (R) views in conven- tional stereo video: (a) side-by-side; (b) above-below; (c) line-by-line; and (d) checkboard.

21 With spatial multiplexing, the left and right views are sub-sampled either horizontally or vertically and interleaved (packed) within a single frame. Possible packing arrangements include: side-by-side, above-below, line-by-line, or checkboard arrangements. Figure 2.6 illustrates the different packing arrangements. The resulting format is known as a frame- compatible format which essentially tunnels the stereo video through existing hardware and delivery channels. Thus, the main advantage for spatial multiplexing is that it allows broad- casters to use the same bandwidth as regular monoscopic video content and transmission can be achieved in the same way. The obvious drawback is, however, the loss of spatial resolution, which may impact the quality of 3D perception. Subjective experiments have shown that, up to a certain limit, if the quality of one of the two views in a stereo pair is reduced by low filtering, the overall perceived quality tends to be dominated by the higher quality view. This is known as the binocular suppression theory [110]. Another possible representation of stereo video exploits this theory to reduce the overall bit rate of the stereo video. This representation is known as mixed resolution stereo. Among the important issues that have been studied is whether the bit rate of the auxiliary view in a mixed resolution stereo video should be reduced by downscaling (and then rescaling at the receiver) or quality reduction (by increasing the quantization parameter). Temporal scaling, i.e., reducing the frame rate of the auxiliary view, is also possible but has been shown to give unacceptable results in terms of perceived 3D quality, especially for high motion content. The main limitation of the stereo representation in general is its hardware representation dependency. The acquisition process is tailored to a specific type of stereoscopic displays (e.g., size, display type, number of views, etc.). Moreover, the baseline distance between the two cameras is fixed. This hinders the flexibility of modifying the 3D impression at the receiver side, and prevents supporting head motion parallax, occlusion, and disocclusion when the viewer changes the viewpoint. Additional information about the captured scene, such as geometry information, needs to be provided in order to support such features.

Video Plus Depth

The video-plus-depth (V+D) format provides a regular 2D video signal along with geometry information of the scene in the form of an associated depth map video, as shown in Figure 2.7. The V+D format enables the generation of virtual views, within a certain range around the captured view, using a view synthesis technique such as depth image-based rendering (DIBR) [64], [37]. In this format, the 2D video represents texture information and color intensity, while the depth video provides a Z-value for each pixel representing the distance between the optical center of the camera and a 3D point in the captured scene. The depth range is restricted to a range between two extremes Znear and Zfar indicating the minimum and maximum possible distances of a 3D point, respectively. Information about this range needs to be transmitted along with the video bitstream. The depth range is usually quantized

22 0

255 Texture Depth

Figure 2.7: Video plus depth representation of 3D video.

using 8 bits into 256 quantization intervals, where Znear is associated with value 255 and

Zfar associated with value 0. Thus, the depth video carries a monochromatic video signal. The depth data is usually stored as inverted real-world depth according to

1 1 1 1 d = round 255 · − / − , (2.8) " z Zfar ! Znear Zfar !# where z is the real-world depth and d is the corresponding value in the depth map [88]. This method of depth storage has the following advantage. Since depth values are inverted, a high depth resolution of nearby objects is achieved, while farther objects only receive coarse depth resolution. This also aligns with the human perception of stereopsis, where a depth impression is derived from the shift between left- and right-eye views [88]. Thus, the stored depth values are quantized similarly to these shift or disparity values. The process of capturing the depth map is in itself error-prone. One technique for capturing depth information is using triangulation. A laser stripe is scanned across the scene, which is captured by a camera positioned at a distance from the laser pointer. And the range of the scene is then determined by the focal length of the camera, the distance between the camera and the laser pointer, and the observed stripe position in the captured image [68]. This technique works well for static scenes, but not for dynamic ones. It tends to change the color and texture of the scene, which means it introduces artifacts even before the encoding process. Another depth map capturing method utilizes the time-of-flight principle [46]. For example, laser beams (often in the infrared spectrum) are emitted to the scene, and the reflections are collected by the device to measure the time of flight [68]. Pulsed-wave sensors, such as the ZCam Depth Camera from 3DV Systems [43] (later acquired by Microsoft), can measure the time of delay directly. This method works well for dynamic scenes, but it is yet to be determined how it will perform in some specific environments, such as when there are many mirrors, smooth/rough surfaces, very fast motion, or heat/cold. Depth information may also be estimated from a stereo pair by solving for stereo correspondences [98].

23 The video-plus-depth format has several advantages. Encoding of depth map presents a small overhead (about 10-20%) on the video bit rate, which makes it attractive when attempting to minimize bandwidth. Moreover, the inclusion of depth enables a display- independent solution that provides adaptivity at the user side for different kinds of displays. For example, it allows the adjustment of depth perception in stereo displays based on the viewing characteristics. Finally, the format is backward compatible with legacy devices. However, the flexibility provided by the video-plus-depth format comes at the cost of in- creased complexity at both the sender and the receiver sides. Depth estimation algorithms, for example, are highly complex, time consuming, and error prone. At the receiver side, it is required to generate a second view to drive a stereoscopic display. Moreover, video- plus-depth is only capable of rendering a limited range of views and is prone to errors at disoccluded points.

Multi-view Plus Depth

One key issue with video-plus-depth is that it enables synthesizing only a limited continuum of views around the original view. This is due to the disocclusion (or exposure) problem, where some regions in the virtual view have no mapping because they were invisible in the original reference view. These regions are known as holes and require applying a filling algorithm that interpolates the value of the unmapped pixels from surrounding areas. This disocclusion effect increases as the angular distance between the reference view and the virtual view increases. On the other hand, advanced 3D video applications that enable the user to change the viewing point, such as wide range multi-view auto-stereoscopic displays and free-viewpoint video, require a very large number of output views. To overcome the limitations of the V+D format, the Moving Picture Experts Group (MPEG) developed a new 3D video standard based on a multi-view video-plus-depth (MVD) representation format, where multiple 2D views are combined with their associated depth map signals. Virtual views may be synthesized more correctly if two or more reference views, from both sides of the virtual view, are used [42]. This is possible because areas which are occluded in one of the reference views may not be occluded in the other one. Thus, the MVD format enables generating many high quality views from a limited amount of input data. Moreover, the format enables flexible adjustment of the video signal to various display types and sizes, and different viewing preferences. When only two reference views and their depth maps are available for a certain video, the representation format is referred to as MVD2. Similarly, MVD4 is the representation format where four reference views and four associated depth maps are available. Figure 2.8 demonstrates the results of the view synthesis process using an MVD2 video. This representation format will be used in subsequent chapters of this thesis.

24 Left Reference View Right Reference View (Texture + Depth) (Texture + Depth)

Synthesized View 1 Synthesized View 2 Synthesized View 3

Figure 2.8: Synthesizing three intermediate views using two reference views and associated depth maps.

Layered Depth Video

Layered depth video (LDV) [89] is an extension of the MVD representation and is based on the concept of layered depth images [105]. A layered depth video has two layers: a texture sequence and an associated depth map sequence comprising the base layer, and an enhancement layer (also known as the occlusion layer) composed of a residual texture sequence and a residual depth map. Figure 2.9 illustrates the components of a single frame in LDV representation. The occlusion layer in the LDV representation is used to fill any holes during the view synthesis process. Generating the occlusion information, however, requires more processing during content generation, or pre-processing before compression and transmission. Since there are a lot of redundant data in non-occluded regions, this can be exploited to efficiently code the occlusion layer. Occlusion data can be obtained from side views or from previous or next frames. Although LDV has a more compact representation than MVD, the generation of LDV is based on view warping and, consequently, on the error prone depth data. Moreover, this basic approach is oblivious to reflections, shadows, and other factors which result in the same content appearing differently in different views [7].

2.5.2 3D Video Coding

Before going over the different video coding standards for 3D and multi-view videos, we first provide a brief background on video coding concepts in 2D videos. A 2D video sequence is a stream of individual frames (pictures) which are presented with constant or variable time intervals between them. In addition to spatial redundancies which are normally present in 2D images, video sequences also contain temporal redundancies because successive frames within the video often have small differences, with the exception of scene changes. The

25 Main Layer Enhancement Layer

Texture Depth Residual Texture Residual Depth

Figure 2.9: A sample frame from a layered depth video (LDV). frames are grouped into coding structures known as group of pictures (GOP), where each GOP contains the same number of frames. State-of-the-art 2D video encoders divide each frame into a set of non-overlapping small blocks, each known as a macroblock (MB). Within a GOP, the encoder reduces the temporal redundancy by attempting to predict each MB within a frame from MBs in previous (and possibly future) frames and obtain a residual frame containing prediction errors between the original and predicted blocks. Residual frames have smaller energy and therefore can be coded more efficiently using fewer bits. In order to construct the predicted frame, the encoder performs a search within a certain region in the reference frame(s) to find the closest matching block and a motion vector (MV) representing the displacement between the that block and the block being coded is calculated, a process known as motion estimation. The difference between the two blocks is then transmitted in the encoded bitstream along with the corresponding motion vector. This process is known as inter-frame prediction (or inter-prediction for short). For MBs where inter-prediction cannot be exploited, intra-prediction is used to eliminate spatial redundancies. Intra-prediction attempts to predict a block by extrapolating the neighboring pixels from adjacent blocks in a defined set of different directions. Frames within a GOP can therefore be classified based on the type of prediction used for the MBs within the frame into: I-frames (only intra-predicted MBs), P-frames (either intra-predicted MBs and/or predicted MBs from previous frames), and B-frames (intra-predicted MBs and/or predicted MBs from previous and future frames). The prediction relationship between the different frames within a GOP can therefore be represented using a dependency structure similar to the one shown in Figure 2.10, where the black frames are I-frames and the number below each frame represents its decoding order. A more detailed explanation of prediction coding as well as other 2D video coding concepts used by recent video coding standards, the reader is referred to [97, Chapter-3]. In state-of-the-art video coding standards, such as H.264/AVC and high efficiency video coding (HEVC) [114], the encoded video bitstream consists of data units called network abstraction layer (NAL) units, each of which is effectively a packet that contains an integer number of bytes. Some NAL units contain parameter sets that carry high-level information

26 0 4 3 5 2 7 6 8 1

Figure 2.10: Hierarchical (dyadic) prediction structure. regarding the entire coded video sequence or a subset of the pictures within it. Other NAL units carry coded samples in the form of slices that belong to one of the various picture types defined by the video coding standard. In addition, some NAL units contain optional supplementary enhancement information (SEI) that supports the decoding process or may assist in other ways, such as providing hints about how best to display the video. A set of NAL units whose decoding results in one decoded picture is referred to as an access unit (AU). One main challenge with video compression applications is delivering multiple versions of a video at different operating points, i.e., different qualities, spatial resolutions, and frame rates. The straightforward way to achieve this using conventional video coders is to encode each version of the video sequence independently. However, this approach results in sig- nificant overhead since the generated versions contain many redundancies. Scalable video coders, such as the scalable video coding (SVC) extension of H.264/AVC [102] and scalable high efficiency video coding (SHVC) [17], exploit the correlation between different versions of the same video sequence to provide smaller storage requirements and transmission band- width. In addition to the spatial and temporal motion-compensated predictions that are available in a single-layer coder, scalable video coders utilize inter-layer prediction, where the reconstructed video signal from a reference layer is used to predict an enhancement layer. A single scalable encoder produces multiple coded bitstreams referred to as layers, Figure 2.11. The lowest (base) layer is a stream decodeable using a standard single-layer decoder to generate a version of the video sequence at the lowest available quality/resolu- tion operating point. One or more enhancement layers are coded as scalable bitstreams. To decode a sequence at a higher quality or resolution, a scalable video decoder decodes the base layer and one or more enhancement layers.

27 Low Bandwidth Base Layer Scalable/Conventional Video Decoder

Enhancement Medium Bandwidth Scalable Video Layer 1 Scalable Encoder Video Decoder

High Bandwidth Scalable Enhancement Video Decoder Layer 2

Figure 2.11: Scalable video coding.

Conventional Stereo Video Coding

The direct method of coding a multi-view video is simulcast, where each view is individually coded using a conventional 2D video encoder such as H.264/AVC or HEVC. This method does not take advantage of the correlation between neighboring views. In the case of stereo video, the fidelity range extensions (FRExt) [78] of the H.264/AVC standard defines a special stereo video SEI message that enables the decoder to identify the multiplexing of the stereo video and extract the two views. The two views are interlaced line-by-line into one sequence, where the top field contains the left view and the bottom field contains the right view. The interlaced sequence is then encoded using the field coding mode of H.264/AVC. At the receiver side, the interlaced bitstream is decoded and the output sequence is de-interlaced to obtain the two individual views. The main drawback of this approach is that it does not support backward compatibility. Traditional 2D devices incapable of demultiplexing the two views will not be able to decode and display a 2D version of the content.

Multi-view Video Coding

In early 2010, the Joint Video Team (JVT), which is a collaboration between the Video Coding Experts Group (VCEG) of the ITU-T and MPEG of the ISO/IEC, standardized multi-view video coding (MVC) [126] as an extension to the H.264/AVC video coding stan- dard. The multi-view extension provided inter-view prediction to improve compression efficiency and support for traditional temporal and spatial prediction schemes. The MVC standard introduced two profiles, which indicate a subset of coding tools that must be supported by conforming decoders: the Multi-view High Profile (supports multiple views with no interlace coding tools), and the Stereo High Profile (supports only two views with interlace coding tools). MVC has been selected as the standard of 3D video distribution by the Blu-ray Disc Association (BDA). In addition to the general video coding requirements, such as those implemented in H.264/AVC, some specific requirements for MVC include: view switching random access, view scalability, and backward compatibility. View switching random access provides the

28 T0 T1 T2 T3 T4 T5 T6 T7 T8

V0 I0 B3 B2 B3 B1 B3 B2 B3 I0

V1 B1 B4 B3 B4 B2 B4 B3 B4 B1

V2 P0 B3 B2 B3 B1 B3 B2 B3 P0

V3 B1 B4 B3 B4 B2 B4 B3 B4 B1

V4 P0 B3 B2 B3 B1 B3 B2 B3 P0

V5 B1 B4 B3 B4 B2 B4 B3 B4 B1 P V6 P0 B3 B2 B3 B1 B3 B2 B3 0 P V7 P0 B3 B2 B3 B1 B3 B2 B3 0

Figure 2.12: Typical MVC hierarchical prediction structure. ability to access, decode and display a specified view at a random access point with a small amount of data required to decode the image. View scalability is the ability to access a subset of the bitstream to decode a subset of encoded views. For backward compatibility, a subset of the encoded multi-view video bitstream should be decodable by the H.264/AVC decoder. A typical MVC prediction structure uses a combination of hierarchical temporal predic- tion and inter-view prediction, as shown in Figure 2.12. In such a structure, some views are dependent on other views to be accessed and decoded. To access view2, for example, both view 1 and view 3 need to be decoded first. Moreover, view 3 will only be available after decoding view 1 because it depends on it. The sequence parameter set of an H.264/AVC bitstream was extended to include high level syntax that signals the view identification, view dependency, and indicators of resource requirements. To support backward compat- ibility, an MVC bitstream is structured to include a base view which could be decoded independently.

Video Plus Depth Coding

In the video-plus-depth representation of 3D videos, the depth map can be considered as a monochromatic grayscale image sequence. Therefore, it is possible to use existing video codecs, such as H.264/AVC, to compress the depth map stream. However, these codecs are optimized for encoding texture information which are viewed by the user. This is in contrast to depth maps which represent geometry information that is never directly presented to the viewer and only aid the view rendering process. Thus, compression artifacts have a more severe effect when existing codecs are used for compressing depth maps and cause distortions

29 in rendered synthesized views. There are two possible approaches to solve this problem in the literature. The first approach is to take the special characteristics of depth maps into consideration and develop novel compression techniques specifically suitable for them, e.g., [83] and [49]. The second approach is modifying and optimizing existing video codecs for depth map encoding and removing undesirable compression artifacts at the decoder side using post-processing image denoising techniques. Examples of the second approach include [91], and [74]. The ISO/IEC 23002-3 MPEG-C Part 3 standard specifies a representation format for depth maps which allows encoding them as conventional 2D sequences and additional param- eters for interpreting the decoded depth values at the receiver side [54]. The specification is based on the encoding of a 3D content inside a conventional MPEG-2 transport stream, which includes the texture video, the depth video and some auxiliary data. It specifies high-level syntax that allows a decoder to interpret two incoming video streams correctly as texture and depth data. The standard does not introduce any specific coding algorithms and supports different coding formats like MPEG-2 and H.264/AVC. Transport is defined in a separate MPEG Systems specification ISO/IEC 13818-1:2003 Carriage of Auxiliary Data [53]. The two (texture and depth) bit-streams are interleaved frame-by-frame, resulting in one transport-stream that may contain additional depth map parameters as auxiliary information. Another option for encoding video-plus-depth sequences is using the multiple auxiliary components (MAC) of MPEG-4 version 2. An auxiliary component is a greyscale shape that is used to describe transparency of a video object. However, an auxiliary component can be defined in a more general way in order to describe shape, depth shape, or other secondary texture. It is also sometimes known as an alpha channel. Thus, the depth video can be used as one of the auxiliary components in MPEG-4 version 2 [63]. Moreover, the same compression techniques used for the texture component can also be used for auxiliary components and motion vectors used for motion compensation are identical.

High Efficiency Video Coding of 3D Videos

The ISO/IEC MPEG and ITU-T Video Coding Experts Group (VCEG) standardization bodies established the Joint Collaborative Team on 3D Video Coding Extension Develop- ment (JCT-3V) in July 2012. The aim of this group is to develop the next-generation 3D video coding standards which provide more advanced compression capabilities and fa- cilitates synthesis of additional non-captured views to support emerging auto-stereoscopic displays. JCT-3V has developed two extensions for HEVC [114], namely, Multiview HEVC (MV-HEVC) [117], which was integrated in the second edition of the HEVC standard [57], and 3D-HEVC [118], which was completed in February 2015 and integrated in the latest edition of the standard. In MV- and 3D-HEVC, a layer can represent texture, depth, or other auxiliary information of a scene related to a particular camera perspective. All layers

30 belonging to the same camera perspective are denoted as a view; whereas layers carrying the same type of information (e.g., texture or depth) are usually called components in the scope of 3D videos [116]. MV-HEVC enables efficient coding of multiple camera views and associated auxiliary pic- tures and follows the same design principles of multi-view video coding (see Section 2.5.2).For example, MV-HEVC allows prediction from pictures in the same AU and the same compo- nent but in different views.To enable this, decoded pictures from other views are inserted into one or both of the reference picture lists of the current picture being decoded. There- fore, motion vectors may be temporal MVs, when related to temporal reference pictures of the same view, or disparity MVs, when related to inter-view reference pictures. MV-HEVC comprises only high-level syntax (HLS) additions. Therefore, it can be implemented using existing single-layer decoding cores without changing the block-level processing modules. Compared to HEVC simulcast, MV-HEVC provides higher compression gains by exploiting the redundancy between different camera views of the same scene. Support for depth maps is enabled through auxiliary picture high-level syntax. The auxiliary picture decoding pro- cess would be the same as that for video or multi-view video, and the required decoding capabilities can be specified as part of the bitstream. The second, more advanced, extension is 3D-HEVC. This extension targets a coded representation consisting of multiple views and associated depth maps, as required for gen- erating additional intermediate views in advanced 3D displays. 3D-HEVC aims to compress the video-plus-depth format more efficiently by introducing new compression tools that: 1) explicitly address the unique characteristics of depth maps; and 2) exploit dependencies between multiple views as well as between video texture and depth. 3D-HEVC extends MV-HEVC by allowing new types of inter-layer prediction to enable more efficient compres- sion. The new prediction types include:

• combined temporal and inter-view prediction (A+V), where a reference picture is in the same component but in a different AU and a different view,

• inter-component prediction (C), where reference pictures are in the same AU and view but in a different component.

• combined inter-component and inter-view prediction (C+V), where reference pictures are in the same AU but in a different view and component.

Additional bit rate reductions compared to MV-HEVC are achieved by specifying new block- level video coding tools, which explicitly exploit statistical dependencies between video texture and depth and specifically adapt to the properties of depth maps. A further design change compared with MV-HEVC is that in addition to sample and motion information, residual, and disparity and partitioning information can also be predicted or inferred.

31 2.6 HTTP Adaptive Streaming

Until recently, the transmission control protocol (TCP) was considered unsuitable for video applications, due to the latency and overhead of its sliding window and retransmissions. For a long time, the main video delivery protocl was the real-time transmission protocol (RTP) [100], which uses the unreliable user datagram protocol (UDP) for transmission. RTP is a push-based protocol where video frames are transmitted in a paced manner from the server to the receiving client. Using UDP as an underlying transport protocol eliminates the overhead and latency of retransmissions encountered in TCP, since UDP does not imple- ment error and flow control techniques. This made RTP attractive for real-time interactive applications and multicast. However, although using UDP for media delivery has its advantages, it became clear over the past few years that this has many shortcomings. For example, due to the unreliable nature of the protocol, an unavoidable consequence of network interruption is that the receiver would have to ignore lost and late frames. This causes artifacts and distortions in the rendered video and decreases the user’s quality-of-experience. Moreover, RTP-based media delivery is quite complex and is not scalable. While RTP itself is responsible for the transmission of the actual media data, it relies on additional protocols for establishing and controlling media sessions between end points and providing feedback information. For example, in addition to establishing a separate UDP connection for each media component, e.g., audio and video components, RTP requires additional UDP connections for real-time transport control protocol (RTCP) [100] channels, one for each RTP connection. To facilitate real-time control of media playback from the server, a separate connection is dedicated to a control plane protocol known as the real-time streaming protocol (RTSP) [101] which enables the clients to issue VCR-like commands to the server. Moreover, UDP is a non-responsive protocol, i.e., it does not reduce its data rate when there is a congestion. Given the rapid increase in video streaming network flows, this may lead to congestion collapse where little or no useful communication is happening due to congestion. The frame-based delivery in RTP requires streaming servers to parse video files in order to extract the frames, which adds additional overhead and impacts scalability. Recently, multimedia delivery services have adopted a pull-based approach for delivering media content over the Internet using the widely popular Hyper-Text Transfer Protocol (HTTP). HTTP adaptive streaming (HAS) aims to overcome all the issues of RTP streaming and is motivated by the stateless nature of the HTTP protocol, which makes it a more scalable solution, and the fact that, unlike UDP, almost all firewalls and network address translators (NATs) are configured to allow HTTP traffic. HAS optimizes and adapts the video configurations over time in order to deliver the best possible quality to the user at any given time. This allows for enhanced quality-of-experience enabled by intelligent adaptation to different network path conditions, device capabilities, and content characteristics.

32

✝ ◗ ✄☎✆ Bandwidth Quality

Rep. 3 Rep. 3

Rep. 2 Rep. 2

Rep. 1 Rep. 1

❚ ✁✂ Time Time

Figure 2.13: Video stream adaptation in HTTP adaptive streaming.

In HAS, each video is divided into a number of non-overlapping chunks called segments (or fragments), each corresponding to a few seconds of the media. The segments are encoded at multiple discrete bitrates and/or resolutions called representations and, in general, can be decoded independently, unless a multi-layer encoder is used. Segments from different representation streams are aligned so that the video player can switch to a different rep- resentation when necessary at a segment boundary. One or more HTTP servers store the encoded segments along with a manifest file which describes the different video profiles and codecs, available representations, Internet protocol (IP) addresses of servers, and segment URLs. HAS is a pull-based protocol where the client is responsible for requesting the most appropriate resources from the server. At the beginning of a HAS streaming session, the client requests the manifest file from the streaming server. All segments are downloaded us- ing HTTP GET requests and are placed in a playback segment buffer. A key feature in HAS is the ability to respond to bandwidth variations during a streaming session. The streaming logic resides on the client side and, based on the information provided in the manifest, it is able to decide on the segments which are compatible with its decoding and rendering capabilities, Figure 2.13. More importantly, the client continuously monitors the available network resources, such as available bandwidth and the state of the TCP connection(s), and dynamically selects the right encoding bitrate for the next segment. Because the video data is segmented and available at a range of bitrates, the client is able to provide fast start-up and seek times by requesting the lowest bitrate segments first and subsequently transitioning to higher bitrates to improve visual quality. The lack of a common standard for HTTP adaptive video streaming lead to the de- velopment of several commercial and proprietary streaming solutions like Adobe HTTP

33 MPD Period Adaptation Set 1 id = 2 BaseURL= start = 150 s http://www.xyz.com/v/ Period id=1 Segment Access Representation 3 start = 0 s Adaptation Set 1 Representation 1 Init Segment Video Bitrate = 400 Kbps Rate: 1.5 Mbps Resolution: 720p http://www.xyz.com/v/3/s0.m4s Period id=2 Adaptation Set 2 Representation 2 Segment Info start = 150 s Audio English Bitrate = 800 Kbps Media Segment 1 Duration = 3 s http://www.xyz.com/v/3/s1.m4s Adaptation Set 3 Representation 3 Template Period id=3 Audio French Bitrate = 1.5 Mbps 3/s$Number$.m4s Media Segment 2 start = 700 s http://www.xyz.com/v/3/s2.m4s Adaptation Set 4 Representation 4 Subtitles Arabic Bitrate = 4 Mbps

Figure 2.14: Structure of an MPD file in MPEG-DASH.

Dynamic Streaming [11], Apple HTTP Live Streaming [15] and Microsoft Smooth Stream- ing [82]. As a result, video streaming devices had to support multiple protocols to access different streaming services and users were often limited to the video streaming clients sup- plied by the streaming provider. A common standard for HTTP video streaming would therefore allow standard-compliant client devices to access any standard-compliant video streaming service. Standardization of HAS has been driven by various standards bodies including Third Generation Partnership Project (3GPP) and the Moving Picture Experts Group (MPEG) [86]. The resulting standard was published in ISO/IEC 23009-1 [56] and is commonly known as MPEG Dynamic Adaptive Streaming over HTTP (DASH). Which video codec to choose and how the client should adapt the playback based on the options offered, is out of scope of the DASH standard and therefore left to content providers and client implementations to decide. In DASH, the manifest file is known as a media presentation description (MPD) file and it describes the properties and URLs of the content and its segments. The MPD file is structured as illustrated in Figure 2.14. The media description is organized into a hierarchy. At the top level of this hierarchy, the media is segmented in periods. A period represents a time period where the set of adaptation options does not change. For instance, a period could contain the main movie with several adaptation options, but a second period comprised of out-takes is only available with a reduced set of options. An adaptation set is a logical group of adaptation options. For example, a full-length movie would typically have three adaptation sets defined: one for the video, one for the audio, and one for the subtitles (where multiple languages might be available). An adaptation set in turn contains different representations of the specified option. For instance, a video adaptation set would contain representations that correspond to different bitrates/qualities and/or resolutions. At the end of the hierarchy we have a list of media segments, which contains the location (i.e., URL) of the described media content segments in chronological order.

34 2.7 Wireless Cellular Networks

Wireless cellular technologies are continuously evolving to meet the increasing demands for high data rate mobile services. Fourth generation (4G) mobile broadband access networks, represented by IEEE’s WiMAX [14] and 3GPP’s long-term evolution (LTE) [28] (and its extension LTE-Advanced [12]), were designed to improve on prior UMTS-based mobile networks by enhancing the system capacity and transmission coverage. They also allow both data and voice services to be provided in an integrated fashion using the Internet protocol (IP). However, with the explosion in the number of wireless mobile devices and services, there are still challenges that cannot be accommodated by the currently deployed 4G networks. In this section, we provide a brief description of WiMAX and LTE networks and the challenges associated with delivering 3D video services using these networks. We note that the algorithms presented in Chapter 3 and Chapter 5 of this thesis are applicable to both of these networks.

2.7.1 IEEE 802.16 WiMAX Networks

WiMAX is a broadband wireless access technology based on the IEEE 802.16 standard [52]. WiMAX enables the delivery of last-mile wireless broadband services as an alternative to traditional wire-based access technologies such as digital subscriber line (DSL). WiMAX uses orthogonal frequency-division multiplexing (OFDM) to improve the transmission range and increase bandwidth utilization by preventing inter-channel interferences among adjacent wireless channels. The frequency band considered IEEE 802.16 standard is 2-66 GHz. This band is divided into two frequency ranges: 2-11 GHz for non-line-of-sight transmissions and 10-66 GHz for line-of-sight transmissions. WiMAX is capable of offering a peak downlink data rate of up to 63 Mbps and a peak uplink data rate of up to 28 Mbps in a 10MHz channel bandwidth using multiple-input and multiple-output (MIMO) antenna techniques and flexible sub-channelization schemes [10]. In the physical layer, WiMAX transmits data over multiple carriers using division duplex (TDD) frames. The duration of a WiMAX frame ranges from 2 to 20 ms. Each frame consists of downlink and uplink subframes, as shown in Figure 2.15. The downlink subframe is followed by a transmit/receive transmission gap (TTG) and the uplink subframe is separated from the following downlink subframe using a receive/transmit transmission gap (RTG). A frame contains header information and uplink/downlink maps followed by bursts of user data. The downlink and uplink maps specify the resource allocation within the frame.

2.7.2 Long Term Evolution Networks

The system architecture of an LTE network is shown in Figure 2.16. The network is com- posed of two main components: a) a radio access network, and b) a core network. The access network, referred to as the evolved UMTS terrestial radio access network (E-UTRAN) in

35 TD rnmsin sn omnrdofaesrcue A structure. frame radio common a using transmissions (TDD) SG) n aktdt ewr aea (P-GW). gateway network data packet a and (S-GW), ie,hl u-rm)i h iedmi n 2sub-carrie 12 and domain time the in sub-frame) a half (i.e., il access tiple h u-rmsaetasitduigOD hc iie ava divides which OFDM using transmitted are sub-frames The onikcanli iie no1 sfae,ec ute d further each frames, ms 10 into [ divided is channel downlink rue oehrt omasbcanlta evsa h bas the as serves that both support sub-channel systems LTE a sion. form modula to one carrying together of grouped capable each sub-carriers, of number rdi ohtm n rqec oan,a hw nFgr 2. a Figure as in shown to as domains, referred frequency and time both in downlink grid to allocated either is sub-frame a TDD, For domain. the as known is network core bet The connection respectively. the shor terminals, manage and for t control eNBs and and user-plane (or transmission the data eNodeBs as known as protocols, known of sets stations two base provides of number avai a to access of managing for responsible is terminology, LTE P ossso he niis oiiymngmn entit management mobility a entities: three mobi of of management consists the EPC for responsible network core mobile a n pn 5kzad66 and kHz 15 spans and alda called 41 .I h aeo D,uln n oniktasisosaes are transmissions downlink and uplink FDD, of case the In ]. T ewrsueatasiso ehiu nw as known technique transmission a use networks LTE eoreblock resource

ODA ndwln hnes nODA h rqec sdiv is frequency the OFDMA, In channels. downlink in (OFDMA) Frequency eoreelement resource

Preamble R) ahR opie utpersuc lmnsspannin elements resource multiple comprises RB Each (RB). FCH . Downlink Subframe Downlink s h mletui htcnb loae yteeoe is eNodeB the by allocated be can that unit smallest The µs. 7 UL Map DL Map iue21:WMXFrame. WiMAX 2.15: Figure DL #4 DL #3 DL #1 rqec-iiinduplex frequency-division DL #2 orsod ooecmlxvle ouainsymbol modulation complex-valued one to corresponds , Time (OFDMSymbols) MBS Map Data Region 36 MBS T RTG TTG rhgnlfeunydvso mul- frequency-division orthogonal FD and (FDD) vle aktcore packet evolved al ai eore.I consists It resources. radio lable Uplink Subframe ecnrlpae hc support which control-plane, he insmo.Sbcrir are Sub-carriers symbol. tion s(8 H)i h frequency the in kHz) (180 rs ME,asriggateway serving a (MME), y ententokadmobile and network the ween 7 aheeeti hsgrid, this in element Each 17. iy oiy n euiy An security. and policy, lity, vddit ssub-frames ms 1 into ivided lberdorsucsit a into resources radio ilable ruln rnmsin[ transmission uplink or UL #2 UL #1 prtdi h frequency the in eparated cui fdt transmis- data of unit ic h hscllvl the level, physical the t ) nEURNalso E-UTRAN An t). iedvso duplex time-division EC.I is It (EPC). ddit a into ided 0 g . ms 5 13 ]. Core Network Radio Access Network

Home Subscriber Mobility Management Server (HSS) Entity (MME)

Internet

Packet Data Network Serving Gateway Gateway (P-GW) (S-GW)

Figure 2.16: LTE network system architecture.

Radio Frame = 10 ms

0 1 2 3 4 5 7 8 9 10 11 12 13 14 15 16 17 18 19

Subframe Slot (1 ms) (0.5 ms)

Resource Block Frequency (15 kHz each) 12 Subcarriers

Time (OFDM Symbols)

Figure 2.17: Downlink frame in LTE.

37 domain. User equipment (UE) are dynamically allocated non-overlapping sets of resource blocks depending on their channel conditions. To accommodate the time-varying radio channel conditions of UEs, LTE uses a link adaptation method known as adaptive modulation and coding (AMC) which adapts the modulation scheme and code rate based on the channel’s signal-to-noise ratio (SNR). The quality of a channel is periodically measured at the UE and sent to the eNodeB in the form of so-called channel quality indicators (CQIs). The modulation and coding scheme (MCS) used for the resource blocks assigned to a UE is then chosen based on the reported CQI value such that the block error rate (BLER) is less than a certain threshold.

2.7.3 Multimedia Multicast Services

According to a recent study by Cisco, video traffic will account for nearly 75 % of total mobile data traffic by the year 2020 [25]. In order to cope with this increasing demand, cellular service providers may need to rely on video delivery services based on point-to-multipoint transmissions, i.e., broadcast and/or multicast. Using multicast, a streaming server can substantially reduce the wireless network load by serving mobile devices interested in the same video stream with a single multicast session. To facilitate the process of initiating multicast and broadcast sessions, the WiMAX standard defines a specific service at the data link layer known as the multicast and broadcast service (MBS) [24]. LTE networks also define a similar service for video delivery of multicast sessions known as evolved multimedia broadcast multicast services (eMBMS) [8]. In WiMAX, MBS provides an efficient concurrent transmission method in the downlink for data common to a group of mobile stations through a shared radio resource using a common connection identifier. Using MBS, a certain area in each TDD frame can be reserved for multicast and broadcast data only, as shown in Figure 2.15. It is also possible to designate the entire frame as a download-only broadcast frame. In addition, it is also possible to construct multiple MBS regions within the downlink subframe. One or more base stations constitute an MBS zone. Each MBS zone is identified by a unique MBS zone identifier that is not reused across any two adjacent MBS zones. Coordination between base stations within an MBS zone enables the mobile device to receive MBS transmissions from any base station without having to re-register to that base station. It is also possible to synchronize MBS transmissions across all base stations within an MBS zone to improve reception reliability using macro-diversity, where the mobile subscriber receives the signal from multiple base stations [10]. In LTE, multicast services can be provided in two modes of transmission. The first mode is the independent mode where multicast transmissions within a cell are not coordinated with its neighboring cells. The independent mode has the advantage of being adaptable to changes in the distribution of users within the cell. For example, if no active users are present within a cell, multicast transmission can be disabled within that particular cell.

38 The second multicast mode is known as single frequency network (SFN). In SFN, a set of base stations coordinate their multicast transmissions by using the same frequency for common multicast sessions. This significantly improves the utilization of radio resources since coordinated cells are using identical radio signals in their transmissions which enables receivers at cell edges to obtain multiple copies of the same data from multiple eNodeBs. Since the same frequency is used by the different eNodeBs during transmission, the strength of the received signal at the cell edge is enhanced while the interference power is largely reduced. A user can therefore move from one cell to another without significant degradation in reception quality.

39 Chapter 3

Energy-Efficient Multicasting of Multiview 3D Videos over Wireless Networks

3.1 Introduction

Recent advances in wireless communications have significantly changed people’s lifestyles. The introduction of innovative and powerful mobile devices has motivated new applications and services. The popularity of these devices, in addition to streaming services such as Netflix, has led to behavioral changes in viewing habits and a steady increase in the demand for mobile video. Recent studies indicate that the time spent watching video on smartphones and tablets has increased over the past few years [35]. Major telecommunication operators are starting to use multicast services for popular content to efficiently utilize their scarce radio resources. As an example, Verizon Wireless broadcast the 2014 Super Bowl to more than 30,000 of its customers which consumed about 1.9TB [124]. In 2015, UK telecom provider EE in collaboration with BBC R&D broadcast the FA Cup Final at Wembley Stadium using LTE eMBMS [34]. As mobile devices such as cell phones, tablets, personal gaming consoles, mobile video players, and personal digital assistants become more powerful, their ability to handle 3D content is becoming a reality. Recent market studies report that the mobile 3D market, including smartphones, notebooks, mobile Internet devices and portable game players, is estimated to grow to 547.69 million units by 2018 [84]. However, there are still many challenges that need to be addressed before commercial 3D mobile video streaming services can take-off. A mobile 3D video solution would require high quality views, low power and bandwidth consumption, and low complexity. Despite the significant advances in data transfer rates of wireless networks, there is still an upper limit on the capacity of the

40 wireless transmission channel. The increased volume of video information inherent in 3D videos requires efficient ways to deliver the content over these channels. In this chapter, we consider the problem of multicasting 3D videos over 4/5G broadband access networks, such as LTE/LTE-Advanced and WiMAX. In particular, we address two main challenges:

• Maximizing the video quality of rendered views in auto-stereoscopic displays [32, 122] of mobile receivers such as smartphones and tablets.

• Minimizing the energy consumption of the mobile receivers during multicast sessions.

We note that our focus is optimizing the quality of 3D TV applications which use multicast for content delivery. These applications are different from free viewpoint video (FVV) applications in which user interactivity is important to determine appropriate views of the 3D videos to be transmitted to different users. User interactivity is mostly achieved in unicast systems where a feedback channel exists. In multicast sessions, such feedback channels are not practical for large-scale user communities. We consider multicasting multi-view video streams in which the texture and depth map components of the views are simulcast coded using the scalable video coding extension of H.264/AVC (see Section 2.5.2). Two views of each multi-view-plus-depth video are chosen for multicast and all chosen views are multiplexed over the wireless transmission channel. Joint texture-depth rate-distortion optimized substream extraction is performed in order to minimize the distortion in the views rendered at the receiver. We mathematically formulate the problem of selecting the set of substreams from each of the two chosen views for all video sequences being transmitted. We show that this problem is NP-hard, and thus cannot be optimally solved in real-time for an arbitrary number of input video streams. We propose a substream selection scheme that enables receivers to render the best possible quality for all views given the bandwidth constraints of the transmission channel and the variable nature of the video bit rate. For most current multimedia services, the subscribers are mainly mobile users with energy-constrained devices. Therefore, an efficient multicast solution should minimize the power consumption of the receivers to provide a longer viewing time experience. We extend our algorithm to perform energy-efficient radio frame scheduling of the selected substreams. The allocation algorithm attempts to find a burst transmission schedule that minimizes the energy consumption of the receivers. Transmitting the video data in bursts enables the mobile receivers to turn off their wireless interfaces for longer periods of time, thereby saving battery power. The extended algorithm first determines the best substreams to transmit for each of the multicast sessions based on the current network capacity. It then allocates the video data to radio frames and constructs a burst schedule that does not result in buffer overflow or underflow instances at the receivers.

41 The rest of this chapter is organized as follows. Section 3.2 summarizes the related work in the literature. We provide a system overview in Section 3.3 and formally state the optimal 3D video multicasting problem. The problem formulation and proposed scalable 3D video multicasting algorithm are described in Section 3.4. We then extend the proposed algorithm to perform energy-efficient radio frame allocation is presented in Section 3.6. The chosen virtual view quality model is validated in Section 3.7 and we evaluate the proposed substream selection and burst scheduling algorithms in Section 3.8. Finally, Section 3.9 presents a summary of the chapter.

3.2 Related Work

The work presented in this chapter is related to research in the areas of 3D video transmis- sion over 4/5G broadband wireless access networks, quality modeling of synthesized views, and optimal bit allocation between texture videos and depth maps for depth-image-based rendering.

3.2.1 3D Video Transmission Over Wireless Networks

A number of studies have been conducted on 3D video transmission over wireless networks. In [31], De Silva et al. study the performance of 3D-TV transmission over an error prone WiMAX broadband wireless network. The authors compared the stereoscopic and video- plus-depth 3D video representations. The target video transmission rate was set to 6 Mbps for all video sequences, with a distribution of 4.8 Mbps for texture and 1.2 Mbps for depth in the case of video-plus-depth and 3 Mbps for each of the left and right views in the case of stereoscopic videos. Different adaptive modulation and coding settings were used to deliver the 6 Mbps test video sequences. It was observed that 3D video streams encoded using the stereoscopic representation yielded better PSNR results than the video-plus-depth. However, these results are due to the use of only a single view plus depth which results in larger holes when synthesizing a second view which are filled using heuristic techniques. In [18], Ceglie et al. investigate how 3D video formats and their average encoding rate impact the quality experienced by end users when the videos are delivered over a heteroge- neous LTE-Advanced network composed of macro and pico cells. The video representation formats used in the evaluation are side-by-side stereoscopic and video-plus-depth. Objective metrics like packet loss ratio, peak signal-to-noise ratio, structured similarity index, delay, and goodput are used for measuring QoS and QoE. The study shows that the video-plus- depth representation format suffers less QoE degradation than other formats for all user numbers and coding rates and therefore is the best choice for 3D video delivery systems. Moreover, results indicate that LTE-Advanced is very effective for the transmission of 3D videos, even in presence of other kinds of flows. However, similar to [31], the authors only evaluated a single view plus depth representation which is not very effective in producing

42 high quality 3D videos where a large number of views needs to be generated. Such scenar- ios would require transmitting two or more views along with the depth information which significantly increases the bandwidth requirements. Moreover, the evaluated scenarios are based on unicast transmissions which do not effectively utilize the wireless network’s radio resources. In [73], Lin et al. study multi-view 3D video multicast over IEEE 802.11 WiFi networks. They analyze the view failure probability in the case of view synthesis with DIBR and compare it with the traditional view loss probability, i.e., the probability that a view fails to be sent to a user without DIBR. A multi-view group management protocol is proposed for multi-view 3D multicast. The protocol utilizes the analytical results to transmit the most suitable right and left views, so that the view failure probability is guaranteed to stay below a threshold when users join or leave a multicast session. Simulation results show that the proposed protocol reduces bandwidth consumption and increases the probability for each client to successfully playback the desired view in a multi-view 3D video.

3.2.2 Modeling Synthesized Views Quality

There has been some recent work on modeling the quality of synthesized views based on the qualities of two reference views from which the target view is synthesized. In [76], Liu et al. attempt to solve the no-reference evaluation problem by proposing a distortion model to characterize the view synthesis quality without requiring the original reference image. In this model, the distortion of a synthesized virtual view is composed of three additive distortions: video coding-induced distortion, depth quantization-induced distortion, and inherent geometry distortion. The practicality of the presented model is however restricted due to its high complexity. Yuan et al. [135] propose an alternative and concise low- complexity distortion model for the synthesized view. Kim et al. [67] also attempt to overcome the no-reference evaluation problem when coding depth maps by approximating the rendered view distortion from the reference texture video that belongs to the same viewpoint as the depth map. However, the model does not jointly consider both texture and depth map distortions. For our work, we validate the model relation presented in [135] and use it to solve the multiple 3D video multicasting problem.

3.2.3 Optimal Texture-Depth Bit Allocation

The works most related to ours are [93] and [21]. In [93], Petrovic et al. perform virtual view adaptation for selective streaming of 3D multi-view video. However, the proposed adaptation scheme requires empirically constructing the rate-distortion function for the 3D multi-view video. Moreover, exhaustively searching the space of possible quantizers is computationally expensive. In [21], Cheung et al. address the problem of selecting the best views to transmit and determining the optimal bit allocation among texture and depth maps

43 of the selected views, such that the visual distortion of synthesized views at the receiver is minimized. Contrary to our work, the bit allocation optimization problem presented in [21] is applicable in scenarios where the selected views are encoded on-the-fly and the coding parameters can be adjusted based on the available bandwidth. Coding 3D videos in real-time is, however, challenging. Our work assumes that the views are pre-encoded using scalable video coders and bit rate adaptation is performed via substream extraction, which is expected to be the common case in practice due to the flexibility it provides.

3.3 System Overview

A wireless mobile video streaming system has four main components: content servers, access gateways, cellular base stations, and mobile receivers. Receivers periodically send feedbacks about current channel conditions, e.g., signal-to-noise ratio (SNR) or link-layer buffer state, to the base station. Based on this feedback, the base station changes the modulation and coding scheme so that the SNR is increased. This consequently results in a change in channel capacity. Knowing the current capacity of the channel, a base station can adapt the bit rate of the transmitted video accordingly. The main challenge in 3D video transmission over wireless networks is the capacity of the wireless channel, which is limited by the available bandwidth of the radio spectrum and various types of noise and interference. 3D video challenges the network bandwidth more than 2D videos as it requires the transmission of at least two video streams. These two streams can either be a stereo pair (one for the left eye and one for the right eye), or a texture stream and an associated depth stream from which the receiver renders a stereo pair by synthesizing a second view. Receivers in a wireless network such as WiMAX and LTE are heterogeneous. A smart- phone may be equipped with an auto-stereoscopic display capable of rendering only two views. Examples of such mobile displays include MasterImage’s TN-LCD stereoscopic dis- play [79] and the 3D HDDP LCD produced by NEC [121]. However, devices with a larger display size such as tablets are capable of rendering more views. Rendering a large number of views enables a more immersive experience. For example, a user can move his/her head in front of the display and experience (to a certain degree) a look around effect by being able to see newly revealed background behind foreground objects. For such displays, a small number of views are used to synthesize the remaining ones. Transmitting two views and their depth maps enables the display to render higher quality views at each possible viewing angle [42]. Although it is possible to use three or more reference views to cover most of the disocclusion holes in the synthesized view, the major concern is bandwidth consumption. Even with the texture and depth information of only two reference views, the aggregate rate of the four streams may exceed the channel capacity due to the variable bit rate nature of the video streams and the variation in the wireless channel conditions. Thus, allocation of system resources should be performed

44 dynamically and efficiently to reflect the time varying characteristics of the channel [112], which is the goal of this chapter. We consider a wireless multicast/broadcast service in 4G wireless networks streaming multiple 3D videos in the MVD2 representation. The architecture of this streaming system was shown in Figure 1.1 (Chapter 1). The videos are transmitted using multicast services such as the evolved multicast broadcast multimedia services (eMBMS) in LTE networks and the multicast broadcast service (MBS) in WiMAX. As mentioned in Section 2.5.1, MVD2 is a multi-view-plus-depth (MVD) representation in which there are only two views. Therefore, two video streams are transmitted along with their depth map streams. Each texture/depth stream is encoded using a scalable encoder into multiple quality layers.

3.4 Problem Statement and Formulation

In the scheduler of a 4G wireless network base station, time is divided into a number of scheduling windows of equal duration δ, i.e., each window contains the same number of time division duplex (TDD) frames. The base station allocates a fixed-size data area in the downlink subframe of each TDD frame. In the case of multicast applications, the parameters of the physical layer, e.g., signal modulation and transmission power, are fixed for all receivers. These parameters are chosen to ensure an average level of bit error rate for all receivers in the coverage area of the base station. Thus, each frame transmits a fixed amount of data within its multicast area. In the following, we assume that the entire frame is used for multicast data and we refer to the multicast area within a frame as a multicast block. The first problem that we attempt to address is the optimal substream selection problem, which can be stated as follows: Problem 1 (Optimal Multicast of 3D Video Streams). Consider a set S of 3D video streams in two-view plus depth (MVD2) format and receivers with auto-stereoscopic displays, where each receiver generates a set I of synthesized virtual views. Each texture and depth component of every video stream is encoded into L layers using a scalable video coder. Given the capacity of a scheduling window, select the optimal subset of layers to be transmitted over the network for each video stream s ∈ S such that: 1) the total amount of transmitted data does not exceed the available network capacity; and 2) the average quality of synthesized views over all transmitted 3D video streams is maximized. An algorithm for solving this problem may either be integrated into the base station’s scheduler or implemented in a separate media-aware network element (MANE) that is attached to the base station. Such an algorithm is periodically run at the beginning of each scheduling window to determine for each multicast session the set of substreams for the texture and depth components of the 3D video being multicast. We now mathematically formulate the substream selection problem. The symbols used in the formulation are listed in Table 3.1. Let S = {1,...,S} be a set of multi-view-plus-

45 Table 3.1: List of symbols used in this chapter.

Symbol Description S Set of 3D video streams S Number of 3D video streams I Set of synthesized intermediate views I Number of synthesized intermediate views L Number of layers per view t qsl Average PSNR of left and right texture substream sl d qs,l Average PSNR of left and right depth substream sl t rs,l Sum of left and right texture substream sl data rates d rs,l Sum of left and right depth substream sl data rates t bs,l Number of blocks required for texture substream sl d bs,l Number of blocks required for depth substream sl P Number of TDD frames within scheduling window F Capacity of the TDD frame τ Duration of the TDD frame δ Duration of the scheduling window i αs Quality model parameter for intermediate view i of video s i βs Quality model parameter for intermediate view i of video s k Υs Stream s consumption buffer level at start frame of interval k k xs Start frame of interval k for stream s k zs End frame of interval k for stream s k ys Number of frames to be allocated for stream s in interval k

46 depth video streams where two reference views are picked for transmission from each video. All videos are to be multiplexed over a single channel. If each view is encoded into multiple layers, then at each scheduling window the base station needs to determine which substreams to extract for every view pair of each of the S streams. Let R be the current maximum bit rate of the transmission channel. For each 3D video, we have four encoded video streams representing the two reference streams and their associated depth map streams. Each stream has at most L layers. Thus, for each stream, we have L substreams to choose from, where substream l includes layer l and all layers below it. Let the data rates and quality values for selecting substream l of stream s be rs,l and qs,l, respectively, where l ∈ {1, 2,...,L}.

For example, q3,2 denotes the quality value for first enhancement layer substream of the third video stream. These values may be provided as separate metadata. Alternatively, if the scalable video is encoded using H.264/SVC [102] and the base station is media-aware, this information can be obtained directly from the encoded video stream itself using the Supplementary Enhancement Information (SEI) messages. In the general case, texture or depth streams will not have the same number of layers. This provides flexibility in choosing the substreams that satisfy the bandwidth constraints. This flexibility, however, complicates the quality model in the objective function since we will have to deal with the quality of the left view and the right view independently. Thus, we only consider an equal number of layers for left and right texture streams, as well as for the left and right depth streams. Moreover, corresponding layers in the left and right streams are encoded using the same quantization parameter (QP). This enables us to treat corresponding layers in the left and right texture streams as a single item with a weight (cost) equal to the sum of the two rates and a representative quality equal to the average of the two qualities. The same also applies for left and right depth streams. Let I be the set of possible intermediate views which can be synthesized at the receiver for a given 3D video that is to be transmitted. The goal is to maximize the average quality over all i ∈ I and all s ∈ S. Thus, we have the problem of choosing the substreams such that the average quality of the intermediate synthesized views between the two reference views is maximized, given the constraint that the total bit rate of the chosen substreams does not exceed the current channel capacity. Let xs,l be binary variables that take the value of 1 if substream l of stream s is selected for transmission, and 0 otherwise. We denote with superscripts t and d the texture and depth streams, respectively. If the capacity of the scheduling window is C and the size of each TDD frame is F , then the total number of frames within a window is P = C/F . The data to be transmitted for each substream can thus be divided into bs,l = ⌈δrs,l/F ⌉ multicast blocks. We use a recent linear virtual view distortion model presented in [135] to represent the quality of the synthesized view in terms of the qualities of reference views. In Section 3.7, we experimentally validate this distortion model using two different video quality metrics. Based on this model, the quality of a virtual view can be approximated by a linear surface

47 L R Layer-4

L R Layer-3

L R Layer-2

L R Layer-1

Figure 3.1: Calculating profits and costs for texture component substreams of the reference views.

in the form given in Eq. (3.1), where Qv is the average quality of the synthesized views, Qt is the average quality of the left and right texture references, Qd is the average quality of the left and right references depth maps, and α, β, and c are model parameters. The model parameters can be obtained by either solving three equations with three combinations of

Qv, Qt, and Qd, or more accurately using regression by performing linear surface fitting.

Qv = αQt + βQd + c. (3.1)

Consequently, we have the optimization problem (P1). In this formulation, constraint (P1a) ensures that the chosen substreams do not exceed the transmission channel’s band- width. Constraints (P1b) and (P1c) enforce that only one substream is selected from the texture references and one substream from the depth references, respectively.

L L 1 1 i t t i d d Maximize αs xs,lqs,l + βs xs,lqs,l (P1) S S I I ! sX∈ iX∈ Xl=1 Xl=1 S L L t t d d such that xs,lbs,l + xs,lbs,l ≤ P (P1a) ! sX=1 Xl=1 Xl=1 L t xs,l = 1, s = 1,...,S, (P1b) Xl=1 L d xs,l = 1, s = 1,...,S, (P1c) Xl=1 t d xs,l,xs,l ∈ {0, 1} (P1d)

48 The following theorem shows that the optimal texture-depth substream selection prob- lem given in (P1) is an NP-complete problem.

Theorem 1. Determining which layers to transmit from the texture and depth components of multiple 3D video sequences in MVD2 format over a wireless channel with a limited capacity such that the average perceived quality of all synthesized views is maximized is an NP-complete problem.

Proof. We reduce a well-known NP-complete problem, the multiple choice knapsack problem (MCKP) [65, pp. 317], to our problem in polynomial time. We then show that a solution for our problem can be verified in polynomial time. In an MCKP instance, there are M mutually exclusive classes N1,..., NM of items to be packed into a knapsack of capacity

W . Each item j ∈ Ni has a profit pi,j and a weight wi,j. The problem is to choose exactly one item from each class such that the profit is maximized without having the total sum exceed the capacity of the knapsack. The substream selection problem can be mapped to the MCKP in polynomial time as follows. The texture/depth streams of the reference views of each 3D video represent a multiple choice class in the MCKP. Substreams of these texture/depth reference streams represent items in the class. The average quality of the texture/depth reference views substreams represent the profit of choosing an item and the sum of their data rates represents the weight of the item. Figure 3.1 demonstrates this mapping for the texture component of video s in a set of 3D videos, where both the texture and depth streams are encoded into 4 layers. For example, item-2 in the figure represents the second layer in both left and right reference texture streams with a cost equal to the sum of their data rates and a profit equal to their average quality. The 3D video is represented by two classes in the MCKP, one for the texture streams and one for the depth map streams. Finally, by making the scheduling window capacity the knapsack capacity, we have a MCKP instance. Thus, the problem is NP-hard, i.e., an optimal solution to our problem would yield an optimal solution to the MCKP. Moreover, given a set of selected substreams from the components of each 3D video stream, this solution can be verified in O(SL) steps. Hence, the substream selection problem is NP-complete.

3.5 Proposed Solution

The 3D video multicasting problem can be solved optimally using enumerative algorithms such as branch-and-bound or dynamic programming. These algorithms are implemented in most of the available optimization tools. However, these algorithms have, in the worst case, running times which grow exponentially with the input size. Thus, this approach will not be suitable if the problem is large. Furthermore, optimization tools may be too complex to run on a wireless base station. We propose an approximation algorithm which

49 Algorithm 1: Scalable 3D Video Multicast (S3VM) Algorithm Input: Scheduling window capacity P Input: TDD frame capacity F Input: Set of scalably simulcast coded MVD2 3D videos S i i Input: Model parameters for each virtual view position of each video αs, βs Input: Approximation factor ǫ Output: Set of substreams to transmit during the current scheduling window for texture/depth components of each 3D video 1 LP-relaxation: relax the integrality constraint (P1d) in the problem formulation to obtain an LP-relaxation of the problem 2 Call SolveRelaxedLP ′ 3 Drop fractional values, obtain split solution of value z ′ 4 Calculate an upper bound (2zh) on the optimal solution, where zh ← max(z , zs) 5 Calculate a scaling factor K ′ 6 Scale the qualities of substreams qs,l ← ⌊qˆs,l/K⌋ 7 Solve the scaled down instance of the problem using dynamic programming by reaching to obtain a solution whose value is no less than (1 − ǫ)z∗ runs in polynomial time and finds near optimal solutions. Given an approximation factor ǫ, an approximation algorithm will find a solution with a value that is guaranteed to be no less than (1 − ǫ) of the optimal solution value, where ǫ is a small positive constant. The main steps of our proposed scalable 3D video multicast (S3VM) algorithm are given in Algorithm 1. To solve a substream selection problem instance, we first calculate a single coefficient for the decision variables in the objective function. For variables associated with the texture t t i component we haveq ˆs,l = qs,l i∈I αs, and the coefficient for depth component variables d d i isq ˆs,l = qs,l i∈I βs. We thenP find an upper bound on the optimal solution value in order to reduce theP search space. This is achieved by solving the linear program relaxation of the MCKP. A linear time partitioning algorithm for solving the LP-relaxed MCKP exists. This algorithm is based on the works of Dyer [33] and Zemel [139] and does not require any pre-processing of the classes, such as expensive sorting operations. This algorithm relies on the concept of dominance to delete items that will never be chosen in the optimal solution. We apply the Dyer-Zemel algorithm to our problem as shown in Algorithm 2. We note that a class in the context of the MCKP represents one of the two components (texture or depth) of a given 3D video in our problem, where each component is comprised of the corresponding streams from the two reference views. It should also be noted that m denotes the number of classes available at a particular iteration, since this changes from one iteration to another as the algorithm proceeds. Thus, at the beginning of the algorithm we have m = 2S classes. An optimal solution vector xLP to the linear relaxation of the MCKP satisfies the following properties: (1) xLP has at most two fractional variables; and (2) if xLP has two fractional variables, they must be from the same class. When there are two fractional

50 Algorithm 2: SolveRelaxedLP Input: LP-relaxed version of (P1) after dropping integrality constraint (P1d) Output: Solution vector xLP , having at most two fractional variables 1 foreach class (component) Nj do 2 pair substreams two by two as h(j,k1), (j,k2)i and order each pair such that

bj,k1 ≤ bj,k2 , break ties such thatq ˆj,k1 ≥ qˆj,k2 and eliminate dominated substreams 3 B ← 0 and Q ← 0 4 foreach class (component) Nj do 5 if Nj has only one substream k left then 6 decrease capacity P ← P − bj,k 7 Q ← Q +q ˆj,k 8 remove component Nj

9 foreach h(j,k1), (j,k2)i do qˆ −qˆ 10 j,k2 j,k1 derive slope π 1 2 = h(j,k ),(j,k )i bj,k2 −bj,k1 11 γ ← median of the slopes {γh(j,k1),(j,k2)i} 12 for j = 1 to m do 13 derive Mj(π), φj, and ψj according to:

M (π)= j ∈ N : (ˆq − πb ) = max(p − πb ) j j j,k j,k N j,l j,l ( l∈ j )

φj =arg min bj,k k∈Mj (π)

ψj = arg max bj,k k∈Mj (π)

14 m m if π is optimal, i.e. if B + j=1 bj,φj ≤ P

51 variables, one of the items (substreams) corresponding to these two variables is called the split item, and the class containing the two fractional variables is denoted as the split class.A split solution is obtained by dropping the fractional values and maintaining the LP-optimal choices in each class, i.e., the variables with value equal to 1. If xLP has no fractional variables, then the obtained solution is an optimal solution to the MCKP. By dropping the fractional values from the LP-relaxation solution, we have a split so- ′ lution of value z which we can use to obtain an upper bound. A heuristic solution to the MCKP with a worst case performance equal to 1/2 of the optimal solution value can be ′ obtained by taking the maximum of z and zs, where zs is the sum of the split substream from the split class, i.e., the stream to which the split substream belongs, and the sum of the qualities of the substreams with the smallest number of required multicast blocks in each of the other components’ streams [65]. Since the optimal objective value z∗ is less than ′ or equal to z + zs, thus z∗ ≤ 2zh and we have an upper bound on the optimal solution value. We use the upper bound in calculating a scaling factor K for the quality values of ǫzh the layers. In order to get a performance guarantee of 1 − ǫ, we choose K = 2S . The ′ quality values are scaled down to qs,l = ⌊qˆs,l/K⌋. We then proceed to solve the scaled down instance of the problem using dynamic programming by reaching (also known as dynamic programming by profits) [65, Chapter-2]. Let B(g, q) denote the minimal number of blocks for a solution of an instance of the substream selection problem consisting of stream components 1, . . . , g, where 1 ≤ g ≤ 2S, such that the total quality of selected substreams is q. For all components g ∈ {1,..., 2S} and all quality values q ∈ {0,..., 2zh}, we construct a table where the cell values are B(g, q) for the corresponding g and q. If no solution with total quality q exists, B(g, q) is set to ∞. Initializing B(0, 0) = 0 and B(0, q) = ∞ for q = 1,..., 2zh, the values for classes 1, . . . , g are calculated for g = 1,..., 2S and q = 1,..., 2zh using the recursion in Eq. (3.2).

B(g − 1, q − qg1)+ bg1 if 0 ≤ q − qg1  B(g − 1, q − qg2)+ bg2 if 0 ≤ q − qg2  B(g, q) = min . (3.2) . .

B(g − 1, q − qgng )+ bgng if 0 ≤ q − qgng    The value of the optimal solution is given by Eq. (3.3). To obtain the solution vector for the substreams to be transmitted, we perform backtracking from the cell containing the optimal value in the dynamic programming table.

Q∗ = max{q|B(2S, q) ≤ P }. (3.3)

52 3.5.1 Analysis

Correctness

The core component of our algorithm is solving the dynamic programming formulation based on the recurrence relation in Eq. (3.2). We prove the correctness of the recurrence relation using induction. For the basis step where we only consider a single component of one video stream, only the substream of maximum quality and a number of blocks requirement not exceeding the capacity of the scheduling window is selected. We assume for the induction hypothesis case of g − 1 components that it is also the case that the selected substreams have the maximum possible quality with a total bit rate not exceeding the capacity. For filling the B(g, q) entry in the dynamic programming table, we first retrieve all B(g − 1, q − qgl) entries and add the number of block requirements bs,l of corresponding layers to them. According to Eq. (3.2), only the substream with minimum number of blocks among all entries which result in quality q is chosen. This guarantees that the exactly one substream per component constraint is not violated. Since B(g − 1, q) is already minimum, then B(g, q) is also minimum for all q. Therefore, based on the above and Eq. (3.3), the proposed algorithm generates a valid solution for the substream selection problem.

Approximation Factor

Let the optimal solution set to the problem be X∗ with a corresponding optimal value of z∗. Running dynamic programming by profits on the scaled instance of the problem results in a solution set X˜. Using the original values of the substreams chosen in X˜, we obtain an approximate solution value zA. Because we use the floor operation to round down the quality values during the scaling process, we have

q zA = q ≥ K⌊ j ⌋. (3.4) j K jX∈X˜ jX∈X˜ The optimal solution to a scaled instance will always be at least as large as the sum of the scaled quality values of the substreams in the optimal solution set X∗ of the original problem. Thus, we have the following chain of inequalities

q q q K⌊ j ⌋ ≥ K⌊ j ⌋≥ K j − 1 K ∗ K ∗ K jX∈X˜ jX∈X jX∈X   ∗ = (qj − K)= z − 2SK. (3.5) ∗ jX∈X Replacing the value of K, we get

ǫzh zA ≥ z∗ − 2S · = z∗ − ǫzh. (3.6) 2S

53 Since zh is a lower bound on the optimal solution value (zh ≤ z∗), we finally have

zA ≥ z∗ − ǫz∗ = (1 − ǫ)z∗. (3.7)

This proves that the solution obtained by our algorithm is always within a factor of (1−ǫ) from the optimal solution. Therefore, it is a constant factor approximation algorithm.

Time Complexity

The dynamic programming table for the B(g, q) entries contains 2S×2zh entries. Computing each entry in the table requires O(L) time according to the recurrence relation given in Eq. (3.2). Therefore, table construction requires O(L · 2S · 2zh) time or O(nz∗), where n is the total number of layers for all the streams components. Calculating zh takes O(n) time using the Dyer-Zemel algorithm. This leads to a total time of O(n + nz∗). We are ǫzh using dynamic programming to solve a scaled down instance of the problem where K = 2S . ∗ h ∗ 4S Since z ≤ 2z , we have z /K ≤ ǫ . This means that computing the table entries now takes O(nS/ǫ) time. Therefore, the time complexity of the S3VM algorithm is O(nS/ǫ).

3.6 Energy Efficient Radio Frame Scheduling

In the S3VM algorithm presented in Section 3.5, we have determined which substreams to transmit from the texture and depth components of each 3D video in order to maximize the quality of the synthesized views. We now turn to the problem of allocating the video data of the chosen substreams within the frames of the scheduling window. Minimizing energy consumption is a main concern in battery powered mobile wireless devices. Implementing an energy saving scheme which minimizes the energy consumption over all mobile subscribers is therefore a crucial requirement for multicasting video streams over wireless access networks. Instead of continuously sending the streams at the encoding bit rate, a typical energy saving scheme transmits the video streams in bursts. After receiving a burst of data, mobile subscribers can switch off their RF circuits until the start of the next burst. An optimal allocation scheme should generate a burst schedule that maximizes the average system-wide energy saving over all multicast streams. The problem of finding the optimum schedule is complicated by the requirement that the schedule must ensure that there are no receiver buffer violations for any multicast session. In fact, the problem of burst scheduling for the much simpler case of 2D video streams has been proven to be NP-complete [48].

3.6.1 Proposed Allocation Algorithm

We approach the problem by leveraging a scheme known as double buffering in which a receiver buffer of size B is divided into two buffers, a receiving buffer and a consumption buffer, of size B/2 [49]. Thus, a number of bursts with an aggregate size of B/2 can

54 be received while the video data are being drained from the consumption buffer. This scheme resolves the buffer overflow problem. To avoid underflow, we must make sure that the reception buffer is completely filled by the time the consumption buffer is completely drained, and the buffers are swapped at that point in time. Unlike [49], we consider a burst composed of one or more contiguous radio frames allocated to a certain video stream because we are dealing with complete radio frames of fixed duration.

Let γs be the energy saving for a mobile subscriber receiving stream s. γs is the ratio between the amount of time the RF circuits are put in sleep mode within the scheduling window to the total duration of the window. This metric has been used in previous works in the literature [133], [30] to evaluate the energy saving of a burst schedule. The average system-wide energy saving over all multicast sessions can therefore be defined as

1 S γ = γ . (3.8) S s sX=1 The objective of an energy efficient allocation algorithm is thus a list Γ of the form 1 1 2 2 hns, hfs ,ws i,..., hfs ,ws ii for each 3D video stream. In this list, ns is the number of bursts k k that should be transmitted for stream s within the scheduling window, and fs and ws denote the starting frame and the width of burst k, respectively. Moreover, no two bursts k k k k¯ k¯ k¯ should overlap, i.e., [fs ...fs + ws ] ∩ [fs¯ ...fs¯ + ws¯]. Here the operator [...] denotes an integer interval. Since the substreams have been already chosen by the S3VM algorithm, we omit the sub- t stream subscripts l from corresponding terms in the following for simplicity, e.g., rs instead t of rsl. Let rs be the aggregate bit rate of the texture and depth component substreams t d of video s, i.e., rs = rs + rs . For each 3D video stream, we divide the scheduling window k into a number of intervals ws , where k denotes the interval index, during which we need to fill the receiving buffer with B/2 data before the consumption buffer is completely drained. We note that depending on the video bit rate, the length of the interval may not necessarily be aligned with the radio frames. Therefore, buffer swapping at the receiver, which occurs whenever the consumption buffer is completely drained, may take place at any point during the last radio frame of the interval. The starting point of an interval is always aligned with radio frames. Thus, it is necessary to keep track of the current level of the consumption buffer at the beginning of an interval to determine when the buffer swapping will occur and set the deadline accordingly. k Let Υs denote the consumption buffer level for stream s at the beginning of interval k, k k and xs and zs are the start and end frames for interval k of stream s, respectively. The end frame for an interval represents a deadline by which the receiving buffer should be filled before a buffer swap occurs. Within each interval for stream s, the base station schedules k ys for transmission before the deadline. Except for the last interval, the number of frames B/2 to be transmitted is ⌈ F ⌉. We note that the last of the scheduled frames within an interval

55 Algorithm 3: Energy-efficient Scalable 3D Video Multicast (eS3VM) Input: Scheduling window capacity P Input: TDD frame capacity F Input: Buffer size B Input: Set of scalably simulcast coded MVD2 3D videos S i i Input: Model parameters for each virtual view position of each video αs, βs Input: Approximation factor ǫ Output: Video data burst allocation to radio frames of current scheduling window 1 Run S3VM algorithm to select the substreams for the texture and depth components of the videos 2 for s = 1 to S do 3 k ← 0 4 k Calculate xs using (3.10) 5 k while xs < P do 6 k k Calculate ys and zs using (3.12) and (3.11) 7 k ← k + 1

8 Let Λ = φ 9 foreach decision point do 10 tcurrent ← current time 11 tnext ← next decision point time 12 k k Get interval ws with earliest deadline zs among all outstanding intervals 13 Allocate frames between tcurrent and tnext 14 k es ← actual completion time for bursts in interval k 15 k k if max{es − zs }≤ 0 then 16 return Λ 17 No feasible solution may not be completely filled with video data. For the last interval, the end time is always set to the end of the scheduling window. The amount of data to be transmitted within this interval is calculated based on how much data will be drained from the consumption buffer by the end of the window.

B/2 if k = 0  k−1 Υk = B − 1 − Υs mod rsτ if Υk−1 mod r τ 6= 0 (3.9) s  2 rsτ s s     B/2 otherwise    0 if k = 0  k k−1 k−1 xs = z if Υ mod rsτ = 0 (3.10)  s s  k−1 zs + 1 otherwise   

56 0 1 23 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19

interval decision points

Stream−1

Stream−2

All decision points

Figure 3.2: Transmission intervals and decision points for two streams (r2 > r1) in a scheduling window of 20 TDD frames.

P if k is last interval zk = (3.11) s  k xk + ⌊ Υs ⌋ otherwise  s rsτ  B  k k ⌈( − Υ )+ rsτ(P − x )⌉ if k is last interval k 2 s s ys = (3.12)  B/2 ⌈ F ⌉ otherwise

Based on the above formulation, a complete energy-efficient scalable 3D video multicast (eS3VM) algorithm is given in Algorithm 3. Assuming that the consumption buffer is initially full, the proposed allocation extension proceeds as follows. The start frame number for all streams is initially set to zero. Decision points are set at the start and end frames for each interval of each frame as well as the frame at which all data to be transmitted within the interval has been allocated. At each decision point, the algorithm picks the interval with earliest deadline, i.e., closest end frame, among all outstanding intervals. It then continues allocating frames for the chosen video until the next decision point or the fulfillment of the data transmission requirements for that interval. We demonstrate the concepts of transmission intervals and decision points in Figure 3.2 for a two stream example. Stream-2 in the figure has a higher data rate. Thus, the consumption buffer for the receivers of the second multicast session is drained faster than consumption buffer of the receivers of the first stream. Consequently, the transmission intervals for stream-2 are shorter. The set of decision points within the scheduling window is the union of the decision points of all streams being transmitted. If no feasible allocation satisfying the buffer constraints is returned, the selected sub- streams cannot be allocated within the scheduling window. Thus, the problem size needs to be reduced by discarding one or more layers from the input video streams and a new set of substreams needs to be recomputed. To prevent severe shape deformations and geometry errors, we initially restrict the layer reduction process to the texture components of the 3D videos. This process is repeated until a feasible allocation is obtained or all enhancement layers of texture components have been discarded. If a feasible solution is not obtained after

57 [h]

Table 3.2: 3D video sequences used in 3D distortion model validation experiments.

Sequence Resolution Number of Views Reference Views Synthesized Views Champagne 1280×960 80 37, 41 38, 39, 40 Pantomime 1280×960 80 37, 41 38, 39, 40 Kendo 1024×768 7 1, 5 2, 3, 4 Balloons 1024×768 7 1, 5 2, 3, 4 Lovebird1 1024×768 12 4, 8 5, 6, 7 Newspaper 1024×768 9 2, 6 3, 4, 5

discarding all texture component enhancement layers, we proceed with reducing layers from the depth components. Given only the base layers of all components, if no feasible solution is found, the system should reduce the number of video streams to be transmitted. Deciding on the video stream from which an enhancement layer is discarded is based on the ratio between the average quality synthesized views and size of the video data being transmitted within the window. We calculate the average quality given by the available substreams of each video over all synthesized views. We then divide this value by the amount of data being transmitted within the scheduling window. The video stream with the minimum quality to bits ratio is chosen for enhancement layer reduction.

3.7 Validation of Virtual View Quality Model

To validate that the relation between the quality of the synthesized views and the quality of the texture and depth maps of the reference views can be approximated by a linear plane, we perform the following experiment. For a set of 6 multi-view video sequences and their associated depth maps, we chose two reference views which are 4 baseline distances apart from each video. We encoded 30 frames (about 1 second) of the texture and depth map streams of the chosen views using the Joint Scalable Video Model (JSVM) software [61]. JSVM is the reference software for scalable video coding (SVC). The encoder was configured such that the generated scalable streams contain 4 layers (one base layer and three MGS enhancement layers). A description of the sequences used in our experiment is given in Table 3.2. We use two video quality metrics to calculate the quality for each layer of the left and right reference streams. This is done for both texture and depth streams. The two metrics used are luminance component peak signal-to-noise ratio (Y-PSNR) and structural similarity (SSIM) [128]. We then synthesize three intermediate views between the two reference views using the view synthesis reference software (VSRS) [115]. VSRS uses two reference views, left and right, to synthesize an intermediate virtual view by using the two corresponding

58 reference depth maps. The distortion in views synthesized from reconstructed references is a summation of both the distortion resulting from compression and the distortion resulting from the view synthesis process. We calculate the average quality for each texture and depth map substream combination against views synthesized at the same camera position but from the uncompressed reference streams. This will demonstrate only the effect of reference views compression on the view synthesis process. A sample from our results is shown in Figure 3.3; results for other video sequences are similar. The figure illustrates the relationship between the average quality of the left and right reference textures, the average quality of the left and right depth maps, and average quality of the 3 synthesized views in terms of PSNR and SSIM values. As shown in the figure, all surfaces can indeed be approximated by a linear surface in the form given in Eq. (3.1).

3.8 Performance Evaluation

3.8.1 Setup

We implemented the proposed substream selection algorithm in Java and evaluated its performance using scalable video trace files. To generate the video traffic, we used six 3D video sequences. We divide each sequence into four 60-frame (2 seconds) segments to obtain 24 multi-view-plus-depth video streams. The texture and depth streams were then encoded using the JSVM reference software version 9.19 [61] into one base layer and four medium grain scalability (MGS) layers. The quantization parameter values used in the encoding process are 36, 34, 30, 28, and 26. We then extract and decode each of the substreams from the encoded bitstreams and calculate the average quality and total bit rate for the corresponding layers of the left and right reference views. We summarize this information for the first segment of each of the original six video sequences in Table 3.3.

59 39 40

38.5 39.5 38 39 37.5 38.5 37 38 36.5

36 37.5

35.5 37 46 48 Avg. Y−PSNR of Synthesized Views (dB) Avg. Y−PSNR of Synthesized Views (dB) 43 43 44 42 46 42 41 41 42 40 44 40 39 39 40 38 42 38 Depth Y−PSNR (dB) Texture Y−PSNR (dB) Depth Y−PSNR (dB) Texture Y−PSNR (dB)

(a) Balloons (b) Pantomime

CHAMPAGNE KENDO

36 41

40.5 35.5 40

35 39.5

34.5 39 38.5 34 38

33.5 37.5 50 46 Avg. Y−PSNR of Synthesized Views (dB) Avg. Y−PSNR of Synthesized Views (dB) 43 43 48 42 44 42 41 41 46 40 42 40 39 39 44 38 40 38 Depth Y−PSNR (dB) Texture Y−PSNR (dB) Depth Y−PSNR (dB) Texture Y−PSNR (dB)

(c) Champagne Tower (d) Kendo

NEWSPAPER LOVEBIRD1

35.5 38

35 37.5 37 34.5 36.5 34 36 33.5 35.5

33 35

32.5 34.5 46 49 Avg. Y−PSNR of Synthesized Views (dB) Avg. Y−PSNR of Synthesized Views (dB) 41 48 41 44 40 40 39 47 39 42 38 46 38 37 37 40 36 45 36 Depth Y−PSNR (dB) Texture Y−PSNR (dB) Depth Y−PSNR (dB) Texture Y−PSNR (dB)

(e) Newspaper (f) Lovebird1

Figure 3.3: Average PSNR quality of 3 synthesized views from decoded substreams with respect to views synthesized from uncompressed references.

60 0.98 0.985

0.975

0.97 0.98

0.965

0.96 0.975

0.955 Avg. Synthesized View SSIM Avg. Synthesized View SSIM 0.95 0.97 0.99 0.995 0.98 0.99 0.98 0.985 0.975 0.978 0.97 0.985 0.976 0.98 0.965 0.98 0.974 0.96 0.972 0.975 0.955 0.975 0.97 Depth SSIM Texture SSIM Depth SSIM Texture SSIM

(a) Balloons (b) Pantomime

CHAMPAGNE KENDO

0.98 0.98

0.96 0.96

0.94 0.94 Avg. SSIM Index Avg. SSIM Index

0.92 0.92

50 46 43 43 48 42 44 42 41 41 46 40 42 40 39 39 44 38 40 38 Depth Y−PSNR (dB) Texture Y−PSNR (dB) Depth Y−PSNR (dB) Texture Y−PSNR (dB)

(c) Champagne Tower (d) Kendo

NEWSPAPER LOVEBIRD1

0.98 0.98

0.96 0.96

0.94 0.94

Avg. SSIM Index Avg. SSIM Index 0.92 0.92 0.9 46 49 41 48 41 44 40 40 39 47 39 42 38 46 38 37 37 40 36 45 36 Depth Y−PSNR (dB) Texture Y−PSNR (dB) Depth Y−PSNR (dB) Texture Y−PSNR (dB)

(e) Newspaper (f) Lovebird1

Figure 3.4: Average SSIM quality of 3 synthesized views from decoded substreams with respect to views synthesized from uncompressed references.

61 Table 3.3: Data Rates (Kbps) and Y-PSNR Values (dB) Representing Each Layer of the Scalable Encodings of the Texture and Depth Streams.

1 Layer 2 Layers 3 Layers 4 Layers 5 Layers Sequence r1 q1 r2 q2 r3 q3 r4 q4 r5 q5 t 157 35.3511 284 36.4732 653 39.5048 971 40.6820 1493 42.2360 Champagne d 64 40.2629 104 40.9074 238 43.2668 386 44.3530 650 45.6268 t 517 34.5877 674 35.4403 1183 38.2229 1670 39.5058 2398 41.3435

62 Pantomime d 119 40.6100 180 41.2614 352 42.9276 554 43.8257 896 44.8836 t 295 35.9112 415 36.8813 771 39.4434 1121 40.6169 1697 42.0590 Kendo d 203 38.5026 294 39.3620 546 41.5120 819 42.6939 1264 44.0096 t 217 35.3134 342 36.4006 716 39.1762 1068 40.3533 1665 41.8603 Balloons d 101 38.7728 162 39.6052 348 41.7705 564 42.8321 952 44.0922 t 137 33.8922 282 34.9077 739 37.9702 1108 39.1853 1710 40.7982 Lovebird1 d 30 43.1877 52 43.7545 116 45.3736 207 46.4544 370 47.5168 t 168 34.7678 295 35.7765 694 38.7524 1035 39.9533 1587 41.4886 Newspaper d 79 39.1052 126 39.7935 288 41.8723 466 42.9496 787 44.2220 38 38 OPT OPT S3VM S3VM 36 36

34 34

32 32 Average Quality (dB) Average Quality (dB)

30 30 10 15 20 25 30 35 100 150 200 250 300 350 Number of 3D Video Streams MBS Area Size (kb) (a) Quality vs. Number of Streams (b) Quality vs. MBS Area Size

Figure 3.5: Average quality of solutions obtained using proposal (taken over all video se- quences) for: (a) variable number of video streams; (b) different MBS area sizes.

For each texture-depth quality combination, three intermediate views are synthesized using VSRS 3.5 [115]. We synthesize virtual views by using the general synthesis mode with half-pel precision. The quality of the synthesized views are compared against the quality of views synthesized from the original non-compressed references. These values are then used along with average qualities obtained for the compressed reference texture and depth substreams to obtain the model parameters at each synthesized view position. We consider a 20 MHz Mobile WiMAX channel, which supports data rates up to 60 Mbps depending on the modulation and coding scheme [69]. The typical frame duration in Mobile WiMAX is 5 ms. Thus, for a 1 second scheduling window, there are 200 TDD frames. We assume that the size of the MBS area within each frame is 100 Kb. The initial multicast channel bit rate is therefore 20 Mbps. To assess the performance of our algorithm, we run several experiments, as described in the sequel, and compare our results with the optimal substream selection solution obtained using the CPLEX LP/MIP solver [27]. All experiments were run on a dual 2.66 GHz Intel Xeon processor machine with two cores in each physical processor (for a total of 4 cores) and 8 GB of physical memory. The two performance metrics used in our evaluation are: average video quality (over all synthesized views and all streams), and running time.

3.8.2 Simulation Results

Video Quality

In the first experiment, we study the performance of our algorithm in terms of video quality. We first fix the MBS area size at 100 Kb and vary the number of 3D video streams from 10 to 35 streams. The approximation parameter ǫ is set to 0.1. We calculate the average quality across all video streams for all synthesized intermediate views. We compare the

63 70 70 OPT OPT 60 S3VM 60 S3VM 50 50 40 40 30 30 20 20 Running Time (msec) Running Time10 (msec) 10

0 0 10 15 20 25 30 35 100 150 200 250 300 350 Number of 3D Video Streams MBS Area Size (kb) (a) Number of Streams vs. Running Time (b) MBS Area Size vs. Running Time

Figure 3.6: Average running times for: (a) variable number of video streams; (b) different MBS area sizes. results obtained from our algorithm to those obtained from the absolute optimal substream set returned by the CPLEX optimization software. The results are shown in Figure 3.5a. As expected, the average quality of a feasible solution decreases since more video data need to be allocated within the scheduling window. However, it is clear that our algorithm returns a near optimal solution with a set of substreams that results in an average quality that is less than the optimal solution by at most 0.3 dB. Moreover, as the number of videos increases, the gap between the solution returned by the S3VM algorithm and the optimal solution decreases. This indicates that our algorithm scales well with the number of streams. We then fix the number of video streams at 30 and vary the capacity of the MBS area from 100 Kb to 350 Kb, reflecting data transmission rates ranging from 20 Mbps to 70 Mbps. As can be seen from the results in Figure 3.5b, the quality of the solution obtained by our algorithm again closely follows the optimal solution.

Running Time

In the second set of experiments, we evaluate the running time of our algorithm against that of finding the optimum solution. Fixing the approximation parameter at 0.1 and the MBS area size at 100 Kb, we measure the running time of our algorithm for a variable number of 3D video streams. Figure 3.6a compares our results with those measured for obtaining the optimal solution. As shown in the figure, the running time of the S3VM algorithm is much smaller than the time required to obtain the optimal solution for all samples. In Figure 3.6b we show the results for a second experiment where the number of videos was fixed at 30 streams and the MBS area size was varied from 100 Kb to 350 Kb. From the figure, it is clear that the running time of our algorithm is still significantly less than that of the optimum solution.

64 35 50 OPT OPT S3VM S3VM 40 34

30 33 20 32 10 Average Quality (dB) Running Time (msec)

0 31 0.1 0.2 0.3 0.4 0.5 0.1 0.2 0.3 0.4 0.5 Approximation Parameter (ǫ) Approximation Parameter (ǫ) (a) Running Time (b) Average Quality

Figure 3.7: Average running times for different values of the approximation parameter.

Approximation Parameter

In the last experiment, we study the effect of the approximation parameter value ǫ on the running time of our algorithm. We use 30 video streams with an MBS area size of 100 Kb, and vary ǫ from 0.1 to 0.5. As shown in Figure 3.7a, increasing the value of the approximation parameter results in faster running time. In the description of the S3VM algorithm in Section 3.5, the scaling factor K is proportional to the value of ǫ. Therefore, increasing ǫ results in smaller quality values which reduces the size of the dynamic programming table and consequently the running time of the algorithm at the cost of increasing the gap between the returned solution and optimal solution, as illustrated in Figure 3.7b.

Buffer Level Validation

To study the performance of our allocation algorithm, we generate a 500 second workload from each 3D video. We do this by taking the 8 second video streams, starting from a random initial frame, and then repeating the frame sequences. The resulting sequences are then encoded as discussed in Section 3.8.1. The experiments are performed over a period of 50 consecutive scheduling windows. In this experiment, we validate that the output schedule from the proposed allocation algorithm does not result in buffer violations for receivers. We set the scheduling window duration to 4 seconds and the size of the receivers’ buffers to 500 kb. We then plot the total buffer occupancy for each multicast session at the end of each TDD frame within the scheduling window. The total buffer occupancy is calculated as the sum of the receiving buffer level and the consumption buffer level. Figure 3.8 demonstrates the buffer occupancy for the two buffers as well as the total buffer occupancy for one multicast session. As

65 can be seen from Figure 3.8a, the receiver buffer occupancy never exceeds the buffer size, indicating no buffer overflow instances. For the consumption buffer, we observe that its occupancy jumps directly to the maximum level as soon as the buffer becomes empty due to buffer swapping, as shown in Figure 3.8b. Similar results were obtained for the rest of the multicast sessions. This indicates that no buffer underflow instances occur.

Energy Saving

In the last experiment, we evaluate the energy saving performance of our radio frame alloca- tion algorithm. For this evaluation, we use the power consumption parameters of an actual WiMAX mobile station [103]. The power consumption during the sleep mode and listening mode is 10 mW and 120 mW, respectively. This translates to an energy consumption of 0.05 mJ and 0.6 mJ, respectively, for a 5 ms radio frame. In addition, the transition from the sleep mode to the listening mode consumes 0.002 mJ. We set the TDD frame size to 150 kb and the receiver buffer size to 500 kb. Using a 2 second scheduling window, we vary the number of multicasted videos from 5 to 20 and measure the average power saving over all streams, as shown in Figure 3.9a. Next, keeping all other parameters the same, we set the number of videos to 5 and vary the duration of the scheduling window from 2 to 10 seconds. We plot the average energy savings along with the variance in Figure 3.9b. Finally, in Figure 3.9c, we evaluate the energy saving at different buffer sizes. We set the number of video to 10, the duration of the window to 2 seconds, and vary the receiver buffer size from 500 to 1000 kb. As can be seen from Figure 3.9, our eS3VM algorithm maintains a high average energy saving value, around 86 %, over all transmitted streams. In all cases, the measured variance was very small and hardly noticeable in the figures.

3.9 Summary

In this chapter, we studied the 3D video multicasting problem in wireless environments, where the components of each 3D video are simulcast coded using scalable video coder. We divided the problem into two sub-problems. In the first sub-problem, the base station scheduler selects the reference representation that maximizes the quality of the synthesized views rendered on the receiver’s display given the bandwidth limitations of the channel. We showed that the problem is NP-complete and we presented an approximation algorithm for solving it. Our algorithm leverages scalable coded multi-view-plus-depth 3D videos and per- forms joint texture-depth rate-distortion optimized substream extraction to maximize the average quality of rendered views over all 3D video streams. We proved that our algorithm has an approximation factor of (1 − ǫ) and a running time complexity of O(nS/ǫ), where n is total number of layers, S is the total number of streams, and ǫ is the approximation parameter. In the second sub-problem, the chosen substreams need to be scheduled such

66 that the power consumption of the receiving mobile devices is minimized without introduc- ing any buffer overflow or underflow instances. We proposed an efficient heuristic burst scheduling algorithm based on a double-buffering technique to solve this sub-problem. The performance of our algorithms are evaluated using trace-based simulations. The results showed that our algorithm runs much faster than enumerative algorithms for finding the optimal solution. The selected set of substreams yields an average quality for the synthe- sized views that is within 0.3 dB of the optimal. Moreover, our energy efficient radio frame allocation results in schedules that reduce the average power consumption of the receivers by 86 % on average.

67 300

250

200

150

100

50 Buffer Occupancy (kb) 0 0 200 400 600 800 TDD Frame Number (a)

300

250

200

150

100

50 Buffer Occupancy (kb) 0 0 200 400 600 800 TDD Frame Number (b)

500

400

300

200

100 Buffer Occupancy (kb) 0 0 200 400 600 800 TDD Frame Number (c)

Figure 3.8: Allocation algorithm performance in terms of receiver buffer occupancy lev- els of selected substreams using a 4 second scheduling window: (a) receiving buffer; (b) consumption buffer; (c) overall buffer level.

68 100

80

60

40

20

Average Energy Saving (%) 0 5 10 15 20 Number of 3D Videos (a)

100

90

80

70

60

Average Energy Saving (%) 50 2 4 6 8 10 Scheduling Window Duration (sec) (b)

100

90

80

70

60

Average Energy Saving (%) 50 500 600 700 800 900 1000 Buffer Size (kb) (c)

Figure 3.9: Average energy saving for: (a) variable number of videos streams; (b) variable scheduling window duration; and (c) variable receiver buffer size.

69 Chapter 4

Virtual View Quality-aware Rate Adaptation for HTTP-based Free-viewpoint Video Streaming

4.1 Introduction

There has been a great interest recently in visual experiences beyond what is offered by tra- ditional 2D video systems. 2D video streaming transmits a single view of the 3D world to viewers, delivering a limited experience. In a real world experience, however, one expects to be able to navigate a scene from different viewpoints. This is particularly desirable in scenes where multiple points of interest are present, e.g., concerts and sports events, where users would like to direct their own version of the event. With recent advances in scene capturing technologies and virtual reality hardware, e.g., Oculus Rift and Microsoft HoloLens, interac- tive 360-degree videos and free-viewpoint videos (FVV) are increasingly gaining popularity. Unlike 360-degree videos which only provide rotational movement around the center point of the camera, FVV enables viewers to move to any viewpoint that is located between cam- eras positioned at different locations around the scene. For example, in sports events the cameras can be placed around the field and viewers can experience the game from different perspectives. This provides a richer experience by enabling viewers to watch the scene from their view angle of interest and/or to move around obstructions to get a better view of occluded objects. Supporting interactive free viewpoint video streaming over the best-effort Internet to heterogeneous clients is challenging because of the high system complexity. For example, when viewers navigate across viewpoints in a free viewpoint system, some of the viewpoints may not have been captured by cameras. An interactive free-viewpoint video streaming systems should satisfy the following requirements:

70 • Responsiveness. Users should be able to interact with the system in real-time. The de- lay between a request for viewpoint change and the rendering of the target view should be minimized. This includes network-related delays as well as processing delays.

• Scalability. The system should be able to handle a large number of concurrent clients that are possibly viewing the scene from different angles.

• Adaptability. The system should provide the best possible quality to heterogeneous clients while handling network dynamics, such as bandwidth variation.

• Immersiveness. A user should be able to choose between a large number of view- points and transition smoothly between them in order to provide a truly immersive experience.

In 2D video streaming, a single video stream is transmitted from the server to the client. The quality of the decoded video is directly related to the compression-induced distortion of that single stream, which in turn is inversely proportional to the bit rate of the stream. Unlike 2D video streaming, in a FVV video streaming system, multiple video streams corresponding to the 3D components of different views are sent to the client and the rendered video frames are the result of a view synthesis process from the received components. This makes the problem of rate adaptation in these systems more complex because the quality of the rendered video stream is dependent on the qualities of the component streams used as references in the view synthesis process. Moreover, changes in the bit rates of those components do not equally contribute to the quality of the resulting video. In this chapter, we study the problem of optimizing adaptive streaming of interactive free viewpoint videos to heterogeneous clients. More specifically, we address the problems of selecting the optimal versions for the components of scheduled reference views that maximize the quality of rendered virtual views. To this end, we present a two-step rate adaptation approach for FVV streaming systems. In the first step of the proposed approach, a number of reference views are scheduled for transmission based on the user’s view navigation pattern. In the second stage, the streaming client determines the optimal bit rate allocation for the chosen component streams in order to deliver the best possible quality for rendered virtual views given the available network bandwidth. We implement the proposed approach in a real free-viewpoint video streaming testbed and we develop the first complete DASH-based FVV streaming client using open source libraries. The rest of this chapter is organized as follows. Section 4.2 summarizes the related work in the literature. In Section 4.3, we define the rate adaptation problem for HAS-based FVV streaming and present a virtual view quality-aware rate adaptation approach to solve it based on empirical and analytical quality models. We evaluate the quality of the proposed rate adaptation method in Section 4.7 using both objective quality metrics and subjective quality assessment. Finally, Section 4.8 summarizes the chapter.

71 Transmission Camera Server-side Client-side Array

3D Captureing Depth View Renderer & Coding Decoding Estimation Synthesis Correction Storage

(a)

Camera Server-side Client-side Array

3D Captureing Depth View Renderer & Estimation Synthesis Correction Storage

(b)

Figure 4.1: Free-viewpoint video streaming systems where view synthesis is performed at: (a) server and (b) client.

4.2 Related Work

A number of research works in the literature attempted to address the problem of inter- active multi-view and free-viewpoint video streaming. These works can be classified into two classes: server-based non-adaptive approaches and client-based adaptive streaming ap- proaches.

4.2.1 Server-based Approaches

An interactive multi-view video system based on the idea of transmitting residual informa- tion to aid the client in generating virtual views is presented in [81]. In the proposed system, the encoder-side performs view synthesis and computes the necessary residual information which is then stored and transmitted to the user when needed. This increases the storage and bandwidth requirements of the server and it may affect the scalability of the system be- cause the server needs to prepare and transmit user-specific information for each incoming request. Kurutepe et al. propose a selective streaming system based on multi-view video coding and user head tracking [70]. In this system, the client sends the current viewpoint to the server over a feedback channel. The server then encodes and streams a multi-view video sequence containing a stereo pair corresponding to the user’s viewpoint and two lower resolution side views to reduce the view-switching latency. An enhancement layer is simul- cast coded for the stereo pair to improve the quality of the selected views. Such a system requires that the complex and time consuming multi-view coding process be performed on- the-fly for each client. This increases the server’s processing load and does not scale well with a large number of clients requesting different views. Moreover, because the system

72 does not perform view synthesis, the users are restricted by the viewpoints available at the server and a smooth view-switching experience can only be achieved when a large number of captured views are available at the server, thereby increasing the storage overhead.

4.2.2 Client-based Approaches

Xiao et al. [131] present two approaches to streaming multi-view videos over DASH. The focus of [131] is providing timely view switching without playback interruptions. The main idea of the two proposed approaches is to utilize multi-view encoders performing inter-view prediction to reduce the view switching latency. Unlike our proposed system, their work only considers multi-view videos with no depth information. Different versions of both simulcast-coded and inter-view-coded views are stored on the server. Moreover, in the first approach, all possible version combinations for inter-view coded streams are generated. This imposes a large storage overhead on the content server and significantly increases the cost on the content provider. In [38], Gao et al. present a multi-modal 3D video streaming system based on the DASH standard which allows users to view arbitrary sides of a captured object. Although their work is somewhat similar to ours, the authors mainly focus on supporting multi-modal data and do not provide details on how rate adaptation across the different modalities is performed. The work that closely relates to ours is that presented in [113], where Su et al. present an HEVC multi-view streaming system using DASH. Similar to our proposed system, the streamed video is represented using a number of views and associated depth maps. Unlike our proposed system, however, the client adapts the bit rate of the video by leveraging view-scalability where the number of transmitted reference views (and possibly the distances between them) are varied based on the available network bandwidth. Because the views and their depth streams are jointly coded, the video data for all the components are encoded into a single stream. This dictates a fixed distribution of the total segment bit rate between the different components. In [113] all the segment components for a given representation have an equal bit rate. Therefore the system does not provide much flexibility in terms of rate adaptation and does not consider how each component stream contributes differently to the qualities of the synthesized views. Consequently such a system cannot guarantee a bit rate allocation that results in optimal quality.

4.3 Problem Definition

A content server stores a number of free-viewpoint videos in which scenes are captured using a sparse camera arrangement. Each captured view has a corresponding depth map stream representing the depth value of each pixel in the captured frames. The texture and depth streams for each view are simulcast coded with the same encoding configuration to obtain a multi-view-plus-depth (MVD) representation of the scene. Each component stream

73 is encoded at L different quality levels (representations). The resulting streams are each divided into a set of segments with equal playback duration τ. A single media presentation descriptor (MPD) file describing the component streams as well as metadata information related to the captured reference views is stored at the server. The structure of the MPD file is discussed in more detail in Section 4.6.1. Let V be a set of evenly spaced captured views, where |V| = N. We assume that an equal number of evenly spaced virtual views, say K, are available for view navigation between each two adjacent captured views i and i + 1. In this chapter, we refer to the set of virtual views between two adjacent captured views as a virtual view range. Therefore, the set of views that a user can navigate to is V′ and |V′| = N + K(N − 1), and the views are separated by a distance of d = 1/(K + 1). We can express a viewpoint position as a multiple of the inter-view distance, i.e., a view with index j is at position k = j · d. The problem of virtual view quality-aware optimal segment selection that we attempt to address in this chapter can therefore be stated as follows:

Problem 2. Consider a free-viewpoint video where a number of reference views composed of texture and depth component streams encoded at L different representations are stored on the server. Given the current viewpoint position and the available network bandwidth between the server and client, determine which reference views should be requested and which representations for each texture and depth component should be downloaded such that the quality of the rendered virtual views at the client side is maximized.

To solve this problem, we propose a two-step approach. In the first step, the client determines the set of reference views it needs to request from the server in order to render the current viewpoint as well as any potential viewpoints that the user may navigate to in the future. In the second step, the client’s rate adaptation logic should decide on the representations for each of the segments of the scheduled views’ components. In the following sections, we discuss the two steps of the proposed approach in more detail.

4.4 Reference View Scheduling

In a FVV streaming client, the user expects to be able to navigate freely to any desired viewpoint. The view navigation pattern depends on the nature of the video and the interests of the user. It may happen that the rate at which the user is changing views causes the viewpoint to change to a position that lies outside the virtual view range of the buffered reference views before the current segment duration ends. When the user navigates to a viewpoint outside the virtual view range of the reference views currently in the buffer, the client needs to replace one of the reference views by downloading another segment for a view that bounds the new virtual view range in which the requested virtual view lies. During the download time of the new reference view, the client can either perform view synthesis using

74 only one of the available references, resulting in sudden quality degradation, or wait for a segment from the new reference view to be available, which results in high view switching latency. A FVV streaming client therefore requires a reference view scheduling component that determines which reference views should be downloaded based on the viewer’s view- switching behaviour. To reduce the reference view switching latency, our FVV streaming client conditionally pre-fetches an additional reference view based on the current and previous viewpoint po- sitions of the user. This is achieved by periodically recording the viewpoint position of the user and using a navigation path prediction technique to extrapolate the viewpoint position of the user based on historical measurements. We propose using a simple location estimation technique known as dead reckoning [107, Ch.5]. Dead reckoning calculates an object’s current position by using a previously determined position, or fix, and advancing that position based on known or estimated velocities over a duration of elapsed time and course. By tracking the user’s viewpoint positions, dead reckoning can enable the client to predict the future path by assuming that the user maintains the current view-switching velocity. We divide the time into discrete instants with interval ∆, where τ = ζ∆ and ζ is fixed value. Let x(t) be the view position at time instant t. We can therefore calculate the instantaneous view-switching velocity v(t) using

v(t)=(x(t) − x(t − ∆))/∆. (4.1)

Knowing the view-switching velocity also enables the client to determine whether pre- fetching an additional reference view is necessary for the next segment duration, depending on how fast the view-switching is. Based on the calculated view-switching velocity, the client can predict the view position at the beginning of the next segment as

x(t + τ)= x(t)+ v(t) · τ. (4.2)

If the estimated position is not within the current virtual view range, the view scheduler will schedule an additional reference view that, along with one of the current reference views, bounds the estimated viewpoint position. To obtain a more accurate prediction of the viewpoint position, we apply a smoothing filter, such as the exponentially weighted moving average (EWMA), to either the predicted position or the view-switching velocity. The smoothed out view-switching velocity v′(t) can be calculated as

v′(t)= θ · v(t) + (1 − θ) · v′(t − τ), (4.3) where θ ∈ [0, 1] is the smoothing factor.

75 Views Max view switching velocity 7

6 Estimated 5 view switching velocity 4

3

2

1

0 1 2 3 4 5 6 7 Time

L 0 0 1 1

R 1 1 2 2

P 2 3

Scheduling Window

Figure 4.2: Segment scheduling window. Deciding on left (L) reference view, right (R) reference view, and pre-fetched (P) view.

Figure 4.2 shows an example of a set of viewpoint positions over time, the estimated view navigation path, and the corresponding view scheduling window. The reference view scheduling method can be summarized as follows:

• The streaming client maintains a view scheduling window that stores decisions about which views are requested for each segment index within the window. The length of the window is kept relatively small, e.g., three segments, to reduce the bandwidth overhead in the case of inaccurate predictions in the viewpoint position.

• During one segment duration, the view scheduler records viewpoint position changes and updates the direction and velocity of view-switching based on Eq. (4.1) and Eq. (4.3). The estimated viewpoint position for the next segment index to be down- loaded is then calculated based on Eq. (4.2). We note that the system may enforce a maximum view-switching velocity to ensure that at least one of the buffered views would be an immediate reference at the time of rendering.

• After the download of all component segments of the current segment index has com- pleted, a view scheduling decision is made based on the estimated viewpoint position and the view scheduler slides the window by one segment duration. A scheduling de-

76 cision includes the index of the left (L) and right (R) reference views and, if necessary, an additional view to be pre-fetched (P).

Because the view scheduler schedules views for a number of segment indices in the future, it is possible that scheduling decisions made at a certain decision point will not be valid at the following decision point due to some unpredictable behaviour of the user. In such cases, the client will keep any previous decisions for multi-component segments for which no segment download has been initiated. Instead, the streaming client attempts to utilize the available views to perform view synthesis using one reference view.

4.5 Virtual View Quality-aware Rate Adaptation

4.5.1 Rate Adaptation Based on Empirical Virtual View Quality Mea- surements

To decide on the best combination of representations that maximizes the quality of rendered views given the current network bandwidth observed by the client, the adaptation module needs to have access to a rate-distortion (R-D) model for synthesized views associated with each MVD video. This model describes the relationship between the average bit rates of the texture and depth components and the quality of synthesized views. This need not be a continuous function since, based on DASH, the components are encoded at a discrete set of bit rates. This information can be conveyed within the MPD file downloaded by the client. Assuming each component stream is encoded at L different bit rates, the search space for finding the best bit rate combination is of size L4. We refer to each bitrate combination as an operating point. To reduce this search space, we assume that the streams of the depth components are encoded using the same set L = {1,...,L} of bit rates and the streaming client chooses the same bit rate for the reference views’ depth components. Therefore, for each l ∈ L we generate an R-D surface by rendering the virtual view using different bit rate combinations for the texture components of the reference views. An example R-D surface for the Kendo MVD video sequence is shown in Figure 4.3. This R-D surface is generated for the virtual view at camera 2 using views from cameras 1 and 3 as reference. The rate adaptation logic searches the R-D surfaces corresponding to each of the depth bit rates to find the best operating point within that surface with a total bit rate which does not exceed the estimated throughput. By ordering the set of best operating points obtained from the surfaces, the module can find the optimal operating point. The rate adaptation algorithm presented in this section is able to select the optimal operating point among the available representation bit rates because it relies on empiri- cal models based on actual measured distortion values. Creating such empirical models, however, requires generating a synthesized view for each combination of representations for the pairs of neighbour reference views and measuring the average distortion in each

77 32

30

28

26

24

22 Virtual View Quality (dB) 20 2000 1500 2000 1000 1500 1000 500 500 0 0 Right Reference Bitrate (Kbps) Left Reference Bitrate (Kbps)

Figure 4.3: Kendo sequence R-D surface for virtual view at camera 2 position using reference cameras 1 and 3 and equal depth bit rate of 1 Kbps. iteration. Assuming an MVD video with M captured views, this requires (M − 1)KL4 decode-synthesize iterations, where K is the number of supported virtual views. In addi- tion, these empirical models need to be communicated to the client at the beginning of the streaming session. Adding L4 values for each segment index in the MVD video’s MPD file incurs additional overhead and delays the start-up of playback. This is especially the case for long duration videos with a large number of segments.

4.5.2 Rate Adaptation Based on Analytical Virtual View Quality Models

We now propose a rate adaptation method for DASH-based FVV streaming systems based on analytical virtual view quality (or distortion) models. Such models provide an estimate for the quality of a synthesized view based on the distortions of the reference views’ compo- nents. Using a virtual view distortion model, the adaptation module can quickly calculate the expected distortion of each supported virtual view given a certain operating point for the immediate reference views. An operating point is a combination of representations for the components of the scheduled reference views. The rate adaptation module then chooses the representations corresponding to the operating point which minimizes the expected distortion over a set of virtual views and satisfies the available network bandwidth. In the following, we derive a relation for the virtual view distortion based on the qualities of the reference views’ components. We note that our system can also support other virtual view distortion models provided that the model gets signalled within the MPD file. For example, models such as [22] and [123] can be adapted and used in our system.

78 Let Sv be the virtual image synthesized by original (uncompressed) texture images and original depth maps, S¯v is the virtual image synthesized by the original texture images and compressed depth maps, S˜v is the virtual image synthesized by the compressed texture images and the original depth maps. The distortion of a synthesized virtual view, in terms of mean squared error (MSE), can be expressed as

2 Dv = E[(Sv − Sˆv) ] 2 = E[{(Sv − S¯v)+(S¯v − Sˆv)} ] 2 2 = E[(Sv − S¯v) ]+ E[(S¯v − Sˆv) ] (4.4)

+ 2E[(Sv − S¯v)(S¯v − Sˆv)] 2 2 ≈ E[(Sv − S¯v) ]+ E[(S¯v − Sˆv) ], where E(·) represents the expectation taken over all pixels in one image. 2 In Eq. (4.4), E[(Sv − S¯v) ] represents the view synthesis distortion induced by depth 2 map compression and original texture video, and E[(S¯v − Sˆv) ] represents the view synthesis distortion induced by texture video compression and decoded depth maps. Note that the term 2E[(Sv − S¯v)(S¯v − Sˆv)] can be neglected since the distortions induced by texture video and depth map compression are not correlated [137]. To minimize the quality degradation resulting from unoccluded regions appearing in the virtual view, 3D warping based DIBR algorithms generally use two reference views, the left and right adjacent camera views, to synthesize the virtual image. The virtual view synthesis is expressed as

IV (xV , yV )= ωLIL(xL, yL)+ ωRIR(xR, yR), (4.5) where IV (xV , yV ), IL(xL, yL), and IR(xR, yR) are the pixel values of matching points in the virtual view, left reference view, and right reference view, respectively, and ωL and ωR are distance-dependent blending weights satisfying ωL + ωR = 1. Hence, the virtual view distortion using this double-warping DIBR technique can be expressed as

2 L 2 R Dv = ωLDv + ωRDv , (4.6)

L R where Dv and Dv are the virtual view distortions induced by the left and right reference views [136], respectively, and each can be modeled using Eq. (4.4). Using power spectral density (PSD) and Gaussian modeling of the depth map [94], the 2 term E[(S¯v − Sˆv) ] in Eq. (4.4) can be modeled by

¯ ˆ 2 2 E[(Sv − Sv) ]= ̺ · E[∆PL], (4.7) where ̺ is a linear parameter associated with the texture image contents of the reference view, and ∆P is the warping position error of the reference view. ̺ represents the motion

79 sensitivity factor for the reference view [76]. The warping position error at point (x, y) can be formulated as fδ 1 1 ∆P (x, y)= 255 Znear − Zfar e(x, y), (4.8)   where f is the focal length of the cameras, δ represents the horizontal distance between the virtual viewpoint and the left reference view and right reference view, e(x, y) corresponds to the error between the original and the compressed depth map at point (x, y). Znear and

Zfar are the values for the nearest and farthest depth in the scene, respectively. Hence, from Eq. (4.7) and Eq. (4.8), the depth coding induced distortion can be represented by

2 2 2 E[(S¯v − Sˆv) ] = Φδ E[(e(x, y)) ]̺ (4.9) 2 = Φδ Dd̺, where Dd is the compression distortion for the depth component of the reference view, and Φ is a constant expressed as

f 1 1 2 Φ= − . (4.10) 255 Z Z   near far  2 The term E[(S¯v − Sˆv) ] in Eq. (4.4) can be considered as the compression distortion for the texture component of the reference view. Therefore, Eq. (4.6) can be re-written as follows 2 L 2 L 2 R 2 R Dv = ωL(Dt + ΦδLDd ̺L)+ ωR(Dt + ΦδRDd ̺R) (4.11) L L R R = λDt + µDd + νDt + ξDd + c. It is therefore sufficient to find the values of the coefficients λ, µ, ν, ξ, and the constant c in Eq. (4.11) for each supported virtual viewpoint and communicate them to the client. In order to obtain the values of those coefficients, we use the multiple linear least squares regression function in Matlab [4] and a small set of sample of n R-D points to solve the system of linear equations given in Eq. (4.12). The values of these coefficient are then added to the MPD file in a VVRDModel element, as shown in the example in Listing 4.1 for one segment and one virtual view range with three virtual view positions.

Listing 4.1: Model parameters in the MPD file.

80 Algorithm 4: FindBestOperatingPoint Input: Set P of operating points for left and right reference views qualities Input: Bandwidth constraint Rc Input: Set A of α values corresponding to K virtual view positions Output: Operating point p∗ which minimizes the average distortion over all virtual views ∗ 1 p ← φ, Dmin ← ∞ 2 foreach p ∈ P do 3 if R(p) ≤ Rc then 4 Dsum ← 0 5 for i ← 0 to K do 6 α ← A(i) 7 Dsum ← Dsum + D(p, α)

8 Davg(p) ← Dsum/K 9 if (i == 0) or (Davg(p)

13 if p∗ == φ then 14 p∗ ← lowest quality representations 15 return p∗

L R L R a0 Dt 1 Dt 1 Dd 1 Dd 1 1 Dv1    L R L R  a1   Dt 2 Dt 2 Dd 2 Dd 2 1 Dv2    . . . . . a2 =   (4.12)  . . . . .    .   . . . . .    .    a3    L R L R      Dt n Dt n Dd n Dd n 1   Dvn   a       4     The rate adaptation logic therefore proceeds as given in Algorithm 4, where R(p) is the total bit rate for operating point p, and D(p, α) is the corresponding distortion at virtual view position α. After deciding on a reference view schedule (Section 4.4), the client invokes the rate adaptation logic to determine how the available bandwidth will be distributed amongst the video components of the scheduled views. In the case of stationary viewing where the user does not navigate much around a certain viewpoint, the client will only schedule two reference views and the rate adaptation algorithm evaluates for each operating operating the corresponding average estimated distortion of the virtual views between the two reference views. The representations corresponding to the operating point which results in minimal distortion are then chosen by the algorithm. For segment durations where an additional view will be pre-fetched, a few minor modifi- cations to the algorithm are required. In the case of a scheduled pre-fetch view, an operating

81 Content Server Mobile Network Rep. 1 FVV Rep. 2 DASH Client Texture Rep. M Evolved View-1 Rep. 1 Packet Core Rate Adaptation (EPC) Rep. 2 Rep. M Extended MPD MVD

❱iewpoint for MVD Videos eNodeB ...... Segments (Camera Info. & Switching Quality Models) Rep. 1 FVV Rep. 2 DASH Client Rep. M View-N Rate Adaptation Rep. 1

Rep. 2

Rep. M Content Server

Figure 4.4: Architecture of our free-viewpoint video streaming system. point p will involve six components instead of four: four components for the left and right reference views, and two for the pre-fetch view. The rate adaptation logic will first calculate c the average distortion for the virtual views in the current view range (Davg) and the average e distortion for the expected view range separately (Davg). Hence, lines 4 to 7 in Algorithm 4 will be repeated for the expected view range. Since the viewer will navigate between two virtual view ranges, the value assigned to Davg in line 8 will be replaced with the weighted c e sum (1 − β)Davg + βDavg, where β determines the likelihood of the user navigating to the expected view range. When β equals 0.5, the weighted sum becomes the average distortion over all virtual views of the two ranges.

4.6 System Architecture and Client Implementation

The architecture of our DASH-based free-viewpoint video streaming system is shown in Figure 4.4. In this section, we describe a complete system that implements the proposed virtual view quality-aware rate adaptation algorithm. Our system contains two main enti- ties: the content and server, and the FVV streaming client. In the following we describe the roles of each of these two entities and the implementation of the different components of the FVV streaming client.

4.6.1 Content Server

The free-viewpoint videos are captured using a camera array which captures the scene from multiple viewpoints. To generate the depth information, an additional depth camera may be provided for each viewpoint of the camera array. Alternatively, depth maps can be generated at a later time using one of the known depth estimation methods [98]. The content server stores the videos in the MVD representation format, where each captured view is composed of two separate streams: a texture stream, and an associated depth stream.

82 User Interface

Requested MPD file View

View MVD Bandwidth Requested Frames Scheduler Rate Adaptor Estimator View

Segment Segment Renderer Downloader Decoder

Downloading Actors Decoding Actors Frame Buffers

FVV Segment Unit Buffer

Frame Buffers FVV Streaming Client

Figure 4.5: The components of our DASH client prototype.

We refer to these streams as the components of the view. The server also runs a standard HTTP Web server process that handles requests from DASH clients. Similar to 2D-based DASH streaming systems, each component is encoded at different bit rates (quality levels) using standard 2D video codecs, such as H.264/AVC and HEVC [57]. To synchronize the different component videos, the same frame rate and GOP size is used for all components. The resulting streams constitute different representations of the component at different qualities. The representations are then segmented according to the DASH standard [56] to generate segments of equal duration. A media presentation descriptor (MPD) file describing the MVD content and all its components as well as other information needed to support the view synthesis process and rate adaptation logic on the client-side. We now describe the proposed MPD structure utilized by our FVV streaming system. MPD Structure. As mentioned in Section 2.6, the MPD file in DASH provides the client with a description of all available components of the media content as well as per- segment and per-representation information which enable the client to make adaptation decisions at each segment download time. In our FVV streaming system, each Period element in the MPD file is divided into three sections: component adaptation sets, cam- era parameters, and virtual view quality models. We note here that each Period element represents a scene within the MVD content and that each scene may be captured by a differ- ent camera arrangement and, therefore, may contain a different number of captured views. Without loss of generality, we assume in the following an MVD video with a single Period element for simplicity. Unlike traditional single view 2D content which contains only a single video stream, MVD content contains multiple video streams (one for each component of each view). Therefore, in our proposed system, each texture or depth component of a

83 captured view will have its own AdaptationSet element within the Period elements of the MPD file. To uniquely identify each view, we use a Viewpoint descriptor element within the AdaptationSet elements of each of the view’s components. To identify the type of the component, we use the Role descriptor element with @value="t" for texture streams and @value="d" for depth streams. The MPD file needs additional information about the MVD video sequence to support client-side view synthesis. This information includes the intrinsic and extrinsic parameters of the cameras capturing the scene, as well as the values of the closest and furthest depth values. We use a CameraParameters element, as shown in Listing 4.2, to signal the camera parameters for each captured view. The views are given sequential identifiers based on a left-to-right order and are represented using a View child element.

Listing 4.2: Camera parameters for a multi-view video with two views.

Since our rate adaptation module utilizes a distortion model to estimate the quality of virtual views, it is necessary that the streaming client gains access to quality information for the component streams. A number of MPEG proposals, e.g., [140], were recently presented for signalling quality information in DASH. Quality information can be signalled either at the MPD level or the media container level. In the former case, the MPD would contain ad- ditional metadata sets with metadata representations having associations to corresponding media representations in the adaptation sets. Therefore each metadata segment is always associated with a media segment and both are time aligned. The association between the different metadata elements and their corresponding media elements within the MPD can be achieved by sharing the same id values, for example. Alternatively, communicating quality information to the client can be achieved using metadata tracks within the media container file format itself. This, however, requires modifications to the demuxers used by the decoder to support the syntax of the additional quality metadata tracks. We use a simpler approach in which the average quality of a component stream is provided using additional attributes in the corresponding Representation element. For example, each video Representation element in the MPD file may have avgPSNR and avgSSIM attributes holding the average values of the PSNR and SSIM quality metrics, respectively. Listing 4.3 provides an example

84 Figure 4.6: The user interface of our DASH client prototype. of an adaptation set for the texture component of one of the captured views with one sample representation.

Listing 4.3: AdaptationSet for texture component with two representations.

85 4.6.2 FVV DASH Client

The FVV DASH streaming client is composed of a number of modules: MVD-DASH Man- ager, Segment Downloader, Segment Decoder, View Scheduler, MVD Rate Adaptor and Renderer, as illustrated in Figure 4.5. Our streaming client maintains a buffer of multi- component segments.A multi-component segment is a logical container for the segments of the components of the reference views, possibly including a pre-fetched view. All segments within a multi-component segment have the same duration and correspond to the same seg- ment index and time duration in the presentation timeline. The client is implemented using the actor-based concurrent programming model which relies on message passing between the different actors. Because actors do not share state and messages are sent asynchronously, the different modules will not compete for locks which significantly increases the performance of the streaming client. Figure 4.6 shows the user interface of a prototype implementation of our client. In the following, we discuss the roles of the various modules in more detail.

Rendering Module

The renderer generates the final frames that will be displayed to the user. Based on the user’s current viewpoint, this may require performing view synthesis to generate a virtual view if the requested viewpoint is not at a captured view position. The renderer contains six frame buffers: two for the components of the left reference view, two for the components of the right reference view, and two for the components of the pre-fetched view. To synchronize the frame buffers when segments for one or more components are missing, the frame buffers are filled with dummy frames. In addition to the content of the decoded frame, each buffered frame has associated metadata indicating the index of the view to which the frame belongs and a flag indicating whether the frame is a dummy frame or not. Figure 4.7 illustrates the frame buffers for the texture components and how synchronization is achieved using dummy frames. The renderer picks one frame from each of the reference frame buffers and generates the final frame to be displayed based on the user’s current viewpoint position as follows. Given the current viewpoint position, the renderer finds the indices of the immediate left and right captured views and compares them against the indices of the views associated with the frames from the left and right reference buffers, respectively. If the indices match, the frames in the left and right reference buffers are used to synthesize the virtual view. Otherwise, the renderer checks whether a non-dummy pre-fetch frame is available. If that is the case, the view index of the pre-fetch frame is used to determine the relative position of the pre-fetch frame to the left and right reference frames. If only a dummy pre-fetch frame is available, then view synthesis is performed using only one reference frame, the one closest to the virtual view position.

86 Current Segment

t7 t6 t5 t4 t3 t2 t1

Decoded Frame

t9 t8 t7 t6 t5 t4 t3 t2 t1

❉ ✞✟✟ ② ✠✡✟ ❡

t8 t7 t6 t5 t4 t3 t2 t1

Figure 4.7: Frame buffers for the texture components of the reference streams. Dummy frames are inserted into the pre-fetch stream’s buffer to synchronize the buffers when no pre-fetch segments are needed.

For uninterrupted playback, the levels of the frame buffers are maintained above a certain threshold. Whenever the level of one of the frame buffers drops below the threshold, the renderer requests a batch of frames from the controller to avoid having buffer underflows. If the user’s viewpoint falls at a virtual view position, the renderer performs depth-based view synthesis to generate the target view using available reference views frames. It is important that the chosen view synthesis algorithm can run in real-time in order to keep up with the frame rate of the 3D video. We developed a DIBR implementation which exploits graphics processing units (GPUs) to speed up the view synthesis process for 1D-parallel camera arrangements, where cameras are aligned in a straight line perpendicularly to their optical axes. The rendering module uses the OpenGL graphics API [6] to perform the different stages of the view synthesis process. The frames are uploaded to the GPU’s memory and shader programs perform the warping, blending, and hole-filling steps. Our implementation achieved a 30 fps frame rate for full high definition (HD) resolution on an NVIDIA GeForce 560 Ti GPU. DIBR based on 3D warping finds correspondences between pixels in the reference view and a target virtual view by first mapping each reference view pixel to a point in the 3D coordinate space then projecting these point back to the image plane of the virtual view. A 1D-parallel camera arrangement simplifies finding the correspondence between the pixels of any two views. A pixel in the reference view is mapped to a new position in a virtual view, corresponding to a camera with horizontal shift b, by applying only a horizontal disparity δ which depends on the pixel depth z. The depth z is usually stored in the depth map of

87 the reference view as an 8-bit unsigned integer d using the following non-linear mapping:

1 d 1 1 1 = − + (4.13) z 255 z z z  near far  far where znear and zfar are the minimal and maximal depth values in the reference picture. The horizontal disparity δ is calculated as

f · b δ = (4.14) z where f is the focal length of the camera. Any pixel with coordinates (ur,vr) in the reference picture only needs to be shifted to the coordinate (u, v) in the virtual view with u = ur ± δ and v = vr. A shift is positive when warping an image from right to left and negative when warping from left to right. Since δ is not constrained to take on integer values proper re- sampling and interpolation are required to fill the integer grid corresponding to the pixels of the virtual view. Moreover, along the edges disocclusions may show up uncovering portion of an object (typically the background) that is occluded in the reference view. Our OpenGL-based implementation of the view synthesis pipeline includes three shader programs corresponding to each stage in the pipeline, as shown in Figure 4.8. A shader is a user-defined program designed to run on some stage of a GPU. It is executed independently and in parallel over all vertices or pixels, and performs various transformations on them in order to implement custom rendering algorithms. Our shader programs are implemented using the OpenGL Shading Language (GLSL). Each stage in our view synthesis pipeline is implemented using two shader programs: a vertex shader (.vert) and a fragment shader (.frag). The vertex shader is executed once per vertex and can transform its position while the fragment shader is executed once per pixel and can transform its final color. Except for the final stage, the output of each stage is rendered offscreen using OpenGL constructs known as framebuffer objects. OpenGL texture objects (not to be confused with the texture component in the context of video-plus-depth representations) are attached to a bound framebuffer object and when an OpenGL shader program executes the result is rendered to the attached texture objects. This process is known as render-to-texture in OpenGL. The resulting texture object therefore contains the partially processed frame and can then be used as input to the next stage in the view synthesis pipeline. In our implementation, the synthesized views generated from the left and right references are rendered to two separate OpenGL texture units. There texture units were configured to use a four channel RGBA format: three channels for the color and one alpha channel. We use the alpha channel as a binary map to indicate which pixels in each target image have been assigned a color based on the view synthesis method. We refer to this map as the hole map. The blending shader program then takes the synthesized frames of the two OpenGL texture units and generates a merged frame based on their respective hole maps.

88 Texture0 Left From Reference Left

Pixel Shift Blend Inpaint synth1D.vert blend.vert Texture2 inpaint.vert synth1D.frag blend.frag inpaint.frag

Right From Reference Right Texture1

Figure 4.8: OpenGL-based view synthesis pipeline.

The output of the blending shader program still contains a small number of holes. In the following stage, the inpainting shader program searches the region surrounding these holes and tries to deduce suitable color values for them.

Segment Downloader

The segment downloader holds references to a number of download actors. When a segment download request is received, the request is forwarded to one of the available actors and that actor is responsible for fetching the contents of the segment from the content server. Each download actor maintains a persistent HTTP connection with the server to reduce delays and overhead associated with establishing a new TCP connection for each segment request. The segment downloader is also responsible for keeping track of the download start and finish times for each component segment and reports this information back to the MVD-DASH manager to estimate the available channel bandwidth. The size of the downloaded data for all requested component segments is accumulated and the time interval between the earliest download start time and last download finish time is calculated. The download throughput for a multi-component segment can therefore be measured by dividing the amount of data downloaded by the download interval.

Segment Decoder

Similar to the segment downloader, this module maintains a number of actors which are responsible for decoding individual component segments. It receives decode messages con- taining a downloaded DASH segment for a texture or depth component of one of the views and forwards it to one of the decoding actors. Decoding actors launch decoding threads for each of the segments and decoded frames are returned to the controller for buffering.

View Scheduler

The view scheduler maintains a scheduling window which determines the reference views to be downloaded within a time window. View scheduling decisions are based on predictions of future viewpoints, taking into consideration the current viewpoint of the user as well

89 as historical view navigation positions. The module implements the dead reckoning-based view scheduling method described in Section 4.4.

MVD Rate Adaptor

This module is responsible for responding to variations in network conditions by allocating the available bandwidth between the component segments of the views scheduled for down- load. The rate adaptor takes advantage of virtual view distortion models and implements the proposed rate adaptation algorithm described earlier in Section 4.5.

MVD-DASH Manager

This module is the controller which orchestrates the interaction between the different com- ponents of the player. When a multi-component segment is to be downloaded, the manager invokes the MVD rate adaptor to decide on the set of component representations that will be requested from the server. Based on the decision returned from the adaptation module, the MVD-DASH manager sends a segment download message with the chosen representa- tions to the segment downloader. The downloaded DASH segments for scheduled reference views belonging to the same segment index are aggregated into multi-component segments which are then placed into the segment buffer. Each multi-component segment fetched from the segment buffer is sent to the segment decoder module for decoding. Similar to the rendering module, the controller also maintains six frame buffers to hold decoded segment frames. In the case where only two reference views are being requested, i.e., without a pre-fetch view, the decoders for the pre-fetch view components generate dummy frames for those components to synchronize the number of frames across the buffers.

4.7 Evaluation

To evaluate the proposed FVV DASH rate adaptation algorithms, we implemented a com- plete DASH-based FVV streaming client using C++ and libdash [3], an open-source library which implements the MPEG-DASH standard as defined by ISO/IEC 23009-1 [56]. The implementation includes the proposed virtual view quality-aware R-D-based rate adapta- tion algorithm and reference view scheduling method. The architecture and details of the implementation were presented in Section 4.6. In our evaluation experiments, we used three MVD sequences from the MPEG 3DV ad-hoc group data sets [55] [5] which have different characteristics: Kendo, Balloons, and Café. The resolution for the Kendo and Balloons sequences is 1024 × 768 and the resolution of the Café sequence is 1920 × 1080. The Kendo and Balloons sequences have moving cameras while the cameras in the Café sequence are fixed.

90 4.7.1 Content Preparation

We extended the length of the video sequences from 10 to 30 seconds by repeating the frame sequence. For each MVD video, we chose three cameras from the set of captured views and we allow three virtual views within each virtual view range, for a total of 6 supported virtual view positions. The system therefore supports 9 view positions (including captured views). The video streams for the texture and depth components of each camera were then encoded using the H.264/AVC video coder as follows. For the evaluation of rate adaptation based on empirical models, we used a constant bit rate (CBR) encoder configuration at ten different bit rate levels ranging from 200 Kbps to 2 Mbps with a step of 200 Kbps. For the case of rate adaptation based on analytical models, the video streams for the texture and depth components of each camera were encoded using two configurations. In the first configuration, we use the variable bit rate (VBR) setting of the H.264/AVC encoder with quantization parameter values ranging from 24 to 44 with a step of 4. The average bit rates of each of the components’ representations for the Kendo and Café sequences are given in Table 4.2 and Table 4.3, respectively. In the second configuration, CBR encoding was used with bit rate values ranging from 250 Kbps to 1.5 Mbps with a step of 250 Kbps. We used the GPAC framework [71] to generate one second segments for the different representations of each component. For each segment index, we generate virtual view quality models for all supported virtual view positions as described in Section 4.5. We generate two quality models for each virtual view position: one based on 100 operating point samples, and one based on 40 samples. To validate the generated quality models, we calculate the coefficient of determination (R2) and average absolute fitting error. Table 4.1 shows the results for the virtual view at position 2 for 10 segments of the Kendo and Balloons video sequences. The results indicate that the derived virtual view quality models are good fits for the empirical quality values obtained for all operating points of the encoded video sequences. Finally, we developed Python scripts to parse the MPDs of the component streams and generate a single MPD file for the MVD video based on the structure described in Section 4.6.1. In the evaluation experiments, the client’s multi-component segment buffer capacity was set to 3 segments and the rendering module’s frame buffer capacity to 300 frames.

4.7.2 Experimental Setup

Our test environment setup includes one physical machine and two virtual machines (VMs), as shown in Figure 4.9. The server storing and streaming the content is a VM running the Apache2 Web server [1]. The second VM running FreeBSD 7.3 and KauNet network emulator [39] is placed between the server and the client and is configured with two network interfaces. Two virtual networks are set up to connect the two interfaces to the server and the

91 Table 4.1: Coefficient of determination and average absolute fitting error for virtual view quality models generated from 100 operating points at view position 2 of the Kendo and Balloons sequences (encoded using VBR).

Kendo Balloons Seg. Index R2 Avg. Error R2 Avg. Error 11 0.9780 0.1787 0.9721 0.1643 12 0.9765 0.1863 0.9770 0.1474 13 0.9796 0.1707 0.9736 0.1598 14 0.9756 0.1738 0.9789 0.1537 15 0.9722 0.1722 0.9798 0.1530 16 0.9715 0.1542 0.9812 0.1607 17 0.9723 0.1454 0.9821 0.1528 18 0.9718 0.1725 0.9829 0.1549 19 0.9763 0.1597 0.9836 0.1589 20 0.9767 0.1557 0.9845 0.1544

Table 4.2: Kendo sequence representation bitrates (bps).

Camera-1 Camera-3 Camera-5 Rep. Id QP t d t d t d 1 44 298660 201855 294604 163524 295086 234109 2 40 422265 304327 417409 245342 417551 346308 3 36 612393 451252 607739 358878 607991 509153 4 32 913832 677638 901649 533684 903657 766869 5 28 1404159 1017229 1387540 788757 1390379 1153091 6 24 2304750 1524505 2300003 1165825 2305187 1724321

Table 4.3: Café sequence representation bitrates (bps).

Camera-2 Camera-3 Camera-4 Rep. Id QP t d t d t d 1 44 342501 169776 344584 168252 337362 169260 2 40 493704 262633 497246 258166 485853 261458 3 36 734593 419286 740923 411769 718603 416571 4 32 1124944 661603 1145868 654460 1098888 656130 5 28 1862387 1006185 1925930 1005778 1810588 995894 6 24 3582198 1478472 3902100 1496876 3525769 1471927

92 KauNet VM Content Server VM eth0 eth1 eth0 Bandwidth

FVV Streaming Client

Figure 4.9: Evaluation testbed. client, respectively. The KauNet VM is responsible for controlling the available bandwidth between the client and the server based on input bandwidth change patterns. A trigger pattern is configured to send a periodic signal to the client to synchronize the start of the streaming session with beginning of the bandwidth change pattern. Each time the trigger pattern is invoked, KauNet sends an event to the client indicating the beginning of the bandwidth change pattern. We perform two sets of experiments to evaluate the performance of our proposed rate adaptation method. In the first set of experiments, using objective video quality metrics we compare the quality of our rate adaptation method based on analytical virtual view quality models against the rate adaptation strategy used in [113] and the optimal rate allocation obtained by using empirical virtual view quality models, as presented in Section 4.5. To obtain a fair comparison, the bandwidth estimates resulting from a playback run using the equal allocation strategy presented in [113] are recorded and used as input to the rate adaptation logic in subsequent runs. We first evaluate the client’s behaviour at a fixed network bandwidth and fixed viewing point (the center virtual view position of the first virtual view range). We then assess the behaviour of our streaming client when the viewpoint is fixed while the bandwidth is varied. This enables us to assess the response of the virtual view rate-distortion allocation algorithm in isolation from the view switching prediction logic. In the second set of experiments, we conduct a subjective quality assessment study of the results. In these experiments, subjects were asked to compare the qualities of virtual view videos generated using our proposed approach and the approach presented in [113].

4.7.3 Empirical Quality Models Results

Figure 4.10 demonstrates the behavior of the client using different bit rate allocation strate- gies while streaming the Kendo MVD video sequence. In Figure 4.10a, the player was using a naive equal bit rate allocation in which the available bandwidth is divided equally be- tween the four components being streamed. In Figure 4.10b, we used the R-D based rate allocation strategy in which the R-D surfaces for the virtual view are searched to find the

93 6 6 x 10 x 10 20 10 20 10 Buffer Level Buffer Level 18 Bandwidth 9 18 Bandwidth 9 Est. Throughput Est. Throughput 16 Total Bitrate 8 16 Total Bitrate 8

14 7 14 7

12 6 12 6

10 5 10 5

8 4 8 4 Bitrate (bits/s) Bitrate (bits/s)

6 3 6 3 Buffer Level (segments) Buffer Level (segments)

4 2 4 2

2 1 2 1 0 0 0 0.5 1 1.5 2 2.5 3 3.5 4 0 0.5 1 1.5 2 2.5 3 3.5 4 Time (ms) 4 Time (ms) 4 x 10 x 10 (a) Fair Allocation (b) R-D-based Allocation

Figure 4.10: Client response using different rate allocation strategies. optimal operating point. As can bee seen from the figures, the player using the unequal bit rate allocation strategy is able to achieve smoother total bitrate changes during playback. To evaluate the effect of different bit rate allocation strategies on the quality of the rendered video, we compare the quality of the rendered virtual view to that of the original video sequence captured at that viewpoint. The average peak signal-to-noise ratio (PSNR) obtained for the Kendo MVD sequence using fair bit rate allocation was 28.41 dB compared to 29.95 dB using R-D based rate allocation. Moreover, from Figure 4.10, we can see that this is achieved with less total bit rate for the chosen representations. This indicates the importance of intelligent bit rate allocation between the different components used in generating virtual views in the context of FVV streaming.

4.7.4 Analytical Quality Models Results

Fixed Bandwidth

For the fixed bandwidth experiments, we set the value of the available network bandwidth to a certain value and repeat the streaming session using one of the rate adaptation methods being evaluated in each run. The network bandwidth values used in the evaluation are 1, 2, 4, 5, and 6 Mbps. We fix the view angle at the middle viewpoint between the first two captured camera views (view 2 for Balloons and Kendo, and 2.5 for Café). Figures 4.11 and 4.12 demonstrate the the resulting average virtual view quality, in terms of peak signal- to-noise ratio (PSNR), for segments 11 to 20 of the Balloons and Café videos, respectively, using CBR-encoding. The quality of the rendered virtual view is measured against the virtual view synthesized at the same position using the original uncompressed reference streams. In the figures, Opt. refers to the optimal quality using an empirical quality measurements (Section 4.5.1), Equal refers to the equal rate allocation strategy used in

94 Table 4.4: Bandwidth change patterns.

Time (sec.) 0 3 10 20 25 30 Kendo Bandwidth (Mbps) 1.5 2.5 3 2 1 2.5 Time (sec.) 0 5 10 15 25 30 Café Bandwidth (Mbps) 1.5 2.5 4.5 3 2.5 3.5

[113], and VVRD 40 and VVRD 100 refer to our proposed approach where the numbers indicate the number of samples used to generate the virtual view quality model. It can be seen that even at low bandwidth conditions, our rate adaptation approach is able to achieve significant quality gains (up to 4 dB for Balloons and 2.8 dB for Café). We notice that the improvement achieved by our algorithm is more significant when the bandwidth is not too large (around 2 Mbps), which is the common case. For larger bandwidth values (e.g., more than 6 Mbps), there is little room for optimization and most adaptation algorithms would yield similar or very close qualities. By comparing the results for VVRD 100 and VVRD 40, we can see that the quality improvements can be achieved even when a small number of operating points are used to obtain the model coefficients, which means that our algorithm does not impose significant computational overheads on the servers. The model coefficients are computed only once for each video (not with every streaming session of the same video) and stored in metafiles. Our proposed virtual view quality-aware rate adaptation method also results in quality improvements when the reference views are encoded using VBR. Firgure 4.13 shows the average virtual view quality for the Balloons sequence when the reference views are encoded using VBR. Our method improves the quality of the rendered stream with gains of up to 2.23 dB for some segments (with an average gain of 2.03 dB) in the case of 2 Mbps network bandwidth, and up to 2.26 dB (with an average gain of 1.76 dB) in the case of 4 Mbps network bandwidth. We note that even though the reference views representations were encoded for constant quality using VBR, the combination of representations that results in the optimal virtual view quality differs from one segment to another, as can be seen in the figures. In addition to calculating the quality gains in terms of PSNR, we also evaluated the quality gains using the structural similarity (SSIM) index quality metric for CBR and VBR encoded videos. We note that our proposed method also outperformed [113] in terms of SSIM but omit the results due to space limitations.

Variable Bandwidth

We now evaluate the streaming client’s behaviour when the viewpoint is fixed while the bandwidth is varied. Because the video sequences used in the evaluation have different res- olutions, and therefore different bandwidth requirements, we generate two different band-

95 42

41

40

39

38

37

PSNR (dB) 36

35

34 Opt. VVRD 40 33 VVRD 100 Equal 32 11 12 13 14 15 16 17 18 19 20 Segment Index (a) 2 Mbps

44

43

42

41

40

39

PSNR (dB) 38

37

36 Opt. VVRD 40 35 VVRD 100 Equal 34 11 12 13 14 15 16 17 18 19 20 Segment Index (b) 3 Mbps

46

45

44

43

42

41

PSNR (dB) 40

39

38 Opt. VVRD 40 37 VVRD 100 Equal 36 11 12 13 14 15 16 17 18 19 20 Segment Index (c) 4 Mbps

Figure 4.11: Average quality for the Balloons video sequence with CBR encoding and fixed network bandwidth.

width change patterns, as shown in Table 4.4. These patterns are used by the KauNet VM

96 to change the channel bandwidth between the client and server during the streaming session of the corresponding video sequence. We note that the results presented in this section are for VBR encoded videos. Similar results were obtained for the CBR encoding configuration. Figure 4.14 shows the results for segments 11 to 30 for the Kendo sequence. In Figure 4.14a, the client’s estimated channel bandwidth before downloading each segment is compared to the total bit rate for the operating point chosen by our rate adaptation method and that chosen based on [113]. As shown in the figure, because the approach presented in [113] is not flexible enough in terms of distributing the bit rates of the component segments due to the tight coupling between them at content generation time, it is unable to efficiently utilize all of the available bandwidth. Our virtual view quality-aware approach on the hand is able to take full potential of all the available bandwidth and improve the quality of the rendered virtual view. Using our rate adaptation algorithm, the client is able to achieve PSNR gains up to 2.13dB, Figure 4.14b, and SSIM gains up to 0.014, Figure 4.14c, for view 2. Similar results were obtained for the Café sequence using the bandwidth change pattern in Table 4.4, where our proposed rate adaptation approach resulted in PSNR gains up to 1.09dB and SSIM gains up to 0.022, as shown in Figure 4.15b and Figure 4.15c, respectively, for view 2.5.

4.7.5 Subjective Evaluation

We have followed the recommendations given by ITU-R BT.500-13 [58] to perform subjective quality assessment experiments using the double-stimulus continuous quality-scale (DSCQS) method. In our subjective tests the quality of two impaired video sequences are considered in relation to each other. We evaluated a total of 12 test conditions (3 video content × 2 encoding configurations × 2 bandwidth capacities). Similar to the objective quality evaluation presented in Section 4.7.4, the virtual view at camera position 2 was used in the case of the Kendo and Balloons sequences, and the virtual view at camera position 2.5 was used for the Café sequence. For each test condition, the subjects where presented with two stimuli corresponding to two versions of the synthesized virtual view: one based on our proposed virtual view quality-aware rate adaptation algorithm, and one using the algorithm presented in [113]. The virtual views were generated from the reference segments chosen at 10 segment indices to obtain test stimuli with a duration of 10 seconds, following the BT.500-13 recommendations. A total of 17 subjects participated in our experiments. The subjects were graduate computer science students (12 males and 5 females) at the university whose age ranged from 23 to 33 years old. All subjects were screened and given written instructions before the test session, and these instructions were also explained verbally to make sure they fully understood the experimental procedure. A comfortable seating arrangement was made for the subjects at a distance of three to four times the height of the display size. The test video sequences were shown on a 60” LG 4K Ultra HD 240 Hz display (model 60UF8500). The 12

97 test conditions were shown to the subjects in random order. The order of the two stimuli was also randomized in each test session. For each test condition, one of the stimuli was shown for 10 seconds preceded by 3 seconds of mid-grey field indicating the coded name of the stimulus. Another mid-grey field with the coded name of the other stimuli is then shown for 3 seconds followed by the other 10 second stimulus. This presentation sequence is repeated a second time and followed by 10 seconds of mid-grey field in between different test conditions. The subjects were asked to rate the overall quality of both stimuli and mark their scores on a continuous grading scale. The marks were then mapped to integer values in the range 0−100 and the difference opinion score (= score for stimulus based on our proposed approach − score for stimulus based on [113]) was calculated. Finally, we calculate the mean of the difference opinion score (DMOS) for each test condition. The results were then screened for outliers, where the voting scores of a test subject deviate above twice the standard deviation in more than half of the test videos. Only one test subject was determined to be an outlier and their voting scores were removed from the final results. Figure 4.16 shows the DMOS values for the Kendo, Balloons, and Café video sequences at different available network bandwidth conditions. We evaluate two encoding configuration where the captured reference views are encoded using VBR and CBR. The results indicate that in almost all test conditions, the subjects rated the virtual views synthesized from reference views representations chosen by our proposed rate allocation approach higher than those generated from reference views representations based on an equal bitrate allocation. This is indicated by positive DMOS values in the figure. Moreover, the quality improvement due to the proposed approach was higher in the case of CBR encoded reference views, which is the widely used encoding configuration for DASH content by most content providers. For example, for the Balloons video sequence, the subjects have rated the results of our rate adaptation method higher than those based on the method presented in [113] by 25 points on average and reported seeing a much clearer image.

4.8 Summary

In this chapter, we presented a complete architecture for HTTP adaptive streaming of free- viewpoint videos. We proposed a novel two-step rate adaptation method that takes into consideration the user’s interaction with the scene as well as the special characteristics of multi-view-plus-depth videos and the quality of rendered virtual views. Our rate adaptation method schedules the reference views that will be requested from the server based on the estimated viewpoint position of the user. The representations of the scheduled views are then determined based on virtual view quality models which are generated offline by the content provider from a small number of operating points. Experimental results indicate that our proposed virtual view quality-aware rate adaptation method results in significant

98 quality gains over other rate adaptation approaches (up to 4 dB for CBR streams and up to 2.26 dB for VBR streams), especially at low bandwidth conditions.

99 42

41

40

39

38

37

PSNR (dB) 36

35

34 Opt. VVRD 40 33 VVRD 100 Equal 32 11 12 13 14 15 16 17 18 19 20 Segment Index (a) 2 Mbps

44

43

42

41

40

39

PSNR (dB) 38

37

36 Opt. VVRD 40 35 VVRD 100 Equal 34 11 12 13 14 15 16 17 18 19 20 Segment Index (b) 3 Mbps

46

45

44

43

42

41

PSNR (dB) 40

39

38 Opt. VVRD 40 37 VVRD 100 Equal 36 11 12 13 14 15 16 17 18 19 20 Segment Index (c) 4 Mbps

Figure 4.12: Average quality for the Café video sequence with CBR encoding and fixed network bandwidth.

100 42

41

40

39

38

37

PSNR (dB) 36

35

34 Opt. VVRD 40 33 VVRD 100 Equal 32 11 12 13 14 15 16 17 18 19 20 Segment Index (a) 2 Mbps

44

43

42

41

40

39

PSNR (dB) 38

37

36 Opt. VVRD 40 35 VVRD 100 Equal 34 11 12 13 14 15 16 17 18 19 20 Segment Index (b) 3 Mbps

46

45

44

43

42

41

PSNR (dB) 40

39

38 Opt. VVRD 40 37 VVRD 100 Equal 36 11 12 13 14 15 16 17 18 19 20 Segment Index (c) 4 Mbps

Figure 4.13: Average quality for the Balloons video sequence with VBR encoding and fixed network bandwidth.

101 × 6 3 10

2.5

2

1.5

Throughput (bps) 1

0.5 Est. Throughput VVRD 40 VVRD 100 Equal 0 12 14 16 18 20 22 24 26 28 30 Segment Index (a) Total segment bitrate

44

43

42

41

40

39

PSNR (dB) 38

37

36 Opt. VVRD 40 35 VVRD 100 Equal 34 12 14 16 18 20 22 24 26 28 30 Segment Index (b) PSNR

1

0.99

0.98

0.97

0.96

0.95 SSIM 0.94

0.93

0.92 Opt. VVRD 40 0.91 VVRD 100 Equal 0.9 12 14 16 18 20 22 24 26 28 30 Segment Index (c) SSIM

Figure 4.14: Results for the Kendo video sequence with variable network bandwidth.

102 × 6 3 10

2.5

2

1.5

Throughput (bps) 1

0.5 Est. Throughput VVRD 40 VVRD 100 Equal 0 11 12 13 14 15 16 17 18 19 20 Segment Index (a) Total segment bitrate

44

43

42

41

40

39

PSNR (dB) 38

37

36 Opt. VVRD 40 35 VVRD 100 Equal 34 11 12 13 14 15 16 17 18 19 20 Segment Index (b) PSNR

1

0.99

0.98

0.97

0.96

0.95 SSIM 0.94

0.93

0.92 Opt. VVRD 40 0.91 VVRD 100 Equal 0.9 11 12 13 14 15 16 17 18 19 20 Segment Index (c) SSIM

Figure 4.15: Results for the Café video sequence with variable network bandwidth.

103 35 VBR 30 CBR

25

20

15

DMOS 10

5

0

-5

-10

Cafe (2Mbps)Cafe (4Mbps) Kendo (2Mbps)Kendo (3Mbps) Balloons (2Mbps)Balloons (3Mbps)

Figure 4.16: Difference mean opinion score (DMOS) between proposed virtual view quality- aware rate allocation and [113] for VBR and CBR encoded MVD videos at different available network bandwidth values (given between parentheses). Positive DMOS indicates that our approach is preferred over [113].

104 Chapter 5

QoE-fair HTTP Adaptive Streaming of Free-viewpoint Videos in LTE Networks

5.1 Introduction

Video streaming over cellular networks has become one of the most prevalent mobile services. According to a recent study by Cisco, video traffic will account for nearly 75 % of total mobile data traffic by the year 2020 [25]. Recent technology advances have paved the way for new video applications such as interactive 360-degree videos and free-viewpoint videos (FVV). FVV provides users with an interactive and immersive experience by enabling them to view a scene from an arbitrary position. With the increasing popularity of mobile video and the introduction of more powerful mobile devices, FVV streaming will be the natural extension to current mobile video streaming systems. Future mobile networks based on next-generation long-term evolution (LTE) infrastruc- tures such as LTE-Advanced and 5G networks will provide enough bandwidth to support the high bit rate demands of FVV systems. For video delivery over cellular networks, HTTP adaptive streaming (HAS) [109] has recently emerged as a simple and effective method. In HAS, videos are encoded in several bit rate versions and each version is split into small chunks called segments. A streaming client adaptively requests video segments based on the network conditions at the time of request. During each segment download, the client estimates the available bandwidth based on the download throughput and decides on the best version to request for the next segment. This adaptive mechanism enables the video streaming client to provide the user with the highest quality-of-experience (QoE) given the network conditions. Integrating HAS with free-viewpoint systems is therefore a promising solution for deploying interactive multi-view video streaming services.

105 In cellular networks, each cell is managed by a base station, known as eNodeB. Users within a cell share the radio channel and its radio resources. The eNodeB scheduler is re- sponsible for managing and distributing the radio resources between active users and quickly adapting to fading channel conditions using channel state feedbacks. In a multi-user video streaming environment where users compete for the available resources, achieving efficiency and QoE-fairness across the competing video players becomes critical. Most commercial eNodeB schedulers allocate resources by employing variations of the proportional fair [66] scheduling policy. While this type of fairness seems to perform well when all users follow the same utility, it tends to be inefficient in scenarios where users follow different utilities [120], which is typically the case with video streaming. Moreover, proportionally fair schedulers mainly focus on the network (bitrate) utility and are oblivious to application layer QoE, which may not be ideal for video streaming applications. In video streaming applications, the relationship between rate and QoE varies within a single video stream as well as across multiple streams [19] [111]. A rate allocation policy that provides the same data rate to all users may result in a significantly unfair QoE. This is because users are likely to watch different types of videos, and these videos may require different bitrates to achieve the same level of QoE. For instance, it is known that low-motion videos, e.g., interviews and talk shows, require less data rate to achieve good QoE compared to high motion videos, e.g., sports events and action movies. This problem is amplified in the case of FVV streaming. When viewers navigate across viewpoints in an FVV streaming system, some of the views may not have been captured by cameras. In such cases, the non-captured view, re- ferred to as virtual view, needs to be synthesized using some of the captured views, referred to as reference views, by applying techniques like depth-image-based rendering (DIBR) [37]. DIBR generates a virtual view using two reference views, where each reference view consists of image and depth streams. Since the quality of synthesized virtual views depends on the encoding configuration of the components of the reference views, the rate-quality relation- ship becomes more complex. Moreover, the relationship between rate and perceived quality in FVV content will vary from one viewpoint position to another due to differences in the complexities of the frames in different reference views depending on the angle from which the scene was captured. Therefore, changing the viewpoint would result in a different rate- quality relationship which translates to variations in the perceived quality. Recent studies have shown that users are sensitive to such variations and this tends to affect their overall QoE [85]. In this chapter, we propose a QoE-fair radio resource allocation algorithm for HTTP- based FVV adaptive streaming to heterogeneous clients in mobile networks. The proposed algorithm runs on a media-aware network element (MANE) connected to the eNodeB sched- uler. The MANE finds the optimized guaranteed bitrate values and corresponding radio resource shares for the streaming clients in order to maximize the perceived quality while achieving quality-fairness and reducing quality variations. To find these values, the algo-

106 rithm relies on derived rate-utility models for the virtual views of the videos being streamed. The results are then passed to the eNodeB scheduler to set the resources allocated for each user. Unlike previous works for achieving QoE-fairness [132] [36] [23], our algorithm takes into consideration quality variation. Furthermore, these works are not directly applicable to free-viewpoint videos since they only optimize the resource allocation based on the quality of a single view. To the best of our knowledge this is the first work that addresses the problem of QoE-fair radio resource allocation for FVV streaming systems in mobile networks. The rest of the chapter is organized as follows. Section 5.2 summarizes the related works in the literature. Section 5.3 describes the architecture of FVV streaming systems. In Section 5.4, we define and formulate the QoE-fair resource allocation problem. Section 5.5 presents the proposed resource allocation algorithm for FVV streaming systems. Section 5.6 presents the evaluation of the algorithm, and Section 5.7 concludes the chapter.

5.2 Related Work

We divide works that address the problem of achieving fairness between multiple competing clients based on the delivery network: wired and wireless.

5.2.1 Fairness in Wired Networks

Jian et al. [60] proposed an algorithm to achieve HAS fairness using a combination of har- monic bandwidth estimation, stateful and delayed bitrate update, and randomized request scheduling. The randomized scheduler ensures that a player’s request time is independent of its start time and the stateful bitrate selection guarantees that the bitrates will eventually converge to a fair allocation. In [72], Li et al. presented an approach that probes network subscription, i.e., the relationship between requested segment sizes and fair-share portion of the bandwidth, by small increments to determine when the TCP download throughput can be taken as an accurate indicator of the fair-share bandwidth and backs off when congestion is encountered. Both [60] and [72] attempt to achieve fairness in terms of bandwidth share and do not take the difference between the rate-quality relationships of different videos into consideration. And although they may work for stable wired networks, they are not suitable for cellular networks with dynamic links where the underlying bandwidth share is not fair. For example, FESTIVE [60] utilizes a large window for throughput estimation in order to improve the stability. Such a window will not be able to handle large uneven drops in throughput which characterize wireless channels with mobile clients. Moreover, when multiple HAS clients compete for the available bandwidth, user-driven approaches such as [60] and [72] will generally yield suboptimal results. Therefore, network-assisted streaming approaches which rely on active cooperation between video streaming applications and the network are more efficient. Cofano et al. [26] evaluated several network-assisted strategies of HTTP adaptive streaming in software-defined networks (SDNs) in terms of fairness, aver-

107 age video quality and quality variations. The authors presented a video control plane which enforces video quality fairness among concurrent video flows generated by heterogeneous client devices by solving a max-min fairness optimization problem at runtime. Mansy and Ammar [77] utilized the concept of maximal fairness, because QoE max-min fair allocations might not exist due to discrete QoE values. They proposed a QoE-based progressive algo- rithm, referred to as QPA in this chapter, to achieve maximal fairness. They also proposed a device-dependent metric for QoE. These network-assisted schemes demonstrate the ad- vantage of joint client-network adaptation. However, they are not designed for dynamic cellular networks, nor can they handle complex FVV content.

5.2.2 Fairness in Wireless Networks

In [127], De Vleeschauwer et al. proposed a method to adaptively set up the guaranteed bitrate of each video flow in an LTE network with heterogeneous traffic. Although the algorithm attempts to achieve proportional fairness in terms of utility, the utility function used by the algorithm is not content-aware since it is not based on a video quality metric. In [132], the authors modeled a QoE continuum by considering both cumulative playback quality and playback smoothness using an exponential weighted moving average. Based on this model, they also propose a quality adaptation algorithm that can guarantee both QoE and fairness between multiple clients in a cellular network by exploiting the nature of human perception and video source. This algorithm tends to adjust the instantaneous quality proportional to the channel quality. However, it results in undesired quality variations in case of temporary fluctuations in channel quality. In [36], El Essaili et al. presented a QoE-based resource allocation method for HAS that optimizes a utility function that combines the perceived quality and a penalty for quality switches. Two rate adaptation approaches (reactive and proactive) are discussed to adapt the user’s application rates to the data rates chosen by the scheduler. Although results indicate that the proposed resource allocation method improves fairness in comparison with traditional non QoE-aware methods, fairness is not considered as an objective in the optimization. Cicalò et al. [23] studied the problem of QoE-fair resource allocation for HAS over cellular networks and formulated the problem as a multi-objective optimization problem in terms of maximizing the average quality and minimizing QoE differences between users. However, the given formulation fails to consider quality variation. They proposed an iterative quality-fair adaptive streaming algorithm (QFAS) to solve that formulated problem. In this chapter, we compare the performance of our algorithm against QFAS and QPA since they perform well in achieving fairness and high video quality in the case of 2D videos. We modify QPA to take into consideration the user’s channel conditions in the case of cellular networks. We also modify both algorithms to utilize rate-utility models for FVV content, described in Section 5.5.1.

108 Rep. 1 Rep. 2 Video 1 Viewpoint 1.5 Texture Rep. M

Rep. 1 Heterogeneous View-1 Rep. 2 MVD Bitrate-Utility HAS UEs Models Rep. M Segments Strong Signal

...... Video 2 Viewpoint 1.5

Rep. 1 Rep. 2

Rep. M Video 2 Viewpoint 2.25 Weak Signal Rep. 1 Content MPD MANE View-N Server (Multiview + Depth) Rep. 2 eNode-B Rep. M

Figure 5.1: System model for a HAS-based FVV streaming system. Videos are stored in multi-view-plus-depth format and encoded at multiple bitrates. Clients issue segment requests based on current viewpoint position. Resource allocation algorithm runs on MANE to ensure QoE fairness and achieve high and stable quality for each FVV session.

5.3 System Model and Operation

We consider a streaming system that supports FVV content, as shown in Figure 5.1. In the following, we describe the various models assumed in this chapter and the operation of the system.

5.3.1 Wireless Network Model

An LTE wireless access network with an eNodeB serves free-viewpoint videos to a set K = {1,...,K} of user equipments (UEs). The LTE downlink channel is divided into 10 ms frames, each further divided into 1 ms sub-frames [41]. The sub-frames are transmitted using orthogonal frequency-division multiplexing (OFDM) which divides available radio resources into a grid in both time and frequency domains. A resource block is the smallest unit that can be allocated by the eNodeB in LTE. Each resource block spans 0.5ms (i.e., half a sub-frame) in the time domain and 12 OFDM sub-carriers (180 kHz) in the frequency domain. UEs are dynamically allocated non-overlapping sets of resource blocks depending on their channel conditions. The quality of a channel in the LTE downlink is measured in the UE and sent to the eNodeB in the form of so-called channel quality indicators (CQIs). To accommodate the time-varying radio channel conditions of the UEs, LTE uses adaptive modulation and coding. The modulation and coding scheme (MCS) used for each UE is based on the reported CQI value by the device.

5.3.2 FVV Content Model

Each UE is receiving an FVV in the multi-view-plus-depth (MVD) representation format, where a video is composed of a set W = {w1,...,wW } of equidistant captured (reference)

109 Virtual View Virtual View Range 1 Range 2

View-1

View-1.5 View-2.5 View-2 (virtual) (virtual)

Texture + Depth Texture + Depth

Texture + Depth

Figure 5.2: Free-viewpoint video using multi-view-plus-depth content representation. views and their associated depth maps, Figure 5.2. Depth maps can either be captured using depth cameras or estimated using a depth estimation technique [98]. In the following we refer to the texture and depth video streams corresponding to each captured view as the component streams of the view. Neighboring reference views bound a virtual view range (a set of virtual view positions). The number of virtual view positions in each virtual view range is equal to E. Therefore, the total number of possible viewpoint positions that a UE can request is (W − 1)E + W . The component streams of each captured view are encoded at L bitrates (representa- tions) and divided into a number of segments of duration τ seconds each. A manifest file, also known as a media presentation descriptor (MPD), is generated for the FVV with infor- mation about the captured views and depth maps as well as the parameters of the cameras used to capture the views. For a given virtual view range, we refer to a combination of representations for the reference views of the range as an operating point. To support the rate adaptation process and to enable the client to choose the best operating point for a virtual view range, the MPD file also includes virtual view quality models for each virtual view range and each segment index, similar to the model in [45]. These models provide an estimate for the quality of a virtual view given the qualities of the components of the reference views. The client searches for the operating point that gives the best estimated quality among the set of operating points with a total bitrate that does not exceed the available bandwidth. In addition, the MPD file includes a link to a file which contains for each segment index and each virtual view range the parameters of a rate-utility model.

110 This model represents the relation between the total bitrate of the four components of the reference views corresponding to the virtual view range and the average quality of the vir- tual views within the range. These models are needed by the resource allocation algorithm to achieve QoE-fairness and are stored in a separate file since they are not utilized by the streaming client and would increase the download time of the MPD file. We describe the details of the rate-utility models and how they are generated in Section 5.5.1.

5.3.3 System Operation

FVV content, including segments of component streams and a manifest file, are hosted on servers within a content distribution network (CDN). These servers are accessible by the core network of the mobile network operator (MNO) via the packet data gateway (P-GW) which connects the core network to the Internet. Alternatively, the MNO may host the FVV content on their own mobile content distribution network (mCDN) to provide fast and efficient delivery and establish implicit trust between the components of the mobile network infrastructure and the CDN [134]. The manifest (MPD) file for the free-viewpoint videos contains an XML element that provides a URL from which a rate-utility models file for the virtual view ranges and segment indices can be retrieved. An FVV streaming client runs on each UE and keeps track of the user’s viewpoint position and the available network bandwidth. The client issues segment requests to the content server for the texture and depth component streams of the two reference views bounding the virtual view range to which the target view position belongs. Decisions for the operating points to be requested from the server are based on the virtual view quality- based rate adaptation method similar to the one presented in Chapter 4. Communication between the streaming client and the content server can either be over non-secure HTTP or secure HTTP (HTTPS). A media-aware network element (MANE) is connected to the eNodeB. The MANE is able to intercept HAS requests from the UEs. However, if the streaming client uses HTTPS (or HTTP/2 [87]) to communicate with the content server, the MANE may not be able to inspect the payload of requests and responses. We therefore distinguish between two scenarios. In the first scenario, we assume that either the client requests are sent using HTTP or they are sent using HTTPS but with a common certificate and encryption keys used by both the CDN and the MANE, enabling the MANE to inspect intercepted requests. Having the CDN certificate available at the MANE can be achieved through an agreement between the CDN provider and the MNO. This agreement is mutually beneficial since it enables the MNO to optimize the streaming flows within the mobile network and therefore improves the user experience which in turn results in more users subscribing to the content providers’ services. Alternatively, the MNO may establish its own CDN and therefore have full control over the content and the MANE. The communication between the different network components is shown in Figure 5.3. A client initially sends a request for the MPD

111 Content MANE eNodeB UE Server MPD Request

MPD

Get Rate-Utility Models

Rate-Utility Models

Reference Views Segment Requests

CQI

Determine GBR and operating point

Rewritten Requests

Set GBR

Reference Views Segments

Figure 5.3: Sequence diagram using HTTP or HTTPS with CDN and mobile network collaboration.

file. The MANE generates a copy of the retrieved MPD and processes it to identify the component streams and extract the URL for the rate-utility file. The MANE then uses the URL to fetch the file from the server. Requests for reference view segments are collected by the MANE within the scheduling period. Our proposed QoE-fair resource allocation algorithm then determines the GBR assigned to each client using the rate-utility models for the requested virtual view ranges. The assigned GBRs are communicated to the eNodeB and the MANE determines the best operating points corresponding to the GBR values. The MANE then overrides the clients’ requests with new requests for the chosen operating points. In the second scenario, the streaming client communicates with the content server over a secured protocol, either HTTPS or HTTP/2, and the CDN and MANE do not share encryption keys. Since the exchange between the client and the server is encrypted, the MANE is unable to distinguish HAS traffic and interpret their semantics. Therefore, this scenario requires collaboration between the streaming clients and the MANE through a separate control plane. This is demonstrated in Figure 5.4. A similar approach has been

112 Content MANE eNodeB UE Server MPD Request

MPD Message

MPD Request

MPD

MPD

Get Rate-Utility Models

Rate-Utility Models

Segments Message

CQI

Determine GBR Set GBR

GBR

Choose Operating Reference Views Point Segment Requests

Reference Views Segments

Figure 5.4: Sequence diagram using HTTPS with no CDN and mobile network collabora- tion. proposed in [20] for managing streaming sessions over HTTPS in SDN networks. Here, the client needs to explicitly inform the MANE about MPD and segment requests. The clients locate and identify the MANE using a discovery protocol. At the beginning of the video streaming session, the client sends an MPD message to the MANE signalling the initiation of the session and specifying the URL of the MPD file. This enables the MANE to separately download and process the MPD and retrieve the rate-utility models file. Segment download requests are preceded by sending segment messages containing the virtual view range index and the segment index to the MANE. This enables the MANE to lookup the parameters for the corresponding rate-utility model to be used by the resource allocation algorithm. After determining the assigned GBRs, the MANE announces these values to the clients using GBR messages. The GBR message informs the client of the bandwidth available to it and enables it to select the best operating point.

113 5.4 Problem Statement

Problem 3. Given the channel conditions of multiple free-viewpoint video streaming ses- sions, determine the optimal number of resource blocks to be assigned to each streaming session and the corresponding GBR such that the: (i) average QoE for each session is maximized, (ii) QoE difference across all streaming sessions is minimized, and (iii) quality fluctuation within each session is minimized.

We formulate the QoE-fair radio resource allocation problem for FVV streaming as follows. Symbols used in this chapter are listed in Table 5.1. Let Ts be the scheduling time interval. At each time instant nTs, n ∈ N, the MANE needs to determine the users GBR values Rk[n] which achieve QoE-fairness between UEs in the following scheduling window, where k = 1,...,K. We denote by V = {v1,...,vK } the set of videos being streamed by

UEs. Let mk be the MCS value chosen by the eNodeB for user k based on the reported

CQI by the UE of that user, where m ∈ [1, M]. The per-resource block capacity cm is a non-decreasing quantity of the MCS mode m such that c1 ≤ c2 ≤···≤ cM . Let Π be the total number of resource blocks available for UEs within a scheduling window. It should be noted that the number of available resource blocks may vary from one scheduling window to another and can be dynamically computed as in [127] based on the number of UEs.

We denote by rk the total bitrate for the representations in the chosen operating point for video vk. For two reference views where the component streams are encoded into L representations, the number of possible operating points equals the number of all possible representation combinations for the four components (i.e., L4). Let the set of operating points be Ok = {ok,1,...,ok,L4 } and the corresponding set of bitrates Rk = {rk,1,...,rk,L4 }. We denote by rk,min and rk,max the minimum operating point bitrate (corresponding to ok,1) and maximum operating point bitrate (corresponding to ok,L4 ), respectively. Let sk[n] and λk[n] be the segment index and virtual view range index requested by user k at time n, respectively. In the following, we drop the scheduling window index n for brevity, unless explicitly specified. Given a rate-utility model Uk,sk,λk , the problem can be

114 Table 5.1: List of symbols used in this chapter.

Symbol Description K Set of UEs (video streaming sessions). W Set of captured views. V Set of videos streamed by UEs. Ok Set of operating points for video k. Rk Set of operating point bitrates for video k. K Number of UEs. W Number of captured views for each FVV. E Number virtual view positions in each virtual view range. L Number of representations for each captured view. Ts Scheduling time. τ Duration of video segment. sk[n] Segment index requested by user k in scheduling window n. λk[n] Index of virtual view range of user k in scheduling window n. mk MCS chosen for user k based on channel conditions. cm Resource block capacity corresponding to MCS m. Π Number of available resource blocks in scheduling window. β1, β2 Quality gain thresholds. γ Channel stability threshold. rk,min Minimum operating point bitrate. rk,max Maximum operating point bitrate. ϕm(·) Function for mapping number of resource blocks to data rate for MCS m.

Uk,sk,λk Utility function for video k, segment index sk and virtual view range λk. formulated as follows

K

max Uk,sk,λk (rk[n]) (5.1a) kX=1 K

min ∆(Uk,sk,λk (rk[n]),Uj,sj ,λj (rj[n])) (5.1b) kX=1 j>kX K

min Γ(Uk,sk[n−1],λk[n−1](rk[n − 1]),Uk,sk,λk (rk[n])) (5.1c) kX=1 K R [n]τ s.t. ⌈ k ⌉≤ Π (5.1d) cmk kX=1

Uk,sk,λk (rk,min) ≤ Uk,sk,λk (rk[n]) ≤ Uk,sk,λk (rk,max),k ∈K, (5.1e)

where cmk is the resource block capacity for UE k given MCS mk.

115 Similar to [23], the ∆ function in (5.1b) is a utility-fairness metric defined as:

0 if Ui = fU (ri,min) ∧ Uj Ui (5.2)   |Ui − Uj| otherwise,    where ∧ denotes the AND operator. This metric takes into consideration that Ui, Uj are constrained to their minimum and maximum values. If one of the videos achieves its maxi- mum utility, the available resources should be used to increase the utilities of other videos. When resources are scarce, if the i-th video is at its minimum utility value, decreasing its rate is not possible. It is therefore necessary to decrease the rate of the other videos, at the price of decreasing the related utility. We note that unlike other formulations in previous works (e.g., [23]), our formulation has an explicit objective for minimizing quality changes for each user, Eq (5.1c). The Γ function in (5.1c) is a measure for the change in quality. This enables us to evaluate quality changes between the current allocation window and the previous window for each user. It has been shown that QoE degradations caused by a change in quality where the quality is increased is much smaller than a change causing a decrease in quality of the same scale [75]. Therefore, we define Γ as follows

|Ui − Uj| if Ui

5.5 Proposed Solution

5.5.1 Rate-Utility Models for FVV

A QoE-aware resource allocation algorithm requires access to the relation between the video bitrate and the perceived quality. Unlike 2D videos, where there is only one video stream, each free-viewpoint video has multiple video streams corresponding to the components of the different views. The relation between bitrate and quality for virtual views is therefore complicated by the fact that changes in the bitrates of the component streams do not equally contribute to the quality of the synthesized virtual view. We consider the following

116 46

45

44

43

42

PSNR (dB) 41

40

Operating Points 39 Pareto Optimal Points Rate-Quality Model 38 1 2 3 4 5 6 7 Total Bitrate (bps) ×106

Figure 5.5: Operating points for a virtual view where two reference views and their as- sociated depth maps are used for view synthesis and each component has 6 CBR-coded representations. parametric rate-utility model:

Uk,i,j(rk)= f(rk; αk,i,j), (5.4) where k is the UE index, i is the segment index, j is the virtual view range index, rk is the total bitrate for the requested components for segment i of the video streamed by user k,

Nα and αk,i,j ∈A⊂ R is a time varying and content-dependent vector of Nα parameters. To determine the model that best represents the relationship between the operating point bitrate and the quality of the virtual views for a free-viewpoint video, we generate HAS content based on the system model described in Section 5.3.2 for three MVD videos. For each segment index and each virtual view range, we generate a scatter plot for all operating points. We show an example plot for the Kendo MVD video sequence in Figure 5.5, where the components of views 1 and 3 are used to synthesize views 1.5, 2, and 2.5 and each component is encoded into 6 constant bit rate representations (250, 500, 750, 1000, 1250, 1500 kbps). We note that similar plots were obtained for the other two videos. Each point in the figure designates an operating point with a total bitrate equal to the sum of the components bitrates and the corresponding average quality for the synthesized virtual views. The quality of the virtual views is measured in peak signal-to-noise ratio (PSNR) using views synthesized from the original uncompressed components as references. As can be seen in the figure, the black points represent the Pareto-optimal points which provide the maximal quality for a given bitrate. The rate-utility relationship for each virtual view range can therefore be obtained offline by applying a curve fitting method on this set of

117 points. In our evaluation, we chose the following logarithmic model

Uk,i,j(rk)= f(rk; αk,i,j)= α1 log(α2rk + α3), (5.5) where parameters α1, α2, α3 are the elements of αk,i,j.

5.5.2 Quality-fair FVV Rate Allocation

To achieve QoE-fairness and minimize fluctuations in perceived quality, we propose the Quality-fair Free-viewpoint Video Resource Allocation (QFVRA) algorithm presented in

Algorithm 5. In Algorithm 5, ϕm(·) is a function that maps a number of radio resources to the corresponding data rate given the modulation and coding scheme m. Assuming that a feasible solution is achievable, i.e., the sum of minimum bitrate resource blocks for all users is less than or equal to Π, the algorithm proceeds as follows. For the first scheduling window, each user is initially assigned a number of resource blocks that corresponds to the bitrate of the minimum operating point of the video being streamed and the UE’s reported channel condition (lines 2 to 4). In line 5, the algorithm then uses the function PRA (defined in Algorithm 6) to iteratively add resource blocks to users based on their estimated qualities. Users are sorted based on their estimated perceived quality and the user receiving minimal quality is allocated an additional resource block. The estimated perceived quality that each user gets is calculated based on the rate-utility model given in Eq. (5.4) for the virtual view range that the user’s viewpoint is currently within. This process is repeated until either all users reach data rates equivalent to the bitrates of the maximum operating points of their respective videos or all resource blocks within the scheduling window are allocated (Algorithm 6, lines 3 to 9). It should be noted that if adding a resource block to a user causes their data rate to exceed the maximum operating point bitrate, this user is skipped and is no longer considered by the algorithm within this scheduling window. The progressive resource allocation algorithm described above guarantees fairness among users within the scheduling window in terms of perceived quality. However, due to the time- varying channel conditions of users, running this algorithm for each scheduling window independently without considering the qualities achieved from previous allocations may not result in a stable and smooth perceived quality for each user. To minimize quality variation across scheduling windows, QFVRA first attempts to maintain the same quality values that the users obtained in the previous scheduling window. This is done by finding the −1 α corresponding data rates for these quality values using fU (U, k,sk,λk ), which is the inverse function of fU in Eq. (5.4), and calculating the number of needed resource blocks for each user to achieve these data rates in the current scheduling window given the users’ new channel conditions at the beginning of the window (line 8). The total number of needed resource blocks is then compared against the capacity of the scheduling window to determine how to proceed. If the total number of resource blocks required to maintain the qualities

118 Algorithm 5: QFVRA Input: Set K of UEs, where |K| = K Input: Vectors rmin =(r1,min,...,rK,min) and rmax =(r1,max,...,rK,max) with minimum and maximum operating point bitrates, respectively, of videos streamed by UEs Input: Vector of users’ channel conditions m =(m1, . . . , mK ) Input: Number of radio resources in a scheduling window Π Input: Set of vectors A = {α1,..., αK }, where each vector αk ∈ A represents the Nα values of Nα parameters for the utility function of UE k and A⊂ R Input: Allocation vector of previous scheduling window p =(p1,...,pK ) ′ ′ ′ Input: Vector q =(q1,...,qK ) for estimated qualities in previous scheduling window Input: Quality gain thresholds β1 and β2 Output: Allocation vector x =(x1,x2,...,xK ) Output: Estimated qualities vector q =(q1, q2,...,qK ) based on allocation 1 if First scheduling window then 2 for i ← 1 to K do ri,minτ 3 xi ← ⌈ ⌉ cmi 4 α qi ← f(ϕmi (xi); i)

5 x ← PRA(K, Π, m, rmin, rmax, A, x) 6 else 7 for i ← 1 to K do 8 −1 ′ α bi ← ⌈f (qi; i)τ/cmi ⌉ 9 xi ← bi 10 K if i=1 xi < Π then 11 x ← PRA(K, Π, m, r , r , A, x) P min max 12 for i ← 1 to K do 13 α qi ← f(ϕmi (xi); i) 14 instability ′ if (mi >γ) OR (β1 > qi − qi) then 15 xi ← bi 16 α qi ← f(ϕmi (xi); i) 17 ′ else if (β2 < (qi − qi)) then 18 −1 α xi ← ⌈f (β2; i)τ/cmi ⌉ 19 qi ← β2 20 K else if i=1 xi > Π then 21 x ← PRA(K, Π, m, r , r , A, x) P min max 22 for i ← 1 to K do 23 α qi ← f(ϕmi (xi); i) 24 else 25 for i ← 1 to K do 26 α qi ← f(ϕmi (xi); i) 27 return x, q

119 in the previous scheduling window exceeds the number of available resource blocks within the current window (line 20), a draining process is utilized. Similar to progressive filling, users are first sorted based on an estimate of their perceived qualities given the current allocation. However, instead of adding resource blocks, the algorithm iteratively removes one resource block from the user with the highest estimated quality value in each iteration until the capacity of the scheduling window is reached (Algorithm 6, lines 10 to 16). When the number of resource blocks required to achieve the qualities in the previous window is below the capacity of the current window, this means that the qualities in the previous scheduling window are achievable and there is room for increasing some users’ qualities since there are remaining resource blocks available. However, distributing these available resource blocks among users is not trivial. If a user is suffering from temporary changes in channel conditions, assigning more resource blocks to them will soon be followed by reclaiming those resource blocks in the following scheduling window causing variations in the perceived quality. To maintain long term QoE smoothness and avoid fluctuations in the perceived quality, QFVRA considers the stability of the channel condition for each user and only allocates more resource blocks to those users with stable conditions. Therefore, the scheduler needs to estimate the conditions of users channels. This is done by utilizing the CQI history in order to predict the stability of the channel conditions in the next scheduling window. Another issue which arises when increasing the resource blocks allocated to users is that when the allocation does not result in a noticeable change in quality, it is more efficient to assign these resource blocks to users who are suffering from bad channel conditions so that they can enjoy a quality comparable to other users. On the other hand, when the allocation results in huge difference in quality, it is better to limit the amount of increase in quality. For example, if channel conditions drop in the near future, the impact of quality switch would be reduced. If the channel conditions were good and allow more quality improvements, the quality would be gradually increased and the user would not suffer from abrupt changes. To overcome these issues, QFVRA maintains a set of users which are eligible for quality improvement. This set includes users which satisfy two conditions: (i) they have relatively stable channel conditions; and (ii) additional resources allocated to them through progres- sive filling will result in noticeable improvement in perceived quality. After running the PRA algorithm to calculate a QoE-fair allocation starting from a resource allocation pro- viding the qualities of the previous window, QFVRA checks both the stability of each user’s channel conditions as well as the quality difference between the two consecutive windows (line 14). We use the standard deviation of each user’s channel conditions over a window of instability H previous CQI reports, mi in Algorithm 5, to estimate the stability by comparing it against a stability threshold γ. Parameter β1 defines a quality gain threshold correspond- ing to a just-noticeable-difference. Users that do not satisfy these conditions are assigned a number of resource blocks that maintain their perceived quality in the previous window

120 (lines 15 to 16). Parameter β2 defines a second threshold that limits the amount of quality change between consecutive scheduling windows (line 17).

Algorithm 6: Progressive Resource Allocation (PRA) Input: Number of UEs K Input: Number of radio resources in scheduling window Π Input: Vector of UEs’ channel conditions m =(m1, . . . , mK ) Input: Vectors rmin =(r1,min,...,rK,min) and rmax =(r1,max,...,rK,max) with the minimum and maximum representation bitrates, respectively, of videos streamed by UEs Input: Set of vectors A = {α1,..., αK } representing the parameters for the UEs utility functions. Input: Initial radio resource allocation vector b =(b1,...,bK ) Output: Allocation vector x =(x1,x2,...,xK ) 1 for i ← 1 to K do 2 xi ← bi 3 K while i=1 xi < Π do 4 α j ← arg min {fk(ϕmk (xk + 1); k)} Pk,k=1,...,K 5 w ← ⌈ rj,maxτ ⌉ cmi 6 if xj + 1 ≤ w then 7 xj ← xj + 1 8 else 9 xj ← w

10 K while i=1 xi > Π do 11 α j ← arg max {fk(ϕmk (xk − 1); k)} Pk,k=1,...,K 12 w ← ⌈ rj,minτ ⌉ cmi 13 if xj + 1 ≥ w then 14 xj ← xj − 1 15 else 16 xj ← w

17 return x

The time complexity of the proposed algorithm depends on the difference between the number of resource blocks assigned in the current and previous scheduling window which makes it quite faster than a non-quality-variation-aware progressive filling algorithms, where the resource blocks are allocated independently in each allocation window by iteratively allocating the resource blocks starting from the an allocation that achieves the minimum operating point bitrates. In the worst case, the performance of the proposed QFVRA algorithm is O(Π), where Π is the total number of available resource blocks in the scheduling window. This happens when the number of available resource blocks oscillates between

121 its minimum and maximum possible values between scheduling windows, which is very improbable.

5.6 Evaluation

5.6.1 Setup

We simulate an LTE cellular network using OPNET Modeler and its LTE module [138]. We implement the proposed QFVRA algorithm and use the channel conditions obtained from OPNET as input to the algorithm. We also implement the QFAS [23] and QPA [77] algorithms to compare the performance of our algorithm against them. We demonstrate the efficiency of our algorithm in preserving the average video quality and fairness while mini- mizing the rate of quality switches with using significantly less resource blocks. Although the proposed algorithm is applicable to any wireless network which uses OFDM, our evalu- ation will be based on the LTE Release-12 standard [9]. Table 5.2 shows the configuration of the simulated network. Other parameters are set to the default values of the OPNET LTE module. Time is divided into scheduling windows with a duration of one second. The simulator runs the resource allocation algorithm at the beginning of each scheduling window. We configure users to move following the random way-point model in which mobility speed is randomly chosen between 0 and 5 m/s. We configure the mobile devices to send CQI reports to the associated base station every 100 ms. We choose this reporting interval to ensure that we do not miss any channel condition changes, and at the same time we do not receive unnecessary frequent reports. For QFVRA, we use a sliding window of 20 CQI reports to assess the stability of the UEs channels and we set the stability threshold γ to 1.0. Quality gain thresholds β1 and β2 are set to 0.5 dB and 1.0 dB, respectively. In practical scenarios, mobile operators usually install base stations in crowded areas to serve most users with strong signals. Accordingly, in our simulations, mobile users are randomly distributed within each cell such that the majority of the users, about 90 % of 1 them, are densely populated within 3 of cell radius and the rest are sparsely scattered around the rest of the cell area. We evaluate multiple scenarios where the number of users is 10, 20, 30, 40, and 50. Due to the random elements in our simulations such as the mobility model, we run each scenario 5 times and report the average of the results. We note that, with the chosen network configuration, increasing the number of users beyond 50 may result in some users not being admitted. This happens when a number of users experience bad channel conditions and therefore the total number of resource blocks required to admit all users exceeds the capacity of the networks. In our evaluation, we use three MVD FVV sequences from the MPEG 3DV ad-hoc group data sets [55] [5]: Kendo, Balloons, and Café, which have different characteristics. The resolution for the Kendo and Balloons sequences is 1024 × 768 and the resolution of

122 Table 5.2: Mobile Network Configuration.

Physical Profile LTE 20 MHz FDD eNodeB Antenna Gain 15 dBi UE Antenna Gain −1 dBi Max. Downlink Bitrate 6000 kbps Cell Radius 5 × 5 Km Transmission Power 0.0558 W Propagation Model Pedestrian Environment (ITU-R M.1225) Mobility Model Random Way-point (0 − 5 m/sec) Simulation Time 10 minutes the Café sequence is 1920 × 1080. The Kendo and Balloons sequences have moving cameras while the cameras in Café are fixed. We extend the length of the video sequences from 10 to 360 seconds by repeating the frame sequence. For each video, we choose three cameras from the set of captured views and we allow three virtual views within each virtual view range, for a total of 6 supported virtual view positions. The video streams for the texture and depth components of each camera are then encoded using a CBR configuration for the H.264/AVC encoder with bitrate values of 250, 500, 750, 1000, 1250, and 1500 kbps. We use the GPAC framework [2] to generate one second segments for the different representations of each component. For each segment index, we generate virtual view quality models for all supported virtual view positions, as discussed in Section 5.3.2. We also generate rate-utility models for each virtual view range.

5.6.2 Performance Metrics

Our evaluation is based on the following metrics:

• Quality of Experience. Developing a unified QoE metric for video streaming applica- tions is very challenging [16]. To the best of our knowledge, there is no comprehensive quantitative measure to evaluate QoE for adaptive video streaming. However, low video quality and frequent and/or high-amplitude quality switches are believed to result in low QoE [40] [104]. Therefore, in order to assess QoE, we use two metrics: video quality and quality switches. We measure the average video quality (AVQ) for user j as T j i=1 qi QAVQ = , (5.6) P T where where qi is the video quality in the scheduling window i, and T is the total number of scheduling windows during the user’s session. For quality switches, we measure the average amplitude of quality switches over time for each user. Since, it is observed that the QoE degradations caused by upward video quality switches are

123 much smaller than downward switches of the same scale [75], we only measure the rate of downward video quality switches (DVQS) as

T j j i=2 I(qi , qi−1) QDVQS = , (5.7) P T qj − qj if qj < qj j j i−1 i i i−1 I(qi , qi−1)=  (5.8) 0 otherwise.

 • Network Utilization. We measure the percentage of saved resource blocks (SRB) in each scheduling window as

N i=1 ai USRB = (1 − ) ∗ 100, (5.9) P Π

where ai is the number of assigned resource blocks to user i.

• Fairness. We measure QoE-fairness based on Jain’s Index [59] in scheduling window i as N j 2 i=1 qi FQoE = , (5.10)  N j 2 N P i=1 (qi ) where N is the number of admitted users inP the network.

• Running Time. The time required to compute the allocation.

5.6.3 Results

Quality of Experience

Figure 5.6 shows the average video quality over a 100-second period of simulation for the 20 users scenario. Note that in each scenario run, users randomly choose to watch one of the three FVV videos. It demonstrates how our proposed algorithm avoids potentially damaging upward quality switches to achieve a better quality of experience. The other two algorithms increase the video quality at every possibility, regardless of the stability of channel conditions. Although this approach provides higher video qualities, it suffers from quality variations and consequently results in less QoE, especially when users are experi- encing unstable channel conditions. QFVRA, however, minimizes the impact of quality switches. Figure 5.7 shows that our algorithm outperforms QFAS and QPA by achieving at least 12 % and up to around 32 % less rates of downward quality switches. As can be seen, the amount of improvement varies for different scenarios. In fact, this amount is also affected by the channel conditions and the rate-distortion complexity of video segments watched by users. Note that since quality levels for virtual views are different with those of

124 42 (dB)

40 QFVRA AVQ Q 0 10 20 30 40 50 60 70 80 90 100

42 (dB)

40 QFAS AVQ Q 0 10 20 30 40 50 60 70 80 90 100

42 (dB)

40 QPA AVQ Q 0 10 20 30 40 50 60 70 80 90 100 Simulation Time (Seconds)

Figure 5.6: Average video quality over time (20 users). reference views, the amplitude of quality switches per second has been reported instead of merely reporting their frequencies.

Network Utilization

Increasing video quality blindly in each scheduling window without considering the previous qualities that users have been experiencing as well as possible variations in channel condi- tions has another drawback, namely bandwidth underutilization. Figure 5.8 shows that our algorithm is able to save around 12 to 18 percent of the total available resource blocks while providing almost the same average video quality and significantly less quality variations. These resource blocks can be provisioned for other traffic in the same base station. In the same figure, it can be seen that QPA has saved some resource blocks in cases of 10 and 50 users although based on its design it assigns to all the available resource blocks to the users. This observation is due to the fact that these scenarios are at the extreme ends of resource block availability. As the number of users decreases, QPA is more likely to provide them with the maximum possible qualities. On the other hand, as the system gets more and more crowded, available resource blocks might not even be sufficient to serve users with minimum possible qualities. Therefore, no resource blocks would be assigned to users for some scheduling windows. Similar to QPA, QFAS also tends to exhaust all the available resource blocks. However, since it solves the problem in continuous domain and chooses the nearest rate-distortion points in discrete domain, a small portion of resource blocks, around 1.5 %, remain unassigned.

125 0.35 QFVRA QFAS 0.3 QPA

0.25

0.2 (dB/sec)

0.15 DCVQ

Q 0.1

0.05

0 10 20 30 40 50 Number of users

Figure 5.7: Average rate of downward video quality switches.

Fairness

Figure 5.9 represents the average Jain’s Index across users for different numbers of FVV clients. As can be seen, the three algorithms achieve fairness and their performance is not influenced by the number of clients in the cell.

Running Time

An algorithm, however efficient, would not be deployed in real practical settings, unless it generates the results in a timely manner. Here, we calculate the running time for each scheduling window. Figure 5.10 shows the empirical cumulative distribution function of the average running time for each scheduling window for the 40 users scenario. Our simulations have been conducted on a PC with a CPU of 2.7GHz and 16 GB of RAM. The running time for each scheduling window in QFAS and QPA is expected to be virtually the same. However, it is observed that a portion of scheduling windows in QFAS take more time and this portion’s size increases as the number of clients increase. This observation can be characterized to the running time of the Newton’s method utilized in QFAS which is not quite constant depending on the complexity of derivative of the input function. For our proposed algorithm, the running time for each scheduling window can vary depending on how far the next allowable solution stands from the solution in the previous scheduling window. In the worst case, the performance of our algorithm would reach that of QPA. This only occurs if the channel conditions for all users experience frequent abrupt changes all the time, which is an extreme case. According to Figure 5.10, our algorithm’s execution time is noticeably faster than that of QFAS and QPA. For example, QFVRA runs 65% of

126 20 QFVRA 18 QFAS QPA 16

14

12

10

8 (% of saved blocks)

6 SRB

U 4

2

0 10 20 30 40 50 Number of users

Figure 5.8: Percentage of saved resource blocks. the scheduling windows in less than 0.5 seconds while this amount is around 0 % and 1 % for QFAS and QPA respectively. In almost all scheduling windows, the running time of our algorithm was less than the scheduling interval. Therefore, it can easily run in real-time.

5.7 Summary

In this chapter, we proposed a QoE-fair resource allocation algorithm for adaptive streaming of free-viewpoint videos over cellular networks. The proposed algorithm utilizes virtual view rate-quality models to allocate the radio resources among clients such that the differences in perceived qualities between them are minimized, i.e., fairness in terms of QoE is achieved. More importantly, the proposed algorithm also minimizes the frequency and amplitude of quality switches by taking into account the qualities observed by the clients as a result of previous allocations and avoiding unnecessary quality increases when the user’s channel conditions are unstable. We simulated an LTE network and evaluated the performance of our algorithm and compared it against the closest algorithms in the literature. Results show that the proposed algorithm achieves a high level of fairness, and it reduces the rate of quality switches by up to 32 % compared to other algorithms. In addition, the proposed algorithm saves up to 18 % of the radio resource blocks of the cellular network while achieving comparable average quality to the other algorithms.

127 QFVRA QFAS QPA

1

0.95 Fairness (Jain's Index)

0.9 10 20 30 40 50 Number of users

Figure 5.9: Fairness in terms of average Jain’s Index across users.

1 QFVRA 0.9 QFAS QPA 0.8

0.7

0.6

0.5 CDF 0.4

0.3

0.2

0.1

0 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 Running Time (sec)

Figure 5.10: Cumulative distribution function of average running time per scheduling win- dow (40 users).

128 Chapter 6

Conclusions and Future Work

6.1 Conclusions

3D video delivery services face a number of challenges due to their high bandwidth re- quirements and the complexity of the view synthesis process, which affects the relationship between the transmission bitrate and the quality of generated virtual views. When deliv- ering 3D videos over wireless networks, service providers encounter additional challenges caused by the limited capacity and instability of the wireless channel as well as the limited battery life of mobile terminals. To address these challenges, we have developed a number of novel algorithms in this thesis to enable streaming clients and mobile network providers to optimize the quality delivered to the user and efficiently utilize radio resources. As 3D videos gain popularity and more devices incorporate displays capable of rendering stereoscopic and multi-view videos, mobile network operators will turn to multicast services as an efficient way to deliver traffic to multiple clients simultaneously while minimizing the wireless resource usage. With an increase in the number of multicast sessions, it becomes more important to efficiently utilize and distribute radio resources in order to provide the best possible quality-of-experience to the users. Towards this end, we proposed an algorithm for scheduling the transmission of a set of 3D videos over multicast sessions in Chapter 3. To enable flexible adaptation and reduce storage requirements, video components are encoded into a number of scalability layers. The objective of the algorithm is to reduce the energy consumed by receivers over the duration of the multicast sessions while maximizing the quality of the rendered videos and ensuring uninterrupted playback. The algorithm decides on which set of substreams to transmit for each 3D video using quality models that relate the qualities of substreams to the quality of virtual views synthesized using the substreams as references. Burst scheduling transmission using double-buffering is then used to deliver the chosen substreams to reduce energy consumption and avoid buffer underflows. We im- plemented and validated our algorithms in a simulation setup and studied the impact of a wide range of parameters. Results have shown that our algorithm was able to select sub-

129 streams that yield an average virtual view quality within 0.3 dB of the optimal. Our energy efficient radio frame allocation reduces the average power consumption of the receivers by 86 % on average. In Chapter 4, we studied the problem of delivering free-viewpoint videos using HTTP adaptive streaming methods. With free-viewpoint videos, users are able to dynamically choose a different vantage point. A user’s experience becomes more realistic as the number of available views increases. To reduce the transmission and storage overheads, free-viewpoint videos are often represented using a small set of captured views plus depth information. Additional views are synthesized on-the-fly using these information. The rate adaptation module in streaming clients should therefore handle interactive view switching as well as the complex relationship between the bitrate of the reference views and the quality of synthesized views. We presented a rate adaptation method that performs view pre-fetching in order to reduce view switching latency and selects the reference views representations that maximize the quality of synthesized virtual views. The proposed rate adaptation method uses view prediction based on historical viewpoint positions and performs quality-aware operating point selection using empirical or analytical virtual view quality models that are generated offline by the content provider. We implemented a streaming client with a rate adaptation module based on the proposed method and discussed how to signal the quality models within the manifest file describing the video. Through experimental evaluation we have shown that the proposed virtual view quality-aware rate adaptation method results in significant quality gains compared to other approaches (up to 4 dB for CBR streams and up to 2.26 dB for VBR streams). In Chapter 5, we extended the scenario described in Chapter 4 to multiple users and considered a number of HAS-based free-viewpoint video streaming sessions over broadband wireless access networks. In such networks, video sessions share the downlink capacity and the base station scheduler is responsible for dynamically allocating the downlink radio resources based on fed-back channel state information. Current resource scheduling algo- rithms are designed for traditional single-rate 2D video streaming and elastic data flows and are not optimized for adaptive video flows and free-viewpoint videos. Moreover, these algorithms rely on the concept of proportional fairness which mainly focuses on the network (bitrate) utility and is oblivious to application layer QoE, an important factor in video streaming applications. We formulated the radio resource scheduling problem as a mutli- objective optimization problem and proposed a heuristic algorithm that solves it. The proposed algorithm allocates the radio resources with the goal of achieving QoE-fairness across the streaming sessions while maximizing the perceived quality and reducing quality variations. It utilizes rate-utility models to estimate the quality of synthesized virtual views given an assigned user data rate and incrementally assigns radio resources. Results show that the proposed algorithm achieves a high level of fairness, and it reduces the rate of qual- ity switches by up to 32 % compared to other algorithms. Moreover, it saves up to 18 % of

130 the radio resource blocks of the cellular network while achieving comparable average quality to the other algorithms.

6.2 Future Work

The problems studied in this thesis can be extended in several directions. In Chapter 3, we presented an algorithm that schedules the transmission of 3D video data for a number of multicast sessions across a set of radio frames within a scheduling window. However, the proposed algorithm considers a single cell multicast scenario. As mentioned in Section 2.7, mobile broadband access networks currently support cooperative transmission between a number of neighboring cells through the use of identical and synchronized radio signals where the cooperating cells constitute what is known as a single-frequency network (SFN). This enables receivers at cell edges to get multiple copies of the same data from multiple base stations thereby enhancing the received signal. A future research direction that can be investigated is how to optimize the scheduling of multicast data across multiple cells within an SFN. This includes addressing issues such as determining the number of SFNs that should be created, determining how to allocate cells to SFNs, and dynamically adjusting cell configurations. While network bandwidth is the main constraint for optimizing the quality-of-experience of adaptive streaming clients in wired networks, mobile clients streaming the video over wireless networks are faced with an additional factor that affects their viewing experience, namely, battery lifetime. The development pace of battery technologies is failing to catch- up with the progress of mobile multimedia hardware and applications. The amount of remaining battery power determines the maximum viewing time that the viewer can enjoy. In video streaming applications, the three main sources of battery drainage are: the display, the audio/video decoding processes, and the wireless network interface. A mobile video streaming client should therefore take the power consumed by the streaming devices while using the service into consideration. An extended version of the rate adaptation method presented in Chapter 4 would jointly optimize the virtual view quality and the energy con- sumed by the client. For example, by including additional metadata concerning the power consumption in the manifest file, a client can estimate the power consumed for decoding each segment before the actual decode time and make energy-conservative decisions in the segment selection process. Based on the remaining battery life and the total duration of the video, a FVV streaming client running on a mobile device can determine the average acceptable power consumption. The total power consumed by all the segments scheduled for download should not exceed this value. The streaming client periodically updates this value as playback progresses and the remaining battery power changes. Other extensions to this work include conducting more studies on users’ view navigation patterns to derive models that accurately reflect users behavior and improve the accuracy

131 of reference view scheduling. The proposed rate adaptation method can also be improved by considering a hybrid rate adaptation approach which incorporates both the estimated throughput as well as the client’s segment buffer level. In the first two problems that we addressed in this thesis, multicasting 3D videos and HAS-based FVV adaptive streaming, the solutions resulting from the proposed algorithms are dependent on the chosen virtual view quality models. Finding a good quality model for synthesized views is an on-going research problem and several models have been recently introduced in the literature. One possible extension to the work presented in this thesis is to evaluate the accuracy of these models and integrate them with our algorithms. However, it should be noted that not all models will be directly compatible with the algorithms presented in this thesis. For example, some complex higher order models, e.g., cubic model presented in [21], would require revisiting the formulation of the substream selection problem presented in Chapter 3. The radio resource allocation algorithm presented in Chapter 5 is able to achieve fairness across users in terms of perceived quality. However, a user’s quality-of-experience is a combination of multiple factors. One factor that is not considered by the algorithm is the clients’ playout buffer levels. Re-buffering events during video playback results in stalling which can significantly degrade the users QoE. Stalling probability increases as the data traffic load within the cell increases, especially for new streaming sessions. An important extension to the proposed algorithm is to make sure that an adequate fair amount of video playback time is available in the buffers when performing the resource allocation.

132 Bibliography

[1] Apache HTTP Server Project. https://httpd.apache.org/. [Accessed: January 15, 2016]. [2] GPAC multimedia framework. http://gpac.wp.mines-telecom.fr/. [3] libdash C++ library. https://github.com/bitmovin/libdash. [Accessed: January 15, 2016]. [4] MATLAB and Curve Fitting Toolbox Release 2014a. http://www.mathworks.com/. [Accessed: January 15, 2017]. [5] Nagoya University FTV test sequences. http://www.fujii.nuee.nagoya-u.ac.jp/ multiview-data/mpeg/mpeg_ftv.html. [6] OpenGL 2D/3D graphics API. https://www.khronos.org/opengl/. [7] 3D video and free viewpoint videoâĂŤfrom capture to display. Pattern Recognition, 44(9):1958–1968, 2011. [8] 3GPP. Improved video support for packet switched streaming (PSS) and multimedia broadcast/multicast service services (v13.0.0). TS 26.903, 3rd Generation Partnership Project (3GPP), December 2015. [9] 3GPP. LTE RRC protocol specification version 12.5.0 release 12. TS 36.331, 3rd Generation Partnership Project (3GPP), April 2015. [10] Omar Abdul-Hameed, Erhan Ekmekcioglu, and Ahmet Kondoz. State of the art and challenges for 3D video delivery over mobile broadband networks. In Ce Zhu and Yuenan Li, editors, Advanced Video Communications over Wireless Networks. CRC Press, 2013. [11] Adobe HTTP Dynamic Streaming. http://www.adobe.com/products/ hds-dynamic-streaming.html. [Accessed: January 15, 2017]. [12] Ian F. Akyildiz, David M. Gutierrez-Estevez, and Elias Chavarria Reyes. The evolu- tion to 4G cellular systems: LTE-Advanced. Physical Communication, 3(4):217–244, December 2010. [13] Tara Ali-Yahiya. Understanding LTE and its performance. Springer, 1st edition, 2011. [14] Jeffrey G. Andrews, Arunabha Ghosh, and Rias Muhamed. Fundamentals of WiMAX: Understanding Broadband Wireless Networking. Prentice Hall PTR, Upper Saddle River, NJ, USA, 2007.

133 [15] Apple HTTP Live Streaming. https://developer.apple.com/streaming/. [Ac- cessed: January 15, 2017].

[16] Athula Balachandran, Vyas Sekar, Aditya Akella, Srinivasan Seshan, Ion Stoica, and Hui Zhang. A quest for an internet video quality-of-experience metric. In Proceedings of the ACM Workshop on Hot Topics in Networks (HotNets’12), pages 97–102, 2012.

[17] J. M. Boyce, Y. Ye, J. Chen, and A. K. Ramasubramonian. Overview of SHVC: Scalable extensions of the high efficiency video coding standard. IEEE Transactions on Circuits and Systems for Video Technology, 26(1):20–34, January 2016.

[18] Cristiano Ceglie, Giuseppe Piro, Domenico Striccoli, and Pietro Camarda. Perfor- mance evaluation of 3D video streaming services in LTE-Advanced networks. Wireless Networks, 20(8):2255–2273, 2014.

[19] Nesrine Changuel, Bessem Sayadi, and Michel Kieffer. Control of distributed servers for quality-fair delivery of multiple video streams. In Proceedings of the ACM Inter- national Conference on Multimedia (MM’12), pages 269–278, 2012.

[20] Junyang Chen, Mostafa Ammar, Marwan Fayed, and Rodrigo Fonseca. Client-driven network-level QoE fairness for encrypted ’DASH-S’. In Proceedings of the ACM Work- shop on QoE-based Analysis and Management of Data Communication Networks, Internet-QoE ’16, pages 55–60, 2016.

[21] G. Cheung, V. Velisavljevic, and A. Ortega. On dependent bit allocation for multi- view image coding with depth-image-based rendering. IEEE Transactions on Image Processing, 20(11):3179–3194, November 2011.

[22] Tae-Young Chung, Jae-Young Sim, and Chang-Su Kim. Bit allocation algorithm with novel view synthesis distortion model for multiview video plus depth coding. IEEE Transactions on Image Processing, 23(8):3254–3267, August 2014.

[23] S. Cicalò, N. Changuel, V. Tralli, B. Sayadi, F. Faucheux, and S. Kerboeuf. Improving QoE and fairness in HTTP adaptive streaming over LTE network. IEEE Transactions on Circuits and Systems for Video Technology, 26(12):2284–2298, December 2016.

[24] C. Cicconetti, L. Lenzini, E. Mingozzi, and C. Eklund. Quality of service support in IEEE 802.16 networks. IEEE Network, 20(2):50–55, March 2006.

[25] Cisco visual networking index: Global mobile data traffic forecast update, 2014-2019. White Paper, February 2016. http://www.cisco.com/c/en/us/ solutions/collateral/service-provider/visual-networking-index-vni/ mobile-white-paper-c11-520862.html.

[26] G. Cofano, L. De Cicco, T. Zinner, A. Nguyen-Ngoc, P. Tran-Gia, and S. Mascolo. Design and experimental evaluation of network-assisted strategies for HTTP adap- tive streaming. In Proceedings of the ACM International Conference on Multimedia Systems (MMSys’16), pages 3:1–3:12, 2016.

[27] IBM ILOG CPLEX Optimizer. http://www.ibm.com/software/integration/ optimization/cplex-optimizer/. [Accessed: January 15, 2017].

134 [28] Erik Dahlman, Stefan Parkvall, and Johan Skold. 4G: LTE/LTE-Advanced for mobile broadband. Academic Press, second edition, 2014.

[29] Dick K. G. de Boer, Martin G. H. Hiddink, Maarten Sluijter, Oscar H. Willemsen, and Siebe T. de Zwart. Switchable lenticular based 2D/3D displays. In Proceedings of SPIE, volume 6490 of Stereoscopic Displays and Virtual Reality Systems XIV, pages 64900R–64900R–8, March 2007.

[30] E. de Diego Balaguer, F.H.P. Fitzek, O. Olsen, and M. Gade. Performance evaluation of power saving strategies for DVB-H services using adaptive MPE-FEC decoding. In Proceedings of the IEEE International Symposium on Personal, Indoor and Mobile Radio Communications (PIMRC’05), volume 4, pages 2221–2226, September 2005.

[31] D.V.S.X. De Silva, E. Ekmekcioglu, O. Abdul-Hameed, W.A.C. Fernando, S.T. Wor- rall, and A.M. Kondoz. Performance evaluation of 3D-TV transmission over WiMAX broadband access networks. In Proceedings of the International Conference on In- formation and Automation for Sustainability (ICIAfS’10), pages 298–303, December 2010.

[32] N.A. Dodgson. Autostereoscopic 3D displays. IEEE Computer, 38(8):31–36, August 2005.

[33] M. Dyer. An O(n) algorithm for the multiple-choice knapsack linear program. Math- ematical Programming, 29:57–63, 1984.

[34] 4G Broadcast technology trial at Wembley 2015 FA Cup Final. http://www.bbc.co. uk/rd/blog/2015-05-4g-broadcast-trial-wembley-2015-fa-cup-final. [Ac- cessed: January 15, 2017].

[35] TV and Media 2016: The evolving role of TV and media in consumers’ everyday lives. White Paper, November 2016. https://www.ericsson.com/res/docs/2016/ consumerlab/tv-and-media-2016.pdf.

[36] A. E. Essaili, D. Schroeder, E. Steinbach, D. Staehle, and M. Shehada. QoE-based traffic and resource management for adaptive HTTP video delivery in LTE. IEEE Transactions on Circuits and Systems for Video Technology, 25(6):988–1001, June 2015.

[37] C. Fehn. Depth-image-based rendering (DIBR), compression, and transmission for a new approach on 3D-TV. In Proceedings of SPIE, volume 5291 of Stereoscopic Displays and Virtual Reality Systems XI, pages 93–104, May 2004.

[38] Zhenhuan Gao, Shannon Chen, and Klara Nahrstedt. OmniViewer: Enabling multi- modal 3D DASH. In Proceedings of the ACM International Conference on Multimedia (MM’15), pages 801–802, 2015.

[39] Johan Garcia, Emmanuel Conchon, Tanguy Pérennou, and Anna Brunstrom. KauNet: Improving reproducibility for wireless and mobile research. In Proceedings of the International Workshop on System Evaluation for Mobile Platforms (MobiEval’07), pages 21–26, 2007.

135 [40] M. N. Garcia, F. De Simone, S. Tavakoli, N. Staelens, S. Egger, K. Brunnström, and A. Raake. Quality of experience and HTTP adaptive streaming: A review of subjec- tive studies. In Proceedings of the International Workshop on Quality of Multimedia Experience (QoMEX’14), pages 141–146, September 2014.

[41] Arunabha Ghosh, Jun Zhang, Jeffrey G. Andrews, and Rias Muhamed. Fundamentals of LTE. Prentice Hall, 1st edition, 2010.

[42] M. Gotfryd, K. Wegner, and M. Domański. View synthesis software and assessment of its performance. Doc. M15672, ISO/IEC JTC1/SC29/WG11 (MPEG), Hannover, Germany, 2008.

[43] Ronen Gvili, Amir Kaplan, Eyal Ofek, and Giora Yahav. Depth keying. In Proceedings of SPIE, volume 5006 of Stereoscopic Displays and Virtual Reality Systems X, pages 564–574, May 2003.

[44] Ahmed Hamza and Mohamed Hefeeda. Energy-efficient multicasting of multiview 3D videos to mobile devices. ACM Transactions on Multimedia Computing, Communin- cations, and Applcations, 8(3s):45:1–45:25, October 2012.

[45] Ahmed Hamza and Mohamed Hefeeda. Adaptive streaming of interactive free view- point videos to heterogeneous clients. In Proceedings of the ACM International Con- ference on Multimedia Systems (MMSys’16), pages 10:1–10:12, 2016.

[46] Miles Hansard, Seungkyu Lee, Ouk Choi, and Radu Patrice Horaud. Time-of-Flight Cameras: Principles, Methods and Applications. Springer-Verlag London, 2013.

[47] Richard Hartley and Andrew Zisserman. Multiple View Geometry in Computer Vision. Cambridge University Press, second edition, 2004.

[48] M. Hefeeda and Cheng-Hsin Hsu. Energy optimization in mobile TV broadcast net- works. In Proceedings of the International Conference on Innovations in Information Technology, pages 430–434, December 2008.

[49] Jin Heo and Yo-Sung Ho. Improved context-based adaptive binary arithmetic cod- ing over H.264/AVC for lossless depth map coding. IEEE Signal Processing Letters, 17(10):835–838, October 2010.

[50] Ianir Ideses and Leonid Yaroslavsky. New methods to produce high quality color anaglyphs for 3-D visualization. In Aurélio Campilho and Mohamed Kamel, editors, Image Analysis and Recognition, volume 3212 of Lecture Notes in Computer Science, pages 273–280. Springer Berlin / Heidelberg, 2004.

[51] Ianir Ideses and Leonid Yaroslavsky. Three methods that improve the visual quality of colour anaglyphs. Journal of Optics A: Pure and Applied Optics, 7(12):755, 2005.

[52] IEEE. IEEE standard for air interface for broadband wireless access systems. IEEE Standard 802.16, Institute of Electrical and Electronics Engineers (IEEE), 2012.

[53] ISO/IEC. ISO/IEC 13818-1:2003/FDAM2 Carriage of Auxiliary Data. Doc. N8799, ISO/IEC JTC1/SC29/WG11 (MPEG), Marrakech, Morocco, January 2007.

136 [54] ISO/IEC. ISO/IEC 23002-3 Representation of Auxiliary Video and Supplemental In- formation. Technical report, International Organization for Standardization, Geneva, Switzerland, January 2007.

[55] ISO/IEC. Description of Exploration Experiments in 3D Video Coding. Doc. N11095, ISO/IEC JTC1/SC29/WG11 (MPEG), January 2010.

[56] ISO/IEC. Information technology – Dynamic adaptive streaming over HTTP (DASH) – Part 1: Media presentation description and segment formats. ISO 23009-1:2012, International Organization for Standardization, Geneva, Switzerland, 2012.

[57] ISO/IEC. ISO/IEC 23008-2 High efficiency coding and media delivery in heteroge- neous environments – Part 2: High efficiency video coding. Technical report, Interna- tional Organization for Standardization, Geneva, Switzerland, May 2015.

[58] ITU-R. Methodology for subjective assessment of the quality of television pictures. Recommendation ITU-R BT.500-13, ITU Radiocommunication Sector (ITU-R), 2012.

[59] R Jain, DM Chiu, and W Hawe. A quantitative measure of fairness and discrimination for resource allocation in shared computer systems. Technical Report TR-301, Digital Equipment Corporation, 1984.

[60] Junchen Jiang, Vyas Sekar, and Hui Zhang. Improving fairness, efficiency, and sta- bility in HTTP-based adaptive video streaming with FESTIVE. In Proceedings of the International Conference on Emerging Networking Experiments and Technologies (CoNEXT’12), pages 97–108, 2012.

[61] Joint Scalable Video Model (JSVM) - Reference Software. www.hhi.fraunhofer.de/ departments/video-coding-analytics/research-groups/image-video-coding/ research-topics/svc-extension-of-h264avc/jsvm-reference-software.html. [Accessed: January 15, 2016].

[62] Y. S. Kang, C. Lee, and Y. S. Ho. An efficient rectification algorithm for multi-view images in parallel camera array. In In Proceedings of the 3DTV Conference: The True Vision - Capture, Transmission and Display of 3D Video (3DTV-CON), pages 61–64, May 2008.

[63] H.A. Karim, S. Worrall, A. H. Sadka, and A. M. Kondoz. 3-D video compression using MPEG4-multiple auxiliary component (MPEG4-MAC). In Proceedings of the International Conference on Visual Information Engineering (VIE), April 2005.

[64] P. Kauff, N. Atzpadin, C. Fehn, M. Müller, O. Schreer, A. Smolic, and R. Tanger. Depth map creation and image-based rendering for advanced 3DTV services providing interoperability and scalability. Signal Processing: Image Communication, 22(2):217– 234, 2007.

[65] Hans Kellerer, Ulrich Pferschy, and David Pisinger. Knapsack Problems. Springer- Verlag, 2004.

[66] Frank Kelly. Charging and rate control for elastic traffic. European Transactions on Telecommunications, 8(1):33–37, 1997.

137 [67] W.-S. Kim, A. Ortega, P. Lai, D. Tian, and C. Gomila. Depth map coding with distortion estimation of rendered view. In Society of Photo-Optical Instrumentation Engineers (SPIE) Conference Series, volume 7543, January 2010.

[68] A. Kubota, A. Smolic, M. Magnor, M. Tanimoto, T. Chen, and C. Zhang. Multiview imaging and 3DTV. IEEE Signal Processing Magazine, 24(6):10–21, November 2007.

[69] Amitabh Kumar. Mobile Broadcasting with WiMAX: Principles, Technology, and Applications. Elsevier Inc., 2008.

[70] E. Kurutepe, M.R. Civanlar, and A.M. Tekalp. Client-driven selective streaming of multiview video for interactive 3DTV. IEEE Transactions on Circuits and Systems for Video Technology, 17(11):1558–1565, 2007.

[71] Jean Le Feuvre, Cyril Concolato, Jean-Claude Dufourd, Romain Bouqueau, and Jean- Claude Moissinac. Experimenting with multimedia advances using GPAC. In Proceed- ings of the ACM International Conference on Multimedia (MM’11), pages 715–718, 2011.

[72] Z. Li, X. Zhu, J. Gahm, R. Pan, H. Hu, A. C. Begen, and D. Oran. Probe and adapt: Rate adaptation for HTTP video streaming at scale. IEEE Journal on Selected Areas in Communications, 32(4):719–733, April 2014.

[73] Chi-Heng Lin, De-Nian Yang, Ji-Tang Lee, and Wanjiun Liao. Error-resilient mul- ticast for multi-view 3D videos in IEEE 802.11 networks. CoRR, abs/1503.08726, 2015.

[74] Shujie Liu, PoLin Lai, Dong Tian, Cristina Gomila, and Chang Wen Chen. Joint tri- lateral filtering for depth map compression. Proceedings of the SPIE, 7744(1):77440F, 2010.

[75] Y. Liu, S. Dey, D. Gillies, F. Ulupinar, and M. Luby. User experience modeling for DASH video. In Proceedings of the International Packet Video Workshop (PV’13), pages 1–8, December 2013.

[76] Yanwei Liu, Qingming Huang, Siwei Ma, Debin Zhao, and Wen Gao. Joint video/depth rate allocation for 3D video coding based on view synthesis distortion model. Signal Processing: Image Communication, 24(8):666–681, September 2009.

[77] A. Mansy, M. Fayed, and M. Ammar. Network-layer fairness for adaptive video streams. In IFIP Networking Conference, pages 1–9, May 2015.

[78] D. Marpe, T. Wiegand, and S. Gordon. H.264/MPEG4-AVC fidelity range extensions: tools, profiles, performance, and application areas. In IEEE International Conference on Image Processing, volume 1, pages 593–596, September 2005.

[79] MasterImage 3D Autostereoscopic 3D LCD. http://masterimage3d.com/products/3d- lcd, 2012.

[80] Takashi Matsuyama, Shohei Nobuhara, Takeshi Takai, and Tony Tung. 3D Video and Its Applications. Springer-Verlag London, 2012.

138 [81] T. Maugey and P. Frossard. Interactive multiview video system with low complexity 2D look around at decoder. IEEE Transactions on Multimedia, 15(5):1070–1082, 2013.

[82] Microsoft Smooth Streaming. http://www.iis.net/downloads/microsoft/ smooth-streaming. [Accessed: January 15, 2017].

[83] S. Milani and G. Calvagno. A depth image coder based on progressive silhouettes. IEEE Signal Processing Letters, 17(8):711–714, August 2010.

[84] Mobile 3D market estimated 547.69 million units by 2018. http://www. marketsandmarkets.com/PressReleases/3d-mobile.asp. [Accessed: January 15, 2017].

[85] Ricky K.P. Mok, Edmond W.W. Chan, Xiapu Luo, and Rocky K.C. Chang. Inferring the QoE of HTTP video streaming from user-viewing activities. In Proceedings of the ACM SIGCOMM Workshop on Measurements Up the Stack (W-MUST’11), pages 31–36, 2011.

[86] Moving Picture Experts Group (MPEG). http://mpeg.chiariglione.org/. [Ac- cessed: January 15, 2017].

[87] C. Mueller, S. Lederer, C. Timmerer, and H. Hellwagner. Dynamic adaptive streaming over HTTP/2.0. In Proceedings of the IEEE International Conference on Multimedia and Expo (ICME’13), pages 1–6, July 2013.

[88] K. Müller, P. Merkle, and T. Wiegand. 3-D video representation using depth maps. Proceedings of the IEEE, 99(4):643–656, April 2011.

[89] K. Müller, A. Smolic, K. Dix, P. Kauff, and T. Wiegand. Reliability-based generation and view synthesis in layered depth video. In Proceedings of the IEEE Workshop on Multimedia Signal Processing, pages 34–39, October 2008.

[90] Mark Nawrot. Depth from motion parallax scales with eye movement gain. Journal of Vision, 3(11):841–851, 2003.

[91] Kwan-Jung Oh, Sehoon Yea, A. Vetro, and Yo-Sung Ho. Depth reconstruction filter and down/up sampling for depth coding in 3-D video. IEEE Signal Processing Letters, 16(9):747–750, September 2009.

[92] S. Ohtsuka and S. Saida. Depth perception from motion parallax in the . In Proceedings of the IEEE International Workshop on Robot and Human Communication, pages 72–77, July 1994.

[93] Goran Petrovic, Luat Do, Sveta Zinger, and Peter H. N. de With. Virtual view adaptation for 3D multiview video streaming. Stereoscopic Displays and Applications XXI, 7524(1):752410, 2010.

[94] Prashant Ramanathan and Bernd Girod. Rate-distortion analysis for light field coding and streaming. Signal Processing: Image Communication, 21(6):462–475, 2006.

139 [95] Stephan Reichelt, Ralf Häussler, Gerald Fütterer, and Norbert Leister. Depth cues in human visual perception and their realization in 3D displays. In Proceedings of SPIE, volume 7690 of Three-Dimensional Imaging, Visualization, and Display, pages 76900B–76900B–12, May 2010.

[96] Research and Markets. Global 3d Display Market 2016-2020. http://www. researchandmarkets.com/research/gptljv/global_3d_display, 2016.

[97] Iain E. Richardson. The H.264 Advanced Video Compression Standard. John Wiley & Sons, 2nd edition, 2010.

[98] Daniel Scharstein and Richard Szeliski. A taxonomy and evaluation of dense two- frame stereo correspondence algorithms. International Journal of Computer Vision, 47:7–42, 2002.

[99] Oliver Schreer, Peter Kauff, and Thomas Sikora, editors. 3D video communication: Algorithms, concepts and real-time systems in human centred communication. Wiley, 2005.

[100] H. Schulzrinne, S. Casner, R. Frederick, and V. Jacobson. RTP: A transport protocol for real-time applications. RFC 3550, July 2003. https://tools.ietf.org/html/ rfc3550.

[101] H. Schulzrinne, A. Rao, and R. Lanphier. Real Time Streaming Protocol (RTSP). RFC 2326, April 1998. https://tools.ietf.org/html/rfc2326.

[102] H. Schwarz, D. Marpe, and T. Wiegand. Overview of the scalable video coding extension of the H.264/AVC standard. IEEE Transactions on Circuits and Systems for Video Technology, 17(9):1103–1120, September 2007.

[103] Datasheet: SQN1130 System-on-Chip for WiMAX Moblie Stations. http://www. sequans.com/products-solutions/mobile-wimax/sqn1130/, 2007.

[104] M. Seufert, S. Egger, M. Slanina, T. Zinner, T. Hoßfeld, and P. Tran-Gia. A survey on quality of experience of HTTP adaptive streaming. IEEE Communications Surveys Tutorials, 17(1):469–492, 2015.

[105] Jonathan Shade, Steven Gortler, Li-wei He, and Richard Szeliski. Layered depth im- ages. In Proceedings of the Annual Conference on Computer Graphics and Interactive Techniques (SIGGRAPH’98), pages 231–242, 1998.

[106] Feng Shao, Gang-yi Jiang, and Mei Yu. Color correction and geometric calibration for multi-view images with feature correspondence. Optoelectronics Letters, 5(3):232–235, 2009.

[107] Sandeep Singhal and Michael Zyda. Networked Virtual Environments: Design and Implementation. Addison-Wesley Professional, 1st edition, 1999.

[108] Aljoscha Smolic. An Overview of 3D Video and Free Viewpoint Video, pages 1–8. Springer Berlin Heidelberg, Berlin, Heidelberg, 2009.

[109] I. Sodagar. The MPEG-DASH standard for multimedia streaming over the Internet. IEEE MultiMedia, 18(4):62–67, April 2011.

140 [110] L. Stelmach, Wa James Tam, D. Meegan, and A. Vincent. Stereo image quality: effects of mixed spatio-temporal resolution. IEEE Transactions on Circuits and Systems for Video Technology, 10(2):188–193, March 2000.

[111] G. M. Su, Z. Han, M. Wu, and K. J. R. Liu. A scalable multiuser framework for video over OFDM networks: Fairness and efficiency. IEEE Transactions on Circuits and Systems for Video Technology, 16(10):1217–1231, October 2006.

[112] Guan-Ming Su, Yu-Chi Lai, Andres Kwasinski, and Haohong Wang. 3D video com- munications: Challenges and opportunities. International Journal of Communication Systems, 24(10):1261–1281, October 2011.

[113] Tianyu Su, Ashkan Sobhani, Abdulsalam Yassine, Shervin Shirmohammadi, and Ab- bas Javadtalab. A DASH-based HEVC multi-view video streaming system. Journal of Real-Time Image Processing, pages 1–14, 2015.

[114] G.J. Sullivan, J. Ohm, Woo-Jin Han, and T. Wiegand. Overview of the high efficiency video coding (HEVC) standard. IEEE Transactions on Circuits and Systems for Video Technology, 22(12):1649–1668, December 2012.

[115] M. Tanimoto, T. Fujii, K. Suzuki, N. Fukushima, and Y. Mori. Reference softwares for depth estimation and view synthesis. Doc. M15377, ISO/IEC JTC1/SC29/WG11 (MPEG), Archamps, France, April 2008.

[116] G. Tech, Y. Chen, K. Muller, J. Ohm, A. Vetro, and Y. Wang. Overview of the multiview and 3D extensions of high efficiency video coding. IEEE Transactions on Circuits and Systems for Video Technology, 26(1):35–49, January 2016.

[117] G. Tech, K. Wegner, Y. Chen, M. M. Hannuksela, and J. Boyce. MV-HEVC draft text 9. Doc. JCT3V-I1002, ITU-T/ISO/IEC Joint Collaborative Team on 3D Video Coding Extension Development (JCT-3V), Sapporo, Japan, July 2014.

[118] G. Tech, K. Wegner, Y. Chen, and S. Yea. 3D-HEVC draft text 7. Doc. JCT3V- K1001, ITU-T/ISO/IEC Joint Collaborative Team on 3D Video Coding Extension Development (JCT-3V), Geneva, Switzerland, February 2015.

[119] R. Y. Tsai. An efficient and accurate camera calibration technique for 3D machine vision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 364–374, 1986.

[120] G. Tychogiorgos, A. Gkelias, and K. K. Leung. Utility-proportional fairness in wireless networks. In Proceedings of the IEEE International Symposium on Personal, Indoor and Mobile Radio Communications (PIMRC’12), pages 839–844, September 2012.

[121] S.-i. Uehara, T. Hiroya, H. Kusanagi, K. Shigemura, and H. Asada. 1-inch diagonal transflective 2D and 3D LCD with HDDP arrangement. In Society of Photo-Optical Instrumentation Engineers (SPIE) Conference Series, volume 6803, page 68030O, March 2008.

[122] Hakan Urey, Kishore V. Chellappan, Erdem Erden, and Phil Surman. State of the art in stereoscopic and autostereoscopic displays. Proceedings of the IEEE, 99(4), 2011.

141 [123] V. Velisavljevic, Gene Cheung, and J. Chakareski. Bit allocation for multiview image compression using cubic synthesized view distortion model. In IEEE International Conference on Multimedia and Expo (ICME’11), pages 1–6, July 2011. [124] Verizon Wireless: Customers Use 1.9 Terabytes of Data in Stadium at Super Bowl, February 2014. http://tiny.cc/Verizon2014. [Accessed: January 15, 2016]. [125] A. Vetro, A.M. Tourapis, K. Müller, and Tao Chen. 3D-TV content storage and transmission. IEEE Transactions on Broadcasting, 57(2):384–394, June 2011. [126] A. Vetro, T. Wiegand, and G.J. Sullivan. Overview of the stereo and extensions of the H.264/MPEG-4 AVC standard. Proceedings of the IEEE, 99(4):626–642, April 2011. [127] D. De Vleeschauwer, H. Viswanathan, A. Beck, S. Benno, G. Li, and R. Miller. Opti- mization of HTTP adaptive streaming over mobile cellular networks. In Proceedings of the IEEE Conference on Computer Communications (INFOCOM’13), pages 898–997, April 2013. [128] Zhou Wang, A.C. Bovik, H.R. Sheikh, and E.P. Simoncelli. Image quality assessment: from error visibility to structural similarity. IEEE Transactions on Image Processing, 13(4):600–612, April 2004. [129] Stefan Winkler and Dongbo Min. Stereo/multiview picture quality: Overview and recent advances. Signal Processing: Image Communication, 28(10):1358–1373, 2013. [130] Graham J. Woodgate, Jonathan Harrold, Adrian M. S. Jacobs, Richard R. Mose- ley, and David Ezra. Flat-panel autostereoscopic displays: characterization and enhancement. Proceedings of SPIE - International Society for Optical Engineering, 3957(1):153–164, 2000. [131] Jimin Xiao, Miska M. Hannuksela, Tammam Tillo, and Moncef Gabbouj. A paradigm for dynamic adaptive streaming over HTTP for multi-view video. In Yo-Sung Ho, Jitao Sang, Yong Man Ro, Junmo Kim, and Fei Wu, editors, Advances in Multimedia Information Processing, volume 9315 of Lecture Notes in Computer Science, pages 410–418. Springer International Publishing, 2015. [132] Z. Yan, J. Xue, and C. W. Chen. QoE continuum driven HTTP adaptive streaming over multi-client wireless networks. In Proceedings of the IEEE International Confer- ence on Multimedia and Expo (ICME’14), pages 1–6, July 2014. [133] X.D. Yang, Y.H. Song, T.J. Owens, J. Cosmas, and T. Itagaki. Performance analysis of time slicing in DVB-H. In Joint IST Workshop on Mobile Future, and the Symposium on Trends in Communications (SympoTIC’04), pages 183–186, October 2004. [134] F. Z. Yousaf, M. Liebsch, A. Maeder, and S. Schmid. Mobile CDN enhancements for QoE-improved content delivery in mobile operator networks. IEEE Network, 27(2):14– 21, March 2013. [135] Hui Yuan, Yilin Chang, Junyan Huo, Fuzheng Yang, and Zhaoyang Lu. Model-based joint bit allocation between texture videos and depth maps for 3-D video coding. IEEE Transactions on Circuits and Systems for Video Technology, 21(4):485–497, April 2011.

142 [136] Hui Yuan, Yilin Chang, Junyan Huo, Fuzheng Yang, and Zhaoyang Lu. Model-based joint bit allocation between texture videos and depth maps for 3-D video coding. IEEE Transactions on Circuits and Systems for Video Technology, 21(4):485–497, April 2011.

[137] Hui Yuan, Ju Liu, Hongji Xu, Zhibin Li, and Wei Liu. Coding distortion elimination of virtual view synthesis for 3D video system: Theoretical analyses and implementation. IEEE Transactions on Broadcasting, 58(4):558–568, December 2012.

[138] Yasir Zaki, Thushara Weerawardane, Carmelita Görg, and Andreas Timm-Giel. Long term evolution (LTE) model development within OPNET simulation environment. In Proceedings of OPNETWORK 2011, Bethesda, MD, USA, August 2011.

[139] Eitan Zemel. An O(n) algorithm for the linear multiple choice knapsack problem and related problems. Information Processing Letters, 18:123–128, March 1984.

[140] Shaobo Zhang and Xin Wang. Metadata representation carrying quality information signalling for DASH. Doc. M32198, ISO/IEC JTC1/SC29/WG11 (MPEG), San Jose, California, USA, January 2014.

143