On Quality Aware Adaptation of Internet Video

Dimitrios Miras

A dissertation submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy of the University of London.

Department of Computer Science University College London

May 2004 ProQuest Number: U643878

All rights reserved

INFORMATION TO ALL USERS The quality of this reproduction is dependent upon the quality of the copy submitted.

In the unlikely event that the author did not send a complete manuscript and there are missing pages, these will be noted. Also, if material had to be removed, a note will indicate the deletion. uest.

ProQuest U643878

Published by ProQuest LLC(2016). Copyright of the Dissertation is held by the Author.

All rights reserved. This work is protected against unauthorized copying under Title 17, United States Code. Microform Edition © ProQuest LLC.

ProQuest LLC 789 East Eisenhower Parkway P.O. Box 1346 Ann Arbor, Ml 48106-1346 To my parents, Fotios and Ioanna

To Athena Abstract

The main issue with the transmission of streamed video over the Internet is that of adaptation: due to the best-effort service model of the Internet, abundance of bandwidth to guarantee good quality is not always feasible. Instead, congestion control of the shared resources needs to be employed, resulting in variations in bandwidth availability. These variations of the transmis­ sion rate introduce a first level of quality degradation. In addition, the time-varying complexity of the underlying visual scenes requires a widely fluctuating transmission bandwidth in order to achieve good quality, otherwise quality oscillations occur. In order to tackle these prob­ lems, a multitude of rate-adaptation techniques have been proposed. However, these propos­ als either consider adaptation solely from the network point of view (e.g., TCP-friendliness, rate-adaptation, layering) and completely disregard the effect on quality, or employ simplistic metrics of quality (e.g., peak signal-to-noise ratio) that are not necessarily true representations of the quality as experienced by the user of the video service. This thesis advocates the integration of emerging objective video quality metrics within the adaptation cycle of Internet video. A result of recent research efforts, objective quality metrics are computational models that produce quality ratings which are highly correlated with human judgements of quality. By considering the time-changing relationship between properties of the video content, the available , and the effect of both on perceived quality, this disserta­ tion studies quality-aware rate adaptation techniques that improve end quality perception in the context of two different application scenarios of video streaming. Firstly, this dissertation examines applications that involve the transmission of multiple concurrent media streams to a receiver (e.g., the transmission and display of several video streams, relevant to the application scenario). We encounter the problem of efficiently appor­ tioning the bandwidth available to a multi-stream session among its constituent media flows, by allowing participating flows to jointly adapt their transmission rates in consideration of their re­ spective time-varying quality. Suitable adaptation timescales that coincide with changes in the video content (scene cuts) and an inter-stream adaptation mechanism that considers the time- varying objective quality of the participating streams are proposed. Experimental results show Abstract 4 the benefits of the proposed method, in terms of improved session quality and utilisation of the session bandwidth, in comparison to (i) a priority-based inter-stream adaptation and (ii) the case where the session flows are transmitted over independent congestion controlled connections. Secondly, the thesis deals with the problem of providing smooth quality rate adaptation for live, unicast, real-time video streaming. Since real-time performance is necessary, an objec­ tive quality metric cannot be applied in-line, as it is computationally intensive. For this reason, artificial neural networks are utilised to predict quality ratings in real-time. Predictions are sought based on descriptive features of the video content and the bandwidth that is available to the stream. The limitations of current approaches to provide stable or smooth quality are identified. A rate-quality controller, built on the principles of fuzzy logic, is then developed to alleviate annoying short-term quality variations that appear due to mismatches between the available bandwidth and the rate required for stable quality. Based on the neural network’s quality predictions, the controller continuously monitors the recent quality values, the nominal transmission rate and the occupancy levels of the participating buffers, to calculate appropriate encoding rates that eliminate short-term quality fluctuations. Experimental results show that the proposed solution offers significant stability of short-term quality and ‘smoothes out’ an­ noying oscillations of quality to extreme low and high values, while at the same time respects transmission rate constraints and preserves buffer stability. By presenting numerous experimental results with a wide variety of video sequences, this dissertation shows that video streaming systems can utilise objective measures of perceived quality to deliver improved presentation quality, by applying quality-aware adaptation tech­ niques which are tailored to the semantics of the specific streaming application. Contents

1 Introduction 16 1.1 Video adaptation - interaction between the codec and the n etw ork ...... 17 1.2 Video adaptation - interaction between the codec, network and metrics of per­ ceived quality 20 1.3 Contributions ...... 21 1.4 Structure of thesis ...... 23

2 Internet video streaming - issues and challenges 25 2.1 Introduction ...... 25 2.2 Overview of video communications ...... 26 2.3 Video compression ...... 29 2.3.1 Current and emerging video compression solutions ...... 30 2.4 Networked delivery of compressed video ...... 32

2.4.1 Transport rate control ...... 34 2.4.2 Rate-adaptive video encoding ...... 37 2.4.3 The role of buffering ...... 40 2.5 Streaming video content: the effect on quality ...... 41

2.5.1 Encoding artifacts ...... 42 2.5.2 Transmission artifacts ...... 43

2.6 Measuring video quality ...... 43 2.6.1 Subjective video assessment ...... 46

2.6.2 Objective metrics of video quality ...... 49

2.6.3 Weaknesses of video quality assessment techniques ...... 56 2.7 A closer insight into an objective quality m etric ...... 57

2.8 IP video adaptation: from network friendly to mediafriendly ...... 61 Contents 6

3 Quality aware adaptation in multi-stream sessions 64 3.1 Motivation ...... 65 3.2 Related w o r k ...... 68 3.2.1 Joint rate co n tro l ...... 68 3.2.2 Integrated congestion control ...... 70 3.3 Timescales of inter-stream adaptation ...... 72 3.4 Content-aware quality adaptation m odel ...... 75

3.5 Experimental results ...... 78 3.5.1 Effect on quality smoothness ...... 82 3.6 Chapter summary and discussion ...... 87

4 A Neural network predictor of quality 89 4.1 Motivation and problem description ...... 90 4.1.1 Smooth quality video rate control- literature review ...... 91 4.1.2 Constraints and requirements of smooth quality video streaming .... 94 4.2 Architecture of smooth quality live video streaming ...... 96 4.3 Artificial neural networks ...... 98 4.3.1 Related w o rk ...... 103 4.3.2 Challenges ...... 105 4.4 Extraction of content features ...... 106 4.5 Neural network architecture and s e tu p ...... 109 4.6 Examination of the ANN performance ...... 112 4.6.1 Examination of additional overhead ...... 117 4.7 Chapter summary and discussion ...... 120

5 Smooth quality rate adaptation using a fuzzy controller 123 5.1 Estimation of encoding rate for smooth perceived quality...... 123 5.2 Fuzzy adaptive quality smoothing ...... 125 5.3 Performance of the fuzzy quality controller ...... 136 5.3.1 Study of the quality smoothing capability of thecontroller ...... 142 5.3.2 A closer examination of buffer stability and its im p a c t ...... 146 5.3.3 Networks with CBR-like bandwidth ...... 148

5.4 Chapter summary and discussion ...... 151 Contents 7

6 Conclusions 156 6.1 Contributions ...... 156 6.2 Critical review and areas of future research ...... 160

A Video sequences 177

B Fuzzy adaptive quality smoothing - additional results 178 List of Figures

1.1 Involving objective metrics of perceived quality in the video adaptation process. 20

2.1 Impact of common video compression artifacts on the perceived quality of images 44 2.2 Video quality assessment scales used in subjective MOS tests ...... 47 2.3 Failure of pixel error metrics to accurately reflect the perceived quality ...... 50 2.4 PSNR of sequence Foreman encoded at 300Kbps. Observe the significant drops of PSNR as frames are skipped during encoding ...... 51 2.5 This figure illustrates the definition of the spatio-temporal (S-T) region .... 57 2.6 Top: edge-enhanced versions of an input (left) and output (right) image. Bot­ tom: histograms of the polar coordinates of the filtered images ...... 58

3.1 Variation of instantaneous quality within the scene boundaries and between dif­ ferent scenes. The vertical lines indicate scene cuts in the video content ...... 74 3.2 Model schema of content-based inter-stream adaptation architecture ...... 75 3.3 The network topology used in the simulations ...... 79 3.4 Total session quality of the proposed method in comparison to the proportional allocation method for the first 200 sec of the simulation ...... 81 3.5 Average session quality percentage gain of the proposed method in comparison to a proportional allocation, at different levels of network load ...... 82 3.6 Left: Session quality of the proposed method in comparison to the aggregate quality of the four independent TCP-friendly streams. Right: Throughput of the ensemble flow in comparison to the aggregate throughput of the independent TCP-friendly streams ...... 83 3.7 Percentage of average session quality gain and throughput of the proposed method in comparison to the aggregate quality and throughput of the indepen­ dent TCP-friendly streams, at different levels of network load ...... 83 3.8 Coefficient of variation of active number of layers for each individual stream. . 85 3.9 Coefficient of variation of quality for each individual stream ...... 86 List of Figures 9

4.1 The effect of a network-friendly video rate encoding on instantaneous quality and the hypothetical shape of a desired quality with infrequent oscillations. . . 90

4.2 This illustration shows the components of a framework for smooth quality adap­ tation of live video streaming and the interactions between involved modules. . 97

4.3 The structure of the basic ANN component - a neuron or perceptron, and a feedforward neural network with n inputs, one hidden layer with m neurons, and one output layer. is the weights matrix for layer L ...... 99

4.4 Early stop training: evolution of squared error of neural network responses for the training and monitoring sets ...... 102

4.5 Monitoring RMSE, before and after the stepwise elimination on the original inputs (content features) to the model, at different numbers of hidden layer neurons ...... 114

4.6 Monitoring RMSE, before and and after the stepwise elimination on the PCA scores (principal components), at different numbers of hidden layer neurons rih. 115

4.7 Comparison of the monitoring RMSE at different numbers for the two data reduction schemes: sensitivity elimination on (i) original input, (ii) PCA scores. 116

4.8 Sensitivity of the input variables to the model ...... 117

4.9 ANN prediction performance. The graphs co-plot the actual versus the pre­ dicted quality scores at different target bit rates ...... 118

4.10 Histograms of the ANN prediction error values ...... 119

4.11 Percentage of prediction error values over A, as a function of A ...... 119

4.12 Side-by-side box-plots depicting the range of ANN prediction residual values. 120

5.1 Examples of membership functions, (a) s-function, (b) 7r-function, (c) z- function, (d-f) triangular versions, (g-i) trapezoidal versions, (j) flat 7r-function, (k) rectangle, (1) singleton ...... 127

5.2 An example of three membership functions for the linguistic variable‘age’. . . 127

5.3 Main components of a typical end-to-end live video streaming system ...... 128

5.4 Histogram of quality change (error) values over subsequent S-T periods ...... 130

5.5 The shape of the membership functions for all fuzzy sets for the controller’s inputs (error and buflev) and output (parameter a ) ...... 132

5.6 The control surface of the fuzzy inference system that determines parameter a. 136

5.7 Single bottleneck network topology used in the simulations ...... 138 List of Figures 10

5.8 Improvement of quality stability for two different video sequences. The graphs depict the on-going quality when video is encoded based on the available trans­ mission rate of the flow and that achieved by the fuzzy rate-quality controller. . 140

5.9 Time-plot of Q tcpf, Q tar get and Q ac t u a l (top), sender and receiver buffer sizes, Bs and Br, (middle). The buffers accommodate the mismatches between the encoding and transmission rates (bottom) ...... 141

5.10 Smoothness index over time between the two quality series of interest: Qtcpf

and Q t a r g e t ...... 142

5.11 Histograms of AQ values: AQtcpf, ^Qtarget, and AQactuah Notice the dif­ ferent ranges of the jc-axes on the different graphs (initial playout delay: 8 sec). 144

5.12 Autocorrelation of Q tcpf, Qtarget and Q a c t u a l for two video sequences, The Matrix and Terminator ...... 145

5.13 Box-plots depicting the range of variation of Q tcpf, Qtarget and Q a c t u a l values at larger timescales ...... 145

5.14 Impact of startup delay (6 and 10 sec) on quality smoothing and buffer occu­ pancy (sequence: The Matrix) ...... 147

5.15 Impact of the startup delay on quality smoothness and buffer stability. Top:

distribution of A Q tcpf, A Q tar get and A Q a c t u a l- Bottom: send and receive buffer sizes over the duration of the simulation (sequence: The Matrix) ...... 149

5.16 Impact of the startup delay on quality smoothness and buffer stability. Top:

distribution of AQtcpf, AQt a r ge t and A Q a c t u a l- Bottom: send and receive buffer sizes over the duration of the simulation (sequence: Football) ...... 150

5.17 CBR transmission. Top: distribution of AQtcpf, AQtarget and A Q actual- Bot­ tom graphs: send and receive buffer sizes over the duration of the simulation (sequence: The Matrix) ...... 152

5.18 CBR transmission. Top: distribution of AQtcpf, AQtarget and A Q actual- Bot­ tom graphs: send and receive buffer sizes over the duration of the simulation (sequence: Football) ...... 153

6.1 Clustering effects in the values of a video content feature ...... 162

B.l Histograms of AQ values: AQtcpf, A Q t a r g e t , and A Q a c t u a i for sequences Ter­ minator and mixed. Summary statistics of all depicted distributions are shown for comparison ...... 179 List of Figures 11

B.2 Autocorrelation of Qtcpf, Qtarget and Qactual for two video sequences, Termi­ nator and mixed ...... 180

B.3 Box-plots depicting the range of variation of Qtcpf, Qtarget and Qactual values at larger timescales ...... 180 B.4 Impact of the startup delay on quality smoothness and buffer stability. Top:

distribution of AQtcpf, AQtarget and A Q actual- Bottom: send and receive buffer sizes over the duration of the simulation (sequence: Terminator) ...... 181 List of Tables

3.1 Video sequences used in simulations ...... 79

4.1 Content features extracted from the original video frames ...... 109 4.2 Numerical properties of the ANN prediction error, at different bit rates: mean absolute error, mean error, median, 0.01, first, third and 0.99 quantile values. . . 117

5.1 Table of linguistic values of the control parameter alpha as a function of fuzzy variables error and buflev ...... 135 5.2 Various parameters of the fuzzy rate-quality controller system ...... 137

6.1 A simple classification of spatial and temporal of content activity and the re­ sulting classes in the spatio-temporal domain ...... 163

A.l Video sequences features ...... 177 Acknowledgements

Although a PhD thesis work is a lonely journey of individual endeavour, this work would not have come to fruition without the support of many people, to whom I am sincerely indebted and thankful. My deepest gratitude is dedicated to my supervisor, Graham Knight, for his continuous guidance, fruitful feedback and moral support throughout. Graham eagerly provided a surplus of advice and constructive comments as well as optimism and encouragement at times where things were not looking rosy. For all these, I thank him deeply. I am especially grateful to Prof. Jon Crowcroft for eagerly reviewing this work and pro­ viding invaluable suggestions and improvements at several stages of my research. This thesis has been greatly influenced and improved based on his discerning comments. I would like to truthfully thank Prof. Mark Handley for providing insightful, detailed and constructive com­ ments that helped me to better shape my research ideas. His guidance on the development of this work and the shaping of this thesis was paramount. I also owe a sincere debt of thanks to Prof. Angela Sasse for providing me with invaluable support, both in financial and equipment terms, at the later stages of this work, when it was mostly needed. I wish to sincerely thank Dr. Saleem Bhatti who provided a thorough review of this thesis and suggested comments of extraordinary quality and depth. Special thanks are also dedicated to Dr. David Hands at BT Labs for appreciating my work, providing feedback based on his expertise on video quality and funding parts of this work. My gratitude also goes to Steve Rudkin and Richard Jacobs at BT Labs for giving me the chance to spend seven months at the Distributed Systems Group and the opportunity to kick-start with this research. The funding support that followed was absolutely crucial as it enabled me to dedicate more time into this research, and reduce external fund-raising activities. Several other individuals have greatly contributed to this work through numerous discus­ sions, suggestions and responses to my queries. I am very thankful to Stephen Wolf at the Institute for Sciences (ITS) for providing valuable information and sugges­ tions on issues related to video quality and eagerly responding to my (sometimes unsolicited) questions. He also assisted me with the implementation of the video quality metric designed at Acknowledgements 14

ITS, which is used throughout this thesis, by providing me with output data from their model to compare against my own implementation. Prof. Mohammed Ghanbari also provided helpful suggestions at the early stages of this research. My thanks also go to Matthew Trotter and Dr. Dave Comey from the Machine Learning Group at UCL who provided initial suggestions and kindly answered several questions on neural networks. Publications

• D. Miras, R. Jacobs and V. Hardman. Utility based inter-stream adaptation of layered streams in a multiple-flow IP session, 7th Intl. Workshop on Interactive Distributed Mul­ timedia Systems and Telecommunication Services (IDMS). Enschede, The Netherlands. LNCS 1905, pp. 77-88, Oct. 2000.

• D. Miras, R. Jacobs and V. Hardman. Content-aware quality adaptation in IP sessions with multiple streams. 8th Intl. Workshop on Interactive Distributed Multimedia Systems and Telecommunication Services (IDMS2001). Lancaster, UK. LNCS 2158, pp. 168- 180, Sep. 2001.

• D. Miras. A survey on network QoS needs of advanced Internet applications. Inter- net2, QoS Working Group, Dec. 2002, h t t p : //www. in t e r n e t2 . e d u /q o s/w g / apps / fellow ship/.

• J. McCarthy, M. A. Sasse and D. Miras. Sharp or smooth? Comparing the effects of quan­ tization vs. frame rate for streamed video. Intl. Conf. on Human Factors in Computing Systems (CHI). Vienna, Austria, Apr. 2004.

• D. Miras and G. Knight. Smooth quality streaming of live Internet video. IEEE Global Conference (Globecom). Dallas, Texas, US, Nov. 2004.

• J. McCarthy, M. A. Sasse and D. Miras. Evaluating mobile video quality. 13th 1ST Mobile and Wireless Communications Summit, Lyon, France, June 2004. Chapter 1

Introduction

Video has been an important media for communication, learning and entertainment for many decades. In the early days, video was captured and transmitted in analog form and such legacy video systems still operate today. The advent of digital signal processing and advances in com­ puting technology led to the digital manipulation of video, enabling the emergence of new and far more efficient video compression, storage and transmission practices. In the early 1990s, video compression and transmission was a major area of research, revolutionising the land­ scape of video applications. A variety of exciting new applications appeared including storage of high quality movies on Digital Versatile Discs (DVD), satellite and terrestrial digital tele­ vision (DTV), broadcast of TV signals over cable, video conferencing and video telephony over circuit-switched networks. At the same time, mainly due to the emergence of the World Wide Web, the Internet started to experience an increasing growth and popularity. It soon be­ came apparent that content rich, multimedia data can be transmitted over best-effort packet networks efficiently and economically, despite the fact that at first sight, the proposition of a ‘no-guarantees’ data network was considered unsuitable for real-time media. Still today, real-time delivery of good quality video over the Internet is complicated by the unknown and time-varying bandwidth, delay and packet losses, as well as other related issues, such as, how to avoid overloading of network resources, how to fairly share the resources among competing users or flows and how to arrange efficient and scalable transmission of popular content to many recipients simultaneously. Until recently, delivery of video services over the Internet to a large audience has been limited to poor quality material. The main impeding factor was slow network access speeds, as the majority of users were confined to slow dial-up connections. Naturally, these services could not be regarded as reliable alternatives to the well-established motion pictures services over ter­ restrial, satellite or cable networks. But this situation has been changing significantly. During the past couple of years, supply and demand for access has experienced a dramatic 1.1. Video adaptation - interaction between the codec and the network 17 upsurge. According to a recent survey (May 2003), the number of high-speed Internet con­ nections in Europe, which include (DSL) and cable, grew on average, over 130% in a year, (over 235% in the UK), with more than one-third (and fast approaching the 50% mark) of Internet users in Europe and US currently connected at high speed (source: Nielsen Netratings [1]). At the end of 2003, there were more than 100 million high speed lines worldwide, making broadband one of the “fastest growing new technologies in history”, even beating the meteoric rise of mobile phones. Furthermore, the number of dial-up connections has started to decrease, attributed to falling prices of broadband access. At the same time, demand for streaming media is surging. Millions of online users listen to Internet radio stations or view streamed video content (news, sports, movie clips, advertisements). The great advantage of the Internet is that it is a universal medium; despite the fact that users might at times be unhappy with the quality and download times of media content they will continue using streaming content that they cannot get anywhere else. Furthermore, a large percentage of those users are willing to pay a fee to have private access to high-quality content [2]. De­ ployment of commercial video services for high speed Internet users is gradually gaining pace. Today, telecom operators encounter slowly declining telephony revenues due to competition from mobile operators and falling prices. The situation with data services is similar: subscrip­ tion prices are falling while the volume of data is ever increasing. There is the opportunity to revive revenue by offering bundled voice, data and video services over the same medium (tele­ phone cables). The efficiency of IP delivery allows a new breed of video services to emerge that are either not offered elsewhere or are more efficient and economical when offered over an IP network: specialised and personalised content according to the preferences of the individual user, network personal video recorder, time-shifted TV, video on demand (VoD), TV channels that are not available on cable or satellite (e.g., from other counties). However, similar oppor­ tunities also exist for the individual to create and transmit his/her own content, unfold personal talents and attract an audience that otherwise would not have been possible.

1.1 Video adaptation - interaction between the codec and the net­ work

Originally engineered for data traffic, the Internet provides an unpredictable and best-effort service with a time-varying end-to-end channel in terms of bandwidth, delay, and packet loss rate. Nevertheless, by being tolerant to large delay and delay-variation, data applications can be effectively supported using the Transmission Control Protocol (TCP), which essentially trans­ forms time-varying bandwidth and loss into extra delay and delay-jitter. This is achieved by 1.1. Video adaptation - interaction between the codec and the network 18 adapting to the time-varying channel rate and using end-to-end retransmission to recover from packet losses. In contrast, media streams have real-time requirements, therefore, the additional delay, delay-jitter and packet loss are usually unacceptable for streaming audio and video appli­ cations. Video compression techniques, on the other hand, were typically not designed with the transport channel characteristics in mind. Instead, a constant data-rate channel with low loss rates was assumed. For example, the popular MPEG-1 video coding standard was designed for the compact disk media and MPEG-2 for higher-quality media storage on DVD or transmission over suitably provisioned broadcast TV networks. As a result, such network-ignorant com­ pression techniques cannot accommodate the dynamic nature of best-effort packet networks in terms of loss, bandwidth and latency.

There are two approaches to bridge the gap between what the network offers and what compressed video streams require. The first approach is to rebuild or augment the current network to provide the Quality of Service (QoS) guarantees necessary to transport network- ignorant multimedia. This is typically achieved through resource reservation, admission control or service discrimination techniques that attempt to establish deterministic or statistical end-to- end guarantees on delay, delay-jitter, and packet loss rate. The most prominent attempts are the Integrated Services (IntServ) [3, 4] and the Differentiated Services (DiffServ) [5] models developed for the Internet. While these approaches promise to provide high quality video, their relative complexity and deployment issues have prevented their wide-spread adoption despite extensive research and development efforts.

A second approach is to develop adaptive applications that attempt to react to changing network conditions. Adaptation is a multi-faceted principle that is manifested at various com­ ponents of a video streaming system. Rate adaptation aims to match the bit rate of the stream to the changing bandwidth availability. Adaptation to delay curtails the impact of end-to-end delay and delay variation using a buffer, at the expense of delay in playout time. Adaptation to loss protects the integrity of rendered video through error resilient coding, error concealment techniques or retransmission. Adaptation of video has been made possible by recent advances in video compression techniques [6,7, 8, 9,10,11,12,13,14,15] as well as the development of transport level congestion control for real-time media [16, 17, 18, 19, 20, 21]. Indeed, adapta­ tion is feasible only through the - direct or indirect - interaction between the video compression algorithm and the network transmission policy. For example, a transport mechanism reacts to changes in network congestion by appropriately adjusting the rate that bits are injected to the network; at the same time, the video codec alters its compression parameters in order to pro­ duce a complying bitstream rate. A video codec can also react to information from the transport 1.1. Video adaptation - interaction between the codec and the network 19 mechanism about the packet loss rate by increasing the error resilience of the compressed data accordingly.

Delivering real-time video over the Internet is often a predicament with respect to the end-video quality (quality of presentation). This is attributed to the video content’s inherently varying spatio-temporal complexity: in order to achieve sustainable levels of quality the bit rate requirement of video varies widely over short and long timescales. Scenes with low spatial activity and motion can be efficiently compressed with good quality, while complex visual con­ tent and motion increase the distortion introduced by the encoder unless a high bandwidth is available. At the same time, the bandwidth available to the application varies as the transmis­ sion rate adapts to changing levels of contention for network resources. However, the available bandwidth is seldom enough to support minimal distortion to the video sequence and unfortu­ nately, its pattern of variation does not coincide with the bit rate requirement for good encoding quality.

Therefore, an important issue in video adaptation is to understand its impact on video quality and how to measure it. Measuring the quality performance of digital imaging systems with respect to the capture, compression, storage, transmission and display of visual informa­ tion is one of the most important challenges in this area. As digital video systems exhibit artifacts that are fundamentally different from analog systems, well-established measurement procedures of analog video quality are tenuously correlated with the perceived quality of digital video. To overcome these limitations, researchers and designers of video systems have had to resort to subjective viewing tests to obtain reliable video quality ratings [22]. While these tests are the closest interpretation of the ‘truth’ regarding quality, they require costly, complex and time-consuming set-ups and therefore, are often impractical and not suitable for on-line quality monitoring. Instead, pixel-error metrics, such as mean squared error (MSE) and peak signal- to-noise ratio (PSNR) are commonly used for this purpose. Such arithmetic fidelity metrics are easy to calculate. However, they suffer a major drawback: they do not always correlate well with human judgements of quality [23], especially at low-to-modest bit rates (up to a few hun­ dreds of Kbps). The main issue with MSE and PSNR is that they cannot discriminate between impairments that humans can and cannot see or impairments that are less or more annoying.

These problems have prompted an intensified study of visual objective video quality met­ rics in recent years [24]. These are computational models that measure video quality in a way that preserves high correlation to human ratings of quality, by accounting the type and mag­ nitude of perceived distortions in the video signal and their perceptual effect. Visual quality metrics have proven to exhibit significant performance in terms of quality prediction accuracy 1.2. Video adaptation - interaction between the codec, network and metrics of perceived quality20

Rate control Rate control & adaptation & adaptation

Video Video Coding Coding

Objective metrics of video quality Video Quality-aware Adaptation Video Adaptation

Figure 1.1: Involving objective metrics of perceived quality in the video adaptation process.

and excellent correlation with subjective quality ratings [25]. The main advantage of these

models is that they are content-independent and apply to a variety of video processing systems.

Commercial products are now emerging with the aim of replacing subjective tests in areas like

the quality evaluation of video sequences, the improvement of new video codecs and coding algorithms or the monitoring of video transmission systems.

1.2 Video adaptation - interaction between the codec, network and metrics of perceived quality

Thus far, objective perceptual quality metrics are being used to measure the performance of

video systems or components in a non-intrusive manner. They are used to obtain mean opinion

score (MOS) predictions, compare and optimise video processing algorithms, check the quality

of encoded video files, test and compare competing streaming equipment, assess the network

service-level agreement (SLA) of streaming services or characterise video content prior to en­

coding. The results of this process are then exploited by experts to facilitate improvements,

modifications and decision-making. Focusing on Internet video streaming, this thesis proposes

the use of perceptual video quality metrics as an integral part of the streaming system, to drive

or assist quality-aware adaptation of video. The accurate quality prediction attributes of these

models can be exploited to facilitate adaptation decisions that account for their effect on per­

ceptual QoS. This proposition is graphically pictured in Figure 1.1, which circumscribes the

conceptual context of this research.

Video quality metrics however, offer the means but not the solution to performing quality-

aware adaptation. How can these metrics be integrated is a question that is bound to the specific

characteristics of the video application and the properties of the streaming system. The precise 1.3. Contributions 21 objective of this work may be stated thus:

This thesis focuses on rate adaptation of Internet video to investigate and develop engineering enhancements in video streaming systems that allow the integration of visual, objective quality metrics in the adaptation life-cycle, and displays the benefits of quality-aware adaptation in delivering improved perceived quality of service and wiser utilisation of the transmission bandwidth.

This objective is examined in the context of two applications:

1. The bandwidth allocation and inter-stream adaptation problem in the transmission of mul­ tiple, co-located, concurrent media streams.

2. The provision of smooth quality adaptation for unicast streaming of live video content.

Work in this thesis focuses on video adaptation from the viewpoint of rate-control, i.e., how the rate of the compressed video should be manipulated so that the end-quality is improved. As a result, the issue of video protection against packet loss (error resilience) is out of its scope. However, the techniques developed herein do not disallow or impede the use of error protection techniques in tandem.

1.3 Contributions

The contributions of this thesis can be summarised as follows:

• A detailed, informative discussion on the issue of video quality. This dissertation provides an analytical discussion of what constitutes video quality and how it can be in­ terpreted and measured. It introduces a detailed discussion on video quality measurement techniques, assesses their respective merits and drawbacks and proposes their use in the process of video adaptation.

• An adaptation technique for multiple-stream multimedia sessions. In the context of a multimedia session that engages the transmission of multiple concurrent video streams to a receiver^ a method is developed for adapting the media rate of each flow so that the total quality of the session is maximised. The approach makes use of the time-varying ob­ jective quality of each constituent video flow to enable quality-efflcient utilisation of the session bandwidth. To achieve this, a quality-based scheduling approach is proposed to

*For example, the playback of a sports event, like a football game, which involves the transmission and display of several video streams coming from various cameras around the stadium (e.g., main pitch action, close-ups, player- follow camera, etc) 1.3. Contributions 22

apportion the nominal session bandwidth among the session flows, based on the instanta­ neous quality contribution of each media stream to the overall session quality. The adap­ tation algorithm operates at suitable timescales that are found to coincide with changes in the underlying visual content of each stream (video scene boundaries). We show that quality-aware inter-stream adaptation can improve the total session quality and result to better utilisation (in terms of quality) of the available bandwidth when compared against current practices in tackling the same problem,

• A method for real-time prediction of instantaneous video quality based on machine learning techniques. A method is developed for the real-time estimation of continuous objective quality ratings based on the generalisation capabilities of artificial neural net­ works, For applications like live video streaming, it is essential to obtain fast indications of the instantaneous quality as well as to acquire rate-to-quality mappings for different types of video content (or scenes) in order to perform quality-aware adaptation. In such cases, directly using a visual quality model is not feasible due to the prohibitive computa­ tion complexity of evaluating such metrics. The proposed method utilises artificial neural networks to acquire quality scores based on descriptive features of the video content over continuous, short-time intervals and the candidate encoding rate, A feature selection pro­ cess is proposed to extract meaningful descriptors of content activity that influence the encoding quality. Several issues focusing on the calibration of a suitable neural network architecture for this kind of application are also studied. We show that the use of artificial neural networks can provide accurate estimates of continuous quality scores and facilitate the - indirect - use of objective metrics in video adaptation,

• A fuzzy-logic rate-quality controller to provide smooth quality for live-video stream­ ing. Network-friendly video rate adaptation is not always media-friendly: a congestion- controlled (or, rate-controlled) video stream exhibits frequent oscillations in perceived quality of service, which are particularly disturbing for the viewer, A source-rate quality controller based on the principles of fuzzy logic is developed to provide smooth qual­ ity streaming of live video and ameliorate the side-effects of network-friendly transmis­ sion, The presented technique accounts for several properties of human perception of continuous quality to achieve a stable streaming quality experience. The rate-quality controller seeks continuous indications of the perceived quality from the neural network predictor and derives appropriate encoding rates that maintain quality stability, subject to constraints imposed by the transmission rate and the size of media buffers, Experimen- 1.4. Structure of thesis 23

tal results show that the proposed rate-quality controller can alleviate annoying short­ term oscillations of quality and that media-friendly adaptation can be achieved under a network-friendly transmission regime.

1.4 Structure of thesis

This dissertation is structured as follows. Chapter 2 briefly reviews the state of the art in Internet video transmission. It then engages in a comprehensive discussion on the issue of video quality and the recent advances in the field. Special attention is given to objective video quality metrics, as they form the foundation of this thesis. Implementation details of such an objective metric that is extensively used in this thesis are presented. Certain issues presented in Chapter 2 have also been discussed in [26], [27] and [28]. Chapter 3 encounters the problem of efficiently apportioning the bandwidth available to a multimedia session that consists of an ensemble of concurrent media streams, under an inte­ grated network congestion management regime (e.g., [29]). Using integrated congestion con­ trol, the session can benefit by sharing knowledge of the ongoing network conditions between the participating flows, and control the allocation (or, more precisely, the distribution or shar­ ing) of resources to individual flows by considering their respective contribution to the overall session quality. An appropriate inter-stream bandwidth allocation scheduler is presented. We define suitable adaptation timescales based on the content dynamics among the participating media streams. We show the benefits that arise by the joint use of integrated congestion con­ trol and content-based quality profiles, generated by means of objective measurements, both in terms of improved session quality and wiser use of the available session bandwidth, in compari­ son to a ‘priority-based’ inter-stream adaptation and when flows are transmitted as independent IP flows. Parts of this work have been presented in [30] and [31]. The second part of the thesis (Chapters 4 and 5) deals with the problem of providing smooth perceived quality adaptation for live, unicast, real-time video streaming. In this appli­ cation, stable video quality is considered very significant since frequent oscillations of quality are particularly annoying to the user. Chapter 4 at first reviews related work on quality smooth­ ing of video streaming, and identifies limitations of current approaches. It then introduces an architecture for quality-aware adaptation of live video and establishes the main requirements that arise. The main part of Chapter 4 presents an artificial neural network method that yields continuous objective quality ratings in real-time, based on video content feature descriptors ex­ tracted on-line from the source video frames and the bandwidth that is available to the stream. Details of the neural network architecture and its generalisation performance are also presented. 1.4. Structure of thesis 24

Following, Chapter 5 introduces a fuzzy rate-quality controller to alleviate fluctuations in quality that result from a network-friendly transmission and provide a ‘smoother’ quality adaptation. The method accounts for the impact that the type, magnitude and frequency of quality changes have on perceived quality and examines the constraints imposed by the available transmission bandwidth and the sender and receiver media buffers. The efficacy of the method is demonstrated through numerous experimental results. Parts of this work appear in [32]. Chapter 6 summarises the contributions of this thesis and reviews limitations of this work in its current state; it also outlines possible directions for future research. Finally, the Appendix section of this dissertation presents auxiliary information as well as additional experimental results for the various techniques developed in the thesis. Chapter 2

Internet video streaming - issues and challenges

Today, visual media are used in a multitude of application scenarios. Understanding the dif­ ferent classes of video applications is important as they pose different sets of constraints and requirements and therefore, involve different choices for system design. This chapter first pro­ vides a brief overview of the diverse range of video communication applications. It then reviews the state of the art in digital video transmission, focusing on recent advances in Internet video and the associated areas of research. New challenges in video streaming are then introduced, namely the issue of video quality and its potential role in Internet video adaptation that this the­ sis addresses. A detailed review of work on video quality measurement techniques is presented, which serves as the foundations that the work of this thesis is based upon.

2.1 Introduction

Recent advances in computing technologies, video compression and higher network access speeds, have made it feasible to provide multimedia services over the Internet. Specifically, digital video is now an important medium for communications and entertainment on the In­ ternet. While video services are still predominantly carried over specialised networks, such as terrestrial digital TV (DTV), cable, satellite, or circuit-switched networks, the growth and popularity of the Internet has motivated video communication over best-effort packet networks. At first sight, the Internet seemed unsuitable to support video services since digital video is characterised by high bandwidth demands, error intolerance and requirements for low delay. However, research and development in video compression, adaptive rate control, congestion control, error protection and other related areas, have proven that video can be efficiently and effectively transmitted over a network that does not offer any explicit quality of service (QoS) guarantees. 2.2. Overview of video communications 26

Transmission of digital video over a best-effort packet network poses several challenges at various levels in the system design: from signal processing for efficient and robust video coding to appropriate communication protocols and transmission policies. Furthermore, in­ teraction and cooperation among the various components of the video transmission system is also required to achieve the objectives of video applications. The following sections survey the most important recent advances in video streaming research and development and outline the challenges that this dissertation tackles.

2.2 Overview of video communications

Digital video systems exhibit a diverse set of requirements, operating conditions and constraints. Video applications may be point-to-point, broadcast or multicast, can be one-way, interactive, or loosely interactive [26]. Digital video content may be pre-encoded (stored), or encoded in real-time (e.g., broadcasting, videoconferencing or live streaming). The video communication channel^ may be static or dynamic, may support some form of QoS control, or may only provide a best-effort service model. The following paragraphs briefly discuss the most important of these properties and how they effect the design of the video system.

Broadcast, pomt-to-point and multicast Perhaps the most popular form of video communication is one-to-many, or, broadcast, where programme content can be delivered to all willing receivers simultaneously. Because the broad­ cast signal is intended for a widespread audience at dispersed geographical locations, different channel characteristics apply, and the source-channel encoding is often designed to adequately service receivers with the worst-case channel, thus sacrificing some reception quality for more well-off receivers (e.g., in a city centre). In broadcast communication, feedback from the re­ cipient of the signal to the sender is not feasible (and in most cases not required), limiting the chances of system adaptation.

In point-to-point, or, one-to-one communication, the signal is only transmitted to or be­ tween the involved peers (e.g., unicast video streaming, video telephony). In this case, a feed­ back channel may or may not be present.

Finally, multicast is a form of one-to-many or many-to-many communication. Multicast is different to broadcast in the sense that in the latter the signal is always available for receivers to ‘tune in’, while in the former, the signal is transmitted only to those parts of the network where receivers are willing to view the corresponding programme.

^The term channel is here used in its broad, signal processing definition, which may correspond to the transmis­ sion or storage medium. 2.2. Overview of video communications 27

Interactive vs. non-interactive Interactive applications, such as videoconferencing, involve a two-way (or, multi-way) trans­ mission of video streams between the participating nodes. Such applications pose hard real-time constraints, especially on latency. The maximum acceptable latency depends on the application, but usually, interactive video communication requires a latency that is below 150ms, and def­ initely, no more than 300-400 ms [33]. In non-interactive applications, such as viewing of stored or live video material, video transmission is one-way and has a more relaxed latency constraint. For some on-demand video distribution applications, that may be called loosely- interactive, some form of interactivity might occur in the form of VCR-like actions, e.g., to rewind, or fast-forward the presentation. The degree of interactivity required by the video ap­ plication is therefore critical for the design of the video communication system. For example, non-interactive streaming applications use a preroll buffer of significant size (initial delay), in the order of seconds or tens of seconds, to alleviate fluctuations in the network bandwidth or the bursty nature of the video bit rate.

Real-time vs. pre-encoded Video content may be encoded for real-time communication or may be pre-encoded and then stored for later viewing. For example, interactive applications require real-time performance from both the encoder and decoder of the video system in order to meet latency constraints. When video material is encoded and stored so that it can be viewed on-demand at some future time, the real-time performance of the decoder only is necessary. In this case, more sophis­ ticated encoding practises may be utilised to increase compression and streaming efficiency (asymmetric coding), as used for example, in local storage applications (DVD), or Internet streaming. Note that non-interactive applications may also require real-time encoding perfor­ mance, for instance, streaming of a live event.

Channel characteristics The nature of the channel characteristics and attributes is the most decisive factor in the design of the video communication system. With respect to its behaviour over time, the communication channel is termed static, if it provides a fixed bit rate and delay and very low loss or error rates (e.g., ISDN). If these channel properties vary over time, the channel is dynamic", like the Internet or a wireless network. Furthermore, the channel can be constant bit rate (CBR), if its bandwidth is constant over time, or variable bit rate (VBR) otherwise. This has a direct implication on how video is coded in order to ‘fit’ in the channel. If good (constant) quality is required, a CBR channel is needed to provide enough bandwidth to accommodate the inherent variability of compressed video rate (like in digital TV networks, or DVD storage). If this is not the case 2.2, Overview of video communications 28

(e.g., in an ISDN network), or if the channel is VBR (e.g., a best-effort network), the video signal rate needs to be constantly altered to match what the channel can support, resulting in time-varying quality.

Support for Quality of Service Network quality of service (QoS) support for the Internet has been the subject of intensive re­ search over the past two decades. QoS is a vague term, that conveys the network’s ability to provide some type of preferential delivery service or certain performance guarantees to flows, e.g., guarantees on bandwidth, bounded delay or packet loss. Network QoS has the potential to greatly facilitate video communication since it can prioritise delay or loss sensitive video data relative to other flows or discriminate its service between different forms of video data. There has been a multitude of research on network QoS. The discussion that follows briefly outlines the two most widely known approaches: Integrated Services and Differentiated Services. The Integrated Services (IntServ) model [3] attempts to provide flows with end-to-end QoS guar­ antees in terms of bandwidth, delay and packet loss. Such guarantees can be established by explicitly allocating resources on the transmission path of a flow using a signalling protocol, called the Resource Reservation Protocol (RSVP) [4]. The high complexity, cost of deploy­ ment and lack of scalability of the RS VP-based architecture led to the proposal of other QoS mechanisms. In particular, the Differentiated Services model (DiffServ) [5] was designed with the requirement of lower complexity and easier deployment in mind. The DiffServ architec­ ture decouples the application from the network load control process by not providing per-flow guarantees but service differentiation on flow aggregates. Flow differentiation is facilitated by appropriately marking packets. Traffic entering the network is assigned to different traffic ag­ gregates or classes. Within the network core, different classes of traffic (identified by a DS code point on every packet) are forwarded according to the per-hop behaviour assigned to each DS code-point. Differentiated services are achieved through a combination of traffic conditioning and per-hop-behaviour (PHB) based forwarding of packets.

Despite its potential and the significant number of proposals and techniques developed, QoS is still not widely deployed in today’s Internet and this seems to be the case for the foresee­ able future. The reason is that the deployment of these technologies requires enhancements to the Internet infrastructure, which are subject to a wide range of commercial as well as technical considerations. RFC 2990 [34] outlines some of the technical challenges to the deployment of QoS architectures in the Internet. Often, interim measures that provide support for fast-growing applications are adopted and are successful enough to relieve the pressure for a ubiquitous de­ ployment of more disruptive QoS technologies. For example, most service providers resort to 2.3. Video compression 29 over-provisioning their backbone networks. There are many examples of the slow deployment of infrastructure that are similar to the slow deployment of QoS mechanisms, including IPv6 and IP multicast. This thesis deals with video transmitted over dynamic, best effort, packet switched networks, that do not provide guarantees on packet delivery.

2.3 Video compression

This section presents a brief overview of video coding and video compression standards, high­ lighting important practices of current and emerging encoding algorithms. The motivation for this discussion is that the most eminent standards of video compression (e.g., MPEG-1/2/4 or H.261/3/4) and proprietary solutions (e.g., RealNetworks or Microsoft Windows Media) are based on the same principles and methodologies, thus it provides a basic understanding of a video streaming modus operandi and serves as a foundation for several issues, related to the video compression process that are tackled in later chapters. Compression in video data is achieved by exploiting similarities and redundancies that ex­ ist in a typical video signal, both in the spatial and temporal dimensions. For instance, within a single video frame, there is significant spatial redundancy as the characteristics of neighbouring pixels (luminance - pixel intensity and chrominance - pixel colour) are often highly correlated. Similarly, consecutive frames of a video sequence exhibit high temporal redundancy since they typically contain the same visual objects with some levels of movement between frames. The objective of video compression is also to remove irrelevant information in the video signal, thus saving bits, by only coding features that are perceptually important. In the case of coloured images, a colour space conversion is first applied to convert the Red-Green-Blue (RGB) image^ into a luminance/chrominance colour space (one luminance, F, and two chrominance compo­ nents, UV, also referred to as the YUV colour space), where the luminance and chrominance characteristics of a frame can be better exploited. compression algorithms are based on the quantisation of the coefficients of a Dis­ crete Cosine Transform (DCT) on pixel values and motion-compensation [35]. As adjacent pixels in a frame are often highly similar, their energies are concentrated in the low frequencies. To exploit this, frames are divided into small (usually 8x8) blocks of pixels^ and 2-dimensional DCT is applied. From the resulting 8 x 8 block of DCT coefficients, only a small fraction (those of lower frequencies) is enough to accurately reconstmct the frame. These coefficients are then quantised to remove redundancy and then further processed using techniques such as zigzag scanning, mn-length coding and Huffman coding. This procedure is applied to both the lu-

^This is the format that an image is captured from a typical source, e.g., a video camera ^Pixels within small blocks usually show greater similarity than in larger blocks. 2.3. Video compression 30 minance and chrominance components of the pixels. At the temporal dimension, redundancy is removed by exploiting similarities between neighbouring frames by first predicting a frame based on a reference frame and then coding the error of the prediction (instead of the whole frame). This is performed by estimating the motion between frames using a technique known as motion estimation (ME) and motion-compensated prediction. This technique (usually) oper­ ates on 16 X 16 blocks of pixels (macroblocks); for each macroblock, the prediction works on searching for the best matching macroblock in the reference frame. The relative displacement of the matching macroblock is termed as the motion vector (MV). Frames can be predicted (i) based on a previously encoded frames (predicted, or P frames), (ii) using both previous and future coded frames (bi-directionally predicted, or B-frames), or (iii), not predicted at all, i.e., frames encoded independently, also called I-frames. The prediction error pixels are also DCT transformed and quantised. Quantised DCT coefficients together with motion vector coordi­ nates are run-length and Huffman coded to produce the compressed bitstream.

2.3.1 Current and emerging video compression solutions

During the last decade or so, a number of video compression standards have emerged to resolve the problem of ensuring interoperability between encoders and decoders used by different enti­ ties. This standardisation process facilitated the speedy acceptance and widespread use of video communication solutions, reducing cost and associated risks. Standards have been developed to address a variety of applications; currently, there are two main standardisation bodies: the International Telecommunications Union - Telecommunications sector (ITU-T) and the Inter­ national Organisation for Standardisation (ISO). The following paragraphs briefly outline the main properties of the most prominent video compression solutions. H.261: The first compression standard to gain widespread acceptance was the ITU H.261 [36], which originally targeted videoconferencing applications over ISDN networks. It was designed to operate at multiples of the baseline ISDN data rate, or p x 64:Kbps,p = I,..., 30. It was the first standard to use the typical encoder structure used predominantly today (8x8 block DCT, macroblock motion estimation, scalar quantisation, run-length and variable- length entropy encoding). H.263: To target video transmission at lower data rates (e.g., video telephony at 33 or 56 Kbps), the H.263 [6, 7] standard emerged as an evolution of H.261 and it is still considered state-of-the-art. H.263 is capable of variable block-size motion compensation, half-pixel motion estimation, and bi-directional prediction. During development, its target bit rate was broadened to up to 2000Kbps or over, as it was proved to provide superior quality in comparison to H.261. H.263+: Technically, a second version of H.263, it added a number of new operational 2.3. Video compression 31 features to H.263 [8]. A notable advance over prior standards is that it offered error resilience mechanisms for transmission over packet-based or wireless networks. H.263-H also added a number of improvements in compression efficiency, flexible video formats and frame rates, scalability (layered H.263+), and extended the operational bit rate of H.263 to essentially any bit rate (from very low to very high), giving superior performance over existing standards.

MPEG-1: This is a widely successful standard for storage and retrieval of moving pic­ tures and audio at VHS quality or better at about 1-2 Mbps [37] and was the result of work from the ISO Moving Picture Experts Group (MPEG) working group. MPEG-1 is intended to be generic, i.e. the coding syntax is defined and therefore only the decoding scheme is stan­ dardized. MPEG-1 defines a block-based DCT and motion estimation/compensation hybrid. It features bi-directionally predicted frames (known as B-frames) and half-pixel motion estima­ tion. It also provides functionality for random access in digital storage media.

MPEG-2: MPEG-2 was a step further in bit rate ranges and picture quality. It forms the heart of broadcast-quality television for both standard-définition (SDTV) and high-definition TV (HDTV) signals [38]. Its target bit rate is in the range of 4-20 Mbps or higher. The video coding scheme used in MPEG-2 is again generic and it is a refinement of the one in MPEG-1. Furthermore, it features extensions for scalable (layered) encoding. In order to keep implementation complexity low for products not requiring all video formats supported by the standard, so-called ‘Profiles’, describing functionality, and ‘Levels’, describing resolutions, are defined to provide separate MPEG-2 conformance levels.

MPEG-4: MPEG-4 is a standard for multimedia applications [39, 9]. It addresses the need for robustness in error-prone environments, interactive functionality for content-based access and manipulation, and a high compression efficiency at very low bit rates. MPEG-4 features an object-based coding scheme using so-called ‘audio-visual objects’, which can be for example, the fixed background in a video scene, the picture of a person in front of that background, the voice associated with that person, etc. The basic video coding structure supports shape coding, motion compensation, DCT-based texture coding as well as a zero-tree wavelet algorithm.

Proprietary solutions:Besides the above video compression standards, proprietary video codecs have emerged from industry efforts, such as Microsoft’s Media Video codec [40] and Real’s RealVideo [41]. They are heavily used today in the two most popular commercial video streaming solutions, {Windows Media and RealNetworks). Although details of these video codecs have not been fully disclosed, in principle, their coding core utilises the block-DCT, mo­ tion compensation hybrid scheme. They are also said to feature somewhat more sophisticated processing in various respects, such as improved motion prediction, adaptive transform sizes. 2.4. Networked delivery of compressed video 32 pre-filtering to remove noise, post-processing (like frame up-sampling to interpolate skipped frames and de-ringing filtering to reduce ringing artifacts), and demonstrate better coding gain in comparison to standard codecs [12,42]. Automatic Repeat-Request (ARQ) mechanisms are also employed to allow the client to request retransmission of missed packets, which if delivered within their playout deadlines, enable recovery from losses. Emerging standards - JVT H.264: The increasing number of video services and the emerging transmission technologies such as cable, xDSL and wireless, generate greater needs for higher compression efficiency. In these transmission environments, data rates are lower than Digital TV broadcast channels and enhanced coding efficiency is required to enable transmis­ sion of good quality video representations. In recognition of these demands, the Joint Video Team (JVT) H.264/AVC (or, MPEG-4 Part 10) is the newest emerging video coding standard of joint work from the ITU-T and ISO [10, 11]. The main goals of this effort are increased compression performance and provision of ‘network-aware’ video representations for both in­ teractive (video telephony/conferencing) and non-interactive (storage, broadcast, streaming) ap­ plications. The design of the H.264 video coding layer is based on the block-based, motion- compensation hybrid, but with some important improvements relative to prior standards. Mo­ tion compensation becomes more flexible by allowing variable macroblock sizes (as small as 4x4) and shapes, as well as quarter-pixel motion vector accuracy (as opposed to half-pixel in prior codecs). In-the-loop de-blocking filters are used to reduce blocking artifacts and a smaller block transform (4 x 4) helps to reduce ringing artifacts"^. More advanced entropy cod­ ing (such as arithmetic and context-adaptive coding) are used to improve performance. When these features are used together, bit rate savings of up to 50% (or better, especially for high- latency, asymmetric coding) for equivalent perceptual quality relative to prior standards have been reported [43]. The higher computation complexity that these enhancements require can be accommodated by the ever increasing CPU power or efficient hardware implementations available in modem systems. Besides these, a number of features are available to increase ro­ bustness against data errors and packet loss, allowing flexibility of operation over a variety of networking environments [11].

2.4 Networked delivery of compressed video

This section presents a concise discussion on IP video transmission issues, reports the main problems and challenges in video streaming and reviews the state-of-the-art, focusing on trans­ port rate control issues and rate-adaptive video encoding.

“Section 2.5 presents a discussion of the most common artifacts in compressed digital video. 2,4. Networked delivery of compressed video 33

Traditionally, the objective of a video codec has been to optimise video quality at a given bit rate. This approach was founded on the doctrine of a constant channel transmission. There­ fore, until recently, the complications of transmitting over a time-varying transmission channel were not studiously considered by the signal processing community. For Internet streaming video, this objective has changed: with a best effort service available, there are no guarantees on bandwidth, delay or loss rate. Therefore, a key goal in the design of a video streaming system is to deliver video of acceptable quality when dealing with unknown and dynamic:

• Bandwidth. The bandwidth available to a video stream is the main determinant of end- quality. If a sender transmits video data faster than the available bandwidth, congestion causes packets to be lost resulting in a severe drop of quality. On the other hand, if the sender transmits at a lower rate, the receiver application displays sub-optimal video qual­ ity. In principle, the goal is to estimate the available bandwidth and match the transmitted video bit rate to it.

• End-to-end delay variation. Fluctuations of the end-to-end delay (jitter) cause data pack­ ets to arrive at irregular intervals, producing problems in the reconstructed video stream, e.g., jerkiness in the video playout. This problem can be eliminated or significantly re­ duced by the introduction of receiver playout buffers, at the expense of an additional delay. While this option is suitable for one-way or non-interactive video, it is more cum­ bersome for interactive video communication where the total end-to-end delay needs to be kept at a very low value (ideally < 150ms) [26].

• Loss rate. Packet loss can result in particularly unpleasant artifacts in the reconstructed video stream. The type and nature of artifacts depends on several factors, such as the type (packet loss or bit errors - depending on the type of transmission network) and pattern of losses (independent losses or loss bursts). There has been a significant amount of work trying to combat the effect of data loss by introducing error resilience and control to video streaming systems. Approaches to error control can be roughly grouped into four categories: (i) forward error correction (FEC), (ii) retransmission, (iii) error concealment at the receiver, and (iv) error-resilient video coding. Error control for video transmission is a broad research area and a detailed review is out of the scope of this thesis.

To address the aforementioned technical issues in a best-effort video delivery, two general approaches have been proposed. The first approach is network-centric’, it advocates that the core of the network (routers/switches) is augmented with functionalities to provide QoS support for real-time flows (e.g., integrated or differentiated services, as briefly outlined in section 2.2). 2.4. Networked delivery of compressed video 34

Given the obstacles for a wider deployment of these services, a second, solely end-system ap­ proach that does not impose any requirements on the network seems to be the preferred option at present. In this approach, the end application is in charge of employing control mechanisms to maximise video quality without any explicit QoS support from the underlying network^. These mechanisms can be viewed from either the transport or the video encoding perspective. The transport perspective refers to the use of control techniques without regard of the specific video semantics; in other words, these techniques are applicable to generic data. The main functionality here, relevant to this thesis, is that of transmission rate control^: the application is responsible for regulating the number of bits that it injects into the network so that it does not contribute to congestion as well as sensing for available bandwidth that will improve video qual­ ity. The video encoding perspective refers to the deployment of signal processing techniques with consideration of the video semantics at the encoding layer. This deals with how to best encode the video signal so that bandwidth constraints are met, also referred to as rate-adaptive video encoding or video rate adaptation, and how to better protect the compressed bitstream in the presence of loss. However, the power of the end-system approach to provide better video QoS relies on the ability to combine both transport and signal processing mechanisms that may work in tandem (joint source/channel coding) to improve video QoS. The following sections discuss the main principles of transport rate control and video rate adaptation techniques.

2.4.1 Transport rate control

Despite being an overwhelmingly complex and heterogeneous structure, the Internet has to date shown remarkable stability, in large part due to the congestion avoidance and control mech­ anisms implemented in its main transport protocol, TCP. However, since the TCP reliability functionality introduces delays that may not always be acceptable for media streaming, UDP is commonly employed as the transport protocol for real-time streams instead^. On the other hand though, UDP does not perform congestion control for its streams. The ever growing volume of uncontrolled UDP real-time traffic (audio/video streaming, IP telephony, conferencing, games, etc.) brings the problem of unfair treatment of flows that promptly react to incipient congestion in the network. In heavy network load scenarios, this unfair situation may in extreme cases

^Even if network QoS guarantees are present, it is most probable that these will be offered at varying cost configurations. Therefore, adaptation mechanisms would be still necessary to yield adequate quality with cheap, ‘relaxed’ QoS guarantees. ^Transport layer error control is another, like traditional FEC (i.e., channel coding) and conventional retransmis­ sion (i.e., automatic request repeat - ARQ). ’This is not to say that TCP cannot be used to transport real-time media. Indeed, in certain cases, there are advantages of using TCP, like its proven stability, guaranteed delivery, etc. It may come as no surprise to realise that very often today, streaming media are carried over TCP, for instance, to get through firewalls and Network Address Translation (NAT) boxes [44]. 2.4. Networked delivery of compressed video 35 lead to starvation of TCP flows and high packet loss rates (also referred to as congestion col­ lapse [16]). Besides being unfair to compliant flows, unresponsive media streams can also harm themselves as they experience higher packet loss rates, especially if switches employ queue management techniques (e.g.. Random Early Detection, RED [45]) that drop packets in propor­ tion to a stream’s transmission rate. It is known that, in the presence of packet loss, the quality of video significantly deteriorates even if the video rate is increased [46], therefore, lower-rate, low-loss transmission is preferable to higher-rate, high-loss. For this reason, congestion control should be implemented for UDP-based network transport, so that bandwidth is shared fairly between TCP and non-TCP flows, or, in a TCP-friendly manner, to use a now mainstream term. A non-TCP flow can be characterised as TCP-friendly if its long-term throughput does not ex­ ceed the throughput of a conforming TCP connection under the same network transmission conditions (round-trip time, packet loss, etc.). TCP-friendly congestion control schemes can be classified with respect to a multitude of characteristics [21]. The following classification is therefore informative and not exhaustive:

Sender-based vs. receiver-based. In sender-based congestion control, the sender (or source) is responsible for adapting the transmission rate of the video stream. In principle, this is done by trying to match the rate of the media stream to the available network bandwidth by appropri­ ately responding to network congestion signals (i.e., loss events). Source-based rate control can be applied to both unicast [17] and multicast [47] transmission. In the receiver-based approach, it is the end-application that regulates the receiving rate of video. Typically, this scheme is applied to layered (multiple rate) multicast video: the sender may transmit a number of cumu­ lative video layers and the receiver adds or drops channels according to its estimate of available bandwidth [48, 49]. Single-rate multicast congestion control approaches have also been pro­ posed [50, 51]. Unicast congestion control can also be receiver-based [52]. This approach may have the additional advantage of removing the burden of monitoring the network condition from the server. The receiver is responsible for communicating its estimate of bandwidth to the sender and instructing it to switch to the stream construction that would yield the minimum dis­ tortion over the current channel. Therefore, it reduces the complexity of server-side processing, with the potential of allowing servers to support more simultaneous connections.

AIMD vs. model-based. This classification of TCP-friendly congestion control mechanisms refers to the method the protocol uses to adjust a flow’s rate to its ‘fair’ share of bandwidth. These approaches use congestion signals, such as a packet loss event, or an Early Congestion Notification (ECN) signal [53], or the absence of them, to reduce their congestion window (or transmission rate [18]) or to probe for extra bandwidth. The additive-increase, multiplicative 2.4. Networked delivery of compressed video 36 decrease (AIMD) adopts a TCP-style congestion control strategy, by (in common) linearly in­ creasing its window (or rate) as a result of the receipt of one window of acknowledgments in a round-trip-time (RTT) and multiplicative decreasing it upon detection of a loss event. The control algorithm may be expressed as:

Increase : Wt+n + a > 0

Decrease : W t+R (1 - /^) 0 < ^ < 1, where R is the RTT estimate of the flow, wt is the window size at time t, and a, (3 are the increase/decrease constants. When a = 1 and (3 = 0.5, the expressions above represent the typical TCP rate-halving behaviour. AIMD(o!, (3) congestion control can be implemented using a window-based mechanism, a la TCP, or in rate-based techniques like the Rate Adaptation Protocol (RAP) [18]. However, when viewed over shorter timescales, TCP’s rate-halving re­ sponse to single loss events means that the available bit rate exhibits frequent rate fluctuations (the familiar saw tooth shape). Many applications though would prefer a less varying transmis­ sion rate, while still being adaptive to congestion over longer time periods. Model-based (or equation-based) congestion control uses a response function [54] to achieve a sending rate that is compatible with TCP, yet exhibits greater rate smoothness at short timescales. The main in­ carnation of this category is the TCP-Friendly Rate Control (TFRC) protocol, proposed in [19]. TFRC employs a slow-start procedure similar to TCP. After that, the sender estimates the loss event rate, where a loss event is defined as one or more dropped packets within a single RTT. It then uses the TCP response function:

+ tRTo{‘3\J^)p{'^ + 32p2) to continuously adjust the sending rate of the flow, where s is the packet size,R is the estimate of the RTT,p is the estimated rate of loss events and Irto is the TCP retransmit timeout value. By being less responsive to isolated loss events, and refraining from aggressively utilising available bandwidth, TFRC achieves a far smoother rate than AIMD(1, 0.5), but at the same time, it is responsive (albeit at a slower pace) to persistent congestion [19, 55]. TFRC relies on receiver feedback (at least once per RTT) to update its estimate of the loss event rate and the RTT value in the calculation of the TCP formula. Smoother sending rates can also be achieved when AIMD operates with a more moderate decrease parameter value [56], or by employing other mutations of the AIMD mechanism that use binomial algorithms to increase/decrease the congestion window in response to packet losses [57]. 2.4. Networked delivery of compressed video 37

While TCP-friendly congestion control solutions are readily available and proven safe for wide deployment [20], it is questionable whether application designers are willing to undertake the task of implementing congestion control as part of the application’s requirements. To ad­ dress this problem, a recent IETF initiative is working on the Datagram Congestion Control Pro­ tocol (DCCP) [58], a lightweight transport protocol for supporting congestion control. DCCP offers applications the necessary transport protocol services, such as connection establishment and tear-down, sequencing, different DCCP packet types, acknowledgements, provisioning for security, etc. Some applications would prefer the more aggressive TCP-like rate-halving and probing for available bandwidth, while others, the smoother evolution of sending rate provided by TFRC. DCCP offers the freedom of choice of congestion control mechanism; selection can be done by using a specific Congestion Control ID (CCED) at connection establishment.

2.4.2 Rate-adaptive video encoding

The previous section discussed transport level actions to adjust the sending rate of a video stream according to the changing conditions of the network. Transport rate control is applicable to generic data, therefore it is the streaming system’s responsibility how to better encode the video signal and also what and how video data are transmitted over the available bandwidth. This section categorises existing techniques for video rate control and adaptation over best- effort networks, comparing their relative merits and disadvantages. The discussion is based on the assumption that a desired communication rate can be determined (e.g., using TCP-like congestion control) and emphasis is given to the ability of each scheme to adapt the compressed video signal to variations of bandwidth.

Fixed-rate bitstream To date, the majority of video content is still compressed using a single, constant bit rate (CBR) stream. In such cases, it is particularly difficult to change the communication rate after com­ pression is completed. In practice, existing applications, such as Windows Media Player and Real Player, apply a ten seconds or larger buffering delay and transmit a compressed bitstream whose encoding rate is (often significantly) lower compared to the expected channel rate, in order to reduce the probability of congestion. To improve on quality variations that result from CBR video encoding, controlled variable bit rate (VBR) encoding can be used in conjunction with receiver-side buffering and sophisticated packet scheduling; in this way, more bits can be allocated to more active scenes, but the transmission rate remains constant.

Transcoding Transcoding is a direct way to modify the media bit rate and is extremely useful for media trans­ 2.4. Networked delivery of compressed video 38 mission between networks with different characteristics (e.g., satellite-delivered programmes transported over a cable, or xDSL network, or at the boundaries of wired-to-wireless networks). In principle, this process involves decoding of the media and then re-encoding to the desired rate. This approach has its drawbacks; the resulted video quality is in general lower than if the video was coded directly from the original source at the same rate. Furthermore, it usu­ ally requires extensive computation, making this approach expensive. The complexity problem can however be solved using compressed-domain transcoding, where the bitstream is partially decoded to the DCT level and its bit rate is scaled down either by removing higher frequency coefficients or re-quantising coefficients with a larger quantisation step and carrying out motion compensation entirely on the DCT domain [59, 60].

Multiple-file and multiple-rate switching Another commonly used technique is multiple-file switching, whereby the same content is en­ coded at different bit rates. Early implementations of this method encoded the content at a few strategic media rates that correspond to common connection speeds (e.g., dial-up, DSL/cable, Tl, etc.). Once a media rate was chosen, it remained the same for the duration of the session. On the other hand, multiple-rate switching enables dynamic alteration among different rates within a single media streaming session, allowing better adaptation to longer-term fluctuations of bandwidth. This technique is used by Microsoft’s Intelligent Streaming [61] and RealNet­ work’s SureStream [12] streaming solutions. The encoder produces multiple representations (or streams) of the original content which are stored in a single file, in a form that facilitates their efficient retrieval by the server. This approach overcomes the transcoding limitations, as very little additional computation is required and no re-compression penalty occurs. However, the main disadvantage is that multiple copies of the same media need to be stored, incurring a higher storage cost. Furthermore, the granularity of adaptation is limited by the number of alternative representations (or media rates), which is, for practical implementations, kept low.

Scalable encoding A more elegant approach to adapt the rate of a video stream is to use layered or scalable compression [13]. This technique is similar in spirit to multi-rate switching, but rather than producing multiple interpretations of the same content at different rates, layered encoding pro­ duces a set of cumulative (ordered) bitstreams, referred to as layers. Transmission of more layers results in higher transmission rate, but also higher quality, in terms of spatial resolution, temporal resolution or picture quality. By dynamically changing the number of layers used, rate control can be achieved in both unicast and multicast transmission. Many commonly used standards, such as MPEG-2, H.263 and MPEG-4, offer scalability extensions for layered cod­ 2.4. Networked delivery of compressed video 39 ing. The granularity of adaptation for layered coding coincides with the layer granularity or the number of layers; the more layers the streams is encoded to, the better the option for finer-grain adaptation. However, due to the inter-dependency of layers, correct reconstruction of a higher layer relies on the successful receipt of all lower layers. The major drawback of this method is that, due to necessary redundancy present among layers, there is a compression penalty in comparison to non-scalable coding.

A major characteristic of layered coding is that an enhancement video layer has to be entirely transmitted, received and decoded, or it does not provide any quality enhancement at all. By offering improved rate adjustment scalability, the Fine Granular Scalability (FGS) video coding algorithm is considered as a compression method suitable for video streaming applications and is being introduced into MPEG-4 standards [14]. A FGS video stream consists of two layers, a Base Layer (BL) and an Enhancement Layer (EL). The BL is generated using conventional DCT-based MPEG-4 coding and provides a minimum video quality. The EL is generated by coding the difference between the original and the BL data using what is called, a bit-plane coding of the DCT coefficients. The advantage of this method is that the enhancement layer can be truncated into any number of bits within each frame, offering the ability to better match the time-varying available bandwidth. The decoder should be able to reconstruct an enhancement video from the BL and the truncated EL bitstreams to provide partial enhancement proportional to the number of bits decoded for each frame. Finally, multiple description coding (MDC) [62] is a particular scalable coding method, that is specifically designed for error prone networks. The signal is coded into two (or more) separate bitstreams (called descriptions), with the following properties: (i) each description can be independently decoded to give a usable reconstruction of the original signal and (ii) the quality of the signal is improved with the number of descriptions correctly received. It differs from conventional scalable coding schemes, in which, the enhancement layer(s) require that the base layer can be properly decoded. The important property of MDC is that it enables repair of corrupted frames in a description using uncorrupted frames from the other deschption(s), even when both descriptions are afflicted by loss, as long as both descriptions are not simultaneously lost [63].

Adaptive single layer For video conferencing or real-time streaming of video events (broadcasting/webcasting), rate control is achieved by dynamically altering the encoder’s parameters, mainly the quantisation parameter (QP), and possibly the frame rate, resolution, etc. Traditional codecs (H.261, MPEG- 1,2) typically rely on changing the QP to achieve the target rate, as they operate at constant 2.4. Networked delivery of compressed video 40 frame rates. Since only altering the QP is not always enough for low bit rate encoding and in many cases increasing the frame drop rate rather than the coarseness of quantisation results in superior encoding quality [27], recent coding schemes (H.263, MPEG-4) allow the alteration of the frame rate (frame skipping). The fundamental problem with video rate control is how to determine suitable encoding parameters (QP, frame rate, etc.) to achieve the target rate while ensuring the best quality result and also catering for additional streaming requirements (e.g., error protection). As an alternative to changing the encoding parameters, rate shaping is a technique to adapt the rate of a compressed video bitstream to a target rate constraint. A rate shaper is an interface or filter between the encoder and the transport module. Rate shaping does not require any interaction with the encoder, so it is applicable to any video coding scheme for both live and stored material. Rate shaping can be viewed from two angles: transport and compression. Selective frame discard [64] is a representative mechanism of transport-level rate shaping. Given network bandwidth and client buffer constraints, this technique uses a dynamic programming algorithm to find the minimum number of frames that must be discarded in order to meet transport constraints. On the other hand, dynamic rate shaping [15] operates on the compression level. Based on rate-distortion theory, this method selectively discards DCT coefficients of higher frequencies so that the target bit rate is accomplished.

2.4.3 The role of buffering

It is common practice for streaming clients to buffer several seconds worth of data before media playout commences. As consecutive media packets have a deadline time, a buffering scheme essentially relaxes this constraint by an identical amount of time. Critical to the performance of video streaming over a best-effort network, buffering provides a number of advantages:

1. Protection against delay variation (jitter). Jitter causes variation in the packets inter­ arrival times on the end-to-end path, resulting in jerkiness in playback as certain video data are skipped when they cannot meet their presentation deadlines. Essentially, the effect of high jitter is identical to that of the data being lost. Receiver buffering extends the presentation deadlines and in most cases, eliminates this problem.

2. Error recovery through retransmission. If the playout time of video is sufficiently delayed, then there is enough time for TCP’s ARQ mechanism to operate (or for the streaming client to request the retransmission of lost packets if UDP is used to transport media). Since packet loss significantly effects the quality of the decompressed video, this loss recovery option greatly improves media quality in the presence of packet loss. In fact. 2.5. Streaming video content: the effect on quality 41

retransmission-based recovery (either through TCP’s ARQ or on top of UDP) is heavily used in contemporary commercial streaming solutions.

3. Smoothing throughput fluctuations. Since the bandwidth of IP media flows is time- varying, a receiver buffer can provide the data needed to sustain playback when the net­ work throughput is temporarily lower than the media consumption rate at the decoder.

4. A buffering scheme may also reside at the sender-side of a streaming system, e.g., in the case of live video encoding, to alleviate situations similar to (3) above, in conjunction with the receiver-side buffer. When the media encoding rate is lower (higher) than the transmission rate, then the send-buffer drains (fills) respectively. In this case, the appli­ cation may diverge from encoding the media at the current available bandwidth, if this provides better quality, and rely on the buffers to alleviate rate mismatches. A sender buffer can also be used to smooth the rate of VBR video bitstreams (e.g. MPEG-2 video) that cannot be easily transmitted across the network due their bursty bit rate [65].

Nevertheless, buffering comes at a price. Besides additional storage requirements at the receiver^, the main drawback of buffering is that of additional startup delay. Furthermore, the streaming application is slow to react on user controlled commands such as pause, fast forward and rewind. However, playout delays of a few seconds are not considered annoying and the use of adaptive media playout techniques [66] can further reduce the average startup delay, enabling excellent trade-offs between delay and reliability. The aforementioned buffering methods do not apply as easily to interactive audio-visual communication since an extensive playout delay is a prohibitive option; only a minimal amount of buffering is possible, and the media rates should closely match the available network bandwidth.

2.5 Streaming video content: the effect on quality

End-to-end transmission of video material is subject to two sources of distortion. The orig­ inal content sequence is encoded (in real-time or off-line) to reduce its otherwise prohibitive bandwidth requirements. As a result, a first level of distortion caused by the lossy nature of en­ coding is introduced. The compressed bitstream is then transmitted in packets over the network. There, delay, delay-variation and packet loss disrupt delivery of data to the decoder and further impairments occur. The following paragraphs describe the most common types of distortions introduced during the encoding and transmission of digital video.

®This is not considered a problem with modem PCs/workstations, but it might be more of an obstacle with mobile and hand-held devices that feature limited memory space (although, this is still questionable as mobile devices are supplied with ever increasing storage capacity). 2.5. Streaming video content: the effect on quality 42

2.5.1 Encoding artifacts

The majority of contemporary video codecs rely on motion-compensation and block-based DCT transformation on blocks of pixels and quantisation of the resulting transform coefficients. Quantisation is the main source of encoding distortions, although other encoding parameters also influence the perceived fidelity of the video (like frame dropping). The main types of artifact in a compressed video sequence [67] can be summarised as follows:

• Blocking effect or tiling. Blockiness is defined in [68] as a “distortion of the image char­ acterised by the appearance of an underlying block encoding structure’’. It is caused by the independent quantisation of blocks, resulting in visible discontinuities at the bound­ aries of adjacent blocks. In other words, tiling creates false horizontal and vertical edges at the block boundaries. Due to its pattern, this deformation is the most apparent visual distortion of the encoding process.

• Blurring. Blurring is a global distortion over the entire image, characterised by reduced sharpness of object edges and spatial detail [68]. It is the result of the suppression of higher-frequency coefficients when a coarser quantisation parameter is used by the en­ coder.

• Temporal edge noise or mosquito effect. This is a form of distortion of object edges characterised by time-varying sharpness (shimmering) of the edges of various objects in the video scene. This temporal artifact is the result of the varied coding of the same area of the image in subsequent frames.

• Jagged motion. Jagged motion is the result of poor motion estimation. Block-based motion estimation works best when the movement of all pixels in a block is identical. Otherwise, the error of motion prediction is large, and as a result, the quantisation error of the prediction residual is high.

• Jerky motion. Jerkiness is defined in [68] as “originally smooth and continuous motion being perceived as a series of discontinued images”. This is due to lost motion energy when video is transmitted at low frame rates.

• Other artifacts include colour bleeding (smearing of colour between areas of strong chrominance difference), added random noise and chrominance mismatch (due to the use of the luminance motion vectors for chrominance motion compensation).

Some of the above effects are unique to block-based coding, while others are prevalent in all types of compression algorithms. In wavelet codecs, for example, there are no block-related 2.6. Measuring video quality 43 artifacts, as the transform is applied on the entire image, however blurring may become more noticeable. Figure 2.1 shows the perceptual impact of the most common compression artifacts.

2.5.2 Transmission artifacts

A common source of impairments is the transmission of the compressed video bitstream over a packet network. In this case, the bitstream is fragmented into a series of packets, which are then sent to the destination. Two different types of network transmission characteristics contribute to transmission impairments: (i) packet loss, and (ii) end-to-end delay. When packets are lost they are unavailable to the decoder, while excessively delayed packets (high jitter) are worthless to the application. Therefore, both types have the same impact: temporary data unavailability. The impact of such losses depends on the nature of the video encoder and the level of redundancy present in the compressed bitstream (for example, intra-coded bitstreams are more resilient to loss). For MC/DCT codecs, like MPEG, H.263, etc., interdependencies of syntax informa­ tion can cause an undesired effect, where the loss of a macroblock may corrupt subsequent macroblocks, until the decoder can re-synchronise. This is manifest as error blocks within the image that bear no resemblance to the current scene and contrast with adjacent blocks. Error blocks have a prevalent impact on perceived quality that is sometimes greater than the effects of coding artifacts. Another problem arises when blocks in subsequent frames are predicted from a corrupted macroblock - they will be damaged as well and this will cause a temporal propagation of errors until the next intra-coded macroblock is available.

2.6 Measuring video quality

The quality of a video application is usually difficult to measure or quantify in an unambiguous, universal manner. Quality is a multi-parameter property related to the nature of the application and the context of its use. In some cases, application quality is synonymous to the level of satisfaction or enjoyment of the human viewer. In others, to the degree that the application successfully completes its task (e.g., to enable visual communication or assist a cognitive task) and usually a combination of the two. A viewer’s enjoyment when watching a video programme also depends on its content and material. Other factors, like the viewing distance, the size and quality of the display or its resolution and contrast, are also significant [69, 70, 71]. Above all though, the quality of sound and video have the primary importance. Furthermore, it is well known that good sound quality tends to alleviate the viewer’s ability to detect impairments of video [72]. Although quality is a rather difficult concept to grasp, two features of quality measure­ ment are widely accepted. First, quality implies a comparison. This comparison may be direct 2.6. Measuring video quality 44

Original Image Blockiness

Blurring Edge Noise

Figure 2.1: Impact of common video compression artifacts on the perceived quality of images 2.6. Measuring video quality 45

(‘better’, ‘worse’), or indirect (‘good’, ‘bad’). Second, quality must be measured in some open- ended scale^. A direct comparison is taking place when both the original (or reference) and the impaired (or test) video sequences are available and shown to the viewer or passed to the mea­ surement instrumentation. When the reference is not available, then comparison (in the case of subjective tests) is done using some ‘internal’ reference of quality of the individual viewer. In general, two different kinds of quality measurement methodologies are recognised and appreciated. Firstly, those that can be used off-line, or out-of-service, to gain an in-depth un­ derstanding of the specific application, or the system’s components requirements in terms of ‘quality’. In this case, quality might mean user-perception, success in achieving and complet­ ing the application task or clues on how an application needs to be efficiently designed and engineered. In this category, several techniques and disciplines, or combinations of them, may be used:

• User studies and trials, including, subjective quality assessment tests, questionnaires, specialised psychology tests, or interviews with user groups, application designers and service providers.

• Objective measures that use computational models to quantify the quality given a certain set of application parameters (e.g., the value of network performance parameters, like delay, packet loss, etc.).

• Relative results and knowledge gained from similar experiences in the past.

A second category includes quality metrics that can be used to monitor the quality of an application during operation time, also called in-service metrics. Such metrics are equally important, as they can provide invaluable feedback on the application QoS during its real-time operation. They can also be used to dynamically monitor a video system and identify problems on the application end-to-end path. Quality tools and methods that belong to the above classes are not destined to work exclu­ sively but in a complementary way. The background or out-of-service methods are believed to provide more accurate results because they may use multi-disciplinary techniques, do not have real-time performance requirements and have access to both the original and distorted video sequences. The in-service metrics on the other hand, are precious management and quality monitoring mechanisms for the real-time analysis of a service. Since time constraints prohibit the use of computationally expensive processing and the original video is not usually available

^An open-ended scale means that the quality of the test video sequence is allowed to be both better or worse than the reference. 2.6. Measuring video quality 46 for direct comparison, in-service models may sometimes fail to provide equally accurate re­ sults. This reveals another important feature of the out-of-service methods; they can be used for alignment and calibration of the on-line models during their design and development process. As video becomes a fundamental part of networking applications, being able to measure its quality has significant importance at all stages: starting from the development of new video codecs and ending to the monitoring of the transmission system’s quality of service. This became more apparent with the advent of compressed digital video, which exposed the limita­ tions of techniques used traditionally for quality measurement of analog video [73]. Because of compression, digital video exhibits artifacts fundamentally different from analog systems - blockiness, blurring, motion estimation mismatches, etc. [67]. Furthermore, transmission over networks that do not provide strong QoS support introduces additional types of distortion (e.g., artifacts due to packet loss). Given these limitations, designers of video systems resorted to subjective viewing tests in order to obtain reliable ratings of video quality. Subjective tests may produce accurate ratings but they require costly and complex setups and controlled viewing conditions and thus, they are inflexible, often impractical, and in certain cases not possible to use (e.g., for on-line quality monitoring). In the quest for more convenient alternatives, researchers traditionally use simple, pixel error measures, like the mean squared error (MSE) or the peak signal-to-noise ratio (PSNR) with the implied suggestion that they form valid quality measures. However, these pixel-based metrics fail to appreciate the true perceptual impact of image distortions. The complexity of subjective tests and the inaccuracy of simple pixel-based error metrics have generated an inten­ sified research on the development of objective quality assessment models or metrics. Objective quality metrics are computational models that can produce quality ratings that correlate well with human opinions of quality. This area of research is already quite mature yet continuously developing. While objective models are very promising and commercial products have been de­ veloped, especially for digital TV signals (e.g., [74,75,76,77,78, 79, 80]), both subjective and objective methods are considered useful for measuring and appreciating the quality of a video system. For example, subjective tests are required for the calibration of objective models. The next sections outline the methods of subjective quality assessment and introduce the principles of objective quality metrics and recent research work in this area.

2.6.1 Subjective video assessment

Subjective quality assessment or mean opinion score (MOS) tests aim to capture the user’s per­ ception and understanding of quality and, inevitably, produce some form of rating or quality score that corresponds to the viewer’s judgement of quality. Formal subjective testing, as de- 2.6. Measuring video quality 47

DSCQS scale DSIS scale

Excellent Imperceptible

Good Perceptible, but not annoying

Fair Slightly annoying

Poor Annoying

Bad Very Annoying

Figure 2.2: Video quality assessment scales used in subjective MOS tests

fined by ITU-R Rec. BT.500 [22], has been used for several years at all stages of video system design and production. In this method, a panel of human subjects is shown a series of video sequences, and asked to score the quality of the video scenes in one of a variety of manners. The recommendation also defines issues like: selection of subjects, selection of video material, viewing conditions, timing and presentation of the video scenes, scaling methods of opinion scores, etc.

Depending on what contextual factors that influence user perception need to be derived, three testing procedures are most commonly used: Double Stimulus Continuous Quality Scale, Double Stimulus Impairment Scale and Single Stimulus Continuous Quality Evaluation.

Double Stimulus Continuous Quality Scale (DSCQS). In this method of subjective quality assessment, viewers are shown several pairs of video sequences, which consist of the reference and test sequences. Video clips have a short duration, usually 8-10 seconds. The reference and test sequences are shown to the user twice in alternating fashion, the order chosen randomly. Subjects do not know in advance which is the reference sequence and which is the test sequence. They rate the material on a scale ranging from bad to excellent (Figure 2.2), in reference to the other video clip in the pair. This rating may have an equivalent numerical scale from 0 to 100. The difference, rather than the absolute values, of the two ratings is taken for further analysis. This difference removes some rating uncertainties caused by the material content and viewers’ experience.

Double Stimulus Impairment Scale (DSIS). In contrast to DSCQS, in this method the refer­ ence sequence is always presented before the test sequence and there is no need for the pair to be shown twice. The viewers rate the second clip with reference to the first on an overall im­ pression or impairment scale, ranging from very annoying to imperceptible (Figure 2.2). This 2.6. Measuring video quality 48 scale is commonly referred to as the 5-point scale. This method is more useful for evaluating clearly visible impairments, such as noticeable artifacts caused by encoding or transmission.

The limitation of using rather short test sequences becomes a problem when someone is interested in the continuous evaluation of digital video systems over longer timescales. In such cases, video systems generate substantial quality variations that may not be uniformly distributed over time. Both the DSCQS and DSIS methods were not designed for quality moni­ toring applications, i.e., when video sequences have arbitrarily long duration. The first problem encountered is that if double stimulus is used for longer sequences, comparable moments in the two sequences will be far apart in time from each other to be rated accurately. Furthermore, it is known that humans’ memory is increased for more recent stimuli, when the duration of the stimulus is increased from 10 to 30 seconds [81]. In other words, for long sequences, the most recent parts of the video have a relatively greater contribution to the overall quality impression. This phenomenon, called the recency effect, is long known in psychology literature [82, 72] and it is difficult to quantify in subjective tests.

Single Stimulus Continuous Quality Evaluation (SSCQE). In order to capture temporal vari­ ations of quality in video sequences of extended duration, the SSCQE method allows viewers to dynamically assess the quality in a continuous manner. In this case, the viewers are shown the sequence to be evaluated and they rate the instantaneous perceived quality by continuously adjusting a side slider on the DSCQS scale (from bad to excellent). The slider is typically a hardware device [83]. Instantaneous quality scores are obtained by periodically sampling the slider value, usually every 1-2 seconds or less. In this way, differences between alternative transmission configurations can be analysed in a more informative manner. The drawback of this method is that the accuracy of user ratings can be compromised by the cognitive load im­ posed by the task of moving the slider. In addition, as the content of the programme tends to have a significant impact on the SSCQE ratings, it becomes difficult to compare scores from different test sequences. When a model is required to link instantaneously perceived quality to an overall quality score calculated for the whole sequence, then the non-linear influence of good or bad parts within the sequence can be expressed by appropriate pooling methods, like Minkowski weighing [84]. Other problems with evaluating SSCQE scores include the impact of the recency effect on scale judgements, order effects (the relative position of periods of bad and good quality) and the fact that the viewer’s reaction time to changing quality is not con­ stant, but usually varies. Momentary changes in quality are also quite difficult to track, leading to potential reliability problems for the derived results. 2.6. Measuring video quality 49

2.6.2 Objective metrics of video quality

Although subjective quality testing procedures constitute a reliable method for gaining insight­ ful knowledge about the performance of a digital video system, the complicated and costly setup process involved make these methods unattractive and not feasible for automating the quality evaluation process. The involvement of human subjects renders this approach unusable when the quality monitoring system has to be embedded into practical processing systems. Even for cases of off-line quality evaluation (e.g., quality performance examination of a new video codec), subjective tests place a big burden in terms of cost and time. Furthermore, subjective tests are very sensitive to viewing conditions and the number of human subjects required to draw statistically sound and safe results. In this respect, extemporaneously planned subjective tests can produce doubtful or even invalid results, and should be considered with scepticism. In such cases, it is believed that they can be even worse than pixel error metrics like PSNR [85]. For this reason, quality metrics that can automatically produce objectively obtained ratings that correlate well with human quality ratings present an attractive alternative [86]. Objective test methods do not use human subjects. Instead, they measure and analyse the video signal for perceivable distortions and artifacts. There are a number of reasons why new quality metrics for digital video systems are necessary:

1. The operating characteristics of digital transmission systems like bandwidth, packet loss (or bit-error rate) and delay may change over time producing quality fluctuations. These transients may be very difficult to capture unless the performance of the system is being continuously monitored. Therefore, only continuous, non-intrusive, performance moni­ toring can accurately capture what the viewer is perceiving in these instances.

2. Digital video systems produce fundamentally different kinds of impairments than ana­ log video systems, as described in section 2.5 above. To fully quantify the performance characteristics of a digital video system, it is desirable to have a set of performance pa­ rameters, where each parameter is sensitive to some unique dimension of video quality or impairment type. This discrimination property of performance parameters is useful for designers trying to optimise certain system attributes over others and for network opera­ tors wanting to know not only when a system is failing but where and how it is failing.

3. The advent of digital video compression, storage and transmission systems has exposed fundamental limitations of techniques and methodologies that have traditionally been used to measure analog video performance. Traditional performance parameters have relied on the ‘constancy’ of a video system’s performance for different input scenes. 2.6. Measuring video quality 50

MSB = 27.10 MSB = 21.26

Bigure 2.3: Bailure of pixel error metrics to accurately reflect the perceived quality.

therefore exhibiting a deterministic quality response for any kind of input scene. Thus,

one could inject a test pattern or test signal, measure some resulting system attribute (e.g.,

frequency response) and be relatively confident that the system would respond similarly

for any other video material (e.g., video with motion). However, the digital manipulation

of video has made the connection of these traditional parameters with perceived video

quality much more tenuous. Digital video systems adapt and change their behaviour

depending on the characteristics of the input signal. For example, a system that might

work fine for video conferencing may be inadequate for digital television and vice versa.

In such cases, quality metrics need to select a wide range of input scenes for testing as

well as a representative sample of types of distortion introduced during compression and

transmission.

4. During the recent years, there has been a rapid evolution of video compression, storage

and transmission technologies that make a performance measurement task far more dif­

ficult. Therefore, new quality measurement instruments need to be independent of the

specific technologies.

The simplest form of objectively measuring the quality is by calculating the distortion at the pixel level. Due to their relative simplicity, fidelity metrics, like peak signal-to-noise ratio (PSNR) and mean squared error (MSB), are extensively used by the image processing community. Despite being straightforward metrics to calculate, these error measures operate strictly on a pixel-by-pixel basis and neglect the importance that the type of distortion, the 2.6. Measuring video quality 51

o

0 50 100 150 200 250 300

frame

Figure 2.4: PSNR of sequence Foreman encoded at 300Kbps. Observe the significant drops of PSNR as frames are skipped during encoding.

image content and the viewing conditions have on the actual visibility of artifacts. The main issue with MSE and PSNR is that they cannot discriminate between impairments that humans can and cannot see or impairments that are less or more annoying. Therefore, they cannot appreciate distortions as perceived by a complex and multi-dimensional system like the human visual system and, as a result, it is widely accepted that they very often fail to correlate well with human judgements of quality [23, 87]. An illustration of this is shown in Figure 2.3, where an image with higher MSE, thus, supposedly worse quality, exhibits a superior visual quality. More importantly, it is difficult to compare different sequences using these metrics, i.e., whatever the known PSNR relationship between two sequences is (better, same or, worst), one cannot make safe assumptions that a similar perceived quality relationship holds; thus it cannot represent a ‘universal’ quality indicator. Another problem with pixel error measures in video is that skipped frames have a disproportionate effect on the value of PSNR. Skipping frames is very commonly used in low bit rate video streaming as a means of keeping the output rate under control. In the simplest of cases, missing frames are compensated at the decoder by simply repeating the last decoded frame, whilst sophisticated frame interpolation techniques can also be used, at the expense of increased complexity at the decoder. Although low frame skip rates are known to have minimal impact on perceptual quality [88, 27], this fact cannot be appreciated by PSNR, SNR, etc. Figure 2.4 plots the PSNR for 300 frames of the sequence Foreman, encoded at 300Kbps. When frames are skipped, PSNR drops dramatically. In contrast, recent experimental results show that frame skipping (i.e., effective frame rates down to 15, 12 or even 2.6. Measuring video quality 52 less frames/second) does not have a considerable effect on the acceptability of the encoded sequence quality [27].

These problems have prompted an intensified study of visual, objective video quality met­ rics (VQM) in recent years [24]. An objective VQM is an algorithm, or computational model, that receives as input one or two video sequences and produces quality scores that reflect the predicted video quality (of the single sequence), or the predicted fidelity (of the impaired video with reference to its undistorted counterpart). To achieve high correlation with human judge­ ments of quality, the development process of such metrics involves their calibration with sub­ jective evaluations (subjective quality scores obtained by means of standard MOS tests). After the calibration phase, the models are tested for their prediction performance with subjective data not used during calibration.

Over the past few years, a wide variety of methods for the objective testing of digital video quality have been developed. One way of classifying these metrics is by the extent the original video (reference) is used by the model in order to evaluate the quality of the test video. Using this classification, three categories can be identified: (i) Full-reference (FR), (ii) Reduced- Reference (RR), and (iii) No-Reference (NR). Full-Reference (FR) models require both the reference and test sequence in order to mea­ sure quality. FR VQMs perform a pixel-by-pixel and frame-by-frame comparison (usually on frames differences) between the reference and test. PSNR is the simplest example of the cat­ egory. The KDD/Pixelmetrix model [75], computes the image difference map (like PSNR) and then uses a sophisticated three-layered bottom-up noise weighting to determine video qual­ ity [89]. Typical noise types from the compression are classified and weighted according to their characteristics. The local texture is analysed to compute the local degree of masking. Finally, a gaze prediction stage is used to emphasize noise visibility in and around objects of interest. The measure of distortion is obtained from the PSNR computed on the weighted noise. The Samoff JNDMetrix (just noticeable difference) model [90] performs similar comparisons between the reference and test, but only after the video is passed through a human vision system that indi­ cates the areas of the image where viewers are likely to see differences between the original and distorted versions of the same scene. In this case, distortions that are noticeable to the human eye can be weighted more than distortions that are not (e.g., due to spatial or temporal masking).

FR models are very useful in evaluating and improving new image processing or com­ pression techniques. FR metrics though are very sensitive to pixel offset and require accurate alignment, therefore, a pre-registration of the reference and test sequences is required prior to the application of the metric. 2.6. Measuring video quality 53

Reduced-Reference (RR) VQMs operate by extracting features and computing statistics from the test sequence and performing specialised comparisons with the corresponding features of the reference sequence. Features are chosen so that they reflect perceivable distortions in the image. Feature statistics are then chosen and correlated with subjective quality scores using conventional regression techniques. Webster et al [91] developed a video quality assessment system that is based on a combination of three low-level features, selected empirically from a number of candidates so as to yield the best correlation with subjective data for a given test set. These features reflect the amount of lost and added energy at the spatial and temporal domain and are calculated using spatial gradients and the temporal error between successive frames. They are then linearly combined to give a measure of video quality. The metric developed by the Institute for Telecommunication Sciences (ITS) [92] is based on the extraction and statistical manipulation of scalar, spatio-temporal features from the original and degraded video sequences to obtain a single measure of distortion and in essence is a progression from [91]. This is one of the most prominent objective quality models to date that attained high performance during the latest evaluation process by the ITU-T VQEG [25] and has been recently recommended to the ITU [93]. Because this model is extensively used in this work, it is described in greater detail later in section 2.7.

RR models can be used for in-service quality monitoring, when the full reference sequence is not available (e.g., because the measurement infrastructure resides at the other end of the transmission system). In such cases, a limited number of statistical features can be transmitted or stored downstream and compared with the corresponding features of the distorted stream.

The above two classes are also called double-ended, since some information from the orig­ inal sequence (the exact reference or some reduced information extracted from the reference) must be available to the measurement instrument in order to make picture quality calculations. On the other hand, No-Reference (NR), or single-ended VQMs do not require any access to the original video frames to calculate quality scores, but rely on the extraction of features from the distorted sequence that reveal artifacts such as blockiness, noise or image blur. For exam­ ple, the amplitude of adjacent pairs of pixels relative to their position in the macroblock grid (i.e., pixel pairs at either sides of a macroblock boundary and pixel pairs in the same side) can provide information about the levels of blocking artifacts, the most apparent form of video dis­ tortion [94, 77]. As these models do not have access to the original sequence, they are less accurate than double-ended models. Nonetheless, RR models are especially useful for on-line monitoring of digital video transmission systems and are practical trouble-shooting tools, and for this reason, several commercial tools have become available (e.g., [78, 79, 80]). 2.6. Measuring video quality 54

Another classification of objective quality metrics appears with respect to the method used to reveal and measure perceivable artifacts in video. There are two broad categories: (i) feature extraction models and (ii) perceptual models.

Feature extraction uses mathematical computations to derive characteristics of single frames (spatial features) or sequence of frames (temporal features). The features from the ref­ erence and test sequences are then compared to produce a quality score. The ITS VQM as well as Hamada’s three-layered bottom-up noise weighting [89] belong to this category. Watson’s Digital Video Quality (DVQ) metric [95, 96] relies on the measurement of spatial sensitivity and masking effects and the measurement of visibility thresholds for temporally varying DCT quantisation noise and modelling of temporal masking. Tan et al [97] proposed a MPEG video quality metric that computes the perceptual impairment based on contrast sensitivity and mask­ ing using spatial filtering and Sobel-operators. The masked error signal is then calculated and normalized. At a second stage, a cognitive emulator is utilised, that simulates higher-level as­ pects of perception, such as the delay and temporal smoothing effect of observer responses, the non-linear saturation of perceived quality and the asymmetric behaviour of viewers with respect to order effects. This tool is one of the few targeting the measurement of the temporal variation of video quality and although it requires the reference as input, the cognitive emulator was shown to improve the predictions on subjective SSCQE ratings. A drawback of feature extraction models is that they may not always work well for all types of distortions. However, their advantage is that features extracted from the reference pictures may by sent or stored along with the compressed picture for quality evaluation at the receiving end.

Other perceptual quality models rely on video processing based on the properties of the human visual system (HVS). Based on a significant amount of work on the modelling of the HVS that has been conducted over the past three decades, several perceptual video metrics have appeared in the literature. Recent models, like Lambrecht’s Moving Picture Quality Metric [98], Winkler’s Perceptual Distortion Metric (PDM) [99] and other have gained particular attention. The Samoff JND metric [90] also belongs to this category. Tan and Ghanbari [100] presented a multi-metric model for MPEG video based on the combination of a perceptual model and a blockiness detector. The perceptual model performs non-linear processing and filtering to weigh the visibility details of the image according to properties of the HVS and to account for temporal masking effects. The blockiness detector operates on edge-enhanced versions of the reference and test frames and performs edge cancellation followed by a harmonic analysis process to measure blockiness. The perceptual model and the blockiness detector are combined to produce a single quality score. 2.6. Measuring video quality 55

While quality metrics based on models of the HVS are believed to provide more accurate results, independent of the type of the underlying distortion, the fact that they implement com­ plicated models of the HVS imposes significant computational cost.

Standardisation efforts

The research efforts to design objective video quality models have generated recent standard­ isation activities that involve the specification of the essential features and requirements of objective quality assessment models, such as target video applications and desired performance characteristics. Candidate models are then evaluated based on the above specifications with the purpose of producing industry standards. This standardisation work is now described.

Video Quality Experts Group.The Video Quality Experts Group (VQEG) [24] was formed in 1997 with the objective to evaluate and analyse the performance of video quality metrics in a variety of application areas (broadcast and cable television, multimedia, etc.). The goal of this effort is to ultimately propose suitable objective quality metrics that would be standardised by relevant bodies like the ITU. During the first phase of its efforts, ten video quality assessment models (proponents), with the inclusion of PSNR as a reference objective model for comparison purposes, were scrutinised for their ability to provide predictions in agreement with subjective ratings. A large number of subjective tests were organised by independent labs, strictly ad­ hering to the specifications of the ITU-R BT.500-10 [22] procedure for the DSCQS method of subjective evaluation (cf. section 2.6.1). Based on the analysis of results [101], VQEG con­ cluded that it could not propose any model for inclusion in ITU Recommendations, and that further validation was required. However, the effort produced significant insights in the process of designing efficient objective video quality metrics and understanding the limitations of the current models. During the second evaluation phase of VQEG [25] however, the group con­ cluded that four models performed well enough to be included in a new recommendation on objective quality metrics [102].

ITU - Study Group 9.The Study Group 9 (SG9) of ITU [103] has been working on preparing and maintaining recommendations for the delivery of audiovisual material (voice, sound, distri­ bution television, video on demand, and other related services) over cable or hybrid networks. Among other efforts, ITU-SG9 is working in close cooperation with VQEG on standardisation efforts aimed towards the quality evaluation of these services and networks (for example, the transport of audiovisual signals using IP), determination of quality parameters for television transport, objective and subjective methods for the evaluation of audiovisual quality of conver­ sational or distribution multimedia services. 2.6. Measuring video quality 56

2.6.3 Weaknesses of video quality assessment techniques

The previous sections described procedures and methods to measure the quality of digital video. While these methods are very helpful for the evaluation of digital video components (video codecs, monitoring transmission quality, understanding psychophysical aspects of quality) or the quality ratings of video programmes, there are certain limitations. Firstly, although sub­ jective and objective methods can quite accurately measure the impairments of digital video, most of them work on short video sequences (typically, around 10 seconds long). It is clear that 10-second video sequences are not long enough to experience all the kinds of impairments that would occur in a real Internet video application. This problem has been partially tackled by employing continuous assessment techniques [104, 83] or using temporal pooling methods such as Minkowski summation on objectively acquired quality scores [105]. Since objective video models that continuously assess quality have to be based on the corresponding subjec­ tive methods (e.g., SSCQE) for tuning, they inherit the same problems encountered with the subjective methods, as mentioned in section 2.6.1. Secondly, the objective models described earlier are more accurate in measuring the effect of compression on perceived quality or more accurately visual artifacts and not of other types of quality degradations, such as those caused by large delays or delay variation (and to a certain degree losses). The main reason for this is that they were initially developed with digital (broadcast or cable) TV as the target transmis­ sion environment and not the Internet. An objective metric of Internet video quality needs to also develop methods that account for the joint effect of visual (encoding distortions, artifacts due to packet loss) and non-visual (delay and delay variation in the presentation of the video) types of distortions and therefore, be able to measure the perceived quality effects imposed by the various components on the end-to-end path of a video streaming system. Hopefully, this requirement is currently being addressed by the VQEG and is satisfied, to a certain degree, by some existing models (e.g., the ITS VQM [92]). Thirdly, the quality judgements produced by these models are made, in the majority of cases, for the video part only. One may question such practice, since the video material (whether interactive or one-way) is rarely used on its own without audio. It is obvious that if judged jointly, for the purpose of quality assessment, the inherent relationships between the various media types involved in an application scenario (and especially between audio and video) may substantially alter the quality scores. Certain distortions may become more or less important or new requirements may arise, for example, media synchronisation. Nevertheless, current activity in the VQEG consider this issue. In the context of a multimedia perceptual quality model, the requirement for synchronisation of the audio and video compartments of a multimedia presentation is discussed in [106]. 2.7. A closer insight into an objective quality metric 57

Temporal Width

Horizontal Width Vertical Width m0m0

F F F F F n+l n+2 ^ n+3 n+4 ^ n+5

Figure 2.5: This figure illustrates the definition of the spatio-temporal (S-T) region

2.7 A closer insight into an objective quality metric

This section presents details of a video quality assessment metric developed at the Institute for Telecommunication Sciences (ITS) [92, 74], called the ITS VQM. This model attained significant performance (one of the models that was recommended to the ITU) during the latest

(2003) phase of evaluation of FR quality models conducted by VQEG [25], exhibiting a Pearson correlation of over 0.90 with the corresponding quality ratings from a large panel of viewers.

The metric is based on the comparison of features extracted from the processed video with similar features extracted from the original video, to generate a set of parameters that are indicative of perceptual changes in video quality. Features are extracted from localised spatio- temporal (S-T) regions of edge-enhanced frames. An S-T region defines a block of pixels by the number of pixels (i) horizontally, (ii) vertically and (iii) the number of consecutive frames in the temporal dimension, as pictured in Figure 2.5. For instance, for a GIF (352 x 288) size image, a 8 x 8 x 6 S-T region has 384 pixels, and each 6-frame period has 1584 S-T regions.

The size of the S-T region is a trade-off between the storage or bandwidth requirement of an auxiliary data channeland the correlation of objective-to-subjective quality scores. An S-T region size of 8x8 p ix e ls x 6 fr a m e s has been shown to produce the best correlation, although higher horizontal and vertical widths (32 and higher) and temporal lengths (up to 30 frames) still produce satisfactory results, while reducing the volume of generated features [107].

Prior to the extraction of the spatio-temporal features from each S-T region, the process involves a perceptual filtering on both the original and distorted video frames to enhance per-

'®For on-line monitoring purposes, S-T region features of the original video may be transmitted to the receiving end, where they can be combined with the corresponding features of the received video to monitor the end quality. 2.7. A closer insight into an objective quality metric 58

Edge-enhance input Edge-enhanced output

Histogram of input Histogram of output

Figure 2.6: Top: edge-enhanced versions of an input (left) and output (right) image. Bottom: histograms of the polar coordinates of the filtered images. 2.7. A closer insight into an objective quality metric 59 ceivably salient properties of the frames, such as edge information. This is achieved by applying horizontal and vertical edge enhancement filters (Sobel-like filters) to the luminance pixel val­ ues (Y component) of the frame. Each pixel’s response to the filters is then converted to polar coordinates (to reveal the magnitude and orientation of edges). Figure 2.6 shows the result of the edge enhancing filtering process for the images on the top row of Figure 2.1. Observe how the top-left edge-enhanced version of the original image accentuates areas with edge ac­ tivity. The tiling and blurring effects of compression are shown in the edge-enhanced version of the encoded image (top-right). There are added edges with horizontal and vertical orienta­ tion, revealing added blocks, and a reduction of the intensity of edges otherwise present in the original, an indication of lost high frequencies (blurring). The plots in the bottom row show the histograms of the polar coordinates of the original and output image, where it is shown how blockiness increases the number of pixels with horizontal and vertical orientation and how the energy of edges with diagonal orientation is lost in the output image due to the presence of blurring.

For each identical S-T region in the original and distorted edge-enhanced frames, two fea­ tures are calculated. The first feature, fi (s, t) measures the overall spatial activity within a given S-T region, where s and t are the spatial and temporal indices of the S-T region. It is calculated as the standard deviation of the (edge-enhanced) pixel magnitudes over the S-T region, clipped at some perceptibility threshold. This feature is sensitive to changes in the overall amount of spatial activity within a given S-T region, for example, localised blurring reduces the amount of spatial activity, while added noise or blocking artifacts increase it. The second feature, / 2 (s, t), is calculated by accounting for pixels that are horizontal or vertical edges and pixels that are diagonal edges (i.e., the gradients are not in the horizontal or vertical direction). It is defined as the ratio of the average magnitude of pixels that are horizontal and vertical edges over the average magnitude of pixels that are diagonal edges, clipped at an appropriate perceptibility threshold. The /2 (s, t) feature is sensitive to changes in the angular orientation of spatial activ­ ity within the S-T region. For instance, if the image suffers from blocking artifacts, horizontal and vertical gradients appear to be larger in the S-T region of the distorted frame and the / 2 (s, t) feature of the distorted frame will be larger than that in the original. On the other hand, blurring produces a / 2 (s, t) value for the output that is less than that of the input. The/ 2 feature thus provides a simple means to include variations in the sensitivity of the human visual system with respect to angular orientation. Therefore, the gain or loss in the value of these features between the original and distorted image reveal the magnitude of an existing perceptual distortion. Gain and loss are examined separately, since they produce fundamentally different effects on quality 2.7. A closer insight into an objective quality metric 60 perception (e.g., loss of spatial activity is due to blurring and gain of spatial activity is caused by added noise or blocking). Perceptual impairments of each S-T region, based on the gain or loss of a feature’s value, are then calculated using specific functions that model visual masking^ ^

This impairment masking process reveals three measures that the developers of the ITS VQM found to best measure the perceptual distortions in each S-T region: t), t) and t). Next, error pooling functions across space and time emulate how humans de­ duce subjective ratings. First, the three measures of perceptual distortion are pooled (spatially collapsed) over the spatial index s (for example, there are 1584 such values for a CIF size frame). Since the worst part of the picture is a predominant factor in the subjective quality deci- sion^^, the spatial collapsing function involves some form of worst case processing. Therefore, the spatial collapsing function is computed at each temporal index t, as the average of the worst 5% of the measured distortions. Finally, temporal collapsing is used to summarise the spatially collapsed measures over the duration of the clip (typically 8-10 seconds). Since no memory effects can appear during the evaluation of such short clips, the average over the temporal index t is used. Different temporal collapsing functions may be used for longer clips that account for the recency effect in quality evaluation. The weights of the resulting three measures to the join value of perceived distortion is found using regression, based on a large collection of subjective data, to produce

join = 0.38 • 0.39 • - 0.23 •

The value of the join perceived distortion measure is between -1 (very high distortion, or, lowest quality) and 0 (no distortion, or highest quality).

It can be observed at this point that since the temporal pooling function is the average of the spatially collapsed measures /i°®^(f), and one can safely postulate that the value

d{t) = 0.38 • + 0.39 • - 0.23 • / | “^”(f) (2.1)

" if an input feature (i.e, feature / i or /g of an S-T region in the reference frame) is denoted as fin (s , t) and the corresponding output feature (in the distorted frame) is denoted asfout(s, t), then for a given S-T region, gain and loss distortions are computed using:

gain{s, t) = pp S^logw and

loss(s, t) = np^ } ’ where, pp and np are the positive and negative part operators (i.e., negative and positive values respectively, are replaced by zero). ^^Localised impairments tend to draw the attention of the user [92]. 2.8. IP video adaptation: from network friendly to media friendly 61 represents a good indication of the instantaneous perceived level of distortion (or, equivalently, the value q{t) = 1 + d{t), represents an indicator of the instantaneous quality) at every time instance t, with a sampling granularity equal to the temporal width of the S-T region, e.g., 6 frames, or 200 ms (for a 30fps input video). This temporal duration of a S-T region is hereafter called the S-T period. This observation is exploited later in this thesis when a measure that relates to the continuous on-going perceived quality is required. Since much of the work described in this thesis requires the - direct or indirect - ap­ plication of an objective quality metric, the ITS VQM was implemented based on the above algorithmic guidelines and further details published in [92]. Some of the reasons for selecting this specific model are:

1. it appears to be one of the most accurate models achieving high correlation with subjec­ tive scores, according to the latest VQEG evaluation process [25] and has been recently recommended to the ITU [93]

2. it is designed to work equally well with sequences encoded at high and low bit rates

3. in contrast to other well-known quality metrics, there exists a lot of published information that facilitated the implementation of such a complex algorithm.

Notwithstanding the above arguments, the work in this thesis is not bound to this specific quality metric and can work with other objective quality metrics that can provide reliable quality scores. Finally, in order to represent quality ratings in a more comprehensible range, all objec­ tive quality scores q{t), that originally lie in the [0,1] range, are scaled to the [0,100] range, which coincides with the range of ratings in a Single Stimulus Continuous Quality Evaluation (SSCQE) scale [22].

2.8 IP video adaptation: from network friendly to media friendly

Signal quality measurements like PSNR work very well with analog and full bandwidth digital video systems, because uncompressed or lightly compressed systems are practically linear, that is, the system behaviour is time invariant and signal independent. With the advent of com­ pressed digital video systems, the situation becomes more complex as the types of distortion that may occur are more subtle. In these cases, simple signal quality measures (PSNR, MSE) are not adequate and may give misleading results. Despite this fact, these metrics are still widely used for the design or optimisation of video source coding or source-channel coding techniques due to their relative simplicity and convenience. This raises the question of how closely they reflect the quality as experienced by the end-user. The recent development of new video quality 2.8. IP video adaptation: from network friendly to media friendly 62 metrics that can accurately measure perceived distortions in the video signal under dynamic characteristics of both the input video and transmission system offers a reliable yet econom­ ical solution to the problem of video quality assessment. The emergence of these metrics (i) facilitates the performance evaluation of video system components, bypassing laborious, time- consuming subjective tests, and (ii) enables automatic quality monitoring of video systems in real-time. While objective models are being primarily used in these two areas, it is undoubted that they bring forth opportunities of exploiting them in the rate-control and adaptation process of IP video streaming, in a way that improves the viewing experience and quality. Therefore, the current model of adaptation, where a video stream is adapted to varying scene content and network conditions based on some internal reference of quality (usually, a rate-distortion model in the encoder), can be enhanced by considering the perceptual impact of these adaptation decisions as well. Thus, this thesis advocates the - direct or indirect - use of objective quality metrics within the adaptation cycle of Internet video. It is steered from the fact that video quality is strongly determined by (i) the complexity of the video scene content and (ii) the transmission bandwidth or, equivalently, the encoding bit rate available to the video stream^ This observation enables the development of appropriate quality-aware adaptation policies or frameworks. The details of the video adaptation framework are subject to the specific nature and requirements of the application. The following chapters demonstrate the benefits of quality- aware video adaptation based on the ITS objective quality metric (section 2.7) by considering two different application scenarios of video streaming:

1. Inter-stream adaptation in IP sessions with multiple concurrent media flows

2. Smooth perceived quality adaptation for live, unicast, real-time video streaming

Inter-stream adaptation in IP sessions with multiple concurrent media flows This part of work relates to media sessions that transmit numerous simultaneous media flows as part of the application’s usage scenario. It is motivated by the fact that joint congestion and rate control on a set of various streams is far more efficient than independent adaptation, as has already been demonstrated in other types of transmission networks (broadcast TV). Un­ like independent adaptation, joint or aggregate adaptation dynamically distributes the available bandwidth among video programmes according to their respective time-varying quality, hence allowing for more efficient multiplexing and achieving better overall session quality. While in

’^Under the assumption that other encoding parameters, like the picture size (or, frame resolution) and the video compression algorithm are kept unchanged. 2.8. IP video adaptation: from network friendly to media friendly 63 traditional distribution networks the bandwidth over which the video programmes can be trans­ mitted is known to the operators, this is not true for Internet transmission. On today’s Internet, the approach is to transmit multiple streams that employ congestion control independently of each other, disallowing opportunities of inter-stream cooperation when performing adaptation. This work builds on recent proposals for integrated congestion management [29] that allow ap­ plications to employ their own bandwidth sharing policies to individual streams, while ensuring a network-friendly behaviour from the aggregate transmission. Chapter 3 presents a method to perform such an inter-stream session bandwidth sharing policy, which considers the time- varying contribution of each individual video stream to the overall session quality. The quality contribution of each stream is measured using appropriate < rate, perceived quality > map­ pings, built using the ITS quality metric. Suitable adaptation timescales and how these are integrated within the adaptation life-cycle are proposed, based on the scene content dynamics of the participating flows.

Smooth perceived quality adaptation for live, unicast, real-time video streaming The second part of the thesis deals with the problem of providing smooth perceived quality adaptation for live, unicast, real-time video streaming. This work is motivated by important observations in the way human viewers rate and appreciate video quality: frequent quality os­ cillations are particularly annoying and viewers tend to prefer a lower but more stable quality than frequent alterations of high and low quality [108]. The proposed system attempts to de­ termine appropriate encoding rates so that it smoothes short-term objective quality variations, while respecting the constraints imposed by the available bandwidth. The previous problem considers the streaming of stored video material, therefore the application of the ITS VQM does not have any impact on the real-time performance of the system. However, the require­ ment changes for live video streaming. Since real-time performance is necessary, the ITS VQM metric cannot be applied in-line, as it is computationally intensive. To tackle this. Chapter 4 introduces an artificial neural network predictor to yield accurate estimates of the continuous objective quality ratings in real-time. Based on the ability offered by the neural network to pre­ dict a rate-objective quality mapping. Chapter 5 presents a method of smoothing the on-going video quality that employs a fuzzy rate-quality controller. By continuously monitoring the re­ cent quality values and the level of the sender and receiver buffers, the fuzzy quality controller calculates an appropriate encoding rate (that may divert from the allowed transmission rate) that eliminates annoying short-term quality fluctuations. Chapter 3

Quality aware adaptation in multi-stream sessions

This chapter examines the use of objective quality metrics to address the bandwidth allocation and adaptation problem for multiple, pre-recorded, concurrent media streams, by considering the quality of the individual media streams. This work is motivated by recent research in dig­ ital TV transmission that proves that joint adaptation of multiple video programmes is more efficient and effective than independent adaptation. Unlike independent adaptation, joint or ag­ gregate adaptation dynamically distributes the available bandwidth among video programmes according to their respective time-varying quality, hence allowing for more efficient multiplex­ ing and achieving better overall session quality. However, recent research tackles this issue by either assuming a constant channel capacity (like in digital video broadcast networks) or by adopting conventional quality metrics to perform rate allocation to each participating stream. On today’s Internet, the approach is to transmit multiple streams that employ congestion control independently of each other, disallowing opportunities of inter-stream cooperation when per­ forming adaptation. An inter-stream rate allocation method is therefore presented that accounts for the dynamic network conditions of a best effort network and uses an objective quality metric to measure the dynamic quality contribution of each media flow. We show that quality improve­ ment is possible if the unequal quality requirements among the participating video streams at any time instance can be exploited. This approach is made feasible by measuring (off-line) the varying quality of the participating video streams using realistic objective measures of perceived quality. This work also discusses suitable adaptation timescales based on the scene content dy­ namics of the individual flows and presents how these can be realised within the adaptation cycle. 3.1. Motivation 65

3.1 Motivation

Traditional forms of multimedia networking, like media streaming or videoconferencing, are becoming commonplace in today’s Internet. Whereas these applications involve the transmis­ sion of a limited number of media streams (an audiovisual stream perhaps accompanied by graphics, or animation), evolving applications are expected to present with a far richer multi- media experience. Such services will engage the transmission of a collection of (continuous or static) flows relevant to the application. In this scenario, static (text, graphics, etc.) and con­ tinuous (audio, video, animation, virtual worlds and avatar) data are blended together to create elevated interactive, collaborative or entertainment experiences. The rapid deployment of digi­ tal broadband networks (Asymmetric Digital Subscriber Line - ADSL and perhaps fibre to the home in the future) to home users and the ever increasing access speeds allow telecom operators and content service providers to offer integrated Internet access and IP-based TV entertainment and therefore, generate viable business models and revenue opportunities. Full-rate ADSL can exceed 8 Mbps, well over 3 Km from the local exchange, which makes it a viable medium to carry multiple simultaneous ‘entertainment-quality’ digital video streams (TV, movies, sports), as well as music, games, Web data, etc., over a standard copper phone line. In the longer term. Very High Speed Digital Subscriber Line - VDSL - can potentially achieve 50 Mbps in short runs such as in high-density housing areas. Sophisticated programme packages may arise by migrating respective services from broadcast digital TV, such as multi-view TV over IP, where the viewer could receive a number of concurrent streams on a desktop PC or TV set with an IP set-top-box. Interactive IP TV will also let customers themselves create customised pro­ gramme profiles based on their own personal viewing preferences^. For example, the broadcast of a sports event, like a football game, may transmit and display several video feeds coming from the various cameras in the stadium on the user’s display as the game progresses (pitch action view, close-ups, player-follow camera, reaction from the bench or the spectators, etc.). Similarly, during the coverage of an FI race, the viewer can simultaneously receive and display several video streams showing the action from different parts of the circuit. In such applica­ tion scenarios, the group of streams belonging to the application have a highly dynamic nature. Streams may be started or stopped at any time, and the relative importance or priority of a stream with respect to the others varies over time as a result of the user’s changing preferences to any of the streams. More importantly, the quality-bandwidth requirement of each stream is also changing over time.

^For example, T iV o (http://www.TiVo.com) is a product towards the direction of customised video services that mainly offers personalised video recording capabilities. 3.1. Motivation 66

In applications where data, audio, video, or a set of visual objects are delivered simultane­ ously over the Internet, the standard practice is that relevant streams are either transmitted using several independent TCP or congestion controlled (TCP-friendly) UDP flows or media rates are aggregated, where the aggregate bit rate is less than or equal to the available channel bandwidth. In the former case, each individual flow is responsible for performing its own congestion con­ trol and adapting its transmission rate depending on network conditions. In this scenario, it is not possible to exercise any form of quality coordination. As a result, the participating video programmes will most probably exhibit unequal quality, relative to each other, which is not very efficient with respect to the overall quality impression of the multi-stream session. In the latter, collaborative rate control is employed and is typically manifest by assigning a fixed fraction of the total session rate to each media flow [109], often based on the relative importance of the flow or user preference. However, there are certain shortcomings with independent allocation of bandwidth resources to the individual flows. Setting the rate of an individual flow as a fixed portion of the overall session bandwidth is an inefficient use of the available bandwidth. With respect to video, not all scenes within a programme {intra-scenes) exhibit the same complexity, hence they do not require the same bit rate to achieve good picture quality. For example, rel­ atively static scenes require fewer bits than scenes with lots of motion or spatial activity. As a consequence, if the bit rate of a stream is set statically (i.e., it is either constant, or it is a fixed percentage of the total, time-varying session bandwidth), different scenes of a video programme may be rendered with unequal quality.

Instead of independent rate adaptation, joint or coordinated rate adaptation is considered more advantageous (e.g., [110, 111, 112,113, 114, 29]). In this case, the method of calculating the aggregate bit rate of the session remains unchanged, but the bit rate of individual flows is allowed to vary according to their time-varying content complexity (and consequently, quality). With coordinated rate control, bits from a less active media flow may be moved to more active ones, so that better picture quality is maintained and the overall quality of the multimedia ses­ sion is improved. In this way, rate control is extended to an additional dimension, that is (in the case of video) between scenes of the different video streams {inter-scene), allowing more freedom in how the session bandwidth is allocated among streams. Furthermore, quality is also dependent on several other parameters of video: the image resolution, the video codec used to encode the original video, etc. Given that, in a multi-stream application, the participating video flows would most probably have different resolutions and would be encoded using a variety of encoding engines, coordinated adaptation allows all these quality affecting attributes of video to be accounted for, albeit indirectly, by considering a common denominator: the quality of the 3.1. Motivation 61 individual streams.

The problem of inter-stream rate-quality adaptation in a multiple-stream session can there­ fore be formulated as follows. Consider a networked multimedia application that consists of an ensemble of multiple concurrent media flows that belong to the same thematic presentation and that are transmitted from a media server to a single receiver. These flows exhibit time-varying intra- and inter-scene content variation, and therefore, time-varying intra- and inter-scene qual­ ity. Furthermore, these video flows may have different resolutions, or can be encoded using different video codecs. This work proposes a coordinated quality adaptation, where the band­ width available to the ensemble is apportioned to the participating flows according to their dynamically varying quality and user preferences. Within this framework, a scheduling mech­ anism that allocates the session bandwidth to each individual flow should be devised, based on the relative contribution of each stream to the overall session quality. Despite recent work having produced techniques of multi-stream rate control (cf. section 3.2), the true impact of multi-stream rate control on perceptual quality has not been considered. This work differs from previous approaches hy measuring the quality contribution of individual flows using an objective video quality metric that quite accurately reflects the way human viewers assess quality. The objective quality metric used is the ITS video quality metric, described in section 2.7. The following sections demonstrate how this quality metric can be integrated in the inter-stream adaptation mechanism and presents appropriate adaptation timescales. A number of trace-driven simulation results show the benefits of performing inter-stream adaptation based on the (dynamic) quality profiles of the individual flows.

While applications like those anticipated in this work are realised by a number of concur­ rent flows of various media (audio, video, etc.), multi-stream sessions consisting only of video streams are considered, for two reasons. Firstly, the majority of users almost always prefer a constantly good audio quality in favour of a varying video quality. Secondly, because the audio part of a multimedia session usually requires relatively less network resources to operate with acceptable quality in comparison to video, thus making adaptation of the video compartments more worthwhile and interesting^.

So far, no assumptions have been made on how the total bandwidth of a multi-stream session can be estimated. The next section discusses recent trends and proposals on this issue.

^This might not always be the case; for example, high quality, multi-channel audio, may have bit rates close to that of video. However, whereas there can be multiple video streams presented on the user display, there can only be one ‘active’ audio stream, which is usually preferred to have a constantly good quality. 3.2. Related work 68

3.2 Related work

The concept of (direct or indirect) collective or collaborative manipulation of multiple data streams is widely used in bandwidth management. Proposed techniques are, in general, bound to the attributes of the transmission channel and have multifaceted objectives: QoS provision­ ing, efficient bandwidth sharing and better utilisation of the channel capacity, or/and quality improvement.

One such approach tackles the issue of QoS provisioning for multimedia streams with diverse and time varying requirements, by introducing QoS management inside the network, including resource reservation (IntServ) and flow prioritisation (DiffServ). Using these network QoS techniques, it becomes feasible to apply policies that discriminate media flows based on their bandwidth requirements or relative importance to the application. Despite the large volume of research in this field, large scale adoption of these proposals in the Internet has been hindered by difficulties of deployment on a large scale and lack of substantial market interest and this is expected to be the case in the foreseeable future. In the context of the discussion above, priority-based mechanisms require that different packets or flows are labelled with different priorities [115, 116]. However, the exact mechanism for dynamically mapping application level priorities to packet or router priority levels can be cumbersome.

3.2.1 Joint rate control

Perhaps the area where the principles of collective or joint rate control has been best used and commercially exploited is digital TV broadcasting, e.g.. Digital Broadcast Satellite (DBS) ap­ plications. Following the transition from analog to digital transmission, statistical multiplexing has been employed as a means of both increasing the channel bandwidth utilisation and improv­ ing the end video quality. In typical multi-programme digital TV broadcasts, multiple live (such as, news, sports or live shows) or pre-encoded (such as, movies, commercials, etc.) video pro­ grammes are multiplexed onto a single constant bit rate channel. Early implementations used the simple approach where the available capacity was equally divided among all programmes, and each sequence was coded independently at a constant bit rate (CBR). For example, in a typical DBS scenario, four video programmes share equal portions of a 24 Mbps channel, so that each of the involved encoders compresses video programmes at 6 Mbps. As a consequence of CBR encoding, the resulting quality of a video programme is uneven, as content changes over time. Therefore, the bit rate for each programme has to be set to a high enough value in order to guarantee adequate picture quality, but this is proven [117] to lead to poor utilisation of the channel (a satellite transmission environment still incurs high operational fees and operators 3.2. Related work 69 desire to squeeze as many programmes per MHz as possible without compromising quality). By allowing variable bit rate encoding and employing joint coding or joint rate-control, a more uni­ form quality among the programmes and a better utilisation of the channel capacity is achieved. As a result, more programmes can be transmitted over the same channel [118, 119]. A multi­ tude of work has been proposed to tackle this issue and the following paragraphs discuss the main concepts and ideas and their relation to this work.

The main principle behind joint coding is to allocate bits to every stream according to their content complexity [117,119,110, 111, 112,113,114]. Statistical measurements of video complexity can be gathered using a feedback or a look-ahead approach. In the feedback ap­ proach [120, 121], complexity statistics are gathered as a by-product of the compression pro­ cess, while in the look-ahead method (e.g., [119]), video programmes are pre-processed prior to encoding. In the look-ahead case, more accurate bit rate predictions, in a rate-distortion sense, are achieved, at the expense of additional complexity and cost. For example, a two-stage en­ coding may be employed, where the first stage is a high quality encoding that extracts content complexity statistics from each stream and these statistics are stored for later use (as a conse­ quence, this approach applies only to pre-recorded material). The second stage performs the joint rate-control during transmission time and may involve a set of transcoders or transraters to reduce the rate of the streams to the appropriate ranges [122]. Since MPEG-2 video coding is used in the vast majority of these systems, the complexity C of a single frame or a group of pictures (GOP) is, in principle, obtained as a function of the encoding quantisation parame­ ter and the number of bits generated, derived from the bit production model of MPEG-2 (e.g., MPEG-2 Test Model 5 [123]):

^l,n — QPl,n ' Rl,nj ^ ~ 1, 2, ..., Z/ where QP is the (average) quantisation parameter used for frame (or GOP) n of programme /, and R is the number of bits generated for the frame (or GOP). The target bit rate for each frame (or GOP) of stream I is then allocated as a fraction of the channel bandwidth, that is

Cl targetrate • B, Z ^ i = l where B is the CBR channel bandwidth. For all participating streams, the joint rate-control chooses among all possible quantisation parameter values, the one that minimises the differ­ ence between the total number of generated bit rates and the target channel rate. Usually, an equal quantiser is in this case selected for all the video streams as the objective is to produce 3.2. Related work 70 programmes with equivalent quality. To ease this selection process it is usually reasonably as­ sumed that the frames (or GOPs) have the same complexity measure within continuous scenes. Using the techniques described above, it has been proven [121, 112, 122, 124] that more VBR MPEG encoded sources can be multiplexed over the same channel capacity and more uniform picture quality among programmes can be achieved than when CBR streams are used. However, all this work still employs signal-to-noise ratio metrics (PSNR or SNR) as the picture quality index. Furthermore, as these proposals are aimed at the transmission of video programmes over broadcast television networks, where a constant capacity channel is available, they are not directly applicable to applications that transmit their media streams over a packet-based, best effort network.

3.2.2 Integrated congestion control

Unlike digital TV networks considered above, the Internet cannot provide a fixed channel ca­ pacity. It is therefore the responsibility of the application to ‘learn’ what bandwidth is available at any instance and also respond to congestion. Recent research has proposed the idea of inte­ grated or coordinated congestion control (e.g., [125,126, 29]) as a remedy to tackle new trends in traffic patterns that potentially undermine the stability of the Internet. Increasingly, the trend on the Internet is for unicast senders (like Web servers) that transmit multiple concurrent hetero­ geneous flows of data (ranging from text, applets and images to audio or video, etc.) to receivers using different transport protocols. These flows may share the same path from the sender to the receiver and need to employ adaptive congestion control in order to probe for bandwidth or react to congestion. However, these concurrent streams typically compete, rather than cooper­ ate, with each other for network resources and in several cases, withdraw from implementing proper congestion control (i.e., transmit at a constant rate throughout the duration of a session). Cooperation can help them to ‘learn’ from each other about the state of the network and jointly orchestrate congestion control. Even if each individual stream implements congestion control, it has been reported [127] that the aggregate of TCP flows is more aggressive when conges­ tion occurs, in comparison to a single TCP connection. The key observation that motivates integrated congestion control is that it is how much data rather than what data is transmitted that matters as far as the load and the state of congestion in the network are concerned. In other words, it is irrelevant from the viewpoint of congestion control whether the data being transferred between a pair of hosts is spread across multiple TCP connections or concentrated in a single connection. It only matters how much data is outstanding in the network. For this purpose, a sender can maintain a unified congestion window (or, aggregate transmission rate) for the set of connections between a pair of hosts. Information about the state of the network 3.2. Related work 71

(RTT, packet loss rate, etc.) can be extracted from the set of active flows; the session rate can then be increased to probe for spare bandwidth or decreased as a reaction to congestion signals (e.g., a loss event in one of its constituent flows). A brief outline of the most prominent research in the area of integrated congestion control follows. Touch [125] proposed the sharing of some of the state in the TCP Control Block (TCB) to allow integrated congestion control over an ensemble of TCP flows and improve transport performance [128]. In the case of independent TCP streams, each stream keeps connection state information in an individual TCP structure, frequently called TCP Control Block (TCB). TCB state can be however aggregated and shared across ‘similar’^ concurrent connections, to enable ensemble sharing. Therefore the system may maintain a unified round trip time and conges­ tion window size for the set of connections and in this way, determine an aggregate congestion window for all TCP flows of the ensemble. Padmanabhan [129, 126] proposed TCP-Session, a method of providing coordinated congestion control to an ensemble of TCP flows. The ideas and principles are similar to Touch’s TCB independence; in other words, the dynamics of the session congestion control are much like that of a single connection using standard TCP. Ex­ perimental results quantify the benefits of sharing congestion state for improved stability and better loss recovery. TheCongestion Manager (CM) [29,130,131] builds on the above work to provide an inte­ grated framework that allows applications to perform integrated congestion control among their participating flows. Furthermore, recognising the danger from applications not implementing proper congestion control, its proposers extend congestion management to non-TCP flows as well, by providing applications with a common ground for transparent adaptation from an end- to-end perspective. The CM handles all these relevant flows as an aggregate flow, termed a macroflow, and congestion control is applied to the macroflow. The server’s total session rate remains the same as if the session was treated as a single TCP-friendly session and it is up to the application to apportion it to its individual flows according to an appropriate schedul­ ing policy. The CM features a TCP-friendly traffic shaper coupled with a lightweight proto­ col to elicit receivers’ feedback and a hierarchical round-robin scheduler for flow transmission scheduling. This scheduler is effectively apportioning the nominal session bandwidth among flows in proportion to pre-configured weights that may correspond to each flow’s importance or are determined based on receiver preference hints. It is shown later in this chapter that this does not always provide the best aggregate perceived quality and that a scheduler based on the quality-over-time of each flow is preferable.

^We discuss how the system can identify ‘similar’ connections later in this section. 3.3. Timescales of inter-stream adaptation 12

A by-product of integrated congestion management, which is relevant to this work, is that it gives the opportunity for efficient multiplexing of concurrent flows. While it is the congestion manager’s responsibility to determine when and how much data can be transmitted, it allows the application to decide what to transmit, or which of the flows should get the opportunity to transmit. Therefore, flow discrimination within the session is possible; the application can schedule to transmit more data from a demanding flow and less from an ‘easy’ to encode flow, in order to improve the session’s utility. Such an option is not feasible if application data is transmitted over independent (TCP or TCP-friendly) streams.

There are several issues with integrated congestion management that should be consid­ ered though. Sharing network information between data streams in order to perform integrated control can only happen among streams that use the same network path or at least, share the same bottleneck link(s). Hence, a common congestion approach is only reasonable between data streams of an end system which have the same receiver or at least receivers in the same part of the network. These data streams form a (data stream) ensemble. The algorithms that determine which, how, and when network information is shared among the different streams of an ensemble, form the actual controller of a common congestion control approach, which is transparent to the application [29]. Although setting the sharing (or the flow aggregation) policy to be for flows that are transmitted between the same pair of hosts seems the logical approach, it may sometimes become problematic. Different flows of a session could be routed over different paths (path diversity) invalidating the assumption of common network path sharing. Network address translation (NAT) boxes or QoS networks further complicate the situation [132]. There are certain techniques proposed for identifying false sharing [133], which are based on testing the correlation of delay and loss experienced by the flows and stem from the assumption that if two flows experience correlated patterns of delay and loss, then it is very probable that they share the same congested resources. Finally, a coordinated congestion management architecture should also determine the granularity of sharing when it aggregates connections. Several levels of granularity may be considered: at application level (all flows of an individual application), at the user level (all application flows of an individual user) or at the host level (all flows generated by a host computer). There is no single acceptable answer to this as it is also related to the still debatable issue of granularity of fairness in IP networks.

3.3 Timescales of inter-stream adaptation

The previous section discussed how the application acquires knowledge about the aggregate bandwidth that is available to its flows by using some form of integrated congestion manage- 3.3. Timescales of inter-stream adaptation 73 ment. This work assumes that the participating video streams are pre-encoded using a layered encoding scheme: with layered or hierarchical encoding the original signal is stripped down to a number of cumulative layers of increasing quality (and bandwidth) and the sender transmits as many layers as can be accommodated by the given network bandwidth. In a multi-stream transmission scenario, rate adaptation takes on transmitting as many of the flow’s layers so that their cumulative rate does not exceed the rate designated to that stream. Therefore, for each stream i encoded at L layers, there is an allowable operating range, ..., R f, defined as the cumulative bit rates that the stream is encoded with.

The objective of the resource allocation mechanism is to maximise the overall quality of the session under the given aggregate bit rate constraint. In other words, the problem may be restated as allocating the total bandwidth of the session, based on the time-varying quality of the individual video programmes, by selecting to transmit a suitable number of layers for each stream. An important issue that arises is that of determining an efficient timescale where the quality of each video stream is measured, which also affects the frequency at which adaptation takes place. From the perspective of the overall session quality, changing the layers allocated to a stream is justified only when there is a considerable quality change (increase or decrease) in any of its participating flows.

This precondition (step change in quality) is usually manifest when there is a significant change in the content activity in any of the transmitted video programmes. This predominantly occurs when there is a scene change in video content; therefore, the video scene boundaries are proposed here as suitable adaptation points. A video scene is typically perceived as the number of consecutive frames within a video sequence that exhibit a more or less uniform content activ­ ity. Scenes are the elementary building blocks of video content. Within a video scene or shot, the camera action may be fixed or may exhibit a number of relatively uniform changes like, panning, zooming, tracking, etc. Shot transitions or cuts, can be identified as abrupt or grad­ ual transitions of the camera action (usually due to the video editing process). Partitioning a video sequence into shots is the first step toward video-content analysis and content-based video browsing and retrieval. Video shots are therefore considered to be the primitives for higher level content analysis, indexing, and classification. Detection of shot boundaries provides a base for all video abstraction and segmentation approaches [134]. Efficient techniques of scene bound­ ary identification have been extensively investigated, e.g., pixel differences, histograms, motion vector and block matching techniques [135, 136]. Of special interest are those that operate on the compressed sequence as they are less computationally intensive and can operate in real­ time [137, 138, 139]. These methods have been shown to produce quite accurate detection of 3.3. Timescales of inter-stream adaptation 74

Inter- vs. Intra-stream quality variation

8

§

8

0 500 1000 1500

6-fram e interval

Figure 3.1 : Variation of instantaneous quality within the scene boundaries and between different scenes. The vertical lines indicate scene cuts in the video content.

scene cuts. The high degree of content correlation within the boundaries of a video shot has also led to the development of scene-based traffic models for compressed video [140, 141, 142].

From a quality perspective, it is expected that, given the encoding bit rate remains the same, the levels of instantaneous quality would not change considerably within a scene, but it is when a new scene occurs that there is a change in quality, especially if the content features

(spatial and motion energy) change significantly between subsequent scenes. As a consequence, there will also be a difference in the corresponding (objectively acquired) quality scores for those successive scenes, that would Justify a re-scheduling of the transmission bit rates (or layers) for the participating flows. For example, observe Figure 3.1, which depicts objective measurements of quality over successive, 6-frame periods, for a 1500-frames long excerpt from the action movie The Matrix encoded using an H.263 video codec at 400 Kbps (the quality measurements were obtained using our implementation of the ITS VQM). This plot shows that the quality remains fairly similar within the boundaries of a scene, whereas changes of quality occur as new scenes emerge. The vertical lines mark the occurrence of a scene cut in the original content. For this reason, choosing the scene duration as the adaptation timescale is a suitable approach; therefore, the inter-stream allocation scheduler operates when a scene cut occurs in any of the participating flows. A single quality score is calculated and assigned to each successive scene by applying the ITS VQM. This emulates, to a certain degree, the process of a user continuously evaluating the perceived quality, for example, by means of a side slider: the user moves the slider up or down whenever a noticeable change in quality occurs. 3.4. Content-aware quality adaptation model 75

ISAM: Inter-stream Adaptation Module Scene OQAM: Objective Quality Assessment Module OQAM Detection

video content quality profiles M acroflow

flow 1

flo w n

ISAM Decoders/Players Server

User preferences

Probe Session bw

Receiver Sender Congestion Congestion Manager Congestion Information Manager

Media Server Receiver

Figure 3.2: Model schema of content-based inter-stream adaptation architecture.

Since within a scene the objective quality does not vary considerably, the user’s perception of quality is likely to remain unchanged, but a viewer’s judgement of quality will change if a new scene appears that exhibits a different level of distortion. Characteristic time durations of video shots range from 2 to 10 sec or more, in typical video sequences (news, sports, movies, etc.).

In several cases, however, the scene duration may be much shorter (e.g., TV commercials or video clips where rapid scene alternations are observed), or much longer (e.g., a TV interview).

In the case of many successive short scenes, the short scenes can be aggregated to a longer one and quality-assessment is applied to that longer scene. For longer duration scenes, the approach followed was to break them down to a number of consecutive scenes of duration 8-10 sec.

3.4 Content-aware quality adaptation model

Figure 3.2 depicts the main components of the proposed quality-aware inter-stream rate adap­ tation architecture. The scene detection module identifies the occurrence of a new video scene in any of the participating video flows and informs the objective quality assessment module

(OQAM) that retrieves the objective perceptual quality profile (described in the next paragraph) for the new scene. A companion congestion control manager monitors the state of the network and determines the nominal share of bandwidth for the session. The congestion control manager can be one of the proposals discussed in section 3.2.2 or any alternative that employs similar principles, i.e., it provides with an estimate of the aggregate session bandwidth. Based on qual­ ity profiles obtained by the OQAM, the inter-stream adaptation module (ISAM) apportions the 3.4. Content-aware quality adaptation model 76 session bandwidth by dictating the transmission rate (number of layers) of each flow. The client application may also signal to the server the user-level preferences for the participating flows.

Assume that the application session consists of N layered streams. Each stream possesses a quality profile, which is a set of operating qualities for the stream’s layers. Each operating quality point is mapped to the corresponding resource, generating a quality-resource mapping, the resource in this case, is the bandwidth required to provide that quality (the cumulative layer bit rate). The main problem that arises in a multi-stream session is to locate the operating quality for each individual media flow so that the overall system utility is maximised under the system’s bandwidth constraints. How the overall utility of the system is quantified as a function of the quality of the individual flows in the ensemble is not a straightforward procedure and depends on several aspects of the system under consideration, such as the user’s visual attendance at every instance in time, the specific application scenario, etc. A reasonable interpretation of the overall session utility can be defined as the (weighted) sum of qualities of the individual streams. For each stream, a quality profile exists in the form of L tuples < s, I, Q(s, I) >, where s is the scene index in the video sequence, defined by a start and finish frame number, I — 1,..., L is the layer index and Q is the corresponding, objectively obtained, quality score for I layers of s. User preference for each media stream is represented by setting a preference weight Wi to each corresponding flow i of the application, where Wi = 1. Denote by B the aggregate time-varying session bandwidth, provided by a companion congestion manager. The aim of the allocation can therefore be formulated as: N Maximise the total session quality : ^ W{ • Qi{ci) (3.1) i=l

N subject to : ^ Ri{ci) < B,

where c, = 1,..., L denotes the number of layers allocated for stream i, and Ri{ci) is the cumulative bandwidth requirement for c* layers of stream i. Because there are discrete operating points (layers of each stream) this problem belongs to the multiple-choice knapsack problem (MCKP), which can be stated as follows. Assume N groups of items, where each group i has k items, and a knapsack with size R. Each itemj of group i has a size and a value Uij. The problem is to select exactly one item from each group (a pick) suchthat the sum of item values of the pick (the solution value) is maximised, and the total size of thepick does not exceed the knapsack size: 3.4. Content-aware quality adaptation model 77

N k m axim ise ^ ^ XijUij i=i j=i

N k subject to < R-> i=\ 3=\

li wheve . ^ij — 1 ; ^ ij S ^0,1^; i — 1, « « ; -AT, —1 , « . «, 3=1

A pick denotes the selection of N items, one from each group. In each group, only one Xij is 1 (picked), and the rest are 0. With regard to our problem, the items are the discrete operating points (layers) of each stream, where an item’s size and value are the cumulative bit rate and the corresponding (weighted) quality score, respectively. Each group corresponds to a layered video stream. The MCKP is an NP-hard problem but, fortunately, because such applications involve a relatively small number of groups (flows) and items (cumulative layers), it can be solved by performing an exhaustive search on all (ili, Qi) pairs without inducing any significant computational burden (when this number is high, i.e., when the number of participating flows is high, then a greedy, near-optimal solution can be used as presented in [143]). The solution provides the number of layers that should be transmitted for each video flow of the ensemble, so that the overall quality of the session is maximised and the bandwidth constraint is met. Based on the above discussion, there are three conditions that might trigger a change in the allocation of bandwidth (or layers) of the individual streams of the ensemble:

1. A scene cut (start of a new scene) is detected in any of the participating flows.

2. The nominal bandwidth available to the session is less than the total bandwidth consump­ tion of all the currently transmitted layers of the flows. In this case, some layers need to be dropped, and the inter-stream adaptation will pick those that will minimise the effect on total quality.

3. The bandwidth available to the session is enough to add new layers to (at least one of) the participating flows.

Obviously, (2) and (3) above follow an aggressive approach on dropping and adding layers that may result in quality fluctuations for the participating flows. This can, however, be tackled by augmenting the system with more conservative layer add and drop mechanisms and receiver buffering as proposed in the literature (e.g., [144,145]), but this is out of the scope of this work. 3.5. Experimental results 78

3.5 Experimental results

The proposed inter-stream rate allocation approach was tested through a series of trace-driven simulations. In order to generate various source content dynamics between the participating flows of the simulated multi-stream application, a number of different video sequences were selected, which cover a reasonably broad range of video content: news, TV commercials, sports, etc., and feature different levels of camera operations and spatio-temporal activity. The video clips used in the experiments are shown in Table 3.1 (descriptions for most of these clips are presented in Appendix A). The simulated application scenario involved the transmission of four concurrent video streams between the source and the destination. Two video multi-stream programme transmission scenarios were constructed.

1. In Scenario A, three of the video streams were constmcted by alternatively selecting (for the simulation duration) one of the clips labelled from 1 to 8 in Table 3.1 in random, while the fourth stream was clip 9.

2. In the second scenario (Scenario B), three video streams were alternatively transmitting (randomly) one of the clips labelled Football A, B and C in Table 3.1, while the fourth was randomly transmitting clips 1 to 9. Each of the three Football clips was a 5000-frames long excerpt from an English Premiership football match. This scenario represents a realistic multiple-flow video streaming application for a sports event.

The number of scenes for each of the clips is also shown in Table 3.1. Without loss of generality, the flow preference weights in equation 3.1 were set to be equal for each stream (i.e., Wi = 0.25,2 = 1,..., 4). All video sequences were encoded up to six or seven layers with cumulative bit rates from 64 Kbps up to 2 Mbps (see Table 3.1) using a H.263+ layered codec developed at British Telecom Labs [146]. The corresponding original and encoded versions of these clips were then processed by the ITS quality metric to produce one quality score per layer for each individual scene of the selected clips. Recall that a quality score is a real number between 0 and 100, where 0 represents a completely unacceptable quality and 100 corresponds to undistorted video. Using these quality scores, the quality profiles for each scene of the clips were then created. Instead of using a scene cut detection algorithm, the sequences were manually processed in order to obtain a precise detection of shot boundaries. The proposed technique is compared with two prominent methods of inter-stream session bandwidth allocation described in section 3.2: 3.5. Experimental results 79

Video receiver

ON/OFF CBR 15M bps ON/OFF CBR so u rc e 1 20 m s destination 1

O N /O F F CB R destination N

Figure 3.3; The network topology used in the simulations.

Table 3.1: Video sequences used in simulations.

Sequence Resolution Frames No of scenes Cumulative bandwidth (Kbps) 1. Claire CIF 495 1 64, 128, 256, 512, 768, 1024, 1536 2. Jack&Box CEF 145 128, 256, 512, 768, 1024, 1536, 2048 3. Miss America CEF 150 64, 128, 256, 512, 768, 1024, 1536 4. Canoa valsesia CIF 220 128, 256, 512, 768, 1024, 1536, 2048 5. Mobile & Calendar CEF 125 128, 256, 512, 768, 1024, 1536, 2048 6. FI car CIF 220 128, 256,512, 768, 1024, 1536, 2048 7. Akiyo CIF 220 64, 128, 256, 512, 768, 1024, 1536 8. Rugby CIF 220 128, 256, 512, 768, 1024, 1536, 2048 9. Why (TV commercial) QCIF 1820 53 64, 128, 256, 512, 768, 1024, 1536 10. Football A CIF 5000 27 100, 250, 500, 800, 1200, 1800 11. Football B CEF 5000 36 100, 250, 500, 800, 1200, 1800 12. Football C CIF 5000 29 100, 250, 500, 800, 1200, 1800

1. A proportional bandwidth allocation scheme, where within the session, each flow is al­

located a transmission rate in proportion to its (user) preference weight‘d. A similar ap­

proach is followed in the Congestion Manager [29], which utilises a simple Hierarchical

Round Robin scheduler to allocate bandwidth among the constituent flows in proportion

to pre-configured weights.

2. A transmission scenario, where the applications video streams are transmitted over inde­

pendent TCP-friendly connections. In this case, the corresponding flows are individually

responsible for performing congestion control and adapting their transmission rate (or,

number of transmitted video layers) so that it fits within the available bandwidth enve­

lope. This is the current practice of transmission for multi-stream applications.

The ns-2 network simulator [147] was used to simulate the transmission of the video streams. The simulated network had a typical dumbbell topology (Figure 3.3) and the video flows of interest were transmitted from a source to a destination node located behind a single bottleneck link, with a capacity of 15 Mbps and a propagation delay of 20 ms. The bottleneck

'*No assumptions are made here of how the preference weights might be set; they may be pre-defined or dynam­ ically extracted from the user based on receiver application hints or cues. 3.5. Experimental results 80 routers had a drop-tail queue management policy with the queue size set to the bandwidth- delay product. All other links were sufficiently provisioned so that the bottleneck was always the link between R1 and R2 in Figure 3.3. To simulate variations in the level of contention for network bandwidth, the bottleneck link was also shared by a number of background CBR ON/OFF flows, whose ON and OFF times were drawn from a heavy-tail (Pareto) distribution with mean values of one and two seconds respectively [148]. The CBR rate of the ON/OFF sources was 300Kbps and their start times were randomly chosen from the [0, 2] sec range. The transmission of the application video streams commenced after 5 sec from the start of the simulation. In order to simulate different levels of resource availability for the multi-stream session, several experiments were carried out where the number of background connections was varied between thirty and seventy. For the simulation experiments where the ensemble of streams is treated as a single flow from the perspective of congestion control, the TCP-Friendly Rate Control protocol (TFRC) [19] was used to determine the session’s nominal transmission rate. This means that a single TFRC flow was created between the sender and receiver pair and its estimate of available bandwidth was then shared among the participating flows using the proposed quality-aware method (as well as the proportionally-weighted alternative). While this does not constitute a fully operable integrated congestion management framework (like the Congestion Manager), it suffices for the purposes of experimental evaluation as it provides the functionality required by the proposed model (i.e., a continuous estimate of the aggregate avail­ able bandwidth). For the independent transmission scenario, four individual TFRC flows were used to simulate transmission of the four streams of the application. All simulations ran for a duration of 600 sec and the graphs that follow show the result of averaging over ten simulation runs. Figure 3.4 shows, for the first 200 sec of the simulation, the total session quality (Eq. 3.1) that the proposed method achieved in comparison to that obtained when the session bandwidth is allocated in proportion to each flow’s preference weight^. Graphs are shown for both multi­ stream application programme scenarios (A and B) described above. In both cases, the proposed method results in improved aggregate session quality. The benefits of the quality-aware inter­ stream adaptation technique are more evident in Figure 3.5, which plots the average quality gain percentage this method achieved over a proportional allocation policy. The quality gain value was calculated by averaging over the duration of the simulated transmission at different levels of network load (number of background ON/OFF flows). The error bars extend to the 5 and 95-percentiles of the observed values of session quality. These results show that there is a con-

^The number of background sources in this experiment was sixty. 3.5. Experimental results 81

Total session quality (Scenario A) Total session quality (Scenario B) 8 Quality-aware 8 Proportional Proportional S 8 g g

8 S

so too 150 200 50 too 150 200

Figure 3.4: Total session quality of the proposed method in comparison to the proportional allocation method for the first 200 sec of the simulation.

sistent improvement in session quality when the transmission of each flow is scheduled based on its content and quality dynamics as opposed to a proportional scheduling based solely on rel­ ative priority or user preference. The comparative quality gains appear to increase as network load increases. This can be explained by the fact that at lower network loads more bandwidth is available and therefore more stream layers are allowed to be transmitted. Given that at higher encoding rates quality saturates, further increasing the encoding rate does not yield a significant increase in quality. Therefore, a different allocation of layers between the flows (in comparison to the proportional alternative) only offers marginal gains. This however changes when band­ width can only accommodate a few layers per flow; a different distribution of layers among the stream based on their quality contribution has a greater impact to the total quality. Figure 3.5 show that slightly higher gains are achieved with the multi-stream programme scenario A than in scenario B. This is explained by the fact that the video sequences of scenario A exhibit higher content unevenness among them, where those of scenario B, exhibit less inter-stream content

(and thus quality) variability.

In turn. Figure 3.6 (left) compares the session quality (for the first 200 sec of the sim­ ulation) achieved by the proposed method, in comparison to the case where the application streams are transmitted using independent TCP-friendly flows^. By inspecting this graph, it seems that on average, there is no apparent difference between the aggregate quality of the four independent flows and the proposed method. However, note that the aggregate throughput that the independent TCP-friendly flows attain, shown in the right plot of Figure 3.6, is significantly

Graph corresponds to simulation runs with sixty background ON/OFF sources. 3.5. Experimental results 82

Quality-aware vs. proportional (Scenario A) Quality-aware vs. proportional (Scenario B)

Session quality -a * Session quality

number of ON/OFF sources number of ON/OFF sources

Figure 3.5: Average session quality percentage gain of the proposed method in comparison to a proportional allocation, at different levels of network load.

higher than that achieved by the single congestion controlled flow that is used to transport all application streams. This difference in throughput is expected, because, as reported in [29], the integrated flows behave as one from the perspective of congestion control, while the indepen­ dent TCP-friendly connections behave more aggressively under the same network conditions

(because the increase and decrease coefficients for the independent connections are larger than for a single connection). The advantages of quality-aware integrated congestion control are more clearly shown in Figure 3.7, which plots the average session quality gain percentage of the proposed method over the independent flow transmission over the duration of the simula­ tion, at different network load conditions. The difference (as a percentage) between the average throughput of the ensemble flow and the average aggregate throughput of the four TCP-friendly connections is also drawn. Although the aggregate throughput of the independent flows is sig­ nificantly higher (indicated by negative gain values), it does not generate any worthwhile quality gain; in many cases, it is rather to the favour of the quality-aware mechanism. This exemplifies the usefulness of integrated congestion control in offering a mechanism for efficient multiplex­ ing of media flows and vindicates the benefits of accounting the content dynamics and quality of the individual streams when multiplexing several application flows over a single congestion controlled connection.

3.5.1 Effect on quality smoothness

Layered encoding provides a coarse-grain control on video quality: the original signal is split into cumulative layers and the sender transmits a subset of the layers so that the resource con­ straint is met. As the bit rate available to the session is subject to significant variations the 3.5. Experimental results 83

Total session quality (Scenario A) Aggregate transmission rate (Scenario A)

Quality-aware Quality-aware s - Independent Independent 8

g g

8

50 too 150 200 50 100 150 200

time (sec) time (sec)

Figure 3.6: Left: Session quality of the proposed method in comparison to the aggregate quality of the four independent TCP-friendly streams. Right: Throughput of the ensemble flow in comparison to the aggregate throughput of the independent TCP-friendly streams.

Quality-aware vs. independent (Scenario A) Quality-aware vs. independent (Scenario B)

S--- ......

8 -

•B - Session quality • a - Session quality •o - Session bandwidth -o - Session bandwidth

number of ON/OFF sources number of Of'J/OFF sources

Figure 3.7: Percentage of average session quality gain and throughput of the proposed method in comparison to the aggregate quality and throughput of the independent TCP-friendly streams, at different levels of network load. 3.5. Experimental results 84 number of layers that are transmitted inevitably changes; this leads to quality fluctuations when layers are added or dropped. When layers are added or dropped frequently, the fluctuations in quality may become disturbing to the human viewer. As the proposed quality-aware adapta­ tion mechanism involves an extra condition for a re-assignment of the active stream layers of the session (when a scene cut occurs), it is interesting to investigate its effect on the quality smoothness of the individual flows.

Figure 3.8 shows the coefficient of variation (CoV) of the number of transmitted layers for each of the four streams of the session for the experiments described above. The top four graphs show the CoV of transmitted layers for the four streams of scenario A and the bottom four, the layer CoV of scenario B streams. Results are presented for the three methods examined in this chapter, at different levels of network load. These graphs show that the extra triggering condition for inter-stream adaptation, that is, when a scene cut appears in any of the participating flows, increases the number of layer changes for the content-aware mechanism, illustrated by higher CoV values for the quality-aware method. However, although it is believed that frequent oscillations of the active number of layers of a stream may generate fluctuations in the end- user quality, the frequency of layer changes is not necessarily a truthful measure of quality smoothness (or variation) in a video stream.

A more realistic quality smoothness metric is one that is related to the objective quality of the flow, a representation of the actual perceived quality. Figure 3.9 depicts the coefficient of variation of each stream’s objective quality, over different network loads. The top four graphs refer to the four streams of the programme transmission scenario A and the bottom four to those of scenario B. In contrast to what was conveyed from Figure 3.8, the fluctuation of qual­ ity for all the streams within the quality-aware session is not at all inferior in comparison to the other two methods of scheduling examined (proportional apportioning and independent trans­ mission). Despite the proposed technique engaging in more frequent switching of layers, it does not adversely affect the quality stability of the individual streams. The CoV of quality tends to increase as the level of contention for network resources increases (or equivalently, the bandwidth that is available to the session or the individual flows decreases). When the number of transmitted layers for a flow is low, then adding or dropping a layer considerably changes its quality, which translates to higher CoV values. On the other hand, adding or dropping high order layers of a stream does not generate a considerable change in quality, since, as discussed earlier, quality saturates beyond a certain range of encoding rates. This initial observation on quality smoothness and its implications will be investigated in future work. 3.5. Experimental results 85

Stream 1 Stream 2

-e- Quality-aware -e- Quality-aware 0.25 H — Proportional 0.25 —i— Proportional B Independent B Independent ÿ 0.2

| o , , S ™ 0.15

o 0.1 -tfr - ' __ - a 0.05 0.05

no of ON/OFF sources no of ON/OFF sources

Stream 3 Stream 4

-©- Quality-aware -©- Quality-aware -f- Proportional 0.25 H — Proportional a Independent B Independent 0.2

0.05

30 40 50 60 70 30 40 50 60 70

no of ON/OFF sources no of ON/OFF sources (a) Scenario A

Stream 1 0.35 Stream 2 -©- Quality-aware -e- Quality-aware 0.25 —I— Proportional 0.3 Proportional B Independent B Independent 0.25

0.2 f 0.15 o 0.15 §

0 .0! iir-' 30 40 50 60 70 30 40 50 60 70

no of ON/OFF sources no of ON/OFF sources

Stream 3 Stream 4

-e - Quality-aware -©- Quality-aware 0.25 -t— Proportional 0.25 —i— Proportional B Independent B Independent

«■ 0.2 0.2

I 0.15 § 0 .1< #r 0.05 0.05,

no of ON/OFF sources no of ON/OFF sources (a) Scenario B

Figure 3.8: Coefficient of variation of active number of layers for each individual stream. 3.5. Experimental results 86

Stream 2 0.25 Stream 1 0.2 Quality-aware -©- Quality-aware —I— Proportional 0.18 -4-- Proportional B independent a Independent 0,2 I 0.16 > 0.14 I O o 0.12 - -O

0.08 30 40 50 60 70 40

no of ON/OFF sources

Stream 3 Stream 4 -0 - Quality-aware -©- Quality-aware - Proportional - Proportional a Independent a Independent

40 50 60 70 40 50 60

no of ON/OFF sources no of ON/OFF sources (a) Scenario A

0.2 Stream 1 0.4 Stream 2 -e - Quality-aware -©- Quality-aware -H- Proportional 0.35 -4-- Proportional 0.15 a Independent a Independent

I O 0.25 O' I 0.05 0.15

30 40 50 60 70

no of ON/OFF sources no of ON/OFF sources

Stream 3 0.25 Stream 4 -e - Quality-aware -e - Quality-aware -4— Proportional -4— Proportional a Independent 0.2 a Independent

I 0.15 a 5 Sr 0.05f P—

40 50 60 40

no of ON/OFF sources no of ON/OFF sources (a) Scenario B

Figure 3.9; Coefficient of variation of quality for each individual stream. 3.6. Chapter summary and discussion 87

3.6 Chapter summary and discussion

The proliferation of recent IP networking technologies offers opportunities to deliver high data rates to large numbers of home users. Higher available bandwidth gives the opportunity of sup­ porting far richer multimedia applications. Whereas the current Internet multimedia experience is, for the majority of users, restricted to streaming of pre-recorded or real-time audio-visual content, new services could comprise a wider number of media data streams to create an ele­ vated, personalised, collaborative or entertainment experience. These applications will involve the transmission of numerous media streams from a source to a destination. While the flows of a multi-stream application would normally be performing congestion control independently of each other, recent proposals on integrated congestion control [125, 29, 126, 130] outline the benefits of performing collective congestion control, by allowing streams to cooperate rather than compete for common network resources. This collaborative congestion control environ­ ment allows for greater flexibility in the multiplexing of concurrent flows, giving the application the freedom to schedule data from its constituent flows according to their respective utility or importance.

Using the principles of integrated congestion management as a foundation, a method of dy­ namic apportioning of the session bandwidth that accounts for the qualities of the participating flows was introduced. The method works on video streams encoded at a number of cumulative layers (layered streams). Based on the assumption that within a range of frames that display uniform content activity (video scene), the quality of the encoded video layer does not change significantly, quality profiles (mappings of resource to quality) are constructed at each discrete operating point, or layer. When a condition of adaptation occurs (when the total transmission rate exceeds the nominal session bandwidth, or there is enough bandwidth to accommodate ex­ tra layers or there is a scene change detected in any of the streams), the system re-assigns the active layers of each stream, based on their contribution to the total quality. Results presented show the benefits of this approach in terms of improved overall session quality in comparison to a scheduling policy that apportions bandwidth in proportion to fixed weights that reflect the importance (priority) of each stream to the application or the user. Partially attributed to the benefits of integrated congestion control, the proposed mechanism exhibits significantly better utilisation of bandwidth in comparison to an application where flows are transmitted as inde­ pendent congestion controlled flows. In addition, preliminary results show that in comparison to the proportional based allocation and the case of independent transmission, the individual flows within the quality-aware session do not exhibit higher quality fluctuations, a usually disturbing phenomenon, despite the more frequent layer changes. 3.6. Chapter summary and discussion 88

The method presented in this chapter assumes that the (objectively obtained) quality pro­ files of the application streams are readily available at the time of transmission. This assump­ tion hinders the operation of applications engaged in transmitting live streaming content; in this case, the system should be capable of producing or predicting the time-varying quality of a video flow at its different operating points in real-time. In the next chapter, a method of predict­ ing the real-time objective quality of a video stream is presented, tailored however to another interesting problem in video streaming: that of maintaining stable quality for live, real-time video streaming. Chapter 4

A Neural network predictor of quality

In contrast to Chapter 3, which proposed a quality-aware adaptation mechanism for multi­ stream sessions, this chapter focuses on quality-aware adaptation of live video streaming. A live video stream, when encoded and transmitted using a congestion controlled IP flow, experi­ ences a variety of quality of service over short and long time-scales due to the bursty nature of video content and variations of available bandwidth. As a consequence, the perceived quality of video can suffer frequent fluctuations. Such quality fluctuations are particularly disturbing to the viewer. Psychophysical studies and subjective tests reveal that maintaining a stable quality is an important factor when viewing video content. In this respect, this chapter and Chapter 5 together present a method to alleviate short-term quality variation. In contrast to related ap­ proaches in the literature, the proposed method performs smooth quality video adaptation by employing a realistic objective measure of the perceived quality. The first part of this chapter discusses the problem of achieving smooth quality video streaming and introduces an archi­ tecture for rate-quality control of live video streaming with the objective of maintaining stable quality under varying video scene content complexity and network bandwidth. The main chal­ lenges of this architecture are then discussed. The architecture relies on rate adaptation based on quality awareness, achieved by the use of a video quality metric. The direct application of the objective metric is however identified as a hindrance to the real-time performance of the stream­ ing system. For this reason, the architecture comprises an artificial neural network to generate predictions of the on-going quality. In addition, the system features a tunable fuzzy rate-quality controller that accounts for several idiosyncrasies of video quality perception in its attempt to provide smooth streaming quality. The second part of this chapter addresses the problem of providing accurate real-time predictions of quality, which can facilitate the operation of the rate-quality controller. It presents details of an artificial neural network approach for estimating the on-going encoding quality of a real-time video stream and examines its performance in its ability to produce accurate quality predictions. 4.1. Motivation and problem description 90

8

3ca § c r I §

o TCP-friendly quality CM Desired smooth quality

50 100 150 200 2 5 0 3 0 0

time (200 ms intervals)

Figure 4.1: The effect of a network-friendly video rate encoding on instantaneous quality and the hypothetical shape of a desired quality with infrequent oscillations.

4.1 Motivation and problem description

In live unicast video streaming (webcasting), the source of video information (such as any stan­ dard video camera, a satellite feed, etc.) is directed to an encoding engine that is responsible for digitising the video information (if in analog format), compressing it down to a desirable bit rate and passing the compressed bitstream to the video server for transmission. However, a real-time video stream encoded and transmitted over a best-effort Internet experiences a variety of qual­ ity of service. This is attributed to both the video content’s inherently varying spatio-temporal complexity and the underlying network conditions. Video scenes with low spatial activity and motion are easier to encode with good quality, while on the other hand, complex visual con­ tent and motion increase the distortion introduced by the encoder. If a fairly constant encoding quality is desired, the resulting encoded video bitstream is bursty, usually referred to as variable bit rate (VBR), exhibiting significant short and long-term rate fluctuations. Furthermore, in a shared network such as the Internet, end systems should react to congestion by adapting their transmission rates, using either TCP or TCP-friendly transmission. As the level of contention for network resources change, the TCP-friendly rate can exhibit significant variation over time that unfortunately, does not match the rate requirements for a constantly acceptable video qual­ ity. Therefore, as the encoding rate of the live stream has to be confined by the TCP-friendly transmission rate, an extra level of quality variation is added. Figure 4.1 displays this effect; it depicts the continuous quality (measured using the objective quality metric described in sec­ tion 2.7) of a video sequence encoded using its TCP-friendly [19] share of bandwidth, obtained 4.1. Motivation and problem description 91 through simulation [147]. Figure 4.1 shows that the resulting video quality exhibits high short­ term variability due to mismatches between the bit rate required to maintain a stable level of quality and the bandwidth available to the video stream. These mismatches are caused by both the underlying video content complexity and the variability of the available transmission rate. As a result, a stable quality level cannot be sustained for a long enough period, leading to drops in the quality value. At other times, quality may rise to a high value but only preserved for a limited time before dropping again. On the same graph (Figure 4.1) the hypothetical shape of a ‘desired’ smooth quality is plotted. Frequent fluctuations of video quality are particularly annoying to the human viewer: higher variation of quality over time leads to worse perceived quality [108]. A video streaming system that delivers a modest yet stable end-quality is preferable to a system that may at times deliver video with imperceivable impairments but also experiences periods of heavily distorted video. In agreement with psychophysical studies in video quality perception [149, 150] that indicate the negative impact of transient picture impairments, periods of low quality ought to be avoided, even though the system cannot consequently achieve as high a visual quality. Hence, a more conservative approach to rate control that avoids driving the quality to extreme, short-lived high and low levels, is preferable.

4.1.1 Smooth quality video rate control- literature review

There has been a significant amount of work in the literature of source-channel coding and rate adaptation techniques to provide video streaming with constant or smooth quality. In general, two main directions can be identified, which are based on different objectives and assumptions:

• The objective of the first approach is to provide a ‘constant’ level of quality through­ out the transmission of the video programme. Since constant video quality can only be accomplished using an uncontrolled VBR encoding, this approach works better under the assumption of the existence of a well-provisioned transmission channel with enough capacity to carry the bursty bitstream (as for example, a broadcast TV network).

• When the transmission channel is dynamic, like in a best-effort Internet, it is very diffi­ cult to acquire constantly acceptable quality throughout, unless a well provisioned con­ nection is in place. In this case, the second best alternative to opt for is a source-channel adaptation that alleviates frequent oscillations of quality and caters for a signal with a ‘smoother’ quality, for some appropriate measure of smoothness. This solution is often enabled by the use of a receiver preroll hujfer to compensate any short-term burstiness of such a delivery process. 4.1. Motivation and problem description 92

Basso et al [151] presented a MPEG-2 encoding method to achieve near-to-constant per­ ceived quality. The method relies on manipulating the quantisation factor q during video en­ coding using a proportional integral derivative (PID) controller at each sampling point k (with duration T):

q{k + 1 ) = g( 0 ) 4 - Kpc{k) + ^ -f Kp— [c{k) — c{k — 1 )], i= i,k ^

whereKp, Td and Tj are PID design variables. The factor c{k) — MPQM{k)-MPQMtarget represents the quality distance of the current sampling period in relation to the target. The term

MPQM represents the perceived quality measured using an objective video quality metric, called the Moving Perceptual Quality Metric (MPQM) [98]. The method produces an encoded sequence with almost constant objective quality, but as a consequence, it exhibits very high variations of the output bit rate. Since there is no constraint on the encoder’s bit rate inside the control process, this method is unsuitable for Internet streaming. Furthermore, this process in­ troduces a significant computational penalty, as it involves the calculation of the MPQM metric in every sampling index k.

Zhang et al [152] presented a rate-allocation scheme for fine-grain scalability (FGS) MPEG-4 streams^ that achieves small variation of distortion between consecutive frames. The method relies on the estimation of rate-distortion (R-D) curves on the enhancement-layer (us­ ing linear interpolation over a set of R-D points, extracted during the encoding process) and uses a sliding window of frames to accommodate changing channel conditions over time. The smoothness achieved improves by increasing the size of the sliding window. This method uses traditional measures of distortion (mean squared error - MSE) to measure quality and no indica­ tion of the smoothness of the actual perceived quality is presented. Furthermore, it requires that a few R-D points are calculated during encoding to enable the approximation of the R-D curve. While the authors argue this is not a major problem for FGS bitstreams, due to the bit-plane method of coding, it might be an overhead for non-FGS codecs.

Nelakuditi et al [145] described a method of providing smooth quality for layered video streaming. They proposed smoothness criteria based on layer runs, where a layer run is de­ fined as the number of consecutive frames in a layer. They argued that longer runs provide smoother video presentation quality. An optimal off-line layer transmission policy is initially presented, where the bandwidth availability for the whole duration of the video is known a priori. Based on the optimal algorithm, an online adaptive algorithm is then presented to pre-

^FGS video coding is discussed in section 2.4.2. 4.1. Motivation and problem description 93 diet the buffer requirements that can sustain the addition of a new layer so that a sufficient run is achieved. This method has certain drawbacks. Notwithstanding the fact that frequent oscillations of the number of transmitted layers result in annoying variations in quality, the as­ sumption that layer smoothness (long layer mns) coincides with quality smoothness is, to some extent speculative, if not ill-defined. This drawback is exacerbated by the fact that the algo­ rithm presented works with layered CBR streams, since it is known that CBR video exhibits much higher quality variation. Kim and Ammar [153] extend this work to solve the problem of streaming layered FGS MPEG-4 video with minimum quality variation. As the enhancement layers of FGS compressed video are predominantly VBR and the algorithm in [145] works on CBR layers, a first step involves a rate smoothing procedure, based on [154], to reduce the rate variability of the compressed bitstream. Then, an optimal adaptation algorithm is presented assuming, as in [145], prior knowledge of the available bandwidth. Subsequently, an online heuristic is described based on the optimal algorithm, that assumes no future knowledge of the available bandwidth. The performance of the algorithm is demonstrated using both TFRC and TCP as congestion control policies. Again, layer runs are used as an indication of smoothness which preserves the aforementioned disadvantages. In addition, this work considers streaming of pre-encoded video material. In comparison to the optimal adaptation algorithm, the proposed heuristic exhibits noticeable difference, therefore one cannot draw straightforward conclusions of its performance, especially if perceived quality is brought into consideration. As expected, the performance increases if more buffering space (higher initial playout delay) is made avail­ able at the receiver. Furthermore, their experiments simulated the transmission of high rate streams (4 Mbps); these rates can support high quality video anyway and are not typical of what the majority of Internet users experience today on an end-to-end path.

Lu et al [155] presented a method for perceptual video quality control based on quality feedback as obtained by the ITS video quality metric. The objective here is not smooth quality adaptation, but transmission rate adaptation based on quality clues. Their system features the extraction of ITS S-T region features at both the sender and the receiver, as discussed in sec­ tion 2.7. At the sender, S-T features are extracted not from the original but from the encoded video frames. Similar features are extracted from the decoded frames at the receiver. These features are then sent back to the sender via an auxiliary TCP feedback channel. Using the local and remote quality parameters, the sender can estimate the levels of quality degradation that is caused by the network transmission (since the local quality features are extracted from the compressed version of the video before transmission, they indirectly measure the perceived artifacts of transmission, i.e., due to packet loss). If quality is impaired due to network conges­ 4.1. Motivation and problem description 94 tion (increased packet loss), the sender lowers its output bit rate by appropriately increasing the quantisation factor. Conversely, if the video is not distorted during transmission, the quantisa­ tion scale is decreased to improve quality. This approach has certain drawbacks. First of all, the auxiliary TCP channel that carries the remote quality features introduces an extra level of latency until this information is available to the sender for further evaluation. The timely deliv­ ery of the feedback quality parameters can be further hindered if it is assumed that the receiver employs a playout buffer, a necessary component of a video streaming application. Since qual­ ity features at the receiving end are extracted after decoding the video data, this means that the sender acquires the feedback quality data with respect to the currently transmitted S-T frames only after at least a playout buffer’s depth and round trip time worth of delay. This fact may question whether the feedback is timely and thus accurate enough to evaluate quality. Finally, no indication in that work is given to establish whether the approach described produces an output bit rate that is also TCP-friendly. With respect to live encoding and streaming of video, the feature extraction process required by this scheme places a significant computation burden on the streaming server (similar computation is required at the receiving end). In a scenario where a server is performing real-time streaming of several video sequences to multiple clients, it can easily become overloaded if required to handle this amount of computation. The discussion in this section reveals that most of the proposed approaches to smooth or stabilise quality adopt traditional, pixel-based metrics or rely on indirect and hence inaccu­ rate metrics (like, smoothness of transmitted layers) to quantify video quality. Fortunately, the emergence of objective quality metrics has encouraged their use in this process. However, the main obstacle, especially in the context of encoding and transmission of live video has been the computational burden that objective metrics put on the real-time performance of the streaming system. The next section firstly presents a number of constraints present in a live encoding and transmission system. Secondly, it identifies the set of requirements that arise by the use of a video quality metric to achieve smooth quality adaptation.

4.1.2 Constraints and requirements of smooth quality video streaming

Transmission of live video material poses a set of constraints not present in streaming of pre­ recorded, stored video. In the latter case, a number of optimisations at different stages of the streaming process are possible. First of all, the whole video sequence is available to the en­ coder to perform optimal encoding (e.g., asymmetric or multiple-pass encoding). This option is not available with live video; the encoder has access to only those frames being generated by the video source and perhaps an extra window of frames (typically, a few seconds long), if a look-ahead buffer is used [156]. Furthermore, live video has its own capture clock, hence. 4.1. Motivation and problem description 95 frames cannot be generated faster or slower than the capture rate (e.g., 30 fps). Another major difference is where the burden of adaptation is placed. In on-demand streaming of pre-recorded video, most of the complexity for efficient and robust encoding is placed on the encoder, which cannot adapt to the varying channel conditions and must rely on the media streaming server for this task. Rate scalable representations are very important to allow adaptation to varying network throughput without requiring additional computation at the media server. As a conse­ quence, there is no need for direct connection between the encoder and the server. The task of the server is limited to selecting an appropriate signal representation that best matches the cur­ rent network conditions. In a live encoding scenario, the server is tightly bound to the encoder; feedback about the condition of the network (available bandwidth and, optionally, packet loss rate) is propagated to the encoder, which in turn alters its parameters to generate a compliant bitstream^. Finally, efficient packetisation and packet scheduling algorithms can be employed to stream pre-recorded packetised video, in a rate-distortion optimised way (e.g., [157, 158]). These techniques determine how video packets are formed and which packets are transmitted to meet a cost constraint (the time-varying available bandwidth) while minimising end-to-end distortion in a rate-distortion fashion (e.g., [157]). The proposed solution differs from recent proposals in the literature in its appreciation of the quality effect in the smoothing process. Contrary to solutions where measurement of the video quality performance is usually limited to a few objective metrics such as, mean square error (MSE) and peak signal-to-noise ratio (PSNR), a realistic measure of perceived distortion is used. To summarise, the requirement of an ‘ideal’ system is to provide a sequence-based rate control solution that can achieve almost constant quality at short timescales and smooth quality changes over the longer timescale. The solution should be independent of the characteristics of the sequence content and adhere to rate constraints imposed by the network (i.e., it should be independent of the underlying transmission policy). This approach poses two distinct chal­ lenges:

1. Achieving live video streaming with consistent quality requires a method that appropri­ ately manipulates the encoding bit rate so that the resulting video quality exhibits min­ imal difference with the quality of the recent past. In order to achieve this, a method of mapping the encoding bit rate to the resulting encoding quality in real-time has to be available.

^This does not mean that the former option is precluded; the encoder can still be decoupled from the streaming server and produce scalable video representations, which the streaming server uses appropriately, but this option is less flexible. 4.2. Architecture of smooth quality live video streaming 96

2. If rate to quality mappings can be made available to the system, then appropriate quality values should be continuously chosen over successive S-T periods so that smooth qual­ ity is maintained. This is accomplished by searching for an estimate of the encoding bit rate that is required to achieve the designated smooth quality (and respecting the limita­ tions imposed by the TCP-friendly congestion control regime). Therefore, a rate-quality controller that continuously determines appropriate target quality values is required.

To address the first challenge above, the ITS quality metric can be used to obtain continu­ ous S-T period quality scores as discussed in section 2.7 . Doing so however, requires encoding and decoding of the S-T period frames at several candidate bit rates in the search space and the subsequent application of the metric. Furthermore, recall that the metric performs the calcula­ tion of distortion features from both the original and distorted frames. Obviously this approach is prohibitive in terms of real-time performance; the ITS metric (or any other objective quality model) alone is quite computation-intensive and can hardly run in real-time^ even for a sin­ gle selection of original and distorted frames (recall that the system needs to perform multiple quality evaluations at several encoding bit rates to devise the proper encoding rate point, not to mention the corresponding encoding and decoding involved in such a process). The rest of this chapter proposes an architecture that addresses these issues. It then presents a method that utilises artificial neural networks to obtain automatic prediction of the continuous S-T period quality scores, which bypasses the time-consuming calculation of the ITS metric for each S- T period. A solution to the second challenge is briefly introduced in the next section and is discussed in detail in Chapter 5.

4.2 Architecture of smooth quality live video streaming

This section describes the architecture and the functionality of all components in the proposed smooth quality video streaming adaptation system, as illustrated in Figure 4.2. The system relies on a congestion control module that is periodically sampled to elicit the nominal trans­ mission bit rate of the stream. Since the transmission environment assumed in this thesis is a best effort Internet, the congestion control module performs some form of TCP-friendly algo­ rithm to estimate the available bandwidth. Denote the TCP-friendly transmission rate as Rtcpf- The video encoder is continuously receiving frames from a live video source (e.g., from a stu­ dio camera, a satellite link, etc.) with the task of producing a compressed bitstream. Every time period t, summary statistics of video features are extracted from a small number of sub-

^For example, the low computation mode{General Model) of the ITS VQM evaluates the quality of a video sequence at a rate of about 9 frames/sec on a 1.4GHz Pentium 4 processor, assuming PC based video formats (GIF or QCIF) [159]. 4.2. Architecture of smooth quality live video streaming 97

Rtcpfit) TCPF Congestion Qtcpfit)- Control Quality Content Predictor Qtcpfit-iy Feature Features Extraction Encoded module Frames

Encoder

quality error *____ Fuzzy Quality Rate-quality Renc(t) Qtarget(t)- Adaptive controller Smoother

Input Frames buflev Send/Recv buffer momtor

Figure 4.2: This illustration shows the components of a framework for smooth quality adapta­ tion of live video streaming and the interactions between involved modules.

sequent frames. This content feature extraction process at the encoder is described in detail in section 4.4. Based on these content feature statistics that reflect the complexity of the original visual content, and the current nominal transmission rate, Rtcpf [t)-> an artificial neural network (ANN) quality predictor, that has been adequately calibrated with the ITS quality metric, calcu­ lates predictions of the instantaneous encoding quality, Qtcpfit)^- In general, the ANN quality predictor generates a quality approximation, QR{t), given a set of content feature statistics of the relevant video frames and an encoding rate R. The sampling of the TCP-friendly rate and the estimation of the continuous quality scores are carried out at a period equal to the duration of the S-T region (i.e., every 6 frames or 200 ms for a 30 frames/sec video input). This period is a reasonable trade-off between the granularity of network adaptation^ and the duration of the S-T region of the quality metric, as described in section 2.7. It also minimises the additional delay and buffering at the sender. The proposed method determines continuous values of the encoding rate that alleviate short-term variations in quality. In doing so, the encoding rate can be lower or higher than the transmission rate at each adaptation time instance. To accommodate these mismatches, the system relies upon buffer cushions at both the sender and receiver, as explained in detail later (Chapter 5). The proposed method utilises a fuzzy rate-quality controller that receives

'^Therefore, Qtcpf(t) represents the encoding quality when the encoding rate is set equal to the TCP-friendly rate. 5 Contemporary TCP-like or TCP-friendly congestion control algorithms re-calculate their transmission window (or rate) every time a new feedback (ACK) packet is received, but it is safe to assume that within this relatively short timescale (200ms), one would not expect a drastic change of network conditions. 4.3. Artificial neural networks 98 successive values of Qtcp/ and an estimate of the sender and receiver buffer sizes to determine the value of a target quality, Q tar get, that reduces variations of the encoding quality based on the quality of the recent past. At the same time, it attempts to maintain the stability of both the sender and receiver buffers. The function of the controller is to locate, by further invocations of the ANN, the required encoding bit rate, Rene, that achieves a final encoding quality, that closely approximates Qtarget- In other words, the rate-quality controller determines, at every S-T period t, a suitable encoding rate, that results in a smoother on-going quality alternative to

Qtcpf- The details of the fuzzy rate-quality controller are presented in Chapter 5.

4.3 Artificial neural networks

There exist a multitude of techniques for modelling or approximating an unknown function among multi-dimensional data, such as function approximation using regression methods, sta­ tistical approaches and Bayesian inference, artificial neural networks, genetic programming, etc. For example, multivariate regression is a well-studied field, however, these methods re­ quire the specification of some pre-defined function to approximate. The widely used linear regression may fail to represent non-linear relationships in the data, and for non-linear data, some complicated (non-linear) function should be provided. Bayesian reasoning provides a probabilistic approach to inference and is based on the approach that the quantities of interest are governed by probability distributions. They constitute an excellent choice for classification problems, but they usually require initial knowledge of many probabilities. Genetic algorithms and genetic programming provides an approach to machine learning which is based on simu­ lated evolution.

This work utilises the learning and generalisation capabilities of artificial neural networks to discover the relationship between short-term video content features and the value of the corresponding objective quality. Although other techniques (such as the ones mentioned above or others) may be equally effective, the scope of this work is not to propose the best method of approximating this relation but to prove that it is feasible to predict ongoing objective quality scores in real-time with a high level of accuracy. An Artificial Neural Network (ANN) is a general, practical form of machine learning, that provides a robust approach to approximating real, discrete or vector target functions and learns to interpret complex real-world data. When suitably trained, ANNs can provide accurate estimation of the output(s) based on a selection of inputs, efficiently predict non-linear relationships among multi-dimensional data and support a general paradigm to deal with complex mathematical functions. They constitute one of the most effective learning methods and have been proven successful in a wide and diverse range of 4.3. Artificial neural networks 99

Perceptron ANN with one hidden layer

a = f ( ^ w . X i +b) In puty L ayers Output Y Output Layer

V Hidden Layer

Figure 4.3: The structure of the basic ANN component - a neuron or perceptron, and a feed­ forward neural network with n inputs, one hidden layer with m neurons, and one output layer, is the weights matrix for layer L.

applications: industrial, financial, medical, speech and face recognition, telecommunications, robotics and many other. ANNs possess a number of properties that make them particularly suitable for the problem under examination:

• The target function to be ‘learned’ can be defined over instances of input attributes and target values. No assumptions on the input attributes is required; they can be highly correlated or independent of each other and input values can have any real value.

• They are suitable for applications where long training times are acceptable, however, fast evaluation of the target function (or target values) is an essential requirement.

• The requirement to understand the learned target function is not considered important. Effectively, a neural network learns weights between its connections, which are often difficult for humans to interpret.

Extensive research in this area has resulted in a multitude of approaches to neural network computing and types of neural networks; this discussion is limited to the very basic principles that govern an ANN and to the most popular type of ANN, the multi-layer perceptron witherror back-propagation [160, 161], which is used in this work. The basic building block of a neural network is an elementary neuron, or perceptron, shown in Figure 4.3. Each value of the input vector = [x\,X2, ...,Xn] is weighted with an appropriate weight Wi, that defines the contribution of input variable Xi to the perception’s output a. The sum of weighted inputs together with a bias b, also called the activation of the neuron, is projected on a differentiable transfer function /, to produce the neuron’s output: 4.3. ArüGcia] neural networks 100

a = fÇ^WiXi + h)

Typically, neural network frameworks have a layered topology, where layers of percep- trons are combined to form a multi-layer feedforward network (Figure 4.3). A neural network framework can therefore be described by the number of neuron layers, denoted by L, and the number of neurons in each of its layers, denoted by ni , where I is the ANN layer index. The number of neurons in a layer is also called the layer size. In a typical layered topology, the input layer is not really neural but serves to introduce the values of the input variables. The output layer contains neurons that produce the output(s) of the network. One or more hidden layers can exist between the input and output layer; their neurons have no direct interaction with the ‘out­ side world’, only with other neurons within the network. Multiple layers of perceptrons with non-linear transfer functions, like log-sigmoid, or tangent-sigmoid^, allow the network to learn linear and non-linear relationships between inputs and output(s), without a priori assumption of a specific model form. The function of a neural network is to determine suitable values for a set of adjustable parameters, like the weights and biases at every layer and neuron, by performing an iterative procedure, called training or learning, on the set of train samples. These adjustable parameters are given random initial values and the training process consists of two steps per iteration. For a set of training input vectors with a known response vector y, a forward pass calculates all the activations at every neuron to generate a predicted response vector ÿ. Then, a back-propagation step is used to adjust all the weights of the neural network based on the magnitude of the error between the predicted and actual output:

E’‘ = ^{y'‘- f Ÿ (4.1) K E = Y ^ E ^ (4.2) where, is the target value in y, is the ANN output with respect to the training input vector and K is thetotal number of training patterns.The error cost measure in expression 6

logsig{x) = (1 4-/)e-"=) e®,.x — e „ ‘ - x 4.3. Artificial neural networks 101

(4.2) is commonly used for its simplicity and it presents the deviation of the network’s output from the ideal. The task of the training is to find the weights and biases that minimise E. This iterative procedure is repeated with new optimised parameters until an acceptable low error is achieved. There are several algorithms proposed to adjust the weights at every iteration of the training phase, and the gradient descent is probably the most popular [162]. Essentially, this method performs iterative steps in the weight space, proportional to the negative gradient of the cost function E to update the weights:

'^ij + ^ W i j , dE

dE y . dE^

where 77 is the step size parameter, usually called the learning rate. An ANN is therefore an optimisation technique that attempts to locate the minimum of a multidimensional error surface, which usually includes several local minima. A neural network might not always find the absolute minimum, but an acceptable local minimum close to it. After the training phase, the ANN can be validated for its generalisation capability, by comparing its output with the actual (expected) values, where the input data come from a set of unknown during the training phase samples, called the test set. A common problem that occurs during training is over-fitting. In this case, the error on the training set may be reduced to a very small value, but when presented with new, unknown test patterns, the network has poor performance (large prediction error) because it has almost memorised the training samples. The tendency for over-fitting increases with the network size (number of hidden layers and hidden layer neurons), but deciding what is the best size for the network is difficult to know beforehand. Early stopping is a technique that is very often used to stop the training process before the network starts to over-fit. In this method, the available data for training are split into a training set and a (usually smaller) monitoring set, and the error of the monitoring set is also inspected during training. While, at the beginning, both the training and monitoring errors decrease, when the network begins to over-fit the training data, the monitoring error will start to increase, as shown in Figure 4.4. If this increase continues for a specific number of iterations, the training process stops and the ANN parameters (weights and biases) that presented the minimum monitoring error are retained.

The method proposed in this chapter employs neural networks to predict the on-going qual- 4.3. Artificial neural networks 102

Error

Modelling- Over-fitting

Monitoring error

Traning error

Traning stop iterations point

Figure 4.4: Early stop training: evolution of squared error of neural network responses for the training and monitoring sets.

ity based on representative measures of the underlying visual content features from the original video scene. Such features are: the magnitude of spatial complexity of the images, the amount and the characteristics of motion among subsequent frames, the locality or spread of texture and motion activity, the type of camera action and others. These properties of the underlying con­ tent determine the encoder’s capacity to compress the video data with imperceivable distortion under a given rate constraint^ or equivalently, the level of introduced artifacts. There is a broad understanding of the qualitative effect that the complexity of video content and the bit rate bud­ get have on the visual image quality (block effects, blurred edges, jerky motion, etc.), discussed in detail in section 2.5. However, how these features quantitatively determine perceived quality is difficult to deduce using conventional mechanisms^ due to the complex operation of the hu­ man perception process that complicate such evaluation procedures. To overcome this hurdle, neural networks are utilised to provide an approximation of the unknown relationship between content activity, encoding bit rate and quality. Therefore, the following sections examine the performance of employing ANNs for the on-line prediction of perceived quality, based on input features of the original content and the encoding bit rate. The ANN model operates on visual content descriptors that are extracted from the input video frames during the encoding process on an S-T period basis. The neural network directly yields objective quality scores associated with vectors of extracted content features. The function that maps feature vectors into objective quality ratings is learned by training the neural network. For the training process, continu­ ous objective quality scores are obtained using the ITS VQM, described in section 2.7, on the premise that they reflect, with high accuracy, the actual subjective opinion of human viewers.

^This is under the assumption of a non-changing picture size (resolution) and that the video codec used is the same all the time. ^Traditional rate-distortion theory is a quantitative mechanism, but as mentioned earlier, it is based on the as­ sumption that the distortion is some function of the error between pixel values. 4.3. Artificial neural networks 103

4.3.1 Related work

The following paragraphs briefly outline related work on issues like the role of content feature extraction in digital video processing and use of ANNs for content-based adaptation and quality prediction. Extraction of video content features can provide a signiflcant amount of information about the underlying visual content and has been used in video traffic modelling and video retrieval systems. Dawood and Ghanbari [141] presented a traffic model for VBR MPEG video based on scene content descriptors, where non-homogeneous video clips are classified into homoge­ neous video shots, in terms of their texture and motion complexity (low, medium and high). To measure texture and motion complexity, the average magnitude of the block DOT coeffi­ cients and the average magnitude of the macroblock motion vectors are calculated respectively. Both features are then averaged over the video shot. Each homogeneous shot class is repre­ sented by autoregressive models to generate the expected number of bits of the compressed data. Bocheck [163] focused on the construction of content-based traffic modelling for VBR video that can be used to predict the requirements of video for network resources or drive the reser­ vation of such resources in the network. In that work, long video sequences were segmented into homogeneous activity periods (which were defined by scene boundaries) and several fea­ tures were extracted from compressed MPEG-2 bit-streams, such as, camera operation (static, panning, zoom, transition, etc.), number and size of video objects (VOs), spatial and temporal complexity of VOs. These features were subsequently quantised into a small number of levels (small, medium and high). Based on these descriptors, activity periods were clustered into a number of classes using a Bayesian unsupervised classifier. Corresponding traffic descriptors for the activity periods were generated and machine learning was used to map a content class to a traffic class that yields a prediction for the resource requirement of each activity period. A similar approach was presented in [164], albeit the neural network is trained with non-quantised features to predict long-term traffic. A large set of content features was gathered and a special feature selection procedure (termed Sequential Forward Selection) was used to reduce the num­ ber of features that exhibit the highest relevance in traffic prediction to a smaller subset. In order to accomplish the traffic prediction task, the reduced set of significant features together with the associated bandwidth statistics of the observation period were used to train a back-propagation multi-layer neural network.

Bocheck et al [165] described a content-based quality adaptation method for MPEG-4 streams based on the generation of dynamic utility functions. The system comprises two main components: a real-time estimation module and an adaptation module. Based on visual and 4.3. Artificial neural networks 104 encoder-based content features extracted online by a content analyser (frame size, average and variance of the motion vectors, camera operation, energy of DCT coefficients, etc.), the esti­ mator dynamically determines a utility class and the characteristic utility function for the video object (video frame). No actual estimation of the utility class is performed. The off-line adap­ tation loop uses unsupervised machine learning (decision tree) to find ‘clusters’ in the content feature space of the video objects in the training pool. Since this module is computationally intensive, it is decoupled from the real-time estimation path, and is periodically called to re­ compute the decision tree parameters and update the video objects in the training pool. The characteristic utility function for the video object is then calculated by the real-time utility esti­ mator, which is based on supervised machine learning [166]. The utility function maps a set of (discontinuous) bandwidth values to a corresponding utility for the video object. Attached with each pair is a scaling profile that contains appropriate adaptation techniques to scale the video to the specific rate, which results to that particular utility. The drawback of this architecture is that it presents significant complexity. Scaling profiles need to be either trans­ mitted together with the stream (for example, using the MPEG-4 object descriptor), or located using uniform resource locators (URL). Furthermore, the utility is expressed in terms of the SNR metric. There is also a significant estimation error for the predicted utility function (from 5% to 20%).

Mohamed and Rubino [167] proposed a method of predicting the end-quality of a trans­ mitted video stream using Random Neural Networks (RNN). The authors argued that the main factors that influence quality are: (i) the encoding bit rate, (ii) the output frame rate of the en­ coded stream, (iii) the average packet loss rate, (iv) the average size of loss bursts and, (v) the ratio of intra- to inter-encoded macroblocks^. For this reason, the RNN is trained with vari­ ous combinations of these parameters at discrete points together with the corresponding quality ratings obtained by means of subjective MOS tests. However, experimental results presented show the behaviour of the system with respect to only one, 300-frames long, video clip. No indication of the generalisation performance with other types of sequences is given, nor of the problem of predicting the evolution of continuous quality ratings.

Gastaldo et al [168,169] present a neural network model that processes a set of input fea­ tures, extracted from MPEG-2 encoded bitstreams, to yield an associated estimate of perceived quality. This work has similar objectives but there is a fundamental difference: the extraction of features that train the network is performed on compressed MPEG-2 streams. This option is

^This ratio is indicative of the efficiency of the motion estimation process but is also a simple way of adding robustness against the propagation of error macroblocks. 4.3. Artificial neural networks 105 clearly not available in our case. Furthermore, quality scores are obtained by means of subjec­ tive MOS tests. Again, the on-going quality of video is not considered in this work. Lin and Mersereau [170, 171] presented a bit rate allocation scheme for MPEG video coding with the aim of optimising subjective quality. Four block-based features are extracted from the input and output video sequences and feed a four-layer (two hidden layers) back- propagation neural network. The four features are selected over a much larger set of candidate features, based on their ability to predict an observer’s assessment of quality. These features are: (i) the mean and (ii) the standard deviation of the magnitude of the DFT coefficients difference between the original and coded blocks, (iii) the mean absolute of the Wepstrum^® difference, and (iv), the colour difference, as measured by the energy in the difference between the original and coded blocks in the UVW colour system (a linear transformation from the YUV colour system). Training was based on quality scores obtained by subjective tests using a continuous quality scale. A macroblock bit-allocation algorithm is also presented that changes the quanti­ sation scale of every macroblock in order to optimise the end quality (obtained from the neural network). This method is obviously not suitable in our case as it requires the presence of both the original and coded sequences and due to its significant encoding complexity, it is more suitable for asymmetric applications (off-line encoding). Yao et al [172] presented a video quality evaluation model, based on multi-feature extrac­ tion and radial-basis neural networks. Again, the measured features that are used to train and test the neural network are extracted from both the original and distorted video frames.

4.3.2 Challenges

The ANN method presented here does not rely on the availability of both the original and decoded sequences, or of the encoded bitstream. Quality predictions are sought based only on features that are extracted on-the-fly from the original input frames and the encoding bit rate. Due to this requirement, several challenges arise:

1. The set of content features that are extracted from the original material. Such features should be obtained without any significant computational cost, since real-time perfor­ mance is a major requirement. Therefore, features have to be collected as part of the natural encoding process, i.e., features that are calculated anyway by the encoder.

2. The choice of the set-up details of the neural network (the ANN architecture) also bears significant importance to the performance of the quality prediction process. The abil­ ity of the neural network to converge to an acceptable solution is subject to a number

'°The inverse wavelet transform of the logarithm of the magnitude of the wavelet coefficients. 4.4. Extraction of content features 106

of parameters, including, but not limited to, the complexity of the unknown function approximation, the size of the neural network and its influence on the generalisation per­ formance, and how input data can be appropriately pre-processed so that they facilitate learning.

It follows from the discussion above, that the proposed neural network based quality pre­ diction method does not constitute a ‘tme’ objective video quality assessment model, as it does not incorporate any modelling of the human cognitive or subconscious process of visual per­ ception. It may be argued however, that the human quality perception process is indirectly integrated into the quality scores that are used for the calibration of the neural network model. In this respect, the level of perceived distortions in the respective video frames is quantified, albeit indirectly, by the corresponding ITS VQM continuous quality scores. This hypothesis is successfully tested later in section 4.6. It is possible, therefore, that certain idiosyncrasies of the human perception cannot always be captured by the neural network predictor only from content features of the reference video. This is natural, as full-reference objective quality models, described in section 2.6.2, have access to both the original and distorted set of video frames and can therefore measure impairments present in the distorted in comparison to the original sequence more accurately. The next two sections confront the two main challenges in a ANN quality prediction archi­ tecture as outlined above. Specifically, section 4.4 describes in detail what features are selected to represent the video content activity and how they are extracted from the original video frames as part of the encoding process. In turn, section 4.5 highlights important aspects of the neural learning paradigm and investigates a suitable architecture and size for the featured neural net­ work.

4.4 Extraction of content features

Keeping in mind the requirement of real-time processing, a set of fifteen content features are extracted from every original frame within the S-T period ( 6 frames), as being the most likely in influencing quality. The analysis that follows is based on a H.2634- video codec [146], but it can be applied, with minimal modifications, to any other hybrid DCT block-based codec that employs motion estimation. The spatial complexity of frames largely determines the bit rate requirement of video and for this reason, its quality. For example, high spatial energy results in more high frequency coefficients being dropped during quantisation. Four features that measure texture complexity are extracted. The pixel activity, PelAct, defined as the standard deviation of luminance pixels 4.4. Extraction of content features 107 for each block averaged over the number of ( 8 x8 ) blocks in the frame and the spatial spread of pixel activity, PelActSpread, defined as the deviation of hlock-PelAct values over the frame, are measures of the spatial complexity (image texture activity) of the frame at the pixel level. Sim­ ilar features are calculated to measure the edge activity within the frame. Edges are defined as regions of pixels with high variation of luminance. Edges convey significant visual information, reveal texture and are more susceptible to certain encoding impairments in comparison to flat regions of an image (e.g., blurring distorts the strength of edges). From the human vision point of view, spatial and texture masking are sensitive to the intensity of areas with edge activity. To determine the edge activity, the magnitude of pixel gradients in each block is calculated. This can be done by applying a Sobel filter (or gradient operator) at each pixel value:

magn(Wpij) — |p ^ - ij—i 4- 2pi—ij + Pi—ij+i — Pi+ij—i ~ ^ Pi+ij-t-i I

+ |Pi-i,j-i + 2pij-i -\-pi+ij — — 2pij+i — wherepij is the luminance value of the pixel at row i and column j in the frame’s pixel grid. The edge activity, EdgeAct, is the deviation of m agn{Vpij) values in every block, averaged over the number of frame blocks. The spread of edge activity, EdgeActSpread is calculated similar to PelActSpread.

The amount of motion energy and the complexity of the motion can also affect the quality of a rate-constrained video stream. When motion in the sequence is more intense, the motion estimation error residue would be considerably higher and will be affected by the subsequent quantisation process. Several motion related features are extracted with the aim of covering the range of motion attributes that may introduce perceived distortions. The first feature is the sum of absolute pixel differences, soad}^, which is a measure of pixel change between the current (motion-estimated) frame and its reference frame. The average magnitude of motion vector over the whole frame, MVMagn is calculated as follows:

M VM agn = \\mvF{i)\\2 (4.3) i where M is the number of macroblocks (MBs) in the frame, and m vpii) is the motion vector (MV) of the (motion-estimated) MB i in frame F. The spatial variance of the motion vector

''This feature is equivalent to the mean average difference, MAD, often used in video coding terminology. 4.4. Extraction of content features 108 magnitude, MVMagnVar is also calculated:

MVMagnVar — — M V M agnŸ . (4.4) i

It is also interesting to locate frames where there is strong motion in portions of the image, as this may lead to localised impairments. Therefore, the motion vectors magnitude of expression (4.3) is also calculated for each of the four spatial quadrants of the frame, resulting in four additional features, MVMagnUL, MVMagnUR, MVMagnLL, and MVMagnLR. The ratio of the motion estimated MBs over the total number of the frame’s MBs, MERatio is also calculated as a representative measure of the coding efficiency of the motion estimation process.

Besides the magnitude of the motion, expressed by the above features, other properties of motion may be influencing compression efficiency and therefore encoding quality. Features that signify the complexity of motion are introduced, such as, direction of motion within the frame, the speed of motion and the change of motion speed (motion acceleration). The complexity of motion, MotCompl, is calculated as follows: every motion vector of motion estimated MBs is classified according to the dominant axis of the vector (up, down, left, right, none), and the variance of this five-bin histogram is taken [173]. A uniform histogram of the directional MVs reveals a more complex motion throughout the frame. To measure the changes in motion within the frame, two final features are produced. MotDirChange represent changes in the motion direction and is formed by subtracting the motion vectors of successive motion estimated blocks and averaging over the frame’s number of MBs:

MotDirChange — (^) “ M % where F' represents the reference frame used for motion estimation in frame F. The motion acceleration, MotAccel, captures the change in the motion speed (acceleration), again averaged over the frame’s MBs

Table 4.1 summarises the content features that are extracted from the original frames.

The mean value of each content feature is calculated over the duration of the 6 -frame S-T period to obtain equal in number content feature descriptors that summarise the video content complexity over the duration of the corresponding activity period. Other descriptive statistics were also examined, like the median or maximum, but did not provide any improvement against 4.5. Neural network architecture and setup 109

Table 4.1: Content features extracted from the original video frames.

Content Feature Description

1 PelAct Pixel activity averaged over all blocks 2 PelActSpread Deviation of pixel activity over all blocks 3 EdgeAct Edges activity averaged over all blocks 4 EdgeActSpread Deviation of pixel activity over all blocks 5 soad Sum of abs. pixel differences between adjacent frames 6 MVMagn Magnitude of motion vectors 7 MVMagnVar Spatial variance of motion vector magnitudes 8 - 1 1 MVMagnLL, MVMagnLR, Magnitude of motion vectors per quadrant - low left MVMagnUL, MVMagnUR & right, upper left & right 1 2 MERatio The ratio of motion estimated MBs in the frame 13 MotCompl Motion complexity (variance of the directional motion vectors histograms) 14 MotDirChange Change of motion direction between adjacent frames 15 MotAccel Acceleration of motion between adjacent frames the selection of the average values. The feature extraction process and the calculation of the feature descriptors was integrated in a H.263+ video codec [146], but this process can be easily ported to other hybrid DCT/ME video codecs with relatively small modifications.

4.5 Neural network architecture and setup

This section discusses the architecture of the proposed ANN quality predictor^^. Through a series of evaluation experiments, two important aspects in the development of a successful neural network model were resolved: (i) the effect of network topology (number and size of layers) and (ii) how the input variables can be pre-processed so that they constitute a more in­ formative input to the model and ideally, lead to better learning. The supervised ANN training involves the set of content features descriptors, extracted online during the encoding process and summarised (averaged) over the duration of the S-T period. The training process calibrates the neural network based on the corresponding S-T period quality scores, obtained after pro­ cessing the original and encoded test sequences with our implementation of the ITS metric. The ANN architecture features a feedforward, multi-layer network with error back-propagation (Figure 4.3), which is the most popular and extensively used type of ANN. The optimisation of the ANN topology and the selection of appropriate inputs to the model are probably the most tedious steps in the development of a model. ANN can perform unbiased estimation of the training set to arbitrary precision, e.g., using an indefinitely large number of hidden nodes, however, such an ANN would be extremely sensitive to the idiosyncrasies of the training data and will fail to generalise. For this reason, reducing the number of layers and

’^The neural network implementation and evaluation experiments described in this chapter were performed using M a t l a b . 4.5. Neural network architecture and setup 110 nodes in the ANN is required and one should opt for more parsimonious models, which do not produce a perfect fit on the training data but provide more accurate response to unknown inputs.

Within the set of content features that are extracted during encoding as potential factors that influence the encoding quality, there is probably redundancy and not all of them may be highly relevant to the ANN model. Retaining those input variables that are relevant to the model not only results in a simpler model representation and advanced prediction ability, but also improves the training speed and reduces memory requirements. However, deciding what features to preserve for the purpose of function approximation is not straightforward. The simplest way to extract the subset of features that yield the best generalisation consistency is to obtain all the possible combinations from the original set of content descriptors and evaluate the corresponding approximation accuracy. This however involves a prohibitively large number of combinations to be selected; ("%) —2^ — 1, where N is the total number of feature descriptors (15 in this case). Input features that are relevant to the model can be derived through a stepwise, trial and error method. For example, with stepwise addition, one may start with an initially small set of inputs and add a new variable at a time until a satisfactory monitoring or prediction error is achieved. This has the risk that the method may stop with selected input variables x\, ...,Xm, but some important information to the model may also be contained in input Xn,ri > m. For this reason, stepwise elimination of input variables can be employed. With stepwise elimination, a deliberately large subset of initial variables is chosen and variables are subsequently removed until the monitoring or prediction error improves no longer. The selection process of the appropriate input variables during stepwise elimination can be improved if the relevance of each variable to the model, called its sensitivity, is estimated. The sensitivity of the input variables can be determined by partial modelling [174]. This method is based on the estimation of the individual contribution of each input variable to the variance of the predicted response of the neural network. First, the ANN is trained to estimate the parameters of the model (weights and biases). Then, the sensitivity of each input variable is calculated as the variance of the ANN response predicted with the trained ANN when all the input variables, except the one under consideration, are set to zero. Once all sensitivities are estimated, the variable with the lowest sensitivity is tentatively removed and the ANN is re-trained. If the monitoring error decreasesthe variable is deemed irrelevant to the model and is removed, otherwise, it is retained and the process continues with the next variable. At the end of this

^^Strictly speaking, the monitoring error (RMSE) may be slightly increasing between successive elimination trials, as the condition for elimination of a variable is RMSE{k) < r • RMSE{k — 1), where r is a tolerance factor, in this case r = 1.05. Increasing this factor will result in removing more variables at the expense of loosing some relevant variables to the model. This value should not be below 1.0, otherwise, generalisation might be poor. 4.5. Neural network architecture and setup 111 process, a subset of the initial input variables presents the new input features set.

In addition. Principal Component Analysis is also examined as an alternative input data compression technique. Principal Component Analysis (PCA) [175] is a data dimensionality reduction scheme, which is very often used in neural networks. This technique is a linear trans­ formation that extracts characteristic features from the data whilst minimising the information loss. The basic principle of PCA is the representation of the data by a reduced set of unit vectors (eigenvectors). PCA is at first applied to the train input vectors (calibration matrix). Therefore, if -7>ixm is the training samples matrix (where n is the number of training patterns and m is the number of input variables) and Vmxm is the principal component transformation matrix, the transformed training set of patterns is the (n x m) matrix T' = T y.V. Note that, the same transformation has to be applied to the set (matrix) of test patterns as well, using the transfor­ mation matrix V derived from the calibration matrix (train patterns). Usually, most of the data variance can be explained using the first few principal components (PCs) of T'. While it is usu­ ally difficult to consider the significance of the original input variables to the model, it becomes much easier to do so when the input data are preconditioned with PCA. Similarly, a stepwise elimination process is applied on the input variables, which are now the PCs of the transforma­ tion on the original input data. Although PCA is often more useful when there is a large number of input variables or when one wants to increase the ratio of the number of training samples to the number of adjustable parameters of the ANN, PCA was employed as an additional method to investigate whether it improves the prediction performance of the model. More importantly, it is also included in anticipation of future work that might involve the prediction of quality scores over longer timescales (e.g., per video scene, as discussed in Chapter 3). In this case, in order to accurately describe the distribution of the frame-level content feature values over time periods in the order of seconds or few tens of seconds, more statistical descriptors should be calculated. For example, content feature descriptors can be generated based on the mean, standard deviation, min, max, 0.05 and 0.95 quantiles values, etc., to get an accurate estimate of the variation of content throughout the longer timescale. In this case, a large number of inputs to the neural network emerges, with higher levels of redundancy in the data, hence, the afore­ mentioned data compression technique is deemed appropriate, not least because training and optimisation of the network topology can be extremely time-consuming if all original inputs are retained.

Besides the probable benefits that a data dimensionality reduction brings to the calibration of a ANN model, there is another important reason why these techniques are being considered here. The content feature selection process, discussed thoroughly in section 4.4, identified a set 4.6. Examination of the ANN performance 112 of features that potentially influence or determine the objective quality score of the correspond­ ing observation period. There is however, no a priori assertion that these features (or, some of them) do indeed fulfil their anticipated role. Therefore, the above techniques constitute an indirect method to scrutinise the appropriateness of the proposed set of selected features. The ANN architecture comprises a three-layer ANN topology (one input, one hidden and one output layer). The hidden layerhas rih neurons with a non-linear {tangent-sigmoid) transfer function, whereas the output layer has a liner transfer function. At this point, it should be mentioned that other types of ANNs might yield better performance prediction, however, the quest of an optimal neural network is out of the scope of this thesis. The set of input variables is obtained after the sensitivity analysis on both the original input content features and the input variables obtained from the PCA transformation and the subsequent stepwise elimination on the set of initial input variables. In order to make training samples be in the active range of the non-linear transfer function, they are scaled using a linear min-max mapping^^. Furthermore, common practice dictates that the initial weights and biases at each layer are initialised to some small non-zero values, drawn from a uniform distribution in some symmetric interval [—ac, k], in this case, K = 0.1. A final step involves the removal of outliers from the input data: input sample values are removed as outliers if they are positioned more than three standard deviations away from the mean value of the population.

4.6 Examination of the ANN performance

A large collection of video scenes was selected for the calibration of the proposed neural net­ work model and the assessment of its generalisation performance. The chosen set of video scenes featured a wide range of content type and complexity: different camera actions (static, panning, zooming, fades, etc.) and various levels of spatial energy and motion activity. Video frames were extracted from: action movies {The Matrix, Terminator, XMen), sports (football clips from the English Premiership) and also several short video clips from the VQEG web­ site [24] (refer to Appendix A for details). In total, the test sequence library consisted of over 39,000 frames (approximately 6,500 S-T periods). From the whole set of 6,500 patterns, 80% were randomly chosen as the training set and the remaining 2 0 % as the validation (test) set. One fourth of the training samples was used as the monitoring set (to facilitate early stopping)

^^The theoretical property of universal approximation has been proved for ANNs with only one hidden layer [176], and it is recommended that in most cases, one hidden layer in multi-variate cahbration is enough. *^For each variable vector x, the sample Xi e x is scaled to

_ X i- m m (x) , ^ , " mox(x) - mm(x) where Vmin = — 1 and Vmax — 1 for the tangent-sigmoid nonlinearity. 4.6. Examination of the ANN performance 113 and the rest as the actual training set.

The training process involved the modelling of several neural networks, one for each op­ erating bit rate in the range R q,...,R n , where, R q = 100Kbps, R n = 2000Kbps, and Ri+i - Ri = 100Kbps,i = 0, ...,N — 1. Early stop was employed to alleviate the risk of over-fitting, as discussed in section 4.3. Since the final ANN model is sensitive to the initial conditions, several trials are performed with different sets of initial random weights and biases, to overcome chances that the monitoring set is also over-fitted. Ten such training trials were attempted and the set of weights and biases that led to the monitoring error value closest to the median value over the replicate trails was retained.

The performance of the neural network for different numbers of hidden neurons (uh) was first examined. An upper bound in the number of hidden nodes is known to be in the order of the size of the training samples (i.e., number of input variables) [177]. Therefore, the performance of the ANN on the monitoring set of samples is tested for various values of Uh, from sixteen down to four. At each hidden layer configuration (value of Uh), both input variable elimination techniques discussed above were followed (i.e., stepwise elimination on both the original fifteen input variables and the scores of the PCA transformation). Detailed results on this process as well as on the subsequent performance evaluation of the ANN models are shown in later graphs with respect to four representative encoding rate points in the range (100 — 2000Kbps): 200,500,800 and 1500Kbps.

Figure 4.5 plots the root mean square error (RMSE) of the ANN’s responses on the mon­ itoring set of input samples for different values of parameter Uh, where the input samples cor­ respond to the original content feature descriptors. The numerical labels indicate the number of retained input variables after the stepwise elimination, based on the sensitivity of each input variable to the model. In turn. Figure 4.6 depicts the monitoring RMSE, this time when the ANN’S input variables are the principal components of a PCA transformation on the original input variable (content features). From these graphs, two conclusions can be drawn. First, the sensitivity elimination process on the input variables does not seem to produce an ANN model with considerably improved prediction performance. For certain hidden neuron configurations, the input reduction process results in seemingly better monitoring RMSE, but this pattern in not consistent over ANNs with the same configuration but trained with quality score targets corresponding to different encoding bit rates. This is also evident from the fact that the sensi­ tivity elimination process retains almost all of the inputs (the number of retained inputs at each experiment are shown as numerical labels on the same graphs), which means that the content features initially selected correlate with the resulting quality score and therefore, constitute a 4.6. Examination of the ANN performance 114

200K 500K

o before stepwise eiiminatior o before stepwise eiiminatior A after stepwise elimination a after stepwise elimination

10 Hh 800K 1500K

o before stepwise eiiminatior o before stepwise elimination A after stepwise elimination A after stepwise elimination

DC o a

Figure 4.5: Monitoring RMSE, before and after the stepwise elimination on the original inputs (content features) to the model, at different numbers of hidden layer neurons Uh-

good choice. Secondly, the number of hidden neurons used does not significantly affect the monitoring error, although there is a favourable bias towards a configuration with more than six neurons.

Finally, Figure 4.7 compares the monitoring RMSE for the two data reduction options: sensitivity elimination (i) on the original set of content feature descriptors and (ii) on the scores

(PCs) of the PCA transformation on the original descriptors. No apparent benefit of the PCA method can be assumed based on these results. This is expected, since, as shown above, (i) there is no significant redundancy among the original content features and (ii) the number of input variables to the model is not large. The rest of the experimental results are based on an

ANN topology with a number of hidden neurons n^, = 14. Figure 4.8 plots the sensitivity to the model for each input variable (S-T period content feature descriptors) of the ANN. From this graph, it appears that the most influential variables (for a network topology with 14 hidden neurons) are inputs 1 {P elA ct), 3 {E dgeA ct) and 14 {MotDirChange).

The following results examine in detail the prediction performance of the developed ANN models when presented with new, unknown input data (test patterns). Figure 4.9 shows corre- 4.6. Examination of the ANN performance 115

200K 500K

c before stepwise eiiminatior before stepwise eiiminatior 6 after stepwise elimination after stepwise elimination

Hh rih 800K 1500K

-o before stepwise eiiminatior ■o before stepwise eiiminatior A after stepwise elimination A after stepwise elimination

Figure 4.6: Monitoring RMSE, before and and after the stepwise elimination on the PCA scores (principal components), at different numbers of hidden layer neurons n/^.

lation (scatter) plots of the original objective quality scores versus the ANN predictions. On the top and right sides of each graph are the histograms of the actual and predicted scores. The ANN prediction residual is also shown at the bottom of each graph. It can be seen that the ANNs are able to successfully predict the objective quality scores: the Pearson correlation co­ efficient R between the predicted responses and the actual scores is as high as % 0 0 K = 0.915,

^ 5 0 0 A" = 0.919, R sook = 0.901, and R i^qok = 0.875 for the four bit rate configurations examined. There are some outliers that can be observed; such outliers originate from S-T peri­ ods whose content feature descriptors cannot be accurately fitted by the trained model, because similar descriptors where not present in the training set of samples. Figure 4.10 plots the dis­ tribution of the ANN’S generalisation error on the test samples. For the four representative target bit rates that are examined in detail here, the average absolute prediction errors attained by the ANN model on the test data were: ^ferrf = 4.64, = 3.89, — 3.68 and ^|err°^ = 3.36 respectively^^. In addition. Figure 4.11 plots the percentage of ANN output quality scores whose absolute difference to the actual quality targets is more than A, for various

'^The error vector is defined as err = y — ÿ, where y and ÿ are the ANN target and predicted output vectors, respectively. 4.6. Examination of the ANN performance 116

200K 500K

in A Onginal inputs A Original inputs -o PC inputs G ■o PC inputs

UJ " (/)

1------1------1------1------r 10 8 10 12 14 16 Hh Hh 800K 1500K cvj _rs“ Onginal inputs A Original inputs PC inputs o PC inputs

—I------1— 10 12 Hh

Figure 4.7: Comparison of the monitoring RMSE at different numbers for the two data reduc­ tion schemes: sensitivity elimination on (i) original input, (ii) PCA scores.

values of parameter A. These graphs show that the ANN achieves considerable generalisation performance: for the vast majority of ANN output quality scores, the absolute prediction error is much lower than 10 (recall that quality scores are in the [0,100] range). The prediction er­ ror tends to decrease as the encoding bit rate increases, because quality values (the targets of the function the ANN tries to approximate) at higher encoding rates rest in the high-end of the scale, allowing the neural network to regress better.

Thus far, the presented results examine in detail the performance of the ANN quality pre­ diction models with respect to the four representative encoding bit rates: 200, 500, 800 and 1500 Kbps. In turn, Figure 4.12 shows side-by-side box-plots of the distribution of the ANN prediction error for quality scores corresponding to several encoding bit rates in the considered range (100-2000Kbps). Boxes show the inter-quartile range with whiskers extending to the 1- and 99-percentile values of the respective samples. The graph shows that the ANN models attain a consistent generalisation performance and that the majority of prediction error values are constrained to significantly small values, an indication of a good prediction capability. The numerical properties of the respective error distributions are summarised in Table 4.2. 4.6. Examination of the ANN performance 117

Ü 2 0 0 K

□ 800K □ 1500K

L L

Input variable

Figure 4.8: Sensitivity of the input variables to the model.

Table 4.2: Numerical properties of the ANN prediction error, at different bit rates: mean abso­ lute error, mean error, median, 0.01, first, third and 0.99 quantile values.

Bit rate version Abs. Mean Mean Median 3:0.01 3:0.25 3:0.75 3:0.99 200K 4.64 0.162 0.165 -14.42 -3.49 3.39 14.81 400K 3.88 -0.01 0.27 -13.85 -2.68 2.85 11.63 500K 3.89 -0.03 -0.04 -13.22 -2.94 2.83 15.48 600K 3.76 0.15 0.17 -13.83 -2.48 2.90 13.97 800K 3.68 -0.24 -0.10 -12.83 -2.63 2.69 11.87 lOOOK 3.22 -0.40 -0.05 -12.82 -2.64 1.95 10.04 1200K 3.19 0.10 0.35 -13.60 -2.16 2.36 9.96 1500K 3.36 0.0 0.38 -12.93 -2.26 2.63 9.52 1800K 3.21 0.40 0.63 -13.23 -1.34 2.78 10.28 2000K 3.50 0.04 0.37 -17.55 -2.05 2.46 10.24

4.6.1 Examination of additional overhead

The on-line quality predictor introduces two additional processing elements: the extraction and statistical summarisation (average) of content features inside the video codec and the invocation of the neural network quality predictor. The overhead of the feature extraction process to the video encoder is not significant. Most chosen features, like pixel activity, soad, motion vectors, apart from the edges energy, are calculated as part of the encoding process, namely for motion estimation, hence, no additional delay occurs. Features like complexity of motion, accelera­ tion and direction of motion are computed from the values of the motion vectors using simple statistical calculations. Calculation of edge activity in the frame is the the main overhead (the

Sobel gradient involves sixteen additions per pixel). The rest of the processing cost involves the statistical summarisation of frame-level features (mean and standard deviation over the frame) and a mere calculation of the mean value for each six-sample content features vector. In total. 4.6. Examination of the ANN performance 118

8

S

Ï s

s

8

20 40 60 80 100 20 40 60 ao 100 Actual acorns Actual scores I « I 81 I = .1* u 1 8 J 20 40 60 80 100

(a) 200K (b) 500K

20 40 60 80

(c) 800K (d) 1500K

Figure 4.9: ANN prediction performance. The graphs co-plot the actual versus the predicted quality scores at different target bit rates. 4.6. Examination of the ANN performance 119

Distribution of prediction error (200K) Distribution of prediction error (500K)

§

S

8

-20 -10 0 10 20 -20 -1 0 0 10 20

prediction error prediction error

Distribution of prediction error (BOOK) Distribution of prediction error (1500K)

§ 8

S S

8 8

-20 -10 0 10 20 -20 -1 0 0 10 20

prediction error prediction error

Figure 4.10: Histograms of the ANN prediction error values.

Percentage of |err|>A

8 - B - 2 0 0 K - e - 500K § 800K - e - 1500K

2 C CD o

o

0 5 10 15

Figure 4. 1 1 : Percentage of prediction error values over A, as a function of A. 4.7. Chapter summary and discussion 120

Distribution of prediction error

o

200K 400K 500K 600K 800K 1000K 1200K 1500K 1800K 2000K

Version of ANN model

Figure 4.12: Side-by-side box-plots depicting the range of ANN prediction residual values.

the additional overhead is on average less than 15ms per 6-frame activity period for CIF-size frames on a 2.2GHz Pentium 4 processor) therefore it does not adversely affect real-time per­ formance*^. The overhead of the neural network is negligible: by nature, an ANN might require a significant amount of time to train but the process of calculating a response involves a small number of operations on the input variables vector.

4.7 Chapter summary and discussion

Real-time streaming of live video content over the Internet is subject to significant variations in perceived quality owing to content changes in video scenes and fluctuations of the bandwidth that is available to the video encoder. Findings in psychophysical studies and subjective tests on video quality highlight the negative impact of quality variations. In this respect, the first part of this chapter outlined the architectural components of a live video streaming system that attempts to smooth such variations in quality. The method is based on a rate-quality controller that monitors the quality values of the recent past and determines a quality value for the present time so that it maintains short-term stability of the on-going quality. The novelty with respect to existing approaches is the use of an objective video quality metric to obtain continuous mea­ sures of the instantaneous quality. However, real-time encoding requirements prohibit the direct use of the quality metric. For this reason, the major part of this chapter was dedicated to the development of a real-time quality prediction mechanism based on the principles of machine learning. The main assumption for the introduction of machine learning models was that the

^Note in addition, that no attempt was made to optimise the speed of any of the above calculations. 4.7. Chapter summary and discussion 121 encoding quality of video is some function of the encoding rate and the properties (features) of the original video content.

Specifically, the use of feedforward, multi-layer artificial neural networks was proposed to approximate this unknown function. The first step was the selection of appropriate features of content activity that are considered influential to the end-quality result and can be efficiently extracted from the original video frames as part of the encoding process without incurring a considerable computation cost. This step resulted in a set of fifteen content features that measure several aspects of the spatial and temporal activity within a small number (six) of successive frames. Several ANNs (one for each operating bit rate point in the range) were calibrated using an extensive set of frames from numerous video scenes. The training of the ANNs also involved an input sensitivity analysis and variable reduction method to examine (i) whether some of the input variables can be removed from the input data if this achieves better prediction performance and (ii) to examine whether the content features initially selected are correlated with the corresponding quality. This process revealed that there was no consistent gain from a sensitivity elimination process and that the set of features described in section 4.4 provides a reasonable set of parameters that influence quality. Additional results on the impact of the number of hidden neurons suggested that the examined configurations provide sufficient and equivalent prediction performance, with a certain bias towards a higher number. Most of the insensitivity of the ANN models to a data reduction technique as well as to the number of hidden neurons can be explained by the fact that the cardinality of the input vectors in not very high. However, the examined methodologies are considered useful when the number of inputs to the model increases, for example, when one desires to obtain quality scores over a larger number of frames (e.g., on a per video scene basis, as discussed in Chapter 3). In this case, more statistics on the frame-level content features than just the mean value should be gathered, in order to more accurately represent the distribution of feature values over longer time-frames.

Through a number of validation tests using new, unknown test data, the proposed ANN- based quality estimation method showed significant generalisation performance with a mean absolute prediction error ranging between 3.0 and below 5.0 (recall that quality scores fall in the [0, 100] range). The generalisation performance of the proposed ANN architecture was shown to be consistent over the various ANNs, which were trained to provide predictions of quality scores at various encoding bit rates in the considered range (100-2000Kbps). Based on the experimental findings of this chapter, the main contributions can be summarised as follows: 4.7. Chapter summary and discussion 122

• We demonstrated that ANNs provide a reliable technique to obtain accurate predictions of the instantaneous quality by ‘learning’ the relationship between features of the video content, the candidate encoding bit rate and the resulting encoding quality.

• We presented a set of video content features to describe the spectrum of the underlying spatial and temporal activity of video content. More importantly, the extraction of content features comes as part of the encoding process, therefore it does not impede the real-time performance of the system.

• We proved that the use of machine learning techniques and specifically, artificial neural networks, present a reliable alternative as a complementary tool to objective evaluation models, when online, real-time quality monitoring is required. Chapter 5

Smooth quality rate adaptation using a fuzzy controller

This chapter presents an adaptation technique for encoding and streaming real-time, live video, so that it exhibits a smooth evolution of quality. At first, the operational enhancements that can be introduced to the underlying rate-quality controller are identified. In short, the functional­ ity of the rate-quality controller is to deduce suitable encoding bit rates that result in a desired smooth continuous quality. The question this chapter answers is how the system can derive ap­ propriate values of consistent quality, and at the same time, account for the interactions among the parameters of a typical live video streaming system: real-time performance, available trans­ mission rate, status of sender and receiver buffers over time. Based on these requirements, a controller built on the principles of fuzzy logic to determine continuous values of the target quality is proposed. The advantages of this method are argued and results that demonstrate the effectiveness of the proposed scheme are presented.

5.1 Estimation of encoding rate for smooth perceived quality

Chapter 4 presented a mechanism that can provide accurate predictions of the instantaneous objective quality in real-time, based on the candidate encoding bit rate and content feature descriptors extracted from the uncompressed video data in real-time. Since video transmission over a best-effort Internet is assumed, the transmission rate of the video flow is determined by a TCP-friendly congestion control algorithm. Section 4.1 highlighted the main problem of TCP- friendly encoding of live video: a time-variable transmission rate and the bursty nature of video content result in undesired oscillations in video quality. These short-term quality variations are due to mismatches between the bit rate required to maintain a stable level of quality and the bandwidth available to the TCP-friendly stream. As a result, a stable quality level cannot be sustained for long enough periods, leading to drops of the quality value. At other times, it 5.1. Estimation of encoding rate for smooth perceived quality 124 may result in a quality increase that will only be preserved for a limited time, before dropping again. In the design of a quality-aware rate adaptation, a more conservative approach that avoids driving the quality to extreme, short-lived high and low levels, is preferable. This is in agreement with studies in subjective quality assessment; higher variation of quality over time leads to worse perceived quaUty [108]. Other interesting phenomena that appear when human viewers assess video material in­ clude:

1. There is evidence that viewers react with higher time constants to positive changes in quality versus negative changes in quality. In other words, viewers are quick to criticise but slow to forgive, and thus, a drop in quality has higher perceptual impact than an equivalent sized rise of quality [150].

2. In accordance to claim (1) above, research seems to show that the perceived quality during an evaluation interval is primarily determined by the worst perceivable impairments [92].

3. Also, there is a memory effect in human ratings, that lasts several seconds, where the quality levels of the recent past influence the overall subjective opinion [178, 149].

These issues are addressed accordingly later in this section where relevant design decisions are made. Short-lived increases or drops of the instantaneous quality that create most of the annoying quality variation can be smoothed out by appropriately manipulating the instantaneous encoding rate. Recall from Chapter 4 thatQtcpf(t) denotes the instantaneous quality during time (S-T period) t, derived when the encoding rate, Renc{t), is set to the rate the stream is allowed to transmit, Rtcpf{t) (the TCP-friendly rate). The rate-to-quality mapping is obtained in real-time using the ANN predictor described in the previous chapter. A desirable rate-quality controller should continuously vary the encoding rate so that the output video quality exhibits smaller variation by being insensitive to transient, short-lived perturbations in the value of Qtqpf- The aim of the controller is to produce a smooth representation of the recent Qtcpf values, which is denoted by Q tar get- At the same time, the rate estimator has to be responsive to occasions where quality exhibits consistent, long-term changes from one quality level to another one and stays there for a significant time (e.g., due to a step change in the steady-state available bandwidth). Failure to do so will lead to undesirable system behaviour, that either underachieves by encoding at lower quality levels than possible, or overshoots quality, resulting in unresponsiveness to the TCP-friendly rate (i.e., the encoding rate is consistently higher than the TCP-friendly rate). The 5.2. Fuzzy adaptive quality smoothing 125 basis of the approach is to calculate the smoothed quality value using an exponentially weighted moving average (EWMA) of Qtcpf.

Qtargetify — ^ ' Q t a r g e t i l 1) “h (1 Qî) ' Q t c p f i f y t CK G [0, l] (5.1)

and set Q t a r g e t as the target quality of the S-T period t. The reason for choosing an EWMA filter as the quality smoothing strategy is that the (smooth) encoding quality produced hy the controller has to inevitably follow the trend of Q t c p f , in order to maintain long term transmis­ sion stability. The parameter a controls the predictor and it represents the weight of the recent history of quality values to the estimation of the current value of Q ta r g e t- EWMA predictors are quite simple but the main design difficulty is the choice of the weight a. Given that, in practice, the variation of Q t c p f is unknown, setting a to a high value leads to successful elimination of large variations but lacks responsiveness to transient changes, while a small value fails to pro­ vide smooth values for the target quality. The desired approach is to be able to determine the value of a on-line, according to the changes of Qtcpf -The following sections introduce a fuzzy logic controller to dynamically calculate appropriate values for a.

5.2 Fuzzy adaptive quality smoothing

Fuzzy logic was introduced by Zadeh [179, 180] to describe vagueness in system behaviour, where a variable or a parameter exhibits a gradual transition between different states. In such cases the assertion that a parameter is in a specific state can be, to a certain degree, both true and false. A fuzzy logic system can simultaneously handle numerical data and empirical knowledge of the system’s properties or desired performance, expressed in linguistic terms, and represent linear and non-linear mappings of an input data vector into a scalar output. The usefulness of such an approach lies on the fact that imprecisely or roughly defined ‘classes’ or properties that directly reflect human knowledge or experience can be incorporated into the design of a system. In general, a fuzzy system has inputs Ui EU i{i — and outputs î/i G (z = 1,..., m). These ordinary (crisp) sets Ui and yi are called universe of discourse or universe for U{ and Hi and, in practice, they are sets of real numbers. Since the operation of a fuzzy system is based on linguistic descriptions, linguistic expressions are used for the inputs and output(s) and their characteristics. Therefore, associated with each input and output are linguistic variables that describe the fuzzy system inputs and outputs. Just as Ui and yt take on values from the universes Ui and yi, the respective linguistic variables, üi and ÿi take on linguistic values that describe characteristics of the variables. Linguistic values are often given names that match to adjectives in a descriptive manner, like ‘low’ or ‘high’. Each linguistic value Àj of üi is 5.2. Fuzzy adaptive quality smoothing 126 in turn associated with a fuzzy set. In a fuzzy set, its members (elements of a universe U) do not have an exclusive either-or membership or crisp value like in conventional sets, but a grade of membership. Every element of the universe U is member of a fuzzy set F to some degree

(even zero) and for every element u e U, a function called the membership function Pf {u) associates a number that represents the degree of membership of that element in fuzzy set F. In other words, the membership function describes the certainty that the element u, with linguistic description û, takes on the respective linguistic value. Thus, a fuzzy set F is a set of ordered pairs:

F = {u : ZY,y : [0,1] • y = Pf {u) • (u, y)}.

In practice, a universe of discourse is usually partitioned into a number of fuzzy sets whose membership functions cover it in a more or less uniform manner. Based on the requirements, a membership function can have different shapes. Figure 5.1 draws some typical shapes of mem­ bership functions. The behaviour of the system is controlled by introducing a set of linguistic rules, mainly in an if-then format, that associate fuzzy input and output variables. A rule con­ veys knowledge of an expert about the system and its desired response, expressed in a linguistic manner, and contains a series of logical connectives between linguistic variables:

if ill is A\ and U2 is ... and Un is then ÿq is B^,

where u and y represent linguistic variables and A and B linguistic values, respectively. It is a set of rules in this form that the expert or the system designer has to specify in order to describe how to control the system. The theory of fuzzy control is based on the above briefly outlined principles and further introductory details can be found in [179, 180, 181,182]. As an example, assume a universe of discourse to be the range of ages of a person. One may define three gradations for the linguistic variable age, which takes on the linguistic values ‘young’, ‘middle-aged’ and ‘old’, which signify age groups. Figure 5.2 shows an example of the corresponding membership functions that map the real age to one or more linguistic characterisations. A simple fuzzy controller involves a fuzzifier that converts a crisp input value to a degree of membership (the numeric value of its membership functions). An inference engine evaluates the membership values of all fuzzy variables by combining the corresponding rules and mathematically maps (infers) input fuzzy sets to an output fuzzy set. In other words, it determines which rules are ‘fired’. The inference mechanism will then seek to combine the recommendations of all the active rules to come up with a single output fuzzy set. The output fuzzy set has to be converted to a numerical 5.2. Fuzzy adaptive quality smoothing 127

0.5 ol— -100 100 -100 100 (I)

Figure 5.1: Examples of membership functions, (a) s-function, (b) 7r-function, (c) z-function, (d-f) triangular versions, (g-i) trapezoidal versions, (j) flat 7r-function, (k) rectangle, (1) single­ ton.

you n g ' middle-aged ' old

.9- 2 0.8 Ë i 0.6

0 .4

0.2

0 , 0 20 4 0 6 0 8 0 100 A ge

Figure 5.2; An example of three membership functions for the linguistic variable ‘age’.

value, which is the output control signal. This operation is called defuzzification and the output fuzzy set is thus defuzzified into a crisp control signal. There are several methods to do this and the most common one is to calculate the centre o f gravity, where the crisp output value y is the abscissa under the centre of gravity of the output fuzzy set.

y = which is, in effect, the weighted average of the elements of the output fuzzy set that have non­ zero membership (also called the support of a fuzzy set). For continuous membership functions, the sums in the above expression are replaced by integrals. 5.2. Fuzzy adaptive quality smoothing i 2 8

Sender Receiver

Send R eceive Transmitter R eceiver buffer buffer Satellite d i s h \ bcpt D ecoder network D isplay Encoding Transcoding ransrating

Figure 5.3: Main components of a typical end-to-end live video streaming system.

Following the discussion in the previous section, it seems that, while the selection of a proper value for a might be simple for the extreme cases (small or large changes of Qtcpf)^ it is not as simple to make decisions for the intermediate values in the range of changes or, how changes of the input parameters should affect the value of the designated output parameter a.

The situation becomes more complicated when the effect of the choice of a quality value to the level of occupancy of the sender and receiver buffer is also considered. This work proposes the use of a fuzzy controller [183] to continuously update the parameter a of the exponentially weighted filter so that it responds to changes of Q t c p f - The reasoning behind this approach is that in this way, the system’s desired behaviour can be described with a few intuitive, linguistic variables. Fuzzy controllers are particularly useful when there is a system whose behaviour is difficult to derive or represent accurately by an analytical model and when its approximate behaviour can be characterised qualitatively based on human experience and intuitive require­ ments. Therefore, interactions between all involved parameters (magnitude of quality change, occupancy level of sender and receiver buffers) in the live streaming system can be easily inte­ grated. The fuzzy controller is described in more detail below, where the design decisions taken are argued. The ensuing analysis is based on a typical configuration of a live video streaming system, as shown in Figure 5.3. Video frames from a live video programme (from a satellite feed, digital camera or disc storage) constantly feed the encoding engine. The encoded bit­ stream is packetised into IP packets which are then sent to the destination (streaming client). In order to accommodate differences between the encoding and transmission rates, a sender buffer usually resides between the packetiser and the transmitter^ Received video packets are also placed in a receiver buffer before they are decoded and presented on the user display. Such buffering is essential to reduce the impact of transmission variations (variations of the transmis­ sion rate and delay) and provide a continuous playback for the user. In broadcast-like streaming,

' Encoded video data may also be kept in the sender buffer in case the client application requests retransmission of lost packets. 5.2. Fuzzy adaptive quality smoothing 129 there is a delay between the point the live video is encoded and when it is played back at the receiver. This delay guarantees that there is enough data buffered at the sender and receiver buffers. Since interactivity constraints are relaxed, this playback delay can be in the order of seconds or tens of seconds [26, 184]. What constitutes an acceptable playback delay time for such applications is hard to devise; certain guidelines [185] recommend a figure around ten sec­ onds for pre-recorded material. For streaming of live material, current experience shows that this time may even be up to a few tens of seconds. Initially, uncompressed frames enter the encoding engine at a constant rate (e.g., 25fps, or 30fps) and encoded data fill the sender buffer for a period of Ts sec. Data are then transmitted from the sender buffer and placed in the receiver side play out buffer. After a further T r sec, normal video playback commences and encoded frames from the receiver buffer are forwarded to the decoder to be subsequently displayed. Therefore, there is a total playback delay (due to buffering) D = T s + T r . Denote B s { t) and B r { t) the amount of data (in bits) that reside in the server and receiver buffers, respectively, at time instance t. The total amount of buffered data represents a fixed amount of media time, but the distribution of this media time (as well as data bits) between the receive and send-buffer changes, depending on the value of the encoding and transmission rates over time^. Since rate adaptation decisions happen on a S-T period basis, the streaming server can quite accurately track the status of both sender and receiver buffers, by continuously updating Bg and Br".

Bs{t) = Bs{t — 1) T ' (i^enc(^) ~ Rtcpf{^))t t > T r (5.2)

Br{i) = Br{t — V) + T ' {R tcpf{t) — R p {t)), t > D (5.3)

where T is the duration of the adaptation period (in sec) and R p {t) is the rate at which data from the receive buffer are consumed by the decoder (video play out rate). As theplayback delay is fixed at D sec,R p (t) = R en c{t — D). Equation (5.3) for the calculation of the receiver buffer size is considered quite accurate at low packet loss rates. Higher packet loss rates can however contaminate the accuracy of this estimate. Alternatively, the sender may use feedback from the client application about the level of the receiver buffer in order to calculate a more accurate value for B r (t). The value of the playout delay D may also vary over time if adaptive playout delay techniques are employed; this does not however impede the operation of the system as long as the server maintains updated values for B s ( t) and B r(t). The first steps towards designing the controller is to determine what are the inputs and

^There is also an amount of outstanding data in transit within the network which is assumed to be significantly smaller than the buffered data. 5.2. Fuzzy adaptive quality smoothing 130

d CNJ O O O § d

s d 1-perœntile=-24.0 99-percentile=14.34 s d 8 d

-4 0 -20 0 20 40

quality change

Figure 5.4: Histogram of quality change (error) values over subsequent S-T periods.

outputs of the system, associate them with linguistic variables, identify appropriate linguistic values for the different gradations of these variables and finally, define the membership func­ tions of the resulting fuzzy sets. Then, linguistic rules that govern the desired behaviour of the controller are introduced.

Observe that the change of theQ tcpf value between successive S-T periods, A Q tcp f =

Qtcpfit) - Qtcpf {i - 1), represents the level of encoding quality variation that we want to cur­ tail. A Q tcp f can be negative (if Qtcpf decreases), zero, or positive (when Q tcpf increases). It is both the sign and magnitude of AQ tcpf that determine the type and level of quality variation.

Therefore, a fuzzy input variable quality error, or error, is associated with AQ tcpf and five gra­ dations are defined (linguistic values that error takes on): neglarge, negsmail, zero, possmall, and poslarge. The associated fuzzy variable error is scaled down to the [-1, 1] range using the following procedure. The value of a quality change, AQ tcpf, is theoretically in the range [-100,

100]^. The scaling process is influenced by an important observation: it appears very unlikely that AQ tcpf values span over the whole range of theoretical values, as, for example, shown in

Figure 5.4, which plots a histogram of error values from a 2840-frames sequence (sequence:

The Matrix, encoded at 400Kbps). The majority of error magnitude values are confined in the

[-20, 20] range and the occasional larger values only occur when there is an abrupt scene cut in the video content between two S-T periods. This observation was asserted by performing similar experiments with several other video sequences and encoding rates. Thus, a scaling based on the width of the absolute error range (100) would most likely produce a highly non-

Recall that quality values range between 0 and 100. 5,2. Fuzzy adaptive quality smoothing 131 uniform distribution of error, where values are gathered close to zero, instead of spanning the whole [-1, 1] range in a more uniform way. This may lead the fuzzy controller to undesirable behaviour as the fuzzy variable error never takes values from the whole range of the universe of discourse. To tackle this, error values are scaled relative to a ‘large quality error’ value, denoted LE, therefore, err or (t) = AQ tcpf{t)/LE, If occasionally \Qtcpf ~ Qtcpf \ > L E , then theerror is set to -1 or 1 accordingly. For experiments presented hereafter, a value of LE equal to 20 was used based on the above observation (Figure 5.4) and also because this value represents one grade on the range of quality scores in a 5-point MOS scale. Nevertheless, the value of this parameter can be set to another appropriate value that can be obtained from the subjective significance of a quality change based on user trials.

While the fuzzy controller is designed to manipulate the encoding rate so that Q ta r g e t fol­ lows the trend of Q t c p f , mismatches between the corresponding rates are commonplace. Such disparities can be accommodated by introducing both a sender and a receiver-side buffer and allow for an initial playback delay to occur in order to fill these buffers with sufficient amount of data. With respect to the send and receive-buffer sizes, the difference between the encoding and transmission rates has the following effects: if Rene is higher thanRtcpf then the sender buffer size increases, if equal, it remains unchanged, otherwise it decreases. Similarly, the receiver buffer fills, remains unchanged or drains, when the transmission rate is respectively higher, equal or lower than the video data playout rate. Therefore, the number of bits distributed be­ tween the two buffers reflects the relationship between the encoding and transmission rates and playback continuity. It is important that, besides providing a stable encoding quality, buffers are kept well away from underflow situations. While a sender buffer underflow can be prevented by transmitting at a lower rate than Rtcpf, this is usually unwelcome, as it will then take the TCP-friendly protocol some time to reach back to the flow’s fair share. Receive buffer under­ flows are more undesirable, as they lead to temporary breaks in video playout. The degree of distribution of buffered video data between the sender and receiver buffer can be represented using the variable buffered data balance level (buflev):

buflev = Br/{Br + Bs). (5.4)

This expression is a convenient way to establish whether the sender or receiver buffer level is low; if the receiver buffer runs low (Br —> 0) then buflev —> 0 and if the sender buffer approaches underflow levels (Bg 0), buflev —> 1. Therefore, the system can determine how encoded data is distributed between the two buffers at any time, monitor both buffers for 5.2. Fuzzy adaptive quality smoothing J32

os lngmail possm allnegsm m edium 1 - neglarge poslargt highlow

E E i

1 -0.4 -0.2 0 0.2 0.4 0 0.25 0.5 0.75 1

buflev

sm all m edium large

0.4 0.7 0.96

Figure 5.5: The shape of the membership functions for all fuzzy sets for the controller’s inputs (error and buflev) and output (parameter a).

running at low levels and react accordingly. The value buflev forms the second input to the fuzzy controller, and a linguistic variable with the same name is associated to it. This fuzzy variable takes on three linguistic values: low, medium and high. These gradations are enough to describe the different states of both buffers; adding further gradations does not present any obvious advantage and introduces unnecessary complexity. A fuzzy value ‘low’ means that the receiver buffer runs low, a ‘high’ value that the sender buffer’s level is low, while a ‘medium’ value that there is enough data distributed, more or less evenly, between the two buffers.

Summarising, the fuzzy controller comprises two inputs, error G [—1,1] and buflev G

[0,1], and one output, a G [0, Ij. The linguistic variable error is an indication of the magnitude of change of the short term quality, while buflev shows whether enough data reside in both the sender and receiver buffers. Figure 5.5 depicts the membership functions of the fuzzy linguistic values for all three fuzzy variables. Standard triangular fuzzy sets were used, thus three param­ eters for each membership function had to be specified. The performance of several values for the parameters of all membership functions involved was examined, and Figure 5.5 depicts the ones that were found to perform best with respect to the following desired system behaviour: to calculate suitable values of the weight a, so that the target quality the controller recommends smoothly follows Qtcpf^ if the amount of buffered data allows this. The rationale behind deriv­ ing the support of each fuzzy set and consequently, the ranges of the corresponding membership functions as depicted in Figure 5.5, was as follows. With respect to the fuzzy input parameter 5.2. Fuzzy adaptive quality smoothing 133 error, the parameters of the five membership functions were selected based on intuitive yet reasonable assumptions on the perceptual impact of the magnitude of change in quality values. The range of fuzzy set zero is [—0.2,0.2], which is set to correspond to an undetectable quality change value, which is between -4 and 4 (recall that error is associated with the scaled value of AQtcpf, using LE = 20, and that quality values are in the [0,100] range). Additionally, an absolute quality change that is less than 8 is considered to represent a just noticeable quality disturbance, therefore, the ranges of the fuzzy sets negsmail and possmall, which represent neg­ ative or positive changes in quality, are [—4,0] and [0,4], respectively. Finally, quality change values that are below -0.2 or over 0.2 are considered to constitute a considerably noticeable fluctuation in quality and the ranges of the fuzzy sets corresponding to fuzzy values ‘neglarge’ and ‘poslarge’ are [—1, —0.2] and [0.2,1], respectively. Naturally, there is an overlapping in the range of the various membership functions, as shown in Figure 5.5, which is necessary to rep­ resent the fuzziness in the membership value of error. As mentioned, the proposed parameters of the above membership functions are, to some extent, dictated by common sense assumptions on how noticeable a quality change is to the human viewer. Nevertheless, these membership functions can be altered as a result of subjective experiments that answer the question of the re­ lationship between the numerical value of a quality change and the degree to which this change is noticeable by the user. The shape of the membership functions of the fuzzy input buflev is based on the assumption that a buffer that is less than 25% full is considered as being close to underflow. Finally, the membership functions of the output are represented as singleton values. The main reason for this is that with fuzzy singletons it is possible to drive the control signal (the weighta) to its extreme values; this property is extremely desirable in our controller, for cases where only rules that result in a ‘high’ a are fired. Maintaining stable short-term quality requires that a takes on its highest value (0.96) and singleton output fuzzy sets guarantee this. Furthermore, singleton fuzzy sets result in simpler computation of the output.

The remaining part of the controller yet to be defined is the set of the linguistic rules that govern its behaviour. In principle, the controller is designed to be resistant to changes in Qtcpf, by opting for a high a value, when not restricted by any of the buffers running low. Note here the effect that the value ofQtarget with respect to Qtcpf has on the size of send and receive buffers: When Qtarget is higher thanQtcpf, the media encoding rate is consequently higher than the transmission rate, therefore, the send buffer fills but the receive buffer drains. The opposite happens when Qtarget has a lower value than Qtcpf, encoded video data is transmitted at a faster rate than produced, emptying the send buffer‘d. The following cases outline what

'‘Recall that we cannot feed the encoder at a faster rate than the capture rate of original video frames. 5.2. Fuzzy adaptive quality smoothing 134 linguistic values a should take on based on the linguistic value of the inputs:

• If error is neglarge (high drop of Q t c p f ) and buflev is low ( B r — > 0), then a small value

for parameter a is appropriate, so that the target quality drops in order to follow Qtcpf- Doing otherwise means that the video is encoded at a consistently higher rate that what the stream is allowed to transmit, which further drains the receiver buffer. When on the other hand there is enough data buffered at the receiver (buflev is medium or high), the weight a can be kept large to maintain stable quality and smooth out an undesirable drop in quality.

• When there is a significant, yet not very large drop of Q t c p f (negsmail) and buflev is low, a medium value for a is chosen. For other levels of buflev, a is kept at a large value.

• When Qtcpf changes only slightly between successive S-T periods (error is fuzzy zero), a stable Qtarget can be maintained by selecting a high a, without adversely affecting the size of both buffers.

• When Q t c p f increases considerably, but not substantially (error is possmall), the fol­ lowing policies are defined: if buflev is low or medium, a preserves a high value, and

Q ta r g e t is kept Stable (this also fills the receiver buffer when its size is low). When how­ ever buflev is high (i.e., the sender buffer occupancy is low), a takes on a medium value to avoid an underflow situation.

• Similar decisions are taken when there is an ample increase in Q t c p f and there is enough data in the sender buffer (buflev is medium or high). In this case, a large a value is

preferable. The consensus behind this decision is that it is not desirable for Q ta r g e t to swiftly increase to a much higher value due to temporary favourable circumstances (either a sudden increase of the available bandwidth or a period where the video content is very simple) only to be subsequently dropped after a short time. Nevertheless, in the case

where a high encoding quality can be sustained, Q ta r g e t will follow this trend, only at a slower pace^. An exception in this case is allowed: when buflev is high (Eg 0) and therefore the sender buffer is in risk of draining, a takes on a small value in order to increase the encoding rate (and thus the rate at which data fills in the sender buffer).

Notice how the system’s behaviour is described using adjectives and adverbs (e.g., low, high, slightly, substantially) to signify the value or the state of an involved parameter. Here lies

^Recall that an increase in visual quality is not appreciated by a human observer respectively, in contrast to an a quality drop of similar magnitude. 5.2. Fuzzy adaptive quality smoothing 135

Table 5.1: Table of linguistic values of the control parameter alpha as a function of fuzzy variables error and buflev.

error

neglarge negsmail zero possmall poslarge low small medium large large large buflev medium large large large large large large large large large medium small

the flexibility of the fuzzy controller. No strict numerical values or boundaries, or analytical models that describe the inputs to output relationship are required. Such relationships are con­ strued using adjectival descriptions of the different states of the variables and simple, linguistic descriptions of these relationships. Based on the above cognitive rules, a total of fifteen rules are generated, (which are also summarised in Table 5.1): 1. if error is neglarge and buflev is low then a is small

2. if error is neglarge and buflev is medium then a is large

3. if error is neglarge and buflev is high then a is large

4. if error is negsmail and buflev is low then a is medium

5. if error is negsmail and buflev is medium then a is large

6. if error is negsmail and buflev is high then a is large

7. if error is zero and buflev is low then a is large

8. if error is zero and buflev is medium then a is large

9. if error is zero and buflev is high then a is large

10. if error is possmall and buflev is low then a is large

11. if error is possmall and buflev is medium then a is large

12. if error is possmall and buflev is high then a is medium

13. if error is poslarge and buflev is low then a is large

14. if error is poslarge and buflev is medium then a is large

15. if error is poslarge and buflev is high then a is small

The relationship between the two inputs, error and buflev, and the output (the parameter a of the EWMA filter) of the controller is showed in Figure 5.6, which plots the control surface of the fuzzy inference system. 5.3. Performance of the fuzzy quality controller J36

Control surface of the fuzzy controller

05

0.2 buflev

Figure 5.6: The control surface of the fuzzy inference system that determines parameter a.

5.3 Performance of the fuzzy quality controller

Recall the architecture of the proposed smooth quality rate controller described in section 4.2 as illustrated in Figure 4.2. The task of the fuzzy rate-quality controller is to continuously monitor at every S-T period t, Q tcpfit) and the relative occupancy of the two buffers, b u f l e v i t ) , to determine a smooth value for the target quality Qtarget{t)- It then utilises the ANN predictor to obtain the appropriate encoding rate, Renc{t) that achievesQtarget{t)- Note that, the ANN network is not trained to provide predictions at every possible rate. Instead, once the range of operating bit rates for the specific application has been chosen, it can then be sampled at N closely distanced points to obtain a set of rates in increasing order, Rq, Ri, ...,Rn, where the distance between the sampling points is R s te p - The ANN is only trained at these ‘sampled’ rate points and the same rate points are used for prediction. So, by performing a small number of iterative invocations of the ANN, the rate controller performs the simple task of finding index i G {0, ...,N - 1}, so that, Qr^ < Q ta r g e t < Q R ^ + l)^ Assuming that Q r is an increasing function of R, then Rene is found by interpolating Q between Ri and After interpolation,

Q R e n c ~ Q ta r g e t- The interpolation error becomes insignificant by choosing reasonably small values of R s te p - The server’s estimate of the sender and receiver buffer sizes is updated using expressions (5.2) and (5.3).

When there is a drop in Qtcpf ^ the controller calculates an encoding rate Rene that needs to remain higher than the currently available transmission bit rate Rtcpf to retain a stable target quality. The ratio of the maximum value the encoding rate is allowed to take and the current 5.3. Performance of the fuzzy quality controller 137

Table 5.2: Various parameters of the fuzzy rate-quality controller system.

Symbol Description

Q tc p f S-T quality when encoding rate is set to the transmission rate Q ta rg e t Target quality that smoothes Q tc p f Q r ANN quality prediction when the encoding rate is R B s Send buffer size (bits) B r Receive buffer size (bits)

R e n o R t c p f Encoding and (TCP-friendly) transmission rates R s t e p Distance between encoding rate sampling points R q,...,R n ratio Bandwidth overshoot ratio

transmission rate is called bandwidth overshoot ratio, ratio:

Rene < ratio • Rtcpf, ratio > 1 (5.5)

The value of ratio is a trade-off between the ability of the controller to resist quality drops and the speed that the receiver buffer is drained. A highratio value may achieve the desired smooth target quality in a short timescale, but it drains the receiver buffer faster, affecting the quality stability in the long term. Furthermore, if the receiver buffer is already at some low level, it may also result in buffer underflow. For this reason, ratio is made dependant on the value of buflev, and is defined as: ratio = 1 -f 1.5 • buflev (therefore, ratio e [1,2.5]). In this way, if buflev has a low value (i.e., the receive buffer is running at low levels) then ratio is relatively close to one and Rene is not allowed to take a much higher value than the transmission rate Rtcpf' On the other hand, whenbuflev is relatively high, the system can more freely overshoot Rtcpf to achieve the target quality. Table 5.2 summarises the notations and the parameters of the fuzzy quality controller used throughout this chapter. The following experiments examine the performance of the fuzzy rate-quality controller in its ability to provide a smooth encoding quality and maintain the stability of sender and receiver playout buffer. Therefore, the purpose of the analysis below is to examine the following issues:

1. Whether the fuzzy rate-quality controller provides a stable ongoing quality in compari­ son to the quality attained by a pure TCP-friendly rate encoding regime and in general, any pattern of available bandwidth, and the degree of quality smoothness improvement achieved.

2. Whether the controller provides an uninterrupted presentation, i.e., there are no signifi­ cant gaps in the playout of the video at the receiver (receiver buffer does not drain) and whether video is always available for transmission (sender buffer does not underflow). 5.3. Performance of the fuzzy quality controller

TFRC TFRC sender receiver

bw bg source-1 Çy bg sink-1 delay

bg source-N bg sink-N

Figure 5.7: Single bottleneck network topology used in the simulations.

3. What is the significance and impact of the various control parameters described above

to the general behaviour of the controller (and if we can find optimal values for these

parameters, under certain assumptions and conditions).

4. Finally, whether the controller exhibits consistent behaviour regardless of the transmis­

sion environment, i.e., the original video content’s type and complexity and the underly­

ing network conditions (e.g., the pattern of the available transmission rate).

To examine the ability of the fuzzy rate-quality controller to maintain stable on-going quality, an Intemet-like transmission environment was simulated using the ns-2 Network Sim­ ulator [147]. The network topology used throughout was the typical ‘dumbbell’ network with a single bottleneck link, as shown in Figure 5.7. The bottleneck bandwidth was 10 Mbps with a 20 ms delay, and the queue size at the bottleneck routers set to the delay-bandwidth product. The end-to-end path between the video sender and receiver employed equation-based

TCP-friendly congestion control (using the TFRC [19] agent in ns-2) to determine the nominal transmission rate of the application. To create a realistic variation of network conditions, a number N of background ON/OFF CBR (300Kbps) flows, with ON and OFF times drawn from a Pareto distribution [148] a]so traversed the bottleneck link. The mean ON and OFF times were 1 sec and 2 sec respectively. The number N was chosen so that the bandwidth that is available to the TFRC sender is normally between the range of encoding bit rates, R q and

R n . The video sequences used in the experiments were encoded at bit rates from lOOKbps to 2000iv {Ro = lOOKbps, R ^ = 2000Kbps), with a distance between encoding points equal to R s t e p — lOOKbps.

Figure 5.8 shows the improvement in quality stability; it illustrates the continuous objective quality values for the two approaches: (i) when the encoding rate for each 6-frame period is determined by the sampled value of the TCP-friendly bandwidth of the flow (Qtcfp)^ and

(ii) when the encoding rate is obtained from the proposed fuzzy quality controller (Qtarget)-

Graphs are plotted for (a) a 8000-frames long video extracts from the movie The Matrix and 5.3. Performance of the fuzzy quality controller 139

(b) a 4400-frames long sports video clip from an English Premiership game (Football). All the frames of the video sequences used in the experiments were CIF size (352 x 288). The two clips contained several scenes with various content activities and alterations of ‘simple’ and ‘complex’ scenes. The initial buffer build-up delay was 8 sec (the first 4 sec filled the sender buffer, then transmission started and the receiver buffer was filled for the next 4 sec). As shown in Figure 5.8, the target quality determined by the fuzzy controller follows the trend of Qtcpf, a necessary condition for transmission stability, but at the same time, it exhibits significantly less variation as the controller avoids driving Q ta r g e t to considerably low and high values, eliminating short spikes in quality values. While the standard deviation over the whole duration of the transmission should not be considered as a comprehensive metric of ‘quality smoothness’, nevertheless, for the plot (a) the standard deviation of Qtcpf is 11.05 and that of Q ta r g e t is 7.75, while for plot (b), the values are 12.50 and 6.39 respectively. The smoothing effectiveness of the method is assessed in more detail in the following by introducing appropriate metrics of smoothness. A visual inspection of Figure 5.8 however reveals that the fuzzy controller: (i) eliminates the frequent oscillations of quality and (ii), when sufficient buffering allows it, avoids driving quality to values as low as those of Qtcpf' This second feature is extremely useful for video streaming, since, as reported earlier in this section, subjective experiments show that it is the periods where video exhibits high distortions that greatly impact a user’s rating of quality [149]. Such ‘bad’ periods of quality have a long-lasting influence in quality perception indeed (“...viewers are slow to forgive”).

The top graph in Figure 5.9, in addition to Q ta r g e t and Q t c p f , shows the actual quality,

Q a c tu a l, that is achieved after the search for rateR e n e , so that, Q { R e n c ) ~ Q ta r g e t (sequence:

Football). In this graph, it can be observed that for the majority of time Q a c tu a i coincides with Q ta r g e t , as desired, but certain mismatches between the target and actual quality, like for example at points A, occur. These disparities are caused by significant variance of the underlying visual content: since the encoding rate is restricted to at most ratio times higher than the available transmission rate, the encoder cannot achieve as high a quality as targeted by the controller.

Section 5.2 described how the fuzzy controller was designed to avoid buffer underflow situations. Sender buffer underflows are caused when the sender buffer fill rate (Rene) is con­ sistently lower than the buffer drain rate (the transmission rate R t c p f ) for considerable amount of time. Given that uncompressed frames are produced at a constant rate (e.g., 30 fps), the encoder cannot retrieve frames faster than this rate. On the other hand, a receiver buffer un­ derflow occurs when the buffer’s fill rate ( R t c p f ) is lower than its drain rate (the rate at which 5.3. Performance of the fuzzy quality controller 140

target

time (sec)

(a) Sequence: The Matrix, 8(X)0 frames

8 'tcpt I target

§

8

20 4 0 60 8 0 100 120 140

time (sec)

(b) Sequence: Football, 4400 frames

Figure 5.8: Improvement of quality stability for two different video sequences. The graphs depict the on-going quality when video is encoded based on the available transmission rate of the flow and that achieved by the fuzzy rate-quality controller. 5.3. Performance of the fuzzy quality controller 141

Sequence; Football 8 'tcpi ^target §

S

8

CNJ O

‘tcpf lO d

o

20 40 60 80 100 120 140

time (sec)

Figure 5.9: Time-plot of Q tcpf ^ Qtarget and Q a c tu a i (top), sender and receiver buffer sizes, Bs and B r, (middle). The buffers accommodate the mismatches between the encoding and transmission rates (bottom).

the decoder extracts data from the receiver buffer). Receive buffer underflows are considered more hazardous, as they lead to temporary interruptions of playout. Send buffer underflows are also undesired, though they can be more efficiently handled by transmitting at a lower rate than allowed. In addition, it is usually the case in broadcast-like, live video streaming systems to introduce an additional ‘broadcast-delay’, where the encoder processes uncompressed video frames time-shifted a few seconds (and even tens of seconds) in the past with reference to the present time. This delay is not visible to the client application, however, if a sender buffer is in danger of underflowing, the encoder may read frames faster than the input frame rate to momentarily escape from this situation (this option though is not considered in these results).

Figure 5.9 (middle graph) shows that the fuzzy controller keeps the buffers well off an under­ flow situation. A more detailed analysis of the controller’s ability to maintain buffer stability is performed later in this section. The bottom plot of Figure 5.9 depicts the evolution of the target encoding rate Rene in comparison to the transmission rate Rtcpf, where it is shown how mismatches of the two rates are used to fill or drain the buffers accordingly. 5.3. Performance of the fuzzy quality controller 142

1

09

08

0.7

8 l i s i | 0 6 J - ^ ' y \ | o 5 + + % % A " . j ' r £ E 04 + + + + v: v . % 03 V r . ^ ; ■ + + + + ++ +++ - 02 + + - f - . , + # ^ ' - 0 1 * 1 - 0 100 150 lim e (se c )

(a) Sequence: The Matrix, 8000 frames (b) Sequence: Football, 4400 frames

Figure 5.10: Smoothness index over time between the two quality series of interest: Qtcpf and

Q ta r g e t'

5.3.1 Study of the quality smoothing capability of the controller

Recall from sections 2.7 and 5.1 that quality scores over six frames long S-T periods are used as a measure of the ongoing quality. The magnitude of the quality change between successive

S-T periods is introduced as a representative metric of quality smoothness:

AQ = |Q(2)-W-1)|, where highAQ values indicate significant short-term quality variation, while low A Q s suggest a stable on-going quality.

An illustration of the variability of Q ta r g e t in comparison to Q t c p f is presented in Fig­ ure 5.10, which plots thesmoothness index between the two qualities, for the two sequences mentioned above {The Matrix and Football). The smoothness index is calculated using the expression:

AQtcpf smoothness index = A Q t c p f “h ^ Q t a r g e t

A smoothness index value of 1.0 represents a ‘perfect’ smoothing ( A Q t a r g e t = 0) with a value over 0.5 representing a Q ta r g e t that is smoother than Q t c p f - Figure 5.10 shows that the vast majority of AQtarget values are significantly lower than the respective AQ tcpf values

(93.7% for video sequence The Matrix and 92.8% for sequence Football). 5.3. Performance of the fuzzy quality controller 143

The smoothing capability of the proposed system is also evident by examining the prob­ ability density histograms of the three AQ series (Figure 5.11). The histograms of AQ tcpf

(top-row) are much more ‘flat’ than those of AQtarget (middle-row) and AQactuai (bottom- row). This means that Qtc^f quality values show a wide range of magnitude of change between consecutive values, while changes of the quality determined by the fuzzy quality controller tend to gather close to a near-to-zero zone, an indication of significantly smoother quality. The respective summary statistics of the ‘A ’ series are shown at the bottom of Figure 5.11. The two plots of the third row in Figure 5.11 show the histograms of AQactuai values. Notice that the histograms of AQactuai are slightly ‘flatter’ than those of AQtarget- This is, as explained before in this section (see Figure 5.9), due to mismatches between the quality devised by the controller and what the system can actually achieve due to limitations imposed by the video content’s complexity and the nominal transmission bit rate. Similar results with two further video sequences are presented in Appendix B (Figure B.l).

Measuring smoothness over larger time scales So far, the reduction of quality variation that the fuzzy controller achieves was evaluated by examining only within the short timescale of two successive S-T periods, that is 400ms for a 30 fps input video. Although it is expected that reducing the variation between successive six-frame periods also results in stabilising the quality in the longer term, it is interesting to examine how quality fluctuates and what is the smoothness efficiency over longer time frames.

A first attempt to investigate this is by examining the autocorrelation functions of Qtcpf,

Qtarget and Qactuai valucs. Since the autocorrelation of a set of observation values is an in­ dication of the observation’s regularity, it is expected that a variable that does not change sig­ nificantly will exhibit high autocorrelation at larger lags. Figure 5.12 plots the autocorrelation function of QtQ}f, Qtcpf and Qtcpf- As observed, the autocorrelation of Qtcpf quality values declines faster in comparison to Qtarget and Qactuai, for which, high autocorrelation is main­ tained at longer lags. The plots correspond to the system’s response for two video sequences - The Matrix and Football - from the same simulation experiment.

The autocorrelation of quality observations presented in Figure 5.12 provides a qualita­ tive indication of quality stability. In order to obtain a more comprehensive conclusion on the smoothing performance of the fuzzy adaptive quality controller, a quantitative metric to mea­ sure quality variation over longer timescales is introduced. A measure of quality disturbance within a time window T is defined as the range of quality values observed and is calculated as the distance between the minimum and maximum quality observations within [t, t + T], for 5.3. Performance of the fuzzy quality controller 144

(a) Sequence: The Matrix (b) Sequence: Football

AQtcpt AQ,c

8 -

8 - —r—^ ^ 8 J 10 15 20 25 30 10 15 20 25

AQ

AQ. AQtarget

10

AQ

AQ. AQ.

0 5 10 15 20 0 5 10 15

AQ AQ

A Q t c p f A Q ta r g e t A Q a c tu a i A Q tc p f A Q ta r g e t A Q a c tu a i Min. 0.0 0.0 0.0 0.0 0.0 0.0 Is Quar. 0.975 0.12 0.15 0.97 0.12 0.15 Median 2.20 0.26 0.34 2.20 0.26 0.34 Mean 3.25 0.34 0.74 3.25 0.34 0.74 3rd Quar. 3.84 0.45 0.70 3.84 0.45 0.70 Max. 30.14 3.19 7.68 30.14 3.19 7.68

Figure 5.11: Histograms of AQ values: AQtcpf, AQtarget, and AQactuai- Notice the different ranges of the %-axes on the different graphs (initial playout delay: 8 sec). 5.3. Performance of the fuzzy quality controller 145

0 10 20 30 40 50 0 10 20 30 40 50

lag (in S -T periods) lag (in S -T periods)

(a) Sequence: The Matrix (b) Sequence: Football

Figure 5.12: Autocorrelation o f Qtcpf-, Qtarget and Q a c tu a i for two video sequences, The Matrix and Tenninator.

0 4 0.8 1 10 20 30 40 60 0.4 0.8 1 2 4 5 10 20 30 40 60

timescale (sec) timescale (sec)

(a) Sequence: The Matrix (b) Sequence: Football

Figure 5.13: Box-plots depicting the range of variation of Qtcpf, Qtarget and Q a c tu a i values at larger timescales. 5.3. Perfonnance of the fuzzy quality controller 146 every S-T period t. This metric was examined for several values of the time window T, from 400 ms up to 60 sec. For each window T, the distances between the minimum and maximum quality values were calculated for all time windows of length T sec within the duration of the simulated video transmission. Given that continuous quality scores are obtained on a S-T pe­ riod basis, if the total number of S-T periods for the duration of the transmission experiment is n, then for each time window, n — T/0.2 quality distance values are calculated (one S-T period is 0.2 sec). Figure 5.13 shows the range of quality values for different time windows T. The range is presented as a box-plot extending to the upper and lower quartiles of the values, with the whiskers extending to the minimum and maximum values observed. The average values of the series are also shown as interpolated lines. These plots show a very significant reduction of quality variation achieved using the proposed controller even at quite large time windows of observation. Similar figures of longer-term quality smoothness from experiments with two further video sequences are shown in Appendix B.

5.3.2 A closer examination of buffer stability and its impact

While buffers provide a convenient cushion to absorb short-term fluctuations of quality and available bit rate, buffer underflows can happen when there is a consistent difference between the encoding and transmission rates. Sender or receiver buffer underjiow occasions need to be eliminated as they result in either data being unavailable for transmission or breaks in video playout at the user-end, respectively. A preliminary examination of the buffer size evolution over time, shown in Figure 5.9, shows that the fuzzy controller keeps the buffer’s occupancy well above starvation levels. In the following, a more thorough assessment of the buffers’ behaviour is performed to test the controller’s ability to maintain not only smooth quality but also buffer stability under various underlying conditions. Figure 5.14 shows how the available buffer size affects the smoothness that the fuzzy controller is capable of delivering, where Qtcpf is plotted on the top graph together with Qtarget for two different values of initial playout delays (6 and 10 sec). The second graph shows how the buflev changes over time, as the distribution of data changes between the two buffers. Finally, the two bottom graphs plot the size of the sender (Bg) and receiver (Br) buffers over time, for the two different startup delay experiments. As expected, the controller is capable of achieving a more stable quality when more data resides at both buffers. In the case of the 10 sec initial delay, the relative abundance of buffered data allows the controller to gracefully increase or decrease the target quality, in comparison to Qtcpf, at the expense of faster buffer draining. This is not always feasible though in the case of a 6 sec initial playout delay; the 5.3. Performance of the fuzzy quality controller 147

Qicpf —— Qiarg .1 (6 sec) —^ QB,a,,(10s8C

6 sec 10 sec

6 se c a — lO s e c j

time (sec)

Figure 5.14: Impact of startup delay (6 and 10 sec) on quality smoothing and buffer occupancy (sequence: The Matrix).

buffers drop faster to a low level when mismatches between the encoding and transmission rate occur, forcing the fuzzy controller to adopt a smaller value for the control parameter a. This results in the target quality swiftly following Qtcpf in order to avoid buffer underflows (e.g., at points ‘A’), thus generating slightly higher variation.

Figures 5.15 and 5.16 plot, in the form of ‘box-percentile’ plots, the distributions of the

AQtarget and AQactuai values that the proposed system achieves for different initial sender and receiver buffer build-up times (from six to sixteen seconds). For comparison, the distribution of the AQ tcpf values is also plotted on the same graphs (note the difference in the range of the y-axes). In these figures, the evolution (over time) of the sender and receiver buffer sizes is also plotted (bottom graphs). Results for these graphs were obtained from simulations with the same 5.3. Performance of the fuzzy quality controller 148 setup described above (Figure 5.7), using the two video sequences The Matrix and Football (similar results for the Terminator sequence are presented in Appendix B). The side-by-side box-percentile plots for AQtcpf, AQtarget and AQactuai extend to the maximum absolute value of the observations, with the horizontal lines indicating the 0.25, 0.5 and 0.75 quantités. These results indicate that the system is resilient to buffer underflows; under reasonable initial buffering delays, no buffer under-runs occur. Occasional receive buffer underflows do oc­ cur when the initial buffering delay is very low (Figure 5.16, at around time 90 sec for sequence Football and 3 sec of initial receiver buffering). As expected, the smaller the initial delay, the weaker is the capacity of both buffers to accommodate rate mismatches, and therefore, the closer they get to low levels. As a result, the controller has to react by reducing the control parameter a. Furthermore, as the ratio of bandwidth overshoot, ratio, is set to be dependant on the level of buffered data in both buffers (expressed by the input parameter buflev), at low initial buffer sizes, the encoder cannot always achieve the target quality determined by the controller. This is indicated by wider A Q a c t u a i distributions at lower playout delays. At higher startup delays, the buffers can easily absorb any differences between the transmission rate and the rate devised by the controller, while staying away of low levels. In these cases the controller is capable of maintaining a high a value most of the times. It may be argued at this point that since there is still abundance of buffered data that could be exploited, better smoothing could be achieved. One could work around this by a mere change of the parameters of the membership function of the linguistic value high of fuzzy output a, so the output of the controller is much closer to one. Doing so however, will result in a controller that is very slow in response to consistent rises of quality, which is not a desired property. Furthermore, the controller was designed to operate at relatively low initial playout delays (around 10 sec); better improvement (but not signiflcant) can be accomplished with a higher initial delay. Observe that, even low initial delays do not adversely influence the quality smoothing process, as shown by the distribution of the A Q t a r g e t values, but there are occasions where the controller cannot always achieve the desired target quality, as shown by the more elongated AQtarget distributions. What higher initial buffer sizes do offer though is greater protection against buffer underflow.

5.3.3 Networks with CBR-Iike bandwidth

The following experiment examines the performance of the system when the video is transmit­ ted over an end-to-end path that provides a fairly constant bandwidth throughout the duration of the service. Although a CBR-like channel might not be what a stream usually experiences on today’s Internet, this experiment is useful for comparison purposes in order to examine how the algorithm controls a bursty VBR video stream when the bandwidth is stable. A further incen- 5.3. Performance of the fuzzy quality controller 149

Quality change

Q _ Lf)

O

in

i o 6 6 8 8 10 10 12 12 16 16

playout delay (sec)

time (sec)

Figure 5.15: Impact of the startup delay on quality smoothness and buffer stability. Top: dis­ tribution of AQtcpf, AQtarget and AQactuai- Bottom: Send and receive buffer sizes over the duration of the simulation (sequence: The Matrix). 5.3. Performance of the fuzzy quality controller 150

Quality change

00

CO

CM

O

6 6 8 8 10 10 12 12 16 16

playout delay (sec)

sender buffer receiver buffer

OGO

600

//\

o O time (sec)

Figure 5.16: Impact of the startup delay on quality smoothness and buffer stability. Top: dis­ tribution of AQtcpf, tar get and /5iQ actual- Bottom: send and receive buffer sizes over the duration of the simulation (sequence: Football). 5.4. Chapter summary and discussion 151 tive for this evaluation, stems from the fact that recent home networking infrastructures (DSL, cable) do provide a CBR-like channel as the bottleneck is in most cases the last-mile of the network (from the local exchange to the home). As these networking infrastructures become more and more prolific and thus enable access to streamed media to a wider audience, this pat­ tern of bandwidth availability may become more widespread. This experiment also serves as a comparison platform to the current practice in video streaming. In the majority of contem­ porary streaming applications, video is encoded at one (or multiple) discrete CBR rate(s), and adaptation is performed by switching to the encoded bitstream that its rate most closely matches the nominal transmission rate^. For this experiment, the transmission rate was kept at 500Kbps throughout the duration of the transmission, which is similar to what broadband home users may have available now and in the near future. Figures 5.17 and 5.18 show the distributions of AQtcpf, AQtarget and AQactuai as well as the evolution of the send and receive buffer sizes at different initial delays for sequences The Matrix and Football respectively. Under a CBR transmission, the system exhibits good quality smoothing properties. It also seems to provide greater buffer stability as it avoids driving the buffers to low thresholds (in comparison to a variable, TCP-friendly transmission examined above). This comes at no surprise, as it is a widely fluctuating bandwidth that generates greater differences between the buffer fill and consumption rates, creating shortages in buffered data.

5.4 Chapter summary and discussion

A live video stream, when encoded and transmitted using a congestion controlled IP flow, ex­ periences a variety of quality of service due to the bursty nature of video content and bandwidth variations. As a result, the end quality of the video delivered to the user may suffer frequent oscillations that deteriorate the user’s experience. For this reason, adapting the rate of video in response to changes in its content and the available bandwidth in a way that preserves a certain degree of stability in the quality is desired. In stored media streaming, opportunities for im­ proving the encoding quality are greater as the encoder has the whole sequence of frames at its disposal. Encoding optimisations (e.g., multiple pass asymmetric encoding) or efficient packet scheduling can be employed to alleviate the perceptual impact of ‘hard-to-decode’ scenes by efficiently amortizing the bit-budget throughout the compressed bitstream. In live video stream­ ing, these techniques cannot be applied. Therefore, the solution is to adapt the encoding rate to the current state of the network. This chapter presented a solution to the problem of alle­ viating short-term quality fluctuations. Based on the real-time generalisation capabilities of a

®Very usually though in commercial services, streams are transmitted at an encoded rate well below the available network rate to avoid the complexities of congestion control and adaptation. 5.4. Chapter summary and discussion 752

Quality change

00

o

c\j

Io 6 6 8 a 10 10 12 12 16 16

playout delay (sec)

sender buffer receiver buffer

,V,

time (sec)

Figure 5.17: CBR transmission. Top: distribution of AQtcpf, A Q t a r g e t and A Q a c t u a i - Bot­ tom graphs: send and receive buffer sizes over the duration of the simulation (sequence: The Matrix). 5.4. Chapter summary and discussion ^53

Quality change

CO

CD

C\J

O

6 6 8 8 10 10 12 12 16 16

playout delay (sec)

sender buffer receiver buffer

O

time (sec)

Figure 5.18: CBR transmission. Top: distribution of AQtcpf, AQtarget and AQactuai- Bottom graphs: send and receive buffer sizes over the duration of the simulation (sequence: Football). 5,4. Chapter summary and discussion 154 neural network, proven in Chapter 4, a fuzzy controller was introduced to determine the target encoding quality that achieves a smoother evolution in quality. The ability of the controller to achieve this is based on the amount of buffered data on both the sender and receiver buffer, as well as the underlying variability of video content and network bandwidth. Results presented herein as well as viewing of the reconstructed sequences obtained using the proposed method (reconstructed video sequences can be viewed and compared in [186]), show that:

• A rate-quality controller based on the principles of fuzzy logic presents with an efficient and flexible method to perform quality-aware adaptation that aims to alleviate annoying short-term quality variations. The flexibility of the method lies in its ability to represent the relationship among the quality-affecting parameters of the streaming system using easy to understand behavioural rules.

• The fuzzy controller successfully eliminates short-lived fluctuations of the on-going qual­ ity and provides a smoother video quality in comparison to a pure TCP-friendly encoding and transmission, while at the same time, adheres to constraints imposed by the network bandwidth availability.

• The fuzzy controller disallows the sender and receiver buffers from running at low occu­ pancy levels, even with considerably small initial buffering delays. The smoothing ability of the system improves with increasing buffering delay, but seems to saturate beyond 10- 15 seconds of buffering.

• Conclusively, experimental findings in this chapter prove that by integrating objective metrics of the perceived video quality into the adaptation process it is feasible to achieve media-friendly adaptation of a live video stream while adhering to a network-friendly transmission regime.

The quality controller assumes the existence of a data buffering scheme at both the origi­ nating and concluding end-points of the application. The controller uses knowledge of the levels of occupancy of these buffers in order to determine the degree of quality improvement that it is capable of delivering. Therefore, it assumes the existence of an accurate mechanism to estimate the size of these buffers at any point of time during transmission. The method presented in this chapter is based on the assumption that no feedback from the receiver exists that informs the server the current size of the receiver buffer. This may not always give a completely accurate estimate, for example when some packets are lost. However, this assumption can be replaced by a mechanism that provides such feedback from the receiver to the sender. 5.4. Chapter summary and discussion 155

The benefits of the fuzzy controller rest on the flexibility given to the application designer to select the type of reaction to quality changes that are considered appropriate in terms of their subjective impact on the end-user or the application semantics. For example, the current setup of the fuzzy controller was designed to react slowly to quality changes, given that there is enough data buffered at the sender or receiver. When there is a consistent change of Qtcpf, either due to a scene change in the original video or due to a change in the nominal transmission rate, the controller determines a target quality that follows that trend, albeit at a slower pace. Another quality adaptation policy may introduce the importance of a scene change in the video. As discussed in chapter 3, intra-scene content activity is comparatively uniform and it is among non-homogeneous video scenes where content variation is evident. In this case, the objective of the controller might be to keep the target quality within the same scene as stable as possible but allow for abrupt changes when there is a scene cut, allowing the target quality to change to the new level. Implementing this policy within the controller requires a simple change of the fuzzy rules and possibly a few straightforward changes in the parameters of the memberships functions of the linguistic values. Here, therefore, lies the flexibility of a controller based on fuzzy logic and linguistic rules: it has the ability to easily exemplify various adaptation policies that are suited to the needs of the specific application being designed. Finally, the current solution strictly adheres to the nominal transmission rate obtained by a TCP-friendly congestion controller. The scheme presented can be more efficient in deliver­ ing improved quality stability and averting potential receiver buffer underflows if the stream is allowed to occasionally transmit at a slightly higher rate than TCP when there is a risk of a receiver buffer drain. Such an approach is not considered TCP-unfriendly when a number of conditions apply (e.g., when the level of statistical multiplexing at a bottleneck link is high). For example, the effect of a flow that transmits slightly faster for a short duration of time is rela­ tively benign. Indeed, it is the case that concurrent TCP or TCP-friendly flows exhibit different instantaneous throughput, but over longer timescales, they share the link capacity equally. Chapter 6

Conclusions

This thesis studied quality-aware techniques for rate-adaptive transmission of Internet video. The starting point has been that delivering perceptually good video is a desirable yet so far, not a directly addressed feature in the design of video streaming systems. The main reason for this has been that video streaming components (e.g., the encoder) have been designed to optimise quality based on internal representations of quality that do not always correlate well with quality as perceived by the inevitable judge, the human viewer. The main obstacle in doing so has, so far, been the lack of mechanisms to measure perceptual quality in a manner that makes their interaction with an automated adaptation framework feasible. However, as reviewed in this dissertation, recent advances in video quality research have generated objective models that yield quality ratings that are in agreement with human judgements of quality. On these grounds, this thesis investigated adaptation procedures from an additional perspective, that of their perceptual impact on quality. Its main contribution was to demonstrate that objective video quality metrics can be integrated - directly or indirectly - in the adaptation life-cycle of video to improve the perceptual quality of service and to propose and develop techniques for achieving this. The principal question it answered was what application-level enhancements should be put in place to facilitate this integration, without interfering with other aspects of a video streaming system, like its network rate-control policy (e.g., TCP-friendly) or its real-time performance requirements.

6.1 Contributions

Realising quality-aware video adaptation requires an understanding of the way humans perceive quality and how this can be measured. Throughout this dissertation, several aspects of human perception of video quality were revisited with respect to their influence in the design of the proposed techniques [26, 27, 28]. With regard to the issue of video quality measurement, the main concepts of objective quality assessment models and the opportunities they present were 6.1. Contributions 157 discussed. In particular, this thesis described how features of visual quality models, that are currently used for nan-intrusive quality assessment of video sequences or to compare the per­ formance of video streaming components, can be tailored to assist the application in acquiring knowledge of the actual stream quality and drive the adaptation mechanism accordingly.

Quality-aware adaptation in multi-stream sessions The first application area to apply the concept of quality-aware adaptation, was in multime­ dia applications that engage the transmission of multiple concurrent video streams to a re­ ceiver. Based on the principle of joint rate-control among the participating flows, a method was presented that allocates the total bandwidth available to the session depending on the time- varying quality of its constituent flows, with the objective of maximising the overall session quality [30, 31]. The concept of integrated congestion control was introduced as a good choice for this kind of applications, as it offers a method to determine the aggregate transmission rate of the flow ensemble in an IP network. In effect, the inter-stream adaptation technique distributes the available bit-budget by ‘stealing’ bits from some flows and ‘offering’ them to others, if this process results in increased total quality. Two observations steered the design of the proposed inter-stream allocation mechanism and the following conclusions were drawn, based on the results obtained through a number of simulation experiments:

• The first noted that the encoding quality of compressed video remains fairly constant within a sequence of frames with similar levels of content activity, given that the encoding rate does not change. For this reason, the video scene (defined as a sequence of successive frames with uniform visual content) was adopted as the time frame (or timescale) for quality evaluation. Therefore, a homogeneous video scene is regarded as an autonomous entity with respect to quality evaluation. Long-running video clips can be segmented into a sequence of scenes and a video quality metric is applied to each scene to obtain scene-level quality scores.

• Based on the above, the second observation identified instances in time where a scene change in any of the participating sequences occurs, as appropriate points to adapt the rate of the constituent flows (or, times where a re-scheduling of the total bandwidth between the participating flows is justified).

• Experimental results showed that the proposed quality-aware inter-stream adaptation method improves the aggregate session quality in comparison to a proportional alloca­ tion mechanism based on the priority or importance of each flow. This is attributed to the 6.1. Contributions 158

dynamically changing rate-quality relationships among the participating video streams that the proposed method is able to exploit.

• Furthermore, we contrasted the proposed method with the most common current practice of delivering multiple related concurrent flows over the Internet, which involves trans­ mitting each media stream over separate, independent congestion controlled flows. In this scenario, simulation results confirmed similar observations in the literature [29]; the independent flows attain a significantly higher aggregate throughput in comparison to in­ tegrated management, where all streams are ‘treated’ as one with respect to congestion control. This behaviour of a group of independent connections between the same pair of hosts is not considered as ‘socially proper’. However, in this work, by assuming conges­ tion cooperation among the constituent streams and considering the time-varying quality of each stream in the inter-stream allocation process, we showed that quality-aware inter­ stream adaptation achieves a far more efficient use of the available bandwidth. Despite the fact that, under the same network conditions, the aggregate throughput of the inde­ pendent flows is significantly higher than that of the ensemble, there is not a significant gain on the total session quality.

• Finally, preliminary results showed that the proposed adaptation scheme does not ad­ versely influence the quality smoothness of the participating streams in comparison the the other two examined adaptation mechanisms.

These results proved the validity of the initial hypothesis: the use of a quality metric provides a far more efficient utilisation of the available bandwidth in terms of the quality (or utility) that the streaming application delivers.

Real-time prediction of quality using artificial neural networks The problem of providing smooth perceived quality for the real-time encoding and transmission of live streamed material was also examined [32]. Constantly high visual quality cannot always be guaranteed in a best-effort transmission network, therefore, the second best alternative has been to maintain quality stability. This is in accordance with psychophysical studies that em­ phasise the negative impact of frequent alterations in quality. Adaptation options for live video content are limited by a number of constraints not present in stored media streaming (which has attracted the lion’s share of recent literature). The most important difference is that the encoder does not have the whole video sequence at its disposal to perform optimised compression (like multiple-pass encoding). Frames are produced in real-time, therefore only a limited number 6.1. Contributions 159 of them are available at each time interval. Furthermore, rate-adaptation has real-time require­ ments. In order to provide smooth perceived quality streaming, the main problem that emerged was that an objective quality metric cannot be directly used for in-service quality estimation, as it requires significant computation power that would hamper the real-time performance of any streaming server. To alleviate this problem, the use of artificial neural networks was proposed. The solution was based on the assumption that the level of perceived distortion is primarily in­ fluenced by the video content’s complexity and the encoding rate and that the function between the video content activity, the encoding bit rate and the resulting quality can be ‘learned’ by the ANN. The design process of the ANN quality predictor involved the following procedures and evaluation results produced a number of interesting findings:

• Neural networks were trained with features that closely match the type and magnitude of short-term, spatio-temporal content activity at a range of encoding rates. Target quality scores obtained by applying the ITS video quality metric. Fifteen frame-level features were introduced to capture content dynamics in the spatial and temporal domain. Since the quality metric and the respective rate adjustments are applied with a temporal width of six frames (the S-T period), simple statistical manipulation (averaging) of the frame- level content features was employed to obtain descriptors of the content activity over these short time-scale periods.

• A sensitivity analysis on the fifteen initially chosen features and a corresponding input reduction method revealed that no significant redundancy was present among these fea­ tures, with respect to a three layer ANN topology with a hidden layer size in the order of the input vector size (from four to sixteen neurons). Experiments with a pre-conditioning of the training matrix using PCA showed no particular benefit too. Nevertheless, this pro­ cess revealed that the chosen features are indeed capable of quantifying content-specific factors that influence quality. It was further suggested that PCA together with the input sensitivity analysis and the stepwise elimination methods are useful for cases where the number of input variables is larger. Such cases emerge if one wants to predict quality rat­ ings for video scenes of longer duration (e.g., order of seconds), where several summary statistics are needed to represent the distribution of objectively obtained quality scores.

• Neural networks were trained with a broad range of video scenes that constituted a fairly representative collection of content activity types. The validation of the ANN with un­ known, test patterns showed that the proposed neural network method can yield accurate predictions of the instantaneous objective quality, for a variety of encoding bit rates. 6.2. Critical review and areas of future research 160

The principal idea for accomplishing stable quality was to choose encoding rates that are sometimes lower and at other times higher than the available transmission rate, according to the value of the encoding quality targeted by the controller. Mismatches between the instantaneous encoding and transmission rates were accommodated using a limited amount of media buffering at the sender and receiver ends. In order to estimate suitable target quality values that preserve quality stability and derive appropriate encoding rates to achieve that target quality value, while at the same time adhere to a TCP-friendly transmission rate and avoid underflow of the sender and receiver buffers, a rate-quality controller designed on the principles of fuzzy logic was proposed. The choice of a fuzzy logic based controller was motivated by its capability to represent the behaviour of the system with a few intuitive rules, without having to resort to complicated analytical models. The fuzzy rate-quality controller relied on the generalisation capabilities of the neural network to obtain real-time encoding rate to instantaneous quality mappings with the task of producing a more stable value for the video’s target encoding quality. The design of the fuzzy controller was engineered to account for several properties of quality perception, of which the most important were considered to be the requirement of reducing short-term quality fluctuations and avoiding the target quality from dropping to low levels. The findings from an extensive number of experiments with several test video sequences, initial buffering times and transmission paradigms can be summarised as follows:

• The controller was shown to produce a significantly smoother evolution of perceived quality [186], in comparison to the quality the system delivers when the encoding rate closely follows the nominal bandwidth available to the flow.

• For reasonable initial playout (buffering) delays, the occupancy of the sender and receiver buffers is kept off underflow levels. The smoothing ability of the system improves with increasing buffering delay, but seems to saturate beyond 10-15 seconds of buffering.

6.2 Critical review and areas of future research

This section provides a critical discussion on the assumptions made in parts of this dissertation and contemplates certain design issues and technical decisions taken. It also proposes a number of suggestions to mitigate potential limitations and presents options for future work. The objective of the inter-stream allocation mechanism presented in Chapter 3 is to max­ imise the total quality of the session, i.e., the weighted sum of each flow’s quality, where the weight of a flow represents its relative importance or priority to the application or the user’s preference. Further investigation is required to examine whether this objective coincides with 6.2. Critical review and areas of future research 161 the viewer’s perception of session quality for this type of multi-stream application. Extracting such knowledge and conveying it to the streaming server is not a simple task. Simple clues may be helpful: usually, the video feed that the viewer is watching at any given time, is the most important. The size of the display or the distance of the viewer from it are also crucial, as they determine the ‘active’ region of the display that is ‘viewable’ by the user. Another reasonable approach would be for the flows to exhibit equivalent levels of quality over time. This is, for example, the objective of joint rate control in contemporary broadcast TV scenarios, where a number of independent programmes are statistically multiplexed over a common channel (e.g., via direct broadcast satellite - DBS). Although this kind of application is different in spirit from the one examined in this thesis, in this case only the scheduler should be altered to realise this policy.

The inter-stream adaptation method uses scene change events to trigger a re-allocation of session bandwidth among the participating video flows. This is based on the assumption that quality remains fairly stable within frames of the same scene. Therefore, quality is assessed at a video-scene level. Whether this method coincides with the user opinion for each video stream’s quality is quite difficult to confirm. On the one hand, scenes can be independent in terms of how their quality is assessed by a human viewer, hence the introduced distortion can be independent from each other, validating this approach. On the other hand, the quality of the recent past also influences user perception of quality, therefore, the quality of the past few scenes may also influence the user’s opinion on the quality of the current scene. Thus, this is an interesting subject which requires further investigation, as discussed in section 2.6.3. Such techniques for the objective measurement of continuous video quality will soon become available and may be used instead, but in the mean-time, the scene-based segmentation approach followed here presents the best match [150, 187].

Results presented in Chapter 4 indicate that using artificial neural networks is an elegant approach to elicit predictions of objective quality scores when applying an objective quality metric itself is deemed too costly to the performance of the system (such as when real-time encoding is also required). A number of content features that describe the frame-level spatio- temporal activity are extracted in real-time as part of the encoder’s operation. This feature extraction process was incorporated into the source code of a H.263-1- codec. Although the set of features selected was broad enough to cover all aspects of content activity, it is not necessarily exhaustive. More specialised features could be designed so that they extract content properties from the salient areas of a video frame. For example, it is known that visual attention is usually drawn to the area around the centre of the image, the area with the dominant motion activity, or 6.2. Critical review and areas of future research M2

t .1 I

Mean sum of abs. pixel differences (soad)differencesMean pixel activity (PelAct) Mean sum of abs. pixel differences (soad)differencesMean

Figure 6.1: Clustering effects in the values of a video content feature.

the main object in the video scene. In this respect, new features may concentrate on measuring content activity properties (e.g., pixel or edge activity, motion magnitude or complexity) on those areas of the image that are thought to attract the human eye during the viewing process.

If these features are proven to be more correlated with the levels of perceived distortions intro­ duced, they will also constitute more relevant inputs for the training of the neural network, and thus result in improved prediction accuracy.

The ANN architecture was trained with an extensive set of input patterns obtained from a large collection of video frames. Its generalisation performance was significant when presented with unknown sets of input patterns. However, the training patterns used should not be con­ sidered exhaustive with respect to the gamut of content present in all video scenes one could imagine. By definition, machine learning techniques work well if presented with inputs that fall within the spectrum of input data that trained the ANN, or in other words, an ANN is as good as the data used to train it. The prediction performance can be improved if the neural network is trained to specialise on subsets of all possible values of each input content feature.

The following present a high-level initial consideration of how this can be achieved. The pro­ cedure may involve unsupervised classification (e.g., using Bayesian classifiers) of the content features to reveal groups (or clusters) within the input patterns (content features). For example.

Figure 6.1 plots two such content features (the frame-level activity of pixels, PelAct, and the sum of absolute pixel difference between successive frames, soad), introduced in section 4.4, and their relation to the objective quality at a certain encoding bit rate (sequence used: mixed).

An observation of these graphs reveals the formation of clusters in the corresponding content features. One could then define or derive classifications on the selected set of input variables that categorise the range of values of content features into groups with similar or correlated 6.2. Critical review and areas of future research 163

Table 6.1: A simple classification of spatial and temporal of content activity and the resulting classes in the spatio-temporal domain.

Texture Motion Spatio-temporal L L LL LM LM L H LH M L ML M M MM MH MH H L HL H M HM H H HH

values. One such simplistic classification may be deduced by defining three classes based on the value of the feature under inspection: low (L), medium (M) and high (H). For instance, considering two generic content features, such as, texture (spatial energy) and motion (motion activity), produces the two-dimensional classification shown in Table 6.1. To reduce the num­ ber of classes, hence reduce the complexity of the approach, several of these classes may be merged, as they will probably generate similar objective quality values (for example, low tex­ ture and high motion would probably have an identical impact on quality as medium texture and low motion). Multi-dimensional classifications could also be created if more than two content features are considered, at the expense of additional complexity. A pool of neural networks can then be created, where each neural network corresponds to a specific class in the list. In this way, generalisation accuracy will most probably improve, as each neural network is trained in a narrower range in the spectrum of the input patterns. The issues discussed in this paragraph are the subject of on-going and future work. Section 2.5 identified the two primary sources of distortion in packetised video: distor­ tions due to the compression process and artifacts introduced during transmission over an non- deterministic, unreliable inter-network. The work presented in this thesis primarily focused on the impact of artifacts generated from the video encoding process and the subsequent quality aspects of video rate-adaptation. As such, it did not account for the perceived effects of dis­ tortions introduced during the transmission of the packetised video (mainly due to packet loss). Although the potential perceptual impact of transmission-based type of distortions might be sig­ nificant, and in some cases, might surpass that of the signal compression process, contemporary source-channel video encoding techniques add significant tolerance to these kind of impair­ ments. Moreover, persistently high packet drop rates are rarely seen in the Internet today in the absence of routing failures or other major disruptions. This fortunate situation is primarily due 6.2. Critical review and areas of future research 164 to over-provisioning in the core, with congestion typically found on lower-capacity access links and to the use of end-to-end congestion control of TCP or TCP-friendly UDP. Nonetheless, there has been a large volume of literature on error resilient video, including but not limited to, forward error correction, retransmission, error concealment at the receiver, etc. Indeed, retrans­ mission of eventual lost packets is, in fact, heavily used in commercial streaming applications today, for media streams not particularly sensitive to additional delay, such as those considered in this work. These techniques diminish to a great degree the impact of packet loss on the per­ ceived quality. It is therefore advised that such schemes should be incorporated in a real system that implements the ideas and techniques canvassed in this thesis. Bibliography

[1] NielseiV/NetRatings. h ttp ://w w w .n ie ls e n - n e tr a tin g s .c o m .

[2] Arbitron/Edison Media Research. Internet 8: Advertising vs. subscription - which streaming model will win? h t t p : / / w w w .a rb itro n . co m /ra d io _ s t a t io n s / internetusage.htm.

[3] R. Braden, D. Clark, and S. Schenker. Integrated services in the Internet architecture: an overview. RFC 1633, June 1994.

[4] L. Zhang, S. Deering, D. Estrin, S. Shenker, and D. Zappala. RSVP: A New Resource ReSerVation Protocol. IEEE Network, 7(5):8-18, September 1993.

[5] S. Blake, D. Black, M. Carlson, E. Davies, Z. Wang, and W. Weiss. An architecture for differentiated services. RFC 2475, December 1998.

[6] International Telecommunications Union (ITU-T). Video coding for low bit rate com­ munication (ver. 2). ITU-T Rec. H.263, 1997.

[7] K. Rijkse. H.263: video coding for low bit-rate communication. IEEE Communications Magazine, 34(12):42-45, December 1996.

[8] B. Erol, M. Gallant, G. Gote, and F. Kossentini. The H.263-1- video coding standard: complexity and performance. In Proceedings of the IEEE Data Compression Conference, pages 259-268, Snowbird, Utah, USA, March 1998.

[9] T. Ebrahimi and C. Home. MPEG-4 natural video coding: an overview. Signal Process­ ing: Image Communication, pages 365-385, 2000.

[10] Joint Video Team (JVT) of ISO/IEC MPEG and ITU-T VCEG. Draft ITU-T recom­ mendation and final draft international standard of joint video specification. ITU-T Rec. H.264/ISO/IEC 14 496-10 AVCJVT-G050, 2003.

[11] T. Wiegand, G. J. Sullivan, G. Bjontegaard, and A. Luthra. Overview of the H.264/AVC video coding standard. IEEE Transactions on Circuits and Systems for Video Technology, 13(7):560-576, July 2003.

[12] G. J. Conklin, G. S. Greenbaum, K. O. Lillevold, A. F. Lippman, and Y. A. Reznik. Video coding for streaming media delivery on the Internet. IEEE Transactions on Circuits and Systems for Video Technology, 11(3):269-281, March 2001.

[13] M. Ghanbari. Two-layer coding of video signals for VBR networks. IEEE Journal on Selected Areas in Communications, 7(5):771-781, June 1989.

[14] W. Li. Overview of fine granularity scalability in MPEG-4 video standard. IEEE Trans­ actions on Circuits and Systems for Video Technology, 11(3):1051-317, March 2001. Bibliography 166

[15] A, Eleftheriadis and D. Anastassiou. Meeting arbitrary QoS constraints using dynamic rate shaping of coded digital video. In Proceedings of the 5th International Workshop on Network and Operating System Support for Digital Audio and Video (NOSSDAV), Durham, New Hampshire, April 1995.

[16] S. Floyd and K. Fall. Promoting the use of end-to-end congestion control in the Internet. lEEE/ACM Transactions on Networking, 7(4):458^72, August 1999.

[17] D. Wu, Y. T. Hou, W. Zhu, H.-J. Lee, T. Chiang, and Y.-Q. Zhang. On end-to-end architecture for transporting MPEG-4 video over the Internet. IEEE Transactions on Circuits and Systems for Video Technology, 10:923-941, September 2000.

[18] R. Rejaie, M. Handley, and D. Estrin. RAP: An end-to-end rate-based congestion control mechanism for realtime streams in the Internet. In IEEE INFOCOM ’99, pages 1337- 1345, New York, NY, USA, March 1999.

[19] S. Floyd, M. Handley, J. Padhye, and J. Widmer. Equation-based congestion control for unicast applications. In ACM SIGCOMM ’00, pages 43-56, Stockholm, Sweden, August 2000.

[20] D. Bansal, H. Balakrishnan, S. Floyd, and S. Shenker. Dynamic behavior of slowly- responsive congestion control algorithms. In Proceedings of ACM SIGCOMM ’01, pages 263-274, San Diego, CA, USA, August 2001.

[21] J. Widmer, R. Denda, and M. Mauve. A survey on TCP-friendly congestion control. IEEE Network Magazine. Special Issue on Control of Best Effort Traffic., 15(3):28-37, May 2001.

[22] International Telecommunications Union. ITU-T Recommendation BT.500-11, Method­ ology for the subjective assessment of the quality of television pictures, 2002.

[23] B. Girod. What’s wrong with mean-squared error? In A.B. Watson, editor. Digital Images and Human Vision, pages 207-220. MIT Press, Cambridge, MA, USA, 1993.

[24] The Video Quality Experts Group. VQEG. h t t p : / /www. VQEG. org.

[25] The Video Quality Experts Group. Draft final report from the Video Quality Experts Group on the validation of objective models of video quality assessment. Phase II, July 2003. Version 4.

[26] D. Miras. A survey on network QoS needs of advanced Internet applications. Work­ ing document, Intemet2, QoS Working Group, December 2002. h ttp ://w w w . internet2.edu/qos/wg/apps/fellowship/.

[27] J. McCarthy, M. A. Sasse, and D. Miras. Sharp or smooth? comparing the effects of quantization vs. frame rate for streamed video. In Intl. Conf. on Human Factors in Computing Systems (CHI), Vienna, Austria, April 2004.

[28] J. D. McCarthy, M. A. Sasse, and D. Miras. Evaluating mobile video quality. In 13th 1ST Mobile and Wireless Communications Summit, Lyon, France, June 2004.

[29] H. Balakrishnan, H. Rahul, and S. Seshan. An integrated congestion management archi­ tecture for Internet hosts. In Proceedings of SIGCOMM ’99, Cambridge, MA, September 1999. Bibliography 161

[30] D. Miras, R. Jacobs, and V. Hardman. Utility based inter-stream adaptation of layered streams in a multiple-flow IP session. In Ith Intl. Workshop on Interactive Distributed Multimedia Systems and Telecommunication Services (IDMS2000), Lecture Notes in Computer Science (LNCS 1905), pages 77-88, Enschede, The Netherlands, October 2000. [31] D. Miras, R. Jacobs, and V. Hardman. Content-aware quality adaptation in IP sessions with multiple streams. In 8th Intl. Workshop on Interactive Distributed Multimedia Sys­ tems and Telecommunication Services (IDMS2001), Lecture Notes in Computer Science (LNCS 2158), pages 168-180, Lancaster, UK, September 2001. [32] D. Miras and G. Knight. Smooth quality streaming of live internet video. In IEEE Global Telecommunications Conference (Globecom), Dallas, Texas, US, November 2004. [33] International Telecommunications Union (ITU-T). General characteristics of interna­ tional telephone connections and international telephone circuits. ITU-T-G.114, 1998. [34] G. Huston. Next steps for the IP QoS architecture. RFC 2990, November 2000. [35] M. Ghanbari. Video coding - An introduction to standard codecs. lEE Telecommunica­ tions Series 42. The Institute of Electrical Engineers, London, UK, 1999. [36] International Telecommunications Union (ITU-T). Video codec for audiovisual services at p X 64 kbits/s. ITU-T Rec. H.261, 1993. [37] International Organization for Standardization (ISO/IEC JTCl). Coding of moving pic­ tures and associated audio for digital storage media at up to about 1.5 Mbit/s - Part 2: Video. ISO/IEC 11172-2, March 1993. [38] ITU-T and ISO/IEC JTCl. Generic coding of moving pictures and associated audio information - Part 2: Video. ITU-T Recom. H.262 - ISO/IEC 13818-2 (MPEG-2), November 1994. [39] International Organization for Standardization (ISO/IEC JTCl). Coding of audio-visual objects. ISO/IEC 14496,1999. [40] Microsoft Windows Media 9 Series. 9 series codecs video. h ttp ://w w w . m icrosoft.com/windows/windowsmedia/9series/codecs/video. aspx%.

[41] RealNetworks. Realvideo 10. h ttp ://w w w .re a ln e tw o rk s .c o m /p ro d u c ts / codecs/realvideo.htm l. [42] RealNetworks. Realvideo 10 technical overview, h t t p : / /d o c s . r e a l . co m /d o cs/ rn / rv l 0 / RVl 0_Tech_Overvi ew .pdf, 2003. [43] A. Jochand F. Kossentini, H. Schwarz, T. Wiegand, and G. J. Sullivan. Performance comparison of video coding standards using Lagrangian coder control. In Proceedings of Int. Conference on Image Processing 2002, volume 2, pages 501-504, British Columbia Univ., Vancouver, BC, Canada, September 2002. [44] J. van der Merwe, S. Sen, and C. Kalmanek. Streaming video traffic : characterization and network impact. In 7th Intl. Workshop on Web Content Caching and Distribution (WCW), Boulder, Colorado, August 2002. [45] S. Floyd and V. Jacobson. Random early detection gateways for congestion avoidance. lEEE/ACM Transactions on Networking, 1(4):397-413, August 1993. Bibliography 168

[46] O. Verscheure, P. Frossard, and M. Hamdi. MPEG-2 video services over packet net­ works: joint effect of encoding rate and data loss on user-oriented QoS. In Proceedings of the 8th International Workshop on Network and Operating System Support for Digital Audio and Video (NOSSDAV), pages 257-264, Cambridge, UK, July 1998.

[47] J.-C. Bolot, T. Turletti, and I. Wakeman. Scalable feedback control for multicast video distribution in the Internet. In ACM SIGCOMM ’94, pages 58-67, London, UK, Septem­ ber 1994.

[48] S. McCanne, V. Jacobson, and M. Vetterli. Receiver-driven layered multicast. In ACM SIGCOMM ’96, pages 117-130, Palo Alto, California, USA, August 1996.

[49] L. Vicisano, L. Rizzo, and J. Crowcroft. TCP-like congestion control for layered multi­ cast data transfer. In IEEE INFOCOM ’98, volume 3, pages 996-1003, San Francisco, USA, March 1998.

[50] L. Rizzo, pgmcc: a TCP-friendly single-rate multicast. In ACM SIGCOMM ’00, pages 17-28, Stockholm, Sweden, August 2000.

[51] J. Widmer and M. Handley. Extending equation-based congestion control to multicast applications. In ACM SIGCOMM ’01, pages 275-285, San Diego, CA, USA, August 2001.

[52] I. Rhee, V. Ozdemir, and Y. Yi. TEAR: TCP emulation at receivers - flow control for mul­ timedia streaming. NCSU Technical Report, Computer Science Dept., North Carolina State University, 2000.

[53] K. Ramakrishnan, S. Floyd, and D. Black. The addition of explicit congestion notifica­ tion (ECN) to IP. RFC 3168, September 2001.

[54] J. Padhye, V. Firoiu, D. Towsley, and J. Kurose. Modeling TCP throughput: a simple model and its empirical validation. In ACM SIGCOMM ’98, pages 303-314, Vancouver, CA, September 1998.

[55] S. Floyd, M. Handley, and J. Padhye. A comparison of equation-based and AIMD con­ gestion control, February 2000. h t t p : / /www. a c i r i . o r g / t f rc .

[56] R. Yang and S. S. Lam. General AIMD congestion control. In International Conference on Network Protocols (ICNP), pages 187-198, Osaka, Japan, November 2000.

[57] D. Bansal and H. Balakrishnan. Binomial congestion control algorithms. In IEEE IN­ FOCOM ’01, volume 2, pages 631-640, Anchorage, AK, April 2001.

[58] E. Kohler, M. Handley, S. Floyd, and J. Padhye. Datagram congestion control protocol (DCCP). Internet Draft draft-ietf-dccp-spec-05.txt, October 2003. IETF.

[59] H. Sun, W. Kwok, and J. W. Zdepski. Architectures for MPEG compressed bitstream scaling. IEEE Transactions on Circuits and Systems for Video Technology, 6(2): 191- 199, April 1996.

[60] P. Assuncao and M.Ghanbari. A frequency-domain video transcoder for dynamic bit-rate reduction of MPEG-2 bit streams. IEEE Transactions on Circuits and Systems for Video Technology, 8(8):953-967, December 1998.

[61] Microsoft Windows Media 9 Series. Intelligent streaming. h ttp ://w w w . m icrosoft.com/windows/windowsmedia. Bibliography 169

[62] A. R. Reibman, H. Jafarkhani, Y, Wang, M. T. Orchard, and R. Puri. Multiple description video coding using motion-compensated temporal prediction. IEEE Transactions on Circuits and Systems for Video Technology, 12(3): 193-204, March 2002.

[63] J. G. Apostolopoulos. Reliable video communication over lossy packet networks us­ ing multiple state encoding and path diversity. In Visual Communications and Image Processing (VCIP), pages 392-409, San Jose, CA, January 2001.

[64] Z.-L. Zhang, S. Nelakuditi, R. Aggarwal, and R. P. Tsang. Efficient selective frame discard algorithms for stored video delivery across resource constrained networks. In IEEE INFOCOM ’99, pages 472-479, New York, NY, March 1999.

[65] S. Sen, J. Rexford, J. Dey, J. Kurose, and D. Towsley. Online smoothing of variable-bit- rate streaming video. IEEE Transactions on Multimedia, 2(l):37-48, March 2000.

[66] B. Girod, J. Chakareski, M. Kalman, Y. J. Liang, E. Setton, and R. Zhang. Advances in network-adaptive video streaming (invited paper). In Proceedings o f2002 Tyrrhenian International Workshop on Digital Communications (IWDC), pages 1-8, Capri, Italy, September 2002.

[67] M. Yuen and H. R. Wu. A survey of hybrid MC/DPCM/DCT video coding distortions. Signal Processing, 70(3):247-278, 1998.

[68] American National Standards Institute (ANSI). ANSI Tl.801.02, Digital transport of video teleconferencing/video telephony signals - Performance terms, definitions and ex­ amples. Alliance for Telecommunications Industry Solutions, 1996. [69] A. J. Ahumada and C. H. Null. Image quality: a multidimensional problem. In A. B. Wat­ son, editor. Digital Images and Human Vision, pages 141-148. MIT Press, Cambridge, MA, USA, 1993. [70] S. A. Klein. Image quality and image compression: A psychophysicist’s viewpoint. In A. B. Watson, editor. Digital Images and Human Vision, pages 73-88. MIT Press, Cambridge, MA, USA, 1993.

[71] J. B. Martens and V. Kayargadde. Image quality prediction in a multidimensional per­ ceptual space. In Intl. Conf on Image Processing, volume 1, pages 877-880, Lausanne, Switzerland, 1996.

[72] D. E. Pearson. Viewer response to time-varying video quality. In B. E. Rogowitz and T. N. Pappas, editors, SPIE Human Vision and Electronic Imaging, pages 16-25, Belling­ ham, WA, 1998.

[73] W. Y. Zou. Performance evaluation: From NTSC to digitally compressed video. SMPTE Journal, 103(2):795-800, 1994.

[74] S. Wolf and M. Pinson. Video quality measurement techniques. Technical Report NTIA Report 02-392, National Telecommunications and Information Administration (NTIA), Institute for Telecommunication Sciences (ITS), June 2002. h t t p : / / www. its .bldrdoc.gov/n3/video/documents.htm.

[75] Pixelmetrix/KDD Media. VP2000 picture quality analyser. http://www. pixelmetrix.com/rel/VPDatasheet.pdf.

[76] M. Knee. The picture appraisal rating (PAR) - A single-ended picture quality measure for MPEG-2. In Int. Broadcasting Convention, The Netherlands, 2000. Bibliography 170

[77] A. Woerner. A realtime no reference video quality analysis. Technical report, Ro- hde&Schwarz, October 2001.

[78] Snell&Wilcox. Mosalina. h t t p : //www. s n e llw ilc o x . com.

[79] Rohde&Schwartz. h t t p : / /www. ro h d e -sc h w a rz . com.

[80] Tektronix Picture Quality, h t t p : / /www. t e k t r o n i x . com.

[81] R. Aldridge, J. Davidoff, M. Ghanbari, D. Hands, and D. Pearson. Recency effect in the subjective assessment of digitally-coded television pictures. In Fifth lEE International Conference on Image Processing and its Applications, pages 336-339, Edingburgh, UK, July 1995.

[82] A. D. Baddeley. Working Memory. Oxford University Press, Oxford, UK, 1986.

[83] H. de Ridder and R. Hamberg. Continuous assessment of image quality. SMPTE Journal, 106(2): 123-128, February 1997.

[84] H. de Ridder. Minkowski-metrics as a combination rule for digital image coding im­ pairments. In SPIE Human Vision, Visual Processing and Digital Display, volume 1666, pages 16-26, San Jose, CA, 1992.

[85] Samoff Corporation. Video Quality Experts Group (VQEG) Frequently Asked Ques­ tions. http ://www.sarnoff.com /products_services/video_vision/ j ndme t r i x/ documen% t s/vqeg_faq.asp.

[86] David Fibush. Overview of picture quality measurement methods. Contribution to IEEE Standards Subcommittee G-2.1.6, May 1997. h t t p : / /g r o u p e r . i e e e . o rg / groups/videocomp/1997g216/pqm.pdf.

[87] S. Pefferkom and J.-L. Blin. Perceptual quality metric of color quantization errors on still images. In Proc. of SPIE, volume 3299, pages 210-220, San Jose, CA, January 1998.

[88] D. Wang, F. Speranza, A. Vincent, T. Martin, and P. Blanchfield. Towards optimal rate control: a study of the impact of spatial resolution, frame rate, and quantization on sub­ jective video quality and bit rate. In Visual Communications and Image Processing (VCIP 2003), Lugano, Switcherland, July 2003.

[89] T. Hamada, S. Miyaji, and S. Matsumoto. Picture quality assessment system by three- layered bottom-up noise weighting considering human visual perception. SMPTE Jour­ nal, 108(l):20-26, 1999.

[90] The Alliance for Telecommunications Industry Solutions (ATIS). Objective perceptual quality measurement using a JND-based full reference technique. Technical Report Tl.TR.PP.75-2001, October 2001.

[91] A. Webster. An objective video quality assessment system based on human perception. In Human Vision, Visual Processing, and Digital Display IV, pages 15-26, San Jose, CA, February 1993.

[92] S. Wolf and M. Pinson. Spatial-temporal distortion metrics for in-service quality moni­ toring of any digital video system. In Proceedings of SPIE International Symposium on Voice, Video and Data Communications, pages 175-184, Boston, USA, September 1999. Bibliography 171

[93] International Tellecommunication Union. Revision of ITU-T Recommendation J.144 - Objective perceptual video quality measurement techniques for digital cable television in the presence of a full reference. Temporary document TD-0871rl, ITU, Stuy Group 9, Geneva, April 2003. Study Period 2001-2004.

[94] J. Lauterjung. Picture quality measurement. In International Broadcasting Convention, pages 413-417. lEE, London, 1998.

[95] A. B. Watson. Toward a perceptual video quality metric. In Proceedings of SPIE Human Vision and Electronic Imaging, volume 3299, pages 139-147, Jan Jose, CA, 1998.

[96] A. B. Watson, J. Hu, J. F. McGowan, and J. B. Mulligan. Design and performance of a digital video quality metric. In Proceedings of SPIE Human Vision and Electronic Imaging, volume 3644, pages 168-174, Jan Jose, CA, 1999.

[97] K. T. Tan, M. Ghanbari, and D. E. Pearson. An objective measurement tool for MPEG video quality. Signal Processing, 70(3):279-294, 1998.

[98] C. J. van den Braden Lambrecht and O. Verscheure. Perceptual quality measure using a spatio-temporal model of the human visual system. In Proceedings of Digital Video Compression: Algorithms and Technologies, pages 450-461, San Jose, CA, 1996.

[99] S. Winkler. A perceptual distortion metric for digital color video. In Proceedings of SPIE Human Vision and Electronic Imaging, volume 3644, pages 175-184, San Jose, CA, 1999.

[100] K. T. Tan and M. Ghanbari. A multi-metric objective picture-quality measurement model for MPEG video. IEEE Transactions on Circuits and Systems for Video Technology, 10(7):1208-1213, October 2000.

[101] A. M. Rohaly, J. Libert, P. Corriveau, and A. Webster (ed. committee). Final report from the Video Quality Experts Group on the validation of objective models of video quality assessment. Video Quality Experts Group (VQEG), April 2000.

[102] Working Group 6Q International Telecommunication Union (ITU). Objective perceptual video quality measurement techniques for standard definition digital broadcast television in the presence of a full reference. Draft New Recommendation ITU-R BT.fDoc. 6/39], September 2003.

[103] International Telecommunication Union (ITU). ITU-T Study Group 9. http : / /www. itu.int/ITU-T/studygroups/com09.

[104] R. Hamberg and H. de Ridder. Time-varying image quality: modelling the relation be­ tween instantaneous and overall quality. SMPTE Journal, pages 802-811, November 1999.

[105] A. M. Rohaly, J. Lu, N. R. Franzen, and M. K. Ravel. Comparison of temporal pooling methods for estimating the quality of complex video sequences. In B. Rogowitz and T. N. Pappas, editors, SPIE Human Vision and Electronic Imaging IV, volume 3644, pages 218-225, 1999.

[106] D. Hands. Requirements for a multimedia perceptual model. ITU, Study Group 9 - Contribution 28, July 2001. Bibliography 1 72

[107] S. Wolf and M. Pinson. The relationship between performance and spatial-temporal region size for reduced-reference, in-service video quality monitoring systems. In SCI/ISAS 2001 (Systematics, Cybernetics, and Informatics / Information Systems Analy­ sis and Synthesis), Orlando, Florida USA, July 2001.

108] B. Girod. Psychovisual aspects of image communication. Signal Processing, 28(3):239- 251, 1992.

109] Q. Zhang, W. Zhu, and Y.-Q. Zhang. Resource a]location for multimedia streaming over the Internet. IEEE Transactions on Multimedia, 3(3):339-355, September 2001.

110] L. Wang and A. Vincent. Bit allocation and constraints for joint coding of multiple video programs. IEEE Transactions on Circuits and Systems for Video Technology, 9(6):949- 959, September 1999.

111] L. Borocky, A. Y. Ngai, and E. F. Westermann. Statistical multiplexing using MPEG-2 video encoders. IBM Journal of Research and Development, 43(4):511-520, July 1999.

112] W.-C. Gu and D. W. Lin. Joint rate-distortion coding of multipie videos. IEEE Transac­ tions on Consumer Electronics, 45(1):159-164, February 1999.

113] L. Wang, A. Vincent, and P. Corriveau. Multi-program video coding with joint rate control. In IEEE Globecom, volume 3, pages 1516-1520, Nov 1996.

114] H. Soria], W. E. Lynch, and A. Vincent. Joint bit-allocation for MPEG encoding of multiple video sequences with minimum quality-variation. In IEEE Intl. Symposium on Circuits and Systems, volume II, pages 9-12, Geneva, Switzerland, May 2000.

115] H. R. Shao, W. Zhu, and Y. Q. Zhang. User-aware object-based video communication over next generation networks. Journal of Signal Processing.Tmage Communication, 16:763-784, 2001.

116] J. Shin, W. K. Kim, and C.-C. J. Juo. Content-based packet video forwarding mechanism in differentiated networks. In 10th Intl. Workshop on Packet Video, Sardinia, Italy, May 2000.

117] S. Lee, S. H. Jang, and J. S. Lee. Dynamic bandwidth allocation for multiple VBR MPEG video sources. In IEEE Intl. Conf. on Image Processing, volume 1, pages 268-282,1994.

118] T. Koga, Y. lijima, K. limuna, and T. Ishigure. Statistical performance analysis of an interframe encoder for broadcast television signals. IEEE Transactions on Communica­ tions, C-29:1868-1875, December 1981. 1981.

119] A. Guha and D. J. Reiniger. Multichannel joint rate control of VBR MPEG encoded video for DBS applications. IEEE Transactions on Consumer Electronics, 40:616-623, August 1994.

120] M. Perkins and D. Amstein. Statistical multiplexing of multiple MPEG-2 video programs in a single channel. SMPTE Journal, 104(9):596-599, 1995.

121] L. Wang and A. Vincent. Joint rate control for multi-program video coding. IEEE Transactions on Consumer Electronics, 42(3):300-305, August 1996.

122] H. Z. Sorial, W. E. Lynch, and A. Vincent. Joint transcoding of multiple MPEG video bitstreams. In IEEE Intl. Symposium on Circuits and Systems (ISCAS), volume 4, pages 251-254, Orlando, Florida, May 1999. Bibliography 173

123] ISO/IEC JTC1/SC29/WG11 N0400. Test model 5, April 1993.

124] A. Vincent, P. Corriveau, P. Blanchfield, and R. Renaud. Modelling of the coding gain of joint coding for multi-program video transmission. In Intl. Conf. on Multimedia & Expo (ICME), volume 3, pages 1309-1312, July 2000.

125] J. Touch. TCP control block independence. RFC 2140, April 1997.

126] V. Padmanabhan. Coordinated congestion management and bandwidth sharing for het­ erogeneous data streams. In 9th Intl. Workshop on Network and Operating System Sup­ port for Digital Audio and Video (NOSSDAV), Basking Ridge, NJ, June 1999.

127] H. Balakrishnan, V. Padmanabhan, S. Seshan, M. Stemm, and R. Katz. TCP behavior of a busy web server: Analysis and improvements. In Proceedings of IEEE INFOCOM ’99, San Francisco, CA, March 1998.

128] L. Eggert, J. Heidemann, and J. Touch. Effects of ensemble-TCP. ACM Computer Communication Review, pages 15-29, January 2000.

129] V. Padmanabhan. Addressing the Challenges of Web Data Transport. PhD thesis, Univ. of California, Berkeley, December 1998.

130] H. Balakrishnan and S. Seshan. The congestion manager. RFC 3124, June 2001. IETF Endpoint Congestion Management Working Group.

131] D. Andersen, D. Bansal, D. Curtis, S. Seshan, and H. Balakrishnan. System support for bandwidth management and content adaptation in Internet applications. In 4th Sympo­ sium on Operating Systems Design and Implementation, pages 213-226, San Diego, CA, October 2000.

132] S. A. Akella, S. Seshan, and H. Balakrishnan. The impact of false sharing on shared con­ gestion management. Technical Report CMU-CS-01-135, School of Computer Science, Carnegie Mellon University, Pittsburgh, PA, USA, June 2001.

133] D. Rubenstein, J. Kurose, and D. Towsley. Detecting shared congestion of flows via end-to-end measurement. lEEE/ACM Transactions on Networking, 10(3):381-395, June 2002.

134] A. Hanjalic. Shot-boundary detection: unraveled and resolved? IEEE Transactions on Circuits and Systems for Video Technology, 12(2):90-105, February 2002.

135] J. Boreczky and L. A. Rowe. Comparison of video shot boundary detection tech­ niques. In I.K. Sethi and R. C. Jain, editors, Storage and Retrieval for Image and Video Databases, volume IV of Proceedings of SPIE 2670, 1996.

136] R. Lienhart. Comparison of automatic shot boundary detection algorithms. In IS&T/SPIE Storage and Retrieval for Image and Video Databases VII, volume 3656, pages 290-301, January 1999.

137] B.-L. Yeo and B. Liu. Rapid scene analysis on compressed video. IEEE Transactions on Circuits and Systems for Video Technology, 5(6):533-544, December 1995.

138] J. Feng, K.-T. Lo, and H. Mehrpour. Scene change detection algorithm for MPEG video sequence. In Proceedings of Intl. Conf. on Image Processing (ICIP), volume 2, pages 821-824, 1996. Bibliography 174

[139] H.-C. Liu and G. Zick. Automatic determination of scene changes in MPEG compressed video. In IEEE Intl. Symposium on Circuits and Systems,, volume 1, pages 764-767, Seattle, USA, 1995. [140] P. Bocheck and S. Chang. A content based video traffic model using camera operations. In Intl. Conference on Image Processing (ICIP ’96), September 1996. [141] A. M. Dawwod and M. Ghanbari. Content-based MPEG video traffic modelling. IEEE Transactions on Multimedia, l(l):77-87, March 1999. [142] M. Krunz and A. M. Ramasamy. The correlation structure for a class of scene-based video models and its impact on the dimensioning of video buffers. IEEE Transactions on Multimedia, 2(l):27-36, March 2000. [143] R. Rajkumar, C. Lee, J. Lehoczky, and D. Siewiorek. A resource allocation model for QoS management. In IEEE Real-time Systems Symposium, December 1997. [144] R. Rejaie and M. Handley. Quality adaptation for congestion controlled video playback over the Internet. In Proceedings of ACM SIGCOMM 99, Cambridge, MA, USA, Aug. 31-Sep. 3 1999. [145] S. Nelakuditi, R. R. Harinath, E. Kusmierek, and Z.-L. Zhang. Providing smoother quality layered video stream. In 10th Intl. Workshop on Network and Operating System Support for Digital Audio and Video (NOSSDAV), Chapel Hill, North Carolina, USA, June 2000. [146] M. Nilsson, D. Dalby, and J. O’Donnell. Layered audio-visual coding for multicast distribution on IP networks. In IEEE Packet Video Workshop (PV 2000), Forte Village, Cagliari, Italy, May 2000.

[147] ns-2 Network Simulator, 1998. http : / /www-mash. cs . berkeley. edu/ns. [148] W. Willinger, M. S. Taqqu, R. Sherman, and D. V. Wilson. Self-similarity through high- variability: statistical analysis of Ethernet LAN traffic at source level. In ACM SIG­ COMM ’95, Cambridge, MA, August 1995. [149] D. Hands and S.E. Avons. Recency and duration neglect in subjective assessment of television picture quality. Applied Cognitive Psychology, 15:639-657, 2001. [150] S. Wolf. Continuous quality assessment of long video clips. Institute for Telecommuni­ cation Sciences. Private Communication. [151] A. Basso, I. Dalgic, F. A. Tobagi, and C. J. van den Braden Lambrecht. Feedback-control scheme for low-latency constant-quality MPEG-2 video encoding. In SPIE Proceed­ ings Digital Compression Technologies and Systems for Video Communications, volume 2952, pages 460^71, 1996. [152] X. M. Zhang, A. Vetro, Y. Q. Shi, and H. Sun. Constant quality constrained rate al­ location for FGS-coded video. IEEE Transactions on Circuits and Systems for Video Technology, 13(2):121-130, February 2003. [153] T. Kim and M. H. Ammar. Optimal quality adaptation for MPEG-4 fine-grained scalable video. In Proc. of IEEE Infocom 2003, San Francisco, CA, USA, March 2003. [154] J. D. Salehi, Z.-L. Zhang, J. Kurose, and D. Towsley. Supporting stored video: reduc­ ing rate variability and end-to-end resource requirements through optimal smoothing. lEEE/ACM Transactions on Networking, 6(4):397-410, August 1998. Bibliography 175

155] X. Lu, R. O. Morando, and M. El Zarki. Understanding video quality and its use in feedback control. In IEEE Packet Video Workshop, Pittsburgh, PA, April 2002,

156] B. Bimey. Windows Media 9 series - Reducing broadcast delay. Microsoft Corpora­ tion, April 2003. http://www.inicrosoft.com/windows/windowsmedia/ howto/articles/broadcastd% elay.aspx.

157] P. A. Chou and Z. Miao. Rate-distortion optimized streaming of packetized me­ dia. Technical Report MSR-TR-2001-35, Microsoft Research, February 2001. h t t p : //research.microsoft.com/pachou. 158] P. A. Chou and A. Sehgal. Rate-distortion optimized receiver-driven streaming over best-effort networks. In Packet Video Workshop, Pittsburgh, PA, April 2002.

159] M. Pinson, S. Wolf, P. G. Austin, and A. Allhands. Video quality measurement PC user’s manual, November 2002. h t t p : / /www. i t s . b ld r d o c . gov/n3 /v id e o / documents .htm.

160] M. T. Hagan, H. B. Demuth, and M. H. Beale. Neural Network Design. PWS Publishing Company, Boston, MA, 1996. 161] D. E. Rumelhart, G. E. Hinton, and R. J. Williams. Learning internal representations by error propagation. In D. Rumelhart and J. McClelland, editors. Parallel Data Processing, volume 1, chapter 8, pages 318-362. MIT Press, Cambridge, MA, 1986. 162] R. Fletcher. Practical Methods of Optimization. John Wiley & Sons, New York, 2000.

163] P. Bocheck. Content-based video communication: methodology and applications. PhD thesis, Columbia University, 2000. 164] M. Wu, R. A. Joyce, H.-S. Wong, L. Guan, and S.-Y. Kung. Dynamic resource allocation via video content and short-term traffic statistics. IEEE Transactions on Multimedia, 3(2): 186-199, June 2001. 165] P. Bocheck, A. T. Campbell, S.-F. Chang, and R.-F. Liao. Utility-based adaptation for MPEG-4 systems. In 9th Conference on Network and Operation Systems for Digital Audio and Video (NOSSDAV 99), Basking Bridge, NJ, June 1999.

166] R.-F. Liao, P. Bouklee, and A. T. Campbell. Dynamic generation of bandwidth utility curves for utility-based adaptation. In Packet Video ’99, New-York, USA, April 1999.

167] S. Mohamed and G. Rubino. A study of real-time packet video quality using random neural networks. IEEE Transactions on Circuits and Systems for Video Technology, 12(12): 1071-1083, December 2002.

168] P. Gastaldo, S. Rovetta, and R. Zunino. Objective quality assessment of MPEG-2 video streams by using CBP neural networks. IEEE Transactions on Neural Networks, 13(4):939-947, July 2002.

169] P. Gastaldo, R. Zunino, and S. Rovetta. Objective assessment of MPEG-2 video quality. Journal of Electronic Imaging, 11(3), July 2002.

170] F.-H. Lin and M. Mersereau. An optimization of MPEG to maximize subjective quality. In IEEE Intl. Conference on Image Processing, volume 2, pages 547-550, October 1995.

171] F.-H. Lin and M. Mersereau. Rate-quality tradeoff MPEG video encoder. Signal Pro­ cessing: Image Communication, 14:297-309, 1999. Bibliography 116

172] S, Yao, W. Lin, Z. Lu, E, Ong, and X. Yang. Video quality assessment using neural net­ work based on multi-feature extraction. In Visual Communications and Image Processing 2003 - Special Session on Image and Video Quality Assessment: Methods, Metrics and Applications, Lugano, Switcherland, July 2003.

173] M. Wu, R. A. Joyce, H.-S. Wong, L. Guan, and S.-Y. Kung. Dynamic resource allocation via video content and short-term statistics. IEEE Transactions on Multimedia, 3(2): 186- 199, June 2001.

174] F. Despagne and D.L. Massait. Variable selection for neural networks in multivariate calibration. Chemometrics & Intelligent Laboratory Systems, 40:145-163, 1998.

175] I.T. Jolliffe. Principal Component Analysis. Springer-Verlag New York Inc., October 2002.

176] K. Homik, M. Stinchcombe, and H. White. Multilayer feedforward neural networks are universal approximators. Neural Networks, 2(5):359-366, 1989.

177] D. R. Hush and B. G. Home. Progress in supervised neural networks. IEEE Signal Processing Magazine, 10(l):8-39, January 1993.

178] D. E. Pearson. Viewer response to time-varying video quality. SPIE Proceedings, Human Vision and Electronic Imaging III, 3299:16-25, Jan 1998.

179] L. H. Zadeh. Fuzzy sets. Information and control, 8:338-353, 1965.

180] L. H. Zadeh. Outline of a new approach to the analysis of complex systems and decision processes. IEEE Transactions on Systems, Man & Cybernetics, 3(1):28^44, 1973.

181] K. M. Passino and S. Yurkovich. Fuzzy Control. Addison-Wesley, 1998.

182] J. M. Mendel. Fuzzy logic systems for engineering: A tutorial. Proceedings of IEEE, 83(3):345-377, March 1995.

183] P. S. Khedkar and S. Keshav. Fuzzy prediction of timeseries. In IEEE Conference of Fuzzy Systems (FUZZ-IEEE), March 1992.

184] Microsoft Corporation. Windows media 9 series, h t t p : / / w w w .m icro so ft. com/ windows /windowsmedia.

185] Streamcheck Research. News video quality survey, h t t p : / / www. s tre a m c h e c k . com, June 2002.

186] http://www.cs.ucl.ac.uk/stafiyd.miras/QA\^deo/, 2004.

187] M. Ghanbari. Scene-based video quality evaluation. University of Essex. Private Com­ munication. Appendix A

Video sequences

The following table describes the properties of several video sequences used in the experimental results.

Table A.l: Video sequences features.

Sequence Characteristics Duration Akiyo Head & shoulders newscaster 8 sec Harp Camera zoom-in of a person playing the harp, saturated color & masking 8 sec effect FI car Fast movement, saturated colors 8 sec Canoa Water movement, movement in different direction, high details 8 sec Valsesia Newsl Head & shoulders: newscaster with simple background 8 sec Rugby Outdoor rugby match: movement and colour 8 sec Foreman Facial close-up followed by wide shot of construction site 8 sec Irene Shot of woman using sign language, hands movement 8 sec Costguard Still camera of a moving coastguard boat 8 sec News2 Two newscasters superimposed on a background with movement 8 sec (dancers) Jacknbox Colour, rapid motion 5 sec Susie Close-up of woman talking on the phone, some head movement 5 sec BTadv BT commercial, several scenes, camera movement, moving humans 15 sec Ship Slowly moving commercial ship 7 sec Salesman TV telesales, low motion, spatial detail 7 sec Mobile & Colour, motion and spatial detail 5 sec Calendar Appendix B

Fuzzy adaptive quality smoothing - additional results

The following graphs depict the smoothing properties of the adaptive fuzzy quality controller gathered from simulations with two other video sequences: a 8000-frames long excerpt from the action movie Terminator and a 3000-frames long sequence, called mixed. The latter sequence was artificially constmcted by merging a collection of several smaller video scenes, shown in Table A.I. This sequence exhibits a wide variation in content activity and therefore generates different levels of quality between scenes. Figure B.l shows the histograms of AQ values for the Qtcpf, Q tar get and Q actual quality series. Results obtained by simulation using the same setup as described in section 5.3, where the initial buffers build-up delay was set to 8 sec. The autocorrelation function of the involved quality series is shown in Figure B.2, while Figure B.3 plots the range of quality values over larger timescales (up to 60 sec). Finally, Figure B.4 shows the impact of the initial buffering delay (startup delay) on the quality smoothing performance of the fuzzy controller. It also plots the evolution of the sender and receiver buffer sizes (sequence used: Terminator). 179

(a) Sequence: Terminator (b) Sequence: mixed AQicpi s i i

s 8 8

0 5 10 15 20 25 30 0 5 10 15 20 25 30

AQ» AQu

s

10 0 510 15 20

AQ

AQ. AQ.

A Q tc p f ^ Q ta rg e t ^Q a c tu a l A Q tc p f ^ Q ta rg e t ^Q a c tu a l Min. 0.0 0.0 0.0 0.0 0.0 0.0 Is Quar. 1.02 0.07 0.11 0.80 0.11 0.14 Median 2.33 0.14 0.26 1.75 0.26 0.33 Mean 3.24 0.27 1.06 2.77 0.33 1.30 3rd Quar. 4.16 0 .30 0.30 3.15 0.47 1.32 Max. 37.01 10.05 25.01 44.43 2.83 26.93

Figure B.l: Histograms of AQ values: AQtcpf, AQtarget^ and AQactual for sequences Termi­ nator and mixed. Summary statistics of all depicted distributions are shown for comparison. 2 : § 2

0 10 20 30 40 50 0 10 20 30 40 50

lag (in S -T periods) lag (in S -T periods)

(a) Sequence: Terminator (b) Sequence: mixed

Figure B.2: Autocorrelation of Qtcpf. Qtarget and Qactuai for two video sequences, Terminator and mixed.

s

s s

§ o

0.4 0.8 1 2 4 5 10 20 30 40 60 0.4 0.8 1 2 4 5 10 20 30 40 60

timescale (sec) timescale (sec)

(a) Sequence: Terminator (b) Sequence: mixed

Figure B.3: Box-plots depicting the range of variation of Qtcpf, Qtarget and Qactuai values at larger timescales. 181

Quality change

o _ m

o

m

% o

6 6 8 8 10 10 12 12 16 16

playout delay (sec)

sender buffer receiver buffer

700

600

500

400

time (sec) time (sec)

Figure B.4: Impact of the startup delay on quality smoothness and buffer stability. Top: dis­ tribution of AQtcpf, ^Qtarget and A Q actual- Bottom: send and receive buffer sizes over the duration of the simulation (sequence: Terminator).