Low Complexity Scalable Video Encoding

LOW COMPLEXITY SCALABLE VIDEO ENCODING

Rashad M. Jillani

A Dissertation Submitted to the Faculty of

The College of Engineering and Computer Science

In Partial Fulfillment of the Requirements for the Degree of

Doctor of Philosophy

Florida Atlantic University

Boca Raton, FL

May 2012

LOW COMPLEXITY SCALABLE VIDEO ENCODING

Rashad M. Jillani

This dissertation was prepared under the direction ofthe candidate's dissertation co-advisors, Dr. Hari Kalva and Dr. Abhijit S. Pandya, Department of Computer and Electrical Engineering and Computer Science, and has been approved by the members of his supervisory committee. It was submitted to the faculty ofthe College ofEngineering and Computer Science and was accepted in partial fulfillment ofthe requirements for the degree ofDoctor ofPhilosophy. SUPERVISORY COMM

~, f~..L.r Abhijit S. Pandya, Ph.D. Dissertation Co-Advisor /~~~

uang, Ph.D.

or 0 Furht, Ph.D. air, Department ofComput and Electrical Engineering and Computer Science

Mohammad Ilyas, Ph.D. Interim Dean, College ofEngineering and Computer Science Bf-r~SO?;~~ Dean, Graduate College 11

ACKNOWLEDGEMENTS

First and foremost, my deepest gratitude is to my advisors, Dr. Hari Kalva and Dr.

Abhijit Pandya. I have been amazingly fortunate to have advisors who gave me the freedom to explore on my own and at the same time the guidance to recover when my steps faltered. Hari taught me how to question thoughts and express ideas. His patience and support helped me overcome many crisis situations and finish this dissertation.

I am very grateful to my committee members, Dr. Imad Mahgoub, Dr. Hanqi

Zhuang and Dr. Sam Hsu for their insightful comments and constructive criticisms at different stages of my research. Unfortunately, Dr. Sam Hsu had to opt out of the committee at the last moment due to his travel arrangements.

I would also like to thank Dr. Chiranjib Bhattacharyya (Associate Professor,

Department of Computer Science and Automation, IISc, Bangalore, India), and his student Urvang Joshi for helping me with the generation of training classifiers using cost- sensitive and chance-constrained approaches.

Next I would like to thank Dr. Kalva’s MLAB’s former and current members, especially Chris Holder, Paula Carrillo, Sebastian Possos, Velibor Adzic and many others for their friendship, support and advice. MLAB and its members have been wonderful through all these years of my graduate research and it has been fun working here. I have

iii

enjoyed every moment that we have worked together including all those late night in the

lab and out of lab activities.

I would like to thank many friendly faces of CEECS staff, especially Rosemary,

Jean, Helene, and Gina who patiently guided me through administrative procedures.

Surely, this dissertation would never have been completed without their assistance.

Of course no acknowledgements would be complete without giving thanks to my

family for their years of sacrifice and support. Thanks to my parents, brother and sisters,

without their unconditional love, support, and prayers; I would never have achieved this goal. They have given up so many things for me to be a PhD; they have cherished with me every great moment and supported me whenever I needed it. To Aatif, thanks for your understanding, continuous encouragement and endless patience through this journey. I am also grateful to Aamna and Aaliya for their support.

Last, but certainly not least, I must acknowledge with tremendous and deep thanks to my wife, Umaira. Umaira, through your love, support and unwavering belief in me, I have been able to complete this long dissertation journey. You are my biggest fan and supporter. You have given me confidence and motivated me to complete this dissertation. Thank you with all my heart and soul. Your complete and unconditional love carries me through always.

ABSTRACT

Author: Rashad M. Jillani

Title: Low Complexity Scalable Video Encoding

Institution: Florida Atlantic University

Dissertation Co-Advisor: Dr. Hari Kalva

Dissertation Co-Advisor: Dr. Abhijit Pandya

Degree: Doctor of Philosophy

Year: 2012

The emerging Scalable Video Coding (SVC) extends the H.264/AVC video coding standard with new tools designed to efficiently support temporal, spatial and SNR scalability. In real-time multimedia systems, the coding performance of video encoders and decoders is limited by computational complexity. This thesis presents techniques to manage computational complexity of H.264/AVC and SVC video encoders. These techniques aim to provide significant complexity saving as well as a framework for efficient use of SVC.

This thesis first investigates, experimentally, the computational complexity of MB coding mode decision in H.264/AVC video encoder. Based on machine learning techniques, complexity reduction algorithms are proposed. It is shown that these

algorithms can reduce the computational complexity of Intra MB coding with negligible loss of video quality. Complexity reduction algorithms based on statistical classifiers are proposed for SVC encoder. It is shown that these algorithms can flexibly control the computational complexity in enhancement layers of SVC with negligible loss of video quality.

The inherent relationship of MB mode decision in base and enhancement layers of

SVC is investigated through experimental tests and a rate model function is proposed. An innovative fast mode decision model is developed to reduce the computational complexity by using the layer relationship along with the rate model function. We develop a general framework that applies to SVC and use this framework to adapt SVC bitstream by employing the low-complexity video encoding along with the input of video streaming constraints in order to adapt the bitstream.

The proposed SVC based framework uses both objective low-complexity video encoding techniques and subjective saliency based video adaptation resulting in optimal use of network bandwidth. The approaches described in this thesis can not only reduce computational complexity of a video encoder, but also can manage the trade-off between complexity and distortion. These proposed algorithms are evaluated in terms of complexity reduction performance, rate-distortion performance and subjective and objective visual quality by experimental testing. The advantages and disadvantages of each algorithm are discussed.

DEDICATION

To Umaira

LOW COMPLEXITY SCALABLE VIDEO ENCODING

LIST OF TABLES ...... xiii

LIST OF FIGURES ...... xv

1. INTRODUCTION ...... 1

1.1 Overview ...... 1

1.2 Problem Statement ...... 3

1.3 Contributions ...... 10

1.4 Organization ...... 11

2. FUNDAMENTALS OF VIDEO CODING ...... 13

2.1 Introduction ...... 13

2.2 General Framework ...... 14

2.3 Generic Compression System ...... 16

2.3.1 Hybrid Video Codec...... 18

2.3.2 Transforms ...... 21

2.3.3 Quantization ...... 24

2.3.4 Entropy Coding ...... 26

viii

2.4 Video Encoding Process...... 30

2.4.1 Intra Frame Prediction ...... 32

2.4.2 Inter Frame Prediction ...... 34

2.5 Video Quality Metrics ...... 38

2.6 Video Encoding Standards ...... 41

2.6.1 Single Layer Coding (H.264/AVC) ...... 42

2.6.2 Scalable Video Coding (SVC) ...... 48

2.7 Scalable Extension of H.264/AVC...... 50

2.7.1 Hierarchical Coding Structure ...... 52

2.7.2 Inter-Layer Prediction ...... 54

2.7.3 Progressive Refinement Slices (Quality Scalability) ...... 56

2.7.4 NAL Unit Syntax ...... 57

2.8 Summary ...... 59

3. SURVEY OF RELATED WORK ...... 60

3.1 Introduction ...... 60

3.2 Mode Decision Complexity ...... 61

3.3 Adaptive GOP Structure...... 63

3.4 Early Skip Schemes ...... 63

3.5 Exploiting Layer Information in SVC ...... 64

3.6 Exploiting Psycho-Visual Characteristics ...... 64

3.7 Summary ...... 67

4. H.264/AVC COMPLEXITY REDUCTION ...... 68

4.1 Introduction ...... 68

4.2 Intra Prediction Complexity ...... 69

4.3 Fast Intra Coding ...... 71

4.3.1 Machine Learning (ML) and Data Mining (DM) ...... 72

4.3.2 Applying Machine Learning for Intra Prediction ...... 73

4.3.3 Low Complexity Mode Decision ...... 82

4.3.4 Performance Evaluation ...... 94

4.4 Summary ...... 109

5. SVC COMPLEXITY REDUCTION I...... 111

5.1 Introduction ...... 111

5.2 Cost-Sensitive Learning ...... 114

5.2.1 Mathematical Formulation ...... 116

5.2.2 Implementation...... 118

5.2.3 Performance Evaluation ...... 118

5.3 Chance Constrained Approach ...... 120

5.3.1 Mathematical Formulation ...... 122

5.3.2 Geometrical Formulation ...... 125

5.3.3 Implementation...... 127

5.3.4 Performance Evaluation ...... 129

5.4 Summary ...... 132

6. SVC COMPLEXITY REDUCTION II ...... 134

6.1 Introduction ...... 134

6.2 SVC Layers Correlation Evaluation ...... 135

6.2.1 Spatial Scalability ...... 136

6.2.2 SNR Scalability ...... 155

6.3 Mode Decision and Inter-Layer Prediction ...... 157

6.4 Fast Mode Decision Model ...... 167

6.5 Summary ...... 174

7. ADAPTIVE DELIVERY OF SCALABLE VIDEO ...... 176

7.1 Introduction ...... 176

7.2 System Architecture ...... 178

7.3 System Model ...... 183

7.3.1 Problem Definition ...... 184

7.3.2 Region Based Solution ...... 184

7.3.3 System Implementation ...... 185

7.4 Summary ...... 187

8. CONCLUSIONS AND FUTURE WORK ...... 189

8.1 Conclusion ...... 189

8.2 Future Work ...... 193

BIBLIOGRAPHY ...... 195

xii

LIST OF TABLES

Table 4-I: Decision Trees and Mode Decisions ...... 83

Table 4-II: Node 1 Decision Tree Statistics ...... 84

Table 4-III: Node 2 Decision Tree Statistics ...... 85

Table 4-IV: Node 3 Decision Tree Statistics ...... 85

Table 4-V: Node 4 Decision Tree Statistics ...... 85

Table 4-VI: Node 5 Decision Tree Statistics ...... 86

Table 4-VII: Node 6 Decision Tree Statistics ...... 86

Table 4-VIII: Node 7 Decision Tree Statistics ...... 86

Table 4-IX: Node 8 Decision Tree Statistics ...... 87

Table 4-X: Node 9 Decision Tree Statistics ...... 87

Table 4-XI: Node 10 Decision Tree Statistics ...... 87

Table 4-XII: Node 11 Decision Tree Statistics ...... 88

Table 4-XIII: Node 12 Decision Tree Statistics ...... 88

Table 4-XIV: Comparison Results For Top Level Classifier ...... 107

Table 4-XV: Comparison Results For Intra 16x16 Classifier ...... 107

Table 4-XVI: Comparison Results For Intra 4x4 Classifier ...... 108

Table 4-XVII: Comparison Results For All Classifiers ...... 108

Table 4-XVIII: Comparison Results with Other Algorithm [59]...... 108 xiii

Table 5-I: Mathematical Notations for Cost Sensitive Classifier ...... 116

Table 5-II: Classification Results by Cost-Sensitive Learning ...... 119

Table 5-III: Mathematical Notations for Chance Constrained Classifier ...... 122

Table 6-I: Comparison of SVC to Simulcast and Single-Layer Scenerios ...... 141

Table 6-II: Comparing Delta QP of EL for Given BL QP ...... 153

Table 6-III: Mode Distribution in EL ...... 168

Table 6-IV: Proposed Model vs. Reference Encoder ...... 171

xiv

LIST OF FIGURES

Figure 2.1: Overview of a Video Coding System ...... 15

Figure 2.2: Hybrid Motion-Compensated Video Encoder ...... 18

Figure 2.3: Hybrid Motion Compensated Video Decoder ...... 19

Figure 2.4: Function of a 7-Step Uniform Quantizer ...... 26

Figure 2.5: An Example of Huffman Tree ...... 27

Figure 2.6: An Example of Arithmatic Encoding Process ...... 29

Figure 2.7: GOP consisting of 7 Frames ...... 31

Figure 2.8: GOP of 12 Frames (M=3 and N=2) ...... 32

Figure 2.9: A 4x4 Block and its Neighboring Samples ...... 33

Figure 2.10: Directions of 9 Modes of Intra 4x4 ...... 33

Figure 2.11: Directions of 4 Modes of Intra 16x16 ...... 34

Figure 2.12: Motion Estimation ...... 35

Figure 2.13: An Example of Logarithmic Search ...... 37

Figure 2.14: Block Diagram for H.264/AVC Encoder ...... 43

Figure 2.15: Block Diagram for H.264/AVC Decoder ...... 45

Figure 2.16: Coding Structure for SVC Extension of H.264/AVC ...... 51

Figure 2.17: Hierarchical Prediction Structure ...... 53

Figure 2.18: Hierarchical Prediction Structure with Inter-Layer Prediction ...... 54 xv

Figure 2.19: SVC NAL Unit ...... 58

Figure 3.1: H.264 CPU usage distribution for Intra frames ...... 61

Figure 4.1: Decision Tree Creation ...... 74

Figure 4.2: Decision Trees in Complexity Reduction Mode ...... 75

Figure 4.3: Decision Tree for Intra Coding ...... 80

Figure 4.4: Attribute Selection for Intra 16x16 ...... 82

Figure 4.5: Attributes Selection for Intra 4x4 Modes ...... 90

Figure 4.6: Macroblock Partitions for Intra 16x16 vs. Intra 4x4 ...... 93

Figure 4.7: RD Performance for Top Level Classifier (QCIF) ...... 97

Figure 4.8: RD Performance for Top Level Classifier (CIF)...... 98

Figure 4.9: RD Performance for Top Level Classifier (CCIR) ...... 98

Figure 4.10: RD Performance for Intra 16x16 Classifier (QCIF) ...... 100

Figure 4.11: RD Performance for Intra 16x16 Classifier (CIF)...... 101

Figure 4.12: RD Performance for Intra 16x16 Classifier (CCIR) ...... 101

Figure 4.13: RD Performance for Intra 4x4 Classifier (QCIF) ...... 103

Figure 4.14: RD Performance for Intra 4x4 Classifier (CIF)...... 103

Figure 4.15: RD Performance for Intra 4x4 Classifier (CCIR) ...... 104

Figure 4.16: RD Performance for Combined Intra Classifier (QCIF) ...... 105

Figure 4.17: RD Performance for Combined Intra Classifier (CIF) ...... 106

Figure 4.18: RD Performance for Combined Intra Classifier (CCIR) ...... 106

Figure 5.1: EL RD Peformance and BL RD Performance using Cost-Sensitive

Learning...... 119

xvi

Figure 5.2: Geometrical Interpretation of SOC Constraint ...... 127

Figure 5.3: RD Performance with Chance Constrained Classifier for mobile and

flower sequences ...... 131

Figure 5.4: Time Complexity Reduction with Chance Constrained Classiifer for

mobile and flower sequences ...... 131

Figure 5.5: RD Performance with Machine Learning Classifier for mobile and

flower sequences ...... 132

Figure 5.6: Time Complexity Reduction with Machine Learning Classifier for

mobile and flower sequences ...... 132

Figure 6.1: Average relative computational complexity ...... 139

Figure 6.2: RD Performance for SVC, Simulcast and Single Layer solutions for

sequences Flower and Mobile ...... 140

Figure 6.3: RD Performance for SVC, Simulcast and Single Layer solutions for

sequences City and Crew ...... 141

Figure 6.4: BL is quantized at variable rate while EL is quantized at constant rate.

Red area over SL curve shows SVC overhead compared to SL for

Mobile Sequence (QCIF-CIF)...... 143

Figure 6.5: BL is quantized at constant rate while EL is quantized at variable rate.

Red area over SL curves shows SVC overhead compared to SL for

Mobile Sequence (QCIF-CIF)...... 144

xvii

Figure 6.6: BL is quantized at variable rate while EL is quantized at constant rate.

Red area over SL curve shows SVC overhead compared to SL for

Crew Sequence (CIF-4CIF)...... 145

Figure 6.7: BL is quantized at constant rate while EL is quantized at variable rate.

Red area over SL curves shows SVC overhead compared to SL for

Crew Sequence (CIF-4CIF)...... 146

Figure 6.8: Proposed rate modeling results. (a) Flower, QCIF-CIF. (b) Mobile,

QCIF-CIF. (c) City, CIF-4CIF. (d) Crew, CIF-4CIF ...... 148

Figure 6.9: Coding performance of 2-layer spatial scalability for City sequence ...... 149

Figure 6.10: Coding performance of 2-layer spatial scalability for Crew sequence ...... 149

Figure 6.11: Coding performance of 2-layer spatial scalability for Flower sequence .... 150

Figure 6.12: Coding performance of 2-layer spatial scalability for Mobile sequence .... 150

Figure 6.13: EL RD performance w.r.t. BL QP for Mobile sequence ...... 154

Figure 6.14: EL RD performance w.r.t. BL QP for Flower sequence ...... 154

Figure 6.15: 2-layers SNR performance for City sequence in 4CIF format, with

differential QP equal to 6 and 2...... 156

Figure 6.16: Average (CIF and 4CIF) relative computational complexity for different

coding options: single layer, 2-layer Simulcast and SVC 2-layers SNR

coding ...... 156

Figure 6.17: Correlation of MB Type in base and enhancement layer for

Foreman(24,18) ...... 160

xviii

Figure 6.18: Correlation of MB type in base and enhancement layer for

Crew(24,12) ...... 161

Figure 6.19: Correlation of MB type in base and enhancement layer for

Crew(24,24) ...... 162

Figure 6.20: Correlation of MB type in base and enhancement layer for

Flower (24,12)...... 163

Figure 6.21: Correlation of MB type in base and enhancement layer for

City(24,18) ...... 163

Figure 6.22: Correlation of MB type in base and enhancement layer for

City(24,36) ...... 164

Figure 6.23: RD comparison for proposed technique for Mobile sequence @

QP_BL = 24 ...... 172

Figure 6.24: RD comparison for proposed technique for City sequence @

QP_BL = 30 ...... 172

Figure 6.25: RD comparison for proposed technique for Crew sequence @

QP_BL = 36 ...... 173

Figure 7.1: Example of H.264/SVC video streaming with heterogeneous receiving

devices and variable network conditions...... 180

Figure 7.2: SVC based video streaming framework ...... 181

Figure 7.3: Comparison of SVC (spatial scalablity) and cropping from HD format to

480x320...... 183

Figure 7.4: Scalable ROI/Saliency based System ...... 186

xix

Figure 7.5: Process of bitsream extraction and selection ...... 187

1. INTRODUCTION

1.1 Overview

Image compression techniques play a central role in the video encoding technology. Compressed digital video reduces space and bandwidth requirements and therefore is suitable for storage and transmission. The efficiency of video encoding techniques in terms of increase in quality and reduction in bandwidth requirement has a major role in the explosive growth of digital video applications in the past decade. It benefits recording, exchange, processing, and retrieving visual information over wired and wireless digital communication infrastructure. Video coding is the science of representing visual information in a compact and reliable way. This goal is achieved through proper analyzing, transforming, and reorganizing digital video information.

Compression of image (video signals) is possible, because a large part of the information in an image is redundant. Image compression techniques seek to exploit various statistical redundancies present in the signal and the perceptual limitations of the human vision in order to obtain a compact representation, thereby minimizing the amount of information necessary to represent an image for transmission or storage purpose. It is necessary that the reconstructed image from this compact representation be of visually acceptable quality. It is also usually required that the complexity of the image compression system should be low. Various requirements of an application, economic

and computing constraints usually determine the transmission capacity available, which in turn place a bound on coding, and the fidelity with which reconstruction can be achieved.

Development of video compression is rooted in the source coding theorem introduced by Shannon in [1]. According to the theorem, source coding and channel coding can each perform asymptotically close to the theoretical bounds, respectively. The goal of source coding is to represent the original source with a bitstream as short as

possible, given a quality requirement and the assumption that the transmission channels

are error free to the coded video. Current state of the art video compression technology achieves a data reduction of up to 80 times, i.e., the compressed bitstream is only 1.25% in size of the original video, while maintaining excellent visual quality. The distribution of multimedia contents through numerous applications, such as DVD video, Video on

Demand (VoD) and video streaming, High Definition TV (HDTV), Digital Video

Broadcasting (DVB), have significantly changed the way people acquire and exchange

visual content, and therefore efficiently improved our everyday life.

IP networks are key to digital communications in today’s world. All types of

computer networks, TV networks, telephone networks, over wired or wireless media,

provide a means for efficient video transmission. Nearly all network protocols make it

possible for users from different types of networks to communicate and share multimedia

contents. As a result, new video applications have emerged and demand for high quality

services, including accessibility, resolution, latency, and reliability has increased. There

is a growing demand for high quality video services over a diverse range of client capabilities and transmission channel capacities through heterogeneous networks.

1.2 Problem Statement

The problem addressed in this work is broadly defined as the complexity of video encoding and the adaptation of bitstream subjected to network and end user resources.

However, this characterization requires further refinement.

Beyond the goal of achieving high compression gains, a fundamental problem in video coding is high complexity in terms of computing resources. This is essentially a result of the increasing popularity of video applications distributed over various, error prone and limited bandwidth networks targeted towards the resource constrained mobile devices. Many real-time and mobile applications require low-power and low complexity.

Two major issues are challenging the existing video coding technology for transmission over heterogeneous networks where multimedia contents are delivered towards the heterogeneous clients. The first is the complexity of coding process. Video encoding process is an extremely computing intensive task requiring high capacity servers to complete even the most basic video encoding projects in a reasonable time frame. Compared with other types of data such as text, voice and image being transmitted over the networks, video data consumes most bandwidth. Limited transmission bandwidth, low processor power, and memory availability all restrict delivery capabilities of networks. In the software implementations of current video coding standards such as

H.263 and MPEG-4, the performance of the video codec is limited by available processing power as well as, or rather than, by available bandwidth. It is therefore

important to develop flexible methods of managing the computational complexity of video encoding and decoding.

Heterogeneity is another factor that constraints video applications over IP networks. Different types of networks have varying bandwidth and traffic loads. Clients have diverse equipment and different quality requirements. For example, multicast video clients may want di erent display resolutions, and client systems may have di erent caching or intermediateff storage resources. Even if multiple clients want same resolutionff of video, they may have varying capacities to display the video, thus requiring multiple qualities of the same video with the same resolution. E cient video compression should make video streams suitable for transmission through heterogeneous,ffi unreliable networks and maintain high quality. Error resilience and scalability are desirable features and essential to accomplish this mission.

In wireless communications networks, an important issue that must be addressed is the limited energy supply of a mobile device, especially in wireless video applications.

The problem becomes even more critical with the power-demanding video encoding functionality integrated into the mobile computing platform. As the available power in the battery decreases, the performance of video encoding is limited by the available processing power as well as, or rather than, the available transmission bandwidth.

Moreover, from the power consumption perspective, efficient video compression significantly reduces the size of the video data to be transmitted, which in turn saves a significant amount of energy in data transmission. On the other hand, more efficient video compression often requires higher power consumption. All of this implies that

there is a tradeoff among the bandwidth R, power consumption P, and video quality D. to

find a relationship between power consumption and video encoding complexity, we need an analytic framework to explore the P-R-D behavior of the video encoding system.

However, this is out of the scope of this dissertation but it is an important aspect. We

believe that the inherent implication of reduced complexity video encoding would imply

reduced power consumption.

Scalable video coding (SVC) is a coding technique which codes a video sequence

into a set of bitstreams that can support various levels of quality at the decoder.

Scalability provides robustness to video in that certain types of loss are acceptable and

may not terminate decoding or severely a ect visual quality. There are three major types

of scalabilities namely, but not limited to, ﬀspatial, temporal and fidelity (SNR) scalability.

The advances in the international video coding standards such as H.261, MPEG-1,

MPEG-2 Video, H.263, and MPEG-4 Visual have been crucial in the success of digital

video applications [2] [3] [4] [5]. These standards made it possible to interoperate among

products from different manufacturers while at the same time provided flexibility in

implementations and optimizations in heterogeneous environment scenarios. The

H.264/AVC video coding standard [6] is the latest and state-of-the-art video coding

standard. When compared to the previous coding standards, H.264/AVC significantly

increases the coding efficiency – a property measured by reduction in bit rate necessary

to represent a given level of perceptual quality.

Continuous evolution of receiving devices and the increasing usage of

heterogeneous network environments that are characterized by a widely varying

connection quality arises the need for SVC, which allows on-the-fly adaptation to certain application requirements such as display and processing capabilities of target devices, and varying transmission conditions. SVC has been an active research topic in video encoding for at least 20 years. The prior international video coding standards MPEG-2

Video, H.263, and MPEG-4 Visual already include the scalable profiles. However, the scalable profiles of these standards never gained popularity for their use in commercial products. Reasons for that include the considerable growth in decoder complexity and significant reduction in coding efficiency (i.e. bit rate increase for a given level of reconstruction quality) as compared to the corresponding non-scalable profiles. The new

SVC extension [7] of the H.264/AVC standard addressed these drawbacks, which were making the success of scalability profiles impossible. Also the characteristics of traditional video transmission systems, significant loss in coding efficiency as a result of spatial and quality scalability features, and a large increase in decoder complexity as compared to the non-scalable profiles are major factors in the failure of scalability profiles of the earlier video coding standards.

H.264 video coding standard is the latest block-oriented motion-compensation- based codec standard developed by the ITU-T Video Coding Experts Group (VCEG) together with the ISO/IEC Moving Picture Experts Group (MPEG). H.264 can achieve considerably higher coding efficiency than previous standards. Unfortunately, this comes at a cost in considerably increased complexity at the encoder mainly due to motion estimation and mode decision. The high-computational complexity of H.264 and real-

time requirements of video systems represent the main challenge to overcome on the

development of efficient encoder solutions.

The compression efficiency of H.264/AVC has increased mainly because of the

large number of coding options available. For example, the H.264 video supports Intra

prediction with 3 different block sizes and Inter prediction with 8 different block sizes.

The encoding of a macroblock (MB) involves evaluating all the possible coding options

and selecting an option that has the least cost associated with it. Resource constrained

devices typically manage the complexity by using a subset of possible coding modes

thereby sacrificing video quality. This quality and complexity relationship is evident in

most video codecs used today. Most H.264 encoder implementations on mobile devices today do not implement the standard profiles fully due to high complexity.

Most of the traditional complexity reduction approaches in video coding are based on eliminating a subset of allowed coding modes and sacrifice quality for reduced complexity. The traditional approaches have not been able to reduce the encoding complexity enough to enable the use of advanced video coding features on resource constrained devices. We develop machine learning based approaches to reduce the complexity of video encoding. This approach reduces the computationally expensive elements of encoding such as coding-mode evaluation to a classification problem with negligible complexity. The key contribution of this work is the exploration of machine learning in video encoding applications. In this research, we also focus our attention on macro-block mode decision, one of the most stringent tasks involved in the encoding process.

The Scalable Video Coding (SVC) standard has higher complexity than

H.264/AVC since it has spatial, temporal and quality scalability in addition to

H.264/AVC functionality. Furthermore, the introduction of additional tools in SVC, i.e., inter-layer prediction and layered coding for spatial scalability make motion estimation and mode decision more complex.

Multimedia tools ranging from multimedia messaging, video telephony and video conferencing over mobile TV, and wireless and Internet video streaming to standard and high definition TV broadcasting are being widely used in homes and workplaces. The most prominent of these tools use Internet and wireless networks as carriers for video applications. Therefore the importance of transmission of video data over Internet and wireless networks carries much more weight than the traditional data traffic over networks. The transmission of video data over such networks is exposed to the unreliable and variable transmission conditions, which can be handled gracefully by using the scalability features of the video coding. Furthermore, the heterogeneous nature of network infrastructure and the usage of a variety of decoding devices with a range of display options and computational capabilities add to the complexity of the situation.

Therefore, we need a flexible standard which can adapt a once-encoded content according to the needs of heterogeneous target networks and devices, and simultaneously we also need the interoperability of encoder and decoder devices from different manufacturers.

The recent deployment of Internet and mobile networks has greatly contributed to the success and adoption of distributed multimedia communications. Additionally, recent

advances in hardware and the great spread of digital communications have stimulated the

research interest in digital techniques for encoding and transmitting visual information.

Multimedia traffic, and more specifically digital video traffic, has grown in recent years.

This video traffic, not only on the Internet, but also on the mobile networks, increases

daily. Videoconferences, streaming video and digital television are some of the currently

available video services. Furthermore, HDTV systems are a contemporary revolution in

the world of domestic multimedia applications, offering the best video qualities available

in the broadcast systems. HD video quality implies an increase in the resolution of the

domestic television sets; in other words, the amount of points or pixels that conform an

image in the screen is greater in HD. Whereas the image of standard television transmits

with a resolution of 720x576 pixels, the high-resolution image has been increased up to

1920x1080 pixels, which implies a considerable increase in data for transmitting. Even though it is now possible to transmit at greater data rates over current communications platforms, it is economically not viable to dedicate such an amount of bandwidth to a single video communication service. It is at this point where the video compression standards enter into a panorama. Reducing considerably the amount of data and therefore reducing the resources required for their transmission and storage, while guaranteeing a high quality image, compression schemes enable the storage and transmission of video information in a more compact form.

Recently, HTTP-based delivery for Video on Demand (VoD) has been gaining popularity. Progressive download over HTTP, typically used in VoD, takes advantage of the widely deployed network caches to relieve video servers from sending the same

content to a high number of users in the same access network. Instead of transmitting all the representations of video targeted for a variety of end-users in a heterogeneous environment in the form of multiple H.264/AVC bitstreams, the layered structure SVC provides flexibility in such a framework by providing a single bitstream.

1.3 Contributions

This dissertation is focused on reducing the complexity of the scalable extension of H.264/AVC video encoding using machine learning techniques. The idea behind using machine learning is to exploit structural similarities in video in order to make optimal prediction modes by applying statistical analysis thereby using the probabilistic approaches. In general, this research work contributes the following innovations related to the complexity reduction of video encoding.

• Video complexity reduction techniques based on data mining using machine

learning are designed (Chapter 4 - Chapter 5).

• Evaluation of different machine learning algorithms in terms of reduction in

complexity based on the analysis of the suitability of video features (Chapter 4 -

Chapter 5).

• The design of cost-sensitive classifier for machine learning algorithm (Section

5.2). We developed new approach by combining decision trees with chance

constrained classifier.

• The design of chance-constrained classifier for machine learning algorithm

(Section 5.3) to be applied on SVC layers.

• Development of the framework by combining machine learning with the

statistical approaches namely, Cost-Sensitive and Chance-Constrained (Chapter

5).

• Performance evaluation of the statistical classifiers based models in Scalable

Video Coding (Chapter 5-6).

• Complexity reduction for SVC and optimal layer selection in SVC. We developed

a fast mode decision model in SVC (Chapter 6).

• Development of the framework for adaptive delivery of SVC contents.

Framework is proposed for bitstream adaptation in video streaming environment.

(Chapter 7)

1.4 Organization

The first three chapters introduce the research work, provide background information, and give a survey of existing research. Proposed algorithms culminating in the development of a practical complexity reduction technique along with the efficient framework with reduced complexity mode is presented in remaining chapters.

• Chapter 1 briefly introduces the background of proposed work, the motivations

behind this research work and the objectives that are tried to approach on it. The

problems inherent in the video encoding i.e., complexity and the distribution of

multimedia contents over heterogeneous networks, are described. Next a few

examples of the video applications are listed. Finally, the contributions of this

research work are given.

• Chapter 2 covers background material regarding video encoding. First video

encoding standard H.264/AVC in general is described. A review of basic concepts

of the video coding techniques is made, analyzing the main function used in the

most of video compression standards. Next we describe Scalable Video Coding

(SVC) in more detail including different types of scalabilities.

• Chapter 3 gives a survey of existing research in complexity reduction in video

encoding.

• Chapter 4 presents the new complexity reduction techniques in H.264/AVC for

Intra coding we developed for the framework.

• Chapter 5 builds on the work in chapter 4 by proposing new research examining

the statistical approaches towards the complexity reduction in H.264-SVC in

detail. Research presented in this chapter forms the basis for further work in the

development of the effective framework for low complexity video encoding

targeted for heterogeneous environment.

• Chapter 6 presents a detailed analysis of spatial and SNR scalability. We have

shown the effectiveness and variation of inter-layer prediction in spatial and SNR

scalability. We have also proposed a fast mode decision model as a function of

base and enhancement layer mode decision.

• Chapter 7 uses the techniques presented in chapter 4, 5 and 6 to design a

framework for effective bitstream extraction and adaptation from SVC bitstream.

• Chapter 8 presents concluding remarks and discusses possible future extensions.

2. FUNDAMENTALS OF VIDEO CODING

2.1 Introduction

Digital video consists of a sequence of still pictures. Individual pictures of a

digital video that are close in time are usually very similar and highly correlated. While

containing large amount of visual information, uncompressed video sequences also carry

considerable redundancy. It is not suitable for applications in general purposes due to the

demanding storage space and transmission bandwidth required.

Video compression represents the visual information in a compact bit streams by reducing the redundancy. E ectively removing redundant information between or within the pictures, and organizingﬀ information intelligently are two fundamental methods to achieve high compression gains. By considering the features of the human visual system, which is not sensitive to certain visual information, such as high frequency noise and fast motion, higher compression gains can be obtained with acceptable and controllable quality and information loss. This results in a signiﬁcant reduction in the amount of bits needed for storage and transmissions, and thus enables a wide variety of applications.

In addition to making the video bit stream as small as possible while keeping required qualities, other features are also required in video compression. In general, the desirable features include fast and easy accessibility, cross platform compatibility, scalability, and reliability for storage and delivery. Depending on applications, a

particular feature may beneﬁt the overall performance more than others. All the above requirements need, in term of information theory, additional redundancy in the compressed bit streams. It then becomes an ultimate goal to achieve perfect balance between compression e ciency and other functionalities.

Much of the materialﬃ in this chapter is background material. Original work includes the evaluation of existing video encoding standards and the complexity inherent in them. The subsequent development of the classification of complexity reduction techniques is also a contribution of this work.

2.2 General Framework

The components of a video coding algorithm are determined to a large extent by the source model that is adopted for modeling the video sequences. The video coder seeks to describe the contents of a video sequence by means of its source model. The source model may make assumptions about the spatial and temporal correlation between pixels of a sequence. It might also consider the shape and motion of objects or illumination effects. In Figure 2.1, we show the basic components in a video coding system. In the encoder the digitized video sequence is first described using the parameters of the source model. If we use a source model of statistically independent pixels, then the parameters of this source model would be the luminance and chrominance amplitudes of each pixel. On the other hand, if we use a model that describes a scene as several objects, the parameters would be the shape, texture, and motion of individual objects.

In the next step, the parameters of the source model are quantized into a finite set of symbols. The quantization parameters depend on the desired tradeoff between the bit

rate and distortion. The quantization parameters are finally mapped into binary code- words using lossless coding techniques, which further exploit the statistics of the quantized parameters. The resulting bitstream is transmitted over the communication channel. The decoder retrieves the quantized parameters of the source model by reversing the binary encoding and quantization process of the encoder. Then, the image synthesis algorithm at the decoder computes the decoded video frame using the quantized parameters of the source model.

Figure 2.1: Overview of a Video Coding System

The rest of this chapter will review basic technologies that provide the above features in video compression. Section 2.3 covers the technologies in di erent stages of the state-of-art motion-compensated video compression approach. In Sectionﬀ 2.4, video encoding process in general is described. Section 2.5 covers the video quality metrics. In

Section 2.6, international standards in the area of video compression are brieﬂy introduced. In Section 2.7, scalable extension of H.264/AVC is introduced. 15

2.3 Generic Compression System

It has been envisioned that network visual communication has become an active research area in the recent years. One of the most challenging problems for the implementation of a video communication system is that the available bandwidth of the network is usually insufficient for the delivery of the voluminous amount of the video data. In order to solve this problem, considerable effort has been applied in the last three decades for the development of video compression techniques. These efforts have resulted in the video coding standards such as H.261, H.263, MPEG-1, MPEG-2 and

MPEG-4.

Image and Video Coding is an optimization problem. A successful image and

video coding algorithm delivers a good trade-off between visual quality and other coding performance measures, such as compression, complexity, scalability, robustness and security.

Compared to analog signals, digital signals can be easily stored and analyzed by computers. They can also be transmitted directly over a digital communication link like

T1 and optical fiber. Public databases often archive digital signals as well. As a result, digital signals become a part of wide variety of applications. However, they are not perfect. The main problem lies in their enormous data size which often clogs low capacity memory and congests network traffic.

The most effective solution to reduce the large file size lies in data compression, which decreases the storage space for the same amount of data.

The process of compacting data into a smaller number of bits is named

Compression. That is to say, Video Compression is the process of compacting a digital

video sequence into a smaller number of bits. A digital video sequence is created with a

set of successive images. Considering no technique of compression, it could be seen like

a succession of images shown at a rate that it produces, in our visual system, the

movement sensation. The use of video compression makes possible digital video over

networks that would not support uncompressed video (raw format), i.e. current Internet

capabilities are insufficient to handle it in real time. Besides, a DVD can only store a few

seconds of raw video, therefore it is necessary video and audio compression. High bitrate

connections and high storage capability, such as Blu-ray Disk (BD) [8], may help us, but today it is not enough to support uncompressed video. Even with constant advances in storage and transmission capacity, compression seems to be an essential component of multimedia services for many years to come. From an economic and enterprise point of view, it much more turns out profitable to reduce the amount of information to send or to store, with the purpose of offering competitive prices to the final consumer, maximizing the utilization of services being deployed.

Digital video coding technologies have been studied by scientists, researchers, and engineers all over the world for more than twenty years. Numerous approaches and algorithms have been reported and applied in practical applications. Among them, the block-based motion-compensated prediction techniques become fundamentals of the industry and all existing international standards. The following parts of this section will review the general techniques of this video coding technology.

2.3.1 Hybrid Video Codec

A video codec consists of two parts: an encoder and a decoder. The encoder performs the task of transforming a video sequence into a compressed bit stream, and the decoder reconstructs the video content from the bit stream. A well designed codec has many common features shared between the encoder and decoder. The encoder contains three main function units: a temporal unit, a spatial unit, and an entropy coding unit, which remove temporal, spatial, and statistical redundancy of original video sequences, respectively.

Block based hybrid motion-compensated video coding is the most successful and widely used video compression technique nowadays. It is illustrated in the block diagram in Figure 2.2 and Figure 2.3.

Figure 2.2: Hybrid Motion-Compensated Video Encoder 18

The main loop of the scheme consists of the temporal unit, called a motion-

compensated predictor. It uses block-based prediction coding on neighboring video frames. The residual signal from the prediction coding is then fed into the spatial unit.

The spatial correlation between pixels within a block is removed by transform coding. All the information out of both temporal and spatial units is further entropy coded to remove statistical redundancy remaining in the signals. Adjustments on symbols may be required prior to the entropy coding unit so that maximum compression gains can be achieved.

Figure 2.3: Hybrid Motion Compensated Video Decoder

A video frame is first divided into multiple blocks, which are the basic processing

units of motion estimation and prediction. Properly selected block sizes provide a good

tradeoff between compression efficiency and computation complexity. Large blocks

benefit spatial compression in that less block-to-block correlation is maintained. While small block size is more feasible in computation, it also provides more accurate motion estimation, especially in areas with complex content. The normal block size of video coding is 16×16. The video encoding standard H.264/AVC allows flexible selection of block sizes and shapes for higher compression gains [6]. In H.264, blocks smaller than

16×16, in either square or rectangular shapes, can be the units of motion compensation.

Thus different motion vectors, each for a particular part of the 16×16 block, can better describe the complicated scene areas and motions.

Discrete Cosine Transform (DCT) and wavelet transform are two mostly studied transform techniques for video coding. Both have the capability to represent information into fewer coefficients. The DCT transform is the most used transform in industry due to its high efficiency and simplicity.

Quantization restricts coefficient values onto a limited number of levels, which highly improves compression gains of the following entropy coding. It is the only step that will cause information loss in normal encoding and decoding processing. Scalar quantization is used in all image and video coding standards. Vector quantization, or quantization in multiple dimensions, may bring higher compression gains but has yet been widely used due to the cost in complexity [9].

The main loop in Figure 2.2 yields two types of information sent to entropy encoding: quantized coefficients and motion vectors. Both are compressed by the entropy coding unit to form the bit-stream. A reversible run length coding stage is applied before entropy coding to further reduce symbol numbers. Huffman coding and arithmetic coding are two widely used variable-length entropy coding (VLC) algorithms.

Side information, such as headers, decoding parameters and synchronization information, are generated by the encoder during compression. They are represented in a group of equal-length and variable-length codes. The above three types of information is

combined according to standard specifications to form the final video stream for

transmission.

Three types of frames have been adopted internationally. The I frames are intra

coded without temporal prediction. Because they are not dependent on any other frames,

they can be used as random access points and error correction points. The P frame is

inter-frame prediction coded using one reference frame. The B frame is bidirectional

predicted using two reference frames from both before and after the coded frame.

Decoding of P and B type frames are dependent on their references and so affected by

errors in their references.

2.3.2 Transforms

Block transforms have been studied extensively and applied successfully to

practical applications. The transforms convert spatial signals into a set of transform

coefficients with respect a transform basis. Discrete separable orthogonal transforms are

widely used in digital image and video processing due to the efficient decorrelation,

separable computations, real-valued functions, and symmetry of the forward and backward transform matrices. Transform coefficients can be independent if the optimal orthogonal transforms are used.

DC Transform (DCT)

The Discrete Cosine Transform (DCT) is the most popular 2-D discrete separable

orthogonal transform used in video compression. Given the block signal , ,

푚 푛 1, the forward and backward transforms of DCT are defined as: 푓 ≤ 푚 푛 ≤

푁 −

(2 + 1) (2 + 1) = 푁−1 푁−1 cos ( ) cos ( ) (2.1) , , 2 2 휋 푚 푘 휋 푚 푙 퐹푘 1 푎푘푎푙 � � 푓푚 푛 푚=0 푛=0 푁 푁

(2 + 1) (2 + 1) = 푁−1 푁−1 cos( ) cos( ) (2.2) , , 2 2 휋 푚 푘 휋 푚 푙 푓푚 푛 � � 푎푘푎푙 퐹푘 푙 푘=0 푙=0 푁 푁 Where = and = for 1 . 1 2 푎0 �푁 푎푘 �푁 ≤ 푘 ≤ 푁

DCT transform is used on 8x8 blocks of prediction residual data or intra coded

data of video frames in most standards and applications. For commonly encountered

pictures, the DCT transform basis vectors approximate those of the Karhunen-Loeve

Transform (KLT), which is a statistic-based transform and proven to be optimum in sense of minimizing the geometric mean of coefficient variances [10]. Since it, unlike the KLT basis, is not data dependent, the DCT transform is often used as the substitute of KLT in practice. Another important reason for the success of the DCT is the availability of fast and inexpensive computation algorithms [11].

The signal transform itself does not compress anything. The goal of the signal

transform step is to concentrate energy into fewer transform coefficients than the original

signal. For lossless compression, the coefficients have very uneven distribution which

generally leads to compression gains when they are entropy encoded. In lossy

compressions, transform domain coefficients are easy to drop and reduce accuracy

according to their importance to the image quality. This is useful in low bit rate image

and video compression.

Integer Transform

A DCT based integer transform is introduced in H.264 to transform residual data

in 4x4 blocks [6] [12]. It approximates the 4x4 DCT transform matrix by adjusting some

coefficients to separate it as two simple operations: an all-integer core transform and a scaling process.

Let X be 4x4 picture data and Y the corresponding DCT domain coefficients. The

4x4 DCT is given by:

= = [ ] (2.3) 푎 푐 푎 푎 푎 푎 푎 푏 푇 −푎 −푏 푌 퐴푋퐴 �푏 푐 −푐 −푏� 푋 �푎 푐 � 푎 −푎 −푎 푎 푎 −푐 −푎 푏 푐 −푏 푏 −푐 푎 −푏 푎 −푐

Where = , = cos , and = cos . The matrix multiplication 1 1 휋 1 3휋 푎 2 푏 �2 8 푐 �2 8 can be equivalently rewritten as:

= ( ) 푇 푌 퐶푋퐶 ⊗ 퐸 1 1 1 1 1 2 1 1 2 푎푏 2 푎푏 2 2 2 1 1 2 1 1 1 2 ⎡푎 2 푎 2 ⎤ = [ ] 푎푏 푏 푎푏 푏 (2.4) 1 1 1 1 1 1 1 2 ⎢ ⎥ 2 4 2 4 1 2 −2 −1 1 2 −1 −1 ⎢ 푎푏 푎푏⎥ �� 푋 � �� ⊗ ⎢ 2 2 ⎥ − − − − 2 2 ⎢푎 2 푎 2 ⎥ − − − − ⎢푎푏 푏 푎푏 푏 ⎥ 2 4 2 4 Where indicates scalar multiplication of coefficients⎣ at the same⎦ positions of

⊗ matrix ( ) and . Here = is modified to = to maintain the 푇 2 퐶푋퐶 퐸 푑 푐� 푏 �5 orthogonality of the transform. 푏

The inverse transform needs to first multiply each coefficient by a weighting factor and then applies the inverse core transform, as given by the following equation: 23

= ( ) 푇 퐼 퐼 퐼 1 1 푋 퐶1 푌 1⊗ 퐸 1 퐶 2 2 1 1 1 1 2 1 2 1 2 2 1 2 1 1 2 1 1 = 1 [ ] 푎2 푎푏 푎2 푎푏 1 (2.5) 1 1 1 1 1 1 2 1 2 1 − 1 − 푎푏 푏2 푎푏 푏 2 − 1 − 1 1 2 푌 1 1 2 � − − � � ⊗ �푎 푎푏 푎 푎푏�� − − � − − − 푎푏 푏 푎푏 푏 −

Putting inverse scaling before inverse transform can avoid loss of decoding

accuracy and mismatch between encoder and decoder. The forward and inverse transforms are orthogonal. The integer transform reduces complexity in that the first part of the transform can be implemented using only additions and shifts. The scaling process

and are then integrated into the quantization and inverse quantization steps,

퐼 퐸respectively.퐸

2.3.3 Quantization

Quantization is important to image compression owing to its effect on image quality and rate control. Quantization uses a limited number of values to approximately represent real signals according to the range they are in. It is proper to remove unnecessary information that is not sensitive to human visual system, such as the fractional parts of signals after de-correlation or predictive coding, and before entropy coding. Compression is achieved by only transmitting values in the finite set instead of any possible value. On the other hand, too much loss of information leads quality degradation. It is a lossy compression operation and irreversible.

Generally, multi-dimensional quantization, also called vector quantization, is more efficient than one dimensional quantization, the scalar quantization, in sense of compression gain. But scalar quantizations are more popular due to lower complexity 24

compared to vector quantizations. Scalar quantizer is a many-to-one function along the

real axis. An example of a seven-step uniform quantizer is shown in Figure 2.4.

The whole real axis is divided into seven non-overlapped intervals. Five of them between [ , ) are equal length. Real numbers in each interval are mapped to a selected

1 6 single value,푥 푥the quantizer level, to minimize quantization errors.

The information loss due to quantization is mathematically computable. A

common used objective in quantization is to have the highest compression with minimum

quantization error. While optimal quantization levels for a given signal distribution has to

be calculated iteratively [13], in general, people specify fixed quantization levels base to

practical statistics. The choices of quantization steps are designed to meet both

compression and quality requirements.

= - X

Figure 2.4: Function of a 7-Step Uniform Quantizer

2.3.4 Entropy Coding

Entropy coding is usually used as the last step of source coding to remove

statistical redundancies of messages. It output binary bitstreams which are ready for

transmission.

Huffman Coding

Huffman codes by far are the most popular entropy codes used in image and video

compression. The basic idea of Huffman coding is giving high probability symbols short

code-words and low-probability symbols longer codes.

Optimal Huffman encoding procedure includes construction of a Huffman codebook according to the source probability distribution, and encoding the incoming source symbols based on the codebook [14]. Two assumptions are made for that to be valid: 1) the probability distribution is known in advance; 2) the coding scheme for a given source is not adaptive. Changes in probability distribution will result in variety of compression efficiency.

Figure 2.5: An Example of Huffman Tree

Creating codebooks needs to construct Huffman trees. Figure 2.5 demonstrates a canonical Huffman tree for a source with a four-entry alphabet. Unique routes from the root to every leaf node index valid code-words corresponding to the leaf symbols.

Variable length code-words are assigned to each leaf as needed.

Note that the way to construct the Huffman tree is not unique, i.e., there can be different codebooks for the same source. They may not have the same coding efficiency.

Encoding incoming message is carried out according to the designed codebook.

The encoder replaces each incoming source symbol in the message with the

corresponding code-word. The output of the encoder is a compressed bit-stream which maintains all information of the source message.

Arithmetic Coding

Arithmetic coding has been studied for over ten years as an alternate of Huffman coding. It is more attractive than Huffman coding due to two factors: 1) it can achieve average fractional bit rate without chunking a long message into blocks of symbols; 2) it allows the coding procedure adaptive to the statistical model of sources, which results in higher compression gains. It has been an option in video compression standard H.263 [5]

and recently became the key technique of H.264/AVC [6].

Unlike Huffman coding, the basic idea behind arithmetic coding is to map the

whole source symbol sequence into one single variable length code-word. The probability distribution of symbols is still the metric of code-word generation: likely symbols yield short codes and unlikely symbols give long ones.

In arithmetic coding, a message of symbols is represented by an interval of real

numbers between [0, 1) [15]. The size of the interval reduces as the knowledge of the

message increases. Symbols with different probabilities reduce the interval in different

steps and add corresponding bits to the codes. The encoder outputs a fractional number

that is located in the interval in form of binary bits. The decoder uses the received binary

number to locate the interval by doing a scaling-down segmentation procedure similar to

encoding.

Figure 2.6: An Example of Arithmatic Encoding Process

Figure 2.6 demonstrates an encoding example. Assume a source with alphabet {a, b, c, d} whose probabilities are shown in the Figure 2.6. Assume the incoming message is b c a c d. The outcome of the encoder is the region [0.25, 0.5) when it encounters symbol b. Then that range turns out to be [0.275, 0.3125) when symbol c comes in. And so on.

The final subsection is [0.295625, 0.29590625). Any number within this section is precise enough to specify the range.

As same as Huffman coding, any error of the codes might lead the decoder to a totally different interval and results in the rest of encoded information unusable.

2.4 Video Encoding Process

The pictures in a video sequence may be encoded in different ways, depending on the video compression standard and the properties of it. Three of the most common kinds of pictures available in the multimedia standards are Intra (I), Predictive (P) or B

(Bipredicted) pictures. Every one of them exploits one kind of redundancy or more than one. The goal of the different kinds of pictures is to exploit the diverse class of redundant information that it is possible to find in a video sequence.

• Intra pictures (I Type): The encoding of these images is made using only

spatial redundancy (as was explained in Section 2.4.1), without making

reference to previous or later images. This type of images is used to establish

reference points for the reproduction of a video sequence and to reduce the

error propagation. A GOP starts with an Intra picture.

• Predicted pictures (P Type): These images are encoded with motion

estimation and compensation (explained in Section2.4.2), using as reference a

previous I or P image. In this type of images, it is possible to reduce size up to

half that of an I image.

• Bi-directional or Interpolated pictures (B Type): They are coded using

motion compensated prediction from either past and/or future I or P images

(the two frames closest) to the one to be encoded. These images need more

than one reference frame to be encoded, using motion estimation and

compensation. An error in an image of this type does not affect any other,

because this type of images is never used as reference to others. In this kind of

images, the compression factor usually is around to one fourth of an I image.

Prediction Prediction

I Picture

P Picture

B Picture

Interpolated Interpolated Bi-directional Prediction Bi-directional Prediction

Figure 2.7: GOP consisting of 7 Frames

This division in types of images allows controlling the complexity of encoders and decoders. Depending of the types of images selected for encoding the video sequence, the computational time for doing it will vary. As well as the performance ratio, the error resilience, the quality of the sequence, etc. Figure 2.7 shows seven pictures in a

GOP, where the arrows indicate the references for a picture.

The order that the different types of images in a sequence is given by the parameters M and N. M is the distance between two successive pictures, whereas N is the distance between two I pictures. A typical encoding pattern is M = 3 and N = 12, as is seen in Figure 2.8.

As shown in Figure 2.8, most of the images are encoded as B pictures. This feature allows a higher compression factor and, a reduction in the propagation error

probability in terms of the appearance on it in an image that is used as reference (I and P pictures). Due to the need of synchronization in case of errors as well as to enable a random access, the distance between two I images usually is 10 or 12, in a video sequence with 25 images per second. The encoder would only have to wait 0.5 seconds to synchronize itself again.

N = 12

M = 3 I Picture P Picture B Picture

Figure 2.8: GOP of 12 Frames (M=3 and N=2)

2.4.1 Intra Frame Prediction

Intra prediction of H.264/AVC is conducted in the spatial domain. A prediction block is formed based on previously coded and reconstructed blocks. And the difference between current block and the prediction is coded. From Figure 2.9, we can see intra 4×4 prediction coding is conducted for samples a~p of a block using A~Q as prediction samples. In luma samples intra prediction, there are two block types, which are intra 4×4 and intra 16×16 block. In H.264/AVC FRExt (Fidelity Range Extension), it added intra

8×8 prediction. But most profiles of H.264 do not support intra 8×8 prediction. Intra 4×4 prediction has 9 directional prediction modes (Figure 2.10). Intra 16×16 is suitable for

smooth image with 4 directional prediction modes, including vertical, horizontal, DC, and plane mode (Figure 2.11). The chroma samples prediction technique is similar to intra 16×16 prediction of luma.

Figure 2.9: A 4x4 Block and its Neighboring Samples

Figure 2.10: Directions of 9 Modes of Intra 4x4

Figure 2.11: Directions of 4 Modes of Intra 16x16

2.4.2 Inter Frame Prediction

Inter-frame prediction predicts the current frame from the reference frames to reduce temporal redundancy. Transmitted prediction residual has smaller energy than original data and therefore uses less bandwidth. The more accurate the prediction is, the less energy is left in the residual. Due to movement of objects, changes of camera, or cuts of scenes, highly similar areas are spatially displaced in two frames. Therefore, inter- frame prediction is not used on a whole frame, but rather on each small area of a frame.

The spatial displacement from a reference area to the region in the current frame is called a motion vector. This is illustrated in Figure 2.12.

n-1 n

Reference frame Current Frame

Figure 2.12: Motion Estimation

The procedures to search the best matched region in a reference image for the

intensity patterns of interest are called motion estimation. The motion vectors and the

prediction errors are transmitted to the decoder as the major parts of the video messages.

A decoder combines motion vectors, prediction errors and predictors in the reference frames to recover the actual pixel values. This reversed process is called motion compensation. The inter-frame prediction is thus also called motion compensated

prediction.

Video compression only considers motions of individual objects contained in the

video. To date, all current video coding standards use block based approaches to reflect 35

local motions, called block matching. A matched block is the one that satisfies a certain

criterion in the motion estimation. The found “motions” may not be the actual motions of

natural objects all the time. But these vectors still point to the most similar blocks in the

references and thus meet the requirements of predictive coding. The simple linear

motions of the blocks have been shown to be accurate enough for the approximation of

any practical motion given the small intervals between video frames.

Several criteria for motion estimation are often used [16]. In video compression,

an average performance is more important than other concerns, such as the worst case

performance. The popular quadratic criterion is not a good choice because a single large

prediction error will lead to a misjudgment of distortion. The more popular absolute

function is not only more robust, but simpler to implement since no multiplication is

needed.

Exhaustive search of the matching block is time-consuming and computationally costly. The logarithmic search [17] is a popular fast practical search method. It only checks blocks on some dominant points then shrinks down the search region along certain directions based on search results on these points. Then motion estimation is repeatedly refined until some requirements are met. Thus, the number of times to calculate the criteria function is greatly reduced. Figure 2.13 demonstrate a three step logarithmic search method [18].

Centered at the location of the current block, a square section is initiated as the first search area. This search area usually has its side length shorter than the side length of total neighbor area that we are going to search. In this example, the total neighbor area

has 12x12 blocks and the search area in step 1 is 8x8. Nine blocks are chosen to be the search candidates. They are marked by circles in Figure 2.13. Then the one that gives the minimum distortion (block at -4, 4) is picked up as the center of search area in step 2.

The new search area has its side length half of the length in step 1, and another nine candidates are evaluated. The third step is carried out on base of the outcome from step 2.

The final motion vector is (-5, 3).

Figure 2.13: An Example of Logarithmic Search

Many other methods can be used to discover the details of motions. An example is to apply phase correlation combined with frequency domain motion detection and block matching [19].

2.5 Video Quality Metrics

In order to specify, evaluate and compare video communication systems it is necessary to determine the quality of the video sequences delivered to the viewer.

Measuring the visual quality is a difficult and often an imprecise art since there are so many factors that can affect the results. Visual quality is inherently subjective and is influenced by many factors that make it difficult to obtain a completely accurate measure of quality. Using objective criteria gives accurate, repeatable results but as yet there are no objective measurement systems that completely reproduce the subjective experience of a human observer watching a video display.

The most widely used video quality metric is the Peak Signal to Noise Ratio

(PSNR) (Equation 2.7). PSNR is measured on a logarithmic scale and depends on the

Mean Squared Error (MSE) (Equation 2.6) between an original and a decoded image or video frame, relative to (2 1) (the square of the highest-possible signal value in the 푛 2 image, where is the number− of bits per image sample).

푛 1 = 푀−1 푁−1 ( , ) ( , ) (2.6) 2 푀푆퐸 � ��퐼표푟푖푔푖푛푎푙 푚 푛 − 퐼푑푒푐표푑푒푑 푚 푛 � 푀푥푁 푚=0 푛=0 (2 1) = 10 log (2.7) 푛 2 − 푃푆푁푅푑퐵 푀푆퐸

The PSNR measure suffers from a number of limitations. PSNR requires an

unimpaired original image for comparison but this may not be available in every case and

it may not be easy to verify that an original image has perfect fidelity. For a given image

or image sequence, high PSNR usually indicates high quality and low PSNR usually

indicates low quality. However, PSNR ratings do not necessarily correlate with

subjective quality.

The Moving Pictures Quality Metric (MPQM) [20] is another metric that incorporates some modeling of the HVS. In particular, two key human perception phenomenon that have been intensively studied: contrast sensitivity and masking. The first phenomenon accounts for the fact that a signal is detected by the eye only if its contrast is greater that some threshold. The eye sensitivity varies as a function of spatial frequency, orientation and temporal frequency. The second phenomenon is related to the human vision response to the combination of several signals. A stimulus consists of two types of signals (foreground and background). The detection threshold of the foreground will be modified as a function of the contrast of the background. MPQM is an objective quality metric for moving picture which incorporates two human vision characteristics as mentioned above. It first decomposes an original sequence and a distorted version of it into perceptual channels. The channel-based distortion is then computed, accounting for contrast sensitivity and masking. Finally, the data is pooled over all the channels to compute the quality rating which is then scaled from 1 to 5 (from bad to excellent).

MPQM does not take into consideration the chrominance and that is why the method

Color Moving Pictures Quality Metric (CMPQM) has been introduced.

With the Noise Quality Measure (NQM) that is another metric introduced in

[Nir00], a degraded image is modeled as an original image that has been subjective to

linear frequency distortion and additive noise injection. These two sources of degradation are considered independent and are decoupled into two quality measures: a Distortion

Measure (DM) of the effect of frequency distortion, and a NQM of the effect of additive noise. The NQM takes into account: (1) variation in contrast sensitivity with distance, image dimensions; (2) variation in the local luminance mean; (3) contrast interaction between spatial frequencies; (4) contrast masking effects.

Finally, Video Quality Metric (VQM) [21] was developed by The Institute for

Telecommunication Science (ITS) to provide an objective measurement for perceived video quality. It measures the perceptual effects of video impairments including blurring, jerky/unnatural motion, global noise, block distortion and color distortion, and combines them into a single metric. The testing results show VQM has a high correlation with subjective video quality assessment and has been adopted by The American National

Standard Institute (ANSI) as an objective video quality standard. VQM can be computed using various models based on certain optimization criteria. These models include (1)

Television (2) Videoconferencing (3) General (4) Developer and (5) PSNR. VQM, shown in the Equation 2.8, was derived from PSNR.

1 = , ( 10 55) (2.8) 1 + . ( . ) 푉푄푀푃 0 1701푥 푃푆푁푅−25 6675 ≤ 푃푆푁푅 ≤ 푒

A different approach for video quality assessment is presented by Wang [22],

known as Structure SIMilarity index (SSIM). This method differs from the previously

described methods by using the structural distortion measurement instead of the error.

The idea behind this is that the human vision system is highly specialized in extracting

structural information from the viewing field and it is not specialized in extracting the

errors. Thus, a measurement on structural distortion should give a better correlation to the

subjective impression.

2.6 Video Encoding Standards

The adoption of digital video in many applications has been fueled by the

development of video coding standards. These standards provide necessary

interoperability between systems designed by different manufacturers. ITU-T and

ISO/IEC JTC1 (MPEG), are two active formal organizations in video coding standardization. The ITU-T video coding standards are called recommendations, which are denoted as H.26x (H.261, H.262, H.263, and H.264). They were designed mostly for real-time video communication applications. The MPEG standards (known as MPEG-1,

MPEG-2, and MPEG-4), on the other hand, mainly address the needs of video storage,

broadcast video, and video streaming applications.

Different coding standards may require specific input video formats. The formats

of video sequences used in common video coding are derived from CCIR-601 digital

video format [23]. Each of the coding standards can use several formats for different

resolutions and output bit rates. Low resolution formats are suitable for real-time and network applications, while large format video is designed for high definition TV or

digital cinema. Formats for progressive and interlaced display are also defined

respectively in the standards.

2.6.1 Single Layer Coding (H.264/AVC)

Aimed to provide significant better coding efficiency compared to MPEG-2 and

H.263, H.264 standardizes many new algorithms [6]. Context-adaptive binary arithmetic coding (CABAC) and context-adaptive variable-length coding (CAVLC) have higher compression gains than previous entropy coding methods. Quarter pixel motion vectors and enhanced motion estimation with variable block sizes provides higher prediction coding gains. The introduction of integer block transform increases coding speeds. The in-loop de-blocking filter significantly improves both objective and subjective visual quality compared with other coding standards. In total, H.264 generates 50% less bitrate in average given the same fidelity compared to any other standard.

H.264 targets a wider range of bit rates and video formats than its predecessors.

H.264 has shown an excellent performance in very difference environments from very low bit rate QCIF video over wireless networks to digital cinema compatible high definition (HD) video.

Starting with MPEG-2, the MPEG committee defines profiles and levels to specify coding requirements for application. Each profile includes particular functions, and each level imposes limits on parameters such as processing rates, formats, coded bit rates, and memory requirements, etc. A combination of a profile and a level indicates a range of bit rates and supported features thus to reduce implementation cost for certain applications. There are five profiles and four levels in MPEG-2. In MPEG-4, three types

of video: natural, synthetic and hybrid video, have their own profiles respectively. Early

ITU-T recommendations use options to define additional coding methods. For example,

there are fifteen negotiable options in H.362+, the second version of H.263 [5]. H.264 defines three profiles: the baseline, main, and extended profiles, and corresponding levels within each profile.

As for others standards, H.264 does not define the rules for implementing the encoder, it only defines the mechanism of decoding a sequence encoded with H.264, with the syntax of an encoded video bit stream. Figure 2.14 and Figure 2.15 show the different elements of an H.264 encoder and decoder respectively. As in previous standards, the basic functional elements can be found in MPEG-2 or H.263, for example. The main differences are on the way these functional elements work. However, the de-blocking filter is a new element not defined in prior standards.

F + D X Entropy NAL n n T Q Reorder (current) encode

F Inter n-1 MC (reference) P Intra prediction Intra

F’ De-blocking uF’ D’ n n n T-1 Q-1 (reconstructed) filter +

Figure 2.14: Block Diagram for H.264/AVC Encoder

The encoder (Figure 2.14) has two paths known as the forward path (left to right)

and the reconstruction path (right to left). In the forward path an input frame or field is

푛 processed in macroblocks (16x16 pixels), and can be coded in Intra or in Inter mode.퐹 The

encoder creates a reconstructed frame, labeled as P in Figure 2.14, based on reconstructed

pictures samples. In Intra mode, P is formed from samples in the current slice that have

previously encoded, decoded and reconstructed ( in the Figure 2.14). In the Inter

푛 mode, P is created by motion-compensation prediction푢퐹́ from the reference pictures. These

reference pictures may be chosen from a selection of past or future pictures that have

already been encoded, reconstructed and filtered. This prediction image (P) is subtracted from the current image to produce a residual image, which will be transformed and quantized to obtain X, a set of quantized transform coefficients which are reordered and entropy encoded. As well, the encoder decodes the frame to provide a reference for future predictions. The X image is scaled ( ) and inverse transformed ( ) to produce . −1 −1 푛 The P image is added to to create푄 the reconstructed image . However,푇 this image퐷́

푛 푛 is unfiltered. In the last step,퐷́ a filer is used to reduce the effects 푢of퐹 ́blocking distortion.

A De-blocking Filter is used to reduce blocking distortion and is applied to each

decoded macroblock. This module may improve the compression performance, because

the filtered image is often a more reliable reproduction of the original frame than a

blocky and unfiltered image. In the encoder (Figure 2.14) this filter processes the

macroblock after the inverse transform , before the stage of reconstruction and storing −1 for future predictions. In the decoder (Figure푇 2.15), it is the last operation of the process.

The function of this module is to smooth block edges, improving the appearance of the

decoded frames. The filtered image is used for motion compensation in future frames.

The filter is applied to vertical and horizontal edges of 4x4 blocks in a macroblock but

the edges on slices boundaries.

NAL Entropy X -1 -1 D’n uF’n De-blocking F’n Reorder Q T decode + filter (reconstructed) +

P Intra Intra prediction

F MC n-1 (reference) Inter

Figure 2.15: Block Diagram for H.264/AVC Decoder

The Transform, used in the H.264 standard, T and (Figure 2.14 and Figure −1 2.15), depends on the type of residual data to be coded.푇 There are three kinds of transforms available: a Hadamard Transform (HT) for the 4x4 array of luma DC coefficients in Intra macroblocks predicted in 16x16 mode, a HT for the 2x2 array of chroma DC coefficients in any macroblock and a DCT-based transform for all other 4x4 blocks in the residual data.

The H.264 transform [24] is based on the DCT but with some fundamental differences:

• It is an integer transform, which that implies no floating point operations are

needed. The mismatch between the encoder and the decoder is zero without

loss of accuracy.

• It can be implemented using only additions and shifts. 45

• The number of operation can be reduced by integrating part of the operations

involved in the transform into the quantizer.

The Quantizer, and (Figure 2.14 and Figure 2.15), adopted by the H.264 −1 standard, is a scalar quantiz푄 er. A푄 total of 52 values for the Quantification Parameter (QP) are supported by the standard. The quantification step is doubled in size for every increment of 6 in QP. The wide range of quantizer step sizes makes it possible for an encoder to control the tradeoff accurately and flexibly between bitrate and quality.

Besides, the H.264 standard allows different values for the QP for luma and chroma. The quantization step-sizes are not linearly related to the quantization parameter (as in all prior standards). A default relationship is specified between the quantization step sizes used for luma and chroma, and the encoder can adjust this relationship at the slice level to balance the desired fidelity of the color components.

The Entropy encode (Figure 2.14) or the Entropy decode (Figure 2.15), are the modules where the elements of the sequence are encoded/decoded, using fixed or variable length binary codes. As shown later, this operation depends of the profile being used to encode/decode the video sequence.

The entropy-encoded coefficients, together with side information required to decode each macroblock, from the compressed bit stream which is passed to the Network

Abstraction Layer (NAL) where the picture will be prepared for transmission or storage.

The H.264 standard does not specify the mechanism of transmitting NAL units, but a distinction is made between transmission over packet-based transport mechanisms

(packet networks), and transmission in a continuous data stream (circuit-switched

channels). Each NAL unit contains a Raw Byte Sequence Payload (RBSP), a set of data

corresponding to coded video data or header information. The reason for having variable

code lengths and NAL is to discriminate between coding and transport features.

On the other hand, the decoder (Figure 2.15) only has the forward path (left to

right). The dataflow path in the decoder shows the similarities between encoder and

decoder.

The input for the decoder is a compressed bit stream from the NAL, and the

entropy module decodes the data to generate a set of quantized coefficients, denoted by X

in Figure 2.15. These are scaled and inverse transformed to give , exactly the same

푛 푛 created in the encoder (Figure 2.14), if there were no errors during퐷́ the process. Using the퐷́

information stored in the video sequence, the decoder generates the P image. The decoder

adds these two images to produce , which will be filtered to obtain .

푛 푛 In the decoder, (Figure 2.15푢퐹),́ each block coming from the quantiz퐹́ er is mapped to

a sixteen element array in a zigzag order. This is the function made by the Reorder. This

module has the function to prepare the data (reordering the coefficients to optimize) for

the next module, where the entropy coding is done. The inverse process is made by the

decoder (Figure 2.15). The macroblock coefficients are reordered before the inverse quantification.

The Intra prediction, and the Inter prediction – Motion Compensation (MC) and

Motion Estimation (ME) modules in the encoder and Motion Compensation (MC) in the

decoder, will be explained later with more detail, because they are the modules where the

different approaches for this dissertation will be proposed.

2.6.2 Scalable Video Coding (SVC)

The problem of sharing video from a single source among a number of users in various settings using existing system and network resources raises new challenges to

video coding. In many applications, such as video streaming, video conference,

broadcasting, and surveillance, it becomes typical that a same video source will be

simultaneously sent to many clients in very different environments. A video coding

solution being able to adapt to different requirements is desirable for quick and easy

video transmission in these applications [25] [26].

For example, in a large video surveillance system such as airport, subway, and highway administration, there can be more than a thousand network cameras distributed in many critical places. Meanwhile, there can also be several hundred monitoring points, such as central control room, on-site monitors, data analysis applications, mobile

monitors in vehicle, or handheld device, need to access any captured or stored video data

from the system in real time. The cameras from far away can connect to the system using

IP networks, and other cameras and data storage center may use large bandwidth

networks. The connection of monitors to the system may vary from high speed cable,

local area network, to wireless transmission.

Each of these monitors may require the same video in different resolutions, frame

rates, and decoding complexity simultaneously. A universal video codec solution is

essential in the above system to adaptively and efficiently meet the requirements.

The heterogeneity among video clients can be grouped into three catalogs:

network conditions, device settings, and quality requirements. In general, wireless

networks have smaller bandwidth and higher error rates than wired networks. Even

different sections in the same type of networks may have different and varying quality of

services. Some applications, such as video conferencing and broadcasting, are more

sensitive and have less tolerance to transmission delay. Handheld and mobile devices,

such as pocket PC, iPod, and cell phone, usually have less processing power, memory,

and storage space than computers or other high end video players, and therefore can only

afford lower complexity decoding and low quality video. Users in subscription service

plans usually are limited to the corresponding video services. In such a system, low cost

quality control coding scheme is highly desirable.

Streaming video over heterogeneous networks demands video services of varying

spatial, temporal, and quality fidelity. Transcoding a video into several formats to make it

suitable for different clients requires special processing of video bit streams for each

particular requirements, which is complicated and time consuming. Let a server keep

several versions of the same video and switch during transmission also require additional

storage and speedy data access for frequent bitstream switching. This is usually not

practical for many small and middle size servers.

Scalable video coding provides an efficient way to generate adaptive video for

transmission in heterogeneous requirements [27]. Conventional scalable video coding techniques generate layers of video with different importance to the quality of video and fixed decoding order. In Layered Coding (LC), effect of loss or errors in enhancement

layers is limited. A minimum quality and normal decoding process is retained by the base layer. Three types of scalabilities: quality (SNR), spatial, and temporal are most commonly used and standardized, such as in MPEG, H.26x, etc. However, the base layer content requires stronger protection in order to benefit from scalability. Otherwise, losses in base layer still cause error propagation and make enhancement layer useless.

This section presents fundamental knowledge of scalable video coding techniques. The rest of this chapter is organized as follows. Next section will review layered video coding technologies, including SNR, spatial, temporal, and quality (SNR) scalability.

2.7 Scalable Extension of H.264/AVC

The notion of scalability is the expected functionality to introduce a high degree of flexibility in coding/decoding systems. The scalable video coding scheme produces a compressed bitstream, parts of which can be decoded. Compared to decoding the complete bitstream, decoding a part of the bitstream produces pictures with degraded quality or smaller image size or smaller frame rate.

As a remarkable feature of the SVC project, most components of H.264/AVC are used as specified in the standard (e.g. motion-compensated prediction, intra prediction, transform coding, entropy coding, and de-blocking filter), while only a few components have been added or modified. The key features of the scalable extension of H.264/AVC are:

• Hierarchical prediction structure

• Layered coding scheme with switchable inter-layer prediction mechanisms

• Base layer compatibility with H.264/AVC

• Fine granular quality scalability using progressive refinement slices

• Usage and extension of the NAL unit concept of H.264/AVC

The scalable H.264/AVC extension specifies a layered video codec. In general, the coder structure depends on the scalability space that is required by the application

[28].

Original Images Layer 1 Side Info. Motion Predictor Intra Prediction Residues Residue Transform, Entropy or Predictor Quantization Coding Inter Prediction Pred. Img

Img. De-blocking Inv. Quant. Buffer Filter Inv. Transform SVC M Bitstream Inter-Layer U Spatial Prediction Inter-Layer Inter-Layer Inter-Layer X decimation Intra Prediction Motion Prediction Residual Prediction

Layer 0 Side Info.

Intra Prediction Transform, Entropy or Residues Quantization Coding Inter Prediction Pred. Img

Img. De-blocking Inv. Quant. Buffer Filter Inv. Transform

Figure 2.16: Coding Structure for SVC Extension of H.264/AVC

Figure 2.16 shows the proposed scalable coder that consists of two motion- compensated coders that encode a video sequence and produce a single SVC bitstream from which two bitstreams of different resolutions can be extracted individually. The generic structure is characterized by mixed spatial and temporal scalability and 51

independent motion estimation and compensation performed in the individual prediction loops. These two features together with other improvements described further are substantial for high efficiency obtained.

In each spatial layer, the basic concepts of MC prediction and intra prediction are employed as in H.264/AVC. The redundancy between different layers is exploited by additional interlayer prediction concepts that include prediction mechanisms for motion parameters as well as for texture data (Intra and residual data). A base representation of the input pictures of each layer is obtained by transform coding similar to that of

H.264/AVC; the corresponding NAL units contain motion information and texture data.

The NAL units of the base representation of the lowest layer are compatible with single layer H.264/AVC. The reconstruction quality of the base representations can be improved by an additional coding of so called progressive refinement (PR) slices; the corresponding NAL units can be arbitrarily truncated in order to support medium granular quality scalability (MGS) or flexible bitrate adaptation.

2.7.1 Hierarchical Coding Structure

In contrast to standards as MPEG-2/4, the coding and display order of pictures is completely decoupled in H.264/MPEG4-AVC. Any picture can be marked as reference picture and used for motion-compensated prediction of following pictures independent of the corresponding slice coding types. These features allow the coding of picture sequences with arbitrary temporal dependencies.

Temporal scalable bitstreams can be generated by using hierarchical prediction structure as illustrated in Figure 2.17.

GOP border GOP border Prediction

T0 T3 T2 T3 T1 T3 T2 T3 T0 Key Picture Key Picture Tx : Temporal Layer Identifier • Group of Pictures (GOP) Structural Delay = 7 frames • Key Picture: Typically Intra-coded • Hierarchically predicted B Pictures: Motion-Compensated Prediction

Figure 2.17: Hierarchical Prediction Structure

So called key pictures are coded in regular intervals by using only previous key

pictures as references. The pictures between two key pictures are hierarchically predicted

as in Figure 2.17. It is obvious that the sequence of key pictures represents the coarsest

supported temporal resolution, which can be refined by adding pictures of following

temporal prediction levels. The hierarchical picture coding can be extended to motion-

compensated temporal filtering (MCTF) [29]. For that, motion-compensated update operations using the prediction residuals (colored arrows in Figure 2.17) are introduced in addition to the MC prediction.

Because of the similarities in MC prediction, the approach to temporal scalability of H.264/AVC is maintained in SVC. In addition to enabling temporal scalability, the hierarchical prediction structures also provide improved flexibility compared to classical

IBBP coding on the cost of an increased encoding-decoding delay.

2.7.2 Inter-Layer Prediction

Spatial scalability requires the largest degree of change to H.264/AVC among the

various types of scalabilities. Spatial scalability is achieved by an oversampled pyramid

approach. The pictures of different spatial layers are independently coded with layer specific motion parameters as illustrated in Figure 2.18.

Motion-compensated prediction GOP GOP border border

Layer N + 1

Inter-Layer Prediction

Layer N

Key Key Picture Picture

Figure 2.18: Hierarchical Prediction Structure with Inter-Layer Prediction

However, in order to improve the coding efficiency of the enhancement layers in

comparison to simulcast, additional inter-layer prediction mechanisms have been introduced. These prediction mechanisms have been made switchable; so that an encoder

can freely choose which base layer information should be exploited for an efficient

enhancement layer coding. The following techniques turned out to provide gains and

were included into the scalable video codec.

• Prediction of intra macroblocks using up-sampled base layer intra blocks

• Prediction of motion information using up-sampled base layer motion data 54

• Prediction of residual information using up-sampled base layer residual blocks

Since the incorporated inter-layer prediction concepts include techniques for motion parameter and residual prediction, the temporal prediction structures of the spatial layers should be temporally aligned for an efficient use of the inter-layer prediction.

The following three inter-layer prediction techniques are included in the SVC design.

Motion Vector Prediction

In order to employ base layer motion data for spatial enhancement layer coding, additional macroblock modes have been introduced in spatial enhancement layers. The macroblock partitioning is obtained by up-sampling the partitioning of the co-located 8x8 block in the lower resolution layer. The reference picture indices are copied from the co- located base layer blocks and the associated motion vectors are scaled by a factor of 2.

These scaled motion vectors are either used unmodified or refined by an additional quarter sample motion vector refinement. Additionally, a scaled motion vector of the lower resolution can be used as motion vector predictor for the conventional macroblock modes.

Residual Prediction

The usage of inter-layer residual prediction is signaled by a flag that is transmitted for all inter coded macroblocks. When this flag is true, the base layer signal of the collocated block is block wise up-sampled and used as prediction for the residual signal of the current macroblock, so that only the corresponding difference signal is coded [30].

Intra Prediction

Furthermore, an additional intra macroblock mode is introduced in which the

prediction signal is generated by up-sampling the co-located reconstruction signal of the lower layer. For this prediction it is generally required that the lower layer is completely decoded including the computationally complex operations of MC prediction and de- blocking. However, this problem can be circumvented when the inter-layer intra prediction is restricted to those parts of the lower layer picture that are intra coded. With this restriction, each supported target layer can be decoded with a single motion compensation loop.

2.7.3 Progressive Refinement Slices (Quality Scalability)

A picture is generally represented by a non-scalable base representation, which includes all the corresponding motion data as well as a ‘coarse’ approximation of the intra and residual data, and zero or more quality scalable enhancement representations, which represent the residual between the original sub-band pictures (prediction residuals and intra blocks) and their reconstructed base representation (the subordinate enhancement representation). For the encoding of quality enhancement representations a new slice type called progressive refinement (PR) slice has been introduced [31].

Usually, the base representation corresponds to a minimally acceptable reconstruction quality and this basic quality can be improved in a fine granular way by truncating the enhancement representation NAL units at any arbitrary point.

For SNR scalability, coarse-grain scalability (CGS) and medium-grain scalability

(MGS) are distinguished.

Coarse-Grain SNR (CGS) Scalability

Coarse-grain SNR scalable coding is achieved using the concepts for spatial scalability. The only difference is that for CGS the up-sampling operations of the inter-

layer prediction mechanisms are omitted. With CGS, only selected SNR scalability layers

are supported and the coding efficiency is optimized for coarse rate graduations as factor

1.5-2 from one layer to the next. The restricted inter-layer prediction that enables single-

loop decoding is even more important for CGS than for spatial scalable coding.

Medium-Grain SNR (MGS) Scalability

In terms of coding mechanism, MGS is nearly the same as CGS. A main

difference between CGS and MGS is the high level syntax that affects the flexibility in

discarding data to meet a bitrate constraint. All NAL units of a CGS layer must be either

completely retained or completely discarded, whereas NAL units of an MGS layer can be

individually discarded. Especially, in MGS, the coded data corresponding to a

quantization step size (equivalent to a CGS layer) can be fragmented into at most 15

(sub) layers. Thanks to this splitting, “finer” scalability can be achieved. This kind of

packet-based scalability can be considered as a compromise between CGS and Fine Grain

SNR (FGS) scalability.

2.7.4 NAL Unit Syntax

The coded video data of H.264/AVC and its scalable extension is organized into

NAL units, each of which is effectively a packet that contains an integer number of bytes.

The high-level syntax of SVC obeys similar design criteria as those of H.264/AVC.

Parameter sets, containing information for more than one picture, are normally transmitted out-of-band, using a reliable transmission protocol (e.g., TCP), but could also be repeated in-band, e.g., for broadcast applications.

idr _ flag ( 1 ) quality _ id ( 4 ) priority _ id ( 6 ) _ idc nal _ ref ( 2 ) output _ flag ( 1 ) _ id ( 3 ) temporal nal _ unit type ( 5 ) _ id ( 3 ) dependency discardable _ flag ( 1 ) reserved _ one bit ( 1 ) forbidden _ zero bit ( 1 ) _ 2 bits ( ) reserved _ three use _ ref base pic flag ( 1 ) no _ inter layer pred flag ( 1 )

AVC header SVC extension header

Figure 2.19: SVC NAL Unit

The video data is transmitted in Network Abstraction Layer (NAL) units. The first

byte of the SVC NAL unit header extension (Figure 2.19) contains the syntax element

priority_id and also indicates whether the NAL unit belongs to an IDR access unit

(idr_flag). The second and third byte provide information on the scalability dimensions,

by the fields dependency_id (Did), temporal_id (Tid) and quality_id (Qid), and also on the possibility to discard NAL units from the decoding of layers with higher Did

(discardable_flag). A legacy H.264/AVC decoder regards SVC NAL units as regular

NAL units with unknown NAL unit types, and discards them while still being able to

decode the base layer.

2.8 Summary

This chapter presented background material for understanding video compression and encoding system. This begins with the description of a general framework of a compression system with the main building blocks are explained in detail. Next we

describe the video encoding process and its main processes. Finally, we looked into the

video encoding standard H.264/AVC and its scalable extension along with SVC structure

and tools available for prediction process.

3. SURVEY OF RELATED WORK

3.1 Introduction

This chapter presents a survey of the literature on the topic of complexity in video

coding in general and particularly for scalable video coding. First we look at the

foundational papers in the area to introduce the topic. Next we examine existing

techniques and methods for complexity reduction in video encoding including a

discussion of mode elimination methods based on statistical models.

Current video coding standards are highly asymmetrical. Encoding is typically 5-

10 times more complex than decoding. This is due to the use of inter-frame predictive coding, which is desirable for consumer electronics (CE) applications including DVD

(Digital Versatile Disk) and DTV (Digital Television), video streaming, and video on demand (VOD), as it can result in high compression ratios.

The computational complexity of a software video CODEC may be controlled by using variable complexity algorithms for processor-intensive functions. In many software implementations of video coding standards such as H.263, H.264 and MPEG-4, the performance of the video CODEC is limited by available processing power as well as, or rather than, by available bandwidth. It is therefore important to develop flexible methods of managing the computational complexity of video encoding and decoding.

Following section describes the mode decision complexity and the subsequent

sections classify the complexity reduction techniques in video encoding into different

categories.

3.2 Mode Decision Complexity

Figure 3.1: H.264 CPU usage distribution for Intra frames

The computational complexity for the intra prediction in H.264 is very high due to the many coding modes. Figure 3.1 shows the CPU usage profile of a software based

H.264 encoder running on a PC based on Intel Architecture [32]. There are 9 modes for

4x4 luma blocks, 4 modes for 16x16 Luma blocks and 4 modes 8x8 chroma blocks. The

H.264 standard adopts a lot of state-of-the-art techniques to improve coding performance,

4x4 block-based integer transform, motion compensation using variable block sizes and multiple references, advanced in-loop de-blocking filter, improved entropy coders such as

CAVLC (Context Adaptive VLC) and CABAC (Context Adaptive Binary Arithmetic

Coding) and enhanced intra-prediction. The RDO (Rate Distortion Optimization) is conducted in Intra/Inter to select the best coding mode among possible combinations, that guarantees the smallest distortion under the given bit rate instead of just minimizing the bit-rate or the distortion. Since RDO should perform the transform and entropy coding for each coding mode, computational complexity is increased extremely compared to the conventional, thereby it makes H.264/AVC difficult to apply directly to low complexity devices.

Mode decision is computationally most expensive process in video coding, as described above, efforts are made in reducing these computation and predict the modes faster. Coding of Enhancement layers in SVC can be done more effectively if, the base layer is coded sub-optimally such that it can be maximally utilized in interlayer prediction.

As described above, H.264/AVC has multiple coding modes along with variable block sizes ranging from 4x4 to 16x16 in addition to Inter and Intra coding. SVC extension of H.264/AVC adds more modes that take the advantage of layered structured of SVC by reusing the base/lower layer information. The best coding mode is selected by trade-off between rate and distortion performance of each mode. This process of selection of best coding mode is computationally expensive if exhaustive search is performed through all the coding modes. Therefore, fast mode decision algorithms are required. The key to fast mode decision algorithms is to somehow try to reduce the candidate modes before finding the rate distortion cost.

3.3 Adaptive GOP Structure

It is possible to detect the magnitude of temporal activity over several images by extracting the information from motion vectors and then change the GOP size in order to terminate the mode decision process early. In [33], the authors adaptively change the size of the GOPs according to temporal characteristics of video. It helps in early termination of the mode decision process based on the average motion vector magnitude and number of Intra coded macroblocks. The authors compute the average motion vector magnitude

(|MV|) and number of Intra coded macroblock (numIntra) for full sized GOP. Large motion vectors and large number of Intra coded macroblocks imply high temporal activity where smaller GOP size is used to reduce complexity and vice versa. This technique requires pre-determined threshold values which may not be suitable for different types of video contents.

3.4 Early Skip Schemes

Similarly in [34], authors explore the property of most natural videos which tend to have a homogeneous motion. The basic idea behind this technique is fact that frames in a GOP show similar distribution of motion vectors (MVs). The authors utilize this stored information of frames inside a GOP of lower layer for decision of mode at higher level.

The mode information of referenced frame is stored in Mode History Map (MHM).

Further, the MHM is refined by considering the motion vector magnitude. This technique takes advantage of relation between levels in GOP. When a macroblock at reference frame of low level has the SKIP mode, the macroblock at higher level also tends to have

a SKIP mode. Therefore, it is inferred that if a macroblock mode of references is all SKIP modes, it is reasonable to consider only SKIP and P16x16 modes as candidate modes.

3.5 Exploiting Layer Information in SVC

In [35], the authors use the mode prediction at the base layer for prediction at enhancement layer based on the statistical analysis of mode evaluation at base and enhancement layers. The candidate modes at enhancement layer are reduced based on the actual mode at base layer.

In [36], the authors consider motion vectors as well as integer transform coefficients of the residual for mode prediction at enhancement layer. For non-zero motion blocks, the integer transform coefficients of the residual between current macroblock and motion compensated macroblock by predicted motion vectors from base layer, is considered. For Zero Motion Block (ZMB) or Zero Coefficient Block (ZCB),

Inter 16x16 mode is used. For others, RD costs are computed for a number of candidate modes.

3.6 Exploiting Psycho-Visual Characteristics

In [37], the authors explore the psycho-visual characteristics to decide the mode.

The basic idea driving this approach is the well-known fact that the moving objects usually attract more human attention than static ones. This technique defines a motion attention model, which generates a motion attention map based on the motion vectors estimation scheme. Visually more attended regions of the frame, undergo the usual exhaustive search scheme. And for visually less attended regions of the frame, fast mode decision algorithm is applied similar to the one proposed by He Li et al in [35]. 64

In [38], authors explore the correlation between base and enhancement layers.

This correlation is then utilized to predict the mode of next layer is predicted from previous layer. The subordinate layer is divided in two regions with QP<33 and QP>33.

If QP of reference layer is >33 then inter layer prediction is skipped, since the reference layer would be of lower quality. If QP of reference layer is < 33 then all the modes with interlayer prediction are considered for testing.

In [39] the time is reduced up to 70%, with negligible PSNR loss and bit-rate increase, the authors exploit the spatial correlation and the high mode correlation between neighboring macroblocks, an algorithm is proposed in which the first level of selection is based on directional mask (instead of edge detectors), the minimum mask difference is chosen as the a candidate mode. Additional candidate modes are obtained using neighboring mode information. At the end, the best mode will be the minimum SADT between neighboring modes and the minimum mask candidate mode. In [40] a fast mode decision is made using temporal correlation. The basic idea of this approach is to measure the difference between the current macro-block and its collocated macro-block in the previous (already coded) picture. If they are close enough (a threshold decision) the current macro-block will reuse the mode decision of its collocated macro-block and the entire mode decision process is skipped. The proposed algorithm can reduce the computational complexity up to 33% with negligible loss of compression efficiency (less than 0.2dB).

There are many faster prediction methods have been proposed to successfully reduce the computation for H.264 intra-prediction. Huang et al. [41] developed a fast

algorithm which performed the context-based adaptive skipping of unlikely prediction modes and simplification of matching operations to save about 50% of the encoding time.

Meanwhile, Pan [42] proposed another fast intra-mode decision scheme which using

Sobel operator to measure the edge angle of 4x4 blocks and 16x16 macro-blocks (MB) to reduce the number of probable modes for complexity reduction. Wang [43] proposed a simple edge detection algorithm, which is proposed in MPEG-7 as feature descriptors, to achieve a better result in comparison to the previous algorithm [42]. A more recent study of intra-prediction on fast intra/inter-coding is also published [44] [45]. Tsai e. al. [46] proposed a technique based on direction detection algorithm by computing sub-block and pixel direction differences.

In [42] the authors propose a fast classification mode based on local edge detection histogram. It exploits the idea that the pixels along the direction of local edge normally have similar values. Therefore a good prediction could be achieved if it is predicted using those neighboring pixels that are in the same direction of the edge. An edge map using Sobel edge operation is applied to each pixel and then a histogram with the different mode directions is calculated. The mode with more concurrences, if it is strong enough (differences between the higher concurrences greater than a threshold), will be the candidate best mode. Finally, the mode decision is chosen between DC mode, the edge histogram candidate and its neighboring in terms of direction. In the case that all the cells have similar amplitudes, DC mode will be the better choice. However, it is difficult to pre-define a universal threshold that suits for different block context and different video sequences. Another technique used along with the edge direction

histogram is early termination of RDO calculation: In RDO, the coding cost consists of

two parts, rate and distortion. After calculating the cost of rate, there might be cases that

the cost of rate is higher than the coding cost of the best mode in the previous modes.

This implies that the current mode will not be the best mode since its coding cost will not be the smallest. Therefore, the RDO calculation will be terminated and the calculation of the distortion is then eliminated. The results show a timesaving average of 25% with negligible losses in PSNR and increments in bit rate for IPPP sequences. On the other hand for I-type sequences the timesaving average is approximately 60%. Early termination scheme contributed to about 6% to 8% of the total timesaving. Another observation was that QCIF sequences achieve more timesaving that CIF.

3.7 Summary

We have given a survey of the literature on video complexity reduction in

H.264/AVC standard. We described the general approaches towards the complexity reduction in video encoding and categorized them based on techniques applied during the process. The common element among all these techniques is elimination of the least probable mode. We have observed that most of the surveyed techniques try to prevent

RDO calculation in order to reduce the encoding complexity of H.264/AVC encoder.

Most of current complexity reduction techniques in video encoding use fixed thresholds and unable to react to changing QPs. Moreover, they do not exploit full space of neighborhood correlation. Our proposed complexity reduction techniques do not use any fixed thresholds and exploit spatial similarity of video contents.

4. H.264/AVC COMPLEXITY REDUCTION

4.1 Introduction

H.264 offers a significant performance improvement over previous video coding

standards such as H.263++ and MPEG-4 in terms of better peak signal-to-noise ratio

(PSNR) and visual quality at the same bit rate [47]. This is accomplished mainly due to the consideration of variable block sizes for motion compensation, multiple reference frames, integer transform [an approximation to discrete cosine transform (DCT)], in-loop de-blocking filter, context based adaptive binary arithmetic coding (CABAC), but also due to better exploitation of the spatial correlation that may exist between adjacent

Macroblocks, with the multiple intra mode prediction in intra (I) slices [48].

The H.264 video coding standard supports intra prediction for various block sizes.

For coding the luma signal, one 16x16 macroblock may be predicted as a whole using

Intra-16x16 modes, or the macroblock can be predicted as individual 4x4 blocks using nine Intra-4x4 modes.

The RD Optimization (RDO) technique [49] has been employed in H.264 for

Intra-prediction mode selection to achieve coding efficiency. However, the computational complexity of the RDO technique is extremely high since the encoder has to encode the target block by searching all possible modes exhaustively for the best mode in the RD

sense; it makes H.264/AV C difficult for applications with low computational capability,

such as mobile devices.

4.2 Intra Prediction Complexity

H.264/AVC defines a slice as a group of MBs with a scanning order. A popular shape of slices in H.264 is the rectangular shape with the raster scanning order, where the slice can start and end at any MB. I-slices in H.264/AVC contain Intra pictures that have all MBs coded with Intra modes. An important aspect of slice in H.264/AVC is that a picture can be composed of multiple slices with different slice types. For example, there can be I slices, P slices, B slices mixed to constitute a single picture. Since our experiments are related to Intra coding modes only, therefore, we deal with I slices only during our experiments, simulations and implementations. Slices are primarily used for error concealment and ROI (Region of Interest) configuration, the scope of which is outside of our current topic and discussed elsewhere [50].

The H.264/AVC video coding standard uses block-based motion compensation, the same principle adopted by every major coding standard since H.261. Important differences from earlier standards include the support for a range of block sizes (down to

4x4) and fine sub-pixel motion vectors (quarter pel in the luma component). H.264/AVC supports motion compensation block sizes ranging from 16x16 to 4x4 luminance samples with many options between the two.

Intra prediction is a pre-processing operation before DCT to change/tweak signal characteristics of input images for compression improvement by tuning them to a DCT basis. The main focus of Intra prediction is to eliminate low frequency components in a

predictable way that enables perfect reconstruction of the source picture in the decoder.

Generally, a better Intra prediction suppresses low frequency components more. Intra

prediction has been present with almost all video compression standards such as MPEG-

1, MPEG-2, MPEG-4 Part 2, VC-1 and H.264/AVC. However, actual techniques vary in each standard. In contrast to previous video coding standards (especially H.263 and

MPEG-4 Visual), where Intra prediction has been conducted in the transform domain,

Intra prediction in H.264/AVC is always conducted in the spatial domain, by referring to neighboring samples of previously decoded blocks that are to the left and/or above the block to be predicted. Since this can result in spatio-temporal error propagation when

Inter prediction has been used for neighboring macro-blocks, a constrained Intra coding mode can alternatively be selected that allows prediction only from Intra coded neighboring macroblocks.

In Intra mode, a prediction block is formed based on previously coded and reconstructed blocks. Before encoding, the prediction block is subtracted from the current block. The luma samples have two block mode: Intra 4x4 MB and Intra 16x16 MB.

When using Intra 4x4 MB mode, the predictions are formed for each 4x4 block and each

4x4 block of the luma components choose one mode from the 9 prediction modes. For

Intra 16x16 MB, the predictions are formed for the entire block and the block will choose

one mode from the 4 prediction modes. The four chroma prediction modes are similar to

that of Intra 16x16 MB prediction except for the different order of mode numbers. The

same prediction mode is always used by both chroma blocks.

Encoders use typically Rate-Distortion Optimization (RDO) to select the best coding mode. Rate-Distortion is used to measure compression performance where goal of the encoder is to optimize its overall fidelity: Minimize distortion D, subject to a constant rate R. The encoder must encode the Intra block using all the mode combinations and choose the one that minimizes the bit-rate. Since the chroma prediction is independent to luma prediction, for each luma prediction mode there should be four different chroma prediction modes. Therefore, the number of RDO calculations is M8x(M4x16+M16) where M8 , M4 , M16 represents the number of prediction modes for chroma block, Intra

4x4 MB, Intra 16x16 MB. So there will be 4x(9x16+4)=592 times of RDO calculations.

It means that, for an MB, it has to perform 592 different RDO calculations before a best

RDO mode is determined. As a result, the complexity and computational load of the encoder is extremely high.

The intra prediction for the chrominance components Cb and Cr of a macroblock is similar to the Intra16x16 type for the luminance component because the chrominance signals are very smooth in most cases. It is performed always on 8x8 blocks using vertical prediction, horizontal prediction, DC prediction and plane prediction.

4.3 Fast Intra Coding

We present a new fast intra prediction mode decision method to improve the encoding speed without much sacrifice at RDO performance. This method is based on the idea that the prediction mode of each block is correlated with those neighboring prediction modes and the fact that the dominating direction of a smaller block is similar to that of bigger block.

4.3.1 Machine Learning (ML) and Data Mining (DM)

Machine learning algorithms are used in the Data Mining (DM) step of a

Knowledge Discovery from Data (KDD) process, whose main goal is the extraction of knowledge from structured or raw data [51]. While DM has become the most popular term in the ﬁeld of knowledge discovery, it is in fact only a step in the KDD process, where important steps are carried out before (data cleaning, data pre-processing, etc.) or after (model evaluation, knowledge dissemination, etc.) the application of DM algorithms. Thus, the DM step and hence the machine learning algorithms can be viewed as the core of KDD. The way in which DM algorithms obtain the knowledge can vary depending on the machine learning paradigm they are based on. In the case of the inductive approach, used in this work, the goal is to look for regularities into the data and transform them into generalizations that will be expressed by using a knowledge representation model. DM has been used in an extensive range of applications including search engines, medical diagnosis, stock market analysis, classifying DNA sequences, speech and handwriting recognition, object recognition in computer vision, game playing, and robot motion, etc.

The two main goals in DM can be broadly classiﬁed as: prediction and description. In predictive DM, the goal is to determine the value of a target (dependent) variable by using the values taken by some predictive variables (or attributes). In descriptive DM, the main goal is the discovering of association or relations between the variables and/or instances in the dataset (e.g., clusters, associations, graphs, etc.). In our research, we focus on predictive DM and more concretely in classiﬁcation as our target

(i.e., the class) variable (the MB decision mode) can take a ﬁnite number of (nominal) outcomes. From the different classiﬁcation models available in the literature we use decision trees [52]. Machine learning uses statistics with different kinds of algorithms to solve a problem by studying and analyzing the data [53].

4.3.2 Applying Machine Learning for Intra Prediction

While there are no known published results on the use of machine learning in video encoding, there are several techniques on fast intra mode decisions in H.264 encoding [41] [42] [43]. All these approaches reduce the computational cost compared to the H.264 reference software. The complexity reduction, however, is not sufficient to enable the use of complex video encoding features on resource constrained devices. We have developed an innovative approach that is not limited by the selective evaluation approaches.

Machine learning refers to the study of algorithms and systems that “learn” or acquire knowledge from experiences. Deductive machine learning deduces new rules/knowledge from existing rules and inductive machine learning uses the analysis of data sets for creating a set of rules to take decisions. These rules can be used, in the machine learning, to build a tree decision using a set of experiments or examples, named the training data set. This set of data must have the following properties [54]:

1. Each attribute or variable can take nominal or numerical values, but the

number of attributes cannot vary from an example to another. This is to

say, all the samples in the training data set used for training the model

must have the same number of variables.

2. The set of categories that the examples can be assigned to must a priori be

known to enable supervised learning.

3. The set of categories must be finite and must be different from one

another.

4. Since the inductive learning consists of obtaining generalization from

examples, it is supposed the existence of a sufficiently great number of

examples.

Figure 4.1: Decision Tree Creation

Figure 4.2: Decision Trees in Complexity Reduction Mode

Multimedia Data Mining techniques have been developed over the last few years and have been mainly used to analyze or understand multimedia data [55]. Our approach is different; we aim to analyze intrinsic properties of multimedia data, but, instead of producing the discovered model as the output, our goal is to integrate it into a multimedia data treatment which in this case is a low complexity encoder. In our opinion, the complexity of H.264/AVC encoding creates an opportunity for applying machine learning algorithms in order to reduce the complexity of the transcoder. DM can be used to develop decision trees that will classify MB mode decisions without having to evaluate all the possible combination of partitions and sub-partitions. We propose a DM based approach to select the H.264/AVC MB mode in a video encoder.

In concrete, we describe the process of using DM to build a classiﬁer for very low complexity encoding. The decision tree(s) will be used to determine the coding mode of 75

an MB in I frames of the output H.264 video, based on the information gathered during

the decoding stage of the input sequence.

A decision tree is made by mapping the observations about a set of data in a tree made of arcs, nodes and leaves. The arcs represent the choices that a decision tree can make, nodes represent the classifier and leaves (rectangles) represent the possible modes decided by the decision trees. The tree can have more than one level; in that case, the intermediate nodes represent the decision based on the values of different variables that drives us from the root to the leaf. These types of trees are used in data mining processes for discovering the relationships in a set of data, if they exist. The tree leaves are the classifications and the branches are the features that lead to a specific classification.

The decision tree for MB mode classification was made using the WEKA [54]

data mining tool. WEKA is a collection of machine learning algorithms for data mining

tasks. The algorithms can be applied directly to a dataset. WEKA contains tools for data

pre-processing, classification, regression, clustering, clustering, association rules, and visualization. It is well-suited for developing new machine learning schemes. It is open source software issued under the GNU General Public License. The files that are used by the WEKA data mining program are known as ARFF (Attribute-Relation File Format)

files. An ARFF file is written in ASCII text and shows the relationship between a set

attributes. These files contain the datasets to be classified. An ARFF file has two different

sections; the first section is the header with the information about the name of the

relation, the attributes that are used and their types; and the second section contains the

data. In the header section, we have the attribute declaration. For each macroblock, the

proposed algorithm (Section 4.3.3) uses different attributes and the corresponding

H.264/AVC MB coding mode (from either Intra 16x16 or from Intra 4x4) decision for that MB as determined by the JM reference software. The following code shows our declaration for the ARFF files.

@RELATION 16x16_Vs_4x4.arff

@ATTRIBUTE Mean_Of_Means_SMB4x4 NUMERIC

@ATTRIBUTE Var_Of_Means_SMB4x4 NUMERIC

@ATTRIBUTE Mode {0, 1}

@DATA

234.81, 15.32, 0

221.04, 325.86, 0

------

The proposed approach uses C4.5 [52] algorithm for building classifiers. With respect to the DM algorithm used to learn the model(s)/tree(s), we choose C4.5 which is a greedy, recursive, top-down algorithm for the induction of decision trees from data. The algorithm starts at the root node with all the available data, and selects the best test, i.e.,

the most informative one with respect to the class, among the available. This test is

placed as the root node and the data set is partitioned following the possible outcomes of

the test. Then, the process is recursively repeated for each partition until a stopping

criterion is met. In C4.5 the best attribute is selected by using information gain (based on

Shannon’s entropy) and numerical attributes are discretized on-line by searching for the

threshold that yields the maximum entropy reduction (see [52] for details).

This is one of the commonly used algorithms in data mining. Such systems take as input a collection of cases, each belonging to one of a small number of classes and described by its values for a fixed set of attributes, and output a classifier that can accurately predict the class to which a new case belongs. C4.5 generates classifiers expressed as decision trees. A decision tree is used to classify a case, i.e. to assign a class value to a case depending on the values of the attributes of the case. In fact, a path from the root to a leaf of the decision tree can be followed based on the attribute values of the case. The class specified at the leaf is the class predicted by the decision tree. A performance measure of a decision tree over a set of cases is called classification error. It is defined as the percentage of misclassified cases, i.e. of cases whose predicted classes differ from the actual classes.

The supposed dependent variable, namely Mode in the example, is the variable that we are trying to understand, classify, or generalize. In our approach, the dependent variable 0 denotes Intra 4x4 and 1 denotes Intra 16x16. The other variables are the variables that are used to make the classification. The ARFF data section has the instance lines, which are the samples used to train our model. Each macroblock sample is represented on a single line.

The decision tree, that is proposed to predict the Inter mode decisions for 16x16 and 4x4 macroblocks, is a model of the data that encodes the distribution of the class label (namely class in the example) in terms of the attributes (the other variables in the example). The final goal of this decision tree is to find a simple structure to show the possible dependencies between the attributes and the class.

Figure 4.3: Decision Tree for Intra Coding

The proposed approach was developed based on the insights from our work on

MPEG-2 to H.264 transcoding that exploited machine learning tools [56] [57]. The key idea behind this approach is to exploit the correlation between the structural information in a video frame and the corresponding H.264 MB mode decisions and build a classifier or a decision tree. Figure 4.1 depicts the process for building the decision trees to be used in the H.264/AVC MB coding mode decisions during encoding process. The decision trees will be used to determine the coding mode of MBs and sub-MBs in I frames of the given H.264/AVC video, based on information gathered during the prior encoding 80

process. This technique determines encoder decisions such as MB coding mode decisions that are computationally expensive by using easily computable features derived from uncompressed video. A machine learning algorithm C4.5 is used to deduce the classifier/decision tree based on such features. The decision tree was obtained by using the WEKA data mining tool. The J48 (Java implementation of C4.5) algorithm implemented in WEKA was used to create the decision trees.

The training sets were made using only I-frames of YUV video sequence. The

H.264/AVC coding mode decisions in the training sets were obtained from encoding the video sequence separately for each value of quantization parameter. After extensive experiments, we found that the sequences that contain regions varying from homogenous to high-detail serve as good training sets. Good sample sequences could be Flower and

Football. Our goal is to develop a single, generalized, decision tree for each level that can be used for encoding any H.264 video.

Once a tree is trained, the encoder coding mode decisions that are normally done using cost-based models that evaluate all possible coding options are replaced with a decision tree. Figure 4.2 depicts the process of using decision trees in complexity reduction mode. Decision trees are in effect if-else statements in software and require negligible computing resources. We believe this simple approach has the potential to significantly reduce encoding complexity and affect the way encoders are used in mobile devices.

4.3.3 Low Complexity Mode Decision

This section discusses the proposed low complexity mode prediction algorithm.

This goal is achieved by making use of H.264/AVC coding mode along with the means, variances and other attribute values of the pixel data. The energy of a picture and hence of a macroblock is represented by means and variance of the 4x4 sub-blocks of a macroblock. Intra slices in H.264/AVC uses sum of absolute differences (SAD) between the original 4x4 block and predicted block, thus it exploit the spatial correlation of the

4x4 sub-blocks. Therefore, mean and variance of 4x4 sub-blocks in a macroblock can be exploited to understand the spatial correlation of two macroblocks or adjacent sub-blocks in a macroblock.

Figure 4.4: Attribute Selection for Intra 16x16

Intra MBs in H.264 are coded as Intra 16x16, Intra 4x4, or Intra 8x8. The baseline

profile used in mobile devices does not support Intra 8x8 mode and this mode will not be

discussed further in this dissertation. Intra modes also have associated prediction modes;

Intra 16x16 has 4 prediction modes and Intra 4x4 has 9 prediction modes. Baseline

profile encoders typically evaluate both Intra 16x16 and Intra 4x4 modes and the

associated prediction modes before making MB mode decisions. In the proposed

machine learning based approach we separate the Intra MB mode and Intra prediction

mode decisions. Intra MB mode is determined as Intra 16x16 or Intra 4x4 without

evaluating any prediction modes. The appropriate prediction modes for the MB mode are

then determined. Since the MB mode is determined first, our approach right away eliminates the computation of any prediction modes for the MB mode that is not selected.

If the MB mode is determined to be Intra 16x16, there is no need to evaluate any prediction modes for the 4x4 sub-blocks.

Table 4-I: DECISION TREES AND MODE DECISIONS Node Number MB Type Mode Decision 1 Intra Intra 16x16 Vs Intra 4x4 2 Intra 16x16 DC Vs Non DC 3 Intra 16x16 PLANE Vs HOR, VER 4 Intra 16x16 VER Vs HOR 5 Intra 4x4 DC Vs Non DC 6 Intra 4x4 0,3,5,7 Vs 1,4,6,8 7 Intra 4x4 0,5 Vs 3,7 8 Intra 4x4 1,6 Vs 4,8 9 Intra 4x4 0 Vs 5 10 Intra 4x4 3 Vs 7 11 Intra 4x4 1 Vs 6 12 Intra 4x4 4 Vs 8

Figure 4.3 shows the hierarchical decision tree composed of twelve different trees, used in making H.264/AVC Intra MB mode and prediction mode decisions. Table I shows the decision trees represented by nodes along with the mode decisions. Each node uses its own attributes and hence separate training sets are used for each decision tree.

The goal of the decision tree is to accelerate the Intra MB mode decisions. This

goal is achieved by making use of different attributes of MBs and sub-MBs calculated

prior to finalize the mode decisions. The attributes of the MBs and sub-MBs can thus be

exploited to understand the spatial correlation of MBs for Intra mode decisions in

H.264/AVC. The open source WEKA data mining tool is used to discover a pattern of the

attribute values for the H.264/AVC coding mode decisions. Figure 4.3 shows the decision

tree used in the proposed encoder.

Table 4-II: NODE 1 DECISION TREE STATISTICS Number Number Node Sequence QP of of CCI (%) Samples Leaves 20 3168 16 97.601 24 3168 11 97.9167 1 Flower 28 3168 9 97.5379 32 3168 7 97.5694 36 3168 5 95.4545 40 3168 8 92.7715

Table 4-III: NODE 2 DECISION TREE STATISTICS Number Number Node Sequence QP of of CCI (%) Samples Leaves 20 1182 10 87.1404 24 1268 9 85.1735 2 Flower 28 1385 12 83.1769 32 1534 7 83.442 36 1772 8 84.8194 40 2096 10 84.9237

Table 4-IV: NODE 3 DECISION TREE STATISTICS Number Number Node Sequence QP of of CCI (%) Samples Leaves 20 983 5 88.6063 24 1044 6 89.1762 3 Flower 28 1137 6 86.8953 32 1223 9 90.5151 36 1108 6 78.7004 40 1700 6 91.5882

Table 4-V: NODE 4 DECISION TREE STATISTICS Number Number Node Sequence QP of of CCI (%) Samples Leaves 20 1750 12 75.9429 24 1895 15 73.6148 4 Flower 28 2035 15 72.3833 32 2269 10 74.5703 36 2647 11 73.4794 40 3082 12 74.0104

Table 4-VI: NODE 5 DECISION TREE STATISTICS Number Number of of Node Sequence QP Leaves CCI (%) Samples 20 4480 5 80.6027 24 4304 8 80.948

28 4048 9 80.5089

32 3856 5 78.7863 5 Flower 36 3488 4 77.9243 40 2976 6 75.5376

Table 4-VII: NODE 6 DECISION TREE STATISTICS Number Number Node Sequence QP of of CCI (%) Samples Leaves 20 1412 6 65.5807 24 1403 7 66.6429 6 Flower 28 1338 9 65.0972 32 1217 7 63.3525 36 993 6 65.861 40 760 4 64.7368

Table 4-VIII: NODE 7 DECISION TREE STATISTICS Number Number Node Sequence QP of of CCI (%) Samples Leaves 20 1605 9 61.7445 24 1573 8 62.6192 7 Flower 28 1420 8 63.8732 32 1377 6 62.1641 36 1149 5 62.054 40 993 6 61.8328

Table 4-IX: NODE 8 DECISION TREE STATISTICS Number Number Node Sequence QP of of CCI (%) Samples Leaves 20 1946 8 56.8859 24 1881 7 56.0872 8 Flower 28 1815 6 54.6556 32 1671 4 57.2711 36 1526 7 55.1114 40 1197 5 57.3935

Table 4-X: NODE 9 DECISION TREE STATISTICS Number Number Node Sequence QP of of CCI (%) Samples Leaves 20 982 7 64.8676 24 983 8 68.4639 9 Flower 28 1922 5 77.4194 32 1962 4 80.4281 36 333 3 79.2793 40 844 3 76.5403

Table 4-XI: NODE 10 DECISION TREE STATISTICS Number Number Node Sequence QP of of CCI (%) Samples Leaves 20 623 6 52.6485 24 590 5 50.678 10 Flower 28 553 2 50.4521 32 529 3 53.8752 36 264 6 59.8485 40 963 10 64.1745

Table 4-XII: NODE 11 DECISION TREE STATISTICS Number Number Node Sequence QP of of CCI (%) Samples Leaves 20 1007 4 65.2433 24 1687 6 65.9158 11 Flower 28 716 4 75.419 32 1673 8 68.2606 36 674 6 75.6677 40 1480 5 74.4595

Table 4-XIII: NODE 12 DECISION TREE STATISTICS Number Number Node Sequence QP of of CCI (%) Samples Leaves 20 939 7 69.9681 24 971 6 53.1411 12 Flower 28 921 3 52.7687 32 770 3 67.9221 36 689 8 69.521 40 545 2 60.9174

The decision tree consists of twelve WEKA decision trees, shown in Figure 4.3 as nodes numbered from 1 to 12. The first WEKA tree (at root node 1) is used to check of the incoming macroblock is going to be handled as 16x16 or as 4x4. If an MB is determined as 16x16, left sub-tree at node 2 is followed, and if necessary up to the node 4 until a mode decision is made. Similarly, if an MB is determined as 4x4 MB, right sub- tree at node 5 is followed up to the node 12, if necessary. As soon as a mode is determined at any stage during tree traversal i.e. leaf is reached, further traversal is aborted and mode number is returned as the coding mode for that MB or sub-MB. The

WEKA tool determined the attribute values’ thresholds for each of the twelve WEKA 88

trees in the decision tree. Due to space constraints we cannot show all the rules being

evaluated in the WEKA decision nodes. The process described in herein should be

sufficient for interested people to develop the decision trees and repeat these experiments.

The decision tree works as follows:

Attributes: In a digital picture, neighboring blocks have a very high similarity; hence by using these spatial correlations among blocks, most suitable mode can be predicted. The decision tree exploits the spatial similarity of the current candidate macroblock and the prediction pixels that are used to predict different mode decisions.

The attributes chosen as input to WEKA tool are based on the pixel intensity of the current MB, sub-MB, rows (top, bottom) and columns (right, left) of the current as well

as of the adjacent MBs and sub-MBs.

Figure 4.4 shows the schematic diagram for the calculation of attributes in 16x16 macroblocks. The dark rectangle shows the macroblock being evaluated while light colored rectangles are the top and left macroblocks adjacent to the current candidate macroblock. We calculate the 16 means and 16 variances in the current macroblock of

4x4 sub-blocks. We also provide the difference between the bottom row of top MB and top row of the current MB. Similarly, we calculate the difference between the left column of current MB and the right column of left MB. These two differential metrics give a strong hint about the horizontal or vertical prediction of the current MB.

Top Sub-MB [4x4]

Left Sub-MB [4x4]

Current Sub-MB [4x4]

Figure 4.5: Attributes Selection for Intra 4x4 Modes

Figure 4.5 shows the schematic diagram for the calculation of attributes in 4x4 sub-blocks. In addition to the types of attributes chosen for 16x16 macroblocks, we also calculated the difference of intensity measures of the bottom row of top sub-MB and right column of the left sub-MB. These differential measures provide helpful clues for angular mode decisions in 4x4 sub-MBs.

The decision trees obtained by using various attributes are discussed below. All of the decision trees are implemented as binary trees. Tables II – VII show the performance of the training set (from node 1 to node 6) in terms of CCI % (Correctly Classified

Instances).

Node 1: The inputs for this node are the variance of 16 4x4 sub-MBs variances and mean of 16 4x4 sub-MBs means. Since 16x16 mode is selected for a homogenous region, therefore, these two attributes proved to be very helpful in determining the

intensity variance for a macroblock. For a 16x16 MB, variance of the pixel values tends

to be very low, which is also a very helpful indicator in determining the macroblock

mode. The output of this node determines the type of the macroblock mode type i.e.

16x16 or 4x4. Table II shows the statistics of the decision tree for node 1.

Node 2: This node differentiates between the DC mode and non-DC modes for

16x16 macroblock. Since DC mode has no specific direction, it cannot be predicted with a spatial correlation. The inputs for this node are the variance of 16 4x4 sub-MBs

variances and mean of 16 4x4 sub-MBs means. There are two additional attributes which

are calculated by subtracting the mean values of the bottom row of the current MB from

the bottom row of top MB. Similarly, we also calculate the difference between the mean

values of the right column of left MB and right column of current MB. Table III shows

the statistics of the decision tree for node 2.

Node 3: The output of this node classifies Plane mode as one class and

Horizontal, Vertical modes as another class by using the same inputs as that of node 2.

The statistics of the classifier of node 3 is shown in Table IV.

Node 4: This node also uses the same attributes as inputs as that of node 2 and

node 3. The classifier at this node differentiates between horizontal and vertical

prediction modes. The output of this node is either mode 0 (vertical) or mode 1

(horizontal) for 16x16 macroblock. The statistics of the classifier of node 3 is shown in

Table V.

Node 5: This node differentiates between DC and non-DC mode for a 4x4 sub-

block in a macroblock. The inputs for this node are the mean and variance of the current

sub-block as well as a third attribute calculated by taking into account the intensity values of rows and columns in angular directions. The statistics for this node are shown in

Table VI.

Node 6: This node classifies non-DC modes for Intra 4x4 mode into two classes.

We divided the Intra 4x4 non-DC modes such that the 4 adjacent modes from go to each

class as shown in Figure 4.3. Therefore, we classified 3, 7, 0, 5 as one class while 4, 6, 1,

8 as another class. The statistics for this classifier is shown in table VII.

Node 7: This node classifies the modes 3,7,0,5 into two classes combining modes

0 and 5 into one class and modes 3 and 7 into another class. In addition to the attributes

that are used for the classifier for node 6, it also uses the mean values of top, bottom rows

and right, left columns of the current 4x4 sub-block. The statistics for this classifier is shown in table VIII.

Node 8: This node classifies the modes 4,6,1,8 into two classes combining modes

1 and 6 into one class and modes 4 and 8 into another class. The classifier for this node uses the same attributes as that of node 7. The statistics for this classifier is shown in table IX.

Node 9: This classifier at this node differentiates between mode 0 and mode 5 of

Intra 4x4 prediction. The classifier for this node uses the same attributes as that of node 7.

The statistics for this classifier is shown in table X.

Node 10: This classifier at this node differentiates between mode 3 and mode 7 of

Intra 4x4 prediction. The classifier for this node uses the same attributes as that of node 7.

The statistics for this classifier is shown in table XI.

Node 11: This classifier at this node differentiates between mode 1 and mode 6 of

Intra 4x4 prediction. The classifier for this node uses the same attributes as that of node 8.

The statistics for this classifier is shown in table XII.

Node 12: This classifier at this node differentiates between mode 4 and mode 8 of

Intra 4x4 prediction. The classifier for this node uses the same attributes as that of node 8.

The statistics for this classifier is shown in table XIII.

(a) Flower.cif (original QP =28) (b) flower.cif (proposed QP=28) Figure 4.6: Macroblock Partitions for Intra 16x16 vs. Intra 4x4

Since the MB mode decision depend upon the quantization parameter (QP) used in H.264/AVC encoding, the mean and variance threshold will have to be different at each QP. Two solutions are possible: 1) develop a single decision tree and adjust the mean, variance and other attributes’ threshold used by the trees based on the QP and 2) develop the decision trees for each QP and use the appropriate decision tree depending upon the QP selected. For the first option, to use a single decision tree, the decision tree has to be developed for a mid QP value of 25 and then have to be adjusted for other

values of QP. Since the quantization step size in H.264/AVC doubles when QP is

increased by 6, thresholds are adjusted by 2.5% for a change in QP of 1. For QP values

higher than 25, the thresholds are decreased and for the QP values lower than 25

thresholds are proportionally increased. For the second option, the decision trees are built

for 6 different values of QP i.e. 20, 24, 28, 32, 36 and 40. Since the standard conformance

procedure for H.264/AVC standard mostly uses these values of QP, we decided to use the

same procedure. The different trees were implemented in reference software JM 14.2 [58]

by using the function pointers of C. After extensive experimentation to evaluate both of

the methods mentioned above, we came to conclude that the using multiple decision trees

for separate QP values produces better results.

Figure 4.6 shows an example of the results obtained by applying our proposed algorithm. Figure 4.6(a) illustrates the Intra mode selection (16x16 vs 4x4) in the original flower sequence of CIF format encoded by using the JM reference software. Figure 4.6

(b) shows the Intra mode selection by using our proposed algorithm. From this comparison, it is clear that our algorithm generates very similar results to those obtained by using the JM reference software.

4.3.4 Performance Evaluation

The proposed low complexity MB coding mode decision algorithm is implemented in the H.264/AVC reference software, version JM 14.2. We measure complexity reduction in terms of execution time for both reference and modified encoders on the same machine which keep our results valid for our experiments to measure complexity reduction. Figure 4.1 and Figure 4.2 show the overall operation of

the proposed encoder. The H.264/AVC video is decoded and the information required by

the decision trees is gathered in this stage as shown in Figure 4.2. The additional

computation here is the computation of the mean and variance of the 4x4 sub-blocks of the residual MBs. The MB coding mode decision determined by the decision trees is used in the low complexity H.264 encoding stage as shown in Figure 4.3. This is an H.264 reference encoder with the MB mode decision replaced by simple mode assignment from the decision tree. The H.264 video encoder takes as input the decoder H.264/AVC video

(pixel data) and the MB mode decision from the decision tree and encodes the H.264 video. The H.264/AVC Intra prediction mechanism which uses the SAD (Sum of

Absolute Differences) calculation is not used and the encoder performs the Intra prediction just for the final MB mode determined by the decision tree.

The performance of the proposed very low complexity encoder is compared with a reference H.264/AVC encoder. We compare the performance of our proposal to the

H.264/AVC encoder with the RD Optimization enabled. The metrics used to evaluate the performance are the reduction in the computational cost and rate distortion function which are measured by PSNR and bit rate (BR). The time results reported are for the

H.264 encoding for both the proposed and reference encoders.

We have conducted an extensive set of experiments with videos representing wide range of motion, texture, and color. Experiments were conducted to evaluate the performance of the proposed algorithm when transcoding videos at commonly used resolutions: CCIR-601, CIF, and QCIF. The input to the encoder is a YUV sequence.

Since the proposed encoder addresses the problem of complexity reduction for Intra

prediction, all the frames are encoded using I frames coding. The experiments have shown that the proposed approach performs extremely well to reduce the complexity across all QP values and resolution formats.

The sequences have been encoded with H.264 using the QP factors ranging from

20 up to 40 in steps of 4. This corresponds to the H.264 QP range used in most practical applications. All frames in sequences were encoded as I- frames by specifying the value of IntraPeriod to 1. The rate control was disabled for all the simulations. The ProfileIDC was set to Baseline for all the simulations. The simulations were run on Intel Core2 Duo machine running with 2.40 GHz processor with 4 GB RAM. The results are reported for six different sequences: two for each of the three resolutions mentioned above. The configuration parameters for the proposed and reference encoder are the same. Due to the space constraints, RD curves are shown for 3 CIF and QCIF sequences (Akiyo, Mobile,

Stefan) and for 2CCIR (Mobile, Flower) sequences. However, average speed up, BR increase and PSNR loss are shown for all sequences, we used for the simulations, in the tables below.

The comparison metrics were produced and tabulated based on the difference of percentage coding time (∆Time), the PSNR difference (∆PSNR) and the percentage bitrate difference (∆BR). PSNR and bit-rate differences are calculated according to the numerical averages between the RD curves derived from reference JM encoder, the algorithm under study (modified JM encoder with RD Optimization disabled). The detail procedures in calculating these differences can be found in the JVT documents by

Bjontegaard, which is recommended by JVT Test Model Ad Hoc Group. Note that PSNR

and bit-rate differences should be regarded as equivalent, i.e., there is either a decrease in

PSNR or an increase in bit-rate, but not both at the same time. To evaluate the performance of our proposed, we divide our classifiers into following four categories.

A. Top Level Classifier

The classifiers at this level are the decision trees that decide MB mode partition type i.e. Intra 16x16 or Intra 4x4. Rest of the mode decisions are determined by the default reference implementation in JM 14.2.

Figure 4.7: RD Performance for Top Level Classifier (QCIF)

Figure 4.8: RD Performance for Top Level Classifier (CIF)

Figure 4.9: RD Performance for Top Level Classifier (CCIR)

Figure 4.7, Figure 4.8, and Figure 4.9 shows the RD performance results for the reference and proposed encoder for selected sequences of different resolutions. As seen from these RD curves, PSNR obtained with the proposed encoder deviates slightly from the results obtained when applying the considerable more complex reference encoder.

Table VIII summarizes the average statistics for the top level classifier. Compared with the reference encoder, the proposed encoder has average PSNR drop 0.32dB with the penalty of little bit more than 1% of average BR and average speed up of about 68% for

QCIF sequences. Similarly, the average results for CIF and CCIR-601 sequences are shown in the same table. The negligible drop in PSNR is more than offset by the reduction in computational complexity. As shown in the table VIII, the encoding time is reduced by more than 60% with RD optimization.

B. Intra 16x16 Classifier

This classifier at this level enables the encoder to classify Intra 16x16 decisions based on the proposed approach. This classifier also includes the decisions made by the top level classifier while Intra 4x4 mode decisions are made by the reference encoder.

Figure 4.10, Figure 4.11, and Figure 4.12 shows the RD performance results for the reference encoder and proposed encoder for selected sequences. The results obtained for this classifier is very similar to that of top level classifier. Table XV summarizes the average statistics for the top level classifier. Compared with the reference encoder, the proposed encoder has average PSNR drop 0.33dB with the penalty of about 3% of average BR and average speed up of about 68% for QCIF sequences. Similarly, the average results for CIF and CCIR-601 sequences are shown in the same table. The

negligible drop in PSNR is more than offset by the reduction in computational complexity. As shown in the table XV, the encoding time is reduced by more than 60% with RD optimization.

Figure 4.10: RD Performance for Intra 16x16 Classifier (QCIF)

100

Figure 4.11: RD Performance for Intra 16x16 Classifier (CIF)

Figure 4.12: RD Performance for Intra 16x16 Classifier (CCIR)

101

C. Intra 4x4 Classifier

This classifier at this level enables the encoder to classify Intra 4x4 decisions only based on the proposed approach. This classifier also includes the decisions made by the top level classifier while Intra 16x16 mode decisions are made by the reference encoder.

Figure 4.13, Figure 4.14, and Figure 4.15 shows the RD performance results for the reference encoder and proposed encoder for selected sequences. Table IX summarizes the average statistics for the top level classifier. Compared with the reference encoder, the proposed encoder has average PSNR drop 0.24dB with the penalty of about 29% of average BR and average speed up of about 65% for QCIF sequences. Similarly, the average results for CIF and CCIR-601 sequences are shown in the same table. The drop in PSNR is negligible; however massive bit-rate increase does not make this classifier viable for proposed encoder although the encoding time is reduced by more than 65% for all sequences.

102

Figure 4.13: RD Performance for Intra 4x4 Classifier (QCIF)

Figure 4.14: RD Performance for Intra 4x4 Classifier (CIF)

103

Figure 4.15: RD Performance for Intra 4x4 Classifier (CCIR)

D. Combined Classifier

At this level, all Intra mode decisions are determined by the proposed classifiers i.e., first of all top level classifier determines whether the current MB is going to be Intra

16x16 or Intra 4x4, then according to the selected mode, next level classifier decides about the prediction directions. Figure 4.16, Figure 4.17, and Figure 4.18 shows the RD performance results for the reference encoder and proposed encoder for selected sequences. The results obtained for this classifier is very similar to that of Intra 4x4 classifier. Table XVII summarizes the average statistics for the top level classifier.

Compared with the reference encoder, the proposed encoder has average PSNR drop

0.25dB with the penalty of about 30% of average BR and average speed up of about 65%

104

for QCIF sequences. Similarly, the average results for CIF and CCIR-601 sequences are shown in the same table. As in the case of Intra 4x4 classifier, the drop in PSNR is negligible however, it cost the increase of bit-rate by almost 30%. As shown in the table

XVII, the encoding time is reduced by more than 65% for all sequences.

Figure 4.16: RD Performance for Combined Intra Classifier (QCIF)

105

Figure 4.17: RD Performance for Combined Intra Classifier (CIF)

Figure 4.18: RD Performance for Combined Intra Classifier (CCIR)

106

Table 4-XIV: COMPARISON RESULTS FOR TOP LEVEL CLASSIFIER QCIF CIF CCIR Sequence ∆PSNR ∆BR ∆TIME ∆PSNR ∆BR ∆TIME ∆PSNR ∆BR ∆TIME dB % % dB % % dB % % Akiyo -0.292 1.38 -60.5 -0.310 3.08 -60.91 Coastguard -0.271 0.565 -64.48 -0.277 0.518 -65.52 Container -0.280 1.772 -62.537 -0.267 1.542 -64.143 Flower -0.465 0.865 -67.167 -0.474 1.013 -67.408 -0.424 0.846 -67.866 Foreman -0.223 0.858 -62.256 -0.260 1.284 -63.277 Hall -0.263 2.215 -62.333 -0.244 2.624 -62.964 Monitor Mobile -0.404 0.754 -67.429 -0.377 0.766 -68.050 -0.355 0.911 -67.835 Mother -0.286 0.812 -61.269 -0.265 2.716 -61.026 Daughter Silent -0.303 0.198 -62.761 -0.314 0.884 -63.427 Stefan -0.430 1.001 -65.914 -0.402 0.726 -66.052 Table -0.289 1.232 -62.533 -0.263 1.138 -64.428 Average -0.319 1.06 -63.56 -0.31 1.48 -64.29 -0.39 0.88 -67.85

Table 4-XV: COMPARISON RESULTS FOR INTRA 16X16 CLASSIFIER QCIF CIF CCIR Sequence ∆PSNR ∆BR ∆TIME ∆PSNR ∆BR ∆TIME ∆PSNR ∆BR ∆TIME dB % % dB % % dB % % Akiyo -0.312 2.414 -60.801 -0.331 5.462 -61.652 Coastguard -0.275 1.426 -64.753 -0.282 1.531 -65.860 Container -0.282 3.251 -62.94 -0.272 2.858 -64.834 Flower -0.466 1.000 -67.283 -0.479 1.282 -67.870 -0.426 1.091 -68.093 Foreman -0.235 1.351 -62.394 -0.271 2.394 -63.685 Hall- -0.274 3.175 -62.672 -0.259 4.534 -63.529 Monitor Mobile -0.403 0.971 -67.469 -0.380 1.075 -68.129 -0.359 1.315 -67.975 Mother -0.294 1.679 -61.626 -0.285 5.952 -61.802 Daughter Silent -0.308 1.146 -62.905 -0.318 2.100 -63.698 Stefan -0.434 1.564 -66.083 -0.406 1.471 -66.363 Table -0.297 2.228 -63.011 -0.274 2.320 -65.066 Average -0.33 1.82 -64.02 -0.33 2.68 -64.97 -0.39 1.20 -68.03

107

Table 4-XVI: COMPARISON RESULTS FOR INTRA 4X4 CLASSIFIER QCIF CIF CCIR Sequence ∆PSNR ∆BR ∆TIME ∆PSNR ∆BR ∆TIME ∆PSNR ∆BR ∆TIME dB % % dB % % dB % % Akiyo -0.130 33.640 -62.208 -0.143 41.829 -62.066 Coastguard -0.127 31.531 -66.120 -0.101 35.433 -67.104 Container -0.221 30.781 -64.024 -0.172 32.450 -64.872 Flower -0.402 13.413 -69.142 -0.412 13.497 -68.486 -0.363 14.610 -69.615 Foreman -0.240 48.931 -64.109 -0.199 41.902 -64.440 Hall -0.118 39.265 -64.103 -0.049 48.467 -63.896 Monitor Mobile -0.375 14.720 -69.383 -0.349 19.223 -69.651 -0.330 19.432 -69.523 Mother -0.168 33.426 -62.947 -0.055 51.746 -61.766 Daughter Silent -0.205 32.184 -64.814 -0.183 33.188 -64.962 Stefan -0.405 18.798 -67.696 -0.321 25.540 -67.398 Table -0.241 21.082 -63.972 -0.169 24.343 -65.270 Average -0.24 28.89 -65.32 -0.20 33.42 -65.45 -0.35 17.02 -69.57

Table 4-XVII: COMPARISON RESULTS FOR ALL CLASSIFIERS QCIF CIF CCIR Sequence ∆PSNR ∆BR ∆TIME ∆PSNR ∆BR ∆TIME ∆PSNR ∆BR ∆TIME dB % % dB % % dB % % Akiyo -0.133 34.963 -62.614 -0.180 44.391 -62.871 Coastguard -0.135 32.461 -66.424 -0.107 36.506 -67.537 Container -0.237 32.395 -64.512 -0.184 33.839 -65.607 Flower -0.402 13.577 -69.184 -0.418 13.792 -68.933 -0.366 14.884 -69.839 Foreman -0.247 49.574 -64.307 -0.212 43.251 -64.898 Hall Monitor -0.148 40.609 -64.380 -0.069 50.612 -64.521 Mobile -0.376 14.938 -69.451 -0.356 19.574 -69.755 -0.345 19.922 -69.785 Mother -0.177 34.496 -63.367 -0.063 55.401 -62.526 Daughter Silent -0.202 33.224 -64.992 -0.191 34.437 -65.244 Stefan -0.414 19.424 -67.897 -0.335 26.381 -67.709 Table -0.254 22.335 -64.515 -0.187 25.702 -65.862 Average -0.25 29.82 -65.60 -0.21 34.90 -65.95 -0.36 17.40 -69.81

Table 4-XVIII: COMPARISON RESULTS WITH OTHER ALGORITHM [59] [61] Top Level Classifier Intra 16x16 Classifier PDD Sequence ∆PSNR ∆BR ∆TIME ∆PSNR ∆BR ∆TIME ∆PSNR ∆BR ∆TIME dB % % dB % % dB % % QCIF -0.21 3.18 -62.67 -0.32 1.06 -63.56 -0.33 1.82 -64.02 CIF -0.26 3.56 -63.35 -0.31 1.48 -64.29 -0.33 2.68 -64.97

108

One important observation in above mentioned four different classifiers is the

effect of bit-rate increase as a result of complexity reduction in all classifiers. While for

top level and Intra 16x16 classifiers, bit-rate penalty is within reasonable limits but for

Intra 4x4 and combined classifiers, bit-rate penalty is relatively much larger than the other two classifiers. In summary, the proposed algorithm is about two times faster than

JM14.2 reference software while the increase in BR is nearly negligible in the case of top level and Intra 16x16 classifiers. In the case of Intra 16x16 and combined classifiers, we get the same amount of complexity reduction with tangible increase in the bit-rate.

Comparing our top level and Intra 16x16 classifiers with the proposed algorithm for

Pixel-Based Direction Detection (PDD) in [59], the computation time of the proposed methods is also reduced albeit by very minor margin. However, our proposed classifiers are more cost effective in terms of bit-rate penalty by the factor of 2 as shown in Table

XVIII. This comparison is based only on QCIF and CIF sequences.

4.4 Summary

In this chapter we have presented a machine learning based approach that reduces the Intra coding complexity in H.264/AVC encoder. We have also explained the reasons for using machine learning in video encoding and its effectiveness. We have proposed a novel macroblock partition mode decision algorithm for Intra prediction in H.264/AVC.

The proposed algorithm used data mining techniques to exploit the correlation between the H.264/AVC MB statistics and H.264 MB coding modes. The WEKA tool was used to develop decision tree for H.264/AVC coding mode decision. The proposed algorithm has very low complexity as it only requires the calculation of MB statistics such as means,

109

variances and differences of border pixels. The proposed encoder was evaluated using

YUV sequences at QCIF, CIF and CCIR resolutions. Our results show that the proposed algorithm is able to maintain a good picture quality while considerably reducing the computational complexity by 65% on average. The reduction in computational cost has negligible impact on the quality and bit-rate of the encoded video. Our results show that the proposed algorithm maintains its performance across all resolutions. The proposed approach is novel and the basic idea can also be used to refine other techniques in video encoding [60] [61].

110

5. SVC COMPLEXITY REDUCTION I

5.1 Introduction

SVC is based on a multilayer representation of the video with an AVC compliant

base layer (BL). The basic concept of SVC is to enable the creation of compressed bit

stream that is comprised of partial bit streams and it enables the transmission and

decoding of partial streams. SVC structures the data of a compressed video bit stream

into layers. The base layer is an ordinary H.264/AVC bit stream, while one or more

enhancement layers provide improved quality for those decoders that are capable of using

it. The SVC bit stream always consists of a lower-resolution/quality version of the video signal (base layer) and the full-resolution/quality video (enhancement layer). With a little bit of increase in decoder complexity relative to a single layer H.264/AVC, SVC provides network friendly scalability at a bit stream level. Furthermore, SVC makes it possible to rewrite the fidelity-scalable SVC bit streams to single-layer H.264/AVC bit stream lossless. The target applications of the SVC extension of H.264/AVC vary from video

conferencing as well as for mobile to high-definition broadcast and professional editing applications.

Three fundamental types of scalability were enabled in H.264/SVC. The first is temporal scalability, in which the enhancement layer provides an increase of the frame rate of the base layer. To a large extent this was already supported in the original

111

H.264/AVC standard—but new supplemental information has been designed to make

such uses more powerful. The next is spatial scalability (or resolution scalability), in

which the enhancement layer offers increased picture resolution for receivers with greater

display capabilities. Finally, there is quality scalability, sometimes also referred to as

SNR or fidelity scalability, in which the enhancement layer provides an increase in video

quality without changing the picture resolution.

The purpose of SVC is to extend the capabilities of the H.264/AVC design to

address the needs of applications to make video coding more flexible for use in highly

heterogeneous and time-varying environments. Instead of multiple encodings of each video source so as to provide the optimized bit stream to each client, scalable coding provides a unique bit stream whose syntax enables a flexible and low complexity

extraction of the information so as to match the requirements of different devices and

networks.

We obtained classification data by using reference software JSVM with the

configuration of two spatial layers. Based on the cost-sensitive learning, we develop a classification algorithm and implemented in JSVM by replacing the traditional Intra prediction tools in AVC and ILP tools in SVC. The proposed approach was developed based on the insights from our work on MPEG-2 to H.264 transcoding [62] and low complexity Intra MB encoding in H.264 [63], both of which exploit machine learning tools. The key idea behind this approach is to exploit the correlation between the structural information in a video frame and the corresponding mode decisions.

112

Based on machine learning techniques having been used with improved

performance, we propose low complexity video encoder operating in the spatial domain.

It improves the RD performance and results in a significant speedup in encoding by

applying cost-sensitive learning in the first phase and then applying chance constrained

approach.

SVC achieves extremely efficient encoding, but at the same time the processing

involved is quite complex and hence the codec is often unusable on resource constrained

devices such as mobiles. The most time consuming process in H.264/SVC encoding is

Macroblock (MB) mode decision. For Intra-frame prediction, H.264 has 4 prediction

modes for 16x16 block-size and 9 modes each for 8x8 and 4x4 blocks-sizes. This mode

selection is typically performed by trying all the modes and then choosing the one

achieving best Rate-Distortion (RD) performance. This approach describes a novel

method for mode-decision which uses a classifier to predict the optimal mode. In

particular, the classifier is trained so as to minimize the loss in RD performance. This is a

better approach than training the classifier to optimize zero-one error, because for our

application, various wrong predictions may incur various penalties in terms of RD

performance [64].

In our effort to reduce the complexity in H.264/AVC encoder, we developed

machine learning based techniques by using C4.5 algorithm (See Chapter 4). The decision tree we obtained are binary trees, however, in the context of video encoding the choice of binary trees, while determining the MB modes, is somewhat limiting.

Stochastic approaches provide us the opportunity to select the best possible choice among

113

all of the possible and available inputs. In other words, binary decision trees compel to divide all possible choices into two sets of choices at every step, once a path is taken; half of the possible choices are taken out of consideration. And the final selection must be made from the other half of the remaining choices. Stochastic methods can make the best possible selection from all of the available inputs.

5.2 Cost-Sensitive Learning

Learning algorithms are used to minimize the expected cost of misclassifications.

Most classification learning algorithms attempt to minimize the expected number of misclassification errors. In many applications, different kinds of classification errors have different costs, so we need cost-sensitive methods. In video encoding RD cost determines the video quality in terms of compression efficiency. Rate Distortion (RD) optimization can be used to improve quality in video encoding where decisions have to be made that affect both file size and quality simultaneously. Our purpose here is to reduce the file size

(bit-rate) while improving the quality. Therefore, we consider RD cost is an important feature that can be proved very helpful in classification while in learning phase.

In classical machine learning and data mining settings, the classiﬁers generally try to minimize the number of mistakes they will make in classifying new instances. Such a setting is valid only when the costs of di erent types of mistakes are equal.

Unfortunately, in many real-world applications theﬀ costs of di erent types of mistakes are often unequal. For example, in video encoding, the cost of misclassificationﬀ of Intra 4x4

MB mistakenly classified as Intra 16x16 MB may be much larger than that of

114

misclassification of Intra 16x16 MB as Intra 4x4 MB, because the former type of

misclassification results in the heavier penalty in terms of RD performance.

The proposed approach uses Support Vector Machine (SVM) [65] for building

classifiers. With respect to the DM algorithm used to build classifiers, we choose SVM

which solves classification and regression problems with cost model and dependent costs.

SVM approach is computationally efficient in training and classification. This is one of the commonly used algorithms in data mining. Such systems take as input a collection of cases, each belonging to one of a small number of classes and described by its values for a fixed set of attributes, and output a classifier that can accurately predict the class to which a new case belongs. SVM generates classifiers which we converted into decision trees. A decision tree is used to classify a case, i.e. to assign a class value to a case depending on the values of the attributes of the case. In fact, a path from the root to a leaf of the decision tree can be followed based on the attribute values of the case. The class specified at the leaf is the class predicted by the decision tree. A performance measure of a decision tree over a set of cases is called classification error. It is defined as the percentage of misclassified cases, i.e. of cases whose predicted classes differ from the actual classes.

We solve the mode-prediction problem using a two-level approach. At first level, we use a decision tree classifier to predict the block size to be 16x16 or 4x4. (We have not used 8x8 block size in our experiments). At second level, we use one classifier each for 16x16 and 4x4 block sizes, each of them having been trained using cost-sensitive learning. The mathematical formulation of cost-sensitive learning described below.

115

5.2.1 Mathematical Formulation

Mathematical notations and symbols used in next section are explained below

with their meaning in Table 5-I.

Table 5-I: MATHEMATICAL NOTATIONS FOR COST SENSITIVE CLASSIFIER Notation Meaning

Slack variable to allow errors in training

set

휉 Feature value

휔 Regularization factor

퐶 Feature Map

Ψ Tensor Product

⨂ Orthogonal (binary) encoding 푐 ⋀

In [65] Joachims et al presents a general framework based on Support Vector

method for building classifiers so as to optimize performance measures as opposed to the

zero-one error. In this approach, the classifier is trained by solving the following optimization problem: 1 + (5.1) , 0 2 푚푖푛 2 ‖휔‖ 퐶휉 휔 휉 ≥ [ ( ) ( )] ( ) . . \ , , , 푇 푠 푡 The is called∀푦�′ slack∈ 푌� variable.푦� ∶ 휔 IfΨ a training푥̅ 푦� − exampleΨ 푥̅ 푦 �lies′ on≥ the∆ “wrong”푦�′ 푦� − side휉 of the

hyperplane, 휉the corresponding is greater than 1. The factor controls the amount of 116 휉 퐶

regularization. ( , ) is a feature map describing the match between feature vector and label andΨ 푥is 푦the loss function. (See [65] for detail notations). Note that optimal푥

upper bounds푦 training∆ loss ( , ), where

푝푟푒푑 휉 ∆ 푦= arg푦� ( , ) 푇 푝푟푒푑 푦 The choices of and are푦 application 푚푎푥specific.휔 ForΨ our푥 application,푦 the classifiers for

16x16 & 4x4 blockΨ sizes∆ will have 4 & 9 target classes respectively. Hence, we have chosen feature map

( , ) = ( ) (5.2) 푐 Which is similarΨ 푥 푦to the 푥one⨂ used⋀ 푦 for SVM-multiclass in [66]. In other words, vector ( , ) will be an nk -dimensional vector (assuming is n-dimensional feature vector Ψand푥 푦 is the number of modes) which is obtained by푥 shifting feature vector according to 푘label . In particular, 푥

(푦 , )[ ( 1) + ] = [ ]; = 0, … . . 1

Ψ 푥 푦 푛 ∗ 푦 − 푖 =푥0 푖 ; 푖 푛 − (5.3)

Due to this choice of , learned vector will표푡ℎ 푒푟푤푖푠푒be a collection of vectors

, , … … where is theΨ weight vector for휔 the mode. Next, to minimize 푡ℎ 푣reduction1 푣2 in RD푣푘 performance,푣푖 we have chosen the loss function푖

( , ) = ( ) ( ) (5.4)

Where∆ 푦� 푦 ( 퐶표푠푡) = RD푦 cost− 퐶표푠푡incurred푦� when mode y is chosen. Thus ( , ) in

퐶표푠푡 푦 ∆ 푦� 푦 essence denotes the extra RD cost incurred when we chose a non-optimal mode instead 117 푦

of optimal one . With this choice of and , we train one classifier each for block

sizes 16x16 and푦� 4x4 using the frameworkΨ above∆ by tuning parameter to learn vector

. During encoding, the mode-prediction for a given MB with feature퐶 vector is performed휔 as follows: for each mode , we calculate its score using 푥

( ) = 푦� ( , ) (5.5) 푇 Then the mode 푠푐표푟푒with the푦 �lowest휔 scoreΨ is푥 predicted푦� by the classifier.

5.2.2 Implementation

We used SVMLight for cost-sensitive classifier in order to classify the training

data. SVMLight is an implementation of Support Vector Machine (SVM) in C [65]. We

implemented the classification algorithm in order to evaluate the performance of our

algorithm. The classification algorithm was implemented in SVC reference software

JSVM 9.17. The Intra MB decisions in base layer and enhancement layers were replaced

by our algorithm. Test sequences are encoded with all Intra frames and performance

evaluated by comparing the RD performance and speedup of the modified encoder with

that of standard SVC encoder.

5.2.3 Performance Evaluation

We measure complexity reduction in terms of execution time for both reference

and modified encoders on the same machine which keep our results valid for our

experiments to measure complexity reduction. The results show that the MB mode

decisions can be made with very high accuracy; as shown in Figure 5.1. Prediction mode

decisions are complex and introduce a small loss in PSNR. The maximum PSNR loss

118

suffered in this case is about 0.5 dB. Table 5-II shows the classification performance for

various QCIF and CIF sequences. For base layer, we obtain a speedup of about 50%

while for enhancement layer; a speedup of about 70% is obtained. Fig. 1 shows the RD

performance for base and enhancement layers.

Figure 5.1: EL RD Peformance and BL RD Performance using Cost-Sensitive Learning.

Table 5-II: CLASSIFICATION RESULTS BY COST-SENSITIVE LEARNING

QCIF CIF

Sequence ∆PSNR ∆PSNR ∆BR % ∆T % ∆BR % ∆T % dB dB

Akiyo -0.292 2.38 -50.5 -0.31 3.08 -70.91 Flower -0.465 1.86 -57.16 -0.474 3.01 -77.4 Foreman -0.223 1.85 -52.25 -0.26 2.28 -67.27 Mobile -0.404 2.75 -57.42 -0.377 4.76 -68.05 Mot Dau -0.286 3.81 -51.26 -0.265 4.71 -79.02 Silent -0.303 1.19 -52.76 -0.314 2.88 -73.42 Average -0.33 2.31 -53.56 -0.33 3.45 -72.68

119

5.3 Chance Constrained Approach

In the previous Section, we developed a machine learning based approach to

reduce the computationally expensive elements of encoding such as coding mode

evaluation to a classiﬁcation problem with much less complexity. We successfully

adopted this technique for transcoding and produced much better results. We continued to

improve our methodology by employing various techniques to improve our results in the

classiﬁcation domain. The key contribution of this part of dissertation is the exploration of machine learning and mathematical formulation of video coding mode decision as a classiﬁcation problem.

The proposed approach (Chance Constrained) was developed based on the insights from our work on H.264/AVC complexity reduction that exploited machine learning tools. The key idea behind this approach is to exploit the correlation between the structural information in a video frame and the corresponding H.264/AVC MB mode decisions. Basic idea behind this approach is to determine encoder decisions such as MB coding mode decisions that are computationally expensive using the easily computable features derived from uncompressed video. Chance constrained technique is used to deduce the classifier based on such features. Once the classifier is obtained, the encoder coding mode decisions that are normally done using cost-based models that evaluate all possible coding options are replaced with a binary decision tree. We believe this simple approach has the potential to significantly reduce encoding complexity and a ect the way encoders are used in mobile devices. The results of this technique may not beff necessarily interpreted as an improvement to the results of our previous work based on machine

120

learning but we have successfully explored alternate approach by keeping the problem in

the classiﬁcation domain.

The features that are of significant value when taking Intra coding mode decision

in H.264 are those which capture the similarity between current Macroblock (MB) and

neighboring blocks as well as the ones which capture the self-similarity of the MB. The

importance of self-similarity comes from the fact that coding mode depends upon

whether the MB is homogeneous or contains highly detailed information. For Instance, to

minimize Rate-Distortion, optimum coding mode size of a homogeneous MB would be

large and vice versa. Metrics such as mean and variance of intensity values of an MB

provide a good measure of self-similarity. We can also partition each 16x16 MB into 16

sub-macroblocks (SMBs) of size 4x4 and then we can take the mean and variance of intensity values within each SMB; which will signify how different the intensity values and homogeneity in various parts of the MB are. The similarity with neighboring MBs can be captured using the difference in intensity values between current and neighboring

MBs. These differences can again be in terms of mean and variance of pixel wise

differences of intensities.

It is clear that pairs of means and variance values in our problem play a

significant role in MB coding-mode decision. From this viewpoint, we can consider an approach similar to the chance constrained approach in. This approach considers each pair of mean-variance values being the mean & variance of a random variable. We use this approach to derive an equivalent formulation for our problem. Mathematical and geometrical formulation of the problem is described in detail in [67].

121

5.3.1 Mathematical Formulation

Mathematical notations and symbols used in next section are explained below with their meaning in Table 5-III.

Table 5-III: MATHEMATICAL NOTATIONS FOR CHANCE CONSTRAINED CLASSIFIER Notation Meaning

Mean value of MB

휇 Variance of MB

휎 PDF (Probability Density Function)

휌 Slack variable to allow errors in training set

휉 User defined parameter (lower-bounds the classification accuracy)

휂 Feature value

휔 Mapped Class

푏 Second Order Cone Programming (SOCP) problem ‖휔‖2 ≤ 푊 Random Variable 푗 푍 has the pdf given by

푖 푋푖 푖 푋푖 푋 ∼ 푓 푋 푓

It is clear that pairs of means and variance values in our problem play a

significant role in MB coding-mode decision. From this viewpoint, we can consider an

approach similar to the chance constrained approach in [68]. This approach considers each pair of mean-variance values being the mean & variance of a random variable. We use this approach to derive an equivalent formulation for our problem. Let the data of 122

each class be specified by the first two moments, i.e., mean and covariance . Let

and represent the random vectors that generate the data 휇points of the positiveΣ and

1 2 negative푋 푋 classes respectively. Assume that distributions of and can be modeled

using mixture models, with component distributions having푋1 diagonal푋2 covariance

matrices; i.e. the component distributions are de-correlated. Let and be the

number of components in the mixture model of positive class 푘and1 negative푘2 class

respectively and let = + .

푘 푘1 푘2 = 푘1 = 푘 푗 푗 푋1 ∼ 푓푋1 � 휌 푓푥푗 푋2 ∼ 푓푋2 � 휌 푓푥푗 푗=1 푗=푘1+1 s.t. = 1 = 푘1 푗 푘 푗 ∑푗=1 ∑푗=푘1+1 and =휌 , , . . . .휌, 푗 2 2 2 푗1 푗2 푗푛 where ∑ is the variance푑푖푎푔�휎 of 휎 component휎 � of . ∀푗 2 푡ℎ 푗 푗푖 We want휎 to learn a classifier푖 < , > which푋 separates the two classes with

high probability, i.e. 휔 푏

+ 1 , = 1 푇 푗 1 푃�휔 푋 + 푏 ≥ �1 ≥ 휂, 푗 = →+ 1푘 푇 푗 1 푃 � 휔 푋 푏 ≤ ,− � ≥ 휂 푗= 푘1 → 푘 (5.6) 푗 푗 1 where is a푋 user∼-defined푓푥 parameter, which lower푗 -bounds→ 푘the classification

accuracy. 휂 123

To handle the case of outliers and almost linearly separable cases, we can

introduce slack variable and obtain the following formulation with relaxed constraints: 푗 휉 . . , , 푘 푚푖푛 푗 � 휉 푠 푡 휔 푏 휉 푗=1 + 1 , = 1 푇 푗 푗 1 푃�휔 푋 + 푏 ≥ 1 +− 휉 � ≥ 휂, 푗 = →+ 푘1 푇 푗 푗 1 푃� 휔 푋 푏 ≤ 휉 � ≥ 0휂, 푗 = 1푘 → 푘 푗 휉 ≥ 푗 → 푘

2 ‖ 휔 ‖ ≤ 푊, = 1 (5.7) 푗 푗 1 Probabilistic constraints in Equation푋 ∼ 5.7 푓can푥 be simplified푗 →using푘 Chebyshev-

Cantelli Inequality as in [69] to get the following equivalent formulation (Equation 5.8):

. . , , 푘 푚푖푛 푗 � 휉 푠 푡 휔 푏 휉 푗=1 1 + , = 1 1 푇 푗 푗 2 푦푗�휔 휇 − 푏� ≥ − 휉 푘 �∑ 휔� 푗 → 푘 0, = 1 푗 휉 ≥ 푗 (→5.8푘)

푊 ≥ ‖휔‖2 Where = 휂 푘 �1−휂

124

Formulation in Equation 5.8 can be further simplified as follows: Let be a 푗 푍 = . = = random vector such that 1 1 and 푗 2 푗 푗 −2 푗 푗 Now using 푍in place∑ 푋of 퐻푒푛푐푒 in Equation피� 푍 � 5.8,∑ we휇 can obtain푐표푣� 푍following� 퐼 푗 푗 equivalent formulation:푍 푋

. . , , 푘 푚푖푛 푗 � 휉 푠 푡 휔 푏 휉 푗=1 1 + , = 1 푇 푗 푗 푗 푍 푦 � 휔 휇 − 푏� ≥ 0−, 휉 푘 ‖ 휔 ‖ = 1 푗 → 푘 푗 휉 ≥ 푗 → 푘 (5.9)

푊 ≥ ‖휔‖2 = = where 1 푗 푗 −2 푗 This classification휇푍 피� 푍 problem� ∑ is휇 an instance of a Second Order Cone Program

(SOCP), which can be solved by solvers like sedumi.

5.3.2 Geometrical Formulation

Geometric interpretation of Equation 5.9 turns out to be classifying most of the

spheres - with center being the mean value and radius - correctly, as opposed to classifying points in the usual case. This can be seen as follows:휅 Let the set of points lying in sphere with center and radius be denoted by

( , 훣) = { | ( 푐 ) ( 푟 ) }. 푇 2 훣 풸 푟 푥 푥 − 푐 푥 − 푐 ≤ 푟

125

Now, the problem of classifying points in ( , ) correctly (including the case of outliers) is: 훣 휇 휅

1 , ( , ) (5.10) 푇 The constraints휔 푥 − in푏 Equation≥ − 5.10,휉 which∀푥 ∈ imply훣 휇 that휅 the whole sphere should be on the positive half space of the hyper-plane 1 , can be replaced by a 푇 single constraint: 휔 푥 − 푏 ≥ − 휉

arg min ( ) 1 , = (5.11) ( , 푇) 푇 휔 푥 − 푏 휔 푥0 − 푏 ≥ − 휉 푤ℎ푒푟푒 푥0 which specifies that the point nearest 푥to ∈the훣 hyper휇 휅-plane = 1 푇 should be on the positive half-space. 휔 푥 − 푏 − 휉

Now, can be found as follows: Drop a perpendicular to the hyper-plane from centre of the 푥sphere.0 The point at which the perpendicular intersects with the sphere is . Using this geometry and noting that the sphere has center and radius we

푥0 휇 휅 get = 휔 푥0 휇 − 휅 ‖휔‖ Now, ( , ). Hence from Equation 5.10,

0 푥 ∈ 훣 휇 휅 1 (5.12) 푇 0 Putting value of휔 푥 in− Equation푏 ≥ 5.12,− 휉 we get

0 1푥 + , = 1 (5.13) 푇 which휔 휇 is− similar푏 ≥ in form− 휉 to the휅 ‖Second휔‖ Order푗 Cone→ (SOC)푘 constraint in Equation

5.8 considering that we have considered positive half-space here ( = 1).

126 푦

This is geometrically illustrated in Figure 5.2. All blue spheres have positive, while red sphere have negative labels. Note that except the red sphere intersecting the hyper-plane, all spheres are classified correctly.

Figure 5.2: Geometrical Interpretation of SOC Constraint

5.3.3 Implementation

Our mode-decision problem is a multiclass problem consisting of 13 possible class labels: 4 prediction modes for block size of 16x16 and 9 prediction modes for block size of 4x4. As in the approach used in Section 4.3.2, we solve this problem using a set of

12 binary sub-classifiers arranged as a hierarchical tree.

For each sub-classifier we use a set of features in the form of mean-variance pairs of various intensity values relevant to that particular sub-classification. Each classifier is

127

then trained using the formulation in [67] using yalmip, which is an interface to SOCP solver sedumi, to obtain model parameters and .

For testing, firstly the required 휔features푏 of the incoming macroblock are

= computed. Then, the mean values are scaled using the equation 1 . 푗 −2 푗 Note that this computation is done only once per macroblock and then휇푠푐푎푙푒푑 reused at all∑ levels휇

of the tree. Note also that, as is a diagonal matrix, computation of 1 as well as −2 matrix multiplication are very ∑low cost operations. After this initial computation,∑ each sub-classifier takes a decision based on ( ) where 푇 ′푗 푠푐푎푙푒푑 consists of only those mean values which are푠푖푔푛 relevant휔 휇 to sub-classification− 푏 at ′푗 this휇푠푐푎푙푒푑 level of tree.

Note that one major task in our problem is the speed-up achieved in encoding as compared to standard reference encoder (JM14.2). This is clearly achieved here, as all operations required during testing are low-cost operations as can be seen in the performance evaluation section. The experimental results support this claim as well.

Another major task is the performance in terms of bit-rate penalty. This highly depends upon the performance of the classifier. Experimental results suggest that bitrate performance of the approach is good too, that is, the bitrate penalty of the video encoded using this technique is very low.

128

5.3.4 Performance Evaluation

16x16, Intra 4x4, or Intra 8x8. The baseline profile used in mobile devices does not support Intra 8x8 mode and this mode will not be discussed further in this dissertation.

Intra modes also have associated prediction modes; Intra 16x16 has 4 prediction modes and Intra 4x4 has 9 prediction modes. Baseline profile encoders typically evaluate both

Intra16x16 and Intra4x4 modes and the associated prediction modes before making MB mode decisions. In the proposed chance constrained based approach we separate the Intra

MB mode and Intra prediction mode decisions. Intra MB mode is determined as Intra

16x16 or Intra 4x4 without evaluating any prediction modes. The appropriate prediction modes for the MB mode are then determined. Since the MB mode is determined first, our approach right away eliminates the computation of any prediction modes for the MB mode that is not selected. If the MB mode is determined to be Intra 16x16, there is no need to evaluate any prediction modes for the 4x4 sub-blocks. Figure 4.3 shows the hierarchical decision tree used in making H.264 intra MB mode and prediction mode decisions. The attributes used in the decision trees are the mean and variance of the means of the 4x4 sub blocks for MB mode decisions. For prediction mode decisions, the mean and variance of the pixels used for prediction are the attributes. Analytical results based on this approach show that the number of operations required for Intra mode determination is reduced by about 15 times [63].

129

We implemented the decision tree in order to evaluate the performance of the chance constrained based decisions. The decision trees were implemented in H.264 reference software JM 14.2. The Intra MB decisions in JM 14.2 were replaced by a decision tree, a set of if-else statements. Test sequences are encoded with all Intra frames and performance evaluated by comparing the RD performance of an encoder with the proposed methods for Intra mode decisions and a standard encoder. The results show that

MB mode decisions can be made with very high accuracy; as shown in Figure 5.3, with

MB mode decisions made by the proposed approach and the prediction modes determined by the reference software there is very minor difference compared with the reference encoder. Prediction mode decisions are complex and introduce a small loss in

PSNR. The maximum PSNR loss suffered has been less than 1dB. Complexity reduction in terms of encoding time when Chance Constrained classifier is applied is shown in

Figure 5.4. Figure 5.5 shows the RD performance and time complexity measurement of the H.264/AVC Intra MB coder developed by machine learning approach in for the same sequences used. From both of these figures, it is clear that the performance of both of these classifiers is almost similar. Figure 5.6 shows the complexity reduction time when machine learning classifier is applied.

130

Figure 5.3: RD Performance with Chance Constrained Classifier for mobile and flower sequences

Figure 5.4: Time Complexity Reduction with Chance Constrained Classiifer for mobile and flower sequences

131

Figure 5.5: RD Performance with Machine Learning Classifier for mobile and flower sequences

Figure 5.6: Time Complexity Reduction with Machine Learning Classifier for mobile and flower sequences

5.4 Summary

In this chapter, we have introduced two stochastic classifiers for video encoding,

Cost-Sensitive classifier and Chance-Constrained classifier. When we use these

132

stochastic classifiers in combination with the machine learning classifiers, we improve

RD performance.

We have proposed a novel approach to H.264/AVC Intra MB mode computation based on chance constrained classiﬁer. The proposed approach has great potential to reduce the computational complexity. The results of the implementation in JM 14.2 show that the RD performance is very close to the reference encoder and with each other. The encoding time for chance constrained classiﬁer is reduced by about 1/5 compared to the reference encoder [67].

We have also proposed a novel approach to H.264/SVC Intra MB mode computation based on cost-sensitive classifier. The proposed approach has great potential to reduce the computational complexity and is a new type of classifier explored. The results of the implementation in the reference encoder JSVM 9.17 show that for the base layer, we obtain a speedup of about 50% while for enhancement layer; a speedup of 70% is obtained [70].

Both of the proposed approaches have great potential to reduce the computational complexity of H.264/AVC encoder and scalable extension of H.264/AVC encoder. We believe that both of the proposed approaches are also applicable to Inter mode prediction and are expected to substantially reduce the encoding complexity when adopted for Inter mode prediction.

133

6. SVC COMPLEXITY REDUCTION II

6.1 Introduction

In this chapter, we focus on low complexity video encoding, which is developed for application such as wireless sensor networks, mobile video devices and distributed video surveillance systems. These systems are characterized by scarce resources for memory, computation, and energy at the video encoder.

Applications for Scalable Video Encoding (SVC) have not yet been able to produce practical and widely accepted applications in the market. One of the reasons is the significant increase of the encoding complexity over single-layer H.264/AVC video, due to the layered nature of SVC. SVC provides three types of scalability, temporal, spatial, and quality. Using quality scalability, additional quality information is transmitted to the user, while temporal scalability allows adapting the frame rate. Both techniques slightly increase complexity, in contrast to spatial scalability.

Spatial scalability allows different resolutions to be encoded in a single bitstream.

Unlike in a simulcast scenario, where all streams are encoded independently, inter-layer prediction (ILP) is applied in SVC to encode different spatial layers into a single layer.

Using ILP, the lower resolution (base layer) can be used as a predictor for higher resolutions (enhancement layers). Hence, the mode decision (including motion estimation) has to be performed twice for the enhancement layer, once using regular

134

techniques (as in H.264/AVC) and once with the base layer as a predictor. Therefore,

spatial scalability comes with a high complexity.

To reduce the encoding complexity of the enhancement layer, fast mode decision

models have been proposed. Most of these models are based on limiting the evaluations

of macroblock partition size, or inter-layer residual prediction. While many relevant methods are listed here, many more techniques have been proposed.

In the following sections, we first investigate the correlation between the base and enhancement layers in H.264/SVC and based on the existing correlation, we recommend the guidelines for the configuration of practical applications. Moreover, by using these guidelines, we propose a fast mode decision algorithm.

6.2 SVC Layers Correlation Evaluation

As described earlier in the discussion, SVC consists of layers based encoding and there is always a correlation between base layer and enhancement layer. The correlation can be classified into two types based on the relation between base and enhancement layers. In case of coarse grain quality scalability (CGS), both base and enhancement layers have the same resolution and only the difference is different QP. Hence, the

correlation between the base and enhancement layers is very significant. We propose to

use this correlation along with spatial correlation which is inherent to any video in our

mode decision. In case of spatial scalability, the resolutions along with layers are little

different, and as compared with CGS, the correlation between the base and enhancement

layer is less as the resolution of the video in two layers is different. However, since the

enhancement layer is the interpolated version of base layer, there is bound to be

135

considerable correlation which we can utilize for mode decision in the enhancement layer.

We have observed that the mode distribution between the base layer and its enhancement layers has a certain correlation. In spatial scalability, for each MB at the base layer, the corresponding up-sampled MBs at enhancement layers tend to have the same mode partition. For coarse grain signal-to-noise (SNR) scalability (CGS), each enhancement-layer MB tends to have a finer mode partition than the corresponding MB at the base layer. In the case of temporal scalability, the mode partition of MBs in the current frame is most similar to the mode partition of MBs in its reference frames.

Motivated by these observations, we propose an effective fast mode decision for spatial, and CGS scalable video coding. With the proposal, a good mode partition prediction can be achieved if we predict the MB mode at an enhancement layer from that at the base layer. Therefore, the presented algorithm reduces the number of candidate modes for an

MB at enhancement layers by using the mode distribution at the base layer and, hence, the computational complexity significantly.

We now carefully analyze the correlation between the base and enhancement layers for spatial and SNR scalability in terms of bitrate and PSNR. In this analysis, we will determine the effect of base layer on the quality of enhancement layer for different values of QP both in the base and enhancement layers.

6.2.1 Spatial Scalability

First we analyze the case of spatial scalability with two layers. In the experiments, we have selected five scalable video sequences, {foreman, flower, mobile, city, and

136

crew}. These sequences represent slow, medium and fast motion sequences with low and high spatial details. The performance evaluation of the coding efficiency and computational complexity of the base and enhancement layers in Scalable Video Coding standard has been conducted by using the Joint Scalable Video Model (JSVM), the reference SVC software model provided by the JVT, version 9.18.

Coding efficiency has been evaluated in terms of Rate-Distortion figures, by encoding each test sequence several times with different quantization parameters (QP).

Distortion has been measured as the average Peak Signal to Noise Ratio (PSNR) of the luminance component of the co-decoded sequences with respect to the original ones.

As test bench, we adopted the same sequences commonly used by the JVT experts in their experiments: considered sequences are Foreman, Flower, Mobile, City, and Crew in QCIF, CIF and 4CIF format. Computational complexity has been determined by measuring the overall CPU user time dedicated to the coding process of each sequence in each tested configuration. All complexity measures are given in relative terms to guarantee as much as possible to be independent from the underlying hardware platform.

With no doubt, different implementations of the SVC standard could exhibit different Rate-Distortion performance, for example if different motion estimation or bitrate control algorithms are employed, or if simpler macroblock mode decision methods are adopted instead of the complex Lagrangian Rate-Distortion Optimization used by the

JSVM software or, on the contrary, if more sophisticated methods are employed to increase the performance of inter-layer prediction.

137

Similarly, much different complexity results can be obtained by optimizing a

particular SVC implementation for a specific hardware platform, for instance by

exploiting MMX instructions on PC machines, or by taking advantage of application

specific coprocessors. Nonetheless, our aim is to give a first and general layer evaluation

of the SVC standard without considering specific issues regarding software

implementations or hardware platforms. Hence, R-D and complexity results should not

be considered conclusive, but rather as a starting point for deeper profile analysis and

possible further algorithmic and platform optimizations. Moreover, the results provided

can be helpful to identify the combination of scalability tools and the rate-distortion- complexity trade-off that best suit the preferred application, which is an extremely

complex problem that can hardly be described analytically.

In our simulation tests we compared the performance of the SVC inter-layer

prediction for QCIF to CIF and CIF to 4CIF (both at 30 Hz) scalability, with respect to

single-layer QCIF, CIF and 4CIF coding and to Simulcasting QCIF+CIF and CIF+4CIF

layers.

Figure 6.1 illustrates the relative computational complexity of the different coding

options, averaged considering all tested sequences with all tested Quantization

Parameters in BL and EL. The histogram shows that if the computation required for

single-layer CIF H.264/AVC encoding is scaled to 1.0, then the coding of two separate

layers in CIF and 4CIF format requires about 5 times the computation of the single-layer

CIF. Instead, if the two spatial layers are jointly coded by using SVC with adaptive inter-

layer prediction, the relative computation required is 9.6, which means that SVC is 1.8

138

times more complex than Simulcasting. This result is due to the fact that motion

estimation, which requires the most part of the encoding computation, is performed two

times for each macroblock of the EL: with and without interlayer residual prediction.

Figure 6.1: Average relative computational complexity

The performance of the SVC extension requires careful analysis. In this section, we report the results of some coding efficiency experiments using the JSVM software that is maintained by the JVT. It is important for the reader to understand that the combination of example implementations is designed to maximize the quality and fidelity of the lower resolution sequences. Thus, the results in this section provide an indication of system performance when the lower resolution sequence is of primary importance and thus cannot be degraded relative to a single-layer encoding.

Experiments make use of image sequences that are common in the video coding community. Specifically, sequences consist of the Foreman, Flower, Mobile, City, and

Crew test data. The spatial resolution of the first three sequences is 352x288 luma

139

samples per picture (common intermediate format, or CIF), while the remaining

sequences have a spatial resolution of 704x576 luma samples per picture (4CIF). All image frames used 4:2:0 color sampling.

The image sequences were encoded with a hierarchical B-frame structure and with only one intra-picture coded frame (located at the beginning of the sequence). The interval between P-frames was 16 frames for Crew, 32 frames for Foreman, Flower,

Mobile and 64 frames for City. (The difference was due to the relative motion of the

sequences.)

Figure 6.2: RD Performance for SVC, Simulcast and Single Layer solutions for sequences Flower and Mobile

140

Figure 6.3: RD Performance for SVC, Simulcast and Single Layer solutions for sequences City and Crew

Table 6-I: COMPARISON OF SVC TO SIMULCAST AND SINGLE-LAYER SCENERIOS

Simulcast Single Layer Sequence Δ Bitrate (%) Δ PSNR (dB) Δ Bitrate (%) Δ PSNR (dB) Foreman -9.8 -0.06 15.2 -0.05 Flower -11.2 0.02 19.5 0.02 Mobile -8.5 -0.05 20.2 -0.03 City -9.3 -0.03 14.2 -0.02 Crew -15.1 0.00 7.4 0.00

Representative results appear in Figure 6.2 and Figure 6.3, where rate-distortion

plots for the Flower, Mobile, City and Crew sequences are illustrated. Distortion values are reported for the enhancement layer, while rate parameters represent the aggregate rate of the scalable bit-stream. For comparison, Figure 6.2 and Figure 6.3 also contain the

141

rate-distortion performance for a single layer H.264/AVC encoding as well as a simulcast scenario. The single-layer and simulcast results were generated using the same software implementation and encoding algorithms, though reconfigured to encode a single layer representation.

Visual evaluation of Figure 6.2 and Figure 6.3 provide insight into the system performance. Moreover, delta peak signal-to-noise ratio (PSNR) and bit rate measurements for all sequences are provided in Table 6-I. As can be seen from both the figure and table, the JSVM encoder outperforms the simulcast solution by an average of

10.8% in terms of bit rate. Compared to single layer coding, the JSVM encoder performs within 15% of the single layer codec. Of course, the SVC solution provides a lower resolution layer that can be easily extracted and decoded by legacy decoders.

142

Figure 6.4: BL is quantized at variable rate while EL is quantized at constant rate. Red area over SL curve shows SVC overhead compared to SL for Mobile Sequence (QCIF-CIF).

143

Figure 6.5: BL is quantized at constant rate while EL is quantized at variable rate. Red area over SL curves shows SVC overhead compared to SL for Mobile Sequence (QCIF-CIF).

144

Figure 6.6: BL is quantized at variable rate while EL is quantized at constant rate. Red area over SL curve shows SVC overhead compared to SL for Crew Sequence (CIF-4CIF).

145

Figure 6.7: BL is quantized at constant rate while EL is quantized at variable rate. Red area over SL curves shows SVC overhead compared to SL for Crew Sequence (CIF-4CIF).

146

Figure 6.4- Figure 6.7 represent the distribution of bitrates with respect to base and enhancement layers in scalable bitstreams. For a relationship between the base and enhancement layers, there are two possibilities; either to keep the base layer at constant bitrate or to keep the enhancement layer at constant bitrate. From these graphs, we can observe that besides the effect of BL quantization, on the input video signal to the EL encoder, the output of the EL encoder is determined by the EL quantization.

To derive the rate model of an EL, we plot the rate of a dependent layer EL, with respect to the rate of its reference layer BL, in Figure 6.8, where rate pairs [R1(Q1),

R2(Q1,Q2)] are shown. Here R1 represents bitrate of BL, R2 represents bitrate of EL, while

Q1 and Q2 represent the QPs of BL and EL respectively. There are two types of curves

reflecting two different settings of a rate function denoted by quantization steps of BL

and EL (bitrates of BL and EL). The dashed curve on the diagonal plots the EL rate when

EL_QP + 6 = BL_QP. For solid curves, the value of BL_QP varies whereas the value of

EL_QP is fixed. We see that for each fixed EL_QP, increasing bitrate of BL (or

decreasing BL_QP) results in a roughly linear reduction in EL bitrate. However, the EL

rate becomes saturated and does not decrease furthermore beyond the point with EL_QP

= BL_QP - 6. Based on the above observation, we conclude that the rate of a dependent

spatial layer can be approximated at EL_QP = BL_QP-6.

147

(a) (b)

Figure 6.8: Proposed rate modeling results. (a) Flower, QCIF-CIF. (b) Mobile, QCIF-CIF. (c) City, CIF-4CIF. (d) Crew, CIF-4CIF

148

Figure 6.9: Coding performance of 2-layer spatial scalability for City sequence

Figure 6.10: Coding performance of 2-layer spatial scalability for Crew sequence 149

Figure 6.11: Coding performance of 2-layer spatial scalability for Flower sequence

Figure 6.12: Coding performance of 2-layer spatial scalability for Mobile sequence

150

Figure 6.9 - Figure 6.12, show the RD performance of enhancement layer

compared to the base layer extracted from the SVC bitstream for the selected QPs {12,

18, 24, 30, 36, 42, 48}. We evaluated spatial scalability performance for a 4CIF scalability scenario, considering two spatial layers in CIF and 4CIF formats and for a CIF

scalability scenario, considering two spatial layers in QCIF and CIF formats. Note that

the resolution ratio between the two layers is 2.0 in both dimensions, and that the two

layers have the same temporal resolution (30 Hz), so that interlayer prediction can be

efficiently exploited for each picture of the enhancement layer.

We used five different sequences, coding the two layers using the same

quantization step, i.e. QP_BL=QP_EL={12, 18, 24, 30, 36, 42, 48}, considering that in a

real application they are likely to be transmitted with similar quality.

We obtained experimental results by employing adaptive inter-layer prediction

from BL to EL, because in the spatial scalability case each macroblock of the EL can

correspond to more than one macroblock in the BL, depending on how the macroblock

grids of the two layers are aligned, so that a straightforward method, such as the forced

interlayer prediction, is not applicable.

We also found that spatial scalability can give better results if the BL is less

quantized than the EL, so that a higher quality reference is offered to the inter-layer

prediction method. Therefore we coded again all the sequences, this time considering

QP_BL=QP_EL+6, and we obtained that in this situation spatial scalability gives much

better results if compared to H.264/AVC Simulcast, being able to achieve a remarkable

average bit-rate reduction of 46.5%, or equivalently a PSNR increase of 1.55 dB.

151

However, this particular coding configuration may be of limited applicability,

considering that BL has much higher quality than EL. In fact, the average PSNR

difference between the two layers is typically higher than 3 dB, with a peak of 14.96 dB

for the City sequence using QP_BL=12 and QP_EL=48. In that case resulted

YPSNR(BL) = 45.94 dB and Y-PSNR(EL) =30.98 dB. The lowest difference was measured for the sequence Crew using QP_BL=12 and QP_EL=48. In that case resulted

Y-PSNR(BL)=46.37 dB and YPSNR(EL)=35.51 dB.

Our simulations also exhibit another major trait of spatial scalability when enhancement layer QP changes from lower extreme to upper extreme values. Also the difference between the QPs of the base and enhancement layers, positive or negative, plays a major role in determining the effectiveness of the spatial scalability, which will be explained later. Figure 6.13 and Figure 6.14 show the R-D performance of EL for different BL QPs. Simulations were performed for entire set of QPs {12, 18, 24, 30, 36,

42, 48} and 2-layer SVC bitstreams were encoded, thus 49 bitstreams were generated for each sequence. The realistic R-D performance of the spatial scalability could only be observed for QP > 24 and QP < 40 as indicated in the figures by drawing ovals over the range of these QPs. When EL QP is too low, the overhead of the EL in terms of bitrate is greater than that of simulcast. On the opposite, if EL QP is too high, there is no gain in terms of bitrate for EL. This phenomenon is explained in the Table 6-II for sequences

Flower and Crew. Table 6-II shows the overhead of EL bitrate for differential QPs. As explained earlier, overhead is too much comparatively when EL QP is too low while on

152

the other side, for higher EL QP values, the difference becomes negligible and hence no benefit is gained from scalability process.

From Figure 6.13 and Figure 6.14, we can also conclude the spatial scalability analysis with the remark that, if allowed by the application, it is always a good solution to encode the BL with lower Quantization Parameter than the EL, in order to better exploit inter-layer prediction and achieve improved compression efficiency in comparison to

Simulcasting and single-layer coding.

Table 6-II: COMPARING DELTA QP OF EL FOR GIVEN BL QP BL QP EL QP ΔBR (EL1-EL2)

Flower 12 12,18 2389.81 18,24 1604.7 24,30 842.79 30,36 277.63 36,42 123.03 42,48 -84.6 Crew 18 18,24 6136.22 24,30 1618.34 30,36 325.17 36,42 72.70 42,48 18.75

153

Figure 6.13: EL RD performance w.r.t. BL QP for Mobile sequence

Viable Scalability Configuration

Figure 6.14: EL RD performance w.r.t. BL QP for Flower sequence 154

6.2.2 SNR Scalability

CGS can be considered a special case of spatial scalability where the spatial

layers have a resolution ratio equal to one. CGS employs the same inter-layer prediction

mechanism of Spatial Scalability discussed in Section 6.2.1, with the only difference that

BL data must not be up-sampled, which implies computation savings and R-D

improvements.

We assessed CGS performance for 2-layers using CIF and 4CIF coding, using

fixed differential quantization parameters (DQP) between the two layers, equal to 2 and

6. We used the same simulation environment and test sequences as we used in the

previous section for spatial scalability. We also tested the JSVM adaptive and forced

options for inter-layer prediction (ILP). Figure 6.15 shows the R-D curves obtained for the sequence City in 4CIF format, proving that DQP=6 typically provides better coding

efficiency than DQP=2, except at very low bit-rates. By using DQP=6, 2-layers SNR

scalability achieves only about 0.5 dB loss with respect to single-layer coding, or almost

1.5 dB gain with respect to Simulcast, which is a notable result.

Thanks to the fact that BL data do not need to be up-sampled, the efficiency of the

inter-layer prediction is considerably higher than in the Spatial Scalability case. We can

obtain very good performance by limiting the coding process to the forced inter-layer

prediction, which in addition allows a considerable saving of computational complexity,

as demonstrated by Figure 6.16, because motion estimation is no longer needed in the

EL, as previously explained.

155

Figure 6.15: 2-layers SNR performance for City sequence in 4CIF format, with differential QP equal to 6 and 2.

0 Single Layer SVC (SNR) Simulcast

Figure 6.16: Average (CIF and 4CIF) relative computational complexity for different coding options: single layer, 2-layer Simulcast and SVC 2-layers SNR coding

156

In conclusion, we can state that SVC allows obtaining CGS scalability with good

coding gain with respect to Simulcast by using adaptive inter-layer prediction with a delta

inter-layer QP equal to 6. Alternatively, if very low computational complexity is

preferred, SVC still allows good coding gain (with a maximum loss of 0.3 dB with

between 3 and 6 Mb/s) by using forced inter-layer prediction, which permits to encode 2- layers with only about 20% of additional computation with respect of single-layer coding, whereas 2-layer SNR coding with adaptive prediction requires roughly 5.1 times the computation of single-layer coding.

6.3 Mode Decision and Inter-Layer Prediction

A fast mode decision method exploiting neighboring macroblock statistics is proposed in [71]. This method reports a 44.81% time saving. A selective inter-layer

residual prediction method reduces complexity with 40% [72]. This is achieved by evaluating all modes without inter-layer residual prediction and re-evaluating the rate- distortion (RD) optimal mode with inter-layer residual prediction. Finally, the RD optimal prediction is applied for the mode. Because for this technique all modes have to be evaluated, it could be used for improving existing fast mode decision models.

The previously mentioned methods do not exploit encoded base layer information, such as macroblock types, for the spatial enhancement layer mode decision.

If base layer information is used, it is possible to lower the video encoding complexity as shown in [73], which uses a classification mechanism for the most probable modes, based on base layer information. This results in a complexity reduction of 65%, with a reported bit rate increase of 0.17%. Macroblock modes can be prioritized based on the base layer

157

macroblock type, as is suggested by [74]. Based on the state (i.e., all-zero block) of the current macroblock and neighboring macroblocks, an early termination strategy is applied to the prioritized list. The reported small complexity reduction of 20.23% for

CGS and 27.47% for dyadic spatial scalability makes this technique less suited as a stand-alone technique. However, in combination with other techniques this could yield a lower complexity. Another prioritizing scheme [75] alters the mode decision, based on the base and enhancement layer neighboring macroblocks, yielding a 30% time saving.

Li’s model [76] limits the enhancement layer mode decision based on co-located base layer modes. An off-line analysis of encoded video streams determines which modes are not likely to be selected. This model shows significant time savings of 60% on average, with small bit rate and PSNR changes.

In the following sections, analysis of the MB type of a macroblock in the enhancement layer and MB type of a macroblock in the base layer is presented. In this process, resolutions of the base and enhancement layers, quantization parameters of the base and enhancement layers, and MB type of base layer are the main factors. Quite intuitively, it can be observed that the MB types of the enhancement layers are highly dependent on the spatial contents of the MB, therefore, there is a high correlation between the MB type of the enhancement layer and that of base layer. However, it can also be observed that the difference in the quantization parameters of the base and enhancement layers also affects process of the determination of MB type in enhancement layer.

158

The analysis has been performed using five test sequences (i.e., Foreman, Flower,

Mobile, Crew, and City). These five sequences represent different kinds of motion and

texture, such that the conclusions of the analysis can be extended to any sequences. All

encoded sequences contain two spatial layers: namely a base layer and one spatial

enhancement layer. The first three sequences have a QCIF resolution for the base layer,

while the enhancement layer has a CIF resolution. The last two sequences have a CIF

resolution at the base layer, and a 4CIF resolution for the enhancement layer.

To analyze the impact of the quantizers, for each sequence, a number of streams

were generated with varying QPs of base layer and enhancement layers (QP_BL,

QP_EL). The values for both QP_BL and QP_EL are given by: {12, 18, 24, 30, 36, 42,

48}. For all sequences, each combination of QP_BL and QP_EL is encoded, noted as the

ordered pair (QP_BL, QP_EL), which leads to 49 combinations for one sequence.

Although there are rarely any practical applications for sequences where QP_BL <

QP_EL (i.e., the quality of the enhancement layer is reduced compared to the base layer),

these streams are included in the analysis for completeness of this study and will help to

understand the mechanism behind the enhancement layer mode selection.

For each combination of (QP_BL, QP_EL), 100 frames have been encoded using

the Joint Scalable Video Model (JSVM) reference software version 9.19, with an Intra

period of 32 frames. The sequences have a GOP size of 16 frames. This results in a total

of 49 encoded streams for each test sequence. Each layer has the same temporal

resolution of 30 fps, adaptive ILP is used for enhancement layers. Context-based

Adaptive Binary Arithmetic Coding (CABAC) is used as entropy coding mode.

159

As Intra-coded frames have already been discussed in previous chapter, we will not discuss Intra-coding in this chapter. Only Inter coded frames (i.e., P and B pictures) are discussed here.

Foreman (24,18) 40 35 30 25 20

10 MODE_INTRA MODE_8x8 5 MODE_8x16 MODE_16x8 Probability (%) Probability 0 MODE_16x16 MODE_SKIP BL Type MB

EL MB Type

Figure 6.17: Correlation of MB Type in base and enhancement layer for Foreman(24,18)

160

Crew (24,12) 60

10 MODE_8x8 0 MODE_16x8 Probability (%) Probability MODE_SKIP BL Type MB

EL MB Type

Figure 6.18: Correlation of MB type in base and enhancement layer for Crew(24,12)

Figure 6.17 – Figure 6.19 show some of the graphs representing the distribution of MB types with respect to the base and enhancement layers. The sequences Foreman,

Flower, Mobile and City show the same characteristics, therefore only for one of these sequences a graph is shown to highlight the properties. The mutual trends can be seen throughout the different graphs of these sequences. The sequence Crew has different characteristics, to show these characteristics graphs of Crew are included. Note that these findings do not only apply the Crew sequence, but many sequences will correspond to these characteristics. However, these sequences were not incorporated in this analysis.

These graphs visualize, in terms of percentage, the occurrence of each MB Type pair (MB_Type_BL, MB_Type_EL). Each bar represents the probability for a random macroblock that MB type of enhancement layer is selected, based on the a priori

161

knowledge of the MB type of the co-located macroblock in the base layer. For each MB type of the base layer, the sum of the row (all points with a constant MB type of the base layer) is 1 or 0. In case the sum is 0, no base layer macroblock is encoded using the MB type of the base layer. On the other hand when MB type of base layer is used, the sum of the probabilities that any MB type of enhancement layer will be selected given the MB type of the base layer will be 1. The macroblock type is indicated on both axes.

Crew (24,24) 80 70 60 50 40

20 MODE_INTRA MODE_8x8 10 MODE_8x16 0 MODE_16x8 Probability (%) Probability MODE_16x16 MODE_SKIP BL Type MB

EL MB Type

Figure 6.19: Correlation of MB type in base and enhancement layer for Crew(24,24)

162

Flower (24,12) 70 60 50 40

20 MODE_INTRA MODE_8x8 10 MODE_8x16 0 MODE_16x8 MODE_16x16 Probability (%) Probability MODE_SKIP BL Type MB

EL MB Type

Figure 6.20: Correlation of MB type in base and enhancement layer for Flower (24,12) City (24,18) 100 90 80 70 60 50 40 30

MODE_INTRA 20 MODE_8x8 10 MODE_8x16 0 MODE_16x8 Probability (%) Probability MODE_16x16 MODE_SKIP BL Type MB

EL MB Type

Figure 6.21: Correlation of MB type in base and enhancement layer for City(24,18)

163

City (24,36) 80 70 60 50 40

20 MODE_INTRA MODE_8x8 10 MODE_8x16 0 MODE_16x8 Probability (%) Probability MODE_16x16 MODE_SKIP BL Type MB

EL MB Type

Figure 6.22: Correlation of MB type in base and enhancement layer for City(24,36)

Encoded streams of the sequences Foreman, Flower, Mobile and City have similar characteristics while they have different resolutions. This indicates that the resolution of the layers does not have an influence on the probability of MB type of enhancement layer. On the other hand, the graphs for sequence Crew show a different layout, while it has the same resolutions applied as sequence City. This can be seen in

Figure 6.17 – Figure 6.19, where only the sequence Crew (Figure 6.18 – Figure 6.19) has

Intra coded macroblocks in the base layer, whereas sequence Foreman (Figure 6.17) shows that none of the base layer macroblocks is Intra coded.

For the Crew sequence less than 5% of the base layer macroblocks in Inter-coded frames are Intra coded. Other notable findings are observed when MB type of the base layer is Intra coded. First, in most situations, MB type of the enhancement layer is the 164

same as that of base layer, holds true (Figure 6.18, Figure 6.19). Second, when the quality

of the enhancement layer increases considerably i.e., (QP_EL < QP_BL-8), MB type of enhancement layer can be Intra coded as well (Figure 6.18). The latter is caused by the increase in the degree of detail in the enhancement layer because of the increase in quality and resolution. This increase will result in extra residual data if MB type of enhancement layer is the same as that of MB type of base layer and poor prediction results for inter prediction, consequently more macroblocks are Intra coded. This can be seen in Figure 6.18, where the enhancement layer for Intra coded macroblocks has values around 40-60%, while for nearly all of the Intra coded base layer macroblocks of the stream corresponding to Figure 6.19 have the Intra coded macroblocks in the enhancement layer.

For Inter-coded frames, some similarities in the general layout of the graphs can be seen (which counts for both P and B pictures), also differences between both types can be observed due to the nature of both Inter-coded frame types.

A general observation for all graphs, is a diagonal of significant values, which means that there is a substantial probability for a random macroblock in the enhancement layer such that the MB type of the enhancement layer is the same as that of MB type in the base layer. The exact influence of this diagonal on the selection probabilities depends on both QP_BL and QP_EL. In Figure 6.18, it can be seen that the probability for MB type of the base layer is the same as that of MB type in enhancement layer, is significantly lower compared to Figure 6.19. Nevertheless, in every situation this diagonal can be considered as an important issue.

165

As a special case of this diagonal property, Intra coded macroblocks has to be considered. In almost all graphs it can be seen that if MB type of the base layer is Intra coded then MB type of the enhancement layer is also Intra coded with a probability almost equal to 1. This might be less explicit when the quality of the enhancement layer decreases compared to the base layer, but the alternative types does not have a significant probability.

A last general observation only applies to the Crew sequence. It is observed that for Inter-coded macroblocks in the base layer, Intra coded macroblocks in the enhancement layer has a significant probability as long as QP_EL < QP_BL. These findings can be observed in Figure 6.18 and Figure 6.19.

A direct relationship between the number of skipped macroblocks and the QP of the enhancement layer is observed. When QP_EL > 18, about 10% of the macroblocks in the enhancement layer have Mode_SKIP, independent of MB type of the base layer

(illustrated by Figure 6.17 and Figure 6.21). This value increases when QP_EL increases, due to the quality decrease in the enhancement layer, more macroblocks in the enhancement layer are Mode_SKIP coded. As much as 35-40% of all enhancement layer macroblocks are coded Mode_SKIP when QP_EL > 30 and QP_BL < QP_EL (as depicted in Figure 6.22). This phenomenon is explained by the larger quantization of the enhancement layer, which reduces the details of both the reference picture for the enhancement layer and the current frame of the enhancement layer. Consequently, residual data is unnecessary. Note that only when QP_EL > QP_BL, the impact of

Mode_16x16 is reduced, in favor of Mode_SKIP mode, as can be seen in Figure 6.22.

166

Overall, it is observed that 16x8 partitions in the base layer rarely correspond t0

8x16 partitions in the enhancement layers. This is explained by the fact that in the

enhancement layer either the same orientation is maintained as in the low resolution

image, more details are included 8x8, or the texture is smoothed Mode_16x16. The

analogy holds for 8x16 partitions in base layer. It can be stated that in the enhancement

layer 8x16 and 8x16 partitioned macroblocks can occur when Mb type of the base layer

has such partitioning. This is depicted in Figure 6.20 – Figure 6.22.

It has also been observed that MB type of Mode_16x8 and Mb type of

Mode_8x16 are less selected in the enhancement layer when the co-located base layer macroblock corresponds to MB type of Mode_16x16 or to MB type of Mode_8x8. It seems that macroblocks which don’t have a rectangle oriented partitioning in the base layer, do not tend to have such partitioning in the enhancement layer.

6.4 Fast Mode Decision Model

The macroblock type selection process for enhancement layers in the encoder can be optimized using the presented analysis in the previous section. Based on prior knowledge (i.e., Mb type, QP_BL, QP_EL); only a subset of all available macroblock types have to be tested. This subset corresponds to the macroblock types that have high probabilities of selection, given the a priori knowledge. As long as the quantization does not change, this subset remains the same throughout the sequence, with the remark that only MB type of the base layer changes according to the applied macroblock type in the base layer.

167

Table 6-III: MODE DISTRIBUTION IN EL MODE_SKIP MODE_16x16 MODE_16x8 MODE_8x16 MODE_8x8 MODE_INTRA Sequences (%) (%) (%) (%) (%) (%) Foreman 36.5 34.1 8.6 9.6 7.6 4 Flower 38.2 15.7 10.1 5.7 28.3 2 Mobile 31.3 37.2 7.5 5.8 15.3 3 City 65.7 10.8 3.8 1.7 5.2 13 Crew 72.4 12.6 1.4 0.6 4.4 9 Average 48.82 22.08 6.28 4.68 12.16 6.2

Table 6-III shows the percentage of mode distribution in enhancement layer (EL),

for the given sequences with BL_QP=24 and EL_QP=18. It can be seen from the Table

6-III, more than 70% of best prediction modes after mode decision are with large size

(16x16) modes and only about 12% MBs choose small size (8x8) mode such as Inter 8x8,

Inter 8x4, and Inter 4x4. It shows that the majority of best prediction modes after mode

decision are with large sizes modes such as inter 16×16, SKIP, which implies that the

search work for small size modes would be unnecessary in most cases. So, it is better to

have a proper early termination strategy from the midway of fast mode decision

algorithm for SVC. A proper breaking-out mechanism will help the algorithm to find a good tradeoff between the mode decision speed and the mode decision quality. On the other side, the EL and BL contain the same video sequence with different quality or spatial resolutions, and thus the RD costs of co-located MBs have a certain correlation

between the layers. Besides, the homogeneous regions in the video sequences have a

strong spatial correlation, and the RD cost of a MB tends to be spatially correlated.

The presented model is content independent. This has the advantage that only one

model has to be defined so that it will give reasonable results for all sequences. To

168

achieve the sequence independency, properties of the analysis of the Crew sequence are merged with the findings of the analysis of the other sequences.

For Inter-coded pictures a subset of macroblock types can be chosen based on the following recommendations:

• In every situation Mb type of the base layer is included in the subset because

of the diagonal property.

• If MB type of the base layer is intra-coded, the subset is reduced tomb type of

the base layer or MB type of Mode_16x16 provided QP_EL < QP_BL,

otherwise the MB type of just base layer is selected. Note that when MB type

of base layer is MB type of Mode_16x16, MB type of enhancement layer will

always be of type Mode_16x16.

• In case that MB type of base layer is Inter-coded, there only has to be a test

for MB type of the enhancement layer, where MB type of enhancement layer

could be one of Mode_16x16 or Mode_8x8. As observed in the analysis, MB

type of Mode_16x16 is not significant for all sequences. However, to meet the

sequence independency, it is always included in one of the possibility. Even

though MB type of Mode_8x8 gains influence if QP_EL < 24, it seems still

significant when this constraint is not fulfilled.

• For Inter-coded base layer macroblocks (which are not Mode_SKIP), the co-

located macroblock can be Mode_SKIP when QP_EL > 18.

169

• For Inter-coded base layer macroblocks, the macroblock types Mode_8x16

and Mode_16x8 could best be tested when QP_EL < 24, while their

counterpart Mode_16x16 seem to be influential when QP_EL > 24.

We measure complexity reduction in terms of execution time for both reference and modified encoders on the same machine which keep our results valid for our experiments to measure complexity reduction. Preliminary test results of the proposed model, implemented in an SVC encoder based on JSVM 9.19, are promising. Table 6.1 shows results for the test sequence Soccer. The bit rate and PSNR of the streams generated by the reference software are shown for each stream. Bit rate increase is represented in percentage and difference in PSNR compared to the original streams are shown both for reference encoder and for our proposed model. The time saving for the mode selection step of the enhancement layer is shown. This time does not include timings for other calculations, such as base layer encoding, which will be the same between the different models. The time saving is given by:

= 퐽푆푉푀 푝푟표푝표푠푒푑 푠 푇 − 푇 푇 퐽푆푉푀 An average increase in bitrate of approximately푇 1.7% in bit rate and a quality reduction of 0.4 dB have been observed. These measurements are in line with those of other fast mode decision models. On the other hand, while for previous models an average time saving of 50% is seen, we identify a time saving of 70% for the mode selection step in the enhancement layer.

170

Table 6-IV: PROPOSED MODEL VS. REFERENCE ENCODER JSVM 9.19 QP JSVM (Proposed) (Reference) BR %BR ΔPSNR ΔT BL EL (kbps) PSNR (kbps) (dB) (%) 12 12 9209 45.71 1.04 -0.43 68 12 18 5884 41.16 1.23 -0.33 74 12 24 3729 36.72 1.15 -0.21 67 12 30 2681 33.19 1.12 -0.15 69 12 36 2382 31.16 1.01 -0.09 50 18 12 8333 45.71 1.31 -0.65 68 18 18 5015 41.16 1.39 -0.51 78 18 24 2856 36.75 1.48 -0.35 78 18 30 1803 33.20 1.26 -0.22 67 18 36 1500 31.16 1.09 -0.19 62 24 12 7769 45.72 1.15 -0.71 65 24 18 4449 41.17 1.52 -0.55 77 24 24 2298 36.78 2.09 -0.43 68 24 30 1238 33.21 2.15 -0.25 78 24 36 930 31.14 1.85 -0.12 60 30 12 7547 45.75 1.15 -0.65 69 30 18 4230 41.18 1.24 -0.49 65 30 24 2077 36.80 2.03 -0.31 72 30 30 1018 33.22 2.19 -0.29 80 30 36 707 31.12 1.8 -0.21 72 36 12 7481 45.72 2.76 -0.75 79 36 18 4163 41.19 2.53 -0.56 82 36 24 2014 36.80 2.36 -0.45 75 36 30 953 33.22 2.3 -0.34 79 36 36 643 31.13 2.12 -0.25 72

Average 1.6528 -0.3796 70.96

171

Mobile 46 45 44 43 42 41 40

39 38 37 36 35

PSNR (dB) PSNR 34 33 32 31 QP_BL_Proposed=24 30 29 QP_BL=24 28 27 26 0 2000 4000 6000 8000 10000 Bitrate (Kbps)

Figure 6.23: RD comparison for proposed technique for Mobile sequence @ QP_BL = 24

City 46.00 45.00 44.00 43.00 42.00 41.00

40.00 39.00 38.00 37.00 36.00 PSNR (dB) PSNR 35.00 QP_BL=30 34.00 33.00 32.00 QP_BL_Proposed=30 31.00 30.00 29.00 0 5000 10000 15000 20000 25000 30000 Bitrate (Kbps)

Figure 6.24: RD comparison for proposed technique for City sequence @ QP_BL = 30 172

Crew 47 46 45 44 43 42

41 40 39 PSNR (dB) PSNR 38 QP_BL_Proposed=36 37 QP_BL=36 36 35 34 33 0 5000 10000 15000 20000 25000 Bitrate (Kbps)

Figure 6.25: RD comparison for proposed technique for Crew sequence @ QP_BL = 36

Figure 6.23, Figure 6.24 and Figure 6.25 show the coding efficiency for the proposed fast mode technique when used in the reference encoder as a replacement of the default behavior. It can be seen that there has been a slightly lower RD performance for the Mobile, City and Crew sequences with base layers encoded at QP values {24, 30,

36}. Such a small decrease in the RD performance justifies the use of low complexity fast mode technique. When even lower complexity is required, these techniques can be combined with other fast mode decision models. However, the RD will further decrease.

173

6.5 Summary

The emerging Scalable Video Coding is an extension of the H.264/AVC video coding standard designed to allow temporal, spatial and SNR scalability. After extensive analysis, which considered both Rate-Distortion and computational performance, we can conclude that SVC is based on a flexible and efficient coding framework, achieving remarkable advantages with respect to the conventional Simulcast approach. These characteristics make SVC an appealing solution for many modern digital video applications, particularly whenever it is necessary to deal with different video formats, heterogeneous terminal capabilities and varying network conditions.

The SVC reference software model provided by the Joint Video Team was mainly designed to allow easy integration of new functionalities by the JVT experts, and to attain competitive Rate-Distortion efficiency, not much taking into account the computational complexity of the coding process, which is a matter beyond standardization. Nonetheless, depending on the chosen coding options, it is able to allow computational savings with respect to Simulcasting with moderate RD losses, or to achieve much better RD at the price of associated extra computation.

In this chapter, we present a fast mode algorithm for inter-frame coding in SVC by leveraging the insights of all the performed experiments. We have collected the most important conclusions for practical applications of H.264/AVC video coding. From the experiments contained in this chapter, a trade-off between video quality and coding complexity has been identified. Therefore, for practical applications, the configuration of

174

the H.264/SVC video coding needs to be adjusted, following the guidelines provided in this chapter.

175

7. ADAPTIVE DELIVERY OF SCALABLE VIDEO

In this chapter, based on all the algorithms and techniques having been proposed

in this thesis, a real-time scalable solution is proposed. In particular, our schemes are

designed to cope with the varying network conditions along with the resource availability

on the target platforms and devices. The decision to adapt the bitstream takes into

account the available resources and decides the type of scalability to be adopted. We

propose a framework for real-time video streaming environment based on low

complexity solution.

7.1 Introduction

If the video stream is to be made more compatible with a specific viewing device

and channel bandwidth, it must be encoded many times with different settings. Each

combination of settings must yield a stream that targets the bandwidth of the channel

carrying the stream to the consumer and the decode capability of the viewing device. If

the original uncompressed stream is unavailable, the encoded stream must be first

decoded and then re-encoded with the new settings. This quickly becomes prohibitively

expensive.

In an ideal scenario, the video would be encoded only once with a high efficiency

codec. The resulting stream would, when decoded, yield the full resolution video.

Furthermore, in this ideal scenario, if a lower resolution or bandwidth stream was needed 176

to reach further into the network and target a lower performance device, a small portion of the encoded stream would be sent without any further processing. This smaller stream would be easier to decode and yield lower resolution video. In this way, the encoded video stream would be able to adapt itself to the bandwidth of the channel it was required to travel and to the capabilities of the target device. These are exactly the qualities of a scalable video codec.

Resource constrained devices typically manage the complexity by using a subset of possible coding modes thereby sacrificing video quality. This quality and complexity relationship is evident in most video codecs used today. Most H.264 encoder implementations on mobile devices today do not implement the standard profiles fully due to high complexity. Due to this reason most content available today is not developed for mobile devices. When such content is delivered to mobile devices scaling down to the resolution of the device is the most common solution. When down scaling are employed, mobile devices use techniques such as letterboxing to maintain the aspect ratio. Using the valuable display space on mobile devices for letterboxing is not an effective use of resources. Systems adopting content to mobile devices should exploit the available display fully. The classic problem of quality vs. bitrate tradeoffs should be balanced.

Proper rate control can significantly improve the performance by reducing time-out effects, packet loss thus enhance the video quality and guarantee quality of service (QOS)

[77]. The rate control algorithms that appear in literature primarily focus on achieving target bitrate and do not consider the content of video and human perception. Similar

177

work has been done in model based coding where content modeling drives bit allocation

[78].

There exist technologies for video streaming in networks with fluctuating

bandwidth that define the regions of interest in a video. MPEG-4 selective enhancement

[79] is used in the enhancement layer of MPEG-4 FGS (Fine Grained Scalability) in order to stream better quality of video within selected image regions. However, MPEG-4

selective enhancement does not provide quality improvement for the base layer. FGS-MR video encoding [80] uses MR (Multi-Resolution) frames based on MR masks to improve the rate distortion performance. However, this process introduces additional complexity in the encoding process since canny edge detection necessary for FGS-MR is a resource consuming process.

SVC provides an ideal solution to the problems in video streaming since it addresses the needs of video streaming applications to make video coding more flexible for use in highly heterogeneous and time-varying environment. We propose a system model based on SVC which was assumed while solving the problem and state the system components and processes defined during solving the problem. In the next section we will describe in detail about the components and processes of the proposed system model.

7.2 System Architecture

The purpose of SVC is to extend the capabilities of the H.264/AVC design to

address the needs of applications to make video coding more flexible for use in highly

heterogeneous and time-varying environments. Instead of multiple encodings of each video source so as to provide the optimized bit stream to each client, scalable coding

178

provides a unique bit stream whose syntax enables a flexible and low complexity

extraction of the information so as to match the requirements of different devices and networks. Such an application scenario is shown in Figure 7.1.

Though SVC allows simple and flexible truncation of a coded bitstream, efficient

solutions to support the adaptation in different scenarios are still an open issue for

research. For example, given a bitrate constraint, one needs to know how to decide the

best adaptation tradeoff among three dimensions, or how to model and associate QoS

data for a bitstream. We discussed some of these issues in Chapter 5 and Chapter 6 and

developed complexity reduction techniques to address these issues. The proposed

solution was implemented in the form of decision trees in the H.264/AVC reference

software JM and H.264/SVC reference software JSVM. Attributes calculation (used in

decision trees) takes up additional computation time but our algorithm bypasses the

standard H.264 mode detection algorithm which in turn saves the considerable encoding

time.

SVC provides three major choices for possible solutions in the video streaming

environment. When network condition or user demand varies, provider has three options

1) change the quality of the video i.e. change the bitrate 2) change the format of the video

i.e. change the spatial dimension of the video and 3) change in both quality and format of

the video. The last option could be the result of variation in the number of end-users.

However, these choices, when implemented robotically, could be proved inefficient. The

availability of the network bandwidth should not result in the transmission of more bitrate

consuming more bandwidth even if does not improve the video quality perceptually.

179

Therefore, we propose the use of ROI (Region of Interest) and saliency in order to use the network bandwidth efficiently.

Figure 7.1: Example of H.264/SVC video streaming with heterogeneous receiving devices and variable network conditions.

SVC, when used keeping in mind the users’ quality of experience (QoE), provides better performance perceptually and subjectively as well, compared to the traditional complexity reduction techniques. As far as the scalable video coding (SVC) is concerned, video adaptation includes format transcoding, bitstream extraction, bitstream replacement, etc. Here, bitstream extraction is taken as an example to illustrate our SVC based video adaptation framework. Bitstream extraction is regarded as the adaptation of coded stream at signal level, and the result bitstream is acquired according to various configuration parameters to fit for different bitrates. Under the video adaptation framework as shown in Figure 7.2, the video content information is combined with the

180

coding structure information. The compound description is used to steer the adaptation

process at signal level. Because the video content analysis is added into the adaptation,

the extraction results fit for visual needs. The process include following steps:

SVC Video Streaming Complexity Parameters Reduction

Adaptation Description

SVC encoder bitstream

Source Video +

Video content analysis Video content description

Adapted bitstream Bitstream description

SVC IROI/Saliency Detection

Figure 7.2: SVC based video streaming framework

1. Source video is preprocessed before it is input into the encoder. Visual sensitive regions are located or ROIs are defined to generate the video content description;

2. Source video is input into encoder and encoded into scalable video coding bitstream by employing the low-complexity video encoding techniques;

3 Video streaming environment parameters (available bandwidth, number of

users, end-user device capabilities etc.) are used in order to generate bitstream;

181

4. Video content description and bitstream structure description are combined according to the application demands, and then a new bitstream is generated. A parser is included in the video encoder which is used to generate adapted bitstreams.

The proposed video adaptation framework has a distinct feature: owing to the close relation between content description and user’s visual needs, the adapted result can provide optimal overall utility while meeting diverse resource constraints and user preferences.

Figure 7.3 shows the comparison of the perceptual quality of spatial scalability and IROI based cropping. Obviously, cropping produce better perceptual quality because unlike down-sampling it preserves the fine details of the selected area of the image. But the down-sampled HD images targeted for mobile devices do not produce good perceptual quality on the short display screens and the detail of the whole frame is obscured. On the other hand, cropping may produce good results but selection of ROI is the major problem in this case. Moreover, the overhead of using FMO is significant as well as the sizes of XML descriptions for ROIs. These shortcomings make this framework more complex for mobile devices. However, this framework may be better suited for surveillance systems.

182

HD Image [1280x720]

Spatially Scalable [480x320] Cropped [480x320]

Figure 7.3: Comparison of SVC (spatial scalablity) and cropping from HD format to 480x320.

7.3 System Model

We propose a framework where video contents are efficiently delivered to the consumers having diverse communication environments. Proposed framework uses

183

7.3.1 Problem Definition

Nowadays, applications of video content delivery are very popular thanks to the

advances of hardware and software technologies for video production and processing.

Usually, a content, e.g. a video clip on a video sharing web site, is consumed by many

users. This brings an important research issue: how to efficiently deliver video content to

consumers having diverse communication environments. Network resources (e.g.

bandwidth) available for different users may be quite different and time-varying. In

addition, the characteristics of users’ terminals may vary significantly in terms of display

resolution, processing power, etc. Therefore, the same content needs to be delivered at

the same time in different formats according to these variables. Figure 7.1 explains this

situation.

Having different scalability dimensions, video transmission using scalable video

coding schemes requires an adaptive strategy to determine which scalability options to be

used for given resource constraints. When a bit rate limitation is given, spatial and SNR

scalabilities have trade-off relationship, e.g. increasing the spatial resolution can be done

only at the cost of the decreased frame quality. The ultimate goal of such a strategy is to

maximize end-users’ quality of experience for the delivered content. Therefore, it is

important to understand the impact of different scalability options and their combinations

on human observers’ quality perception through subjective quality assessment.

7.3.2 Region Based Solution

The intrinsic limitation of the current video streaming applications is the unawareness from the content type. Current video streaming applications depend upon

184

bitrate, resolution, frame rate, bandwidth etc. While this set of parameters is very useful

for “blind” streaming of video contents, it doesn’t necessarily provide perceptually and

subjectively better video contents. In this case video streaming application is simply

using the greedy approach to maximize the usage of network bandwidth. This approach

uses H.264/AVC and SVC bitstreams in the form of layers. If we can determine ROIs

and salient regions in the video contents, then it is possible to enhance the quality of

those regions, and thus improving the quality perception of the delivered video contents.

This is where ROIs and saliency play a very important part in the video streaming

application for better perceptual quality.

7.3.3 System Implementation

We discuss the delivery of video contents in the streaming environment with

better subjective quality in the context of a system consisting of client (end-user), server

(encoding, adaptation) and network components as shown in Figure 7.4. The server performs low-complexity video encoding and network-aware adaptation. It then delivers the contents over the network. Network components include band width, number of users and types of required content types consisting of different resolutions and qualities.

Clients consist of receiver devices having variety of capabilities in terms of display, processing power etc. Figure 7.5 shows the flowchart for extraction and selection of an appropriate bitstream algorithm. The choice of the algorithm depends on the applications, resource availability, and constraints.

185

Figure 7.4: Scalable ROI/Saliency based System

186

Figure 7.5: Process of bitsream extraction and selection

7.4 Summary

In this chapter, we proposed a content-based low-complexity scalable video stream adaptation framework. Under this framework, video adaptation using visual sensitive regions and Interactive ROI (IROI) are implemented. The video adaptation scheme extracts sub-streams at non-equal step while keeping compatible with scalable video coding format. Moreover, a broad range of bit rates can be obtained. A saliency based framework is proposed that can determine the ROI with different methods and then it can decide to adapt the bitstream choosing a smaller area from high resolution source video to a low resolution target video. Experimental results show that the method can

187

effectively maintain the integrity of video content. A detailed analysis of the results can be observed in [81].

It is worthwhile to note that the proposed framework have several other closely related issues, such as the accuracy of extracted visual sensitive regions, whether it is consistent with the subjective judgments, and so on. Such cross-disciplinary exploration is critical to innovation and advancement of video adaptation technologies for next generation pervasive media applications.

188

8. CONCLUSIONS AND FUTURE WORK

8.1 Conclusion

As we have discussed in Chapter 1, transmission of digital video signals over current data networks demands efficient, reliable, and adaptable video coding techniques due to the heterogeneous nature of current wired and wireless networks. In this dissertation, we have mainly addressed the scalable video coding structure, in particular spatial and SNR scalability; and low complexity video encoding techniques.

Recent years have witnessed a rapid growth of research and development to provide users with video communication through wired and wireless media. Although the video coding standards exhibit acceptable quality-compression performance in many visual communication applications, further improvements are desired and more features need to be added, especially for some specific applications. The important considerations for video coding schemes to be used within future networks can therefore be summarized as:

• Compression efficiency

• Adaptability to different available bandwidths

• Adaptability to memory and computational power for different clients

Several other communication and networking issues are also relevant, such as scalability, robustness, and interactivity. The choice of a Scalable Video Coding

189

framework in this context brings technical and financial advantages. Under this framework, network elements can adapt the video streams to the channel conditions and transport the adapted video streams to receivers with acceptable perceptual quality. The advantages of deploying such an adaptive framework are that it can achieve suitable QoS for video over wired and wireless networks, bandwidth efficiency and fairness in sharing resources.

This work has been undertaken to create a framework suitable for video streaming environment. To be practical for efficient video transmission through streaming application, the resulting video bitstream must be adaptive to the myriad network and end-user conditions present in video streaming scenario. Important factors include low power, low complexity end-user devices along with the varying number of end-users on top of available network bandwidth. The proposed framework must be invariant to these conditions or adapt behavior to accommodate them.

Through a thorough research effort, this work culminates in Chapter 8 with the creation of SVC based framework. The proposed framework is shown to reduce the computational complexity by using subjective and objective techniques.

Many preliminary research and development has been required to arrive at the proposed SVC based framework. To begin, Chapter 1, 2 and 3 presented an introduction to the problem, an overview of video coding, and a survey of related work. This material is largely a summary of existing literature to which we also have a contribution [64].

Chapters 4-7 contain original work leading to the design of VSC based framework. First we design low-complexity video encoding techniques. These techniques

190

are based on machine learning algorithms [53]. Low-complexity video encoding techniques form the core of the proposed framework.

Chapter 4 contains a machine learning based approach that reduces the Intra coding complexity in H.264/AVC encoder. We have also explained the reasons for using machine learning in video encoding and its effectiveness. We have proposed a novel macroblock partition mode decision algorithm for Intra prediction in H.264/AVC. The proposed algorithm used data mining techniques to exploit the correlation between the

H.264/AVC MB statistics and H.264 MB coding modes. The WEKA tool was used to develop decision tree for H.264/AVC coding mode decision. The proposed algorithm has very low complexity as it only requires the calculation of MB statistics such as means, variances and differences of border pixels. Our results show that the proposed algorithm is able to maintain a good picture quality while considerably reducing the computational complexity by 65% on average. The reduction in computational cost has negligible impact on the quality and bit-rate of the encoded video. Our results show that the proposed algorithm maintains its performance across all resolutions. The proposed approach is novel and the basic idea can also be used to refine other techniques in video encoding [60] [61].

In Chapter 5, we have introduced two stochastic classifiers for video encoding,

Cost-Sensitive classifier and Chance-Constrained classifier. When we use these stochastic classifiers in combination with the machine learning classifiers, we improve

RD performance. The encoding time for chance constrained classiﬁer is reduced by about 1/5 compared to the reference encoder [67]. For Cost-Sensitive classifier, the

191

results of the implementation in the reference encoder JSVM 9.17 show that for the base layer, we obtain a speedup of about 50% while for enhancement layer; a speedup of 70% is obtained [70].

Chapter 6 identifies the relationship between base and enhancement layer mode decision process. In this chapter, we have presented a fast mode algorithm for inter-frame coding in SVC by leveraging the insights of all the performed experiments. We have collected the most important conclusions for practical applications of H.264/AVC video coding. From the experiments contained in this chapter, a trade-off between video quality and coding complexity has been identified. Therefore, for practical applications, the configuration of the H.264/SVC video coding needs to be adjusted, following the guidelines provided in this chapter. With the help of a fast mode decision model defined in this chapter, we identify a time saving of 70% for the mode selection step in the enhancement layer.

Finally, Chapter 7 describes the design of SVC based low-complexity scalable video stream adaptation framework by using the low-complexity video encoding techniques designed in the previous chapters. Under this framework, video adaptation using visual sensitive regions and Interactive ROI (IROI) are implemented. SVC provides the flexibility of choosing spatial or SNR based adaptive bitstream.

A reduction in computational complexity of video encoders affects decoded video quality. The proposed algorithms and subsequently framework achieve the minimal quality distortion under a fixed encoder complexity constraint and provide controllable

192

trade-offs between complexity and video quality. Computation resources are efficiently

allocated among different frames and components of a video sequence.

A reduction in computational complexity of video encoders also affects the power consumption of the video encoders. Since modified reduced complexity encoders take much less time to encode the same video contents than that of reference encoders, therefore, energy required in the

The techniques described in this thesis fulfill the aim and objectives of this research work. They are practical and can be applied to video coding applications where encoding time is an important factor, such as real-time multimedia systems, to control computational complexity whilst maintaining acceptable video quality. These algorithms can also benefit power-constrained systems, for example mobile video phones, by reducing the power consumption for encoding video, hence potentially achieving longer battery life.

8.2 Future Work

• We will analyze the MC residual for complexity reduction. To generalize this

approach depending on the different regions that are recognized in the MC residual

mean and variance images to apply to every standard operating with residual frames.

Exploiting this idea could lead to add more standards to the study and may help to the

analysis of the existent relationship between the residual and the MB mode selected

by a video coding standard.

• We will explore more classifiers while using machine learning algorithms.

193

• Power analysis is another part of complexity management which we intend to focus

in the future.

• Application of scalability on ROI is another interesting extension of our proposed

framework. While existing methods use a fixed size of enhancement layers, scalable

ROI will utilize only ROI as an enhancement layer so that it can improve subjective

QoS with low transmission overhead.

• High Efficiency Video Coding (HEVC) is a proposed video compression standard, a

successor to H.264/AVC, currently under joint development by the ISO/IEC (MPEG)

and ITU-T (VCEG). It has sometimes been referred to as "H.265", since it is

considered the successor of H.264, although this name is not commonly used within

the standardization project. Complexity is an important concern for HEVC and for its

scalable extension. We intend to find new and extend our exiting techniques for

HEVC.

194

BIBLIOGRAPHY

[1] C. E. Shannon, "A mathematical theory of communication," Bell Sys. Tech. Journal,

pp. 27:379–423, 623–656, Oct 1948.

[2] ITU-T, "Video Codec for Audiovisual Services at px64 kbit/s," in ITU-T

Recommendadtion H.261, Version 1: Nov. 1990, Version 2: Mar. 1993.

[3] ISO/IEC JTC 1, "Coding of Moving Pictures and associated Audio for Digital

Storage Media at up to about 1.5 Mbit/s -- Part 2: Video," in ISO/IEC 11 172-2

(MPEG-1 Video), Mar. 1993.

[4] ITU-T and ISO/IEC JTC 1, "generic coding of moving pictures and associated audio

information - Part 2: Video," in ITU Recommendation H.262 and ISO/IEC 13818-2

(MPEG-2 Video), Nov. 2004.

[5] ITU-T, "Video coding for low bit rate communication," in ITU-T recommendation

H.263, Version 1: Nov. 1995, Version 2: Jan. 1998, Ver. 3: Nov. 2000.

[6] "Advanced Video Coding (AVC)--3rd Edition," in ser. ITU-T Rec. H.264 and

ISO/IEC 14496-10 (MPEG-4 Part 10), JVT, 2004.

195

[7] Gary J. Sullivan, Thomas Wiegand, and Heiko Schwarz, "ITU-T Rec. H.264 |

ISO/IEC 14496-10 Advanced Video Coding Defect Report (JVT-Z210)," in JVT

Meeting (Joint Video Team of ISO/IEC MPEG & ITU-T VCEG), Antalya, Turkey,

Jan 13-18, 2008.

[8] System Description Blu-ray Disc Rewritable. Part 1: Basic Format Specifications

Ver.2.11 Mar. 2006. Including System Description Blu-ray Disc Hybrid Format Ver.

1.01 (Part 1) Dec.2005,.

[9] A. G. Gersho and R. M. Gray, "Vector Quantization and Signal Compression," in

Kluwer Academic Publishers, Boston, MA, 1992.

[10] R. J. Clark, Transform Coding of Images. New York: Academic Press, 1985.

[11] W. H. Chen, C. H. Smith , and S. C. Fralick , "A fast computational algorithm for

the discrete cosine transform," in IEEE Trans. Commu. COM-25., 1977, pp. 1004–

1009.

[12] Iain E.G. Richardson, H.264 and MPEG-4 Video Compression: Video Coding for

Next-generation Multimedia. England: Wiley, 2003.

[13] S. P. Lloyd, "Least squares quantization in PCM," in IEEE Trans. Info. Theory,

March 1982, pp. 129–137.

[14] Lina J. Karam, "Lossless Coding," in Handbook of Image and Video Processing.

San Diego, CA, USA: Academic Press, 2000, ch. 5.1, pp. 461–474.

196

[15] T. C. Bell, J. G. Cleary, and I. H. Witten, Text Compression, Prentice Hall, Ed.

Englewood Cliffs, NJ, USA: Prentice Hall Advanced Reference Series: Computer

Science., 1990.

[16] Janusz Konrad, Handbook of Image and Video Processing, Motion Detection and

Estimation chapter 3-10, Ed. San Diego, CA, USA: Academic Press, 2000.

[17] Jaswant R. Jain and Anil K. Jain, "Displacement measurement and its application in

interframe coding," in IEEE Trans. on Commu, 1981, pp. (12):1799–1808.

[18] T. Koga, "Motion compensated interframe coding for video conferencing," in

Proceedings of the National Telecommunications Conference, New Orleans,

Louisiana, Nov. 1981, pp. G5.3.1–G5.3.5.

[19] A. Tekalp, "Digital Video Processing," in Prentice-Hall, Englewood Cliff, NJ, 1995.

[20] C. J. Branden Lambrecht and O. Verscheure, "Perceptual Quality Measure using a

Spatio-Temporal Model of the Human Visual System," in Proc. SPIE, vol. 2668,

March 1996, pp. 450-461.

[21] Margaret H Pinson and Stephen Wolf, "A New Standardized Method for Objectively

Measuring Video Quality," in IEEE Transactions on Broadcasting, vol. 50, Sep.

2004, pp. 312-322.

[22] Z. Wang, L. Lu , and A. C. Bovic, "Video quality assessment using structural

distortion measurement," in Signal Processing: Image Communication, special issue

on “Objective video quality metrics”, vol. 19, Feb. 2004, pp. 121-132.

197

[23] Barry G. Haskell, Atul Puri, and Arun N. Netravali, Digital video: an introduction to

MPEG-2, Digital Multimedia Standards Series ed. USA: Chapman & Hall, 1997.

[24] A. Hallapuro, M. Karczewicz, and H. Malvar, "Low Complexity Transform and

Quantization – Part I: Basic Implementation," Geneva, JVT document JVT-B038,

February 2002.

[25] ISO/IEC JTC1/SC29/WG11, "Applications and requirements for scalable video

coding," Technical Report N5540 International Organisation for Standardisation,

March 2003.

[26] ISO/IEC JTC1/SC29/WG1, "Applications and requirements for scalable video

coding," Technical Report N6880 International Organisation for Standardisation,

January 2005.

[27] Jens-Painer Ohm, "Advances in scalable video coding," in Proceedings of the IEEE,

January 2005, pp. 93(1):42 – 56.

[28] Detlev Marpe Heiko Schwarz and Thomas Wiegand, "Combined scability support

for the scalable extension of h.264/AVC," in In IEEE International Conference on

Multimedia and Expo (ICME 05), July 2005.

[29] Heiko Schwarz, Detlev Marpe, and Thomas Wiegand, "Analysis of hierarchical B

pictures and MCTF," in IEEE International Conference on Multimedia and Expo

(ICME 06), July 2006.

198

[30] Detlev Marpe, Heiko Schwarz, Tobias Hinz, and Thomas Wiegand, "Constrained

inter-layer prediction for single-loop decoding in spatial scalability," in IEEE

International Conference on Image Processing (ICIP 05), Sep. 2005.

[31] Detlev Marpe Heiko Schwarz and Thomas Wiegand, "SNR-Scalable extension of

H.264/AVC," in In IEEE International Conference on Image Processing (ICIP 04),

Oct. 2004.

[32] Intel. (2010, January) [Online].

http://download.intel.com/design/intarch/designgd/323064.pdf

[33] Chih-Wei Chiou, Chia-Ming Tsai, and Chia-Wen Lin, "Fast mode decision

Algorithms for Adaptive GOP structure in Scalable Extension of H.264/AVC," in

Proc. of ISCAS, May 2007, pp. 3459-3462.

[34] S. Lim, J. Yang, and B. Jeon, "Fast coding mode decision for Scalable Video

Coding," in ICACT, Feb. 2008.

[35] H. Li, Z. G. Li, and C. Wen, "Fast mode decision for Spatial Scalable Video

Coding," in Proc. Picture Coding Symp, Beijing, China, Apr. 2006.

[36] Bumshik Lee, Munchurl Kim, Sangjin Hahm, Changseob Park, and Keunsoo Park,

"A Fast mode selection scheme in Interlayer Prediction of H.264 Scalable Extension

Coding," in IEEE International Symposium on Broadband Multimedia Systems and

Broadcasting , ISBMSB, 2008, pp. 1-5.

199

[37] Yun-Da Wu and Chih-Wei Tang, "The Motion Attention Directed Fast mode

decision for Spatial and CGS Scalable Video Coding," in IEEE International

Symposium on Broadband Multimedia Systems and Broadcas t ing, ISBMSB, 2008,

pp. 1-4.

[38] H. C. Lin, W. H. Peng, H. M. Hang, and W. J. Ho, "Layer Adaptive Mode decision

and Motion Search for Scalable Video Coding with Combined CGS and Temporal

scalability," in in Proc. IEEE Int. Conf. Image Process., vol. 2, Sep. 2007, pp. 289–

292.

[39] J. Kim, K. Jeon, and J. Jeong, "H.264 Intra Mode Decision for Reducing Complexity

Using Directional Masks and Neighboring Modes," in LNCS 4319, 2006, pp. 959-

968.

[40] J. Xin and A. Vetro, "Fast Mode Decision for Intra-only H.264/AVC coding," in

Picture Coding Symposium (PCS), PCS 2006, Apr. 2006.

[41] Y. W. Huang, B. Y. Hsieh, T. C. Chen, and L. G. Chen, "Analysis, fast algorithm,

and VLSI architecture design for H.264/AVC intra-frame coder," in IEEE Trans.

Circuits Syst. Video Technol, vol. 15, March 2005, pp. 378–401.

[42] F. Pan et al., "Fast mode decision algorithm for intra-prediction in H.264/AVCvideo

coding," in IEEE Trans. Circuits Syst. Video Technol, vol. 15, July 2005, pp. 813–

822.

200

[43] J. C. Wang, J. F. Wang, J. F. Yang, and J. T. Chen, "A fast mode decision algorithm

and its VLSI design for H.264/AVC intra-prediction," in IEEE Trans. Circuits Syst.

Video Technol., vol. 17, Oct. 2007, pp. 1414–1422.

[44] I. Choi, J. Lee, and B. Jeon, "Fast coding mode selection with rate-distortion

optimization for MPEG-4 part-10 AVC/H.264," in IEEE Trans. Circuits Syst. Video

Technol, vol. 16, Dec. 2006, pp. 1557–1561.

[45] Changsung Kim and C.-C. Jay Kuo, "Feature-Based intra/intercoding mode

selection for H.264/AVC," in IEEE Trans. Circuits Syst. Video Technol, vol. 17,

Apr. 2007, pp. 441–453.

[46] An-Chao Tsai, Jhing-Fa Wang , Jar-Ferr Yang , and Wei Guang Lin , "Effective

Subblock-Based and Pixel-Based Fast Direction Detections for H.264 Intra

Prediction," in IEEE Trans. Circuits Syst. Video Technol, vol. 18, July 2008, pp.

975–982.

[47] "Report of the formal verification tests on AVC (ISO/IEC 14 496-10 - ITU-T Rec.

H.264)," MPEG2003/N6231, Dec. 2003.

[48] ISO/IEC IS 13818, "Information Technology-Generic coding of moving pictures

and associated audio information, Part 2: Video," ISO/IEC JTC1/SC29/WG11 2004.

[49] Gary J. Sullivan and Thomas Wiegand, "Rate Distortion Optimization for Video

Compression," in IEEE Signal Processing Magazine, Nov. 1998, pp. 74-90.

201

[50] Thomas Wiegand, Gary Sullivan , G. Bjontegaard , and A. Luthra , "Overview of the

H.264/AVC video coding standard," in IEEE Transactions on Circuits Systems and

Video Technology, vol. 13, July 2003, pp. 560-576.

[51] U. Fayyad, G. Piatetsky-Shapiro , and P. Smyth , "From data mining to knowledge

discovery in databases," in AI Mag, vol. 17, 1996, pp. 37–54.

[52] Quinlan J. R., C4.5: Programs for Machine Learning. San Francisco, CA: Morgan

Kaufman Publishers Inc, 1993.

[53] R. Jillani et al., "Video Encoding and Transcoding Using Machine Learning," in 9th

Intl. Workshop on Multimedia Data Mining: associated with the ACM SIGKDD

2008 (MDM ‘08), Aug. 2008.

[54] Ian H. Witten and Eibe Frank, Data Mining: Practical Machine Learning Tools and

Techniques, 2nd ed., Morgan Kaufmann, Ed. San Francisco, 2005.

[55] Valery A. and Latifur Khan, Multimedia Data Mining and Knowledge Discovery, 1st

ed. New York: Springer, 2007.

[56] Gerardo Fernández-Escribano, Hari Kalva , Pedro Cuenca , and Luis Orozco-

Barbosa, "A Very Low Complexity MPEG-2 to H.264 Transcoding Using Machine

Learning," in Proceedings of the ACM Multimedia, Santa Barbara (California),

USA, Oct. 2006.

202

[57] G. Fernández-Escribano, H. Kalva , P. Cuenca , and L. Orozco-Barbosa,

"RDOptimization for MPEG-2 To H.264 Transcoding," in Proceedings of the IEEE

International Conference on Multimedia & Expo, ICME'06, Toronto, Canada, July

9-12 2006.

[58] JVT H.264/AVC reference software JM14.2. http://iphome.hhi.de/suehring/.

[59] An-Chao Tsai, Jhing-Fa Wang , Jar-Ferr Yang , and Wei Guang Lin , "Effective

Subblock-Based and Pixel-Based Fast Direction Detections for H.264 Intra

Prediction," in IEEE Trans. Circuits Syst. Video Technol, vol. 18, Jul. 2008, pp.

975–982.

[60] H. Kalva, P. Kunzelmann, R. Jillani, and A. Pandya, "Low Complexity H.264 Intra

MB Coding," in Proceedings of the IEEE International Conference on Consumer

Electronics, Las Vegas, USA, January 9-13, 2008.

[61] R. Jillani and H. Kalva, "Low Complexity Intra MB Encoding in H.264/AVC,"

Consumer Electronics, IEEE Transactions on, vol. 55, no. 5, pp. 277-285, February

2009.

[62] Gerardo Fernndez, Pedro Cuenca , Luis Orozco Barbosa , and Hari Kalva , "Very

Low Complexity MPEG-2 to H.264 Transcoding Using Machine Learning," in In

Proceedings of the 14th annual ACM international conference on Multimedia

(MULTIMEDIA '06), New York, 2006.

203

[63] Hari Kalva and Lakis Christodoulou, "Using Machine Learning for Fast Intra MB

Coding in H.264," in Visual Communications and Image Processing, vol. 6508,

2007.

[64] R. Jillani and H. Kalva, "Scalable Video Coding Standard," in Encyclopedia of

Multimedia. US: Springer, 2008, pp. 775-781.

[65] Thorsten Joachims, "A support vector method for multivariate performance

measures," in in ICML ’05:Proceedings of the 22nd international conference on

Machine learning, New York, 2005, pp. 377–384.

[66] I. Tsochantaridis, T. Hofmann, T. Joachims, and Y. Altun, "Support vector machine

learning for interdependent and structured output spaces," in ICML ’04, New York,

NY, USA, 2004.

[67] R. Jillani, U. Joshi, C. Bhattacharyya, H. Kalva, and K. R. Ramakrishnan, "Video

Coding Mode Decision As A Classification Problem," in SPIE/IS&T Visual

Information Processing and Communication, San Diego, USA, January 17-21 2010.

[68] J. S. Nath, C. Bhattacharyya, and M. N. Murty, "Clustering based large margin

classification: a scalable approach using socp formulation," in KDD ’06:

Proceedings of the 12th ACM SIGKDD international conference on Knowledge

discovery and data mining, New York, NY, USA, 2006, pp. 674–679.

204

[69] J. S., Nath, C., Bhattacharyya, and M. N. Murty, "Clustering based large margin

classification: a scalable approach using socp formulation," in KDD ’06:

Proceedings of the 12th ACM SIGKDD international conference on Knowledge

discovery and data mining, New York, USA, 2006, pp. 674–679.

[70] U. Joshi, R. Jillani, C. Bhattacharyya, H. Kalva, and K. R. Ramakrishnan, "Speedup

Macroblock Mode Decision in H.264/SVC Encoding Using Cost-Sensitive

Learning," in IEEE International Conference on Consumer Electronics, Las Vegas,

USA, January 11-13, 2010.

[71] G. Goh, J. Kang, M. Cho, and C.-S. Cho, "Fast mode decision for scalable video

coding based on neighboring macroblock analysis," in Proceedings of the 2009

ACM symposium on Applied Computing, March 2009, pp. 1845-1846.

[72] C.-S. Park, S.-J. Baek, M.-S. Yoon, H.-K. Kim, and S.-J. Ko, "Selective inter-layer

residual prediction for SVC-based video streaming," in IEEE Trans. on Consumer

Electronics, vol. 55, Feb. 2009, pp. 235-239.

[73] S.-T. Kim, K. R. Konda, C. Su Park, C.-S. Cho, and S.-J. Ko, "Fast mode decision

algorithm for inter-layer coding in scalable video coding," in IEEE Trans. on

Consumer Electronics, vol. 55, no. 3, Aug. 2009, pp. 1572–1580.

[74] S.-W. Jung, S.-J. Baek, C.-S. Park, and S.-J. Ko, "Fast mode decision using all-zero

block detection for fidelity and spatial scalable video coding," in IEEE Trans. on

Circuits and Systems for Video Technology, vol. 20, no. 2, Feb. 2010, pp. 201-206.

205

[75] J. Ren and N. Kehtarnavaz, "Fast adaptive early termination for mode selection in

H.264 scalable video coding," in Journal of Real-Time Image Processing, vol. 4,

no.1, Mar. 2009, pp. 13-21.

[76] H. Li, Z. Li, C. Wen, and L.-P. Chau, "Fast mode decision for spatial scalable video

coding," in International Symposium on Circuits and Systems (ISCAS), May 2006.

[77] D. Wu, T. Hou, and Y.-Q. Zhang, "Transporting real time video over the Internet:

challenges and approaches," in Proc IEEE 88, 2000, pp. 1855-1877.

[78] J. B. Lee and A. Eleftheriadis, "Spatio-temporal model-assisted compatible coding

for low and very low bit rate video-telephony," in Intl. Conf. Image Processing,

Lausanne, Switzerland, Oct. 1996, pp. 429-432.

[79] M. van der Schaar and Y.-T. Lin, "Content-based selective enhancement for

streaming video," in Image Processing, 2001. Proceedings. 2001 International

Conference on, 7-10 Oct 2001, pp. 977-980 vol.2.

[80] S. Chattopadhyay, S.M. Bhandarkar, and K. Li, "FGS-MR: MPEG4 fine grained

scalable multi-resolution layered video encoding," in Procs. of ACM NOSSDAV'06,

New Port, RI, USA, May 2006.

[81] R. Jillani, C. Holder, and H. Kalva, "Exploiting Spatio-Temporal Characteristics of

Human Vision for Mobile Video Applications," in SPIE Optics Photonics 2008,

Applications of Digital Image Processing XXXI, San Diego, CA, Aug. 2008.

206

[82] G. Fernandez-Escribano et al., "Low-Complexity Heterogeneous Video Transcoding

Using Data Mining," in Multimedia, IEEE Transactions on, vol. 10, Feb. 2008, pp.

286-299.

[83] G. Fernandez-Escribano, H. Kalva , P. Cuenca , and L. Orozco-Barbosa , "Speeding-

up the Macro block Partition Mode Decision for MPEG-2 to H.264 Transcoding," in

Proceeding of ICIP 2006, Atlanta, 2006.

[84] S. Boykin and A. Merlino, "Machine learning of event segmentation for news on

demand," in Communications of the ACM, vol. 43, No. 2, Feb. 2000, pp. 35-41.

[85] A. Vailaya, M.A.T. Figueiredo, and A.K. Jain, "Image classification for content-

based indexing," in IEEE Transactions on Image Processing, jan. 2001, pp. 117-

130.

[86] N. Baskiotis and M. Sebag, "C4.5 competence map: a phase transition-inspired

approach," in Proceedings of the 21st International Conf. on Machine Learning,

ICML '04, vol. 69, 2004.

[87] M. H. Lee, H. W. Sun, D. Ichimura, Y. Honda, and S. M. Shen, "ROI Slice SEI

Message," in JVT-S054, Input DocumentJoint Video Team (JVT), Geneva,

Switzerland, Apr. 2006.

207

[88] P. Lambert, D. De Schrijver, D. Van Deursen, Y. Dhondt W. De Neve, and R. Van

de Walle, "A real-time content adaptation framework for exploiting ROI scalability

in H.264/AVC," in Lecture Notes in Computer Science (8th international conference

on Advanced Concepts for Intelleigent Vision Systems), vol. 4179, Antwerp,

Belgium, 2006, pp. 442-453.

[89] H. K. Arachchi, S. Dogan, H. Uzuner, and A. M. Kondoz, "Utilising Macroblock

SKIP Mode Information to Accelerate Cropping of an H.264/AVC Encoded Video

Sequence for User Centric Content Adaptation," in axmedis, Third International

Conference on Automated Production Cross Media Content for Multi-Channel

Distribuition (AXMEDIS'07), 2007, pp. 3-6.

[90] D. De Schrijver, W. De Neve, D. Van Deursen, S. De Bruyne, and R. Van de Walle,

"Exploitation of interactive region of interest scalability in scalable video coding by

using an XML-driven adaptation framework," in Proc AXMEDIS’2006, Leeds, UK,

Dec. 2006.

[91] Wei Liu, Wei Wang, and Guohui Li, "A FOA-based error resiliency scheme for

video transmission over unreliable channels," in International Conference on

Wireless Communications, Networking and Mobile Computing, vol. 2, 23-26 Sep.

2005, pp. 1265-1270.

[92] L. Itti, "Automatic foveation for video compression using a neurobiological model

of visual attention," in IEEE Transactions on Image Processing, 2004.

208