HEVC OPTIMIZATION IN MOBILE ENVIRONMENTS

by

Ray Garcia

A Dissertation Submitted to the Faculty of

The College of Engineering and Computer Science

in Partial Fulfillment of the Requirements for the Degree of

Doctor of Philosophy

Florida Atlantic University

Boca Raton, FL

May 2014

Copyright by Ray Garcia 2014

ii

ACKNOWLEDGEMENTS

As we journey through life, individual achievements are rarely individual but a

collection of help, encouragement, and support of a multitude of people involved directly or

indirectly through the endeavor. My scholastic effort for the dissertation is no different.

First and foremost I want to thank my wife for her patience and support. Without this the

manuscript would not have been possible. In addition, the staff at Florida Atlantic University

was invaluable in providing guidance and recommendations. I am very thankful to my

advisor, Dr. Hari Kalva, my committee members Dr. Borko Furht, Dr. Imad, Mahgoub, Dr.

Daniel Raviv and graduate department staff Jean Mangiaracina.

I am truly grateful to all.

iv

ABSTRACT

Author: Ray Garcia

Title: HEVC Optimization in Mobile Environments

Institution: Florida Atlantic University

Dissertation Advisor: Dr. Hari Kalva

Degree: Doctor of Philosophy

Year: 2014

Recently, multimedia applications and their use have grown dramatically in

popularity in strong part due to mobile device adoption by the consumer market.

Applications, such as video conferencing, have gained popularity. These applications and others have a strong video component that uses the mobile device’s resources. These resources include processing time, network bandwidth, memory use, and battery life.

The goal is to reduce the need of these resources by reducing the complexity of the coding process. Mobile devices offer unique characteristics that can be exploited for optimizing video codecs. The combination of small display size, video resolution, and human vision factors, such as acuity, allow encoder optimizations that will not (or minimally) impact subjective quality.

The focus of this dissertation is optimizing video services in mobile environments. Industry has begun migrating from H.264 video coding to a more resource intensive but compression efficient High Efficiency Video Coding (HEVC). However,

v there has been no proper evaluation and optimization of HEVC for mobile environments.

Subjective quality evaluations were performed to assess relative quality between H.264

and HEVC. This will allow for better use of device resources and migration to new

codecs where it is most useful. Complexity of HEVC is a significant barrier to adoption

on mobile devices and complexity reduction methods are necessary. Optimal use of encoding options is needed to maximize quality and compression while minimizing encoding time. Methods for optimizing coding mode selection for HEVC were

developed. Complexity of HEVC encoding can be further reduced by exploiting the

mismatch between the resolution of the video, resolution of the mobile display, and the

ability of the human eyes to acquire and process video under these conditions. The

perceptual optimizations developed in this dissertation use the properties of spatial

(visual acuity) and temporal information processing (motion perception) to reduce the

complexity of HEVC encoding. A unique feature of the proposed methods is that they

reduce encoding complexity and encoding time.

The proposed HEVC encoder optimization methods reduced encoding time by

21.7% and bitrate by 13.4% with insignificant impact on subjective quality evaluations.

These methods can easily be implemented today within HEVC.

vi

HEVC OPTIMIZATION IN MOBILE ENVIRONMENTS

LIST OF TABLES ...... x

LIST OF FIGURES ...... xii

1 INTRODUCTION ...... 1

1.1 Motivation ...... 2

1.2 Contribution ...... 3

1.3 Outline ...... 4

2 PROBLEM DESCRIPTION ...... 5

2.1 H.264 vs. HEVC Subjective Evaluation ...... 7

2.2 Decision Optimization ...... 7

2.3 Complexity Reduction with HVS Factors ...... 8

3 BACKGROUND ...... 9

3.1 Overview of Video Compression ...... 9

3.1.1 HEVC overview ...... 14

3.2 Overview of HVS ...... 19

3.2.1 Retina ...... 19

3.2.2 Smooth Pursuit Eye Movement ...... 23

3.2.3 Transparent Motion Perception ...... 24

vii 4 LITERATURE REVIEW ...... 25

4.1 Subjective Quality ...... 25

4.1.1 Metrics for Quality Evaluation ...... 25

4.1.2 Subjective Evaluation ...... 27

4.2 Complexity Reduction ...... 35

4.2.1 H.264 ...... 36

4.2.2 HEVC ...... 42

5 HEVC AND H.264 SUBJECTIVE EVALUATION ...... 58

5.1 Background ...... 58

5.2 Evaluation Methods ...... 62

5.3 Experiments ...... 66

5.4 Results ...... 68

5.5 Discussion ...... 71

5.6 Concluding Remarks ...... 78

6 HEVC DECISION OPTIMIZATION ...... 79

6.1 Background ...... 79

6.2 Method ...... 81

6.3 Prediction Modeling ...... 83

6.4 Results ...... 84

6.4.1 Data Reading Primer ...... 84

viii 6.5 Data Analysis ...... 86

6.5.1 Prediction Model Analysis ...... 90

6.6 Application ...... 94

7 ADAPTING LOW BIT RATE SKIP MODE IN MOBILE ENVIRONMENT ...... 96

7.1 Background ...... 96

7.2 HEVC Elements ...... 99

7.2.1 Quad-Tree ...... 99

7.2.2 Skip Mode Method ...... 100

7.3 Proposed Method ...... 102

7.4 Experiments ...... 105

7.5 Results ...... 110

7.6 Concluding Remarks ...... 121

8 CONCLUSION ...... 122

9 FUTURE WORK ...... 124

10 REFERENCES ...... 125

ix LIST OF TABLES

Table 1: Major differences for portable devices ...... 6

Table 2: Small screen complexity reduction opportunities ...... 6

Table 3: Retina topography data ...... 21

Table 4: Class definition for resolution, frame rate, and bit rates ...... 32

Table 5: Complexity reduction reviewed research ...... 56

Table 6: Video sequence information ...... 64

Table 7: 400Kbps data for H.264 and HEVC PSNR(Y) ...... 65

Table 8: 200Kbps data for H.264 and HEVC PSNR(Y) ...... 65

Table 9: Subjective grading scale ...... 68

Table 10: Mean opinion score (MOS) for 400 kbps rate ...... 70

Table 11: Mean opinion score (MOS) for 200 kbps rate ...... 71

Table 12: Video sequences preference ...... 76

Table 13: Video sequence definition ...... 82

Table 14: Class C, D, E, and All video sequence, average Δ PSNR and Time-savings ... 87

Table 15: Kristen and Sara 64/4/3 PSNR and Δ PSNR for 100 and 600 kbps ...... 90

Table 16: Video sequences WEKA estimate for Δ PSNR and Time-savings. Max ...... 92

Table 17: Video sequences actual vs. WEKA difference ...... 93

Table 18: Correlation between actual and WEKA results ...... 94

Table 19: Test sequences ...... 105

x Table 20: Acuity test conditions ...... 106

Table 21: Rate distortion loop calls ...... 107

Table 22: Bitrates for subjective videos near 400kbps ...... 110

Table 23: Encoding time for subjective videos near 400kbps ...... 111

Table 24: Acuity average and standard deviation ...... 112

Table 25: MOS for base vs. high acuity at 960x540 and 400 kbps bit rate ...... 113

Table 26: MOS for base vs. low acuity at 960x540 and 400 kbps bit rate ...... 114

Table 27: MOS for depth reduced vs. high acuity at 960x540 and 400 kbps bit rate ..... 114

Table 28: MOS for depth reduced vs. low acuity at 960x540 and 400 kbps bit rate ...... 114

Table 29: MOS for base vs. high acuity at 480x272 and 400 kbps ...... 117

Table 30: MOS for base vs. low acuity at 480x272 and 400 kbps ...... 117

Table 31: MOS for depth reduced vs. high acuity at 480x272 and 400 kbps ...... 118

Table 32: MOS for depth reduced vs. low acuity at 480x272 and 400 kbps ...... 118

xi LIST OF FIGURES

Figure 1: Mobile environment elements ...... 2

Figure 2: H.261 hybrid coding ...... 10

Figure 3: Scope of H.264 standard ...... 13

Figure 4: H.264 video encoder block diagram. (decoder is within dashed box) ...... 13

Figure 5: HEVC video encoder block diagram (with decoder elements in grey) ...... 15

Figure 6: CTU (or CTB) subdivision into CUs (or CBs)...... 16

Figure 7: Modes for splitting a CB into PB ...... 16

Figure 8: Intra picture directional orientations ...... 18

Figure 9: Receptor density in retina [1] ...... 20

Figure 10: Tangential section of photoreceptors through human fovea [2] ...... 21

Figure 11: Retina topography ...... 21

Figure 12: Viewing cone ...... 22

Figure 13: Scatter plots ...... 27

Figure 14: H.264 processor utilization...... 35

Figure 15: Simplified H.264 and HEVC encoder block diagram ...... 36

Figure 16: Activate and inactivate area. Activate and inactivate area in binary ...... 38

Figure 17: Decision tree for MB encoding ...... 39

Figure 18: Restriction of selectable sub-macroblock modes ...... 39

Figure 19: Flexible search method flow chart ...... 41

xii Figure 20: Transform choices by transform skip mode ...... 44

Figure 21: Decomposition using checkered pattern ...... 45

Figure 22: Template and block matching vectors ...... 46

Figure 23: CU splitting and pruning process ...... 47

Figure 24: Literature research on hybrid coding map ...... 57

Figure 25: Basketball drill PSNR(Y) rate distortion curve ...... 66

Figure 26: Observer to LCD viewing definition ...... 67

Figure 27: Mean opinion score (MOS) for 400 kbps bit rate ...... 69

Figure 28: Mean opinion score (MOS) for 200 kbps bit rate (graph) ...... 70

Figure 29: PSNR(Y)  (dB) vs. MOS  (dB) ...... 72

Figure 30: Race horses sequence. Horse coat color is bothersome to observer ...... 75

Figure 31: Foreground tree detail loss not a concern in Keiba sequence ...... 75

Figure 32: MOS vs. POA with MOS  bubble @ 400kbps. Ti > 10 ...... 77

Figure 33: MOS vs. POA with MOS  bubble @ 200kbps. Ti > 10 ...... 77

Figure 34: TiSi Plot ...... 78

Figure 35: Kristen and Sara 100kbps time-savings vs.  PSNR ...... 85

Figure 36: Kristen and Sara sequence, 100kbps. Max-quality and left-option points ...... 88

Figure 37: Kristen and Sara sequence, 100kbps. Right-option and min-quality points ... 88

Figure 38: Kristen and Sara 100kbps to 600kbps shift for time-savings vs. Δ PSNR ...... 90

Figure 39: Kristen and Sara 100kbps time-savings vs  PSNR WEKA derived ...... 92

Figure 40: Recursive CU splitting example ...... 99

Figure 41: Mode decision process ...... 101

Figure 42: Viewing cone ...... 103 xiii Figure 43: Neighbor LCU motion vector ...... 104

Figure 44: Video sequence order ...... 108

Figure 45: Observer voting ...... 109

Figure 46: Display viewing setup ...... 109

Figure 47: MOS for base vs. high acuity (960x540 display resolution and 400 kbps) ... 115

Figure 48: MOS for base vs. low acuity (960x540 display resolution and 400 kbps) .... 115

Figure 49: MOS for depth reduced vs. high acuity (960x540 display resolution and

400 kbps) ...... 116

Figure 50: MOS for depth reduced vs. low acuity (960x540 display resolution and

400 kbps) ...... 116

Figure 51: MOS for base vs. high acuity (480x272 display resolution and 400 kbps) ... 119

Figure 52: MOS for base vs. low acuity (480x272 display resolution and 400 kbps) .... 119

Figure 53: MOS for depth reduced vs. high acuity (480x272 display resolution and

400 kbps) ...... 120

Figure 54: MOS for depth reduced vs. low acuity (480x272 display resolution and

400 kbps) ...... 120

xiv 1 INTRODUCTION

Multimedia applications have developed extensively during the last 20 years. In conjunction with the multimedia advances is the evolution of video encoding / decoding standards. From the onset, subjective evaluation has been a strong driver in video coding standards development. A significant amount of human and computational resources have been expended for subjective evaluation. From this effort, standards have evolved rather succulently to support the consumption of multimedia consumption with the technologies available in the industry.

One form of technology that has become very popular is the use of mobile compute devices for consuming and generating multimedia data. Mobile multimedia generation has a significant amount of computational and wireless transmission resources that are dedicated to video. Therefore, video compression efficiency and effectiveness is very important and subjective performance in the mobile compute platform needs proper care to maximize the performance.

Even though the video compression standards clearly provide direction for mobile multimedia use, the current video compression research has focused significantly on entertainment consumption for multimedia data, which is mainly internet use for desktop, notebook and tablet computers and television. The video resolution tends to be 1280 x

720 or higher and available bandwidth tends to be in higher than one mega-byte and usually in the multiple of megabytes. The vast majority of research to date is highly

1 focused on larger scale video resolution and consumption essentially over wired internet, where transmission bandwidth is not as constrained in the mobile arena. During the next decade, more focus will be given to mobile multimedia consumption and the best methods to effectively manage the mobile devices resources to enhance the device’s performance to the consumer.

1.1 Motivation

My research focus is to address some underserved aspects for video consumption within the mobile devices. The main emphasis is to optimize video services in mobile environment. By optimization, the target is to reduce coding complexity without penalizing the observer’s perception.

Mobile environments have inherent discrepancies between (a) the mobile device display, such as display resolution and display size, (b) encoded video characteristics, such as encoded resolution, and (c) human vision system, such as acuity, smooth eye pursuit, and viewing distance. In essence, these are three elements, shown in Figure 1, when combined in a mobile environment can be effectively exploited. Current research does not adequately address the opportunities within this environment.

Figure 1: Mobile environment elements 2 Subjective analysis in mobile environment between HEVC, the latest coding standard, and H.264 is very limited. HEVC target is to improve over H.264 by two times. This means for the same subjective feedback, HEVC will use ½ the bit rate. The targeted HEVC gains are not as clear in mobile environments. This needs addressing to give the mobile developer comparison data on performance between HEVC and H.264 in the mobile use case.

Mode selection for optimal HEVC performance is needed for any optimization technique to adequately select a mode for use in the mobile environment. Models are needed to effectively deal with decision optimization for mode selection. Prediction modeling is an optimization technique widely known which can be used to reduce complexity for mobile device’s video coding.

1.2 Contribution

The main contributions of this dissertation are:

 Conducted subjective evaluation experiments comparing H.264 and HEVC

compressed video under mobile video delivery conditions. This work has

been published in peer-reviewed conferences [3], [4] and journals [5].

 Complexity reduction method using predictive model to estimate Δ PSNR and

encoding time savings. Developed techniques to efficiently choose HEVC

coding tools (CU size, CU depth, TU depth) in video conferencing

applications. This work has been published in peer-reviewed conferences [6],

[7].

 Developed complexity reduction methods using a perceptually-aware method

for video encoding that exploits mismatches between video resolution, display 3 resolution, visual acuity and motion perception. This work has been published

in a peer-reviewed conference [8].

 Developed joint complexity and bitrate reduction methods for HEVC

encoding; the method has negligible impact on subjective quality evaluations.

1.3 Outline

The rest of the dissertation is outlined as follows. Section 2 is the problem description. Section 3 is the background which has an overview of video compression.

Section 4 is the literature review within the multimedia field mainly focusing on subjective quality and complexity reduction. Section 5 presents quality evaluation in mobile devices. Section 6 identifies optimal mode selection for HEVC using predictive modeling techniques. Section 7 exploits visual perception for complexity and bitrate reduction.

4 2 PROBLEM DESCRIPTION

As presented earlier, the focus is optimizing video services in mobile

environments. Video coding complexity is a big problem. The mobile environment has

unique characteristics that allow new approaches for complexity reduction. Today’s

mobile devices have small displays with high resolutions. In addition, HVS has limitations that allow exploitation when encoded video is viewed on a small display.

There are several subjective studies within the video coding realm. However,

there are very few studies that address the impact in mobile compute environment.

Several of the earlier studies has the older video compressed at twice the bit rate than

HEVC. These results are predisposing the outcome based on different bit rate for each coding standard. In addition, there have been substantial complexity reduction efforts in the last few years with HEVC standard. However, significant focus has been given to higher resolutions, which is the main target of the broadcast industry. The higher resolutions are mainly for larger screen devices from desktop displays to televisions.

Limited data is available when using mobile phone type screen sizes, which are less than

6” and typically range within 3.5” to 5” display screen size.

Research is needed where the mobile environment dictates the setup restriction which then applies to all coding standards for subjective evaluation. From this effort, then determine the feasible modifications, such as complexity reduction, for mobile

sensitive resources, such as power and bandwidth limitations.

5 Mobile devices, such as mobile phones, have unique requirements that have not been sufficiently studied and exploited by earlier works. Table 1 shows typical values among the major differences among common portable devices in the consumer industry.

Table 1: Major differences for portable devices

Type Screen Size Resolution Wireless Medium Mobile phone < 5.0” < 1080P Wireless WAN Tablet ~ 10” ~ 1080P Wi-Fi Notebook > 14” > 1080P Wi-Fi

Mobile multimedia demand will increase significantly during the next few years.

[9] states mobile data and internet traffic will increase from 237 PetaBytes (PB)/ Month in 2010 to an estimated 6,254 PB/ Month by 2015. The increase is not only due to higher data use on existing phones, but also a significant deployment of multimedia capable phones on wireless wide area networks throughout the world.

Mobile compute platforms, with small screen sizes, lead to unique characteristics that can be exploited by the developer. The complexity reduction research is geared to exploit these characteristics and employ methods that will reduce encoding time with minimal impact to subjective quality. Some potential coding elements to research are listed in Table 4. In addition, encoding time reduction can dramatically help the power drain on the mobile device which is a current problem being addressed in today’s mobile devices.

Table 2: Small screen complexity reduction opportunities

Item Potential Optimized mode selection (1) Reduce ME time by eliminating modes not conducive to smaller screens. This method is tree pruning. (2) Use previous frames modes if criteria met. Criteria such as MV threshold may be used. Depth selection Use previous frames CU and TU depth if criteria met. 6

2.1 H.264 vs. HEVC Subjective Evaluation

Mobile compute environments provide a unique set of user needs and

expectations that designers must consider. The focus within the mobile compute

environment is smart phones. Multimedia use in smart phones has increased with the

surge in popularity for these devices. With increased multimedia use in mobile

environments, video encoding methods within the smart phone market segment are key

factors that contribute to positive user experience. Currently available display resolutions

and expected cellular bandwidth are major factors the designer must consider when

determining which encoding methods should be supported. Recent mobile devices released in the consumer market have shown display technology has progressed strongly.

The displays range from high-end mobile phones with resolutions up to 1280x720 for

5.0” diagonal screen sizes to entry level smart phones with resolutions around 480 x 270 for 3.5”.

A comparative evaluation of user experience subjective quality between HEVC and H.264 video coding standards are needed to provide guidance to the developer. The guidance is for design within the mobile environment. The desired goal is to maximize the consumer experience, reduce cost, and reduce time to market.

2.2 Decision Optimization

HEVC has many configuration options for encoding. A model and approach is needed to select an efficient set of HEVC encoding options for devices in mobile environments as described earlier. A main goal is to reduce the encoding complexity without significantly affecting the quality of video conferencing applications. A real- 7 time adaptive configuration is needed to optimize for the available bandwidth within the mobile environment. Basic configuration that is offered by HEVC are coding unit size, coding unit depth, and transform unit size. There are many other options, but these are commonly manipulated options from related works. The encoder computational complexity can be reduced for the target bit rates while maintaining an allowable additional PSNR loss.

2.3 Complexity Reduction with HVS Factors

Encoding complexity reduction, without degrading video quality, is a common goal for video coding researchers. Indeed several authors presented within the literature review strived to achieve complexity reduction and minimally affect the video quality.

However, two elements are usually not addressed for mobile environments. The mobile environment offers unique characteristics that can be exploited. For example, mobile devices, such as smart phones, typically have displays that are 3.5” to 5.5” diagonal and bandwidth constrained networks, such as cellular, as the main data transmission method.

In addition, human vision system factors, such as vision acuity and smooth eye pursuit factors, can be used to exploit where complexity and bit rate reductions can occur without significantly impacting perception of video quality.

Use of mobile devices, such as smart phones, has grown dramatically in the last few years and has penetrated the consumer space [10]. Applications, such as real-time video sharing, have gained popularity. Quality of service is a factor in application adoption and research has shown adoption is successful if the technology delivers a predominantly a positive experience [11]. The successful mobile real-time video application will need to work well in low bit rate and lossy networks. 8 3 BACKGROUND

3.1 Overview of Video Compression

Video encoding in its’ basic essence was patented by Raymond Davis Kell in

1929 [12]. Kell’s U.S. Patent revealed the benefits of transmitting only changes in subsequent images [13], which are the main elements predictive compression. Now, the proposed implementation method of separate light sources to display the changes using analog methods is impractical, however the basic idea for video compression was born.

An early digital method for video compression was released by International

Telegraph and Telephone Consultative Committee (CCITT), currently the International

Telecommunications Union (ITU), in 1984, in the form of the standard recommendation

H.120 [14]. This standard is directed for point to point transmission, with video conference service as a main beneficiary. The standard decisions were heavily influenced by the analog video transmission eco-system available at the time. The release supported the National Television System Committee (NTSC) system of 525 lines and Phase Alternating Line (PAL) system of 625 lines. The implementation was mainly unsuccessful due to inadequate video quality. However, several basic concepts developed within this standard are evident within todays’ video compression techniques, such as quantization and code tables.

During 1990, a much improved coding standard was released in standard H.261

[15]. As with H.120, a major emphasis included videophone and video conference within

9 the audio visual services. The main transmission target is ISDN lines using 64kbit/s

bitrate. However, the standard made allowances for p x 64kbit/s services, where p can be

specified anywhere from 1 to 30. Common video resolutions used within H.261

implementations include common intermediate format (CIF), which is a resolution of 352

x 288 pixels, or quarter common intermediate format (QCIF), which is a resolution of

176 x 144 pixels. H.261 further improved coding methods by introducing hybrid video

coding method that is still a key basis for modern coding standards. Hybrid video coding

combines two methods. These are: (1) frame to frame motion is estimated and

compensated for by using data from previously encoded frames and (2) spatial domain

data is decoupled and transformed to the frequency domain which can be quantized. The

hybrid model is shown in Figure 2.

T Transform Q Quantizer P Picture memory with motion compensated variable delay F Loop filter CC Coding control p Flag for INTRA/INTER t Flag for transmitted or not qz Quantizer indication q Quantizing index for transform coefficients v Motion vector f Switching on/off of the loop filter Figure 2: H.261 hybrid coding

A parallel effort was started in 1988 by Moving Pictures Expert Group (MPEG),

this group leveraged many concepts from H.261 and released MPEG-1 (ISO/IEC 11172)

standard in 1993. The standard’s focus included multimedia distribution. As with H.261,

MPEG-1 is hybrid coding. New techniques were developed to enhance video sequence

10 reconstruction quality. Techniques such as Group of Picture (GOP) were introduced.

This included three picture frame types such as intra-coded (I) frame, Predictive-coded

(P) frame, and bi-directional-predictive-coded (B) frame. A GOP combination must

include an I frame and it must be positioned such that the remaining frames can use the I

frame to decode the remaining frames. In addition, an improve concept was the

maximum allowable macroblock (MB) size increase to 16x16 from 8x8 as with H.261.

Hybrid video coding approach represents each coded picture in block-shaped units of associated luma and chroma samples which are called MBs.

The next release, in 1995, was MPEG-2 / H.262 and is the first joint effort

between ISO and ITU standard organizations and maintained jointly by the ITU-T Video

Coding Experts Group (VCEG). The focus included multimedia distribution for digital

video broadcasting and consumer purchased multimedia, such as DVD. The combined

committee built upon and advanced coding capabilities from each organization’s

standards and greatly improved the video content distribution for use in transmission

environment from 3 to 10 Mbit/s. The standard strove to balance quality and complexity at these higher rates to provide studio quality digital video. In addition, the committee was also forward looking, as shown by defining HDTV/SDTV and the use of up to 214 x

214 frame sizes. As the coding technology advanced, the consumer industry benefited

greatly with a unified direction for video coding. This greatly reduced adoption risk by

encoder developers. Also, more development resources were dedicated to implementing

encoding protocol of one standard. The MPEG-2 / H.262 standard is backward

compatible with MPEG-1.

11 The next ITU-T VCEG standard, H.263 [16], focus returned to video conferencing applications, more specifically the standard’s scope is for compressing the moving picture component of audio-visual services at low bit rates. This standard was released in 1996 and was widely adopted by video conferencing and cell phone codecs.

The design target is video conferencing applications within mobile devices where low bit rate restrictions exist. H.263 is utilized by many video telephony, video conferencing and internet conferencing standards, such as H.324, H.323, H.320, RTSP, and SIP.

Crafting the MPEG-4 Part 2 standard began in 1995 near the release of MPEG-2 /

H.262 and released in 1999. As with H.263, MPEG-4 Part 2 original goal is for use within low bit rate environments. However, the scope has expanded and several novel ideas have been introduced within this standard that allows a wide range of compression quality and bit rate compromises. Many consider this as a major facet since continued enhancements have evolved with new profiles that include original coding concepts such as interactive graphics, object and shape coding, scalable coding, 3D graphics, among other techniques.

In the last decade, ITU-T VCEG released H.264 / MPEG-4 part 10 Advanced

Video Coding (AVC) [17]. This is commonly known as H.264 or H.264/AVC. Since its’ release in 2003, this coding standard has been well received. As with other coding standards, the scope for H.264 is a decoder specification as shown in Figure 3. Even though a decoder is specified, this leads the encoder on expected data format for transmission between the encoder and decoder.

12

Figure 3: Scope of H.264 standard

H.264 standard leveraged heavily prior standards and enhanced known coding techniques, such as variable block size motion compensation, decoupling reference order from display order, hierarchical block transform among other enhancements [18]. H.264 coding block diagram is shown in Figure 4. A main goal is to have superior compression performance across a wide range of bit rates. This standard has been adopted by several well known consumer products, such as Apple iPod and Sony Playstation. Also, several consumer media products, such as HD-DVD and Blu-ray have adopted this standard for video compression.

Figure 4: H.264 video encoder block diagram. (decoder is within dashed box)

13 3.1.1 HEVC overview

Recently, during late January 2013, High Efficiency Video Coding (HEVC)

standard [19] was released by ITU-T VCEG. The main target is to achieve same video

subjective performance with half the H.264 data rate. HEVC uses the same hybrid

coding approach as that has been successful with earlier standards. Several H.264 tools were enhanced to achieve this goal [20] [21]. More details below within this section.

The HEVC standard [19] has adopted one profile with three configurations. The configurations are “intra”, “low-delay” coding and “random-access” coding. This is to support wide services such as broadcast, mobile and streaming. Recent assessments show that HEVC can achieve equivalent subjective quality as H.264 using approximately

50% less bit rate [21]. The low-delay coding typically uses the previous frame as the reference frame for inter prediction. Random-access coding uses both past and future frames as reference frames. Intra does not use other frames for prediction, which indicates the intra frames do not have temporal elements.

The bit rate improvements are achieved, in part, by the introduction of new tools, such as block partitioning structure [22], and variable block-size for prediction and transform coding [23]. Although the coding efficiency of the low-delay coding is worse than that of the random-access coding, the low-delay coding is widely required in industry area because of the importance of real-time video applications [24]. The

remainder of HEVC discussion, within this section, will be an overview of the HEVC

structure and tools that will be of interest within the research presented. For reference,

Figure 5 shows the HEVC hybrid video encoder block diagram that can be used for

reference for overview discussion.

14

Figure 5: HEVC video encoder block diagram (with decoder elements in grey)

As with earlier standards, the frame is partitioned into square MBs for compression management. The largest MB with luma and chroma data is called the largest coding unit (LCU). The LCU partitioning into smaller coding units (CU) determined by encoder direction which is mapped into a coding tree unit (CTU), as shown in Figure 6. The CTU size is selected by the encoder and consists of a luma coding tree block (CTB) and the corresponding chroma CTBs. The CTU size LxL of a luma CTB can be chosen as L = 16, 32, or 64. For example, in 4:2:0 color sampling, there is one luma CTB LxL and two corresponding chroma CTB L/2xL/2. A larger LxL sizes typically enables better compression for higher resolution video sequences since a

64x64 LCU is a smaller footprint of the overall frame and has higher likelihood of homogeneous pixel data. HEVC then supports a partitioning of the CTBs into smaller coding blocks (CB) using a tree structure and quad tree signaling as shown in Figure 6 and uses “Z” pattern during the split process.

15

Figure 6: CTU (or CTB) subdivision into CUs (or CBs)

A CTB may contain only one CU or may be split to form multiple CUs. Each CU has an associated partitioning into prediction units (PUs) and a tree of transform units

(TUs).

For prediction units and prediction blocks (PBs), the decision whether to code a picture area using inter-picture or intra-picture prediction is made at the CU level. HEVC supports variable PB sizes from 64x64 down to 4x4 samples. CB are split into PB, which are called “modes”, and can supported asymmetrical shapes as shown in Figure 7. The

PB is tied to motion vector during the motion estimation (ME) process which will be discussed further in section.

Figure 7: Modes for splitting a CB into PB

TUs and transform blocks (TB) take the prediction residual and codes it using block transforms. A TU tree structure has its root at the CU level. HEVC supports variable TB sizes from 32x32 to 4x4. HEVC design allows a TB to span across multiple 16 PBs for inter picture-predicted CUs to maximize the potential coding efficiency for TB partitioning.

Motion vector computation is advanced motion vector prediction (AMVP) which includes derivation of several most likely candidates based on data from adjacent PBs and the reference picture. A merge mode for motion vector (MV) coding may also be used which allows the inheritance of MVs from temporally or spatially neighboring PBs.

Compared to H.264/MPEG-4 AVC, an improved skipped and direct motion inference was specified.

Motion compensation is handled by quarter-sample precision for MVs, and 7-tap or 8-tap filters are used for interpolation of fractional-sample positions. As with H.264, multiple reference pictures can be used. A PB can be associated with one or two motion vectors resulting in uni-predictive or bi-predictive coding, respectively. Uni-predictive uses only previous frames for coding and bi-predictive can use either previous or future frames for coding. As with H.264, a scaling and offset operation is supported.

Intra prediction uses only spatial prediction which means the decoded boundary samples of adjacent blocks are used as reference data for prediction. Intra picture prediction supports 33 directional modes as compared to eight directional modes in

H.264, in addition with planar (surface fitting) and DC (flat) prediction modes. The selected intra picture prediction modes are encoded by deriving most probable modes and prediction directions based on those of previously decoded neighboring PBs.

17

Figure 8: Intra picture directional orientations

Quantization is similar to H.264 where quantization scaling matrices are supported for the various transform block sizes.

Entropy coding is very similar to the context adaptive binary arithmetic coding

(CABAC) scheme in H.264. However, several improvements have been developed to improve throughput speed that parallel-processing architectures can take advantage of and compression performance. In addition, the improvements allowed a reduction in context memory requirements.

Deblocking filter similar to the one used in H.264 within the inter picture prediction loop. The purpose of deblocking is to smooth out the block edges with neighboring blocks in order to get a subjectively better picture. A major difference in

HEVC over H.264 is the design has simplified decision making and filtering processes and is more conducive to parallel processing.

Sample adaptive offset (SAO) is a new feature. A nonlinear amplitude mapping is introduced within the inter picture prediction loop after the deblocking filter. The goal is to better reconstruct the original signal amplitudes by using a look-up table and a few additional parameters determined by histogram analysis from the encoder. 18 3.2 Overview of HVS

This section will present the properties of the HVS exploited in the dissertation.

This understanding, in combination with video compression techniques, will yield new methods to exploit HVS for improving video compression.

3.2.1 Retina

The retina consists of many layers of specialized tissue with each layer playing a unique and critical role in human vision [25], [26]. The layer that will get further discussion is the photoreceptor layer which is the light gathering layer and turns the light into neural signals. This layer is a dense mosaic of cell bodies that contains rods and cones. Rods are achromatic and are primarily responsible for vision in low light. Cones are primarily responsible for color vision. There are three types of color cone receptors, which are S-cones (absorbs blue), M-cones (absorbs green) and L-cones (absorbs red).

Cones dominate and are densely populated near the fovea, which is the area necessary for sharp and detailed vision and is of extreme importance. Cones are primarily used for high light (such as daylight) conditions. The fovea area is also commonly referred to the focal point of vision and is used for detail vision capture, such as reading.

The fovea’s minute center is called the central island and has the highest density of photoreceptors in a 0.2 (12 minutes of arc) area. The central island, where vision is the sharpest, there are red and green cones (no blue cones or rods) [27], [28], [29]. As one moves away from fovea’s central island, the vision acuity drops off. The most significant drop off occurs near the central island edges. The retinal acuity topography is roughly concentric zones with the maximum resolvable spatial frequency for each zone. The spatial frequency is determined by the number of photoreceptors in the zone. The central 19 island has approximately 120 receptors per degree. Surrounding the central island and within the fovea is the foveola which spans approximately 1.2 of visual angle. As with

the central island, the foveola is free of rods and blood vessels. The foveola photoreceptor

coverage is roughly 70 to 120 receptors per degree. The fovea includes the foveola and

has a visual field span of 6 and consists of rods and cones. The cone density is less and

rod density is greater the farther from the central island. The fovea has approximately 50

to 70 receptors per degree. Receptor density mapping over the retina is shown in Figure

9. The central island is centered at 0 degrees.

Figure 9: Receptor density in retina [1]

For the central island area, photoreceptors are positioned in a hexagon pattern,

similar to honeycomb pattern as shown in Figure 10. The central island has

approximately 120 photoreceptors per degree. Per Nyquist sampling theory, this

translates to a theoretical resolution capability of 60 cycles per degree. Basically, it takes

two receptors to distinguish change (i.e. cycle) has occurred as indicated by Nyquist.

Therefore, Nyquist resolves to one-half of the samples, which are the photoreceptors.

Using the same method, the foveola has a resolution capacity of 35 to 60 cycles per

20 degree. The fovea has a resolution capacity of approximately 25-35 degree per cycle. The retina topography data is show in Table I.

Figure 10: Tangential section of photoreceptors through human fovea [2]

Table 3: Retina topography data

Topography Span Photoreceptors / Cycles / Location (Degrees) Degree Degree

Central Island 0.2 120 60 Foveola 1.2 70-120 35-60 Fovea 6.0 50-70 25-35 Parafovea 7.0 20-50 10-25 Perifovea - 0-20 0-10

Retina topography showing location and relative size from the central island to periphery is shown in Figure 11.

Central Island Foveola Periphery Fovea Parafovea

Perifovea

Figure 11: Retina topography 21 “20/20” vision is a Snellen fraction, developed in 1862, used to express normal visual acuity measured at a distance of 20 feet (6 meters). Basically, 20/20 vision means one can see clearly at 20 feet what should normally be seen at 20 feet. For example,

20/200 vision means one must be 20 feet away to clearly see what a person with normal vision can see at 200 feet. 20/20 vision correlates with 30 cycles per degree for acuity resolving capacity [29].

Recapping this section, the retina’s photoreceptor density effects the ability to distinguish change from adjacent photoreceptors in terms of cycles per degree. Using

Nyquist sampling theory, the cycles per degree is ½ of the photoreceptor density per degree. The HVS acuity is based on cycles per degree.

When a display’s resolution per degree is greater than the HVS acuity, individual pixel changes will be difficult for the HVS to detect. One example is viewing a display such as television from a long distance. The television’s resolution is packed into a tighter viewing cone. See Figure 12 for viewing cone definition. Mobile users have a similar environment, where high resolution display is within a tight viewing cone. This discrepancy can be used for HEVC encoder complexity reduction with minimal impact on subjective quality.

Figure 12: Viewing cone

22 3.2.2 Smooth Pursuit Eye Movement

Eye pursuit movement distinction was first reported by Raymond Dodge [30] in

1903 and his contemporaries of that era. This observation was later defined as Smooth

Pursuit Eye Movement (SPEM) which is the eye movement in which the line of observation follows an object moving across the field of vision. Human beings instinctively follow objects from early infancy. This instinct is so persistent that in adult life it is very difficult to keep eyes from moving when an object moves. Analysis showed when an object moves the ocular muscle response follows shortly to track the object.

Later works defined distinctly between saccadic and smooth pursuit human eye movements. Saccadic eye movement has yielded much attention primarily due to the very rapid motion within a very short period of time of the human being eyes. In contrast, smooth pursuit movements occupy a significant portion of ocular activity [31]. Per

Robinson, smooth pursuit velocity can occur up to 25-30 degree per second. He noted that smooth pursuit velocity overshoot is noticeable at a movement rate of 5 degree per sec. No noticeable overshoot occurs at 15 degree per sec. Overshoot is the condition where the eye movement keeps tracking on same path after object has settled. Saccades

was reported at a velocity threshold of 25 to 30 degree per second. Girod [32] reported

saccades as large rapid eye movements with an angular velocity of up to 830 degree per

second that align the point of regard with an interesting target. While several earlier

works typically reported saccades as 30 to 60 degree per second. In addition, if the target

is moving, our eyes can compensate this motion by SPEM with a maximum velocity

around 20 to 30 degrees per second. Adzic [33] reported a threshold, defined as saccadic

suppression detection threshold of 48 to 60 degrees per second. Velocity saturation is

23 where smooth pursuit becomes disconnected with the object briefly during the pursuit movement. Saccadic eye movement can briefly occur during the eye pursuit of the object.

3.2.3 Transparent Motion Perception

Transparent motion perception is a minute change between object distances over a short period of time where the distance change is not perceived by the HVS. Nakayama

[34] performed subjective tests where HVS sensitivity is measured for horizontally moving randomly generated dots. Dots were displaced for 12 msec, 100msec, or 200 msec and then returned to the original position. The observers were asked if motion occurred. For all the three displacement times, the transparent motion sensitivity, which is the just noticeable displacement, is approximately 2 arc-minutes. Observers noticed movement for displacement distances of 2 arc-minutes or greater. Subjective tests performed by Qian, Andersen, and Adelsen [35] where the observer is asked to identify motion between random generated dots or paired dots. Several experiments were performed. One particular experiment indicates where motion is detected between pixel offsets. Each pixel for this particular experiment is 0.028 (1.68 arc-minutes). The transparent motion sensitivity is difference is between 1 and 2 pixels. No positive indications were given for 1 pixel movement and about a 10% positive indication given for 2 pixel movement.

24 4 LITERATURE REVIEW

4.1 Subjective Quality

Multimedia consumption is to be viewed and enjoyed. The ultimate purpose for

video compression is to reduce resources, such as bandwidth, and maintain the highest

level video quality as measured by human beings. The ultimate critic is the end user for the multimedia consumption. The purpose of subjective quality assessment is to determine the parameters which are important and not so important to the end user. This section is two main bodies of discussion which are (1) the quality evaluation metrics currently available and (2) review of recent subjective evaluations.

4.1.1 Metrics for Quality Evaluation

Objective video quality assessment (VQA) is the computational models to evaluate the video quality in line with the perception of the human visual system (HVS)

[36]. This method is strongly computational based that uses Structural Similarity (SSIM) as one component of the subjective portion of the assessment. SSIM was used along with

an alternate weighting method between motion compensation blocks as a measure of

temporal distortion known as motion compensated SSIM (or MC-SSIM) [37]. The

second subjective component is singular value decomposition (SVD) based image quality

metric for computing the spatial scores [38]. Both scores (MC-SSIM and SVD) are

computed and combined into a single score. The score is computed for both reference

and impaired frames. In addition, other objective quality scores were generated for the

25 dataset, such as Peak Signal-to-Noise Ratio (PSNR), Visual Signal-to-Noise Ratio

(VSNR) [39], and Visual Information Fidelity (VIF) [40]. The subjective testing was performed on H.264 encoded video sequences at a resolution of 352 x 288 and 768 x 432.

Predictive model was generated between each calculated quality score and Mean Opinion

Score (MOS) results. The author’s model was better in predicting subjective for the test cases used.

A very good comparison of a few well known video quality assessment (VQA) models was conducted by [41]. The study compares H.264 and HEVC MOS (subjective

VQA) with five objective VQA models which are (1) Structural Similarity (SSIM), (2)

Multi-Scale SSIM index (MS-SSIM), (3) Video Quality Metric (VQM) [42], (4) MOtion- based Video Integrity Evaluation index (MOVIE) [43], and (5) PSNR. The encoding configuration of HM5.0 was set as random-access high-efficiency and accordingly

JM18.3 configuration was adjusted to best match that of HM5.0 configuration. The scatter plots of objective scores versus MOS are shown in Figure 13. All four objective

VQA models clearly outperform PSNR for quality assessment estimation. MS-SSIM had slightly better results than the other remaining three.

26

Figure 13: Scatter plots

4.1.2 Subjective Evaluation

Mainstream video coding methods use block-based video and region-based coding approaches, where statistical features of the video sequence are exploited to detect redundancies to obtain bit rate reduction. These methods decompose video sequences into coherent regions based on features such as motion, color or texture. Image Analysis and Completion Video Coding approach [44] are regions in a video sequence can be subdivided into two classes, which are perceptually relevant and perceptually irrelevant.

Perceptually irrelevant regions are highly textured regions and perceptually relevant regions are the remainder. The author’s idea is then to represent perceptually irrelevant parts by imperceptible approximations. Therefore, the imperceptible approximation may be achieved at a lower bit rate than other objective/statistical based compression methods.

The authors main contributions were identify image completion approximations with subjective consideration and breakdown a video sequence between the mix of the two

27 regions, which are perceptually relevant and irrelevant. A few image completion

methods investigated by the author were auto regressive , auto regressive moving

average, Laplace, non-linear, among others.

In [45], the author’s focus is mobile quality of experience (QoE) for multimedia-

enriched web services within a mobile environment. The author’s paper includes

experiments aimed that analyzed QoE for different media-enriched web-based services in

a mobile environment. The user display specifications within the experiments are 2.8” with resolution of 320x240. SSIM and MOS results were compared for all video sequences. Tests included video sequences that range from 2 second to 120 second clips.

Video clips were regenerated multiple times with different levels of transmission

degradations as measured by SSIM. MOS results showed that observer’s tolerance is

significantly higher, as shown by higher MOS results, for longer (100s to 120s) video

segments. Video segments with high motion regardless of complexity level (i.e. high or

low) tended towards lower MOS results.

A strong objective component was used by [46] to determine perceptual quality

and impact to bit rate. The author investigated the impact of frame size, frame rate,

signal-to-noise (quantization) to bit rate result. The author points to perceptual quality

testing based on spatial, temporal, and amplitude resolution (STAR) changes and

influence to bit rate model. Tests used well known H.264 video sequences (such as akiyo

and crew). The variables changed were frame rate from 1.875 up to 30 Hz and quantizer,

which varied essentially the whole quantizer spectrum of upper teens to one hundred.

However, the resolution remained fixed for the tests, since resolution typically is not continuously varied within a video sequence. Author makes recommendation for heavy

28 use of frame rate adaptive control to maintain constant bitrate required by the video

sequence application.

Saliency models as shown by [47] use a computational bottom-up model and

human visual estimates. Human visual fixation behavior is driven the person’s sensory

system, which is “bottom-up”, and/or a higher order task specific as driven by purpose of

the task, which is “top-down”. Visual saliency refers to distinct image details the human

being notices naturally without a task or purpose at hand or commonly known as free

viewing. Visual saliency is believed to drive human fixation during free viewing.

Human studies have shown visual saliency is a good predictor for attention during free viewing for both images and videos [48]. Human visual fixation area has been commonly defined as “conspicuity area,” which is defined as the spatial region around the center of gaze where the target can be detected or identified in the background, within a single fixation. The practical value of the visual conspicuity concept was limited by the fact that the associated psychophysical measurement procedures were primarily and manual endeavor, intricate and time-consuming. The research team developed an automated method to track eye movements and correlate to conspicuity area. The author reviews 12 saliency models and proposes an additional model called Multiscale Contrast

Conspicuity. All models are tested with standard images and psychophysical conspicuity area is determined by method developed. Multiscale Contrast Conspicuity estimated

gaze area correlated with actual result over 0.643 which performed very well when

compared to the other 12 saliency models.

In the last decade, commercially available eye-tracking systems are economically available and have allowed easily quantifiable fixation data. [49] performed testing with

29 eye tracking system to visually track the regions of interest as determined by the fixation point. The human vision system samples its environment by linking fixation points between saccades, which are fast and sudden movements. Test subjects were shown a two and a half minute movie trailer and their eye motions were tracked. Scene changes were noted to cause large differences in fixation point locations between the test subjects.

Many psychophysical experiments discover that viewing quality is not correlated well with PSNR but significantly influenced by the viewing conditions (e.g. display size and viewing distance) [50]. The author demonstrates that major picture quality evaluation schemes are not suitable for subjective quality driven video adaptation due in part to the inability to track the relationship between viewing condition and viewing quality. The author investigates a novel approach that performs video adaptation, such as modifying distortion algorithms, with respect to the target display scenario, such as mobile environment, in order to maximize the viewing quality. Viewing ratio (VR) and perceptual quality translated into a computationally feasible algorithm for video adaptation. Given the amount of mobile video traffic expected in the next decade, video adaptation for mobile environment can potentially save precious transmission bandwidth that mobile devices rely on. Subjective testing discovered that viewing experience with mobile devices is significantly influenced by the viewing conditions, which includes viewing distance, video resolution, display size, and content type. Statistical data show that the viewing distance of mobile video with small displays is usually fixed at arms length [51]. As a result, the display resolution and display size are a very important aspects that determine mobile video viewing experience. VR is defined as the viewing distance to display height ratio. Low VR means being placed closer to displayed object.

30 For example, a typical big screen TV has a VR of 3 to 4; cell phone video has a VR of

10. The video sequence distortion can be modified, if VR is known by the transmitting device. The transmitting device can adjust to low quality for devices with high VR, such as mobile devices. Authors subjective testing was able to exploit video encoding parameters when using VR as an input to encoding parameter.

4.1.2.1 H.264 vs. HEVC subjective evaluation

The Joint Collaborative Team on Video Coding (JCT-VC), a joint team between

MPEG and ITU, reported subjective test results for 27 test candidates in April 2010 [52].

The purpose was to evaluate the candidates for the next generation video coding standard,

HEVC. Two anchor encodings were generated to assist with the test candidate evaluations. Anchor encodings were included in the formal subjective tests and were directed through the same evaluation criteria as the test candidates. The H.264 encoder used for anchor file generation is JM16.2. These anchor reference points were used to define behavior of current and accepted video encoding technologies for side-by-side comparison with test candidates. Video resolutions within the tests ranged from 416x240 to 2560x1600. Encoded bit rates were from 256kbit/s to 14Mbit/s. Video bit rates chosen were dependent on video resolution as show in Table 4. Test methods used within test sessions were Double Stimulus Continuous Quality Scale (DSCQS) and

Double Stimulus Impairment Scale (DSIS) evaluation methods as defined by ITU-R

BT.500-13 [53]. DSIS test evaluation methods were used for class C (832 x 480 video sequence), D (416 x 240 video sequence), E (1280 x 720 video sequence), and lower bit rates for class B (1920 x 1080 video sequence). DSCQS test evaluation methods were used for higher encoding rates within class B sequences. Results from the evaluation

31 showed a 50% bit rate improvement can be achieved with multiple test candidates and

achieve similar mean opinion score (MOS).

Table 4: Class definition for resolution, frame rate, and bit rates

Class Resolution Frame Rate Bit Rate (min.) Bit Rate (max.) A 2560 x1600 30 2.5 Mbit/s 14Mbit/s B 1920 x1080 24 - 60 1 Mbit/s 10 Mbit/s C 823 x 480 30 - 60 384 kbit/s 2Mbit/s D 416x240 30 – 60 256kbit/s 1.5Mbit/s E 1280x720 60 256kbit/s 1.5Mbit/s

[54] objectively and subjectively measured performance and made comparisons

between HM5.0 and JM18. Tests were conducted with high-efficiency (HE), low-

complexity (LC) and low-complexity combinations of rate distortion optimized quantization (RDOQ), adaptive loop filter (ALF), sample adaptive offset (SAO). Coding tools were manipulated to improve Bjǿntegaard delta rate (BD-rate) vs. encoding and

decoding time. The study shows video encoded with HE configuration yielded

subjectively indistinguishable results when compared to LC with RDOQ and SAO. Nine

video sequences from class B and C were encoded with QP = 32 and 37 for both random

access and low delay configurations. The target bit rates were predominantly 500 kbps to

4,000 kbps. The test method was DSIS variant I, which shows the reference and

impaired video once to the test subject before voting takes place. Informal subjective

tests used HM 5.0 encoded video sequences that are half the bit rate versus JM 18.2

encoded video sequences. Observer votes showed that HEVC is preferred over H.264

from 56% to 83% of the time. The preference percentage depends on the video sequence

encoding configuration being used for the subjective test.

32 Additional subjective comparisons were performed by the JCT-VC ad hoc group

comparing HM5 and similarly configured JM encoder/decoder and reported by Ohm, et

al. [55]. The goal was to quantify feasible rate savings that yield similar subjective

quality when comparing HEVC and similarly configured H.264. Tests were performed

with the nine video sequences for class B and C. JM QP settings were 27, 30, 33, and 36.

The research team determined JM QP settings should be four more. Therefore, QPHM =

QPJM + 4, which gives a HM QP settings of 31, 34, 37, and 40. Subjective tests were

performed using Double Stimulus Impairment Scale (DSIS) method with same approach

described in [52]. After some linear interpolation on RD graphs, a gross average rate

reduction of 67% for class B sequences and 49% for class C sequences were deduced.

Informal subjective tests for 720p and 1080p resolutions specifically targeting

low-delay applications were performed by Horowitz et al. [56]. H.264/AVC JM version

18.3 and x264 version core 122 r2184 are compared with HEVC (HM version 7.1).

Encoders were configured for low-delay function and 8-bit per sample video encoding.

The H.264 videos were encoded at double the rate of the HEVC videos. QP was selected

to ensure lossy video, but still considered good quality video. Both quality extremes

were avoided. Since very high quality video will yield experiment results where both

video sequences are indistinguishably excellent video. Also, extremely low quality video

will be difficult for the observer to indicate a preference. Results indicate HM encoded

sequences were favored 46% of the time for 720p sequences and 86% of the time for

1080p sequences.

Resolutions beyond HDTV are the main emphasis for subjective analysis by [57].

Tests were conducted on a high performance quad full high definition (QFHD) liquid

33 crystal display (LCD). Video sequences were evaluated for spatial information (SI) and

temporal information (TI) indexes. Class A video sequences were used, since these

sequences have the highest resolution. Also, test video sequences were augmented with two other high resolution sequences for a video resolution of 3840x1744 (or higher). Bit rates for tests were encoded and ranged from 768 kbps to 20Mbps. Double Stimulus

Impairment Scale DSIS variant II, as defined by [53], was the test method and the test session was divided into two 15 minute sessions with a rest period in between. Test results showed reduction of over 50% is achieved with HEVC over AVC for high resolution sequences for equivalent subjective performance.

H.264 vs. HEVC subjective comparisons were performed by JCT-VC and reported in May 2012 [58]. Results were weighted 6:1:1 for YUV PSNR as an imperfect substitute for subjective assessment. The author mentions the subjective results may actually perform better than the reported results within the study using PSNR. Bit rate savings ranged from 22% to 36% depending on HEVC configuration used. The tool configurations between H.264 and HEVC base configurations with minor referencing

structure changes. For example, the low delay configurations, which are low delay P and

low delay B, were set for 1+3 reference structure.

An earlier JCT-VC subjective report [55], released in Feb. 2012 suggests roughly a bit rate reduction of 49% to 67% bit rate reduction for HEVC over H.264 for similar

subjective performance as measured by MOS. Class B (which are 1920 x 1080) and

Class C (which are 832 x 480) were the video sequences measured within this report. A

modified version of the JM (i.e. H.264/AVC) software was used that includes non-

34 normative improvements that will allow “more” normalized comparisons with HEVC

HM5.0.

4.2 Complexity Reduction

Complexity reduction has been and is a major focus among researchers, since encoding time and resources are heavily used during complexity calculations during coding phases. In Figure 14 is processor utilization estimate for H.264 encoding process as presented by [59].

Figure 14: H.264 processor utilization

Motion estimation and rate distortion optimization take up around 75% of the processing utilization. For mobile devices, a processing utilization reduction will significantly reduce the processing time which will lead directly to a reduction in battery usage and a longer lasting mobile device. HEVC encoder and decoder complexity assessment is a research study focus for [60], where the different HEVC tools are analyzed in terms of performance and computational complexity. In [61], H.264 use in bandwidth limited environments is dissected. The paper recognizes the need for coding standard enhancements to further improve video compression over limited bandwidths

35 and specifically pointed out the need for improved motion estimation techniques and

coding tools in low bandwidth environments.

Figure 15 shows a simplified hybrid video encoder that is representative for

H.264 and HEVC. The items with red dashed rectangles are the focus of the majority of

the H.264 and HEVC complexity reduction literature research.

Figure 15: Simplified H.264 and HEVC encoder block diagram

4.2.1 H.264

In H.264 video encoding, predictive coding time takes up significant time in encoding computations. In [62], the focus is for complexity reduction the Discrete

Cosine Transform (DCT) and quantization (Q) the H.264 video encoder. The author introduces a prediction algorithm that reduces the redundant computations within the

DCT/Q and the inverse (i.e. IQ/IDCT) process. The approach is to exploit DCT by reducing the effect of high frequency coefficients that typically is zeroed out after quantization. This is especially true when the quantization parameters are large for low- bit-rate video applications. In addition, the DCT coefficients can be more loosely

36 represented since the transformed coefficients will be quantized coarsely with large Q

factor.

An encoding strategy sensitive to wireless services needs are presented by [59].

The author uses knowledge of the video context from the video sequences. Unimportant

regions in the frames are isolated and unnecessary processing is avoided. Therefore,

battery power consumption can be reduced from H.264 complexity reduction while

maintaining reasonable frame quality and low bit-rate. User input and prior knowledge

of the context is taken into consideration in deciding the frames relevance and

significance of each section within the frame. Each frame throughout the sequence is

segmented into non-overlapping regions of varying significance. The most significant

areas are the foreground area and the remaining areas as the background areas. For example, in video-conferencing applications, within the frame, the speaker’s head-and- shoulders are the foreground and remaining frame is the background. Using this method, the complexity is reduced by more than 40%.

A H.264 reduction algorithm for search window sizes within ME is proposed by

[63]. This algorithm decreases the encoder complexity by the reduction of sum of absolute difference (SAD) calculations which are beneficial in low complexity sequences such as video surveillance and video telephony. Encoder complexity reduction is achieved by binary assessment for a given motion threshold on the image which yields a difference based on blob coloring over images generated block wise. The current frame is compared with the latest I-frame. Each frame block is compared between the current and I-frame. If the image difference within the block is greater than the threshold, then the block is set as activate. Otherwise the block is set to inactivate. The activate area has

37 a larger search window for ME, while the inactivate area has a smaller search window and is shown in Figure 16. The algorithm reaches an encoding time savings of 50% to

60% with a PSNR decrease of less than 0.05 and bit stream size increase of 0.09% or less.

Figure 16: Activate and inactivate area. Activate and inactivate area in binary

Machine learning techniques are used for optimization of low complexity H.264 encoder presented by [64]. The approach is to make encoder decisions, such as MB coding mode, that are computationally expensive using features derived from uncompressed video. A machine learning algorithm is used to obtain a classifier decision tree based on such features. The decision tree is trained and the encoder coding mode decisions, that usually evaluate all possible coding options, are replaced with a decision tree. The author proposes a three level topology tree for Inter mode decision. First level is an improvement in speed up is a Skip early decision, Intra 16x16 and the rest of the modes. The second level is separation among Inter 8x8 and remaining Inter 16x16 modes and sub-modes. Finally, the third level is decision among the remaining Inter mode and sub-modes. Figure 17 is the decision tree.

38

Figure 17: Decision tree for MB encoding

For [65], the authors proposed a complexity reduction in macroblock mode selection. Macroblock modes inter8×8 and intra4×4 have the highest complexity, which is the focus of the proposed methods. Two methods complexity reduction methods for inter8×8 and intra4×4 use costs of the other macroblock modes. For inter8×8 reduction, the costs of inter macroblock modes increase or decrease according to block direction.

With this assumption, the selectable sub-macroblock modes are reduced by using the MV costs and reference costs of inter16×16, inter16×8, and inter8×16, as shown in Figure 18.

If 8x8 DCT is not selected, then intra4×4 is compared with intra16x16. If small RD cost difference, per author’s cost computation, between the RD costs of intra16x16 and intra4x4, then remaining RD cost computation of intra4×4. Simulation results showed the methods 57.7% of total encoding time savings with a PSNR decrease of 0.05dB.

Figure 18: Restriction of selectable sub-macroblock modes 39 A H.264 power aware complexity reduction method was presented by [66].

Basically the encoding complexity adapted depending on available power. Therefore,

more available power allowed the use of higher complex motion estimation modes. The algorithm has two main elements one is region of interest (ROI) determination and the second is mode set used allowed for the ME computation as shown in equation (1). The

ROI is motion-based, therefore the motion above threshold will give an ROI value of 1 to the block, otherwise ROI value is set to zero. When power is freely available, all mode sets are available to any ROI value which yields the highest coding quality. As power availability decreases the mode set is reduced for low ROI blocks. At medium available power, low ROI value blocks can only select from mode sets 0 and 1 for ME. At low available power, low ROI value blocks will use only mode set 0. For low available power, encoding time savings of over 50% with a PSNR loss less than 0.4dB was achieved.

ModeSet 0 low complexity modes Mode Set ModeSet 1 medium complexity modes (1) ModeSet 2 high complexity modes

A variable ME search area is proposed by [67]. The basic premise is to compare block from the five previous reference frames sum of absolute differences (SAD) is computed at zero motion vector and compared with threshold. If below threshold, then use full search (FS), otherwise use three step search during ME as shown in Figure 19.

40

Figure 19: Flexible search method flow chart

A study conducted by [68] analyzed permissible perceptual distortions and assigned a heavier weighting to the regions that are perceptually less sensitive to human vision. The main concept of the proposed speed-dependent motion-estimation algorithm is computational time savings will occur as it assigns a larger intermode value to regions that are perceptually less sensitive to distortion. The proposed model aims to reduce computation time by skipping certain MBs in perceptually less sensitive areas while maintaining RD performance. The human visual system has a significantly accounted for in understanding perceptual video coding. Human beings cannot perceive tiny variations in visual signals because of the human visual system psycho-visual properties. The proposed algorithm uses motion vectors to determine the speed of the object within the block. The quality of the block is adjusted accordingly to predetermined thresholds.

41 Author’s proposed algorithm can reduce the computational complexity of motion

estimation by up to 47.16% while maintaining high compression efficiency.

4.2.2 HEVC

HEVC adopts a well-known Rate Distortion Optimization model (RDO) [69] tries

to achieve balance between quality, complexity and coding efficiency. HEVC encoder

and decoder complexity from HEVC base configurations have been assessed in an ITU reported research study focus for [60], where different HEVC tools are analyzed in terms of performance and computational complexity on different hardware platforms. RDO reaches the optimal partitioning by evaluating all combinations for CU size, prediction unit (PU), and TU. This approach increases encoder computational complexity and makes more difficult the real time encoding implementations, especially for portable and mobile devices where power consumption is one of the key factors. To reduce the RDO complexity some fast algorithms have been published, mainly focused on reducing the number of coding blocks (CB) and prediction blocks (PB) and TU sizes to evaluate.

Some of the algorithms will be discuss within this section.

In [70], HEVC coding complexity is reduced by limiting HEVC options and

recording bit rate delta. Several tools modifications were analyzed. Modifications

included (1) the angular intra prediction reduced to eight directions, similar to H.264, (2)

limiting maximum CU size, (3) limiting maximum TU size, (4) changing intra mode

coding approaches, and (5) SAO enabling/disabling. The most bit rate gain impacts,

which can be similarly stated as coding efficiency loss, were from angular prediction, CU

size, and TU Size limitations. Minimal bit rate impacts were observed for the remaining

changes. HEVC coding advantages are greater with higher resolutions and where strong

42 directionalities exist within the video sequence. Strong directional textures are noted for

sequences with large homogenous regions which allow effective use of HEVC 64x64

block sizes with accurate prediction. Video sequences that takes advantage of this is

Kimono, Johnny and Kristen and Sara. The last two mentioned are Class E, which is for

video conferencing with large static background areas with regular motion from talking

people in the foreground. The former, Kimono, is a panning video sequence where the

background is non-moving but camera is panning. This indicates large CU sizes, such as

64x64, can be used with motion vector to accurate estimate the background CU.

In another paper, transform skipping within HEVC is explored [71]. The main emphasis is exploring new transform tools advancements from H.264 to HEVC. The author video sequences, in the study, includes both camera and graphical content, such computer generated material shared over the internet. The usual implementation of 1D or

2D transform is based on DCT and DST integer implementation. HEVC has adopted

DST for intra residuals on certain directional prediction. However, certain types of residual can benefit from skipping the transform step totally as presented by the author.

HEVC TU intra coding consists of several parts. First, the reconstructed pixels are used to predict pixels in a specific direction. Then, the predicted residue is applied by the corresponding TU transform. Followed by the quantizing the transform coefficients and use CABAC to code the quantized coefficients. Lastly, the quantized coefficients are reconstructed and used to predict later TU. The author performed tests where 2D, 1D or no TU transforms were used as shown by Figure 20. Saving of up to 30% BD-rate were observed when TU transforms were skipped. The largest gains were observed for configurations of intra and low complexity.

43

Figure 20: Transform choices by transform skip mode

TU coefficient analysis is performed by [72]. The emphasis is weighting the quantization for TU coefficients. HEVC transformed coefficients are equally quantized per the selected QP. The author argues this is inefficient, since the TU are not equally distributed within the CTU. The method researched is weighting the TU quantizer

depending on Nth level the TU resides within the CTU. Nth level quantization increases the quantization factor as scanning progresses with the TU. The authors proposed method showed a bit rate gain of 0.3 to 0.6%.

Deblocking filtering and decision were investigated as potential encoding time savings by

[73]. Deblocking is feature is the same concept between H.264 and HEVC, where the

CB boundaries are compared for hard spatial changes on both sides of the boundary. A major change is H.264 applies de-blocking only to 4x4 grid, while the HEVC applies to

8x8 grid. When deblocking is turned on, the edges are smoothed to deemphasize the spatial changed. Typically, the luma CB is modified to perform the boundary smoothing.

Care must be taken with deblocking, since the smoothing affect can significantly enhance or adversely affect subjective quality. Desired details with a small pixel footprint can be deblocked (i.e. smoothed out) to the point where desired detailed is not as discernable.

Author performed objective and subjective tests with deblocking enabled and disabled.

44 Bit rate increased from 1.3% to 3.4% dependent on the configuration used when de- blocking is turned on vs. off. Low delay configurations tend to have higher bit rate.

Author comments this is due to the low delay configurations having one intra-coded

frame at the beginning of the video sequence. The most subjective noticeable affect was

with high QP (QP=37) encoding of “Kristen and Sara”. Deblocking improves the

subjective quality in both small and large CBs.

A unique approach to video coding was proposed by [74], [75]. The authors concept is to address the decoding techniques that can be used in lower bit rate environments and reduce mobile device power consumption without impacting subjective quality. From this effort apply the requirements to the encoder. The core component for the approach was to decompose the video sequence into two spatial elements with significantly different quality. This leads to, as the author defines, a “low resolution” and “high resolution” component. The video frame is composed of a checker-board pattern between the “low” and “high” resolution components as shown in Figure 21.

Figure 21: Decomposition using checkered pattern

Leveraging the checker-board decomposition within the encoder means a method to code blocks with profoundly different quantizers for “low” and “high” resolution components. The use case is for HD video (1080p) or higher resolution where the video sequence is decoded by mobile device. The mobile device is impacted with a longer

45 decode time which adversely impacts battery powered mobile devices. In addition,

mobile devices with smaller screen sizes, user perception will not benefit from the HD or

higher resolutions. However, HD (or higher) decoding but must be maintained and it’s decoding time overhead.

Figure 22: Template and block matching vectors

Cho and Kim [76], proposed an HEVC algorithm for Intra frames that performs early decisions for CU splitting or pruning. This is interesting, since most other

conference papers focus on one of the complementary methods, either CU splitting or CU

pruning. Early CU splitting and pruning decisions are made with the Bayes decision rule

based on RD costs. The decision model is updated regularly to adapt to video sequence

from the previous frames. Prediction coding for Intra frames are based on two processing

operations. CU splitting is a depth-first, top-down approach. CU pruning is performed in

a depth-first, bottom-up manner. Both methods occur during the encoding process and depicted in Figure 23. The HEVC computational complexity of intra prediction coding is from the CU splitting in order to compute full RD costs for candidate intra prediction modes of CUs at all depth levels. To help with computational complexity reduction,

HEVC allows skipping the full RD cost computation for a CU and even allows terminating subsequent CU splitting and pruning process. The author’s proposed method will use Bayesian decision with the statistical parameters of prior known or estimated random variables fits within the RD costs phase to expedite the decision process. The experimental results show an encoding speedup of 50.2% with just 0.6% BD-rate increase and also achieves 63.5% speedup with a 3.5% BD-rate increase.

46

Figure 23: CU splitting and pruning process

A nice analysis and proposed complexity reduction is presented by [77]. The analysis is between HEVC configurations low delay p-frames (LDP) and low delay b- frames (LDB). The main trade-off between both configurations are between computational complexity, coding efficiency and Bjǿntegaard delta bit-rate (BDPR).

LDP has lower computational complexity and higher coding efficiency; however the

BDPR is lower by roughly 6%. Author performed calculated LDB and LDP mode statistics of two test sequences, Kimono and BQTerrace, with quantization parameter

(QP) of 27. Inter frame data such as uni-prediction merge mode, bi-prediction inter mode, and uni-prediction inter mode were captured. More than 50% of modes are bidirectional prediction in LDB. The author proposed a feature reduced version of LDB, where some of the precision motion compensation algorithms are removed. The result is a configuration that has performance between LDP and LDB. The main advantage of the proposed configuration is to still have bi-directional feature but with am lower complexity algorithm which helps reduce computation complexity and reduces need of precious hardware resources, such as memory space.

A Context-based Adaptive Binary Arithmetic Coder (CABAC) enhancement in the bypass mode section is proposed by [78] as a complexity reduction without loss of 47 compression performance. CABAC is an entropy coder adopted by H.264 coding standard. CABAC obtains high compression efficiency by using probability estimation.

However, CABAC is also a main source of decoder complexity and processing time. The author presents an alternate bypass mode called pass through mode, which is less- complex coding. In the pass through mode, the probability estimation process and arithmetic coding process are skipped, therefore processing steps are reduced. The main emphasis is to simplify the binary symbol (bin) strings. Followed by the bin encoded using binary arithmetic coder in either regular or bypass encoding mode. Bins which are identified with a probability of 0.5, which indicates 0 or 1 will occur in equal probability.

The equal probability bins are encoded by the reduced complexity arithmetic coding mode.

In [79], the author presents a faster intra prediction mode decision algorithm. The algorithm takes into account the neighboring PUs modes and uses edge information of the current PU to choose a reduced set of directions. The algorithm developed selects the

9 most often used directional modes from previous Pus and only uses those for intra prediction evaluation. The reduced directional set consequently makes the HEVC intra prediction decision mode computationally more efficient; however, there will be a PSNR quality degradation. There is a strong suspicion the reduced direction will closely resemble the H.264 direction set. The author’s example was a reduced set representative of the H.264 directional set. The proposed method had a decrease 32.08% or less for intra prediction processing time. The bit-rate increase of 0.9% (on average) and a 0.02dB reduction in PSNR values.

48 A fast algorithm is developed for sub-pixel motion estimation by [80] as a

complexity reduction method. The author’s algorithm approximates the error surface of

the sub-pixel position and predicts the minimum point by minimizing the function. This

is followed by a second order approximation within a smaller area to predict the best sub-

pixel position. Typical ME process has two notable stages, which are integer-pixel

search in a search area and sub-pixel search around the best integer pixel position. The

direction way to determine optimal position is using the full search (FS) algorithm. FS

checks all points within the search range and selects the best point. However FS

computational complexity is undesirable and has very long encoding times. The author

proposed several error surface models ranging from 5-term to 9-term error models.

However, the author eventually settled on the 5-term model for first order approximation and 6-term model for second order approximation. The proposed method test results

showed a PSNR reduction of 0.04dB or less and reduced encoding time of 19% to 67%.

In [81], a fast decision method to reduce encoder complexity of high efficiency video coding was proposed, which is an early detection of SKIP mode in one CU-level

based on the differential motion vector (DMV) and coded block flag (CBF). SKIP mode

has a high probability of occurrence; therefore it is preferred to detect the SKIP mode as

early as possible. The test video sequences had a resolution ranging from 416x240 to

1920x1080. The SKIP mode occurrence probability averaged from 0.817 to 0.843

depending on the configuration. The early detection of SKIP mode utilized the

differential motion vector (DMV) and coded block flag (CBF) of inter 2Nx2N mode.

The method selects the best Inter 2Nx2N mode having the minimum of RD cost. For the best inter 2Nx2N mode, if the DMV is equal to (0,0) and CBF is equal to zero, then best

49 mode is the SKIP mode and the remaining PU modes are not searched further.

Experimental results show that the encoding complexity can be reduced by up to 34.55%

in random access (RA) configuration and 36.48% in low delay (LD) configuration with

only a little bit of rate increase of 0.4%. A similar SKIP mode algorithm is offered by

[82].

Choi, Park, and Jang [82] provided early HEVC studies for early termination and presented in International Telecommunications Union’s (ITU) Joint Collaborative Team on Video Coding (JCT-VC) conference. The body of work suggested opportunities in determining best skip mode for CU early termination. Choi proposed a coding tree pruning based on SKIP mode detection. If the SKIP mode is selected as the best prediction mode then no further processing of smaller CB sizes is performed.

Depending on the content, research showed that well over 90% of CU depth selection can be skipped. A conditional probability is placed on the depth selection. The determination to split the CU is guided by the probability. Results show 42% encoding time reduction with luma PSNR gain 0.6% or less.

A complexity reduction method that speeds up decision making within the RDO process is proposed by [83]. The proposed method by is a fast RDO algorithm based on two techniques: (1) Stop Skip, which selects the initial CU size and (2) Early

Termination, which limits the smaller CB sizes. This reduces the encoding time by 40% with a bit rate increase of 2%. During the RDO process, the encoder tests all the possible coding modes and block partitions and keeps those providing the smallest rate distortion

(RD) cost. The large number of available modes and partitions lead to high computational cost which is time consuming and may not be suitable for real-time

50 applications. To reduce the computational cost and reduce the complexity, the author

proposes two techniques which are Top Skip and Early Termination. For Top Skip, the

larger CU sizes are avoided by selecting a starting CTB depth (higher than zero)

corresponding to a given level of CU quad-tree splitting which takes it’s queue from the

previous frame. The Top Skip portion selects starting depth observing that there exists a

high correlation between the minimum depth of the current Coding Tree Block (CTB)

and the one of the co-located CTB in the previous frame. Therefore, the depth starting

point is same as previous frame resultant depth, if previous frame QP is same. For Early

Termination, the technique stops the CU splitting process if RD cost is than a given

threshold. This mean that an acceptable coding cost has already been obtained and

continued searching for smaller CUs may only slightly improve the RD performance

process. The Early Termination technique avoids checking smaller CU sizes when they

are not likely to be selected by the brute force Rate Distortion Optimization (RDO)

process. The Early Termination technique halts the CU splitting process when the best

Rate Distortion (RD) cost is already lower than the pre-defined threshold. The value for

the threshold is adaptively computed and performs tradeoffs between complexity

reduction and negligible RD performance loss. In addition, the threshold computation exploits both the spatial and temporal correlations in the video inter-frames using a

Gaussian weighting function.

Another RDO scheme uses a decision algorithm that dynamically adjusts the

depth of the CU defined by quad-tree structures process as proposed by [84]. The authors

proposed method is similar to earlier papers presented in this section where

computational complexity is reduced by trimming the CU options within the RDO

51 process. As with earlier papers, the aim is to eliminate the maximum number of CUs

tested during the RDO process in order to avoid processing CUs at large tree depths

where found to have high complexity cost that bring small encoding gain. When the frames are being encoded, the algorithm does not allow the RDO process to test all the possible optimization possibilities. Instead, the algorithm uses the information in the history table and defined RDO limits the current frame RDO process to a maximum tree

depth tested in each 64x64 area to the history table’s saved value. If the limit is reached,

then the search is completed and current results are used. Since surrounding CUs in

neighbor frames tend to have similar maximum depths, it is expected that coding

efficiency will improve and RDO results will not be affected much. The experimental

results showed a potential complexity reduction of 40% to 80% with PSNR drop is lower

than 0.8 dB and bit rate increase less than 5.7%.

A similar RDO complexity reduction approach was taken by same authors in [85]

and [86], however the history table is limited to neighboring CTB depth. This lead to

encoding time savings, however, the PSNR was not as nearly degraded. Experiments

showed a computational complexity reduction of 40% with a PSNR drop of 0.1% and bit

rate increase of 3.5%.

Motion vector merging (MVM), proposed by Sampaio et al. [87], prunes PU

partition size decision by checking the neighboring CUs that border the current CU. The

query is for the left CU and above CU for PU shapes of 2Nx2N, 2NxN, and Nx2N. Each

query is analyzed and a heuristic defines if the partitions can be merged or not. If

merging occurs, the rate-distortion cost is evaluated for the decided PU partition,

producing sufficient information for the MVM decision. When identical motion vectors

52 are calculated for all NxN, 2NxN or Nx2N partitions and neighbor CU, the current PU is selected for encoding. On average, 2Nx2N PU partitions are chosen in 85% of the decisions. The asymmetrical PU partitions, 2NxN and Nx2N, are the best option for approximately in 7% each. The NxN PU partition is relegated for use only with 8x8 CU sizes. This corresponded to being selected only 0.2% of the time. This proposed method reduced execution time by 34% with dB losses of less than 0.08dB.

CU splitting early termination algorithm is proposed by Shen and Yu [88] where the CU splitting optimization in HEVC is formulized as a binary classification problem and is solved by support vector classification. It embeds the model training into the predictive model selection process and simple greedy search. For the predictive model, the features that are useful to build a good predictor are two types of feature selection approaches which are filters and wrapper approaches. The wrapper method was based on

F-score. The filter method was based on correlation or mutual information ranking are easy to implement. However, selecting the most relevant variables is usually sub-optimal for building a predictor, especially when the variables are redundant. The proposed algorithm performed well across different configurations and various video contents. The

CU splitting early termination model was trained offline and the proposed algorithm was computationally simple. Experimental results showed 44.7% reduction in computational complexity is achieved with 1.35% BD-Rate increase for “Random Access, main” configuration and 41.9% complexity reduction with 1.66% BD-Rate increase in “low- delay main” configuration.

CU size decision method is proposed by Shen et al. [89] where different methods are tested to resolve to optimal CU size. All methods led to adapted maximum depth

53 allowed. The first method checked for motion homogeneity by querying the above and left neighbor CUs motion vector (MV) X and MV Y direction. The difference between the current CU MVs and neighbor MVs are compared against a threshold. When the motion homogeneity is smaller than a threshold, the current CU is considered with homogeneous motion. Otherwise the CU is considered with complex motion. The threshold is set to in order to tolerate minor noisy MVs in the motion’s homogenous region. The second method is based on RD cost checking. Spatial and temporal neighboring CTUs usually show a similar RD cost distribution. Therefore, the RD cost based correlation is used to determine the early termination threshold. When RD cost of the current CU size is smaller than the calculated threshold, the next depth level motion estimations can be skipped. The third method is skip mode checking based. The algorithm introduces skip mode checking to skip checking unnecessary ME on smaller

CU sizes. This is accomplished by utilizing the prediction mode information in the upper depth level and the current depth level. Typically, choosing a small CU size usually results in a lower energy residual after motion compensation but requires a larger number of bits to signal the MVs and type of prediction. The smooth and slow motion can be predicted more accurately by using a larger CU size. By selecting skip mode as the best prediction mode for the current CU size, this indicates that the current CU is located in a region with homogeneous motion or static region. This should result in a lower energy residual after motion compensation compared to other prediction modes. Thus, no further processing of sub-CUs is necessary. Experimental results showed a 28% to 52% reduction in computational complexity with 0.90% to 3.63% BD-Rate increase.

54 Leng et al. [90] presented a method which skips ME for some of the depths depending on the data from co-located CU in previous frame and neighbor CUs. The method exploited similarities for several consecutive frames. Some features stayed the same such as the QP, moving speed and the resolution. In addition, the detail part and homogenous parts stayed the same as well. This indicated there are mode correlations among consecutive frames. This allowed the ability to skip some specific depths which were rarely used in the previous frames for all the CUs in current frame. No more than two depth levels were skipped for any given CU. The current CU depth starting point can be set to new depth as indicated by previous frames depth usage. This method provided an average of 45% time savings with PSNR loss of 0.11 dB or less.

Below table is a summary of the complexity reduction methods discussed within this section. The figure following the table shows the major areas of discussion within the encoder block diagram.

55 Table 5: Complexity reduction reviewed research -Frame r Cite STD Type Manipulation Method Description Mobile Intra-Frame Inte [62] H.264 x x T/Q Q Faster resolving transform. Remove high frequency quicker. [59] H.264 x x x ME/T Search/Tree Pre-determine foreground and background. Reduce options for Pruning background. [63] H.264 x ME Search Search window size change due to threshold [64] H.264 x MODE Tree Pruning Use decision tree for ME Mode [65] H.264 x MODE Tree Pruning Macroblock mode selection for inter8×8 and intra4×4 [66] H.264 x x MODE Tree Pruning Power-aware. ROI determine mode select. ROI determined per MV. [67] H.264 x ME Search Full search or Three step search depending on 5 reference frame SAD threshold [69] H.264 x x RDO RDO function Mainly, RDO model for GOP. Some ME and Mode change [70] HEVC x x ME/ MV limit/ Intra block coding only. Restricted angular directions. Varied MODE/ Tree Puning/ depth searches. Configuration limits. Config. SAO [71] HEVC x x T Tree Pruning TU method for 2D,1D or skip [72] HEVC x x T/Q Q Faster resolving transform by varing Q by the CU depth. [73] HEVC x x T Other De-blocking filter to intra luma block in PU or TU [74] / [75] HEVC x x x T/Q Q Use high/low resolution quantizer in checkerboard pattern. [76] HEVC x MODE Tree Pruning Intra frame only. CU tree splitting pruning using Bayesian decision model. [77] HEVC x ME Tree Pruning Low delay B-frame with reduced set [78] HEVC x x Entropy Other CABAC enhancement in by-pass mode [79] HEVC x ME/ Tree Pruning Derive from neighboring PU edge. Reduce direction from 33 MODE to 9. [80] HEVC x ME Math Model simple sub-pixel ME equation [81] HEVC x MODE Tree Pruning early skip mode detection per motion vector threshold [82] HEVC x MODE Tree Pruning early skip mode detection per motion vector threshold [83] HEVC x x MODE Tree Pruning stop skip and early termination [84] / [91] HEVC x x MODE Tree Pruning RDO complexity reduction ratio limit and previous frame history

56 ME MODE T Q

Figure 24: Literature research on hybrid coding map

57 5 HEVC AND H.264 SUBJECTIVE EVALUATION

This chapter compares the quality of H.264 and HEVC encoded video in low

bandwidth mobile environments. In this study, the focus within the mobile environment

is smart phones. The key characteristics of a smart phone are smaller screen size, which

is usually 3.5 inches diagonal to 5.0 inches diagonal for high end smart phones and

typical cellular network bandwidth, which is 3G or faster. Subjective evaluations were

conducted to evaluate the user experience on a mobile device with a small screen size and video coded at 200 and 400 Kbps. The studies showed compelling evidence that a user’s experience in low bandwidth mobile environments is very similar between HEVC and

H.264. The results suggest the benefits of HEVC over H.264 in a mobile environment with lower video bitrates and resolutions are not as clear.

5.1 Background

The mobile compute environment has evolved rapidly in the last few years and smart phones have penetrated the consumer market extensively. Smart phones are being used as extensions of consumer electronics devices such as TVs, Blu-ray players, and audio receivers. Smart phone display performance has progressed significantly and cellular network bandwidth has improved in recent years. This has allowed streaming multimedia adoption in these traditionally loss-prone environments [92]. Display technologies, in the mobile environment market space have benefited from strong design investment by smart phone manufacturers and significant research and development by

58 liquid crystal display (LCD) manufacturers. This has enabled the mobile LCDs to

improve steadily in performance aspects such as: (a) resolution, (b) power consumption,

and (c) viewing angles. Mobile phone’s connectivity also benefits significantly from

wireless infrastructure improvements, beginning with WiFi availability at home, work,

and business locations, to cellular network technology improvements with increasing

bandwidth provided by 3G and LTE.

The evolution of encoding methods from H.264 to HEVC optimizes visual quality

on larger resolution images, especially for Ultra High Definition [20], with the main

beneficiary being the internet and broadcast networks [93]. However, HEVC is also

expected to provide compression gains over H.264 in the mobile environment [92]. The

significance of these gains in mobile devices, playing low bitrate video, has not been

studied. The main goal of this work is to evaluate the subjective quality of HEVC and

H.264 at mobile bitrates and to determine whether the additional gains from HEVC

encoding are perceivable by end users on mobile device displays.

This chapter presents subjective quality evaluation studies that compare videos

coded with H.264 and HEVC. The studies were conducted using videos coded at typical mobile bitrates of 200 and 400 Kbps. Subjective evaluations showed that H.264 and

HEVC result in a similar quality of experience. Bandwidth reduction alone may not sufficiently justify the cost of deploying HEVC in mobile devices targeting low bandwidth applications. Design decisions and trade-offs based on the reported results can improve consumer electronics designs and user experience. Using a simpler codec can reduce complexity in mobile environments which can lead to lower power consumption [84] and better video quality [77].

59 A significant amount of HEVC quality evaluation has been limited to performance evaluation at high resolutions and high bitrates. An evaluation of candidates for HEVC standardization was conducted in a joint collaboration between ISO and ITU video experts groups [94]. Test results showed that 50% bit rate improvement over

H.264 can be achieved with the proposed coding schemes with similar mean opinion score (MOS) [52]. This evaluation study provided the groundwork that was needed to standardize HEVC.

Subjective comparison of HEVC and H.264 at higher bitrates and resolutions has

shown that HEVC outperforms H.264, yielding average bit-rate savings of 58% [93].

These objective and subjective results produced by this study confirmed that the goal of

developing a HEVC video coding standard, which delivers the same visual quality as

H.264/MPEG-4 AVC high profile at only half of the bit rate was accomplished.

However, the question of performance of HEVC over H.264 for low bitrate mobile

applications was unanswered.

Tan et. al. reported the subjective comparisons for HEVC and H.264 [54]. Tests

were conducted with HEVC high-efficiency (HE) and low-complexity (LC) combinations with rate distortion optimized quantization (RDOQ), adaptive loop filter

(ALF), and sample adaptive offset (SAO). Nine sequences, referred to as Class B and C sequences in JVT evaluations [52], were encoded with QP = 32 and 37 for random access and low delay. The class B sequences have a resolution of 1920x1080 and the class C sequences have a resolution of 832x480. The first test method is Double-Stimulus

Continuous Quality-Scale (DSCQS), except the observer is asked to make one of three choices. These are: (a) A is better than B, (b) B is better than A, and (c) A and B are

60 same. The second test method is Double Stimulus Impairment Scale (DSIS). Tests

compared HEVC coded video at half the bit rate of H.264. The bit rates for subjective

testing were varied from 500 kbps to 4,000 kbps.

Test results indicate the observers chose HEVC HE 56% to 83% of the time over

H.264 and HEVC with LC-RDOQ-SAO 58% to 75% of the time [54]. The 50%

preference indicates the coding methods are viewed as similar subjective quality, Greater

than 50% indicates a preference for HEVC over H.264.

Additional subjective comparisons were performed by the JCT-VC ad hoc group

comparing HEVC and similarly configured H.264 encoder/decoder [55]. Tests performed were for class B and class C. A gross average bit rate reduction of 67% for

class B sequences and 49% for class C sequences resulted from the subjective tests.

Subjective tests specifically targeting low-delay applications have been performed

for 720p and 1080p resolutions [56]. The HEVC and H.264 encoders were configured

for low-delay-P-main operation. An "IPPPP…” coding structure was used, which has

only one intra frame (the first frame of each sequence) for both encoders. These tests

compared H.264 encoding at twice the bit rate of HEVC. Results show HEVC encoded

sequences were favored 48% of the time for 720p sequences and 90% of the time for

1080p sequences. This clearly shows better HEVC performance at higher resolutions and bitrates.

Subjective tests for very high resolution videos (3840 x 1744) again shows that

HEVC out performs H.264 [57]. The majority of bit rates evaluated were over 1,000 kbps. The test method was DSIS variant II with two 15 minute test sessions with a rest

61 period in between. Test results convincingly show bit rate reduction of over 50% is

achieved with HEVC over AVC for high resolution sequences.

Yamamura, Iwasaki and Matsuo reported subjective quality assessment for video

sequences with blocking artifacts [95]. The analysis included several coding standards, such as HEVC, H.264, and MPEG with bit rates varied from 500kbps to 2Mbps. The analysis revealed that HEVC performed best with the periodic gaps caused by the blocking artifacts.

In summary, a significant amount of H.264 and HEVC subjective testing has been performed in high bit rate and high resolution environments (832x480 or higher).

However, HEVC has not been sufficiently examined to determine the performance at lower bit rates on lower resolution devices which is a typical use case for low-end smartphones. Understanding the subjective test implications may allow effective choice of codecs for mobile video services and other consumer electronics devices with video capabilities.

Preliminary results from the present study comparing HEVC and H.264 in mobile environment showed that H.264 can perform competitively at lower bitrates typical in mobile applications [3], [4].

5.2 Evaluation Methods

Mobile encoding bitrates and resolutions used within this study are industry best practices by multimedia streaming service providers and guidelines referenced by content providers. The industry best practices define “Higher quality and resolution” as 640x360

(16:9) at 400kbps and “medium quality” as 400x300 resolution at 200kbps [96]. A

62 combination of WiFi resolution (640x360) and higher bandwidth cellular (400 kbps) was

selected.

The encoder implementations used for the comparison are H.264 reference

software JM 18.3 and High Efficiency Video Coding (HEVC) version HM 6.0. H.264

was configured to closely mimic HEVC coding structure (based on HM-like

configurations available in JM 18.3). Only the first frame is an I-frame; the remaining

frames are coded as P frames. The key encoder settings used in HEVC are: intra period

is set to -1, largest coding unit size of 64x64, maximum depth of 4, fast search set to

EPZS, search range of 64, rate distortion optimization is enabled, internal bit depth is

eight, SAO is enabled, ALF is disabled, AMP is disabled. These settings are listed as

default settings within the HEVC version HM6.0 configuration file

“encoder_lowdelay_P_main.cfg”. Similar settings were used in previously reported

subjective evaluation studies [56].

The video sequences used were selected from the sequence set used during HEVC

development. The frame rates of video test sequences chosen are either 24, 30, 50, or 60

fps. All video test sequences were scaled and, if necessary, cropped to obtain videos at

640x360 resolution. The interpolation for resizing is performed with the 3-lobed Lanczos

Window Function. The interpolation algorithm uses source image intensities at 36 pixels in the neighborhood of the target pixel. The frames are then cropped to 640x360 by trimming the edges when needed.

The eight sequences used in the experiment pool are: Basketball Drill,

Flowervase, Keiba, Kimono, Johnny, People on Street, Race Horses, and Traffic. The videos were selected to get a breadth of low motion to high motion as well as varying

63 number of people or objects in the video sequence. The video sequence properties are summarized in Table 6.

Table 6: Video sequence information

Video Original Resolution Test Resolution Frames FPS Basketball Drill 832x480 640x360 500 50 Flowervase 832x480 640x360 300 30 Keiba 832x480 640x360 300 30 Kimono 1920x1080 640x360 240 24 Johnny 1280x720 640x360 600 60 People on Street 2560x1600 640x360 150 30 Race Horses 832x480 640x360 300 30 Traffic 2560x1600 640x360 150 30

The 640x360 video test sequences were encoded at various quality levels. For

H.264, the QP used was QP 27-51. For HEVC, the QP used was for QP 24-51. From this effort, the video sequences with bit rate closest to target bit rates of 400 and 200

Kbps were chosen for subjective evaluation. Also, care was taken to minimize the bit rate delta between the H.264 and HEVC video sequences in order to avoid unwanted subjective bias. See Table 7 and Table 8 for 200kbps and 400kbps bit rate results and delta, which is HEVC bitrate minus H.264 average bitrate. An example rate distortion curve for Basketball Drill is shown in Figure 25.

64 Table 7: 400Kbps data for H.264 and HEVC PSNR(Y) PSNR(Y) (dB) Bitrate (kbps)

Test Value  Value  BBDrill-H264 29.5 380.5 2 21.2 BBDrill-HEVC 31.5 401.7 Flowervase-H.264 35.4 349.3 2.2 36.8 Flowervase-HEVC 37.6 386.1 Johnny-H264 39 398.7 0.4 0.3 Johnny-HEVC 39.4 399 Keiba-H.264 33 435 1.3 -13.6 Keiba-HEVC 34.3 421.4 Kimono-H264 32.7 408.9 1.1 29.7 Kimono-HEVC 33.9 438.6 People-H.264 20.8 369 0.4 -15.8 People-HEVC 21.2 353.2 RaceHorses-H.264 28.6 390.2 1 -5.3 RaceHorses-HEVC 29.6 384.9 Traffic-H.264 27.3 387.7 1.1 -32.1 Traffic-HEVC 28.3 355.6

Table 8: 200Kbps data for H.264 and HEVC PSNR(Y)

PSNR(Y) (dB) Bitrate (kbps)

Test Value  Value  BBDrill-H264 26.9 198.8 2.5 2.8 BBDrill-HEVC 29.4 201.6 Flowervase-H.264 33.3 204.7 1.8 -14.5 Flowervase-HEVC 35.2 190.2 Johnny-H264 36.6 193.4 1.1 5.7 Johnny-HEVC 37.7 199.1 Keiba-H.264 29.7 211.1 1.9 -6 Keiba-HEVC 31.6 205.1 Kimono-H264 30.4 198.4 0.8 -17.5 Kimono-HEVC 31.2 180.9 People-H.264 19.2 203.1 0.6 -0.5 People-HEVC 19.7 202.6 RaceHorses-H.264 26.4 188.2 1.3 10.4 RaceHorses-HEVC 27.6 198.6 Traffic-H.264 25.5 193.1 1.2 -0.6 Traffic-HEVC 28.3 355.6

65

Figure 25: Basketball drill PSNR(Y) rate distortion curve

5.3 Experiments

Video sequences were shown on a 4.3” LCD with 480x272 resolution. This resolution represents the low-to-mid range smart phone resolutions in the market. As noted earlier, video sequences are encoded to 640x360, which is a recommended video encoding resolution [96]. Content providers typically encode video at a few resolutions and mismatch between display resolution and video resolution is common and is common in mobile video services. The mobile device scales the video accordingly for the mobile device’s display.

The observer was approximately 12” to 18” from the display and the viewing angle is approximately 10 degree (+/- 5 degree) above normal as shown in Figure 26. The

LCD backlight luminance was approximately 45cd/m2. Room lighting was roughly 800 cd / m2.

Twenty five observers were used in the evaluation experiments. According to the

ITU specification P.910, the possible number of observers in viewing test can vary from

66 4 to 40. The P.910 further specifies that at least 15 observers should participate in subjective testing to obtain reliable results [97]. The observers were 18 to 50 years of age and all observers are in good health with normal or corrected-to-normal vision.

Figure 26: Observer to LCD viewing definition

Subjective evaluations were conducted in accordance with the Double-Stimulus

Impairment Scale (DSIS) Variant II as defined by ITU-R BT.500-13[53]. The test methods were similar to the test conditions used in HEVC evaluations [98], [3]. The double stimulus method is cyclic where an observer is presented with an unimpaired reference followed by the same but impaired video. Variant II was chosen to allow the observer a second viewing of both video sequences. The presentation order of the video sequence is 3 seconds of solid grey video, followed by 5 or 10 seconds of the reference sequence, followed by 3 seconds of solid grey video, followed by 5 or 10 seconds of the impaired sequence, the previous four viewing events are repeated, followed by the voting cycle. Essentially, two video sequences are shown per test, where reference sequence and the corresponding impaired sequence are shown twice. Then, the observer was asked to rate the quality.

DSIS Variant II was used for the presentation structure of the test material. This allows the user two viewings of each video sequence (reference and impaired) before

67 subjective grading. Also, the video sequences were shown in a random order in order to

reduce observer bias. Grade scores are on a scale from 1 to 5 and defined in Table 9.

Grade of “one” is poor (very annoying) and “five” is excellent (imperceptible) as defined

by ITU-R BT.500-13.

Table 9: Subjective grading scale

Score Definition 5 imperceptible 4 perceptible, but not annoying 3 slightly annoying 2 annoying 1 very annoying

5.4 Results

Subjective results summary are presented in Table 10, Table 11, Figure 27, and

Figure 28. In Figure 27and Figure 28, vertical axis is the mean opinion score (MOS)

from a scale of 1 to 5. The horizontal axis shows each video sequence pair for H.264 and

HEVC. Table 10 and Table 11 show the maximum, minimum and average MOS score

for each video sequence. Data is listed in video sequence pairs. Comparisons between

H.264 and HEVC were compared by the same bit rate (i.e. 400kbps or 200kbps). This

will lend the subjective results to a more realistic consumer use where the available

bandwidth is the same regardless of coding standard used by hardware.

In Figure 27 and Table 10, 400kbps MOS results showed 85% grade scores

ratings of “5” or “4” for the impaired video sequence, which indicates the observer

feedback is “imperceptible” or “perceptible, but not annoying”, respectively. Observer feedback indicates the impaired video sequence is acceptable for the mobile environment,

68 which is smaller screen size and lower bit rate. In Figure 28 and Table 11, as expected,

200kbps MOS results showed a wider spread of the results. MOS variance between

HEVC and H.264 was much higher at this bitrate. For example, the Basketball Drill and

Race Horse video sequences, which had several MOS results in the 3 to 4 range within

400kbps results, had significantly lower results at 200 kbps. These sequences had several

200kbps results below a score of 3. Further analysis of these results is presented in next section.

Figure 27: Mean opinion score (MOS) for 400 kbps bit rate

69 Table 10: Mean opinion score (MOS) for 400 kbps rate Test Max Min Average BBDrill-H264 5 2 3.4 BBDrill-HEVC 5 3 4.3 Flowervase-H.264 5 3 4.5 Flowervase-HEVC 5 3 4.7 Johnny-H264 5 4 4.6 Johnny-HEVC 5 3 4.6 Keiba-H.264 5 4 4.9 Keiba-HEVC 5 4 4.7 Kimono-H264 5 4 4.7 Kimono-HEVC 5 4 4.8 People-H.264 5 3 4 People-HEVC 5 3 4.1 RaceHorses-H.264 5 3 3.7 RaceHorses-HEVC 5 4 4.6 Traffic-H.264 5 3 4.5 Traffic-HEVC 5 3 4.5

Figure 28: Mean opinion score (MOS) for 200 kbps bit rate (graph)

70 Table 11: Mean opinion score (MOS) for 200 kbps rate

Test Max Min Average BBDrill-H264 4 1 2.1 BBDrill-HEVC 4 2 3.4 Flowervase-H.264 5 3 4.1 Flowervase-HEVC 5 3 4.4 Johnny-H264 5 2 4.4 Johnny-HEVC 5 4 4.7 Keiba-H.264 5 1 4.3 Keiba-HEVC 5 1 4.5 Kimono-H264 5 3 4.2 Kimono-HEVC 5 2 4.2 People-H.264 4 1 2.1 People-HEVC 4 1 2.5 RaceHorses-H.264 3 1 2 RaceHorses-HEVC 5 2 3.3 Traffic-H.264 5 3 3.7 Traffic-HEVC 5 2 3.9

5.5 Discussion

Performance of HEVC and H.264 can be evaluated by comparing the difference

of MOS scores and the difference of corresponding PSNR values. This comparison is

shown in Figure 29. Positive values show HEVC has better performance. Negative

values show H.264 has better performance. The upper right quadrant shows HEVC to be superior in both MOS  and PSNR(Y) . The lower left quadrant H.264 to be superior in both MOS  and PSNR(Y) . The upper left HEVC has better MOS  and H.264 has

better PSNR(Y) . The opposite is true for the lower right quadrant.

As shown in Figure 29, for the same PSNR difference, the MOS difference varies

with content. A typical PSNR(Y) improvement of 0.5dB to 2dB is observed for HEVC over H.264 for all video sequences. This is an expected result with HEVC always

71 resulting in better compression performance. However, the MOS ratings show that subjective quality of HEVC is not always better than H.264. MOS ratings show several instances where HEVC and H.264 are essentially considered equal or very close to each other in terms of subjective feedback. A common claim about HEVC performance is that

HEVC produces equivalent quality video at about half the bitrate of H.264 [55], [56],

[57]. To verify this claim, HEVC at 200kbps vs. H.264 at 400kbps were compared. The observation is shown in Figure 29 that, at lower bit rates, the H.264 at 400 Kbps is clearly superior to HEVC at 200 Kbps both in PSNR and subjective quality.

Figure 29: PSNR(Y)  (dB) vs. MOS  (dB)

The following discussion is for HEVC to H.264 comparisons at same bit rate.

Subjective quality is a function of the PSNR of the decoded video and the content of the

video. Large PSNR difference may not produce a large difference in subjective quality.

For example, “Flowervase” and “Keiba” video sequences yielded a PSNR difference of

approximately a 1.3dB to 2.2dB, with HEVC yielding a higher PSNR(Y) compared to

72 H.264. However, the MOS results indicate subjective performance is about the same with a MOS difference of 0.3 or less. Actually, the observers gave H.264 a higher average MOS for the “Keiba” 400kbps video sequence. Informal discussion with the observers suggests a HEVC to H.264 MOS difference of 0.3 or less will not cause the

observer to prefer one impaired video sequence over the other.

Also, “People On Street” and “Traffic” video sequences show a difference of

HEVC PSNR(Y) over H.264 PSNR(Y) of approximately 0.4dB to 1.2dB. The MOS

results indicate subjective performance is about the same with a MOS difference of 0.30

or less for three of the four video bitrates from “People On Street” and “Traffic” video

sequences. This indicates the observer will tend not to have a preference between HEVC

impaired video sequence and H.264 impaired video sequence. The fourth video bitrate

(“People On Street” at 200 kbps) had a MOS result of 0.55. Observers will slightly favor

HEVC encoded video sequence.

“Kimono” video sequences had a delta of HEVC PSNR(Y) over H.264 PSNR(Y)

approximately 1.1 dB or less. MOS difference between HEVC and H.264 is just under

0.4, where HEVC has the higher rating. Observers rated 200kbps H.264 MOS results as

higher quality. MOS results suggest the observer is likely to find the subjective

difference predominantly acceptable. However, this observation was made through

informal dialog with the observers and further study is warranted before making stronger

claims.

Two video sequences for “Basketball Drill” and “Race Horses” have an HEVC

PSNR(Y) over H.264 PSNR(Y) difference of approximately 1.0dB to 2.0dB, which is in

line with all the sequences evaluated. The MOS differential is between 0.5 and 1.2,

73 which is slightly greater than the other video sequences where the delta is less the 0.4 for

five video sequences and just under 0.5 for one video sequence.

Informal observer comments suggest an observer preference for HEVC impaired

video sequence over H.264 impaired video sequence. Both “Basketball Drill” and “Race

Horses” video sequences have motion in large contiguous areas that are relatively

uniform (either same color and/or pattern). For “Race Horses”, the main continuous

color is the horses coat color and is affected by changes in coat color as the horse moves.

A frame from this video sequence is shown by Figure 30. For “Basketball Drill”, the

floor is a repeating pattern that gets affected by the basketball players’ shadows as the

players move about the basketball court.

An interesting note is the “Keiba” video sequence has a significant large

contiguous areas affected with the tree trunks and branches in the video foreground as shown by Figure 31. However, no observer had commented this as a problem area. The belief this is due to the “point of attention” is on the horse rider and the video coding artifacts in tree area are unnoticed.

The main comment for preference of “HEVC” is the continuous patterns or colors looked significantly worse for H.264 encoded video, although HEVC showed a milder effect. For example, for the Racehorse video sequence, the horse’s brown coat did not look proper when in motion, as shown in Figure 30. But the grass in the background did not bother observers and was not commented as being a quality issue. For Basketball

Drill, the floor showed many blocking artifacts for H.264 when the players’ shadows moved across the basketball court.

74

Figure 30: Race horses sequence. Horse coat color is bothersome to observer

Figure 31: Foreground tree detail loss not a concern in Keiba sequence

Table 12 shows the video sequence, MOS  (HEVC MOS – H.264 MOS) and the

observer preference. The correlation between the MOS  for 200kbps and 400kbps

observer comments are shown in addition to points of attention (POA), shown as POA

[3].

75 Table 12: Video sequences preference

MOS  Video Preference POA 400kbps 200kbps Basketball Drill HEVC 5 0.67 1.11 Flowervase None 1 0.17 0.29 Keiba None 1 0 0.44 Kimono None 1 -0.14 0.25 Johnny None 1 0.33 -0.33 People on Street None/HEVC 10+ 0.08 0.55 Race Horses HEVC 4 0.86 1.18 Traffic None 10+ -0.09 0

Analysis of the experimental results shows that subjective quality assessment is influenced by content. Specifically, number of points of visual attention, spatial complexity, and temporal complexity of the content, have direct influence over the quality of user experience. To begin understanding these influences, a notion of temporal information (Ti) and spatial information (Si) as defined by the ITU specification P.910

[97]. The Ti value for a video sequence gives a measure of temporal changes in a video.

Videos with a large motion have a large Ti value. Similarly, Si gives a measure of spatial complexity and is measured based on number of edges in each frame of a video. Points of visual attention were determined empirically. Shown in Figure 32 and Figure 33 are the rewriting of the data into a graph with HEVC MOS, point of attention, and MOS  results for video sequences with a temporal information greater than 10 [3]. The temporal information and spatial information (TiSi) plot is shown in Figure 34. A clearer picture evolves in showing expected H.264 and HEVC MOS relationships. The size of the circle represents the MOS . A larger circle indicates a larger MOS ; i.e., observers gave higher MOS value to videos coded with HEVC compared to H.264. The videos with “1 to 3” and “8+” points of attention have smaller MOS . However, the MOS  76 values in the “4 to 7” region are high. The MOS values in the “8+” region are high at

400kbps, as shown in Figure 32, and mixed at 200kbps, as shown in Figure 33. As indicated earlier, the observer will rate encoders as roughly equivalent for the video sequences that fall within “1 to 3” and “8+” regions. The region where “4 to 7” points of attention lies is where the larger MOS  occurs.

Figure 32: MOS vs. POA with MOS  bubble @ 400kbps. Ti > 10

Figure 33: MOS vs. POA with MOS  bubble @ 200kbps. Ti > 10

77

Figure 34: TiSi Plot

Video sequences with few points of attention will tend to have higher MOS results from either H.264 or HEVC with little difference between MOS averages. This is shown by the MOS results for video sequences Keiba, Kimono1, Flowervase and Johnny in Figure 32 and Figure 33. Increased visual artifacts tend to not impact observers scores.

Applications such as video conferencing have fewer points of attention and both H.264 and HEVC result in equivalent lower bitrates.

5.6 Concluding Remarks

For the mobile conditions performed in the study, which is a 4.3” screen size and an average bit rate of 200kbps and 400kbps, the user (i.e. observer) experience is not significantly different when comparing H.264 and HEVC compressed video sequences.

Standard compliant subjective evaluations with 25 observers show that both encoding methods are adequate in low bitrate mobile environments. Results show that content dependencies affect perceived quality. Specifically, the number of points of visual attention influence user experience.

78 6 HEVC DECISION OPTIMIZATION

This chapter presents a model and approach to select an efficient set of HEVC

encoding options for mobile devices. The main goal is to reduce the encoding

complexity without significantly affecting the quality of video conferencing applications.

Video target bit rates from 100kbps to 600kbps were used within this study.

Experimental results show that by carefully selecting the coding unit size, coding unit

depth and to a lesser degree transform unit size, the encoder computational complexity can be reduced for the target bit rates while maintaining an allowable additional PSNR loss. Results show a 36.5% of complexity reduction on average with a negligible PSNR

loss of less than 0.25 dB.

6.1 Background

High Efficiency Video Coding (HEVC) is the latest video coding standard

finalized in January 2013 by ISO and ITU’s Joint Collaborative Team on Video Coding

(JCT-VC). This chapter’s working environment is for HEVC video conferencing

applications in the smart phone market segment with cellular bit rates, such as 200kbps

and 400kbps. The primary focus is to reduce coding complexity, in terms of time

savings, while maintaining high quality video.

The HEVC standard has adopted three profiles for wide services, such as

broadcast, mobile communications and video streaming. Recent assessments show that

HEVC can achieve equivalent subjective quality as H.264/MPEG-4 AVC with 50% less

79 bit rate [21], by introducing new tools like the block partitioning structure [22], and

variable block-size for prediction and transform coding [23] among others.

Undoubtedly, an effective tool included in HEVC is the new Coding Tree Block

(CTB) partitioning structure. This can be thought of as a more generic version of Macro

Block (MB) coding unit used in previous standards. The square CTB restricted to a

maximum size of 64x64 pixels, which can be split into smaller quad-blocks called

Coding Units (CU), with a minimum allowed size of 8x8. This means the maximum

allowed CU Depth is 4.

The transform coding also uses a quadtree structure called Residual QuadTree

(RQT), splitting a Transform Unit (TU) into smaller transform units, and limiting its size

to a maximum size of 32x32 and minimum of 4x4. A 2Nx2N CU can use a maximum

TU depth of 3, taking 2Nx2N, NxN and N/2xN/2 sizes.

HEVC encoder and decoder complexity assessment is a research topic for [85],

[99], [100], [60], where the different HEVC tools are analyzed in terms of performance

and computational complexity.

As with H.264, HEVC uses a well-known Rate Distortion Optimization model

(RDO) [69] to achieve the best coding efficiency. RDO reaches the optimal partitioning

by evaluating all CU sizes, prediction unit (PU), and TU size for each of those CU-PU combinations. This approach greatly increases computational complexity of encoders.

So, this makes it more difficult for real time encoding implementations, especially for portable and mobile devices where power consumption is one of the key factors. To reduce the RDO complexity some fast algorithms have been proposed, which focus on

80 reducing the number of coding blocks (CB), prediction blocks (PB), and TU sizes to evaluate.

Schwartz et. al., show the effects of constraining the Largest Coding Unit (LCU) for Random Access (RA) and Low delay (LD) configurations of main profile [21].

Decreasing the LCU from 64x64 to 32x32, will realize a 18% and 17% time savings with a small bit rate penalty of 2.2% and 3.7% respectively. On the other hand, using a LCU of 16x16 instead 64x64, reduces encoding time by 42% but with a bit rate increase of

11% and 17.4% respectively. Also reducing the maximum RQT depth from 3 to 2 yields a time saving of 10% with a slight bit rate increase of 0.3% for RA and 0.4% for LD configurations.

Finally, a complexity control algorithm based on coding tree depth selection for

CTB, is proposed in [21] which obtains a computational complexity reduction of 40% with a bit rate increase of 3.5%. For [91], a maximum CU-depth dynamic adjustment is proposed, which testing showed a 40% complexity reduction, 0.1dB PSNR, and a 3% bit rate increase.

6.2 Method

The mobile device working environment, within this research, was based on multimedia streaming guidelines and content providers. The content is defined as

“higher quality” for 640x360 resolution at 400kbps and “medium quality” for 400x300 resolution at 200kbps [101] [102]. We chose to have a superset of target bit rates, from the recommendations, in order to observe the video quality over a range of bitrates. The bit rates within this study are 100, 200, 300, 400, and 600 kbps. Also, we decided to have a several video resolutions. The lower resolutions made available by JCT-VC and 81 defined in [103] were used. HEVC video sequences chosen for the experiments were

from Class C, D, and E as defined by JCT-VC. Video sequences used for the

experiments are defined in Table 13.

Table 13: Video sequence definition

Class Video Sequence Resolution FPS Frames C BasketballDrill 832x480 50 500 C BQMall 832x480 60 600 C PartyScene 832x480 50 500 C RaceHorses 832x480 30 300 D BasketballPass 416x240 50 500 D BlowingBubbles 416x240 50 500 D BQSquare 416x240 60 600 D RaceHorses 416x240 30 300 E FourPeople 1280x720 60 600 E Johnny 1280x720 60 600 E KristenAndSara 1280x720 60 600

The study’s experiments used HEVC encoder reference software version HM8.0

[104] with Main Profile and Low-delay configuration as the baseline configuration. This configuration was meant for real-time personal video communications which uses the previous fame for motion estimation. The common JCT-VC test conditions described in

[105] were used with the aim of analyzing the HEVC performance. The input variables

“Bit Rate”, “Largest CU (LCU) size”, “CU-depth”, and “TU-inter-depth” were modified

between tests.

The HM8.0 source code was modified to output additional CU and TU data that

was used for analysis. The collected data included CU Mode (i.e. Intra or Inter), LCU,

CU-depth, TU-inter-depth for each CU block within all frames for the video sequences.

A total of 1170 video sequence experiments were run during the study. This data was

82 used for analysis. Allowances were made to run experiments over multiple computers.

However, each video sequence experiment occurred on same computer to eliminate any comparison disconnects.

Each sequence was encoded with a baseline configuration (all options on) and additional configurations where for a given bit rate, the LCU, CU-depth, and TU-inter- depth were varied. All video sequences and input variable combinations are as follows:

- LCU: 64x64, 32x32 and 16x16

- CU-depth: 4, 3, 2, 1

- TU-inter-depth: 3, 2, 1

- Bit Rate: 100, 200, 300, 400, and 600 kbps

All videos were 10 seconds long with frame rates varying among 30, 50, and 60

FPS. Each frame had 66 data elements tracked. This yielded a total amount of over 37 million data elements that required reviewing and analyzing. The key outputs observed for performance analysis were the PSNR difference, identified as ∆ PSNR, and encoding time difference, identified as Time-savings, between the baseline configuration and each of the experimental configurations.

6.3 Prediction Modeling

To assist with data analysis and create a prediction model, the Waikato

Environment for Knowledge Analysis (WEKA) Version 3.6.4 [106] was used for data mining results from the experiments. WEKA is an effective software tool used for sifting through large data sets and determining relevant data for the chosen prediction model.

The datasets used for prediction modeling training were chosen from one dataset per class. All remaining video sequence datasets were used for testing. From these 83 experiments, the Time-savings and  PSNR were recorded and used for the desired

output to determine equation. More details on Time-savings and  PSNR later within

this report.

The WEKA software tool’s analysis classifier was set to “linear regression” for

prediction and attribute selection method selected to “No attribute selection”, which allows all inputs to be potentially used for the modeling equation. Therefore, no pruning attempts were made to simplify the equations. Attributes used for prediction model includes: resolution, bit rate, LCU, CU-depth, and TU-inter-depth. Attributes, such as

“number of frames” and “frame rate” were removed from the attribute pool. These attributes are irrelevant for the calculations. The remaining tool configurations were left at default settings.

6.4 Results

Data was gathered for all video sequences. All the data gathered within the study cannot be addressed or presented in this report. However, representative examples were used to explain underlying concepts.

6.4.1 Data Reading Primer

For the rest of this document, “LCU” / “CU-depth” / “TU-inter-depth” will be shown as numbers separated by the slash (i.e. /). Therefore, 64/4/3 is representation for

LCU = 64x64, CU-depth = 4, TU-inter-depth = 3.

The graph shown in Figure 35 is the “Time-savings vs.  PSNR” for

“KristenAndSara” video sequence. The graph is for bit rate of 100kbps. Each point is defined by “LCU”/”CU-depth”/”TU-inter-depth”. The curve within the graph is the

84 outline curve for all points. The line represents the effective trade-off between  PSNR and Time-savings. The closer the points are to the curve the better trade-off decision.

Max-quality

Preferred location

Left-option

Right-option

Min-quality

Figure 35: Kristen and Sara 100kbps time-savings vs.  PSNR

The most beneficial location is the upper right corner, which has maximum Time- savings and minimum  PSNR loss. This area is defined in the graph as “Preferred location”. Within the data set, the best  PSNR loss is 0 dB and is held by the 64/4/3 data point. This data point is shown with arrow in upper left corner and is identified as

Max-quality. As expected this encoded sequence has the longest encoding time, which leads to 0% Time-savings. Moving along the outline curve there are two points which are defined as “Left-option” point and “Right-option” point. These points are the optimal points as determined by us. Perceptual video quality assessment was carried out. The assessment consisted of subjectively reviewing each sequence to determine the preferred

85 video sequence. In instances where there the subjective review returned the same result,

the selection went to the sequence that yielded a higher  PSNR x Time-savings value.

From this effort, the optimal points mainly consisted of the minimum CU-depth and TU-

inter-depth for LCU of 64x64 and 32x32. The majority of the optimal cases were clearly

these points. We decided to simplify the option point’s selection and set it to the minimal

configuration for LCU of 64x64 and 32x32.The data point on the lower right has the most

restrictive settings. This point is identified as Min-quality. As expected, this point

challenges for the best Time-savings and largest  PSNR loss. There are frequent

instances where this may not the case as shown in Figure 35.

6.5 Data Analysis

The graph (Figure 35) is a typical representation for the data gathered from the

video sequences within this study. The Left-option point’s Time-savings is

approximately 25% with  PSNR of -0.07dB. The Right-option point has Time-savings

of approximately 45% with  PSNR of -1.03dB.

The Left-option and Right-option points vary depending on resolution. For 832 x

480 (Class C), the two points are 64/2/1 and 32/1/1. For 1280x720 (Class E) and

416x240 (Class D), the two points are 64/3/1 and 32/2/1. Table 14 shows the Δ PSNR and Time-savings for Class C, D, and E. The C, D, and E average is shown as “All”.

Further table explanations are shown later within this section.

86 Table 14: Class C, D, E, and All video sequence, average Δ PSNR and Time-savings

Left- Right- Class Description Max. option option Min. C Δ PSNR (dB) 0 -0.31 -0.59 -1.36 C Time-savings (%) 0.0% 41.8% 58.1% 55.6% D Δ PSNR (dB) 0 -0.21 -0.37 -0.7 D Time-savings (%) 0.0% 24.2% 36.4% 56.4% E Δ PSNR (dB) 0 -0.07 -1.03 -3.26 E Time-savings (%) 0.0% 23.6% 45.6% 67.2% All Δ PSNR (dB) 0 -0.25 -0.72 -1.82 All Time-savings (%) 0.0% 36.5% 55.4% 69.2%

Video sequences that match the points from Figure 35 were evaluated by us. A

frame for each graph point is shown in Figure 36 and Figure 37. The example chosen is the KristenAndSara from Class E video sequence. The Max-quality video, which is the reference video sequence, is impaired and loss of detail is observed in Figure 36 left side picture, which is 64/4/3. However, when compared to the Left-option point, which is

64/3/1 and on the right side, the degradation in the picture is minimal. The Right-option point to compare with is Figure 37 left side picture, which is 32/2/1. More degradation is noticed when compared to the reference (i.e. Max-quality) video sequence. Depending on user and environment needs, this may acceptable and will yield additional encoding time savings.

87 64x64/4/3 64x64/3/1

Figure 36: Kristen and Sara sequence, 100kbps. Max-quality and left-option points

32x32 / 2 / 1 16 x 16 / 1 / 1

Figure 37: Kristen and Sara sequence, 100kbps. Right-option and min-quality points

Table 14 shows the averages for Δ PSNR and Time-savings. As noted earlier, the

“64/4/3” data point has Δ PSNR of 0 dB and Time-savings of 0%. This reference point

represents the Max-quality option. All graphs have the normalization performed for

consistency and clarity in data analysis. 88 As expected, the PSNR loss increased as the encoder options were reduced.

Basically, encoder flexibility was limited by reducing LCU, CU-depth, or TU-inter-depth

options.

Major contributors that affect Time-savings are either LCU or CU-depth depending on resolution. CU-depth change has more of an impact for smaller

resolutions. LCU change has a bigger impact for larger resolutions. This observation is

noticed when the test results for each sequence are ordered by Time-savings. Smaller

resolutions order lined up according to CU-depth selections, and larger resolutions

ordering lined up in accordance to LCU selections. This is an observation and detail is

not shown within this report.

For a given video sequence, the effect of bitrate on Time-savings and Δ PSNR is

shown in Figure 38 and Table 15. As expected, the reference video sequence (i.e. Max-

quality) absolute PSNR improves. Tool options changes have a greater impact on Δ

PSNR. All video sequences, in the study, had the similar shift.

89 Diamond – 100kbps Square – 600 kbps

Figure 38: Kristen and Sara 100kbps to 600kbps shift for time-savings vs. Δ PSNR

Table 15: Kristen and Sara 64/4/3 PSNR and Δ PSNR for 100 and 600 kbps

Δ PSNR Bitrate (kbps) 64/4/3 PSNR Max. Left- option Right-option Min. 100 27.35 0 -0.10 -1.00 -2.83 600 33.63 0 -0.12 -0.93 -3.44

6.5.1 Prediction Model Analysis

The data analysis showed the video sequence data has a fairly good relationship to each

other when the data is sorted by resolution size. However, all resolutions combined, as

shown by Table 14 (rows “All”), will be the comparison point for the predictive model

explored in this chapter. There are content dependencies not evaluated within this

chapter and it is noted within this section.

From the data set gathered from the experiments, a data mining exercise took

place using WEKA Version 3.6.4 in order to generate a prediction model. This could

potentially assist with the selection of the optimal settings to maximize Time-savings and

90 stay within the allowable  PSNR for the given user defined environment. The prediction model equations are shown in equation (2) and equation (3). Equation discussion will be presented in the form of tables and graphs that correlate with the results.

ΔPSNR = (A*(Height) + B*(bit rate) + C*(LCU) + D*(CU-depth) + E*(TU-inter-depth + F) / G Where: (2) A = -2132, B = -316, C = 29834, D = 156664, E = 3339, F = -1261163, G = 1,000,000

Time-savings = (H*(Height) + I*(bit rate) - J*(LCU) - K*(CU-depth) – L*(TU-inter-depth) + M) / G Where: (3) H = -1.52, I = 41, J = -769, K = -203991, L = -31219, M = 949434, G = 1,000,000

Using the WEKA derived equations; the Left-option and Right-option points are valid as shown in Figure 39. The arrows identify the same points from Figure 35. The

Time-savings from the WEKA graphs are accurate representation. However, the ΔPSNR has more variability when compare to Figure 35, but the PSNR points relative to one another within the WEKA graph is representative. To get better correlation to actual data, the WEKA generated ΔPSNR and Time-savings graph is normalized by shifting

64/4/3 data point to 0,0. All other points are shifted by same amount. Given this, the shape of the points within the WEKA model can be used to determine the optimal points.

91

Figure 39: Kristen and Sara 100kbps time-savings vs  PSNR WEKA derived

From the WEKA derived equations, video sequences Δ PSNR and Time-savings were calculated and averaged for all classes. Results are shown in Table 16. The results in Table 16 have been normalized to Max-quality data point (i.e. 64/4/3).

Table 16: Video sequences WEKA estimate for Δ PSNR and Time-savings. Max

Left- Right- Description Max. option option Min. Δ PSNR (dB) 0.00 -0.22 -1.33 -1.91 Time-savings (%) 0.0% 34.1% 56.9% 71.1%

Comparison between actual values and WEKA derived results are shown in Table

17. Results from Table 14 will be referred to as “Actual” and results from Table 16 will be referred to as “WEKA”.

An “Actual” versus “WEKA” Δ PSNR differences comparison shows a difference no greater than 0.61 dB for the optimal points, as shown in Table 17. This shows there are very good estimate results by using WEKA. Also, the Time-savings comparison

92 between “Actual” and “WEKA” shows better results as shown in Table 17. The worst

case for the optimal points for any Class is a 2.4% difference.

The correlation between “Actual” and “WEKA” values were derived and the results are shown in Table 18. This data shows that the WEKA derived data has an excellent correlation for encoding time savings. Therefore, by using the prediction model the developer can confidently estimate the encoding time savings changes by tweaking

LCU, CU-depth, and TU-inter-depth to suit the user and environment needs. Δ PSNR correlation is good with a correlation of 0.77 or higher. This indicates the WEKA model is a fairly good indicator for Δ PSNR. The Δ PSNR variation is mainly due to bit rate changes and video content dependencies.

By using the derived models, the developer can determine the impact of the video encoding changes could cause to the video quality via PSNR and the encoding time savings. In addition, the developer can change the model during the video encoding to use additional parameters derived from the previous frames. By using this additional real-time data, the model’s quality will improve and yield more accurate results. The

correlation improves to 0.95 for  PSNR and 0.99 for encoding time savings, when block

types of previous frames are used within the prediction model.

Table 17: Video sequences actual vs. WEKA difference

Left- Right- Description Max. option option Min. Δ PSNR diff. 0.00 -0.03 0.61 0.09 Time-savings diff. 0.0% -2.4% 1.5% 1.9%

93 Table 18: Correlation between actual and WEKA results

Description Correlation Δ PSNR diff. 0.77 Time-savings diff. 0.98

6.6 Application

With the concepts within this study, the developer can estimate the PSNR with option changes. A no reference PSNR estimator, as presented by [107], can be used as a

“coarse” PSNR estimate. The  PSNR, calculated by the prediction model, will “fine tune” the absolute PSNR value for the projected video sequence. The developer can make informed decisions for video quality for transmitted video sequence. Also, using the prediction model, the developer can estimate how much the encoders will be taxed in terms of encoding time. Please note that each encoder implementation will derive different formulas and the method defined within this study will need to be re-established to determine the systems performance. This will only need to be characterized once for the H/W and S/W setup of the system.

Using predictive modeling software, such as WEKA, can be effectively used to estimate Δ PSNR and encoding time savings (i.e Time-savings) changes from the base, which is encoded video that has all the desired tool options enabled. With this data, the developer can effectively manage video encoding for the user and environment, such as transmission bandwidth limitations. Video conferencing environments will benefit greatly from the modeling by giving the developer a reliable method to estimate PSNR loss changes and encoding time savings gains. This will allow the developer to maximize the video quality (based on PSNR loss estimate) and finely tune to the available

94 bandwidth (by using the encoding time savings estimate). Also, the model derived within the study has applicability for content independent use, especially useful when the prediction model uses real time data, such as block types used in previous frames, as part of the model.

95 7 ADAPTING LOW BIT RATE SKIP MODE IN MOBILE ENVIRONMENT

A method is introduced to enhance HEVC early skip method decision process.

The proposed method uses human vision system factors, such as vision acuity and

smooth eye pursuit factors, to determine where coding complexity and bit rate reductions

can occur without significantly impacting perception of video quality. In this study, the focus is mobile devices, such as smart phones, where displays are less than 5.0” in diagonal and bandwidth availability is low. The proposed methods exploit the mismatch between the video resolution, display resolution, and visual acuity of users. This method

significantly reduces subjective impact while reducing encoding time for low bit rate

environments. Mobile devices are the main beneficiary of the proposed method since

battery consumption will be greatly lessened. The proposed method reduces both

encoding time and bitrate; results show an average of 21.7% decrease in encoding time,

13.4% decrease in average bitrate, and insignificant impact on subjective quality

evaluations.

7.1 Background

Use of mobile devices, such as smart phones, has grown dramatically in the last

few years and has penetrated the consumer space [10]. A significant reason is the

applications that are used on these devices. An application that has gained popularity is

real-time video sharing applications between mobile devices such as video conferencing

or video “chatting”. Quality of service is a factor in the adoption of real-time video

96 sharing applications. Research shows users adopt a new technology if it delivers quality

experience the majority of the time and the converse is true [11]. The adoption and

success of these mobile video sharing applications depends on the quality and reliability

of services delivered over the public internet and wireless wide area networks (WWAN).

Bitrate optimization is thus an important tool in delivering video services.

In this chapter, mobile resolutions and constrained bandwidths as observed in

WWANs are examined. The proposed method improves High Efficiency Video Coding

(HEVC) [108] encoders by simultaneously reducing encoding complexity and target

bitrate for equivalent subjective quality. The properties of the Human Visual System

(HVS) exploited in the proposed approach are visual acuity and high motion perception.

Visual acuity depends on the person’s photoreceptor density. Higher photoreceptor

density means better acuity. In addition, photoreceptor density is limited which indicates

the HVS will not perceive difference above a certain resolution for a given screen size

viewed at the same distance. HVS motion perception is dependent on the eye’s ability to

smoothly track motion. When motion is exceeds the HVS ability to smoothly track, saccadic eye response starts to dominate. There are opportunities to take advantage of

these elements without affecting perceived video quality. The properties of the HVS are

used along with motion vector comparisons between current coding unit (CU) and

neighbor CUs to determine early termination. Early termination prevents any additional prediction unit (PU) or sub-CU calculations. Such early termination effects the quality of the video pixels for that section of the video. The impact of this reduced pixel quality on

perceived quality, however, is dependent on the video resolution among other

environmental factors such as display size and viewing distance.

97 In essence, there are three elements, when combined in a mobile environment can

be effectively exploited, shown in Figure 1. The differences in encoded video resolution, display resolution, and HVS photoreceptor density translate to limit the HVS’s ability to discern resolution for a given area. The proposed method will work well for a low bit rate environments.

The key contributions of this chapter are: (1) perceptually-aware method for video encoding that exploits visual acuity and motion perception (2) implementation and evaluation of the proposed method in HEVC encoding (3) subjective evaluations on mobile devices. The proposed method reduces both encoding time and bitrate; results show an average of 21.7% decrease in encoding time, 13.4% decrease in average bitrate,

and insignificant impact on subjective quality evaluations.

As shown earlier, there exists a significant body of work for complexity

reduction. Most have the similar goal of reducing complexity while minimizing the

impact on the measured element, which typically PSNR loss. However, there is limited

work in the mobile environment using HVS elements for complexity reduction. In

addition, there is limited work showing the show the perceptual gains via subjective

testing.

The proposed method focuses on the mobile environment device to device

communication, such as real-time conferencing. The method will show HVS factors used

in help decision-making for effective complexity reduction without significantly affecting

MOS and provide major coding time reductions and bit rate savings.

98 7.2 HEVC Elements

7.2.1 Quad-Tree

The current HEVC standard provides a basic quad-tree block structure with recursive splitting from CU size of 64x64 down to potentially an 8x8 CU size. Basically, the 64 x 64 CU breaks down into smaller CU, where the RD is calculated. Again, the smaller CU is broken down to smaller pieces and RD costs are calculated. The continues

until a 8x8 CU size is obtained. HM10.0 CU compression estimation starts with largest coding unit (LCU). The compression estimation routine recursively calls CU compression subroutine for smaller CU sizes within the LCU. The CU compression subroutine determines the RD cost for each PU mode. The RD cost for each mode is computed. The computation is extremely time intensive, since RD cost is calculated for all possible cases.

Potentially, there are up to 85 CU calculations for each Largest Coding Unit

(LCU). Each LCU recursive calculation of 64x64 to CUs of 32x32 to CUs of 16x16 to

CUs of 8x8 is calculated as follows, 1 + 4 + 4x4 + 4x4x4 = 85. An example CU quad-tree with recursive splitting is shown in Figure 40. In addition, each CU leaf node is divided once more as a PU that can be one of four shapes for CU of 64x64 or eight shapes for CU sizes of 32x32 or less [87], [20] and potential shapes are shown in Figure 7.

Figure 40: Recursive CU splitting example

99 7.2.2 Skip Mode Method

When early skip detection (ESD) is enabled within HM 10.0 this allows 2Nx2N merge evaluation for early skip determination for inter-frames. Within the CU compression process, ESD is used to determine if further prediction units calculations for the 2Nx2N can been skipped. ESD essentially checks two elements: (a) NxN RD costs and (b) NxN MV. If the four NxN RD costs are individually zero, then early skip is performed. Also, if the four NxN MVs are individually equal to zero, then early skip is performed. At this point the PU NxNs are the leaf nodes for the “merged” to form the

2Nx2N. In addition, if early skip is performed, then any further CU splitting are abandoned and the 2Nx2N is used as the CU for compression with the NxN as PU nodes.

ESD allows the ability to skip RD cost calls within the CU compression process for certain cases. The 2Nx2N merge parameters are calculated to determine if the remaining NxN PUs RD costs, which are not yet calculated, can be skipped. The HM

10.0 skip mode flowchart is shown in Figure 41.

100

Figure 41: Mode decision process

Within the “Merge 2Nx2N Cost” subroutine code is where determination for early

skip is made through a serious of comparisons and checks. HM10.0 currently allows ESD

skip to occur, if the CU’s merge flag is set or if the CU’s 2Nx2N motion vector in x and y

direction is zero. The merge costs are determined for the 2Nx2N CU subparts. The subparts are the four NxN CUs in the 2Nx2N. The subparts RD costs and residual cost are calculated. If the residual cost from this activity is zero, then the merge flag is set, which sets the early skip flag before exiting the sub-routine. In addition, the 2Nx2N

absolute motion vectors in x and y direction are calculated. If the sum of x and y motion vectors are zero, then the early skip flag is set [8].

101 7.3 Proposed Method

The proposed method adds a third decision method within the “Merge 2Nx2N

Cost” for early skip determination. As with the other two methods, when an RD cost

reaches the target no further RD costs calculations are needed. Essentially, when the

proposed method is determined as valid the method will terminate any further RD

calculations for the current CU and sub-CUs by setting the early skip flag. The proposed

method uses HVS factors such as acuity and SPEM within the decision process along

with the video resolution and display resolution. This allows exploitation of the different

elements while keeping the impact of MOS results minimal. These factors lead to a

calculation of a threshold limit for the method. The threshold limit is compared to the

neighboring LCUs to determine if the early SKIP criteria are met.

The threshold depends on two factors which are (a) acuity limit and (b) motion

limit. The threshold equation is shown in (4). The acuity limit is the top half of the

equation and the motion limit is the bottom half of the equation. The threshold is

characterized in terms of pixels for the given video.

Pixels in Width Pixels in Height Foveola Acuity ∗ Viewing Cone Threshold (4) |MV| ∗ frame rate ∗ transparent motion sensitvity SPEM velocity saturation

The acuity limit is factoring the HVS acuity given the video resolution, viewing cone, and foveola acuity. The display dimension of interest is the display diagonal dimensions. The display diagonal dimensions are in terms of pixels. This is calculated from the width in pixels and height in pixels. The foveola acuity is the visual acuity.

102 Basically, the acuity limit is determining the ability to perceive pixel granularity for the given HVS acuity, video resolution and display characteristics. The video resolution is from the video sequence. The viewing cone is calculated from the viewing distance and display diagonal measured length. The viewing cone equation is shown in Figure 42 and

(2). Foveola acuity has the highest cone receptor density in the retina which is the best acuity from the central island. This is the best condition from the HVS and will yield a lower threshold as compared with acuity levels from other areas within the retina.

Figure 42: Viewing cone

The motion limit is a function of the MV magnitude, frame rate, viewing distance, transparent motion sensitivity, and SPEM velocity saturation. The purpose of this term is to allow additional threshold allowances due to high motion CU. The current MV magnitude is calculated for the PU that has been selected. The MV is for the current frame and is in terms of pixels per frame. Frame rate is the frame per second for the given video sequence. The transparent motion sensitivity is the HVS ability to distinguish motion for small display distance changes to the object in motion. The display distance change in conjunction with the viewing distance is converted to degree change for the purposes of expressing transparent motion sensitivity. Motion sensitivity for each frame

103 is (display resolution / viewing cone * motion sensitivity). Motion sensitivity guidance is from research of Ning Qian et al. [35]. SPEM velocity saturation is the limit of smooth eye pursuit.

The method checks the neighboring left and above LCUs motion vector (MV) along the adjacent edge with the current CU. The neighbor MV closest to the current CU is used for comparisons between the MVs as show in Figure 43.

Above LCU

Left LCU

Figure 43: Neighbor LCU motion vector

If the MV difference is within the threshold, then early SKIP flag is set to true per equation shown in (5). The threshold gets calculated for each CU. The threshold calculations determine the allowed MV deviation between the current CU MV and the neighbor LCU edge MV. An allowed MV deviation of greater than zero will permit additional minor video quality impairments. The additional impairments were determined to achieve minimal impact to the HVS with the benefit of encoding time savings derived by this method.

104 if MVCUx MVleftCUx MVCUy MVleftCUy Threshold 1 AND (5) if MVCUx MVaboveCUx MVCUy MVaboveCUy Threshold 0 otherwise

7.4 Experiments

Objective Testing

Video sequences chosen are from ITU’s JCT-VC test video sequences. In

particular video sequences chosen were from Class C, D, and E as defined in Table 19.

These sequences were chosen for mobile resolution of 1280x720 or lower. This is consistent with industry recommendations [101].

Table 19: Test sequences

Sequence Class WxH Frames Frame Ratea Blowing Bubbles D 416x240 500 50 Basketball Pass D 416x240 500 50 BQ Square D 416x240 600 60 Race Horses D 416x240 300 30 Basketball Drill C 832x480 500 50 BQ Mall C 832x480 600 60 Race Horses C 832x480 300 30 Four People E 1280x720 600 60 Johnny E 1280x720 600 60 Kristen and Sara E 1280x720 600 60

Video sequence encoding configuration for data generation used HM10.0 with

“low-delay main” configuration with ESD set to enabled (i.e. “1”) and the remaining

settings set to default. This is considered the base configuration, which is without any of the proposed method’s enhancements. The proposed method data was generated for two test conditions. The conditions are “High Acuity” and “Low Acuity”. The condition

105 differences are listed in Table 20. The video sequences with various acuity level settings

levels were reviewed by us and we determined the two selected conditions of “High

Acuity’ and “Low Acuity” were adequate for HVS with just above average acuity versus

HVS with just below average. The “High Acuity’ and “Low Acuity” conditions are

simply translated to desired quality of service, with the “High Acuity” having a higher

level of quality.

Table 20: Acuity test conditions

Transparent SPEM Video Foveola ESD Motion Velocity Sequence Type Acuitya Sensitivityb Saturationc Base 1 - - - High Acuity 1 120 1.5 30 Low Acuity 1 40 2.5 20 a Foveola acuity in receptors per degree. b Transparent motion sensitivity in arc-minutes per degree. c SPEM velocity saturation in degrees per second.

Decode times were reduced with the proposed method. Video sequences with

high acuity parameters displayed a 6.2% time reduction. While video sequences with low

acuity parameters showed and decode time reduction of 10.2%. The reduction in decode times is due mainly from the increase in skips from the proposed method. For high acuity, the RD loop calls, which where the RD calculations occur, had an average of

80.8% loop calls as compared with the baseline encoding method. Low acuity RD loops calls had an average of 64.4% loop calls from the baseline. The complete RD loops calls are shown for each sequence in Table 21.

106 Table 21: Rate distortion loop calls

BASELINE HIGH ACUITY LOW ACUITY RDLoop RDLoop RDLoop RDLoop RDLoop RDLoop Sequence Count (%) Count (%) Count (%) Blowing Bubbles 7869707 100.0% 7814087 99.3% 4028945 51.2% Basketball Pass 8366629 100.0% 7700148 92.0% 5876481 70.2% BQ Square 8741026 100.0% 8514559 97.4% 3637965 41.6% Race Horses 5305176 100.0% 5025313 94.7% 3535752 66.6% Basketball Drill 29181423 100.0% 23029283 78.9% 22186646 76.0% BQ Mall 33692358 100.0% 18905260 56.1% 16890417 50.1% Race Horses 18591537 100.0% 9979762 53.7% 8672796 46.6% Four People 72998991 100.0% 65470125 89.7% 65497801 89.7% Johnny 72707649 100.0% 56039623 77.1% 53681260 73.8% Kristen & Sara 72393262 100.0% 57255203 79.1% 56285012 77.7% Average 32984776 100.0% 25973336 81.8% 24029308 64.4%

Subjective Testing

The video sequences chosen for subjective testing are the same Quantizer chosen by base configuration video sequence with the bitrates closest to 400kbps. A bit rate of

400kbps is considered a higher bitrate for WWAN devices [4]. In addition, video sequences were generated where the video sequence had the depth reduced. There is a significant body of work that uses depth as a means to reduce encoding time. We added the depth reduced sequences within the subjective test pool for subjective comparison with other methods defined in this chapter.

Observers (test subjects) were shown the impaired sequences on displays with resolutions of 480 x 272 or 960 x 540. The tests were conducted in accordance with single-stimulus with adjectival categorical judgment method Variant I as defined by ITU-

R BT.500-13 [53]. The observer is presented an impaired sequence followed by the same sequence with different impairment. Variant I was chosen which allows one viewing of

107 video sequences. For tests within this study, a one second grey background is added between each video sequence along with a sequence ordering identification which identifies if the video is the first or second video within the test sequence. The video sequence presentation order is shown in Figure 44. The subjective evaluation test starts with one second of grey frames with test sequence number, one second of grey frames with letter “A” , followed by ten seconds of the impaired video sequence, followed by one second of grey frames with letter “B”, followed by ten seconds of the impaired video sequence, followed by the five second voting cycle.

Figure 44: Video sequence order

Observers were given a short explanation of the test structure and instructions for viewing and voting sections. Observers were instructed to only vote during the voting cycle. The test sessions for observers were limited to 13 minutes. Several technical literatures recommend a test session length to be 30 minutes or less. A test session of 13 minutes was chosen to significantly reduce the chance of observer fatigue.

The impairment scale is a scale of -3 to 3 in accordance with ITU-R BT.500-13

Table 4 [53]. A rating of 3 indicates the impaired video B is much better than impaired video A. By contrast, a rating of -3 indicates that the impaired video B is much worse than impaired video A. A rating of 0 indicates the videos were perceptibly equivalent.

Voting feedback must select whole number ratings. Figure 45 shows the rating scale.

108

Figure 45: Observer voting

Video sequences were shown on a 4.3” LCDs with a resolution of 480 x 272 or

960 x 540. The observer is positioned approximately 14” from the display as shown in

Figure 46. Twenty 25 observers participated in the test over two separate days.

Observers are 14 to 57 years of age and are in good health. Corrective lenses were allowed and used during the test, if the lenses were prescribed.

12" to 16"

Viewing Cone Display Diagonal

Figure 46: Display viewing setup

The critical metrics for the threshold value within the proposed method is the viewing distance and display diagonal. These two measurements determine the viewing cone in terms of degrees. Observer’s environmental conditions were controlled, as much as permitted, in order to stay within the viewing distance.

109 7.5 Results

Objective Results

Video sequences were encoded with a Q of 22, 27, 32, and 37. Objective results are provided in table format with the Q closest to 400kbps selected. The sequences listed in the tables are the video sequences used by the subjective testing. As noted earlier within this chapter, the depth reduced results were only generated for the base video sequences for the same Q value. Table 22 shows the bitrate and bitrate savings for four the configuration conditions of each video sequence. Table 23 shows the encoding times and encoding time savings for the four configuration conditions of each video sequence.

Table 22: Bitrates for subjective videos near 400kbps

BASELINE DEPTH HIGH ACUITY LOW ACUITY

Sequence Q Bitrate (kbps) BR Savings (%) Bitrate (kbps) BR Savings (%) Bitrate (kbps) BR Savings (%) Bitrate (kbps) BR Savings (%)

Blowing Bubbles 32 352.5 0.0% 363.0 -3.0% 344.9 2.2% 275.8 21.8% Basketball Pass 32 415.6 0.0% 433.7 -4.3% 390.5 6.1% 353.2 15.0% BQ Square 32 307.0 0.0% 317.0 -3.3% 297.1 3.2% 233.9 23.8% Race Horses (416) 32 297.6 0.0% 311.3 -4.6% 278.5 6.4% 244.3 17.9% Basketball Drill 37 416.5 0.0% 452.7 -8.7% 329.0 21.0% 326.6 21.6% BQ Mall 37 457.2 0.0% 503.4 -10.1% 375.8 17.8% 363.0 20.6% Race Horses (832) 37 455.7 0.0% 494.0 -8.4% 348.7 23.5% 332.4 27.0% Four People 32 422.9 0.0% 428.7 -1.4% 348.0 17.7% 344.2 18.6% Johnny 27 415.3 0.0% 423.8 -2.0% 334.0 19.6% 317.9 23.5% Kristen & Sara 32 315.6 0.0% 318.9 -1.1% 263.4 16.5% 257.3 18.5%

110 Table 23: Encoding time for subjective videos near 400kbps

BASELINE DEPTH HIGH ACUITY LOW ACUITY

Sequence Q Encoding Encoding Time (s) Time Savings (%) Encoding Time (s) Time Savings (%) Encoding Time (s) Time Savings (%) Encoding Time (s) Time Savings (%) Blowing Bubbles 32 2128.86 0.0% 1104.24 48.1% 2086.41 2.0% 1056.11 50.4% Basketball Pass 32 2618.68 0.0% 1349.66 48.5% 2367.07 9.6% 1812.32 30.8% BQ Square 32 2248.68 0.0% 1159.28 48.4% 2166.2 3.7% 1082.25 51.9% Race Horses (416) 32 1928.24 0.0% 1000.48 48.1% 1794.78 6.9% 1290.9 33.1% Basketball Drill 37 4566.88 0.0% 2461.49 46.1% 2962.52 35.1% 2879.67 36.9% BQ Mall 37 5000.21 0.0% 2703.37 45.9% 3268.81 34.6% 2985.25 40.3% Race Horses (832) 37 3975.12 0.0% 2141.12 46.1% 2310.35 41.9% 2031.83 48.9% Four People 32 7608.52 0.0% 5767.68 24.2% 5999.98 21.1% 5927.75 22.1% Johnny 27 7804.4 0.0% 5941.88 23.9% 5301.06 32.1% 4881.03 37.5% Kristen & Sara 32 7784.2 0.0% 5938.61 23.7% 5415.72 30.4% 5213.09 33.0%

Most complexity reductions papers reviewed within this chapter had ESD

disabled. Most showed a 35% to 45% complexity reduction with typically less than 0.10 dB loss. HEVC, in HM 10.0, has natively a complexity reduction time savings of 34.46% and bit rate savings of 0.20% on average when ESD is enabled. The proposed method yields an additional 21.7% complexity reduction with a 13.4% decrease in bit rate on average with minimal impact to subjective testing. The subjective test results will be covered in the next section.

Subjective Evaluation Results

Subjective test results are separated into two sections according to the display resolution. Section (1) has 960x540 MOS results in table and graph format and Section

(2) has 480x272 MOS results in same format. A positive value for the average means the proposed method was preferred when compared to the competing base (normal or depth

111 reduced) method. Table 24 breaks down the competing acuity configurations and shows

the average and standard deviation separated for each display resolution.

Table 24: Acuity average and standard deviation

960x540 480x272 Std. Std. Bit Rate Encoding Acuity Configuration Avg. Avg. Dev. Dev. Savings Time Savings

Base vs. High Acuity -0.01 0.34 -0.03 0.18 13.4% 21.7% Base vs. Low Acuity -0.39 0.34 0.03 0.11 20.8% 38.5% Depth vs. High Acuity -0.18 0.39 -0.04 0.22 18.1% -18.5% Depth vs. Low Acuity -0.26 0.27 -0.06 0.17 25.5% -1.8%

The high acuity configuration vs. base configuration yielded consistent results

between the display resolutions. 960x540 data is shown in Table 25 and Figure 47 and

480x272 data is shown in Table 29 and Figure 51. The high acuity configuration was

essentially viewed as subjective equivalent since the average results are within a few

hundredths to zero for both display resolutions. Encoding time savings is 21.7% and a bit

rate savings of 13.4% for approximately the same subjective quality.

The low acuity configuration vs. base configuration had mixed results. Results for

960x540 are shown in Table 26 and Figure 48 and results for 480x272 are shown in

Table 30 and Figure 52. The 480x272 display resolution had favorable results with

subjective results average of 0.03, however, the 960x540 display resolution subjective

results yielded -0.39. This indicates the base video sequences are subjectively slightly

better than the low acuity video sequences. The main reason for this significant difference

with the acuity limit within the threshold was significantly higher for video sequences viewed on a 960x540 display. This allowed average HVS to notice subtle differences between the video sequences.

112 High acuity configuration vs. depth reduced configuration gave slightly favorable subjective results in favor of the depth reduced configuration. This was -0.18 for display resolution of 960x540 and -0.04 for display resolution of 480x272. However, the depth reduced configuration paid a significant penalty in bit rate savings, where the high acuity configuration used 18.1% less bandwidth. 960x540 results are shown in Table 27 and

Figure 49. 480x272 results are shown in Table 31 and Figure 53.

The low acuity configuration vs. depth reduced configuration also had similar results as with high acuity configuration vs. depth reduced configuration. Table 28 and

Figure 50 and 960x540 data and Table 32 and Figure 54 have data for 540x272. Again, the depth reduced configuration paid a significant penalty in bit rate savings, where the low acuity configuration used 25.5% less bandwidth. The encoding time savings were less than 1.8% apart, which is minimal savings to warrant depth reduced configuration, especially when considering the loss in bit rate savings.

Display Resolution (960 x 540)

Table 25: MOS for base vs. high acuity at 960x540 and 400 kbps bit rate

Test Max Min Average

Blowing Bubbles 1 -2 0.12 Basketball Pass 1 -1 -0.22 BQ Square 1 -2 -0.38 Race Horses (416x240) 2 -1 0.38 Basketball Drill 2 -2 0.25 BQ Mall 1 -2 -0.41 Race Horses (832x480) 1 -2 -0.53 Four People 2 -1 0.56 Johnny 1 -1 0.00 Kristen & Sara 2 -1 0.13

113 Table 26: MOS for base vs. low acuity at 960x540 and 400 kbps bit rate

Test Max Min Average

Blowing Bubbles 2 -2 -0.24 Basketball Pass 1 -2 -0.78 BQ Square 0 -1 -0.38 Race Horses (416x240) 1 -2 -0.56 Basketball Drill 1 -2 -0.50 BQ Mall 1 -2 -0.12 Race Horses (832x480) 1 -3 -1.12 Four People 1 -2 -0.22 Johnny 1 -1 0.13 Kristen & Sara 0 -1 -0.13

Table 27: MOS for depth reduced vs. high acuity at 960x540 and 400 kbps bit rate

Test Max Min Average

Blowing Bubbles 2 -1 0.35 Basketball Pass 1 -1 0.22 BQ Square 1 -1 -0.13 Race Horses (416x240) 2 -1 0.31 Basketball Drill 1 -2 -0.63 BQ Mall 1 -2 -0.41 Race Horses (832x480) 2 -3 -0.71 Four People 1 -2 -0.67 Johnny 1 -1 -0.13 Kristen & Sara 0 0 0.00

Table 28: MOS for depth reduced vs. low acuity at 960x540 and 400 kbps bit rate

Test Max Min Average

Blowing Bubbles 2 -3 -0.29 Basketball Pass 1 -2 -0.22 BQ Square 0 -1 -0.38 Race Horses (416x240) 1 -3 -0.81 Basketball Drill 2 -3 -0.31 BQ Mall 2 -2 -0.06 Race Horses (832x480) 1 -3 -0.59 Four People 3 -1 0.11 Johnny 1 -2 0.00 Kristen & Sara 0 0 0.00 114

Figure 47: MOS for base vs. high acuity (960x540 display resolution and 400 kbps)

Figure 48: MOS for base vs. low acuity (960x540 display resolution and 400 kbps)

115

Figure 49: MOS for depth reduced vs. high acuity (960x540 display resolution and 400 kbps)

Figure 50: MOS for depth reduced vs. low acuity (960x540 display resolution and 400 kbps)

116 Display Resolution (480 x 272)

Table 29: MOS for base vs. high acuity at 480x272 and 400 kbps

Test Max Min Average

Blowing Bubbles 2 -1 0.27 Basketball Pass 0 -1 -0.17 BQ Square 0 0 0.00 Race Horses (416x240) 1 -1 -0.20 Basketball Drill 1 -2 -0.30 BQ Mall 2 -1 0.27 Race Horses (832x480) 2 -2 0.00 Four People 0 0 0.00 Johnny 0 -1 -0.20 Kristen & Sara 0 0 0.00

Table 30: MOS for base vs. low acuity at 480x272 and 400 kbps

Test Max Min Average

Blowing Bubbles 1 -2 -0.18 Basketball Pass 1 -1 0.00 BQ Square 1 -1 0.00 Race Horses (416x240) 2 -2 0.20 Basketball Drill 1 -1 0.00 BQ Mall 1 -1 0.09 Race Horses (832x480) 1 -1 0.00 Four People 0 0 0.00 Johnny 1 -1 0.00 Kristen & Sara 1 0 0.20

117 Table 31: MOS for depth reduced vs. high acuity at 480x272 and 400 kbps

Test Max Min Average

Blowing Bubbles 1 0 0.27 Basketball Pass 0 -1 -0.33 BQ Square 0 -1 -0.20 Race Horses (416x240) 1 -1 -0.20 Basketball Drill 1 -1 -0.10 BQ Mall 1 -1 0.36 Race Horses (832x480) 2 -1 0.18 Four People 1 -1 0.00 Johnny 0 -1 -0.20 Kristen & Sara 1 -1 -0.20

Table 32: MOS for depth reduced vs. low acuity at 480x272 and 400 kbps

Test Max Min Average

Blowing Bubbles 1 -1 0.27 Basketball Pass 1 0 0.17 BQ Square 0 -1 -0.20 Race Horses (416x240) 1 -1 -0.20 Basketball Drill 1 -1 -0.30 BQ Mall 1 -1 -0.09 Race Horses (832x480) 1 -1 -0.09 Four People 0 0 0.00 Johnny 0 0 0.00 Kristen & Sara 0 -1 -0.20

118

Figure 51: MOS for base vs. high acuity (480x272 display resolution and 400 kbps)

Figure 52: MOS for base vs. low acuity (480x272 display resolution and 400 kbps)

119

Figure 53: MOS for depth reduced vs. high acuity (480x272 display resolution and 400 kbps)

Figure 54: MOS for depth reduced vs. low acuity (480x272 display resolution and 400 kbps)

High acuity encoded sequences were compared with sequences where random selections were made for early skip selection. Naturally, the expectation is for the random selected early skip sequences will perform poorly when compared against the proposed method. Indeed this is the case. Using the same scoring scale of -3 to 3, subjective results averaged a score of 0.42 in favor of the proposed method. This indicates the proposed method is subjectively slightly better than the random skip video sequences.

120 7.6 Concluding Remarks

For mobile conditions performed within this study, significant additional encoding time savings and bit rate savings can be obtained. The proposed method

augments HM10.0 with early skip detection algorithm. The computational complexity

reductions achieved high acuity configuration yielded an encoding time savings is 21.7% and a bit rate savings of 13.4% for approximately the same subjective quality. These are significant savings that warrants HVS consideration within encoding decisions.

121 8 CONCLUSION

Mobile devices have penetrated the consumer market in great strides during the

last 10 years. Applications, such as mobile device to mobile device video conferencing,

have gained popularity within consumer community with smart mobile devices. With the

increase of data use for the video conferencing, resources such as bandwidth and video

coding processing can be a premium. Therefore, methods to reduce the strain on these

resources without affecting the user perception are ideal.

Mobile devices offer a unique environment for video streaming use by consumers.

The display screen is comparatively smaller than other devices, such as notebooks and televisions. The mobile device typically has wireless wan service, which indicates the propensity for bandwidth limitations from data transmission on these networks. These mobile device characteristics led to boundaries that are inherent to this environment that were taken into consideration. In addition, by adding the HVS characteristics into the models, there were several factors that can be exploited. Indeed the experiments performed, within the scope of the research, showed very promising results.

Optimizing video services is the central theme of this dissertation. The research

showed the subjective impact between H.264 and HEVC in a mobile environment. Little

or no difference was observed between H.264 and HEVC where uniform motion existed

and adequate mobile environments bandwidth, such as 400 kbps, existed. A joint

research effort with the Universidad de Castilla-La Mancha lead to coding tools set

122 evaluation using predictive modeling techniques for use within the mobile device to

adjust the video compression factors, such as CU size, to optimally work within the video conferencing applications. Predictive modeling was used to estimate Δ PSNR and

encoding time savings which lead to optimal use of HEVC encoding options. The model

maximized quality while reducing encoding time. Mobile video conferencing

environments can benefit by using the predictive modeling by giving a reliable method to

estimate PSNR loss changes and encoding time savings gains. In addition, by taking

advantage of the mismatches between the display resolution, video resolution, and HVS

acuity and motion sensitivity further complexity reduction can be obtained. The area of

research focused on the existing early skip method available within HEVC. The methods

proposed showed there is significant opportunity to reduce video coding complexity and

improve bit rates without significant impact to the subjective results. For mobile

conditions, the early skip decision making was enhanced to obtain additional encoding

time savings and bit rate savings for essentially the same subjective quality. The

significant savings warrants HVS consideration within encoding decisions.

The research methods used and developed within this research will become significant contributors to the HEVC and the next generation coding standard. HVS

elements folding into decision making and predictive analysis will be elements for

reducing complexity while reducing subjective impact. By highlighting the factors that

successfully contribute to this effort, this will hopefully increase awareness to the

effectiveness of these methods.

123 9 FUTURE WORK

As shown by the earlier sections, there is ample opportunity combining HVS factors in HEVC optimization strategies. The continuation in using this approach warrants further research in order to continue the evolution. More specifically, research in improving the motion threshold within the enhanced skip mode is needed. Motion sensitivity and velocity sensitivity factors in maximizing the HEVC optimization offers the most compelling elements to achieve further optimization improvements.

124 10 REFERENCES

[1] Jeremy Wolfe, Keith Kluender, Dennis Levi, Linda M. Bartoshuk, Rachel S. Herz,

Roberta L. Klatzky, and Susan J. Lederman, "Sensation & Perception," Sinauer

Associates, Third Edition, Jan. 2006.

[2] Peter K. Ahnelt, Helga Kolb, and Renate Pflug, "Identification of a subtype of cone

photoreceptor, likely to be blue sensitive, in the human retina," The Journal of

comparative neurology, vol.255, issue.1, pp.18-34, Jan. 1987.

[3] Ray Garcia and Hari Kalva, "Subjective Evaluation of HEVC in Mobile Devices,"

IS&T/SPIE Electronic Imaging, vol.8667, session.4, pp.86670L, Mar. 2013.

[4] Ray Garcia and Hari Kalva, "Human Mobile-Device Interaction on HEVC and H.264

Subjective Evaluation for Video Use in Mobile Environment," IEEE International

Conference on Consumer Electronics (ICCE 2013), pp.639-640, Jan. 2013.

[5] Ray Garcia and Hari Kalva, "Subjective Evaluation of HEVC and AVC/H.264 in

Mobile Environments," IEEE Transactions on Consumer Electronics, vol.60,

issue.1, pp.1-8, Feb. 2014.

[6] Ray Garcia, Damian Ruiz-Coll, Hari Kalva, and Gerardo Fernández-Escribano,

"HEVC Decision Optimization for Low Bandwidth in Video Conferencing

Applications in Mobile Environments," 2013 IEEE International Conference on

Multimedia and Expo Workshops (ICMEW), pp.1-6, Jul. 2013.

125 [7] Damián Ruiz, Velibor Adzic, Ray Garcia, Hari Kalva, Gerardo Fernández, J.Luis

Martínez, and Pedro Cuenca, "Algoritmo de baja complejidad para la predicción

Intra-Frame en HEVC," Jornadas Sarteco, pp.1-6, Sep. 2013.

[8] Ray Garcia and Hari Kalva, "HEVC Inter-frame Skip Enhancement At Low Bit

Rates," IEEE International Conference on Consumer Electronics (ICCE 2014),

pp.1-2, Jan. 2014.

[9] Cisco White Paper, "Cisco Visual Networking Index: Forecast and Methodology,

2010-2015," Retrieved on 2011November06 from

http://www.cisco.com/en/US/solutions/collateral/ns341/ns525/ns537/ns705/ns827

/white_paper_c11-481360.pdf, pp.1-16, Jun. 2011.

[10] Gartner, "Gartner Says Asia/Pacific Led Worldwide Mobile Phone Sales to Growth

in First Quarter of 2013," Retrieved on 2013June22 from

http://www.gartner.com/newsroom/id/2482816, May. 2013.

[11] Frost & Sullivan, "Overcoming the Challenges of Mobile Video Conferencing," A

Frost & Sullivan Whitepaper, pp.1-8, May. 2013.

[12] Marco Jacobs and Jonah Probell, "A Brief History of Video Coding," ARC

International Whitepaper, Jan. 2007.

[13] Raymond Davis Kell, "Transmission and Reception of Pictures," US Patent

1,992,009 (Filing: Apr. 15, 1929, Issue: Feb 19, 1935), pp.1-3, Apr. 1929.

[14] ITU-T, "Codecs for Videoconferencing using Primary Digital Group Transmission,"

ITU-T Recommendation H.120, pp.1-66, Mar. 1993.

[15] ITU-T, "Line Transmission of Non-telephone Signals - Video CODEC for Audio

Visual at px64 kbits," ITU-T H.261, pp.1-29, Mar. 1993.

126 [16] ITU-T, "SERIES H: Audiovisual and Multimedia Systems - Video coding for low

bit rate communication," ITU-T Recommendation H.263, pp.1-226, Jan. 2005.

[17] ITU-T, "SERIES H: Audiovisual and Multimedia Systems - Advanced Video

Coding for Generic Audiovisual Services," ITU-T Recommendation H.264, pp.1-

680, Jan. 2012.

[18] , Gary J. Sullivan, Gisle Bjøntegaard, and Ajay Luthra, "Overview

of the H.264/AVC Video Coding Standard," IEEE Transactions on Circuits and

Systems for Video Technology, vol.13, issue.7, pp.560-576, Jul. 2003.

[19] Benjamin Bross, Woo-Jin Han, Jens-Rainer Ohm, Gary J. Sullivan, Ye-Kui Wang,

and Thomas Wiegand, "High Efficiency Video Coding (HEVC) text specification

draft 10 (for FDIS & Last Call)," Joint Collaborative Team on Video Coding

(JCT-VC) of ITU-T SG 16 WP 3 and ISO/IEC JTC 1/SC 29/WG 11, JCTVC-

L1003_v34, pp.1-310, Jan. 2013.

[20] Gary J. Sullivan, Jens-Rainer Ohm, Woo-Jin Han, Member, and Thomas Wiegand,

"Overview of the High Efficiency Video Coding (HEVC) Standard," IEEE

Transactions on Circuits and Systems for Video Technology, vol. 22, issue. 12,

pp.1649-1668, Dec. 2012.

[21] Jens-Rainer Ohm, Gary J. Sullivan, Heiko Schwarz, Thiow Keng Tan, and Thomas

Wiegand, "Comparison of the Coding Efficiency of Video Coding Standards

Including High Efficiency Video Coding (HEVC)," IEEE Transactions on

Circuits and Systems for Video Technology, vol.22, no.12, pp.1669-1684, Dec.

2012.

127 [22] Junghye Min, Tammy Lee, Woo-Jin Han, and JeongHoon Park, "Block Partitioning

Structure in the HEVC Standard," IEEE Transactions on Circuits and Systems for

Video Technology, vol.22, issue.12, pp.1697-1706, Dec. 2012.

[23] G. J. Sullivan and J.-R. Ohm, "Recent developments in standardization of high

efficiency video coding (HEVC)," SPIE Applications of Digital Image Processing

XXXIII, vol.7798, pp.1-7, Aug. 2010.

[24] Kimihiko Kazui, Junpei Koyama, and Akira Nakagawa, "Industry needs of very low

delay coding," Joint Collaborative Team on Video Coding (JCT-VC) of ITU-T

SG16 WP3 and ISO/IEC JTC1/SC29/WG11, JCTVC-F147, pp.1, Jul. 2011.

[25] Helga Kolb, "How the Retina Works," American Scientist, vol.91, pp.28-35, Feb.

2003.

[26] Helga Kolb, Ralph Nelson, Eduardo Fernandez, and Bryan Jones, "The Organization

of the Retina and Visual System," Retrieved on 2013Oct19 from

http://webvision.med.utah.edu/book/, Apr. 2012.

[27] Stephen L. Polyak, "The Retina," History of Science Society, vol.34, issue.3,

pp.234-235, Sep. 1943.

[28] Christine A. Curcio, Kenneth R. Sloan, Robert E. Kalina, and Anita E. Hendrickson,

"Human Photoreceptor Topography," Journal of Comparative Neurology,

vol.292, issue.4, pp.497-523, Feb. 1990.

[29] Sean McCarthy, "Quantitative Evaluation of Human Visual Perception for Multiple

Screens and Multiple Codecs," SMPTE Mot. Imag J., vol.122, no.4, pp.36-42,

Jun. 2013.

128 [30] Raymond Dodge, "Five Types of Eye Movement in the Horizontal Meridian Plane

of the Field of Regard," American Journal of Physiology, vol.8, issue.4, pp.307-

329, Jan. 1903.

[31] D. A. Robinson, "The Mechanics of Human Smooth Pursuit Eye Movement," The

Journal of Physiology, vol.180, no.3, pp.569-591, Jan. 1965.

[32] Bernd Girod, "Eye Movements and Coding of Video Sequences," Visual

Communications and Image Processing '88, 398, pp.1-8, Oct. 1988.

[33] Velibor Adzic, Hari Kalva, Lai-Tee Cheok, "Adapting Vdeo Delivery Based on

Motion Triggered Visual Attention," Applications of Digital Image Processing

XXXV, 84991L, pp.1-6, Aug. 2012.

[34] Ken Nakayama, "Differential Motion Hyperacuity under Conditions of Common

Image Motion," Vision Research, vol.21, issue.10, pp.1475-1482, Mar. 1981.

[35] Ning Qian, Richard A. Andersen, and Edward H. Adelsson, "Transparent Motion

Perception as Detection of Unbalanced Motion Signals," The Journal of

Neuroscience, vol.14, issue.12, pp.7357-66, Dec. 1994.

[36] M. Narwaria, W. Lin, and A. Liu, "Low-Complexity Video Quality Assessment

Using Temporal Quality Variations," IEEE Transactions on Multimedia, vol.14,

issue.3, pp.525-536, Jun. 2012.

[37] Anush Krishna Moorthy and Alan Conrad Bovik, "Efficient Video Quality

Assessment Along Temporal Trajectories," IEEE Transactions on Circuits and

Systems for Video Technology, vol.20, issue.11, pp.1653-1658, Nov. 2010.

129 [38] Manish Narwaria, Weisi Lin, "Scalable Image Quality Assessment Based on

Structural Vectors," IEEE International Workshop on Multimedia Signal

Processing, 2009. MMSP '09., vol., issue., pp.1-6, Oct. 2009.

[39] Damon M. Chandler and Sheila S. Hemami, "VSNR: A Wavelet-Based Visual

Signal-to-Noise Ratio for Natural Images," IEEE Transactions on Image

Processing, vol.16, issue.9, pp.2284-2298, Sep. 2007.

[40] A.K. Moorthy, K. Seshadrinathan, R. Soundararajan, and A.C.Bovik,, "Wireless

Video Quality Assessment: A Study of Subjective Scores and Objective

Algorithms," IEEE Transactions on Circuits and Systems for Video Technology,

vol.20, issue.4, pp. 587-599, Apr. 2010.

[41] Kai Zeng, Abdul Rehman, Jiheng Wang and Zhou Wang, "From H.264 TO HEVC:

Coding Gain Predicated by Objective Video Quality Assessment Models,"

International Workshop on Video Processing and Quality Metrics for Consumer

Electronics (2013), pp.1-6, Feb. 2013.

[42] Margaret H. Pinson and Stephen Wolf, "A New Standardized Method for

Objectively Measuring Video Quality," IEEE Transactions on Broadcasting,

vol.50, issue.3, pp.312-322, Sep. 2004.

[43] Kalpana Seshadrinathan and Alan Conrad Bovik, "Motion Tuned Spatio-Temporal

Quality Assessment of Natural Videos," IEEE Transactions on Image Processing,

vol.19, issue.2, pp.335-350, Feb. 2010.

[44] P. Ndjiki-Nyaa, D. Doshkova, H. Kaprykowskya, F. Zhangc, D. Bullc, and T.

Wieganda, "Perception-oriented video coding based on image analysis and

130 completion: A review," Signal Processing: Image Communication, vol.27,

issue.6, pp.579-594, Jan. 2012.

[45] J. O. Fajardo, I. Taboada and F. Liberal, "Quality assessment for mobile media-

enriched services: impact of video lengths," Communications in Mobile

Computing 2012; http://www.comcjournal.com/content/1/1/2, vol.1, issue.2, Jan.

2012.

[46] Z. Ma, M. Xu, Y. Ou, and Y. Wang, "Modeling of Rate and Perceptual Quality of

Compressed Video as Functions of Frame Rate and Quantization Stepsize and its

Applications," IEEE Transactions on Circuits and Systems for Video Technology,

issue.99, Nov. 2011.

[47] A. Toet, "Computational versus Psychophysical Bottom-Up Image Saliency: A

Comparative Evaluation Study," IEEE Transactions on Pattern Analysis and

Machine Intelligence, vol.3, issue.11, Nov. 2011.

[48] L. Itti, "Quantifying the contribution of low-level saliency to human eye movements

in dynamic scenes," Visual Cognition, vol.12, issue.6, Jan. 2005.

[49] F. Ribeiro, and D. Florencio, "Region of Interest Determination Using Human

Computation," IEEE 13th International Workshop on Multimedia Signal

Processing (MMSP 2011), pp.1-5, Oct. 2011.

[50] J. Xue and C. W. Chen, "Towards Viewing Quality Optimized Video Adaptation,"

IEEE International Conference on Multimedia and Expo (ICME 2011), pp.1 - 6,

Jul. 2011.

131 [51] Hendrik Knoche, John D. Mccarthy, and M. Angela Sasse, "How low can you go?

The effect of low resolutions on shot types in mobile TV," Multimedia Tools and

Applications, vol.36, issue.1-2, pp.145-166, Jan. 2008.

[52] Vittorio Baroncini, Jens-Rainer Ohm, and Gary Sullivan, "Report of Subjective Test

Results of Responses to the Joint Call for Proposals (CfP) on Video Coding

Technology for High Efficiency Video Coding (HEVC)," Joint Collaborative

Team on Video Coding (JCT-VC) of ITU-T SG16 WP3 and ISO/IEC

JTC1/SC29/WG11, JCTVC-A204, pp.1-33, Apr. 2010.

[53] ITU-R, "Methodology for the subjective assessment of the quality of television

pictures," Recommendation ITU-R BT.500-13, ITU-R BT.500-13, pp.1-46, Jan.

2012.

[54] TK Tan, A. Fujibayashi, Y. Suzuki, and J.Takiue, "Objective and subjective

evaluation of HM5.0.," Joint Collaborative Team on Video Coding (JCT-VC) of

ITU-T SG16 WP3 and ISO/IEC JTC1/SC29/WG11, JCTVC-H0116, 1-24, Feb.

2012.

[55] Jens-Rainer Ohm, Gary Sullivan, Frank Bossen, Thomas Wiegand, Vittorio

Baroncini, Mathias Wien, and Jizheng Xu, "JCT-VC AHG report: HM subjective

quality investigation (AHG22)," Joint Collaborative Team on Video Coding

(JCT-VC) of ITU-T SG16 WP3 and ISO/IEC JTC1/SC29/WG11, JCTVC-

H0022r1, 1-3, Feb. 2012.

[56] Michael Horowitz, Faouzi Kossentini, Nader Mahdi, Shilin Xu, Hsan Guermazi,

Hassene Tmar, Bin Li, Gary Sullivan, Jizheng Xu, "Informal Subjective Quality

Comparison of Video Compression Performance of the HEVC and H.264 /

132 MPEG-4 A VC Standards for Low-Delay Applications," SPIE Applications of

Digital Image Processing XXXV, vol.8499, 84990W, pp.1-6, Aug. 2012.

[57] Philippe Hanhart, Martin Rerabek, Francesca De Simone, and Touradj Ebrahimi,

"Subjective Quality Evaluation of the Upcoming HEVC Video Compression

Standard," Retrieved on 2012October10 from http://www.itu.int/ITU-

T/studygroups/com16/jct-vc/, pp.1-13, Aug. 2012.

[58] Bin Li, Gary J. Sullivan, and Jizheng Xu, "Comparison of Compression Performance

of HEVC Draft 6 with AVC High Profile," Joint Collaborative Team on Video

Coding (JCT-VC) of ITU-T SG 16 WP 3 and ISO/IEC JTC 1/SC 29/WG 11,

JCTVC-I0409, pp.1-6, May. 2012.

[59] L. Sahafi, T.S. Randhawa, R.H.S. Hardy, "Context-based Complexity Reduction of

H.264 in Video over Wireless Applications," IEEE 6th Workshop on Multimedia

Signal Processing (2004), pp.23-26, Oct. 2004.

[60] Tea Anselmo, Daniele Alfonso, "JCTVC-G262 HM decoder complexity assessment

on ARM," Joint Collaborative Team on Video Coding (JCT-VC) of ITU-T SG16

WP3 and ISO/IEC JTC1/SC29/WG11 Geneva November 2011, JCTVC-G262,

pp.1-10, Nov. 2011.

[61] C.P. Singh, N. Singh, and R. Tripathi, "Optimization of Standards for Video

Compression Tools over Wireless Networks," International Conference on Recent

Advances in Information Technology (RAIT) 2012 1st, pp.114-118, Mar. 2012.

[62] C.-L. Hsu C.-H. Cheng, "Reduction of Discrete Cosine Transform/

Quantisation/Inverse Quantisation/Inverse Discrete Cosine Transform

Computational Complexity in H.264 Video Encoding by Using an Efficient

133 Prediction Algorithm," IET Image Processing, vol.3, issue.4, pp.177-187, Aug.

2009.

[63] Gianluca Bailo, Massimo Bariani, Ivano Barbieri, and Murco Raggio, "Search

Windows Size Decision for Motion Estimation Algorithm in H.264 Video

Coder," International Conference on Image Processing (ICIP 2004), vol.3,

pp.1453-1456, Oct. 2004.

[64] Paula Carrillo, Tao Pin, and Hari Kalva, "Low Complexity H.264 Video Encoder

Design Using Machine Learning Techniques," Digest of Technical Papers

International Conference on Consumer Electronics (ICCE 2010), pp.461-462, Jan.

2010.

[65] Donghyung Kim and Jechang Jeong, " A Fast Mode Selection Algorithm in H.264

Video Coding," IEEE International Conference on Multimedia and Expo (2006),

pp.1709-1712, Jul. 2006.

[66] Avin Kumar Kannur and Baoxin Li, "Power-Aware Content-Adaptive H.264 Video

Encoding," IEEE International Conference on Acoustics, Speech and Signal

Processing (ICASSP 2009), pp.925-928, Apr. 2009.

[67] Anwarul Kaium Patwary and Mohamed Othman, "Fast Adaptive Motion Estimation

for H.264," International Conference on Multimedia Technology (ICMT 2011),

pp.6330-6333, Jul. 2011.

[68] Byoungman An, Youngseop Kim, and Oh Jin Kwon, "Low-complexity motion

estimation for H.264/AVC through perceptual video coding," KSII Transactions

on Internet and Information Systems, vol.5, issue.8, pp.1444-1456, Aug. 2011.

134 [69] Xiang Li, Mathias Wien,and Jens-Rainer Ohm, "Rate-Complexity-Distortion

Optimization for Hybrid Video Coding," IEEE Transactions on Circuits and

Systems for Video Technology, vol. 21, issue.7, pp.957-970, Jul. 2011.

[70] Jani Lainema, Frank Bossen, Woo-Jin Han, Junghye Min, and Kemal Ugur, "Intra

Coding of the HEVC Standard," IEEE Transactions on Circuits and Systems for

Video Technology, issue.99, pp.1-11, Oct. 2012.

[71] M. Mrak and J. Xu, "Improving Screen Content Coding in HEVC By Transform

Skipping," European Signal Processing Conference (EUSIPCO 2012), pp.1209-

1213, Aug. 2012.

[72] RyeongHee Gweon, Yung-Lyul Lee, "N-Level Quantization in HEVC," IEEE

International Symposium on Broadband Multimedia Systems and Broadcasting

(BMSB 2012), pp.1-5, Jun. 2012.

[73] Andrey Norkin, Kenneth Andersson, Arild Fuldseth, and Gisle Bjontegaard, "HEVC

deblocking filtering and decisions," SPIE Applications of Digital Image

Processing XXXV, vol.8499,849912, pp.1-8, Aug. 2012.

[74] Z. Ma, and A. Segall, "Low Resolution Decoding For High-Efficiency Video

Coding," URL: http://vision.poly.edu/~zma03/paper/Zhan_Segall_SIP11.pdf,

Dec. 2011.

[75] Zhan Ma and Andrew Segall, "System for Graceful Power Degradation," Joint

Collaborative Team on Video Coding (JCT-VC) of ITU-T SG16 WP3 and

ISO/IEC JTC1/SC29/WG11, JCTVC-B114, pp.1-8, Jul. 2010.

135 [76] Seunghyun Cho and Munchurl Kim, "Fast CU Splitting and Pruning for Suboptimal

CU Partitioning in HEVC Intra Coding," IEEE Transactions on Circuits and

Systems for Video Technology, vol.PP, issue.99, pp., Feb. 2013.

[77] Jaemoon Kim, Jaehyun Kim, Kiwon Yoo, and Kyohyuk Lee, "Analysis and

Complexity Reduction of High Efficiency Video Coding for Low-Delay

Communication," IEEE International Conference on Consumer Electronics -

Berlin (ICCE-Berlin 2012), pp.11-12, Sep. 2012.

[78] Ryoji Hattori, Kazuo Sugimoto, Yusuke Itani, Shun-ichi Sekiguchi, and Tokumichi

Murakami, "Fast bypass mode for CABAC," Picture Coding Symposium (PCS

2012), pp.417-420, May. 2012.

[79] Thaísa L.da Silva, Luciano V. Agostini, and Luis A. da Silva Cruz, "Fast HEVC

Intra Prediction Mode Decision Based on EDGE Direction Information,"

Proceedings of the 20th European Signal Processing Conference (EUSIPCO

2012), pp.1214-1218, Aug. 2012.

[80] Wei Dai, Oscar C. Au, Chao Pang, Lin Sun, Ruobing Zou, and Sijin Li, "A Novel

Fast Two Step Sub-pixel Motion Estimation Algorithm in HEVC," IEEE

International Conference on Acoustics, Speech and Signal Processing (ICASSP

2012), pp.1197-1200, Mar. 2012.

[81] Jaehwan Kim, Jungyoup Yang, Kwanghyun Won, and Byeungwoo Jeon, "Early

Determination of Mode Decision for HEVC," Picture Coding Symposium (PCS

2012), pp.449-452, May. 2012.

[82] Kiho Choi, Sang-Hyo Park, and Euee S. Jang, "JCTVC-F092: Coding tree pruning

based CU early termination," Joint Collaborative Team on Video Coding (JCT-

136 VC) of ITU-T SG16 WP3 and ISO/IEC JTC1/SC29/WG11 Torino July 2011,

JCTVC-F092, pp.1-11, Jul. 2011.

[83] M. B. Cassa, M. Naccari, F. Pereira, "Fast Rate Distortion Optimization for the

Emerging HEVC Standard," Picture Coding Symposium (2012), pp.493-496,

May. 2012.

[84] Guilherme Corrêa, Pedro Assuncao, Luciano Agostini, and Luis A. da Silva Cruz,

"Complexity control of high efficiency video encoders for power-constrained

devices," IEEE Transactions on Consumer Electronics, vol.57, issue.4, pp.1866-

1874, Nov. 2011.

[85] Guilherme Corrêa, Pedro Assuncao, Luis A. da Silva Cruz, and Luciano Agostini,

"Adaptive coding tree for complexity control of high efficiency video encoders,"

Picture Coding Symposium (PCS 2012), vol., issue., pp.425-428, May. 2012.

[86] Guilherme Correa, Pedro Assuncao, Luis A. da Silva Cruz, and Luciano Agostini,

"Dynamic Tree-Depth Adjustment for Low Power HEVC Encoders," IEEE

International Conference on Electronics, Circuits and Systems (ICECS 2012),

pp.564-567, Dec. 2012.

[87] Felipe Sampaio, Sergio Bampi, Mateus Grellert, Luciano Agostini, Julio Mattos,

"Motion Vectors Merging: Low Complexity Prediction Unit Decision Heuristic

for the Inter-Prediction of HEVC Encoders," IEEE International Conference on

Multimedia and Expo (ICME 2012), pp.657-662, Jul. 2012.

[88] Xiaolin Shen and Lu Yu, "CU splitting early termination based on weighted SVM,"

EURASIP Journal on Image and Video Processing (2013), vol.2013, issue.1,

pp.1-11, Jan. 2013.

137 [89] Liquan Shen, Zhi Liu, Xinpeng Zhang, Wenqiang Zhao, and Zhaoyang Zhang, "An

Effective CU Size Decision Method for HEVC Encoders," IEEE Transactions on

Multimedia, vol.15, issue.2, pp.465-470, Feb. 2013.

[90] Jie Leng, Lei Sun, Takeshi Ikenaga, and Shinichi Sakaida, "Content Based

Hierarchical Fast Coding Unit Decision Algorithm For HEVC," International

Conference on Multimedia and Signal Processing (CMSP 2011 Volume:1),

pp.56-59, May. 2011.

[91] Guilherme Correa, Pedro Assuncao, Luciano Agostini, and Luis A. da Silva Cruz,

"Motion compensated tree depth limitation for complexity control of HEVC

encoding," 19th IEEE International Conference on Image Processing (ICIP 2012),

pp.217-220, Sep. 2012.

[92] J. Nightingale, Q. Wang, and C. Grecos, "HEVStream: a framework for streaming

and evaluation of high efficiency video coding (HEVC) content in loss-prone

networks," IEEE Transactions on Consumer Electronics, vol.58, issue.2, pp.404-

412, May. 2012.

[93] M.T. Pourazad, C. Doutre, M. Azimi, and P. Nasiopoulos, "HEVC: The New Gold

Standard for Video Compression: How Does HEVC Compare with

H.264/AVC?," IEEE Consumer Electronics Magazine, vol.1, issue.3, pp.36-46,

Jul. 2012.

[94] ITU-T, "Joint Call for Proposals on Video Compression Technology," ITU-T Q6/16

(Kyoto, 22 January 2010), VCEG-AM91, pp.1-19, Jan. 2010.

[95] Yuta Yamamura, Shinya Iwasaki, Yasutaka Matsuo, Jiro Katto, "Quality

Assessment of Compressed Video Sequences Having Blocking Artifacts by

138 Cepstrum Analysis," IEEE International Conference on Consumer Electronics

(ICCE 2013), pp.494-495, Jan. 2013.

[96] Brightcove, Boston, MA, "Encoding for Mobile Delivery," Whitepaper

communication, Jul. 2008.

[97] ITU-T, "Subjective video quality assessment methods for multimedia applications,"

Recommendation ITU-T P.910, ITU-T P.910, pp.1-42, Apr. 2008.

[98] Tobias Oelbaum, Vittorio Baroncini, Thiow Keng Tan, and Charles Fenimore,

"Subjective Quality Assessment of the Emerging AVC/H.264 Video Coding

Standard," International broadcast conference (IBC), Sep. 2004.

[99] Frank Bossen, "On software complexity," Joint Collaborative Team on Video

Coding (JCT-VC) of ITU-T SG16 WP3 and ISO/IEC JTC1/SC29/WG11 Geneva

November 2011, JCTVC-G757r1, pp.1-7, Nov. 2011.

[100] Marko Viitanen, Jarno Vanne, Timo D. Hämäläinen, Moncef Gabbouj, and Jani

Lainema, "Complexity analysis of next-generation HEVC decoder," IEEE

International Symposium on Circuits and Systems (ISCAS 2012), vol., issue.,

pp.882-885, May. 2012.

[101] Brightcove, "Encoding for Mobile Delivery," Retrieved on 2012July08 from

http://support.brightcove.com/en/docs/encoding-mobile-delivery, Jul. 2012.

[102] Apple, "iOS Developer Library - Preparing Media for Delivery to iOS-Based

Devices," Retrieved on 2012July08 from

https://developer.apple.com/library/ios/#documentation/NetworkingInternet/Conc

eptual/StreamingMediaGuide/UsingHTTPLiveStreaming/UsingHTTPLiveStream

ing.html#//apple_ref/doc/uid/TP40008332-CH102-SW8, Jul. 2012.

139 [103] Frank Bossen, "Common test conditions and software reference configurations,"

Joint Collaborative Team on Video Coding (JCT-VC) of ITU-T SG16 WP3 and

ISO/IEC JTC1/SC29/WG11, JCTVC-I1100, pp.1-3, May. 2012.

[104] Fraunhofer Institut Nachrichtentechnik Heinrich-Hertz Institut, "High Efficiency

Video Coding (HEVC) version 6.0," Retrieved on 2012May06 from

http://hevc.hhi.fraunhofer.de/, May. 2012.

[105] Frank Bossen, "Common test conditions and software reference configurations,"

Joint Collaborative Team on Video Coding (JCT-VC) of ITU-T SG16 WP3 and

ISO/IEC JTC1/SC29/WG11, JCTVC-J1100, pp.1-3, Jul. 2012.

[106] Mark Hall, Eibe Frank, Geoffrey Holmes, Bernhard Pfahringer, Peter Reutemann,

and Ian H. Witten, "The WEKA Data Mining Software: An Update," SIGKDD

Explorations, vol.11, issue.1, pp.1-9, Jun. 2009.

[107] Bumshik Lee and Munchurl Kim, "No-Reference PSNR Estimation for HEVC

Encoded Video," IEEE Transactions on Broadcasting, vol.59, issue.1, pp.20-27,

Mar. 2013.

[108] Fraunhofer Institut Nachrichtentechnik Heinrich-Hertz Institut, "High Efficiency

Video Coding (HEVC) version 10.0," Retrieved on 2013March17 from

http://hevc.hhi.fraunhofer.de/, Mar. 2013.

140