HEVC OPTIMIZATION IN MOBILE ENVIRONMENTS
by
Ray Garcia
A Dissertation Submitted to the Faculty of
The College of Engineering and Computer Science
in Partial Fulfillment of the Requirements for the Degree of
Doctor of Philosophy
Florida Atlantic University
Boca Raton, FL
May 2014
Copyright by Ray Garcia 2014
ii
ACKNOWLEDGEMENTS
As we journey through life, individual achievements are rarely individual but a
collection of help, encouragement, and support of a multitude of people involved directly or
indirectly through the endeavor. My scholastic effort for the dissertation is no different.
First and foremost I want to thank my wife for her patience and support. Without this the
manuscript would not have been possible. In addition, the staff at Florida Atlantic University
was invaluable in providing guidance and recommendations. I am very thankful to my
advisor, Dr. Hari Kalva, my committee members Dr. Borko Furht, Dr. Imad, Mahgoub, Dr.
Daniel Raviv and graduate department staff Jean Mangiaracina.
I am truly grateful to all.
iv
ABSTRACT
Author: Ray Garcia
Title: HEVC Optimization in Mobile Environments
Institution: Florida Atlantic University
Dissertation Advisor: Dr. Hari Kalva
Degree: Doctor of Philosophy
Year: 2014
Recently, multimedia applications and their use have grown dramatically in
popularity in strong part due to mobile device adoption by the consumer market.
Applications, such as video conferencing, have gained popularity. These applications and others have a strong video component that uses the mobile device’s resources. These resources include processing time, network bandwidth, memory use, and battery life.
The goal is to reduce the need of these resources by reducing the complexity of the coding process. Mobile devices offer unique characteristics that can be exploited for optimizing video codecs. The combination of small display size, video resolution, and human vision factors, such as acuity, allow encoder optimizations that will not (or minimally) impact subjective quality.
The focus of this dissertation is optimizing video services in mobile environments. Industry has begun migrating from H.264 video coding to a more resource intensive but compression efficient High Efficiency Video Coding (HEVC). However,
v there has been no proper evaluation and optimization of HEVC for mobile environments.
Subjective quality evaluations were performed to assess relative quality between H.264
and HEVC. This will allow for better use of device resources and migration to new
codecs where it is most useful. Complexity of HEVC is a significant barrier to adoption
on mobile devices and complexity reduction methods are necessary. Optimal use of encoding options is needed to maximize quality and compression while minimizing encoding time. Methods for optimizing coding mode selection for HEVC were
developed. Complexity of HEVC encoding can be further reduced by exploiting the
mismatch between the resolution of the video, resolution of the mobile display, and the
ability of the human eyes to acquire and process video under these conditions. The
perceptual optimizations developed in this dissertation use the properties of spatial
(visual acuity) and temporal information processing (motion perception) to reduce the
complexity of HEVC encoding. A unique feature of the proposed methods is that they
reduce encoding complexity and encoding time.
The proposed HEVC encoder optimization methods reduced encoding time by
21.7% and bitrate by 13.4% with insignificant impact on subjective quality evaluations.
These methods can easily be implemented today within HEVC.
vi
HEVC OPTIMIZATION IN MOBILE ENVIRONMENTS
LIST OF TABLES ...... x
LIST OF FIGURES ...... xii
1 INTRODUCTION ...... 1
1.1 Motivation ...... 2
1.2 Contribution ...... 3
1.3 Outline ...... 4
2 PROBLEM DESCRIPTION ...... 5
2.1 H.264 vs. HEVC Subjective Evaluation ...... 7
2.2 Decision Optimization ...... 7
2.3 Complexity Reduction with HVS Factors ...... 8
3 BACKGROUND ...... 9
3.1 Overview of Video Compression ...... 9
3.1.1 HEVC overview ...... 14
3.2 Overview of HVS ...... 19
3.2.1 Retina ...... 19
3.2.2 Smooth Pursuit Eye Movement ...... 23
3.2.3 Transparent Motion Perception ...... 24
vii 4 LITERATURE REVIEW ...... 25
4.1 Subjective Quality ...... 25
4.1.1 Metrics for Quality Evaluation ...... 25
4.1.2 Subjective Evaluation ...... 27
4.2 Complexity Reduction ...... 35
4.2.1 H.264 ...... 36
4.2.2 HEVC ...... 42
5 HEVC AND H.264 SUBJECTIVE EVALUATION ...... 58
5.1 Background ...... 58
5.2 Evaluation Methods ...... 62
5.3 Experiments ...... 66
5.4 Results ...... 68
5.5 Discussion ...... 71
5.6 Concluding Remarks ...... 78
6 HEVC DECISION OPTIMIZATION ...... 79
6.1 Background ...... 79
6.2 Method ...... 81
6.3 Prediction Modeling ...... 83
6.4 Results ...... 84
6.4.1 Data Reading Primer ...... 84
viii 6.5 Data Analysis ...... 86
6.5.1 Prediction Model Analysis ...... 90
6.6 Application ...... 94
7 ADAPTING LOW BIT RATE SKIP MODE IN MOBILE ENVIRONMENT ...... 96
7.1 Background ...... 96
7.2 HEVC Elements ...... 99
7.2.1 Quad-Tree ...... 99
7.2.2 Skip Mode Method ...... 100
7.3 Proposed Method ...... 102
7.4 Experiments ...... 105
7.5 Results ...... 110
7.6 Concluding Remarks ...... 121
8 CONCLUSION ...... 122
9 FUTURE WORK ...... 124
10 REFERENCES ...... 125
ix LIST OF TABLES
Table 1: Major differences for portable devices ...... 6
Table 2: Small screen complexity reduction opportunities ...... 6
Table 3: Retina topography data ...... 21
Table 4: Class definition for resolution, frame rate, and bit rates ...... 32
Table 5: Complexity reduction reviewed research ...... 56
Table 6: Video sequence information ...... 64
Table 7: 400Kbps data for H.264 and HEVC PSNR(Y) ...... 65
Table 8: 200Kbps data for H.264 and HEVC PSNR(Y) ...... 65
Table 9: Subjective grading scale ...... 68
Table 10: Mean opinion score (MOS) for 400 kbps rate ...... 70
Table 11: Mean opinion score (MOS) for 200 kbps rate ...... 71
Table 12: Video sequences preference ...... 76
Table 13: Video sequence definition ...... 82
Table 14: Class C, D, E, and All video sequence, average Δ PSNR and Time-savings ... 87
Table 15: Kristen and Sara 64/4/3 PSNR and Δ PSNR for 100 and 600 kbps ...... 90
Table 16: Video sequences WEKA estimate for Δ PSNR and Time-savings. Max ...... 92
Table 17: Video sequences actual vs. WEKA difference ...... 93
Table 18: Correlation between actual and WEKA results ...... 94
Table 19: Test sequences ...... 105
x Table 20: Acuity test conditions ...... 106
Table 21: Rate distortion loop calls ...... 107
Table 22: Bitrates for subjective videos near 400kbps ...... 110
Table 23: Encoding time for subjective videos near 400kbps ...... 111
Table 24: Acuity average and standard deviation ...... 112
Table 25: MOS for base vs. high acuity at 960x540 and 400 kbps bit rate ...... 113
Table 26: MOS for base vs. low acuity at 960x540 and 400 kbps bit rate ...... 114
Table 27: MOS for depth reduced vs. high acuity at 960x540 and 400 kbps bit rate ..... 114
Table 28: MOS for depth reduced vs. low acuity at 960x540 and 400 kbps bit rate ...... 114
Table 29: MOS for base vs. high acuity at 480x272 and 400 kbps ...... 117
Table 30: MOS for base vs. low acuity at 480x272 and 400 kbps ...... 117
Table 31: MOS for depth reduced vs. high acuity at 480x272 and 400 kbps ...... 118
Table 32: MOS for depth reduced vs. low acuity at 480x272 and 400 kbps ...... 118
xi LIST OF FIGURES
Figure 1: Mobile environment elements ...... 2
Figure 2: H.261 hybrid coding ...... 10
Figure 3: Scope of H.264 standard ...... 13
Figure 4: H.264 video encoder block diagram. (decoder is within dashed box) ...... 13
Figure 5: HEVC video encoder block diagram (with decoder elements in grey) ...... 15
Figure 6: CTU (or CTB) subdivision into CUs (or CBs)...... 16
Figure 7: Modes for splitting a CB into PB ...... 16
Figure 8: Intra picture directional orientations ...... 18
Figure 9: Receptor density in retina [1] ...... 20
Figure 10: Tangential section of photoreceptors through human fovea [2] ...... 21
Figure 11: Retina topography ...... 21
Figure 12: Viewing cone ...... 22
Figure 13: Scatter plots ...... 27
Figure 14: H.264 processor utilization...... 35
Figure 15: Simplified H.264 and HEVC encoder block diagram ...... 36
Figure 16: Activate and inactivate area. Activate and inactivate area in binary ...... 38
Figure 17: Decision tree for MB encoding ...... 39
Figure 18: Restriction of selectable sub-macroblock modes ...... 39
Figure 19: Flexible search method flow chart ...... 41
xii Figure 20: Transform choices by transform skip mode ...... 44
Figure 21: Decomposition using checkered pattern ...... 45
Figure 22: Template and block matching vectors ...... 46
Figure 23: CU splitting and pruning process ...... 47
Figure 24: Literature research on hybrid coding map ...... 57
Figure 25: Basketball drill PSNR(Y) rate distortion curve ...... 66
Figure 26: Observer to LCD viewing definition ...... 67
Figure 27: Mean opinion score (MOS) for 400 kbps bit rate ...... 69
Figure 28: Mean opinion score (MOS) for 200 kbps bit rate (graph) ...... 70
Figure 29: PSNR(Y) (dB) vs. MOS (dB) ...... 72
Figure 30: Race horses sequence. Horse coat color is bothersome to observer ...... 75
Figure 31: Foreground tree detail loss not a concern in Keiba sequence ...... 75
Figure 32: MOS vs. POA with MOS bubble @ 400kbps. Ti > 10 ...... 77
Figure 33: MOS vs. POA with MOS bubble @ 200kbps. Ti > 10 ...... 77
Figure 34: TiSi Plot ...... 78
Figure 35: Kristen and Sara 100kbps time-savings vs. PSNR ...... 85
Figure 36: Kristen and Sara sequence, 100kbps. Max-quality and left-option points ...... 88
Figure 37: Kristen and Sara sequence, 100kbps. Right-option and min-quality points ... 88
Figure 38: Kristen and Sara 100kbps to 600kbps shift for time-savings vs. Δ PSNR ...... 90
Figure 39: Kristen and Sara 100kbps time-savings vs PSNR WEKA derived ...... 92
Figure 40: Recursive CU splitting example ...... 99
Figure 41: Mode decision process ...... 101
Figure 42: Viewing cone ...... 103 xiii Figure 43: Neighbor LCU motion vector ...... 104
Figure 44: Video sequence order ...... 108
Figure 45: Observer voting ...... 109
Figure 46: Display viewing setup ...... 109
Figure 47: MOS for base vs. high acuity (960x540 display resolution and 400 kbps) ... 115
Figure 48: MOS for base vs. low acuity (960x540 display resolution and 400 kbps) .... 115
Figure 49: MOS for depth reduced vs. high acuity (960x540 display resolution and
400 kbps) ...... 116
Figure 50: MOS for depth reduced vs. low acuity (960x540 display resolution and
400 kbps) ...... 116
Figure 51: MOS for base vs. high acuity (480x272 display resolution and 400 kbps) ... 119
Figure 52: MOS for base vs. low acuity (480x272 display resolution and 400 kbps) .... 119
Figure 53: MOS for depth reduced vs. high acuity (480x272 display resolution and
400 kbps) ...... 120
Figure 54: MOS for depth reduced vs. low acuity (480x272 display resolution and
400 kbps) ...... 120
xiv 1 INTRODUCTION
Multimedia applications have developed extensively during the last 20 years. In conjunction with the multimedia advances is the evolution of video encoding / decoding standards. From the onset, subjective evaluation has been a strong driver in video coding standards development. A significant amount of human and computational resources have been expended for subjective evaluation. From this effort, standards have evolved rather succulently to support the consumption of multimedia consumption with the technologies available in the industry.
One form of technology that has become very popular is the use of mobile compute devices for consuming and generating multimedia data. Mobile multimedia generation has a significant amount of computational and wireless transmission resources that are dedicated to video. Therefore, video compression efficiency and effectiveness is very important and subjective performance in the mobile compute platform needs proper care to maximize the performance.
Even though the video compression standards clearly provide direction for mobile multimedia use, the current video compression research has focused significantly on entertainment consumption for multimedia data, which is mainly internet use for desktop, notebook and tablet computers and television. The video resolution tends to be 1280 x
720 or higher and available bandwidth tends to be in higher than one mega-byte and usually in the multiple of megabytes. The vast majority of research to date is highly
1 focused on larger scale video resolution and consumption essentially over wired internet, where transmission bandwidth is not as constrained in the mobile arena. During the next decade, more focus will be given to mobile multimedia consumption and the best methods to effectively manage the mobile devices resources to enhance the device’s performance to the consumer.
1.1 Motivation
My research focus is to address some underserved aspects for video consumption within the mobile devices. The main emphasis is to optimize video services in mobile environment. By optimization, the target is to reduce coding complexity without penalizing the observer’s perception.
Mobile environments have inherent discrepancies between (a) the mobile device display, such as display resolution and display size, (b) encoded video characteristics, such as encoded resolution, and (c) human vision system, such as acuity, smooth eye pursuit, and viewing distance. In essence, these are three elements, shown in Figure 1, when combined in a mobile environment can be effectively exploited. Current research does not adequately address the opportunities within this environment.
Figure 1: Mobile environment elements 2 Subjective analysis in mobile environment between HEVC, the latest coding standard, and H.264 is very limited. HEVC target is to improve over H.264 by two times. This means for the same subjective feedback, HEVC will use ½ the bit rate. The targeted HEVC gains are not as clear in mobile environments. This needs addressing to give the mobile developer comparison data on performance between HEVC and H.264 in the mobile use case.
Mode selection for optimal HEVC performance is needed for any optimization technique to adequately select a mode for use in the mobile environment. Models are needed to effectively deal with decision optimization for mode selection. Prediction modeling is an optimization technique widely known which can be used to reduce complexity for mobile device’s video coding.
1.2 Contribution
The main contributions of this dissertation are:
Conducted subjective evaluation experiments comparing H.264 and HEVC
compressed video under mobile video delivery conditions. This work has
been published in peer-reviewed conferences [3], [4] and journals [5].
Complexity reduction method using predictive model to estimate Δ PSNR and
encoding time savings. Developed techniques to efficiently choose HEVC
coding tools (CU size, CU depth, TU depth) in video conferencing
applications. This work has been published in peer-reviewed conferences [6],
[7].
Developed complexity reduction methods using a perceptually-aware method
for video encoding that exploits mismatches between video resolution, display 3 resolution, visual acuity and motion perception. This work has been published
in a peer-reviewed conference [8].
Developed joint complexity and bitrate reduction methods for HEVC
encoding; the method has negligible impact on subjective quality evaluations.
1.3 Outline
The rest of the dissertation is outlined as follows. Section 2 is the problem description. Section 3 is the background which has an overview of video compression.
Section 4 is the literature review within the multimedia field mainly focusing on subjective quality and complexity reduction. Section 5 presents quality evaluation in mobile devices. Section 6 identifies optimal mode selection for HEVC using predictive modeling techniques. Section 7 exploits visual perception for complexity and bitrate reduction.
4 2 PROBLEM DESCRIPTION
As presented earlier, the focus is optimizing video services in mobile
environments. Video coding complexity is a big problem. The mobile environment has
unique characteristics that allow new approaches for complexity reduction. Today’s
mobile devices have small displays with high resolutions. In addition, HVS has limitations that allow exploitation when encoded video is viewed on a small display.
There are several subjective studies within the video coding realm. However,
there are very few studies that address the impact in mobile compute environment.
Several of the earlier studies has the older video compressed at twice the bit rate than
HEVC. These results are predisposing the outcome based on different bit rate for each coding standard. In addition, there have been substantial complexity reduction efforts in the last few years with HEVC standard. However, significant focus has been given to higher resolutions, which is the main target of the broadcast industry. The higher resolutions are mainly for larger screen devices from desktop displays to televisions.
Limited data is available when using mobile phone type screen sizes, which are less than
6” and typically range within 3.5” to 5” display screen size.
Research is needed where the mobile environment dictates the setup restriction which then applies to all coding standards for subjective evaluation. From this effort, then determine the feasible modifications, such as complexity reduction, for mobile
sensitive resources, such as power and bandwidth limitations.
5 Mobile devices, such as mobile phones, have unique requirements that have not been sufficiently studied and exploited by earlier works. Table 1 shows typical values among the major differences among common portable devices in the consumer industry.
Table 1: Major differences for portable devices
Type Screen Size Resolution Wireless Medium Mobile phone < 5.0” < 1080P Wireless WAN Tablet ~ 10” ~ 1080P Wi-Fi Notebook > 14” > 1080P Wi-Fi
Mobile multimedia demand will increase significantly during the next few years.
[9] states mobile data and internet traffic will increase from 237 PetaBytes (PB)/ Month in 2010 to an estimated 6,254 PB/ Month by 2015. The increase is not only due to higher data use on existing phones, but also a significant deployment of multimedia capable phones on wireless wide area networks throughout the world.
Mobile compute platforms, with small screen sizes, lead to unique characteristics that can be exploited by the developer. The complexity reduction research is geared to exploit these characteristics and employ methods that will reduce encoding time with minimal impact to subjective quality. Some potential coding elements to research are listed in Table 4. In addition, encoding time reduction can dramatically help the power drain on the mobile device which is a current problem being addressed in today’s mobile devices.
Table 2: Small screen complexity reduction opportunities
Item Potential Optimized mode selection (1) Reduce ME time by eliminating modes not conducive to smaller screens. This method is tree pruning. (2) Use previous frames modes if criteria met. Criteria such as MV threshold may be used. Depth selection Use previous frames CU and TU depth if criteria met. 6
2.1 H.264 vs. HEVC Subjective Evaluation
Mobile compute environments provide a unique set of user needs and
expectations that designers must consider. The focus within the mobile compute
environment is smart phones. Multimedia use in smart phones has increased with the
surge in popularity for these devices. With increased multimedia use in mobile
environments, video encoding methods within the smart phone market segment are key
factors that contribute to positive user experience. Currently available display resolutions
and expected cellular bandwidth are major factors the designer must consider when
determining which encoding methods should be supported. Recent mobile devices released in the consumer market have shown display technology has progressed strongly.
The displays range from high-end mobile phones with resolutions up to 1280x720 for
5.0” diagonal screen sizes to entry level smart phones with resolutions around 480 x 270 for 3.5”.
A comparative evaluation of user experience subjective quality between HEVC and H.264 video coding standards are needed to provide guidance to the developer. The guidance is for design within the mobile environment. The desired goal is to maximize the consumer experience, reduce cost, and reduce time to market.
2.2 Decision Optimization
HEVC has many configuration options for encoding. A model and approach is needed to select an efficient set of HEVC encoding options for devices in mobile environments as described earlier. A main goal is to reduce the encoding complexity without significantly affecting the quality of video conferencing applications. A real- 7 time adaptive configuration is needed to optimize for the available bandwidth within the mobile environment. Basic configuration that is offered by HEVC are coding unit size, coding unit depth, and transform unit size. There are many other options, but these are commonly manipulated options from related works. The encoder computational complexity can be reduced for the target bit rates while maintaining an allowable additional PSNR loss.
2.3 Complexity Reduction with HVS Factors
Encoding complexity reduction, without degrading video quality, is a common goal for video coding researchers. Indeed several authors presented within the literature review strived to achieve complexity reduction and minimally affect the video quality.
However, two elements are usually not addressed for mobile environments. The mobile environment offers unique characteristics that can be exploited. For example, mobile devices, such as smart phones, typically have displays that are 3.5” to 5.5” diagonal and bandwidth constrained networks, such as cellular, as the main data transmission method.
In addition, human vision system factors, such as vision acuity and smooth eye pursuit factors, can be used to exploit where complexity and bit rate reductions can occur without significantly impacting perception of video quality.
Use of mobile devices, such as smart phones, has grown dramatically in the last few years and has penetrated the consumer space [10]. Applications, such as real-time video sharing, have gained popularity. Quality of service is a factor in application adoption and research has shown adoption is successful if the technology delivers a predominantly a positive experience [11]. The successful mobile real-time video application will need to work well in low bit rate and lossy networks. 8 3 BACKGROUND
3.1 Overview of Video Compression
Video encoding in its’ basic essence was patented by Raymond Davis Kell in
1929 [12]. Kell’s U.S. Patent revealed the benefits of transmitting only changes in subsequent images [13], which are the main elements predictive compression. Now, the proposed implementation method of separate light sources to display the changes using analog methods is impractical, however the basic idea for video compression was born.
An early digital method for video compression was released by International
Telegraph and Telephone Consultative Committee (CCITT), currently the International
Telecommunications Union (ITU), in 1984, in the form of the standard recommendation
H.120 [14]. This standard is directed for point to point transmission, with video conference service as a main beneficiary. The standard decisions were heavily influenced by the analog video transmission eco-system available at the time. The release supported the National Television System Committee (NTSC) system of 525 lines and Phase Alternating Line (PAL) system of 625 lines. The implementation was mainly unsuccessful due to inadequate video quality. However, several basic concepts developed within this standard are evident within todays’ video compression techniques, such as quantization and code tables.
During 1990, a much improved coding standard was released in standard H.261
[15]. As with H.120, a major emphasis included videophone and video conference within
9 the audio visual services. The main transmission target is ISDN lines using 64kbit/s
bitrate. However, the standard made allowances for p x 64kbit/s services, where p can be
specified anywhere from 1 to 30. Common video resolutions used within H.261
implementations include common intermediate format (CIF), which is a resolution of 352
x 288 pixels, or quarter common intermediate format (QCIF), which is a resolution of
176 x 144 pixels. H.261 further improved coding methods by introducing hybrid video
coding method that is still a key basis for modern coding standards. Hybrid video coding
combines two methods. These are: (1) frame to frame motion is estimated and
compensated for by using data from previously encoded frames and (2) spatial domain
data is decoupled and transformed to the frequency domain which can be quantized. The
hybrid model is shown in Figure 2.
T Transform Q Quantizer P Picture memory with motion compensated variable delay F Loop filter CC Coding control p Flag for INTRA/INTER t Flag for transmitted or not qz Quantizer indication q Quantizing index for transform coefficients v Motion vector f Switching on/off of the loop filter Figure 2: H.261 hybrid coding
A parallel effort was started in 1988 by Moving Pictures Expert Group (MPEG),
this group leveraged many concepts from H.261 and released MPEG-1 (ISO/IEC 11172)
standard in 1993. The standard’s focus included multimedia distribution. As with H.261,
MPEG-1 is hybrid coding. New techniques were developed to enhance video sequence
10 reconstruction quality. Techniques such as Group of Picture (GOP) were introduced.
This included three picture frame types such as intra-coded (I) frame, Predictive-coded
(P) frame, and bi-directional-predictive-coded (B) frame. A GOP combination must
include an I frame and it must be positioned such that the remaining frames can use the I
frame to decode the remaining frames. In addition, an improve concept was the
maximum allowable macroblock (MB) size increase to 16x16 from 8x8 as with H.261.
Hybrid video coding approach represents each coded picture in block-shaped units of associated luma and chroma samples which are called MBs.
The next release, in 1995, was MPEG-2 / H.262 and is the first joint effort
between ISO and ITU standard organizations and maintained jointly by the ITU-T Video
Coding Experts Group (VCEG). The focus included multimedia distribution for digital
video broadcasting and consumer purchased multimedia, such as DVD. The combined
committee built upon and advanced coding capabilities from each organization’s
standards and greatly improved the video content distribution for use in transmission
environment from 3 to 10 Mbit/s. The standard strove to balance quality and complexity at these higher rates to provide studio quality digital video. In addition, the committee was also forward looking, as shown by defining HDTV/SDTV and the use of up to 214 x
214 frame sizes. As the coding technology advanced, the consumer industry benefited
greatly with a unified direction for video coding. This greatly reduced adoption risk by
encoder developers. Also, more development resources were dedicated to implementing
encoding protocol of one standard. The MPEG-2 / H.262 standard is backward
compatible with MPEG-1.
11 The next ITU-T VCEG standard, H.263 [16], focus returned to video conferencing applications, more specifically the standard’s scope is for compressing the moving picture component of audio-visual services at low bit rates. This standard was released in 1996 and was widely adopted by video conferencing and cell phone codecs.
The design target is video conferencing applications within mobile devices where low bit rate restrictions exist. H.263 is utilized by many video telephony, video conferencing and internet conferencing standards, such as H.324, H.323, H.320, RTSP, and SIP.
Crafting the MPEG-4 Part 2 standard began in 1995 near the release of MPEG-2 /
H.262 and released in 1999. As with H.263, MPEG-4 Part 2 original goal is for use within low bit rate environments. However, the scope has expanded and several novel ideas have been introduced within this standard that allows a wide range of compression quality and bit rate compromises. Many consider this as a major facet since continued enhancements have evolved with new profiles that include original coding concepts such as interactive graphics, object and shape coding, scalable coding, 3D graphics, among other techniques.
In the last decade, ITU-T VCEG released H.264 / MPEG-4 part 10 Advanced
Video Coding (AVC) [17]. This is commonly known as H.264 or H.264/AVC. Since its’ release in 2003, this coding standard has been well received. As with other coding standards, the scope for H.264 is a decoder specification as shown in Figure 3. Even though a decoder is specified, this leads the encoder on expected data format for transmission between the encoder and decoder.
12
Figure 3: Scope of H.264 standard
H.264 standard leveraged heavily prior standards and enhanced known coding techniques, such as variable block size motion compensation, decoupling reference order from display order, hierarchical block transform among other enhancements [18]. H.264 coding block diagram is shown in Figure 4. A main goal is to have superior compression performance across a wide range of bit rates. This standard has been adopted by several well known consumer products, such as Apple iPod and Sony Playstation. Also, several consumer media products, such as HD-DVD and Blu-ray have adopted this standard for video compression.
Figure 4: H.264 video encoder block diagram. (decoder is within dashed box)
13 3.1.1 HEVC overview
Recently, during late January 2013, High Efficiency Video Coding (HEVC)
standard [19] was released by ITU-T VCEG. The main target is to achieve same video
subjective performance with half the H.264 data rate. HEVC uses the same hybrid
coding approach as that has been successful with earlier standards. Several H.264 tools were enhanced to achieve this goal [20] [21]. More details below within this section.
The HEVC standard [19] has adopted one profile with three configurations. The configurations are “intra”, “low-delay” coding and “random-access” coding. This is to support wide services such as broadcast, mobile and streaming. Recent assessments show that HEVC can achieve equivalent subjective quality as H.264 using approximately
50% less bit rate [21]. The low-delay coding typically uses the previous frame as the reference frame for inter prediction. Random-access coding uses both past and future frames as reference frames. Intra does not use other frames for prediction, which indicates the intra frames do not have temporal elements.
The bit rate improvements are achieved, in part, by the introduction of new tools, such as block partitioning structure [22], and variable block-size for prediction and transform coding [23]. Although the coding efficiency of the low-delay coding is worse than that of the random-access coding, the low-delay coding is widely required in industry area because of the importance of real-time video applications [24]. The
remainder of HEVC discussion, within this section, will be an overview of the HEVC
structure and tools that will be of interest within the research presented. For reference,
Figure 5 shows the HEVC hybrid video encoder block diagram that can be used for
reference for overview discussion.
14
Figure 5: HEVC video encoder block diagram (with decoder elements in grey)
As with earlier standards, the frame is partitioned into square MBs for compression management. The largest MB with luma and chroma data is called the largest coding unit (LCU). The LCU partitioning into smaller coding units (CU) determined by encoder direction which is mapped into a coding tree unit (CTU), as shown in Figure 6. The CTU size is selected by the encoder and consists of a luma coding tree block (CTB) and the corresponding chroma CTBs. The CTU size LxL of a luma CTB can be chosen as L = 16, 32, or 64. For example, in 4:2:0 color sampling, there is one luma CTB LxL and two corresponding chroma CTB L/2xL/2. A larger LxL sizes typically enables better compression for higher resolution video sequences since a
64x64 LCU is a smaller footprint of the overall frame and has higher likelihood of homogeneous pixel data. HEVC then supports a partitioning of the CTBs into smaller coding blocks (CB) using a tree structure and quad tree signaling as shown in Figure 6 and uses “Z” pattern during the split process.
15
Figure 6: CTU (or CTB) subdivision into CUs (or CBs)
A CTB may contain only one CU or may be split to form multiple CUs. Each CU has an associated partitioning into prediction units (PUs) and a tree of transform units
(TUs).
For prediction units and prediction blocks (PBs), the decision whether to code a picture area using inter-picture or intra-picture prediction is made at the CU level. HEVC supports variable PB sizes from 64x64 down to 4x4 samples. CB are split into PB, which are called “modes”, and can supported asymmetrical shapes as shown in Figure 7. The
PB is tied to motion vector during the motion estimation (ME) process which will be discussed further in section.
Figure 7: Modes for splitting a CB into PB
TUs and transform blocks (TB) take the prediction residual and codes it using block transforms. A TU tree structure has its root at the CU level. HEVC supports variable TB sizes from 32x32 to 4x4. HEVC design allows a TB to span across multiple 16 PBs for inter picture-predicted CUs to maximize the potential coding efficiency for TB partitioning.
Motion vector computation is advanced motion vector prediction (AMVP) which includes derivation of several most likely candidates based on data from adjacent PBs and the reference picture. A merge mode for motion vector (MV) coding may also be used which allows the inheritance of MVs from temporally or spatially neighboring PBs.
Compared to H.264/MPEG-4 AVC, an improved skipped and direct motion inference was specified.
Motion compensation is handled by quarter-sample precision for MVs, and 7-tap or 8-tap filters are used for interpolation of fractional-sample positions. As with H.264, multiple reference pictures can be used. A PB can be associated with one or two motion vectors resulting in uni-predictive or bi-predictive coding, respectively. Uni-predictive uses only previous frames for coding and bi-predictive can use either previous or future frames for coding. As with H.264, a scaling and offset operation is supported.
Intra prediction uses only spatial prediction which means the decoded boundary samples of adjacent blocks are used as reference data for prediction. Intra picture prediction supports 33 directional modes as compared to eight directional modes in
H.264, in addition with planar (surface fitting) and DC (flat) prediction modes. The selected intra picture prediction modes are encoded by deriving most probable modes and prediction directions based on those of previously decoded neighboring PBs.
17
Figure 8: Intra picture directional orientations
Quantization is similar to H.264 where quantization scaling matrices are supported for the various transform block sizes.
Entropy coding is very similar to the context adaptive binary arithmetic coding
(CABAC) scheme in H.264. However, several improvements have been developed to improve throughput speed that parallel-processing architectures can take advantage of and compression performance. In addition, the improvements allowed a reduction in context memory requirements.
Deblocking filter similar to the one used in H.264 within the inter picture prediction loop. The purpose of deblocking is to smooth out the block edges with neighboring blocks in order to get a subjectively better picture. A major difference in
HEVC over H.264 is the design has simplified decision making and filtering processes and is more conducive to parallel processing.
Sample adaptive offset (SAO) is a new feature. A nonlinear amplitude mapping is introduced within the inter picture prediction loop after the deblocking filter. The goal is to better reconstruct the original signal amplitudes by using a look-up table and a few additional parameters determined by histogram analysis from the encoder. 18 3.2 Overview of HVS
This section will present the properties of the HVS exploited in the dissertation.
This understanding, in combination with video compression techniques, will yield new methods to exploit HVS for improving video compression.
3.2.1 Retina
The retina consists of many layers of specialized tissue with each layer playing a unique and critical role in human vision [25], [26]. The layer that will get further discussion is the photoreceptor layer which is the light gathering layer and turns the light into neural signals. This layer is a dense mosaic of cell bodies that contains rods and cones. Rods are achromatic and are primarily responsible for vision in low light. Cones are primarily responsible for color vision. There are three types of color cone receptors, which are S-cones (absorbs blue), M-cones (absorbs green) and L-cones (absorbs red).
Cones dominate and are densely populated near the fovea, which is the area necessary for sharp and detailed vision and is of extreme importance. Cones are primarily used for high light (such as daylight) conditions. The fovea area is also commonly referred to the focal point of vision and is used for detail vision capture, such as reading.
The fovea’s minute center is called the central island and has the highest density of photoreceptors in a 0.2 (12 minutes of arc) area. The central island, where vision is the sharpest, there are red and green cones (no blue cones or rods) [27], [28], [29]. As one moves away from fovea’s central island, the vision acuity drops off. The most significant drop off occurs near the central island edges. The retinal acuity topography is roughly concentric zones with the maximum resolvable spatial frequency for each zone. The spatial frequency is determined by the number of photoreceptors in the zone. The central 19 island has approximately 120 receptors per degree. Surrounding the central island and within the fovea is the foveola which spans approximately 1.2 of visual angle. As with
the central island, the foveola is free of rods and blood vessels. The foveola photoreceptor
coverage is roughly 70 to 120 receptors per degree. The fovea includes the foveola and
has a visual field span of 6 and consists of rods and cones. The cone density is less and
rod density is greater the farther from the central island. The fovea has approximately 50
to 70 receptors per degree. Receptor density mapping over the retina is shown in Figure
9. The central island is centered at 0 degrees.
Figure 9: Receptor density in retina [1]
For the central island area, photoreceptors are positioned in a hexagon pattern,
similar to honeycomb pattern as shown in Figure 10. The central island has
approximately 120 photoreceptors per degree. Per Nyquist sampling theory, this
translates to a theoretical resolution capability of 60 cycles per degree. Basically, it takes
two receptors to distinguish change (i.e. cycle) has occurred as indicated by Nyquist.
Therefore, Nyquist resolves to one-half of the samples, which are the photoreceptors.
Using the same method, the foveola has a resolution capacity of 35 to 60 cycles per
20 degree. The fovea has a resolution capacity of approximately 25-35 degree per cycle. The retina topography data is show in Table I.
Figure 10: Tangential section of photoreceptors through human fovea [2]
Table 3: Retina topography data
Topography Span Photoreceptors / Cycles / Location (Degrees) Degree Degree
Central Island 0.2 120 60 Foveola 1.2 70-120 35-60 Fovea 6.0 50-70 25-35 Parafovea 7.0 20-50 10-25 Perifovea - 0-20 0-10
Retina topography showing location and relative size from the central island to periphery is shown in Figure 11.
Central Island Foveola Periphery Fovea Parafovea
Perifovea
Figure 11: Retina topography 21 “20/20” vision is a Snellen fraction, developed in 1862, used to express normal visual acuity measured at a distance of 20 feet (6 meters). Basically, 20/20 vision means one can see clearly at 20 feet what should normally be seen at 20 feet. For example,
20/200 vision means one must be 20 feet away to clearly see what a person with normal vision can see at 200 feet. 20/20 vision correlates with 30 cycles per degree for acuity resolving capacity [29].
Recapping this section, the retina’s photoreceptor density effects the ability to distinguish change from adjacent photoreceptors in terms of cycles per degree. Using
Nyquist sampling theory, the cycles per degree is ½ of the photoreceptor density per degree. The HVS acuity is based on cycles per degree.
When a display’s resolution per degree is greater than the HVS acuity, individual pixel changes will be difficult for the HVS to detect. One example is viewing a display such as television from a long distance. The television’s resolution is packed into a tighter viewing cone. See Figure 12 for viewing cone definition. Mobile users have a similar environment, where high resolution display is within a tight viewing cone. This discrepancy can be used for HEVC encoder complexity reduction with minimal impact on subjective quality.
Figure 12: Viewing cone
22 3.2.2 Smooth Pursuit Eye Movement
Eye pursuit movement distinction was first reported by Raymond Dodge [30] in
1903 and his contemporaries of that era. This observation was later defined as Smooth
Pursuit Eye Movement (SPEM) which is the eye movement in which the line of observation follows an object moving across the field of vision. Human beings instinctively follow objects from early infancy. This instinct is so persistent that in adult life it is very difficult to keep eyes from moving when an object moves. Analysis showed when an object moves the ocular muscle response follows shortly to track the object.
Later works defined distinctly between saccadic and smooth pursuit human eye movements. Saccadic eye movement has yielded much attention primarily due to the very rapid motion within a very short period of time of the human being eyes. In contrast, smooth pursuit movements occupy a significant portion of ocular activity [31]. Per
Robinson, smooth pursuit velocity can occur up to 25-30 degree per second. He noted that smooth pursuit velocity overshoot is noticeable at a movement rate of 5 degree per sec. No noticeable overshoot occurs at 15 degree per sec. Overshoot is the condition where the eye movement keeps tracking on same path after object has settled. Saccades
was reported at a velocity threshold of 25 to 30 degree per second. Girod [32] reported
saccades as large rapid eye movements with an angular velocity of up to 830 degree per
second that align the point of regard with an interesting target. While several earlier
works typically reported saccades as 30 to 60 degree per second. In addition, if the target
is moving, our eyes can compensate this motion by SPEM with a maximum velocity
around 20 to 30 degrees per second. Adzic [33] reported a threshold, defined as saccadic
suppression detection threshold of 48 to 60 degrees per second. Velocity saturation is
23 where smooth pursuit becomes disconnected with the object briefly during the pursuit movement. Saccadic eye movement can briefly occur during the eye pursuit of the object.
3.2.3 Transparent Motion Perception
Transparent motion perception is a minute change between object distances over a short period of time where the distance change is not perceived by the HVS. Nakayama
[34] performed subjective tests where HVS sensitivity is measured for horizontally moving randomly generated dots. Dots were displaced for 12 msec, 100msec, or 200 msec and then returned to the original position. The observers were asked if motion occurred. For all the three displacement times, the transparent motion sensitivity, which is the just noticeable displacement, is approximately 2 arc-minutes. Observers noticed movement for displacement distances of 2 arc-minutes or greater. Subjective tests performed by Qian, Andersen, and Adelsen [35] where the observer is asked to identify motion between random generated dots or paired dots. Several experiments were performed. One particular experiment indicates where motion is detected between pixel offsets. Each pixel for this particular experiment is 0.028 (1.68 arc-minutes). The transparent motion sensitivity is difference is between 1 and 2 pixels. No positive indications were given for 1 pixel movement and about a 10% positive indication given for 2 pixel movement.
24 4 LITERATURE REVIEW
4.1 Subjective Quality
Multimedia consumption is to be viewed and enjoyed. The ultimate purpose for
video compression is to reduce resources, such as bandwidth, and maintain the highest
level video quality as measured by human beings. The ultimate critic is the end user for the multimedia consumption. The purpose of subjective quality assessment is to determine the parameters which are important and not so important to the end user. This section is two main bodies of discussion which are (1) the quality evaluation metrics currently available and (2) review of recent subjective evaluations.
4.1.1 Metrics for Quality Evaluation
Objective video quality assessment (VQA) is the computational models to evaluate the video quality in line with the perception of the human visual system (HVS)
[36]. This method is strongly computational based that uses Structural Similarity (SSIM) as one component of the subjective portion of the assessment. SSIM was used along with
an alternate weighting method between motion compensation blocks as a measure of
temporal distortion known as motion compensated SSIM (or MC-SSIM) [37]. The
second subjective component is singular value decomposition (SVD) based image quality
metric for computing the spatial scores [38]. Both scores (MC-SSIM and SVD) are
computed and combined into a single score. The score is computed for both reference
and impaired frames. In addition, other objective quality scores were generated for the
25 dataset, such as Peak Signal-to-Noise Ratio (PSNR), Visual Signal-to-Noise Ratio
(VSNR) [39], and Visual Information Fidelity (VIF) [40]. The subjective testing was performed on H.264 encoded video sequences at a resolution of 352 x 288 and 768 x 432.
Predictive model was generated between each calculated quality score and Mean Opinion
Score (MOS) results. The author’s model was better in predicting subjective for the test cases used.
A very good comparison of a few well known video quality assessment (VQA) models was conducted by [41]. The study compares H.264 and HEVC MOS (subjective
VQA) with five objective VQA models which are (1) Structural Similarity (SSIM), (2)
Multi-Scale SSIM index (MS-SSIM), (3) Video Quality Metric (VQM) [42], (4) MOtion- based Video Integrity Evaluation index (MOVIE) [43], and (5) PSNR. The encoding configuration of HM5.0 was set as random-access high-efficiency and accordingly
JM18.3 configuration was adjusted to best match that of HM5.0 configuration. The scatter plots of objective scores versus MOS are shown in Figure 13. All four objective
VQA models clearly outperform PSNR for quality assessment estimation. MS-SSIM had slightly better results than the other remaining three.
26
Figure 13: Scatter plots
4.1.2 Subjective Evaluation
Mainstream video coding methods use block-based video and region-based coding approaches, where statistical features of the video sequence are exploited to detect redundancies to obtain bit rate reduction. These methods decompose video sequences into coherent regions based on features such as motion, color or texture. Image Analysis and Completion Video Coding approach [44] are regions in a video sequence can be subdivided into two classes, which are perceptually relevant and perceptually irrelevant.
Perceptually irrelevant regions are highly textured regions and perceptually relevant regions are the remainder. The author’s idea is then to represent perceptually irrelevant parts by imperceptible approximations. Therefore, the imperceptible approximation may be achieved at a lower bit rate than other objective/statistical based compression methods.
The authors main contributions were identify image completion approximations with subjective consideration and breakdown a video sequence between the mix of the two
27 regions, which are perceptually relevant and irrelevant. A few image completion
methods investigated by the author were auto regressive , auto regressive moving
average, Laplace, non-linear, among others.
In [45], the author’s focus is mobile quality of experience (QoE) for multimedia-
enriched web services within a mobile environment. The author’s paper includes
experiments aimed that analyzed QoE for different media-enriched web-based services in
a mobile environment. The user display specifications within the experiments are 2.8” with resolution of 320x240. SSIM and MOS results were compared for all video sequences. Tests included video sequences that range from 2 second to 120 second clips.
Video clips were regenerated multiple times with different levels of transmission
degradations as measured by SSIM. MOS results showed that observer’s tolerance is
significantly higher, as shown by higher MOS results, for longer (100s to 120s) video
segments. Video segments with high motion regardless of complexity level (i.e. high or
low) tended towards lower MOS results.
A strong objective component was used by [46] to determine perceptual quality
and impact to bit rate. The author investigated the impact of frame size, frame rate,
signal-to-noise (quantization) to bit rate result. The author points to perceptual quality
testing based on spatial, temporal, and amplitude resolution (STAR) changes and
influence to bit rate model. Tests used well known H.264 video sequences (such as akiyo
and crew). The variables changed were frame rate from 1.875 up to 30 Hz and quantizer,
which varied essentially the whole quantizer spectrum of upper teens to one hundred.
However, the resolution remained fixed for the tests, since resolution typically is not continuously varied within a video sequence. Author makes recommendation for heavy
28 use of frame rate adaptive control to maintain constant bitrate required by the video
sequence application.
Saliency models as shown by [47] use a computational bottom-up model and
human visual estimates. Human visual fixation behavior is driven the person’s sensory
system, which is “bottom-up”, and/or a higher order task specific as driven by purpose of
the task, which is “top-down”. Visual saliency refers to distinct image details the human
being notices naturally without a task or purpose at hand or commonly known as free
viewing. Visual saliency is believed to drive human fixation during free viewing.
Human studies have shown visual saliency is a good predictor for attention during free viewing for both images and videos [48]. Human visual fixation area has been commonly defined as “conspicuity area,” which is defined as the spatial region around the center of gaze where the target can be detected or identified in the background, within a single fixation. The practical value of the visual conspicuity concept was limited by the fact that the associated psychophysical measurement procedures were primarily and manual endeavor, intricate and time-consuming. The research team developed an automated method to track eye movements and correlate to conspicuity area. The author reviews 12 saliency models and proposes an additional model called Multiscale Contrast
Conspicuity. All models are tested with standard images and psychophysical conspicuity area is determined by method developed. Multiscale Contrast Conspicuity estimated
gaze area correlated with actual result over 0.643 which performed very well when
compared to the other 12 saliency models.
In the last decade, commercially available eye-tracking systems are economically available and have allowed easily quantifiable fixation data. [49] performed testing with
29 eye tracking system to visually track the regions of interest as determined by the fixation point. The human vision system samples its environment by linking fixation points between saccades, which are fast and sudden movements. Test subjects were shown a two and a half minute movie trailer and their eye motions were tracked. Scene changes were noted to cause large differences in fixation point locations between the test subjects.
Many psychophysical experiments discover that viewing quality is not correlated well with PSNR but significantly influenced by the viewing conditions (e.g. display size and viewing distance) [50]. The author demonstrates that major picture quality evaluation schemes are not suitable for subjective quality driven video adaptation due in part to the inability to track the relationship between viewing condition and viewing quality. The author investigates a novel approach that performs video adaptation, such as modifying distortion algorithms, with respect to the target display scenario, such as mobile environment, in order to maximize the viewing quality. Viewing ratio (VR) and perceptual quality translated into a computationally feasible algorithm for video adaptation. Given the amount of mobile video traffic expected in the next decade, video adaptation for mobile environment can potentially save precious transmission bandwidth that mobile devices rely on. Subjective testing discovered that viewing experience with mobile devices is significantly influenced by the viewing conditions, which includes viewing distance, video resolution, display size, and content type. Statistical data show that the viewing distance of mobile video with small displays is usually fixed at arms length [51]. As a result, the display resolution and display size are a very important aspects that determine mobile video viewing experience. VR is defined as the viewing distance to display height ratio. Low VR means being placed closer to displayed object.
30 For example, a typical big screen TV has a VR of 3 to 4; cell phone video has a VR of
10. The video sequence distortion can be modified, if VR is known by the transmitting device. The transmitting device can adjust to low quality for devices with high VR, such as mobile devices. Authors subjective testing was able to exploit video encoding parameters when using VR as an input to encoding parameter.
4.1.2.1 H.264 vs. HEVC subjective evaluation
The Joint Collaborative Team on Video Coding (JCT-VC), a joint team between
MPEG and ITU, reported subjective test results for 27 test candidates in April 2010 [52].
The purpose was to evaluate the candidates for the next generation video coding standard,
HEVC. Two anchor encodings were generated to assist with the test candidate evaluations. Anchor encodings were included in the formal subjective tests and were directed through the same evaluation criteria as the test candidates. The H.264 encoder used for anchor file generation is JM16.2. These anchor reference points were used to define behavior of current and accepted video encoding technologies for side-by-side comparison with test candidates. Video resolutions within the tests ranged from 416x240 to 2560x1600. Encoded bit rates were from 256kbit/s to 14Mbit/s. Video bit rates chosen were dependent on video resolution as show in Table 4. Test methods used within test sessions were Double Stimulus Continuous Quality Scale (DSCQS) and
Double Stimulus Impairment Scale (DSIS) evaluation methods as defined by ITU-R
BT.500-13 [53]. DSIS test evaluation methods were used for class C (832 x 480 video sequence), D (416 x 240 video sequence), E (1280 x 720 video sequence), and lower bit rates for class B (1920 x 1080 video sequence). DSCQS test evaluation methods were used for higher encoding rates within class B sequences. Results from the evaluation
31 showed a 50% bit rate improvement can be achieved with multiple test candidates and
achieve similar mean opinion score (MOS).
Table 4: Class definition for resolution, frame rate, and bit rates
Class Resolution Frame Rate Bit Rate (min.) Bit Rate (max.) A 2560 x1600 30 2.5 Mbit/s 14Mbit/s B 1920 x1080 24 - 60 1 Mbit/s 10 Mbit/s C 823 x 480 30 - 60 384 kbit/s 2Mbit/s D 416x240 30 – 60 256kbit/s 1.5Mbit/s E 1280x720 60 256kbit/s 1.5Mbit/s
[54] objectively and subjectively measured performance and made comparisons
between HM5.0 and JM18. Tests were conducted with high-efficiency (HE), low-
complexity (LC) and low-complexity combinations of rate distortion optimized quantization (RDOQ), adaptive loop filter (ALF), sample adaptive offset (SAO). Coding tools were manipulated to improve Bjǿntegaard delta rate (BD-rate) vs. encoding and
decoding time. The study shows video encoded with HE configuration yielded
subjectively indistinguishable results when compared to LC with RDOQ and SAO. Nine
video sequences from class B and C were encoded with QP = 32 and 37 for both random
access and low delay configurations. The target bit rates were predominantly 500 kbps to
4,000 kbps. The test method was DSIS variant I, which shows the reference and
impaired video once to the test subject before voting takes place. Informal subjective
tests used HM 5.0 encoded video sequences that are half the bit rate versus JM 18.2
encoded video sequences. Observer votes showed that HEVC is preferred over H.264
from 56% to 83% of the time. The preference percentage depends on the video sequence
encoding configuration being used for the subjective test.
32 Additional subjective comparisons were performed by the JCT-VC ad hoc group
comparing HM5 and similarly configured JM encoder/decoder and reported by Ohm, et
al. [55]. The goal was to quantify feasible rate savings that yield similar subjective
quality when comparing HEVC and similarly configured H.264. Tests were performed
with the nine video sequences for class B and C. JM QP settings were 27, 30, 33, and 36.
The research team determined JM QP settings should be four more. Therefore, QPHM =
QPJM + 4, which gives a HM QP settings of 31, 34, 37, and 40. Subjective tests were
performed using Double Stimulus Impairment Scale (DSIS) method with same approach
described in [52]. After some linear interpolation on RD graphs, a gross average rate
reduction of 67% for class B sequences and 49% for class C sequences were deduced.
Informal subjective tests for 720p and 1080p resolutions specifically targeting
low-delay applications were performed by Horowitz et al. [56]. H.264/AVC JM version
18.3 and x264 version core 122 r2184 are compared with HEVC (HM version 7.1).
Encoders were configured for low-delay function and 8-bit per sample video encoding.
The H.264 videos were encoded at double the rate of the HEVC videos. QP was selected
to ensure lossy video, but still considered good quality video. Both quality extremes
were avoided. Since very high quality video will yield experiment results where both
video sequences are indistinguishably excellent video. Also, extremely low quality video
will be difficult for the observer to indicate a preference. Results indicate HM encoded
sequences were favored 46% of the time for 720p sequences and 86% of the time for
1080p sequences.
Resolutions beyond HDTV are the main emphasis for subjective analysis by [57].
Tests were conducted on a high performance quad full high definition (QFHD) liquid
33 crystal display (LCD). Video sequences were evaluated for spatial information (SI) and
temporal information (TI) indexes. Class A video sequences were used, since these
sequences have the highest resolution. Also, test video sequences were augmented with two other high resolution sequences for a video resolution of 3840x1744 (or higher). Bit rates for tests were encoded and ranged from 768 kbps to 20Mbps. Double Stimulus
Impairment Scale DSIS variant II, as defined by [53], was the test method and the test session was divided into two 15 minute sessions with a rest period in between. Test results showed reduction of over 50% is achieved with HEVC over AVC for high resolution sequences for equivalent subjective performance.
H.264 vs. HEVC subjective comparisons were performed by JCT-VC and reported in May 2012 [58]. Results were weighted 6:1:1 for YUV PSNR as an imperfect substitute for subjective assessment. The author mentions the subjective results may actually perform better than the reported results within the study using PSNR. Bit rate savings ranged from 22% to 36% depending on HEVC configuration used. The tool configurations between H.264 and HEVC base configurations with minor referencing
structure changes. For example, the low delay configurations, which are low delay P and
low delay B, were set for 1+3 reference structure.
An earlier JCT-VC subjective report [55], released in Feb. 2012 suggests roughly a bit rate reduction of 49% to 67% bit rate reduction for HEVC over H.264 for similar
subjective performance as measured by MOS. Class B (which are 1920 x 1080) and
Class C (which are 832 x 480) were the video sequences measured within this report. A
modified version of the JM (i.e. H.264/AVC) software was used that includes non-
34 normative improvements that will allow “more” normalized comparisons with HEVC
HM5.0.
4.2 Complexity Reduction
Complexity reduction has been and is a major focus among researchers, since encoding time and resources are heavily used during complexity calculations during coding phases. In Figure 14 is processor utilization estimate for H.264 encoding process as presented by [59].
Figure 14: H.264 processor utilization
Motion estimation and rate distortion optimization take up around 75% of the processing utilization. For mobile devices, a processing utilization reduction will significantly reduce the processing time which will lead directly to a reduction in battery usage and a longer lasting mobile device. HEVC encoder and decoder complexity assessment is a research study focus for [60], where the different HEVC tools are analyzed in terms of performance and computational complexity. In [61], H.264 use in bandwidth limited environments is dissected. The paper recognizes the need for coding standard enhancements to further improve video compression over limited bandwidths
35 and specifically pointed out the need for improved motion estimation techniques and
coding tools in low bandwidth environments.
Figure 15 shows a simplified hybrid video encoder that is representative for
H.264 and HEVC. The items with red dashed rectangles are the focus of the majority of
the H.264 and HEVC complexity reduction literature research.
Figure 15: Simplified H.264 and HEVC encoder block diagram
4.2.1 H.264
In H.264 video encoding, predictive coding time takes up significant time in encoding computations. In [62], the focus is for complexity reduction the Discrete
Cosine Transform (DCT) and quantization (Q) the H.264 video encoder. The author introduces a prediction algorithm that reduces the redundant computations within the
DCT/Q and the inverse (i.e. IQ/IDCT) process. The approach is to exploit DCT by reducing the effect of high frequency coefficients that typically is zeroed out after quantization. This is especially true when the quantization parameters are large for low- bit-rate video applications. In addition, the DCT coefficients can be more loosely
36 represented since the transformed coefficients will be quantized coarsely with large Q
factor.
An encoding strategy sensitive to wireless services needs are presented by [59].
The author uses knowledge of the video context from the video sequences. Unimportant
regions in the frames are isolated and unnecessary processing is avoided. Therefore,
battery power consumption can be reduced from H.264 complexity reduction while
maintaining reasonable frame quality and low bit-rate. User input and prior knowledge
of the context is taken into consideration in deciding the frames relevance and
significance of each section within the frame. Each frame throughout the sequence is
segmented into non-overlapping regions of varying significance. The most significant
areas are the foreground area and the remaining areas as the background areas. For example, in video-conferencing applications, within the frame, the speaker’s head-and- shoulders are the foreground and remaining frame is the background. Using this method, the complexity is reduced by more than 40%.
A H.264 reduction algorithm for search window sizes within ME is proposed by
[63]. This algorithm decreases the encoder complexity by the reduction of sum of absolute difference (SAD) calculations which are beneficial in low complexity sequences such as video surveillance and video telephony. Encoder complexity reduction is achieved by binary assessment for a given motion threshold on the image which yields a difference based on blob coloring over images generated block wise. The current frame is compared with the latest I-frame. Each frame block is compared between the current and I-frame. If the image difference within the block is greater than the threshold, then the block is set as activate. Otherwise the block is set to inactivate. The activate area has
37 a larger search window for ME, while the inactivate area has a smaller search window and is shown in Figure 16. The algorithm reaches an encoding time savings of 50% to
60% with a PSNR decrease of less than 0.05 and bit stream size increase of 0.09% or less.
Figure 16: Activate and inactivate area. Activate and inactivate area in binary
Machine learning techniques are used for optimization of low complexity H.264 encoder presented by [64]. The approach is to make encoder decisions, such as MB coding mode, that are computationally expensive using features derived from uncompressed video. A machine learning algorithm is used to obtain a classifier decision tree based on such features. The decision tree is trained and the encoder coding mode decisions, that usually evaluate all possible coding options, are replaced with a decision tree. The author proposes a three level topology tree for Inter mode decision. First level is an improvement in speed up is a Skip early decision, Intra 16x16 and the rest of the modes. The second level is separation among Inter 8x8 and remaining Inter 16x16 modes and sub-modes. Finally, the third level is decision among the remaining Inter mode and sub-modes. Figure 17 is the decision tree.
38
Figure 17: Decision tree for MB encoding
For [65], the authors proposed a complexity reduction in macroblock mode selection. Macroblock modes inter8×8 and intra4×4 have the highest complexity, which is the focus of the proposed methods. Two methods complexity reduction methods for inter8×8 and intra4×4 use costs of the other macroblock modes. For inter8×8 reduction, the costs of inter macroblock modes increase or decrease according to block direction.
With this assumption, the selectable sub-macroblock modes are reduced by using the MV costs and reference costs of inter16×16, inter16×8, and inter8×16, as shown in Figure 18.
If 8x8 DCT is not selected, then intra4×4 is compared with intra16x16. If small RD cost difference, per author’s cost computation, between the RD costs of intra16x16 and intra4x4, then remaining RD cost computation of intra4×4. Simulation results showed the methods 57.7% of total encoding time savings with a PSNR decrease of 0.05dB.
Figure 18: Restriction of selectable sub-macroblock modes 39 A H.264 power aware complexity reduction method was presented by [66].
Basically the encoding complexity adapted depending on available power. Therefore,
more available power allowed the use of higher complex motion estimation modes. The algorithm has two main elements one is region of interest (ROI) determination and the second is mode set used allowed for the ME computation as shown in equation (1). The
ROI is motion-based, therefore the motion above threshold will give an ROI value of 1 to the block, otherwise ROI value is set to zero. When power is freely available, all mode sets are available to any ROI value which yields the highest coding quality. As power availability decreases the mode set is reduced for low ROI blocks. At medium available power, low ROI value blocks can only select from mode sets 0 and 1 for ME. At low available power, low ROI value blocks will use only mode set 0. For low available power, encoding time savings of over 50% with a PSNR loss less than 0.4dB was achieved.
ModeSet 0 low complexity modes Mode Set ModeSet 1 medium complexity modes (1) ModeSet 2 high complexity modes
A variable ME search area is proposed by [67]. The basic premise is to compare block from the five previous reference frames sum of absolute differences (SAD) is computed at zero motion vector and compared with threshold. If below threshold, then use full search (FS), otherwise use three step search during ME as shown in Figure 19.
40
Figure 19: Flexible search method flow chart
A study conducted by [68] analyzed permissible perceptual distortions and assigned a heavier weighting to the regions that are perceptually less sensitive to human vision. The main concept of the proposed speed-dependent motion-estimation algorithm is computational time savings will occur as it assigns a larger intermode value to regions that are perceptually less sensitive to distortion. The proposed model aims to reduce computation time by skipping certain MBs in perceptually less sensitive areas while maintaining RD performance. The human visual system has a significantly accounted for in understanding perceptual video coding. Human beings cannot perceive tiny variations in visual signals because of the human visual system psycho-visual properties. The proposed algorithm uses motion vectors to determine the speed of the object within the block. The quality of the block is adjusted accordingly to predetermined thresholds.
41 Author’s proposed algorithm can reduce the computational complexity of motion
estimation by up to 47.16% while maintaining high compression efficiency.
4.2.2 HEVC
HEVC adopts a well-known Rate Distortion Optimization model (RDO) [69] tries
to achieve balance between quality, complexity and coding efficiency. HEVC encoder
and decoder complexity from HEVC base configurations have been assessed in an ITU reported research study focus for [60], where different HEVC tools are analyzed in terms of performance and computational complexity on different hardware platforms. RDO reaches the optimal partitioning by evaluating all combinations for CU size, prediction unit (PU), and TU. This approach increases encoder computational complexity and makes more difficult the real time encoding implementations, especially for portable and mobile devices where power consumption is one of the key factors. To reduce the RDO complexity some fast algorithms have been published, mainly focused on reducing the number of coding blocks (CB) and prediction blocks (PB) and TU sizes to evaluate.
Some of the algorithms will be discuss within this section.
In [70], HEVC coding complexity is reduced by limiting HEVC options and
recording bit rate delta. Several tools modifications were analyzed. Modifications
included (1) the angular intra prediction reduced to eight directions, similar to H.264, (2)
limiting maximum CU size, (3) limiting maximum TU size, (4) changing intra mode
coding approaches, and (5) SAO enabling/disabling. The most bit rate gain impacts,
which can be similarly stated as coding efficiency loss, were from angular prediction, CU
size, and TU Size limitations. Minimal bit rate impacts were observed for the remaining
changes. HEVC coding advantages are greater with higher resolutions and where strong
42 directionalities exist within the video sequence. Strong directional textures are noted for
sequences with large homogenous regions which allow effective use of HEVC 64x64
block sizes with accurate prediction. Video sequences that takes advantage of this is
Kimono, Johnny and Kristen and Sara. The last two mentioned are Class E, which is for
video conferencing with large static background areas with regular motion from talking
people in the foreground. The former, Kimono, is a panning video sequence where the
background is non-moving but camera is panning. This indicates large CU sizes, such as
64x64, can be used with motion vector to accurate estimate the background CU.
In another paper, transform skipping within HEVC is explored [71]. The main emphasis is exploring new transform tools advancements from H.264 to HEVC. The author video sequences, in the study, includes both camera and graphical content, such computer generated material shared over the internet. The usual implementation of 1D or
2D transform is based on DCT and DST integer implementation. HEVC has adopted
DST for intra residuals on certain directional prediction. However, certain types of residual can benefit from skipping the transform step totally as presented by the author.
HEVC TU intra coding consists of several parts. First, the reconstructed pixels are used to predict pixels in a specific direction. Then, the predicted residue is applied by the corresponding TU transform. Followed by the quantizing the transform coefficients and use CABAC to code the quantized coefficients. Lastly, the quantized coefficients are reconstructed and used to predict later TU. The author performed tests where 2D, 1D or no TU transforms were used as shown by Figure 20. Saving of up to 30% BD-rate were observed when TU transforms were skipped. The largest gains were observed for configurations of intra and low complexity.
43
Figure 20: Transform choices by transform skip mode
TU coefficient analysis is performed by [72]. The emphasis is weighting the quantization for TU coefficients. HEVC transformed coefficients are equally quantized per the selected QP. The author argues this is inefficient, since the TU are not equally distributed within the CTU. The method researched is weighting the TU quantizer
depending on Nth level the TU resides within the CTU. Nth level quantization increases the quantization factor as scanning progresses with the TU. The authors proposed method showed a bit rate gain of 0.3 to 0.6%.
Deblocking filtering and decision were investigated as potential encoding time savings by
[73]. Deblocking is feature is the same concept between H.264 and HEVC, where the
CB boundaries are compared for hard spatial changes on both sides of the boundary. A major change is H.264 applies de-blocking only to 4x4 grid, while the HEVC applies to
8x8 grid. When deblocking is turned on, the edges are smoothed to deemphasize the spatial changed. Typically, the luma CB is modified to perform the boundary smoothing.
Care must be taken with deblocking, since the smoothing affect can significantly enhance or adversely affect subjective quality. Desired details with a small pixel footprint can be deblocked (i.e. smoothed out) to the point where desired detailed is not as discernable.
Author performed objective and subjective tests with deblocking enabled and disabled.
44 Bit rate increased from 1.3% to 3.4% dependent on the configuration used when de- blocking is turned on vs. off. Low delay configurations tend to have higher bit rate.
Author comments this is due to the low delay configurations having one intra-coded
frame at the beginning of the video sequence. The most subjective noticeable affect was
with high QP (QP=37) encoding of “Kristen and Sara”. Deblocking improves the
subjective quality in both small and large CBs.
A unique approach to video coding was proposed by [74], [75]. The authors concept is to address the decoding techniques that can be used in lower bit rate environments and reduce mobile device power consumption without impacting subjective quality. From this effort apply the requirements to the encoder. The core component for the approach was to decompose the video sequence into two spatial elements with significantly different quality. This leads to, as the author defines, a “low resolution” and “high resolution” component. The video frame is composed of a checker-board pattern between the “low” and “high” resolution components as shown in Figure 21.
Figure 21: Decomposition using checkered pattern
Leveraging the checker-board decomposition within the encoder means a method to code blocks with profoundly different quantizers for “low” and “high” resolution components. The use case is for HD video (1080p) or higher resolution where the video sequence is decoded by mobile device. The mobile device is impacted with a longer
45 decode time which adversely impacts battery powered mobile devices. In addition,
mobile devices with smaller screen sizes, user perception will not benefit from the HD or
higher resolutions. However, HD (or higher) decoding but must be maintained and it’s decoding time overhead.
Figure 22: Template and block matching vectors
Cho and Kim [76], proposed an HEVC algorithm for Intra frames that performs early decisions for CU splitting or pruning. This is interesting, since most other
conference papers focus on one of the complementary methods, either CU splitting or CU
pruning. Early CU splitting and pruning decisions are made with the Bayes decision rule
based on RD costs. The decision model is updated regularly to adapt to video sequence
from the previous frames. Prediction coding for Intra frames are based on two processing
operations. CU splitting is a depth-first, top-down approach. CU pruning is performed in
a depth-first, bottom-up manner. Both methods occur during the encoding process and depicted in Figure 23. The HEVC computational complexity of intra prediction coding is from the CU splitting in order to compute full RD costs for candidate intra prediction modes of CUs at all depth levels. To help with computational complexity reduction,
HEVC allows skipping the full RD cost computation for a CU and even allows terminating subsequent CU splitting and pruning process. The author’s proposed method will use Bayesian decision with the statistical parameters of prior known or estimated random variables fits within the RD costs phase to expedite the decision process. The experimental results show an encoding speedup of 50.2% with just 0.6% BD-rate increase and also achieves 63.5% speedup with a 3.5% BD-rate increase.
46
Figure 23: CU splitting and pruning process
A nice analysis and proposed complexity reduction is presented by [77]. The analysis is between HEVC configurations low delay p-frames (LDP) and low delay b- frames (LDB). The main trade-off between both configurations are between computational complexity, coding efficiency and Bjǿntegaard delta bit-rate (BDPR).
LDP has lower computational complexity and higher coding efficiency; however the
BDPR is lower by roughly 6%. Author performed calculated LDB and LDP mode statistics of two test sequences, Kimono and BQTerrace, with quantization parameter
(QP) of 27. Inter frame data such as uni-prediction merge mode, bi-prediction inter mode, and uni-prediction inter mode were captured. More than 50% of modes are bidirectional prediction in LDB. The author proposed a feature reduced version of LDB, where some of the precision motion compensation algorithms are removed. The result is a configuration that has performance between LDP and LDB. The main advantage of the proposed configuration is to still have bi-directional feature but with am lower complexity algorithm which helps reduce computation complexity and reduces need of precious hardware resources, such as memory space.
A Context-based Adaptive Binary Arithmetic Coder (CABAC) enhancement in the bypass mode section is proposed by [78] as a complexity reduction without loss of 47 compression performance. CABAC is an entropy coder adopted by H.264 coding standard. CABAC obtains high compression efficiency by using probability estimation.
However, CABAC is also a main source of decoder complexity and processing time. The author presents an alternate bypass mode called pass through mode, which is less- complex coding. In the pass through mode, the probability estimation process and arithmetic coding process are skipped, therefore processing steps are reduced. The main emphasis is to simplify the binary symbol (bin) strings. Followed by the bin encoded using binary arithmetic coder in either regular or bypass encoding mode. Bins which are identified with a probability of 0.5, which indicates 0 or 1 will occur in equal probability.
The equal probability bins are encoded by the reduced complexity arithmetic coding mode.
In [79], the author presents a faster intra prediction mode decision algorithm. The algorithm takes into account the neighboring PUs modes and uses edge information of the current PU to choose a reduced set of directions. The algorithm developed selects the
9 most often used directional modes from previous Pus and only uses those for intra prediction evaluation. The reduced directional set consequently makes the HEVC intra prediction decision mode computationally more efficient; however, there will be a PSNR quality degradation. There is a strong suspicion the reduced direction will closely resemble the H.264 direction set. The author’s example was a reduced set representative of the H.264 directional set. The proposed method had a decrease 32.08% or less for intra prediction processing time. The bit-rate increase of 0.9% (on average) and a 0.02dB reduction in PSNR values.
48 A fast algorithm is developed for sub-pixel motion estimation by [80] as a
complexity reduction method. The author’s algorithm approximates the error surface of
the sub-pixel position and predicts the minimum point by minimizing the function. This
is followed by a second order approximation within a smaller area to predict the best sub-
pixel position. Typical ME process has two notable stages, which are integer-pixel
search in a search area and sub-pixel search around the best integer pixel position. The
direction way to determine optimal position is using the full search (FS) algorithm. FS
checks all points within the search range and selects the best point. However FS
computational complexity is undesirable and has very long encoding times. The author
proposed several error surface models ranging from 5-term to 9-term error models.
However, the author eventually settled on the 5-term model for first order approximation and 6-term model for second order approximation. The proposed method test results
showed a PSNR reduction of 0.04dB or less and reduced encoding time of 19% to 67%.
In [81], a fast decision method to reduce encoder complexity of high efficiency video coding was proposed, which is an early detection of SKIP mode in one CU-level
based on the differential motion vector (DMV) and coded block flag (CBF). SKIP mode
has a high probability of occurrence; therefore it is preferred to detect the SKIP mode as
early as possible. The test video sequences had a resolution ranging from 416x240 to
1920x1080. The SKIP mode occurrence probability averaged from 0.817 to 0.843
depending on the configuration. The early detection of SKIP mode utilized the
differential motion vector (DMV) and coded block flag (CBF) of inter 2Nx2N mode.
The method selects the best Inter 2Nx2N mode having the minimum of RD cost. For the best inter 2Nx2N mode, if the DMV is equal to (0,0) and CBF is equal to zero, then best
49 mode is the SKIP mode and the remaining PU modes are not searched further.
Experimental results show that the encoding complexity can be reduced by up to 34.55%
in random access (RA) configuration and 36.48% in low delay (LD) configuration with
only a little bit of rate increase of 0.4%. A similar SKIP mode algorithm is offered by
[82].
Choi, Park, and Jang [82] provided early HEVC studies for early termination and presented in International Telecommunications Union’s (ITU) Joint Collaborative Team on Video Coding (JCT-VC) conference. The body of work suggested opportunities in determining best skip mode for CU early termination. Choi proposed a coding tree pruning based on SKIP mode detection. If the SKIP mode is selected as the best prediction mode then no further processing of smaller CB sizes is performed.
Depending on the content, research showed that well over 90% of CU depth selection can be skipped. A conditional probability is placed on the depth selection. The determination to split the CU is guided by the probability. Results show 42% encoding time reduction with luma PSNR gain 0.6% or less.
A complexity reduction method that speeds up decision making within the RDO process is proposed by [83]. The proposed method by is a fast RDO algorithm based on two techniques: (1) Stop Skip, which selects the initial CU size and (2) Early
Termination, which limits the smaller CB sizes. This reduces the encoding time by 40% with a bit rate increase of 2%. During the RDO process, the encoder tests all the possible coding modes and block partitions and keeps those providing the smallest rate distortion
(RD) cost. The large number of available modes and partitions lead to high computational cost which is time consuming and may not be suitable for real-time
50 applications. To reduce the computational cost and reduce the complexity, the author
proposes two techniques which are Top Skip and Early Termination. For Top Skip, the
larger CU sizes are avoided by selecting a starting CTB depth (higher than zero)
corresponding to a given level of CU quad-tree splitting which takes it’s queue from the
previous frame. The Top Skip portion selects starting depth observing that there exists a
high correlation between the minimum depth of the current Coding Tree Block (CTB)
and the one of the co-located CTB in the previous frame. Therefore, the depth starting
point is same as previous frame resultant depth, if previous frame QP is same. For Early
Termination, the technique stops the CU splitting process if RD cost is than a given
threshold. This mean that an acceptable coding cost has already been obtained and
continued searching for smaller CUs may only slightly improve the RD performance
process. The Early Termination technique avoids checking smaller CU sizes when they
are not likely to be selected by the brute force Rate Distortion Optimization (RDO)
process. The Early Termination technique halts the CU splitting process when the best
Rate Distortion (RD) cost is already lower than the pre-defined threshold. The value for
the threshold is adaptively computed and performs tradeoffs between complexity
reduction and negligible RD performance loss. In addition, the threshold computation exploits both the spatial and temporal correlations in the video inter-frames using a
Gaussian weighting function.
Another RDO scheme uses a decision algorithm that dynamically adjusts the
depth of the CU defined by quad-tree structures process as proposed by [84]. The authors
proposed method is similar to earlier papers presented in this section where
computational complexity is reduced by trimming the CU options within the RDO
51 process. As with earlier papers, the aim is to eliminate the maximum number of CUs
tested during the RDO process in order to avoid processing CUs at large tree depths
where found to have high complexity cost that bring small encoding gain. When the frames are being encoded, the algorithm does not allow the RDO process to test all the possible optimization possibilities. Instead, the algorithm uses the information in the history table and defined RDO limits the current frame RDO process to a maximum tree
depth tested in each 64x64 area to the history table’s saved value. If the limit is reached,
then the search is completed and current results are used. Since surrounding CUs in
neighbor frames tend to have similar maximum depths, it is expected that coding
efficiency will improve and RDO results will not be affected much. The experimental
results showed a potential complexity reduction of 40% to 80% with PSNR drop is lower
than 0.8 dB and bit rate increase less than 5.7%.
A similar RDO complexity reduction approach was taken by same authors in [85]
and [86], however the history table is limited to neighboring CTB depth. This lead to
encoding time savings, however, the PSNR was not as nearly degraded. Experiments
showed a computational complexity reduction of 40% with a PSNR drop of 0.1% and bit
rate increase of 3.5%.
Motion vector merging (MVM), proposed by Sampaio et al. [87], prunes PU
partition size decision by checking the neighboring CUs that border the current CU. The
query is for the left CU and above CU for PU shapes of 2Nx2N, 2NxN, and Nx2N. Each
query is analyzed and a heuristic defines if the partitions can be merged or not. If
merging occurs, the rate-distortion cost is evaluated for the decided PU partition,
producing sufficient information for the MVM decision. When identical motion vectors
52 are calculated for all NxN, 2NxN or Nx2N partitions and neighbor CU, the current PU is selected for encoding. On average, 2Nx2N PU partitions are chosen in 85% of the decisions. The asymmetrical PU partitions, 2NxN and Nx2N, are the best option for approximately in 7% each. The NxN PU partition is relegated for use only with 8x8 CU sizes. This corresponded to being selected only 0.2% of the time. This proposed method reduced execution time by 34% with dB losses of less than 0.08dB.
CU splitting early termination algorithm is proposed by Shen and Yu [88] where the CU splitting optimization in HEVC is formulized as a binary classification problem and is solved by support vector classification. It embeds the model training into the predictive model selection process and simple greedy search. For the predictive model, the features that are useful to build a good predictor are two types of feature selection approaches which are filters and wrapper approaches. The wrapper method was based on
F-score. The filter method was based on correlation or mutual information ranking are easy to implement. However, selecting the most relevant variables is usually sub-optimal for building a predictor, especially when the variables are redundant. The proposed algorithm performed well across different configurations and various video contents. The
CU splitting early termination model was trained offline and the proposed algorithm was computationally simple. Experimental results showed 44.7% reduction in computational complexity is achieved with 1.35% BD-Rate increase for “Random Access, main” configuration and 41.9% complexity reduction with 1.66% BD-Rate increase in “low- delay main” configuration.
CU size decision method is proposed by Shen et al. [89] where different methods are tested to resolve to optimal CU size. All methods led to adapted maximum depth
53 allowed. The first method checked for motion homogeneity by querying the above and left neighbor CUs motion vector (MV) X and MV Y direction. The difference between the current CU MVs and neighbor MVs are compared against a threshold. When the motion homogeneity is smaller than a threshold, the current CU is considered with homogeneous motion. Otherwise the CU is considered with complex motion. The threshold is set to in order to tolerate minor noisy MVs in the motion’s homogenous region. The second method is based on RD cost checking. Spatial and temporal neighboring CTUs usually show a similar RD cost distribution. Therefore, the RD cost based correlation is used to determine the early termination threshold. When RD cost of the current CU size is smaller than the calculated threshold, the next depth level motion estimations can be skipped. The third method is skip mode checking based. The algorithm introduces skip mode checking to skip checking unnecessary ME on smaller
CU sizes. This is accomplished by utilizing the prediction mode information in the upper depth level and the current depth level. Typically, choosing a small CU size usually results in a lower energy residual after motion compensation but requires a larger number of bits to signal the MVs and type of prediction. The smooth and slow motion can be predicted more accurately by using a larger CU size. By selecting skip mode as the best prediction mode for the current CU size, this indicates that the current CU is located in a region with homogeneous motion or static region. This should result in a lower energy residual after motion compensation compared to other prediction modes. Thus, no further processing of sub-CUs is necessary. Experimental results showed a 28% to 52% reduction in computational complexity with 0.90% to 3.63% BD-Rate increase.
54 Leng et al. [90] presented a method which skips ME for some of the depths depending on the data from co-located CU in previous frame and neighbor CUs. The method exploited similarities for several consecutive frames. Some features stayed the same such as the QP, moving speed and the resolution. In addition, the detail part and homogenous parts stayed the same as well. This indicated there are mode correlations among consecutive frames. This allowed the ability to skip some specific depths which were rarely used in the previous frames for all the CUs in current frame. No more than two depth levels were skipped for any given CU. The current CU depth starting point can be set to new depth as indicated by previous frames depth usage. This method provided an average of 45% time savings with PSNR loss of 0.11 dB or less.
Below table is a summary of the complexity reduction methods discussed within this section. The figure following the table shows the major areas of discussion within the encoder block diagram.
55 Table 5: Complexity reduction reviewed research -Frame r Cite STD Type Manipulation Method Description Mobile Intra-Frame Inte [62] H.264 x x T/Q Q Faster resolving transform. Remove high frequency quicker. [59] H.264 x x x ME/T Search/Tree Pre-determine foreground and background. Reduce options for Pruning background. [63] H.264 x ME Search Search window size change due to threshold [64] H.264 x MODE Tree Pruning Use decision tree for ME Mode [65] H.264 x MODE Tree Pruning Macroblock mode selection for inter8×8 and intra4×4 [66] H.264 x x MODE Tree Pruning Power-aware. ROI determine mode select. ROI determined per MV. [67] H.264 x ME Search Full search or Three step search depending on 5 reference frame SAD threshold [69] H.264 x x RDO RDO function Mainly, RDO model for GOP. Some ME and Mode change [70] HEVC x x ME/ MV limit/ Intra block coding only. Restricted angular directions. Varied MODE/ Tree Puning/ depth searches. Configuration limits. Config. SAO [71] HEVC x x T Tree Pruning TU method for 2D,1D or skip [72] HEVC x x T/Q Q Faster resolving transform by varing Q by the CU depth. [73] HEVC x x T Other De-blocking filter to intra luma block in PU or TU [74] / [75] HEVC x x x T/Q Q Use high/low resolution quantizer in checkerboard pattern. [76] HEVC x MODE Tree Pruning Intra frame only. CU tree splitting pruning using Bayesian decision model. [77] HEVC x ME Tree Pruning Low delay B-frame with reduced set [78] HEVC x x Entropy Other CABAC enhancement in by-pass mode [79] HEVC x ME/ Tree Pruning Derive from neighboring PU edge. Reduce direction from 33 MODE to 9. [80] HEVC x ME Math Model simple sub-pixel ME equation [81] HEVC x MODE Tree Pruning early skip mode detection per motion vector threshold [82] HEVC x MODE Tree Pruning early skip mode detection per motion vector threshold [83] HEVC x x MODE Tree Pruning stop skip and early termination [84] / [91] HEVC x x MODE Tree Pruning RDO complexity reduction ratio limit and previous frame history
56 ME MODE T Q
Figure 24: Literature research on hybrid coding map
57 5 HEVC AND H.264 SUBJECTIVE EVALUATION
This chapter compares the quality of H.264 and HEVC encoded video in low
bandwidth mobile environments. In this study, the focus within the mobile environment
is smart phones. The key characteristics of a smart phone are smaller screen size, which
is usually 3.5 inches diagonal to 5.0 inches diagonal for high end smart phones and
typical cellular network bandwidth, which is 3G or faster. Subjective evaluations were
conducted to evaluate the user experience on a mobile device with a small screen size and video coded at 200 and 400 Kbps. The studies showed compelling evidence that a user’s experience in low bandwidth mobile environments is very similar between HEVC and
H.264. The results suggest the benefits of HEVC over H.264 in a mobile environment with lower video bitrates and resolutions are not as clear.
5.1 Background
The mobile compute environment has evolved rapidly in the last few years and smart phones have penetrated the consumer market extensively. Smart phones are being used as extensions of consumer electronics devices such as TVs, Blu-ray players, and audio receivers. Smart phone display performance has progressed significantly and cellular network bandwidth has improved in recent years. This has allowed streaming multimedia adoption in these traditionally loss-prone environments [92]. Display technologies, in the mobile environment market space have benefited from strong design investment by smart phone manufacturers and significant research and development by
58 liquid crystal display (LCD) manufacturers. This has enabled the mobile LCDs to
improve steadily in performance aspects such as: (a) resolution, (b) power consumption,
and (c) viewing angles. Mobile phone’s connectivity also benefits significantly from
wireless infrastructure improvements, beginning with WiFi availability at home, work,
and business locations, to cellular network technology improvements with increasing
bandwidth provided by 3G and LTE.
The evolution of encoding methods from H.264 to HEVC optimizes visual quality
on larger resolution images, especially for Ultra High Definition [20], with the main
beneficiary being the internet and broadcast networks [93]. However, HEVC is also
expected to provide compression gains over H.264 in the mobile environment [92]. The
significance of these gains in mobile devices, playing low bitrate video, has not been
studied. The main goal of this work is to evaluate the subjective quality of HEVC and
H.264 at mobile bitrates and to determine whether the additional gains from HEVC
encoding are perceivable by end users on mobile device displays.
This chapter presents subjective quality evaluation studies that compare videos
coded with H.264 and HEVC. The studies were conducted using videos coded at typical mobile bitrates of 200 and 400 Kbps. Subjective evaluations showed that H.264 and
HEVC result in a similar quality of experience. Bandwidth reduction alone may not sufficiently justify the cost of deploying HEVC in mobile devices targeting low bandwidth applications. Design decisions and trade-offs based on the reported results can improve consumer electronics designs and user experience. Using a simpler codec can reduce complexity in mobile environments which can lead to lower power consumption [84] and better video quality [77].
59 A significant amount of HEVC quality evaluation has been limited to performance evaluation at high resolutions and high bitrates. An evaluation of candidates for HEVC standardization was conducted in a joint collaboration between ISO and ITU video experts groups [94]. Test results showed that 50% bit rate improvement over
H.264 can be achieved with the proposed coding schemes with similar mean opinion score (MOS) [52]. This evaluation study provided the groundwork that was needed to standardize HEVC.
Subjective comparison of HEVC and H.264 at higher bitrates and resolutions has
shown that HEVC outperforms H.264, yielding average bit-rate savings of 58% [93].
These objective and subjective results produced by this study confirmed that the goal of
developing a HEVC video coding standard, which delivers the same visual quality as
H.264/MPEG-4 AVC high profile at only half of the bit rate was accomplished.
However, the question of performance of HEVC over H.264 for low bitrate mobile
applications was unanswered.
Tan et. al. reported the subjective comparisons for HEVC and H.264 [54]. Tests
were conducted with HEVC high-efficiency (HE) and low-complexity (LC) combinations with rate distortion optimized quantization (RDOQ), adaptive loop filter
(ALF), and sample adaptive offset (SAO). Nine sequences, referred to as Class B and C sequences in JVT evaluations [52], were encoded with QP = 32 and 37 for random access and low delay. The class B sequences have a resolution of 1920x1080 and the class C sequences have a resolution of 832x480. The first test method is Double-Stimulus
Continuous Quality-Scale (DSCQS), except the observer is asked to make one of three choices. These are: (a) A is better than B, (b) B is better than A, and (c) A and B are
60 same. The second test method is Double Stimulus Impairment Scale (DSIS). Tests
compared HEVC coded video at half the bit rate of H.264. The bit rates for subjective
testing were varied from 500 kbps to 4,000 kbps.
Test results indicate the observers chose HEVC HE 56% to 83% of the time over
H.264 and HEVC with LC-RDOQ-SAO 58% to 75% of the time [54]. The 50%
preference indicates the coding methods are viewed as similar subjective quality, Greater
than 50% indicates a preference for HEVC over H.264.
Additional subjective comparisons were performed by the JCT-VC ad hoc group
comparing HEVC and similarly configured H.264 encoder/decoder [55]. Tests performed were for class B and class C. A gross average bit rate reduction of 67% for
class B sequences and 49% for class C sequences resulted from the subjective tests.
Subjective tests specifically targeting low-delay applications have been performed
for 720p and 1080p resolutions [56]. The HEVC and H.264 encoders were configured
for low-delay-P-main operation. An "IPPPP…” coding structure was used, which has
only one intra frame (the first frame of each sequence) for both encoders. These tests
compared H.264 encoding at twice the bit rate of HEVC. Results show HEVC encoded
sequences were favored 48% of the time for 720p sequences and 90% of the time for
1080p sequences. This clearly shows better HEVC performance at higher resolutions and bitrates.
Subjective tests for very high resolution videos (3840 x 1744) again shows that
HEVC out performs H.264 [57]. The majority of bit rates evaluated were over 1,000 kbps. The test method was DSIS variant II with two 15 minute test sessions with a rest
61 period in between. Test results convincingly show bit rate reduction of over 50% is
achieved with HEVC over AVC for high resolution sequences.
Yamamura, Iwasaki and Matsuo reported subjective quality assessment for video
sequences with blocking artifacts [95]. The analysis included several coding standards, such as HEVC, H.264, and MPEG with bit rates varied from 500kbps to 2Mbps. The analysis revealed that HEVC performed best with the periodic gaps caused by the blocking artifacts.
In summary, a significant amount of H.264 and HEVC subjective testing has been performed in high bit rate and high resolution environments (832x480 or higher).
However, HEVC has not been sufficiently examined to determine the performance at lower bit rates on lower resolution devices which is a typical use case for low-end smartphones. Understanding the subjective test implications may allow effective choice of codecs for mobile video services and other consumer electronics devices with video capabilities.
Preliminary results from the present study comparing HEVC and H.264 in mobile environment showed that H.264 can perform competitively at lower bitrates typical in mobile applications [3], [4].
5.2 Evaluation Methods
Mobile encoding bitrates and resolutions used within this study are industry best practices by multimedia streaming service providers and guidelines referenced by content providers. The industry best practices define “Higher quality and resolution” as 640x360
(16:9) at 400kbps and “medium quality” as 400x300 resolution at 200kbps [96]. A
62 combination of WiFi resolution (640x360) and higher bandwidth cellular (400 kbps) was
selected.
The encoder implementations used for the comparison are H.264 reference
software JM 18.3 and High Efficiency Video Coding (HEVC) version HM 6.0. H.264
was configured to closely mimic HEVC coding structure (based on HM-like
configurations available in JM 18.3). Only the first frame is an I-frame; the remaining
frames are coded as P frames. The key encoder settings used in HEVC are: intra period
is set to -1, largest coding unit size of 64x64, maximum depth of 4, fast search set to
EPZS, search range of 64, rate distortion optimization is enabled, internal bit depth is
eight, SAO is enabled, ALF is disabled, AMP is disabled. These settings are listed as
default settings within the HEVC version HM6.0 configuration file
“encoder_lowdelay_P_main.cfg”. Similar settings were used in previously reported
subjective evaluation studies [56].
The video sequences used were selected from the sequence set used during HEVC
development. The frame rates of video test sequences chosen are either 24, 30, 50, or 60
fps. All video test sequences were scaled and, if necessary, cropped to obtain videos at
640x360 resolution. The interpolation for resizing is performed with the 3-lobed Lanczos
Window Function. The interpolation algorithm uses source image intensities at 36 pixels in the neighborhood of the target pixel. The frames are then cropped to 640x360 by trimming the edges when needed.
The eight sequences used in the experiment pool are: Basketball Drill,
Flowervase, Keiba, Kimono, Johnny, People on Street, Race Horses, and Traffic. The videos were selected to get a breadth of low motion to high motion as well as varying
63 number of people or objects in the video sequence. The video sequence properties are summarized in Table 6.
Table 6: Video sequence information
Video Original Resolution Test Resolution Frames FPS Basketball Drill 832x480 640x360 500 50 Flowervase 832x480 640x360 300 30 Keiba 832x480 640x360 300 30 Kimono 1920x1080 640x360 240 24 Johnny 1280x720 640x360 600 60 People on Street 2560x1600 640x360 150 30 Race Horses 832x480 640x360 300 30 Traffic 2560x1600 640x360 150 30
The 640x360 video test sequences were encoded at various quality levels. For
H.264, the QP used was QP 27-51. For HEVC, the QP used was for QP 24-51. From this effort, the video sequences with bit rate closest to target bit rates of 400 and 200
Kbps were chosen for subjective evaluation. Also, care was taken to minimize the bit rate delta between the H.264 and HEVC video sequences in order to avoid unwanted subjective bias. See Table 7 and Table 8 for 200kbps and 400kbps bit rate results and delta, which is HEVC bitrate minus H.264 average bitrate. An example rate distortion curve for Basketball Drill is shown in Figure 25.
64 Table 7: 400Kbps data for H.264 and HEVC PSNR(Y) PSNR(Y) (dB) Bitrate (kbps)
Test Value Value BBDrill-H264 29.5 380.5 2 21.2 BBDrill-HEVC 31.5 401.7 Flowervase-H.264 35.4 349.3 2.2 36.8 Flowervase-HEVC 37.6 386.1 Johnny-H264 39 398.7 0.4 0.3 Johnny-HEVC 39.4 399 Keiba-H.264 33 435 1.3 -13.6 Keiba-HEVC 34.3 421.4 Kimono-H264 32.7 408.9 1.1 29.7 Kimono-HEVC 33.9 438.6 People-H.264 20.8 369 0.4 -15.8 People-HEVC 21.2 353.2 RaceHorses-H.264 28.6 390.2 1 -5.3 RaceHorses-HEVC 29.6 384.9 Traffic-H.264 27.3 387.7 1.1 -32.1 Traffic-HEVC 28.3 355.6
Table 8: 200Kbps data for H.264 and HEVC PSNR(Y)
PSNR(Y) (dB) Bitrate (kbps)
Test Value Value BBDrill-H264 26.9 198.8 2.5 2.8 BBDrill-HEVC 29.4 201.6 Flowervase-H.264 33.3 204.7 1.8 -14.5 Flowervase-HEVC 35.2 190.2 Johnny-H264 36.6 193.4 1.1 5.7 Johnny-HEVC 37.7 199.1 Keiba-H.264 29.7 211.1 1.9 -6 Keiba-HEVC 31.6 205.1 Kimono-H264 30.4 198.4 0.8 -17.5 Kimono-HEVC 31.2 180.9 People-H.264 19.2 203.1 0.6 -0.5 People-HEVC 19.7 202.6 RaceHorses-H.264 26.4 188.2 1.3 10.4 RaceHorses-HEVC 27.6 198.6 Traffic-H.264 25.5 193.1 1.2 -0.6 Traffic-HEVC 28.3 355.6
65
Figure 25: Basketball drill PSNR(Y) rate distortion curve
5.3 Experiments
Video sequences were shown on a 4.3” LCD with 480x272 resolution. This resolution represents the low-to-mid range smart phone resolutions in the market. As noted earlier, video sequences are encoded to 640x360, which is a recommended video encoding resolution [96]. Content providers typically encode video at a few resolutions and mismatch between display resolution and video resolution is common and is common in mobile video services. The mobile device scales the video accordingly for the mobile device’s display.
The observer was approximately 12” to 18” from the display and the viewing angle is approximately 10 degree (+/- 5 degree) above normal as shown in Figure 26. The
LCD backlight luminance was approximately 45cd/m2. Room lighting was roughly 800 cd / m2.
Twenty five observers were used in the evaluation experiments. According to the
ITU specification P.910, the possible number of observers in viewing test can vary from
66 4 to 40. The P.910 further specifies that at least 15 observers should participate in subjective testing to obtain reliable results [97]. The observers were 18 to 50 years of age and all observers are in good health with normal or corrected-to-normal vision.
Figure 26: Observer to LCD viewing definition
Subjective evaluations were conducted in accordance with the Double-Stimulus
Impairment Scale (DSIS) Variant II as defined by ITU-R BT.500-13[53]. The test methods were similar to the test conditions used in HEVC evaluations [98], [3]. The double stimulus method is cyclic where an observer is presented with an unimpaired reference followed by the same but impaired video. Variant II was chosen to allow the observer a second viewing of both video sequences. The presentation order of the video sequence is 3 seconds of solid grey video, followed by 5 or 10 seconds of the reference sequence, followed by 3 seconds of solid grey video, followed by 5 or 10 seconds of the impaired sequence, the previous four viewing events are repeated, followed by the voting cycle. Essentially, two video sequences are shown per test, where reference sequence and the corresponding impaired sequence are shown twice. Then, the observer was asked to rate the quality.
DSIS Variant II was used for the presentation structure of the test material. This allows the user two viewings of each video sequence (reference and impaired) before
67 subjective grading. Also, the video sequences were shown in a random order in order to
reduce observer bias. Grade scores are on a scale from 1 to 5 and defined in Table 9.
Grade of “one” is poor (very annoying) and “five” is excellent (imperceptible) as defined
by ITU-R BT.500-13.
Table 9: Subjective grading scale
Score Definition 5 imperceptible 4 perceptible, but not annoying 3 slightly annoying 2 annoying 1 very annoying
5.4 Results
Subjective results summary are presented in Table 10, Table 11, Figure 27, and
Figure 28. In Figure 27and Figure 28, vertical axis is the mean opinion score (MOS)
from a scale of 1 to 5. The horizontal axis shows each video sequence pair for H.264 and
HEVC. Table 10 and Table 11 show the maximum, minimum and average MOS score
for each video sequence. Data is listed in video sequence pairs. Comparisons between
H.264 and HEVC were compared by the same bit rate (i.e. 400kbps or 200kbps). This
will lend the subjective results to a more realistic consumer use where the available
bandwidth is the same regardless of coding standard used by hardware.
In Figure 27 and Table 10, 400kbps MOS results showed 85% grade scores
ratings of “5” or “4” for the impaired video sequence, which indicates the observer
feedback is “imperceptible” or “perceptible, but not annoying”, respectively. Observer feedback indicates the impaired video sequence is acceptable for the mobile environment,
68 which is smaller screen size and lower bit rate. In Figure 28 and Table 11, as expected,
200kbps MOS results showed a wider spread of the results. MOS variance between
HEVC and H.264 was much higher at this bitrate. For example, the Basketball Drill and
Race Horse video sequences, which had several MOS results in the 3 to 4 range within
400kbps results, had significantly lower results at 200 kbps. These sequences had several
200kbps results below a score of 3. Further analysis of these results is presented in next section.
Figure 27: Mean opinion score (MOS) for 400 kbps bit rate
69 Table 10: Mean opinion score (MOS) for 400 kbps rate Test Max Min Average BBDrill-H264 5 2 3.4 BBDrill-HEVC 5 3 4.3 Flowervase-H.264 5 3 4.5 Flowervase-HEVC 5 3 4.7 Johnny-H264 5 4 4.6 Johnny-HEVC 5 3 4.6 Keiba-H.264 5 4 4.9 Keiba-HEVC 5 4 4.7 Kimono-H264 5 4 4.7 Kimono-HEVC 5 4 4.8 People-H.264 5 3 4 People-HEVC 5 3 4.1 RaceHorses-H.264 5 3 3.7 RaceHorses-HEVC 5 4 4.6 Traffic-H.264 5 3 4.5 Traffic-HEVC 5 3 4.5
Figure 28: Mean opinion score (MOS) for 200 kbps bit rate (graph)
70 Table 11: Mean opinion score (MOS) for 200 kbps rate
Test Max Min Average BBDrill-H264 4 1 2.1 BBDrill-HEVC 4 2 3.4 Flowervase-H.264 5 3 4.1 Flowervase-HEVC 5 3 4.4 Johnny-H264 5 2 4.4 Johnny-HEVC 5 4 4.7 Keiba-H.264 5 1 4.3 Keiba-HEVC 5 1 4.5 Kimono-H264 5 3 4.2 Kimono-HEVC 5 2 4.2 People-H.264 4 1 2.1 People-HEVC 4 1 2.5 RaceHorses-H.264 3 1 2 RaceHorses-HEVC 5 2 3.3 Traffic-H.264 5 3 3.7 Traffic-HEVC 5 2 3.9
5.5 Discussion
Performance of HEVC and H.264 can be evaluated by comparing the difference
of MOS scores and the difference of corresponding PSNR values. This comparison is
shown in Figure 29. Positive values show HEVC has better performance. Negative
values show H.264 has better performance. The upper right quadrant shows HEVC to be superior in both MOS and PSNR(Y) . The lower left quadrant H.264 to be superior in both MOS and PSNR(Y) . The upper left HEVC has better MOS and H.264 has
better PSNR(Y) . The opposite is true for the lower right quadrant.
As shown in Figure 29, for the same PSNR difference, the MOS difference varies
with content. A typical PSNR(Y) improvement of 0.5dB to 2dB is observed for HEVC over H.264 for all video sequences. This is an expected result with HEVC always
71 resulting in better compression performance. However, the MOS ratings show that subjective quality of HEVC is not always better than H.264. MOS ratings show several instances where HEVC and H.264 are essentially considered equal or very close to each other in terms of subjective feedback. A common claim about HEVC performance is that
HEVC produces equivalent quality video at about half the bitrate of H.264 [55], [56],
[57]. To verify this claim, HEVC at 200kbps vs. H.264 at 400kbps were compared. The observation is shown in Figure 29 that, at lower bit rates, the H.264 at 400 Kbps is clearly superior to HEVC at 200 Kbps both in PSNR and subjective quality.
Figure 29: PSNR(Y) (dB) vs. MOS (dB)
The following discussion is for HEVC to H.264 comparisons at same bit rate.
Subjective quality is a function of the PSNR of the decoded video and the content of the
video. Large PSNR difference may not produce a large difference in subjective quality.
For example, “Flowervase” and “Keiba” video sequences yielded a PSNR difference of
approximately a 1.3dB to 2.2dB, with HEVC yielding a higher PSNR(Y) compared to
72 H.264. However, the MOS results indicate subjective performance is about the same with a MOS difference of 0.3 or less. Actually, the observers gave H.264 a higher average MOS for the “Keiba” 400kbps video sequence. Informal discussion with the observers suggests a HEVC to H.264 MOS difference of 0.3 or less will not cause the
observer to prefer one impaired video sequence over the other.
Also, “People On Street” and “Traffic” video sequences show a difference of
HEVC PSNR(Y) over H.264 PSNR(Y) of approximately 0.4dB to 1.2dB. The MOS
results indicate subjective performance is about the same with a MOS difference of 0.30
or less for three of the four video bitrates from “People On Street” and “Traffic” video
sequences. This indicates the observer will tend not to have a preference between HEVC
impaired video sequence and H.264 impaired video sequence. The fourth video bitrate
(“People On Street” at 200 kbps) had a MOS result of 0.55. Observers will slightly favor
HEVC encoded video sequence.
“Kimono” video sequences had a delta of HEVC PSNR(Y) over H.264 PSNR(Y)
approximately 1.1 dB or less. MOS difference between HEVC and H.264 is just under
0.4, where HEVC has the higher rating. Observers rated 200kbps H.264 MOS results as
higher quality. MOS results suggest the observer is likely to find the subjective
difference predominantly acceptable. However, this observation was made through
informal dialog with the observers and further study is warranted before making stronger
claims.
Two video sequences for “Basketball Drill” and “Race Horses” have an HEVC
PSNR(Y) over H.264 PSNR(Y) difference of approximately 1.0dB to 2.0dB, which is in
line with all the sequences evaluated. The MOS differential is between 0.5 and 1.2,
73 which is slightly greater than the other video sequences where the delta is less the 0.4 for
five video sequences and just under 0.5 for one video sequence.
Informal observer comments suggest an observer preference for HEVC impaired
video sequence over H.264 impaired video sequence. Both “Basketball Drill” and “Race
Horses” video sequences have motion in large contiguous areas that are relatively
uniform (either same color and/or pattern). For “Race Horses”, the main continuous
color is the horses coat color and is affected by changes in coat color as the horse moves.
A frame from this video sequence is shown by Figure 30. For “Basketball Drill”, the
floor is a repeating pattern that gets affected by the basketball players’ shadows as the
players move about the basketball court.
An interesting note is the “Keiba” video sequence has a significant large
contiguous areas affected with the tree trunks and branches in the video foreground as shown by Figure 31. However, no observer had commented this as a problem area. The belief this is due to the “point of attention” is on the horse rider and the video coding artifacts in tree area are unnoticed.
The main comment for preference of “HEVC” is the continuous patterns or colors looked significantly worse for H.264 encoded video, although HEVC showed a milder effect. For example, for the Racehorse video sequence, the horse’s brown coat did not look proper when in motion, as shown in Figure 30. But the grass in the background did not bother observers and was not commented as being a quality issue. For Basketball
Drill, the floor showed many blocking artifacts for H.264 when the players’ shadows moved across the basketball court.
74
Figure 30: Race horses sequence. Horse coat color is bothersome to observer
Figure 31: Foreground tree detail loss not a concern in Keiba sequence
Table 12 shows the video sequence, MOS (HEVC MOS – H.264 MOS) and the
observer preference. The correlation between the MOS for 200kbps and 400kbps
observer comments are shown in addition to points of attention (POA), shown as POA
[3].
75 Table 12: Video sequences preference
MOS Video Preference POA 400kbps 200kbps Basketball Drill HEVC 5 0.67 1.11 Flowervase None 1 0.17 0.29 Keiba None 1 0 0.44 Kimono None 1 -0.14 0.25 Johnny None 1 0.33 -0.33 People on Street None/HEVC 10+ 0.08 0.55 Race Horses HEVC 4 0.86 1.18 Traffic None 10+ -0.09 0
Analysis of the experimental results shows that subjective quality assessment is influenced by content. Specifically, number of points of visual attention, spatial complexity, and temporal complexity of the content, have direct influence over the quality of user experience. To begin understanding these influences, a notion of temporal information (Ti) and spatial information (Si) as defined by the ITU specification P.910
[97]. The Ti value for a video sequence gives a measure of temporal changes in a video.
Videos with a large motion have a large Ti value. Similarly, Si gives a measure of spatial complexity and is measured based on number of edges in each frame of a video. Points of visual attention were determined empirically. Shown in Figure 32 and Figure 33 are the rewriting of the data into a graph with HEVC MOS, point of attention, and MOS results for video sequences with a temporal information greater than 10 [3]. The temporal information and spatial information (TiSi) plot is shown in Figure 34. A clearer picture evolves in showing expected H.264 and HEVC MOS relationships. The size of the circle represents the MOS . A larger circle indicates a larger MOS ; i.e., observers gave higher MOS value to videos coded with HEVC compared to H.264. The videos with “1 to 3” and “8+” points of attention have smaller MOS . However, the MOS 76 values in the “4 to 7” region are high. The MOS values in the “8+” region are high at
400kbps, as shown in Figure 32, and mixed at 200kbps, as shown in Figure 33. As indicated earlier, the observer will rate encoders as roughly equivalent for the video sequences that fall within “1 to 3” and “8+” regions. The region where “4 to 7” points of attention lies is where the larger MOS occurs.
Figure 32: MOS vs. POA with MOS bubble @ 400kbps. Ti > 10
Figure 33: MOS vs. POA with MOS bubble @ 200kbps. Ti > 10
77
Figure 34: TiSi Plot
Video sequences with few points of attention will tend to have higher MOS results from either H.264 or HEVC with little difference between MOS averages. This is shown by the MOS results for video sequences Keiba, Kimono1, Flowervase and Johnny in Figure 32 and Figure 33. Increased visual artifacts tend to not impact observers scores.
Applications such as video conferencing have fewer points of attention and both H.264 and HEVC result in equivalent lower bitrates.
5.6 Concluding Remarks
For the mobile conditions performed in the study, which is a 4.3” screen size and an average bit rate of 200kbps and 400kbps, the user (i.e. observer) experience is not significantly different when comparing H.264 and HEVC compressed video sequences.
Standard compliant subjective evaluations with 25 observers show that both encoding methods are adequate in low bitrate mobile environments. Results show that content dependencies affect perceived quality. Specifically, the number of points of visual attention influence user experience.
78 6 HEVC DECISION OPTIMIZATION
This chapter presents a model and approach to select an efficient set of HEVC
encoding options for mobile devices. The main goal is to reduce the encoding
complexity without significantly affecting the quality of video conferencing applications.
Video target bit rates from 100kbps to 600kbps were used within this study.
Experimental results show that by carefully selecting the coding unit size, coding unit
depth and to a lesser degree transform unit size, the encoder computational complexity can be reduced for the target bit rates while maintaining an allowable additional PSNR loss. Results show a 36.5% of complexity reduction on average with a negligible PSNR
loss of less than 0.25 dB.
6.1 Background
High Efficiency Video Coding (HEVC) is the latest video coding standard
finalized in January 2013 by ISO and ITU’s Joint Collaborative Team on Video Coding
(JCT-VC). This chapter’s working environment is for HEVC video conferencing
applications in the smart phone market segment with cellular bit rates, such as 200kbps
and 400kbps. The primary focus is to reduce coding complexity, in terms of time
savings, while maintaining high quality video.
The HEVC standard has adopted three profiles for wide services, such as
broadcast, mobile communications and video streaming. Recent assessments show that
HEVC can achieve equivalent subjective quality as H.264/MPEG-4 AVC with 50% less
79 bit rate [21], by introducing new tools like the block partitioning structure [22], and
variable block-size for prediction and transform coding [23] among others.
Undoubtedly, an effective tool included in HEVC is the new Coding Tree Block
(CTB) partitioning structure. This can be thought of as a more generic version of Macro
Block (MB) coding unit used in previous standards. The square CTB restricted to a
maximum size of 64x64 pixels, which can be split into smaller quad-blocks called
Coding Units (CU), with a minimum allowed size of 8x8. This means the maximum
allowed CU Depth is 4.
The transform coding also uses a quadtree structure called Residual QuadTree
(RQT), splitting a Transform Unit (TU) into smaller transform units, and limiting its size
to a maximum size of 32x32 and minimum of 4x4. A 2Nx2N CU can use a maximum
TU depth of 3, taking 2Nx2N, NxN and N/2xN/2 sizes.
HEVC encoder and decoder complexity assessment is a research topic for [85],
[99], [100], [60], where the different HEVC tools are analyzed in terms of performance
and computational complexity.
As with H.264, HEVC uses a well-known Rate Distortion Optimization model
(RDO) [69] to achieve the best coding efficiency. RDO reaches the optimal partitioning
by evaluating all CU sizes, prediction unit (PU), and TU size for each of those CU-PU combinations. This approach greatly increases computational complexity of encoders.
So, this makes it more difficult for real time encoding implementations, especially for portable and mobile devices where power consumption is one of the key factors. To reduce the RDO complexity some fast algorithms have been proposed, which focus on
80 reducing the number of coding blocks (CB), prediction blocks (PB), and TU sizes to evaluate.
Schwartz et. al., show the effects of constraining the Largest Coding Unit (LCU) for Random Access (RA) and Low delay (LD) configurations of main profile [21].
Decreasing the LCU from 64x64 to 32x32, will realize a 18% and 17% time savings with a small bit rate penalty of 2.2% and 3.7% respectively. On the other hand, using a LCU of 16x16 instead 64x64, reduces encoding time by 42% but with a bit rate increase of
11% and 17.4% respectively. Also reducing the maximum RQT depth from 3 to 2 yields a time saving of 10% with a slight bit rate increase of 0.3% for RA and 0.4% for LD configurations.
Finally, a complexity control algorithm based on coding tree depth selection for
CTB, is proposed in [21] which obtains a computational complexity reduction of 40% with a bit rate increase of 3.5%. For [91], a maximum CU-depth dynamic adjustment is proposed, which testing showed a 40% complexity reduction, 0.1dB PSNR, and a 3% bit rate increase.
6.2 Method
The mobile device working environment, within this research, was based on multimedia streaming guidelines and content providers. The content is defined as
“higher quality” for 640x360 resolution at 400kbps and “medium quality” for 400x300 resolution at 200kbps [101] [102]. We chose to have a superset of target bit rates, from the recommendations, in order to observe the video quality over a range of bitrates. The bit rates within this study are 100, 200, 300, 400, and 600 kbps. Also, we decided to have a several video resolutions. The lower resolutions made available by JCT-VC and 81 defined in [103] were used. HEVC video sequences chosen for the experiments were
from Class C, D, and E as defined by JCT-VC. Video sequences used for the
experiments are defined in Table 13.
Table 13: Video sequence definition
Class Video Sequence Resolution FPS Frames C BasketballDrill 832x480 50 500 C BQMall 832x480 60 600 C PartyScene 832x480 50 500 C RaceHorses 832x480 30 300 D BasketballPass 416x240 50 500 D BlowingBubbles 416x240 50 500 D BQSquare 416x240 60 600 D RaceHorses 416x240 30 300 E FourPeople 1280x720 60 600 E Johnny 1280x720 60 600 E KristenAndSara 1280x720 60 600
The study’s experiments used HEVC encoder reference software version HM8.0
[104] with Main Profile and Low-delay configuration as the baseline configuration. This configuration was meant for real-time personal video communications which uses the previous fame for motion estimation. The common JCT-VC test conditions described in
[105] were used with the aim of analyzing the HEVC performance. The input variables
“Bit Rate”, “Largest CU (LCU) size”, “CU-depth”, and “TU-inter-depth” were modified
between tests.
The HM8.0 source code was modified to output additional CU and TU data that
was used for analysis. The collected data included CU Mode (i.e. Intra or Inter), LCU,
CU-depth, TU-inter-depth for each CU block within all frames for the video sequences.
A total of 1170 video sequence experiments were run during the study. This data was
82 used for analysis. Allowances were made to run experiments over multiple computers.
However, each video sequence experiment occurred on same computer to eliminate any comparison disconnects.
Each sequence was encoded with a baseline configuration (all options on) and additional configurations where for a given bit rate, the LCU, CU-depth, and TU-inter- depth were varied. All video sequences and input variable combinations are as follows:
- LCU: 64x64, 32x32 and 16x16
- CU-depth: 4, 3, 2, 1
- TU-inter-depth: 3, 2, 1
- Bit Rate: 100, 200, 300, 400, and 600 kbps
All videos were 10 seconds long with frame rates varying among 30, 50, and 60
FPS. Each frame had 66 data elements tracked. This yielded a total amount of over 37 million data elements that required reviewing and analyzing. The key outputs observed for performance analysis were the PSNR difference, identified as ∆ PSNR, and encoding time difference, identified as Time-savings, between the baseline configuration and each of the experimental configurations.
6.3 Prediction Modeling
To assist with data analysis and create a prediction model, the Waikato
Environment for Knowledge Analysis (WEKA) Version 3.6.4 [106] was used for data mining results from the experiments. WEKA is an effective software tool used for sifting through large data sets and determining relevant data for the chosen prediction model.
The datasets used for prediction modeling training were chosen from one dataset per class. All remaining video sequence datasets were used for testing. From these 83 experiments, the Time-savings and PSNR were recorded and used for the desired
output to determine equation. More details on Time-savings and PSNR later within
this report.
The WEKA software tool’s analysis classifier was set to “linear regression” for
prediction and attribute selection method selected to “No attribute selection”, which allows all inputs to be potentially used for the modeling equation. Therefore, no pruning attempts were made to simplify the equations. Attributes used for prediction model includes: resolution, bit rate, LCU, CU-depth, and TU-inter-depth. Attributes, such as
“number of frames” and “frame rate” were removed from the attribute pool. These attributes are irrelevant for the calculations. The remaining tool configurations were left at default settings.
6.4 Results
Data was gathered for all video sequences. All the data gathered within the study cannot be addressed or presented in this report. However, representative examples were used to explain underlying concepts.
6.4.1 Data Reading Primer
For the rest of this document, “LCU” / “CU-depth” / “TU-inter-depth” will be shown as numbers separated by the slash (i.e. /). Therefore, 64/4/3 is representation for
LCU = 64x64, CU-depth = 4, TU-inter-depth = 3.
The graph shown in Figure 35 is the “Time-savings vs. PSNR” for
“KristenAndSara” video sequence. The graph is for bit rate of 100kbps. Each point is defined by “LCU”/”CU-depth”/”TU-inter-depth”. The curve within the graph is the
84 outline curve for all points. The line represents the effective trade-off between PSNR and Time-savings. The closer the points are to the curve the better trade-off decision.
Max-quality
Preferred location
Left-option
Right-option
Min-quality
Figure 35: Kristen and Sara 100kbps time-savings vs. PSNR
The most beneficial location is the upper right corner, which has maximum Time- savings and minimum PSNR loss. This area is defined in the graph as “Preferred location”. Within the data set, the best PSNR loss is 0 dB and is held by the 64/4/3 data point. This data point is shown with arrow in upper left corner and is identified as
Max-quality. As expected this encoded sequence has the longest encoding time, which leads to 0% Time-savings. Moving along the outline curve there are two points which are defined as “Left-option” point and “Right-option” point. These points are the optimal points as determined by us. Perceptual video quality assessment was carried out. The assessment consisted of subjectively reviewing each sequence to determine the preferred
85 video sequence. In instances where there the subjective review returned the same result,
the selection went to the sequence that yielded a higher PSNR x Time-savings value.
From this effort, the optimal points mainly consisted of the minimum CU-depth and TU-
inter-depth for LCU of 64x64 and 32x32. The majority of the optimal cases were clearly
these points. We decided to simplify the option point’s selection and set it to the minimal
configuration for LCU of 64x64 and 32x32.The data point on the lower right has the most
restrictive settings. This point is identified as Min-quality. As expected, this point
challenges for the best Time-savings and largest PSNR loss. There are frequent
instances where this may not the case as shown in Figure 35.
6.5 Data Analysis
The graph (Figure 35) is a typical representation for the data gathered from the
video sequences within this study. The Left-option point’s Time-savings is
approximately 25% with PSNR of -0.07dB. The Right-option point has Time-savings
of approximately 45% with PSNR of -1.03dB.
The Left-option and Right-option points vary depending on resolution. For 832 x
480 (Class C), the two points are 64/2/1 and 32/1/1. For 1280x720 (Class E) and
416x240 (Class D), the two points are 64/3/1 and 32/2/1. Table 14 shows the Δ PSNR and Time-savings for Class C, D, and E. The C, D, and E average is shown as “All”.
Further table explanations are shown later within this section.
86 Table 14: Class C, D, E, and All video sequence, average Δ PSNR and Time-savings
Left- Right- Class Description Max. option option Min. C Δ PSNR (dB) 0 -0.31 -0.59 -1.36 C Time-savings (%) 0.0% 41.8% 58.1% 55.6% D Δ PSNR (dB) 0 -0.21 -0.37 -0.7 D Time-savings (%) 0.0% 24.2% 36.4% 56.4% E Δ PSNR (dB) 0 -0.07 -1.03 -3.26 E Time-savings (%) 0.0% 23.6% 45.6% 67.2% All Δ PSNR (dB) 0 -0.25 -0.72 -1.82 All Time-savings (%) 0.0% 36.5% 55.4% 69.2%
Video sequences that match the points from Figure 35 were evaluated by us. A
frame for each graph point is shown in Figure 36 and Figure 37. The example chosen is the KristenAndSara from Class E video sequence. The Max-quality video, which is the reference video sequence, is impaired and loss of detail is observed in Figure 36 left side picture, which is 64/4/3. However, when compared to the Left-option point, which is
64/3/1 and on the right side, the degradation in the picture is minimal. The Right-option point to compare with is Figure 37 left side picture, which is 32/2/1. More degradation is noticed when compared to the reference (i.e. Max-quality) video sequence. Depending on user and environment needs, this may acceptable and will yield additional encoding time savings.
87 64x64/4/3 64x64/3/1
Figure 36: Kristen and Sara sequence, 100kbps. Max-quality and left-option points
32x32 / 2 / 1 16 x 16 / 1 / 1
Figure 37: Kristen and Sara sequence, 100kbps. Right-option and min-quality points
Table 14 shows the averages for Δ PSNR and Time-savings. As noted earlier, the
“64/4/3” data point has Δ PSNR of 0 dB and Time-savings of 0%. This reference point
represents the Max-quality option. All graphs have the normalization performed for
consistency and clarity in data analysis. 88 As expected, the PSNR loss increased as the encoder options were reduced.
Basically, encoder flexibility was limited by reducing LCU, CU-depth, or TU-inter-depth
options.
Major contributors that affect Time-savings are either LCU or CU-depth depending on resolution. CU-depth change has more of an impact for smaller
resolutions. LCU change has a bigger impact for larger resolutions. This observation is
noticed when the test results for each sequence are ordered by Time-savings. Smaller
resolutions order lined up according to CU-depth selections, and larger resolutions
ordering lined up in accordance to LCU selections. This is an observation and detail is
not shown within this report.
For a given video sequence, the effect of bitrate on Time-savings and Δ PSNR is
shown in Figure 38 and Table 15. As expected, the reference video sequence (i.e. Max-
quality) absolute PSNR improves. Tool options changes have a greater impact on Δ
PSNR. All video sequences, in the study, had the similar shift.
89 Diamond – 100kbps Square – 600 kbps
Figure 38: Kristen and Sara 100kbps to 600kbps shift for time-savings vs. Δ PSNR
Table 15: Kristen and Sara 64/4/3 PSNR and Δ PSNR for 100 and 600 kbps
Δ PSNR Bitrate (kbps) 64/4/3 PSNR Max. Left- option Right-option Min. 100 27.35 0 -0.10 -1.00 -2.83 600 33.63 0 -0.12 -0.93 -3.44
6.5.1 Prediction Model Analysis
The data analysis showed the video sequence data has a fairly good relationship to each
other when the data is sorted by resolution size. However, all resolutions combined, as
shown by Table 14 (rows “All”), will be the comparison point for the predictive model
explored in this chapter. There are content dependencies not evaluated within this
chapter and it is noted within this section.
From the data set gathered from the experiments, a data mining exercise took
place using WEKA Version 3.6.4 in order to generate a prediction model. This could
potentially assist with the selection of the optimal settings to maximize Time-savings and
90 stay within the allowable PSNR for the given user defined environment. The prediction model equations are shown in equation (2) and equation (3). Equation discussion will be presented in the form of tables and graphs that correlate with the results.
ΔPSNR = (A*(Height) + B*(bit rate) + C*(LCU) + D*(CU-depth) + E*(TU-inter-depth + F) / G Where: (2) A = -2132, B = -316, C = 29834, D = 156664, E = 3339, F = -1261163, G = 1,000,000
Time-savings = (H*(Height) + I*(bit rate) - J*(LCU) - K*(CU-depth) – L*(TU-inter-depth) + M) / G Where: (3) H = -1.52, I = 41, J = -769, K = -203991, L = -31219, M = 949434, G = 1,000,000
Using the WEKA derived equations; the Left-option and Right-option points are valid as shown in Figure 39. The arrows identify the same points from Figure 35. The
Time-savings from the WEKA graphs are accurate representation. However, the ΔPSNR has more variability when compare to Figure 35, but the PSNR points relative to one another within the WEKA graph is representative. To get better correlation to actual data, the WEKA generated ΔPSNR and Time-savings graph is normalized by shifting
64/4/3 data point to 0,0. All other points are shifted by same amount. Given this, the shape of the points within the WEKA model can be used to determine the optimal points.
91
Figure 39: Kristen and Sara 100kbps time-savings vs PSNR WEKA derived
From the WEKA derived equations, video sequences Δ PSNR and Time-savings were calculated and averaged for all classes. Results are shown in Table 16. The results in Table 16 have been normalized to Max-quality data point (i.e. 64/4/3).
Table 16: Video sequences WEKA estimate for Δ PSNR and Time-savings. Max
Left- Right- Description Max. option option Min. Δ PSNR (dB) 0.00 -0.22 -1.33 -1.91 Time-savings (%) 0.0% 34.1% 56.9% 71.1%
Comparison between actual values and WEKA derived results are shown in Table
17. Results from Table 14 will be referred to as “Actual” and results from Table 16 will be referred to as “WEKA”.
An “Actual” versus “WEKA” Δ PSNR differences comparison shows a difference no greater than 0.61 dB for the optimal points, as shown in Table 17. This shows there are very good estimate results by using WEKA. Also, the Time-savings comparison
92 between “Actual” and “WEKA” shows better results as shown in Table 17. The worst
case for the optimal points for any Class is a 2.4% difference.
The correlation between “Actual” and “WEKA” values were derived and the results are shown in Table 18. This data shows that the WEKA derived data has an excellent correlation for encoding time savings. Therefore, by using the prediction model the developer can confidently estimate the encoding time savings changes by tweaking
LCU, CU-depth, and TU-inter-depth to suit the user and environment needs. Δ PSNR correlation is good with a correlation of 0.77 or higher. This indicates the WEKA model is a fairly good indicator for Δ PSNR. The Δ PSNR variation is mainly due to bit rate changes and video content dependencies.
By using the derived models, the developer can determine the impact of the video encoding changes could cause to the video quality via PSNR and the encoding time savings. In addition, the developer can change the model during the video encoding to use additional parameters derived from the previous frames. By using this additional real-time data, the model’s quality will improve and yield more accurate results. The
correlation improves to 0.95 for PSNR and 0.99 for encoding time savings, when block
types of previous frames are used within the prediction model.
Table 17: Video sequences actual vs. WEKA difference
Left- Right- Description Max. option option Min. Δ PSNR diff. 0.00 -0.03 0.61 0.09 Time-savings diff. 0.0% -2.4% 1.5% 1.9%
93 Table 18: Correlation between actual and WEKA results
Description Correlation Δ PSNR diff. 0.77 Time-savings diff. 0.98
6.6 Application
With the concepts within this study, the developer can estimate the PSNR with option changes. A no reference PSNR estimator, as presented by [107], can be used as a
“coarse” PSNR estimate. The PSNR, calculated by the prediction model, will “fine tune” the absolute PSNR value for the projected video sequence. The developer can make informed decisions for video quality for transmitted video sequence. Also, using the prediction model, the developer can estimate how much the encoders will be taxed in terms of encoding time. Please note that each encoder implementation will derive different formulas and the method defined within this study will need to be re-established to determine the systems performance. This will only need to be characterized once for the H/W and S/W setup of the system.
Using predictive modeling software, such as WEKA, can be effectively used to estimate Δ PSNR and encoding time savings (i.e Time-savings) changes from the base, which is encoded video that has all the desired tool options enabled. With this data, the developer can effectively manage video encoding for the user and environment, such as transmission bandwidth limitations. Video conferencing environments will benefit greatly from the modeling by giving the developer a reliable method to estimate PSNR loss changes and encoding time savings gains. This will allow the developer to maximize the video quality (based on PSNR loss estimate) and finely tune to the available
94 bandwidth (by using the encoding time savings estimate). Also, the model derived within the study has applicability for content independent use, especially useful when the prediction model uses real time data, such as block types used in previous frames, as part of the model.
95 7 ADAPTING LOW BIT RATE SKIP MODE IN MOBILE ENVIRONMENT
A method is introduced to enhance HEVC early skip method decision process.
The proposed method uses human vision system factors, such as vision acuity and
smooth eye pursuit factors, to determine where coding complexity and bit rate reductions
can occur without significantly impacting perception of video quality. In this study, the focus is mobile devices, such as smart phones, where displays are less than 5.0” in diagonal and bandwidth availability is low. The proposed methods exploit the mismatch between the video resolution, display resolution, and visual acuity of users. This method
significantly reduces subjective impact while reducing encoding time for low bit rate
environments. Mobile devices are the main beneficiary of the proposed method since
battery consumption will be greatly lessened. The proposed method reduces both
encoding time and bitrate; results show an average of 21.7% decrease in encoding time,
13.4% decrease in average bitrate, and insignificant impact on subjective quality
evaluations.
7.1 Background
Use of mobile devices, such as smart phones, has grown dramatically in the last
few years and has penetrated the consumer space [10]. A significant reason is the
applications that are used on these devices. An application that has gained popularity is
real-time video sharing applications between mobile devices such as video conferencing
or video “chatting”. Quality of service is a factor in the adoption of real-time video
96 sharing applications. Research shows users adopt a new technology if it delivers quality
experience the majority of the time and the converse is true [11]. The adoption and
success of these mobile video sharing applications depends on the quality and reliability
of services delivered over the public internet and wireless wide area networks (WWAN).
Bitrate optimization is thus an important tool in delivering video services.
In this chapter, mobile resolutions and constrained bandwidths as observed in
WWANs are examined. The proposed method improves High Efficiency Video Coding
(HEVC) [108] encoders by simultaneously reducing encoding complexity and target
bitrate for equivalent subjective quality. The properties of the Human Visual System
(HVS) exploited in the proposed approach are visual acuity and high motion perception.
Visual acuity depends on the person’s photoreceptor density. Higher photoreceptor
density means better acuity. In addition, photoreceptor density is limited which indicates
the HVS will not perceive difference above a certain resolution for a given screen size
viewed at the same distance. HVS motion perception is dependent on the eye’s ability to
smoothly track motion. When motion is exceeds the HVS ability to smoothly track, saccadic eye response starts to dominate. There are opportunities to take advantage of
these elements without affecting perceived video quality. The properties of the HVS are
used along with motion vector comparisons between current coding unit (CU) and
neighbor CUs to determine early termination. Early termination prevents any additional prediction unit (PU) or sub-CU calculations. Such early termination effects the quality of the video pixels for that section of the video. The impact of this reduced pixel quality on
perceived quality, however, is dependent on the video resolution among other
environmental factors such as display size and viewing distance.
97 In essence, there are three elements, when combined in a mobile environment can
be effectively exploited, shown in Figure 1. The differences in encoded video resolution, display resolution, and HVS photoreceptor density translate to limit the HVS’s ability to discern resolution for a given area. The proposed method will work well for a low bit rate environments.
The key contributions of this chapter are: (1) perceptually-aware method for video encoding that exploits visual acuity and motion perception (2) implementation and evaluation of the proposed method in HEVC encoding (3) subjective evaluations on mobile devices. The proposed method reduces both encoding time and bitrate; results show an average of 21.7% decrease in encoding time, 13.4% decrease in average bitrate,
and insignificant impact on subjective quality evaluations.
As shown earlier, there exists a significant body of work for complexity
reduction. Most have the similar goal of reducing complexity while minimizing the
impact on the measured element, which typically PSNR loss. However, there is limited
work in the mobile environment using HVS elements for complexity reduction. In
addition, there is limited work showing the show the perceptual gains via subjective
testing.
The proposed method focuses on the mobile environment device to device
communication, such as real-time conferencing. The method will show HVS factors used
in help decision-making for effective complexity reduction without significantly affecting
MOS and provide major coding time reductions and bit rate savings.
98 7.2 HEVC Elements
7.2.1 Quad-Tree
The current HEVC standard provides a basic quad-tree block structure with recursive splitting from CU size of 64x64 down to potentially an 8x8 CU size. Basically, the 64 x 64 CU breaks down into smaller CU, where the RD is calculated. Again, the smaller CU is broken down to smaller pieces and RD costs are calculated. The continues
until a 8x8 CU size is obtained. HM10.0 CU compression estimation starts with largest coding unit (LCU). The compression estimation routine recursively calls CU compression subroutine for smaller CU sizes within the LCU. The CU compression subroutine determines the RD cost for each PU mode. The RD cost for each mode is computed. The computation is extremely time intensive, since RD cost is calculated for all possible cases.
Potentially, there are up to 85 CU calculations for each Largest Coding Unit
(LCU). Each LCU recursive calculation of 64x64 to CUs of 32x32 to CUs of 16x16 to
CUs of 8x8 is calculated as follows, 1 + 4 + 4x4 + 4x4x4 = 85. An example CU quad-tree with recursive splitting is shown in Figure 40. In addition, each CU leaf node is divided once more as a PU that can be one of four shapes for CU of 64x64 or eight shapes for CU sizes of 32x32 or less [87], [20] and potential shapes are shown in Figure 7.
Figure 40: Recursive CU splitting example
99 7.2.2 Skip Mode Method
When early skip detection (ESD) is enabled within HM 10.0 this allows 2Nx2N merge evaluation for early skip determination for inter-frames. Within the CU compression process, ESD is used to determine if further prediction units calculations for the 2Nx2N can been skipped. ESD essentially checks two elements: (a) NxN RD costs and (b) NxN MV. If the four NxN RD costs are individually zero, then early skip is performed. Also, if the four NxN MVs are individually equal to zero, then early skip is performed. At this point the PU NxNs are the leaf nodes for the “merged” to form the
2Nx2N. In addition, if early skip is performed, then any further CU splitting are abandoned and the 2Nx2N is used as the CU for compression with the NxN as PU nodes.
ESD allows the ability to skip RD cost calls within the CU compression process for certain cases. The 2Nx2N merge parameters are calculated to determine if the remaining NxN PUs RD costs, which are not yet calculated, can be skipped. The HM
10.0 skip mode flowchart is shown in Figure 41.
100
Figure 41: Mode decision process
Within the “Merge 2Nx2N Cost” subroutine code is where determination for early
skip is made through a serious of comparisons and checks. HM10.0 currently allows ESD
skip to occur, if the CU’s merge flag is set or if the CU’s 2Nx2N motion vector in x and y
direction is zero. The merge costs are determined for the 2Nx2N CU subparts. The subparts are the four NxN CUs in the 2Nx2N. The subparts RD costs and residual cost are calculated. If the residual cost from this activity is zero, then the merge flag is set, which sets the early skip flag before exiting the sub-routine. In addition, the 2Nx2N
absolute motion vectors in x and y direction are calculated. If the sum of x and y motion vectors are zero, then the early skip flag is set [8].
101 7.3 Proposed Method
The proposed method adds a third decision method within the “Merge 2Nx2N
Cost” for early skip determination. As with the other two methods, when an RD cost
reaches the target no further RD costs calculations are needed. Essentially, when the
proposed method is determined as valid the method will terminate any further RD
calculations for the current CU and sub-CUs by setting the early skip flag. The proposed
method uses HVS factors such as acuity and SPEM within the decision process along
with the video resolution and display resolution. This allows exploitation of the different
elements while keeping the impact of MOS results minimal. These factors lead to a
calculation of a threshold limit for the method. The threshold limit is compared to the
neighboring LCUs to determine if the early SKIP criteria are met.
The threshold depends on two factors which are (a) acuity limit and (b) motion
limit. The threshold equation is shown in (4). The acuity limit is the top half of the
equation and the motion limit is the bottom half of the equation. The threshold is
characterized in terms of pixels for the given video.