<<

MASTER'S THESIS

Compression of High Dynamic Range Video

Simon Ekström 2015

Master of Science in Engineering Technology Computer Science and Engineering

Luleå University of Technology Department of Computer Science, Electrical and Space Engineering Abstract

For a long time the main interest in the TV-industry has been in increasing the resolution of the video. However, we are getting to a point where there is little benefit in increasing it even further. New technologies are quickly rising as a result of this and High Dynamic Range (HDR) video is one of these. The goal of HDR video is to provide a greater range of luminosity to the end-consumer. MPEG (Moving Picture Experts Group) wants to know if there is a potential for improvements to the HEVC (High Efficiency Video Coding) standard, specifically for HDR video, and in early 2015 the group issued a Call for Evidence (CfE) to find evidence whether improvements can be made to the existing video coding standard. This work presents the implementation and analysis of three different ideas for suggestions: bit shifting at the coding unit level, histogram- based value mapping, and modifications to the existing Sample Adaptive Offset (SAO) in-loop filter in HEVC. Out of the three suggestions, the histogram-based color value mapping is shown to provide significant improvements to the coding efficiency, both objectively and subjectively. The thesis concludes the work with a discussion and possible directions for future work. Acknowledgements

I am very grateful for the opportunity to perform my thesis work at Ericsson Research’s Visual Technology unit in Kista and I would like to thank all the people there that have assisted me throughout this work. I would like to especially thank my external super- visor at Ericsson, Martin Pettersson, for assisting me and providing me with valuable feedback throughout the whole process. I would also like to thank my supervisor at Lule˚aUniversity of Technology, Anders Landstr¨om,for showing an interest in the work and providing valuable guidance. Finally I would like to thank family and friends, both new and old, for all the support I have been given during the work and especially the move to a new town.

ii Contents

1 Introduction 1 1.1 Background ...... 2 1.2 Purpose ...... 2 1.3 Delimitations ...... 3 1.4 Related Work ...... 3 1.5 Contribution ...... 4

2 Theory 6 2.1 High Dynamic Range ...... 6 2.1.1 Transfer Functions ...... 7 2.1.1.1 Philips TF ...... 8 2.1.1.2 PQ-TF ...... 9 2.2 Color Models ...... 10 2.2.1 RGB ...... 10 2.2.2 YCbCr ...... 11 2.3 Color Spaces ...... 12 2.3.1 CIE 1931 ...... 12 2.3.2 CIELAB ...... 12 2.3.3 Wide Color ...... 14 2.3.4 BT.709 ...... 14 2.3.5 BT.2020 ...... 15 2.3.6 DCI P3 ...... 16 2.4 ...... 16 2.4.1 4:4:4 to 4:2:0 ...... 17 2.4.2 4:2:0 to 4:4:4 ...... 18 2.5 File Formats ...... 19 2.5.1 EXR ...... 19 2.5.2 TIFF ...... 20 2.6 Video Coding ...... 20 2.6.1 Encoder ...... 21 2.6.2 Decoder ...... 22 2.6.3 Rate-Distortion Optimization ...... 22 2.6.4 Video Coding Artifacts ...... 23 2.6.5 HEVC Standard ...... 25 2.6.5.1 Quantization Parameter ...... 27 2.6.5.2 Coding Tree Units ...... 27 2.6.5.3 Deblocking Filter ...... 28

iii Contents iv

2.6.5.4 Sample Adaptive Offset ...... 29 2.6.5.5 Profiles ...... 30 2.7 Quality Measurement ...... 31 2.7.1 PSNR ...... 31 2.7.2 tPSNR ...... 32 2.7.3 CIEDE2000 ...... 32 2.7.4 mPSNR ...... 35 2.7.5 Bjøntegaard-Delta Bit-Rate Measurements ...... 36

3 Method 38 3.1 Test Sequences ...... 38 3.2 Processing Chain ...... 40 3.2.1 Preprocessing ...... 41 3.2.2 Postprocessing ...... 41 3.2.3 Anchor Settings ...... 42 3.2.4 Conversion of TIFF Input Files ...... 42 3.3 Evaluation ...... 43 3.3.1 Objective Evaluation ...... 43 3.3.2 Subjective Evaluation ...... 44 3.4 HEVC Profile Tests ...... 44 3.5 Bitshifting at the CU Level ...... 45 3.5.1 Variation 1 ...... 47 3.5.2 Variation 2 ...... 47 3.6 Histogram Based Color Value Mapping ...... 48 3.6.1 Preprocessing ...... 50 3.6.2 Postprocessing ...... 52 3.6.3 Parameters ...... 52 3.7 SAO XYZ ...... 53

4 Results 57 4.1 HEVC Profile Tests ...... 57 4.1.1 Main-RExt, 12 bits, 4:2:0 ...... 57 4.1.2 Main-RExt, 10 bits, 4:4:4 ...... 59 4.1.3 Main-RExt, 12 bits, 4:4:4 ...... 59 4.1.4 Main-RExt, QP offsets ...... 60 4.2 Bitshifting at the CU Level ...... 60 4.2.1 Variation 1 ...... 61 4.2.2 Variation 2 ...... 61 4.3 Histogram Based Color Value Mapping ...... 62 4.3.1 Objective Results ...... 62 4.3.2 Subjective Results ...... 63 4.4 SAO XYZ ...... 65

5 Discussion 66 5.1 Reflections ...... 66 5.2 Conclusions ...... 67 5.3 Future Work ...... 67 Contents v

5.3.1 Bitshifting at the CU Level ...... 68 5.3.2 Histogram Based Color Value Mapping ...... 68 5.3.3 SAO XYZ ...... 69 Chapter 1

Introduction

The TV industry is developing quickly. Until now the main interest has been in in- creasing the resolution of the video content. Ultra HDTV (High Definition Television) provides a number of improvements, including increased resolution, higher frame rate, and an improved . However, we are currently approaching the point where there is little benefit in increasing the resolution for ordinary TV sets. Therefore, there is a rising interest in other technologies which can be used to increase the visual expe- rience. One technology introduced is High Dynamic Range (HDR) video and content providers such as Amazon are already providing HDR content [1]. The goal of HDR video is to provide a greater range of luminosity to the end consumer.

Today’s television systems only provides Standard Dynamic Range (SDR), which is a dynamic range of about 1000:1 (ratio between brightest and darkest ), with a luminosity between 0.1 to 100 candela per square metre (cd/m2). As an example, the sky near the horizon during noon a clear day has a level of approximately 10 000 cd/m2. HDR is defined as a dynamic range greater than 65 536:1.

HDR video, however, may require changes throughout the video chain, affecting every- thing from video capturing to the TV sets that are meant to display the video. The content providers want to produce and distribute content that actually utilizes this new feature, but they also want to be able to distribute it as efficiently as possible. This may require new tools that are specific for compression of HDR video.

This thesis will present three main ideas for improvements to existing tools which im- proves the coding of HDR video:

• Bit depth shifting at the CU level,

• Histogram based color value mapping,

1 Chapter 1. Introduction 2

• Sample adaptive offset in XYZ domain.

These ideas will be described in detail and analyzed throughout the report.

Chapter 1 will provide an introduction to the work, providing background, purpose and delimitations. Chapter 2 provides the theory necessary to get an understanding of the presented ideas. It will cover areas such as HDR, color theory, and video coding. Chapter 3 presents the three approaches, describing them in detail. This chapter also provides an overview of the method used for evaluation of the ideas. Chapter 4 will present the results for each of the ideas and also a brief analysis of the results. The last chapter, chapter 5, provides a discussion of the result, including a more detailed analysis of the results and overall conclusions.

1.1 Background

HEVC (High Efficiency Video Coding) is a video compression standard [2] and version 3 was approved in April 2015 [3]. MPEG (Moving Picture Experts Group) issued a Call for Evidence (CfE) for HDR and WCG (Wide Color Gamut) video coding in the spring 2015. This process has a clear purpose: MPEG wants to explore if the coding efficiency and/or the functionality of the HEVC standard can be improved for HDR.

The introduction of HDR video presents a number of challenges not previously consid- ered. Many of the methods used for compressing ordinary video may not work as well for HDR video, both in terms of quality and compression rate.

There is an interest in how to efficiently represent the to fit the given number of bits per pixel, as is how to do the coding as efficiently as possible. The former is typically performed in the pre- and post-processing stages of the processing chain.

Standard video is usually represented in the sRGB color space, which gives a more efficient use of the pixels compared to just storing them in a linear color space. However, the sRGB gamut, i.e. the complete subset of colors which can be represented within the color space, is restricted and the non-linear gamma model used in sRGB is not well-suited for HDR imagery, as the input and output ranges are unknown [4].

1.2 Purpose

The purpose of this work is to first and foremost study the concepts of HDR video and wide color to get a basic understanding of what it is, how it differs from today’s Chapter 1. Introduction 3 technology, and how it may affect the video coding process. Next goal is to, given the guidelines of the MPEG standardization process, explore if there are any possible changes and/or additions that can improve the video coder for HDR and WCG video. Any suggestions for changes are then to be implemented and evaluated.

1.3 Delimitations

This work is connected to a standardization process led by MPEG. The process has a clear purpose and so has this work, to find compression efficiency improvements for the existing video coder. MPEG has limited the proposals to three different categories:

1. Normative changes to the HEVC standard. Proposals in this category need to be justified with significant improvements to the performance.

2. Backwards compatibility. This category covers backward compatibility and how to present HDR content on older systems not supporting HDR.

3. Optimization using the existing standardized Main 10 profiles, described in sec- tion 2.6.5.5. This category consists of two subcategories covering non-normative changes, i.e. changes that do not have an impact on the decoding process, to

(a) the Main 10 profile and (b) the Scalable Main 10 profile.

To limit the work, the thesis is restricted to the two categories 1 and 3a. The work mainly focuses on finding improvements to the processing chain presented by the CfE and will not be specifically limited to neither normative nor non-normative changes.

1.4 Related Work

HDR video, not to be confused with HDR photography, is still a quite new concept but there have been previous work done in the area. Lu et al. [5] discussed the implications of distributing HDR and WCG content, and Zhang et al. [6], provided a review of HDR image and video compression. Banitalebi-Dehkordi et al. [7], provided a comparison of H.264/AVC and HEVC, showing that HEVC performs better when it comes to com- pressing HDR video. As mentioned there are no standards specified for broadcasting yet. However, HDR was recently standardized for the Blu-ray disc format [8]. Chapter 1. Introduction 4

There is currently a lot of work going on, MPEG is in the process of standardize HDR for HEVC [9], hardware manufacturers are introducing HDR displays [10], and content providers are starting to provide their users with HDR content [1].

MPEG is not the only organization working on HDR, other groups such as SMPTE, DVB, ATSC, and EBU are also working on specifying standards relating to HDR video.

One of the initial problems with HDR video is how to cope with the extended range. This requires more efficient usage of the available bits. BBC, for instance, have done a lot of work in this area [11], trying to find an efficient transfer function for HDR. Dolby has also presented work in this area [12], introducing the PQ transfer function. Zhang et al. [13] presents a method for reducing the required bit depth for HDR video, resulting in efficiency improvements, and in [14], several other methods for reducing the required bit depth is proposed, all based on an adaptive uniform re-quantization applied prior to the encoding.

There has also been work on how to perform the evaluation of HDR video compression methods. In [15] a comparison of four objective metrics: mPSNR, tPSNR, and PSNR∆E, as included in the CfE, as well as the HDR-VDP-2 metric [16] is presented.

In addition to the efficiency improvements, there has been work done on how to provide backwards compatibility. Dai et al. [17] presented techniques that were shown to both provide efficiency improvements for HDR video, but also provides backwards compat- iblity by allowing tone mapping algorithms to be applied, reducing the contrast and luminance.

1.5 Contribution

This thesis presents three different ideas for improvements to video coding of HDR content. Two of which provides no significant gains, but the work itself should provide a small base for further work in the area.

The third proposal, histogram based color value mapping, provides significant gains both objectively and subjectively. This proposal is closely related to the transfer functions suggested by BBC, Dolby, and Philips. It tries to improve the utilization of available bits, which is also the purpose of the transfer functions. However, the key difference is that the proposal in this thesis looks at the actual video content and tries to optimize the mapping for individual sequences while the transfer functions are designed as a generic solution by looking at the properties of the human visual system. Chapter 1. Introduction 5

There are existing similar techniques but they are not completely identical so the pro- posed idea is still worth considering. For instance, the transfer functions and the pro- posed mapping technique are not mutually exclusive, in this thesis they are used together to improve the efficiency even further. The mapping technique provides efficiency im- provements for HDR and possibly SDR, and the technique could be worth continue developing or take inspiration from. Chapter 2

Theory

This chapter covers the background required to get an understanding of HDR, color theory, and video coding in general. It will begin by covering the theory and tools used for coding HDR, including Wide Color Gamut (WCG). It will then continue on by giving a general understanding of the HEVC video coder. The chapter will not go into full detail about the inner workings of the coder, but it will cover what is necessary to get an understanding of the ideas proposed in this thesis.

2.1 High Dynamic Range

High Dynamic Range (HDR) imaging is a set of techniques used in imaging and photog- raphy that allows for a greater dynamic range of luminosity compared to what is possible with standard digital imaging techniques. Dynamic range can be described as the ratio between the maximum and the minimum luminous intensity in a scene. Luminance is a measure of luminous intensity per unit area and the SI unit for this measure is candela per square metre (cd/m2), another term for the same unit is ”nit”.

In photography the dynamic range is commonly measured in terms of f-stops, which describes the light range by powers of 2.

• 10 f-stops = 210 : 1 = 1024 : 1.

• 16 f-stops = 216 : 1 = 65 536 : 1.

The can approximately see a difference of 100 000 : 1 in a scene with no adaption [18]. Table 2.1 shows five examples of luminance values in common scenarios [19][20].

6 Chapter 2. Theory 7

Table 2.1: Approximate luminance levels in common scenarios.

Environment Luminance Level (cd/m2) Frosted bulb 60 W 120 000 White fluorescent lamp 11 000 Clear sky at noon 10 000 Cloudy sky at noon 1 000 Night sky with full moon 0.01

Standard Dynamic Range (SDR) Today’s television systems only provide SDR, which is less than or equal to 10 f-stops. SDR typically supports a range of luminance of around 0.1 to 100 cd/m2. Table 2.1 indicates that SDR is far from being able to provide the luminance levels that the human eye is used to.

Enhanced Dynamic Range (EDR) EDR is an enhanced version of SDR which supports a dynamic range between 10-16 f-stops.

High Dynamic Range (HDR) HDR supports a dynamic range of more than 16 f-stops. This means that the range of HDR is significantly bigger than the one of SDR. Using a SIM2 HDR display [10] which supports a brightness up to 4000 cd/m2, it would be possible to reproduce the brightness of a cloudy sky at noon.

2.1.1 Transfer Functions

When capturing video with a camera the colors are captured in the linear domain. This means that the color values are linearly proportional to the amount of luminance. The linear domain, however, is not suitable for when doing the quantization required before doing the video coding. There is typically a too small number of bits to be able to represent the colors without causing visible errors. The video is therefore typically transferred to a perceptual domain using a Transfer Function (TF) before doing the encoding. The video will then be transformed back to the linear domain after doing the decoding using the inverse transfer function.

Bartens model [21] is a model of the human eye’s sensitivity to contrast in different levels of luminance. Comparing a transfer function to the model shows how likely it is that the function will cause visible banding artifacts, as described in section 2.6.4, and how efficiently the bits are used. Figure 2.1 is a graph of Bartens curve showing the contrast sensitivity of the human eye. Noticeable looking at Bartens curve is the fact that the human eye is less sensitive to contrast in dark regions. This is a fact that has been used a lot by models when trying to optimize the usage of bits when encoding images Chapter 2. Theory 8

Contrast Sensitivity 2 10 Barten curve BT.1886 8 bit BT.1886 10 bit

1 10

0 10 Minimum Contrast Step (%)

−1 10 −4 −2 0 2 4 10 10 10 10 10 Luminance (nits)

Figure 2.1: 8 and 10 bit BT.1886 compared to the Barten curve. in traditional gamma models. Anything beneath the curve would not cause any visible artifacts but putting the whole transfer function underneath the curve would require a larger number of available bits.

BT.1886 [22] is a gamma model suggested for HDTV and two versions of this model is visible in the figure 2.1, an 8 bit version and a 10 bit version. However, this model is designed for a limited dynamic range, and in this case 0.1 to 100 nits.

As BT.1886 is not suitable for the increased range of HDR [12] three new transfer functions for coding HDR have been discussed in MPEG, the BBC TF, the Philips TF, and the Dolby Perceptual Quantizer Electro-Optical TF (PQ-EOTF, or simply PQ-TF) [12]. These three transfer functions together with the Barten curve can be seen in figure 2.2. The BBC model is very similar to BT.1886 up to a certain level and after this an exponential curve is used. Philips and Dolby also proposed two transfer functions of their own which both follows the Barten curve more smoothly.

2.1.1.1 Philips TF

The Philips transfer function is defined as [9]

 1  log 1 + (ρ − 1) · (r · x) γ PhilipsTF(x, y) = , (2.1) log(ρ) · M Chapter 2. Theory 9

Contrast Sensitivity 2 10 Barten curve 10 bit BBC 0−10k nits 10 bit PQ−TF 0−10k nits 10 bit Philips TF 0−10k nits

1 10

0 10 Minimum Contrast Step (%)

−1 10 −4 −2 0 2 4 10 10 10 10 10 Luminance (nits)

Figure 2.2: Bartens model with BBC TF, Philips TF, and PQ-TF.

y where ρ = 25, γ = 2.4, r = 5000 , and

 1  log 1 + (ρ − 1) · r γ M = . log(ρ)

2.1.1.2 PQ-TF

The Dolby PQ-TF is defined as [9]

 m1 m2 c1 + c2L PQ TF(x) = m , (2.2) 1 + c3L 1 where

2610 1 m = · , (2.3) 1 4096 4 2523 m = · 128, (2.4) 2 4096 3424 c = c − c + 1 = , (2.5) 1 3 2 4096 2413 c = · 32, (2.6) 2 4096 2392 c = · 32. (2.7) 3 4096 Chapter 2. Theory 10

The inverse of the PQ-TF is defined as [9]

m−1  h m−1 i 2 max N 2 − c1, 0 PQ TF−1(N) = , (2.8)  m−1  c2 − c3N 2

2.2 Color Models

A is a mathematical model that describes the way colors can be represented using a predefined number of components. Examples of color models are RGB and CMYK. A color model together with a reference color space and an associated mapping function results in a set of colors referred to as a color gamut, where the gamut refers to a subset of the complete reference color space.

2.2.1 RGB

Figure 2.3: Picture separated into the R, G, and B channels.

The RGB color model splits the color information into the three primary colors; red, green, and blue. Figure 2.3 depicts an example picture and the three color channels. This model is an model, meaning that the three colors will added together reproduce any of the colors in the color space.

Figure 2.4 depicts how the three primary colors can be mixed to represent other colors. For instance, adding green to red will result in , and adding all primary colors together will result in white.

An RGB color space is defined by three additive primaries, red, green, and blue. Plotting an RGB color space on a diagram the color space will be visualized by a Chapter 2. Theory 11

Figure 2.4: Additive color mixing, depicts the mixing of the three primary colors. triangle, as seen in figure 2.6 with the BT.709 [23] and BT.2020 [24] color spaces. The triangles’ corners are defined by the chosen color primaries of that color space and any color within the triangle can be reproduced. A complete specification of an RGB color space will also require a and a curve to be defined. Not shown in the figure is the sRGB color space [25]. sRGB shares the same color primaries as BT.709, which means they both share the same color gamut. sRGB, however, explicitly specifies an output gamma of 2.2.

RGB is used when displaying colors on a number of common display types, such as Cathode Ray Tube (CRT), Liquid Crystal Display (LCD), plasma displays, or Organic Light Emitting Diode (OLED). Each pixel on the display consists of three different light sources, one for each color. From a normal viewing distance, the separate sources will be indistinguishable, giving the appearance of a single color.

2.2.2 YCbCr

Figure 2.5: Picture separated into the Y, Cb, and Cr channels.

YCbCr is a family of color spaces where a color value is represented by three components; Y, Cb, and Cr. Y represents the brightness (luminance), while Cb and Cr are the blue Chapter 2. Theory 12 and red chroma components holding the color information. Figure 2.5 depicts a picture separated into the three channels. YCbCr should be distinguished from Y’CbCr where the Y’ component (luma) compared to Y is in a non-linear domain, for instance encoded by gamma correction.

Y’CbCr is a relative color space derived from an RGB color space. The color primaries are provided by a color space such as BT.709 or BT.2020. For conversions between Y’CbCr and R’G’B’, see section 2.3.4 or 2.3.5 depending on color primaries used.

Y’CbCr is preferred when doing video coding as the separation of the luma and chroma components allows for operations such as storing the components at different resolutions. This is to take advantage of the human visual system and the fact that the chromatic visual acuity is lower than the achromatic acuity [26]. This means that the chroma components can be stored at a lower resolution than the luma component without any major visual impact. The same does not apply for the RGB color model as each of the three channels is of equal importance.

2.3 Color Spaces

To be able to capture, store, and display video with colors, the color information needs to be represented in some way. For this purpose color spaces are used. Color spaces allows for a reproducible color representation.

2.3.1 CIE 1931

The CIE 1931 color spaces [27][28], and the CIE 1931 XYZ color space specifically describes all colors visible to the human eye and can be seen as the color gamut of the human visual system. The CIE 1931 XYZ color space is depicted as the complete colored area in the chromaticity diagram in figure 2.6. The chromaticity diagram is a simplification of the color space and the color space is actually expressed as a 3D hull. The X, Y, and Z components of the color space are then coordinates of this 3D hull. Given the properties of the human eye the model defines the Y component as the luminance.

2.3.2 CIELAB

CIELAB or CIE L*a*b* [28] is a color space describing all the colors in the gamut of the human vision and it was specified mainly for the purpose to serve as a device-independent Chapter 2. Theory 13

CIE 1931 Chromaticity Diagram 0.9 520 BT.2020

0.8

0.7

0.6

500 0.5 y

0.4

0 0.3 49 700

0.2

0.1

0.0 0.0 0.1380 0.2 0.3 0.4 0.5 0.6 0.7 0.8 x Figure 2.6: Gamuts of BT.709, DCI P3, and BT.2020 on the CIE 1931 color space. model to be used as a reference. The color space consists of three components, L*, a*, and b*, where L* represents the of the color while a* and b* are the color components.

Equation 2.9 defines the CIELAB color space and how to convert a color value from the CIE 1931 XYZ color space.

∗ L = 116f(Y/Yn) − 16, (2.9a) ∗ a = 500[f(X/Xn) − f(Y/Yn)], (2.9b) ∗ b = 200[f(Y/Yn) − f(Z/Zn)], (2.9c) where  t1/3 if t > (24/116)3 f(t) = (841/108) · t + 16/116 otherwise and Xn, Yn, and Zn are the tristimulus values of a specified white object color stimulus.

In this case Yn = 100, Xn = Yn · 0.95047, and Zn = Yn · 1.08883 [9]. Chapter 2. Theory 14

2.3.3 Wide Color Gamut

Color gamut, as mentioned previously, describes a subset of colors. This could for instance be the range of colors that the human eye may perceive or the range supported by a particular output device. This chapter covers three important color spaces, BT.709, DCI P3, and BT.2020. These color spaces all have their own gamut and figure 2.6 compares these gamuts on top of the CIE 1931 color space.

In addition to HDR, one could increase the realism even further by using a wider color gamut. Ideally the color gamut used would cover the complete color gamut of the human visual system, as in the CIE 1931 color space, but there are still limitations in the video chain. A color gamut larger than the one of BT.709 is typically referred to as a Wide Color Gamut (WCG), and as such, both BT.2020 and DCI P3 are referred to as wide color gamuts. These gamuts give a closer rendition to human color perception and together with HDR it allows for very bright saturated colors.

2.3.4 BT.709

BT.709 [23] is a standard defining the format parameters for High Definition Television (HDTV). It specifies parameters such as aspect ratio, supported resolutions, frame rates, and the color space. The color space of this standard is what will be covered in this section.

Color Space Conversion

Equation below defines the conversion from Y’CbCr to R’G’B’ with BT.709 primaries [9],

R0 = Y 0 + 1.57480 · Cr, (2.10a) 0 0 G = Y − 0.18733 · Cb − 0.46813 · Cr, (2.10b) 0 0 B = Y + 1.85563 · Cb, (2.10c) where R0, G0, and B0 are non-linear RGB in the perceptual domain, this is a result of transforming the color values using the PQ transfer function. The perceptual domain is connected to the perceptual properties of the human visual system and the purpose for using this is that it makes for a more efficient representation of the color values. These Chapter 2. Theory 15 variables can be defined as

R0 = PQ TF(max(0, min(R/10000, 1))), (2.11a) G0 = PQ TF(max(0, min(G/10000, 1))), (2.11b) B0 = PQ TF(max(0, min(B/10000, 1))). (2.11c)

where PQ TF is defined in equation 2.2 in section 2.1.1.2.

Additionally, the conversion from R’G’B’ to Y’CbCr for BT.709 can be approximated as [9]

Y 0 = 0.212600 · R0 + 0.715200 · G0 + 0.072200 · B0, (2.12a) Cb = −0.114572 · R0 − 0.385428 · G0 + 0.500000 · B0, (2.12b) Cr = 0.500000 · R0 − 0.454153 · G0 − 0.045847 · B0. (2.12c)

Equation below defines how to convert RGB with BT.709 primaries to the CIE 1931 XYZ color space [9],

X = 0.412391 · R + 0.357584 · G + 0.180481 · B, (2.13a) Y = 0.212639 · R + 0.715169 · G + 0.072192 · B, (2.13b) Z = 0.019331 · R + 0.119195 · G + 0.950532 · B. (2.13c)

2.3.5 BT.2020

While BT.709 is the standard for HDTV, the BT.2020 [24] standard defines the format parameters for Ultra High Definition Television (UHDTV). The color space for both BT.709 and BT.2020 are visualized in figure 2.6.

Color Space Conversion

The equations below define the conversion from Y’CbCr to R’G’B’ with BT.2020 pri- maries [9]:

R0 = Y 0 + 1.47460 · Cr, (2.14a) G0 = Y 0 − 0.16455 · Cb − 0.57135 · Cr, (2.14b) B0 = Y 0 + 1.88140 · Cb, (2.14c) where R0, G0, and B0 are non-linear RGB defined as in equation 2.11. Chapter 2. Theory 16

Additionally, the conversion from R’G’B’ to Y’CbCr for BT.2020 can be approximated as [9]

Y 0 = 0.262700 · R0 + 0.678000 · G0 + 0.059300 · B0, (2.15a) Cb = −0.139630 · R0 − 0.360370 · G0 + 0.500000 · B0, (2.15b) Cr = 0.500000 · R0 − 0.459786 · G0 − 0.040214 · B0. (2.15c)

The equations below define how to convert RGB with BT.2020 primaries to the CIE 1931 XYZ color space [9]:

X = 0.636958 · R + 0.144617 · G + 0.168881 · B, (2.16a) Y = 0.262700 · R + 0.677998 · G + 0.059302 · B, (2.16b) Z = 0.000000 · R + 0.028073 · G + 1.060985 · B. (2.16c)

2.3.6 DCI P3

DCI P3 [29] is the specification of the color space used in digital cinemas and it is meant as a standard for modern digital projection. All modern digital cinema projectors are capable of displaying the color space. However, there are not many commercially available monitors that support the DCI P3 color gamut.

The color gamut of DCI P3 is smaller than BT.2020, but larger than the one of BT.709, as seen in figure 2.6. The color gamut of P3 is therefore referred to as a wide color gamut.

The equation below define how to convert RGB with DCI P3 primaries to the CIE 1931 XYZ color space [9]:

X = 0.486571 · R + 0.265668 · G + 0.198217 · B, (2.17a) Y = 0.228975 · R + 0.691739 · G + 0.079287 · B, (2.17b) Z = 0.000000 · R + 0.045113 · G + 1.043944 · B. (2.17c)

2.4 Chroma Subsampling

As mentioned previously, Y’CbCr separates the luma component from the chroma com- ponents. Having a lower resolution for the chroma components allows for a lower bit rate without lowering the overall subjective image quality significantly. Chapter 2. Theory 17

The chroma subsampling formats [30] are commonly expressed using a three part ratio a:b:c, which specifies the ratio between the luma and chroma samples.

• a is the Y’ horizontal sampling reference, defining the width of the sampling region.

• b specifies the horizontal subsampling of Cb and Cr, which is the number of Cb and Cr samples in the first row.

• c is the vertical subsampling for Cb and Cr. Either same as b or zero, indicating that Cb and Cr are subsampled 2:1 vertically.

The two types that are the most important to understand for this thesis are the 4:2:0 and the 4:4:4.

• 4:4:4 specifies that there is no subsampling used. Meaning we have the same number of samples for all components.

• 4:2:0 specifies a subsampling by a factor of 2 for the chroma components, both horizontally and vertically. This means that the resolution is a quarter of the original resolution for the chroma components.

Y’ Cb+Cr Y’CbCr

4:4:4

4:2:0

Figure 2.7: The 4:4:4 and 4:2:0 chroma subsampling formats.

Figure 2.7 depicts for 4:4:4 and 4:2:0 how the luma (Y’) and the chroma (Cb+Cr) samples are merged to produce the resulting pixels. 4:4:4 will not cause any compression gains as it will result in 3 samples per pixels, similar to an ordinary picture in an RGB color space. 4:2:0, on the other hand, will lower the amount of data required as we will go from 8 chroma samples to 2 chroma samples for every macro block.

2.4.1 4:4:4 to 4:2:0

Chroma downsampling from 4:4:4 to 4:2:0 is done in two steps, first the picture is downsampled horizontally down to 4:2:2, and then the picture is downsampled from 4:2:2 to 4:2:0, as follows [9]: Chapter 2. Theory 18

• First perform the horizontal downsampling down to 4:2:2. Let the input picture be s[i][j], while W and H are the width and height in chroma samples. For i = [0,H − 1] and j = [0,W/2 − 1] the 4:2:2 samples, f[i][j], are derives as follows:

1 X f[i][j] = c1[k] · s[i][Clip3(0,W − 1, 2 · j + k)], (2.18) k=−1

where c1[−1] = 1, c1[0] = 6, c1[1] = 1, and

 x if z < x  Clip3(x, y, z) = y if z > y  z otherwise

• Perform the vertical downsampling. For i = [0,H/2 − 1] and j = [0,W/2 − 1], the output 4:2:0 samples, r[i][j], are derives as follows:

1 X r[i][j] = ( c2[k] · f[Clip3(0,H − 1, 2 · i + k)][j] + offset)  shift, (2.19) k=−1

where c2[−1] = 0, c2[0] = 4, c2[1] = 4, shift = 6, offset = 32, and  is the right bitshift operator.

2.4.2 4:2:0 to 4:4:4

Chroma upsampling from 4:2:0 to 4:4:4 is performed in a similar fashion to the down- sampling, first vertical filtering is performed, and then horizontal. The steps are as follows [9]:

• Let H and W be the dimensions of the input picture s[i][j] in chroma samples. For i = [0,H − 1] and j = [0,W/2 − 1] the intermediate samples, f[i][j], are derived as follows:

1 X f[2 · i][j] = d0[k] · s[Clip3(0,H − 1, i + k)][j], k=−2 (2.20) 1 X f[2 · i + 1][j] = d1[k] · s[Clip3(0,H − 1, i + k + 1)][j], k=−2

where the coefficients are defined as in table 2.2. Chapter 2. Theory 19

Table 2.2: Chroma upsampling coefficients.

Phase −2 −1 0 1 d0[k] −2 16 54 −4 d1[k] −4 54 16 −2

• For i = [0, 2 ∗ H − 1] and j = [0,W − 1], the output samples r[i][j] are derived as,

r[i][2 · j] = (f[i][j] + offset1)  shift1 1 X r[i][2 · j + 1] = ( c[k] · f[i][Clip3(0,W − 1, j + k + 1)] + offset2)  shift2 k=−2 (2.21)

where c[−2] = 4, c[−1] = 36, c[0] = 36, c[1] = −4, shift1 = 6, offset1 = 32,

shift2 = 12, and offset2 = 2048.

2.5 File Formats

Working with video requires formats for representing and storing the data and it is important that this can be performed with minimal losses of information. The format needs to support a larger color gamut and an increased dynamic range compared to the traditional video formats. There are a number of formats for HDR video to choose from, all with different capabilities [4].

This section presents two file formats that are used in this thesis work. It will focus on the bit encoding on a pixel level and not how the full image compression is performed.

2.5.1 EXR

OpenEXR [31] is an open source image format created by Industrial Light and Magic [32] with the purpose of being used as an image format for special effects rendering and compositing. The format is a general purpose wrapper for the 16 bit half-precision floating-point data type, Half, [4]. The Half format, or binary16, is specified in the IEEE 754-2008 standard [33]. OpenEXR also supports other formats such as both floating- point and integer 32 bit formats [31]. Using the Half data type the format will have 16 bits per channel, or 48 bits per pixel.

The OpenEXR format is able to cover the entire visible gamut and a range of about 10.7 orders of magnitude with a relative precision of 0.1%. Based on the fact that the human eye can see no more than 4 orders of magnitude simultaneously, OpenEXR makes for a good candidate for archival image storage [4]. Chapter 2. Theory 20

2.5.2 TIFF

TIFF (Tagged Image File Format) [34] is a widely supported and flexible image format. It provides support for a wide range of image formats wrapped into one file. The format allows the user to specify the type of image (CMYK, YCbCr, etc), compression methods, and also to specify the usage of any of the extensions provided for TIFF.

2.6 Video Coding

For the purpose of this thesis we see video as a sequence of frames, i.e. a series of pictures. Each frame (picture) consists of a number of pixels where every pixel stores color information. The color information is typically stored either in an RGB color space or in the YCbCr color space. The frames store the color information using one of the various types of chroma subsampling. The most important subsamplings types to consider for this thesis are the 4:2:0 and the 4:4:4.

Uncompressed video requires a very high data rate and to meet the limitations of today’s networks and typical storage devices there is a need for the video to be compressed. To understand the need for video compression we can look at the size of an uncompressed video sequence in HD. A sequence in Full HD (1920x1080) with a frame rate of 30 frames per second (60i, i.e. a interlaced frame rate of 60 fields per second [35]), 10-bit per color channel, and the sampling rate 4:2:0 has a data rate of 932 Mbps [35] or a total of 410 GB data for every hour of video. It is clear that this huge amount of data will be impossible to distribute and store efficiently for the average consumer.

There are two types of compression; lossy compression and lossless compression, where a majority of the existing algorithms uses lossy compression. For lossy compression techniques there is a trade-off between video quality, data rate, and coding complexity. A high quality video stream will require a high data rate while the required data rate can be lowered by reducing the video quality. The complexity of the coder is also a big factor, a complex coder is able to perform a lot of optimizations when coding which may increase quality and decrease data rate. However, the time it takes to code a sequence will increase with increased complexity.

A typical compression algorithm tries to reduce the required data rate by removing redundant data in the stream, both in terms of spatial and temporal data. Another important concept is perceptual video coding. This concept is about understanding and using the human perception to enhance the perceptual quality of the coded video. A good example of this are the transfer functions presented in section 2.1.1, they take Chapter 2. Theory 21 advantage of the human visual system and its properties to utilize the available bits more efficiently.

2.6.1 Encoder

Figure 2.8: Block diagram of an encoder.

Figure 2.8 shows a typical video encoder which consists of three main units: a temporal model, a spatial model, and an entropy encoder.

In a video sequence there are typically two types of redundancies that the coding process tries to reduce:

• Temporal redundancy, which is similarities between multiple frames, for instance if two sequential frames have the same values in a given region.

• Spatial redundancy, which is similarities or patterns within the same frame, for instance if a picture consists of a solid color or a repeated pattern.

Temporal Model The temporal model attempts to reduce temporal redundancy by finding similarities between neighboring video frames. It then constructs a prediction of the current video frame by looking at these similarities. The input for this step is the uncompressed video sequence and there are two outputs: the residual and the motion vectors. The residual is the difference between the prediction and the actual frame, and the motion vectors describes the motion in reference to the neighboring frames.

Spatial Model The spatial model attempts to reduce any spatial redundancy. Com- pared to the temporal model this model only references the current frame. It makes use of similarities in the local picture. One way to reduce the redundancy is to use transform coding. The residual samples are transformed into the frequency domain, in which the video signal is looked at in respect to frequency bands. The signal is there represented Chapter 2. Theory 22 by transform coefficients. The coefficients are then quantized to reduce the number of insignificant values.

Entropy Encoder The entropy encoder takes the motion vectors and the transform coefficients as input. This step uses a more general compression approach; it tries to compress the input using entropy coding to even further reduce any redundant data.

2.6.2 Decoder

The decoder is similar to the encoder but works in reverse. It takes the bit stream generated by the encoder as input and then tries to reproduce the original sequence of frames. First the process decodes the motion vectors and the quantized transform coefficients. The coefficients are rescaled to invert the quantization performed in the encoder, however, as this is a lossy process the coefficients will not be equal to the original coefficients.

The residual data is then restored by doing an inverse transform on the coefficients. Due to the losses in the process the resulting residual data will not be the same as the original. The picture will then be reconstructed by adding the decoded residual data to the predicted picture generated by the motion vectors together with any previous reference frames.

2.6.3 Rate-Distortion Optimization

When compressing video the coder wants to provide high quality video, however, there is a trade-off between video quality and the data rate required. Rate-Distortion Opti- mization (RDO) refers to optimizing the amount of distortion in the video against the data rate required.

Rate-distortion optimization is utilized a lot within a typical video coder. This allows the coder to try out a number of various techniques for coding the video and then comparing the cost for each, making sure the most cost effective technique is used. This will however increase the complexity at the encoder side as it will try to code the video in a number of different ways.

The typical video coder splits the input video into smaller regions, macroblocks in older standards or coding tree units in HEVC, allowing the rate-distortion optimization to determine the best type of prediction and mode on a region to region basis. Chapter 2. Theory 23

2.6.4 Video Coding Artifacts

As the video coder tries to reduce the data rate as much as possible, as mentioned previously, there is a clear trade-off between data rate and video quality. There are usually very noticeable artifacts on highly compressed video. This section describes some of the more common types of artifacts encountered in video coding.

Figure 2.9: Illustration of color banding.

Color Banding Banding is an artifact that causes inaccurate colors in an image. This artifact is produced when there are not a sufficient number of bits to represent the colors in an image. Natural gradients are typical examples where this artifact may be visible, the number of bits is not sufficient to represent the complete gradient without abrupt changes between two colors. Figure 2.9 shows three versions of the same image, one with a very low number of bits per channel (leftmost) and visible banding artifacts, and one with higher bit count (rightmost) that appears to be smooth. There are possible ways to avoid or hide this type of artifact.

• One could increase the bits per pixel. However, it is not always possible to increase the bit depth.

• Try to encode the available bits more efficiently, as described in section 2.1.1.

• Attempt to hide the artifact by applying intentional noise (dither) to the image, see middle image in figure 2.9.

Blurring Transform coding is typically used in video compression and as a way to control the quality of the video stream the resulting coefficients are quantized. For low quality video coding the coefficients are quantized very coarsely and this may zero out the high frequency components [36]. This yields a low-pass like effect and the resulting video may be perceived as low resolution and blurry. Figure 2.10 shows an example of this artifact with a clear loss of detail in the middle region of the white tent. Chapter 2. Theory 24

Figure 2.10: Illustration of blurring artifacts.

Figure 2.11: Illustration of blocking artifacts.

Blocking Blocking is common when using macroblocks, or as in HEVC, coding tree units, when doing both image and video coding. The use of macroblocks or coding tree units may cause the coder to potentially code neighboring blocks differently. For instance, when performing transform coding each block will be producing their own set of transform coefficients and the blurring artifact previously mentioned will then lead to discontinuities at the block boundaries [37]. Figure 2.11 shows an image with block coding artifacts caused by the macroblocking when performing JPEG coding. To reduce this type of artifact the coder typically performs either post filtering or in-loop filtering. In-loop filtering is applied as a part of the encoder loop. HEVC uses two in-loop filters in an attempt to minimize this type of artifact, the deblocking filter, and the so called Sample Adaptive Offset (SAO) filter [2] proposed for HEVC.

Ringing Ringing artifacts are fundamentally associated with Gibb’s phenomenon and are as such typically produced along high contrast edges in areas that are generally smooth [37]. It typically appears as a rippling outwards from the edge. Figure 2.12 illustrates several examples of ringing, the clearest examples being visible around the edges of the cube. This type of artifact is closely related to the blurring artifact as they Chapter 2. Theory 25

Figure 2.12: Illustration of ringing artifacts. are both caused by quantization of the transform coefficients [37]. The SAO filter was partly designed to correct these types of errors [38].

2.6.5 HEVC Standard

The High Efficiency Video Coding (HEVC) [2] standard is a successor to the MPEG-4 H.264/AVC standard [39]. HEVC can provide significantly increased coding efficiency compared to previous standards [40].

The second version of the standard includes a range extension (RExt) which supports higher bit depth and additional chroma sampling formats on top of 4:2:0 (4:0:0, 4:2:2 and 4:4:4) [3].

HEVC uses the same hybrid approach as many of the previous standards, using a com- bination of inter-/intrapicture prediction and 2-D transform coding [2].

Figure 2.13 depicts a block diagram of a typical HEVC video encoder. The encoder also duplicates the decoding process and the decoder elements are the shaded blocks in the figure. This allows the encoder to generate predictions identical to the ones of the decoder which allows for better inter-picture prediction. The Sample Adaptive Offset (SAO) filter [38] also uses the generated predictions to determine suitable parameters that help correcting various errors and artifacts.

Input Video This is the input video that the coder is encoding. The encoder first proceeds by splitting each picture of the input video into block-shaped regions called coding tree units [2]. The coder then goes on to decide which type of prediction to use. Chapter 2. Theory 26

Input Video

Residual Transform, Scaling & Quantized - Quantization Transform Coefficients Split into CTUs Intra-Picture Intra-Picture Estimation Prediction Bit- Stream Mode Scaling & Decision Inverse CABAC Transform

Motion Motion Estimation Compensation + Prediction Data

Reconstructed Picture Deblocking & Filter SAO Control Filter Parameters

Decoded Picture Buffer

Figure 2.13: Block diagram of the HEVC encoder (blocks shaded in gray are decoder elements).

Intra-Picture Estimation Intra-picture prediction is the first of the two types of predictions used and it performs predictions based only on data available in the same picture. Therefore intrapicture will have no dependence on other pictures. Intra-picture prediction is the only possible prediction mode when coding the first picture of the sequence or the first picture of a random access point [2].

Motion Compensation For the remaining pictures of the sequence, inter-picture prediction is typically used for the majority of the blocks [2]. In this mode predictions are made based on adjacent pictures in the sequence. The encoder side predicts motion vectors for the blocks that the decoder then will compensate for.

Mode Decision The encoder will then have to decide which mode to use, intra-picture prediction or motion compensation. If the picture does not happen to be a picture where intra-picture prediction is forced (i.e. first picture of the sequence or a random access point), the type of prediction is typically determined by performing RDO [2]. The prediction is decoded and the result is subtracted from the original picture to create a residual. The data needed to perform the predictions are also sent to the CABAC module, either the motion vectors of the inter-picture prediction or the intra-picture prediction data depending on what decision was made. Chapter 2. Theory 27

Transform, Scaling & Quantization The residual signal of the intra- and inter- picture prediction is then coded using transform coding. This is done by first trans- forming the signal by a linear spatial transform. The transform coefficients are scaled, quantized and entropy coded before getting sent to the CABAC module. The quantized transform coefficients are then inverse transformed to duplicate the decoded approxima- tion of the residual signal. The residual signal is then added to the predicted signal and the resulting signal is fed into deblocking and SAO (Sample Adaptive Offset) filters

Filter Control When reconstructing the picture the deblocking and the SAO filter of the decoder also needs to be duplicated. The purpose of these filters is to smooth out any artifacts caused by the block-wise processing and quantization. In this step the encoder also determines the parameters for the SAO filter which will be used in the real decoding process, so the resulting parameters are sent to the CABAC module. After the reconstructed signal has gone through the two filters it will be saved in a buffer of decoded pictures. This is the buffer that will be used when doing prediction on subsequent pictures.

CABAC Any data that is about to be a part of the bitstream is run through an entropy coder. In this case Context Adaptive Binary Arithmetic Coding (CABAC) is used [2]. This module will code all the coefficients, motion vectors, intra-picture prediction data, filter parameters, and any other data necessary before constructing the resulting bitstream.

2.6.5.1 Quantization Parameter

The quantization performed on the transform coefficients is determined by a Quantiza- tion Parameter (QP) [2] that is set when doing the coding in a way to control the quality or the data rate of the coder. The range of the QP values is defined from 0 to 51. An increase of 1 in the QP means an increase of quantization step size by approximately 12% and an increase of 6 means an increase by exactly a factor of 2. It can also be no- ticed that a change of quantization step size by 12% also means a reduction of roughly 12% in bit rate [39].

2.6.5.2 Coding Tree Units

In previous standards such as H.264/AVC [39], the picture was typically split into mac- roblocks, consisting of a 16x16 block of luma samples and two 8x8 blocks of chroma Chapter 2. Theory 28

CTU CTU CTB (Y')

Cb Cr

CB CB CB CB CB

CB CB

Figure 2.14: Overview of the coding tree unit (CTU). samples in the case of 4:2:0 subsampling. HEVC introduces a new concept replacing the typical macroblock with the Coding Tree Unit (CTU) [2]. Compared to a macroblock the CTU is not of fixed size, the size is selected by the encoder and it can be larger than a traditional macroblock, up to 64x64 pixels. A CTU consists of three Coding Tree Blocks (CTBs), one for the luma samples and two for the corresponding chroma samples, as shown by figure 2.14. The tree structure of the CTB allows for partitioning into smaller blocks called Coding Blocks (CBs) using quadtree-like signaling [41].

The prediction type for blocks are then coded into Coding Units (CUs), where each CU consists of three CBs, one for luma and two for chroma. Each CU will also have an associated partitioning into Prediction Units (PUs) and Transform Units (TUs).

The decision whether to use interpicture or intrapicture prediction is made at the CU level. Depending on decisions in the prediction process, the CBs can be further split in size and predicted by the Prediction Blocks (PBs) in the PUs.

The residual from the prediction is coded using transform coding. A TU is a tree structure with its root at the CU level consisting of Transform Blocks (TBs). A TB may be of the same size as a CB residual, or it may be split into smaller TBs.

2.6.5.3 Deblocking Filter

Similar to the H.264/AVC standard, HEVC also uses an in-loop deblocking filter [42]. This filter operates within the encoding and decoding loops and is used to reduce the visible artifacts at the block boundaries caused by the block-based coding. The filter detects the artifacts and it then makes decisions on whether to use filtering or not, and subsequently what filtering mode to use. Chapter 2. Theory 29

2.6.5.4 Sample Adaptive Offset

Additional to the deblocking filter HEVC introduces a new in-loop filtering technique, Sample Adaptive Offset (SAO) [38]. This filter is applied after the deblocking filter and its purpose is to improve the reconstruction of the original by correcting various errors caused by the encoding process. This filter is applied on a CTB level and given that a CTU have a CTB for every component, one for luma and two for chroma, the filter will be applied for every color component.

At the encoder the filter will classify each reconstructed sample into one of two cate- gories, Edge Offset (EO) or Band Offset (BO). This will be done similar to how RDO is performed, determining which mode and offsets that are optimal. The offsets are optional and will only be applied if they have the possibility to increase the quality of the final picture. The offsets are determined and signaled through the bitstream to the decoder, which applies these offsets to the samples when reconstructing the picture.

Edge Offset For the EO mode the sample is classified by comparing the sample to two of its eight neighboring samples in four directional patterns, horizontal, vertical, and two diagonal patterns [38]. EO allows for both smoothing and sharpening of edges in the picture and it helps correcting errors such as ringing artifacts. A positive offset results in smoothing while a negative offset would make the edge sharper. However, after statistical analysis it is clear that HEVC disallows sharpening and only sends absolute values of offsets [38].

Band Offset For the BO mode the offsets are selected based on the amplitude of the sample. The full sample range is divided into 32 bands and the sample is categorized into one of these bands. Four offsets are determined from for four consecutive bands and are then signaled to the decoder. At the decoder one offset will be applied to all samples of the specific band [38]. Using only four consecutive bands helps correcting banding artifacts as these typically appear in smooth areas where the sample amplitudes tend to be concentrated in only a few of the bands [2].

SAO also provides a number of ways to reduce the information needed to be transmitted between the encoder and the decoder, such as allowing multiple CTUs to share the same SAO parameters. Chapter 2. Theory 30

2.6.5.5 Profiles

The standard defines a number of different profiles [3]. A profile defines a range of bit depths, supported chroma sampling formats, and a set of coding tools that conforms to that profile [2]. The encoder may choose which settings and coding tools to use as long as they conform to the specification of the profile. The decoder on the other hand is required to support all coding tools that the profile supports.

In the second version of the standard, the format range extensions (RExt) were included. These extensions allows for profiles with higher bit depths and additional chroma sam- pling formats. The 12 bits per channel and 4:4:4 profiles are examples of the formats supported by RExt.

Main The Main profile is the most common profile and it allows for a bit depth of 8 bits per sample and 4:2:0 chroma subsampling. This is the most common format of video used [3].

Main 10 Main 10 is similar to the Main profile with 4:2:0 chroma subsampling, but it allows for an bit depth up to 10 bits per sample [3]. The extra two bits per sample compared to the 8 bits of the Main profile is a big benefit and it also allows for larger color spaces [43]. This thesis, with the requirements of HDR and WCG will focus mainly on using this profile. It is also stated that the Main 10 profile with 10 bits per sample provides a higher picture quality but with the same bit rate as the Main profile [44].

Main 12 Main 12 allows for a bit depth between 8- to 12 bits per sample with 4:0:0 and 4:2:0 chroma subsampling [3].

Main 4:4:4 10 Main 4:4:4 10 only allows a bit depth of 10 bits per sample, just as Main 10, but in addition to 4:2:0 it also supports 4:0:0, 4:2:2, and 4:4:4 chroma subsampling [3].

Main 4:4:4 12 Main 4:4:4 12 supports the same chroma subsampling formats as Main 4:4:4 but it allows for a bit depth up to 12 bits per sample [3]. Chapter 2. Theory 31

2.7 Quality Measurement

To measure the performance of a video coder you can either perform subjective measure- ments or objective measurements. When doing subjective measurements you will have human observers watching and rating the quality of the video. For objective measure- ments on the other hand you will have mathematical models designed to approximate the results you will get when doing subjective measuring.

Given that the video is to be consumed by a human being in the end, subjective measure- ments are of more value. However, they are usually very costly and time-consuming to gather. Therefore objective measurements are commonly used as a preliminary quality measurement.

Objective measurements can be done using a number of various models. This thesis will focus mainly on the tPSNR and the mPSNR measurements introduced for the MPEG CfE, which both are variations of the PSNR (Peak Signal-to-Noise Ratio) measure. The reason for introducing these measures is that the non-linear behavior of the human visual system makes PSNR an ill-fitted measurement when it comes to image compression [45]. Despite its drawbacks it has been widely used for Standard Dynamic Range (SDR) video. However, there seems to be a general understanding that the measurement works much worse for HDR video.

2.7.1 PSNR

Peak signal-to-noise ratio (PSNR) defines the ratio between the original video and the error introduced by the compression. The PSNR is calculated as

2552 PSNR = 10log , (2.22) 10 MSE where MSE is the mean square error, defined as

H W 1 X X MSE = [F (x, y) − F (x, y)]2. (2.23) WH o r y=1 x=1

Here W and H are the width and height of the video, Fo(x, y) is the original frame, and

Fr(x, y) is the reconstructed frame. Chapter 2. Theory 32

2.7.2 tPSNR

When calculating the tPSNR [9] measurement, an average of the PQ and the Philips transfer functions is used. This is to give a result closer to the subjective results and to avoid biasing the measurement towards any specific transfer function.

First, both transfer functions are required to be normalized to support 10 Kcd/m2. The content to be transformed is expected to be in linear-light 4:4:4 RGB EXR, if not, the content will have to be converted first.

To calculate the measurement there are a number of steps applied for each sample of the two contents to compare.

• Each sample needs to be normalized to support a luminance of 10 000 cd/m2, this is done by dividing the values by 10 000.

• The samples are then converted to XYZ, see equation 2.13, 2.16, or 2.17 depending on the color space of the samples (BT.709, BT.2020, or DCI P3).

• Apply the transfer functions for each sample:

PQ TF(X) + PhilipsTF(X, 10000) X0 = , 2 PQ TF(Y ) + PhilipsTF(Y, 10000) Y 0 = , 2 PQ TF(Z) + PhilipsTF(Z, 10000) Z0 = , 2

where PQ TF(x) is defined in equation 2.2 and PhilipsTF(x, y) is defined in equa- tion 2.1.

• Four sums of square error (SSE) values are computed between the two contents,

SSEx+SSEy+SSEz SSEx, SSEy, SSEz, and SSExyz, where SSExyz = 3 . • Finally, the PSNR values are computed for each SSE as:

nbSamples tP SNR = 10 · log , 10 SSE

where nbSamples = 10242 when having an input with 10 bits per color channel, and the SSEs are being clipped to 1e−20.

2.7.3 CIEDE2000

CIEDE2000 [46] formula is used to compute the difference, ∆E, or distance, between two colors. The difference between two colors is a metric that is of big interest in color Chapter 2. Theory 33 science and the purpose of the metric here is to provide a measurement of the difference between two pictures.

This section describes how to compute an objective measurement based on the CIEDE2000 [9], which will be used when evaluating the quality of a particular implementation.

Firstly, the process requires the contents that are to be compared to be in linear-light 4:4:4 RGB EXR format. For instance, if the content is in the Y’CbCr 4:2:0 format, it will first be needed to upsampled to 4:4:4 (see section 2.4.2) and then converted to linear-light RGB according to equation 2.10 for contents with BT.709 primaries, or 2.14 for BT.2020 primaries.

Subsequently, the following steps are to be applied for each (R, G, B) sample of the two contents to compare, both the original and the test material.

• Convert the samples from RGB to the XYZ color space according to equation 2.13 for BT.709 primaries, or 2.16 for BT.2020 primaries.

• Convert from XYZ to the CIELAB color space according to equation 2.9.

∗ ∗ ∗ ∗ ∗ ∗ • Given the two samples to compare, (L1, a1, b1) and (L2, a2, b2), the CIEDE2000 color difference, ∆E is calculated as follows [46][47]:

0 0 1. Calculate the modified chroma, Ci, and angle, hi: q ∗ ∗ 2 ∗ 2 Ci,ab = (ai ) + (bi ) for i = 1, 2 (2.25) C∗ + C∗ C¯∗ = 1,ab 2,ab (2.26) ab 2 s ! C¯∗ 7 G = 0.5 1 − ab (2.27) ¯∗ 7 7 (Cab + 25 0 ∗ ai = (1 + G)ai for i = 1, 2 (2.28) q 0 0 2 ∗ 2 Ci = (ai) + (bi ) for i = 1, 2 (2.29)  0 if b∗ = a0 = 0 0  i i hi = for i = 1, 2 (2.30) −1 ∗ 0 tan (bi /ai) otherwise Chapter 2. Theory 34

2. Calculate the difference in lightness, ∆L0, chroma, ∆C0, and hue, ∆H0:

0 ∗ ∗ ∆L = L2 − L1 (2.31) 0 0 0 ∆C = C2 − C1 (2.32)  0 0 0 if C1C2 = 0   h0 − h0 if |h0 − h0 | ≤ 180◦ ∆h0 = 2 1 2 1 (2.33) (h0 − h0 ) − 360◦ if (h0 − h0 ) > 180◦  2 1 2 1   0 0 ◦ 0 0 ◦ (h2 − h1) + 360 if (h2 − h1) < −180 q ∆h0  ∆H0 = 2 C0 C0 sin (2.34) 1 2 2

3. Calculate the CIEDE2000 color difference, ∆E:

¯0 ∗ ∗ L = (L1 + L2)/2 (2.35) ¯0 0 0 C = (C1 + C2)/2 (2.36)  0 0 0 0 (h1 + h2) if C1C2 = 0   h0 +h0  1 2 if |h0 − h0 | ≤ 180◦ ¯0 2 1 2 h = 0 0 ◦ (2.37) h1+h2+360 0 0 ◦ 0 0 ◦  if |h1 − h2| > 180 and (h1 + h2) < 360  2  h0 +h0 −360◦  1 2 0 0 ◦ 0 0 ◦  2 if |h1 − h2| > 180 and (h1 + h2) ≥ 360 T = 1 − 0.17 cos(h¯0 − 30◦) + 0.24 cos(2h¯0) (2.38) + 0.32 cos(3h¯0 + 6◦) − 0.20 cos(4h¯0 − 63◦) (2.39) ( ) h¯0 − 275◦ 2 ∆θ = 30 exp − (2.40) 25 s C¯07 R = 2 (2.41) C C¯07 + 257 0.015(L¯0 − 50)2 SL = 1 + (2.42) p20 + (L¯0 − 50)2 0 SC = 1 + 0.045C¯ (2.43) 0 SH = 1 + 0.015C¯ T (2.44)

RT = −sin(2∆θ)RC (2.45) s  ∆L0 2  ∆C0 2  ∆H0 2  ∆C0   ∆H0  ∆E = + + + RT , kLSL kC SC kH SH kC SC kH SH (2.46)

where SL, SC , SH , and RT are weighting functions correcting the lack of Chapter 2. Theory 35

perceptual uniformity in CIELAB. The parameters kL, kC , and kH are cor- rection terms accounting for variations in experimental conditions. In this case they are all set to 1, which is for reference conditions [47].

• Finally, a PSNR based values is derived as

10000 PSNR = 10 · log . ∆E 10 ∆E

2.7.4 mPSNR

The mPSNR [48] measurement is calculated by creating several new images at different exposures from the original and the reconstructed images, similar to taking photographs at different exposures. First a number of c-values are calculated where each c-value will result in one exposure. A PSNR measure will then be calculated between each exposure pair (original and reconstructed). The resulting mPSNR will then be an average of the PSNR values for all the exposure pairs.

The measurement requires the data to be linear RGB values. This means that if the PQ transfer function is applied on the values the inverted transfer function needs to be applied, converting the values back to linear space.

Given the original data Fo = (Ro, Go, Bo) and the reconstructed data you want to compare with Fr = (Rr, Gr, Br) by applying the following steps for each sample:

• Clip all the values in the range [0, 65504]. This is done both for the original data and the reconstructed data.

• Determine the largest component of the original,

colMax = max(Ro,Go,Bo). (2.47)

• Find the smallest integer c-value that will give a non-zero contribution,

  0.5   cMin = ceil γ · log − log (colMax) , (2.48) 2 255 2

where γ = 2.2, which is the display gamma.

• Find the largest c-value that will give a non-saturated contribution,

 254.5  cMax = floor γ · log − log (colMax) . (2.49) 2 255 2 Chapter 2. Theory 36

• Generate a set of c-values (exposures), containing all integer values between and including cMin and cMax.

• For each sample and c-value, calculate the squared error and add that to the total error,

 c 1  RoL = clip 0, 255, 255 ∗ (2 Ro) γ , (2.50)

 c 1  GoL = clip 0, 255, 255 ∗ (2 Go) γ , (2.51)

 c 1  BoL = clip 0, 255, 255 ∗ (2 Bo) γ , (2.52)

 c 1  RrL = clip 0, 255, 255 ∗ (2 Rr) γ , (2.53)

 c 1  GrL = clip 0, 255, 255 ∗ (2 Gr) γ , (2.54)

 c 1  BrL = clip 0, 255, 255 ∗ (2 Br) γ , (2.55)

2 2 2 SSE = SSE + (RoL − RrL) + (GoL − GrL) + (BoL − BrL) (2.56)

• Calculate the mean squared error (MSE) by dividing the error by the total number of samples times the number of c-values,

SSE MSE = . (2.57) 3 · numSamples · (cMax − cMin + 1)

Calculate the resulting mPSNR as

 2552  mP SNR = 10 · log . (2.58) 10 MSE

2.7.5 Bjøntegaard-Delta Bit-Rate Measurements

Evaluating an implementation will be done with a set of QP values for the encoder, resulting in metrics for a number of different bit rates. Combining any of the PSNR metric together with the resulting bit rate will result in a rate-distortion curve, which will be the distortion as a function of the bit rate. To compare two implementations and how well one performs compared to the other, the curves of the implementations can be used.

To compare a set of rate-distortion curves the Bjontegaard Distortion-rate (BD-rate) [49] measurement is used. This results in a objective measurement expressing the average change in bit rate for a comparable quality. A positive measure implies an increase in bit rate while a negative one implies a decrease in bit rate for a certain level of quality.

The method for calculating the BD-rate between two rate-distortion curves, RD1 and

RD2, is divided into three steps [49]. Chapter 2. Theory 37

• Fit two curves through the data points of the rate-distortion results. One curve

for RD1, one for RD2.

• Find an expression for the integral of the two curves.

• The average difference is the difference between the two integrals divided by the integration interval.

The two rate-distortion curves are fitted to a third order polynomial and as it is con- sidered more appropriate to perform the integration with the bit rate on a logarithmic scale [49] the polynomial can be expressed as

log(R(D)) = a + b ∗ D + c ∗ D2 + d ∗ D3, (2.59) where D is the distortion (PSNR) value and R(D) is the bit rate as a function of the distortion.

The average difference, or the BD-rate, between the two fitted curves can then be ex- pressed as [49] R B A [log(R2(D))−log(R1(D))]dD ∆R = 10 B−A − 1, (2.60) where B and A specifies the integration interval and they could be expressed as

A = max(min(D1), min(D2)), (2.61)

B = min(max(D1), max(D2)). (2.62) Chapter 3

Method

This chapter starts by presenting the method used for evaluating the suggested imple- mentations. It will then continue on by describing the three ideas for improvements of the video coding chain: bit shifting at the CU level, histogram based color value mapping, and modifications to the SAO filter.

The suggestions presented in this report was all evaluated against an anchor provided by MPEG for the CfE [9]. This anchor consists of a fixed processing chain with the pre- and postprocessing steps together with the reference coder for HEVC/H.265 (HM 16.2) and a fixed configuration. In this case the anchor runs using a 4:2:0 with 10 bits per color channel configuration. The purpose of the anchor is to provide a point of comparison. For the comparisons to be fair the anchor chain needs to produce identical results every time it is run.

3.1 Test Sequences

To evaluate the proposals twelve test sequences provided by MPEG for the CfE were used. They all have the resolution 1920x1080 progressive and the color format RGB 4:4:4. The sequences are split into six different classes: A, B, C, D, G, and A’.

Class A, G Native content have BT.709 color primaries, however, they use a BT.2020 container, i.e. the colors are stored in the BT.2020 format but only the colors covered by BT.709 are used. Sequences are stored in the linear domain in EXR files.

Class B, C, D Native content have P3D65 color primaries, stored in a BT.2020 container, using 12 bit PQ-TF in TIFF files. 38 Chapter 3. Method 39

Class A’ Similar to class A but only uses a BT.709 container.

Table 3.1 shows a table of all the sequences with their respective class and native color primaries.

Table 3.1: Video sequences used in the evaluation.

Class gamut Sequence name A BT.709 FireEater2Clip4000r1 BT.709 Tibul2Clip4000r1 BT.709 Market3Clip4000r2 B P3D65 AutoWeldingClip4000 P3D65 BikeSparklersClip4000 C P3D65 ShowGirl2TeaserClip4000 D P3D65 StEM MagicHour P3D65 StEM WarmNight G BT.709 BalloonFestival A’ BT.709 FireEater2Clip4000r1 BT.709 Tibul2Clip4000r1 BT.709 Market3Clip4000r2

FireEater This sequence shows a fire show with two persons swinging and blowing fires. The scene is quite dark and there is not a lot of color but it contains some high luminance flames. In terms of motion the scene is quite static but there are some complex motions around the fires.

Tibul In this computed generated sequence the camera follows a small space ship looking thing in a illuminated cave. The colors are quite limited to the fire/lava look of the cave.

Market This sequence has a lot of light as it is filmed during a bright day. It shows a static scene of a market with some slow motions. It also contains a wide color spectrum, with a bright sky and colorful cloths hanging around.

AutoWelding Dark sequence showing a welding torch in action. Contains limited colors and a large contrast between the welding torch and the rest of the scene and consists of only small motions.

BikeSparklers This sequence captures two persons biking around inside a medium bright warehouse with a lot of small bright sparkles flying around, causing a lot of motion. Chapter 3. Method 40

ShowGirl This sequence begins by showing a girl in front of a theatre dressing room mirror. The surrounding is quite dark but the mirror has a number of high luminance lights around it. There is a lot of color and detail around the picture. The sequence continues on by showing the girl turning around and suddenly a bright light shines upon her face, highlighting a lot of complex features of the face.

MagicHour This scene consists of two scenes. Overall the sequence is filmed during early night but the scenes are still quite bright. The first one shows a fully covered table with a lot of colorful flowers and accessories. The second scene shows a number of people walking along a path. The people have colorful cloths with a lot of complex patterns on them.

WarmNight This sequence shows the first scene also shown in the beginning of Magi- cHour with the big table. Now there are a lot of people around the table and a lot of small motions. The scene is still very colorful.

BalloonFestival This sequence shows a grass plain full of people walking around and three large air balloons. There is a lot of color on the people and the balloons and the scene is filmed during daylight with high illumination. Behind the plain is a large mountain landscape.

3.2 Processing Chain

Preprocessing Input Video Bitstream R’G’B’ 4:4:4 Quant HM Encoder TF to to 10b 4:2:0 10bits Y’CbCr 4:2:0

Postprocessing Output Bitstream Video 4:2:0 Inv. Y’CbCr HM Decoder Inv. to Quant to 4:2:0 10bits TF 4:4:4 10b R’G’B’

Figure 3.1: End to end video coding chain.

Figure 3.1 shows the full end-to-end processing chain used when generating the anchors provided by the CfE. This chain also works as a base for evaluating the proposals of this Chapter 3. Method 41 thesis. For the pre- and postprocessing HDRTools 0.9 was used, which is a tool used within MPEG for the CfE. HDRTools provides a wide range of tools for converting and processing HDR video, all color conversions, transfer functions, and chroma subsampling was done through this tool. It also provides tools used for computing the various metrics used for the evaluation. To perform the actual coding the reference software for HEVC, HM 16.2 (HEVC Test Model) [50] was used.

3.2.1 Preprocessing

The video input to the preprocessing chain is assumed to be in the EXR format, with RGB values in a 16 bit floating-point format. This means that inputs of other formats such as the sample sequences of classes B, C, and D, in the 12 bit PQ-TF format needs to be converted according to section 3.2.4.

• The preprocessing starts of by applying a transfer function on the input video and in this case the PQ-TF, see equation 2.2, is used.

• The processing goes on by converting the video data from R’G’B’ to Y’CbCr, the Y’CbCr values can be expressed as in equations 2.12 and 2.15, depending on the color primaries of the video data.

• Then quantization is performed, converting the video data from 16 bit floating- point values to 10 bit integers.

• Finally, chroma downsampling from 4:4:4 to 4:2:0 is applied on the video data. This processed is described in detail in section 2.4.1.

3.2.2 Postprocessing

The postprocessing is done in a similar fashion to the preprocessing but in reverse. The chain expects decoded video in the 10 bit 4:2:0 Y’CbCr format and it will convert and output the final video output as 16 bit floating-point RGBs.

• Firstly, the video is upsampled from 4:2:0 to 4:4:4. This process is described in detail in section 2.4.2.

• The quantization is inversed, converting the video from 10 bit integers to 16 bit floating-point values.

• The video is converted back from Y’CbCr to R’G’B’ according to 2.10.

• Finally, the inverse PQ-TF (equation 2.8) is applied on the content. Chapter 3. Method 42

3.2.3 Anchor Settings

The anchor and the proposals are all based on the reference software for HEVC, HM 16.2 [50]. The configuration used for the anchor generation and the evaluations are based on a configuration file provided with HM, encoder randomaccess main10.cfg 1.

The difference between the configuration provided together with HM and the one pro- vided for the CfE is that the latter specifies a profile level. The profile level defines the max bit rate and some other properties for the coder. The CfE specifies the profile level to be 4.1, which should allow for a picture resolution of 1920x1080 at a maximum frame rate of 64 [3]. This is enough to be able to code all the provided test sequences of the CfE.

The anchor settings were used as the base for the configuration used for the HM coder in the evaluation of all the proposed ideas. The same configuration as for the anchor with the Main10, 10 bit, 4:2:0, were used for all evaluated ideas except stated otherwise.

3.2.4 Conversion of TIFF Input Files

Conversion from the 12 bit PQ-TF P3D65 format to 16 bit RGB BT.2020 format is performed as follows [9]:

0 0 0 • First perform inverse quantization on the content (DR, DG, DB) from 12 bit PQ- TF into normalized PQ-TF, according to

D0 –rl R0 = R , (3.1a) rh − rl D0 –rl G0 = G , (3.1b) rh − rl D0 –rl B0 = B , (3.1c) rh − rl

where rl = 16 and rh = 4076.

• Content is converted from R’G’B’ to RGB using equation 2.8.

• Content is converted from RGB with P3D65 primaries to the XYZ color space according to equation 2.17 in section 2.3.6.

1https://hevc.hhi.fraunhofer.de/svn/svn_HEVCSoftware/tags/HM-16.2/cfg/encoder_ randomaccess_main10.cfg [Accessed: February 2015] Chapter 3. Method 43

• Finally, the content is converted from the XYZ color space to RGB with BT.2020 primaries,

R2020 = 1.716651 · X − 0.355671 · Y − 0.253366 · Z, (3.2a)

G2020 = −0.666684 · X + 1.616481 · Y + 0.015768 · Z, (3.2b)

B2020 = 0.017640 · X − 0.042771 · Y + 0.942103 · Z. (3.2c)

3.3 Evaluation

Given the lack of an available HDR monitor during the beginning of the work and the extra work required for performing subjective testing the ideas was mainly evaluated using objective testing. However, for the Histogram Based Color Value Mapping (section 3.6), subjective testing was also used.

3.3.1 Objective Evaluation

The objective evaluation is performed using an internal framework at Ericsson. Due to the amount of time it takes to run a single sequence through the process chain, the framework uses a cluster to perform a large number of simultaneous tests. All of the test sequences presented in section 3.1 are run through the complete process chain, with each sequence being coded multiple times using a set of four QP values. The QP values used for the evaluation can be found in table 3.2 and they were defined by the CfE [9] to meet certain bitrate requirements.

Table 3.2: QP values of the test sequences.

Sequence QP1 QP2 QP3 QP4 FireEater2Clip4000r1 20 23 26 29 Tibul2Clip4000r1 19 24 29 34 Market3Clip4000r2 21 25 29 33 AutoWeldingClip4000 21 25 29 33 BikeSparklersClip4000 23 25 29 33 ShowGirl2TeaserClip4000 21 25 29 33 StEM MagicHour 21 25 29 33 StEM WarmNight 21 25 29 33 BalloonFestival 18 22 26 30

In the end of each test run dispatched by the framework three metrics are calculated, tPSNR, PSNR∆E, and mPSNR, all described in section 2.7. These PSNR-based values are calculated between the two endpoints of the coding chain (see figure 3.1). These metrics for all sequences and QP values together with their corresponding bitrates are presented in a spreadsheet at the end of a complete run. Chapter 3. Method 44

For each sequence, using the results from all the four QP values, there is one resulting rate-distortion curve. The curve from each sequence was compared to the corresponding curves of the anchor using the BD-Rate presented in section 2.7.5 to give an objective measurement of the change in coding efficiency for the new tested coding chain.

3.3.2 Subjective Evaluation

The subjective evaluation performed was a very simple process. The testing was per- formed on a single SIM2 HDR monitor [10], comparing the interesting sequences to the anchor sequences sequently. The video sequences to show were selected in a manner to give an as equal bit rate as possible, preferably with the anchor bit rate being lower compared to the test sequence.

This empirical evaluation was simply intended to point out visible errors, and therefore no standardized testing procedure was used.

3.4 HEVC Profile Tests

First tests were performed to evaluate the settings of the anchor. The anchor was run using the Main 10 profile with a bit depth of 10 bits per channel and a chroma subsampling ratio of 4:2:0. This is an increase from the 8 bits per channel typically used in broadcasted television today. It was earlier stated that the extra 2 bits per channel do not increase the total bit rate [44], could this possibly mean there would be an increase in quality but without any significant increase in bit rate when increasing the bit depth even further?

Additional to the increased bit depth the 4:4:4 chroma subsampling format was also evaluated against the anchor settings to get an understanding of the benefits and in- creased costs. The anchor was run and compared to three different coder settings using the range extensions (RExt) of HEVC. The three configurations evaluated was

• Main-RExt, 12 bits per channel, 4:2:0,

• Main-RExt, 10 bits per channel, 4:4:4,

• Main-RExt, 12 bits per channel, 4:4:4. Chapter 3. Method 45

3.5 Bitshifting at the CU Level

The anchor is run using a bit depth of 10 bits per color channel to be able to support the extended range of HDR and an increased color gamut compared to traditional video. A thought that instantly comes to mind is the question whether the coding cost is increased with the increased bit depth. The advantage of 10 bits per color channel could necessarily only be useful in areas of the picture where the extended range or where the colors outside of the traditional BT.709 color gamut is actually used. For instance, dark regions of a picture would have very low luminance values and the range of the color values would typically be very limited.

Figure 3.2: Example picture with dark regions highlighted.

Figure 3.2 tries to illustrate an example of this. The area within the red rectangle has a very low luminance level and a very small range of colors and therefore 10 bits per color channel may not be needed. What if this area were to be coded using only 9 bits per channel? Would this provide any gains overall?

The proposed idea is to make use of the RDO process described in section 2.6.3 to perform coding for two different bit depths simultaneously, both 9 and 10 bits per channel. This is performed on a per CU basis, with the encoder performing coding of each CU first with a 9 bit version of the input data and then with the original 10 bit version. The encoder then chooses which of the two bit depths that provides the biggest gains.

For simplicity the shifting is only performed for intra-picture prediction, leaving the inter-picture prediction intact.

Figure 3.3 shows a simplified overview of the compression of intra-prediction pictures in HEVC. It is a recursive process where the CUs are compressed at all levels of the tree. The process will calculate and compare the costs of every compressed CU, always keeping track the CU with the lowest cost. When the process has gone through all levels the lowest cost compressed CU will be the one that will get coded to the resulting bitstream. Chapter 3. Method 46

Split to CUs

CU Compress CU

Split Compressed CU

CU CU Compress CU CU CU

Compressed CU ......

Figure 3.3: Simplified overview of the coding of intra-prediction pictures.

The technique presented here makes changes to the compress CU step in figure 3.3, adding an additional step for performing the coding with the 9 bit input data. The resulting process of the new compress CU procedure is described in algorithm 1.

Algorithm 1 Compress CU 1: bestCU ← 0 2: bestCost ← ∞ 3: tempCU ← 0 4: tempCost ← ∞ 5: 6: // First compress 10 bit version 7: (bestCU, bestCost) ← PerformIntra(inputCU) 8: 9: // Convert and compress 9 bit version 10: inputCU9b ← ConvertTo9b(inputCU) 11: (tempCU, tempCost) ← PerformIntra(inputCU9b) 12: 13: if tempCost < bestCost then 14: bestCU ← tempCU 15: bestCost ← tempCost 16: end if 17: return (bestCU, bestCost)

To avoid having additional inputs to the actual encoder itself the process of producing the 9 bit version is done on the fly, note ConvertTo9b in algorithm 1. The best compressed CU and its cost, note bestCU and bestCost in algorithm 1, for that level will then be passed to the compression of the next level of CUs, comparing the costs and choosing the cheapest one for the resulting bitstream. Chapter 3. Method 47

This proposal presents two variations. The first one provides the coder with a true 9 bits per channel version of the original data, and the other one is a simpler version where the color values are shifted to 9 bit and then back to 10 bit, basically zeroing the rightmost bit out. For both versions the RDO process is similar, the big difference lays in the format of the data that is fed to the intra-prediction process. Looking at algorithm 1, the main difference between the two proposals is that the ConvertTo9b procedure is defined differently. The first variation may also require some tweaking to the cost computation of the PerformIntra procedure.

3.5.1 Variation 1

In this variation the PerformIntra procedure is feed a true 9 bits per color channel version of the original data. In this case ConvertTo9b simply shifts all color values of the CU down to 9 bits. This method also requires changes to decoder as the picture data will have to be shifted up when performing the decoding. For the decoder to know whether a certain CU is 10 bit or 9 bit a new flag is added to the bitstream that is set every time the RDO chooses a 9 bit CU. This flag is coded to the bitstream through the CABAC coder and indicates that the data requires to be shifted back to 10 bit during the decoding.

The big challenge with this proposal is how to make a fair cost computation when comparing the 10 bit version to the 9 bit version during the RDO. One could (1) compute the distortion either by shifting up the coded 9 bit version to 10 bit and then comparing it to the original 10 bit version, or (2) compare it to the 9 bit original picture. Both suggestions would require changes to the cost computations of PerformIntra.

3.5.2 Variation 2

This variation is a lot simpler and avoids the problem of the cost computation. There is also no need to perform any changes to the decoder. Here the ConvertTo9b procedure first shifts the color values of the CU one bit to the right, and then one bit to the left. This means that the two original pictures are both in a 10 bit format and the only difference is that the picture representing the 9 bit version has the last bit zeroed out. Removing the rightmost bit should remove some of the noise from the image, potentially making it easier to encode.

Here the cost computation will be performed exactly the same, both coded versions are compared to the original 10 bit picture. The distortion of the shifted version would most likely be higher caused by the loss of information, but the potential decrease in bit rate Chapter 3. Method 48 caused by the removed noise could possibly give a smaller total cost compared to the original 10 bit version.

3.6 Histogram Based Color Value Mapping

The video coder is expecting the color values of the input video to be represented by integers and in the case of the anchor the bit depth is set to 10 bits per color channel. This means that the range of available values, also referred to as available code words, is restricted to 210 = 1024 code words. Typically for SDR video a bit depth of 8 bits is used, which would mean 256 available code words.

In general the code words are not utilized very well and in a given picture there are many code words that are very seldom used or possibly not used at all. This is a problem as the wasted code words could be utilized to increase the precisions for the color values that are used the most.

4 x 10 Y'

5

0 0 200 400 600 800 1000 4 x 10 Cb

10

5

0 0 200 400 600 800 1000 4 x 10 Cr

10

0 0 200 400 600 800 1000

Figure 3.4: Histograms for the color components of the preprocessed market sequence.

Figure 3.4 shows a histogram of the color values in a frame from the market video sequence that has gone through the preprocessing of the anchor coding chain. The PQ transfer function does a good job at optimizing the usage of the bits based on how the Chapter 3. Method 49 human perceive colors, however, noticeable in the histogram is the fact that there are still a lot of unused code words.

The market sequence is quite bright and has a wide range of luminance in it. In the histogram this is shown by the wide usage of the code words for the Y’ component. The color components on the other hand utilizes only a small portion of the available code words, the rest are completely wasted.

The proposed idea is to perform color value mapping (CVM) on the input values similar to what the PQ transfer function does. However, the mapping curve here is generated based on the input data and the frequency of the different color values. There will be one curve generated per color component. The PQ transfer function is still very effective on its own when it comes to utilizing the properties of the human visual system so the proposed technique will still be using the PQ transfer function.

Preprocessing Input Processed Video Video R’G’B’ 4:4:4 Quant CVM PQ-TF to to 16b 10b A B Y’CbCr C D 4:2:0 E F

Meta data

Postprocessing Decoded Output Video Video 4:2:0 Inv. Y’CbCr Inv. Inv. to Quant to CVM PQ-TF F’ E’ 4:4:4 D’ 16b C’ R’G’B’ B’ A’

Meta data

Figure 3.5: Processing chains for the proposed CVM.

The proposed technique will perform pre- and postprocessing on the video before and after the video encoder. Figure 3.5 shows a diagram depicting the preferred pre- and postprocessing chains. The preprocessing chain consists of five components:

• Applying the PQ transfer function on content.

• Conversion from R’G’B’ color space to Y’CbCr.

• Quantization, conversion from floating-point to 16 bit integers.

• Chroma subsampling, converting from 4:4:4 to 4:2:0.

• Color Value Mapping (CVM), mapping component values from 16 bit to 10 bit. Chapter 3. Method 50

The quantization to 16 bits rather than 10 bit as in the anchor chain and the CVM block represents the proposed color value mapping technique. The postprocessing looks similar, performing the inverse of the preprocessing.

As visible in the figure, the technique also requires metadata describing the mapping curve to be distributed together with the video for the inverse color value mapping in the postprocessing.

3.6.1 Preprocessing

The CVM preprocessing is split into two sub processes, first the generation of the map- ping curve, and then the actual mapping of the color values. As input to the process, see step E in figure 3.5, we have data in the 16 bit Y’CbCr 4:2:0 format.

Mapping Curve Generation

The mapping curves consists of a number of intervals with a predefined size, intervalSize, and the number of intervals is determined by the input bit depth. For example, in the case of an input bit depth of 16 and an interval size of 1024, the resulting number of intervals would be 216/1024 = 64. The curves are generated based on the histograms of the three input components Y’, Cb, and Cr. The histograms are computed with one bin for every interval in the resulting curve, providing a lookup table for the number of values in a specific interval.

After the histograms have been built the algorithm will start building the curve by going through every interval. Every interval will be assigned a slope based on the ratio between the values in the current interval and the total number of values. This provides intervals with a higher number of values a higher slope, resulting in higher precision for the color intervals that are used the most. A predefined minimum slope value, C, is also added for every interval.

When the curves have been constructed they are normalized for the Y-axis to match the range the curve should map to. This all depends on the properties of the coding chain but in this case the mapping should be done to 10 bit. However, this allows for any output range, for instance, if this would be used in an 8 bit coding chain.

Algorithm 2 describes the procedure of generating the mapping curves:

As the color frequencies of a video are expected to change with scene changes, etc, the procedure will not use the same mapping curve for all pictures. However, to minimize the Chapter 3. Method 51

Algorithm 2 Generate Mapping Curve 1: intervalCount ← 2inputBitDepth/intervalSize 2: for each component do 3: BuildHistogram() 4: 5: totalSum ← [Total number of values] 6: lastY ← 0 7: for i in range(0, intervalCount) do 8: x1 ← i ∗ intervalSize 9: x2 ← (i + 1) ∗ intervalSize 10: 11: subSum ← sum of values in range(x1, x2) 12: ratio ← subSum/totalSum 13: y1 ← lastY 14: y2 ← lastY + ratio + C 15: interval ← {x1, y1, x2, y2} 16: 17: AddIntervalToCurve(component, interval) 18: end for 19: 20: // Normalize Y to range (0, 2outputBitDepth) 21: NormalizeIntervals() 22: end for amount of metadata transmitted and to avoid causing problems when pictures reference each other in inter-picture prediction, the curve is not generated for every frame. The proposed solution generates a new curve every GOP (Group Of Pictures) [2], this assures that the curve is generated on a frame that uses intra-picture prediction and avoids any issues with inter-picture prediction frames referencing frames using a different mapping curve.

Value Mapping

Figure 3.6 shows the process of mapping a 16 bit value, in, to a 10 bit value, out, using a generated mapping curve. First the interval containing the value to maps needs to be located. When the interval is located the out value computed by using linear interpolation between the two points that defines the interval.

The pseudo code below shows the process of performing color value mapping on an input picture. Chapter 3. Method 52

16 bit to 10 bit mapping 10 2

. out (x1, y1) . (x2, y2)

16 in 2

Figure 3.6: Mapping curve.

Algorithm 3 Color Value Mapping 1: for each component do 2: for i in range(0, number of values) do 3: x ← input[component][i] 4: 5: // Find the interval containing the value we want to map 6: {x1, y1, x2, y2} ← FindInterval(intervals, x) 7: 8: // Perform linear interpolation 9: y ← y1 + (y2 − y1) ∗ (x − x1)/(x2 − x1) 10: 11: output[component][i] ← y 12: end for 13: end for

3.6.2 Postprocessing

The postprocessing will start by performing the color value mapping in reverse. In this step, see F’ in figure 3.5, if the mapping in the preprocessing maps to 10 bit, this step would be mapping from 10 bit to 16 bit.

Given the metadata provided with the video the unmapping process will have a copy of the mapping curve used for the mapping. The process will invert this curve, simply changing the axes, and then perform the mapping on the inverted curve in the same way as the preprocessing did with the original curve.

3.6.3 Parameters

The curve generation has two main parameters controlling how the curve will be shaped, the interval size and the base slope for the intervals. Chapter 3. Method 53

Algorithm 4 Color Value Unmapping 1: for each component do 2: for i in range(0, number of values) do 3: x ← input[component][i] 4: 5: // Find the interval (on the y-axis) containing the value 6: {x1, y1, x2, y2} ← FindIntervalOnY(intervals, x) 7: 8: // Perform linear interpolation 9: y ← x1 + (x2 − x1) ∗ (x − y1)/(y2 − y1) 10: 11: output[component][i] ← y 12: end for 13: end for

Interval size A large interval size will result in a very linear curve and it would not be as beneficial in cases where there are very large but narrow spikes in the frequency of color values in the input. A really small interval size will be able to provide more precision when it comes to the mapping. However, a small interval size will require more metadata to be shared between the pre- and postprocessing chains.

Base slope This is the base slope added to all intervals. Having a curve completely flat for regions not containing values would be beneficial in terms of code word utilization but the encoder will be having a hard time coding the data efficiently when having flat regions as it will amplify a lot of the noise in the original picture, therefore a base slope is required.

For these tests an interval size of 1024 and a base slope of 0.5 was used.

3.7 SAO XYZ

Video coding chain is typically not performed in a linear domain, but the non-linear video domain (or gamma domain). This introduces some problems with the result from the processing not corresponding to the correct result you would get if performing the processing in the linear domain. The problem has not been that significant and todays broadcasting infrastructure typically processes in the video domain. However, with the introduction of the greater range of HDR and the new accompanying transfer functions that are much more non-linear compared to the existing gamma curves, processing in the video domain becomes more problematic. Chapter 3. Method 54

This issue has still not been worked out and even the anchor coding chain described in section 3.2, which is used in this thesis, performs processing in the video domain. This is especially caused by doing the chroma subsampling in the highly non-linear perceptual domain and it may cause various visible artifacts even with the actual compression step removed. The ideal solution would be to actually perform the processing in the linear domain, simply changing the order of the PQ-TF and the chroma subsampling. How- ever, various obstacles such as backwards compatibility and issues with coding efficency prevent this.

This section presents a propsal that is supposed to correct the errors caused by the previously mentioned problem. The main idea is to modify the existing SAO filter to take into account the errors that may have occured when doing the processing.

4:2:0 YCbCr Preprocessing Bitstream Encoder & SAO Input Video

4:4:4 RGB

Figure 3.7: Block diagram of the new video chain.

Figure 3.7 shows a diagram of the proposed coding chain. Additional to the standard preprocessed video in the 4:2:0 YCbCr format the encoder also takes the original un- processed video in the 4:4:4 RGB format as input. The purpose for this is that the new modified SAO filter uses the unprocessed original as a reference rather than the processed one when determining the mode and offsets to use. This allows the filter to correct any errors caused by the conversions and processing steps during the preprocessing.

In the HM reference software, the SAO process consists of two steps.

• First, statistics is collected about the picture to correct, comparing it to the original picture. This includes computing the differences between samples of the original and the reconstructed picture for all the different EO modes and BO bands.

• Secondly, the process makes decisions on which modes to use and derives the offsets based on the previously gathered statistics. For each SAO mode an offset is computed together with a cost for that specific offset derived from the resulting distortion and the number of bits used for transmitting that offset. The coder then chooses the offset with the lowest cost for each sample, in a RDO fashion.

The proposed idea requires modification of both the mentioned steps. In the first step, additional to the sample differences between the original and the reconstructed picture, Chapter 3. Method 55 a difference is calculated in the XYZ domain between the unprocessed original picture and the reconstructed picture. This will require the pictures to first be converted to the XYZ domain. The XYZ differences will then be used when calculating the costs of the offsets.

Algorithm 5 shows the new proposed process for the SAO filter.

Algorithm 5 SAO Process 1: // Convert unprocessed RGB original to X’Y’Z’ 2: orgXYZ ← ConvertRGBToXYZ(orgRGB) 3: // Convert reconstructed Y’CbCr picture to X’Y’Z’ 4: recXYZ ← ConvertYCbCrToXYZ(recYCbCr) 5: 6: // Gather statistics 7: stats ← GetSAOStats(orgYCbCr, recYCbCr, orgXYZ, recXYZ) 8: 9: // Decide SAO modes for all CTUs using RDO 10: DecideSAOModes(stats)

ConvertRGBToXYZ This procedure converts a linear-space 4:4:4 floating-point RGB picture to a 4:4:4 integer X’Y’Z’ picture. The conversion performs quantization of the original picture and therefore the resulting picture is converted to a non-linear format (using the PQ-TF). The conversion process is performed as follows:

• The PQ-TF (equation 2.2) is applied on the content.

• The content is converted from R’G’B’ to X’Y’Z’ using equation 2.16.

• The content is then quantized to a 10 bits per channel format.

ConvertYCbCrToXYZ This procedure converts a 4:2:0 Y’CbCr picture to a 4:4:4 integer X’Y’Z’ picture. The conversion process is performed as follows:

• The content is first chroma upsampled from 4:2:0 to 4:4:4 (see section 2.4.2).

• The Y’CbCr content is converted to R’G’B’, according to equation 2.10 for color samples in the BT.709 color space and equation 2.14 for color samples in BT.2020 color space.

• The content is converted from R’G’B’ to X’Y’Z’ using equation 2.16. Chapter 3. Method 56

GetSAOStats The process of gathering the differences of the original picture and the reconstructed picture for the different type of offset modes. The differences for both the ordinary Y’CbCr space and the X’Y’Z’ are gathered.

DecideSAOModes This is the RDO process that makes decisions of which modes and offsets to use. It computes the offsets for all the different modes using a sum of the differences gathered earlier. The actual offset is computed using only the differences calculated in the Y’CbCr space, as this is what will be applied on the reconstructed Y’CbCr picture later. However, the cost of using an actual offset is determined using the differences in the X’Y’Z’ space.

Given the difference between the two spaces (Y’CbCr and X’Y’Z’), only the differences in the Y’ component of X’Y’Z’ is used when performing the RDO. For the other components (Cb and Cr) their ordinary differences are used, just as in the original version of the SAO process. The reason Y’ of X’Y’Z’ can be used is the fact that both Y’ of X’Y’Z’ and Y’ of Y’CbCr specifies the luminance level. Chapter 4

Results

This chapter presents the results from the evaluation of all the implementations pre- sented in chapter 3. That chapter also describes how the various results were gathered.

The three metrics used for the evaluation are tPSNR, PSNR∆E, and mPSNR. The reason for using all three metrics was that before the CfE process was initiated MPEG had no clear winner between the three. However, after some analysis it has been identified that tPSNR and mPSNR are less suitable metrics while the PSNR∆E is considered to be a relatively suitable metric [15]. These evaluations will still be looking at all metrics but this should be kept in mind.

4.1 HEVC Profile Tests

This section presents the results from the evaluation of the Main-RExt profiles of HEVC. The results are compared against the Main10 profile used by the anchor, with the 10 bits per channel, 4:2:0, configuration. Only objective results are presented here as there was no access to the hardware required for a subjective evaluation.

4.1.1 Main-RExt, 12 bits, 4:2:0

Table 4.1 shows the objective results from the evaluation of using 12 bits per color channel compared to the usual 10 bits for the anchor. Each column shows the BD- rate (see section 2.7.5) for each of the PSNR-based metrics presented in section 2.7 between the anchor and the coding chain with a 12 bits per channel configuration. A negative value means an increase in efficiency as it signals a decrease in bitrate for a comparable quality, while a positive means a decrease in efficiency. The colored cells

57 Chapter 4. Results 58

Table 4.1: Anchor compared to Main-RExt 4:2:0 with 12 bits per channel.

tPSNR X Y Z XYZ PSNR∆E mPSNR class A FireEaterClip4000r1 -1,5% -1,3% -5,6% -2,7% -3,2% -11,0% Market3Clip4000r2 -0,2% -0,4% -0,2% -0,2% 1,2% -0,2% Tibul2Clip4000r1 -1,1% -0,8% -2,9% -1,4% -3,4% -2,2% class B AutoWelding -1,3% -0,8% -3,2% -1,9% -5,3% -8,8% BikeSparklers -0,1% 0,0% -0,7% -0,3% -3,6% -1,8% class C ShowGirl2Teaser -0,3% -0,2% -0,8% -0,4% -5,3% -1,9% class D StEM MagicHour -0,1% -0,1% -0,5% -0,3% -1,2% -0,4% StEM WarmNight 0,1% 0,0% -0,5% -0,2% -1,3% -0,5% class G BalloonFestival -0,4% -0,3% -0,2% -0,3% -3,3% -0,2% Overall -0,5% -0,4% -1,6% -0,9% -2,8% -3,0% indicates values worth taking a look at, with green being an increase and red being a decrease in efficiency.

As visible in the table there are some small gains. The biggest gains can be seen in the sequences with a lot of dark areas, such as FireEater and AutoWelding. The more complex sequences on the other hand almost have negligible gains, the Market also have some loss looking at the PSNR∆E metric.

One interesting fact which may contradict the typical view of an increased bit depth is that the bitrate for the sequences actually decreased when using 12 bits per channel. Table 4.2 shows the bitrates for all QPs of the two sequences with the largests gains, FireEater and AutoWelding.

Table 4.2: Bitrate of anchor compared to Main-RExt 4:2:0 with 12 bits per channel.

Bitrate (kbps) Sequence QP Anchor 12 bits FireEater 20 1921,6 1906,3 23 1259,9 1255,5 26 811,8 809,0 29 520,7 518,8 AutoWelding 21 3156,8 3131,8 25 1382,7 1375,8 29 778,0 775,0 33 454,4 452,9

As a conclusion these results are not significant enough to make it worth considering making permanent changes to the coding chain. There are a lot of additional costs when using a 12 bit chain, such as added complexity to the coder process, with an increase in coding time of about 20% compared to the anchor, and larger storage space requirements for intermediate formats. Chapter 4. Results 59

Table 4.3: Anchor compared to Main-RExt 4:4:4 with 10 bits per channel.

tPSNR X Y Z XYZ PSNR∆E mPSNR class A FireEaterClip4000r1 -16,7% -1,9% -44,3% -21,9% -34,1% -37,2% Market3Clip4000r2 -1,1% 3,4% -18,3% -7,6% -47,0% -9,2% Tibul2Clip4000r1 -18,6% -7,7% -50,3% -22,3% -37,4% -29,0% class B AutoWelding 12,9% 26,1% -24,1% -1,6% -35,4% -6,3% BikeSparklers 17,0% 29,6% -18,3% 3,0% -35,2% 2,8% class C ShowGirl2Teaser -3,9% 4,5% -29,5% -12,0% -46,5% -15,8% class D StEM MagicHour 2,5% 11,6% -24,0% -10,4% -38,3% -13,8% StEM WarmNight -2,3% 9,5% -22,0% -9,3% -37,7% -24,2% class G BalloonFestival -8,1% 9,5% -31,2% -16,8% -50,0% -16,4% Overall -2,0% 9,4% -29,1% -11,0% -40,2% -16,6%

4.1.2 Main-RExt, 10 bits, 4:4:4

Table 4.3 shows the objective results from coding the video using 4:4:4 chroma subsam- pling. This shows a significant gain compared to 4:2:0, as could be expected. Visible while looking at the tPSNR metric is that the gains are not evenly distributed for all components and there is even some indicated loss for the Y component. This is caused by the fact that the total bitrate is increased compared to the anchor while the gains have not increased equally for each component. The Y component shows a decrease in distortion but the metric presented in the table is the BD-Rate, which also takes the bitrate into account.

Using 4:4:4 means that there is an increased resolution for the chroma components. This is the reason for the smaller gains for the Y components, as the Y component is defined as the luminance of the scene. What is interesting is that there are still gains for the Y component, even though 4:4:4 only improves the chroma components.

Another interesting fact is the big differences in the three metrics. Looking at the results for the BikeSparklers sequence the PSNR∆E metric implies that there are gains while the mPSNR implies the opposite.

Similar to the 12 bit version there are a lot of extra costs of replacing 4:2:0 in the coding chain. The use of 4:4:4 increases the total bitrate and it is not clear from simply looking at these results if there actually are any significant subjective gains. In this case the coding time is increased by about 50% compared to the anchor.

4.1.3 Main-RExt, 12 bits, 4:4:4

Table 4.4 shows the results for the 12 bits per channel, 4:4:4 setup. These results do not show anything interesting that have not already been discussed for the 12 bit per Chapter 4. Results 60

Table 4.4: Anchor compared to Main-RExt 4:4:4 with 12 bits per channel.

tPSNR X Y Z XYZ PSNR∆E mPSNR class A FireEaterClip4000r1 -18,1% -3,1% -48,1% -24,1% -36,8% -45,1% Market3Clip4000r2 -1,2% 3,3% -18,4% -7,7% -47,9% -9,3% Tibul2Clip4000r1 -20,0% -8,9% -53,4% -24,1% -40,5% -30,7% class B AutoWelding 11,7% 25,3% -26,0% -2,9% -40,2% -13,3% BikeSparklers 16,9% 29,5% -18,4% 3,0% -38,4% 1,9% class C ShowGirl2Teaser -4,2% 4,2% -30,2% -12,4% -48,4% -17,2% class D StEM MagicHour 2,3% 11,5% -24,2% -10,6% -38,9% -13,9% StEM WarmNight -2,4% 9,4% -22,3% -9,4% -38,7% -24,4% class G BalloonFestival -8,4% 9,2% -31,3% -17,0% -50,6% -16,4% Overall -2,6% 8,9% -30,3% -11,7% -42,3% -18,7% channel and the 4:4:4 runs. The results are basically a mix of the two previous results, as could be expected.

4.1.4 Main-RExt, QP offsets

HEVC allows one to specify offsets for the QP values per component. To investigate the results from the 4:4:4, 10 bits per channel test, another test was performed using positive QP offsets for the two chroma components.

Table 4.5: Anchor compared to Main-RExt 4:4:4 with 10 bits per channel and QP offsets set.

tPSNR X Y Z XYZ PSNR∆E mPSNR class A FireEaterClip4000r1 -9,0% -3,5% -9,8% -7,8% -11,8% -8,8% Market3Clip4000r2 0,1% -0,3% 4,9% 2,0% 12,7% 2,1% Tibul2Clip4000r1 -12,7% -6,8% -25,6% -13,6% -19,3% -19,1% class B AutoWelding -0,1% -1,1% 5,7% 2,1% 10,5% 1,3% BikeSparklers -0,3% -1,2% 3,5% 0,9% 8,7% 1,6% class C ShowGirl2Teaser 0,5% 0,0% -0,9% -0,2% 3,3% 0,8% class D StEM MagicHour -0,2% -0,9% 2,1% 0,8% 4,6% 0,1% StEM WarmNight 0,6% -0,3% 2,9% 1,5% 6,0% -1,4% class G BalloonFestival -4,2% -1,0% -3,9% -3,5% -3,1% -1,5% Overall -2,8% -1,7% -2,3% -2,0% 1,3% -2,8%

Table 4.5 shows the anchor compared to the 10 bits per channel, 4:4:4 setup with a QP offset of 6 for the Cb component and a QP offset of 7 for the Cr component. These results shows that the QP offsets can be used to lower the total bitrate at the cost of decreased gains for the chroma components.

4.2 Bitshifting at the CU Level

This section presents some objective results and a small discussion of the bitshifting proposal. Chapter 4. Results 61

4.2.1 Variation 1

The first variation had a problem defined already from the start; the cost computation for the RDO of the coder. The method used and evaluated was to shift up the coded 9 bits per channel version to 10 bits before comparing it to the 10 bit original to compute the distortion of the coded CU.

The results from running tests using this method were identical to the results of the anchor. Some further analysis showed that the 9 bit version CU was never actually used as the coding cost was always larger for 9 bits.

During the evaluation phase of this proposal there were some doubt about how beneficial it actually would be as there was a very small difference between coding 8, 10, and 12 bits per channel. Because of this the time felt best spent on other areas and the second proposal for cost computation was not implemented and variation 2 of the idea was implemented instead.

4.2.2 Variation 2

The main purpose of the second variation was to see if the shifting would be able to remove some noise by removing the least significant bit. The evaluation was first performed on a single frame and it showed some promise with a gain of −3.1% for the AutoWelding sequence. The results for the other sequences, however, were insignificant.

Table 4.6: Anchor compared to bitshifting solution.

tPSNR X Y Z XYZ PSNR∆E mPSNR class A FireEaterClip4000r1 -0,1% -0,1% 0,5% 0,1% 0,1% 0,2% Market3Clip4000r2 -0,1% -0,1% 0,0% -0,1% 0,4% -0,1% Tibul2Clip4000r1 0,1% 0,0% 0,5% 0,1% 0,3% 0,5% class B AutoWelding -0,2% -0,2% 0,4% 0,0% 1,0% 1,7% BikeSparklers -0,3% -0,2% 0,1% -0,1% -0,3% 0,1% class C ShowGirl2Teaser -0,2% -0,1% 0,3% 0,0% -0,8% 0,0% class D StEM MagicHour -0,1% -0,1% 0,0% 0,0% -0,1% 0,1% StEM WarmNight 0,0% 0,0% -0,1% -0,1% 0,0% -0,3% class G BalloonFestival -0,2% -0,2% 0,0% -0,1% -0,2% -0,1% Overall -0,1% -0,1% 0,2% 0,0% 0,0% 0,2%

Table 4.6 shows the objective results of running the full sequences through the coding chain. Here the results from all sequences are insignificant with neither gains nor losses. This means the result from the initial tests were most likely just a coincidence.

Due to the lack of any gainful results this proposal was not investigated further. Chapter 4. Results 62

4.3 Histogram Based Color Value Mapping

This section presents results from both objective and subjective tests for the histogram based color value mapping (CVM) implementation. The subjective tests performed only resulted in some simple comparisons between the anchor and the implemented idea.

4.3.1 Objective Results

Table 4.7: Anchor compared to histogram based color value mapping implementation.

tPSNR X Y Z XYZ PSNR∆E mPSNR class A FireEaterClip4000r1 -7,2% -0,2% -17,7% -8,6% -15,3% -42,0% Market3Clip4000r2 -2,0% 0,4% -9,7% -4,6% -42,2% -5,5% Tibul2Clip4000r1 -3,6% 0,8% -15,3% -4,4% -11,6% -8,4% class B AutoWelding 4,1% 12,5% -10,9% -0,3% -30,1% -25,8% BikeSparklers 2,4% 7,0% -10,6% -1,5% -35,0% -9,1% class C ShowGirl2Teaser -1,3% 3,8% -15,2% -5,1% -34,6% -12,4% class D StEM MagicHour -0,4% 4,8% -10,5% -4,6% -29,3% -11,5% StEM WarmNight -3,6% 3,4% -10,6% -5,2% -31,3% -19,1% class G BalloonFestival -2,9% 0,9% -5,6% -3,4% -24,0% -4,8% Overall -1,6% 3,7% -11,8% -4,2% -28,1% -15,4%

Table 4.7 shows the objective results of the color value mapping, comparing the anchor to the implemented idea. The results shows significant gains, at least when looking at the PSNR∆E and mPSNR metrics. Similar to the previous results there is a big difference in the PSNR∆E and mPSNR metrics and PSNR∆E is the suggested metric to look at. The tPSNR metrics, however, are not as clear.

As the tPSNR metrics are calculated using the PQ-TF and Philips-TF transfer functions, they are biased towards contents transformed using any of these transfer functions. The color value mapping still uses the PQ-TF but maps the values even further, transferring them away from the PQ-TF. For this reason the tPSNR metrics are not as suitable for evaluating this feature.

Looking at the results it is clear that the sequences that benefit the most are the more complex sequences with a lot of colors, such as Market and BikeSparklers; market with some complexity and a lot of color and BikeSparklers with a lot of movement.

Worth mentioning is the fact that these results does not take into account the metadata that needs to be transmitted together with the video bitstream for the mapping process. However, the mapping data is only sent once every GOP (every 24 frame), and the data is quite compact so that should not make a significant impact. Chapter 4. Results 63

4.3.2 Subjective Results

This section will present some subjective results and point out some of the errors that could be seen by looking at the sequences. The pictures presented here are exposures of the original HDR video since it is impossible to show the full range of HDR in this report, but they still point out the errors visible on a HDR display. The evaluation was done comparing three versions of the sequences,

• an original uncompressed version,

• a version compressed using the anchors coding chain,

• and a version compressed using the coding chain together with the proposed CVM.

These versions were converted and displayed on the SIM2 HDR display in a sequential manner. The two compressed versions were compared using sequences compressed with a comparable resulting bitrate to get a fair comparison.

(a) Original (b) Anchor (c) CVM

Figure 4.1: Comparison of the Market sequence.

Figure 4.1 shows a part of the Market sequence from the three versions used. Both the coded versions used a QP value of 33 and the bitrate of the anchor was a bit smaller than the one of the CVM version, see table 4.8. Comparing the anchor to the CVM version there are two errors that are more visible on the anchor.

• Around the head of the person closest to the camera there is a clear discoloration. The error is visible for both compressed versions but the error is clearer for the anchor.

• The person partially covered by the car have a yellow tint on her and it is visible that color information is lost for the anchor compared to the CVM version. Chapter 4. Results 64

(a) Original (b) Anchor (c) CVM

Figure 4.2: Another comparison of the Market sequence.

Figure 4.2 shows another comparison of the Market sequence using the same QPs. Here there is a discoloration (a blue tint) on the tent for the anchor picture, while the CVM version still keeps the original color.

(a) Original (b) Anchor (c) CVM

Figure 4.3: Comparison of the MagicHour sequence.

Figure 4.3 shows a comparison with an example of color information loss at the anchor. This is a comparison of the MagicHour sequence coded using a QP of 33 for both the anchor and CVM. The bitrate of the two version can be found in table 4.8. Comparing the anchor to CVM it is visible that there is a loss of color information and even though both compressed pictures are blurry compared to the original, more of the original color information is kept for the CVM picture.

Table 4.8: Bitrates of the evaluated sequences.

Anchor (kbps) CVM (kbps) Market 1247,9 1282,8 MagicHour 771,0 829,5

Table 4.8 presents the bitrates of the compared sequences, both with a QP of 33.

From these results, the CVM method causes subjective improvements to the coding process. The big benefits seem to be in the color components and compared to the Chapter 4. Results 65 anchor both discolorations and loss of color information is prevented in several cases. The types of errors presented here were only visible and corrected by CVM in some of the sequences. However, it is worth mentioning that CVM did not introduce any major new artifacts, possibly some lesser ones that would be hidden in the anchor due to the loss of color information.

4.4 SAO XYZ

Table 4.9: Anchor compared to SAO XYZ implementation.

tPSNR X Y Z XYZ PSNR∆E mPSNR class A FireEaterClip4000r1 5,0% 5,5% 3,9% 4,8% 4,5% 6,7% Market3Clip4000r2 2,9% 3,3% 2,9% 3,0% 2,0% 3,5% Tibul2Clip4000r1 6,6% 6,8% 3,7% 6,1% 6,2% 6,1% class B AutoWelding 15,2% 16,7% 12,0% 14,3% 7,5% 33,5% BikeSparklers 7,2% 6,7% 2,7% 5,3% 3,6% 8,4% class C ShowGirl2Teaser 8,1% 8,5% 4,8% 7,0% 5,1% 9,4% class D StEM MagicHour 6,8% 7,3% 6,6% 6,8% 4,4% 11,4% StEM WarmNight 2,2% 2,7% 2,5% 2,5% 2,1% 2,9% class G BalloonFestival 1,6% 1,2% 1,6% 1,5% 0,8% 1,3% Overall 6,2% 6,5% 4,5% 5,7% 4,0% 9,2%

Table 4.9 presents the results for the SAO XYZ proposal. Looking at the results there seem to be no benefit in using the proposed implementation, it even provides significant loss compared to the anchor. However, this idea should generally be considered a work in progress. The proposed idea may still have some teething problems that need to be sorted out.

One reflection that instantly comes to mind is the fact that the actual offsets used are not calculated based on the unprocessed picture; the unprocessed picture is only used for determining the cost of a specific offset. Another thought is the fact that the method only is implemented for the Y’ component, however, this should not result in losses compared to the anchor as this means the same offsets as for the anchor is used for the chroma components.

Chapter 5 will present and discuss some future work and improvements that could possibly improve the results of this idea. Chapter 5

Discussion

This chapter provides a conclusion of the ideas and results presented in this thesis together with a future work section, providing possible directions for future work in the area.

5.1 Reflections

A big issue during the progress of this work was the lack of a definitive metric for the gains of a specific implementation. MPEG presented three metrics for the CfE in hopes of allowing the CfE process to point out whichever metric was the best. Looking at the results in this report it is clear that there are differences between the metrics.

Looking at the results for the color value mapping, where there were actually some subjective testing performed, the PSNR∆E metric seems to be the most accurate one.

This aligns with the fact that PSNR∆E was considered the most suitable one in [15].

The sequences with the best PSNR∆E value, such as market with −42.2%, seems to be the ones with the biggest gains subjectively. As mentioned, the tPSNR metrics may not be as suitable for the color value mapping case because of the bias against the Philips and PQ transfer functions, but the mPSNR metric should not be affected.

The objective metrics used in this work may not have been completely reliable. One typical problem when doing objective testing is finding a suitable metric that corresponds to the subjective results. The properties of HDR made the old metrics unsuitable as the objective results came too far from the subjective results. Therefore new methods need to be developed and that is why one of the goals of the CfE was to evaluate these new metrics. However, they did still give some indication of the results, but it is important

66 Chapter 5. Discussion 67 to remember that these are still objective metrics and in a case like this, the subjective results are what matters.

5.2 Conclusions

This thesis presented an analysis of HDR video and its requirements together with three different suggestions for improvements of the coding chain for HDR video presented by MPEG for the CfE [9] that were supposed to increase the coding efficiency;

• bitshifting at the CU level,

• histogram based color value mapping,

• and SAO optimized for XYZ.

The results of the suggestions were mixed with bitshifting resulting in neither gains nor losses and the modified SAO filter with losses. However, the color value mapping suggestion resulted in a significant gain overall of around −28.1% looking at the PSNR∆E metric. Additional to objective gains it also showed its potential with clearly visible subjective improvements. However, the changes to the decoder side of the coding chain would classify the suggestion as a normative change if it would ever be part of the standardization. To justify normative changes, really significant improvements to the performance are required and even though there are gains, they are not that big.

This work provides possible directions to follow for the future and even though the CfE process has ended the standardization process has just begun. This thesis has shown that there is room for performance improvements; not only in the coder itself, but changes in the pre- and postprocessing alone has shown to provide significant performance im- provements. Hopefully this work and its presented suggestions will help in the search for even better solutions.

5.3 Future Work

Generally there are a lot of potential for improving the coding efficiency of HDR video. There are a number of possible directions to go in order to find improvements to the existing standard. This thesis presented ideas in three completely different areas. Most success was found in improving the utilization of the available codewords, using the presented color value mapping method. It could be suitable to continue in this direction Chapter 5. Discussion 68 as one of the problems introduced by HDR and WCG compared to typical video coding is the increased range of color values that needs to be represented.

All the suggestions presented in this thesis show a potential for improvements, even though some base ideas seem more valuable to pursue than others. The suggestion that would probably benefit the most from continued work is the SAO filter optimized for XYZ, as this should generally be considered a work in progress.

5.3.1 Bitshifting at the CU Level

During the development of this suggestion the biggest problem to tackle was how to perform the cost computation for the RDO in variation 1. A fair distortion computation between the coded CU in a 9 bits per channel format and the original 10 bit CU was needed.

However, another thing worth looking into is, if in fact, there are any cases where there actually are gains when coding in 9 bits compared to 10 bits. This suggestion would be fruitless if there were no such cases. Also, comparing the results from coding in 8, 10, and 12 bits suggests that the benefits, if any, would be really small for coding in 9 bits.

5.3.2 Histogram Based Color Value Mapping

For the color value mapping there are a number of improvements that could be imple- mented and analyzed to see if they could possibly improve the performance of the chain even further.

• One thing to consider is if the parameters used for performing the mapping curve generation could be optimized even further as the parameters used in this thesis were derived only by a few test runs.

• One could possibly avoid the initial quantization in the preprocessing by simply allowing the color value mapping to map directly from floating-point values. This would require no major changes to the existing implementation, the only change required is to modify the X-axis of the mapping curve to be in whatever range the floating-point values are in.

• For the SAO XYZ suggestion the problem of performing the processing in a non- linear domain was presented. One could avoid this issue by removing the PQ-TF in the beginning of the chain, avoid the quantization as mentioned in the previous suggestion and generate the mapping curve using a weighted mix between the PQ-TF curve and the histogram-based curve. Chapter 5. Discussion 69

• Another idea would be to perform the mapping curve generation in a similar fashion to the RDO of the encoder, optimizing the parameters to get the most beneficial mapping curve.

However, the main thing to actually implement would be the signaling of the metadata to the decoder. This could be used to evaluate the actual cost of the suggestion as the metadata is not accounted for in this thesis. It would also be required if the suggestion would be used in a real life scenario. The signaling could be performed in a number of different ways. An example would be to use the SEI [2] messaging specified in the HEVC standard.

5.3.3 SAO XYZ

This suggestion probably has a lot of teething problems that needs fixing. Two possible directions for future work are:

• Investigate the use of the unprocessed X’Y’Z’ picture for calculating the offsets, currently it is only used for computing the cost of a specific offset. This would of course require one to figure out the problem of computing the offset in the right space as the offsets are expected to be in an Y’CbCr space.

• Currently the method only works for the Y’ component, this is quite simple as the Y’ of X’Y’Z’ and Y’ of Y’CbCr both refers to the luminance, but how would one implement the method for the chroma components?

Additional to performance improvements, this suggestion would also benefit a lot from removing the requirements of an additional input to the encoder. This problem would be hard to solve but would most likely be required for the suggestion to be widely accepted as a possible solution. The additional input would require major changes to existing coding chains. Bibliography

[1] High Dynamic Range Now Available on Amazon Instant Video Exclusively for Prime Members. http://phx.corporate-ir.net/phoenix.zhtml?c=176060&p= irol-newsArticle&ID=2062190. Accessed: July 2015.

[2] Sullivan, Gary J and Ohm, Jens and Han, Woo-Jin and Wiegand, Thomas. Overview of the high efficiency video coding (HEVC) standard. IEEE Transac- tions on Circuits and Systems for Video Technology, 22(12):1649–1668, 2012.

[3] ITU Telecommunication Standardization. ITU-T Recommendation H. 265: High Efficiency Video Coding. Telecommunication Standardization Sector, 2015.

[4] High dynamic range image encodings. http://www.anyhere.com/gward/hdrenc/ hdr_encodings.html. Accessed: January 2015.

[5] Taoran Lu, Fangjun Pu, Peng Yin, Tao Chen, and Walt Husak. Implication of high dynamic range and wide color gamut content distribution. In SPIE Optical Engineering+ Applications, pages 95990B–95990B. International Society for Optics and Photonics, 2015.

[6] Yang Zhang, Dimitris Agrafiotis, and David R Bull. High Dynamic Range image & video compression a review. In 18th International Conference on Digital Signal Processing (DSP), pages 1–7. IEEE, 2013.

[7] Amin Banitalebi-Dehkordi, Mani Azimi, Mahsa T Pourazad, and Panos Nasiopou- los. Compression of high dynamic range video using the HEVC and H.264/AVC standards. In 10th International Conference on Heterogeneous Networking for Qual- ity, Reliability, Security and Robustness (QShine), pages 8–12. IEEE, 2014.

[8] Blu-ray Disc Format - General 4th Edition. http://www.blu-raydisc.com/ Assets/Downloadablefile/White_Paper_General_4th_20150817_clean.pdf. Accessed: September 2015.

[9] Ajay Luthra, Edouard Fran¸cois, Walt Husak. Call for Evi- dence (CfE) for HDR and WCG Video Coding, February 2015.

70 Chapter 5. Discussion 71

URL http://mpeg.chiariglione.org/standards/exploration/ high-dynamic-range-and-wide-colour-gamut-content-distribution/ call-evidence. Accessed: February 2015.

[10] SIM2 Multimedia. SIM2 - High Dynamic Range Display Series. URL http://www. sim2hdr.com/. Accessed: July 2015.

[11] Tim Borer. Non-linear Opto-Electrical Transfer Functions for High Dynamic Range Television. 2014.

[12] Scott Miller, Mahdi Nezamabadi, and Scott Daly. Perceptual signal coding for more efficient usage of bit codes. In SMPTE Conferences, volume 2012, pages 1–9. Society of Motion Picture and Television Engineers, 2012.

[13] Yang Zhang, Erik Reinhard, and David Bull. Perception-based high dynamic range video compression with optimal bit-depth transformation. In Image Processing (ICIP), 2011 18th IEEE International Conference on, pages 1321–1324. IEEE, 2011.

[14] Mikael Le Pendu, Christine Guillemot, and Dominique Thoreau. Adaptive re- quantization for high dynamic range video compression. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 7367–7371. IEEE, 2014.

[15] Koohyar Minoo, Zhouye Gu, David Baylon, and Ajay Luthra. On metrics for objective and subjective evaluation of high dynamic range video. In SPIE Optical Engineering+ Applications, pages 95990F–95990F. International Society for Optics and Photonics, 2015.

[16] Rafat Mantiuk, Kil Joong Kim, Allan G Rempel, and Wolfgang Heidrich. HDR- VDP-2: a calibrated visual metric for visibility and quality predictions in all lumi- nance conditions. In ACM Transactions on Graphics (TOG), volume 30, page 40. ACM, 2011.

[17] Wei Dai, Madhu Krishnan, and Pankaj Topiwala. Chroma sampling and modula- tion techniques in high dynamic range video coding. In SPIE Optical Engineering+ Applications, pages 95990D–95990D. International Society for Optics and Photon- ics, 2015.

[18] Ajay Luthra, Edouard Fran¸cois, Walt Husak. Requirements and Use Cases for HDR and WCG Content Coding, February 2015. URL http://mpeg.chiariglione.org/standards/exploration/ high-dynamic-range-and-wide-colour-gamut-content-distribution/ requirements-and. Accessed: February 2015. Chapter 5. Discussion 72

[19] Hahn, Lance. Photometric Units. URL http://retina.anatomy.upenn.edu/ ~rob/lance/units_photometric.html. Accessed: July 2015.

[20] Mischler, Georg. Lighting Design Glossary – Luminance. URL http://www. schorsch.com/en/kbase/glossary/luminance.html. Accessed: July 2015.

[21] Peter GJ Barten. Contrast sensitivity of the human eye and its effects on image quality, volume 72. SPIE press, 1999.

[22] Recommendation ITU-R BT.1886-0, Reference electro-optical transfer function for flat panel displays used in HDTV studio production. ITU, 2011.

[23] Recommendation ITU-R BT.709-5, Parameter values for the HDTV standards for production and international programme exchange. ITU, 2002.

[24] Recommendation ITU-R BT.2020-1, Parameter values for ultra-high definition tele- vision systems for production and international programme exchange. ITU, 2014.

[25] A Standard Default Color Space for the Internet - sRGB. http://www.w3.org/ Graphics/Color/sRGB. Accessed: February 2015.

[26] Winkler, Stefan and Kunt, Murat and van den Branden Lambrecht, Christian J. Vision and video: models and applications. In Vision Models and Applications to Image and Video Processing, page 209. Springer, 2001.

[27] Smith, Thomas and Guild, John. The CIE colorimetric standards and their use. Transactions of the Optical Society, 33(3):73, 1931.

[28] J´anosSchanda. : understanding the CIE system. John Wiley & Sons, 2007.

[29] SMPTE, RP. 431-2. Reference projector and environment for display of DCDM in review rooms and theaters, 2006.

[30] Poynton, Charles. Chroma subsampling notation. URL http://www.poynton.com/ PDFs/Chroma_subsampling_notation.pdf. Accessed: March 2015.

[31] OpenEXR. http://www.openexr.com/. Accessed: February 2015.

[32] Industrial Light & Magic. http://www.ilm.com/. Accessed: February 2015.

[33] IEEE Standard for Floating-Point Arithmetic. IEEE Std 754-2008, pages 1–70, Aug 2008.

[34] TIFF Revision 6.0. http://partners.adobe.com/public/developer/en/tiff/ TIFF6.pdf. Accessed: February 2015. Chapter 5. Discussion 73

[35] Understanding HD Formats. http://www.microsoft.com/windows/ windowsmedia/howto/articles/understandinghdformats.aspx. Accessed: Mars 2015.

[36] Andreas Unterweger. Compression artifacts in modern video coding and state-of- the-art means of compensation. Multimedia Networking and Coding, page 28, 2012.

[37] Michael Yuen and HR Wu. A survey of hybrid MC/DPCM/DCT video coding distortions. Signal processing, 70(3):247–278, 1998.

[38] Fu, Chih-Ming and Alshina, Elena and Alshin, Alexander and Huang, Yu-Wen and Chen, Ching-Yeh and Tsai, Chia-Yang and Hsu, Chih-Wei and Lei, Shaw-Min and Park, Jeong-Hoon and Han, Woo-Jin. Sample adaptive offset in the HEVC standard. IEEE Transactions on Circuits and Systems for Video Technology, 22 (12):1755–1764, 2012.

[39] Wiegand, Thomas and Sullivan, Gary J and Bjontegaard, Gisle and Luthra, Ajay. Overview of the H. 264/AVC video coding standard. IEEE Transactions on Circuits and Systems for Video Technology, 13(7):560–576, 2003.

[40] Ohm, J-R and Sullivan, Gary J and Schwarz, Heiko and Tan, Thiow Keng and Wiegand, Thomas. Comparison of the coding efficiency of video coding stan- dards—including high efficiency video coding (HEVC). IEEE Transactions on Cir- cuits and Systems for Video Technology, 22(12):1669–1684, 2012.

[41] Samet, Hanan. The quadtree and related hierarchical data structures. ACM Com- puting Surveys (CSUR), 16(2):187–260, 1984.

[42] Norkin, Andrey and Bjontegaard, Gisle and Fuldseth, Arild and Narroschke, Matthias and Ikeda, Masaru and Andersson, Kenneth and Zhou, Minhua and Van der Auwera, Geert. HEVC deblocking filter. IEEE Transactions on Circuits and Systems for Video Technology, 22(12):1746–1754, 2012.

[43] The emergence of HEVC and 10-bit colour formats. http://blog.imgtec.com/ powervr-video/the-emergence-of-hevc-and-10-bit-colour-formats. Ac- cessed: July 2015.

[44] Focus on HEVC: The background behind the game-changing stan- dard - Ericsson. http://www.ericsson.com/tv-media/blog/ focus-hevc-background-behind-game-changing-standard-ericsson/. Ac- cessed: July 2015.

[45] Rerabek, Martin and Hanhart, Philippe and Korshunov, Pavel and Ebrahimi, Touradj. Subjective and objective evaluation of HDR video compression. In 9th Chapter 5. Discussion 74

International Workshop on Video Processing and Quality Metrics for Consumer Electronics (VPQM), number EPFL-CONF-203874, 2015.

[46] Sharma, Gaurav and Wu, Wencheng and Dalal, Edul N. The CIEDE2000 color- difference formula: Implementation notes, supplementary test data, and mathemat- ical observations. Color Research & Application, 30(1):21–30, 2005.

[47] CIE. 142-2001. Improvement to Industrial Colour-Difference Evaluation. Vienna: Central Bureau, 2001.

[48] Munkberg, Jacob and Clarberg, Petrik and Hasselgren, Jon and Akenine-M¨oller, Tomas. High Dynamic Range Texture Compression for Graphics Hardware. ACM Transactions on Graphics (TOG), 25(3):698–706, 2006.

[49] Gisle Bjontegaard. Calculation of average PSNR differences between RD-curves. Doc. VCEG-M33 ITU-T Q6/16, Austin, TX, USA, 2-4 April 2001, 2001.

[50] HEVC reference software. https://hevc.hhi.fraunhofer.de/. Accessed: Febru- ary 2015.