2

Acquisition, Representation, Display, and Perception of Image and Video Signals

In digital video communication, we typically capture a natural scene by a camera, transmit or store data representing the scene, and finally reproduce the captured scene on a display. The camera converts the light emitted or reflected from objects in a three-dimensional scene into arrays of discrete-amplitude samples. In the display device, the arrays of discrete-amplitude samples are converted into light that is emitted from the display and perceived by human beings. The primary task of video coding is to represent the sample arrays generated by the camera and used by the display device with a small number of bits, suitable for transmission or storage. Since the achievable compression for an exact representation of the sample arrays recorded by a camera is not sufficient for most applications, the sample arrays are modified in a way that they can be represented with a given maximum number of bits or bits per time unit. Ideally, the degradation of the perceived image quality due to the modifications of the sample arrays should be as small as possible. Hence, even though video coding eventually deals with mapping arrays of discrete-amplitude samples into a bitstream, the quality of the displayed video is largely influenced by the way we acquire, represent, display, and perceive visual information.

7 8 Acquisition, Representation, Display, and Perception

Certain properties of human visual perception have in fact a large impact on the construction of cameras, the design of displays, and the way visual information is represented as sample arrays. And even though today’s video coding standards have been mainly designed from a signal processing perspective, they provide features that can be used for exploiting some properties of human vision. A basic knowledge of human vision, the design of camera and displays, and the used represen- tation formats is essential for understanding the interdependencies of the various components in a video communication system. For design- ing video coding algorithms, it is also important to know what impact changes in the sample arrays, which are eventually coded, have on the perceived quality of images and video. In the following section, we start with a brief review of basic prop- erties of image formation by lenses. Afterwards, we discuss certain as- pects of human vision and describe raw data formats that are used for representing visual information. Finally, an overview of the design of cameras and displays is given. For additional information on these topics, the reader is referred the comprehensive overview in [70].

2.1 Fundamentals of Image Formation

In digital cameras, a three-dimensional scene is projected onto an image sensor, which measures physical quantities of the incident light and converts them into arrays of samples. For obtaining an image of the real world on the sensors surface, we require a device that projects all rays of light that are emitted or reflected from an object point and fall through the opening of the camera into a point in the image plane. The simplest of such devices is the pinhole by which basically all light, except a single pencil of rays, coming from a particular object point is blocked from reaching the light-sensitive surface. Due to their bad optical resolution and extremely low light efficiency, pinhole optics are not used in practice, but lenses are used instead. In the following, we review some basic properties of image formation using lenses. For more detailed treatments of the topic of optics, we recommend the classic references by Born and Wolf [6] and Hecht [33]. 2.1. Fundamentals of Image Formation 9

2.1.1 Image Formation with Lenses Lenses consist of transparent materials such as glass. They change the direction of light rays falling through the lens due to refraction at the boundary between the lens material and the surrounding air. The shape of a lens determines how the wavefronts of the light are deformed. Lenses that project all light rays originating from an object point into a single image point have a hyperbolic shape at both sides [33]. This is, however, only valid for monochromatic light and a single object point; there are no lens shapes that form perfect images of objects. Since it is easier and less expensive to manufacture lenses with spherical surfaces, most lenses used in practice are spherical lenses. Aspheric lenses are, however, often used for minimizing aberration in lens systems.

Thin Lenses. We restrict our considerations to paraxial approxima- tions (the angles between the light rays and the optical axis are very small) for thin lenses (the thickness is small compared to the radii of curvature). Under these assumptions, a lens projects an object in a distance s from the lens onto an image plane located at a distance b at the other side of the lens, see Figure 2.1(a). The relationship between the object distance s and the image distance b is given by 1 1 1 + = , (2.1) s b f which is known as Gaussian lens formula (a derivation is, for example, given in [33]). The quantity f is called the focal length and represents the distance from the lens plane, in which light rays that are parallel to the optical axis are focused into a single point. For focusing objects at different locations, the distance b between lens and image sensor can be modified. Far objects (s → ∞) are in focus if the distance b is approximately equal to the focal length f. As illustrated in Figure 2.1(b), for a given image sensor, the focal length f of the lens determines the field of view. With d representing the width, height, or diagonal of the image sensor, the angle of view is given by  d  θ = 2 arctan . (2.2) 2 f 10 Acquisition, Representation, Display, and Perception

image sensor image plane θ object 푑 plane

푓 푓 푏 ≈ 푓 푠 푏 (a) (b)

object image plane plane Δ푏푁 퐷 푎 푎 푐

Δ푠퐹 Δ푠푁 Δ푏퐹 푓 (c) (d) 푠 푏

Figure 2.1: Image formation with lenses: (a) Object and image location for a thin convex lens; (b) Angle of view; (c) Aperture; (d) Relationship between the acceptable diameter c for the circle of confusion and the depth of field D.

Aperture. Besides the focal length, lenses are characterized by their aperture, which is the opening of a lens. As illustrated in Figure 2.1(c), the aperture determines the bundle of rays focused in the image plane. In camera lenses, typically adjustable apertures with an approximately circular shape are used. The aperture diameter a is commonly notated as f/F , where F is the so-called f-number,

F = f/a. (2.3)

For example, an aperture of f/4 corresponds to an f-number of 4 and specifies that the aperture diameter a is equal to 1/4 of the focal length. For a given distance b between lens and sensor, only object points that are located in a plane at a particular distance s are focused on the sensor. As shown Figure 2.1(d), object points located at distances s + ∆sF and s − ∆sN would be focused at image distances b − ∆bF and b + ∆bN , respectively. On the image sensor, at the distance b, the object points appear as blur spots, which are called circles of confusion. If the blur spots are small enough, the projected objects still appear to be 2.1. Fundamentals of Image Formation 11 sharp in a photo or video. Given a maximum acceptable diameter c for the circles of confusion, we can derive the range of object distances for which we obtain a sharp projection on the image sensor. By considering similar triangles at the image side in Figure 2.1(d), we get ∆b c F c ∆b c F c N = = and F = = . (2.4) b + ∆bN a f b − ∆bF a f

Using the Gaussian lens formula (2.1) for representing b, b + ∆bN , and b−∆bF as functions of the focal length f and the corresponding object distances, and solving for ∆sF and ∆sN yields F c s (s − f) F c s (s − f) ∆s = and ∆s = . (2.5) F f 2 − F c (s − f) N f 2 + F c (s − f) The distance D between the nearest and farthest objects that appear acceptably sharp in an image is called the depth of field. It is given by 2 F c f 2 s (s − f) 2 F c s2 D = ∆s + ∆s = ≈ . (2.6) F N f 4 − N 2 c2 (s − f) f 2 For the simplification at the right side of (2.6), we used the often valid approximations s  f and c  f 2/s. The maximum acceptable diameter c for the circle of confusion could be defined as the distance between two photocells on the image sensor. Based on considerations about the resolution capabilities of the human eye and the typical viewing angle for a photo or video, it is, however, common practice to define c as a fraction of the sensor diag- onal d, for example, c ≈ d/1500. By using this rule and applying (2.2), we obtain the approximation F s2 θ  D ≈ 0.005 · · tan2 , (2.7) d 2 where θ denotes the diagonal angle of view. Note that the depth of field increases with decreasing sensor size. When we film a scene with a given camera, the depth of field can be influenced basically only by changing the aperture of the lens. As an example, if we use a 36 mm×24 mm sensor and a 50 mm lens with an aperture of f/1.4 and focus an object at a distance of s = 10 m, all objects in the range from 8.6 m to 11.9 m appear acceptably sharp. By decreasing the aperture to f/8, the depth of field is increased to a range of about 5 m to 122 m. 12 Acquisition, Representation, Display, and Perception

object 푌 푦 point lens image plane 푃 = (푋, 푌, 푍) plane

푍 center of lens in point (0, 0, 0) 푥 푋 푏 ≈ 푓 푝 = (푥, 푦)

Figure 2.2: Perspective projection of the 3-dimensional space onto an image plane.

Projection by Lenses. As we have discussed above, a lens actually generates a three-dimensional image of a scene and the image sensor basically extracts a plane of this three-dimensional image. For many applications, the projection of the three-dimensional world onto the image plane can be reasonably well approximated by the perspective projection model. If we define the world and image coordinate systems as illustrated in Figure 2.2, a point P at world coordinates (X,Y,Z) is projected into a point p at the image coordinates (x, y), given by b f b f x = X ≈ X and y = Y ≈ Y. (2.8) Z Z Z Z

2.1.2 Diffraction and Optical Resolution Until now, we assumed that rays of light in a homogeneous medium propagate in rectilinear paths. Experiments show, however, that light rays are bent when they encounter small obstacles or openings. This phenomenon is called diffraction and can be explained by the wave character of light. As we will discuss in the following, diffraction effects limit the resolving power of optical instruments such as cameras. A mathematical theory of diffraction was formulated by Kirchhoff [59] and later modified by Sommerfeld [78]. As shown in Figure 2.3(a), we consider a plane wave with wave- length λ that encounters an aperture with the pupil function g(ζ, η). The pupil function is defined in a way that values of g(ζ, η) = 0 specify opaque points and values of g(ζ, η) = 1 specify transparent points in the aperture plane. The irradiance I(x, y) observed on a screen in dis- 2.1. Fundamentals of Image Formation 13 wavelength 휆 aperture 푔(휁, 휂) screen 휂 푦 sensor plane irradiance 퐼(푥, 푦) 푅

휁 푥 푓 푧 (a) (b) Figure 2.3: Diffraction in cameras: (a) Diffraction of a plane wave at an aperture; (b) Diffraction in cameras can be modeled using Fraunhofer diffraction. tance z depends on the spatial position (x, y). For z  a2/λ, with a be- ing the largest dimension of the aperture, the phase differences between the individual contributions that are superposed on the screen only depend on the viewing angles given by sin φ = x/R and sin θ = y/R, with R = px2 + y2 + z2. This far-field approximation is referred to as Fraunhofer diffraction. Since a lens placed behind an aperture focuses parallel light rays in a point, as illustrated in Figure 2.3(b), diffraction observed in cameras can be modeled using Fraunhofer diffraction. The observed irradiance pattern [33] is given by

  2 x y I(x, y) = C · G , , (2.9) λ R λ R where C is a constant and G(u, v) represents the two-dimensional Fourier transform of the pupil function g(ζ, η). For a camera with a circular aperture, the diffraction pattern on the sensor [33] in distance z ≈ f is given by

2 J (β r)2 π a π a π I(r) = I · 1 with β = ≈ = , (2.10) 0 β r λ R λ f λ F where r = px2 + y2 represents the distance from the optical axis, I0 = I(0) is the maximum irradiance, a, f, and F = f/a denote the aperture diameter, focal length, and f-number, respectively, of the lens, and J1(x) represents the Bessel function of first kind and order one. The diffraction pattern (2.10), which is illustrated in Figure 2.4(a), is called Airy pattern and its bright central region is called Airy disk. 14 Acquisition, Representation, Display, and Perception

1

0.8

0.6

0.4 MTF(u) 0.2

0 (a) (b) (c) 0 0.2 0.4 0.6 0.8 1 rel. spat. frequency u / (λ F) Figure 2.4: Optical resolution: (a) Airy pattern; (b) Two just resolved image points; (c) Modulation transfer function of a diffraction-limited lens with a circular aperture.

Optical Resolution. The imaging quality of an optical system can be described the point spread function (PSF) or line spread function. They specify the projected patterns for a focused point or line source. For large object distances, the wave fronts encountering the aperture are approximately planar. If we have a circular aperture and assume that diffraction is the only source of blurring, the PSF is given by the Airy pattern (2.10). For off-axis points, the Airy pattern is centered around the image point given by (2.8). Optical system for which the imaging quality is only limited by diffraction are referred to as diffraction-limited or perfect optics. In real lenses, we have additional sources of blurring caused by deviations from the paraxial approximation (2.1). The PSF of an optical system determines its ability to resolve details in the image. Two image points are said to be just resolvable when the center of one diffraction pattern coincides with the first minimum of the other diffraction pattern. This rule is known as Rayleigh criterion and is illustrated in Figure 2.4(b). For cameras with diffraction-limited lenses and circular apertures, two image points are resolvable if the distance ∆r between the centers of the Airy patterns satisfies x ∆r ≥ ∆r = 1 λ F ≈ 1.22 λ F, (2.11) min π where x1 ≈ 3.8317 represents the first zero of J1(x)/x. As an example, we consider a camera with a 13.2 mm × 8.8 mm sensor and an aper- ture of f/4 and assume a wavelength of λ = 550 nm (in the middle of the visible spectrum). Even with a perfect lens, we cannot discriminate more than 4918 × 3279 points (or 16 Megapixel) on the image sensor. 2.1. Fundamentals of Image Formation 15

The number of discriminable points increases with decreasing f-number and increasing sensor size. By considering (2.7), we can, however, con- clude that for a given picture (same field of view and depth of field), the number of distinguishable points is independent of the sensor size.

Modulation Transfer Function. The resolving capabilities of lenses are often specified in the frequency domain. The optical transfer func- tion (OTF) is defined as the two-dimensional Fourier transform of the point spread function, OTF(u, v) = FT{PSF(x, y)}. The amplitude spectrum MTF(u, v) = |OTF(u, v)| is referred to as modulation trans- fer function (MTF). Typically, only a one-dimensional slice MTF(u) of the modulation transfer function MTF(u, v) is considered, which corresponds to the Fourier transform of the line spread function. The contrast C of an irradiance pattern shall be defined by I − I C = max min , (2.12) Imax + Imin where Imin and Imax represent the minimum and maximum irradiances. The modulation transfer MTF(u) specifies the reduction in contrast C for harmonic stimuli with a spatial frequency u,

MTF(u) = Cimage /Cobject, (2.13) where Cobject and Cimage denote the contrasts in the object and image domain, respectively. The OTF of diffraction-limited optics can also be calculated as the normalized autocorrelation function of the pupil function g(ζ, η) [28]. For a camera with a diffraction-limited lens and a circular aperture with the f-number F , the MTF is given by !  r 2  2 u u  u   arccos − 1 − : u ≤ u0 MTF(u) = π u0 u0 u0 , (2.14)   0 : u > u0 where u0 = 1/(λ F ) represents the cut-off frequency. This function is illustrated in Figure 2.4(c). The MTF for real lenses generally lies be- low that for diffraction-limited optics. Furthermore, for real lenses, the MTF additionally depends on the position in the image plane and the orientation of the harmonic pattern. 16 Acquisition, Representation, Display, and Perception

(a) (b) (c)

(1)

(2) (d) (e) (f)

Figure 2.5: Aberrations: (a) Spherical Aberration; (b) Field curvature; (c) Coma; (d) Astigmatism; (e) Distortion; (f) Axial(1) and lateral(2) chromatic aberration.

2.1.3 Optical Aberrations We analyzed aspects of the image formation with lenses using the Gaus- sian lens formula (2.1). Since this formula represents only an approxi- mation for thin lenses and paraxial rays, it does not provide an accurate description of real lenses. Deviations from the predictions of Gaussian optics that are not caused by diffraction are called aberrations. There are two main classes of aberrations: Monochromatic aberrations, which are caused by the geometry of lenses and occur even with monochro- matic light, and chromatic aberrations, which occur only for light con- sisting of multiple wavelengths. The five primary monochromatic aber- rations, which are also called Seidel aberrations, are:

• Spherical aberration: The focal point of light rays depends on their distance to the optical axis, see Figure 2.5(a); • Field curvature: Points in a flat object plane are focused in a curved surface instead of a flat image plane, see Figure 2.5(b); • Coma: The projections of off-axis object points appear as a comet- shaped blur spots instead of points, see Figure 2.5(c); • Astigmatism: Light rays that propagate in perpendicular planes are focused in different distances, see Figure 2.5(d); • Distortion: Straight lines in the object plane appear as curved lines in the image plane, objects are deformed, see Figure 2.5(e). 2.2. Visual Perception 17

Chromatic aberrations arise from the fact that the phase velocity of a electromagnetic wave in a medium depends on its frequency, a phenomenon called dispersion. As a result, light rays of different wave- lengths (or frequencies) are refracted at different angles. Typically, two types of chromatic aberration are distinguished: • Axial (or longitudinal) chromatic aberration: The focal length de- pends on the wavelength, see Figure 2.5(f), case (1); • Lateral : For off-axis object points, different wavelengths are focused at different positions in the image plane, see Figure 2.5(f), case (2). The image quality in cameras is often additionally degraded by a brightness reduction at the periphery compared to the image center, an effect referred to as vignetting. Aberrations can be reduced by combin- ing multiple lenses of different shapes and materials. Typical camera lenses consist of about 10 to 20 lens elements, including aspherical lenses and lenses of extra-low dispersion materials.

2.2 Visual Perception

In all areas of digital image communication, whether it be photography, television, home entertainment, video streaming or video conferencing, the photos and videos are eventually viewed by human beings. The way humans perceive visual information determines whether a reproduction of a real-world scene in the form of a printed photograph or pictures displayed on a monitor or television screen looks realistic and truthful. In fact, certain aspects of human vision are not only taken into account for designing cameras, displays and printers, but are also exploited for digitally representing and coding still and moving pictures. In the following, we give a brief overview of the human visual system with particular emphasis on the perception of color. We will mainly concentrate on aspects that influence the way we capture, represent, code and display pictures. For more details on human vision, the reader is referred to the books by Wandell [90] and Palmer [68]. The topic of colorimetry is comprehensively treated in the classic reference by Wyszecki and Stiles [95] and the book by Koenderink [60]. 18 Acquisition, Representation, Display, and Perception

iris retina crystalline lens

fovea

pupil ciliary muscle

cornea optic nerve

Figure 2.6: Basic structure of the human eye.

2.2.1 The Human Visual System The human eye has similar components as a camera. Its basic structure is illustrated in Figure 2.6. The cornea and the crystalline lens, which is embedded in the ciliary muscle, form a two-lens system. They act like a single convex lens and project an image of real-world objects onto a light-sensitive surface, the retina. The photoreceptor cells in the retina convert absorbed photons into neural signals that are further processed by the neural circuitry in the retina and transmitted through the optic nerve to the visual cortex of the brain. The area of the retina that provides the sharpest vision is called fovea. We always move our eyes such that the image of the object we look at falls on the fovea. The iris is a sphincter muscle that controls the size of the hole in its middle, called pupil, and thus the amount of light entering the retina.

Human Optics. In contrast to cameras, the distance between lens and retina cannot be modified for focusing objects at varying distances. Instead, focusing is achieved by adapting the shape of the crystalline lens by the ciliary muscle. This process is referred to as accommodation. In the eyes of young people, the resulting focal length of the two-lens optics can be modified between about 14 and 17 mm [30], allowing to focus objects in distances from approximately 8 cm to infinity. Similarly as in cameras, the image projected onto the retina is actually inverted. The optical quality of the human eye was evaluated by measur- ing line spread, point spread, or the corresponding modulation trans- fer functions (see Section 2.1.2) for monochromatic light [12, 61, 64]. These investigations show that the eye is far from being perfect optics. 2.2. Visual Perception 19

] 20 2 18 blind spot cones 60° 40° 16 rods / mm 80° 4 14 temporal 20° 12 10 0° (fovea) 8 6 4 nasal retina temporal retina nasal −20° 2 receptor density [10 0 −80° −40° −60° -60 -40 -20 0 20 40 visual angle relative to center of fovea [degree] Figure 2.7: Illustration of the distribution of photoreceptor cells along the hori- zontal meridian of the human eye (plotted using experimental data of [21]).

While for very small pupil sizes, the human optical system is nearly diffraction-limited, for larger pupil sizes, the imperfections of the cornea and crystalline lens cause significant monochromatic aberrations, much larger than that of camera lenses. The sharpest image on the retina is obtained for pupil diameters of about 3 mm, which is the typical pupil size for looking at a white paper in good reading light. The dispersion of the substances inside the eye lead also to signif- icant chromatic aberrations. In typical lighting conditions, the green range of the spectrum, which the eye is most sensitive to, is sharply focused on the retina, while the focal planes for the blue and red range are in front of and behind the retina, respectively. This axial chromatic aberration has the strongest effect for the short wavelength range of the visible light [30]. Lateral chromatic aberration increases with the distance from the optical axis; in the fovea, its effect can be neglected.

Human Photoreceptors. The retina contains two classes of photore- ceptor cells, the rods and cones, which are sensitive to different light levels. Under well-lit viewing conditions (daylight, luminance greater than about 10 cd/m2), only the cones are effective. This case is re- ferred to as photopic vision. At very low light levels, between the visual threshold and a luminance of about 5 · 10−3 cd/m2 (somewhat lower than the lighting in a full moon night), only the rods contribute to the visual perception; this case is called scotopic vision. Between these two cases, both the rods and cones are active and we talk of mesopic vision. 20 Acquisition, Representation, Display, and Perception

There are about 100 million rods and 5 million cones in each eye, which are very differently distributed throughout the retina [67, 21], see Figure 2.7. The rods are mainly concentrated in the periphery. The fovea does not contain any rods, but by far the highest concentration of cones. At the location of the optic nerve, also referred to as blind spot, there are no photoreceptors. Although the retina contains much more rods than cones, the visual acuity of scotopic vision is much lower than that of photopic version. The reason is that the photocurrent responses of many rods are combined into a single neural response, whereas each cone signal is further processed by several neurons in the retina [90].

Spectral Sensitivity. The sensitivity of the human eye depends on the spectral characteristics of the observed light stimulus. Based on the data of several brightness-matching experiments, for example [27], the Commission Internationale de l’Eclairage (CIE) defined the so- called CIE luminous efficiency function V (λ) for photopic vision [14] in 1924. This function characterizes the average spectral sensitivity of human brightness perception1. Two light stimuli with different radiance spectra Φ(λ) are perceived as equally bright if the corresponding values R ∞ 0 Φ(λ) V (λ) dλ are the same. V (λ) determines the relation between radiometric and photometric quantities. For example, the analogous R ∞ photometric quantity of the radiance Φ= 0 Φ(λ) dλ is the luminance R ∞ I = K 0 V (λ)Φ(λ) dλ, where K is a constant (683 lumen per Watt). The SI unit of the luminance is candela per square meter (cd/m2). Viewing experiments under scotopic conditions lead to the defini- tion of a scotopic luminous efficiency function V 0(λ) [16]. The luminous efficiency functions V (λ) and V 0(λ) are depicted in Figure 2.8(a). The phenomenon that the wavelength range of highest sensitivity is differ- ent for photopic and scotopic vision is referred to as the Purkinje effect. Both luminous efficiency functions are noticeably greater than zero in the range from about 390 to 700 nm. For that reason, electromagnetic radiation in this part of the spectrum is commonly called visible light.

1The CIE 1924 photopic luminous efficiency function V (λ) has been reported to underestimate the contribution of the short wavelength range. Improvements were suggested by Judd [56], Vos [89], and more recently by Sharpe et al. [74, 75]. 2.2. Visual Perception 21

1 1 S-cones 0.8 0.8 photopic vision rods 0.6 0.6 scotopic vision M-cones 0.4 0.4 L-cones luminous efficiency

0.2 normalized sensitivity 0.2

0 0 400 450 500 550 600 650 700 750 400 450 500 550 600 650 700 750 (a)wavelength λ [nm] (b) wavelength λ [nm] Figure 2.8: Spectral sensitivity of human vision: (a) CIE luminous efficiency func- tions for photopic and scotopic vision (the dashed curve represents the correction suggested in [75]); (b) Spectral sensitivity of the human photoreceptors.

In low light (scotopic vision), we can only discriminate between dif- ferent brightness levels, but under photopic (and mesopic) conditions, we are able to see colors. The reason is that the human retina contains only a single rod type, but three types of cones, each with a differ- ent spectral sensitivity. The existence of three types of photoreceptors has already been postulated in the 19th century by Young [96] and Helmholtz [34]. In the 1960s, direct measurements on single photore- ceptor cells of the human retina [9] confirmed the Young-Helmholtz theory of trichromatic vision. The cones types are typically referred to as L-, M- and S-cones, where L, M and S stand for the long, medium and short wavelength range, respectively, and characterize the peak sensi- tivities. On average only about 6% of the human cones are S-cones. This low density of S-cones is consistent with the large blur of short- wavelength components due to axial chromatic aberration. While the percentage of S-cones is roughly constant for different individuals, the ratio of L- and M-cones varies significantly [36]. The spectral sensitivity of cone cells was determined by measuring photocurrent responses [5, 72]. For describing color perception, we are, however, more interested in spectral sensitivities with respect to light entering the cornea, which are different, since the short wavelength range is strongly absorbed by different components of the eye before reaching the retina. Such sensitivity functions, which are also called cone fundamentals, can be estimated by comparing color-matching data 22 Acquisition, Representation, Display, and Perception

(see Section 2.2.2) of individuals with normal vision with that of in- dividuals lacking one or two cone types. In Figure 2.8(b), the cone fundamentals estimated by Stockman et al. [82, 81] are depicted together with the spectral sensitivity function for the rods, which is the same as the scotopic luminous efficiency function V 0(λ).

Luminance Sensitivity. The sensing capabilities of the human eye span a luminance range of about 11 orders of magnitude, from the visual threshold of about 10−6 cd/m2 to about 105 cd/m2 [30], which roughly corresponds to the luminance level on a sunny day. However, in each moment, only luminance levels in a range of about 2 to 3 orders of magnitude can be distinguished. In order to cover the huge range of ambient light levels, the human eye adapts its sensitivity to the lighting conditions. A fast adaptation mechanism is the pupillary light reflex, which controls the pupil size depending on the luminance on the retina. The main factors, however, which are also responsible for the transi- tion between rod and cone vision, are photochemical reactions in the pigments of the rod and cone cells and neural processes. These mech- anisms are much slower than the pupillary light reflex; the adaptation from very high to very low luminance levels can take up to 30 minutes. To a large extent, the sensitivities of the three cone types are inde- pendently controlled. As a consequence, the human eye does not only adjust to the luminance level, but also to the spectral composition of the incident light. In connection with certain properties of the neural pro- cessing, this aspect causes the phenomenon of color constancy, which describes the effect that the perceived colors of objects are relatively independent of the spectral composition of the illuminating light. Another property of human vision is that our ability to distinguish two areas with the same color but a particular difference in luminance depends on the brightness of the viewed scene. Let I and ∆I denote the background luminance, to which the eye is adapted, and the just perceptible increase in luminance, respectively. Within a wide range of luminance values I, from about 50 to 104 cd/m2 [30], the relative sen- sitivity ∆I/I is nearly constant (approximately 1–2%). This behavior is known as Weber-Fechner law. 2.2. Visual Perception 23

Opponent Colors. The theory of opponent colors was first formulated by Hering [35]. He found that certain hues are never perceived to occur together. While colors can be perceived as a combination of, for example, yellow and red (orange), red and blue (purple), or green and blue (cyan), there are no colors that are perceived as a combination of red and green or yellow and blue. Hering concluded that the human color perception includes a mechanism with bipolar responses to red- green and blue-yellow. These hue pairs are referred to as opponent colors. According to the opponent color theory, any light stimulus is received as containing either a one or the other of the opponent color pairs, or, if both contributions cancel out, none of them. For a long time, the opponent color theory seemed to be irrecon- cilable with the Young-Helmholtz theory. In the 1950s, Jameson and Hurvich [55, 37] performed hue-cancellation experiments by which they estimated the spectral sensitivities of the opponent-color mecha- nisms. Furthermore, measurements of electrical responses in the retina of goldfish [83, 84] and the lateral geniculate nucleus of the macaque monkey [23] showed the existence of neural signals that were consistent with the bipolar responses formulated by Hering. These and other experimental findings resulted in a wide acceptance of the modern the- ory of opponent colors, according to which the responses of the three cones to light stimuli are not directly transmitted to the brain. Instead, neurons along the visual pathways transform the cone responses into three opponent signals, as illustrated in Figure 2.9(a). The transforma- tion can be considered as approximately linear and the outputs are an achromatic signal, which corresponds to a relative luminance measure, as well as a red-green and a yellow-blue color difference signal. Since the cone sensitivities are to a large extent independently ad- justed, the spectral sensitivities of the opponent processes depend on the present illumination. Estimates of the spectral sensitivity curves for the eye adapted to equal-energy white (same spectral radiance for all wavelengths) are shown in Figure 2.9(b). The depicted curves represent linear combinations, suggested in [80], of the Stockman and Sharpe cone fundamentals [81]. As an example, let Φ(λ) denote the radiance spectrum of a light stimulus and let c¯rg(λ) represent the spectral sensi- 24 Acquisition, Representation, Display, and Perception

cone responses 1 0 + 0 + 0 + yellow-blue achromatic

L M S 0.5

0 linear linear linear combination combination combination -0.5 normalized sensitivity red-green V(λ) r-g y-b -1 0 + − + − + 400 450 500 550 600 650 700 750 (a) opponent process responses (b) wavelength λ [nm] Figure 2.9: Opponent color theory: (a) Simplified model for the neural processing of the cone responses; (b) Estimates [80] of the spectral sensitivities of the opponent- color processes (for the eye adapted to equal-energy white).

R ∞ tivity curve for the red-green process. If the integral 0 Φ(λ)c ¯rg(λ) dλ is positive, the light stimulus is perceived as containing a red compo- nent, if it is negative, the stimulus appears to include a green compo- nent. As has been shown in [11], the conversion of the cone responses into opponent signals is effectively a decorrelation. It can be interpreted as a way of improving the neural transmission of color information.

Neural Processing. The neural responses of the photoreceptor cells are first processed by the neurons in the retina and then transmitted to the visual cortex of the brain, where the visual information is fur- ther processed and eventually interpreted yielding the images of the world we perceive every day. The mechanisms of the neural processing are extremely complex and not yet fully understood. Nonetheless, the understanding of the processing in the visual cortex is continuously improved and many aspects are already known. One important prop- erty of human visual perception is that our visual system permanently compares the information obtained through the eyes with memorized knowledge, which finally yields an interpretation of the viewed real- world scene. Many example of visual illusions impressively demonstrate that the human brain always interprets the received visual information. Although many more aspects of visual perception than the ones mentioned in this section are already known, we will not discuss them in this monograph, since they are virtually not exploited in today’s video 2.2. Visual Perception 25 communication applications. The main reason that most properties of human vision are neglected in image and video coding is that no simple and sufficiently accurate model has been found that allows to quantify the perceived image quality based on samples of an image or video.

2.2.2 Color Perception

While the previous section gave a brief overview of the human visual system, we will now further analyze and quantitatively describe the perception and reproduction of color information. In particular, we will discuss the colorimetric standards of the CIE, which are widely used as basis for specifying color in image and video representation formats.

Metamers. It is a well-known fact that, by using a prism, a ray of sunlight can be split into components of different wavelengths, which we perceive to have different colors, ranging from violet over blue, cyan, green, yellow, orange to red. We can conclude that light with a partic- ular spectral composition induces the perception of a particular color, but the converse is not true. Two light stimuli that appear to have the same color can have very different spectral compositions. Color is not a physical quantity, but a sensation in the viewers mind induced by the interaction of electromagnetic waves with the human cones. A light stimulus emitted or reflected from the surface of an object and falling through the pupil of the eye can be physically characterized by its radiance spectrum, specifying its composition of electromagnetic waves with different wavelengths. The light falling on the retina excites the three cone types in different ways. Let ¯l(λ), m¯ (λ) and s¯(λ) represent the normalized spectral sensitivity curves of the L-, M- and S-cones, respectively, which have been illustrated in Figure 2.8. Then, a radiance spectrum Φ(λ) is effectively mapped to a three-dimensional vector

  ∞ ¯  L Z `(λ) Φ(λ) M = m¯ (λ) dλ, (2.15) Φ0 S 0 s¯(λ) where Φ0 >0 represents an arbitrarily chosen reference radiance, which is introduced for making the vector (L, M, S) dimensionless. 26 Acquisition, Representation, Display, and Perception

1 max Φ3(λ) Φ1(λ) 0.8 perceived color: 0.6 Φ4(λ) Φ (λ) 0.4 2 orange 0.2

0 spectral radiance Φ ( λ ) / 400 450 500 550 600 650 700 750 wavelength λ [nm] Figure 2.10: Metamers: All four radiance spectra shown in the diagram induce the same cone excitation responses and are perceived as the same color (orange).

If two light stimuli with different radiance spectra yield the same cone excitation response (L, M, S), they cannot be distinguished by the human visual system and are therefore perceived as having the same color. Light stimuli with that property are called metamers. As an example, the radiance spectra shown in Figure 2.10 are metamers. Metameric color matches play a very important role in all color re- production techniques. They are the basis for color photography (see Section 2.4), color printing, color displays (see Section 2.5) as well as for the representation of color images and videos (see Section 2.3). For specifying the cone excitation responses in (2.15), we used nor- malized spectral sensitivity functions without paying attention to the different peak sensitivities. Actually, this aspect does not have any im- pact on the characterization of metamers. If two vectors (L1,M1,S1) and (L2,M2,S2) are the same, the appropriately scaled versions (αL1, βM1, γS1) and (αL2, βM2, γS2), with non-zero scaling factors α, β and γ, are also the same, and vice versa. An aspect that is, however, neglected in equation (2.15) is the chromatic adaptation of the human eye, i.e., the changing of the scaling factors α, β and γ in dependence of the spectral properties of the observed light (see Section 2.2.1). For the following considerations, we assume that the eye is adapted to a particular viewing condition and, thus, the mapping between radiance spectra and cone excitation responses is linear, as given by (2.15). Another point to note is that the so-called quality of a color, typ- ically characterized by the hue and saturation, is solely determined 2.2. Visual Perception 27 by the ratio L:M :S. Two colors given by the cone response vectors (L1,M1,S1) and (L2,M2,S2) = (αL1, αM1, αS1), with α > 1, have the same quality, i.e., the same hue and saturation, but the luminance2 of the color (L2,M2,S2) is by a factor of α larger than that of (L1,M1,S1). Although (2.15) could be directly used for quantifying the percep- tion of color, the colorimetric standards are based on empirical data obtained in color-matching experiments. One reason is that the spectral sensitivities of the human cones had not be known at the time when the standards were developed. Actually, the cone fundamentals are typ- ically estimated based on data of color-matching experiments [81].

Mixing of Primary Colors. Since the perceived color of a light stim- ulus can be represented by three cone excitation levels, it seems likely that, for each radiance spectrum Φ(λ), we can also create a metameric spectrum Φ∗(λ) by suitably mixing three primary colors or, more cor- rectly, primary lights. The radiance spectra of the three primary lights A, B and C shall be given by pA(λ), pB(λ) and pC (λ), respectively. With T p(λ) = ( pA(λ), pB(λ), pC (λ)) , the radiance spectrum of a mixture of the primaries A, B and C is given by

∗ Φ (λ) = A · pA(λ) + B · pB(λ) + C · pC (λ) = (A, B, C) · p(λ), (2.16) where A, B and C denote the mixing factors, which are also referred to as tristimulus values. The radiance spectrum Φ∗(λ) of the light mixture is a metamer of Φ(λ) if and only if it yields the same cone excitation responses. Thus, with (L, M, S) being the vector of cone excitation responses for Φ(λ), we require   ∞ ¯    L Z `(λ) Φ∗(λ) A M = m¯ (λ) dλ = T · B , (2.17) Φ0 S 0 s¯(λ) C with the transformation matrix T being given by ∞ ¯  Z `(λ) p(λ)T T = m¯ (λ) dλ. (2.18) Φ0 0 s¯(λ)

2As mentioned in Section 2.2.1, we assume that the luminance can be represented as a linear combination of the cone excitation responses L, M, and S. 28 Acquisition, Representation, Display, and Perception

If the primaries are selected in a way that the matrix T is invertible, the mapping between the tristimulus values (A, B, C) and (L, M, S) is bijective. In this case, the color of each possible radiance spectrum Φ(λ) can be matched by a mixture of the three primary lights. And there- fore, the color description in the (A, B, C) system is equivalent to the description in the (L, M, S) system. A sufficient condition for a suit- able selection of the three primaries is that all primaries are perceived as having a different color and the color of none of the primaries can be matched by a mixture of the two other primaries. One aspect that will be discussed later, but should be noted at this point, is that for each selection of real primaries, i.e., primaries with radiance spectra p(λ) ≥ 0, ∀λ, there are stimuli Φ(λ) for which one or two of the mixing factors A, B and C are negative. By combining the equations (2.17) and (2.15), we obtain

    ∞  ∞ A L Z `¯(λ) Z −1 −1 Φ(λ) Φ(λ) B= T M= T m¯ (λ) dλ = c¯(λ) dλ, (2.19) Φ0 Φ0 C S 0 s¯(λ) 0 which specifies the direct mapping of radiance spectra Φ(λ) onto the tristimulus values (A, B, C). The components a¯(λ), ¯b(λ) and c¯(λ) of the vector function c¯(λ) = (a ¯(λ), ¯b(λ), c¯(λ))T are referred to as color- matching functions for the primaries A, B and C, respectively. They represent equivalents to the cone fundamentals `¯(λ), m¯ (λ) and s¯(λ). Thus, if we know the color-matching functions a¯(λ), ¯b(λ) and c¯(λ) for a set of three primaries, we can uniquely describe all perceivable colors by the corresponding tristimulus values (A, B, C). Before we discuss how color-matching functions can be determined, we highlight an important property of color mixing, which is a direct consequence of (2.19). Let Φ1(λ) and Φ2(λ) be the radiance spectra of two lights with the tristimulus values (A1,B1,C1) and (A2,B2,C2), respectively. Now, we mix an amount α of the first with an amount β of the second light. For the tristimulus values (A, B, C) of the resulting radiance spectrum Φ(λ) = α Φ1(λ) + β Φ2(λ), we obtain   ∞     A A1 A2 Z αΦ1(λ) + βΦ2(λ) B= c¯(λ) dλ = αB1 + βB2 . (2.20) Φ0 C 0 C1 C2 2.2. Visual Perception 29

primary masking observation lights screen screen image seen by observer

observer test light

Figure 2.11: Principle of color-matching experiments.

The tristimulus values of a linear combination of multiple lights is given by the linear combination, with the same weights, of the tristimulus values of the individual lights. This property has been experimentally discovered by Grassmann [29] and is often called Grassmann’s law.

Color-Matching Experiments. In order to experimentally determine the color-matching matching functions c¯(λ) for three given primary lights, the color of sufficiently many monochromatic lights3 can be matched with a mixture of the primaries. For each monochromatic 0 0 light with wavelength λ, the radiance spectrum is Φ(λ ) = Φλ δ(λ −λ), where Φλ is the absolute radiance and δ(·) represents the Dirac delta function. According to (2.19), the tristimulus vector is given by

  ∞ A Z Φλ 0 0 0 Φλ B = c¯(λ ) δ(λ − λ) dλ = · c¯(λ). (2.21) Φ0 Φ0 C λ 0 Except for a factor, the tristimulus vector for a monochromatic light with wavelength λ represents the value of c¯(λ) for that wavelength. Even though the value of Φ0 can be chosen arbitrarily, the ratio of the absolute radiances Φλ of the monochromatic lights to any constant reference radiance Φ0 has to be known for all wavelengths. The basic idea of color-matching experiments is typically attributed to Maxwell [65]. The color-matching data that lead to the creation of the widely used CIE 1931 colorimetric standard were obtained in experiments by Wright [94] and Guild [32]. The principle of their

3In practice, lights with a reasonable small spectrum are used. 30 Acquisition, Representation, Display, and Perception color-matching experiments [31, 93] is illustrated in Figure 2.11. At a visual angle of 2◦, the observers looked at both a monochromatic test light and a mixture of the three primaries, for which a red, green, and blue light source were used. The amounts of the primaries could be adjusted by the observers. Since not all lights can be matched with pos- itive amounts of the primary lights, it was possible to move any of the primaries to the side of the test light, in which case the amount of the corresponding primary was counted as negative value4. The monochro- matic lights were obtained by splitting a beam of white light using a prism and selecting a small portion of the spectrum using a thin slit. For determining the color-matching functions c¯(λ), only the ratios of the amounts of the primary lights were utilized. These data were combined with the already estimated luminous efficiency function V (λ) for photopic vision, assuming that V (λ) can be represented as linear combination of the three color-matching functions a¯(λ), ¯b(λ) and c¯(λ). Due to the linear relationship between the tristimulus values (A, B, C) and the cone response vectors (L, M, S), the assumption is equivalent to the often used model (see Section 2.2.1) that the sensation of luminance is generated by linearly combining the cone excitation responses in the neural circuitry of the human visual system. The utilization of the luminous efficiency function V (λ) had the advantage that the effect of luminance perception could be excluded in the experiments and that it was not necessary to know the ratios of the absolute radiances Φλ to a common reference Φ0 for all monochromatic lights (see above). The exact mathematical procedure for determining the color-matching functions c¯(λ) given the mixing ratios and V (λ) is described in [76].

Changing Primaries. Before we discuss the results of Wright and Guild, we consider how the color-matching functions for an arbitrary set of primaries can be derived from the measurements for another set of primaries. Let us assume that we measured the color-matching T functions c¯1(λ) = (a ¯1(λ), ¯b1(λ), c¯1(λ)) for a first set of primaries given

4Due to the linearity of color mixing, adding a particular amount of a primary to the test light is mathematically equivalent to subtracting the same amount from the mixture of the other primaries. 2.2. Visual Perception 31

T by the radiance spectra p1(λ) = ( pA1(λ), pB1(λ), pC1(λ)) . Based on these data, we want to determined the color-matching functions c¯2(λ) for a second set of primaries, which shall be given by the radiance spectra p2(λ). For each radiance spectrum Φ(λ), the tristimulus vectors T T t1 = (A1,B1,C1) and t2 = (A2,B2,C2) for the primary sets one and two, respectively, are given by

∞ ∞ Z Φ(λ) Z Φ(λ) t1 = c¯1(λ) dλ and t2 = c¯2(λ) dλ . (2.22) Φ0 Φ0 0 0

T T The radiance spectra Φ(λ), Φ1(λ) = p1(λ) t1 and Φ2(λ) = p2(λ) t2 are metamers. Consequently, all three spectra correspond to the same color representation for any set of primaries. In particular, we require

∞  ∞  Z Z T Φ2(λ) p2(λ) t1 = c¯1(λ) dλ =  c¯1(λ) dλt2 = T 21 t2 . (2.23) Φ0 Φ0 0 0 The tristimulus vector in one system of primaries can be converted into any other system of primaries using a linear transformation. Since this relationship is valid for all radiance spectra Φ(λ), including those of the monochromatic lights, the color-matching functions for the second set of primaries can be calculated according to

−1 c¯2(λ) = T 21 c¯1(λ) = T 12 c¯1(λ) . (2.24)

It should be noted that the columns of a matrix T ik represent the tristimulus vectors (A, B, C) of the primary lights of set i in the primary system k. These values can be directly measured, so that the color- matching functions can be transformed from one into another primary system even if the radiance spectra p1(λ) and p2(λ) are unknown.

CIE Standard Colorimetric Observer. In 1931, the CIE adopted the colorimetric standard known as CIE 1931 2◦ Standard Colorimetric Observer [15] based on the experimental data of Wright and Guild. Since Wright and Guild used different primaries in their experi- ments, the data had to be converted into a common primary system. For that purpose, monochromatic primaries with wavelengths of 700 nm 32 Acquisition, Representation, Display, and Perception

(red), 546.1 nm (green) and 435.8 nm (blue) were selected5. Since the tristimulus values for monochromatic lights had been measured in the experiments, the conversion matrices could be derived by interpolating these data at the wavelengths of the new primary system. The ratio of the absolute radiances of the primary lights was decided to be chosen so that white light with a constant radiance spectrum is represented by equal amounts of all three primaries. Hence, the to-be-determined color-matching functions r¯(λ), g¯(λ) and ¯b(λ) for the red, green and blue primaries, respectively, had to fulfill the condition ∞ ∞ ∞ Z Z Z r¯(λ) dλ = g¯(λ) dλ = ¯b(λ) dλ. (2.25) 0 0 0 The experimental data of Wright and Guild were transformed into the new primary system, the results were averaged and some irregular- ities were removed [76, 8]. The requirement (2.25) resulted in a lumi- nance ratio IR :IG :IB equal to 1:4.5907:0.0601, where IR, IG and IB represent the luminances of the red, green and blue primaries, respec- tively. The corresponding ratio ΦR :ΦG :ΦB of the absolute radiances is approximately 1:0.0191:0.0137. Finally, the normalization factor for the color-matching functions, i.e., the ratio ΦR/Φ0, was chosen such that the condition I I V (λ) =r ¯(λ) + G · g¯(λ) + B · ¯b(λ) (2.26) IR IR is fulfilled. The resulting CIE 1931 RGB color-matching functions r¯(λ), g¯(λ) and ¯b(λ) were tabulated for wavelengths from 380 to 780 nm at intervals of 5 nm [15, 76]. They are shown in Figure 2.12(a). It is clearly visible, that r¯(λ) has negative values inside the range from 435.8 to 546.1 nm. In fact, for most of the wavelengths inside the range of visible light, one of the color-matching functions is negative, meaning that most of the monochromatic lights cannot be represented by a physically meaningful mixture of the chosen red, green and blue primaries. The CIE decided to develop a second set of color-matching functions x¯(λ), y¯(λ), and z¯(λ), which are now known as CIE 1931 XYZ color- matching functions, as basis for their colorimetric standard. Since all

5The primaries were chosen to be producible in a laboratory. 2.2. Visual Perception 33

0.4 2 - - - z (λ) b (λ) r (λ) 0.3 - g (λ) 1.5 - 0.2 y (λ) - 1 x (λ) 0.1 0.5 tristimulus amplitudes 0 tristimulus amplitudes B G R -0.1 0 400 450 500 550 600 650 700 750 400 450 500 550 600 650 700 750 (a)wavelength λ [nm] (b) wavelength λ [nm] Figure 2.12: CIE 1931 color-matching functions: (a) RGB color-matching func- tions, the primaries are marked with R, G and B; (b) XYZ color-matching functions. sets of color-matching functions are linearly dependent, x¯(λ), y¯(λ), and z¯(λ) had to obey the relationship

 x¯(λ)   r¯(λ)   y¯(λ)  = T XYZ ·  g¯(λ)  , (2.27) z¯(λ) ¯b(λ) with T XYZ being an invertible, but otherwise arbitrary, transformation matrix. For specifying the 3×3 matrix T XYZ, the following desirable properties were considered: • All values of x¯(λ), y¯(λ) and z¯(λ) were to be non-negative; • The color-matching function y¯(λ) was to be chosen equal to the luminous efficiency function V (λ) for photopic vision; • The scaling was to be chosen so that the tristimulus values for an equal-energy spectrum are equal to each other; • For the long wavelength range, the entries of the color-matching function z¯(λ) were to be equal to zero; • Subject to the above criteria, the area that physical meaningful radiance spectra represent inside a plane given by a constant sum X + Y + Z was to be maximized.

By considering these design principles, the transformation matrix   1 0.49000 0.31000 0.20000 T XYZ =  0.17697 0.81240 0.01063  (2.28) 0.17697 0.00000 0.01000 0.99000 34 Acquisition, Representation, Display, and Perception was adopted. A detailed description of how this matrix was derived can be found in [76, 26]. The resulting XYZ color-matching functions x¯(λ), y¯(λ) and z¯(λ) are depicted in Figure 2.12(b). They have been tabulated for the range from 380 to 780 nm, in intervals of 5 nm, and specify the CIE 1931 standard colorimetric observer [15]. The color of a radiance spectrum Φ(λ) can be represented by the tristimulus values

  ∞  X Z x¯(λ) Φ(λ)  Y  =  y¯(λ)  dλ. (2.29) Φ0 Z 0 z¯(λ)

The reference radiance Φ0 is typically chosen in a way that X, Y , and Z lie in a range from 0 to 1 for the considered viewing condition. Note that, due to the choice y¯(λ) = V (λ), the value Y represents a scaled and dimensionless version of the luminance I. It is correctly referred to as relative luminance, however, often the term “luminance” is used for both the “absolute” luminance I and the relative luminance Y . In the 1950s, Stiles and Burch [79] performed color-matching ex- periments for a visual angle of 10◦. Based on these results, the CIE defined the CIE 1964 10◦ Supplementary Colorimetric Observer [17]. The data by Stiles and Burch are considered as the most secure set of existing color-matching functions [7] and have been used as basis for the Stockman and Sharpe cone fundamentals [81] and the recent CIE proposal [19] of physiologically relevant color-matching functions. Baylor, Nunn and Schnapf [5] measured direct photocurrent re- sponses in the cones of a monkey and could predict the color-matching functions of Stiles and Burch with reasonable accuracy. Nonetheless, the CIE 1931 Standard Colorimetric Observer [15] is still used in most applications. The RGB and XYZ color-matching functions for the CIE standard observers are included in the recent ISO/CIE standard on colorimetry [41] and can also be downloaded from [40].

Chromaticity Diagram. The black curve in Figure 2.13(a) shows the locus of monochromatic lights with a particular radiance in the XYZ space. The tristimulus values of all possible radiance spectra represent linear combinations, with non-negative weights, of the (X,Y,Z) values for monochromatic lights. They are located inside a cone, which has its 2.2. Visual Perception 35

Y 0.9 520

530 0.8 spectral locus X 510 540 0.7 550

G 560 0.6 570 500 (a) Z 0.5 580

590

0.4 600

chromaticity y 610 Y 620 0.3 W E R 1 490

0.2 X

480 1 0.1 B purple line 470 0 460 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 1 (b) Z (c) chromaticity x

Figure 2.13: The CIE 1931 chromaticity diagram: (a) Locus of monochromatic lights and the imaginary purple plane in the XYZ space; (b) Space of real radiance spectra with the plane X + Y + Z = 1 and the line of all equal-energy spectra; (c) Chromaticity diagram illustrating the region of all perceivable colors in the x-y plane. The diagram additionally shows the point of equal-energy white (E) as well as the primaries (R,G,B) and (W) of the sRGB [38] . apex in the origin and lies completely in the positive octant. The cones surface is spanned by the locations of the monochromatic lights and an imaginary purple plane, which connects the tangents for the short and long wavelength end. As mentioned above, the quality of a color is solely determined by the ratio of the tristimulus values X :Y :Z. Hence, all lights that have the same quality of color lie on a line that intersects the origin, as is illustrated by the gray arrow in Figure 2.13(b), which represents the color of equal-energy radiance spectra. For differentiating between the luminance and the quality of a color, it is common to introduce normalized chromaticity coordinates X Y Z x = , y = , and z = . (2.30) X + Y + Z X + Y + Z X + Y + Z The z-coordinate is actually redundant, since it is given by z =1−x−y. 36 Acquisition, Representation, Display, and Perception

The tristimulus values (X,Y,Z) of a color can be represented by the chromaticity coordinates x and y, which specify the quality of the color, and the relative luminance Y . For a given quality of color, i.e., a ratio X : Y : Z, the chromaticity coordinates x and y correspond to the values of X and Y , respectively, inside the plane X + Y + Z = 1, as is illustrated in Figure 2.13(b). The set of qualities of colors that is perceivable by human beings is called the human gamut. Its location in the x-y coordinate system is shown in Figure 2.13(c)6. This plot is referred to as chromaticity diagram. The human gamut represents a horseshoe shape; its boundaries are given by the projection of the monochromatic lights, referred to as spectral locus, and the purple line, which is a projection of the imaginary purple plane. For the spectral locus, the figure includes wavelength labels in nanometers; it also shows the location x = y = 1/3, marked by “E”, of equal-energy spectra.

Linear Color Spaces. All color spaces that are linearly related to the LMS cone excitation space shall be called linear color spaces in this monograph. When neglecting measurement errors, the CIE RGB and XYZ spaces are linear color spaces and, hence, there exists a matrix by which the XYZ (or RGB) color-matching functions are transformed into cone fundamentals according to (2.24). Actually, cone fundamentals are typically obtained by estimating such a transformation matrix [81]. While we specified the primary spectra for the CIE 1931 RGB color space, the color-matching functions for the CIE 1931 XYZ color space were derived by defining a transformation matrix, without ex- plicitly stating the primary spectra. Given the color-matching func- tions c¯(λ) = (x ¯(λ), y¯(λ), z¯(λ))T, the corresponding primary spectra T p(λ) = ( pX (λ), pY (λ), pZ (λ)) are not uniquely defined. With I de- noting the 3×3 identify matrix, they only have to fulfill the condition

∞ Z T c¯(λ) p(λ) dλ = Φ0 · I, (2.31) 0

6The complete human gamut cannot be reproduced on a display or in a print and the perception of a color depends on the illumination conditions. Thus, the colors shown in Figure 2.13(c) should be interpreted as a rough illustration. 2.2. Visual Perception 37 which is a special case of (2.23). Even though there are infinitely many spectra p(λ) that fulfill (2.31), they all have negative entries and, thus, represent imaginary primaries7. The same is true for the LMS color space and all other linear color spaces with non-negative color-matching functions. It is often referred to as primary paradoxon and is caused by the fact that the cone fundamentals have overlapping support. There is no physical meaningful radiance spectrum, i.e., with p(λ) ≥ 0, ∀λ, that excites the M-cones without also exciting the L- or S-cones. For all real primaries, the corresponding color-matching functions have negative entries. Consequently, not all colors of the human gamut can be represented by a physical meaningful mixture of the primary lights. As an example, the chromaticity diagram in Figure 2.13(c) shows the chromaticity coordinates for the sRGB primaries [38]. Displays that use primaries with these chromaticity coordinates can only represent the colors that are located inside the triangle spanned by the primaries. This set of colors is called the color gamut of the display device. In cameras, the situation is different. Since the transmittance spec- tra of the color filters (see Section 2.4), which represent the color- matching functions of the camera color space, are always non-negative, it is, in principle, possible to capture all colors of the human gamut. However, the camera color space is only a linear color space, if the trans- mittance spectra of the color filters represent linear combinations of the cone fundamentals (or, equivalently, the XYZ color-matching func- tions). In practice, this can be only approximately realized. Nonethe- less, often a linear transformation is used for converting the camera data into a linear color space; a suitable transformation matrix can be determined by least-squares linear regression. Since camera color spaces are associated with imaginary primaries, the image data captured by a camera sensor cannot be directly used for operating a display device, they always have to be converted. Sev- eral algorithms have been developed for realizing such a conversion; the

7 R This can be verified as follows. For obtaining y¯(λ)pY (λ)dλ=Φ0, the spectrum pY (λ) has to contain values greater than 0 inside the range for which y¯(λ) is greater than 0, but since either x¯(λ) or z¯(λ) are also greater than 0 inside this range, the R R integrals x¯(λ)pY (λ)dλ and z¯(λ)pY (λ)dλ cannot become equal to 0, unless pY (λ) has also negative entries. 38 Acquisition, Representation, Display, and Perception simplest variant consists of a linear transformation of the tristimulus values (for changing the primaries) and a subsequent clipping of nega- tive values. For the transmission between the camera and the display, an image or video representation format, such as the above mentioned sRGB, is used. Typically, the representation formats define linear RGB color spaces, for which the primary chromaticity coordinates lie in- side the human gamut and they only allow positive tristimulus values. Hence, the color spaces of representation formats also have a limited color gamut, as has been shown for the sRGB format in Figure 2.13(c). The conversion between an RGB and the XYZ color space can be written as       X Xr Xg Xb R  Y  =  Yr Yg Yb · G  , (2.32) Z Zr Zg Zb B where Xr represents the X-component of the red primary, etc. The RGB color spaces used in representation formats are typically defined by the chromaticity coordinates of the red, green and blue primaries, which shall be denoted by (xr, yr), (xg, yg) and (xb, yb), respectively, and the chromaticity coordinates (xw, yw) of the so-called white point, which represents the quality of color for tristimulus values R = G = B. The chromaticity coordinates of the white point are necessary, because they determine the length ratios of the primary vectors in the XYZ coordinate system. According to (2.30), we can replace X by xY/y and Z by (1 − x − y)Y/y in (2.32). If we then write this equation for the white point given by R = G = B, we obtain

 xw   xr xg xb   Yr Yg Yb 1 Y yw yr yg yb w       1 = Yr Yg Yb  1 . (2.33) R 1−xw−yw  1−xr−yr 1−xg−yg 1−xb−yb  Yr Yg Yb 1 yw yr yg yb

It should be noted that Yw/R > 0 is only a scaling factor, which spec- ifies the relative luminance of the stimuli with R = G = B = 1. It can be chosen arbitrarily and is often set equal to 1. Then, the linear equa- tion system can be solved for the unknown values Yr, Yg and Yb, which finally determine the transformation matrix. 2.2. Visual Perception 39

incid. spec. radiance S(λ) / Smax spectral reflectance R(λ) refl. spec. radiance Φ(λ) / Φmax 1 1 1 for tungsten 0.8 0.8 0.8 light bulb daylight flower 0.6 0.6 0.6 "veronica fruticans" 0.4 tungsten 0.4 0.4 for daylight 0.2 light bulb 0.2 0.2 0 0 0 400 450 500 550 600 650 700 400 450 500 550 600 650 700 400 450 500 550 600 650 700 (a) wavelength λ [nm] (b) wavelength λ [nm] (c) wavelength λ [nm] Figure 2.14: Influence of the illumination: (a) Normalized radiance spectra of a tungsten light bulb (illuminant A) and normal daylight (illuminant D65); (b) Re- flectance spectrum of the flower “veronica fructicans” [1]; (c) Normalized radiance spectra of the reflected light for both illuminants, the chromaticity coordinates (x, y) are (0.3294, 0.2166) for the light bulb and (0.1971, 0.1130) for daylight.

Illumination. With the exception of computer monitors, television sets, and mobile phone displays, we rarely look at surfaces that emit light. In most situations, the objects we look at reflect light from one or more illumination sources, such as the sun or an incandescent light bulb. Using a simple model, the radiance spectrum Φ(λ) entering the eye from a particular surface point can be expressed as the product Φ(λ) = S(λ) · R(λ) (2.34) of the incident spectral radiance S(λ) reaching the surface point from the light source and the reflectance spectrum R(λ) of the surface point. The physical structure of the surface determines the degree of photon absorption for different wavelengths and thus the reflectance spectrum. It typically depends on the angles between the incident and reflected rays of light and the surface normal. The color of an object does not only depend on the physical proper- ties of the object surface, but also on the spectrum of the illumination source. This aspect is illustrated in Figure 2.14, where we consider two typical illumination sources, daylight and a tungsten light bulb, and the reflectance spectrum for the petals of a particular flower. Due to the different spectral properties of the two illuminants, the radiance spectra that are reflected from the flower petals are very different and, as a result, the tristimulus and chromaticity values are also different. It should be noted that two objects that are perceived as having the same color for a particular illuminant can be distinguishable from each 40 Acquisition, Representation, Display, and Perception

0.9 520 8000 K 5000 K 530 1 0.8 510 540 0.5 0.7 550

2856 K (illuminant A) 560 ≈ tungsten light bulb 0.6 0 570 norm. spec. radiance 400 450 500 550 600 650 700 750 500 0.5 3000 K 580 wavelength λ [nm] 4000 K (a) 5000 K 590 6000 K 0.4 8000 K 600 12000 K 610 chromaticity y 620 E 2000 K 0.3490 normal daylight (D65) morning light ∞ 1000 K 1 0.2

0.5 0.1 480

twilight (just before dark) 470 0 0 460

norm. spec. radiance 400 450 500 550 600 650 700 750 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 wavelength λ [nm] chromaticity x (b) (c) Figure 2.15: Illumination sources: (a) Black-body radiators; (b) Natural daylight; (c) Chromaticity coordinates of black-body radiators (Planckian locus). other when the illumination is changed. The color of a material can only be described with respect to a given illumination source. For that purpose, several illuminants have been standardized. The radiance spectrum of incandescent light sources, i.e., materials for which the emission of light is caused by their temperature, can be described by Planck’s law. A so-called black body at an absolute temperature T emits light with a radiance spectrum given by −1 2 h c  h c  Φ (λ) = e kB T λ − 1 , (2.35) T λ5 where kB is the Boltzmann constant, h the Planck constant and c the speed of light in the medium. The temperature T is also referred to as the color temperature of the emitted light. Figure 2.15(a) illustrates the radiance spectra for three temperatures. For low temperatures, the emitted light mainly includes long-wavelength components. When the temperature is increased, the peak of the radiance spectrum is shifted toward the short-wavelength range. Figure 2.15(c) shows the chromatic- ity coordinates (x, y) of light emitted by black-body radiators in the CIE 1931 chromaticity diagram. The curve representing the black-body 2.2. Visual Perception 41 radiators for different temperatures is called the Planckian locus. The radiance spectrum for a black-body radiator of about 2856 K has been standardized as illuminant A [42] by the CIE; it represents the typical light emitted by tungsten filament light bulbs. There are several light sources, such as fluorescent lamps or light- emitted diodes (LEDs), for which the light emission is not caused by temperature. The chromaticity coordinates for such illuminants often do not lie on the Planckian locus. The light of non-incandescent sources is often characterized by the so-called correlated color temperature. It represents the temperature of the black-body radiator for which the perceived color most closely matches that of the considered light source. With the goal of approximating the radiance spectrum of average daylight, the CIE standardized the illuminant D65 [42]. It is based on various spectral measurements and has a correlated color temperature of 6504 K. Daylight for different conditions can be well approximated by linearly combining three radiance spectra. The CIE specified these three radiance spectra and recommended a procedure for determining the weights given a correlated color temperature in the range from 4000 to 25000 K. These daylight approximations are also referred to as CIE series-D illuminants. Figure 2.15(b) shows the approximations for average daylight (illuminant D65), morning light (4300 K) and twilight (12000 K). The chromaticity coordinates of the illuminant D65 specify the white point of the sRGB format [38]; they are typically also used as standard setting for the white point of displays.

Chromatic Adaptation. The tristimulus values of light reflected from an objects’ surface highly depend on the spectral composition of the light source. However, to a large extent, our visual system adapts to the spectral characteristics of the illumination sources. Even though we no- tice the difference between, for example, the orange light of a tungsten light bulb and the blueish twilight just before dark (see Figure 2.15), a sheet of paper is recognized as being white for a large variety of illumi- nation sources. This aspect of the human visual system is referred to as chromatic adaptation. As discussed above, linear color spaces provide a mechanism for determining if two light stimuli appear to have the 42 Acquisition, Representation, Display, and Perception same color, but only under the assumption that the viewing conditions do not change. By modeling the chromatic adaptation of the human vi- sual system, we can, to a certain degree, predict how an object observed under one illuminant looks under a different illuminant. A simple theory of chromatic adaptation, which was first postulated by von Kries [88] in 1902, is that the sensitivities of the three cone types are independently adapted to the spectral characteristics of the illumination sources. With (L1,M1,S1) and (L2,M2,S2) being the cone excitation responses for two different viewing conditions, the von Kries model can be formulated as       L2 α 0 0 L1  M2  =  0 β 0  ·  M1  . (2.36) S2 0 0 γ S1 If we assume that the white points, i.e., the LMS tristimulus values of light stimuli that appear white, are given by (Lw1,Mw1,Sw1) and (Lw2,Mw2,Sw2) for the two considered viewing conditions, the scaling factors are determined by

α = Lw2/Lw1, β = Mw2/Mw1, γ = Sw2/Sw1. (2.37)

Today it is known that the chromatic adaptation of our visual sys- tem cannot solely described by an independent re-scaling of the cone sensitivity functions, but also includes non-linear components as well as cognitive effects. Nonetheless, variations of the simple von Kries method are widely used in practice and form the basis of most modern chromatic adaptation models. A generalized linear model for chromatic adaptation in the CIE 1931 XYZ color space can be written as       X2 α 0 0 X1 −1  Y2  = M CAT · 0 β 0 ·M CAT · Y1  , (2.38) Z2 0 0 γ Z1 where the matrix M CAT specifies the transformation from the XYZ color space into the color space in which the von Kries-style chromatic adaptation is applied. If the chromaticity coordinates (xw1, yw1) and (xw2, yw2) of the white points for both viewing conditions are given 2.2. Visual Perception 43 and we assume that the relative luminance Y shall not change, the scaling factors can be determined by α = A /A  A   xwk  w2 w1 wk ywk   −1   β = Bw2/Bw1 with  Bwk  = M CAT · 1  . (2.39) 1−xwk−ywk γ = Cw2/Cw1 Cwk ywk

The transformation specified by the matrix M CAT is referred to as chromatic adaptation transform. If we strictly follow von Kries’ idea, it specifies the transformation from the XYZ into the LMS color space. On the basis of several viewing experiments, it has been found that transformations into color spaces that are represented by so-called sharpened cone fundamentals yield better results. The chromatic adap- tation transform that is suggested in the CIECAM02 [18, 62] specified by the CIE is given by the matrix  0.7328 0.4296 −0.1624  M CAT(CIECAM02) =  −0.7036 1.6974 0.0061  . (2.40) 0.0030 −0.0136 0.9834 For more details about chromatic adaptation transforms and modern color appearance models, the reader is referred to [70, 25]. In contrast to the human visual system, digital cameras do not au- tomatically adjust to the properties of the present illumination, they simply measure the radiance of the light falling through the color filters (see Section 2.4). For obtaining natural looking images, the raw data recorded by the image sensor have to be processed in order to simulate the chromatic adaptation of the human visual system. The correspond- ing processing step is referred to as white balancing. It is often based on a standard chromatic adaptation transform and directly incorpo- rated into the conversion from the internal color space of the camera to the color space of the representation format. With (R1,G1,B1) being the recorded tristimulus values and (R2,G2,B2) being the tristimulus values of the representation format, we have       R2 α 0 0 R1 −1 −1  G2  = M Rep ·M CAT · 0 β 0 ·M CAT ·M Cam · G1  . (2.41) B2 0 0 γ B1

The matrices M Cam and M Rep specify the conversion from the camera and representation RGB spaces, respectively, into the XYZ space. The 44 Acquisition, Representation, Display, and Perception

Figure 2.16: Example for white balancing: (left) Original picture taken between sunset and dusk, implicitly assuming an equal-energy white point; (right) Picture after white balancing (the white point was defined by a selected area of the boat). scaling factors α, β, and γ are determined according to (2.39), where the white point (xw2, yw2) is given by the used representation format. For selecting the white point (xw1, yw1) of the actual viewing condi- tion, cameras typically provide various methods, ranging from select- ing the white point among a predefined set (“sunny”, “cloudy”, etc.), over calculating it based on (2.35) by specifying a color temperature, to automatically estimating it based on the recorded samples. An example for white balancing is shown in Figure 2.16. As a re- sult of the spectral composition of the natural light between sunset and dusk, the original image recorded by the camera has a noticeable purple color cast. After white balancing, which was done by using the chro- maticity coordinates of an area of the boat as white point (xw1, yw1), the color cast is removed and the image looks more natural.

Perceptual Color Spaces. The CIE 1931 XYZ color space provides a method for predicting whether two radiance spectra are perceived as the same color. It is, however, not suitable for quantitatively describing the difference in perception for two light stimuli. As a first aspect, the perceived brightness difference between two stimuli is not only depend- ing on the difference in luminance, but also on the luminance level to which the eye is adapted (Weber-Fechner law, see Section 2.2.1). The CIE 1931 chromaticity space is not perceptually uniform either. The experiments of MacAdam [63] showed that the range of not perceptible 2.2. Visual Perception 45

chromaticity differences for a given reference chromaticity (x0, y0) can be described by an ellipse in the x-y plane centered around (x0, y0), but the orientation and size of theses so-called MacAdam ellipses8 highly depend on the considered reference chromaticity (x0, y0). With the goal of defining approximately perceptual uniform color spaces with a simple relationship to the CIE 1931 XYZ color space, the CIE specified the color spaces CIE 1976 L∗a∗b∗ [43] and CIE 1976 L∗u∗v∗ [44], which are commonly referred to as CIELAB and CIELUV, respectively. Typically, the CIE L∗a∗b∗ color space is considered to more perceptual uniform. Its relation to the XYZ space is given by

 ∗        L 0 116 0 f(X/Xw) 16 ∗  a  =  500 −500 0 · f(Y/Yw)  −  0  (2.42) ∗ b 0 200 −200 f(Z/Zw) 0 with √ ( 3 6 3 t : t > ( 29 ) f(t) = 1 29 2 4 6 3 . (2.43) 3 ( 6 ) t + 29 : t ≤ ( 29 ) The values (L∗, a∗, b∗) do not only depend on the considered point in the XYZ space, but also on the tristimulus values (Xw,Yw,Zw) of the reference white point determined by the present illumination. Hence, the L∗a∗b∗ color space includes a chromatic normalization, which cor- responds to a simple von Kries-style model (2.38) with M CAT equal to the identity matrix. The function f(t) mimics the non-linear behavior of the human visual system. The coordinate L∗ is called the lightness, a perceptually corrected version of the relative luminance Y . The com- ponents a∗ and b∗ represents color differences between reddish-magenta and green and yellow and blue, respectively. Hence, the L∗, a∗ and b∗ values can be interpreted as non-linear versions of the opponent-color processes discussed in Section 2.2.1. Due to the approximate perceptual uniformity of the CIE L∗a∗b∗ color space, the difference between two light stimuli can be quantified by the Euclidean distance between the corresponding (L∗, a∗, b∗) vectors, q ∗ ∗ 2 ∗ ∗ 2 ∗ ∗ 2 ∆E = (L1 − L0) + (a1 − a0) + (b1 − b0) . (2.44)

8MacAdam’s description can be extended to the XYZ space [10, 22], in which case the regions of not perceptible color differences are ellipsoids. 46 Acquisition, Representation, Display, and Perception

There are many other color spaces that have been developed for different purposes. Most of them can be derived from the XYZ color space, which can be seen as a master color space, since it has been specified based on experimental data. For image and video coding, the Y’CbCr color space is particularly important. It has some of the proper- ties of CIELAB and will be discussed in Section 2.3, where we describe representation formats for image and video coding.

2.2.3 Visual Acuity

The ability of the human visual system to resolve fine details is de- termined by three factors: The resolution of the human optics, the sampling of the projected image by the photoreceptor cells, and the neural processing of the photoreceptor signals. The influence of the first two factors was evaluated in several experiments. Measurements of the modulation transfer function [12, 61, 64] revealed that the human eye has significant aberrations for large pupil sizes (see Section 2.2.1). At high spatial frequencies, however, large pupil sizes provide an im- proved modulation transfer. The estimated cut-off frequency ranges from about 50 cycles per degree (cpd), for pupil sizes of 2 mm, to 200 cpd, for pupil sizes of 7.3 mm [61]. In the foveal region, the average distance between rows of cones is about 0.5 minutes of arc [92, 21]. This corresponds to a Nyquist frequency of 60 cpd. For the short wavelength range of visible light, the image projected on the retina is significantly blurred due to axial chromatic aberration. The density of the S-cones is also significantly lower than that of M- and L-cones; it corresponds to a Nyquist frequency of about 10 cpd [20]. The impact of the neural processing on the visual acuity can be only evaluated in connection with the human optics and the retinal sampling. An ophthalmologist checks visual acuity typically using a Snellen chart. At luminance levels of at least 120 cd/m2, a person with normal visual acuity has to be able to read letters covering a visual angle of 5 minutes of arc, for example, letters of 8.73 mm height at a distance of 6 m. The used letters can be considered to consist of basically 3 black and 2 white lines in one direction and, hence, people with normal acuity can resolve spatial frequencies of at least 30 cpd. 2.2. Visual Perception 47

Contrast Sensitivity. The resolving capabilities of the human visual system are often characterized by contrast sensitivity functions, which specify the contrast threshold between visible and invisible. The con- trast C of a stimulus is typically defined as Michelson contrast I − I C = max min , (2.45) Imax + Imin where Imin and Imax are the minimum and maximum luminance of the stimulus. The contrast sensitivity sc = 1/Ct is the reciprocal of the contrast Ct at which a pattern is just perceivable. Note that the smallest possible value of sc = 1 means that, regardless of the contrast, the stimulus is invisible for a human observer. For analyzing the visual acuity, the contrast sensitivity is typically measured for spatio-temporal sinusoidal stimuli

I(α, t) = I¯· 1 + C · cos(2πu α) · cos(2πv t), (2.46) where I¯ = (Imin + Imax)/2 is the average luminance, u is the spatial frequency in cycles per visual angle, v is the temporal frequency in Hz, α denotes the visual angle, and t represents the time. By varying the spatial and temporal frequency, a function sc(u, v) is obtained, which is called the spatio-temporal contrast sensitivity function (CSF).

Spatial Contrast Sensitivity. The spatial CSF sc(u) specifies the con- trast sensitivity for sinusoidal stimuli that do not change over time (i.e., for v = 0). It can be considered a psychovisual version of the modu- lation transfer function. Experimental investigations [87, 13, 91, 86] showed that it highly depends on various parameters, such as the aver- age luminance I¯and the field of view. A model that well matches the ex- perimental data was proposed by Barten [4]. Figure 2.17(a) illustrates the basic form of the spatial CSF for foveal vision and different aver- age luminances I¯. The spatial CSF has a bandpass character. Except for very low luminance levels, the Weber-Fechner law, sc(u) 6= f(I¯), is valid in the low frequency range. In the high frequency range, however, the CSF highly depends on the average luminance level I¯. For photopic luminances, the CSF has its peak sensitivity between 2 and 4 cpd, the cut-off frequency is between 40 and 60 cpd. 48 Acquisition, Representation, Display, and Perception

1000 300 Weber-Fechner 1000 ca/m2 red-green isochromatic law is valid in

(u) 300 this range (u) 100 c c 100 30 blue-yellow 30 100 ca/m2 10 10 10 ca/m2 2 1 ca/m 3 3 2 contrast sensitivity s 0.1 ca/m contrast sensitivity s 1 1 0.03 0.1 0.3 1 3 10 30 100 0.03 0.1 0.3 1 3 10 30 100 (a) spatial frequency u [cpd] (b) spatial frequency u [cpd]

Figure 2.17: Spatial contrast sensitivity: (a) Contrast sensitivity function for dif- ferent luminance levels (generated using the model of Barten [4] for a 10◦ × 10◦ field of view); (b) Comparison of the contrast sensitivity function for isochromatic and isoluminant stimuli (approximation for the experimental data of Mullen [66]).

In order to analyze the resolving capabilities of the opponent pro- cesses in human vision, the spatial CSF was also measured for isolu- minant stimuli with varying color [66, 73]. Such stimuli with a spatial frequency u and a contrast C are, in principle, obtained by using two sinusoidal gratings with the same spatial frequency u, average lumi- nance I¯, and contrast C, but different colors, and superimposing them with a phase shift of π. Figure 2.17(b) shows a comparison of the spa- tial CSFs for isochromatic and isoluminant red-green and blue-yellow stimuli. In contrast to the CSF for isochromatic stimuli, the CSF for isoluminant stimuli has a low-pass shape and the cut-off frequency is significantly lower. This demonstrates that the human visual system is less sensitive to changes in color than to changes in luminance.

Spatio-Temporal Contrast Sensitivity. The influence of temporal changes on the contrast sensitivity was also investigated in several ex- periments, for example in [71, 57]. A model for the spatio-temporal CSF was proposed in [2, 3]. Figure 2.18(a) illustrates the impact of temporal changes on the spatial CSF sc(u). By increasing the tem- poral frequency v, the contrast sensitivity is at first increased for low spatial frequencies and the spatial CSF becomes a low-pass function; a further increase of the temporal frequency results in a decrease of the contrast sensitivity for the entire range of spatial frequencies. 2.2. Visual Perception 49

300 300 v = 6 Hz v = 1 Hz u = 4 cpd 100 100 u = 0.5 cpd (u,v) (u,v)

c v = 16 Hz c u = 16 cpd 30 30

10 v = 22 Hz 10 u = 22 cpd

3 3 contrast sensitivity s contrast sensitivity s 1 1 0.1 0.3 1 3 10 30 100 0.1 0.3 1 3 10 30 100 (a) spatial frequency u [cpd] (b) temporal frequency v [Hz]

Figure 2.18: Spatio-temporal contrast sensitivity: (a) Spatial CSF sc(u) for dif- ferent temporal frequencies v; (b) Temporal CSF sc(v) for different spatial frequen- cies u. The shown curves represent approximations for the data of Robson [71].

Similarly, as illustrated in Figure 2.18(b), the temporal CSF sc(v) also has a band-pass shape for low spatial frequencies. When the spatial frequency is moderately increased, the contrast sensitivity is improved for low temporal frequencies and the shape of sc(v) is more low-pass. By further increasing the spatial frequency, the contrast sensitivity is reduced for all temporal frequencies. It should be noted that the spatial and temporal aspects are not independent of each other. The temporal cut-off frequency at which a temporally changing stimulus starts to have a steady appearance is called critical flicker frequency (CFF). It is about 50–60 Hz. Investigation of the spatio- temporal CSF for chromatic isoluminant stimuli [58] showed that not only the spatial but also the temporal sensitivity to chromatic stimuli is lower than that for luminance stimuli. For chromatic isoluminant stimuli the CFF lies in the range of 25–30 Hz.

Pattern Sensitivity. The contrast sensitivity functions provide a de- scription of spatial and temporal aspects of human vision. The human visual system is, however, not linear. Thus, the analysis of the responses to harmonic stimuli is not sufficient to completely describe the resolv- ing capabilities of human vision. There are several neural aspects that influence the way we see and discriminate patterns or track the motion of objects over time. For a further discussion of such aspects the reader is referred to the literature on human vision [90, 68]. 50 Acquisition, Representation, Display, and Perception

2.3 Representation of Digital Images and Video

In the following, we describe data formats that serve as input formats for image and video encoders and as output formats of image and video decoders. These raw data formats are also referred to as representation formats and specify how visual information is represented as arrays of discrete-amplitude samples. At the sender side of a video commu- nication system, the camera data have to be converted into such a representation format and at the receiver side the output of a decoder has to be correctly interpreted for displaying the transmitted pictures. Important examples for representation formats are the ITU-R Recom- mendations BT.601 [46], BT.709 [46], and BT.2020 [46], which specify raw data formats for standard definition (SD), high definition (HD), and ultra-high definition (UHD) television, respectively. A discussion of several design aspects for UHD television can be found in [48].

2.3.1 Spatio-Temporal Sampling

In order to process images or videos with a microprocessor or com- puter, the physical quantities describing the visual information have to be discretized, i.e., they have to be sampled and quantized. The physical quantities that we measure in the image plane of a camera are irradiances observed through color filters. Let ccont(x, y, t) be a contin- uous function that represents the irradiance for a particular color filter in the image plane of a camera. In image and video coding applications, orthogonal sampling lattices as illustrated in Figure 2.19(a) are used. The W × H sample array cn[`, m] representing a color component at a particular time instant tn is, in principle, given by

cn[`, m] = ccont( ` · ∆x, m · ∆y, n · ∆t ) (2.47) where `, m, and n are integer values with 0≤`

푊 ⋅ ∆푥

푥 top field bottom field

푦 ∆푥 top field ∆ ⋅ bottom field 퐻 푦 ∆푦 top field (a) (b) bottom field Figure 2.19: Spatial sampling of images and video: (a) Orthogonal spatial sampling lattice; (b) Top and bottom field samples in interlaced video. than the number of amplitude levels in the final representation format, we treat cn[`, m] as continuous-amplitude samples in the following. Fur- thermore, it is presumed that the same sampling lattice is used for all color components. If the required image size is different than that given by the image sensor or the sampling lattices are not aligned, the color components have to be re-sampled using appropriate discrete filters. The size of a discrete picture is determined by the number of sam- ples W and H in horizontal and vertical direction, respectively. The spatial sampling lattice is further characterized by the sample aspect ratio (SAR) and the picture aspect ratio (PAR) given by ∆x W · ∆x W SAR = and PAR = = · SAR. (2.48) ∆y H · ∆y H Table 2.1 lists the picture sizes, sample aspect ratios, and picture as- pect ratios for some common picture formats. The term overscan refers to a concept from analog television; it describes that some samples at the picture borders are not displayed. The picture size W ×H deter- mines the range of viewing angles at which a displayed picture appears sharp to a human observer. Due to that reason it is also referred to as the spatial resolution of a picture. The temporal resolution of a video is determined by the frame rate ft = 1/∆t. Typical frame rates are 24/1.001, 24, 25, 30/1.001, 30, 50, 60/1.001, and 60 Hz. The spatio-temporal sampling described above is also referred to as progressive sampling. An alternative that was introduced for saving bandwidth in analog television, but is still used in digital broadcast, is the interlaced sampling illustrated in Figure 2.19(b). The spatial sampling lattice is partitioned into odd and even scanlines. The even 52 Acquisition, Representation, Display, and Perception

Table 2.1: Examples for common picture formats.

picture size sample aspect picture aspect overscan (in samples) ratio (SAR) ratio (PAR) 720 × 576 12:11 4:3 horizontal overscan standard 720 × 480 10:11 4:3 (only 704 samples definition 720 × 576 16:11 16:9 are displayed for 720 × 480 40:33 16:9 each scanline) 1280 × 720 1:1 16:9 high 1440 × 1080 4:3 16:9 without overscan definition 1920 × 1080 1:1 16:9 ultra-high 3840 × 2160 1:1 16:9 without overscan definition 7680 × 4320 1:1 16:9

scan lines (starting with index zero) form the top field and the odd scan lines form the bottom field of an interlaced frame. The top and bottom fields are alternatively scanned at successive time instances. The sample arrays of a field have the size W ×(H/2). The number of fields per second, called the field rate, is twice the frame rate.

2.3.2 Color Representation For a device-independent description of color information, represen- tation formats often include the specification of linear color spaces. As discussed in Section 2.2.2, displays are not capable of reproducing all colors of the human gamut. And since the number of amplitude levels required for representing colors with a given accuracy increases with increasing gamut, representation formats typically use linear color spaces with real primaries and non-negative tristimulus values. The color spaces are described by the CIE 1931 chromaticity coordinates of the primaries and the white point. The 3×3 matrix specifying the conversion between the tristimulus values of the representation format and the CIE 1931 XYZ color space can be determined by solving the linear equation system in (2.33). As examples, Figure 2.20 lists the chromaticity coordinates for selected representation formats and illus- trates the gamuts in the chromaticity diagram. In contrast to the HD and UHD specifications BT.709 and BT.2020, the ITU-R Recommen- dation BT.601 for SD television does not include the specification of a linear color space. For conventional SD television systems, the linear 2.3. Representation of Digital Images and Video 53

y EBU Tech. 3213 (SD 625) SMPTE EBU ITU-R ITU-R 170M Tech. 3213 BT.709 BT.2020 BT.2020 (UHD) x 0.6300 0.6400 0.6400 0.7080 human gamut red r yr 0.3400 0.3300 0.3300 0.2920 x 0.3100 0.2900 0.3000 0.1700 green g BT.709 (HD) yg 0.5950 0.6000 0.6000 0.7970 D65 x 0.1550 0.1500 0.1500 0.1310 white blue b yb 0.0700 0.0600 0.0600 0.0460 white xw 0.3127 0.3127 0.3127 0.3127 SMPTE 170M (D65) yw 0.3290 0.3290 0.3290 0.3290 (SD 525) x Figure 2.20: Color spaces of selected representation formats: (left) CIE 1931 chro- maticity coordinates for the color primaries and white point; (right) Comparison of the corresponding color gamuts to the human gamut. color spaces specified in EBU Tech. 3213 [24] (for 625-line systems) and SMPTE 170M [77] (for 525-line systems) are used9, which are similar to that in BT.709. Due to continuing improvements in display technology, the color primaries for UHD specification BT.2020 have been selected to lie on the spectral locus yielding a significantly larger gamut than the SD and HD specifications. As a consequence, BT.2020 also recom- mends larger bit depths for representing amplitude values. At the sender side, the color sample arrays captured by the image sensor(s) of the camera have to be converted into the color space of the representation format. For each point (`, m) of the sampling lattice, the conversion can be realized by a linear transform according to (2.41)10. If the transform yields a tristimulus vector with one or more negative entries, the color lies outside the gamut of the representation format and has to be mapped to a similar color inside the gamut; the easiest way of such a mapping is to set the negative entries equal to zero. It is common sense to scale the transform matrix in a way that the components of the resulting tristimulus vectors have a maximum value of one. At the receiver side, a similar linear transform is required for

9 Since the 6th edition, BT.601 lists the chromaticity coordinates specified in EBU Tech. 3213 [24] (625-line systems) and SMPTE 170M [77] (525-line systems). 10Note that the white point of the representation format is used for both, the determination of the conversion matrix M Rep and the calculation of the white bal- ancing factors α, β, and γ. If the camera captures C > 3 color components, the conversion matrix M Cam and the combined transform matrix have a size of 3 × C. 54 Acquisition, Representation, Display, and Perception converting the color vectors of the representation format into the color space of the display device. In accordance with video coding standards such as H.264 | MPEG-4 AVC [53] or H.265 | MPEG-H HEVC [54], we denote the tristimulus values of the representation format with ER, EG, and EB and presume that their values lie in the interval [0; 1].

2.3.3 Non-linear Encoding The human visual system has a non-linear response to differences in lu- minance. As discussed in Sections 2.2.1 and 2.2.3, the perceived bright- ness difference between two image regions with luminances I1 and I2 does not only depend on the difference in luminance ∆I = |I1 − I2|, but also on the average luminance I¯ = (I1 + I2)/2. If we add a certain amount of quantization noise to the tristimulus values of a linear color space, whether by discretizing the amplitude levels or lossy coding, the noise is more visible in dark image regions. This effect can be circum- vented if we introduce a suitable non-linear mapping fTC (E) for the linear color components E and quantize the resulting non-linear color 0 components E = fTC (E). A corresponding non-linear mapping fTC is often referred to as transfer function or transfer characteristic. For relative luminances Y with amplitudes in the range [0; 1], the perceived brightness can be approximated by a power law

0 γe Y = fTC (Y ) = Y , (2.49)

For the exponent γe, which is called encoding gamma, a value of about 1/2.2 is typically suggested. The non-linear mapping Y → Y 0 is com- monly also referred to as gamma encoding or gamma correction. Since a color component E of a linear color space represents the relative lu- minance of the corresponding primary spectrum, the power law (2.49) can also be applied to the tristimulus values of a linear color space. At the receiver side, it has to be ensured that the luminances I produced on the display are roughly proportional to

−1 0 0 γd Y = fTC (Y ) = (Y ) with γd = 1/γe, (2.50) so that the end-to-end relationship between the luminance measured by the camera and the reproduced luminance is approximately linear. The exponent γd is referred to as decoding gamma. 2.3. Representation of Digital Images and Video 55

1 γ = 1/2.2 linear increasing Y CIE L*a*b* e 0.8 BT.709 0.6 BT.2020 0.4 0 linear increasing Y = fTC (Y ) 0.2 linear encoding

0 non-linear encoded signal E' 0 0.2 0.4 0.6 0.8 1 (a) (b) linear component signal E

Figure 2.21: Non-linear encoding: (a) Comparison of linear increasing Y and Y 0 using the transfer function fTC specified in BT.709 and BT.2020, the bottom parts illustrate uniform quantization; (b) Comparison of selected transfer functions.

Interestingly, in cathode ray tube (CRT) displays, the luminance I is proportional to (V +ε)γ, where V represents the applied voltage, ε is a constant voltage offset, and the exponent γ lies in the range of about 2.35 to 2.55. The original motivation for the development of gamma encoding was to compensate for this non-linear voltage-luminance re- lationship. In modern image and video applications, however, gamma encoding is applied for transforming the linear color components into a nearly perceptual-uniform domain and thus minimizing the bit depth required for representing color information [69]. Since the power law (2.49) has an infinite slope at zero and yields unsuitably high values for very small input values, around zero, it is often replaced by a linear function, which yields the piecewise-defined transfer function ( κ · E : 0 ≤ E < b E0 = f (E) = . (2.51) TC a · Eγ − (a − 1) : b ≤ E ≤ 1 The exponent γ and the slope κ are specified in representation formats. The values a and b are determined in a way that both sub-functions of fTC yield the same value and derivative at the connection point E = b. BT.709 and BT.2020 specify the exponent γ = 0.45 and the slope κ = 4.5, which yields the values a ≈ 1.0993 and b ≈ 0.0181. Representation formats specify the application of the transfer func- tion (2.51) to the linear components ER, EG, and EB, which have amplitudes in the range [0; 1]. The resulting non-linear color compo- 56 Acquisition, Representation, Display, and Perception

0 0 0 nents are denoted as ER, EG, and EB; their range of amplitudes is also [0; 1]. In most applications, ER, EG, and EB have already discrete amplitudes. For a reasonable application of gamma encoding, the bit depth of the linear components has to be at least 3 bits larger than the bit depth used for representing the gamma-encoded values. Figure 2.21(a) illustrates the subjective effect of non-linear encoding for the relative luminance Y of an achromatic signal. In Figure 2.21(b), the transfer function fTC as specified in BT.709 and BT.2020 is com- pared to the simple power law with γe = 1/2.2 and the transfer function used in the CIE L∗a∗b∗ color space (see Section 2.2.2).

2.3.4 The Y’CbCr Color Representation Color television was introduced as backwards compatible extension of the existing black and white television. This was achieved by trans- mitting two signals with color difference information in addition to the conventional luminance-related signal. As will be discussed in the fol- lowing, the representation of color images as a luminance-related signal and two color difference signals has some advantages due to which it is still widely used in image and video communication applications. Firstly, let us assume that the luminance-related signal, which shall be denoted by L, and the color difference signals C1 and C2 represent linear combinations of the linear color components ER, EG, and EB. The mapping between the vectors (L, C1,C2) and the CIE 1931 XYZ color space can then be represented by the matrix equation         X Xr Xg Xb R` Rc1 Rc2 L

 Y  =  Yr Yg Yb  ·  G` Gc1 Gc2  ·  C1  , (2.52)

Z Zr Zg Zb B` Bc1 Bc2 C2 where the first matrix specifies the given mapping between the linear RGB color space of the representation format and the XYZ color space and the second matrix specifies the mapping from the LC1C2 to the RGB space. We consider the following desirable properties:

• Achromatic signals (x = xw and y = yw) have C1 = C2 = 0;

• Changes in the color difference components C1 or C2 do not have any impact on the relative luminance Y . 2.3. Representation of Digital Images and Video 57

The first property requires R` = G` = B`. The second criterion is fulfilled if, for k being equal to 1 and 2, we have

Yr · Rck + Yg · Gck + Yb · Bck = 0. (2.53)

Probably to simplify implementations, early researchers chose Rc1 = 0 and Bc2 = 0. With s`, sc1 , and sc2 being arbitrary non-zero scaling factors, this choice yields

L = s` · ( Yr · ER + Yg · EG + Yb · EB )

C1 = sc1 · ( −Yr · ER − Yg · EG + (Yr + Yg) · EB ) (2.54)

C2 = sc2 · ((Yg + Yb) · ER − Yg · EG − Yb · EB ).

By using Y = YrER + YgEG + YbEB, we can also write

L = s` · Y

C1 = sc1 · ((Yr + Yg + Yb) EB − Y ) (2.55)

C2 = sc2 · ((Yr + Yg + Yb) ER − Y ). The component L is, as expected, proportional to the relative lumi- nance Y ; the components C1 and C2 represent differences between a primary component and the appropriately scaled relative luminance Y .

Y’CbCr. Due to decisions made in the early years of color television, the transformation (2.55) from the RGB color space into a color space with a luminance-related and two color difference components is applied after gamma encoding11. The transformation is given by

0 0 0 0 EY = KR · ER + (1 − KR − KB) · EB + KB · EB 0 0 0 ECb = (EB − EY ) / (2 − 2KB) (2.56) 0 0 0 ECr = (ER − EY ) / (2 − 2KR). 0 The component EY is called the luma component and the color differ- 0 0 ence signals ECb and ECr are called chroma components. The terms “luma” and “chroma” have been chosen to indicate that the signals are computed as linear combinations of gamma-encoded color components; the non-linear nature is also indicated by the prime symbol. The rep- resentation of color images by a luma and two chroma components is

11In the age of CRT TVs, this processing order had the advantage that the decoded 0 0 0 ER, EG, and EB signals could be directly fed to a CRT display. 58 Acquisition, Representation, Display, and Perception

Figure 2.22: Representation of a color image (left) as red, green, and blue com- 0 0 0 0 0 ponents ER, EG, EB (top right) and as luma and chroma components EY , ECb, 0 ECr (bottom right). All components are represented as gray-value pictures; for the 0 0 signed components ECb and ECr a constant offset (middle gray) is added. referred to as Y’CbCr or YCbCr color format. Note that, in contrast to linear color spaces, Y’CbCr is not an absolute color space, but rather a way of encoding the tristimulus values of a linear color space. The scaling factors in (2.56) are chosen in a way that the luma component has an amplitude range of [0; 1] and the chroma components have amplitude ranges of [−0.5; 0.5]. If we neglect the impact of gamma encoding, the constants KR and KB have to be chosen according to

Yr Yb KR = and KB = , (2.57) Yr + Yg + Yb Yr + Yg + Yb where Yr, Yg, and Yb are, as indicated in (2.52), determined by the cho- sen linear RGB color space. For BT.709 (KR = 0.2126, KB = 0.0722) and BT.2020 (KR = 0.2627, KB = 0.0593), the specified values of KR and KB can be directly derived from the chromaticity coordinates of the primaries and white point. BT.601, which does not define a color space, specifies the values KR = 0.299 and KB = 0.114, which were derived based on the color space of an old NTSC standard [85]12. In the Y’CbCr format, color images are, in principle, represented 0 0 by an achromatic signal EY , a blue-yellow difference signal ECb, and 0 a red-green difference signal ECr. In that respect, the Y’CbCr format

12In SD television, we have a discrepancy between the Y’CbCr format and the linear color spaces [24, 77] used in existing systems. As a result, quantization errors in the chroma components have a larger impact on the luminance I of decoded pictures as it would be the case with the choice given in (2.57). 2.3. Representation of Digital Images and Video 59

Table 2.2: Common color formats for image and video coding. format description RGB (4:4:4) The red, green, and blue components have the same size. Y’CbCr 4:4:4 The chroma components have the same size as the luma component. Y’CbCr 4:2:2 The chroma components are horizontally subsampled by a factor of two. The height of the chroma components is the same as that of the luma component. Y’CbCr 4:2:0 The chroma components are subsampled by a factor of two in both horizontal and vertical direction. Each chroma component contains a quarter of the samples of the luma component.

is similar to the CIELAB color space and the opponent processes in human vision. The transformation into the Y’CbCr domain effectively decorrelates the cone responses and thus also the RGB data for typical natural images. When we use the Y’CbCr format as basis for lossy im- age or video coding, the components can be treated separately and still the quantization errors are introduced in a perceptual meaningful way (as far as the color representation is considered). Figure 2.22 illustrates the differences between the RGB and Y’CbCr format for an example image. As can be seen, the red, green, and blue components are typ- ically highly correlated. In the Y’CbCr format, however, most of the visual information is concentrated in the luma component. Due to these properties, the Y’CbCr format is a suitable format for lossy coding and is used in nearly all image and video communication applications.

Chroma Subsampling. When we discussed contrast sensitivity func- tions in Section 2.2.3, we noted that human beings are much more sensitive to high-frequency components in isochromatic than in isolu- minant stimuli. For saving bit rate, the chroma components are often downsampled. For normal viewing distances, the reduction of the num- ber of chroma samples does not result in any perceivable degradation of image quality. Table 2.2 summarizes the color formats used in image and video coding applications. The most commonly used format is the Y’CbCr 4:2:0 format, in which the chroma sample arrays are downsam- pled by a factor of two in both horizontal and vertical direction. Representation formats do not specify filters for resampling the chroma components. In order to avoid color fringes in displayed images, 60 Acquisition, Representation, Display, and Perception

4:4:4 4:2:2 4:2:0 (BT.2020) 4:2:0 (MPEG-1) 4:2:0 (MPEG-2)

Figure 2.23: Nominal locations of chroma samples (indicated by circles) relative to that of luma samples (indicated by crosses) for different chroma sampling formats. the phase shifts of the filters and thus the locations of the chroma sam- ples relative to the luma samples should be known. For 4:4:4 and 4:2:2 sampling formats, representation formats and video coding standards generally specify that the top-left chroma samples coincide with the top-left luma sample (see Figure 2.23). For the 4:2:0 format, however, different alternatives are used. While BT.2020 specifies that the top-left chroma samples coincide with the top-left luma sample (third picture in Figure 2.23), in the video coding standards MPEG-1 Video [45], H.261 [49], and H.263 [50], the chroma samples are located in the cen- ter of the four associated luma samples (fourth picture in Figure 2.23). And in the video coding standards H.262 | MPEG-2 Video [52], H.264 | MPEG-4 AVC [53], and H.265 | MPEG-H HEVC [54], the nominal offset between the top-left chroma and luma samples is zero in hor- izontal and half a luma sample in vertical direction (fifth picture in Figure 2.23). Video coding standards such as H.264 | MPEG-4 AVC and H.265 | MPEG-H HEVC include syntax that allows to indicate the location of chroma samples in the 4:2:0 format inside the bitstream.

Constant Luminance Y’CbCr. The application of gamma encoding before calculating the Y’CbCr components in (2.56) has the effect that changes in the chroma components due to quantization or subsampling influence the relative luminance of the displayed signal. BT.2020 [47] specifies an alternative format, which is given by the components

0 EYC = fTC ( KR · ER + (1 − KR − KB) · EG + KB · EB ) (2.58) 0 0 0 ECbC = (EB − EYC ) /NB (2.59) 0 0 0 ECrC = (ER − EYC ) /NR. (2.60) 2.3. Representation of Digital Images and Video 61

The sign-dependent scaling factors NB and NR are ( γ 0 0 2a (1 − KX ): EX − EYC ≤ 0 NX = γ 0 0 , (2.61) 2a (1 − (1 − KX ) ) − 1 : EX − EYC > 0 where a and γ represent the corresponding parameters of the transfer function fTC in (2.51). By defining sY = Yr + Yg + Yb and using (2.58) we obtain for the relative luminance Y of the decoded signal

Y = sY · (KR ER + (1 − KR − KB) EG + KB EB)   −1 0   = sY · KR ER + fTC (EYC ) − KR ER − KB EB + KB EB −1 0 = sY · fTC (EYC ). (2.62) 0 The relative luminance Y depends only on EYC . Due to that reason the alternative format is also referred to as constant luminance Y’CbCr format. In the document BT.2246 [48], the impact on video coding was evaluated by encoding eight test sequences, given in an RGB format, with H.265 | MPEG-H HEVC [54]. The reconstruction quality was measured in the CIELAB color space using the distortion measure ∆E given in (2.44). It is reported that by choosing the constant luminance format instead of the conventional Y’CbCr format, on average 12% bit rate savings are obtained for the same average distortion. The con- stant luminance Y’CbCr format has similar properties as the currently dominating standard Y’CbCr format and could replace it in image and video applications without requiring any adjustments despite the mod- ified transformation from and to the linear RGB color space.

2.3.5 Quantization of Sample Values Finally, for obtaining discrete-amplitude samples suitable for coding 0 0 and digital transmission, the luma and chroma components EY , ECb, 0 and ECr are quantized using uniform quantization. The ITU-R Recom- mendations BT.601, BT.709, and BT.2020 specify that the correspond- ing integer color components Y , Cb, and Cr are obtained according to h 0 B−8 i Y = (219 · EY + 16) · 2 , (2.63) h 0 B−8 i Cb = (224 · ECb + 128) · 2 , (2.64) h 0 B−8 i Cr = (224 · ECr + 128) · 2 , (2.65) 62 Acquisition, Representation, Display, and Perception where B denotes the bit depth, in bits per sample, for representing the amplitude values and the operator [ · ] represents rounding to the nearest integer. While BT.601 and BT.709 recommend bit depths of 8 or 10 bits, the UHD specification BT.2020 recommends the usage of 10 or 12 bits per sample. Video coding standards typically support the usage of different bit depths for the luma and chroma components. In the most used profiles, however, only bit depths of 8 bits per sample are supported. If the RGB format is used for coding, all three color components are quantized according to (2.63). Quantization according to (2.63)–(2.65) does not use the entire range of B-bit integer values. The ranges of unused values are referred to as footroom (small values) and headroom (large values). They allow the implementation of signal processing operations such as filtering or analog-to-digital conversion without the need of clipping the results. In the xvYCC color space [39], the headroom and footroom is used for extending the color gamut. When using this format, the linear compo- nents E as well as the gamma-encoded components E0 are no longer restricted to the interval [0; 1] and the definition of the transfer function fTC is extended beyond the domain [0; 1]. As an alternative, the video coding standards H.264 | MPEG-4 AVC [53] and H.265 | MPEG-H HEVC [54] provide a syntax element by which it can be indicated that the full range of B-bit integer values is used for representing ampli- tude values, in which case the quantization equations (2.63)–(2.65) are modified so that the minimum and maximum used integer values are 0 and 2B − 1, respectively.

2.4 Image Acquisition

Modern digital cameras are complex devices that consist of a multitude of components, which often include advanced systems for automatic focusing, exposure control, and white balancing. The most important components are illustrated in Figure 2.24. The camera lens forms an image of a real-world scene on the image sensor, which is located in the image plane of the camera. The lens (or some lens elements) can be moved for focusing objects at different distances. As discussed in 2.4. Image Acquisition 63

aperture objects in lens image sensor digital 3-d world picture

image processor

camera

Figure 2.24: Basic principle of image acquisition with a digital camera.

Section 2.1, the focal length of the lens determines the field of view and its aperture regulates the depth of field as well as the illuminance (photometric equivalent of irradiance) falling on the image sensor. The image sensor basically converts the illuminance pattern observable on its surface into an electric signal. This is achieved by measuring the energy of visible light that falls onto small areas of the image sensor during a certain period of time, which is referred to as exposure time or shutter speed. The image processor controls the image sensor and converts the electric signal that is output by the image sensor into a digital representation of the captured scene. The amount of visible light energy per unit area that is used for creating a picture is called exposure; it is given by the product of the illuminance on the image sensor and the exposure time te. The illumi- nance on the sensor is proportional to the area of the entrance pupil and, thus, to the square of the aperture diameter a. But the area of an objects image on the sensor is also approximately proportional to the square of the focal length f. Hence, for a given scene, the illu- minance on the image sensor depends only on the f-number F = f/a. The camera settings that influence the exposure are often expressed as 2 exposure value EV = log2(F / te). All combinations of aperture and shutter speed that have the same exposure value give the same expo- sure for a chosen scene. An increment of one, commonly called one “stop”, corresponds to halving the amount of visible light energy. Note that different camera settings with the same exposure value still yield different pictures, because the depth of field depends on the f-number and the amount of motion blur on the shutter speed. For video, the exposure time has to be smaller than the reciprocal of the frame rate. 64 Acquisition, Representation, Display, and Perception

saturation voltage light microlens filter light color filter voltage

saturation exposure level image sensor photocell (a) (b) exposure Figure 2.25: Image sensor: (a) Array of light-sensitive photocells; (b) Illustration of the exposure-voltage transfer function for a photocell.

2.4.1 Image Sensor

An image sensor consists of an array of light-sensitive photocells, as is illustrated in Figure 2.25(a). Each photocell corresponds to a pixel in the acquired images. Typically, microlenses are located above the photocells. Their purpose is to improve the light efficiency by directing most of the incident light to the light-sensitive parts of the sensor. For some types of sensors, which we will further discuss in Section 2.4.2, color filters that block light outside a particular spectral range are placed between the photocells and microlenses. Another filter is typi- cally inserted between the lens and the sensor. It is used for removing wavelengths to which human beings are not sensitive, but to which the image sensor is sensitive. Without such a filter, the acquired images would have incorrect colors or gray values, since parts of the infrared or ultraviolet spectrum would contribute to the generated image signal. Modern digital cameras use either charge-coupled device (CCD) or complementary metal-oxide-semiconductor (CMOS) image sensors. Both types of sensors employ the photoelectric effect. When a photon (quantum of electromagnetic radiation) strikes the semiconductor of a photocell, it creates an electron-hole pair. By applying an electric field, the positive and negative charges are collected during the exposure time and a voltage proportional to the number of incoming photons is generated. At the end of an exposure, the generated voltages are read out, converted to digital signals, and further processed by the image processor. Since the created charges are proportional to the number of incoming photons, the exposure-voltage transfer function for a pho- tocell is basically linear. However, as shown in Figure 2.25(b), there 2.4. Image Acquisition 65 is a saturation level, which is determined by the maximum collectible charge. If the exposure exceeds the saturation level for a significant amount of photocells, the captured image is overexposed; the lost im- age details cannot be recovered by the following signal processing.

Sensor Noise. The number of photons that arrive at a particular pho- tocell during the exposure time is random; it can be well modeled as random variable with a Poisson distribution. The resulting noise in the captured image is called photon shot noise. The Poisson distribution has the property that the variance σ2 is equal to the mean µ. Hence, if we assume a linear relationship between the number of photons and the generated voltage, the signal-to-noise ratio (SNR) of the output signal is proportional to the average number of incoming photons (µ2/σ2 = µ). Other types of noise that affect the image quality are: • Dark current noise: A certain amount of charges per time unit can be also created by thermal vibration; • Read noise: Thermal noise in readout circuitry; • Reset noise: Some charges may remain after resetting the photo- cells at the beginning of an exposure; • Fixed pattern noise: Caused by manufacturing variations across the photocells of a sensor. Most noise sources are independent of the irradiance on the sensor. An exception is the photon shot noise, which becomes predominant above a certain irradiance level. The SNR of a captured image increases with the number of photons arriving at a photocell during exposure. Con- sequently, pictures and videos captured with the small image sensors (and small photocells) of smartphones are considerably noisier than those captured with the large sensors of professional cameras.

ISO Speed. The ISO speed or ISO sensitivity is a measure that was originally standardized by the International Organization for Standard- ization (ISO) for specifying the light-sensitivity of photographic films. It is now also used as a measure for the sensitivity of image sensors. Digital cameras typically allow to select the ISO speed inside a given 66 Acquisition, Representation, Display, and Perception range. Changing the ISO speed modifies the amplification factor of the sensors output signal before analog-to-digital conversion. The ISO sys- tem defines a linear and a logarithmic scale. Digital cameras typically use the linear scale (with values of 100, 200, etc.), for which a doubling of the ISO sensitivity corresponds to a doubling of the amplification factor. Note that higher ISO values correspond to lower signal-to-noise ratios, since the noise in the sensors output is also amplified. The ISO speed is the third parameter, beside the aperture and the shutter speed, by which the exposure of a picture can be controlled. Typically, an image is considered to be correctly exposed if nearly the entire range of digital amplitude levels is utilized and the portion of saturated photocells or clipped sample values is very small. For a given scene, the photographer or videographer can select one of multiple suit- able combinations of aperture, shutter speed, and ISO sensitivity and, thus, control the depth of field, motion blur, and noise level within certain ranges. For filming in dark environments, increasing the ISO sensitivity is often the only way to achieve the required frame rate.

2.4.2 Capture of Color Images The photocells of an image sensor basically only count photons. They cannot discriminate between photons of different wavelengths inside the visible spectrum. As discussed in Section 2.2.2, we need however at least three image signals, each for a different range of wavelengths, for representing color images. Consequently, the spectrum of visible light has to be decomposed into three spectral components. There are two dominating techniques in today’s cameras: Three-sensor systems and single sensors with color filter arrays. A third technique, the multi-layer sensor, is also used in some cameras.

Three-Sensor Systems. As the name suggests, three-sensor systems use three image sensors, each for a different part of the spectrum. The light that falls through the lens is split by a trichroic prism assembly, which consists of two prisms with dichroic coatings (dichroic prisms), and is illustrated in Figure 2.26(a). The dichroic optical coatings have the property that they reflect or transmit light depending on the lights 2.4. Image Acquisition 67

image sensor (red component) filter coating filter coating image sensor green component light falling ( ) through lens color filter image sensor (a) (blue component) (b) photocell Figure 2.26: Color separation: (a) Three-sensor camera with color separation by a trichroic prism assembly; (b) Sensor with color filter array (Bayer pattern). wavelength. In the example of Figure 2.26(a), the short wavelength range is reflected at the first coating and directed to the image sensor that captures the blue color component. The remaining light passes through. At the second filter coating, the long wavelength range is reflected and directed to the sensor capturing the red component. The remaining middle wavelength range, which corresponds to the green color components, is transmitted and captured by the third sensor. In contrast to image sensors with color filter arrays, three-sensor systems have the advantages that basically all photons are used by one of the image sensors and that no interpolation is required. As a consequence, they typically provide images with better resolution and lower noise. Three-sensor systems are, however, also more expensive, and they are large and heavy, in particular when large image sensors are used.

Sensors with Color Filter Arrays. Another possibility to distinguish photons of different wavelength ranges is to use a color filter array with a single image sensor. As illustrated in Figure 2.26(b), each pho- tocell is covered by a small color filter that basically blocks photons with wavelengths outside the desired range from reaching the photo- cell. The color filters are typically placed between the photocell and the microlens as it is shown in Figure 2.25(a). The color filter pattern shown in Figure 2.26(b) is called Bayer pattern. It is the most common type of color filter array and consists of a repeating 2×2 grid with two green, one red, and one blue color filter. The reason for using twice as many green than red or blue color filters is that humans are more sensitive to the middle wavelength range of visible light. Several alter- 68 Acquisition, Representation, Display, and Perception natives to the Bayer pattern are used by some manufacturers. These patterns either use filters of different colors or a different arrangement, or they include filters with a fourth color. Since each photocell of a sensor can only count photons for one of the wavelength ranges, the sample arrays for the color components con- tain a significant number of holes. The unknown sample values have to be generated using interpolation algorithms. This processing step is commonly called demosaicing. For a Bayer sensor, actually half of the samples for the green component and three quarters of the sam- ples for the red and blue components have to be generated. If the assumptions underlying the employed demosaicing algorithm are not true for an image region, interpolation errors can cause visible artifacts. The most frequently observed artifacts are Moiré patterns, which typi- cally appear as wrong color patterns in fine detailed image regions. For reducing demosaicing artifacts, digital image sensors with color filter arrays typically incorporate an optical low-pass filter or anti-aliasing filter, which is placed directly in front of the sensor. Often, this filter consists of two layers of a birefringent material and is combined with an infrared absorption filter. The optical low-pass filter splits every ray of light into four rays, each of which falls on one photocell of a 2×2 cluster. By decreasing the high-frequency components of the irradiance pattern on the photocell array, it reduces the demosaicing artifacts, but also the sharpness of the captured image. Image sensors with color filter arrays are smaller, lighter, and less expensive than three-sensor systems. But due to the color filters they have a lower light efficiency. The demosaicing can cause visible interpolation artifacts; in connection with the often applied optical filtering it also reduces the sharpness.

Multi-Layer Image Sensors. In a multi-layer sensor, the photocells for different wavelength ranges are not arranged in a two-dimensional, but a three-dimensional array. At each spatial location, three photodi- odes are vertically stacked. The sensor employs the property that the absorption coefficient of silicon is highly wavelength dependent. As a result, each of the three stacked photodiodes at a sample location re- sponds to a different wavelength range. The sample values of the three 2.4. Image Acquisition 69 primary colors (red, green, blue) are generated by processing the cap- tured signals. Since three color samples are captured for each spatial location, optical low-pass filtering and demosaicing is not required and interpolation artifacts do not occur. The spectral sensitivity curves re- sulting from the employed wavelength separation by absorption are less linearly related to the cone fundamentals than typical color filters. As a consequence, it is often reported that multi-layer sensors have a lower color accuracy than sensors with color filter arrays.

2.4.3 Image Processor

The signals that are output by the image sensor have to be further processed and eventually converted into a format suitable for image or video exchange. As a first step, which is required for any further signal processing, the analog voltage signals have to be converted into digital signals. In order to reduce the impact of this quantization on the following processing, typically a bit depth significantly larger than the bit depth of the final representation format is used. The analog-to- digital conversion is sometimes integrated into the sensor. For converting the obtained digital sensor signals into a representa- tion format, the following processing steps are required: Demosaicing (for sensors with color filter arrays, Section 2.4.2), a conversion from the camera color space to the linear color space of the representation format, including white balancing (Section 2.2.2), gamma encoding of the linear color components (Section 2.3.3), optionally, a transform from the linear color space to a Y’CbCr format (Section 2.3.4), and a final quantization of the sample values (Section 2.3.5). Beside these required processing steps, image processors often also apply algorithms for improving the image quality, for example, denoising and sharpening algorithms or processing steps for reducing the impact of lens imperfec- tions, such as vignetting, geometrical distortions, and chromatic aber- rations, in the output images. Particularly in consumer cameras, the raw image data are typically also compressed using an image or video encoder. The outputs of the camera are then bitstreams that conform to a widely accepted coding standard, such as JPEG [51] or H.264 | MPEG-4 AVC [53], and are embedded in a container format. 70 Acquisition, Representation, Display, and Perception

2.5 Display of Images and Video

In most applications, we capture and encode visual information for eventually presenting them to human beings. Display devices act as in- terface between machine and human. At the present time, a rather large variety of display techniques are available. New technologies and im- provements to existing technologies are still developed. Independent of the actual used technology, for producing the sensation of color, each element of a displayed picture has to be composed of at least three primary colors (see Section 2.2.2). The actual employed technology de- termines the chromaticity coordinates of the primary colors and, thus, the display-internal color space. In general, samples of the representa- tion format provided to the display device have to be transformed into the display color representation; this transformation typically includes gamma decoding and a color space conversion. Modern display devices often apply additional signal processing algorithms for improving the perceived quality of natural video. In the following, we briefly review some important display techniques. For a more detailed discussion, the reader is referred to the overview in [70].

Cathode Ray Tube (CRT) Displays. Some decades ago, all devices for displaying natural pictures were cathode ray tube (CRT) displays. It is the oldest type of electronic display technology, but has now been nearly completely replaced by more modern technologies. As illustrated in Figure 2.27(a), a CRT display basically consists of electron guns, a deflection system, and a phosphor-coated screen. Each electron gun contains a heated cathode that produces electrons by thermionic emis- sion. By applying electric fields the electrons are accelerated and fo- cused to form an electron beam. When the electron beam hits the phosphor-coated screen, it causes the emission of photons. The inten- sity of the emitted light is controlled by varying the electric field in the electron gun. For producing a picture on the screen, the electron beam is linewise swept over the screen, typically 50 or 60 times per sec- ond. The direction of the beam is controlled by the deflection system consisting of magnetic coils. In color CRT displays, three electron guns and three types of phosphors, each for emitting photons for one of the 2.5. Display of Images and Video 71

screen with V phosphors shadow mask

screen electron beams electron guns deflection system backlight polarizer liquid polarizer color (magnetic coils) crystals (a) (b) filters

- - - - + + + + +- - + - - + + - + - + - - + + - + + - cell with red cell with green cell with blue emission of emission of emission of (c) phosphor phosphor phosphor (d) red light green light blue light Figure 2.27: Basic principles of display technologies: (a) Cathode ray tube (CRT) display; (b) Liquid crystal display (LCD); (c) Plasma display; (d) OLED display. primary colors red, green, and blue, are used. The different phosphors are arranged in clusters or stripes. A shadow masks mounted in front of the screen prevents electrons from hitting the wrong phosphor.

Liquid Crystal Displays (LCDs). Liquid crystals used in displays are liquid organic substances with a crystal molecular structure. They are arranged in a layer between two transparent electrodes in a way that the alignment of the liquid crystals inside the layer can be controlled by the applied voltage. LCDs employ the effect that depending on the orientation of the liquid crystals inside the layer and, thus, the applied voltage, the polarization direction of the transmitted light is modified. The basic structure of LCDs is illustrated in Figure 2.27(b). The light emitted by the displays backlight is passed through a first polarizer, which is followed by the liquid crystal layer and a second polarizer with a polarization direction perpendicular to that of the first polarizer. By adjusting the voltages applied to the liquid crystal layer, the modification of the polarization direction and, thus, the amount of 72 Acquisition, Representation, Display, and Perception light transmitted through the second polarizer is controlled. Finally, the light is passed through a layer with color filters (typically red, green, and blue filters) and a color picture perceivable by human beings is generated on the surface of the screen. A disadvantage of LCDs is that a backlight is required and a significant amount of light is absorbed. Due to that reason, LCDs have a rather large power consumption and do not achieve such a good black level as plasma or OLED displays.

Plasma Displays. Plasma displays are based on the phenomenon of gas discharge. As illustrated in Figure 2.27(c), a layer of cells typically filled with a mixture of helium and xenon [70] is embedded between two electrodes. The electrode at the front side of the display has to be transparent. When a voltage is applied to a cell, the accelerated elec- trons may ionize the contained gas for a short duration. If the excited atoms return to their ground state, photons with a wavelength inside the ultraviolet (UV) range are emitted. A part of the UV photons ex- cites the phosphors inside the cell, which eventually emit light in the visible range. The intensity of the emitted light can be controlled by the applied voltage. For obtaining color images, three types of phosphors, which emit light in the red, green, and blue range of the spectrum, are used. The corresponding cells are arranged in a suitable spatial layout.

OLED Displays. Organic light-emitting diode (OLED) displays are a relatively new technology. OLEDs consist of organic substances that emit visible light when an electric current is passed through them. A layer of organic semiconductor is situated between two electrodes, see Figure 2.27(d). At least one of the electrodes is transparent. If a voltage is applied, the electrons and holes injected from the cathode and anode, respectively, form electron-hole pairs called excitons. When an exciton recombines, the excess energy is emitted in the form of a photon; this process is called radiative recombination. The wavelength of the emitted photon depends on the band energy of the organic material. The light intensity can be controlled by adjusting the applied voltage. In OLED displays, typically three types of OLEDs with organic substances that emit light in the red, green, and blue wavelength range are used. 2.6. Chapter Summary 73

Projectors. In contrast to the display devices discussed so far, projec- tors, also commonly called beamers, do not display the final image on the light modulator itself, but on a diffusely reflecting screen. Due to the loose coupling of the light modulator and screen, very large images can be displayed. That is why projectors are particularly suitable for large audiences. There are three dominant projection techniques: LCD projectors, DLP projectors, and LCoS projectors. In LCD projectors, the white light of a bright lamp is first split into red, green, and blue components, either by dichroic mirrors or prisms (see Section 2.4.2). Each of the resulting beams is passed through a sep- arate transparent LCD panel, which modulates the intensity according to the sample values of the corresponding color component. Finally, the modulated beams are combined by dichroic prisms and passed through a lens, which projects the image on the screen. Digital light processing (DLP) projectors use a chip with micro- scopic mirrors, one for each pixel. The mirrors can be rotated to send light from a lamp either through the lens or towards a light absorber. By quickly toggling the mirrors, the intensity of the light falling through the lens can be modulated. Color images are typically generated by placing a color wheel between the lamp and the micromirror chip, so that the color components of an image are sequentially displayed. Liquid crystal on silicon (LCoS) projectors are similar to LCD pro- jectors, but instead of transparent LCD panels, they use reflective light modulators (similar to DLP projectors). The light modulators basically consist of a liquid crystal layer that is fabricated directly on top of a silicon chip. The silicon is coated with a highly reflective metal, which simultaneously acts as electrode and mirror. As in LCD panels, the light modulation is achieved by changing the orientation of liquid crystals.

2.6 Chapter Summary

In this section, we gave an overview of some fundamental properties of image formation and human visual perception, and based on certain aspects of human vision, we reviewed the basic principles that are used for capturing, representing, and displaying digital video signals. 74 Acquisition, Representation, Display, and Perception

For acquiring video signals, the lens of a camera projects a scene of the three-dimensional world onto the surface of an image sensor. The focal length and aperture of the lens determine the field of view and the depth of field of the projection. Independent of the fabrication quality of the lens, the resolution of the image on the sensor is limited by diffraction; its effect increases with decreasing aperture. For real lenses, the image quality is additionally reduced by optical aberrations such as geometric distortions, spherical aberrations, or chromatic aberrations. The human visual system has similar components as a camera; a lens projects an image onto the retina, where the image is sampled by light-sensitive cells. The photoreceptor responses are send to the brain, where the visual information is interpreted. Under well-lit conditions, three types of photoreceptor cells with different spectral sensitivities are active. They basically map the infinite-dimensional space of electromag- netic spectra onto a three-dimensional space. Hence, light stimuli with different spectra can be perceived as having the same color. This prop- erty of human vision is the basis of all color reproduction techniques; it is employed in capturing, representing, and displaying image and video signals. For defining a common color system, the CIE standardized a so-called standard colorimetric observer by defining color-matching functions based on experimental data. The derived CIE 1931 XYZ color space represents the basis for specifying color in video communication applications. In display devices, colors are typically reproduced by mix- ing three suitably selected primary lights; it is, however, not possible to reproduce all colors perceivable by humans. Color spaces that are spanned by three primaries are called linear color spaces. The human eye adapts to the illumination of a scene; this aspect has to be taken into account when processing the signals acquired with an image sen- sor. The acuity of human vision is determined by several factors such as the optics of the eye, the density of photoreceptor cells, and the neural processing. Human beings are more sensitive to details in the luminance than to details in the quality of color. Certain properties of human vision are also exploited for efficiently representing visual information. For describing color information, each video picture consists of three samples arrays. The primary colors are 2.6. Chapter Summary 75 specified in the CIE 1931 XYZ color space. Since the human visual system has a non-linear response to differences in luminance, the linear color components are non-linear encoded. This processing step, also called gamma encoding, yields color components with the property that a certain amount of quantization noise has approximately the same subjective impact on dark and light image regions. In most video coding applications, the gamma-encoded color components are transformed into a Y’CbCr format, in which color pictures are specified using a luminance-related component, called luma component, and two color difference components, which are called chroma components. By this transformation, the color components are effectively decorrelated. Since humans are significantly more sensitive to details in luminance than to details in color difference data, the chroma components are typically downsampled. In the most common format, the Y’CbCr 4:2:0 format, the chroma components contain only a quarter of the samples of the luma component. The luma and chroma sample values are typically represented with a bit depth of 8 or 10 bits per sample. The image sensor in a camera samples the illuminance pattern that is projected onto its surface and converts it into a discrete representa- tion of a picture. Each cell of an image sensor corresponds to an image point and basically counts the photons arriving during the exposure time. For capturing color images, the incident light has to be split into at least three spectral ranges. In most digital cameras, this is achieved either by using a trichroic beam splitter with a separate image sensor for each color component or by mounting a color filter array on top of a single image sensor. In display devices, color images are reproduced by mixing (at least) three primary colors for each image point according to the corresponding sample values. Important display technologies are CRT, LCD, plasma, and OLED displays. For large audiences, as in a cinema, projection technologies are typically used. References

[1] Sarah E. J. Arnold, Samia Faruq, Vincent Savolainen, Peter W. McOwan, and Lars Chittka. FReD: The Floral Reflectance Database – A Web Por- tal for Analyses of Flower Colour. PLoS ONE, 5(12):e14287, December 2010. http://reflectance.co.uk. [2] Peter G. J. Barten. Spatiotemporal model for the contrast sensitivity of the human eye and its temporal aspects. In Jan P. Allebach and Ber- nice E. Rogowitz, editors, Proc. SPIE, Human Vision, Visual Processing, and Digital Display IV, volume 1913. SPIE, September 1993. [3] Peter G. J. Barten. Contrast Sensitivity of the Human Eye and Its Ef- fects on Image Quality. SPIE Optical Engineering Press, Bellinghan, Washington, 1999. [4] Peter G. J. Barten. Formula for the contrast sensitivity of the human eye. In Yoichi Miyake and D. Rene Rasmussen, editors, Proc. SPIE, Image Quality and System Performance, volume 5294, pages 231–238. SPIE, December 2003. [5] D. A. Baylor, B. J. Nunn, and J. L. Schnapf. Spectral sensitivity of cones of the monkey macaca fascicularis. The Journal of Physiology, 390(1):145–160, 1987. [6] Max Born and Emil Wolf. Principles of Optics: Electromagnetic The- ory of Propagation, Interference and Diffraction of Light. Cambridge University Press, Cambridge, UK, 7th (expanded) edition, 1999.

76 References 77

[7] D. H. Brainard and A. Stockman. Colorimetry. In M. Bass, C. DeCusatis, J. Enoch, V. Lakshminarayanan, G. Li, C. Macdonald, V. Mahajan, and E. van Stryland, editors, Vision and Vision Optics, volume III of The Op- tical Society of America Handbook of Optics, pages 10.1–10.56. McGraw Hill, 3rd edition, 2009. [8] Arthur D. Broadbent. A critical review of the development of the CIE1931 RGB color-matching functions. Color Research & Application, 29(4):267–272, August 2004. [9] P. K. Brown and G. Wald. Visual pigments in single rods and cones of the human retina. Science, 144(3614):45–52, April 1964. [10] W. R. T. Brown and D. L. MacAdam. Visual sensitivities to combined chromaticity and luminance differences. Journal of the Optical Society of America, 39(10):808–823, 1949. [11] G. Buchsbaum and A. Gottschalk. Trichromacy, opponent colours coding and optimum colour information transmission in the retina. Proceedings of the Royal Society B: Biological Sciences, 220(1218):89–113, November 1983. [12] F. W. Campbell and R. W. Gubisch. Optical quality of the human eye. The Journal of Physiology, 186(3):558–578, 1966. [13] F. W. Campbell and J. G. Robson. Application of Fourier analysis to the visibility of gratings. The Journal of Physiology, 197(3):551–566, 1968. [14] CIE. CIE Proceedings 1924. Cambridge University Press, Cambridge, 1926. [15] CIE. CIE Proceedings 1931. Cambridge University Press, Cambridge, 1932. [16] CIE. CIE Proceedings 1951. Bureau Central de la CIE, Paris, 1951. [17] CIE. CIE Proceedings 1963. Bureau Central de la CIE, Paris, 1964. [18] CIE. A Colour Appearance Model for Colour Management Systems: CIECAM02, Publication 159:2004. Bureau Central de la CIE, Vienna, 2004. [19] CIE. Fundamental Chromaticity Diagram with Physiological Axes – Part 1, Publication 170-1. Bureau Central de la CIE, Vienna, 2006. [20] Christine A. Curcio, Kimberly A. Allen, Kenneth R. Sloan, Connie L. Lerea, James B. Hurley, Ingrid B. Klock, and Ann H. Milam. Distribution and morphology of human cone photoreceptors stained with anti-blue opsin. The Journal of Comparative Neurology, 312(4):610–624, October 1991. 78 References

[21] Christine A. Curcio, Kenneth R. Sloan, Robert E. Kalina, and Anita E. Hendrickson. Human photoreceptor topography. The Journal of Com- parative Neurology, 292(4):497–523, February 1990. Data available at http://www.cis.uab.edu/curcio/PRtopo. [22] H. R. Davidson. Calculation of color differences from visual sensitivity ellipsoids. Journal of the Optical Society of America, 41(12):1052–1055, 1951. [23] Russell L. De Valois, Isreal Abramov, and Gerald H. Jacobs. Analysis of response patterns of LGN cells. Journal of the Optical Society of America, 56(7):966–977, July 1966. [24] European Broadcasting Union. EBU standard for chromaticity tolerances for studio monitors. EBU Tech. 3213, August 1975. [25] Mark D. Fairchild. Color Appearance Models. John Wiley & Sons, 3rd edition, 2013. [26] Hugh S. Fairman, Michael H. Brill, and Henry Hemmendinger. How the CIE 1931 color-matching functions were derived from Wright-Guild data. Color Research & Application, 22(1):11–23, February 1997. [27] K. S. Gibson and E. P. T. Tyndall. Visibility of radiant energy. Scientific Papers of the Bureau of Standards, 19(475):131–191, 1923. [28] Joseph W. Goodman. Introduction to Fourier Optics. The MacGraw-Hill Companies, Inc., 1996. [29] H. Grassmann. Zur Theorie der Farbenmischung. Annalen der Physik, 165(5):69–84, 1853. [30] H. Gross, F. Blechinger, and B. Achtner. Survey of Optical Instruments, volume 4 of Handbook of Optical Instruments. Wiley-VCH Verlag GmbH & Co. KGaA, Weinheim, 2008. [31] J. Guild. A trichromatic colorimeter suitable for standardisation work. Transactions of the Optical Society, 27(2):106–129, December 1925. [32] J. Guild. The colorimetric properties of the spectrum. Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engi- neering Sciences, 230(681-693):149–187, January 1932. [33] Eugene Hecht. Optics. Addison-Wesley, 4th edition, 2001. [34] H. Helmholtz. Handbuch der Physiologischen Optik, volume IX of Allge- meine Encyklopädie der Physik. Leopold Voss, Leipzig, 1867. [35] Ewald Hering. Grundzüge der Lehre vom Lichtsinn. Verlag von Julius Springer, Berlin, 1920. References 79

[36] H. Hofer. Organization of the human trichromatic cone mosaic. Journal of Neuroscience, 25(42):9669–9679, October 2005. [37] Leo M. Hurvich and Dorothea Jameson. Some quantitative aspects of an opponent-colors theory. II. brightness, saturation, and hue in normal and dichromatic vision. Journal of the Optical Society of America, 45(8):602– 616, August 1955. [38] IEC. Multimedia systems and equipment – Colour measurement and management – Part 2-1: Colour management – Default RGB colour space – sRGB. International Standard 61966-2-1, October 1999. [39] IEC. Extended-gamut YCC colour space for video applications – xvYCC. IEC 61966-2-4, January 2006. [40] Institute of Ophthalmology, University College London. Colour & Vision Research laboratory and database. http://www.cvrl.org. [41] ISO and CIE. Colorimetry – Part 1: CIE standard colorimetric observers. ISO/IEC 11664-1 | CIE S 014-1, 2007. [42] ISO and CIE. Colorimetry – Part 2: CIE standard illuminants. ISO/IEC 11664-2 | CIE S 014-2, 2007. [43] ISO and CIE. Colorimetry – Part 4: CIE 1976 L*a*b* colour space. ISO/IEC 11664-4 | CIE S 014-4, 2008. [44] ISO and CIE. Colorimetry – Part 5: CIE 1976 L*u*v* colour space and u’, v’ uniform chromaticity scale diagram. ISO/IEC 11664-5 | CIE S 014-5, 2009. [45] ISO/IEC. Coding of moving pictures and associated audio for digital storage media at up to 1.5 Mbit/s – part 2: Video. ISO/IEC 11172-2, 1993. [46] ITU-R. Studio encoding parameters of digital television for standard 4:3 and wide-screen 16:9 aspect ratios. Recommendation ITU-R BT.601-7, March 2011. [47] ITU-R. Parameter values for ultra-high definition television systems for production and international programme exchange. Recommendation ITU-R BT.2020-1, June 2014. [48] ITU-R. The present state of ultra-high definition television. Report ITU-R BT.2246-3, March 2014. [49] ITU-T. Video codec for audiovisual services at p × 64 kbits. ITU-T Recommendation H.261, edition 3, March 1993. [50] ITU-T. Video coding for low bit rate communication. ITU-T Recom- mendation H.263, edition 3, January 2005. 80 References

[51] ITU-T and ISO/IEC. Digital compression and coding of continuous-tone still images – requirements and guidelines. ITU-T Recommendation T.81 | ISO/IEC 10918-1, September 1992. [52] ITU-T and ISO/IEC. Generic coding of moving pictures and associated audio information: Video. ITU-T Recommendation H.262 | ISO/IEC 13818-2, edition 2, February 2000. [53] ITU-T and ISO/IEC. Advanced video coding for generic audiovisual services. ITU-T Recommendation H.264 | ISO/IEC 14496-10, edition 8, April 2013. [54] ITU-T and ISO/IEC. High efficiency video coding. ITU-T Recommen- dation H.265 | ISO/IEC 23008-10, edition 1, April 2013. [55] Dorothea Jameson and Leo M. Hurvich. Some quantitative aspects of an opponent-colors theory. I. chromatic responses and spectral saturation. Journal of the Optical Society of America, 45(7):546–552, July 1955. [56] D. B. Judd. Report of U.S. Secretariat Committee on Colorimetry and Artificial Daylight. In Proceedings of the Twelfth Session of the CIE, volume 1, part 7, Stockholm, 1951. Bureau Central de la CIE, Paris. [57] D. H. Kelly. Frequency doubling in visual responses. Journal of the Optical Society of America, 56(11):1628–1632, 1966. [58] D. H. Kelly. Spatiotemporal variation of chromatic and achromatic con- trast thresholds. Journal of the Optical Society of America, 73(6):742– 749, 1983. [59] G. Kirchhoff. Zur Theorie der Lichtstrahlen. Annalen der Physik, 254(4):663–695, 1883. [60] Jan J. Koenderink. Color for the Sciences. MIT Press, Cambridge, MA, 2010. [61] Junzhong Liang and David R. Williams. Aberrations and retinal im- age quality of the normal human eye. Journal of the Optical Society of America A, 14(11):2873–2883, November 1997. [62] Ming Ronnier Luo and Changjun Li. CIECAM02 and its recent develop- ments. In Christine Fernandez-Maloigne, editor, Advanced Color Image Processing and Analysis, pages 19–58. Springer New York, May 2012. [63] David L. MacAdam. Visual sensitivities to color differences in daylight. Journal of the Optical Society of America, 32(5):247–273, 1942. [64] Susana Marcos. Image quality of the human eye. International Ophthal- mology Clinics, 43(2):43–62, 2003. References 81

[65] James Clerk Maxwell. Experiments on colour, as perceived by the eye, with remarks on colour-blindness. Transactions of the Royal Society of Edinburgh, 21(02):275–298, January 1857. [66] Kathy T. Mullen. The contrast sensitivity of human colour vision to red- green and blue-yellow chromatic gratings. The Journal of Physiology, 359:381–400, February 1985. [67] G. A. Østerberg. Topography of the layer of rods and cones in the human retina. Acta Ophthalmologica, 13(Supplement 6):1–97, 1935. [68] Stephen E. Palmer. Vision Science: Photons to Phenomenology. A Brad- ford Book. MIT Press, Cambridge, MA, 1999. [69] Charles Poynton. Digital Video and HDTV Algorithms and Interfaces. The Morgan Kaufmann Series in Computer Graphics and Geometric Modeling. Morgan Kaufmann Publishers, San Francisco, CA, 2003. [70] Erik Reinhard, Erum Arif Khan, Ahmet OˇguzAkyüz, and Garrett M. Johnson. Color Imaging: Fundamentals and Applications. A K Peters, Ltd., Wellesley, MA, 2008. [71] J. G. Robson. Spatial and temporal contrast-sensitivity functions of the visual system. Journal of the Optical Society of America, 56(8):1141– 1142, August 1966. [72] J. L. Schnapf, T. W. Kraft, and D. A. Baylor. Spectral sensitivity of human cone photoreceptors. Nature, 325(6103):439–441, January 1987. [73] Nobutoshi Sekiguchi, David R. Williams, and David H. Brainard. Aberration-free measurements of the visibility of isoluminant gratings. Journal of the Optical Society of America A, 10(10):2105–2117, 1993. [74] L. T. Sharpe, A. Stockman, W. Jagla, and H. Jägle. A luminous efficiency function, V ∗(λ), for daylight adaptation. Journal of Vision, 5(11):948– 968, December 2005. [75] Lindsay T. Sharpe, Andrew Stockman, Wolfgang Jagla, and Herbert Jä- ∗ gle. A luminous efficiency function, VD65(λ), for daylight adaptation: A correction. Color Research & Application, 36(1):42–46, February 2011. [76] T. Smith and J. Guild. The C.I.E. colorimetric standards and their use. Transactions of the Optical Society, 33(3):73–134, January 1932. [77] SMPTE. Composite analog video signal — NTSC for studio applications. SMPTE Standard 170M-2004, November 2004. [78] A. Sommerfeld. Mathematische Theorie der Diffraction. Mathematische Annalen, 47(2–3):317–374, June 1896. 82 References

[79] W. S. Stiles and J. M. Burch. N.P.L. colour-matching investigation: Final report (1958). Optica Acta: International Journal of Optics, 6(1):1–26, January 1959. [80] A. Stockman and D. H. Brainard. mechanism. In M. Bass, C. DeCusatis, J. Enoch, V. Lakshminarayanan, G. Li, C. Macdonald, V. Mahajan, and E. van Stryland, editors, Vision and Vision Optics, volume III of The Optical Society of America Handbook of Optics, pages 11.1–11.86. McGraw Hill, 3rd edition, 2009. [81] Andrew Stockman and Lindsay T. Sharpe. The spectral sensitivities of the middle- and long-wavelength-sensitive cones derived from measure- ments in observers of known genotype. Vision Research, 40(13):1711– 1737, June 2000. [82] Andrew Stockman, Lindsay T. Sharpe, and Clemens Fach. The spectral sensitivity of the human short-wavelength sensitive cones derived from thresholds and color matches. Vision Research, 39(17):2901–2927, August 1999. [83] G. Svaetichin. Spectral response curves from single cones. Acta Physio- logica Scandinavica, 39(Supplement 134):17–46, 1956. [84] Gunnar Svaetichin and Edward F. MacNichol. Retinal mechanisms for chromatic and achromatic vision. Annals of the New York Academy of Sciences, 74(2):385–404, November 1958. [85] United States National Television System Committee. Recommendation for transmission standards for color television, December 1953. [86] A. van Meeteren and J. J. Vos. Resolution and contrast sensitivity at low luminances. Vision Research, 12(5):825–833, May 1972. [87] Floris L. van Nes and Maarten A. Bouman. Spatial modulation transfer in the human eye. Journal of the Optical Society of America, 57(3):401– 406, March 1967. [88] Johannes von Kries. Theoretische Studien über die Umstimmung des Sehorgans. In Festschrift der Albrecht-Ludwigs-Universität in Freiburg, pages 143–158. C. A. Wagner’s Universitäts-Buchdruckerei, Freiburg, Germany, 1920. [89] J. J. Vos. Colorimetric and photometric properties of a 2◦ fundamental observer. Color Research & Application, 3(3):125–128, July 1978. [90] Brian A. Wandell. Foundations of Vision. Sinauer Associates, Inc., Sun- derland, Massachusetts, 1995. References 83

[91] A. Watanabe, T. Mori, S. Nagata, and K. Hiwatashi. Spatial sine-wave responses of the human visual system. Vision Research, 8(9):1245–1263, September 1968. [92] David R. Williams. Topography of the foveal cone mosaic in the living human eye. Vision Research, 28(3):433–454, 1988. [93] W. D. Wright. A trichromatic colorimeter with spectral primaries. Trans- actions of the Optical Society, 29(5):225–242, May 1928. [94] W. D. Wright. A re-determination of the trichromatic coefficients of the spectral colours. Transactions of the Optical Society, 30(4):141–164, March 1929. [95] G. Wyszecki and W. S. Stiles. Color Science: Concepts and Methods, Quantitative Data and Formulae. John Wiley & Sons, Inc., New York, 2nd edition, 1982. [96] T. Young. The Bakerian Lecture: On the theory of light and colours. Philosophical Transactions of the Royal Society of London, 92(0):12–48, January 1802.