Video Coding

2 Acquisition, Representation, Display, and Perception of Image and Video Signals In digital video communication, we typically capture a natural scene by a camera, transmit or store data representing the scene, and finally reproduce the captured scene on a display. The camera converts the light emitted or reflected from objects in a three-dimensional scene into arrays of discrete-amplitude samples. In the display device, the arrays of discrete-amplitude samples are converted into light that is emitted from the display and perceived by human beings. The primary task of video coding is to represent the sample arrays generated by the camera and used by the display device with a small number of bits, suitable for transmission or storage. Since the achievable compression for an exact representation of the sample arrays recorded by a camera is not sufficient for most applications, the sample arrays are modified in a way that they can be represented with a given maximum number of bits or bits per time unit. Ideally, the degradation of the perceived image quality due to the modifications of the sample arrays should be as small as possible. Hence, even though video coding eventually deals with mapping arrays of discrete-amplitude samples into a bitstream, the quality of the displayed video is largely influenced by the way we acquire, represent, display, and perceive visual information. 7 8 Acquisition, Representation, Display, and Perception Certain properties of human visual perception have in fact a large impact on the construction of cameras, the design of displays, and the way visual information is represented as sample arrays. And even though today’s video coding standards have been mainly designed from a signal processing perspective, they provide features that can be used for exploiting some properties of human vision. A basic knowledge of human vision, the design of camera and displays, and the used representation formats is essential for understanding the interdependencies of the various components in a video communication system. For design- ing video coding algorithms, it is also important to know what impact changes in the sample arrays, which are eventually coded, have on the perceived quality of images and video. In the following section, we start with a brief review of basic properties of image formation by lenses. Afterwards, we discuss certain as- pects of human vision and describe raw data formats that are used for representing visual information. Finally, an overview of the design of cameras and displays is given. For additional information on these topics, the reader is referred the comprehensive overview in [70]. 2.1 Fundamentals of Image Formation In digital cameras, a three-dimensional scene is projected onto an image sensor, which measures physical quantities of the incident light and converts them into arrays of samples. For obtaining an image of the real world on the sensors surface, we require a device that projects all rays of light that are emitted or reflected from an object point and fall through the opening of the camera into a point in the image plane. The simplest of such devices is the pinhole by which basically all light, except a single pencil of rays, coming from a particular object point is blocked from reaching the light-sensitive surface. Due to their bad optical resolution and extremely low light efficiency, pinhole optics are not used in practice, but lenses are used instead. In the following, we review some basic properties of image formation using lenses. For more detailed treatments of the topic of optics, we recommend the classic references by Born and Wolf [6] and Hecht [33]. 2.1. Fundamentals of Image Formation 9 2.1.1 Image Formation with Lenses Lenses consist of transparent materials such as glass. They change the direction of light rays falling through the lens due to refraction at the boundary between the lens material and the surrounding air. The shape of a lens determines how the wavefronts of the light are deformed. Lenses that project all light rays originating from an object point into a single image point have a hyperbolic shape at both sides [33]. This is, however, only valid for monochromatic light and a single object point; there are no lens shapes that form perfect images of objects. Since it is easier and less expensive to manufacture lenses with spherical surfaces, most lenses used in practice are spherical lenses. Aspheric lenses are, however, often used for minimizing aberration in lens systems. Thin Lenses. We restrict our considerations to paraxial approximations (the angles between the light rays and the optical axis are very small) for thin lenses (the thickness is small compared to the radii of curvature). Under these assumptions, a lens projects an object in a distance s from the lens onto an image plane located at a distance b at the other side of the lens, see Figure 2.1(a). The relationship between the object distance s and the image distance b is given by 1 1 1 + = , (2.1) s b f which is known as Gaussian lens formula (a derivation is, for example, given in [33]). The quantity f is called the focal length and represents the distance from the lens plane, in which light rays that are parallel to the optical axis are focused into a single point. For focusing objects at different locations, the distance b between lens and image sensor can be modified. Far objects (s → ∞) are in focus if the distance b is approximately equal to the focal length f. As illustrated in Figure 2.1(b), for a given image sensor, the focal length f of the lens determines the field of view. With d representing the width, height, or diagonal of the image sensor, the angle of view is given by d θ = 2 arctan . (2.2) 2 f 10 Acquisition, Representation, Display, and Perception image sensor image plane θ object 푑 plane 푓 푓 푏 ≈ 푓 푠 푏 (a) (b) object image plane plane Δ푏푁 퐷 푎 푎 푐 Δ푠퐹 Δ푠푁 Δ푏퐹 푓 (c) (d) 푠 푏 Figure 2.1: Image formation with lenses: (a) Object and image location for a thin convex lens; (b) Angle of view; (c) Aperture; (d) Relationship between the acceptable diameter c for the circle of confusion and the depth of field D. Aperture. Besides the focal length, lenses are characterized by their aperture, which is the opening of a lens. As illustrated in Figure 2.1(c), the aperture determines the bundle of rays focused in the image plane. In camera lenses, typically adjustable apertures with an approximately circular shape are used. The aperture diameter a is commonly notated as f/F , where F is the so-called f-number, F = f/a. (2.3) For example, an aperture of f/4 corresponds to an f-number of 4 and specifies that the aperture diameter a is equal to 1/4 of the focal length. For a given distance b between lens and sensor, only object points that are located in a plane at a particular distance s are focused on the sensor. As shown Figure 2.1(d), object points located at distances s + ∆sF and s − ∆sN would be focused at image distances b − ∆bF and b + ∆bN , respectively. On the image sensor, at the distance b, the object points appear as blur spots, which are called circles of confusion. If the blur spots are small enough, the projected objects still appear to be 2.1. Fundamentals of Image Formation 11 sharp in a photo or video. Given a maximum acceptable diameter c for the circles of confusion, we can derive the range of object distances for which we obtain a sharp projection on the image sensor. By considering similar triangles at the image side in Figure 2.1(d), we get ∆b c F c ∆b c F c N = = and F = = . (2.4) b + ∆bN a f b − ∆bF a f Using the Gaussian lens formula (2.1) for representing b, b + ∆bN , and b−∆bF as functions of the focal length f and the corresponding object distances, and solving for ∆sF and ∆sN yields F c s (s − f) F c s (s − f) ∆s = and ∆s = . (2.5) F f 2 − F c (s − f) N f 2 + F c (s − f) The distance D between the nearest and farthest objects that appear acceptably sharp in an image is called the depth of field. It is given by 2 F c f 2 s (s − f) 2 F c s2 D = ∆s + ∆s = ≈ . (2.6) F N f 4 − N 2 c2 (s − f) f 2 For the simplification at the right side of (2.6), we used the often valid approximations s f and c f 2/s. The maximum acceptable diameter c for the circle of confusion could be defined as the distance between two photocells on the image sensor. Based on considerations about the resolution capabilities of the human eye and the typical viewing angle for a photo or video, it is, however, common practice to define c as a fraction of the sensor diagonal d, for example, c ≈ d/1500. By using this rule and applying (2.2), we obtain the approximation F s2 θ D ≈ 0.005 · · tan2 , (2.7) d 2 where θ denotes the diagonal angle of view. Note that the depth of field increases with decreasing sensor size. When we film a scene with a given camera, the depth of field can be influenced basically only by changing the aperture of the lens. As an example, if we use a 36 mm×24 mm sensor and a 50 mm lens with an aperture of f/1.4 and focus an object at a distance of s = 10 m, all objects in the range from 8.6 m to 11.9 m appear acceptably sharp.

Load more