DEGREE PROJECT IN THE FIELD OF TECHNOLOGY ENGINEERING PHYSICS AND THE MAIN FIELD OF STUDY ELECTRICAL ENGINEERING, SECOND CYCLE, 30 CREDITS STOCKHOLM, SWEDEN 2017

Calibration using a general homogeneous depth camera model

DANIEL SJÖHOLM

KTH ROYAL INSTITUTE OF TECHNOLOGY SCHOOL OF COMPUTER SCIENCE AND COMMUNICATION Calibration using a general homogeneous depth camera model

DANIEL SJÖHOLM

Master’s Thesis at CSC/CVAP Supervisor: Magnus Burenius Supervisor at KTH: Patric Jensfelt Examiner: Joakim Gustafson

Abstract

Being able to accurately measure distances in depth images is important for accurately reconstructing objects. But the measurement of depth is a noisy process and depth sensors could use additional correction even after factory calibration. We regard the pair of depth sensor and image sensor to be one single unit, returning complete 3D information. The 3D information is combined by relying on the more accurate image sensor for everything except the depth measurement. We present a new linear method of correcting depth distortion, using an empirical model based around the con- straint of only modifying depth data, while keeping planes planar. The depth distortion model is implemented and tested on the Intel RealSense SR300 camera. The results show that the model is viable and generally decreases depth measurement errors after calibrating, with an average improvement in the 50 % range on the tested data sets. Referat

Kalibrering av en generell homogen djupkameramodell

Att noggrant kunna mäta avstånd i djupbilder är viktigt för att kunna göra bra rekonstruktioner av objekt. Men denna mätprocess är brusig och dagens djupsensorer tjänar på ytterligare korrektion efter fabrikskalibrering. Vi betraktar paret av en djupsensor och en bildsensor som en enda enhet som returnerar komplett 3D information. 3D informationen byggs upp från de två sensorerna genom att lita på den mer precisa bildsensorn för allt förutom djupmätningen. Vi presenterar en ny linjär metod för att korrigera djup- distorsion med hjälp av en empirisk modell, baserad kring att enbart förändra djupdatan medan plana ytor behålls pla- na. Djupdistortionsmodellen implementerades och testades på kameratypen Intel RealSense SR300. Resultaten visar att modellen fungerar och i regel mins- kar mätfelet i djupled efter kalibrering, med en genomsnittlig förbättring kring 50 % för de testade dataseten. Acknowledgments

I would like to thank Magnus Burenius for offering the thesis project and supervising me. Special gratitude for sharing his devised linear depth distortion model, allowing me to write my thesis on it. I would in addition like to thank everyone who has taken part in discussions regarding the camera internals and possible physical causes of the distortion. Contents

1 Introduction 1 1.1 Objectives ...... 2 1.2 Outline ...... 2

2 Background 3 2.1 Projective geometry ...... 3 2.2 Geometric transformations ...... 4 2.2.1 Notation ...... 5 2.2.2 Rotations ...... 5 2.2.3 Euclidean transformations ...... 6 2.2.4 Similarity transformation ...... 6 2.2.5 Affine transformation ...... 6 2.2.6 Projective transformation ...... 7 2.2.7 Extension to 3D ...... 7 2.3 Camera model ...... 7 2.3.1 Extension to depth cameras ...... 10 2.3.2 Lens correction model ...... 10 2.3.3 Calibration outline ...... 11 2.4 Depth sensor ...... 12 2.4.1 Structured light ...... 12 2.4.2 Depth versus lens distortion ...... 15 2.5 Accuracy metrics ...... 15 2.5.1 Reprojection error ...... 15 2.5.2 Depth accuracy ...... 16 2.5.3 Depth and image combined ...... 16

3 Related work 17 3.1 Camera calibration ...... 17 3.2 Depth cameras ...... 19

4 Theory 21 4.1 Homogeneous depth camera model ...... 21 4.2 Projective depth distortion ...... 22 4.3 General homogeneous depth camera model ...... 22 4.4 Calibrating the depth distortion parameters analytically ...... 23 4.5 Iterative reweighted least squares ...... 24 4.6 Calibrating the depth-distortion parameters non-linearly ...... 25 4.7 Parameter interpretation ...... 25 4.8 Parameter visualization ...... 26

5 Implementation 29 5.1 Physical hardware ...... 29 5.2 Calibration pattern ...... 29 5.3 Data collection ...... 31 5.4 Calibration algorithm ...... 31 5.5 Evaluation ...... 32 5.5.1 Dot error ...... 32 5.5.2 Plane error ...... 33

6 Results 35 6.1 Data set list ...... 35 6.2 Complete model with lens distortion ...... 36 6.2.1 Parameter stability ...... 39 6.2.2 Difference images ...... 42 6.2.3 Mean image error ...... 45 6.3 2 parameter model (x, w) with lens distortion ...... 47 6.4 Model comparison ...... 50 6.5 Multi camera comparison ...... 50 6.5.1 Camera 2 ...... 51 6.5.2 Camera 3 ...... 53 6.5.3 Camera 4 ...... 54

7 Discussion and conclusions 57 7.1 Discussion ...... 57 7.1.1 Lens distortion ...... 57 7.1.2 Parameter stability ...... 57 7.1.3 Error metrics ...... 58 7.1.4 Measurement error ...... 59 7.2 Future work ...... 59 7.3 Conclusions ...... 60

Appendices 60

A Numerical results camera 1 61

B Depth distortion properties 65

C Ethical considerations 69

Bibliography 71

Chapter 1

Introduction

The camera has been a huge part of society ever since its invention. Its progress into the digital world with digital cameras and small embedded optics in today’s smartphones shows that it is not going to disappear anytime soon. As is well known, a camera captures a two dimensional image of our three dimensional world. From this image it is in some situations possible to reconstruct parts of the captured 3D world. However, this reconstruction is generally not well defined. Scale and even perspective can be completely off from the true situation. A way to enable a reconstruction which is correct both in perspective, as well as in scale, is to calibrate the camera. Calibration establishes a connection between the metric properties of the camera, such as its position in space and distance between the lens and optical center (focal length), and the images captured by it. More than one 2D image is required to infer general information about a 3D scene. This can be solved by capturing multiple images from different angles of the scene or by extending the information present in the image with data from another sensor. Using a stereo setup of cameras, one can begin to reconstruct depth by matching features between the two images. A well textured scene is however required in order to be able to match features between the images and draw conclusions about the depth. The distance to the object can also not be too great in comparison to the inter-camera distance, or the views become too similar. A quite recent development in the direction of distance measuring by cameras is to bundle a depth sensor with a camera into a so called RGB-D camera (Red, Green, Blue and Depth), which provides even more information. The depth sensing is implemented as a separate sensor, which internally might utilize stereo vision or another technique. This has especially received new attention after the release of the Microsoft Kinect camera for the Xbox gaming system, which attracted new research due to the cheap hardware showing acceptable performance. The bundled depth sensing is of course welcome in all kinds of applications, as mimicking human vision requires acquiring images as well as information about the distance to the imaged objects. Fields taking explicit advantage of the depth information is for example to create 3D models of real life objects [21], which in turn can be useful

1 CHAPTER 1. INTRODUCTION for tracking, taking measurements on the object or re-creating it by 3D printing it. The depth information can assist in visual odometry [25] and its extension of Simultaneous Localization and Mapping (SLAM) [26, 34]. It’s also of use for whenever image segmentation based on depth is required, such as object recognition [16], and applications [16, 22]. Mapping using a camera coupled with a depth sensor is a procedure which instantaneously captures a dense representation of the view. Two alternatives to this are to scan a static scene with a laser scanner or by measuring objects through tactile measurements. These methods generally produce depth measurements of higher accuracy than the RGB-D system of focus in this thesis, but they suffer from measurements taking a long time which leads to unsuitability in dynamic scenes, long wait times and generally cost a lot more than an RGB-D camera setup. These methods are also generally either more adapted for long-range measurements or contact measurement at very close range. RGB-D cameras fit in the middle, allowing relatively precise measurements in the single meter distance range. The act of calibrating a camera is essentially to model it mathematically. This is done by gathering data to determine the parameters of the mathematical model, so that more information about the real world can be extracted from the images captured by the camera.

1.1 Objectives

This thesis looks at improving the spatial accuracy of the depth data returned by an active stereo vision system, by regarding the camera system as a complete 3D imaging device, rather than an imaging device coupled with a depth sensor. The end goal is to enable more accurate measurements through modeling distortions in the depth data. We achieve this through a joint camera model of image and depth. The model contains a linear correction for the depth data, utilizing the image data in order to improve the depth data. This previously unpublished linear correction model of depth is presented and tested out in practice. The camera is calibrated in order to achieve the above, which is simply another name for determining the parameters in the model.

1.2 Outline

The report is structured as follows. Chapter 2 summarizes necessary theory and gives a brief description of active stereo cameras. Chapter 3 reports previous works within both ordinary camera calibration as well as works centered around depth cameras. The thesis contributions start in Chapter 4, where the new theory is presented. In Chapter 5 the experimental setup is explained together with the methods used for evaluating results, the results of which are presented in Chapter 6. Finally conclusions are drawn and potential further work suggested in Chapter 7.

2 Chapter 2

Background

This section introduces fundamental concepts and theory required to understand the presented thesis report. Most of this section is inspired by [9]. It first introduces projective geometry, which is helpful to understand the upcoming section on the camera model, continues on with linear transformations which help relate the physical world to the camera, and later the various properties utilized for calibrating the camera.

2.1 Projective geometry

Scenes in reality often have parts in them that are relatively at infinity, such as the horizon or the point where it appears that parallel lines converge like in Figure 2.1. This is a consequence of the perspective nature of vision. In order to describe this in a mathematical framework, one can introduce homogeneous coordinates by extending our 2D and 3D euclidean geometry with an additional dimension. The extra dimension allows us to represent even infinite points with finite coordinates. In homogeneous coordinates, we extend the euclidean 2D plane (x, y) by adding another dimension, (x, y, w) where w is any scalar. CC0 license | photoFotoworkshop4You on by pixabay.com Martin Winkler,

Figure 2.1: Example of a point at infinity where the parallel railroad tracks converge

3 CHAPTER 2. BACKGROUND

Homogeneous coordinates are only defined up to a scale and as such the value of the scalar (assumed =6 0) is of less importance. That is the point (1, 2) in 2D euclidean space turns into (1, 2, 1) in homogeneous coordinates, but is also equal to (3, 6, 3) = 3(1, 2, 1) due to the scale invariance. N N The short hand notation for euclidean N-dimensional space is E = R . The N N+1 corresponding projective room is P = R \(0,..., 0), which has N +1 dimensions. 2 2 2 For the two dimensional case the matching spaces are E and P , where dim E = 2 2 and dim P = 3. To convert back from homogeneous coordinates into euclidean, one divides all elements in the vector by the last value, and subsequently discards the last dimension.   x ! x/w y  7→ (2.1)   y/w w

2 2 Which is a transformation from P 7→ E . Homogeneous coordinates allows representing points at infinity by setting the last dimension equal to zero, which accurately captures the distance to infinity in a concise representation without the issue of having to actually divide by 0 in the Euclidean space. The conversion keeps its structure when working in three dimensional spaces, x x/w y    7→ y/w (2.2) z      z/w w

3 3 Which is a transformation from P 7→ E .

2.2 Geometric transformations

Geometrical transformations used in this thesis will be linear invertible 2 and 3 dimensional transformations. They can be ordered by their complexity in terms of degrees of freedom (dof) and the properties that remain invariant after transformation. The transformations are outlined here more thoroughly in two dimensions for easier mental visualization, but preserve key characteristics when utilized in three dimensions. Extending into 3D the degrees of freedom increase and with them the number of invariant properties decrease. The non-invariant properties of a transformation can be used to undo a transfor- mation by identifying it, such as the perspective transformation a scene undergoes when observed by our eyes or through a camera. Planes in the real world that aren’t perpendicular to the image plane can also be rectified to look as if they were, since they are related by a homography. See for example Figure 2.2. All linear transformations in this thesis will be performed as matrix multiplica- tions from the left with points stored as column vectors. For a given transformation

4 2.2. GEOMETRIC TRANSFORMATIONS

(b) Planar rectified to one of the walls (a) Original image

Figure 2.2: Example of a perspective distortion and planar rectification.

2 2 T 2 0 2 T : P → P , source point P = (X,Y,W ) ∈ P and transformed point P ∈ P this becomes X  0   P = T P = T  Y  (2.3) W

2.2.1 Notation

The notation of this thesis are as follows. Scalars are written in lowercase letters, like 1 or dx. Matrices are written in uppercase letters, like R. Vectors are represented by lowercase and bold, like t. An exception to this is points, which may be written in uppercase as well in order to be more distinct. Points in either the real world or the image are represented like X or x. Points and vectors are written in column form unless otherwise stated. Placeholders in equations and their explanations are marked by the sign •.

2.2.2 Rotations

Rotations can introduce ambiguity as to how the coordinate system is oriented after a performed rotation. Rotations from the special orthogonal group counteracts the ambiguity by imposing constraints on the rotation matrix to preserve the orientation and scale. The special orthogonal group enforces that a rotation matrix R must fulfill det R = 1 and RTR = I. All rotation matrices in this thesis belong to this group. A rotational matrix in the 2D plane which rotates points counter clockwise by the angle θ is written as " # cos θ − sin θ R = (2.4) sin θ cos θ

5 CHAPTER 2. BACKGROUND for points in Cartesian coordinates, and

  cos θ − sin θ 0 " # R 0 T = sin θ cos θ 0 = (2.5)   0T 1 0 0 1 for points in homogeneous coordinates. As can be seen, this transformation provides 1 degree of freedom, which is represented by the variable θ. The rest of the transformations in this section will only be described for points in homogeneous coordinates.

2.2.3 Euclidean transformations

Euclidean transformations model rigid motion as translations and rotations. They have the lowest degree of freedom of the transformation classes, offering only 3 degrees of freedom - 2 from translation and 1 from rotation.

  cos θ − sin θ t " # x R t T = sin θ cos θ t  = (2.6)  y 0T 1 0 0 1

2.2.4 Similarity transformation

Similarity transformations extend euclidean transformations by allowing a uniform scale factor. They model translations, rotations and scaling. Offers 4 degrees of freedom, one additional from euclidean thanks to the scale factor.

  s cos θ −s sin θ t " # x sR t T = s sin θ s cos θ t  = (2.7)  y 0T 1 0 0 1

2.2.5 Affine transformation

Affine transformations no longer require a pure rotation in the upper-left, rather allow arbitrary entries. They model translations, rotations, skew and per-axis scaling. This adds up to 6 degrees of freedom.

  a a t " # 11 12 x A t T = a a t  = (2.8)  21 22 y 0T 1 0 0 1

6 2.3. CAMERA MODEL

2.2.6 Projective transformation In 2D projective space, the most general transformation is a so called projective transform or homography [9],   h11 h12 h13   H = h21 h22 h23 (2.9) h31 h32 h33 which has 8 degrees of freedom. The matrix offers 9 entries, but loses one degree of freedom due to the scale invariance of homogeneous coordinates. Among the few properties preserved by a projective transformation are colinearity of points, but not parallel lines.

2.2.7 Extension to 3D When extending to three dimensions the transformation matrices retain their general shape, but increase from 3 × 3 to 4 × 4 matrices. The block-matrix contents look the same, but vectors and matrices grow to accommodate the added dimension. For example, an affine transformation in 3D looks like a a a t  11 12 13 x " #   a21 a22 a23 ty A t T =   = T (2.10) a31 a32 a33 tz  0 1 0 0 0 1 and as such retain their block-matrix representation.

2.3 Camera model

A camera creates a mapping or image of our 3D world onto a 2D camera plane. Under ideal assumptions that there is no lens and thus no distortions affecting the light entering the sensor, a camera can be modeled as a pinhole camera, where all light enters through one single point, the optical center. The model allows us to map any ray of light from the real world into an image point on the plane, and vice verse, by drawing a straight line from the 3D point and finding its intersection with the image plane of the camera, see Figure 2.3. In a physical camera the actual image plane from a camera is located at a distance equal to the focal length f behind the optical center, however we can model it as being in front for ease of calculation without issues. Let us create a 3D coordinate system with the origin in the optical center and call it the Camera Coordinate Frame, with the Z axis facing towards the image plane. A point in the 3D world with coordinates (X,Y,Z)T is then mapped into a point (fX/Z,fY/Z,f)T on the image plane, which can be seen from similar triangles [9]. Note however, that this relation is non-linear due to the division by Z, something that we can adjust by using homogeneous coordinates.

7 CHAPTER 2. BACKGROUND

u v

C Zcam P (X,Y,X) Xcam p(x, y, f) Ycam x Principal axis f y

Figure 2.3: Pinhole camera model with image plane in front of camera center C

By disregarding the last dimension of the image coordinates (which is always equal to f, by construction), we have that   X ! f X Y  7→ (2.11)   Z Y Z and we now have a pure 2D point. We can observe that any points on a line through the optical center will map to the same point in the image plane and all depth information will be lost. Introducing homogeneous coordinates from Section 2.1, a world point will be projected upon our image coordinate (x, y)T according to         X x x f 0 0 0 Y  y ∝ λ y = 0 f 0 0   (2.12)       Z  1 1 0 0 1 0   1 where λ = Z is the scaling factor, multiplied to the other side of the equation from eq. (2.11). It can be disposed of due to the homogeneous representation only being defined up to a scale factor, and the representation is now fully linear. Actual digital image coordinates usually start from one of the corners however, rather than in the middle as specified here. The camera may also have various geometrical properties such as non-square pixels, or skewed or non-orthogonal axes, which is all captured by the so called camera calibration matrix,   fx s x0   K =  0 fy y0 (2.13) 0 0 1 where x0, y0 represents the shift of origin, fx, fy the focal length in the two directions, which are ideally identical and s representing the skew. As can be seen, this matrix

8 2.3. CAMERA MODEL has 5 degrees of freedom and represents all the intrinsic parameters of the distortion free pinhole camera. We return to representing the world and image coordinates in homogeneous coor- dinates. Let X be a point in the world coordinate system, and x the corresponding point in the image. We can now write the complete correspondence between X and x as x = K [ I | 0 ]X (2.14) 3×3 3×3 3×1 which maps 3D world points expressed in the camera coordinate system into points on an image. For the extrinsic parameters such as position and rotation of the camera relative to a world coordinate system, we introduce the translation vector t of the camera center relative to the world coordinate system, and its 3 × 3 rotation matrix R, which transforms eq. (2.14) into

x = K [ R | t ]X = P X (2.15) 3×3 3×3 3×1 3×4 where we call the product of the intrinsic and extrinsic parameters the projection matrix P . In terms of degrees of freedom, R has 3 and the translation vector t contributes with 3, totaling 6 degrees of freedom for position and rotation of the camera and bringing the total degrees of freedom for the entire system up to 11 when multiplied together with K. The extrinsic parameters are required to describe the camera position relative to the real world, which could be defined as a certain position on a calibration object, or relative to another camera in multi-camera systems, see Figure 2.4. Keeping track of points in the real world coordinate system and capturing from several different orientations then allow to solve the equation with enough constraints on the solution.

R1

C1 Xcam1 Zcam1

t~1 Yworld Ycam1

Zworld ~ R2 t2 Xworld

C2 P (X,Y,Z) Zcam2 Xcam2

Ycam2

Figure 2.4: Camera coordinate systems and the world coordinate system.

9 CHAPTER 2. BACKGROUND

The projection matrix P is a 3 × 4 matrix, but can be regarded as composition 3 2 of homographies and a P 7→ P projection,   1 0 0 0   P = [3 × 3 homography] 0 1 0 0 [4 × 4 homography] (2.16) 0 0 1 0 giving another view of what the camera projection does mathematically.

2.3.1 Extension to depth cameras A way of extending this 2D imaging to account for 3D measurements, is simply to consider the homogeneous coordinate in the image-view to be the depth. That is, X x Y  y = K[ R | t ]   (2.17)   Z  z   1

2.3.2 Lens correction model The lens in front of the camera adds distortions to the image, which are not modeled by the pinhole camera model described up until now. There are two directions to keep in mind for distortion correction, out of which only one mirrors the true physical behavior of light. The physical accurate representation assumes that light enters through the pinhole model, which gives rise to a perfect image without distortions. In order to represent the actual image, this perfect image is distorted (which physically represents the influence of the lens and everything else not accounted for in the pinhole model) and the true distorted points in the image arise. The non-physical representation is the reverse mapping of the above, which takes the distorted points and undistorts them into the non-existing perfect points. We thus seek to model the distortions caused by the lens. The lens distortion model is usually split up into a part correcting the radial displacements of image points due to lens distortion and a tangential part, correcting non-perfect alignment of the lens [11]. A common way is to model lens distortion is using polynomials [20]. Usually only one direction is specified and then approximately inverted (since there’s no analytical solution to the inversion problem) to yield the other [10, 20]. Modeling the distortions according to [4] as a forward model, taking undistorted points x, y into their distorted counterparts x0, y0, we have 0 2 4 6 2 2 x = x(1 + k1r + k2r + k3r ) + 2p1xy + p2(r + 2x ) 0 2 4 6 2 2 (2.18) y = y(1 + k1r + k2r + k3r ) + p1(r + 2y ) + 2p2xy 2 2 2 where r = x +y , ki are the radial distortion coefficients and pi tangential distortion coefficients. This polynomial method of distortion correction is also known as the Brown-Conrady model.

10 2.3. CAMERA MODEL

2.3.3 Calibration outline

There are two main categories of calibration methods: auto-calibration and camera resectioning or photogrammetric calibration [31]. Auto calibration uses several images of an unknown scene, and determines the internal parameters of the camera through correlating the images with one another by assumptions of a static scene. Camera calibration/resectioning requires a known calibration object to be placed in the scene during calibration. Auto calibration derives all required properties from several images of the scene, without any prior knowledge of the scene. This is done by identifying constraints imposed by invariants of geometric transformations. Auto calibration will not be studied in this thesis, due to a desire to have a simpler relation between calibration scene and end result. Any further mentions of calibration will regard camera resectioning using a known calibration object, unless otherwise noted. For camera resectioning using point targets a good detection of points is necessary. In general down to sub-pixel accuracy (by interpolating the data that should be between the discrete pixels) with varying number of points depending on the chosen method. A human can perform this sub-pixel selection of points but it’s tedious work and prone to mistakes and an important task for improving calibration accuracy [7, 6]. Making the extraction automatic and as precise as possible is one of the difficulties with calibration for higher accuracy, but helps as it generates better data to the optimization. Once points have been extracted a typical non-linear optimization procedure called bundle adjustment has to be conducted in order to find the physical parameters of the camera, which accounts for most of the computational time of calibration. Non-linear optimization also introduces a risk of finding a local minima instead of the global one, requiring some thought when initializing the optimization.

Non-linear optimization

Non-linear optimization procedures work in an iterative manner in order to reach a minima of a so called cost function. For bundle adjustment and camera calibration, the cost function often defines how large the error is between points in a captured image, and the theoretical model of how the camera works. An example formulation of a bundle adjustment problem follows, which minimizes a sum of a cost function over all the points in all images part of the procedure. We have Nc points pbi = (xi, yi) in each image, and corresponding points Pi = (Xi,Yi,Zi) in the world for i = 0,...,Nc. With K the camera intrinsics of Eq. 2.13, Rc and tc the camera extrinsics of Eq. 2.15 for image c, with c = 0,...,C. Thus Ec = [ Rc | tc ]. 0 Modeling lens distortions as in Section 2.3.2 as a function Lf (p) 7→ p , which transforms ideal pinhole points into points affected by lens distortions. The general cost function formulation is f(pb, p) 7→ scalar. An often used metric is the euclidean distance between the two points, which will be detailed in Section 2.5.1. Minimizing over all the parameters K, {Ec},Lf the complete bundle adjustment problem looks

11 CHAPTER 2. BACKGROUND like

C N X Xc arg min f(pbi,c,Lf (KEcPic )) (2.19) K,{Ec},Lf c i where f commonly looks similar to f(a, b) = ka − bk. The actual optimization procedure carried out for finding the minima is often a non-linear least squares procedure, which iteratively linearizes the function around the current best solution and then alters the parameters to decrease the value of the linear function, which becomes the next iterations best solution. Several different methods are available. One of the more popular methods for bundle adjustment is the Levenberg-Marquandt algorithm.

2.4 Depth sensor

There are several different technologies available for depth sensing, such as stereo vision, time of flight or structured light. This work focuses on structured light sensors, but is in general sensor agnostic. Stereo vision enables depth sensing in a way similar to our human vision, but requires two spatially distanced cameras. In addition to these requirements, the object being viewed must have enough texture in order to be able to uniquely match points from the two views. The accuracy of depth measurements also degrades heavily when the distance to the object becomes much larger than the distance between the cameras, as their views start to coincide. Time of flight cameras determine the depth by emitting a light pulse and measuring the time it takes to return. These types of cameras typically outperform structured light for long range uses, but come with their own set of limitations. Work on these sensors, that are relevant to this, include among others [3] which has a non-parametric solution to model the depth error, with a constraint of keeping the resulting map smooth, managing to reduce the average error for their sensor from the scale of 40 mm down to 5 mm, in ranges of 2-9 m. Regardless of which depth sensor used, they must be calibrated extrinsically against each other, in order to be able to match the data from the depth sensor with the RGB camera. This can be done using similar techniques as for camera calibration by estimating the extrinsics relating the two sensors.

2.4.1 Structured light Structured light depth sensors are a combination of a light projector, and a camera pointed in the same direction. Depth from structured light can be recovered using any wavelength of light, but usage of infrared has the benefit of being invisible to the human eye. The Intel RealSense camera used in this thesis uses infrared light. See Figure 2.5 for a schematic view. Structured light is in essence stereo vision, with depth measured through tri- angulation, but with one camera replaced with an active system. Thus there is no

12 2.4. DEPTH SENSOR mapping between two captured images, rather a mapping between the image plane of the projector, and the captured image in the camera. The projected pattern is known, and how the scene lights up is recorded by the camera. By geometry the depth can then be computed, just like in the stereo vision case.

IR Camera IR Laser Projector

Color Camera

Figure 2.5: Structured light RGB-D camera.

A typical issue of structured light sensors is strong illumination in the spectrum used for sensing, causing either a complete failure or unreliable output. The spectrum is typically either the visible spectrum or the near infrared spectrum. Border issues also arise for when the projector illuminates one side of an object and the camera captures another [18]. There may also be issues with temperature dependent measurements and distortions in the depth values returned dependent both on the radial distance and actual depth [13]. Ways to address the depth distortions have included measuring planes at various distances and then applying an offset [13], multiplying the returned value with trained constant maps [26] or fitting a B-spline [19, 17]. Possible distortions come to a large part from the stereo vision nature of the sensor, since the projector is essentially a reversed camera. The projector has an image chip, which for the analogy to a reverse camera would be equal to the image plane. Light passes through the image chip from behind, and is projected into the world through a lens. As this principle is similar to that of a camera, the same theory for radial and tangential distortions apply. Similarly, the rest of the internal camera parameters affect the final image. In addition to camera calibration there’s also the possibility of uneven light intensity through the projected image, which depending on the algorithms robustness to lighting conditions may need to be considered. Thanks to the known projected pattern, additional matching methods are ap- plicable, in comparison to stereo vision. The list of used patterns is extensive, and contains both grayscale as well as color patterns [8]. Other major differentiators between patterns, is the spatial resolution that it offers, as well as information given per image. An increase in spatial resolution generally requires more versions of a pattern projected sequentially into the scene, but comes at the cost of requiring a static scene in order to be able to use the measurements spread out in time. Other patterns optimize for capturing dynamic scenes and may only require one projected

13 CHAPTER 2. BACKGROUND pattern, but at the cost of spatial resolution. Two pattern-techniques that are applicable to the illumination offered in the infrared spectrum is gray coding, and phase shifting.

Gray Coding Gray Coded structured light uses a sequence of binary stripe patterns (light/dark) of decreasing width projected over time. This enables absolute positioning down to the resolution offered by the projector/camera imaging chips, and number of patterns used. See Figure 2.6 for an example of width 64 (pixels), with a comparison to normal binary coding. Each row corresponds to one projected image pattern (image the row stretched out vertically), and each column to a horizontal position within the image. The benefit of gray coding over normal binary coding is that it is more robust, since any neighboring values only differ by 1 bit.

Binary Pixel position 0 8 16 24 32 40 48 56 63 0 1 2 3 Time 4 5

Gray code Pixel position 0 8 16 24 32 40 48 56 63 0 1 2 3 Time 4 5

Figure 2.6: Binary and gray code coding compared. Red line marks reading pixel position 24. Note how being off-by one in the binary code results in several changed bits, whereas the gray code only has a change in a single bit.

By comparing locations of the code words in the projected image and in the captured one, one can calculate the disparity and depth from the scene. This is the technique that the Intel RealSense camera seems to be using, judging from the images presented in an Intel blog post [15].

Phase Shift Another method using a sequence of projected patterns is fringe projection, or phase shift. It uses a set of sinusoidal (or trapezoidal or continuous variant) functions, projected into the scene with varying phase, allowing depth reconstruction from as little as three patterns. The matching between projected patterns and captured images is done by comparing the phase, globally over the entire image. The phase is continuous throughout the image, but due to the phase periodicity, this cannot be seen in the image. Phase unwrapping is thus performed, to map the several single period waves, into a single continuously increasing phase shift. However, phase unwrapping

14 2.5. ACCURACY METRICS has issues when the depth causes a shift of more than 2π, limiting the depth reconstruction to depths within one period [0, 2π], with results outside of that range being ambiguous to a depth within. This method can also be combined with gray coding to remove the phase ambiguity.

2.4.2 Depth versus lens distortion The difference between depth and lens distortion might not be apparent from the beginning and is thus explained here. Lens distortion (Eq. 2.18) moves pixels around in the image, without modifying the value of them. Depth distortion does not alter the position of the pixels, only their values. In depth images the values correspond to the distance from the camera. In an image the values would correspond to intensity. The depth sensor is also affected by lens distortion, since the depth sensor in the active stereo case also consists of a camera with a lens. However, we classify all distortions into the distinct groups of depth or lens distortion, based on if they alter the values or positions of pixels, respectively.

2.5 Accuracy metrics

Optimization of the camera parameters requires a cost function which defines the error of the current solution to what one theoretically wishes to achieve. A common metric for 2D camera calibration is reprojection error.

2.5.1 Reprojection error Reprojection error is one of the most commonly used metric for ordinary camera calibration. It is computed by taking the known points in the world coordinate system, computing their projection through the current model of the camera, and comparing the resulting position with the position in the captured image. The distance (error) for each point are then summed in a mean square error fashion, and returned as a scalar for the entire procedure. Let xbI be the observed image point of X. This point is measured by the camera, and is as such affected by measurement errors and noise. Let x be the point projected from the known point X using the camera projection matrix P . The projected point is calculated according to the current intrinsic and extrinsic parameters of the camera calibration with the actual 3D coordinates of a point. When observing the calibration pattern, the 3D coordinates are known by construction. x = P X The reprojection error is the euclidean distance between the Cartesian representations of the actual image point xbI and theoretical point x,

1 1 error = d(xbI , x) = xbI − x (2.20) xIw xw

15 CHAPTER 2. BACKGROUND where in two dimensions v u 2  2  2 1 1 u ax bx ay by aw bw d(a, b) = a − b = u − + − + − (2.21) u aw bw t aw bw aw bw aw bw | {z } =0

This enables correcting the camera parameters to more accurately reflect the measured data in the captured image. For 2D images this is very straightforward by comparing pixel distances. When appending the depth dimension the units make the comparison harder, since each "pixel" in the depth image has the value of the depth of that "pixel" once back- projected into the three dimensional world. The units turn into (pixel, pixel, depth/disparity), where depth (or disparity) is measured in some real-world distance unit (inverted). Reprojection error is as such not directly applicable for 3D cameras due to the different units.

2.5.2 Depth accuracy For depth measurements, a common way is to compare the resulting point cloud of distance measurements, to a known object. Separate ground truth knowledge is thus required. The object need not be geometrically complex, rather could be a simple plane or a cube. With simpler primitives plane fitting can be adapted as well in order to reduce the impact of random noise, for a clearer image of the systematic error. When combining this with a cube known to hold a certain fabrication accuracy, the angles between the planes are as well features for accuracy evaluation.

2.5.3 Depth and image combined For evaluating the complete RGB-D camera pair at once, there are no good universal numeric methods. Visual inspection is the leading method of evaluation for general scene and focus is on the boundary between objects, where the texture of two different objects at various depths easily overlap or get assigned to the wrong object. When utilizing an object with known (defined) world coordinates, evaluation becomes possible by forward projecting the object points, or back projecting the measured points into the world and comparing with the object. Utilizing reprojection error for 3D cameras is thus reasonable, as long as attention is paid to the units and their relative magnitude when working in the 3D space of the camera.

16 Chapter 3

Related work

The need for calibrating a camera to reconstruct metric properties from images arose early, but has had a surge in recent times for computer vision and computational photography. Calibrated cameras are beneficial for example for determining geodetic information from aerial photographs and in robotics where metric information is of interest in subfields such as navigation [25] and reconstruction of real world objects into 3D models [34, 16]. This chapter is split into two sections based on the focused area. It begins with a look at methods of camera calibration in Section 3.1, which contains some history of camera calibration including lens distortions and finishes up with a study of previous works on depth cameras in Section 3.2.

3.1 Camera calibration

The currently most popular way of calibrating a camera is one developed in [31], which received widespread adaptation after inclusion into both MATLAB and OpenCV, making its usage simple. Other methods have been proposed since then, but the lack of a publicly released toolbox makes the step to usage a lot higher than that of [31], which is accepted as performing good enough for most applications. It does however generally take a lot of time in order to capture all the images required, and as such isn’t so tempting to use when one has to calibrate many cameras. Camera calibration using circles has been studied in [33], where 2 co-planar circles is everything required to refine a previously done calibration. Their method focuses solely on calibrating extrinsic parameters as well as the focal length. There has been extensive search for various geometric objects to calibrate from, the current most popular one being a chess-board like 2D plane. There are however plenty of methods utilizing other constraints. Approaches using only rigidity constraints of a natural scene without any spe- cialized calibration object is generally called self-calibration, and can result in a camera calibration with comparable accuracy to standard methods, and exist in both sparse (feature based) and dense (whole scene) methods. Since there is no special

17 CHAPTER 3. RELATED WORK calibration object involved, [32] calls these 0D approaches. [32] explores calibrating using markers on a string, as a 1D object alternative to the other approaches, noting that a free moving 1D object can never provide enough information to fully calibrate a camera, but by fixing one end of it achieves worse, but comparable results to chess-board calibration. Correcting lens distortions is very important and most related work base the correction on a simple polynomial model, commonly called a Brown-Conrady model. Corrections usually account for both radial and tangential distortion but sometimes just focuses on the radial part due to its higher impact [20]. A way of giving more physical meaning to the parameters in the lens distortion correction is given in [27], where they decouple and couple equations depending on if they are physically coupled in the actual camera construction or not. The result is a model that performs equally well as the classical polynomial model of Brown-Conrady, but with what they argue, a better physical explanation. Most of the popular calibration methods use information from multiple images in order to get enough information for calibration. Under certain assumptions on the camera and on the scene, however, calibration can be performed by capturing just one image. One such approach is laid out in [12], which proposes two different methods of single image calibration from a specialized constructed 3D object, attaining similar but not identical calibration compared to a multiple image approach. On the subject of single image calibration using planar circle-based markers, [22] argue that using two concentric semi-circles gives a more robust identification of points, ultimately leading to a better calibration than using two complete concentric circles. A survey detailing more contributions to the field of single-view camera calibration can be found in [28], which details contributions and developments up to the year 2009. An approach to calibrating from the outline of a sphere (which is a conic) can be found in [1]. They provide a detailed analysis on the number of constraints posed per image as well as the number of variables to determine for intrinsic vs. extrinsic calibration. Their solution requires three images of the same sphere in various locations in order to solve the system of equations. They note that calibration based on ellipses has drawbacks depending on the region in the image it is captured, since if the ellipse is almost in the center of the image, the aspect ratio is almost one and calibration is difficult, whereas if close to the borders of the image the lens distortions will start to have significant impact. In an attempt for higher accuracy, [7] improve earlier methods by replacing squares and checkerboards by rings. They do not try to detect them accurately in the captured image, rather project what they should look like given the current camera model. The optimization is done around the error from measured and constructed view, citing reprojection errors on the scale of 0.05 pixels. The presented method is however very sensitive to planarity errors and suffers greatly already after 1 mm planarity distortions. [6] improves matching accuracy by performing a fronto-planar rectification and matching different kinds of markers on the rectified

18 3.2. DEPTH CAMERAS image. By cross-correlating templates of the marker with the fronto-planar image they are able to increase precision further than by point detection and sub-pixel refinement in the projective distorted image. The popular method of chessboard-like calibration requires manual alignment of the chessboard. This requires knowledge from the operator in order to place the planes in good orientations, but still isn’t foolproof. The most popular countermeasure to the impact of operator error is usually to capture more images than necessary, which also helps to decrease the impact of noise in the images. A method to simplify the procedure was developed in [24], which displays the video stream from the camera being calibrated. The video stream is utilized in an augmented reality way to show where to place the calibration pattern next, based on where it would currently improve the calibration the most. They base it on their new proposed metric, which is computed by regarding all the parameters of the model as probabilistic Gaussians around the true value and sampling from them, based on the data already collected. Depending on the uncertainty or spread of the distributions, different alignments of the calibration board are proposed. This helps to calibrate the camera faster than by capturing a lot of images. It also adds some accuracy guarantees.

3.2 Depth cameras

Many implementations of improving depth data are based on the Microsoft Kinect camera initially released in 2010 for the Xbox 360 entertainment system. This made a cheap RGBD sensor combination available to the public, sparking researcher interest. The first version of Kinect used structured light, whereas later versions switched to time of flight implementations. An impactful work on Kinect calibration is [13] by Herrera et al., which studies the issue of calibrating the color camera and depth sensor individually or together as a pair. They also study the distortions present and propose a correction of the depth. Their work is based on the idea that calibrating both sensors jointly gives access to more data, and thus more information for optimizing the model. The depth-dependent radial distortions are corrected as well by fitting a correction function from measuring the disparity to planes set up at various distances. This correction function consists of per-pixel constants, and an exponential decay with increasing distance from the camera. It is as such a dense correction, working on a map over image plane depth errors, modified in amplitude by the distance to the camera. Herrera et al.’s work is extended by Raposo et al. in [23], which focuses on lowering the execution time of the calibration procedure, while keeping accuracy high. They notice the need for distortion correction in the depth images, as otherwise accuracy is negatively affected. The lower run time is partly achieved by the significant reduction in number of images required for a successful calibration. Where Herrera [13] propose 60 images for calibration, Raposo manages with as few as 4 and reaches comparable performance after 6 to 10 images. However, most of the run time differences comes

19 CHAPTER 3. RELATED WORK from the difference in distortion correction, which is done open-loop in [23] instead of alternating between optimizing spatial depth corrections and the reprojection error. The distortions of Kinect-like cameras are also studied in [2], noting that the approach by Herrera et al. in [13] is not as suitable for long distances and extending upon it. By using the myopic property of the depth sensing error, they split up the error as a function describing the systematic depth error, that is the error in average depth over the entire sensor to a certain distance, and a local function describing the error in each pixel as a function of depth. They are as such coupled, but by splitting them they can calculate so called undistortion maps on the localized error and later estimate the systematic error. Data is collected by measuring various distances to a wall, and ends by an approximately 1 hour long optimization scheme. A comparison between three different Kinect calibration methods was done in [30] which compare the supervised approaches of Herrera et al. [13] and a checkerboard approach using no depth correction as well as the SLAM procedure in [26]. They show that distortion correction in depth measurements is perhaps the most important aspect of getting a good measurement, arguing that Herrera’s method outperforms the others thanks to its more rigorous correction. Distortion mapping is also part of [34], where they manage to get a Simultaneous Localization and Calibration routine up to real time performance on a modern computer. [17] extended the plane calibration into a 2.5D domain by punching holes in it, allowing calibration of depth sensors at the same time. They use a specialized hole pattern in order to be able to determine the orientation of the board even when parts of the pattern is occluded.

20 Chapter 4

Theory

The new contributions of this thesis report starts with this chapter, detailing the theory behind the implementation. This thesis investigates what extra information can be gained from regarding an RGB-D camera as a 3D imaging device, rather than an imaging device coupled with a depth sensor. Instead of regarding the image 2 as residing in the 3 dimensional space P and extended with depth data outside of the camera model, we consider the image and depth data to be a 4 dimensional 3 space in P . This space can in turn have been distorted by projective deformations, which we can rectify using scene knowledge.

4.1 Homogeneous depth camera model

The camera model of Section 2.3 works very well for normal 2D imaging devices. However, since we assume a sensor returning depth data as well, we would like to extend this to also explain 3D measurements of the world in an equally linear fashion. It is possible to use the 2D model for 3D data as is, like in Section 2.3.1. This is done by disregarding homogeneity and assuming that x3 is in fact the depth in the scene, thus turning it into a model going from projective space into the corresponding 3 3 euclidean space, P 7→ E . This decreases the flexibility of the model, turning it into a pseudo-homogeneous representation, and requires additional work for applying the inverse transformation, since the matrix representing the transformation cannot be invertible. 3 Instead, we would like to remain in P by extending the camera model to stay in projective 3D space. We do this by embedding the normal K matrix of internal camera parameters into a bigger representation. By extending the projection matrix as follows,   f α c 0 x 1 " #    0 fy c2 0 R t M =   T (4.1)  0 0 0 1 0 1 0 0 1 0 | {z } E | {z } K

21 CHAPTER 4. THEORY

3 we stay in P . A homogeneous point (X1,X2,X3,X4) in the 3D world is now projected to the depth image as     x1 X1     x2 X2   = M   (4.2) x3 X3 x4 X4 and returning back to Cartesian coordinates is performed as usual by normalizing by the fourth coordinate, x  xc  1 1    c  1 x2 x2 =   (4.3) c x4 x3 x3 x4 c where the third Cartesian coordinate x3 is now equal to the disparity, that is the inverse of the depth.

4.2 Projective depth distortion

The 2D image returned by current RGBD cameras is typically of higher quality than the depth image, and its distortions have been well studied (see Section 3.1 for summaries of a few studies). With this in mind, we thus seek to formulate a correction that affects depth data only while leaving the 2D image the same. It is also beneficial for inverse calculations if the model is linear and invertible. An additional constraint that can be placed on the correction is that it should keep planes planar. This is beneficial since a lot of man-made objects contain planar surfaces, which can thus be used for identification. It can be shown (See B) that the only distortion model fulfilling the above is a special 3D homography of the form   1 0 0 0    0 1 0 0  D =   (4.4)  0 0 1 0  d1 d2 d3 d4

Nominal values expected from a perfect system is d1 = d2 = d3 = 0 and d4 = 1, which turns D into an identity matrix.

4.3 General homogeneous depth camera model

Combining the projective depth distortion with the homogeneous depth camera model, we arrive at the complete projection matrix for the transformation.

x = KDEX (4.5)

22 4.4. CALIBRATING THE DEPTH DISTORTION PARAMETERS ANALYTICALLY

Where x is a point in the “3D image” space of the camera, and X is the corresponding world point being imaged.     fx α c1 0 1 0 0 0      0 fy c2 0  0 1 0 0  P = KDE =     E  0 0 0 1  0 0 1 0  0 0 1 0 d1 d2 d3 d4     (4.6) f α c 0 f α c 0 x 1 x 1 " #      0 fy c2 0   0 fy c2 0  R t =   E =   T d1 d2 d3 d4 d1 d2 d3 d4 0 1 0 0 1 0 0 0 1 0 Comparing this model to the normal 2D camera projection of Eq. (2.13), we note 3 that our projection matrix has grown by one row and now keeps the results in P . The first two coordinates remain the same in both models. But whereas Eq. (2.13) loses homogeneity upon using depth data, our model preserves it. The proposed model is fully linear, under the strong assumption that the data is captured ideally without radial or tangential distortions. An equivalent situation is undistorting all images before using them for the perspective depth correction.

4.4 Calibrating the depth distortion parameters analytically

In the case of known intrinsics and extrinsics, such as returned by calibrating the camera, solving for the depth distortion parameters analytically is simple. With at least 3 point correspondences, 3 of the parameters can be solved for. In order to solve for all 4 parameters, 4 point correspondences are required, but they cannot all lie in a plane as this does not provide enough information to solve for all 4 parameters uniquely. The solution is based on stacking relations derived from each correspondence, in order to later solve it in a least squares sense. For one single point correspondence between world point XW and image captured point XcI , with intrinsic parameters K and extrinsic parameters E, the distortion affected relation is XcI = KDEXW (4.7)

Let XC be the world point in the camera-centered coordinate system,

XC = EXW (4.8) Then −1 n −1 o K XcI = DXC ⇒ XfI = K XcI ⇒ XfI = DXC (4.9)       XeI1 1 0 0 0 XC1       XeI2  0 1 0 0  XC2   =     (4.10) XeI3  0 0 1 0  XC3 XeI4 d1 d2 d3 d4 XC4

23 CHAPTER 4. THEORY inspecting the fourth row yields

XeI4 ∼ d1XC1 + d2XC2 + d3XC3 + d4XC4 T (4.11) XeI4 ∼ XC d

T where d = (d1, d2, d3, d4) . Here ∼ is used instead of = in order to clarify that we’re looking at a scalar from a homogeneous vector, which is only defined up to a scale factor. Dividing through by XeI3 gets rid of the unknown scale factor.

T XeI4 XC XeI4XC3 T = d ⇒ = XC d (4.12) XeI3 XC3 XeI3 | {z } |{z} y aT where the left hand side is a scalar, and the right hand side a product of a row and a column vector. The division by XeI3 and XC3 is okay since XC3 ∼ XeI3 = 0 if and only if the point is in the camera optical center, which it cannot be. Each point correspondence thus provides one equation. Repeating the procedure T for all captured points and stacking Eq. (4.12) vertically, Y = (y1, y2, . . . , yN ) , T A = [a1, a2,..., aN ] , we get

Y = A d (4.13) N×1 N×4 4×1 which has the least squares solution

d = (ATA)−1ATY (4.14) and solves N [j] [j] !2 Xe X d X X[j]d − I4 C3 = arg min C [j] (4.15) d j=0 XeI3 where •[j] refers to number j of the point •.

4.5 Iterative reweighted least squares

To account for outliers and noisy data, the analytical solution can optionally be further refined by iterative reweighted least squares (IRLS) [29]. IRLS instead iteratively minimizes

N [j] [j] !2 Xe X d(t+1) X w(t) X[j]d − I4 C3 = arg min j C [j] (4.16) d j=0 XeI3 here •(t) indicates variable at iteration t.

24 4.6. CALIBRATING THE DEPTH-DISTORTION PARAMETERS NON-LINEARLY

Using an L1 norm and regularizing yields

(t) 1 wj =  [j] [j]  (4.17) [j] (t) XeI4 XC3 max δ, XC d − [j] XeI3 with δ a small regularization constant, for instance set to 0.0001.

4.6 Calibrating the depth-distortion parameters non-linearly

In an attempt to get more accurate depth distortion parameters, we also formulate it as a non-linear least square problem. We take into account the depth distortion parameters and optimize over the point positions in 3D. Since the modeled distortion does not affect the x, y image positions, only the depth distortion parameters are included in the optimization. Back-projecting points into the world using the distortion parameters does change the believed position in the world. However, we consider our defined world points to have much higher accuracy than those measured by the camera, and ignore this fact. We have image measured points pbic with i = 1 ...Nc points in every c = 1 ...C camera view. The corresponding points in the world are Pic. All points are here 3 described as P points, with last entry equal to 1. For each view c we have the extrinsics Ec. The intrinsics K are fixed for all views. The optimization problem looks like

C Nc X X −1 −1 −1 arg min d (Pic ,Ec D K pbic) (4.18) D c i where D is the matrix form of d = (dx, dy, dz, dw). Here d (a, b) is equal to the procedure that will be detailed in Section 5.5.1. In short: we divide the vectors by their last entry and then take the norm of the corresponding euclidean vector. With restricted models, entries that are not solved for are instead replaced with the value at corresponding position in the 4 × 4 identity matrix. For no depth distortion, this corresponds to setting D to the identity matrix.

4.7 Parameter interpretation

The 4 depth distortion parameters, d1, d2, d3, d4 or with more descriptive sub indices, dx, dy, dz, dw are presented as an empirical correction of the depth data. They share the same end-result due to their functionality as scale factors: an increase in parameter magnitude increases the correction performed. But they differ in source. Since the x and y coordinates can be both positive and negative, the sign of dx and dy indicate in what direction the correction is tilted with regard to the image plane, and magnitude the strength of the tilt. Since the coordinate system is based in the

25 CHAPTER 4. THEORY camera center with the positive Z-axis looking forward, there can be no negative Z values. A sign change for the dz parameter thus moves the depth data further or closer to the camera, dependent on the already measured depth. The effect of dw is similar to this, offering a general scale increase or decrease, independent of the points’ measured positions. The effect of dw compared to the general scale s is inverted, −1 s = dw , so that dw < 1 implies an increased scale, s > 1, and dw > 1 ⇒ s < 1. We can imagine some possible underlying physical causes, although they have not been focused on in this thesis. First and foremost, they account for the difference of a deviance of the model used to calculate the depth (stereo vision or an extension thereof), and reality. This could for example be manufacturing errors, timing errors or errors arising from finite precision calculations. Perhaps the image planes are not entirely parallel to each other, and thus the projected pattern is slightly titled compared to the camera - this could be corrected with dx and dy depending on the situation. If the image planes are parallel to each other, but offset in distance from the baseline, or the baseline has a slightly different length than assumed, it could be corrected by dw or dz. The magnitude of the parameters is of course highly dependent on the coordinate system used, and will for the purpose of this thesis always be in an orthogonal coordinate system with millimeters as the unit along each axis. The proposed 4 parameters has similarities to depth-multiplier images proposed by some authors (e.g. [13, 26]), which generally consist of the product of one or more dense maps of per-pixel constants similar to dw, and a scalar function only dependent on the measured depth, to increase or decrease their impact. These methods result in parameterisations of the distortion with parameter counts in the 10000 range. The impact of the 4 parameters in our presented model most likely corrects for the distortions in a similar way. With the obvious disadvantage of being unable to capture non-linearities perfectly. The big advantage however, is that calibration does not require full-view images of a plane parallel to the image sensor, like required for determining the per-pixel constants.

4.8 Parameter visualization

In order to get a better grasp of what the distortion parameters actually affect, a couple of visualizations with points located at physically possible locations are shown. All visualizations show a cube with a side 16 cm that’s placed in front of the camera (represented by the axes) at a distance of 38 cm, centered around the Z axis. The camera center and coordinate frame is represented by the axes in origo, with Z pointing forwards. The effect of dx and dy is identical along their respective axis, and is in Figure 4.1 visualized for the Y axis from the side, as well as from the camera’s perspective. The difference between dz and dw can be seen in Figure 4.2. The scaling induced by dw is independent of the distance to the camera, whereas dz causes a shift dependent on the distance.

26 4.8. PARAMETER VISUALIZATION

(a) Side view of modifying dy

(b) Camera view, identical image regardless of the distortion parameters

Figure 4.1: Impact of modifying dy with quite extreme values. To the left, original points. To the right, distorted points.

The careful reader might notice in the side views that the spheres have different radii. The radii of the spheres in these visualizations depend linearly upon the distance to the camera center. This is in order to correspond to that we are changing the camera’s view of the world, and not the world itself. The depth distortion moves all points along the ray from the camera center through the point, which naturally affects the size of the imaged spheres when viewed under a perspective projection like in these images. The points in the real world do not change when the depth distortion parameters are applied, but the measurements do. These images are as such not different views of the world, but different views of the camera’s measurement volume, with the camera view in Figure 4.1b showing the camera’s rendering of the volume. Another way of reasoning is that each view represents different real world situations that are imaged the same in the camera, given that the camera is affected by the corresponding depth distortion. Finding the distortion parameters then allows to find the correct situation, which here is the cube.

27 CHAPTER 4. THEORY

Figure 4.2: Different impacts of dz and dw. From top to bottom: points in their original position, points modified by the 4th distortion parameter dw, points modified by the 3rd distortion parameter dz. Note that the distance between XY -planes in the dz case increases with Z distance, whereas it remains constant for dw.

28 Chapter 5

Implementation

This chapter outlines the experimental setup of the thesis. It includes some of the considerations taken and rough implementation details.

5.1 Physical hardware

The physical camera used in this thesis is an Intel RealSense SR300 camera. It uses an IR projector coupled with an IR camera to generate the depth image in an active stereo setup. Since the depth data is measured in the same coordinate frame as the IR camera, the IR image will be used instead of the color image, to simplify calculations by considering it as one camera returning depth as well as grayscale image. An extension to working with color data as well requires keeping track of its coordinate system and the relation to the coordinate system of the depth camera.

5.2 Calibration pattern

Detecting points for calibration accurately is a large part of ordinary camera calibra- tion, which puts constraints on the calibration object. The structured light sensor of use for the practical implementation uses an IR projector, which requires that the material must be visible in the IR spectrum for the depth sensor to function correctly. The sensor applies filtering and smoothing in order to improve the quality of the data captured. This is not without its faults, leading to disturbances near depth discontinuities, as well as near regions of differing infrared reflectivity. Supported and frequently used calibration patterns in the OpenCV library are checkerboards and dot patterns, with black printed on white paper. The checkerboards produced in this manner contain frequent reflectivity changes all throughout, which leads to less reliable depth data. Thus the (asymmetrical) circle grid pattern from OpenCV was selected to reduce depth disturbances. A picture of it can be seen in Figure 5.1 This does not ensure depth data at the image measured marker positions, but it allows for measuring the plane that the pattern lies in over

29 CHAPTER 5. IMPLEMENTATION a relatively large area, as accurate as the camera allows. This in turn opens up for a plane fit, and interpolation of the depth data to the measured marker positions.

Figure 5.1: Pattern used for calibration

Detection of the pattern is done in the IR image by blob detection in OpenCV. Since circles under perspective deformation look like ellipses, and the thus detected center of the blob does not match the actual center of the ideal circle [11], the positions are refined by rectifying the image of the pattern to fronto-planar position and refined by template matching like presented in [6]. See Figure 5.2a and the resulting fronto-planar rectified pattern in Figure 5.2b respectively.

(b) The same image rectified to fronto-planar (a) Example of image used for calibration position and pattern cut out

Figure 5.2: An image from the “general1” data set

Detecting the pattern in the depth data directly is difficult in this case, since the data usually isn’t clean enough to be able to consistently detect the markers. Even if consistent detection was possible, the holes in the depth data from the markers are usually not anywhere near round, and thus less suitable for usage. An

30 5.3. DATA COLLECTION implementation working directly with only the depth data would benefit from using a pattern or object with even surfaces for consistent measurements.

5.3 Data collection

Gathering data for running the algorithms was conducted as simple as possible. The planar calibration target was glued onto a piece of plywood. To see how reasonable the assumption of it being a perfect plane, a flat metallic ruler was slid across the plywood, confirming that it was not quite planar. Estimated deviation from a perfect plane was deemed to be 1mm at maximum. The calibration target was placed where it could be seen clearly from several directions. Data capture starts by requesting images from the camera, and throwing the first frames away. This in order to let the on-board processing stabilize, which among other things seem to carry out a correction based on the internal temperature of the camera. Data capture proceeded by recording a video stream of the target from different directions and distances. After capture, a subset of frames in the order of 5 to 30 were extracted from each video sequence, and stored as a data-set. Several kinds of data sets were selected. There are data sets with the pattern captured at positions and angles as well spread out in the measurement volume as possible, which should also provide good data for the ordinary camera calibration. Other data sets had more constraints on them, such as the camera being held parallell and at the same distance to the pattern at all times, or that the pattern was only visible in the center of the image. The camera was allowed to cool down between each data set capture. Each capture was proceeded by a capture of at least 100 frames that were thrown away, in hope of the camera internals reaching a steady state. Depth data is converted into millimeter units using librealsesense [14]. The pattern is defined in millimeter units in the world, and thus the entire world coordinate system is measured in millimeters. Calibrated ordinary camera intrinsics thus map between millimeters in the world to pixels in the image. In Figure 5.3 shows an example input image pair, together with an enlarged cut out of the depth data from the pattern. An additional set of example images can be seen in Fig. 6.1 on Page 37, which show some of the different characteristics of the images in the data sets.

5.4 Calibration algorithm

Trying to perform single image calibration with the help of the depth data returns too little information to be able to determine all desired camera parameters. We thus turn to the planar method of Zhang [31] to calibrate the normal 2D camera parameters. The method proceeds in two steps, where the first step acts as an initializer for the second step.

31 CHAPTER 5. IMPLEMENTATION

(a) (b) (c)

Figure 5.3: Example input images. Lens distortion visible at the bottom edge. (a) IR image with increased brightness. (b) Depth image presented using color to signify distance. Dark blue color represents missing data. (c) Close-up of pattern in more detailed coloring. Note the un-even hole sizes from the markers and bumpy surface when it does manage to capture data at and near markers

The first step is an analytical initial estimate of the intrinsic and extrinsic camera parameters. The initial estimate is found by forming an homography between each image and the known calibration plane in each image. The second step is bundle adjustment, the non-linear optimization over all camera parameters and world point positions. We extend this method by analytically solving for the depth distortion parameters in a linear least squares sense, using all control points in all the images. This is described in Section 4.4. An attempt to make the adapted parameters more stable is to run the result of the linear least squares in an iteratively reweighted least squares procedure, which theoretically should offer better resilience to outliers (Section 4.5). We vary the model depending on if we consider lens distortion or not, as well as which depth distortion parameters we consider. A further refinement is to extend the previous method by refining the linear least squares solution with a second non linear least squares optimization, detailed in Section 4.6.

5.5 Evaluation

To compare accuracy of the calibration method and the impact of the correction for projective depth distortion, two metrics are used.

5.5.1 Dot error The proposed model is fully linear. This enables inverting the model without losing information, retaining the physical meaning of the transformation, with errors occurring as part of the imaging process (that is, the process causes noisy measurements of 3D points) and not from 3D points themselves being noisy. We rely on the assumption that the world points are defined up to a much higher accuracy than measured by the camera. Considering only the depth data, which has errors on

32 5.5. EVALUATION the scale of several millimeters, this is a reasonable assumption even if the calibration board is uneven. The dot error is defined as

errordot(Xc, X) = d(Xc, X) = (5.1) v u !2 !2 !2 u Xbx Xx Xby Xy Xbz Xz = t − + − + − (5.2) Xbw Xw Xbw Xw Xbw Xw

−1 −1 −1 with xb the actual measured point, Xc = E D K xb its corresponding projection into the world using the model, and X the defined world point. This sparse error metric is used for the black markers in the calibration pattern, seen in Fig. 5.1. Only the centroid of each marker is considered. For the mean value over all markers in the pattern, we take the algorithmic mean, N 1 X mean dot error = error (Xc , X ) (5.3) N dot i i i for all i = 1 ...N point correspondences Xci ↔ Xi.

5.5.2 Plane error Since only the markers have a truly defined position in the world, a different metric is used for densely comparing the whole pattern. The most characteristic criteria for the pattern in the world coordinate system, is that it is defined to be in a certain Z = constant plane. As such, the dense error for the pattern is easily defined as the difference between the measured Z coordinate and the Z = constant plane, for all points with X,Y coordinates inside the pattern boundaries. We disregard from the pattern data that is outside the bounding box of the outermost markers.

errorplane(Xc) = XdZ − Zconst (5.4)

−1 −1 −1 with xb the actual measured point, Xc = E D K xb its corresponding projection into the world using the model. This metric is valid for all Xc whose X,Y coordinates are within the defined world pattern area, except in the vicinity of the markers where depth is unreliable. This dense metric is used to present difference images over how the depth error in the world Z coordinate looks like before and after depth correction. To present a summed statistic we take the mean absolute error over all valid points for the metric, since this error is defined with a sign.

1 X mean absolute plane error = |error (Xc )| (5.5) N plane i i

33

Chapter 6

Results

This chapter contains the main result and analysis part of the thesis. It starts out with several sections dedicated to looking at one single camera, and ends with a quicker analysis on 3 other cameras of the same type. All cameras, except the first in depth studied camera, are always represented with numbers to distinguish them from one another. The full model is calibrated with lens distortions in Section 6.2, where both infrared and depth images are undistorted before processing. In Section 6.3 a reduced model using only two parameters is presented. In Section 6.4 a quick summary of different parameter models are presented, including the end to end completely linear model that doesn’t consider lens distortion. In Section 6.5 the distortion model is tested on additional cameras of the same camera model and summarized results are presented to see how general the results are. The camera was calibrated on a data set consisting of images of the calibration board taken from several different angles and distances. An example image can be seen in Figure 5.2a. This was done in order to provide a well-determined system for the ordinary camera calibration, including lens distortion, as well as accurate determination of the depth distortion parameters. We evaluate the results with the help of the sparse dot error explained in Section 5.5.1 and the dense plane error explained in Section 5.5.2. Unless otherwise noted, all results are reported for a different data set than what the camera was calibrated on. This in order to correspond to real life usage, where the camera would be calibrated once and then used many times, without enough information to re-calibrate the parameters.

6.1 Data set list

We summarize the data sets used, and their general characteristics. • The “centered1” data set is 18 images large, taken with the pattern centered in the view but at different distances and angles to the camera. Images exemplary to this data set are Figs. 6.1g to 6.1i and 6.1k.

35 CHAPTER 6. RESULTS

• The “general1” data set is 27 images large, taken at varying angles and distances to the pattern. Images exemplary to this data set are all shown in Fig. 6.1.

• The “longplywood1” data set is 12 images large, taken in general position. Images exemplary to this data set are all in Fig. 6.1.

• The “longplywood2” data set is 6 images large, taken in general position. Images exemplary to this data set are shown in Figs. 6.1e, 6.1j and 6.1l.

• The “plywood1” data set is 7 images large, taken in general position. Images exemplary to this data set are shown in Figs. 6.1e, 6.1j and 6.1l.

• The “sameheight1” data set is 13 images large. The pattern was on a carpet that is invisible in IR. The images are taken with the camera approximately 1 m from the ground, with the pattern parallel to the image plane. Images exemplary to this data set are shown in Figs. 6.1a to 6.1c.

• The “sameheight2” data set is 27 images large, taken approximately 0.5 m from the camera. The pattern was held by hand and captured at varying positions in the image with the pattern parallel to the image plane. Images exemplary to this data set are shown in Figs. 6.1d to 6.1f.

• The “vertical1” data set is 25 images large, taken in general position, but almost exclusively from the left half of the image plane. Images exemplary to this data set are shown in Figs. 6.1e, 6.1j and 6.1l.

6.2 Complete model with lens distortion

Calibrating on the “general1” data set while taking lens distortion into account, we get the intrinsics   476.9171 0. 312.9668 0.    0. 476.9093 242.4875 0. K =    0. 0. 0. 1. 0. 0. 1. 0. with lens distortion parameters (k1, k2, p1, p2, k3) equal to h i −0.1468 0.0463 0.0012 −0.0007 −0.0756 and depth distortion parameters, here represented as the last row of D, h i d = 0.00013406 −0.00000352 −0.00000699 1.00604654 since the rest of the D matrix is equal to part of an identity matrix. The dz value indicates that points are moved away from the camera depending on their depth,

36 6.2. COMPLETE MODEL WITH LENS DISTORTION

(a) Far away (b) Far away, close to border (c) Far away, close to border

(d) Off center bottom right (e) Off center bottom left (f) Touching left border

(g) Close up (h) Tilted up (i) Tilted right

(j) Close, angled (k) Tilted down (l) Oblique angle

Figure 6.1: Example images from the data sets, showing some possible characteristics. The images have been modified for improved visibility in the printed version.

37 CHAPTER 6. RESULTS

and the dw has a counteracting effect of moving all points closer to the camera, albeit very little. The reprojection error for the OpenCV calibration was 0.44 pixels. The focal length and principal point position correspond well to the intrinsics used by librealsense for the ir/depth camera (f = fx = fy = 476.87, px = 311.13, py = 245.99). These intrinsics are used to naively undistort the lens distortions in both the infrared and depth images before any further computations. We thus consider the lens undistorted images to be captured by an ideal pinhole camera. Several data sets now use a subset of their original images, due to the lens undistortion moving pattern points too close to the image border for the pattern finder to locate them. These images are thus part of calibrating the intrinsics including lens distortion, but not part of the depth distortion calibration.

Mean dot error per dataset 35 Before perspective correction After perspective correction 30 Image mean standard deviation

25 ]

m 20 m [

r o r r e

t 15 o D

10

5

0

centered1 plywood1

sameheight1 sameheight2 longplywood1 longplywood2

general1 (calibration)

Figure 6.2: Mean dot error over all points in each data set. Calibration data set marked with (calibration)

As can be seen in plots of both the sparse dot error (Figure 6.2) and dense plane error (Figure 6.3), the depth distortion correction improves the situation for all data sets. The numerical data behind these plots can be seen in Table 6.1. The two plots and their numerical data is very similar, showing that the two metrics correspond well to each other, despite one working on a few individual points (27 per image) and the other working on thousands of pixels. Without correction, the measurement error of points (Figure 6.2) is slightly larger than the measurement error of the entire plane (Figure 6.3). This could be due to that the points’ depth aren’t actually measured, but derived from a plane

38 6.2. COMPLETE MODEL WITH LENS DISTORTION

Mean absolute plane error per dataset 30 Before perspective correction After perspective correction 25 Image mean standard deviation

20 ] m m [

r o r

r 15 e

e n a l

P 10

5

0

centered1 plywood1

sameheight1 sameheight2 longplywood1 longplywood2

general1 (calibration)

Figure 6.3: Mean absolute plane error in each data set. Calibration data set marked with (calibration)

fit. It could also be due to the different metric, which uses not only the Z world coordinate as in the plane error metric, but rather the euclidean distance in XYZ from the measured to the defined point as in the dot error metric. The fact that data sets other than the calibration data set “general1” perform significantly better before correction is most likely a result of that they aren’t as well spaced out in the scene as the “general1” data set is.

6.2.1 Parameter stability

It is interesting to see how stable the depth distortion parameters are, depending on the data set that they are calibrated on. Keeping the intrinsics from the full calibration, and re-calibrating the depth distortion parameters using analytical linear least squares on all data sets part of the verification shows that the depth distortion parameters fluctuate a bit and don’t remain entirely stable. The results are presented in Table 6.2. The analytical calibration of the depth distortion parameters return similar, but not identical, values for all data sets. The sign of dy and dz changes between data sets. One data set also shows deviations in dx and dw which both see a small but clear reduction compared to the other data sets. In Table 6.3 we compare the three different implemented methods of determining the depth distortion parameters. Again by keeping intrinsics from the original calibration, but applying the method against all data sets. The methods are the

39 CHAPTER 6. RESULTS

Table 6.1: Mean absolute plane and mean dot error with standard deviation, for each data set. Calibration data set marked with *, calibrated with non-linear least squares.

dot error/mm plane error/mm mean std mean std data set depth correction centered1 no 5.3 3.7 3.8 3.0 yes 2.8 1.0 2.2 1.3 general1* no 9.7 5.1 8.7 4.5 yes 1.4 1.0 2.3 1.9 longplywood1 no 4.8 3.1 3.8 2.4 yes 1.7 1.3 1.6 1.4 longplywood2 no 4.3 2.5 3.8 2.3 yes 3.9 1.9 3.6 1.9 plywood1 no 6.1 3.5 4.5 2.4 yes 1.8 0.9 1.7 0.9 sameheight1 no 22.3 8.0 21.2 7.4 yes 8.3 2.5 8.2 4.0 sameheight2 no 4.5 3.0 3.9 2.6 yes 1.5 0.8 1.3 0.8 analytical linear least squares, Iterative Reweighted Least Squares (IRLS), and the non-linear least squares optimization previously described. The comparison reveals that the three methods return similar results within each data set but some signs of inter method instability are shown. We see in Table 6.3 that the dx parameters are in genereal 1-2 orders of magnitude larger than the dy and dz parameters, while dw remains close to 1. The dx parameter is the most stable parameter, closely matched by dw. It is interesting to note that dx is aligned with the baseline of the camera, indicating a correction in the direction that provides the base for the depth calculations. The “sameheight1” data set is the worst data set without correction. It was captured at approximately constant distance of 1 m, against a background that was invisible in the IR spectrum, and with the pattern along the edges of the image. A possibility of why the measurements are off is that the camera got confused due to the invisible background. No further study was however done on the possible impact of the background.

40 6.2. COMPLETE MODEL WITH LENS DISTORTION

Table 6.2: D parameters if intrinsics are kept from the original calibration, but distortion parameters are re-calibrated on the specified data set using the analytical linear least squares

dx dy dz dw data set centered1 1.3 × 10−4 1.5 × 10−6 1.8 × 10−6 1.0038 general1* 1.3 × 10−4 −1.0 × 10−6 −3.2 × 10−6 1.0038 longplywood1 1.2 × 10−4 3.5 × 10−6 −4.7 × 10−6 1.0064 longplywood2 9.3 × 10−5 3.4 × 10−5 2.1 × 10−5 0.9940 plywood1 1.3 × 10−4 5.6 × 10−6 −4.1 × 10−6 1.0084 sameheight1 1.1 × 10−4 −2.6 × 10−5 5.0 × 10−6 1.0016 sameheight2 1.2 × 10−4 −1.0 × 10−6 1.1 × 10−6 1.0012

Table 6.3: D parameters if intrinsics are kept from the original calibration, but distortion parameters are re-calibrated on the specified data set using the given method. Calibration data set marked by *.

dx dy dz dw data set method centered1 analytical 1.3 × 10−4 1.5 × 10−6 1.8 × 10−6 1.0038 IRLS 1.3 × 10−4 −3.7 × 10−6 8.4 × 10−6 0.9996 non-linear 1.3 × 10−4 4.4 × 10−7 5.0 × 10−6 1.0017 general1* analytical 1.3 × 10−4 −1.0 × 10−6 −3.2 × 10−6 1.0038 IRLS 1.3 × 10−4 −1.5 × 10−6 −4.0 × 10−6 1.0039 non-linear 1.3 × 10−4 −3.5 × 10−6 −7.0 × 10−6 1.0061 longplywood1 analytical 1.2 × 10−4 3.5 × 10−6 −4.7 × 10−6 1.0064 IRLS 1.2 × 10−4 6.4 × 10−6 5.7 × 10−8 1.0037 non-linear 1.1 × 10−4 3.2 × 10−6 −2.8 × 10−6 1.0053 longplywood2 analytical 9.3 × 10−5 3.4 × 10−5 2.1 × 10−5 0.9940 IRLS 9.7 × 10−5 1.7 × 10−5 1.3 × 10−5 0.9983 non-linear 9.1 × 10−5 3.1 × 10−5 1.9 × 10−5 0.9945 plywood1 analytical 1.3 × 10−4 5.6 × 10−6 −4.1 × 10−6 1.0084 IRLS 1.3 × 10−4 −3.1 × 10−6 −1.1 × 10−5 1.0112 non-linear 1.2 × 10−4 6.1 × 10−6 −3.0 × 10−6 1.0079 sameheight1 analytical 1.1 × 10−4 −2.6 × 10−5 5.0 × 10−6 1.0016 IRLS 1.1 × 10−4 −4.0 × 10−5 5.1 × 10−5 0.9623 non-linear 1.1 × 10−4 −3.0 × 10−5 −1.7 × 10−5 1.0180 sameheight2 analytical 1.2 × 10−4 −1.0 × 10−6 1.1 × 10−6 1.0012 IRLS 1.2 × 10−4 3.3 × 10−6 −3.1 × 10−5 1.0134 non-linear 1.2 × 10−4 2.8 × 10−6 −7.9 × 10−5 1.0325

41 CHAPTER 6. RESULTS

6.2.2 Difference images

Difference images were generated by comparing the raw depth data and the corrected depth data, to the known (defined) ground truth position of the pattern. The slight positional change arises from that these are viewed in the world coordinate system, where a change of distance to the camera also in general brings about a change of the X and Y position. The color corresponds to the value of the plane error defined in Section 5.5.2.

Figure 6.4: Difference image of camera calibration from the “general1” data set accounting for lens distortion, applied on data set “centered1” images 1 to 3. The correction clearly removes an error going from the left side being too far away from the camera to the right side being too close, into an error that is equal over the pattern.

Looking at some images from the “centered1” data set, we see in Figure 6.4 a correction visually removing the tilt of the error, leaving the pattern a few millimeters too close to the camera. We can also see that error over the entire pattern is decreasing, which is clear from the middle row in Fig. 6.4. A small scale difference from the uncorrected image to the corrected image is visible at right border of the pattern. This is caused by the view being taken as a top-down view on the pattern regarded as lying in the X,Y -plane with the Z value as the error.

42 6.2. COMPLETE MODEL WITH LENS DISTORTION

These are two very distinct corrections that are easy to explain. Similar correc- tions occur throughout all images in all data sets.

Figure 6.5: Original depth data in blue, corrected in green, data from one image in the “centered1” data set after calibrating on “general1”. Camera coordinate frame represented by the axis. Points in the image have been moved closer to the camera the further left in the image they are.

Looking at the data in 3D, in Fig. 6.5 we see the entire scene with the calibration board held by hand. Blue points indicate data before correction, green after. The hard to see blue spheres located in the calibration board represent the actual points used for calibration, green spheres the positions of the calibration points after applying the correction. We see how the left hand side of the image has been moved closer to the camera, and the right hand side further away. In Fig. 6.6 we see the same image but captured from above, which more clearly shows the effect of the correction. In Fig. 6.7 we have an example of when the correction moves all points further away from the camera combined with points on the right hand side being moved further away as well.

43 CHAPTER 6. RESULTS

Figure 6.6: Original depth data in blue, corrected in green, data from one image in the “centered1” data set after calibrating on “general1”. Same image as in Fig. 6.5 but from above. Points to the left have been moved closer to the camera.

Figure 6.7: Original depth data in blue, corrected in green. The correction has moved points further away from the camera, and points on the right hand side of the image further away than those on the left.

44 6.2. COMPLETE MODEL WITH LENS DISTORTION

6.2.3 Mean image error

Mean absolute height error per image 16 Before perspective correction After perspective correction 14 Standard deviations

12 ] m m [

10 r o r r e

e t 8 u l o s b a

n 6 a e M

4

2

0 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 Image number

Figure 6.8: Mean absolute plane error when measuring on the “centered1” data set, after calibrating on “general1” accounting for lens distortion

Looking at the “centered1” data set in Figure 6.8 we have that the correction makes all mean measurements more accurate. The standard deviations of the measurements are however now overlapping. We see that image 2 performs exceptionally well, with a mean depth error of about 0.5 mm. This image was captured in the middle of the frame when it was almost parallel to the image place, at a distance of 0.5 m, occupying about a quarter of the area. In other words, a quite ideal placement. There is however no clear pattern in why the depth error varies as it does in over the images in the “centered1” data set shown in Figure 6.8. The first two images with largest error are quite far away from the camera and parallel to it, which could simply be that the depth camera has less accuracy the further away the target is from the camera. But images 7 and 8 are equally far away, and show a much smaller uncorrected depth error. Figure 6.8 lets one see that the first two difference images of Figure 6.4 are also the two worst uncorrected images of the “centered1” data set. For the “sameheight1” data set in Figure 6.9 we see that the correction generally improves the situation, but not always. Images 8 and 10 become worse. They both have the pattern at the top of the image, very close to the image border. Image 10 additionally has the pattern in the top left corner of the image. The “sameheight2” data set in Figure 6.10 shows a larger gain of applying the depth undistortion than the other two highlighted data sets. This data set provides a quite even response for all images after correction. It contains images both at the left edge and the right edge of the image, without any clear outliers.

45 CHAPTER 6. RESULTS

Mean absolute height error per image 60 Before perspective correction After perspective correction Standard deviations 50 ]

m 40 m [

r o r r e

e t 30 u l o s b a

n a

e 20 M

10

0 0 1 2 3 4 5 6 7 8 9 10 11 12 Image number

Figure 6.9: Mean absolute plane error when measuring on the “sameheight1” data set, after calibrating on “general1” accounting for lens distortion

Mean absolute height error per image 16 Before perspective correction After perspective correction 14 Standard deviations

12 ] m m [

10 r o r r e

e t 8 u l o s b a

n 6 a e M

4

2

0 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 Image number

Figure 6.10: Mean absolute plane error when measuring on the “sameheight2” data set, after calibrating on “general1” accounting for lens distortion. Empty image numbers signify images that were discarded due to pattern finder difficulties

46 6.3. 2 PARAMETER MODEL (x, w) WITH LENS DISTORTION

6.3 2 parameter model (x, w) with lens distortion

Conducting everything as in Section 6.2 but restricting the depth distortion model to only calibrate dx and dw, keeping dy = dz = 0, we get the following results. The summary plots Fig. 6.11 and Fig. 6.12 show that we have almost equal performance to the 4 parameter case shown in Fig. 6.2 and Fig. 6.3. This can also be seen in Table 6.4 and Table 6.1. From the table of parameters in Table 6.5 we see that dx fluctuates more than in the 4 parameter case, whereas dw now fluctuates less for all calibration methods.

Mean dot error per dataset 35 Before perspective correction After perspective correction 30 Image mean standard deviation

25 ]

m 20 m [

r o r r e

t 15 o D

10

5

0

centered1 plywood1

sameheight1 sameheight2 longplywood1 longplywood2

general1 (calibration)

Figure 6.11: Mean dot error over all points in each data set. Calibration data set marked with (calibration)

47 CHAPTER 6. RESULTS

Mean absolute plane error per dataset 30 Before perspective correction After perspective correction 25 Image mean standard deviation

20 ] m m [

r o r

r 15 e

e n a l

P 10

5

0

centered1 plywood1

sameheight1 sameheight2 longplywood1 longplywood2

general1 (calibration)

Figure 6.12: Mean absolute plane error over all pattern points in each data set. Calibration data set marked with (calibration)

Table 6.4: Mean absolute plane error and mean dot error with standard deviation, for each data set. Calibration data set marked with *, calibrated with non-linear least squares.

dot error/mm plane error/mm mean std mean std data set depth correction centered1 no 5.3 3.7 3.8 3.0 yes 2.7 1.0 2.1 1.3 general1* no 9.7 5.1 8.7 4.5 yes 1.5 1.0 2.3 1.9 longplywood1 no 4.8 3.1 3.8 2.4 yes 1.8 1.2 1.7 1.4 longplywood2 no 4.3 2.5 3.8 2.3 yes 3.9 1.9 3.6 1.9 plywood1 no 6.1 3.5 4.5 2.4 yes 2.2 0.9 2.0 0.9 sameheight1 no 22.3 8.0 21.2 7.4 yes 8.4 2.6 8.3 4.1 sameheight2 no 4.5 3.0 3.9 2.6 yes 1.3 0.9 1.1 0.8

48 6.3. 2 PARAMETER MODEL (x, w) WITH LENS DISTORTION

Table 6.5: D parameters if intrinsics are kept from the original calibration, but distortion parameters are re-calibrated on the specified data set using the given method. Calibration data set marked by *.

dx dy dz dw data set method centered1 analytical 1.3 × 10−4 0 0 1.0048 IRLS 1.4 × 10−4 0 0 1.0052 non-linear 1.3 × 10−4 0 0 1.0050 general1* analytical 1.3 × 10−4 0 0 1.0020 IRLS 1.3 × 10−4 0 0 1.0019 non-linear 1.3 × 10−4 0 0 1.0018 longplywood1 analytical 1.2 × 10−4 0 0 1.0039 IRLS 1.2 × 10−4 0 0 1.0034 non-linear 1.1 × 10−4 0 0 1.0037 longplywood2 analytical 8.6 × 10−5 0 0 1.0033 IRLS 9.1 × 10−5 0 0 1.0043 non-linear 8.4 × 10−5 0 0 1.0035 plywood1 analytical 1.3 × 10−4 0 0 1.0062 IRLS 1.3 × 10−4 0 0 1.0059 non-linear 1.2 × 10−4 0 0 1.0063 sameheight1 analytical 1.0 × 10−4 0 0 1.0091 IRLS 1.1 × 10−4 0 0 1.0104 non-linear 1.0 × 10−4 0 0 1.0089 sameheight2 analytical 1.2 × 10−4 0 0 1.0017 IRLS 1.2 × 10−4 0 0 1.0014 non-linear 1.2 × 10−4 0 0 1.0018

49 CHAPTER 6. RESULTS

6.4 Model comparison

In this section we summarize results in a more readable format. These are the results of the presented models as well as other combinations of the depth distortion parameters not presented above. In Table 6.6 we have mean measurement errors using both metrics. We see that the model utilizing just the dx and dw parameters is the best performing model. The full model is very close in performance, but performs slightly worse.

Table 6.6: Mean absolute plane and mean dot errors for different models calibrated with non-linear least squares, averaged over all data sets except the calibration data set. With(out) lens is short for considering lens distortion (or not).

dot error/mm plane error/mm mean std mean std model with lens, all parameters 3.1 1.4 2.9 2.0 with lens, only xy 3.3 1.5 3.0 2.0 with lens, only xz 3.2 1.5 3.0 2.0 with lens, only xw 3.1 1.5 2.9 2.0 with lens, only yw 8.2 4.5 7.1 3.9 with lens, only x 3.6 1.5 3.3 2.0 with lens, only y 7.7 4.4 6.7 3.8 with lens, only z 7.9 4.4 6.9 3.9 with lens, only w 7.8 4.4 6.8 3.8 with lens, no depth correction 7.8 4.4 6.8 3.8 without lens, all parameters 9.1 4.6 7.7 3.8

A comparison with all models summarized over each individual data set is available in Table A.1 in Appendix A.

6.5 Multi camera comparison

In order to see how much of the results are reproducible on other cameras, three additional Intel RealSense SR300 cameras were tested. The results in this section are presented in a more summarized manner and with less analysis. For each camera, 6 data sets were collected. The cameras were allowed to warm up before saving images, and at least 1000 frames were discarded before capturing the data sets. Four of the data sets were taken in general position, like the “general1” data set previously described, and two at a fixed distance to the calibration pattern, with image plane approximately parallel to the pattern, like the “sameheight” data sets previously described. The data set names are prefixed with “Ci_”, i = 2, 3, 4 equal to the camera number, in order to distinguish them from the previous camera.

50 6.5. MULTI CAMERA COMPARISON

The data sets are thus called “Ci_generalX” with X = 1, 2, 3, 4 and “Ci_sameX” with X = 1, 2. They are unique for each camera, but share the same names as the underlying idea behind them is the same. As such, no data is shared at all between “general1” which was presented previously, and “C2_general1” for camera 2 or “C3_general1” for camera 3. The “Ci_sameX” data sets differ in at what distance the camera was held from the pattern, one at approximately 0.5 m and one at approximately 1 m. The “Ci_generalX” data sets are four different attempts at capturing the pattern in varying positions.

6.5.1 Camera 2 Camera 2 was calibrated on its “C2_general2” data set, which offered a reprojection error of 0.23 pixels. Resulting intrinsics are   487.63 0. 316.54 0.    0. 487.75 245.37 0. K =   (6.1)  0. 0. 0. 1. 0. 0. 1. 0. with lens distortion coefficients h i −0.113493 −0.085626 0.001575 0.001248 0.094735 (6.2) and depth distortion parameters h i d = 0.00017984 0.00001908 −0.00003684 1.0393758 (6.3)

Intel calibrated intrinsics are fx = fy = 475.77, cx = 311.13, cy = 245.84. We see that we have a quite large focal length difference, and a noticeable difference in the principal points x coordinate. The plane error is slightly lower than the dot error shown in Fig. 6.13, but very similar in magnitude and thus left out for brevity. The number of images in each data set varies from 12 to 34. The improvement using different models is summarized in Table 6.7. We see that the full model provides the best correction for this camera and the captured data sets, but that the absolute correction is quite low, dropping from around 6.9 mm to 4.0 mm with the full model. The majority of this correction comes from the x parameter alone, lowering the mean dot error to 4.4 mm.

51 CHAPTER 6. RESULTS

Mean dot error per dataset

Before perspective correction 50 After perspective correction Image mean standard deviation

40 ] m m [

r 30 o r r e

t o

D 20

10

0

C2_same1 C2_same2 C2_general1 C2_general3 C2_general4

C2_general2 (calibration) Dataset

Figure 6.13: Mean dot error over all points in each data set for camera 2. Calibration data set marked with (calibration).

Table 6.7: Mean absolute plane and mean dot errors for different models calibrated with non-linear least squares, averaged over all data sets except the calibration data set. With(out) lens is short for considering lens distortion (or not). Reported for camera 2.

dot error/mm plane error/mm mean std mean std model with lens, all parameters 2.5 1.6 2.6 1.7 with lens, only xy 10.3 2.3 9.6 2.4 with lens, only xz 6.2 2.2 5.8 2.2 with lens, only xw 4.5 2.0 4.4 2.0 with lens, only yw 16.9 7.0 14.8 5.6 with lens, only x 10.2 2.3 9.5 2.4 with lens, only y 17.7 7.1 15.6 5.6 with lens, only z 17.3 7.2 15.3 5.7 with lens, only w 17.0 7.1 14.9 5.6 with lens, no depth correction 17.8 7.1 15.7 5.6 without lens, all parameters 14.2 8.0 11.4 5.9

52 6.5. MULTI CAMERA COMPARISON

6.5.2 Camera 3 Camera 3 was calibrated on its “C3_general1” data set, yielding a reprojection error of 0.24 pixels. Resulting intrinsics are   476.85 0. 314.35 0.    0. 476.06 244.32 0. K =   (6.4)  0. 0. 0. 1. 0. 0. 1. 0. with lens distortion parameters h i −0.129139 −0.021613 −0.001209 0.00017 −0.005435 (6.5) and depth distortion parameters h i d = 0.00005374 −0.00001949 0.00000946 0.99313368 (6.6)

Intel calibrated intrinsics are fx = fy = 476.25, cx = 313.46, cy = 246.03. We see that we in general have a good correspondence to the Intel calibrated intrinsics, with slightly larger deviation in the principal points y coordinate. The plane error is slightly lower than the dot error shown in Fig. 6.14, but very similar in magnitude and thus left out for brevity. The number of images in each data set varies from 18 to 30.

Mean dot error per dataset 12 Before perspective correction After perspective correction 10 Image mean standard deviation

] 8 m m [

r

o 6 r r e

t o

D 4

2

0

C3_same1 C3_same2 C3_general2 C3_general3 C3_general4

C3_general1 (calibration) Dataset

Figure 6.14: Mean dot error over all points in each data set for camera 3. Calibration data set marked with (calibration).

The improvement using different models is summarized in Table 6.8. We see that the full model provides the best correction for this camera and the captured data sets and that no two parameter model comes close in performance.

53 CHAPTER 6. RESULTS

Table 6.8: Mean absolute plane and mean dot errors for different models calibrated with non-linear least squares, averaged over all data sets except the calibration data set. With(out) lens is short for considering lens distortion (or not). Reported for camera 3.

dot error/mm plane error/mm mean std mean std model with lens, all parameters 4.0 2.0 3.5 2.0 with lens, only xy 4.3 2.0 3.8 2.0 with lens, only xz 4.7 2.1 4.2 2.1 with lens, only xw 4.6 2.1 4.1 2.1 with lens, only yw 6.7 2.7 6.1 2.5 with lens, only x 4.4 2.1 4.0 2.1 with lens, only y 6.7 2.7 6.2 2.5 with lens, only z 7.0 2.7 6.3 2.6 with lens, only w 7.1 2.7 6.4 2.6 with lens, no depth correction 7.0 2.7 6.3 2.6 without lens, all parameters 11.1 6.9 9.4 5.6

6.5.3 Camera 4 Camera 4 was calibrated on its “C4_general1” data set, yielding a reprojection error of 0.18 pixels. Resulting intrinsics are   476.82 0. 311.17 0.    0. 476.76 240.12 0. K =   (6.7)  0. 0. 0. 1. 0. 0. 1. 0. with lens distortion parameters h i −0.117859 −0.073007 0.001899 0.000341 0.055911 (6.8) and depth distortion parameters h i d = 0.00004329 0.00000009 0.00000114 1.0045781 (6.9)

Intel calibrated intrinsics are fx = fy = 476.7, cx = 312.2, cy = 245.96, i.e. we have a noticeable difference in the principal points y coordinate. The plane error is slightly lower than the dot error shown in Fig. 6.15, but similar in magnitude and thus left out for brevity. The number of images in each data set varies from 21 to 45. The improvement using different models is summarized in Table 6.9. We see that the full model provides the best correction for this camera and the captured data

54 6.5. MULTI CAMERA COMPARISON

Mean dot error per dataset

Before perspective correction 8 After perspective correction Image mean standard deviation 7

6 ] m

m 5 [

r o r r 4 e

t o

D 3

2

1

0

C4_same1 C4_same2 C4_general2 C4_general3 C4_general4

C4_general1 (calibration) Dataset

Figure 6.15: Mean dot error over all points in each data set for camera 4. Calibration data set marked with (calibration). sets. Like with the first camera analyzed, the two parameter xw model provides the same level of correction.

55 CHAPTER 6. RESULTS

Table 6.9: Mean absolute plane and mean dot errors for different models calibrated with non-linear least squares, averaged over all data sets except the calibration data set. With(out) lens is short for considering lens distortion (or not). Reported for camera 4.

dot error/mm plane error/mm mean std mean std model with lens, all parameters 2.5 1.3 2.7 1.9 with lens, only xy 3.5 1.4 4.0 2.0 with lens, only xz 2.8 1.4 2.7 1.9 with lens, only xw 2.5 1.3 2.7 1.9 with lens, only yw 4.1 2.0 4.2 2.3 with lens, only x 3.7 1.2 4.2 2.0 with lens, only y 4.5 1.7 4.8 2.2 with lens, only z 3.8 1.7 4.1 2.2 with lens, only w 3.5 1.6 3.8 2.1 with lens, no depth correction 4.5 1.7 4.7 2.2 without lens, all parameters 10.5 5.9 9.1 4.9

56 Chapter 7

Discussion and conclusions

7.1 Discussion

In this section we mention concerns of validity and considerations of possible issues in the work. These considerations also give rise to suggestions for further work, which are presented at the end.

7.1.1 Lens distortion This work has not paid much attention to distortions caused by the lens or imperfect determination of the camera intrinsics. The primary reason for this is uncertainty as to what effects it would have on the depth, since such a correction affects the way depth is calculated. The IR image and depth image have clear lens distortions present. A naive correction of distortion by remapping the infrared and depth data with distortion parameters found from calibration of the infrared camera does make the distortions smaller in the infrared image, but the validity of the depth data is uncertain. A possible future work could be to study this effect and propose a way of rectifying it after the depth calculation. However, one could hope that the error of handling lens distortion as presented is small, due to the fact that librealsense released by Intel does a similar thing.

7.1.2 Parameter stability It has been observed that the distortion parameters fluctuate quite a bit even for the same camera and so far the reasons are unknown. What seems to be true however, is that the parameters calibrated on one data set improves the depth data for most other data sets. If this remains true under various temperatures, ambient illumination and other possible factors affecting the camera has not been studied. During the various data set captures the ambient illumination and the room temperature has been approximately constant, but they could still have a possible impact. The fluctuations will of course to a degree also vary together with the data. Data sets with points spanning the entire measurement volume should

57 CHAPTER 7. DISCUSSION AND CONCLUSIONS logically provide a much more general adaptation of the depth distortion parameters, than data sets involving multiple near-degeneracy cases, such as groups of planes in approximately the same location and orientation. To the authors knowledge, Intel has not released any public information on how the IR projector in the RealSense SR300 actually works. However, people dismantling and studying it have stated that the Intel RealSense SR300 seems to be using a very small mirror vibrating with a frequency in the kHz range [5] combined with a line laser, as the IR projector. The mirror seems to vibrate in the same direction as the baseline, like seen in [15]. An error source could here be a slight mistiming of the circuitry, causing the pattern to be projected with a slight angular error compared to what was expected. Considering the construction of the camera, a plausible explanation for the parameter instability is that something occurs inside the camera between captures. The laser projector in the camera has a moving mirror, which could be affecting the measurements depending on the orientation of the camera or movements of the camera while capturing. Small timing fluctuations between the ir camera shutter and laser projector could also influence the measurements and thus the parameters. For two of the tested cameras the dx and dw parameters are the most stable out of the 4 parameters and offer equal performance to using the complete model. This could indicate that there are indeed issues with the measured baseline or timing errors with the laser, or an issue with the calibrated principal point in the x direction, as these effects are expected to show themselves in these parameters. Calibrating with the non-linear optimization seems to be less prone to fluctuations between data sets, but they still occur. The results of the non-linear optimization is usually very similar to the linear least squares, and the iteratively reweighted least squares. The non-linear optimization seems to perform better than the other methods in almost degenerate cases, such as the combined set of all points in a data set almost forming a plane.

7.1.3 Error metrics

All of the dense plane errors are taken as the Z-coordinate in the world coordinate system, as explained in Section 5.5.2. In other words, how large of a part the depth error actually is, varies depending on the angle to the pattern. The pattern is however not imaged at angles larger than 45 degrees, somewhat limiting the impact of this error source. This problem exists for the dense plane error, but not for the sparse dot error metric. The sparse dot error takes the 3 dimensional error into account, as explained in Section 5.5.1. It does as such not have the same problematic dependence on the angle to the imaged pattern.

58 7.2. FUTURE WORK

7.1.4 Measurement error No weighting of the measurements have been done depending on their distance to the camera. This would potentially be beneficial to perform, since the depth calculation depends on triangulation. The precision of triangulation naturally decreases the further away from the sensor the measured object is. It’s also quite possible that the printed pattern has an uneven error or a larger error in one axis, leading to a bias in all measurements. The IR image is also regarded to be perfect in comparison to the depth image, and the model adapted to have the depth data correspond with the IR image. The IR image is however also affected by noise, which can bias the pattern detection and further calibration of the model. Some close up pictures of the pattern have also been removed, due to the un-even illumination from the laser causing detection of the markers that are visibly off from the marker center. It has not been studied to what extent the reprojection errors vary in each image, or if there’s any clear bias.

7.2 Future work

We suggest the following extensions of the work:

Physical interpretation: As mentioned earlier in the report, the proposed pa- rameters have a purely empirical derivation to account for detected errors. Ideas for their possible underlying cause have been shared, but their true physical interpreta- tion is not clear. This would be of interest for a better understanding of what is being corrected for.

Parameter Stability: The parameters vary between data sets for an unknown reason. A further study, possibly coupled tightly with physical interpretation, would be to gauge the outcome more thoroughly and find an optimal parameter set for a given camera.

Depth correction comparison: There have been several published articles ([13, 17, 2, 26] to name a few) correcting the data returned from depth sensors. Methods of correction include B-splines, depth-multiplier images and bias correction. It would be interesting to do a comparison between the methods and the improvement of accuracy that can be achieved with each. It would also be interesting to study combinations of correction methods, and see if a combination is better than any individual method.

59 CHAPTER 7. DISCUSSION AND CONCLUSIONS

7.3 Conclusions

This thesis has investigated a new linear empirical method of correcting depth from a depth sensor with image information available in the same reference frame. The 3 depth distortion correction is formulated as a restricted homography in P with only 4 variable parameters. It is based around the idea of altering only the measured depth of pixels, but not their positions in the image, while keeping planes planar. Determination of the depth distortion parameters requires at least 4 points not all positioned in the same plane. The method was tested on images and depth data from Intel RealSense SR300 cameras. The results indicate that calibrating once and using for further recordings provides a noticeably improvement of the accuracy of the depth data, without requiring adjustments after initial calibration. The average uncorrected measurement error over the collected data sets for the primary camera is 7.8 ± 4.4 mm, and around 3.1 ± 1.4 mm when corrected. Utilizing only the dx and dw parameters of the model provides similar correction, showing that 2 parameters is sufficient for this camera. Testing on additional cameras revealed that the reduced model only works on some cameras, but that the full model corrected measurements from 17.8 ± 7.1 mm to 2.5 ± 1.6 mm, 6.9 ± 2.7 mm to 4.0 ± 2.0 mm and 4.5 ± 1.7 mm to 2.5 ± 1.3 mm. This indicates that the accuracy of the cameras vary quite widely, despite being of the same model.

60 Appendix A

Numerical results camera 1

Summary statistics for all models over all data sets for camera 1 can be seen in Table A.1. These results are also averaged over all data sets in Table 6.6.

Table A.1: Mean absolute dense and point errors for different models calibrated with non-linear least squares. With(out) lens is short for considering lens distortion (or not). Calibration data set marked with *.

dot error/mm plane error/mm mean std mean std data set model centered1 with lens, all parameters 2.8 1.0 2.2 1.3 with lens, only xy 3.1 1.0 2.5 1.3 with lens, only xz 2.7 1.0 2.2 1.3 with lens, only xw 2.7 1.0 2.1 1.3 with lens, only yw 5.7 4.1 4.3 3.2 with lens, only x 3.3 1.0 2.6 1.3 with lens, only y 5.2 3.7 3.8 2.9 with lens, only z 5.4 3.9 4.0 3.1 with lens, only w 5.3 3.8 3.9 3.0 with lens, no depth correction 5.3 3.7 3.8 3.0 without lens, all parameters 6.5 3.4 4.4 2.1

Continued on next page

61 APPENDIX A. NUMERICAL RESULTS CAMERA 1

Continued from last page dot error/mm plane error/mm mean std mean std data set model general1* with lens, all parameters 1.4 1.0 2.3 1.9 with lens, only xy 1.6 1.1 2.4 2.0 with lens, only xz 1.5 1.0 2.4 2.0 with lens, only xw 1.5 1.0 2.3 1.9 with lens, only yw 9.7 5.1 8.7 4.5 with lens, only x 1.7 1.1 2.6 2.0 with lens, only y 9.7 5.1 8.6 4.5 with lens, only z 9.7 5.1 8.7 4.5 with lens, only w 9.7 5.1 8.7 4.5 with lens, no depth correction 9.7 5.1 8.7 4.5 without lens, all parameters 6.7 3.7 5.7 3.2 longplywood1 with lens, all parameters 1.7 1.3 1.6 1.4 with lens, only xy 2.2 1.4 2.1 1.5 with lens, only xz 1.9 1.2 1.8 1.4 with lens, only xw 1.8 1.2 1.7 1.4 with lens, only yw 5.0 3.1 4.0 2.4 with lens, only x 2.3 1.3 2.2 1.5 with lens, only y 4.7 3.1 3.8 2.4 with lens, only z 4.8 3.1 3.8 2.3 with lens, only w 4.8 3.1 3.9 2.4 with lens, no depth correction 4.8 3.1 3.8 2.4 without lens, all parameters 8.4 3.8 7.5 3.3 longplywood2 with lens, all parameters 3.9 1.9 3.6 1.9 with lens, only xy 4.3 2.0 4.0 2.0 with lens, only xz 4.0 1.9 3.7 1.9 with lens, only xw 3.9 1.9 3.6 1.9 with lens, only yw 4.3 2.7 3.8 2.4 with lens, only x 4.6 2.0 4.2 2.0 with lens, only y 4.5 2.7 3.9 2.4 with lens, only z 4.2 2.5 3.6 2.2 with lens, only w 4.2 2.5 3.7 2.2 with lens, no depth correction 4.3 2.5 3.8 2.3 without lens, all parameters 5.9 3.2 5.3 3.0

Continued on next page

62 Continued from last page dot error/mm plane error/mm mean std mean std data set model plywood1 with lens, all parameters 1.8 0.9 1.7 0.9 with lens, only xy 3.0 1.0 2.5 1.0 with lens, only xz 2.5 0.9 2.2 0.9 with lens, only xw 2.2 0.9 2.0 0.9 with lens, only yw 7.0 3.7 5.2 2.4 with lens, only x 3.0 0.9 2.6 1.0 with lens, only y 6.1 3.5 4.5 2.4 with lens, only z 6.2 3.5 4.6 2.4 with lens, only w 6.2 3.5 4.6 2.4 with lens, no depth correction 6.1 3.5 4.5 2.4 without lens, all parameters 5.6 1.7 4.5 1.3 sameheight1 with lens, all parameters 8.3 2.5 8.2 4.0 with lens, only xy 7.6 2.4 7.7 4.0 with lens, only xz 8.3 2.6 8.3 4.1 with lens, only xw 8.4 2.6 8.3 4.1 with lens, only yw 23.4 8.0 22.2 7.4 with lens, only x 9.1 2.5 9.0 4.1 with lens, only y 22.1 7.9 21.1 7.4 with lens, only z 22.7 8.0 21.6 7.5 with lens, only w 22.4 8.0 21.4 7.4 with lens, no depth correction 22.3 8.0 21.2 7.4 without lens, all parameters 20.7 7.7 18.1 6.3 sameheight2 with lens, all parameters 1.5 0.8 1.3 0.8 with lens, only xy 1.4 1.0 1.2 0.9 with lens, only xz 1.3 0.9 1.1 0.9 with lens, only xw 1.3 0.9 1.1 0.8 with lens, only yw 4.4 3.0 3.8 2.6 with lens, only x 1.4 0.9 1.2 0.9 with lens, only y 4.4 3.0 3.8 2.6 with lens, only z 4.4 3.0 3.8 2.6 with lens, only w 4.4 3.0 3.8 2.6 with lens, no depth correction 4.5 3.0 3.9 2.6 without lens, all parameters 7.4 4.5 6.5 3.8

63

Appendix B

Depth distortion properties

We’d like to show that the depth distortion keeps planes planar and only alters the depth of points while retaining the image x, y position. The plane may however change orientation and position in space, as long as all points previously lying in a plane, lie in a plane after transformation. 3 T We work in P , with world points X = (X,Y,Z,W ) . The only type of trans- formations of this space that keep planes planar are 4 × 4 homographies. This due to that lines under the effect of an homography remain lines afterwards, see [9, p. 64] which follows from the results in the 2 dimensional case on [9, p. 33]. As such, we only need to consider transformations that are 4 × 4 invertible matrices. We now want to see which 4 × 4 matrices yield the same image points under the homogeneous pinhole camera projection K. We have   fx s cx 0    0 fy cy 0 K =   (B.1)  0 0 0 1 0 0 1 0 and the general 4 × 4 distortion matrix   d11 d12 d13 d14   d21 d22 d23 d24 D =   d31 d32 d33 d34 d41 d42 d43 d44

We want the projection of a world point X to look the same with and without this distortion, i.e. KX ∼ KDX. The imaged point follows x = (x, y, d, w)T, since the projection matrix turns depth into disparity. The multiplied transformation matrix looks like   fxd11 + sd21 + cxd31 fxd12 + sd22 + cxd32 fxd13 + sd23 + cxd33 fxd14 + sd24 + cxd34  fyd21 + cyd31 fyd22 + cyd32 fyd23 + cyd33 fyd24 + cyd34  KD =   (B.2)  d41 d42 d43 d44  d31 d32 d33 d34

65 APPENDIX B. DEPTH DISTORTION PROPERTIES

Comparing the action of Eq. (B.1) and Eq. (B.2) on a world point X for each element, KX ∼ KDX, we get

1st element, x

Xfx + Y s + Zcx ∼ X (fxd11 + sd21 + cxd31)

+Y (fxd12 + sd22 + cxd32)

+Z (fxd13 + sd23 + cxd33)

+W (fxd14 + sd24 + cxd34) where ∼ indicates equal up to a multiplicative scale factor, since we’re dealing with homogeneous coordinates. The relation is equal up to a multiplicative scale factor, but not any additive bias. In order for this relation to be valid for a general point X and camera matrix K, we need that any dij multiplied with a variable not present on the left hand side be equal to 0. We thus require that d21 = d31 = d12 = d32 = d13 = d23 = d14 = d24 = d34 = 0. This yields

Xfx + Y s + Zcx ∼ Xfxd11 + Y sd22 + Zcxd33 where it is clear that we require d11 = d22 = d33 in order for the relation to be valid for all points X without modifying image information.

Xfx + Y s + Zcx ∼ d11(Xfx + Y s + Zcx)

2nd element, y

Y fy + Zcy ∼ +X (fyd21 + cyd31)

+Y (fyd22 + cyd32)

+Z (fyd23 + cyd33)

+W (fyd24 + cyd34) provides no new information compared to the x analysis. Removing entries that need to be zero gives us

Y fy + Zcy ∼ Y fyd22 + Zcyd33 and the same constraint as previously for the two remaining distortion matrix entries.

3rd element, the disparity d We do want to modify the modified disparity in order to correct it. The following relation does thus not need to hold.

W ∼ Xd41 + Y d42 + Zd43 + W d44

We can see that keeping d4i, i = 1, 2, 3, 4 variable gives us the most degrees of freedom for modifying the disparity.

66 4th element, w

Z ∼ Xd31 + Y d32 + Zd33 + W d34 provides no new information compared to the x and y analysis.

Results The D matrix now looks like   d11 0 0 0    0 d11 0 0     0 0 d11 0  d41 d42 d43 d44 which corresponds to the matrix form used in this thesis. Since this matrix is homogeneous it is only defined up to a scale factor. We can thus set d11 = 1 without loss of generality.

67

Appendix C

Ethical considerations

Camera surveillance is ever increasing in popularity and finds more and more uses in everyday life, in areas of protecting property, protecting individuals or analyzing production. Stores and banks use cameras to ensure that possible burglars are deterred, or caught on film to identify them after a crime has been committed. Commuter transit systems use cameras to monitor for crimes as well as prevent humans from ending their lives in front of trains. Industry and factories use cameras for making sure that the product looks as expected, and hasn’t been damaged during its way, or to enable robotic sorting of items. These are just a few examples of where cameras are in use. Most of these situations would be assisted by extending the image with depth information, but widespread adoption has so far not happened, likely due to the increased cost as well as possible technical issues of depth cameras in comparison to normal cameras. But for when cameras offer a possible intrusion into the privacy of peoples lives, some considerations need to be made. The balance between security and privacy is a delicate one. You ideally do not want to track anyone except the ones that would hurt others, property or break the law in a significant enough way to warrant tracking their movements. As artificial intelligence in the application of image understanding is on the rise, these issues only become greater. A human person is perhaps no longer watching the footage directly, but instead a computer is labeling what is happening and can track individuals and search through enormous amounts of data in speeds that aren’t possible for a human individual. Regarding ethical aspects of this work, the direct consequences are minimal to non-existent. Using calibrated cameras instead of uncalibrated ones are beneficial whenever metric measurements and general distortion correction is wanted, which can be utilized in reconstruction of physical objects. From a societal perspective, this work is primarily of interest for people con- ducting measurements using cameras. Surveillance is one mentioned area where measurements are of interest, but industry such as production facilities or waste disposals sorting trash are other examples. However, augmented reality is on the rise and being used by the general public. While accurate measurements may not

69 APPENDIX C. ETHICAL CONSIDERATIONS be the primary goal of augmented reality, there might come a point in the near future where increased accuracy is desired. This could broaden the scope of parties interested of this thesis. We see no direct issues with this thesis from a sustainability perspective, rather advantages. Cameras and similar sensors for measuring depth already exist, and there are no signs that usage of cameras will decrease, rather the contrary. Given this assumption, it should be better that each camera measures as accurate as possible, reducing the need for complementing it with additional sensors. A possible issue could be a positive feedback effect, increasing demand for cameras and competing with other sensors that might have a lower environmental footprint. Should this occur however, there are most likely other issues stopping the competing sensor from reaching market adoption, such as cost or ease of use. This thesis does thus not pose any issues in terms of sustainability or economical issues. For civilian or industrial uses, the primary risk of this work is that someone can now take a more accurate measurement from an image or an image sequence thanks to the added calibration. But methods of calibrating a camera have been known for years and already adapted to situations where privacy intrusion could occur. Such as metric reconstruction of a person’s body from surveillance (RGBD) footage. So although this work aids in accurate calibration for more accurate measurements, which could aid a dubious application, it does not in general pose any ethical issues of environmental, scientific or societal character.

70 Bibliography

[1] Motilal Agrawal and Larry S Davis. Camera calibration using spheres: A semi-definite programming approach. In Computer Vision, 2003. Proceedings. Ninth IEEE International Conference on, pages 782–789. IEEE, 2003. [2] Filippo Basso, Alberto Pretto, and Emanuele Menegatti. Unsupervised intrinsic and extrinsic calibration of a camera-depth sensor couple. In Robotics and Automation (ICRA), 2014 IEEE International Conference on, pages 6244–6249. IEEE, 2014. [3] Amira Belhedi, Steve Bourgeois, Vincent Gay-Bellile, Patrick Sayd, Alberto Bartoli, and Kamel Hamrouni. Non-parametric depth calibration of a tof camera. In Image Processing (ICIP), 2012 19th IEEE International Conference on, pages 549–552. IEEE, 2012. [4] G. Bradski. The opencv library. Dr. Dobb’s Journal of Software Tools, 2000. [5] Sinjin Dixon-Warren Chipworks. Inside the intel realsense gesture camera - chipworks. http://www.chipworks.com/about-chipworks/overview/blog/ inside-the-intel-realsense-gesture-camera. Accessed 2016-10-31. [6] Ankur Datta, Jun-Sik Kim, and Takeo Kanade. Accurate camera calibration using iterative refinement of control points. In Computer Vision Workshops (ICCV Workshops), 2009 IEEE 12th International Conference on, pages 1201– 1208. IEEE, 2009. [7] Damien Douxchamps and Kunihiro Chihara. High-accuracy and robust localiza- tion of large control markers for geometric camera calibration. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 31(2):376–383, 2009. [8] Jason Geng. Structured-light 3d surface imaging: a tutorial. Adv. Opt. Photon., 3(2):128–160, Jun 2011. [9] Richard Hartley and Andrew Zisserman. Multiple view geometry in computer vision. Cambridge university press, 2003. [10] Janne Heikkilä. Geometric camera calibration using circular control points. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 22(10):1066– 1077, 2000.

71 BIBLIOGRAPHY

[11] Janne Heikkilä and Olli Silvén. A four-step camera calibration procedure with implicit image correction. In Computer Vision and Pattern Recognition, 1997. Proceedings., 1997 IEEE Computer Society Conference on, pages 1106–1112. IEEE, 1997.

[12] Jussi Heikkinen and Keijo Inkilä. The comparison of single view calibration methods. In SPIE Optical Metrology, pages 80850Q–80850Q. International Society for Optics and Photonics, 2011.

[13] C Herrera, Juho Kannala, Janne Heikkilä, et al. Joint depth and color camera calibration with distortion correction. Pattern Analysis and Machine Intelli- gence, IEEE Transactions on, 34(10):2058–2064, 2012.

[14] intel. librealsense. https://github.com/IntelRealSense/librealsense. Ac- cessed 2016-10-31.

[15] Tim Duncan (Intel). Can your webcam do this? - exploring the intel realsense 3d camera (f200). https://software.intel.com/en-us/blogs/2015/01/26/ can-your-webcam-do-this. Accessed: 2016-10-31.

[16] Shahram Izadi, David Kim, Otmar Hilliges, David Molyneaux, Richard New- combe, Pushmeet Kohli, Jamie Shotton, Steve Hodges, Dustin Freeman, Andrew Davison, et al. Kinectfusion: real-time 3d reconstruction and interaction using a moving depth camera. In Proceedings of the 24th annual ACM symposium on User interface software and technology, pages 559–568. ACM, 2011.

[17] Jiyoung Jung, Joon-Young Lee, Yekeun Jeong, and In So Kweon. Time-of-flight sensor calibration for a color and depth camera pair. IEEE Trans. Pattern Anal. Mach. Intell., 37(7):1501–1513, Jul 2015.

[18] Kourosh Khoshelham and Sander Oude Elberink. Accuracy and resolution of kinect depth data for indoor mapping applications. Sensors, 12(12):1437–1454, Feb 2012.

[19] Marvin Lindner, Ingo Schiller, Andreas Kolb, and Reinhard Koch. Time-of- flight sensor calibration for accurate range sensing. Computer Vision and Image Understanding, 114(12):1318–1328, Dec 2010.

[20] John Mallon and Paul F Whelan. Precise radial un-distortion of images. In Pattern Recognition, 2004. ICPR 2004. Proceedings of the 17th International Conference on, volume 1, pages 18–21. IEEE, 2004.

[21] Iason Oikonomidis, Manolis Lourakis, and Antonis Argyros. Evolutionary quasi- random search for hand articulations tracking. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3422–3429, 2014.

72 BIBLIOGRAPHY

[22] Hanhoon Park and Jong-Il Park. Linear camera calibration from a single view of two concentric semicircles for augmented reality applications. In Electronic Imaging 2005, pages 353–361. International Society for Optics and Photonics, 2005. [23] Carolina Raposo, Joao P Barreto, and Urbano Nunes. Fast and accurate calibration of a kinect sensor. In 3D Vision-3DV 2013, 2013 International Conference on, pages 342–349. IEEE, 2013. [24] Andrew Richardson, Johannes Strom, and Edwin Olson. AprilCal: Assisted and repeatable camera calibration. In Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), November 2013. [25] Davide Scaramuzza and Friedrich Fraundorfer. Visual odometry [tutorial]. Robotics & Automation Magazine, IEEE, 18(4):80–92, 2011. [26] Alex Teichman, Stephen Miller, and Sebastian Thrun. Unsupervised intrinsic calibration of depth sensors via SLAM. In Robotics: Science and Systems. Citeseer, 2013. [27] Jianhua Wang, Fanhuai Shi, Jing Zhang, and Yuncai Liu. A new calibration model of camera lens distortion. Pattern Recognition, 41(2):607–615, Feb 2008. [28] Zhuo Wang. Pin-hole modelled camera calibration from a single image. 2009. [29] Wikipedia. Iteratively reweighted least squares — Wikipedia, the free encyclo- pedia. https://en.wikipedia.org/wiki/Iteratively_reweighted_least_ squares. Online, Accessed: 2016-10-28. [30] Wei Xiang, Christopher Conly, Christopher D McMurrough, and Vassilis Athit- sos. A review and quantitative comparison of methods for kinect calibration. In Proceedings of the 2nd international Workshop on Sensor-based Activity Recognition and Interaction, page 3. ACM, 2015. [31] Zhengyou Zhang. A flexible new technique for camera calibration. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 22(11):1330–1334, 2000. [32] Zhengyou Zhang. Camera calibration with one-dimensional objects. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 26(7):892–899, 2004. [33] Yinqiang Zheng and Yuncai Liu. The projective equation of a circle and its application in camera calibration. In Pattern Recognition, 2008. ICPR 2008. 19th International Conference on, pages 1–4. IEEE, 2008. [34] Qian-Yi Zhou and Vladlen Koltun. Simultaneous localization and calibra- tion: Self-calibration of consumer depth cameras. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 454–460, 2014.

73 www.kth.se