DEGREE PROJECT IN COMPUTER SCIENCE AND ENGINEERING, SECOND CYCLE, 30 CREDITS STOCKHOLM, SWEDEN 2017

Visual-Inertial Odometry for Autonomous Ground Vehicles

AKSHAY KUMAR BURUSA

KTH ROYAL INSTITUTE OF TECHNOLOGY SCHOOL OF COMPUTER SCIENCE AND COMMUNICATION

Visual-Inertial Odometry for Autonomous Ground Vehicles

Akshay Kumar Burusa [email protected]

Master’s Thesis at school of Computer Science and Communication ——— KTH Supervisor : John Folkesson Scania Supervisor : Zhan Wang Examiner : Patric Jensfelt

September 2017

Abstract

Monocular cameras are prominently used for estimating motion of Unmanned Aerial Vehicles. With growing interest in autonomous vehicle technology, the use of monocular cameras in ground ve- hicles is on the rise. This is especially favorable for localization in situations where Global Navigation Satellite System (GNSS) is unreliable, such as open-pit mining environments. However, most monocular camera based approaches suffer due to obscure scale information. Ground vehicles impose a greater difficulty due to high speeds and fast movements. This thesis aims to estimate the scale of monocular vision data by using an inertial sensor in addi- tion to the camera. It is shown that the simultaneous estimation of pose and scale in autonomous ground vehicles is possible by the fusion of visual and inertial sensors in an Extended (EKF) framework. However, the convergence of scale is sensitive to several factors including the initialization error. An accurate estimation of scale allows the accurate estimation of pose. This facilitates the localization of ground vehicles in the absence of GNSS, providing a reliable fall-back option. Sammanfattning

Monokulära kameror används ofta vid rörelseestimering av obe- mannade flygande farkoster. Med det ökade intresset för auto- noma fordon har även användningen av monokulära kameror i fordon ökat. Detta är fram för allt fördelaktigt i situationer där satellitnavigering (Global Navigation Satellite System (GNSS)) är opålitlig, exempelvis i dagbrott. De flesta system som använder sig av monokulära kameror har problem med att estimera skalan. Denna estimering blir ännu svårare på grund av ett fordons större hastigheter och snabbare rörelser. Syftet med detta exjobb är att försöka estimera skalan baserat på bild data från en monokulär kamera, genom att komplettera med data från tröghetssensorer. Det visas att simultan estimering av position och skala för ett fordon är möjligt genom fusion av bild- och tröghetsdata från sen- sorer med hjälp av ett utökat Kalmanfilter (EKF). Estimeringens konvergens beror på flera faktorer, inklusive initialiseringsfel. En noggrann estimering av skalan möjliggör också en noggrann esti- mering av positionen. Detta möjliggör lokalisering av fordon vid avsaknad av GNSS och erbjuder därmed en ökad redundans. Acknowledgements

This thesis was completed at the department of Autonomous Transport Solu- tion, REPF, at Scania CV AB.

I would like to thank Zhan Wang for providing me the opportunity to work on this thesis. I thank him for his continuous guidance and for giving me great insight regarding how to approaching problems. I thank Lars Hjorth for supporting and encouraging me to complete the thesis. I thank Bengt Boberg for providing me the necessary materials that helped me achieve a greater understanding of concepts. I also thank Patricio Valenzuela for all the interesting and significant discussions.

I would like to thank my supervisor at KTH, John Folkesson, for encourag- ing me and for all his suggestions which kept me focused on the important problems.

Finally, I thank my dear family and amazing friends who were very supportive of my decisions and helped me throughout my studies. This thesis would not be possible without them.

Akshay Kumar Burusa Stockholm, September 2017 Contents

Abstract

Sammanfattning

Acknowledgements

Glossary

1 Introduction 1 1.1 Problem Statement ...... 1 1.2 Objectives ...... 2 1.3 Organization ...... 2

2 Background 5 2.1 Quaternions ...... 5 2.2 Sensors ...... 7 2.2.1 Pose from Visual Sensor ...... 7 2.2.2 Pose from IMU ...... 12 2.3 Fusion of IMU and Vision ...... 14 2.3.1 Bayesian Inference ...... 15 2.3.2 Kalman Filter ...... 16 2.3.3 Extended Kalman Filter ...... 18 2.3.4 Error-State Kalman Filter ...... 20

3 Related Work 23 3.1 Visual Odometry ...... 23 3.1.1 Feature-based Approach ...... 23 3.1.2 Direct Approach ...... 24 3.1.3 Hybrid Approach ...... 25 3.2 Visual-Inertial Odometry ...... 26 3.2.1 Loosely-coupled ...... 26 3.2.2 Tightly-coupled ...... 28 4 Implementation 31 4.1 Coordinate Frames ...... 31 4.2 Vector Notation ...... 32 4.3 Visual-Inertial System ...... 32 4.4 Extended Kalman Filter design ...... 33 4.4.1 Continuous-time Nonlinear State Model ...... 33 4.4.2 Discrete-time Nonlinear State Model ...... 34 4.4.3 Prediction Step ...... 34 4.4.4 Measurement Model ...... 36 4.4.5 Update Step ...... 37

5 Results 39 5.1 Malaga Dataset ...... 39 5.2 Experimental Setup ...... 39 5.2.1 Malaga Track 7 Results ...... 42 5.2.2 Malaga Track 8 Results ...... 45 5.3 Summary ...... 48

6 Conclusion 51

Bibliography 53

Appendices 56

A Sustainability, Ethics and Social Impact 57

B Detailed Results 59 B.1 Malaga Track 7 ...... 59 B.2 Malaga Track 8 ...... 62

Glossary

Pose The combination of position and orientation of an object with respect to a coordinate frame.

Odometry Use of data from sensors to estimate egomotion of an object over time. This estimation with the use of cameras is termed as visual odometry.

Scale Scale refers to the ratio between the size of estimated vehicle trajectory and the size of true trajectory.

Epipolar Geometry Projective geometry between two different views of a Camera.

Photometric Error Geometric error between a pixel in an image and it’s corresponding projection in another image. Pixels are matched based on in- tensity.

Reprojection Error Geometric error between a 2d location of a feature in an image and corresponding projection of the 3d feature in another image.

Chapter 1

Introduction

Experiments for developing autonomous vehicles started as early as 1920’s [15]. Almost a century later, autonomous vehicle technology has come a long way and is almost ready for large scale commercial production.

In 1995, the NavLab 5 team at Carnegie Mellon University successfully devel- oped an autonomous vehicle [21] which drove from Pittsburgh to San Diego without much human intervention. However, further development was re- quired to make the vehicles safe and robust in all conditions. These vehicles are required to navigate through a complex environment of urban streets. A major contribution to autonomous vehicles came from the robotic technology that was developed for the DARPA challenge [28], an autonomous vehicle challenge funded by the US Department of Defense which started in 2004. Soon all major automotive companies started adopting this technology and today driverless vehicles have gained huge popularity although concerns for safety still remain.

The autonomous vehicle technology is not a solved problem yet, in the sense that they are not completely ready for the roads. They fail to handle en- vironments such as a crowded streets or bad weather conditions. Regarding safety of human life, there are several ethical issues such as the famous “Trol- ley problem” [27] and how an autonomous vehicle would react under such circumstances.

1.1 Problem Statement

Autonomous vehicles rely heavily on their sensors for information about the environment. One obvious drawback here is that the failure of one sensor can result in serious consequences. This is addressed by having multiple sensors

1 CHAPTER 1. INTRODUCTION that can perceive similar information. For example, obstacle detection can be achieved using cameras and ultrasonic sensors. The combined use of these sensors is highly desirable for higher accuracy and also to handle cases of sensor failure.

This thesis focuses on the problem of localization and odometry of autonomous ground vehicles, in particular, autonomous trucks. Global Navigation Satellite System (GNSS) are most commonly used for this purpose. GNSS is also prone to failure. It is very common for GNSS receivers to lose satellite signals in remote locations such as open-pit mining environment. A high accuracy GNSS sensor is very expensive and hence using multiple GNSS sensors is not practical. This implies that we require other set of sensors that can perform the same role. Lately, cameras are being used for localization and odometry in Unmanned Aerial Vehicles (UAVs).

The use of monocular camera in ground vehicles introduces some challenges such as fast movements and obscure scale information. This thesis investigates the possibility to simultaneously estimate vehicle position and recover scale information by combining data from monocular camera and Inertial Measure- ment Unit (IMU).

1.2 Objectives

• Investigate a robust real-time visual odometry algorithm for autonomous trucks in an open-pit mining scenario.

• Investigate the possibility of accurate scale recovery with the fusion of monocular camera and IMU.

• Analyze the algorithm for real data which closely represents open-pit mining scenario.

1.3 Organization

Chapter 2: Background An introduction to odometry using cameras and IMU is given, followed by an overview of probabilistic state estimation and sensor fusion techniques.

Chapter 3: Related Work The several methods available for sensor fusion, particularly in visual-inertial odometry, will be focused in this chapter. Some of the state-of-the-art works are introduced.

2 1.3. ORGANIZATION

Chapter 4: Implementation An algorithm for visual-inertial odometry for autonomous vehicles is implemented using an Extended Kalman Filter (EKF) framework.

Chapters 5, 6 : Results and Conclusion The algorithm was tested on an online dataset and the estimation of scale for monocular visual data was analyzed.

3

Chapter 2

Background

This chapter begins with the introduction of quaternions, which are widely used to represent 3D orientation of autonomous vehicles. This is followed by techniques of pose estimation using visual and inertial data. Estimation theory is then introduced that leads to the implemented filter. Details of implementation will follow in Chapter 4.

2.1 Quaternions

Quaternions are a set of hypercomplex numbers that are most prominently used in 3D mechanics. They are suitable for attitude representation in 3- dimensional space and are used to represent orientation in pose estimation. There are several conventions for quaternions with their own set of rules and formulas. The Hamilton convention is followed in this thesis, as explained by Kuipers [14].

A quaternion is a 4-dimensional vector in the space of quaternions H of the form,

q = qw + qxi + qyj + qzk ∈ H, (2.1) where the imaginary components i, j, k follow the rule

i2 = j2 = k2 = ijk = −1, (2.2)

ij = −ji = k, jk = −kj = i, ki = −ik = j. (2.3)

The quaternions can be be represented as,

5 CHAPTER 2. BACKGROUND

  qw " # q q  w  x  q = =   , (2.4) qv qy  qz where qw is the real or scalar part and qv = qxi + qyj + qzk is the imaginary or vector part of the quaternion.

Properties

• Sum The quaternion sum is given by,

"p # "q # "p ± q # p ± q = w ± w = w w . (2.5) pv qv pv ± qv The sum is commutative and associative.

p ± q = q ± p, (2.6)

p ± (q ± r) = (p ± q) ± r. (2.7)

• Product The quaternion product is given by,

" p q − pT q # p ⊗ q = w w v v . (2.8) pwqv + qwpv + pv × qv The product is not commutative but it is associative.

p ⊗ q 6= q ⊗ p, (2.9)

p ⊗ (q ⊗ r) = (p ⊗ q) ⊗ r. (2.10)

• Rotation matrix B The quaternion qA between two coordinate frames represents the ro- tation from frame A to frame B. The rotation matrix defined by C B qA rotates a vector of frame B to frame A, that is, xA = C B xB. qA

 2 2 2 2  qw + qx − qy − qz 2(qxqy − qwqz) 2(qxqz + qwqy)  2 2 2 2  Cq =  2(qxqy + qwqz) qw − qx + qy − qz 2(qyqz − qwqx)  (2.11) 2 2 2 2 2(qxqz − qwqy) 2(qyqz + qwqx) qw − qx − qy + qz

6 2.2. SENSORS

• Quaternion product matrices Matrix representation for multiplication of quaternion q and angular velocity ω.

  0 −ωx −ωy −ωz ω 0 ω −ω   x z y Ω(ω) =   (2.12) ωy −ωz 0 ωx  ωz ωy −ωx 0   −qx −qy −qz  q −q q  ¯  w z y  Ω(q) =   . (2.13)  qz qw −qx −qy qx qw

2.2 Sensors

Estimation of motion can be achieved by several sensors. The sensors that are of interest in this thesis are camera and Inertial Measurement Unit (IMU). The estimation of pose (position + orientation) using these sensors is discussed in this section.

2.2.1 Pose from Visual Sensor A camera is an optical device that captures incident light rays with an optical system and a light-sensitive electronic sensor enables to store this information in a digital format. This digital image is then processed by algorithms for applications such as motion estimation. Estimating the motion of a vehicle using visual cues from cameras attached to it is called Visual Odometry (VO).

Figure 2.1: General Pipeline for Incremental Structure from Motion.

Feature-based methods for visual odometry is a well-researched topic [7] [18] [20] [22] [25]. The general pipeline for these methods is shown by Figure 2.1

7 CHAPTER 2. BACKGROUND and explained below. It involves determining the epipolar geometry between two image frames, estimating the relative pose between the frames, followed by the optimization of estimated pose using bundle adjustment. These steps require functions such as feature extraction and tracking, finding feature cor- respondences and minimization of errors. Initially, these processes required heavy computation which forced the optimization to be done offline. Klein et al. [11] proposed parallel threads for tracking and mapping which made on- line optimization in a bundle adjustment framework possible. Most algorithms today continue to use the parallel approach.

Visual odometry is the estimation of camera motion from visual data and Structure from Motion (SfM) involves the simultaneous estimation of motion and reconstruction of a 3D environment from visual cues. These are briefly introduced here. More details can be found in the description by Cipolla [22].

STEP 1: Image Features

Image features are uniquely identifiable areas in the image that are associated with some 3D object in the world. They should be recognizable even as the scene changes. A feature detector extracts interesting features in the image, such as such as corners or textured areas. Feature descriptors are vectors that are useful to differentiate between two features, that is, they can be considered as numerical fingerprints.

Together, feature detectors and descriptors help in storing the most relevant and unique information about a scene. They are also important to solve the correspondence problem, that is, identifying the same set of features between multiple views of the same scene.

STEP 2: Epipolar Geometry

This is also known as the two-view geometry which defines the relation between a 3D point and its projection onto two different image frames, as described by Figure 2.2.

The epipolar constraint can be obtained from the essential matrix E which relates the corresponding projections in two image frames. Consider a point X with respect to the center of projection C, then it’s location X0 with respect to center of projection C0 is

X = RX0 + t, (2.14)

8 2.2. SENSORS

X

m'

x x'

e e' C C'

R,t

Figure 2.2: When the projection of a 3D point X in one image is known to be x, then the projection x0 on the second image will be restricted to the epipolar line m0. The 3D point X and the centers of projection C and C0 belong to the same plane. where R and t are the rotation and translation matrices between the coordi- 0 T 1 nates of C and C . Pre-multiplying Equation (2.14) by X [t]× gives,

T 0 X [t]×RX = 0. (2.15)

So, the essential matrix is defined as E = [t]×R, which implies

XT EX0 = 0. (2.16)

This is also true for the projected image points, which gives

xT Ex0 = 0. (2.17)   0 −az ay 1 T [a]× is skew symmetric matrix. For a = (ax ay az) , [a]× =  az 0 −ax −ay ax 0

9 CHAPTER 2. BACKGROUND

The essential matrix applies only to calibrated cameras. If the cameras are not calibrated, the intrinsic parameters of the cameras are required to determine the epipolar geometry. This is given by the fundamental matrix F.

F = K0−T EK−1, (2.18) where K and K0 are the camera calibration matrices.

Once the epipolar geometry is determined, it is possible to estimate the relative pose (rotation and translation) between two camera frames when a set of feature correspondences are known.

STEP 3: Motion Reconstruction

The relative pose between two frames can be obtained from the singular value decomposition of the essential matrix associated with the two frames, as pro- posed by Hartley [9],

E = UΣV T , (2.19) where U and V are orthogonal matrices and Σ is a rectangular diagonal matrix. The rotation R and translation t are given by,

0 −1 0   T R = U 1 0 0 V , (2.20) 0 0 1  0 1 0   T [t]× = U −1 0 0 U . (2.21) 0 0 0

The camera projection matrix, which describes the mapping of a camera from 3D world points to 2D image points, can be obtained from the rotation and translation matrices. Considering the projection matrix P associated with first frame to be

P = K[I | 0], (2.22) the projection matrix P0 associated with the second frame is

P 0 = K0[R | t]. (2.23)

10 2.2. SENSORS

There are four possible values for P0 due to choice of signs for R and t. One way to determine the best solution is to triangulate the four solutions and calculate the reprojection error.

STEP 4: Triangulation and Reprojection error

Triangulation refers to calculating the location of a 3D point in space using the projections of the point onto two or more images. In an ideal case, the 3D point will lie on the intersection of the back-projected rays. However, almost always these rays do not intersect at the true 3D point due to the presence of noise and measurement errors, as shown in Figure 2.3.

X

A1 A3

x1 x3

A2

x2

C1 C3

F1 R,tF2 C2 F2 R,t F3

Figure 2.3: Best estimate of the triangulated 3D point X.

So we estimate the location of the 3D point that minimizes the reprojection error, that is, the squared error between all the estimated and measured image points. This is formulated as a minimization problem,

X 2 X = arg min ||xi − xˆi(Pi,X)|| . (2.24) X i

11 CHAPTER 2. BACKGROUND

At the end of this step, we will have a good estimate of the relative pose between two image frames. When this process is applied to two consecutive frames of a camera, we can estimate the motion of the camera.

STEP 5: Bundle Adjustment

Bundle adjustment optimizes the 3D geometry, the relative pose and also the intrinsic camera parameters at the same time. The optimization formulation is based on minimizing the weighted sum of squared reprojection errors.

Imagine there are ‘p’ 3D points captured from ‘q’ different views, and let xij be the ith projection on the jth frame. Then the minimization follows as,

p q X X 2 min bij||xi − xˆi(Pi,X)|| , (2.25) X,P i i=1 j=1 where b is a binary vector where bij is ‘1’ if the point i is visible from the frame j.

A local bundle adjustment applies the optimization to a set of co-visible frames, while the rest of the frames remain fixed. A global bundle adjust- ment optimizes all the frames and 3D points assuming only the initial frame to be fixed.

2.2.2 Pose from IMU An Inertial Measurement Unit (IMU) measures the linear acceleration, angular velocity and magnetic field acting on a body. The combination of accelerom- eter, gyroscope and magnetometer are used for this purpose.

Accelerometer

An accelerometer is a sensor that measures linear accelerations. It has 3 degrees of freedom.

The measured accelerations are influenced by noise and bias. Suppose that at is the true acceleration void of any external disturbance and am is the measured acceleration from the sensor, then at and am are related by,

−1 am = Cq (at − g) + ba − na, (2.26)

12 2.2. SENSORS

where ba is the acceleration bias, na is the measurement noise modelled as Guassian white noise, g is the vector corresponding to the acceleration due to gravity and Cq is the rotational matrix that records how the body is oriented with respect to a stationary coordinate system.

The acceleration bias is not constant but varies with time. It is modelled as a random walk with bias noise nba modelled as Guassian white noise with zero mean,

˙ ba = nba. (2.27)

Gyroscope

A gyroscope is a sensor that measures angular velocities. It has 3 degrees of freedom.

Similar to the accelerometer, a gyroscope measurement is also influenced by noise and bias. Suppose that ωt is the true acceleration void of any external disturbance and ωm is the measured acceleration from the sensor, then ωt and ωm are related by,

ωm = ωt + bω − nω, (2.28) where bω is the angular velocity bias, nω is the measurement noise modelled as a white Gaussian with zero mean.

The angular velocity bias is not constant but grows with time. It is modelled as a random walk with bias noise nbω modelled as Guassian white noise,

˙ bω = nbω. (2.29)

Coordinate Frames

There are two different coordinate frames that need to be introduced when working with IMU’s.

• Body frame (B) or IMU frame (I): The body coordinate frame is rigidly attached to the IMU. When an IMU is attached to a vehicle, this coor- dinate frame moves according to the motion of the vehicle.

13 CHAPTER 2. BACKGROUND

• Navigation frame (N) or World frame (W): The navigation coordinate frame is firmly attached to the earth’s surface. Usually, this is the loca- tion where the vehicle begins from. The motion of the vehicle is regis- tered with respect to this frame.

Navigation Equations

The motion of IMU is modelled with a set of navigation equations. Con- i sider the position of IMU in World frame pw, the velocity of IMU in World i frame vw, the attitude quaternion describing the rotation from World frame i to IMU frame qw, and the accelerometer and gyro biases ba and bω. The IMU kinematics and bias model is as follows,

i i p˙w = vw, (2.30) i i v˙w = Cqw (am − ba + na) − g, (2.31) 1 q˙i = Ω(ω − b + n )qi , (2.32) w 2 m ω ω w ˙ ba = nba , (2.33) ˙ bω = nbω . (2.34)

i where Cqw represents the rotation matrix that transforms a vector from Iner- tial frame to World frame, Ω(ω) is the quaternion product matrix as defined in Section 2.1 and g is the vector corresponding to the acceleration due to i i gravity. The pose of the vehicle is represented by [pw, qw]. The integration of these differential equations help us estimate the pose from IMU measurements.

2.3 Fusion of IMU and Vision

The goal of sensor fusion is to estimate the state x of a system given observa- tions z from different sensors and possibly a control sequence u acting on the system.

Generally, several sensors are used to gather information of interest regarding a system. These sensor observations are influenced by noise and it is not straight-forward to determine a model that accurately maps the observations to the system state. Moreover, the noise acting on the system and uncertainty of the model are of stochastic nature.

14 2.3. FUSION OF IMU AND VISION

This situation is dealt by formulating an approximate system model. The parameters of this model are estimated using observed data such that the noise and uncertainty in the system are minimized.

Consider a general discrete-time state space system,

xk = f(x1:k−1, u1:k−1, w1:k−1), (2.35)

zk = h(x1:k, v1:k). (2.36) Here, x is the state vector describing the system, u is the control input, w is the noise related to the uncertainty in the system, z is the measurement vector, v is the measurement noise and k is the time step.

2.3.1 Bayesian Inference Bayesian inference is an important tool for statistical estimation and integra- tion of disparate information in an optimal way. This process uses the prior belief about the system state, combines it with the current observations and determines an optimal belief regarding the current system state.

Mathematically, this can be determined with the conditional probability den- sity p(xk|z1:k) where xk is the state at time k and z1:k are all the observations made until time k. The Markov property will be assumed here.

The probability density of the current state given the previous observations depends on the probability associated with the transition from state xk−1 to state xk and the probability density of the previous state. This is the prior belief, Z p(xk|z1:k−1) = p(xk|xk−1)p(xk−1|z1:k−1)dxk−1. (2.37)

The probability density of the current state after the current observation has been made depends on the measurement likelihood and the prior belief,

p(zk|xk)p(xk|z1:k−1) p(xk|z1:k) = . (2.38) p(zk|z1:k−1)

The term p(zk|zk−1) is determined by, Z p(zk|z1:k−1) = p(zk|xk)p(xk|z1:k−1)dxk (2.39)

The Equations (2.37) to (2.39) give us a recursive Bayesian estimation to determine the conditional probability density p(xk|z1:k).

15 CHAPTER 2. BACKGROUND

2.3.2 Kalman Filter The Kalman filter is a special case of a recursive Bayesian filter for multivariate normal distributions where the system is discrete-time linear and the noise parameters are assumed to be additive white Gaussian. The filter is named after Rudolf E. Kálmán who contributed towards its development.

The filter consists of two parts, prediction and correction. The prediction step uses the system equations to determine a prior estimate of the state at the current time step. The correction step updates the prior estimate based on observations made. The Kalman filter follows Bayesian inference and estimates the states based on the uncertainties in the system and measurement models.

Consider the system described by Equations (2.35) and (2.36). The linear discrete-time system can be described as,

xk = Ak−1xk−1 + Bk−1uk−1 + wk−1, (2.40)

zk = Hkxk + vk, (2.41) where Ak is the state transition model, Bk is the control input model and Hk is the sensor observation model.

The system and measurement noises wk and vk are assumed to be additive white Gaussian. The uncertainty introduced by these noises are captured by covariance matrices Qk and Rk respectively where

T Qk = E[wkwk ], (2.42) T Rk = E[vkvk ], (2.43) where E[X] indicates the expected value of X.

The error covariance matrix Pk captures the uncertainty in the estimated state xˆk, which is defined as

T Pk = E[(xk − xˆk)(xk − xˆk) ]. (2.44)

The recursive algorithm is as follows,

16 2.3. FUSION OF IMU AND VISION

• Prior estimation of states: An estimate of states at time step k is determined using the system model based on the states and control input at time k − 1,

xˆk|k−1 = Ak−1xˆk−1|k−1 + Bk−1uk−1. (2.45) • Prior estimation of error covariance: The uncertainty of states varies as the system evolves due to model un- certainty and external noise. The error covariance distribution is scaled and transformed according to the state transition matrix and system noise covariance matrix, T Pk|k−1 = Ak−1Pk−1|k−1Ak−1 + Qk−1. (2.46) • Predicted measurement The observed quantity can be estimated from the current system state. This is given by, zˆk = Hkxˆk|k−1. (2.47) T The uncertainty related to this estimate is HkPkHk . • Refining the estimate with measurements: There are two distributions, the predicted measurement with mean and T variance (Hkxˆk,HkPkHk ) and the observed measurement with mean and variance (zk,Rk). The best state estimate will include these two measurements. The combination of these gives us the distribution, 0 Mean Hkxˆk|k = Hkxˆk|k−1 + Kk(zk − Hkxˆk|k−1) (2.48) T T 0 T Covariance HkPk|kHk = HkPk|k−1Hk − KkHkPk|k−1Hk , (2.49)

0 where Kk depends on the predicted measurement covariance and the T T observed measurement covariance given by HkPk|k−1Hk (HkPk|k−1Hk + −1 Rk) . • Update the state estimate: From the Equations (2.48) and (2.49), we can determine the new state estimate and the associated error covariance as

xˆk|k =x ˆk|k−1 + Kk(zk − Hkxˆk|k−1) (2.50)

Pk|k = Pk|k−1 − KkHkPk|k−1 (2.51) T T −1 K = Pk|k−1Hk (HkPk|k−1Hk + Rk) , (2.52)

where Kk is the Kalman gain. The algorithm is then repeated with the updated values.

17 CHAPTER 2. BACKGROUND

Summary

The Kalman filter algorithm can be summarized as,

Prediction Step

xˆk|k−1 = Ak−1xˆk−1|k−1 + Bk−1uk−1 (2.53) T Pk|k−1 = Ak−1Pk−1|k−1Ak−1 + Qk−1. (2.54) Update Step

z˜k = zk − Hkxˆk|k−1 (2.55) T T −1 Kk = Pk|k−1Hk (HkPk|k−1Hk + Rk) (2.56)

xˆk|k =x ˆk|k−1 + Kkz˜k (2.57) T T Pk|k = (I − KkHk)Pk|k−1(I − KkHk) + KkRkKk . (2.58)

2.3.3 Extended Kalman Filter An optimal solution is reached in case that all noise are additive white Gaus- sian with zero mean and the system is perfectly linear. Almost always, the systems in the real world are non-linear, better represented in the form

xk = f(xk−1, uk−1) + wk−1, (2.59)

zk = h(xk) + vk. (2.60)

The general Kalman filter can be applied in this case if the system can be approximately linearized at every time step. The extended Kalman filter linearizes the system about an estimate of the current mean. Using the first order Taylor expansion for f(.) and h(.) about the mean gives,

∂f(x , u ) k−1 k−1 f(xk−1, uk−1) ≈ f(ˆxk−1, uk−1) + (xk−1 − xˆk−1) ∂xk−1 xk−1=ˆxk−1 | {z } Fk−1 (2.61) ∂h(x ) k h(xk) ≈ h(ˆxk) + (xk − xˆk). (2.62) ∂xk xk=ˆxk | {z } Hk

18 2.3. FUSION OF IMU AND VISION

The prior estimate of the state is determined by the expected value of f(xk−1, uk−1) conditioned by zk−1,

E[f(xk−1, uk−1)|zk−1] ≈ f(ˆxk−1, uk−1) + Fk−1E[ek−1|zk−1], (2.63) where ek−1 = xk−1 − xˆk−1. Since E[ek−1|zk−1] = 0, the prior state estimate is

xˆk|k−1 = f(ˆxk−1|k−1, uk−1). (2.64)

The state error gives us

ek = xk − xˆk

= f(xk−1, uk−1) + wk−1 − f(ˆxk−1, uk−1) (2.65)

≈ Fk−1ek−1 + wk−1.

Using Equation (2.65), we can determine the prior state covariance as,

T Pk|k−1 = E[ekek ] T T T = Fk−1E[ek−1ek−1]Fk−1 + E[wk−1wk−1] (2.66) T = Fk−1Pk−1|k−1Fk−1 + Qk−1.

Summary

The extended Kalman filter algorithm can be summarized as,

Prediction Step

xˆk|k−1 = f(ˆxk−1|k−1, uk−1) (2.67) T Pk|k−1 = Fk−1Pk−1|k−1Fk−1 + Qk−1. (2.68)

Update Step

z˜k = zk − h(ˆxk−1|k−1) (2.69) T T −1 Kk = Pk|k−1Hk (HkPk|k−1Hk + Rk) (2.70)

xˆk|k =x ˆk|k−1 + Kkz˜k (2.71) T T Pk|k = (I − KkHk)Pk|k−1(I − KkHk) + KkRkKk . (2.72)

19 CHAPTER 2. BACKGROUND

2.3.4 Error-State Kalman Filter In most Kalman filter implementations involving pose estimation, it is common to use the error-state Kalman filter formulation. This formulation is used to overcome the drawbacks in a full-state Kalman filter. Roumeliotis et al. [23] introduce the error-state Kalman filterdiscuss. Most systems consist of a high frequency non-linear dynamics which is best described only by a non-linear model. The accurate linerization of such a system is not possible. However, the error-state Kalman filter considers only the perturbations in state which can be linearly integrated close to zero mean and are suitable for linear Guassian filtering. This leads to better accuracy in error-state Kalman filters. Another motivation for choosing the Error-State filter is due to additional constraints on quaternion used to represent the orientation. The quaternion is a four-element representation whereas the orientation in space has only three degrees of freedom. The fourth element is dependent on the other three elements by a constraint. This makes the corresponding covariance matrix singular and causes numerical instability [26]. The error-state is simply the arithmetic difference of the estimated state xˆ to the true state x, which is, x˜ = x − xˆ. This rule does not apply to quater- nion, which uses a quaternion difference to define the error-state instead of arithmetic error. Hence, an error quaternion δq is defined as, "δq # δq = q ⊗ qˆ−1 = w (2.73) δqv " cos(δθ/2) # = (2.74) ksin(δθ/2) " 1 # ≈ 1 , (2.75) 2 kδθ where qv is the imaginary part and qw is the real part of quaternion q. The error quaternion is approximated by kδθ, which is a three dimensional vector that represents the error angle. The nominal and error states are predicted in parallel, as shown by Equa- tion (2.76) and Equation (2.77). Nominal state ignores all noises in the system. Error-state tracks perturbations and noises. Note that the mean of error-state is always reset to zero.

n n n xˆk+1 = Fk (x, u)x ˆk (2.76) x δxˆk+1 = Fk (x, u, w) δxˆk. (2.77)

20 2.3. FUSION OF IMU AND VISION

Corrections are made to error-state using measurement data. The error-state mean is then added to nominal state to determine the corrected state esti- mates, shown by n xˆk =x ˆk + δxˆk, (2.78)

n where xˆk is the state estimate, xk is the nominal state and δxk is the error- n x state. Fk and Fk are jacobians of nominal and error states respectively.

Based on the discussion above, the error-state Kalman filter is chosen in this thesis for implementing the fusion of visual and inertial data.

21

Chapter 3

Related Work

This chapter introduces several visual odometry and visual-inertial odometry techniques. The need for visual-inertial fusion is discussed and the strengths of these techniques are analyzed, which provides motivation for the implemented method.

3.1 Visual Odometry

Estimating the motion of a vehicle using visual cues from cameras attached to it is Visual Odometry (VO). This has applications in the field of and recently it’s being used in Autonomous vehicles for localization. We analyze three approaches of Visual odometry.

3.1.1 Feature-based Approach The feature-based method is a classical approach to visual odometry problem. This method relies on unique interests points in the image, as illustrated by Figure 3.1. These points are tracked and matched over consecutive images. The general pipeline for this was introduced in Section 2.2.1.

ORB-SLAM is a visual odometry and mapping algorithm. Mur-Artal et al. [18] [20] use Oriented FAST and Rotated BRIEF (ORB) features proposed by Rublee et al. [24]. ORB features are rotation invariant but not scale invariant. ORB SLAM runs tracking, local mapping and loop closing in separate threads. The tracking thread is responsible for feature matching between consecutive frames and estimating pose by minimizing reprojection error for matched fea- tures. The mapping thread performs a local optimization of camera poses to achieve optimal reconstruction of the surroundings. The loop closure thread is constantly searching for loops, fuses duplicate data and performs a pose

23 CHAPTER 3. RELATED WORK

Figure 3.1: Feature extraction and tracking from an image frame. (Reprinted from ORB- SLAM [20]). graph optimization to achieve global consistency. The ORB SLAM method is quite robust but loses tracking during pure rotations for monocular case.

Another feature-based method is LibViso, proposed by Geiger et al. [7]. This method focuses on 3D reconstruction of the environment using Stereo cameras. Features are extracted by Blob and Corner masks and are matched using a circular matching approach between consecutive stereo images. The camera motion is estimated by minimizing the reprojection error using Gauss-Newton method.

3.1.2 Direct Approach Feature based methods involve a process of feature extraction and matching. The direct method suggests an approach that compares pixels directly, hence skipping the feature extraction step. Direct methods use the pixel intensity information for estimating motion between two frames, achieved using pho- tometric error minimization such as the one used by Engel et al. [5]. This means that every pixel introduces a constraint in the optimization problem leading to an accurate pose and also a dense map of the environment.

These methods work on the assumption that the projection of a point in both frames has the same intensity. This assumption often fails due to lighting changes, sensor noise, pose errors and dynamic objects. Hence, direct meth- ods require a high frame rate which minimizes the intensity changes. The light intensity can also be handled by normalizing the global intensity, as done by Gonçalves and Comport [8]. Another issue is high computation due to the use of all pixels over all frames. This is addressed by only using pixels with sufficient information, that is, with high intensity gradient as proposed by Engel et al. [5]. This produces a semi-dense map that mostly contains edges

24 3.1. VISUAL ODOMETRY

Figure 3.2: Extraction of pixel intensity data at corners and edges. (Reprinted from LSD- SLAM [5]). and corners, as illustrated by Figure 3.2. Concha et al. [2] proposed regu- larizing the image intensity error in addition to minimizing the photometric error. This enforces a smooth solution and aids in reconstructing low-gradient pixels. However, this method is based on the assumption that homogeneous color regions from a super-pixel are planar, which works decently for indoor scenarios and handheld motions only. Also, direct methods are more sensitive to camera calibration as compared to feature-based methods [4].

3.1.3 Hybrid Approach A hybrid approach can be used to combine the benefits of feature-based and direct methods. Hybrids methods will require moderate computation, can handle large inter-frame motion and produce a semi-dense map. Forster et al. [6] use a 4x4 patch around features and estimate the camera pose by minimizing the photometric error of these feature patches. In order to refine the result, the reprojection error of each feature is determined with respect to the keyframe that has observed the feature at roughly the same angle. This produces a reprojection residual which is minimized in order to estimate the camera pose.

A combined feature-based and direct method for Stereo visual odometry was proposed by Krombach et al. [13]. This method initializes a keyframe using stereo correspondence. The small motions are then estimated using feature- based method. When there is a fast or large motion, the direct method is used to assign a keyframe and this is optimized with stereo measurements and also the propagated depth from previous keyframe. The method focuses on combining the speed of feature-based methods and dense map from direct methods to fill the voids caused due to large motions.

25 CHAPTER 3. RELATED WORK

3.2 Visual-Inertial Odometry

Visual odometry methods are well defined, however they suffer from some drawbacks.

• Loss of Tracking Visual odometry performs best when the frame rate of the camera is high, which means there is sufficient overlap of features between two images. This is not practical since it requires high computation speed. So when there is a sudden movement of the camera, there isn’t enough overlap of features due to low frame rate. This might cause loss of tracking.

• Scale Ambiguity When using a single camera for visual odometry, scale of the scene is unknown to the algorithm. Hence, it generates a result that is accurate but not scaled correctly, as shown by Figure 3.3. Scale here refers to the ratio between the size of the vehicle trajectory estimated from visual odometry and the size of true trajectory.

An IMU provides good odometry information for large sudden movements across a small time interval. The information from IMU is also scaled correctly. These factors motivate us to use an Inertial Measurement Unit (IMU) along with the visual sensor. On the downside, the IMU data is prone to noise and drift. So, the fusion of Visual and Inertial sensor is a suitable choice.

3.2.1 Loosely-coupled A loosely-coupled approach for visual-inertial systems keeps the visual and inertial framework as independent entities. Weiss and Siegwart [31] proposed such a loosely-coupled implementation where the metric scale estimation re- mains independent of the visual algorithm. Kelly and Sukhatme [10] imple- mented a loosely-coupled visual-inertial approach using an Unscented Kalman Filter (UKF) which is better designed to handle non-linearity in the system compared to Extended Kalman Filter (EKF). Their solution self-calibrates the transformation between a camera and IMU, while simultaneously local- izing the body and mapping the environment. Lynen et al. [17] designed a multi-sensor fusion in EKF framework with Micro Aerial Vehicle (MAV) data which is robust to long term missions. All these filter-based implementations consider the visual odometry algorithm as a black-box and only use the pose

26 3.2. VISUAL-INERTIAL ODOMETRY

Figure 3.3: The trajectory estimated by Monocular Visual odometry is accurate up to a scale. Here, the true scale 0.5 indicates that the Visual odometry data is in the ratio 1:2 with the Ground truth.

Figure 3.4: General framework for Loosely-coupled Visual Inertial odometry.

27 CHAPTER 3. RELATED WORK and covariance result provided by it. The general framework is shown by Figure 3.4. The loosely-coupled framework is efficient as it requires less computation, which is also beneficial for real-time operation. It is also easier to expand the system to include other sensors. However, it loses in terms of accuracy due to negligence of cross-coupling between visual and inertial parameters.

3.2.2 Tightly-coupled The tightly-coupled approach for visual-inertial systems combines the visual and inertial parameters under a single optimization problem and their states are jointly estimated. Leutenegger et al. [16] proposed one such formulation,

I K K−1 X X X i,j,k T i,j,k i,j,k X k T k k J(x) = er Wr er + es Ws es , (3.1) i=1 k=1 j∈J (i,k) k=1 | {z } | {z } visual inertial where J(x) is the cost function that contains the weighted reprojection errors er and weighted temporal errors from IMU es. i is the camera index, k is the frame index and j is the image feature index.

Figure 3.5: General framework for Tightly-coupled Visual Inertial odometry.

Such an approach can be formulated for direct visual odometry methods too, as proposed by Concha et al. [3]. The optimization function minimizes the IMU residual error, IMU pose error and photometric error. The optimization between two frames assumes the previous frame to be constant and adjusts the current frame. A similar energy function that considers IMU and Image errors is also proposed by Usenko et al. [29], which is used to estimate the camera pose and translational velocity along with IMU biases.

28 3.2. VISUAL-INERTIAL ODOMETRY

A tightly-coupled approach considers the correlation between both the sensors, leading to better accuracy. However, it is hard to debug and computationally costlier than the loosely-coupled method.

Choice of method

A loosely-coupled visual-inertial odometry framework is chosen for implemen- tation since it is modular, flexible and computationally less demanding. This framework requires a visual odometry algorithm and a sensor fusion algorithm. For the visual odometry algorithm, the feature-based method is preferred over direct method since the direct method is computationally expensive and un- able to handle large inter-frame motions. Ground vehicle applications involve high speeds and it’s crucial for the visual odometry algorithm to handle fast movements.

29

Chapter 4

Implementation

In this chapter, the implementation of visual-inertial odometry using an error- state Kalman filter is discussed. The chapter begins with a brief description of coordinate frames involved and goes on to detail the models for inertial and visual sensor data.

4.1 Coordinate Frames

Camera Frame

IMU Frame

World Frame

Vision Frame

Figure 4.1: Different coordinate frames in the system and the transformations that relates them.

31 CHAPTER 4. IMPLEMENTATION

The Visual-Inertial system consists of four major coordinate frames. Under- standing these frames and the transformations between them is crucial for successful filter design. • IMU Frame The IMU frame is rigidly attached to the IMU body and moves along with it. The measurements obtained from the IMU are according to this coordinate frame. For the system used, the IMU coordinate frame is defined as +X forward, +Y right, +Z up. • World Frame The World frame is a static reference frame with respect to which the IMU motion is estimated. The World frame is defined as +X forward, +Y right, +Z up. • Camera Frame: This describes the coordinate system of the camera which moves along with the vehicle. It is essential for performing computer vision opera- tions. For the system used, the Camera coordinate frame is defined as +X right, +Y down, +Z forward. • Vision Frame This is the static reference frame with respect to which the Camera motion is observed. For the system used, the Visual frame is defined as +X right, +Y down, +Z forward.

4.2 Vector Notation

A vector related to a sensor S and represented in the coordinate frame F is S given by vF .

4.3 Visual-Inertial System

The Visual system consists of a monocular camera, since they are cheaper and easier to calibrate compared to stereo cameras. The visual odometry algorithm used in this thesis is ORB-SLAM, which works with Oriented FAST and Rotated BRIEF (ORB) features. ORB-SLAM can handle high speeds and large inter-frame motions. Also, it has an open-source implementation available [19] and showed robust tracking during the preliminary tests. Pose information of the camera is obtained as the output.

32 4.4. EXTENDED KALMAN FILTER DESIGN

The inertial system consists of data from Inertial Measurement Unit, that is, linear accelerations am and angular velocities ωm acting on the vehicle body. The pose of the vehicle is estimated from this data.

The two pose estimates, from camera and IMU, are transformed to a common navigation reference frame (World frame) and fused using extended Kalman filter framework.

4.4 Extended Kalman Filter design

4.4.1 Continuous-time Nonlinear State Model The motion of the IMU is modelled with a set of navigation equations as described in Section 2.2.2. The states of the filter consist of the position of i i the IMU in the World frame pw, the velocity of IMU in the World frame vw, the attitude quaternion describing the rotation from the World frame to the i IMU frame qw, the accelerometer and gyro biases ba and bw, and a scaling factor between the Vision and the World coordinates λ. λ will help us estimate the scale of visual odometry output.

T h i i i i xk = pw vw qw ba bw λ (4.1)

The IMU data are inputs to the system, that is, linear acceleration am and angular velocity ωm. The different noises acting on the system are acceleration noise na, gyro noise nω, acceleration bias noise bna and gyro bias noise bnω . Hence the input uk and system noise wk are defined as,

h iT h iT uk = am ωm and wk = na nω nba nbω . (4.2)

The IMU kinematics and bias model is as follows,

i i p˙w = vw (4.3) i i v˙w = Cqw (am − ba + na) − g (4.4) 1 q˙i = Ω(ω − b + n )qi , (4.5) w 2 m ω ω w ˙ ba = nba (4.6) ˙ bω = nbω (4.7) λ˙ = 0, (4.8)

33 CHAPTER 4. IMPLEMENTATION

i where Cqw represents the rotation matrix that transforms a vector from Iner- tial frame to World frame, Ω(ω) is the quaternion product matrix as defined in Section 2.1 and g is the vector corresponding to the acceleration due to gravity. The symbol ⊗ represents quaternion multiplication as discussed in Section 2.1.

4.4.2 Discrete-time Nonlinear State Model Using implicit Euler method, we obtain the discrete-time (nominal) state space equations,

i i i pw(k) = pw(k − 1) + vw(k − 1)∆t (4.9) i i i vw(k) = vw(k − 1) + [Cqw(k−1)(am − ba + na) − g]∆t (4.10) i i qw(k) = qw(k − 1) ⊗ q{(ωm − bω + nω)∆t} (4.11)

ba(k) = ba(k − 1) + nba ∆t (4.12)

bω(k) = bω(k − 1) + nbω ∆t (4.13) λ(k) = λ(k − 1), (4.14)

As discussed in Section 2.3.4, error-state Kalman filter is used to handle the quaternion in it’s minimal representation for numerical stability. The error- states are,

T h i i i i δx = δpw δvw δθw δba δbw δλ . (4.15) The dynamics of the error-state is given by,

i i i δpw = δpw + δvw∆t (4.16) i i i i i i δvw = δvw − ([Cqw (am − ba + na)]×δθw + Cqw δba − Cqw δna − g)∆t (4.17) i i i i δθw = δθw − (Cqw δbω − Cqw δnω)∆t (4.18)

δba = δba + nba ∆t (4.19)

δbω = δbω + nbω ∆t (4.20) δλ = δλ. (4.21)

4.4.3 Prediction Step As mentioned in Section 2.3.3, the system must be linearized in order to use the Kalman filter framework. Jacobian matrices are computed at each time step to obtain the local linearized model around the current state estimate.

34 4.4. EXTENDED KALMAN FILTER DESIGN

The discrete-time nonlinear error state equations can be represented as,

δxk = f(δxk−1, uk−1, wk−1), (4.22) where f is the state prediction function, x is the state vector, δx is the error- state vector, u is the input vector and w is the system noise vector. The Jacobians are defined as,

∂f ∂f

Fk−1 = and Gk−1 = . (4.23) ∂δxk−1 xˆk−1 ∂wk−1 wk−1

The Jacobian with respect to the state estimates is computed to be,

  I3 ∆tI3 03 03 03 03×1  0 I −∆t[C i a ] −∆tC i 0 0   3 3 qw i × qw 3 3×1    03 03 I3 04×3 −∆tCqi 03×1 F =  w  , (4.24) k−1  0 0 0 I 0 0   3 3 3 3 3 3×1    03 03 03 03 I3 03×1 01×3 01×3 01×3 01×3 01×3 1

where the true acceleration ai = am −ba is the measured acceleration corrected for bias. The Jacobian with respect to the system noise is computed to be,

  03 03 03 03 ∆tC i 0 0 0   qw 3 3 3     03 ∆tCqi 03 03  G =  w  . (4.25) k−1  0 0 ∆tI 0   3 3 3 3     03 03 03 ∆tI3 03 03 03 03

The matrix Qk−1 is the system noise covariance matrix and is defined as shown in Equation (4.26). The vectors σ , σ , σ , σ represent the noise na_c nω_c nba _c nbω _c strengths of na, nω, nba , nbω respectively for the continuous-time model.

 2  σna_c/∆t 03 03 03  2   03 σnω_c/∆t 03 03  Qk−1 =  2  . (4.26)  03 03 σ /∆t 03  nba _c  2  03 03 03 σ /∆t nbω _c

35 CHAPTER 4. IMPLEMENTATION

Once the Fk−1 and Gk−1 matrices are determined, the states and correspond- ing covariances are predicted using the equations described in Section 2.3.3.

Nominal State xˆk|k−1 = f(ˆxk−1|k−1, uk−1) (4.27) T T Covariance Pk|k−1 = Fk−1Pk−1|k−1Fk−1 + Gk−1Qk−1Gk−1. (4.28)

4.4.4 Measurement Model The resultant pose from the visual odometry algorithm is used as measurement data for the loosely-coupled EKF.

The camera position zp is defined as the position of camera in Visual frame c pv, given by

c i c v w i zp = pv = Cqv (pwλ + Cqw pi − pw) + nzp , (4.29)

where nzp is the position measurement noise.

Similarly, the camera orientation zq is defined as the rotation from the Visual c frame to the Camera frame qv, given by

c w i c zq = qv = qv ⊗ qw ⊗ qi + nzq , (4.30)

w where qv is the orientation of World frame with respect to the Vision frame, c qi is the orientation of Camera frame with respect to IMU frame, and nzq is the orientation measurement noise. It is important to note that Equa- tion (4.30) depends on the Quaternion convention used. Equation (4.30) fol- lows the Hamilton convention for the order of multiplication.

Equations (4.29) and (4.30) relate the measurements to the system states. The innovation or residual is computed as,

z˜p = zp − zˆp −1 (4.31) z˜q = zq ⊗ zˆq , where zp and zq are the actual measured observations while ˆzp and ˆzq are estimated observations.

36 4.4. EXTENDED KALMAN FILTER DESIGN

4.4.5 Update Step Jacobian matrices are computed to obtain the local linearized measurement model. This is used for the filter correction. The observation model can be represented as,

zk = h(δxk, vk), (4.32) where h is the measured observation function, δx is the error-state vector and v is the measurement noise vector.

T T h i i i i h i xk = pw vw qw ba bw λ and vk = nzp nzq . (4.33)

The Jacobians are defined as,

∂h ∂h ∂xk Hk = |xˆk|k−1 = |xˆk|k−1 |xˆk|k−1 (4.34) ∂δxk ∂xk ∂δxk ∂h Mk = |vk|k−1 . (4.35) ∂vk The Jacobian of measured observation with respect to the state estimates is as follows,

" c i # w w i w Cqv λ 03 −Cqv [Cqw pi ]× 03 03 Cqv pw Hk = ¯ , (4.36) 03 03 Hq 03 03 03×1 where H¯ q is obtained by approximation of Hq,

∂(qw ⊗ qi ⊗ qc) " 1 0 # H = v w i ≈ 1×3 . (4.37) q i ¯ ∂qw 03×1 Hq The Jacobian with respect to the measurement noise is as follows, " # I3 03 Mk = (4.38) 03 I3

The matrix Rk is the measurement noise covariance matrix and is defined as shown in Equation (4.39). The vectors σnzp , σnzq represent the standard deviation of noises nzp , nzq respectively.

" 2 # σ 03×4 R = nzp . (4.39) k 0 σ2 4×3 nzq

37 CHAPTER 4. IMPLEMENTATION

Once the Hk and Mk matrices are determined, the state estimates are updated using the equations described in Section 2.3.3.

Innovation z˜k = zk − h(ˆxk−1|k−1) (4.40) T T Innovation Covariance Sk = HkPk|k−1Hk + MkRkMk (4.41) T −1 Kalman Gain Kk = Pk|k−1Hk Sk (4.42)

Correction x˜k = Kkz˜ (4.43)

The state covariance matrix is updated using,

T T T Pk|k = (I − KkHk)Pk|k−1(I − KkHk) + KkMkRkMk Kk (4.44)

38 Chapter 5

Results

The implemented filter is tested with an online dataset and the results are presented in this chapter. The experimental setup is described first followed by results obtained using two different vehicle trajectories. The ability to recover scale is then discussed.

5.1 Malaga Dataset

The Malaga dataset has been used for evaluating the visual-inertial filter on data from real sensors. This dataset has semi-urban and highway driving conditions which is most likely to be faced by the autonomous trucks during transportation between the mining site and processing plant. This dataset was collected by Blanco-Claraco et al. [1] in several urban scenarios and provides GPS, stereo camera, and IMU sensor data. Tracks 7 and 8 of this dataset have been used for conducting tests.

The Inertial unit used is xSens-MTi-28A53G35 running at a rate of 100Hz. Point Grey Research’s Bumblebee 2 stereo camera is used for capturing images at 20fps. The synchronized and rectified images of 800×600 resolution is used. ORB-SLAM framework [18] [20] is used as the visual odometry black- box. The ground truth is provided by GPS sensor mmGPS from Topcon, providing positional data at 1Hz.

5.2 Experimental Setup

The purpose of this experiment is to evaluate the ability of the Visual-Inertial EKF framework to estimate the scale. Scale here refers to the ratio between

39 CHAPTER 5. RESULTS the size of vehicle trajectory estimated from visual odometry and the size of true trajectory.

The idea is to test the filter for scale estimation under conditions that are most likely and reasonable. The first step is to choose a range of true scale values. This differs for different applications and sensor setups. The visual odometry can be studied to check the most common scale range. A range of 0.5 to 5.0 is chosen, that is, the visual odometry data is half the scale of ground truth to five times the scale, shown by Equation (5.1). The choice of this range enables the comparison of results with the work of Weiss et al. [31].

The next step is to choose a range of initialization errors in scale state. The work of Weiss [30] is referred, who studied the ability to estimate the scale state under different initialization errors and system input excitations (linear acceleration and angular velocity). This is depicted by Figures 5.1 and 5.2, which shows that it is important to have a good initialization for scale state. A deviation in position, velocity and acceleration bias estimates can be observed for scale RMSE over 0.2. So, an initial scale offset 1 of 0.7 to 1.3 is chosen, where other system states are less influenced by initialization error in scale, shown by Equation (5.2). Note that experiments in [31] use an initial scale offset of 1.0 to 2.0, which only corresponds to cases where visual scale is greater than the true scale. The effect of scaling down is also important to test as it is equally likely that the visual scale is lower than the true state. Hence, the error offsets from 0.7 to 1.0 were chosen.

For each value of true scale, all values of initial scale offset are tested. This gives a total of 42 test cases.

h i True Scale = 0.5 1.0 2.0 3.0 4.0 5.0 (5.1) h i Initial Scale offset = 0.7 0.8 0.9 1.0 1.1 1.2 1.3 . (5.2)

Various scale settings were applied to Malaga Track 7 and Malaga Track 8. These two tracks have different linear acceleration and angular velocity ex- citations which provide variation in test conditions. For the sake of brevity, only one case out of 42 is shown for each track, where the true scale is 3.0 with an initial scale offset of 0.8. Each of them are analyzed for filter performance with respect to all states.

1An offset is usually an additive error. However, the scale offset here is multiplied with the true scale to attain the required deviation. An initial scale offset of 0.8 implies that the scale state is initialized using the value 0.8 times the true scale.

40 5.2. EXPERIMENTAL SETUP

Figure 5.1: Different excitation levels on the system input (linear accelerations and an- gular velocities) affects the convergence of scale state. High excitation on linear accelera- tions is favourable for faster convergence. (Reprinted from the work of Weiss et al. [30]).

Figure 5.2: The evolution of state RMSE is shown as initialization error of scale increases. Higher RMSE indicates longer convergence time. It is difficult to converge when the initial scale state error is beyond 4.0 times the true value. (Reprinted from the work of Weiss et al. [30]).

41 CHAPTER 5. RESULTS

5.2.1 Malaga Track 7 Results

(a)

(b) Figure 5.3: Comparison of vehicle position estimate in the navigation frame with GPS ground truth. The ability of the filter to adjust the scale and follow the Ground truth can be observed. (a) 2D trajectory in X-Y plane. (b) Position in 3-axes plotted against time.

The position estimate of the filter is shown in Figure 5.3. We notice deviations during rotations but this reduces soon after, which could imply that the scale state is adjusting to the effect of gyroscope excitation. Figure 5.4 exhibits the ability of filter to estimate the acceleration and gyroscope bias along with its uncertainty. The region of 3σ bounds converges to a constant value, indicating steady-state of the filter.

The velocity and yaw state estimates of the filter is shown in Figure 5.5. Ve- locity estimate has an average value of 8.0650 ms−1 over the course of the trajectory, which is a reasonable velocity for a vehicle. Yaw estimate depicts the rotation of vehicle along the z-axis. From the plot, it is easy to note that the yaw estimate agrees with the trajectory taken by the vehicle. On a practi- cal note, the estimates of orientation using gyroscope and visual odometry are accurate. This could be the reason for the confident 3σ bound in Figure 5.5.

Finally, the scale convergence plot in Figure 5.6 does not converge to the true state but is slowly approaching it. The uncertainty converges to a steady value, but the true scale 3.0 is not within the 3σ bounds indicating that the scale estimate is over-confident.

42 5.2. EXPERIMENTAL SETUP

(a) (b) Figure 5.4: Estimation of (a) Acceleration bias and (b) Gyroscope bias states of Visual- Inertial filter along the 3-axes. The 3σ bounds of uncertainty is also shown.

43 CHAPTER 5. RESULTS

Figure 5.5: Estimation of Velocity and Yaw of the vehicle. The 3σ bounds of uncertainty is also shown. The velocity estimate has a reasonable range and the yaw estimate agrees with the trajectory of the vehicle.

Figure 5.6: State estimation plot for Scale showing convergence to a value of 2.7741 while the true Scale is 3.0. The 3σ bounds of uncertainty is also shown.

44 5.2. EXPERIMENTAL SETUP

5.2.2 Malaga Track 8 Results

(a)

(b) Figure 5.7: Comparison of vehicle position estimate in the navigation frame with GPS ground truth. The ability of the filter to adjust the scale and follow the Ground truth can be observed. (a) 2D trajectory in X-Y plane. (b) Position in 3-axes plotted against time.

The position estimate of the filter is shown in Figure 5.7. The x-position deviates from the ground truth due to failure in visual odometry black-box. The failure occurs near the end of the trajectory due to insufficient image features. This shows that the visual-inertial filter acquires errors due to failure from the visual odometry black-box. Hence, it is important to have a failure detection. Figure 5.8 exhibits the ability of filter to estimate the acceleration and gyroscope bias along with its uncertainty.

The velocity and yaw state estimates of the filter is shown in Figure 5.9. Velocity estimate has an average value of 9.0751 ms−1 over the course of the trajectory, which is a reasonable velocity for a ground vehicle. Yaw estimate depicts the rotation of vehicle along the z-axis. Similar to Track 7, the yaw estimate agrees with the trajectory of the vehicle and has a confident 3σ bound.

Finally, the scale convergence plot in Figure 5.10 shows convergence to true value. The uncertainty also converges to a steady value and the true scale 3.0 is within the 3σ bounds.

45 CHAPTER 5. RESULTS

(a) (b) Figure 5.8: Estimation of (a) Acceleration bias and (b) Gyroscope bias states of Visual- Inertial filter along the 3-axes. The 3σ bounds of uncertainty is also shown.

46 5.2. EXPERIMENTAL SETUP

Figure 5.9: Estimation of Velocity and Yaw of the vehicle. The 3σ bounds of uncertainty is also shown. The velocity estimate has a reasonable range and the yaw estimate agrees with the trajectory of the vehicle.

Figure 5.10: State estimation plot for Scale showing convergence to a value of 2.9977 while the true Scale is 3.0. The 3σ bounds of uncertainty is also shown.

47 CHAPTER 5. RESULTS

5.3 Summary

The convergence behaviour of scale estimate for Malaga Track 7 and Track 8 data for all 42 test cases are summarized in Figure 5.11 and Figure 5.12 respectively. For each value of true scale, initial scale offset from 0.7 to 1.3 is applied and the convergence of scale state is noted.

Figure 5.11: Convergence behaviour plot for Malaga Track 7. The true scale varies from 0.5 to 5.0. The initial values of the scale have an initialization error offset varying from 0.7 to 1.3 of true scale value. For each of these experimental settings, the red circles de- pict the final filter values to which the scale estimates have converged.

Figure 5.12: Convergence behaviour plot for Malaga Track 8. The true scale varies from 0.5 to 5.0. The initial values of the scale have an initialization error offset varying from 0.7 to 1.3 of true scale value. For each of these experimental settings, the red circles de- pict the final filter values to which the scale estimates have converged.

The major observation here is, the convergence of scale shows a general tendency to approach the true scale for all cases. The estimated value of scale always lies between the initial value and the true state, for both cases when initialization error is greater or lower than the true scale. Despite some inaccuracies, this shows that it is possible to estimate the true scale using the visual-inertial filter.

48 5.3. SUMMARY

The scale estimates do not converge to the true value in all cases. The scale estimate for Malaga Track 7 exhibits a cumulative RMS error of 0.3638, while that of Malaga Track 8 exhibits a cumulative RMS error of 0.2572. These RMSE values do not necessarily mean an error in convergence, but rather indicates a slow rate of convergence. From the convergence plots in Figure 5.6 and Figure 5.10, it can be noted that the filter roughly estimates the scale factor within the first 30 seconds and then slowly moves towards the true state. This is true for all cases, which implies that the filter removes the effect of incorrect initialization. The rate of convergence depends on initialization error of scale state and excitation level of linear acceleration inputs, as studied by Weiss [30] and explained by Figure 5.2 and Figure 5.1.

Moreover, the convergence of scale is also influenced by other factors such as calibration of extrinsic parameters between camera and IMU, calibration of IMU noise parameters and failures in visual odometry black-box. Improving these factors could result in better and faster convergence of scale estimate.

Having considered these factors, the results are promising since the scale es- timate has the tendency to approach the true state. The filter is also able to track other states such as velocity and orientation according to the trajec- tory taken by the vehicle. On the other hand, the convergence rate is slower than desired, which could lead to a significant error in the estimation of ve- hicle pose. To avoid this, it is important to have a good initial estimate of scale state, better calibration of the sensor parameters and failure detection of visual odometry data.

49

Chapter 6

Conclusion

In this thesis, a monocular visual-inertial odometry algorithm with scale re- covery is examined. An error-state Kalman filter was implemented for fusion of monocular vision and inertial data which can estimate system states such as position, velocity, bias and scale.

This system is developed for vehicular applications, in particular, autonomous trucks for open-pit mining scenario where GNSS might not be reliable. The system is tested with Malaga data-set, a semi-urban online data-set collected on a car. This data-set was chosen since it largely coincides with the highway and semi-urban scenarios that the autonomous truck could experience.

The main contribution of this thesis is the application of monocular visual- inertial odometry with scale recovery on ground vehicles. This application provides different excitation conditions and testing on data from real sensors makes the problem challenging.

The major issue with monocular visual-inertial odometry is the obscurity of scale for visual data. Hence, the ability to accurately estimate scale was studied in this thesis. The estimation of scale was tested for Track 7 and Track 8 of Malaga dataset under different scale conditions and initialization errors. It is found that an accurate estimation of scale is possible with the designed filter as the results show a general tendency to approach the true value.

It is possible to implement this framework for localization of autonomous ground vehicles despite the challenges of high speed and obscure scale infor- mation.

51 CHAPTER 6. CONCLUSION

Future Work

Potential future work needs to focus on achieving faster convergence for the scale estimate. One way to tackle this issue is to have a good initial guess for scale. A method suggested by Kneip et al. [12] uses only three consecutive camera images and feature correspondences to achieve this. This approach is already implemented and tested by Weiss [30], showing promising results for simulated and real data in laboratory environment. Moreover, a failure detection step is also added which identifies irregular spikes in visual odometry data. Such a failure mode reduces the introduction of failure errors from the visual odometry black-box.

As part of future work, the filter could be tested for real-time implementation with actual data from Truck in an open-pit mining scenario. The system can then be tested for robustness to different environments such as bad weather and lighting conditions or lack of image features. There is a large dependence on the Visual odometry black-box for the filter to function well. This depen- dence could be reduced by considering a multi-sensor fusion with GNSS and LIDAR.

52 Bibliography

[1] Blanco-Claraco, José-Luis, Moreno-Dueñas, Francisco-Ángel, and González- Jiménez, Javier. “The Málaga urban dataset: High-rate stereo and Li- DAR in a realistic urban scenario”. In: The International Journal of Robotics Research 33.2 (2014), pp. 207–214. [2] Concha, Alejo and Civera, Javier. “DPPTAM: Dense piecewise pla- nar tracking and mapping from a monocular sequence”. In: Intelligent Robots and Systems (IROS), 2015 IEEE/RSJ International Conference on. IEEE. 2015, pp. 5686–5693. [3] Concha, Alejo et al. “Visual-inertial direct SLAM”. In: Robotics and Automation (ICRA), 2016 IEEE International Conference on. IEEE. 2016, pp. 1331–1338. [4] Engel, Jakob, Koltun, Vladlen, and Cremers, Daniel. “Direct sparse odometry”. In: IEEE Transactions on Pattern Analysis and Machine Intelligence (2017). [5] Engel, Jakob, Schöps, Thomas, and Cremers, Daniel. “LSD-SLAM: Large- scale direct monocular SLAM”. In: European Conference on Computer Vision. Springer. 2014, pp. 834–849. [6] Forster, Christian, Pizzoli, Matia, and Scaramuzza, Davide. “SVO: Fast semi-direct monocular visual odometry”. In: Robotics and Automation (ICRA), 2014 IEEE International Conference on. IEEE. 2014, pp. 15– 22. [7] Geiger, Andreas, Ziegler, Julius, and Stiller, Christoph. “Stereoscan: Dense 3d reconstruction in real-time”. In: Intelligent Vehicles Sympo- sium (IV), 2011 IEEE. Ieee. 2011, pp. 963–968. [8] Gonçalves, Tiago and Comport, Andrew I. “Real-time direct tracking of color images in the presence of illumination variation”. In: Robotics and Automation (ICRA), 2011 IEEE International Conference on. IEEE. 2011, pp. 4417–4422.

53 BIBLIOGRAPHY

[9] Hartley, Richard I. “Estimation of relative camera positions for uncali- brated cameras”. In: European conference on computer vision. Springer. 1992, pp. 579–587. [10] Kelly, Jonathan and Sukhatme, Gaurav S. “Visual-inertial sensor fusion: Localization, mapping and sensor-to-sensor self-calibration”. In: The In- ternational Journal of Robotics Research 30.1 (2011), pp. 56–79. [11] Klein, Georg and Murray, David. “Parallel tracking and mapping for small AR workspaces”. In: Mixed and , 2007. ISMAR 2007. 6th IEEE and ACM International Symposium on. IEEE. 2007, pp. 225–234. [12] Kneip, Laurent et al. “Closed-form solution for absolute scale veloc- ity determination combining inertial measurements and a single feature correspondence”. In: Robotics and Automation (ICRA), 2011 IEEE In- ternational Conference on. IEEE. 2011, pp. 4546–4553. [13] Krombach, Nicola, Droeschel, David, and Behnke, Sven. “Combining feature-based and direct methods for semi-dense real-time stereo visual odometry”. In: International Conference on Intelligent Autonomous Sys- tems. Springer. 2016, pp. 855–868. [14] Kuipers, Jack B et al. Quaternions and rotation sequences. Vol. 66. Princeton university press Princeton, 1999. [15] LaFrance, Adrienne. “Your Grandmother’s Driverless Car”. In: The At- lantic (June 2016). url: https://www.theatlantic.com/technology/ archive/2016/06/beep-beep/489029/ (visited on 09/11/2017). [16] Leutenegger, Stefan et al. “Keyframe-based visual–inertial odometry us- ing nonlinear optimization”. In: The International Journal of Robotics Research 34.3 (2015), pp. 314–334. [17] Lynen, Simon et al. “A robust and modular multi-sensor fusion approach applied to mav navigation”. In: Intelligent Robots and Systems (IROS), 2013 IEEE/RSJ International Conference on. IEEE. 2013, pp. 3923– 3929. [18] Mur-Artal, Raul, Montiel, Jose Maria Martinez, and Tardos, Juan D. “ORB-SLAM: a versatile and accurate monocular SLAM system”. In: IEEE Transactions on Robotics 31.5 (2015), pp. 1147–1163. [19] Mur-Artal, Raul, Montiel, Jose Maria Martinez, and Tardos, Juan D. ORB-SLAM Project Webpage. url: http : / / webdiis . unizar . es / ~raulmur/orbslam/ (visited on 09/11/2017).

54 BIBLIOGRAPHY

[20] Mur-Artal, Raul and Tardós, Juan D. “Orb-slam2: An open-source slam system for monocular, stereo, and rgb-d cameras”. In: IEEE Transac- tions on Robotics (2017). [21] Pomerleau, Dean. “RALPH: Rapidly adapting lateral position handler”. In: Intelligent Vehicles’ 95 Symposium., Proceedings of the. IEEE. 1995, pp. 506–511. [22] Robertson, Duncan P and Cipolla, Roberto. “Structure from motion”. In: (2008). [23] Roumeliotis, Stergios I, Sukhatme, Gaurav S, and Bekey, George A. “Circumventing dynamic modeling: Evaluation of the error-state kalman filter applied to mobile robot localization”. In: Robotics and Automa- tion, 1999. Proceedings. 1999 IEEE International Conference on. Vol. 2. IEEE. 1999, pp. 1656–1663. [24] Rublee, Ethan et al. “ORB: An efficient alternative to SIFT or SURF”. In: Computer Vision (ICCV), 2011 IEEE international conference on. IEEE. 2011, pp. 2564–2571. [25] Scaramuzza, Davide and Fraundorfer, Friedrich. “Visual odometry [tu- torial]”. In: IEEE robotics & automation magazine 18.4 (2011), pp. 80– 92. [26] Shuster, MD, Lefferts, EJ, and Markley, FL. “Kalman filtering for space- craft attitude estimation”. In: AIAA 20th Aerospace Sciences Meeting, Orlando, Florida. Vol. 232. 1982. [27] Thomson, Judith Jarvis. “The trolley problem”. In: The Yale Law Jour- nal 94.6 (1985), pp. 1395–1415. [28] Thrun, Sebastian et al. “Stanley: The robot that won the DARPA Grand Challenge”. In: Journal of field Robotics 23.9 (2006), pp. 661–692. [29] Usenko, Vladyslav et al. “Direct visual-inertial odometry with stereo cameras”. In: Robotics and Automation (ICRA), 2016 IEEE Interna- tional Conference on. IEEE. 2016, pp. 1885–1892. [30] Weiss, Stephan M. “Vision based navigation for micro helicopters”. PhD thesis. 2012. [31] Weiss, Stephan and Siegwart, Roland. “Real-time metric state estima- tion for modular vision-inertial systems”. In: Robotics and Automation (ICRA), 2011 IEEE International Conference on. IEEE. 2011, pp. 4531– 4537.

55

Appendix A

Sustainability, Ethics and Social Impact

In this thesis, visual-inertial algorithm has been developed for autonomous vehicle technology which certainly has some ethical and societal impacts.

The algorithm is used to estimate the position of vehicle. Consider a scenario where the vehicle is on a collision course and needs to take preventive measures. It relies on the position data and it’s crucial to have an accurate position estimate. A mishap due to inaccurate position estimate is inexcusable. It is crucial to have the algorithm tested for different scenarios and environments and provide a reliable position estimate. In case the algorithm is not certain about the position, it should provide a degree of uncertainty regarding the estimate so that other collision avoidance methods can be incorporated.

One of the advantages of autonomous driving is a higher level of efficiency. Urban traffic is regulated better and heavy transport trucks use optimized paths. This reduces fuel consumption or improves battery life in electric vehi- cles. Moreover, the concept of ride-sharing will reduce the overall requirement of vehicles in cities. A positive social impact of autonomous vehicles is its improved accessibility for the elderly and handicapped. However, the growing amount of city space and resources used for autonomous vehicles is a concern.

57

Appendix B

Detailed Results

B.1 Malaga Track 7

Various scale settings were applied to Malaga Track 7. The resultant trajecto- ries for different true scale values for an initial state offset of 1.1 is summarized by Figure B.1. The resultant trajectories for a true scale of 3.0 with different initial state errors is summarized by Figure B.2.

59 APPENDIX B. DETAILED RESULTS

(a) Initial Scale: 0.55 (1.1 error offset) (b) Initial Scale: 1.1 (1.1 error offset)

(c) Initial Scale: 2.2 (1.1 error offset) (d) Initial Scale: 3.3 (1.1 error offset)

(e) Initial Scale: 4.4 (1.1 error offset) (f) Initial Scale: 5.5 (1.1 error offset) Figure B.1: Variation in output trajectory for different scales with +10% error in initial estimates.

60 B.1. MALAGA TRACK 7

(a) Initial Scale: 2.1 (0.7 error offset) (b) Initial Scale: 2.4 (0.8 error offset)

(c) Initial Scale: 2.7 (0.9 error offset) (d) Initial Scale: 3.3 (1.1 error offset)

(e) Initial Scale: 3.6 (1.2 error offset) (f) Initial Scale: 3.9 (1.3 error offset) Figure B.2: Variation in output trajectory for Scale 3.0 with different initial estimates.

61 APPENDIX B. DETAILED RESULTS

B.2 Malaga Track 8

Various scale settings were applied to Malaga Track 8. The resultant trajecto- ries for different true scale values for an initial state offset of 1.1 is summarized by Figure B.3. The resultant trajectories for a true scale of 3.0 with different initial state errors is summarized by Figure B.4.

62 B.2. MALAGA TRACK 8

(a) Initial Scale: 0.55 (1.1 error offset) (b) Initial Scale: 1.1 (1.1 error offset)

(c) Initial Scale: 2.2 (1.1 error offset) (d) Initial Scale: 3.3 (1.1 error offset)

(e) Initial Scale: 4.4 (1.1 error offset) (f) Initial Scale: 5.5 (1.1 error offset) Figure B.3: Variation in output trajectory for different scales with 1.1 error offset in ini- tial estimates.

63 APPENDIX B. DETAILED RESULTS

(a) Initial Scale: 2.1 (0.7 error offset) (b) Initial Scale: 2.4 (0.8 error offset)

(c) Initial Scale: 2.7 (0.9 error offset) (d) Initial Scale: 3.3 (1.1 error offset)

(e) Initial Scale: 3.6 (1.2 error offset) (f) Initial Scale: 3.9 (1.3 error offset) Figure B.4: Variation in output trajectory for Scale 3.0 with different initial estimates.

64 www.kth.se