Tracking an RGB-D Camera on Mobile Devices Using an Improved Frame-To-Frame Pose Estimation Method
Total Page:16
File Type:pdf, Size:1020Kb
Tracking an RGB-D Camera on Mobile Devices Using an Improved Frame-to-Frame Pose Estimation Method Jaepung An∗ Jaehyun Lee† Jiman Jeong† Insung Ihm∗ ∗Department of Computer Science and Engineering †TmaxOS Sogang University, Korea Korea fajp5050,[email protected] fjaehyun lee,jiman [email protected] Abstract tion between two time frames. For effective pose estima- tion, several different forms of error models to formulate a The simple frame-to-frame tracking used for dense vi- cost function were proposed independently in 2011. New- sual odometry is computationally efficient, but regarded as combe et al. [11] used only geometric information from rather numerically unstable, easily entailing a rapid accu- input depth images to build an effective iterative closest mulation of pose estimation errors. In this paper, we show point (ICP) model, while Steinbrucker¨ et al. [14] and Au- that a cost-efficient extension of the frame-to-frame tracking dras et al. [1] minimized a cost function based on photo- can significantly improve the accuracy of estimated camera metric error. Whereas, Tykkal¨ a¨ et al. [17] used both geo- poses. In particular, we propose to use a multi-level pose metric and photometric information from the RGB-D im- error correction scheme in which the camera poses are re- age to build a bi-objective cost function. Since then, sev- estimated only when necessary against a few adaptively se- eral variants of optimization models have been developed lected reference frames. Unlike the recent successful cam- to improve the accuracy of pose estimation. Except for the era tracking methods that mostly rely on the extra comput- KinectFusion method [11], the initial direct dense methods ing time and/or memory space for performing global pose were applied to the framework of frame-to-frame tracking optimization and/or keeping accumulated models, the ex- that estimates the camera poses by repeatedly registering tended frame-to-frame tracking requires to keep only a few the current frame against the last frame. While efficient recent frames to improve the accuracy. Thus, the resulting computationally, the frame-to-frame approach usually suf- visual odometry scheme is lightweight in terms of both time fers from substantial drift due to the numerical instability and space complexity, offering a compact implementation caused mainly by the low precision of consumer-level RGB- on mobile devices, which do not still have sufficient com- D cameras. In particular, the errors and noises in their depth puting power to run such complicated methods. measurements are one of the main sources that hinder a sta- ble numerical solution of the pose estimation model. In order to develop a more stable pose estimation 1. Introduction method, the KinectFusion system [11] adopted a frame- to-model tracking approach that registers every new depth The effective estimation of 6-DOF camera poses and re- measurement against an incrementally accumulated dense construction of a 3D world from a sequence of 2D images scene geometry, represented in a volumetric truncated captured by a moving camera has a wide range of com- signed distance field. By using higher-quality depth im- puter vision and graphics applications, including robotics, ages that are extracted on the fly from the fully up-to-date virtual and augmented reality, 3D games, and 3D scanning. 3D geometry, it was shown that the drift of the camera With the availability of consumer-grade RGB-D cameras can decrease markedly while constructing smooth dense 3D such as the Microsoft Kinect sensor, direct dense methods, models in real-time using a highly parallel PC GPU imple- which estimate the motion and shape parameters directly mentation. A variant of the frame-to-model tracking was from raw image data, have recently attracted a great deal of presented by Keller et al. [9], in which aligned depth im- research interest because of their real-time applicability in ages were incrementally fused into a surfel-based model, robust camera tracking and dense map reconstruction. instead of a 3D volume grid, offering a relatively more The basic element of the direct dense visual localization efficient memory implementation. While producing more and mapping is the optimization model which, derived from accurate pose estimates than the frame-to-frame tracking, pixel-wise constraints, allows an estimate of rigid-body mo- the frame-to-model tracking techniques must manipulate the incrementally updated 3D models during camera track- higher memory efficiency without trading off the pose esti- ing, whether they are stored via a volumetric signed dis- mation accuracy, several surfel-based techniques were also tance field or a surfel-based point cloud. This inevitably in- proposed to represent the dynamically fused scene geome- creases the time and space complexity of the camera track- try, for example, [9,15,21], for which efficient point manip- ing method, often making the resulting implementation in- ulation and splatting algorithms were needed. efficient on low-end platforms with limited computational Whichever base tracking method is used to register cap- resources such as mobile devices. tured images, small pose estimation errors eventually ac- In this paper, we show that a cost-efficient extension cumulate in the trajectory over time, demanding periodic of the simple, drift-prone, frame-to-frame tracking can im- optimizations of the estimated camera trajectory and maps prove the accuracy of pose estimation markedly. The pro- for global consistency, which usually involves the nontriv- posed multi-level pose error correction scheme decreases ial computations of keyframe selection, loop-closure detec- the estimation errors by re-estimating the camera poses on tion, and global optimization. Several algorithms suitable the fly only when necessary against a few reference frames for low-cost RGB-D cameras have been proposed. For in- that are selected adaptively according to the camera mo- stance, Endres et al. [5], Henry et al. [7], and Dai et al. tion. It differs from the visual odometry techniques, such [4] extracted sparse image features from the RGB-D mea- as [10], which estimate camera poses with respect to adap- surements to estimate spatial relations between keyframes, tively switched keyframes but without any error correction. from which a pose graph was incrementally built and op- Our method is based on the observation that supplying a timized. A surfel-based matching likelihood measure was better initial guess for the rigid-body motion between two explored by Stuckler¨ and Behnke [15] to infer spatial rela- frames results in a numerically more stable cost-function tions between the nodes of a pose graph. Whereas, Kerl et optimization. By using high-quality initial guesses derived al. [10] presented an entropy-based method to both select from available pose estimates, we show that our error cor- keyframes and detect loop closures for the pose graph con- rection scheme produces a significantly enhanced camera struction. Whelan et al. [19] combined the pose graph opti- trajectory. The extra cost additionally required for the com- mization framework with non-rigid dense map deformation putationally cheap frame-to-frame tracking is quite low in for efficient map correction. Whelan et al. [21] also pro- terms of both computation time and memory space, thus posed a more map-centric dense method for building glob- enabling an efficient implementation on such low-end plat- ally consistent maps by frequent refinement of surfel-based forms as mobile devices. We describe our experience with models through non-rigid surface deformation. this technique and demonstrate how it can be applied to de- In spite of the recent advances in mobile computing tech- veloping an effective mobile visual odometry system. nology, implementing the above methods on mobile plat- forms is still quite challenging, and very few studies ad- 2. Related work dressed this issue. Kahler¨ et al. [8] presented an efficient mobile implementation, based on the voxel block hashing A variant of the frame-to-model tracking algorithm of technique [12], achieving high frame rates and tracking ac- Newcombe et al. [11] was presented by Bylow et al. [2], curacy on mobile devices with an IMU sensor. where the distances of back-projected 3D points to scene geometry were directly evaluated using the signed distance 3. Improved frame-to-frame pose estimation field. While effective, these frame-to-model tracking ap- proaches needed to manipulate a (truncated) 3D signed dis- In this section, we first explain our multi-level adaptive tance field during the camera tracking, incurring substan- error correction scheme which, as an inherently frame-to- tial memory overhead usually on the GPU. To resolve the frame method, enables to implement an efficient camera limitation caused by a fixed, regular 3D volume grid, Roth tracking system on mobile devices with limited comput- and Vona [13] and Whelan et al. [20] proposed the use of ing power and memory capability. Consider a live input a dynamically varying regular grid, allowing the camera stream produced by a moving RGB-D camera, where each to move freely in an unbounded extended area. A more frame at time i (i = 0;1;2;···) provides an augmented im- memory-efficient, hierarchical volume grid structure was age Fi = (Ii;Di) that consists of an intensity image Ii(u) and coupled with the GPU-based pipeline of the KinectFusion a depth map Di(u), respectively seen through every pixel by Chen et al. [3] in an attempt to support large-scale recon- u 2 U ⊆ R2. We assume that the intensity image has been structions without sacrificing the fine details of the recon- aligned to the depth map, although the reverse would also be structed 3D models. Nießner et al. [12] employed a spatial possible. The basic operation in the frame-to-frame camera hashing technique for the memory-efficient representation tracking is, for two given augmented images Fi and Fj from of the volumetric truncated signed distance field, which, two different, not necessarily consecutive, time steps i and because of its simple structure, allowed higher frame rates j (i > j), to estimate the rigid transformation Ti; j 2 SE(3) for 3D reconstruction via the GPU.