Semi-Dense Visual Odometry for AR on a Smartphone

Semi-Dense Visual Odometry for AR on a Smartphone Thomas Schops¨ ∗ Jakob Engely Daniel Cremersz Technische Universitat¨ Munchen¨ Figure 1: From left to right: AR demo application with simulated car. Corresponding estimated semi-dense depth map. Estimated dense collision mesh, fixed and shown from a different perspective. Photo of running system. The attached video shows the system in action. ABSTRACT of-the-art monocular SLAM methods generally operate on features. We present a direct monocular visual odometry system which runs While this allows to estimate the camera movement in real-time on in real-time on a smartphone. Being a direct method, it tracks and mobile platforms [12, 15], the resulting feature based maps hardly maps on the images themselves instead of extracted features such as provide sufficient information about the 3D geometry of the scene keypoints. New images are tracked using direct image alignment, for physical interaction. while geometry is represented in the form of a semi-dense depth At the same time, recent advances in computer vision have direct map. Depth is estimated by filtering over many small-baseline, shown the high potential of methods for monocular pixel-wise stereo comparisons. This leads to significantly less out- SLAM [16, 5, 7, 17]: instead of operating on features, these meth- liers and allows to map and use all image regions with sufficient ods perform both tracking and mapping directly on the image inten- gradient, including edges. We show how a simple world model sity values. Fundamentally different from feature based methods, for AR applications can be derived from semi-dense depth maps, direct methods not only allow for fast, sub-pixel accurate camera and demonstrate the practical applicability in the context of an AR tracking, but also provide substantially more information about the application in which simulated objects can collide with real geom- 3D structure of the environment, are less susceptible to outliers, and etry. more robust in environments with little texture [5]. Keywords: Semi-Dense, Direct Visual Odometry, Tracking, Map- 1.1 Related Work ping, AR, Mobile Devices, 3D Reconstruction, NEON In this section we give an overview over existing monocular SLAM and VO methods, divided into feature based and direct methods. 1 INTRODUCTION While there exists a large number of feature based methods for mo- Estimating the movement of a monocular camera and the 3D struc- bile phones, existing direct methods are computationally expensive ture of the environment is amongst the most prominent challenges and require a powerful GPU to run in real-time. Figure 2 summa- in computer vision. Commonly referred to as monocular SLAM rizes the main differences between feature based and direct meth- or structure from motion, it is a key enabler for many augmented ods. reality applications: only if the precise pose of the camera is avail- able in real-time, virtual objects can be rendered into the scene as Feature Based. The basic idea behind features is to split the over- if they were part of it. Further, knowledge about the geometry of all problem – estimating geometric information from images – into the scene allows virtual objects to interact with it: in an augmented two separate, sequential steps: First, a set of feature observations is reality game, game characters can collide with, be occluded by or extracted from the image, typically independently of one another. be placed on top of real obstacles. To assist with furnishing or re- This can be done using a large variety of methods, including differ- decorating a room, a piece of furniture could be reconstructed from ent corner detectors and descriptors, as well as fast matching meth- a video taken by a smartphone, and virtually rendered into different ods and outlier detection schemes like RANSAC. Second, camera locations in the room. Figure 1 shows an example AR application position and scene geometry are computed as a function of these realized on top of our direct Visual Odometry (VO) system. feature observations only. Again, there exists a variety of methods Apart from marker based methods [24, 23, 6] – which allow for to do this, including bundle adjustment based approaches [11] or precise and fast camera pose estimation at the cost of having to filtering based approaches [3, 14]. manually place one or more physical markers into the scene – state- While decoupling image based (photometric) estimation from subsequent geometric estimation simplifies the overall problem, it ∗e-mail: [email protected] comes with an important limitation: Only information that con- ye-mail: [email protected] forms to the feature type and parametrization can be used. In par- ze-mail: [email protected] ticular, when using keypoints, information contained in edges is discarded. Today, there are several keypoint based monocular VO and SLAM methods which run in real-time on mobile devices [14, 12]. In order to obtain a denser 3D reconstruction, one approach is to FeatureBased Direct Input Input Images Images Extract & Match Features (SIFT / SURF / ...) abstract image to feature observations keep full images (no abstraction) Track: Track: min. reprojection error min. photometric error Figure 3: Examples of semi-dense depth maps estimated in real-time (point distances) (intensity differences) on a smartphone. See also the attached video. Map: Map: est. featureparameters est. perpixel depth (3D points / normals) (semidense depth map) semi-dense depth map of the currently visible scene. In particular we (1) describe modifications required to run the algorithm in Figure 2: Feature based methods abstract images to feature obser- real-time on a smartphone, and (2) propose a method to derive a vations and discard all other information. In contrast, the proposed dense world model suitable for basic physical interaction of simu- direct approach maps and tracks directly on image intensities: this lated objects with the real world. We demonstrate the capabilities allows to (1) use all information, including e.g. edges and (2) directly of the proposed approach with a simple augmented reality game, obtain rich, semi-dense information about the geometry of the scene. in which a simulated car drives through the environment, and can collide with real obstacles in the scene. The paper is organized as follows: We describe the proposed perform two-frame or multi-frame stereo on selected frames, where semi-dense, direct VO method in Sec. 2. In particular, in Sec. 2.3, the camera pose is obtained from a full feature based SLAM sys- we describe the steps required for real-time performance on a tem running in the background [22]. However – even though the smartphone. Following this, we show how collision meshes are computed dense depth maps are often more accurate and precise computed from the semi-dense depth map in Sec. 3, and how they than the feature based map, they cannot directly be fed back into are used in a simple AR application. Finally, we qualitatively eval- the SLAM system running in the background, thereby discarding uate the resulting system in different environments in Sec. 4. valuable information. Direct. Direct approaches circumvent these limitations by di- 2 SEMI-DENSE DIRECT VISUAL ODOMETRY rectly optimizing the camera poses and scene geometry on the raw The proposed monocular VO algorithm does not use features at This allows to use all information in the image images. , leading to any stage of the algorithm, but instead directly operates on the higher accuracy and robustness in particular in indoor environments raw intensity images: The map is represented as a semi-dense in- with only few features. Early direct or semi-direct approaches verse depth map, which contains a Gaussian probability distribu- were based on scene representations by sets of planar patches: [19] tion (mean and variance) for the inverse depth of a subset of pixels, presents such a system which simultaneously estimates the motion, hence “semi-dense”. An example is shown in Fig. 3: pixels that scene structure and illumination. [8] also combines tracking and have a depth hypothesis are shown in color (encoding the depth). reconstruction and especially discusses local optimums in the error function. The whole system is divided into two parts (see Fig. 4), running in parallel: and . In Sec. 2.1, we describe Images are tracked by direct minimization of the per-pixel pho- tracking mapping tracking using direct image alignment. In Sec. 2.2, we present the tometric error (see Sec. 2.1), which is well established for tracking mapping part which simultaneously estimates and propagates the RGB-D or stereo sensors [10, 1]. In a monocular setting, the re- depth map. The system is closely based on the approach by Engel quired per-pixel depth values are in turn computed from stereo on et al. [5] for real-time operation on a consumer laptop. previous frames: in [5], a pixel-wise filtering formulation was proposed, which fuses information from many small-baseline stereo We represent an image as function I :Ω ! . Simi- comparisons. This approach allows to obtain accurate and pre- Notation. R larly, we represent the inverse depth map and inverse depth variance cise semi-dense depth maps in real-time on a CPU. It has re- + + map as functions D :ΩD ! and V :ΩD ! , where ΩD cently been extended to LSD-SLAM, a large-scale direct monoc- R R contains all pixels which have a valid depth hypothesis. Note that ular SLAM system [4] including loop-closures. Another approach D and V denote mean and variance of the inverse depth, as this is to compute fully dense depth maps using a variational formu- approximates the uncertainty of stereo much better than assuming lation [20, 16, 17], which however is computationally demanding a Gaussian-distributed depth. and requires a state-of-the-art GPU to run in real-time.

Load more