Online Surfel-Based Mesh Reconstruction
Total Page:16
File Type:pdf, Size:1020Kb
This is the author’s version of an article that has been published in this journal. Changes were made to this version by the publisher prior to publication. The final version of record is available at http://dx.doi.org/10.1109/TPAMI.2019.2947048 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. XX, NO. X, SEPTEMBER 2018 1 SurfelMeshing: Online Surfel-Based Mesh Reconstruction Thomas Schops,¨ Torsten Sattler, and Marc Pollefeys Abstract—We address the problem of mesh reconstruction from live RGB-D video, assuming a calibrated camera and poses provided externally (e.g., by a SLAM system). In contrast to most existing approaches, we do not fuse depth measurements in a volume but in a dense surfel cloud. We asynchronously (re)triangulate the smoothed surfels to reconstruct a surface mesh. This novel approach enables to maintain a dense surface representation of the scene during SLAM which can quickly adapt to loop closures. This is possible by deforming the surfel cloud and asynchronously remeshing the surface where necessary. The surfel-based representation also naturally supports strongly varying scan resolution. In particular, it reconstructs colors at the input camera’s resolution. Moreover, in contrast to many volumetric approaches, ours can reconstruct thin objects since objects do not need to enclose a volume. We demonstrate our approach in a number of experiments, showing that it produces reconstructions that are competitive with the state-of-the-art, and we discuss its advantages and limitations. The algorithm (excluding loop closure functionality) is available as open source at https://github.com/puzzlepaint/surfelmeshing. Index Terms—3D Modeling and Scene Reconstruction, RGB-D SLAM, Real-Time Dense Mapping, Applications of RGB-D Vision, Depth Fusion, Loop Closure, Surfels. F 1 INTRODUCTION HE availability of fast programmable GPUs has enabled autonomous navigation [7], [8], [9] benefit from continuous T handling the massive amount of RGB-D data generated surface representations, e.g., meshes. On the other side, scene by depth sensors such as the Kinect in real-time. This has led representations need to be flexible to handle spatial deforma- to dense Simultaneous Localization and Mapping (SLAM) tions caused by loop closure events in SLAM. It is important systems [1], [2], online 3D scene perception approaches [3], that they can be updated efficiently if new data becomes [4], and applications such as Augmented Reality (AR) [5], available to allow for real-time processing. Ideally, a scene [6] building on top of SLAM and dense mapping. representation should also be adaptive in terms of resolution: Performing dense mapping in real-time is particularly It should be able to model both large structures and small important for interactions with the environment in AR or in or thin objects without too much memory overhead. robotics. For example, robots navigating through unknown The predominant scene representation for RGB-D SLAM scenes must map their surroundings quickly to be able and real-time dense 3D reconstruction is a voxel vol- to advance their path planning [7], [8], [9]. Complex AR ume [12]. Each voxel in the volume stores the truncated applications, e.g., AR games in which users can race cars signed distance to the closest surface, which can be updated through their living room, need more information than is very efficiently in parallel. A continuous surface representa- provided in a single depth map, which would be sufficient tion in the form of a mesh can then be extracted from this for occlusion handling but does not provide any information volume with the Marching Cubes algorithm [13]. However, in the occluded areas. Consumer applications benefit from volume-based methods lack flexibility: Handling loop clo- online mapping which avoids waiting times for users. On- sures during online operation is expensive, since accurate arXiv:1810.00729v2 [cs.CV] 20 Nov 2019 line mapping is also useful as a live preview for offline 3D compensation can imply changing the whole volume. At reconstructions, e.g., created with hand-held mobile devices. the same time, the resolution of the voxel volume is often Especially non-expert users benefit from this as a feedback fixed in practice for efficiency reasons [2], [14], [15], which about which parts of the scene have been reconstructed in limits the adaptiveness of the representation (c.f . Fig.2, left). good quality already and where additional data is needed. In addition, objects cannot be reconstructed if they are small A central question in the context of SLAM and online or thin with respect to the voxel size. dense mapping is the choice of scene representation. On Another popular approach is to represent the scene the one side, applications such as collision detection in with a set of points or with surfels [16], [17], i.e., oriented physics simulations [10], occlusion handling in AR applica- discs (c.f . Fig.2, middle right). This scene representation tions [5], [11], and obstacle detection and path planning for is flexible, since point or surfel coordinates can be updated very efficiently for the whole reconstruction. It is also highly adaptive as measurements at a higher resolution lead to • Thomas Sch¨opsand Marc Pollefeys are with the Department of Computer denser point resp. surfel clouds, and it easily handles thin Science, ETH Zurich, Switzerland. E-mail: fi[email protected] objects. The main drawback of this representation is its dis- • Marc Pollefeys is additionally with Microsoft, Zurich. crete nature, which can be resolved by meshing. However, • Torsten Sattler is with Chalmers University of Technology. global methods such as Delaunay triangulation [18], [19], E-mail: [email protected] [20] or Poisson reconstruction [21] are too computationally Copyright (c) 2019 IEEE. Personal use is permitted. For any other purposes, permission must be obtained from the IEEE by emailing [email protected]. This is the author’s version of an article that has been published in this journal. Changes were made to this version by the publisher prior to publication. The final version of record is available at http://dx.doi.org/10.1109/TPAMI.2019.2947048 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. XX, NO. X, SEPTEMBER 2018 2 Fig. 1. Triangle mesh reconstructed with our method with loop closures handled during reconstruction, colored (left) and shaded. expensive to run in real-time on the dense point clouds gen- erated by RGB-D sensors. In contrast, efficient local meshing methods [22], [23] are very sensitive to noise, leading to both noisy surface reconstructions and holes in the meshes. In this paper, we show that online meshing of recon- structed dense surfel clouds is possible through denois- ing and adaptive remeshing (c.f . Fig.1). Our algorithm is a novel combination of surfel reconstruction [17], [24] (Sec. 3.1) with fast point set triangulation [22], [23] (Sec. 3.3). The key step for enabling local triangulation is a novel surfel denoising scheme (Sec. 3.2) to handle the noisy input data InfiniTAM [29] FastFusion [27] ElasticFusion [17] Ours (c.f . Fig.8). For efficiency, flexibility, and adaptiveness, we introduce a remeshing algorithm (Sec. 3.4) that recomputes Fig. 2. Left to right: voxel-based meshes by InfiniTAM [29] and FastFu- sion [27] (each with default voxel size settings), surfel splat rendering by the mesh locally if required. In contrast to volume-based ElasticFusion [17], our surfel-based mesh. The same camera trajectory methods, our approach easily adapts to loop closures and is used with all approaches. Notice the sharper colors produced by our leads to higher resolution models (c.f . Fig.2) while automat- method. ically handling thin objects (c.f . Fig. 15). In contrast to purely point resp. surfel cloud-based approaches, our method re- constructs high-resolution meshes during online operation. create consistent scene models. We exclude offline methods To the best of our knowledge, ours is the first approach to as the focus of our work is on (real-time) online operation. provide a continuous surface reconstruction while allowing Voxel volume-based methods build on truncated signed flexible deformations for loop closures, being adaptive to distance function (TSDF) fusion [12]. Popularized by Kinect- different resolutions, and allowing efficient updates. Fusion [1], many following works focused on scaling this Looking at the complete system, there is a fundamental approach to larger scenes [25], [26], adding multi-resolution question to be asked about how tightly the SLAM and 3D capability [27], [28], or improving efficiency [29], [30]. Mesh reconstruction components should be integrated with each extraction typically runs at a lower frame rate compared to other. A tight integration allows to share data structures, al- TSDF fusion. In contrast to our method, these works do not lowing both systems to influence each other while reducing handle loop closures during reconstruction. duplicated efforts. Voxel-based methods such as KinectFu- sion [1] commonly use the volume for both tracking and Kintinuous [2] uses a volume that moves with the cam- mapping. A loose integration enables to easily exchange era. Scene parts leaving this volume are triangulated [23], the 3D reconstruction and SLAM parts independently. This and loops are closed by deforming the mesh. Compared to improves portability, e.g., one could use ARCore resp. ARKit our method, [2] does not merge meshes after loop closures, when porting a system to Android respectively iOS. In this resulting in multiple (potentially inconsistent) meshes per article, we focus on the 3D reconstruction part and present surface. Kahler¨ et al.[14] use many small subvolumes that a meshing pipeline that is rather loosely coupled, while of- move independently to handle loop closures, combining fering the potential for tighter integration. In fact, making a overlapping subvolumes by weighted averaging during step towards a tight integration between both parts, similar rendering.