Real-Time Scalable Dense Surfel Mapping
Total Page:16
File Type:pdf, Size:1020Kb
Real-time Scalable Dense Surfel Mapping Kaixuan Wang, Fei Gao and Shaojie Shen Abstract— In this paper, we propose a novel dense surfel mapping system that scales well in different environments with only CPU computation. Using a sparse SLAM system to estimate camera poses, the proposed mapping system can fuse intensity images and depth images into a globally consistent model. The system is carefully designed so that it can build from room-scale environments to urban-scale environments using (a) Reconstruct using stereo cameras (b) Reconstruct details depth images from RGB-D cameras, stereo cameras or even a monocular camera. First, superpixels extracted from both intensity and depth images are used to model surfels in the system. superpixel-based surfels make our method both run- time efficient and memory efficient. Second, surfels are further organized according to the pose graph of the SLAM system to achieve O(1) fusion time regardless of the scale of reconstructed models. Third, a fast map deformation using the optimized (c) Reconstruct using a monocular camera (d) Reconstruct details pose graph enables the map to achieve global consistency in Fig. 1. Our proposed dense mapping method can fuse low-quality depth real-time. The proposed surfel mapping system is compared maps to reconstruct large-scale globally-consistent environments in real-time with other state-of-the-art methods on synthetic datasets. The without GPU acceleration. The top row shows the reconstruction of KITTI performances of urban-scale and room-scale reconstruction are odometry 00 using stereo cameras and the detail of a looped corner. The demonstrated using the KITTI dataset [1] and autonomous bottom row shows the reconstruction using only one monocular camera with aggressive flights, respectively. The code is available for the depth prediction [2]. benefit of the community1. efficiency (e.g. CHISEL [6]), and the global consistency (e.g. I. INTRODUCTION BundleFusion [7]) of TSDF-based methods. Surfel-based methods model the environment as a collection of surfels. Estimating the surrounding 3D environment is one of the For example, ElasticFusion [8] uses surfels to reconstruct fundamental abilities for robots to navigate safely or operate the scene and achieves global consistency. Although all these high-level tasks. To be usable in mobile robot applications, methods achieve impressive results using RGB-D cameras, the mapping system needs to fulfill the following four re- extending them to fulfill all four requirements and to be quirements. First, the 3D reconstruction has to densely cover usable in different robot applications is non-trivial. the environment in order to provide sufficient information for navigation. Second, the mapping system should have In this paper, we propose a mapping method that fulfills good scalability and efficiency so that it can be deployed all four requirements and can be applied to a range of in different environments using limited onboard computa- mobile robotic systems. Our system uses state-of-the-art tion resources. From room-scale (several meters) to urban- sparse visual SLAM systems to track camera poses and scale (several kilometers) environments, the mapping system fuses intensity images and depth images into a globally should maintain both run-time efficiency and memory effi- consistent model. Unlike ElasticFusion [8] that treats each ciency. Third, global consistency is required in the mapping pixel as a surfel, we use superpixels to represent surfels. systems. If loops are detected, the system should be able Pixels are clustered into superpixels if they share similar to deform the map in real-time to maintain consistency intensity, depth, and spatial locations. Modeling superpixels between different visits. Fourth, to be usable in different as surfels greatly reduces the memory requirement of our robot applications, the system should be able to fuse depth system and enables the system to fuse noisy depth maps maps of different qualities from RGB-D cameras, stereo from stereo cameras or a monocular camera. Surfels are cameras or even monocular cameras. organized according to the keyframes they are last observed In recent years, many methods have been proposed to in. Using the pose graph of the SLAM systems, we further reconstruct the environment using RGB-D cameras focusing find locally consistent keyframes and surfels that the relative on several requirements mentioned above. KinectFusion [3] drift between each other is negligible. Only locally consistent is a pioneering work that uses the truncated signed distance surfels are fused with input images, achieving O(1) fusion field (TSDF) [4] to represent 3D environments. Many follow- time and local accuracy. Global consistency is achieved by ing works improve the scalability (e.g. Kintinuous [5]), the deforming surfels according to the optimized pose graph. Thanks to the careful design, our system can be used to All authors are with the Department of Electronic and Computer Engi- reconstruct globally consistent urban-scale environments in neering, The Hong Kong University of Science and Technology, Hong Kong, SAR China. fkwangap, fgaoaa, [email protected] real-time without GPU acceleration. 1https://github.com/HKUST-Aerial-Robotics/DenseSurfelMapping In summary, the main contributions of our mapping method are the following. into multiple low-drift submaps and merge them into a global • We use superpixels extracted from both intensity and model using updated poses. depth images to model surfels in the system. Superpix- Different methods have been proposed to accelerate the els enable our method to fuse low-quality depth maps. fusion process. Steinbrucker¨ et al. [14] use an octree as the Run-time efficiency and memory efficiency are also data structure to represent the environment. Voxblox [15] gained by using superpixel-based surfels. is designed for planning that both TSDF and the Euclidean • We further organize surfels accordingly to the pose signed distance fields are calculated. Voxblox [15] proposes graph of the sparse SLAM systems. Using this or- a grouped raycasting to speed up the integration, and a novel ganization, locally consistent maps are extracted for weighting strategy to deal with the distortion caused by large fusion, and the fusion time maintains O(1) regardless of voxel sizes. FlashFusion [16] uses valid chunk selection to the reconstruction scale. Fast map deformation is also speed up the fusion step and achieves global consistency proposed based on the optimized pose graph so that the based on the reintegration method. Most of the surfel-based system can achieve global consistency in real-time. mapping systems require GPUs to render index maps for • We implement the proposed dense mapping system data association. MREMap [17] defines octree-organized using only CPU computation. We evaluate the method voxels as surfels so that it does not need GPUs. However, using public datasets and demonstrate its usability us- the reconstructed model of MREMap is voxels instead of ing autonomous aggressive flights. To the best of our meshes. knowledge, the proposed method is the first online depth fusion approach that achieves global consistency III. SYSTEM OVERVIEW in urban-scale using only CPU computation. Surfel Mapping System II. RELATED WORK Intensity, Depth images Sensors Superpixel Extraction Surfel Initialization Most online dense reconstruction methods take depth maps Tracking Results from RGB-D cameras as input. In this section, we introduce Localization Surfel Fusion System Pose Graph Pose Graph different methods to extend the scalability, global consistency Updated? and run-time efficiency of these mapping systems. Yes Local Map Kintinuous [5] extends the scalability of mapping systems Map Deformation Extraction by using a cyclical buffer. The TSDF volume is virtually transformed according to the movement of the camera. Voxel Globally- Updated Surfels hashing proposed by Nießner et al. [9] is another solution to consistent Model Map Database improve the scalability. Due to the sparsity of the surfaces Fig. 2. The system architecture of the proposed method. Sensors are in the space, only valid voxels are stored using hashing determined by the robot application. Our system can fuse depth maps functions. DynSLAM [10] reconstructs urban-scale models from active RGB-D cameras, stereo cameras, or monocular cameras. The using hashed voxels and a high-end GPU to accelerate. localization system is used to track the camera pose, detect loops, and provide optimized pose graph of keyframes. Each surfel in the system is Surfel-based methods are relatively scalable compared with attached to one keyframe, and all the surfels are stored in the map database. voxel-based methods because only surfaces are stored in the system. Without the explicit data optimization, Elastic- The system architecture is shown in Fig. 2. Our system Fusion [8] can build room-scalable environments in detail. fuses intensity and depth image pairs into a globally con- Fu et al. [11] further increase the scalability of surfel-based sistent model. We use a state-of-the-art sparse visual SLAM methods by maintaining a local surfel set. system (e.g. ORB-SLAM2 [18] or VINS-MONO [19]) as To remove the drift from camera tracking and maintain the localization system to track the motion of the camera, global consistency, mapping systems should be able to fast detect loop closures, and optimize the pose graph. The keys deform the model when loops are detected. Whelan et al. [12] to our mapping system are (1) superpixel-based surfels, (2) improved Kintinuous [5] with point clouds deformation. A pose graph-based surfel fusion, and (3) fast map deformation. deformation graph is constructed incrementally as the camera For each intensity and depth image input, the localization moves. When loops are detected, the deformation graph system generates camera tracking results and provides an is optimized and applied to the point clouds. Surfel-based updated pose graph. If the pose graph is optimized, our methods usually deform the map using similar methods. system first deforms all the surfels in the map database BundleFusion [7] introduces another solution to achieve to ensure global consistency.