Real-Time 3D Reconstruction in Dynamic Scenes Using Point-Based Fusion

Real-time 3D Reconstruction in Dynamic Scenes using Point-based Fusion Maik Keller Damien Lefloch Martin Lambers Shahram Izadi pmdtechnologies University of Siegen University of Siegen Microsoft Research Tim Weyrich Andreas Kolb University College London University of Siegen Abstract struct and respond rapidly to their environment; or even to provide immediate feedback to users during 3D scanning. Real-time or online 3D reconstruction has wide appli- The first step of the reconstruction process is to acquire cability and receives further interest due to availability of depth measurements either using sequences of regular 2D consumer depth cameras. Typical approaches use a mov- images (e.g. [19]), or with active sensors, such as laser ing sensor to accumulate depth measurements into a single scanners or depth cameras, based on triangulation or time- model which is continuously refined. Designing such systems of-flight (ToF) techniques. Unlike methods that focus on is an intricate balance between reconstruction quality, speed, reconstruction from a complete set of 3D points [5, 7], on- spatial scale, and scene assumptions. Existing online meth- line methods require fusion of many overlapping depth maps ods either trade scale to achieve higher quality reconstruc- into a single 3D representation that is continuously refined. tions of small objects/scenes. Or handle larger scenes by Typically methods first find correspondences between depth trading real-time performance and/or quality, or by limiting maps (data association) and register or align depth maps to the bounds of the active reconstruction. Additionally, many estimate the sensor’s egomotion [1, 24]. The fusion method systems assume a static scene, and cannot robustly handle typically involves removal of outliers e.g. by visibility test- scene motion or reconstructions that evolve to reflect scene ing between depth maps [16], observing freespace violations changes. We address these limitations with a new system [2], or photo-consistency [12], and merging of measurements for real-time dense reconstruction with equivalent quality into the global model, e.g. using simple weighted averaging to existing online methods, but with support for additional [2] or more costly spatial regularization [25, 12]. spatial scale and robustness in dynamic scenes. Our system Recent online systems [6, 11] achieve high-quality results is designed around a simple and flat point-based represen- by adopting the volumetric fusion method of Curless and tation, which directly works with the input acquired from Levoy [2]. This approach supports incremental updates, ex- range/depth sensors, without the overhead of converting be- ploits redundant samples, makes no topological assumptions, tween representations. The use of points enables speed and approximates sensor uncertainty, and fusion is performed memory efficiency, directly leveraging the standard graphics using a simple weighted average. For active sensors, this pipeline for all central operations; i.e., camera pose estima- method produces very compelling results [2, 9, 6, 11]. The tion, data association, outlier removal, fusion of depth maps drawbacks are the computational overheads needed to con- into a single denoised model, and detection and update of tinuously transition between different data representations: dynamic objects. We conclude with qualitative and quantita- Where point-based input is converted to a continuous implicit tive results that highlight robust tracking and high quality function, discretized within a regular grid data structure, and reconstructions of a diverse set of scenes at varying scales. converted back to an (explicit) form using expensive poly- gonization [10] or raycasting [14] methods. As well as the 1. Introduction and Background memory overheads imposed by using a regular voxel grid, which represents both empty space and surfaces densely, and Online 3D reconstruction receives much attention as inex- thus greatly limits the size of the reconstruction volume. pensive depth cameras (such as the Microsoft Kinect, Asus These memory limitations have led to moving-volume Xtion or PMD CamBoard) become widely available. Com- systems [17, 23], which still operate on a very restricted pared to offline 3D scanning approaches, the ability to obtain volume, but free-up voxels as the sensor moves; or hierarchi- reconstructions in real time opens up a variety of interactive cal volumetric data structures [26], which incur additional applications including: augmented reality (AR) where real- computational and data structure complexity for only limited world geometry can be fused with 3D graphics and rendered gains in terms of spatial extent. live to the user; autonomous guidance for robots to recon- Beyond volumetric methods, simpler representations have 1 also been explored. Height-map representations [3] work our system still captures many benefits of volumetric fusion, with compact data structures allowing scalability, especially with competitive performance and quality to previous online suited for modeling large buildings with floors and walls, systems, allowing for accumulation of denoised 3D models since these appear as clear discontinuities in the height- over time that exploit redundant samples, model measure- map. Multi-layered height-maps support reconstruction of ment uncertainty, and make no topological assumptions. more complex 3D scenes such as balconies, doorways, and The simplicity of our approach allows us to tackle another arches [3]. While these methods support compression of fundamental challenge of online reconstruction systems: the surface data for simple scenes, the 2.5D representation fails assumption of a static scene. Most previous systems make to model complex 3D environments efficiently. this assumption or treat dynamic content as outliers [18, 22]; Point-based representations are more amenable to the only KinectFusion [6] is at least capable of reconstructing input acquired from depth/range sensors. [18] used a point- moving objects in a scene, provided a static pre-scan of the based method and custom structured light sensor to demon- background is first acquired. Instead, we leverage the imme- strate in-hand online 3D scanning. Online model rendering diacy of our representation to design a method to not only required an intermediate volumetric data structure. Interest- robustly segment dynamic objects in the scene, which greatly ingly, an offline volumetric method [2] was used for higher improves the robustness of the camera pose estimation, but quality final output, which nicely highlights the computa- also to continuously update the global reconstruction, regard- tional and quality trade-offs between point-based and volu- less of whether objects are added or removed. Our approach metric methods. [22] took this one step further, demonstrat- is further able to detect when a moving object has become ing higher quality scanning of small objects using a higher static or a stationary object has become dynamic. resolution custom structured light camera, sensor drift correc- The ability to support reconstructions at quality compara- tion, and higher quality surfel-based [15] rendering. These ble to state-of-the-art, without trading real-time performance, systems however focus on single small object scanning. Fur- with the addition of extended spatial scale and support for ther, the sensors produce less noise than consumer depth dynamic scenes provides unique capabilities over prior work. cameras (due to dynamic rather than fixed structured light We conclude with results from reconstructing a variety of patterns), making model denoising less challenging. static and dynamic scenes of different scales, and an experi- Beyond reducing computational complexity, point-based mental comparison to related systems. methods lower the memory overhead associated with volumetric (regular grid) approaches, as long as overlapping 2. System Overview points are merged. Such methods have therefore been used Our high-level approach shares commonalities with the in larger sized reconstructions [4, 20]. However, a clear existing incremental reconstruction systems (presented previ- trade-off becomes apparent in terms of scale versus speed ously): we use samples from a moving depth sensor; first pre- and quality. For example, [4] allow for reconstructions of process the depth data; then estimate the current six degree- entire floors of a building (with support for loop closure and of-freedom (6DoF) pose of sensor relative to the scene; and bundle adjustment), but frame rate is limited (∼ 3 Hz) and an finally use this pose to convert depth samples into a unified unoptimized surfel map representation for merging 3D points coordinate space and fuse them into an accumulated global can take seconds to compute. [20] use a multi-level surfel model. Unlike prior systems, we adopt a purely point-based representation that achieves interactive rates (∼ 10 Hz) but representation throughout our pipeline, carefully designed require an intermediate octree representation which limits to support data fusion with quality comparable to online vol- scalability and adds computational complexity. umetric methods, whilst enabling real-time reconstructions In this paper we present an online reconstruction system at extended scales and in dynamic scenes. also based around a flat, point-based representation, rather Our choice of representation makes our pipeline ex- than any spatial data structure. A key contribution is that our tremely amenable to implementation using

Real-Time 3D Reconstruction in Dynamic Scenes Using Point-Based Fusion

Image-Based 3D Reconstruction: Neural Networks Vs. Multiview Geometry

Amodal 3D Reconstruction for Robotic Manipulation Via Stability and Connectivity

Stereoscopic Vision System for Reconstruction of 3D Objects

Configurable 3D Scene Synthesis and 2D Image Rendering with Per-Pixel Ground Truth Using Stochastic Grammars

3D Scene Reconstruction from Multiple Uncalibrated Views

3D Shape Reconstruction from Vision and Touch

3D Reconstruction Is Not Just a Low-Level Task: Retrospect and Survey

Automatic Reconstruction of Textured 3D Models of Textured 3Dmodels Automatic Reconstruction Dipl.-Ing

Image-Based Synthesis and Re-Synthesis of Viewpoints Guided by 3D Models

3D Reconstruction and Recognition Acknowledgement

Bayesian Reconstruction of 3D Human Motion from Single-Camera Video

Arxiv:2001.05613V2 [Cs.CV] 14 Oct 2020 Mental Results Demonstrate That the Mean Per Joint Position I.E., Parts Or All of the Body Must Not Be Lost at Any Time