Quick viewing(Text Mode)

A Survey of 3D Modeling Using Depth Cameras

A Survey of 3D Modeling Using Depth Cameras

A survey of using depth cameras

Hantong XU, Jiamin XU, Weiwei XU

State Key Lab of CAD&CG, Zhejiang University

Abstract- 3D modeling is an important topic in computer and computer vision. In recent years, the introduction of consumer-grade depth cameras has brought profound advances in 3D modeling. Starting with the basic data structure, this survey reviews the latest developments of 3D modeling based on depth cameras, including research works on camera tracking, 3D object and scene reconstruction and high-quality texture reconstruction. We also discuss the future works and possible solutions for 3D modeling based on depth camera.

Keywords: 3D Modeling, Depth Camera, Camera Tracking, Signed Distance Function,

1. Introduction

3D modeling is an important research area in and computer vision. Its target is to capture the 3D shape and appearance of objects and scenes in high quality so as to simulate the

3D interaction and perception in the digital space. The 3D modeling methods can be mainly categorized into three types: (1) Modeling based on professional software, such as 3DS Max,

Maya, etc. It requires the user to be proficient at these software. Thus, the learning curve is usually long, and the creation of 3D content is usually time-consuming. (2) Modeling based on 2D RGB images, such as multi-view stereo (MVS) [SCD*06], structure from motion(SFM) [A03, GD07], etc. Since image sensors are of low-cost and easy to deploy, 3D modeling from multi-view 2D images can significantly automate and simplify the modeling process, but relies heavily on image quality and are easily affected by lighting and camera resolution. (3) Modeling based on professional, active 3D sensors [RHL02], including 3D scanner and depth camera. The sensors actively project structured light or laser stripes to the object and then reconstruct spatial shape of objects and scenes, i.e. sampling a large number of 3D points from the 3D surface, according to the spatial information decoded from the structured light. The advantage of 3D sensors is that they are robust to interference in the environment and able to scan 3D models in high precision automatically. However, professional 3D scanners are expensive, which limits their applications for ordinary users.

In recent years, the development of depth (or RGB-D) cameras has led to the rapid development of 3D modeling. The depth camera is not only low-cost, compact, but also capable of capturing pixel-level color and depth information with sufficient resolution and frame rate.

Because of these characteristics, depth cameras have a huge advantage in the case of consumer-oriented applications compared to other expensive scanning devices. Recently, the

KinectFusion algorithm proposed by Imperial College London and Microsoft Research in 2011 opens up the opportunity to 3D modeling with depth cameras [IKH11, NIH*11]. It is capable of generating 3D models with high-resolution details in real-time using color and depth images obtained by depth cameras, which attracts the attention of researchers in the field of . Researchers in computer graphics and computer vision have continuously innovated on the algorithms and systems for 3D modeling based on depth cameras, and achieved rich research results.

This survey will review and compare the latest methods of 3D modeling based on depth cameras. We first give two basic data structures widely used 3D modeling using depth cameras in

Section 2. In Section 3, we focus on the progress of geometric reconstruction based on depth cameras. Then, in Section 4, the reconstruction of high quality texture is reviewed. Finally, we conclude and discuss the future work in Section 5.

1.1 Preliminaries of Depth Cameras

Each frame of data obtained by the depth camera contains the distance from each in the real scene to the vertical plane where the depth camera is located. This distance is called the depth value, and these depth values at sampled pixels constitute the depth image of a depth frame.

Usually, the color image and the depth image are registered, and thus, as shown in figure 1, there is a one-to-one correspondence between the pixels.

Currently, most of depth cameras are developed based on structured light, such as Microsoft's

Kinect 1.0 and Occipital's structure sensor, or ToF (time of flight) technology, such as Microsoft's

Kinect 2.0. Strucuture-light depth cameras firstly projects the infrared structured light onto the surface of an object, and then receives the light pattern reflected by the surface of the object. Since A 3D point

Image plane The projection of M in the

Camera center

image center

focal length

depth value

Figure 1: RGB image and depth value. M: the 3D point. XM: the projection of M on the image plane.

the received pattern is modulated by the 3D shape of the object, the spatial information of the

3D surface can be calculated by the position and degree of modulation of the pattern on the image.

The structured light camera is suitable for scenes with insufficient illumination, and it can achieve high precision within a certain range and produce high-resolution depth images. The disadvantage of structured light is that it is easy to be affected by strong natural light outdoors that renders the projected coded light to be submerged and unusable. At the same time, it is susceptible to plane mirror reflection. The ToF camera uses a continuous emission of light pulses (generally invisible light) onto an object to be observed and then receives a pulse of light reflected back from the object. The distance from the measured object to the camera is calculated by detecting the flight

(round trip) time of the light pulse. Overall, the measurement error of the ToF depth camera does not increase with the increase of the measurement distance, and the anti-interference ability is strong, which is suitable for relatively long measurement distance (such as automatic driving). The disadvantage is that the depth image obtained is not high in resolution, and the measurement error is larger than that of other depth cameras when measuring at close range.

1.2 The Pipeline of 3D Modeling Based on Depth Cameras

Since the depth data obtained by the depth camera inevitably has noise, we first need to preprocess the , that is, reduce noise and remove outliers. Next, the camera poses are estimated to register captured depth images into a consistent coordinate frame. The classic approach is to find

Measurement Pose Estimation Surface Prediction Update Reconstruction ICP to register the Compute 3D vertices Integrate surface Ray-Cast TSDF to RGBD measured surface to and normal s at pixels measurement into Compute Surface the global model global TSDF prediction

Figure 2: The pipeline of volumetric representation based reconstruction [NIH*11] .

the corresponding points between the depth maps (i.e., data associations) and then estimate the

camera's motion by registering the depth images. Finally, the current depth image is merged

intothe global model based on the estimated camera pose. By repeating these steps, we can get the

geometry of the scene. However, in order to completely reconstruct the appearance of the scene, it

is obviously not enough to just capture the geometry. We also need to give the model color and

texture.

According to the data structure used to represent the 3D modeling result, we can roughly

categorize 3D modeling into volumetric representation based modeling and surfel-based modeling.

The steps of 3D reconstruction will vary based on different data structures. This review clarifies

how each method solves the classic problems in 3D reconstruction by reviewing and comparing

the different strategies adopted by the various modeling methods.

2. Data Structure for 3D Surfaces

2.1 Volumetric Representation

The concept of volumetric representation was first proposed in 1996 by Curless and Levoy [CL96].

They introduced to represent a 3D surfaces using signed distance function (SDF) sampled at

, where SDF values are defined as the distance of a 3D point to the surface. Thus, the

surface of an object or a scene, according this definition, is an iso-distance surface whose SDF is 0.

The free space, i.e., the space outside of the object, is indicated by a positive SDF value, since the

distance increases as a point moving along the surface normal towards the free space. In contrast,

the occupied space (the space inside of the object) is indicated by negative SDF values. These

SDF values are stored in a cube surrounding the object, and the cube is rasterized into voxels. The

reconstruction algorithm must define the size and spatial extent of the cube. Other data, such

as colors, are usually stored as voxel attributes. Since only voxels close to the actual surface are important, researchers often use the truncated signed distance function (TSDF) to represent the model. Figure 2 shows the general pipeline of reconstruction based on volumetric representation.

Since the TSDF value implicitly represents the surface of the object, it is necessary to extract the surface of the object by means of ray casting before estimating the pose of the camera. The fusion step is a simple weighted average operation at each voxel center, which is very effective for eliminating the noise of the depth camera.

Uniform voxel grids are very inefficient in terms of memory consumption and are limited by predefined volume and resolutions. In the context of real-time scene reconstruction, most methods rely heavily on the processing power of modern GPUs. The spatial extent and resolution of voxel grids are often limited by GPU memory. In order to support a larger spatial extent, researchers have proposed a variety of methods to improve the memory efficiency of volumetric representation-based algorithms. In order to prevent data loss caused by depth images acquired outside of a predefined cube, Whelan et al. [WKF12] proposed a method of dynamically shifting voxel grids so that the voxel grids follow the motion of the depth camera. When the voxel grids move, the method extracts the parts outside the current voxel mesh and stores them separately, adding their camera pose to the global pose. Although this can achieve a larger scan volume, it requires a large amount of external memory usage, and the scanned surface cannot be arbitrarily revisited. Henry et al. [RV12] also proposed a similar idea.

Voxel , such as octrees, are another way to efficiently store the geometry of 3D surfaces. The octree can adaptively subdivide the scene space according to the complexity of the scene, effectively utilizing the computer's memory [FPRJ00]. Although its definition is simple, due to the sparsity of nodes, it is very difficult to utilize the parallelism of the GPU. Sun et al.

[SZKS08] established an octree with only leaf nodes to store voxel data and accelerate tracking based on octree. Zhou et al. [ZGHG11] used the GPU to construct a complete octree structure to perform Poisson reconstruction on scenes with 300,000 vertices at an interactive rate. Zeng et al.

[ZZZL13] implemented a 9 to 10 level octree for KinectFusion and extend the reconstruction results to a medium-sized office. Chen et al. [CBI13] proposed a similar 3-level with equally good resolution. Steinbrücker et al. [SSC14] represents the scene in a multi-resolution data structure for real-time accumulation on the CPU, outputting a real-time grid model at approximately 1 Hz. Nießner et al. [NZIS13] proposed a voxel hashing structure. In this method, the index corresponding to the spatial position of the voxel is stored in a linearized spatial hash table and addressed by a spatial hash function. Since only voxels containing spatial information are stored in memory, this strategy greatly reduces memory consumption and theoretically allows voxel blocks of a certain size and resolution to represent infinitely large spaces. Compared with the voxel hierarchies, the hashing structure has great advantages in data insertion and access, whose time complexity is O(1). But the hashing function inevitably maps multiple different spatial voxel blocks to the same location in the hash table. Kahler et al. [KPR15] proposed to use a different hashing method to reduce the number of hash collisions. Hash-based 3D reconstruction has the advantage of small memory requirements and efficient calculations, making it even suitable for mobile devices, such as being applied to Google's Tango [DKSX17].

2.2 Surfel

Surfel is another important surface representation which was originally used as the primitive for model rendering [PZVG00]. As shown in Figure 3, a surfel usually contains the following attributes: space coordinates p R3, normal vector n R3, color information c R3, weight (or confidence) w, radius r, timestamp t. The weight of a surfel is used for the weighted average of neighboring 3D points and the judgment of the stable and unstable points. It is generally initialized

2 2 by the following formula: 퐰 = 푒−훾 /2휎 , Where γ is the normalized radial distance from the current depth measurement to the center of the camera, and σ is usually taken as 0.6 [KLL*13].

The radius of a surfel is determined by the distance of the surface of the scene from the optical center of the camera. The further the distance is, the larger the radius of the surfel will be. Compared to the volumetric representation, surfel-based 3D reconstruction is more compact and intuitive for input from depth cameras, eliminating the need to switch back and forth between different representations. Similar to volumetric representation, octrees can be used to improve surfel-based Figure 3: Schematic diagram of surfel. data structures [SB12, SB14]. Pros and Cons: Volumetric representation is more convenient in depth denoising and surface rendering, since the TSDF computed according to each frame can be efficiently fused at each voxel, as discussed in the next section. In contrast, the surfel representation can be viewed as an extension of point cloud representation. When the vertex correspondences are necessary in the registration, such information can be directly computed using surfels, while the volumetric representation has to be converted to 3D meshes in this case.

3. 3D Reconstruction Based On Depth Cameras

3.1 Depth Map Preprocessing

The depth image obtained by the depth camera inevitably contains noise or even outliers, due to the complex lighting condition, geometric variations and spatially variant materials of the objects.

Mallick et al. [MDM 14] classify these noises into three categories. (1) missing depth. When the object is too close or too far from the depth camera, the camera sets the depth value to zero because it cannot measure the depth. It should be noted that even within the measurement range of the camera, zero depth may be generated due to surface discontinuities, highlights, shadows or other reasons. (2) depth error. When a depth camera records a depth value, its accuracy depends on its depth itself. (3) depth inconstant. Even if the depth of the scene is constant, the depth value obtained will change with time.

In most cases, bilateral filtering [TM98] is used to reduce the noise of the depth image. After noise reduction, KinectFusion [IKH11] applys mean downsampling to obtain a 3-layer depth map pyramid, in order to estimate the camera pose from coarse to fine in the next step. For the depth image after noise reduction, according to the internal parameters of the depth camera, the 3D coordinates of each pixel point in the camera coordinate system can be calculated by inverse projection to form a vertex map. The normal of each vertex can be obtained by the cross product of vectors formed by the vertex and its adjacent vertices. For the surfel representation, it is also necessary to calculate the radius of each point to represent the local surface area around a given point while minimizing the visible holes between adjacent points, as shown in Figure 3. [SGKD14] calculates the vertex radius r using the following formula: 퐫 = √2 ∗ 퐝 / 퐟, where d represents the depth value of the vertex, and f represents the focal length of the depth camera.

3.2 Camera Tracking

Usually, the 6-DoF camera pose of a captured depth image is represented by a rigid transformation Tg. This transformation matrix maps the camera coordinate system to the world coordinate system. Given the position of the point pl in the local camera coordinate system, it can be converted to its 3D coordinate in the world coordinate system, denoted by pl, by the following equation: pg = Tg* pl.

Camera tracking, i.e., estimating the camera pose, is a prerequisite for fusing the depth image of the current frame into the global model, and the fused model is represented by either volumetric representation or surfel dependant on the choice of each 3D reconstruction system. The iterative closest point algorithm (ICP) might be the most important algorithm in relative camera pose estimation. In the early days, the ICP algorithm was applied to the 3D shape registration [BM92,

CM91]. The rigid transformation is calculated by finding matching points between two adjacent frames and minimizing the sum of squares of the Euclidean distance between these pairs. This step is iterated until some convergence criterion is met. A serious problem with this frame-to-frame strategy is that the registration error of two frames is accumulated as the scan progresses.

In order to alleviate this problem, the depth camera-based reconstruction methods adopted a frame-to-model camera tracking method [NIH*11, WKF12, NZIS13]. Although frame-to-model

Figure 4: Camera tracking and the optimized trajectory [KSC13]. tracking significantly reduces the tracking drift of each frame, it does not completely solve the problem of error accumulation, as tracking errors still accumulate over time. The accumulated drift eventually leads to misalignment of the reconstructed surface at the loop closure of the track.

Therefore, researchers have introduced the idea of global pose optimization. Henry et al. [HKH12] proposed the concept of the keyframe, which produces a keyframe whenever the accumulated camera pose is greater than a threshold. The loop closure detection is performed only when a keyframe occurs, and the RANSAC algorithm [FB81] is used to perform feature point matching on the current keyframe and the previous keyframe. After the loop closure is detected, there are two globally optimized strategies to choose from. One is the pose graph optimization, that is, the pose graph is used to represent the constraint between the frame and the frame, and the edge between the frames corresponds to the geometric constraint. The second is sparse bundle adjustment, which globally minimizes the re-projection error of the feature points using SBA matching on all frames [TMHF00, LA09, K10]. Zhou et al. [ZMK13] proposed to use interest points to preserve local details and combine global pose optimization to evenly distribute registration errors in the scene. Zhou et al. [ZK13] proposed to segment the video sequence into segments, using frame-to-model integration, reconstructing locally accurate scene segments from each segment, establishing a dense correspondence between overlapping segments, and optimizing a global energy function to align the fragment. The optimized images can subtly distort the image fragments to correct inconsistencies caused by low-frequency distortion in the input images.

Whelan et al. [WLS*15] used depth and color information to perform global pose estimation for each input frame. The prediction of the depth image and the color image used for registration is obtained by using the surface splatting algorithm [PZVG00] for the global model. They divide the loop closure problem into a global loop closure and a partial loop closure. Global loop closure detection uses the random fern coding method [GSCI15]. For loop closure optimization, they use a spatial deformation method based on the embedded deformation map [SSP07]. A deformation map consists of a set of nodes and edges that are distributed among the models to be deformed. Each surfel is affected by a set of nodes, and the influence weight is inversely proportional to the distance of the surfel to the node.

Regardless of the registration strategy, finding corresponding points between two frames is an essential step in camera tracking. The set of corresponding point pairs is brought into the objective function to be optimized, thereby calculating the optimal solution of the camera pose.

Finding the corresponding points can be divided into sparse methods and dense methods according to the pair of points used. The sparse method only finds the corresponding feature points, while the dense method searches for the correspondences for all points.

For feature extraction and matching, the SIFT algorithm [LOWE04] is a popular choice. The

SURF algorithm [BETV08] improves the SIFT algorithm and improves the speed of detecting feature points. The ORB feature descriptor is an alternative to SIFT and SURF, which is faster and correspondingly less stable. Endres et al. [EHE*12] exploited these three feature descriptors when extracting feature points and compare their effects. Whelan et al. [WKF*12] explored feature point-based fast odometry from vision (FOVIS) instead of the iterative closest point method in

KinectFusion and adopted SURF feature descriptors for loop closure detection. Zhou et al. [ZK15] proposed to track camera based on object contours and introduced corresponding points of contour features to make the tracking more stable. The recent BundleFusion method in [DNZ*17] used the

SIFT feature for coarse registration first and dense methods for fine registration.

For dense methods, traditional methods of finding corresponding points are time consuming.

The projection data association algorithm [BL95] greatly speeds up this process. This strategy projects the three-dimensional coordinates of the input point to a pixel on the target depth map according to the camera pose and camera internal parameters, and obtains the best corresponding point through the search in the neighborhood of the pixel. There are many ways to measure the error of the corresponding point, such as point-to-point metric [BM92], point-to-plane metric

[CM91]. Comparing to the point-to-point metric, the point-to-plane metric leads to fast convergence in the camera tracking, and thus are widely used. In addition to using geometric distance as constraints, many methods use optical differences (i.e., color differences) between corresponding points as constraints, such as [SKC13, DNZ*17]. Lefloch et al. [LKS*17] took curvature information as their independent surface property for real-time reconstruction. They considered the curvature of the input depth map and global model when finding dense corresponding points.

3.3 Fusion of Depth Images Volumetric Representation: Compared to computing a complete signed distance function for all voxels, the truncated signed distance function (TSDF) is usually calculated because its computational cost is more affordable. The voxel center is first projected to the input depth map after camera tracking to compute the TSDF value corresponding to the current input depth map.

Then, the new TSDF value is fused to the TSDF of the global model by weighted averaging, which is effective in remove depth noises.

Surfel: After estimating the camera pose of the current input frame, each vertex and its normal and radius are integrated into the global model. The fusion of depth maps using surfels has three steps: data association, weighted average, and removal of invalid points [KLL*13]. First, corresponding points are found by projecting the vertices of the current 3D model to the image plane of the current camera. Since some model points might be projected onto the same pixel, the upsampling method is used to improve the accuracy. If the corresponding point is found, the most reliable point is merged with the new point estimate using a weighted average. If not found, the new point estimate is added to the global model as an unstable point. Over time, the global model is cleaned up to eliminate outliers according to the visibility and time constraints.

3.4 3D Reconstruction of Dynamic Scene

The reconstruction of dynamic scenes is more challenging than static scenes because of moving objects. We have to deal with not only the overall movement of the objects but also their deformations. The key to dynamic reconstruction is to develop a fast and robust non-rigid registration algorithm to handle non-rigid motions in the scene.

Volumetric Representation: DynamicFusion [NFS15] proposed by Newcombe et al. is the first method to reconstruct a non-rigid deformation scene in real time using input obtained by a depth camera. It reconstructs an implicit volumetric representation, similar to the KinectFusion method, and simultaneously optimizes the rigid and non-rigid motion of the scene based on the warped field parameterized by the sparse deformation map [SSP07]. The main problem with

DynamicFusion is that it cannot handle topology changes occured during the capturing. Dou et al. Figure 5: A moving person in the scene [kll*13]. proposed the Fusion4D method [DKD*16]. The method used 8 depth cameras to acquire multi-view input, combined the concept of volumetric fusion with the estimation of the smooth deformation field, and proposed a fusion method of multiple camera depth information that can be applied to complex scenes. In addition, it proposed a machine-learning-based corresponding points detection method that can robustly handle frame-to-frame fast motion. Recently, Yu et al. proposed a new real-time reconstruction system, DoubleFusion [YZK*18]. This system combined dynamic reconstruction based on volumetric representation and skeletal model SMPL of the

[KMR * 15], which can reconstruct detailed geometry, non-rigid body motion, and internal shape simultaneously from a single depth camera. One of the key contributions of this method is the two-layer representation, which consists of a fully parametric internal shape and a gradually fused outer surface layer. The predefined node map on the body surface parameterizes the non-rigid body deformation near the body, and the free-form dynamic change map parameterizes the outer surface layer away from the body for more general reconstruction. Furthermore, a joint motion tracking method based on the two-layer representation is proposed to make joint motion tracking robust and fast. Guo et. al. [KFT*2017] proposed a -based scheme to leverage appearance information for motion estimation and the computed albedo information is fused into the volume.

The accuracy of the reconstructed motion of the dynamic object is significantly improved due to the incorporation of the albedo information.

Surfel: Keller et al. [KLL*13] used the foreground segmentation method to process dynamic scenes. The proposed algorithm removes objects with moving surfaces from the global model so that the dynamic part does not act on the camera tracking. As shown in Figure 5, a person seating on a chair is being reconstructed, and then he begins to move. The system segments the moving person in the ICP step, ignoring the dynamic scene part (A), thus ensuring the robustness of the camera tracking to dynamic motion.

Rünz et al. [RA17] proposed to use motion or semantic cues to segment the scene into backgrounds and different rigid foreground objects, simultaneously tracking and reconstructing their 3D geometry over time, which is one of the earliest real-time methods for intensive tracking and fusion of multiple objects. They propose two segmentation strategies: motion segmentation

(grouping points that are consistent in motion in 3D) and object instance segmentation (detect and segment single objects in RGB images for a given semantic tag). Once detected and segmented, objects are added to the active model list, then tracked, and their 3D shape models are updated by merging the data that belongs to the object. Gao et al. [GT18] propose to use a deformation field to align the depth input of the current moment with a reference geometry (template). The reference geometry is initialized with the deformation field of the previous frame and then optimized by the

ICP algorithm similar to DynamicFusion. Once the deformation field is updated, data fusion is performed to accumulate the current depth observations into the global geometric model.

According to the updated global model, the global deformation field is inversely computed for the registration of the next frame.

3.5 Scene Understanding

Scene understanding refers to analyzing the scene by considering the geometry and semantic context of the scene content and the intrinsic relationship between them, including object detection and recognition, the relationship between objects and semantic segmentation. Researches in recent years have shown that the understanding of scenes, such as plane detection and semantic segmentation, can provide significant assistance for the 3D modeling of scenes.

Observing that there are many planes in the indoor scene, such as walls, floors, and desktops,

Figure 6: The recognition, tracking and mapping results in MaskFusion[RA18].

Zhang et al. [ZXTZ15] proposed a method of region growth to detect and label new planes, using planarity and orthogonality constraints to reduce camera tracking drift and flat region bending.

Dzitsiuk et al. [DSM*17] fit the plane to the symbol distance function value of the volume based on the robust least squares method and merges the detected local planes to find a global continuous plane. By using the detected plane to correct the signed distance function value, the noise on the flat surface can be significantly reduced. Further, they use plane a priori to directly obtain object-level and semantic segmentation (such as walls) of the indoor environment. Shi et al.

[SXN*18] propose a method for predicting the co-planarity of image patches for RGBD image registration. They use a self-supervised approach to train a deep network to predict whether two image patches are coplanar or not and to incorporate coplanar constraints when solving camera pose problems. Experiments have shown that this method of using co-planarity is still effective when feature point-based methods cannot find loop closure.

SLAM++ [SNS*13] proposed by Salas-Moreno et al. is one of the earliest methods for object-oriented mapping methods. They use point-to-point features to detect objects and reconstruct 3D maps at the object level. Although this scheme enables the SLAM system to be extended to object-level loop closure processing, detailed 3D model preprocessing of objects is required in advance to learn 3D feature descriptors. Tateno et al. [TTN16] incrementally segment the reconstructed region in TSDF volumes and use the OUR-CVFH [ATRV12] feature descriptor to directly match other objects in the database. Xu et al. [XSZ*16] proposed a recursive 3D attention model for the robot's active 3D object recognition. The model can perform both the instance-level shape classification and the continuous next best view regression. They insert the Figure 7: Primitive abstraction and texture enhancement in [HDGN17]. retrieved 3D model into the scene and built the 3D scene model step by step. In [SKQ*19], the 3D indoor scene scanning task is decomposed to concurrent subtasks for multiple robots to explore the unknown regions simultaneously, where the target regions of subtasks are computed via optimal transportation optimization.

In recent years, due to the development of 2D semantic image segmentation algorithms, many methods combine them with the 3D modeling methods to improve the reconstruction quality.

McCormac et al. [MHDL17] adopted the CNN model [HSH15] proposed by Noh et al. to perform

2D semantic segmentation. The Bayesian framework for 2D-3D label transfer was used to fuse 2D semantic segmentation labels into 3D models. The model is then post-processed using CRF. Rünz et al. [RA18] adopted the Mask RCNN model [HGDG17] to detect object instances, see figure 6.

They apply the ElasticFusion [WLS*15] using surfels to represent each object and static background for real-time tracking and the dense reconstruction of dynamic instances. Different from the above two methods for semantic labeling of the scene at the object level, Hu et al.

[HWK*18] propose a method for reconstructing a single object based on the labeling of each part of the object. In order to minimize the user's labeling work, they use an active self-learning method to generate the data needed by the neural network. Schönberger et al. [SPGS18] obtain 3D descriptors by jointly encoding high-level 3D geometric information and semantic information.

Experiments have shown that such descriptors enable reliable positioning even under extreme viewing angles, illumination, and geometric variations.

4. Texture Reconstruction

In addition to the 3D geometry of objects and scenes which are of great interest in many applications, surface color, and general appearance information can significantly enhance the sense of reality of 3D geometry in AR/VR applications, which is important to simulate the real world in virtual environments. In the following, offline and online texture reconstruction methods are reviewed and discussed.

4.1 Offline Texture Reconstruction

Its goal is to reconstruct the high-quality, globally consistent texture for the 3D model with the multi-view RGB images after the capturing process is finished. They rely on optimization algorithms to remove the ghosting and over-smoothing effects in the direct blending of the captured images. The frequently explored optimization objectives include color consistency

[BMR01, PS00], the alignment of image and geometric features [LHS01, SA02], and mutual information maximization between the projected images [CDPS09, CDG*13]. While these methods can effectively handle camera calibration inaccuracies, they cannot handle inaccurate geometries and optical distortions in RGB images, which is a common problem in the data captured by consumer-level depth cameras. Zhou and Koltun [ZK14] solved the above problems by optimizing for the RGB camera poses and non-rigid warping of each associated image to maximize the photo consistency. In contrast, Bi et al. [BKR17] proposed a patch-based optimization method, which synthesizes a set of feature-aligned color images to produce high-quality even with large geometric errors. Huang et al. [HDGN17] made a number of improvements based on Zhou et al. [ZK14] for 3D Indoor scene reconstruction (see

Figure 7). After calculating a primitive abstraction of the scene, they first corrected color values by compensating for the varying exposure and white balance, and then the RGB images were aligned by optimization based on sparse features, dense photo consistency. Finally, a temporally coherent sharpening algorithm was developed to obtain clean texture for the reconstructed 3D model.

4.2 Online Texture Reconstruction

Online texture reconstruction is to generate high-quality textures in parallel to the scanning of the

3D shape objects. Its advantage is that the user can see the real-time textured model during on-line scanning, but possibly inferior in the texture quality since it can not exploit the unseen images.

Whelan et al. [WKJ*15] refuse to fuse the color values of the edges of the object into the model as this will result in artifacts or discontinuities. In the subsequent work [WSG*16], they estimate the position and orientation of the light source in the scene to further reject the fusion from samples containing only highlights. Although this has significantly improved the visual quality of the texture, artifacts may still occur. Since high dynamic range (HDR) images provide more image details than low dynamic range (LDR) images, they are also adopted in the 3D reconstruction based on RGB-D cameras [MBC13, LHZC16, APZV17]. First, the pre-calibrated response curve are used to linearize the observed intensity values. Then, the HDR pixel colors based on the exposure time are computed and fused into the texture of the 3D model in real time. A recent contribution by Xu. et.al. proposed a dynamic atlas texturing scheme for warping and updating dynamic texture on the fused geometry in real-time, which significantly improve the visual quality of 3D human modeling [LZL*2019].

5. Conclusion and Future Work

In the past few years, 3D modeling using depth cameras has evolved fast. Its current development covers the entire reconstruction pipeline and brings together various levels of innovation from deep camera hardware to high-level applications, such as augmented reality and teleconference.

Research works on 3D modeling using depth cameras, including camera tracking, 3D object and scene reconstruction, and high-quality texture reconstruction, are reviewed and discussed in this paper.

Despite the achieved success, 3D modeling based on depth cameras still encounters many challenges. First, the depth cameras are actually a short-range 3D sensor. It is still time-consuming to apply them in the large-scale scene reconstructions. Thus, it is necessary to investigate how to integrate the data of long-range sensor, such as laser scanner, with depth data to speed-up the 3D scene reconstruction. A multi-source large-scale 3D data registration algorithm, including correspondence detection and subsequent distance minimization, should be developed in this case.

We hypothesize that domain decomposition technique, which supports to alternatively minimize the objective function using a subset of optimization variables in each iteration, can be applied to handle such large-scale registration problem. It is also interesting to investigate the active vision approach to drive the robots with depth cameras in a scene for a complete scanning as in

[ZXY*17]. Second, the moving objects in the scene is not friendly to the depth camera tracking.

We need to explore how to accurately separate the moving objects from the static scene in the captured depth so as to improve the accuracy of the camera tracking based static scene. One possible solution is to integrate the object detection and motion flow together, where the movement of the objects conforms to the camera motion should be static with respect to the camera. Third, to support the application of depth cameras integrated into mobile phones, it is necessary to develop real-time and energy-saving depth data processing and camera tracking algorithms for the hardware of mobile phone. The convergence acceleration algorithms, such as

Anderson acceleration, can be explored to speed-up the camera tracking [YBJ*18].

Acknowledgement

Weiwei Xu is partially supported by NSFC 61732016.

REFERENCES

[A03] Andrew J. Real-time simultaneous localisation and mapping with a single camera. In Proceedings of the International Conference on Computer Vision (ICCV), 2003. [APZV17] Alexandrov S. V., Prankl J., Zillich M., Vincze M.: Towards dense SLAM with high dynamic range colors. In Computer Vision Winter Workshop (CVWW), 2017. [ATRV12] Aldoma A., Tombari F., Rusu R. B., Vincze M.: Our-cvfh: Oriented, unique and repeatable clustered viewpoint feature histogram for object recognition and 6dof pose estimation. In Joint DAGM-OAGM Pattern Recognition Symposium, 2012. [BM92] Besl P. J., Mckay N. D.: A method for registration of 3-d shapes. In IEEE Transactions on Pattern Analysis and Machine Intelligence, TPAMI, 14, 2 (1992), 239–256. [BETV08] Bay H., Ess A., Tuytelaars T., Van Gool L.: Surf: Speeded Up Robust Features. In Computer Vision and Image Understanding 10 (2008), 346–359 [BL95] Blais G., Levine M. D.: Registering multiview range data to create 3D computer objects.In IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), 17, 8(1995), pp. 820–824. [BMR01] Bernardini F., Martin I. M., Rushmeier J.: High-Quality Texture Reconstruction from Multiple Scans. In IEEE TVCG 7, 4 (2001), pp. 318–332. [BKR17] Bi S., Kalantari N. K., Ramamoorthi R.: Patch-based optimization for image-based texture mapping. In ACM Trans. on Graphics (Proc. SIGGRAPH) 36, 4 (2017). [CM91] Chen Y., Medioni G.: Object modeling by registration of multiple range images. In IEEE International Conference on Robotics and Automation, ICRA, 1991, pp. 2724–2729 vol.3. [CBI13] Chen J., Bautembach D., Izadi S.: Scalable real-time volumetric surface reconstruction. In ACM Trans. on Graphics (Proc. SIGGRAPH) 32, 4 (2013), 113:1–113:16. [CDG*13] Corsini M., Dellepiane M., Ganovelli F., Gherardi R., Fusiello A., Scopigno. R.: Fully Automatic Registration of Image Sets on Approximate Geometry. In IJCV 102,1-3 (2013), pp. 91–111. [CDPS09] Corsini M., Dellepiane M., Ponchio F., Scopigno R.: Image-to-Geometry Registration: A Mutual Information Method exploiting Illumination-related Geometric Properties. In CGF 28, 7 (2009), pp. 1755–1764. [CL96] Curless B., Levoy M.: A volumetric method for building complex models from range images. In Proc. Comp. Graph. & Interact. Techn. (1996), pp. 303–312. [DKD*16] Dou M., Khamis S., Degtyarev Y., Davidson P., Fanello S. R., Kowdle A., Escolano S. O., Rhemann C., Kim D., Taylor J., Kohli P., Tankovich V., Izadi S.: Fusion4d: Real-time performance capture of challenging scenes. ACM Trans. on Graphics 35, 4 (2016), pp. 114:1–114:13. [DKSX17] Dryanovski, I., Klingensmith, M., Srinivasa, S.S., Xiao, J.: Large-scale, real-time 3D scene reconstruction on a mobile device. In Auton. Robots 41, 6 (2017), pp. 1423–1445 [DNZ*17] DAI A., NIESSNER M., ZOLLHÖFER M., IZADI S., THEOBALT C.: Bundlefusion: real-time globally consistent 3d reconstruction using on-the-fly surface re-integration. In ACM Transactions on Graphics, TOG, 36, 4 (2017). [DSM*17] Dzitsiuk M., Sturm J., Maier R., Ma L., Cremers D.: De-noising, Stabilizing and Completing 3D Reconstructions On-the-go using Plane Priors. In International Conference on Robotics and Automation (2017). [EHE*12] Endres F., Hess J., Engelhard N., Sturm J., Cremers D., and Burgard W.: An evaluation of the RGB-D SLAM system. In Proceedings of the IEEE Int. Conf. on Robotics and Automation (ICRA) (2012). [FB81] Fischler M., Bolles R.: Random sample consensus: A paradigm for model fitting with applications to image analysis and automated cartography. Communications of the ACM 24, 1981, pp. 381–395. [FPRJ00] Frisken S.F., Perry, R.N., Rockwood, A.P., Jones, T.R.: Adaptively sampled distance fields: a general representation of shape for computer graphics. In Proceedings of the 27th Annual Conference on Computer Graphics and Interactive Techniques, SIGGRAPH (2000), pp.249–254. [GD07] Georg K., David M. Parallel tracking and mapping for small AR workspaces. In Proceedings of the International Symposium on Mixed and Augmented Reality (ISMAR), 2007. [GSCI15] Glocker B., Shotton J., Criminisi A., and Izadi S.: Real-Time RGB-D Camera Relocalization via Randomized Ferns for Keyframe Encoding. IEEE Transactions on and Computer Graphics, 21(5), 2015, pp. 571–583. [GT18] Gao W., Tedrake R..: Surfelwarp: Efficient non-volumetric single view dynamic reconstruction. In Robotics: Science and Systems, 2018. [HDGN17] Huang J., Dai A., Guibas L., Nieÿner M.: 3DLite: Towards commodity for content creation. ACM Transactions on Graphics, 2017, 36(6), No.203. [HGDG17] He K., Gkioxari G., Dollár P., Girshick R.: Mask R-CNN. In Proc. IEEE Int. Conf. Comput. Vis. (ICCV), 2017, pp. 2980_2988. [HKH12] Henry P., Krainin M., Herbst E., Ren X., Fox D.: RGB-D mapping: Using depth cameras for dense 3d modeling of indoor environments. In International Symposium on Experimental Robotics, ISER (2010), pp. 477–491. [HSH15] Noh H., Hong S., Han B.: Learning deconvolution network for semantic segmentation. In Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), 2015, pp. 1520-1528. [HWK*18] Hu R., Wen C., Kaick O., Chen L., Lin D., Cohen-Or D., Huang H.: Semantic Object Reconstruction via Casual Handheld Scanning. In ACM Transactions on Graphics (Proc. SIGGRAPH Asia), 37(6): 219, 2018. [IKH11] Izadi S., Kim D., Hilliges O., Molyneaux D., Newcombe R., Kohli P., Shotton J., Hodges S., Freeman D., Davison A., Fitzgibbon A.: KinectFusion: real-time 3D reconstruction and interaction using a moving depth camera. In Proc. ACM Symp. User Interface Softw. &Tech., 2011, pp. 559–568. [K10] Konolige K.: Sparse sparse bundle adjustment. In Proceedings of the British Machine Vision Conference (BMVC), 2010. [KFT*2017] Kaiwen Guo, Fdng Xu, Tao Yu, Xiaoyang Liu, Qionghai Dai, and Yebin Liu, Real-time Geometry, Albedo and Motion Reconstruction Using a Single RGBD Camera, ACM Transactions on Graphics, 2017, 36(4), No.32. [KLL*13] Keller M., Lefloch D., Lambers M., Izadi S., Weyrich T., Kolb A.: Real-time 3D reconstruction in dynamic scenes using point-based fusion. In Proc. Int. Conf. 3D Vision (3DV) (2013), IEEE Computer Society, pp. 1–8. [KPR15] Kahler O., Prisacariu V. A., Ren C. Y., Sun X., Torr P., Murray D.: Very high frame rate volumetric integration of depth images on mobile devices. In IEEE Trans. on Visualization and Computer Graphics 21, 11 (2015), 1241–1250. [KMR*15] Loper, M., Mahmood, N., Romero, J., Pons-Moll, G., Black, M.J.: SMPL: a skinned multi-person linear model. CM Transactions on Graphics, 2015, 34(6), 248:16. [KSC13] Kerl C., Sturm J., Cremers D.: Dense visual SLAM for RGB-D cameras. In: Intl. Conf. on Intelligent Robot Systems (IROS) (2013). [LA09] Lourakis M., Argyros A.: SBA: A software package for generic sparse bundle adjustment. ACM Transactions on Mathematical Software 36, 2009, pp. 1–30. [LHS01] Lensch H., Heidrich W., Seidel H.: A Silhouette-Based Algorithm for Texture Registration and Stitching. In Graphical Models 63, 4 (2001), pp. 245–262. [LHZC16] Li S., Handa A., Zhang Y., Calway A.: HDRFusion: HDR SLAM using a low-cost auto-exposure RGB-D sensor. In Proc. Int. Conf. 3D Vision (3DV) (2016), IEEE, pp. 314–322. [LKS*17] Lefloch, D., Kluge, M., Sarbolandi, H., Weyrich, T., Kolb, A.: Comprehensive Use of Curvature for Robust and Accurate Online Surface Reconstruction. IEEE Trans. Pattern Anal. Mach. Intell. (2017), pp. 1. [LOWE04] Lowe D.: Distinctive image features from scale-invariant keypoints, cascade filtering approach. In International Journal of Computer Vision60 (2004), pp. 91–110. [LZL*2019] Lan Xu, Zhuo Su, Lei Han, Tao Yu, Yebin Liu and Lu Fang,UnstructuredFusion: Realtime 4D Geometry and Texture Reconstruction using Commercial RGBD Cameras,To Appear on IEEE PAMI, 2019. [MBC13] Meilland M., Barat C., Comport A.: 3D high dynamic range dense visual slam and its application to real-time object re-lighting. In Proc. IEEE Int. Symp. Mixed and Augmented Reality (ISMAR) (2013), IEEE, pp. 143–152. [MDM14] Mallick T., Das P., Majumdar A.K: Characterizations of noise in Kinect depth images: A review, In IEEE Sensors J. 14, 6 (2014), pp. 1731-1740. [MHDL17] McCormac J., Handa A., Davison A., Leutenegger S.: SemanticFusion: Dense 3D semantic mapping with convolutional neural networks. In Proc. IEEE Int. Conf. Robot. Autom. (ICRA), 2017, pp. 4628-4635. [NIH*11] Newcombe R. A., Izadi S., Hilliges O., Molyneaux D., Kim D., Davison A. J., Kohi P., Shotton J., Hodges S., Fitzgibbon A.: Kinectfusion: Real-time dense surface mapping and tracking. In International Symposium on Mixed and Augmented Reality, ISMAR, (2011), pp. 127–136. [NFS15] Newcombe, R.A., Fox, D., Seitz, S.M.: Dynamicfusion: Reconstruction and tracking of non-rigid scenes in real-time. In the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015. [NZIS13] Nieÿner M., Zollhofer M., Izadi S., Stamminger M.: Real -time 3D reconstruction at scale using voxel hashing. ACM Transactions On Graphics 32, 6 (2013), No.169. [PGV04] Pollefeys, M., Gool, L.V., Vergauwen, M., Verbiest, F., Cornelis, K., Tops, J., Koch, R.: Visual modeling with a hand-held camera. In International Journal of Computer Vision 59,3 (2004), pp. 207–232. [PS00] Pulli K., Shapiro L. G.: Surface Reconstruction and Display from Range and Color Data. In Graphical Models 62, 3 (2000), pp. 165–201. [PZVG00] Pfister H., Zwicker M., Van B. J., Gross M.: Surfels: Surface elements as rendering primitives. In Proceedings of the 27th Annual Conference on Computer Graphics and Interactive Techniques (New York, NY, USA, 2000), SIGGRAPH ’00, ACM Press/Addison-Wesley Publishing Co., pp. 335–342. [RA17] Rünz M., Agapito L.: Co-fusion: Real-time segmentation, tracking, and fusion of multiple objects. In IEEE Intl. Conf. on Robotics and Automation (ICRA), 2017, pp. 4471–4478. [RA18] Rünz M., Agapito L.: Maskfusion: Real-time recognition, tracking and reconstruction of multiple moving objects. arXiv preprint arXiv:1804.09194, 2018. [RHL02] Rusinkiewicz S., Hall-Holt O., Levoy M.: Real-time 3D model acquisition. In ACM Transactions on Graphics, 2002. [RV12] Roth H., Vona M.: Moving Volume KinectFusion. In British Machine Vision Conference, BMVC, 2012, pp. 1–11. [SA02] Stamos I., Allen P.: Geometry and Texture Recovery of Scenes of Large Scale. In Computer Vision and Image Understanding 88, 2 (2002), pp. 94–118. [SB12] Stuckler J., Behnke S.: Integrating depth and color cues for dense multi-resolution scene mapping using RGB-D cameras. In IEEE Intl. Conf. on Multisensor Fusion and Information Integration (MFI), 2012. [SB14] Stückler J., S. Behnke. Multi-resolution surfel maps for efficient dense 3d modeling and tracking. Journal of Visual Communication and Image Representation, 25(1), 2014, pp. 137–147. [SCD*06] Seitz, S.M., Curless, B., Diebel, J., Scharstein, D., Szeliski, R.: A comparison and evaluation of multi-view stereo reconstruction algorithms. In Proc. IEEE Conf. Comp. Vision and Pat. Rec., volume 1 (2006), pp. 519–528. [SGKD14] Salas-Moreno R., Glocker B., Kelly P. H. J., and Davison A. J.: Dense Planar SLAM. In Proceedings of the International Symposium on Mixed and Augmented Reality (ISMAR), 2014. [SKC13] Steinbrucker F., Kerl C., Cremers D.: Large-scale multiresolution surface reconstruction from RGB-d sequences. In Proceedings of the IEEE International Conference on Computer Vision (2013), pp. 3264–3271. [SKQ*19] Siyan Dong, Kai Xu, Qiang Zhou, Andrea Tagliasacchi, Shiqing Xin, Matthias Nießner, Baoquan Chen, "Multi-Robot Collaborative Dense Scene Reconstruction," ACM Transactions on Graphics, 2019, 38(4). [SNS*13] Salas-Moreno R. F., Newcombe R. A., Strasdat H., Kelly P. H., Davison A. J.: SLAM++: Simultaneous localisation and mapping at the level of objects,'' in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (2013), pp. 1352–1359. [SPGS18] Schönberger J., Pollefeys M., Geiger A., Sattler T.: Semantic Visual Localization. In Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), 2018. [SSC14] Steinbrucker F., Sturm J., Cremers D.: Volumetric 3D mapping in real-time on a CPU. In Proc. IEEE Int. Conf. Robotics and Automation (2014), IEEE, pp. 2021–2028. [SSP07] Sumner, R.W., Schmid, J., Pauly, M.: Embedded deformation for shape manipulation. In ACM TOG 26, 3, 2007. [SXN*18] Shi Y., Xu K., Nießner M., Rusinkiewicz S., Funkhouser T., "Planematch: Patch coplanarity prediction for robust RGB-D reconstruction", Proc. Eur. Conf. Comput. Vis. (2018), pp. 750-766. [SZKS08] Sun X., Zhou K., Stollnitz E., Shi J., Guo B.: Interactive relighting of dynamic refractive objects. In ACM Trans. Graph. 27,3 (2008), 35. [TM98] Tomasi C., Manduchi R.: Bilateral filtering for gray and color images. In Proc. IEEE Int. Conf. on Computer Vision (1998), pp. 839– 846. [TMHF00] Triggs B., McLauchlan P., Hartley R., Fitzgibbon A.: Bundle adjustment—a modern synthesis. Vision Algorithms: Theory and Practice 1883, 2000, pp. 153–177. [TTN16] Tateno K., Tombari F., Navab N.: When 2.5 D is not enough: Simultaneous reconstruction, segmentation and recognition on dense SLAM, In Proc. IEEE Int. Conf. Robot. Automat. (2016), pp. 2295–2302. [WKF*12] Whelan T., Kaess M., Fallon M., Johannsson H., Leonard J., McDonald J.: Kintinuous: Spatially extended KinectFusion. In Proc. RSS Workshop on RGB-D: Advanced Reasoning with Depth Cameras (2012). [WKJ*15] Whelan T., Kaess M., Johannsson H., Fallon M., Leonard J. J., McDonald J.: Real-time large-scale dense RGB-D SLAM with volumetric fusion. Int. Journal of Robotics Research 34, 4-5 (2015), 598–626. [WLS*15] Whelan T., Leutenegger S., Salas-Moreno R. F., Glocker B., Davison A. J.: Elasticfusion: Dense SLAM without A pose graph. In Robotics: Science and Systems XI, Sapienza University of Rome, 2015. [WSG*16] Whelan T., Salas-Moreno R. F., Glocker B., Davison A. J., Leutenegger S.: ElasticFusion: Real-time dense SLAM and light source estimation, In Int. J. Robot. Res., 35, 14(2016), pp. 1697-1716. [XSZ*16] Xu K., Shi Y., Zheng L., Zhang J., Liu M., Huang H., Su H., Cohen-Or D., Chen B.: 3D Attention-Driven Depth Acquisition for Object Identification. ACM Transactions on Graphics, 2016, 35(6), No.238. [YBJ*18] Yue Peng, Bailin Deng, Juyong Zhang , Fanyu Geng, Wenjie Qin, Ligang liu, Anderson Acceleration for Geometry Optimization and Physics Simulation, ACM Transactions on Graphics , 2018, 37(4). [YZG*18] Yu T., Zheng Z., Guo, K., Zhao J., Dai Q., Li H., Pons-Moll G., Liu Y.: Doublefusion: Real-time capture of human performance with inner body shape from a depth sensor", IEEE Conf. on Computer Vision and Pattern Recognition, 2018. [ZGHG11] Zhou K., Gong M., Huang X., Guo B.: Data-parallel octrees for surface reconstruction. IEEE Transactions on Visualization and Computer GraphicsI, 2011, 17(5), pp. 669–681. [ZMK13] Zhou Q.-Y., Miller S., Koltun V.: Elastic fragments for dense scene reconstruction. In Proc. IEEE Conf. Computer Vision and Pattern Recognition (2013), pp. 473–480. [ZK13] Zhou Q.-Y., Koltun V.: Dense scene reconstruction with points of interest. In ACM Trans. on Graphics 32, 4 (2013), pp. 112. [ZK14] Zhou Q.-Y., Koltun V.: Color map optimization for 3D reconstruction with consumer depth cameras. ACM Trans. on Graphics 33, 4 (2014), pp. 155. [ZK15] Zhou Q.-Y., Koltun V.: Depth camera tracking with contour cues. In Proc. IEEE Conf. Computer Vision and Pattern Recognition (2015), pp. 632–638. [ZXTZ15] ZHANG Y., XU W., TONG Y., ZHOU K.: Online structure analysis for real-time indoor scene reconstruction. In ACM Trans. on Graphics 34,5 (2015), pp. 159-171. [ZZZL13] Zeng M., Zhao F., Zheng J., Liu X.: Octree-based fusion for realtime 3D reconstruction. Graphical Models 75, 3 (2013), 126–136. [ZXY*17] Zheng L., Xu K., Yan Z., Yan G., Zhang E., Niessner M., Deussen O., Cohen-Or D. and Huang H., "Autonomous Reconstruction of Unknown Indoor Scenes Guided by Time-varying Tensor Fields," ACM Transactions on Graphics, 2017, 36(6).