A Survey of 3D Modeling Using Depth Cameras
Total Page:16
File Type:pdf, Size:1020Kb
A survey of 3D modeling using depth cameras Hantong XU, Jiamin XU, Weiwei XU State Key Lab of CAD&CG, Zhejiang University Abstract- 3D modeling is an important topic in computer graphics and computer vision. In recent years, the introduction of consumer-grade depth cameras has brought profound advances in 3D modeling. Starting with the basic data structure, this survey reviews the latest developments of 3D modeling based on depth cameras, including research works on camera tracking, 3D object and scene reconstruction and high-quality texture reconstruction. We also discuss the future works and possible solutions for 3D modeling based on depth camera. Keywords: 3D Modeling, Depth Camera, Camera Tracking, Signed Distance Function, Surfel 1. Introduction 3D modeling is an important research area in computer graphics and computer vision. Its target is to capture the 3D shape and appearance of objects and scenes in high quality so as to simulate the 3D interaction and perception in the digital space. The 3D modeling methods can be mainly categorized into three types: (1) Modeling based on professional software, such as 3DS Max, Maya, etc. It requires the user to be proficient at these software. Thus, the learning curve is usually long, and the creation of 3D content is usually time-consuming. (2) Modeling based on 2D RGB images, such as multi-view stereo (MVS) [SCD*06], structure from motion(SFM) [A03, GD07], etc. Since image sensors are of low-cost and easy to deploy, 3D modeling from multi-view 2D images can significantly automate and simplify the modeling process, but relies heavily on image quality and are easily affected by lighting and camera resolution. (3) Modeling based on professional, active 3D sensors [RHL02], including 3D scanner and depth camera. The sensors actively project structured light or laser stripes to the object surface and then reconstruct spatial shape of objects and scenes, i.e. sampling a large number of 3D points from the 3D surface, according to the spatial information decoded from the structured light. The advantage of 3D sensors is that they are robust to interference in the environment and able to scan 3D models in high precision automatically. However, professional 3D scanners are expensive, which limits their applications for ordinary users. In recent years, the development of depth (or RGB-D) cameras has led to the rapid development of 3D modeling. The depth camera is not only low-cost, compact, but also capable of capturing pixel-level color and depth information with sufficient resolution and frame rate. Because of these characteristics, depth cameras have a huge advantage in the case of consumer-oriented applications compared to other expensive scanning devices. Recently, the KinectFusion algorithm proposed by Imperial College London and Microsoft Research in 2011 opens up the opportunity to 3D modeling with depth cameras [IKH11, NIH*11]. It is capable of generating 3D models with high-resolution details in real-time using color and depth images obtained by depth cameras, which attracts the attention of researchers in the field of 3D reconstruction. Researchers in computer graphics and computer vision have continuously innovated on the algorithms and systems for 3D modeling based on depth cameras, and achieved rich research results. This survey will review and compare the latest methods of 3D modeling based on depth cameras. We first give two basic data structures widely used 3D modeling using depth cameras in Section 2. In Section 3, we focus on the progress of geometric reconstruction based on depth cameras. Then, in Section 4, the reconstruction of high quality texture is reviewed. Finally, we conclude and discuss the future work in Section 5. 1.1 Preliminaries of Depth Cameras Each frame of data obtained by the depth camera contains the distance from each point in the real scene to the vertical plane where the depth camera is located. This distance is called the depth value, and these depth values at sampled pixels constitute the depth image of a depth frame. Usually, the color image and the depth image are registered, and thus, as shown in figure 1, there is a one-to-one correspondence between the pixels. Currently, most of depth cameras are developed based on structured light, such as Microsoft's Kinect 1.0 and Occipital's structure sensor, or ToF (time of flight) technology, such as Microsoft's Kinect 2.0. Strucuture-light depth cameras firstly projects the infrared structured light onto the surface of an object, and then receives the light pattern reflected by the surface of the object. Since A 3D point Image plane The projection of M in the image plane Camera center image center focal length depth value Figure 1: RGB image and depth value. M: the 3D point. XM: the projection of M on the image plane. the received pattern is modulated by the 3D shape of the object, the spatial information of the 3D surface can be calculated by the position and degree of modulation of the pattern on the image. The structured light camera is suitable for scenes with insufficient illumination, and it can achieve high precision within a certain range and produce high-resolution depth images. The disadvantage of structured light is that it is easy to be affected by strong natural light outdoors that renders the projected coded light to be submerged and unusable. At the same time, it is susceptible to plane mirror reflection. The ToF camera uses a continuous emission of light pulses (generally invisible light) onto an object to be observed and then receives a pulse of light reflected back from the object. The distance from the measured object to the camera is calculated by detecting the flight (round trip) time of the light pulse. Overall, the measurement error of the ToF depth camera does not increase with the increase of the measurement distance, and the anti-interference ability is strong, which is suitable for relatively long measurement distance (such as automatic driving). The disadvantage is that the depth image obtained is not high in resolution, and the measurement error is larger than that of other depth cameras when measuring at close range. 1.2 The Pipeline of 3D Modeling Based on Depth Cameras Since the depth data obtained by the depth camera inevitably has noise, we first need to preprocess the depth map, that is, reduce noise and remove outliers. Next, the camera poses are estimated to register captured depth images into a consistent coordinate frame. The classic approach is to find Measurement Pose Estimation Surface Prediction Update Reconstruction ICP to register the Compute 3D vertices Integrate surface Ray-Cast TSDF to RGBD measured surface to and normal s at pixels measurement into Compute Surface the global model global TSDF prediction Figure 2: The pipeline of volumetric representation based reconstruction [NIH*11] . the corresponding points between the depth maps (i.e., data associations) and then estimate the camera's motion by registering the depth images. Finally, the current depth image is merged intothe global model based on the estimated camera pose. By repeating these steps, we can get the geometry of the scene. However, in order to completely reconstruct the appearance of the scene, it is obviously not enough to just capture the geometry. We also need to give the model color and texture. According to the data structure used to represent the 3D modeling result, we can roughly categorize 3D modeling into volumetric representation based modeling and surfel-based modeling. The steps of 3D reconstruction will vary based on different data structures. This review clarifies how each method solves the classic problems in 3D reconstruction by reviewing and comparing the different strategies adopted by the various modeling methods. 2. Data Structure for 3D Surfaces 2.1 Volumetric Representation The concept of volumetric representation was first proposed in 1996 by Curless and Levoy [CL96]. They introduced to represent a 3D surfaces using signed distance function (SDF) sampled at voxels, where SDF values are defined as the distance of a 3D point to the surface. Thus, the surface of an object or a scene, according this definition, is an iso-distance surface whose SDF is 0. The free space, i.e., the space outside of the object, is indicated by a positive SDF value, since the distance increases as a point moving along the surface normal towards the free space. In contrast, the occupied space (the space inside of the object) is indicated by negative SDF values. These SDF values are stored in a cube surrounding the object, and the cube is rasterized into voxels. The reconstruction algorithm must define the voxel size and spatial extent of the cube. Other data, such as colors, are usually stored as voxel attributes. Since only voxels close to the actual surface are important, researchers often use the truncated signed distance function (TSDF) to represent the model. Figure 2 shows the general pipeline of reconstruction based on volumetric representation. Since the TSDF value implicitly represents the surface of the object, it is necessary to extract the surface of the object by means of ray casting before estimating the pose of the camera. The fusion step is a simple weighted average operation at each voxel center, which is very effective for eliminating the noise of the depth camera. Uniform voxel grids are very inefficient in terms of memory consumption and are limited by predefined volume and resolutions. In the context of real-time scene reconstruction, most methods rely heavily on the processing power of modern GPUs. The spatial extent and resolution of voxel grids are often limited by GPU memory. In order to support a larger spatial extent, researchers have proposed a variety of methods to improve the memory efficiency of volumetric representation-based algorithms.