LNCS 7674, Pp

Kinect-Based Easy 3D Object Reconstruction Di Xu1, Jianfei Cai2,TatJenCham2, Philip Fu2, and Juyong Zhang1 1 BeingThere Centre, Institute for Media Innovation, Nanyang Technological University, Singapore 2 School of Computer Engineering, Nanyang Technological University, Singapore Abstract. Inspired by the recently developed KinectFusion technique, which is able to reconstruct a 3D scene in real time through moving Kinect, we consider improving KinectFusion for 3D reconstruction of a real object. We make some adaptations to KinectFusion so as to identify the object-of-interest and separate the 3D object model from the entire 3D scene. Moreover, considering that the 3D object model generated by KinectFusion often contains some clearly visible outliers due to the noisy Kinect data, we propose a refinement scheme to remove the outliers. Our basic idea is to make use of the existing powerful 2D segmentation tool to refine the silhouette in each color image and then form visual hull via the refined dense silhouettes to improve the 3D object model. Experimental results show improved performance. Keywords: 3D reconstruction, Kinect, 2D segmentation, visual hull. 1 Introduction It is of great practical values to enable easy creation of 3D models of real objects. Main technologies have been developed towards this goal. Among them, the multi-view stereo (MVS) [1] is the most popular one, which builds 3D models of real objects from multi-view images. Despite the great advance, most of the MVS systems are still in the prototype level, limited to lab environment, not user-friendly, often require a few hours computation, and have some impractical assumptions such as assuming the silhouettes of the object are known. In 2010, Microsoft has launched the Kinect sensor for game applications. Kinect is equipped with an infrared camera and a RGB camera. The infrared camera can generate the depth information easily by capturing the continuously- projected infrared structured light. With the assistance of this additional depth information, many challenging computer vision problems can now be simplified and tackled in an efficient manner. Kinect has been used in 3D reconstruction recently [2–6]. In particular, in [2], multiple fixed kinects are used for fully dynamic real-time 3D scene capture for room-based telepresence systems. The depth data from each Kinect is being denoised first and then merged together weighted according to the angle and the distance to the camera. Although the 3D model of the entire scene can be W. Lin et al. (Eds.): PCM 2012, LNCS 7674, pp. 476–483, 2012. c Springer-Verlag Berlin Heidelberg 2012 Kinect-Based Easy 3D Object Reconstruction 477 generated in real time, the reconstruction is not of good quality. In [4], Cui and Stricker proposed a 3D object scanning scheme, where a Kinect is slowly moved around an object to capture different views. Super-resolution technique is applied to improve the quality of the raw data from Kinect. The method can achieve high-quality object reconstruction at the cost of high computation complexity and long processing time. The recent developed KinectFusion [5] is a system for accurate real-time map- ping of indoor scenes, using only a moving low-cost depth camera and commodity graphics hardware. The robustness of this system lies in that it fuses all of the depth data streamed from a Kinect sensor into a single global implicit surface model of the observed scene in real-time. Similar to other techniques, they first de-noise the input raw data with a bilateral filter and a multi-resolution method. Then the truncated signed distance function (TSDF) is used as the data struc- ture for later processing. The global fusion of all depth maps is formed by the weighted average of all individual TSDFs. The resulted 3D model from Kinect- Fusion is of reasonable quality. In this paper, we apply KinectFusion for easy 3D construction of real objects. First, considering that KinectFusion is designed for scene reconstruction, we make some adaptations to KinectFusion so as to identify the object-of-interest and separate the 3D object model from the entire 3D scene. Second, due to the noisy Kinect data, the 3D object model generated by KinectFusion often contains some clearly visible outliers. We propose a refinement scheme to remove the outliers. Our basic idea is to make use of the existing powerful 2D segmentation tool to refine the silhouette in each color image and then form visual hull via the refined dense silhouettes to improve the 3D object model. Experimental results show improved performance. The rest of the paper is organized as follows. We describe the proposed system in Section 2. The experimental results are shown in Section 3. Finally, we conclude the paper in Section 4. 2ProposedSystem Fig. 1 shows the proposed easy 3D object reconstruction system. The primary inputs to the system are the color and depth videos captured by Kinect, and the output of the system is the reconstructed 3D model. In this first stage, considering the real-time 3D reconstruction capability of KinectFusion, we choose it to generate an initial 3D object model. Since Kinect data is very noisy and KinectFusion only makes use of the depth information, the reconstructed 3D object models is of limited quality and often contains clearly visible errors. Thus, in the second stage, we propose to obtain dense and accurate silhouettes in color images via a powerful 2D segmentation technique to remove the outliers in the initial 3D object model generated by KinectFusion. The second stage consists of three iterative steps: 3D to 2D projection, silhouette refinement by 2D cut, and 478 D. Xu et al. Fig. 1. The system diagram of the proposed easy 3D object reconstruction 3D model refinement by visual hull. The iteration is to ensure the 2D segmentation performed in each individual image are consistent and converged with the visual hull projections. In the following, we elaborate the two main stages in detail. 2.1 3D Object Reconstruction Using KinectFusion KinectFusion in it original form cannot be directly used for 3D object reconstruction since it is designed for reconstructing the entire scene. One common solution is to assume that the object is always the closest one to the viewer and use some thresholding to separate its 3D construction from the background 3D reconstruction. However, in this way, the object is hard to be separated from its supporting entity since the object has to be place on top of an entity such as ground or table. In the KinectFusion paper [5], it suggests another solution, i.e. obtaining the 3D object model by subtracting the 3D reconstructions with and without the object. But no implementation detail is provided. In this research, we follow the idea in [5] to generate an initial 3D object model. In particular, the object-of-interest is first placed in the scene and the user holding a Kinect scans the scene to obtain the entire 3D scene reconstruction using KinectFusion. Later, after some repeating scene scanning, the user removes the object from the scene and the final KinectFusion reconstruction is the 3D scene without the object. By subtracting the final 3D reconstruction from the initial one containing the 3D object, we obtain the 3D object model we want. Note that the scanning is a non-stop process till the end so as to ensure the same global coordination system between the two reconstructions and avoid the alignment problem. Fig.2 shows an example of the initial 3D object model generation. 2.2 3DModelRefinementvia2DSegmentation The initial 3D object model obtained by KinectFusion often contains some outliers due to the noisy depth data. Thus, in this second stage, we make use some powerful 2D segmentation tool to generate dense silhouettes to help remove the outliers of the 3D model. Kinect-Based Easy 3D Object Reconstruction 479 (a) (b) Fig. 2. An illustration of generating the initial 3D object model using KinectFusion. We produce the mesh of the whole scene on (a). After the Robot being removed, another 3D model is generated as in (b). By subtracting the model of (a) by that in (b), we can get the initial model of the robot. Since the 2D object segmentation requires some initial contour, we first per- form 3D-to-2D projection. The initial 3D mesh is projected to each of the 2D images using the corresponding projection matrices generated by KinectFusion, which results in a binary mask in each image. As expected, because of the inaccurate initial 3D model as well as the inaccurate projection matrices, the generated initial 2D contours typically suffers a segmentation error up to 20 pixels for an image with a size of 640x480. This can be observed in Fig. 3(c), where the boundary of the binary mask is not snapped with the silhouette of the object. Therefore, next we apply our recently developed robust convex active contour tool [7] for the silhouette refinement. The tool has strong ability to evolve the initial contour to snap to the geometry features/edges in an image. Besides, it has fast processing speed since it can be solved by convex optimization. The convex active contour model can be expressed as min ( gb|∇u|dx + λ1 hrudx), (1) 0≤u≤1 Ω Ω where u is a function on image domain Ω and receives a value between 0 and 1 at each pixel location x in the image, function gb is typically an edge detection function, and function hr is a region function that measures the coherence of the inside and outside regions. (1) consists of two terms, where the first term is a boundary term and the second term is a region term.

Load more