-Based Easy 3D Object Reconstruction

Di Xu1, Jianfei Cai2,TatJenCham2, Philip Fu2, and Juyong Zhang1

1 BeingThere Centre, Institute for Media Innovation, Nanyang Technological University, Singapore 2 School of Computer Engineering, Nanyang Technological University, Singapore

Abstract. Inspired by the recently developed KinectFusion technique, which is able to reconstruct a 3D scene in real time through moving Kinect, we consider improving KinectFusion for 3D reconstruction of a real object. We make some adaptations to KinectFusion so as to identify the object-of-interest and separate the 3D object model from the entire 3D scene. Moreover, considering that the 3D object model generated by KinectFusion often contains some clearly visible outliers due to the noisy Kinect data, we propose a refinement scheme to remove the outliers. Our basic idea is to make use of the existing powerful 2D segmentation tool to refine the silhouette in each color image and then form via the refined dense silhouettes to improve the 3D object model. Experimental results show improved performance.

Keywords: 3D reconstruction, Kinect, 2D segmentation, visual hull.

1 Introduction

It is of great practical values to enable easy creation of 3D models of real objects. Main technologies have been developed towards this goal. Among them, the multi-view stereo (MVS) [1] is the most popular one, which builds 3D models of real objects from multi-view images. Despite the great advance, most of the MVS systems are still in the prototype level, limited to lab environment, not user-friendly, often require a few hours computation, and have some impractical assumptions such as assuming the silhouettes of the object are known. In 2010, Microsoft has launched the Kinect sensor for game applications. Kinect is equipped with an infrared camera and a RGB camera. The infrared camera can generate the depth information easily by capturing the continuously- projected infrared structured light. With the assistance of this additional depth information, many challenging problems can now be simplified and tackled in an efficient manner. Kinect has been used in 3D reconstruction recently [2–6]. In particular, in [2], multiple fixed kinects are used for fully dynamic real-time 3D scene capture for room-based telepresence systems. The depth data from each Kinect is being denoised first and then merged together weighted according to the angle and the distance to the camera. Although the 3D model of the entire scene can be

W. Lin et al. (Eds.): PCM 2012, LNCS 7674, pp. 476–483, 2012. c Springer-Verlag Berlin Heidelberg 2012 Kinect-Based Easy 3D Object Reconstruction 477 generated in real time, the reconstruction is not of good quality. In [4], Cui and Stricker proposed a 3D object scanning scheme, where a Kinect is slowly moved around an object to capture different views. Super-resolution technique is applied to improve the quality of the raw data from Kinect. The method can achieve high-quality object reconstruction at the cost of high computation complexity and long processing time. The recent developed KinectFusion [5] is a system for accurate real-time map- ping of indoor scenes, using only a moving low-cost depth camera and commodity graphics hardware. The robustness of this system lies in that it fuses all of the depth data streamed from a Kinect sensor into a single global implicit surface model of the observed scene in real-time. Similar to other techniques, they first de-noise the input raw data with a bilateral filter and a multi-resolution method. Then the truncated signed distance function (TSDF) is used as the data struc- ture for later processing. The global fusion of all depth maps is formed by the weighted average of all individual TSDFs. The resulted 3D model from Kinect- Fusion is of reasonable quality. In this paper, we apply KinectFusion for easy 3D construction of real objects. First, considering that KinectFusion is designed for scene reconstruction, we make some adaptations to KinectFusion so as to identify the object-of-interest and separate the 3D object model from the entire 3D scene. Second, due to the noisy Kinect data, the 3D object model generated by KinectFusion often contains some clearly visible outliers. We propose a refinement scheme to remove the outliers. Our basic idea is to make use of the existing powerful 2D segmentation tool to refine the silhouette in each color image and then form visual hull via the refined dense silhouettes to improve the 3D object model. Experimental results show improved performance. The rest of the paper is organized as follows. We describe the proposed sys- tem in Section 2. The experimental results are shown in Section 3. Finally, we conclude the paper in Section 4.

2ProposedSystem

Fig. 1 shows the proposed easy 3D object reconstruction system. The primary inputs to the system are the color and depth videos captured by Kinect, and the output of the system is the reconstructed 3D model. In this first stage, con- sidering the real-time 3D reconstruction capability of KinectFusion, we choose it to generate an initial 3D object model. Since Kinect data is very noisy and KinectFusion only makes use of the depth information, the reconstructed 3D ob- ject models is of limited quality and often contains clearly visible errors. Thus, in the second stage, we propose to obtain dense and accurate silhouettes in color images via a powerful 2D segmentation technique to remove the outliers in the initial 3D object model generated by KinectFusion. The second stage consists of three iterative steps: 3D to 2D projection, silhouette refinement by 2D cut, and 478 D. Xu et al.

Fig. 1. The system diagram of the proposed easy 3D object reconstruction

3D model refinement by visual hull. The iteration is to ensure the 2D segmenta- tion performed in each individual image are consistent and converged with the visual hull projections. In the following, we elaborate the two main stages in detail.

2.1 3D Object Reconstruction Using KinectFusion

KinectFusion in it original form cannot be directly used for 3D object recon- struction since it is designed for reconstructing the entire scene. One common solution is to assume that the object is always the closest one to the viewer and use some thresholding to separate its 3D construction from the background 3D reconstruction. However, in this way, the object is hard to be separated from its supporting entity since the object has to be place on top of an entity such as ground or table. In the KinectFusion paper [5], it suggests another solution, i.e. obtaining the 3D object model by subtracting the 3D reconstructions with and without the object. But no implementation detail is provided. In this research, we follow the idea in [5] to generate an initial 3D object model. In particular, the object-of-interest is first placed in the scene and the user holding a Kinect scans the scene to obtain the entire 3D scene reconstruction using KinectFusion. Later, after some repeating scene scanning, the user removes the object from the scene and the final KinectFusion reconstruction is the 3D scene without the object. By subtracting the final 3D reconstruction from the initial one containing the 3D object, we obtain the 3D object model we want. Note that the scanning is a non-stop process till the end so as to ensure the same global coordination system between the two reconstructions and avoid the alignment problem. Fig.2 shows an example of the initial 3D object model generation.

2.2 3DModelRefinementvia2DSegmentation

The initial 3D object model obtained by KinectFusion often contains some out- liers due to the noisy depth data. Thus, in this second stage, we make use some powerful 2D segmentation tool to generate dense silhouettes to help remove the outliers of the 3D model. Kinect-Based Easy 3D Object Reconstruction 479

(a) (b)

Fig. 2. An illustration of generating the initial 3D object model using KinectFusion. We produce the mesh of the whole scene on (a). After the being removed, another 3D model is generated as in (b). By subtracting the model of (a) by that in (b), we can get the initial model of the robot.

Since the 2D object segmentation requires some initial contour, we first per- form 3D-to-2D projection. The initial 3D mesh is projected to each of the 2D images using the corresponding projection matrices generated by KinectFusion, which results in a binary mask in each image. As expected, because of the inaccu- rate initial 3D model as well as the inaccurate projection matrices, the generated initial 2D contours typically suffers a segmentation error up to 20 pixels for an image with a size of 640x480. This can be observed in Fig. 3(c), where the boundary of the binary mask is not snapped with the silhouette of the object. Therefore, next we apply our recently developed robust convex active contour tool [7] for the silhouette refinement. The tool has strong ability to evolve the initial contour to snap to the geometry features/edges in an image. Besides, it has fast processing speed since it can be solved by convex optimization. The convex active contour model can be expressed as  

min ( gb|∇u|dx + λ1 hrudx), (1) 0≤u≤1 Ω Ω where u is a function on image domain Ω and receives a value between 0 and 1 at each pixel location x in the image, function gb is typically an edge detection function, and function hr is a region function that measures the coherence of the inside and outside regions. (1) consists of two terms, where the first term is a boundary term and the second term is a region term. The boundary term favors the segmentation along the places that the edge detection function reaches minimum, i.e. detecting edges, and also favors the segmentation with smooth boundary curve. The region term ensures the segmentation complying with some region coherence criteria defined in function hr. In particular, for the initial binary mask obtained by 3D projection, it is reasonable to assume that the areas far away from the initial boundary are 480 D. Xu et al. likely to be correctly classified. The area that are within a threshold of D pixels away from the initial boundary is treated as the unknown region. Each pixel is then given a probability, where the foreground and the background pixels are set to 1 and 0, respectively, and a pixel in the unknown region is given a probability value that is proportional to its distance away from the initial contour. In this way, we obtain an initial probability map P (X) and we let u(x)=P (x)in(1)for initialization. In addition, based on the foreground and background pixels, we also build up local foreground and background Gaussian Mixture Model (GMM) color models for each image. With the local GMM models, the region term hr is then defined as

hr(x)=α(PB (x) − PF (x)+(1− α)(1 − 2P (x))), (2) where where PF (x)andPB (x) are the normalized foreground and background likelihoods respectively, and α ∈ [0, 1], is a tradeoff factor. The first term (PB (x)− PF (x)) in (2) ensures that the active contour evolves towards the one complying with the local GMM color models. The second term (1 − 2P (x)) in (2) prevents the refined contour drifting too far apart from the initial segmentation. Once the optimization of the convex active contour model (1) is solved, we ob- tain the solution of u(x), which represents the probability of a pixel x belonging to the foreground. By thresholding u(x) against a threshold (typically, 0.5), we obtain a refined 2D segmentation. After that,a visual hull containing the object is generated from the dense refined silhouettes using the shape-from-silhouette method in [8]. The visual hull acts as a hard constraints on the initial 3D model. Any part outside the visual hull is deemed as background and thus cut away from the initial 3D model. The last three steps of 3D-to-2D projection, 2D segmentation and visual hull are tightly coupled and they together form an iteration process. In each iteration, the refined silhouettes provide better visual hull to cut the 3D model and in return the refined 3D model produces better initial silhouette for the next-round 2D segmentation.

3 Experimental Results

We test our proposed system using two objects, i.e. a chair and a robot. In our experiment, one video clip usually contains hundreds of frames so that we are able to generate intensive silhouettes from those frames. Theoretically, the more silhouettes we use, the finer visual hull we can obtain. To avoid heavy computation, we uniformly sample the video frames and each final data set contains about 100 views of 640x480 color and depth images. Fig. 3 shows the results of the chair example. It can be seen that the initial silhouette projected from the initial 3D model does not align with the chair boundary in the image. With our 2D segmentation technique, the chair’s silhou- ette can be accurately refined. Moreover, the refined 3D object model is also of high quality. Fig.5 shows the 3D model of the final reconstructed chair from one view. Kinect-Based Easy 3D Object Reconstruction 481

(a) (b)

(c) (d)

Fig. 3. The results of the chair example. (a) Original color image; (b) Original depth image; (c) Projected image from KinectFusion; (d) Refined image via 2D segmentation.

(a) Point cloud (b) Mesh

Fig. 4. An illustration of the 3D reconstruction result of the chair. 482 D. Xu et al.

The proposed system still has some limitation. Fig. 5 gives an example of reconstructing the robot. We can see that although our 2D segmentation tool improves the silhouette, there are still some clearly visible errors. This is mainly because the color of the object is similar to some of the background.

(a) (b)

(c) (d)

Fig. 5. The results of the robot example. (a) Original color image; (b) Original depth image; (c) Projected image from KinectFusion; (d) Refined image via 2D segmentation.

4Conclusion

In this paper, we have proposed a system to easily reconstruct the 3D model of a real object. The system is a coherent integration of several modern tech- niques including KinectFusion, the convex active contour, and visual hull. The experimental results have demonstrated that the proposed system is a practical and effective tool that can refine the 3D object model generated by KinectFu- sion. The future work includes further improving the reconstruction quality and designing better schemes to fuse the depth and color information.

Acknowledgement. This research, which is carried out at BeingThere Centre, is supported by the Singapore National Research Foundation under its Interna- tional Research Centre @ Singapore Funding Initiative and administered by the IDM Programme Office. Kinect-Based Easy 3D Object Reconstruction 483

References

1. Seitz, S., Curless, B., Diebel, J., Scharstein, D., Szeliski, R.: A comparison and evaluation of multi-view stereo reconstruction algorithms. In: 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, vol. 1, pp. 519– 528. IEEE (2006) 2. Maimone, A., Fuchs, H.: Encumbrance-free telepresence system with real-time 3d capture and display using commodity depth cameras. In: 2011 10th IEEE Interna- tional Symposium on Mixed and (ISMAR), pp. 137–146. IEEE (2011) 3. Kuster, C., et al.: Freecam: A hybrid camera system for interactive free-viewpoint video. In: Proceedings of Vision, Modeling, and Visualization, VMV (2011) 4. Cui, Y., Stricker, D.: 3d shape scanning with a kinect. In: ACM SIGGRAPH 2011 Posters, p. 57. ACM (2011) 5. Izadi, S., Kim, D., Hilliges, O., Molyneaux, D., Newcombe, R., Kohli, P., Shotton, J., Hodges, S., Freeman, D., Davison, A., et al.: Kinectfusion: real-time 3d recon- struction and interaction using a moving depth camera. In: Proceedings of the 24th Annual ACM Symposium on User Interface Software and Technology, pp. 559–568. ACM (2011) 6. Steinbrucker, F., Sturm, J., Cremers, D.: Real-time visual odometry from dense rgb- d images. In: 2011 IEEE International Conference on Computer Vision Workshops, ICCV Workshops, pp. 719–722. IEEE (2011) 7. Bresson, X., Esedoglu, S., Vandergheynst, P., Thiran, J., Osher, S.: Fast global minimization of the active contour/snake model. Journal of Mathematical Imaging and Vision 28, 151–167 (2007) 8. Laurentini, A.: The visual hull concept for silhouette-based image understanding. IEEE Transactions on Pattern Analysis and Machine Intelligence 16, 150–162 (1994)