Interactive Model-Based Reconstruction of the Human Head Using an RGB-D Sensor

Interactive Model-based Reconstruction of the Human Head using an RGB-D Sensor M. Zollhofer,¨ J. Thies, M. Colaianni, M. Stamminger, G. Greiner Computer Graphics Group, University Erlangen-Nuremberg, Germany Abstract We present a novel method for the interactive markerless reconstruction of human heads using a single commodity RGB-D sensor. Our entire reconstruction pipeline is implemented on the GPU and allows to obtain high-quality reconstructions of the human head using an interactive and intuitive reconstruction paradigm. The core of our method is a fast GPU-based non-linear Quasi-Newton solver that allows us to leverage all information of the RGB-D stream and fit a statistical head model to the ob- servations at interactive frame rates. By jointly Figure 1: Hardware Setup: The RGB-D stream solving for shape, albedo and illumination of a single PrimeSense Carmine is parameters, we are able to reconstruct high- used to reconstruct high-quality head quality models including illumination corrected models using an interactive and intu- textures. All obtained reconstructions have a itive reconstruction paradigm. common topology and can be directly used as assets for games, films and various virtual reality applications. We show motion retarget- this field shifting the focus to intuitive and inter- ing, retexturing and relighting examples. The active paradigms that are easy-to-use. accuracy of the presented algorithm is evaluated This work presents a model-based reconstruc- by a comparison against ground truth data. tion system for the human head that is intuitive and leverages a commodity sensor’s RGB- Keywords: Virtual Avatars, Model-based D stream interactively. Using the presented sys- Face Reconstruction, 3D Scanning, Non-linear tem a single user is able to capture a high-quality Optimization, GPU, Statistical Head Models facial avatar (see Figure 1) by moving his head freely in front of the sensor. During a scanning session the user receives interactive feed- 1 Introduction back from the system showing the current reconstruction state. Involving the user into the recon- The release of the Microsoft Kinect made a struction process makes the system immersive cheap consumer level RGB-D sensor available and allows to refine the result. Our system effec- for home use. Therefore, 3D scanning technol- tively creates a high-quality virtual clone with a ogy is no longer restricted to a small group of known semantical and topological structure that professionals, but is also accessible to a broad can be used in various applications ranging from audience. This had a huge impact on research in virtual try-on to teleconferencing. Figure 3: Denoising: Statistical noise removal Figure 2: Comparison: Model-free approaches (right) better deals with noisy input (Kinect Fusion, 2563 voxel grid) are than spatial filtering approaches (left). prone to oversmoothing. In contrast to Fine scale features are retained, while this, our model-based approach allows still effectivly dealing with noise. to estimate fine-scale surface details. based on volumetric fusion [4] accumulate the In the following, we discuss related work data directly into a consistent representation, but (Section 2), present an overview of our recon- do not keep the raw input for further postpro- struction system (Section 3) based on a fast GPU cessing steps. The Kinect Fusion framework tracking and fitting pipeline (Section 4-6). We [8, 9] is such a system and made real-time 3D- sum up by showing reconstruction results, appli- Reconstruction with a moving RGB-D camera cations and ground truth comparisons (Section viable for the first time. Because this approach 7) and give ideas for future work (Section 8). deals with noise by spatial and temporal filtering, it is prone to oversmoothing (see Figure 2). Model-free approaches allow to digitize ar- 2 Related Work bitrary real world objects with the drawback of the output to be only a polygon soup [5] with 3D-Reconstruction from RGB-D images and no topological and semantical information at- streams is a well studied topic in the geome- tached. Therefore, these reconstructions can not try, vision and computer graphics communities. be automatically animated or used in virtual re- Due to the extensive amount of literature in this ality applications. field, we have to restrict our discussion to approaches closely related to this work. Therefore, 2.2 Model-based 3D-Reconstruction we will focus on model-free and model-based algorithms that are suitable for capturing a de- In contrast to model-free approaches, model- tailed digital model (shape and albedo) of a hu- based methods heavily rely on statistical pri- man head. We compare these approaches based ors and are restricted to a certain class of ob- on their generality and applicability and moti- jects (i.e., heads or bodies). This clear disadvan- vate our decision for a model-based reconstruc- tage in generality is compensated by leveraging tion method. the class-specific information built into the prior [10, 11, 12, 13]. In general, this leads to higher 2.1 Model-free 3D-Reconstruction reconstruction quality, because noise can be sta- tistically regularized (see Figure 3) and infor- 3D-Reconstruction mainly is about the acquisi- mation can be propagated to yet unseen and/or tion of a real world object’s shape and albedo. unobservable regions. These properties make This includes capturing and aligning multiple model-based reconstruction algorithms the first partial scans [1, 2, 3] to obtain one complete choice for applications that are focused on one reconstruction, data accumulation or fusion [4] specific object class. and a final surface extraction step [5] to obtain a Blanz and colleagues reconstruct 3D models mesh representation. Systems based on a direct from RGB and RGB-D input [10, 11] by fitting accumulation of the input sample points [6, 7] a statistical head model. These methods require preserve information, but scale badly with the user input during initialization and registration length of the input stream. In contrast, systems is performed in a time consuming offline pro- Figure 4: Per-Frame Pipeline (from left to right): The RGB-D input stream is preprocessed, the rigid head pose is estimated, data is fused and a joint optimization problem for shape, albedo and illumination parameters is solved iteratively. cess. Statistical models have also been exten- 3 Pipeline Overview sively used to reconstruct templates for tracking facial animations [14, 15, 16]. While the tracking is real-time, the reconstruction is per- Our reconstruction system (Figure 4) has been formed offline. In addition, these methods only completely implemented on the GPU with an use depth and do not consider the RGB chan- interactive application in mind. The user sits nels which allows to jointly estimate the illu- in front of a single RGB-D sensor (see Fig- mination and can be used to improve tracking ure 1) and can freely move his head to obtain [17, 18]. Other methods specifically focused on a complete and high-quality reconstruction. In reconstruction are either offline or do not use all the preprocessing stage, the captured RGB-D the data of the RGB-D stream [19, 20, 21, 22, stream is bilaterally filtered [31] to remove high- 23, 24, 25]. In many cases, they only rely on a frequency noise. We back-project the depth map single input frame. to camera space and compute normals at the In contrast, our method utilizes all data pro- sample points using finite differences. We track vided by the RGB-D stream and gives the user the rigid motion of the head using a dense GPU- immediate feedback. We specifically decided based iterative closest point (ICP) algorithm. for a model-based approach because of its su- After the global position and orientation of the perior reconstruction quality and better reusabil- head has been determined we use a non-rigid ity of the created models. Applications able to registration method that flip-flops between data use our reconstructions range from animation fusion and model fitting. We fuse the unfiltered retargeting [26, 14, 15, 16] to face identification input data into a consistent mesh-based repre- [27, 28], as well as virtual aging [29] and try-on sentation that shares its topology with the statis- [30]. The main three contributions of this work tical prior. This step allows for super-resolution are: reconstructions, closes holes and fairs the data using a fast GPU-based thin-plate regularizer [32]. The resulting faired displacements de- • An intuitive reconstruction paradigm that is fine the position constraints for non-rigidly fit- suitable even for unexperienced users ting the statistical model. After the best fitting model has been computed, we use the solution • The first interactive head reconstruction to initialize the next flip-flop step. This allows system that leverages all available informa- us to temporally fair and stabilize the target cor- tion of the RGB-D stream respondences. • A fast non-linear GPU-based Quasi- Newton solver that jointly solves for shape, albedo and illumination. 4 Head Pose Estimation Fusion [8, 9] to achieve super-resolution reconstructions and effectivly deal with the noisy in- We compute an initial guess for the global head put. A per-vertex displacement map is defined pose using the Procrustes algorithm [33]. on the template model to temporally accumulate The required feature points the input RGB-D stream. Target scalar displac- are automatically detected ments are found by ray marching in normal di- using Haar Cascade Clas- rection, followed by four bisection steps to re- sifiers [34] for the mouth, fine the solution. The resulting displacement nose and eyes. Correspond- map is faired by computing the best fitting thin- ing features on the model plate. We approximate the non-linear thin-plate have been manually pres- energy [32] by replacing the fundamental forms elected and stay constant.

Interactive Model-Based Reconstruction of the Human Head Using an RGB-D Sensor

Global Human Mandibular Variation Reflects Differences in Agricultural

Standard Human Facial Proportions

The Development of a Whole-Head Human Finite- Element Model for Simulation of the Transmission of Bone-Conducted Sound

Comparison of Cadaveric Human Head Mass Properties: Mechanical Measurement Vs

Adult Human Ocular Volume

Physical and Geometric Constraints Shape the Labyrinth-Like Nasal Cavity

Arxiv:2106.12302V1 [Cs.CV] 23 Jun 2021

Forensic Hair Comparisons

A Three-Dimensional Measurement of Human Head ―For the Purpose of Dummyhead Construction

Bacteria Slides

Biomechanical Analysis of the Influence of a Cholesteatoma In

Preservation of the Nerves to the Frontalis Muscle During Pterional Craniotomy