Model-Based Deep Convolutional Face Autoencoder for Unsupervised Monocular Reconstruction
Total Page:16
File Type:pdf, Size:1020Kb
MoFA: Model-based Deep Convolutional Face Autoencoder for Unsupervised Monocular Reconstruction Ayush Tewari1 Michael Zollhofer¨ 1 Hyeongwoo Kim1 Pablo Garrido1 Florian Bernard1;2 Patrick Perez´ 3 Christian Theobalt1 1Max-Planck-Institute for Informatics 2 LCSB, University of Luxembourg 3Technicolor Our model-based deep convolutional face autoencoder enables unsupervised learning of semantic pose, shape, expression, reflectance and lighting parameters. The trained encoder predicts these parameters from a single monocular image, all at once. Abstract tailed three-dimensional face reconstruction from a single arbitrary in-the-wild image, e.g., downloaded from the Inter- In this work we propose a novel model-based deep convo- net, is still an open research problem due to the high degree lutional autoencoder that addresses the highly challenging of variability of uncalibrated photos in terms of resolution problem of reconstructing a 3D human face from a single and employed imaging device. In addition, in unconstrained in-the-wild color image. To this end, we combine a convolu- photos, faces show a high variability in global pose, facial tional encoder network with an expert-designed generative expression, and are captured under diverse and difficult light- model that serves as decoder. The core innovation is the ing. Detailed 3D face reconstruction is the foundation for a differentiable parametric decoder that encapsulates image broad scope of applications, which range from robust face formation analytically based on a generative model. Our recognition, over emotion estimation, to complex image ma- decoder takes as input a code vector with exactly defined nipulation tasks. In many applications, faces should ideally semantic meaning that encodes detailed face pose, shape, ex- be reconstructed in terms of meaningful low-dimensional pression, skin reflectance and scene illumination. Due to this model parameters, which facilitates interpretation and ma- new way of combining CNN-based with model-based face nipulation of reconstructions (cf. [54]). reconstruction, the CNN-based encoder learns to extract Recent monocular reconstruction methods broadly fall semantically meaningful parameters from a single monocu- into two categories: Generative and regression-based. Gen- lar input image. For the first time, a CNN encoder and an erative approaches fit a parametric face model to image and arXiv:1703.10580v2 [cs.CV] 7 Dec 2017 expert-designed generative model can be trained end-to-end video data, e.g., [3, 2, 12] by optimizing the alignment be- in an unsupervised manner, which renders training on very tween the projected model and the image [14, 54, 51, 52, 25]. large (unlabeled) real world data feasible. The obtained re- State-of-the-art generative approaches capture very detailed constructions compare favorably to current state-of-the-art and complete 3D face models on the basis of semantically approaches in terms of quality and richness of representation. meaningful low-dimensional parameterizations [14, 54]. Un- fortunately, the fitting energies are usually highly non- convex. Good results thus require an initialization close to the global optimum, which is only possible with some 1. Introduction level of control during image capture or additional input data, Detailed, dense 3D reconstruction of the human face from e.g., detected landmarks. image data is a longstanding problem in computer vision Only recently, first regression-based approaches for dense and computer graphics. Previous approaches have tackled 3D face reconstruction based on deep convolutional neural this challenging problem using calibrated multi-view data networks were proposed. Richardson et al. [41] use iterative or uncalibrated photo collections [26, 45]. Robust and de- regression to obtain a high quality estimate of pose, shape and expression, and finer scale surface detail [42] of a face 2. Related Work model. The expression-invariant regression approach of Tran et al. [55] obtains high-quality estimates of shape and skin In this section, we summarize previous works that are reflectance. Unfortunately, these approaches can only be related to our approach. We focus on parametric model trained in a supervised fashion on corpora of densely anno- fitting and CNN approaches in the context of monocular face tated facial images, the creation of which is a major obstacle reconstruction. For further work on general template-based in practice. In particular, the creation of a training corpus mesh tracking, please refer to [25, 26, 51, 52, 31]. of photo-realistic synthetic facial images that include facial hair, parts of the upper body and a consistent background is Parametric Face Models Active Appearance Models challenging. While the refinement network of Richardson (AAMs) use a linear model for jointly capturing shape and et al. [42] can be trained in an unsupervised manner, their texture variation in 2D [9]. Matching an AAM to an image coarse shape regression network requires synthetic ground is a registration problem, usually tackled via energy opti- truth data for training. Also, the quality and richness of mization. A closely related approach to AAMs is the 3D representation (e.g. illumination and colored reflectance in morphable model of faces (3DMM) [3], which has been used addition to geometry) of these methods does not match the for learning facial animations from 3D scans [2]. In [12], best generative ones. However, trained networks are effi- a parametric head model has been employed to modify the cient to evaluate and can be trained to achieve remarkable relative head pose and camera parameters of portrait photos. robustness under difficult real world conditions. Monocular Optimization-based Reconstruction Many This paper contributes a new type of model-based deep monocular reconstruction approaches solve an optimization convolutional autoencoder that joins forces of state-of-the- problem to fit a model to a given image. For example, the art generative and CNN-based regression approaches for 3DMM has been used for monocular reconstruction [43] dense 3D face reconstruction via a deep integration of the and image collection-based reconstruction [45]. In [14], two on an architectural level. Our network architecture is high-quality 3D face rigs are obtained from monocular RGB inspired by recent progress on deep convolutional autoen- video based on a multi-layer model. Even real-time facial coders, which, in their original form, couple a CNN encoder reconstruction and reenactment has been achieved [54, 20]. and a CNN decoder through a code-layer of reduced dimen- Compared to optimization-based approaches, ours differs in sionality [18, 33, 61]. Unlike previously used CNN-based two main regards. First, our network efficiently regresses decoders, our convolutional autoencoder deeply integrates an model parameters without requiring iterative optimization. expert-designed decoder. This layer implements, in closed Second, given a cropped face image, our method does not form, an elaborate generative analytically-differentiable im- require an initialization of the model parameters, which is a age formation model on the basis of a detailed parametric significant advantage over optimization-based techniques. 3D face model [3]. Some previous fully CNN-based au- toencoders tried to disentangle [28, 15], but could not fully Deep Learning for Coarse Face Reconstruction The de- guarantee the semantic meaning of code layer parameters. tection of facial landmarks in images is an active area of In our new network, exact semantic meaning of the code research [57, 23]. Various approaches are based on deep vector, i.e., the input to the decoder, is ensured by design. learning, including convolutional neural network (CNN) cas- Moreover, our decoder is compact and does not need training cades [50, 63], a deep face shape model based on Restricted of enormous sets of unintuitive CNN weights. Boltzmann Machines [58], a recurrent network with long- Unlike previous CNN regression-based approaches for short term memory [61], a recurrent encoder-decoder net- face reconstruction, a single forward pass of our network work for real-time facial landmark detection in video [38], or estimates a much more complete face model, including pose, a two-stage convolutional part heatmap regression approach shape, expression, skin reflectance, and illumination, at a [5]. In [40], a multi-task CNN is trained to predict several high quality. Our new network architecture allows, for the face-related parameters (e.g. pose, gender, age), in addition first time, combined end-to-end training of a sophisticated to facial landmarks. These deep learning approaches share model-based (generative) decoder and a CNN encoder, with common limitations: They are trained in a supervised man- error backpropagation through all layers. It also allows, ner and predict only sparse information. In contrast, our for the first time, unsupervised training of a network that approach works unsupervised and obtains a dense recon- reconstructs dense and semantically meaningful faces on un- struction via regressing generative model parameters. labeled in-the-wild images via a dense photometric training loss. In consequence, our network generalizes better to real Deep Learning for Dense Face Reconstruction Apart world data compared to networks trained on synthetic face from the approaches mentioned above, there exist several data [41, 42]. dense deep learning approaches. A multilayer generative model based on deep belief networks for the generation of Deep