Avatarme: Realistically Renderable 3D Facial Reconstruction "In-The-Wild"
Total Page:16
File Type:pdf, Size:1020Kb
AvatarMe: Realistically Renderable 3D Facial Reconstruction “in-the-wild” Alexandros Lattas1,2 Stylianos Moschoglou1,2 Baris Gecer1,2 Stylianos Ploumpis1,2 Vasileios Triantafyllou2 Abhijeet Ghosh1 Stefanos Zafeiriou1,2 1Imperial College London, UK 2FaceSoft.io 1 2 {a.lattas,s.moschoglou,b.gecer,s.ploumpis,ghosh,s.zafeiriou}@imperial.ac.uk [email protected] Figure 1: From left to right: Input image; Predicted reflectance (diffuse albedo, diffuse normals, specular albedo and specular normals); Rendered reconstruction in different environments, with detailed reflections; Rendered result with head completion. Abstract 1. Introduction Over the last years, with the advent of Generative Ad- The reconstruction of a 3D face geometry and texture is versarial Networks (GANs), many face analysis tasks have one of the most popular and well-studied fields in the in- accomplished astounding performance, with applications tersection of computer vision, graphics and machine learn- including, but not limited to, face generation and 3D face ing. Apart from its countless applications, it demonstrates reconstruction from a single “in-the-wild” image. Never- the power of recent developments in scanning, learning and theless, to the best of our knowledge, there is no method synthesizing 3D objects [3, 44]. Recently, mainly due to which can produce high-resolution photorealistic 3D faces the advent of deep learning, tremendous progress has been from “in-the-wild” images and this can be attributed to the: made in the reconstruction of a smooth 3D face geometry, (a) scarcity of available data for training, and (b) lack of ro- even from images captured in arbitrary recording conditions bust methodologies that can successfully be applied on very (also referred to as “in-the-wild”) [13, 14, 33, 36, 37]. Nev- high-resolution data. In this paper, we introduce AvatarMe, ertheless, even though the geometry can be inferred some- the first method that is able to reconstruct photorealistic 3D what accurately, in order to render a reconstructed face in faces from a single “in-the-wild” image with an increasing arbitrary virtual environments, much more information than level of detail. To achieve this, we capture a large dataset a 3D smooth geometry is required, i.e., skin reflectance as of facial shape and reflectance and build on a state-of-the- well as high-frequency normals. In this paper, we propose art 3D texture and shape reconstruction method and succes- a meticulously designed pipeline for the reconstruction of sively refine its results, while generating the per-pixel dif- high-resolution render-ready faces from “in-the-wild” im- fuse and specular components that are required for realistic ages captured in arbitrary poses, lighting conditions and oc- rendering. As we demonstrate in a series of qualitative and clusions. A result from our pipeline is showcased in Fig. 1. quantitative experiments, AvatarMe outperforms the exist- The seminal work in the field is the 3D Morphable Model ing arts by a significant margin and reconstructs authentic, (3DMM) fitting algorithm [3]. The facial texture and shape 4K by 6K-resolution 3D faces from a single low-resolution that is reconstructed by the 3DMM algorithm always lies in image that, for the first time, bridges the uncanny valley. a space that is spanned by a linear basis which is learned by Principal Component Analysis (PCA). The linear basis, 1760 even though remarkable in representing the basic character- ily creating a digital avatar rather than high-quality render- istics of the reconstructed face, fails in reconstructing high- ready face reconstruction from “in-the-wild” images which frequency details in texture and geometry. Furthermore, the is the goal of our work. PCA model fails in representing the complex structure of fa- In this paper, we propose the first, to the best of cial texture captured “in-the-wild”. Therefore, 3DMM fitting our knowledge, methodology that produces high-quality usually fails on “in-the-wild” images. Recently, 3DMM fit- render-ready face reconstructions from arbitrary images. In ting has been extended so that it uses a PCA model on robust particular, our method builds upon recent reconstruction features, i.e., Histogram of Oriented Gradients (HoGs) [8], methods (e.g., GANFIT [14]) and contrary to [6, 42] does not for representing facial texture [4]. The method has shown apply algorithms for high-frequency estimation to the origi- remarkable results in reconstructing the 3D facial geometry nal input, which could be of very low quality, but to a GAN- from “in-the-wild” images. Nevertheless, it cannot recon- generated high-quality texture. Using a light stage, we have struct facial texture that accurately. collected a large scale dataset with samples of over 200 sub- With the advent of deep learning, many regression meth- jects’ reflectance and geometry and we train image transla- ods using an encoder-decoder structure have been pro- tion networks that can perform estimation of (a) diffuse and posed to infer 3D geometry, reflectance and illumination specular albedo, and (b) diffuse and specular normals. We [6, 14, 33, 35, 36, 37, 39, 44]. Some of the methods demon- demonstrate that it is possible to produce render-ready faces strate that it is possible to reconstruct shape and texture, from arbitrary faces (pose, occlusion, etc.) including por- even in real-time on a CPU [44]. Nevertheless, due to var- traits and face sketches, which can be realistically relighted ious factors, such as the use of basic reflectance models in any environment. (e.g., the Lambertian reflectance model), the use of syn- thetic data or mesh-convolutions on colored meshes, the 2. Related Work methods [33, 35, 36, 37, 39, 44] fail to reconstruct highly- detailed texture and shape that is render-ready. Further- 2.1. Facial Geometry and Reflectance Capture more, in many of the above methods the reconstructed tex- Debevec et al. [9] first proposed employing a special- ture and shape lose many of the identity characteristics of ized light stage setup to acquire a reflectance field of a the original image. human face for photo-realistic image-based relighting ap- Arguably, the first generic method that demonstrated that plications. They also employed the acquired data to es- it is possible to reconstruct high-quality texture and shape timate a few view-dependent reflectance maps for render- from single “in-the-wild” images is the recently proposed ing. Weyrich et al. [41] employed an LED sphere and 16 GANFIT method [14]. GANFIT can be described as an ex- cameras to densely record facial reflectance and computed tension of the original 3DMM fitting strategy but with the view-independent estimates of facial reflectance from the following differences: (a) instead of a PCA texture model, it acquired data including per-pixel diffuse and specular albe- uses a Generative Adversarial Network (GAN)[23] trained dos, and per-region specular roughness parameters. These on large-scale high-resolution UV-maps, and (b) in order to initial works employed dense capture of facial reflectance preserve the identity in the reconstructed texture and shape, which is somewhat cumbersome and impractical. it uses features from a state-of-the-art face recognition net- Ma et al. [27] introduced polarized spherical gradient work [11]. However, the reconstructed texture and shape illumination (using an LED sphere) for efficient acquisi- is not render-ready due to (a) the texture containing baked tion of separated diffuse and specular albedos and photo- illumination, and (b) not being able to reconstruct high- metric normals of a face using just eight photographs, and frequency normals or specular reflectance. demonstrated high quality facial geometry, including skin Early attempts to infer photorealistic render-ready infor- mesostructure as well as realistic rendering with the ac- mation from single “in-the-wild” images have been made quired data. It was however restricted to a frontal viewpoint in the line of research of [6, 20, 32, 42]. Arguably, some of acquisition due to their employment of view-dependent of the results showcased in the above noted papers are of polarization pattern on the LED sphere. Subsequently, high-quality. Nevertheless, the methods do not generalize Ghosh et al. [15] extended polarized spherical gradient il- since: (a) they directly manipulate and augment the low- lumination for multi-view facial acquisition by employ- quality and potentially occluded input facial texture, instead ing two orthogonal spherical polarization patterns. Their of reconstructing it, and as a result, the quality of the final method allows capture of separated diffuse and specular reconstruction always depends on the input image. (b) the reflectance and photometric normals from any viewpoint employed 3D model is not very representative, and (c) a very around the equator of the LED sphere and can be considered small number of subjects (e.g., 25 [42]) were available for the state-of-the art in terms of high quality facial capture. training for the high-frequency details of the face. Thus, Recently, Kampouris et al. [22] demonstrated how to em- while closest to our work, these approaches focus on eas- ploy unpolarized binary spherical gradient illumination for 761 Figure 2: Overview of the proposed method. A 3DMM is fitted to an “in-the-wild” input image and a completed UV texture is synthesized, while optimizing for the identity match between the rendering and the input. The texture is up-sampled 8 times, to synthesize plausible high-frequency details. We then use an image translation network to de-light the texture and obtain the diffuse albedo with high-frequency details. Then, separate networks infer the specular albedo, diffuse normals and specular normals (in tangent space) from the diffuse albedo and the 3DMM shape normals. The networks are trained on 512 × 512 patches and inferences are ran on 1536 × 1536 patches with a sliding window.