Geometry-Aware Face Completion and Editing
Total Page:16
File Type:pdf, Size:1020Kb
The Thirty-Third AAAI Conference on Artificial Intelligence (AAAI-19) Geometry-Aware Face Completion and Editing Linsen Song, Jie Cao, Lingxiao Song, Yibo Hu, Ran He∗ National Laboratory of Pattern Recognition, CASIA Center for Research on Intelligent Perception and Computing, CASIA Center for Excellence in Brain Science and Intelligence Technology, CAS University of Chinese Academy of Sciences, Beijing 100190, China [email protected], fjie.cao,[email protected], flingxiao.song,[email protected] Abstract Face completion is a challenging generation task because it requires generating visually pleasing new pixels that are se- mantically consistent with the unmasked face region. This paper proposes a geometry-aware Face Completion and Edit- ing NETwork (FCENet) by systematically studying facial ge- ometry from the unmasked region. Firstly, a facial geome- try estimator is learned to estimate facial landmark heatmaps and parsing maps from the unmasked face image. Then, an encoder-decoder structure generator serves to complete a face image and disentangle its mask areas conditioned on both the masked face image and the estimated facial geometry im- ages. Besides, since low-rank property exists in manually la- beled masks, a low-rank regularization term is imposed on the disentangled masks, enforcing our completion network to manage occlusion area with various shape and size. Further- more, our network can generate diverse results from the same masked input by modifying estimated facial geometry, which provides a flexible mean to edit the completed face appear- Figure 1: Face completion and attributes editing results. The ance. Extensive experimental results qualitatively and quanti- first and third row show the visual quality results of face tatively demonstrate that our network is able to generate visu- completion under the guidance of inferred facial geome- ally pleasing face completion results and edit face attributes try. In the second row, the girl’s eyes become smaller af- as well. ter changing its facial geometry. In the fourth row, the boy’s mouth becomes much bigger by replacing the mouth patch Introduction in the facial geometry images with a bigger mouth patch. Face completion, also known as face inpainting, aims to complete a face image with a masked region or missing con- Yeh et al. 2017; Li et al. 2017). These methods set a masked tent. As a common face image editing technique, it can also image as the network input to learn context features, then be used to edit face attributes. The generated face image can restore the missing content from the learned features. Some either be as accurate as the original face image, or content challenges still exist in these methods when they are applied coherent to the context so that the completed image looks to real-world face completion problems. First of all, differ- visually realistic. Most traditional methods (Barnes et al. ent from general objects, human faces have distinct geome- 2009; Huang et al. 2014; Darabi et al. 2012) rely on low- try distribution, and hence face geometry prior is more likely level cues to search for patches to synthesize missing con- to facilitate face completion (Yeh et al. 2017). Few existing tent. These image completion algorithms are skilled in fill- methods utilize the facial prior knowledge and well incorpo- ing backgrounds which contain similar textures but not excel rate it into an end-to-end network. Second, these algorithms at specific image object like face since prior domain knowl- are incapable of modifying the face attributes of the filled edge isn’t well incorporated. region (Li et al. 2017), e.g., editing the eye shape or mouth Recently, CNN based image completion methods with ad- size in figure 1. versarial training strategy have already significantly boosted To address these two problems, this paper studies the image completion performance (Pathak et al. 2016; a geometry-aware face completion and editing network ∗Ran He is the corresponding author. (FCENet) by exploring facial geometry priors including Copyright c 2019, Association for the Advancement of Artificial facial landmark heatmaps and parsing maps. Landmark Intelligence (www.aaai.org). All rights reserved. heatmap is an image composed of 68 points labeled by dif- 2506 ferent values to distinguish different face components. Sim- Radford et al. propose a deep convolutional GAN (DCGAN) ilarly, facial parsing map is an image with different val- (Radford, Metz, and Chintala 2016) to generate high-quality ues representing different face components (e.g., eyes, nose, images on multiple datasets. It is the first time to integrate mouth, hair, cheek, and background). They can not only pro- the deep models into GAN. To address training instability vide a hint of target face’s geometry information for face of GAN, Arjovsky et al. (Arjovsky, Chintala, and Bottou completion but also point out a novel way to modify face 2017) analyze the causes of training instability theoretically attributes of masked regions. The FCENet consists of three and propose Wasserstein GAN. Mao et al. (Mao et al. 2017) stages. In the first stage, facial parsing maps and landmark put forward LSGAN to avoid vanishing gradients problem heatmaps are inferred from masked face images. In the sec- during the learning process. Recently, improved GAN archi- ond stage, we concatenate the masked image, inferred land- tectures have achieved a great success on super-resolution mark heatmaps and parsing maps together as the input of (Ledig et al. 2017), style transfer (Li and Wand 2016), image a face completion generator to restore the uncorrupted face inpainting (Pathak et al. 2016), face attribute manipulation images and mask. In the third stage, two discriminators dis- (Shen and Liu 2017), and face rotation (Huang et al. 2017; tinguish generated face images and real ones globally and Hu et al. 2018). Motivated by these successful solutions, we locally to force the generated face images as realistic as develop the FCENet based on GAN. possible. Furthermore, the low-rank regularization boosts Image Completion. The image completion techniques can the completion network to disentangle more complex mask be broadly divided into three categories, The first cate- from the corrupted face image. Our FCENet is efficient to gory exploits the diffusion equation to propagate the low- generate face images with a variety of attributes depending level feature from the context region to the missing area on the facial geometry images concatenated with masked along the boundaries iteratively (Bertalmio et al. 2000; face image. A few face completion and editing examples of Elad et al. 2005). These methods excel at inpainting small FCENet are shown in figure 1. holes and superimposed text or lines but having limitations The main contributions of this paper are summarized as to the reproduction of large textured regions. The second follows: category is patch-based methods, which observe the con- 1. We design a novel network called facial geometry estima- text of the missing content and search similar patch from the tor to estimate reasonable facial geometry from masked same image or external image databases (Darabi et al. 2012; face images. Such facial geometry is leveraged to guide Bertalmio et al. 2003; Criminisi, Perez,´ and Toyama 2004; the face completion task. Several experiments systemat- Hays and Efros 2007). These methods can achieve ideal ically demonstrate the performance improvements from completion results on backgrounds like sky and grass, but different facial geometry, including facial landmarks, fa- they fail to generate semantic new pixels if these patches cial parsing maps and both together. don’t exist in the databases. The third category is CNN- based, which train an encoder-decoder architecture network 2. The FCENet allows interactive facial attributes editing to extract image features from the context and generate miss- of the generated face image by simply modifying its in- ing content according to the image features (Pathak et al. ferred facial geometry images. Therefore, face comple- 2016; Iizuka, Simo-Serra, and Ishikawa 2017). Deepak et tion and face attributes editing are integrated into one uni- al. (Pathak et al. 2016) present an unsupervised visual fea- fied framework. ture learning algorithm called context encoder to produce a 3. Our face completion generator simultaneously accom- plausible hypothesis for the missing part conditioned on its plishes face completion and mask disentanglement from surroundings. A pixel-wise reconstruction loss and an adver- the masked face image. A novel low-rank loss regular- sarial loss regularize the filling content to bear some resem- izes the disentangled mask to further enhance similarity blance to its surrounding area both on appearance and se- between the disentangled mask and the original mask, mantic segmentation. Satoshi et al. (Iizuka, Simo-Serra, and which enables our method to handle various masks with Ishikawa 2017) put forward an approach which can gener- different shapes and sizes. ate inpainting images that are both locally and globally con- 4. Experiments on the CelebA (Liu et al. 2015b) and Multi- sistent by training a global and local context discriminator. PIE (Gross et al. 2010) dataset demonstrate the superior- Their method is able to complete images of arbitrary res- ity of our approach over the existing approaches. olutions and various obstructed shapes by training a fully- convolutional neural network. Face Completion. Human face completion is much more Related Work challenging than general image completion tasks because fa- Generative Adversarial Network (GAN). As one of the cial components (e.g., eyes, nose, and mouth) are of highly most significant image generation techniques, GAN (Good- structurization and contain large appearance variations. Be- fellow et al. 2014) utilizes the mini-max adversarial game to sides, the symmetrical structure is reflected on human faces train a generator and a discriminator alternatively. The ad- like many natural objects. Second, compared to general ob- versarial training method encourages the generator to gen- ject image completion, face completion need to pay more erate realistic outputs to fool the discriminator.