Recovering Facial Reflectance and Geometry from Multi-view Images

Guoxian Song∗ Jianmin Zheng Nanyang Technological University Nanyang Technological University [email protected] [email protected] Jianfei Cai Tat-Jen Cham Nanyang Technological University Nanyang Technological University [email protected] [email protected] ABSTRACT While the problem of estimating shapes and diffuse reflectances of human faces from images has been extensively studied, there is rela- tively less work done on recovering the specular . This paper presents a lightweight solution for inferring photorealistic facial reflectance and geometry. Our system processes video streams from two views of a subject, and outputs two reflectance maps for diffuse and specular , as well as a vector map of surface normals. A model-based optimization approach is used, consisting of the three stages of multi-view face model fitting, facial reflectance inference and facial geometry refinement. Our approach is based on a novel formulation built upon the 3D morphable model (3DMM) for rep- resenting 3D textured faces in conjunction with the Blinn-Phong reflection model. It has the advantage of requiring only a simple Figure 1: Given image streams from two views, our method setup with two video streams, and is able to exploit the interaction can infer the 3D geometry as well as the diffuse and specu- between the diffuse and specular reflections across multiple views lar albedos of faces, together with the environment lighting. as well as time frames. As a result, the method is able to reliably This allows rendering to new views with or without specular recover high-fidelity facial reflectance and geometry, which facili- reflections. tates various applications such as generating photorealistic facial images under new viewpoints or illumination conditions. sophisticated systems that require specialized hardware, such as CCS CONCEPTS appearance measurement devices, in studio environments where • Computing methodologies → Motion capture; the lighting can be fully controlled [9, 14, 21]. However, these pro- fessional configurations are cumbersome and costly. In contrast, KEYWORDS various -from-X algorithms [15, 33] requiring less stringent 3D facial reconstruction; specular estimation; multi-view capture conditions may alternatively be used. Due to the highly nonlinear effects of the specular component, most of these methods assume 1 INTRODUCTION approximately Lambertian surfaces and thus ignore specular reflec- tion, which is detrimental to the accuracy of the modeling. This paper considers the problem of recovering facial reflectance There are some approaches that handle non-Lambertian objects, arXiv:1911.11999v1 [cs.CV] 27 Nov 2019 and geometry from images with emphasis on inferring specular but many of these have very limiting assumptions, or focus only albedo, which is important as real faces often exhibit bright high- on removing specularities. For example, Kumar et al. [19] proposed lights due to specular reflection. Accurately recovering specular a method for reconstructing human faces that takes both diffuse components facilitates more realistic rendering of digital faces in and specular reflections into account. Magda and Zickler proposed new environments, which is very useful in many applications such two approaches to reconstruct surfaces with arbitrary BRDFs [22]. as augmented and virtual reality, games, entertainment [16, 23, 30] However, both works require constraints on the locations of point and telepresence [6, 12, 20, 33]. It also helps in the recovery of high- light sources, or a large collection of images generated from a fixed fidelity 3D geometry for faces, which is a fundamental problem in view point with a single point light source. There are works to infer computer vision and graphics. facial albedo or geometry based on spherical harmonics lighting. Modeling high quality 3D geometry and reflectance from im- For examples, Saito et al.[26] proposed to infer photorealistic tex- ages is a very challenging task due to the complex yet ambiguous ture details from a high resolution face texture database based on a interaction across facial geometry, reflectance properties, illumina- deep neural network. However, it does not reconstruct fine-scale tion and camera parameters. Great success has been achieved via geometric detail for photorealistic rendering. Guo et al. [15] pro- ∗The Corresponding author. posed real-time CNN-based 3D face capture from an RGB camera in a coarse-to-fine manner, which uses a deep neural network to Capturing facial appearance. Capturing facial appearance in- regress a coarse shape and refine the details based on the input cluding detailed geometry and reflectance is of great interest in image. Recently, Yamaguchi et al. [35] proposed a learning-based both academia and industry, and has been studied extensively. In technique to infer high-quality facial reflectance and geometry from particular, various sophisticates systems have been designed. For a single unconstrained image. It acquired training data by employ- example, Debevec et al. developed the Light Stage technique, which ing 7 high-resolution DSLR cameras and used polarized gradient can capture fine facial appearances [9]. The technique was im- spherical illumination to obtain diffuse and specular albedos, as proved by using specialized devices such as camera matrix and well as medium- and high-frequency geometric displacement maps, sphere LEDs [13, 14, 21], which provides production-level measure- but this setup demands high effort and cost. ment of reflectance. Recently, there is an effort to design deployable Our work is inspired by the success of [15] and [35]. However, and portable systems for photorealistic facial capturing. Tan et different from15 [ ] that relies on PCA-based diffuse albedo esti- al. [31] introduced a cost-efficient framework based on two RGB-D mation and view-based geometric refinement, our method refines inputs with depth sensors such as Kinect to align point clouds from diffuse albedo and estimate specular albedo from input images via two cameras for high quality 3D facial reconstruction. Our method the Blinn-Phong model, and we also introduce a geometric normal simply needs two color webcams. refinement method to improve the 3D normal displacement map. Parametric facial models. There has also been a lot of research Instead of expensively collecting high-quality training data in [35] done on representing human faces using parametric models. Earliest using a sophisticated multi-view system, our method will be more works include Eigenfaces [34] and the active appearance model widely applicable to cost-sensitive scenarios, and does not require (AAM) [10], which represent facial appearance using linear models prior data collection. and facilitate many applications. Smith et.al proposed several works Diffuse and specular reflections behave differently under view- focusing on recovering facial shape using surface normal from a point changes. As specular reflection depends on the angle of the parametric model[28, 29]. A well-known and widely-used model surface normal to the camera but diffuse reflection does not, we is 3D morphable model (3DMM) proposed by Blanz and Vetter propose to take frames from video streams to separate the effects [2], which models textured 3D faces by analysis-by-synthesis and caused by the diffuse albedo and the surface normal. Moreover, we disentangles a 3D face into three sets of parameters: facial identity, build our formulation upon the 3D morphable model (3DMM), con- expression and albedo. The basic idea of 3DMM leverages Principal sidering that this PCA (principal component analysis) based facial Component Analysis (PCA) and transforms the shape and texture of parametric model provides a compact representation for 3D faces example faces into a vector representation. 3DMM can be computed and encodes some intrinsic structures of the faces. By non-trivially efficiently, but it is limited by the linear space spanned bythe integrating all these new ideas, we provide a reliable and light- training samples and fails to capture fine geometric and appearance weight solution to recovering facial reflectance (including diffuse details. albedo and specular albedo) and geometry from two-view images. Reflection models. Synthesizing an image of face requires re- In addition, we collected a multi-view facial expression dataset of flection modeling, which is used to determine the colour inapar- 15 subjects from different countries with different skin conditions ticular view. Reflectance models usually contain different reflection for evaluation. The experiments demonstrate that our approach components such as diffuse colour, specular colour and ambient can substantially improve face recovery fidelity and produce pho- colour. To separate each component from an input image is a well- torealistic rendering at new viewpoints (See Figure 1). studied problem. Early approaches [1] recover diffuse and specular The main contributions of the paper are twofold: colours by analysis of colour histogram distributions, under the piecewise-constant surface colour assumption. A recent approach • We propose a reliable and lightweight solution for faith- [32] is to derive a pseudo diffuse image that exhibits the same geo- fully recovering facial appearances with specularity, which metric profile as the diffuse component of the input image; andthen combines multi-view face model fitting, facial reflectance iteratively remove highlights and propagate the maximum diffuse inference and geometry refinement. The solution is built component to neighbor pixels. Some approaches [17][36] employ on the 3D morphable model (3DMM) for representing 3D a dark channel prior in generating the pseudo diffuse images. textured faces, and uses a simple setup of two-view video Learning-based reflectance recovery. Recently learning-based streams as input for separating the behavior of diffuse and techniques have been developed to infer facial geometry and colour specular reflections across multiple views and time frames. properties from images. For example, Guo et al. [15] synthesized • We propose two optimization algorithms to recover the spec- large datasets and trained a coarse-to-fine CNN framework for real- ular albedo based on the Blinn-Phong reflection model and time 3D face reconstruction based on 3DMM with RGB inputs. Yu to refine the normal map for surfaces generated byPCA et al. [37] presented a neural network for dense facial correspon- based methods. Our idea of using the Blinn-Phong model dences in highly unconstrained RGB images. These works achieve for specular faces can be incorporated into many existing impressive results on facial geometric fitting or reconstruction, but Lambertian based 3D facial modeling works and applications. their reflectance shading assumption is Lambertian. Our work ex- In addition, we collected a test dataset with video streams of tends those methods to consider specular effect, which leads to 15 subjects with different ethnicities, which will be released. more photorealistic rendering. Yamaguchi et al. [35] used a deep learning approach to infer high-fidelity facial reflectance and geom- 2 RELATED WORK etry from a single image based on a large-scale dataset with ground This section briefly reviews some relevant work. truth, which is costly to obtain. 2 3 OVERVIEW 4.2 Parametric 3D face model The pipeline of our approach is illustrated in Figure 2. The whole We represent 3D face geometry and albedo using a 3D Morphable process consists of three stages shown from left to right, respec- Model (3DMM) as the parametric face model with z=30k vertices tively. The first stage is multi-view 3D face model fitting, which and 59k faces.The 3D face shape geometry S and diffuse colour fits a 3D face model to two streaming views. We adapt an inverse albedo Cd is compactly encoded via PCA: rendering based optimization method [15, 33] to generate an initial = ¯ + id id + exp exp diffuse albedo and normal map for each time instance. The second S S A x A x (3) alb alb stage is facial reflectance inference, which estimates lighting and Cd = C¯d + A x (4) extracts the specular albedo map and refined diffuse albedo map ¯ ¯ iteratively, by minimizing image differences once the Blinn-Phong where S and Cd denote respectively the 3D shape and diffuse albedo id 3z×100 alb 3z×100 reflection model is introduced to cater for specular reflection. The of the average face, A ∈ R and A ∈ R are the third stage is geometry refinement, which refines the initial normal principal axes extracted from a set of textured 3D meshes with a exp map based on the inferred reflectance maps and refined lighting neutral expression from the Basel Face Model (BFM) [24], A ∈ 3z×79 conditions. The output of the whole process is a 3D face mesh model R represents the principal axes trained on the offsets between with a fine (detailed) geometry normal map and two reflectance the expression meshes and the neutral meshes of individual persons id 100 exp 79 alb maps for diffuse and specular albedos. For subsequent new two- from FaceWarehouse [7], and x ∈ R , x ∈ R and x ∈ 100 view time frames, we use the previously inferred reflectance maps R are the corresponding coefficient vectors that characterize a in fitting to newly updated facial geometry together with a refined specific 3D face model. normal map, and is thus able to adapt to changes in facial expres- sions. In the following sections, we will describe the three stages 4.3 3D face model fitting in detail. In fitting a 3D face model to 2D views, the unknown parameters to id alb exp be estimated are X = {x , x , x , R, t,s,l, Ia }. Ia is a scalar 4 MULTI-VIEW 3D FACE MODEL FITTING value for ambient colour. We achieve this by minimizing the fol- Given a pair of unconstrained images from two views, we calculate lowing objective function: a 3D face mesh S, an initial diffuse albedo map C , and the rigid face d E(X) = Econ(X) + wl Elan(X) + wr Ereд(X) pose {R, t,s} denoting the rotation, translation and scale, respec- tively. In this paper we only consider two views, but the method can wherewl ,wr are trade-off parameters balancing the photo-consistency be easily extended to multiple views. The fitting module is designed term Econ, the landmark term Elan, and the regularization term −5 upon the 3DMM model [3], similar to [15, 33], but with two-view Ereд.We set wl = 10 and wr = 5 · 10 in our experiments. images as input. Once the 3D face mesh is created, we parameterize The photo-consistency term Econ expresses the desire to mini- it to a UV map using a common parametrization method. Then mize the difference between synthetic face images and the input we generate a normal map of the 3D face mesh in the UV map images. Since we have two view images as input, we formulate the domain and project the diffuse albedo of each visible view into the term Econ in [33] to include these two views: UV domain as well. We explain the fitting module below.   Õ 1 Õ 2 Econ(X) = ∥Iren − Iv ∥ , |Mview | 4.1 Illumination and shading view p ∈Mview Many previous reconstruction works [15, 26, 33] assume surfaces where Iv is one view of the real face and Iren is the corresponding to be Lambertian. To simplify the computation, in the initial stage rendered face from Eq.1, p ∈ Mview denotes a facial pixel in facial we also follow assumption and temporarily ignore the specular pixel set Mview from the view image, and view represents the left component. Thus the shading model for generating facial images or right view. in this stage consists of two components: The landmark term Elan encourages the projected facial key vertices of the shape to be close to the corresponding detected Iren = Ia + Id (1) landmarks. It is formulated as   where Ia and Id represent the ambient and diffuse components. Õ 1 Õ 2 Elan(X) = ∥fi − Πview (RSi + t)∥ , We also approximate the global illumination using second order |Fview | view f ∈F spherical harmonic (SH) functions [25]. The diffuse component can i view be expressed as: where fi ∈ Fview is a 2D facial detected landmark in each view using the method of [5], and Πview denotes the projection from the B2 Õ 3D face onto the corresponding view image plane. Using two view I (p,n,C ) = C (p)· l ϕ (n(p)) (2) d d d k k images improves the robustness of the fitting process, especially in k=1 the situation with large pose variation. where p denotes the pixel, Cd (p) is the diffuse albedo, n(p) is the The regularization term Ereд penalizes facial parameters that normal vector on the 3D facial surface corresponding to the pixel are outliers, based on the normal distribution implied by PCA: p in the UV space, ϕ are the SH basis functions, and lk are the SH 100 79 exp Õ  xid 2  xalb 2 Õ  x 2 coefficients to be estimated. We use the first B = 3 bands for the E (X) = i + i + i . reд id alb exp global illumination. i=1 σi σi i=1 σi 3 Figure 2: The pipeline of our approach.

We solve the minimization problem by a Gauss-Newton iteration solver implemented in a parallel toolkit [8]. After solving the X for each pair of input images, we extract the visible view albedo and normal map and project them into the UV texture maps.

5 FACIAL REFLECTANCE INFERENCE In this section, we infer the specular albedo as well as refine the previously estimated diffuse albedo and lighting, while fixing the facial geometry. To this end, the Blinn-Phong reflectance model [4] is used to describe the shading, and an optimization framework is formulated on two-view images. Figure 3: Illustration of the vectors used in the Blinn-Phong model. 5.1 Blinn-Phong reflection model The Blinn-Phong reflection model is one of common reflectance models that accounts for specular reflection. It computes the specu- 5.2 Optimization lar reflection at p by To extract the specular albedo, we introduce a scalar map Cs which stores specular albedos. To refine the diffuse albedo, we introduce  m L(p)Cs (p)(n(p)· h(p)) , n · h > 0 I (p,n,v,C ) = (5) a displacement diffuse map δCd that is an offset for the initial s s 0, n · h ≤ 0 diffuse albedo map Cd . Then we estimate the displacement diffuse map and the specular albedo map together with the lighting for where L(p) is the incident lighting intensity, Cs (p) is the specular ( ) a subject cross time t. This is done by finding Y = {δC ,Cs , Ia,l} albedo, h(p) = L+v p is a halfway vector between the normalized d |L+v(p)| with l = {lk } representing the SH coefficients of the lighting, which view vector v(p) towards the viewer(camera) and the normalized is the solution to the following problem: lighting vector L towards the light source (see Figure 3), and m is a shininess constant for material, which controls the concentration arg min EO + λS ES + λH EH Y (7) of the specular reflection. subject to 0 ≤ Cd + δC ≤ 255, 0 ≤ Cs ≤ 255 For the lighting represented by spherical harmonic functions, we d use Sloan’s method [27] to extract the dominant lighting direction. where EO , ES and EH are the data term, specular difference term The local lighting intensity can be calculated by dividing Spher- and smoothness term, respectively, while λS and λH are two trade- ical Harmonics function value with the product of the dominant off factors balancing the terms. Below we explain EO , ES and EH direction and the surface normal. in detail. Eq. 1 can then be combined with the specular component to give Data term. The rendering equation of Eq. 6 can be re-written the total reflection of: with specular albedo Cs and new diffuse albedo Cd + δCd : Itotal = Ia + Id + Is (6) ( ) ( ) ( ) Itotal Y,p = Ia + Id Cd + δCd ,l,p + Is Cs ,l,p where Ia, Id and Is are the ambient, diffuse and specular reflection The data term EO measures the difference between the rendered intensities, respectively. image Itotal (Y,p) and the corresponding observed images It,view 4 of each view at time t: Õ Õ Õ ( ) = ∥ ( ) − ( , )∥2 EO Y It,view p Itotal Y p 2 (8) t view p ∈Mview All images of the same subject are recorded in a consistent envi- ronment. Thus the ambient and lighting conditions are estimated globally for all images. Specular difference loss term. Since ambient and diffuse re- flections are view independent, the image difference between the two views is thus mainly caused by the difference in specular re- flection. We use this property to estimate the specular reflection in Figure 4: Comparison of our method with the PCA method an overlapping region of the two views. To get a full face specular in recovery of diffuse albedo. “Ours-” is rendered with our albedo map, we construct the objective function from several pairs diffuse albedo. “PCA” is rendered using diffuse albedo ob- of (Il,t , Ir,t ) obtained across different timestamps t. We aim to min- tained by the PCA method, where the appearance feature like mole is missing in the rectangle regions. imize the specular difference loss ∥Iet −Igs,t ∥, where Iet = Il,t −Ir,t is the difference between the captured images in the two views, while Igs,t = Is (p,nt ,vl,t ,Cs )−Is (p,nt ,vr,t ,Cs ) is the difference between 2 the rendered images in the two views, with the left and right view ∥δn ∥2 is to penalize large perturbation, the Laplacian term in l1 norm, ∥∆δ ∥ , is a smoothness term but allowing the existence of vectors given by vl,t and vr,t respectively. The view vector maps n 1 can be obtained after the 3D face model fitting described in the sharp discontinuities, and w1 and w2 are the trade-off parameters. −3 previous section. The overall loss is thus written as follows: We set w1 and w2 to be 10 and 0.3 in our experiment, and use Õ Õ 2 spherical coordinates to represent normals for a smoother space. ES (Cs ,l) = ∥Iet (p) − Igs,t (p,l,Cs (p))∥ The minimization problem is solved using the Adam[18] solver in t p ∈Mr ∩Ml parallel. Smoothness term. To constrain the gradients of the diffuse dis- Once normal map is refined, we can update vertex positions of placements as well as the specular albedoes to be locally smooth, the facial mesh to improve the geometry by a least-squares process. the smoothness term EH is introduced: Note that the diffuse and specular albedos are independent of Õ time and view. Once they are computed, they remain the same for E (δ ,C ) = (∥∆δ ∥ + ∥∆C (p)∥ ), H Cd s Cd (p) 1 s 1 new coming images. Thus for a new coming pair of images whose p geometry may change due to different facial expressions or other where ∆ is the Laplacian operator. Here the l1 norm is used to better factors, we only need to recover the facial geometry. This can be preserve sharp discontinuities. done by our proposed procedure but with fixed diffuse and specular To solve the minimization problem of Eq. 7, the parameters are albedos, which greatly simplifies the computation. estimated sequentially. We first update lighting l and ambient colour Ia, followed by the specular albedo map Cs , and finally the diffuse 7 EXPERIMENTS displacement δC . The optimization is iterated until convergence. d We recorded videos from three views for each subject: two views We set the trade-off parameters {λS , λH } to be {0.1, 0.001} and empirically set the shininess exponent m to be 5 in our experiment. are used as input to recover the facial reflectance and geometry, and For timestamps, we choose 3 or 4 pairs of frames of a subject with the third view is used as the ground truth view for evaluation. All different poses. The initial diffuse albedo, SH coefficients andambi- the videos were captured simultaneously at 30 FPS with 680x480 ent scalar is set as mean value derived from X of all frames solved resolution. For synchronization, each camera is controlled by an in the previous section. The initial specular is set as zero map. The individual PCI port with UDP communication module. energy function is optimized by the Adam[18] solver in parallel. More specifically, we recorded videos of 15 persons (12 male and 3 female) with different ethnicities. Each subject was required to 6 GEOMETRY REFINEMENT rotate his or her head and perform different facial actions based on the Facial Action Coding System(FACS) [11], such as opening and Once facial reflectance inference is completed, we proceed to refine closing the mouth, and various expressions. We used the first few the normal map obtained from the multi-view facial model fitting. frames to infer the reflectance model, which includes the global For this purpose, we introduce a correction map δ that is used to n lighting, ambient colour, diffuse albedo texture and specular albedo. correct the normal vector stored in the normal map to improve the For the remaining frames, we obtained the initial geometry nor- matching of the rendering images and input images. The correction mals and further refined the normal directions to recover subtle vector is computed by minimizing the following objective function: skin deformation caused by facial movement. Finally, we used the Õ Õ 2 E(δn) = ∥Iview (p) − Itotal (n(p) + δn(p))∥ reflectance model and the refined geometry normals to render the view p ∈Mview face in the ground truth viewpoint, for purposes of evaluation and 2 comparison. +w1 ∥∆δn ∥1 + w2 ∥δn ∥2, (9) where Itotal (n(p) + δn(p)) is the image generated by the rendering PCA vs Ours. The PCA-based Lambertian model (Section 4.3) is equation Eq. 6 with the new normal vector n(p) + δn(p), the term widely used in 3D facial reconstruction. Figure 4 shows the diffuse 5 Figure 5: Comparison in recovering geometry. From left to right are the ground truth image, diffuse colour rendering using geometry normal recovered by the PCA based method [2], diffuse colour rendering (Ours-) using our refined geometry normal, and diffuse and specular colour rendering (Ours) using our refined geometry normal. albedo recovered by our method and by the PCA-based Lambertian model. It can be seen that our diffuse albedos reveal high frequency skin features such as moles, while the PCA model only recovers a low frequency diffuse albedo map that is unable to capture thefine details. We further compared the geometry refinement in Figure 5. The PCA-based approach uses a low dimension basis to represent geometry, while ours is able to further refine the mesh normals so as to better capture detailed surface variation (e.g., regions around the mouth).

Specular effect. Figure 6 shows that when a pair frames and inferred reflectance textures are given, our method can recover photorealistic faces with specular effects in new viewpoints. Com- pared to the view-independent Lambertian model, our approach can synthesize more plausible results with view changes or facial movement. More results can be found in the supplementary video.

7.1 Face tracking comparison While face tracking is not the focus of our work, Figure 7 shows that our multi-view 3D face tracking can handle extreme facial poses compared to the state of the art single view method. Guo et al. [15] use network to regress 3DMM parameters and recon- struct 3D meshes from an RGB images. In most cases, both methods Figure 6: Rendering results of different view angles. The top produce similar high quality results (that can be found in the sup- images show the two view inputs with inferred reflectance plementary video). The closeups in Figure 7 show that our method textures and the corresponding fitted base mesh result. The yields a closer face fit in an extreme pose, which is particularly bottom images show the rendering faces (with refined nor- mal) when a virtual camera rotates about a vertical axis from visible at the silhouette of the input face. In general our multi-view ◦ ◦ fitting can handle more challenging poses and is more stable than -35 to 35 : the first row is rendered with specular effect and the methods using only monocular input. the second row is rendered without specular effect.

7.2 Capture performance comparison a low dimensional basis, and has difficulty in synthesizing skin In Figure 8, we qualitatively compared our approach with clas- appearance details like wrinkles and mode. FineNet [15] first re- sic PCA-based 3D face reconstruction [2] and the state-of-the-art gresses coarse 3D faces from input images and then refines the 3D CNN-based FineNet [15] for 8 subjects. The PCA-based method face geometry details, but it does not refine texture. Moreover, both reconstructs 3D faces and textures using regressed coefficients for PCA and FineNet assume Lambertian surfaces and cannot render 6 face model fitting takes less than 2 seconds, the reflectance infer- ence consumes about 42 seconds for one subject, and the geometry refinement with 100 iteration takes 2 seconds per pair views.

8 CONCLUSIONS We have described a model based method to recover photorealistic facial reflectance and geometry from two video streams of a subject in two views. By jointly estimating initial facial geometry and texture map, jointly inferring specular and diffuse reflectance, and further refining geometry while utilizing two view information, our approach demonstrates appreciable improvements over the prior art. It allows for better reconstruction of the shape of faces with specular reflections and more compelling rendering of faces with specular effects under new viewpoints. Our approach has some limitations. For example, we use an it- erative optimization approach, which is not as fast as single-pass network-based approaches. However, the data generated by our method can be used as pseudo ground truth for improving the training of such network-based methods. In addition, our method Figure 7: Comparison of 3D face fitting with CoarseNet of assumes a consistent lighting environment. For dynamic lighting, Guo et al.[15] in an extreme pose. From left to right are the we will need to re-estimate the lighting condition for each frame. fitting overlap results on the left view, right view and zoom Furthermore, our approach may not accurately recover unmodeled in detail of green window. Note that Guo et al [15] performs details such as kerchief and hair, which do not correspond to se- fitting twice for two views and our method performs the fit- mantic features of human faces and may interfere with the fitting ting simultaneously for both views. process used to recover corresponding texture maps.

Table 1: Quantitative results of face rendering, measured us- 9 ACKNOWLEDGMENTS ing the PSNR and the root-mean-square error (RMSE) in the We would like to thank Yudong Guo and Juyong Zhang from Uni- same overlap facial region1 . versity of Science and Technology of China for providing compar- ison result. And we also thank Chuanxia Zhen, Yujun Cai, Zhijie Methods PSNR RMSE Zhang, Ayan Kumar Bhunia, etc for data collection. This research is supported by Singtel Cognitive and Artificial Intelligence Lab for PCA [2] 24.6 15.0 Enterprises at NTU. FineNet [15] 25.5 13.6 Our method (without specular) 26.8 11.6 Our method (with specular) 28.3 9.8 REFERENCES [1] Ruzena Bajcsy, Sang Wook Lee, and Alex0161 Leonardis. 1996. Detection of diffuse and specular interface reflections and inter-reflections by color image segmentation. International Journal of Computer Vision 17 (1996), 241–272. [2] Volker Blanz and Thomas Vetter. 1999. A Morphable Model for the Synthesis of specular effects. Our method handles geometry, texture refinement 3D Faces. In Proceedings of the 26th Annual Conference on and and specular effects, which leads to more photorealistic results. A Interactive Techniques (SIGGRAPH ’99). ACM Press/Addison-Wesley Publishing Co., New York, NY, USA, 187–194. https://doi.org/10.1145/311535.311556 video comparison can be found in the supplementary video. [3] Volker Blanz and Thomas Vetter. 1999. A morphable model for the synthesis of In Table 1, we quantitatively measured the ability of our approach 3D faces. In Proceedings of the 26th annual conference on Computer graphics and to faithfully render photorealistic faces. We tested 1500 frames from interactive techniques. ACM Press/Addison-Wesley Publishing Co., 187–194. [4] James F. Blinn. 1977. Models of Light Reflection for Computer Synthesized two views of 15 subjects with different expressions and synthesized Pictures. SIGGRAPH Comput. Graph. 11, 2 (July 1977), 192–198. https://doi.org/ to a third view, for which we had the corresponding ground truth 10.1145/965141.563893 images. Compared to the PCA based method [2] and FineNet [15], [5] Adrian Bulat and Georgios Tzimiropoulos. 2017. How far are we from solving the 2D & 3D Face Alignment problem? (and a dataset of 230,000 3D facial landmarks). our approach obtained the highest score in PSNR and the lowest In International Conference on Computer Vision. error in RMSE. [6] Chen Cao, Derek Bradley, Kun Zhou, and Thabo Beeler. 2015. Real-time High- fidelity Facial Performance Capture. ACM Trans. Graph. 34, 4, Article 46 (July 2015), 9 pages. https://doi.org/10.1145/2766943 7.3 Timing [7] Chen Cao, Yanlin Weng, Shun Zhou, Yiying Tong, and Kun Zhou. 2014. Faceware- house: A 3d facial expression database for visual computing. IEEE Transactions All our experiments were conducted on a PC with Intel Core i7- on Visualization and Computer Graphics 20, 3 (2014), 413–425. 7700K CPU of 3.5 GHz equipped with two GeForce GTX 1080 with [8] Shane Cook. 2013. CUDA Programming: A Developer’s Guide to Parallel Computing with GPUs (1st ed.). Morgan Kaufmann Publishers Inc., San Francisco, CA, USA. 8GB memory. Following the pipeline in Figure 2, our multi-view 3D [9] Paul Debevec, Tim Hawkins, Chris Tchou, Haarm-Pieter Duiker, Westley Sarokin, and Mark Sagar. 2000. Acquiring the Reflectance Field of a Human Face. In Proceedings of the 27th Annual Conference on Computer Graphics and Interactive 1The facial mesh in FineNet is larger than ours. For fair comparison, we calculate the Techniques (SIGGRAPH ’00). ACM Press/Addison-Wesley Publishing Co., New error only within the same facial region from our face projection. York, NY, USA, 145–156. https://doi.org/10.1145/344779.344855

7 Figure 8: Rendering results of different subjects. On the left side are the input two view frames, recovered reflectance mapsand fitted base mesh projected to the ground truth view. On the right side are ground truth view and overlapped facial rendering results obtained from PCA based method [2], FineNet [15], our method without specular effect, and our method with specular effect. The RMSE is given in the lower-right corner of each rendered image. 8 [10] G. J. Edwards, C. J. Taylor, and T. F. Cootes. 1998. Interpreting Face Images Using [24] Pascal Paysan, Reinhard Knothe, Brian Amberg, Sami Romdhani, and Thomas Active Appearance Models. In Proceedings of the 3rd. International Conference on Vetter. 2009. A 3D face model for pose and illumination invariant face recogni- Face & Gesture Recognition (FG ’98). IEEE Computer Society, Washington, DC, tion. In Advanced Video and Signal Based Surveillance, 2009. AVSS’09. Sixth IEEE USA, 300–. http://dl.acm.org/citation.cfm?id=520809.796067 International Conference on. Ieee, 296–301. [11] Paul Ekman. 1978. Facial action coding system. Consulting Psychologists Press. [25] Ravi Ramamoorthi and Pat Hanrahan. 2001. An Efficient Representation for [12] Pablo Garrido, Michael Zollhöfer, Dan Casas, Levi Valgaerts, Kiran Varanasi, Irradiance Environment Maps. In Proceedings of the 28th Annual Conference on Patrick Pérez, and Christian Theobalt. 2016. Reconstruction of Personalized 3D Computer Graphics and Interactive Techniques (SIGGRAPH ’01). ACM, New York, Face Rigs from Monocular Video. ACM Trans. Graph. 35, 3, Article 28 (May 2016), NY, USA, 497–500. https://doi.org/10.1145/383259.383317 15 pages. https://doi.org/10.1145/2890493 [26] Shunsuke Saito, Lingyu Wei, Liwen Hu, Koki Nagano, and Hao Li. 2016. Photoreal- [13] Abhijeet Ghosh, Graham Fyffe, Borom Tunwattanapong, Jay Busch, Xueming istic Facial Texture Inference Using Deep Neural Networks. CoRR abs/1612.00523 Yu, and Paul Debevec. 2011. Multiview Face Capture Using Polarized Spherical (2016). arXiv:1612.00523 http://arxiv.org/abs/1612.00523 Gradient Illumination. ACM Trans. Graph. 30, 6, Article 129 (Dec. 2011), 10 pages. [27] Peter Pike Sloan. 2008. Stupid Spherical Harmonics (SH) Tricks. (2008). http: https://doi.org/10.1145/2070781.2024163 //www.ppsloan.org/publications/StupidSH36.pdf [14] Abhijeet Ghosh, Tim Hawkins, Pieter Peers, Sune Frederiksen, and Paul Debevec. [28] William A. Smith and Edwin R. Hancock. 2008. Facial Shape-from-shading and 2008. Practical Modeling and Acquisition of Layered Facial Reflectance. ACM Recognition Using Principal Geodesic Analysis and Robust Statistics. Int. J. Com- Trans. Graph. 27, 5, Article 139 (Dec. 2008), 10 pages. https://doi.org/10.1145/ put. Vision 76, 1 (Jan. 2008), 71–91. https://doi.org/10.1007/s11263-007-0074-8 1409060.1409092 [29] W. A. P. Smith and E. R. Hancock. 2006. Recovering Facial Shape Using a Statistical [15] Yudong Guo, Juyong Zhang, Jianfei Cai, Boyi Jiang, and Jianmin Zheng. 2018. Model of Surface Normal Direction. IEEE Transactions on Pattern Analysis and 3DFaceNet: Real-time Dense Face Reconstruction via Synthesizing Photo-realistic Machine Intelligence 28, 12 (Dec 2006), 1914–1930. https://doi.org/10.1109/TPAMI. Face Images. IEEE Transactions on Pattern Analysis and Machine Intelligence 2006.251 abs/1708.00980 (2018). arXiv:1708.00980 http://arxiv.org/abs/1708.00980 [30] Guoxian Song, Jianfei Cai, Tat-Jen Cham, Jianmin Zheng, Juyong Zhang, and [16] Liwen Hu, Shunsuke Saito, Lingyu Wei, Koki Nagano, Jaewoo Seo, Jens Fursund, Henry Fuchs. 2018. Real-time 3D Face-Eye Performance Capture of a Person Iman Sadeghi, Carrie Sun, Yen-Chun Chen, and Hao Li. 2017. Avatar Digitization Wearing VR Headset. In Proceedings of the 26th ACM International Conference on from a Single Image for Real-time Rendering. ACM Trans. Graph. 36, 6, Article Multimedia (MM ’18). ACM, New York, NY, USA, 923–931. https://doi.org/10. 195 (2017), 14 pages. https://doi.org/10.1145/3130800.31310887 1145/3240508.3240570 [17] Hyeongwoo Kim, Hailin Jin, Sunil Hadap, and In-So Kweon. 2013. Specular Re- [31] Fuwen Tan, Chi-Wing Fu, Teng Deng, Jianfei Cai, and Tat-Jen Cham. 2017. flection Separation Using Dark Channel Prior. 2013 IEEE Conference on Computer FaceCollage: A Rapidly Deployable System for Real-time Head Reconstruction Vision and Pattern Recognition (2013), 1460–1467. for On-The-Go 3D Telepresence. In Proceedings of the 2017 ACM on Multimedia [18] Diederik P. Kingma and Jimmy Ba. 2014. Adam: A Method for Stochastic Optimiza- Conference (MM ’17). ACM, New York, NY, USA, 64–72. https://doi.org/10.1145/ tion. CoRR abs/1412.6980 (2014). arXiv:1412.6980 http://arxiv.org/abs/1412.6980 3123266.3123281 [19] Ritwik Kumar, Angelos Barmpoutis, Arunava Banerjee, and Baba C Vemuri. 2011. [32] Robby T. Tan and Katsushi Ikeuchi. 2005. Separating Reflection Components of Non-lambertian reflectance modeling and shape recovery of faces using tensor Textured Surfaces Using a Single Image. IEEE Trans. Pattern Anal. Mach. Intell. Splines. IEEE Transactions on Pattern Analysis and Machine Intelligence 33, 3 27, 2 (Feb. 2005), 178–193. https://doi.org/10.1109/TPAMI.2005.36 (2011), 533–567. [33] J. Thies, M. Zollhöfer, M. Stamminger, C. Theobalt, and M. Nießner. 2016. [20] Hao Li, Laura Trutoiu, Kyle Olszewski, Lingyu Wei, Tristan Trutna, Pei-Lun Face2Face: Real-time Face Capture and Reenactment of RGB Videos. In Proc. Hsieh, Aaron Nicholls, and Chongyang Ma. 2015. Facial Performance Sensing Computer Vision and Pattern Recognition (CVPR), IEEE. Head-Mounted Display. ACM Transactions on Graphics (Proceedings SIGGRAPH [34] M. Turk and A. Pentland. 3(1):71âĂŞ86, 199. Eigenfaces for recognition. J. 2015) 34, 4 (July 2015). Cognitive Neuroscience (3(1):71âĂŞ86, 199). [21] Wan-Chun Ma, Tim Hawkins, Pieter Peers, Charles-Felix Chabert, Malte Weiss, [35] Shuco Yamaguchi, Shunsuke Saito, Koki Nagano, Yajie Zhao, Weikai Chen, Kyle and Paul Debevec. 2007. Rapid Acquisition of Specular and Diffuse Normal Olszewski, Shigeo Morishima, and Hao Li. 2018. High-fidelity Facial Reflectance Maps from Polarized Spherical Gradient Illumination. In Proceedings of the 18th and Geometry Inference from an Unconstrained Image. ACM Trans. Graph. 37, 4, Eurographics Conference on Rendering Techniques (EGSR’07). Eurographics As- Article 162 (July 2018), 14 pages. https://doi.org/10.1145/3197517.3201364 sociation, Aire-la-Ville, Switzerland, Switzerland, 183–194. https://doi.org/10. [36] Qingxiong Yang, Shengnan Wang, and Narendra Ahuja. 2010. Real-time Specular 2312/EGWR/EGSR07/183-194 Highlight Removal Using Bilateral Filtering. In Proceedings of the 11th Euro- [22] Sebastian Magda, David J Kriegman, Todd Zickler, and Peter N Belhumeur. 2001. pean Conference on Computer Vision: Part IV (ECCV’10). Springer-Verlag, Berlin, Beyond lambert: Reconstructing surfaces with arbitrary brdfs. In Proceedings. Heidelberg, 87–100. http://dl.acm.org/citation.cfm?id=1888089.1888097 Eighth IEEE International Conference on Computer Vision, Vol. 2. IEEE, 391–398. [37] Ronald Yu, Shunsuke Saito, Haoxiang Li, Duygu Ceylan, and Hao Li. 2017. Learn- [23] Kyle Olszewski, Joseph J. Lim, Shunsuke Saito, and Hao Li. 2016. High-Fidelity ing Dense Facial Correspondences in Unconstrained Images. CoRR abs/1709.00536 Facial and Speech Animation for VR HMDs. ACM Transactions on Graphics (2017). arXiv:1709.00536 http://arxiv.org/abs/1709.00536 (Proceedings SIGGRAPH Asia 2016) 35, 6 (December 2016).

9