High-Quality Face Capture Using Anatomical Muscles

Michael Bao1,2 Matthew Cong2,† Stephane´ Grabli2,† Ronald Fedkiw1,2 1Stanford University 2Industrial Light & Magic 1{mikebao,rfedkiw}@stanford.edu †{mcong,sgrabli}@ilm.com

Abstract as Kong: Skull Island where it was used both to aid in the creation of blendshapes and to offer physically-based Muscle-based systems have the potential to provide both corrections to artist-created animation sequences [15, 29]. anatomical accuracy and semantic interpretability as com- While [14] alleviates the problems of muscle-based simu- pared to blendshape models; however, a lack of expressiv- lation in regards to expressiveness and control, the method ity and differentiability has limited their impact. Thus, we is geared towards generative problems, propose modifying a recently developed rather expressive and is thus not amenable for estimating a facial pose from muscle-based system in order to make it fully-differentiable; a two-dimensional image as is common for markerless per- in fact, our proposed modifications allow this physically formance capture. One could iterate between solving for robust and anatomically accurate muscle model to conve- a performance using blendshapes and then using a muscle- niently be driven by an underlying blendshape basis. Our based solution to correct the blendshapes; however, this it- formulation is intuitive, natural, as well as monolithically erative method is lossy as the muscle does not and fully coupled such that one can differentiate the model have access to the raw data and may thus hallucinate details from end to end, which makes it viable for both optimiza- or erase details of the performance. tion and learning-based approaches for a variety of appli- In this paper, we extend [14] by combining the ease of cations. We illustrate this with a number of examples in- use and differentiability of traditional blendshape models cluding both shape matching of three-dimensional geome- with expressive, physically-plausible muscle track simula- try as as well as the automatic determination of a three- tions in order to create a differentiable simulation frame- dimensional facial pose from a single two-dimensional RGB work that can be used interchangeably with traditional image without using markers or depth information. blendshape models for facial performance capture and an- imation. Instead of relying on a non-differentiable per- frame volumetric morph to drive the muscle track defor- 1. Introduction mation as in [14], we instead create a state-of-the-art blend- shape model for each muscle, which is then used to drive Muscle simulation-based animation systems are attrac- its volumetric deformation. Our model maintains the ex- tive due to their ability to preserve important physical prop- pressiveness of [14] while preserving crucial physical prop- erties such as volume conservation as well as their abil- erties. Furthermore, our new formulation is differentiable ity to handle contact and collision. Moreover, utilizing an from end to end, which allows it to be used to target anatomically motivated set of controls provides a straight- three-dimensional facial poses as well as two-dimensional forward way of extracting out semantic meaning from the RGB images. We demonstrate that our blendshape muscle control values. Unfortunately, even though [42] was able tracks method shows significant improvements in anatom- to automatically compute muscle activation values given ical plausibility and semantic interpretability when com- sparse data, muscle-based animation models pared to state-of-the-art blendshape-based methods for tar- have proven to be significantly less expressive and harder to geting three-dimensional geometry and two-dimensional control than their blendshape-based counterparts [30]. RGB images. Recently, [14] introduced a novel method that signifi- cantly improved upon the expressiveness of muscle-based 2. Related Work animation systems. They introduced the concept of “mus- cle tracks” to control the deformation of the underlying Face Models: Although our work does not directly ad- musculature. This concept gives the muscle simulation dress the modeling part of the pipeline, it relies on having enough expressiveness to target arbitrary shapes, which al- a pre-existing model of the face. For building a realistic lowed it be used in high-quality movie productions such digital double of an actor, multi-view stereo techniques can

10802 be used to collect high-quality geometry and texture infor- deformation model that is the basis of our method intro- mation in a variety of poses [5, 6, 16]. Artists can then duced in Section 6. A hybrid blendshape model refers to use this data to create the final blendshape model. In state- a deformation model that uses both linear blendshapes and of-the-art models, the deformation model will include non- linear blend skinning to deform the vertices of the mesh. linear skinning/enveloping in addition to linear blendshapes Our model contains a single 6-DOF joint for the jaw. We to achieve more plausible deformations [30]. On the other can succinctly write the model given the blendshape param- hand, more generalized digital face models would be more eters b and joint parameters j as useful in cases where the target actor is not known before- hand. Such models include the classic Blanz and Vetter x(b,j)= T (j)(n + Bb) (1) model [8], the Basel Face Model (BFM) [36, 37], Face- Warehouse [11], and the Large Scale Facial Model (LSFM) where n is the neutral shape, B is the blendshape deltas ma- [9]. Recent models such as the FLAME model [31] have trix, and T (j) contains the linear blend skinning matrix, i.e. begun to introduce non-linear deformations by using skin- a transformation matrix due to a change in the jaw joint, ning and corrective blendshapes. These models tend to be for each vertex. Note that n + Bb is often referred to as geared towards real-time applications and as a result have a the pre-skinning shape and Bb as the pre-skinning displace- low number of vertices. ments. More complex animation systems include corrective Face Capture: A more comprehensive review of facial shapes and intermediate controls and thus we let w denote performance capture techniques can be found in [53]. To a broader set of animator controls which we treat as our in- date, marker based techniques have been the most popular dependent variable rewriting Equation 1 as for capturing facial performances for both real-time applica- x(w)= T (j(w))(n + Bb(w)) (2) tions and feature films. Helmet mounted cameras (HMCs) are often used to stereo track a sparse set of markers on the where j(w) and b(w) may include non-linearities such as face. These markers are then used as constraints in an opti- non-linear corrective blendshapes. mization to find blendshape weights [7]. In many real-time applications, pre-applied markers are generally not an op- 4. Muscle Model tion so 2D features [10, 12, 50], depth images [12, 26, 49], or low-resolution RGB images [48] are often used instead. We create an anatomical model of the face consisting of More recently, methods using neutral networks have been the cranium, jaw, and a tetrahedralized flesh mesh with em- used to reconstruct face geometry [24, 41] and estimate bedded muscles for a given actor using the method of [13]. facial control parameters [25, 27]. Analysis-by-synthesis Since we desire parity with the facial model used to de- techniques have also been explored for capturing facial per- form the face surface, we define the jaw joint as a 6-DOF formances [38]. joint equivalent to the one used to skin the face surface in Face Simulation: [42] was one of the first to utilize qua- Section 3. Traditionally, face simulation models have been sistatic to drive the deformation of a 3D face, controlled using a vector of muscle activation parameters especially for motion capture. There has also been inter- which we denote as a. We use the same constitutive model est in using quasistative simulations to drive muscle defor- for the muscle as [45, 46] which consists of an isotropic mations in the body [23, 45, 46]. However, in general, fa- Mooney-Rivlin term, a quasi-incompressibility term, and cial muscle simulations tend to be less expressive than their an anisotropic passive/active muscle response term. The artist-driven blendshape counterparts. More recently, sig- finite-volume method [45, 47] is used to compute the force nificant work has been done to make muscle simulations on each vertex of the tetrahedralized flesh mesh given the more expressive [14, 21]. While these methods can be used current 1st Piola-Kirchoff stress computed using the consti- to target data in the form of geometry, it is unclear how tutive model and the current deformation gradient. Some to cleanly transfer these methods to target non-geometry vertices of the flesh mesh XC are constrained to kinemati- data such as two-dimensional RGB images. Other work has cally follow along with the cranium/jaw and the steady state been done to try to introduce physical simulations into the position is implicitly defined as the positions of the uncon- blendshape models themselves [3, 4, 22, 28]; however, these strained flesh mesh vertices XU which make the sum of all works do not focus on the inverse problem. relevant forces identically 0, i.e. f(XC ,XU ) = 0. One can decompose the forces to be a sum of the finite- 3. Blendshape Model volume forces and collision penalty forces

C U C U As discussed in Section 2, there are many different types ffvm(X ,X ,a)+ fcollisions(X ,X ) = 0. (3) of blendshape models that exist and we refer interested readers to [30] for a more thorough overview of existing lit- One can further break down the finite-volume forces into erature. We focus on the state-of-the-art hybrid blendshape the passive force fp and active force fa. Then using the

10803 fact the the active muscle response is scaled linearly by the 6. Blendshape-Driven Muscle Tracks muscle activation a [51], we can rewrite the finite-volume force as The morph from Section 5 was designed in the spirit of the computer graphics pipeline, and as such, does not allow C U C U C U for the sort of full end-to-end coupling that facilitates dif- ffvm(X ,X ,a)= fp(X ,X )+ afa(X ,X ). (4) ferentiability, inversion, and other typical inverse problem We refer interested readers to [42, 45, 46, 47] for derivations methodologies. Thus, our key contribution is to replace the of the aforementioned forces and their associated Jacobians morphing step with a blendshape deformation in the form with respect to the flesh mesh vertices. Given a vector of of Equation 1 to drive the muscle volumes and their center- muscle activations and cranium/jaw parameters, Equation 3 line curves thereby creating a direct functional connection can be solved using the Newton-Raphson method to com- between the animator controls w and the muscle tracks tar- pute the unconstrained flesh mesh vertex positions XU . get locations Mm and activation values a. 0 For each muscle, we create a tetrahedralized volume Mm 0 5. Muscle Tracks and piecewise linear center-line curve Cm in the neutral pose. Furthermore, for each blendshape in the face sur- The muscle tracks simulation introduced by [14] modi- face model, we use the morph from [14] to create a cor- fies the framework described in Section 4 such that the mus- responding shape for each muscle’s tetrahedralized volume k k cle deformations are primarily controlled by a volumetric Mm and center-line curve Cm, where k is used to denote the morph [1, 13] rather than directly using muscle activation kth blendshape. Alternatively, one could morph and subse- values. [14] first creates a correspondence between the neu- quently simulate as in Section 5 using tracks in order to cre- k k tral pose n of the blendshape system and the outer boundary ate Mm and Cm. In addition, we assign skinning weights to b 0 0 surface of the tetrahedral mesh X . Then, given a blend- each vertex in Mm and Cm and assemble them into linear ∗ M C shape target expression x (b,j) with surface mesh displace- blend skinning transformation matrices Tm and Tm . This ments x∗ − n, [14] creates target displacements for the allows us to write outer boundary of the tetrahedral mesh δXb. Using δXb as M 0 k Dirichlet boundary conditions, [14] solves a Poisson equa- Mm(b,j)= Tm (j) Mm + Mmbk (7) 2 k ! tion for the displacements δX = X − X0, i.e. ∇ δX = 0, X where X0 are the rest-state vertex positions consistent with C 0 k the neutral pose n. Neumann boundary conditions are used Cm(b,j)= Tm (j) Cm + Cmbk (8) k ! on the inner boundary of the tetrahedral mesh. Afterwards, X zero-length springs are attached between the tetrahedralized which parallel Equation 1. Notably, we are able to obtain flesh mesh vertices interior to each muscle and their corre- Equations 7 and 8 in part because we solve the Poisson sponding target locations resulting from the Poisson equa- equation on the pre-skinning neutral as compared to [14] tion. The muscle track force resulting from the zero-length which uses the post-skinning neutral. In addition, this bet- springs for each muscle m has the form ter prevents linearized rotation artifacts from diffusing into the tetrahedralized flesh mesh. Finally, we can write the U ftracks,m = Km(Mm − ImX ) (5) length of each center-line curve as where Km is the per-muscle spring stiffness matrix, Im is L(Cm(b,j)) = ||Cm,i(b,j) − Cm,i−1(b,j)||2 (9) i the selector matrix for the flesh mesh vertices interior to the X muscle, and Mm are the target locations resulting from the where Cm,i(b,j) is the ith vertex of the piecewise linear volumetric morph. Thus the expanded quasistatics equation center-line curve for the mth muscle. can be written as To justify our approach, we can write the linear system U C to solve the Poisson equation as A (X0)δX = A (X0)Bb U ffvm + fcollisions + ftracks = 0 (6) where A (X0) is the portion of the Laplacian matrix dis- cretized on the tetrahedralized volume at rest using the where ftracks includes Equation 5 for every muscle. Since method of [52] for the unconstrained vertices. Similarly, the activation values a are no longer specified manually, C A (X0) is the portion for the constrained vertices post- they must be computed automatically given the final post- multiplied by the linear correspondence between the neutral morph shape of a muscle to reintroduce the effects of mus- pose n of the blendshape system and the outer boundary of cle tension into the simulation. [14] barycentrically em- the tetrahedral mesh Xb. Equivalently, we may write beds a piecewise linear curve into each muscle and uses the U C length of that curve to determine an appropriate activation A (X0)δX = A (X0)Bekbk (10) k value. X

10804 (where ek are the standard basis vectors) which is equivalent (∂fT /∂C)(∂C/∂bk) = 0 which is equivalent to to doing k solves of the form (∂fT /∂X)(∂X/∂bk)+(∂ffvm/∂C)(∂C/∂bk) = 0 since U C fcollisions is independent of C. Since our activations are still A (X0)δXk = A (X0)Bek (11) independent of X just as they were in [42], ∂fT /∂X here and then subsequently summing both sides to obtain δX = is identical to that discussed in [42], and thus their qua- sistatic solve can be used to determine ∂X/∂b by solv- k δXkbk. That is, the linearity of the Poisson equation k allows us to precompute its action for each blendshape and ing (∂fT /∂X)(∂X/∂bk) = −(∂ffvm/∂C)(∂C/∂bk). To P subsequently obtain the exact result on any combination of compute the right hand side, note that ∂C/∂bk can be ob- blendshapes by simply summing the results obtained on the tained from Equation 8. To obtain ∂ffvm/∂C, we compute individual blendshapes. ∂ffvm/∂C = (∂ffvm/∂a)(∂a/∂L)(∂L/∂C). ∂ffvm/∂a are In summary, for each of the k blendshapes, we solve simply the active forces fa in Equation 4, ∂a/∂L is the local k a Poisson equation (Equation 11) to precompute Mm and slope of the activation length curve, and ∂L/∂C is readily k Cm, and then given animator controls w which yield b and computed from Equation 9. The ∂X/∂jk are determined j, we obtain Mm and Cm via Equations 7 and 8. This re- similarly. places the morphing step allowing us to proceed with the One may take a similar approach to Equation 6, quasistatic muscle simulation using tracks driven entirely obtaining ∂X/∂bk by solving (∂fT /∂X)(∂X/∂bk) = by the animator parameters w. −(∂ffvm/∂C)(∂C/∂bk) − (∂ftracks/∂M)(∂M/∂bk). We stress that the coefficient matrix ∂fT /∂X of the quasistatic 7. End-to-End Differentiability solve is now augmented by ∂ftracks/∂X (see Equation In this section, we outline the derivative of the simulated 5) and is the same quasistatic coefficient matrix in [14]. tetrahedral mesh vertex positions with respect to the blend- ∂ftracks/∂M and ∂M/∂bk are obtained from Equations 5 shape parameters b and jaw controls j that parameterize the and 7 respectively. Again, the ∂X/∂jk are found similarly. simulation results as per Section 6. The derivative of b and In summary, finding ∂X/∂bk and ∂X/∂jk involves solving j with respect to the animator controls w depend on the spe- the same quasistatics problem of [42] with the slight aug- cific controls and can be post-multiplied. If one cares about mention to the coefficient matrix from [14] merely with dif- the resulting vertices of a rendered mesh embedded in or ferent right hand sides. Although this requires a quasistatic constrained to the tetrahedral mesh, then this embedding, solve for each bk and jk, they are all independent and can typically linear, can be pre-multiplied. thus be done in parallel. Although the constrained nodes XC typically only de- pend on the joint parameters, one may wish, at times, to 8. Experiments simulate only a subset of the tetrahedral flesh mesh. In such instances, the constrained nodes can appear on the unsim- We use the Dogleg optimization [35] as im- ulated boundary which in turn can be driven by the blend- plemented by the Chumpy autodifferentation library [33] in C shape parameters b; thus, we write X (b,j) and concate- order to target our face model to both three-dimensional ge- U nate it with X (b,j) to obtain X(b,j) for the purposes ometry and two-dimensional RGB images to demonstrate of this section. The collision forces only depend on the the efficacy of our end-to-end fully differentiable formu- nodal positions, and we may write fcollisions(X(b,j)). The lation. Other optimization and/or applications finite volume force depends on both the nodal positions may similarly be pursued. Our nonlinear least squares opti- and activations, and the activations are determined from an mization problems generally have the form activation-length curve where the length is given in Equa- tion 9. Our precomputation makes Cm only a function of ∗ 2 2 b and j and notably independent of X, and so we may minw||F − F (xR(w))||2 + λ||w||2 (12) write am(Lm(Cm(b,j))) combining the activation length curve with Equations 8 and 9. We stress that the acti- where w are the animator controls that deform the face, vations are independent of the positions, X. Thus, we x(w) are the positions of the vertices on the surface of the may write ffvm(X(b,j),C(b,j)). Similarly, we may write face deformed using the full blendshape-driven muscle sim- ftracks(X(b,j),M(b,j)). Therefore, all the forces in Equa- ulation system as described in Section 6, F (xR(w)) is a tion 6 are a function of X, C, and M which are in turn a function of those vertex positions, and F ∗ is the desired out- function of b and j. put of that function. R(θ) and t are an additional rigid ro- Using the aforementioned dependencies, we can take tation and translation, respectively where θ represents Euler the total derivative of the forces fT = ffvm + angles, i.e. xR(w)= R(θ)x(w)+ t . We use a standard L2 2 fcollisions in Equation 3 with respect to a single blend- norm regularization on the animator controls ||w||2, where shape parameter bk to obtain (∂fT /∂X)(∂X/∂bk) + λ is set experimentally to avoid overfitting.

10805 Figure 1. The eight viewpoints used to reconstruct the facial ge- Figure 3. The geometry reconstructed by applying the multi-view ometry for a particular pose. stereo algorithm described in [5, 6] to the input images shown in Figure 1.

8.1. Model Creation surface of the tetrahedral mesh and used as boundary con- ditions in a Poisson equation solve again as in [1, 13] to ob- The blendshape system is created from the neutral pose tain linear blend skinning weights throughout the volumet- n as well as FACS-based expressions [17] using the meth- ric tetrahedral mesh as well as for the muscles and center- ods of [5, 6]. Eight black and white cameras from varying line curves, thus defining skinning transformation matrices M C viewpoints (see Figure 1) are used to reconstruct the geom- Tm and Tm . Figure 2 shows the muscles in the neutral pose 0 etry of the actor. Artists clean up these scans and use them Mm as well as the result after skinning with the jaw open, as inspiration for a blendshape model and to calibrate the i.e. Equation 7 with all bk identically 0. linear blend skinning matrices for the face surface (see Fig- Finally, for each shape in the blendshape system, we ure 3). Of course, any reasonable method could be used solve a Poisson equation (Equation 11) for the vertex dis- to create the blendshape system. A subset of the full face placements δXk which are then transferred to the muscle k k surface model with 52,228 surface vertices is used in the volumes and center-line curves to obtain Mm and Cm. This optimization. allows us full use of Equations 7 and 8 parameterized by the We use the neutral pose of the blendshape system and blendshapes bk. Figure 4 shows some examples of the mus- the method of [13] to create the tetrahedral flesh mesh X0, cles evaluated using Equation 7 for a variety of expressions. 0 tetrahedral muscle volumes Mm, and muscle center line 0 8.2. Targeting 3D Geometry curves Cm by morphing them from a template asset. Our simulation mesh has 302,235 vertices and 1,470,102 tetra- Oftentimes, one has captured a facial pose in the form hedra. We use 60 muscles with a total of 50,710 vertices and of three-dimensional mesh; however, this data is generally 146,965 tetrahedra (some tetrahedra are duplicated between noisy, and it is desirable to convert this data into a lower di- muscles due to overlap). The linear blend skinning weights mensional representation. Using a lower dimensional rep- used to form Tj on the face surface are propagated to the resentation facilitates editing, extracting semantic informa-

(a) Smile (b) Pucker (c) Funneler (a) Neutral (b) Jaw Open Figure 4. The anatomical model of the face peforming a variety of Figure 2. The underlying anatomical model of the face in the neu- expressions using only the blendshape deformation from Equation tral pose as well as the jaw open pose using linear blend skinning. 7.

10806 tion, and performing statistical analysis. In our case, the via the blendshape muscle tracks as described in Section 6 lower dimensional representation is the parameter space of versus when it is driven using the pure blendshape model the blendshape or simulation model. described in Section 3. Traditionally, pucker shapes have In general, extracting a lower dimensional representation been difficult for activation-muscle based simulations to hit. from an arbitrary mesh requires extracting a set of corre- See Figure 5. Although neither inversion quite captures the spondences between the mesh and the face model. How- tightness of the mouth’s pucker, the muscle simulation re- ever, for simplicity, we assume that the correspondence sults demonstrate how the simulation’s volume preservation problem has been solved beforehand and that each vertex of property significantly improves upon the blendshape results the incoming mesh captured by a system using the methods where the top and bottom lips seem to shrink. This prop- of [5, 6] has a corresponding vertex on our face surface. We erty is also useful in preserving the general shape of the can thus use an optimization problem in the form of Equa- philtrum; the blendshape models’s inversion causes the part tion 12 to solve for the model parameters where F ∗ are the of the philtrum near the nose to incorrectly bulge signifi- vertex positions of the target geometry, and F (x)= x is the cantly. Furthermore, the resulting muscle activation values identity function. are easier to draw semantic meaning from due to their spar- While a rigid alignment between the F ∗ and the neu- sity and anatomical meaning as seen in Figure 6. tral mesh n, i.e. R(θ) and t, is created as a result of [5, 6], Note that errors in the method of [5, 6] in performing we generally found it to be inaccurate. As a result, we also multi-view reconstruction will cause the vertices of the tar- allow the optimization to solve for θ and t as well. Our opti- get geometry to contain noise and potentially be in physi- mization problem for targeting three-dimensional geometry cally implausible locations. Additionally, errors in finding thus has the form correspondences between the target geometry and the face surface will result in an inaccurate objective function. Fur- ∗ 2 2 minw,θ,t||F − xR(w)||2 + λ||w||2 (13) thermore, there is no guarantee that our deformation model x(w) is able to hit all physically attainable poses even when −6 where λ = 1 × 10 is set experimentally. the capture and correspondence are perfect. This demon- We demonstrate the efficacy of our method on a pose strates the efficacy of introducing physically-based priors where the actor has his mouth slightly open and is making into the optimization. Additional comparisons and results a pucker shape. We compare the results of targeting three- are shown in the supplementary material and video. dimensional geometry when it is driven using simulation 8.3. Targeting Monocular RGB Images To further demonstrate the efficacy of our approach, we consider facial reconstruction from monocular RGB im- ages. The images were captured using an 100mm lens at- tached to an ARRI Alexa XT Studio running at 24 frames- per-second with an 180 degree shutter angle at ISO 800.

(a) Blendshape (b) Simulation (c) Target

(a) Blendshape Weights (b) Muscle Activations Figure 6. The blendshape solve results in blendshape weights that Figure 5. We target the geometry shown in (c) using purely blend- are dense, overdialed, and hard to decipher. Whereas all 129 (of shapes shown in (a) versus the blendshape driven muscle simula- 146; shapes for the neck, etc. were not used) blendshapes used tion model shown in (b). While neither method exactly matches have non-zero values, only 13 of the available 60 muscles have target geometry, in general, we found that the simulation re- non-zero activation values. The top four most activated muscles sults preserve key physical properties such as volume preservation are related to the frontalis indicating that the eyebrows are raised around the lips. A close-up of the lips is shown in the bottom row [43]. The activations of the incisivus labii superioris and orbicu- where it is more apparent how the pure blendshape inversion has laris oris muscles are also among the top activated muscles prop- significant volume loss around the lips. erly indicating a compression of the lips [20, 43].

10807 We refer to images captured by the camera as the “plates.” on the face surface x(w) projected into the image plane. See The original plates have a resolution of 2880 × 2160, but Figure 8. We then use wˆ to initialize a shape from shading we downsample them to 720 × 540. The camera was cali- solve brated using the method of [19] and the resulting distortion ∗ 2 2 2 parameters are used to undistort the plate to obtain F . minw||E2(w)||2 + λ1||E1(w)||2 + λ2||w − wˆ||2 (16) F (x) renders the face geometry in its current pose with − to determine the final parameters w where λ = 1 × 10 4 a set of camera, lighting, and material parameters. We use 1 and λ = 1 are set experimentally. Here, E = G(F ∗ − a simple pinhole camera with extrinsic parameters deter- 2 2 F (x (w),γ,c)) is a three-level Gaussian pyramid of the mined by the camera calibration step. The rigid transforma- R per-pixel differences between the plate and the synthetic tion of the face is determined by manually tracking features render. on the face in the plate. The face model is lit with a sin- We demonstrate the efficacy of our approach on 66 gle spherical harmonics light with 9 coefficients γ, see [39], frames of a facial performance. As in Section 8.2, we and is shaded with Lambertian diffuse shading. Each vertex compare the results of solving Equations 15 and 16 using i also has an RGB color c associated with it. We solve for i x(w) driven by a simulation model versus a blendshape γ and all c using a non-linear least squares optimization of i model. In particular, we choose four frames with partic- the form ularly challenging facial expressions (frames 1112, 1160, ∗ 2 2 1170) as well as capture conditions such as motion blur minγ,c||F − F (xR(0),γ,c)|| + λ||S(c)|| (14) 2 2 (frame 1134). We note that a significant portion of the where the per-vertex colors is regularized using S(c) = facial expression is captured using the rotoscoped curves and the shape-from-shading step primarily helps to refine i j∈N(i) ci − cj where N(i) are the neighboring ver- tices of vertex i. This lighting and albedo solve is done the expression and the contours of the face. Both E1 and P P as a preprocess on a neutral or close to neutral pose with E2 (Equations 15 and 16) require end-to-end differentiabil- λ = 2500 set experimentally. OpenDR [34] is used to dif- ity through our blendshape driven method. See Figure 9. ferentiate F (x) to solve Equation 14; however, any other While the general expressions are similar, we note that the differentiable renderer (e.g. [32]) can be used instead. Then simulation’s surface geometry tends to be more physically we assume that γ and c stay constant throughout the perfor- plausible due the simulation’s ability to preserve volume, mance. See Figure 7. especially around the lips. This regularization is especially We solve for the parameters w in two steps. Given curves prominent on frame 1134. As shown in supplementary ma- around the eyes and lips on the three-dimensional neutral terial, the resulting muscle activation values are also com- face mesh, a rotoscope artist draws corresponding curves paratively sparser which leads to an increased ability to ex- on the two-dimensional film plate. Then, we solve for an tract semantic meaning out of the performance. Additional initial guess wˆ by solving an optimization problem of the comparisons and results are shown in the supplementary form material and video.

2 2 minwˆ||E1(w ˆ)||2 + λ1||wˆ||2 (15) 9. Conclusion and Future Work where λ1 = 3600 is set experimentally. E1(w ˆ) is the two- Although promising anatomically based muscle simula- dimensional Euclidean distance between the points on the tion systems have existed for some time and have had the rotoscoped curves on the plate and the corresponding points ability to target data as in [42], they have lacked the high- end efficacy required to produce compelling results. Al-

(a) Plate (b) Lighting/Albedo (a) Blendshapes (b) Simulation (c) Roto Curves Figure 7. Before estimating the facial pose, we first estimate light- Figure 8. We use rotoscoped curves on the plate to solve for an ing and albedo on a neutral or close to neutral pose. initial estimate of the face pose.

10808 iue9 etre h a mg aauigorfc model face our using data image raw the target We 9. Figure mgs ent htmtossc s[2 ol eue in used be could [42] to as first such RGB methods the target that to are note We model we muscle images. knowledge, a of our simulation of quasistatic use best the To RGB images. tar- two-dimensional target- pre-existing and by geometry a approach three-dimensional our as ing of mesh efficacy surface the demonstrate face We full get. a for removes blend- need morph the by blendshape-driven driven This system parameters. shape morphing a the of and both differentiability end-to-end is alleviated with and [14] we input extending paper, difficulties, as this aforementioned shape In face full differentiable. a not requires com- it quite produce results, does pelling [14] proposed recently the though two-dimensional some monocular from using suffer optimization. the only results in to of data due sets of ambiguity Both frames depth of number performance. a actor’s on an blendshapes and simulation both using 1170 1160 1134 1112 a Blendshapes (a) b Simulation (b) c Plate (c) x ( w ) 10809 lolk otakPu utnfrhsatn n aeWu Jane video. and supplementary acting the his preparing in for help Huston would her We Paul for thank Agesen. to Ole like of also The facial Honor by into part in in efforts Fellowship supported VMWare our was Bao Indus- supporting M. and for capture. Bhat, performance Magic Kiran & Phillips, Light Cary trial as well machine as and ONR vision at learning, computer Behzad into and efforts Reza our both supporting thank for to like addi- would In Toyota. we and tion, Amazon W911NF-07- from gifts AHPCRC generous and 0027, ARL N00014-17-1-2174, ONR Acknowledgements the match to required to are needs data. perturbations only network further the what instead learn them; need explain not or would network learn a to that preserva- so collision volume and contact, as tion, such system properties simulation physical pru- before physical considered incorporates possible anatomically-based often is as Our rest function the dent. a learn to of network much a strong training as to adding generate methods, easier that to procedural is etc. seems informed that It knowledge, network domain a better. priors, to generalizes leads and this train assumption, that networks the again, correlated under once spatially information use correlated learn. 40] spatially to for [18, easier in lower- authors and/or is The correlated, perturbation the spatially that to dimensional, opposed assuming as authors shape, skinning the whole blend the [2], linear of in perturbation example, to a For learn similarities (drawing [44]). crime” crimes variational “learning not a should considered or should be what around revolve to seems learning semanti- of weights. instead blendshape of learning combinations statistical/deep meaningless cally used for future be basis for also a avenue could as promising activations muscle a Additionally, is work. This sme- anatomically-based information. extracting natic for useful be to haveshown activations muscle model model’s extensive anatomical without our calibration, Even information. ex- semantic to tract acti- way muscle anatomically-motivated a an using provides image The basis or vation pose facial annotation. consisting a and express infancy to labeing its ability in image still preliminary is of feel or mostly are do faces to intend such or what doing of understanding identifying semantic even shape, and their surroundings, their from segmenting images, cleanly in them faces identifying to regards in efforts expressions. desired the would reproduce and effectively expressive to able less be be not result- would the however, results 7); simulation Section ing the of in paragraph outlined last (as to paper second this in presented optimizations the eerhspotdi atb N N00014-13-1-0346, ONR by part in supported Research deep in questions philosophical more the of one Finally, great expends community vision computer the Although References [14] Matthew Cong, Kiran S Bhat, and Ronald Fedkiw. Art- directed muscle simulation for high-end facial animation. In [1] Dicko Ali-Hamadi, Tiantian Liu, Benjamin Gilles, Ladislav Symposium on Computer Animation, pages 119–127, 2016. Kavan, Franc¸ois Faure, Olivier Palombi, and Marie-Paule [15] Matthew Cong, Lana Lan, and Ronald Fedkiw. Muscle sim- Cani. Anatomy transfer. ACM Transactions on Graphics ulation for facial animation in kong: Skull island. In ACM (TOG), 32(6):188, 2013. SIGGRAPH 2017 Talks, page 21. ACM, 2017. [2] Stephen W Bailey, Dave Otte, Paul Dilorenzo, and James F [16] Paul Debevec. The light stages and their applications to pho- O’Brien. Fast and deep deformation approximations. ACM toreal digital actors. SIGGRAPH Asia, 2(4), 2012. Transactions on Graphics (TOG), 37(4):119, 2018. [17] Paul Ekman. Facial action coding system (facs). A human [3] V Barrielle and N Stoiber. Realtime performance-driven face, 2002. physical simulation for facial animation. In Computer [18] Yao Feng, Fan Wu, Xiaohu Shao, Yanfeng Wang, and Graphics Forum. Wiley Online Library, 2018. Xi Zhou. Joint 3d face reconstruction and dense align- [4] Vincent Barrielle, Nicolas Stoiber, and Cedric´ Cagniart. ment with position map regression network. arXiv preprint Blendforces: A dynamic framework for facial animation. In arXiv:1803.07835, 2018. Computer Graphics Forum, volume 35, pages 341–352. Wi- [19] Janne Heikkila and Olli Silven. A four-step camera calibra- ley Online Library, 2016. tion procedure with implicit image correction. In Computer [5] Thabo Beeler, Bernd Bickel, Paul Beardsley, Bob Sumner, Vision and Pattern Recognition, 1997. Proceedings., 1997 and Markus Gross. High-quality single-shot capture of fa- IEEE Computer Society Conference on, pages 1106–1112. cial geometry. In ACM Transactions on Graphics (ToG), IEEE, 1997. volume 29, page 40. ACM, 2010. [20] Mi-Sun Hur. Anatomical features of the incisivus labii su- [6] Thabo Beeler, Fabian Hahn, Derek Bradley, Bernd Bickel, perioris muscle and its relationships with the upper muco- Paul Beardsley, Craig Gotsman, Robert W Sumner, and labial fold, labial glands, and modiolar area. Scientific re- Markus Gross. High-quality passive facial performance cap- ports, 8(1):12879, 2018. ture using anchor frames. In ACM Transactions on Graphics [21] Alexandru-Eugen Ichim, Petr Kadlecek,ˇ Ladislav Kavan, (TOG), volume 30, page 75. ACM, 2011. and Mark Pauly. Phace: physics-based face modeling [7] Kiran S Bhat, Rony Goldenthal, Yuting Ye, Ronald Mallet, and animation. ACM Transactions on Graphics (TOG), and Michael Koperwas. High fidelity facial animation cap- 36(4):153, 2017. ture and retargeting with contours. In Proceedings of the 12th [22] Alexandru Eugen Ichim, Ladislav Kavan, Merlin Nimier- ACM SIGGRAPH/eurographics symposium on computer an- David, and Mark Pauly. Building and animating user-specific imation, pages 7–14. ACM, 2013. volumetric face rigs. In Symposium on Computer Animation, pages 107–117, 2016. [8] Volker Blanz and Thomas Vetter. A morphable model for the synthesis of 3d faces. In Proceedings of the 26th an- [23] Geoffrey Irving, Joseph Teran, and Ronald Fedkiw. In- nual conference on Computer graphics and interactive tech- vertible finite elements for robust simulation of large Proceedings of the 2004 ACM SIG- niques, pages 187–194. ACM Press/Addison-Wesley Pub- deformation. In GRAPH/Eurographics symposium on Computer animation lishing Co., 1999. , pages 131–140. Eurographics Association, 2004. [9] James Booth, Anastasios Roussos, Stefanos Zafeiriou, Allan [24] Aaron S Jackson, Adrian Bulat, Vasileios Argyriou, and Ponniah, and David Dunaway. A 3d morphable model learnt Georgios Tzimiropoulos. Large pose 3d face reconstruction from 10,000 faces. In Proceedings of the IEEE Conference from a single image via direct volumetric cnn regression. In on and Pattern Recognition, pages 5543– Computer Vision (ICCV), 2017 IEEE International Confer- 5552, 2016. ence on, pages 1031–1039. IEEE, 2017. [10] Chen Cao, Yanlin Weng, Stephen Lin, and Kun Zhou. 3d [25] Amin Jourabloo and Xiaoming Liu. Large-pose face align- shape regression for real-time facial animation. ACM Trans- ment via cnn-based dense 3d model fitting. In Proceedings actions on Graphics (TOG), 32(4):41, 2013. of the IEEE Conference on Computer Vision and Pattern [11] Chen Cao, Yanlin Weng, Shun Zhou, Yiying Tong, and Kun Recognition, pages 4188–4196, 2016. Zhou. Facewarehouse: A 3d facial expression database for [26] Vahid Kazemi, Cem Keskin, Jonathan Taylor, Pushmeet visual computing. IEEE Transactions on Visualization and Kohli, and Shahram Izadi. Real-time face reconstruction Computer Graphics, 20(3):413–425, 2014. from a single depth image. In 3D Vision (3DV), 2014 2nd in- [12] Yen-Lin Chen, Hsiang-Tao Wu, Fuhao Shi, Xin Tong, and ternational conference on, volume 1, pages 369–376. IEEE, Jinxiang Chai. Accurate and robust 3d facial capture using a 2014. single rgbd camera. In Computer Vision (ICCV), 2013 IEEE [27] Hyeongwoo Kim, Michael Zollhofer,¨ Ayush Tewari, Justus International Conference on, pages 3615–3622. IEEE, 2013. Thies, Christian Richardt, and Christian Theobalt. Inverse- [13] Matthew Cong, Michael Bao, Jane E, Kiran S Bhat, and facenet: Deep monocular inverse face rendering. In Proceed- Ronald Fedkiw. Fully automatic generation of anatomical ings of the IEEE Conference on Computer Vision and Pattern face simulation models. In Proceedings of the 14th ACM Recognition, pages 4625–4634, 2018. SIGGRAPH/Eurographics Symposium on Computer Anima- [28] Yeara Kozlov, Derek Bradley, Moritz Bacher,¨ Bernhard tion, pages 175–183. ACM, 2015. Thomaszewski, Thabo Beeler, and Markus Gross. Enriching

10810 facial blendshape rigs with physical simulation. In Computer [43] Susan Standring. Gray’s anatomy e-book: the anatomical Graphics Forum, volume 36, pages 75–84. Wiley Online Li- basis of clinical practice. Elsevier Health Sciences, 2015. brary, 2017. [44] Gilbert Strang. Variational crimes in the finite element [29] Lana Lan, Matthew Cong, and Ronald Fedkiw. Lessons from method. In The mathematical foundations of the finite ele- the evolution of an anatomical facial muscle model. In Pro- ment method with applications to partial differential equa- ceedings of the ACM SIGGRAPH Digital Production Sym- tions, pages 689–710. Elsevier, 1972. posium, page 11. ACM, 2017. [45] Joseph Teran, Sylvia Blemker, V Hing, and Ronald Fedkiw. [30] John P Lewis, Ken Anjyo, Taehyun Rhee, Mengjie Zhang, Finite volume methods for the simulation of skeletal muscle. Frederic H Pighin, and Zhigang Deng. Practice and theory In Proceedings of the 2003 ACM SIGGRAPH/Eurographics of blendshape facial models. Eurographics (State of the Art symposium on Computer animation, pages 68–74. Euro- Reports), 1(8), 2014. graphics Association, 2003. [31] Tianye Li, Timo Bolkart, Michael J Black, Hao Li, and [46] Joseph Teran, Eftychios Sifakis, Silvia S Blemker, Victor Javier Romero. Learning a model of facial shape and expres- Ng-Thow-Hing, Cynthia Lau, and Ronald Fedkiw. Creat- sion from 4d scans. ACM Transactions on Graphics (TOG), ing and simulating skeletal muscle from the visible human 36(6):194, 2017. data set. IEEE Transactions on Visualization and Computer [32] Tzu-Mao Li, Miika Aittala, Fredo´ Durand, and Jaakko Lehti- Graphics, 11(3):317–328, 2005. nen. Differentiable monte carlo ray tracing through edge [47] Joseph Teran, Eftychios Sifakis, Geoffrey Irving, and sampling. ACM Trans. Graph. (Proc. SIGGRAPH Asia), Ronald Fedkiw. Robust quasistatic finite elements and 37(6):222:1–222:11, 2018. flesh simulation. In Proceedings of the 2005 ACM SIG- [33] M Loper. Chumpy autodifferentation library. http:// GRAPH/Eurographics symposium on Computer animation, chumpy.org, 2014. pages 181–190. ACM, 2005. [34] Matthew M Loper and Michael J Black. Opendr: An ap- [48] Justus Thies, Michael Zollhofer, Marc Stamminger, Chris- proximate differentiable renderer. In European Conference tian Theobalt, and Matthias Nießner. Face2face: Real-time on Computer Vision, pages 154–169. Springer, 2014. face capture and reenactment of rgb videos. In Proceed- [35] MLA Lourakis and Antonis A Argyros. Is levenberg- ings of the IEEE Conference on Computer Vision and Pattern marquardt the most efficient optimization algorithm for im- Recognition, pages 2387–2395, 2016. plementing bundle adjustment? In Computer Vision, 2005. [49] Thibaut Weise, Sofien Bouaziz, Hao Li, and Mark Pauly. Re- ICCV 2005. Tenth IEEE International Conference on, vol- altime performance-based facial animation. In ACM transac- ume 2, pages 1526–1531. IEEE, 2005. tions on graphics (TOG), volume 30, page 77. ACM, 2011. [36] Marcel Luthi,¨ Thomas Gerig, Christoph Jud, and Thomas [50] Chenglei Wu, Derek Bradley, Markus Gross, and Thabo Vetter. Gaussian process morphable models. IEEE transac- Beeler. An anatomically-constrained local deformation tions on pattern analysis and machine intelligence, 2017. model for monocular face capture. ACM Transactions on [37] Pascal Paysan, Reinhard Knothe, Brian Amberg, Sami Graphics (TOG), 35(4):115, 2016. Romdhani, and Thomas Vetter. A 3d face model for pose and [51] Felix E Zajac. Muscle and tendon properties models scaling illumination invariant face recognition. In Advanced video and application to biomechanics and motor. Critical reviews and signal based surveillance, 2009. AVSS’09. Sixth IEEE in biomedical engineering, 17(4):359–411, 1989. International Conference on, pages 296–301. Ieee, 2009. [52] Wen Zheng, Bo Zhu, Byungmoon Kim, and Ronald Fed- [38] Fred´ eric´ Pighin, Richard Szeliski, and David H Salesin. kiw. A new incompressibility discretization for a hybrid par- Resynthesizing facial animation through 3d model-based ticle mac grid representation with surface tension. Journal tracking. In Computer Vision, 1999. The Proceedings of the of , 280:96–142, 2015. Seventh IEEE International Conference on, volume 1, pages [53] M Zollhofer,¨ J Thies3 P Garrido, D Bradley4 T Beeler4 P 143–150. IEEE, 1999. Perez,´ M Stamminger, and M Nießner3 C Theobalt. State [39] Ravi Ramamoorthi and Pat Hanrahan. An efficient represen- of the art on monocular 3d face reconstruction, tracking, and tation for irradiance environment maps. In Proceedings of applications. STAR, 37(2), 2018. the 28th annual conference on Computer graphics and inter- active techniques, pages 497–500. ACM, 2001. [40] Anurag Ranjan, Timo Bolkart, Soubhik Sanyal, and Michael J Black. Generating 3d faces using convolutional mesh autoencoders. arXiv preprint arXiv:1807.10267, 2018. [41] Matan Sela, Elad Richardson, and Ron Kimmel. Unre- stricted facial geometry reconstruction using image-to-image translation. In 2017 IEEE International Conference on Com- puter Vision (ICCV), pages 1585–1594. IEEE, 2017. [42] Eftychios Sifakis, Igor Neverov, and Ronald Fedkiw. Auto- matic determination of facial muscle activations from sparse motion capture marker data. Acm transactions on graphics (tog), 24(3):417–425, 2005.

10811