CREATING A PHOTOREAL DIGITAL ACTOR: THE DIGITAL EMILY PROJECT

Oleg Alexander1 Mike Rogers1 William Lambeth1 Matt Chiang 2 Paul Debevec2

1 Image Metrics 1918 Main St, 2nd Floor, Santa Monica, CA 90405, USA e-mail: {oleg.alexander,mike.rogers,william.lambeth}@image-metrics.com

2 University of Southern California Institute for Creative Technologies 13274 Fiji Way, Marina del Rey, CA 90292, USA e-mail: {chiang,debevec}@ict.usc.edu

1 Introduction Abstract Creating a photoreal digital actor with computer graphics The Digital Emily Project is a collaboration between has been a central goal of the field for at least thirty years facial animation company Image Metrics and the Graphics [Parke 1972]. The Digital Emily project undertaken by Image Laboratory at the University of Southern California’s Institute Metrics and the University of Southern California’s Institute for Creative Technologies to achieve one of the world’s for Creative Technologies (USC ICT) attempted to achieve an first photorealistic digital facial performances. The project animated, photoreal digital face by bringing together latest- leverages latest-generation techniques in high-resolution face generation results in 3D facial capture, modeling, animation, scanning, character rigging, video-based facial animation, and rendering. The project aimed to cross the ”uncanny valley” and compositing. An actress was first filmed on a studio set [20], producing a computer-generated face which appeared speaking emotive lines of dialog in high definition. The lighting to be a real, relatable, animated person. Some of the key on the set was captured as a high dynamic range light probe technologies employed included a fast high-resolution digital image. The actress’ face was then three-dimensionally scanned face scanning process using the light stage at USC ICT, and in thirty-three facial expressions showing different emotions the Image Metrics video-based facial animation system. The and mouth and eye movements using a high-resolution facial result of the project was by several accounts the first public scanning process accurate to the level of skin pores and demonstration of a photoreal computer-generated face able to fine wrinkles. Lighting-independent diffuse and specular convincingly speak and emote in a medium closeup. reflectance maps were also acquired as part of the scanning process. Correspondences between the 3D expression scans 2 Previous Efforts at Photoreal Digital Humans were formed using a semi-automatic process, allowing a blendshape facial animation rig to be constructed whose A variety of laudable efforts have been made to create realistic expressions closely mirrored the shapes observed in the rich digital actors over the last decade, each leveraging numerous set of facial scans; animated eyes and teeth were also added advances in computer graphics technology and artistry. In this to the model. Skin texture detail showing dynamic wrinkling section, we overview some of these key efforts in order to was converted into multiresolution displacement maps also compare and contrast them with the Digital Emily project. driven by the blend shapes. A semi-automatic video-based The SIGGRAPH 1999 Electronic Theater featured ”The facial animation system was then used to animate the 3D face Jester” [15], a short animation by Life/FX of a woman rig to match the performance seen in the original video, and reading poetry in jester’s cap. The actor’s face was three- this performance was tracked onto the facial motion in the dimensionally laser scanned and textured using photographic studio video. The final face was illuminated by the captured textures stitched together with artistic effort to minimize the studio illumination and shading using the acquired reflectance original shading and specular reflectance effects, and her maps with a skin translucency shading algorithm. Using performance was recorded with a traditional arrangement of this process, the project was able to render a synthetic facial markers. The motion capture dots were used performance which was generally accepted as being a real to drive a volumetric finite element model which allowed a face. high-resolution facial mesh to produce simulated buckling and wrinkling from anisotropic stress at a scale significantly more detailed than the original motion capture data was able Keywords: Digital Actors; Facial Animation; 3D Scanning to record. While the skin shading lacked realistic specular reflectance and pore detail, the face conveyed realistic motion Spider Man 2 (2004) built digital stunt doubles for villain Doc and emotion in significant part due to the skin dynamics Ock (Alfred Molina) and hero Spider-Man (Tobey Maguire) model. The process was later extended with significant using facial reflectance scans in USC ICT’s Light Stage 2 additional artistic effort to create an aged version of the actor device [24]. Each actor was filmed with four synchronized in a follow-up animation ”Young at Heart”. 35mm film cameras in several facial expressions from 480 lighting directions. Colorspace techniques as in [5] were used Disney’s ”Human Face Project” developed technology to to separate diffuse and specular reflections. The relightable show an older actor encountering a younger version of texture information was projected onto a 3D facial rig based himself [27, 12]. A facial mold of the actor’s face was taken on geometry from a traditional laser scan, and illuminated and the resulting cast was scanned using a high-resolution variously by HDRI image-based lighting [4] and traditional face scanning process, and twenty-two Cyberware scans in CG light sources using a custom shading algorithm for various expressions were also acquired. A medium-resolution approximately 40 digital double shots. Using additional animated facial rig was sculpted to mimic the expression cameras, the technique was also used to construct digital scans used as reference. A multi-camera facial capture actors for Superman Returns (2006), Spider Man 3 (2007), and setup was employed to film the face under relatively flat Hancock (2008). Due to the extensive reflectance information cross-polarized lighting. Polarization difference images collected, the technique yielded realistic facial reflectance isolating specular reflections of the face [5] were acquired for the digital characters, including close-up shots with mild to reveal high-resolution texture detail, and analysis of this degrees of facial animation, especially in Superman Returns. texture was used to add skin pores and fine wrinkle detail as However, results of the process did not demonstrate emotive displacement maps to the scans. Optical flow performed on facial performances in closeup; significant facial animation video of the actor’s performance was used to drive the digital was shown only in wide shots. face, achieving remarkably close matches to the original performance. A younger version of the actor was artistically Beowulf (2007) used a multitude of digital characters for a modeled based on photographs of him in his twenties. HDR fully computer-rendered film. The performance capture based lighting information was captured on set so that the digital film following the approach of the 2001 film Final Fantasy actor could be rendered with image-based lighting [4] to match of constructing as detailed characters as possible and then the on-set illumination. Final renderings achieved convincing driving them with motion capture, keyframe animation, and facial motion and lighting in a two-shot but less convincing simulation. Beowulf substantially advanced the state of the facial reflectance; a significant problem was that no simulation art in this area by leveraging greater motion capture fidelity of the skin’s translucency [14] had been performed. and performance volume, employing more complex lighting simulation, and using better skin shading techniques. While The Matrix Sequels (2003) used digital actors for many scenes some characters were based closely in appearance their voice where Keanu Reeves acrobatically fights many copies of Hugo and motion capture actors (e.g. Angelina Jolie and Anthony Weaving. High-quality facial casts of Reeves and Weaving Hopkins), other characters bore little resemblance (e.g. Ray were acquired and scanned with a high-resolution laser Winstone), requiring additional artistic effort to model them. scanning system (the XYZRGB system based on technology Static renderings of the faces achieved impressive levels from the Canadian National Research Council) to provide of realism, although the single-layer subsurface scattering 3D geometry accurate to the level of skin pores and fine model produced a somewhat ”waxy” appearance. Also, wrinkles. A six-camera high-definition facial capture rig the artist-driven character creation process did not leverage was used to film facial performance clips of the actors under high-resolution 3D scans of each actor in a multitude of facial relatively flat illumination. The six views in the video were expressions. According to Variety [3], the ”digitized figures used to animate a facial rig of the actor and, as in [10], to in ’Beowulf’ look eerily close to storefront mannequins ... provide time-varying texture maps for the dynamic facial suspended somewhere between live-action and animation, appearance. Still renderings using image-based lighting [4] fairy tale and videogame.” and a texture-space approximation to subsurface scattering [2] showed notably realistic faces which greatly benefitted The Curious Case of Benjamin Button (2008) was the first from the high-resolution geometric detail texture found in the feature film to feature a photoreal human virtual character. XYZRGB scans. However, the animated facial performances The aged version of Brad Pitt seen in the film’s first 52 in the film were shown in relatively wide shots and exhibited minutes was created by visual effects studio Digital Domain less realistic skin reflectance, perhaps having lost some leveraging ”recently developed offerings from Mova, Image appearance of geometric detail during normal map filtering. Metrics, and the [USC] Institute for Creative Technologies” The animated texture maps, however, provided a convincing [23]. Silcone maquettes of Brad Pitt as an old man were degree of dynamic shading and albedo changes to the facial constructed by Kazuhiro Tsuji at Rick Baker’s makeup studio performances. A tradeoff to the use of video textures is that and used as the basis for the digital character. Detailed facial the facial model could not easily generate novel performances reflectance capture was performed for the age 70 maquette in unless they too were captured in the complete performance USC ICT’s Light Stage 5 device [26] from eight angles and 156 capture setup. lighting directions. Medium-resolution meshes of Brad Pitt’s face in a multitude of facial expressions were captured using Mova’s Contour facial capture system [22], using a mottled pattern of glow-in-the-dark makeup to create facial geometry from multi-view stereo. Digital Domain used special software tools and manual effort to create a rigged digital character whose articulations were based closely on the facial shapes captured in the Mova system. Image Metrics’ video-based facial animation system was used to provide animation curves for the Benjamin facial rigs by analyzing frontal, flat-lit video of Brad Pitt performing each scene of the film in a studio; these animation curves were frequently refined by animators at Digital Domain to create the final performances seen in the film. Extensive HDRI documentation was taken on each film set so that advanced Image-Based Lighting techniques based on [4] could be used to render the character with matching illumination to the on-set lighting. Image-based relighting Figure 1: Actress Emily O’Brien being scanned in Light techniques as in [5] were used to simulate the reflectance of the Stage 5 at USC ICT for the Digital Emily project. scanned maquette in each illumination environment as cross- validation of the digital character’s lighting and skin shaders. The final renderings were accurately tracked onto the heads 3.1 Estimating Subsurface and Specular Albedo and of slight, aged actors playing the body of Benjamin in each Normals scene. The quality of the character was universally lauded as a breakthrough in computer graphics, winning the film an Most of the images are shot with essentially every light in Academy Award for Best Visual Effects. An estimate of more the light stage turned on, but with different gradations of than two hundred person-years of effort was reportedly spent brightness. All of the light stage lights have linear polarizer [23] creating the character in this groundbreaking work. film placed on them, affixed in a special pattern of orientations, which permits the measurement of the specular and subsurface 3 Acquiring high-resolution scans of the actor in various reflectance components of the face independently by changing facial expressions the orientation of a polarizer on the camera. Emily O’Brien, an Emmy-nominated actress from the The top two rows of Fig. 2 show Emily’s face under American daytime drama ”The Young and the Restless”, was four spherical gradient illumination conditions and then a cast for the project. After being filmed seated describing point-light condition, and all of the images in this row are aspects of the Image Metrics facial animation process on an cross-polarized to eliminate the shine from the surface of informal studio set, Emily came to USC ICT to be scanned in her skin – her specular component. What remains is the its Light Stage 5 device on the afternoon of March 24, 2008. A skin-colored ”subsurface” reflection, often referred to as the set of 40 small facial dots were applied to Emily’s face with a ”diffuse” component. This is light which scatters within the dark makeup pencil to assist with facial modeling. Emily then skin enough to become depolarized before re-emerging. Since entered Light Stage 5 for approximately 90 minutes during this light is depolarized, approximately half of this light can which data for thirty-seven high-resolution facial scans were pass through the horizontal polarizer on the camera. The top acquired. Fig. 1 shows Emily in the light stage during a scan, right image is lit by a frontal flash, also cross-polarizing out with all 156 of its white LED lights turned on. the specular reflection. The light stage scanning process used for Digital Emily was The middle row of Fig. 2 shows parallel-polarized images described in [18]. In contrast to earlier light stage processes of the face, where the polarizer on the camera is rotated (e.g. [5, 11]), which photograph the face under hundreds of vertically so that the specular reflection returns, in double illumination directions, this newer capture process requires strength compared to the attenuated subsurface reflection. We only fifteen photographs of the face under different lighting can then reveal the specular reflection on its own by subtracting conditions as seen in Fig. 2 to capture geometry and reflectance the first row of images from the second row, yielding the information for a face. The photos are taken with a stereo pair specular-only images shown in Fig. 3. of Canon EOS 1D Mark III digital still cameras, and the images Fig. 4(a) is a closeup of the ”diffuse-all” image of Emily. are sufficiently few so that they can be captured in the cameras’ Every light in the light stage is turned on to equal intensity, and ”burst mode” in under three seconds, before any data needs to the polarizer on the camera is oriented to block the specular be written to the compact flash cards. reflection from every single one of the polarized LED light sources. Even the highlights of the lights in Emily’s eyes are eliminated. This is about as flat-lit an image of a person’s face as can conceivably be photographed, and thus it is nearly a perfect (a) (b) (c)

Figure 4: Closeups of Emily’s (a) cross-polarized subsurface component, (b) parallel-polarized subsurface plus specular component, and (c) isolated specular component formed by subtracting (a) from (b). The black makeup dots on her face are easily removed digitally and help with aligning and corresponding her scans.

image as in [6]. Fig. 4(a) also shows the makeup dots we put on Emily’s face which help us to align the images in the event there is any drift in her position or expression over the fifteen images; they Figure 2: The fifteen photographs taken by a frontal stereo are relatively easy to remove digitally. Emily was extremely pair of cameras for each Emily scan. The photo sets, taken good at staying still for the three-second scans and many of in under three seconds, record Emily’s diffuse albedo and her datasets required no motion compensation at all. (Faster normals (top row), specular albedo and normals (second capture times are already possible: 24fps capture of such data row), and base 3D geometry (bottom row). using high-speed video cameras is described in [19]). The shinier image in Fig. 4(b) is also lit by all of the light stage lights, but the orientation of the polarizer has been turned 90 degrees which allows the specular reflections to return. Her skin exhibits a specular sheen, and the reflections of the lights are now evident in her eyes. In fact, the specular reflection is seen at double the strength of the subsurface (or diffuse) reflection, since the polarizer on the camera blocks about half Figure 3: Emily’s specular component under the four of the unpolarized subsurface reflection. gradient lighting conditions (all, left, top, front) and single Fig. 4(b) shows the combined effect of specular reflection frontal flash condition, obtained by subtracting the cross- and subsurface reflection. For modeling facial reflectance, we polarized images from the parallel-polarized images. would ideally observe the specular reflection independently. As a useful alternative, we can simply subtract the diffuse- only image Fig. 4(a) from this one. Taking the difference image to use as the diffuse texture map for the face in building between the diffuse-only image and the diffuse-plus-specular a digital actor. The one issue is that it is affected to some image yields an image of primarily the specular reflection of extent by self-shadowing and interreflections, making the the face as in 4(c). A polarization difference process was used concavities around the eyes, under the nose, and between previously for facial reflectance analysis in [5], but only for the lips appear somewhat darker and more color-saturated a single point light source and not for the entire sphere of than it inherently is. Depending on the rendering technique illumination. The image is mostly colorless since this light chosen, having these effects of occlusion and interreflection has reflected specularly off the surface of the skin, rather ”baked in” to the texture map is either a problem or an than entering the skin and having its blue and green colors advantage. For real-time rendering, the effects can add realism significantly absorbed by skin pigments and blood before (as if the environmental light were entirely uniform) given reflecting back out. that simulating them accurately might be computationally prohibitive. If new lighting is being simulated on the face This image provides a useful starting point for building a digital using a more accurate global illumination technique, then it character’s specular intensity map, or ”spec map”; it shows is problematic to calculate self-shadowing of a surface whose for each pixel the intensity of the specular reflection at that texture map already has self-shadowing present; likewise pixel. However, the specular reflection becomes amplified near for interreflections. In this case, one could perform inverse grazing angles, such as at the sides of the face according to rendering by using the actor’s 3D geometry and approximate the denominator of Fresnel’s equations. We generally model reflectance to predict the effects of self-shadowing and/or and compensate for this effect using Fresnel’s equations but interreflections, and then divide these effects out of the texture also discount regions of the face at extreme grazing angles. The image also includes some of the effects of ”reflection cheeks, and lips. This image provides visual reference for occlusion” [16]. The sides of the nose and innermost contour choosing specular roughness parameters for the skin shaders. of the lips appear to have no specular reflection since self- The image can also be used to drive a data-driven specular shadowing prevents the lights from reflecting in these angles. reflectance model as in [9]. This effect can be an asset for real-time rendering, but should be manually painted out for offline rendering. 3.2 Deriving High-Resolution 3D Geometry Recent work [9] reports that this sort of polarization difference The last set of images in the scanning process (Fig. 2, bottom image also contains the effects of single scattering, wherein row) are a set of colored stripe patterns from a video projector light refracts into the skin but scatters exactly once before which allow a stereo correspondence algorithm to robustly refracting back toward the camera. Such light can pick up compute pixel correspondences between the left and right the color of the skin’s melanocytes, adding some color to the viewpoints of the face. The projector is also cross-polarized specular image. However, the image is dominated by the so that the stereo pair of images consist of only subsurface specular component’s first-surface reflection, which allows us reflection and lack specular highlights; such highlights to reconstruct high-resolution facial geometry. would shift in position between the two viewpoints and thus complicate the stereo correspondence process. The patterns The four difference images of the face’s specular reflection form a series of color-ramp stripes of different frequencies so under the gradient illumination patterns (Fig. 3) let us derive that a given facial pixel receives a unique set of RGB irradiance a high-resolution normal map for the face: a map of its local values over the course of the sequence [17]. The first projected surface orientation vector at each pixel. If we examine the image, showing full-on illumination, is used to divide out the intensity of one pixel across this four-image sequence, its cross-polarized facial BRDF from the remaining stripe images; brightness in the X, Y, and Z images divided by its brightness this also helps ensure that pixels in one camera have very in the fully-illuminated image uniquely encodes the direction nearly the same pixel values as in the other camera, facilitating of the light stage reflected in that pixel. From the simple correspondence. From these correspondences and the camera formula in [18] involving image ratios, we can derive the calibration [28], we can triangulate a three-dimensional mesh reflection vector at each pixel, and from the camera orientations of the face with vertices at each pixel, applying bilateral mesh (calibrated with the technique of [28]) we also know the view denoising [8] to produce a smooth mesh without adding blur to vector. geometric features. However, the surface resolution observable from the diffuse reflection of skin is limited by the scattering of the incident light beneath the skin. As a result, the geometry appears relatively smooth and lacks the skin texture detail that we wish to capture in our scans. We add in the skin texture detail by embossing the specular normal map onto the 3D mesh. By doing this, a high- resolution version of the mesh is created and the vertices of each triangle are allowed to move forward and back until they best exhibit the same surface normals as the normal map. The ICT Graphics Lab first described this process – using diffuse normals estimated in Light Stage 2 – in [13] 2001; more recent work in the area includes Nehab et al. [21].) This creates a notably high-resolution 3D scan, showing different skin texture detail clearly observable in different areas of the face (Fig. 6).

3.3 Scanning a Multitude of Expressions Figure 5: The specular normal map for one of Emily’s Emily was captured in thirty-three different facial expressions expressions derived from the four specular reflection images based loosely on Paul Ekman’s Facial Action Coding System under the gradient illumination patterns. (FACS) [7] as seen in Fig. 7; fourteen of these individual scans are shown in Fig. 8. By design, there is a great deal of variety Computing the vector halfway between the reflection vector in the shape of her skin and the pose of her lips, eyes, and jaw and the view vector yields a surface normal estimate for the across the scans. Emily was fortunately very good at staying face based on the specular reflection. Fig. 5 shows the face’s still for all of the expressions. Two of the expressions, one normal map visualized using the common color map where with eyes closed and one with eyes open, were also scanned red, green and blue indicate the X, Y, and Z components of from the sides with Emily’s face rotated to the left and right the surface normal. The normal map contains detail at the as seen in the inset figures of the neutral-mouth-closed and level of skin pores and fine wrinkles. The point-lit polarization neutral-mouth-open scans. This allowed us to merge together difference image (Fig. 3, far right) provides a visual indication a 3D model of the face with geometry stopping just short of of the BRDF’s specular lobe shape on the nose, forehead, her ears to create the complete ”master mesh” (Fig. 12(a)) and (a) (b)

Figure 6: (a) Facial geometry obtained from the diffuse (or subsurface) component of the reflection. (b) Far more detailed facial geometry obtained by embossing the specular normal map onto the diffuse geometry. Figure 7: The thirty-three facial expressions scanned for creating the Digital Emily character. to extrapolate full ear-to-ear geometry for each partial (frontal- only) facial scan as in Fig. 15(c). We note that building a digital actor from 3D scans of multiple facial expressions is a commonly practiced technique; for the area. While we may not consciously notice these kinds of example, the Animatable Facial Reflectance Fields project phenomena when interacting with others, they are present and [11] followed this approach in scanning actress Jessica Vallot visible in all faces, and failing to reproduce them accurately in approximately forty facial expressions. Going further back, imperils the realism of a digital character. visual effects company Industrial Light + Magic acquired several Cyberware 3D scans of actress Mary Elizabeth 3.5 Scanning Emily’s Teeth Mastrantonio in different expressions to animate the face of Finally, we also scanned a plaster cast of Emily’s teeth, which the water creature in 1989’s The Abyss. required adapting the 3D scanning system to work with greater 3.4 Phenomena Observed in the Facial Scans accuracy in a smaller scanning volume. Fig. 10(a) shows the cast and Fig 10(b) shows a rendering of Emily’s digital The fourteen faces in Fig. 8 show a sampling of the high- teeth model; the upper and lower teeth each were the result of resolution scans taken of Emily in different facial expressions, merging eight sinusoid-pattern structured light scans to form and offer an opportunity to observe three-dimensional facial meshes with approximately 600,000 polygons. dynamics in more detail than has been previously easy to do. A great deal of dynamic behavior can be observed as a face 4 Building the digital character from the scans moves, exemplified in the detailed images in Fig. 9. For example, the skin pore detail on the cheek in Fig. 9(a) changes 4.1 Constructing the Animatable Base Mesh dramatically when Emily pulls her mouth the side in Fig. 9(b): The stitched ear-to-ear neutral-expression scan of Emily was the pores significantly elongate and become shallower. When remeshed to create a 4,000 polygon animatable mesh as seen Emily stretches her face vertically, small veins pop out from in Figure 11(a). This drastic polygon reduction from the her eyelid Fig. 9(c). several million polygons of the original scan was done to make When Emily raises her eyebrows, the relatively isotropic skin animation of the mesh tractable and to ease corresponding the pores of her forehead transform into rows of fine wrinkles in geometry across scans; the geometric skin texture detail would Fig. 9(d) – her skin is too elastic to develop deep furrows. be added back using displacement maps calculated from the When she scowls in Fig. 9(e), indentations above her eyebrows high-resolutions scans. This was done principally using the appear where her muscles attach beneath the skin. And when commercial product ZBrush to create the facial topology and she winces in Fig. 9(f), the muscles in her forehead bulge out. then Autodesk’s Maya package to create the interior of the Examining the progression through Figs. 9(d,e,f), we see how mouth and the eye sockets. Then, UV texture coordinates the bridge of Emily’s nose significantly shrinks and expands for the neutral animatable mesh were mapped out in Maya, as the rest of the face pushes and pulls tissue into and out of yielding the complete master mesh. Figure 8: High-resolution 3D geometry from fourteen of the thirty-three facial scans. Each mesh is accurate to 0.1mm resolution and contains approximately three million polygons.

4.2 Building the Blendshapes from the meshes by deleting vertices; these were commonly the teeth and eyeballs regions of the expression scans, as Image Metrics originally planned to use the scans captured such regions are not present in the master mesh. This process in the Light Stage as artistic reference for building the also removed most of the mesh artifacts associated with blendshapes. However, the scans proved to be much more discontinuities in the expression scans. The 3D locations of useful than just reference. The tiny stabilization dots drawn on the stabilization dots were annotated manually and served Emilys face during the scanning session were visible in every as a sparse set of known points of correspondence between texture map of every scan. Rather than sculpt the blendshape master mesh and each expression scan. For each expression, meshes artistically and then project (or ”snap”) the vertices to this sparse correspondence information was used to initialize the corresponding scans, we used the dots directly to warp the and stabilize a proprietary automatic method of determining a neutral animatable mesh into the different expressions. This dense correspondence. The method used a 3D spatial location achieved not only an accurate shape for each blendshape, but and mesh normal agreement measure within a 2D conformal also accurate skin movement between blendshapes. mapping frame (such as in [25]) to obtain the required dense The challenge in constructing blendshapes from these data correspondence. This process resulted in a mapping of was to use the sparse set of stabilization dots to find a dense each master mesh vertex to a position on the surface of the correspondence between the animatable master mesh and each appropriate expression mesh. expression scan. Figure 12 shows an example of three such With the correspondence between our partial, cleaned-up partial frontal scans alongside the animatable master mesh. meshes it was possible to calculate the shape change required Although highly detailed and accurate, the expression scans to transform the neutral master mesh to each expression. required some pre-processing before appropriate blendshapes However, a rigid translation/rotation between the master could be constructed. For example, many meshes have and expression scans was also generally present in the data. irregular edges with poor triangulation and mesh artifacts This rigid transformation must be calculated to ensure only around the teeth and eye regions. Also, many of the scans motion due to the changing facial expression and not residual contained surface regions not represented in the master mesh, head motion is represented in each blendshape. To do this, a such as the surface of the eyes and teeth. These regions should 3D rigid transformation for each expression was calculated not be corresponded with master mesh vertices. Finally, the using a manually selected subset of texture markers. The scans, captured from a frontal stereo camera pair, did not cover texture markers selected were chosen independently for each the full facial region of the animatable master mesh. Figures 13 expression so as to be the least affected by the skin and jaw and 14 show examples of each of these issues. These aspects of motion for that expression, and thus provide a good basis for the data meant that a fully automatic correspondence algorithm calculating a rigid transformation between scans. Once the would be unlikely to provide adequate results. rigid transformations were calculated and was factored out of each blendshape, blendshape deltas for each vertex in the The correspondence method we used required some simple partial master mesh were calculated. This produced partial manual data cleaning and annotation to produce data more blendshapes as shown in Figure 15(a,b). amenable to automatic processing. For each expression scan, we removed regions from both the master mesh and Any jaw movement had to be subtracted from each blendshape; the expression mesh to achieve rough edge consistency. For this step was a very important because it eliminated redundancy example, we removed the neck region from the expression in the rig. One example was the lipSuck shape where scans and face sides from the master to produce consistent both top and bottom lips are sucked inside the mouth. In facial coverage. We also removed uncorrespondable regions order to capture the maximum amount of data for the lips, (a) (b)

(a) (b)

Figure 11: (a) Stitched ear-to-ear neutral-expression mesh comprising several millions polygons made from merging (c) (d) left, front, and right scans (b) Remeshed neutral mesh with 4,000 polygons and animation-friendly topology.

(e) (f)

Figure 9: Details of the 3D face scans showing various dynamic facial deformation phenomena. (a) (b) (c) (d) Figure 12: (a) Ear-to-ear master mesh with (b,c,d) three this expression was scanned and modeled with the jaw slightly partial expression scans to be used as blend shapes. open. After the blendshape was modeled the jaw movement was subtracted by going negative with the jawOpen shape. to move appropriately with the known vertices. Figure 15(c) A custom algorithm was used to map the partial blendshapes illustrates the result of this interpolation process. onto the full master mesh. The missing parts of the full master mesh that were not observed in the expression scan were The final results of the automatic blendshape creation process interpolated using the assumption that the extreme edges of were then artistically cleaned up by a small team of rigging the face remained static, which we found to be a reasonable artists to provide a full set of quality-assured expression blend assumption for most expressions. Where this assumption was shapes. not appropriate, the results were artistically corrected to give After the blendshape modeling and cleanup process, Image the desired shape. The unknown internal parts of the master Metrics ended up with approximately 30 ”pre-split” mesh, such as the inner mouth and lips, were interpolated blendshapes, one blendshape corresponding to each scan. However, most of the scans were captured with ”doubled up” facial expressions. For example, the browRaise and chinRaise were captured in the same scan. This was done to get through the scanning session quicker and to save processing time. But now the blendshapes had to be split up into localized shapes – the ”post-split” shape set. To do this, we used the Paint Blendshape Weights feature in Maya, creating a set of normalized ”split maps” for each (a) (b) (c) facial region. For example, this is how the browRaise blendshape was split up. Three normalized split maps were Figure 10: (a) A plaster cast of Emily’s upper and lower painted for the browRaise blendshape: browRaise L, teeth. (b) Resulting merged 3D model before remeshing. (c) browRaise C, and browRaise R, roughly following the Remeshed model with 10,000 polygons each for the upper and eyebrow raisers described in [7]. A custom MEL script was lower teeth. then run which applied each split map to the browRaise (a) (b) (c)

Figure 15: (a) A partial master mesh with (b) a partial blendshape expression mesh. (c) A complete expression scan, created by finding correspondences from (b) to (a) to the complete master mesh.

arrows toward its point was equivalent to contracting those muscles. Additional animation controls were placed in the Figure 13: Mesh artifacts and irregular edges in the raw Maya channel box allowing numerical input. Emily scans

Figure 14: Inconsistent coverage at the periphery of the face Figure 16: The user interface (right) for the Digital Emily and uncorrespondable regions (the eyeball and eyelid, also facial rig (left). the teeth) in the raw Emily scans 4.4 Soft Eyes and Sticky Lips shape in the Paint Blendshape Weights tool. The left, center, The Emily facial rig included two notable special effects: soft and right browRaise blendshapes were then duplicated out. eyes and sticky lips. Soft eyes is the effect of the rotation of the Splitting up the browRaise shapes in this way allowed the cornea pushing and tugging the eyelids and skin around the eye. animator to control every region of the brows independently. The soft eyes setup created for Emily was relatively simple in However, because the split maps were normalized, turning that each eye had a separate blendshape node with four shapes: on all three regions together summed to the original pre-split lookUp, lookDown, lookLeft, and lookRight. The browRaise shape. All the pre-split blendshapes underwent envelope of this blendshape node was negated by the blink this process resulting in a post-split shape set of roughly 75 shape. In other words, whenever the eye blinked, the soft eyes blendshapes. This shape set gave the Image Metrics animators effect was turned off. This was done to prevent a conflict an unprecedented amount of control over each region of the between the blink and lookDown blendshapes. face. Sticky lips is the subtle effect of the lips peeling apart during speech, and has been a part of several high-end facial rigs 4.3 The Facial Rig’s User Interface including the Gollum character built at WETA Digital for the The rig user interface (Fig. 16) was inspired by the facial Lord of the Rings sequels. The Emily rig had a relatively muscles and their ”direction of pull”. Many of the controls elaborate sticky lips setup involving a series of Maya deformers were represented as arrow-shaped NURBS curves, with each which provided animators a ”sticky” control for both corners of arrow representing a certain facial muscle. Pulling on the the lips. 4.5 Adding Blendshape Displacement Maps achieved. As the process is example driven, it presents a natural framework to allow artistic and stylistic interpretation In order to preserve as much of the scan data as possible, of an actor’s performance, for example when using a human to the Emily render had 30 animated displacement maps. The drive a cartoon or animal character. However, in this case the displacement maps were extracted using Pixologic’s ZBrush technology was used for the purpose of producing a faithful software by placing each pre-split blendshape on top of its one-to-one reproduction of Emily’s performance. corresponding scan and calculating the difference between the two. Then each displacement map was cleaned up in Photoshop The Image Metrics facial animation process has several and divided according to the same normalized split maps used advantages over traditional performance-driven animation to split the blendshapes. This yielded a displacement map for techniques which significantly added to the realism of the each blendshape in the rig (although only a subset of these digital character. First, the process is based on video of the were used in the final renderings). It is important to note actor performing, which provides a great deal of information that the highest frequency, pore-level detail displacement map regarding the motion of the actor’s eyes and mouth; these came only from the neutral scan. All the other displacement are problematic areas to capture faithfully with traditional maps had a median filter applied to them to remove any high motion capture markers, but are the most important part of frequency detail, while still keeping the wrinkles. This was the face for communicating an actor’s performance. Second, done because adding two maps with high frequency detail the process leverages an appropriate division of labor between would result in the pores ”doubling up” in the render, since the animator and the automated algorithms. Because an artist the high-resolutions scans had not been aligned to each other is part of the process, they can ensure that each of the key at the level of skin pores and fine wrinkles. Therefore, only animation poses reads in an emotionally faithful way to the the neutral displacement map had the pore detail; all the other appearance of the actor in the corresponding frame of video. maps had only wrinkles without pores. The displacement map This process would be difficult to automate, since it involves animation was driven directly by the corresponding blendshape reading and comparing the emotional content of faces with animation. significant detail. Conversely, the trajectories and timing of the facial motion between key poses are successfully derived 4.6 Adding the Teeth from the automated video analysis; achieving realistic timing Plaster casts of Emily’s upper and lower teeth were made using in photoreal facial animation requires a great deal of skill standard dental casting techniques. The two casts were scanned and time for an animator to achieve manually. As a result, with a structured light scanning system to produce two 600,000 the process combines what an animator can do quickly and polygon meshes from sixteen merged scans acquired using well with what an automatic process can accomplish, yielding structured lighting patterns based on Gray codes. These teeth facial animation with higher quality and greater efficiency than scans were manually remeshed to have far fewer triangles and either fully manual or fully automatic techniques currently smoother topology. The top and bottom teeth meshes were allow. remeshed to 10,000 polygons each as in Fig. 10(c). We then carefully placed the teeth geometry inside the neutral mesh 6 Tracking, Lighting, Rendering, and Compositing using the high-resolution smile scans as reference for the teeth Emily’s performance was shot from two angles using high- positions. definition cameras: a front-on closeup for use in the Image The teeth scanning was done relatively late in the facial model Metrics analysis pipeline, and a medium shot in a three- construction process. Until the teeth were scanned, a generic quarter view to provide the background plate for the final teeth model was used in the animated character. We found video. The final renderings needed to be match-moved to this to be significantly less believable than using the model of Emily’s head position in the three-quarter shot. Complicating Emily’s actual teeth. the process (but representative of many practical production scenarios) the cameras were uncalibrated, and there were no 5 Video-Based Facial Animation markers on Emily’s face to assist in tracking. Match moving requires sub-pixel accuracy and temporal consistency of pose, Digital Emily’s facial animation was created using Image otherwise unwanted ”floating” effects become easily visible. Metrics’ proprietary video analysis and animation system To achieve this, we developed a manually guided match- which allows animators to associate character poses with moving process. We manually set the 3D pose of the animated a small subset of performance frames. The analysis of the character on several example frames so that the projected performance requires video from a single standard video location of the model closely matched that of Emily in the camera and is designed to capture all the characteristics of video frame. Using these frames as templates, we applied the actor. The animation technology then uses the example a model-based optical flow algorithm [1] to calculate the poses provided by an animator to generate predictions of the required character pose in the intermediate frames. To ensure required character pose for each frame of the performance. a smooth temporal path within the pose space, we developed a The animator can then iteratively refine this prediction by novel weighted combination of template frames that smoothly adding more example poses until the desired animation is favors the temporally closest set of templates. Any errors were manually corrected and added as a new template to derive a new pose-space tracking result until the results no longer exhibited artifacts.

Figure 18: A final rendering from the Digital Emily (a) (b) animation. The face is completely computer-rendered, Figure 17: (a) The light probe image acquired to simulate including its eyes, teeth, cheeks, lips, and forehead. Emily’s the live-action lighting conditions on Digital Emily’s face. (b) ears, hair, neck, arms, hands, body, and clothing come from A frame from a facial contour pass used in the compositing the original background plate. Animated results may be seen process. at: http://gl.ict.usc.edu/Research/DigitalEmily/

Since Emily’s face would be composited onto live-action apply the process again would likely require significantly fewer background plate including real video of her hair, ears, neck, resources to achieve results of equal or even better quality. and body, it was imperative for the rendered version of Emily’s face to look completely convincing. Refining the rendering Based on the experiences of the Digital Emily project, five parameters took a rendering artist approximately three months sufficient (and perhaps necessary) steps for achieving a to perfect. The rendering was done in Mental Ray using the photoreal digital actor are: Fast Subsurface Scattering skin shader. The lighting was based on an high dynamic range light probe image captured 1. Sufficient facial scanning resolution accurate to the on the day of the shoot [4] as seen in Figure 17(a). Many level of skin pores and fine wrinkles, achieved with the different passes were rendered, including diffuse, specular, scanning process of [18] matte, and contour passes. In particular, the contour pass (Fig. 17(b)) was very important for visually integrating (or 2. 3D geometry and appearance data from a wide range of ”marrying”) the different components of facial geometry using facial expressions, also achieved through [18] different amounts of blur in the composite. Otherwise, for 3. Realistic facial animation, as based on a real actor’s example, the line where the eyelid meets the eyeball would performance, including detailed motion of the eyes and appear too sharp and thus unrealistic. Compositing was mouth, which was achieved through the semi-automatic done in the Eyeon Fusion package. Emily’s fingers were Image Metrics video-based facial animation process rotoscoped whenever she moved them in front of her face so they could also obscure her digital face. Small paint fixes 4. Realistic skin reflectance including translucency, were done to the final renderings around the eye highlights. leveraging a subsurface scattering technique based on Two frames from a final Emily animation are shown in [14] Fig. 18, and animated results may be seen at the web site http://gl.ict.usc.edu/Research/DigitalEmily/. 5. Accurate lighting integration with the actor’s environment, done using HDRI capture and image- based lighting as described in [4] 7 Discussion

Some timeframes and personnel for the Digital Emily Project 7.1 Conclusion: Lessons Learned were: The experience of creating Digital Emily taught us numerous • Scanning: 1.5 hours, 3 seconds per scan, 3 technicians lessons which will be useful in the creation of future digital • Scan Processing: 10 days, 37 processed scans, 1 artist characters. These lessons included: • Rig Construction: 3 months, 75 blendshapes, 1 artist • Having consistent surface (u,v) coordinates across the • Animation: 2 weeks, 90 seconds of animation, 2 animators scans accurate to skin pore detail would be of significant • Rendering/Compositing: 3 months, 1 artist use in the process; a great deal of the effort involved was • Output: 24fps, 1920x1080 pixel resolution forming correspondences between the scans. A great deal of information was learned and many tools were • Although the Image Metrics facial animation system developed and improved in the context of this project, so to requires no facial markers for animation, including at least a few facial markers as the actor performs would [13] ICT-Graphics-Laboratory. Realistic human face make the head tracking process much easier. scanning and rendering. Web site, 2001. http://gl.ict.usc.edu/Research/facescan/. • Giving the digital character their own teeth, accurately [14] Henrik Wann Jensen, Stephen R. Marschner, Marc Levoy, and scanned and placed within the head and jaw, is important Pat Hanrahan. A practical model for subsurface light transport. for the believability of the character and their resemblance In Proceedings of ACM SIGGRAPH 2001, Computer Graphics to the original person. Proceedings, Annual Conference Series, pages 511–518, August 2001. References [15] Debra Kaufman. Photo genesis. WIRED, 7(7), July 1999. [1] Simon Baker and Iain Matthews. Lucas-kanade 20 years on: http://www.wired.com/wired/archive/7.07/jester.html. A unifying framework. Int. J. Comput. Vision, 56(3):221–255, [16] Hayden Landis. Production-ready global illumination. In 2004. Notes for ACM SIGGRAPH 2005 Course #16: RenderMan in [2] G. Borshukov and J. P. Lewis. Realistic human face rendering Production, New York, NY, USA, July 2002. ACM. for ’’. In ACM SIGGRAPH 2003 Sketches [17] Wan-Chun Ma. A Framework for Capture and Synthesis of & Applications, 2003. High Resolution Facial Geometry and Performance. PhD thesis, [3] Chang. Beowulf. Variety, Nov 2007. National Taiwan University, 2008. [4] . Rendering synthetic objects into real [18] Wan-Chun Ma, Tim Hawkins, Pieter Peers, Charles-Felix scenes: Bridging traditional and image-based graphics with Chabert, Malte Weiss, and Paul Debevec. Rapid acquisition global illumination and high dynamic range photography. of specular and diffuse normal maps from polarized spherical In Proceedings of SIGGRAPH 98, Computer Graphics gradient illumination. In Rendering Techniques, pages 183–194, Proceedings, Annual Conference Series, pages 189–198, July 2007. 1998. [19] Wan-Chun Ma, Andrew Jones, Jen-Yuan Chiang, Tim [5] Paul Debevec, Tim Hawkins, Chris Tchou, Haarm-Pieter Duiker, Hawkins, Sune Frederiksen, Pieter Peers, Marko Vukovic, Westley Sarokin, and Mark Sagar. Acquiring the reflectance field Ming Ouhyoung, and Paul Debevec. Facial performance of a human face. In Proceedings of ACM SIGGRAPH 2000, synthesis using deformation-driven polynomial displacement Computer Graphics Proceedings, Annual Conference Series, maps. ACM Transactions on Graphics, 27(5):121:1–121:10, pages 145–156, July 2000. December 2008. [6] Paul Debevec, Chris Tchou, Andrew Gardner, Tim Hawkins, [20] Masahiro Mori. Bukimi no tani (the uncanny valley). Energy, Charis Poullis, Jessi Stumpfel, Andrew Jones, Nathaniel Yun, 7(4):33–35, 1970. Per Einarsson, Therese Lundgren, Marcos Fajardo, and Philippe [21] Diego Nehab, Szymon Rusinkiewicz, James Davis, and Ravi Martinez. Estimating surface reflectance properties of a complex Ramamoorthi. Efficiently combining positions and normals scene under captured natural illumination. Technical Report for precise 3d geometry. ACM Transactions on Graphics, ICT-TR-06.2004, USC ICT, Marina del Rey, CA, USA, Jun 24(3):536–543, August 2005. 2004. http://gl.ict.usc.edu/Research/reflectance/Parth-ICT-TR- 06.2004.pdf. [22] Steve Perlman. Volumetric cinematography: The world no longer flat. Mova White Paper, Oct 2006. [7] P. Ekman and W. Friesen. Facial Action Coding System: A Technique for the Measurement of Facial Movement. Consulting [23] Barbara Robertson. What’s old is new again. Computer Psychologists Press, Palo Alto, 1978. Graphics World, 32(1), Jan 2009. [8] Shachar Fleishman, Iddo Drori, and Daniel Cohen-Or. Bilateral [24] Mark Sagar, John Monos, John Schmidt, Dan Ziegler, Sing- mesh denoising. ACM Transactions on Graphics, 22(3):950– Choong Foo, Remington Scott, Jeff Stern, Chris Waegner, Peter 953, July 2003. Nofz, Tim Hawkins, and Paul Debevec. Reflectance field [9] Abhijeet Ghosh, Tim Hawkins, Pieter Peers, Sune Frederiksen, rendering of human faces for spider-man¨ 2¨. In SIGGRAPH and Paul Debevec. Practical modeling and acquisition of layered ’04: ACM SIGGRAPH 2004 Technical Sketches, New York, NY, facial reflectance. ACM Transactions on Graphics, 27(5):139:1– USA, 2004. ACM. 139:10, December 2008. [25] Yang Wang, Mohit Gupta, Song Zhang, Sen Wang, Xianfeng [10] Brian Guenter, Cindy Grimm, Daniel Wood, Henrique Malvar, Gu, Dimitris Samaras, and Peisen Huang. High resolution and Fred´ eric´ Pighin. Making faces. In Proceedings of tracking of non-rigid motion of densely sampled 3d data using SIGGRAPH 98, Computer Graphics Proceedings, Annual harmonic maps. Int. J. Comput. Vision, 76(3):283–300, 2008. Conference Series, pages 55–66, July 1998. [26] Andreas Wenger, Andrew Gardner, Chris Tchou, Jonas Unger, [11] Tim Hawkins, Andreas Wenger, Chris Tchou, Andrew Gardner, Tim Hawkins, and Paul Debevec. Performance relighting and Fredrik Goransson,¨ and Paul Debevec. Animatable facial reflectance transformation with time-multiplexed illumination. reflectance fields. In Rendering Techniques 2004: 15th ACM Transactions on Graphics, 24(3):756–764, August 2005. Eurographics Workshop on Rendering, pages 309–320, June [27] Ellen Wolff. Creating virtual performers: Disney’s human face 2004. project. Millimeter magazine, April 2003. [12] Walter Hyneman, Hiroki Itokazu, Lance Williams, and Xinmin [28] Zhengyou Zhang. A flexible new technique for camera Zhao. Human face project. In ACM SIGGRAPH 2005 Course calibration. IEEE Trans. Pattern Anal. Mach. Intell., #9: Digital Face Cloning, New York, NY, USA, July 2005. 22(11):1330–1334, 2000. ACM.