<<

USOO9036898B1

(12) United States Patent (10) Patent No.: US 9,036,898 B1 Beeler et al. (45) Date of Patent: May 19, 2015

(54) HIGH-QUALITY PASSIVE PERFORMANCE 2009.0153655 A1 6/2009 Ike et al...... 348,77 CAPTURE USING ANCHOR FRAMES 2010.0189342 A1* 7, 2010 Parr et al...... 382,154 2010/0284607 A1* 1 1/2010 Van Den Hengel et al. ... 382/154 (75) Inventors: Thabo Beeler, Zurich (CH); Bernd OTHER PUBLICATIONS Bickel, Zurich (CH); Fabian Hahn, Regensdorf (CH); Derek Bradley, High Resolution Passive Facial Performance Capture. Derek Brad

Zurich (CH); Paul Beardsley, Zurich eXander,Wigg ... Hsieh,This 6 e digital RP emilys Allaproject: Sheth photoreal Iacia (CH), Bob Sumner, Zurich (CH): modeling and animation.” ACM SIGGRAPH, 2009, Courses 1-15. Markus Gross, Zurich (CH) Anuar, N., et al., “Extracting animated meshes with adaptive motion (73) Assi DISNEY ENTERPRISES INC estimation.” Proc. Vision, Modeling, and Visualization, 2004, pp. SS1gnee: 9 es 63-71 Burbank, CA (US) Beeler, T., et al., "High-quality single-shot capture of facial geom etry.” ACM Trans. Graphics, 2010, Proc. SIGGRAPH, 9 pages. (*) Notice: Subject to any disclaimer, the term of this (Continued) patent is extended or adjusted under 35 U.S.C. 154(b) by 535 days. Primary Examiner — Stephen R Koziol 21) Appl. No.: 13/287,774 Assistant Examiner — Delomia Gilliard (21) Appl. No.: 9 (74) Attorney, Agent, or Firm — Kilpatrick Townsend & (22) Filed: Nov. 2, 2011 Stockton LLP Related U.S. Application Data (57) ABSTRACT High-quality passive performance capture using anchor (60) Provisional application No. 61/433,926, filed on Jan. frames derives in part from a robust tracking algorithm. 18, 2011. Tracking is performed in image space and uses an integrated result to propagate a single reference mesh to performances (51) Int.G06T Cl. I7/00 (2006.01) represented in the image space. Image-space tracking is com puted for each camera in multi-camera setups. Thus, multiple (52) U.S. Cl. hypotheses can be propagated forward in time. If one flow CPC ------grgr. G06T 1700 (2013.01) computation develops inaccuracies, the others can compen (58) Field of Classification Search sate. This yields results Superior to mesh-based tracking tech CPC ...... GO6T 17/OO niques because image data typically contains much more USPC ...... 382/154 detail, facilitating more accurate tracking. Moreover, the See application file for complete search history. problem of error propagation due to inaccurate tracking in (56) Ref Cited image space can be dealt with in the same domain in which it eerees e occurs. Therefore, there is less complication of distortion due to parameterization, a technique used frequently in mesh U.S. PATENT DOCUMENTS processing algorithms. 2009, OOO3686 A1* 1/2009 Gu ...... 382,154 2009, OO16435 A1* 1/2009 Brandsma et al...... 375,240.12 21 Claims, 20 Drawing Sheets

IMAGE ACQUISETION STAGE20 MESHRE CONSERUCON STAGE 220

WESH REFINEMENT STAGE260 N PROPAGATION STAGE 250 FRAME

SEQUENCE US 9,036,898 B1 Page 2

(56) References Cited Ma, W., et al., “Facial performance synthesis using deformation driven polynomial displacement maps.” ACM Trans. Graphics, 2008, OTHER PUBLICATIONS Proc. SIGGRAPH Asia, vol. 27, No. 5, 10 pages. Pighin, F. H., et al., “Resynthesizing facial animation through 3d Bickel, B., et al., “Multi-scale capture of facial geometry and model-based tracking.” ICCV, 1999, pp. 143-150. motion.” ACM Trans. Graphics, 2007, Proc. SIGGRAPH, vol.33, pp. Popa, T., et al., “Globally consistent space-time reconstruction.” 10 pages. Eurographics Symposium on Geometry Processing, 2010, 10 pages. Blanz, V., et al., “Reanimating faces in images and video. Computer Rav-Acha, A., et al., “Unwrap mosaics: A new representation for Graphics Forum, 2003, Proc. Eurographics, vol. 22, No. 3, pp. 641 video editing.” ACM Transactions on Graphics, 2008, SIGGRAPH 650. Aug. 2008, 12 pages. Bradley, D., et al., “High resolution passive facial performance cap Sumner, R. W., et al., “Deformation transfer for triangle meshes.” ture.” ACM Trans. Graphics, 2010, Proc. SIGGRAPH, pp. 10 pages. ACM Trans Graph., 2004, vol. 23, pp. 399-405. Decarlo, D., et al., “The integration of optical flow and deformable Wand, M., et al., “Efficient reconstruction of nonrigid shape and models with applications to human face shape and motion estima motion from real-time 3d scanner data.” ACM Trans. Graph, 2009, tion.” CVPR, 1996, vol. 231, pp. 8 pages. vol. 28, No. 2, pp. 1-15. Essa, I., et al., “Modeling, tracking and interactive animation of faces Wang.Y., et al., “High resolution acquisition, learning and transfer of and heads using input from video. Proceedings of Computer Ani dynamic 3-d facial expressions.” Computer Graphics Forum, 2004, mation, 1996, vol. 68, pp. 12 pages. vol. 23, No. 3, pp. 677-686. Furukawa, Y, et al., “Dense 3d motion capture for human faces.” Williams, L., “Performance-driven facial animation.” Computer CVPR, 2009, pp. 8 pages. Graphics, 1990, Proceedings of SIGGRAPH 90, vol. 24, pp. 235 Guenter, B., et al., “Making faces.” Computer Graphics, 1998, ACM 242. Press, New York, SIGGRAPH 98 Proceedings, pp.55-66. Winkler, T., et al., “Mesh massage.” The Visual Computer, 2008, vol. Hernandez, C., et al., “Self-calibrating a real-time monocular 3d 24, pp. 775-785. facial capture system.” Proceedings International Symposium on 3D Zhang, L., et al., “Spacetime faces: High resolution capture for mod Data Processing, Visualization and Transmission, 2010, pp. 8 pages. eling and animation.” ACM Transactions on Graphics, 2004, vol. 23. Kraevoy, V., et al., “Cross-parameterization and compatible remesh No. 3, pp. 548-558. ing of 3d models.” ACM Trans. Graph., 2004, vol. 23, pp. 861-869. Ekman, Paul, et al., “Facial Action Coding System. The Manual on Li, H., et al., “3-d motion estimation in model-based facial image CD Rom, HTML Demonstration Version.” online), published by A coding.” IEEE Trans. Pattern Anal. Mach. Intell., 1993, vol. 15, No. Human Face, 2002, ISBN 0-93.1835-01-1, retrieved on Oct. 3, 6, pp. 545-555. 2013, retrieved from the internet:

OBJECT MODELING SYSTEM(S) 130

OBJECT ARTICULATON SYSTEM(S) 140

OBJECT DESGN ANMATON COMPUTER(S) SYSTEM(S) 110 150

OBJECT SMUATION SYSTEM(S) 160 OBJECT BRARES 120 OBJECT RENDERING SYSTEM(S) 170

F.G. 1 U.S. Patent May 19, 2015 Sheet 2 of 20 US 9,036,898 B1

U.S. Patent May 19, 2015 Sheet 3 of 20 US 9,036,898 B1

BEGIN 310

CAPTURE PERFORMANCE TO GENERATE A SEOUENCE OF FRAMES 320

IDENTFY OBJECTS IN EACH FRAME IN THE 330 SEOUENCE OF FRAMES

GENERATE GEOMETRY FOREACH IDENT FED 340 OBJECT

STORE GEOMETRY GENERATED FOREACH 350 DENT FED OBJECT

END 360

FIG. 3 U.S. Patent May 19, 2015 Sheet 4 of 20 US 9,036,898 B1

( 400

BEGIN 410

DETERMINE AREFERENCE FRAME INA SEOUENCE OF FRAMES CAPTURING A 420 PERFORMANCE

IDENTIFY ANCHOR FRAMES INSEQUENCE OF FRAMES BASED ON REFERENCE FRAME 430

GENERATE CPS BASED ON ANCHOR FRAMES 440

STORE GENERATED CPS 450

END 460

FG. 4 U.S. Patent May 19, 2015 Sheet 5 of 20 US 9,036,898 B1

BEGIN 510

DETERMNE FEATURE SET OF REFERENCE FRAME 520

RECEIVE PLURALITY OF TARGET FRAMES 530

PERFORM CORRESPONDENCE MATCHING BETWEEN FEATURE SET OF REFERENCE 540 FRAME AND EACH TARGET FRAME

DETERMINE AMOUNT OF CORRESPONDENCE BETWEEN FEATURE SET OF REFERENCE 550 FRAME AND EACH TARGET FRAME

GENERATE INFORMATION DENTFYING EACH TARGET FRAMEAS AN ANCHOR FRAME WHEN AMOUNT OF CORRESPONDENCE SATSFES 560 SELECTION CRITERA

END 570

F.G. 5 U.S. Patent May 19, 2015 Sheet 6 of 20 US 9,036,898 B1

§§§§ 9'

099 &

8:008 ca. U.S. Patent May 19, 2015 Sheet 7 of 20 US 9,036,898 B1

Y-s s

S U.S. Patent May 19, 2015 Sheet 8 of 20 US 9,036,898 B1

( 800

BEGIN 810

RECEIVE PLURALITY OF CPS 820

FOREACH CLP N THE PLURALITY OF CLIPS, PERFORM IMAGE-SPACE TRACKING BETWEEN 830 REFERENCE FRAME AND ANCHOR FRAMES OF CLIP TO GENERATE MOTON ESTIMATION

FOREACH CLP IN THE PLURALITY OF CLIPS, PERFORM IMAGE-SPACE TRACKING BETWEEN REFERENCE FRAME AND UNANCHOREO 840 FRAMES OF CLP TO GENERATE MOTON ESTEMATON

GENERATE TRACKING INFORMATION FOR 850 EACH CIPN THE PLURALITY OF CPS

END 860

FG. 8 U.S. Patent May 19, 2015 Sheet 9 of 20 US 9,036,898 B1

BEGIN 910

PERFORM FEATURE MATCHING BETWEEN REFERENCE FRAME AND ANCHOR FRAME OF A CP TO GENERATE FORWARD MOTION 920 ESTMATON AND BACKWARD MOTION ESTMATON

FTER FEATURE MATCHES BASED ON 930 SELECTION CRITERA

PERFORM RE-MATCHING OF UNMATCHED 940 FEATURES

PERFORM MATCH REFINEMENT 950

GENERATE FEATURE TRACKING INFORMATION 96.O FROM REFERENCE FRAME TO ANCHOR FRAME

END 970

FIG. 9 U.S. Patent May 19, 2015 Sheet 10 of 20 US 9,036,898 B1 (? 1000

BEGIN 1010

PERFORM INCREMENTAL FRAME-BY-FRAME FEATURE MATCHING FROMA PRECEONG ANCHOR FRAME TO AN. UNANCHORED FRAME 1 O2O OF ACLIP TO GENERATE FORWARD MOTON ESTIMATON

PERFORM MATCH REFINEMENT 1030

PERFORM INCREMENTAL FRAME-BY-FRAME FEATURE MATCHING FROMA SUCCEEONG ANCHOR FRAME TO THE UNANCHORED FRAME 1040 OF ACP TO GENERATE BACKWARD MOTION ESTIMATON

PERFORM MATCH REFINEMENT 1050

CHOOSE FORWARD MOTON EST MATONOR BACKWARD MOTON ESTMATON BASED ON 1060 SELECTION CRITERA

GENERATE FEATURE TRACKING INFORMATION FROMANCHOR FRAME TO THE UNANCHORED 1070 FRAME

END 108O

FIG 10 U.S. Patent May 19, 2015 Sheet 11 of 20 US 9,036,898 B1

U.S. Patent May 19, 2015 Sheet 12 of 20 US 9,036,898 B1

s s U.S. Patent May 19, 2015 Sheet 13 of 20 US 9,036,898 B1

1400- 1410s 1420s

F.G. 14

F.G. 15 U.S. Patent May 19, 2015 Sheet 14 of 20 US 9,036,898 B1

( 1600

BEGIN 1610

RECEIVE GEOMETRY ASSOCATED WITH OBJECT N REFERENCE FRAME 1620

RECEIVE INFORMATION IDENTFYING ATARGET 1630 FRAME THAT INCLUDES THE OBJECT

RECEIVE FEATURE TRACKING INFORMATION 1640 FROM REFERENCE FRAME TO TARGET FRAME

GENERATE INFORMATION PROPAGATNG THE GEOMETRY ASSOCATED WITH THE OBJECT 1650 BASED ON THE FEATURE TRACKNG NFORMATION

END 1660

FG 16 U.S. Patent May 19, 2015 Sheet 15 of 20 US 9,036,898 B1

TRACKING NFORMATION FOR FRAME 1730

TRACKING NFORMATION FOR FRAME 1750

TRACKING NFORMATON FOR FRAME 1770 U.S. Patent May 19, 2015 Sheet 16 of 20 US 9,036,898 B1

(? 1800

BEGIN 1810

RECEIVE INFORMATION PROPAGATING GEOMETRY OF AN OBJECT ACROSS CP 1820

PERFORM PER-FRAME REFNEMENT 1830

PERFORMACROSS-FRAME REFNEMENT 1840

END 1850

F.G. 18 U.S. Patent May 19, 2015 Sheet 17 of 20 US 9,036,898 B1

C d O O wer CN r s O) O O c w ve w ve U.S. Patent May 19, 2015 Sheet 18 of 20 US 9,036,898 B1

S U.S. Patent May 19, 2015 Sheet 19 of 20 US 9,036,898 B1

U.S. Patent US 9,036,898 B1

US 9,036,898 B1 1. 2 HIGH-QUALITY PASSIVE PERFORMANCE niques for use in CGI and computer-aided animation, Some of CAPTURE USING ANCHOR FRAMES which may be discussed herein.

CROSS-REFERENCES TO RELATED BRIEF SUMMARY APPLICATIONS The following portion of this disclosure presents a simpli This Application claims the benefit of and priority to U.S. fied Summary of one or more innovations, embodiments, Provisional Application No. 61/433,926, filed Jan. 18, 2011 and/or examples found within this disclosure for at least the and entitled “High-Quality Passive Facial Performance Cap purpose of providing a basic understanding of the Subject 10 matter. This Summary does not attempt to provide an exten ture Using Anchor Frames, the entire disclosure of which is sive overview of any particular embodiment or example. hereby incorporated by reference for all purposes. Additionally, this summary is not intended to identify key/ critical elements of an embodiment or example or to delineate BACKGROUND the scope of the Subject matter of this disclosure. Accordingly, 15 one purpose of this Summary may be to present some inno This disclosure relates to computer-generated imagery Vations, embodiments, and/or examples found within this (CGI) and computer-aided animation. More specifically, this disclosure in a simplified form as a prelude to a more detailed disclosure relates to techniques for high-quality passive per description presented later. formance capture using anchor frames. In one aspect, high-quality passive performance capture With the wide-spread availability of computers, computer using anchor frames derives in part from a robust tracking graphics artists and animators can rely upon computers to algorithm. Tracking is performed in image space and uses an assist in production for creating animations and com integrated result to propagate a single reference mesh to per puter-generated imagery (CGI). The production of CGI and formances represented in the image space. Image-space computer-aided animation may involve the extensive use of tracking is computed for each camera in multi-camera setups. various computer graphics techniques to produce a visually 25 Thus, multiple hypotheses can be propagated forward intime. appealing image from the geometric description of an object If one flow computation develops inaccuracies, the others can that may be used to convey an essential element of a story or compensate. This yields results Superior to mesh-based track provide a desired special effect. One of the challenges in ing techniques because image data typically contains much creating these visually appealing images can be the balancing more detail, facilitating more accurate tracking. Moreover, of a desire for a highly-detailed image of a character or other 30 the problem of error propagation due to inaccurate tracking in object with the practical issues involved in allocating the image space can be dealt with in the same domain in which it resources (both human and computational) required to pro occurs. Therefore, there is less complication of distortion due duce those visually appealing images. to parameterization, a technique used frequently in mesh Therefore, one issue with the production process is the time processing algorithms. 35 In various embodiments, methods for manipulating com and effort involved when a user undertakes to model the puter-generated objects can include receiving a plurality of geometric description of an object and the models associated images bounded by a first image and a second image. Track avars, rigging, shader variables, paint data, or the like. It may ing information is then generated for a third image in the take several hours to several days for the user to design, rig, plurality of images based on how a first portion of an actor's pose, paint, or otherwise prepare a model that can be used to 40 face in the first image corresponds in image space to a first produce the visually desired look. This involvement in time portion of the actor's face in the third image and how a first and effort can limit that ability of the user to create even a portion of the actor's face in the second image corresponds in single scene to convey a specific performance or to provide image space to a second portion of the actor's face in the third the desired visual effect. Additionally, artistic control over the image. Information that manipulates a computer-generated look of a model or its visual effect when placed in a scene may 45 object is generated based on the tracking information for the also be lost by some attempts at reducing the time and effect third image to reflect a selected portion of the actor's face in in preparing a model that rely too much on automated proce the third image. dural creation of models. In one aspect, generating the tracking information for the Thus, in CGI and computer-aided animation, games, inter third image may include determining how the first portion of active environments, synthetic storytelling, and virtual real 50 the actor's face in the first image corresponds incrementally ity, performance capture is one research topic of critical in image space to a portion of the actor's face in each image importance as it can reduce the time and effect in preparing a in a set of images in the plurality of images that occur in a model while preserving artistic control over the performance sequence of images in-between the first image and the third of the model. However, the complexity as well as the skills image. Generating the tracking information for the third and familiarity in interpreting real-life performances make 55 image may further include determining how the first portion the problem exceptionally difficult. A performance capture of the actor's face in the second image corresponds incremen result must exhibit a great deal of spatial fidelity and temporal tally in image space to a portion of the actor's face in each accuracy in order to be an authentic reproduction of a real image in of a set of images in the plurality of images that occur actor's performance. Numerous technical challenges, such as in a sequence of images in-between the second image and the robust tracking under extreme deformations and error accu 60 third image. mulation over long performances, contribute to the problems In another aspect, generating the tracking information for difficulty. the third image may include determining a first error associ Accordingly, what is desired is to solve one or more prob ated with how the first portion of the actor's face in the first lems relating to performance capture techniques for use in image corresponds in image space to the first portion of the CGI and computer-aided animation, some of which may be 65 actor's face in the third image. Generating the tracking infor discussed herein. Additionally, what is desired is to reduce mation for the third image may further include determining a one or more drawbacks relating to performance capture tech second error associated with how the first portion of the US 9,036,898 B1 3 4 actor's face in the second image corresponds in image space computer-generated objects includes code for receiving a to the second portion of the actor's face in the third image. On plurality of images bounded by a first image and a second which of the first image and the second image to base in part image, code for generating tracking information for a third the tracking information for the third image may be identified image in the plurality of images based on how a first portion in response to a comparison between the first error and the of an actor's face in the first image corresponds in image second error. space to a first portion of the actor's face in the third image and how a first portion of the actor's face in the second image In some embodiments, the second image satisfies similar corresponds in image space to a second portion of the actor's ity criteria associated with the first image. The first image and face in the third image, and code for generating information the second image may also satisfy similarity criteria associ that manipulates a computer-generated object based on the ated with a reference image of the actor. In some aspects, 10 tracking information for the third image to reflect a selected tracking information for the first image in the plurality of portion of the actor's face in the third image. images may be generated based on how a first portion of the In one embodiment, system for manipulating computer actor's face in a reference image of the actor corresponds in generated objects includes a and a memory storing image space to a second portion of the actor's face in the first code for receiving a plurality of images bounded by a first image. Tracking information for the second image in the 15 image and a second image, code for generating tracking infor plurality of images maybe generated based on how a second mation for a third image in the plurality of images based on portion of the predetermined image of the actor corresponds how a first portion of an actor's face in the first image corre in image space to a second portion of the actor's face in the sponds in image space to a first portion of the actor's face in second image. The tracking information for the third image the third image and how a first portion of the actor's face in the may further be generated based on at least one of the tracking second image corresponds in image space to a second portion information for the first image and the tracking information of the actor's face in the third image, and code for generating for the second image. Generating the tracking information for information that manipulates a computer-generated object the first image or the second image may include a bi-direc based on the tracking information for the third image to reflect tional comparison to and from one or more portions of the a selected portion of the actor's face in the third image. actor's face in the reference image of the actor. 25 In a still further embodiment, for capturing facial perfor In further embodiments, a first set of features is identified mances and expressions of actors, a sequence of frames asso in the first image. A first set of features is then identified in the ciated with one or more facial captures of an actor may be third image that corresponds to the first set of features in the received. Each frame may include a plurality of stereo first image. Information indicative of how the first portion of images. A mesh is generated for each frame in the sequence of the actor's face in the first image corresponds in image space 30 frames that represents the facial position of all or part of the to the first portion of the actor's face in the third image can actor's face in the frame. One or more clips may be deter then be generated based on the first set of features in the third mined based on the sequence of frames and similarity criteria image that corresponds to the first set of features in the first associated with one or more reference frames. Each clip in the image. In further aspects, a first set of features is identified in plurality of clips is bounded by a first frame and a second the second image. A second set offeatures is then identified in 35 frame that satisfy the similarity criteria associated with a the third image that corresponds to the first set of features in reference frame in the one or more reference frames. A the second image. Information indicative of how the first motion field may be determined for at least one frame in a first portion of the actor's face in the second image corresponds in clip associated with a first reference frame based on an image image space to the second portion of the actor's face in the space tracking of pixels between each image in the at least one third image can then be generated based on the second set of 40 frame and each image in the first frame of the first clip and features in the third image that corresponds to the first set of between each image in the at least one frame and each image features in the second image. Identifying the first set of fea in the second frame of the first clip. Information that manipu tures in the first image or the second image may include lates a mesh for the first reference frame of the first clip may identifying one or more of a pixel, an image structure, a be generated based on the motion field for the at least one geometric structure, or a facial structure. 45 frame in the first clip. In at least one aspect, determining the In some embodiments, criteria is received configured to motion field for the at least one frame in the first clip may designate images as being similar to a reference image of the include determining a forward motion field and forward error actor. In response to a sequence of images, the plurality of from the image space tracking of pixels between each image images may be generated as a clip bounded by the first image in the at least one frame and each image in the first frame of and the second image based on the criteria configured to 50 the first clip and determining a backward motion field and a designate images as being similar to the reference image of backward error from the image space tracking of pixels the actor. between each image in the at least one frame and each image In still further embodiments, generating the information in the second frame of the first clip. Either the forward motion that manipulates the computer-generated object to reflect the field or the backward motion field may be selected based on selected portion of the actor's face in the third image may 55 whether the forward error or the backward error satisfies a include generating information configured to propagate a predetermined threshold. mesh from a first configuration to a second configuration. The A further understanding of the nature of and equivalents to information that manipulates the computer-generated object the subject matter of this disclosure (as well as any inherent or may further be modified in response to shape information express advantages and improvements provided) should be associated with one or more portions of the actor's face in the 60 realized in addition to the above section by reference to the third image. The information that manipulates the computer remaining portions of this disclosure, any accompanying generated object may further be modified in response to a drawings, and the claims. measure of temporal coherence over a set of two or more images in the plurality of images that includes the third BRIEF DESCRIPTION OF THE DRAWINGS image. 65 In one embodiment, a non-transitory computer-readable In order to reasonably describe and illustrate those inno medium storing computer-executable code for manipulating Vations, embodiments, and/or examples found within this US 9,036,898 B1 5 6 disclosure, reference may be made to one or more accompa FIG. 17 is an illustration depicting how tracking informa nying drawings. The additional details or examples used to tion for individual frames is used to pose a mesh associated describe the one or more accompanying drawings should not with a reference frame in one embodiment. be considered as limitations to the scope of any of the claimed FIG. 18 is a simplified flowchart of a method for refining inventions, any of the presently described embodiments and/ the propagated geometry in one embodiment. or examples, or the presently understood best mode of any FIG. 19 depicts images of example frames taken from innovations presented within this disclosure. multiple different sequences of an actor in one embodiment. FIG. 1 is a simplified block diagram of a system for creat FIG. 20 depicts images of several frames from a sequence ing computer graphics imagery (CGI) and computer-aided in which significant occlusion and reappearance of structure animation that may implement or incorporate various 10 around the mouth is handled using tracking information embodiments or techniques disclosed herein for high-quality obtained from the bi-directional image-space tracking in one passive performance capture using anchor frames. embodiment. FIG. 2 is an illustration depicting stages of a high-quality FIG. 21 depicts image of several frames from a sequence in passive performance capture process implemented by the which fast facial motion and motion blur is handled using 15 tracking information obtained from the bi-directional image system of FIG. 1 in one embodiment. space tracking in one embodiment. FIG.3 is a flowchart of a method for constructing geometry FIG.22 is a block diagram of a computer system or infor associated with objects in images of a captured performance mation processing device that may incorporate an embodi in one embodiment. ment, be incorporated into an embodiment, or be used to FIG. 4 is a flowchart of a method for generating image practice any of the innovations, embodiments, and/or sequence clips of the captured performance based on a refer examples found within this disclosure. ence frame in one embodiment. FIG. 5 is a flowchart of a method for identifying anchor DETAILED DESCRIPTION frames of the clips using the reference frame in one embodi ment. 25 Introduction FIG. 6 is an illustration depicting results of a comparison between the reference frame and individual frames in the Performance capture techniques are disclosed that deliver captured performance used to identify anchor frames in one a single, consistent mesh sequence deforming over time to embodiment. precisely or Substantially match actors’ performances. 30 Resulting meshes exhibit extremely fine pore-level geometric FIG. 7 is an illustration of an exemplary clip in one embodi details, offering higher spatial fidelity than previously pos ment having anchor frames corresponding to the reference sible. The disclosed techniques are robust to expressive and frame. fast performance motions, reproducing extreme deforma FIG. 8 is a flowchart of a method for performing image tions with minimal drift. Additionally, temporally varying space tracking of clips of the captured performance in one 35 textures can be derived directly from the captured perfor embodiment. mance without the need for the actor to utilize makeup or FIG. 9 is a flowchart of a method for performing image other markers. Computation is also designed to be paralleliz space tracking between a reference frame and anchor frames able so that long sequences can be reconstructed efficiently of a clip in one embodiment. using multi-core and multi-processor implementations. FIG. 10 is a flowchart of a method for performing image 40 In one aspect, high-quality results derive in part from a space tracking between a reference frame and unanchored robust tracking algorithm. Tracking is performed in an image frames of a clip in one embodiment. space obtained using a multi-camera setup and passive illu FIG. 11 is an illustration depicting a bi-directional image mination and uses an integrated result to propagate a single space tracking in one embodiment. reference mesh to performances represented in the image FIG. 12 is an illustration depicting a sequence of frames 45 space. Image-space tracking is computed for each camera in pulled out of different times steps of the captured perfor the multi-camera setup. Thus, multiple hypotheses can be mance demonstrating low temporal drift resulting from the propagated forward intime. If one flow computation develops bi-directional image-space tracking by consistent attachment inaccuracies, the others can compensate. This yields results of a Superimposed grid pattern in one embodiment. Superior to mesh-based tracking techniques because image FIG. 13 is an illustration depicting another sequence of 50 data typically contains much more detail, facilitating more frames pulled out of different times steps of another captured accurate tracking. Moreover, the problem of error propaga performance demonstrating low temporal drift resulting from tion due to inaccurate tracking in image space can be dealt the bi-directional image-space tracking by consistent attach with in the same domain in which it occurs. Therefore, there ment of a Superimposed grid pattern in one embodiment. is no complication of distortion due to parameterization, a FIG. 14 depicts images of a Zoom onto the forehead of an 55 technique used frequently in mesh processing algorithms. actor in an input image, a 3D reconstruction of the actor's As suggested above, integration error can eventually accu forehead according to the prior art, and a 3D reconstruction of mulate when reconstructing long capture sessions. In another the actor's forehead according to one embodiment. aspect, by employing an “anchor frame concept long capture FIG. 15 depicts images of the 3D reconstructions of the sessions and individual capture sessions taken at different actor's forehead at three time steps according to a prior art 60 times can be arranged into shortersequences or clips. It can be reconstruction on the left half of the images and a reconstruc observed that a lengthy performance typically contains main tion according to one embodiment on the right half of the frames that are similar in appearance. For example, when images. speaking, an actor's face naturally returns to a resting pose FIG. 16 is a flowchart of a method for propagating geom between sentences or during speech pauses. In various etry corresponding to objects in a captured performance using 65 embodiments, one frame (e.g., a frame that includes an actor tracking information obtained from the bi-directional image in a resting pose or other desirable pose) is designated as a space tracking in one embodiment. reference frame and then used to mark or otherwise identify US 9,036,898 B1 7 8 as “anchor frames' all other frames similar to the reference proven to yield impressive facial performance capture results frame. Due to the similarity between poses in the reference Ma et al. 2008: Alexander et al. 2009. frame and the anchor frames, an image tracker can compute Recently, research has focused on entirely passive capture, the flow from the reference frame to each anchor frame inde without requiring markers, structured light or expensive hard pendently and with higher accuracy. ware. Beeler et al. 2010 reconstruct pore-scale facial geom Additionally, each sequence between two consecutive etry for static frames. Bradley et al. 2010 perform passive anchors can be treated independently, integrating the tracking performance capture with automatic temporal alignment, but from both sides and enforcing continuity at the anchor bound they lackpore-scale geometry and fail when confronted with aries. The accurate tracking of each anchor frame prevents expressive motions. Other passive deformable surface recon error accumulation in lengthy performances. And, since the 10 struction techniques have been applied to faces Wand et al. computation of the track between two anchors is independent, 2009; Popa et al. 2010. Wandet al. 2009 reconstruct from the process can be parallelized across multiple cores or CPUs. point cloud data by fitting a template model. However, their Accordingly, anchor frames can span multiple capture ses method tends to lead to loss of geometric detail which is sions of the same Subject on different occasions without any necessary for realistic facial animation. This is partially additional special processing. This can be used to 'splice' 15 resolved in recent work by Popa et al. 2010 who use a and “mix and match' unrelated clips, adding a powerful new gradual change prior in a hierarchical reconstruction frame capability to the editorial process. work to propagate a mesh structure across frames. Facial Animation While image data, due to its Superior resolution and detail, Data-driven facial animation has come a long way since the can be an enormous help in tracking 3D points over time, it is marker-based techniques introduced over two decades ago not always available, and sometimes just a sequence of 3D Williams 1990: Guenter et al. 1998. Current state of the art point clouds or incompatibly-triangulated 3D meshes are the methods are now passive and almost fully automatic Bradley input. A number of authors have addressed the problem of et al. 2010; Popa et al. 2010. However, a technique for tracking a 3D mesh over time based on pure geometry. An capturing highly detailed expressive performances that com early approach was described by Anuar and Guskov 2004. pletely avoids temporal drift has yet to be realized. Towards 25 They start with an initial template mesh that is then propa this goal, an anchored reconstruction approach is disclosed gated through the frames of the animation, based on a 3D for capturing actor performances while limiting tracker drift adaptation of the Bayesian multi-scale differential optical and robustly handling occlusions and motion blur. flow algorithm. Since this flow is invariant to motion in the One approach to performance capture is to start with a tangent plane, it is notable to eliminate “swimming artifacts deformable model (or template), for example of a face, and 30 during the sequence. Inspired by the deformation transfer then determine the parameters that best fit the model from method of Sumner and Popovic 2004, which allows to images or videos of a performing actor Lietal. 1993: Essa et "copy and paste’ mesh geometry from one shape to another, al. 1996; DeCarlo and Metaxas 1996: Pighin et al. 1999; Winkler et al. 2008 present an optimization method to track Blanz et al. 2003. Using this approach, the approximate 3D triangle geometry over time combining terms measuring data shape and pose of the deforming face can be determined. 35 fidelity, and preservation of triangle shape. Shape preserva However, the deformable face tends to be very generic, so the tion is achieved by the use of mean-value barycentric coor resulting animations often do not resemble the captured actor. dinates. This also addresses motion in the tangent plane, The face model must also be low-resolution for the fitting eliminating the artifacts present in the results of Anuar and methods to be tractable, and so it is usually not possible to Guskov. obtain the fine Scale details that make a performance expres 40 Tracking triangle geometry is intimately related to the sive and realistic. problem of cross-parameterization, or compatible remeshing Another common approach for performance capture is to of shapes, where the objective is to impose a triangle mesh track a sparse number of hand-placed markers or face paint representing one shape onto another, in a manner which mini using one or more video cameras Williams 1990: Guenter et mizes distortion between the two. Not Surprisingly, shape al. 1998: Lin and Ouhyoung 2005; Bickel et al. 2007: 45 preserving coordinates seem to be an effective tool in achiev Furukawa and Ponce 2009. While these techniques can yield ing this goal too, as demonstrated by Kraevoy and Sheffer robust tracking of very expressive performances and are usu 2004). ally Suitable for a variety of lighting environments, the More recent work by Scharfet. al 2008 tracks pure geom manual placement of markers can be tedious and invasive. etry over time using a Volumetric representation. The main Furthermore, the markers must be digitally removed from the 50 idea behind their method is that the volume of an object videos if face color or texture is to be acquired. Also, the should be approximately constant over time, thus the flow marker resolution is naturally limited, and detailed pore-scale must be “incompressible'. This assumption, along with other performance capture has not been demonstrated with this standard continuity assumptions, regularizes the solution Suf approach. ficiently to provide a good track. An alternative to placing markers on the face is to project 55 System Overview active illumination on the Subject using one or more projec Further to these works, techniques are provided for an tors Wang et al. 2004; Zhang et al. 2004. While this anchor-based reconstruction with robust image-space track approach requires less manual setup, it can be equally inva ing. A multi-camera setup and passive illumination are used sive to the actor. Acquiring face color also poses a problem to deliver a single, consistent mesh sequence deforming over with these methods, as uniform illumination must be tempo 60 time to precisely match actors’ performances captured by the rally interleaved with the structured light, sacrificing tempo multi-camera setup. Resulting meshes exhibit extremely fine ral resolution. A related active-light technique is proposed by pore-level geometric details, offering higher spatial fidelity Hernandez and Vogiatzis Harnández and Vogiatzis 2010, than previously possible. The disclosed techniques are robust who use tri-colored illumination with both photometric and to expressive and fast facial motions, reproducing extreme multi-view stereo to obtain facial geometry in real-time, how 65 deformations with minimal drift. Additionally, temporally ever without temporal correspondence. Finally, combining varying textures can be derived directly from the captured markers and structured light with an expensive light stage has performance without the need for actors to utilize makeup or US 9,036,898 B1 9 10 other markers. Computation is also designed to be paralleliz cues, simulation data, texture data, lighting data, shader code, able so that long sequences can be reconstructed efficiently or the like. An object stored in object library 120 can include using multi-core and multi-processor implementations. Thus, any entity that has an n-dimensional (e.g., 2D or 3D) surface similar facial expressions in a sequence form anchor frames geometry. The shape of the object can include a set of points that are used to limit temporal drift, to match and reconstruct 5 or locations in space (e.g., object space) that make up the sequences in parallel, and to establish correspondences objects Surface. Topology of an object can include the con between multiple sequences of the same actor captured on nectivity of the Surface of the object (e.g., the genus or num different occasions. ber of holes in an object) or the vertex/edge/face connectivity FIG. 1 is a simplified block diagram of system 100 for of an object. creating computer graphics imagery (CGI) and computer 10 aided animation that may implement or incorporate various The one or more object modeling systems 130 can include embodiments or techniques for high-quality performance hardware and/or software elements configured for modeling capture using anchor frames. In this example, system 100 can one or more computer-generated objects. Modeling can include one or more design computers 110, object library include the creating, sculpting, and editing of an object. The 120, one or more object modeler systems 130, one or more 15 one or more object modeling systems 130 may be invoked by object articulation systems 140, one or more object animation or used directly by a user of the one or more design computers systems 150, one or more object simulation systems 160, and 110 and/or automatically invoked by or used by one or more one or more object rendering systems 170. processes associated with the one or more design computers The one or more design computers 110 can include hard 110. Some examples of software programs embodied as the ware and software elements configured for designing CGI one or more object modeling systems 130 can include com and assisting with computer-aided animation. Each of the one mercially available high-end 3D computer graphics and 3D or more design computers 110 may be embodied as a single modeling software packages 3D STUDIO MAX and computing device or a set of one or more computing devices. AUTODESK MAYA produced by Autodesk, Inc. of San Some examples of computing devices are PCs, laptops, work Rafael, Calif. stations, mainframes, cluster computing system, grid com 25 In various embodiments, the one or more object modeling puting systems, cloud computing systems, embedded systems 130 may be configured to generated a model to devices, computer graphics devices, gaming devices and con include a description of the shape of an object. The one or soles, consumer electronic devices having programmable more object modeling systems 130 can be configured to processors, or the like. The one or more design computers 110 facilitate the creation and/or editing of features, such as non may be used at various stages of a production process (e.g., 30 uniform rational B-splines or NURBS, polygons and subdi pre-production, designing, creating, editing, simulating, ani vision surfaces (or SubDivs), that may be used to describe the mating, rendering, post-production, etc.) to produce images. shape of an object. In general, polygons are a widely used image sequences, motion pictures, video, audio, or associated model medium due to their relative stability and functionality. effects related to CGI and animation. Polygons can also act as the bridge between NURBS and In one example, a user of the one or more design computers 35 SubDivs. NURBS are used mainly for their ready-smooth 110 acting as a modeler may employ one or more systems or appearance and generally respond well to deformations. Sub tools to design, create, or modify objects within a computer Divs are a combination of both NURBS and polygons repre generated Scene. The modeler may use modeling software to senting a smooth Surface via the specification of a coarser sculpt and refine a neutral 3D model to fit predefined aesthetic piecewise linear polygon mesh. A single object may have needs of one or more character designers. The modeler may 40 several different models that describe its shape. design and maintain a modeling topology conducive to a The one or more object modeling systems 130 may further storyboarded range of deformations. In another example, a generate model data (e.g., 2D and 3D model data) for use by user of the one or more design computers 110 acting as an other elements of system 100 or that can be stored in object articulator may employ one or more systems or tools to library 120. The one or more object modeling systems 130 design, create, or modify controls or animation variables 45 may be configured to allow a user to associate additional (avars) of models. In general, rigging is a process of giving an information, metadata, color, lighting, rigging, controls, or object, Such as a character model, controls for movement, the like, with all or a portion of the generated model data. therein “articulating its ranges of motion. The articulator The one or more object articulation systems 140 can may work closely with one or more animators in rig building include hardware and/or software elements configured to to provide and refine an articulation of the full range of 50 articulating one or more computer-generated objects. Articu expressions and body movement needed to support a charac lation can include the building or creation of rigs, the rigging ter's acting range in an animation. In a further example, a user ofan object, and the editing of rigging. The one or more object of design computer 110 acting as an animator may employ articulation systems 140 may be invoked by or used directly one or more systems or tools to specify motion and position of by a user of the one or more design computers 110 and/or one or more objects over time to produce an animation. 55 automatically invoked by or used by one or more processes Object library 120 can include hardware and/or software associated with the one or more design computers 110. Some elements configured for storing and accessing information examples of software programs embodied as the one or more related to objects used by the one or more design computers object articulation systems 140 can include commercially 110 during the various stages of a production process to available high-end 3D computer graphics and 3D modeling produce CGI and animation. Some examples of object library 60 software packages 3D STUDIO MAX and AUTODESK 120 can include a file, a database, or other storage devices and MAYA produced by Autodesk, Inc. of San Rafael, Calif. mechanisms. Object library 120 may be locally accessible to In various embodiments, the one or more articulation sys the one or more design computers 110 or hosted by one or tems 140 can be configured to enable the specification of more external computer systems. rigging for an object, Such as for internal skeletal structures or Some examples of information stored in object library 120 65 eternal features, and to define how input motion deforms the can include an object itself, metadata, object geometry, object object. One technique is called "skeletal animation in which topology, rigging, control data, animation data, animation a character can be represented in at least two parts: a Surface US 9,036,898 B1 11 12 representation used to draw the character (called the skin)and engine can include a computer program that simulates one or a hierarchical set of bones used for animation (called the more physics models (e.g., a Newtonian physics model), skeleton). using variables such as mass, Velocity, friction, wind resis The one or more object articulation systems 140 may fur tance, or the like. The may simulate and pre ther generate articulation data (e.g., data associated with con dict effects under different conditions that would approxi trols or animations variables) for use by other elements of mate what happens to an object according to the physics system 100 or that can be stored in object library 120. The one model. The one or more object simulation systems 160 may or more object articulation systems 140 may be configured to be used to simulate the behavior of objects, such as hair, fur, allow a user to associate additional information, metadata, and cloth, in response to a physics model and/or animation of color, lighting, rigging, controls, or the like, with all or a 10 one or more characters and objects within a computer-gener portion of the generated articulation data. ated Scene. The one or more object animation systems 150 can include The one or more object simulation systems 160 may fur hardware and/or Software elements configured for animating ther generate simulation data (e.g., motion and position of an one or more computer-generated objects. Animation can object over time) for use by other elements of system 100 or include the specification of motion and position of an object 15 that can be stored in object library 120. The generated simu overtime. The one or more object animation systems 150 may lation data may be combined with or used in addition to be invoked by or used directly by a user of the one or more animation data generated by the one or more object animation design computers 110 and/or automatically invoked by or systems 150. The one or more object simulation systems 160 used by one or more processes associated with the one or may be configured to allow a user to associate additional more design computers 110. Some examples of software information, metadata, color, lighting, rigging, controls, or programs embodied as the one or more object animation the like, with all or a portion of the generated simulation data. systems 150 can include commercially available high-end 3D The one or more object rendering systems 170 can include computer graphics and 3D modeling Software packages 3D hardware and/or software element configured for “rendering STUDIO MAX and AUTODESK MAYA produced by or generating one or more images of one or more computer Autodesk, Inc. of San Rafael, Calif. 25 generated objects. “Rendering can include generating an In various embodiments, the one or more animation sys image from a model based on information Such as geometry, tems 150 may be configured to enable users to manipulate viewpoint, texture, lighting, and shading information. The controls or animation variables or utilized character rigging to one or more object rendering systems 170 may be invoked by specify one or more key frames of animation sequence. The or used directly by a user of the one or more design computers one or more animation systems 150 generate intermediary 30 110 and/or automatically invoked by or used by one or more frames based on the one or more key frames. In some embodi processes associated with the one or more design computers ments, the one or more animation systems 150 may be con 110. One example of a software program embodied as the one figured to enable users to specify animation cues, paths, or the or more object rendering systems 170 can include PhotoRe like according to one or more predefined sequences. The one alistic RenderMan, or PRMan, produced by Pixar Animations or more animation systems 150 generate frames of the ani 35 Studios of Emeryville, Calif. mation based on the animation cues or paths. In further In various embodiments, the one or more object rendering embodiments, the one or more animation systems 150 may be systems 170 can be configured to render one or more objects configured to enable users to define animations using one or to produce one or more computer-generated images or a set of more animation languages, morphs, deformations, or the like. images over time that provide an animation. The one or more The one or more object animations systems 150 may fur 40 object rendering systems 170 may generate digital images or ther generate animation data (e.g., inputs associated with raster graphics images. controls or animations variables) for use by other elements of In various embodiments, a rendered image can be under system 100 or that can be stored in object library 120. The one stood in terms of a number of visible features. Some examples or more object animations systems 150 may be configured to of visible features that may be considered by the one or more allow a user to associate additional information, metadata, 45 object rendering systems 170 may include shading (e.g., tech color, lighting, rigging, controls, or the like, with all or a niques relating to how the color and brightness of a Surface portion of the generated animation data. varies with lighting), texture-mapping (e.g., techniques relat The one or more object simulation systems 160 can include ing to applying detail information to surfaces or objects using hardware and/or software elements configured for simulating maps), bump-mapping (e.g., techniques relating to simulat one or more computer-generated objects. Simulation can 50 ing Small-scale bumpiness on Surfaces), fogging/participat include determining motion and position of an object over ing medium (e.g., techniques relating to how light dims when time in response to one or more simulated forces or condi passing through non-clear atmosphere or air) shadows (e.g., tions. The one or more object simulation systems 160 may be techniques relating to effects of obstructing light), soft shad invoked by or used directly by a user of the one or more design ows (e.g., techniques relating to varying darkness caused by computers 110 and/or automatically invoked by or used by 55 partially obscured light sources), reflection (e.g., techniques one or more processes associated with the one or more design relating to mirror-like or highly glossy reflection), transpar computers 110. Some examples of Software programs ency or opacity (e.g., techniques relating to sharp transmis embodied as the one or more object simulation systems 160 sions of light through solid objects), translucency (e.g., tech can include commercially available high-end 3D computer niques relating to highly scattered transmissions of light graphics and 3D modeling software packages 3D STUDIO 60 through solid objects), refraction (e.g., techniques relating to MAX and AUTODESK MAYA produced by Autodesk, Inc. bending of light associated with transparency), diffraction of San Rafael, Calif. (e.g., techniques relating to bending, spreading and interfer In various embodiments, the one or more object simulation ence of light passing by an object or aperture that disrupts the systems 160 may be configured to enables users to create, ray), indirect illumination (e.g., techniques relating to Sur define, or edit simulation engines, such as a physics engine or 65 faces illuminated by light reflected off other surfaces, rather physics processing unit (PPU/GPGPU) using one or more than directly from a light source, also known as global illu physically-based numerical techniques. In general, a physics mination), caustics (e.g., a form of indirect illumination with US 9,036,898 B1 13 14 techniques relating to reflections of light off a shiny object, or In image-space tracking stage 240, one goal is to track focusing of light through a transparent object, to produce features from the reference frame (e.g., image data Such as bright highlights on another object), depth offield (e.g., tech pixels, lines, curves or objects in the image data Such as niques relating to how objects appear blurry or out of focus shapes, detected structures, facial features, etc.) to corre when too far in front of or behind the object in focus), motion sponding features in each frame in the sequences of frames blur (e.g., techniques relating to how objects appear blurry from the one or more capture sessions. For example, given due to high-speed motion, or the motion of the camera), that a frame F, is a collection of images I of a set of cameras non-photorealistic rendering (e.g., techniques relating to ren c=1 ... n at time t, tracking features from frame F, to frame F, dering of Scenes in an artistic style, intended to look like a means tracking features in each image pair (I, I) of all painting or drawing), or the like. 10 cameras c. One of the most straightforward approaches is to The one or more object rendering systems 170 may further begin by tracking features from the reference frame to the render images (e.g., motion and position of an object over anchor frames, because the image appearance in both frames time) for use by other elements of system 100 or that can be is, by definition, similar. This then can be used to guide the stored in object library 120. The one or more object rendering tracking of features from the reference frame to all other systems 170 may be configured to allow a user to associate 15 frames in a specific sequence of frames. additional information or metadata with all or a portion of the In mesh propagation stage 250, the tracked features rendered image. obtained in image-space tracking stage 240 provide a way to Passive Facial Performance Capture propagate geometry associated with the reference frame to all In various embodiments, system 100 may include one or frames in the sequence. "Mesh propagation of a reference more hardware elements and/or software elements, compo mesh is used to mean the computation of new positions in nents, tools, or processes, embodied as the one or more design space of the mesh vertices, in correspondence with physical computers 110, object library 120, the one or more object movement of a captured performance. In refinement stage modeler systems 130, the one or more object articulation 260, each propagation of the geometry associated with the systems 140, the one or more object animation systems 150, reference frame to each frame in the sequence provides an the one or more object simulation systems 160, and/or the one 25 initial estimate for a refinement process that then updates the or more object rendering systems 170 that provide one or geometry (e.g., mesh vertices), enforcing consistency with more tools for high-quality passive performance capture. the image data while applying priors on spatial and temporal In one aspect, System 100 receives input capturing an coherence of the deforming Surface. actor's performance. System 100 may receive a time-based Thus, system 100 can receive information capturing an sequence of one or more “frames. A “frame” means a set of 30 actor's performance and generate an output as a sequence of one or more images acquired at one timestep. In various meshes (e.g., 2D or 3D), one per captured frame, of the actor's embodiments, each frame consists of a predetermined num performance which move and deform in correspondence with ber of stereo images acquired at each timestep in a sequence. the physical activity of the actor captured in the performance Based on input frames, system 100 generates an output as a and which further take into account a prior on spatial coher sequence of meshes (e.g., 2D or 3D), one per frame, which 35 ence of any dynamically deforming Surface and a prior on move and deform in correspondence with the physical activ temporal coherence. ity of the actor captured in a performance. According to some Initial Meshes embodiments, each mesh vertex corresponds to a fixed physi FIG.3 is a flowchart of method 300 for constructing geom cal point on the actor and maintains that correspondence etry associated with objects in images of a captured perfor throughout the sequence. Computation of the output meshes 40 mance in one embodiment. Implementations of or processing takes into account the image data, a prior on spatial coherence in method 300 depicted in FIG. 3 may be performed by of the Surface, and a prior on temporal coherence of the Software (e.g., instructions or code modules) when executed dynamically deforming Surface. by a (CPU or processor) of a logic FIG. 2 is an illustration depicting stages of a high-quality machine, Such as a computer system or information process passive performance capture process implemented by system 45 ing device, by hardware components of an electronic device 100 of FIG. 1 in one embodiment. The exemplary high or application-specific integrated circuits, or by combinations quality passive performance capture process of FIG. 2 of software and hardware elements. Method 300 depicted in includes image acquisition stage 210, mesh reconstruction FIG. 3 begins in step 310. stage 220, anchoring stage 230, image-space tracking stage In step 320, a performance is captured to generate a 240, mesh propagation stage 250, and refinement stage 260. 50 sequence of frames. For example, system 100 may receive In general, in image acquisition stage 210, System 100 input obtained from one or more capture sessions. In various receives sequences of frames from one or more capture ses embodiments, the one or more capture sessions can produce sions. In mesh reconstruction stage 220, each frame in a one or more sequences of one or more frames. Capture ses sequence of frames is processed (e.g., independently) togen sions can occur sequentially, with each capture sessions pro erate a first estimate for geometry (e.g., a mesh) of the frame 55 ducing one or more sequences of frames capturing perfor corresponding to each of one or more objects represented in mances of the actor, where conditions (e.g., lighting, the frame (actors, actor faces and parts, animals, inanimate backgrounds, costumes, makeup, number of imaging objects, etc.). In anchoring stage 230, one frame is identified Sources, etc.) remain Substantially similar from capture ses as a reference frame (e.g., marked “R” in FIG. 2). Frames in sion to capture session. Capture sessions can also occur sepa the sequences from the one or more capture sessions that 60 rately from one another such that conditions vary from one satisfy similarity criteria associated with the reference frame, capture sessions to the next capture session. Such as having similar image appearance (e.g., similar fea As discussed above, each frame in a sequence of frames tures, facial expressions, head orientations, etc.) are detected can include a set of one or more images. An image may automatically and marked or otherwise labeled as anchor include image data captured by an individual camera or by frames (e.g., marked 'A' in FIG. 2). Anchor frames provide a 65 multiple cameras. Images may further include single image way to partition the sequences of frames from the one or more data, multiple image, data, Stereo image data, and the like. capture sessions into clips of frames for further processing. Images forming a frame may include image data captured US 9,036,898 B1 15 16 from imaging Sources at different angles, perspectives, light substantially similar to one in the reference frame. Some ing conditions or filters, resolutions, and the like. Some examples of selection criteria for a reference frame is whether images may include common image data while others may an actor's face is in its natural rest expression or a desired include image data capturing unique portions of an actor's pose. For example, a reference frame may be chosen based on performance. a desired pose or look on an actor's face. In step 330, one or more objects are identified in each frame In step 430, anchor frames are identified in the sequence in the sequence of frames. For example, system 100 may based on the reference frame. For example, frames that are employ one or more object identification techniques to iden similar to the reference frame are labeled as anchors. A frame tify objects in image data associate with a frame. Some object can be deemed similar in one embodiment if features are identification techniques can include computer-vision 10 present in both the frame and the reference frame. Similarity approaches, appearance-based matching, such as edge detec criteria may be used to determine how close a frame is to tion and shape identification, feature-based matching, tem matching a reference frame. One method for identifying plate matching, gradient histograms, intraclass transfer learn anchor frames is described further with respect to FIG. 5. ing, explicit and implicit object models, global scene In step 440, clips are generated based on the anchor frame. representations, shading, reflectance, texture, grammars, bio 15 A clip is a sequence of frames bounded by two anchor frames. logically inspired object recognition, window-based detec In one aspect, an anchor frame might be both the last frame of tion, image data cues, and the like. In various embodiments, one clip and the first frame of the next clip. A clip can include system 100 identifies one or more features of an object in each any number of frames in addition to the anchor frames. In step frame. Some examples of features can include image data 450, the generated clips are stored. FIG. 4 ends in step 460. features, such as pixels or Subpixel, and more complex object Drift (or error accumulation) in tracking is a key issue features, such as object edges, object structures, and complex when processing a long sequence. Anchor frames generated physical features (e.g., noses, ears, cheeks, mouth, etc.). by system 100 provide a way to decompose sequences of In step 340, geometry is generated for each identified frames obtained from one or more capture sessions into clips, object. For example, system 100 can generate information effectively allowing multiple starting points for further pro describing the geometry of an n-D shape. System 100 may 25 cessing while still having a common root in the reference generate geometric or mathematical descriptions that define frame. One immediate benefit of using anchor frames is that points, lines, arcs, curves, triangles, rectangles, circles, com the clips they form naturally allow for parallelization of sub posites, or any variety of polygons, and their connectivity. In sequent computations on the individual clips. Another impor Some embodiments, system 100 generates a 3D polygonal tant benefit is that using anchor frames results in periodic mesh for each identified object using the 3D reconstruction 30 resets that prevent the accumulation of drift when tracking method in Beeler et al. 2010. along a sequence. Using anchor frames also confines cata At this point, a sequence of meshes has been generated by strophic failures in tracking, which could arise from occlu system 100. There is no temporal correspondence in the sion/motion blur/face outside the image, to the frames in resulting sequence of meshes, (i.e., their mesh structure which they Occur. (number of Vertices and triangulation structure) is totally 35 FIG. 5 is a flowchart of method 500 for identifying anchor unrelated). As discussed further below, one goal of subse frames of clips using a reference frame in one embodiment. quent stages will be to modify the meshes so that they are Implementations of or processing in method 500 depicted in compatible, (i.e., share a common vertex set and mesh struc FIG. 5 may be performed by software (e.g., instructions or ture). FIG. 3 ends in step 360. code modules) when executed by a central processing unit Anchoring 40 (CPU or processor) of a logic machine, Such as a computer In various embodiments, system 100 identifies and other system or information processing device, by hardware com wise designates one frame in a sequence of frames as a ref ponents of an electronic device or application-specific inte erence frame. Thereafter, system 100 can automatically grated circuits, or by combinations of software and hardware detect and label frames with similar image appearance (e.g., elements. Method 500 depicted in FIG. 5 begins in step 510. similar face expression and orientation) in the sequence and 45 In step 520, a feature set of a reference frame is determined. in other sequences as anchor frames. In one aspect, anchor For example, system 100 detects a feature set S in stereo frames partition a sequence of frames into clips as shown in image I* in a reference frame, for stereo cameras c=1 ... n. FIG. 2. Anchoring is further utilized in image-space tracking A feature is a uniquely identifiable element of an image in the at the level of each clip, reducing the need to track more frame. Some examples of features include image data in the lengthy sequences of frames. 50 form of pixels, groups of pixels, lines, curves, edges, and the FIG. 4 is a flowchart of method 400 for generating image like. Other examples of features include objects represented sequence clips of the captured performance based on a refer in image data, Such as structures, physical attributes of an ence frame in one embodiment. Implementations of or pro object, distinguishable attributes of a form, body parts, and cessing in method 400 depicted in FIG. 4 may be performed the like. by Software (e.g., instructions or code modules) when 55 In step 530, a plurality of target frames is received. In step executed by a central processing unit (CPU or processor) of a 540, correspondence matching is performed between the fea logic machine, such as a computer system or information ture set of the reference frame and each target frame in the processing device, by hardware components of an electronic plurality of target frames. For example, system 100 performs device or application-specific integrated circuits, or by com a correspondence matching of S. between I and I. in a binations of software and hardware elements. Method 400 60 target frame F, for stereo cameras c=1 ... n. In step 540, an depicted in FIG. 4 begins in step 410. amount of correspondence is determined between the feature In step 420, a reference frame in a sequence of frames set of the reference frame and each target frame. In one capturing a performance is determined. For example, a refer embodiment, system 100 computes an error Score E as a Sum ence frame can be chosen manually or procedurally using a of cross-correlation scores over all features in all feature sets variety of selection criteria. The reference frame then can 65 S c produce anchor frames along the sequence whenever the In step 550, information identifying each target frame as an actor's face returns to an expression in a target frame that is anchor frame is generated when the amount of correspon US 9,036,898 B1 17 18 dence satisfies selection criteria. Selection criteria may FIG.9 is a flowchart of method 900 for performing image include how similar the reference frame is to each target space tracking between a reference frame and anchor frames frame (a vice-versa). FIG. 5 ends in step 560. of a clip in one embodiment. Implementations of or process FIG. 6 is an illustration depicting results of a comparison ing in method 900 depicted in FIG.9 may be performed by between reference frame 610 and sequence 620 of individual Software (e.g., instructions or code modules) when executed frames in a captured performance used to identify anchor by a central processing unit (CPU or processor) of a logic frames in one embodiment. In this example, reference frame machine, Such as a computer system or information process 610 is compared with all frames in sequence 620. Variation in ing device, by hardware components of an electronic device E is shown for a few example frames. Specifically, for several or application-specific integrated circuits, or by combinations frame surrounding frame number 60 the variation in E is 10 small as indicated by a relatively low error score 630. For of software and hardware elements. Method 900 depicted in several frames between frames 100 to 120, the variation in E FIG.9 begins in step 910. is larger as indicated by a relatively larger error score, Such as In step 920, feature matching is performed between a ref score 640. Thus, frames where the image appearance is simi erence frame and an anchor frame of a clip to generate a lar to reference frame 610 can be labeled as anchor frames. 15 forward motion estimation and a backward motion estima FIG. 7 is an illustration of clip 700 in one embodiment tion. For example, the orientation and expression of a face in having anchor frames 710 and 720 corresponding to a refer anchor frames may be similar to that in an reference frame, ence frame. but its location within the frame might differ substantially. Image-Space Tracking Thus, system 100 performs feature matching to determine As discussed above, one goal of image-space tracking is to any changes in locations within the frame of corresponding track image features from a reference frame to each frame in features. a sequence. There can be two distinct situations for matching. In some embodiments, system 100 uses an image pyramid For example, anchor frames by definition have similar image to detect motions of corresponding features. System 100 may appearance to the reference frame. Thus image correspon start the process at a coarsest level of a pyramid and match a dence can be straightforward. Unanchored frames, however, 25 sparse set of features to estimate extremal motions m, fol may be quite different from the reference and may not be lowed by a dense motion estimation method using a search reliably matched directly. window of size (m,"-m)x(m,"-m). System 100 then FIG. 8 is a flowchart of method 800 for performing image may upsample the resulting motion field to the next higher space tracking of clips of a captured performance in one resolution where dense motion estimation is repeated with a embodiment. Implementations of or processing in method 30 800 depicted in FIG.8 may be performed by software (e.g., search window of fixed size (e.g., 3x3). System 100 repeats instructions or code modules) when executed by a central until the highest resolution layer is reached. This is done for processing unit (CPU or processor) of a logic machine. Such each anchor frame to provide the motion fields u' from as a computer system or information processing device, by the reference frame R to anchor frame A for stereo cameras c=1 ... n. hardware components of an electronic device or application 35 specific integrated circuits, or by combinations of Software In one embodiment, system 100 provides a motion estima and hardware elements. Method 800 depicted in FIG. 8 tion as an extension to 2D of the matching introduced in begins in step 810. Beeleretal. 2010. For example, system 100 matches feature In step 820, a plurality of clips are received. In step 830, for X (e.g., apixel) in image I. to one or more features x' in image each clip in the plurality of clips, image-space tracking is 40 I.". System 100 may match features defined as pixels using a performed between a reference frame and anchor frames of block-based normalized cross-correlation (e.g., with the the clip to generate a motion estimation. In various embodi above 3x3 search window). Based on these matches, system ments, correspondences from the reference frame are first 100 generates a forward motion estimation u=X-X'. System obtained for the anchor frames. Correspondences then can be 100 also matches feature y in image I." to one or more fea obtained from the anchor frames for the reference frame. One 45 turesy" in image I.". System may match features defined as example of a method for performing image-space tracking pixels again using a block-based normalized cross-correla between a reference frame and anchor frames of a clip is tion (e.g., with the above 3x3 search window). Based on these discussed further with respect to FIG. 9. In various embodi matches, system 100 generates a backward motion estimation ments, system 100 generate the motion estimation to include v=y-y'. information providing how a set of features in the reference 50 In step 930, feature matches are filtered based on selection frame move between the reference frame and the anchor criteria. Feature matches may be filtered for further process frames. This can include rigid body motions as well has ing, either as confirmed matches or for re-matching. For changes in shape. example, system 100 may not accept a match between a In step 840, for each clip in the plurality of clips, image feature in image I. and a feature in image I." that fails to space tracking is performed between the reference frame of 55 satisfy the selection criteria. In one embodiment, system 100 the clip and unanchored frames to generate a motion estima discards matches where u+VI is larger than a predetermined tion. In one aspect, the correspondences from the reference threshold (e.g., one pixel). In step 940, re-matching is per frame to the anchor frames are propagated from the anchor formed for unmatched features. For example, system 100 frames bounding the clip to the intermediate unanchored may again perform a cross-correlation between features in frames within the clip, both forwards and backwards. One 60 image I. and features in image I.". In some aspects, system example of a method for performing image-space tracking 100 may utilize neighboring and confirmed matches for guid between a reference frame and unanchored frames of a clip is aCC. discussed further with respect to FIG. 10. In step 950, match refinement is performed. For example, In step 850, tracking information is generated for each clip system 100 may refine computed matches using one or more in the plurality of clips. An immediate benefit of this tech 65 refinement terms or criteria. Some examples of refinement nique is that it naturally allows parallelization of the compu terms or criteria are photometric consistency of a match tations. FIG. 8 ends in step 860. between a plurality of images and depth maps. Depth maps US 9,036,898 B1 19 20 can be obtained from initial geometry determined mesh reconstruction stage 220 of FIG. 2 and FIG.3 and reprojected back onto a set of images. ikt One formulation of refined motion in Beeler et al. 2010 5 includes a convex combination u'-(wu-w.u.)/(w-w,). In step 1030, match refinement is performed. For example, This formulation, in one aspect, is extended such that the system 100 may refine computed matches using one or more regularization term u is modified based on a depth map 8 and refinement terms or criteria, such as described about with a matching error S. respect to FIG. 9. In some aspects, this allows system 100 to 10 correct the motion field for small drift with respect to the reference frame (track-to-first principle). its (x, y) = w/y : u...y In step 1040, incremental frame-by-frame feature match ing is performed from a succeeding anchor frame to the unanchored frame of the clip to generate a backward motion where 15 estimation. For example, system 100 performs frame-by frame feature matching (e.g., using a similar image-space tracking technique as described above with respect to FIG.9) |o, y - 0, y || starting from an anchor frame I." to the next frame in the clip. Wy -e Olof (1-6, y) The process repeats with each intermediate clip beginning from the anchor frame until reaching the target unanchored frame I. Thus, starting from a succeeding anchor frame in and N denotes the neighborhood of (x,y). The value of Ö may the clip from the unanchored frame, system 100 generates a be defined as a predetermined depth range, (e.g., 1 mm). backward motion estimation for the unanchored frame: In step 960, feature tracking information from the refer ence frame to the anchor frame is generated. The feature 25 tracking information can include information describing how a feature moves or otherwise is transformed between the ist reference frame and the anchor frame. Accordingly, system 100 may detect motions, transformations, shape changes, etc. of corresponding features to provide, for example, motion 30 In step 1050, match refinement is performed. Again, for example, system 100 may refine computed matches using one fields u.' ' from the reference frame R to anchor frame A. or more refinement terms or criteria, such as described about FIG.9 ends in step 970. with respect to FIG. 9. FIG. 10 is a flowchart of a method for performing image In step 1060, the forward motion estimation or the back space tracking between a reference frame and unanchored 35 ward motion estimation is chosen based on selection criteria. frames of a clip in one embodiment. Implementations of or Specifically, a bi-directional matching (e.g., both a forward processing in method 1000 depicted in FIG. 10 may be per and backward matching) is performed. Each clip is bound by formed by Software (e.g., instructions or code modules) when a start and endanchor frame, and system 100 performs feature executed by a central processing unit (CPU or processor) of a tracking from the start anchor in the forward direction and logic machine, such as a computer system or information 40 from the endanchor in the backward direction. As the forward processing device, by hardware components of an electronic and backward motion fields may differ due to error, system device or application-specific integrated circuits, or by com 100 chooses one to suitably represent feature motion. In one binations of software and hardware elements. Method 1000 embodiment, system 100 resolves the selection by computing depicted in FIG. 10 begins in step 1010. an error field for each motion field, smoothing the error fields Direct tracking of features from a reference frame to unan 45 to remove local perturbation, and then taking lowest chored frames of a clip is not reliable because image appear smoothed error to obtain the best match at each feature. ance can differ substantially between the two. Instead, in In step 1070, feature tracking information from anchor various embodiments, system 100 utilizes the matching frame to the unanchored frame is generated. The feature between the reference frame and the anchor frames to aid the tracking information can include information describing how process. Frames are tracked incrementally within a clip start 50 a feature moves or otherwise is transformed between the ing from the relevant anchor frames. Feature tracking infor reference frame and the chosen anchor frame and intermedi mation from the reference frame to the anchor frame, plus any ate frames (e.g., either forward or backward motion selec incremental frame-to-frame matching, is then used by System tion). Accordingly, system 100 may detect motions, transfor 100 to infer feature tracking from the reference frame to each mations, shape changes, etc. of corresponding features to individual unanchored frame. 55 provide, for example, motion fields u. from the reference In step 1020, incremental frame-by-frame feature match frame R to the unanchored framet. FIG. 10 ends in step 1080. ing is performed from a preceding anchor frame to an unan FIG. 11 is an illustration depicting a bi-directional image chored frame of a clip to generate a forward motion estima space tracking in one embodiment. System 100 uses refer tion. For example, system 100 performs frame-by-frame ence frame 1110 to generate one or more clip each bounded feature matching (e.g., using a similar image-space tracking 60 by anchor frames that satisfy similarity criteria with reference technique as described above with respect to FIG. 9) starting frame 1110. In this example, a clip has been generated (e.g., from an anchor frame I." to the next frame in the clip I. The represented by images in the rectangle) that begins with process repeats with each intermediate clip beginning from anchor frame 1120 and ends with anchor frame 1130. System the anchor frame until reaching the target unanchored frame 100 performs image-space tracking of feature between refer I. Thus, starting from a preceding anchor frame in the clip 65 ence frame 1110 and each of anchor frames 1120 and 1130. from the unanchored frame, system 100 generates a forward Specifically, system 100 provides a forward motion estima motion estimation for the unanchored frame: tion for features in reference frame 1110 to anchor frame US 9,036,898 B1 21 22 1120 and a backward motion estimation for features in anchor Mesh Propagation frame 1120 to reference frame 1110 (e.g., represented by The anchor-based reconstruction and bi-directional image bi-directional arrow 1140A). From these two estimations, space tracking, as illustrated above, can be a powerful tool for system 100 can evaluate whether selected matches provide integrating multiple face performances of an actor over an the best correspondences. System 100 also provides a for extended period. A reference frame can be taken from one ward motion estimation for features in reference frame 1110 sequence but used to generate anchor frames in another to anchor frame 1130 and a backward motion estimation for sequence. A single mesh associated with the reference frame features in anchor frame 1130 to reference frame 1110 (e.g., can be propagated across different capture sessions for the represented by bi-directional arrow 1140B). From these two actor (including the case where camera positions or calibra 10 tion may have changed somewhat between sessions) and used estimations, system 100 can evaluate whether selected to embed the full corpus of facial performance data for the matches provide the best correspondences. actor into a single coordinate frame. System 100 further performs image-space tracking of fea FIG. 16 is a flowchart of method 1600 for propagating ture between both anchor frames 1120 and 1130 and target geometry corresponding to objects in a captured performance frame 1150. Specifically, system 100 provides a forward 15 using tracking information obtained from the bi-directional motion estimation for features in anchor frame 1120 to target image-space tracking in one embodiment. Implementations frame 1150 using incrementally determined motion estima of or processing in method 1600 depicted in FIG.16 may be tions in each intervening frame between anchor frame 1120 performed by Software (e.g., instructions or code modules) and target frame 1150 (e.g., represented by bi-directional when executed by a central processing unit (CPU or proces arrow 1160A). System 100 also provides a backward motion sor) of a logic machine, such as a computer system or infor estimation for features in anchor frame 1130 to target frame mation processing device, by hardware components of an 1150 using incrementally determined motion estimations in electronic device or application-specific integrated circuits, each intervening frame between anchor frame 1130 and target or by combinations of software and hardware elements. frame 1150 (e.g., represented by bi-directional arrow 1160B). Method 1600 depicted in FIG. 16 begins in step 1610. From these two estimations, system 100 can evaluate whether 25 In step 1620, geometry associated with an object in a selected matches provide the best correspondences. Thus, reference frame is received. For example, one or more image with the anchor-based reconstruction and bi-directional based meshing techniques may be used to extract N-dimen image-space tracking, system 100 generates tracking infor sional object geometry from a set of one or more images. In mation that limits temporal drift. other embodiments, one or more preconfigured or generic FIG. 12 is an illustration depicting a sequence of frames 30 meshes or geometric primitives may be adapted or trans formed based on object information obtained from the refer 1200 pulled out of different times steps of a captured perfor ence frame. mance demonstrating low temporal drift resulting from the In step 1630, information is received identifying a target bi-directional image-space tracking of system 100 by consis frame that includes the object. For example, system 100 may tent attachment of a Superimposed grid pattern in one 35 receive information specifying or otherwise identifying one embodiment. or more anchor frames or unanchored target frames in a clip. FIG. 13 is an illustration depicting another sequence of In step 1640, feature tracking information from the reference frames 1300 pulled out of different times steps of another frame to the frame target is received. For example, system 100 captured performance demonstrating low temporal drift may determine feature tracking information from the refer resulting from the bi-directional image-space tracking of sys 40 ence frame to an anchor frame as described in FIG. 9. In tem 100 by consistent attachment of a superimposed grid another example, system 100 may determine feature tracking pattern in one embodiment. information from the reference frame to an unanchored frame FIG. 14 depicts images of a Zoom onto the forehead of an as described in FIG. 10. actor in an input image, a 3D reconstruction of the actor's In step 1650, information propagating the geometry asso forehead in the prior art, and a 3D reconstruction of the actor's 45 ciated with the object in the reference frame is generated forehead according to one embodiment. In this example, based on the feature tracking information. For example, a image 1400 depicts the forehead of the actoran represents an reference mesh M is obtained consisting of a set of vertices input image to system 100. Image 1410 depicts a 3D recon X, for a reference frame F. Each vertex in the set of vertices struction of the actors forehead with lower fidelity. Image X, for the reference frame F represents a physical point in 1420 depicts a 3D reconstruction of the actor's forehead with 50 the reference frame F (e.g., a point on an actor's face). In higher fidelity. various embodiments, system 100 implements mesh propa FIG. 15 depicts images of the 3D reconstructions of the gation to a target frame F in a sequence of frames in a clip by actor's forehead at three time steps in a prior art reconstruc finding a transformed 3D position X', of each vertex X? due tion on the left half of the images and a reconstruction accord to any motion and deformation of an object (e.g., the actor's ing to one embodiment on the right half of the images. In this 55 face) from the reference frame F to the target frame F. In example, the top center image depicts an unwrinkled fore various embodiments, the feature tracking information head. The bottom left image depicts a wrinkled forehead and includes motion fields from reference frame F to the target the bottom right image depicts a transition back to an frame F. One method for using motion fields to estimate the unwrinkled forehead. Changes in the attachment of a Super propagated vertices X, is given in Bradley et al. 2010). imposed grid to the 3D reconstructions between the top center 60 System 100 completes mesh propagation and Vertices X? a image and last images demonstrate drift in conventional in correspondence with vertices X, in the target frame F. tracking methods in area 1510 and drift anchor-based recon FIG. 16 ends in step 1660. struction and bi-directional image-space tracking showing FIG. 17 is an illustration depicting how tracking informa significantly less drift in area 1520. In particular, less drift tion for individual frames is used to pose a mesh associated occurs vs. conventional methods as more changes are notice 65 with a reference frame in one embodiment. In this example, a able in area 1530 in the attachment of the Superimposed grid reference mesh M is obtained consisting of a set of vertices to the 3D reconstructions than on the right half of the images. X,*, for a reference frame 1710. System 100 propagates each US 9,036,898 B1 23 24 Vertex X?, due to the motion and deformation of an actor's In various embodiments, to render the refinement process face from reference frame 1710 to target frame 1720 by robust and efficient, motion and shape are treated separately finding a transformed 3D positionX, of each vertex X, using as suggested in Furukawa et al. 2009. System 100, in some feature tracking information 1730 from reference frame 1710 aspects, provides the refinement as an iterative process in to target frame 1720. System 100 also propagates each vertex 2.5D that interleaves motion and shape refinement. X? due to the motion and deformation of an actor's face from For example, in various embodiments, system 100 refines reference frame 1710 to target frame 1740 by finding a trans shape along the normal. System 100 may employ the refine formed 3D position X', of each vertex X? using feature track ment framework in Beeler et al. 2010 for shape refinement: ing information 1750 from reference frame 1710 to target 10 frame 1740. Further, system 100 propagates each vertex X?, where photometric X, smooth X, and mesoscopic X, posi due to the motion and deformation of an actor's face from tions as well as the photometric w and mesoscopic w, reference frame 1710 to target frame 1760 by finding a trans weights are the same as defined in Beeler et al. 2010. To formed 3D position X, of each vertex X? using feature track produce Smoother solutions in areas of higher matching error ing information 1770 from reference frame 1710 to target 15 (e.g., eye-brows, nostrils, etc.), in some aspects, system 100 frame 1760. modifies the Smoothness weight W to be: Mesh Refinement In the stages above, system 100 has generated a propaga tion of a reference mesh to each frame in a sequence of frames where S is the matching error and w is a user defined smooth in a clip. This is a step closer to having temporal correspon ness coefficient vector. Some examples of w include 0.03, 0.7, and 1000. dence of meshes along the sequence. However, the propa In further embodiments, system 100 refines motion in a gated meshes can be computed with different methods (e.g., tangent plane spanned by e. e. Similar to shape refinement, for anchor frames and unanchored frames) and computed system 100 may use a linear combination of a data term and independently for each timestep. In various embodiments, a regularization term. For example, a photometric position system 100 updates the meshes to ensure a uniform treatment 25 in the computation of all frames and to apply temporal coher estimate X, is the position on the tangent plane e, e, that ence. In some aspects, this is accomplished in two stages—a maximizes photometric consistency between a current frame refinement that acts independently on each frame and can and a reference frame. System 100 uses a normalized cross thus be parallelized and a refinement that aims for temporal correlation as a measure of consistency and computes it by coherence between frames. 30 reprojecting corresponding Surface patches into a reference FIG. 18 is a simplified flowchart of a method for refining image I. and a target image I. for all cameras c. In another the propagated geometry in one embodiment. Implementa example, a regularized position estimate X assumes local tions of or processing in method 1800 depicted in FIG. 18 rigidity of the surface and tries to preserve the local structure may be performed by Software (e.g., instructions or code using Laplacian coordinates, similar to Bradley et al. 2010. modules) when executed by a central processing unit (CPU or 35 A photometric confidence w, is the sum of the matching processor) of a logic machine. Such as a computer system or errors for the neighboring positions on the tangent plane: information processing device, by hardware components of w-0.25(s.d.-- xylidy) an electronic device or application-specific integrated cir A regularized confidence w employs a polynomial: cuits, or by combinations of software and hardware elements. Method 1800 depicted in FIG. 18 begins in step 1810. 40 w, hot-wis, thos. In step 1820, information propagating geometry of an Some examples of include 0.5, 1, and 8000. object across a clip is received. For example, a reference mesh Referring again to FIG. 18, in step 1840, across-frame M is obtained consisting of a set of vertices X,*, for a refer refinement is performed. This refinement aims for temporal ence frame 1710 of FIG. 17. System 100 propagates each coherence between frames. As discussed above, per-frame Vertex X?, due to the motion and deformation of an actor's 45 refinement can operate on each frame independently, and can face from reference frame 1710 to target frame 1720 by thus be run in parallel. System 100 does not guarantee the finding a transformed 3D positionX, of each vertex X/ using results to be temporally consistent but will be very similar by feature tracking information 1730 from reference frame 1710 construction. While temporal error in the motion estimate is to target frame 1720. mostly not perceivable, small differences in shape between In step 1830, per-frame refinement is performed. As dis 50 Successive frames can cause large changes in Surface nor cussed above, this refinement acts independently on each mals, which will produce noticeable flickering in a visualiza frame and canthus be parallelized. In one aspect, information tion. To avoid this, system 100 performs a final pass over the is generated that finds for one or more of the vertices in a set complete sequence. In one embodiment, system 100 averages of vertices X, for a reference frame the position in space that the Laplacian coordinates in a -1, +1 temporal window. optimizes fidelity criteria. Some examples of fidelity criteria 55 FIG. 18 ends in step 1850. include spatial image fidelity, temporal image fidelity, mesh Performance Reconstruction Results fidelity, geometry Smoothness, and the like. For spatial image In several experiments, the inventors reconstructed several fidelity, system 100 attempts to ensure that the reprojections performances given by three different actors using the dis in all visible cameras for a target frame F are similar or closed techniques. The results of these experiments cover a otherwise satisfy similarity criteria. For temporal image fidel 60 wide range of expressiveness, include highly-detailed pore ity, system 100 attempts to ensure that the reprojections in a level geometry, and demonstrate robustness of the disclosed target frame F and a reference frame F“ for each visible techniques to motion blur and occlusions, thus outperforming camera are similar or otherwise satisfy similarity criteria. For previous approaches. mesh fidelity, system 100 attempts to ensure that a trans All of datasets were acquired using seven cameras (Dalsa formed mesh M be locally similar to a reference mesh M“. 65 Falcon 4M60) capturing images with resolution 1176x864 at For geometry Smoothness, system 100 attempts to ensure that 42 frames per second. The actors were lit uniformly using transformed geometry G be locally smooth. LED-lights. US 9,036,898 B1 25 26 FIG. 19 depicts images of example frames taken from sequence. A Subset of the sequence can be matched to one multiple different sequences of an actor in one embodiment. reference frame, followed by a change of the reference frame FIG. 19 shows a number of frames from the actor, including to one of the processed frames, if the would yield a a series of input image 1910 (first row), resulting geometry better anchor frame distribution for the remainder of the 1920 (second row), the geometry rendered with a grid pattern sequence. The result shown in the latter three frames of FIG. 1930 to show the temporal motion over time (third row), and 19 is generated in this way, where the reference frame starts the final result 1940 rendered using per-frame color informa off as a neutral expression and then to a pose with the tion projected from the input images. This result, while dem mouth open, since the mouth remains open for most of the onstrating the expressive quality of the reconstructions, also Sequence. illustrates how system 100 can match a reference frame 10 Therefore, a performance capture algorithm is disclosed across multiple capture sequences. Specifically, the first three that can acquire expressive facial performances with pore expressions in FIG. 19 come from sequences that were cap level geometric details. In some embodiments, image-space tured on different occasions, however they are still in full tracking requires a stereo deployment where all cameras cap Vertex correspondence. ture the full face. This could be extended to a stereo deploy Reconstructed sequences from two additional actors are 15 ment like that of Bradley et al 2010, where the cameras are shown in previously discussed FIGS. 12 and 13 where it is optically Zoomed to capture Small patches of the face by shown several frames in chronological order, illustrating how combining the stereo images into one image, for example the temporal reconstruction by system 100 remains faithful to using the “unwrap mosaics' method of Rav-Acha et al. the true facial motion using an overlaid grid pattern. The 2008. result in FIG. 12 particularly highlights the fine-scale spatio Infurther embodiments, a reference frame may be matched temporal details that system 100 is able to produce using the to an anchor frames in a non-image space. In one extension, disclosed techniques. By dealing with error propagation matching may be made in 3D via the meshes computed in directly in image space, system 100 is able to produce more mesh-reconstruction stage 220 of FIG. 2. Thus, matching accurate motion reconstruction with less drift than techniques would succeed whenever object expression is the same in the that rely on a potentially distorted parameterization domain 25 reference and anchor frames, and would no longer addition for drift correction, such as the approach of Bradley et al. ally require a similar orientation relative to the cameras. 2010. Furthermore, anchoring can be a powerful tool for integrat Furthermore, with the application of anchor frames system ing multiple face performances of an actor over an extended 100 addresses the problems associated with sequential period. As discussed above, a reference frame can be taken motion processing, such as the unavoidable tracker drift over 30 from one sequence but used to generate anchor frames in long sequences, and complete tracking failure caused by very another sequence. This provides a way to propagate a single fast motion or occlusions. With an anchor frame reconstruc mesh across different capture sessions for an actor (including tion framework, system 100 can recover from such tracking the case where the camera positions or calibration may have failure. changed somewhat between the sessions), and to embed the FIG. 20 depicts images of several frames from a sequence 35 full corpus of facial performance data for the actor into a in which significant occlusion and reappearance of structure single coordinate frame. An extension of this would be to use around the mouth is handled using tracking information multiple reference frames simultaneously. For example, obtained from the bi-directional image-space tracking in one facial performance capture could be applied to a sequence in embodiment. In this example, FIG. 20 shows a sequence of lip which an actor adapts a set of FACS poses Ekman and movements in input images 2010 in which the upper lip is 40 Friesen 1978 with careful supervision to yield best possible occluded by the lower lip in the third input image. Since results. The frames with the FACS poses could then be used as tracking of the upper lip fails, system 100 may incorrectly a set of high quality reference frames that could be used predict that the motion of the upper lip drifts down onto the simultaneously when processing Subsequent sequences (be lower lip, indicated by the overlaid grid. Sequential tracking cause the meshes have consistent triangulation). methods have trouble recovering from this situation. How 45 ever, due to an anchor frame located later in the sequence, CONCLUSION system 100 is able to successfully track the upper lip back wards from the anchor frame to the occluded frame, to auto A new passive technique has been disclosed for high-qual matically restore correct tracking after the occlusion. Thus, ity facial performance capture. In one aspect, a robust track the combination of robust image space tracking and anchor 50 ing algorithm is employed that integrates all feature tracking frames allows us to Successfully reconstruct very fast in image space and uses the integrated result to propagate a motions, even those containing motion blur. single reference mesh to each target frame in parallel. In FIG.21 depicts image of several frames from a sequence in another aspect, leveraging the fact that performances tend to which fast facial motion and motion blur is handled using contain repetitive motions, "anchor frames' are defined as tracking information obtained from the bi-directional image 55 those where the expression is similar to a reference frame. space tracking in one embodiment. In this example, FIG. 21 After locating the anchor frames, we compute feature track depicts a short sequence of input images 2110 in which an ing directly from the reference frame to anchor frames. By actor is opening his mouth very quickly. Images 2120 and using the anchor frames to partition the sequence into clips 2130 compare result using the method of Bradley et al. 2010 and independently matching clips, tracker drift is bound with (center) and the disclosed techniques (right). System 100 is 60 the correct handling of occlusion and motion blur. Frames can able to produce a more accurate reconstruction. also be matched between multiple capture sessions recorded While Some presentations of the concept of anchoring on different occasions, yielding a single deformable mesh involve a single reference frame. It may happen that a given that corresponds to every performance an actor gives. reference frame does not yield a good distribution of anchor FIG. 22 is a block diagram of computer system 2200 that frames in the whole sequence, so that the benefits of anchor 65 may incorporate an embodiment, be incorporated into an based reconstruction are lost in places. In this case, it is not embodiment, or be used to practice any of the innovations, necessary to maintain the same reference frame for the entire embodiments, and/or examples found within this disclosure. US 9,036,898 B1 27 28 FIG.22 is merely illustrative of a computing device, general shading operations, or the like. The one or more graphics purpose computer system programmed according to one or processors or graphical processing units (GPUs) 2210 may more disclosed techniques, or specific information process include any number of registers, logic units, arithmetic units, ing device for an embodiment incorporating an invention caches, memory interfaces, or the like. The one or more data whose teachings may be presented herein and does not limit 5 processors or central processing units (CPUs) 2205 may fur the scope of the invention as recited in the claims. One of ther be integrated, irremovably or moveably, into one or more ordinary skill in the art would recognize other variations, motherboards or daughter boards that include dedicated video memories, frame buffers, or the like. modifications, and alternatives. Memory subsystem 2215 can include hardware and/or Computer system 2200 can include hardware and/or soft Software elements configured for storing information. ware elements configured for performing logic operations 10 Memory subsystem 2215 may store information using and calculations, input/output operations, machine commu machine-readable articles, information storage devices, or nications, or the like. Computer system 2200 may include computer-readable storage media. Some examples of these familiar computer components, such as one or more one or articles used by memory subsystem 2270 can include random more data processors or central processing units (CPUs) access memories (RAM), read-only-memories (ROMS), 2205, one or more graphics processors or graphical process 15 Volatile memories, non-volatile memories, and other semi ing units (GPUs) 2210, memory subsystem 2215, storage conductor memories. In various embodiments, memory Sub subsystem 2220, one or more input/output (I/O) interfaces system 2215 can include performance capture data and pro 2225, communications interface 2230, or the like. Computer gram code 2240. system 2200 can include system 2235 interconnecting the Storage subsystem 2220 can include hardware and/or soft above components and providing functionality, Such connec ware elements configured for storing information. Storage tivity and inter-device communication. Computer system Subsystem 2220 may store information using machine-read 2200 may be embodied as a computing device. Such as a able articles, information storage devices, or computer-read personal computer (PC), a workstation, a mini-computer, a able storage media. Storage Subsystem 2220 may store infor mainframe, a cluster or farm of computing devices, a laptop, mation using storage media 224.5. Some examples of storage a notebook, a netbook, a PDA, a Smartphone, a consumer 25 media 224.5 used by storage subsystem 2220 can include electronic device, a gaming console, or the like. floppy disks, hard disks, optical storage media such as CD The one or more data processors or central processing units ROMS, DVDs and bar codes, removable storage devices, networked storage devices, or the like. In some embodiments, (CPUs) 2205 can include hardware and/or software elements all or part of performance capture data and program code configured for executing logic or program code or for provid 2240 may be stored using storage subsystem 2220. ing application-specific functionality. Some examples of 30 In various embodiments, computer system 2200 may CPU(s) 2205 can include one or more (e.g., include one or more hypervisors or operating systems, such as single core and multi-core) or micro-controllers. CPUs 2205 WINDOWS, WINDOWS NT WINDOWS XP. VISTA, may include 4-bit, 8-bit, 12-bit, 16-bit, 32-bit, 64-bit, or the WINDOWS 7 or the like from Microsoft of Redmond, Wash., like architectures with similar or divergent internal and exter Mac OS or Mac OS X from Apple Inc. of Cupertino, Calif., nal instruction and data designs. CPUs 2205 may further 35 SOLARIS from Sun Microsystems, LINUX, UNIX, and include a single core or multiple cores. Commercially avail other UNIX-based or UNIX-like operating systems. Com able processors may include those provided by of Santa puter system 2200 may also include one or more applications Clara, Calif. (e.g., , x86 64, PENTIUM, CELERON, configured to execute, perform, or otherwise implement tech CORE, CORE 2, CORE ix, , XEON, etc.), by niques disclosed herein. These applications may be embodied of Sunnyvale, Calif. (e.g., x86, 40 as performance capture data and program code 2240. Addi AMD 64, ATHLON, DURON, TURION, ATHLONXP/64, tionally, computer programs, executable computer code, OPTERON, PHENOM, etc). Commercially available pro human-readable source code, shader code, rendering engines, cessors may further include those conforming to the or the like, and data, Such as image files, models including Advanced RISC Machine (ARM) architecture (e.g., ARMv7 geometrical descriptions of objects, ordered geometric 9), POWER and POWERPC architecture, architec 45 descriptions of objects, procedural descriptions of models, ture, and or the like. CPU(s) 2205 may also include one or scene descriptor files, or the like, may be stored in memory more field-gate programmable arrays (FPGAs), application subsystem 2215 and/or storage subsystem 2220. The one or more input/output (I/O) interfaces 2225 can specific integrated circuits (ASICs), or other microcontrol include hardware and/or software elements configured for lers. The one or more data processors or central processing performing I/O operations. One or more input devices 2250 units (CPUs) 2205 may include any number of registers, logic 50 and/or one or more output devices 2255 may be communica units, arithmetic units, caches, memory interfaces, or the like. tively coupled to the one or more I/O interfaces 2225. The one or more data processors or central processing units The one or more input devices 2250 can include hardware (CPUs) 2205 may further be integrated, irremovably or and/or Software elements configured for receiving informa moveably, into one or more motherboards or daughterboards. tion from one or more sources for computer system 2200. The one or more graphics processor or graphical process 55 Some examples of the one or more input devices 2250 may ing units (GPUs) 2210 can include hardware and/or software include a computer mouse, a trackball, a track pad, a joystick, elements configured for executing logic or program code a wireless remote, a drawing tablet, a Voice command system, associated with graphics or for providing graphics-specific an eye tracking system, external storage systems, a monitor functionality. GPUs 2210 may include any conventional appropriately configured as a touch screen, a communica , such as those provided by conven 60 tions interface appropriately configured as a transceiver, or tional video cards. Some examples of GPUs are commer the like. In various embodiments, the one or more input cially available from , ATI, and other vendors. In devices 2250 may allow a user of computer system 2200 to various embodiments, GPUs 2210 may include one or more interact with one or more non-graphical or graphical user vector or parallel processing units. These GPUs may be user interfaces to enter a comment, select objects, icons, text, user programmable, and include hardware elements for encoding/ 65 interface widgets, or other user interface elements that appear decoding specific types of data (e.g., video data) or for accel on a monitor/display device via a command, a click of a erating 2D or 3D drawing operations, texturing operations, button, or the like. US 9,036,898 B1 29 30 The one or more output devices 2255 can include hardware selecting a Subset of images of the time-based sequence of and/or software elements configured for outputting informa the plurality of images, wherein the Subset of images is tion to one or more destinations for computer system 2200. bounded by a first anchor image and a second anchor Some examples of the one or more output devices 2255 can image; include a printer, a fax, a feedback device for a mouse or 5 generating, with one or more processors associated with joystick, external storage systems, a monitor or other display the computer system, tracking information for a third device, a communications interface appropriately configured image in the plurality of images based on how a first as a transceiver, or the like. The one or more output devices portion of an actor's face in the first anchor image cor 2255 may allow a user of computer system 2200 to view responds in image space to the first portion of the actor's objects, icons, text, user interface widgets, or other user inter 10 face in the third image and how the first portion of the face elements. actor's face in the second anchor image corresponds in A display device or monitor may be used with computer image space to the second portion of the actor's face in system 2200 and can include hardware and/or software ele the third image, wherein generating the tracking infor ments configured for displaying information. Some examples 15 mation for the third image comprises: include familiar display devices, such as a television monitor, determining a first error associated with how the first a cathode ray tube (CRT), a liquid crystal display (LCD), or portion of the actor's face in the first image corre the like. sponds in image space to the first portion of the actor's Communications interface 2230 can include hardware and/ face in the third image: or software elements configured for performing communica determining a second error associated with how the first tions operations, including sending and receiving data. portion of the actor's face in the second image corre Some examples of communications interface 2230 may sponds in image space to the second portion of the include a network communications interface, an external bus actor's face in the third image; and interface, an Ethernet card, a modem (telephone, satellite, identifying on which of the first image and the second cable, ISDN), (asynchronous) digital subscriber line (DSL) 25 image to base in part the tracking information for the unit, FireWire interface, USB interface, or the like. For third image in response to a comparison between the example, communications interface 2230 may be coupled to first error and the second error; communications network/external bus 2280, Such as a com generating, with the one or more processors associated puter network, to a FireWire bus, a USB hub, or the like. In with the computer system, information that manipulates other embodiments, communications interface 2230 may be 30 a computer-generated object based on the tracking infor physically integrated as hardware on a motherboard or daugh mation for the third image to reflect a selected portion of terboard of computer system 2200, may be implemented as a the actor's face in the third image; and Software program, or the like, or may be implemented as a storing the information that manipulates the computer combination thereof. generated object in a storage device associated with the In various embodiments, computer system 2200 may 35 one or more computer system. include Software that enables communications over a net 2. The method of claim 1 wherein generating, with the one work, Such as a local area network or the Internet, using one or more processors associated with the computer system, the or more communications protocols, such as the HTTP, TCP/ tracking information for the third image comprises: IP, RTP/RTSP protocols, or the like. In some embodiments, determining how the first portion of the actor's face in the other communications Software and/or transfer protocols 40 first image corresponds incrementally in image space to may also be used, for example IPX, UDP or the like, for a portion of the actor's face in each image in a set of communicating with hosts over the network or with a device images in the plurality of images that occur in a sequence directly connected to computer system 2200. of images in-between the first image and the third image: As Suggested, FIG. 22 is merely representative of a gen and eral-purpose computer system appropriately configured or 45 determining, with the one or more processors associated specific data processing device capable of implementing or with the computer system, how the first portion of the incorporating various embodiments of an invention presented actor's face in the second image corresponds incremen within this disclosure. Many other hardware and/or software tally in image space to a portion of the actor's face in configurations may be apparent to the skilled artisan which each image in of a set of images in the plurality of images are Suitable for use in implementing an invention presented 50 that occur in a sequence of images in-between the sec within this disclosure or with various embodiments of an ond image and the third image. invention presented within this disclosure. For example, a 3. The method of claim 1 wherein the second image satis computer system or data processing device may include desk fies similarity criteria associated with the first image. top, portable, rack-mounted, or tablet configurations. Addi 4. The method of claim 1 wherein the first image and the tionally, a computer system or information processing device 55 second image satisfy similarity criteria associated with a ref may include a series of networked computers or clusters/grids erence image of the actor. of parallel processing devices. In still other embodiments, a 5. The method of claim 1 further comprising: computer system or information processing device may per generating, with the one or more processors associated form techniques described above as implemented upon a chip with the computer system, tracking information for the or an auxiliary processing board. 60 first image in the plurality of images based on how a first portion of the actor's face in a reference image of the What is claimed is: actor corresponds in image space to a second portion of 1. A method for manipulating computer-generated objects, the actor's face in the first image; the method comprising: generating, with the one or more processors associated receiving a time-based sequence of a plurality of images; 65 with the computer system, tracking information for the identifying a plurality of anchor frames within the time second image in the plurality of images based on how a based sequence of the plurality of images; second portion of the predetermined image of the actor US 9,036,898 B1 31 32 corresponds in image space to a second portion of the identifying one or more of a pixel, an image structure, a actor's face in the second image; and geometric structure, or a facial structure. generating, with the one or more processors associated 9. The method of claim 1 further comprising: with the computer system, the tracking information for receiving, at the computer system, criteria configured to the third image further based on at least one of the designate images as being similar to a reference image tracking information for the first image and the tracking of the actor; information for the second image. receiving, at the computer system, a sequence of images; 6. The method of claim 5 wherein generating the tracking and information for the first image or the second image comprises generating, with the one or more processors associated a bi-directional comparison to and from one or more portions 10 with the computer system, the plurality of images as a of the actor's face in the reference image of the actor. clip bounded by the first image and the second image 7. A method for manipulating computer-generated objects, based on the criteria configured to designate images as the method comprising: being similar to the reference image of the actor. receiving a time-based sequence of a plurality of images; 10. The method of claim 1 wherein generating, with the one identifying a plurality of anchor frames within the time 15 or more processors associated with the computer system, the based sequence of the plurality of images; information that manipulates the computer-generated object Selecting a Subset of images of the time-based sequence of to reflect the selected portion of the actor's face in the third the plurality of images, wherein the Subset of images is image comprises generating information configured to propa bounded by a first anchor image and a second anchor gate a mesh from a first configuration to a second configura image; tion. generating, with one or more processors associated with 11. The method of claim 1 further comprising modifying, the computer system, tracking information for a third with the one or more processors associated with the computer image in the plurality of images based on how a first system, the information that manipulates the computer-gen portion of an actor's face in the first anchor image cor erated object in response to shape information associated responds in image space to the first portion of the actor's 25 with one or more portions of the actor's face in the third face in the third image and how the first portion of the image. actor's face in the second anchor image corresponds in 12. The method of claim 1 further comprising modifying, image space to the second portion of the actor's face in with the one or more processors associated with the computer the third image; system, the information that manipulates the computer-gen generating, with the one or more processors associated 30 erated object in response to a measure oftemporal coherence with the computer system, information that manipulates over a set of two or more images in the plurality of images that a computer-generated object based on the tracking infor includes the third image. mation for the third image to reflect a selected portion of 13. A non-transitory computer-readable medium storing the actor's face in the third image; computer-executable code for manipulating computer-gener storing the information that manipulates the computer 35 ated objects, the non-transitory computer-readable medium generated object in a storage device associated with the comprising: one or more computer system; code for receiving a time-based sequence of a plurality of identifying, with the one or more processors associated images: with the computer system, a first set of features in the code for identifying a plurality of anchor frames within the first image: 40 time-based sequence of the plurality of images; identifying, with the one or more processors associated code for selecting a Subset of images of the time-based with the computer system, a first set of features in the sequence of the plurality of images, wherein the Subset third image that corresponds to the first set of features in of images is bounded by a first anchor image and a the first image: second anchor image; generating, with the one or more processors associated 45 code for generating tracking information for a third image with the computer system, information indicative of in the plurality of images based on how a first portion of how the first portion of the actor's face in the first image an actor's face in the first anchor image corresponds in corresponds in image space to the first portion of the image space to the first portion of the actor's face in the actor's face in the third image based on the first set of third image and how the first portion of the actor's face features in the third image that corresponds to the first set 50 in the second anchor image corresponds in image space of features in the first image: to a second portion of the actor's face in the third image, identifying, with the one or more processors associated wherein the code for generating the tracking information with the computer system, a first set of features in the for the third image comprises: Second image: code for determining a first error associated with how the identifying, with the one or more processors associated 55 first portion of the actor's face in the first image cor with the computer system, a second set of features in the responds in image space to the first portion of the third image that corresponds to the first set of features in actor's face in the third image: the second image; and code for determining a second error associated with how generating, with the one or more processors associated the first portion of the actor's face in the second image with the computer system, information indicative of 60 corresponds in image space to the second portion of how the first portion of the actor's face in the second the actor's face in the third image; and image corresponds in image space to the second portion code for identifying on which of the first image and the of the actor's face in the third image based on the second second image to base in part the tracking information set of features in the third image that corresponds to the for the third image in response to a comparison first set of features in the second image. 65 between the first error and the second error; and 8. The method of claim 7 wherein identifying the first set of code for generating information that manipulates a com features in the first image or the second image comprises puter-generated object based on the tracking informa US 9,036,898 B1 33 34 tion for the third image to reflecta selected portion of the 16. The non-transitory computer-readable medium of actor's face in the third image. claim 13 further comprising: 14. The non-transitory computer-readable medium of code for identifying a first set of features in the first image: claim 13 wherein the code for generating the tracking infor code for identifying a first set of features in the third image mation for the third image comprises: that corresponds to the first set of features in the first code for determining how the first portion of the actor's image: face in the first image corresponds incrementally in code for generating information indicative of how the first image space to a portion of the actor's face in each image portion of the actor's face in the first image corresponds in a set of images in the plurality of images that occur in in image space to the first portion of the actor's face in in sages in-between the first image and the 10 the third image based on the first set of features in the code for Gaining how the first portion of the actor's third image that corresponds to the first set of features in face in the second image corresponds incrementally in the first Image: - image space to a portion of the actor's face in each image code for identifying a first set of features in the second in of a set of images in the plurality of images that occur 15 Image. in a sequence of images in-between the second image code for identifying a second set of features in the third and the third image image that corresponds to the first set of features in the 15. A non-transitory computer-readable medium storing Image: and f ion indicative of how the fi computer-executable code for manipulating computer-gener- code Ior gifEig 1. ot s TE: how the first ated objects, the non-transitory computer-readable medium 20 portion of the actor's face in the secon Image corre comprising: sponds in image space to the second portion of the code for receiving a time-based sequence of a plurality of actors face 1. the third image based on the second set of images: features in the third image that corresponds to the first set code for identifyings a plurality of anchor frames within the of features in- the Second image. - time-based sequence of the plurality of images; 25 17. The non-transitory computer-readable medium of code for selecting a subset of images of the time-based claim 16 wherein the code for identifying the first set of sequence of the plurality of images, wherein the subset features 1. the first image or the second Image comprises code of images is bounded by a first anchor image and a for identifying one or more of a pixel, an image structure, a Second anchor image: geometric structure, or a facial structure. code for generating trackings information for the first image 30 1 18. IR ton-transitory computer-readable medium of in the plurality of images based on how a first portion of Ca1 er comprising: the actor's face in a reference image of the actor corre- code for receiving criteria configured to designate images sponds in image space to a second portion of the actor's as being similar to a reference Image of the actor; face in the first image: code for receiving a sequence of images; and code for generating tracking information for the second 35 code for generating the plurality of Images as a clip image in the plurality of images based on how a second bounded by the first image and the second 1mage based portion of the predetermined image of the actor corre- on the Criter1a configured to designate images as being sponds in image space to a second portion of the actor's 1. isis to the reference E. G. di f face in the second image, wherein the code for generat- 19. The non-transitory computer-readable medium o ing the tracking information for the first image or the 40 claim 13 wherein the code for generating the information that manipulates the computer-generated object to reflect the second image comprises a bi-directional comparison to Selected portion of the actor's face in the third image com and from one or more portions of the actor's face in the prises code for generating information configured to propa reference image of the actor; and gate a mesh from a first configuration to a second configura code for generating tracking information for a third image tion. in the plurality of images based on how a first portion of 45 an actor's face in the first anchor image corresponds in 20. The non-transitory computer-readable medium of image space to the first portion of the actor's face in the claim 13 further comprising code for modifying the informa third image, how the first portion of the actor's face in tion that manipulates the computer-generated object in the second anchor image corresponds in image space to response to shape information associated with one or more a second portion of the actor's face in the third image, 50 portions of the actor's face in the third image. and at least one of the tracking information for the first 21. The non-transitory computer-readable medium of image and the tracking information for the second claim 13 further comprising code for modifying the informa image; and tion that manipulates the computer-generated object in code for generating information that manipulates a com response to a measure oftemporal coherence over a set of two puter-generated object based on the tracking informa or more images in the plurality of images that includes the tion for the third image to reflecta selected portion of the third image. actor's face in the third image. ck k < k cic UNITED STATES PATENT AND TRADEMARK OFFICE CERTIFICATE OF CORRECTION PATENT NO. : 9,036,898 B1 Page 1 of 1 APPLICATIONNO. : 13/287774 DATED : May 19, 2015 INVENTOR(S) : Beeler et al. It is certified that error appears in the above-identified patent and that said Letters Patent is hereby corrected as shown below:

On the title page item 73 insert --ETH Zürich (Eidgenössische Technische Hochschule Zürich), Zürich (CH)--

Signed and Sealed this Fifteenth Day of December, 2015 74-4-04- 2% 4 Michelle K. Lee Director of the United States Patent and Trademark Office