A Practical and Configurable Lip Sync Method for Games

A Practical and Configurable Lip Sync Method for Games Yuyu Xu ∗ Andrew W. Feng† Stacy Marsella‡ Ari Shapiro§ USC Institute for Creative Technologies Figure 1: Accurate lip synchronization results for multiplecharacterscanbegeneratedusingthesamesetofphonebigram blend curves. Our method uses animator-driven data to produce high quality lip synchronization in multiple languages. Our method is well-suited for animation pipelines, since it uses static facial poses or blendshapes and can be directly edited by an animator, and modified as needed on a per-character basis. ABSTRACT Methodologies]: Computer Graphics—Three-Dimensional Graph- ics and Realism; I.3.8 [Computing Methodologies]: Computer We demonstrate a lip animation (lip sync) algorithm for real-time Graphics—Applications applications that can be used to generate synchronized facial movements with audio generated from natural speech or a text-to-speech 1MOTIVATION engine. Our method requires an animator to construct animations using a canonical set of visemes for all pairwise combinations of a Synchronizing the lip and mouth movements naturally along with reduced phoneme set (phone bigrams). These animations are then animation is an important part of convincing 3D character perfor- stitched together to construct the final animation, adding velocity mance. In this paper, we present a simple, portable and editable and lip-pose constraints. This method can be applied to any charac- lip-synchronization method that works for multiple languages, re- ter that uses the same, small set of visemes. Our method can operate quires no machine learning, can be constructed by a skilled an- efficiently in multiple languages by reusing phone bigram anima- imator, is effective for real-time simulations such as games, and tions that are shared among languages, and specific word sounds can be personalized for each character. Our method associates an- can be identified and changed on a per-character basis. Our method imation curves designed by an animator on a fixed set of static fa- uses no machine learning, which offers two advantages over tech- cial poses, with sequential pairs of phonemes (phone bigrams), and niques that do: 1) data can be generated for non-human characters then stitches these animations together to create a set of curves for whose faces can not be easily retargeted from a human speaker’s the facial poses along with constraints that ensure that key poses face, and 2) the specific facial poses or shapes used for animation are properly played. Diphone- and triphone-based methods have can be specified during the setup and rigging stage, and before the been explored in various previous works, often requiring machine lip animation stage, thus making it suitable for game pipelines or learning. However, our experiments have shown that animating circumstances where the speech targets poses are predetermined, phoneme pairs (such as phone bigrams or diphones), as opposed such as after acquisition from an online 3D marketplace. to phoneme triples or longer sequences of phonemes, is sufficient for many types of animated characters. Also, our experiments Index Terms: I.3.2 [Computing Methodologies]: Com- have shown that skilled animators can sufficiently generate the data puter Graphics—Methodology and Techniques; I.3.7 [Computing needed for good quality results. Thus our algorithm does not need any specific rules about coarticulation, such as dominance functions ∗e-mail: [email protected] or language rules. Such rules are implicit within the artist-produced †e-mail: [email protected] data. In order to produce a tractible set of data, our method reduces ‡e-mail: [email protected] the full set of 40 English phonemes to a smaller set of 21, which §e-mail: [email protected] are then annotated by an animator. Once the full animation set has been generated, it can be reused for multiple characters. Each addi- tional character requires a small set of static poses or blendshapes that match the original pose set. Lip sync data needed for a new language requires a new set of phone bigrams for each language, although similar phonemes among languages can share the same model. Although this parameterization method is intuitive to use, phone bigram curves. We show how to reuse our English phone bi- it is difficult to generalize across different faces since a set of pa- gram animation set to adapt to a Mandarin set. Our method works rameters is bounded to a certain facial topology. Therefore a lot of with both natural speech and text-to-speech engines. manual effort is required for each new face. Our method parame- Our artist driven approach is useful for lip syncing non-human terized the face animation as a set of blend curves using a canonical characters whose lip or face configuration doesn’t match a human’s, set of viseme shapes. Therefore the same set of parameter curve which would make it difficult to retarget. Animators can generate can be applied on different characters as far as they use the same a lip animation data set for any character or class of characters, canonical definition for viseme shapes. regardless of the facial setup specified. We demonstrate such an ex- Many data-driven methods are developed to learn co-articulation ample on an animated, talking crab that we acquired from an online patterns from animation data. Ma et al learned variable length units 3D content source, and whose facial configuration does not match based on a motion capture corpus [26]. Deng et al proposed a any existing set of lip sync poses. In addition, our method can be method that learn diphone and triphones model presented in weight used with any set of facial poses or blend shapes as input, thus curves from motion captured data [16]. Dynamic programming is making it suitable for game pipelines where the characters must applied to calculate the best co-articulation viseme sequences based adhere to specific facial configurations. This differs from many on speech input. Although their method can produce natural speech machine learning algorithms where the retargeting efforts must be animation with learned co-articulation, it requires large amounts done once the learned data is generated, or where specific facial of motion data for training and the result is highly dependent on poses are output from the machine learning process, but differ ac- the training data. Cao et al transferred captured data to a model cording to the input data. By allowing the setup of the character to synthesize speech motions [6]. The recent work by Taylor et to be determined in advance of generating the lip animation data, al [34] also introduced a dynamic visemes method, which extracts our system can be adjusted to be compatible with any existing lip and clusters visual gestures from video input of human subjects. To animation setup, thus making it ideally suitable for game pipelines. synthesize dynamic visemes from a phoneme sequence, it looks for We demonstrate this capability by using the same facial poses as a the best mapping through the probability graph learned from visual commercial lip animation tool, FaceFX [19], whose results we can gesture data and stitch the viseme sequences together. Their method then directly compare against. In addition, control over our method produces good results with co-articulation effects but requires sig- can be achieved by reanimating specific phoneme pairs (such as L- nificant animator efforts to setup around 150 visemes animations Oh, as in the word ’low’). Thus control over various parts of the for each rigged characters face. By contrast, our method requires algorithm are both intuitive and controllable. only a small set of static facial poses for each character. It is difficult to compare the quality of the results between one Blendshape is a widely used technique due to its simplicity, when lip syncing method to another, since the models, data set, and char- 3D shapes are interpolated and blended together in customized acteristics are often different between methods. With some excep- ways. One of the drawbacks when using blendshapes is the dif- tions [27], few research methods attempt to compare their own re- ficulty in defining a set of face shapes that are orthogonal to each sults to other established research methods or commercial results. other. This in turn causes the influence from one face shape to There are many reasons for this, including the difficulty in repro- degrade other shapes. [25, 15] provided a method to avoid inter- ducing exactly other methods and the lack of availability of data ferences. Human performance also provides a way to produce high sets. Consequently, comparisons are made to simpler methods [34]. quality and realistic facial animations. Waters and coworkers [36] By contrast, we present a direct comparison with a popular com- applied radial laser scanner to capture facial geometry from the sub- mercial lip syncing engine, FaceFX [19], using identical models ject and used the scanned markers to drive the underlying muscle and face shapes. We thus maintain that our method generally pro- system. Chai et al [10] used optical flow tracking to capture the duces good results with very low requirements, while simultane- performance from a subject to drive facial animations. ously being controllable and editable. 2.2 Expressive Facial Animation 2RELATED WORK Expressive facial animations, including eye gaze and head move- There are many facial animation techniques, including 2D image- ments, would greatly enhance the realism of a virtual character. based methods, and 3D geometry-based methods. A background Cassell et al developed rule based automatic system [8]. Brand et on many of these techniques can be found in the survey by Parke al [5] constructed a facial internal state machine driven by voice in- and Waters [31]. However, the focus of this paper is on lip syncing put using Hidden Markov Model (HMM).

Load more