A Practical and Configurable Method for Games

Yuyu Xu ∗ Andrew W. Feng† Stacy Marsella‡ Ari Shapiro§

USC Institute for Creative Technologies

Figure 1: Accurate lip synchronization results for multiplecharacterscanbegeneratedusingthesamesetofphonebigram blend curves. Our method uses -driven data to produce high quality lip synchronization in multiple languages. Our method is well-suited for pipelines, since it uses static facial poses or blendshapes and can be directly edited by an animator, and modified as needed on a per-character basis.

ABSTRACT Methodologies]: —Three-Dimensional Graph- ics and Realism; I.3.8 [Computing Methodologies]: Computer We demonstrate a lip animation (lip sync) algorithm for real-time Graphics—Applications applications that can be used to generate synchronized facial move- ments with audio generated from natural speech or a text-to-speech 1MOTIVATION engine. Our method requires an animator to construct using a canonical set of visemes for all pairwise combinations of a Synchronizing the lip and mouth movements naturally along with reduced phoneme set (phone bigrams). These animations are then animation is an important part of convincing 3D character perfor- stitched together to construct the final animation, adding velocity mance. In this paper, we present a simple, portable and editable and lip-pose constraints. This method can be applied to any charac- lip-synchronization method that works for multiple languages, re- ter that uses the same, small set of visemes. Our method can operate quires no learning, can be constructed by a skilled an- efficiently in multiple languages by reusing phone bigram anima- imator, is effective for real-time simulations such as games, and tions that are shared among languages, and specific word sounds can be personalized for each character. Our method associates an- can be identified and changed on a per-character basis. Our method imation curves designed by an animator on a fixed set of static fa- uses no machine learning, which offers two advantages over tech- cial poses, with sequential pairs of phonemes (phone bigrams), and niques that do: 1) data can be generated for non-human characters then stitches these animations together to create a set of curves for whose faces can not be easily retargeted from a human speaker’s the facial poses along with constraints that ensure that key poses face, and 2) the specific facial poses or shapes used for animation are properly played. Diphone- and triphone-based methods have can be specified during the setup and rigging stage, and before the been explored in various previous works, often requiring machine lip animation stage, thus making it suitable for game pipelines or learning. However, our experiments have shown that animating circumstances where the speech targets poses are predetermined, phoneme pairs (such as phone bigrams or diphones), as opposed such as after acquisition from an online 3D marketplace. to phoneme triples or longer sequences of phonemes, is sufficient for many types of animated characters. Also, our experiments Index Terms: I.3.2 [Computing Methodologies]: Com- have shown that skilled can sufficiently generate the data puter Graphics—Methodology and Techniques; I.3.7 [Computing needed for good quality results. Thus our algorithm does not need any specific rules about coarticulation, such as dominance functions ∗e-mail: [email protected] or language rules. Such rules are implicit within the artist-produced †e-mail: [email protected] data. In order to produce a tractible set of data, our method reduces ‡e-mail: [email protected] the full set of 40 English phonemes to a smaller set of 21, which §e-mail: [email protected] are then annotated by an animator. Once the full animation set has been generated, it can be reused for multiple characters. Each addi- tional character requires a small set of static poses or blendshapes that match the original pose set. Lip sync data needed for a new language requires a new set of phone bigrams for each language, although similar phonemes among languages can share the same model. Although this parameterization method is intuitive to use, phone bigram curves. We show how to reuse our English phone bi- it is difficult to generalize across different faces since a set of pa- gram animation set to adapt to a Mandarin set. Our method works rameters is bounded to a certain facial topology. Therefore a lot of with both natural speech and text-to-speech engines. manual effort is required for each new face. Our method parame- Our artist driven approach is useful for lip syncing non-human terized the face animation as a set of blend curves using a canonical characters whose lip or face configuration doesn’t match a human’s, set of viseme shapes. Therefore the same set of parameter curve which would make it difficult to retarget. Animators can generate can be applied on different characters as far as they use the same a lip animation data set for any character or class of characters, canonical definition for viseme shapes. regardless of the facial setup specified. We demonstrate such an ex- Many data-driven methods are developed to learn co-articulation ample on an animated, talking crab that we acquired from an online patterns from animation data. Ma et al learned variable length units 3D content source, and whose facial configuration does not match based on a corpus [26]. Deng et al proposed a any existing set of lip sync poses. In addition, our method can be method that learn diphone and triphones model presented in weight used with any set of facial poses or blend shapes as input, thus curves from motion captured data [16]. Dynamic programming is making it suitable for game pipelines where the characters must applied to calculate the best co-articulation viseme sequences based adhere to specific facial configurations. This differs from many on speech input. Although their method can produce natural speech machine learning algorithms where the retargeting efforts must be animation with learned co-articulation, it requires large amounts done once the learned data is generated, or where specific facial of motion data for training and the result is highly dependent on poses are output from the machine learning process, but differ ac- the training data. Cao et al transferred captured data to a model cording to the input data. By allowing the setup of the character to synthesize speech motions [6]. The recent work by Taylor et to be determined in advance of generating the lip animation data, al [34] also introduced a dynamic visemes method, which extracts our system can be adjusted to be compatible with any existing lip and clusters visual gestures from video input of human subjects. To animation setup, thus making it ideally suitable for game pipelines. synthesize dynamic visemes from a phoneme sequence, it looks for We demonstrate this capability by using the same facial poses as a the best mapping through the probability graph learned from visual commercial lip animation tool, FaceFX [19], whose results we can gesture data and stitch the viseme sequences together. Their method then directly compare against. In addition, control over our method produces good results with co-articulation effects but requires sig- can be achieved by reanimating specific phoneme pairs (such as L- nificant animator efforts to setup around 150 visemes animations Oh, as in the word ’low’). Thus control over various parts of the for each rigged characters face. By contrast, our method requires algorithm are both intuitive and controllable. only a small set of static facial poses for each character. It is difficult to compare the quality of the results between one Blendshape is a widely used technique due to its simplicity, when lip syncing method to another, since the models, data set, and char- 3D shapes are interpolated and blended together in customized acteristics are often different between methods. With some excep- ways. One of the drawbacks when using blendshapes is the dif- tions [27], few research methods attempt to compare their own re- ficulty in defining a set of face shapes that are orthogonal to each sults to other established research methods or commercial results. other. This in turn causes the influence from one face shape to There are many reasons for this, including the difficulty in repro- degrade other shapes. [25, 15] provided a method to avoid inter- ducing exactly other methods and the lack of availability of data ferences. Human performance also provides a way to produce high sets. Consequently, comparisons are made to simpler methods [34]. quality and realistic facial animations. Waters and coworkers [36] By contrast, we present a direct comparison with a popular com- applied radial laser scanner to capture facial geometry from the sub- mercial lip syncing engine, FaceFX [19], using identical models ject and used the scanned markers to drive the underlying muscle and face shapes. We thus maintain that our method generally pro- system. Chai et al [10] used optical flow tracking to capture the duces good results with very low requirements, while simultane- performance from a subject to drive facial animations. ously being controllable and editable. 2.2 Expressive Facial Animation 2RELATED WORK Expressive facial animations, including eye gaze and head move- There are many facial animation techniques, including 2D image- ments, would greatly enhance the realism of a virtual character. based methods, and 3D geometry-based methods. A background Cassell et al developed rule based automatic system [8]. Brand et on many of these techniques can be found in the survey by Parke al [5] constructed a facial internal state machine driven by voice in- and Waters [31]. However, the focus of this paper is on lip syncing put using Hidden Markov Model (HMM). Deng et al extended the to audio, which we summarize below. work in co-articulation model [16] and added the expressive facial motion learned from motion data to wrap it along with speech mo- 2.1 Visual Speech Animation tion [17]. Cao et al used learning techniques to generate separate The synthesis of realistic visual speech animations corresponding emotion components from speech [7]. More recent research has to novel text or prerecorded acoustic speech input has been a major used test-to-speech(TTS) to drive expressive speech [29]. research problem for decades. A common approach is to map one or more individual phonemes to corresponding viseme and gener- 2.3 Software ate the animation by interpolating the visemes given phoneme se- FaceFX is a lip syncing software that has been widely adopted in quences [18] and demonstrated in a system using blendshapes [35]. many video games and simulations. It takes a speech audio as input However, naively interpolating viseme sequences tend to generate and generate a set of blending curves. It then additively combines poor results since it does not consider co-articulation. Visual speech a set of simple component animations based on blending curves co-articulation is a phenomena describing that, which shows that a to produce facial animations. In this paper we will compare our current viseme shape is not an independent face shape but rather result with its result based on lip-syncing accuracy and naturalness. would be affected by adjacent phonemes. The early work by Cohen In addition to prerecorded audio files, we also make use of text- et al introduced dominance functions [11, 28] as a parameteriza- to-speech (TTS) engine such as Microsoft TTS [22], Fesitival [4], tion method to deal with co-articulation. Their pioneering work is and Cereproc [9] to produce mid-quality speech audio and phoneme followed by more research works aimed to improve the dominance timing on the fly. function model [13, 24, 12]. In their work each parameter curve Our method differs from many of the existing machine learn- controls a time-varying deformation over a small region on the face ing methods in that the data can be constructed and modified by a skilled animator, and therefore does not require any performance English Common Examples capture or speech corpus. In addition, each phone bigram generated phoneme Phoneme Set by the animator has local effects and can be edited or changed indi- ae, ah, ax ah cat, cut, ago vidually without affecting other parts of the system. Such process aa aa father can be repeated for new characters, and thus the fact that our system ao ao dog can be changed and modified easily represents a strong advantage ey, eh eh ate, pet over machine learning methods which cannot learn and change the er er fur results for individual sounds per character as our system can. ih, iy ih feel, fill, debit w, uw, uh w with, too, 3METHOD book Our method involves two phases; an offline phase where an an- ow ow go imator constructs animation curves associated with all pairwise aw aw foul combinations of phonemes, and a runtime phase that produce the oy oy toy speech animations by stitching, smoothing and constraint satisfac- ay ay bite tion based on the input timings of the phonemes. h h help r r red 3.1 Offline Phase l l lid Our goal is to construct a set of animations that are associated with s, z z sit, zap pairs of phonemes. We choose phoneme pairs as our canonical unit sh, ch, jh, zh, sh she, chin, joy, of animation since pairs of phonemes allow for coarticulation ef- y pleasure, yard fects that are not possible by associating animation with individual th, dh th thin, then phonemes. Diphones are commonly used during machine learn- f, v f fork, vat ing techniques, which represent timings between the middle of one d, t, n, ng d dig, talk, no, phoneme and the next. Animators construct phone bigrams, which sing represent the timings from the start of one phoneme to the end of the k, g kg cut, gut following one. We choose phone bigrams, and not diphones, since p, b, m bmp put, big, mat an animation of a phone bigram is intuitive, whereas an animation of a diphone has no intuitive representation. Phone bigrams can be Table 1: Forty English phonemes mapped to our common set of constructed by animating a short word with a single syllable that phonemes. The left column lists the full set of English phonemes. represents the two phonemes. For example, when animating the The right column all the canonical visemes and the second right phone bigram for L-Oh, an animator can create an animation that column lists all the pairwise phoneme combinations for the given represents the word ’Low’. By contrast, there is no intuitive rep- phoneme schedules. resentation of the mouth movements for a diphone, which would require the animator to start the motion from the middle of the ’L’ sound to the middle of the ’O’ sound. By using phone bigrams, animators are able to create better quality set of animations. phoneme combintions. Thus, animators only need to generate ap- We also choose phoneme pairs, rather than combinations of three proximately 25% of the animations that would ordinarily be con- phonemes or higher order sequences in order to reduce the amount structed if the entire English phoneme set were used. of data needed to be produced by animator. We expect that better Using this reduced phoneme set, we generated phone bigrams results could be obtained by animating or learning combinations of from a corpus of approximately 200 utterances of varying length, three phonemes or even longer sequences. However, our goal is to from a single word, to utterances composed of several sentences. identify a tractable set of data that can be quickly generated so that Table 2 shows the frequencies of the Common Phoneme Set pair- high quality lip syncing can be produced easily. wise phoneme combinations when generated from the Microsoft TTS Engine: 3.1.1 Phoneme Selection Note that the d Common Phoneme Set phoneme appears fre- Phonemes differ from each other according to their sounds. quently, since it encompases four phonemes: d, t, n, ng. In addi- However, similar-looking facial movements can produce multiple tion, 104 of the 441 Common Set phone bigrams never appeared, phonemes. For example, a person will make a similar-looking face and thus, nearly 99.1% of the phone bigrams could be generated of folding their lips together when producing the ’b’ and ’p’ sounds. from the first 283 most common phone bigrams. The phone bigram Thus, in order to further reduce the amount of data necessary for distribution is shown below in Figure 2. It is possible that different our method, we map the 40 English phonemes [37] into a smaller phoneme sequencers would generate slightly different results based phoneme set, which we call the Common Phoneme Set, in turn on their own analysis of words and their mapping to the phonemes. greatly reducing the number of pairwise phonemes that must be However, we would expect them to be mostly aligned with our re- produced during the offline phase. Table 1 shows the mapping to sults, since to do otherwise would indicate a lack of synchronization our common set of phonemes. with the SAMPA [37] phoneme set. The mapping of phonemes (for example, mapping all ’b’ and ’m’ sounds to the common ’bmp’ phoneme) somewhat reduces the 3.1.2 Facial Pose Selection and Phone Bigram Animation subtleties that result from the differences between those sounds. Our method sequences a series of animations that are associated However, our method allows for the inclusion of additional de- with our Common Phoneme Set of phoneme pair. Therefore instead tail as needed by remapping the phonemes. Thus, the ’b’ and ’m’ of having animators create animations from scratch, we choose to phonemes could be separated, resulting in an additional number of represent phone bigram animations as a set of blend curves using phone bigrams to animate. a canonical set of face poses. This not only greatly simplifies the A language with n phonemes uses n2 pairwise phoneme com- task for animators, but also allow our animations to be portable to binations. Thus, English has 40 phonemes and 402 = 1600 pair- different characters. As far as the new character has the same set wise phoneme combinations. By contrast, our reduced Com- of canonical face poses, our phone bigram curves can be applied on mon Phoneme Set of 21 phonemes contains 212 = 441 pairwise that character to generate high quality lip syncing animations. To Rank Common Set Phoneme Pair Percentage a specific phone bigram, he can immediately evaluate its quality in 1 ah - d 4.62 a speech animation by typing words or sentences that contain the 2 d - ih 3.44 phoneme pair. This feedback can also help the animator to quickly 3 ih - d 2.68 identify the subset of phone bigrams that need improvements. By 4 d-d 2.64 testing various words and sentences, problematic results can be ef- 5 d-ah 2.40 fectively found and provide informations for the animator to fill in 6 eh - d 1.83 missing phone bigrams or improve existing ones. 7 d-w 1.66 8 ah - r 1.58 9 ih - w 1.50 10 d-z 1.43 11 ih - z 1.37 12 ih - kg 1.32 13 aa - d 1.27 14 z - ih 1.22 15 r - ih 1.21 16 th - ah 1.19 17 z-d 1.16 18 kg - ah 1.15 19 ah - l 1.10 20 z-ah 1.04 21 ah - bmp 0.98

Table 2: Frequency of Common Phoneme Set. Phoneme pairs gen- erated from a TTS engine on a corpus of approximately 200 utter- ances.

Figure 3: Curves for F-Ah phone bigram. The animator selects three facial poses; FV, open and wide, and constructs animation curves over normalized time. The animation can be directly played onthe character, or the character’s lip animation could be driven using TTS or recorded audio.

Since we rely on blend curves to produce character indepen- dent phone bigram animations, it is crucial to choose a good set of canonical face poses. In our method, we choose a set of fa- Figure 2: Distribution of Common Phoneme Set phoneme pairs using cial poses that allows the generation of nearly any facial expression atext-to-speechengine.Horizontalaxisshowsthenumberofpair- through various combinations of the facial poses. For compatibility wise phonemes. Vertical axis shows the percentage of all pairwise with existing pipelines and for purposes of comparison, we chose phonemes. the same facial poses that are used for the FaceFX software [20] and detailed in Figure 4. These shapes include 5 face shapes, and 3 tongue shapes. We expect that any canonical set of facial poses produce a Common Phoneme Set of phone bigram animations, ani- could be used, since the animator decides how and when to activate mators construct facial movements by combining several static face the various components. The facial poses need to be able to model poses over time using a curve editor or similar tool. The animation sophisticated facial movements such as pursing the lips and tongue curves are normalized, and will be timewarped over the length of movements. the phone bigrams. In order to reduce efforts from animators when creating blend curves, we develop a novel phone bigram curve editor which would assist the user to quickly create curves and evaluate the quality of resulting facial animations. As shown in Figure 3, our editing tool [32] allows the animator to choose pairwise phoneme combinations and edit their corresponding blend curves. The animator specifies k a piecewise linear curve l j (t) for each face pose fk and for each phoneme pair d j. The linear curve represents the control curve for k a smooth B-spline curve c j(t), which will be use to adjust the influ- ence of pose fk for a specific phone bigram d j.Hereeachpose fk represents displacements from neutral pose and therefore the k weights ∑k c j(t) does not necessarily sum to one. It can also gener- ate a speech animation for a specific sentence automatically from a TTS engine or from a prerecorded audio file. The sentence will be analyzed on the fly and transformed into the individual phone bi- grams. We show an example of a set of curves associated with the Figure 4: From first to the end: fv, open, pbm, shch, w, wide. In F-Ah phone bigram (for example, when saying the the beginning addition, three tongue positions: up, down and back are used as well of the word ’fat’). After the animator updates the blend curves for as a neutral pose. The animators construct curves in normalized time. The phone 3.2.2 Constraint Satisfaction bigram animation will be used by the realtime algorithm by time- Since facial poses are activated based on parametric values, any warping the animation curve over the length of each phone bigram, two poses could be arbitrarily combined together to form a new detailed in the section below. Each language, such as English, re- face shape. However, certain poses, when activated simultane- quires its own set of animations, although other language sets can ously, would produce unnatural results. For example, the face pose reuse the same phone bigram animations, since many phonemes are for ’open’ viseme should not be activated along with the pose for shared between languages. ’bmp’. Since one pose is open mouth and the other is close mouth, combining them together may cause interference and cancel each 3.2 Runtime Phase other. This can cause important information being eliminated from the resulting facial motion. To fix this problem, we added the con- Figure 5 shows a flow chart for our runtime system. The runtime straints between face pose pair to filter out undesired poses. A con- component operates on sequence of phonemes generated from a straint threshold C(a,b) is defined for each pose pair ( fa, fb).A phoneme scheduler. The phoneme scheduler can be extracted di- priority value for both P(a) and P(b) is also defined for each face rectly from a TTS engine, extracted offline from recorded audio, pose fa and fb to determine which pose should have its paramet- or online via various phoneme translator tools. We have tested our ric value reduced during conflict. The threshold C(a,b) determines TTS path using the Microsoft [30], Cereproc TTS [9], and Festival whether the two poses fa and fb may interfere with each other, and [3] TTS engines. Recorded audio can be extracted using various it is activated when ca(t)+cb(t) > C(a,b). Once the constraint commercial [19] or noncommercial tools [23, 33]. The details of is activated, the parametric value from the lower priority curve such phoneme scheduling are outside of the scope of this work. is reduced to satisfy the constraint ca$(t)+cb$(t) ≤ C(a,b).As- Our system expects a sequence of English phonemes and timings suming P(a) > P(b),wedefineca$(t)=ca(t) and cb$(t)=cb(t) − for each phoneme. (C(a,b) − ca(t) − cb(t)). Figure 9 shows the resulting curves after constraint adjustments. This simple heuristic ensures that the in- 3.2.1 Curves Stitching and Smoothing compatible curves will not affect each other and thus it improves the quality and expression of resulting animations. The method is not Our method then groups the sequence of phonemes into phoneme restricted to a standard set of face poses, it can be extended to work pairs, and maps those to the Common Phoneme Set. As shown with any arbitrary, non-conventional sets of face poses since the an- in Figure 6, assuming the input phoneme schedules p0, p1,...pn imator can define suitable thresholds and priority values based on occur at times t0,t1,...tn. Our method first constructs a se- the given face pose set. In our example, we found that setting the quence of phoneme pairs consisting of adjacent pairs of phonemes constraints C(open,FV)=C(open, pbm)=C(open,ShCh)=0.5, (p0, p1),(p1, p2),...(pn−2, pn−1) and their corresponding time span C(a,b)=2.0 for all other face pose pairs, and priorities P(open) < (t0,t2),(t1,t3)...,(tn−2,tn). P(W) < P(wide) < P(ShCh) < P(FV) < P(pbm) work well under standard FaceFx face pose set. Each phoneme pair (pi, pi+1) is also associated with a set of k phone bigram curves ci (t){ti ≤ t ≤ ti+2} for each canonical face pose fk. We stretch or compress the curves according to the time 3.2.3 Parametric Speed Limits k span of the phoneme pair to generate ci (t). We then stitch the over- To produce natural lip syncing animation, a character should only lapping curves for the same face pose fk over adjacent time spans move his lips with a reasonable speed. However, the animator may k to produce one single continuous blend curve c (t) {t0 ≤ t ≤ tn}. create a bigram curve with high slope and the time warping could The stitching process eliminates overlapping areas between adja- also compress the curve to be excessively sharp. This can cause fast k k cent curves ci and ci+1 by retaining the curves with largest value, as activation or deactivation of various static face poses or shapes, re- k k k sulting in popping or similar artifacts. We cap the parametric speed shown in Figure 7. For example, c (t)=max(ci (t),c (t)){ti+1 ≤ i+1 dl(t) t ≤ ti+2}. We choose the maximum value over linear blending or dt of piecewise linear curve l(t) by a value lmax. This prevents averaging in the overlapping area so that the original information is overly fast changes to any shape, and is intended to track the speed preserved in the resulting curves to activate face poses. Applying at which a person’s face can be changed (for example, how quickly other blending methods may damp the curves and produce unde- the mouth can be opened or closed). We have found that values sired lip movements during transition. where lmax = 15 provides reasonable results. Figure 10 shows the We also perform a smoothing pass over each curve using a user- resulting curves after limiting the parametric speeds. specified window to scan through temporal domain and find local Once the speed limit phase has been completed, we transfer the maximas. Figure 8 shows the resulting curves after the smooth- curves to our animation system, which blends the facial poses or ing process. Here we defines a sliding window, with its size 2tw blendshapes according to their activation values dictated by the defined by user. The smoothing window will be slideing over the curves. k k curve c (t) from t0 to tn and detect the local maximums for c (t). k If at any time instant t there are two or more local maximas c (ta) 3.2.4 Editing k and c (tb) inside the window from t −tw to t +tw, we will smooth By constructing animation curves that are driven by a fixed set out the curve values between ta and tb by interpolating two maxi- of shapes or poses, the animator constructs a data set that can k k k mas c (ta) and c (tb) with a spline curve. Specifically, since c (ta) be reused on other characters which use similar facial poses or k and c (tb) are the values of spline curves associated with piecewise blendshapes. Thus, during the offline phase, the animator creates k k linear curves whose values are l (ta) and l (tb) at time ta and tb, the character-independent animations. In addition, our method is well- k suited for an animation pipeline since there are no black-box com- new spline curve can be obtain by linearly interpolating l (ta) and k k ponents whose results are difficult to interpret or modify. Specific lk(t ) such that lk(t)= (tb−t)l (ta)+(t−ta)l (tb) from time t to t . This b tb−ta a b segments of the animations can be directly mapped back to their k new linear curve thus defines a updated values for c (t) from ta to originating phone bigrams and subsequently changed. In addition, tb. Intuitively, this process smoothes out the ”valley” between two character-specific animations can be added to the original set of maximas and therefore helps reduce high frequency lip movements phone bigrams curves to adjust the motions per character. Thus, our that are not natural for a speech animation. method allows two types of adjustments; changing the static poses !!"#$J$1JV

6%`0V#" 1 H.1J$ !!"#

).QJVIV# "H.VR%CV` )Q4V#6QJ4 `:1J 4 Figure 7: Curve visualization from our interactive tool. Second step: ).QJV#G1$`:I4# Q# 6%`0V4#8:]]1J$ "]VVR#6QJ4 `:1J 4 Overlapping curves are stitched together;parametric valuesofcurves are compared in the overlapped regions and only the ones with largest values are retained (shown in blue). The original stitched curves from previous step are shown in green. :J1I: 1QJ#:JR# ;VJRV`1J$

Figure 5: Runtime phase flow chart. A phoneme scheduler pro- duces phonemes and their respective timings. Animations previously created by an artist are associated with each phone bigram anima- tion. Each facial pose then stitches together those animations, runs asmoothingprocess,speedandposeconstraints.Thefinalcurves are animated by the system. Figure 8: Curve visualization from our interactive tool. Third step: The stitched curves are smoothed by finding local maximums with a sliding window through temporal domain. The valleys of the curves are removed by connecting two local maximums inside the sliding window. The smoothed curves are shown in yellow.

character and one medium quality male character as shown in Fig- ure 11. We captured 10 movie clip pairs for each character, each Figure 6: Curve visualization from our interactive tool. First step: pair is one random utterance from the pool generated using our Curves of the phone bigram animations from the ’open’ viseme are method and using FaceFX. A total of 80 people participated in the shown and placed together on the timeline (shown in red). Phoneme study with 20 observers assigned to each character’s 10 movie clip boundaries are shown as vertical lines. Time axis is shown in the pairs, so 200 choices are made for each character’s lip animation first row of the black boxes underneath the curves. Phonemes and performance. The positioning of the movie clip inside a pair is ran- their corresponding times are shown in the second and third row. domized. The survey was done on Amazon Mechanical Turk [2] asking ob- or blendshapes, as well as the changing of specific phone bigram servers to choose the preferred clip between the paired clips as well animations on a per character basis. as state the strength of their preference, using a scale from 1(weak) to 5(strong). Subjects were asked to choose based on the general performance of the lip animation accuracy and naturalness. Ac- 4STUDY curacy indicates that the mouth configuration is synchronized with It can be difficult to evaluate the relative quality of various lip sync- the words heard from the audio. The higher the accuracy, the more ing algorithms against each other because of the difficulty in im- you can understand what the character is saying by just reading the plementing each solution and the supporting data that is needed, lip syncing even without sound. Naturalness indicates closely the such as character models, rigs and configuration parameters. Thus, mouth configuration resembles that of a human’s during speech. most lip syncing algorithms are typically indirectly compared with each other taking into account those differing aspects. However, we are able to compare our results directly with a popular commercial lip syncing solution; FaceFX. For our experiment, we use the same character, the same phoneme scheduler, and we construct phone bigram animations using the set of FaceFX static poses. When pro- ducing results from our algorithm, the FaceFX phonemes are then mapped to our Common Phoneme Set. Thus our algorithm differs from the results in FaceFX only in the interpretation and transla- tion of phonemes into animation curves. Therefore we can directly compare the two methods. Figure 9: Curve visualization from our interactive tool. Fourth step: 4.1 Method: Comparison With Other Methods The curves are adjusted by enforcing the constraints betweeneach We performed a study comparing the results from both algorithms. face pose pairs. Conflicted face pose curves are adjusted according The study includes 4 characters; one cartoon character, one high to their priority values. The resulting curves after adjustments are quality (near photo-realistic) character, one medium quality female shown in cyan. Figure 11: The four characters used in our study (top left) a cartoony character, (top right) a high quality character, (bottom left) a medium quality male character, (bottom right) a medium quality female character.

ferred our technique preferred it on average at greater than medium strength, with a preference strength that is greatest in the case of the cartoon character(4.03) and high quality character(3.97). Interest- ingly participants that preferred FaceFX, though fewer in number, also preferred it on average greater than medium strength. This suggests the need for a follow-on study where we tease apart what factors are leading to these judgments across techniques as well as across character types. Specifically in reference to our technology, Figure 10: Curve visualization from our interactive tool. Fifth step: there is a question of why the cartoony and high quality characters The curves are further adjusted to satisfy the speed limits. The have stronger preferences than the medium quality. Is it just the slopes of curves are reduced if their parametric speeds are over a quality of the character or is there an interaction between character user-defined threshold. The final curves are shown in magneta. quality and the realization of the visemes?

5DISCUSSION 4.2 Study Results In this paper, we present a lip syncing method that can achieve All the results are shown in Figure 12. Our method is preferred high-quality results using only artist-driven data. We observe that over FaceFX with 47%, 34%, 23%, 39% stronger preference re- only phoneme pairs are needed, and that higher-order phoneme se- spectively for cartoony character, high quality character, medium quences are not necessary for generating reasonable synchronized quality male character, medium quality female character and there’s lip movements with audio. We also observe that artist-designed a significant different using chi-square goodness of fit test on prefer- animation curves work well at run-time to synthesize high qual- 2 2 2 ity speech animations, and that machine learning is not necessary. ences, χ(2) = 44.14, p <.005; χ(2) = 23.12, p <.005; χ(2) = 10.58, A key advantage to our method is that in can be constructed from p <.005; χ2 = 30.42, p <.005. When it comes to average (2) any set of static facial poses or shapes. This allows the construc- strength of preferences, our method versus FaceFX is 4.03/3.70, tion of lip animation for non-human characters using non-standard 3.97/3.80, 3.54/3.60, 3.62/3.7 respectively. face poses, as well as allowing the construction of character face rigs separately from the animation, since the facial setup needed 4.3 Study Discussion can be determined in advance, and does not have to be explicitly As expected, the results show our method is favored more by the retargeted. participants across all types of characters. Participants that pre- We also demonstrate that such a method can be effective on a Figure 12: Comparison between our method and FaceFX using cartoony, high quality, medium quality characters. (Left) preference comparison (right) average strength of preference comparison. We use Amazon Mechanical Turk to collect viewer ratings from 80 participants. range of character styles, including cartoon-like and realistic look- servatively estimate that each phone bigram animation takes 5 min- ing models. High levels of facial realism are becoming increasingly utes to generate, an entire language set can be generated in approxi- popular, and thus spurring the need for animation methods suitable mately 441∗5 = 2205 minutes = 37.5 hours. Thus, a new language for such levels of fidelity. data set can be generated by a single animator using standard tools in less than 1 week. This language data set can be used for all char- 5.1 Data-Driven Methods acters which utilize the same static facial poses. In addition, we 1 We observe that many previous methods attempt to solve the lip provide the English language data set openly to the community so syncing problem by extracting co-articulation effects from captured that others may use it directly in their experiments or works, thus data. The goal of these methods can be considered as generat- further lowering the barrier to implementation for this method. ing a correct facial animation for each diphone or triphone com- In addition, since nearly 25% of the Commone Phoneme Set bination through either heuristics such as dominance function or phoneme pairs never appeared in our speech corpus, a mostly com- through machine learning methods. Therefore on a higher level, plete animation set in English can be generated in 75% of that time, they attempt to achieve the similar goal as our method to fill out or around 3 days. Short sentences can require around 30-40 unique the phoneme pair or phoneme trio (diphones/phone bigrams, or tri- phoneme pairs to be annotated, which typically take a only a few phones/phone trigrams) tables to account for co-articulations. hours to construct. Thus, our lip sync method can be tested on en- The limitation with data-driven methods is that there is no guar- tirely new 3D characters that have distinctive facial poses or shapes antee that all the required combinations exist or could be extracted in a relatively short amount of time. effectively in the training data. Therefore the missing combinations would directly affect the quality of resulting animations. More- 5.3 Configurability over, there is no easy way to modified the learned model to account A key advantage to our method is the configurability of the facial for missing combinations. The goal of our method is to provide setup to the lip syncing setup. Any set of facial poses can be used a transparent framework for the animator to explicitly fill out the to construct a language set which will be effective for all characters phoneme pair/diphone/phone bigram tables. Thus the animator has that use the same matching facial poses. Machine learning methods direct control over the quality of resulting speech animations. Note generally dictate the facial setup needed based on the learned data; that although we did not choose to solve the problem via data-driven for example, in generating shapes based on a statistical analysis methods, our framework can be extended to make use of captured such as PCA. Our method can adopt to any set of input data. This facial animations. Diphone curves, instead of phone bigrams, can can be effective when the facial requirements have been dictated to be extracted from captured animations by fitting the animation data the animator, such as when acquiring models from a 3D market- with canonical face poses. place where the facial poses have alreay been established. This is also useful for non-human characters whose facial configurations 5.2 Offline Data Generation either don’t match a human’s or contain a different set of poses that Our offline phase requires an animator to generate 441 short anima- activate various non-human capabilities. With our method, it would tions for the English language, which can be used by any character be possible to construct a different phone bigram animation set to that utilizes the same speech targets used during the animation gen- match any standard lip syncing configuration, thus allowing direct eration phase. Our animators were able to generate a single phone compatibility with other lip syncing setups. bigram animation in just a few minutes. Our supplemental video shows an example of generating such animations. Thus, if we con- 1http://smartbody.ict.usc.edu We purchases a rigged crab model from an online marketplace animation independent of facial expression remains an important [1] which contained several non-standard facial poses, such as ’bare component of many games and simulations, as well as for methods teeth’, ’frown’, ’smile’, ’mouth open’, ’mouth OO’ and various that use the upper face as a means to express emotional content. tongue positions, such as those seen in Figure 13. We then con- We anticipate that the use of phone trigram or triphones (or structed phone bigrams necessary to animate the crab, since our longer sequences of phonemes) would result in slightly better re- method allows the generation of such animation for use in lip sults, but would require too much data to be annotated by an anima- synching from any set of facial poses. Our supplementary video tor. Triphones would require p3 animations, where p is the num- shows the results. Note that we are not performing any remodeling ber of phonemes in our Common Phoneme Set. Thus, manually or rigging changes in order to create the lip sync data set. annotating phone trigrams or triphones would make our method In addition, phoneme pairs can be individually identified and re- unwieldly. Also, slightly better results could be obtained by ex- placed as needed on a per character basis. These phoneme pairs panding the Common Phoneme Set. For example, separating the are easy to identify, since their effect can be scene at the times cor- ’b’ and ’m’ phonemes. Our selection of the Common Phoneme responding to the phoneme activation. As mentioned above, this Set is based on general practices and expert knowledge within the enables the partial construction of a lip sync animation set for test- animation community. ing purposes, without having to complete the entire set (for exam- ple, to preview the quality of the animation on a single sentence), 5.7 On Achieving Good Lip Synching since the parts of the lip sync that need to be consructed are readily indentifiable from their phoneme schedule. There are many factors that contribute towards high quality lip Recent work by Taylor et al [34] have also utilized hand-crafted syncing, including the model rendering, facial setup including rig- animations to generate lip sync data. Note that our method requires ging, and input data. Proper phoneme alignment is critical to gen- only a small set of static facial poses, while their method requires erating properly timed lip syncing, which is why text-to-speech al- the construction of 150 short animations per character. We believe gorithms or slow, even-toned utterances can generate good results, that the quality of our results are comparable to theirs while requir- while fast talking, emotional speech, stutters, partial word expres- ing over an order of magnitude less data per character. sions and other non-words can be difficult to map properly. Re- search has show that the best phoneme alignment can be determined 5.4 Multilanguage Efforts only as a margin of error by agreement among experts [21]. Our method can operate on multiple languages by creating a new set of animations based on the phonemes for the language or reusing ACKNOWLEDGEMENTS existing phone bigram animations. We were able to generate a Man- Thanks to Teresa Dey for her extensive artistic contributions in darin phone bigram animation set by mapping the phonemes associ- support of this work. Thanks also to Matt Liewer, Joe Yip and ated with Mandarin pinyin [38] to the our Common Phoneme Set in Oleg Alexander for artistic and technical support. Thanks to Jarvis English, then adding one additional phoneme unique to Mandarin. McGee for his work on the curve editor. Thanks to Dominic Mas- The additional phoneme required adding 43 additional diphone an- saro for discussions about lip animation and dominance function imations, which were generated in less than one day. Thus our methods. Mandarin lip synchronization could be generated in a short amount of time. We expect that other languages could be added through similar means. REFERENCES [1] Daz3d, content marketplace, http://www.daz3d.com, 2013. 5.5 Direct Method Comparisons [2] Amazon. Amazon mechanical turk, 2012. We were able to compare our method directly with FaceFX using a [3] A. Black, P. Taylor, and R. Caley. The festival sys- nearly identical pipeline. We do not claim that our method yields tem, 1998. superior results to all other methods, but rather that our results are [4] A. W. Black and P. A. Taylor. The Festival Speech Syn- thesis System: System documentation. Technical Report comparable to many others that have much larger requirements. HCRC/TR-83, Human Communciation Research Centre, Uni- Other methods that attempt to quantity the quality of animation are versity of Edinburgh, Scotland, UK, 1997. Avaliable at limited to learning based methods [27] or comparisons between the http://www.cstr.ed.ac.uk/projects/festival.html. original data and the reconstructed data, which are not applicable to [5] M. Brand. Voice puppetry. In Proceedings of the 26th annual confer- animator-driven methods. McGurk studies [14] can be performed ence on Computer graphics and interactive techniques,SIGGRAPH to see how the McGurk effect on real humans compares to that of ’99, pages 21–28, New York, NY, USA, 1999. ACM Press/Addison- synthesized speech. Likewise, noise studies where words are ran- Wesley Publishing Co. domly removed from utterances and must be recovered by reading [6] Y. Cao, P. Faloutsos, E. Kohler, and F. Pighin. Real-time speech mo- lips can help judge the quality of the articulation. We believe that tion synthesis from recorded motions. In Proceedings of the 2004 the best criteria for comparison is side-by-side comparison in sub- ACM SIGGRAPH/Eurographics symposium on , jective studies. We believe that our method is accessible and simple pages 345–353. Eurographics Association, 2004. enough and adaptable to be used for comparison from future re- [7] Y. Cao, P. Faloutsos, and F. Pighin. Unsupervised learning for search. speech motion editing. In Proceedings of the 2003 ACM SIG- GRAPH/Eurographics symposium on Computer animation,pages 5.6 Limitations 225–231. Eurographics Association, 2003. [8] J. Cassell, C. Pelachaud, N. Badler, M. Steedman, B. Achorn, B. Dou- Our method does not address the issue of emotional content dur- ville, S. Prevost, and M. Stone. Animated conversation: Rule-based ing speech. Many recent research works have identified expressive generation of facial expression, gesture and spoken intonation for mul- speech as a more interesting problem than simple lip animation, tiple conversational agents. pages 413–420, 1994. and thus have avoided the problem of generating high quality lip [9] Cereproc. Cerevoice text-to-speech synthesis, animation on its own by pursuing high quality emotional facial per- http://www.cereproc.com, 2012. formances, presuming that a high quality emotional facial perfor- [10] J.-x. Chai, J. Xiao, and J. Hodgins. Vision-based control of mance would include a high quality lip sync result. In doing so, 3d facial animation. In Proceedings of the 2003 ACM SIG- however, there has been no consensus among the research commu- GRAPH/Eurographics symposium on Computer animation,pages nity regarding the best lip sync methods for 3D characters. Lip 193–206. Eurographics Association, 2003. Figure 13: Some face poses from cartoon crab acquired from an online marketplace. Our method is able to use any set of facial poses to construct a lip animation data set.

[11] M. M. Cohen and D. W. Massaro. Modeling coarticulation insynthetic ible Speech Synthesis Based on Concatenating Variable Length Mo- visual speech. In Models and Techniques in Computer Animation, tion Capture Data. IEEE Transactions on Visualization and Computer pages 139–156. Springer-Verlag, 1993. Graphics,12:266–276,2006. [12] M. M. Cohen, D. W. Massaro, and R. Clark. Training a talking head. [27] X. Ma and Z. Deng. A statistical quality model for data-driven speech In Proceedings of the 4th IEEE International Conference on Mul- animation. IEEE Trans. Vis. Comput. Graph.,18(11):1915–1927, timodal Interfaces,ICMI’02,pages499–,Washington,DC,USA, 2012. 2002. IEEE Computer Society. [28] D. W. Massaro, M. M. Cohen, M. Tabain, J. Beskow, and R. Clark. [13] P. Cosi, E. Caldognetto, G. Perin, and C. Zmarich. Labial coarticula- Animated speech: research progress and applications. pages 309–345, tion modeling for realistic facial animation. In Multimodal Interfaces, 2012. 2002. Proceedings. Fourth IEEE International Conference on,pages [29] W. Mattheyses, L. Latacz, and W. Verhelst. Comprehensive many-to- 505 – 510, 2002. many phoneme-to-viseme mapping and its application for concatena- [14] D. Cosker, S. Paddock, D. Marshall, P. L. Rosin, and S. Rushton. tive visual speech synthesis. Speech Communication,2013. Toward perceptually realistic talking heads: Models, methods, and [30] Microsoft. Microsoft english phonemes, mcgurk. ACM Transactions on Applied Perception (TAP),2(3):270– http://msdn.microsoft.com/en-us/library/ms717239, 2012. 285, 2005. [31] F. I. Parke and K. Waters. Computer Facial Animation.AKPeters [15] Z. Deng, P.-Y. Chiang, P. Fox, and U. Neumann. Animating blend- Ltd, second edition, 2008. shape faces by cross-mapping motion capture data. In Proceedings of [32] A. Shapiro. Building a system. Motion in Games, the 2006 symposium on Interactive 3D graphics and games,I3D’06, pages 98–109, 2011. pages 43–48, New York, NY, USA, 2006. ACM. [33] S. Sutton, R. Cole, J. D. Villiers, J. Schalkwyk, P. Vermeulen, M. Ma- [16] Z. Deng, J. Lewis, and U. Neumann. Synthesizing speech anima- con, Y. Yan, E. Kaiser, B. Rundle, K. Shobaki, P. Hosom, A. Kain, tion by learning compact speech co-articulation models. In Computer Johan, J. Wouters, D. Massaro, and M. Cohen. Universal speechtools: Graphics International 2005,pages19–25,june2005. The cslu toolkit. In In Proceedings of the International Conference on [17] Z. Deng, I. C. Society, T. yong Kim, M. Bulut, S. Narayanan,and Spoken Language Processing (ICSLP,pages3221–3224,1998. S. Member. Expressive facial animation synthesis by learning speech [34] S. L. Taylor, M. Mahler, B.-J. Theobald, and I. Matthews.Dy- co-articulation and expression. Space, IEEE Transaction on Visual- namic units of visual speech. In Proceedings of the 2012 ACM SIG- ization and Computer Graphics,12:2006,2006. GRAPH/Eurographics symposium on Computer animation,jul2012. [18] T. Ezzat and T. Poggio. Miketalk: A talking facial display based on [35] A. Wang, M. Emmi, and P. Faloutsos. Assembling an expressivefa- morphing visemes. In In Proceedings of the Computer Animation cial animation system. In Proceedings of the 2007 ACM SIGGRAPH Conference,pages96–102,1998. symposium on Video games,pages21–26.ACM,2007. [19] FaceFX. Facefx software, http://www.facefx.com/, 2012. [36] K. Waters and D. Terzopoulos. Modeling and animating faces using [20] FaceFX. Facefx speech targets, scanned data. The Journal of Visualization and Computer Animation, http://www.facefx.com/documentation/2010/w76, 2012. 2(4):123–128, 1991. [21] J.-P. Hosom. Speaker-independent phoneme alignment using [37] J. Wells et al. Sampa computer readable phonetic alphabet. Handbook transition-dependent states. Speech Communication,51(4):352–368, of standards and resources for spoken language systems,4,1997. 2009. [38] P. H. Zein. Mandarin chinese phonetics, [22] X. Huang, A. Acero, H. Hon, Y. Ju, J. Liu, S. Meredith, and http://www.zein.se/patrick/chinen8p.html, 2012. M. Plumpe. Recent improvements on microsoft’s trainable text-to- speech system-whistler. In Acoustics, Speech, and Signal Process- ing, 1997. ICASSP-97., 1997 IEEE International Conference on,vol- ume 2, pages 959–962 vol.2, 1997. [23] D. Huggins-Daines, M. Kumar, A. Chan, A. Black, M. Ravishankar, and A. Rudnicky. Pocketsphinx: A free, real-time continuous speech recognition system for hand-held devices. In Acoustics, Speech and Signal Processing, 2006. ICASSP 2006 Proceedings. 2006 IEEEIn- ternational Conference on, volume 1, pages I–I. IEEE, 2006. [24] S. A. King and R. E. Parent. Creating speech-synchronized anima- tion. IEEE Transactions on Visualization and Computer Graphics, 11(3):341–352, may 2005. [25] J. P. Lewis, J. Mooser, Z. Deng, and U. Neumann. Reducing blend- shape interference by selected motion attenuation. In Proceedings of the 2005 symposium on Interactive 3D graphics and games,I3D’05, pages 25–29, New York, NY, USA, 2005. ACM. [26] J. Ma, R. Cole, B. L. Pellom, W. Ward, and B. Wise. Accurate Vis-