A Practical Model for Live Speech-Driven Lip-Sync
Total Page:16
File Type:pdf, Size:1020Kb
Speech Animation A Practical Model for Live Speech-Driven Lip-Sync Li Wei and Zhigang Deng ■ University of Houston he signal-processing and speech-understand- whereas of"ine speech animation synthesis algo- ing communities have proposed several ap- rithms don’t need to meet such tight time con- proaches to generate speech animation based straints. Compared with forced phoneme align- onT live acoustic speech input. For example, based ment for prerecorded speech, this last challenge on a real-time recognized phoneme sequence, re- comes from the low accuracy of state-of-the-art searchers use simple linear smoothing functions to live speech phoneme recognition systems (such as produce corresponding speech animation.1,2 Other the Julius system [http://julius.sourceforge.jp] and approaches train statistical models (such as neural the HTK toolkit [http://htk.eng.cam.ac.uk]). networks) to encode the mapping between acous- To quantify the phoneme recognition accuracy tic speech features and facial movements.3,4 These between the prerecorded and live speech cases, we approaches have demonstrated randomly selected 10 prerecorded sentences and A simple, ef!cient, yet their real-time runtime ef!ciency extracted their phoneme sequences using the Ju- practical phoneme-based on an off-the-shelf computer, lius system, !rst to do forced phoneme-alignment approach to generating but their performance is highly on the clips (called of!ine phoneme alignment) and realistic speech animation in speaker-dependent because of the then as a real-time phoneme recognition engine. individual-speci!c nature of the By simulating the same prerecorded speech clip as real time based on live speech chosen acoustic speech features. live speech, the system generated phoneme output input starts with decomposing Furthermore, the visual realism of sequentially while the speech was being fed into it. lower-face movements and these approaches is insuf!cient, Then, by taking the of"ine phoneme alignment re- ends with applying motion so they’re less suitable for graphics sults as the ground truth, we were able to compute blending. Experiments and and animation applications. the accuracies of the live speech phoneme recogni- comparisons demonstrate the Live speech-driven lip-sync in- tion in our experiment. As Figure 1 illustrates, the realism of this synthesized volves several challenges. First, live speech phoneme recognition accuracy of the speech animation. additional technical challenges same Julius system varies from 45 to 80 percent. are involved compared with pre- Further empirical analysis didn’t show any pat- recorded speech, where expensive global optimiza- terns of incorrectly recognized phonemes (that is, tion techniques can help !nd the most plausible the phonemes often recognized incorrectly in live speech motion corresponding to novel spoken or speech), implying that to produce satisfactory live typed input. In contrast, it’s extremely dif!cult, if speech-driven animation results, any phoneme- not impossible, to directly apply such global opti- based algorithm must take the relatively low pho- mization techniques to live speech-driven lip-sync neme recognition accuracy (for live speech) into applications because the forthcoming (unavailable design consideration. Moreover, that algorithm yet) speech content can’t be exploited during the should be able to perform certain self-corrections synthesis process. Second, live speech-driven lip- at runtime because some phonemes could be in- sync algorithms must be highly ef!cient to en- correctly recognized and input into the algorithm sure real-time speed on an off-the-shelf computer, in a less predictable manner. 50 January/February 2014 Published by the IEEE Computer Society 0272-1716/15/$31.00 © 2015 IEEE 1.0 cy ra 0.9 cu 0.8 Inspired by these research challenges, we pro- ac pose a practical phoneme-based approach for live n 0.7 speech-driven lip-sync. Besides generating realistic 0.6 gnitio 0.5 speech animation in real time, our phoneme-based co approach can straightforwardly handle speech in- 0.4 e re put from different speakers, which is one of the 0.3 major advantages of phoneme-based approaches 0.2 over acoustic speech feature-driven approaches phonem 0.1 ve (see the “Related Work in Speech Animation Syn- Li 0 thesis” sidebar).3,4 Speci!cally, we introduce an ef- 1 2 3 4 5 6 7 8 9 10 !cient, simple algorithm to compute each motion Sentence no. segment’s priority and select the plausible segment based on the phoneme information that’s sequen- Figure 1. The Julius system. Its live speech phoneme recognition tially recognized at runtime. Compared with exist- accuracy varied from 45 to 80 percent. ing lip-sync approaches, the main advantages of our method are its ef!ciency, simplicity, and ca- pability of handling live speech input in real time. Data Acquisition and Preprocessing We acquired a training facial motion dataset for this work by using an optical motion capture system and attaching more than 100 markers to the face of a female native English speaker. We eliminated 3D rigid head motion by using a sin- gular value decomposition (SVD) based statistical shape analysis method.5 The subject was guided to speak a phoneme-balanced corpus consist- ing of 166 sentences with neutral expression; the obtained dataset contains 72,871 motion frames (about 10 minutes of recording, with 120 frames per second). Phoneme labels and durations were (a) (b) automatically extracted from the simultaneously recorded speech data. Figure 2. Facial motion dataset. (a) Among the 102 markers, we used As Figure 2a illustrates, we use 39 markers in the 39 green markers in this work. (b) Illustration of the average face the lower face region in this work, which results markers. in a 117-dimensional feature vector for each mo- tion frame. We apply principal component analy- si –1 si+1 sis (PCA) to reduce the dimension of the mo- tion feature vectors, which allows us to obtain a …… compact representation by only retaining a small si number of principal components. We keep the !ve most signi!cant principal components to cover Figure 3. Motion segmentation. The grids represent motion frames, 96.6 percent of the motion dataset’s variance. where si denotes the motion segment corresponding to phoneme pi. All the other processing steps described here are When we segment a motion sequence based on its phoneme timing performed in parallel in each of the retained !ve information, we keep one overlapping frame (grids with slashes !lled) principal component spaces. between two neighboring segments. We evenly sample !ve frames (grids with red color) to represent a motion segment. Motion Segmentation For each recorded sentence, we segment its motion Motion Segment Normalization sequence based on its phoneme alignment and ex- Because a phoneme’s duration is typically short tract a motion segment for each phoneme occur- (in our dataset, the average phoneme duration is rence. For the motion blending that occurs later, 109 milliseconds), we could use a small number we keep an overlapping region between two neigh- of evenly sampled frames to represent the original boring motion segments. We set the length of the motion. We downsample a motion segment by overlapping region to one frame (see Figure 3). evenly selecting !ve representative frames (see IEEE Computer Graphics and Applications 51 Speech Animation Related Work in Speech Animation Synthesis he essential part of visual speech animation synthesis is References Tdetermining how to model the speech co-articulation 1. M.M. Cohen and D.W. Massaro, “Modeling Coarticulation in effect. Conventional viseme-driven approaches need users Synthetic Visual Speech,” Models and Techniques in Computer to !rst carefully design a set of visemes (or a set of static Animation, Springer, 1993, pp. 139–156. mouth shapes that represent different phonemes) and 2. A. Wang, M. Emmi, and P. Faloutsos, “Assembling an Expressive then employ interpolation functions1 or co-articulation Facial Animation System,” Proc. SIGGRAPH Symp. Video Games, rules to compute in-between frames for speech animation 2007, pp. 21–26. synthesis.2 However, a !xed phoneme-viseme mapping 3. C. Bregler, M. Covell, and M. Slaney, “Video Rewrite: Driving scheme is often insuf!cient to model the co-articulation Visual Speech with Audio,” Proc. SIGGRAPH 97, 1997, pp. phenomenon in human speech production. As such, the 353–360. resulting speech animations often lack variance and realis- 4. Y. Cao et al., “Expressive Speech-Driven Facial Animation,” tic articulation. ACM Trans. Graphics, vol. 24, no. 4, 2005, pp. 1283–1302. Instead of interpolating a set of predesigned visemes, 5. Z. Deng and U. Neumann, “eFASE: Expressive Facial Animation one category of data-driven approaches generates speech Synthesis and Editing with Phoneme-Level Controls,” Proc. animations by optimally selecting and concatenating motion ACM SIGGGRAPH/Eurographics Symp. Computer Animation, units from a precollected database based on various cost 2006, pp. 251–259. functions.3–7 The second category of data-driven speech 6. S.L. Taylor, “Dynamic Units of Visual Speech,” Proc. 11th ACM animation approaches learns statistical models from data.8,9 SIGGRAPH/Eurographics Conf. Computer Animation, 2012, pp. All these data-driven approaches have demonstrated notice- 275–284. able successes for of"ine prerecorded speech animation 7. X. Ma and Z. Deng, “A Statistical Quality Model for Data- synthesis. Unfortunately, they can’t be straightforwardly Driven Speech Animation,” IEEE Trans. Visualization and extended for live speech-driven lip-sync. The main reason Computer Graphics, vol. 18, no. 11, 2012, pp. 1915–1927. is that they typically utilize some form of global optimiza- 8. M. Brand, “Voice Puppetry,” Proc. 26th Annual Conf. Computer tion technique to synthesize the most optimal speech-syn- Graphics and Interactive Techniques, 1999, pp. 21–28. chronized facial motion. In contrast, in live speech-driven 9. T. Ezzat, G.