<<

Issues with Lip Sync : Can You Read My Lips?

Rick Parent Scott King Osamu Fujimura Department of Computer and Computer Science Department of Speech and Information Science Department Hearing Science Ohio State University University of Otago Ohio State University [email protected] [email protected] [email protected]

Abstract and hand gestures, can also modify or enhance speech. In addition, a talking head interface has the Lip-sync animation is complex and challenging. potential of making communication with computers It promises to be important in natural human- easier for the segment of the population not computer interfaces and entertainment as well as aid comfortable with technology or for those unable to in the education of the deaf. It is an important communicate with the computer using more component in creating a realistic human figure. traditional means. And finally, realistic visual aspects Speech is based on principles from anatomy, physics, of speech take us another step closer to the creation and psychophysiology. We discuss some of the issues of a synthetic human figure indistinguishable from that make speech so complex to model visually. life. Recently, research in 1. Introduction has emerged that strives to simulate the visual aspect of interpersonal communication. Facial expressions One of the great challenges for Computer Animation such as eyebrow raising, winking, and head nods, can is the creation of a realistic synthetic human figure. be incorporated into a talking head model during The ability to model, animate, and render such speech production. For full-figure imagery, hand and figures would be useful for entertainment, education, arm gestures can create emphasis and convey the design, etc. There are many problems to be solved to emotional state of the speaker. Our efforts are achieve this goal, among them: clothes, hair, skin, concentrated on accurate motion of the lips and and facial animation. Our recent efforts have been tongue. We are striving to produce very precise and directed at attempting to solve one aspect of facial realistic control of the visible anatomical mechanisms animation – that of realistic lip-sync animation. This responsible for speech: the lips, tongue, and jaw. paper will discuss the complexities that make Even though parts of the speech mechanism such as accurate lip sync animation challenging. the vocal folds are hidden from view, many of the One of the most natural forms of muscles associated with these hidden articulators can communication is speaking with someone face-to- influence the surface form of the head and neck face. It follows, then, that one of the more natural region. These subtle motions all contribute to the human-computer interfaces would be one that realism of the figure. simulates this face-to-face communication. To this Because facial movement is so familiar, end, speech recognition and have people are very critical of synthetic representations of emerged as important areas of research in the area of facial animation. Any motion not true to form can be Human-Computer Interfaces (HCI). But face-to-face distracting and can do more to confuse the communication between individuals is multi-modal; communication rather than aid it. To accurately it is dependent not just on sounds but also on the address lip sync animation, we must understand how visual cues that augment what is being said. It is well speech is produced. As with all human behavior, known that visual cues aid in the intelligibility of there are various ways to analyze speech production. speech. Some visual cues, such as facial expressions One way is to look at the cognitive processes that

Proceedings of the Computer Animation 2002 (CA 2002) 1087-4844/02 $17.00 © 2002 IEEE produce the behavior. Another is to look at the transfer function that converts an intent into motion strategies. A third way is to look at the mechanical The tongue is one of the main agents that aspect of the physical system. Here, we examine modifies sound. It is extremely deformable. The speech the second and third ways. We begin by tongue is controlled by four extrinsic muscles. describing the human system of sound production, Extrinsic muscles are exterior to the tongue and the vocal apparatus. position it inside the oral cavity. These muscles pull on it from various directions in the mouth (Figure 2). 2. The vocal apparatus The tongue also has four intrinsic muscles. These are muscles that are contained inside the tongue and The human vocal tract consists of a number of control its shape (Figure 3). The tongue can elongate structures that are responsible for producing and and push forward or pull backward. It can raise the modifying the basic sound once it is produced. Some tip up or curl it back. The tongue can suppress its side of these structures are buried relatively deep under edges up or the central (midsagittal) part down the skin surface and therefore seem not to contribute forming a narrow groove to secure an acoustic tube in to the visual appearance of the face. However, under the front part of the vocal tract. The jaw, which aids some conditions, even these relatively deep structures the tongue in modifying the shape of the oral cavity, can contribute to the visual representation. For not only rotates for opening and closing, but also is example, deaf listeners seem to be able to perceive capable of some limited translation, both front-back vowel quality partly via visual perception of the and left-right, and lateral rotation (rocking). cheek conditions reflecting the internal tongue gestures such as fronting and retraction. The main components of the sound production system are the vocal folds (glottis), velum (soft palate), nasal cavity, oral cavity (surrounded by the palate, teeth, cheeks, and tongue), jaw (mandible) and lips. See Figure 1. The vocal folds are fleshy surfaces that can control the flow of air between them – through the glottal passage. The velum is a flap that is a flexible extension of the hard palate that contains bones. The palate forms the roof of the mouth.

Figure 2. Extrinsic muscles of tongue

Figure 1. Human sound production system

Proceedings of the Computer Animation 2002 (CA 2002) 1087-4844/02 $17.00 © 2002 IEEE Air flows up the trachea through the glottal opening - the space between the vocal folds. A (voiced) sound is produced if the vocal folds vibrate chopping the dc air stream. This sound propagates into either the nasal cavity or the oral cavity and is modified in spectral quality by resonance modes that are characteristic of the articulation. Sounds can also be produced using the tongue or lips to create source signals. Air flowing through a narrow constriction of the vocal tract produces turbulence that results in hissing sounds. Such sounds are called fricatives. Sounds such as ‘s’, ‘ch’ (soft), and ‘sh’ are examples of fricatives and affricates that involve such a frication noise. Another sound producing mechanism is called a stop or a plosive. Air flowing through the vocal tract is stopped, pressure is built up in the oral cavity behind Figure 3. Intrinsic muscles of tongue the obstruction, and then the air is suddenly released (showing cross section slice of tongue). to produce an explosive sound. Air flow is typically stopped by placing the tongue against the teeth or the The lips are also extremely flexible. They palate or by closing upper and lower lips together. are controlled by approximately twenty muscles of Sounds such as [p]’, [k], and [t] are produced in this the face that can pull up or down on the middle or manner. either side. In addition, there is a sphincter-like Sound is modified primarily by changing the muscle, the orbicularis oris, which wraps around the shape of the vocal tract. The vocal tract is a tube lips to constrict the labial (relating to the lips) extending from the glottis to the lip opening. The opening, i.e., the mouth, and protrude the lips. See shape of the oral cavity is manipulated by the lips, Figure 4. For example, the lip rounding for the vowel jaw and tongue. The different vowel sounds are [o] is very different from a consonantal lip produced this way. The lips modify sound by constriction in [p] or [f] that involves no protrusion. changing the shape of the passage out of the mouth. As previously mentioned, sound can also be modified by diverting around the oral cavity and through the nasal cavity by lowering the velum, a fleshy flap-like extension of the palate. For example, [n] is such a nasal sound. While many of these processes are hidden from view, such as the vibration of the vocal folds and some movements of the tongue, their movement and the associated muscles can produce visible changes on the surface in some cases. The larynx changes its height considerably and visibly when voicing is initiated or stopped and when voice pitch is changed. All these visible changes contribute to naturalness of the facial image as one talks, and sometimes contribute to reinforcing the correct perception of the speech sounds being uttered.

Figure 4. Facial muscles involved in speech 3. Synthesizing speech Sound is produced by vibration. The vocal Linguistics is a broad term used to refer to the study folds are responsible for creating the basic sounds of language. Phonetics, often considered as a associated with vowels and some consonants. Such category of linguistics, is concerned with the sounds sounds are referred to as voiced. If a sound is of a language, how they are produced, and how they produced without the vibration of the vocal folds, are perceived. Because we are interested in producing such as a [p], then it is voiceless. The airflow to the visuals that correspond to sounds, we are produce a sound originates in the respiratory system.

Proceedings of the Computer Animation 2002 (CA 2002) 1087-4844/02 $17.00 © 2002 IEEE concerned with how the vocal apparatus is used to 3.1 Coarticulation produce sounds. This is called articulatory phonetics. Phonetics describes sounds in various ways. One In synthesizing the speech, phonemes are combined way is in terms of articulation such as stop, fricative, with adjacent phonemes. The problem with aspirated, voiced, and nasal. Phonetics uses a phonemes is that they don’t sound (or look) the same phonetic alphabet (e.g., [5]) to unambiguously in all contexts. A phoneme’s sound depends on what describe the sounds of a language. A common is said immediately before and immediately after that approach to describing the sounds of a language is to particular phoneme. The modified variations of a break it down into the most basic sounds that are phoneme are called its allophones. Thus the [k] in the produced in the spoken language. Typically, one of word ‘key’ is a palatal allophone of the phoneme /k/, the sound elements in a word can be replaced by while the [k] in ‘cool’ is a velar allophone of the another, and the resultant form is another word in the same phoneme /k/. Also, a consonant or a vowel does same language. These basic sounds are referred to as not change its property uniformly as a whole. The phonemes. There are about 40 phonemes in the physical property of the sound is affected by English language. A spoken phoneme, as a neighboring sounds continuously, so the part of the waveform, is assumed to have a certain duration, vowel [a] in ‘cot’ is most strongly affected by [k] in amplitude, and fundamental frequency (pitch), along its beginning part and by [t] in its ending part. This with its characteristic articulation. modification based on context is referred to as In order to automate the speech production coarticulation. Coarticulation is basically a process, text is often used as the form for input. smoothing operation resulting from physical Many text-to-speech systems are based on this constraints of the human speech-producing apparatus. phonemic decomposition. Systems exist which When we examine articulatory control, speech is decompose text into constituent phonemes along with actually a continuous system with no hard boundaries basic timing information and associated waveforms. between phonemes, syllables, or words; this is First, a database of sounds is created. Typically, a contrary to that suggested by the written form. person is recorded speaking each phoneme in a Occasional phonetic stops and pauses between neutral voice (these databases are freely available). phrases are the only breaks in the acoustic waveform, The spoken phonemes are recorded as primitive except that the oscillation of the vocal folds typically waveforms. The text-to-speech system blends these exhibits rather discontinuous changes in signal waveforms together to form a continuous flow of periodicity. It is typically very difficult to identify speech (Figure 5). In producing the final speech word boundaries in spectrograms of continuously signal, two forces are at work: one that combines spoken speech. Luckily, our auditory system is adjacent signals into a continuous flow trained to recognize linguistic units such as words (coarticulation) and one that modifies the speech when a specific target is approximated. based on the emotion or the intent of the speaker (prosody). 3.2 Syllables

Consonants and vowels often cannot be spoken by themselves: they are components of utterable units called syllables. It was in fact a remarkable invention in the history of human culture about 4,000 years ago when the Greeks began using alphabetical symbols to represent individual consonants and vowels as abstract units of speech. The technical convenience of the alphabetic font for printing has made this transcription system overwhelmingly popular in the world ([6], p. 225). From a phonetic point of view, however, the smallest, robust enough unit of utterance is a syllable, typically Figure 5. Text-to-speech system centered around a vowel often with preceding and (in some languages like English) succeeding consonants. When a waveform representing the speech signal is given, it is relatively easy to segment the continuous time function into syllables, even though the edges of the time domain for each syllable are often affected by coarticulation of the adjacent portion of the

Proceedings of the Computer Animation 2002 (CA 2002) 1087-4844/02 $17.00 © 2002 IEEE preceding or succeeding syllable. In synthesis, if we the syllable is made longer. At the same time, have an inventory of speech signals representing usually, a stressed syllable is pronounced with a individual syllables, we usually can select appropriate higher voice pitch (fundamental frequency of vocal syllabic units and concatenate them together to fold vibration) than unstressed syllables in the same produce intelligible speech for any word or phrase. word or phrase. Such a designation of a syllable in a The signal sounds fairly natural if we use a word with stress is called lexical accent. Depending parametric representation of the spectral property of on the language, however, a lexical accent may not the sound as a set of control time functions, such as manipulate the stress pattern, which reflects the Linear Predictive Coding coefficients or formant general amount of effort in articulation and is frequencies (resonances). The signal can be invariably associated with durational expansion. In resynthesized after concatenation and smoothing is some languages, like Japanese, the primary physical performed parameter by parameter. In order to property to be observed for accent designation of a achieve a reasonably natural and readily syllable is raised voice pitch, without accompanying comprehensible synthetic speech, we need to add an durational expansion. This form of lexical accent appropriate prosodic modulation as discussed below. designation is often called pitch accent. A similar The operating units for prosodic modulation are pitch accent specification is used in English as a syllables. Therefore, syllables are the critical speech property of some phrases, e.g. raising the voice pitch units for any high quality signal processing in speech toward the end of a sentence if it is a question technology. expecting yes or no as the answer in conversation. Such a phrasal voice pitch modulation is often called 3.3 Prosody intonation and is often used in conveying subtle nuances of the message delivery as well as The prosodic control of the speech signal is linguistically distinct semantic value of the sentence. physically observed in various sound properties: voice fundamental frequency (F0), other voice 4. Text-to-audiovisual speech quality including sound intensity, voice source signal characteristics such as spectral tilt, articulatory Our work concentrates on the visuals associated with modulation such as jaw opening control resulting in the sounds produced by text-to-speech systems. We formant frequency modifications, and temporal have developed a new tongue model (Figure 6) and a compression/stretching of individual syllables and new lip model (Figure 7). These models are capable boundary phenomena such as silent pauses. In of the articulations necessary for speech. When English, individual words are identified by the animating a face for speech, the lip model replaces sequence of specific syllables, each of which in turn the lips of the face and is grafted onto the can be identified as a linear sequence of phonemes or surrounding skin (Figure 8). For most facial models, a specified organization of features. only the exterior surface is defined. In order to Besides the identity of each syllable in the string prepare it for speech, the oral cavity must be defined. in terms of the so-called segmental specification such The tongue, teeth, palate, and fleshy surfaces are as designated consonants and vowels, the linear inserted into the oral cavity (Figure 9). Once the string of syllables shows a pattern as an inherent facial model is complete, there are 27 parameters that property of the work, phrase, etc., as a whole. For control these articulators of speech. example, the verb ‘import’ is different from the noun ‘import’ in the way the word is pronounced with respect to what is called stress pattern. While it is rather rare to find a pair of English words that contrast with each other only in terms of stress pattern, English words generally have their inherent stress patterns which help identify the word in question when a listener hears it in conversation. In some languages, the stress pattern is fixed and the same for all words, thus contributing to identification of word boundaries rather than distinguishing words from each other. A value of stress is attached to each syllable to Figure 6. The Tongue Model create a distinct stress pattern for a linguistic form such as words and phrases as a whole unit. When a strong stress is attached to a syllable, the duration of

Proceedings of the Computer Animation 2002 (CA 2002) 1087-4844/02 $17.00 © 2002 IEEE for each degree of freedom (DOF) of a mechanism. We also consider a pose to be the specification of values for some structurally significant subset of the DOFs of a mechanism. A pose of a facial model associated with a phoneme is referred to as a viseme. Thus a viseme is the visual equivalent of a phoneme. While it’s intuitive to speak of a single pose defining a viseme, we have found that the viseme must be treated as a dynamic shaping of the vocal tract. For a given articulation parameter, we define one or more control points over the duration of a phoneme in order to generate a curve, parameterized by time relative to the start of the phoneme. This specifies the value of the parameter throughout the phoneme’s duration. Designing these visemes is not Figure 7. The Lip Model an easy task. The relative timing of the motion of the articulators with the auditory tract has to be taken into account and is especially important for gestures such as stops and fricatives. These visemes and the associated waveforms are input into the facial modeling system in order to generate the sequence of images representing the spoken text. The resulting system, called TalkingHead, is diagrammed in Figure 10.

Figure 8. Grafting Lips onto a Model

Figure 9. The Finished Lips and Mouth

In order to produce appropriate visuals, each phoneme can be associated with a pose of the geometric model. A pose is a specification of a value

Proceedings of the Computer Animation 2002 (CA 2002) 1087-4844/02 $17.00 © 2002 IEEE dominance curve that indicates how important that articulator value is to that phoneme. The effect that adjacent phonemes have on an articulator can be balanced against one another based on this importance. This has been a big step in the right direction, although the dominance functions are hard to use because of the large number of parameters. Coarticulation has proven difficult to compute at the phoneme level. It is easier to handle when larger chunks of speech are used as the fundamental building blocks of speech. Diphones (phoneme pairs) and triphones (phoneme triples) have been used instead of single phonemes. Using these larger building blocks increases the number of segmental units but decreases coarticulation effects. Using larger building blocks results in more building blocks; if there are 40 phonemes there are potentially 40x40 diphones, although some of these don’t occur in a given language. In fact, most coarticulation effects occur within syllables (as opposed to between syllables) so using syllables as building blocks would almost eliminate coarticulation computations. There are about ten thousand syllables.

4.2 Prosody

Prosody, like coarticulation, modifies the articulators during speech production. However, prosody contributes to the semantic content of what is spoken. Some work has been done in incorporating prosody into facial animation [1][7]. In a text-to-speech system, the incorporation of prosodic effects requires that the phonemes be marked with the associated amount of stress. These tags can be generated Figure 10. TalkingHead System Overview automatically from the text or supplied externally. Most of the visual work on prosody has 4.1 Coarticulation involved augmenting speech with facial expressions such as eyebrow raising and winking. Prosodic Viewing a model parameter, such as jaw rotation, as effects on the timing of utterances may be able to be a function of time and referred to as a track, the handled by the coarticulation model. But our recent visemes define key points on the track. The speech experience with song suggests that changing the production system is a target-oriented system with relative durations are not handled well by current the vocal apparatus trying to reach specific target coarticulation models. Unfortunately, other effects of configurations at specified times [3]. These targets prosody, for example effects on lip and tongue occur relatively frequently at the rate of motion, are not well-understood and there is little approximately 10Hz. Because of physical limits of emperical data on which to base an approach. the mechanical system there are limits to how fast a model parameter should change - not all of the key 4.3 Geometric Concerns positions may be able to be interpolated. Thus a compromise must be established between the key Text-based systems are useful if they can maintain positions and the curve. One of the most effective interactive rates. The text database can be accessed techniques for modeling coarticulation is due to when responding to user queries to provide timely Cohen and Massaro [2]. They define dominance information. However, speed is important. Speech curves that are used to blend the effects of phonemes generation is a challenging task in this situation. The by controlling the relative importance of poses. Each speech apparatus is a physical system and is subject phoneme’s influence over an articulator is given a

Proceedings of the Computer Animation 2002 (CA 2002) 1087-4844/02 $17.00 © 2002 IEEE to physical constraints as we’ve discussed with Brett Douville, Scott Prevost, and Matthew Stone. regard to coarticulation. It is a complex system with “Animated conversation: rule-based generation of some rigid articulation, but primarily it is based on facial expression gesture and spoken intonation for deformable structures that contact each other to multiple conversational agents,” Proceedings of produce the shapes necessary to produce correct SIGGRAPH 94, Orlando, Florida, July 24-29, 1994, sounds. The lips and tongue assume various shapes (Andrew Glassner, editor) pages 413-420, ACM in order to produce various sounds such as ‘a’, ‘e’, SIGGRAPH, ACM Press. and ‘w’. The lips contact each other to form shapes for consonants such as ‘b’, ‘p’, and ‘m’. The lips 2. Michael Cohen and Dominic Massaro. “Modeling contact the teeth for ‘f’ and ‘v’. The tongue contacts coarticulation in synthetic visual speech,” In Models the teeth or palate for ‘th’, ‘l’, and ‘n’ sounds and and Techniques in Computer Animation, (Nadia deforms as a result of this contact. The tongue Magnenat-Thalmann and Daniel Thalmann, editors), deforms itself into various shapes for many sounds pages 139-156. Springer-Verlag, Tokyo, 1993. such as ‘sh’ and ‘ch’. Speech analysis tells us that, for a certain sound, the tongue touches the palate; 3. Donald Dew and Paul J. Jensen. “Phonetic however it doesn’t specify whether this is performed processing: the dynamics of speech,” Charles E. by rotation of the jaw or motion of the tongue or Merrill Publishing Company, Columbus, Ohio, 1977. some combination of the two. Simulating the physically accurate 4. Anthony Fox. “Prosodic features and prosodic deformations that are produced as a result of these structure: The phonology of suprasegmentals,” collisions and muscle contractions is a daunting task. Oxford University Press, Oxford, 2000. But there are two features of lip-synch animation that provide us with computational short-cuts. First, most 5. Michael Gourlay. “Modified SAMPA Phonetic of these deformations are hidden from view. The Alphabet,” www.colorado- deformations produced in the tongue only have to be research.com/~gourlay/audio/mjg-diphones- approximate in order to be convincing. Second, there text/node2.html, 2002. are only a relatively small number of such deformations that take place during the course of 6. Peter Ladefoged. A Course in Phonetics, Fourth speech. This suggests the possibility of precomputing Edition, Harcourt Brace, 2001. the various shapes of the tongue and lips that are produced as a result of these collisions. 7. Catherine Pelachaud, Norman I. Badler, and Mark Steedman. “Linguistic issues in facial animation,” In 5. References Computer Animation ’91, (Nadia Magnenat- Thalmann and Daniel Thalmann, editors), pages 15- 1. Justing Cassell, Catherine Pelachaud, Norman 30. Springer-Verlag, Tokyo, 1991. Badler, Mark Steedman, Brett Achorn, Tripp Bechet,

Proceedings of the Computer Animation 2002 (CA 2002) 1087-4844/02 $17.00 © 2002 IEEE