Issues with Lip Sync Animation: Can You Read My Lips?
Total Page:16
File Type:pdf, Size:1020Kb
Issues with Lip Sync Animation: Can You Read My Lips? Rick Parent Scott King Osamu Fujimura Department of Computer and Computer Science Department of Speech and Information Science Department Hearing Science Ohio State University University of Otago Ohio State University [email protected] [email protected] [email protected] Abstract and hand gestures, can also modify or enhance speech. In addition, a talking head interface has the Lip-sync animation is complex and challenging. potential of making communication with computers It promises to be important in natural human- easier for the segment of the population not computer interfaces and entertainment as well as aid comfortable with technology or for those unable to in the education of the deaf. It is an important communicate with the computer using more component in creating a realistic human figure. traditional means. And finally, realistic visual aspects Speech is based on principles from anatomy, physics, of speech take us another step closer to the creation and psychophysiology. We discuss some of the issues of a synthetic human figure indistinguishable from that make speech so complex to model visually. life. Recently, research in computer animation 1. Introduction has emerged that strives to simulate the visual aspect of interpersonal communication. Facial expressions One of the great challenges for Computer Animation such as eyebrow raising, winking, and head nods, can is the creation of a realistic synthetic human figure. be incorporated into a talking head model during The ability to model, animate, and render such speech production. For full-figure imagery, hand and figures would be useful for entertainment, education, arm gestures can create emphasis and convey the design, etc. There are many problems to be solved to emotional state of the speaker. Our efforts are achieve this goal, among them: clothes, hair, skin, concentrated on accurate motion of the lips and and facial animation. Our recent efforts have been tongue. We are striving to produce very precise and directed at attempting to solve one aspect of facial realistic control of the visible anatomical mechanisms animation – that of realistic lip-sync animation. This responsible for speech: the lips, tongue, and jaw. paper will discuss the complexities that make Even though parts of the speech mechanism such as accurate lip sync animation challenging. the vocal folds are hidden from view, many of the One of the most natural forms of muscles associated with these hidden articulators can communication is speaking with someone face-to- influence the surface form of the head and neck face. It follows, then, that one of the more natural region. These subtle motions all contribute to the human-computer interfaces would be one that realism of the figure. simulates this face-to-face communication. To this Because facial movement is so familiar, end, speech recognition and speech synthesis have people are very critical of synthetic representations of emerged as important areas of research in the area of facial animation. Any motion not true to form can be Human-Computer Interfaces (HCI). But face-to-face distracting and can do more to confuse the communication between individuals is multi-modal; communication rather than aid it. To accurately it is dependent not just on sounds but also on the address lip sync animation, we must understand how visual cues that augment what is being said. It is well speech is produced. As with all human behavior, known that visual cues aid in the intelligibility of there are various ways to analyze speech production. speech. Some visual cues, such as facial expressions One way is to look at the cognitive processes that Proceedings of the Computer Animation 2002 (CA 2002) 1087-4844/02 $17.00 © 2002 IEEE produce the behavior. Another is to look at the transfer function that converts an intent into motion strategies. A third way is to look at the mechanical The tongue is one of the main agents that aspect of the physical system. Here, we examine modifies sound. It is extremely deformable. The speech the second and third ways. We begin by tongue is controlled by four extrinsic muscles. describing the human system of sound production, Extrinsic muscles are exterior to the tongue and the vocal apparatus. position it inside the oral cavity. These muscles pull on it from various directions in the mouth (Figure 2). 2. The vocal apparatus The tongue also has four intrinsic muscles. These are muscles that are contained inside the tongue and The human vocal tract consists of a number of control its shape (Figure 3). The tongue can elongate structures that are responsible for producing and and push forward or pull backward. It can raise the modifying the basic sound once it is produced. Some tip up or curl it back. The tongue can suppress its side of these structures are buried relatively deep under edges up or the central (midsagittal) part down the skin surface and therefore seem not to contribute forming a narrow groove to secure an acoustic tube in to the visual appearance of the face. However, under the front part of the vocal tract. The jaw, which aids some conditions, even these relatively deep structures the tongue in modifying the shape of the oral cavity, can contribute to the visual representation. For not only rotates for opening and closing, but also is example, deaf listeners seem to be able to perceive capable of some limited translation, both front-back vowel quality partly via visual perception of the and left-right, and lateral rotation (rocking). cheek conditions reflecting the internal tongue gestures such as fronting and retraction. The main components of the sound production system are the vocal folds (glottis), velum (soft palate), nasal cavity, oral cavity (surrounded by the palate, teeth, cheeks, and tongue), jaw (mandible) and lips. See Figure 1. The vocal folds are fleshy surfaces that can control the flow of air between them – through the glottal passage. The velum is a flap that is a flexible extension of the hard palate that contains bones. The palate forms the roof of the mouth. Figure 2. Extrinsic muscles of tongue Figure 1. Human sound production system Proceedings of the Computer Animation 2002 (CA 2002) 1087-4844/02 $17.00 © 2002 IEEE Air flows up the trachea through the glottal opening - the space between the vocal folds. A (voiced) sound is produced if the vocal folds vibrate chopping the dc air stream. This sound propagates into either the nasal cavity or the oral cavity and is modified in spectral quality by resonance modes that are characteristic of the articulation. Sounds can also be produced using the tongue or lips to create source signals. Air flowing through a narrow constriction of the vocal tract produces turbulence that results in hissing sounds. Such sounds are called fricatives. Sounds such as ‘s’, ‘ch’ (soft), and ‘sh’ are examples of fricatives and affricates that involve such a frication noise. Another sound producing mechanism is called a stop or a plosive. Air flowing through the vocal tract is stopped, pressure is built up in the oral cavity behind Figure 3. Intrinsic muscles of tongue the obstruction, and then the air is suddenly released (showing cross section slice of tongue). to produce an explosive sound. Air flow is typically stopped by placing the tongue against the teeth or the The lips are also extremely flexible. They palate or by closing upper and lower lips together. are controlled by approximately twenty muscles of Sounds such as [p]’, [k], and [t] are produced in this the face that can pull up or down on the middle or manner. either side. In addition, there is a sphincter-like Sound is modified primarily by changing the muscle, the orbicularis oris, which wraps around the shape of the vocal tract. The vocal tract is a tube lips to constrict the labial (relating to the lips) extending from the glottis to the lip opening. The opening, i.e., the mouth, and protrude the lips. See shape of the oral cavity is manipulated by the lips, Figure 4. For example, the lip rounding for the vowel jaw and tongue. The different vowel sounds are [o] is very different from a consonantal lip produced this way. The lips modify sound by constriction in [p] or [f] that involves no protrusion. changing the shape of the passage out of the mouth. As previously mentioned, sound can also be modified by diverting around the oral cavity and through the nasal cavity by lowering the velum, a fleshy flap-like extension of the palate. For example, [n] is such a nasal sound. While many of these processes are hidden from view, such as the vibration of the vocal folds and some movements of the tongue, their movement and the associated muscles can produce visible changes on the surface in some cases. The larynx changes its height considerably and visibly when voicing is initiated or stopped and when voice pitch is changed. All these visible changes contribute to naturalness of the facial image as one talks, and sometimes contribute to reinforcing the correct perception of the speech sounds being uttered. Figure 4. Facial muscles involved in speech 3. Synthesizing speech Sound is produced by vibration. The vocal Linguistics is a broad term used to refer to the study folds are responsible for creating the basic sounds of language. Phonetics, often considered as a associated with vowels and some consonants. Such category of linguistics, is concerned with the sounds sounds are referred to as voiced. If a sound is of a language, how they are produced, and how they produced without the vibration of the vocal folds, are perceived.