Graphically Speaking Carnival—Combining Speech Technology
Total Page:16
File Type:pdf, Size:1020Kb
Editor: Graphically Speaking Miguel Encarnação Carnival—Combining Speech Technology and Computer Animation Michael A. Berger Speech Graphics Ltd. Gregor Hofer Speech Graphics Ltd. Hiroshi Shimodaira University of Edinburgh peech is powerful information technology ary concept called visual speech synthesis. Since the and the basis of human interaction. By emit- late 1980s, two applications have been in develop- ting streams of buzzing, popping, and hiss- ment. Audio-driven animation automatically syn- ingS noises from our mouths, we transmit thoughts, thesizes facial animation from audio. Text-driven intentions, and knowledge of the world from one animation (or audiovisual text-to-speech synthe- mind to another. We’re accustomed to thinking sis) synthesizes both auditory and visual speech of speech as an acoustic, auditory phenomenon. from text. The former is used for automatic lip However, speech is also visible. Although the pri- sync with recorded audio, the latter for entirely mary function of speech is to manipulate air in text-based avatars. the vocal tract to produce sound, this action has But speech technology and computer graphics an ancillary effect of changing the face’s appear- remain worlds apart, and the development of vi- ance. In particular, the action of the lips and jaw sual speech synthesis suffers from lack of a unifi ed during speech causes constant deformation of the conceptual and technological framework. To meet facial surface, generating a robust visual signal this need, researchers at Speech Graphics (www. highly correlated with the acoustic one1—that is, speech-graphics.com) and the University of Ed- visual speech. inburgh’s Centre for Speech Technology Research In computer animation, this means that speech (CSTR; www.cstr.ed.ac.uk) are developing Carni- is something that must be studied, understood, val, an object-oriented environment for integrat- and simulated. Speech animation, or lip synchro- ing speech processing with real-time graphics. nization, is a major challenge to animators, ow- Carnival comprises an unlimited number of mod- ing to its intrinsic complexity and viewers’ innate ules that can be dynamically loaded and assembled sensitivity to the face. (Actually, the term lip syn- into a mutable animation production system. chronization—lip sync—incorrectly implies that only the lips move, whereas actually almost all the Visual Speech Synthesis facial surface below the eyes gets deformed during Both audio- and text-driven animation involve a speech.) At the same time, demand for lip sync is series of operations converting a representation of sharply increasing, in terms of both realism and speech from an input form to an output form. quantity. Automated solutions are now absolutely necessary. (For more on why this is the case, see Audio-Driven Animation the sidebar.) In audio-driven animation (see Figure 1), the fi rst The past two decades have seen the emergence step is acoustic analysis to extract useful infor- of techniques for animating speech automati- mation from the audio signal. This information cally using speech technology—an interdisciplin- might be of two kinds: 80 September/October 2011 Published by the IEEE Computer Society 0272-1716/11/$26.00 © 2011 IEEE Audio Acoustic Acoustic parameters/speech categories analysis w ∧ nn aIth rg ræm ∧ rwo Ω kp∧ Motion synthesis Motion parameters Adaptation Animation parameters Rendering Animation Figure 1. A typical processing pipeline for audio-driven facial animation. Acoustic analysis extracts continuously and categorically valued representations of the audio. Both can be used as input to motion synthesis, which produces audio-synchronous motion in some parameter space, which must be mapped to a facial model’s animation parameters (adaptation). From these parameters, we can render the animation using standard methods. ■ continuous acoustic parameters, such as pitch, in- mantic description of speech events. This descrip- tensity, or mel-frequency cepstral coefficients; or tion abstracts the speech from the audio domain, ■ discrete speech categories, such as phonemes or allowing its reconstruction in the motion domain. visemes. After synthesizing facial motion in some form, we must still map it to a facial model’s animation Both can be the basis for the next step, synthesizing parameters, determined by its deformers. In a 3D audio-synchronous motion. Given some regression facial rig with blendshapes and bones (also called model, we can map continuous acoustic parameters joints), the parameters are blendshape weights directly to motion parameters. On the other hand, and bone transformation parameters. Using these a categorical analysis (see Figure 2) provides a se- parameters, we can render the animation. This is IEEE Computer Graphics and Applications 81 Graphically Speaking Why Automate Speech? Realistic facial synthesis is one of the most funda- tion, means that what we think to be a given consonant or mental problems in computer graphics—and one of vowel is actually realized quite differently depending on the the most difficult.1 sounds preceding and following it. Consequently, it’s difficult or impossible to define units of speech in a context-invariant raditionally, lip synchronization (lip sync) has been done way. Such dynamic complexity is understandably difficult Tmanually, by keyframing or rotoscoping. However, as 3D to reproduce by hand. animation reaches increasing heights of realism, all aspects of the animation industry must keep up, including lip sync. Audio Synchronicity And realistic lip sync is extremely labor intensive and difficult Unlike other animated behaviors such as walking, visual to achieve manually. This difficulty is due to four character- speech must be tightly and continuously synchronized istics of visual speech: dynamic complexity, audio synchron- with an audio channel. This synchronization makes speech icity, high sensitivity, and high volume. animation a uniquely double-edged problem. The visual speech must be not only dynamically realistic in itself but Dynamic Complexity also sufficiently synchronous and commensurate with the Speech is arguably one of the most complex human mo- auditory speech to create the illusion that the two signals tor activities. Our alphabetic writing system can deceive are physically tied—that is, that the face we’re watching is us into thinking that speech is just a succession of dis- the source of the sound we hear. crete, sound-producing events, corresponding to letters. But as people discovered in the 19th century with the High Sensitivity advent of instrumental acoustics, this isn’t the physical Beyond the intrinsic difficulties of synthesizing visual reality. Speech is a continuous activity with no real “units” speech—complexity and audio synchronicity—there’s an of any kind. extrinsic perceptual problem. Humans are innately well Like other task-oriented motor behaviors, speech is highly attuned to faces, which makes us sensitive to unrealistic efficient: energy expenditure is minimized. So instead of facial animation or bad lip sync. producing one sound after another sequentially, we begin This sensitivity might serve a communicative function. producing each sound well before concluding the previous With our highly expressive faces, we seem designed for face- one. The movements of the tongue, lips, and jaw in speech to-face communication. This obviously includes nonverbal are like an athlete’s coordinated movements: different body communication: facial expressions modify the spoken word’s parts acting in concert, future movements efficiently overlap- meaning and transmit emotional states and signals in the ping with current ones, all efficiently compressed in time. absence of speech. But faces are also integral to speech com- This simultaneous production of sounds, called coarticula- munication. Humans have an innate ability to lip-read, or rec- k w ih k sh aa t s r ae ng aw t 13.65 14.77 Time (sec.) Figure 2. The waveform and categorical analysis of the utterance “Quick shots rang out.” Such analyses provide a semantic description of speech events that we can use to reconstruct speech in the motion domain. 82 September/October 2011 ognize words by sight; this is true for both hearing-impaired multiplayer online game, The Old Republic, to be released and normally hearing people. The next time you’re in a noisy in 2012 by Bioware, will feature some “hundreds of thou- place such as a bar, notice how much you rely on seeing sands of lines of dialogue,” or the equivalent of about 40 someone’s face in order to “hear” them. Even in noise-free novels.4 At just one-third of the way through the voice work environments, speech perception is still a function of both for the game, the amount of audio recorded reportedly had auditory and visual channels. Visual speech so strongly already exceeded the entire six-season run of The Sopranos.5 influences speech perception that it can even override As if that weren’t enough, most games are released in the auditory percept, causing us to hear a sound different multiple languages; to avoid unsynchronized or “dubbed” from the one the ear received—the famous McGurk effect.2 speech, all the speech animation must be redone for every language. Video game developers simply can’t do this by High Volume hand; they need automated solutions. As the bar for realism in 3D animation continues to rise, and with it the demand for higher-quality lip sync, the quantity of speech and dialogue in animation is also rising References exponentially. These are antagonistic sources of pressure: 1. F. Pighin et al., “Synthesizing Realistic Facial Expressions animators can’t satisfy quantity demands without sacrific- from Photographs,” Proc. Siggraph, ACM Press, 1998, pp. ing quality, and vice versa. 75–84. A case in point is the video game industry. Video game 2. H. McGurk and J. MacDonald, “Hearing Lips and Seeing characters are becoming increasingly realistic, in both static Voices,” Nature, vol. 264, 1976, pp. 746–748. appearance and behavior. Poor lip sync will be reflected in 3. “Rockstar Games’ Dan Houser on Grand Theft Auto IV and game reviews, which often include rants about lip sync.