University of Pennsylvania ScholarlyCommons

Departmental Papers (CIS) Department of Computer & Information Science

July 2002

Creating Interactive Virtual Humans: Some Assembly Required

Jonathan Gratch USC Institute for Creative Technologies

Jeff Rickel USC Information Sciences Institute

Elisabeth André University of Augsburg

Justine Cassell MIT Media Lab

Eric Petajan Face2Face Animation

FSeeollow next this page and for additional additional works authors at: https:/ /repository.upenn.edu/cis_papers

Recommended Citation Jonathan Gratch, Jeff Rickel, Elisabeth André, Justine Cassell, Eric Petajan , and Norman I. Badler, "Creating Interactive Virtual Humans: Some Assembly Required", . July 2002.

Copyright © 2002 IEEE. Reprinted from IEEE Intelligent Systems, Volume 17, Issue 4, July-August 2002, pages 54-63. Publisher URL:http://ieeexplore.ieee.org/xpl/tocresult.jsp?isNumber=22033&puNumber=5254

This material is posted here with permission of the IEEE. Such permission of the IEEE does not in any way imply IEEE endorsement of any of the University of Pennsylvania's products or services. Internal or personal use of this material is permitted. However, permission to reprint/republish this material for advertising or promotional purposes or for creating new collective works for resale or redistribution must be obtained from the IEEE by writing to [email protected]. By choosing to view this document, you agree to all provisions of the copyright laws protecting it.

This paper is posted at ScholarlyCommons. https://repository.upenn.edu/cis_papers/17 For more information, please contact [email protected]. Creating Interactive Virtual Humans: Some Assembly Required

Abstract Discusses some of the key issues that must be addressed in creating virtual humans, or androids. As a first step, we overview the issues and available tools in three key areas of virtual human research: face-to- face conversation, emotions and personality, and human figure animation. Assembling a virtual human is still a daunting task, but the building blocks are getting bigger and better every day.

Comments Copyright © 2002 IEEE. Reprinted from IEEE Intelligent Systems, Volume 17, Issue 4, July-August 2002, pages 54-63. Publisher URL:http://ieeexplore.ieee.org/xpl/tocresult.jsp?isNumber=22033&puNumber=5254

This material is posted here with permission of the IEEE. Such permission of the IEEE does not in any way imply IEEE endorsement of any of the University of Pennsylvania's products or services. Internal or personal use of this material is permitted. However, permission to reprint/republish this material for advertising or promotional purposes or for creating new collective works for resale or redistribution must be obtained from the IEEE by writing to [email protected]. By choosing to view this document, you agree to all provisions of the copyright laws protecting it.

Author(s) Jonathan Gratch, Jeff Rickel, Elisabeth André, Justine Cassell, Eric Petajan , and Norman I. Badler

This journal article is available at ScholarlyCommons: https://repository.upenn.edu/cis_papers/17 Workshop Report

Creating Interactive Virtual Humans: Some Assembly Required

Jonathan Gratch, USC Institute for Creative Technologies Jeff Rickel, USC Information Sciences Institute Elisabeth André, University of Augsburg Justine Cassell, MIT Media Lab Eric Petajan, Face2Face Animation Norman Badler, University of Pennsylvania

cience fiction has long imagined a future populated crepancies from human norms. Thus, virtual human research must draw heavily on psychology and communication with artificial humans—human-looking devices with S theory to appropriately convey nonverbal behavior, emotion, human-like intelligence. Although Asimov’s benevolent and personality. robots and the Terminator movies’ terrible war machines This broad range of requirements poses a serious prob- lem. Researchers working on particular aspects of virtual are still a distant fantasy, researchers across a wide range humans cannot explore their component in the context of of disciplines are beginning to work together toward a a complete virtual human unless they can understand more modest goal—building virtual humans. These soft- results across this array of disciplines and assemble the ware entities look and act like people and can engage in vast range of software tools (for example, speech recog- conversation and collaborative tasks, but they live in simu- nizers, planners, and animation systems) required to con- lated environments. With the untidy problems of sensing struct one. Moreover, these tools were rarely designed to and acting in the physical world thus dispensed, the focus interoperate and, worse, were often designed with differ- of virtual human research is on capturing the richness and ent purposes in mind. For example, most computer graph- dynamics of human behavior. ics research has focused on high fidelity offline image The potential applications of this technology are con- rendering that does not support the fine-grained interac- siderable. History students could visit ancient Greece tive control that a virtual human must have over its body. and debate Aristotle. Patients with social phobias could In the spring of 2002, about 30 international researchers rehearse threatening social situations in the safety of a from across disciplines convened at the University of virtual environment. Social psychologists could study Southern California to begin to bridge this gap in knowl- theories of communication by systematically modifying edge and tools (see www.ict.usc.edu/~vhumans). Our a virtual human’s verbal and nonverbal behavior. A vari- ultimate goal is a modular architecture and interface stan- ety of applications are already in progress, including dards that will allow researchers in this area to reuse each education and training,1 therapy,2 marketing,3,4 and other’s work. This goal can only be achieved through a entertainment.5,6 close multidisciplinary collaboration. Towards this end, Building a virtual human is a multidisciplinary effort, the workshop gathered a collection of experts representing joining traditional artificial intelligence problems with a the range of required research areas, including range of issues from computer graphics to social science. Virtual humans must act and react in their simulated envi- • Human figure animation ronment, drawing on the disciplines of automated reason- • Facial animation ing and planning. To hold a conversation, they must exploit • Perception the full gamut of natural language processing research, • Cognitive modeling from speech recognition and natural language understand- • Emotions and personality ing to natural language generation and speech synthesis. • Natural language processing Providing human bodies that can be controlled in real time • Speech recognition and synthesis delves into computer graphics and animation. And because • Nonverbal communication an agent looks like a human, people expect it to behave like • Distributed simulation one as well and will be disturbed by, or misinterpret, dis- • Computer games

54 1094-7167/02/$17.00 © 2002 IEEE IEEE INTELLIGENT SYSTEMS Here we discuss some of the key issues Regarding the goals achieved by the ities. The architecture should be flexible that must be addressed in creating virtual different modalities, in natural conversa- enough to track these different threads of humans. As a first step, we overview the tion speakers tend to produce a gesture communication in ways appropriate to issues and available tools in three key areas with respect to their propositional goals (to each thread. Different threads have dif- of virtual human research: face-to-face advance the conversation content), such as ferent response-time requirements; conversation, emotions and personality, making the first two fingers look like legs some, such as feedback and interruption, and human figure animation. walking when saying “it took 15 minutes to occur on a sub-second time scale. The get here,” and speakers tend to use eye architecture should reflect this by allow- Face-to-face conversation movement with respect to interactional ing different processes to concentrate on Human face-to-face conversation involves goals (to ease the conversation process), activities at different time scales. both language and nonverbal behavior. The such as looking toward the other person • Understanding and synthesis of proposi- behaviors during conversation don’t just when giving up the turn.7 To realistically tional and interactional information. function in parallel, but interdependently. generate all the different verbal and non- Dealing with propositional information— The meaning of a word informs the inter- verbal behaviors, then, computational the communication content—requires pretation of a gesture, and vice versa. The architectures for virtual humans must con- building a model of the user’s needs and time scales of these behaviors, however, are trol both the propositional and interactional knowledge. The architecture must in- different—a quick look at the other person structures. In addition, because some of clude a static domain knowledge base to check that they are listening lasts for less these goals can be equally well met by one and a dynamic discourse knowledge time than it takes to pronounce a single modality or the other, the architecture must base. Presenting propositional informa- word, while a hand gesture that indicates tion requires a planning module for what the word “caulk” means might last presenting multi-sentence output and longer than it takes to say, “I caulked all managing the order of presentation of weekend.” With the untidy problems of interdependent facts. Understanding Coordinating verbal and nonverbal con- interactional information—about the versational behaviors for virtual humans sensing and acting in the physical processes of conversation—on the other requires meeting several interrelated chal- hand, entails building a model of the cur- lenges. How speech, intonation, gaze, and world thus dispensed, the focus of rent state of the conversation with respect head movements make meaning together, to the conversational process (to deter- the patterns of their co-occurrence in con- virtual human research is on mine who is the current speaker and lis- versation, and what kinds of goals are tener, has the listener understood the achieved by the different channels, are all capturing the richness and speaker’s contribution, and so on). equally important for understanding the • Conversational function model. Func- construction of virtual humans. Speech and dynamics of human behavior. tions, such as initiating a conversation or nonverbal behaviors do not always manifest giving up the floor, can be achieved by a the same information, but what they convey range of different behaviors, such as is virtually always compatible.7 In many looking repeatedly at another person or cases, different modalities serve to reinforce deal at the level of goals or functions, and not bringing your hands down to your lap. one another through redundancy of mean- at the level of modalities or behaviors. That Explicitly representing conversational ing. In other cases, semantic and pragmatic is, giving up the turn is often achieved by functions, rather than behaviors, provides attributes of the message are distributed looking at the listener. But, if the speaker’s both modularity and a principled way to across the modalities.8 The compatibility of eyes are on the road, he or she can get a combine different modalities. Functional meaning between gestures and speech response by saying, “Don’t you think?” models influence the architecture because recalls the interaction of words and graph- Constructing a virtual human that can the core system modules operate exclu- ics in multimodal presentations.9 For pat- effectively participate in face-to-face con- sively on functions, while other system terns of co-occurrence, there is a tight syn- versation requires a control architecture modules at the edges translate input chrony among the different conversational with the following features:4 behaviors into functions, and functions modalities in humans. For example, people into output behaviors. This also produces accentuate important words by speaking • Multimodal input and output. Because a symmetric architecture because the more forcefully, illustrating their point with humans in face-to-face conversation same functions and modalities are pre- a gesture, and turning their eyes toward the send and receive information through sent in both input and output. listener when coming to the end of a thought. gesture, intonation, and gaze as well as Meanwhile listeners nod within a few hun- speech, the architecture should also sup- To capture different time scales and the dred milliseconds of when the speaker’s port receiving and transmitting this importance of co-occurrence, input to a gaze shifts. This synchrony is essential to information. virtual human must be incremental and the meaning of conversation. When it is • Real-time feedback. The system must let time stamped. For example, incremental destroyed, as in low bandwidth videocon- the speaker watch for feedback and turn speech recognition lets the virtual human ferencing, satisfaction and trust in the out- requests, while the listener can send give feedback (such as a quick nod) right as come of a conversation diminishes.10 these at any time through various modal- the real human finishes a sentence, there-

JULY/AUGUST 2002 computer.org/intelligent 55 that broadens a knowledge base search of Discourse Knowledge Word model base timing the domain being discussed. The annotation is then passed to a set of behavior genera- tion rules. Output is scheduled so that tight Behavior generation synchronization is maintained among Language Behavior Behavior Behavior modalities. tagging suggestion selection scheduling Translator Emotions and personality People infuse their verbal and nonverbal behavior with emotion and personality, and Generator set Filter set Animation modeling such behavior is essential for building believable virtual humans. Conse- quently, researchers have developed com- Figure 1. Behavior Expression Animation Toolkit text-to-nonverbal behavior module. putational models for a wide range of appli- cations. Computational approaches might be roughly divided into communication- fore influencing the direction the human that case, the current behavior schedule driven and simulation-based approaches. speaker takes. At the very least, the sytem must be scrapped, the voice halted, and In communication-driven approaches, a should report a significant change in state new attentive behaviors initiated—all with virtual human chooses its emotional expres- right away, even if full information about reasonable seamlessness. sion on the basis of its desired impact on the the event has not yet been processed. This Synchronicity between modalities is as user. Catherine Pelachaud and her colleagues means that if speech recognition cannot be important in the output as the input. The use facial expressions to convey affect in incremental, at least someone speaking or virtual human must align a graphical combination with other communicative finished speaking should be relayed imme- behavior with the uttering of particular functions.14 For example, making a request diately, even in the absence of a fully rec- words or a group of words. The temporal with a sorrowful face can evoke pity and ognized utterance. This lets the virtual association between the words and behav- motivate an affirmative response from the human give up the turn when the real iors might have been resolved as part of the listener. An interesting feature of their human claims it and signal reception after behavior generation process, as is done in approach is that the agent deliberately being addressed. When dealing with multi- SPUD (Sentence Planning Using Descrip- plans whether or not to convey a certain ple modalities, fusing interpretations of the tion),8 but it is essential that the speech emotion. Tutoring applications usually also different input events is important to under- synthesizer provide a mechanism for main- follow a communication-driven approach, stand what behaviors are acting together to taining synchrony through the final produc- intentionally expressing emotions with the convey meaning.12 For this, a synchronized tion stage. There are two types of mecha- goal of motivating the students and thus clock across modalities is crucial so events nisms, event based or time based. A text-to- increasing the learning effect. The Cosmo such as exactly when an emphasis beat speech engine can usually be programmed system, where the agent’s pedagogical gesture occurs can be compared to speech, to send events on phoneme and word bound- goals drive the selection and sequencing of word by word. This requires, of course, aries. Although this is geared towards sup- emotive behaviors, is one example.15 For that the speech recognizer supply word porting lip synch, other behaviors can be instance, a congratulatory act triggers a onset times. executed as well. However, this does not motivational goal to express admiration Similarly, for the virtual human to pro- allow any time for behavior preparation. that is conveyed with applause. To convey duce a multimodal performance, the output Preferably, the TTS engine can provide appropriate emotive behaviors, agents such channels also must be incremental and exact start-times for each word prior to as Cosmo need to appraise events not only tightly synchronized. Incremental refers to playing back the voice, as Festival does.13 from their own perspective but also from two properties in particular: seamless tran- This way, we can schedule the behaviors, the perspective of others. sitions and interruptible behavior. When and thus the transitions between behaviors, The second category of approaches aims producing certain behaviors, such as ges- beforehand, and then play them back at a simulation of “true” emotion (as op- tures, the virtual human must reconfigure along with the voice for a perfectly seam- posed to deliberately conveyed emotion). its limbs in a natural manner, usually requir- less performance. These approaches build on appraisal theo- ing that some time be spent on interpolat- On the output side, one tool that provides ries of emotion, the most prominent being ing from a previous posture to a new one. such tight synchronicity is the Behavior Andrew Ortony, Gerald Clore, and Allan For the transition to be seamless, the virtual Expression Animation Toolkit system.11 Collins’ cognitive appraisal theory—com- human must give the animation system Figure 1 shows BEAT’s architecture. BEAT monly referred to as the OCC model.16 advance notice of events such as gestures, has the advantage of automatically annotat- This theory views emotions as arising from so that it has time to bring the arms into ing text with hand gestures, eye gaze, eye- a valenced reaction to events and objects in place. Sometimes, however, behaviors brow movement, and intonation. The anno- the light of agent goals, standards, and atti- must be abruptly interrupted, such as when tation is carried out in XML, through inter- tudes. For example, an agent watching a the real human takes the turn before the action with an embedded word ontology game-winning move should respond differ- virtual human has finished speaking. In module, which creates a set of hypernyms ently depending on which team is preferred.3

56 computer.org/intelligent IEEE INTELLIGENT SYSTEMS Recent work by Stacy Marsella and Jonathan Gratch integrates the OCC model with coping theories that explain how peo- ple cope with strong emotions.17 For exam- ple, their agents can engage in either prob- lem-focused coping strategies, selecting and executing actions in the world that could improve the agent’s emotional state, or emotion-focused coping strategies, improving emotional state by altering the agent’s mental state (for example, dealing (a) (b) (c) with guilt by blaming someone else). Fur- ther simulation approaches are based on Figure 2. Pelachaud and colleagues use a MPEG-4 compatible facial animation system to investigate how to resolve conflicts that arise when different communication functions the observation that an agent should be need to be shown on different channels of the face. able to dynamically adapt its emotions through its own experience, using learning mechanisms. 6,18 Ball and Jack Breese presented an example work has been done on conveying emotion Appraisal theories focus on the relation- of such an approach. They constructed a through sentence structure or word choice. ship between an agent’s world assessment Bayesian network that estimates the likeli- An exception includes Eduard Hovy’s pio- and the resulting emotions. Nevertheless, hood of specific body postures and gestures neering work on natural language genera- they are rather vague about the assessment for individuals with different personality tion that addresses not only the goal of process. For instance, they do not explain types and emotions.23 For instance, a nega- information delivery, but also pragmatic how to determine whether a certain event is tive emotion increases the probability that aspects, such as the speaker’s emotions.26 desirable. A promising line of research is an agent will say “Oh, you again,” as Marilyn Walker and colleagues present a integrating appraisal theories with AI- opposed to “Nice to see you!” first approach to integrating acoustic para- based planning approaches,19 which might Recent work by Catherine Pelachaud and meters with other linguistic phenomena, lead to a concretization of such theories. colleagues employs Bayesian networks to such as sentence structure and wording.27 First, emotions can arise in response to a resolve conflicts that occur when different Obviously, there is a close relationship deliberative planning process (when rele- communicative functions need to be shown between emotion and personality. Dave vant risks are noticed, progress assessed, on different channels of the face, such as Moffat differentiates between personality and success detected). For example, several eyebrows, mouth shape, gaze direction, and emotion using the two dimensions approaches derive an emotion’s intensity head direction, and head movements (see duration and focus.28 Whereas personality from the importance of a goal and its prob- Figure 2).14 In this case, the Bayesian net- remains stable over a long period of time, ability of achievement. 20,21 Second, emo- work estimates the likelihood that a face emotions are short-lived. Moreover, while tions can influence decision-making by movement overrides another. Bayesian net- emotions focus on particular events or allocating cognitive resources to specific works also offer a possibility to model how objects, factors determining personality are goals or threats. Plan-based approaches emotions vary over time. Even though nei- more diffuse and indirect. Because of this support the implementation of decision and ther Ball and Breese nor Pelachaud and obvious relationship, several projects aim action selection mechanisms that are colleagues took advantage of this feature, to develop an integrated model of emotion guided by an agent’s emotional state. For the extension of the two approaches to and personality. As an example, Ball and example, the Inhabited Market Place appli- dynamic Bayesian networks seems obvious. Breese model dependencies between emo- cation treats emotions as filters to constrain While significant progress has been tions and personality in a Bayesian network.23 the decision process when selecting and made on the visualization of emotive To enhance the believability of animated instantiating dialogue operators.3 behaviors, automated speech synthesis still agents beyond reasoning about emotion In addition to generating affective states, has a long way to go. The most natural- and personality, Helmut Prendinger and we must also express them in a manner easily sounding approaches rely on a large inven- colleagues model the relationship between interpretable to the user. Effective means of tory of human speech units (for example, an agent’s social role and the associated conveying emotions include body gestures, combinations of phonemes) that are subse- constraints on emotion expression, for acoustic realization, and facial expressions quently selected and combined based on example, by suppressing negative emo- (see Gary Collier’s work for an overview of the sentence to be synthesized. These tion when interacting with higher-status studies on emotive expressions22). Several approaches do not, yet, provide much abil- individuals.29 researchers use Bayesian networks to ity to convey emotion through speech (for Another line of research aims at provid- model the relationship between emotion example, by varying prosody or intensity). ing an enabling technology to support and its behavioral expression. Bayesian Marc Schröder provides an overview of affective interactions. This includes both networks let us deal explicitly with uncer- speech manipulations that have been suc- the definition of standardized languages for tainty, which is a great advantage when cessfully employed to express several basic specifying emotive behaviors, such as the modeling the connections between emo- emotions.25 While the interest in affective Affective Presentation Markup Language14 tions and the resulting behaviors. Gene speech synthesis is increasing, hardly any or the Emotion Markup Language (www.

JULY/AUGUST 2002 computer.org/intelligent 57 Animating a human body form requires more than just controlling skeletal rotation angles. People are neither skeletons nor robots, and considerable human qualities arise from intelligent movement strategies, soft deformable surfaces, and clothing. Movement strategies include reach or con- strained contacts, often achieved with goal- directed inverse kinematics.34 Complex workplaces, however, entail more complex planning to avoid collisions, find free paths, and optimize strength availability. The suppleness of human skin and the underlying tissue biomechanics lead to shape changes caused by internal muscle actions as well as external contact with the environment. Modeling and animating the local, muscle-based, deformation of body surfaces in real time is possible through shape morphing techniques,35,36 but provid- ing appropriate shape changes in response Figure 3. PeopleShop and DI-Guy are used to create scenarios for ground combat to external forces is a challenging problem. training. This scenario was used at Ft. Benning to enhance situation awareness in “Skin-tight” texture mapped clothing is experiments to train US Army officers for urban combat. Image courtesy of Boston prevalent in computer game characters, Dynamics. but animating draped or flowing garments requires dynamic simulation, fast colli- vhml.org), as well as the implementation of body movements or on animation of the sion detection, and appropriate collision toolkits for affective computing combining face. response.37,38 a set of components addressing affective Accordingly, animation systems build knowledge acquisition, representation, Body animation methods procedural models of these various behav- reasoning, planning, communication, and In body animation, there are two basic iors and execute them on human models. expression.30 ways to gain the required interactivity: use The diversity of body movements involved motion capture and additional techniques has led to building more consistent agents: Human figure animation to rapidly modify or re-target movements procedural animations that affect and con- By engaging in face-to-face conversation, to immediate needs,31 or write procedural trol multiple body communication chan- conveying emotion and personality, and code that allows program control over nels in coordinated ways.11,24,39,40 The par- otherwise interacting with the synthetic important movement parameters.32 The ticular challenge here is constructing environment, virtual humans impose fairly difficulty with the motion capture approach computer graphics human models that bal- severe behavioral requirements on the un- is maintaining environmental constraints ance sufficient articulation, detail, and derlying animation system that must render such as solid foot contacts and proper motion generators to effect both gross and their physical bodies. Most production reach, grasp, and observation interactions subtle movements with realism, real-time work involves animator effort to design with the agent’s own body parts and other responsiveness, and visual acceptability. or script movements or direct performer objects. To alleviate these problems, pro- And if that isn’t enough, consider the addi- motion capture. Replaying movements in cedural approaches parameterize target tional difficulty of modeling a specific real real time is not the issue; rather, it is creating locations, motion qualities, and other individual. Computer graphics still lacks novel, contextually sensitive movements in movement constraints to form a plausible effective techniques to transfer even cap- real time that matters. Interactive and con- movement directly. Procedural approaches tured motion into features that characterize versational agents, for example, will not consist of kinematic and dynamics tech- a specific person’s mannerisms and behav- enjoy the luxury of relying on animators to niques. Each has its preferred domain of iors, though machine-learning approaches create human time-frame responses. Anima- applicability; kinematics is generally better could prove promising.41 tion techniques must span a variety of body for goal-directed activities, and slower Implementing an animated human body systems: locomotion, manual gestures, hand (controlled) actions and dynamics is more is complicated by a relative paucity of gen- movements, body pose, faces, eyes, speech, natural for movements directed by applica- erally available tools. Body models tend to and other physiological necessities such as tion of forces, impacts, or high-speed be proprietary (for example, Extempo.com, breathing, blinking, and perspiring. Research behaviors.33 The wide range of human Ananova.com), optimized for real time and in human figure animation has addressed all movement demands that both approaches thus limited in body structure and features of these modalities, but historically the work have real-time implementations that can be (for example, DI-Guy, BDI.com, illustrated focuses either on the animation of complete procedurally selected as required. in Figure 3), or constructions for particular

58 computer.org/intelligent IEEE INTELLIGENT SYSTEMS animations built with standard animator movements from text or acoustic speech. A mouth width, mouth-nose distance, eye tools such as Poser, Maya, or 3DSMax. The TTS algorithm, or an acoustic speech recog- separation, iris diameter, or eye-nose dis- best attempt to design a transportable, stan- nizer, provides a translation to phonemes, tance. Although MPEG-4 has defined a dard avatar is the Web3D Consortium’s H- which are then mapped to visemes (visual limited set of visemes and facial expres- Anim effort (www.h-anim.org). With well- phonemes). The visemes drive a speech sions, designers can specify two visemes or defined body structure and feature sites, the articulation model that animates the face. two expressions with a blend factor between H-Anim specification has engendered The convincing synthesis of a face from text the visemes and an intensity value for each model sharing and testing not possible with has yet to be accomplished. The state of the expression. The normalization of the FAPs proprietary approaches. The present liability art provides understandable acoustic and gives the face model designer freedom to is the lack of an application programming visual speech and facial expressions.42,43 create characters with any facial propor- interface in the VRML language binding of The third and most recent method for tions, regardless of the source of the FAPs. H-Anim. A general API for human models animating a face model is to measure They can embed MPEG-4 compliant face is a highly desirable next step, the benefits human facial movements directly and then models into decoders, store them on CD- of which have been demonstrated by Nor- apply the motion data to the face model. ROM, download them as an executable man Badler’s research group’s use of the The model can capture facial motions from a Web site, or build them into a Web software API in Jack (www.ugs.com/prod- using one or more cameras and can incor- browser. ucts/efactory/jack), which allows feature porate face markers, structured light, laser access and provides plug-in extensions for range finders, and other face measurement Integration challenges new real-time behaviors. modes. Each facial motion capture approach Integrating all the various elements has limitations that might require postpro- described here into a virtual human is a Face animation methods cessing to overcome. The ideal motion- daunting task. It is difficult for any single A computer-animated human face can capture data representation supports suffi- research group to do it alone. Reusable evoke a wide range of emotions in real peo- cient detail without sacrificing editability tools and modular architectures would be ple because faces are central to human real- (for example, MPEG-4 Facial Animation an enormous benefit to virtual human ity. Unfortunately, modeling and rendering Parameters). The choice of modeling and researchers, letting them leverage each artifacts can easily produce a negative rendering technologies ranges from 2D other’s work. Indeed, some research groups response in the viewer. The great complex- line drawings to physics-based 3D models have begun to share tools, and several stan- ity and psychological depth of the human with muscles, skin, and bone.44,45 Of dards have recently emerged that will fur- response to faces causes difficulty in pre- course, textured polygons (nonuniform ther encourage sharing. However, we must dicting the response to a given animated rational b-splines and subdivision sur- confront several difficult issues before we face model. The partial or minimalist ren- faces) are by far the most common. A can readily plug-and-play different modules dering of a face can be pleasing as long as variety of surface deformation schemes to control a virtual human’s behavior. Two it maintains quality and accuracy in certain exist that attempt to simulate the natural key issues discussed at the workshop were key dimensions. The ultimate goal is to deformations of the human face while consistency and timing of behavior. analyze and synthesize humans with driven by external parameters.46,47 enough fidelity and control to pass the Tur- MPEG-4, which was designed for high- Consistency ing test, create any kind of virtual being, quality visual communication at low bit- When combining a variety of behavioral and enable total control over its virtual rates coupled with low-cost graphics ren- components, one problem is maintaining appearance. Eventually, surviving tech- dering systems, offers one existing standard consistency between the agent’s internal nologies will be combined to increase for human figure animation. It contains a state (for example, goals, plans, and emo- accuracy and efficiency of the capture, lin- comprehensive set of tools for representing tions) and the various channels of outward guistic, and rendering systems. Currently and compressing content objects and the behavior (for example, speech and body the approaches to animating the face are animation of those objects, and it treats movements). When real people present disjoint and driven by production costs and virtual humans (faces and bodies) as a spe- multiple behavior channels, we interpret imperfect technology. Each method pre- cial type of object. The MPEG-4 Face and them for consistency, honesty, and sincerity, sents a distinct “look and feel,” as well as Body Animation standard provides and for social roles, relationships, power, advantages and disadvantages. anatomically specific locations and anima- and intention. When these channels con- Facial animation methods fall into three tion parameters. It defines Face Definition flict, the agent might simply look clumsy or major categories. The first and earliest Parameter feature points and locates them awkward, but it could appear insincere, method is to manually generate keyframes on the face (see Figure 4). Some of these confused, conflicted, emotionally detached, and then automatically interpolate frames points only serve to help define the face’s repetitious, or simply fake. To an actor or an between the keyframes (or use less skilled shape. The rest of them are displaced by expert animator, this is obvious. Bad actors animators). This approach is used in tradi- Facial Animation Parameters, which spec- might fail to control gestures or facial tional cell animation and in 3D animated ify feature point displacements from the expressions to portray the demeanor of their feature films. Keyframe and morph target neutral face position. Some FAPs are persona in a given situation. The actor animation provides complete artistic con- descriptors for visemes and emotional might not have internalized the character’s trol but can be time consuming to perfect. expressions. Most remaining FAPs are nor- goals and motivations enough to use the The second method is to synthesize facial malized to be proportional to neutral face body’s own machinery to manifest these

JULY/AUGUST 2002 computer.org/intelligent 59 look insincere. • Attempts at emotions in gait variations look funny without concomitant body and facial affect. • Otherwise carefully timed gestures and speech fail to register with gesture per- formance and facial expressions. • Repetitious actions become irritating because they appear unconcerned about our changing (more negative) feelings about them.

Timing In working together toward a unifying architecture, timing emerged as a central Feature points affected by FAPs concern at the workshop. A virtual human’s Other feature points Right eye Left eye behavior must unfold over time, subject to a variety of temporal constraints. For exam- ple, speech-related gestures must closely Nose follow the voice cadence. It became obvi- ous during the workshop that previous work focused on a specific aspect of behavior (for example, speech, reactivity, or emotion), leading to architectures that are tuned to a Teeth subset of timing constraints and cannot straightforwardly incorporate others. Dur- ing the final day of the workshop, we strug- gled with possible architectures that might address this limitation. For example, BEAT schedules speech- Tongue Mouth related body movements using a pipelined architecture: a text-to-speech system gener- Figure 4. The set of MPEG-4 Face Definition Parameter (FDP) feature points. ates a fixed timeline to which a subsequent gesture scheduler must conform. Essen- inner drives as appropriate behaviors. A acter. EMOTE’s power arises from the rela- tially, behavior is a slave to the timing con- skilled animator (and actor) knows that all tively small number of parameters that con- straints of the speech synthesis tool. In con- aspects of a character must be consistent trol or affect a much larger set, and from new trast, systems that try to physically convey a with its desired mental state because we can extensions to the original definitions that sense of emotion or personality often work control only voice, body shape, and move- include non-articulated face movements. The by altering the time course of gestures. ment for the final product. We cannot open same set of parameters control many aspects For example, EMOTE works later in the a dialog with a pre-animated character to of manifest behavior across the agent’s body pipeline, taking a previously generated further probe its mind or its psychological and therefore permit experimentation with sequence of gestures and shortening or state. With a real-time embodied agent, similar or dissimilar settings. The hypothesis drawing them out for emotional effect. however, we might indeed have such an is that behaviors manifest in separate chan- Essentially, behavior is a slave to the con- opportunity. nels with similar EMOTE parameters will straints of emotional dynamics. Finally, One approach to remedying this problem appear consistent to some internal state of some systems have focused on making the is to explicitly coordinate the agent’s internal the agent; conversely, dissimilar EMOTE character highly reactive and embedded in state with the expression of body movements parameters will convey various negative the synthetic environment. For example, in all possible channels. For example, Nor- impressions of the character’s internal con- Mr. Bubb of Zoesis Studios (see Figure 5) is man Badler’s research group has been build- sistency. Most computer-animated agents tightly responsive to unpredictable and con- ing a system, EMOTE, to parameterize and provide direct evidence for the latter view: tinuous changes in the environment (such as modulate action performance.24 It is based mouse movements or bouncing balls). In on Laban Movement Analysis, a human • Arm gestures without facial expressions such systems, behavior is a slave to envi- movement observation System. EMOTE is look odd. ronmental dynamics. Clearly, if these vari- not an action selector per se; it is used to • Facial expressions with neutral gestures ous capabilities are to be combined, we modify the execution of a given behavior and look artificial. must reconcile these different approaches. thus change its movement qualities or char- • Arm gestures without torso involvement One outcome of the workshop was a

60 computer.org/intelligent IEEE INTELLIGENT SYSTEMS number of promising proposals for recon- ciling these competing constraints. At the very least, much more information must be shared between components in the pipeline. For example, if BEAT had more access to timing constraints generated by EMOTE, it could do a better job of up- front scheduling. Another possibility would be to specify all of the constraints explicitly and devise an animation system flexible enough to handle them all, an approach the motion graph technique suggests.48 Nor- man Badler suggests an interesting pipeline architecture that consists of “fat” pipes with weak uplinks. Modules would send down considerably more information (and possibly multiple options) and could poll downstream modules for relevant informa- tion (for example, how long would it take to look at the ball, given its current loca- tion). Exploring these and other alterna- Figure 5. Mr. Bubb is an interactive character developed by Zoesis Studios that reacts tives is an important open problem in vir- continously to the user’s social interactions during cooperative play experience. Image tual human research. courtesy of Zoesis Studios.

Acknowledgments 4. J. Cassell et al., “Human Conversation as a System Framework: Designing Embodied We are indebted to Bill Swartout for develop- Conversational Agents,” Embodied Conver- ing the initial proposal to run a Virtual Human sational Agents, MIT Press, Cambridge, workshop, to Jim Blake for helping us arrange Mass., 2000, pp. 29–63. funding, and to Julia Kim and Lynda Strand for 5. J.E. Laird, “It Knows What You’re Going To handling all the logistics and support, without he future of androids remains to be Do: Adding Anticipation to a Quakebot,” T which the workshop would have been impossi- Proc. 5th Int’l Conf. Autonomous Agents, seen, but realistic interactive virtual humans ble. We gratefully acknowledge the Department ACM Press, New York, 2001, pp. 385–392. will almost certainly populate our near of the Army for their financial support under future, guiding us toward opportunities to contract number DAAD 19-99-D-0046. We also 6. S. Yoon, B. Blumberg, and G.E. Schneider, learn, enjoy, and consume. The move toward acknowledge the support of the US Air Force “Motivation Driven Learning for Interactive sharable tools and modular architectures F33615-99-D-6001 (D0 #8), Office of Naval Synthetic Characters,” Proc. 4th Int’l Conf. Research K-5-55043/3916-1552793, NSF IIS99- will certainly hasten this progress, and, Autonomous Agents, ACM Press, New York, 00297, and NASA NRA NAG 9-1279. Any 2000, pp. 365–372. although significant challenges remain, opinions, findings, and conclusions or recom- work is progressing on multiple fronts. The mendations expressed in this material are those 7. J. Cassell, “Nudge, Nudge, Wink, Wink: Ele- emergence of animation standards such as of the authors and do not necessarily reflect the ments of Face-to-Face Conversation for views of these agencies. MPEG-4 and H-Anim has already facili- Embodied Conversational Agents,” Embod- ied Conversational Agents, MIT Press, Cam- tated the modular separation of animation bridge, Mass., 2000, pp. 1–27. from behavioral controllers and sparked the development of higher-level extensions 8. J. Cassell, M. Stone, and H. Yan, “Coordina- References tion and Context-Dependence in the Genera- such as the Affective Presentation Markup tion of Embodied Conversation,” Proc. INLG Language. Researchers are already sharing 1. W.L. Johnson, J.W. Rickel, and J.C. Lester, 2000, 2000. behavioral models such as BEAT and “Animated Pedagogical Agents: Face-to-Face Interaction in Interactive Learning Environ- EMOTE. We have outlined only a subset of 9. M.T. Maybury, ed., Intelligent Multimedia ments,” Int’l J. AI in Education, vol. 11, Nov. Interfaces, AAAI Press, Menlo Park, Cal- the many issues that arise, ignoring many of 2000, pp. 47–78. if., 1993. the more classical AI issues such as percep- tion, planning, and learning. Nonetheless, 2. S.C.Marsella, W.L. Johnson, and C. LaBore, 10. B. O’Conaill and S. Whittaker, “Characteriz- “Interactive Pedagogical Drama,” Proc. 4th ing, Predicting, and Measuring Video-Medi- we have highlighted the considerable recent Int’l Conf. Autonomous Agents, ACM Press, ated Communication: A Conversational progress towards interactive virtual humans New York, 2000, pp. 301–308. Approach,” Video-Mediated Communication: and some of the key challenges that remain. Computers, Cognition, and Work, Lawrence 3. E. André et al., “The Automated Design of Assembling a new virtual human is still a Erlbaum Associates, Mahwah, N.J., 1997, pp. Believable Dialogues for Animated Presenta- 107–131. daunting task, but the building blocks are tion Teams,” Embodied Conversational getting bigger and better every day. Agents, MIT Press, Cambridge, Mass., 2000, 11. J. Cassell, H. Vilhjálmsson, and T. Bickmore, pp. 220–255. “BEAT: The Behavior Expression Animation

JULY/AUGUST 2002 computer.org/intelligent 61 Toolkit,” to be published in Proc. SIGGRAPH Jonathan Gratch is a project leader for the stress and emotion project at ’01, ACM Press, New York, 2001. the University of Southern California’s Institute for Creative Technolo- gies and is a research assistant professor in the department of computer 12. M. Johnston et al., “Unification-Based Mul- science. His research interests include cognitive science, emotion, plan- timodal Integration,” Proc. 35th Ann. Meet- ning, and the use of simulation in training. He received his PhD in com- ing Assoc. Compuational Linguistics (ACL puter science from the University of Illinois. Contact him at the USC Inst. 97/EACL 97), Morgan Kaufmann, San Fran- for Creative Technologies, 13274 Fiji Way, Marina del Rey, CA 90292; cisco, 2001, pp. 281–288. [email protected]; www.ict.usc.edu/~gratch. 13. P. Taylor et al., “The Architecture of the Fes- tival Speech Synthesis System,” 3rd ESCA Workshop Speech Synthesis, 1998. Jeff Rickel is a project leader at the University of Southern California’s 14. C. Pelachaud et al., “Embodied Agent in Information Sciences Institute and a research assistant professor in the Information Delivering Application,” Proc. department of computer science. His research interests include intelligent Autonomous Agents and Multiagent Systems, agents for education and training, especially animated agents that collab- ACM Press, New York, 2002. orate with people in virtual reality. He received his PhD in computer sci- ence from the University of Texas at Austin. Contact him at the USC 15. J. C. Lester et al., “Deictic and Emotive Com- Information Sciences Inst., 4676 Admiralty Way, Ste. 1001, Marina del munication in Animated Pedagogical Rey, CA 90292; [email protected]; www.isi.edu/~rickel. Agents,” Embodied Conversational Agents, MIT Press, Cambridge, Mass., 2000, pp. 123–154.

Elisabeth André is a full professor at the University of Augsburg, Ger- 16. A. Ortony, G.L. Clore, and A. Collins, The many. Her research interests include embodied conversational agents, Cognitive Structure of Emotions, Cambridge multimedia interface, and the integration of vision and natural language. Univ. Press, Cambridge, UK, 1988. She received her PhD from the University of Saarbrücken, Germany. Contact her at Lehrstuhl für Multimedia Konzepte und Anwendungen, 17. S. Marsella, and J. Gratch, “A Step Towards Institut für Informatik, Universität Augsburg, Eichleitnerstr. 30, 86135 Irrationality: Using Emotion to Change Augsburg, Germany; [email protected]; www. Belief,” to be published in Proc. 1st Int’l Joint informatik.uni-augsburg.de/~lisa. Conf. Autonomous Agents and Multi-Agent Systems, ACM Press, New York, 2002. 18. M.S. El-Nasr, T.R Ioerger, and J. Yen, “Peteei: A Pet with Evolving Emotional Intelligence,” Justine Cassell is an associate professor at the MIT Media Lab where Proc. 3rd Int’l Conf. Autonomous Agents, she directs the Gesture and Narrative Language Research Group. Her ACM Press, New York, 1999, pp. 9–15. research interests are bringing knowledge about verbal and nonverbal aspects of human conversation and story telling to computational systems 19. S. Marsella and J.Gratch, “Modeling the design, and the role that technologies—such as a virtual story telling peer Interplay of Emotions and Plans in Multi- that has the potential to encourage children in creative, empowered, and Agent Simulations,” Proc. 23rd Ann. Conf. independent learning—play in children’s lives. She holds a master’s Cognitive Science Society, Lawrence Erl- degree in literature from the Université de Besançon (France), a master’s baum, Mahwah, N.J., 2001, pp. 294–599. degree in linguistics from the (Scotland), and a dual PhD from the in psychology and linguistics. Contact her at cassell@ 20. A. Sloman, “Motives, Mechanisms and Emo- media.mit.edu. tions,” Cognition and Emotion, The Philoso- phy of Artificial Intelligence, Oxford Univ. Eric Petajan is chief scientist and founder of face2face animation, and Press, Oxford, UK, 1990, pp. 231–247. chaired the MPEG-4 Face and Body Animation (FBA) group. Prior to forming face2face, he was a Bell Labs researcher where he developed 21. J. Gratch, “Emile: Marshalling Passions in facial motion capture, HD video coding, and interactive graphics systems. Training and Education,” Proc. 4th Int’l Conf. Starting in 1989, he was a leader in the development of HDTV technol- Autonomous Agents, ACM Press, New York, ogy and standards leading up to the US HDTV Grand Alliance. He received 2000. a PhD in EE in 1984 and an MS in physics from the University of Illinois where he built the first automatic lipreading system. He is also associate 22. G. Collier, Emotional Expression, Lawrence editor of the IEEE Transactions on Circuits and Systems for Video Tech- Erlbaum, Hillsdale, N.J., 1985. nology. Contact him at [email protected]. 23. G. Ball and J. Breese, “Emotion and Person- ality in a Conversational Character,” Embod- ied Conversational Agents, MIT Press, Cam- Norman Badler is a professor of computer and information science at bridge, Mass., 2000, pp. 189–219. the University of Pennsylvania and has been on that faculty since 1974. Active in computer graphics since 1968, his research focuses on human 24. D. Chi et al., “The EMOTE Model for Effort figure modeling, real-time animation, embodied agent software, and com- and Shape,” ACM SIGGRAPH ’00,ACM putational connections between language and action. He received a BA in Press, New York, 2000, pp. 173–182. creative studies mathematics from the University of California at Santa Barbara in 1970, an MSc in mathematics in 1971, and a PhD in computer 25. M. Schröder, “Emotional Speech Synthesis: A science in 1975, both from the University of Toronto. He directs the Cen- Review,” Proc. Eurospeech 2001, ISCA, ter for Human Modeling and Simulation and is the Associate Dean for Bonn, Germany, 2001, pp. 561–564. Academic Affairs in the School of Engineering and Applied Science at the University of Pennsyl- vania. Contact him at [email protected]. 26. E. Hovy, “Some Pragmatic Decision Criteria in Generation,” Natural Language Genera- tion, Martinus Nijhoff Publishers, Dordrecht, Germany, 1987, pp. 3–17.

62 computer.org/intelligent IEEE INTELLIGENT SYSTEMS 27. M. Walker, J. Cahn, and S.J. Whittaker, Anthropomorphic Limbs,” Graphical Mod- Trans. Multimedia, vol. 2, no. 3, Mar. 2000, “Improving Linguistic Style: Social and els, vol. 62, no. 5, May 2000, pp. 353–388. pp. 152–163. Affective Bases for Agent Personality,” Proc. Autonomous Agents ’97, ACM Press, New 35. J. Lewis, M. Cordner, and N. Fong, “Pose 43. T. Ezzat, G. Geiger, and T. Poggio, “Train- York, 1997, pp. 96–105. Space Deformations: A Unified Approach to able Videorealistic Speech Animation, to be Shape Interpolation and Skeleton-driven published in Proc. ACM Siggraph, 2002. 28. D. Moffat, “Personality Parameters and Pro- Deformation,” ACM Siggraph, July 2000, pp. grams,” Creating Personalities for Synthetic 165–172. 44. D. Terzopoulos and K. Waters, “Analysis and Actors, Springer Verlag, New York, 1997, pp. Synthesis of Facial Images Using Physical 120–165. 36. P. Sloan, C. Rose, and M. Cohen, “Shape by and Anatomical Models,” IEEE Trans. Pat- Example,” Symp. Interactive 3D Graphics, tern Analysis and Machine Intelligence, vol. 29. H. Prendinger, and M. Ishizuka, “Social Role Mar., 2001, pp. 135–144. 15, no. 6, June 1993, pp. 569–579. Awareness in Animated Agents,” Proc. 5th Conf. Autonomous Agents, ACM Press, New 37. D. Baraff and A. Witkin, “Large Steps in 45. F.I. Parke and K. Waters, Computer Facial York, 2001, pp. 270–377. Cloth Simulation, ACM Siggraph, July 1998, Animation,AK Peters, Wellesley, Mass., 1996. pp. 43–54. 30. A. Paiva et al., “Satira: Supporting Affective 46. T.K. Capin, E. Petajan, and J. Ostermann, Interactions in Real-Time Applications,” 38. N. Magnenat-Thalmann and P. Volino, Dress- “Efficient Modeling of Virtual Humans in CAST 2001: Living in Mixed Realities, FhG- ing Virtual Humans, Morgan Kaufmann, San MPEG-4,” IEEE Int’l Conf. Multimedia, ZPS, Schloss Birlinghoven, Germany, 2001, Francisco, 1999. 2000, pp. 1103–1106. pp. 227–230. 39. K. Perlin, “Real Time Responsive Animation 47. T.K. Capin, E. Petajan, and J. Ostermann, 31. M. Gleicher, “Comparing Constraint-based with Personality,” IEEE Trans. Visualization “Very Low Bitrate Coding of Virtual Human Motion Editing Methods,” Graphical Mod- and Computer Graphics, vol. 1, no. 1, Jan. Animation in MPEG-4,” IEEE Int’l Conf. els, vol. 62, no. 2, Feb. 2001, pp. 107–134. 1995, pp. 5–15. Multimedia, 2000, pp. 1107–1110. 32. N. Badler, C. Phillips, and B. Webber, Simu- 40. M. Byun and N. Badler, “FacEMOTE: Qual- 48. L. Kovar, M. Gleicher, and F. Pighin, “Motion lating Humans: Computer Graphics, Anima- itative Parametric Modifiers for Facial Ani- Graphs,” to be published in Proc. ACM Sig- tion, and Control, Oxford Univ. Press, New mations,” Symp. Computer Animation,ACM graph 2002, ACM Press, New York. York, 1993. Press, New York, 2002. 33. M. Raibert and J. Hodgins, “Animation of 41. M. Brand and A. Hertzmann, “Style Dynamic Legged Locomotion,” ACM Sig- Machines, ACM Siggraph, July 2000, pp. graph, vol. 25, no. 4, July 1991, pp. 349–358. 183–192. For more information on this or any other comput- 34. D. Tolani,A. Goswami, and N. Badler, “Real- 42. E. Cosatto and H.P. Graf, “Photo-Realistic ing topic, please visit our Digital Library at Time Inverse Kinematics Techniques for Talking-Heads from Image Samples,” IEEE http://computer.org/publications/dlib. Get CSDP Certified Announcing IEEE Computer Society's new Certified Software Development Professional Program

Doing Software Right "The exam is valuable to me for two reasons: One, it validates my knowledge in various areas of expertise within the software field, without regard to specific knowledge of • Demonstrate your level of ability in tools or commercial products... relation to your peers Two, my participation, along with others, in the exam and in continuing education sends a message that software develop- • Measure your professional knowledge ment is a professional pursuit requiring advanced education and/or experience, and all the other requirements the IEEE and competence Computer Society has established. I also believe in living by the Software Engineering code of ethics endorsed by the Computer The CSDP Program differentiates between Society. All of this will help to improve the overall quality of the you and others in a field that has every kind of products and services we provide to our customers..." — Karen Thurston, Base Two Solutions credential, but only one that was developed by, for, and with software engineering professionals. Register Today Visit the CSDP web site at http://computer.org/certification or contact [email protected]