UC Merced Proceedings of the Annual Meeting of the Cognitive Science Society

UC Merced Proceedings of the Annual Meeting of the Cognitive Science Society Title Using listener gaze to augment speech generation in a virtual 3D environment Permalink https://escholarship.org/uc/item/7vn7t95m Journal Proceedings of the Annual Meeting of the Cognitive Science Society, 34(34) ISSN 1069-7977 Authors Staudte, Maria Koller, Alexander Garoufi, Konstantina et al. Publication Date 2012 Peer reviewed eScholarship.org Powered by the California Digital Library University of California Using listener gaze to augment speech generation in a virtual 3D environment Maria Staudte Alexander Koller Konstantina Garoufi Matthew Crocker Saarland University University of Potsdam University of Potsdam Saarland University Abstract ments. A natural language generation (NLG) system, that ex- Listeners tend to gaze at objects to which they resolve referring ploited this information to provide direct feedback, commu- expressions. We show that this remains true even when these nicated its intended referent to the listener more effectively objects are presented in a virtual 3D environment in which lis- than similar systems that did not draw on listener gaze. Gaze- teners can move freely. We further show that an automated speech generation system that uses eyetracking information based feedback was further shown to increase listener atten- to monitor listener’s understanding of referring expressions tion to potential target objects in a scene, indicating a gen- outperforms comparable systems that do not draw on listener erally more focused and task-oriented listener behavior. This gaze. system is, to our knowledge, the first NLG system that adjusts Introduction its referring expressions to listener gaze. In situated spoken interaction, there is evidence that the gaze Related work of interlocutors can augment both language comprehension and production processes. For example, speaker gaze to ob- Previous research has shown that listeners align with speak- jects that are about to be mentioned (Griffin & Bock, 2000) ers by visually attending to mentioned objects (Tanenhaus has been shown to benefit listener comprehension by direct- et al., 1995) and, if possible, to what the speaker attends to ing listener gaze to the intended visual referents (Hanna & (Richardson & Dale, 2005; Hanna & Brennan, 2007; Staudte Brennan, 2007; Staudte & Crocker, 2011; Kreysa & Knoe- & Crocker, 2011). Little is known, however, about speaker ferle, 2011). Even when speaker gaze is not visible to the adaptation to the listener’s (gaze) behavior, in particular when listener, however, listeners are known to rapidly attend to this occurs in dynamic and goal-oriented situations. Typi- mentioned objects (Tanenhaus, Spivey-Knowlton, Eberhard, cally, Visual World experiments have used simple and static & Sedivy, 1995). This gaze behavior on the part of listeners visual scenes and disembodied utterances and have analyzed potentially provides speakers with useful feedback regarding the recorded listener gaze off-line (e.g., Altmann & Kamide, the communicative success of their utterances: By monitor- 1999; Knoeferle, Crocker, Pickering, & Scheepers, 2005). ing listener gaze to objects in the environment, the speaker Although studies involving an embodied speaker inherently can determine whether or not a referring expression (RE) they include some dynamics in their stimuli, this is normally con- have just produced was correctly understood or not, and po- strained to speaker head and eye movements (Hanna & Bren- tentially use this information to adjust subsequent production. nan, 2007; Staudte & Crocker, 2011). Besides simplifying In this paper we investigate the hypothesis that speaker the physical environment to a static visual scene, none of use of listener gaze can potentially enhance interaction, even these approaches can capture the reciprocal nature of interac- when situated in complex and dynamic scenes that simulate tion. That is, they do not take into account that the listeners’ physical environments. In order to examine this hypothesis in eye movements may, as a signal of referential understanding a controlled and consistent manner, we monitor listener per- to the speaker, change the speaker’s behavior and utterances formance in the context of a computer system that generates on-line and, as such, affect the listener again. spoken instructions to direct the listener through a 3D virtual One study that emphasized interactive communication in environment with the goal of finding a trophy. Successful a dynamic environment was conducted by Clark and Krych completion of the task requires listeners to press specific but- (2004). In this experiment, two partners assembled Lego tons. Our experiment manipulated whether or not the com- models: The directing participant advised the building par- puter system could follow up its original RE with feedback ticipant on how to achieve that goal. It was manipulated based on the listener’s gaze or movement behavior, with the whether or not the director could see the builder’s workspace aim of shedding light on the following two questions: and, thus, use the builder’s visual attention as feedback for directions. Clark and Krych found, for instance, that the vis- • Do listener eye movements provide a consistent and useful ibility of the listener’s workspace led to significantly more indication of referential understanding, on a per-utterance deictic expressions by the speaker and to shorter task com- basis, and when embedded in a dynamic and complex, pletion times. However, the experimental setting introduced goal-driven scenario? large variability in the dependent and independent variables, making controlled manipulation and fine-grained observa- • What effect does gaze-based feedback have on listeners’ tions difficult. In fact, we are not aware of any previ- (gaze-)behavior and does it increase the more general ef- ous work that has successfully integrated features of natu- fectiveness of an interaction? ral environments—realistic, complex and dynamic scenes in We show that the listeners’ eye movements are a reliable which the visual salience of objects can change as a result of predictor of referential understanding in our virtual environ- the listener’s moves in the environment—with the reciprocal 1007 nature of listener-speaker-adaptation while also being able to carefully control and measure relevant behavioral data. Recently, researchers have examined eye gaze of speakers and listeners in the scenes of Tangram puzzle simulations on computer screens (Kuriyama et al., 2011; Iida, Yasuhara, & Tokunaga, 2011). In these experiments, eye gaze features are found to be useful for a machine learning model of reference resolution. However, this setting is restricted in its dynamics, as it does not embed the objects into physical scenes or involve any updates to the spatial and visual context of the objects in the scenes. In contrast, by generating REs and ask- ing the subjects to resolve them, rather than resolving human- produced REs itself, the system we propose here can provide more control over the language that is used in the interaction. Figure 1: A screenshot of one of the virtual 3D environments. Computational models of gaze behavior are frequently implemented in embodied conversational agents as part of non- verbal behavior that aims at improving the human-computer hidden in a wall safe. They must press certain buttons in the interaction (see e.g. Foster, 2007). Such agents do not typi- correct sequence in order to open the safe; since they do not cally employ listener gaze tracking for the generation of ap- have prior knowledge of which buttons to press, they rely on propriate REs, though. One work that focuses on situated RE instructions and REs generated by the NLG system in order to generation is Denis (2010), which takes the visual focus of carry out the task. A room may contain several buttons other objects into account for the gradual discrimination of refer- than the target, which is the button that the user must press ents from distractors in a series of utterances. However, vi- next. These other buttons are called distractors and are there sual focus in Denis’ work is modeled by visibility of objects to make the RE resolution task more challenging. Rooms on screen rather than eye gaze. To our knowledge, there exists also contain a number of landmark objects, such as chairs no prior RE generation algorithm that is informed directly by and plants, which cannot be interacted with, but may be used listener gaze. in REs to nearby targets. For our experiment we use three Finally, gaze as a modality of interaction has been inves- different virtual environments designed by Gargett, Garoufi, tigated in virtual reality games before, e.g. by Hulsmann,¨ Koller, and Striegnitz (2010), which differ in what objects Dankert, and Pfeiffer (2011). However, most such settings they contain and where they are located. do not use language as a further modality. One virtual game- like setting which focuses on language is the recent Challenge Generation systems on Generating Instructions in Virtual Environments (GIVE; Koller et al., 2010), which evaluates NLG systems that pro- We implemented three different NLG systems for generating duce natural-language instructions in virtual environments. In instructions in these virtual environments. All systems gen- this work we use the freely available open-source software in- erate navigation instructions, which guide the user to a spe- frastructure provided by GIVE1 to set up our experiment. cific location, as well as object manipulation instructions such as “press the blue button” containing REs such as “the blue Methods button”. The generated instructions are converted to speech by the MARY text-to-speech system (Schroder¨ & Trouvain, In the GIVE setting (Koller et al., 2010; Striegnitz et al., 2003) and presented via loudspeaker. At any point, the user 2011), a human user can move about freely in a virtual in- may press the ‘H’ key on their keyboard to indicate that they door environment featuring several interconnected corridors are confused. This will cause the NLG system to generate a and rooms. A 3D view of the environment is displayed on a clarified instruction.

UC Merced Proceedings of the Annual Meeting of the Cognitive Science Society

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support