This is actually the first page of the thesis and will be discarded after the print out. This is done because the title page has to be an even page. The memoir style package used by this template makes different indentations for odd and even pages which is usally done for better readability. University of Augsburg Faculty of Applied Computer Science Department of Computer Science Master’s Program in Computer Science

Master’s Thesis

Speech and Gesture Input for Different Interaction Tasks in a Virtual World

submitted by Kathrin Janowski on 09.11.2012

Supervisor: Prof. Dr. Elisabeth Andr´e

Adviser: Dipl.-Inf. Felix Kistler

Abstract

In recent years, the technologies for gesture and speech input have advanced far enough to become part of consumer products, such as video game con- soles or smart phones. Both are the main interaction mechanisms between a human and the real world, so this also makes them attractive for controlling virtual world applications for both training and entertainment. However, virtual worlds fully controlled by both are still , and the existing sys- tems tend to use only one of them or simplify the world model to allow only certain types of interaction. This thesis examines the use of speech and gestures for four distinct cat- egories of interactions which are common to most simulated worlds. In par- ticular, these are the navigation in a three-dimensional scene, the selection of interactive entities, dialogue with virtual characters and the manipulation of virtual objects. For each of the tasks, existing approaches are summarized. In order to confirm which modality suits each task best, an application was then developed for interacting with a virtual environment, relying on the sensor for full body gestures and a keyword spotting approach for speech recognition. Finally, a user study was conducted with this system. The re- sults showed that a walking metaphor was preferred for navigation whereas for dialogue, speech was preferred over choosing sentences by pointing. No clear preferences were obtained for selection and manipulation, but the lat- ter category revealed two distinct personas which should be examined more closely.

v

Acknowledgments

I would like to thank the following people for various forms of help.

ˆ Felix Kistler for guiding me through the whole process, for all the help with many different aspects, and, most importantly, for taking care of problems in the FUBI recognition that kept haunting me, such as the persistent “lost user” bug which was tracked down just in time for the user study.

ˆ Professor Dr. Andr´efor pointing me towards the works of Norman, Chai and Krahmer, for helping me spot the flaws in the experimental procedure and for reminding me that a chaotic test run was not the end of the world.

ˆ Gregor Mehlmann for discussing the multimodality aspect with me and for pointing me towards relevant Oviatt papers.

ˆ Johannes Wagner for letting me look at the speech recognition he was working with, although that one did not meet the requirements I had.

ˆ All those people at the chair of Human Centered Multimedia who took part in the user study, offered helpful feedback or otherwise showed interest in my work and the results.

ˆ My music teachers and sword trainer for helping me escape now and then, for understanding that those hobbies tended to get left behind and for listening to all that stuff completely unrelated to them.

ˆ My family for letting me vent my frustrations, for getting infected with my enthusiasm, and for helping me get this far by providing me with all sorts of support and resources.

vii

Statement and Declaration of Consent

Statement Hereby I confirm that this thesis is my own work and that I have docu- mented all sources used.

Kathrin Janowski

Augsburg, 09.11.2012

Declaration of Consent Herewith I agree that my thesis will be made available through the library of the Computer Science Department.

Kathrin Janowski

Augsburg, 09.11.2012

ix

Contents

Contentsi

1 Introduction1 1.1 Motivation...... 1 1.2 Objectives...... 2 1.3 Outline...... 3

2 State of the Art5 2.1 Overview...... 5 2.2 General Concepts...... 6 2.2.1 Input Technologies ...... 6 2.2.2 Multimodality...... 8 2.2.3 Usability Principles...... 9 2.3 Interaction Tasks...... 13 2.3.1 Navigation ...... 14 2.3.2 Selection ...... 23 2.3.3 Dialogue...... 28 2.3.4 Object Manipulation...... 32

3 Practical Work 43 3.1 Overview...... 43 3.2 Application ...... 43 3.2.1 Available Interactions...... 45 3.2.2 Input ...... 49 3.2.3 Visualization...... 56 3.2.4 Implementation ...... 60 3.3 User Study ...... 66 3.3.1 Objectives ...... 66

i ii CONTENTS

3.3.2 Experimental Setting...... 68 3.3.3 Results and Discussion...... 70

4 Conclusion and Future Work 83 4.1 Conclusion...... 83 4.2 Future Work ...... 85 4.2.1 Analysis of the Video Data ...... 85 4.2.2 Role Playing versus Efficiency...... 85 4.2.3 Direct Gesture Mapping...... 86 4.2.4 Expansion of the Speech Recognition...... 87 4.2.5 Multimodal commands...... 88 4.2.6 Expansion of the Smart Object Infrastructure . . . . 88

Bibliography 91

A Questionnaire 99 A.1 German ...... 99 A.2 Translated to English ...... 101

List of Figures 103

List of Tables 105 Chapter 1

Introduction

1.1 Motivation

In recent years, natural input technologies are gradually becoming avail- able to the average user. The three major manufacturers of video game consoles all provide motion recognition by now - the Nintendo Wii which is based around gestural input, the PlayStation Move controller for Sony’s PlayStation 3, and the Kinect sensor for the 360 [28]. Another important modality, speech recognition, has also been around for some time, usually taking the form of dictation or language trainer software. This functionality is included in Windows Vista and Windows 7 as part of their accessibility options [26] and is also becoming an alternative input method for smartphones. The most famous example for these devices is the “Siri” assistant on Apple’s iPhone [1], while the Samsung Galaxy S III, which was released this year, features a similar interface called “S Voice” on the Android platform. In context with the graphical capabilities of current computers and con- soles, the sophisticated virtual worlds found in modern video games, increas- ingly lifelike virtual characters and, finally, stereoscopic technology being integrated into consumer television sets and handheld devices, it seems like the vision of a Star Trek holodeck for both entertainment and serious sim- ulations is about to become reality in the near future.

However, speaking of reality, these input technologies are still fairly new, at least when it comes to widespread use. In several articles dating from 2010 [32, 34] and 2012 [33], Donald Norman observed that, although these

1 2 CHAPTER 1. INTRODUCTION technologies have been existing for quite a while, they only recently ma- tured enough to be released to the public and therefore many aspects of their usage are still under development. Different companies are developing different guidelines for these input methods while end users are only now getting used to them, and applications which use both modalities to their full potential are rare to find. This can easily be confirmed by browsing the shelves of electronics stores. Searching for gesture-controlled games brings up mostly sports and dancing applications aimed at casual players, or fitness applications meant to motivate excercising. While these examples are heavily based on natu- ral movement, only few offer something resembling a complex interactive world. Such worlds are still mostly found in traditional games designed for mouse, keyboard or regular game controllers with buttons and joysticks. Speech support, on the other hand, is slowly becoming part of those more sophisticated titles. For example, various games marked with the “Better with Kinect” label allow the player to speak certain commands for a higher degree of immersion. However, the phrasing “better with” also in- dicates that the Kinect functionality is optional rather than required while the regular controller still serves as the primary input method.

But the potential is already there, only spread out across different applica- tions and research areas. Various approaches exist for the key elements of an application fully controlled by speech and gestures, so it is time to have a closer look at the available findings and gather them in one place.

1.2 Objectives

This thesis examines the application of two natural input technologies, namely gesture and speech recognition, to interactive virtual worlds. It focuses on the different demands of four distinct interaction tasks which are common to such environments - navigation, selection, dialogue with non-player characters (NPCs) and manipulation of objects. For these tasks, the following questions will be dealt with.

World Representation In what way is the setting or representation of the world connected to possible inputs?

Input What kind of input metaphors are proposed for either modality? What kind of vocabulary is required? Which modality do the users prefer? 1.3. OUTLINE 3

Visualization What information does the user need in order to give the right commands? How is this information presented to them?

In addition to gathering information from literature, a study was con- ducted to confirm the theoretical assumptions. After many key aspects had already been examined in Wizard-of-Oz studies by various researchers, the idea was to now confront users with a system actually processing gestures and speech and seeing if the resulting preferences were similar. This, in turn, went hand in hand with developing a testing environment which can be controlled by both modalities.

1.3 Outline

Chapter2 will present the current state of the art, grouped into general aspects and those which concern particular interaction tasks. The general section will provide some basic information about the relevant sensor tech- nologies, multimodality aspects common to all tasks and important usabil- ity principles. Afterwards, the four task categories are examined. These subsections will provide details about their role in a virtual world appli- cation, suitable gesture and speech inputs for the associated actions and visualization mechanisms which make them accessible to the user. The practical part of this thesis, consisting of the application and user study, will be dealt with in chapter3. The application section starts with the scenario and the interactions it provides. After that it will present the chosen inputs afor each task, the necessary visualizations, and finally, the some details about the implementation. For the user study, the section consists of the objectives, the experimental setup and the presentation and discussion of the results. Finally, chapter4 contains the conclusion and possible future work.

Chapter 2

State of the Art

2.1 Overview

In general, interaction with a virtual world covers a wide range of topics and concerns various components of the underlying application. There are many possible and often fundamentally different ways in which the user might need or want to interact with such a system, and these in turn depend on the representation of the world and the available technologies. Furthermore, usability requirements have to be met because the user must be able to discover these interaction possibilities and the inputs which will trigger them. So first of all, it is necessary to identify and organize those aspects which are relevant for this thesis. The focus here will be on the different interaction tasks which are likely to be found in virtual reality applications. They can be grouped into four main categories according to similarities in their goals and execution, and one objective of this thesis is to examine the differences between these categories. Apart from their objective, the four major tasks differ in what kind of inputs are considered suitable for them. There are many input technologies available, but only gesture and speech recognition will be taken into account in this work. Because most actions within the real world are performed by speaking or physical manipulation, it makes these two modalities very important for immersive simulations as well. Here, they will mostly be considered separately, providing alternative input methods for the same task. Because their multimodal integration is a complex topic in itself, it will be included but not dealt with in depth.

5 6 CHAPTER 2. STATE OF THE ART

One last element that is tightly coupled to the tasks and inputs are the clues which inform the user about possible interactions. Ideally, finding and performing input commands should be intuitive, meaning that no further explanations are necessary for operating the system. In reality, however, what can be called “intuitive” is not always straightforward where users with different experience are concerned, and an application with good us- ability must provide some guidance to ensure that its users can interact as desired.

This chapter consists of two parts. First, some general aspects concern- ing all input tasks will be presented, namely basic characteristics of current input technologies, the concept of multimodality and important usability principles. The second section will then deal with the individual input tasks. For each of them, it will provide details about their requirements for the world representation and present current research on appropriate input commands as well as the interface clues by which users will identify them. The liter- ature in question will cover virtual world applications as well as others dealing with the same input modalities.

2.2 General Concepts

2.2.1 Input Technologies In relation to the ORIENT system [23], Kurdyukova et al. state that input technologies should be embedded into the simulation in a plausible way. As input metaphors based on gestures and speech resemble interactions in the real world, they are, in theory, both easier to learn and less likely to destroy the illusion.

However, these technologies come with certain limitations. For instance, one common problem is a noticeable delay between the user’s command and the system’s reaction which has several reasons. First of all, the signal processing and semantic interpretation required to parse the sensor data are comparatively expensive in computational terms. Fortunately, this is less of an issue nowadays, as computer hardware is con- stantly evolving, multicore processors are already the standard for consumer systems and much research goes into optimizing the necessary algorithms to meet real-time demands. Another cause for this delay can not be solved easily with technological 2.2. GENERAL CONCEPTS 7 progress. In order to interpret a command, the system has to wait for the user to finish it. This can be compared to somebody telling another person to perform a certain task in contrast to doing it themselves, because in addition to choosing and executing an action, time has to be sacrificed for formulating and transmitting an instruction. The length of this time span depends on the complexity of the input. For low-level commands, for example when the current position of a body part is directly mapped to its position in the virtual world, it will be hardly noticeable, whereas symbolical gestures or spoken descriptions for more abstract interactions take significantly longer to perform. One way to work around this issue is to form and discard hypotheses during the observation and commit the one with the highest probability as soon as the user stopped moving or speaking. For example, this kind of behavior can be observed for the Microsoft Speech Platform recognition engine used for the practical part of this thesis [27]. However, detecting the end of an input usually consists of waiting for a pause in the sensor data. This pause also has to be longer than a certain threshold, especially for speech recognition which needs to distinguish the gaps between sentences from those between words or dependent clauses. Obviously, this contributes to the existing delay.

Other limitations arise from the specific input technology which is being used. This mostly concerns gesture input because, apart from a microphone with sufficient quality, speech recognition does not need any particular sen- sors. There are several different technologies available for recognizing ges- tures. One method mentioned in various sources is to attach magnetic tracking devices to important body parts like the dominant hand [19, 18], the entire dominant arm [8], the head [8] or the torso [24]. Accelerom- eters are another frequently used type of sensor which are, for example, included in the Wiimote controller. The fact that this controller is part of a famous consumer product also makes it attractive for researchers because test subjects are more likely to enjoy using it [23]. However, both methods have the disadvantage of limiting the gestures to those body parts to which the sensors are attached. Vision-based gesture recognition, on the other hand, makes it possible to observe the entire body as long as the relevant limbs are not occluded, without the need to wear or handle physical devices. Such devices might be uncomfortable for the user or cause them to worry about breaking them, which in turn could affect their experience with the system negatively [17]. Examples for this type of input technology can be found, among others, 8 CHAPTER 2. STATE OF THE ART in VisTA-walk by Kadobayashi et al. [17], the Transfiction engine used by Cavazza et al. in [4] and applications using Microsoft’s Kinect sensor, like the “Sugarcane Island” game by Kistler et al. [21].

While steering an application without physical contact has its advantages, it should nevertheless be considered that the lack of input devices can also be a problem. While describing the “Agent Common Environment”, a smart object based system which offered user interaction by means of a data glove, Kallmann noted that the lack of haptic feedback can make the direct ma- nipulation of virtual objects frustrating [19, 18]. Norman mentioned that additional devices are often necessary because body gestures alone are not always sufficient for controlling complex systems [32]. An example he gave was the act of opening the grip on an object which could be done by releas- ing a button on a physical controller, specifically the Wiimote, but would be more difficult to implement without such a device.

2.2.2 Multimodality When different modalities are being used in the same application, the con- cept of multimodality is not far away. Even though this thesis and the application developed for the study are not aiming at multimodal integra- tion, some aspects still need to be mentioned. First of all, different modalities are not necessarily interchangeable, as Sharon Oviatt noted in her article “Ten Myths Of Multimodal Interaction” [36]. Each has its strengths and weaknesses when it comes to expressing dif- ferent types of information, and a precise expression in one modality might require a more complex one in another. For example, describing a shape verbally is more complicated than drawing the same shape with a pen. In the context of reference generation van der Sluis and Krahmer explained that different modalities complemented each other[46]. Users would com- pensate for a less precise expression in one modality by adding the missing information in a different, more suitable channel. Specifically, they tend to make the most of each modality’s advantages in a way that the combined expression requires minimal effort to produce but still contains all the nec- essary information.

Corradini and Cohen conducted a study in which ten test subjects were playing “Myst III - Exile” in a Wizard-of-Oz experiment after being told they could control the game via speech and gesture [8]. One of their find- ings was that six of these subjects tended to combine both input modalities rather than choosing a single one. While one subject supposed they were 2.2. GENERAL CONCEPTS 9 talking subconsciously in the same way they would talk to their car, two others explained that they preferred gesture but used speech for additional clarification. This type of expression is not always necessary but rather depends on the complexity of the task. In [36], Oviatt refers to studies which showed that simple, general commands were given unimodally about 99% of the time whereas complex instructions containing spatial information, like the location, orientation and dimensions of an object, were usually given mul- timodally. A later study by Oviatt et al. [37] confirmed that a higher level of difficulty also lead to higher rates of multimodal expressions, in this case consisting of speech and pen gestures. Four levels were compared in a sce- nario which required organizing entities on a map, with each level adding one more piece of spatial information. This caused significant increases in multimodal input between the two easiest as well as between the two most difficult levels. The same study also showed that new dialogue content was established multimodally 77.1% of the time, whereas follow-up expressions referring to known entities fell back to 18.6%.

Additional clarification can also be linked to error handling. In “Ten Myths Of Multimodal Interaction” [36], it was pointed out that, if given the choice, users would choose those modalities which they believed to be less error- prone in the given context. Also, they would switch modalities on encoun- tering recognition errors. Together with the tendency to choose simple and concise expressions in each channel [36, 46], distributing the information across different modalities can reduce the overall probability of recognition errors.

2.2.3 Usability Principles New input technologies pose new challenges, both to the user and the de- signer. In their article “Gestural Interfaces: A Step Backwards In Usability” [34], Norman and Nielsen warned that while it makes sense to develop new interaction methods for these technologies established usability principles must not be ignored. They criticized the tendency to confront consumers with creative but confusing new interface designs, partially because developers lacked pa- tience in testing them before releasing their product. Another reason given are the patent conflicts between rival companies which prevent different platforms from using the same standards [34, 33].

This confusion makes it difficult for the user to become familiar with non- 10 CHAPTER 2. STATE OF THE ART traditional input technologies. It is often used as a marketing argument that gesture and speech are intuitive ways of controlling a system, which suggests that there is one sensible, easily found way to perform the desired command. In practice, however, finding a suitable command is not always straightforward. The designer of an input method usually has background knowledge and experience different from that of the end users, and these end users often have quite different backgrounds as well, each with their own expectations and assumptions. Norman points out that many gestures we use in everyday communi- cation are actually abstract symbols which need to be learned and differ between cultures [32]. Even those gestures which do have a natural coun- terpart often need to be adapted in order to prevent undesired side effects. As an example, Norman refers to the phenomenon of Wii users accidentally throwing the Wiimote at their television screen while trying to release a virtual bowling ball [32]. The difference in this case is that the user is not allowed to fully open their grasp but only to reduce the pressure until a certain button on the gesture controller is no longer pressed. But because the rest of the gesture is designed to closely match the physical action, this difference runs the risk of being forgotten while the player is absorbed in the game and focused on their goal. Similar problems exist for speech recognition. Natural languages offer far greater flexibility than computer systems can handle today. Although they keep becoming better at adapting to unexpected situations, they are still limited to a comparatively small subset of human communication with pre-defined, domain-dependent phrases. In general, it can be said that both modalities suffer from a small vo- cabulary. Modeling a realistic interactive world is still a complex task, and most of the available interactions and the rules governing them need to be defined by human designers. This leads to limitations in how many and what kind of actions the user can perform. In reality, it is often possible to solve a problem creatively by finding a new use for the tools at hand, whereas items in adventure games usually serve a fixed purpose and do not allow for improvisation. Visible parts of a simulated machine may be nec- essary to depict it realistically but are not always meant to be manipulated, which can lead to confusion when the user tries to examine them. Likewise, virtual conversation partners only answer questions and give information of which the author thought, since it is difficult to implement appropriate reactions to unpredictable, dynamically occuring events.

To bridge this gap between the capabilites of a simulated world and the user’s expectations about what can be done, additional clues are neces- 2.2. GENERAL CONCEPTS 11 sary to guide them towards those interactions which are intended by the designer. Independently of the technologies involved, the following basic principles must be respected.

Visibility The user must be able to detect possible actions and the commands necessary to trigger them. Actions inherent in the design of an object or system are called affordances. Norman distinguishes between real affordances, which are possible regardless of whether they are visible, and perceived affordances which are suggested by certain design ele- ments and therefore known to the user [30]. The visible clues which draw attention to these are called signifiers [31, 34] and need to guide the user towards the actions which can be performed in the given con- text, and only to those. As an example, Norman contrasts clicking anywhere on a screen, which is physically possible but meaningless, with clicking on a particular button to trigger an action [30]. Also, Norman explains that constraints help to make the right action stand out. Physical constraints render the input impossible when it is not expected, for example by hiding the relevant interface elements. Logical constraints let the user deduce the correct input by showing the conceptual model behind it. Finally, cultural constraints let the user decipher the meaning of familiar signifiers and remember the correct behavior for their context.

Discoverability Visibility is directly related to another important aspect, the user’s ability to discover all possible interactions by examining the system [34]. They must be aware of its functions in order to use them, and making this information accessible in the application itself removes the need to memorize instructions. So-called feedforward mechanisms [2] use the signifiers mentioned above to inform the user about their options and how to perform the necessary inputs. They can cover the entire set of commands at once or only those relevant under the current circumstances. Menus, for example, serve to group and display available operations [34,2]. The user then only needs to read their entries to find all the necessary information for handling the system, which is easier and more reliable than remembering the commands from a manual or a training session. 12 CHAPTER 2. STATE OF THE ART

Feedback In order to give the right commands for the intended action, the user must be able to deduce the effect of their action and receive confirmation of what exactly happened [33]. In particular, they must be able to identify differences between their execution of an input and the one expected by the system, especially when the recognizer pays attention to different aspects than the user would [2]. According to Bau and Mackay[2], this confirmation can range from a simple notification that a particular command was recognizied to more detailed information such as accuracy and probability values. It can take place after the input was completed or already during the process, helping the user to correct errors even before it is too late. Furthermore, the presentation of recognized commands can also serve as feedforward for future inputs by showing an idealized ver- sion instead of the user’s actual command. What kind of feedback is available depends strongly on the actual recognition algorithm and the kind of output it provides.

Reliability Operations should work as suggested by the design and be causally linked to the user’s inputs. Actions should not occur at random be- cause the user must feel that they are in control [34]. This is again linked to the feedback which enables the user to determine when and why an error occured. If nothing happens, they must be able to un- derstand if the system simply failed to recognize the input or if the command they were trying to issue is not even available in the first place [33]. As for the input itself, every technology has its weaknesses and in- accuracies which need to be counterbalanced by the application’s de- sign[34]. With gesture and speech input, accidental inputs are even easier to make than with a mouse or keyboard [34, 32]. One key ad- vantage of the new technologies is the fact that inputs are based on a user’s natural behavior, but the downside is that unintentional parts of the same behavior, such as casually moving the hand or thinking aloud, can not always be distinguished from actual inputs. Conse- quently, every system needs methods to restore the previous state after unintentional changes. Most traditional applications contain an undo function for this purpose, and similar functions must be both present and visible in the new systems [32, 34, 41].

Consistency 2.3. INTERACTION TASKS 13

In order to avoid confusion, similar actions and effects need to be triggered by similar inputs. In particular, designers must respect es- tablished standards and conventions, which ideally should be kept up across different platforms and systems. Users become familiar with these conventions by using similar applications themselves or learn- ing from those who did, and then rely on this knowledge when they approach an unknown system. The cultural constraints Norman men- tioned in [30] are in fact cultural conventions. They do not enforce any actions, but they encourage the right behavior and discourage experimenting because the user has already learned what to expect in similar contexts. However, this also indicates that certain standards depend on the end user’s background. For example, relying on similarities to other applications and technologies only works if they are familiar with those systems, and cultural conventions such as particular symbols and color codes can vary between said cultures.

2.3 Interaction Tasks

The available actions are dependent on the definition of the world, and vice versa - certain elements in a simulated environment, such as characters or obstacles, have to be dealt with in a particular way in order to appear realistic, and the way they can be dealt with has to be encoded somewhere in the world representation. When comparing frequently occuring interaction tasks and the context they arise in, the differences and similarities between them lead to several categories. For instance, there are forms of interaction which mostly change or narrow down a user’s options, while others are about retrieving informa- tion or actually changing the world state. Another criterion is the action’s target. Most interactions in a virtual world relate to an entity which re- sembles either a living being or an inanimate object. Like those entities, they are often based on their real-world counterparts in order to increase immersion, which means that they also inherit their differences. In this thesis, the following four categories will be analyzed.

Navigation Depending on the size and architectural layout of the simulated en- vironment, navigation actions are often necessary for reaching a dif- ferent set of possible interactions. They mostly involve changing the 14 CHAPTER 2. STATE OF THE ART

position and perspective of the user by steering a camera around the virtual scenery.

Selection Selection is responsible for narrowing down the choices which are cur- rently available, as opposed to the following two categories which ac- tually influence the simulated world. In particular, it places the focus on one of several entities which will later be targeted by a specific command.

Dialogue One much researched type of entity is the embodied conversational agent, so it seems plausible to create a separate group for dialogue with virtual agents. Given that the targets often appear human- like and the interaction itself is modeled after real human-to-human communication, a naive assumption would be that users choose speech over gesture in this case.

Object Manipulation Finally, there is the physical interaction with virtual objects. As mentioned before, it is often modeled after the action a human would perform on a real object. Complex rules are required to cover as many of the interactions that would be possible in reality, which often leads to similarly complex and more varied input gestures than for the other categories.

The following sections will deal with each category in detail, referring to current research and applications. First, they will explain the relation between the tasks at hand and the representation of the virtual world. Afterwards, suitable input metaphors for both modalities are presented and finally, the usability aspect is examined with regards to signifiers and feedback for these inputs.

2.3.1 Navigation Virtual environments can range from simple three-dimensional displays with a steerable camera to large and complex simulated worlds where multiple objects and agents can be manipulated by the user or influence each other. One example of those simpler cases is the VisTA-walk system [17]. It was developed by Kadobayashi et al. in order to enhance the traditional museum experience and enables the user to walk through the virtual re- construction of an ancient village. Users can explore the settlement and 2.3. INTERACTION TASKS 15 retrieve information about individual buildings by selecting them, but in order to reach all points of interest, they have to change their position and perspective. Similar exploration of an artificial world occurs in the more recent ORI- ENT system [23], which is a part of the EU funded eCIRCUS project. This interactive storytelling application was designed for teaching teenagers about cultural differences and empathy. It provides three users with differ- ent roles, one of which is the navigation officer. He or she is responsible for moving through the setting of an alien planet, where the group would encounter various non-player characters (NPCs) they had to interact with. In both cases navigation actions expand the interaction space beyond the small region that a fixed camera could show. But apart from visibility, the position relative to the target must be taken into account as well. While in theory nothing would prevent a user from manipulating an entity that appears far away in the scene, it makes more sense if they are placed at a suitable distance and facing the right direction. Kallmann’s and Thalmann’s definition of smart objects [18, 19] includes such reference positions as part of their interaction description. The appli- cation presented in [18] is called the “Agent Common Environment” and mostly controls virtual characters in a simulation. However, it also provides the option of including a human actor who is equipped with a data glove. As this technology restricts the user’s movement to a small area, Kallmann notes that a navigation metaphor becomes necessary for interacting with a larger environment. In [19], it is specifically stated that such a metaphor enables the user to reach different reference locations. But a large virtual world does not only make navigation necessary. The way it is stored and accessed also influences how this task can be performed. Most worlds in games or other simulations are rendered in real time today, allowing for a continuous world as opposed to pre-rendered scenes. There- fore, the camera’s perspective can be changed in any way the user desires, often only limited by collision detection with obstacles. This flexibility in the simulated world is directly reflected in the com- mands which have to be considered for the navigation task. Applications based on pre-rendered perspectives, for example the first four games of the “Myst” series [9], only allow transitions between screens by selecting pre- defined passage ways, such as doors, stairways or a farther part of the floor. Moving closer to objects in order to examine them is only possible if the specific close-up view was prepared beforehand. In contrast, navigation in a real-time 3D environment relies more on general direction commands which correspond to translation or rotation vectors. These vectors are largely independent of the current position and 16 CHAPTER 2. STATE OF THE ART orientation within the world. Passage ways still play a role, but mostly with regards to collision detection while the decision to pass through them happens implicitly by steering the camera in their direction. Information about where they lead does however become necessary for one kind of speech navigation which chooses the movement target directly.

Suitable Inputs There are several possible ways of mapping user inputs to navigation com- mands. For gesture input, the VisTA-walk paper [17] lists three major control schemes which could be applied to a user standing in front of a screen: The mouse principle, the joystick principle and the steering wheel and accelerator principle. All three are based on a walking metaphor, which has the advantage of occupying only the feet and leaving the hands free for other tasks such as selection or object manipulation [17, 24, 23].

Figure 2.1: Navigation command mappings for the three gesture control schemes.

The mouse principle directly maps the user’s coordinates within the sen- sor space to the world coordinates, which makes it mostly applicable when the virtual environment is small enough. An example for such a small area would be the “Space Pop” game included in “Kinect Adventures” [42], where the player steers their within the 3D representation of a small cell without gravity in order to reach and pop a number of bubbles. Move- ment in the first two dimensions is done by stepping forward, backward and sideways on the floor, and the avatar directly copies the user’s position. It is difficult to decide whether this counts more as navigation or as object ma- nipulation because the movement actions indirectly trigger the interaction with the bubbles while the perspective and available actions remain rather similar. However, this scenario already demonstrates the most important principle, namely walking within the sensor area in order to move in the 2.3. INTERACTION TASKS 17 virtual world. One problem with this control scheme lies in the scaling factor. While direct mapping may feel closer to reality, a large simulated world would require an equally large area for the user to move in. This was one of the reasons why the mouse approach was discarded for the VisTA-walk application. How- ever, the problem can be addressed by adapting the scaling factor between both areas. LaViola Jr. et al. [24] used additional gestures for scaling a map of the environment, which was projected below the user’s feet and aligned to their position. This solution lets them trade off travelling speed against accuracy dynamically by zooming out for bigger changes and zooming in for higher precision. After setting the desired scaling factor, the user could walk to their target location on this miniature representation and confirm the change by another gesture. The other two approaches are rather similar in that they both use veloc- ity vectors for different directions, depending on the user’s current position in comparison to a known neutral position. The minimal set of commands for arriving at any accessible position consists of moving forward and rotat- ing on the spot. Additional commands include moving backward, sideways or vertically. With the joystick control scheme, the vector between the user’s position and the origin is used as the direction vector. As soon as the user steps away far enough from their starting position, the avatar or the camera will move directly in the indicated direction. This enables the user to reach any location they wish while requiring only a comparatively small region for physical movement. A similar technique was implemented in [24] by LaViola Jr. et al., in addition to the coarser mouse-like control scheme described earlier. After using the scaled miniature map to arrive close to the target, the user would walk or lean their upper body in the intended direction in order to trigger smaller movements within the actual environment. The necessary amount of leaning depends on the position within the sensor area, based on the observation that users would mainly lean when they could not walk any further, but might not intend leaning as an input gesture while standing comfortably at the center. In order to change the orientation, additional gestures are needed for rotation commands. Kadobayashi et al. suggested using the rotation of the body [17]. This of course means that the gesture input technology must be able to recognize the user’s orientation, which was not the case for their system. Instead, the control scheme implemented in the VisTA-walk system uses only walking gestures based on a “steering wheel and accelerator” metaphor [17]. The forward axis is mapped to acceleration, so stepping closer to the 18 CHAPTER 2. STATE OF THE ART screen or farther away moves the camera along the same axis, just as the joystick metaphor would. The sideways axis, however, is interpreted as turning a steering wheel to that side which means that stepping to the left or right will rotate the camera in the respective direction. Since sideways movements can still be achieved by turning in the desired direction, moving forward and then turning the other way again, this decision allowed them to keep a smaller gesture vocabulary. On the other hand, requiring three gestures for stepping sideways can also be a disadvantage if such movements occur frequently.

The scaling factor which was mentioned for the mouse control scheme also affects the latter two. The walking respectively turning speed in VisTA- walk is proportional to the actual distance between the current and neutral positions. During their study [17], Kadobayashi et al. observed that slower speed allows for precise navigation but runs the risk of making the user im- patient. Higher speeds, on the other hand, let the users arrive quickly, but at the same time it becomes more difficult to stop at the target location. In this case users travel over a greater distance because they often pass over the intended position and have to move back again in order to correct this mistake. Their results after comparing five different walking speeds indicate that the factor should lie between 2.5 and 5.0 m/s in the virtual world for 80 cm in reality. As for rotation, it depends on the field of view provided by the virtual environment. LaViola et al. [24] noted that in a fully immersive display consisting of four walls, the user could simply turn towards that part of the world they were interested in, whereas for displays which surrounded them only partially, their rotation had to be amplifed. This could be achieved by turning the camera in the opposite direction so that the user would see the region originally behind them on arriving at their physically possible orientation. The drawback of this solution was that users felt uncomfort- able with this mix of realistic turning movements and unnatural change in perspective, so in their final implementation, the technique would only be used beyond a certain rotation threshold depending on the user’s location. Other systems such as VisTA-walk [17], ORIENT [23] or Kinect games intended for home use have only one screen at their disposal which limits the user’s rotation even further. In these cases, all of the rotation is done by the camera and the user’s rotation angle is comparatively small because they have to keep the single screen in sight. Furthermore, gesture recognition is often based on processing the video stream of a fixed camera and might not even allow the user to rotate permanently. Therefore, the user must eventually return to their neutral orientation before other input gestures 2.3. INTERACTION TASKS 19 can be performed. The steering wheel approach used in VisTA-walk [17] is unaffected by this problem, since it uses steps to the side for rotation which means that the user keeps facing the screen. Apart from turning in the walking direction, a last action to consider is looking up or down. VisTA-walk allows the user to change the viewpoint in this way in order to increase the immersion in the simulated village [17]. For lowering the viewing position, Kadobayashi et al. used a crouching ges- ture which has this effect in reality. For looking up, however, they resorted to the symbolical gesture of raising both hands. It is easy to deduce that a system capable of recognizing rotations, for example of the torso or the head, could use a more direct mapping for changing the camera pitch. As mentioned by LaViola jr. et al. [24], the head is also moving about in order to glance at different regions of the screen, so the torso would again be the more sensible choice as it is more robust to accidental rotation commands.

Spoken commands for navigation are harder to find in literature than their gesture counterparts. One reason could be that walking is the most obvious metaphor for movement whereas speech is rarely used for this task in real life. Another reason might be related to Oviatt’s research about multimodal- ity. In [36] as well as [37] spatial information is a prominent example for the type of content which users prefer to transmit multimodally. [36] gives a probability of 95-100% for multimodal interaction in that case. It is stated that users avoid to speak complex spatial descriptions, which tend to be less fluent and more likely to contain errors, when they can use alternative input methods, for example a pen, which allow them to express shapes or mark locations more easily. While the sources in question do not deal with immersive worlds, this implies that speech in general may not be sufficient for spatial tasks like navigation. From what can be found, there seem to be two main starting points for speech navigation. It is either possible to specify the movement param- eters, which is similar to the joystick and steering wheel control schemes for gestures, or the target could be indicated directly, which resembles the mouse principle. However, both approaches are often mixed, probably due to the flexibility of natural speech. The QuickSet system by Cohen et al. [7] is a multimodally controlled map on a handheld device which helps the user navigate a 3D environment. Similar to choosing a location by walking on the scaled map in [24], it is possible to name an entity and request such a movement verbally. Two examples given in [7] follow the pattern “take/fly me to ¡target¿”, where the target can be the name of an entity or its type combined with a pointing 20 CHAPTER 2. STATE OF THE ART gesture at the specific instance. A similar input pattern can be found in the Wizard-of-Oz study by Corradini and Cohen which examined how users would control a game like “Myst III” by speech or gestures[8]. The attempted inputs were recorded and transcribed in order to analyze what kind of inputs the subject would choose intuitively, and one of these transcribed actions was a command for crossing a bridge in the game world. Apart from pointing along the bridge, the user utters the command “go across the bridge”. Both the gesture and speech input single out the passage way to be used for the movement. This type of target-based speech input overlaps greatly with the general selection task, in particular the aspect of reference resolution which will be elaborated in 2.3.2. Like the mouse principle, this method can provide useful shortcuts to different locations, but in order to benefit from this, the application must be able to map the spoken designation to the correct entity or position. This means that either the set of potential targets has to be limited to certain points of interest with known names, or the names of arbitrary entities must be derivable from the world representation, prefer- ably with a certain robustness to synonyms. Additionally, a path finding algorithm may be required in scenarios where teleportation or movement through objects would break the realism in an undesired way. The other approach, specifying the parameters of the movement, can be seen in a third example in [7]. In this case, a path is marked by a gesture and the spoken part of the input consist not only of a command to fly along this route, but also of the speed to use for this movement. Schuller et al. noted that navigation by speech can be improved by using more complex commands [41] which include parameters similar to this. As their paper focuses on the technical side of the parsing process, comparing three different setups for the recognition engine, there is once again little information about the actual vocabulary used for navigation. However, they mention that the movement distance is measured in discrete steps, for which the user can specify the number and, apparently, the size. The direction is another obvious parameter and therefore part of the only example given in the text, the command “please once more step three times to the left”. It can be assumed that the second approach makes use of the same basic vocabulary as its gesture equivalent, namely translation in the four directions, rotating on the spot and tilting the camera, in order to reach any desired viewpoint. Other parameters, such as speed and distance can be added in order to tweak the movement. General terms such as “slow” or “fast” can roughly be compared to stepping away a short or long distance from the neutral position in gesture-based navigation. Since speech allows 2.3. INTERACTION TASKS 21 for the input of numerical values, these additional parameters can be argued to be more precise than mapping the user’s physical position to a speed vector, but at the same time they might be less intuitive because the user needs to know these values first.

Visualization

Navigation in a three-dimensional environment is a complex task with sev- eral degrees of freedom, and each of them requires distinct commands. There are two dimensions for the walking direction and at least two pos- sible axes for rotating the view, already adding up to eight commands for a gesture control scheme based on the joystick principle. It is possible to reduce this vocabulary by using more abstract control schemes, for example the steering wheel and accelerator method or the mouse scheme, but this in turn makes it necessary to combine several actions for reaching the same result. In the end, the user either has to remember a large vocabulary or complex rules for using a smaller one to the same effect. Furthermore, moving through the setting can take up a great part of the interaction, depending on the size and density of the simulated world. On the one hand this means that the user must be able to handle this task reliably in order to deal with the others, on the other hand its ubiquity can offer more chance for practicing the necessary inputs. The first step was already taken by choosing an input metaphor and a direct mapping which resemble real-world movement and therefore ensure a basic familiarity with the commands involved. If users can derive them from everyday behavior, like walking somewhere or giving movement di- rections to another person, it is easier to learn these commands and fewer signifiers are needed.

A common approach used in Kinect games is to inform the user about the input vocabulary before starting the actual application. ”Kinect Adven- tures” [42] displays an animated instruction while loading a game like the “River Rush” rafting parcours, showing the sideways movement for steering the boat. The “Sonic Free Riders” demo [43] embeds its instructions for ba- sic movement into the calibration phase. During the game no further clues are given for how these movements should be performed, only the obstacles are shown. It should be noted that forward movement in current Kinect games is often automated and leaves mostly the sideways and sometimes vertical steering to the user, limiting their freedom in order to make use of a smaller and easier set of input gestures. 22 CHAPTER 2. STATE OF THE ART

Map-based navigation, either for speech or gesture input, obviously needs to display such a map. It can be shown on an additional device as in the QuickSet system [7], which has the disadvantage of occupying the hands, or on the same screen as the 3D environment itself. The scalable map used in [24], called “Step WIM”, can be summoned by the user whenever they need to move to a distant location. “Step WIM” stands for a World In Miniature controlled by stepping gestures, which means that it displays the same three dimensional environment the user wants to navigate in, albeit at a smaller scale. After the user arrived at their target location, the map is hidden again and does not occlude any part of the virtual environment. Additionally, occlusion is reduced by projecting it on the floor of a Cave environment.

Apart from illustrating the navigation commands, visual clues are also im- portant for highlighting those points of interest the user should navigate to. These locations can be marked on the aforementioned map or directly in the 3D environment. The reference positions stored in Kallmann’s smart objects, for instance, can be displayed when an interaction requires the user to move to a particular spot [19, 18]. Although the “Agent Common Envi- ronment” did not include a navigation metaphor for human users and these positions rather served to animate virtual characters, their depiction in the smart object editor suggests a small sphere with an arrow for the direction the user should face.

As far as feedback is concerned, the very goal of navigation actions is to change what is shown on the screen. Most virtual worlds either use a first person perspective [17, 24, 23, 18,9] or a third person perspective with the camera directly following the user’s avatar [42, 38, 39, 43], the direction indicated by the user can be directly mapped to the movement of the dis- played viewport. This makes it easy to predict the outcome of a navigation action. In real-time applications the effect of each input can be observed im- mediately afterwards, or at least after the delay inherent in the recognition technology. The way the camera’s perspective changes shows which com- mand was recognized, and the analogy to what the user’s eyes would see while walking through the real world helps to verify that it was the correct command. Because navigation does not change the world state directly, most commands are also easily undone by moving in the opposite direction or, in target-based control schemes, specifying the previous location. Maps similar to those used for that type of navigation also help the user find their current position in the world, especially in larger or more complex 2.3. INTERACTION TASKS 23 environments. When the Step WIM in [24] is displayed, the user’s current position on the map is aligned to their physical position on the projection surface. Additionally, it is marked by a green icon while they leave that spot to either navigate or examine the map. Such a depiction allows them to understand the geographic and architectural properties of the simulation which are important for verifying that they have indeed arrived at their destination.

2.3.2 Selection Selection becomes necessary as soon as several options are accessible and the user has to choose one before they can continue. Technically, all of the four interaction tasks could be seen as a form of selection. For navigation, the user has to choose where to go to and which path to take, whereas dialogue is about selecting a phrase to speak to the other character and finally, object manipulation means the user has to decide which action to perform on said object. The major difference between these and the other categories is that all of the others directly influence the simulated world. Navigation changes the user’s location within the environment. Dialogue actions advance a conversation and also the plot if some kind of narrative is involved. Object manipulation influences the state of certain entities in order to reach a certain goal. In contrast, the purpose of a selection is to single out the option or entity which currently has the user’s attention before the actual interaction takes place. This information then makes it possible to narrow down what the user can and will do next by discarding unnecessary alternatives. Moreover, focusing on one target is important when the subsequent action can only be applied to one entity at a time, is only supposed to affect a subset of those currently available or when its outcome depends on the specific target’s properties.

Suitable Inputs A selection is made by a referring expression from the user which has to be parsed and matched to a specific entity. This parsing process is called reference resolution [6] and depends on the modalities involved as well as the current context of an application. Van der Sluis and Krahmer explained that in order to identify an object it is necessary to determine the subset of its properties which is not shared by any of the alternatives [46]. Depending on these alternatives, which 24 CHAPTER 2. STATE OF THE ART are also called “distractors”, and their similarity to the desired entity, the referring expression can be very simple or rather complex. Gesture references, also called deictic gestures, usually consist of point- ing at the target [17,6, 46]. For pen input, alternative marks such as drawing an “” [37] or circling it [6] are also possible. The latter can also serve to select multiple entities at the same time. The precision of a point- ing gesture can vary greatly, and while drawing a pen mark on a nearby screen leaves little ambiguity, pointing towards a remote object can mark a larger area containing several entities, similar to a flashlight cone lighting up different regions depending on the distance and the direction angle [46]. What degree of precision is available depends on the actual recognition technology. Immersive virtual worlds with gesture input use contact-free technologies rather than direct pen input, so a certain distance between the user and the target objects is usually inherent in these setups. Vision-based systems make it necessary to stand at an appropriate distance so that all necessary body parts are within the camera’s point of view. For example, the “Sugarcane Island” game needed to recognize gestures from kicking to moving the hands in front of the head, so players were placed between 1.5 and 3 meters away from a Kinect sensor [21]. For commercial Kinect games, the recommended distance is about 1.8 meters for one and 2.1 meters for two players [28]. The camera’s field of view alone only covers two dimensions of hand movement, but in order to correctly locate the pointing hand, the depth information becomes necessary as well. In VisTA-walk, pointing gestures were restricted to either left or right as that system used only a 2D camera on top of the screen and was not able to detect hands stretched towards it. However, the authors noted that this problem can be avoided with a stereoscopic camera setup [17], like the one used in [25]. Also, for any type of recognition technology, it is necessary to account for the offsets between the sensor, the user’s hands and the actual screen displaying the target entites. One possible approach here is to make the pointing gesture relative to the user’s position rather than have them point directly at the screen position. For pointing in “Sugarcane Island” [21], the vector between their hand and shoulder is used to move a cursor on the screen which then acts as a pointing device touching the target directly. This still leaves some uncertainty as to where exactly the cursor will appear when it is first displayed on the screen, and the user has to learn how their real movements effect the cursor.

The imprecisions of pointing gestures are a potential cause of accidental selections, so most applications require an additional confirmation after the 2.3. INTERACTION TASKS 25 gesture itself. Several Kinect games, such as “Kinect Adventures” [42] or “” [39] wait for the user to hold the gesture for a short pe- riod before the selection is processed. The “Sugarcane Island” application adopted the same method for one game mode in which quick time events were handled by selecting buttons, with a confirmation interval of 1.5 sec- onds [21]. Alternatively, additional gestures may follow after pointing at the target. “Sugarcane Island”, for example, offers a push gesture to skip the waiting period. The Kinect fitness game “Your Shape: Fitness Evolved” [44] displays a smaller confirmation button below the actual target when the selection starts, and the game “Dance Central” [14] expects a sideways swiping gesture while the hand is pointed at the desired menu item. Another approach is to calculate a probability value for the recognized pointing gesture, as it was done by Chai et al. in the so-called “Responsive Information Architect” infrastructure which was embedded in an informa- tion system about residential properties [6]. When the application recog- nizes a gesture as indicating a single point, an entity’s likelyhood of being selected is calculated as a function of the distance between these coordi- nates and the nearest point of its boundary. If this distance exceeds a given limit, the object is ignored. For circular selection gestures, the selection probability is given by the percentage of the object’s area which is covered by the drawn shape.

As for speech, references consist of naming the choice and, if necessary, adding a description which contains the distinguishing properties, such as a color or a spatial relationship to another entity [6, 46]. Compared to a pointing gesture, which usually consists of only one movement, this de- scription can become very complex if the objects in question are too similar and many characteristics or relationships must be included to reduce the ambiguity. The type of object also plays a role in this context. When compar- ing spoken references for colored geometrical shapes to those necessary for black-and-white portraits of people, van der Sluis and Krahmer observed that test subjects used more words for the latter category [46]. They at- tributed this difference to the fact that the appearence of a human is defined by far more parameters than that of a basic circle or polygon. Other results were that the descriptions contained significantly more location information and were delivered less fluently when a difficult object had to be indicated. These findings were confirmed in a second experiment presented in the same article, this time involving countries on a map with the description difficulty depending on size and neighborhood. One way to make this process easier for the user is to fill in missing 26 CHAPTER 2. STATE OF THE ART information from the selection history and the current context. An exam- ple scenario for the “Responsive Information Architect” mentioned above, embedded in an information system for residential properties, consists of the user being prompted to clarify a previous selection of a house which resulted in two possible candidates. Since the currently available subset of entities rules out the other red objects on the screen, the incomplete state- ment “the red one” is sufficient in this case [5]. Similarly, the system can also use domain-specific knowledge about the visible entities to implicitly select the one which possesses the property the user wants to know about. For instance, when the user asks about the price of an ambiguously chosen entity, a house is selected instead of the train station next to it which is not for sale [6].

Still, one key aspect of that system is that it not only considers speech alone but also integrates gesture input. Since both modalities have their drawbacks when trying to produce precise references on their own, a lot of literature can be found about combining them for better results. Both examples follow or are accompanied by a pen-based selection which already makes a lot of verbal information unnecessary. Also, the two studies by van der Sluis and Krahmer [46] included pointing gestures as well, albeit with two levels of gesture precision enforced by distance. In particular, the complex spoken references were the result of the users being unable to use a precise pointing gesture from 2.5 meters away, whereas far simpler phrases were used when the subject was close enough to touch the target effortlessly with a pen or a stick they had been given for that purpose.

Earlier in that article, van der Sluis and Krahmer had argued that users trade off the costs of producing a referring expression in such a way that the result would be precise enough to identify the target correctly but also require the least effort in any modality involved. Their experiments con- firmed that precise pointing gestures were accompanied by short verbal expressions, on average consisting of 0.81 words for shapes and people and 2.76 words for countries, in contrast to the longer expressions necessary when the gesture could only indicate the general region, which required 4.19 and 21.42 words respectively. These findings also match Oviatt’s ob- servation that people tend to use simpler language constructs when a more suitable modality such as a pen for drawing is available [36]. In all these scenarios, less words need to be spoken which is more comfortable and at the same time reduces the risk of misunderstandings. 2.3. INTERACTION TASKS 27

Visualization

First of all, it is difficult to implement functionality for every entity in a virtual world. Consequently, going by the principle of visibility explained in 2.2.3, those which actually are interactive need to stand out from the background and static content. Also, the selected object will have to be marked afterwards. These visual clues do not depend on any particular input method, so various highlighting symbols used in regular applications and games can be taken as examples. However, unobtrusive hints should be preferred where possible to avoid disturbing the user’s immersion in the simulation, just as interaction devices should be logically embedded into the scenario [23].

For pointing gestures, one major convention is the cursor, known from tra- ditional mouse interfaces and adapted to contact-free gesture input as well. The previous section already listed it as a means to enable precise pointing while standing at a distance, and it serves as both a feedforward and a feedback mechanism at the same time. Norman noted that different cursor shapes can be used as cultural constraints which prevent clicking when it is not possible [30], and in analogy to that, computer users can expect to see a cursor when pointing becomes necessary. Since the cursor acts like a pointing device, this is also similar to van der Sluis and Krahmer equipping their test subjects with an electronic pen or a stick in order to encourage pointing gestures [46]. Moreover, this moveable symbol marks the location which the system currently recognizes as the gesture’s target and thereby helps the user to ad- just their hand position for the correct selection. Those applications which use resting on the target for confirmation also display the progress of that delay on the cursor [42, 39, 21], for example, by drawing a circle around the cursor which gradually fills up [29].

As for speech, it is crucial to indicate what can be included in the refer- ring expression. Traditional point-and-click adventure games tend to reveal items of interest only while they are under the cursor, which still follows the principle of discoverability but trades off visibility against the goals of creating challenging gameplay as well as an immersive, realistic world where labels would appear out of place. The downside is that in the worst case the cursor would have to scan the entire screen for selectable entities. While tedious, this is still feasible for two dimensions, but the search space for verbal expressions is far larger. This makes it even more important to highlight interactive objects as mentioned earlier. 28 CHAPTER 2. STATE OF THE ART

Even if the visible objects are different enough to be selected by a single noun, the user may still have to choose from a set of synonyms which may or may not be known to the speech recognizer. A similar problem can be illustrated by a Wizard-of-Oz experiment based on the “Sugarcane Island” application, which compared two modes of letting the subjects speak their choice for progressing the story [21]. In particular, the “indicated” mode displayed the possible keywords whereas “freestyle” contained no such visual clues, but offered the possibility to ask for the available options. Although this example falls somewhere between the different task categories defined for this thesis, it is nevertheless useful in this context because the users in this study found the indicated mode significantly easier to use, more convenient and also expected others to learn this mode more quickly. Transferred to speech selection this signifies the user’s need for guidance when they are trying to speak a choice. The most obvious clue would be to label each selectable object with the designation known to the speech recognizer. This works well for a simple reference, but the more flexible expressions described by Chai et al. [5,6] are a more complicated case. Some of the distinguishing characteristics, like a house’s color and position, can easily be displayed on the screen and discovered by the user. The problem, however, is that such a depiction can also include more than the recognized properties so the user needs to know which of these actually are within the recognizer’s scope. Problems with expressions outside the vocabulary are mentioned for this system [6], but no details are given about how the user was informed about its capabilities. Since providing visual cues about complex grammars would take up a lot of screen space and distract from the actual application, the focus in such an application would be on a powerful speech recognition which provides some robustness to unexpected referring expressions, covering as much of the already apparent properties as possible.

2.3.3 Dialogue Simulated worlds, especially in games, are usually populated with non- player characters, or NPCs for short. Some may serve as background detail which makes the world appear alive, but others hold valuable information for the user or are ready to follow their orders. The latter then has to talk to these entities in order to accomplish a goal or follow a narrative. In the ORIENT system [23], for example, communicating with the virtual inhabitants is the key to solving the scenario as well as its educational motivation. 2.3. INTERACTION TASKS 29

Dialogue in virtual worlds often has an effect on how the plot of an interactive storytelling application unfolds, which sets them apart from in- formation providing systems like the one using the “Responsive Information Architect” [5,6] mentioned in 2.3.2. Cavazza et al., for example, developed a mixed reality application which lets the user take the role of the villain in a James Bond adventure [4], and their answers and statements deter- mine which options are available to the NPC’s planning algorithm. If, for instance, the user refuses to give out a necessary piece of information, the NPC must try to get it from a different source. Although both types of application are based on the conversation paradigm and dynamically adapt to the user’s utterances, the style of these interac- tions is different. While Oviatt observed that humans tend to use a rather conversational tone even when they are merely placing units on a map [37], virtual worlds usually involve an additional degree of role-playing, be it for passing a training scenario or experiencing a story. The role a user has to play can have a radical influence not only on the topics and vocabulary, but also on the style of expression. In the ORIENT example [23] the players are faced with a culture which is literally alien to them, and these cultural differences are reflected, among other things, in an artificial language barrier. Regardless of their home culture and mother tongue, the users are forced to learn an unfamiliar mode of communication. Other settings, like the Bond example above, go the opposite way and rely on the user’s knowledge of stereotypical plot developments to guide them through the narrative [4]. The language in this case is natural and, unlike the one used for information requests, can also convey aspects of the portrayed personality and their attitude towards the NPCs. Furthermore, plain requests for information may depend on context in- formation such as the selection history [5], but usually nothing prevents the user from asking the same questions over and over again. This may even be desired, for example when comparing a new entity to one examined earlier. In a story-based scenario, however, the overall context marches on and earlier dialogue can rarely be repeated once that topic was dealt with.

Suitable Inputs Communicating with a virtual character bears a strong resemblance to a conversation between humans, and the simulation of lifelike agent behavior is an active research topic of its own. By definition, dialogue with such a character simulates talking to them, so the use of speech seems straightfor- ward for this task. In [4], Cavazza et al. called speech “the only practical mode of communication [...] in an interactive storytelling context, in addi- 30 CHAPTER 2. STATE OF THE ART tion to its being part of the narrative itself.” But speech, especially natural speech, is still difficult to parse and interpret for a computer system, con- sidering that even real humans have trouble understanding each other when the words are acoustically or semantically ambiguous. There are several approaches to speech recognition and interpretation, some more limiting on the user’s expressions than others. The easiest way is to enforce simple speech patterns in order to reduce the potential for unexpected inputs. Sentences in ORIENT consist of three parts, based on the subject-verb-object structure which is known from many natural lan- guages like English and German [23]. The first component is the addressed person’s name, recognized by keyword spotting on a mobile phone which represents a suitable communication device for the science fiction setting. This is followed by a symbolical gesture for the verb and scanning the RFID code of a physical object. The scanned code stands in for the object’s name which can’t be spoken because the players do not know the aliens’ term. But despite this rigid structure, some users still confused the order because their native languages were more flexible in this respect. Another method, mentioned by Cavazza et al. [4], is multikeyword spotting. A system using this approach waits for the user to speak any single keyword or fixed sequence from a given set but ignores the rest of the sentence. This gives the users more leeway for phrasing their input while the most important information such as greetings, orders, or requested topics can still be identified. However, this freedom comes at the price of losing the keywords’ semantic context. The approach Cavazza et al. used in their own application is based on template matching [4]. When a certain keyword such as an action verb or the subject is detected, the rest of the utterance is searched for addi- tional parameters expected in this context. If no template matches, typical positive or negative phrases are used to identify general tendencies like friendliness or denial. These intermediate results are then compared to the current plot progression in order to determine the so-called “speech act”, by which the authors mean expressions which can be mapped to specific narrative actions. Such a more complex, grammar-like parsing still allows for some flexibil- ity from the user’s side while the sentence’s actual meaning can be identified more reliably.

As for gesture input, there are different types of gestures which play impor- tant roles in human communication [4, 32,3, 40]. Section 2.3.2 already dealt with deictic gestures used in referring expres- sions. Just like those used for selection, they signal that a certain object 2.3. INTERACTION TASKS 31 has the user’s attention, but in this case, the information is meant for the dialogue partner instead of the overall system. Iconic gestures, which de- pict certain properties of objects such as their shape, can also be used to identify them. For their acting application, Cavazza et al. focused especially on semi- otic gestures, the category which contains gestures for various messages, such as greetings or certain attitudes [4]. The authors held the belief that these actions can be mapped to speech acts in a similar way as their spoken counterparts. Those gestures which Norman pointed out as learned, culture-dependent conventions, such as shaking one’s head or waving at another person [32], all fall into this category as they are linked to a known expression. Additionally, artificial gestures like those depicting verbs in ORIENT [23], both for user input and between the NPCs themselves, serve the same purpose and can be counted as well. However, both sources mention that these symbols are not necessarily easy to learn or remember [23, 32]. Furthermore, many of these are ambiguous and highly dependent on the context they are displayed in. For example, Cavazza et al. wrote that the same gesture of opening one’s arms can have a number of different meanings, such as welcoming someone, signalling ignorance or challenging the other person [4]. In their system, the gestures were seen as an optional addition to the spo- ken utterance, assisting in interpreting its meaning and vice versa. Again, both modalities are combined for a more robust form of communication like they are for the selection expressions described in section 2.3.2.

Visualization Section 2.3.2 already contained the example of the “Sugarcane Island” study which showed that users preferred to see what they could say to the system [21], true to the principle of visibility. In that application, the keywords were displayed on the screen, similar to the way traditional computer games display the available sentences which are then selected by mouse or keyboard. Apart from prompting the user to choose one, this kind of option display also often gives an impression of what effect the utterance will have on the dialogue. Another way to support the user in forming an utterance is to show information they need to include. For example, NPCs in the ORIENT ap- plication are labeled with their name, which can be seen on an illustration in [23] and in the trailer video downloadable from the eCIRCUS project pages [11]. This overlaps with displaying information necessary for selection. 32 CHAPTER 2. STATE OF THE ART

But apart from that, not much is to be found in literature. As with the reference generation for selection, systems with near-natural speech input seem to rely on the domain and the recognizer’s ability to handle unexpected input instead of presenting the user with information about the vocabulary. Neither is there information about visual clues for the gestures. It is mentioned that the ORIENT gestures were learnt in advance, so there is probably no visual reminder included [23].

The result of a dialogue input is usually signalled by the NPC’s reaction, going by the very definition of dialogue as an exchange of messages. Such reactions can include spoken content, facial expressions or actions, depend- ing on the scenario and the application. The behavior of virtual agents in simulated conversation is a research topic of its own and will not be covered here. However, the user can compare this reaction to the one they intended with their input. For example, if the answer does not match their question, there has probably been an interpretation error. As for correcting one’s input before it is too late, the mixed reality system by Cavazza et al. provides the user with a way to observe their gesture input from the outside [4]. Their system is based on a “magic mirror” metaphor and projects the user’s image into the virtual world, segmented in real-time and displayed on a screen in front of them. While this is not explicitly mentioned as a feedback mechanism, it allows the user to watch and adjust their body language during the interaction.

2.3.4 Object Manipulation The actual interaction with virtual entities is a complex topic. In most applications, such as games and training simulations, the world and the objects it contains have to be manipulated, either to observe their behavior under a given set of circumstances or in order to reach a specific goal. However, real world objects can have any number of possible interactions which are still difficult to model in a computer simulation. Designers either have to predefine every possible functionality or combi- nation, which is hardly a realistic goal, or they have to develop a similarly complex set of rules which allows the program to derive these functionalities on its own, at the cost of increased computation time. The former approach usually focuses on those functions which are easily discovered, which can leave a user to wonder why other actions they know from reality are not available in the simulation. The latter is often found in physics simulations, for example collision detection on arbitrarily chosen objects, which have to 2.3. INTERACTION TASKS 33 be simplified to meet the real-time requirements of a virtual reality appli- cation, especially when additional computation is required for AI reasoning and input processing.

Jasper Uijlings’ work on the “Virtual Storyteller” [45] provides a detailed example of the second approach. Although the “Virtual Storyteller” does not primarily deal with user interaction, it requires a world model from which it can derive actions and reactions to drive its narrative. One of his goals was to keep this model as complete and universal as possible so it could be reused for different story settings, which makes it a useful basis for simulated worlds in general. Uijlings defines such a story world as a number of hierarchically connected locations containing different objects and agents. Two ontologies are provided to organize these objects and the actions they are involved in, by associating them with semantic categories and subcategories. For creating the action ontology, Uijlings analyzed several types of ac- tions which were deemed relevant in a storytelling context. As the abstrac- tion level for these actions, he chose the so-called semantic level which only includes physical actions affecting the environment, but neither details like moving particular body parts nor the underlying motivations and long-term goals. Another way to reduce the complexity was to leave out those actions which only result in information gain or the change of emotional and social states. Based on these assumptions, Ujlings arrived at several action categories which are summarized below [45]. The TransitMove category will be ignored here because it relates more to navigation than to object interaction.

CorpuscularObjectMove This category deals with movements onto, into and out of solid objects which are not large enough to represent a location of their own, such as containers and furniture. The directions of movement form the major subcategories, with movement off the target handled as movement back onto the ground. The MoveOn-action also differentiates between the postures of the agent, who can be lying, sitting or standing on the target object. Preconditions include that the target is accessible, which depends on its proximity, its height and whether it is open.

Transfer Transfer actions change the spatial relationship between an object and an agent, and the main distinction is made between the two directions 34 CHAPTER 2. STATE OF THE ART

to and away from the agent. Each direction has subcategories for solid objects, wearable clothes and transfer involving a container, which is especially important for non-solid substances. The different types of container lead to more specific actions for substances. Preconditions for those actions are related to holding either the ob- ject or its container, which means that the actor must possess the necessary strength for its weight and be able to grasp an object of the given dimensions. If something is to be transfered into or out of a container, this container has to be open ond big enough. Finally, the objects involved have to be within reach of the actor.

Drag Dragging consists of pushing and pulling an object along a given path or over other objects, with no subcategories. In contrast to Transfer, it depends on the path of an object which must be passable and smooth enough. Weight and dimensions play an important part in this category because it is used for objects which cannot be handled by the regular transfer actions due to these prop- erties. However, an actor still needs to be strong enough and, in case of the pulling action, able to grasp the object in order to perform a dragging action.

Consume Consumption is divided into eating and drinking for solid objects and fluids, respectively. Digestible objects and substances are classified as food and affect the actor’s body in a way specified by the nutrients they contain, others remain unchanged.

Control Controlling another entity, such as a machine or a riding animal, en- ables an actor to use its actions. Specifically, the category contains actions for taking and losing control. While an actor is in control, they temporarily lose their own actions and gain those of the object. Different actions were created for whether the object is sentient and for whether the actor is supported or contained by them because dif- ferent preconditions apply for these types of objects.

Manipulate Manipulation actions change attributes of an object, which can also influence the availability of their actions. The category contains switch- ing objects on, closing and locking them and the reverse actions for each one of these. There is also a folding action which transforms a 2.3. INTERACTION TASKS 35

functional object into its unfunctional folded version, but unlike the transforming action in the “create” category it is implied to be re- versible. For these actions, the objects require status attributes and information about which other changes depend on their value.

Attack Attacking a solid object damages it by either disabling available ac- tions or by modifying property values, such as the quality of a tool, the protectiveness of a piece of armour or the beauty of an artifact. The category is subdivided into different forms of attack, namely kicking, punching, shooting and striking.

Attach The Attaching category contains actions which connect or disconnect two objects. The subcategories are related to the different types of devices enabling those connections, similar to the containers used for transfering substances. The attachment strength, which can depend on the skill of the agent performing the action, is stored as a property of the new relation and a necessary precondition for the respective detaching action. For attaching, preconditions can be related to the attachment device in use, for example the length of a rope.

Create These actions deal with transforming, assembling and disassembling objects into something else. A transformation is considered irre- versible while the other actions may undo each other’s effects with varying success. Precondidtions include the skill of the actor, which is compared to a given difficulty, and necessary components and tools. However, Ui- jlings notes that tools are often interchangeable and it is difficult to determine all objects which qualify as a tool for a specific task. For disassembling, the strength of individual parts can influence the dam- age they obtain during the process.

The actions defined by Uijlings cover the majority of interactions inside a simulated world, but they only include the necessary information for us- ing them in written or spoken text. In his thesis [45] he already noted that the so-called realisation level would be more suitable for animated stories, since it describes the actual steps the actor has to take while performing an action. The same applies to user input because interaction via speech or gesture requires that both the user and the system know which commands 36 CHAPTER 2. STATE OF THE ART will trigger the actions and how these inputs have to be carried out.

Some information can already be derived from the preconditions, such as the necessity of grasping an object in order to move it. Other actions, like those of the “attack” category, directly refer to physical movements, so it would be obvious to map them to the related gestures. But the actual details of such a gesture, like the movement of each body part, can not be found in these definitions. Therefore, the world representation needs another layer of information for the realisation level. One way to store this information is by using smart objects, which is a common approach for letting virtual characters interact with a dynamic world. Examples for this kind of simulation can be found in [18] by Kall- mann (the “Agent Common Environment” mentioned in the navigation section), in [22] by Kistler et al. or in [15] by Heckel and Youngblood. The last two also refer to “The Sims” as a famous representative of commercial games using this principle. Kallmann defines many different interaction features which are stored in the smart objects of the “Agent Common Environment” [18, 19]. Among those are informations about their parts, locations, state variables, actions which can change both for any part and commands which will call these actions with the appropriate parameters. Furthermore, behaviour scripts provide step-by-step instructions not only for the changes to apply to the object, but also for the interaction which is expected from the user. These instructions can then be interpreted according to the current application, which means they are played as animations in case of a virtual character and compared to sensor data when input from a human user is required.

Suitable Inputs Virtual objects in a three-dimensional, computer generated world are sup- posed to represent real objects which influences the way people interact with them. Corradini and Cohen examined the kind of gestures people would use to control “Myst III” [8] and presented their findings on manip- ulative gestures which were directly related to changing an object’s state, as opposed to deictic gestures. They noted that six of their ten subjects used gestures manipulatively during multimodal interaction, with five of those performing movements based on the affordances of the object’s real counterpart. For unimodal gesture input, those four who used manipulative gestures also moved their hand in such a way. Similar results were obtained by Kistler et al. when they compared two different gesture input modes for quick time events in “Sugarcane Island” 2.3. INTERACTION TASKS 37

[21]. The first mode had the users point at randomly sized buttons which were placed at randomly chosen screen coordinates. In contrast, the second mode prompted them to perform a full body gesture resembling the action which was described by the narrator. For example, climbing a cliff would be done by alternatingly moving the hands up and down, or a kick could be performed by raising one foot. The subjects in this study rated the latter mode as significantly easier to use and learn, more comfortable and less inconvenient, and also more fun and satisfying. Although the virtual world in that application was only narrated and no actual 3D objects were shown besides a talking head character, this further supports the idea that users prefer movements which are directly related to the game world.

The most basic form of manipulating an object in 3D space is changing its position and orientation. In terms of Uijlings’ categories, these gestures could be used to perform “drag” actions. Other actions, such as turning a wheel, could be attributed to the “manipulate” category as they change an object’s orientation property. Manders et al. implemented two-handed gestures for this task which were centered around the user’s chin [25]. In order to translate an object along one of the three axes, both hands are moved in the respective di- rection. For forward and backward movement, an offset along the z-axis compensates for the fact that the user’s body blocks the path. Roll and yaw are set by moving the hands in opposite directions with regards to the origin (including the offset just mentioned), which has some similarity to grabbing the sides of an object and physically rotating it around the rel- evant axis. However, the starting position consisted of both hands being held in front of the face and parallel to the sideways pointing x-axis. On this basis, Manders et al. were not able to find an intuitive pitch gesture which would fit the style of the other inputs. In addition to the direction, all five gestures also express the desired speed of the transformation. The acceleration for translational movements is directly proportional to the average distance between the hands’ current and neutral positions. For rotations, the distance between the hands on the respective axis was mapped to the angular velocity.

In the virtual reality system described by Kallmann [19, 18], the user can manipulate objects in more detail by means of a data glove to which a mag- netic tracker had been attached. The tracked position is projected into the virtual environment where it is compared to the interaction positions stored within the smart object representation. As soon as the distance is smaller than a certain threshold and the hand’s orientation is similar enough to the 38 CHAPTER 2. STATE OF THE ART one specified, the interaction plan associated with that position is enacted. For example, individual parts of the object are moved or state variables are changed. Such a plan can also involve multiple hand positions for longer and more complex actions, in which case the system waits for the user to move their hand accordingly before the rest of the behavior script is triggered. Further- more, the stored hand positions also include information about the rotation and the desired hand shape. The shape is ignored for most interactions, but when two trigger positions are too close to each other, this additional data is compared to the glove’s deformation and used to resolve the ambiguity between the actions in question. Since Kallmann’s goal was to create a high-level interaction metaphor, the actions are only triggered but low-level details of the movement, such as flexing individual fingers or following a moving part of the object, are ignored for data glove interactions. This information is rather used for an- imating virtual actors. However, gesture trajectories could be defined as sequences of separate steps if necessary, and hand shapes could be used to signal that an object has been grasped before it is manipulated.

Compared to gestures, speech commands possess a simpler structure. Cor- radini and Cohen presented an excerpt of a session’s transcript which showed some of the commands issued by one of their test subjects [8]. Apart from references to the object of interest, the sentences in this example included the verb for the desired action and resembled orders given to a personifi- cation of the computer system, like the phrases “grab it and pull it down” or “turn it”. This style of expression is similar to the utterances Oviatt observed when users were managing virtual entities in her studies [37]. For example, users would say “close Morrison bridge at the East end” or “put a sandbag wall along [a certain path]”.

However, one more important finding of the Myst study was that users strongly preferred multimodal commands [8]. While three of the ten sub- jects used speech alone for more then 60% of the time, six of them used multimodal expressions to that extent. As it was mentioned in 2.2.2, mul- timodal interaction often stems from the users tendency to trade off the advantages and disadvantages of the modalities at hand. In case of that study, several users explained that speech was less exhausting and worked better for them, whereas most of the subjects added gestures to their ver- bal expressions. This is probably similar to the reason behind multimodal references, as explained in section 2.3.2, because certain interactions, such as pulling down part of an object, include spatial components which are 2.3. INTERACTION TASKS 39 more difficult to describe with words. As for the opposite direction, moving body parts does consume more energy than simply standing still and talk- ing. Again, the goal for producing these expressions would be to balance precision and effort for each channel.

Visualization The users’ tendency to compare virtual objects to their real-world coun- terparts already helps them to discover available actions. However, as it was mentioned earlier in 2.2.3 and at the beginning of 2.3.4, neither their functionality nor the way it is activated can be taken for granted. One fundamental rule is to design the virtual objects in a way that draws attention to their actual affordances, similar to how the real objects would be designed. But when they are supposed to look realistic, it is not always possible to omit misleading signifiers. Furthermore, signifying the difference between the real action and the input to a computer system is not always possible using only clues on the object itself. For instance, a virtual bowling ball like the one Norman referenced in [32] can be modeled after real bowling balls, including the same finger holes which signify grabbing it, but there is no obvious clue when it comes to letting it go. From the virtual ball alone, the user could not deduce whether the hand should be closed or open when the arm swings forward.

Consequently, most applications need to add artificial feedforward mech- anisms to ensure that the user detects the right interaction possibilities as well as the correct input. This information is often presented before the user begins interacting with the system. For example, “Kinect Sports” [39] displays animated human outlines before the game starts which show relevant body movements for handling the virtual sports equipment. The moving body parts are marked in green. Within the game, the user is reminded of these gestures by text clues or graphical symbols at the appropriate moment. Such in-game clues are another important element. Ideally, they should not only signify that a known action is expected but also remind the user of the way they should be performed. “Sugarcane Island”, for instance, dis- plays symbols which represent a key posture of the required gesture when- ever a quick time event is encountered [21]. A second reason for presenting these clues in context is the number and diversity of interactions. For example, the smart objects described by Kall- mann [19, 18] can have any number of interactive parts, and keeping all of them in mind can be difficult. Moreover, one advantage of the smart object 40 CHAPTER 2. STATE OF THE ART concept is that any of them can be loaded into any scene without adapting the simulation’s infrastructure, which on the other hand means that the set of interactions might change easily and be difficult to predict. Therefore, Kallmann proposes to use the input information stored within the smart object, namely the hand positions and postures, to indicate nearby inter- action possibilities by displaying an appropriately posed 3D hand in these locations to which the user should align their own.

Since static symbols alone, like those used for “Sugarcane Island”, lack important information like the trajectory and the movement speed, other applications use animations instead. For example, “Kinect Joy Ride” [38] plays animations when the so-called “boost” gesture is expected. A car- toon drawing of a driver is then shown pulling their arms back and pushing them forward to boost their vehicle’s speed. Games based on body move- ment without object interaction, like the fitness application “The Biggest Loser - Ultimate Workout” [13] or the dancing game “Dance Central” [14], tend to display NPCs whose movements the user has to mimic, seamlessly embedding the instructions into the game world. One major problem in this context is the mapping between the user’s movements and those of the animated figure. The cartoon animation in “Kinect Joy Ride” [38] is seen from the side, which is suitable since both arms are moved along one easily identified axis. But in many cases the correct orientation is difficult to determine. Ob- viously, the presenting character can not be turned in the same direction as a user facing the screen because their back would occlude the demonstrated arm movements. The animations in “Kinect Sports” [39] are basically seen from the front but also slightly turned to the side, with the angle depend- ing on the relevant limbs and the movement direction for that particular gesture. The depicted character uses the same hand as the user will, which means that a movement of the right arm will be seen on the left side. Since these instructions are shown before the game starts, the user has enough time to understand which limb is supposed to move. For applica- tions with an arbitrarily high number of gestures, however, it is necessary to present these in context Another approach, especially when arbitrary gestures are only displayed in context, is to use a mirror metaphor. “Dance Central” [14] explicitly instructs the user to imagine the avatar as their mirror image. This kind of mapping is well-known from everyday actions like brushing one’s teeth and therefore saves the user the mental effort of rotating the depiction to match their own body. But nevertheless, it is important to inform the user when they are dealing with a mirror-inverted image so they know that this additional matching is not expected. 2.3. INTERACTION TASKS 41

Like for the inputs themselves, visualization examples for speech are rarer in literature, but many of the concepts above can be applied to this modal- ity as well. Ideally, the keywords should follow logically from the actions they trigger and need no additional clarifications. If this is not possible, for example if the action itself is not easy to discover in the first place, the speech commands need to be advertised to the user at the correct moment, most likely by displaying the expected keywords as it is done for selection or dialogue. Clues for speech even lack some of the complexities inherent in gesture presentation because speaking itself does not depend on orientation, trajec- tories or choice of body parts which would make the necessary clue difficult to understand.

For feedback, there are again several approaches available. The most basic form of feedback is still to notify the user that an input was recognized, for example by visually highlighting the associated option. In “Sugarcane Island” this is done by drawing a check mark over the completed gesture’s symbol [21]. Furthermore, some feedback is already inherent in the objects them- selves, as the state variables changed by manipulation actions are usually reflected in their appearance. If, for instance, a part of an object starts to move, the user can compare this movement to the one they expected, regardless of the input modality. But apart from signalling success, feedback is also important for identi- fying and correcting errors. This is especially important for gesture input since gesture commands for the current task category are as varied as the objects in the simulated world. For vision based systems the raw image is sometimes displayed for comparison, for example in “Dance Central” [14]. In “The Biggest Loser: Ultimate Workout”, the depth image is colored ac- cording to how well it matches the current excercise and additional text instructions provide details about specific problems [13]. In “Your Shape: Fitness Evolved”, or at least its “Virtual Smash” mode, this depth image is directly used as the user’s avatar which then collides with breakable blocks [44]. Other applications, such as “Kinect Adventures” [42] and “Kinect Sports” [39] make use of modeled avatars which copy the user’s movement, revers- ing the mechanism which was used for instructions. A third option, also seen in “Dance Central” [14], is to highlight those body parts which move incorrectly on the character displaying the flawless movement, which at the same time draws attention to what the correct movement should look like.

Chapter 3

Practical Work

3.1 Overview

The practical part of this thesis involved a user study to confirm the theo- retical findings detailed in the previous chapter. Its goal was to offer both input methods to the user and let them choose the one they preferred for each task. For this experiment, an application was developed which allowed them to interact with a virtual world using only gesture or speech input. This chapter will first present this application, starting with the scenario and the interactions it offers, followed by the inputs chosen for each task, the visual clues provided for the user and, finally, the implementation. The second section deals with the study itself. It will explain the ob- jectives, hypotheses and the experimental setting before the results are presented and discussed.

3.2 Application

Examining user interaction with a virtual world obviously requires such an interactive environment. The world used in this thesis contains a short sce- nario which resembles traditional point and click adventure games but uses speech and gesture recognition instead of a mouse and keyboard. Because the study focussed on the inputs, the story in question is predefined and mostly linear. This was not only simpler to implement but was also meant to keep the sessions of different subjects comparable. The scenario uses a first person perspective like, for example, the games of the “Myst” series [9], which is meant to increase immersion by basically

43 44 CHAPTER 3. PRACTICAL WORK letting the user see through the protagonist’s eyes. At the same time, this perspective removes the need to replay user movements on a 3D avatar. The avatar approach can be seen in other games such as “Kinect Sports” [39], and while it can provide helpful feedback, it also runs the risk of looking un- natural and confusing. Automated inverse kinematics may not always give the intended result when mapping the movements of an arbitrary human skeleton onto that of a fixed computer-generated character. Pre-defined animations, on the other hand, limit the number of behaviors which can be replayed and are therefore not flexible enough for different scenarios with different sets of input gestures. Speaking of different scenarios, the technical part was designed to be as reusable as possible. It is independent from any particular setting, which means that the world can easily be replaced by a simpler training environ- ment or future, more complex scenarios. Once the program is running, all inputs including menu handling are meant to be done by either speech or gestures, but the actual triggering of all interactions was kept separate from input processing. This has several advantages. First, this makes it possible to choose freely between the available inputs. Second, it allowed for testing the basic infrastructure with mouse and keyboard input. And third, it also leaves the door open for the future addition of different input technologies or a form multimodal integration. This integration of different input technologies is another key aspect of the application. In contrast to the Wizard-of-Oz setups used in comparable studies, like [37] by Oviatt et al. and [8] by Corradini et al., an important goal here was to use actual speech and gesture recognition. Those studies were conducted in 2004 and 2005, respectively, and in the meantime the technologies in question have matured enough to become part of consumer products. So it seemed appropriate to use currently available input tech- nologies, even if this carried the risk of recognition errors. These errors were expected to give an impression of today’s technical capabilities, and either modality’s reliability was taken into account as a factor which would influence the user’s choice. The following subsections will elaborate on various aspects of this ap- plication, starting with the virtual world and the interactions it had to provide. After this, the input metaphors chosen for each of the four tasks will be explained, followed by their signifiers and feedback visualizations. The fourth subsection will then provide details of the technical implementa- tion, in particular smart objects in the Horde3D GameEngine [35], Kinect gestures via the “Full Body Interaction Framework” (FUBI) [20] and speech input via the Microsoft Speech Platform [27]. 3.2. APPLICATION 45

3.2.1 Available Interactions The storyline of the scenario is that of an extraterrestrial who was captured after their ship’s engine malfunctioned and the vessel was forced to crash- land. The player’s objective is to escape their cell, reach their ship and repair the engine in order to leave again. Similar to a regular adventure game, this scenario covers all four of the examined tasks. Test subjects must navigate in order to reach different locations, select conversation partners and objects, talk to virtual characters in order to advance the plot and manipulate virtual tools and devices to change the world state in their favor.

Environment The world in which this story takes places is a continuous, three-dimensional environment. Like the example systems mentioned in 2.3.1, this allows for free movement within the simulation’s setting. This particular setting offers a variety of locations, from the open area between container buildings and the crashed space ship to the medium-sized interior of the storage container and finally the smaller spaces inside the prison cell and the ship itself. Both the ship and the storage container are entered through doors which require some precision to hit, and the ship’s entrance also adds a narrow ramp which can’t be accessed from the sides. In addition to moving along the floor, the user can look up or down in order to find entities or examine them more closely. Although interactive objects were placed at a comfortable height whenever it was possible, some could not be designed this way without loosing the sense of realism. For example, the prison cell contains a single camp bed with a height of 47 cm, and a tool needed for the engine repair is placed on a table with a working height of 70 cm. Both objects’ measurements are based on those of actual furniture, and both entities are placed on the floor as they would be in reality. This means they are below the default eye level of 160 cm and disappear into the blind spot when the user is standing right in front of them. Just as in the real world, this is the kind of situation where a human will turn their gaze downwards to keep the target within their field of vision.

Virtual characters Unlike the third Myst game used by Corradini et al. [8], this setting required conversations with NPCs in order to confirm the preferred modality for this 46 CHAPTER 3. PRACTICAL WORK

navigation selection dialogue object interaction turn to guard guard where am I wave turn to bed bed why imprisoned push to window look down by whom step on bed step right leaving turn to window window location of ship unlock window ship’s status open window end conversation climb through move to technician technician lack of surprise wave turn to table tool organization pick up look down experience move to ship power connection plans for ship attach tool enter ship tool leaving turn up leave ship damaged part ship’s status pick up move to storage door heard enough open enter storage crate ask about equipment open leave storage spare part clarify storage pick up move to ship power connection end conversation insert spare part enter ship tool turn down turn to cockpit lever turn up look down slider push up 18 17 17 18 Table 3.1: The interactions in the virtual world. 3.2. APPLICATION 47 particular task. There are two virtual characters in the scenario, a guard in front of the cell door and a technician next to the space ship. As the study was about the inputs rather than the storytelling, the test subjects were not supposed to depend on the information obtained from these agents. Nevertheless, these conversations were designed to resemble those typically found in interactive narratives, which are one of the intended use cases for the examined input technologies. The guard’s dialogue con- tains a few basic answers about the plot, embedded in short, impatient sentences. Talking to him is a precondition for escaping from the cell be- cause this character only leaves the container after the conversation. After their escape, the player encounters the technician who provides more de- tails about the background story. Furthermore, this NPC’s answers contain information about the following steps to take. Both conversations are pre-defined and branch out from one or two compulsory sentences to a number of different options, according to the dialogue’s progress. Part of these options must be taken before the user can end the conversation and continue with other types of interaction. The user’s contribution to the conversation mainly consists of simple questions. This is partly due to technical limitations, as the speech recog- nition event occurs when the user pauses between sentences. Another reason is that these are straightforward requests for information which follow di- rectly from the background plot. There are, however, two more complex statements and one more detailed question about an item not mentioned before. Additionally, most of the user’s phrases contain filler words which are not strictly necessary but make them sound more lifelike.

Interactive objects The objects and actions in this scenario were chosen with regards to the most relevant action categories identified by Uijlings [45]. Some categories become very similar once they are broken down into the actual steps nec- essary for performing the action on virtual objects, while others involve complex implementation tasks which would distract from the main topic of this thesis. Therefore, not all of them are covered, but some can still be identified.

For the “transfer” category, there are three objects which need to be picked up, and two of these have to be placed in their pre-defined target locations. Pushing the bed towards the window can be considered a “drag” action, al- though the properties of the ground and the bed’s mass are not taken into account for the current application. The “corpuscularObjectMove” cate- 48 CHAPTER 3. PRACTICAL WORK gory is covered by the escape from the cell, specifically stepping onto the bed and climbing through the window. Unlike all the other actions, these two involve the legs for gesture input. One of the “transfer” actions results in a spanner being attached to a screw, so it can also be counted as an “attach” action. Going by Uijlings’ definitions, the spanner would then be considered an attaching device as it more or less grabs the screw. There are no “attach” actions which re- quire additional fasteners, such as glue or nails, because the implemented infrastructure does not yet support actions which involve more than one object at the same time, such as holding one item in place while applying the necessary attaching device. Moreover, dealing with the combined object and its emergent properties would be a research topic of its own and leave the scope of this thesis. For similar reasons, no “create” actions exist so far. However, part of the necessary functionality could already be achieved by attaching one object after the other and changing the parts’ properties to make certain actions available. Most of the actions fall into the “manipulate” category. They mainly in- volve moving parts of the selected object, such as opening a door or turning handles and levers. Furthermore, they change object states in a way that influences the availability of other actions. This can happen implicitly, for example when opening the crate reveals an item which can then be picked up, or explicitly by setting the smart object’s property values those other actions depend on. The “attack” category was left out because it basi- cally serves the same purpose. A physical action changes critical properties within the object, in this case a health value, and the change is represented by a deformation. There is no real difference between animating an object to open or to fall apart after a user input. Changing the material to a dented or scorched version is not that different from changing an indica- tor lamp’s color. Overall, the distinction between both categories is more important for narrative purposes than for user interaction. In a way, starting the repaired engine counts as a “control” action be- cause it lets the player borrow the space ship’s ability to fly away. However, on the abstraction level used in this application, the act of taking control is broken down into several navigation actions and the manipulation of objects on the dashboard. “Consume” actions were ignored, partly because there was no obvious way to work them into the plot and partly because they can be seen as a kind of “transfer” action. Picking an item up to store it in the inventory already works very similar, except that the motion target in that case is some representation of the actor’s pockets instead of their mouth. One 3.2. APPLICATION 49 major characteristic Uijlings named for this category are the effects a food item has on the actor [45]. These could, in theory, affect their ability to perform certain interactions, but as with the “attack” category, this is more of a narrative tool and less relevant for general virtual reality applications.

Apart from the actions they provide, the objects varied greatly in size, from the small and thin spanner over a medium-sized crate to the bed which was about two meters long. Some had no distracting neighbor objects, some were next to each other but with enough space between them and, finally, some were close to and overlapping with each other. These spatial proper- ties were meant to provide realistic variations for the selection task.

3.2.2 Input This section describes the input metaphors and mappings chosen for the four tasks. For each action the user has to perform, a gesture and a spoken phrase were defined which are both associated with the same command. This system does not include any sort of multimodal fusion, but both recognition mechanisms are active at the same time and the action is triggered as soon as the command is detected on either channel.

Navigation This application only uses low-level navigation by specifying individual movements, as opposed to selecting points of interest and being brought there automatically. The first method is closer to reality and therefore more relevant for general simulations, including those which are meant for training rather than entertainment, whereas the other can be seen as an optional convenience added on top of these basic controls. Gesture input for this task is based on the joystick control scheme. Since the gesture technology used here can detect the orientation of the torso, there is no need for abstract turning gestures like those of the “steering wheel and accelerator” approach. To walk, the user steps away from the neutral position in any of the four main directions, which causes their virtual representation to start moving in the same direction. Turning starts when the torso or the whole body is rotated to the left or the right, and looking up or down is triggered by leaning back or forward, respectively. Any movement can be stopped by returning to the neutral position and posture. 50 CHAPTER 3. PRACTICAL WORK

This neutral position is set automatically whenever the simulation leaves its pause mode and the user starts interacting with it, which means that it can easily be changed during runtime by switching to the menu, moving to a more suitable spot in the sensor area and switching back. In case of a new user entering the sensor area and replacing the current one, the application will pause on its own for this very purpose. In order to avoid accidental gestures while standing comfortably, as men- tioned in [24], these inputs are only triggered beyond certain thresholds. For walking, the tolerance was set to 15 centimeters in all four directions which allows an average human to shift their weight to either leg. Turning to the side starts at 30 degrees in either direction. The necessary angles for lean- ing foward and back are closer to neutral, with absolute values of 20 and 10 degrees respectively. Rotating the torso around this axis is less comfortable than turning it sideways because its center of mass is no longer supported by the feet. This calls for easier gestures in this case and at the same time makes accidental leaning less likely. Additionally, all rotated poses need to be outside these thresholds for 0.2 seconds before they are accepted as inputs.

The speech commands for walking consist of the words for each direction. For turning to either side or looking up and down these are combined with the verb for this action, and finally, stopping is done by saying “stop”. These keywords are mapped to the same commands as their gesture counterparts. One additional advantage in this case is that, at least for the movement domain and in the German language used for this study, most synonyms are rather similar to the main command. They are either derived from a com- mon root, as in “r¨uckw¨arts”and “zur¨uck” for moving backwards, or contain rather similar vowels. For example, the keywords “hoch” and “nach oben” for the “turnUp” command share a long “o” sound which is not present in their opposites “runter” and “nach unten”. In this implementation, the recognizer’s vocabulary is limited to only those keywords available at the given moment, so for now, these similarities are sufficient for mapping most unexpected utterances to the desired commands, even without a more so- phisticated grammar explicitly handling the synonyms.

The total vocabulary for the navigation task contains nine different com- mands which are available throughout the scenario. So far, movements are only triggered without any speed parameters, although these could be im- plemented by examining the body posture more closely and adding adverbs to the keywords. But while they would give the user finer control over their movement, they would also increase the complexity of the navigation task. 3.2. APPLICATION 51

Given that there are three more tasks the test subjects have to deal with, these parameters are left for future studies. There is, however, one strong argument in favor of flexible speeds. Re- gardless of the task, speech inputs suffer from a noticabe delay which could take about 0.5 to 1.0 seconds. This is less of an issue for the other tasks since the world state is rather stable while they are performed. But once a movement was triggered, the user’s position changes constantly and timing becomes far more critical. As Kadobayashi et al. pointed out for VisTA- walk [17], the movement speed must allow the user to arrive quickly but also let them stop at precisely the intended location. These two goals are difficult to achieve with a fixed velocity. As a compromise, movements in this application start with a generous acceleration phase of five seconds. Assuming a reaction time of about one second for the user themselves and another second for the speech recognizer, and assuming they decided to stop as soon as they saw that they were moving, they would be travelling at nearly half the target speed by the time that command was recognized. On the other hand, it can be assumed that a user wants to travel a greater distance if they keep moving for several seconds, so they can still benefit from the higher speed in that case. The target speed here is set to 2.0 meters per second for walking and 30 degrees per second for turning. Assuming an average step size of 50 centimeters, this gets close to the VisTA-walk mapping which suggests a movement speed between 2.5 and 5.0 meters for a physical distance of 80 centimeters [17]. Going by the example above, the user would have moved 0.8 meters or rotated 12 degrees during those 2.0 seconds. This still makes it difficult to aim for a door, but more feasible than it would be without the acceleration phase. On the other hand, a distance of about 10 meters (a bit shorter than that between the prison and the space ship) could still be travelled in about 7.5 seconds. Rotating 90 degrees takes 5.5 seconds, and a 180 degree turn can be finished in 8.5 seconds.

Selection Objects can only be selected when the user stands in front of them. First of all, they must be visible on the screen so that the user knows what referring expressions can be made. Also, manipulating an object does not make much sense if the effect can not be observed and confirmed. The second limitation is the distance. Selectable objects must be within a certain radius around the user’s position in the virtual world. In real 52 CHAPTER 3. PRACTICAL WORK life, this radius would be determined by the length of their arm, but in this setup, things are more complicated than that. There is a discrepancy between the distance observed by the virtual camera and the user’s physical distance to the objects’ depiction, since the Kinect sensor requires a distance of approximately two meters but the camera is supposed to show what the avatar’s eyes see. However, when the camera moves too close to an object, too little is seen of the environment. The interaction radius for smart objects was eventually set to 2.5 meters, which means that the distance between the user’s virtual representation and the reachable objects is similar to the physical distance between the user and the screen. For different setups, for example using a head mounted display, a shorter distance would be more appropriate.

The gesture input consists of pointing at the desired object, using a cursor to facilitate precise pointing over the comparatively large distance. Cur- sor movements faster than a small threshold are ignored in order to avoid accidental selections along the way to the real target, and the selection is locked when the cursor lingers more than 0.5 seconds on the same smart object. The speech equivalent for this command consists of stating the object’s name. For this scenario, the objects available at any given time can be clearly distinguished from each other, so it is not necessary to produce longer referring expressions from spatial relations or appearance details.

Dialogue The speech inputs for this task are based on the multikeyword spotting approach. First of all, each dialogue action consist of a natural sentence which is the user’s contribution to the conversation. The actual input is defined by a list of keywords which need to be detected within the same utterance. In theory, they could be chosen on a whim, as long as they sound sufficiently different from those of all other sentences available at the same time. But the more transparent method is to chose those words of semantic importance, such as interrogative pronouns, the action verb or a noun which stands for the current topic. Other less important words can be added if it is necessary for making the set phonetically unique. Apart from triggering the recognition event, this structure acts as a logical constraint by highlighting the intended meaning and encouraging the user to stay close to it. But ultimately, the user is free to construct their own phrase around this skeleton while the default sentence serves as 3.2. APPLICATION 53 a suggestion and provides the context necessary to deduce the effect of this action.

Unfortunately, there was no natural option for unimodal gesture input which could be used throughout the scenario. The current implementa- tion contains no method for multimodal fusion, so each modality had to function on its own and needed to express the same intention. But as Cavazza et al. explained, many conversational gestures are ambiguous on their own and need to be analyzed in context [4]. As there was not enough time to implement a complex dialogue manager, the only available context is the current sentence. Moreover, it is not always possible to find gestures for arbitrary con- tent in the first place. While a greeting can easily be matched to waving one’s hand, there is no obvious equivalent for interrogative pronouns, which means that most of the sentences in this scenario, like the very first question “Where am I?”, could not be expressed that way. Objects like the space ship could be referred to with iconic gestures, for example by indicating its triangular shape, but there is no straightforward symbol for asking about its state after the forced landing. Some messages, such as agreement, disagreement or certain attitudes are indeed associated with well-known gestures, such as shaking one’s head for signalling “no”. For this study, the cultural background would not have been an obstacle to using those. However, the gesture framework used here is not capable of detecting the rotation of an end joint, such as the head or a hand. Arm movements, like shrugging to show ignorance or lack of interest can be implemented with the current system, but a nod or a headshake could not, which limits the available vocabulary even further. One last alternative would have been to use some kind of sign language, similar to the verb gestures in the ORIENT application [23]. Such a gesture set would be able to express the desired content, but on the other hand, it would be unfamiliar to the user and therefore far more difficult to learn than the spoken alternative. As a result, these signs would probably be ignored in favor of the latter. To sum it up, most of the sentences would have been left without a ges- ture command, which was not an option for this study. However, the way dialogue actions are implemented here does offer the possibility to associate some of them with gestures, so this should be picked up for future studies.

The fallback solution for the current application was to treat the task of choosing the desired phrase as a general selection. The user controls the same cursor as for the selection task, this time aiming at a target field 54 CHAPTER 3. PRACTICAL WORK associated with the sentence. While this method is far more abstract, its advantages are that it allows for unambiguous input and the user is already familiar with its mechanisms. Furthermore, if they used a pointing gesture to select the dialogue partner, their hand is already in place and they can continue almost seamlessly.

Object Manipulation The gesture input for object manipulation was made to resemble those movements a user would perform in real life. They mostly involve the right hand, but two include the left hand as well and two include the legs. So far, this application only offers right-handed input, but the gesture definitions themselves are stored in a separate file which can easily be exchanged. Each of the gestures consists of one or several linear strokes, usually fol- lowing the path of a moving part on the manipulated object. Curved trajec- tories are approximated by two separate strokes since the FUBI recognition used here does not support them. Another thing this setup lacked was a grasping metaphor, due to the purely vision-based input technology which could not recognize the hand’s shape. FUBI already has some functionality for detecting and counting individual fingers, but this would require the user to stand far closer to the sensor and spread them clearly. Consequently, performing a gesture requires a certain minimum speed to distinguish it from unintentional movements. These gestures are only provided for one position and orientation relative to the object. For this scenario, this was not much of a problem because most objects were placed in such a way that they were only accessible from the proper direction. For example, the crate which has to be opened is placed in front of a wall with its front side facing the user, and the door of the storage container swings open far enough that it is hardly visible from the inside. For some, most notably pushing the bed, this restriction was not possible, whereas a few, such as picking up the transfer objects, did not have that problem anyway. The speech options simply consisted of the verb associated with that action.

Global Actions In addition to the four tasks, two global actions were defined for managing the application. They are based on the same mechanics as the regular object actions. 3.2. APPLICATION 55

One is the “cancel” command, which is used for deselecting an object or NPC, for example after an accidental selection or after seeing that the de- sired action is currently unavailable. Since it is needed in the same context as the regular actions, using a physical gesture for this purpose keeps the input more consistent in this context. Like the “undo” command in [41], it has the highest priority while the simulation is running. The other global action pauses the game and displays a simple menu. Since both the commands are supposed to be available at any time, the “menu” command is implemented the same way as the “cancel” command, using a gesture rather than a selectable button. However, finding suitable gestures for these two turned out to be diffi- cult. As the “cancel” command was expeted to be used rather frequently, it had to be comfortable and simple to perform. But most importantly, neither of the global commands was allowed to interfere with those gestures which belonged to the regular actions. The first approach was to give them more complex shapes with at least two diagonal strokes. The symbol drawn for “cancel” resembled a simplified arrow to the left, shaped like a left angular bracket, and relied on the same metaphor as the backspace key or the “return to previous” buttons found in numerous applications, for example web browsers. The “menu” shape was based on an “X”, similar to those seen on the “close” button of application windows. But as the scenario was filled with its content and more varied gestures, it turned out that the arrow shape could still overlap with the regular gestures. In particular, pushing the bed to the right side requires the user to move their arms to the starting position on the left and bring them back to the right side, which was very similar to the angular symbol. On the other hand, the “X” shape for the menu was rather complicated to learn and perform. Shortly before the study it was suggested to use the left hand instead, but this was not possible because this hand was also involved in two ges- tures. Eventually, the global gestures were redesigned to use both hands with the left one serving as a kind of modificator, similar to pressing the “Ctrl” key to change the meaning of a mouse or keyboard input. The final version of the “cancel” command consisted of raising the left hand above the shoulder and moving the right hand to the left, still using the metaphor of the left arrow. For the menu, the right hand was moved in the other direction. 56 CHAPTER 3. PRACTICAL WORK

3.2.3 Visualization The application was designed to work without any mouse or keyboard input, which also meant that the user was faced with a lot of new commands necessary to control it this way. Therefore, every task they had to perform needed its own feedforward and feedback mechanisms which are detailed below. To avoid confusion, there is a common color code to the GUI elements. Information which is not crucial, such as the filler words in a sentence, and commands which are currently unavailable are displayed in white or a light gray. Light blue is used to highlight the currently allowed inputs, and finally, recognized commands are highlighted in green for some time afterwards. In the case of dialogue and manipulation actions, this time period also marks the cool down phase.

Navigation Navigation was meant to be trained beforehand because there were many different commands to remember. The size of the vocabulary for both modalities and the fact that navigation was available for the majority of the scenario meant that displaying the commands for reference would clut- ter the screen, probably distracting from the world itself. In this case, usability relied on the fact that the gesture commands were chosen to re- semble natural movement and the spoken keywords were familiar direction commands, so for either modality, they were expected to be remembered not only from the instruction phase but also from the real world. Feedback for navigation mostly relies on the fact that the camera move- ment can be observed immediately. In the case of stepping sideways, the movement vectors are updated as soon as the user steps outside the neu- tral zone. Similarly, turning gestures are triggered as soon as the torso is rotated beyond the respective thresholds, but as these are more generous to avoid accidental rotation, it takes slightly longer to reach them. As an additional visual clue, the user’s physical position relative to the neutral area is marked as a dot on a four-sided arrow which is displayed on the upper part of the screen. The dot is colored green during movement, regardless of the modality used to start it, and returns to blue after the user stopped to show that the system is ready for navigation commands. When an entity is selected and walking is disabled, the arrow icon is hidden altogether. Speech commands are confirmed by a label below the arrow icon when- ever a navigation command was spoken. As it displays the recognized key- 3.2. APPLICATION 57 word, this can also help to remind the user of the correct vocabulary after a synonym was used. Since the world for this study was rather small, a map is not included in the current implementation and points of interest are only marked within the environment itself. While the player is standing still, interactive smart objects are labeled with their name. During movement, these labels are replaced by more subtle brackets to avoid cluttering the screen, and both elements are colored blue as soon as the user comes close enough to interact with the object in question.

Figure 3.1: Left: The arrow icon, showing the neutral position (blue) and forward movement (green). Right: The brackets displayed during movement (gray: out of reach, blue: in reach).

Selection

Figure 3.2: Upper left: An object which is out of reach. Lower left: An object in reach which is being selected. Right: The actions available for the selected object.

The name labels actually belong to the selection task, displaying the keyword the user has to speak. Additionally, a line connects this label to the 3D object to show which mesh the cursor has to indicate. Speaking of the cursor, several shapes have been considered. Some Kinect games use an open hand shape, which is fitting for a projection 58 CHAPTER 3. PRACTICAL WORK of the user’s own hand [42, 39]. The disadvantage, however, is the lack of an obvious hotspot which makes precise selections more difficult. A point- ing finger comes closer to that, but usually ends in a rounded shape which again lacks some precision. Considering that the target objects would be life-sized but seen from two meters away, the cursor was supposed to be as helpful as possible. This left the options of a crosshair shape, which seemed more appropriate for combat-oriented games, and a basic arrow pointer which was eventually chosen for its clear and familiar shape. This cursor appears when the user raises their hand towards the screen, but only when a pointing gesture is expected. During the confirmation interval the cursor fills up in blue to show the progress of the selection. After the selection succeeded, the interaction options for this particular entity are displayed whereas all selection clues are hidden. Only the line between this object and its label is kept, now connecting it to an options panel which resembles a context menu popping up on a traditional applica- tion. This makes it obvious that the object is now the center of attention and offers the interactions displayed on said panel.

Dialogue For dialogue actions, the entire sentence is displayed with the relevant key- words highlighted in blue. A rectangular field right of it, designed with a bevel to look like a button, provides the target for the pointing option. This pointing option uses the same cursor as the object selection. Ad- ditionally, the animation of filling it up with color is inverted during the cooldown phase, this time using green to mark the recognition. At the same time the recognized sentence is highlighted in green while the other sentences are hidden.

Object Manipulation Even before it is selected, an object’s appearance gives a first impression about its functionality. Some of them are modeled to resemble their real counterparts and inherit their most relevant affordances. The window, for instance, has a handle which needs to be turned to unlock it. Its initial orientation leaves only one direction for turning because otherwise it would collide with the wall. The door is grabbed by its handle and rotated around its hinges which makes the gesture’s trajectory evident. The spanner is known as a tool for turning something that interlocks with its shape, so once it sits in place, its purpose can be deduced. 3.2. APPLICATION 59

Figure 3.3: Upper image: Available dialogue actions with highlighted key- words and the cursor filling up for selection. Lower image: The recognized sentence with the cursor gradually returning to white.

Designing objects aboard the space ship was more difficult since they were not supposed to look too familiar. However, some basic indicators were built into their shape as well. The lever for closing the door and the handle on the slider are indented to indicate that they can be gripped, whereas the upper side of the damaged part is flush with the surrounding surface. This is even easier to see on the intact parts next to it. The fact that it offers no grip at the moment hints at the need to change something else first. Furthermore, both the door lever and the alien screw which needs to be turned are marked with the same triangles. When the object in question is unlocked, these point in the direction in which movement is possible, which either means walking through the door or taking the part from its casing.

As for the actual gestures, animated humanoid figures present the move- ments to illustrate what is expected from the user. These animations are derived from the same gesture descriptions which are used for recognition, displaying the average speed, state duration and transition time allowed for this input. They are turned to face the user and mirror-inverted for easier recognition. Since depth information is more difficult to see with this ori- entation, the third dimension is emphasized by a color ramp. Limbs closer to the camera appear green and eventually blue whereas those further back will go from yellow to red. Unfortunately, the gesture recognition used here offers no straightfor- ward way to determine the source of an error, as the input either activates at least one matching recognizer or none at all. Movement speeds outside the given limits are ignored, and missing strokes or pauses longer than the 60 CHAPTER 3. PRACTICAL WORK

Figure 3.4: Left: The images from the animation for turning a lever counter- clockwise. Right: The depth image of the user with the recognized skeleton. threshold cause the failure of a posture combination recognizer. However, the raw input data is available in the form of the user’s depth image and the currently recognized position of their joints. These are displayed as a small overlay on a fixed position on the screen, and since the orientation of the gesture dummies was chosen to match that of the depth image, this sensor data can be used to find the differences between the user’s movement and the expected input. Finally, the voice commands for the actions are displayed next to the animations. Just like the dialogue keywords, they appear blue when the action is available and turn green after recognition. Those actions which are not possible at the moment are shown never- theless, in order to inform the user that they can be accessed at a later time. To set them apart from the relevant inputs they are colored gray and no animation is displayed for them.

3.2.4 Implementation The application, going by the working title of “KinectWorld”, was imple- mented in the Horde3D GameEngine which was developed at the University of Augsburg [35]. This engine is based on a modular principle and consists of separate components with specialized functionality which are added to those entities which need them. They communicate with each other and the main application using an event-based infrastructure, and this makes it easy to implement new components and integrate them into the system. Apart from rendering the three-dimensional environment, this engine al- ready provided various useful components by different authors which could be reused. In particular, those included keyframe animations, inverse kine- matics, rigid body physics based on the Bullet library, text-to-speech out- put, socket communication and the AnimatedOverlay component which could already transform pointing gestures to cursor positions. 3.2. APPLICATION 61

In order to create the application described here, a new component was created which added the required smart object infrastructure for handling all the interactions. The main application interprets gesture input which it receives from Kistler’s “Full Body Interaction Framework”’ [20], FUBI for short, which was also behind the AnimatedOverlayComponent and the gesture recognition in the “Sugarcane Island” application [21]. For speech input, an external recognizer program was written based on the Microsoft Speech Platform [27]. This SDK was only available for C# whereas the is written in C++, so it had to be connected via the socket com- ponent. Another major addition was the GestureAnimator class which cal- culates animations from the same gesture definitions used by FUBI. These four elements are described in the following subsections.

Smart Objects

The new component is called “KWSmartObjectComponent”, as in “smart object for the ‘KinectWorld’ application”. Similar to the smart objects described by Kallmann [19, 18], they contain internal state variables, actions that can be performed on them, conditions and effects for these actions and, most importantly, information about the inputs which will trigger them. All entities which do not serve as background scenery are implemented as smart objects, including the player and the NPCs, inspired by the decision of Jorissen and Lamotte to treat all parts of a simulation equally [16]. An entity’s description in the XML scene graph file contains specific subnodes for each of the components it uses. In case of the smart object component this subnode includes all the data to store, which means that changes can quickly be made inside a text editor without changing the program itself. The state variables, called properties, are indexed by their name and can have three different types, boolean truth values, floating point numbers or plain text. They can not only be accessed from within the smart object itself but also from outside by using the GameEngine’s event system. The latter is mostly done by other smart objects when their actions depend on external conditions or when they apply the effects of an action. An action consists of its name, the name of the gesture recognizer and a keyword string for speech input, as well as a list of conditions which need to be fulfilled and a list of effects which will be applied afterwards. Every interactive smart object stores a list of these which is requested by the main application whenever that object is selected. In that case it hands over the 62 CHAPTER 3. PRACTICAL WORK core data for each action, which consists of its name, the input information and whether this action is currently enabled. Dialogue actions are implemented as a subclass which adds the whole sentence and a list of several keywords, sorted in the order in which they are spoken. Gesture names can still be stored for these actions but are not used in the current implementation. These actions are stored separately and requested when the object’s “isTalking” property is set to true. When the main application recognizes an input which belongs to one of these actions, it reports that action’s name back to the component which then applies its effects. Conditions store the tested property’s name, its type, the value it is compared to and the type of that comparison which can be “less”, “greater”, “different” or “equal”. By default, this property is searched within the same smart object, but it is also possible to assign a reference to a different entity which means that this one’s property is checked instead. Effects, just like the conditions, can be applied to the object itself or a different target. The currently implemented effects are shown in table 3.2.

effect type usage setProperty Sets a property’s value, which affects the conditions of other actions. relativeNumber Modifies a numerical value using a factor and an offset. Also affect the conditions. speak Speaks a pre-defined sentence, the typical effect of a dialogue action. animation Plays a pre-defined animation for a given duration. Used for most actions. move Moves the object during a given time, either over a certain distance or to an absolute position. attach Attach the object to another entity or a part of one. Used for the transfer actions. Table 3.2: The available effect types for smart object actions.

Gesture Recognition Gesture input is recognized using FUBI [20], which in turn accesses the Kinect sensor’s data using the “OpenNI” framework, the “NITE” mid- 3.2. APPLICATION 63 dleware∗ and the Avin2 driver„. The Kinect itself senses depth using an infrared laser and camera to project a structured light pattern [12, 10]. FUBI defines different types of gesture recognizers. A JointOrientation- Recognizer is activated when the rotation of a joint lies within the minimum and maximum angles around a particular axis, while a JointRelationRecog- nizer monitors the distance between two joints. The LinearMovementRec- ognizer is defined by a direction vector, which can be related to another joint, and the allowed speed range. Finally, the basic recognizers can be combined using a PostureCombinationRecognizer, which defines a sequence of states with a minimum and maximum duration and a maximum transi- tion time.

In this application, all input gestures are defined as PostureCombination- Recognizers so that the duration of each stroke can be limited. All gesture definitions are loaded from an XML file when the main application is initial- ized. At runtime, only those recognizers which belong to currently available actions are enabled in order to avoid false alerts. Whenever the application checks for inputs, the names of the recognized gestures are compared to those linked to the available actions, which can either be manipulation actions or rotation commands for navigation. The walking itself is not based on a particular recognizer but accesses the user’s absolute position instead. This vector is compared to the neu- tral position mentioned in section 3.2.2, and depending on the result the respective navigation action is triggered directly. The third kind of gesture input is the pointing gesture. For this pur- pose, the existing AnimatedOverlayComponent is used to obtain the cursor position on the screen, but instead of the component’s cursor the main application draws its own overlay at this position.

Speech Recognition This application had three main requirements for speech recognition. First, it had to support the German language since this was the mother tongue of those people which were to be recruited for the study. Second, it had to work reliably without training because it was going to be used by a number of different people. And third, it had to be capable of changing the grammar at runtime, since inactive commands had to be disabled in order to reduce the search space and to avoid misinterpretations.

∗OpenNI homepage, ”http://www.openni.org/” „Avin2 repository, ”https://github.com/avin2” 64 CHAPTER 3. PRACTICAL WORK

The Microsoft Speech Platform [27] turned out to satisfy all three re- quirements, and unlike commercial solutions, it had the additional advan- tage of being available for free. The only downside was that the SDK was only available for C# so it could not be directly integrated into the main application which is written in C++.

Whenever the set of actions changes, the main application encodes the keyword lists for each action and the names of the selectable objects in a message which is sent to the recognizer application using an UDP socket. The recognizer then constructs a simple grammar based on this informa- tion. In this context, dialogue and regular actions are treated the same way since both include at least one keyword and are only activated when the user stops speaking. For each action, the recognized pattern consists of the keywords in the list, separated by wildcard tokens to allow for the filler words in a sentence. A wildcard at the end assures that the sentence is not recognized too early if it ends in filler words. This unit is then linked to a semantic value which consists of the action’s name. All of the action utterances are then grouped in a subgrammar with “action” as their semantic key. A second subgrammar is created for the selectable objects, this time us- ing the object names directly. This grammar is then linked to the semantic key “selection”. The third and last component of the recognizer grammar is the garbage token. This one consists of a wildcard and is supposed to catch unintentional speech which is not similar to any of the real inputs.

When the recognizer detects a matching utterance, it sends a message back to the main application which contains the type of command and the name of the action or object, respectively. The main application will then trigger that action or lock the selection to the object named by the user.

Gesture Animator

While the majority of the visual clues could easily be displayed as text labels or other overlays, the animations for the manipulation gestures were far more complex. One main goal in developing this application was to make it as flexible and independent from the setting as possible, which also meant that it should be easy to tweak the gesture inputs without the need to re-export an animation. Furthermore, preparing animations for every single action in a virtual environment takes up time that, should be invested in other development tasks. 3.2. APPLICATION 65

To solve this problem, the GestureAnimator class parses the gesture description from the same XML file used by FUBI and calculates the tra- jectories which are then displayed on a small human figure. Currently, only LinearMovementRecognizers within PostureCombinationRecognizers are animated, and only for the hands and legs.

The starting posture for each animation depends on the joints which are involved. The basic posture sees the figure standing with both arms hang- ing down, whereas additional positions are defined for the hands, based on the assumption that the user’s hand will be placed in front of the chest while they interact with objects on the screen. Consequently, animations involving a hand are based on such a position. Additionally, the starting point of the first stroke is shifted in the opposite direction by 25% of its length in order to make room for the full movement. The length of the stroke itself is calculated from the expected speed and duration, which lie within the allowed ranges as specified by their respective scaling factors. For this study, these factors were all set to 0.5, resulting in the average speed and duration. Displaying an average version of the movement rather than an extreme case is supposed to make it easier for the user to copy the movement since the tolerance is maximized for both directions. For each of the moving joints these strokes are stored as the starting point and translation vector. Leg movements are slightly different from hand movements because they are based on the knee joints which are more likely to be in the sensor’s field of view. Consequently, the animator class has to explicitly store a following movement for the foot, this time assuming that the lower part of the leg is hanging down without any movement of its own.

The strokes for each body part are then replayed on the gesture dummies using inverse kinematics. On each update, the GestureAnimator checks the current state to display, calculates the time that has passed since the state changed and moves each joint to an intermediate position, calculated from the respective stroke’s starting point and translation vector. After the stroke is finished, the joint rests in this position until the next one starts. One more thing animated by this class is the shader responsible for the dummy’s coloring. While the Horde3D engine already takes care of mapping the depth information to the color range, the GestureAnimator sets a shader parameter to darken these colors between animation loops so that the beginning and the end of each gesture are clearly visible. 66 CHAPTER 3. PRACTICAL WORK

3.3 User Study

3.3.1 Objectives The goal of this study was to see which input modality subjects would prefer for each task when they were faced with a system that offered both. These preferences were expected to show in their usage and to be related to their perception of the method’s qualities.

Main Hypotheses Users prefer different modalities for different tasks.

First of all, there is evidence that gesture input is more suitable for some tasks whereas others work better with speech input. Depending on the task, one of them tends to be covered more thoroughly in literature, with the other more likely to play a supportive role during multimodal inter- action. In particular, the following preferences are assumed for the tasks, based on the theoretical findings detailed in 2.3.

> The walking metaphor is preferred for low-level navigation.

Although speech could prove useful for specifying additional param- eters, such as a precise velocity or a particular target, it appears less suitable for general movement. A later implementation might use both modalities to their full potential, but for now, this study exam- ines only the core vocabulary for navigation. This is the same for both input technologies and already contains 9 commands which was deemed sufficiently complex for the current experiment. Seeing that the walking metaphor is more straightforward with the real world in mind, users are believed to choose this modality for navigation. Another possible reason could be that gesture input for the movement directions could be implemented to react more quickly, as it monitors a numerical distance value which only needs to be checked against the stored neutral position. The speech recognizer, in contrast, comes with a noticeable delay which makes it more difficult to stop at the intended location. > Speech is preferred for dialogue with virtual characters.

Although there are conversational gestures which play an important role in human communication, they are hardly sufficient for a more 3.3. USER STUDY 67

complex conversation. In particular, most of them are ambiguous and change meaning when combined with different utterances. Be- cause this experiment does not involve any form of multimodal fusion, each modality in this application had to function on its own, which left only the abstract “selection by pointing” metaphor for gesture in- put. Speech, on the other hand, is the more conventional information channel used between humans so it is believed that users will transfer this preference to human-like computer entities. > Gesture input is preferred for object manipulation.

Several works deal with natural input gestures whereas little mate- rial was found for language commands. In particular, people were shown to interact with virtual objects by movements matching the affordances of their real-world counterparts [8] or to prefer natural gestures to more abstract ones for general story-related interactions [21]. This suggests that the direct interaction metaphor is more at- tractive than commanding an action only by speech, even if this would be quicker.

For selection, there is no clear tendency yet. In this application, there is no ambiguity between the objects available for selection, so a single key- word is sufficient for identifying the one the user is interested in. Selection gestures here rely on moving a cursor to facilitate precise pointing, and the game engine component which handles the pointing gesture applies a smoothing factor to reduce jittering. According to the findings that users trade off the precision and effort between both modalities, they can be expected to switch to speech when aiming the cursor correctly proves to difficult, for example on smaller objects like the tool. Speech might be chosen on principle, since it is quicker and less exhausting than raising an arm, but on the other hand, reaching for an object can be seen as the first step of physically interacting with it, so the hand would be stretched towards it anyway.

Secondary Hypotheses The preferred modality is rated more favorably.

There are some general assumptions about the relation between usability and the chosen modality. A user’s perception of an input method is likely to reflect their preference, explaining why they chose that way and which qualities are relevant for the decision. 68 CHAPTER 3. PRACTICAL WORK

> Users prefer commands which are easier to remember or to discover.

This one is almost self-explanatory, as people can only use commands they are aware of. Furthermore, people tend to avoid unnecessary efforts, mental as well as physical, which is the reason behind multi- modal expressions for difficult references [46] or issuing complex com- mands [37]. Consequently, they are less likely to search for a particular input when an equally expressive alternative was already found. > Users prefer commands which they perceive as more natural.

In a way, this hypothesis can be linked to the previous one. If users can transfer their experience with the real world to the simulation, it helps them to figure out which inputs follow logically from the proper- ties of the virtual objects. Also, natural inputs can be argued to add to the immersion and enjoyment when interacting with the system. > Users prefer the modality which is less tiring to them.

Again, this assumption is related to the findings that people min- imize the costs for producing an expression [46]. Apart from the mental effort necessary for choosing the correct command, these costs also depend on the physical effort of moving various limbs or speaking longer phrases. In other words, users are likely to avoid inputs which would exhaust them and prefer those which are easier to perform. > Users prefer the modality which they perceive as more reliable.

As Oviatt pointed out, users tend to avoid information channels that are likely to produce recognition errors [36]. These require the user to repeat their input or look for an alternative, which goes against their general goal of avoiding unnecessary efforts. Unless the scenario itself provides an explanation for such a failure, like the cultural differences in the ORIENT setting do [23], these errors can also distract the user from their role in the simulation which serves as another reason to prefer the reliable option.

3.3.2 Experimental Setting System Setup The virtual environment was displayed on a 50 inch screen, with the Kinect sensor placed below it and the neutral position marked as a square ap- 3.3. USER STUDY 69 proximately two meters away. This distance ensured that the subjects’ legs would be recognized for gesture input. While the microphone built into the Kinect would be sufficient in most cases, it had turned out to require comparatively loud speech and to be less reliable over this distance. For example, the speech recognizer had problems distinguishing the keywords “T¨ur”for selecting the door and “r¨uckw¨arts” for moving backwards when running on this input device. Therefore, test subjects were equipped with a wireless headset, the “Sound Blaster Tactic 3D Wrath” by Creative. Successful inputs were automatically logged by the application. For each input, it printed the game engine’s internal time, the modality, the command and the task category to a text file which could later be pasted into an Excel sheet. Global and menu actions were classified as specialized tasks and excluded from the analysis. Although redundant log entries were supposed to be avoided, in partic- ular by logging only those gestures which resulted in a change of direction, these files would occasionally contain long sequences of navigation gestures which reported exactly the same command. This was probably due to the flags which were necessary to prevent a neutral body position from cancel- ing out spoken commands. To prevent these repetitions from distorting the results, duplicates were removed if they were less than two seconds apart so that only one gesture marked the act of stepping or turning in that direction. Another error encountered in the logs were selections of non-selectable objects which were technically impossible. They sometimes appeared near the real selections, probably related to aborted selection processes when the cursor left the boundary of a smart object. Since the name of the selected object was printed as the selection command, they could easily be distinguished from the correct selections and were removed. Additionally, the sessions were recorded on camera if the subject had agreed to this. While the logs can only show those inputs which were recog- nized first but not whether they were intentional or accompanied by their equivalent of the other modality, these videos are supposed to be annotated and analyzed after this thesis for more accurate and detailed insights. For this purpose, the camera was placed at an angle which captured both the subject’s actions and the screen showing the system’s reaction.

Sessions Twelve test subjects were recruited at the chair of Human Centered Multi- media. This group consisted of ten staff members, one student assistant and 70 CHAPTER 3. PRACTICAL WORK one student working on their Master’s thesis. Eleven of them were male, one was female, and all were right-handed and either native speakers or fluent in German. They were between 24 and 35 years old with an average age of 29.5 years.

The subjects were first introduced to the various controls and allowed to practice them in a simpler setting. This scene contained a few walls, one lever which would move a simple door and one NPC who told the user to turn the lever when they asked what they should do. About 5 to 10 minutes were needed for this instruction phase. Afterwards, the subjects were playing the space ship scenario and free to choose either modality for any interaction they encountered. The scenario itself took about 15 to 20 minutes to finish. The actions advancing the plot were displayed on a poster next to the screen so that the subjects could focus on the input instead of puzzle solv- ing. Due to the linear storyline, all of them were faced with the same interaction possibilities. When they got stuck on a problem, for example when they searched for the storage container in the wrong direction or failed to understand that the tool had to be turned after being attached to the screw, they were given hints on how they should proceed.

Before their session, each subject filled in the first page of a questionnaire which asked about demographic information and experience with either of the two input technologies. The remaining four pages of the questionnaire contained more detailed questions about the separate task categories which had to be answered after the space ship scenario. For each task, the subjects were to state their preferred modality as gesture, speech or neither. Furthermore, there were four statements which had to be rated on a five-point Likert scale according to how much the subjects agreed with them. For each of the two modalities, they answered how difficult the commands were to remember or discover, how natural they felt for these actions, how tiring they were to perform and how reliable the recognition was.

3.3.3 Results and Discussion Hypothesis: Users prefer different modalities for different tasks.

The basic assumption for this study was that the two modalities are not equally suitable for every type of interaction. Therefore, user preferences were expected to be different between tasks, and indeed these differences 3.3. USER STUDY 71

Figure 3.5: Preferred modality for each task as stated in the questionnaire.

Figure 3.6: Average use of both modalities for each task, calculated from the successful inputs in the log files. can be seen in the results. For both gesture and speech there is one task category which was almost exclusively controlled by that modality, whereas the remaining two were more varied. This first impression was confirmed by a repeated measures ANOVA (analysis of variance) comparing the logged percentage of speech inputs between the tasks, with Bonferroni correction used for the post-hoc test. According to the ANOVA results, the differences between navigation and selection (p=0.000, effect size =0.872) as well as navigation and dia- logue (p=0.000, effect size = 0.985) were highly significant with p≤0.001. Furthermore, they were very significant for navigation and manipulation (p=0.005, effect size = 0.797) with p≤0.01 and significant for dialogue and manipulation (p=0.015, effect size = 0.748) with p≤0.05. The effect sizes were large for all four of these combinations. 72 CHAPTER 3. PRACTICAL WORK

The detailed results for every task are presented and discussed below. For each of them, two-sided paired t-tests were used to compare the logged usage and the results of the four agreement questions between both modalities.

Navigation

speech gesture µ σ2 µ σ2 p effect size usage 4.97% 0.003 95.03% 0.003 0.00000*** 0.993 difficult 2.50 1.545 1.42 0.447 0.04097* 0.572 natural 2.92 1.538 3.75 2.205 0.15668 tiring 2.33 1.333 2.67 2.242 0.47385 reliable 4.17 0.697 3.92 0.629 0.49098 Table 3.3: T-test results for navigation. Significance: *p≤ 0.05, ***p≤ 0.001.

Figure 3.7: Left: Logged navigation inputs per subject. Right: Average agreement ratings for navigation. Significance: *p≤ 0.05.

Hypothesis: The walking metaphor is preferred for low-level naviga- tion. This hypothesis was confirmed since almost all subjects preferred gestures and used them most of the time. 3.3. USER STUDY 73

11 out of the 12 test subjects (91.67%) stated they preferred gesture input for navigation, and the log files showed an average use of 95.03% gesture (minimum 80.82%) and 4.97% speech. This difference was highly significant ( p≤ 0.001, large effect size > 0.5). As for the perception of the input methods, gesture was rated signif- icantly less difficult to remember or discover than speech (p≤ 0.05, large effect size > 0.5). It was also rated as more natural, less reliable and more tiring than speech, but none of these three differences were signifi- cant (p> 0.05).

The strong tendency for gesture input could already be observed in the training setting where many subjects needed to be reminded that they should try the speech option as well. On average, the difficulty of remembering the correct gesture command was rated as 1.42 on the five-point Likert scale. This is very close to the minimum and confirms that the walking metaphor and the chosen mapping were as straightforward as expected. One subject explicitly stated that walking was a physical action, so ges- tures seemed more natural to them and they used it almost all the time. But although the walking gestures were perceived as rather natural on av- erage, they were not rated as completely natural either. This makes sense because the joystick control scheme does sacrifice some realism, mostly by forcing the user to return to the neutral position and turn their body away from the perceived orientation.

The physics-based collision detection caused some annoying problems, mak- ing several users fall off the bed, trip over the ship’s ramp or, in at least two cases, even get stuck in a wall. On the input side, there were occasional problems when a user rotated their body during walking, as these commands tended to cancel each other out, and sometimes the distinction between rotating sideways and changing the camera pitch was not as clear as it should be. But when the subjects separated their inputs more carefully, the steering worked as expected. It should also be noted that some users tried to correct problematic movements not by stopping but by immediately steering in the other di- rection, especially when the current movement involved both axes. Due to the way the acceleration was implemented, these direction changes took considerably longer. Despite all those problems, the overall recognition was rated as rather reliable for both modalities, which showed that it worked sufficiently well for this task. 74 CHAPTER 3. PRACTICAL WORK

Selection

speech gesture µ σ2 µ σ2 p effect size usage 66.51% 0.118 33.49% 0.118 0.12415 difficult 1.33 0.788 2.08 0.811 0.05584 0.542 natural 4.25 1.841 3.50 1.182 0.20160 tiring 1.42 0.629 2.33 1.333 0.05009 0.553 reliable 4.58 0.447 3.83 0.879 0.05584 0.542 Table 3.4: T-test results for selection.

Figure 3.8: Left: Logged selection inputs per subject. Right: Average agreement ratings for selection.

For this task, there was no clear hypothesis to confirm. The results show a tendency for speech with 7 of the 12 subjects (58.33%) stating a preference for this modality, 4 (33.33%) being undecided and 1 (8.33%) preferring gesture. 66.51% of the logged inputs were using speech, but this difference did not turn out to be significant (p> 0.05). The average ratings for all four statements are in favor of speech, with higher values for the positive qualities and lower values for the negative ones, but none of these were significant, either (p> 0.05). However, the values are rather close to the threshold for “difficult” and “reliable”, and in the case of “tiring” it is missed very narrowly. The effect sizes, too, can be considered sufficiently large (> 0.5), which means that this 3.3. USER STUDY 75 task category should be examined again in order to get clearer results.

Compared to the effort of raising an arm, aiming it and holding that pose, speech input only requires saying a few syllables which probably explains why the results hint at this modality. This can be compared to the state- ment by van der Sluis and Krahmer that users aim to reduce the effort necessary for producing a referring expression [46]. On the other hand, the usage and stated preference are rather mixed which could be caused by differences in the targets themselves. Three sub- jects mentioned that they preferred reaching for the object they wanted to interact with, confirming the earlier idea that stood against speech selec- tion, while two of these preferred addressing the NPCs by speech. Multimodality might be another factor in this case, considering that one subject noticed how they often added the name to their gesture. It is pos- sible that analyzing the video data will show other people doing the same.

Another aspect that should be examined in the recordings are differences caused by the target’s size. Also, the fact that some objects were placed very close to each other was commented upon. The current study considered only the overall modality use, but going by the findings of van der Sluis and Krahmer [46], who confirmed that users compensated for an imprecise pointing gesture by adding spoken descrip- tions, the expectation would be that the more difficult objects were selected by speech. However, one observation during this study here was that, at least at first, gesture users would rather focus on careful hand movements than switch to speech. This could nevertheless change over the course of the scenario.

Again, both modalities were rated as very reliable, but this time the differ- ence was more noticeable. This is probably related to the target difficulty mentioned above. Furthermore, at least two subjects tried to point at the name label instead of the object’s mesh, probably mistaking the panel be- hind the text for a kind of button. This was rare because the panels were consciously designed as flat, translucent and without a conspicious border, in contrast to the solid, beveled buttons used for dialogue and menu options. However, it may have contributed to the existing difficulty.

Dialogue Hypothesis: Speech is preferred for dialogue with virtual characters. This was clearly confirmed, as speech was used almost exclusively and all 76 CHAPTER 3. PRACTICAL WORK

speech gesture µ σ2 µ σ2 p effect size usage 93.23% 0.024 6.77% 0.024 0.00000*** 0.946 difficult 1.08 0.083 2.25 1.477 0.00607** 0.714 natural 4.92 0.083 2.83 0.879 0.00002*** 0.909 tiring 1.17 0.152 2.58 0.811 0.00020*** 0.854 reliable 4.83 0.152 3.67 0.606 0.00052*** 0.825 Table 3.5: T-test results for dialogue. Significance: **p≤ 0.01, ***p≤ 0.001.

Figure 3.9: Left: Logged dialogue inputs per subject. Right: Average agreement ratings for dialogue. Significance: **p≤ 0.01, ***p≤ 0.001. ratings were in this modality’s favor.

All 12 subjects stated that they preferred speech, and 93.23% of the di- alogue inputs were spoken compared to only 6.77% of gesture use. The minimum amount of speech was 50.00%, but 9 subjects actually used speech for every sentence and no gestures at all. Unsurprisingly, the difference was highly significant (p≤ 0.001, large effect size > 0.5). Furthermore, this modality’s ratings for the four agreement questions were very close to the extremes, with almost full agreement for the positive statements and almost none for the negative properties. There were highly significant differences for “natural” (p≤ 0.001, large effect size > 0.5), “reli- able” (p≤ 0.001, large effect size > 0.5) and “tiring” (p≤ 0.001, large effect size > 0.5), and a very significant difference for “difficult” (p≤ 0.01, large 3.3. USER STUDY 77 effect size > 0.5).

Given that the gesture alternative was highly abstract as opposed to actively taking part in the conversation, the preference for speech was completely expected. Again, this was already indicated by the subjects’ behavior in the training phase. One subject noted that they mostly used speech but would use gestures for things like greeting, which serves as a reminder that a following implementation should include at least some conversational ges- tures. Some of the reasons the subjects gave were directly reflected in the rating, such as the fact that it was clearly more natural than the pointing gesture. The latter was rated below but close to neutral due to the fact that most users did not even try this option. As for reliability, one subject found that the recognition had always worked as expected, except in a few cases for which they blamed themselves because they got it right on the second attempt. Other users did run into problems, as a few sentences were occasionally misinterpreted or not recognized. Also, it could happen that a filler word was interpreted as the cancel command, although care had been taken to reduce this risk as far as possible. When these errors occured, users tended to focus on the highlighted keyword rather than switch to the gesture option. One of them did use a pointing gesture after the recognizer failed to understand them but used the former approach for the rest of the dialogue. Nevertheless, the majority of the sentences were recognized without problems, and leaving out the filler words worked well for handling the errors discribed above. As can be seen by the questionnaire results, these did not have much of an impact on the perceived reliability. Another observation in this context was that natural dialogue runs the risk of prompting unexpected utterances from the user, but fortunately, they were not much of an issue in this study. One subject tended to ac- knowledge the NPC’s sentences with “mm-hm” sounds which were ignored by the recognizer. Another reacted with short, spontaneous sentences be- fore turning to the options they were presented with, but only at the first two opportunities. As soon as the speech recognizer interpreted the second utterance as an input, they stopped this behavior. Perhaps a more complex speech recognition might be able to counter this effect in a constructive way instead of prohibiting it. Occasionally, subjects would think aloud, comment on the scenario or ask questions about it, but most of these were not similar enough to the active vocabulary and therefore identified as garbage. 78 CHAPTER 3. PRACTICAL WORK

Recognizing the available commands was rated as not difficult at all. In this case, the limitation to predefined sentences was an advantage as all the options were visible on the screen. Users mostly kept to the default sen- tence, but occasionally varied the filler words, and in case of error handling, they instantly identified the real keywords. Speech input was not perceived as tiring, either, which can be attributed to the fact that no physical actions were involved.

Object Manipulation

speech gesture µ σ2 µ σ2 p effect size usage 49.72% 0.132 50.28% 0.132 0.97926 difficult 1.25 0.205 2.67 1.152 0.00086*** 0.807 natural 3.42 1.356 4.00 1.273 0.30568 tiring 1.33 0.242 2.25 1.295 0.00864** 0.693 reliable 4.92 0.083 3.83 0.879 0.00157** 0.783 Table 3.6: T-test results for manipulation. Significance: **p≤ 0.01, ***p≤ 0.001.

Figure 3.10: Left: Logged manipulation inputs per subject. Right: Average agreement ratings for manipulation. Significance: **p≤ 0.01, ***p≤ 0.001.

Hypothesis: Gesture input is preferred for object manipulation. This could not be confirmed because both modalities were used almost 3.3. USER STUDY 79 equally.

The explicit preferences in this case happen to be evenly split, with each modality being preferred by 5 subjects (41.67%) and the remaining 2 (16.67%) being undecided. Going by the log files, gesture was used only slightly more, which was not significant. As far as the agreement ratings are concerned, only the one for “natural” was better for gesture input, but not significantly different (p> 0.05). The remaining three were actually in favor of speech, with very signifi- cant differences for “reliable” (p≤ 0.01, large effect size > 0.5) and “tiring” (p≤ 0.01, large effect size > 0.5). For “difficult”, this difference was even highly significant (p≤ 0.001, large effect size > 0.5).

In a way, these results match those of Corradini and Cohen [8] who mostly observed preferences for multimodal (60%) or spoken (30%) interaction. On the other hand, their findings were not sorted according to task categories and they mentioned that speech-only navigation might distort the results due to the number of commands. Looking at the significantly positive ratings for speech input, one would assume that there was also a clear preference for this modality, but surpris- ingly, that was not the case.

But what is remarkable about the manipulation task is the almost equal distribution of preferences, together with very different observations about the subjects’ behavior. Speech users tended to be more focused on quickly fulfilling the scenario’s goals, often calling the actions as soon as they ap- peared on screen, whereas some gesture users exhibited role-playing behav- ior, such as worrying about being watched by the NPCs or talking more playfully during dialogue. This hints at two distinct personas which should be examined more thoroughly in a different study. For now, the first step was to take a closer look at those subjects who clearly preferred one of the modalities. Their data was re-grouped, based on the preference stated in the questionnaire, which lead to two sets of 5 sub- jects each. Those without an explicit preference were ignored. Additional two-sided paired t-tests were then performed on either set, again comparing the data between both modalities but this time restricted to those subjects who preferred the same modality.

As expected, the log files showed a significantly higher gesture use for those who preferred that modality (p≤ 0.05, large effect size > 0.5). For speech users, the preference was even confirmed with a highly significant differ- 80 CHAPTER 3. PRACTICAL WORK

speech gesture µ σ2 µ σ2 p effect size speech usage 87.58% 0.003 12.42% 0.003 0.00010*** 0.992 difficult 1.20 0.200 3.40 0.300 0.00418** 0.947 natural 4.00 1.000 2.80 0.200 0.03268* 0.849 tiring 1.40 0.300 2.80 1.200 0.05161 0.808 reliable 4.80 0.200 3.60 0.800 0.03268* 0.849 gesture usage 18.98% 0.043 81.02% 0.043 0.02849* 0.859 difficult 1.20 0.200 2.00 0.500 0.09930 natural 3.00 2.000 5.00 0.000 0.03411* 0.845 tiring 1.00 0.000 1.40 0.300 0.17781 reliable 5.00 0.000 4.40 0.800 0.20800 Table 3.7: T-test results for manipulation, separated according to the stated preference. Significance: *p≤ 0.05, **p≤ 0.01, ***p≤ 0.001. ence (p≤ 0.001, large effect size > 0.5). Since the speech group used their modality 87.58% of the time and the other used gestures for 81.02%, it is easy to see why these ratios canceled each other out when the average was calculated over all subjects. Speech users rated their modality as more positive and less negative for all four agreement questions. The values for “natural” (p≤ 0.05, large effect size > 0.5) and “reliable” (p≤ 0.05, large effect size > 0.5) were significantly higher for speech than for gesture, and the lower value for “difficult” was very significant (p≤ 0.01, large effect size of > 0.5). For “tiring”, the results were still close to the significance threshold. On the gesture side, only “natural” was rated significantly higher for gesture input than for speech (p≤ 0.05, large effect size > 0.5). The other three properties were still better for speech, but for this group the differ- ences were not significant.

Interestingly, the speech users did not perceive the manipulation gestures as particularly natural, as the neutral rating shows, whereas all five gesture users fully agreed with the statement. Since the input gestures were mod- 3.3. USER STUDY 81

Figure 3.11: Average agreement ratings for manipulation, separated ac- cording to the stated preferences. Left: Speech users only. Right: Gesture users only. Significance: *p≤ 0.05, **p≤ 0.01. eled after the real actions, only lacking the grasping part, the explanation would be that the gesture users either had more problems understanding the animations or did not care about that alternative in the first place. For those who did perceive the gestures as natural it may have been their main reason for choosing that modality, especially considering that their other ratings would point in the other direction. One gesture user, who had also stated that they preferred reaching for the selection target, explained that they would rather see the objects move because of their own hand movements. Both groups found the gesture commands more difficult to discover than the action keywords, but while that value was indeed above neutral for the speech users, the gesture users rated the difficulty as rather low. One of these remarked that although the animations themselves were not always easy to read, they could be understood rather quickly due to their similarity to the real world. The animations do have some shortcomings, mainly because they need to be shown on a rather small scale and the depth information is still difficult to see despite the color coding. Furthermore, users tended to be confused by those gestures which started with an arm movement to a given position, such as reaching for the win- dow before they could lift their leg to climb or raising their hand to begin waving. Although it was shown in the animation, they often failed to un- derstand that the first stroke was a necessary part of the gesture, so when the system did not react, they usually repeated only the later part with the 82 CHAPTER 3. PRACTICAL WORK hand still in place. Apparently, defining most gestures without the help of fixed postures and basing the animation on assumed starting positions was more of a liability in that context, even if it worked for most of the actions in this scenario. As for the rest of the animation, at least one gesture user paid little attention to the separate strokes and speed displayed by the figure, usually making a rather fast movement in the general direction. Since the FUBI recognition is rather tolerant with regard to the actual direction, this still worked most of the time. While it is not related to the actual inputs, it should also be noted here that several users attempted to perform inactive actions, repeatedly speak- ing gray keywords or being confused about the conditions they had to fulfill. A future implementation should make these conditions more transparent, probably by requesting an action’s conditions from the smart object along with the input data. Gestures were rated as very reliable by those who preferred them and still above neutral by the other group, which confirms that the overall recog- ition worked as expected. A few gestures were triggered by accident due to the tolerance mentioned above, whereas on some occasions, the recognizer definition was too strict and the users had to try several times. However, the latter was probably related to getting the speed right with only a few seconds of animation for guidance. Speech users were rarely facing such problems, which is reflected in the fact that both groups rated this modality as almost perfectly reliable. Consequently, speech also served as a fallback solution for the gesture users when they had problems with a particular movement, and sometimes they were seen alternating between both modalities. Despite the physical effort and the occasional need to repeat their input, gesture users did not perceive their preferred modality as tiring. This aspect should be examined again with a longer scenario and a larger test group, but can already be seen as a positive sign. Speech was not seen as tiring by its users, either, and even less so by the gesture group, so anyway, it would be a suitable alternative if gestures should become tiring during a longer session. Chapter 4

Conclusion and Future Work

This last chapter presents a summary of the work done for this thesis and ends with those ideas which are left for future research.

4.1 Conclusion

Four distinct task categories have been identified for interacting with a virtual world, specifically navigation, selection, dialogue and manipulation. Each of them plays a different role in such an application and has its own requirements for user input. Findings from existing literature were sorted according to these tasks in order to identify these differences. In particular, the relevant input metaphors for speech and gesture were presented, along with the visual clues that help the user to identify and perform them.

Based on these approaches, a three-dimensional virtual environment was implemented in which all of these tasks can be performed by either gesture or speech commands. So far, no multimodal integration is used and the inputs for both modalities were designed to be interchangeable. In contrast to the Wizard-of-Oz systems described in several examined sources, this application makes use of currently available technologies for actual input recognition. It recognizes full body gestures via the Kinect sensor and the “Full Body Interaction Framework” by Felix Kistler [20], and the Microsoft Speech Platform [27] provided the functionality for spotting one or several relevant keywords in an utterance.

83 84 CHAPTER 4. CONCLUSION AND FUTURE WORK

The interaction information itself is stored in smart objects, imple- mented in the Horde3D GameEngine [35], which will make it easier to reuse the system for future research.

The finished application was the testing ground for a user study with the goal of confirming the suitable modality for each task. Twelve subjects were playing a short scenario with various interactions, and for each of those they were free to chose either of the input alternatives. First of all, the study confirmed that the recognition technology worked as expected, as the subjects perceived both modalities as mostly or highly reliable for all four tasks. The inputs themselves turned out to cause less problems than unrelated aspects of the implementation did, most notably the physics engine used for collision detection. For navigation, users preferred the option based on walking gestures which they rated as easier to remember than the spoken alternative. Dialogue was almost exclusively done by speech, which was rated as almost completely natural and reliable. Furthermore, the commands were hardly perceived as difficult to discover or tiring to perform. Another reason for this clear preference was probably the fact that the gesture alternative in this case was highly abstract, since there was no natural gesture scheme available that could function unimodally and still express the same com- mands. There was no clear result for selection, which was probably due to the small test group because the perception results were rather close to the sig- nificance threshold. There is a tendency towards speech, but some subjects also remarked that they had different preferences for objects than for the NPCs, physically reaching for the former but addressing the latter verbally. Finally, the manipulation category showed no overall preference because each modality was chosen by almost half of the group. Analyzing this (admittedly small) sample further showed that speech users found their preferred input method more reliable and also more natural whereas the other group perceived the manipulation gestures as perfectly natural. The fact that they were based on realistic hand movements appar- ently helped them overcome the difficulty of understanding the animations, whereas those users who preferred speech rated that aspect as far more difficult.

By grouping the relevant approaches and confirming some of the tenden- cies in a system with working speech and gesture recognition, this thesis presents a useful starting point for complex interaction with a virtual world using only these natural input technologies. 4.2. FUTURE WORK 85

Additionally, the application developed here already offers the basic functionality, which can be extended for future research in order to fill in the information that could not be gained from the current study.

4.2 Future Work

There are several interesting aspects left which either could not be included in the current work or were discovered in the process.

4.2.1 Analysis of the Video Data The time frame for this thesis did not allow for a proper analysis of the video files in addition to the other work necessary for the study. Data about the actual inputs during the scenario is approximated by the log files which are more limited in comparison. For instance, if a subject made several attempts to get an input right before switching to the other modality, the log files could only record that last successful command. The same applies to accidental inputs caused by thinking aloud or making unintentional hand movements. The video recordings, on the other hand, can provide more information about the actual recognition rate which, so far, can only be judged from the subjects’ perception of the system’s reliability. In addition, the error handling strategies could be examined more closely. So could the commands which included both modalities, although they were rarer in this scenario due to the explicit separation. Nevertheless, there were some occurences, such as a user saying “stop” while moving back to the neutral position or another naming the object they were pointing at. Also, this scenario included situations of varying difficulty in order to be more representative of arbitrary simulations. While the overall number of interactions is probably too small for reliable results, examining the user’s behavior for different objects could give a first impression of how this in- fluenced their decision during the study. For example, error handling and multimodal expressions are more likely in these cases.

4.2.2 Role Playing versus Efficiency The mixed results for the manipulation category revealed two separate per- sonas or focus groups which seem to have different priorities when inter- acting with virtual objects. As mentioned in 3.3.3, speech users tended to interact in a more straightforward, goal-oriented way than gesture users. 86 CHAPTER 4. CONCLUSION AND FUTURE WORK

One speech user who had displayed notable efficiency-based behavior, speaking action commands as soon as the object was selected, was explic- itly asked about a connection between this behavior and their input choice. Their answer was that they intuitively chose the kind of input which had worked well for the guard dialogue because they were interested in quickly fulfilling the scenario’s goals, just as they would be with any regular com- puter game. In contrast, one of the gesture users explained that they had expected to be using speech but changed their preference as soon as they were drawn into the story and thought about the guard hearing them es- cape. Another subject, who had mostly used gestures but did not state a particular preference, already closed the ship’s door before repairing the engine, saying that they were suspicious of the technician’s helpfulness and did not want him to watch. Such observations indicate that realistic physical interaction may be more important for role playing users who want to be immersed in a story whereas goal-oriented players would prefer a quick speech command in order to move on sooner. This will be an interesting topic for future research which may reveal more factors that contribute to the choice of modality. For example, it could be examined by comparing a realistic training situation to a competitive game, or by adding different levels of immersion to the same setting, such as sounds or NPC behavior. The idea that the presence of NPCs might persuade the user to choose differently had actually been considered during the planning phase of this thesis. In fact, one motivation for making the guard NPC leave after the conversation was to ensure that the test subjects were not influenced in their choice. Seeing that there were indeed subjects who gave this as a reason for their behavior, this aspect should be picked up again.

4.2.3 Direct Gesture Mapping The input gestures themselves are far from perfect. At the moment, they rely on the user standing in a particular position and orientation, which can not be guaranteed and can lead to confusing situations, such as the bed moving in a direction perpendicular to the path of the user’s hands or the fact that a door could not be manipulated from the other side. This is partly caused by the way gestures are recognized since they follow a pre-defined pattern which can not be adapted at run-time, at least not without repeatedly destroying and recreating its description. In addition to that, this fixed structure needs to be defined in such a way that it rejects unintentional movements but at the same time leaves some tolerance so that it can be performed easily. 4.2. FUTURE WORK 87

However, it turned out that the current FUBI recognition is very lenient with respect to the actual direction vector, usually activating any recognizer which shares a movement axis with the input, regardless of the remaining vector entries. For example, a diagonal motion to the left and upwards would lead to the recognition of all gestures that partially moved upwards as well as all those including any movement to the left. This was a major problem in defining gestures which had to be available at the same time, leading to some being more restricted than others. A different approach may be more useful in this case, such as projecting the hand into the environment like it was done by Kallmann [19, 18]. Instead of performing a symbol of the gesture which only acts as a trigger, the virtual hand could be used to actually grasp parts of the object and move them according to degrees of freedom given in its scene graph. Such a solution would also eliminate the need for both defining recognizer patterns and displaying them as animations which are hard to understand. The downside is that, obviously, this approach would require a grasp- ing metaphor which works with a vision-based system or an unobtrusive additional device.

4.2.4 Expansion of the Speech Recognition The currently implemented speech recognition uses only a fraction of what the Microsoft Speech Platform [27] can offer, especially when it comes to semantic tagging. Semantics are already used to distinguish between se- lections and actions, but they could also help to combine them and skip the selection phase altogether, similar to the implicit selection mentioned in section 2.3.2. For example, the actions of all reachable objects could be pre-loaded and the system would then listen for either the name of a unique action or a phrase containing both an action and a reference to its object. The system would then select that object automatically before triggering the action. During dialogue, a more sophisticated grammar could be used for more natural input by accepting synonyms for the fundamental keywords or of- fering more general patterns for talking about the known topics. Also, the filler words might include meaningful information about a user’s attitude. Finally, navigation in this study did not include the option to specify the target directly, which was a notable input approach for speech-based navigation [7,8]. Such a command could be implemented by combining navigation verbs with the names of visible objects. This might also be useful for a study pitting immersion against efficiency, as such shortcuts 88 CHAPTER 4. CONCLUSION AND FUTURE WORK can appear unrealistic in a virtual world and therefore might rather be used by goal-oriented players than by role-playing types.

4.2.5 Multimodal commands In most cases, the input would benefit from a form of multimodal integra- tion. Most notably, the input for dialogue does not yet include conversa- tional gestures because they could not be used as effectively in an unimodal context. This neglects an important information channel which should be added in the future. Multimodal integration was already mentioned in the selection context in section 2.3.2. Since there was at least one subject who combined both modalities for that task and other occurences might turn up during the anal- ysis of the video files, this information might as well be joined to simplify the reference generation, as it is explained in existing literature [5,6, 46]. Other possible uses may be the addition of speed parameters to naviga- tion commands, which may be more efficient by speech than by adapting the physical distance, or improving the recognition of manipulation actions when the gesture input is ambiguous or outside the allowed thresholds.

4.2.6 Expansion of the Smart Object Infrastructure Finally, regardless of the input, the smart object approach has much more potential than what is currently implemented, especially considering the complex ruleset Uijlings defined for general world representations [45]. At the moment, the objects picked up by the user are only displayed for confirmation but can not be accessed directly like they would be in traditional game inventories. Unlike those actions defined by Uijlings, the ones here do not include parameters for the user or the tool. This means that they only work on known entities, so selecting a particular inventory item is of no use. For example, it is not possible to put the damaged part back into the engine because this action is reserved for the spare. The only interaction which uses a tool, namely unlocking the part with the spanner, had to be broken down into two steps which caused some users to expect the result too early. Furthermore, the available actions are still pre-defined by hand. Some actions, such as picking up smaller items or pushing around those of medium size, do have specific conditions and effects which could be assigned auto- matically, for example based on the size of their bounding box. But since the current scenario was rather small, this was not required yet. On the contrary, offering unnecessary actions like pushing the spanner across the 4.2. FUTURE WORK 89 table would have distracted from the main topic of this thesis. Seeing that some test subjects were already confused by gray labels for potentially avail- able options, leaving out automatically generated actions was obviously a good choice. Nevertheless, this idea might be interesting for speeding up the content creation process, especially if this system and its input capabilites are to be reused for a more complex setting.

Bibliography

[1] Apple. Siri - frequently asked questions. http://www.apple.com/ios/ siri/siri-faq/. [cited at p. 1]

[2] Olivier Bau and Wendy E. Mackay. Octopocus: a dynamic guide for learning gesture-based command sets. In In Proc. of ACM UIST, pages 37–46, 2008. [cited at p. 11, 12]

[3] Kirsten Bergmann, Hannes Rieser, and Stefan Kopp. Regulating dialogue with gestures: towards an empirically grounded simulation with conversa- tional agents. In Proceedings of the SIGDIAL 2011 Conference, SIGDIAL ’11, pages 88–97, Stroudsburg, PA, USA, Jun 2011. Association for Compu- tational Linguistics. [cited at p. 30]

[4] M. Cavazza, F. Charles, S.J. Mead, O. Martin, X. Marichal, and A. Nandi. Multimodal acting in mixed reality interactive storytelling. Multimedia, IEEE, 11(3):30 – 39, Jul-Sep 2004. [cited at p. 8, 29, 30, 31, 32, 53]

[5] J. Chai, Shimei Pan, M.X. Zhou, and K. Houck. Context-based multimodal input understanding in conversational systems. In Multimodal Interfaces, 2002. Proceedings. Fourth IEEE International Conference on, pages 87 – 92, 2002. [cited at p. 26, 28, 29, 88]

[6] Joyce Y. Chai, Pengyu Hong, and Michelle X. Zhou. A probabilistic ap- proach to reference resolution in multimodal user interfaces. In Proceedings of the 9th international conference on Intelligent user interfaces, IUI ’04, pages 70–77, New York, NY, USA, 2004. ACM. [cited at p. 23, 24, 25, 26, 28, 29, 88]

[7] P. Cohen, D. McGee, S. Oviatt, L. Wu, J. Clow, R. King, S. Julier, and L. Rosenblum. Multimodal interaction for 2d and 3d environments [virtual reality]. Computer Graphics and Applications, IEEE, 19(4):10 –13, Jul/Aug 1999. [cited at p. 19, 20, 22, 87]

91 92 BIBLIOGRAPHY

[8] A. Corradini and P. Cohen. On the Relationships Among Speech, Gestures, and Object Manipulation in Virtual Environments: Initial Evidence. In Proceedings of the International CLASS Workshop on Natural, Intelligent and Effective Interaction in Multimodal Dialogue Systems, 2002. [cited at p. 7, 8, 20, 36, 38, 44, 45, 67, 79, 87]

[9] Cyan, Inc. Myst Complete. Ubisoft, 1997-2005. [cited at p. 15, 22, 43] [10] Julian Dibbell. Mit technology review - gestural interfaces. http:// www2.technologyreview.com/article/423687/gestural-interfaces/, May/Jun 2011. [cited at p. 63] [11] eCircus. eCircus - Media download page. http://www.macs.hw.ac. uk/EcircusWeb/index.php?module=pagemaster&PAGE_user_op=view_ page&PAGE_id=26. [cited at p. 31] [12] Barak Freedman, Alexander Shpunt, Meir Machline, and Yoel Arieli. Patent: Depth mapping using projected patterns. http: // www. freepatentsonline. com/ y2010/ 0118123. html , May 2010. 20100118123. [cited at p. 63] [13] Blitz Games. The Biggest Loser - Ultimate Workout (Demo). THQ, Nov 2010. [cited at p. 40, 41] [14] Harmonix Music Systems, Inc. Dance Central (Demo). MTV Games, Nov 2010. [cited at p. 25, 40, 41] [15] Frederick Heckel and G. Michael Youngblood. Contextual affordances for intelligent virtual characters. In Hannes Vilhj´almsson,Stefan Kopp, Stacy Marsella, and Kristinn Th´orisson,editors, Intelligent Virtual Agents, volume 6895 of Lecture Notes in Computer Science, pages 202–208. Springer Berlin / Heidelberg, 2011. 10.1007/978-3-642-23974-8 22. [cited at p. 36] [16] Pieter Jorissen and Wim Lamotte. A framework supporting general object interactions for dynamic virtual worlds. In Andreas Butz, Antonio Kr¨uger, and Patrick Olivier, editors, Smart Graphics, volume 3031 of Lecture Notes in Computer Science, pages 154–158. Springer Berlin / Heidelberg, 2004. 10.1007/978-3-540-24678-7 17. [cited at p. 61] [17] Rieko Kadobayashi, Kazushi Nishimoto, and Kenji Mase. Design and eval- uation of gesture interface of an immersive walk-through application for exploring cyberspace. In Automatic Face and Gesture Recognition, 1998. Proceedings. Third IEEE International Conference on, pages 534 –539, Apr 1998. [cited at p. 7, 8, 14, 16, 17, 18, 19, 22, 24, 51] [18] Marcelo Kallmann. Object Interaction in Real-Time Virtual Environ- ments. PhD thesis, Ecole´ Polytechnique F´ed´eralede Lausanne, Jan 2001. [cited at p. 7, 8, 15, 22, 36, 37, 39, 61, 87] BIBLIOGRAPHY 93

[19] Marcelo Kallmann and Daniel Thalmann. Direct 3d interaction with smart objects. In Proceedings of the ACM symposium on Virtual reality software and technology, VRST ’99, pages 124–130, New York, NY, USA, 1999. ACM. [cited at p. 7, 8, 15, 22, 36, 37, 39, 61, 87]

[20] Felix Kistler. Fubi - full body interaction framework. http: //www.informatik.uni-augsburg.de/lehrstuehle/hcm/projects/ tools/fubi/. [cited at p. 44, 61, 62, 83]

[21] Felix Kistler, Dominik Sollfrank, Nikolaus Bee, and Elisabeth Andr´e. Full body gestures enhancing a game book for interactive story telling. In Mei Si, David Thue, Elisabeth Andr´e,James Lester, Joshua Tanenbaum, and Veronica Zammitto, editors, Interactive Storytelling, volume 7069 of Lecture Notes in Computer Science, pages 207–218. Springer Berlin / Heidelberg, 2011. 10.1007/978-3-642-25289-1 23. [cited at p. 8, 24, 25, 27, 28, 31, 37, 39, 41, 61, 67]

[22] Felix Kistler, Michael Wißner, and Elisabeth Andr´e. Level of detail based behavior control for virtual characters. In Jan Allbeck, Norman Badler, Tim- othy Bickmore, Catherine Pelachaud, and Alla Safonova, editors, Intelligent Virtual Agents, volume 6356 of Lecture Notes in Computer Science, pages 118–124. Springer Berlin / Heidelberg, 2010. 10.1007/978-3-642-15892-6 12. [cited at p. 36]

[23] Ekaterina Kurdyukova, Elisabeth Andr´e,and Karin Leichtenstern. Intro- ducing multiple interaction devices to interactive storytelling: Experiences from practice. In Ido Iurgel, Nelson Zagalo, and Paolo Petta, editors, Inter- active Storytelling, volume 5915 of Lecture Notes in Computer Science, pages 134–139. Springer Berlin / Heidelberg, 2009. 10.1007/978-3-642-10643-9 18. [cited at p. 6, 7, 15, 16, 18, 22, 27, 28, 29, 30, 31, 32, 53, 68]

[24] Joseph J. LaViola, Jr., Daniel Acevedo Feliz, Daniel F. Keefe, and Robert C. Zeleznik. Hands-free multi-scale navigation in virtual environments. In Proceedings of the 2001 symposium on Interactive 3D graphics, I3D ’01, pages 9–15, New York, NY, USA, 2001. ACM. [cited at p. 7, 16, 17, 18, 19, 22, 23, 50]

[25] Corey Manders, Farzam Farbiz, Tang Ka Yin, Yuan Miaolong, and Chua Gim Guan. A gesture control system for intuitive 3d interaction with virtual objects. and Virtual Worlds, 21(2):117–129, 2010. [cited at p. 24, 37]

[26] Microsoft. Common commands in speech recognition. http://windows.microsoft.com/en-US/windows-vista/ Common-commands-in-Speech-Recognition. [cited at p. 1] 94 BIBLIOGRAPHY

[27] Microsoft. Msdn library - microsoft speech platform. http://msdn. microsoft.com/en-us/library/hh361572.aspx. [cited at p. 7, 44, 61, 64, 83, 87]

[28] Microsoft. Xbox support - get started with kinect. http://www.xbox.com/ en-US/Kinect/GetStarted. [cited at p. 1, 24]

[29] Microsoft. Xbox support - kinect gestures. http://support.xbox.com/ en-US/xbox-360/kinect/body-controller. [cited at p. 27]

[30] Donald A. Norman. Affordance, conventions, and design. interactions, 6(3):38–43, May 1999. [cited at p. 11, 13, 27]

[31] Donald A. Norman. The way i see it: Signifiers, not affordances. interactions, 15(6):18–19, Nov 2008. [cited at p. 11]

[32] Donald A. Norman. Natural user interfaces are not natural. interactions, 17(3):6–10, May 2010. [cited at p. 1, 8, 10, 12, 30, 31, 39]

[33] Donald A. Norman. Yet another technology cusp: confusion, vendor wars, and opportunities. Commun. ACM, 55(2):30–32, Feb 2012. [cited at p. 1, 9, 12]

[34] Donald A. Norman and Jakob Nielsen. Gestural interfaces: a step backward in usability. interactions, 17(5):46–49, Sep 2010. [cited at p. 1, 9, 11, 12]

[35] University of Augsburg. Horde3d gameengine. http://mm-werkstatt. informatik.uni-augsburg.de/projects/GameEngine/doku.php. [cited at p. 44, 60, 84]

[36] Sharon Oviatt. Ten myths of multimodal interaction. Commun. ACM, 42(11):74–81, Nov 1999. [cited at p. 8, 9, 19, 26, 68]

[37] Sharon Oviatt, Rachel Coulston, and Rebecca Lunsford. When do we inter- act multimodally?: cognitive load and multimodal communication patterns. In Proceedings of the 6th international conference on Multimodal interfaces, ICMI ’04, pages 129–136, New York, NY, USA, 2004. ACM. [cited at p. 9, 19, 24, 29, 38, 44, 68]

[38] Big Park. Kinect Joy Ride (Demo). Microsoft Game Studios, Nov 2010. [cited at p. 22, 40]

[39] Rare. Kinect Sports (Demo). Microsoft Game Studios, Nov 2010. [cited at p. 22, 25, 27, 39, 40, 41, 44, 58]

[40] Siddharth Rautaray, Anand Kumar, and Anupam Agrawal. Human com- puter interaction with hand gestures in virtual environment. In Malay BIBLIOGRAPHY 95

Kundu, Sushmita Mitra, Debasis Mazumdar, and Sankar Pal, editors, Per- ception and Machine Intelligence, volume 7143 of Lecture Notes in Computer Science, pages 106–113. Springer Berlin / Heidelberg, 2012. 10.1007/978-3- 642-27387-2 14. [cited at p. 30]

[41] B. Schuller, F. Althoff, G. McGlaun, and M. Lang. Navigation in virtual worlds via natural speech. In Proceedings of the 9th International Con- ference on Human-Computer Interaction (HCI International 2001), Poster Sessions: Abridged Proceedings, pages 19–21, New Orleans, Louisiana, USA, Aug 2001. Lawrence Erlbaum Ass., New Jersey, 2001. [cited at p. 12, 20, 55]

[42] Good Science Studio. Kinect Adventures! Microsoft Game Studios, Nov 2010. [cited at p. 16, 21, 22, 25, 27, 41, 58]

[43] . Sonic Free Riders (Demo). , Nov 2010. [cited at p. 21, 22]

[44] Ubisoft. Your Shape: Fitness Evolved (Demo). Ubisoft, Nov 2010. [cited at p. 25, 41]

[45] Jasper Uijlings. Designing a virtual environment for story generation. Mas- ter’s thesis, University of Amsterdam, Jun 2006. [cited at p. 33, 35, 47, 49, 88]

[46] Ielka van der Sluis and Emiel Krahmer. Generating multimodal references. Discourse Processes, 44(3):145–174, 2007. [cited at p. 8, 9, 23, 24, 25, 26, 27, 68, 75, 88]

Appendices

97

Appendix A

Questionnaire

A.1 German

1. Allgemeine Fragen

1.1 Alter: 1.2 Geschlecht: O m¨annlich O weiblich 1.3 Beruf/Studiengang: 1.4 Rechts- oder Linksh¨ander? O rechts O links 1.5 Wie oft haben Sie die folgenden Technologien bereits genutzt? nie 1-10x >10x regelm¨aßig Sprachsteuerung (z.B. Diktieranwendung) Gestensteuerung (z.B. Wii, Kinect)

2. Fragen zur Aufzeichnung

2.1 Sind Sie damit einverstanden, wenn Ihre Teilnahme zur genaueren Auswertung als Video aufgezeichnet wird? O ja O nein 2.2 Falls ja, sind Sie damit einverstanden, wenn Teile dieses Videos ver¨offentlicht werden? O ja O nein

Unterschrift:

99 100 APPENDIX A. QUESTIONNAIRE

3. Studie

Nach einer kurzen Einweisung werden Sie ein Szenario durchspielen, das sich sowohl ¨uber Sprache als auch ¨uber Gestik steuern l¨asst.F¨ur jede Aktion k¨onnenSie frei w¨ahlen,welche der beiden Eingabem¨oglichkeiten Sie verwenden. Die Aktionen teilen sich dabei in vier Bereiche auf: Navigation, Selektion, Dialog und Objekt-Manipulation.

4. Fragen zu den vier Aufgabenbereichen

4.1 Navigation (Fortbewegung durch die Welt, Anderung¨ der Ansicht) 4.1.1 Welche Eingabemethode bevorzugen Sie f¨urdiese Aufgabe? Sprache Gestik egal O O O 4.1.2 Wie sehr stimmen Sie den folgenden Aussagen zu? a) Es war schwierig, die Kommandos f¨urdie gew¨unschte Aktion zu erkennen bzw. sich an sie zu erinnern.

st. nicht zu — neutral — stimme zu Sprache OOOOO Gestik OOOOO b) Die Kommandos f¨urdiese Aktionen f¨uhltensich nat¨urlich an. (die gleiche Skala) c) Es war erm¨udend,die Kommandos zu geben. (die gleiche Skala) d) Die Erkennung funktionierte zuverl¨assig. (die gleiche Skala) 4.1.3 Haben Sie weitere Anmerkungen zu dieser Aufgabe? 4.2 Selektion (Auswahl von Charakteren und Objekten) (die gleichen Fragen) 4.3 Dialog mit virtuellen Charakteren (die gleichen Fragen) 4.4 Manipulation von Objekten (die gleichen Fragen) A.2. TRANSLATED TO ENGLISH 101

A.2 Translated to English

1. General Questions

1.1 Age: 1.2 Gender: O male O female 1.3 Profession/Course of Studies: 1.4 Right- or left-handed? O right O left 1.5 How many times have you used the following technologies so far? never 1-10x >10x regularly speech control (e.g. dictation software) gesture control (e.g. Wii, Kinect)

2. Questions about the recording

2.1 Do you agree with your session being recorded on video for a more detailed analysis? O yes O no 2.2 If so, do you agree with parts of this video being published? O yes O no

Signature:

3. Study

After a short introduction you will be playing a short scenario which can be controlled by speech as well as gestures. For each action you can freely chose which input method you want to use. The actions are divided into four categories: Navigation, selection, dialogue and object manipulation.

4. Questions about the four tasks

4.1 Navigation through the world, changing the perspective 4.1.1 Which input method do you prefer for this task? speech gesture undecided O O O 102 APPENDIX A. QUESTIONNAIRE

4.1.2 How strongly do you agree with the following statements? a) It was difficult to recognize or remember the commands for the desired action.

disagree — neutral — agree speech OOOOO gesture OOOOO b) The commands for these actions felt natural. (the same scale) c) It was tiring to give the commands. (the same scale) d) The recognition worked reliably. (the same scale) 4.1.3 Do you have any more comments about this task? 4.2 Selection of characters and objects (the same questions) 4.3 Dialogue with virtual characters (the same questions) 4.4 Manipulation of objects (the same questions) List of Figures

2.1 Navigation command mappings for the three gesture control schemes...... 16

3.1 Left: The arrow icon, showing the neutral position (blue) and forward movement (green). Right: The brackets displayed dur- ing movement (gray: out of reach, blue: in reach)...... 57 3.2 Upper left: An object which is out of reach. Lower left: An object in reach which is being selected. Right: The actions available for the selected object...... 57 3.3 Upper image: Available dialogue actions with highlighted key- words and the cursor filling up for selection. Lower image: The recognized sentence with the cursor gradually returning to white. 59 3.4 Left: The images from the animation for turning a lever counter- clockwise. Right: The depth image of the user with the recog- nized skeleton...... 60 3.5 Preferred modality for each task as stated in the questionnaire. 71 3.6 Average use of both modalities for each task, calculated from the successful inputs in the log files...... 71 3.7 Left: Logged navigation inputs per subject. Right: Average agreement ratings for navigation. Significance: *p≤ 0.05. . . . . 72 3.8 Left: Logged selection inputs per subject. Right: Average agree- ment ratings for selection...... 74 3.9 Left: Logged dialogue inputs per subject. Right: Average agree- ment ratings for dialogue. Significance: **p≤ 0.01, ***p≤ 0.001. 76 3.10 Left: Logged manipulation inputs per subject. Right: Average agreement ratings for manipulation. Significance: **p≤ 0.01, ***p≤ 0.001...... 78

103 104 LIST OF FIGURES

3.11 Average agreement ratings for manipulation, separated accord- ing to the stated preferences. Left: Speech users only. Right: Gesture users only. Significance: *p≤ 0.05, **p≤ 0.01...... 81 List of Tables

3.1 The interactions in the virtual world...... 46 3.2 The available effect types for smart object actions...... 62 3.3 T-test results for navigation. Significance: *p≤ 0.05, ***p≤ 0.001. 72 3.4 T-test results for selection...... 74 3.5 T-test results for dialogue. Significance: **p≤ 0.01, ***p≤ 0.001. 76 3.6 T-test results for manipulation. Significance: **p≤ 0.01, ***p≤ 0.001...... 78 3.7 T-test results for manipulation, separated according to the stated preference. Significance: *p≤ 0.05, **p≤ 0.01, ***p≤ 0.001.... 80

105