This Is Actually the First Page of the Thesis and Will Be Discarded After
Total Page:16
File Type:pdf, Size:1020Kb
This is actually the first page of the thesis and will be discarded after the print out. This is done because the title page has to be an even page. The memoir style package used by this template makes different indentations for odd and even pages which is usally done for better readability. University of Augsburg Faculty of Applied Computer Science Department of Computer Science Master's Program in Computer Science Master's Thesis Speech and Gesture Input for Different Interaction Tasks in a Virtual World submitted by Kathrin Janowski on 09.11.2012 Supervisor: Prof. Dr. Elisabeth Andr´e Adviser: Dipl.-Inf. Felix Kistler Abstract In recent years, the technologies for gesture and speech input have advanced far enough to become part of consumer products, such as video game con- soles or smart phones. Both are the main interaction mechanisms between a human and the real world, so this also makes them attractive for controlling virtual world applications for both training and entertainment. However, virtual worlds fully controlled by both are still rare, and the existing sys- tems tend to use only one of them or simplify the world model to allow only certain types of interaction. This thesis examines the use of speech and gestures for four distinct cat- egories of interactions which are common to most simulated worlds. In par- ticular, these are the navigation in a three-dimensional scene, the selection of interactive entities, dialogue with virtual characters and the manipulation of virtual objects. For each of the tasks, existing approaches are summarized. In order to confirm which modality suits each task best, an application was then developed for interacting with a virtual environment, relying on the Kinect sensor for full body gestures and a keyword spotting approach for speech recognition. Finally, a user study was conducted with this system. The re- sults showed that a walking metaphor was preferred for navigation whereas for dialogue, speech was preferred over choosing sentences by pointing. No clear preferences were obtained for selection and manipulation, but the lat- ter category revealed two distinct personas which should be examined more closely. v Acknowledgments I would like to thank the following people for various forms of help. Felix Kistler for guiding me through the whole process, for all the help with many different aspects, and, most importantly, for taking care of problems in the FUBI recognition that kept haunting me, such as the persistent \lost user" bug which was tracked down just in time for the user study. Professor Dr. Andr´efor pointing me towards the works of Norman, Chai and Krahmer, for helping me spot the flaws in the experimental procedure and for reminding me that a chaotic test run was not the end of the world. Gregor Mehlmann for discussing the multimodality aspect with me and for pointing me towards relevant Oviatt papers. Johannes Wagner for letting me look at the speech recognition he was working with, although that one did not meet the requirements I had. All those people at the chair of Human Centered Multimedia who took part in the user study, offered helpful feedback or otherwise showed interest in my work and the results. My music teachers and sword trainer for helping me escape now and then, for understanding that those hobbies tended to get left behind and for listening to all that stuff completely unrelated to them. My family for letting me vent my frustrations, for getting infected with my enthusiasm, and for helping me get this far by providing me with all sorts of support and resources. vii Statement and Declaration of Consent Statement Hereby I confirm that this thesis is my own work and that I have docu- mented all sources used. Kathrin Janowski Augsburg, 09.11.2012 Declaration of Consent Herewith I agree that my thesis will be made available through the library of the Computer Science Department. Kathrin Janowski Augsburg, 09.11.2012 ix Contents Contentsi 1 Introduction1 1.1 Motivation.............................. 1 1.2 Objectives.............................. 2 1.3 Outline................................ 3 2 State of the Art5 2.1 Overview............................... 5 2.2 General Concepts.......................... 6 2.2.1 Input Technologies .................... 6 2.2.2 Multimodality....................... 8 2.2.3 Usability Principles.................... 9 2.3 Interaction Tasks.......................... 13 2.3.1 Navigation ......................... 14 2.3.2 Selection .......................... 23 2.3.3 Dialogue........................... 28 2.3.4 Object Manipulation................... 32 3 Practical Work 43 3.1 Overview............................... 43 3.2 Application ............................. 43 3.2.1 Available Interactions................... 45 3.2.2 Input ............................ 49 3.2.3 Visualization........................ 56 3.2.4 Implementation ...................... 60 3.3 User Study ............................. 66 3.3.1 Objectives ......................... 66 i ii CONTENTS 3.3.2 Experimental Setting................... 68 3.3.3 Results and Discussion.................. 70 4 Conclusion and Future Work 83 4.1 Conclusion.............................. 83 4.2 Future Work ............................ 85 4.2.1 Analysis of the Video Data ............... 85 4.2.2 Role Playing versus Efficiency.............. 85 4.2.3 Direct Gesture Mapping................. 86 4.2.4 Expansion of the Speech Recognition......... 87 4.2.5 Multimodal commands.................. 88 4.2.6 Expansion of the Smart Object Infrastructure . 88 Bibliography 91 A Questionnaire 99 A.1 German ............................... 99 A.2 Translated to English ....................... 101 List of Figures 103 List of Tables 105 Chapter 1 Introduction 1.1 Motivation In recent years, natural input technologies are gradually becoming avail- able to the average user. The three major manufacturers of video game consoles all provide motion recognition by now - the Nintendo Wii which is based around gestural input, the PlayStation Move controller for Sony's PlayStation 3, and the Kinect sensor for the Microsoft Xbox 360 [28]. Another important modality, speech recognition, has also been around for some time, usually taking the form of dictation or language trainer software. This functionality is included in Windows Vista and Windows 7 as part of their accessibility options [26] and is also becoming an alternative input method for smartphones. The most famous example for these devices is the \Siri" assistant on Apple's iPhone [1], while the Samsung Galaxy S III, which was released this year, features a similar interface called \S Voice" on the Android platform. In context with the graphical capabilities of current computers and con- soles, the sophisticated virtual worlds found in modern video games, increas- ingly lifelike virtual characters and, finally, stereoscopic technology being integrated into consumer television sets and handheld devices, it seems like the vision of a Star Trek holodeck for both entertainment and serious sim- ulations is about to become reality in the near future. However, speaking of reality, these input technologies are still fairly new, at least when it comes to widespread use. In several articles dating from 2010 [32, 34] and 2012 [33], Donald Norman observed that, although these 1 2 CHAPTER 1. INTRODUCTION technologies have been existing for quite a while, they only recently ma- tured enough to be released to the public and therefore many aspects of their usage are still under development. Different companies are developing different guidelines for these input methods while end users are only now getting used to them, and applications which use both modalities to their full potential are rare to find. This can easily be confirmed by browsing the shelves of electronics stores. Searching for gesture-controlled games brings up mostly sports and dancing applications aimed at casual players, or fitness applications meant to motivate excercising. While these examples are heavily based on natu- ral movement, only few offer something resembling a complex interactive world. Such worlds are still mostly found in traditional games designed for mouse, keyboard or regular game controllers with buttons and joysticks. Speech support, on the other hand, is slowly becoming part of those more sophisticated titles. For example, various Xbox 360 games marked with the \Better with Kinect" label allow the player to speak certain commands for a higher degree of immersion. However, the phrasing \better with" also in- dicates that the Kinect functionality is optional rather than required while the regular controller still serves as the primary input method. But the potential is already there, only spread out across different applica- tions and research areas. Various approaches exist for the key elements of an application fully controlled by speech and gestures, so it is time to have a closer look at the available findings and gather them in one place. 1.2 Objectives This thesis examines the application of two natural input technologies, namely gesture and speech recognition, to interactive virtual worlds. It focuses on the different demands of four distinct interaction tasks which are common to such environments - navigation, selection, dialogue with non-player characters (NPCs) and manipulation of objects. For these tasks, the following questions will be dealt with. World Representation In what way is the setting or representation of the world connected to possible inputs? Input What kind of input metaphors are