Multimodal Input Fusion for Companion Technology a Thesis Submitted to Attain the Degree of Dr

Ulm University | 89069 Ulm | Germany Faculty of Engineering, Computer Sciences and Psychology Institute of Media Informatics Multimodal Input Fusion for Companion Technology A thesis submitted to attain the degree of Dr. rer. nat. of the Faculty of Engineering, Computer Sciences and Psychology at Ulm University Author: Felix Schüssel from Erlangen 2017 Version of November 28, 2017 c 2017 Felix Schüssel This work is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 3.0 License. To view a copy of this license, visit http://creativecommons.org/licenses/by-nc-sa/3.0/de/ or send a letter to Creative Commons, 543 Howard Street, 5th Floor, San Francisco, California, 94105, USA. Typeset: PDF-LATEX 2" Date of Graduation: November 28, 2017 Dean: Prof. Dr. Frank Kargl Advisor: Prof. Dr.-Ing. Michael Weber Advisor: Prof. Dr. Dr.-Ing. Wolfgang Minker A Note on Copyrighted Material: [citation] For added clarity, literally adopted parts from own work published elsewhere are emphasized with markings in the page margins as exemplified on the left. These markings are made generously to minimize impacts on layout and readability. In consequence, they may occasionally contain information obviously not stemming from the original publication that are not exempted from the marking (headings, references within this thesis, etc.). The above applies to parts of this thesis that have already been published in the following articles given by bibliography reference and copyright notice as mandated by the respective publisher: [144]: Multimodal Pattern Recognition of Social Signals in Human-Computer-Interaction. Using the Transferable Belief Model for Multimodal Input Fusion in Companion Systems, vol. 7742 of Lecture Notes in Computer Science, 2013, pp. 100-115. Schüssel, F., Honold, F., and Weber, M. c Springer-Verlag Berlin Heidelberg 2013 with permission of Springer [142]: Journal on Multimodal User Interfaces. Influencing factors on multimodal interaction during selection tasks, vol. 7, no. 4, 2012, pp. 299-310. Schüssel, F., Honold, F., and Weber, M. c Open- Interface Association 2012 with permission of Springer [59]: c 2014 IEEE. Reprinted, with permission, from Honold, F., Schüssel, F. and Weber, M. The Automated Interplay of Multimodal Fission and Fusion in Adaptive HCI. In IE’14: Proceedings of the 10th International Conference on Intelligent Environments, July 2014 [140]: SCHÜSSEL, F., HONOLD, F., SCHMIDT, M., BUBALO, N., HUCKAUF, A., and WEBER, M. Multimodal Interaction History and its use in Error Detection and Recovery. In Proceedings of the 16th ACM International Conference on Multimodal Interaction, ICMI ’14, ACM (New York, NY, USA, 2014), 164–171. c 2014 Association for Computing Machinery, Inc. Reprinted by permission. [139]: SCHÜSSEL, F., HONOLD, F., BUBALO, N., and WEBER, M. Increasing Robustness of Mul- timodal Interaction via Individual Interaction Histories. In Proceedings of the Workshop on Mul- timodal Analyses Enabling Artificial Agents in Human-Machine Interaction, MA3HMI ’16, ACM (New York, NY, USA, 2016), 56–63. c 2016 Association for Computing Machinery, Inc. Reprinted by permission. iv Abstract Today’s computer systems provide a vast variety of functionalities but at the same time lack an intelligent and natural user interface. The idea of companion technology is to complement a system with an intelligent interface that adapts itself to the individual user and its context of use. One important task of the interface is to take inputs of all available devices and determine what the user has intended to do. This process is called multimodal input fusion. Being a topic of research for over 30 years now, it has still failed broad employment. The question is: What are the known approaches lacking, especially regarding their use as companion technology? From the state of the art, two major weaknesses are identified: the insufficient handling of uncertain- ties considering that a variety of sensors provide inputs, and no means of adaptation to the individual user to reduce error rates and increase the overall robustness. This work’s contribution is a fully im- plemented multimodal input fusion that employs evidential reasoning and individual user adaptation to provide a comprehensive and robust decision on all available inputs. It is not about progressing specific input technologies and devices, but to make the most of what they provide on a generic level. Based on an existing idea of applying evidential reasoning, an input fusion is developed that provides functionalities of combination, disambiguation, reinforcement and conflict detection. A theoretical evaluation proves the benefits of the formalism compared to the usually employed n-best lists. Ad- ditional benefits arise when the realized component is embedded into a companion system, closely cooperating with a multimodal fission component in charge of arbitrating the available output modal- ities. Two perspectives are taken regarding individual user adaptation. First, the user’s decision to adopt a specific input modality is investigated. A user study identifies yet unknown predictors, like personality traits and technology affinity. Though useful in system design, the complexity of influencing factors hinders direct application for input fusion. Taking a system perspective, how- ever, temporal data provided by input sensors is known to bear important information and research promised its usefulness for some time. Nevertheless, no practical application has been presented to date. In this work, an original approach to individual user adaptation is developed based on the no- tion of temporal interaction histories, to complement and extend the input fusion with individually adaptive error recognition capabilities. An evaluation study proves a significant decrease of error rates compared to previous ideas suggested in literature. As a comprehensive concept of individually adaptive input fusion, the presented work can not only be used within companion systems, but in any type of system as a technology for realizing robust multimodal inputs. Thus, it could be the basis for multimodal interfaces to reach a maturity at which they in fact advance HCI for end users. v vi Contents Introduction 1 1 Fundamentals5 1.1 Companion Technology . .5 1.1.1 Planning & Reasoning . .7 1.1.2 Knowledge Base . .7 1.1.3 Perception & Recognition . .8 1.1.4 Dialog & Interaction . .9 1.2 Multimodal Interaction . .9 1.2.1 The Human Action Cycle and the Stormy Tree . 10 1.2.2 The Term Modality . 13 1.2.3 Conceptual Models of Multimodal Interaction . 21 1.2.4 Benefits and Misconceptions of Multimodal Interaction . 23 1.3 Introduction to Input Fusion . 29 1.3.1 The Term Fusion . 30 1.3.2 Types of Multimodal Input Fusion . 31 1.4 Summary . 32 2 State Of the Art 35 2.1 Multimodal Input Fusion Approaches . 35 2.1.1 Procedural Approaches . 36 2.1.2 Frame-Based . 39 2.1.3 Unification-Based . 43 2.1.4 Hybrid and Statistical . 48 2.1.5 Classification and Summary . 52 2.2 Description Languages for Input Fusion Engines . 56 2.2.1 EMMA . 56 2.2.2 MIML . 57 vii Contents 2.2.3 SMUIML . 58 2.2.4 XISL . 60 2.2.5 XMMVR . 63 2.2.6 Summary on Description Languages . 64 2.3 W3C Standards for Multimodal Systems . 66 2.3.1 Multimodal Interaction Framework (MMIF) . 66 2.3.2 Multimodal Architecture and Interfaces (MMI Arch) . 68 2.4 Summary . 70 3 Goals and Requirements 73 3.1 Functional Requirements . 74 3.1.1 Combination . 74 3.1.2 Disambiguation . 74 3.1.3 Reinforcement . 74 3.1.4 Conflict Detection . 75 3.2 Non-functional Requirements . 75 3.2.1 Sound Handling of Uncertainty . 75 3.2.2 Account for User Individuality . 76 3.2.3 Support Synchronization of Inputs . 76 3.2.4 Focus on Logical Interaction . 76 3.2.5 Facilitate Modularity . 77 4 Fusion of User Inputs Using the Transferable Belief Model 79 4.1 N-best Lists, Bayesian Inference or Dempster-Shafer . 79 4.2 Dempster-Shafer Theory and the Transferable Belief Model . 81 4.2.1 Basic Notions . 82 4.2.2 Combination of Evidence . 83 4.2.3 Decision Making . 84 4.3 Input Fusion With the Transferable Belief Model . 85 4.3.1 Modification of the Rule of Combination . 85 4.3.2 Theoretical Evaluation . 87 4.4 Fusion Specification via Graphs in XML . 92 4.4.1 The Frame of Discernment as Undirected Graph . 92 4.4.2 Event Data and Semantic Combination . 94 viii Contents 4.5 Runtime Integration . 97 4.5.1 The Multimodal Input Fusion Component . 97 4.5.2 Cooperation with Multimodal Fission in a Companion System . 103 4.6 Summary and Comparison to the State of the Art . 114 5 Individual User Adaptation 119 5.1 User-Sided Preferences for Modality Use . 119 5.1.1 State of Knowledge . 120 5.1.2 User Study on Selection Tasks . 126 5.1.3 User Specific Factors . 131 5.1.4 System Specific Factors . 136 5.1.5 Discussion . 138 5.2 System-Sided Adaptation . 139 5.2.1 State of Knowledge . 139 5.2.2 User Study on Interaction Histories . 142 5.2.3 Integration Patterns and Individual Behavior . 144 5.2.4 Error Types of Multimodal Input . 147 5.2.5 Adapting to Individual Histories . 149 5.2.6 Evaluation . 153 5.2.7 Discussion . 163 5.3 Summary . 165 6 Conclusion and Outlook 167 6.1 Contributions and Limitations . 167 6.2 Further Research . 171 6.3 Concluding Remarks . 173 Appendix 175 A.1 XML Schema Definitions . 175 A.2 Additional Figures . 179 Acronyms 181 Bibliography 185 ix Introduction Today’s computer systems provide a vast variety of functionalities and have become an important part of everyday life, be it for private use or in the context of work. People interact with mobile and stationary systems in various contexts and perform complex tasks with multiple devices for input and output.

Load more