M AMI a Taxonomy of Multimodal Interaction in the Human Information Processing System

M AMI A Taxonomy of Multimodal Interaction in the Human Information Processing System A Report of the Esprit Project 8579 MIAMI { WP 1 { February, 1995 Written by L. Schomaker∗, J. Nijtmans (NICI) A. Camurri, F. Lavagetto, P. Morasso (DIST) C. Beno^ıt, T. Guiard-Marigny, B. Le Goff, J. Robert-Ribes, A. Adjoudani (ICP) I. Defée (RIIT) S. Munc¨ h∗ (UKA) K. Hartung∗, J. Blauert (RUB) ∗Editors Abstract This document has been prepared in the ESPRIT BRA No. 8579, Multimodal Integration for Advanced Multimedia Interfaces | in the following referred to as MIAMI | in order to serve as a basis for future work. The basic terms which will be used in MIAMI will be defined and an overview on man-machine-interfaces will be given. The term \taxonomy" is used in the following sense, adapted from [217]: \1: the study of the general principles of scientific classification: SYSTEMATICS; 2: CLASSIFICATION; specif: orderly classification of plants and animals according to their presumed natural relationships"; but instead of plants and animals, we attempt to classify input and output modalities. @techreport{Schomaker-et-al-Taxonomy1995, author = {L. Schomaker and J. Nijtmans and A. Camurri and P. Morasso and C. Benoit and T. Guiard-Marigny and B. Le Gof and J. Robert-Ribes and A. Adjoudani and I. Defee and S. Munch and K. Hartung and J. Blauert}, institution = {Nijmegen University, NICI}, keywords = {handwriting-recognition}, title = {A Taxonomy of Multimodal Interaction in the Human Information Processing System: Report of the Esprit Project 8579 MIAMI}, year = {1995} } Contents 1 Introduction 1 1.1 Definitions of Basic Terms . 1 1.1.1 The basic model for human-computer interaction . 1 1.1.2 Levels of observation . 3 1.1.3 (Multi-) Modality . 5 1.1.4 Multimodal vs. multimedia vs. virtual reality system . 6 1.1.5 Communication channels . 8 1.2 Additional Notes and Caveats . 10 1.2.1 An extra information loop? . 10 1.2.2 Notes with respect to intention . 11 1.2.3 Topics which had to be excluded from this report . 12 1.3 Structure of the Document . 12 2 Perception 15 2.1 Human Input Channels . 15 2.1.1 Input modalities . 17 2.1.2 Vision . 18 2.1.3 Hearing . 19 2.1.4 Somatic senses . 21 2.2 Computer Output Media . 22 2.2.1 Output modalities . 22 2.2.2 Visual output . 22 i ii Miami Esprit bra 8579 2.2.3 Acoustical output . 23 2.2.4 Tactile/haptic output . 26 2.3 Bi- and Multimodal Perception . 30 2.3.1 Visual-acoustical perception . 30 2.3.2 Visual-speech perception . 31 3 Control and Manipulation 39 3.1 Human Output Channels . 39 3.1.1 Cybernetics: Closed-loop control . 40 3.1.2 Open-loop models . 40 3.1.3 Coordinative structure models . 42 3.1.4 Relevance for multimodal interaction . 43 3.2 Computer Input Modalities . 44 3.2.1 Keyboards . 46 3.2.2 Mice . 47 3.2.3 Pens . 48 3.2.4 Cameras . 52 3.2.5 Microphones . 52 3.2.6 3D input devices . 52 3.2.7 Other input devices . 53 3.2.8 Generalized input devices . 54 3.3 Event Handling Architectures in CIM . 56 3.3.1 Within-application event loops: GEM, X11 . 56 3.3.2 Event-routine binding: Motif, Tcl/Tk . 57 3.4 Bi- and Multimodal Control . 57 3.4.1 Visual-gestural control . 57 3.4.2 Handwriting-visual control . 59 3.4.3 Handwriting-speech control . 60 3.4.4 Visual-motoric control . 66 CONTENTS Esprit bra 8579 Miami iii 4 Interaction 73 4.1 Architectures and Interaction Models . 73 4.2 Input/Output Coupling . 86 4.3 Synchronization . 87 4.3.1 Object synchronization . 87 4.3.2 Complexity of information . 88 4.4 Virtual Reality? . 89 4.5 Analysis of Interaction . 91 5 Cognition 93 5.1 Cognition in Humans . 93 5.1.1 Symbolic, subsymbolic, and analogical . 96 5.1.2 High-level representations . 96 5.1.3 Human learning and adaptation . 97 5.1.4 Hybrid interactive systems . 98 5.2 (Intelligent) Agents and Multimedia . 99 5.2.1 Application Scenarios . 100 6 Scenarios & Dreams 103 6.1 The Multimodal Orchestra . 103 6.2 Multimodal Mobile Robot Control . 104 A An Introduction to Binaural Technology 109 A.1 The Ears-and-Head Array: Physics of Binaural Hearing . 111 A.1.1 Binaural recording and authentic reproduction . 112 A.1.2 Binaural measurement and evaluation . 112 A.1.3 Binaural simulation and displays . 113 A.2 Psychophysics of Binaural Hearing . 115 A.2.1 Spatial hearing . 120 A.2.2 Binaural psychoacoustic descriptors . 120 CONTENTS iv Miami Esprit bra 8579 A.2.3 Binaural signal enhancement . 120 A.3 Psychology of Binaural Hearing . 121 B Audio-Visual Speech Synthesis 126 B.1 Visual Speech Synthesis from Acoustics . 126 B.1.1 Articulatory description . 130 B.1.2 Articulatory synthesis . 133 B.2 Audio-Visual Speech Synthesis from Text . 135 B.2.1 Animation of synthetic faces . 135 B.2.2 Audio-visual speech synthesis . 137 C Audio-Visual Speech Recognition 140 C.1 Integration Models of Audio-Visual Speech by Humans . 140 C.1.1 General principles for integration . 141 C.1.2 Five models of audio-visual integration . 142 C.1.3 Conclusion . 145 C.1.4 Taxonomy of the integration models . 146 C.2 Audio-Visual Speech Recognition by Machines . 147 C.2.1 Audio-visual speech perception by humans . 148 C.2.2 Automatic visual speech recognition . 148 C.2.3 Automatic audio-visual speech recognition . 149 C.2.4 Current results obtained at ICP . 150 C.2.5 Forecast for future works . 153 D Gesture Taxonomies 156 D.1 Hand Gestures Taxonomy . 156 E Two-dimensional Movement in Time 158 E.1 The Pen-based CIM/HOC . 159 E.2 Textual Data Input . 160 E.2.1 Conversion to ASCII . 160 CONTENTS Esprit bra 8579 Miami v E.2.2 Graphical text storage . 162 E.3 Command Entry . 162 E.3.1 Widget selection . 162 E.3.2 Drag-and-drop operations . 163 E.3.3 Pen gestures . 163 E.3.4 Continuous control . 163 E.4 Handwriting and Pen Gestures COM . 163 E.5 Graphical Pattern Input . 164 E.5.1 Free-style drawings . 164 E.5.2 Flow charts and schematics . 165 E.5.3 Miscellaneous symbolic input . 165 E.6 Known Bimodal Experiments in Handwriting . 166 E.6.1 Speech command recognition and pen input . 166 E.6.2 Handwriting recognition and speech synthesis . 166 Bibliography 167 CONTENTS Chapter 1 Introduction In this chapter we will introduce our underlying model of human-computer interaction which will influence the whole structure of this document as well as our research in MIAMI. The next step will be the definition of basic terms (more specific terms will follow in later sections) and a statement of what will be included and excluded in the project, respectively. Some 'philosophical' considerations will follow in section 1.2, and finally we will give an overview on the structure of this report. 1.1 Definitions of Basic Terms One thing most publications on multimodal systems have in common is each author's own usage of basic terms on this topic. Therefore, we want to state in the beginning of this report what our basic model looks like, how we are going to use several terms within this document, and which levels of perception and control we are considering in MIAMI. 1.1.1 The basic model for human-computer interaction In order to depict a taxonomy of multimodal human-computer interaction we will have to clarify a number of concepts and issues. The first assumption is that there are minimally two separate agents involved, one human and one machine. They are physically separated, but are able to exchange information through a number of information channels. As schematically shown in figure 1.1, we will make the following definitions. There.

Load more