Virtual Modality: a Framework for Testing and Building Multimodal
Total Page:16
File Type:pdf, Size:1020Kb
Virtual Modality: a Framework for Testing and Building Multimodal Applications Péter Pál Boda1 Edward Filisko Audio-Visual Systems Laboratory Spoken Language Systems Group Nokia Research Center CSAIL, MIT Helsinki, Finland Cambridge, Massachusetts, USA [email protected] [email protected] responding deictic reference are issued. The Abstract paper explains the procedure followed to cre- ate Virtual Modality data, the details of the speech-only database, and results based on a This paper introduces a method that generates multimodal city information and navigation simulated multimodal input to be used in test- application. ing multimodal system implementations, as well as to build statistically motivated multi- modal integration modules. The generation of 1 Introduction such data is inspired by the fact that true mul- Multimodal systems have recently drawn significant timodal data, recorded from real usage scenar- attention from researchers, and the reasons for such an ios, is difficult and costly to obtain in large interest are many. First, speech recognition based appli- amounts. On the other hand, thanks to opera- cations and systems have become mature enough for tional speech-only dialogue system applica- larger-scale deployment. The underlying technologies tions, a wide selection of speech/text data (in are gradually exhibiting increased robustness and per- the form of transcriptions, recognizer outputs, formance, and from the usability point of view, users parse results, etc.) is available. Taking the tex- can see some clear benefits from speech-driven applica- tual transcriptions and converting them into tions. The next evolutionary step is the extension of the multimodal inputs in order to assist multimo- "one dimensional" (i.e., speech-only) interface capabili- dal system development is the underlying idea ties to include other modalities, such as gesture, sketch, of the paper. A conceptual framework is es- gaze, and text. This will lead to a better and more com- tablished which utilizes two input channels: prehensive user experience. the original speech channel and an additional A second reason is the widely accepted, and ex- channel called Virtual Modality. This addi- pected, mobility and pervasiveness of computers. De- tional channel provides a certain level of ab- vices are getting more and more powerful and versatile; straction to represent non-speech user inputs they can be connected anywhere and anytime to net- (e.g., gestures or sketches). From the tran- works, as well as to each other. This poses new de- scriptions of the speech modality, pre-defined mands for the user interface. It is no longer sufficient to semantic items (e.g., nominal location refer- support only a single input modality. Depending on the ences) are identified, removed, and replaced specific application, the given usage scenario, and the with deictic references (e.g., here, there). The context, for example, users should be offered a variety deleted semantic items are then placed into the of options by which to interact with the system in an Virtual Modality channel and, according to appropriate and efficient way. external parameters (such as a pre-defined Third, as the output capabilities of devices provide user population with various deviations), tem- ever-increasing multimedia experiences, it is natural poral shifts relative to the instant of each cor- that the input mechanism must also deal with various 1 Currently a Visiting Scientist with the Spoken Language Systems Group, CSAIL, MIT. modalities in an intuitive and comprehensive manner. If amount of data is required in order to build and tune a a map is displayed to the user, it is natural to expect that system; on the other hand, without an operational sys- the user may want to relate to this physical entity, for tem, no real data can be collected. Incremental and rule- instance, via gestures, pointing, gazing or by other, not based implementations, as well as quick mock-ups and necessarily speech-based, communicative means. Wizard-of-Oz setups (Lemmelä and Boda, 2002), aim to Multimodal interfaces give the user alternatives and address the application development process from both flexibility in terms of the interaction; they are enabling ends; we follow an intermediate approach. rather than restricting. The primary goal is to fully un- The work presented here is performed under the as- derstand the user's intention, and this can only be real- sumption that testing and building multimodal systems ized if all intentional user inputs, as well as any can benefit from a vast amount of multimodal data, even available contextual information (e.g., location, prag- if the data is only a result of simulation. Furthermore, matics, sensory data, user preferences, current and pre- generating simulated multimodal data from textual data vious interaction histories) are taken into account. is justified by the fact that a multimodal system should This paper is organized as follows. Section 2 intro- also operate in speech-only mode. duces the concept of Virtual Modality and how the mul- timodal data are generated. Section 3 explains the 2.2 The concept underlying Galaxy environment and briefly summarizes Most current multimodal systems are developed with the operation of the Context Resolution module respon- particular input modalities in mind. In the majority of sible for, among other tasks, resolving deictic refer- cases, the primary modality is speech and the additional ences. The data generation as well as statistics is modality is typically gesture, gaze, sketch, or any com- covered in Section 4. The experimental methodology is bination thereof. Once the actual usage scenario is fixed described in Section 5. Finally, the results are summa- in terms of the available input modalities, subsequent rized and directions for future work are outlined. work focuses only on these input channels. This is ad- vantageous on the one hand; however, on the other 2 Virtual Modality hand, there is a good chance that system development This section explains the underlying concept of Virtual will focus on tiny details related to the modality- Modality, as well as the motivation for the work pre- dependent nature of the recognizers and their particular sented here. interaction in the given domain and application sce- nario. 2.1 Motivation Virtual Modality represents an abstraction in this sense. The focus is on what semantic units (i.e., mean- Multiple modalities and multimedia are an essential part ingful information from the application point of view) of our daily lives. Human-human communication relies are delivered in this channel and how this channel aligns on a full range of input and output modalities, for in- with the speech channel. Note that the speech channel stance, speech, gesture, vision, gaze, and paralinguistic, has no exceptional role; it is equal in every sense with emotional and sensory information. In order to conduct the Virtual Modality. There is only one specific consid- seamless communication between humans and ma- eration regarding the speech channel, namely, it con- chines, as many such modalities as possible need to be veys deictic references that establish connections with considered. the semantic units delivered by the Virtual Modality Intelligent devices, wearable terminals, and mobile channel. handsets will accept multiple inputs from different mo- The abstraction provided by the Virtual Modality dalities (e.g., voice, text, handwriting), they will render enables the developer to focus on the interrelation of the various media from various sources, and in an intelli- speech and the additional modalities, in terms of their gent manner they will also be capable of providing addi- temporal correlation, in order to study and experiment tional, contextual information about the environment, with various usage scenarios and usability issues. It also the interaction history, or even the actual state of the means that we do not care how the information deliv- user. Information such as the emotional, affective state ered by the Virtual Modality arose, what (ac- of the user, the proximity of physical entities, the dia- tual/physical) recognition process produced them, nor logue history, and biometric data from the user could be how the recognition processes can influence each used to facilitate a more accommodating, and concise interaction with a system. Once contextual information other’s performance via cross-interaction using early is fully utilized and multimodal input is supported, then evidence available in one channel or in the other - al- the load on the user can be considerably reduced. This is though we acknowledge that this aspect is important and especially important for users with severe disabilities. desired, as pointed out by Coen (2001) and Haikonen Implementing complex multimodal applications (2003), this has not yet been addressed in the first im- represents the chicken-and-egg problem. A significant plementation of the model. The term “virtual modality” is not used in the multimo- (Sp) and it delivers certain semantic units to the classi- dal research community, as far as we know. The only fier. These semantic units correspond to a portion of the occurrence we found is by Marsic and Dorohonceanu word sequence in the speech channel. For instance, the (2003), however, with “virtual modality system” they original incoming sentence might be, “From Harvard refer to a multimodal management module that manages University to MIT.” In the case of a multimodal setup and controls various applications sharing common mo- the semantic unit, originally represented by a set of dalities in the context of telecollaboration user inter- words in the speech channel (e.g., “to MIT”), will be faces. delivered by the Virtual Modality as mi. The tier between the two modality channels is a 2.3 Operation deictic reference in the speech channel (di in the bottom The idea behind Virtual Modality is explained with the of Figure 1). There are various realizations of a deictic help of Figure 1. The upper portion describes how the reference, in this example it can be, for example, “here”, output of a speech recognizer (or direct natural language “over there”, or “this university”.