Multimodal Interaction with Xforms
Total Page:16
File Type:pdf, Size:1020Kb
©2006 by authors. Reprinted from Proc. of the The Sixth International Conference on Web Engineering (ICWE2006), Palo Alto, California, July, 2006, pages 201-208. http://doi.acm.org/10.1145/1145581.1145624 Multimodal Interaction with XForms Mikko Honkala Mikko Pohja Helsinki University of Technology, TML Helsinki University of Technology, TML P. O. Box 5400, FI-02015 HUT, Finland P. O. Box 5400, FI-02015 HUT, Finland [email protected].fi [email protected].fi ABSTRACT new and complex tasks are solved faster with the use of The increase in connected mobile computing devices has cre- multiple modalities [4]. Additionally, several countries have enacted legislation, which require public Web services to be ated the need for ubiquitous Web access. In many usage 1 scenarios, it would be beneficial to interact multimodally. accessible to people with disabilities [9] . Current Web user interface description languages, such as The problem is that current approaches to authoring HTML and VoiceXML, concentrate only on one modality. multimodal Web applications lack ease-of-authoring. They Some languages, such as SALT and X+V, allow combining concentrate on one modality at the time. For instance, aural and visual modalities, but they lack ease-of-authoring, HTML+ECMAScript approach focuses on the visual UI, since both modalities have to be authored separately. Thus, and it has well-known accessibility issues when used with for ease-of-authoring and maintainability, it is necessary to speech tools, while VoiceXML is only intended for voice us- provide a cross-modal user interface language, whose se- age. Two main multimodal Web UI approaches, SALT [22] mantic level is higher. We propose a novel model, called and XHTML+Voice (X+V) [1], allow to author and deliver XFormsMM, which includes XForms 1.0 combined with the same user interface for both speech and GUI, but require modality-dependent stylesheets and a multimodal interac- each modality to be authored separately, and thus they limit tion manager. The model separates modality-independent ease-of-authoring. The solution is to raise the semantic level parts from the modality-dependent parts, thus automati- of the user interface description. cally providing most of the user interface to all modalities. The main research problem was efficient authoring of mul- The model allows flexible modality changes, so that the user timodal user interfaces in the Web context. The research can decide, which modalities to use and when. was conducted in two phases. First, a literature study was done, in order to derive requirements for a multimodal UI language. Second, the requirements were converted into an Categories and Subject Descriptors design, which was then implemented as a proof-of-concept. H.5.2 [Information Interfaces and Presentation]: User This approach was then compared with SALT and X+V Interfaces—Voice I/O; Graphical user interfaces (GUI); based on the requirements. I.7.2 [Document and Text Processing]: Document As a result, we propose to use XForms as the user interface Preparation—Markup languages description language, because of its modality-independent nature. We define a multimodal UI model called XForms General Terms Multimodal (XFormsMM ), which includes XForms 1.0 and modality-dependent stylesheets as the UI description lan- Design, Experimentation, Standardization, Languages guage and a definition of a multimodal interaction manager. This solution allows filling and submitting any XForms com- Keywords pliant form simultaneously with aural and visual input and output. A proof-of-concept implementation was created and Multimodal Interaction, UI Description Language, XForms integrated into X-Smiles XML browser. Two use cases were developed to demonstrate the functionality of the imple- 1. INTRODUCTION mentation, including input and output of several datatypes, The diversity of Web applications, and devices, which are as well as navigation within a form. One of the use cases used to access them, is increasing. Additionally, new usage was also compared to an equivalent application realized with contexts bring new kinds of requirements for Web applica- X+V. tions. Using multiple modalities (e.g., aural and visual input The paper is organized as follows. The next two Sections and output) simultaneously can help to fulfill these require- give background information about the topic and introduces ments. For instance, Web applications could be used with the related work. The research problem is discussed in Sec- voice input and output while driving a car, and a keyboard- tion 4. Section 5 introduces the design of XFormsMM, while less device could use voice input to enhance a graphical ap- Section 6 describes our implementation of it. Use case im- plication’s input capabilities. Studies have also shown that plementations are shown in Section 7. Section 8 compares XFormsMM to other approaches, discussion is in Section 9 Copyright is held by the author/owner(s). and, finally, Section 10 concludes the paper. ICWE'06, July 11-14, 2006, Palo Alto, California, USA. 1 ACM 1-59593-352-2/06/0007. W3C WAI Policies, http://www.w3.org/WAI/Policy/ 2. BACKGROUND a cross-modal UI language, which provides single-authoring In a multimodal system, the user communicates with an for multiple modalities is the best approach. They conclude application using different modalities (e.g., hearing, sight, that XForms is well suited to act as this language. There speech, mouse). The input can be sequential, simultaneous, are also other proposals to realize multimodal interaction. or composite between modalities. Moreover, uses of mul- Raggett and Froumentin propose extensions to Cascading timodality can be either supplementary or complementary. Stylesheets (CSS) to allow simple multimodal interaction Supplementary means that every interaction can be carried tasks [15]. The paper defines new CSS properties, which add out in each modality as if it was the only available modality, aural modality to current HTML applications. In contrast while complementary means that the interactions available to our proposal, the aural UI must be created separately to the user differ per modality. Finally, one of the key con- from visual UI through the CSS extension. cepts is the interaction manager, which handles the dialog XHTML+Voice profile [1] is based on W3C standards and between the user and the application. [10] is used to create multimodal dialogs with visual and speech As a modality, speech is quite different from graphical UI. modalities. In addition to XHTML Basic and modularized The main difference is the navigation within the UI. In the subset of VoiceXML 2.0 [12], it consists of XML Events mod- visual modality, more information and navigation options ule, the XHTML scripting module, and a module contain- can be represented simultaneously, while speech interface ing a small number of attribute extensions to both XHTML has to be serialized. On the other hand, spoken command and VoiceXML 2.0. XHTML is used to define layout of input allows intuitive shortcuts. It has been shown an effi- the graphical content of a document, whereas VoiceXML ciency increase of 20-40 per cent using speech systems com- adds voice modality to the document. XML Events pro- pared with other interface technologies, such as keyboard in- vides event listeners and associated event handlers for the put [11]. Although speech interaction seems to be promising, profile. Finally, the attribute module facilitates the shar- the potential technical problems with speech interfaces may ing of multimodal input data between the VoiceXML dialog irritate the user and reduce task performance [18]. That is and XHTML input and text elements. VoiceXML is inte- why speech is often used by combining it with other modal- grated into XHTML using DOM events. When an element ities to offset the weaknesses of them [5]. Especially, GUIs is activated, a corresponding event handler is called. The are often combined with speech modality. event handler synthesizes the text obtained from the ele- Basically, there are three different approach to realize ment that activated the handler. Also, a speech input to an speech interface: command, dialog, and natural language XHTML form field is assigned through VoiceXML elements. based. Human-computer interaction through speech is dis- XHTML+Voice documents can be styled both by the visual cussed in [2]. The basis of the paper is that perfect per- and aural style sheets. formance of a speech recognizer is not possible, since the Speech Application Language Tags (SALT) is an XML- system is used, e.g., in noisy environments with different based language for incorporating Speech input and output speakers. Therefore, dialog and feedback features are impor- to other languages [22]. It can be embedded, for instance, tant to a recognizer because both human and computer may in XHTML documents. The speech interaction model is misunderstand each other. In conclusion, a speech interface dialog-based. SALT uses a the notion of subdialogs, which should provide adaptive feedback and imitate human con- can be added to a page. SALT does not have explicit data versation. Natural language speech user interfaces are not model, but it binds to external data models, such XHTML necessary better than dialog and command based, and it has forms, using ids. It has quite low-level programming model, been shown 80 % of the users prefer a mixture of command thus allowing authors to fully control the speech dialog in- and dialog based user interface over a natural language user terface. interface. [13] XForms 1.0 Recommendation [6] is the next-generation 4. RESEARCH PROBLEM AND SCOPE Web forms language, designed by the W3C. It solves some The research problem was to study efficient authoring of of the problems found in the HTML forms by separating multimodal interfaces for the Web. The scope of this work the purpose from the presentation and using declarative is Web applications in general, with special focus on au- markup to describe the most common operations in form- ral and visual modalities.