Voice-Commanded Scripting Language for Programming Navigation Strategies On-The-Fly

Voice-commanded Scripting Language for Programming Navigation Strategies On-the-fly Michael Nichols, Qian Wang, Gopal Gupta Department of Computer Science University of Texas at Dallas Richardson, TX 75080, USA Abstract We present a voice-based scripting language called ALVIN (Aural Language for VoiceXML Interpretation and Navigation) that allows users to define navigation strategies on-the-fly while browsing VoiceXML pages. The language is intended to be completely voice/audio-based, so as to allow it to be used with voice/audio-only communication devices, such as telephones (land- line or wireless). This paper discusses the various challenges that need to be overcome in designing a voice/audio-based language, and describes how these challenges are overcome in designing ALVIN. In addition, this paper discusses a model implementation of ALVIN that leverages the existing capabilities of VoiceXML to provide a readily deployable solution. The language design is, strictly speaking, a special-purpose programming language for navigating VoiceXML pages, but the language includes features that would be representative of a proposed class of spoken general-purpose programming languages. A prototype implementation of the language is in progress. 1 Introduction The phenomenal success of the World-Wide Web (WWW) demonstrates, in a profound way, the ability of computer technology to deliver a cornucopia of diverse services and information with unprecedented speed and breadth of content. As users rely more on Web-based services and information sources, the technology is moving in a natural direction toward increased user mobility and ease of access. Traditional visual-interface computing devices place significant constraints on portability and accessibility by requiring the user to possess and see a viewing screen. Voice-based computing has the potential to overcome these constraints by making it unnecessary for a user to view a screen or for the user's hands to be occupied with a keyboard or stylus. Moreover, a voice-based interface makes it possible for a user to communicate with a computing device via (mobile) telephone, thus obviating the need for the user to actually have the computing device in his or her possession. Recent technology standards, such as VoiceXML, are allowing voice-based applications to be developed on a large scale. VoiceXML is a standard mark-up language created by the VoiceXML forum and the W3C consortium to make Internet content and information accessible via voice and audio. VoiceXML documents can be aurally browsed on a desktop computer with microphone and speakers or over the phone. Just as a web browser renders HTML documents visually, a VoiceXML interpreter (or a voice browser) renders VoiceXML documents aurally. Voice-based computing, however, is not without its limitations. For one thing, current speech- recognition technology is imperfect. Rapid advances in speech-recognition technology and signal processing, however, are significantly reducing the limitations of speech-recognition. A greater challenge, however, is the serial nature of spoken communications. While a user is free to visually scan over a web page displayed on a screen, a user must listen to the text of a voice- based page as it is spoken in sequence, which can be very inefficient, especially if the page was originally intended to be “read” visually. Additionally, while VoiceXML has facilitated aural browsing of documents, its current design leaves the control of navigation within the document in the hands of VoiceXML document writer, rather than in the hands of the user who is aurally browsing the document. Thus, the writer of the document is responsible for specifying all possible ways in which a listener might browse the document (the writer typically accomplishes this by asking the listener what he/she wishes to browse next, after each VoiceXML form is finished being recited). If the writer fails to anticipate an interaction scenario and leaves it unspecified in the document, the listener is deprived of that particular interaction. Hence, the onus of foreseeing all possible interaction scenarios falls on the author of the document. One way of getting around this problem is to associate a navigation strategy with a particular page so that only certain features of the page are recited to the user and that those features are recited in a particular order selected by the user. We propose the use of a scripting language to allow a user to dynamically create such navigation strategies on-the-fly while aurally browsing VoiceXML pages. Such a language should itself be spoken by the user, so as to allow for completely voice-based operation. This paper provides a design for such a language (called ALVIN) and discusses the various challenges overcome by the design. ALVIN not only allows a listener to program navigation strategies on-the-fly during aural browsing, it also facilitates navigation based on keyword search of the page contents. The idea of using a scripting language for dynamically programming navigation strategies rests on the notion of voice anchors [1]. Voice anchors allow listeners to attach speech labels to dialogs of a VoiceXML document. When this speech label is uttered by the user later (during the same browsing session), browsing returns to the dialog to which the speech label is attached. Thus, voice anchors can be used by a user to move around the various parts of the document via simple voice utterances and voice commands. One can be more ambitious and design a navigation language that specifies in advance the order in which various dialogs labelled with voice anchors should be visited and heard. Thus, complex navigation strategies can be orally programmed. This is precisely the motivation behind the design of ALVIN. This paper also discusses a model implementation of the ALVIN language using voice anchors that leverages the existing capabilities of VoiceXML to provide a readily deployable solution. The language design is, strictly speaking, a special-purpose programming language for navigating VoiceXML pages, but the language includes features that would be representative of a proposed class of spoken general-purpose programming languages. The research described in this paper makes the following contributions: (i) it presents a viable way of navigating complex aural documents in a meaningful and structured way by using spoken scripting languages; (ii) it presents a concrete spoken language, and an outline of its implementation, for aural documents written in VoiceXML. To the best of our knowledge ours is the first effort in this area, since most spoken scripting languages are limited to making selections from a list of menu that is read out at various points in the aural document. In contrast, our spoken scripting language allows complex navigation strategies to be orally programmed. 2 Language Design Designing a computer language that is intended only to be spoken poses a number of unusual problems that must be addressed in order to arrive at a workable language. Many of the assumptions about the programmer that are implicitly made in the design of written computer languages do not apply. Moreover, many arguably desirable aspects of written computer languages are either less important or actually a hindrance to the design of a spoken computer language. These problems, and our proposed solutions to these problems, as exhibited by our model spoken scripting language, are addressed in the following sections. A partial grammar of a scripting language incorporating our ideas, which we call ALVIN (Aural Language for VoiceXML Interpretation and Navigation), is depicted in Figure 1. Example sessions illustrating the use of the language are provided in Figures 2 and 3. COMPUTER (reading VoiceXML page): Consequently, the reason that the . USER: Stop COMPUTER: Okay USER: Set an anchor called foxtrot oscar oscar at the next sentence COMPUTER: anchor foo set USER: echo COMPUTER: Set anchor named foo at next sentence Figure 2: Example session USER: Define bravo alpha romeo as Ask for a number ``Give me a number'' (USER pauses) COMPUTER: Go on USER: Store it as x-ray. Store one in foxtrot. Repeat the following until x-ray equals one. Take foxtrot. Multiply by x-ray. Store the result in foxtrot. Take x-ray. Subtract one. Store that in x-ray. End. COMPUTER: Okay USER: Current block? COMPUTER: bar USER: The result is foxtrot. COMPUTER: Okay USER: End. COMPUTER: Bar defined. Figure 3: A more elaborate session 2.1 Overall Design Philosophy Our language design is based on three fundamental principles or goals that we feel can significantly enhance the usability of a spoken scripting language. The first is that the ideal spoken scripting language should be as close to natural language as possible. Since we anticipate that spoken scripting languages will be used by lay end-users as much as by trained programmers, this is an important design goal. The second principle is related to the first, but perhaps more controversial. The second principle is that the ideal spoken scripting language would give its users more ways to write correct code than to generate errors. We expect some variations in users' code from user to user and from time to time, particularly in the case of spoken code, which is likely to be more “spontaneous” than written code. We think that it is important that the scripting language or its interpreter/compiler is as forgiving as possible. The third principle is related to the second. The third principle is that all reasonable efforts should be made to prevent the user/programmer from having to revisit and/or edit existing code. The rationale behind this is twofold. First, the kind of “full text” editing most programmers are used to can only be partially emulated over a voice interface and not very efficiently, at that. Second, revisiting existing code over audio is undesirable because it can require the programmer to commit more information to his or her memory.

Load more