<<

Voice-commanded Scripting Language for Programming Navigation Strategies On-the-fly

Michael Nichols, Qian Wang, Gopal Gupta

Department of Science University of Texas Dallas Richardson, TX 75080, USA

Abstract

We present a voice-based scripting language called ALVIN (Aural Language for VoiceXML Interpretation and Navigation) that allows users to define navigation strategies on-the-fly while browsing VoiceXML pages. The language is intended to be completely voice/audio-based, so as to allow it to be used with voice/audio-only communication devices, such as telephones (land- line or wireless). This paper discusses the various challenges that need to be overcome in designing a voice/audio-based language, and describes how these challenges are overcome in designing ALVIN. In addition, this paper discusses a model implementation of ALVIN that leverages the existing capabilities of VoiceXML to provide a readily deployable solution. The language design is, strictly speaking, a special-purpose for navigating VoiceXML pages, but the language includes features that would be representative of a proposed class of spoken general-purpose programming languages. A prototype implementation of the language is in progress.

1 Introduction

The phenomenal success of the World-Wide Web (WWW) demonstrates, in a profound way, the ability of computer technology to deliver a cornucopia of diverse services and information with unprecedented speed and breadth of content. As users rely on Web-based services and information sources, the technology is moving in a natural direction toward increased user mobility and ease of access. Traditional visual-interface computing devices place significant constraints on portability and accessibility by requiring the user to possess and see a viewing screen. Voice-based computing has the potential to overcome these constraints by making it unnecessary for a user to view a screen or for the user's hands to be occupied with a keyboard or stylus. Moreover, a voice-based interface makes it possible for a user to communicate with a computing device via (mobile) telephone, thus obviating the need for the user to actually have the computing device in his or her possession. Recent technology standards, such as VoiceXML, are allowing voice-based applications to be developed on a large scale. VoiceXML is a standard mark-up language created by the VoiceXML forum and the W3C consortium to make Internet content and information accessible via voice and audio. VoiceXML documents can be aurally browsed on a desktop computer with microphone and speakers or over the phone. Just as a renders HTML documents visually, a VoiceXML (or a voice browser) renders VoiceXML documents aurally.

Voice-based computing, however, is not without its limitations. For one thing, current speech- recognition technology is imperfect. Rapid advances in speech-recognition technology and signal processing, however, are significantly reducing the limitations of speech-recognition. A greater challenge, however, is the serial nature of spoken communications. While a user is free to

visually scan over a web page displayed on a screen, a user must listen to the text of a voice- based page as it is spoken in sequence, which can be very inefficient, especially if the page was originally intended to be “read” visually. Additionally, while VoiceXML has facilitated aural browsing of documents, its current design leaves the control of navigation within the document in the hands of VoiceXML document writer, rather than in the hands of the user who is aurally browsing the document. Thus, the writer of the document is responsible for specifying all possible ways in which a listener might browse the document (the writer typically accomplishes this by asking the listener what he/she wishes to browse next, after each VoiceXML is finished being recited). If the writer fails to anticipate an interaction scenario and leaves it unspecified in the document, the listener is deprived of that particular interaction. Hence, the onus of foreseeing all possible interaction scenarios falls on the author of the document.

One way of getting around this problem is to associate a navigation strategy with a particular page so that only certain features of the page are recited to the user and that those features are recited in a particular order selected by the user. We propose the use of a scripting language to allow a user to dynamically create such navigation strategies on-the-fly while aurally browsing VoiceXML pages. Such a language should itself be spoken by the user, so as to allow for completely voice-based operation. This paper provides a design for such a language (called ALVIN) and discusses the various challenges overcome by the design. ALVIN not only allows a listener to program navigation strategies on-the-fly during aural browsing, it also facilitates navigation based on keyword search of the page contents.

The idea of using a scripting language for dynamically programming navigation strategies rests on the notion of voice anchors [1]. Voice anchors allow listeners to attach speech labels to dialogs of a VoiceXML document. When this speech is uttered by the user later (during the same browsing session), browsing returns to the dialog to which the speech label is attached. Thus, voice anchors can be used by a user to around the various parts of the document via simple voice utterances and voice commands. One can be more ambitious and design a navigation language that specifies in advance the order in which various dialogs labelled with voice anchors should be visited and heard. Thus, complex navigation strategies can be orally programmed. This is precisely the motivation behind the design of ALVIN.

This paper also discusses a model implementation of the ALVIN language using voice anchors that leverages the existing capabilities of VoiceXML to provide a readily deployable solution. The language design is, strictly speaking, a special-purpose programming language for navigating VoiceXML pages, but the language includes features that would be representative of a proposed class of spoken general-purpose programming languages.

The research described in this paper makes the following contributions: (i) it presents a viable way of navigating complex aural documents in a meaningful and structured way by using spoken scripting languages; (ii) it presents a concrete spoken language, and an outline of its implementation, for aural documents written in VoiceXML. To the best of our knowledge ours is the first effort in this area, since most spoken scripting languages are limited to making selections from a list of menu that is read out at various points in the aural document. In contrast, our spoken scripting language allows complex navigation strategies to be orally programmed.

2 Language Design Designing a computer language that is intended only to be spoken poses a number of unusual problems that must be addressed in order to arrive at a workable language. Many of the assumptions about the programmer that are implicitly made in the design of written computer

languages do not apply. Moreover, many arguably desirable aspects of written computer languages are either less important or actually a hindrance to the design of a spoken computer language. These problems, and our proposed solutions to these problems, as exhibited by our model spoken scripting language, are addressed in the following sections. A partial grammar of a scripting language incorporating our ideas, which we call ALVIN (Aural Language for VoiceXML Interpretation and Navigation), is depicted in Figure 1. Example sessions illustrating the use of the language are provided in Figures 2 and 3.

COMPUTER (reading VoiceXML page): Consequently, the reason that the . . USER: Stop COMPUTER: Okay USER: Set an anchor called foxtrot oscar oscar at the next sentence COMPUTER: anchor foo set USER: COMPUTER: Set anchor named foo at next sentence

Figure 2: Example session

USER: Define bravo alpha romeo as Ask for a number ``Give me a number'' (USER pauses) COMPUTER: Go on USER: Store it as x-ray. Store one in foxtrot. Repeat the following until x-ray equals one. Take foxtrot. Multiply by x-ray. Store the result in foxtrot. Take x-ray. Subtract one. Store that in x-ray. End. COMPUTER: Okay USER: Current block? COMPUTER: bar USER: The result is foxtrot. COMPUTER: Okay USER: End. COMPUTER: Bar defined.

Figure 3: A more elaborate session

2.1 Overall Design Philosophy Our language design is based on three fundamental principles or goals that we feel can significantly enhance the usability of a spoken scripting language. The first is that the ideal spoken scripting language should be as close to natural language as possible. Since we anticipate that spoken scripting languages will be used by lay end-users as much as by trained programmers, this is an important design goal. The second principle is related to the first, but perhaps more controversial. The second principle is that the ideal spoken scripting language would give its users more ways to write correct code than to generate errors. We some variations in users' code from user to user and from to time, particularly in the case of spoken code, which is likely to be more “spontaneous” than written code. We think that it is important that the scripting language or its interpreter/ is as forgiving as possible. The third principle is related to the second. The third principle is that all reasonable efforts should be made to prevent the user/programmer from having to revisit and/or edit existing code. The rationale behind this is twofold. First, the kind of “full text” editing most programmers are used to can only be partially emulated over a voice interface and not very efficiently, at that.

Second, revisiting existing code over audio is undesirable because it can require the programmer to commit more information to his or her memory.

2.2 Punctuation/Delimiters and General Syntax Written programming languages rely heavily on delimiters to separate statements and blocks of code, such as semicolons (as in C, Pascal, and other similar Algol-family languages), line breaks (as in Fortran), parentheses (as in Lisp), periods and commas (as in Prolog), and white space (as in Python). It is very awkward to attempt to reproduce these types of punctuation in speech, however.1 Our language design, therefore, uses a consistent command syntax (or “sentence structure”) to allow individual statements to be distinguished on the basis of their content, rather than on additional punctuation. An ALVIN program consists of a series of commands. A command may cause browsing to shift to another VoiceXML form, with control never returning to the current form (a go-to like effect) or it may cause sidetracking, i.e., the control returns to the current form after the target form has been browsed (a procedure call-like effect). Commands can also be given to repeatedly browse one or more VoiceXML forms, conditionally or unconditionally (see Figure 1). Since each command in our language is structured in the form of an imperative sentence, each command begins with a verb. As in natural language, each verb may be transitive or intransitive. If a verb is transitive, its direct object (usually either some kind of identifier or a value) immediately follows the verb. Other information needed for performing the command can be supplied by a set of zero or more modifiers, which are generally represented in the form of a prepositional phrase (e.g., “at anchor foo,” “as number,” etc.). The modifiers in a given command may be supplied in any order. We feel that this grammatical structure is natural to the user and highly expressive, yet regular enough to be readily parsed by a computer. A brief inspection of the grammar in Figure 1 reveals that this grammatical structure is a fair approximation to the way commands are typically spoken in English. Although it was not originally apparent to us at the time, our grammar is actually very structurally similar to that of the Japanese language, where the verb has a fixed position in the sentence and the other sentence elements may be more flexibly arranged, because they are accompanied by “particles,” which denote the roles the various words play in the sentence. The most visible difference between our basic grammatical design and that of the Japanese language is the fact that Japanese prefers postfix expressions over our more English-like prefix expressions. The verb in a Japanese sentence goes at the end, and particles follow the words they are associated with.2 Our language employs a number of strategies to allow a user some degree of license to use varying forms of the same commands. For example, the commands “skip,” “jump,” and “go” are synonyms and can be used interchangeably to move from location to location within a VoiceXML document. Also, any of a number of “filler words,” including the articles “the,” “a,” and “an,” may be inserted anywhere within a command and will be ignored. This allows a user to speak more naturally to the computer, with no loss of meaning.

1 Unless you are Victor Borge [2], that is. 2 For example, in the Japanese sentence, “Watashi-wa Tanaka-san-no bengoshi desu,” the particle “wa” placed after Watashi (I) means that Watashi (I) is the subject of the sentence. The “no” following Tanaka-san is a possessive particle (analogous to an apostrophe-s in English). The final two words, bengoshi (attorney) and desu (to be), are the direct object and verb, respectively. Thus, the complete sentence translates as “I am Mr. Tanaka's attorney.”

As shown in Figure 1, the two I/O commands, “ask” and “say” have grammar rules that end in periods (full stops). This denotes that the two commands record audio clips at compile-time for later playback at run-time. The programmer signifies completion of the recording by pressing a key on his/her touch-tone phone or by waiting for a time-out period to expire

2.3 Tolerated Ambiguity and Interactive Interpretation/Compilation One of the major differences between natural languages and typical computer languages is that the latter are generally designed to be inherently unambiguous. While inherently unambiguous computer languages are certainly far from being an obsolete concept, it should be recognized that inherently unambiguous languages are a product of the old batch-processing model of computation and are not strictly required for interactive computing. In a batch process, the computer's instructions must be entirely specified in advance. Thus, no ambiguity in program code can be tolerated, since unpredictable (or, at least, undesirable) results could result. With interactive computing, however, the computer can simply ask the user for additional information needed to solve the problem at hand. Thus, our language model assumes that an interactive interpreter or compiler will be used to interpret or compile the language. Hence, if an ambiguity arises (either because of difficulty in speech recognition or because of an ambiguous grammar), the interactive interpreter/compiler will immediately notify the user/programmer and request information from the user to permit the interpreter/compiler to resolve the ambiguity. This approach to interaction, where both sides take turns initiating actions, is knows as “mixed-initiative interaction.” Ambiguity detection may be elegantly implemented in a logic language, such as Prolog [3], which supports non-deterministic parsing through the facility of definite clause grammars (DCGs). If the DCG finds two or more plausible interpretations for a given command, the interpreter/compiler asks the user for some distinguishing piece of information to allow it to resolve the ambiguity.

2.4 Modifiers and Slots Modifiers are used to assign values to “slots,” which are, essentially, attributes of a given command. Certain recurring types of slots are “locative,” “nominative,” and “type” slots. Locative slots refer to a location, in this case, to a location within the VoiceXML file being browsed (e.g., next to last sentence). Nominative slots refer to a name given to the direct object of the command (e.g., in the command Store 5 as FOO, the nominative slot is filled with the value “FOO,” representing a variable name or label). Some slots are required to be filled in order for the command to make sense; those slots are labeled with an overbar in Figure 1. If the programmer/user completes the command without filling in all of the slots (by stating the appropriate modifiers), the interpreter/compiler interrupts the programmer/user and asks the user a question to resolve the issue. For example, the interpreter or compiler might ask something like this, “What name was 5 supposed to be stored as?” At that time, the programmer/user could simply supply the requested answer, “foo,” then resume programming.

2.5 Serial Programming Arguably, the primary obstacle faced by the designer of a spoken computer language is the strictly serial nature of voice-based I/O. Because only one language element or word may be spoken at any one time, there is no way for a programmer to refer back to earlier parts of the

program, even recent code, while programming. The impact this has on the programmer cannot be underestimated. There are many language constructs in written computer languages that implicitly assume that the programmer will have visual access to the source code during the programming process. For example, any form of nested or hierarchical language structure makes this implicit assumption. Parentheses in the LISP programming language are a supreme example of this. Without ready visual confirmation of previously-entered lines of code, it is difficult (and in some case, practically impossible) to determine where one part of a routine ends and another begins or resumes. The problem with the traditional approach taken in written computer languages is that it puts too many demands on the programmer's memory [4]. Programmers are generally not consciously aware of the degree to which they rely on the information that a screen display provides. Because it provides constant, immediately perceptible feedback about the context in which a given line of code is being written, the screen display frees the programmer from having to keep careful track of what precise point in the program logic the programmer is writing the code. Visual cues, such as brackets, delimiters, and whitespace/indenting, all aid the programmer in determining this and make it possible for a programmer to interrupt a central sequence of instructions (and train of programmer thought) to address some auxiliary sequence of instructions before returning to the main sequence. For example, the C-like loop do { instruction 1; ……………… instruction n;} while(condition); involves two such sequences. The primary sequence is that in which the do-while loop instruction resides. The secondary sequence is the loop body itself. Now, to program this loop in a purely serial fashion using C would require that the programmer first begin the do-loop instruction, suspend the programmer's train of thought about the loop conditions temporarily in order to program the loop body, then return to the loop's condition for iteration. As a program becomes more complicated, with many such structures presented in a nested form, it is not difficult to see the confusion that can arise when a programmer is confronted with the problem of determining where his or her original train of thought left off. Moreover, the traditionally favoured top-down style of design, is difficult to achieve using traditional structured written programming languages, such as C. When employing top-down design techniques, a programmer may write a conditional statement (an “if”) then leave a placeholder for the code conditionally executed in response to the conditional statement. At a later time, the programmer will come back to the placeholder and the placeholder with the conditionally executed code. This scheme, of course, anticipates that the programmer will be able to return to the placeholder in a random-access fashion, delete the now-unneeded placeholder, and insert new code at that location. Thus, traditional top-down design using structured programming languages naturally assumes that the programmer will be able to view and edit his or her code in a more-or-less random-access fashion. In the rigidly sequential world of audio-based programming, however, this kind of random-access programming is not practical. What the above discussion reveals, however, is that traditional structured programming languages are not really “top-down” in the sense that the language expresses programming concepts in a top-down fashion. What traditional structured languages express is not “top down” code, but is, rather, a depth-first search of a top-down design. True top-down programming would be more akin to a breadth-first search. In a breadth-first search of a graph, one traverses all of the nodes at a particular before proceeding to the next level. In a breadth-first search,

there is no need to return to a higher level before continuing the search, since all of the higher- level nodes have already been visited. Thus, a programmer writing code in a breadth-first manner need not revisit previous levels, either, and has less contextual information to remember. Our language design allows for breadth-first or top-down writing of code by encouraging the use of forward references. For example, in our spoken language, one might write the previous “do- while loop” using a forward reference in this form: do FOO while condition where FOO is instruction1 instruction2 ……………… end. This encourages top-down design through the ability to “write” code in top-down order without having to revisit previously defined pieces of code. Because users are allowed to create their own labels, such as “FOO,” for forward references to code blocks, some mechanism for allowing these labels to be defined must be provided for in the language. Where those labels may be defined in our grammar (Figure 1), we have enclosed the labels in boxes. This means that the user may either define the label name at that point or simply recite the label name itself (if the label has already been defined). To define a new label, the user says “new label,” spells the name letter-by-letter (preferably using the International Phonetic Alphabet or IPA), then completes the new label by saying “end label” (alternatively, the user could simply being spelling the label using the IPA (e.g., “foxtrot, oscar, oscar, bravo, alpha, romeo” for “foobar.”). After having spelled the new label, the user should be able to simply recite the word itself, without having to spell the word [1].3

2.6 Instructions with Compile-time Semantics Three commands, “list,” “current label,” and “pending labels,” have “compile-time semantics.” Compile-time semantics is a concept borrowed from the Forth language, in which certain commands, when entered in the course of writing compiled code, are not compiled into the program, but execute immediately and return a result. These three instructions are used to perform tasks that aid the programmer and cannot be compiled into a program. The list command is used to have the interpreter/compiler recite the source code to a particular named block of code. The current label command recites the name of block being currently written by the programmer. The pending labels command recites the names of the code blocks that have been used in a forward reference, but which have not yet been defined. This helps facilitate top-down programming, by reminding the programmer of the code blocks that need to be defined.

2.7 Arithmetic Instructions Based on “Current Value” (generally for string pattern matching purpose) without specifying an explicit variable Arithmetic and memory operations are performed in our language with reference to a “current value,” which might be thought of as a kind of accumulator register, as were common in early computer hardware designs. A “take” instruction is used to load an immediate or stored value as the current value. Arithmetic commands each perform an operation on the current value and some other immediate or stored value and store the result as the new current value. This strategy,

3 The initial spelling of the label is necessary, as standard VoiceXML does not provide a convenient way to recognize words that it does not expect to hear (i.e., words that are not part of the grammar specified by the writer of the VoiceXML page). Once a word is spelled, however, the word may be incorporated into a VoiceXML grammar to allow the word to be recognized by the VoiceXML browser. This is one of the limitations of current state-of-the-art voice recognition software.

although admittedly primitive when applied to written arithmetic expressions, is more appropriate for spoken code, because it eliminates the complications associated with reciting algebraic expressions verbally. The current value is not limited to only numerical values, but may also assume values of other types, such as audio clips. In this way, our current value concept is not unlike the “$_” variable in , which is used to perform various operations name.

3 Model Implementation In our model implementation of the ALVIN language, we have chosen to use a Prolog-based CGI (Common Gateway Interface) program to serve VoiceXML pages augmented with additional tags and grammar information to allow a user to speak commands in ALVIN through a VoiceXML browser. The grammar included in the augmented VoiceXML pages is sufficiently detailed to enable the VoiceXML browser to recognize and tokenize the user's speech. Once a complete ALVIN command is tokenized by the VoiceXML browser, the tokenized command is submitted to the Prolog-based CGI program, which may reside behind a standard HTTP (HyperText Transfer Protocol) server. The CGI interprets the command and returns a result to the VoiceXML browser in the form of an ALVIN-augmented VoiceXML page. As an additional optimization, the ALVIN-augmented VoiceXML page may include JavaScript or other client- side code that may be able to offload some of the work of the CGI (for simple commands) to the VoiceXML browser. The system may be combined with an HTML-to-VoiceXML transcoder for sophisticated voice/audio-based web browsing as in [5].

4 Related Work

Voice-based programming is a relatively new field, largely due to the fact that voice-recognition technology with sufficient accuracy to permit voice-based programming has only become available in recent years. Arnold et al. of Drexel University and Georgia Tech proposed a system for programming by voice to allow computer users to avoid using a keyboard (in the event of repetitive stress injuries and the like) [6]. Unlike our research, which addresses programming in purely audio environments by defining a new spoken computer language, their voice programming system is a syntax-directed editor intended to be used in front of a computer screen for programming in existing written computer languages, such as C and Java. Alvin Surkan of the University of Nebraska-Lincoln has proposed a voice-directed interface in the APL programming language to permit an agent-based program synthesis system to be controlled via voice-based controls [7]. Ramakrishnan et al. of Virginia Tech have published an interesting paper relating mixed-initiative interaction (including voice-based interaction) to mixed computation (i.e., partial evaluation) [8]. Finally, note that there are purely speech-based commercial systems available as well (such as IBM's Via Voice) that take speech commands. These speech commands are very simple; also, these systems do not allow users to program navigation strategies on-the-fly.

5 Conclusions In this paper we presented a framework for designing a voice-based scripting language that can be used to program navigation strategies on-the-fly for browsing VoiceXML pages. We used this design framework to develop the ALVIN language, which is a representative of a larger potential class or family of voice-based programming languages. We discussed some of the challenges

associated with designing a voice-based scripting language and proposed solutions for those challenges which were incorporated into the design of our experimental language, called ALVIN. We also outlined techniques for implementing ALVIN. A prototype implementation of ALVIN is in progress. We anticipate that, as mobile telecommunications devices and portable computing devices become more prevalent, the need for remote and/or hands-free programmability of computing devices will increase. Spoken programming languages like ALVIN will enable users to “write” scripts to program such devices without requiring a keyboard or display. We plan to build on the basic concepts of ALVIN to develop more general-purpose languages for interactive, voice- based programming to fulfill this need.

References

[1] Reddy, H.,Annamalai, N.,Gupta, G.: Listener-controlled dynamic navigation of VoiceXML documents. In: International conference on Helping People (ICCHP), Springer Verlag, Lecture Notes in Computer Science (2004) 337-354 [2] Knudsen, W.: Victor borge sound clips. http://www.kor.dk/borge/b-mus-1.htm (1997) [3] Shapiro, E., Sterling, L.,: The Art of Prolog. MIT Press. 1994. [4] Fry, C.: Programming on an already full brain. Commun. ACM \textbf40 (1997) 55--64 [5] Gupta, G., Raman, S.S., Nichols, M.M., Reddy, H., Annamalai, N.: DAWN: Dynamic aural web navigation. HCI 2005. (This Proceedings) (2005) [6] Arnold, S.C., Mark, L., Goldthwaite, J.: Programming by voice, Vocal Programming. In: ASSETS '00: Proceedings of the fourth international ACM conference on Assistive technologies, ACM Press (2000) 149-155. [7] Surkan, A.J.: Spoken-word direction of synthesis. In: APL '00: Proceedings of the international conference on APL-Berlin-2000 conference, ACM Press (2000) 221-227. [8] Ramakrishnan, N., Capra, R.G., Pérez-Quiñones, M.A.: Mixed-initiative interaction = mixed computation. In: PEPM '02: Proceedings of the 2002 ACM SIGPLAN workshop on Partial evaluation and semantics-based program manipulation, ACM Press (2002) 119-130.