SALT: An XML Application for Web-based Multimodal Dialog Management

Kuansan Wang Speech Technology Group, Microsoft Research One Microsoft Way, Microsoft Corporation Redmond, WA, 98006, USA http://research.microsoft.com/stg

devices. GUI is an immensely successful concept, Abstract notably demonstrated by the World Wide Web. Although the relevant technologies for the This paper describes the Speech Internet had long existed, it was not until the Application Language Tags, or SALT, an adoption of GUI for the Web did we witness a XML based spoken dialog standard for surge on its usage and rapid improvements in multimodal or speech-only applications. A Web applications. key premise in SALT design is that speech-enabled user interface shares a lot GUI applications have to address the issues of the design principles and computational commonly encountered in a goal-oriented dialog requirements with the graphical user system. In other words, GUI applications can be interface (GUI). As a result, it is logical to viewed as conducting a dialog with its user in an introduce into speech the object-oriented, iconic language. For example, it is very common -driven model that is known to be for an application and its human user to undergo flexible and powerful enough in meeting many exchanges before a task is completed. The the requirements for realizing application therefore must manage the interaction sophisticated GUIs. By reusing this rich history in order to properly infer user’s intention. infrastructure, dialog designers are The interaction style is mostly system initiative relieved from having to develop the because the user often has to follow the underlying computing infrastructure and prescribed interaction flow where allowable can focus more on the core user interface branches are visualized in graphical icons. Many design issues than on the computer and applications have introduced mixed initiative software engineering details. The paper features such as type-in help or search box. focuses the discussion on the Web-based However, user-initiated digressions are often distributed computing environment and recoverable only if they are anticipated by the elaborates how SALT can be used to application designers. The plan-based dialog implement multimodal dialog systems. theory (Sadek et al 1997, Allen 1995, Cohen et al How advanced dialog effects (e.g., 1989) suggests that, in order for the mixed cross-modality reference resolution, initiative dialog to function properly, the implicit confirmation, multimedia computer and the user should be collaborating synchronization) can be realized in SALT partners that actively assist each other in planning is also discussed. the dialog flow. An application will be perceived as hard to use if the flow logic is obscure or Introduction unnatural to the user and, similarly, the user will Multimodal interface allows a human user to feel frustrated if the methods to express intents interaction with the computer using more than are too limited. It is widely believed that spoken one input methods. GUI, for example, is language can improve the user interface as it multimodal because a user can interact with the provides the user a natural and less restrictive computer using keyboard, stylus, or pointing way to express intents and receive feedbacks. The Speech Application Language Tags (SALT services can be discovered and linked up 2002) is a proposed standard for implementing dynamically to collaborate on a task. In other spoken language interfaces. The core of SALT is words, Web services can be regarded as the a collection of objects that enable a software software agents envisioned in the open agent program to listen, speak, and communicate with architecture (Bradshaw 1996). Conceptually, the other components residing on the underlying Web infrastructure provides a straightforward platform (e.g., discourse manager, other input means to realize the agent-based approach modalities, telephone interface, etc.). Like their suitable for modeling highly sophisticated dialog predecessors in the Microsoft Speech (Sadek et al 1997). This distributed model shares Application Interface (SAPI), SALT objects are the same basis as the SALT dialog management programming language independent. As a result, architecture. SALT objects can be embedded into a HTML or any XML document as the spoken language 1.1 Page-based Dialog Management interface (Wang 2000). Introducing speech An examination on human to human capabilities to the Web is not new (Aron 1991, Ly conversation on trip planning shows that et al 1993, Lau et al 1997). However, it is the experienced agents often guide the customers in utmost design goal of SALT that advanced dialog dividing the trip planning into a series of more management techniques (Sneff et al 1998, manageable and relatively untangled subtasks Rudnicky et al 1999, Lin et al 1999, Wang 1998) (Rudnicky et al 1999). Not only the observation can be realized in a straightforward manner in contributes to the formation of the plan-based SALT. dialog theory, but the same principle is also widely adopted in designing GUI-based The rest of the paper is organized as follows. In transactions where the subtasks are usually Sec. 1, we first review the dialog architecture on encapsulated in visual pages. Take a travel which the SALT design is based. It is argued that planning Web site for example. The first page advanced spoken dialog models can be realized usually gathers some basic information of the trip, using the Web infrastructure. Specifically, such as the traveling dates and the originating and various stages of dialog goals can be modeled as destination cities, etc. All the possible travel Web pages that the user will navigate through. plans are typically shown in another page, in Considerations in flexible dialog designs have which the user can negotiate on items such as the direct implications on the XML document price, departure and arrival times, etc. To some structures. How SALT implements these extent, the user can alter the flow of interaction. If document structures are outlined. In Sec. 2, the the user is more flexible for the flight than the XML objects providing spoken language hotel reservation, a well designed site will allow understanding and speech synthesis are described. the user to digress and settle the hotel reservation These objects are designed using the event driven before booking the flight. Necessary architecture so that they can included in the GUI confirmations are usually conducted in separate environment for multimodal interactions. Finally pages before the transaction is executed. in Sec. 3, we describe how SALT, which is based on XML, utilizes the extensibility of XML to The designers of SALT believe that spoken allow new extensions without losing document dialog can be modeled by the page-based portability. interaction as well, with each page designed to achieve a sub-goal of the task. There seems no 1 Dialog Architecture Overview reason why the planning of the dialog cannot utilize the same mechanism that dynamically With the advent of XML Web services, the Web synthesizes the Web pages today. has quickly evolved into a gigantic distributed computer where Web services, communicating in XML, play the role of reusable software 1.2 Separation of Data and Presentation components. Using the universal description, SALT preserves the tremendous amount of discovery, and integration (UDDI) standard, Web flexibility of a page-based dialog system in dynamically adapting the style and presentation information such as coordinates where the click of a dialog (Wang 2000). A SALT page is takes place. SALT extends the mechanism for composed of three portions: (1) a data section speech input, in which the notion of semantic corresponding to the information the system objects (Wang 2000, Wang 1998) is introduced needs to acquire from the user in order to achieve to capture the meaning of spoken language. the sub-goal of the page; (2) a presentation When the user says something, speech events, section that, in addition to GUI objects, contains furnished with the corresponding semantic the templates to generate speech prompts and the objects, are reported. The semantic objects are rules to recognize and parse user’s utterances; (3) structured and categorized. For example, an a script section that includes inference logic for utterance “Send mail to John” is composed of deriving the dialog flow in achieving the goal of two nested semantic objects: “John” representing the page. The script section also implements the the semantic type “Person” and the whole necessary procedures to manipulate the utterance the semantic type “Email command.” presentation sections. SALT therefore enables a multimodal integration algorithm based on semantic type compatibility This document structure is motivated by the (Wang 2001). The same command can be following considerations. First, the separation of manifest in a multimodal expression, as in “Send the presentation from the rest localizes the natural email to him [click]” where the email recipient is language dependencies. An application can be given by a point-and-click gesture. Here the ported to another language by changing only the semantic type provides a straightforward way to presentation section without affecting other resolve the cross modality reference: the handler sections. Also, a good dialog must dynamically for the GUI mouse click event can be strike a balance between system initiative and programmed into producing a semantic object of user initiative styles. However, the needs to the type “Person” which can subsequently be switch the interaction style do not necessitate identified as a constituent of the “email changes in the dialog planning. The SALT command” semantic object. Because the notion document structure maintains this type of of semantic objects is quite generic, dialog independence by separating the data section from designers should find little difficulty employing the rest of the document, so that when there are other multimodal integration algorithms, such as needs to change the interaction style, the script the unification based approach described in and the presentation sections can be modified (Johnston et al 1997), in SALT. without affecting the data section. The same mechanism also enables the app to switch among 2 Basic Speech Elements in SALT various UI modes, such as in the mobile environments where the interactions must be able SALT speech objects encapsulate speech to seamlessly switching between a GUI and functionality. They resemble to the GUI objects speech-only modes for hand-eye busy situations. in many ways. Because they share the same high The presentation section may vary significantly level abstraction, SALT speech objects among the UI modes, but the rest of the document interoperate with GUI objects in a seamless and can remain largely intact. consistent manner. Multimodal dialog designers 1.3 Semantic Driven Multimodal can elect to ignore the modality of Integration communication, much the same way as they are insulated from having to distinguish whether a SALT follows the common GUI practice and text string is entered to a field through a keyboard employs an object-oriented, event-driven model or cut and pasted with a pointing device. to integrate multiple input methods. The technique tracks user’s actions and reports them 2.1 The Listen Object as events. An object is instantiated for each event to describe the causes. For example, when a user The “listen” object in SALT is the speech input clicks on a graphical icon, a mouse click event is object. The object must be initialized with a fired. The mouse-click event object contains speech grammar that defines the language model and the lexicon relevant to the recognition task. automatically detects the end of utterance and cut The object has a start method that, upon off the audio stream. The mode is most suitable invocation, collects the acoustic samples and for push-to-talk UI or telephony based systems. performs speech recognition. If the language Reciprocal to the start method, the listen object model is a probabilistic context free grammar also has a stop method for forcing the recognizer (PCFG), the object can return the parse tree of the to stop listening. The designer can explicitly recognized outcome. Optionally, dialog invoke the stop method and not rely on the designers can embed XSLT templates or scripts recognizer’s default behavior. Invoking the stop in the grammar to shape the parse tree into any method becomes necessary when the listen object desired format. The most common usage is to operates under the single mode, where the transform the parse tree into a semantic tree recognizer is mandated to continue listening until composed of semantic objects. the stop method is called. Under the single mode, the recognizer is required to evaluate and return A SALT object is instantiated in an XML hypotheses based on the full length of the audio, document whenever a tag bearing the object even though some search paths may have reached name is encountered. For example, a listen object a legitimate end of sentence token in the middle can be instantiated as follows: of the audio stream. In contrast, the third multiple mode allows the listen object to report is designed for push-hold-and-talk type of UI, while the multiple mode is for real-time or dictation type of applications.

The object, named “foo,” is given a speech The listen object also has methods to modify the grammar whose universal resource indicator PCFG it contains. Rules can be dynamically (URI) is specified via a constituent. activated and deactivated to control the As in the case of HTML, methods of an object are perplexity of the language model. The semantic invoked via the object name. For example, the parsing templates in the grammar can be command to start the recognition is foo.start() in manipulated to perform simple reference the ECMAScript syntax. Upon a successful resolution. For example, the grammar below (in recognition and parsing, the listen object raises SAPI format) demonstrates how a deictic the event “onreco.” The event handler, f(), is reference can be resolved inside the SALT listen associated in the HTML syntax as shown above. object: If the recognition result is rejected, the listen object raises the “onnoreco” event, which, in the above example, invokes function g(). As mentioned in Sec. 0, these event handlers reside in the script section of a SALT page that manages left the within-page dialog flow. Note that SALT is right designed to be agnostic to the syntax of the eventing mechanism. Although the examples through out this article use HTML syntax, SALT can operate with other eventing standards, such as World Wide Web Consortium (W3C) XML In this example, the propname and propvalue (DOM) Level 2, ECMA attributes are used to generate the semantic Common Language Infrastructure (CLI), or the objects. If the user says “the left one,” the above upcoming W3C proposal called XML Events. grammar directs the listen object to return the semantic object as coffee. This mechanism for three modes designed to meet different UI composing semantic objects is particularly useful requirements. The automatic mode, shown above, for processing expressions closely tied to how data are presented. The grammar above may be catching the playback landmarks in SALT is as used when the computer asks the user for choice such: of the drink by displaying the pictures of the choices side by side. However, if the display is tiny, the choices may be rendered as a list, to Traveling to New York? which a user may say “the first one” or “the bottom one.” SALT allows dialog designers to There are 3 flights available approach this problem by dynamically adjusting … the speech grammar.

When the synthesizer reaches the TTS markup 2.2 The prompt object , the onbookmark event is raised and The SALT “prompt” object is the speech output the event hander f() is invoked. When a barge-in object. Like the listen object, the prompt object is detected, the dialog designer can determine if has a start method to begin the audio playback. the barge-in occurs before or after the bookmark The prompt object can perform text to speech by inspecting whether the function f() has been synthesis (TTS) or play pre-recorded audio. For called or not. TTS, the prosody and other dialog effects can be controlled by marking up the text with synthesis Multimedia synchronization is another main directives. usage for TTS bookmarks. When the speech output is accompanied with, for example, Barge-in and bookmark are two events of the graphical animations, TTS bookmarks are an prompt object particularly useful for dialog effective mechanism to synchronize these designs. The prompt object raises a barge-in parallel outputs. event when the computer detects user utterance To include dynamic content in the prompt, SALT during a prompt playback. SALT provides a rich adopts a simple template-based approach for program interface for the dialog designers to prompt generation. In other words, the carrier specify the appropriate behaviors when the phrases can be either pre-recorded or hard-coded, barge-in occurs. Designers can choose whether to while the key phrases can be inserted and delegate SALT to cut off the outgoing audio synthesized dynamically. The prompt object that stream as soon as speech is detected. Delegated confirms a travel plan may appear as the cut-off minimizes the barge-in response time, and following in HTML: is close to the expected behavior for users who wish to expedite the progress of the dialog without waiting for the prompt to end. Similarly, non-delegated barge-in let the user change playback parameters without interrupting the … output. For example, the user can adjust the speed Do you want to fly from and volume using speech commands while the to audio playback is in progress. SALT will on automatically turn on echo cancellation for this ? case so that the playback has minimal impacts on the recognition. As shown above, SALT uses a tag inside The timing of certain user action or the lack a prompt object to refer to the data contained in thereof often bears semantic implications. other parts of the SALT page. In this example, the Implicit confirmation is a good example, where prompt object will insert the values in the HTML the absence of an explicit correction from the input objects in synthesizing the prompt. user is considered as a confirmation. The prompt object introduces an event for reporting the landmarks of the playback. The typical way of 2.3 Declarative Rule-based Programming 3 SALT Extensibilities Although the examples use procedural Naturally spoken language is a modality that can programming in managing the dialog flow be used in widely diverse environments where control, SALT designers can practice inference user interface constraints and capabilities vary programming in a declarative rule-based fashion significantly. As a result, it is only practical to in which rules are attached to the SALT objects define into SALT the speech functions that are capturing user’s actions, e.g., the listen object. universally applicable and implementable. For Instead of authoring procedural event handlers, example, the basic speech input function in designers can declare inside the listen object rules SALT only deals with speaker independent that will be evaluated and invoked when the recognition and understanding, even though semantic objects are returned. This is achieved speaker dependent recognition or speaker through a SALT element as demonstrated verification are in many cases very useful. As a below: result, extensibility is crucial to a natural language interface like SALT. introduced at the component level, or as a new … 3.1 Component extensibility SALT components can be extended with new functions individually through the component The predicate of each rule is applied in turns configuration mechanism of the against the result of the listen object. They are element. For example, the element has expressed in the standard XML Pattern language an event to signal when speech is detected in the in the “test” clause of the element. In this incoming audio stream. However, the standard example, the first rule checks if the confidence does not specify an algorithm for detecting level is above the threshold. If not, the rule speech. A SALT document author, however, can activates a prompt object (prompt_confirm) for declare reference cues so that the document can explicit confirmation, followed by a listen object be rendered in the similar way among different listen_confirm to capture the user’s response. The processors. The element can be used to speech objects are activated via the start method set the reference algorithm and threshold for of the respective object. Object activations are detecting speech in the object: specified in the targetElement and the targetMethod clauses of the element. Similarly, the second rule applies when the confidence score exceeds the prescribed level. energy The rule extracts the relevant semantic objects 0.7 from the parsed outcome and assigns them to the … respective elements in the SALT page. As shown above, SALT reuses the W3C XPATH language for extracting partial semantic objects from the Here the parameters are set using an algorithm parsed outcome. whose uniform resource name (URN), xyz.edu/algo-1, is declared as an attribute of the 3.2 Language extensibility XML namespace of . The parameters for In addition to component extensibility, the whole configuring this specific speech detection method are further specified in the child elements. A language of SALT can be enhanced with new functionality using XML. Communicating with document processor can perform a schema other modalities, input devices, and advanced translation on the URN namespace into any schema the processor understands. For example, discourse and context management are just a few potential use for language-wise extension. if the processor implements the speech detection algorithm where the detection threshold has a different range, the adjustment can be easily SALT message extension, or the element, is the standard conduit between a SALT made when the document is parsed. document and the outside world. The message

The same mechanism is used to extend the element takes as its child element to forge a connection to an external component. functionality. For instance, the object Once a link is established, SALT document can can be used for speaker verification because the algorithm used for verification and the communicate with the external component by exchanging text messages. The element programmatic interfaces share a lot in common has a “sent” attribute to which a text message can with the recognition. In SALT, a object can be extended for speaker verification through be assigned to. When its value is changed, the new value is regarded as a message intended for configuration parameters: the external component and immediately

dispatched. When an incoming message is property called “received” and raises the “onreceive” event. ../../data 0.85 … 3.2.1 Telephony Interface … Telephones are one of the most important access devices for spoken language enabled Web applications. Call control functions play a central In this example, the object is extended role in a telephony SALT application. The for speaker verification that compares a user’s element in SALT is a perfect match with voice against a cohort set. The events “onreco” the telephony call standard known as ECMA 323. and “onnoreco” are invoked when the voice ECMA 323 defines the standard XML schemas passes or fails the test, respectively. As in the for messages of telephony functions, ranging previous case, the extension must be decorated from simple call answering, disconnecting, with URN that specifies the intended behavior of transferring to switching functionality suitable the document author. Being an extension, the for large call centers. ECMA 323 employs the document processor might not be able to discern same message exchange model as the design of the semantics implied by the URN natively. . This allows SALT application to tap However, XML based protocols allow the into a rich telephony call controls without processor to query and employ Web services that needing a complicated SALT processor. As can either (1) transform the extended document shown in (SALT 2002), sophisticated telephony segment into an XML schema the processor applications can be authored in SALT in a very understands, or (2) perform the function straightforward manner. described by the URN. By closely following XML standards, SALT documents fully enjoy the 3.2.2 Advanced Context Management benefits of extensibility and portability of XML In addition to devices such as telephones, SALT with SALT. object may also be employed to connect to Web services or other software components to facilitate advanced discourse semantic analysis and context managements. Such capabilities, as described in (Wang 2001), empower the user to References realize the full potential of interacting with Sadek M.D., Bretier P., Panaget F. (1997) ARTIMIS: computers in natural language. For example, the Natural dialog meets rational agency. In Proc. user can simply say “Show driving directions to IJCAI-97, Japan. my next meeting” without having to explicitly Allen J.F. (1995) Natural language understanding, 2nd and tediously instruct the computer to first look Edition, Benjamin-Cummings, Redwood City, CA. up the next meeting, obtain its location, copy the Cohen P.R, Morgan J., Pollack M.E (1989) Intentions location to another program that maps out a in communications, MIT Press, Cambridge MA. driving route. SALT Forum (2002) Speech Application Language Tags Specification, http://www.saltforum.org. The customized semantic analysis can be achieved in SALT as follows: Wang K. (2000) Implementation of a multimodal dialog system using extensible markup language, In

Proc. ICSLP-2000, Beijing China. Aron B. (1991) Hyperspeech: navigation in speech-only hypermedia, in Proc. Hypertext-91, San … Ly E., Schmandt C., Aron B. (1993) Speech recognition architectures for multimedia environments, in Proc. AVIOS-93, San Jose, CA.

Lau R., Flammia G., Pao C., Zue V. (1994) Webgalaxy: Integrating spoken language and hyptertext navigation, in Proc. EuroSpeech-97, http://personal.com Rhodes, Greece. … Sneff S., Hurley E., Lau R., Pao C., Schmid P., Zue V. (1998) Galaxy-II: A reference architecture for conversational system development, in Proc. ICSLP-98, Sydney Australia. Rudnicky A., Xu W. (1999) An agenda-based dialog Here the element includes a rudimentary management architecture for spoken language grammar to analyze the basic sentence structure systems, in Proc. ASRU-99, Keystone CO. of user utterance. Instead of resolving every Lin B.-S, Wang H.-M, Lee L.-S (1999) A distributed reference (e.g. “my next meeting”), the result is architecture for cooperative spoken dialog agents cascaded to the element linked to a Web with coherent dialog state and history, in Proc. service specializing in resolving personal ASRU-99, Keystone CO. references. Bradshaw J.M. (1996) Software Agents, AAAI/MIT Press, Cambridge, MA. Summary Wang K (1998) An event driven dialog system, in In this paper, we describe the design philosophy Proc. ICSLP-98, Sydney Australia. of SALT in using the existing Web architecture Wang K (2001) Semantic modeling for dialog systems for distributed multimodal dialog. SALT allows in a pattern recognition framework, in Proc. ASRU 2001, Trento Italy. flexible and powerful dialog management by fully taking advantage of the well publicized Johnston M., Cohen P.R., McGee D., Oviatt S.L., benefits of XML, such as separation of data from Pittman J.A., Smith I (1997) Unification based multimodal integration, in Proc. 35th ACL, Madrid presentation. In addition, XML extensibility Spain. allows new functions to be introduced as needed Wang K (2001) Natural language enabled Web without sacrificing document portability. SALT applications, in Proc. 1st NLP and XML Workshop, uses this mechanism to accommodate diverse Tokyo, Japan. Web access devices and advanced dialog management.