SALT: an XML Application for Web-Based Multimodal Dialog Management
Total Page:16
File Type:pdf, Size:1020Kb
SALT: An XML Application for Web-based Multimodal Dialog Management Kuansan Wang Speech Technology Group, Microsoft Research One Microsoft Way, Microsoft Corporation Redmond, WA, 98006, USA http://research.microsoft.com/stg devices. GUI is an immensely successful concept, Abstract notably demonstrated by the World Wide Web. Although the relevant technologies for the This paper describes the Speech Internet had long existed, it was not until the Application Language Tags, or SALT, an adoption of GUI for the Web did we witness a XML based spoken dialog standard for surge on its usage and rapid improvements in multimodal or speech-only applications. A Web applications. key premise in SALT design is that speech-enabled user interface shares a lot GUI applications have to address the issues of the design principles and computational commonly encountered in a goal-oriented dialog requirements with the graphical user system. In other words, GUI applications can be interface (GUI). As a result, it is logical to viewed as conducting a dialog with its user in an introduce into speech the object-oriented, iconic language. For example, it is very common event-driven model that is known to be for an application and its human user to undergo flexible and powerful enough in meeting many exchanges before a task is completed. The the requirements for realizing application therefore must manage the interaction sophisticated GUIs. By reusing this rich history in order to properly infer user’s intention. infrastructure, dialog designers are The interaction style is mostly system initiative relieved from having to develop the because the user often has to follow the underlying computing infrastructure and prescribed interaction flow where allowable can focus more on the core user interface branches are visualized in graphical icons. Many design issues than on the computer and applications have introduced mixed initiative software engineering details. The paper features such as type-in help or search box. focuses the discussion on the Web-based However, user-initiated digressions are often distributed computing environment and recoverable only if they are anticipated by the elaborates how SALT can be used to application designers. The plan-based dialog implement multimodal dialog systems. theory (Sadek et al 1997, Allen 1995, Cohen et al How advanced dialog effects (e.g., 1989) suggests that, in order for the mixed cross-modality reference resolution, initiative dialog to function properly, the implicit confirmation, multimedia computer and the user should be collaborating synchronization) can be realized in SALT partners that actively assist each other in planning is also discussed. the dialog flow. An application will be perceived as hard to use if the flow logic is obscure or Introduction unnatural to the user and, similarly, the user will Multimodal interface allows a human user to feel frustrated if the methods to express intents interaction with the computer using more than are too limited. It is widely believed that spoken one input methods. GUI, for example, is language can improve the user interface as it multimodal because a user can interact with the provides the user a natural and less restrictive computer using keyboard, stylus, or pointing way to express intents and receive feedbacks. The Speech Application Language Tags (SALT services can be discovered and linked up 2002) is a proposed standard for implementing dynamically to collaborate on a task. In other spoken language interfaces. The core of SALT is words, Web services can be regarded as the a collection of objects that enable a software software agents envisioned in the open agent program to listen, speak, and communicate with architecture (Bradshaw 1996). Conceptually, the other components residing on the underlying Web infrastructure provides a straightforward platform (e.g., discourse manager, other input means to realize the agent-based approach modalities, telephone interface, etc.). Like their suitable for modeling highly sophisticated dialog predecessors in the Microsoft Speech (Sadek et al 1997). This distributed model shares Application Interface (SAPI), SALT objects are the same basis as the SALT dialog management programming language independent. As a result, architecture. SALT objects can be embedded into a HTML or any XML document as the spoken language 1.1 Page-based Dialog Management interface (Wang 2000). Introducing speech An examination on human to human capabilities to the Web is not new (Aron 1991, Ly conversation on trip planning shows that et al 1993, Lau et al 1997). However, it is the experienced agents often guide the customers in utmost design goal of SALT that advanced dialog dividing the trip planning into a series of more management techniques (Sneff et al 1998, manageable and relatively untangled subtasks Rudnicky et al 1999, Lin et al 1999, Wang 1998) (Rudnicky et al 1999). Not only the observation can be realized in a straightforward manner in contributes to the formation of the plan-based SALT. dialog theory, but the same principle is also widely adopted in designing GUI-based The rest of the paper is organized as follows. In transactions where the subtasks are usually Sec. 1, we first review the dialog architecture on encapsulated in visual pages. Take a travel which the SALT design is based. It is argued that planning Web site for example. The first page advanced spoken dialog models can be realized usually gathers some basic information of the trip, using the Web infrastructure. Specifically, such as the traveling dates and the originating and various stages of dialog goals can be modeled as destination cities, etc. All the possible travel Web pages that the user will navigate through. plans are typically shown in another page, in Considerations in flexible dialog designs have which the user can negotiate on items such as the direct implications on the XML document price, departure and arrival times, etc. To some structures. How SALT implements these extent, the user can alter the flow of interaction. If document structures are outlined. In Sec. 2, the the user is more flexible for the flight than the XML objects providing spoken language hotel reservation, a well designed site will allow understanding and speech synthesis are described. the user to digress and settle the hotel reservation These objects are designed using the event driven before booking the flight. Necessary architecture so that they can included in the GUI confirmations are usually conducted in separate environment for multimodal interactions. Finally pages before the transaction is executed. in Sec. 3, we describe how SALT, which is based on XML, utilizes the extensibility of XML to The designers of SALT believe that spoken allow new extensions without losing document dialog can be modeled by the page-based portability. interaction as well, with each page designed to achieve a sub-goal of the task. There seems no 1 Dialog Architecture Overview reason why the planning of the dialog cannot utilize the same mechanism that dynamically With the advent of XML Web services, the Web synthesizes the Web pages today. has quickly evolved into a gigantic distributed computer where Web services, communicating in XML, play the role of reusable software 1.2 Separation of Data and Presentation components. Using the universal description, SALT preserves the tremendous amount of discovery, and integration (UDDI) standard, Web flexibility of a page-based dialog system in dynamically adapting the style and presentation information such as coordinates where the click of a dialog (Wang 2000). A SALT page is takes place. SALT extends the mechanism for composed of three portions: (1) a data section speech input, in which the notion of semantic corresponding to the information the system objects (Wang 2000, Wang 1998) is introduced needs to acquire from the user in order to achieve to capture the meaning of spoken language. the sub-goal of the page; (2) a presentation When the user says something, speech events, section that, in addition to GUI objects, contains furnished with the corresponding semantic the templates to generate speech prompts and the objects, are reported. The semantic objects are rules to recognize and parse user’s utterances; (3) structured and categorized. For example, an a script section that includes inference logic for utterance “Send mail to John” is composed of deriving the dialog flow in achieving the goal of two nested semantic objects: “John” representing the page. The script section also implements the the semantic type “Person” and the whole necessary procedures to manipulate the utterance the semantic type “Email command.” presentation sections. SALT therefore enables a multimodal integration algorithm based on semantic type compatibility This document structure is motivated by the (Wang 2001). The same command can be following considerations. First, the separation of manifest in a multimodal expression, as in “Send the presentation from the rest localizes the natural email to him [click]” where the email recipient is language dependencies. An application can be given by a point-and-click gesture. Here the ported to another language by changing only the semantic type provides a straightforward way to presentation section without affecting other resolve the cross modality reference: the handler sections. Also, a good dialog must dynamically for the GUI mouse click event can be strike a balance between system initiative and programmed into producing a semantic