Annotations for HTML to Voicexml Transcoding: Producing Voice Webpages with Usability in Mind

Annotations for HTML to VoiceXML Transcoding: Producing Voice WebPages with Usability in Mind Zhiyan Shao Robert Capra Manuel A. Pérez-Quiñones Department of Computer Science Virginia Tech Blacksburg, VA Emails: {zshao | rcapra | perez}@vt.edu ABSTRACT Transcoding can be used to convert an HTML document Web pages contain a large variety of information, but are (e.g. a web page) to a VoiceXML document (e.g. a voice largely designed for use by graphical web browsers. interface). Transcoding is a method for translating one type Mobile access to web-based information often requires of code (e.g. HTML) into a different type (e.g. presenting HTML web pages using channels that are VoiceXML). Transcoders exist to convert HTML to limited in their graphical capabilities such as small-screens VoiceXML [12], to convert HTML to a format more or audio-only interfaces. Content transcoding and suitable for display on a PDA [8][2][5], and to convert annotations have been explored as methods for intelligently from a UIML document to multiple language documents presenting HTML documents. Much of this work has [1][14]. focused on transcoding for small-screen devices such as are The process of converting HTML code into VoiceXML found on PDAs and cell phones. Here, we focus on the use code is a fairly straightforward application of transcoding. of annotations and transcoding for presenting HTML However, a difficulty arises not in how to translate from the content through a voice user interface instantiated in original HTML to the target VoiceXML – but instead in VoiceXML. This transcoded voice interface is designed what the resulting VoiceXML code should be? For our with an assumption that it will not be used for extended work, we focus on transcoding an HTML page to produce a web browsing by voice, but rather to quickly gain directed VoiceXML page with high usability. access to information on web pages. We have found repeated structures that are common in the presentation of Transcoding the diverse set of “raw” information stored in data on web pages that are well suited for voice HTML web pages into VoiceXML is a significant presentation and navigation. In this paper, we describe challenge. First, the problem is difficult because much of these structures and their use in an annotation system we the information stored on the web has been specifically have implemented that produces a VoiceXML interface to engineered for use by graphical web browsers. The serial information originally embedded in HTML documents. nature of voice interfaces requires different presentation We describe the transcoding process used to translate strategies than the high-bandwidth, parallel nature of HTML into VoiceXML, including transcoding features we graphical interfaces. Furthermore, there is semantic have designed to lead to highly usable VoiceXML code. information that is implicitly encoded in the page contents can be difficult to obtain. Keywords VoiceXML, transcoding The problem of converting HTML-to-VoiceXML has some similarities to the problem of displaying HTML on small INTRODUCTION screen devices. The small screen-size constrains the Telephone-accessible voice interfaces to information stored amount of data that can be presented “at once” in a similar on the web are becoming more prevalent. Services such as way that the serial nature of voice limits data presentation. TellMe and BeVocal provide voice access to news, sports, Research on web page presentation for small screen devices weather, financial and entertainment information using has explored ways to summarize information and to divide VoiceXML-based interfaces that are accessible from any web pages into logical and physical-layout “segments” that telephone. However, much of the web-based information can be used to control the presentation of large or complex that is currently available through voice interfaces is pages. However, the small screen devices still rely on the limited in scope and has been selected and processed by the visual scanning capabilities of humans as desktop web voice service providers to integrate into existing voice browsers do, even if the screen size is smaller. The information systems that have been crafted for usability. A HTML-to-VoiceXML problem has an extra restriction -- wide variety of existing web content exists in the form of the serial (and thus slow) nature of voice presentation. HTML web pages. In this paper we present the results of our research into within the WTP. Hori et al. [8] designed a system to make building a servlet to implement part of a transcoder that HTML documents suitable for small-screen devices. produces VoiceXML pages from HTML files with external Asakawa et al. [2] used the same external annotations annotations. The servlet relies on semantic tags that are method and organize visually separate sections of an added to an HTML file to drive the generation of a HTML page so they can be presented together in a voice VoiceXML interface with high usability. For our initial presentation. Both projects focused on using annotations to work, we focused on structured web pages, such as those mark the importance value of an HTML page subsections available in news sites, like CNN.com or My Yahoo! Our and to then reconstruct the page for browsing in a PDA or a assumptions for this work are not that users will abandon voice browser. Our approach relies on external annotations their graphical web browsers in favor of voice web for the first phase (Phase I) of transcoding. In Hori [8] and browsing, but rather that mobile users will be able to gain Asakawa [2] annotations are used to highlight sections of directed access to information stored on HTML web pages pages that are of interest for later processing. We use through a usable, transcoded voice interface. annotations to remove unneeded data and to insert our own RELATED WORK new tags. The result is an XML file with information for a The IBM WebSphere Transcoding Publisher (WTP) [12] is voice interface. Then in Phase II of the transcoding, we use a commercial product that supports an HTML-to- a servlet transcoder to translate this XML stream into VoiceXML transcoder. Transcoders can be plugged into a VoiceXML output. WTP server that can be configured as a proxy. Client-side The Aurora transcoding system [9] has some similarities browsers can use this proxy to obtain transcoded content. with our work. They adapt web pages based on semantic The proxy intercepts the HTTP from the client-side information and build an XML document with extracted browser, fetches the requested HTML document, information. Their intent is for the transcoded document to transcodes it, and forwards on the transcoded document to support navigation in Internet Explorer and in IBM the browser. In the case of the HTML-to-VoiceXML Homepage Reader. Their work is more geared towards transcoder, a voice browser could be configured to use the specific tasks that users may perform, such as interaction proxy to receive VoiceXML versions of HTML web pages. with an auction site. The semantic information used for the IBM has an HTML-to-VoiceXML transcoder for use with transcoding must be produced manually. WTP that splits the HTML into two main sections in the Speech Application Language Tags (SALT) [16] is a VoiceXML code it produces: a main content section and a project that is in its early stages. SALT tags are added to an listing of all the links on the page [7]. In addition, a menu HTML document so that users with a special browser can is added to the VoiceXML file to allow users to navigate interact with the Web using graphics and voice at the same between the two other sections, or to exit. time. SALT is different from VoiceXML in several ways. The main content section contains the text from the web First, it is intended to extend the existing web browser with page. The transcoder divides the main content into a voice interface, and not as an alternative interface style. subsections based on heading tags (e.g. <h1>, <h2>, etc.). Also, SALT takes a programming approach to adding voice The text between heading tags is used to create menus and to the web, while VoiceXML uses a document-based speech recognition grammar choices in the VoiceXML approach. SALT applications are composed of objects , code. Users can speak any of the prompted choices to listen triggers and events, while VoiceXML applications are built to the paragraphs following the heading tags. The link list by combining tags into one or more documents. section contains a listing of all the links available on the XHTML+Voice [19] has an approach that is similar to page. The text between the HTML link tags (e.g. <a></a>) SALT, with a focus on supporting multi-modal devices. In is used in the speech recognition grammar choices and as this approach, existing VoiceXML tags are integrated into the text to be read in the list of link choices. XHTML. In contrast, SALT tries to integrate new tags into This transcoder has a simple translation strategy that works HTML. for simple HTML documents with clearly structured Several Interfaces, Single Logic (Sisl) [3] is an architecture heading tags and for documents that the text between the and domain-specific language (DSL) for single service <a></a> tags is context independent. Unfortunately, most logic to support multiple user interfaces. It provides a web pages do not have these characteristics. For example, high-level abstraction of the user-system interaction. The if the transcoder translates a page that has no heading tags idea is to design the transaction to be provided in DSL and in it, the result is of very low usability. The resulting then to convert it into several interfaces (including, for VoiceXML code presents all the text on the page example, VoiceXML and HTML).

Load more