SmartWeb Handheld Interaction (V 1.1): General Interactions and Result Display for User-System Multimodal Dialogue

Question answering (QA) has become one of the fastest growing topics in computational linguistics and information access. In this context, SmartWeb (http://www.smartweb-projekt.de/) was a large- scale German research project that aimed to pro- vide intuitive multimodal access to a rich selection of Web-based information services. In one of the main scenarios, the user interacts with a smart- phone client interface, and asks natural language questions to access the Semantic Web. The demon- strator systems were developed between 2004 and 2007 by partners from academia and industry. This document provides the interaction storyboard (the multimodal design of SmartWeb handheld’s inter- action), and a description of the actual implementa- tion. We decided to publish this technical document in a second version in the context of the THESEUS Daniel Sonntag project (http://www.theseus-programm.de) since this SmartWeb document provided many sugges- Norbert Reithinger tions for the THESEUS usability guidelines (http:// and the implementation of the THESEUS TEXO dem- www.dfki.de/~sonntag/interactiondesign.htm) DFKI GmbH on the Internet of Services, where the user can del- egateonstrators. complex Theseus tasks is to the dynamically German flagship composed project se- mantic web services by utilizing multimodal inter- action combining speech and multi-touch input on advanced smartphones. More information on SmartWeb’s technical ques- tion-answering software architecture and the un- derlying multimodal dialogue system, which is further developed and commercialized in the THE- SEUS project, can be found in the book: Ontologies Technisches Dokument Nr. 5 and Adaptivity in Dialogue for Question Answering. December 2009

V 1.0 February 2007 V 1.1 December 2009

Daniel Sonntag Norbert Reithinger

DFKI GmbH Stuhlsatzenhausweg 3 66123 Saarbrücken

Tel.: (0681) 302-5254 Tel.: (0681) 302-5346 Fax: (0681) 302-5020

E-Mail: [email protected] [email protected]

Dieses Technische Dokument gehört zu Teilprojekt 2: Verteilte Multimodale Dialogkomponen- te für das mobile Endgerät Das diesem Technischen Dokument zugrunde liegende Forschungsvorhaben wurde mit Mitteln des Bundesministeriums für Bildung und Forschung unter dem Förderkennzeichen 01 IMD 01 gefördert. Die Verantwortung für den Inhalt liegt bei den Autoren.

2 1 Introduction

lays the foundations for multimodal user interfaces for distributed and composable Semantic Web access on mobile de- information. This document describes the gen- eralvices. interaction http://smartweb.dfki.de/ possibilities for provides the SmartWeb further Handheld client, in particular the result display and the user-system multimodal dialogue. We concentrate particularly on the interaction possi- bilities of SmartWeb System Integration 1.0. 1.1 General Introduction to Interactions The interaction design and development process was supported by iterative prototyping, close The personal device for the SmartWeb Handheld involvement of potential users, and informal user scenario is T-Mobile’s PDA (personal digital as- testing. several possibilities for interaction: the micro- Since this is an internal and not a public phone,sistant), the the stylusMDA 4 (for / MDA both pro. mouse This device replacement offers document, we refrain from citing sources in the and handwriting recognition), the Qwertz key- text. Instead, please refer to the references at the pad, as well as several programmable buttons, end of this report. This document describes past mainly on the right side of the device. The MDA 3, work within the SmartWeb project and is due to revision without notice. It should be read jointly mainly in terms of the display resolution and the with Technical Document 2 about the Interaction programmableon which the first button storyboard on the front was side.based, differs Storyboard, and Technical Document 3 about the architecture of the dialogue components. The goal of this project is to allow for interactions in the form of a combination of voice and The latest revision of the following dialogue graphical user interface in a generic setting. components can be found at the dedicated wiki “Generic setting” means that the input modality pages within the SmartWeb Developer Portal at mix can be reduced in a modality busy setting, e.g., when the phone is at the ear, or when the user’s location is too loud. In general, “modality http://smartweb.dfki.de/sys/trac/wiki:• Information Hub for Server-side Dialogue busy” means that any given modality is not Processing, available, not commutable, or not appropriate in • Graphical User Interface Application for Handheld Dialogue Client, interaction is mainly provided by voice input and • Reaction and Presentation Component for output,a specific and situation. interaction In the requires scenario a more we model, powerful the Server-side Dialogue Processing, client device with embedded speech recognition • Discourse Ontology, capabilities. In any case, much attention will be • Server-side Multimodal Dialogue Manager, drawn to speech input, since the user interaction • Server-side Media Fusion and Discourse device is a phone and speech input is, therefore, Processing, • Server-side Language Understanding, circumstances are expected. Many results are not • Adapted Knowledge Sources for Server-side naturallymost natural. presented For result in an auditory presentation, manner, different which Speech Recognition, means that we plan to exploit the graphical display • Server-side Text Generation, and as much as possible. Accordingly, all results will • the Integrated Demonstrator. be presented on screen at least.

The current version of the self-running SmartWeb If multiple modes are used for input, multimodal presentation can be downloaded from http:// discourse phenomena such as multimodal deixis, cross-modal reference, and multimodal turn-tak- ing must be considered and modelled. The inter- smartweb.dfki.de/Intro_Demo/start.html. 3 action is to be characterised as composite multimodal and the challenge is to fuse modalities to composite input in order to represent the user’s intention. The composite multimodal in- teraction does not only evoke new challenges on interaction design, but also provides new possibilities for the interpreta- tion of user input – allowing for mutual disambiguation of mo- dalities. Composite input is input received on multiple modalities at the same time and treated as a single, integrated compound input by downstream processes (see, e.g., W3C Multimodal In- teraction Requirements, http:// www.w3.org/TR/mmi-reqs/).

SmartWeb’s main Graphical User Interface (GUI) is shown in the picture to the upper right. We implemented the rendering surface according to the storyboard requirements explained in Technical Document 2; the rendering surface is roughly segmented into three sections, for input, output, and control, respectively. The input/ case of active speech recognition, the best speech editor field can be used to write questions. In for visual inspection (user perceptual feedback), andinterpretation editing possibilities. is shown in Section this input/editor 2 comprises field of videos, and longer text documents. This input- the output field and the media space for images, Question Answering (QA) paradigm in the visual surface.output field should reflect the SmartWeb general to natural language questions by searching largeQA can document be defined collections. as the task Unlike of finding information answers retrieval systems, question answering systems do not retrieve documents, but provide short, relevant answers located in small fragments of shouldtext instead. display Accordingly, a short relevant the inputanswer field to it. is This for answerposing a can complex also be question, seen as and a headline the output to morefield complex (multimedia) answers, when a greater picture of the 2006 World Cup’s mascots” results amount of information should be displayed than in displaying the name Goleo a few words, or multimedia content should be followed by the image, both of which are relevant presented. For example, the question “Show me a to the multimodal answer (seein graphic the answer above). field,

4 Speech synthesis also plays a major role in pre- confusion set can either be produced by lexical senting results. We decided to use speech syn- knowledge of the dialogue system, or by a thesis to both notify the user that SmartWeb ob- built-in functionality of the speech recognition tained an answer and to present content informa- system. This assumes that the word hypothesis tion about the answer concurrently. The output - (single word probabilities). Audio correction vide the most suitable information for synthesis. modegraph can is alsobe backed interesting off to unigram in this probabilities regard, for we generated for the output field seems to pro example, if the speech recognition system is bit shorter than the synthesis. We display ‘traf- adjustable to run for single words to be uttered. In some cases, the answer field information is a In SmartWeb we do not use dynamic ASR condition between Aachen and Bonn’. grammars; instead, short meta commandos fic condition’, for example, but synthesise ‘traffic such as ‘further’, ‘left’, or ‘as table view’ are included into one general speech recognition grammar. For word hypothesis we pursue a 2 Interaction, Communication, and graphical correction selection approach instead of audio correction. Synchronisation • Incremental display: (1) for later incoming Interaction in a multimodal system ties results, (2) for relevance feedback queries. aspects of communication and synchronisation closely together. All interaction channels, most • Higher level selection modus, primarily prominently the output channels, must be for result lists: (1) quick pen list selection, (2) synchronised in order to render the multimodal information to be presented. Here we address touch on dynamically presented items such as this topic for the system results in particular. imagesconfirmations/approval and table cells. (OK) also by onscreen

• Display of result confidences: Either low 2.1 Requirements level hit/source size information or higher level

We will now describe some important communi- case, background statistics should be made cation and synchronisation considerations in the available,confidence e.g., display as a link. at best For with factoid symbols. questions In any in context of interactions. These take place in the di- the open question answering scenario, best hits alogue until the result is presented by the system. count most. This leads to the question of which Keep in mind that many communication aspects (important) background information is to be are dependent on presentational aspects: provided for the user for result explanation and provenance. • Display unrecognised tokens in a different colour • User can approve selection/results by recognition, i.e., out-of-vocabulary (OOV) tokens voice. A challenging topic will be to allow for are visible. In in case red text of low colour, confidence the selection values of an in selection by voice in deictic utterances. “Show OOV word activates a drop-down list of all OOV me this and that …” Dialogue-based relevance- feedback queries are often deictic in nature, the displayed red word token. such as: “Show me more like this.” Fusion of candidates for selection. The first candidate is multimodal input, where the user clicks on • Audio repetition of paraphrased query is available, possibility of interruption (editing) the audio output as well. Editing should be •items Different first, isdisplays a related for topic different of interest. result types: allowed via speech, pen or keyboard (The paraphrased query is displayed on screen status information, background information. anyhow.). Inreactions, addition intermediate (and orthogonally), results, final the results,display depends on the information content itself • Pull-down menu for word hypotheses to be rendered:the type of the question, the in correction mode. A precondition is that a media of available resources, the cardinality of

5 results, the size of single result items (consider 2.2 Implementations

In this section we will describe the implementa- the different forms of semantic/syntactic web tions of correction possibilities the user should results, for example). We employ different be able to test, and the implementation regarding ofgraphical the interaction sources and for presentation different metaphorspresentation the presentation of multimodal results. arerequirements represented of the by different the interaction result sets. pattern Most branch of the discourse ontology. 2.2.1 Correction Possibilities Probably the most important question concern- • Synchronisation of speech synthesis and ing the user interaction model is how to correct textual GUI results: Speech synthesis belongs invalid user input which stems from recognition to the more obtrusive output channels. General errors or from natural language understanding guidelines recommend keeping acoustic mes- errors. New interaction methods have to be de- sages short and simple. In the context of hand- held devices, this guideline should be followed becomes even more serious in the context of com- even more strictly, since the handheld device positefined in multimodality, order to repair where an invalid the dialogue user input. system This loudspeaker has very limited capabilities, and must understand and represent the multimodal the social and situational context during its use input. For SmartWeb, we decided on the follow- (confer modality busy setting) often does not ing correction possibilities and directed them to permit longer speech syntheses. Hence, the the physical MDA user interface. synthesis should correspond to a smaller frag- ment of text which is displayed concurrently, Correction of textual query: and which can be seen on screen even after A textual query such as “Who was world synthesis. In order to synchronise speech syn- champion in 1990?” can be edited directly in the thesis and GUI result presentation, we imple- editor text area which displays it after speech

the correction mode is activated and, e.g., the ment a notification framework that starts the been rendered. In this way the speech synthe- deviceinterpretation. keyboard, When provided the editor by field the is Windows clicked, synthesis when the output/answer field has sis component can be synchronised with the mobile operating system, appears. The dialogue system-internal processing. As it turns handwriting recogniser is preferred because it out, suitable timing parameters for the start of allows intuitive word selection on screen. The the synthesis depend heavily on the server ma- user could then, for example, simply click on or chine the synthesis is processed on, and where underline an incorrect word. the GUI is being started. We optimised the syn- chronisation behaviour for the demonstrator For example, the query “Wer war Weltmeister system using DFKI’s demonstration servers 2002 / Who was world champion in 2002?” and the MDA4 handheld device. is shown below. If the user presses the editor

6 2.2.2.1 Perceptual Feedback Requirements can switch from keyboard to transcriber before and Implementation SmartWebbutton, the iseditor started. text (Tabfield isthe enlarged. input panel The usericon Perceptual feedback allows the user to understand at the bottom center of the screen and tab the the current processing state of the system, in or- input selector arrow. See also ‘Edit text using der to react accordingly. Although we do not use transcriber in the Windows Mobile help page’.) a lifelike character like Smartakus (see Smartkom project), system-activity and responsiveness can The transcriber gestures, such as quick strokes be expressed by liveliness of the presentation ele- of the stylus, and options, are thoroughly avail- ments in terms of metaphors. System states that able within the SmartWeb GUI. Best practice are used as perceptual states are (1) Listening/ is to change to the editor mode to increase the Idle State, (2) Recording (green and red icon), (3) font size when using the transcriber to correct Understanding, (4) Query Processing (status bar), words. (5) Presentation Planning, and (6) Presenting. Visual feedback is provided on the screen at each Out-of-vocabulary (OOV) word correction: prominent dialogue stage, especially in order to The look-and-feel can then be described as provide feedback for speech input. Not only user follows. A click on an OOV word or word group feedback, but also search results will be present- in red font directly enables the correction mode ed on screen at least. In addition, other modali- of the word(s). The user can select other word ties can be selected, but depend on the context in hypotheses displayed in the pull-down menu which the system is used and the data suitability or use the keyboard/handwriting recogniser of being presented in a special modality.

example of out-of-vocabulary word correction. 1. Listening/Idle State: Green micro and pulsing instead. The figure on this page shows an SmartWeb logo. 2.2.2 Result Presentation In this section we will discuss the presentation of The pulsing SmartWeb logo indicates that the results stemming from the Semantic Web access system is reactive towards speech, text input, mediated by the Semantic Mediator component. and click events. When the logo is static, the It was already mentioned that visual feedback system is busy. This can occur, for example, if will be provided on the screen at each dialogue larger XML result messages are being sent to stage, especially in order to provide feedback for the device. In this case, the user should wait speech input. Not only user feedback, but also for the logo to pulse again before addressing search results will be presented on screen at least. In addition, other modalities can be selected, but depend on the context in which the system is used and the data’s suitability of being presented in a special modality.

7 SmartWeb. (The logo is also static when the 4. Query Processing: Progress bar with the progress is playing, but this is not concerned following emblem. with processor load.)

5. Presentation Planning: Indicated as text out- put in the statusbar (Resultat wird generiert / Results are being generated). The click-down area of the inactive microphone is designed to be larger than its visual appearance. This way the user can activate the micro in push-to-talk mode more easily with his thumb, holding the device in his left hand. 6. Presenting: Answers are loaded into the

2. Speech Recording/Recognition State: accessed by the control buttons and NaviLinks (explainedanswer field further and the down) media. End screen of toquery be microphone states: recognition inactive, processing is indicated in the status bar, recognitionThe following ready, figure and illustratesrecognition the running. different status bar on-time. (Recherchen abgeschlossen /potential Search terminated.) result filtering is also indicated in the

Additional perceptual feedback includes:

• Zoom in selected items, implemented by focus concept (illustrated in the three screenshots below). The focus concept is a metaphor for the

focus. A blue frame indicates the selection of aindication single table of a specific cell, which entity canin the be user’s a graphic, visual an image, or a text. This information is then 3. Understanding: Semantic paraphrase and transferred to the dialogue engine as the focus concept icons. Semantic paraphrases display the semantic interpretation of the user’s for verbal deitic utterances such as, “What utterance in ontology concept notation. isfor it?” further Since processing. the focus Itis helps set on to findone referencesparticular Concept icons present feedback to question table cell, it is clear that the question refers to understanding and answer presenting in a it. The focus frame is a blue transparent frame language-independent, conceptual way. The around the focus object which appears on the movie icon corresponds to the ‘Film (gefragt)/ screen as animation. The particularity is that Movie (asked)’ text. The sun icon, for example, the other visual content apart from the focus complements textual weather forecast results and conveys weather condition information.

8 concept is still visible and buttons can still be • Explanation on selection, implemented by pressed. The blue frame disappears if the focus the status bar in normal mode, and synthesis is unset, or a new query is being posed, in which in full-screen mode. If, for example, a POI is se- case the focus is unset automatically. Users can lected, the user gets information about that POI be trained to use visual foci before they utter something. This enhances the reliability of page provides an example. input fusion enormously. in the status bar. The figure at the top of this • Transparent popup menus: Under experi- Focus concept: Pressing a table cell sets a focus mentation for textual POI information which on it. If the focus is set to an image icon, the icon is focussed visually. If the focus is set to text, the text is also focussed visually. The focus concept 2.2.2.2does not Portrait fit into the and status Landscape bar. Orienta- represents a possible referent for input fusion. tion Portrait and landscape orientation of the same information content is shown in the two screenshots below.

In portrait orientation the user normally holds the device in one hand and uses the stylus with the other. In landscape orientation the MDA pro can also be held in both hands. The buttons in the left and right margins can be pressed with the left or right thumb, respectively.

9 swering sub-system re-ranks the results from the separate in- formation extraction components, which perform passage retrieval and entity extraction as well as layout-supported extraction of images and information ob- jects.

best answers to the userWe present in the the answer five tab, in textual form. In addition, images can be retrieved 2.2.2.3 Multimodal Answer Generation and browsed by selecting them with a press and Result Display of the image button. For each short answer, The input of the generation module are instances additional textual material is available when the of SWIntO representing the search results. These text button is pressed; this accesses a generated answer document. Each generated text document e.g., as a heading, as a row in a table, as a picture comprises of the following information: (1) the captionresults are title, then or as verbalized text which in different is synthesized. ways, Please refer to the graphic above for an example. the source server where the answer was found, (3)short the answer document and frequencyretrieval confidence of the answer value, within (2) 2.2.2.4 Result Display of Open Domain Question Answering concordance information, i.e., the paragraph Open domain Question Answering retrieves an- wherethe filtered the answer set of retrievedhas been found. documents, One example and (4) swers from internet pages and PDF documents. The front-end component of the question an- for this is provided in the figure below.

10 3 Interaction Metaphors Based on Storyboard

“Advances and breakthroughs in computer graphics have made visual media the basis of the modern user interface, and it is clear that graphics will play a dominant role in the way people communicate and interact with computers in the future. Indeed, as computers become more and more pervasive, and display sizes both increase and decrease, new and challenging problems arise for the effective use and generation of computer graphics.

“Recent advances in computer graphics have 2.2.2.5 Listening to Audio allowed AI researchers to integrate graphics in their Audio material can be listened to by clicking systems, and on the other hand, many AI techniques the audio button (see image above). If the MP3 have matured to the point of being easily used by controller is activated in the preferences window, a non specialists. These very techniques are likely to panel with additional control buttons is displayed. be the vehicle by which both principles from graphic A similar controller is available for video content. design, and the results of research into cognitive Note: Unfortunately some Linux Flash players do aspects of visual representations will be integrated not support the controller button functionality. in next generation graphical interfaces.” (Smart Graphics Enterprise, www.smartgraphics.org). 2.2.2.6 OnFocus / OffFocus Status Display Let us start with an example showing how we Two components determine the attentional state implemented a special interaction metaphor in of the user: the OnView recogniser, and the OnTalk the storyboard. With this, we mean the display recogniser. The task of the OnView recogniser is of help or recommendation information with to determine whether the user is looking at the the focus concept. The “Empfehlung” tab is used system or not. The OnView recogniser analyses a to present recommendations to the user. Please video signal captured by a video camera linked to the mobile device. For each frame, it then deter- mines whether informationrefer to the figure design. at the top right of the next page. the user is in On- The left figure shows the storyboard of the help View or OffView A click on the help or recommendation icon en- mode. These two larges the icon in order to display the type of help signals are com- or recommendation. In the following example, the bined to an On- boy indicates a general recommendation. Special Focus/OffFocus user state. In the OffFocus state, the speech input obtained from the speech rec- ogniser is omit- ted in dialogue processing. The visual feedback on the screen is a blue animated microphone (see

figure to the left).

11 recommendation types will be designed accord- Please make sure your dialogue server is running - and you are connected. See the image to the right tem. A click on a recommendation (in the second for a screenshot of the welcome window. columning to the of classification the recommendation of the recommended table) activates sys the recommendation. This means that the recom- 4.1 Interaction Sequence Example - Knowledge Server der to send the recommendation as a request, the 1. User: “When was world champion?” usermendation can click is oncopied the cursor into the button editor (circled field. Inin redor 2. System: “In the following 4 years: 1954 (in ), 1974 (in Germany), 1990 (in ), page). This is similar to the process ofin typingthe figure normal at the questions bottom into right the of the previous system with a keyboard.

4 More Interaction Ex- amples report on some tested interaction sequences.For further clarification, Other interaction we will sequences and question patterns can be downloaded at: http://

SampleUtterances. smartweb.dfki.de/sys/trac/wiki/

12 2003 (in USA).” 21. User: “Go back!” 3. User: “And Brazil?” 22. System: Shows answer to previous World Cup 4. System: “In the following 5 years: 1958 (in champion question again. Sweden), 1962 (in Chile), 1970 (in Mexico), 23. User: “Who is the player to the right of Horst 1994 (in USA), 2002 (in Japan).” + [team picture, Eckel?” MPEG7-annotated] 24. System: “Helmut Rahn.” 5. User: Pointing gesture on player Aldair + “How many goals did this player score?” 4.5 Interaction Sequence Example Web Services 4.2 Interaction Sequence Example 25. User: “What will the weather be like on Answers from Test Release 0.7, 4. March 2007: Saturday?” Knowledge Server 26. System: “The weather on Saturday will be...” 1. User: “When was Germany world champion?” 27. User: “And tomorrow?” 2. System: “In the following 4 years: 1954 (in 28. System: “The weather tomorrow will be...” Switzerland), 1974 (in Germany), 1990 (in Italy), 2003 (in USA).” 4.6 Interaction Sequence Example 3. User: “And Italy?” 29. User: “How do I get from Berlin to?” 4. System: “In the following 4 years: 1934 (in 30. System: Follow-up question... Italy), 1938 (in ), 1982 (in ), 2006 (in Germany).” 4.7 Simple Factoid Questions 5. User: Click Pirlo, Andera of the 2006 World Cup “Wer war 2002 Weltmeister? / Who was world + “How many goals did this player score?” champion in 2002?” – Brasilien / Brazil (see graphic below) 4.36. System: Interaction “1 goal… Sequence during the finals…”Example Web Services here?” 8.7. System: User: “Where “Map” + can Infos I are find displayed. Italian restaurants 9. User: “And Greek ones?” 10. System: “Map” + Infos are displayed. 11. User: Clicks on Restaurant-POI + “How do I get here?” 12. System: “Route” + Calculated route is displayed. 13. User: “Show me ATMs.” 4.8 More Complex Questions with More 14. System: Map + Restaurant POIs + ATM POIs Complex Answer Structures (Answers are displayed. with MPEG-7 Annotated Image Results) 15. User: Click on ATM-POI + “Give me more “Gegen wen spielte Frankreich bei der WM 1998? information about this.” / Against which teams did France play at the 16. System: “Bank of America” + further World Cup in 1998?” – 8 Spiele / 8 games (see information, if available. graphic below) 4.4 Interaction Sequence Example Knowledge Server 17. User: “Who was world champion in 1954 ?” 18. System: “Germany” 19. User: Looks at pictures (-> Focus + deictic references are set) + “Who is the third person from the left?” (“from the top left” if there are several rows) 20. System: “Horst Eckel” + picture detail

13 4.9 Referencing Displayed Images 4.11 Referencing Mpeg-7 Sub-Annota- “Wann war diese Mannschaft Weltmeister?/ tions When was this team world champion?” – In den [POI click] + “Wo gibt es Italiener in Saarbrücken?/ folgenden 5 Jahren / In the following 5 years (see Where are Italian restaurants in Saarbruecken?” graphic below) (see graphic below)

4.10 Referencing Mpeg-7 Sub-Annota- 4.11 Question and Answers with Video tions Content [player click] + “Wie viele Tore hat dieser Spieler “Zeige mir ein Tor im Spiel Deutschland gegen geschossen? / How may goals did this player Brasilien 2002? / Show me a goal in the 2002 score?” (see graphic below) match Germany against Brazil?” – 6 Tore / 6 goals this player (see graphic below)

Fullscreen and Synthesis Mode: When the Video content can be accessed by the second user changes to fullscreen mode, picture media button on the bottom panel. Pressing the dragging, zooming, and subselecting is possible. video button, the video starts downloading and Dragging means the user can touch the screen streaming. and drag the picture to display a special area of the picture. The picture can be zoomed in on Note: In order to enhance the playback quality we and out; the pointing gestures are interpreted tested reducing the resolution of the GUI screen according to the changing zoom factor. A special while playing. In addition, the screen listeners are image part can be sub-selected by pointing on reduced to two buttons, the text button and the screen. If the sub-selection has a label attached micro button. If your version actually reduces the to it, the label is being synthesised in the resolution, which can be easily seen, please press fullscreen mode, contrary to being displayed in the text button after watching the video in order the status/caption area in normal mode. to return to non-video mode. In order to pose

14 a new question, press the micro button as usual. In any case, the video player is stopped, the resolution of the GUI screen enhanced again, and all screen listeners reactivated. (The video controller in the extra panel is still under development, please excuse the inconvenient use of the text “Zeige mir die Maskottchen der WM / Show me button to return to normal mode if the resolution the World Cup mascots” – 11 Maskottchen/ 11 is changed while playing). Mascots

Note: In some environments, and due to the The deeply structured results obtained from the download speed, the video can be displayed in Semantic Web access for this question can be preferred size as soon as the download is com- seen in the screenshot of the OntoBroker output plete. The streaming option is activated for all en- vironments. ontoserver/ontoserver-current-questions.html), presented(http://smartweb.dfki.de/sys/svn/trunk/ at the bottom of this page. NaviLink structures: In the context of new result structures in These structures serve as input for multimodal semantic ontological form (proximity graph, answer generation, hyperlink generation, and generated tables) as deeply nested information and knowledge structures new presentation of this page). and interaction metaphors have been developed presentation (please refer to the figure at the top and implemented. The NaviLink structures, When asking for tables as additional inquiry “Als which follow a kind of automatic and dynamic Tabelle bitte/as table please”, this part of the hyperlink generation approach, have been proximity result graph is also turned into a table designed for the SmartWeb mobile handheld structure by the generation module. This table scenario. structure is presented to the user by an additional

example of this is provided on the following page withNavilink a detailed flag explanation. “Tabelle/table”. An extensive

15 (1) In response to the question “Welche WM Maskottchen of the focus concept. The focus is on the object in white with gibt es? / Which World Cup mascots are there?” the system a transparent blue focus frame surrounding it. provides a list. If the user prefers the information to be (5) The user has chosen another mascot to examine. Again, everything except the object in focus is blue. In this bitte / as table please” or type the request into the keyboard combination of focus concept and table, it is possible to ask providedpresented on in screen. a different format, he can specify “als Tabelle questions which refer to only the object focused on. More (2) In our example, the user chooses to also have a table of the results which are provided in a second NaviLink tab. the object, since the focus concept has already highlighted He can then choose to work with the results as presented it.importantly, Instead of there asking, is no “What need yearto make did a the specific World reference Cup have to initially, or switch to the provided table. Striker as a mascot?” the user can simply ask “Which year (3) When switching to the table, small icons of the mascots was this the mascot?” are provided, each of which may be clicked for a close up. (6) This screenshot shows an example of the focus concept as it pertains to text only. In this case, questions may be years are not included). (4)The In information this screenshot, provided the in user the has table chosen is modified one of (e.g., the necessity of providing his name directly. mascots to look at more closely. Here we have an example asked specifically about World Cup Willie without the usual

16 Cross-modal reference onto dynamically 5 Notes on GUI Elements and generated presentation items (Movie titles) for continued dialogue interaction: „Was kommt Implementation heute abend in Saarbrücken im Kino? / What’s playing at the cinema in Saarbruecken tonight?“ – 26 Filme / 26 movies – “Wo läuft if they are not only informative, but also dieser Film? / Where is this movie playing?” aestheticallyPresentation satisfying; elements humans fit their perceive purpose better best (see graphic above left) when the informative content is pleasing in

„Zeige mir mehr Informationen zu diesem Film/Show me more information about this mostlyaddition distinguished to being functional. by availability We deliver of ontological different movie“ (see graphic above right) resultvisualisations representations, for different Mpeg-7 data annotations, characteristics, and cardinality (one multimodal result, or a result list, Partial queries are queries which cannot be proc- or an ordered sequential instructive text such as a essed at once because information is missing. route description). These queries can only be processed by user in- terfaces and AI systems which either know how However, for the graphics and GUI surface we to supplement the query automatically or pro- decided to locally play Flash-Movies on the PDA vide means for initiating a subdialogue to ask the device as user interface. Flash MX 2004/Flash user for missing information. According to our 8 was the latest version of Macromedia Flash storyboard, we rely on dialogue manager compe- in 2007. (In the context of the THESEUS project - (2007-2011), we now use we now use Flash 10 for ple: “Wie komme ich von Saarbrücken / How the CoMET System, and for the MEDICO Use Case dotence I get to frominitiate Saarbruecken” clarification requests. – “Rückfrage: For exam Wo- we have used Adobe Flex (AiR) Technology.) hin wollen sie fahren / Follow-up request: Where do you want to go” – “Berlin”. Please see below We have been mainly interested in developing - interactive user interfaces quickly, and Flash is quests on the mobile user interface. meant to provide us not only with the develop- for an example of how we model clarification re

17 ment tool, but also with software for easy deploy- the screen. Natural language queries initiate ment on the mobile PDA device. During develop- an intelligent Semantic Web search. Results are ment and GUI implementation, especially for the presented in natural language accompanied by handheld scenario, it became evident that the images, videos, text, and other media. The user provided Flash running environment for mobile can navigate through the result in an intuitive way. devices is at a very prototypical development stage. Many aspects and API functions of the role in modern society - becoming humans’ desktop environment can not be used on the PDA dominantWe recognised haptic that interactor.the thumb Thisplays development a significant for several special reasons, such as player version availability, execution speed, and compatibility on future (mobile) devices. For more information, - pleaseshould see be reflected in the interface design for tion time of the GUI interface. The (touch) screen listenersdifferent operateplatforms. in theAnother Flash problemplayer plugin- is the reacand Mobile Explorer allowed time, which does not html.http://www.dfki.de/~sonntag/SMARTWEB/ guarantee direct responsiveness. We do test all Interaction_Study_Haptic_Mobile_Interfaces. upcoming mobile Flash environments for im- provements in this respect. 5.2 Interactive Semantic Navigation

Due to the processor restrictions on interaction We plan to display semantic relationships of re- element on the PDA device, the following sults in a graph structure. This is not a must for interaction metaphors which were planned could presenting the desired information content, but not, or not yet be integrated into the Handheld adds mostly navigational value to the presenta- client distribution. tion design. Since many of the multimodal results can be expressed in a graph-like structure, we ex- periment with (1) multimodal result(s) to graph 5.1 Wheel Metaphor conversion, and (2) automatic layouts of the de- rived graph structure meeting aesthetic criteria The wheel metaphor is used to generate a of visual presentations and elements. graphical display-selection-navigation element for long result lists, as displayed in the graphic Currently, several graph-layout platforms are below. Questions can be posed in natural available. We use a constraint-based general language and assisted by pointing gestures on structure and layout optimiser. The graph struc-

18 ture and layout are produced server-side accord- • Correction: (1) One step back, (2) Cancel, ing to the solution of the layout optimiser. The re- (3) Inline editing: Edit textual transcription of sulting graph, the graph structure and positional spoken input or choose other items (suggestions) elements will be transmitted to the device. The layouter should be started server-side on demand. example named entity classes. - by pull-down menu filled by closed classes, for board (left), the SmartWeb-integrated graphic • How can transaction consistency be ensured designIn the figureprototype above, (middle), we present and the prototypegraph story of during and after back-stepping? For example, the automatic interactive graph implementation the working memory must be updated accordingly. and menus for local navigation, and constraint- based(right). layout The latter information uses interactive for initial fisheye presentation views conditions and follow-up requests. language• Results results are gathered be displayed? in different And synchroni languages,- cally?in German and in English. How can different

6 Topics for Further Research

• Answering Polarisation Questions (Yes / No questions)

• Should the application look and feel like a native application? It may be easier to learn and use, but restrictions exist because of

users used to the PDA, the behaviour is more predictableabstract commands, using native predefined look and icons, feel. etc. For

19 7 References [15] Sharon Oviatt. Ten myths of multimodal interaction. In: Communications of the ACM, [1] Roman Longoria, ed. Designing Software for the 42(11):74–81 (1999). Mobile Context. A Practitioner’s Guide, Springer [16] Norbert Reithinger and Daniel Sonntag. An (2004). Integration Framework for a Mobile Multimodal [2] Vladimir Geroimenko and Chaomei Chen. Dialogue System Accessing the Semantic Web. In: Visualizing the Semantic Web. XML-based Internet Proceedings of Interspeech’ 05, Lisbon, Portugal and Information Visualization, Springer (2004). (2005). [17] Norbert Reithinger, Simon Bergweiler, Ralf [3] R. Amant and C. Healey. Usability Guidelines for Engel, Gerd Herzog, Norbert Pfeger, Massimo Interactive Search in Direct Manipulation Systems. Romanelli, and Daniel Sonntag. A Look Under North Carolina State University, Proceedings IJCAI the Hood. Design and Development of the First (2001). SmartWeb System Demonstrator. In: Proceedings [4] T. Rhyme and L. Trinish, eds. Visualization of 7th International Conference on Multimodal Viewpoints. Perception and Painting: A search for Interfaces (ICMI 2005), Trento, Italy, October 04-06 (2005). [5] R. Amant et. al. A visual interface to a music Effective, Engaging Visualizations, IEEE (2002). [18] Daniel Sonntag. Towards Interaction Ontologies database, North Carolina State University (2002). for Mobile Devices Accessing the Semantic Web - [6] Nicholas Cravatta, ed. The Shrinking-Interface Pattern Languages for Open Domain Information Paradox, EDN (2003). Providing Multimodal Dialogue Systems. In: Proceedings of the Workshop on Artificial Intelligence [7] Jaime Carbonell et. al. Vision Statement to in Mobile Systems (AIMS). 2005 at MobileHCI, Guide Research in Question & Answering and Text Salzburg (2005). Summarization, CMU (2000). [19] Daniel Sonntag and Massimo Romanelli. A [8] S. Salmon-Alt and L. Romary. Generating Multimodal Result Ontology for Integrated Semantic Referring Expressions in Multimodal Contexts, Web Dialogue Applications. In: Proceedings of the 5th INLG (2000). Conference on Language Resources and Evaluation [9] S. Salmon-Alt. Interpreting Referring Expres- (LREC 2006), Genova, Italy, May 24–26 (2006). sions by Restructuring Context, ESSLLI (2000). [20] Daniel Sonntag, Ralf Engel, Gerd Herzog,

Robust and Generic Discourse Model for Multimodal Romanelli, Norbert Reithinger. SmartWeb Dialogue,[10] N. Pfleger, IJCAI (2003). J. Alexandersson, and T. Becker. A HandheldAlexander - Multimodal Pfalzgraf, NorbertInteraction Pfleger, with Ontological Massimo Knowledge Bases and Semantic Web Services. In: [11] Elsa Pecourt and Norbert Reithinger. Multimo- International Workshop on AI for Human Computing dal Database Access on Handheld Devices. In The (AI4HC) in conjunction with (IJCAI) (2007). Companion Volume to the Proceedings of 42st An- nual Meeting of the Association for Computational [21] Stock, O., et al.. Alfresco. Enjoying the Linguistics, Barcelona (2004). Combination of Natural Language Processsing and Hypermedia for Information Exploration. In: [12] Norbert Reithinger, Gerd Herzog, and Alassane Intelligent Multimedia Interfaces, Ed. M.T. Maybury. Ndiaye. Situated Multimodal Interaction in Menlo Park (CA): AAAI Press/The MIT Press, 197- SmartKom. In: Computers & Graphics, 27(6):899- 224. (1993). 903 (2003). [22] Daniel Sonntag: Ontologies and Adaptivity in Dialogue for Question Answering, Heidelberg, AKA to Multimodal Fusion and Discourse Processing. In: Press/IOS Press (2010). Proceedings[13] Norbert of Pfleger. the Doctoral Fade - An Spotlight Integrated at ICMI Approach 2005, Trento, Italy (2005).

Parsing of Free Word Order Languages in Spoken Dialogue[14] Ralf Systems. Engel. RobustIn: Proceedings and Efficient of 9th Conference Semantic on Speech Communication and Technology, Lisboa (2005).

20 8 Appendix 2. Interactive Documentation for New SmartWeb Users User instructions for self-running SmartWeb presentation which can be downloaded at: amount of the documentation of the SmartWeb userThe SmartWeb interface. presentationIt interactively offers shows a thegreat http://smartweb.dfki.de/Intro_Demo/start.html. SmartWeb handheld’s interaction design and its implementation on the MDA. Additional, more technical documents are supplied for the project-wide deployment and the research The following SmartWeb-Flash presentation community. These documents describe technical details, including the connection scenarios: establishment to the dialogue system and the has been developed to be applied in different local message controller. 1. Self-Running Presentation and Demo 3. The Project Frame for the Evaluation of the The presentation consists of three sections: Dialogue System and the User Interface SmartWeb information with introductory texts, SmartWeb user interface functions, and In the future expansion of the presentation, the the demonstration of answers presentation running SmartWeb interface will be integrated and navigation possibilities which are given to in order to complete real evaluations of the the multimodal answers of questions. For the SmartWeb (and THESEUS) dialogue system latter, real multi-media results including videos in the running dialogue sessions. The to three sample questions are available. The questionnaires conceived for this purpose can user has the possibility to drag the questions then be completed and saved electronically in into the SmartWeb user interface in order to the presentation. (This new functionality is simulate the answering of the question. The extensively described in the book Ontologies rest of the inactive interface is self-explanatory and Adaptivity in Dialogue-based Question and available in both German and English with Answering.) the push of a language selection button. In respect to the evaluation of the reaction and Pressing the space bar brings up a menu presentation module of the dialogue manager (REAPR), the presentation should be expanded The automatic operation also allows users to as follows. While testing the dialogue system, navigateoffering through settings the questions for automatic and answers operation. of the user should be able to directly supply the simulated SmartWeb system. For automatic relevant feedback in order to supply empirical operation, a fast content renewal (similar to a data for the adaptation process. (This new slide show) can be started, an option which is functionality is also described in the book available through the menu (by pressing the Ontologies and Adaptivity in Dialogue-based space bar). Automatic operation mode (slide Question Answering.) show) is distinctive in that the interface remains responsive. This way, an interested visitor can

their functions interactively. directly concentrate on specific areas and test When viewed in full screen mode 16:9 (or by enlarging the width of the animation), SmartWeb logos are animated in the left and right margins with the goal of attracting attention. We have decided to forgo speech syntheses of the multimodal results and other

the exhibition stand. sound effects for reasons such as noise levels at 21