Evaluating Spoken Dialogue Processing for Time-Offset Interaction

Evaluating Spoken Dialogue Processing for Time-Offset Interaction David Traum, Kallirroi Georgila, Ron Artstein, Anton Leuski USC Institute for Creative Technologies 12015 Waterfront Drive, Playa Vista CA 90094-2536, USA traum|kgeorgila|artstein|leuski @ict.usc.edu { } Abstract amount of content, and would only really work if someone asked questions about a very limited set This paper presents the first evaluation of topics. There is a big gap from this proof of of a full automated prototype system for concept to evidence that the technique can work time-offset interaction, that is, conversa- more generally. One of the biggest questions is tion between a live person and record- how much material needs to be recorded in order ings of someone who is not temporally co- to support free-flowing conversation with naive in- present. Speech recognition reaches word teractors who don’t know specifically what they error rates as low as 5% with general- can ask. This question was addressed, at least for purpose language models and 19% with one specific case, in Artstein et al. (2015). There domain-specific models, and language un- we showed that an iterative development process derstanding can identify appropriate di- involving two separated recording sessions, with rect responses to 60–66% of user utter- Wizard of Oz testing in the middle, resulted in a ances while keeping errors to 10–16% (the body of material of around 2000 responses that remainder being indirect, or off-topic re- could be used to answer over 95% of questions sponses). This is sufficient to enable a nat- from the desired target audience. In contrast, the ural flow and relatively open-ended con- 1400 responses from the first recording session versations, with a collection of under 2000 alone was sufficient to answer less than 70% of recorded statements. users’ questions. Another question is whether cur- 1 Introduction rent language processing technology is adequate to pick enough appropriate responses to carry on Time-offset interaction allows real-time synchro- interesting and extended dialogues with a wide va- nous conversational interaction with a person who riety of interested interactors. The proof of con- is not only physically absent, but also not engaged cept worked extremely well, even when people in the conversation at the same time. The ba- phrased questions very differently from the train- sic premise of time-offset interaction is that when ing data. However, that system had very low per- the topic of conversation is known, the partici- plexity, with fewer than 20 responses, rather than pants’ utterances are predictable to a large ex- something two orders of magnitude bigger. tent (Gandhe and Traum, 2010). Knowing what an interlocutor is likely to say, a speaker can In this paper, we address the second question, record statements in advance; during conversa- of whether time-offset interaction can be automat- tion, a computer program selects recorded state- ically supported at a scale that can support interac- ments that are appropriate reactions to the inter- tion with people who know only the general topic locutor’s utterances. The selection of statements of discussion, not what specific content is avail- can be done in a similar fashion to existing inter- able. In the next section, we review related work active systems with synthetic characters (Leuski that is similar in spirit to time-offset interaction. In and Traum, 2011). Section 3 we review our materials, including the In Artstein et al. (2014) we presented a proof of domain of interaction, the system architecture, di- concept of time-offset interaction, which showed alogue policy, and collected training and test data. that given sufficiently interesting content, a rea- In Section 4, we describe our evaluation method- sonable interactive conversation could be demon- ology, including evaluation of speech recognition strated. However that system had a very small and classifier. In Section 5, we present our results, 199 Proceedings of the SIGDIAL 2015 Conference, pages 199–208, Prague, Czech Republic, 2-4 September 2015. c 2015 Association for Computational Linguistics showing that over 70% of user utterances can be preciation and determination of tolerance for oth- given a direct answer, and an even higher percent- ers. Unfortunately, due to the age of survivors, this age can reach task success through a clarification opportunity will not be available far into the fu- process. We conclude with a discussion and future ture. The New Dimensions in Testimony project work in Section 6. (Maio et al., 2012) is an effort to preserve as much as possible of this kind of interaction. 2 Related Work The pilot subject is Pinchas Gutter, who has pre- The idea for time-offset interaction is not new. viously told his life story many times to diverse We see examples of this in science fiction and audiences. The most obvious topic of conversa- fantasy. For example, in the Hollywood movie tion is Pinchas’ experiences during World War II, “I, Robot”, Detective Spooner (Will Smith) inter- including the Nazi invasion of Poland, his time in views a computer-driven hologram of a recently the Warsaw Ghetto, his experiences in the concen- deceased Dr. Lanning (James Cromwell). tration camps, and his liberation. But there are The first computer-based dialogue system that many other topics that people bring up with Pin- we are aware of, that enabled a form of time-offset chas, including his pre- and post-war life and fam- interactions with real people was installed at the ily, his outlook on life, and his favorite songs and Nixon Presidential Library in late 1980s (Chabot, pastimes. 1990). The visitors were able to select one of over 3.2 System architecture 280 predefined questions on a computer screen and observe a video of Nixon answering that ques- The automatic system is built on top of the com- tion, taken from television interviews or filmed ponents from the USC ICT Virtual Human Toolkit, 1 specifically for the project. This system did not which is publicly available. Specifically, we use allow Natural language input. the AcquireSpeech tool for capturing the user’s 2 In the late 1990s Marinelli and Stevens came speech, CMU PocketSphinx and Google Chrome 3 up with the idea of a “Synthetic Interview”, where ASR tools for converting the audio into text, users can interact with a historical persona that NPCEditor (Leuski and Traum, 2011) for classi- was composed using clips of an actor playing that fying the utterance text and selecting the appropri- historical character and answering questions from ate response, and a video player to deliver the se- the user (Marinelli and Stevens, 1998). “Ben lected video response. The individual components Franklin’s Ghost” is a system built on those ideas run as separate applications on the user’s machine 4 and was deployed in Philadelphia from 2005– and are linked together by ActiveMQ messaging : 2007 (Sloss and Watzman, 2005). This system had An instance of ActiveMQ broker runs on the ma- a book in which users could select questions, but, chine, each component connects to the server and again, did not use unrestricted natural language in- sends and receives messages to other components put. via the broker. The system setup also includes the What we believe is novel with our New Dimen- JLogger component for recording the messages, sions in Testimony prototype is the ability to inter- and the Launcher tool that controls starting and act with a real person, not an actor playing a his- stopping of individual tools. For example, the torical person, and also the evaluation of its ability user can select between PocketSphinx and Google to interact naturally, face to face, using speech. ASR engines by checking the appropriate buttons in the Launcher interface. Figure 1 shows the over- 3 Materials all system architecture. We show the data flow through the system as black lines. Gray arrows 3.1 Domain indicate the control messages from the Launcher Our initial domain for time-offset interaction is the interface. Solid arrows represent messages passed experiences of a Holocaust survivor. Currently, an via ActiveMQ and dotted lines represent data go- important aspect of Holocaust education in mu- ing over TCP/IP. seums and classrooms is the opportunity to meet While most of the system components already a survivor, hear their story firsthand, and interact 1http://vhtoolkit.ict.usc.edu with them. This direct contact and ability to ask 2http://cmusphinx.sourceforge.net questions literally brings the topic to life and moti- 3https://www.google.com/intl/en/chrome/demos/speech.html vates many toward further historical study and ap- 4http://activemq.apache.org 200 ActiveMQ messaging PocketSphinx Purely domain-specific LMs cannot recognize out- ASR of-domain words or utterances. On the other Launcher Logger hand, general-purpose LMs do not perform well Acquire Speech with domain-specific words or utterances. Un- like PocketSphinx, which supports trainable LMs, Microphone NPCEditor VideoPlayer Google both Google Chrome ASR and Apple Dictation Chrome Client come with their own out-of-the-box LMs that cannot be modified. Google ASR Table 1 shows example outputs of all three rec- ognizers (PocketSphinx examples were obtained with a preliminary LM). As we can see, Google Figure 1: System architecture Chrome ASR and Apple Dictation with their general-purpose LMs perform well for utterances that are not domain-specific. On the other hand, existed before the start of this project, the Google PocketSphinx clearly is much better at recogniz- Chrome ASR Client and VideoPlayer tools were ing domain-specific words, e.g., “Pinchas”, “Maj- developed in the course of this project. Google danek”, etc. but fails to recognize general-purpose Chrome ASR client is a web application that takes utterances if they are not included in its LM.

Load more