<<

Evaluating Spoken Dialogue Processing for Time-Offset Interaction

David Traum, Kallirroi Georgila, Ron Artstein, Anton Leuski USC Institute for Creative Technologies 12015 Waterfront Drive, Playa Vista CA 90094-2536, USA traum|kgeorgila|artstein|leuski @ict.usc.edu { }

Abstract amount of content, and would only really work if someone asked questions about a very limited set This paper presents the first evaluation of topics. There is a big gap from this proof of of a full automated prototype system for concept to evidence that the technique can work time-offset interaction, that is, conversa- more generally. One of the biggest questions is tion between a live person and record- how much material needs to be recorded in order ings of someone who is not temporally co- to support free-flowing conversation with naive in- present. Speech recognition reaches word teractors who don’t know specifically what they error rates as low as 5% with general- can ask. This question was addressed, at least for purpose language models and 19% with one specific case, in Artstein et al. (2015). There domain-specific models, and language un- we showed that an iterative development process derstanding can identify appropriate di- involving two separated recording sessions, with rect responses to 60–66% of user utter- Wizard of Oz testing in the middle, resulted in a ances while keeping errors to 10–16% (the body of material of around 2000 responses that remainder being indirect, or off-topic re- could be used to answer over 95% of questions sponses). This is sufficient to enable a nat- from the desired target audience. In contrast, the ural flow and relatively open-ended con- 1400 responses from the first recording session versations, with a collection of under 2000 alone was sufficient to answer less than 70% of recorded statements. users’ questions. Another question is whether cur- 1 Introduction rent language processing technology is adequate to pick enough appropriate responses to carry on Time-offset interaction allows real-time synchro- interesting and extended dialogues with a wide va- nous conversational interaction with a person who riety of interested interactors. The proof of con- is not only physically absent, but also not engaged cept worked extremely well, even when people in the conversation at the same time. The ba- phrased questions very differently from the train- sic premise of time-offset interaction is that when ing data. However, that system had very low per- the topic of conversation is known, the partici- plexity, with fewer than 20 responses, rather than pants’ utterances are predictable to a large ex- something two orders of magnitude bigger. tent (Gandhe and Traum, 2010). Knowing what an interlocutor is likely to say, a speaker can In this paper, we address the second question, record statements in advance; during conversa- of whether time-offset interaction can be automat- tion, a computer program selects recorded state- ically supported at a scale that can support interac- ments that are appropriate reactions to the inter- tion with people who know only the general topic locutor’s utterances. The selection of statements of discussion, not what specific content is avail- can be done in a similar fashion to existing inter- able. In the next section, we review related work active systems with synthetic characters (Leuski that is similar in spirit to time-offset interaction. In and Traum, 2011). Section 3 we review our materials, including the In Artstein et al. (2014) we presented a proof of domain of interaction, the system architecture, di- concept of time-offset interaction, which showed alogue policy, and collected training and test data. that given sufficiently interesting content, a rea- In Section 4, we describe our evaluation method- sonable interactive conversation could be demon- ology, including evaluation of speech recognition strated. However that system had a very small and classifier. In Section 5, we present our results,

199 Proceedings of the SIGDIAL 2015 Conference, pages 199–208, Prague, Czech Republic, 2-4 September 2015. c 2015 Association for Computational Linguistics showing that over 70% of user utterances can be preciation and determination of tolerance for oth- given a direct answer, and an even higher percent- ers. Unfortunately, due to the age of survivors, this age can reach task success through a clarification opportunity will not be available far into the fu- process. We conclude with a discussion and future ture. The New Dimensions in Testimony project work in Section 6. (Maio et al., 2012) is an effort to preserve as much as possible of this kind of interaction. 2 Related Work The pilot subject is Pinchas Gutter, who has pre- The idea for time-offset interaction is not new. viously told his life story many times to diverse We see examples of this in science fiction and audiences. The most obvious topic of conversa- fantasy. For example, in the Hollywood movie tion is Pinchas’ experiences during World War II, “I, Robot”, Detective Spooner (Will Smith) inter- including the Nazi invasion of Poland, his time in views a computer-driven hologram of a recently the Warsaw Ghetto, his experiences in the concen- deceased Dr. Lanning (James Cromwell). tration camps, and his liberation. But there are The first computer-based dialogue system that many other topics that people bring up with Pin- we are aware of, that enabled a form of time-offset chas, including his pre- and post-war life and fam- interactions with real people was installed at the ily, his outlook on life, and his favorite songs and Nixon Presidential Library in late 1980s (Chabot, pastimes. 1990). The visitors were able to select one of over 3.2 System architecture 280 predefined questions on a computer screen and observe a video of Nixon answering that ques- The automatic system is built on top of the com- tion, taken from television interviews or filmed ponents from the USC ICT Virtual Human Toolkit, 1 specifically for the project. This system did not which is publicly available. Specifically, we use allow Natural language input. the AcquireSpeech tool for capturing the user’s 2 In the late 1990s Marinelli and Stevens came speech, CMU PocketSphinx and Google Chrome 3 up with the idea of a “Synthetic Interview”, where ASR tools for converting the audio into text, users can interact with a historical persona that NPCEditor (Leuski and Traum, 2011) for classi- was composed using clips of an actor playing that fying the utterance text and selecting the appropri- historical character and answering questions from ate response, and a video player to deliver the se- the user (Marinelli and Stevens, 1998). “Ben lected video response. The individual components Franklin’s Ghost” is a system built on those ideas run as separate applications on the user’s machine 4 and was deployed in Philadelphia from 2005– and are linked together by ActiveMQ messaging : 2007 (Sloss and Watzman, 2005). This system had An instance of ActiveMQ broker runs on the ma- a book in which users could select questions, but, chine, each component connects to the server and again, did not use unrestricted natural language in- sends and receives messages to other components put. via the broker. The system setup also includes the What we believe is novel with our New Dimen- JLogger component for recording the messages, sions in Testimony prototype is the ability to inter- and the Launcher tool that controls starting and act with a real person, not an actor playing a his- stopping of individual tools. For example, the torical person, and also the evaluation of its ability user can select between PocketSphinx and Google to interact naturally, face to face, using speech. ASR engines by checking the appropriate buttons in the Launcher interface. Figure 1 shows the over- 3 Materials all system architecture. We show the data flow through the system as black lines. Gray arrows 3.1 Domain indicate the control messages from the Launcher Our initial domain for time-offset interaction is the interface. Solid arrows represent messages passed experiences of a Holocaust . Currently, an via ActiveMQ and dotted lines represent data go- important aspect of Holocaust education in mu- ing over TCP/IP. seums and classrooms is the opportunity to meet While most of the system components already a survivor, hear their story firsthand, and interact 1http://vhtoolkit.ict.usc.edu with them. This direct contact and ability to ask 2http://cmusphinx.sourceforge.net questions literally brings the topic to life and moti- 3https://www.google.com/intl/en/chrome/demos/speech.html vates many toward further historical study and ap- 4http://activemq.apache.org

200 ActiveMQ messaging PocketSphinx Purely domain-specific LMs cannot recognize out- ASR of-domain words or utterances. On the other Launcher Logger hand, general-purpose LMs do not perform well Acquire Speech with domain-specific words or utterances. Un- like PocketSphinx, which supports trainable LMs, Microphone NPCEditor VideoPlayer

Google both Google Chrome ASR and Apple Dictation Chrome Client come with their own out-of-the-box LMs that can- not be modified.

Google ASR Table 1 shows example outputs of all three rec- ognizers (PocketSphinx examples were obtained with a preliminary LM). As we can see, Google Figure 1: System architecture Chrome ASR and Apple Dictation with their general-purpose LMs perform well for utterances that are not domain-specific. On the other hand, existed before the start of this project, the Google PocketSphinx clearly is much better at recogniz- Chrome ASR Client and VideoPlayer tools were ing domain-specific words, e.g., “Pinchas”, “Maj- developed in the course of this project. Google danek”, etc. but fails to recognize general-purpose Chrome ASR client is a web application that takes utterances if they are not included in its LM. advantage of the Google Speech API available in For example, the user input “what’s your favorite the Chrome browser. The tool provides push-to- restaurant” is misrecognized as “what’s your fa- talk interface control for acquiring user’s speech; vorite rest shot” because the word “restaurant” or it uses the API to send audio to Google ASR the sequence “favorite restaurant” was not part of servers, collect the recognition result, and broad- the LM’s training data. Similarly, the user in- cast it over the ActiveMQ messaging. We de- put “did you serve in the army” is misrecognized veloped the VideoPlayer tool so that we can con- as “did you certain the army” because the word trol the response playback via the same ActiveMQ “serve” or the sequence “serve in the army” was messaging. VideoPlayer also implements custom not included in the LM’s training data. transition between clips. It has video adjustment For training LMs for PocketSphinx we used controls so that we can modify the scale and posi- the CMU Statistical Language Modeling toolkit tion of the video image, and it automatically dis- (Clarkson and Rosenfeld, 1997) with back-off 3- plays a loop of idle video clips while the system is grams. The CMU pronouncing dictionary v0.7a in resting or listening states. (Weide, 2008) was used as the main dictionary While the system was developed to be cross- with the addition of domain-dependent words, platform so that it can run both on OS X and Win- such as names. We used the standard US En- dows, we conducted all our testing and experi- glish acoustic models that are included in Pock- ments on OS X. The system is packaged as a single etSphinx. OS X application that starts the Launcher interface and the rest of the system. This significantly sim- 3.4 Dialogue policy plifies distribution and installation of the system As mentioned in section 3.2, NPCEditor combines on different computers. the functions of Natural Language Understanding 3.3 Speech recognition (NLU) and Dialogue Management – understand- ing the utterance text and selecting an appropri- Currently the system can work with two speech ate response. The NLU functionality is a classifier recognition engines, CMU PocketSphinx and trained on linked question-response pairs, which Google Chrome ASR. But for our experiments we identifies the most appropriate response to new 5 also considered Apple Dictation. (unseen) user input. The dialogue management One major decision when selecting a speech logic is designed to deal with instances where the recognizer is whether it allows for training classifier cannot identify a good direct response. 6 domain-specific language models (LMs) or not. During training, NPCEditor calculates a response

5https://support.apple.com/en-us/HT202584 what the recognizer should expect to listen to and recognize. 6While the acoustic models of a speech recognizer recog- If a word or a sequence of words is not included in the LM, nize individual sounds, the LM provides information about they will never be recognized.

201 User Input Google Chrome ASR Output Apple Dictation Output CMU Pocket Sphinx Output hello pinchas hello pinterest hello princess hello pinchas where is lodz where is lunch where is lunch where is lodz were you in majdanek were you in my dannic were you in my donick were you in majdanek were you in kristallnacht were you and krystal knox where you went kristallnacht where you when kristallnacht from did you serve in the army did you serve in the army he served in the army did you certain the army have you ever lived in israel have you ever lived in israel that ever lived in israel are you ever live in a israel what’s your favorite restau- what’s your favorite restau- what’s your favorite restau- what’s your favorite rest shot rant rant rant

Table 1: Examples of speech recognition outputs threshold based on the classifier’s confidence in is also above the threshold, Pinchas will respond the appropriateness of selected responses: this with the lower ranked response. If the only re- threshold finds an optimal balance between false sponses above threshold are among the recently positives (inappropriate responses above thresh- used then Pinchas will choose one of them, since old) and false negatives (appropriate responses be- repetition is considered preferable to responding low threshold) in the training data. At runtime, with an off-topic or inappropriate statement. if the confidence for a selected response falls be- low the predetermined threshold, that response is 3.5 Data collection replaced with an “off-topic” utterance that asks The development process consisted of several the user to repeat the question or takes initia- stages: preliminary planning and question gather- tive and changes the topic (Leuski et al., 2006); ing, initial recording of survivor statements, Wiz- such failure to return a direct response, also called ard of Oz studies using the recorded statements to non-understanding (Bohus and Rudnicky, 2005), identify gaps in the content, a second recording of is usually preferred over returning an inappropri- survivor statements to address the gaps, assembly ate one (misunderstanding). of an automated dialogue system, and continued The current system uses a five-stage off-topic testing with the automated system. The develop- selection algorithm which is an extension of that ment process has been described in detail in Art- presented in Artstein et al. (2009). The first time stein et al. (2015); here we describe the data col- Pinchas fails to understand an utterance, he will lected at the various stages of development, which assume this is a speech recognition error and ask constitute the training and test data for the auto- the user to repeat it. If the misunderstanding per- mated system. sists, Pinchas will say that he doesn’t know (with- In the preliminary planning stages, poten- out asking for repetition), and the third time he tial user questions were collected from various will state that he cannot answer the user’s utter- sources, but these were not used directly as sys- ance. In a severe misunderstanding that persists tem training data. Instead, these questions formed beyond three exchanges, Pinchas will suggest a the basis for an interview script that was used for new topic in the fourth turn, and if even this fails eliciting the survivor statements during the record- to bring the user to ask a question that Pinchas can ing sessions. The first training data include the understand, then in the fifth turn Pinchas will give actual utterances used during these elicitation in- a quick segue and launch into a story of his choice. terviews. The interviewer utterances were manu- If at any point Pinchas hears an utterance that he ally linked to the survivor responses; in the typ- can understand (that is, if the classifier finds a re- ical case, an utterance is linked to the response it sponse above threshold), Pinchas will answer this elicited during the recording sessions, but the links directly, and the off-topic state will reset to zero. were manually adjusted to remove instances when A separate component of the dialogue policy the response was not appropriate, and to add links is designed to avoid repetition. Normally, Pin- to additional appropriate responses. chas responds with the top-ranked response if Additional training data were collected in the it is above the threshold. However, if the top- various stages of user testing – the Wizard of ranked response has been recently used (within Oz testing between the first and second record- a 4-turn window) and a lower ranked response ing sessions, and fully automated system testing

202 Data source Questions Links Code Interpretation Elicitation 1546 2147 4 Directly addresses the user question. Wizard of Oz 1753 3329 3 Indirectly addresses the user question, or System testing 2014 1825 1990 contains additional irrelevant material. System testing 2015 1823 1959 2 Does not address the user question, but is on a related topic. Total 6947 9425 1 Irrelevant to the user question. Table 2: Training data sets Table 3: Coherence rating for system responses following the second recording. Wizard of Oz testing took place in June and July 2014; partic- ipants sat in front of a screen that showed rough- is some overlap in the test questions, so the 400 cut video segments of Mr. Gutter’s statements, se- instances contain only 341 unique question types, lected by human operators in response to user ut- with the most frequent question (What is your terances in real time. Since the Wizard of Oz test- name?) occurring 5 times. We believe it is fair ing took place prior to the second recording, wiz- to include such overlap in the test set, since it ards were only able to choose statements from the gives higher weight to the more frequent ques- first recording. The user utterances were recorded, tions. Also, while the text of overlapping ques- transcribed, and analyzed to form the basis for the tions is identical, each instance is associated with elicitation script for the second recording. Sub- a unique audio file; these utterances may therefore sequent to the second recording, these utterances yield different speech recognizer outputs, result- were reannotated to identify appropriate responses ing in different outcomes. from all of the recorded statements, and these re- The test set was specially annotated to serve as a annotated question-response links form the Wiz- test key. There is substantial overlap in content be- ard of Oz portion of the training data. tween the recorded survivor statements, so many Testing with the automated system was carried user utterances can be addressed appropriately by out starting in October 2014, following the second more than one response. For training purposes it recording of survivor statements. Users spoke to is sufficient to link each user utterance to some ap- the automated system, and their utterances were propriate responses, but the test key must link each recorded, transcribed, and annotated with appro- utterance to all appropriate responses. It is im- priate responses. These data are partitioned into practical to check each of the 400 test utterances two – the testing that took place in late 2014 was against all 1726 possible responses, so instead we mostly internal, with team members, other insti- used the following procedure to identify responses tute staff, and visitors, while the testing from early that are likely to come up in response to specific 2015 was mostly external, conducted over 3 days test questions: we trained the system under dif- at a local museum. We thus have 4 portions of ferent partitions of the training data and different training data, summarized in Table 2. training parameters, ran the test questions through Test data for evaluating the classifier perfor- each of the system versions, and from each sys- mance were taken from the system testing in late tem run we collected the responses that the system 2014. We picked a set of 400 user utterances, col- considered appropriate (that is, above threshold) lected during the last day of testing, which was for each question. This resulted in a set of 3737 conducted off-site and therefore consisted primar- utterance-response pairs, ranging from 3 to 19 re- ily of external test participants (these utterances sponses per utterance, which represent likely sys- are not counted in Table 2 above). We only in- tem outputs for future training configurations. All cluded in-domain utterances for which an appro- the responses retrieved by the system were rated priate on-topic response was available. The eval- for coherence on a scale of 1–4 (Table 3). The re- uation therefore measures the ability of the sys- sponses rated 3 or 4 were deemed appropriate for tem to identify an appropriate response when one inclusion in the test key, a total of 1838 utterance- is available, not its ability to identify instances response pairs, ranging from 1 to 10 responses per where an on-topic response is unavailable. There utterance.

203 4 Method number of errors at each return rate, and then plot the number of errors against the number of off- 4.1 Speech recognition topics. As mentioned above, neither Google nor Apple We used 6 combinations of the training data de- ASRs allow for trainable LMs. But for Pocket- scribed in section 3.5. The baseline is trained with Sphinx we experimented with different domain- only the elicitation questions, and represents the specific LMs and below we report results on Pock- performance we might expect if we were to build etSphinx performance with two different domain- a dialogue system based on the recording sessions specific LMs: one trained on Wizard of Oz and alone, without collecting user question data (ex- system testing data (approx. 5000 utterances) col- cept to the extent that user questions influenced lected until December 2014 (LM-ds), and another the second recording session). To this baseline one trained on additional data (approx. 6500 utter- we successively added training data from the Wiz- ances) collected until January 2015 (LM-ds-add). ard of Oz testing, system testing 2014, and sys- The test set was the 400 utterances mentioned tem testing 2015. Our final training sets include above. There was no overlap between the training the elicitation questions and system testing 2014 and test data sets. (without Wizard of Oz data), and the same with In order to evaluate the performance of the the system testing 2015 added. speech recognizers we use the standard word er- All of the classifiers were trained in NPCEdi- ror rate (WER) metric: tor using the same options: text unigrams for the Substitutions + Deletions + Insertions question language models, text unigrams plus IDs WER = Length of transcription string for the response language models, and F-score as the classifier scoring function during training. 4.2 Classifier evaluation We used 3 versions of the test utterances: the Evaluation of the classifier is difficult, because transcribed text, the output of Google ASR, and it has to take into account the dialogue policy: the output of PocketSphinx, and ran each ver- the classifier typically returns the top-ranked re- sion through each of the 6 classifiers – a total sponse, but may return a lower-ranked response of 18 configurations. For each testing configu- if it is above threshold and the higher-ranked re- ration, we retrieved the top-ranked response for sponses were used recently. So while the classi- each utterance, together with the classifier confi- fier ranks all the available responses, anything be- dence and a true/false indication of whether the low the top few will never be selected by the di- response matched the answer key. The responses alogue manager, rendering measures such as pre- were ranked by the classifier confidence, and for cision and recall quite irrelevant. An ideal evalua- each possible cutoff point (from returning zero off- tion should give highest weight to the correctness topic responses to returning off-topic responses for of the top-ranked response, with rapidly decreas- all 400 utterances), we calculated the number of ing weight to the next several responses, but it is errors among the on-topic responses and plotted difficult to determine what weights are appropri- that against the number of off-topics. Each plot ate. We therefore focus on the top answer, since in represents the error-return tradeoff for a particular most cases the top answer is what will get served testing configuration (see section 5.2). to the user. The top answer can be one of three outcomes: 5 Results it can be appropriate (good), inappropriate (bad), or below threshold, in which case an off-topic re- 5.1 Speech recognition evaluation sponse is served. A good response is better than Table 4 shows the WERs for the three different an off-topic, which is in turn better than a bad re- speech recognizers and the two different LMs. sponse. This makes it difficult to compare systems Note that we also experimented with interpolat- with different off-topic rates: how do two systems ing domain-specific with background LMs avail- compare if one gives more good and bad responses able from http://keithv.com/software. In- than the other, but fewer off-topics? We therefore terpolation did not help but this is still an issue un- compare systems using error return plots, which der investigation. Interpolation helped with speak- show the error rate across all possible return rates ers who had low WERs (smooth easy to recognize (Artstein, 2011): for each system we calculate the speech) but hurt in cases of speakers with high

204 Speech Language Model Test set: Transcriptions Recognizer General LM-ds LM-ds-add 200 ● elicitation ● elicitation−wizard Google 5.07% — — ● elicitation−system2014 ● ●● ●● elicitation−system2014−system2015 ●● ●● ●● ● ●● elicitation−wizard−system2014 Apple 7.76% — — ●● ● ●● ● ●●● ●●● elicitation−wizard−system2014−system2015

150 ●●●●●●● ●●●●● ●● ●● ●●●●● PocketSphinx — 22.04% 19.39% ●● ●● ●● ●●●● ●●● ●● ●●●●●●●●●● ●●● ●● ●● ●● ●●● ●● ●●●●●●● ●●●● ●● ●● ●●●●●● ●● ●● ●●●●●●● ●●●● ●● ●●●●●●●●● ●●●● ●● ●● ●●● ●●● ●● ●●●●●●●●●●●●●● ●●●● ●● ●●●●●● ●● ●●● ●● ●●●●●●● ●● ●● ●●●● ●● ●● ●●●●●●●●●●●●●● ●●●●● ●●●● ●●●●●●● ●●● ●●●●●●● Table 4: Speech recognition results (WER). Gen- ●●●●●●● ●●● ●●●●● ●●●● ● ● ●● ● ● ●●●● ●●●●●● ●● ●●● ●●●●●● ●●●●●● ●●●●● ●●● 100 ●●●● ●●● ●●●●●●● ●● ●●●●● ●●●● ●● ●● Errors ●●●●● ●● ● ● ●● ●●●● ●●●●●● ●●● ●● ●●●●● ●● ●●●●●●●●● ●●●● ●●●●●●● ●●●●●●●● ●●●●●●●●● ●● eral LM stands for general-purpose LM, LM-ds ●●●●●●●●●●● ●●●●●●●● ●●● ●●●●●● ●●●●●●●●●●●● ●●●● ●●●●●●●●● ●●●●●●●●● ●● ●●●●● ●● ●● ●●●●●● ●●●●●●● ●●●● ●●● ●●● ●●●●●●● ●● ●●●●● ●●●● ●●●● ●●●●●●●●●●● ●●●●●●● ●●●● ●●●●●●● ●● ●●● ●●●●●●●●●● ●● stands for domain-specific LM trained with data ●●● ●●●●●●●●●●●●●●● ●●●●● ●●● ●● ● ●●●● ●● ●●●●●●● ●●●●● ●●●● ●●●●●●● ●●●●●●●● ●●●●●●●●●● ●● ●●● ●●●●●●●●●●●●● ●● ●● ●●●● ●● ●●●● ●● ●●●●●●●● ●● ●●●●●●●● ●● ●● ●●● ●●●●●●●●●● ●●●●●●●● ●● ●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●● ●●●● ●●●●●● ●●●●● ●● ●●●●● ●●●●● ●●● ●●●●●●● ●● ●● collected until December 2014, and LM-ds-add 50 ●●●●●●●●●●●●●●●●● ●●● ● ● ●●● ●●●●●●●●●● ●●●●●●●●●●●●●●●●● ●●● ●●●●●●●●●● ●●●● ●●●●●●●●●●● ●●●●●●●●● ●●●●●●●●●●●●●●●● ●●●●●●●●●● ●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●● ●●●● ●●●●●●●●●●●●●●●● ●●●●●● ●●●●● ●●● ●●●● ●●●●● ●●●●●●●● ●●●●●●●●● ●●●●●●●●●● ●●●●●●●●●● ●●●●●● ●●●●●●●●● ●●●●●●● ●●●● ●●●●●●●●●●● ●●●●● ●●●●●●●●●●● ●●● stands for domain-specific LM trained with addi- ●●●●●● ●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●● ●●●● ●●●●●●● ●●●●●●●●●● ●●●●●●● ●●●●● ●●●●●●●●●●●●●●●●●● ●●●●●● ●●●●●●●●●● ●● ●●●●●●●●●●●●● ●●●●●●●●●●●●●●●● ●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●● ●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●● ●●●● ●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●● ●●●●●●●●●● ●●●●●●●●●●●● ●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● tional data collected until January 2015. ●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●● ●●●●●●●●●●● ●●●●●●●●●●● ●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● 0

0 100 200 300 400

WERs. In the latter cases, having a background Off−topics model meant that there were more choices for the speech recognizer to choose from, which instead of helping caused confusion. Figure 2: Tradeoff between errors and off-topics for various training sets (tested on transcribed text) We also noticed that PocketSphinx was less tol- erant of environmental noises, which most of the time resulted in insertions and substitutions. For Training on all the data example, as we can see in Table 1, the user input

“have you ever lived in israel” was misrecognized 200 ● PocketSphinx ASR ● Google ASR by PocketSphinx as “are you ever live in a israel”. ● Transcriptions These misrecognitions do not necessarily confuse the classifier, but of course they often do. 150

● ●●●● ●●●●● ●● ●● ●●●●● ●●● ●●●● ● ●●● ●● ●● ●● ●● ● ●●●●● ●● ●● ●● ●●●● 5.2 Classifier evaluation ●●●●● ●● ●● ●● ●● ●●●●●● ●● ●● ●●●●● ●● ●●●●●●● ●● 100 ●● ●●● ●● ●●● ●● ●●●● Errors ●●●●●●● ●●●●●● ●●●● ●● ●●● ●●● ●●●●● ●●● ●● ●●● ●●● ●●●●●●● ●● ●●●● ●●●●●● ●●●● ●●●●● ●●●●●● ●●●● ●● ●● Classifier performance is best when training on all ●●● ●●●●● ●● ●● ●● ●● ●●●●● ●●● ●●●●●●●●● ●●● ●● ●●●●● ●● ●● ●● ●●● ●●●●●● ●● ●●●● ●●● ●●● ●●●●● ●●● ●●●●● ●●● ●●● ●●●●●● the data, and testing on transcriptions rather than ●●●● ●●●●● ●●●●●●●● ●●●● ●●● ●●●●●●●●●● ●● ●● ●● ●● ●● ●●●●●● ●●●● ●●●● ●●●●● ●●●●● ●●●●●● ●●● 50 ●●●●●●●●●●●●●● ●●●●●●●●●●● ●●●●●●●●●●● ●●●●●●● ●●●●●●●●●●●●●● ●●● speech recognizer output. Figure 2 shows the ef- ●●●●●●●●●●●●●●●● ●●●●●● ●●●●●●●●● ●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●● ●●●●●●● ●●● ●●● ●●●●●●●●● ●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●● ●●●●●●●●●●●●● ●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●● ●● ●●●●●●●●●●●●●●● ●● fect of the amount of training data on classifier ●●●●●●● ●●●●●● ● ● ●●●●●●●●● ●●●●●● ●●●●●●●●●●●● ●●●●● ●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●● ●●●●●●●●●●●●●● ●●●●●●●●●●● performance when tested on transcribed text (a ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●● similar effect is observed when testing on speech 0 recognizer output). Lower curves represent better 0 100 200 300 400 performance. As expected, performance improves Off−topics with additional training data – training on the full set of data cuts error rates by about a third com- Figure 3: Tradeoff between errors and off-topics pared to training on the elicitation questions alone. for different test sets (trained on the full data) Additional training data (both new questions and question-response links) are likely to improve per- formance even further. higher than with transcriptions. At a 20% off-topic The effect of speech recognition on classifier rate, the response error rates are 14% for transcrip- performance is shown in Figure 3. Automatic tions and 16% for Google ASR, meaning that al- speech recognition does impose a performance most two thirds of user utterances receive a direct penalty compared to testing on transcriptions, but appropriate response. At 30% off-topics, errors the penalty is not very large: classifier errors when drop to 10–11%, and direct appropriate responses testing with Google ASR are between 1 and 3 drop to just shy of 60%. Informal impressions percentage points higher than with transcriptions, from current testing at a museum (section 6) sug- while PocketSphinx fares somewhat worse, with gests that these numbers are sufficient to enable a classifier errors about 5 to 8 percentage points reasonable conversation flow.

205 6 Discussion In addition to improving performance through improved data, there are also algorithmic improve- This paper has demonstrated that time-offset inter- ments that can be made to the language process- action with a real person is achievable with present ing components. One goal is to leverage the rela- day spoken language processing technology. Not tive strengths of the general purpose and domain- only are we able to collect a sufficiently large specific ASRs, e.g., through the classifier: past and varied set of statements to address user utter- work has shown that language understanding can ances (Artstein et al., 2015), we are also able to be improved by allowing NLU to select from use speech recognition and language understand- among several hypotheses provided by a single ing technology to identify appropriate responses speech recognizer (Morbini et al., 2012), and we frequently enough to enable a natural interaction propose to try a similar method to utilize the out- flow. Future work is needed in three areas: in- puts of separate speech recognizers. Another idea vestigating the interaction quality of the dialogue is to combine/align the outputs of the speech rec- system, improving the language processing, and ognizers (before they are forwarded to the clas- generalizing the process to additional situations. sifier) taking into account information from the To investigate the interaction quality, we need recognition confidence scores and lattices. This to look at dialogues in context rather than as iso- will potentially help in cases where different rec- lated utterances, and to collect user feedback. We ognizers succeed in correctly recognizing different are presently engaged in a joint testing, demon- parts of the utterance. stration, and data collection effort that is intended Time-offset interaction has a large potential im- to address these issues. The time-offset interac- pact on preservation and education – people in the tion system has been temporarily installed at the future will be able to not only see and listen to Illinois Holocaust Museum and Education Center historical figures, but also to interact with them in Skokie, Illinois, where visitors interact with the in conversation. Future research into time-offset system as part of their museum experience (Isaacs, interaction will need to generalize the develop- 2015). The system is set up in an auditorium and ment process, in order to enable efficient use of users talk to Pinchas in groups, in a setting that resources by identifying common user questions is similar to in-person encounters with Holocaust that are specific to the person, ones that are spe- survivors which also take place at the museum. cific to the dialogue context or conversation topic, Due to physical limitations of the exhibit space, and ones that are of more general application. interaction is mediated by museum docents: each user question is relayed by the docent into the mi- Acknowledgments crophone, and Pinchas responds to the docent’s speech. An excerpt of museum interaction is in This work was made possible by generous dona- the Appendix. Data and feedback from the mu- tions from private foundations and individuals. We seum installation will be used to evaluate the in- are extremely grateful to The Pears Foundation, teraction quality, including user feedback as to the Louis F. Smith, and two anonymous donors for naturalness of the interaction and user satisfaction. their support. The work was supported in part by The ongoing testing also serves the purpose of the U.S. Army; statements and opinions expressed data collection for improving system performance: do not necessarily reflect the position or the policy Figure 2 shows that errors diminish with addi- of the United States Government, and no official tional training data, and it appears that we have not endorsement should be inferred. Heather Maio yet reached the point of diminishing returns with and Alesia Gainer spent long hours on data col- about 7000 training utterances. We hope to collect lection and system testing. The Los Angeles Mu- an average of 10 training utterances per response, seum of the Holocaust, the Museum of Tolerance, that is about 17000 user utterances. Annotation and New Roads School in Santa Monica offered is also incomplete: the test key has an average of their facilities for data collection. The USC Shoah 4.6 links per utterance, as opposed to an average of Foundation provided financial and administrative around 1.4 links per utterance in the training data. support, and facilities. Finally, we owe special While complete linking is not necessary for clas- thanks to Pinchas Gutter for sharing his story, and sifier operation, improving the links will probably for his tireless efforts to educate the world about improve performance. the Holocaust.

206 References Donald Marinelli and Scott Stevens. 1998. Syn- thetic interviews: The art of creating a ‘dyad’ be- Ron Artstein, Sudeep Gandhe, Jillian Gerten, Anton tween humans and machine-based characters. In Leuski, and David Traum. 2009. Semi-formal eval- Proceedings of the Sixth ACM International Con- uation of conversational characters. In Orna Grum- ference on Multimedia: Technologies for Interactive berg, Michael Kaminski, Shmuel Katz, and Shuly Movies (MULTIMEDIA), pages 11–16, New York, Wintner, editors, Languages: From Formal to Natu- NY, USA. ACM. ral. Essays Dedicated to Nissim Francez on the Oc- casion of His 65th Birthday, volume 5533 of Lecture Fabrizio Morbini, Kartik Audhkhasi, Ron Artstein, Notes in Computer Science, pages 22–35. Springer, Maarten Van Segbroeck, Kenji Sagae, Panayiotis Heidelberg, May. Georgiou, David R. Traum, and Shri Narayanan. 2012. A reranking approach for recognition and Ron Artstein, David Traum, Oleg Alexander, An- classification of speech input in conversational di- ton Leuski, Andrew Jones, Kallirroi Georgila, Paul alogue systems. In Proceedings of the Fourth IEEE Debevec, William Swartout, Heather Maio, and Workshop on Spoken Language Technology (SLT), Stephen Smith. 2014. Time-offset interaction with pages 49–54, Miami Beach, Forida, December. a Holocaust survivor. In Proceedings of IUI, pages 163–168, Haifa, Israel, February. Eric Sloss and Anne Watzman. 2005. Carnegie Mel- lon’s Entertainment Technology Center conjures up Ron Artstein, Anton Leuski, Heather Maio, Tomer Benjamin Franklin’s ghost. Press release, Carnegie Mor-Barak, Carla Gordon, and David Traum. 2015. Mellon Media Relations, June 28. http://www. How many utterances are needed to support time- cmu.edu/PR/releases05/050628_etc.html. offset interaction? In Proceedings of the Twenty- Eighth International Florida Artificial Intelligence R.L. Weide. 2008. The CMU pronouncing dictionary. Research Society Conference, pages 144–149, Hol- lywood, Florida, May. AAAI Press. A Dialogue sample Ron Artstein. 2011. Error return plots. In Proceed- ings of SIGDIAL, pages 319–324, Portland, Oregon, The following dialogue excerpt is taken from the June. beta testing in Illinois (see section 6). Speech recognition output is shown in italics when it de- Dan Bohus and Alexander I. Rudnicky. 2005. Sorry, I didn’t catch that! – An investigation of non-under- parts substantially from the spoken text. standing errors and recovery strategies. In Proceed- ings of SIGDIAL, pages 128–143, Lisbon, Portugal, User Hello Pinchas, how are you? September. Las Vegas how are you

Lucy Chabot. 1990. Nixon library technology lets vis- Pinchas Can you just repeat that? itors ‘interview’ him. Los Angeles Times, July 21. User Can you hear me OK? Philip Clarkson and Ronald Rosenfeld. 1997. Statisti- cal language modeling using the CMU-Cambridge I can see you and I can hear you. toolkit. In Proc. of Eurospeech, Rhodes, Greece, Pinchas September. User Hello Pinchas, can you hear me? Sudeep Gandhe and David Traum. 2010. I’ve said it how thick is can you hear me before, and I’ll say it again: An empirical investiga- tion of the upper bound of the selection approach to Pinchas I can hear you, yeah. dialogue. In Proceedings of SIGDIAL, pages 245– 248, Tokyo, September. User Pinchas, can you tell me how old you are? Mike Isaacs. 2015. Holocaust Museum: Pilot program Vegas can you tell me how old you are aims to preserve survivor voices for future genera- tions. Chicago Tribune, May 19. Pinchas I was born in nineteen thirty two and we are now in two thousand and fourteen and Anton Leuski and David Traum. 2011. NPCEditor: you can make your own reckoning. Creating virtual human dialogue using information retrieval techniques. AI Magazine, 32(2):42–56. User Can you tell me a little bit about your child- Anton Leuski, Ronakkumar Patel, David Traum, and hood? Brandon Kennedy. 2006. Building effective ques- tion answering characters. In Proceedings of SIG- Pinchas My life before the war, I was born in DIAL, Sydney, Australia, July. Lodz in Poland, and I was eight years old Heather Maio, David Traum, and Paul Debevec. when the war started. So the first eight years 2012. New dimensions in testimony. PastForward, I lived in a home, a very religious one, a Ha- Summer:22–26. sidic home. I had a twin sister, my. . .

207 User Have you ever met Hitler? mermaid Hitler

Pinchas Please repeat that.

User Did you ever meet Hitler?

Pinchas Did I ever meet Hitler? Well, Hitler was a Fuhrer¨ of Germany and I was born in Poland and I was a little Jewish boy, so the chances of somebody like me meeting Hitler was impossible, because Hitler was. . .

User What would you do if you met Hitler?

Pinchas Geyn drerd arein, that’s Yiddish and what it means go and bury yourself.

208