
Using Paraphrases and Lexical Semantics to Improve the Accuracy and the Robustness of Supervised Models in Situated Dialogue Systems Claire Gardent Lina M. Rojas Barahona CNRS/LORIA, Nancy Université de Lorraine/LORIA, Nancy [email protected] [email protected] Abstract supervised QA character. We expand the size and the quality (less skewed data) of the training corpus This paper explores to what extent lemmati- using paraphrase generation techniques. We com- sation, lexical resources, distributional seman- tics and paraphrases can increase the accuracy pare the performance obtained on lemmatised vs. of supervised models for dialogue manage- non lemmatised data. And we investigate how vari- ment. The results suggest that each of these ous resources (synonym dictionaries, WordNet, dis- factors can help improve performance but that tributional neighbours) can be used to handle unseen the impact will vary depending on their com- words at run time. bination and on the evaluation mode. 2 Related work 1 Introduction Previous work on improving robustness of super- One strand of work in dialog research targets the vised dialog systems includes detecting and han- rapid prototyping of virtual humans capable of con- dling out of domain utterances for generating feed- ducting a conversation with humans in the context back (Lane et al., 2004) ; using domain-restricted of a virtual world. In particular, question answering lexical semantics (Hardy et al., 2004) ; and work on (QA) characters can respond to a restricted set of manual data expansion (DeVault et al., 2011). Our topics after training on a set of dialogs whose utter- work follows up on this research but provides a sys- ances are annotated with dialogue acts (Leuski and tematic investigation of how data expansion, lemma- Traum, 2008). tisation and synonym handling impacts the perfor- As argued in (Sagae et al., 2009), the size of the mance of a supervised QA engine. training corpus is a major factor in allowing QA 3 Experimental Setup characters that are both robust and accurate. In ad- dition, the training corpus should arguably be of We run our experiments on a dialog engine de- good quality in that (i) it should contain the various veloped for a serious game called Mission Plastech- ways of expressing the same content (paraphrases) nologie. In this game, the player must interact with and (ii) the data should not be skewed. In sum, the different virtual humans through a sequence of 12 ideal training data should be large (more data is subdialogs, each of them occurring in a different part better data) ; balanced (similar amount of data for of the virtual world. each class targeted by the classifier) and varied (it Training Data. The training corpus consists of should encompass the largest possible number of around 1250 Human-Human dialogues which were paraphrases and synonyms for the utterances of each manually annotated with dialog moves. As the fol- class). lowing dialog excerpt illustrates, the dialogs are con- In this paper, we explore different ways of im- ducted in French and each dialog turn is manu- proving and complementing the training data of a ally annotated using a set of 28 dialog acts. For 808 Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pages 808–813, Seattle, Washington, USA, 18-21 October 2013. c 2013 Association for Computational Linguistics a more detailed presentation of the training corpus racy (the number of correct classifications divided and of the annotation scheme, the reader is referred by the number of instances in the testset) to mea- to (Rojas-Barahona et al., 2012a) sure performance and we carry out two types of dialog : 01_dialogDirecteur-Tue Jun 14 11 :04 :23 2011 evaluation. On the one hand, we use 10-fold cross- >M.Jasper : Bonjour, je suis M.Jasper le directeur. || greet validation on the EmoSpeech corpus (H-H data). On (Hello, I am the director, Mr. Jasper.) the other hand, we report accuracy on a corpus of >M.Jasper : Qu’est-ce que je peux faire pour vous ? || ask(task(X)) 550 Human-Computer (H-C) dialogues obtained by (What can I do for you ?) having 22 subjects play the game against the QA >Lucas : je dois sauver mon oncle || first_step character trained on the H-H corpus. As we shall see (I must rescue my uncle) below, performance decreases in this second evalua- >M.Jasper : Pour faire votre manette, il vous faut tion suggesting that subjects produce different turns des plans. Allez voir dans le bureau d’études, when playing with a computer than with a human ils devraient y être. || inform(do(first_step)) thereby inducing a weak out-of-domain effect and (To build the joystick you will need the plans. negatively impacting classification. Evaluation on You will find them in the Designing Office.) the H-H corpus therefore gives a measure of how >M.Jasper : Bonne Chance ! || quit well the techniques explored help improving the di- (Good Luck !) alog engine when used in a real life setting. Dialog Systems For our experiments, we use a hy- Correspondingly, we use two different tests for brid dialog system similar to that described in (Ro- measuring statistical significance. In the H-H eval- jas Barahona et al., 2012b; Rojas Barahona and uation, significance is computed using the Wilcoxon Gardent, 2012). This system combines a classifier signed rank test because data are dependent and are for interpreting the players utterances with an infor- not assumed to be normally distributed. When build- mation state dialog manager which selects an appro- ing the testset we took care of not including para- priate system response based on the dialog move as- phrases of utterances in the training partition (for signed by the classifier to the user turn. The clas- each paraphrase generated automatically we keep sifier is a logistic regression classifier 1 which was track of the original utterance), however utterances trained for each subdialog in the game. The features in both datasets might be generated by the same sub- used for training are the set of content words which ject, since a subject completed 12 distinct dialogues are associated with a given dialog move and which during the game. Conversely, in the H-C evaluation, remain after TF*IDF 2 filtering. Note that in this ex- training (H-H data) and test (H-C data) sets were periment, we do not use contextual features such as collected under different conditions with different the dialog acts labeling the previous turns. There are subjects therefore significance was computed using two reasons for this. First, we want to focus on the the McNemar sign-test (Dietterich, 1998). impact of synonym handling, paraphrasing and lem- matisation on dialog management. Removing con- 4 Paraphrases, Synonyms and textual features allows us to focus on how content Lemmatisation features (content words) can be improved by these We explore three main ways of modifying the mechanisms. Second, when evaluating on the H-C content features used for classification : lemmatising corpus (see below), contextual features are often in- the training and the test data ; augmenting the train- correct (because the system might incorrectly inter- ing data with automatically acquired paraphrases ; pret and thus label a user turn). Excluding contextual and substituting unknown words with synonyms at features from training allows for a fair comparison run time. between the H-H and the H-C evaluation. Test Data and Evaluation Metrics We use accu- Lemmatisation We use the French version of Treetagger 3 to lemmatise both the training and the 1. We used MALLET (McCallum, 2002) for the LR classi- test data. Lemmas without any filtering were used fier with L1 Regularisation. 2. TF*IDF = Term Frequency*Inverse Document Fre- 3. http://www.ims.uni-stuttgart.de/projekte/ quency corplex/TreeTagger/ 809 to train classifiers. We then compare performance Algorithm extendingDataWithParaphrases(trainingset ts) 1. Let c be the set of categories in ts. with and without lemmatisation. As we shall see, 2. µ be the mean of train instances per category 3. σ be the standard deviation of train instances per category the lemma and the POS tag provided by TreeTag- 4. Let Npc be the number of paraphrases per category 5. Let lp ← min Npcj ger are also used to lookup synonym dictionaries and 6. Repeat 7. set i ← 0 8. be the number of instances per category EuroWordNet when using synonym handling at run Ninstci ci 9. di ← Ninstci − µ time. 10. if di < σ then 11. Ninstci ← lp 12. else Paraphrases : (DeVault et al., 2011) showed that l 13. p Ninstci ← 2 enriching the training corpus with manually added 14. end if 15. set i←i+1 paraphrases increases accuracy. Here we exploit au- 16. if i>kck then 17. terminate tomatically acquired paraphrases and use these not 18. end only to increase the size of the training corpus but also to better balance it 4. We proceed as follows. First, we generated paraphrases using a pivot ma- FIGURE 1: Algorithm for augmenting the training data chine translation approach where each user utter- with paraphrases. ance in the training corpus (around 3610 utterances) was translated into some target language and back frequency (< 0.001) across translations i.e., lexical into French. Using six different languages (English, translations given by few translations and/or transla- Spanish, Italian, German, Chinese and Arabian), tion systems. We then preprocessed the paraphrases we generated around 38000 paraphrases. We used in the same way the utterances of the initial train- Google Translate API for translating. ing corpus were preprocessed i.e., utterances were unaccented, converted to lower-case and stop words Category Train Instances Balanced Instances were removed, the remaining words were filtered greet 24 86 with TF*IDF.
Details
-
File Typepdf
-
Upload Time-
-
Content LanguagesEnglish
-
Upload UserAnonymous/Not logged-in
-
File Pages6 Page
-
File Size-