CRF-Based Disfluency Detection Using Semantic Features For

CRF-based Disfluency Detection using Semantic Features for German to English Spoken Language Translation Eunah Cho, Thanh-Le Ha, Alex Waibel International Center for Advanced Communication Technologies - InterACT Institute of Anthropomatics Karlsruhe Institute of Technology, Germany {eunah.cho|thanh-le.ha|alex.waibel}@kit.edu Abstract 1.1. Disfluencies in Spontaneous Speech Disfluencies in speech pose severe difficulties in machine Filler words (e.g. “uh”, “uhm”) are a common disfluencies translation of spontaneous speech. This paper presents our in spontaneous speech. Discourse markers (e.g. “you know”, conditional random field (CRF)-based speech disfluency de- “well” in English) are considered filler words as well. An- tection system developed on German to improve spoken lan- other common disfluency is repetition, where speakers repeat guage translation performance. their words. A repetition can either be an identical repeti- In order to detect speech disfluencies considering syn- tion, where speakers exactly repeat a word or phrase, or a tactics and semantics of speech utterances, we carried out rough repetition, where they correct themselves using simi- a CRF-based approach using information learned from the lar words. Simplified examples of such repetitions from our word representation and the phrase table used for machine disfluency annotated lecture data with English gloss transla- translation. The word representation is gained using recur- tion are shown in Table 1, in which the identical repetition rent neural networks and projected words are clustered using is on the upper part, and the rough repetition is on the lower the k-means algorithm. Using the output from the model part. trained with the word representations and phrase table information, we achieve an improvement of 1.96 BLEU points on the lecture test set. By keeping or removing human- Table 1: Repetitions in spontaneous speech annotated disfluencies, we show an upper bound and lower Source Das sind die Vorteile, die Sie die Sie haben. bound of translation quality. In an oracle experiment we gain En.gls These are the advantages, that you that you have. 3.16 BLEU points of improvement on the lecture test set, Source Da gibt es da gab es nur eins. compared to the same set with all disfluencies. En.gls There is there was only one. 1. Introduction Natural language processing (NLP) tasks often suffer from Another type of speech disfluency, where several speech disfluencies in spontaneous speech. In spontaneous speech, fragments are dropped and new fragments are introduced, is speakers occasionally talk with disfluencies such as repeti- restart fragments. As presented in Table 2, the speaker starts tions, stuttering, or filler words. These speech disfluencies a new way of forming the sentence after aborting the first inhibit proper processing for other subsequent applications, several utterances. Although the example shown in this ta- for example machine translation (MT) systems. ble depicts a case where the context is still kept in the fol- MT systems are generally trained using well-structured, lowing new utterances, occasionally we confront other cases cleanly written texts. The mismatch between this training where the previous context is abandoned and a new topic is data and the actual test data, in this case spontaneous speech, discussed in spontaneous speech. causes a performance drop. A system which reconstructs the non-fluent output from an automatic speech recognition Table 2: Restart fragment in spontaneous speech (ASR) system into the proper form for subsequent applications will increase the performance of the application. Das ist alles, was Sie das haben Sie Source A considerable number of works on this task such as [1] alles gelernt, und jetzt können Sie... and [2] focus on English, from the point of view of the ASR That is all, what you you have Engl. gloss systems. One of our goals is to extend this work to German, learned all of this, and now can you... and also apply it to the MT task, in order to analyze the effect of speech disfluencies on MT. 1.2. Motivation In this work we aim to analyze and improve machine translation performance by detecting and removing the dis- Detecting obvious filler words and simple repetitions can be fluencies in a preprocessing step before translation. For this more feasible than other sorts of disfluencies for automatic we adopt a conditional random field (CRF)-based approach, modeling techniques, using lexical patterns such as typical in which the characteristics of disfluencies can be modeled filler word tokens and repetitive part-of-speech (POS) tokens using various features. In order to consider the issues dis- as in previous work [2, 3]. Although it is the case for ob- cussed previously, we devised features learned from word vious disfluencies (i.e. “uh”, “uhm”, same repetitive tokens, representations and phrase tables used for the MT process and so on), we are confronted with many other cases where in addition to lexical and language model features. The MT it is hard to recognize or decide whether the token is a dis- performance of CRF-detected output is evaluated and com- fluency or not via automatic means. This issue can be con- pared to the result of an oracle experiment, where the test sistent even when the disfluency is filler words or repetitive data without all annotated disfluencies is translated. tokens. Table 3 contains a sentence from the annotated data, This paper is organized as follows. In Section 2, a brief which depicts this issue for repetition. In the German source overview of past research on speech disfluency detection is sentence, the word üblicherweise, meaning ‘customarily’ is given. The annotated data used in this work is described in annotated as a disfluency, as it was the speaker’s intention to Section 3, followed by Section 4 which contains the CRF change the utterance into the next word traditionell, which modeling technique with extended features from word repre- means ‘traditionally’. sentation and phrase table information. Section 5 describes our experiment setups and their results along with an analy- Table 3: Difficulty in detecting repetitions sis. Finally, Section 6 concludes our discussions. Die Kommunikation zwischen Mensch Source und Maschine, die wir so üblicherweise 2. Related Work traditionell immer sehen, ist die... In previous work, the disfluency detection problem has been The communication between man addressed using a noisy channel approach [4]. In this work Engl. gloss and machine, which we customarily it is assumed that fluent text, free of any disfluencies passed traditionally always see, is the... a noisy channel which adds disfluencies to the clean string. The authors use language model scores and five different Discourse markers can be hard to capture, as they occa- models to retrieve the string, where the two factors are con- sionally convey meanings in a sentence. In the same way as trolled by weight. An in-depth analysis on disfluency re- it is with English discourse markers such as “I mean”, “ac- moval using this system and its effect are provided in [5]. tually”, and “like”, for example, German discourse markers, They find that for the given news test set, an 8% improvement as shown in Table 4, can sometimes be used as a discourse in BLEU [6] is achieved when the disfluencies are removed. marker and sometimes as normal tokens. In this table it is In another noisy channel approach [7], the disfluency de- shown that a German word nun means ‘now’ as shown in tection problem is reformulated as a phrase-level statistical the upper part, but occasionally is used as a discourse marker machine translation problem. Trained on 142K words of like in the lower part and does not need to be translated. In data, the translation system translates noisy tokens with dis- the lower row, the word nun appears with another discourse fluencies into clean tokens. The clean data contains new tags marker ja, which can also mean ‘yes’ in English, depending of classes such as repair, repeat, and filled pauses. Using this on the context. translation model based technique, they achieve their highest F-score of 97.6 for filled pauses and lowest F-score of 40.1 Table 4: Difficulty in detecting discourse markers for repairs. Sie sehen hier unseren Simultanübersetzer, The noisy channel approach is combined with a tree- Source der nun meinen Vortrag transkribiert. adjoining grammar to model speech repairs in [1]. A syntac- Here you see our simultaneous translator, tic parser is used for building a language model to improve Reference which now transcribes my presentation. the accuracy of repair detection. Same or similar words in roughly the same order, defined rough copy, are modeled us- An einer Universität haben wir ja nun Source ing crossed word dependencies. Trained on the annotated viele Vorlesungen. Switchboard corpus, they achieve an F-score up to 79.7. Reference In a university, we have many lectures. The automatic annotation generated in [1] is one of the features used for modeling disfluencies in [2], where they These examples suggest that disfluency detection re- train a CRF model to detect speech disfluencies. In addition quires an analysis of syntactics as well as semantics. Detect- to the automatic identification by [1], they use lexical, lan- ing restarted fragments especially requires semantic labeling, guage model, and parser information as features. The CRF as in some cases the restarted new fragment does not contain model is trained, optimized and tested on around 150K words the same content as the aborted utterances. of annotated data, where disfluencies are to be classified into three different classes. Following this work, the authors offer shown in Table 1 and 3 with bold letters. Words are tagged an insightful analysis on syntactics and semantics of manu- as non-copy when the speaker changes their mind about ally reconstructed spontaneous speech [8].

Load more