Multilingual Disfluency Removal Using

Multilingual Disfluency Removal using NMT Eunah Cho, Jan Niehues, Thanh-Le Ha, Alex Waibel Institute for Anthropomatics and Robotics, Karlsruhe Institute of Technology, Karlsruhe, Germany [email protected] Abstract the characteristics of neural machine translation (NMT) utilizing continuous representations of the input sentences in- In this paper, we investigate a multilingual approach for stead of a language-specific, fixed format have brought a lot speech disfluency removal. A major challenge of this task of attention in building a model for multilingual inputs [1, 2]. comes from the costly nature of disfluency annotation. Moti- The potential benefit from sharing semantic representations vated by the fact that speech disfluencies are commonly ob- of the languages is a strong advantage of multilingual NMT. served throughout different languages, we investigate the po- From this assessment, we expect that multilingual disfluency tential of multilingual disfluency modeling. We suggest that learning can be a solution to a task with the data sparsity is- learning a joint representation of the disfluencies in multiple sue. Utilizing the neural network (NN)-based system, we aim languages can be a promising solution to the data sparsity is- to obtain a joint representation of speech disfluencies across sue. In this work, we utilize a multilingual neural machine different languages. translation system, where a disfluent speech transcript is di- NMT offers a powerful framework where further speech- rectly transformed into a cleaned up text. cleaning or reconstructing operations are allowed, such as Disfluency removal experiments on English and German reordering and replacement of words. As an initial work to speech transcripts show that multilingual disfluency model- apply the NMT framework for this task, we establish the ef- ing outperforms the single language systems. In a follow- fectiveness of this framework on the disfluency removal task. ing experiment, we show that the improvements are also We use disfluency-annotated English meeting data and observed in a downstream application using the disfluency- German lecture data, and build three separated disfluency re- removed transcripts as input. moval systems: two single language systems (one trained on the English data, another on the German data) and the mul- 1. Introduction tilingual system. In each system, the source side consists of The challenges posed by spoken language translation (SLT) disfluent transcripts. In the multilingual system, it includes have received a great deal of research interest lately. One of German and English disfluent transcripts. Note that the tran- these challenges is the handling of speech disfluencies. Un- scripts in each language do not form a parallel or comparative like written language, spoken language contains disfluencies, corpus; they are speech transcripts from individual sources. such as repetitions, false starts, stutters, etc. Speech disflu- The target side then consists of clean transcripts. Each lan- encies not only hinder the readability of speech transcripts guage, therefore, goes through a monolingual translation pro- but also harms the performance of subsequent applications of cess where its disfluencies can be directly removed. natural language processing (NLP), including machine trans- This work will demonstrate that by allowing multilingual lation (MT). learning we can improve the overall disfluency removal per- One challenge of speech disfluency modeling is the formance over the single language learning systems. In order costly nature of disfluency-annotated data. Most state-of- to evaluate the performance, disfluency-removed test sets are the-art techniques for disfluency detection rely on supervised compared to manually cleaned transcripts. In addition, we data, whose annotation process is expensive. Moreover, conduct an extrinsic evaluation, where we measure the im- disfluency-annotated data is often limited to only certain lan- pact of disfluency removal in a subsequent application. Once guages, which brings an even bigger challenge when model- a disfluent test set is transformed into a clean one, it will be ing the speech phenomenon for low resourced languages. translated into another language using an MT system. By Many of the speech disfluencies, such as stutters, false evaluating the MT performance, we can measure the impact starts and repetitions, commonly occur in different lan- of multilingual disfluency removal in a downstream applica- guages. This research, thus, is motivated by the idea that the tion. data sparsity issue can be remedied if joint representations of This paper is organized as follows. In Section 2, a brief speech disfluencies across languages are available. Recently, overview of past research on disfluency removal and multilingual learning is given. Followed by a brief introduction where one source language can be translated into multiple to speech disfluencies in Section 3, The multilingual disflu- languages. In this scheme, the attention mechanism is lo- ency removal scheme is explained in Section 4. Preceding cated in the each target language decoder. In [17], the au- experiments on parameter optimization are also discussed. thors introduced an attention-based NMT which can accom- In Section 5, our experimental setups and the results along modate shared attention mechanism for multilingual transla- with an analysis are given. Finally, Section 6 concludes our tion. While supporting many-to-many translation tasks, they discussion. integrated a single attention mechanism into the NMT. NN for sequence-to-sequence learning [18] and the 2. Related Work framework of encoder-decoder networks [19] showed its ef- fectiveness in sequence-to-sequence mapping. The drawback Including the early approaches using statistical language of this approach, difficulty in learning from long sentences, model [3], there has been extensive research on the speech is later addressed by having an attention mechanism [20]. disfluency issue. One promising direction was using noisy With the attention mechanism, which is now used in many channel model [4, 5], where it is assumed that fluent text, state-of-the-art NMT systems, the system can learn which without any disfluencies, passes through a noisy channel source tokens to observe more when predicting the next tar- which introduces disfluencies. The noisy channel model is get words. NMT systems achieved a greater performance later extended with a tree-adjoining grammar [6]. In this over the traditional approach of phrase-based machine trans- work, the authors used a syntactic parser for building a lan- lation (PBMT) using the same parallel data as observed in guage model. Sequential tagging has shown a good perfor- recent machine translation campaigns [21, 22]. mance as well. In [7], conditional random fields (CRF) with features from lexical, language model, and parser informa- 3. Speech Disfluency tion have been used for the automatic identification of disfluencies. The authors in [8] compared different modeling tech- Spoken language differs largely from written language. It of- niques for this task, including CRF and maximum entropy ten contains self-repairs, stutters, ungrammatical expressions (ME) model. Recent efforts have made to evaluate different and incomplete sentences or phrases. segmentation and disfluency removal techniques for conver- Different disfluency categories are thoroughly studied in sational speech [9], where CRFs have been used extensively [23]. A common type of disfluency is the filler disfluency, as well. which includes filled pauses, as well as discourse markers. Following a huge success of neural networks in many Obvious filler words or sounds would be uh, uhm or their challenges in NLP, however, there has been a great number variants. Representative discourse markers in English are of research on disfluency detection using NN. Word embed- you know and well, and in German ja and nun, for exam- dings from recurrent neural networks (RNN) have been se- ple. Edit disfluencies, on the other hand, are often writ- lected as features in a sequential tagging model for disfluency ten as (reparandum) * <editing term> correction. The detection [10]. In [11], the authors investigated the combined reparandum represents the aborted part of the utterance. The approach of CRF and feed-forward neural networks (FFNN) interruption point * is the point where the speaker aborts on the disfluency detection and punctuation insertion task. the ongoing utterance and introduces the corrected utterance. The combination of the two modeling techniques yielded 0.6 Therefore, the reparandum includes repetition, false starts, to 0.8 BLEU [12] points of improvements over the individ- etc. Before introducing the corrected utterance, often speak- ual models. One recent study [13] explored RNNs for the ers use an editing term, i.e. sorry. This scheme has been disfluency detection and analyzed the effect of context on the the bases for disfluency analysis and modeling for many lan- performance. The authors showed a good incremental prop- guages [24]. erties of RNNs with low latency. Bidirectional LSTMs have Example sentences from the English and German spon- been used for this task in [14], with engineered disfluency taneous data are shown in Table 1. For the German excerpt, pattern match features. its English gloss translation is also provided. The excerpt The idea behind multilingual speech processing was also shows exact and rough repetitions and filler words, marked supported by [15]. In [16], for example, the authors showed

Multilingual Disfluency Removal Using

Recurrent Neural Networks in Speech Disfluency Detection And

A Corpus-Based Study of Speech Fluency Across English Dialects

NWAV 46 Booklet-Oct29

Exploring the Role of Fillers in the Automatic Prediction of a Speaker's

Automatic Detection of Sentence Boundaries, Disfluencies, And

Pronunciation and Disfluency Modeling for Expressive Speech Synthesis Raheel Qader

An Analysis of Speech Disfluency on the Ellen Degeneres Show

Diss2005 Proceedings

Abnormal Speech Sound Representation in Persistent Developmental Stuttering

Psycholinguistic Analysis of Stuttering in Joe Biden's

Um and Uh, and the Expression of Stance in Conversational Speech

Assessing Speech Fluency Problems in Typically Developing Children Aged 4 to 5 Years