Segmentation Strategies for Streaming Speech Translation

Segmentation Strategies for Streaming Speech Translation Vivek Kumar Rangarajan Sridhar, John Chen, Srinivas Bangalore Andrej Ljolje, Rathinavelu Chengalvarayan AT&T Labs - Research 180 Park Avenue, Florham Park, NJ 07932 vkumar,jchen,srini,alj,[email protected] Abstract models as well as the pipeline have been optimized in several ways to achieve tasks such as high qual- The study presented in this work is a first ef- ity offline speech translation (Cohen, 2007; Kings- fort at real-time speech translation of TED bury et al., 2011; Federico et al., 2011), on-demand talks, a compendium of public talks with dif- ferent speakers addressing a variety of top- web based speech and text translation, low-latency ics. We address the goal of achieving a sys- real-time translation (Wahlster, 2000; Hamon et al., tem that balances translation accuracy and la- 2009; Bangalore et al., 2012), etc. The design of a tency. In order to improve ASR performance S2S translation system is highly dependent on the for our diverse data set, adaptation techniques nature of the audio stimuli. For example, talks, lec- such as constrained model adaptation and vo- tures and audio broadcasts are typically long and re- cal tract length normalization are found to be quire appropriate segmentation strategies to chunk useful. In order to improve machine translation (MT) performance, techniques that could the input signal to ensure high quality translation. be employed in real-time such as monotonic In contrast, single utterance translation in several and partial translation retention are found to consumer applications (apps) are typically short and be of use. We also experiment with inserting can be processed without the need for additional text segmenters of various types between ASR chunking. Another key parameter in designing a and MT in a series of real-time translation ex- S2S translation system for any task is latency. In periments. Among other results, our experi- offline scenarios where high latencies are permit- ments demonstrate that a good segmentation ted, several adaptation strategies (speaker, language is useful, and a novel conjunction-based segmentation strategy improves translation qual- model, translation model), denser data structures (N- ity nearly as much as other strategies such best lists, word sausages, lattices) and rescoring pro- as comma-based segmentation. It was also cedures can be utilized to improve the quality of found to be important to synchronize various end-to-end translation. On the other hand, real- pipeline components in order to minimize la- time speech-to-text or speech-to-speech translation tency. demand the best possible accuracy at low latencies such that communication is not hindered due to po- 1 Introduction tential delay in processing. The quality of automatic speech-to-text and speech- In this work, we focus on the speech translation to-speech (S2S) translation has improved so signifi- of talks. We investigate the tradeoff between accu- cantly over the last several decades that such systems racy and latency for both offline and real-time trans- are now widely deployed and used by an increasing lation of talks. In both these scenarios, appropriate number of consumers. Under the hood, the individ- segmentation of the audio signal as well as the ASR ual components such as automatic speech recogni- hypothesis that is fed into machine translation is crit- tion (ASR), machine translation (MT) and text-to- ical for maximizing the overall translation quality of speech synthesis (TTS) that constitute a S2S sys- the talk. Ideally, one would like to train the models tem are still loosely coupled and typically trained on entire talks. However, such corpora are not avail- on disparate data and domains. Nevertheless, the able in large amounts. Hence, it is necessary to con- 230 Proceedings of NAACL-HLT 2013, pages 230–238, Atlanta, Georgia, 9–14 June 2013. c 2013 Association for Computational Linguistics form to appropriately sized segments that are similar partial speech recognition hypotheses. to the sentence units used in training the language and translation models. We propose several non- 3 Problem Formulation linguistic and linguistic segmentation strategies for the segmentation of text (reference or ASR hypothe- The basic problem of text translation can be formu- lated as follows. Given a source (French) sentence ses) for machine translation. We address the prob- J f = f1 = f1, ··· , fJ , we aim to translate it into lem of latency in real-time translation as a function I of the segmentation strategy; i.e., we ask the ques- target (English) sentence ê =e ˆ1 =e ˆ1, ··· , eÎ . tion “what is the segmentation strategy that maxi- ê(f) = arg max Pr(e|f) (1) mizes the number of segments while still maximiz- e ing translation accuracy?”. If, as in talks, the source text (reference or ASR hy- 2 Related Work pothesis) is very long, i.e., J is large, we attempt to break down the source string into shorter se- Speech translation of European Parliamentary quences, S = s ··· s ··· s , where each sequence speeches has been addressed as part of the TC- 1 k Qs s = [f f ··· f ], j = 1, j = STAR project (Vilar et al., 2005; Fugen¨ et al., 2006). k jk jk+1 j(k+1)−1 1 Qs+1 J + 1. Let the translation of each foreign sequence The project focused primarily on offline translation sk be denoted by tk = [eik eik+1 ··· ei(k+1)−1], i1 = of speeches. Simultaneous translation of lectures 0 1, i = I + 12. The segmented sequences can and speeches has been addressed in (Hamon et al., Qs+1 2009; Fugen¨ et al., 2007). However, the work fo- be translated using a variety of techniques such as cused on a single speaker in a limited domain. Of- independent chunk-wise translation or chunk-wise fline speech translation of TED1 talks has been ad- translation conditioned on history as shown in Eqs. 2 ∗ dressed through the IWSLT 2011 and 2012 evalua- and 3, respectively. In Eq. 3, ti denotes the best tion tracks. The talks are from a variety of speakers translation for source sequence si. with varying dialects and cover a range of topics. ˆ The study presented in this work is the first effort on eˆ(f) = arg max Pr(t1|s1) ··· arg max Pr(tk|sk) t1 tk real-time speech translation of TED talks. In com- (2) parison with previous work, we also present a sys- ∗ eˆ(f) = arg max Pr(t1|s1) arg max Pr(t2|s2, s1, t1) tematic study of the accuracy versus latency tradeoff t1 t2 for both offline and real-time translation on the same ∗ ∗ ··· arg max Pr(tk|s1, ··· , sk, t1, ··· , tk−1) dataset. tk Various utterance segmentation strategies for of- (3) fline machine translation of text and ASR output have been presented in (Cettolo and Federico, 2006; Typically, the hypothesis eˆ will be more accurate Rao et al., 2007; Matusov et al., 2007). The work than eˆ for long texts as the models approximating in (Fugen¨ et al., 2007; Fugen¨ and Kolss, 2007) Pr(e|f) are conventionally trained on short text seg- also examines the impact of segmentation on of- ments. In Eqs. 2 and 3, the number of sequences Qs fline speech translation of talks. However, the real- is inversely proportional to the time it takes to gen- time analysis in that work is presented only for erate partial target hypotheses. Our main focus in speech recognition. In contrast with previous work, this work is to obtain a segmentation S such that the we tackle the latency issue in simultaneous transla- quality of translation is maximized with minimal lation of talks as a function of segmentation strategy tency. The above formulation for automatic speech and present some new linguistic and non-linguistic recognition is very similar except that the foreign ˇ ˇJ ˇ ˇ methodologies. We investigate the accuracy versus string f = f1 = f 1, ··· , fJˇ is obtained by decoding latency tradeoff across translation of reference text, the input speech signal. utterance segmented speech recognition output and 2The segmented and unsegmented talk may not be equal in 0 1http://www.ted.com length, i.e., I 6= I 231 Model Language Vocabulary #words #sents Corpora Acoustic Model en 46899 2611144 148460 1119 TED talks ASR Language Model en 378915 3398460155 151923101 Europarl, WMT11 Gigaword, WMT11 News crawl WMT11 News-commentary, WMT11 UN, IWSLT11 TED training Parallel text en 503765 76886659 7464857 IWSLT11 TED training talks, Europarl, JRC-ACQUIS Opensubtitles, Web data MT es 519354 83717810 7464857 Language Model es 519354 83717810 7464857 Spanish side of parallel text Table 1: Statistics of the data used for training the speech translation models. 4 Data 5 Speech Translation Models In this section, we describe the acoustic, language and translation models used in our experiments. In this work, we focus on the speech translation of TED talks, a compendium of public talks from 5.1 Acoustic and Language Model several speakers covering a variety of topics. Over SM the past couple of years, the International Work- We use the AT&T WATSON speech recog- shop on Spoken Language Translation (IWSLT) has nizer (Goffin et al., 2004). The speech recogni- been conducting the evaluation of speech translation tion component consisted of a three-pass decoding on TED talks for English-French. We leverage the approach utilizing two acoustic models. The mod- IWSLT TED campaign by using identical develop- els used three-state left-to-right HMMs representing ment (dev2010) and test data (tst2010). However, just over 100 phonemes. The phonemes represented English-Spanish is our target language pair as our general English, spelled letters and head-body-tail internal projects are cater mostly to this pair.

Load more