End-To-End Simultaneous Speech to Text Translation

SimulSpeech: End-to-End Simultaneous Speech to Text Translation Yi Ren ∗ Jinglin Liu ∗ Xu Tan ∗ Zhejiang University Zhejiang University Microsoft Research [email protected] [email protected] [email protected] Chen Zhang ∗ Tao Qin Zhou Zhaoy Zhejiang University Microsoft Research Zhejiang University [email protected] [email protected] [email protected] Tie-Yan Liu Microsoft Research [email protected] Abstract 1 Introduction Simultaneous speech to text translation (Fugen¨ In this work, we develop SimulSpeech, an end- et al., 2007; Oda et al., 2014; Dalvi et al., 2018), to-end simultaneous speech to text translation system which translates speech in source lan- which translates source-language speech into target- guage to text in target language concurrently. language text concurrently, is of great importance SimulSpeech consists of a speech encoder, a to the real-time understanding of spoken lectures or speech segmenter and a text decoder, where conversations and now widely used in many scenar- 1) the segmenter builds upon the encoder and ios including live video streaming and international leverages a connectionist temporal classifica- conferences. However, it is widely considered as tion (CTC) loss to split the input streaming one of the challenging tasks in machine transla- speech in real time, 2) the encoder-decoder attention adopts a wait-k strategy for simulta- tion domain because simultaneous speech to text neous translation. SimulSpeech is more chal- translation has to understand the speech and trade lenging than previous cascaded systems (with off translation accuracy and delay. Conventional simultaneous automatic speech recognition approaches to simultaneous speech to text transla- (ASR) and simultaneous neural machine trans- tion (Fugen¨ et al., 2007; Oda et al., 2014; Dalvi lation (NMT)). We introduce two novel knowl- et al., 2018) divide the translation process into two edge distillation methods to ensure the perfor- stages: simultaneous automatic speech recognition mance: 1) Attention-level knowledge distilla- (ASR) (Rao et al., 2017) and simultaneous neu- tion transfers the knowledge from the multiplication of the attention matrices of simultane- ral machine translation (NMT) (Gu et al., 2016), ous NMT and ASR models to help the training which cannot be optimized jointly and result in in- of the attention mechanism in SimulSpeech; 2) ferior accuracy, and also incurs more translation Data-level knowledge distillation transfers the delay due to two stages. knowledge from the full-sentence NMT model In this paper, we move a step further to translate and also reduces the complexity of data distri- the source speech to target text simultaneously, and bution to help on the optimization of Simul- Speech. Experiments on MuST-C English- develop SimulSpeech, an end-to-end simultaneous Spanish and English-German spoken language speech to text translation system. The SimulSpeech translation datasets show that SimulSpeech model consists of 1) a speech encoder where each achieves reasonable BLEU scores and lower speech frame can only see its previous frames to delay compared to full-sentence end-to-end simulate streaming speech inputs; 2) a text decoder speech to text translation (without simultane- where the encoder-decoder attention follows the ous translation), and better performance than wait-k strategy (Ma et al., 2018) to decide when to the two-stage cascaded simultaneous translation model in terms of BLEU scores and trans- listen and write on the source speech and target text lation delay. respectively (see Figure1); 3) a speech segmenter that builds upon the encoder and leverages a CTC ∗ Equal contribution. loss to detect the word boundary, which is used y Corresponding author to decide when to stop listening according to the 3787 Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 3787–3796 July 5 - 10, 2020. c 2020 Association for Computational Linguistics Source Audio → 2 Preliminaries In this section, we briefly review some basic knowledge for simultaneous speech to text translation, How including speech to text translation, simultaneous translation based on wait-k strategy, and CTC loss Target text text Target is Listen for segmentation. Write the → Speech to Text Translation Given a set of bilin- weather gual speech-text sentence pairs D = f(x; y) 2 (X × Y)g, an speech to text machine translation today model learns the parameter θ by minimizing the P negative log-likelihood − (x;y)2D log P (yjx; θ). Figure 1: The wait-k strategy for simultaneous speech P (yjx; θ) is calculated based on the chain rule QTy to text translation. The model will wait for the first t=1 P (ytjy<t; x; θ), where y<t represents the k source speech segments and then start to translate a text tokens preceding position t, and Ty is the target word. After that, once receiving a new source length of text sentence y. An encoder-attention- segment, the decoder generates a new target word until decoder framework is usually adopted to model there is no more source word, and then the translation the conditional probability P (yjx; θ), where the degrades to the full-sentence translation. The example shows the case with k = 2. encoder maps the input audio to a set of hidden representations h and the decoder generates each target token yt using the previously generated to- wait-k strategy. kens y<t as well as the speech representations h. Considering the difficulty of this task, we elab- Previous works (Berard´ et al., 2016; Weiss et al., orately design two techniques to boost the perfor- 2017; Liu et al., 2019) on speech to text translation mance of SimulSpeech: 1) attention-level knowl- focus on the full-sentence translation where the full edge distillation that transfers the knowledge from source speech can be seen when predicting each the multiplication of the attention matrices of si- target token. multaneous NMT and ASR model to SimulSpeech Simultaneous Translation Based on Wait-k Si- to help the training of its attention mechanism; 2) multaneous translation aims to translate sentences data-level knowledge distillation that transfers the before they are finished according to certain strate- knowledge from a full-sentence NMT model to gies. We use wait-k strategy (Ma et al., 2018) SimulSpeech and also reduces the complexity of in this work: given a set of speech and text data distribution (Zhou et al., 2019) to help on the pairs D = f(x; y) 2 (X × Y)g, the model optimization of SimulSpeech model. with the wait-k strategy learns the parameter θ Compared with the cascaded pipeline that trains by minimizing the negative log-likelihood loss P simultaneous ASR and NMT models separately, − (x;y)2D log P (yjx; k; θ), where k corresponds SimulSpeech can alleviate the error propagation to the wait-k strategy. P (yjx; k; θ) is calculated problem and optimize all model parameters jointly based on the chain rule towards the end goal, as well as reduce the de- Ty lay of simultaneous translation. Experiments on Y P (yjx; k; θ) = P (ytjy<t; x ; θ); (1) 1 <t+k MuST-C English-Spanish and English-German t=1 spoken language translation datasets demonstrate that SimulSpeech: 1) achieves reasonable BLEU where y<t represents the tokens preceding position scores and lower delay compared to full-sentence t and Ty is the length of target sentence y, x<t+k end-to-end speech to text translation (without si- represents the speech segments preceding position multaneous translation), and 2) obtains better per- t + k. The wait-k strategy ensures that the model formance than the two-stage cascaded simultane- can see t + k − 1 source segments when generat- ous translation model in terms of BLEU scores and ing the target token yt, while can see the whole translation delay. sentence if there is no more source segments. CTC for Alignment and Segmentation The 1https://ict.fbk.eu/must-c/ connectionist temporal classification (CTC) 3788 Output text Softmax Linear Word boundary Speech Add & Norm Segmenter Feed Forward Data Distillation Add & Norm Source Text Encoder Decoder Add & Norm (Source Text) (Target Text) Wait-K Decoder Auxiliary Tasks Feed Forward Attention Attention Distillation Source Text NMT Attention ASR Attention N× Add & Norm Add & Norm ×N Auxiliary NMT Auxiliary ASR x Predicted Encoder Decoder Source Masked Self- Masked Self- (source text, masked) (source text, wait-k) Text attention Attention CTC loss Speech Segmenter + Decoder Predicted Encoder (target text, wait-k) (source speech, masked) Target S2T Attention Text Pre-Net Word Embedding Source Speech Target text (a) SimulSpeech model structure. (b) The training pipeline for SimulSpeech model. Figure 2: (a) The model structure of SimulSpeech. (b) The training pipeline for SimulSpeech model. The Simul- Speech model is shown in purple box, and the auxiliary training techniques are in other boxes. loss (Graves et al., 2006) is widely used for features, which consists of multiple convo- alignment and segmentation, which maps the lutional layers with the same hidden size as frame-level classification outputs of a speech Transformer. sequence to a text sequence (with a different length from the speech sequence). For a text sequence • To enable simultaneous translation, we design y, CTC introduces a set of intermediate represen- different attention mechanisms for the encoder tation paths φ(y) called CTC paths, which has a and decoder. The encoder adopts masked self- many-to-one mapping to y since multiple CTC attention, which masks the future frames of a paths can correspond to the same text sequence. speech frame when encoding it and ensures For example, both the frame-level classification that each speech frame can only see its previ- outputs (CTC paths) “HHE;L;LOO” and ous frames to simulate the real-time streaming “;HHEEL;LO” are mapped to text sequence inputs.

Load more