SimulSpeech: End-to-End Simultaneous Speech to Text Translation Yi Ren ∗ Jinglin Liu ∗ Xu Tan ∗ Zhejiang University Zhejiang University Microsoft Research [email protected] [email protected] [email protected]

Chen Zhang ∗ Tao Qin Zhou Zhao† Zhejiang University Microsoft Research Zhejiang University [email protected] [email protected] [email protected]

Tie-Yan Liu Microsoft Research [email protected]

Abstract 1 Introduction

Simultaneous speech to text translation (Fugen¨ In this work, we develop SimulSpeech, an end- et al., 2007; Oda et al., 2014; Dalvi et al., 2018), to-end simultaneous speech to text translation system which translates speech in source lan- which translates source-language speech into target- guage to text in target language concurrently. language text concurrently, is of great importance SimulSpeech consists of a speech encoder, a to the real-time understanding of spoken lectures or speech segmenter and a text decoder, where conversations and now widely used in many scenar- 1) the segmenter builds upon the encoder and ios including live video streaming and international leverages a connectionist temporal classifica- conferences. However, it is widely considered as tion (CTC) loss to split the input streaming one of the challenging tasks in machine transla- speech in real time, 2) the encoder-decoder at- tention adopts a wait-k strategy for simulta- tion domain because simultaneous speech to text neous translation. SimulSpeech is more chal- translation has to understand the speech and trade lenging than previous cascaded systems (with off translation accuracy and delay. Conventional simultaneous automatic approaches to simultaneous speech to text transla- (ASR) and simultaneous neural machine trans- tion (Fugen¨ et al., 2007; Oda et al., 2014; Dalvi lation (NMT)). We introduce two novel knowl- et al., 2018) divide the translation process into two edge distillation methods to ensure the perfor- stages: simultaneous automatic speech recognition mance: 1) Attention-level knowledge distilla- (ASR) (Rao et al., 2017) and simultaneous neu- tion transfers the knowledge from the multipli- cation of the attention matrices of simultane- ral (NMT) (Gu et al., 2016), ous NMT and ASR models to help the training which cannot be optimized jointly and result in in- of the attention mechanism in SimulSpeech; 2) ferior accuracy, and also incurs more translation Data-level knowledge distillation transfers the delay due to two stages. knowledge from the full-sentence NMT model In this paper, we move a step further to translate and also reduces the complexity of data distri- the source speech to target text simultaneously, and bution to help on the optimization of Simul- Speech. Experiments on MuST-C English- develop SimulSpeech, an end-to-end simultaneous Spanish and English-German spoken language speech to text translation system. The SimulSpeech translation datasets show that SimulSpeech model consists of 1) a speech encoder where each achieves reasonable BLEU scores and lower speech frame can only see its previous frames to delay compared to full-sentence end-to-end simulate streaming speech inputs; 2) a text decoder speech to text translation (without simultane- where the encoder-decoder attention follows the ous translation), and better performance than wait-k strategy (Ma et al., 2018) to decide when to the two-stage cascaded simultaneous transla- tion model in terms of BLEU scores and trans- listen and write on the source speech and target text lation delay. respectively (see Figure1); 3) a speech segmenter that builds upon the encoder and leverages a CTC ∗ Equal contribution. loss to detect the boundary, which is used † Corresponding author to decide when to stop listening according to the

3787 Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 3787–3796 July 5 - 10, 2020. c 2020 Association for Computational Linguistics Source Audio → 2 Preliminaries In this section, we briefly review some basic knowl- edge for simultaneous speech to text translation, How including speech to text translation, simultaneous translation based on wait-k strategy, and CTC loss Target text is Listen for segmentation. Write the

→ Speech to Text Translation Given a set of bilin-

weather gual speech-text sentence pairs D = {(x, y) ∈ (X × Y)}, an speech to text machine translation today model learns the parameter θ by minimizing the P negative log-likelihood − (x,y)∈D log P (y|x; θ). Figure 1: The wait-k strategy for simultaneous speech P (y|x; θ) is calculated based on the chain rule QTy to text translation. The model will wait for the first t=1 P (yt|y

3788 Output text

Softmax Linear Word boundary

Speech Add & Norm Segmenter Feed Forward Data Distillation Add & Norm Source Text Encoder Decoder Add & Norm (Source Text) (Target Text) Wait-K Decoder Auxiliary Tasks Feed Forward Attention Attention Distillation Source Text NMT Attention ASR Attention

N× Add & Norm Add & Norm ×N Auxiliary NMT Auxiliary ASR x Predicted Encoder Decoder Source Masked Self- Masked Self- (source text, masked) (source text, wait-k) Text attention Attention CTC loss Speech Segmenter + Decoder Predicted Encoder (target text, wait-k) (source speech, masked) Target S2T Attention Text Pre-Net Source Speech

Target text

(a) SimulSpeech model structure. (b) The training pipeline for SimulSpeech model.

Figure 2: (a) The model structure of SimulSpeech. (b) The training pipeline for SimulSpeech model. The Simul- Speech model is shown in purple box, and the auxiliary training techniques are in other boxes. loss (Graves et al., 2006) is widely used for features, which consists of multiple convo- alignment and segmentation, which maps the lutional layers with the same hidden size as frame-level classification outputs of a speech Transformer. sequence to a text sequence (with a different length from the speech sequence). For a text sequence • To enable simultaneous translation, we design y, CTC introduces a set of intermediate represen- different attention mechanisms for the encoder tation paths φ(y) called CTC paths, which has a and decoder. The encoder adopts masked self- many-to-one mapping to y since multiple CTC attention, which masks the future frames of a paths can correspond to the same text sequence. speech frame when encoding it and ensures For example, both the frame-level classification that each speech frame can only see its previ- outputs (CTC paths) “HHE∅L∅LOO” and ous frames to simulate the real-time streaming “∅HHEEL∅LO” are mapped to text sequence inputs. The decoder adopts the wait-k strat- “HELLO”, where ∅ is the blank symbol. The egy (Ma et al., 2018), as shown in Equation1, likelihood of y can thus be evaluated as a sum of which guarantees that each target token can the probabilities of its CTC paths: only see the source segments following the X P (y|x) = P (z|x), (2) wait-k strategy. z∈φ(y) • As the wait-k strategy requires source speech where x is the utterance consisting of speech to be discrete segments, we introduce a speech frames and z is one of the CTC path. segmenter to split a speech sequence into dis- 3 The SimulSpeech Model crete segments, each representing a word or phrase. The segmenter takes the outputs of Similar to many sequence to sequence generation the speech encoder as inputs, passes through tasks, SimulSpeech adopts the encoder-decoder multiple non-linear dense layers and then a framework. As shown in Figure 2a, both the en- softmax linear layer to predict the character coder and decoder follow the basic network struc- in frame level. When a word boundary token ture of Transformer (Vaswani et al., 2017a) for neu- (the space character in our case) is predicted ral machine translation. SimulSpeech is different by the segmenter, SimulSpeech knows a word from Transformer in several aspects: is ended. Multiple consecutive word boundary • To handle speech inputs, we employ a speech tokens are merged into one boundary. pre-net (Shen et al., 2018) to extract speech

3789 Tsrc x Ssrc Ttgt x Tsrc Ttgt x Ssrc x = Binarization Loss

NMT Attention ASR Attention S2T Attention (teacher) S2T Attention (predicted)

Figure 3: Details of attention-level knowledge distillation.

4 Training of SimulSpeech which transfers the knowledge from the multiplica- tion of attention weights matrices of simultaneous The training of the SimulSpeech model is more dif- ASR and NMT models, into the attention of the ficult than that of an NMT model or an ASR model, SimulSpeech model. In order to obtain the atten- since SimulSpeech involves multiple modalities tion weights of simultaneous ASR and NMT, we (i.e., speech and text) and multiple languages. In add auxiliary simultaneous ASR and NMT tasks this section, we discuss how to train the Simul- which share the same encoder or decoder with Speech model. As shown in Figure 2b, we intro- SimulSpeech model respectively, as shown in Fig- duce the CTC loss for the training of the speech seg- ure 2b. The two auxiliary tasks both leverage a menter, and attention-level and data-level knowl- wait-k strategy similar to that used in SimulSpeech edge distillation for the training of the overall model. SimulSpeech model. In SimulSpeech training, the Denote the sequence length of the source speech, training data are provided in the format of (source source text and target text as S , T and T re- speech, source text, target text) tuples. src src tgt spectively. Denote the attention weights of simulta- T ×S T ×T 4.1 Training Segmenter with CTC Loss neous ASR and NMT as A src src and A tgt src re- spectively. Ideally, the attention weights of Simul- In SimulSpeech, the speech segmenter is used to Speech ATtgt×Ssrc should satisfy detect word boundaries, and detected boundaries are used to determine when to stop listening and ATtgt×Ssrc = ATtgt×Tsrc × ATsrc×Ssrc . (4) switch to translation, which is critical for the perfor- mance of simultaneous translation. As it is hard to However, the attention weights are difficult to find frame-level label to guide the output of the soft- learn, and the attention weights of SimulSpeech max linear layer in speech segmenter, we leverage model are more difficult to learn than that of the connectionist temporal classification (CTC) loss simultaneous ASR and NMT models since Simul- to train the speech segmenter. According to Equa- Speech is much more challenging. Therefore, we tion2, the CTC loss is formulated as propose to distill the knowledge from the multipli- cation of the attention weights of the simultaneous X X Lctc = − P (z|x), (3) ASR and NMT, as shown in Figure 2b and Figure3. (x,y)∈(X ×Ysrc) z∈φ(y) We first multiply the attention matrix of simultane- ous NMT by that of simultaneous ASR, and then src where (X × Y ) denotes the set of source speech binarize the attention matrix with a threshold. We and source text sequence pairs, and φ(y) denotes then match the attention weights that is predicted the set of CTC paths for y. by the SimulSpeech model to the binarized atten- During inference, we simply use the best path tion matrix, with the loss function decoding (Graves et al., 2006) to decide the word Ttgt×Tsrc Tsrc×Ssrc Ttgt×Ssrc boundary without seeing subsequent speech frames, Latt kd = −B(A × A ) × A , which is consistent with the masked self-attention (5) in speech encoder, i.e., the output of segmenter for where B is the binarization operation which set the position i depends only on the inputs at positions element of the matrix to 1 if above the threshold of preceding i. 0.05, and otherwise to 0.

4.2 Attention-Level Knowledge Distillation 4.3 Data-Level Knowledge Distillation To better train the SimulSpeech model, we propose Data-level knowledge distillation is widely used a novel attention-level knowledge distillation that to help model training in various tasks and situa- is specially designed for speech to text translation, tions (Kim and Rush, 2016; Tan et al., 2019) and

3790 can boost the performance of a student model. In use character sequence of source text. We learn the this work, we leverage knowledge distillation to BPE merge operations across source and target lan- transfer the knowledge from a full-sentence NMT guages. We use the speech segmenter proposed in teacher model to a SimulSpeech model. We train Section3 to split the speech mel-spectrograms into a full-sentence NMT teacher model first and then segments, where each segment is regarded as dis- generate target text y0 given source text y that is crete tokens and represents a word or short phrase. paired with source speech x. Finally, we train the student SimulSpeech with the generated target text Task Train Dev Test 0 y which is paired with the source speech x. The En→Es 229703 (496h) 1316 (2.5h) 2502 (4h) loss function is formulated as En→De 265625 (400h) 1423 (2.5h) 2641 (4h)

X 0 Ldata kd = − log P (y |x), (6) Table 1: The number of sentences and the duration of (x,y0)∈(X ×Ytgt0 ) audio in MuST-C dataset.

0 where (X × Ytgt ) denotes the set of speech-text Model Configuration We use the Trans- sequence pairs where text is generated by the NMT former (Vaswani et al., 2017b) as the basic teacher model. SimulSpeech model structure since it achieves The total loss function to train SimulSpeech state-of-the-art accuracy and becomes a popular model is choice for recent NMT research. The model hidden size, number of heads, number of encoder L = λ1Lctc + λ2Latt kd + λ3Ldata kd, (7) and decoder-layers are set to 384, 4, 6 and 4 respectively. Considering that the adjacent hidden where λ , λ , λ are hyperparameters to trade off 1 2 3 states are closely related in speech task, we replace the three losses. the feed-forward network in Transformer with 5 Experiments and Results a 2-layer 1D convolutional network (Gehring et al., 2017) with ReLU activation. Left padding In this section, we evaluate SimulSpeech on MuST- is used in the 1D convolutional network in the C corpus (Di Gangi et al., 2019). First we describe target side (Ren et al., 2019) to avoid the output experimental settings and details, then we show token seeing its subsequent tokens in the training the experiment results, and further conduct some stage. The kernel size and filter size of 1D analyses on our model. convolution are set to 1536 and 9 respectively. The pre-net (bottom left in Figure 2a) is a 3-layer 5.1 Experiment Settings convolutional network with left padding, whose Datasets We use the MuST-C English-Spanish output dimension is same as the hidden size of the (En→Es) and English-German (En→De) speech transformer encoder. The decoder of the auxiliary translation corpus in our experiments. Both two ASR model and the encoder of the auxiliary NMT datasets contain audio clips in source language, and model, as well as the encoder and decoder of the corresponding source-language transcripts and the NMT teacher model share the same model target-language translated text. The official data structures described above. statistics and splits for train/dev/test set are shown in Table1. For the speech data, we transform the Training and Inference SimulSpeech is trained raw audio into mel-spectrograms following Shen on 2 NVIDIA Tesla V100 GPUs, with totally batch et al.(2018) with 50 ms frame size and 12.5 ms size of 64 sentences. We use the Adam optimizer hop size. To simplify the model training, we re- with the default parameters (Kingma and Ba, 2014) move some non-verbal annotation in the text, such and learning rate schedule in Vaswani et al.(2017a). as “(Laughing)”, “(Music)”. All the sentences are We train the SimulSpeech with auxiliary simultane- 2 first tokenized with moses tokenizer and then seg- ous ASR and NMT tasks by default. We set the λ1, mented into subword symbols using Byte Pair En- λ2, λ3 in Equation7 as 1.0, 0.1, 1.0 respectively, coding (BPE) (Sennrich et al., 2016), except for according to the validation performance. Simul- the label to train the speech segmenter, where we Speech is trained and tested with the same k unless 2https://github.com/moses- otherwise stated. The translation quality is evalu- smt/mosesdecoder/blob/master/scripts/tokenizer/tokenizer.perl ated by tokenized case sensitive BLEU (Papineni

3791 et al., 2002) with the perl scripts3. Our code is k 1 3 5 7 9 inf 4 based on tensor2tensor (Vaswani et al., 2018) . En-Es 15.02 19.92 21.58 22.42 22.49 22.72 En-Es (FS) 3.25 7.18 10.52 13.33 15.32 22.72 The Metric of Translation Delay Many previ- En-De 10.73 15.52 16.90 17.46 17.87 18.29 ous works focus on proposing the metrics of trans- En-De (FS) 2.58 6.89 9.65 11.70 13.15 18.29 lation delay for simultaneous text to text translation, such as average proportion (AP) (Cho and Esipova, Table 2: The BLEU scores of SimulSpeech on the test 2016) and average latency (AL) (Ma et al., 2018). set of the MuST-C En→Es and En →De dataset. FS The former calculates the mean absolute delay cost denotes training with k=inf. by each target token, while the latter measures the degree of out of sync with the speaker. In this work, Translation Delay We plot the translation qual- we extend the AP and AL metric that are originally ity (in terms of BLEU score) against delay met- calculated on word sequence to speech sequence rics (AP and AL) of our SimulSpeech model and for simultaneous speech to text translation task. test-time wait-k model (trained with full-sentence Our extended AP is defined as follows: translation but only test with wait-k, denoted as |y| “train-full test-k”) in Figure 4a and 4b. We can see 1 X AP (x, y) = t(i), (8) that the BLEU scores increase as k increases, with |x|time|y| i=1 the sacrifice of translation delay. The accuracy of SimulSpeech model is always better than the test- where x and y are the source speech and target text, time wait-k, which demonstrates the effectiveness |x|time is the total time duration of source speech, of the SimulSpeech. |y| is the length of target text, t(i) is real-time delay in terms of source speech when generating the i-th 25 k=9 k=inf word in target sequence, i.e., the duration of source k=5k=7 k=3 speech listened by the model before writing the 20 k=1 k=9 i-th target token. Our extended AL is defined as 15 k=7 k=5 follows: BLEU 10 k=3 τ(|x|seg) 1 X i − 1 5 k=1 SimulSpeech AL(x, y) = g(i) − , (9) τ(|x|seg) r train-full test-k i=1 0 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 seg where |x| is length of speech segments, g(i) is Average Proportion the delay at step i, i.e., the number of source seg- (a) The translation quality against the latency in terms of ments listened by the model before writing the i-th AP. target token. τ(|x|seg) denotes the earliest timestep 25 k=7 k=9 k=inf k=5 where our model has consumed the entire source k=3 sequence: 20 k=1 k=9 seg seg 15 k=7 τ(|x| ) = arg min(g(t) = |x| ), (10) t k=5 BLEU 10 and r = |y|/|x|seg is the length ratio between target k=3 and source sequence. 5 k=1 SimulSpeech train-full test-k 0 5.2 Experiment Results 0.0 2.5 5.0 7.5 10.0 12.5 15.0 17.5 20.0 Translation Accuracy First, we evaluate the per- Average Lagging formance of SimulSpeech model under different k. (b) The translation quality against the latency in terms of The BLEU scores of En-Es and En-De are shown AL. in Table2. We can see that the performance of our Figure 4: The translation quality against the latency model does not drop a lot when k is small, com- metrics (AP and AL) on En→Es dataset. pared to the full-sentence translation (training with k=inf). Comparison with Cascaded Models Finally, 3 https://github.com/moses- we implement the cascaded simultaneous speech to smt/mosesdecoder/blob/master/scripts/generic/multi- bleu.perl text translation pipeline and compare the accuracy 4https://github.com/tensorflow/tensor2tensor of SimulSpeech with it under the same translation

3792 1 2 3 4 5 6 7 8 9 10 11 12 En (source) the first on here is the classic apple. Es (target) la primera aqu´ı es la clasica´ manzana. ASR (wait-1) the first on here is the class sake apple. ASR (wait-1) + NMT (wait-3) pero la primera vez es una manzana motivo de clase. SimulSpeech (wait-3) la primera es una manzana clasica.´

Figure 5: An example from the test set in En→Es dataset, which demonstrates that SimulSpeech outperforms cascaded models under same delay (the delay of wait-1 for ASR plus wait-3 for NMT is equal to the delay of wait- 3 for SimulSpeech). In this case, wait-1 ASR model in cascaded method does not recognize the word “classic” correctly, and results in the wrong translation in NMT model. delay by using the same k. For cascaded method, Model k=1 k=5 k=9 we try all possible combinations of wait-k ASR and Naive S2T 9.02 14.90 15.90 wait-k NMT models and report the best one. The +Aux 12.98 19.41 20.39 accuracy of the two methods is shown in Table3. +Aux+DataKD 13.77 20.98 21.52 It can be seen that 1) SimulSpeech outperforms the +Aux+AttnKD 13.74 20.64 20.90 cascaded method when k < 9 which covers most +Aux+DataKD+AttKD simultaneous translation scenarios. 2) Cascaded (SimulSpeech) 15.02 21.58 22.49 model only outperforms SimulSpeech in larger k5. Table 4: The ablation studies on En→Es dataset. The These results demonstrate the advantages of Simul- baseline model (Naive S2T) is the naive simultaneous Speech specifically for simultaneous translation speech to text translation model with wait-k policy. We scenario. We further plot the BLEU scores of the gradually add our techniques on it to evaluate their ef- two methods in Figure6. It can be seen that Simul- fectiveness. Speech with wait-3 can achieve the same BLEU score with the cascaded method under wait-5. To The Effectiveness of attention-level knowledge sum up, SimulSpeech achieves higher translation distillation We further evaluate the effectiveness accuracy than cascaded method under the same of attention-level knowledge distillation. We add translation delay, and achieves lower translation attention-level knowledge distillation (Row 5 vs. delay with the same translation accuracy. Row 3) to the model and find that the accuracy can also be improved. As a result, we combine Model k=1 k=3 k=5 k=7 k=9 k=inf all the techniques together (Row 6, SimulSpeech) Cascaded 12.77 16.91 19.66 21.05 23.43 25.60 and obtain the best BLEU scores across different SimulSpeech 15.02 19.92 21.58 22.42 22.49 22.72 wait-k, which demonstrates the effectiveness of all Table 3: The comparison between two-stage cascaded techniques we proposed for the training of Simul- method and SimulSpeech under different wait-k on Speech. En→Es dataset. The Effectiveness of Speech Segmenter To 5.3 Ablation Study evaluate the effectiveness of our segmenter, we We evaluate the effectiveness of each component compare the accuracy of SimulSpeech model us- and show the results in Table4. From the BLEU ing our segmentation method and the ground-truth scores in Row 2 and Row 3, it can be seen that the segmentation, where we extract the segmentation translation accuracy with different wait-k can be from the ground-truth speech and corresponding boosted by adding auxiliary task to naive simulta- transcripts using the alignment tool6 and regard neous speech to text translation model (denoted as it as the ground-truth segmentation. As shown Naive S2T). in Table5, the BLEU scores of SimulSpeech us- ing our segmentation method is close to that using The Effectiveness of data-level knowledge dis- 7 tillation We further evaluate the effectiveness of ground-truth segmentation , which demonstrates data-level knowledge distillation (Row 4 vs Row the effectiveness of our speech segmenter. 3). The result shows that data-level knowledge dis- tillation can achieve a large accuracy improvement. 6https://github.com/lowerquality/gentle 7Note that we cannot obtain the ground-truth segmentation 5In a typical simultaneous translation scenario, k should during inference. Therefore the accuracy gap in Table5 is be as small as possible, otherwise large delay is incurred. reasonable.

3793 Method k=1 k=3 k=5 k=7 k=9 to-end speech to text translation and achieved com- Ground-Truth 18.04 22.61 23.76 23.36 23.14 parable accuracy to the cascaded models. Our Method 15.02 19.92 21.58 22.42 22.49 6.2 Simultaneous Translation Table 5: The BLEU scores of SimulSpeech on En→Es using our speech segmentation method and ground- Simultaneous translation aims to translate sen- truth segmentation. tences before they are finished (Fugen¨ et al., 2007; Oda et al., 2014; Dalvi et al., 2018). Traditional Case Analysis We further conduct case stud- speech to text simultaneous translation system usu- ies to demonstrate the advantages of our end-to- ally first recognizes and segments the incoming end translation over the previous cascaded models. speech stream based on an automatic speech recog- As shown in Figure5, simultaneous ASR model nition (ASR) system, and then translates it to the makes a mistake which further affects the accuracy text in target language. And most of the previous of downstream simultaneous NMT model, while works focus on the simultaneous machine transla- SimulSpeech is not suffered by this problem. As a tion part (Zheng et al., 2019): Gu et al.(2016) pro- result, SimulSpeech outperforms cascaded models. posed a framework for simultaneous NMT in which an agent learns to make decisions on when to trans- k=inf late from the interaction with a pre-trained NMT 25.0 k=9 k=7 k=9 k=inf environment. Ma et al.(2018) introduced a very 22.5 k=5 k=7 k=3 k=5 simple but effective wait-k strategy for simultane- 20.0 ous NMT based on a prefix-to-prefix framework, k=3 BLEU 17.5 k=1 which predicts the next target word conditioned 15.0 k=1 on the partial source sequence the model has seen, SimulSpeech 12.5 Cascaded instead of the full source sequence. The wait-k 10.0 0.0 2.5 5.0 7.5 10.0 12.5 15.0 17.5 20.0 strategy will wait for the first k source and

Average Lagging then start to generate a target word. After that, once receiving a new source word, the decoder Figure 6: The comparison between SimulSpeech and generates a new target word until there is no more the cascaded method in terms of translation accuracy source word, and then the translation degrades to and delay on En→Es dataset. full-sentence translation.

6 Related Works 7 Conclusion

6.1 Speech to Text Translation In this work, we developed SimulSpeech, an end- Speech to text translation has been a hot re- to-end simultaneous speech to text translation sys- search topic in the field of artificial intelligence tem that directly translates source speech into tar- recently (Berard´ et al., 2016; Weiss et al., 2017; get text concurrently. SimulSpeech consists of a Liu et al., 2019). Early works on speech to text speech encoder, a speech segmenter, and a text de- translation rely on a two-stage method by cascaded coder with wait-k strategy for simultaneous trans- ASR and NMT models.B erard´ et al.(2016) pro- lation. We further introduced several techniques posed an end-to-end speech to text translation sys- including data-level and attention-level knowledge tem, which does not leverage source language text distillation to boost the accuracy of SimulSpeech. during training or inference. Weiss et al.(2017) Experiments on MuST-C spoken language transla- further leveraged an auxiliary ASR model with a tion datasets demonstrate the advantages of Simul- shared encoder with the speech to text model, re- Speech in terms of both translation accuracy and garding it as a multi-task problem. Vila et al.(2018) delay. applied Transformer (Vaswani et al., 2017b) archi- For future work, we will design more flexible tecture to this task and achieved good accuracy. policies to achieve better translation quality and Bansal et al.(2018) explored speech to text trans- lower delay in simultaneous spoken language trans- lation in the low-resource setting where both data lation. We will also investigate simultaneous trans- and computation are limited. Sperber et al.(2019) lation from the speech in a source language to the proposed a novel attention-passing model for end- speech in a target.

3794 Acknowledgments Yoon Kim and Alexander M Rush. 2016. Sequence- level knowledge distillation. arXiv preprint This work was supported in part by the Na- arXiv:1606.07947. tional Key R&D Program of China (Grant Diederik Kingma and Jimmy Ba. 2014. Adam: A No.2018AAA0100603), Zhejiang Natural Science method for stochastic optimization. arXiv preprint Foundation (LR19F020006), National Natural Sci- arXiv:1412.6980. ence Foundation of China (Grant No.61836002), Yuchen Liu, Hao Xiong, Zhongjun He, Jiajun Zhang, National Natural Science Foundation of China Hua Wu, Haifeng Wang, and Chengqing Zong. 2019. (Grant No.U1611461), and National Natural Sci- End-to-end speech translation with knowledge distil- ence Foundation of China (Grant No.61751209). lation. arXiv preprint arXiv:1904.08075. This work was also partially funded by Microsoft Mingbo Ma, Liang Huang, Hao Xiong, Kaibo Liu, Research Asia. Chuanqiang Zhang, Zhongjun He, Hairong Liu, Xing Li, and Haifeng Wang. 2018. Stacl: Simultane- ous translation with integrated anticipation and con- References trollable latency. arXiv preprint arXiv:1810.08398. Sameer Bansal, Herman Kamper, Karen Livescu, Yusuke Oda, Graham Neubig, Sakriani Sakti, Tomoki Adam Lopez, and Sharon Goldwater. 2018. Low- Toda, and Satoshi Nakamura. 2014. Optimizing seg- resource speech-to-text translation. arXiv preprint mentation strategies for simultaneous speech transla- arXiv:1803.09164. tion. In ACL, pages 551–556.

Alexandre Berard,´ Olivier Pietquin, Christophe Servan, Kishore Papineni, Salim Roukos, Todd Ward, and Wei- and Laurent Besacier. 2016. Listen and translate: A Jing Zhu. 2002. Bleu: a method for automatic eval- proof of concept for end-to-end speech-to-text trans- uation of machine translation. In ACL, pages 311– lation. arXiv preprint arXiv:1612.01744. 318. Kanishka Rao, Has¸im Sak, and Rohit Prabhavalkar. Kyunghyun Cho and Masha Esipova. 2016. Can neu- 2017. Exploring architectures, data and units for ral machine translation do simultaneous translation? streaming end-to-end speech recognition with rnn- arXiv preprint arXiv:1606.02012. transducer. In ASRU, pages 193–199. IEEE.

Fahim Dalvi, Nadir Durrani, Hassan Sajjad, and Yi Ren, Yangjun Ruan, Xu Tan, Tao Qin, Sheng Zhao, Stephan Vogel. 2018. Incremental decoding Zhou Zhao, and Tie-Yan Liu. 2019. Fastspeech: and training methods for simultaneous translation Fast, robust and controllable text to speech. arXiv in neural machine translation. arXiv preprint preprint arXiv:1905.09263. arXiv:1806.03661. Rico Sennrich, Barry Haddow, and Alexandra Birch. Mattia Antonino Di Gangi, Roldano Cattoni, Luisa 2016. Neural machine translation of rare words with Bentivogli, Matteo Negri, and Marco Turchi. 2019. subword units. In ACL. MuST-C: a Multilingual Speech Translation Corpus. In NAACL-HLT, Minneapolis, MN, USA. Jonathan Shen, Ruoming Pang, Ron J Weiss, Mike Schuster, Navdeep Jaitly, Zongheng Yang, Zhifeng Christian Fugen,¨ Alex Waibel, and Muntsin Kolss. Chen, Yu Zhang, Yuxuan Wang, Rj Skerrv-Ryan, 2007. Simultaneous translation of lectures and et al. 2018. Natural tts synthesis by condition- speeches. Machine translation, 21(4):209–252. ing wavenet on mel spectrogram predictions. In ICASSP, pages 4779–4783. IEEE. Jonas Gehring, Michael Auli, David Grangier, Denis Matthias Sperber, Graham Neubig, Jan Niehues, and Yarats, and Yann N Dauphin. 2017. Convolutional Alex Waibel. 2019. Attention-passing models for ro- sequence to sequence learning. In ICML, pages bust and data-efficient end-to-end speech translation. 1243–1252. JMLR. org. TACL, 7:313–325.

Alex Graves, Santiago Fernandez,´ Faustino Gomez, Xu Tan, Yi Ren, Di He, Tao Qin, Zhou Zhao, and Tie- and Jurgen¨ Schmidhuber. 2006. Connectionist Yan Liu. 2019. Multilingual neural machine trans- temporal classification: labelling unsegmented se- lation with knowledge distillation. arXiv preprint quence data with recurrent neural networks. In Pro- arXiv:1902.10461. ceedings of the 23rd international conference on Ma- chine learning, pages 369–376. ACM. Ashish Vaswani, Samy Bengio, Eugene Brevdo, Fran- cois Chollet, Aidan N. Gomez, Stephan Gouws, Jiatao Gu, Graham Neubig, Kyunghyun Cho, and Vic- Llion Jones, Łukasz Kaiser, Nal Kalchbrenner, Niki tor OK Li. 2016. Learning to translate in real- Parmar, Ryan Sepassi, Noam Shazeer, and Jakob time with neural machine translation. arXiv preprint Uszkoreit. 2018. Tensor2tensor for neural machine arXiv:1610.00388. translation. CoRR, abs/1803.07416.

3795 Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017a. Attention is all you need. In NIPS 2017, pages 6000–6010. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017b. Attention is all you need. In NIPS, pages 5998–6008. Laura Cross Vila, Carlos Escolano, Jose´ AR Fonollosa, and Marta R Costa-jussa.` 2018. End-to-end speech translation with the transformer. In IberSPEECH, pages 60–63. Ron J Weiss, Jan Chorowski, Navdeep Jaitly, Yonghui Wu, and Zhifeng Chen. 2017. Sequence-to- sequence models can directly translate foreign speech. arXiv preprint arXiv:1703.08581. Baigong Zheng, Renjie Zheng, Mingbo Ma, and Liang Huang. 2019. Simultaneous translation with flexi- ble policy via restricted imitation learning. arXiv preprint arXiv:1906.01135. Chunting Zhou, Graham Neubig, and Jiatao Gu. 2019. Understanding knowledge distillation in non- autoregressive machine translation. arXiv preprint arXiv:1911.02727.

3796