Worse WER, but Better BLEU? Leveraging Embedding as Intermediate in Multitask End-to-End Speech Translation

Shun-Po Chuang1, Tzu-Wei Sung2, Alexander H. Liu1, and Hung-yi Lee1

1National Taiwan University, Taiwan 2University of California San Diego, USA {f04942141, b03902042, r07922013, hungyilee}@ntu.edu.tw

Abstract Although applying the text of source language as the intermediate information in multitask end-to- Speech translation (ST) aims to learn transfor- end ST empirically yielded improvement, we argue mations from speech in the source language to the text in the target language. Previous works whether this is the optimal solution. Even though show that multitask learning improves the ST the recognition part does not correctly transcribe performance, in which the recognition decoder the input speech into text, the final translation result generates the text of the source language, and would be correct if the output of the recognition the translation decoder obtains the final trans- part preserves sufficient semantic information for lations based on the output of the recognition translation. Therefore, we explore to leverage word decoder. Because whether the output of the embedding as the intermediate level instead of text. recognition decoder has the correct semantics is more critical than its accuracy, we propose In this paper, we apply pre-trained word embed- to improve the multitask ST model by utilizing ding as the intermediate level in the multitask ST word embedding as the intermediate. model. We propose to constrain the hidden states of the decoder of the recognition part to be close 1 Introduction to the pre-trained word embedding. Prior works on Speech translation (ST) increasingly receives atten- word embedding regression show improved results tion from the (MT) commu- on MT (Jauregi Unanue et al., 2019; Kumar and nity recently. To learn the transformation between Tsvetkov, 2018). Experimental results show that speech in the source language and the text in the the proposed approach obtains improvement to the target language, conventional models pipeline au- ST model. Further analysis also shows that con- tomatic (ASR) and text-to-text strained hidden states are approximately isospectral MT model (Berard´ et al., 2016). However, such to word embedding space, indicating that the de- pipeline systems suffer from error propagation. coder achieves speech-to-semantic mappings. Previous works show that deep end-to-end mod- 2 Multitask End-to-End ST model els can outperform conventional pipeline systems with sufficient training data (Weiss et al., 2017; In- Our method is based on the multitask learning aguma et al., 2019; Sperber et al., 2019). Neverthe- for ST (Anastasopoulos and Chiang, 2018), in- less, well-annotated bilingual data is expensive and cluding speech recognition in the source language hard to collect (Bansal et al., 2018a,b; Duong et al., and translation in the target language, as shown in 2016). Multitask learning plays an essential role in Fig.1(a). The input audio feature sequence is first leveraging a large amount of monolingual data to encoded into the encoder hidden state sequence improve representation in ST. Multitask ST models h = h1, h2, . . . , hT with length T by the pyramid have two jointly learned decoding parts, namely the encoder (Chan et al., 2015). To present speech recognition and translation part. The recognition recognition in the source language, the attention part firstly decodes the speech of source language mechanism and a decoder is employed to pro- into the text of source language, and then based on duce source decoder sequence sˆ =s ˆ1, sˆ2,..., sˆM , the output of the recognition part, the translation where M is the number of decoding steps in the part generates the text in the target language. Vari- source language. For each decoding step m, the ant multitask models have been explored (Anas- probability P (ˆym) of predicting the token yˆm in tasopoulos and Chiang, 2018), which shows the the source language vocabulary can be computed improvement in low-resource scenario. based on the corresponding decoder state sˆm.

5998 Proceedings of the 58th Annual Meeting of the Association for Computational , pages 5998–6003 July 5 - 10, 2020. c 2020 Association for Computational Linguistics (a) Multitask End-to-End Model D set and eˆv ∈ R is the embedding vector with di- Recognition ( Intermedia Level ) Translation mension D for any word v ∈ V , in the recognition

[… , � , … ] [… , �,… ] Cross task. We choose the source language decoder state Entropy [… , �(� ),… ] [… , �(�),… ] (embedding) sˆ to reinforce since it is later used in Source Language ������� ������� Speech Features Linear Linear the translation task. To be more specific, we argue [ �̂ , �̂ , … , �̂ ] that the embedding generated by the source lan- Source Language Attention guage decoder should be more semantically correct Decoder Target Speech Concatenation Language in order to benefit the translation task. Given the Encoder Attention Decoder pre-trained source language word embedding Eˆ, [ ℎ , ℎ , … , ℎ] Attention we proposed to constrain the source decoder state (b) Cosine Distance (CD) (c) Cosine Softmax (CS) [… , �, … ] Cosine Distance Cross sˆ at step m to be close to its corresponding word Linear [… , � , … ] Entropy m [… , �̂ , … ] [… , � , … ] [… , � (�), … ] [… , �̂ , … ] embedding eˆyˆm with the two approaches detailed ������� Linear Cosine Similarity Source Source in the following sections. [… , �̂ , … ] Language Language … �̂ �̂ �̂|| � Decoder Decoder 3.1 Directly Learn Word Embedding

Figure 1: (a) Multitask ST model. Dotted arrows indi- Since semantic-related would be close in cate steps in the recognition part. Solid arrows indicate terms of cosine distance (Mikolov et al., 2018), a steps in the translation part. (b) Directly learn word simple idea is to minimize the cosine distance (CD) embedding via cosine distance. (c) Learn word embed- between the source language decoder hidden state ding via cosine softmax function. Both (b)(c) are the sˆm and the corresponding word embedding eˆyˆm for recognition part in (a). every decode step m, X LCD = 1 − cos(fθ(ˆsm), eˆyˆm ) To perform speech translation in the target lan- m guage, both the source language decoder state se- (2) X fθ(ˆsm) · eˆyˆm quence sˆ and the encoder state sequence h will = 1 − , kfθ(ˆsm)kkeˆyˆm k be attended and treated as the target language de- m coder’s input. The hidden state of target language where fθ(·) is a learnable linear projection to match decoder can then be used to derived the probability the dimensionality of word embedding and decoder state. With this design, the network architecture of P (yq) of predicting token yq in the target language vocabulary for every decoding step q. the target language decoder would not be limited Given the ground truth sequence in the source by the dimension of word embedding. Fig.1(b) il- lustrates this approach. By replacing L in Eq. (1) language yˆ =y ˆ1, yˆ2,..., yˆM and the target lan- src with L , semantic learning from word embedding guage y = y1, y2, . . . , yQ with length Q, multitask CD ST can be trained with maximizing log likelihood for source language recognition can be achieved. in both domains. Formally, the objective function 3.2 Learn Word Embedding via Probability of multitask ST can be written as: Ideally, using word embedding as the learning tar- α β get via minimizing CD can effectively train the LST = Lsrc + Ltgt M Q decoder to model the semantic relation existing in α X β X = − log P (ˆy ) + − log P (y ), the embedding space. However, such an approach M m Q q m q suffers from the hubness problem (Faruqui et al., (1) 2016) of word embedding in practice (as we later discuss in Sec. 4.5). where α and β are the trade-off factors to balance To address this problem, we introduce cosine between the two tasks. softmax (CS) function (Liu et al., 2017a,b) to learn speech-to-semantic embedding mappings. Given 3 Proposed Methods the decoder hidden state sˆm and the word embed- ˆ We propose two ways to help the multitask end- ding E, the probability of the target word yˆm is to-end ST model capture the semantic relation defined as between word tokens by leveraging the source exp(cos(fθ(ˆsm), eˆyˆm )/τ) PCS(ˆym) = P , language word embedding as intermediate level. eˆ ∈Eˆ exp(cos(fθ(ˆsm), eˆv)/τ) ˆ v E = {eˆ1, eˆ2, ...eˆ|V |}, where V is the vocabulary (3) 5999 where cos(·) and fθ(·) are from Eq. (2), and τ is the (a) 160 hours (b) 40 hours temperature of softmax function. Note that since dev test dev test the temperature τ re-scales cosine similarity, the SE 34.50 34.51 17.41 15.44 hubness problem can be mitigated by selecting a ME 35.35 35.49 23.30 20.40 proper value for τ. Fig.1(c) illustrates the approach. CD 33.06 33.65 23.53 20.87 With the probability derived from cosine softmax in CS 35.84 36.32 23.54 21.72 Eq. (3), the objective function for source language decoder can be written as Table 1: BLEU scores trained on different size of data. X LCS = − log PCS(ˆym). (4) m we could see that ME outperforms SE in all condi- tions. By replacing Lsrc in Eq. (1) with LCS, the decoder High-resource: Column (a) in Table1 showed hidden state sequence sˆ is forced to contain seman- the results trained on 160 hours of data. CD and tic information provided by the word embedding. CS represent the proposed methods mentioned in Sec. 3.1 and 3.2 respectively. We got mixed results 4 Experiments on further applying pre-trained word embedding on ME. CD degraded the performance, which is 4.1 Experimental Setup even worse than SE, but CS performed the best. We used Fisher Spanish corpus (Graff et al., 2010) Results showed that directly learn word embed- to perform Spanish speech to English text transla- ding via cosine distance is not a good strategy in tion. And we followed previous works (Inaguma the high-resource setting, but integrating similar- et al., 2019) for pre-processing steps, and 40/160 ity with cosine softmax function can significantly hours of train set, standard dev-test are used for improve performance. We leave the discussion in the experiments. Byte-pair-encoding (BPE) (Kudo Sec. 4.5. and Richardson, 2018) was applied to the target Low-resource: We also experimented on 40 hours transcriptions to form 10K subwords as the tar- subset data for training, as shown in column (b) in get of the translation part. Spanish word embed- Table1. We could see that ME, CD and CS over- dings were obtained from FastText pre-trained on whelmed SE in low-resource setting. Although CD Wikipedia (Bojanowski et al., 2016), and 8000 resulted in degrading performance in high-resource Spanish words were used in the recognition part. setting, it showed improvements in low-resource The encoder is a 3-layer 512-dimensional bidi- scenario. CS consistently outperformed ME and rectional LSTM with additional convolution lay- CD on different data size, showing it is robust on ers, yielding 8× down-sampling in time. The de- improving ST task. coders are 1024-dimensional LSTM, and we used one layer in the recognition part and two layers in 4.3 Analysis of Recognition Decoder Output the translation part. The models were optimized In this section, we analyzed hidden states s by ex- −6 using Adadelta with 10 as the weight decay rate. isting methods. For each word v in corpus, we Scheduled sampling with probability 0.8 was ap- denoted its word embedding eˆv as pre-trained em- plied to the decoder in the translation part. Experi- bedding, and ev as predicted embedding. Note that ments ran 1.5M steps, and models were selected by because a single word v could be mapped by mul- the highest BLEU on four transcriptions per speech tiple audio segments, we took the average of all in dev set. its predicted embedding. We obtained the top 500 frequent words in the whole Fisher Spanish corpus, 4.2 Speech Translation Evaluation and tested on the sentences containing only these Baseline: We firstly built the single-task end-to- words in test set. end model (SE) to set a baseline for multitask learn- Eigenvector Similarity: To verify our proposed ing, which resulted in 34.5/34.51 BLEU on dev and methods can constrain hidden states in the word em- test set respectively, which showed comparable re- bedding space, we computed eigenvector similar- sults to Salesky et al.(2019). Multitask end-to-end ity between predicted embedding and pre-trained model (ME) mentioned in Sec.2 is another base- embedding space. The metric derives from Lapla- line. By applying multitask learning in addition, cian eigenvalues and represents how similar be-

6000 160 hours 40 hours 160 hours 40 hours dev test dev test dev test dev test ME 16.50 18.58 13.80 15.09 ME 43.13 38.57 53.42 54.70 CD 2.60 3.44 3.95 3.63 CS 50.15 44.43 57.63 57.21 CS 11.55 13.76 8.62 9.80 Table 4: Word error rate (%) trained on different size Table 2: Eigenvector similarity. of data.

160 hours 40 hours has worse WER, but higher BLEU compared with P@1 P@5 P@1 P@5 ME. We concluded that although leveraging word ME 1.85 6.29 1.11 9.62 embedding at the intermediate level instead of text CD 61.48 77.40 56.30 69.25 results in worse performance in speech recogni- CS 17.78 35.19 10.37 25.19 tion (this indicates that the WER of the recognition part does not fully determine the translation perfor- Table 3: Precision@k of semantic alignment on test set. mance), the semantic information could somewhat help multitask models generate better translation tween two spaces, the lower value on the metric, in terms of BLEU. We do not include the WER of > the more approximately isospectral between the CD in Table1 because its WER is poor ( 100%), two spaces. Previous works showed that the met- but interestingly, the BLEU of CD is still reason- ric is correlated to the performance of translation able, which is another evidence that WER of the task (Søgaard et al., 2018; Chung et al., 2019). As intermediate level is not the key of translation per- shown in Table2, predicted embedding is more sim- formance. ilar to pre-trained embedding when models trained 4.5 Cosine Distance (CD) v.s. Softmax (CS) on sufficient data (160 v.s 40 hours). CD is the Based on experimental results, we found that pro- most similar case among the three cases, and ME posals are possible to map speech to . is the most different case. Results indicated that With optimizing CS, BLEU consistently outper- our proposals constrain hidden states in pre-trained formed ME, which shows that utilizing semantic embedding space. information truly helps on ST. Directly minimiz- Semantic Alignment: To further verify if pre- ing cosine distance made the predicted embedding dicted embedding is semantically aligned to pre- space closest to pre-trained embedding space, but trained embedding, we applied Procrustes align- performed inconsistently on BLEU in different ment (Conneau et al., 2017; Lample et al., 2017) data sizes. We inferred that the imbalance word method to learn the mapping between predicted frequency training and hubness problem (Faruqui embedding and pre-trained embedding. Top 50 et al., 2016) in word embedding space made hidden frequent words were selected to be the training dic- states not discriminated enough for the target lan- tionary, and we evaluated on the remaining 450 guage decoder while optimizing CS can alleviate words with cross-domain similarity local scaling this issue. (CSLS) method. Precision@k (P@k, k=1,5) were reported as measurements. As shown in Table3, 5 Conclusions CD performed the best, and ME was the worst one. Our proposals showed that utilizing word embed- This experiment reinforced that our proposals can ding as intermediate helps with the ST task, and constrain hidden states to the similar structure of it is possible to map speech to the semantic space. word embedding space. We also observed that lower WER in source lan- guage recognition not imply higher BLEU in target 4.4 Speech Recognition Evaluation language translation. We further analyzed the results of speech recog- This work is the first attempt to utilize word nition for ME and CS. To obtain the recognition embedding in the ST task, and further techniques results from Eq (3), simply take arg maxv PCS(v). can be applied upon this idea. For example, cross- The word error rate (WER) of the source language lingual word embedding mapping methods can be recognition was reported in Table4. Combining considered within the ST model to shorten the dis- the results shown in Table1, we could see that CS tance between MT and ST tasks.

6001 References Hirofumi Inaguma, Kevin Duh, Tatsuya Kawa- hara, and Shinji Watanabe. 2019. Multilingual Antonios Anastasopoulos and David Chiang. 2018. end-to-end speech translation. arXiv preprint Tied multitask learning for neural speech translation. arXiv:1910.00254. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computa- Inigo Jauregi Unanue, Ehsan Zare Borzeshi, Nazanin tional Linguistics: Human Language Technologies, Esmaili, and Massimo Piccardi. 2019. ReWE: Re- Volume 1 (Long Papers), pages 82–91, New Orleans, gressing word embeddings for regularization of neu- Louisiana. Association for Computational Linguis- ral machine translation systems. In Proceedings of tics. the 2019 Conference of the North American Chap- ter of the Association for Computational Linguistics: Sameer Bansal, Herman Kamper, Karen Livescu, Human Language Technologies, Volume 1 (Long Adam Lopez, and Sharon Goldwater. 2018a. Low- and Short Papers), pages 430–436, Minneapolis, resource speech-to-text translation. arXiv preprint Minnesota. Association for Computational Linguis- arXiv:1803.09164. tics. Sameer Bansal, Herman Kamper, Karen Livescu, Adam Lopez, and Sharon Goldwater. 2018b. Taku Kudo and John Richardson. 2018. Sentencepiece: Pre-training on high-resource speech recognition A simple and language independent subword tok- improves low-resource speech-to-text translation. enizer and detokenizer for neural . arXiv preprint arXiv:1809.01431. arXiv preprint arXiv:1808.06226.

Alexandre Berard,´ Olivier Pietquin, Christophe Servan, Sachin Kumar and Yulia Tsvetkov. 2018. Von and Laurent Besacier. 2016. Listen and translate: A mises-fisher loss for training sequence to sequence proof of concept for end-to-end speech-to-text trans- models with continuous outputs. arXiv preprint lation. arXiv preprint arXiv:1612.01744. arXiv:1812.04616.

Piotr Bojanowski, Edouard Grave, Armand Joulin, Guillaume Lample, Alexis Conneau, Ludovic Denoyer, and Tomas Mikolov. 2016. Enriching word vec- and Marc’Aurelio Ranzato. 2017. Unsupervised ma- tors with subword information. arXiv preprint chine translation using monolingual corpora only. arXiv:1607.04606. arXiv preprint arXiv:1711.00043.

William Chan, Navdeep Jaitly, Quoc V Le, and Oriol Yu Liu, Hongyang Li, and Xiaogang Wang. 2017a. Vinyals. 2015. Listen, attend and spell. arXiv Learning deep features via congenerous cosine preprint arXiv:1508.01211. loss for person recognition. arXiv preprint arXiv:1702.06890. Yu-An Chung, Wei-Hung Weng, Schrasing Tong, and James Glass. 2019. Towards unsupervised speech- Yu Liu, Hongyang Li, and Xiaogang Wang. 2017b. to-text translation. In ICASSP 2019-2019 IEEE In- Rethinking feature discrimination and polymeriza- ternational Conference on Acoustics, Speech and tion for large-scale recognition. arXiv preprint Signal Processing (ICASSP), pages 7170–7174. arXiv:1710.00870. IEEE. Tomas Mikolov, Edouard Grave, Piotr Bojanowski, Alexis Conneau, Guillaume Lample, Marc’Aurelio Christian Puhrsch, and Armand Joulin. 2018. Ad- Ranzato, Ludovic Denoyer, and Herve´ Jegou.´ 2017. vances in pre-training distributed word representa- Word translation without parallel data. arXiv tions. In Proceedings of the International Confer- preprint arXiv:1710.04087. ence on Language Resources and Evaluation (LREC Long Duong, Antonios Anastasopoulos, David Chiang, 2018). Steven Bird, and Trevor Cohn. 2016. An attentional model for speech translation without transcription. Elizabeth Salesky, Matthias Sperber, and Alan W In Proceedings of the 2016 Conference of the North Black. 2019. Exploring phoneme-level speech repre- American Chapter of the Association for Computa- sentations for end-to-end speech translation. arXiv tional Linguistics: Human Language Technologies, preprint arXiv:1906.01199. pages 949–959. Anders Søgaard, Sebastian Ruder, and Ivan Vulic.´ Manaal Faruqui, Yulia Tsvetkov, Pushpendre Rastogi, 2018. On the limitations of unsupervised and Chris Dyer. 2016. Problems with evaluation of bilingual dictionary induction. arXiv preprint word embeddings using word similarity tasks. arXiv arXiv:1805.03620. preprint arXiv:1605.02276. Matthias Sperber, Graham Neubig, Jan Niehues, and David Graff, Shudong Huang, Ingrid Carta- Alex Waibel. 2019. Attention-passing models for ro- gena, Kevin Walker, , and Christopher Cieri. bust and data-efficient end-to-end speech translation. 2010. Fisher spanish speech (ldc2010s01). Transactions of the Association for Computational https://catalog.ldc.upenn.edu/LDC2010S01. Linguistics, 7:313–325.

6002 Ron J Weiss, Jan Chorowski, Navdeep Jaitly, Yonghui Word Embedding 160 hours Wu, and Zhifeng Chen. 2017. Sequence-to- Source sequence models can directly translate foreign dev test speech. arXiv preprint arXiv:1703.08581. ME - 35.35 35.49 A Appendix Wikipedia 33.06 33.65 CD Fisher Spanish 33.33 33.80 A.1 Single-task end-to-end model Wikipedia 35.84 36.32 One of our baseline models is a single-task end-to- CS Fisher Spanish 36.45 35.81 end model, which is abbreviated as SE in the pre- SE vious section. was trained using the source lan- Table 5: BLEU scores on using different pre-trained guage speech and the target language text. It shares word embeddings. the same architecture with the multitask model but without the source language text decoding (without the recognition part in Fig.1(a)). And its objective In general, whether using F-emb or W-emb as function can be written as: the training target, the experimental results show consistency to the discussion in Sec. 4.2. X LSE = Ltgt = − log P (yq). (5) q

Further details can be referred to (Anastasopoulos and Chiang, 2018).

A.2 Using different Word Embeddings Our proposed model benefits from publicly avail- able pre-trained word embedding, which is easy- to-obtain yet probably coming from the domains different from testing data. It can bring to ST mod- els in a simple plug-in manner. In Sec. 4.2, we used word embedding trained on Wikipedia. To demonstrate the improvement of using different word embeddings, we additionally provide results of ST models using word embed- dings trained on Fisher Spanish corpus (train and dev set) in Table5. Here we use the abbreviation of word embedding trained on Wikipedia as W-emb and word embedding trained on Fisher Spanish corpus as F-emb. In CD/CS method, using F-emb obtained 0.27/0.61 improvement from using W-emb on dev set. And, CD got 0.15 improvement but CS got 0.51 degrading performance on test set. The improvements show that using word embed- dings trained in the related domain helps on the per- formance. In CD method, although using F-emb improves the performance, it still under-performed ME method. It indicates that the selection of adopt- ing methods is critical. In CS method, it got a great improvement on dev set but not on test set. It shows that using F-emb does help with the performance, but using word embedding trained on rich data (W- emb) could provide additional information that can generally extend to the test set.

6003