Worse WER, but Better BLEU? Leveraging Word Embedding As Intermediate in Multitask End-To-End Speech Translation

Worse WER, but Better BLEU? Leveraging Word Embedding as Intermediate in Multitask End-to-End Speech Translation Shun-Po Chuang1, Tzu-Wei Sung2, Alexander H. Liu1, and Hung-yi Lee1 1National Taiwan University, Taiwan 2University of California San Diego, USA ff04942141, b03902042, r07922013, [email protected] Abstract Although applying the text of source language as the intermediate information in multitask end-to- Speech translation (ST) aims to learn transfor- end ST empirically yielded improvement, we argue mations from speech in the source language to the text in the target language. Previous works whether this is the optimal solution. Even though show that multitask learning improves the ST the recognition part does not correctly transcribe performance, in which the recognition decoder the input speech into text, the final translation result generates the text of the source language, and would be correct if the output of the recognition the translation decoder obtains the final trans- part preserves sufficient semantic information for lations based on the output of the recognition translation. Therefore, we explore to leverage word decoder. Because whether the output of the embedding as the intermediate level instead of text. recognition decoder has the correct semantics is more critical than its accuracy, we propose In this paper, we apply pre-trained word embed- to improve the multitask ST model by utilizing ding as the intermediate level in the multitask ST word embedding as the intermediate. model. We propose to constrain the hidden states of the decoder of the recognition part to be close 1 Introduction to the pre-trained word embedding. Prior works on Speech translation (ST) increasingly receives atten- word embedding regression show improved results tion from the machine translation (MT) commu- on MT (Jauregi Unanue et al., 2019; Kumar and nity recently. To learn the transformation between Tsvetkov, 2018). Experimental results show that speech in the source language and the text in the the proposed approach obtains improvement to the target language, conventional models pipeline au- ST model. Further analysis also shows that con- tomatic speech recognition (ASR) and text-to-text strained hidden states are approximately isospectral MT model (Berard´ et al., 2016). However, such to word embedding space, indicating that the de- pipeline systems suffer from error propagation. coder achieves speech-to-semantic mappings. Previous works show that deep end-to-end mod- 2 Multitask End-to-End ST model els can outperform conventional pipeline systems with sufficient training data (Weiss et al., 2017; In- Our method is based on the multitask learning aguma et al., 2019; Sperber et al., 2019). Neverthe- for ST (Anastasopoulos and Chiang, 2018), in- less, well-annotated bilingual data is expensive and cluding speech recognition in the source language hard to collect (Bansal et al., 2018a,b; Duong et al., and translation in the target language, as shown in 2016). Multitask learning plays an essential role in Fig.1(a). The input audio feature sequence is first leveraging a large amount of monolingual data to encoded into the encoder hidden state sequence improve representation in ST. Multitask ST models h = h1; h2; : : : ; hT with length T by the pyramid have two jointly learned decoding parts, namely the encoder (Chan et al., 2015). To present speech recognition and translation part. The recognition recognition in the source language, the attention part firstly decodes the speech of source language mechanism and a decoder is employed to pro- into the text of source language, and then based on duce source decoder sequence s^ =s ^1; s^2;:::; s^M , the output of the recognition part, the translation where M is the number of decoding steps in the part generates the text in the target language. Vari- source language. For each decoding step m, the ant multitask models have been explored (Anas- probability P (^ym) of predicting the token y^m in tasopoulos and Chiang, 2018), which shows the the source language vocabulary can be computed improvement in low-resource scenario. based on the corresponding decoder state s^m. 5998 Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 5998–6003 July 5 - 10, 2020. c 2020 Association for Computational Linguistics (a) Multitask End-to-End Model D set and e^v 2 R is the embedding vector with di- Recognition ( Intermedia Level ) Translation mension D for any word v 2 V , in the recognition [… , �%& , … ] [… , �),… ] Cross task. We choose the source language decoder state Entropy [… , �(� ),… ] [… , �(�%&),… ] ) (embedding) s^ to reinforce since it is later used in Source Language �� Speech Features Linear Linear the translation task. To be more specific, we argue [ �+̂ , �,̂ , … , �0̂ ] that the embedding generated by the source lan- Source Language Attention guage decoder should be more semantically correct Decoder Target Speech Concatenation Language in order to benefit the translation task. Given the Encoder Attention Decoder pre-trained source language word embedding E^, [ ℎ+ , ℎ, , … , ℎ-] Attention we proposed to constrain the source decoder state (b) Cosine Distance (CD) (c) Cosine Softmax (CS) [… , �%&, … ] Cosine Distance Cross s^ at step m to be close to its corresponding word Linear [… , �& , … ] Entropy m [… , �̂ , … ] [… , � , … ] [… , �9: (�%&), … ] 5%6 & [… , �̂ , … ] embedding e^y^m with the two approaches detailed & �� Linear Cosine Similarity Source Source in the following sections. [… , �̂ , … ] Language & Language … �+̂ �̂, �̂|8| �B Decoder Decoder 3.1 Directly Learn Word Embedding Figure 1: (a) Multitask ST model. Dotted arrows indi- Since semantic-related words would be close in cate steps in the recognition part. Solid arrows indicate terms of cosine distance (Mikolov et al., 2018), a steps in the translation part. (b) Directly learn word simple idea is to minimize the cosine distance (CD) embedding via cosine distance. (c) Learn word embed- between the source language decoder hidden state ding via cosine softmax function. Both (b)(c) are the s^m and the corresponding word embedding e^y^m for recognition part in (a). every decode step m, X LCD = 1 − cos(fθ(^sm); e^y^m ) To perform speech translation in the target lan- m guage, both the source language decoder state se- (2) X fθ(^sm) · e^y^m quence s^ and the encoder state sequence h will = 1 − ; kfθ(^sm)kke^y^m k be attended and treated as the target language de- m coder’s input. The hidden state of target language where fθ(·) is a learnable linear projection to match decoder can then be used to derived the probability the dimensionality of word embedding and decoder state. With this design, the network architecture of P (yq) of predicting token yq in the target language vocabulary for every decoding step q. the target language decoder would not be limited Given the ground truth sequence in the source by the dimension of word embedding. Fig.1(b) illustrates this approach. By replacing L in Eq. (1) language y^ =y ^1; y^2;:::; y^M and the target lan- src with L , semantic learning from word embedding guage y = y1; y2; : : : ; yQ with length Q, multitask CD ST can be trained with maximizing log likelihood for source language recognition can be achieved. in both domains. Formally, the objective function 3.2 Learn Word Embedding via Probability of multitask ST can be written as: Ideally, using word embedding as the learning tar- α β get via minimizing CD can effectively train the LST = Lsrc + Ltgt M Q decoder to model the semantic relation existing in α X β X = − log P (^y ) + − log P (y ); the embedding space. However, such an approach M m Q q m q suffers from the hubness problem (Faruqui et al., (1) 2016) of word embedding in practice (as we later discuss in Sec. 4.5). where α and β are the trade-off factors to balance To address this problem, we introduce cosine between the two tasks. softmax (CS) function (Liu et al., 2017a,b) to learn speech-to-semantic embedding mappings. Given 3 Proposed Methods the decoder hidden state s^m and the word embed- ^ We propose two ways to help the multitask end- ding E, the probability of the target word y^m is to-end ST model capture the semantic relation defined as between word tokens by leveraging the source exp(cos(fθ(^sm); e^y^m )=τ) PCS(^ym) = P ; language word embedding as intermediate level. e^ 2E^ exp(cos(fθ(^sm); e^v)=τ) ^ v E = fe^1; e^2; :::e^jV jg, where V is the vocabulary (3) 5999 where cos(·) and fθ(·) are from Eq. (2), and τ is the (a) 160 hours (b) 40 hours temperature of softmax function. Note that since dev test dev test the temperature τ re-scales cosine similarity, the SE 34.50 34.51 17.41 15.44 hubness problem can be mitigated by selecting a ME 35.35 35.49 23.30 20.40 proper value for τ. Fig.1(c) illustrates the approach. CD 33.06 33.65 23.53 20.87 With the probability derived from cosine softmax in CS 35.84 36.32 23.54 21.72 Eq. (3), the objective function for source language decoder can be written as Table 1: BLEU scores trained on different size of data. X LCS = − log PCS(^ym): (4) m we could see that ME outperforms SE in all condi- tions. By replacing Lsrc in Eq. (1) with LCS, the decoder High-resource: Column (a) in Table1 showed hidden state sequence s^ is forced to contain seman- the results trained on 160 hours of data. CD and tic information provided by the word embedding. CS represent the proposed methods mentioned in Sec. 3.1 and 3.2 respectively. We got mixed results 4 Experiments on further applying pre-trained word embedding on ME.

Load more