JOURNAL OF CLASS FILES, VOL. *, NO. *, JANUARY 2021 1

Efficiently Fusing Pretrained Acoustic and Linguistic Encoders for Low-resource Cheng Yi, Shiyu Zhou, and Bo Xu, Member, IEEE

Abstract—End-to-end models have achieved impressive results space to explore, which is hard to train with limited labeled on the task of automatic speech recognition (ASR). For low- data. resource ASR tasks, however, labeled data can hardly satisfy Pre-training can help the end-to-end model work well in the demand of end-to-end models. Self-supervised acoustic pre- training has already shown its amazing ASR performance, while the target ASR task on the low-resource condition [12]. the transcription is still inadequate for language modeling in Supervised pre-training, also known as supervised transfer end-to-end models. In this work, we fuse a pre-trained acous- learning [13], uses the knowledge learned from other tasks and tic encoder (wav2vec2.0) and a pre-trained linguistic encoder applies it to the target one [8]. However, this solution requires (BERT) into an end-to-end ASR model. The fused model only sufficient and domain-similar labeled data, which is hard to needs to learn the transfer from speech to language during fine- tuning on limited labeled data. The length of the two modalities is satisfy. Another solution is to partly pretrain the end-to-end matched by a monotonic attention mechanism without additional model with unlabeled data. For example, [14] pre-trains the parameters. Besides, a fully connected is introduced for acoustic encoder of Transformer by masked predictive coding the hidden mapping between modalities. We further propose a (MPC), and the model gets further improvement over a strong scheduled fine-tuning strategy to preserve and utilize the text ASR baseline. Unlike the encoder, the decoder of the S2S context modeling ability of the pre-trained linguistic encoder. Experiments show our effective utilizing of pre-trained modules. model cannot be separately pre-trained since it is conditioned Our model achieves better recognition performance on CALL- on the acoustic representation. In other , it is difficult to HOME corpus (15 hours) than other end-to-end models. guarantee the consistence between pre-training and fine-tuning Index Terms—end-to-end modeling, low-resource ASR, pre- for the decoder. training, wav2vec, BERT In this work, we abandon realizing linguistic pre-training for the S2S model. Instead, we turn to fuse a pre-trained acoustic encoder (wav2vec2.0) and a pre-trained linguistic encoder I.INTRODUCTION (BERT) into a single end-to-end ASR model. The fused model IPELINE methods decompose the task of automatic has separately exposed to adequate speech and text data, so P speech recognition (ASR) into three components to that it only needs to learn the transfer from speech to lan- model: acoustics, pronunciation, and language [1]. It can guage during fine-tuning with limited labeled data. To bridge dramatically decrease the difficulty of ASR tasks, requiring the length gap between speech and language modalities, a much less labeled data to converge. With self-supervised monotonic attention mechanism without additional parameters pre-trained acoustic model, the pipeline method can achieve is applied. Besides, a fully connected layer is introduced for impressive recognition accuracy with as few as 10 hours the mapping between hidden states of the two modalities. Our of transcribed speech [2]–[5]. However, it is criticized that model works in the way of non-autoregressive (NAR) [15] due the three components are combined by two fixed weights to the absent of a well-defined decoder structure. NAR models arXiv:2101.06699v2 [cs.CL] 24 Jan 2021 (pronunciation and language), which is inflexible [6]. have the speed advantage [16] and can perform comparable On the contrary, end-to-end models integrate the three results with autoregressive ones [15]. Different with self- components into one and directly transform the input speech training, acoustic representation is fed to the linguistic encoder features to the output text. Among end-to-end models, the during fine-tuning. The inconsistency can severely influence sequence-to-sequence (S2S) model is composed of an encoder the representation ability of the linguistic encoder. We help and a decoder, which is the dominant structure [7]–[10]. this module get along with the acoustic encoder by a scheduled The end-to-end modeling achieves better results than pipeline fine-tuning strategy. methods on most of public datasets [9], [11]. Nevertheless, it requires at least hundreds of hours of transcribed speech for training. A deep neural network has an enormous parameter II.RELATED WORK A lot of works propose methods to leverage text data for the Cheng Yi is with the Institute of Automation, Chinese Academy of Sciences, China and University of Chinese Academy of Sciences, China (e- end-to-end model. Deep fusion [17] and cold fusion [18], [19] mail: [email protected]). integrate a pre-trained auto-regressive (LM) Shiyu Zhou is with the Institute of Automation, Chinese Academy of into a S2S model. In this setting, the S2S model is randomly Sciences, China (e-mail: [email protected]). Bo Xu is with the Institute of Automation, Chinese Academy of Sciences, initialized and still needs a lot of labeled data for training. China (e-mail: [email protected]). Knowledge distillation [20] requires a seed end-to-end ASR JOURNAL OF CLASS FILES, VOL. *, NO. *, JANUARY 2021 2

CIF output BERT add input to_vocab_1 to_vocab_0 (768) embedding modality sum FC BERT encoder mapping looking up to_vocab_2 (767) random mix embedding table mul cif labels 'i' 's' 'a' 'c' 'a' 't' embeded (768) (768)

w2v encoder embbeding

encoder output speech labels

Fig. 1. The structure of w2v-cif-. On the left part, a variant CIF mechanism converts hAC to l without any additional module; on the right part, hLM is not directly fed into BERT but mixed with embedded labels in advance during training. Modules connected by dot lines are ignored during inference. Numbers in “()” indicate the size of hidden in the model. model, and it is not as convenient as the pretrain-and-finetune Wav2vec2.0 is composed of a feature encoder and a context paradigm. network. The feature encoder outputs with a stride of about [10] builds a S2S model with a pre-trained acoustic encoder 20ms between each frame, and a receptive field of 25ms of and a multilingual linguistic decoder. The decoder is part of a audio. The context network is a stack of self-attention blocks S2S model (mBART [21]) pre-trained on text data. Although for context modeling [25]. In this work, wav2vec2.0 is used this model achieves great results in the task of speech-to-text to convert speech x to acoustic representation hAC (w2v translation, it is not verified in ASR tasks. Besides, this work encoder) in our model, which is colored blue in Fig. 1. does not deal with the inconsistency we mentioned above. [16] takes advantage of an NAR decoder to revise the greedy CTC outputs from the encoder where low-confidence B. Modality Adaptation tokens are masked. The decoder is required to predict the It is normal to apply global attention to connect acoustic tokens corresponding to those mask tokens by taking both and language representation [25]. However, this mechanism is the unmasked context and the acoustic representation into to blame for the poor generalization of text length [11], which account. [22] iteratively predicts masked tokens based on is worse under sample scarcity. Instead, we use continuous partial results. As mentioned above, however, these NAR integrate-and-fire (CIF) mechanism [7] to bridge the discrepant decoders cannot be separately pre-trained since they rely on sequence lengths. CIF constrains a monotonic alignment the acoustic representation. between the acoustic and linguistic representation, and the reasonable assumption drastically decreases the difficulty of III.METHODOLOGY learning the alignment. In the original work [7], CIF uses a local We propose an innovative end-to-end model called w2v-cif- module to assign the attention value to each input frame. To bert, which consists of wav2vec2.0 (pre-trained on speech cor- avoid introducing additional parameters, the last dimension of pus), BERT (pre-trained on ), and CIF mechanism the hAC is regarded as the raw scalar attention value (before to connect the above two modules. The detailed realization of sigmoid operation), as demonstrated in the left part of Fig. 1. our model is demonstrated in Fig. 1. It is worth noting that The normalized attention value αt are accumulated along the only the fully connection in the middle (marked as red) does time dimension T and a linguistic representation lu outputs not participate in any pre-training. whenever the accumulated αt surpasses 1.0. During training, the sum of attention values for one input sample is resized to ∗ A. Acoustic Encoder the number of output tokens y (n = ||y||). The formalized operations in CIF are: We choose wav2vec2.0 as the acoustic encoder since it has been well verified [4], [10], [23]. Wav2vec2.0 is a pre-trained encoder that converts raw speech signals into acoustic repre- AC αt = sigmoid(ht [−1]) (1) sentation [4]. During pre-training, it masks the speech input on α0 = resize(α |α , n∗) (2) the latent space and solves a contrastive task. Wav2vec2.0 can t t T X 0 AC outperform previous semi-supervised methods simply through lu = αt ∗ ht [: −1] (3) fine-tuning on transcribed speech with CTC criterion [24]. t JOURNAL OF CLASS FILES, VOL. *, NO. *, JANUARY 2021 3

, where hAC represents acoustic vectors with length of T , l E. Scheduled Modality Fusion represents accumulated acoustic vectors with length of U. We notice a distinct mismatch between BERT as a text fea- CIF introduces a quantity loss to supervise the encoder ture extractor during pre-training and as a linguistic encoder in generating the correct number of final modeling units: the ASR model during fine-tuning. BERT cannot process hLM

T according to its pre-trained knowledge of . It X needs to greatly adjust the parameters on the bottom, which nˆ = αt (4) t is significantly different from fine-tuning on NLP tasks. Worst ∗ of all, BERT’s parameters cannot transfer from the bottom up Lqua = ||n − nˆ||2 (5) with a minimum cost. On the contrary, BERT will undergo a , where nˆ represents the predicted decoding length. During massive top-down change following the broadcast of gradients. inference, we add an extra rounding operation on nˆ to simulate In response to the above mismatch, we randomly replace LM n∗ between Eq. 1 and 2. h with embedded target tokens y along linguistic length Based on the matched sequence length, the accumulated U with a scheduled gold rate p ∈ [0, 1], as demonstrated in acoustic vector l is mapped into the linguistic vector hLM by the right part of Fig. 1. p will decrease during fine-tuning for BERT to get rid of the dependency on y. At the beginning of a randomly initialized fully connected layer (FC), realizing the LM modality adaptation. fine-tuning, BERT cannot understand the frames from h and views them as masked input. BERT mainly utilizes the gold context (label embedding vectors) to predict. Due to the C. Linguistic Encoder consistency with pre-training, BERT can fastly converge to high predicting accuracy. Along with the fine-tuning and the We choose BERT as the linguistic encoder in our model. decrease of p, BERT gradually grasps the meaning of hLM BERT is a masked LM, applying the mask-predict criterion and predicts more accurately. During inference, p is set to 0 for self-training and utilizes both left and right context on a and BERT needs to predict with pure hLM . huge amount of text data [26]. BERT has empirically shown impressive performance on various NLP tasks [26]–[28]. IV. EXPERIMENTS BERT is composed of three modules: an embedding table (embedding) to convert tokens to hidden vectors, a final fully A. Datasets and Experimental Settings connection layer (to vocab 0) to convert hidden vectors to We focus on low-resource ASR and mainly experiment an output softmax over the vocabulary, and a Transformer on CALLHOME corpus [1]. CALLHOME is a multilingual encoder (BERT encoder) for bidirectional context modeling. corpus with less than 20 hours of transcribed speech for These modules are colored green in Fig. 1. Pre-trained BERT each language. In this work, we use CALLHOME Mandarin can compensate for the lack of text data on low-resource ASR (MA, LDC96S15) and English (EN, LDC97S20). MA has tasks. Specifically, the prior output distribution can acceler- 23,915 transcribed utterances (15.6h) for training and 3,021 ate the convergence, and BERT provides stronger linguistic for testing. EN has 21,194 transcribed utterances (14.9h) for context. training and 2,840 for testing. To compare with more works, we also test our model on a relatively large and popular corpus: HKUST [30]. HKUST corpus (LDC2005S15, LDC2005T32) D. Additional Connections consists of a training set and a development set, which adds up to about 178 hours. Both of them are telephone conversational We add two additional connections after stacking the three , which is much more realistic and harder than modules. Firstly, a high-way directly connects the acoustic Librispeech [4]. to the final output. Secondly, an auxiliary CTC supervision We use the open source wav2vec2.0 1 as the encoder, is attached to the acoustic encoder (L ). Both of the two ctc bert-base-uncase 2 as the linguistic encoder for English and connections can make the encoder affected by the target bert-base-chinese 3 as the linguistic encoder for Chinese. We supervision more effectively [7], [29]. We use BERT’s final are free from pre-processing the transcripts since the built-in fully connected layer (to vocab 0) to initialize the new ones tokenizers of BERTs can automatically generate the modeling (to vocab 1 and to vocab 2). The final output of our model units. All the code and experiments are implemented using is: fairseq [31]. We reserve most of the training settings in 4 logits = λAC ∗ logitsAC + λLM ∗ logitsLM (6) wav2vec2.0 fine-tuning demonstration . We optimize with Adam, warming up the learning rate for 8000 steps to a peak , where λAC and λLM are the weights for the output from the of 4×10−5, holding 42000 steps and then exponential decay it. acoustic encoder and the linguistic encoder. The final output we only use a single GPU (TITAN Xp) for each experiment. is supervised by the cross-entropy criterion (Lce) [25]. The final loss during fine-tuning over labeled speech is: 1https://dl.fbaipublicfiles.com/fairseq/wav2vec/wav2vec small.pt 2https://storage.googleapis.com/bert models/2020 02 20/uncased L-12 H-768 A-12.zip L = Lce + µ1 ∗ Lqua + µ2 ∗ Lctc (7) 3https://storage.googleapis.com/bert models/2018 11 03/chinese L-12 H-768 A-12.zip 4 , where µ1 and µ2 are the weights for these losses respectively. https://github.com/pytorch/fairseq/tree/master/examples/wav2vec JOURNAL OF CLASS FILES, VOL. *, NO. *, JANUARY 2021 4

Considering the NAR property, our model simply uses the TABLE II greedy search to generate final results. ABLATIONSONSTRUCTUREOFW2V-CIF-BERT OVER CALLHOMEMA. PERFORMANCEIS CER(%) ONTESTSET.

B. Overall Results Description Settings CER µ1 = 0.2 In this section, we compare our model with other end-to- µ2 = 1.0 end modeling works. Transformer [8] conducts experiments λLM = 0.2 w2v-cif-bert λAC = 1.0 32.93 on low-resource tasks (MA and EN) by supervised transfer no sharing to vocab learning. w2v-ctc [32] also applies pre-trained wav2vec2.0 as 0.9 → 0.2/4000 encoder and adds a randomly initialized linear projection on TH = 0.8 µ = 0 154.7 (1) quantity loss 1 top of the encoder. w2v-ctc is optimized by minimizing a CTC µ1 = 0.5 33.05 µ = 0 35.77 loss, and it is one of the most concise end-to-end models [33]. (2) CTC loss 2 In this work, we reproduce w2v-seq2seq [32], which is µ2 = 2.0 33.01 λ = 0.0 36.43 (3) LM weight LM composed of pre-trained wav2vec2.0 and Transformer decoder λLM = 0.4 33.22 (with 1 or 4 blocks, randomly initialized) along with cross- (4) acoustic weight λAC = 0.0 99.69 attention [25]. w2v-seq2seq models apply the same modeling share 0,1 33.00 (5) share to vocab share 0,2 78.76 units (character for Chinese and subword for English) as [32]. share 1,2 35.10 We further implement cold fusion for w2v-seq2seq with pre- 0.9 → 0.2/8000 34.13 trained Transformer LM (6 blocks) on a private Chinese text 0.9 → 0.2/2000 33.37 (6) gold rate schedule 0.9 → 0.0/4000 34.92 corpus (200M samples). w2v-seq2seq models decode through 0.2 → 0.2/∞ 33.26 the beam search with a size of 50. All of these models are 0.0 → 0.0/∞ 35.95 TH = 1.0 33.37 trained under the same setups as w2v-cif-bert. (7) confidence threshold TH = 0.6 35.51

TABLE I COMPARISON ON CALLHOME AND HKUST. PERFORMANCEIS CER (%) FOR MA AND HKUST; WER (%) FOR EN. proper alignment between the acoustic and linguistic represen- tation is hard to learn through other supervisions; (2) Adding Model MA(15h) EN(15h) HKUST(150h) the auxiliary CTC criterion greatly matters. It can help the CIF [7] - - 23.09 Transformer + MPC [14] - - 21.70 encoder to learn the alignment; (3) BERT as linguistic encoder Transformer [8], [34] 37.62 33.77 26.60 makes an impressive contribution to the performance, showing w2v-ctc [32] 36.06 24.93 23.80 our effective utilization of pre-trained masked LM; (4) The w2v-seq2seq acoustic high-way connection is inevitable for the performance decoder with 1 blocks 39.81 26.18 24.06 + cold fusion 37.90 - 24.02 convergence of our model, which plays a similar role as the decoder with 4 blocks 54.82 47.66 25.73 CTC criterion; (5) Using three separate-updating to vocab + cold fusion - - 25.46 modules is the best option. w2v-cif-bert 32.93 23.79 22.92 During fine-tuning, we apply a trick of mixing hLM with embedded labels at a scheduled sampling rate p. we explore As demonstrated in Table I, our model achieves best results different schedules of p in (6). “→” indicates the range of on low-resource tasks, showing the promising direction to fuse p, and the number after “/” is the decreasing step. Directly pre-trained acoustic and linguistic modules. On the relatively fine-tuning without any mixing (0.0 → 0.0/∞) achieves a abundant labeled ASR task, our model still achieves compara- rather poor performance, demonstrating the necessity of our ble performance as the SOTA. Considering the MPC training proposed schedule fusing trick. A reasonable schedule of gold [14] utilizes a rather large speech data (10,000 hours) that is rate p during fine-tuning is decreasing from a high value to similar to the target task, this benchmark is hard to approach. a low one. Keeping p > 0 is another key point. We explain w2v-seq2seq models are inferior to w2v-ctc, even with that it can make BERT work better by preserving some anchor cold-fusion. We think it is the random-initialized decoder that tokens. causes the low performance under the low-resource condition. During inference, we also add some anchor tokens accord- Compared with cold-fusion, our method can make better use ing to the confidence of the acoustic output logitsAC , which of pre-trained LM. is similar to the iterative decoding of NAR [16]. Tokens are determined when their post-probabilities surpass a customized C. Ablation Study threshold TH (TH = 1.0 means no anchor tokens). Then the embedding vectors of these tokens are mixed with hLM . As We explore different structures and hyper-parameters of we can see in (7), a proper confidence threshold, TH = 0.8 w2v-cif-bert by ablations to find a reasonable setting. Some in our best result, contributes to the better performance during connections in Fig. 1 are shut off by setting corresponding inference. weights to 0. Ablations are conducted on CALLHOME-MA. Results are listed in Table II. We get the following con- clusions according to the corresponding ablations: (1) The quantity loss is inevitable for the CIF mechanism since a JOURNAL OF CLASS FILES, VOL. *, NO. *, JANUARY 2021 5

REFERENCES [23] C. Yi, F. Wang, and B. Xu, “Ectc-docd: An end-to-end structure with ctc encoder and ocd decoder for speech recognition.” in INTERSPEECH, 2019, pp. 4420–4424. [1] S. Zhou, Y. Zhao, S. Xu, B. Xu et al., “Multilingual recurrent neural [24] A. Graves, S. Fernandez,´ F. Gomez, and J. Schmidhuber, “Connection- networks with residual learning for low-resource speech recognition.” in ist temporal classification: labelling unsegmented sequence data with INTERSPEECH, 2017, pp. 704–708. recurrent neural networks,” in Proceedings of the 23rd international [2] A. Baevski and A. Mohamed, “Effectiveness of self-supervised pre- conference on , 2006, pp. 369–376. training for asr,” in ICASSP 2020-2020 IEEE International Conference [25] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2020, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” in Advances pp. 7694–7698. in neural information processing systems, 2017, pp. 5998–6008. [3] A. Baevski, S. Schneider, and M. Auli, “Vq-wav2vec: Self-supervised [26] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of learning of discrete speech representations,” in International Conference deep bidirectional transformers for language understanding,” in NAACL- on Learning Representations (ICLR), 2020. HLT (1), 2019. [4] A. Baevski, Y. Zhou, A. Mohamed, and M. Auli, “Wav2vec 2.0: [27] Z. Lan, M. Chen, S. Goodman, K. Gimpel, P. Sharma, and R. Soricut, A framework for self- of speech representations,” “Albert: A lite bert for self-supervised learning of language representa- Advances in Neural Information Processing Systems, vol. 33, 2020. tions,” in International Conference on Learning Representations (ICLR), [5] A. Conneau, A. Baevski, R. Collobert, A. Mohamed, and M. Auli, “Un- 2020. supervised cross-lingual representation learning for speech recognition,” [28] Z. Yang, Z. Dai, Y. Yang, J. Carbonell, R. R. Salakhutdinov, and arXiv preprint arXiv:2006.13979, 2020. Q. V. Le, “Xlnet: Generalized autoregressive pretraining for language [6] V. Pratap, A. Hannun, Q. Xu, J. Cai, J. Kahn, G. Synnaeve, V. Liptchin- understanding,” in Advances in neural information processing systems, sky, and R. Collobert, “wav2letter++: The fastest open-source speech 2019, pp. 5753–5763. recognition system,” arXiv preprint arXiv:1812.07625, 2018. [29] S. Kim, T. Hori, and S. Watanabe, “Joint ctc-attention based end- [7] L. Dong and B. Xu, “Cif: Continuous integrate-and-fire for end-to- to-end speech recognition using multi-task learning,” in 2017 IEEE end speech recognition,” in ICASSP 2020-2020 IEEE International international conference on acoustics, speech and signal processing Conference on Acoustics, Speech and Signal Processing (ICASSP). (ICASSP). IEEE, 2017, pp. 4835–4839. IEEE, 2020, pp. 6079–6083. [30] Y. Liu, P. Fung, Y. Yang, C. Cieri, S. Huang, and D. Graff, “Hkust/mts: [8] S. Zhou, S. Xu, and B. Xu, “Multilingual end-to-end speech recognition A very large scale mandarin telephone speech corpus,” in International with a single transformer on low-resource languages,” arXiv preprint Symposium on Chinese Spoken Language Processing. Springer, 2006, arXiv:1806.05059, 2018. pp. 724–735. [9] J. Li, X. Wang, Y. Li et al., “The speechtransformer for large-scale [31] M. Ott, S. Edunov, A. Baevski, A. Fan, S. Gross, N. Ng, D. Grangier, mandarin chinese speech recognition,” in ICASSP 2019-2019 IEEE and M. Auli, “fairseq: A fast, extensible toolkit for sequence modeling,” International Conference on Acoustics, Speech and Signal Processing in Proceedings of NAACL-HLT 2019: Demonstrations, 2019. (ICASSP). IEEE, 2019, pp. 7095–7099. [32] C. Yi, J. Wang, N. Cheng, S. Zhou, and B. Xu, “Applying wav2vec2.0 [10] C. Tran, C. Wang, Y. Tang, Y. Tang, J. Pino, and X. Li, “Cross- to speech recognition in various low-resource languages,” arXiv preprint modal transfer learning for multilingual speech-to-text translation,” arXiv:2012.12121, 2020. arXiv preprint arXiv:2010.12829, 2020. [33] A. Graves and N. Jaitly, “Towards end-to-end speech recognition with [11] L. Dong, C. Yi, J. Wang, S. Zhou, S. Xu, X. Jia, and B. Xu, “A recurrent neural networks,” in International conference on machine comparison of label-synchronous and frame-synchronous end-to-end learning, 2014, pp. 1764–1772. models for speech recognition,” arXiv preprint arXiv:2005.10113, 2020. [34] S. Zhou, L. Dong, S. Xu, and B. Xu, “A comparison of modeling units [12] G. E. Hinton and R. R. Salakhutdinov, “A better way to pretrain deep in sequence-to-sequence speech recognition with the transformer on boltzmann machines,” in Advances in Neural Information Processing mandarin chinese,” in International Conference on Neural Information Systems, 2012, pp. 2447–2455. Processing. Springer, 2018, pp. 210–220. [13] Y.-A. Chung, H.-Y. Lee, and J. Glass, “Supervised and unsupervised transfer learning for ,” in Proceedings of NAACL- HLT, 2018, pp. 1585–1594. [14] D. Jiang, X. Lei, W. Li, N. Luo, Y. Hu, W. Zou, and X. Li, “Improving transformer-based speech recognition using unsupervised pre-training,” arXiv preprint arXiv:1910.09932, 2019. [15] J. Gu, J. Bradbury, C. Xiong, V. O. Li, and R. Socher, “Non- autoregressive neural ,” in International Conference on Learning Representations (ICLR), 2018. [16] Y. Higuchi, S. Watanabe, N. Chen, T. Ogawa, and T. Kobayashi, “Mask CTC: Non-Autoregressive End-to-End ASR with CTC and Mask Predict,” in Proc. Interspeech 2020, 2020, pp. 3655–3659. [17] S. Toshniwal, A. Kannan, C. Chiu, Y. Wu, T. N. Sainath, and K. Livescu, “A comparison of techniques for language model integration in encoder- decoder speech recognition,” in 2018 IEEE Spoken Language Technol- ogy Workshop (SLT), 2018, pp. 369–375. [18] A. Sriram, H. Jun, S. Satheesh, and A. Coates, “Cold fusion: Training seq2seq models together with language models,” in Proc. Interspeech 2018, 2018, pp. 387–391. [19] C. Shan, C. Weng, G. Wang, D. Su, M. Luo, D. Yu, and L. Xie, “Component fusion: Learning replaceable language model component for end-to-end speech recognition system,” in ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2019, pp. 5361–5635. [20] A. H. Liu, H. Lee, and L. Lee, “Adversarial training of end-to-end speech recognition using a criticizing language model,” in ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2019, pp. 6176–6180. [21] Y. Liu, J. Gu, N. Goyal, X. Li, S. Edunov, M. Ghazvininejad, M. Lewis, and L. Zettlemoyer, “Multilingual denoising pre-training for neural machine translation,” arXiv preprint arXiv:2001.08210, 2020. [22] N. Chen, S. Watanabe, J. Villalba, and N. Dehak, “Listen and fill in the missing letters: Non-autoregressive transformer for speech recognition,” arXiv preprint arXiv:1911.04908, 2019.