Efficiently Fusing Pretrained Acoustic and Linguistic Encoders for Low

JOURNAL OF CLASS FILES, VOL. *, NO. *, JANUARY 2021 1 Efficiently Fusing Pretrained Acoustic and Linguistic Encoders for Low-resource Speech Recognition Cheng Yi, Shiyu Zhou, and Bo Xu, Member, IEEE Abstract—End-to-end models have achieved impressive results space to explore, which is hard to train with limited labeled on the task of automatic speech recognition (ASR). For low- data. resource ASR tasks, however, labeled data can hardly satisfy Pre-training can help the end-to-end model work well in the demand of end-to-end models. Self-supervised acoustic pre- training has already shown its amazing ASR performance, while the target ASR task on the low-resource condition [12]. the transcription is still inadequate for language modeling in Supervised pre-training, also known as supervised transfer end-to-end models. In this work, we fuse a pre-trained acous- learning [13], uses the knowledge learned from other tasks and tic encoder (wav2vec2.0) and a pre-trained linguistic encoder applies it to the target one [8]. However, this solution requires (BERT) into an end-to-end ASR model. The fused model only sufficient and domain-similar labeled data, which is hard to needs to learn the transfer from speech to language during fine- tuning on limited labeled data. The length of the two modalities is satisfy. Another solution is to partly pretrain the end-to-end matched by a monotonic attention mechanism without additional model with unlabeled data. For example, [14] pre-trains the parameters. Besides, a fully connected layer is introduced for acoustic encoder of Transformer by masked predictive coding the hidden mapping between modalities. We further propose a (MPC), and the model gets further improvement over a strong scheduled fine-tuning strategy to preserve and utilize the text ASR baseline. Unlike the encoder, the decoder of the S2S context modeling ability of the pre-trained linguistic encoder. Experiments show our effective utilizing of pre-trained modules. model cannot be separately pre-trained since it is conditioned Our model achieves better recognition performance on CALL- on the acoustic representation. In other words, it is difficult to HOME corpus (15 hours) than other end-to-end models. guarantee the consistence between pre-training and fine-tuning Index Terms—end-to-end modeling, low-resource ASR, pre- for the decoder. training, wav2vec, BERT In this work, we abandon realizing linguistic pre-training for the S2S model. Instead, we turn to fuse a pre-trained acoustic encoder (wav2vec2.0) and a pre-trained linguistic encoder I. INTRODUCTION (BERT) into a single end-to-end ASR model. The fused model IPELINE methods decompose the task of automatic has separately exposed to adequate speech and text data, so P speech recognition (ASR) into three components to that it only needs to learn the transfer from speech to lan- model: acoustics, pronunciation, and language [1]. It can guage during fine-tuning with limited labeled data. To bridge dramatically decrease the difficulty of ASR tasks, requiring the length gap between speech and language modalities, a much less labeled data to converge. With self-supervised monotonic attention mechanism without additional parameters pre-trained acoustic model, the pipeline method can achieve is applied. Besides, a fully connected layer is introduced for impressive recognition accuracy with as few as 10 hours the mapping between hidden states of the two modalities. Our of transcribed speech [2]–[5]. However, it is criticized that model works in the way of non-autoregressive (NAR) [15] due the three components are combined by two fixed weights to the absent of a well-defined decoder structure. NAR models arXiv:2101.06699v2 [cs.CL] 24 Jan 2021 (pronunciation and language), which is inflexible [6]. have the speed advantage [16] and can perform comparable On the contrary, end-to-end models integrate the three results with autoregressive ones [15]. Different with self- components into one and directly transform the input speech training, acoustic representation is fed to the linguistic encoder features to the output text. Among end-to-end models, the during fine-tuning. The inconsistency can severely influence sequence-to-sequence (S2S) model is composed of an encoder the representation ability of the linguistic encoder. We help and a decoder, which is the dominant structure [7]–[10]. this module get along with the acoustic encoder by a scheduled The end-to-end modeling achieves better results than pipeline fine-tuning strategy. methods on most of public datasets [9], [11]. Nevertheless, it requires at least hundreds of hours of transcribed speech for training. A deep neural network has an enormous parameter II. RELATED WORK A lot of works propose methods to leverage text data for the Cheng Yi is with the Institute of Automation, Chinese Academy of Sciences, China and University of Chinese Academy of Sciences, China (e- end-to-end model. Deep fusion [17] and cold fusion [18], [19] mail: [email protected]). integrate a pre-trained auto-regressive language model (LM) Shiyu Zhou is with the Institute of Automation, Chinese Academy of into a S2S model. In this setting, the S2S model is randomly Sciences, China (e-mail: [email protected]). Bo Xu is with the Institute of Automation, Chinese Academy of Sciences, initialized and still needs a lot of labeled data for training. China (e-mail: [email protected]). Knowledge distillation [20] requires a seed end-to-end ASR JOURNAL OF CLASS FILES, VOL. *, NO. *, JANUARY 2021 2 CIF output BERT add input to_vocab_1 to_vocab_0 (768) embedding modality sum FC BERT encoder mapping looking up to_vocab_2 (767) random mix embedding table mul cif labels 'i' 's' 'a' 'c' 'a' 't' embeded (768) (768) w2v encoder embbeding encoder output speech labels Fig. 1. The structure of w2v-cif-bert. On the left part, a variant CIF mechanism converts hAC to l without any additional module; on the right part, hLM is not directly fed into BERT but mixed with embedded labels in advance during training. Modules connected by dot lines are ignored during inference. Numbers in “()” indicate the size of hidden in the model. model, and it is not as convenient as the pretrain-and-finetune Wav2vec2.0 is composed of a feature encoder and a context paradigm. network. The feature encoder outputs with a stride of about [10] builds a S2S model with a pre-trained acoustic encoder 20ms between each frame, and a receptive field of 25ms of and a multilingual linguistic decoder. The decoder is part of a audio. The context network is a stack of self-attention blocks S2S model (mBART [21]) pre-trained on text data. Although for context modeling [25]. In this work, wav2vec2.0 is used this model achieves great results in the task of speech-to-text to convert speech x to acoustic representation hAC (w2v translation, it is not verified in ASR tasks. Besides, this work encoder) in our model, which is colored blue in Fig. 1. does not deal with the inconsistency we mentioned above. [16] takes advantage of an NAR decoder to revise the greedy CTC outputs from the encoder where low-confidence B. Modality Adaptation tokens are masked. The decoder is required to predict the It is normal to apply global attention to connect acoustic tokens corresponding to those mask tokens by taking both and language representation [25]. However, this mechanism is the unmasked context and the acoustic representation into to blame for the poor generalization of text length [11], which account. [22] iteratively predicts masked tokens based on is worse under sample scarcity. Instead, we use continuous partial results. As mentioned above, however, these NAR integrate-and-fire (CIF) mechanism [7] to bridge the discrepant decoders cannot be separately pre-trained since they rely on sequence lengths. CIF constrains a monotonic alignment the acoustic representation. between the acoustic and linguistic representation, and the reasonable assumption drastically decreases the difficulty of III. METHODOLOGY learning the alignment. In the original work [7], CIF uses a local convolution We propose an innovative end-to-end model called w2v-cif- module to assign the attention value to each input frame. To bert, which consists of wav2vec2.0 (pre-trained on speech cor- avoid introducing additional parameters, the last dimension of pus), BERT (pre-trained on text corpus), and CIF mechanism the hAC is regarded as the raw scalar attention value (before to connect the above two modules. The detailed realization of sigmoid operation), as demonstrated in the left part of Fig. 1. our model is demonstrated in Fig. 1. It is worth noting that The normalized attention value αt are accumulated along the only the fully connection in the middle (marked as red) does time dimension T and a linguistic representation lu outputs not participate in any pre-training. whenever the accumulated αt surpasses 1.0. During training, the sum of attention values for one input sample is resized to ∗ A. Acoustic Encoder the number of output tokens y (n = jjyjj). The formalized operations in CIF are: We choose wav2vec2.0 as the acoustic encoder since it has been well verified [4], [10], [23]. Wav2vec2.0 is a pre-trained encoder that converts raw speech signals into acoustic repre- AC αt = sigmoid(ht [−1]) (1) sentation [4]. During pre-training, it masks the speech input on α0 = resize(α jα ; n∗) (2) the latent space and solves a contrastive task. Wav2vec2.0 can t t T X 0 AC outperform previous semi-supervised methods simply through lu = αt ∗ ht [: −1] (3) fine-tuning on transcribed speech with CTC criterion [24]. t JOURNAL OF CLASS FILES, VOL. *, NO. *, JANUARY 2021 3 , where hAC represents acoustic vectors with length of T , l E.

Load more