Speech Recognition by Simply Fine-Tuning Bert
Total Page:16
File Type:pdf, Size:1020Kb
SPEECH RECOGNITION BY SIMPLY FINE-TUNING BERT Wen-Chin Huang1;2, Chia-Hua Wu2, Shang-Bao Luo2, Kuan-Yu Chen3, Hsin-Min Wang2, Tomoki Toda1 1Nagoya University, Japan 2Academia Sinica, Taiwan 3National Taiwan University of Science and Technology, Taiwan ABSTRACT call BERT-ASR, formulates ASR as a classification problem, where the objective is to correctly classify the next word given the acoustic We propose a simple method for automatic speech recognition speech signals and the history words. We show that even an AM that (ASR) by fine-tuning BERT, which is a language model (LM) simply averages the frame-based acoustic features corresponding to trained on large-scale unlabeled text data and can generate rich a word can be applied to BERT-ASR to correctly transcribe speech contextual representations. Our assumption is that given a history to a certain extent, and the performance can be further boosted by context sequence, a powerful LM can narrow the range of possi- using a more complex model. ble choices and the speech signal can be used as a simple clue. Hence, comparing to conventional ASR systems that train a pow- erful acoustic model (AM) from scratch, we believe that speech 2. BERT recognition is possible by simply fine-tuning a BERT model. As an initial study, we demonstrate the effectiveness of the proposed idea BERT [4], which stands for Bidirectional Encoder Representations on the AISHELL dataset and show that stacking a very simple AM from Transformers, is a pretraining method from a LM objective on top of BERT can yield reasonable performance. with a Transformer encoder architecture [8]. The full power of BERT can be released only through a pretraining–fine-tuning frame- Index Terms— speech recognition, BERT, language model work, where the model is first trained on a large-scale unlabeled text dataset, and then all/some parameters are fine-tuned on a labeled 1. INTRODUCTION dataset for the downstream task. The original usage of BERT mainly focused on NLP tasks, rang- Conventional automatic speech recognition (ASR) systems consist ing from token-level and sequence-level classification tasks, includ- of multiple separately optimized modules, including an acoustic ing question answering [9, 10], document summarization [11, 12], model (AM), a language model (LM) and a lexicon. In recent years, information retrieval [13, 14], machine translation [15, 16], just to end-to-end (E2E) ASR models have attracted much attention, due name a few. There has also been attempts to combine BERT in to the believe that jointly optimizing one single model is benefi- ASR, including rescoring [17, 18] or generating soft labels for train- cial to avoiding not only task-specific engineering but also error ing [19]. In this section, we review the fundamentals of BERT. propagation. Current main-stream E2E approaches include connec- tionist temporal classification (CTC) [1], neural transducers [2], and 2.1. Model architecture and input representations attention-based sequence-to-sequence (seq2seq) learning [3]. LMs play an essential role in ASR. Even the E2E models that BERT adopts a multi-layer Transformer [8] encoder, where each implicitly integrate LM into optimization can benefit from LM fu- layer contains a multi-head self-attention sublayer followed by a sion. It is therefore worthwhile thinking: how can we make use of positionwise fully connected feedforward network. Each layer is the full power of LMs? Let us consider a situation that we are in the equipped with residual connections and layer normalization. middle of transcribing a speech utterance, where we have already The input representation of BERT was designed to handle a va- correctly recognized a sequence of history words, and we want to riety of down-stream tasks, as visualized in Figure 1. First, a token determine what the next word is being said. From a probabilistic embedding is assigned to each token in the vocabulary. Some special arXiv:2102.00291v1 [cs.SD] 30 Jan 2021 point of view, a strong LM, can then generate a list of candidates tokens were added to the original BERT, including a classification where each of them is highly possible to be the next word. The list token ([CLS]) that is padded to the beginning of every sequence, may be extremely short that there is only one answer left. As a re- where the final hidden state of BERT corresponding to this token sult, we can use few to no clues in the speech signal to correctly is used as the aggregate sequence representation for classification recognize the next word. tasks, and a separation token ([SEP]) for separating two sentences. There has been rapid development of LMs in the field of natu- Second, a segment embedding is added to every token to indicate ral languages processing (NLP), and one of the most epoch-making whether it belongs to sentence A or B. Finally, a learned positional approach is BERT [4]. Its success comes from a framework where embedding is added such that the model can be aware of informa- a pretraining stage is followed by a task-specific fine-tuning stage. tion about the relative or absolute position of each token. The input Thanks to the un-/self-supervised objectives adopted in pretraining, representation for every token is then constructed by summing the large-scale unlabeled datasets can be used for training, thus capable corresponding token, segment, and position embeddings. of learning enriched language representations that are powerful on various NLP tasks. BERT and its variants have created a dominant 2.2. Training and fine-tuning paradigm in NLP in the past year [5, 6, 7]. In this work, we propose a novel approach to ASR, which is to Two self-supervised objectives were used for pretraining BERT. simply fine-tune a pretrained BERT model. Our method, which we The first one is the masked language modeling (MLM), which is Fig. 1: The input representation of the original BERT and the proposed BERT-ASR. Fig. 2: Illustration of the decoding process of the proposed BERT-ASR. a denoising objective that asks the model to reconstruct randomly training samples following the rule in Equation (2): masked input tokens based on context information. Specifically, 15% of the input tokens are first chosen. Then, each token is (1) 8([CLS]) replaced with [MASK] for 80% of the time, (2) replaced with a > >([CLS]; y ) random token for 10% of the time, (3) kept unchanged for 10% of <> 1 the time. (y1; :::; yT ) ! ([CLS]; y1; y2) : (2) > During fine-tuning, depending on the downstream task, minimal >::: > task-specific parameters are introduced so that fine-tuning can be :([CLS]; y1; : : : ; yt−1) cheap in terms of data and training efforts. The training of the BERT-LM becomes simply minimizing the fol- 3. PROPOSED METHOD lowing cross-entropy objective: N T In this section, we explain how we fine-tune a pretrained BERT to X X (i) (i) (i) formulate LM, and then further extend it to consume acoustic speech LLM = − P (yt j[CLS]; y1 ; : : : ; yt−1): (3) signals to achieve ASR. i=1 t=1 Assume we have an ASR training dataset containing N speech (i) (i) N utterances: DASR = hX ; y ii=1, with each y = (y1; :::; yT ) being the transcription consisting of T tokens, and each X = 0 3.2. BERT-ASR (x1; :::; xT 0 ) denoting a sequence of T input acoustic feature d frames. The acoustic features are of dimension d, i.e., xt 2 R , and We introduce our proposed BERT-ASR in this section. Since the tokens are from a vocabulary of size V . the time resolution of text and speech is at completely different scales, for the model described in Section 3.1 to be capable of 3.1. Training a probabilistic LM with BERT taking acoustic features as input, we first assume that we know the nonlinear alignment between an acoustic feature sequence We first show how we formulate a probabilistic LM using BERT, and the corresponding text, as depicted in Figure 3. Specifically, which we will refer to as BERT-LM. The probability of observing a for a pair of training transcription and acoustic feature sequence symbol sequence y can be formulated as: h(x1;:::; xT 0 ); (y1; : : : ; yT )i, we denote Fi to be the segment of ti×d features corresponding to yi: Fi = (xt +1;:::; xt ) 2 , T i−1 i R Y which is from frame ti−1 + 1 to ti, and we set t0 = 0. Thus, P (y) = P (y1; : : : ; yT ) = P (ytjy1; : : : ; yt−1): (1) the T 0 acoustic frames can be segmented into T groups: X = t=1 (F1;:::; FT ), and a new dataset containing segmented acoustic (i) (i) N The decoding (or scoring) of a given symbol sequence then becomes feature and text pairs can be obtained: Dseg = hF ; y ii=1. an iterative process that calculates all the terms in the product, which We can then augment the BERT-LM into BERT-ASR by inject- is illustrated in Figure 2. At the t-th time step, the BERT model takes ing the acoustic information extracted from Dseg into BERT-LM. a sequence of previously decoded symbols and the [CLS] token Specifically, as depicted in Figure 1, an acoustic encoder, which as input, i.e., ([CLS]; y1; : : : ; yt−1). Then, the final hidden state will be described later, consumes the raw acoustic feature segments corresponding to [CLS] is sent into a linear classifier, which then to generate the acoustic embeddings. They are summed with the outputs the probability distribution P (ytjy1; : : : ; yt−1). three types of embeddings in the original BERT, and further sent To train the model, assume we have a training dataset with N into BERT.