Speech-Language Joint Pre-Training for Spoken Language Understanding

SPLAT: Speech-Language Joint Pre-Training for Spoken Language Understanding

1 2 2 Yu-An Chung ⇤, Chenguang Zhu ⇤, Michael Zeng 1MIT Computer Science and Artiﬁcial Intelligence Laboratory 2Microsoft Cognitive Services Group [email protected],{chezhu,nzeng}@microsoft.com

Abstract results. However, such cascaded system has several drawbacks. First, the transcription produced Spoken language understanding (SLU) requires a model to analyze input acoustic sig- by the ASR module often contains errors, which nal to understand its linguistic content and adversely affects the language understanding mod- make predictions. To boost the models’ per- ule’s prediction accuracy. Second, even if the tran- formance, various pre-training methods have scription is perfect, the rich prosodic information been proposed to learn rich representations of speech (e.g., tempo, pitch, and intonation) is in- from large-scale unannotated speech and text. evitably lost after ASR. In comparison, humans of- However, the inherent disparities between the ten leverage these information to better understand two modalities necessitate a mutual analysis. In this paper, we propose a novel semi- and disambiguate the content. Therefore, there has supervised learning framework, SPLAT, to been a rising trend of end-to-end approaches to jointly pre-train the speech and language mod- retain information from audio signals to carry out ules. Besides conducting a self-supervised the understanding task (Serdyuk et al., 2018; Chen masked language modeling task on the two in- et al., 2018; Haghani et al., 2018). dividual modules using unpaired speech and While end-to-end SLU methods are effective, text, SPLAT aligns representations from the two modules in a shared latent space using they often suffer from a shortage of labeled training a small amount of paired speech and text. data, especially when the target task is in a novel Thus, during fine-tuning, the speech module domain. One solution is to leverage self-supervised alone can produce representations carrying training as is done in pre-trained language mod- both acoustic information and contextual se- els. Examples like BERT (Devlin et al., 2019), mantic knowledge of an input acoustic signal. GPT (Radford et al., 2018), and RoBERTa (Liu Experimental results verify the effectiveness et al., 2019) are first pre-trained on large-scale of our approach on various SLU tasks. For example, SPLAT improves the previous state- unannotated text in a self-supervised fashion to of-the-art performance on the Spoken SQuAD learn rich textual representations before being fine- dataset by more than 10%. tuned on downstream tasks with a modest amount of labeled data. Borrowing this idea, several pre- 1 Introduction training methods have been proposed for speech, Spoken language understanding (SLU) tackles the e.g., wav2vec (Schneider et al., 2019; Baevski et al., problem of comprehending audio signals and mak- 2020a), contrastive predictive coding (Oord et al., ing predictions related to the content. SLU has been 2018; Rivière et al., 2020), autoregressive predic- widely employed in various areas such as intent tive coding (Chung et al., 2019a, 2020; Chung and understanding (Tur and De Mori, 2011; Bhargava Glass, 2020b), and DeCoAR (Ling et al., 2020; et al., 2013; Ravuri and Stolcke, 2015; Lugosch Ling and Liu, 2020), to capture contextual repre- et al., 2019), question answering (Lee et al., 2018; sentations from unlabeled speech data. Neverthe- Chuang et al., 2020), and sentiment analysis (Zadeh less, these methods leverage only acoustic data and et al., 2018). Early approaches leverage a two-step mainly focus on modeling the acoustic informa- pipeline: use automatic speech recognition (ASR) tion during pre-training. As a result, the produced to transcribe input audio into text, and then em- representations may not be optimal for language ploy language understanding models to produce understanding tasks.

⇤ Equal contribution. The work was done when Yu-An To solve these problems, we propose a novel Chung was interning at Microsoft. SPeech-LAnguage joint pre-Training framework, 1897

Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 1897–1907 June 6–11, 2021. ©2021 Association for Computational Linguistics Figure 1: Overview of SPLAT. First, the speech and language modules are separately pre-trained using speech and text data via masked language modeling (MLM). In practice, we directly employ the BERTBASE model released by Devlin et al. (2019) to be the language module. Then, by leveraging a small amount of paired speech and text data, either a sequence-level alignment loss or a token-level alignment loss is applied to align the Lseq Ltok representations from both modules in a shared latent space (only is shown here). During alignment, the Lseq language module is kept frozen and only the speech module is updated. Before aligning the two modules, there is an optional step to update the BERTBASE-initialized language module via MLM using the text portion from the paired data. This optional step aims to adapt the language module to the speech domain to facilitate later alignment. After pre-training, the language module is discarded and only the speech module is used in downstream tasks.

SPLAT. SPLAT contains a speech module and a lan- fine-tuning, the speech module alone can produce guage module for multi-modal understanding. The representations that bridge the speech input and the speech module is a Transformer encoder trained language understanding output. from scratch and the language module is initialized We conduct extensive evaluations on several from BERT. Both modules leverage large-scale downstream SLU tasks, including Fluent Speech unannotated data for pre-training via masked lan- Commands for intent detection, Switchboard for guage modeling. In the speech module, each frame dialog act classification, CMU-MOSEI for spoken is seen as a token and is replaced with zero vector sentiment analysis, and Spoken SQuAD for spoken with a certain probability. For each masked frame, question answering. SPLAT achieves superior re- we minimize the L1-distance between the predicted sults in all datasets. For example, SPLAT improves frame and the original frame. the previous state-of-the-art performance on the Spoken SQuAD dataset by more than 10%. Fur- Then, to make the speech module aware of the thermore, we show that SPLAT can perform well contextual information extracted from the language even given just a tiny portion of the labeled training module, we design an alignment loss to align the data in downstream tasks. representations from both modules in a shared latent semantic space. In detail, we propose two 2 Related Work alignment methods, a sequence-level one and a token-level one, that leverage a small amount of Spoken language understanding In recent paired speech and text to minimize the disparity be- years, due to its flexibility and effectiveness, end- tween the acoustic representations from the speech to-end spoken language understanding (SLU) has module and the textual representations from the been proposed and applied to various tasks (Qian language module. In this way, the speech represen- et al., 2017; Serdyuk et al., 2018; Lugosch et al., tations will carry not only the acoustic information 2019). For instance, Qian et al. (2017) use an auto- but also the contextual knowledge from the text. Af- encoder to initialize the SLU model. Lugosch et al. ter this alignment, when text input is absent during (2019) pre-train the model to recognize words and 1898 phonemes, and then fine-tune it on downstream of SpeechBERT require both speech and text input, tasks. Chen et al. (2018) pre-train the model to cat- since it is designed for a specific spoken question egorize graphemes, and the logits are fed into the answering task. However, many SLU tasks only classifier. In most of these approaches, the model take speech as input, which does not align with pre-training requires annotated speech, e.g., word the design of SpeechBERT. In contrast, our model or phonemes corresponding to audio signals. As can learn to align acoustic and textual representa- a result, the massive unlabeled speech data cannot tions using just (a small amount of) paired data be utilized by these models. during pre-training, and only needs speech input for downstream tasks. Self-supervised pre-training for language Pre- Denisov and Vu (2020) propose to align speech trained models have achieved great success in and language embeddings in a method similar to both language and speech domains. In lan- ours. However, there are several key differences. guage, BERT (Devlin et al., 2019), RoBERTa (Liu First, Denisov and Vu (2020) employ the encoder et al., 2019), UniLM (Dong et al., 2019), and of a pre-trained ASR model, which already requires BART (Lewis et al., 2020) have been successfully plentiful of annotated speech to obtain. Our model, applied to natural language inference (Zhang et al., on the other hand, conducts self-supervised learn- 2020b), question answering (Zhu et al., 2018), ing to pre-train the speech module using unanno- and summarization (Zhu et al., 2019). These pre- tated speech. Secondly, besides sequence-level trained models leverage self-supervised tasks such alignment, we propose a token-level alignment as masked language modeling (MLM), next sen- method, which is suitable for token-level down- tence prediction, and de-noising autoencoder. stream tasks. Last but not least, our model uses Self-supervised pre-training for speech In a much smaller paired speech and text for align- speech, wav2vec (Schneider et al., 2019) leverages ment (10 hours) than Denisov and Vu (2020) (1,453 contrastive learning to produce contextual represen- hours), yet still largely outperforms their method tations for audio input; vq-wav2vec (Baevski et al., in intent detection and dialog act classification. 2020a) and wav2vec 2.0 (Baevski et al., 2020b) further propose to discretize the original contin- 3 Method uous audio signals in order to enable more effi- cient MLM training with Transformer (Vaswani In this section we present SPLAT, a framework for et al., 2017). Pre-trained speech models have been learning joint contextual representations of speech applied to ASR (Ling et al., 2020; Chung and and language. The model consists of a speech Glass, 2020a; Baevski et al., 2020b), phoneme module and a language module that share a simi- recognition (Song et al., 2020; Liu et al., 2020a), lar architecture and learning algorithm. The pre- speech translation (Nguyen et al., 2020; Chung training of SPLAT is divided into two steps. First, et al., 2019c), and speech synthesis (Chung et al., we individually pre-train the speech and language 2019b), to name a few. modules using unannotated speech and text, respec- Nevertheless, an SLU model must incorporate tively. Then, we leverage a simple yet effective both acoustic and language understanding capabili- alignment task that uses only a small amount of ties to project speech signals to semantic outputs. paired speech and text data to align the represen- Thus, a pre-trained model for SLU needs to address tations from both modules in a shared latent se- tasks beyond a single modality. mantic space such that the information learned by Speech and language joint pre-training Re- the language module is transferred to the speech cently, SLU applications have prompted joint pre- module. After pre-training, the language module training on both speech and text data. Speech- is discarded and only the speech module is used in BERT (Chuang et al., 2020) applies MLM to pairs downstream tasks. of audio and transcripts. However, there are several Below we formally describe the procedures for crucial differences to compared to our work. First, pre-training the speech (§3.1) and language mod- SpeechBERT contains a phonetic-semantic embed- ules (§3.2), and the alignment loss (§3.3) for align- ding module that requires forced alignment to first ing the representations from the two modules. Fig- segment speech into word segments to obtain. Sec- ure 1 provides an overview of the pre-training pro- ond, both the pre-training and fine-tuning phases cedures of SPLAT. 1899 ! ! ! ! ! ! " # $ % !! !" !# !$ !%

Speech "! 0.7 0.6 0.4 0.4 0.3 ∑) idf () max ,-../0(.*, ()) "" 0.5 0.4 0.5 0.7 0.3 * ℒ&'( = − ∑) idf () "# 0.6 0.9 0.4 0.4 0.6 “What day is today?” Language "$ 0.5 0.3 0.9 0.5 0.2 "! "" "# "$ Pairwise cosine similarity Maximum similarity Compute loss

Figure 2: Token-level alignment between speech and language modules. (s1,...,s5) are the output embeddings of the speech module and (t1,...,t4) are those of the language module.

3.1 Speech module pre-training combined with temporal masking to reinforce the model’s capability to utilize contextual information The goal of this module is to leverage unlabeled from both time and channel, and reduce the impact speech data to learn representations that capture of co-adaptation between acoustic frames. The fi- meaningful acoustic information about speech ut- nal pre-training objective for the speech module is terances such as their phonetic content and speaker to reconstruct the entire input sequence from the characteristics. Formally, the input to the speech altered version of it: module is a 80-dimensional log Mel spectrogram, 80 (x1,...,xn), where xi R , 1 i n. The = x xˆ (1) 2   Lsp k i ik1 speech module, which is implemented as a Trans- i=1,2,...,n former architecture, then produces hidden repre- X sentations (s1,...,sn) and predictions (xˆ1,...,xˆn), We use the speech portion of the train-clean-360 768 80 where si R and xî R . subset from the LibriSpeech corpus (Panayotov 2 2 To boost its capacity for contextual understand- et al., 2015) to pre-train the speech module, i.e., to minimize . This subset contains 360 hours ing, we borrow the idea of masked language mod- Lsp eling (MLM) (Devlin et al., 2019; Liu et al., 2020c; of read speech produced by 921 speakers. We fol- Wang et al., 2020; Liu et al., 2020b). Specifically, low the standard Kaldi setting, using a frame size each audio frame xi is replaced with a zero vector of 25ms and a time shift of 10ms for generating with a probability of 15%. The corresponding out- the 80-dimensional log Mel spectrograms. The put xî is trained to be close to the original frame xi spectrograms are normalized to zero mean and unit via minimizing their L1-distance. Additionally, variance per speaker. since consecutive frames are highly correlated, it is possible that the model simply utilizes the local 3.2 Language module pre-training smoothness of speech signals for reconstructing a The language module aims to offer contextual un- single frame and thus fails to capture useful infor- derstanding for text input. We directly employ the mation. To avoid such issue, when a frame xi is BERTBASE model released by Devlin et al. (2019), selected to be masked, its following three frames which is pre-trained on a large text corpus with xi+1, xi+2, and xi+3 are also masked, and the the MLM task and contains rich textual representa- model is asked to reconstruct all these masked tions, as the language module. We denote the cross- frames. entropy loss for the language MLM task as . Ltext Furthermore, according to SpecAugment (Park Given input token embeddings (y1,...,ym), et al., 2019), the input features (x1,...,xn) can where y1 corresponds to the [CLS] token, be seen as comprising two dimensions: time, i.e., the module produces contextual representations 768 (t1,...,tm), where tj R , 1 j m. the subscript i, and channel, i.e., the elements in 2   each xi. While conventional MLM masks along certain time steps, the input signals can also be 3.3 Aligning speech and language masked along the channel dimension. In other representations words, each column vector [x ,...,x ] for 1 The input to most SLU tasks consists of only audio 1,j n,j  j 80 has a 15% of chance to be masked, i.e., re- signals, but the model is required to conduct seman-  placed with a zero vector. This channel masking is tic understanding, which can be best handled when 1900 textual information is present. Therefore, we pro- Algorithm 1 Pre-training SPLAT pose to align the pre-trained speech and language Input: An unlabeled speech corpus = representations in a shared semantic latent space. (p) N X x p=1, an unlabeled text corpus = Suppose a pair of speech and text data consisting { (q) }M Y y q=1, and a paired speech-text corpus of an acoustic feature sequence (x1,...,xn) and its { } (k) (k) K = (x , y ) k=1, where K N,M. transcript (y1,...,ym). The speech and language Z { } ⌧ modules separately produce the output representa- 1: Use to train the speech module by minimiz- tions (s1,...,sn) and (t1,...,tm). We then propose X ing sp (Equation 1). two methods to align the embeddings from the mod- L 2: Use to train the language module by mini- ules: sequence-level and token-level alignment. Y mizing (we directly employ BERT Ltext BASE Sequence-level alignment For sequence-level from Devlin et al. (2019) for this step). alignment, we treat the first embeddings from the 3: Use y(k) K from to train the language { }k=1 Z two output representations, i.e., s and t , as the module by minimizing text. 1 1 L sequence-level representations of their respective 4: Use to align the two modules by minimiz- Z sequences, and minimize their L1-distance: ing seq (Equation 2) or tok (Equation 3). L L 5: Discard the language module. = s t (2) Lseq k 1 1k1 Output: The final speech module. Since our goal is to transfer the textual knowledge contained by the language module to the speech language module is kept fixed and only the speech module, we only update the speech module to min- module is updated. imize and keep the language module fixed. Lseq To minimize the alignment loss, we randomly After pre-training, when the transcript is absent sample 10 hours of audio paired with its tran- in downstream tasks, the first output embedding of scripts from the train-clean-360 subset, of which the speech module s1 will still be close to its cor- the speech portion is used to pre-train the speech responding text embedding t1 from the language module (§ 3.1). In practice, before minimizing the module, as if the transcript were given. It follows alignment loss, we find it beneficial to train (i.e., that s1 can then be used to predict the property of minimize text) the language module initialized the whole audio input, e.g., intent classification. L with BERTBASE with the 10-hour LibriSpeech tran- Token-level alignment To achieve a finer level scripts with the MLM task. This step allows the of alignment, each audio feature should be com- model to adapt to the speech domain and facilitates pared with its each text token. Although forced the following alignment task. alignment (Gorman et al., 2011) can establish this We summarize the complete procedure of pre- correspondence between audio signals and individ- training SPLAT in Algorithm 1. After pre-training, ual words, it requires a pre-trained ASR system to the language module is discarded and only the obtain. Here we propose a method that automati- speech module is used in downstream tasks. cally aligns audio features with textual tokens. Inspired by BERTScore (Zhang et al., 2020a), 4 Experiment Setup for each output text embedding tj, we first com- 4.1 Baselines pute its cosine similarity with each output acoustic We include a number of strong baselines from re- embedding s , and select the acoustic feature with i cent literature for each downstream task (Lugosch the highest similarity. Then, the alignment is per- et al., 2019; Duran and Battle, 2018; Ghosal et al., formed by maximizing the sum of these maximum 2018; Chuang et al., 2020). We also compare with similarities over all tokens, weighted by each to- another speech-language joint pre-training frame- ken’s inverse document frequency (idf) to reduce work (Denisov and Vu, 2020). For each baseline, the impact of common words: the reported performance is achieved by system m j=1 idf(tj) maxi cossim(si, tj) that either uses similar or more amounts of data tok = m (3) than our model. L idf(tj) P j=1 To verify the effectiveness of each component The token-level alignmentP loss is illustrated in Fig- in SPLAT, we experiment with the following vari- ure 2. Same as , when minimizing , the ants of it, including whether to pre-train the model, Lseq Ltok 1901 Table 1: Variants of SPLAT. An 7 indicates that the variant does not incorporate this step during pre-training. The step numbers correspond to those listed in Algorithm 1.

Step 1. Pre-train Step 2. Pre-train Step 3. Adapt language Step 4. Type of Model variant speech module language module module before alignment alignment loss SPLAT-Scratch 77 7 7 SPLAT-Speech 37 7 7 SPLAT-Seq 33 7 seq L SPLAT-Seq-MLM 33 3 seq L SPLAT-Tok 33 7 Ltok SPLAT-Tok-MLM 33 3 Ltok Table 2: Summary of SLU datasets. For the rows of Train, Validation, and Test, the numbers indicate the number of utterances in the split.

Intent Dialog act Spoken sentiment Spoken question Task detection classiﬁcation analysis answering Dataset FSC SwBD CMU-MOSEI Spoken SQuAD Num. of classes 31 42 7 - Train/val/test 23.1k/3.1k/3.8k 97.8k/8.6k/2.5k 16.2k/1.8k/4.6k 35.1k/2.0k/5.4k whether to use the language module and which of 768 and 12 self-attention heads. The language alignment task to apply. Table 1 summarizes the module is directly initialized from the pre-trained considered model variants. BERTBASE released by Devlin et al. (2019).

• SPLAT-Scratch: No pre-training is con- 4.2 Downstream SLU Tasks ducted at all. Speech module is trained from We evaluate our model on four different SLU appli- scratch on downstream tasks. cations: intent detection, dialog act classification, • SPLAT-Speech: Only the speech module is spoken sentiment analysis, and spoken question pre-trained. Language module and alignment answering. The first three belong to multi-class loss are not incorporated. classification tasks, and the last one is a span prediction problem, which will be described in more • SPLAT-Seq: SPLAT with sequence-level detail below. Table 2 summarizes the used dataset alignment loss , but language module is Lseq for each application. For all datasets, we use 80- not trained on LibriSpeech transcripts with dimensional log Mel spectrograms as input acous- MLM before alignment. tic features as in the pre-training stage.

• SPLAT-Seq-MLM: SPLAT with sequence- Intent detection We use the Fluent Speech Com- level alignment loss , and language mod- Lseq mands corpus (FSC) (Lugosch et al., 2019) for ule is trained on LibriSpeech transcripts with intent detection, where the goal is to correctly pre- MLM before alignment. dict the intent of an input utterance. In this dataset, each utterance is annotated with three slots: action, • SPLAT-Tok: SPLAT with token-level align- object, and location, where each slot can take one ment loss tok, but language module is not L of multiple values. The combination of slot values trained on LibriSpeech transcripts with MLM is defined as the intent of the utterance, and there before alignment. are 31 unique intents in total. In this work we fol- • SPLAT-Tok-MLM: SPLAT with token-level low the original paper to formulate intent detection alignment loss , and language module is as a simple 31-class classification task. Ltok trained on LibriSpeech transcripts with MLM Dialog act classification We use the NTX- before alignment. format Switchboard corpus (SwDA) (Calhoun The speech module of SPLAT is a 3-layer Trans- et al., 2010), a dialog corpus of 2-speaker conver- former encoder where each layer has a hidden size sations. The goal is to correctly classify an input 1902 Table 3: Results on all downstream datasets. All numbers of our models are an average of three runs, of which variances are negligibly small and not included. The metric is classification accuracy for FSC, SwBD and CMU- MOSEI. The metric for Spoken SQuAD is Audio Overlapping Score (AOS).

Model FSC SwBD CMU-MOSEI Spoken SQuAD Ours SPLAT-Scratch 97.6 65.8 68.8 30.4 SPLAT-Speech 99.5 67.5 69.0 57.7 SPLAT-Seq 99.5 74.6 72.5 62.7 SPLAT-Seq-MLM 99.5 76.3 74.7 65.9 SPLAT-Tok 99.2 71.2 70.4 58.0 SPLAT-Tok-MLM 99.2 72.7 71.2 63.8 SPLAT-Seq-MLM 1-hour 99.5 75.8 65.3 65.3 Baselines Lugosch et al. (2019) 98.8 - - - Duran and Battle (2018) - 75.5 - - Ghosal et al. (2018) -- 75.9 - Chuang et al. (2020) - - - 59.7 Denisov and Vu (2020) 95.5 60.2 - - utterance into one of the 42 dialog acts. segment extracted from spoken article as the answer. The model is evaluated by Audio Overlap- Spoken sentiment analysis We use the CMU- ping Score (AOS) (Li et al., 2018): the greater the MOSEI dataset (Zadeh et al., 2018), where each overlap between the predicted span and the ground- utterance is annotated for a sentiment score on truth answer span, the higher the score will be. a [ 3, 3] Likert scale: [-3: highly negative, -2: neg- During fine-tuning, given a spoken article and ative, -1: weakly negative, 0: neutral, +1: weakly a question in the text form, the pre-trained speech positive, +2: positive, +3: highly positive]. We module extracts audio representations of the arti- treat the task as a 7-class classification problem. cle and pass them to a randomly initialized 3-layer And we only use audio signals in the input data. Transformer encoder along with the tokenized tex- For the above three tasks, during fine-tuning, an tual question as input. The Transformer then uses MLP network with one hidden layer of 512 units is the self-attention mechanism to implicitly align el- appended on top of the speech module. It converts ements of the input audio and textual features. For the output representation of the first frame, i.e., s1, each time step of the audio input, the Transformer for class prediction. Both the pre-trained speech is trained to predict whether this is the start of the module and the randomly initialized MLP are fine- span with a simple logistic regression. A separate tuned on the training set for 10 epochs with a batch classifier is used for predicting the end of the span. size of 64 and a fixed learning rate of 3e-4. We compute classification accuracy after each training 5 Results and Analysis epoch and pick the best-performing checkpoint on the validation set to report results on the test set. 5.1 Main results Table 3 shows the performance of models on all Spoken question answering We use the Spoken four downstream tasks. Each number from our SQuAD dataset (Li et al., 2018), which is aug- model is an average over three runs. Based on the 1 mented from SQuAD (Rajpurkar et al., 2016) for results, we make the following observations. spoken question answering. The model is given Firstly, compared with SPLAT-Scratch, all pre- an article in the form of speech and a question trained models achieve superior results, especially in the form of text. The goal is to predict a time more than 30% gain on Spoken SQuAD, proving span in the spoken article that answers the ques- the effectiveness of pre-training. tion. In other words, the model outputs an audio Secondly, the inclusion of language module and 1Li et al. (2018) used Google text-to-speech to generate the alignment task during pre-training is very ben- the spoken version of the articles in SQuAD. eficial. For instance, on CMU-MOSEI, SPLAT- 1903 FSC MOSEI Spoken SQuAD SwBD 100 ● ● ● ● ● ● ● ● ● ● ● 60 ● ● ● ● ● ● 70 ● ● ● ● 75 60 ●● ● ● ● 50 ● ● ● 60 ● ● 40 ● 50 ● ● ● ● SPLAT−Seq−MLM 40 ● ● 50 ● 30 ● ● ● ● ●

Accuracy ● SPLAT−Speech 25 20 ● 40 ● SPLAT−Scratch ● ● ● 20 ● ● 0 ● ● 10 ● 30 ● 0 25 50 75 100 0 25 50 75 100 0 25 50 75 100 0 25 50 75 100 Training data size (%)

Figure 3: Performance on downstream tasks with varying training data sizes. All numbers are an average of three runs, of which variances are negligibly small and not included.

Seq-MLM outperforms SPLAT-Speech by 5.7%, 5.2 Robustness to the size of downstream and outperforms several baseline systems from re- training data cent literature. We argue that as SLU tasks require As human labeling is time-consuming and labor- the model to interpret acoustic signals and their un- intensive, the amount of labeled training data for derlying semantics, the language module will guide downstream tasks is often small and insufficient. the speech module towards a mutual understanding In this section, we show that with effective pre- of both modalities via our alignment task. training, the model will be less dependent on the Thirdly, updating the language module using amount of downstream labeled data. MLM during pre-training is helpful. Although the We randomly sample 50%, 10%, 5%, and 1% language module has been initialized with BERT, of the training data in the downstream tasks, and adaptation to the speech domain can help with se- evaluate the performance of different variants of mantic understanding in the downstream task. SPLAT when fine-tuned on the sampled data. Figure 3 shows the performance on all four downstream tasks with varying training data sizes. Types of alignment Comparing SPLAT-Seq We observe that among the variants, SPLAT-Seq- against SPLAT-Tok, we find that sequence-level MLM is least sensitive to training data sizes. For alignment outperforms token-level alignment on instance, in FSC, with only 10% of the training all four tasks, although the latter is supposed to data, its accuracy only drops 0.4 points. In compar- learn more fine-grained multi-modal representa- ison, both SPLAT-Scratch and SPLAT-Speech tions. We leave the investigations of reasons for drops about 10 points. And the gaps are in gen- such phenomenon and more advanced token-level eral larger when the size of training data further alignment approaches for future work. shrinks. Therefore, our proposed joint pre-training of speech and language modules can help the model quickly adapt to downstream tasks given a modest Low-resource scenario We experiment with a amount of training data. version of SPLAT that uses only 1 hour of tran- scribed speech randomly sampled from the Lib- 5.3 The geometry of the speech latent space riSpeech train-clean-360 subset for aligning speech before and after alignment and language modules, denoted as SPLAT-Seq- So far we have empirically demonstrated the effec- MLM 1-hour. The language module of SPLAT- tiveness of SPLAT for learning multi-modal speech- Seq-MLM 1-hour—after being initialized with language representations that are useful in various BERTBASE—is trained on the 1-hour LibriSpeech SLU tasks. Here we further show that our sequence- transcripts before minimizing the alignment loss. level alignment loss (Equation 2) can help project It achieves comparable results with the best vari- two speech utterances that have similar textual em- ant SPLAT-Seq-MLM: same accuracy on FSC, beddings to nearby points in the speech latent space. 0.5% less on SwBD, and 0.6% less on Spoken SQuAD. This shows that with a small amount of Recall that we use the embedding of the first labeled speech data, our pre-training framework token/feature to represent an utterance and con- can achieve good results on downstream tasks. duct sequence-level alignment (Equation 2). Sup- 1904 Table 4: Average cosine similarity between all pairs utterances to have much closer speech embeddings, of speech embeddings (Savg), and the average cosine compared with other random pairs of utterances. (p) similarity between a speech embedding s1 and that These results demonstrate that via an cross- of an utterance whose textual embedding is closest to (p) modal alignment loss as simple as Equation 2, the corresponding textual embedding t1 (Sclosest). SPLAT can effectively transfer knowledge from the language module to the speech module to capture Model Savg Sclosest both acoustic and linguistic information of speech SPLAT-Speech 0.136 0.238 utterances. SPLAT-Seq 0.144 0.781 SPLAT-Seq-MLM 0.148 0.829 6 Conclusions Spoken language understanding (SLU) tasks re- (p) (p) pose t1 and s1 correspond to the textual and quire an understanding of the input audio signal speech embeddings of the first utterance by SPLAT and its underlying semantics. In this paper, we (q) (q) and t1 and s1 correspond to the embeddings present a novel speech-language joint pre-training of the second utterance. Then, if t(p) t(q), framework, SPLAT, to carry out both speech and 1 ⇡ 1 our SPLAT model trained with the sequence-level language understanding tasks during pre-training. alignment loss will produce s(p) s(q). Besides a self-supervised training on the speech 1 ⇡ 1 We use the dev-clean subset from the Lib- and language modules, we propose two methods to riSpeech corpus for the analysis. First, we compute align the semantic representations from both mod- the average pairwise cosine similarity between the ules using a modest amount of labeled speech data. utterances of all speech embeddings: The speech module can quickly adapt to downstream tasks and achieve superior results on vari- K p 1 ous SLU datasets including intent detection, dialog 1 (p) (q) Savg = cossim(s , s ), act classification, spoken sentiment analysis, and K(K 1)/2 1 1 p=2 q=1 X X spoken question answering. This joint pre-training (4) also makes the model less sensitive to the amount where K is the number of utterances in dev-clean. of labeled training data in downstream domains. Next, for each utterance with its speech and (p) (p) For future work, we plan to integrate automatic textual embeddings denoted as s1 and t1 re- speech recognition and natural language genera- (p) spectively, we first use t1 to retrieve the utter- tion into our framework to achieve good results on (q⇤) ance with the most similar textual embedding t1 , spoken language generation tasks. (p) (q) i.e., q⇤ = argmax1 q K,q=pcossim(t1 , t1 ).We   6 (p) then compute the cosine similarity between s1 References (q⇤) and s1 and take the average of such value over Alexei Baevski, Steffen Schneider, and Michael Auli. all utterances in dev-clean: 2020a. vq-wav2vec: Self-supervised learning of dis- crete speech representations. In ICLR. K 1 (p) (q⇤) Sclosest = cossim(s , s ). (5) Alexei Baevski, Yuhao Zhou, Abdel-rahman Mo- K 1 1 hamed, and Michael Auli. 2020b. wav2vec 2.0: p=1 X A framework for self-supervised learning of speech representations. In NeurIPS. We show the Savg and Sclosest of embeddings produced by SPLAT-Speech, SPLAT-Seq, and Aditya Bhargava, Asli Celikyilmaz, Dilek Hakkani- SPLAT-Seq-MLM in Table 4. Tür, and Ruhi Sarikaya. 2013. Easy contextual intent prediction and slot detection. In ICASSP. We see that Savg is approximately the same for all model variants. However, Sclosest, the av- Sasha Calhoun, Jean Carletta, Jason Brenier, Neil erage similarity between the speech embeddings Mayo, Dan Jurafsky, Mark Steedman, and David of two linguistically similar utterances, increases Beaver. 2010. The NXT-format Switchboard Cor- pus: A rich resource for investigating the syntax, se- from 0.238 to 0.781 after aligning the speech and mantics, pragmatics and prosody of dialogue. Lan- language modules, and further increases to 0.829 guage Resources and Evaluation, 44(4):387–419. after adapting the language module on LibriSpeech Yuan-Ping Chen, Ryan Price, and Srinivas Banga- transcripts with MLM before the alignment. Over- lore. 2018. Spoken language understanding without all, SPLAT can make a pair of semantically similar speech recognition. In ICASSP. 1905 Yung-Sung Chuang, Chi-Liang Liu, and Hung-Yi Lee. Chia-Hsuan Lee, Shang-Ming Wang, Huan-Cheng 2020. SpeechBERT: Cross-modal pre-trained lan- Chang, and Hung-Yi Lee. 2018. ODSQA: Open- guage model for end-to-end spoken question answer- domain spoken question answering dataset. In SLT. ing. In Interspeech. Mike Lewis, Yinhan Liu, Naman Goyal, Mar- Yu-An Chung and James Glass. 2020a. Generative jan Ghazvininejad, Abdelrahman Mohamed, Omer pre-training for speech with autoregressive predic- Levy, Ves Stoyanov, and Luke Zettlemoyer. tive coding. In ICASSP. 2020. BART: Denoising sequence-to-sequence pre- training for natural language generation, translation, Yu-An Chung and James Glass. 2020b. Improved and comprehension. In ACL. speech representations with multi-target autoregressive predictive coding. In ACL. Chia-Hsuan Li, Szu-Lin Wu, Chi-Liang Liu, and Hung- Yi Lee. 2018. Spoken SQuAD: A study of mitigat- Yu-An Chung, Wei-Ning Hsu, Hao Tang, and James ing the impact of speech recognition errors on listen- Glass. 2019a. An unsupervised autoregressive ing comprehension. In Interspeech. model for speech representation learning. In Inter- speech. Shaoshi Ling and Yuzong Liu. 2020. DeCoAR 2.0: Deep contextualized acoustic representa- Yu-An Chung, Hao Tang, and James Glass. 2020. tions with vector quantization. arXiv preprint Vector-quantized autoregressive predictive coding. arXiv:2012.06659. In Interspeech. Shaoshi Ling, Yuzong Liu, Julian Salazar, and Katrin Yu-An Chung, Yuxuan Wang, Wei-Ning Hsu, Kirchhoff. 2020. Deep contextualized acoustic rep- Yu Zhang, and RJ Skerry-Ryan. 2019b. Semi- resentations for semi-supervised speech recognition. supervised training for improving data efficiency in In ICASSP. end-to-end speech synthesis. In ICASSP. Alexander H. Liu, Yu-An Chung, and James Glass. Yu-An Chung, Wei-Hung Weng, Schrasing Tong, and 2020a. Non-autoregressive predictive coding for James Glass. 2019c. Towards unsupervised speech- learning speech representations from local depen- to-text translation. In ICASSP. dencies. arXiv preprint arXiv:2011.00406. Pavel Denisov and Ngoc Thang Vu. 2020. Pretrained semantic speech embeddings for end-to-end spo- Andy T. Liu, Shang-Wen Li, and Hung-Yi Lee. 2020b. ken language understanding via cross-modal teacher- TERA: Self-supervised learning of transformer en- student learning. In Interspeech. coder representation for speech. arXiv preprint arXiv:2007.06028. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Andy T. Liu, Shu-Wen Yang, Po-Han Chi, Po-Chun deep bidirectional transformers for language under- Hsu, and Hung-Yi Lee. 2020c. Mockingjay: Un- standing. In NAACL-HLT. supervised speech representation learning with deep bidirectional transformer encoders. In ICASSP. Li Dong, Nan Yang, Wenhui Wang, Furu Wei, Xi- aodong Liu, Yu Wang, Jianfeng Gao, Ming Zhou, Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man- and Hsiao-Wuen Hon. 2019. Unified language dar Joshi, Danqi Chen, Omer Levy, Mike Lewis, model pre-training for natural language understand- Luke Zettlemoyer, and Veselin Stoyanov. 2019. ing and generation. In NeurIPS. RoBERTa: A robustly optimized BERT pretraining approach. arXiv preprint arXiv:1907.11692. Nathan Duran and Steve Battle. 2018. Probabilistic word association for dialogue act classification with Loren Lugosch, Mirco Ravanelli, Patrick Ignoto, recurrent neural networks. In EANN. Vikrant Singh Tomar, and Yoshua Bengio. 2019. Speech model pre-training for end-to-end spoken Deepanway Ghosal, Md Shad Akhtar, Dushyant language understanding. In Interspeech. Chauhan, Soujanya Poria, Asif Ekbal, and Pushpak Bhattacharyya. 2018. Contextual inter-modal atten- Ha Nguyen, Fethi Bougares, Natalia Tomashenko, Yan- tion for multi-modal sentiment analysis. In EMNLP. nick Estève, and Laurent Besacier. 2020. Inves- tigating self-supervised pre-training for end-to-end Kyle Gorman, Jonathan Howell, and Michael Wagner. speech translation. In Interspeech. 2011. Prosodylab-aligner: A tool for forced alignment of laboratory speech. Canadian Acoustics, Aaron van den Oord, Yazhe Li, and Oriol Vinyals. 39(3):192–193. 2018. Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748. Parisa Haghani, Arun Narayanan, Michiel Bacchiani, Galen Chuang, Neeraj Gaur, Pedro Moreno, Rohit Vassil Panayotov, Guoguo Chen, Daniel Povey, and Prabhavalkar, Zhongdi Qu, and Austin Waters. 2018. Sanjeev Khudanpur. 2015. LibriSpeech: An ASR From audio to semantics: Approaches to end-to-end corpus based on public domain audio books. In spoken language understanding. In SLT. ICASSP. 1906 Daniel Park, William Chan, Yu Zhang, Chung-Cheng Zhuosheng Zhang, Yuwei Wu, Hai Zhao, Zuchao Li, Chiu, Barret Zoph, Ekin Cubuk, and Quoc Le. 2019. Shuailiang Zhang, Xi Zhou, and Xiang Zhou. 2020b. SpecAugment: A simple data augmentation method Semantics-aware BERT for language understanding. for automatic speech recognition. In Interspeech. In AAAI. Yao Qian, Rutuja Ubale, Vikram Ramanaryanan, Chenguang Zhu, Ziyi Yang, Robert Gmyr, Michael Patrick Lange, David Suendermann-Oeft, Keelan Zeng, and Xuedong Huang. 2019. Make lead bias in Evanini, and Eugene Tsuprun. 2017. Exploring your favor: A simple and effective method for news ASR-free end-to-end modeling to improve spoken summarization. arXiv preprint arXiv:1912.11602. language understanding in a cloud-based dialog system. In ASRU. Chenguang Zhu, Michael Zeng, and Xuedong Huang. 2018. SDNet: Contextualized attention-based Alec Radford, Karthik Narasimhan, Tim Salimans, and deep network for conversational question answering. Ilya Sutskever. 2018. Improving language under- arXiv preprint arXiv:1812.03593. standing by generative pre-training. Technical report, OpenAI. Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. 2016. SQuAD: 100,000+ questions for machine comprehension of text. In EMNLP. Suman Ravuri and Andreas Stolcke. 2015. Recurrent neural network and LSTM models for lexical utterance classification. In Interspeech. Morgane Rivière, Armand Joulin, Pierre-Emmanuel Mazaré, and Emmanuel Dupoux. 2020. Unsuper- vised pretraining transfers well across languages. In ICASSP. Steffen Schneider, Alexei Baevski, Ronan Collobert, and Michael Auli. 2019. wav2vec: Unsupervised pre-training for speech recognition. In Interspeech. Dmitriy Serdyuk, Yongqiang Wang, Christian Fue- gen, Anuj Kumar, Baiyang Liu, and Yoshua Bengio. 2018. Towards end-to-end spoken language understanding. In ICASSP. Xingchen Song, Guangsen Wang, Zhiyong Wu, Yiheng Huang, Dan Su, Dong Yu, and Helen Meng. 2020. Speech-XLNet: Unsupervised acoustic model pre- training for self-attention networks. In Interspeech. Gokhan Tur and Renato De Mori. 2011. Spoken language understanding: Systems for extracting semantic information from speech. John Wiley & Sons. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In NIPS. Weiran Wang, Qingming Tang, and Karen Livescu. 2020. Unsupervised pre-training of bidirectional speech encoders via masked reconstruction. In ICASSP. AmirAli Bagher Zadeh, Paul Pu Liang, Soujanya Poria, Erik Cambria, and Louis-Philippe Morency. 2018. Multimodal language analysis in the wild: CMU- MOSEI dataset and interpretable dynamic fusion graph. In ACL. Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Wein- berger, and Yoav Artzi. 2020a. BERTScore: Evalu- ating text generation with BERT. In ICLR. 1907