What All Do Audio Transformer Models Hear? Probing Acoustic Representations for Language Delivery and Its Structure

Preprints (www.preprints.org) | NOT PEER-REVIEWED | Posted: 5 January 2021 doi:10.20944/preprints202101.0081.v1 What all do audio transformer models hear? Probing Acoustic Representations for Language Delivery and its Structure Jui Shah∗;1 Yaman Kumar Singla∗;1;2;3 Changyou Chen3 Rajiv Ratn Shah1 1IIIT-Delhi 2Adobe 3State University of New York at Buffalo [email protected],[email protected],[email protected],[email protected] Abstract on interpreting what these models learn that helps such a wide variety of downstream tasks. Be- In recent times, BERT based transformer models have become an inseparable part of the sides, as more and more applications start relying ‘tech stack’ of text processing models. Sim- on such models, it becomes all the more important ilar progress is being observed in the speech to explain what these embeddings capture to check domain with a multitude of models observing for potential flaws and biases that can affect a large state-of-the-art results by using audio trans- number of applications. To this end, different re- former models to encode speech. This begs the search studies started probing language model em- question of what are these audio transformer beddings for particular linguistic properties of in- models learning. Moreover, although the stan- terest. Belinkov et al.(2017) probed for part-of- dard methodology is to choose the last layer embeddings for any downstream task, but is it speech language understanding, Hewitt and Man- the optimal choice? We try to answer these ning(2019) for syntax, Peters et al.(2018) on mor- questions for the two recent audio transformer phology, Zhang et al.(2020) for scales and num- models, Mockingjay and wave2vec2.0 . We bers, etc. However, progress in the audio domain compare them on a comprehensive set of lan- has been very limited with only a few works (Raj guage delivery and structure features including et al., 2019; Alishahi et al., 2017; Belinkov and audio, fluency and pronunciation features. Ad- Glass, 2017). ditionally, we probe the audio models’ understanding of textual surface, syntax, and seman- Transformers have predominantly addressed the tic features and compare them to BERT. We do discrete data domain. Hence, NLP and vision this over exhaustive settings for native, non- fields oversaw a tremendous amount of work on native, synthetic, read and spontaneous speech transformer-based modelling. Speech being in datasets. the continuous domain, lagged behind. As one of the first models for this problem, vq-wav2vec 1 Introduction (Baevski et al., 2019) proposed a 2 stage pipeline. Since the advent of transformers in the compu- It discretizes an input speech to a K-way quantized tational linguistics field in 2017 (Vaswani et al., embedding space. This is similar to word tokens 2017), they have gained great attention across var- in NLP tasks. The embeddings are then extracted ious domains for a wide variety of tasks. Tay from a BERT-based model. However, this tech- et al.(2020) survey prominent transformer mod- nique does not capture the context representation els, which have now become a formidable force and dependencies across the time domain essen- in the tech stack in Natural Language Process- tial for continuous speech. Wav2vec2.0 (Baevski ing (NLP), Computer Vision and Reinforcement et al., 2020) addresses this issue by designing three Learning. Their inherent property to facilitate subunits - the feature encoder, the transformer, parallel training makes it easier to train models and the quantization module (discussed in Section on large datasets. These pre-trained models are 5.2). These units convert the input audio to latent then fine-tuned on a variety of user-specific down- space embeddings via a contrastive task. This task stream tasks, achieving state-of-the-art results. At involves selecting the correct quantized latent rep- the same time, recent research has started focusing resentation of the masked time steps from a dis- ∗Authors (listed in alphabetical order) contributed tractor set. equally. Mockingjay (Liu et al., 2020) and AudioAL- © 2021 by the author(s). Distributed under a Creative Commons CC BY license. Preprints (www.preprints.org) | NOT PEER-REVIEWED | Posted: 5 January 2021 doi:10.20944/preprints202101.0081.v1 BERT (Chi et al., 2020) are other such transformer with a worse performance. We find that in gen- models. These are modified versions of BERT for eral, type of speakers matter lesser than the type the audio domain. They do not have an inbuilt fea- of speech. Therefore, for both the models, we ob- ture extractor module. Hence, the former takes serve that non-native read speech performs better the input of 160-dim mel features and the latter than spontaneous speech in general. takes in 160-dim fbank features. They share the (3) Additionally, we identify the role of the fea- same architecture with the difference being that ture extractor module in wav2vec2.0, which en- AudioALBERT has shared parameters across the ables it to process raw input audio of 16Hz with- 12 encoder units, while for Mockingjay they are out any preprocessing. We find that the subsequent different. layers of the feature encoder can encode all fea- These audio transformers have been applied tures into increasingly dense and informative rep- over many diverse downstream speech-language resentation vectors without any “intelligent pro- processing (SLP) tasks with state-of-the-art re- cessing” on them. sults. Tasks such as phoneme classification (4) We compare the performance of the repre- (Graves and Schmidhuber, 2005), speaker recog- sentations obtained by audio models and BERT on nition (Tian et al., 2020), automatic scoring text features. This is the first to check the repre- (Grover et al., 2020), and sentiment classification sentative capacity of audio representations for the (Tang et al., 2020) have shown promising results text captured by audio. We find that despite hav- even with pre-trained transformers. This also begs ing no text-specific error metrics, the audio mod- the question as to what these transformer models are able to encode text well and are compa- els are able to learn during the pretraining phase rable to BERT on several parameters. We find for the various evaluation tasks. The sentiment of that the dataset used to pre-train the audio models the above inquiry is also conveyed by Prof. Ray has a significant effect on the downstream perfor- Mooney’s quip that the meaning of a whole sen- mance. Surprisingly, while both wave2vec2.0 and tence cannot be captured by a $&!#* vector (Con- Mockingjay outperform BERT on LibriSpeech neau et al., 2018; Mooney, 2014). (the dataset they were trained on), they underper- form in other settings. Additionally, both models In this paper, we make the following contri- seem to learn surface-level text features (such as butions. (1) We propose a detailed analysis of the number of nouns and pronouns) comparable to what the two recent transformer-based semisu- BERT. pervised audio encoder models, Mockingjay and wav2vec2.0, learn. We do this by implement- We release our code, datasets and tools used to ing post hoc probing on the embeddings extracted perform the experiments and inferences. To the from each of the intermediate units of the trans- best of our knowledge, this is the first attempt to- former models. We probe those embeddings us- wards interpreting audio transformer models. The ing an extensive number of features (46 in total), conclusion points out that the transformers are each categorized by the linguistic property they able to learn a holistic range of features which en- probe. We do this for text-based, audio-based, able them to perform with great accuracy on var- vocabulary-based, fluency-based, and supraseg- ious downstream tasks even while training solely mental pronunciation-based features. The results on unlabeled speech. help us lay out a map of the layers where a particular feature or category of features are learnt while 2 Problem Definition and Data also providing a metric of comparison between Given the latent representations of a model’s inter- the two models. This measures what the models mediate layers, we define the problem of probing are learning on various linguistic tasks, which can that model for the knowledge of different linguis- then inform downstream tasks to use these models. tic features as a regression task. With the input (2) We test the models for their representa- as the intermediate-layer embeddings, the probing tive effectiveness on different types of speech set- model is trained to map them to normalized fea- tings: native-read, native-spontaneous, and non- ture values which are extracted from the data. The native-read. We find that for the most part native- probe is a 3 layer fully connected neural network spontaneous and non-native speech settings follow with the hidden layer having a ReLU activation the result patterns for native-read dataset albeit and dropout to avoid over-fitting. The model di- Preprints (www.preprints.org) | NOT PEER-REVIEWED | Posted: 5 January 2021 doi:10.20944/preprints202101.0081.v1 mensions are (768; 128; 1) for all the intermedi- features assesses both what was spoken and how ate layers of transformers and (512; 128; 1) for the it was spoken. Next, we describe each of the indi- feature extractor. Adam optimizer and a learning vidual features considered. rate of 0:0001 is used. We compare the representa- tive capacity of different embeddings on the basis 3.1 Audio Features of the loss values reported by the prober. Further, Acoustic analysis of speech includes temporal and we take a randomly-initialized vector as a base- spectral analysis of audio waveforms. Hence, we line to compare against all the ‘intelligent’ embed- measure the following features in this category: dings.

Load more