How Do BERT Embeddings Organize Linguistic Knowledge?
Total Page:16
File Type:pdf, Size:1020Kb
How Do BERT Embeddings Organize Linguistic Knowledge? Giovanni Puccettiy , Alessio Miaschi? , Felice Dell’Orletta y Scuola Normale Superiore, Pisa ?Department of Computer Science, University of Pisa Istituto di Linguistica Computazionale “Antonio Zampolli”, Pisa ItaliaNLP Lab – www.italianlp.it [email protected], [email protected], [email protected] Abstract et al., 2019), we proposed an in-depth investigation Several studies investigated the linguistic in- aimed at understanding how the information en- formation implicitly encoded in Neural Lan- coded by BERT is arranged within its internal rep- guage Models. Most of these works focused resentation. In particular, we defined two research on quantifying the amount and type of in- questions, aimed at: (i) investigating the relation- formation available within their internal rep- ship between the sentence-level linguistic knowl- resentations and across their layers. In line edge encoded in a pre-trained version of BERT and with this scenario, we proposed a different the number of individual units involved in the en- study, based on Lasso regression, aimed at understanding how the information encoded coding of such knowledge; (ii) understanding how by BERT sentence-level representations is ar- these sentence-level properties are organized within ranged within its hidden units. Using a suite of the internal representations of BERT, identifying several probing tasks, we showed the existence groups of units more relevant for specific linguistic of a relationship between the implicit knowl- tasks. We defined a suite of probing tasks based on edge learned by the model and the number of a variable selection approach, in order to identify individual units involved in the encodings of which units in the internal representations of BERT this competence. Moreover, we found that it is possible to identify groups of hidden units are involved in the encoding of similar linguistic more relevant for specific linguistic properties. properties. Specifically, we relied on a wide range of linguistic tasks, which resulted to successfully 1 Introduction model different typology of sentence complexity The rise of contextualized word representations (Brunato et al., 2020), from very simple features (Peters et al., 2018; Devlin et al., 2019) has led (such as sentence length) to more complex prop- to significant improvement in several (if not ev- erties related to the morphosyntactic and syntactic ery) NLP tasks. The main drawback of these ap- structure of a sentence (such as the distribution of proaches, despite the outstanding performances, is specific dependency relations). the lack of interpretability. In fact, high dimen- The paper is organized as follows. In Sec.2 sional representations do not allow for any insight we present related work, then we describe our ap- of the type of linguistic properties encoded in these proach (Sec.3), with a focus on the model and the models. Therefore this implicit knowledge can data used for the experiments (Sec. 3.1) and the only be determined a posteriori, by designing tasks set of probing tasks (Sec. 3.2). Experiments and that require a specific linguistic skill to be tackled results are discussed in Sec.4 and5. To conclude, (Linzen and Baroni, 2020) or by investigating to we summarize the main findings of our work in what extent certain information is encoded within Sec.6. contextualized internal representations, e.g. defin- ing probing classifier trained to predict a variety 2 Related work of language phenomena (Conneau et al., 2018a; Hewitt and Manning, 2019; Tenney et al., 2019a). In the last few years, a number of recent works In line with this latter approach and with re- have explored the inner mechanism and the lin- cent works aimed at investigating how the informa- guistic knowledge implicitly encoded in Neural tion is arranged within neural models representa- Language Models (NLMs) (Belinkov and Glass, tions (Baan et al., 2019; Dalvi et al., 2019; Lakretz 2019). The most common approach is based on 48 Proceedings of Deep Learning Inside Out (DeeLIO): The 2nd Workshop on Knowledge Extraction and Integration for Deep Learning Architectures, pages 48–57 Online, June 10, 2021. ©2021 Association for Computational Linguistics the development of probes, i.e. supervised mod- which weights within sentence-level BERT inter- els trained to predict simple linguistic properties nal representations can be set to zero, in order to using the contextual word/sentence embeddings of understand the relationship between hidden units a pre-trained model as training features (Conneau and linguistic competence and whether the infor- et al., 2018b; Zhang and Bowman, 2018; Miaschi mation needed to perform similar linguistic tasks is et al., 2020). These latter studies demonstrated that encoded in similar positions. We relied on a suite NLMs are able to encode a wide range of linguistic of 68 sentence-level probing tasks, each of which information in a hierarchical manner (Blevins et al., corresponds to a specific linguistic feature captur- 2018; Jawahar et al., 2019; Tenney et al., 2019b) ing characteristics of a sentence at different levels and even to support the extraction of dependency of granularity. In particular, we defined a Lasso re- parse trees (Hewitt and Manning, 2019). For in- gression model that takes as input layer-wise BERT stance, Liu et al.(2019) quantified differences in representations for each sentence of a gold standard the transferability of individual layers between dif- Universal Dependencies (UD) (Nivre et al., 2016) ferent models, showing that higher layers of RNNs English dataset and predicts the actual value of (ELMo) are more task-specific (less general), while a given sentence-level feature. Lasso regression transformer layers (BERT) do not exhibit this in- consists in adding an L1 penalization to the usual crease in task-specificity. ordinary least square loss. To do so, one of the most Other works also investigated the importance of relevant parameters is λ, which tunes how relevant individual neurons within models representations the L1 penalization is for the loss function. We (Qian et al., 2016; Bau et al., 2019; Baan et al., performed a grid search with cross validation for 2019). Dalvi et al.(2019) proposed two methods, each feature-layer pair, in order to identify the best Linguistic Correlations Analysis and Cross-model suited value for λ according to each task. Specifi- correlation analysis, to study whether specific di- cally, our goal was to find the most suited value for mensions learned by end-to-end neural models are seeking the best performance when having as few responsible for specific properties. For instance, non-zero coefficients as possible. they showed that open class categories such as 3.1 Model and data verbs and location are much more distributed across the network compared to closed class categories We used a pre-trained version of BERT (BERT- (e.g. coordinating conjunction) and also that the base uncased, 12 layers). In order to obtain the model recognizes a hierarchy of linguistic propri- representations for our sentence-level tasks we ex- eties and distributes neurons based on it. Lakretz perimented with the activation of the first input et al.(2019), instead, proposed a detailed study of token ([CLS]) and the mean of all the word embed- the inner mechanism of number tracking in LSTMs dings for each sentence (Mean-pooling). at single neuron level, showing that long distance With regard to the data used for the regression number information (from the subject to the verb) experiments, we relied on the Universal Dependen- is largely managed by two specific units. cies (UD) English dataset. The dataset includes Differently from those latter work, our aim was three UD English treebanks: UD English-ParTUT, to combine previous approaches based on probes a conversion of a multilingual parallel treebank con- and on the study on individual units in order to pro- sisting of a variety of text genres, including talks, pose an in-depth investigation on the organization legal texts and Wikipedia articles (Sanguinetti and of linguistic competence within NLM contextual- Bosco, 2015); the Universal Dependencies version ized representations. annotation from the GUM corpus (Zeldes, 2017); the English Web Treebank (EWT), a gold standard 3 Approach universal dependencies corpus for English (Silveira et al., 2014). Overall, the final dataset consists of To study how the information used by BERT to 23,943 sentences. implicitly encode linguistic properties is arranged within its internal representations, we relied on a 3.2 Linguistic features variable selection approach based on Lasso regres- As already mentioned, we defined a suite of prob- sion (Tibshirani, 1996), which aims at keeping as ing tasks relying on a wide set of sentence-level few non-zero coefficients as possible when solving linguistic features automatically extracted from the specific regression tasks. Our aim was to identify parsed sentences in the UD dataset. The set of 49 Level of Annotation Linguistic Feature Label Raw Text Properties Raw Text Sentence Length sent length Word Length char per tok Vocabulary Richness Vocabulary Type/Token Ratio for words and lemmas ttr form, ttr lemma Morphosyntactic information Distibution of UD and language–specific POS upos dist *, xpos dist * POS tagging Lexical density lexical density Inflectional morphology Inflectional morphology of lexical verbs and auxiliaries xpos VB-VBD-VBP-VBZ, aux * Verbal Predicate Structure Distribution