Do Attention Heads in BERT Track Syntactic Dependencies?

Do Attention Heads in BERT Track Syntactic Dependencies? Phu Mon Htut∗1 Jason Phang∗ 1 Shikha Bordia∗y 2 Samuel R. Bowman1;2;3 1Center for Data Science 2Dept. of Computer Science 3Dept. of Linguistics New York University New York University New York University 60 Fifth Avenue 60 Fifth Avenue 10 Washington Place New York, NY 10011 New York, NY 10011 New York, NY 10003 Abstract tained top positions on transfer learning bench- marks such as GLUE (Wang et al., 2019b) and Su- We investigate the extent to which individual attention heads in pretrained transformer lan- perGLUE (Wang et al., 2019a). As these models guage models, such as BERT and RoBERTa, become a staple component of many NLP tasks, implicitly capture syntactic dependency re- it is crucial to understand what kind of knowl- lations. We employ two methods—taking edge they learn, and why and when they perform the maximum attention weight and comput- well. To that end, researchers have investigated ing the maximum spanning tree—to extract the linguistic knowledge that these models learn implicit dependency relations from the atten- by analyzing BERT (Goldberg, 2018; Lin et al., tion weights of each layer/head, and com- 2019) directly or training probing classifiers on the pare them to the ground-truth Universal De- pendency (UD) trees. We show that, for some contextualized embeddings or attention heads of UD relation types, there exist heads that can BERT (Tenney et al., 2019b,a; Hewitt and Man- recover the dependency type significantly bet- ning, 2019). ter than baselines on parsed English text, sug- BERT and RoBERTa, as Transformer mod- gesting that some self-attention heads act as a els (Vaswani et al., 2017), compute the hidden proxy for syntactic structure. We also analyze representation of all the attention heads at each BERT fine-tuned on two datasets—the syntax- layer for each token by attending to all the to- oriented CoLA and the semantics-oriented MNLI—to investigate whether fine-tuning af- ken representations in the preceding layer. In this fects the patterns of their self-attention, but we work, we investigate the hypothesis that BERT- do not observe substantial differences in the style models use at least some of their attention overall dependency relations extracted using heads to track syntactic dependency relationships our methods. Our results suggest that these between words. We use two dependency relation models have some specialist attention heads extraction methods to extract dependency rela- that track individual dependency types, but no tions from each self-attention heads of BERT and generalist head that performs holistic parsing significantly better than a trivial baseline, and RoBERTa. The first method—maximum atten- that analyzing attention weights directly may tion weight (MAX)—designates the word with the not reveal much of the syntactic knowledge highest incoming attention weight as the parent, that BERT-style models are known to learn. and is meant to identify specialist heads that track specific dependencies like obj (in the style of 1 Introduction Clark et al., 2019). The second—maximum span- Pretrained Transformer models such as BERT ning tree (MST)—computes a maximum spanning (Devlin et al., 2019) and RoBERTa (Liu et al., tree over the attention matrix, and is meant to iden- 2019) have shown stellar performance on lan- tify generalist heads that can form complete, syn- guage understanding tasks, significantly improve tactically informative dependency trees. We ana- the state-of-the-art on many tasks such as depen- lyze the extracted dependency relations and trees dency parsing (Zhou et al., 2019), question an- to investigate whether the attention heads of these swering (Rajpurkar et al., 2016), and have at- models track syntactic dependencies significantly better than chance or baselines, and what type of ∗Equal contribution y Currently working at Verisk Analytics. This work was dependency relations they learn best. In contrast to completed when the author was at New York University. probing models (Adi et al., 2017; Conneau et al., 2018), our methods require no further training. different language pairs and extract the maximum In prior work, Clark et al.(2019) find that some spanning tree algorithm from the attention weights heads of BERT exhibit the behavior of some de- of the encoder for each layer and head individ- pendency relation types, though they do not per- ually. They find that the best dependency score form well at all types of relations in general. We is not significantly higher than a right-branching are able to replicate their results on BERT using tree baseline. Voita et al.(2019) find the most our MAX method. In addition, we also perform confident attention heads of the Transformer NMT a similar analysis on BERT models fine-tuned on encoder based on a heuristic of the concentration natural language understanding tasks as well as of attention weights on a single token, and find RoBERTa. that these heads mostly attend to relative positions, Our experiments suggest that there are partic- syntactic relations, and rare words. ular attention heads of BERT and RoBERTa that Additionally, researchers have investigated the encode certain dependency relation types such syntactic knowledge that BERT learns by ana- as nsubj, obj with substantially higher accu- lyzing the contextualized embeddings (Warstadt racy than our baselines—a randomly initialized et al., 2019a) and attention heads of BERT (Clark Transformer and relative positional baselines. We et al., 2019). Goldberg(2018) analyzes the con- find that fine-tuning BERT on the syntax-oriented textualized embeddings of BERT by computing CoLA does not significantly impact the accu- language model surprisal on subject-verb agree- racy of extracted dependency relations. However, ment and shows that BERT learns significant when fine-tuned on the semantics-oriented MNLI knowledge of syntax. Tenney et al.(2019b) intro- dataset, we see improvements in accuracy for duce a probing classifier for evaluating syntactic longer-distance clausal relations and a slight loss knowledge in BERT and show that BERT encodes in accuracy for shorter-distance relations. Overall, syntax more than semantics. Hewitt and Man- while BERT and RoBERTa models obtain non- ning(2019) train a structural probing model that trivial accuracy for some dependency types such maps the hidden representations of each token to as nsubj, obj and conj when we analyze indi- an inner-product space that corresponds to syntax vidual heads, their performance still leaves much tree distance. They show that the learned spaces to be desired. On the other hand, when we use of strong models such as BERT and ELMo (Pe- the MST method to extract full trees from spe- ters et al., 2018) are better for reconstructing de- cific dependency heads, BERT and RoBERTa fail pendency trees compared to baselines. Clark et al. to meaningfully outperform our baselines. Al- (2019) train a probing classifier on the attention- though the attention heads of BERT and RoBERTa heads of BERT and show that BERT’s attention capture several specific dependency relation types heads capture substantial syntactic information. somewhat well, they do not reflect the full extent While there has been prior work on analysis of the of the significant amount of syntactic knowledge attention heads of BERT, we believe we are the that these models are known to learn. first to analyze the dependency relations learned by the attention heads of fine-tuned BERT models 2 Related Work and RoBERTa. Previous works have proposed methods for ex- 3 Methods tracting dependency relations and trees from the attention heads of the transformer-based neural 3.1 Models machine translation (NMT) models. In their pre- BERT (Devlin et al., 2019) is a Transformer-based liminary work, Marecekˇ and Rosa(2018) aggre- masked language model, pretrained on BooksCor- gate the attention weights across the self-attention pus (Zhu et al., 2015) and English Wikipedia, that layers and heads to form a single attention weight has attained stellar performance on a variety of matrix. Using this matrix, they propose a method downstream NLP tasks. RoBERTa (Liu et al., to extract constituency and (undirected) depen- 2019) adds several refinements to BERT while us- dency trees by recursively splitting and construct- ing the same model architecture and capacity, in- ing the maximum spanning trees respectively. In cluding a longer training schedule over more data, contrast, Raganato and Tiedemann(2018) train a and shows significant improvements over BERT Transformer-based machine translation model on on a wide range of NLP tasks. We run our experiments on the pretrained large versions of both wj) between word wi and wj if j = argmax W [i] BERT (cased and uncased) and RoBERTa mod- for each row (that corresponds to a word in atten- els, which consist of 24 self-attention layers with tion matrix) i in attention matrix W . Based on this 16 heads each layer. For a given dataset, we feed simple strategy, we extract relations for all sen- each input sentence through the respective model tences in our evaluation datasets. This method is and capture the attention weights for each individ- similar to Clark et al.(2019), and attempts to re- ual head and layer. cover individual arcs between words; the relations Phang et al.(2018) report the performance gains extracted using this method need not form a valid on the GLUE benchmark by supplementing pre- tree, or even be fully connected, and the resulting trained BERT with data-rich supervised tasks such edge directions may or may not match the canon- as the Multi-Genre Natural Language Inference ical directions. Hence, we evaluate the resulting dataset (MNLI; Williams et al., 2018). Al- arcs individually and ignore their direction. After though these fine-tuned BERT models may learn extracting dependency relations from all heads at different aspects of language and show differ- all layers, we take the maximum UUAS over all ent performance from BERT on GLUE bench- relations types.

Load more