Universal Dependencies According to BERT: Both More Specific and More General
Total Page:16
File Type:pdf, Size:1020Kb
Universal Dependencies According to BERT: Both More Specific and More General Tomasz Limisiewicz and David Marecekˇ and Rudolf Rosa Institute of Formal and Applied Linguistics, Faculty of Mathematics and Physics Charles University, Prague, Czech Republic {limisiewicz,rosa,marecek}@ufal.mff.cuni.cz Abstract former based systems particular heads tend to cap- ture specific dependency relation types (e.g. in one This work focuses on analyzing the form and extent of syntactic abstraction captured by head the attention at the predicate is usually focused BERT by extracting labeled dependency trees on the nominal subject). from self-attentions. We extend understanding of syntax in BERT by Previous work showed that individual BERT examining the ways in which it systematically di- heads tend to encode particular dependency verges from standard annotation (UD). We attempt relation types. We extend these findings by to bridge the gap between them in three ways: explicitly comparing BERT relations to Uni- • We modify the UD annotation of three lin- versal Dependencies (UD) annotations, show- ing that they often do not match one-to-one. guistic phenomena to better match the BERT We suggest a method for relation identification syntax (x3) and syntactic tree construction. Our approach • We introduce a head ensemble method, com- produces significantly more consistent depen- dency trees than previous work, showing that bining multiple heads which capture the same it better explains the syntactic abstractions in dependency relation label (x4) BERT. • We observe and analyze multipurpose heads, At the same time, it can be successfully ap- containing multiple syntactic functions (x7) plied with only a minimal amount of supervi- sion and generalizes well across languages. Finally, we apply our observations to improve the method of extracting dependency trees from 1 Introduction and Related Work attention (x5), and analyze the results both in a In recent years, systems based on Transformer ar- monolingual and a multilingual setting (x6). chitecture achieved state-of-the-art results in lan- Our method crucially differs from probing (Be- guage modeling (Devlin et al., 2019) and machine linkov et al., 2017; Hewitt and Manning, 2019; Chi translation (Vaswani et al., 2017). Additionally, the et al., 2020; Kulmizev et al., 2020). We do not use contextual embeddings obtained from the interme- treebank data to train a parser; rather, we extract diate representation of the model brought improve- dependency relations directly from selected atten- ments in various NLP tasks. Multiple recent works tion heads. We only employ syntactically annotated try to analyze such latent representations (Linzen data to select the heads; however, this means esti- et al., 2019), observe syntactic properties in some mating relatively few parameters, and only a small Transformer self-attention heads, and extract syn- amount of data is sufficient for that purpose (x6.1). tactic trees from the attentions matrices (Raganato 2 Models and Data and Tiedemann, 2018; Marecekˇ and Rosa, 2019; Clark et al., 2019; Jawahar et al., 2019). We analyze the uncased base BERT model for En- In our work, we focus on the comparative anal- glish, which we will refer to as enBERT, and the ysis of the syntactic structure, examining how the uncased multilingual BERT model, mBERT, for BERT self-attention weights correspond to Uni- English, German, French, Czech, Finnish, Indone- versal Dependencies (UD) syntax (Nivre et al., sian, Turkish, Korean, and Japanese 1. The code 2016). We confirm the findings of Vig and Be- 1Pretrained models are available at https://github. linkov(2019) and Voita et al.(2019) that in Trans- com/google-research/bert 2710 Findings of the Association for Computational Linguistics: EMNLP 2020, pages 2710–2722 November 16 - 20, 2020. c 2020 Association for Computational Linguistics shared by Clark et al.(2019) 2 substantially helped BERT syntax, using UDApi5 (Popel et al., 2017). us in extracting attention weights from BERT. The main motivation of our approach is to get To find syntactic heads, we use: 1000 EuroParl trees similar to structures emerging from BERT, multi parallel sentences (Koehn, 2004) for five Eu- which we have observed in qualitative analysis of ropean languages, automatically annotated with attention weights. We note that for copulas and UDPipe UD 2.0 models (Straka and Straková, coordinations, BERT syntax resembles Surface- 2017); Google Universal Dependency Treebanks syntactic UD (SUD) (Gerdes et al., 2018). Never- (GSD) for Indonesian, Korean, and Japanese (Mc- theless, we decided to use our custom modification, Donald et al., 2013); the UD Turkish Treebank since some systematic divergences between SUD (IMST-UD) (Sulubacak et al., 2016). and the latent representation occur as well. It is not We use another PUD treebanks from the CoNLL our intention to compare two annotation guidelines. 2017 Shared Task for evaluation of mBERT in all A comprehensive comparison between extracting languages (Nivre et al., 2017)3. UD and extracting SUD trees from BERT was per- formed by (Kulmizev et al., 2020). However, they 3 Adapting UD to BERT used a probing approach, which is noticeably dif- ferent from our setting. Since the explicit dependency structure is not used in BERT training, syntactic dependencies captured 4 Head Ensemble in latent layers are expected to diverge from an- notation guidelines. After initial experiments, we In line with Clark et al.(2019) and other studies have observed that some of the differences are sys- Voita et al.(2019); Vig and Belinkov(2019), we tematic (see Table1). have noticed that a specific syntactic relation type can often be found in a specific head. Additionally, UD Modified Example we observe that a single head often captures only a root nsubj specific aspect or subtype of one UD relation type, Copula at- cop motivating us to combine multiple heads to cover Copula is taches to a cat is an animal a root. 4 the full relation. noun nsubj obj Figure1 shows attention weights of two syntactic root heads (right columns) and their average (left col- Expletive nsubj umn). In the top row (purple), both heads identify Expletive expl is treated is not a there is a spoon the parent noun for an adjectival modifier: Head 9 as a subject nsubj in Layer 3 if their distance is two positions or less, subject obj Head 10 in Layer 7 if they are further away (as in In mul- “a stable , green economy”). tiple conj Similarly, for an object to predicate relation (blue coordina- Conjunct conj tion, all attaches to bottom row), Head 9 in Layer 7 and Head 8 in Layer conjuncts a previous apples , oranges and pears 3 capture pairs with shorter and longer positional attach to one conj conj the first distances, respectively. conjunct 4.1 Dependency Accuracy of Heads Table 1: Comparison of original Universal Dependen- To quantify the amount of syntactic information cies annotations (edges above) and our modification conveyed by a self-attention head A for a depen- (edges below). dency relation label l in a specific direction d (for instance predicate ! subject), we compute: Based on these observations, we modify the UD annotations in our experiments to better fit the jf(i; j) 2 El;d : j = arg max A[i]gj 2 DepAccl;d;A = https://github.com/clarkkev/ jEl;dj attention-analysis 3Mentioned treebanks are available at the where El;d is a set of all dependency tree edges Universal Dependencies web page https:// universaldependencies.org with the label l and with direction d, i.e., in de- 4Certain dependents of the original root (e.g., subject, pendent to parent direction (abbreviated to p2d) auxiliaries) are rehanged and attached to the new root – copula verb. 5https://udapi.github.io 2711 Figure 1: Examples of two enBERT’s attention heads covering the same relation label and their average. Gold relations are marked by red letters. the first element of the tuple i is dependent of the DepAcc of the averaged attention matrices.6 relation and the second element j is the governor; We set N to be 4, as allowing larger ensembles A[i] is the ith row of the attention matrix A. does not improve the results significantly. In this article, when we say that head with at- tention matrix A is syntactic for a relation type l, 5 Dependency Tree Construction we mean that its DepAccl;d;A is high in one of the To extract dependency trees from self-attention directions (parent to dependent p2d or dependent weights, we use a method similar to Raganato and to parent d2p). Tiedemann(2018), which employs a maximum spanning tree algorithm (Edmonds, 1966) and uses 4.2 Method gold information about the root of the syntax tree. Having observed that some heads convey only par- We use the following steps to construct a labeled tial information about a UD relation, we propose a dependency tree: method to connect knowledge of multiple heads. 1. For each non-clausal UD relation label, syn- Our objective is to find a set of heads for each tactic heads ensembles are selected as de- directed relation so that their attention weights af- scribed in Section4. Attention matrices in ter averaging have a high dependency accuracy. the ensembles are averaged. Hence, we obtain The algorithm is straightforward: we define the two matrices for each label (one for each di- maximum number N of heads in the subset; sort rection: "dependent to parent" and "parent to the heads based on their DepAcc on development dependent") set; starting from the most syntactic one we check whether including head’s attention matrix in the av- 2. The "dependent to parent" matrix is trans- erage would increase DepAcc; if it does the head is posed and averaged with "parent to depen- added to the ensemble. When there are already N dent" matrix.