Do Neural Language Models Show Preferences for Syntactic Formalisms?

Do Neural Language Models Show Preferences for Syntactic Formalisms? Artur Kulmizev Vinit Ravishankar Uppsala University University of Oslo [email protected] [email protected] Mostafa Abdou Joakim Nivre University of Copenhagen Uppsala University [email protected] [email protected] Abstract of language models, often establishing strong par- allels between the two (Prasad et al., 2019; Abnar Recent work on the interpretability of deep et al., 2019; Gauthier and Levy, 2019). neural language models has concluded that many properties of natural language syntax As is the case for NLP in general, English has are encoded in their representational spaces. served as the de facto testing ground for much of However, such studies often suffer from lim- this work, with other languages often appearing ited scope by focusing on a single language as an afterthought. However, despite its ubiquity and a single linguistic formalism. In this in the NLP literature, English is generally consid- study, we aim to investigate the extent to ered to be atypical across many typological dimen- which the semblance of syntactic structure sions. Furthermore, the tendency of interpreting captured by language models adheres to a NLP models with respect to existing, canonical surface-syntactic or deep syntactic style of analysis, and whether the patterns are consis- datasets often comes with the danger of conflat- tent across different languages. We apply a ing the theory-driven annotation therein with sci- probe for extracting directed dependency trees entific fact. One can observe this to an extent with to BERT and ELMo models trained on 13 the Universal Dependencies (UD) project (Nivre different languages, probing for two different et al., 2016), which aims to collect syntactic an- syntactic annotation styles: Universal Depen- notation for a large number of languages. Many dencies (UD), prioritizing deep syntactic re- interpretability studies have taken UD as a basis lations, and Surface-Syntactic Universal De- pendencies (SUD), focusing on surface struc- for training and evaluating probes, but often fail to ture. We find that both models exhibit a pref- mention that UD, like all annotation schemes, is erence for UD over SUD — with interesting built upon specific theoretical assumptions, which variations across languages and layers — and may not be universally accepted. that the strength of this preference is correlated Our research questions start from these concerns. with differences in tree shape. When probing language models for syntactic de- 1 Introduction pendency structure, is UD — with its emphasis on syntactic relations between content words — really Recent work on interpretability in NLP has led to the best fit? Or is the representational structure of the consensus that deep neural language models such models better explained by a scheme that is arXiv:2004.14096v1 [cs.CL] 29 Apr 2020 trained on large, unannotated datasets manage to more oriented towards surface structure, such as encode various aspects of syntax as a byproduct of the recently proposed Surface-Syntactic Universal the training objective. Probing approaches applied Dependencies (SUD) (Gerdes et al., 2018)? And to models like ELMo (Peters et al., 2018a) and are these patterns consistent across typologically BERT (Devlin et al., 2019) have demonstrated that different languages? To explore these questions, one can decode various linguistic properties such we fit the structural probe of Hewitt and Manning as part-of-speech categories, dependency relations, (2019) on pretrained BERT and ELMo represen- and named-entity types directly from the internal tations, supervised by UD/SUD treebanks for 13 hidden states of a pretrained model (Tenney et al., languages, and extract directed dependency trees. 2019b,b; Peters et al., 2018b). Another line of work We then conduct an extensive error analysis of the has tried to tie cognitive measurements or theories resulting probed parses, in an attempt to qualify our of human linguistic processing to the machinations findings. Our main contributions are the following: 1. A simple algorithm for deriving directed trees 3 Aspects of Syntax from the disjoint distance and depth probes Syntax studies how natural language encodes mean- introduced by Hewitt and Manning(2019). ing using expressive devices such as word order, 2. A multilingual analysis of the probe’s perfor- case marking and agreement. Some approaches mance across 13 different treebanks. emphasize the formal side and primarily try to ac- 3. An analysis showing that the syntactic infor- count for the distribution of linguistic forms. Other mation encoded by BERT and ELMo fit UD frameworks focus on the functional side to cap- better than SUD for most languages. ture the interface to semantics. And some theories use multiple representations to account for both 2 Related Work perspectives, such as c-structure and f-structure in There has been a considerable amount of recent LFG (Kaplan and Bresnan, 1982; Bresnan, 2000) work attempting to understand what aspects of nat- or surface-syntactic and deep syntactic representa- ural language pre-trained encoders learn. The clas- tions in Meaning-Text Theory (Mel’cukˇ , 1988). sic formulation of these probing experiments is in When asking whether neural language models the form of diagnostic classification (Ettinger et al., learn syntax, it is therefore relevant to ask which 2016; Belinkov et al., 2017; Hupkes et al., 2018; aspects of syntax we are concerned with. This is Conneau et al., 2018), which attempts to unearth especially important if we probe the models by try- underlying linguistic properties by fitting relatively ing to extract syntactic representations, since these underparameterised linear models over represen- representations may be based on different theoreti- tations generated by an encoder. These methods cal perspectives. As a first step in this direction, we have also faced recent critique, for example, con- explore two different dependency-based syntactic cerning the lack of transparency in the classifers’ representations, for which annotations are available ability to extract meaningful information, as op- in multiple languages. The first is Universal De- posed to learning it. Alternative paradigms for pendencies (UD) (Nivre et al., 2016), a framework interpretability have therefore been proposed, such for cross-linguistically consistent morpho-syntactic as correlation-based methods (Raghu et al., 2017; annotation, which prioritizes direct grammatical re- Saphra and Lopez, 2018; Kornblith et al., 2019; lations between content words. These relations Chrupała and Alishahi, 2019). However, this cri- tend to be more parallel across languages that use tique does not invalidate diagnostic classification: different surface features to encode the relations. indeed, more recent work (Hewitt and Liang, 2019) The second is Surface-Syntactic Universal Depen- describes methods to show the empirical validity dencies (SUD) (Gerdes et al., 2018), a recently pro- of certain probes, via control tasks. posed alternative to UD, which gives more promi- Among probing studies specifically pertinent to nence to function words in order to capture varia- our paper, Blevins et al.(2018) demonstrate that tions in surface structure across languages. deep RNNs are capable of encoding syntax given Figure1 contrasts the two frameworks by show- a variety of pre-training tasks, including language ing how they annotate an English sentence. While modeling. Peters et al.(2018b) demonstrate that, the two annotations agree on most syntactic re- regardless of encoder (recurrent, convolutional, or lations (in black), including the analysis of core 1 self-attentive), biLM-based pre-training results in grammatical relations like subject (nsubj ) and ob- similar high-quality representations that implicitly ject (obj), they differ in the analysis of auxiliaries encode a variety of linguistic phenomena, layer by and prepositional phrases. The UD annotation (in layer. Similarly, Tenney et al.(2019a) employ the blue) treats the main verb chased as the root of the ‘edge probing’ approach of Tenney et al.(2019b) to clause, while the SUD annotation (in red) assigns demonstrate that BERT implicitly learns the ‘classi- this role to the auxiliary has. The UD annotation cal NLP pipeline’, with lower-level linguistic tasks has a direct oblique relation between chased and encoded in lower layers and more complex phe- room, treating the preposition from as a case marker, nomena in higher layers, and dependency syntax while the SUD annotation has an oblique relation in layer 5–6. Finally, Hewitt and Manning(2019) between chased and from, analyzing room as the describe a syntactic probe for extracting aspects object of from. The purpose of the UD style of of dependency syntax from pre-trained representa- 1UD uses the nsubj relation, for nominal subject, while tions, which we describe in Section4. SUD uses a more general subj relation. obl nsubj obj case det aux det det the dog has chased the cat from the room DET NOUN AUX VERB DET NOUN ADP DET NOUN comp:aux obl obj Figure 1: SimplifiedUD and SUD annotation for an English sentence. l 2 l l annotation is to increase the probability of the root by the number |n | of word pairs, and dT l pwi; wjq l l and oblique relations being parallel in other lan- is the distance of words wi and wj in the gold tree. guages that use morphology (or nothing at all) to While the distance probe can predict which encode the information expressed by auxiliaries words enter into dependencies with one another, and adpositions. SUD is instead designed to bring it is insufficient for predicting which word is the out differences in surface structure in such cases. head. To resolve this, Hewitt and Manning(2019) The different treatment of function words affects employ a separate probe for tree depth,3 where they not only adpositions (prepositions and postposi- make a similar assumption as they do for distance: 2 tions) and auxiliaries (including copulas), but also a given (square) vector L2 norm ||hi || is analogous subordinating conjunctions and infinitive markers. to wi’s depth in a tree T .

Do Neural Language Models Show Preferences for Syntactic Formalisms?

Using Constraint Grammar for Treebank Retokenization

Using Danish As a CG Interlingua: a Wide-Coverage Norwegian-English Machine Translation System

Floresta Sinti(C)Tica : a Treebank for Portuguese

Instructions for Preparing LREC 2006 Proceedings

17Th Nordic Conference of Computational Linguistics (NODALIDA

Frag, a Hybrid Constraint Grammar Parser for French

Prescriptive Infinitives in the Modern North Germanic Languages: An

The VISL System

A Morphological Lexicon of Esperanto with Morpheme Frequencies

Degrees of Orality in Speechlike Corpora

Arborest – a Growing Treebank of Estonian

The English Wikipedia in Esperanto