A Matter of Framing: the Impact of Linguistic Formalism on Probing Results

A Matter of Framing: The Impact of Linguistic Formalism on Probing Results Ilia Kuznetsov and Iryna Gurevych Ubiquitous Knowledge Processing Lab (UKP-TUDA) Department of Computer Science, Technische Universitat¨ Darmstadt http://www.ukp.tu-darmstadt.de/ Abstract L=0 L=8 L=11 We Deep pre-trained contextualized encoders like asked BERT (Devlin et al., 2019) demonstrate re- John to markable performance on a range of down- leave stream tasks. A recent line of research in prob- . ing investigates the linguistic knowledge implicitly learned by these models during pre- Figure 1: Intra-sentence similarity by layer L of the training. While most work in probing oper- multilingual BERT-base. Functional tokens are similar ates on the task level, linguistic tasks are rarely in L = 0, syntactic groups emerge at higher levels. uniform and can be represented in a variety of formalisms. Any linguistics-based probing study thereby inevitably commits to the for- linguistic knowledge (Reif et al., 2019). Indeed, malism used to annotate the underlying data. even an informal inspection of layer-wise intra- Can the choice of formalism affect probing results? To investigate, we conduct an in-depth sentence similarities (Fig.1) suggests that these cross-formalism layer probing study in role se- models capture elements of linguistic structure, and mantics. We find linguistically meaningful dif- those differ depending on the layer of the model. ferences in the encoding of semantic role- and A grounded investigation of these regularities al- proto-role information by BERT depending on lows to interpret the model’s behaviour, design the formalism and demonstrate that layer prob- better pre-trained encoders and inform the down- ing can detect subtle differences between the stream model development. Such investigation is implementations of the same linguistic formal- the main subject of probing, and recent studies con- ism. Our results suggest that linguistic formalism is an important dimension in probing stud- firm that BERT implicitly captures many aspects ies and should be investigated along with the of language use, lexical semantics and grammar commonly used cross-task and cross-lingual (Rogers et al., 2020). experimental settings. Most probing studies use linguistics as a theoretical scaffolding and operate on a task level. How- 1 Introduction ever, there often exist multiple ways to represent The emergence of deep pre-trained contextualized the same linguistic phenomenon: for example, En- encoders has had a major impact on the field of nat- glish dependency syntax can be encoded using a ural language processing. Boosted by the availabil- variety of formalisms, incl. Universal (Schuster ity of general-purpose frameworks like AllenNLP and Manning, 2016), Stanford (de Marneffe and (Gardner et al., 2018) and Transformers (Wolf Manning, 2008) and CoNLL-2009 dependencies et al., 2019), pre-trained models like ELMO (Pe- (Hajicˇ et al., 2009), all using different label sets and ters et al., 2018) and BERT (Devlin et al., 2019) syntactic head attachment rules. Any probing study have caused a shift towards simple architectures inevitably commits to the specific theoretical frame- where a strong pre-trained encoder is paired with a work used to produce the underlying data. The dif- shallow downstream model, often outperforming ferences between linguistic formalisms, however, the intricate task-specific architectures of the past. can be substantial. The versatility of pre-trained representations im- Can these differences affect the probing results? plies that they encode some aspects of general This question is intriguing for several reasons. Lin- 171 Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, pages 171–182, November 16–20, 2020. c 2020 Association for Computational Linguistics guistic formalisms are well-documented, and if the formalism probing study on role semantics, thereby choice of formalism indeed has an effect on prob- contributing to the line of research on how and ing, cross-formalism comparison will yield new whether pre-trained BERT encodes higher-level se- insights into the linguistic knowledge obtained by mantic phenomena. contextualized encoders during pre-training. If, alternatively, the probing results remain stable de- Contributions. This work studies the effect of spite substantial differences between formalisms, the linguistic formalism on probing results. We con- this prompts a further scrutiny of what the pre- duct cross-formalism experiments on PropBank, trained encoders in fact encode. Finally, on the re- VerbNet and FrameNet role prediction in English verse side, cross-formalism probing might be used and German, and show that the formalism can af- as a tool to empirically compare the formalisms fect probing results in a linguistically meaningful and their language-specific implementations. To way; in addition, we demonstrate that layer probing the best of our knowledge we are the first to explic- can detect subtle differences between implementa- itly address the influence of formalism on probing. tions of the same formalism in different languages. On the technical side, we advance the recently intro- Ideally, the task chosen for a cross-formalism duced edge and layer probing framework (Tenney study should be encoded in multiple formalisms et al., 2019b); in particular, we introduce anchor using the same textual data to rule out the influ- tasks - an analytical tool inspired by feature-based ence of the domain and text type. While many lin- systems that allows deeper qualitative insights into guistic corpora contain several layers of linguistic the pre-trained models’ behaviour. Finally, ad- information, having the same textual data anno- vancing the current knowledge about the encod- tated with multiple formalisms for the same task ing of predicate semantics in BERT, we perform a is rare. We focus on role semantics – a family fine-grained semantic proto-role probing study and of shallow semantic formalisms at the interface demonstrate that semantic proto-role properties between syntax and propositional semantics that can be extracted from pre-trained BERT, con- assign roles to the participants of natural language trary to the existing reports. Our results suggest utterances, determining who did what to whom, that along with task and language, linguistic for- where, when etc. Decades of research in theo- malism is an important dimension to be accounted retical linguistics have produced a range of role- for in probing research. semantic frameworks that have been operational- ized in NLP: syntax-driven PropBank (Palmer et al., 2 Related Work 2005), coarse-grained VerbNet (Kipper-Schuler, 2005), fine-grained FrameNet (Baker et al., 1998), 2.1 BERT as Encoder and, recently, decompositional Semantic Proto- BERT is a Transformer (Vaswani et al., 2017) en- Roles (SPR) (Reisinger et al., 2015; White et al., coder pre-trained by jointly optimizing two unsu- 2016). The SemLink project (Bonial et al., 2013) pervised objectives: masked language model and offers parallel annotation for PropBank, VerbNet next sentence prediction. It uses WordPiece (WP, and FrameNet for English. This allows us to iso- Wu et al.(2016)) subword tokens along with po- late the object of our study: apart from the role- sitional embeddings as input, and gradually con- semantic labels, the underlying data and conditions structs sentence representations by applying token- for the three formalisms are identical. SR3DE level self-attention pooling over a stack of layers L. (Mujdricza-Maydt´ et al., 2016) provides compati- The result of BERT encoding is a layer-wise repre- ble annotation in three formalisms for German, en- sentation of the input wordpiece tokens with higher abling cross-lingual validation of our results. Com- layers representing higher-level abstractions over bined, these factors make role semantics an ideal the input sequence. Thanks to the joint pre-training target for our cross-formalism probing study. objective, BERT can encode words and sentences A solid body of evidence suggests that encoders in a unified fashion: the encoding of a sentence or like BERT capture syntactic and lexical-semantic a sentence pair is stored in a special token [CLS]. properties, but only few studies have considered To facilitate multilingual experiments, we use probing for predicate-level semantics (Tenney et al., the multilingual BERT-base (mBERT) published 2019b; Kovaleva et al., 2019). To the best of by Devlin et al.(2019). Although several recent our knowledge we are the first to conduct a cross- encoders have outperformed BERT on benchmarks 172 (Liu et al., 2019; Lan et al., 2019; Raffel et al., Our probing methodology builds upon the edge 2019), we use the original BERT architecture, since and layer probing framework. The encoding pro- it allows us to inherit the probing methodology and duced by a frozen BERT model can be seen as a to build upon the related findings. layer-wise snapshot that reflects how the model has constructed the high-level abstractions. Tenney 2.2 Probing et al.(2019b) introduce the edge probing task design: a simple classifier is tasked with predicting Due to space limitations we omit high-level dis- a linguistic property given a pair of spans encoded cussions on benchmarking (Wang et al., 2018) and using a frozen pre-trained model. Tenney et al. sentence-level probing (Conneau et al., 2018a), and (2019a) use edge probing to analyse the layer uti- focus on the recent findings related to the represen- lization of a pre-trained BERT model via scalar tation of linguistic structure in BERT. Surface-level mixing weights (Peters et al., 2018) learned during information generally tends to be represented in the training. We revisit this framework in Section3. lower layers of deep encoders, while higher layers store hierarchical and semantic information (Be- 2.3 Role Semantics linkov et al., 2017; Lin et al., 2019). Tenney et al. (2019a) show that the abstraction strategy applied We now turn to the object of our investigation: role by the English pre-trained BERT encoder follows semantics.

A Matter of Framing: the Impact of Linguistic Formalism on Probing Results

Synchronous Dependency Insertion Grammars a Grammar Formalism for Syntax Based Statistical MT

Functional Unification Grammar: a Formalism for Machine Translation

Wittgenstein and Musical Formalism: a Case Revisited

Ten Choices for Lexical Semantics

The Computational Lexical Semantics of Syntagmatic Expressions

The Semantics of Grammar Formalisms Seen As Computer Languages

Pragmatic Formalism, Separation of Powers, and the Need to Revisit the Nondelegation Doctrine

Russian Formalism: a Metapoetics

The Algebra of Language Introduction to Syntax

Form and Formalism in Linguistics

A Formalism and Environment for the Development of a Large Grammar of English

Knowledge Representation Formalism for Building Semantic Web Ontologies