A Matter of Framing: The Impact of Linguistic Formalism on Probing Results

Ilia Kuznetsov and Iryna Gurevych Ubiquitous Knowledge Processing Lab (UKP-TUDA) Department of Computer Science, Technische Universitat¨ Darmstadt http://www.ukp.tu-darmstadt.de/

Abstract L=0 L=8 L=11 We Deep pre-trained contextualized encoders like asked BERT (Devlin et al., 2019) demonstrate re- John to markable performance on a range of down- leave stream tasks. A recent line of research in prob- . ing investigates the linguistic knowledge im- plicitly learned by these models during pre- Figure 1: Intra-sentence similarity by layer L of the training. While most work in probing oper- multilingual BERT-base. Functional tokens are similar ates on the task level, linguistic tasks are rarely in L = 0, syntactic groups emerge at higher levels. uniform and can be represented in a variety of formalisms. Any -based probing study thereby inevitably commits to the for- linguistic knowledge (Reif et al., 2019). Indeed, malism used to annotate the underlying data. even an informal inspection of layer-wise intra- Can the choice of formalism affect probing re- sults? To investigate, we conduct an in-depth sentence similarities (Fig.1) suggests that these cross-formalism layer probing study in role se- models capture elements of linguistic structure, and mantics. We find linguistically meaningful dif- those differ depending on the layer of the model. ferences in the encoding of semantic role- and A grounded investigation of these regularities al- proto-role information by BERT depending on lows to interpret the model’s behaviour, design the formalism and demonstrate that layer prob- better pre-trained encoders and inform the down- ing can detect subtle differences between the stream model development. Such investigation is implementations of the same linguistic formal- the main of probing, and recent studies con- ism. Our results suggest that linguistic formal- ism is an important dimension in probing stud- firm that BERT implicitly captures many aspects ies and should be investigated along with the of language use, lexical and commonly used cross-task and cross-lingual (Rogers et al., 2020). experimental settings. Most probing studies use linguistics as a theoret- ical scaffolding and operate on a task level. How- 1 Introduction ever, there often exist multiple ways to represent The emergence of deep pre-trained contextualized the same linguistic phenomenon: for example, En- encoders has had a major impact on the field of nat- glish dependency can be encoded using a ural language processing. Boosted by the availabil- variety of formalisms, incl. Universal (Schuster ity of general-purpose frameworks like AllenNLP and Manning, 2016), Stanford (de Marneffe and (Gardner et al., 2018) and Transformers (Wolf Manning, 2008) and CoNLL-2009 dependencies et al., 2019), pre-trained models like ELMO (Pe- (Hajicˇ et al., 2009), all using different label sets and ters et al., 2018) and BERT (Devlin et al., 2019) syntactic head attachment rules. Any probing study have caused a shift towards simple architectures inevitably commits to the specific theoretical frame- where a strong pre-trained encoder is paired with a work used to produce the underlying data. The dif- shallow downstream model, often outperforming ferences between linguistic formalisms, however, the intricate task-specific architectures of the past. can be substantial. The versatility of pre-trained representations im- Can these differences affect the probing results? plies that they encode some aspects of general This question is intriguing for several reasons. Lin-

171 Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, pages 171–182, November 16–20, 2020. c 2020 Association for Computational Linguistics guistic formalisms are well-documented, and if the formalism probing study on role semantics, thereby choice of formalism indeed has an effect on prob- contributing to the line of research on how and ing, cross-formalism comparison will yield new whether pre-trained BERT encodes higher-level se- insights into the linguistic knowledge obtained by mantic phenomena. contextualized encoders during pre-training. If, alternatively, the probing results remain stable de- Contributions. This work studies the effect of spite substantial differences between formalisms, the linguistic formalism on probing results. We con- this prompts a further scrutiny of what the pre- duct cross-formalism experiments on PropBank, trained encoders in fact encode. Finally, on the re- VerbNet and FrameNet role prediction in English verse side, cross-formalism probing might be used and German, and show that the formalism can af- as a tool to empirically compare the formalisms fect probing results in a linguistically meaningful and their language-specific implementations. To way; in addition, we demonstrate that layer probing the best of our knowledge we are the first to explic- can detect subtle differences between implementa- itly address the influence of formalism on probing. tions of the same formalism in different languages. On the technical side, we advance the recently intro- Ideally, the task chosen for a cross-formalism duced edge and layer probing framework (Tenney study should be encoded in multiple formalisms et al., 2019b); in particular, we introduce anchor using the same textual data to rule out the influ- tasks - an analytical tool inspired by feature-based ence of the domain and text type. While many lin- systems that allows deeper qualitative insights into guistic corpora contain several layers of linguistic the pre-trained models’ behaviour. Finally, ad- information, having the same textual data anno- vancing the current knowledge about the encod- tated with multiple formalisms for the same task ing of predicate semantics in BERT, we perform a is rare. We focus on role semantics – a family fine-grained semantic proto-role probing study and of shallow semantic formalisms at the interface demonstrate that semantic proto-role properties between syntax and propositional semantics that can be extracted from pre-trained BERT, con- assign roles to the participants of natural language trary to the existing reports. Our results suggest utterances, determining who did what to whom, that along with task and language, linguistic for- where, when etc. Decades of research in theo- malism is an important dimension to be accounted retical linguistics have produced a range of role- for in probing research. semantic frameworks that have been operational- ized in NLP: syntax-driven PropBank (Palmer et al., 2 Related Work 2005), coarse-grained VerbNet (Kipper-Schuler, 2005), fine-grained FrameNet (Baker et al., 1998), 2.1 BERT as Encoder and, recently, decompositional Semantic Proto- BERT is a Transformer (Vaswani et al., 2017) en- Roles (SPR) (Reisinger et al., 2015; White et al., coder pre-trained by jointly optimizing two unsu- 2016). The SemLink project (Bonial et al., 2013) pervised objectives: masked language model and offers parallel annotation for PropBank, VerbNet next sentence prediction. It uses WordPiece (WP, and FrameNet for English. This allows us to iso- Wu et al.(2016)) subword tokens along with po- late the of our study: apart from the role- sitional embeddings as input, and gradually con- semantic labels, the underlying data and conditions structs sentence representations by applying token- for the three formalisms are identical. SR3DE level self-attention pooling over a stack of layers L. (Mujdricza-Maydt´ et al., 2016) provides compati- The result of BERT encoding is a layer-wise repre- ble annotation in three formalisms for German, en- sentation of the input wordpiece tokens with higher abling cross-lingual validation of our results. Com- layers representing higher-level abstractions over bined, these factors make role semantics an ideal the input sequence. Thanks to the joint pre-training target for our cross-formalism probing study. objective, BERT can encode and sentences A solid body of evidence suggests that encoders in a unified fashion: the encoding of a sentence or like BERT capture syntactic and lexical-semantic a sentence pair is stored in a special token [CLS]. properties, but only few studies have considered To facilitate multilingual experiments, we use probing for predicate-level semantics (Tenney et al., the multilingual BERT-base (mBERT) published 2019b; Kovaleva et al., 2019). To the best of by Devlin et al.(2019). Although several recent our knowledge we are the first to conduct a cross- encoders have outperformed BERT on benchmarks

172 (Liu et al., 2019; Lan et al., 2019; Raffel et al., Our probing methodology builds upon the edge 2019), we use the original BERT architecture, since and layer probing framework. The encoding - it allows us to inherit the probing methodology and duced by a frozen BERT model can be seen as a to build upon the related findings. layer-wise snapshot that reflects how the model has constructed the high-level abstractions. Tenney 2.2 Probing et al.(2019b) introduce the edge probing task de- sign: a simple classifier is tasked with predicting Due to space limitations we omit high-level dis- a linguistic property given a pair of spans encoded cussions on benchmarking (Wang et al., 2018) and using a frozen pre-trained model. Tenney et al. sentence-level probing (Conneau et al., 2018a), and (2019a) use edge probing to analyse the layer uti- focus on the recent findings related to the represen- lization of a pre-trained BERT model via scalar tation of linguistic structure in BERT. Surface-level mixing weights (Peters et al., 2018) learned during information generally tends to be represented in the training. We revisit this framework in Section3. lower layers of deep encoders, while higher layers store hierarchical and semantic information (Be- 2.3 Role Semantics linkov et al., 2017; Lin et al., 2019). Tenney et al. (2019a) show that the abstraction strategy applied We now turn to the object of our investigation: role by the English pre-trained BERT encoder follows semantics. For further discussion, consider the the order of the classical NLP pipeline. Strength- following synthetic example: ening the claim about linguistic capabilities of a. [John]Ag gave [Mary]Rc a [book]Th. BERT, Hewitt and Manning(2019) demonstrate b. [Mary]Rc was given a [book]Th by [John]Ag. that BERT implicitly learns syntax, and Reif et al. Despite surface-level differences, the sentences (2019) show that it encodes fine-grained lexical- express the same meaning, suggesting an under- semantic distinctions. Rogers et al.(2020) provide lying semantic representation in which these sen- a comprehensive overview of BERT’s properties tences are equivalent. One such representation is discovered to date. offered by role semantics - a shallow predicate- While recent results indicate that BERT success- semantic formalism closely related to syntax. In fully represents lexical-semantic and grammatical terms of role semantics, Mary, book and John are information, the evidence of its high-level semantic semantic arguments of the predicate give, and capabilities is inconclusive. Tenney et al.(2019a) are assigned roles from a pre-defined inventory, for show that the English PropBank semantics can be example, Agent, Recipient and Theme. extracted from the encoder and follows syntax in Semantic roles and their properties have received the layer structure. However, out of all formalisms extensive attention in linguistics (Fillmore, 1968; PropBank is most closely tied to syntax, and the Levin and Rappaport Hovav, 2005; Dowty, 1991) results on proto-role and relation probing do not and are considered a universal feature of human follow the same pattern. Kovaleva et al.(2019) language. The size and organization of the role and identify two attention heads in BERT responsible predicate inventory are subject to debate, giving for FrameNet relations. However, they find that rise to a variety of role-semantic formalisms. disabling them in a fine-tuning evaluation on the PropBank assumes a predicate-independent la- GLUE (Wang et al., 2018) benchmark does not beling scheme where predicates are distinguished result in decreased performance. by their sense (get.01), and semantic arguments Although we are not aware of any systematic are labeled with generic numbered core (Arg0-5) studies dedicated to the effect of formalism on prob- and modifier (e.g. AM-TMP) roles. Core roles are ing results, the evidence of such effects is scattered not tied to specific definitions, but the effort has across the related work: for example, the aforemen- been made to keep the role assignments consis- tioned results in Tenney et al.(2019a) show a differ- tent for similar verbs; Arg0 and Arg1 correspond ence in layer utilization between constituents- and to the Proto-Agent and Proto-Patient roles as per dependency-based syntactic probes and semantic Dowty(1991). The semantic interpretation of core role and proto-role probes. It is not clear whether roles depends on the predicate sense. this effect is due to the differences in the underly- VerbNet follows a different categorization ing datasets and task architecture or the formalism scheme. Motivated by the regularities in verb per se. behavior, Levin(1993) has introduced the group-

173 ing of verbs into intersective classes (ILC). This to predict a label given a pair of contextualized methodology has been adopted by VerbNet: for span or encodings. More formally, we en- example, the VerbNet class get-13.5.1 would code a WP-tokenized sentence [wp1, wp2, ...wpk] include verbs earn, fetch, gain etc. A verb in Verb- with a frozen pre-trained model, producing con- Net can belong to several classes corresponding textual embeddings [e1, e2, ...ek], each of which is to different senses; each class is associated with a a layered representation over L = {l0, l1, ...lm} set of roles and licensed syntactic transformations. layers, with encoding at layer ln for the wordpiece n Unlike PropBank, VerbNet uses a set of approx. wpi further denoted as ei . A trainable scalar mix is 30 thematic roles that have universal definitions applied to the layered representation to produce the and are shared among predicates, e.g. Agent, final encoding given the per-layer mixing weights Beneficiary, Instrument. {a0, a1..am} and a scaling parameter γ:

FrameNet takes a meaning-driven stance on the m X l l role encoding by modeling it in terms of frame se- ei = γ softmax(a )ei mantics: predicates are grouped into frames (e.g. l=0 Commerce buy), which specify role-like slots to Given the source src and target tgt wordpieces be filled. FrameNet offers fine-grained frame dis- encoded as e and e , our goal is to predict the tinctions, and roles in FrameNet are frame-specific, src tgt label y. e.g. Buyer, Seller and Money. The resource Due to its task-agnostic architecture, edge prob- accompanies each frame with a description of the ing can be applied to a wide variety of unary (by situation and its core and peripheral participants. omitting tgt) and binary labeling tasks in a uni- SPR follows the work of Dowty(1991) and fied manner, facilitating the cross-task comparison. discards the notion of categorical semantic roles The original setup has several limitations that we in favor of feature bundles. Instead of a address in our implementation. fixed role label, each argument is assessed via Regression tasks. The original edge probing a 11-dimensional cardinal feature set including setup only considers classification tasks. Many Proto-Agent and Proto-Patient properties like language phenomena - including positional infor- volitional sentient destroyed , , , etc. mation and semantic proto-roles, are naturally mod- The feature-based approach eliminates some of the eled as regression. We extend the architecture by theoretical issues associated with categorical role Tenney et al.(2019b) and support both classifica- inventories and allows for more flexible modeling tion and regression: the former achieved via soft- of role semantics. max, the latter via direct linear regression to the Each of the role labeling formalisms offers cer- target value. tain advantages and disadvantages (Giuglea and Flat model. To decrease the models’ own ex- Moschitti, 2006;M ujdricza-Maydt´ et al., 2016). pressive power (Hewitt and Liang, 2019), we keep While being close to syntax and thereby easier to the number of parameters in our probing model as predict, PropBank doesn’t contribute much seman- low as possible. While Tenney et al.(2019b) utilize tics to the representation. On the opposite side pooled self-attentional span representations and a of the spectrum, FrameNet offers rich predicate- projection layer to enable cross-model comparison, semantic representations for verbs and nouns, but we directly feed the wordpiece encoding into the suffers from high granularity and coverage gaps classifier, using the first wordpiece of a word. To (Hartmann et al., 2017). VerbNet takes a middle further increase the selectivity of the model, we ground by following grammatical criteria while directly project the source and target wordpiece still encoding coarse-grained semantics, but only representations into the label space, opposed to the focuses on verbs and core (not modifier) roles. SPR two-layer MLP classifier used in the original setup. avoids the granularity-generalization trade-off of Separate scalar mixes. To enable fine-grained the categorical inventories, but is yet to find its way analysis of probing results, we train and analyze into practical NLP applications. separate scalar mixes for source and target word- pieces, motivated by the fact that the classifier 3 Probing Methodology might utilize different aspects of their represen- 1 We take the edge probing setup by Tenney et al. tation for prediction . Indeed, we find that the (2019b) as our starting point. Edge probing aims 1Tenney et al.(2019b, Appendix C) also use separate pro-

174 mixing weights learned for source and target word- tok sent pred arg pieces might show substantial – and linguistically CoNLL+SL 312.2K 11.3K 13.3K 23.9K meaningful – variation. Combined with regression- SR3de 62.6K 2.8K 2.9K 5.5K based objective, separating the scalar mixes allows us to scrutinize layer utilization patterns for seman- Table 1: Statistics for CoNLL+SemLink (English) and tic proto-roles. SR3de (German), only core roles. Sentence-level probes. Utilizing the BERT- specific sentence representation [CLS] allows us 4 Setup to incorporate the sentence-level natural language inference (NLI) probe into our kit. 4.1 Source data For German we use the SR3de corpus (Mujdricza-´ Maydt et al., 2016) that contains parallel PropBank, Anchor tasks. We employ two analytical tools FrameNet and VerbNet annotations for verbal pred- from the original layer probing setup. Mixing icates. For English, SemLink (Bonial et al., 2013) weight plotting compares layer utilization among provides mappings from the original PropBank cor- tasks by visually aligning the respective learned pus annotations to the corresponding FrameNet weight distributions transformed via a softmax and VerbNet senses and semantic roles. We use function. Layer center-of-gravity is used as a sum- these mappings to enrich the CoNLL-2009 (Hajicˇ mary statistic for a task’s layer utilization. While et al., 2009) dependency role labeling data – also the distribution of mixing weights along the layers based on the original PropBank – with roles in all allows us to estimate the order in which informa- three formalisms via a semi-automatic token align- tion is processed during encoding, it doesn’t allow ment procedure. The resulting corpus is substan- to directly assess the similarity between the layer tially smaller than the original, but still an order of utilization of the probing tasks. magnitude larger than SR3de (Table1). Both cor- Tenney et al.(2019a) have demonstrated that pora are richly annotated with linguistic phenom- the order in which linguistic information is stored ena on word level, including part-of-, lemma in BERT mirrors the traditional NLP pipeline. A and syntactic dependencies. The XNLI probe is prominent property of the NLP pipelines is their sourced from the corresponding development split use of low-level features to predict downstream of the XNLI (Conneau et al., 2018b) dataset. The phenomena. In the context of layer probing, prob- SPR probing tasks are constructed from the original ing tasks can be seen as end-to-end feature extrac- data by Reisinger et al.(2015). tors. Following this intuition, we define two groups of probing tasks: target tasks – the main tasks un- type en de der investigation, and anchor tasks – a set of related *token.ix unary 208.9K 46.9K tasks that serve as a basis for qualitative compari- ttype [v] unary 177.2K 34.0K son between the targets. The softmax transforma- lex.unit [v] unary 187.6K 35.7K tion of the scalar mixing weights allows to treat pos unary 312.2K 62.6K them as probability distributions: the higher the deprel binary 300.9K 59.8K mixing weight of a layer, the more likely the probe role binary 23.9K 5.5K is to utilize information from this layer during pre- *spr binary 9.7K - diction. We use Kullback-Leibler divergence to xnli unary 2.5K 2.5K compare target tasks (e.g. role labeling in different formalisms) in terms of their similarity to lower- Table 2: Probing task statistics. Tasks marked with [v] level anchor tasks (e.g. dependency relation and use a most frequent label vocabulary. Here and further, lemma). Note that the notion of anchor task is con- tasks marked with * are regression tasks. textual: the same task can serve as a target and as an anchor, depending on the focus of the study. 4.2 Probing tasks Our probing kit spans a wide range of probing tasks, jections in the background, but do not investigate the differ- from primitive surface-level tasks mostly utilized ences between the learned projections. as anchors later to high-level semantic tasks that

175 language en de Dependency relation (deprel) predicts the de- pendency relation between the parent src and de- PropBank 5 10 pendent tgt tokens; VerbNet 23 29 Semantic role (role.[frm]) predicts the se- FrameNet 189 300 mantic role given a predicate src and an argu- ment tgt token in one of the three role label- Table 3: # of role probe labels by formalism. ing formalisms: PropBank pb, VerbNet vn and FrameNet fn. Note that we only probe for the role task input label label, and the model has no access to the verb sense token.ix I [saw] a cat. → 2 information from the data. ttype I [saw] a cat. → saw Semantic proto-role (spr.[prop]) is a set of lex.unit I [saw] a cat. → see.V eleven regression tasks predicting the values of pos I [saw] a cat. → VBD the proto-role properties as defined in (Reisinger deprel [I]tgt [saw]src a cat. → SBJ et al., 2015), given a predicate src and an argu- role.vn [I]tgt [saw]src a cat. → Experiencer ment tgt. spr.vltn [I]tgt [saw]src a cat. → 2 XNLI is a sentence-level NLI task directly sourced from the corresponding dataset. Given two sen- Table 4: Word-level probing task examples for English. tences, the goal is to determine whether an en- vltn volition corresponds to the SPR property. tailment or a contradiction relationship holds be- tween them. We use NLI to investigate the layer aim to provide a representational upper bound to utilization of mBERT for high-level semantic tasks. predicate semantics. We follow the training, test We extract the sentence pair representation via the and development splits from the original SR3de, [CLS] token and treat it as a unary probing task. CoNLL-2009 and SPR data. The XNLI task is sourced from the development set and only used for 5 Results scalar mix analysis. To reduce the number of labels Our probing framework is implemented using in some of the probing tasks, we collect frequency AllenNLP.2 We train the probes for 20 epochs statistics over the corresponding training sets and using the Adam optimizer with default parameters only consider up to 250 most frequent labels. Be- and a batch size of 32. Due to the frozen encoder low we define the tasks in order of their complexity, and flat model architecture, the total runtime of Table2 provides the probing task statistics, Table3 the main experiments is under 8 hours on a sin- compares the categorical role labeling formalisms gle Tesla V100 GPU. In addition to pre-trained in terms of granularity, and Table4 provides exam- mBERT we report baseline performance using a ples. We evaluate the classification performance frozen untrained mBERT model obtained by ran- using Accuracy, while regression tasks are scored domizing the encoder weights post-initialization as via R2. in Jawahar et al.(2019). Token type (ttype) predicts the type of a word. This requires contextual processing since a word 5.1 General Trends might consist of several wordpieces; While absolute performance is secondary to our Token position (token.ix) predicts the linear analysis, we report the probing task scores on re- position of a word, cast as a regression task over spective development sets in Table5. We observe the first 20 words in the sentence. Again, the task that grammatical tasks score high, while core role is non-trivial since it requires the words to be as- labeling lags behind - in line with the findings of sembled from the wordpieces. Tenney et al.(2019a) 3 We observe lower scores Part-of-speech (pos) predicts the language- for German role labeling which we attribute to the specific part-of-speech tag for the given token. lack of training data. Surprisingly, as we show Lexical unit (lex.unit) predicts the lemma and below, this doesn’t prevent the edge probe from POS of the given word – a common input repre- 2 sentation for the entries in lexical resources. We Code available: https://github.com/UKPLab/emnlp2020- formalism-probing extract coarse POS tags by using the first character 3Our results are not directly comparable due to the differ- of the language-specific POS tag. ences in datasets and formalisms.

176 task en de en de *token.ix [3.13] [4.61] *token.ix 0.95 (0.93) 0.92 (0.87) ttype [4.8] [5.2] lex.unit [4.99] [5.09] ttype 1.00 (0.92) 1.00 (0.48) pos [5.29] [5.75] lex.unit 1.00 (0.75) 1.00 (0.33) deprel src [5.36] [6.01] deprel tgt [5.57] [5.99] pos 0.97 (0.40) 0.97 (0.26) role.pb src [5.47] [5.18] deprel 0.95 (0.42) 0.95 (0.41) role.vn src [4.28] [5.24] role.fn 0.92 (0.18) 0.59 (0.10) role.fn src [4.36] [5.13] role.pb tgt [6.11] [6.12] role.pb 0.96 (0.67) 0.71 (0.49) role.vn tgt [6.05] [6.06] role.vn 0.94 (0.47) 0.73 (0.30) role.fn tgt [6.16] [5.75] xnli [6.28] [6.15] Layer Layer Table 5: Best dev score for word-level tasks over 20 epochs, Acc for classification, R2 for regression; Base- Figure 2: Layer probing results line in parentheses.

en de

*token.ix learning to locate relevant role-semantic informa- tion in mBERT’s layers. ttype The untrained mBERT baseline expectedly un- lex.unit derperforms; however, we note good baseline re- pos sults on surface-level tasks for English, which we deprel src attribute to memorizing token identity and posi- deprel tgt tion: although the weights are set randomly, the fn tgt fn tgt fn src fn src vn tgt vn tgt pb tgt pb tgt vn src vn src frozen encoder still associates each wordpiece in- pb src pb src put with a fixed random vector. We have confirmed this assumption by scalar mix analysis of the un- Figure 3: Anchor task analysis of SRL formalisms. trained mBERT baseline: in our experiments the baseline probes for both English and German at- argument tokens. While the argument representa- tended almost exclusively to the first few layers of tion role*tgt mostly focuses on the same layers the encoder, independent of the task. For brevity, as the dependency probe, the layer utiliza- here and further we do not examine baseline mixing tion of the predicates role*src is affected by weights and only report the scores. the chosen formalism. In English, PropBank pred- Our main probing results mirror the findings of icate token mixing weights emphasize the same Tenney et al.(2019a) about the sequential process- layers as dependency parsing – in line with the ing order in BERT. We observe that the layer utiliza- previously published results. However, the probes tion among tasks (Fig.2) generally aligns for En- for VerbNet and FrameNet predicates (role.vn glish and German4, although we note that in terms src and role.fn src) utilize the layers asso- of center-of-gravity mBERT tends to utilize deeper ciated with ttype and lex.unit that contain layers for German probes. Basic word-level tasks lexical information. Coupled with the fact that both are indeed processed early by the model, and XNLI VerbNet and FrameNet assign semantic roles based probes focus on deeper levels, suggesting that the on lexical-semantic predicate groupings (frames in representation of higher-level semantic phenom- FrameNet and verb classes in VerbNet), this sug- ena follows the encoding of syntax and predicate gests that the lower layers of mBERT implicitly en- semantics. code predicate sense information; moreover, sense encoding for VerbNet utilizes deeper layers of the 5.2 The Effect of Formalism model associated with syntax, in line with Verb- Using separate scalar mixes for source and target Net’s predicate classification strategy. This finding tokens allows us to explore the cross-formalism en- confirms that the formalism can indeed have lin- coding of role semantics by mBERT in detail. For guistically meaningful effects on probing results. both English and German role labeling, the probe’s layer utilization drastically differs for predicate and 5.3 Anchor Tasks in the Pipeline

4Echoing the recent findings on mBERT’s multilingual We now use the scalar mixes of the role label- capacity (Pires et al., 2019; Kondratyuk and Straka, 2019). ing probes as target tasks, and lower-level probes

177 as anchor tasks to qualitatively explore the differ- property R2 ences between how our role probes learn to rep- (A) *instigation 0.68 (0.21) resent predicates and semantic arguments5 (Fig. (A) *volition 0.75 (0.11) 3). The results reveal a distinctive pattern that (A) *awareness 0.78 (0.09) confirms our previous observations: while Verb- (A) *sentient 0.83 (0.07) Net and FrameNet predicate layer utilization src (A) *change.of.location 0.49 (0.04) is similar to the scalar mixes learned for ttype (A) *exists.as.physical 0.63 (0.03) and lex.unit, the learned argument representa- tions tgt and the PropBank predicate attend to (P) *created 0.22 (0.01) the layers associated with dependency relation and (P) *destroyed 0.11 (0.00) POS probes. Aside from the PropBank predicate (P) *changes.possession 0.26 (-0.01) encoding which we address below, the pattern re- (P) *change.of.state 0.37 (0.01) produces for English and German. This aligns with (P) *stationary 0.39 (0.05) the traditional separation of the semantic role la- beling task into predicate disambiguation followed Table 6: Best dev R2 for proto-role probing tasks over by semantic argument identification and labeling, 20 epochs; A - Proto-Agent, P - Proto-Patient; Baseline along with the feature sets employed for these tasks in parentheses. (Bjorkelund¨ et al., 2009). Note that the observation about the pipeline-like task processing within the keeps this scheme consistent within the frame. We BERT encoders thereby holds, albeit on a sub-task assume that this incentivizes the probe to learn se- level. mantic verb groupings and reflects in our probing 5.4 Formalism Implementations results. The ability of the probe to detect subtle dif- ferences between formalism implementations con- Both layer and anchor task analysis reveal a promi- stitutes a new use case for probing, and a promising nent discrepancy between English and German role direction for future studies. probing results: while the PropBank predicate layer utilization for English mostly relies on syntactic 5.5 Encoding of Proto-Roles information, German PropBank predicates behave similarly to VerbNet and FrameNet. The lack of We now turn to the probing results for decomposi- systematic cross-lingual differences between layer tional semantic proto-role labeling tasks. Unlike utilization for other probing tasks6 allows us to rule (Tenney et al., 2019b) who used a multi-label classi- out the effect of purely typological features such as fication probe, we treat SPR properties as separate word order and case marking as a likely cause. regression tasks. The results in Table6 show that The difference in the number of role labels for the performance varies by property, with some of English and German PropBank, however, points the properties attaining reasonably high R2 scores at possible qualitative differences in the label- despite the simplicity of the probe architecture and ing schemes (Table3). The data for English the small dataset size. We observe that properties stems from the token-level alignment in SemLink associated with Proto-Agent tend to perform better. that maps the original PropBank roles to Verb- The untrained mBERT baseline performs poorly Net and FrameNet. Role annotations for German which we attribute to the lack of data and the fine- have a different lineage: they originate from the grained semantic nature of the task. FrameNet-annotated SALSA corpus (Burchardt Our fine-grained, property-level task design al- et al., 2006) semi-automatically converted to Prop- lows for more detailed insights into the layer uti- Bank style for the CoNLL-2009 shared task (Hajicˇ lization by the SPR probes (Fig.4). The results et al., 2009), and enriched with VerbNet labels in indicate that while the layer utilization on the pred- SR3de (Mujdricza-Maydt´ et al., 2016). As a result, icate side (src) shows no clear preference for par- while English PropBank labels are assigned in a ticular layers (similar to the results obtained by predicate-independent manner, German PropBank, Tenney et al.(2019a)), some of the proto-role fea- following the same numbered labeling scheme, tures follow the pattern seen in the categorical role labeling and dependency parsing tasks for the ar- 5Darker color corresponds to higher similarity. 6Apart from the general tendency to use deeper layers in gument tokens tgt. With few exceptions, we ob- German reported in 5.1 serve that the properties displaying that behavior

178 src tgt is likely to be present for any probing setup that *instigation builds upon linguistic material. An investigation *volition *awareness of how, whether, and why formalisms and their *sentient implementations affect probing results for tasks be- *change.of.location *exists.as.physical yond role labeling and for frameworks beyond edge *created probing constitutes an exciting avenue for future *destroyed *changes.possession research. *change.of.state *stationary Acknowledgments Layer Layer This work has been funded by the LOEWE initia- Figure 4: Layer utilization for SPR properties. tive (Hesse, Germany) within the emergenCITY center. are Proto-Agent properties; moreover, a close ex- amination of the results on syntactic preference by Reisinger et al.(2015, p. 483) reveals that these References properties are also the ones with strong preference Collin F. Baker, Charles J. Fillmore, and John B. Lowe. for the subject position, including the outlier case 1998. The Berkeley FrameNet project. In 36th An- stationary nual Meeting of the Association for Computational of which in their data behaves like Linguistics and 17th International Conference on a Proto-Agent property. The correspondence is not Computational Linguistics, Volume 1, pages 86–90, strict, and we leave an in-depth investigation of the Montreal, Quebec, Canada. Association for Compu- reasons behind these discrepancies for follow-up tational Linguistics. work. Yonatan Belinkov, Nadir Durrani, Fahim Dalvi, Has- san Sajjad, and James Glass. 2017. What do neu- 6 Conclusion ral machine translation models learn about morphol- ogy? ACL 2017 - 55th Annual Meeting of the Asso- We have demonstrated that the choice of linguis- ciation for Computational Linguistics, Proceedings tic formalism can have substantial, linguistically of the Conference (Long Papers), 1:861–872. meaningful effects on role-semantic probing re- sults. We have shown how probing classifiers can Anders Bjorkelund,¨ Love Hafdell, and Pierre Nugues. 2009. Multilingual semantic role labeling. In Pro- be used to detect discrepancies between formalism ceedings of the Thirteenth Conference on Computa- implementations, and presented evidence of seman- tional Natural Language Learning (CoNLL 2009): tic proto-role encoding in the pre-trained mBERT Shared Task, pages 43–48, Boulder, Colorado. Asso- model. Our refined implementation of the edge ciation for Computational Linguistics. probing framework coupled with the anchor task Claire Bonial, Kevin Stowe, and Martha Palmer. 2013. methodology enabled new insights into the pro- Renewing and revising SemLink. In Proceedings cessing of predicate-semantic information within of the 2nd Workshop on Linked Data in Linguistics mBERT. Our findings suggest that linguistic for- (LDL-2013): Representing and linking lexicons, ter- minologies and other language data, pages 9–17. malism is an important factor to be accounted for Association for Computational Linguistics. in probing studies. This prompts several recommendations for the Aljoscha Burchardt, Katrin Erk, Anette Frank, An- follow-up probing studies. First, the formalism drea Kowalski, Sebastian Pado,´ and Manfred Pinkal. 2006. The SALSA corpus: a German cor- and implementation used to prepare the linguis- pus resource for . In Proceed- tic material underlying a probing study should be ings of the Fifth International Conference on Lan- always explicitly specified. Second, if possible, guage Resources and Evaluation (LREC’06), Genoa, results on multiple formalisations of the same task Italy. European Language Resources Association should be reported and validated for several lan- (ELRA). guages. Finally, assembling corpora with parallel Alexis Conneau, German Kruszewski, Guillaume Lam- cross-formalism annotations would facilitate fur- ple, Lo¨ıc Barrault, and Marco Baroni. 2018a. What ther research on the effect of formalism in probing. you can cram into a single $&!#* vector: Probing sentence embeddings for linguistic properties. ACL While our work illustrates the impact of formal- 2018 - 56th Annual Meeting of the Association for ism using a single task and a single probing frame- Computational Linguistics, Proceedings of the Con- work, the influence of linguistic formalism per se ference (Long Papers), 1:2126–2136.

179 Alexis Conneau, Ruty Rinott, Guillaume Lample, Ad- John Hewitt and Percy Liang. 2019. Designing and ina Williams, Samuel Bowman, Holger Schwenk, interpreting probes with control tasks. In Proceed- and Veselin Stoyanov. 2018b. XNLI: Evaluating ings of the 2019 Conference on Empirical Methods cross-lingual sentence representations. In Proceed- in Natural Language Processing and the 9th Inter- ings of the 2018 Conference on Empirical Methods national Joint Conference on Natural Language Pro- in Natural Language Processing, pages 2475–2485, cessing (EMNLP-IJCNLP), pages 2733–2743, Hong Brussels, Belgium. Association for Computational Kong, China. Association for Computational Lin- Linguistics. guistics.

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and John Hewitt and Christopher D. Manning. 2019.A Kristina Toutanova. 2019. BERT: Pre-training of structural probe for finding syntax in word repre- deep bidirectional transformers for language under- sentations. In Proceedings of the 2019 Conference standing. In Proceedings of the 2019 Conference of the North American Chapter of the Association of the North American Chapter of the Association for Computational Linguistics: Human Language for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Technologies, Volume 1 (Long and Short Papers), pages 4129–4138, Minneapolis, Minnesota. Associ- pages 4171–4186, Minneapolis, Minnesota. Associ- ation for Computational Linguistics. ation for Computational Linguistics. Ganesh Jawahar, Benoˆıt Sagot, and Djame´ Seddah. 2019. What does BERT learn about the structure David Dowty. 1991. Thematic proto-roles and argu- of language? In Proceedings of the 57th Annual ment selection. Language, 76(3):474–496. Meeting of the Association for Computational Lin- guistics, pages 3651–3657, Florence, Italy. Associa- Charles J. Fillmore. 1968. The Case for Case. In Em- tion for Computational Linguistics. mon Bach and Robert T. Harms, editors, Universals in Linguistic Theory, pages 1–88. Holt, Rinehart and Karen Kipper-Schuler. 2005. VerbNet: A broad- Winston, New York. coverage, comprehensive verb lexicon. Ph.D. thesis, University of Pennsylvania. Matt Gardner, Joel Grus, Mark Neumann, Oyvind Tafjord, Pradeep Dasigi, Nelson F. Liu, Matthew Pe- Dan Kondratyuk and Milan Straka. 2019. 75 lan- ters, Michael Schmitz, and Luke Zettlemoyer. 2018. guages, 1 model: Parsing universal dependencies AllenNLP: A deep semantic natural language pro- universally. In Proceedings of the 2019 Confer- cessing platform. In Proceedings of Workshop for ence on Empirical Methods in Natural Language NLP Open Source Software (NLP-OSS), pages 1– Processing and the 9th International Joint Confer- 6, Melbourne, Australia. Association for Computa- ence on Natural Language Processing (EMNLP- tional Linguistics. IJCNLP), pages 2779–2795, Hong Kong, China. As- sociation for Computational Linguistics. Ana-Maria Giuglea and Alessandro Moschitti. 2006. Olga Kovaleva, Alexey Romanov, Anna Rogers, and Semantic role labeling via FrameNet, VerbNet and Anna Rumshisky. 2019. Revealing the dark secrets PropBank. In Proceedings of the 21st Interna- of BERT. In Proceedings of the 2019 Conference on tional Conference on Computational Linguistics and Empirical Methods in Natural Language Processing 44th Annual Meeting of the Association for Compu- and the 9th International Joint Conference on Natu- tational Linguistics, pages 929–936, Sydney, Aus- ral Language Processing (EMNLP-IJCNLP), pages tralia. Association for Computational Linguistics. 4365–4374, Hong Kong, China. Association for Computational Linguistics. Jan Hajic,ˇ Massimiliano Ciaramita, Richard Johans- son, Daisuke Kawahara, Maria Antonia` Mart´ı, Llu´ıs Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Marquez,` Adam Meyers, Joakim Nivre, Sebastian Kevin Gimpel, Piyush Sharma, and Radu Soricut. Pado,´ Jan Stˇ epˇ anek,´ Pavel Stranˇak,´ Mihai Surdeanu, 2019. Albert: A lite bert for self-supervised learn- Nianwen Xue, and Yi Zhang. 2009. The CoNLL- ing of language representations. arXiv:1909.11942. 2009 shared task: Syntactic and semantic depen- dencies in multiple languages. In Proceedings of Beth Levin. 1993. English Verb Classes and Alterna- the Thirteenth Conference on Computational Nat- tions: A Preliminary Investigation. University of ural Language Learning (CoNLL 2009): Shared Chicago Press. Task, pages 1–18, Boulder, Colorado. Association Beth Levin and Malka Rappaport Hovav. 2005. Argu- for Computational Linguistics. ment Realization. Research Surveys in Linguistics. Cambridge University Press. Silvana Hartmann, Ilia Kuznetsov, Teresa Martin, and Iryna Gurevych. 2017. Out-of-domain FrameNet se- Yongjie Lin, Yi Chern Tan, and Robert Frank. 2019. mantic role labeling. In Proceedings of the 15th Open sesame: Getting inside BERT’s linguistic Conference of the European Chapter of the Associa- knowledge. In Proceedings of the 2019 ACL Work- tion for Computational Linguistics: Volume 1, Long shop BlackboxNLP: Analyzing and Interpreting Neu- Papers, pages 471–482, Valencia, Spain. Associa- ral Networks for NLP, pages 241–253, Florence, tion for Computational Linguistics. Italy. Association for Computational Linguistics.

180 Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man- Anna Rogers, Olga Kovaleva, and Anna Rumshisky. dar Joshi, Danqi Chen, Omer Levy, Mike Lewis, 2020. A primer in bertology: What we know about Luke S. Zettlemoyer, and Veselin Stoyanov. 2019. how bert works. arXiv:2002.12327. Roberta: A robustly optimized bert pretraining ap- proach. arXiv:1907.11692, abs/1907.11692. Sebastian Schuster and Christopher D. Manning. 2016. Enhanced English universal dependencies: An im- Marie-Catherine de Marneffe and Christopher D. Man- proved representation for natural language under- ning. 2008. The Stanford typed dependencies rep- standing tasks. In Proceedings of the Tenth Inter- resentation. In Coling 2008: Proceedings of the national Conference on Language Resources and workshop on Cross-Framework and Cross-Domain Evaluation (LREC’16), pages 2371–2378, Portoroz,ˇ Parser Evaluation, pages 1–8, Manchester, UK. Col- Slovenia. European Language Resources Associa- ing 2008 Organizing Committee. tion (ELRA).

´ Ian Tenney, Dipanjan Das, and Ellie Pavlick. 2019a. Eva Mujdricza-Maydt,´ Silvana Hartmann, Iryna BERT rediscovers the classical NLP pipeline. In Gurevych, and Anette Frank. 2016. Combining se- Proceedings of the 57th Annual Meeting of the Asso- mantic annotation of word sense & semantic roles: ciation for Computational Linguistics, pages 4593– A novel annotation scheme for VerbNet roles on Ger- 4601, Florence, Italy. Association for Computational man language data. In Proceedings of the Tenth In- Linguistics. ternational Conference on Language Resources and Evaluation (LREC 2016). European Language Re- Ian Tenney, Patrick Xia, Berlin Chen, Alex Wang, sources Association (ELRA). Adam Poliak, R. Thomas McCoy, Najoung Kim, Benjamin Van Durme, Samuel R. Bowman, Dipan- Martha Palmer, Daniel Gildea, and Paul Kingsbury. jan Das, and Ellie Pavlick. 2019b. What do you 2005. The proposition bank: An annotated cor- learn from context? Probing for sentence structure pus of semantic roles. Computational Linguistics, in contextualized word representations. 7th Inter- 31(1):71–106. national Conference on Learning Representations, ICLR 2019, pages 1–17. Matthew Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Zettlemoyer. 2018. Deep contextualized word rep- Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz resentations. In Proceedings of the 2018 Confer- Kaiser, and Illia Polosukhin. 2017. Attention is all ence of the North American Chapter of the Associ- you need. In I. Guyon, U. V. Luxburg, S. Bengio, ation for Computational Linguistics: Human Lan- H. Wallach, R. Fergus, S. Vishwanathan, and R. Gar- guage Technologies, Volume 1 (Long Papers), pages nett, editors, Advances in Neural Information Pro- 2227–2237, New Orleans, Louisiana. Association cessing Systems 30, pages 5998–6008. Curran Asso- for Computational Linguistics. ciates, Inc.

Telmo Pires, Eva Schlinger, and Dan Garrette. 2019. Alex Wang, Amanpreet Singh, Julian Michael, Fe- How multilingual is multilingual BERT? In Pro- lix Hill, Omer Levy, and Samuel Bowman. 2018. ceedings of the 57th Annual Meeting of the Asso- GLUE: A multi-task benchmark and analysis plat- ciation for Computational Linguistics, pages 4996– form for natural language understanding. In Pro- 5001, Florence, Italy. Association for Computa- ceedings of the 2018 EMNLP Workshop Black- tional Linguistics. boxNLP: Analyzing and Interpreting Neural Net- works for NLP, pages 353–355, Brussels, Belgium. Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Association for Computational Linguistics. Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Aaron Steven White, Drew Reisinger, Keisuke Sak- Wei Li, and Peter J. Liu. 2019. Exploring the limits aguchi, Tim Vieira, Sheng Zhang, Rachel Rudinger, of transfer learning with a unified text-to-text trans- Kyle Rawlins, and Benjamin Van Durme. 2016. Uni- former. arXiv:1910.10683. versal decompositional semantics on universal de- pendencies. In Proceedings of the 2016 Conference Emily Reif, Ann Yuan, Martin Wattenberg, Fernanda B on Empirical Methods in Natural Language Process- Viegas, Andy Coenen, Adam Pearce, and Been Kim. ing, pages 1713–1723, Austin, Texas. Association 2019. Visualizing and measuring the geometry of for Computational Linguistics. bert. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d’Alche´ Buc, E. Fox, and R. Garnett, editors, Ad- Thomas Wolf, Lysandre Debut, Victor Sanh, Julien vances in Neural Information Processing Systems Chaumond, Clement Delangue, Anthony Moi, Pier- 32, pages 8594–8603. Curran Associates, Inc. ric Cistac, Tim Rault, R’emi Louf, Morgan Funtow- icz, and Jamie Brew. 2019. HuggingFace’s Trans- Drew Reisinger, Rachel Rudinger, Francis Ferraro, formers: State-of-the-art natural language process- Craig Harman, Kyle Rawlins, and Benjamin ing. arXiv:1910.03771. Van Durme. 2015. Semantic proto-roles. Transac- tions of the Association for Computational Linguis- Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V. tics, 3:475–488. Le, Mohammad Norouzi, Wolfgang Macherey,

181 Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, Jeff Klingner, Apurva Shah, Melvin John- son, Xiaobing Liu, Łukasz Kaiser, Stephan Gouws, Yoshikiyo Kato, Taku Kudo, Hideto Kazawa, Keith Stevens, George Kurian, Nishant Patil, Wei Wang, Cliff Young, Jason Smith, Jason Riesa, Alex Rud- nick, Oriol Vinyals, Greg Corrado, Macduff Hughes, and Jeffrey Dean. 2016. Google’s neural machine translation system: Bridging the gap between hu- man and machine translation. arXiv:1609.08144, abs/1609.08144.

182