A Matter of Framing: The Impact of Linguistic Formalism on Probing Results

Ilia Kuznetsov and Iryna Gurevych Ubiquitous Knowledge Processing Lab (UKP-TUDA) Department of Computer Science, Technische Universitat¨ Darmstadt http://www.ukp.tu-darmstadt.de/

Abstract L=0 L=8 L=11 We Deep pre-trained contextualized encoders like asked BERT (Devlin et al., 2019) demonstrate re- John to markable performance on a range of down- leave stream tasks. A recent line of research in prob- . ing investigates the linguistic knowledge im- plicitly learned by these models during pre- Figure 1: Intra-sentence similarity by L of the training. While most work in probing oper- multilingual BERT-base. Functional words are similar ates on the task level, linguistic tasks are rarely in L = 0, syntactic groups emerge at higher levels. uniform and can be represented in a variety of formalisms. Any linguistics-based probing study thereby inevitably commits to the for- an informal inspection of layer-wise intra-sentence malism used to annotate the underlying data. similarities (Fig.1) suggests that these models Can the choice of formalism affect probing re- sults? To investigate, we conduct an in-depth capture elements of linguistic structure, and those cross-formalism layer probing study in role se- differ depending on the layer of the model. A mantics. We find linguistically meaningful dif- grounded investigation of these regularities allows ferences in the encoding of semantic role- and to interpret the model’s behavior, design better pre- proto-role information by BERT depending on trained encoders and inform the downstream model the formalism and demonstrate that layer prob- development. Such investigation is the main sub- ing can detect subtle differences between the ject of probing, and recent studies confirm that implementations of the same linguistic formal- BERT implicitly captures many aspects of lan- ism. Our results suggest that linguistic formal- ism is an important dimension in probing stud- guage use, lexical semantics and grammar (Rogers ies, along with the commonly used cross-task et al., 2020). and cross-lingual experimental settings. Most probing studies use linguistics as a the- oretical scaffolding and operate on a task level. 1 Introduction However, there often exist multiple ways to repre- The emergence of deep pre-trained contextualized sent the same linguistic task: for example, English encoders has had a major impact on the field of nat- dependency syntax can be encoded using a vari- arXiv:2004.14999v1 [cs.CL] 30 Apr 2020 ural language processing. Boosted by the availabil- ety of formalisms, incl. Universal (Schuster and ity of general-purpose frameworks like AllenNLP Manning, 2016), Stanford (de Marneffe and Man- (Gardner et al., 2018) and Transformers (Wolf ning, 2008) and CoNLL-2009 dependencies (Hajicˇ et al., 2019), pre-trained models like ELMO (Pe- et al., 2009), all using different label sets and syn- ters et al., 2018) and BERT (Devlin et al., 2019) tactic head attachment rules. Any probing study have caused a shift towards simple architectures inevitably commits to the specific theoretical frame- where a strong pre-trained encoder is paired with a work used to produce the underlying data. The dif- shallow downstream model, often outperforming ferences between linguistic formalisms, however, the intricate task-specific architectures of the past. can be substantial. The versatility of pre-trained representations im- Can these differences affect the probing results? plies that they encode some aspects of general lin- This question is intriguing for several reasons. Lin- guistic knowledge (Reif et al., 2019). Indeed, even guistic formalisms are well-documented, and if the choice of formalism indeed has an effect on prob- contributing to the line of research on how and ing, cross-formalism comparison will yield new whether pre-trained BERT encodes higher-level se- insights into the linguistic knowledge obtained by mantic phenomena. contextualized encoders during pre-training. If, alternatively, the probing results remain stable de- Contributions. This work studies the effect of spite substantial differences between formalisms, the linguistic formalism on probing results. We con- this prompts a further scrutiny of what the pre- duct cross-formalism experiments on PropBank, trained encoders in fact encode. Finally, on the re- VerbNet and FrameNet role prediction in English verse side, cross-formalism probing might be used and German, and show that the formalism can as a tool to empirically compare the formalisms affect probing results in a linguistically meaning- and their language-specific implementations. To ful way; in addition, we demonstrate that layer the best of our knowledge we are the first to explic- probing can detect subtle differences between im- itly address the influence of formalism on probing. plementations of the same formalism in different languages. On the technical side, we advance the Ideally, the task chosen for a cross-formalism recently introduced edge and layer probing frame- study should be encoded in multiple formalisms work (Tenney et al., 2019b); in particular, we intro- using the same textual data to rule out the influ- duce anchor tasks - an analytical tool inspired by ence of the domain and text type. While many lin- feature-based systems that allows deeper qualita- guistic corpora contain several layers of linguistic tive insights into the pre-trained models’ behavior. information, having the same textual data anno- Finally, advancing the current knowledge about the tated with multiple formalisms for the same task encoding of predicate semantics in BERT, we per- is rare. We focus on role semantics – a family form a fine-grained semantic proto-role probing of shallow semantic formalisms at the interface study and demonstrate that semantic proto-role between syntax and propositional semantics that properties can be extracted from pre-trained assign roles to the participants of natural language BERT, contrary to the existing reports. Our re- utterances, determining who did what to whom, sults suggest that along with task and language, where, when etc. Decades of research in theo- linguistic formalism is an important dimension to retical linguistics have produced a range of role- be accounted for in probing research. semantic frameworks that have been operational- ized in NLP: syntax-driven PropBank (Palmer et al., 2 Related Work 2005), coarse-grained VerbNet (Kipper-Schuler, 2005), fine-grained FrameNet (Baker et al., 1998), 2.1 BERT as Encoder and, recently, decompositional Semantic Proto- BERT is a Transformer (Vaswani et al., 2017) en- Roles (SPR) (Reisinger et al., 2015; White et al., coder pre-trained by jointly optimizing two unsu- 2016). The SemLink project (Bonial et al., 2013) pervised objectives: masked and offers parallel annotation for PropBank, VerbNet next sentence prediction. It uses WordPiece (WP, and FrameNet for English. This allows us to iso- Wu et al.(2016)) subword tokens along with po- late the object of our study: apart from the role- sitional embeddings as input, and gradually con- semantic labels, the underlying data and conditions structs sentence representations by applying token- for the three formalisms are identical. SR3DE level self-attention pooling over a stack of layers L. (Mujdricza-Maydt´ et al., 2016) provides compati- The result of BERT encoding is a layer-wise repre- ble annotation in three formalisms for German, en- sentation of the input wordpiece tokens with higher abling cross-lingual validation of our results. Com- layers representing higher-level abstractions over bined, these factors make role semantics an ideal the input sequence. Thanks to the joint pre-training target for a cross-formalism probing study. objective, BERT can encode words and sentences A solid body of evidence suggests that encoders in a unified fashion: the encoding of a sentence or like BERT capture syntactic and lexical-semantic a sentence pair is stored in a special token [CLS]. properties, but only few studies have considered To facilitate multilingual experiments, we use probing for predicate-level semantics (Tenney et al., the multilingual BERT-base (mBERT) published 2019b; Kovaleva et al., 2019). To the best of by Devlin et al.(2019). Although several recent our knowledge we are the first to conduct a cross- encoders have outperformed BERT on benchmarks formalism probing study on role semantics, thereby (Liu et al., 2019; Lan et al., 2019; Raffel et al., 2019), we use the original BERT architecture, since and layer probing framework. The encoding pro- it allows us to inherit the probing methodology and duced by a frozen BERT model can be seen as a to build upon the related findings. layer-wise snapshot that reflects how the model has constructed the high-level abstractions. Tenney 2.2 Probing et al.(2019b) introduce the edge probing task de- sign: a simple classifier is tasked with predicting Due to space limitations we omit high-level dis- a linguistic property given a pair of spans encoded cussions on benchmarking (Wang et al., 2018) and using a frozen pre-trained model. Tenney et al. sentence-level probing (Conneau et al., 2018a), and (2019a) uses edge probing to analyze the layer uti- focus on the recent findings related to the represen- lization of a pre-trained BERT model via scalar tation of linguistic structure in BERT. Surface-level mixing weights (Peters et al., 2018) learned during information generally tends to be represented in the training. We revisit this framework in Section3. lower layers of deep encoders, while higher layers store hierarchical and semantic information (Be- 2.3 Role Semantics linkov et al., 2017; Lin et al., 2019). Tenney et al. (2019a) show that the abstraction strategy applied We now turn to the object of our investigation: role by the English pre-trained BERT encoder follows semantics. For further discussion, consider the the order of the classical NLP pipeline. Strength- following synthetic example: ening the claim about linguistic capabilities of a. [John]Ag gave [Mary]Rc a [book]Th. BERT, Hewitt and Manning(2019) demonstrate b. [Mary]Rc was given a [book]Th by [John]Ag. that BERT implicitly learns syntax, and Reif et al. Despite surface-level differences, the sentences (2019) show that it encodes fine-grained lexical- express the same meaning, suggesting an under- semantic distinctions. Rogers et al.(2020) provide lying semantic representation in which these sen- a comprehensive overview of BERT’s properties tences are equivalent. One such representation is discovered to date. offered by role semantics - a shallow predicate- While recent results indicate that BERT success- semantic formalism closely related to syntax. In fully represents lexical-semantic and grammatical terms of role semantics, Mary, book and John are information, the evidence of its high-level semantic semantic arguments of the predicate give, and capabilities is inconclusive. Tenney et al.(2019a) are assigned roles from a pre-defined inventory, for show that the English PropBank semantics can be example, Agent, Recipient and Theme. extracted from the encoder and follows syntax in Semantic roles and their properties have received the layer structure. However, out of all formalisms extensive attention in linguistics (Fillmore, 1968; PropBank is most closely tied to syntax, and the Levin and Rappaport Hovav, 2005; Dowty, 1991) results on proto-role and relation probing do not and are considered a universal feature of human follow the same pattern. Kovaleva et al.(2019) language. The size and organization of the role and identify two attention heads in BERT responsible predicate inventory are subject to debate, giving for FrameNet relations. However, they find that rise to a variety of role-semantic formalisms. disabling them in a fine-tuning evaluation on the PropBank assumes a predicate-independent la- GLUE (Wang et al., 2018) benchmark does not beling scheme where predicates are distinguished result in decreased performance. by their sense (get.01), and semantic arguments Although we are not aware of any systematic are labeled with generic numbered core (Arg0-5) studies dedicated to the effect of formalism on prob- and modifier (e.g. AM-TMP) roles. Core roles are ing results, the evidence of such effects is scattered not tied to specific definitions, but the effort has across the related work: for example, the aforemen- been made to keep the role assignments consis- tioned results in Tenney et al.(2019a) show a differ- tent for similar verbs; Arg0 and Arg1 correspond ence in layer utilization between constituents- and to the Proto-Agent and Proto-Patient roles as per dependency-based syntactic probes and semantic Dowty(1991). The semantic interpretation of core role and proto-role probes. It is not clear whether roles depends on the predicate sense. this effect is due to the differences in the underly- VerbNet follows a different categorization ing datasets and task architecture, or the formalism scheme. Motivated by the regularities in verb per se. behavior, Levin(1993) has introduced the group- Our probing methodology builds upon the edge ing of verbs into intersective classes (ILC). This methodology has been adopted by VerbNet: for span or word encodings. More formally, we en- example, the VerbNet class get-13.5.1 would code a WP-tokenized sentence [wp1, wp2, ...wpk] include verbs earn, fetch, gain etc. A verb in Verb- with a frozen pre-trained model, producing con- Net can belong to several classes corresponding textual embeddings [e1, e2, ...ek], each of which is to different senses; each class is associated with a a layered representation over L = {l0, l1, ...lm} set of roles and licensed syntactic transformations. layers, with encoding at layer ln for the wordpiece n Unlike PropBank, VerbNet uses a set of approx. wpi further denoted as ei . A trainable scalar mix is 30 thematic roles that have universal definitions applied to the layered representation to produce the and are shared among predicates, e.g. Agent, final encoding given the per-layer mixing weights Beneficiary, Instrument. {a0, a1..am} and a scaling parameter γ: FrameNet takes a meaning-driven stance on the m X l l role encoding by modeling it in terms of frame se- ei = γ softmax(a )ei mantics: predicates are grouped into frames (e.g. l=0 Commerce buy), which specify role-like slots to Given the source src and target tgt wordpieces be filled. FrameNet offers fine-grained frame dis- encoded as esrc and etgt, our goal is to predict the tinctions, and roles in FrameNet are frame-specific, label y. e.g. Buyer, Seller and Money. The resource Due to its task-agnostic architecture, edge prob- accompanies each frame with a description of the ing can be applied to a wide variety of unary (by situation and its core and peripheral participants. omitting tgt) and binary labeling tasks in a uni- SPR follows the work of Dowty(1991) and fied manner, facilitating the cross-task comparison. discards the notion of categorical semantic roles The original setup has several limitations that we in favor of feature bundles. Instead of a address in our implementation. fixed role label, each argument is assessed via Regression tasks. The original edge probing a 11-dimensional cardinal feature set including only considers classification tasks. Many language Proto-Agent and Proto-Patient properties like phenomena - including positional information and volitional, sentient, destroyed, etc. semantic proto-roles, are naturally modeled as re- The feature-based approach eliminates some of the gression, and we extend the original model by sup- theoretical issues associated with categorical role porting both classification and regression: the for- inventories and allows for more flexible modeling mer achieved via softmax, the latter via direct linear of role semantics. regression to the target value. Each of the role labeling formalisms offers cer- Flat model To decrease the models’ own expres- tain advantages and disadvantages (Giuglea and sive power (Hewitt and Liang, 2019), we keep the Moschitti, 2006;M ujdricza-Maydt´ et al., 2016). number of parameters in our probing model as low While being close to syntax and thereby easier to as possible. While Tenney et al.(2019b) utilize predict, PropBank doesn’t contribute much seman- pooled self-attentional span representations and a tics to the representation. On the opposite side projection layer to enable cross-model comparison, of the spectrum, FrameNet offers rich predicate- we directly feed the wordpiece encoding into the semantic representations for verbs and nouns, but classifier, using the first wordpiece of a word. To suffers from high granularity and coverage gaps further increase the selectivity of the model, we (Hartmann et al., 2017). VerbNet takes a middle directly project the source and target wordpiece ground by following grammatical criteria while representations into the label space, opposed to the still encoding coarse-grained semantics, but only two-layer MLP classifier used in the original setup. focuses on verbs and core (not modifier) roles. SPR Separate scalar mixes. To enable fine-grained avoids the granularity-generalization trade-off of analysis of probing results, we train and analyze the categorical inventories, but is yet to find its way separate scalar mixes for source and target word- into practical NLP applications. pieces, motivated by the fact that the classifier might utilize different aspects of their represen- 3 Probing Methodology tation for prediction1. Indeed, we find that the We take the edge probing setup by Tenney et al. 1The original work (Tenney et al., 2019b, Appendix C), also uses separate projections for source and target tokens (2019b) as our starting point. Edge probing aims in the background, but does not investigate the differences to predict a label given a pair of contextualized between the learned projections. mixing weights learned for source and target word- tok sent pred arg pieces might show substantial – and linguistically CoNLL+SL 312.2K 11.3K 13.3K 23.9K meaningful – variation. SR3de 62.6K 2.8K 2.9K 5.5K Sentence-level probes. Utilizing the BERT- specific sentence representation [CLS] allows us Table 1: Statistics for CoNLL+SemLink (English) and to incorporate the sentence-level natural language SR3de (German), only core roles. inference (NLI) probe into our kit.

Anchor tasks We employ two analytical tools pus annotations to the corresponding FrameNet from the original layer probing setup. Mixing and VerbNet senses and semantic roles. We use weight plotting compares layer utilization among these mappings to enrich the CoNLL-2009 (Hajicˇ tasks by visually aligning the respective learned et al., 2009) dependency role labeling data – also weight distributions transformed via a softmax based on the original PropBank – with roles in all function. Layer center-of-gravity is used as a sum- three formalisms via a semi-automatic token align- mary statistic for a task’s layer utilization. ment procedure. The resulting corpus is substan- While the distribution of mixing weights along tially smaller than the original, but still an order of the layers allows us to estimate the order in magnitude larger than SR3de (Table1). Both cor- which information is processed during encoding, pora are richly annotated with linguistic phenom- it doesn’t allow to directly assess the similarity ena on word level, including part-of-speech, lemma between the layer utilization of the probing tasks. and syntactic dependencies. The XNLI probe is Tenney et al.(2019a) have demonstrated that sourced from the corresponding development split the order in which linguistic information is stored of the XNLI (Conneau et al., 2018b) dataset. The in BERT mirrors the traditional NLP pipeline. A SPR probing tasks are extracted from the original prominent property of the NLP pipelines is their data by Reisinger et al.(2015). use of low-level features to predict downstream phenomena. In the context of layer probing, prob- type en de ing tasks can be seen as end-to-end feature extrac- *token.ix unary 208.9K 46.9K tors. Following this intuition, we define two groups ttype [v] unary 177.2K 34.0K of probing tasks: target tasks – the main tasks un- lex.unit [v] unary 187.6K 35.7K der investigation, and anchor tasks – a set of related pos unary 312.2K 62.6K tasks that serve as a basis for qualitative compari- deprel binary 300.9K 59.8K son between the targets. The softmax transforma- role binary 23.9K 5.5K tion of the scalar mixing weights allows to treat *spr binary 9.7K - them as probability distributions: the higher the xnli unary 2.5K 2.5K mixing weight of a layer, the more likely the probe is to utilize information from this layer during pre- Table 2: Probing task statistics. Tasks marked with [v] diction. We use Kullback-Leibler divergence to use a most frequent label vocabulary. Here and further, compare target tasks (e.g. role labeling in different tasks marked with * are regression tasks. formalisms) in terms of their similarity to lower- level anchor tasks (e.g. dependency relation and 4.2 Probing tasks lemma). Note that the notion of anchor task is con- textual: the same task can serve as a target and as Our probing kit spans a wide range of probing an anchor, depending on the focus of the study. tasks, ranging from primitive surface-level tasks mostly utilized as anchors later to high-level se- 4 Setup

4.1 Source data language en de For German we use the SR3de corpus (Mujdricza-´ PropBank 5 10 Maydt et al., 2016) that contains parallel PropBank, VerbNet 23 29 FrameNet and VerbNet annotations for verbal pred- FrameNet 189 300 icates. For English, SemLink (Bonial et al., 2013) provides mappings from the original PropBank cor- Table 3: # of role probe labels by formalism. task input label task en de task en de token.ix I [saw] a cat. → 2 *token.ix 1.71 2.48 deprel 0.95 0.95 ttype I [saw] a cat. → saw ttype 1.00 1.00 role.fn 0.92 0.59 lex.unit I [saw] a cat. → see.V lex.unit 1.00 1.00 role.pb 0.96 0.71 pos I [saw] a cat. → VBD pos 0.97 0.97 role.vn 0.94 0.73 deprel [I]tgt [saw]src a cat. → SBJ role.vn [I]tgt [saw]src a cat. → Experiencer Table 5: Best dev score for word-level tasks over 20 epochs (Acc for classification, MSE for regression). spr.vltn [I]tgt [saw]src a cat. → 2

Table 4: Word-level probing task examples for English. FrameNet fn. Note that we only probe for the role vltn corresponds to the volition SPR property. label, and the model has no access to the verb sense information from the data. mantic tasks that aim to provide a representational Semantic proto-role (spr.[prop]) is a set of upper bound to predicate semantics. We follow eleven regression tasks predicting the values of the training, test and development splits from the the proto-role properties as defined in (Reisinger original SR3de, CoNLL-2009 and SPR data. The et al., 2015), given a predicate src and an argu- XNLI task is sourced from the development set and ment tgt. only used for scalar mix analysis. To reduce the XNLI is a sentence-level NLI task directly sourced number of labels in some of the probing tasks, we from the corresponding dataset. Given two sen- collect frequency statistics over the corresponding tences, the goal is to determine whether an en- training sets and only consider up to 250 most fre- tailment or a contradiction relationship holds be- quent labels. Below we define the tasks in order tween them. We use NLI to investigate the layer of their complexity, Table2 provides the probing utilization of mBERT for high-level semantic tasks. task statistics, Table3 compares the categorical We extract the sentence pair representation via the role labeling formalisms in terms of granularity, [CLS] token and treat it as a unary probing task. and Table4 provides examples. We evaluate the 5 Results classification performance using Accuracy, while regression tasks are scored via Mean Squared Error. Our models are implemented using AllenNLP.2 We train the probes for 20 epochs using the Adam ttype Token type ( ) predicts the type of a word. optimizer with default parameters and a batch size This requires contextual processing since a word of 32. Due to the frozen encoder and flat model might consist of several wordpieces; architecture, the total runtime of the experiments is token.ix Token position ( ) predicts the linear under 8 hours on a single Tesla V100 GPU. position of a word, cast as a regression task over the first 20 words in the sentence. Again, the task 5.1 General Trends is non-trivial since it requires the words to be as- While absolute performance is secondary to our sembled from the wordpieces. analysis, we report the probing task scores on re- Part-of-speech (pos) predicts the language- spective development sets in Table5. We observe specific part-of-speech tag for the given token. that grammatical tasks score high, while core role Lexical unit (lex.unit) predicts the lemma and labeling lags behind - in line with the findings of POS of the given word – a common input repre- Tenney et al.(2019a) 3 We observe lower scores sentation for the entries in lexical resources. We for German role labeling which we attribute to the extract coarse POS tags by using the first character lack of training data. Surprisingly, as we show of the language-specific POS tag. below, this doesn’t prevent the edge probe from Dependency relation (deprel) predicts the de- learning to locate relevant role-semantic informa- pendency relation between the parent src and de- tion in mBERT’s layers. pendent tgt tokens; Our results mirror the findings of Tenney et al. Semantic role (role.[frm]) predicts the se- (2019a) about the sequential processing order in mantic role given a predicate src and an argu- 2We make the code publicly available ment tgt token in one of the three role label- 3Our results are not directly comparable due to the differ- ing formalisms: PropBank pb, VerbNet vn and ences in datasets and formalisms. en de en de *token.ix [3.13] [4.61] *token.ix ttype [4.8] [5.2] lex.unit [4.99] [5.09] ttype pos [5.29] [5.75] lex.unit deprel src [5.36] [6.01] deprel tgt [5.57] [5.99] pos role.pb src [5.47] [5.18] deprel src role.vn src [4.28] [5.24] role.fn src [4.36] [5.13] deprel tgt role.pb tgt [6.11] [6.12] role.vn tgt [6.05] [6.06] fn tgt fn tgt fn src fn src vn tgt vn tgt pb tgt pb tgt vn src vn src role.fn tgt [6.16] [5.75] pb src pb src xnli [6.28] [6.15] Layer Layer Figure 3: Anchor task analysis of SRL formalisms.

Figure 2: Layer probing results confirms that the formalism can indeed have lin- guistically meaningful effects on probing results. BERT. We observe that the layer utilization among tasks (Fig.2) aligns for English and German 4, al- 5.3 Anchor Tasks in the Pipeline though we note that in terms of center-of-gravity We now use the scalar mixes of the role labeling mBERT tends to utilize deeper layers for German probes as target tasks, and lower-level probes as an- probes. Basic word-level tasks are indeed pro- chor tasks to qualitatively explore the differences cessed early by the model, and XNLI probes focus between how our role probes learn to represent on deeper levels, suggesting that the representation predicates and semantic arguments5 (Fig.3). The of higher-level semantic phenomena follows the results reveal a distinctive pattern: while the pred- encoding of syntax and predicate semantics. icate layer utilization src is similar to the scalar mixes learned for ttype and lex.unit, the 5.2 The Effect of Formalism learned argument representations tgt attend to Using separate scalar mixes for source and target the layers associated with dependency relation and tokens allows us to explore the cross-formalism en- POS probes, and the pattern reproduces for En- coding of role semantics by mBERT in detail. Role glish and German. This aligns with the traditional labeling probe’s layer utilization drastically differs separation of the task into for predicate and argument tokens. While the argu- predicate disambiguation followed by semantic ar- ment representation role*tgt mostly focuses on gument identification and labeling, along with the the same layers as the dependency probe, feature sets employed for these tasks (Bjorkelund¨ the layer utilization of the predicates role*src is et al., 2009). Note that the observation about the affected by the chosen formalism. PropBank pred- pipeline-like task processing within the BERT en- icate token mixing weights emphasize the same coders thereby holds, albeit on a sub-task level. layers as dependency parsing – in line with the previously published results. However, the probes 5.4 Formalism Implementations for VerbNet and FrameNet predicates (role.vn Both layer and anchor task analysis reveal a promi- src and role.fn src) utilize the layers asso- nent discrepancy between English and German role ciated with ttype and lex.unit that contain probing results: while the PropBank predicate layer lexical information. Coupled with the fact that both utilization for English mostly relies on syntactic VerbNet and FrameNet assign semantic roles based information, German PropBank predicates behave on lexical-semantic predicate groupings (frames in similarly to VerbNet and FrameNet. The difference FrameNet and verb classes in VerbNet), this sug- in the number of role labels for English and German gests that the lower layers of mBERT implicitly en- PropBank (Table3) points at possible qualitative code predicate sense information; moreover, sense differences in the labeling schemes. The data for encoding for VerbNet utilizes deeper layers of the English stems from the token-level alignment in model associated with syntax, in line with Verb- SemLink that maps the original PropBank roles Net’s predicate classification strategy. This finding to VerbNet and FrameNet. Role annotations for German have a different lineage: they originate 4Echoing the recent findings on mBERT’s multilingual capacity (Pires et al., 2019; Kondratyuk and Straka, 2019) 5Darker color corresponds to higher similarity. property MSE src tgt *instigation (A) *instigation 1.12 *volition *awareness (A) *volition 0.87 *sentient (A) *awareness 0.79 *change.of.location *exists.as.physical (A) *sentient 0.47 *created (A) *change.of.location 0.66 *destroyed (A) *exists.as.physical 1.29 *changes.possession *change.of.state (P) *created 0.87 *stationary Layer Layer (P) *destroyed 0.61 (P) *changes.possession 0.83 Figure 4: Layer utilization for SPR properties. (P) *change.of.state 1.24 (P) *stationary 0.77 Tenney et al.(2019a)), some of the proto-role fea- Table 6: Best dev MSE for proto-role probing tasks tures follow the pattern seen in the categorical role over 20 epochs. A - Proto-Agent, P - Proto-Patient. labeling and dependency parsing tasks for the ar- gument tokens tgt. With few exceptions, we ob- serve that the properties displaying that behavior from the FrameNet-annotated SALSA corpus (Bur- are Proto-Agent properties; moreover, a close ex- chardt et al., 2006) semi-automatically converted amination of the results on syntactic preference by to PropBank style for the CoNLL-2009 shared task Reisinger et al.(2015, p. 483) reveals that these (Hajicˇ et al., 2009), and enriched with VerbNet properties are also the ones with strong preference labels in SR3de (Mujdricza-Maydt´ et al., 2016). for the subject position, including the outlier case As a result, while English PropBank labels are as- of stationery which in their data behaves like signed in a predicate-independent manner, German a Proto-Agent property. The correspondence is not PropBank, following the same numbered labeling strict, and we leave an in-depth investigation of the scheme, keeps this scheme consistent within the reasons behind these discrepancies for follow-up frame. We assume that this incentivizes the probe work. to learn semantic verb groupings and reflects in our probing results. The ability of the probe to detect 6 Conclusion subtle differences between formalism implementa- We have demonstrated that the choice of linguis- tions constitutes a new use case for probing, and a tic formalism can have substantial, linguistically promising direction for future studies. meaningful effects on role-semantic probing re- sults. We have shown how probing classifiers can 5.5 Encoding of Proto-Roles be used to detect discrepancies between formal- We now turn to the probing results for decomposi- ism implementations, and presented evidence of tional semantic proto-role labeling tasks. Unlike semantic proto-role encoding in the pre-trained (Tenney et al., 2019b) who used a multi-label classi- mBERT model. Our refined implementation of fication probe, we treat SPR properties as separate the edge probing framework coupled with the an- regression tasks. The results in Table6 show that chor task methodology enabled new insights into the performance varies by property, with some of the processing of predicate-semantic information the properties attaining reasonably low MSE scores within mBERT. Our findings show that linguistic despite the simplicity of the probe architecture and formalism is an important factor to be accounted the small dataset size. We do not observe a clear for in probing studies. While our work illustrates performance trend depending on whether the prop- this point using a single task and a single probing erty is associated with Proto-Agent or Patient. framework, the influence of linguistic formalism Our fine-grained, property-level task design al- per se is likely to be present for any probing setup lows for more detailed insights into the layer uti- that builds upon linguistic material. An investiga- lization by the SPR probes (Fig.4). The results tion of how, whether, and why formalisms affect indicate that while the layer utilization on the pred- probing results for tasks beyond role labeling and icate side (src) shows no clear preference for par- for frameworks beyond edge probing constitutes ticular layers (similar to the results obtained by an exciting avenue for future research. References David Dowty. 1991. Thematic proto-roles and argu- ment selection. Language, 76(3):474–496. Collin F. Baker, Charles J. Fillmore, and John B. Lowe. 1998. The Berkeley FrameNet project. In 36th An- Charles J. Fillmore. 1968. The Case for Case. In Em- nual Meeting of the Association for Computational mon Bach and Robert T. Harms, editors, Universals Linguistics and 17th International Conference on in Linguistic Theory, pages 1–88. Holt, Rinehart and Computational Linguistics, Volume 1, pages 86–90, Winston, New York. Montreal, Quebec, Canada. Association for Compu- tational Linguistics. Matt Gardner, Joel Grus, Mark Neumann, Oyvind Tafjord, Pradeep Dasigi, Nelson F. Liu, Matthew Pe- Yonatan Belinkov, Nadir Durrani, Fahim Dalvi, Has- ters, Michael Schmitz, and Luke Zettlemoyer. 2018. san Sajjad, and James Glass. 2017. What do neu- AllenNLP: A deep semantic natural language pro- ral models learn about morphol- cessing platform. In Proceedings of Workshop for ogy? ACL 2017 - 55th Annual Meeting of the Asso- NLP Open Source Software (NLP-OSS), pages 1– ciation for Computational Linguistics, Proceedings 6, Melbourne, Australia. Association for Computa- of the Conference (Long Papers), 1:861–872. tional Linguistics. Anders Bjorkelund,¨ Love Hafdell, and Pierre Nugues. Ana-Maria Giuglea and Alessandro Moschitti. 2006. 2009. Multilingual semantic role labeling. In Pro- Semantic role labeling via FrameNet, VerbNet and ceedings of the Thirteenth Conference on Computa- PropBank. In Proceedings of the 21st Interna- tional Natural Language Learning (CoNLL 2009): tional Conference on Computational Linguistics and Shared Task, pages 43–48, Boulder, Colorado. Asso- 44th Annual Meeting of the Association for Compu- ciation for Computational Linguistics. tational Linguistics, pages 929–936, Sydney, Aus- tralia. Association for Computational Linguistics. Claire Bonial, Kevin Stowe, and Martha Palmer. 2013. Renewing and revising SemLink. In Proceedings Jan Hajic,ˇ Massimiliano Ciaramita, Richard Johans- of the 2nd Workshop on Linked Data in Linguistics son, Daisuke Kawahara, Maria Antonia` Mart´ı, Llu´ıs (LDL-2013): Representing and linking lexicons, ter- Marquez,` Adam Meyers, Joakim Nivre, Sebastian minologies and other language data, pages 9–17. Pado,´ Jan Stˇ epˇ anek,´ Pavel Stranˇak,´ Mihai Surdeanu, Association for Computational Linguistics. Nianwen Xue, and Yi Zhang. 2009. The CoNLL- 2009 shared task: Syntactic and semantic depen- Aljoscha Burchardt, Katrin Erk, Anette Frank, An- dencies in multiple languages. In Proceedings of drea Kowalski, Sebastian Pado,´ and Manfred Pinkal. the Thirteenth Conference on Computational Nat- 2006. The SALSA corpus: a German cor- ural Language Learning (CoNLL 2009): Shared pus resource for lexical semantics. In Proceed- Task, pages 1–18, Boulder, Colorado. Association ings of the Fifth International Conference on Lan- for Computational Linguistics. guage Resources and Evaluation (LREC’06), Genoa, Italy. European Language Resources Association Silvana Hartmann, Ilia Kuznetsov, Teresa Martin, and (ELRA). Iryna Gurevych. 2017. Out-of-domain FrameNet se- mantic role labeling. In Proceedings of the 15th Alexis Conneau, German Kruszewski, Guillaume Lam- Conference of the European Chapter of the Associa- ple, Lo¨ıc Barrault, and Marco Baroni. 2018a. What tion for Computational Linguistics: Volume 1, Long you can cram into a single $&!#* vector: Probing Papers, pages 471–482, Valencia, Spain. Associa- sentence embeddings for linguistic properties. ACL tion for Computational Linguistics. 2018 - 56th Annual Meeting of the Association for Computational Linguistics, Proceedings of the Con- John Hewitt and Percy Liang. 2019. Designing and ference (Long Papers), 1:2126–2136. interpreting probes with control tasks. In Proceed- ings of the 2019 Conference on Empirical Methods Alexis Conneau, Ruty Rinott, Guillaume Lample, Ad- in Natural Language Processing and the 9th Inter- ina Williams, Samuel Bowman, Holger Schwenk, national Joint Conference on Natural Language Pro- and Veselin Stoyanov. 2018b. XNLI: Evaluating cessing (EMNLP-IJCNLP), pages 2733–2743, Hong cross-lingual sentence representations. In Proceed- Kong, China. Association for Computational Lin- ings of the 2018 Conference on Empirical Methods guistics. in Natural Language Processing, pages 2475–2485, Brussels, Belgium. Association for Computational John Hewitt and Christopher D. Manning. 2019.A Linguistics. structural probe for finding syntax in word repre- sentations. In Proceedings of the 2019 Conference Jacob Devlin, Ming-Wei Chang, Kenton Lee, and of the North American Chapter of the Association Kristina Toutanova. 2019. BERT: Pre-training of for Computational Linguistics: Human Language deep bidirectional transformers for language under- Technologies, Volume 1 (Long and Short Papers), standing. In Proceedings of the 2019 Conference pages 4129–4138, Minneapolis, Minnesota. Associ- of the North American Chapter of the Association ation for Computational Linguistics. for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Karen Kipper-Schuler. 2005. VerbNet: A broad- pages 4171–4186, Minneapolis, Minnesota. Associ- coverage, comprehensive verb lexicon. Ph.D. thesis, ation for Computational Linguistics. University of Pennsylvania. Dan Kondratyuk and Milan Straka. 2019. 75 lan- Matthew Peters, Mark Neumann, Mohit Iyyer, Matt guages, 1 model: Parsing Gardner, Christopher Clark, Kenton Lee, and Luke universally. In Proceedings of the 2019 Confer- Zettlemoyer. 2018. Deep contextualized word rep- ence on Empirical Methods in Natural Language resentations. In Proceedings of the 2018 Confer- Processing and the 9th International Joint Confer- ence of the North American Chapter of the Associ- ence on Natural Language Processing (EMNLP- ation for Computational Linguistics: Human Lan- IJCNLP), pages 2779–2795, Hong Kong, China. As- guage Technologies, Volume 1 (Long Papers), pages sociation for Computational Linguistics. 2227–2237, New Orleans, Louisiana. Association for Computational Linguistics. Olga Kovaleva, Alexey Romanov, Anna Rogers, and Anna Rumshisky. 2019. Revealing the dark secrets Telmo Pires, Eva Schlinger, and Dan Garrette. 2019. of BERT. In Proceedings of the 2019 Conference on How multilingual is multilingual BERT? In Pro- Empirical Methods in Natural Language Processing ceedings of the 57th Annual Meeting of the Asso- and the 9th International Joint Conference on Natu- ciation for Computational Linguistics, pages 4996– ral Language Processing (EMNLP-IJCNLP), pages 5001, Florence, Italy. Association for Computa- 4365–4374, Hong Kong, China. Association for tional Linguistics. Computational Linguistics. Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Kevin Gimpel, Piyush Sharma, and Radu Soricut. Wei Li, and Peter J. Liu. 2019. Exploring the limits 2019. Albert: A lite for self-supervised learn- of transfer learning with a unified text-to-text trans- ing of language representations. arXiv:1909.11942. former. arXiv:1910.10683.

Beth Levin. 1993. English Verb Classes and Alterna- Emily Reif, Ann Yuan, Martin Wattenberg, Fernanda B tions: A Preliminary Investigation. University of Viegas, Andy Coenen, Adam Pearce, and Been Kim. Chicago Press. 2019. Visualizing and measuring the geometry of bert. In H. Wallach, H. Larochelle, A. Beygelzimer, Beth Levin and Malka Rappaport Hovav. 2005. Argu- F. d’Alche´ Buc, E. Fox, and R. Garnett, editors, Ad- ment Realization. Research Surveys in Linguistics. vances in Neural Information Processing Systems Cambridge University Press. 32, pages 8594–8603. Curran Associates, Inc.

Yongjie Lin, Yi Chern Tan, and Robert Frank. 2019. Drew Reisinger, Rachel Rudinger, Francis Ferraro, Open sesame: Getting inside BERT’s linguistic Craig Harman, Kyle Rawlins, and Benjamin Transac- knowledge. In Proceedings of the 2019 ACL Work- Van Durme. 2015. Semantic proto-roles. shop BlackboxNLP: Analyzing and Interpreting Neu- tions of the Association for Computational Linguis- tics ral Networks for NLP, pages 241–253, Florence, , 3:475–488. Italy. Association for Computational Linguistics. Anna Rogers, Olga Kovaleva, and Anna Rumshisky. 2020. A primer in bertology: What we know about Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man- how bert works. arXiv:2002.12327. dar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke S. Zettlemoyer, and Veselin Stoyanov. 2019. Sebastian Schuster and Christopher D. Manning. 2016. Roberta: A robustly optimized bert pretraining ap- Enhanced English universal dependencies: An im- proach. arXiv:1907.11692, abs/1907.11692. proved representation for natural language under- standing tasks. In Proceedings of the Tenth Inter- Marie-Catherine de Marneffe and Christopher D. Man- national Conference on Language Resources and ning. 2008. The Stanford typed dependencies rep- Evaluation (LREC’16), pages 2371–2378, Portoroz,ˇ resentation. In Coling 2008: Proceedings of the Slovenia. European Language Resources Associa- workshop on Cross-Framework and Cross-Domain tion (ELRA). Parser Evaluation, pages 1–8, Manchester, UK. Col- ing 2008 Organizing Committee. Ian Tenney, Dipanjan Das, and Ellie Pavlick. 2019a. BERT rediscovers the classical NLP pipeline. In Eva´ Mujdricza-Maydt,´ Silvana Hartmann, Iryna Proceedings of the 57th Annual Meeting of the Asso- Gurevych, and Anette Frank. 2016. Combining se- ciation for Computational Linguistics, pages 4593– mantic annotation of word sense & semantic roles: 4601, Florence, Italy. Association for Computational A novel annotation scheme for VerbNet roles on Ger- Linguistics. man language data. In Proceedings of the Tenth In- ternational Conference on Language Resources and Ian Tenney, Patrick Xia, Berlin Chen, Alex Wang, Evaluation (LREC 2016). European Language Re- Adam Poliak, R. Thomas McCoy, Najoung Kim, sources Association (ELRA). Benjamin Van Durme, Samuel R. Bowman, Dipan- jan Das, and Ellie Pavlick. 2019b. What do you Martha Palmer, Daniel Gildea, and Paul Kingsbury. learn from context? Probing for sentence structure 2005. The proposition bank: An annotated cor- in contextualized word representations. 7th Inter- pus of semantic roles. Computational Linguistics, national Conference on Learning Representations, 31(1):71–106. ICLR 2019, pages 1–17. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Gar- nett, editors, Advances in Neural Information Pro- cessing Systems 30, pages 5998–6008. Curran Asso- ciates, Inc. Alex Wang, Amanpreet Singh, Julian Michael, Fe- lix Hill, Omer Levy, and Samuel Bowman. 2018. GLUE: A multi-task benchmark and analysis plat- form for natural language understanding. In Pro- ceedings of the 2018 EMNLP Workshop Black- boxNLP: Analyzing and Interpreting Neural Net- works for NLP, pages 353–355, Brussels, Belgium. Association for Computational Linguistics.

Aaron Steven White, Drew Reisinger, Keisuke Sak- aguchi, Tim Vieira, Sheng Zhang, Rachel Rudinger, Kyle Rawlins, and Benjamin Van Durme. 2016. Uni- versal decompositional semantics on universal de- pendencies. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Process- ing, pages 1713–1723, Austin, Texas. Association for Computational Linguistics. Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pier- ric Cistac, Tim Rault, R’emi Louf, Morgan Funtow- icz, and Jamie Brew. 2019. HuggingFace’s Trans- formers: State-of-the-art natural language process- ing. arXiv:1910.03771. Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V. Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, Jeff Klingner, Apurva Shah, Melvin John- son, Xiaobing Liu, ukasz Kaiser, Stephan Gouws, Yoshikiyo Kato, Taku Kudo, Hideto Kazawa, Keith Stevens, George Kurian, Nishant Patil, Wei Wang, Cliff Young, Jason Smith, Jason Riesa, Alex Rud- nick, Oriol Vinyals, Greg Corrado, Macduff Hughes, and Jeffrey Dean. 2016. ’s neural machine translation system: Bridging the gap between hu- man and machine translation. arXiv:1609.08144, abs/1609.08144.