Deep Subjecthood: Higher-Order Grammatical Features in Multilingual BERT Isabel Papadimitriou Ethan A. Chi Stanford University Stanford University [email protected] [email protected]

Richard Futrell Kyle Mahowald University of California, Irvine University of California, Santa Barbara [email protected] [email protected]

Abstract

We investigate how Multilingual BERT (mBERT) encodes grammar by examining how the high-order grammatical of morphosyntactic alignment (how different define what counts as a “”) is manifested across the embedding spaces of different languages. To understand if and how morphosyntactic alignment affects contextual embedding spaces, train classifiers to recover the subjecthood of mBERT embeddings in transitive sentences (which do not contain overt information about morphosyntactic alignment) and then evaluate Figure 1: Top: Illustration of the difference between them zero-shot on intransitive sentences alignment systems. A (for ) is notation used for (where subjecthood classification depends on the transitive subject, and O for the transitive ob- alignment), within and across languages. We ject:“The lawyer chased the dog.” S denotes the find that the resulting classifier distributions intransitive subject: “The lawyer laughed.” The blue reflect the morphosyntactic alignment of their circle indicates which roles are marked as “subject” in training languages. Our results demonstrate each system. that mBERT representations are influenced by Bottom: Illustration of the training and test process. high-level grammatical features that are not We train a classifier to distinguish A from O arguments manifested in any input , and that using the BERT contextual embeddings, and test the this is robust across languages. Further ex- classifier’s behavior on intransitive subjects (S). The re- amining the characteristics that our classifiers sulting distribution reveals to what extent morphosyn- rely on, we find that features such as passive tactic alignment (above) affects model behavior. , and case strongly correlate with classification decisions, suggesting that mBERT does not encode subjecthood purely grammars of languages. To do so, we analyze syntactically, but that subjecthood embedding the notion of subjecthood in Multilingual BERT is continuous and dependent on semantic and discourse factors, as is proposed in much (mBERT) across diverse languages with different of the literature. To- morphosyntactic alignments. Alignment (how gether, these results provide insight into how each defines what classifies as a “sub- grammatical features manifest in contextual ject”) is a feature of the grammar of a language, embedding spaces, at a level of abstraction not rather than of any single or sentence, letting 1 covered by previous work. us analyze mBERT’s representation of language- specific high-order grammatical properties. 1 Introduction Recent work has demonstrated that transformer Our goal is to understand whether, and how, large models of language, such as BERT (Devlin et al., pretrained models encode abstract features of the 2019), encode sentences in structurally meaning- 1We release the code to reproduce our experiments here ful ways (Manning et al., 2020; Rogers et al., https://github.com/toizzy/deep-subjecthood 2020; Kovaleva et al., 2019; Linzen et al., 2016;

2522

Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics, pages 2522–2532 April 19 - 23, 2021. ©2021 Association for Computational Linguistics Gulordava et al., 2018; Futrell et al., 2019; Wilcox vary as a function of linguistic features like et al., 2018). In Multilingual BERT, previous work animacy, , and the passive con- has demonstrated surprising levels of multilingual struction. and cross-lingual understanding (Pires et al., 2019; Taken together, the results of these experi- Wu and Dredze, 2019; Libovicky` et al., 2019; ments suggest that mBERT represents subject- Chi et al., 2020), with some notable limitations hood and objecthood robustly and probabilisti- (Mueller et al., 2020). However, these studies cally. Its representation is general enough such still leave an open question: are higher-order ab- that can transfer across languages, but also stract grammatical features — features such as language-specific enough that it learns language- morphosyntactic alignment, which are not realized specific abstract grammatical features. in any one sentence — accessible to deep neu- ral models? And how are these allegedly discrete 2 Background: Morphosyntactic features represented in a continuous embedding alignment space? Our goal is to answer these questions by In transitive sentences, languages need a way of examining grammatical subjecthood across typo- distinguishing which is the transitive sub- logically diverse languages. In doing so, we com- ject (called A, for agent) and which noun is the plicate the traditional notion of the grammatical transitive (O). In English, this distinction subject as a discrete category and provide evidence is marked by : “The dogA chased for a richer, probabilistic characterization of sub- the lawyerO” means something different than “the jecthood. lawyerA chased the dogO”. In other languages, For 24 languages, we train small classifiers to this distinction is marked by a morphological fea- distinguish the mBERT embeddings of that ture: case. Case markings, usually affixes, are at- are subjects of transitive sentences from nouns that tached to nouns to indicate their role in the sen- are objects. We then test these classifiers on out- tence, and as such in these languages word order of-domain examples within and across languages. is often much freer than in English. We go beyond standard probing methods (which Apart from A and O, there is also a third gram- rely on classifier accuracy to make claims about matical role: intransitive subjects (S). In sentences embedding spaces) by (a) testing the classifiers like “The lawyer laughed”, there is no ambiguity out-of-domain to gain insights about the shape as to is doing the action. As such, cased lan- and characteristics of the subjecthood classifica- guages usually do not reserve a third case to mark tion boundary and (b) testing for awareness of S nouns, and use either the A case or the O case. morphosyntactic alignment, which is a feature of Languages that mark S nouns in the same way as A the grammar rather than of the classifier inputs. nouns are said to follow a Nominative– system, where the is for A Our main experiments are as follows. In Exper- and S, and the accusative case is for O. 2 Lan- iment 1, we test our subjecthood classifiers on out- guages that mark S nouns like O nouns follow of-domain intransitive subjects (subjects of an Ergative–Absolutive system, where the erga- which do not have objects, like “The man slept”) tive case is used to mark A nouns, and the absolu- in their training language. Whereas in English tive case marks S and O nouns. For example, the and many other languages, we think of intransitive follows this system. A visual- subjects as grammatical subjects, some languages ization of the two case systems is shown in Figure have a different morphosyntactic alignment sys- 1. tem and treat intransitive subjects more like ob- The feature of whether a language follows a jects (Dixon, 1979; Du Bois, 1987). We find evi- nominative-accusative or an ergative-absolutive dence that a language’s alignment is represented in system is called morphosyntactic alignment. Mor- mBERT’s embeddings. In Experiment 2, we per- phosyntactic alignment is a high-order grammati- form successful zero-shot cross-linguistic trans- cal feature of a language, which is not usually in- fer of our subject classifiers, finding that higher- ferable from looking at just one sentence, but from order features of the grammar of each language are represented in a way that is parallel across lan- 2English follow a Nominative–Accusative sys- tem. For example, the pronoun “” is nominative and is guages. In Experiment 3, we characterize the ba- used both for A and S (as in “she laughed”). The pronoun sis for these classifier decisions by studying how “her” is accusative and is used only for O.

2523 Figure 2: Results of Experiment 1: the behavior of subjecthood classifiers across mBERT layers (x-axis). For each layer, the proportion of the time that the classifier predicts arguments to be A, separated by grammatical role. In higher layers, A and O are reliably classified correctly, and S is mostly classified as A. When the source language is Basque (ergative) or or (split-ergative) S is less likely to pattern with A. The figure is ordered by how close the S line is to A, and ergative and split-ergative languages are highlighted with a gray box. the system with which different sentences are en- our classifiers on test datasets of A, S, and O em- coded. As such, examining the way that individ- beddings. Our data points are extracted from the ual contextual embeddings express morphosyntac- Universal Dependencies treebanks (Nivre et al., tic alignment gets to the question of how mBERT 2016): we use the dependency parse informa- encodes abstract features of grammar. This is a tion to determine whether each noun is an A or question that is not answered by work that looks an O, and if it is either we pass the whole sen- at the contextual encoding of the features that are tence through mBERT and take the contextual em- realized in sentences, like or sen- bedding corresponding to the noun. We run ex- tence structure. periments on 24 languages; specifically, all the languages that are both in the mBERT training 3 Methods set3 and have Universal Dependencies treebanks with at least 1,012 A occurences and 1,012 O oc- Our primary method involves training classifiers to curences.4 predict subjecthood from mBERT contextual em- beddings, and examining the decisions of these classifiers within and across languages. We train Labeling Since UD treebanks are not labeled for a classifier to distinguish A from O in the mBERT sentence role (A, S and O), we extract these labels embeddings of one language, and we examine its using the dependency graph annotations. We only performance on S embeddings in its training lan- include nouns and proper nouns, leaving pronouns guage, and on A, S, and O mBERT embeddings in other languages. 3https://github.com/google-research/bert/ blob/master/multilingual.md Data To train a subjecthood classifier for one 4Our datasets for all languages are the same size. We have language, we use a balanced dataset of 1,012 tran- set them all to be the size of the largest balanced A-O dataset we can extract from the Basque UD corpus, since Basque is sitive subject (A) mBERT embeddings, and 1,012 one of the only represented ergative languages and we wanted transitive object (O) mBERT embeddings. We test it to meet our cutoff.

2524 5 for future work. We label a noun token as: 1.0 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0.9 ● ● ● ● ● ● ● ● ● ● ● ● • O if it has a as a and its dependency ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● dobj iobj ● ● ● ● arc is either or . ● ● ● ● ● ● ● ● ● ● ● ● ● 0.8 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● • A if it has a verb as a head and its dependency ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0.7 ● ● ● ● ● ● arc is nsubj and it has a sibling O. ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

Classifier Accuracy ● • S if it has a verb as a head and its dependency 0.6 ● arc is nsubj and it has no sibling O. 0.5 0 1 2 3 4 5 6 7 8 9 10 11 12 Finally, we exclude the subjects of passive con- Layer structions (where the object of an action is made the grammatical subject) to analyze separately, Figure 3: Accuracy of A-O classifiers for every lan- as including these examples would confound guage, by mBERT layer. For all languages, accuracy is highest in layers 7-10 grammatical subjecthood with semantic agency. We also exclude the siblings of expletives (e.g., “There are many goats”), as these are grammati- 1.00 cal objects which appear without subjects as the only of the verb, and we also exclude 0.75 the children of auxiliaries (“The goat can swim”), looking only at the arguments of verbs. 0.50 Because we use embeddings and are limited by the Universal Dependencies annotation scheme, 0.25 there are some cross-linguistic differences in how Probability S nouns labeled A arguments are handled. For instance, our system 0.00 is not able to handle null subjects or null objects, even though those are prominent parts of many Non ergative languages Basque, Hindi and Urdu languages. Classifiers For each language, and for each Figure 4: Distribution of layer 10 classifier probabil- ities for S nouns in the test set. When trained on mBERT layer ℓ, we train a classifier to classify non-ergative languages, the classifiers mostly predict mBERT contextual embeddings drawn from layer S nouns to be A. When trained on ergative and split- ℓ as A or O. The classifiers are all two-layer per- ergative languages, the classifier predictions for S are ceptrons with one hidden layer of size 64. We train much more spread out (towards being classified as O), each classifier for 20 epochs on a dataset of the suggesting that the ergative nature of the languages is layer-ℓ contextual embeddings of 1,012 A nouns expressed in the contextual embeddings of the A and O and 1,012 O nouns. In total, we train 24 languages nouns, influencing the classifier. × 13 mBERT layers = 312 total classifiers. in the training data? If S arguments are mostly 4 Experiment 1: Subjecthood in mBERT classified as A, that would suggest mBERT is In our first experiment, we train a classifier to pre- learning a nominative-accusative system, where dict the grammatical role of a noun in context from A and S pattern together. If S patterns with O, its mBERT contextual embedding, and examine that would suggest it has an ergative-absolutive its behavior on intransitive subjects (S), which are system. If S patterns differently in different out-of-domain. languages, that would suggest that it learns a This experimental setup lets us ask two ques- language-specific morphosyntactic system and ex- tions about subjecthood encoding in mBERT. presses it in the encoding of nouns in transitive Firstly, do contextual word embeddings reliably (which are unaffected by alignment), so encode subjecthood information? Secondly, how that the A-O classifiers can pick it up. do our classifiers act when given S arguments (in- transitive subjects), which crucially do not appear 4.1 Results

5For an example of how pronouns complicate how sub- Our results show that the classifiers can reliably jecthood is defined, see Fox (1987). perform A-O classification of contextual embed-

2525 2.25 es nouns as O indicates that the ergativity of the lan- ja dehrru sk fr guage is encoded in the A and O embeddings that 2.00 cs sl the classifiers are trained on. ● sr lv no ● 1.75 pl id Note, however, that the A-O classifiers for the heen

hi zh ergative languages do not deduce a fully erga- 1.50 ur laet fa ● tive system for classifying S nouns, but a greater fi 1.25 skew towards classifying S as O than nomina- eu

log odds S:O to O:A tive languages. This suggests that, even though 1.00 properties of ergativity are encoded in mBERT Abs/Erg Cased Nom/Acc Cased Uncased or Mixed space, the prominence of nominative training lan- Language group guages has influenced the contextual space to be Figure 5: For layer 10, the log odds ratio of S:A relative biased towards encoding a nominative subject- to O:A, by source language. This is a measure of how hood system. The difficulty of training the clas- close S is to A, relative to O. The ergative languages sifier in Basque seems consistent with Ravfogel skew lower than the others, although some other lan- et al. (2019)’s finding that learning is guages (like Finnish and Estonian) also skew low. harder in Basque than in English. In Experiment 2, we test the zero-shot perfor- mance of these A-O classifiers across languages, dings with relatively high accuracy, especially in to ask: is there a parallel, interlingual notion of the higher layers of mBERT. As shown in Fig- subjecthood in mBERT contextual space, and do ure 3, performance peaks at around mBERT lay- language-specific morphosyntactic alignment bi- ers 7-10, where for the majority of languages clas- ases transfer interlingually? sifier accuracy surpasses 90%. This is consistent with previous work showing that syntactic infor- 5 Experiment 2: Transferring across mation is most well represented in BERT’s later languages middle layers (Rogers et al., 2020; Hewitt and Manning, 2019; Jawahar et al., 2019; Liu et al., We can learn only so much about mBERT’s gen- 2019). For the rest of this paper, we will eral subjecthood representations by training and mainly on the behavior of the classifiers in the testing in the same language, since many lan- high-performance higher layers to assess the prop- guages in our data set have case-marking and erties in these highly contextual spaces that define therefore have surface forms that reflect their subjecthood within and across languages. grammatical roles. To test whether representations Performance across layers on the test sets of all of subjecthood in mBERT are language-general, 24 languages is shown in Figure 2. When we break we can do a similar analysis to Experiment 1 but the classifiers’ behavior down across roles, we see with zero-shot cross-lingual transfer. that S nouns mostly pattern with A, though they That is, we train a classifier to distinguish A are consistently less likely to be classed as A than and O in Language X (just as in Experiment 1), transitive A nouns. but then we test in Language Y by seeing how the The separation between the A and the S lines is classifier classifies A, O, and S arguments in Lan- not constant for all languages: it is the largest for guage Y. Basque, which is an ergative language, and Hindi By training a classifier on one language and and Urdu, which have a split- system testing on others, we can ask: is subjecthood (De Hoop and Narasimhan, 2005). This difference encoded in parallel ways across languages in is highlighted in Figure 4, where we show the clas- mBERT space? If a classifier trained to distinguish sifiers’ probabilities of classifying S nouns as A A from O in a source language can then use the across the test sets of Basque, Hindi and Urdu ver- same parameters to successfully classify A from sus the test sets for all other 21 languages. In Fig- O in another language, this would indicate that the ure 5, we plot the log odds ratio of classifying S difference between A and O is encoded in similar as A versus classifying S as O, and show that for ways in mBERT space for these two languages. ergative languages this is significantly lower. The Secondly, we can examine the classification of fact that classifiers trained on ergative and split- S nouns (which are out of domain for the classi- ergative languages are more likely to classify S fiers) in the zero-shot cross-lingual setting. By ob-

2526 ● ● 1.00 ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ●● ●● ●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0.75 ● ● ● ● ● ● ● ● 2 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0.50 density 1 Accuracy 0.25 0 0.00 0.25 0.50 0.75 1.00 0.00 Proportion of S nouns labelled A for each destination language Urdu Farsi Hindi Polish Czech Slovak French Finnish Latvian English Basque Hebrew Serbian Slovene Russian Spanish German Chinese Croatian Estonian

Japanese Non ergative languages Norwegian Indonesian Source Language Basque, Hindi and Urdu

Japanese Chinese Spanish Latin Figure 7: Classifiers trained on ergative languages are Latvian English Farsi more likely to label S nouns in other languages as O. Basque Estonian accuracy For BERT layer 8, the proportion of S nouns in each Finnish Czech 0.9 Slovak destination language test set labeled as A for the classi- Russian 0.8 Hindi 0.7 Urdu fiers trained on (1) ergative and split-ergative languages Slovene 0.6 Destination Language Polish (blue) or (2) the rest of the languages. German Hebrew Norwegian Indonesian Croatian French Serbian

Urdu Hindi Farsi Latin Polish Czech French Slovak Finnish Basque EnglishLatvian Serbian Croatian HebrewGerman Slovene Russian Estonian SpanishChinese The average accuracy across all source-destination Japanese IndonesianNorwegian Source Language pairs for a high-performing mBERT layer (layer Figure 6: Experiment 2 Results: Cross-lingual transfer 10) is 82.61%, and there are several pairs for accuracies (accuracies shown are for BERT layer 10). which zero-shot transfer of the sentence role clas- Top: For each classifier trained to distinguish A and O sifier yields accuracies above 90%. The consis- nouns in a source language (labeled on the x-axis), we tent success of zero-shot transfer across different plot the accuracy that classifier achieves when tested source and destination pairs indicates that mBERT zero-shot on all other languages. Zero-shot transfer has parallel, interlingual ways of encoding gram- is surprisingly successful across languages, indicating that subjecthood is encoded cross-lingually in mBERT. matical and semantic relations like subjecthood. Each black point represents the accuracy of a classifier We would expect there to be some extent of joint tested on a particular destination language, and the red learning in mBERT: different languages wouldn’t points represent the within-language accuracy. exist totally independently in the contextual em- Bottom: Analytical performance of classifiers for ev- bedding space, both due to mBERT’s multilingual ery language pair. The x-axis sorted by average transfer training texts and to successful regularization. It accuracy, so that the source whose classifier performs is nevertheless surprising that zero-shot transfer of the best on average is on the left. Despite the general subjecthood classification between languages is so English bias that mBERT often exhibits, in our experi- ments English is neither a standout source nor destina- successful out of the box, and that for all clas- tion. sifiers, within-language accuracy (the red dots in Figure 6) is not an outlier compared to transfer ac- curacies. Our results show not just that there is serving the test behavior of classifiers on S nouns mutual entanglement between the contextual em- in other languages, we can ask: is morphosyntac- bedding spaces of many languages, but that syn- tic alignment expressed in cross-lingually gener- tactic and semantic information in these spaces is alizable and parallel ways in mBERT contextual organized in largely parallel, transferable ways. embeddings? If a classifier trained to distinguish We can then look at how S is classified: does Basque A from O is more likely to classify English the subjecthood of S, and the degree of ergativ- S nouns as O, this means that information about ity within each language that we saw expressed in morphosyntactic alignment is encoded specifically Experiment 1 generalize across languages? Clas- enough to represent each language’s alignment, sifiers trained on ergative languages are signifi- but in a space that generalizes across languages. cantly more likely to classify S nouns in other lan- guages as O, as illustrated in Figure 7 (the source 5.1 Results language’s case system is a significant predictor Zero-shot transfer of subjecthood classification is of the probability of S being an agent, in a mixed effective across languages, as shown in Figure 6. effect regression with a random intercept for lan-

2527 guage β = .11, t = 2.63, p<.05). Our re- sults show that the ergative nature of these lan- guages is encoded in the contextual embeddings of transitive nouns (where ergativity is not real- ized), and that this encoding of ergativity transfers density coherently across languages.

0.0 0.5 1.0 6 Experiment 3: Syntactic and semantic Proportion of test set labelled A for each source/dest pair factors of continuous subjecthood Role O S−passive S A To explore the nature of mBERT’s underlying rep- resentation of grammatical role, we ask which ar- Figure 8: Passive subjects are hard to classify. The guments are most likely to be classified as subjects distribution of average classifier probabilities in layer or objects. This is of particular interest when the 10 for all source-destination language pairs, separated by role. While the layer 10 classifier separates A and classifier gets it wrong: what kinds of subjects get S from O, passive subjects remain largely ambigu- erroneously classified as objects? ous in their classification. These plots indicate that, The functional linguistics literature offers in- in mBERT space, the grammatical subjects of passive sight into these questions. It has been frequently constructions are less subject-y. claimed that grammatical subjecthood is actually a multi-factor, probabilistic concept (Keenan and Comrie, 1977; Comrie, 1981; Croft, 2001; Hop- and semantic relations. per and Thompson, 1980) that cannot always be We choose these three factors as they are well- pinned down as a discrete category. Some subjects studied in the functional literature, as well as read- are more subject-y than others. Comrie (1988) ar- ily available to extract from UD corpora. Pas- gues that a subject can be thought of as the in- sive subjects are marked with a separate depen- tersection of that which is high in agency (sub- dency arc label, the animacy of nouns is anno- jects do things) and topicality (subjects are the tated directly in some UD treebanks, and in case- topics of sentences). Thus, in English, a proto- marked languages, nouns are annotated with their typical subject is something like “ kicked the case. Future work on a more complete exami- ball.” since in such a sentence, the pronoun “he” nation of the functional nature of contextual em- is a clear agent and the topic of the sentence. But, beddings would include other factors not readily in a sentence like “The lake, which Jack Frost vis- available in UD, like the discourse and informa- ited, froze,” the subject is still “lake.” But it is less tion structure (topicality) of nouns in context. subject-y: it is not the clear topic of the sentence and it is not an agent. 6.1 Results A probabilistic notion of grammatical role The first area that we look at are passive con- lends itself naturally to the continuous embedding structions. In passive constructions such as “The spaces of computational models. So, in a series of lawyer was chased by a cat”, the grammatical sub- experiments, we explored what factors in mBERT ject is not the main actor or agent in the sen- contextual embedding space predict subjecthood. tence. As such, while a purely syntactic analy- In these experiments, we examine how the deci- sis of subjecthood would classify passive subjects sions and probabilities of the A-O classifiers from (S-passive) as subjects, an understanding of sub- Experiment 2 relate to other linguistic features jecthood as continuous and reliant on known to contribute to the degree of subjecthood. would be more prone to classify passive subjects In particular, we look at whether nouns appear in as objects. As shown in Figure 8, subjecthood passive constructions, as well as the animacy and classifiers across languages are ambivalent about case of nouns. In seeing how passives, animacy, how they classify passive subjects, even in layers and case interact with our subjecthood classifiers, where they have the acuity to successfully sepa- we can assess if mBERT’s representation of sub- rate A and S from O. This indicates that the clas- jecthood in continuous space is consistent with sifiers do not learn a purely syntactic separation functional analyses, and better understand the con- of A and O: the subjecthood encoding that they tinuous space in which mBERT encodes syntactic learn from mBERT space is largely dependent on

2528 semantic information.

A O S

ru cs skuk cs eu ru ruuk uk pl 0.75 sk cs eu plru ta eu cs uk sk ta

eusk 0.50 eu csskslta

ruuk

probability of being agent 0.25 eu sl cs uk hrsr rusk hr sr pl 0.00 Anim Inan Anim Inan Anim Inan animacy

Figure 9: The influence of animacy on classification (within and across languages). For a high-performing layer (Layer 10), the average probability of classifiers Figure 10: Average probability of being an agent, in in all languages classifying nouns in languages with an- layer 10, with 95% confidence intervals, for Finnish imacy distinctions as A. For all three roles, animates and Basque broken up by case. are more likely to be classified as agents. The labels are two-letter codes for the languages. agentive cases (nominative and ergative). In a We also find that animacy is a strong predic- mixed effect regression predicting Layer 10 prob- tor of subjecthood. Our results presented in Fig- ability of being an agent based on role and whether ure 9 demonstrate that when we by role, the case is agentive (nominative/ergative), there animacy is a significant factor in determining the was a 15% increase associated with being nomina- probability of being classified as A. Classifiers in tive/ergative across categories (t = 2.74, p<.05). all languages, when zero-shot evaluated on a cor- pus marked for animacy, are more likely to clas- 7 Discussion sify animate nouns as A than inanimate nouns. For Layer 10, a mixed effect regression predict- Our experimental results constitute a way to be- ing each destination language’s probability of as- gin understanding how general knowledge of signing an argument to being an agent shows that grammar is manifested in contextual embedding both role and animacy are significant predictors spaces, and how discrete categories like sub- (with a main effect of animacy corresponding to a jecthood are reconciled in continuous embedding 16% increase in the probability of being an agent, spaces. While most previous work analyzing large p<.01). These results indicate that, in learning contextual models focuses on extracting their anal- to separate A from O, the classifiers did not learn ysis of features or structures present in specific in- a purely syntactic separation of the space (though puts, we focus on morphosyntactic alignment, a it is possible to distinguish A and O using only feature of grammars that is not explicitly realized strictly structural syntactic features). Instead, we in any one sentence. see that subjecthood information is entangled with We find that, when tested out of domain, clas- semantic notions such as animacy, giving credence sifiers trained to predict transitive subjecthood in to the hypothesis that subjecthood BERT space is mBERT contextual space robustly demonstrate encoded in a way concordant with the multi-factor decisions which reflect (a) the morphosyntactic manner proposed by Croft, Comrie, and others. alignment of their training language and (b) con- Lastly, we find that classifier probabilities also tinuous encoding of subjecthood influenced by se- vary with case, even when we control for sentence mantic properties. role. As demonstrated in Figure 10, across gram- There has been much recent work pointing out matical roles, classifiers are significantly more the limitations of the probing methodology for an- likely to classify nouns as A if they are in more alyzing embedding spaces (Voita and Titov, 2020;

2529 Pimentel et al., 2020; Hewitt and Liang, 2019), Acknowledgments a methodology that is very similar to ours. The We thank Dan Jurafsky, Jesse Mu, and Ben New- main limitation pointed out in this literature is that man for helpful comments. This work was sup- the power of classifiers is a confounding variable: ported by NSF grant #1947307 and a gift from the we can’t know if a classifier’s encoding of a fea- NVIDIA corporation to R.F., and an NSF Gradu- ture is due to the feature being encoded in BERT ate Research Fellowship to .P. space, or to the classifier figuring out the feature from surface encoding. References In this paper, we address these issues by propos- Ethan A. Chi, John Hewitt, and Christopher D. Man- ing two ways to use classifiers to analyze embed- ning. 2020. Finding universal grammatical relations ding spaces that go beyond probing, and avoid the in multilingual BERT. In Proceedings of the 58th limitations of arguments based only around the ac- Annual Meeting of the Association for Computa- curacy of probes. Firstly, our results rely on testing tional Linguistics, pages 5564–5577, Online. Asso- ciation for Computational Linguistics. the classifiers on out-of-domain zero-shot transfer: both to S arguments and to different languages. Bernard Comrie. 1981. Language Universals and Lin- As such, we focus on linguistically defining the guistic Typology, 1st edition. University of Chicago Press, Chicago. type of classification boundary which our classi- fiers learn from mBERT space, rather than their Bernard Comrie. 1988. . Annual accuracy, and in using transfer we avoid many of Review of Anthropology, 17:145–159. the limitations of probing, as argued in Papadim- William Croft. 2001. Radical construction grammar: itriou and Jurafsky (2020). Secondly, we exam- Syntactic theory in typological perspective. Oxford ine a feature (morphosyntactic alignment) which University Press. is not inferable from the classifiers’ training data, Helen De Hoop and Bhuvana Narasimhan. 2005. Dif- which consists only of transitive sentences. We are ferential case-marking in hindi. In Competition and asking if mBERT contextual space is organized in Variation in Natural Languages, pages 321–345. El- a way that encodes the effects of morphosyntactic sevier. alignment for tokens that do not themselves ex- Jacob Devlin, Ming-Wei Chang, Kenton Lee, and press alignment. Especially in the cross-lingual Kristina Toutanova. 2019. BERT: Pre-training of case, a classifier would not be able to spuriously deep bidirectional transformers for language under- deduce this from the surface form, whatever its standing. In Proceedings of the 2019 Conference power. of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), A limitation of our experimental setup is that pages 4171–4186, Minneapolis, Minnesota. Associ- both our Universal Dependencies training data and ation for Computational Linguistics. the set of mBERT training languages are heav- Robert MW Dixon. 1979. Ergativity. Language, pages ily weighted towards nominative-accusative lan- 59–138. guages. As such, we see a clear nominative- John W Du Bois. 1987. The discourse basis of ergativ- accusative bias in mBERT, and our results are ity. Language, pages 805–855. somewhat noisy as we only have one ergative- absolutive language and two semi-ergative lan- Barbara A Fox. 1987. The noun accessibility guages hierarchy reinterpreted: Subject primacy or the ab- solutive hypothesis? Language, pages 856–870.

Future work should examine the effects of Richard Futrell, Ethan Wilcox, Takashi Morita, Peng Qian, Miguel Ballesteros, and Roger Levy. 2019. balanced joint training between nominative- Neural language models as psycholinguistic sub- accusative and ergative-absolutive languages on jects: Representations of syntactic state. In Pro- the contextual embedding of subjecthood. And ceedings of the 2019 Conference of the North Amer- we hope that future work will continue to ask not ican Chapter of the Association for Computational Linguistics: Human Language Technologies, Vol- if just deep neural models of language represent ume 1 (Long and Short Papers), pages 32–42, Min- discrete linguistic features, but how they represent neapolis, Minnesota. Association for Computational them probabilistically. Linguistics.

2530 Kristina Gulordava, Piotr Bojanowski, Edouard Grave, representations. In Proceedings of the 2019 Confer- Tal Linzen, and Marco Baroni. 2018. Colorless ence of the North American Chapter of the Associ- green recurrent networks dream hierarchically. In ation for Computational Linguistics: Human Lan- Proceedings of the 2018 Conference of the North guage Technologies, Volume 1 (Long and Short Pa- American Chapter of the Association for Computa- pers), pages 1073–1094, Minneapolis, Minnesota. tional Linguistics: Human Language Technologies, Association for Computational Linguistics. Volume 1 (Long Papers), pages 1195–1205, New Orleans, Louisiana. Association for Computational Christopher D Manning, Kevin Clark, John Hewitt, Ur- Linguistics. vashi Khandelwal, and Omer Levy. 2020. Emer- gent linguistic structure in artificial neural networks John Hewitt and Percy Liang. 2019. Designing and in- trained by self-supervision. Proceedings of the Na- terpreting probes with control tasks. In Proceedings tional Academy of Sciences. of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th Interna- Aaron Mueller, Garrett Nicolai, Panayiota Petrou- tional Joint Conference on Natural Language Pro- Zeniou, Natalia Talmina, and Tal Linzen. 2020. cessing (EMNLP-IJCNLP), pages 2733–2743. Cross-linguistic syntactic evaluation of word predic- tion models. In Proceedings of the 58th Annual John Hewitt and Christopher D. Manning. 2019. A Meeting of the Association for Computational Lin- structural probe for finding in word repre- guistics, pages 5523–5539. sentations. In Proceedings of the 2019 Conference of the North American Chapter of the Association Joakim Nivre, Marie-Catherine De Marneffe, Filip for Computational Linguistics: Human Language Ginter, Yoav Goldberg, Jan Hajic, Christopher D Technologies, Volume 1 (Long and Short Papers), Manning, Ryan McDonald, Slav Petrov, Sampo pages 4129–4138, Minneapolis, Minnesota. Associ- Pyysalo, Natalia Silveira, et al. 2016. Universal ation for Computational Linguistics. dependencies v1: A multilingual treebank collec- tion. In Proceedings of the Tenth International Paul J Hopper and Sandra A Thompson. 1980. Tran- Conference on Language Resources and Evaluation sitivity in grammar and discourse. language, pages (LREC’16), pages 1659–1666. 251–299. Isabel Papadimitriou and Dan Jurafsky. 2020. Learn- Ganesh Jawahar, Benoˆıt Sagot, and Djame´ Seddah. ing music helps read: Using transfer to study Proceed- 2019. What does BERT learn about the structure of linguistic structure in language models. In ings of the 2020 Conference on Empirical Methods language? In Proceedings of the 57th Annual Meet- ing of the Association for Computational Linguis- in Natural Language Processing (EMNLP), pages tics, pages 3651–3657, Florence, Italy. Association 6829–6839. for Computational Linguistics. Tiago Pimentel, Josef Valvoda, Rowan Hall Maudslay, Ran Zmigrod, Adina Williams, and Ryan Cotterell. Edward L. Keenan and Bernard Comrie. 1977. Noun 2020. Information-theoretic probing for linguistic phrase accessibility and universal grammar. Lin- structure. In Proceedings of the 58th Annual Meet- guistic Inquiry, 8(1):63–99. ing of the Association for Computational Linguis- tics, pages 4609–4622. Olga Kovaleva, Alexey Romanov, Anna Rogers, and Anna Rumshisky. 2019. Revealing the dark secrets Telmo Pires, Eva Schlinger, and Dan Garrette. 2019. of BERT. In Proceedings of the 2019 Conference on How multilingual is multilingual BERT? In Pro- Empirical Methods in Natural Language Process- ceedings of the 57th Annual Meeting of the Asso- ing and the 9th International Joint Conference on ciation for Computational Linguistics, pages 4996– Natural Language Processing (EMNLP-IJCNLP), 5001, Florence, Italy. Association for Computa- pages 4365–4374, Hong Kong, China. Association tional Linguistics. for Computational Linguistics. Shauli Ravfogel, Yoav Goldberg, and Tal Linzen. 2019. Jindrichˇ Libovicky,` Rudolf Rosa, and Alexander Studying the inductive biases of RNNs with syn- Fraser. 2019. How language-neutral is multilingual thetic variations of natural languages. In Proceed- bert? arXiv preprint arXiv:1911.03310. ings of the 2019 Conference of the North American Chapter of the Association for Computational Lin- Tal Linzen, Emmanuel Dupoux, and Yoav Goldberg. guistics: Human Language Technologies, Volume 1 2016. Assessing the ability of LSTMs to learn (Long and Short Papers), pages 3532–3542, Min- syntax-sensitive dependencies. Transactions of the neapolis, Minnesota. Association for Computational Association for Computational Linguistics, 4:521– Linguistics. 535. Anna Rogers, Olga Kovaleva, and Anna Rumshisky. Nelson F. Liu, Matt Gardner, Yonatan Belinkov, 2020. A primer in bertology: What we know about Matthew E. Peters, and Noah A. Smith. 2019. Lin- how bert works. Transactions of the Association for guistic knowledge and transferability of contextual Computational Linguistics, 8:842–866.

2531 Elena Voita and Ivan Titov. 2020. Information- theoretic probing with minimum description length. In Proceedings of the 2020 Conference on Em- pirical Methods in Natural Language Processing (EMNLP), pages 183–196. Ethan Wilcox, Roger Levy, Takashi Morita, and Richard Futrell. 2018. What do RNN language models learn about filler–gap dependencies? In Proceedings of the 2018 EMNLP Workshop Black- boxNLP: Analyzing and Interpreting Neural Net- works for NLP, pages 211–221, Brussels, Belgium. Association for Computational Linguistics. Shijie Wu and Mark Dredze. 2019. Beto, bentz, be- cas: The surprising cross-lingual effectiveness of bert. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natu- ral Language Processing (EMNLP-IJCNLP), pages 833–844.

2532