Finding Universal Grammatical Relations in Multilingual BERT

Finding Universal Grammatical Relations in Multilingual BERT Ethan A. Chi, John Hewitt, and Christopher D. Manning Department of Computer Science Stanford University fethanchi,johnhew,[email protected] Abstract Recent work has found evidence that Multi- lingual BERT (mBERT), a transformer-based multilingual masked language model, is capa- ble of zero-shot cross-lingual transfer, suggest- ing that some aspects of its representations are shared cross-lingually. To better understand this overlap, we extend recent work on finding syntactic trees in neural networks’ in- ternal representations to the multilingual setting. We show that subspaces of mBERT representations recover syntactic tree distances in languages other than English, and that these subspaces are approximately shared across languages. Motivated by these results, we present an unsupervised analysis method that provides evidence mBERT learns representa- Figure 1: t-SNE visualization of head-dependent de- tions of syntactic dependency labels, in the pendency pairs belonging to selected dependencies in form of clusters which largely agree with the English and French, projected into a syntactic subspace Universal Dependencies taxonomy. This evi- of Multilingual BERT, as learned on English syntax dence suggests that even without explicit su- trees. Colors correspond to gold UD dependency type pervision, multilingual masked language mod- labels. Although neither mBERT nor our probe was els learn certain linguistic universals. ever trained on UD dependency labels, English and French dependencies exhibit cross-lingual clustering 1 Introduction that largely agrees with UD dependency labels. Past work (Liu et al., 2019; Tenney et al., 2019a,b) has found that masked language models such as In this paper, we examine the extent to which BERT (Devlin et al., 2019) learn a surprising Multilingual BERT learns a cross-lingual repre- amount of linguistic structure, despite a lack of sentation of syntactic structure. We extend prob- direct linguistic supervision. Recently, large mul- ing methodology, in which a simple supervised tilingual masked language models such as Multi- model is used to predict linguistic properties from lingual BERT (mBERT) and XLM (Conneau and a model’s representations. In a key departure from Lample, 2019; Conneau et al., 2019) have shown past work, we not only evaluate a probe’s perfor- strong cross-lingual performance on tasks like mance (on recreating dependency tree structure), XNLI (Lample and Conneau, 2019; Williams et al., but also use the probe as a window into understand- 2018) and dependency parsing (Wu and Dredze, ing aspects of the representation that the probe 2019). Much previous analysis has been motivated was not trained on (i.e. dependency labels; Fig- by a desire to explain why BERT-like models per- ure1). In particular, we use the structural prob- form so well on downstream applications in the ing method of Hewitt and Manning(2019), which monolingual setting, which begs the question: what probes for syntactic trees by finding a linear trans- properties of these models make them so cross- formation under which two words’ distance in their lingually effective? dependency parse is approximated by the squared distance between their model representation vectors under a linear transformation. After evaluating whether such transformations recover syntactic tree distances across languages in mBERT, we turn to analyzing the transformed vector representations themselves. We interpret the linear transformation of the structural probe as defining a syntactic subspace (Figure2), which intuitively may focus on syntactic aspects of the mBERT representations. Since Figure 2: The structural probe recovers syntax by find- the subspace is optimized to recreate syntactic tree ing a syntactic subspace in which all syntactic trees’ distances are approximately encoded as squared L2 dis- distances, it has no supervision about edge labels tance (Hewitt and Manning, 2019). (such as adjectival modifier or noun subject). This allows us to unsupervisedly analyze how representations of head-dependent pairs in syntactic trees agree with the dependency labels of Universal De- cluster and qualitatively discuss how these clusters pendencies.1 relate to linguistic notions of grammatical relations. We make the following contributions: 2 Methodology • We find that structural probes extract consid- We present a brief overview of Hewitt and Man- erably more syntax from mBERT than base- ning(2019)’s structural probe, closely following lines in 10 languages, extending the structural their derivation. The method represents each de- probe result to a multilingual setting. pendency tree T as a distance metric where the distance between two words dT (wi; wj) is the num- • We demonstrate that mBERT represents some ber of edges in the path between them in T . It syntactic features in syntactic subspaces that attempts to find a single linear transformation of overlap between languages. We find that the model’s word representation vector space un- structural probes trained on one language der which squared distance recreates tree distance can recover syntax in other languages (zero- ` in any sentence. Formally, let h1:n be a sequence shot), demonstrating that the syntactic sub- of n representations produced by a model from space found for each language picks up on ` a sequence of n words w1:n composing sentence features that BERT uses across languages. k×m `. Given a matrix B 2 R which specifies the • Representing a dependency by the difference probe parameters, we define a squared distance of the head and dependent vectors in the syn- metric dB as the squared L2 distance after transfor- tactic space, we show that mBERT represents mation by B: dependency clusters that largely overlap with ` ` ` ` 2 the dependency taxonomy of Universal De- dB(hi ; hj) = jjBhi − Bhjjj2 pendencies (UD) (Nivre et al., 2020); see Fig- ure1. Our method allows for fine-grained We optimize to find a B that recreates the tree dis- ` ` analysis of the distinctions made by mBERT tance dT ` between all pairs of words (wi , wj) in all that disagree with UD, one way of moving sentences s` in the training set of a parsed corpus. past probing’s limitation of detecting only lin- Specifically, we optimize by gradient descent: guistic properties we have training data for X 1 X ` ` ` ` rather than properties inherent to the model. arg min jdT ` (wi ; wj) − dB(hi ; hj)j B js`j2 ` i;j Our analysis sheds light on the cross-lingual properties of Multilingual BERT, through both zero- For more details, see Hewitt and Manning(2019). shot cross-lingual structural probe experiments and Departing from prior work, we view the probe- novel unsupervised dependency label discovery ex- transformed word vectors Bh themselves—not just periments which treat the probe’s syntactic sub- the distances between them—as objects of study. space as an object of study. We find evidence that 1Code for reproducing our experiments is mBERT induces universal grammatical relations available here: https://github.com/ethanachi/ without any explicit supervision, which largely multilingual-probing-visualization The rows of B are a basis that defines a subspace EVALUATION To evaluate transfer accuracy, we m of R , which we call the syntactic subspace, and use both of the evaluation metrics of Hewitt and may focus only on parts of the original BERT rep- Manning(2019). That is, we report the Spearman resentations. A vector Bh corresponds to a point correlation between predicted and true word pair in that space; the value of each dimension equals distances (DSpr.).7 We also construct an undirected the dot product of h with one of the basis vectors.2 minimum spanning tree from said distances, and evaluate this tree on undirected, unlabeled attach- 2.1 Experimental Settings ment score (UUAS), the percentage of undirected These settings apply to all experiments using the edges placed correctly when compared to the gold structural probe throughout this paper. tree. Data Multilingual BERT is pretrained on corpora 3 Does mBERT Build a Syntactic in 104 languages; however, we probe the perfor- Subspace for Each Language? mance of the model in 11 languages (Arabic, Chi- nese, Czech, English, Farsi, Finnish, French, Ger- We first investigate whether mBERT builds syntac- man, Indonesian, Latvian, and Spanish).3;4 Specifi- tic subspaces, potentially private to each language, cally, we probe the model on trees encoded in the for a subset of the languages it was trained on; Universal Dependencies v2 formalism (Nivre et al., this is a prerequisite for the existence of a shared, 2020). cross-lingual syntactic subspace. Specifically, we train the structural probe to re- Model In all our experiments, we investigate the cover tree distances in each of our eleven languages. 110M-parameter pre-trained weights of the BERT- We experiment with training syntactic probes of 5 Base, Multilingual Cased model. various ranks, as well as on embeddings from all Baselines We use the following baselines:6 12 layers of mBERT. • MBERTRAND: A model with the same 3.1 Results parametrization as mBERT but no training. We find that the syntactic probe recovers syntac- Specifically, all of the contextual attention lay- tic trees across all the languages we investigate, ers are reinitialized from a normal distribution achieving on average an improvement of 22 points with the same mean and variance as the origi- UUAS and 0.175 DSpr. over both baselines (Ta- nal parameters. However, the subword embed- ble1, section I N-LANGUAGE).8 dings and positional encoding layers remain Additionally, the probe achieves significantly unchanged. As randomly initialized ELMo higher UUAS (on average, 9.3 points better on layers are a surprisingly competitive baseline absolute performance and 6.7 points better on im- for syntactic parsing (Conneau et al., 2018), provement over baseline) on Western European lan- we also expect this to be the case for BERT. guages.9. Such languages have been shown to have In our experiments, we find that this baseline better performance on recent shared task results performs approximately equally across layers, on multilingual parsing (e.g.

Load more