<<

Measuring the Similarity of Grammatical Systems by Comparing Partitions

Arya D. McCarthy Adina Williams Shijia Liu David Yarowsky Ryan Cotterell Johns Hopkins University, Facebook AI Research, ETH Zurich [email protected], [email protected], [email protected], [email protected], [email protected]

Abstract

A system divides a lex- icon into a small number of relatively fixed grammatical categories. How similar are these gender systems across ? To quantify the similarity, define gender systems ex- (a) German, K = 3 (b) Spanish, K = 2 tensionally, thereby reducing the problem of comparisons between languages’ gender sys- Figure 1: Two gender systems partitioning N = 6 con- tems to cluster evaluation. We borrow a rich cepts. German (a) has three communities: Obst (fruit) inventory of statistical tools for cluster evalu- and Gras (grass) are neuter, Mond (moon) and Baum ation from the field of community detection (tree) are masculine, Blume (flower) and Sonne (sun) (Driver and Kroeber, 1932; Cattell, 1945), that are feminine. Spanish (b) has two communities: fruta enable us to craft novel information-theoretic (fruit), luna (moon), and flor are feminine, and cesped metrics for measuring similarity between gen- (grass), arbol (tree), and sol (sun) are masculine. der systems. We first validate our metrics, then use them to measure gender system similarity in 20 languages. Finally, we ask whether our exhaustively divides up the ’s ; that gender system similarities alone are sufficient is, the union of gender categories is the entire to reconstruct historical relationships between languages. Towards this end, we make phylo- nominal . Taken this way, a gender system genetic predictions on the popular, but thorny, can be viewed as a partition of the lexicon into problem from historical of inducing communities of same-gendered nouns. Given this, a phylogenetic tree over extant Indo-European a lexical typologist might naturally wish to ask: languages. Languages on the same branch how similar are two languages’ gender systems? of our phylogenetic tree are notably similar, Using modern statistical and information- whereas languages from separate branches are no more similar than chance. theoretic tools from the community detection liter- ature, we offer the first cluster evaluation (Jardine 1 Introduction et al., 1971) perspective on grammatical gender, and quantify the overlap of gender systems. We As many as half the world’s languages carve can compare the pairwise overlap of partitions of nouns up into classes (Corbett, 2013). In these gender systems using a rich literature of measures, languages, nouns are subdivided into gender such as mutual information and several variants categories, which together comprise the language’s (Meila˘, 2003; Vinh et al., 2010; McCarthy et al., grammatical gender system. A gender system 2019a), which we survey and . Individual tends to use a small, fixed number of categories partitions of can also be framed as mem- with fixed usage across speakers. Such categories, distributions 1 bers of over partitions—for instance, like ‘feminine’, can be defined extensionally, the distribution consisting of all partitions of N and are reflected by with other items, or of all partitions of N items into K gen- within the phrase (i.e., concord). Gender der clusters, as in Figure 1. For example, Spanish 1When we talk about the extension of a gender system, is bi-gendered (with masculine and feminine): a we refer to the set of nouns that belong to each gender. This lexicon of (N = 1000) and their stands in contrast to the intension of that gender system, which would be the governing dynamics that gave rise to the would come from a distribution over par- particular partitions observed. See §3. titions of N = 1000 items into K = 2 clusters.

5664 Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, pages 5664–5675, November 16–20, 2020. c 2020 Association for Computational Linguistics The same lexicon translated into German, a tri- representations, even in inflected forms of the noun. gendered language, would come from a distribution Nastase and Popescu(2009) also find that phono- of N = 1000 items partitioned into K = 3 clus- logical form can lead to predictability of gender in ters. Indeed, languages needing different numbers two three-gender systems. With respect to of gender clusters makes this problem non-trivial. , (Williams et al., 2019) quantify the re- From this, we can compare the similarity to what lationship between the gender on inanimate nouns we would expect for the same lexica if nouns were and their distributional word vectors. randomly supplied with gender specifications. That way, we can distinguish meaningful relationships We can’t rely on form. Using phonological or from noise. orthographic form to derive gender is fraught with epicene Armed with the first way to quantify community- complications: particular to our study, wise similarity of gender systems, we ask: Do gen- nouns (i.e., words that can appear in multiple gen- der system similarities reflect linguistic phylogeny, ders) can pose issues. In German, only gender con- or something else, like areal effects? Across 20 cord on the definite and can disam- languages, we find that our pairwise overlap results biguate the gender of some nouns; the same word- Band measurably align with standard pairwise phyloge- form means “volume” when masculine, but netic relationships. Zooming in on Indo-European, “ribbon” when neuter and “band, musical group” in we find that we can recast pairwise similarities into feminine. Another complication with determining an accurate phylogenetic tree, simply by measuring gender from the phonological or orthographic form distance between gender systems and performing of the noun is that correspondences between are hierarchical agglomerative clustering (see §6.2). rarely absolute. For example, even though nouns ending in -e are usually ‘feminine’ in German, this The primary contribution of this work is a novel is not universally the case; for example Affe, and metric for lexical typology that measures the pair- Lowe¨ etc. are masculine. To sidestep these com- wise similarity of gender systems. We operational- plications, we abstract away from particular word ize gender systems as partitions over a shared set of forms and observe the objective consequences of nouns (§3). We design and evaluate our measure- gender over sets of cross-lingual concepts, i.e., in- ments of gender system similarity under this formu- dices not word forms, and instead compare those lation (§4), drawing on insights from community across gender systems (see Figure 1). detection. Then we recover robust phylogenetic re- lationships between pairs of gender systems by ap- Which gender systems are likely to be similar? plying these to 20 gendered languages (§6) and find Several accounts highlight similarities between that similarity between Slavic and Romance gender the gender systems of phylogenetically-related systems does not exceed chance levels. Finally, we languages (Fodor, 1959; Ibrahim, 2014) and ar- show that our quantification of gender system simi- gue that are likely to be at least partially larity allows us to construct phylogenetic trees that due to historical relations between communities closely resemble those posited for Indo-European and socio-political factors governing language use. in historical linguistics (e.g., Pagel et al. 2000; Gray Given this, can we recover phylogenetic similarities and Atkinson 2003; Serva and Petroni 2008). across gender systems using our methods? If so, this should provide validation that we are indeed 2 Background: Grammatical Gender measuring at least some of the genuine similarity that exists between gender systems. Grammatical gender is a highly fixed classification system for nouns. Native speakers rarely make 3 Gender Systems as Partitions errors in gender recall, which might tentatively ar- gue against tremendous arbitrary variation (Corbett, Any concept can be related to its referents either 1991). Some regularity can surely be found in the intensionally or extensionally. While linguistic associations between gender and various features research has historically sought to uncover the of the noun, such as orthographic or phonological rules for associating a noun with gender in terms form, or semantics. With respect to form-based of surface features or semantics (see Corbett regularities, Cucerzan and Yarowsky(2003a) de- 1991 for an overview), we take an extensional vise a system for inferring noun gender (masculine approach. That is, we treat a gender category in a or feminine) from contextual clues and character language solely as the set of words covers. This

5665 maps directly to the notion of a community in categories, though, is a well known problem in the the network science task of community detection: field of community detection. While this looks in- A community is defined by membership, not by surmountable from the gender perspective, where other arbitrary properties, just as a gender here is gender categories refer to something we recog- defined by the union of all nouns it subsumes, not nize, in community detection, the labels themselves by its phonological realization or contributions to are meaningless—there’s no notion of a so-called semantics. The disjoint set of communities forms “Cluster 2”. The field has circumvented issues aris- a partition of the set of nouns: Each noun is a ing from comparing systems differing in number member of one and only one cluster. of categories by introducing information-theoretic Although some epicene nouns are present in measures to compare partitions. Cluster evaluation our investigated languages (see §2), these are very functions in community detection are, by and large, rare. We thus make the simplifying modeling based on information-theoretic concepts. assumption of identifying each word with only We define a gender system A’s entropy as: a single gender (in our case, the most frequent). def X |A| |A| This assumption is necessary for our reduction of H(A) = − log (1) N N gender system to clustering evaluation. A∈A Without it, we would be forced for words like where we observe the standard convention that German der/die/das Band to consider overlapping def 0 log 0 = 0. How is this notion of entropy for or “fuzzy” partitions, which although an intriguing partitions related to the entropy of a probability dis- option, will be left for future work. tribution? These are connected through maximum- Notation. A language’s gender system is a parti- likelihood estimation (MLE). In our case, the tion, named in sans serif (e.g., A). A gender sys- maximum-likelihood estimate that an inanimate tem A has K components called gender classes noun a is located in a given partition turns out to be

(i.e., communities, e.g., {AMSC,AFEM,...}); these the size of that partition divided by N, e.g. we have are in turn sets whose members are items drawn pMLE(MSC) = |AMSC |/N. Recall that the Shannon from a finite base set A ⊆ L, where A is a sub- entropy of a distribution p is defined as lexicon selected from the full lexicon L. In our def X case, A holds all inanimate concepts in our data H(p) = − p(a) log p(a) (2) (see §5). We use Ω to name the set of all partitions a∈A of N = |A| items (in our case, inanimate nouns) We have equality between Eq. 1 and Eq. 2 when into K communities. When comparing two lan- we plug the definition of pMLE into Eq. 2, which is guages’ respective gender systems, we will use the why Eq. 1 is considered the entropy of a partition. letters A and B. 4.1 Mutual information (MI) 4 Comparing Partitions Mutual information is a workhorse of quantifying A partition groups items into a set of disjoint cate- similarity between two probability distributions, gories. We could compare any two gender systems measuring how much information (in bits) is shared (i.e., partitions) which organize the same nouns between two random variables. Now we consider by determining how similar their gender labelings the case of the similarity between two partitions. are. A first pass at quantifying the similarity of If we have two partition A and B, we may general- two gender partitions would be to measure simple ize the entropy of a single partition to the mutual overlap. We could ask: What fraction of A agrees information between two partitions as follows: in gender across languages? That is, for each noun def X X |A ∩ B| N |A ∩ B| in our multilingual vocabulary, do both languages I(A;B) = log (3) N |A| |B| lexicalize it with the same gender? This is an eas- A∈A B∈B ily interpretable, accuracy-like measure, bounded X X pMLE(a, b) = pMLE(a, b) log by 0 and 1. Still, it has no capacity for comparing pMLE(a) pMLE(b) systems with different numbers of categories; the a∈A b∈B measure would be handicapped when comparing As the equality above shows, we find, again, that two-gender systems to three-gender ones. Eq. 3 has an interpretation as the standard defini- Comparing systems with different numbers of tion of probabilistic mutual information applied to

5666 the maximum-likelihood estimate of joint partition it from the textbook form of AMI, where the expec- membership distribution. To foreshadow future dis- tation is over a subset of Ω—only those partitions cussion, we note the mutual information between whose community sizes match those of the argu- any two clusterings on N items is bounded below ments. As we have subtracted the mean, the ex- by 0 and above by log N. Beyond its interpretation pected numerator is centered at 0; the denominator as shared information, mutual information gives lit- serves to re-normalize the measure. The measure tle in terms of interpretability: It has no consistent thus compares the mutual information for the ob- reference points, beyond that the minimum possi- served pair of gender systems to all others within ble MI is zero. Therefore, several variants of MI their family. Using AMI also lends some beneficial are preferred in community detection. properties in cluster evaluation: Normalization. Furthermore, MI is often nor- Remark 1. AMI has a fixed maximum score 1.0 malized to increase its interpretability, as: for exactly matching gender systems. Remark 2. The mathematical expectation of AMI def I(A; B) NMI(A, B) = (4) is 0 so spurious correlations are not rewarded. pH(A) H(B) 4.3 Variation of Information (VI) While our denominator is the geometric mean, any generalized mean of the partitions’ entropies can Unlike MI and AMI, Variation of Information be used as a bound to normalize MI (Yang et al., (Meila˘, 2003) is a distance (metric), meaning each 2016). As we divide bits by bits (or nats by nats), language becomes a point in this metric space, normalized mutual information (NMI) is unitless, whose set is all possible partitions of N items. VI unlike entropy and MI. It expresses the amount is useful because it satisfies the triangle inequality of revealed information as a percentage. Unfor- (Meila˘, 2007). Additionally, as a metric, it guar- tunately, NMI has both theoretical and empirical antees identity of indiscernibles: if two partitions flaws (Peel et al., 2017; McCarthy, 2017; McCarthy are at a distance 0, then they are identical. VI is et al., 2019b); namely, it suffers from the finite-size defined as effect: the baseline rises as N increases. (Recall def VI(A, B) = H(A | B) + H(B | A) (6) that MI is bounded above by log N.) High reward for guessing even the trivial partition into single- and is the summation of two conditional entropies. ton clusters rises, making the measure—like vanilla It can also be normalized by dividing by the joint mutual information (as in Eq. 3)—difficult to inter- entropy, H(A, B). (This measure would be topolog- pret. For its flaws, we exclude NMI in favor of the ically equivalent to Eq. 6.) We do not adjust VI for following MI-based measures that are both more chance. This would deprive it of its metric property, interpretable and more pertinent. because of the subtraction in the numerator. 4.2 Adjusted mutual information (AMI) 5 Data Spurious correlations between two gender systems can mislead the results, showing a higher-than- Swadesh lists & NorthEuraLex. Our starting deserved agreement. We select a measure which point is Swadesh lists (Buck, 1949; Swadesh, adjusts for these chance clusterings: the adjusted 1950, 1952, 1955, 1971/2006): concept-aligned mutual information (AMI; Vinh et al., 2010). We minimal inventories of common, “core” or “basic” employ a recent variant (Gates and Ahn, 2017; Mc- terminology thought to be “frequent, universal, and Carthy et al., 2019b): resistant to change over time” (Kaplan, 2017). For our purposes, concept-aligned sources are appeal- def AMI(A,B) = (5) ing, because they ensure a consistently present base 0 0 set A across all our languages, maximizing com- I(A; B) − E [I(A ; B )] parability. We also use the NorthEuraLex dataset max I(A0, B0) − E [I(A0; B0)] (Dellert and Jager¨ , 2017)—essentially, an extended where the expectation is taken under the uniform Swadesh list covering 1016 concepts—to further distribution over Ω, all clusterings on N items with validate our findings on the original Swadesh lists. KA and KB clusters (Gates and Ahn, 2017). The Because grammatical gender on animate nouns has maximum is also taken over Ω. This distinguishes the added complication that it generally matches

5667 “natural” gender (or expressed preference) of liv- on NorthEuraLex. We apply validation to en- ing creatures across languages (Corbett, 1991; Ro- sure that they are picking up robust similarities maine, 1997; Kramer, 2015), we omit animate as opposed to just reflecting properties of particu- nouns to remove semantic confounds from our lar word lists. (See github.com/aryamccarthy/ investigation of cross-lingual gender assignments. gender-partitions.) We then reconstruct phylo- We now take the base set A from the larger concept genetic trees of the languages involved. The trees list in a broader swath of languages. We have 69 show high agreement with ground truth, compared inanimate nouns in the Swadesh lists and 387 in to random baselines. NorthEuraLex. 6.1 Similarity measures Gender . We choose a corpus-based We apply the three evaluation measures (§4) to approach to identifying a word’s gender. We study the partitions computed for our languages over the gendered languages available in Universal De- the common conceptual lexicon. Figure 2 shows pendencies v2.32 (Nivre et al., 2018), resulting in the pairwise scores for languages’ gender systems a sample of 20 (Hebrew, Greek, , Lithua- (on the Swadesh list) as partitions. The rows nian, Latvian, Polish, Croatian, Slovak, Ukrainian, and columns have been reordered according to Russian, Slovenian, Bulgarian, Swedish, Danish, a “ground truth” of pairwise distances (Serva Romanian, French, Catalan, Italian, Spanish, Por- and Petroni, 2008), for reasons we will explain tuguese). This sample is somewhat skewed based in the next subsection.3 Regardless of measure, on family, with all but one language (Hebrew) be- a few clusters emerge along the diagonal. The longing to Indo-European. All are members of the (Balto-)Slavic branch (i.e., Polish, Croatian, Standard Average European Sprachbund (Whorf, Slovene, Ukrainian, Slovenian, Russian, and Bul- 1997; Haspelmath, 2001), except Hebrew, Hindi, garian) is present at the top left, and the Romance and Greek, which are the only representatives of branch (i.e., French, Catalan, Italian, Spanish, and their groups. Why the Indo-European ? First, Portuguese) appears at the bottom right. Outside we needed aligned concept lists with gender and of these blocks, AMI shows us that the similarity annotations in languages which possess a of gender systems is no better than a chance gender system. Second, it is natural to test unsuper- relationship; at the whole-lexicon level, influence vised methods on a sample with a known ground from the common Indo-European root is absent. truth. Indo-European phylogeny, while not with- We also apply our measures to the wider swath out its debates, is relatively well studied, making of languages and larger aligned inventories of it a strong testbed for verifying our methods. Fu- NorthEuraLex. The again ture work can enable greater linguistic diversity by form a block, as do the Balto-. scraping annotated dictionaries. Figure 3 shows similar separation into families for Gender labels are drawn from the MarMoT both MI (a) and AMI (c), though this is less pro- contextual morphological tagger (Muller¨ et al., nounced for Variation of Information (b). Variation 2013) trained on Universal Dependencies corpora of Information shows some surprising associations (Nivre et al., 2018) in each language and applied to not present in AMI, such as associating Hebrew Wikipedia in that language. In the case of epicene and Slovene highly with the Romance block. words and , we select the consensus gen- der (Cucerzan and Yarowsky, 2003b) for the char- Romanian deserves particular note: It is a acter sequence—its most frequent gender label. We Romance language but has been geographically fill gaps manually using bilingual English-target isolated from its family for over a millennium, language dictionaries. When multiple words are instead sharing membership in the Balkan Sprach- given to express a concept in a language, we select bund with Greek and Bulgarian. As such, we the most frequent. may ask whether its phylogeny or its areal effects are reflected in the gender similarity metrics. 6 Experiments While Romanian differs from other Romance languages in many ways (Dinu and Dinu, 2005; We apply each measure to the gender systems from our Swadesh lists, then validate our results 3Selecting a ground truth hierarchy of languages is a con- tentious and sometimes political matter; even well-accepted 2 German and were excluded because of complica- trees suffer from criticism (Ringe et al., 2002; Gray and Atkin- tions arising through alignment to annotated dictionaries. son, 2003; Greenhill, 2011; Pereltsvaig and Lewis, 2015).

5668 1.0 2.5 1.0 pl pl pl hr hr hr

sk 0.8 sk 2.0 sk 0.8 uk uk uk ru ru ru 0.6 1.5 0.6 sl sl sl bg bg bg

ro 0.4 ro 1.0 ro 0.4 fr fr fr ca ca ca 0.2 it 0.2 it 0.5 it es es es

pt pt pt 0.0 0.0 pl hr sk uk ru sl bg ro fr ca it es pt pl hr sk uk ru sl bg ro fr ca it es pt pl hr sk uk ru sl bg ro fr ca it es pt (a) Mutual information (b) Variation of Information (c) Adjusted Mutual Information

Figure 2: Heatmaps uncovered in inanimate Swadesh list under each pairwise similarity measure, grouped by Levenshtein Distance ground-truth phylogenetic trees (Serva and Petroni, 2008). appendixA gives language codes.

1.0 1.0 he he el el el

hi hi 2.5 hi lt 0.8 lt lt 0.8 lv lv lv

pl pl 2.0 pl hr hr hr

sk 0.6 sk sk 0.6 uk uk uk

ru ru 1.5 ru sl sl sl

bg 0.4 bg bg 0.4 sv sv 1.0 sv da da da ro ro ro

fr 0.2 fr fr 0.2

ca ca 0.5 ca it it it es es es

pt pt pt 0.0 0.0 0.0 he el hi lt lv pl hr skuk ru sl bgsv da ro fr ca it es pt he el hi lt lv pl hr skuk ru sl bgsv da ro fr ca it es pt he el hi lt lv pl hr skuk ru sl bgsv da ro fr ca it es pt (a) Mutual information (b) Variation of Information (c) Adjusted Mutual Information

Figure 3: Heatmaps uncovered in inanimate NorthEuraLex under each pairwise similarity measure, grouped by Levenshtein Distance ground-truth phylogenetic trees (Serva and Petroni, 2008).

Dobrovie-Sorin, 2011)—e.g., it possesses three target language to craft phylogenetic trees. We take genders instead of two4—it is still more similar a similar approach, asking whether the pairwise to its phylogenetically related Romance relatives similarities of gender systems are enough to reveal than to Balto-Slavic languages. This is easiest to phylogenetic truth or some other relationship. We discern in the Variation of Information plot: weak create phylogenetic trees through agglomerative connections surface between Romanian and both hierarchical clustering, using both VI and one Slovene and Ukrainian, but the majority of the minus the AMI as distance measures. We use the Balto-Slavic languages are quite distant from it. weighted pair group method of averages (Sokal and Michener, 1958;M ullner¨ , 2011) as implemented 6.2 Phylogeny in the SciPy library (Jones et al., 2001). Inspired by the findings in the previous section (especially the high similarity among Romance The resulting trees (“dendrograms”) can be visu- languages), we further validate our measure, asking alized showing the sequence of cluster formations whether the resulting similarities reflect known phy- during hierarchical clustering (Figure 4 and Fig- logenetic ground truth—namely, the developmental ure 5). In a dendrogram, any ordering of the leaves history of Indo-European languages. Obviously, maintains fidelity to the computed tree structure, so there are many more facets to languages’ related- long as the is still correct. We choose to ness than their gender systems, so it is interesting improve upon this by optimally ordering the leaves, to find signal this strong from a single category. swapping subtrees to convey similarity both within Rabinovich et al.(2017) cluster languages based on and across subtrees (Bar-Joseph et al., 2001). On simple features of their translations into a common the whole, our dendrograms recover known phy- logenetic relationships between the languages we 4This claim can be debated (Bateman and Polinsky, 2010): The neuter gender manifests as masculine when singular and consider; this serves to largely validate our mea- feminine when (Corbett, 1991). sures as having uncovered some meaningful sim-

5669 1.0 2.5 1.0

0.8 2.0 0.8

1.5 0.6 0.6

0.4 1.0 0.4

0.2 0.5 0.2

0.0 0.0 0.0 uk ru sl pl hr sk bg it pt es ca fr ro ro fr ca es pt it uk sk hr pl sl ru bg uk ru sk hr pl sl bg it pt es ca fr ro (a) Mutual information (b) Variation of Information (c) Adjusted Mutual Information

Figure 4: Phylogenies for inanimate Swadesh under each similarity measure. Colors label levels of similarity, with green being most similar, followed by red, then blue (e.g., blue is >70% of max value).

1.0 0.5 2.5 0.8 0.4 2.0

0.6 0.3 1.5

0.4 0.2 1.0

0.1 0.5 0.2

0.0 0.0 0.0 he el lv lt uk ru pl sk sl bg hr ro pt es ca it fr hi da sv el ro pt es ca it fr he hi da sv lt lv ru uk pl sk sl bg hr hi da sv lt lv fr it ca es pt ro uk hr bg sk pl ru sl el he (a) Mutual information (b) Variation of Information (c) Adjusted Mutual Information

Figure 5: Phylogenies for inanimate NorthEuraLex under each similarity measure. Colors label levels of similarity, with green being most similar, followed by red, cyan, and dark blue (e.g., dark blue is >70% of max value). ilarity between the languages’ gender system. In- measured with Variation of Information is ill suited deed, in every case, we reconstruct the subtree of to our main task. Romance languages with high fidelity. The only difference is that on NorthEuraLex, Catalan is more 6.3 Quantitative Evaluation similar to Portuguese and Spanish than Italian is. In Our proposals to measure similarity of gender sys- all trees, Romanian is always grouped with the Ro- tems give rise to dendrograms that resemble phy- mance languages, matching its ancestry. The Balto- logenetic trees. But how much so? We answer Slavic subtree is less . MI and AMI recover this by measuring the similarity to the ground similarities between Russian and Ukrainian (East- truth tree. To measure the similarity of two trees ern Slavic), Slovak and Polish (Western Slavic), T1 and T2, we use Rabinovich et al.(2017)’s and Croatian and Bulgarian (South Slavic) fairly extension of the L2 norm to leaf pair distance. well. Further, the Slavic and are Here, we sum the number of edges on a path be- properly joined to form a Balto-Slavic group. We tween two nodes to get their distance d. We then take this as validation of our method. compute the total distance as the sum of squared P 2 distances: i6=j (dT1 (`i, `j) − dT2 (`i, `j)) , where When measuring with Variation of Information, each `i identifies one language (or leaf). though, things go awry. While it correctly pairs We show that the distance according to any of Russian and Ukrainian and recreates the same Ro- our three measures is significantly more like the mance subtree as the other measures, there are ground truth (from Serva and Petroni, 2008) than some major discrepancies. Hebrew, the only non– chance by comparing the computed trees to 1000 Indo-European language, is found to be closer to randomly generated trees on the same set of lan- the Romance languages than to the Balto-Slavic guages. (We report mean and standard deviation of cluster. Hindi’s closeness to others is similarly ex- distance from the ground truth. We use Rabinovich aggerated. In fact, everything seems to be close et al.(2017)’s unweighted distance.) For each com- for VI, except Greek! As the other measures better bination of dataset and measure, we use McNe- capture the phylogeny, we suggest that similarity mar’s test for significance and find p < 0.0001.

5670 7 Related Work Dataset Measure Score St. Dev. Swadesh MI 344 - There is a baffling dearth of work on quantifying VI 312 - similarity of gender systems. There is, however, AMI 344 - ample work on characterizing intensional gender Random 1184 133.4 NorthEuraLex MI 1231 - systems, i.e., sets of grammatical rules, that can be VI 1164 - divided (Corbett, 1991) into sets of rules based on AMI 1548 - (Tucker et al., 1977; Gregersen, 1967; Random 2531 209.6 Wald, 1975; Plank, 1986, i.a.) and on Table 1: Distances of generated trees from gold tree. (Bidot, 1925; Tucker et al., 1977; Newman, 1979; Hayward and Corbett, 1988; Marchese, 1988). In- tensional approaches, particularly those with typo- is empirically better suited to large, balanced clus- logical leanings, contribute very fine grained re- ters. In our case of small and uneven clusters, AMI search on particular pairwise similarities for partic- should be preferred (Romano et al., 2016). ular languages and dialects. Although we cannot We can only survey a representative handful of survey these in detail here, we would love for our the numerous cluster evaluation measures in the measures to contribute findings that can comple- limited space we have here. See McCarthy et al. ment these approaches. (2019b) for an outline of desiderata for comparing Relatedly, other recent works have investigated partitions, as well as a general class of appropriate grammatical gender and other types of noun clas- measures, and for further motivation for AMI us- sification systems with information theoretic tools. ing a different null model—languages have a fixed For example, Williams et al. 2020b uses mutual number of gender classes, so we select one over N information to quantify the strength of the rela- items with K communities, rather than an arbitrary tionships between class, of communities. gender, distributional semantics, and orthographic form respectively in several languages. Williams 8 Conclusion et al. 2020a, which is arguably closest to this work, measures the strength of semantic relationships be- We have presented a clean method for comparing tween inanimate nouns and or adjectives that grammatical gender systems across languages: By takes those nouns as arguments, and that work can defining gender classes extensionally, we reduced be seen as comparing the similarity of nouns clus- the problem to cluster evaluation from community tered by their gender, with the same nouns clustered detection. We validate three metrics by recovering by the adjectives that modify them or the verbs that known phylogenic relationships in our languages, take them as arguments. with measurable success. Separate Indo-European Although we adopt information theoretic mea- branches are no more similar than chance. sures, here there are two other major classes of clus- We emphasize that our methods are not specifi- ter evaluation measures: set-matching measures, cally tailored to gender systems. One could apply and pair-counting measures, which tally which them more broadly other aspects of the lexicon, e.g. pairs of items are in the same or different com- to Indo-European classes, Bantu noun classes, munities. One popular set-matching measure in or diachronic time slices of a single language’s gen- information retrieval, purity (Manning et al., 2008), der system, data permitting. A related challenge is asymmetric and biased by the size and number is East and Southeast Asian classifier sys- of communities (Danon et al., 2005). Its symmetric tems, which associate nouns with classifiers based form, the F-measure (Artiles et al., 2007), has clear largely on the semantic properties of the nouns bounds but gives no indication of average-case per- (Kuo and Sera, 2009; Zhan and Levy, 2018; Liu formance. et al., 2019). They display more idiolectal variation, The adjusted Rand index (ARI; Hubert and Ara- and often more than one classifier can accompany bie, 1985) is the preeminent pair-counting measure. a given noun (Hu, 1993), unlike for gender (where It is related to AMI, adjusting the Rand index in the this is rare). We note that we could further extend same way that AMI adjusts MI. ARI also computes our measures to fuzzy partitions, which remain less an expectation, which can be computed over the explored in community detection, but are a promis- proper distribution (Gates and Ahn, 2017), but it ing avenue for future work.

5671 Acknowledgments Leon Danon, Albert D´ıaz-Guilera, Jordi Duch, and Alex Arenas. 2005. Comparing community struc- We thank Tongfei Chen for comments on the Slavic ture identification. Journal of Statistical Mechanics: languages, Jean-Gabriel Young for suggesting that Theory and Experiment, 2005(09):P09008–P09008. we consider Variation of Information, and Johannes Johannes Dellert and Gerhard Jager.¨ 2017. Bjerva for providing us with code to compute the NorthEuraLex. Version 0.9. tree distance. We also thank Tiago Pimentel for Anca Dinu and Liviu P. Dinu. 2005. On the syllabic his help with proofreading. Finally, we would like similarities of Romance languages. In International to thank Eleanor Chodroff for providing useful in- Conference on Intelligent Text Processing and Com- sights during the formulation of the problem. putational Linguistics, pages 785–788. Springer. Carmen Dobrovie-Sorin. 2011. The of Roma- nian: Comparative studies in Romance, volume 40. References Walter de Gruyter. Javier Artiles, Julio Gonzalo, and Satoshi Sekine. 2007. Harold Edson Driver and Alfred Louis Kroeber. 1932. The SemEval-2007 WePS evaluation: Establishing a Quantitative expression of cultural relationships, benchmark for the Web people search task. In Pro- volume 31. University of California Press. ceedings of the Fourth International Workshop on Semantic Evaluations (SemEval-2007), pages 64–69. Istvan Fodor. 1959. The origin of grammatical gender. Association for Computational Linguistics. Lingua, 8:186–214. Ziv Bar-Joseph, David K. Gifford, and Tommi S. Alexander J. Gates and Yong-Yeol Ahn. 2017. The Jaakkola. 2001. Fast optimal leaf ordering for hi- impact of random models on clustering similarity. erarchical clustering. Bioinformatics, 17:S22–S29. Journal of Machine Learning Research, 18(87):1– 28. Nicoleta Bateman and Maria Polinsky. 2010. Ro- manian as a two-gender language. Hypothesis Russell D. Gray and Quentin D. Atkinson. 2003. A/Hypothesis B: Linguistic Explorations in Honor of Language-tree divergence times support the ana- David M. Perlmutter. tolian theory of Indo-European origin. Nature, 426:435. Emile Bidot. 1925. La clef du genre des substantifs franc¸ais: methode´ dispensant d’avoir recours au Simon J. Greenhill. 2011. Levenshtein distances fail to dictionnaire. Imprimerie Nouvelle. identify language relationships accurately. Compu- tational Linguistics, 37(4):689–698. Carl Buck. 1949. A of Selected in the Principal Indo-European Languages. University of Edgar A. Gregersen. 1967. Prefix and in Chicago Press. Bantu. Published at the Waverly Press by Indiana University, Bloomington. Raymond B. Cattell. 1945. The description of personal- ity: Principles and findings in a factor analysis. The Martin Haspelmath. 2001. The European linguistic American Journal of Psychology, 58(1):69–90. area: Standard average European. In Language ty- pology and language universals. (Handbucher¨ zur Greville G. Corbett. 1991. Gender. Cambridge Univer- Sprach-und Kommunikationswissenschaft), pages sity Press., Cambridge. 1492–1510. de Gruyter.

Greville G. Corbett. 2013. Number of genders. In Richard J. Hayward and Greville G. Corbett. 1988. Matthew S. Dryer and Martin Haspelmath, editors, Resolution rules in Qafar. Linguistics, 26:259–279. The World Atlas of Language Structures Online . Qian Hu. 1993. The Acquisition of Chinese Classi- Max Planck Institute for Evolutionary Anthropol- fiers by Young Mandarin-speaking Children The Ac- ogy, Leipzig. quisition of Chinese Classifiers by Young Mandarin- speaking Children. Ph.D. thesis, Boston University. Silviu Cucerzan and David Yarowsky. 2003a. Mini- mally supervised induction of grammatical gender. Lawrence Hubert and Phipps Arabie. 1985. Compar- In Proceedings of the 2003 Language Tech- ing partitions. Journal of Classification, 2(1):193– nology Conference of the North American Chapter 218. of the Association for Computational Linguistics. Muhammad Hasan Ibrahim. 2014. Grammatical gen- Silviu Cucerzan and David Yarowsky. 2003b. Mini- der: Its origin and development, volume 166. Wal- mally supervised induction of grammatical gender. ter de Gruyter. In Proceedings of the 2003 Human Language Tech- nology Conference of the North American Chapter N. Jardine, P. H. P. S. N. Jardine, and R. Sibson. 1971. of the Association for Computational Linguistics, Mathematical Taxonomy. Wiley Series in Probabil- pages 40–47. ity and Mathematical Statistics. Wiley.

5672 Eric Jones, Travis Oliphant, Pearu Peterson, et al. 2001. on Empirical Methods in Natural Language Process- SciPy: Open source scientific tools for Python. ing, pages 322–332, Seattle, Washington, USA. As- sociation for Computational Linguistics. Judith Kaplan. 2017. From lexicostatistics to lexomics: Basic vocabulary and the study of language prehis- Mullner.¨ 2011. Modern hierarchical, ag- tory. Osiris, 32(1):202–223. glomerative clustering algorithms. arXiv preprint arXiv:1109.2378. Ruth T. Kramer. 2015. The Morphosyntax of Gender, volume 58. Oxford University Press. Vivi Nastase and Marius Popescu. 2009. What’s in a name? In some languages, grammatical gender. Jenny Y. Kuo and Maria D. Sera. 2009. Classifier ef- In Proceedings of the 2009 Conference on Empiri- fects on human categorization: the role of shape clas- cal Methods in Natural Language Processing, pages sifiers in . Journal of East Asian 1368–1377. Association for Computational Linguis- Linguistics, 18:1–19. tics.

Shijia Liu, Hongyuan Mei, Adina Williams, and Ryan Paul Newman. 1979. Explaining Hausa feminines. Cotterell. 2019. On the idiosyncrasies of the Man- Studies in African Linguistics. darin Chinese classifier system. In Proceedings of the 2019 Conference of the North American Chap- Joakim Nivre, Mitchell Abrams, Zeljkoˇ Agic,´ Lars ter of the Association for Computational Linguistics: Ahrenberg, Lene Antonsen, Katya Aplonova, Human Language Technologies, Volume 1 (Long Maria Jesus Aranzabe, Gashaw Arutie, Masayuki and Short Papers), pages 4100–4106, Minneapolis, Asahara, Luma Ateyah, et al. 2018. Universal de- Minnesota. Association for Computational Linguis- pendencies 2.3. LINDAT/CLARIN digital library tics. at the Institute of Formal and Applied Linguis- tics (UFAL),´ Faculty of and Physics, Christopher D. Manning, Prabhakar Raghavan, and University. Hinrich Schutze.¨ 2008. Introduction to Information Retrieval. Cambridge University Press, New York, M. Pagel, C. Renfrew, A. McMahon, and L. Trask. NY, USA. 2000. Time depth in historical linguistics. C. Ren- Lynell Marchese. 1988. Noun classes and agreement frew, A. McMahon, and L. Trask, editors, pages 189– systems in Kru: A historical approach. Agreement in 207. Natural Language: Approaches, Theories, Descrip- tions. Stanford: Center for the Study of Language Leto Peel, Daniel B. Larremore, and Aaron Clauset. and Information, pages 323–341. 2017. The ground truth about metadata and commu- nity detection in networks. Science Advances, 3(5). Arya D McCarthy. 2017. Gridlock in networks: The leximin method for hierarchical community detec- A. Pereltsvaig and M. W. Lewis. 2015. The Indo- tion. Master’s thesis, Southern Methodist Univer- European Controversy. Cambridge University sity. Press.

Arya D. McCarthy, Tongfei Chen, and Seth Ebner. Frans Plank. 1986. Paradigm size, morphological ty- 2019a. An exact no free lunch theorem for com- pology, and universal economy. Folia Linguistica, munity detection. In Complex Networks and Their 20(1-2):29–48. Applications VIII, pages 176–187, Lisbon, Portugal. Springer International Publishing. Ella Rabinovich, Noam Ordan, and Shuly Wintner. 2017. Found in translation: Reconstructing phy- Arya D. McCarthy, Tongfei Chen, Rachel Rudinger, logenetic language trees from translations. In Pro- and David W. Matula. 2019b. Metrics matter in com- ceedings of the 55th Annual Meeting of the Associa- munity detection. In Complex Networks and Their tion for Computational Linguistics (Volume 1: Long Applications VIII, pages 164–175, Lisbon, Portugal. Papers), pages 530–540. Association for Computa- Springer International Publishing. tional Linguistics.

Marina Meila.˘ 2003. Comparing clusterings by the Don Ringe, Tandy Warnow, and Ann Taylor. 2002. variation of information. In Learning Theory and Indo-European and computational cladistics. Trans- Kernel Machines, pages 173–187, Berlin, Heidel- actions of the Philological Society, 100(1):59–129. berg. Springer Berlin Heidelberg. Suzanne Romaine. 1997. Gender, grammar, and the Marina Meila.˘ 2007. Comparing clusterings—an in- space in between. Pragmatics and Beyond: New Se- formation based distance. Journal of Multivariate ries, pages 51–76. Analysis, 98(5):873–895. Simone Romano, Nguyen Xuan Vinh, James Bailey, Thomas Muller,¨ Helmut Schmid, and Hinrich Schutze.¨ and Karin Verspoor. 2016. Adjusting for chance 2013. Efficient higher-order CRFs for morphologi- clustering comparison measures. Journal of Ma- cal tagging. In Proceedings of the 2013 Conference chine Learning Research, 17(1):4635–4666.

5673 Maurizio Serva and Fabio Petroni. 2008. Indo- meaning. In Proceedings of the 58th Annual Meet- European languages tree by Levenshtein distance. ing of the Association for Computational Linguistics, EPL (Europhysics Letters), 81(6):68005. pages 6682–6695, Online. Association for Computa- tional Linguistics. Robert Reuven Sokal and Charles Duncan Michener. 1958. A Statistical Method for Evaluating System- Zhao Yang, Rene´ Algesheimer, and Claudio J. Tessone. atic Relationships. University of Kansas science bul- 2016. A comparative analysis of community detec- letin. University of Kansas. tion algorithms on artificial networks. Scientific Re- ports, 6(1):30750. Morris Swadesh. 1950. Salish internal relation- ships. International Journal of American Linguis- Meilin Zhan and Roger Levy. 2018. Comparing the- tics, 16(4):157–167. ories of speaker choice using a model of classifier production in Mandarin Chinese. In Proceedings of Morris Swadesh. 1952. Lexico-statistic dating of pre- the 2018 Conference of the North American Chap- historic ethnic contacts: with special reference to ter of the Association for Computational Linguistics: North American Indians and Eskimos. Proceedings Human Language Technologies, Volume 1 (Long Pa- of the American philosophical society, 96(4):452– pers), pages 1997–2005, New Orleans, Louisiana. 463. Association for Computational Linguistics. Morris Swadesh. 1955. Towards greater accuracy in lexicostatistic dating. International Journal of American Linguistics, 21(2):121–137.

Morris Swadesh. 1971/2006. The origin and diversifi- cation of language. Chicago: Aldine.

G. R. Tucker, W. E. Lambert, and A. Rigault. 1977. The French speaker’s skill with grammatical gender: an example of rule-governed behavior. Janua Lin- guarum: Series didactica. Mouton.

Nguyen Xuan Vinh, Julien Epps, and James Bailey. 2010. Information theoretic measures for cluster- ings comparison: Variants, properties, normaliza- tion and correction for chance. Journal of Machine Learning Research, 11:2837–2854.

Benji Wald. 1975. Animate concord in Northeast Coastal Bantu: Its linguistic and social implications as a case of grammatical convergence. Studies in African linguistics, 6(3):267–314.

Benjamin Lee Whorf. 1997. The Relation of Habitual Thought and Behavior to Language, pages 443–463. Macmillan Education UK, London.

Adina Williams, Damian Blasi, Lawrence Wolf- Sonkin, Hanna Wallach, and Ryan Cotterell. 2019. Quantifying the semantic core of gender systems. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Lan- guage Processing (EMNLP-IJCNLP), pages 5733– 5738, Hong Kong, China. Association for Computa- tional Linguistics.

Adina Williams, Ryan Cotterell, Lawrence Wolf- Sonkin, Damian´ Blasi, and Hanna Wallach. 2020a. On the relationships between the grammatical gen- ders of inanimate nouns and their co-occurring ad- jectives and verbs. Transactions of the Association for Computational Linguistics.

Adina Williams, Tiago Pimentel, Hagen Blix, Arya D. McCarthy, Eleanor Chodroff, and Ryan Cotterell. 2020b. Predicting declension class from form and

5674 A Languages While there are over 70 languages in the Univer- sal Dependencies treebanks, only a select handful possess grammatical gender. We use 20 languages in the Universal Dependencies corpora that have gender and also present in our concept lists. Below find their ISO 639-1 codes (used in the paper to con- serve space), ISO 639-3 codes (widely preferred), and their major family (in the case of Hebrew) or subfamily (in the case of our Indo-European lan- guages), and the number of grammatical genders they have:

Language ISO 639-1 ISO 639-3 (Sub-)Family Genders Bulgarian bg bul Balto-Slavic 3 Catalan ca cat Romance 2 Danish da dan Germanic 2 Greek el ell Hellenic 3 Spanish es spa Romance 2 9 French fr fra Romance 2 Hebrew he heb Semitic 2 Hindi hi hin Indo-Iranian 2 Croatian hr hrv Balto-Slavic 3 Italian it ita Romance 2 Lithuanian lt lit Balto-Slavic 2 Latvian lv lav Balto-Slavic 2 Polish pl pol Balto-Slavic 3 Portuguese pt por Romance 2 Romanian ro ron Romance 3 Russian ru rus Balto-Slavic 3 Slovak sk slk Balto-Slavic 3 Slovene sl slv Balto-Slavic 3 Swedish sv swe Germanic 2 Ukrainian uk ukr Balto-Slavic 3

Table 2: Languages, with their subfamilies and ISO codes, used in this study.

5675