Measuring the Similarity of Grammatical Gender Systems by Comparing Partitions
Total Page:16
File Type:pdf, Size:1020Kb
Measuring the Similarity of Grammatical Gender Systems by Comparing Partitions Arya D. McCarthy Adina Williams Shijia Liu David Yarowsky Ryan Cotterell Johns Hopkins University, Facebook AI Research, ETH Zurich [email protected], [email protected], [email protected], [email protected], [email protected] Abstract A grammatical gender system divides a lex- icon into a small number of relatively fixed grammatical categories. How similar are these gender systems across languages? To quantify the similarity, we define gender systems ex- (a) German, K = 3 (b) Spanish, K = 2 tensionally, thereby reducing the problem of comparisons between languages’ gender sys- Figure 1: Two gender systems partitioning N = 6 con- tems to cluster evaluation. We borrow a rich cepts. German (a) has three communities: Obst (fruit) inventory of statistical tools for cluster evalu- and Gras (grass) are neuter, Mond (moon) and Baum ation from the field of community detection (tree) are masculine, Blume (flower) and Sonne (sun) (Driver and Kroeber, 1932; Cattell, 1945), that are feminine. Spanish (b) has two communities: fruta enable us to craft novel information-theoretic (fruit), luna (moon), and flor are feminine, and cesped metrics for measuring similarity between gen- (grass), arbol (tree), and sol (sun) are masculine. der systems. We first validate our metrics, then use them to measure gender system similarity in 20 languages. Finally, we ask whether our exhaustively divides up the language’s nouns; that gender system similarities alone are sufficient is, the union of gender categories is the entire to reconstruct historical relationships between languages. Towards this end, we make phylo- nominal lexicon. Taken this way, a gender system genetic predictions on the popular, but thorny, can be viewed as a partition of the lexicon into problem from historical linguistics of inducing communities of same-gendered nouns. Given this, a phylogenetic tree over extant Indo-European a lexical typologist might naturally wish to ask: languages. Languages on the same branch how similar are two languages’ gender systems? of our phylogenetic tree are notably similar, Using modern statistical and information- whereas languages from separate branches are no more similar than chance. theoretic tools from the community detection liter- ature, we offer the first cluster evaluation (Jardine 1 Introduction et al., 1971) perspective on grammatical gender, and quantify the overlap of gender systems. We As many as half the world’s languages carve can compare the pairwise overlap of partitions of nouns up into classes (Corbett, 2013). In these gender systems using a rich literature of measures, languages, nouns are subdivided into gender such as mutual information and several variants categories, which together comprise the language’s (Meila˘, 2003; Vinh et al., 2010; McCarthy et al., grammatical gender system. A gender system 2019a), which we survey and contrast. Individual tends to use a small, fixed number of categories partitions of lexicons can also be framed as mem- with fixed usage across speakers. Such categories, distributions 1 bers of over partitions—for instance, like ‘feminine’, can be defined extensionally, the distribution consisting of all partitions of N and are reflected by agreement with other words items, or of all partitions of N items into K gen- within the noun phrase (i.e., concord). Gender der clusters, as in Figure 1. For example, Spanish 1When we talk about the extension of a gender system, is bi-gendered (with masculine and feminine): a we refer to the set of nouns that belong to each gender. This lexicon of Spanish nouns (N = 1000) and their stands in contrast to the intension of that gender system, which would be the governing dynamics that gave rise to the genders would come from a distribution over par- particular partitions observed. See §3. titions of N = 1000 items into K = 2 clusters. 5664 Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, pages 5664–5675, November 16–20, 2020. c 2020 Association for Computational Linguistics The same lexicon translated into German, a tri- representations, even in inflected forms of the noun. gendered language, would come from a distribution Nastase and Popescu(2009) also find that phono- of N = 1000 items partitioned into K = 3 clus- logical form can lead to predictability of gender in ters. Indeed, languages needing different numbers two three-gender systems. With respect to word of gender clusters makes this problem non-trivial. semantics, (Williams et al., 2019) quantify the re- From this, we can compare the similarity to what lationship between the gender on inanimate nouns we would expect for the same lexica if nouns were and their distributional word vectors. randomly supplied with gender specifications. That way, we can distinguish meaningful relationships We can’t rely on form. Using phonological or from noise. orthographic form to derive gender is fraught with epicene Armed with the first way to quantify community- complications: particular to our study, wise similarity of gender systems, we ask: Do gen- nouns (i.e., words that can appear in multiple gen- der system similarities reflect linguistic phylogeny, ders) can pose issues. In German, only gender con- or something else, like areal effects? Across 20 cord on the definite article and adjectives can disam- languages, we find that our pairwise overlap results biguate the gender of some nouns; the same word- Band measurably align with standard pairwise phyloge- form means “volume” when masculine, but netic relationships. Zooming in on Indo-European, “ribbon” when neuter and “band, musical group” in we find that we can recast pairwise similarities into feminine. Another complication with determining an accurate phylogenetic tree, simply by measuring gender from the phonological or orthographic form distance between gender systems and performing of the noun is that correspondences between are hierarchical agglomerative clustering (see §6.2). rarely absolute. For example, even though nouns ending in -e are usually ‘feminine’ in German, this The primary contribution of this work is a novel is not universally the case; for example Affe, and metric for lexical typology that measures the pair- Lowe¨ etc. are masculine. To sidestep these com- wise similarity of gender systems. We operational- plications, we abstract away from particular word ize gender systems as partitions over a shared set of forms and observe the objective consequences of nouns (§3). We design and evaluate our measure- gender over sets of cross-lingual concepts, i.e., in- ments of gender system similarity under this formu- dices not word forms, and instead compare those lation (§4), drawing on insights from community across gender systems (see Figure 1). detection. Then we recover robust phylogenetic re- lationships between pairs of gender systems by ap- Which gender systems are likely to be similar? plying these to 20 gendered languages (§6) and find Several accounts highlight similarities between that similarity between Slavic and Romance gender the gender systems of phylogenetically-related systems does not exceed chance levels. Finally, we languages (Fodor, 1959; Ibrahim, 2014) and ar- show that our quantification of gender system simi- gue that they are likely to be at least partially larity allows us to construct phylogenetic trees that due to historical relations between communities closely resemble those posited for Indo-European and socio-political factors governing language use. in historical linguistics (e.g., Pagel et al. 2000; Gray Given this, can we recover phylogenetic similarities and Atkinson 2003; Serva and Petroni 2008). across gender systems using our methods? If so, this should provide validation that we are indeed 2 Background: Grammatical Gender measuring at least some of the genuine similarity that exists between gender systems. Grammatical gender is a highly fixed classification system for nouns. Native speakers rarely make 3 Gender Systems as Partitions errors in gender recall, which might tentatively ar- gue against tremendous arbitrary variation (Corbett, Any concept can be related to its referents either 1991). Some regularity can surely be found in the intensionally or extensionally. While linguistic associations between gender and various features research has historically sought to uncover the of the noun, such as orthographic or phonological rules for associating a noun with gender in terms form, or semantics. With respect to form-based of surface features or semantics (see Corbett regularities, Cucerzan and Yarowsky(2003a) de- 1991 for an overview), we take an extensional vise a system for inferring noun gender (masculine approach. That is, we treat a gender category in a or feminine) from contextual clues and character language solely as the set of words it covers. This 5665 maps directly to the notion of a community in categories, though, is a well known problem in the the network science task of community detection: field of community detection. While this looks in- A community is defined by membership, not by surmountable from the gender perspective, where other arbitrary properties, just as a gender here is gender categories refer to something we recog- defined by the union of all nouns it subsumes, not nize, in community detection, the labels themselves by its phonological realization or contributions to are meaningless—there’s no notion of a so-called semantics. The disjoint set of communities forms “Cluster 2”. The field has circumvented issues aris- a partition of the set of nouns: Each noun is a ing from comparing systems differing in number member of one and only one cluster. of categories by introducing information-theoretic Although some epicene nouns are present in measures to compare partitions. Cluster evaluation our investigated languages (see §2), these are very functions in community detection are, by and large, rare. We thus make the simplifying modeling based on information-theoretic concepts. assumption of identifying each word with only We define a gender system A’s entropy as: a single gender (in our case, the most frequent).