Quick viewing(Text Mode)

Predicting Declension Class from Form and Meaning Adina Williams@ Tiago Pimenteld Arya D

Predicting Declension Class from Form and Meaning Adina Williams@ Tiago Pimenteld Arya D

Predicting Class from Form and Meaning Adina Williams@ Tiago PimentelD Arya D. McCarthyZ Hagen BlixË Eleanor ChodroffY Ryan CotterellD,Q @Facebook AI Research DUniversity of Cambridge ZJohns Hopkins University ËNew York University YUniversity of York QETH Zurich¨ [email protected], [email protected], [email protected], [email protected], [email protected], [email protected]

Abstract

The lexica of many natural are divided into several declension classes with characteristic morphological properties. Class membership is far from deterministic, but the phonological form of a noun and its + meaning can often provide imperfect clues. Here, investigate the strength of those clues. More specifically, we operationalize “strength” as measuring how much informa- Figure 1: The conditional entropies (H) and mutual in- tion, in bits, we can glean about declension formation quantities (MI) of form (W ), meaning (V ), class from knowing the form and meaning and declension class (C), given (G) in German of . We know that form and mean- and Czech. ing are often also indicative of —which, as we quantitatively verify, can itself share information with declension not are few in number. Compared to English, Ger- class—so we also control for gender. We man has comparatively many common morpholog- find for two Indo-European languages (Czech ical rules for inflecting nouns. For example, some and German) that form and meaning share a significant amount of information with class are formed by adding a suffix to the sin- (and contribute additional information beyond gular: Insekt ‘insect’ 7→ Insekt-en, Hund ‘dog’ 7→ gender). The three-way interaction between Hund-e, Radio ‘radio’ 7→ Radio-s. For others, the class, form, and meaning (given gender) is also is formed by changing a stem vowel:1 Mutter significant. Our study is important for two ‘mother’ 7→ Mu¨tter, or Nagel ‘nail’ 7→ Na¨gel. Some reasons: First, we introduce a new method others form plurals with both suffixation and vowel that provides additional quantitative support change: Haus ‘house’ 7→ Ha¨us-er and Koch ‘chef’ for a classic linguistic finding that form and 7→ Ko¨ch-e. Still others, like Esel ‘donkey’, have meaning are relevant for the classification of nouns into . Second, we show not the same form in plural and singular. The problem only that individual declension classes vary in only worsens when we consider other inflectional the strength of their clues within a , , such as case. but also that the variations between classes Disparate plural formation and case rules of vary across languages. The code is pub- the kind described above split nouns into declen- https://github.com/ licly available at sion classes. To know a noun’s declension class rycolab/declension-mi. is to know which morphological form takes in 1 Introduction which context (e.g., Benveniste 1935; Wurzel 1989; Nubling¨ 2008; Ackerman et al. 2009; Ackerman To an English speaker learning German, it may and Malouf 2013; Beniamine and Bonami 2016; come as a surprise that one cannot necessarily pre- Bonami and Beniamine 2016). But, this begs the dict the plural form of a noun from its singular. question: What clues can we use to predict the This is because pluralizing nouns in English is rel- class for a noun? In some languages, predict- atively simple: Usually we merely add an -s to ing declension class is argued to be easier if we the end (e.g., cat 7→ cats). Of course, not all En- know the noun’s phonological form (Aronoff, 1992; glish nouns follow such a simple rule (e.g., child 7→ children, sheep 7→ sheep, etc.), but those that do 1This vowel change, umlaut, corresponds to fronting.

6682 Proceedings of the 58th Annual Meeting of the Association for Computational , pages 6682–6695 July 5 - 10, 2020. c 2020 Association for Computational Linguistics Dressler and Thornton, 1996) or lexical seman- in continuous speech guide first-language acquisi- tics (Carstairs-McCarthy, 1994; Corbett and Fraser, tion (Maye et al., 2002), even over non-adjacent 2000). However, semantic and phonological clues segments (Newport and Aslin, 2004). Statistical are, at best, only very imperfect hints as to class relationships have also been uncovered between the (Wurzel, 1989; Harris, 1991, 1992; Aronoff, 1992; sounds in a and the word’s Halle and Marantz, 1994; Corbett and Fraser, 2000; (Farmer et al., 2006; Monaghan et al., 2007; Sharpe Aronoff, 2007). Given this, we quantify how much and Marantz, 2017) and between the orthographic information a noun’s form and meaning share with form of a word and its structure valence its class, and determine whether that amount of (Williams, 2018). Thus, we expect the form of a information is uniform across classes. noun to provide clues to declension class. To do this, we measure the mutual information Semantic factors too are often relevant for de- (Cover and Thomas, 2012) both between declen- termining certain types of morphologically rele- sion class and meaning (i.e., distributional seman- vant classes, such as grammatical gender, which is tic vector) and between declension class and form known to be related to declension class. It has been (i.e., orthographic form), as in Figure 1. We select claimed that there are only two types of gender two Indo-European languages (Czech and German) systems: semantic systems (where only seman- that have declension classes. We find that form tic information is required) and formal systems and meaning both share significant amounts of in- (where semantic information as well as morpholog- formation, in bits, with declension class in both ical and phonological factors are relevant) (Corbett languages. We further find that form clues are and Fraser, 2000, 294). Moreover, a large typologi- stronger than meaning clues; for form, we uncover cal survey, Qian et al.(2016), finds that meaning- a relatively large effect of 0.5–0.8 bits, while, for sensitive grammatical properties, such as gender lexical , a moderate one of 0.3–0.5 bits. and , can be decoded well from distribu- We also measure the three-way interaction between tional word representations for some languages, but form, meaning, and class, finding that less well for others. These examples suggest that and semantics contribute overlapping information it is worth investigating whether noun semantics about class. Finally, we analyze individual inflec- provides clues about declension class. tion classes and uncover that the amount of infor- Lastly, form and meaning might interact mation share with form and meaning is not with one another, as in the case of phonaes- uniform across classes or languages. themes where the sounds of provide non- arbitrary clues about their meanings (Sapir, 1929; 2 Declension Classes in Language Wertheimer, 1958; Holland and Wertheimer, 1964; Maurer et al., 2006; Monaghan et al., 2014; The morphological behavior of declension classes D’Onofrio, 2014; Dingemanse et al., 2015; Dinge- is quite complex. Although various factors are manse, 2018; Pimentel et al., 2019). Therefore, we undoubtedly relevant, we on phonological check whether form and meaning together share and lexical semantic ones here. We have ample information with declension class. reason to suspect that phonological factors might class predictability. In the most basic sense, 2.1 Orthography as a proxy for phonology? the form of inflectional suffixes are often altered based on the identity of the final segment of the We motivate an investigation into the relationship stem. For example, the English plural suffix is between the form of a word and its declension spelled as -s after most consonants, like in cats, class by appealing, at least partly, to phonological but as -es if it appears after an s, sh, z, ch etc., motivations. However, we make the simplifying like in ‘mosses’, ‘rushes’, ‘quizzes’, ‘beaches’ etc. assumption that phonological information is ade- Often differences such as these in the spelling of quately captured by orthographic word forms—i.e., plural affixes or declension class affixes are due to strings of written symbols, which are also known phonological rules that are noisily realized in or- as graphemes. In general, one should question thography; there could also be regularities between this assumption (Vachek, 1945; Luelsdorff, 1987; form and class that do not correspond to phono- Sproat, 2000, 2012; Neef et al., 2012). For the par- logical rules but still have an effect. For example, ticular languages we investigate here—Czech and statistical regularities over phonological segments German—it is less problematic, as they are have

6683 fairly “transparent” mappings between spelling measure how much additional information form and pronunciation (Matejˇ cekˇ , 1998; Miles, 2000; and meaning provide about declension class. Caravolas and Vol´ın, 2001), which enables them to achieve higher performance on grapheme-to- 3 Data conversion than do English and other For our study, we need orthographic forms of “opaque” orthographic systems (Schlippe et al., nouns, their associated word vectors, and their de- 2012). These studies suggest that we are justified clension classes. Orthographic forms can be found in taking orthography as a proxy for phonological in any large text corpus or . We isolate form. Nonetheless, to mitigate against any phono- noun lexemes (i.e., or syntactic category–specific logical information being inaccurately represented representations of words) by language. We se- in the orthographic form (e.g., vowel lengthening lect Czech nouns from UniMorph (Kirov et al., in German), several of our authors, are fluent 2018) and from CELEX2 (Baayen reader–annotators of our languages, checked our et al., 1995). For lexical semantics, we trained classes for any unexpected phonological variations. 300-dimensional WORD2VEC vectors on language- We exhibit examples in §3. specific Wikipedia.2 2.2 Distributional Lexical Semantics We select the nominative singular form as the donor for both orthographic and lexical semantic We adopt a distributional approach to lexical se- representations because it is the in Czech mantics (Harris 1954; Mitchell and Lapata 2010; and German. It is also usually the stem for the Turney and Pantel 2010; Bernardi et al. 2015; Clark rest of the morphological paradigm. We restrict 2015; inter alia) that relies on pretrained word em- our investigation to monomorphemic lexemes be- beddings for this paper. We do this for multiple rea- cause: (i) one stem can take several affixes which sons: First, distributional semantic approaches to would multiply its contribution to the results, and create word vectors, such as WORD2VEC (Mikolov (ii) certain affixes come with their own class.3 et al., 2013), have been shown to do well at ex- Compared to form and meaning, declension tracting lexical features such as animacy and tax- class is a bit harder to come by, because it re- onomic information (Rubinstein et al., 2015) and quires linguistic annotation. We associated lex- can also recognize semantic anomaly (Vecchi et al., emes with their classes on a by-language basis 2011). Second, the distributional approach to lexi- by relying on annotations from fluent speaker– cal meaning yields a straightforward procedure for linguists, either for class determination (for Czech) extracting “meaning” from text corpora at scale. or for verifying existing dictionary information 2.3 Controlling for grammatical gender? (for German). For Czech, declension classes were Grammatical gender has been found to interact with derived by an edit distance heuristic over affix lexical semantics (Schwichtenberg and Schiller, forms, which grouped lemmata into subclasses 2004; Williams et al., 2019, 2020), and often can if they received the same inflectional affixes (i.e., be determined from form (Brooks et al., 1993; Do- they constituted a morphological paradigm). If brin, 1998; Frigo and McDonald, 1998; Starreveld orthographic differences between two sets of suf- and La Heij, 2004). This means that it cannot be fixes in the lemma form could be accounted for ignored in the present study. While the precise na- by positing a phonological rule, then the two sets ture of the relationship between declension class were collapsed into a single set; for example, in and gender is far from clear, it is well established the “feminine -a” declension class, we collapsed that the two should be distinguished (Aronoff 1992; forms for which the dative singular suffix surfaces Wiese 2000;K urschner¨ and Nubling¨ 2011; inter as -e following a coronal continuant consonant alia). We first measure the amount of informa- (figurka:figurce ‘figurine.DAT.SG’), -i following a tion shared between gender and class, according palatal nasal (pirana˘ :piran˘i ‘piranha.DAT.SG’), and to the methods described in §4, to verify that the as -e˘ following all other consonants (krava´ :kra´ve˘ predicted relationship exists. We then verify that ‘cow.DAT.SG’). As for meaning, descriptively, gen- gender and class overlap in information in German der is roughly a superset of declension classes in and Czech to a high degree, but that we cannot Czech; among the masculine classes, animacy is reduce one to the other (see Table 3 and §6). We 2We use the GENSIM toolkit (Rehˇ u˚rekˇ and Sojka, 2010). proceed to control for gender, and subsequently 3Since these require special treatment, they are set aside.

6684 Original Final Training Validation Test Average Length # Classes Czech 3011 2672 2138 267 267 6.26 13 German 4216 3684 2948 368 368 5.87 16

Table 1: Number of words in dataset. Counts per language-category pair are listed both before and after prepro- cessing, train-validation-test split, average stem length, and # of classes. Since we use 10-fold cross-validation, all instances are included in the test set at some point, and are used to estimate the cross-entropies in §5. a critical semantic , whereas form seems to number of declension classes, by language. matter more for feminine and neuter classes. For German, nouns came morphologically 4 Methods parsed and lemmatized, as well as coded for class Notation. We define each lexeme in a language in CELEX2. We also use CELEX2 to isolate as a triple. Specifically, the ith triple consists of monomorphemic noun lexemes and bin them into an orthographic word form wi, a distributional se- classes; however, CELEX2 declension classes are mantic vector vi that encodes the lexeme’s seman- more fine-grained than traditional descriptions of tics, and a declension class ci. We assume these declension class—mappings between CELEX2 triples follow a (unknown) probability distribution classes and traditional linguistic descriptions of p(w, v, c)—which can be marginalized to obtain declension class (Alexiadou and Muller¨ , 2008) are p(c), for example. We take the space of word forms provided in Table 4 in the Appendix. The CELEX2 to be the Kleene closure over a language’s alpha- ∗ declension class identifier scheme has multiple sub- bet Σ; thus, we have wi ∈ Σ . Our distributional parts. Each declension class identifier includes: (i) semantic space is a high-dimensional real vector the number prefix (being ‘S’ is for singular, or ‘P’ d d space R where vi ∈ R . The space of declen- for plural), (ii) the morphological form identifier— sion classes is language-specific and contains as zero refers to paradigmatically missing forms (e.g., many elements as the language has classes, i.e., plural is zero for singularia tantum nouns), and C = {1,...,K} where ci ∈ C. For each noun, a other numbers refer to a form identifier of particu- gender gi from a language-specific space of gen- lar morphological processes (e.g., genitive applies ders G is associated with the lexeme. In both Czech an additional suffix for singular masculine nouns, and German, G contains three : feminine, but never for feminines)—and (iii) an optional ‘u’ masculine, and neuter. We also consider four ran- identifier, which refers to vowel umlaut, if present. dom variables: a Σ∗-valued random variable W , an d More details of the German preprocessing steps are R -valued random variable V , a C-valued random in the Appendix. variable C and a G-valued random variable G. After associating nouns with forms, meanings, and classes, we perform exclusions: Because fre- Bipartite Mutual Information. Bipartite MI quency affects class entropy (Parker and Sims, (or, simply MI) is a symmetric quantity that mea- 2015), we removed all classes with fewer than 20 sures how much information (in bits) two random lexemes.4 We subsequently removed all lexemes variables share. In the case of C (declension class) which did not appear in our WORD2VEC models and W (orthographic form), we have trained on Wikipedia dumps. The final tally of MI(C; W ) = H(C) − H(C | W ) (1) Czech yields 2672 nouns in 13 declension classes, and the final tally of German yields 3684 nouns As can be seen, MI is the difference between an in 16 declension classes, which can be broken into unconditional and a conditional entropy. The un- 3 types of singular and 7 types of plural. Table 5 conditional entropy is defined as in the Appendix provides final lexeme counts by X declension class. H(C) = − p(c) log p(c) (2) The remaining lexemes were split into 10 folds: c∈C one for testing, another for validation, and the re- and the conditional entropy is defined as maining eight for training. Table 1 shows train– validation–test splits, average length of nouns, and H(C | W ) = (3) X X − p(c, w) log p(c | w) 4We ran another version of our models that included all the original classes and observed no notable differences. c∈C w∈Σ∗

6685 The mutual linformation MI(C; W ) naturally en- 2019). We consider the normalized mutual infor- codes how much the orthographic word form tells mation, i.e., which fraction of the unconditional us about its corresponding lexeme’s declension entropy is the mutual information: class. Likewise, to measure the interaction between MI(C; W ) declension class and lexical semantics, we also con- NMI(C; W ) = (7) sider the bipartite mutual information MI(C; V ). min{H(C), H(W )}

Tripartite Mutual Information. To consider This yields a percentage of the entropy that the mu- the interaction between three random variables at tual information accounts for—a more interpretable once, we need to generalize MI to three classes. notion of the predictability between class and form One can calculate tripartite MI as follows: or meaning. In practice, H(C)  H(W ) in most cases and our normalized mutual information is MI(C; W ; V ) = (4) termed the uncertainty coefficient (Theil, 1970): MI(C; W ) − MI(C; W | V ) MI(C; W ) U(C | W ) = (8) H(C) As can be seen, tripartite MI is the difference be- tween a bipartite MI and a conditional bipartite MI. 5 Computation and Approximation The conditional bipartite MI is defined as In order to estimate the mutual information quanti- MI(C; W | V ) = H(C | V ) − H(C | W, V ) (5) ties of interest per §4, we need to estimate a variety of entropies. We derive our mutual information N Essentially, Equation 4 is the difference between estimates from a corpus D = {(vi, wi, ci)}i=1. how much C and W interact and how much they 5.1 Plug-in Estmation of Entropy interact after “controlling” for the meaning V .5 The most straightforward quantity to estimate is Controlling for Gender. Working with mutual H(C). Given a corpus, we may use plug-in estima- information also gives us a natural way to control tion: We compute the empirical distribution over for quantities that we know influence meaning and declension classes from D. Then, we plug that em- form. We do this by considering conditional MI. pirical distribution over declension classes C into We consider both bipartite and tripartite conditional the formula for entropy in Equation 2. This esti- mutual information. These are defined as follows: mator is biased (Paninski, 2003), but is a suitable choice given that we have only a few declension MI(C; W |G) = (6a) classes and a large amount of data. Future work H(C | G) − H(C | W, G) will explore whether choice of estimator (Miller, MI(C; W ; V |G) = (6b) 1955; Hutter, 2001; Archer et al., 2013, 2014) could MI(C; W | G) − MI(C; W | V,G) affect the conclusions of studies such as this one. 5.2 Model-based Estimation of Entropy Estimating these quantities tells us how much C and W (and, in the case of tripartite MI, V also) In , estimating H(C | W ) is non-trivial. interact after we take G (the grammatical gender) We cannot simply apply plug-in estimation because ∗ out of the picture. Figure 1 provides a graphical we cannot compute the infinite sum over Σ that is summary for this section until this point. required. Instead, we follow previous work (Brown et al., 1992; Pimentel et al., 2019) in using the cross- Normalization. To further contextualize our re- entropy upper bound to approximate H(C | W ) sults, we consider two normalization schemes for with a model. More formally, for any probability MI. Normalizing renders MI estimates across lan- distribution q(c | w), we have guages more directly comparable (Gates et al., H(C | W ) ≤ Hq(C | W ) (9) 5We emphasize here the subtle, but important, typographic X X distinction between MI(C; W ; V ) and MI(C; W, V ). (The = − p(c, w) log q(c | w) difference in notation lies in the comma replacing the semi- ∗ colon.) While the first (tripartite MI) measures the amount c∈C w∈Σ of (redundant) information shared by the three variables, the second (bipartite) measures the (total) information that class To circumvent the need for infinite sums, we use ˜ M shares with either the form or the lexical semantics. a held-out sample D = {(v˜i, w˜ i, c˜i)}i=1 disjoint 6686 from D to approximate the true cross-entropy 5.3 An Empirical Lower Bound on MI Hq(C | W ) with the following quantity With our empirical approximations of the desired M entropy measures, we can calculate the desired 1 X Hˆ (C | W ) = − log q (˜c | w˜ ) (10) approximated MI values, e.g., q M i i i=1 MI(C; W | G) ≈ (11) where we assume the held-out data is distributed ˆ ˆ according to the true distribution p. We note that H(C | G) − Hq(C | W, G) Hˆ (C | W ) → H (C | W ) as M → ∞. While the q q where H(ˆ C | G) is the plug-in estimation of the en- exposition above focuses on learning a distribution tropy. Such an approximation, though, is not ideal, q(c | w) for classes and forms to approximate since we do not know if the true MI is approxi- H(C | W ), the same methodology can be used to mated by above or below. Since we use a plug-in estimate all necessary conditional entropies. estimator for H(ˆ C | G), which underestimates en- Form and gender: q(c | w, g). We train one tropy, and since Hq(C | W, G) is estimated with a LSTM classifier (Hochreiter and Schmidhuber, cross-entropy upperbound, we have 1996) for each language. The last hidden state of the LSTM models is fed into a linear layer and MI(C; W | G) = H(C | G) − H(C | W, G) ˆ then a softmax non-linearity to obtain probability ' H(C | G) − H(C | W, G) distributions over declension classes. To condition H(ˆ C | G) − Hˆ (C | W, G). our model on gender, we embed each gender and ' q feed it into each LSTM’s initial hidden state. We note that these are expected lower bounds, i.e. Meaning and gender: q(c | v, g). We trained a they are exact when taking an expectation under simple multilayer perceptron (MLP) classifier to the true distribution p. We cannot make a similar predict the declension class from the WORD2VEC statement about tripartite MI, though, since it is representation. When conditioning on gender, we computed as the difference of two lower-bound ap- again embed each gender class, concatenating these proximations of true mutual information quantities. embeddings with the WORD2VEC ones before feed- 6 Results ing the result into the MLP. Form, meaning, and gender: q(c | w, v, g). Our main experimental results are presented in Ta- ble 2. We find that both form and lexical semantics We again trained two LSTM classifiers, but significantly interact with declension class in both this time, also conditioned on meaning (i.e., Czech and German (each p < 0.01).6 We observe WORD2VEC). Before training, we reduce the di- that our estimates of MI(C; W | G) is larger (0.5– mensionality of the WORD2VEC embeddings from 0.8 bits) than our estimates of MI(C; V | G) (0.3– 300 to k dimensions by running PCA on each lan- 0.5 bits). We also observe that the MI estimates guage’s embeddings. We then linearly transformed in Czech are higher than in German. However, we them to match the hidden size of the LSTMs, and caution that the unnormalized estimates for the two fed them in. To also condition on gender, we fol- languages are not fully comparable because they lowed the same procedures, but used half of each hail from models trained on different amounts of LSTM’s initial hidden state for each vector (i.e., data. The tripartite MI estimates between class, WORD2VEC and one-hot gender embeddings). form, and meaning, were relatively small (0.2–0.35 Optimization. We trained all classifiers using bits) for both languages. We interpret this find- Adam (Kingma and Ba, 2015) and the code was ing as showing that much of the information con- implemented using PyTorch. Hyperparameters— tributed by form is not redundant with information number of training epochs, hidden sizes, PCA com- contributed by meaning—although a substantial pression dimension (k), and number of layers— amount is. were optimized using Bayesian optimization with 6All results in this section were significant for both lan- a Gaussian process prior (Snoek et al., 2012). We guages, according to a Welch(1947)’s t-test, which yielded explore a maximum of 50 models for each exper- p < 0.01 after Benjamini and Hochberg’s correction. A Welch(1947)’s t-test differs from Student(1908)’s t-test in iment, maximizing the expected improvement on that the latter assumes equal variances, and the former does the validation set. not, making it preferable (see Delacre et al. 2017).

6687 Form & Declension Class (LSTM) Meaning & Declension Class (MLP) H(C | G)HQ(C | W, G) MI(C; W | G) U(C | W, G) H(C | G)HQ(C | V,G) MI(C; V | G) U(C | V,G) Czech 1.35 0.56 0.79 58.8% 1.35 0.82 0.53 39.4% German 2.17 1.60 0.57 26.4% 2.17 1.88 0.29 13.6% Both (Form and Meaning) & Declension Class Tripartite MI (LSTM) H(C | G)HQ(C | W, V, G) MI(C; W, V | G) U(C | W, V, G) MI(C; W | G) MI(C; W | V,G) MI(C; W ; V | G) U(C | W ; V,G) Czech 1.35 0.37 0.98 72.6% 0.79 0.44 0.35 25.9% German 2.17 1.50 0.67 30.8% 0.57 0.37 0.20 9.2%

Table 2: MI between form and class (top-left), meaning and class (top-right), both form and meaning and class (bottom-left), and tripartite MI (bottom-right). All values are calculated given gender, and bold if significant.

H(C) H(C | G) MI(C; G) U(C | G) have high MI(C = c; V | G) usually have high Czech 2.75 1.35 1.40 50.8% MI(C = c; W | G) (with a few notable exceptions German 2.88 2.17 0.71 24.6% we discuss below).

Table 3: MI between class and gender MI(C; G): H(C) Czech. In general, declension classes associated is class entropy, H(C | G) is class entropy given gen- with masculine nouns (g = MSC) have smaller der, U(C | G) is the uncertainty coefficient. MI(C = c; W | G) than classes associated with feminine (g = FEM) and neuter (g = NEU) ones of a comparable size—the exception being ‘spe- As a final sanity check, we measure mutual infor- cial, masculine, plural -ata’. This class ends ex- mation between class and gender MI(C; G) (see clusively in -e or -e˘, which might contribute to Table 3). For both languages, the mutual informa- that class’ higher MI(C = c; W | G). That tion between declension class and gender is sig- MI(C = c; W | G) is high for feminine and neuter nificant. Our MI estimates range from approxi- classes suggests that the overall MI(C; W | G) mately 3/4 of a bit in German up to 1.4 bits in results might be largely driven by these classes, Czech, which respectively amount to nearly 25% which predominantly end in vowels. We also note and nearly 51% of the remaining unconditional en- that the high MI(C = c; W | G) for feminine ‘plu- tropy. Like the quantities discussed in §4, this MI ral -e’, might be driven by the many or Greek was estimated using simple plug-in estimation. Re- present in this class. member, if class were entirely reducible to gender, With respect to meaning, masculine declension conditional entropy of class given gender would be classes can reflect degrees of animacy: ‘animate1’ zero. This is not the case: Although the conditional contains nouns referring mostly to and a entropy of class given gender is lower for Czech few animals (kocour ‘tomcat’, colek˘ ‘newt’), ‘an- (1.35 bits) than for German (2.17 bits), in neither imate2’ contains nouns referring mostly to ani- case is declension class informationally equivalent mals and a few humans (syn ‘son’, krest’anˇ ‘Chris- to the language’s grammatical gender system. tian’), ‘inanimate1’ contains many plants, staple 7 Discussion and Analysis foods (chleb´ ‘bread’, ocet ‘vinegar’) and meaning- ful places (domov ‘home’, kostel ‘church’), and Next, we ask whether individual declension classes ‘inanimate2’ contains many basic inanimate nouns differ in how idiosyncratic they are, e.g., does any (kamen´ ‘stone’). Of these masculine classes, ‘inan- one class share less informa- imate1’ has a lower MI(C = c; V | G) than its tion with form than the others? To address this, class size alone might lead us to predict. Feminine we qualitatively inspect per-class half-pointwise and neuter classes show no clear pattern, although mutual information in Figure 2a–2b. See Table 5 neuter classes ‘-eni’ and ‘-o’ have comparatively in the Appendix for the five highest and lowest high MI(C = c; V | G). surprisal examples per model. Several qualita- For MI(C = c; V ; W | G), we observe that tive trends were observed: (i) classes show a de- ‘masculine, inanimate1’ is the smallest quantity, fol- cent amount of variability, (ii) unconditional en- lowed by most other masculine classes (e.g., mas- tropy for each class is inversely proportional to the culine animate classes with -ove´ or -i plurals) for class’ size, (iii) half-pointwise MI is higher on av- which MI(C = c; W | G) was also low. Among erage for Czech than German, and (iv) classes that non-masculine classes, we observe that feminine

6688 MI(C=c; | G) 6 MI(C=c; | G) H(C=c | G) 6 4 H(C=c | G) 4 2 Tripartite 2 0 Tripartite

0

6

4 6 Both 2 4

0 Both 2

0 6

4 Form 2 6

0 4 Form 2 6 0 4

Meaning 2 6 0

/-í 4

Meaning 2 neuter, -o feminine, -a pl -ata pl -i pl -i pl -é (instr-pl) feminine, pl -i neuter, -e/- pl -ové feminine, pl -e 0 neuter, -ení, derived from special, masculine, masculine, animate1, masculine, animate1, masculine, animate1, masculine, animate2, masculine, inanimate1 masculine, inanimate2 S1/P4 S3/P5 S3/P1 S1/P3 S2/P3 S3/P0 S1/P5 S1/P0 S3/P3 S1/P1 S1/P2u S3/P1u S1/P4u S1/P1u

Inflection Class S1/loan S3/loan (a) Czech (b) German

Figure 2: Pointwise MI for declension classes. MI for each random variable X ∈ {V, W, {V,W } , {V ; W }} are plotted for classes increasing in size (towards the right): MI(C = c; V |G) (bottom), MI(C = c; W |G) (bottom middle), MI(C = c; V,W |G) (top middle), and tripartite MI(C = c; V ; W |G) (top).

‘pl -i’ and the neuter classes -o and -en´ı show higher ‘parrot’, Pfau ‘peacock’), but also nouns ending tripartite MI. The latter two classes have relatively in -or,(Traktor ‘tractor’, Pastor ‘pastor’), whereas high MI across the board. S3/P1 contains very few nouns, such as names of months (Marz¨ , ‘March’, Mai ‘May’) and names of German. MI(C = c; W | G) for classes con- mythological beasts (e.g., Sphinx, Alp). taining words with umlautable vowels (i.e., S3/P1u, S1/P1u) or loan words (i.e., S3/loan) tends to be Tripartite MI is fairly idiosyncratic in German: high; in the prior case, our models seem able to The lowest quantity comes from the smallest class, separate umlautable from non-umlautable vowels, S1/P2u. S1/P3, a class with low MI(C = c; V | G) and in the latter case, loan word orthography from from above, also has low tripartite MI. We spec- native orthography. MI(C = c; V | G) quantities ulate that S1/P3 could be a sort of “catch-all” are roughly equivalent across classes of different class with no clear regularities. The highest tri- size, with the exception of three classes: S1/P4, partite MI comes from S1/P4, which also had high S3/P1, and S1/P3. S1/P4 consists of highly seman- MI(C = c; V | G). The existence of significant tically variable nouns, ranging from relational noun tripartite MI results suggests that submorphemic lexemes (e.g., Glied ‘member’, Weib ‘wife’, Bild meaning bearing units, or phonaesthemes, might ‘picture’) to masses (e.g., Reis ‘rice’), which per- be present. Taking inspiration from Pimentel et al. haps explains its relatively high MI(C = c; V | G). 2019, which aims to automatically discover such For S1/P3 and S3/P1, MI(C = c; V | G) is low, units, we observe that many words in S1/P4 contain and we observe that both declension classes id- letters {d, e, g, i, l}, often in identically ordered iosyncratically group clusters of semantically simi- orthographic sequences, such as Bild,Biest, Feld, lar nouns: S1/P3 contains “exotic” birds (Papagei Geld, Glied,Kind, Leib, Lied, Schild,Viech, Weib,

6689 etc. While these letters are common in German or- Nevins, editors, Inflectional Identity, volume 18 thography, their noticeable presence suggests that of Oxford Studies in Theoretical Linguistics, pages further elucidation of declension classes in the con- 101–155. Oxford University Press, Oxford. text of phonaesthemes could be warranted. Evan Archer, Il Memming Park, and Jonathan W. Pil- low. 2013. Bayesian and quasi-Bayesian estimators 8 Conclusion for mutual information from discrete data. Entropy, 15(5):1738–1755. We adduce new evidence that declension class membership is not wholly idiosyncratic nor fully Evan Archer, Il Memming Park, and Jonathan W. Pil- deterministic based on form or meaning in Czech low. 2014. Bayesian entropy estimation for count- and German. We quantify mutual information able discrete distributions. The Journal of Machine Learning Research, 15(1):2833–2868. and find estimates which range from 0.2 bits to nearly one bit. Despite their relatively small mag- Mark Aronoff. 1992. Noun classes in Arapesh. In Year- nitudes, our estimates of mutual information be- book of Morphology 1991, pages 21–32. Springer. tween class and form accounted for between 25% and 60% of the class’ entropy, even after relevant Mark Aronoff. 2007. In the beginning was the word. Language, 83(4):803–830. controls, and MI between class and meaning ac- counted for between 13% and nearly 40%. We R. Harald Baayen, Richard Piepenbrock, and Leon Gu- analyze results per-class, and find that classes vary likers. 1995. The CELEX2 lexical database. Lin- in how much information they share with mean- guistic Data Consortium. ing and form. We also observe that classes that Sacha Beniamine and Olivier Bonami. 2016. A com- have high MI(C = c; V | G) often have high prehensive view on inflectional classification. In An- MI(C = c; W | G), with a few noted exceptions nual Meeting of the Linguistic Association of Great that have specific orthographic (e.g., German um- Britain. lauted plurals), or semantic (e.g., Czech mascu- Yoav Benjamini and Yosef Hochberg. 1995. Control- line animacy) properties. In sum, this paper has ling the false discovery rate: A practical and pow- proposed a new information-theoretic method for erful approach to multiple testing. Journal of the quantifying the strength of morphological relation- Royal Statistical Society: Series B (Methodological), ships, and applied it to declension class. We verify 57(1):289–300. and build on existing linguistic findings, by show- Emile´ Benveniste. 1935. Origines de la formation ing that the mutual information quantities between des noms en indo-europeen´ , volume 1. Adrien- declension class, orthographic form, and lexical Maisonneuve Paris. semantics are statistically significant. Raffaella Bernardi, Gemma Boleda, Raquel Fernandez,´ Acknowledgments and Denis Paperno. 2015. Distributional seman- tics in use. In Proceedings of the First Workshop Thanks to Guy Tabachnik for discussions on Czech on Linking Computational Models of Lexical, Sen- tential and Discourse-level Semantics, pages 95– phonology, to Jacob Eisenstein for useful questions 101, Lisbon, Portugal. Association for Computa- about irregularity, and to Andrea Sims and Jeff tional Linguistics. Parker for advice on citation forms. Thanks to Ana Paula Seraphim for helping beautify Figure 1. Olivier Bonami and Sarah Beniamine. 2016. Joint pre- dictiveness in inflectional paradigms. Word Struc- ture, 9(2):156–182.

References Patricia J. Brooks, Martin D. S. Braine, Lisa Catalano, Farrell Ackerman, James P. Blevins, and Robert Mal- Ruth E. Brody, and Vicki Sudhalter. 1993. Acquisi- ouf. 2009. Parts and wholes: Implicative patterns in tion of gender-like noun subclasses in an artificial inflectional paradigms. Analogy in Grammar: Form language: The contribution of phonological mark- and Acquisition, pages 54–82. ers to learning. Journal of Memory and Language, 32(1):76–95. Farrell Ackerman and Robert Malouf. 2013. Morpho- logical organization: The low conditional entropy Peter F. Brown, Stephen A. Della Pietra, Vincent J. conjecture. Language, pages 429–464. Della Pietra, Jennifer C. Lai, and Robert L. Mercer. 1992. An estimate of an upper bound for the entropy Artemis Alexiadou and Gereon Muller.¨ 2008. Class of English. Computational Linguistics, 18(1):31– Features as Probes. In Asaf Bachrach and Andrew 40.

6690 Marketa´ Caravolas and Jan Vol´ın. 2001. Phonolog- Morris Halle and Alec Marantz. 1994. Some key fea- ical spelling errors among dyslexic children learn- tures of distributed morphology. MIT Working Pa- ing a transparent orthography: the case of Czech. pers in Linguistics, 21(275):88. Dyslexia, 7(4):229–245. James W. Harris. 1991. The exponence of gender in Andrew Carstairs-McCarthy. 1994. Inflection classes, Spanish. Linguistic Inquiry, 22(1):27–62. gender, and the principle of contrast. Language, pages 737–788. James W. Harris. 1992. The form classes of Span- ish substantives. In Yearbook of Morphology 1991, Stephen Clark. 2015. Vector space models of lexical pages 65–88. Springer. meaning. In Shalom Lappin and Chris Fox, editors, Word The Handbook of Contemporary Semantic Theory, Zellig S. Harris. 1954. Distributional structure. , pages 493–523. Wiley-Blackwell. 10(2-3):146–162. Sepp Hochreiter and Jurgen¨ Schmidhuber. 1996. Greville G. Corbett and Norman M. Fraser. 2000. Gen- LSTM can solve hard long time lag problems. In der assignment: a typology and a model. Systems of Advances in Neural Information Processing Systems, Nominal Classification, 4:293–325. volume 9, pages 473–479. MIT Press.

Thomas M Cover and Joy A Thomas. 2012. Elements Morris K. Holland and Michael Wertheimer. 1964. of Information Theory. John Wiley & Sons, Inc. Some physiognomic aspects of naming, or, maluma and takete revisited. Perceptual and Motor Skills, Marie Delacre, Lakens, and Christophe Leys. 19(1):111–117. 2017. Why psychologists should by default use Welchs t-test instead of Students t-test. Interna- Marcus Hutter. 2001. Distribution of mutual informa- tional Review of Social Psychology, 30(1). tion. In Advances in Neural Information Processing Systems, volume 14, pages 399–406. MIT Press. Mark Dingemanse. 2018. Redrawing the margins of language: Lessons from research on ideophones. Diederik P. Kingma and Jimmy Ba. 2015. Adam: A Glossa, 3(1). method for stochastic optimization. In 3rd Interna- tional Conference on Learning Representations. Mark Dingemanse, Damian´ E. Blasi, Gary Lupyan, Morten H. Christiansen, and Padraic Monaghan. Christo Kirov, Ryan Cotterell, John Sylak-Glassman, 2015. Arbitrariness, iconicity, and systematicity in Geraldine´ Walther, Ekaterina Vylomova, Patrick language. Trends in Cognitive Sciences, 19(10):603 Xia, Manaal Faruqui, Sabrina J. Mielke, Arya Mc- – 615. Carthy, Sandra Kubler,¨ David Yarowsky, Jason Eis- ner, and Mans Hulden. 2018. UniMorph 2.0: Uni- Lise M. Dobrin. 1998. The morphosyntactic reality versal morphology. In Proceedings of the Eleventh of phonological form. In Yearbook of Morphology International Conference on Language Resources 1997, pages 59–81. Springer. and Evaluation (LREC 2018), Miyazaki, Japan. Eu- ropean Language Resources Association (ELRA). Annette D’Onofrio. 2014. Phonetic detail and dimen- sionality in sound–shape correspondences: Refining Sebastian Kurschner¨ and Damaris Nubling.¨ 2011. The the bouba–kiki paradigm. Language and Speech, interaction of gender and declension in Germanic 57(3):367–393. languages. Folia Linguistica, 45(2):355–388.

Wolfgang U. Dressler and Anna M. Thornton. 1996. Philip A. Luelsdorff. 1987. Orthography and phonol- Italian nominal inflection. Wiener Linguistische ogy. John Benjamins Publishing. Gazette, 55(57):1–26. Z Matejˇ cek.ˇ 1998. Reading in Czech. Part i: Tests of reading in a phonetically highly consistent spelling Thomas A. Farmer, Morten H. Christiansen, and system. Dyslexia, 4(3):145–154. Padraic Monaghan. 2006. Phonological typicality influences on-line sentence comprehension. Pro- Daphne Maurer, Thanujeni Pathman, and Catherine J. ceedings of the National Academy of Sciences, Mondloch. 2006. The shape of boubas: Sound– 103(32):12203–12208. shape correspondences in toddlers and adults. De- velopmental Science, 9(3):316–322. Lenore Frigo and Janet L. McDonald. 1998. Properties of phonological markers that affect the acquisition Jessica Maye, Janet F. Werker, and LouAnn Gerken. of gender-like subclasses. Journal of Memory and 2002. Infant sensitivity to distributional informa- Language, 39(2):218–245. tion can affect phonetic discrimination. Cognition, 82(3):101–111. Alexander J. Gates, Ian B. Wood, William P. Hetrick, and Yong-Yeol Ahn. 2019. Element-centric cluster- Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey ing unifies overlaps and hierarchy. Sci- Dean. 2013. Efficient estimation of word represen- entific Reports, 9(8574). tations in vector space. CoRR, abs/1301.3781.

6691 Elaine Miles. 2000. Dyslexia may show a different face Challenges for NLP Frameworks, pages 45–50, Val- in different languages. Dyslexia, 6(3):193–201. letta, Malta. ELRA. http://is.muni.cz/ publication/884893/en. George Miller. 1955. Note on the bias of informa- tion estimates. In Information Theory in Psychol- Dana Rubinstein, Effi Levi, Roy Schwartz, and Ari ogy: Problems and Methods, pages 95–100. Rappoport. 2015. How well do distributional mod- els capture different types of semantic knowledge? Jeff Mitchell and Mirella Lapata. 2010. Composition In Proceedings of the 53rd Annual Meeting of the in distributional models of semantics. Cognitive sci- Association for Computational Linguistics and the ence, 34(8):1388–1429. 7th International Joint Conference on Natural Lan- guage Processing (Volume 2: Short Papers), pages Padraic Monaghan, Morten H. Christiansen, and 726–730, Beijing, China. Association for Computa- Nick Chater. 2007. The phonological-distributional tional Linguistics. coherence hypothesis: Cross-linguistic evidence Cognitive Psychology in language acquisition. , Edward Sapir. 1929. A study in phonetic symbolism. 55(4):259–305. Journal of Experimental Psychology, 12(3):225. Padraic Monaghan, Richard C. Shillcock, Morten H. Christiansen, and Simon Kirby. 2014. How ar- T. Schlippe, S. Ochs, and T. Schultz. 2012. Grapheme- bitrary is language? Philosophical Transac- to-phoneme model generation for Indo-European tions of the Royal Society B: Biological Sciences, languages. In 2012 IEEE International Confer- 369:20130299. ence on Acoustics, Speech and Signal Processing (ICASSP), pages 4801–4804. Martin Neef, Anneke Neijt, and Richard Sproat. 2012. The Relation of Writing to Spoken Language, vol- Beate Schwichtenberg and Niels O. Schiller. 2004. Se- ume 460. De Gruyter. mantic gender assignment regularities in German. Brain and Language, 90(1-3):326–337. Elissa L. Newport and Richard N. Aslin. 2004. Learn- ing at a distance I. Statistical learning of non- Victoria Sharpe and Alec Marantz. 2017. Revisiting adjacent dependencies. Cognitive Psychology, form typicality of nouns and . The Mental Lex- 48(2):127 – 162. icon, 12(2):159–180.

Damaris Nubling.¨ 2008. Was tun mit Flexionsklassen? Jasper Snoek, Hugo Larochelle, and Ryan P Adams. Deklinationsklassen und ihr Wandel im Deutschen 2012. Practical Bayesian optimization of machine und seinen Dialekten. Zeitschrift fur¨ Dialektologie learning algorithms. In F. Pereira, C. J. C. Burges, und Linguistik, pages 282–330. L. Bottou, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 25, pages Liam Paninski. 2003. Estimation of entropy and mu- 2951–2959. tual information. Neural Computation, 15(6):1191– 1253. Richard Sproat. 2000. A Computational Theory of Writing Systems. Cambridge University Press. Jeff Parker and Andrea D. Sims. 2015. On the inter- action of implicative structure and type frequency in Richard Sproat. 2012. The consistency of the ortho- The 1st International Quan- inflectional systems. In graphically relevant level in Dutch. The Relation of titative Morphology Meeting . Writing to Spoken Language, pages 35–46. Tiago Pimentel, Arya D. McCarthy, Damian´ Blasi, Brian Roark, and Ryan Cotterell. 2019. Meaning Peter Starreveld and Wido La Heij. 2004. Phonologi- to form: Measuring systematicity as information. In cal facilitation of grammatical gender retrieval. Lan- Proceedings of the 57th Annual Meeting of the Asso- guage and Cognitive Processes, 19(6):677–711. ciation for Computational Linguistics, pages 1751– 1764, Florence, Italy. Association for Computational Student. 1908. The probable error of a mean. Linguistics. Biometrika, pages 1–25.

Peng Qian, Xipeng Qiu, and Xuanjing Huang. 2016. Henri Theil. 1970. On the estimation of relationships Investigating language universal and specific prop- involving qualitative variables. American Journal of erties in word embeddings. In Proceedings of the Sociology, 76(1):103–154. 54th Annual Meeting of the Association for Compu- tational Linguistics (Volume 1: Long Papers), pages Peter D. Turney and Patrick Pantel. 2010. From fre- 1478–1488, Berlin, Germany. Association for Com- quency to meaning: Vector space models of se- putational Linguistics. mantics. Journal of artificial intelligence research, 37:141–188. Radim Rehˇ u˚rekˇ and Petr Sojka. 2010. Software Frame- work for Topic Modelling with Large Corpora. In Josef Vachek. 1945. Some remarks on writing and pho- Proceedings of the LREC 2010 Workshop on New netic transcription. Acta Linguistica, 5(1):86–93.

6692 Eva Maria Vecchi, Marco Baroni, and Roberto Zampar- elli. 2011. (Linear) maps of the impossible: Captur- ing semantic anomalies in distributional space. In Proceedings of the Workshop on Distributional Se- mantics and Compositionality, pages 1–9. Associa- tion for Computational Linguistics. Bernard L. Welch. 1947. The generalization of “Stu- dent’s” problem when several different population variances are involved. Biometrika, 34(1/2):28–35. Michael Wertheimer. 1958. The relation between the sound of a word and its meaning. The American Journal of Psychology, 71(2):412–415. Bernd Wiese. 2000. Warum flexionsklassen? uber¨ die deutsche substantivdeklination. In Rolf Thieroff, Matthias Tamrat, Nanna Fuhrhop, and Oliver Teuber, editors, Deutsche Grammatik in Theorie und Praxis. De Gruyter. Adina Williams. 2018. Representing Relationality: MEG Studies on Argument Structure. Ph.D. thesis, New York University. Adina Williams, Damian´ Blasi, Lawrence Wolf- Sonkin, Hanna Wallach, and Ryan Cotterell. 2019. Quantifying the semantic core of gender systems. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Lan- guage Processing (EMNLP-IJCNLP), pages 5733– 5738, Hong Kong, China. Association for Computa- tional Linguistics.

Adina Williams, Ryan Cotterell, Lawrence Wolf- Sonkin, Damian´ Blasi, and Hanna Wallach. 2020. On the relationships between the grammatical gen- ders of inanimate nouns and their co-occurring ad- jectives and verbs. Transactions of the Association for Computational Linguistics.

Wolfgang Ullrich Wurzel. 1989. Inflectional Morphol- ogy and Naturalness, volume 9. Springer Science & Business Media.

6693 A Further Notes on Preprocessing cessing decisions for German, usually based on orthographic or other considerations. For example, The breakdown of our declension classes is given we combined the classes S1 with S4 classes, P1 in Table 4. We will first discuss more details about with P7, and P6 with P3 because the difference be- our preprocessing and linguistic analysis for Czech, tween each member of any of these pairs lies solely and then for German. in spelling (a final is doubled in the spelling Czech. The Czech classes were initially derived when GEN.SG -(e)s, or the PL -(e)n is attached). from an edit-distance heuristic between nouns. A Whether a given singular, say S1, becomes fluent speaker–linguist then identified major noun inflected as P1 or P2—or, for that matter, the classes by grouping together nouns with shared corresponding umlauted versions of these plural suffixes in the surface (orthographic) form. If the classes—is phonologically conditioned (Alexiadou differences between two sets of suffixes in the sur- and Muller¨ , 2008). If the stem ends in a trochee face form could then be accounted for by positing a whose second syllable consists of schwa plus /n/, basic phonological rule—for example, vowel short- /l/, or /r/, the schwa is not realized, i.e., it gets P2, ening in monosyllabic words—then the two sets otherwise it gets P1. For this phonological reason, were collapsed. we also chose to collapse P1 and P2. Among masculine nouns, four large classes were We also collapsed all loan classes (i.e., those identified that seemed to range from “very animate” with P8–P10) under one plural class ‘Loan’. This to “very inanimate.” The morphological divisions choice resulted in us merging loans with Greek between these classes were very systematic, but plurals (like P9, -os / Myth-en) with those there was substantial overlap: dat.sg and loc.sg with Latin plurals (like P8, Maxim-um / Maxim-a differentiated ‘animate1’ from ‘animate2’, ‘inani- and P10, Trauma / Trauma-ta). This choice might mate1’ and ‘inanimate2’; acc.sg, nom.pl and voc.pl have unintended consequences on the results, as the differentiated ‘animate2’ from ‘inanimate1’ and orthography of Latin and Greek differ substantially ‘inanimate2’, and gen.sg differentiated ‘inanimate1’ from each other, as well as from the native German from ‘inanimate2’ (see Figure 3). Further subdivi- orthography, and might be affecting our measure sions were made within the two animate classes for of higher form-based MI for S1/Loan and S3/Loan the apparent idiosyncratic nominative plural suf- classes in Table 3 of the main text. One could fix, and within the ‘inanimate2’ class, where nouns reasonably make a different choice, and instead took either -u or -e as the genitive singular suffix. remove these examples from consideration, as we This division may have once reflected a final palatal did for classes with fewer than 20 lemmata. on nouns taking -e in the genitive singular case, but this distinction has since been lost. All nouns in the ‘inanimate2’ “soft” class end in coronal con- sonants, whereas nouns in the ‘inanimate1’ “hard” class have a variety of final consonants. Among feminine nouns, the ‘feminine -a’ class contained all feminine words that ended in -a in the nominative singular form. (Note that there exist masculine nouns ending in -a, but these did not pattern with the ‘feminine -a’ class). The ‘feminine pl -e’ class contained feminine nouns ending in -e, -eˇ, or a consonant, and as the name suggests, had the suffix -e in the nominative plural form. The ‘feminine pl -i’ class contained feminine nouns ending in a consonant and had the suffix -i in the Figure 3: Czech paradigm for masculine nouns. nominative plural form. No feminine nouns ended in a dorsal consonant. Among neuter nouns, all words ended in a vowel. B Some prototypical examples German. After extracting declension classes To explore which examples, across classes might from CELEX2, we made some additional prepro- be most prototypical, we sampled the top five high-

6694 Czech German class # gender class # classic class gender(s)

masculine, inanimate2 823 MSC S1/P1 1157 Decl I MSC, NEUT feminine, -a 818 FEM S3/P3 1105 Decl VI FEM feminine, pl -e 275 FEM S1/P0 264 Singularia Tantum MSC, NEUT, FEM neuter, -o 149 NEUT S1/P5 256 “default -s PL” MSC, NEUT, FEM neuter, -en´ı 133 NEUT S3/P0 184 Singularia Tantum MSC, NEUT, FEM masculine, animate2, pl -i) 130 MSC S1/P1u 154 Decl II MSC masculine, animate1, pl -i) 112 MSC S2/P3 151 Decl V MSC feminine, pl -i 80 FEM S1/P3 70 Decl IV MSC, NEUT masculine, animate1, pl -ove´ 55 MSC S3/loan 67 Loanwords MSC, NEUT, FEM masculine, inanimate1 32 MSC S3/P1 51 Decl VIII FEM special, masculine, pl -ata 26 MSC S1/P4u 51 Decl III MSC, NEUT neuter, -e/-e/-˘ ´ı 21 NEUT S3/P5 49 “default -s PL” MSC, NEUT, FEM masculine, animate1, pl -e´ 18 MSC S1/loan 41 Loanwords MSC, NEUT S3/P1u 35 Decl VII FEM S1/P4 25 Decl III MSC, NEUT S1/P2u 24 Decl II MSC, phon. Total 2672 3684

Table 4: Declension classes. ‘class’ refers to the declension class identifier, ‘#’ refers to the number of lexemes in each declension class, and ‘gender’ refers to the gender(s) present in each class. German declension classes came from CELEX2, for which ‘S’ refers to a noun’s singular form, ‘P’ refers to its plural, ‘classic class’ refers to the conception of class from Brockhaus Wahrig Worterbuch¨ .

Czech German stem class H(C | W, V ) stem class H(C | W, V )

pavouk masculine, animate2, pl -i 11.54 Balance FEM, ?, S1P5 13.16 investor masculine, animate2, pl -i 10.93 Hertz NEUT, ?, S3P0 13.05 vul˚ masculine, animate2, pl -i 10.78 Schmack MSC, 6, S3P3 12.17 dlazdi˘ c˘ masculine, animate1, pl -ove´ 10.01 See FEM, 6, S3P3 12.12 opylovac˘ masculine, animate2, pl -i 9.21 Reling FEM, ?, S3P5 11.81 −4 −3 optika feminine, -a 2.2x10 Glocke FEM, 6, S3P3 5.7x10 −4 −3 kritika feminine, -a 2.2x10 Schale FEM, 6, S3P3 5.7x10 −4 −3 pahorkatina feminine, -a 2.1x10 Schnecke FEM, 6, S3P3 5.6x10 −4 −3 kachna feminine, -a 2.1x10 Zeche FEM, 6, S3P3 5.6x10 −4 −3 matematika feminine, -a 2.1x10 Parzelle FEM, 6, S3P3 4.8x10

Table 5: Five highest and lowest surprisal examples given form and meaning (w2v) by language. est and lowest suprisal examples. The results are in Table 5. We observe that the lowest surprisal forms for each language generally come from a single class for each language: feminine, -a for Czech and S3/P3 for German. These two classes were among the largest, having lower class entropy, and both contained feminine nouns. Forms with higher sur- prisal generally came from several smaller classes, and were predominately masculine. This sample size is small however, so it remains to be inves- tigated whether this tendency in our data belies a genuine statistically significant relationship be- tween gender, class size, and surprisal.

6695