<<

What's in a name? In some , grammatical

Vivi Nastase Marius Popescu EML Research gGmbH Department of and Computer Science Heidelberg, Germany University of Bucharest, Bucharest, Romania [email protected] [email protected]

Abstract within a single family: travel – mascu- line in Spanish (el viage) and Italian (il viaggio), This paper presents an investigation of the but feminine in Portuguese (a viagem). relation between and their gender in groups in a lan- two gendered languages: German and Ro- guage into distinct classes. There are languages manian. Gender is an issue that has long whose nouns are grouped into more or less than preoccupied linguists and baffled language three classes. English for example has none, and learners. verify the hypothesis that makes no distinction based on gender, although gender is dictated by the general sound did have three and some patterns of a language, and that goes traces remain (e.g. blonde, blond). beyond suffixes or endings. Exper- Linguists assume several sources for gender: () imental results on German and Romanian a first set of nouns which have natural gender and nouns show strong support for this hypoth- which have associated matching grammatical gen- esis, as gender prediction can be done with der; (ii) nouns that resemble (somehow) the nouns high accuracy based on the form of the in the first set, and acquire their grammatical gen- words. der through this resemblance. Italian and Roma- nian, for example, have strong and reliable phono- 1 Introduction logical correlates (Vigliocco et al., 2004b, for Ital- For speakers of a language whose nouns have no ian). (Doca, 2000, for Romanian). In Romanian gender (such as ), making the leap the majority of feminine nouns end in a˘ or e. Some to a language that does (such as German), does rules exists for German as well (Schumann, 2006), not come easy. With no or few rules or heuris- for example nouns ending in -tat,¨ -ung, -e, -enz, tics to guide him, the language learner will try to -ur, -keit, -in tend to be feminine. Also, when draw on the “obvious” parallel between grammat- specific morphological processes apply, there are ical and natural gender, and will be immediately rules that dictate the gender of the newly formed baffled to learn that girl – Madchen¨ – is neuter in word. This process explains why Frau (woman) is German. Furthermore, one may refer to the same feminine in German, while Fraulein¨ (little woman, using words with different gender: car can miss) is neuter – Fraulein¨ = Frau + lein. The ex- be called (das) Auto (neuter) or (der) Wagen (mas- isting rules have exceptions, and there are numer- culine). Imagine that after hard work, the speaker ous nouns in the language which are not derived, has mastered gender in German, and now wishes and such suffixes do not apply. to proceed with a Romance language, for example Words are names used to refer to concepts. The Italian or Spanish. is now confronted with the fact that the same concept can be referred to using task of relearning to assign gender in these new names that have different gender – as is the case languages, made more complex by the fact that for car in German – indicates that at least in some gender does not match across languages: e.g. sun cases, grammatical gender is in the name and not – feminine in German (die Sonne), but masculine the concept. We test this hypothesis – that the gen- in Spanish (el sol), Italian (il sole) and French (le der of a is in its word form, and that this goes soleil); moon – masculine in German (der Mond), beyond word endings – using noun gender data but feminine in Spanish (la luna), Italian (la luna) for German and Romanian. Both Romanian and and French (la lune). Gender doesn’t even match German have 3 genders: masculine, feminine and

1368 Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing, pages 1368–1377, Singapore, 6-7 August 2009. c 2009 ACL and AFNLP neuter. The models built using machine learning larski (2007) presents a more detailed overview algorithms classify test nouns into gender classes of currents and ideas about the origin of gender. based on their form with high accuracy. These re- Unterbeck (1999) contains a collection of papers sults support the hypothesis that in gendered lan- that investigate grammatical gender in several lan- guages, the word form is a strong clue for gender. guages, aspects of gender acquisition and its rela- This supplements the situation when some con- tion with and . cepts have natural gender that matches their gram- There may be several reasons for the polemic matical gender: it allows for an explanation where between these two sides. One may come from the there is no such match, either directly perceived, categorization process, the other from the relation or induced through literary devices. between word form and its meaning. Let us take The present research has both theoretical and them each in turn, and see how influenced practical benefits. From a theoretical point of gender. view, it contributes to research on and Grammatical gender separates the nouns in a gender, in particular in going a step further in un- language into disjoint classes. As such, it is a cat- derstating the link between the two. From a practi- egorization process. The traditional – classical – cal perspective, such a connection between gender theory of categorization and concepts viewed cat- and sounds could be exploited in advertising, in egories and concepts as defined in terms of a set particular in product naming, to build names that of common properties that all its members should fit a product, and which are appealing to the de- share. Recent theories of concepts have changed, sired customers. Studies have shown that espe- and view concepts (and categories) not necessar- cially in the absence of meaning, the form of a ily as “monolithic” and defined through rules, but word can be used to generate specific associations rather as clusters of members that may resemble and stimulate the imagination of prospective cus- each other along different dimensions (Margolis tomers (Sells and Gonzales, 2003; Bedgley, 2002; and Laurence, 1999). Botton et al., 2002). In most linguistic circles, the principle of ar- bitrariness of the association between form and 2 Gender meaning, formalized by de Saussure (1916) has What is the origin of grammatical gender and how been largely taken for granted. It seems however, does it relate to natural gender? Opinions are split. that it is hard to accept such an arbitrary relation, Historically, there were two main, opposite, views: as there have always been contestants of this prin- (i) there is a semantic explanation, and natural ciple, some more categorical than others (Jakob- gender motivated the (ii) the relationship son, 1937; Jespersen, 1922; Firth, 1951). It is pos- between natural and grammatical gender is arbi- sible that the correlation we perceive between the trary. word form and the meaning is something that has Grimm (1890) considered that grammatical arisen after the word was coined in a language, be- gender is an extension of natural gender brought ing the result of what Firth called “phonetic habit” on by imagination. Each gender is associated through “an attunement of the nervous system”, with particular or other attributes, and in and that we have come to prefer, or select, cer- some cases (such as for sun and moon) the assign- tain word forms as more appropriate to the con- ment of gender is based on personification. Brug- cept they name – “There is no denying that there mann (1889) and Bloomfield (1933) took the po- are words which we feel instinctively to be ade- sition that the mapping of nouns into genders is quate to express the ideas they stand for. ... Sound arbitrary, and other phenomena – such as deriva- symbolism, we may say, makes some words more tions, personification – are secondary to the estab- fit to survive” (Jespersen, 1922). lished agreement. Support for this second view These two principles relate to the discussion comes also from language acquisition: children on gender in the following manner: First of all, learn a gendered language do not have a nat- the categories determined by grammatical gen- ural gender attribute that they try to match onto der need not be homogeneous, and their mem- the newly acquired words, but learn these in a sep- bers need not all respect the same member- arate process. Any match or mapping between ship criterion. This frees us from imposing a natural and grammatical gender is done after the matching between natural and grammatical gen- natural gender “” is acquired itself. Ki- der where no such relation is obvious or pos-

1369 sible through literary devices (personification, (2004b) test cognitive aspects of grammatical gen- , metonymy). Nouns belonging to the der of Italian nouns referring to animals. same gender category may resemble each other Cucerzan and Yarowsky (2003) present a boot- because of semantic considerations, lexical deriva- strapping process to predict gender for nouns in tions, internal structure, perceived associations context. They show that context gives accurate and so on. Second, the fact that we allow for the clues to gender (in particular through , possibility that the surface form of a word may quantifiers, adjectives), but when the context is not encode certain word characteristics or attributes, useful, the algorithm can fall back successfully on allows us to hypothesize that there is a surface, the word form. Cucerzan and Yarowsky model phonological, similarity between words grouped the word form for predicting gender using suffix within the same gender category, that can supple- trie models. When a new word is encountered, the ment other resemblance criteria in the gender cat- word is mapped onto the trie starting from the last egory (Zubin and Kopcke,¨ 1986). letter, and it is assigned the gender that has the Zubin and Kopcke¨ (1981), Zubin and Kopcke¨ highest probability based on the path it matches in (1986) have studied the relation between seman- the trie. In context nouns appear with various in- tic characteristics and word form with respect to flections – for number and case in particular. Such gender for . Their study was mo- morphological derivations are gender specific, and tivated by two observations: Zipf (1935) showed as such are strong indicators for gender. that word length is inversely correlated with fre- The hypothesis tested here is that gender comes quency of usage, and Brown (1958) proposed that from the general sound of the language, and is dis- in choosing a name for a given object we are more tributed throughout the word. For this, the data likely to use a term corresponding to a “basic” used should not contain nouns with “tell tale” in- level concept. For example, chair, dog, apple flections. The data will therefore consist of nouns would correspond to the basic level, while furni- in the singular form, . Some nouns ture, animal, fruit and recliner, collie, braeburn are derived from , or adjectives, or apple correspond to a more general or a more other nouns through morphological derivations. specific level, respectively. Their study of gen- These derivations are regular and are identifiable der relative to these levels have shown that basic through a rather small number of regular suffixes. level terms have masculine, feminine, and rarely These suffixes (when they are indicative of gen- neuter genders, while the more undifferentiated der) and word endings will be used as baselines categories at the superordinate level are almost ex- to compare the accuracy of prediction on the full clusively neuter. word with the ending fragment. In psycholinguistic research, Friederici and Ja- cobsen (1999) adopt the position that a lexical 3 Data entry consists of two levels: form and seman- We test our gender-language sounds connection tic and grammatical properties to study the influ- through two languages from different language ence of gender priming – both from a form and families. German will be the representative of the semantic perspective – on language comprehen- , and Romanian for the Ro- sion. Vigliocco et al. (2004a) study gender prim- mance ones. We first collect data in the two lan- ing for German word production. While this re- guages, and then represent them through various search studies the influence of the word form on features – letters, pronunciation, phonetic features. the production of nouns with the same or different grammatical gender, there is no study of the rela- 3.1 Noun collections tion between word forms and their corresponding German data For German we collect nouns and gender. their grammatical gender from a German-English In recent studies we have found on the rela- , part of the BEOLINGUS multi-lingual tion between word form and its associated gender, dictionary1. In the first step we collected the Ger- the only phonological component of a word that man nouns and their gender from this dictionary. is considered indicative is the ending. Spalek et In step 2, we filter out compounds. The reason al. (2008) experiment on French nouns, and test for this step is that a German noun compound will whether a noun’s ending is a strong clue for gen- der for native speakers of French. Vigliocco et al. 1http://dict.tu-chemnitz.de/

1370 have the gender of its , regardless of its nom- To determine whether the connection between a inal modifiers. For the lack of a freely available word form and gender goes beyond this superfi- tool to detect and split noun compounds, we resort cial rule, we generate a dataset in which the nouns to the following algorithm: are stripped of their final letter, and their represen- tation is built based on this reduced form. 1. initialize the list of nouns LN to the empty list; Table 1 shows the data collected and the distri- bution in the three classes. 2. take each noun n in the dictionary D, and German Romanian (a) if n L such that n is an end sub- ∃ i ∈ N masc. 565 32.64% 7338 15.14% string of ni, then add n to LN and re- fem. 665 38.42% 27187 56.08% move ni from LN ; neut. 501 28.94% 13952 28.78% (b) if n L such that n is a end sub- ∃ i ∈ N i total 1731 48477 string of n, skip n; Essentially, we remove from the data all nouns Table 1: Data statistics that include another noun as the end part (which Because for Romanian the dataset is rather is the head position in German noun compounds). large, we can afford to perform undersampling This does not filter examples that have suffixes to balance our classes, and have a more straight- added to form the feminine version of a masculine forward evaluation. We generate a perfectly bal- noun, for example: (der) Lehrer – (die) Lehrerin anced dataset by undersampling the feminine and (teacher). The suffixes are used in one of the base- the neuter classes down to the level of the mascu- lines for with our learning method. line class. We work then with a dataset of 22014 We obtain noun pronunciation information from instances, equally distributed among the three gen- the Bavarian Archive for Speech Signals2. We fil- ders. ter again our list LN to keep nouns for which we have pronunciation information. This allows us to 3.2 Data representation compare the learning results when letter or pro- nunciation information is used. For each word in our collections we produce three After collecting the nouns and their pronunci- types of representation: letters, and ation, we map the pronunciation onto lower level phonological features. Table 2 shows examples phonetic features, following the IPA encoding of for each of these representations. The letter and sounds for the . The mapping representations are self-explanatory. We between sounds and IPA features was manually obtain the pronunciation corresponding to each encoded following IPA tables. word from a pronunciation dictionary, as men- tioned in Section 3.1, which maps a word onto a Romanian data We extract singular nomina- sequence of phonemes (phones). For Romanian tive forms of nouns from the Romanian lexical we have no such resource, but me make without database (Barbu, 2008). The resource contains since in most part the pronunciation matches the the proper word spelling, including and letter representation3. special characters. Because of this and the fact that there is a straightforward mapping between German spelling and pronunciation in Romanian, we can letter abend (m) a b e n d use the entire data extracted from the dictionary phoneme a: b @ n d in our experiments, without special pronunciation . Following the example for the Ger- Romanian man language, we encode each sound through letter seara(f)˘ sear a˘ lower level phonological features using IPA guide- lines. Table 2: Data representation in terms of letters and As in Italian, in Romanian there are strong phonemes for the German and Romanian forms of phonological cues for nouns, especially those hav- the word evening. For Romanian, the letter and ing the feminine gender: they end in a˘ and e. phoneme representation is the same. 2http://www.phonetik.uni-muenchen.de/ 3The exceptions are the diphthongs and a few groups of Bas/ letters: ce, ci, che, chi, oa, and the letter x.

1371 Phonemes, the building blocks of the phonetic The words are represented as strings with bound- representation, can be further described in terms aries marked with a special character (’#’). The of phonological features – “configurations” of high dimensional representation generated by the the vocal tract (e.g tongue and lips position), string kernel is used to find a hyperplane that sep- and acoustic characteristics (e.g. manner of arates instances of different classes. In this section air flow). We use IPA standards for mapping we present in detail the kernel we use. phones in German and Romanian onto these Kernel-based learning algorithms work by em- phonological features. We manually construct bedding the data into a feature space (a Hilbert a map between phones and features, and then space), and searching for linear relations in that automatically binarize this representation and space. The embedding is performed implicitly, use it to generate a representation for each that is by specifying the inner product between phone in each word in the data. For the word each pair of points rather than by giving their co- abend (de) / seara (ro) (evening) in Figure 2, the ordinates explicitly. phonological feature representation for German is: Given an input set (the space of examples), and an embedding vectorX space (feature space), F 0000100000001000010000000001 let φ : be an embedding map called fea- 0001000100000000000010000000 ture map.X → F 0000100000000100000000010001 A kernel is a function k, such that for all x,z ∈ 1000000100000010000000000000 , k(x,z) =< φ(x), φ(z) >, where < .,. > 1000000100000000000010000000, Xdenotes the inner product in . In the case of binary classificationF problems, with the feature base: kernel-based learning algorithms look for a dis- criminant function, a function that assigns +1 to < alveolar, approximant, back, bilabial, cen- examples belonging to one class and 1 to exam- tral, close, closemid, consonant, fricative, front, ples belonging to the other class. This− function glottal, labiodental, long, mid, nasal, nearclose, will be a linear function in the space , that means nearopen, open, openmid, palatal, plosive, it will have the form: F postalveolar, rounded, short, unrounded, uvular, f(x)= sign(< w,φ(x) > +b), velar, vowel >. for some weight vector w. The kernel can be For Romanian, the phonological feature base exploited whenever the weight vector can be ex- is: pressed as a linear combination of the training n points, αiφ(xi), implying that f can be ex- < accented, affricate, approximant, back, bi- i=1 labial, central, close, consonant, dental, fricative, pressed asP follows: front, glottal, labiodental, mid, nasal, open, n plosive, postalveolar, rounded, trill, unrounded, f(x)= sign( αik(xi,x)+ b) i=1 velar, voiced, voiceless, vowel >, . X Various kernel methods differ in the way in and the phonological feature representation which they find the vector w (or equivalently the of the word changes accordingly. vector α). Support Vector Machines (SVM) try to find the vector w that define the hyperplane that 4 Kernel Methods and String Kernels maximum separate the images in of the train- F Our hypothesis that the gender is in the name is ing examples belonging to the two classes. Math- equivalent to proposing that there are sequences of ematically SVMs choose the w and b that satisfy letters/sounds/phonological features that are more the following optimization criterion: common among nouns that share the same gender n 1 2 or that can distinguish between nouns under differ- min [1 yi(< w,φ(xi) > +b)]+ + ν w w,b n − || || ent genders. To determine whether that is the case, Xi=1 we use a string kernel, which for a given string (se- where y is the label (+1/ 1) of the training ex- i − quence) generates a representation that consists of ample xi, ν a regularization parameter and [x]+ = all its substrings of length less than a parameter l. max(x, 0).

1372 Kernel Ridge Regression (KRR) selects the vec- into account all substrings of length less than p it tor w that simultaneously has small empirical er- will be obtained a kernel that is called the blended ror and small norm in Reproducing Kernel Hilbert spectrum kernel: Space generated by kernel . The resulting mini- k p mization problem is: p k1(s, t)= kq(s, t) n q=1 1 2 2 X min (yi < w,φ(xi) >) + λ w w The blended spectrum kernel will be the ker- n i=1 − || || X nel that we will use in with SVM and where again y is the label (+1/ 1) of the training i − KRR. More precisely we will use a normalized example xi, and λ a regularization parameter. De- version of the kernel to allow a fair comparison tails about SVM and KRR can be found in (Taylor of strings of different length: and Cristianini, 2004). What is important is that p above optimization problems are solved in such a ˆp k1(s, t) k1(s, t)= way that the coordinates of the embedded points kp(s, s)kp(t,t) are not needed, only their pairwise inner products 1 1 q which in turn are given by the kernel function k. 5 Experiments and Results SVM and KRR produce binary classifiers and gender classification is a multi-class classification We performed 10-fold cross-validation learning problem. There are a lot of approaches for com- experiments with kernel ridge regression and the bining binary classifiers to solve multi-class prob- string kernel (KRR-SK) presented in Section 4. lems. We used one-vs-all scheme. For arguments We used several baselines to compare the results in favor of one-vs-all see (Rifkin and Klautau, of the experiments against: 2004). BL-R Gender is assigned following the distribu- The kernel function offers to the kernel methods tion of genders in the data. the power to naturally handle input data that are not in the form of numerical vectors, for example BL-M Gender is assigned following the majority strings. The kernel function captures the intuitive class (only for German, for Romanian we use notion of similarity between objects in a specific balanced data). domain and can be any function defined on the respective domain that is symmetric and positive BL-S Gender is assigned based on suffix-gender definite. For strings, a lot of such kernel functions relation found in the literature. We use the exist with many applications in computational bi- following mappings: ology and computational linguistics (Taylor and German (Schumann, 2006): • Cristianini, 2004). feminine -ade, -age, -anz, -e, -ei, -enz, Perhaps one of the most natural ways to mea- -ette, -heit, -keit, -ik, -in, -ine, -ion, - sure the similarity of two strings is to count how itis, -ive, -schaft, -tat,¨ -tur, -ung, -ur; many substrings of length the two strings have p masculine -ant, -er, -ich, -ismus, -ling; in common. This give rise to the p-spectrum ker- neuter -chen, -ist, -lein, -ment, -nis, -o, nel. Formally, for two strings over an alphabet Σ, -tel, -um. s, t Σ∗, the p-spectrum kernel is defined as: ∈ In our data set the most dominant gen- kp(s, t)= numv(s)numv(t) der is feminine, therefore we assign this v Σp gender to all nouns that do not match X∈ any of the previous suffixes. Table 4 where num (s) is the number of occurrences of v shows a few suffixes for each gender, string v as a substring in s 4 The feature map de- and an example noun. fined by this kernel associate to each string a vec- Romanian: in Romanian the word end- tor of dimension Σ p containing the histogram of | | • ing is a strong clue for gender, especially frequencies of all its substrings of length p. Taking for feminine nouns: the vast majority 4Note that the notion of substring requires contiguity. See end in either -e or -a˘ (Doca, 2000). We (Taylor and Cristianini, 2004) for discussion about the am- biguity between the terms ”substring” and ”subsequence” design a heuristic that assigns the gen- across different traditions: biology, computer science. der “preferred” by the last letter – the

1373 Method Accuracy masc. F-score fem. F-score neut. F-score

German BL-R 33.79 BL-M 38.42 BL-S 51.35 40.83 62.42 26.69 KRR-SK 72.36 3 64.88 5 84.34 4 64.44 7 ± ± ± ± KRR-SKnoWB 66.91 58.77 79.19 58.26

Romanian BL-R 33.3 BL-S 74.38 60.65 97.96 63.93 KRR-SK 78.83 0.8 68.74 0.9 98.05 0.2 69.38 2 KRR-SK no last letter 65.73 ± 0.6 56.11 ± 1 85.00 ± 0.5 55.05 ± 1 ± ± ± ± KRR-SKnoWB 77.36 67.54 96.75 67.39 Table 3: 10-fold cross-validation results – accuracy and f-scores percentages ( variation over the 10 runs) – for gender learning using string kernels ±

German Romanian gender suffix example gender ending example Prec. fem. -e Ecke (corner) fem. -a˘ masa˘ (table) 98.04 -heit Freiheit (freedom) -e paineˆ (bread) 97.89 -ie Komodie¨ (comedy) masc. -g sociolog (sociologist) 72.77 masc. -er Fahrer (driver) -r nor (cloud) 66.89 -ich Rettich (radish) -n domn (gentleman) 58.45 -ling Fruhling¨ (spring - season) neut. -m algoritm (algorithm) 90.95 neut. -chen Madchen¨ (girl) -s vers (verse) 66.97 -nis Verstandnis¨ (understanding) -t eveniment (event) 51.02 -o Auto (car) Table 5: Word-ending precision on classifying gender and examples for Romanian Table 4: Gender assigning rules and examples for German tion may be that in both the languages we consid- ered, there is an (almost) one-to-one mapping be- majority gender of all nouns ending in tween letters and their pronunciation, making thus the respective letter – based on analy- the pronunciation-based representation unneces- sis of our data. In Table 5 we include sary. As such, the letter level captures the interest- some of the letter endings with an exam- ing commonalities, without the need to go down ple noun, and a percentage that shows to the phoneme-level. the precision of the ending in classify- We performed experiments for Romanian when ing the noun in the gender indicated in the last letter of the word is removed. The reason the table. for this batch of experiments is to further test the hypothesis that gender is more deeply encoded in a The results of our experiments are presented word form than just the word ending. For both lan- in Table 3, in terms of overall accuracy, and f- guages we observe statistically significant higher score for each gender. The performance presented performance than all baselines. For Romanian, corresponds to the letter-based representation of the last letter heuristic gives a very high baseline, words. It is interesting to note that this represen- confirming that Romanian has strong phonologi- tation performed overall better than the phoneme cal cues for gender in the ending. Had the word or phonological feature-based ones. An explana- ending been the only clue to the word’s gender,

1374 Romanian Romanian 100 80 75 word without last letters baseline 80 70 65 60 60 55

40 accuracy 50 45 20 40 classification accuracy word length percentages 35 0 30 0 5 10 15 20 25 0 1 2 3 4 5 6 7 # of letters considered from end # of letters cut from end

German German 100 75 word without last letters 70 baseline 80 65

60 60 55

40 accuracy 50 45 20 classification accuracy 40 word length percentages 0 35 0 2 4 6 8 10 12 14 16 18 20 0 1 2 3 4 5 6 # of letters considered from end # of letters cut from end

Figure 1: Gender prediction based on the last N letters, and based on the word minus the last N letters once it is removed the performance on recogniz- the baseline changes accordingly. 94.07% of the ing gender should be close to the random assign- words have a length of at most 12 letters in the ment. This is not the case, and the improvement Romanian dataset, and 96.07% in the German one. over the random baseline is 32% points. It is inter- Because gender prediction can be done with accu- esting to notice that when cutting off the last letter racy higher than the random baseline even after 6 the class for which the gender assignment heuris- letters are cut from the ending of the word indicate tic was clearest – the feminine class with -a˘ and that for more than 94% of the words considered, -e endings – the performance remains very high – gender clues are spread over more than the second 85% F-score. half of the word. Again, we remind the reader that To further test where the gender indicators are the word forms are in nominative case, with no located, we performed two more sets of experi- case or number inflections (which are strong indi- ments: (i) classify words in their corresponding cators of gender in both Romanian and German). gender class using the word minus the last N let- Except for lines KRR SKnoWB, the results ters; (ii) classify words based on the last N let- in Table 3 are obtained through− experiments con- ters. The results of these experiments in terms of ducted on words containing word boundary mark- accuracy are presented in Figure 1. When con- ers, as indicated in Section 4. Because of these sidering only the last N letters the performance is markers, word starting or word ending substrings high for both German and Romanian, as expected are distinct from all the others, and information if the gender indicators are concentrated at the end about their position in the original word is thus of the word. It is interesting though to notice the preserved. To further explore the idea that gender results of classification based on the word without indicators are not located only in word endings, the last N letters. The prediction accuracy mono- we ran classification experiments for German and tonically decreases, but remains above the base- Romanian when the word representation does not line until more than 6 letters are cut. Because as contain word boundary markers. This means that letters are cut some words completely disappear, the substrings generated by the string kernel have

1375 no position information. The results of these ex- very tempting to try to assign grammatical gen- periments are presented in rows KRR SKnoWB der based on perceived or guessed natural gender in Table 3. The accuracy is slightly lower− than the types. This does not work out well, and it only best results obtained when word boundaries are serves to confuse the learner even more, when he marked and the entire word form is used. How- finds out that nouns expressing concepts with clear ever, they are well above all the baselines consid- feminine or masculine natural gender will have the ered, without no information about word endings. opposite or a neutral grammatical gender, or that For both German and Romanian, the gender that one concept can be referred to through names that was learned best was feminine. For German part have different grammatical genders. Going with of this effect is due to the fact that the feminine the flow of the language seems to be a better idea, class is more numerous in the data. For Roma- and allow the sound of a word to dictate the gen- nian the data was perfectly balanced, so there is no der. such bias. Neuter and masculine nouns have lower In this paper we have investigated the hypothe- learning performance. For Romanian, a contri- sis that gender is encoded in the word form, and bution to this effect is the fact that neuter nouns this encoding is more than just the word endings behave as masculine nouns in their singular form as it is commonly believed. The results obtained (take the same articles, inflections, derivations), show that gender assignment based on word form but as feminine in the , and our data consists analysis can be done with high accuracy – 72.36% of nouns in singular form. It would seem that from for German, and 78.83% for Romanian. Existing an orthographic point of view, neuter and mascu- gender assignment rules based on word endings line nouns are closer to each other than to feminine have lower accuracy. We have further strength- nouns. ened the point by conducting experiments on Ro- From the reviewed related work, the one that manian nouns without tell-tale word endings. The uses the word form to determine gender is accuracy remains high, with remarkably high per- Cucerzan and Yarowsky (2003) for Romanian. formance in terms of F-score for the feminine There are two important differences with respect class (85%). This leads us to believe that gen- to the approach presented here. First, they con- der information is somehow redundantly coded in sider words in context, which are inflected for a word. We plan to look closer at cases where number and case. Number and case inflections we obtain different predictions based on the word are reflected in suffixes that are gender specific. ending and the full form of the word, and use The words considered here are in singular form, boosting to learn weights for classifiers based on nominative case – as such, with no inflections. different parts of the word to see whether we can Second, Cucerzan and Yarowsky consider two further improve the results. classes: feminine vs. masculine and neuter. Mas- As we have underlined before, word form simi- culine and neuter nouns are harder to distinguish, larity between words under the same gender is one as in singular form neuter nouns behave like mas- criterion for gender assignment. It would be in- culine nouns in Romanian. While the datasets and teresting to verify whether gender recognition can word forms used by Cucerzan and Yarowsky are be boosted by using lexical resources that capture different than the one used here, the reader may the of the words, such as WordNets or be curious how well the word form distinguishes knowledge extracted from Wikipedia, and verify between feminine and the other two classes in whether similarities from a semantic point of view the experimental set-up used here. On the full5 are also responsible for gender assignments in var- Romanian dataset described in Section 3, a two ious languages. class classification gives 99.17% accuracy. When predicting gender for all words in their dataset, Cucerzan and Yarowsky obtain 98.25% accuracy. References Ana-Maria Barbu. 2008. Romanian lexical data 6 Conclusion bases: Inflected and syllabic forms dictionar- ies. In Proceedings of the Sixth International When a speaker of a tries to Language Resources and Evaluation (LREC’08). learn a language with grammatical gender, it is http://www.lrec-conf.org/proceedings/lrec2008/.

5By “full” we mean the dataset before balancing the Sharon Bedgley. 2002. Strawberry is classes 48,477 instances (see Table 1). no blackberry: Building brands us-

1376 ing sound. http://online.wsj.com/article/ Katharina Spalek, Julie Franck, Herbert Schriefers, and 0,,SB1030310730179474675.djm,00.html. Ulrich Frauenfelder. 2008. Phonological regulari- ties and grammatical gender retrieval in spoken word Leonard Bloomfield. 1933. Language. Holt, Reinhart recognition and word production. Journal of Psy- & Winston, New York. cholinguistic Research, 37(6):419–442.

Marcel Botton, Jean-Jack Cegarra, and Beatrice Fer- John S. Taylor and Nello Cristianini. 2004. Kernel rari. 2002. Il nome della marca: creazione e strate- Methods for Pattern Analysis. Cambridge Univer- gia di naming, 3rd edition. Guerini e Associati. sity Press, New York, NY, USA.

Roger Brown. 1958. Words and Things. The Free Barbara Unterbeck, editor. 1999. Gender in Press, New York. and Cognition. Approaches to Gender. Trends in Karl Brugmann. 1889. Das Nominalgeschlecht Linguistics. Studies and Monographs. 124. Mouton in den indogermanischen Sprachen. In Inter- de Gruyter. nationale Zeitschrift fur¨ allgemenine Sprachwis- Gabriela Vigliocco, David Vinson, Peter Indefrey, senschaft, pages 100–109. Willem Levelt, and Frauke Hellwig. 2004a. Role of S. Cucerzan and D. Yarowsky. 2003. Minimally super- grammatical gender and semantics in german word vised induction of grammatical gender. In Proceed- production. Journal of Experimental Psychology: ings of HLT-NAACL 2003, pages 40–47. Learning, Memory and Cognition, 30(2):483–497. Ferdinand de Saussure. 1916. Cours de linguistique Gabriela Vigliocco, David Vinson, and Federica Pa- gen´ erale´ . Harrassowitz, Wiesbaden. ganelli. 2004b. Grammatical gender and meaning. In Proc. of the 26th Meeting of the Cognitive Science Gheorghe Doca Doca. 2000. . Vol. Society. II: Morpho-Syntactic and Lexical Structures. Ars Docendi, Bucharest, Romania. George Zipf. 1935. The Psychobiology of Language. Addison-Wesley. John Rupert Firth. 1951. Modes and meaning. In Papers in linguistics 1934-1951. Oxford University David Zubin and Klaus-Michael Kopcke.¨ 1981. Gen- Press, London. der: A less than arbitrary . In R. Hendrick, C. Masek, and M. F. Miller, editors, Angela Friederici and Thomas Jacobsen. 1999. Pro- Papers from the seventh regional meeting, pages cessing grammatical gender during language com- 439–449. Chicago Linguistic Society, Chicago. prehension. Journal of Psychological Research, 28(5):467–484. David Zubin and Klaus-Michael Kopcke.¨ 1986. Gen- der and folk taxonomy: The indexical relation be- Jacob Grimm. 1890. Deutsche Grammatik. tween grammatical and lexical categorization. In C. Craig, editor, Noun classes and categorization, Roman Jakobson. 1937. Lectures on Sound and Mean- pages 139–180. Benjamins, Philadelphia. ing. MIT Press, Cambridge, MA. Otto Jespersen. 1922. Language - its Nature, Devel- opment and Origin. George Allen & Unwim Ltd., London. Marcin Kilarski. 2007. On grammatical gender as an arbitrary and redundant category. In Douglas Kil- bee, editor, History of Linguistics 2005: Selected pa- pers from the 10th International Conference on the History of Language Sciences (ICHOLS X), pages 24–36. John Benjamins, Amsterdam. Eric Margolis and Stephen Laurence, editors. 1999. Concepts: Core Readings. MIT Press. Ryan Rifkin and Aldebaro Klautau. 2004. In de- fense of one-vs-all classification. Journal of Ma- chine Learning Research, 5(January):101–141. Johannes Schumann. 2006. Mittelstufe Deutsch. Max Hueber Verlag. Peter Sells and Sierra Gonzales. 2003. The language of advertising. http://www.stanford.edu/class/linguist34/; in particular unit 8: ˜/Unit 08/blackberry.htm.

1377