Interpreting Sequence-To-Sequence Models for Russian Inflectional Morphology

Total Page:16

File Type:pdf, Size:1020Kb

Interpreting Sequence-To-Sequence Models for Russian Inflectional Morphology Interpreting Sequence-to-Sequence Models for Russian Inflectional Morphology David L. King Andrea D. Sims Micha Elsner The Ohio State University king.2138, sims.120, elsner.14 @osu.edu { } Abstract Our broader linguistic contribution is to recon- nect the inflection task to the descriptive literature Morphological inflection, as an engineering task in NLP, has seen a rise in the use of neu- on morphological systems. Neural models for in- ral sequence-to-sequence models (Kann and flection are now being applied as cognitive models Schutze¨ , 2016; Cotterell et al., 2018; Aha- of human learning in a variety of settings (Mal- roni and Goldberg, 2017). While these out- ouf, 2017; Silfverberg and Hulden, 2018; Kirov perform traditional systems based on edit rule and Cotterell, 2018, and others). They are appeal- induction, it is hard to interpret what they ing cognitive models partly because of their high are learning in linguistic terms. We propose performance on benchmark tasks (Cotterell et al., a new method of analyzing morphological sequence-to-sequence models which groups 2016, and subsq.), and also because they make errors into linguistically meaningful classes, few assumptions about the morphological system making what the model learns more transpar- they are trying to model, dispensing with overly ent. As a case study, we analyze a seq2seq restrictive notions of segmentable morphemes and model on Russian, finding that semantic and discrete inflection classes. But while these con- lexically conditioned allomorphy (e.g. inani- structs are theoretically troublesome, they are still mate nouns like ZAVOD ‘factory’ and animates important for describing many commonly-studied like OTEC ‘father’ have different, animacy- languages; without them, it is relatively difficult conditioned accusative forms) are responsible for its relatively low accuracy. Augmenting to discover what a particular model has and has the model with word embeddings as a proxy not learned about a morphological system. This for lexical semantics leads to significant im- is often the key question which prevents us from provements in predicted wordform accuracy. using a general-purpose neural network system as a cognitive model (Gulordava et al., 2018). 1 Introduction Our error analysis allows us to understand more Neural sequence-to-sequence models excel at clearly how the sequence-to-sequence model di- learning inflectional paradigms from incomplete verges from human behavior, giving us new infor- input (Table 1 shows an example inflection prob- mation about its suitability as a cognitive model of lem.) These models, originally borrowed from the language learner. neural machine translation (Bahdanau et al., As a case study, we apply our error anal- 2014), read in a series of input tokens (e.g. char- ysis technique to Russian, one of the lowest- acters, words) and output, or translate, them as performing languages in SIGMORPHON 2016. another series. Although these models have be- We find a large class of errors in which the come adept at mapping input to output sequences, model incorrectly selects among lexically- or like all neural models, they are relatively uninter- semantically-conditioned allomorphs. Russian pretable. We present a novel error analysis tech- has semantically-conditioned allomorphy in nouns nique, based on previous systems for learning to and adjectives, and lexically-conditioned allomor- inflect which relied on edit rule induction (Durrett phy (inflection classes) in nouns and verbs (Tim- and DeNero, 2013). By using this to interpret the berlake, 2004); Section 3 gives a brief introduction output of a neural model, we can group errors into to the relevant phenomena. While these facts are linguistically salient classes such as producing the commonly known to linguists, their importance wrong case form or incorrect inflection class. to modeling the inflection task has not previously Source Features Target Faruqui et al. (2016) introduced the use ABASˇ pos=N, ABASˇ of attention-based neural sequence-to-sequence case=NOM, learning for the inflection task, building on models num=SG from machine translation (Bahdanau et al., 2014). JATAGAN pos=N, JATAGANAMI Their model treats input as a linear series where case=INS, grammatical features and characters are encoded num=PL as one-hot embeddings and passed to a bidirec- tional encoder LSTM; output for each paradigm Table 1: An example inflection problem: the task is to cell is produced by a separate decoder. Kann map the Source and Features to the correct, fully in- flected Target. and Schutze¨ (2016) extended Faruqui et al.’s ar- chitecture by using ensembling and by using a single decoder, shared across all output paradigm been pointed out. Section 4 shows that these phe- cells, to account for data sparsity. Later systems nomena account for most of Russian’s increased (Aharoni and Goldberg, 2017; Kann and Schutze¨ , difficulty relative to the other languages. In Sec- 2017) have made changes to the input represen- tion 6, we provide lexical-semantic information to tation and the architecture, for instance incorpo- the model, decreasing errors due to semantic con- rating variants of hard attention and autoencoding. ditioning of nouns by 64% and of verbs by 88%. From a theoretical standpoint, all these models are “a-morphous” (Anderson, 1992) or “inferential- realizational” (Stump, 2001)— rather than assume 2 Background a concatenative process which stitches discrete morphemes together into surface word forms, they The inflection task described above is an instance learn a flexible, generalizable transduction, ei- of the paradigm cell filling problem (Ackerman ther between a stem and surface form (Anderson, et al., 2009), and models a situation which both 1992; Stump, 2001), or between pairs of surface computational and human learners face. For hu- forms (Albright, 2002; Blevins, 2006). mans, the PCFP is closely related to the “wug test” (Berko, 1958): given some previously un- Some older learning-based inflection systems, seen word, how does a speaker produce a different such as Durrett and DeNero (2013), exploit se- inflected form? As Lignos and Yang (2016) and quence alignment across strings. Alignment-based Blevins et al. (2017) point out, the same Zipfian systems essentially treat morphology as concate- distribution that makes other NLP tasks (e.g. MT) nation. While they do not perform full-scale mor- difficult is also at play in morphology, namely that phological analysis (since they do not account for no corpus will ever exist that has every wordform phonological alternations), in languages which are from every lexeme. For theoretical morphologists, mostly concatenative, they do tend to isolate affix- the difficulty of the PCFP on average is a mea- like units as sequences of adjacent insertions or sure of the learnability of a morphological system, deletions. This property has been criticized in the with implications for language typology (Acker- neural literature (Faruqui et al., 2016) since it rep- man et al., 2009; Ackerman and Malouf, 2013; Al- resents processes like vowel harmony by enumer- bright, 2002; Bonami and Beniamine, 2016; Sims ating large sets of surface allomorphs, making the and Parker, 2016). learning problem harder. We agree with these crit- icisms from the modeling standpoint, but we ex- Ackerman et al.’s (2009) formulation of the ploit the interpretability of the technique in our PCFP relies on a simple concatenative model in analysis of model results. which words are divided into stems and affixes, and in which each affix is treated as a discrete Our study of Russian concludes that value. Cotterell et al. (2018) points out that this semantically- and lexically-conditioned allo- model is ill-suited to dealing with phenomena morphy constitutes a problem for current neural like phonological alterations or stem suppletion. reinflection models. This is because such models Newer models (Silfverberg and Hulden, 2018; are trained to map input to output character Malouf, 2017; Cotterell et al., 2018) use sequence- sequences; they do not typically have access to-sequence inflection models to avoid these short- to information about what the words they are comings. inflecting mean. We show that, by providing word embeddings as meaning representations, Case Singular Plural we can reduce this source of error and bring Nominative ,-’,-J,-IJ -”, -I,-II ; Russian closer to the other languages studied in Accusative N or G SIGMORPHON 2016. Genitive -A,-JA,-IJA -OV,-EJ, Recently the NLP community has also pushed -EV,-IEV for greater transparency with neural models (xci, Dative -U,-JU,-IJU -AM,-JAM, 2017; ana, 2019). Wilcox et al. (2018) showed -IJAM that RNNs learn hierarchical structure in sentences Instrumental -OM,-EM, -AMI,-JAMI, like island constraints. Faruqui et al. demon- -IEM -IJAMI strated that RNNs can automatically learn which Locative -E,-II -AX,-JAX, vowel pairs participate in vowel harmony alterna- -IJAX tion. Our error analysis allows us to interpret what neural models are learning, reconnecting inflec- Table 2: An example of class IA, showing the effect 2 tion tasks to linguistic intuitions by generalizing of animacy in the orthography across the singular and over error classes. plural accusative forms, where N or G indicate where syncretism occurs in the accusative form based on ani- 3 Russian Inflectional Morphology macy. We select Russian as our language of analysis be- cause it was among the three worst-performing gular and plural and in classes IB, II, and III ac- languages in the SIGMORPHON 2016 shared cusative plurals, the accusative exhibits syncretism task, falling 4+ percentage points behind the other with either the genitive (for animates) or the nom- languages. Problems with the design of the Navajo inative (for inanimates). In the case of the ani- and Maltese datasets may have been the source mate noun STUDENT (‘student’), for example, the of the problems with those languages,1 but this nominative singular form is student and the ac- cannot explain the Russian results. The discrep- cusative singular and genitive singular forms are ancy hints at some linguistic property which dis- both studenta.
Recommended publications
  • The Intramorphological Meanings of Thematic Vowels in Italian Verbs
    Sede Amministrativa: Università degli Studi di Padova Dipartimento di Studi Linguistici e Letterari SCUOLA DI DOTTORATO DI RICERCA IN: Scienze Linguistiche, Filologiche e Letterarie INDIRIZZO: Linguistica CICLO: XXIV The Intramorphological Meanings of Thematic Vowels in Italian Verbs Direttore della Scuola: Ch.ma Prof. Rosanna Benacchio Coordinatore d’indirizzo: Ch.mo Prof. Gianluigi Borgato Supervisore: Ch.ma Prof. Laura Vanelli Dottoranda: Martina Da Tos Abstract Italian verbs are traditionally classified into three major classes called ‗conjugations‘. Membership of a verb in one of the conjugations rests on the phonological content of the vowel occurring after the verbal root in some (but not all) word forms of the paradigm. This vowel is called ‗thematic vowel‘. The main feature that has been attributed to thematic vowels throughout morphological literature is that they do not behave as classical Saussurean signs in that lack any meaning whatsoever. This work develops the claim that the thematic vowels of Italian verbs are, in fact, Saussurean signs in that they can be attributed a ‗meaning‘ (‗signatum‘), or even more than one (‗signata‘). But the meanings that will be appealed to are somehow different from those which have traditionally been attributed to other morphological units, be they stems or endings: in particular, these meanings would not be relevant to the interpretation of a word form; rather, they would be relevant at the ‗purely morphological‘ (‗morphomic‘, in Aronoff‘s (1994) terms) level of linguistic analysis. They are thus labelled ‗intramorphological‘, remarking that they serve nothing but the morphological machinery of the language. The recognition of ‗intramorphological signata‘ for linguistic signs strongly supports the claim about the autonomy of morphology within the grammar.
    [Show full text]
  • An Analysis of Semantic and Syntactic Negation in Oshiwambo and English
    JULACE: Journal of University of Namibia Language Centre Volume 4, No. 2, 2019 (ISSN 2026-8297) An analysis of semantic and syntactic negation in Oshiwambo and English Immanuel N. Shatepa1 Namibia University of Science and Technology Abstract Negation in languages has been documented since the 1970s and 1980s. This paper attempts to explain the negation structures in semantic and syntactic structures of Oshiwambo and English languages. These two languages have two complete negation structures and how they function to achieve negation is far from being similar. The focus of the paper was on the analysis of the sentential negation and how negative particles are used in English and Oshiwambo, a Bantu language. It analyzes and compares the use of full negatives, affixes and quasi negative words to achieve negation in English and Oshiwambo language. The Oshiwambo and English texts/contents were purposely sampled and content analysis was performed accordingly. The analysis shows that Bantu languages share a common rule of negation which is the use of a pre- initial prefix while the rules to changing negative imperative to interrogative or declarative are different between English and Oshiwambo. Keywords: pre-initial, negation, marker, imperative, declarative, integrative, syntactic, semantic, Bantu Introduction A number of studies on negation of English and Bantu languages have been conducted by linguists. Studies by Kim and Sag (1995); Neba and Tanda (2005) and Weir (2013) show that negation differs from language to language. They further state highlighted that since every language has its own morphological and semantic components to express negation, one rule fits all simply does not work.
    [Show full text]
  • Classifiers: a Typology of Noun Categorization Edward J
    Western Washington University Western CEDAR Modern & Classical Languages Humanities 3-2002 Review of: Classifiers: A Typology of Noun Categorization Edward J. Vajda Western Washington University, [email protected] Follow this and additional works at: https://cedar.wwu.edu/mcl_facpubs Part of the Modern Languages Commons Recommended Citation Vajda, Edward J., "Review of: Classifiers: A Typology of Noun Categorization" (2002). Modern & Classical Languages. 35. https://cedar.wwu.edu/mcl_facpubs/35 This Book Review is brought to you for free and open access by the Humanities at Western CEDAR. It has been accepted for inclusion in Modern & Classical Languages by an authorized administrator of Western CEDAR. For more information, please contact [email protected]. J. Linguistics38 (2002), I37-172. ? 2002 CambridgeUniversity Press Printedin the United Kingdom REVIEWS J. Linguistics 38 (2002). DOI: Io.IOI7/So022226702211378 ? 2002 Cambridge University Press Alexandra Y. Aikhenvald, Classifiers: a typology of noun categorization devices.Oxford: OxfordUniversity Press, 2000. Pp. xxvi+ 535. Reviewedby EDWARDJ. VAJDA,Western Washington University This book offers a multifaceted,cross-linguistic survey of all types of grammaticaldevices used to categorizenouns. It representsan ambitious expansion beyond earlier studies dealing with individual aspects of this phenomenon, notably Corbett's (I99I) landmark monograph on noun classes(genders), Dixon's importantessay (I982) distinguishingnoun classes fromclassifiers, and Greenberg's(I972) seminalpaper on numeralclassifiers. Aikhenvald'sClassifiers exceeds them all in the number of languages it examines and in its breadth of typological inquiry. The full gamut of morphologicalpatterns used to classify nouns (or, more accurately,the referentsof nouns)is consideredholistically, with an eye towardcategorizing the categorizationdevices themselvesin terms of a comprehensiveframe- work.
    [Show full text]
  • Morphological Classes and Gender in Ɓəna-Yungur
    Morphological classes and gender in əna-Yungur Mark van de Velde, Dmitry Idiatov To cite this version: Mark van de Velde, Dmitry Idiatov. Morphological classes and gender in əna-Yungur. Shigeki Kaji. Proceedings of the 8th World Congress of African Linguistics, Research Institute for Languages and Cultures of Asia and Africa, Tokyo University of Foreign Studies, pp.53-65, 2017. halshs-01484016 HAL Id: halshs-01484016 https://halshs.archives-ouvertes.fr/halshs-01484016 Submitted on 6 Mar 2017 HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. Morphological classes and gender in Ɓə́ná-Yungur Mark Van de Velde1,2 & Dmitry Idiatov1,2 1Llacan (UMR 8135 CNRS – USPC/Inalco), Paris, France 2Research Centre for Nigerian Languages, KWASU, Malete, Nigeria Abstract This paper provides an analysis of the gender system of Ɓə́ná-Yungur (glottocode: bena1260), distinguishing noun classes proper, defined as agreement classes, from morphological classes, defined in terms of number marking on nouns. The gender system is typologically unusual in its symmetry and simplicity. Ɓə́ná-Yungur has three noun classes in the singular and the same three classes in the plural. All logically possible singular-plural pairings are attested, except one.
    [Show full text]
  • 0 German Noun Class As a Nominal Protection Device Richard Futrell
    German Noun Class as a Nominal Protection Device Richard Futrell Undergraduate Honors Thesis Stanford University, Department of Linguistics May 2010 ____________________________________ Faculty advisor: Dan Jurafsky ____________________________________ Second reader: Michael Ramscar 0 Gretchen: Wilhelm, where is the turnip? Wilhelm: She has gone to the kitchen. Gretchen: Where is the accomplished and beautiful English maiden? Wilhelm: It has gone to the opera. - Mark Twain, ―The Awful German Language‖ 1. Introduction Grammatical gender, also known as noun class, afflicts about half of the world's languages. Speakers of these languages must mark each noun for its membership in a certain noun class, and must similarly mark elements such as adjectives or verbs that agree with the noun. In over half of these cases, the choice of gender for a noun has no comprehensive systematic relationship to the meaning of the nouns (Corbett 2008), posing a significant obstacle for L2 learners (Harley 1979, Tucker et al. 1968: 312). Some have taken grammatical gender to be superfluous: for instance, Maratsos (1979: 235) calls the existence of such a system ―excellent testimony to the occasional nonsensibleness of the species.‖ A number of researchers have nonetheless defended noun class against the accusation of uselessness, and I will take up that cause in this paper. I will examine the gender system of Standard German with an eye toward detecting function. I claim that noun class serves as a sort of ‗Nominal Protection Device,‘ alleviating the linguistic difficulties inherent in nouns by reducing uncertainty about nouns. Noun class markers help language users predict nouns in a number of ways: they predict the form of the noun, they predict the semantics of the noun, and they predict 1 which discourse referent a pronoun points to in reference tracking (in the sense of Barlow 1992).
    [Show full text]
  • The Noun Class System Of
    UNIVERSITY OF YAOUNDE !! <f:.,;+ e.>, " PA, FACULTY OF LETTERS DEPARTMENT OF 'AFRICIAN AND SOCIAL SCIENCES LANGUAGES AND LINGUISTICS THE NOUN CLASS SYSTEM OF A Dissertation Presented in Partial Fulfilment of the Requirements for the Award of a Post-Graduate Diploma (Maitrise) in Linguistics BY Irene Swiri ASOBO 8. A. Modern Lefters Supervised by Dr. CarL EBOBISSE (Char@ de cours) September 1989 i Dedicated to my parents, bro-them and sisters, with all my love. C. ACKIIOKLEGEMENT I must acknowledge special indebtedness to my supervisor Dr. Cor1 EBOBISSE for his invaluable contribu- tion ot the realisation of this dissertation. His inde- fatigable patience criticisms and unswerving dcvotion were encouraging especially when I was doubting and discouraged. Llithout his potient guidance this work would not have been acheived. My heortfelt'gratitu8e to fioffessor B.S. Chumbow who assisted and advised me during the writing of this work. My special thanks also goes to Dr. Chia Emmanuel otic? 011 my lecturers who were a source of unwmering support to me. Great thankfulness to my parents Prince snii Mrs V.T. ASOBO for the moral and financial ai& they showered on me. I will like to gratefully 3cknowledge Evclyne Monikang for the wonderful and ucfziling encouragomcnt she gnvc me. She was always a pillar to 1em on. Sincere thanks to 311 my classmates whose camoroderie wc?s a11 T neef-ed to spur me on. All. my friends especially Walters Abie who was always ready to help, Po-po who never stopped to say go on and Dora Mbola for being there when I needed her.
    [Show full text]
  • Ablaut and the Latin Verb
    Ablaut and the Latin Verb Aspects of Morphophonological Change Inaugural-Dissertation zur Erlangung des Doktorgrades der Philosophie an der Ludwig-Maximilians-Universität München vorgelegt von Ville Leppänen aus Tampere, Finnland München 2019 Parentibus Erstgutachter: Prof. Dr. Olav Hackstein (München) Zweitgutachter: Prof. Dr. Gerhard Meiser (Halle) Datum der mündlichen Prüfung: 17. Mai 2018 ii Contents Acknowledgements .................................................................................................................. vii List of abbreviations and symbols ........................................................................................... viii 1. Introduction ............................................................................................................................ 1 1.1. Scope, aim, theory, data, and method ............................................................................. 2 1.2. Previous research ............................................................................................................. 7 1.3. Terminology and definitions ......................................................................................... 12 1.4. Ablaut ............................................................................................................................ 14 2. Verb forms and formations .................................................................................................. 17 2.1. Verb systems overview ................................................................................................
    [Show full text]
  • 4.1 Inflection
    4.1 Inflection Within a lexeme-based theory of morphology, the difference between derivation and inflection is very simple. Derivation gives you new lexemes, and inflection gives you the forms of a lexeme that are determined by syntactic environment (cf. 2.1.2). But what exactly does this mean? Is there really a need for such a distinction? This section explores the answers to these questions, and in the process, goes deeper into the relation between morphology and syntax. 4.1.1 Inflection vs. derivation The first question we can ask about the distinction between inflection and derivation is whether there is any formal basis for distinguishing the two: can we tell them apart because they do different things to words? One generalization that has been made is that derivational affixes tend to occur closer to the root or stem than inflectional affixes. For example, (1) shows that the English third person singular present inflectional suffix -s occurs outside of derivational suffixes like the deadjectival -ize, and the plural ending -s follows derivational affixes including the deverbal -al: (1) a. popular-ize-s commercial-ize-s b. upheav-al-s arriv-al-s Similarly, Japanese derivational suffixes like passive -rare or causative -sase precede inflectional suffixes marking tense and aspect:1 (2) a. tabe-ru tabe-ta eat- IMP eat- PERF INFLECTION 113 ‘eats’ ‘ate’ b. tabe-rare- ru tabe-rare- ta eat - PASS-IMP eat- PASS-PERF ‘is eaten’ ‘was eaten’ c. tabe-sase- ru tabe-sase- ta eat- CAUS-IMP eat- CAUS-PERF ‘makes eat’ ‘made eat’ It is also the case that inflectional morphology does not change the meaning or grammatical category of the word that it applies to.
    [Show full text]
  • Indo-European Linguistics: an Introduction Indo-European Linguistics an Introduction
    This page intentionally left blank Indo-European Linguistics The Indo-European language family comprises several hun- dred languages and dialects, including most of those spoken in Europe, and south, south-west and central Asia. Spoken by an estimated 3 billion people, it has the largest number of native speakers in the world today. This textbook provides an accessible introduction to the study of the Indo-European proto-language. It clearly sets out the methods for relating the languages to one another, presents an engaging discussion of the current debates and controversies concerning their clas- sification, and offers sample problems and suggestions for how to solve them. Complete with a comprehensive glossary, almost 100 tables in which language data and examples are clearly laid out, suggestions for further reading, discussion points and a range of exercises, this text will be an essential toolkit for all those studying historical linguistics, language typology and the Indo-European proto-language for the first time. james clackson is Senior Lecturer in the Faculty of Classics, University of Cambridge, and is Fellow and Direc- tor of Studies, Jesus College, University of Cambridge. His previous books include The Linguistic Relationship between Armenian and Greek (1994) and Indo-European Word For- mation (co-edited with Birgit Anette Olson, 2004). CAMBRIDGE TEXTBOOKS IN LINGUISTICS General editors: p. austin, j. bresnan, b. comrie, s. crain, w. dressler, c. ewen, r. lass, d. lightfoot, k. rice, i. roberts, s. romaine, n. v. smith Indo-European Linguistics An Introduction In this series: j. allwood, l.-g. anderson and o.¨ dahl Logic in Linguistics d.
    [Show full text]
  • 1 Noun Classes and Classifiers, Semantics of Alexandra Y
    1 Noun classes and classifiers, semantics of Alexandra Y. Aikhenvald Research Centre for Linguistic Typology, La Trobe University, Melbourne Abstract Almost all languages have some grammatical means for the linguistic categorization of noun referents. Noun categorization devices range from the lexical numeral classifiers of South-East Asia to the highly grammaticalized noun classes and genders in African and Indo-European languages. Further noun categorization devices include noun classifiers, classifiers in possessive constructions, verbal classifiers, and two rare types: locative and deictic classifiers. Classifiers and noun classes provide a unique insight into how the world is categorized through language in terms of universal semantic parameters involving humanness, animacy, sex, shape, form, consistency, orientation in space, and the functional properties of referents. ABBREVIATIONS: ABS - absolutive; CL - classifier; ERG - ergative; FEM - feminine; LOC – locative; MASC - masculine; SG – singular 2 KEY WORDS: noun classes, genders, classifiers, possessive constructions, shape, form, function, social status, metaphorical extension 3 Almost all languages have some grammatical means for the linguistic categorization of nouns and nominals. The continuum of noun categorization devices covers a range of devices from the lexical numeral classifiers of South-East Asia to the highly grammaticalized gender agreement classes of Indo-European languages. They have a similar semantic basis, and one can develop from the other. They provide a unique insight into how people categorize the world through their language in terms of universal semantic parameters involving humanness, animacy, sex, shape, form, consistency, and functional properties. Noun categorization devices are morphemes which occur in surface structures under specifiable conditions, and denote some salient perceived or imputed characteristics of the entity to which an associated noun refers (Allan 1977: 285).
    [Show full text]
  • Staged Approach for Grammatical Gender Identification of Nouns Using Association Rule
    Staged Approach for Grammatical Gender Identification of Nouns using Association Rule Mining and Classification Shilpa Desai1, Jyoti Pawar1, and Pushpak Bhattacharyya2 1 Department of Computer Science and Technology Goa University, Goa - India [email protected], [email protected] 2 Department of Computer Science and Engineering IIT-Bombay, Mumbai - India [email protected] Abstract. In some languages, gender is a grammatical property of the noun. Grammatical gender identification enhances machine translation of such languages. This paper reports a three staged approach for gram- matical gender identification that makes use of word and morphological features only. A Morphological Analyzer is used to extract the morpho- logical features. In stage one, association rule mining is used to obtain grammatical gender identification rules. Classification is used at the sec- ond stage to identify grammatical gender for nouns that are not covered by grammatical gender identification rules obtained in stage one. The third stage combines the results of the two stages to identify the gender. The staged approach has a better precision, recall and F-score compared to machine learning classifiers used on complete data set. The approach was tested on Konkani nouns extracted from the Konkani WordNet and an F-Score 0.84 was obtained. 1 Introduction Gender is a grammatical property of nouns in many languages3 including Indian language such as Sanskrit, Hindi, Gujarati, Marathi and Konkani. In such lan- guages adjectives and verbs in a sentence agree with the gender of the noun. For example translation of \He is a good boy" and \She is a good girl" in Hindi is \vaha eka achchhaa laDakaa haai 4" and "vaha eka achchhii laDakii haai", respectively.
    [Show full text]
  • A Tutorial on Lexical Classes
    A tutorial on lexical classes Ricardo Bermúdez-Otero University of Manchester INTRODUCTION Goals of the tutorial §1 Our brief: L to address the key issues arising from exponence phenomena in which the lexicon is divided into arbitrary subsets, notably (a) inflectional classes and (b) cophonologies. §2 Our focus: L to explore the advantages and drawbacks of theories that model such phenomena by means of diacritic features. Overview §3 The discussion will touch upon the following questions: • Architectural issues How must the grammar be organized in order to guarantee the syntactic inertness of inflectional class features? Can syntactic features like gender be sensitive to inflectional class membership, and can inflectional class membership be determined by phonological properties? • Perspectives on inflectional class features (1): denial To what extent can inflectional class features be eliminated by (a) storing complex items in the lexicon and (b) invoking replacive mechanisms of exponence (‘overwriting’)? • Perspectives on inflectional class features (2): strengthening Instead of eliminating inflectional class features, do we rather need a more elaborate theory (e.g. one based on feature decomposition) capable of accounting for a wider range of phenomena (e.g. cross-class syncretism)? • Class features in phonology: cophonologies Can patterns of overlap between cophonologies be captured through the decomposition of cophonology diacritics, just like syncretisms between inflectional classes? Can derived environment effects in phonology be analysed in terms of the percolation of cophonology diacritics within morphological structures? http://www.bermudez-otero.com/WOTM4.pdf Workshop on Theoretical Morphology 4, Großbothen, 20 June 2008 2 ARCHITECTURAL ISSUES Basic properties of inflectional classes §4 Inflectional classes are traditionally defined by two properties: • arbitrariness and • syntactic inertness.
    [Show full text]