Finite-State Morphological Analysis for Marathi

Finite-State Morphological Analysis for Marathi Vinit Ravishankar Francis M. Tyers Faculty of ICT School of Linguistics University of Malta Higher School of Economics Msida MSD 2080, Malta Moscow, Russia [email protected] [email protected] Abstract have chosen, and how well our analyser performs on them. Section 7 describes potential future work we This paper describes the development of could do. free/open-source morphological descriptions for Marathi, an Indo-Aryan language spoken 2 Marathi in the state of Maharashtra in India. We de- scribe the conversion and usage of an existing Marathi is an Indo-Aryan language spoken primarily Latin-based lexicon for our Devanagari-based in the west Indian state of Maharashtra, and has ap- analyser, taking into account the distinction proximately 62 million speakers, as of 2003 (Pandhari- between full vowels and diacritics, that pande, 2003). Despite being an Indo-European lan- is not adequately captured by the Latin. guage, Marathi has borrowed several features - such as Marathi displays elements of both fusional clusivity, and certain retroflex consonants (such as the and agglutinative morphology, which gives retroflex lateral flap), either absent or relatively uncom- us different ways to potentially treat the mor- mon in other Indo-Aryan languages. phology; philosophically, we approach our Whilst Marathi retains some fusional morphological analyser by treating the morphology system aspects of its proto-language, Sanskrit, it displays mor- as a three-layer affixing system. We use the phological agglutination within many contexts. Our lttoolbox lexicon formalism for describing the analysis broadly follows the perspective of Masica finite-state transducer, and attempt to work (1993). They consider the split morphological to be a within a morphological framework that would form of morphological “layering”; with a primary layer, allow for some consistency across Indo-Aryan comprising mainly of inherited fusional elements (the languages, enabling machine translation “oblique” case), a secondary agglutinative layer, and a across language pairs. An evaluation of our tertiary postpositional layer. These layers are, to a cer- finite-state transducer shows that the coverage tain extent, universal amongst Indo-Aryan languages: is adequate, over 80% on two corpora, and the they differ largely in the conditions under which they precision is good (over 97%). occur, and language-specific variations that may occur. A brief, specific definition of the layers in Marathi would, therefore, look like: 1 Introduction 1. The “oblique” case; complex morphophonemic changes in the lemma. eg. मुलगा mulagā “boy” This paper describes the development of free/open- ! मुला mulā source morphological descriptions of Marathi, an Indo- Aryan language spoken in the state of Maharashtra in 2. Agglutinative suffixes, similar to traditional cases India. Morphological descriptions are computational that mark noun functions, like the nominative or models of a language’s morphology, and are used to the genitive. output morphological analyses from word forms and vice versa. 3. Postpositions; morphologically and semantically In section 2, the paper gives an overview of Marathi complex elements. These can attach to an (op- morphology, and talks about some of the grammatical tional) oblique genitive suffix in layer 2. decisions we made during development of the analyser. Section 3 is a literature review of previous work done in Certain particles, such as an emphasis particle -च - the field. Section 4 describes the methodology we fol- c, or particles like -ही -hī and -सुा -suddhā “as well lowed whilst working on the analyser, the formalisms as”, are quite common, and can attach as a suffix to we have used, and describes our lexicon. We continue most words, with the exception of conjunctions. Verbs, with section 6, describing the evaluation metrics we along with the optional negation particle, decline for 50 Proceedings of the 13th International Conference on Finite State Methods and Natural Language Processing, pages 50–55, Umeå, Sweden, 4–6 September 2017. ©2017 Association for Computational Linguistics https://doi.org/10.18653/v1/W17-4006 tense, aspect and mood, and have adjectival and adver- regular. Words were then scraped (extracted) from the bial derivations. (1) is an example with two of the three lexicon and assigned to their respective paradigms. case layers and two suffix particles. Verbal declensions were stored with a different method; separate files existed, not for separate (1) (to) ghar-ā-māge-hī paradigms, but for separate word forms. Each file had (he) house-obl-behind.post-too.ptcl a set of words, declined to match the particular form ge-l-ā-c described by the file. The lexicon, however, was simi- go- - - pfv 3msg foc lar to the nominal lexicon, in that verbs were assigned a “He definitely went behind the house too” particular verb paradigm. Rather than merge these mul- tiple files into a single paradigm, we created our own 3 Prior work verbal paradigms, with Dhongade and Wali (2009) and There have been a number of efforts to develop mor- Masica (1993) as references. The verb list, however, phological analysers for Marathi over the years. While primarily consists of entries from the LTRC lexicon. morphological analysis for Marathi is fairly well stud- 4.1 Formalisms ied, one downside of previous work is that the soft- ware and lexicon is not freely available. Dixit et al. For the finite-state transducer we employ the lttoolbox (2005) present a spellchecker for the language based formalism, an XML-based format used in the Apertium on a lexicon of 13,000 root words and morphologi- project (Forcada et al., 2011). This formalism is widely cal rules. They did an evaluation of spell-checking used for encoding language data, with Apertium hav- accuracy showing that out of 10,648 words classified ing over 40 language pairs for machine translation. Al- as correctly spelt, only 0.45% were actually false pos- though we could have used an FST toolkit like HFST itives. The morphological analyser of Bapat et al. (Lindén et al., 2011) or Foma (Hulden, 2009), with sep- (2010) is based on a word–paradigm approach mod- arate layers for processing morphonology and morpho- elled with a finite-state transducer and contains a lex- tactics, the lack of significant morphophonological pro- icon of 24,035. They evaluate 21,096 unique word cesses relevant to Marathi orthography made lttoolbox forms from a corpus and find that 97.18% receive all a perfectly adequate choice. and only the correct morphological analyses; it is worth noting, however, is that their dictionary was created to 4.2 Lexicon specifically fit their evaluation corpus. Another anal- The main source of lexical material for our analyser yser based on finite-state technology is described by is from an existing morphological analyser published Dabre et al. (2012), based on a gold standard of 1,341 by the Language Technology Research Centre (LTRC) words, achieved an accuracy of 72.18%. The size of the at IIIT Hyderabad. Unlike other work on Marathi, the lexicon was not specified. Gawade et al. (2013) also use lexicon is available under the free/open-source GPL li- a finite-state transducer to model Marathi morphology, cence. The source lexicon (see example in Figure 3) is although their paper does not evaluate its effectiveness. composed of a dictionary table containing six columns. Resources like BabelNet1, generated by statistically All text in Marathi is written in a Latin-based translit- machine translating WordNet ontologies, do not appear eration scheme. to be very useful - whilst BabelNet does contain some The paradigms in the LTRC lexicon are essentially Marathi nouns, common verbs are all absent. lists of different forms of a word; words are assigned paradigms based on their conformance to the inflection 4 Development of the paradigm word. One of the biggest problems with this is the inefficient noun paradigm system; each We initially worked on the open word classes, many paradigm lists forms that include bound postpositional of which could be successfully scraped from the re- morphemes (including adjectival postpositions); this is sources of the Language Technologies Research Cen- quite unnecessary, as postpositions (layer 3) are largely tre (LTRC), at the International Institute of Information regular, and attach to the oblique case (layer 1) with an Technology, Hyderabad.2 As the lexicon was in WX optional clitic (layer 2). This results in 968 forms per notation,3 a transliteration script was used along, along paradigm, where four would suffice - the singular and with standard UNIX command line utilities, to extract plural nominative and oblique. There were other mi- and convert noun paradigms. Adjective paradigms nor problems, such as the inclusion of plural forms for were fairly trivial to convert; a significant number of uncountable nouns or abstract nouns. adjectives do not inflect at all, and most others are very 1http://babelnet.org/ 4.3 Paradigms 2 Available from the LTRC website at http: The Apertium paradigm system essentially func- //ltrc.iiit.ac.in/showfile.php?filename= tions using finite-state transducers, defined in XML. onlineServices/morph/index.htm 3WX is an ASCII-based transliteration scheme for Indian Paradigms are expressed as an input side (within languages; the name derives from the use of ‘w’ and ‘x’ for ‘<l></l>’ tags), and a corresponding output side dental stops. (within ‘<r></r>’ tags); the transducer is made to re- 51 ^ठेचा/ठेचा<n><m><sg><nom>$ ^तयार करणे/तयार करणे<vblex><inf>$ ^मरची/मरची<n><f><sg><nom>$ ^तयार/तयार<adv> करणे/करणे<vblex><inf>$ ^,/,<cm>$ ^चच/चच<n><f><sg><nom>$ Figure 2: Example analyses for the light verb construct तयार ^व/व<cnjcoo>$ करणे tayār karṇe “to prepare”. The first analysis is what we ^मीठापासून/मीठ<n><nt><sg><obl>+पासून<post><adv>$ use; the second is an intended addition.

Finite-State Morphological Analysis for Marathi

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support