Morphological Analysis of the Dravidian Language Family

ACL 2016 Submission ***. Conﬁdential review copy. DO NOT DISTRIBUTE. 000 050 001 MorphologicalMorphological Analysis Segmentation of the Dravidianof the Dravidian Language Languages Family 051 002 052 003 Arun Kumar Ryan Cotterell Lluís Padró Antoni Oliver 053 004 054 Universitat Oberta JohnsAnonymous Hopkins ACLUniversitat submission Politècnica Universitat Oberta 005 de Catalunya University de Catalunya de Catalunya 055 006 [email protected] [email protected] [email protected] 056 007 057 008 058 009 059 010 Abstract 060 011 Abstract 061 012 The Dravidian family is one of the most 062 013 widely spoken set of languages in the India 063 world, yet there are very few annotated re- 014 064 1sources Introduction available to NLP researchers. To 015 Indo-Aryan 065 remedy this, we create DravMorph, a cor- 016 066 Thepus Dravidian annotated languages for morphological comprise segmenta- one of the 017 067 world’stion and major part-of-speech. language families Also, and we exploit are spoken 018 by overnovel 300 features million and people higher-order in southern models India. to De- 068 019 spiteachieve their prevalence, promising results they remain—with on these corpora respect 069 020 to languageon both tasks, technology—low beating techniques resource. proposed Our cur- 070 021 rentin work the literature focuses on by developing as much as new 4 points models in and Telugu 071 022 data for processing the four most commonly spo- 072 segmentation F1. Kannada 023 ken Dravidian languages: Kannada, Malayalam, Tamil 073 1Tamil Introduction and Telugu. We present a brief overview of 024 Malayalam 074 025 Thethe linguistic Dravidian features languages that characterize comprise one the of family the 075 026 world'sas whole major and then language describe families the development and are spoken of sta- Figure 1: The Dravidian languages are spoken natively in 076 Figuresouthern 1: India, The Dravidian whereas languages belongingare spoken to natively the Indo- in 027 tistical models that utilize these specific features. 077 by over 300 million people in southern India (see southernAryan family, India, a whereas subbranch languages of the larger belonging Indo-European to the Indo- fam- 028 FigureWe focus1). Despite on the their computational prevalence, processing they remain of Aryanily, are family, spoken a subbranchin the north. of the larger Indo-European family, 078 are spoken in the north. 029 lowDravidian resource morphology, with respect a to critical language issue technology. since the 079 Wefamily annotate exhibits new data rich and agglutinative develop new inflectionalmodels for 030 gers that use the output of our segmenters as fea- 080 themorphology most commonly as well spoken as highly-productive Dravidian languages: com- 031 tureWe greatly make three improve primary tagging contributions: accuracy. (i) This We re-in- 081 Kannada,pounding. Malayalam, For example, Tamil and nouns Telugu. are typically 032 leasedicates DravMorph, that for languages a corpus withannotated rich morphology, for morpho- 082 inflectedWe focus with on gender, the computational number, case processing in addition of 033 logicala more segmentation structured approach and part-of-speech to character-level (POS) fea- as 083 Dravidianto various morphology, postpositions. a critical Consider issue the since Malay- the 034 antures open-source than simple resource, prefix and encouraging suffix features future is work nec- 084 familyalam exhibits word ;gMBT`ppiiiBM`ƃ2v2¨TTKñ rich agglutinative inflectional 035 onessary. Dravidian Third, languages; we release (ii) the We annotated show that segmen- a com- 085 morphology(അിപർവതതിെേയാം as well as highly-productive), which consists com- 036 binationtation and of POS-tagged higher-order corpora models asand open-source linguistically- re- 086 pounding.of the compound For example, stem Dravidian;MBYT`ƃpiKñ nouns are motivated features yields state-of-the-art accuracy 037 sources, encouraging future work on Dravidian 087 typically(fire+mountain inflected) and with gender,the following number and suffixes: case on the task of morphological segmentation on the 038 languages. 088 inii addition(inflictive to various increment postpositions.), BM`ƃ2 (genitive E.g., con-case 039 corpus; (iii) We show that training POS taggers 089 sidermarker the), v2 word(inflictiveag niparvvatattinṟeyeāppam increment) and QTTKñ (post that use the output of our segmenters as features 040 2 Morphological Segmentation 090 (positionഅഗ്നിപർവ്വതത്തിന്റെയോപ്പം). These combine to give the meaning) in of significantly improves a state-of-the-art tagger. 041 Malayalamthe English phrase which “with is compromised a volcano”. The of added the The task of morphological segmentation entails 091 042 092 compoundintra-word complexity noun stem makesagni+paṟavvatam morphological 2breaking DravMorph a word up into its constituent morphs. 043 (analysisfire+mountain requisite) and for the the following Dravidian suffixes: languages.tta For example, the English word DQ#H2bbM2bb can 093 044 (inflectionalWe make three increment contributions.), inṟe ( First,genitive we show case Abe primary segmented contribution as DQ#+H2bb of this+M2bb work, uncovering is the re- 094 1 045 markerthat a), combinationye (inflectional of higher-order increment) models and oppam and leasehowthe of DravMorph, word was builta corrected and hinting corpus at the for seman- both 095 046 (linguistically-motivatedpostposition). These combine features to yieldsgive the state-of- mean- morphologicaltics of the resulting segmentation derived form. and POS When inthe process- four 096 047 ingthe-art of accuracies the English on phrase the task ``with of morphological a volcano.'' ing morphologically-rich languages, this helps re- 097 1The morphological analyzers and the code for correcting 048 segmentation in the four major Dravidian lan- duce the sparsity created by the higher OOV rate 098 This complexity makes morphological analysis the corpus available at https://github.com/Malkitti/ 049 obligatoryguages. Second, for the Dravidianwe show that languages. training POS tag- Corpusandcodesdue to the productive morphology, and, empiri- 099 217 1 Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers, pages 217–222, Valencia, Spain, April 3-7, 2017. c 2017 Association for Computational Linguistics POS Segmentation Wiki Dump POS Tagging Segmentation Ka ILMT/IIIT-H ILMT/IIIT-H 2015-02-09 Lang # Sentences # Tokens # Types Ma ILMT/AM ILMT/AM 2015-05-08 Ka 8600 31364 3593 Ta ILMT/AM ILMT/AM 2015-05-09 Ma 4034 34300 4730 Te ILMT/AM ILMT/UoH 2015-02-03 Ta 4550 32400 4445 Te 5679 30625 4183 Table 1: The origin of the ruled-based analyzers and taggers. ILMT stands for Indian Language Machine Translation Project, AM stands for Amrita University, IIIT-H stands for Table 2: Per language breakdown of size of the POS portion IIIT-H University, UoH stands for University of Hyderabad. and the morphological segmentation portion of DravMorph. All train / dev / test splits used in the experiments will be re- leased with the corpus. most widely spoken Dravidian languages: Kan- nada, Malayalam, Tamil and Telugu. The corpus and unsupervised approaches have been success- contains 4034-8600 annotated sentences and 3593- ful, but, when annotated data is available, super- 4730 segmented types per language. The full statis- vised approaches typically greatly outperform un- tics are listed in Table 2. To the best of our knowl- supervised approaches (Ruokolainen et al., 2013). edge, this is the most comprehensive annotated cor- In light of this, we adopt a fully supervised model pus of the Dravidian languages. here. All the newly annotated corpora are based on We apply semi-Markov Conditional Random Wikipedia text in the respective languages (see Ta- Fields (S-CRFs) to the problem of morpholog- ble 1). To speed up annotation, we first ran closed- ical segmentation (Sarawagi and Cohen, 2004; source ruled-based morphological analyzers and Cotterell et al., 2015). S-CRFs have the ability POS taggers produced by the government of India to jointly model both a segmentation and a and Indian universities. We remark that the exis- labeling. For example, consider the following the tence of such rule-based tools does not diminish Malayalam word kūṭṭukāranmāruṭeyēāppam the utility of the annotated corpus---our ultimate (കൂട്ടുകാരന്മാരുടെയോപ്പം) (with (male) goal is the adoption of modern statistical methods friends): for Dravidian NLP, which requires annotated data. To ensure a gold standard corpus, we then hand- labeled segmentation corrected the resulting output. Additionally, we kūṭṭukāranmāruṭeyēāppam =========== ⇒ standardized the POS tagging schemes across lan- w guages, using the IIIT-H POS tagset (Bharati et al., [stem kūṭṭukāran] [suf mār] [suf uṭeZ ] [suf yēāppam] . 2006), which has 23 tags. Furthermore, we calcu- | {z } lated inter-annotator agreement of two annotators s1,ℓ1 s2,ℓ2 s3,ℓ3 s4,ℓ4 for morphological labels and all datasets have Co- A| S-CRF{z models} | this{z transformation} | {z } | as {z } hen's κ (Cohen, 1968) > 0.80. 3 Morphological Segmentation 1 pθ(s, l w)= exp θ⊤f(si, ℓi, ℓi 1) , We first examine the task of morphological seg- | Zθ(w) ( − ) ∑i=1 mentation in the Dravidian languages. The task en- d tails breaking a word up into its constituent morphs. where s is a segmentation, ℓ a labeling, θ R ∈ 2 For example, the English word joblessness is the parameter vector, f is a feature function can be segmented as job+less+ness. When and the partition function Zθ(w) ensures the dis- processing morphologically-rich languages, this tribution is normalized. Note that each ℓi is taken helps reduce the sparsity created by the higher from a set of labels L. In this work, we take L = prefix, stem, suffix . OOV rate due to productive morphology, and, { } empirically, has shown to be beneficial in a di- As an extension to the standard S-CRF Model, verse variety of down-stream tasks, e.g., machine we allow for higher-order segment interactions translation (Clifton and Sarkar, 2011), speech (Nguyen et al., 2011).

Morphological Analysis of the Dravidian Language Family

Grammatical Gender in Hindukush Languages

A Comparative Phonetic Study of the Circassian Languages Author(S

THE INDO-EUROPEAN FAMILY — the LINGUISTIC EVIDENCE by Brian D

The Dravidian Languages

Turkic Languages 161

LING 185 the Syntax of Austronesian Languages Preliminary Syllabus

Linguistic History and Language Diversity in India: Views and Counterviews

A Bayesian Investigation of the Origin Hypotheses of the Dravidian Family

In Search of Language Contact Between Jarawa and Aka-Bea: the Languages of South Andaman1

Revisiting the Position of Philippine Languages in the Austronesian Family

Northwest Caucasian Languages and Hattic

The Indo-European Languages the Indo-European Linguistic Family