Issues and Challenges in Developing Statistical POS Taggers for Sambalpuri

Pitambar , Atul Kr. Ojha and Girish Nath Jha

Jawaharlal Nehru University Centre for Linguistics, Special Centre for Studies {pitambarbehera2, shashwatup9k & girishjha}@.com

Abstract Low-density languages are also known as lesser-known, poorly-described, less-resourced, minority or less-computerized language because they have fewer resources available. Collecting and annotating a voluminous corpus for these languages prove to be quite daunting. For developing any NLP application for a low-density language, one needs to have an annotated corpus and a standard scheme for annotation. Because of their non-standard usage in text and other linguistic nuances, they pose significant challenges that are of linguistic and technical in nature. The present paper highlights some of the underlying issues and challenges in developing statistical POS taggers applying SVM and CRF++ for Sambalpuri, a less-resourced Eastern Indo- language. A corpus of approximately 121k is collected from the web and converted into Unicode encoding. The whole corpus is annotated under the BIS (Bureau of Indian Standards) annotation scheme devised for Odia under the ILCI (Indian Languages Corpora Initiative) Corpora Project. Both the taggers are trained and tested with approximately 80k and 13k respectively. The SVM tagger provides 83% accuracy while the CRF++ has 71.56% which is less in comparison to the former. Keywords: low-density language, parts of speech tagger, SVM, CRF++, Sambalpuri, Eastern IA language

1. Introduction sub-division of district. In comparison to its sister languages such as Maithili, Awadhi, , Bengali, Low-density languages have fewer resources in terms of Assamese, Odia and many others, Sambalpuri has not the availability of voluminous corpus (McEnry et al., gained much attention; neither from linguists nor from 2000) for NLP applications. The unavailability of a the government. It is really quite obvious to affirm that it corpus for a low-density language proves to bear adverse shares the genetic affinity with the Indo-Aryan language impacts on its future NLP development. As rightly family by observing some of its linguistic features pointed out by Ostler (1993), languages that lack active (Kushal, 2015). Although it has 75-76% lexical similarity participation in the electronic media are doomed to be with Standard Oriya (Mathai & Kelsall, 2013), it is endangered. Most of these languages are either dialects or syntactically a distinct language (Tripathy, 1984, pp. 49). languages with no government recognition. As a result, the situations of these languages in in 3. Salient Linguistic Features DQGLQ,QGLFODQJXDJHVLQSDUWLFXODUDUHµUHODWLYHO\EOHDN¶ (McEnry et al., 2000). Although is a land of more Less-described languages have some of the most than 6000 languages with five prominent diverse interesting linguistic features that are typical and language families (Abbi, 2001) only 22 are scheduled and language-specific. Some of the features like the rest are fighting for their survival. agglutination, classifiers, serial verbs, multi-words, This paper is concerned with demonstrating the compounds etc. account for the less accuracy of any of issues and challenges in developing statistical parts of the statistical NLP applications. Some of them are speech taggers for a low-resource language, Sambalpuri. discussed vividly below in four sub-sections: The paper has broadly three objectives. Firstly, it agglutination, classifiers, reduplication and compounds. highlights the issues in corpus collection with regard to non-uniform orthographic language standards and non- 3.1. Agglutination Unicode encodings of the written text. Secondly, it also In an agglutinative language words are made of a linear attempts to bring out the issues in annotation having sequence of distinct morphemes each of which without any guideline. Finally, it demonstrates the corresponds to a definite meaning (SIL International, challenges in developing statistical POS taggers for 2004). In Odia, the categories VXFK DV ³suffixes, Sambalpuri owing to the typical, unobserved and postpositions, and case endings agglutinate with the language-specific linguistic nuances. verbs, nouns, adverbs or pronouns´ (Padhy & Mohanty, 2013; Jena et al., 2011). Similar is the case with 2. Sambalpuri: A Low-Density Eastern IA . Language For instance, Sambalpuri (ISO 639-3) is an Eastern Indo-Aryan (IA) NހܤܭEܤU-ke µto eat¶ language is also known as Dom, Kosali, Koshal, Koshali, - peoples ¶¶ Uµܧ OݜN 1 Western Odia . It is spoken in the ten districts of western bܤKܤU-ke µto outside¶ and south-western which comprises , ¶U-Qݜµfrom meܥP Bolangir, Kalahandi, Sonepur, , , In the examples instantiated above, all the case endings or Sundargarh, Deogarh, , ; and U Qݜ DJJOXWLQDWH ZLWK WKHLU KHDGܧ PDUNHUV NH categories verb, noun and pronoun. /ke/ which is 1 https://www.ethnologue.com/language/spv

349 equivalent to the English infinitive and preposition is label is decided on the basis of the category of the head. alternating here with both verb and spatial noun /NހܤܭEܤU Here the head is a nominal element and hence the and /bܤKܤUUHVSHFWLYHO\ judgment goes in favor of the category of the label of the head word. 3.2. Classifiers For example, Classifier is one of the most prominent phenomena in VܧGࡧ ?--SܧWࡧހܧ?1B11 VܧWࡧSܧWࡧހܧ\1B11µJRRGSDWK¶ Indian languages; especially in the Tibeto-Burman So, in the above example, the decision whether to languages and . Besides, in some of annotate the word as an adjective or noun goes for the the EIA languages like Bengali (Bhattacharya, 1999), right-headedness feature of Sambalpuri. This feature is Odia (Neukom, 2003 and Behera, 2015), Bhojpuri typical to most of the IA languages and the ZRUG (Shukla, 1981), Marathi (Baskaran et al., 2008) and so VܧWࡧSܧWࡧހܧis labelled as a noun. on, it is a dominant linguistic feature in Sambalpuri as well. The classifiers mainly occur either as proper 4. Methodology classifiers, attached to numerals or to the quantity word This section deals with (a) the total corpus collected in NHWࡧ e/ µKRZ PDQ\ VRPH¶ RU DV LQGHILQLWH PDUNHUV LQ four major domains, (b) the BIS annotation guideline 1HXNRP DV ݚH adapted for Sambalpuri, (c) size of the corpus for combination with the suffix /-H´ QH, ݚܼ HWF in Sambalpuri. One of the training, testing and development stages and (d) featuresܧހܱ HܩѺܧހN ܤݚ rarely observed phenomena of Indian languages found in selection for SVM and CRF++ POS taggers. Sambalpuri is that classifiers also occur with post positions. 4.1. Corpus Size .¶µOLNHPHܤH-ݚހUOHNܥFor instance, /P The tabulated data (see table 1) demonstrates the total corpus size collected for developing the Sambalpuri POS 3.3. Reduplication taggers. The whole corpus size comprises of five ³,WLVWKHUHSHWLWLRQRIDVHJPHQWDV\OODEOHRUVRPHSDUW domains, viz. literature, sports, tourism, entertainment, or whole of a lexical or phrasal unit leading to a semantic and miscellaneous. The highest corpus size is registered or graPPDWLFDO PRGLILFDWLRQ´ 3DQGH\ 2007). There are in the domain of entertainment i.e., approximately 40k two types of reduplication: partial and total. In total while the µmiscellaneous¶ section accounts for the lowest reduplication, the whole part of the base is reduplicated number of data. and in the partial reduplication, some part is reduplicated (Abbi, 1992). In the following instances, the first two are Domains Tokens fully reduplicated while the rest of the following are Literature 30, 344 partial. In the partial reduplication, the final /- Sports 21, 121 Qܤ of both the words are reduplicated like in the first Tourism 26, 767 example whereas the final example contains the Entertainment 40, 554 reduplications of the initial syllables Kݜ-/. Miscellaneous 2, 424 For instance, Total 1, 21, 210 oܼNoܼN µVKLQLQJ¶ Gࡧ ހܼUHGࡧ ހܼUH µVORZO\¶ µNQRZQ¶ Table 1. Total Corpus Size Domain-wise ܤVݜQܤQܧܱ ¶µDEXVLQJ ܼܩKݜܤܩKݜ When a sequence of verbs occurs in a chronology, they 4.2. Corpus Annotation are called serial verbs (Jha et al., 2014) and some of them The whole Sambalpuri corpus is annotated using the are reduplicative in nature. In the below-instantiated ILCIANN2 (Kumar et al., 2012) following the BIS-ILCI example, it is quite confusing as to how to annotate the tagset (see table 2) devised for since there is a is no tagset available for it. The BIS tagset is a ܼܩݜܧ verbal occurrences. Because the initial verb /Gࡧ non-finite verb followed by a verbal reduplication which hierarchical set designed by the POS Standardization is behaving like a manner adverb modifying the finite Committee appointed by the Department of Information following verb /SܧOHܼܳܧOܤ. The issue here is how to and Technology, Government of India. It has a total annotate the verbal reduplication. number of 11 categorical labels at the top level and 39 For example, fine-grained labels for the annotation. The tagset is framed keeping in view both the fineness and coarseness ܤOܧOHܼܳܧSܼܩݜܧ Gࡧܼܩݜܧ UܼGࡧܧUNܤVHP he V-Nonfinite V-reduplication V-Finite or flat and hierarchical structures in view. The table . ³+HZHQWDZD\EHDWLQJ ´ below contains the nomenclatures of all the categories in 3.4. Compounds the second column, annotation labels in the third and categorical IPA examples in the fourth. Compound or Sandhi is one of the most productive linguistic phenomena which is quite typical in most of the Sl. Category Annotation Examples of ZRUOGV¶ODQJXDJHV in general and in Indian languages in No. Labels Sambalpuri in IPA particular. There are three basic types of compounds: 1 Noun N vowel, consonant and visarga. In the following instance, 1.1 Common N_NN SԥWࡧԥUSܼWࡧԥOEހܤEQܤ the first word is an adjective and the second is a noun, but PݜQݜV when get combined they comprise a nominal category. Since Sambalpuri a head final language, the annotation 2 http://sanskrit.jnu.ac.in/ilciann/index.jsp

350 1.2 Proper N_NNP UܤPKܼPܤܿԥM, other symbols (#, [, {, ܳԥƾܤGࡧ ހԥUPHKHU %, $, <, >, (, ), *, @, ) (.- etcµ¶³´ OԥM, 11.3 Punctuation RD_PUNCܤEܼVݝԥEܼGࡧ M .VԥPEԥOSݜU etc 11.4 Unknown RD_UNK Tags that are left 1.3 Verbal N_NNV SԥܩހܤSԥKԥUܤܩܭܳܤ Ѻ undecided ܤUݚܤFEܤQ 11.5 Echo word RD_ECH h -ph - ܤݚܧNܧJܤ ܧ JܤHSԥre, EܩܤFhܧNHSܳܤ Spatial & N_NST 1.4 ch HWFܤݚܧ HWFܧtemporal SݜUE 2 Pronoun PR Table 2. BIS POS Tagset Adapted for Sambalpuri .QVHѺ etcܧSܤPersonal PR_PRP PݜܼWݜܼ 2.1 2.2 Reflexive PR_PRF QܼܱH etc. 2.3 Relative PR_PRL ܱܤKܤUܱܤKܤѺNԥU 4.3. Data Size for the Taggers ܱHQPܤQNԥU The tabulated data (see table 3) represents the different WࡧUHGࡧ ݜKH etc. data sets applied to develop the statistical taggers. TheܼހReciprocal PR_PRC QܼܱԥUE 2.4 2.5 Wh-word PR_PRQ NܼHNܤKܤUNHQPܤQH total number of training data used for developing the etc. taggers amounts to around 80k. Initially, the tagger is 2.6 Indefinite PR_PRI ܧQMܧUܧNܧX݅ܧVܼNHKܼ trained with around 50k with manually annotated data etc. and later, the development set consists of 28k which was 3 Demonstrati DM automatically tagged and manually validated. After the ve training period, the testing is conducted with a set of 3.1 Deictic DM_DMD ܼVHܼܳXݏܤNԥ, approximately 13k corpus size tokens. VHܳXݏܤNԥ etc. 3.2 Relative DM_DMR ܱHQܳXݏܤNԥܱܤKܤU Data sets Tokens Training 80, 288 3.3 Wh-word DM_DMQ Nܤ݅ܤNHQԥU NHQܳXݏܤNԥU etc. Testing 12, 791 QVܼHWFܭѺNܤIndefinite DM_DMI ݜQܼ 3.4 4 Verb V Table 3. Training, Development and Testing Data Sets Main V_VM Nh etc. 4.4. Developing POS Taggersܭ GࡧݏMain V_VM VݜGࡧ ԥݜ .4.1 1 h Two statistical taggers are developed for Sambalpuri; the 4.1. Non-finite V_VNF k ܤܼNԥUܼQܤFܼ QܤFܼ, first one is trained with SVM (Joachims, 1999; Giménez 2 etc. & Màrquez, 2006) and the second is with CRF++ (Kudo, 4.1. Infinitive V_VINF kh rke, kh ܤܭEܤ ܤܼEܤU 2013). So far as the former is concerned, learning phase 3 r etc. OܤܼܳQܤFEܤ contains medium verbose (-V 2) and the mode of learning 4.1. Gerund V_VNG khܤܼWࡧ hܼEܤr, khܤXWࡧ hܼEܤr 5 etc. and tagging is set to left-right-left (LRL). The rest of the features like sliding window, feature set, feature filtering, UNԥUܼܤAuxiliary V_VAUX ݜFܼWࡧGࡧ ԥUN .4.2 1 Wࡧ hܼEܤr etc. model compression, C parameter tuning, Dictionary h repairing and so on are set to the default mode. On the 5 Adjective JJ b ԥOXWࡧWࡧԥPVXѺGࡧ ԥU etc. 6 Adverb RB other hand, the latter is trained with the unigram method. 7 Postposition PSP VܤƾHOHNހHOܤܼܳHWF 5. Issues and Challenges 8 Conjunction CC This section is divided into three major sub-sections: 8.1 Coordinator CC_CCD NܤѺKܭOܤܱHNܼNܤUԥQ -ݜܱԥGࡧ ܼ etc. corpus-related, annotation-related and taggerܤ 8.2 Subordinator CC-CCS ܱܧGࡧ ܼWࡧHEHܱHWࡧHEܭOܭ related issues. VHWࡧHEܭOܭ ܱH EܧOܼ etc. 8.3 Quotative CCS_UT ܤܤUHKԥܭOܥKԥܳܥ 5.1. Corpus-related Issues ܤܳMܤѺHWF The issues pertaining to the corpus collection are vividly 9 Particles RP discussed: corpora collection, unavailability of Unicode h 9.1 Default RP_RPD PܧGࡧ MܧKÕWѺ ࡧܧHWF encoding, non-standard usage of the language, different h .HѺ etc. writing conventions and -like constructionsܩܧ k ,ܤClassifier RP_CL ܳݜݚHGࡧ ݜܼݚ 9.2 .etc ܥKܥKܤܭKKԥܤInterjection RP_INJ ݝ 9.3 h Wࡧܼ, k ub, EԥKݜWࡧ ܱԥEԥU 5.1.1. Corpus Collectionܧ Intensifier RP_INTF 9.4 etc. A number of corpora are developed for various languages .Ѻ etcܤQݜKHQܼQܼKܼܤNegation RP_NEG Q 9.5 10 Quantifiers QT like English and some European languages. Considering Vܼ ݚܼNH the situations in non-scheduled (lesser-known) IndianܭHEܩݜހGeneral QT_QTF Wࡧ 10.1 Gࡧ ݜ etc. languages it is quite unfavorable in comparison to theܤܩݜܳ 10.2 Cardinal QT_QTC HNGࡧ ܥWࡧܼQFܤU etc. scheduled languages since some of the Indian institutions have either worked on or are presently developing ܤWࡧܼVUܤGࡧ ݜVUܤOܭOrdinal QT_QTO 3ԥK 10.3 etc. language resources and technologies for the latter 11 Residuals RD languages only. Because of the indifference of the 11.1 Foreign RD_RDF languages of the other government towards the lesser-known languages, the words scripts except Odia former are getting disempowered gradually. The 11.2 Symbol RD_SYM mathematical and institutions and projects that have worked for the corpus

351 collection in scheduled Indian languages are IIIT- öǜĂljĀĐ N_NN, öŢöȡǣ N_NN , CIIL-Mysore, ILCI-JNU, and TDIL. öĚǜûĉ N_NN öĚřĉ N_NNP 5.1.2. Unavailability of Unicode Encoding This non-standard usage of the words creates issues Since low-density languages are less-resourceful or with during both manual and automatic annotation since their no resource the software available for them are also less POS labels vary with the varying conventions. in number. This leads to the non-Unicode encodings which is not favorable for the development of NLP 5.1.5. Hindi-like Constructions applications. The whole corpus has been converted into Sambalpuri is more like Hindi than Odia which accounts UTF-8 encodings using Akruti Text Converter3. There for the fact that the western region, where it is spoken, is are different linguistic issues in the corpus itself such as situated just adjacent to Chattisgarh and where non-standard usage, non-uniform Orthographic forms and the influence of Hindi is largely felt. In the examples ݝԥܱݜGࡧ DQGNHDUHSRVWSRVLWLRQVDVܤHindi-like constructions. instantiated below /b used in Hindi while the Hindi-like indefinite and 5.1.3. Non-standard Usage reflexive pronouns are also used. Sambalpuri is not a scheduled Indian language and is For instance, ݝԥܱݜGࡧ  PSPܤwritten and spoken with varying standards in different /b regions of the western and south-. For KԥUHNPR_PRI /ke/ PSP example: Sambalpuri, Bargadia (spoken in Bargarh), ԥSQܤPR_PRF ԥSQܤU PR_PRF Bolangiri/a (spoken in Bolangir district), Sundargadi/ia (spoken in Sundargarh), Deogarhia (spoken in Deogarh 5.2. Human Annotation-related Issues region) etc. There are some dialectical variations among One of the prominent challenges is that which pertains to the people of Sambalpuri speaking track. The table (see the annotation of the corpus. For annotation of a table 4) demonstrates dialectal variations of Sambalpuri voluminous corpus and to maintain consistency, one ZLWKUHIHUHQFHWRQHJDWLYHPRUSKHPHµQR¶DGYHUEVµQRZ¶ needs a standard tagset. Owing to the fact that a large number of languages like Sambalpuri being less- DQGµWKLV ZD\¶. Lexical similarity within the varieties of Sambalpuri is considerably high which ranges from 90 to described or less-studied, it is quite daunting to devise a tagset. If one adopts and adheres to the tagset devised for 95 percent (Mathai & Kelsall, 2013). This similarity a language of close proximity, then they may either matrix was made by comparing Bargarhi, Bolangiri and compromise with the saliency of the linguistic data or Jharsuguda varieties with Sambalpuri. may end up filling different slots for labels and not researching by delving deep into some interesting Variety of Negative Adverb Adverb structures. For instance, there are large numbers of Sambalpuri Morpheme [ԌhľG૧H] [Ԍľ˅H] homophonous words that can neither be included in the [nľԌ੻@µQR¶ µQRZ¶ µWKLVZD\¶ reduplicated nor can they be labelled as echo. Bargarh n he/n he h e/ ࡱ Ԍ Ԍ ľG૧H͑Fф͑Q Ԍľ˅ ԌSфľOH 5.2.1. Reduplicated Expressions Bolangir n n Ԍ੻ ͑Nф͑ Generally, in Indian languages the reduplicated Deogarh expressions follow the meaningful word. Contrastingly, Kalahandi n n b in Sambalpuri many of the reduplicated parts precede the Ԍ੻ ͑Nф͑ Ԍ ľߺH meaningful words (see section 3.3). For instance, in the Sambalpur n he/n he Ԍ ࡱ conjunct verb (adjective + finite verb) FހܼFހܼ LV WKH Sundargarh meaningless reduplicated part which is preceding the ԌС̸˅Ԍ PHDQLQJIXOSDUWEܼFހܼµVFDWWHUHG¶ For example, Table 4. Dialectal Variations in Sambalpuri (Adapted FހܼFހܼ\RD_ECH EܼFހܼ\JJ KHܼFހܧQµKDYH got VFDWWHUHG¶ from Patel) Similarly, in the following verbal reduplication, the meaningless part is preceding the verbal part. 5.1.4. Different Orthographic Conventions For example, V_VM_VF µhas\ܤA large number of words in Sambalpuri has different NݜWࡧ \RD_ECH NݜWࡧHܼ\V_VM GHO Orthographic conventions; especially the ligatures. In tickled¶ Sambalpuri, there are several writing conventions used These kinds of constructions pose significant linguistic for a given word form because of the non-uniform usage challenges for the human annotators as to how to label of language. them and so is for the statistical tagger. For instance, in the following examples two forms are used for one word with two of them having different POS 5.2.2. Verb-less Constructions labels with the change of form. In Sambalpuri and many sister languages such as Odia, N_NN, DM_DMQ öĚĠśç öçĚĠý Bengali, Assamese (Masica, 1993) verb-less constructions or covertly present verbs are commonly used. These constructions are used with adjectives in 3https://22bc339da9ca3e2462414546a715752e4c2c5e0d. place of verbs. Therefore, the tagger also labels some of googledrive.com/host/0B5rBGd680WZFemVLa3RxY0pr these adjectives as finite verbs because of the annotation eE0/AkrutiUnicode of these constructions in the training data. 352 For example, conjunction, coordinating conjunction-general quantifier, .U\JJ deictic-interrogative demonstrative and so onܤV\N_NN ݜVܤQYܤEݜU\N_NN FܤWࡧ \N_NNP EܧVݝܤs (ݜ(CC_CCD or QT_QTFܤ U\RD_ECH .\RD_PUNC ³6DVZDW %DEX¶V FDQYDVV LVܤSܼV quite large.´ For example, ´DQGP\IDWKHU,³ܤSܤUEܥݜѺ \&&B&&'PܤU\ RD_ECH/ is the PݜܼܤU\-- SܼVܤQ WKH DERYH H[DPSOH ݜV, U³, QHHGVRPH PRUHܤGࡧ ԥUNܤQܤހݜ?47B47)Nܤreduplicated adjectival phrase which satisfies the need of PԥWࡧH the verb. IRRG´ In the above examples, the first one suggests that the ݜLVDFRRUGLQDWLQJFRQMXQFWLRQFRRUGLQDWLQJWZRܤOnomatopoeic Words ZRUG .5.2.3 Onomatopoeic words are the imitation of a sound noun phrases while the second one states that it is a associated phonetically with its describing referent. These general quantifier used as a pre-modifier. following expressions are parts of the multi words Three-label Sets: This section contains the ambiguous words having three labels. The most commonly because individually these words do not have meaning, ambiguous tags are most-expectedly adjective-temporal but when combined they are manner adverbs. As per the nouns-finite, negative-main-finite verb and so on. ILCI guideline, if we annotate the first sound as noun and EܤKܤU (V_VM or N_NST or JJ) the following words as echo-words (RD_ECH), we are For instance, ԥUݜ\1B11 ³&RPH RXW RI WKHހܳ U\V_VM_VFܤKܤmissing relevant linguistic information. E For instance, KRXVH´ b ԥUݜ\1B11³+HZHQWDZD\IURPހܳ--\UܤKܤEܤOܤѺµORXGO\¶ VHSԥOܭހѺEܭހ ܱހܥܱހܥµKHDYLO\¶ WKHIURQWURRP´ Gࡧ ހܧSܧµJDVSLQJ¶ Ѻ Ѻ EܤKܤU-ke\1B167ܤV\9B90B9)³&RPHWRRXWVLGH´ EހܥEހܥµEDUN¶ In the first instance, the word form /EܤKܤU/ has three different POS labels. The first one is annotated as a finite 5.2.4. Agglutination of Classifiers with Postpositions main verb as the sentence is an imperative sentence and Agglutination (see section 3.2) is one of the common the covert subject is the second person pronominal. In the features in Odia (Behera, 2015) and Sambalpuri along second example, it is labelled as an adjective as it with some IA languages like Bengali and Marathi modifies the following noun whereas the third one is a (Baskaran et al., 2008). In Sambalpuri, one of the peculiar spatial noun as it refers to a location. constructions with agglutination is that the classifiers and More than Three-label Sets: The words having more postpositions agglutinate with each other which is also than three labels are encapsulated in this part. For rarely found in the most agglutinating Dravidian instance, main-auxiliary-nonfinite-finite verbs, unknown- languages. Here, to annotate these constructions as interjection-default particle and so on. classifiers (RP_CL) or postpositions (PSP) is quite kԥrܼ (V_VAUX or V_VM or V_VM_VF or difficult. V_VM_VNF) For instance, For instance, VOH\V_VM_VFܤ;9B90NԥUܼ\9B9$8\ܼܤހµas-&/¶ NܤU-ݚܼܳܤE/ VOH\V_VM_VFܤP\1B11NԥUܼ\9B90ܤµOLNH-&/¶ kܤHݚހOHN Similarly, in the example below, it is quite difficult to NԥUܼWࡧހܼOܤ\V_VM_VF decide the annotation labels for both the human and NހܤܼNԥUܼ\9B90B91)ܤVOH\V_VM_VF automatic annotation. The reason is the complexity in The verbal word form /kԥrܼ/ has more than three labels in ,the corpus and which is rightly so. It can be used as main ܤdeciding the head label of the words.7KHZRUGGࡧ ݜܼ-ݚ comprises of two components, a cardinal and a classifier auxiliary, finite and non-finite verbs as instantiated in the morpheme. Both of these categories have separate labels above examples. in the BIS scheme for Sambalpuri. Therefore, if one annotates the word as cardinal, they are compromising Classes of Label Sets with the other label or linguistic information. Ambiguity For instance, 2 Sets: CC_CCD_CC_CCS, CC_CCD_QT_QTF, RP_CL) DM_DMD_DM_DMQ\ܤGࡧ ݜܼ\47B47&ݚ ܤGࡧ ݜܼ-ݚ 3 Sets: JJ_N_NST_V_VM_VF, 5.3. Automatic Annotation-related Issues RD_ECH_V_VM_V_VM_VNF, These issues are mostly pertained to tagger-related RP_NEG_V_VM_V_VM_VF ambiguities and some other linguistic errors. More than V_VAUX_V_VM_V_VM_VF_V_VM_VNF, 3 Sets RP_INJ_RP_NEG_V_VM_V_VM_VNF, 5.3.1. Ambiguity Issues RD_UNK_RP_INJ_RP_RPD_V_VM_V_VM_VNF The data (see table 5) represented below demonstrates that there are different types of ambiguous sets of classes Table 5. Ambiguity Classes and their accuracy rates. All the ambiguity classes are divided into 244 classes and they are generated 6. Results and Discussion automatically by the SVM tool. In spite of the different issues and challenges, statistical Two-label Sets: This section includes the ambiguous taggers developed for Odia achieves accuracies of 94% words with two conflicting labels. The most commonly and 89% by SVM and CRF++ respectively (Behera, ambiguous tags are coordinating-subordinating 2015). The results (SVM 83% and CRF++ 71.56%) for 353 Sambalpuri are comparatively lesser than Odia which Baskaran, S., Bali, K., Bhattacharya, T., Bhattacharyya, accounts for the fact that Sambalpuri has no standardized P., & Jha, G. N. (2008). A Common Parts-of-speech orthographic convention and hence different regional Tagset Framework for Indian Languages. In: LREC varieties use the language in their own ways. 2008. The first and foremost point to emphasize for a low- Behera, P. (2015). Odia Parts of Speech Tagging density language is the large-scale writing on the social Corpora: Suitability of Statistical Models. M. Phil. media using its own script with a soul objective of Thesis. Delhi: Centre for Linguistics, Jawaharlal Nehru developing language resources. If there is less availability University. of the corpus, then one can also take the assistance of Bhattacharya, T. (1999). The Structure of the Bangla DP. mathematical modelling to achieve a higher accuracy Doctoral Dissertation, London: University College. rate. So far as the tagset-related issues are concerned, one Giménez, J. and Màrquez, L. (2006). Technical Manual can take the labels already used by a closely-related v1.3. Barcelona: Universitat Politecnica de Catalunya. language spoken in the region for annotation job by Jena, I., Chaudhury, S., Chaudhry, H., & Sharma, D. M. incorporating it on their convenience. With regard to (2011). Developing Oriya Morphological Analyzer issues in annotation, one can take the exemplary labels Using Lt-toolbox. In: Information Systems for Indian from different tagsets developed for Indian languages. Languages. pp. 124-129. Berlin Heidelberg: Springer. For example, we can incorporate WRB label from the Jha, G. N., Hellan, L., Beermann, D., , S., Behera, IIIT-Hyderabad for the interrogative adverb. The P. and Banerjee, E. (2014). Indian Languages on the reduplicative expressions need to be considered seriously TypeCraft Platform± The Case of Hindi and Odia, as they are the most vital parts of the language and they Iceland. In: LREC. behave quite differently in Sambalpuri. Therefore, it can Joachims, T. (1999). Making Large Scale SVM Learning be averred that labels for reduplication (RD_REDP), Practical. Universität Dortmund. possessive pronouns (PR_POS) and demonstratives Kudo, T. (2013). CRF++: Yet another CRF (DM_POS), interrogative adverbs (WRB) can be toolkit. Retrieved from: introduced. For handling agglutination, a stemmer or a http://crfpp.sourceforge.net/ptojrcts/crfpp/. Access lemmatizer could be used with statistical POS taggers. date: July 10, 2015. For punctuations, fine-grained labels should be Kumar, R., Kaushik, S., Nainwani, P., Banerjee, E., incorporated based on their functions in a given context Hadke, S., & Jha, G. N. (2012, March). Using the ILCI as they can be used as coordinators, section headers, list Annotation Tool for POS Annotation: A Case of Hindi. item markers and so on. Finally, the standardization of In: 13th International Conference on Intelligent Text the language would help solve many of the issues by Processing and Computational Linguistics (CICLing providing consistency in both the human and statistical 2012), , India. annotations. Kushal, G. (2015). Case and Agreement in Sambalpuri. M.Phil. Thesis. Delhi: Centre for Linguistics, 7. Conclusion Jawaharlal Nehru University. In this paper, we have discussed about different issues Masica, C. P. (1993). The Indo-Aryan Languages. and challenges in terms of both corpus collection, Cambridge University Press. annotation and tagger-related issues in detail for a less- Mathai, E. K. & Kelsall, J. (2013). Sambalpuri of Orissa, resourced language, i.e. Sambalpuri. The results (SVM India: A Brief Sociolinguistic Survey. SIL 83% and CRF++ 71.56%) of the statistical taggers for International. Sambalpuri in the present study would not only prove to McEnery, T., Baker, P., & Burnard, L. (2000). Corpus be beneficial for its own future NLP research and Resources and Minority Language Engineering. development but also would be advantageous for any In: LREC. other morphologically-rich less-resourced language from Mitkov, R. (2005). The Oxford Handbook of around the world. At a later stage, we can further use a Computational Linguistics. Oxford University Press. lemmatizer or stemmer to handle the agglutination issue Neukom, L., & Patnaik, M. (2003). A Grammar of Oriya. and incorporate some of the solutions proposed in the Zürich: Seminar für Allgemeine Sprachwissenschaft research. Furthermore, these POS taggers could be der University. potentially used for developing morph analyzer, chunker, Ostler, N. (1999). Language technology and the Smaller parser and hopefully for enhancing the accuracy of Language. Elra Newsletter, 4(2). machine translation. For its future development, an online Patel, Kunjabana. (undated). A Sambalpuri Phonetic lexical dictionary using Language Explorer, Lexique Pro Reader. Sambalpur: Menaka Prakashani. & Toolbox can also be prepared. Shukla, S. (1981), Bhojpuri Grammar. Washington, D. C.: Georgetown University Press. References Tripathy, B. (1984). Sambalpuri Semantics. Graduate Thesis. Sambalpur: . Abbi, A. (1992). Reduplication in South Asian Languages: An Areal, Typological, and Historical Study. India: Allied Publishers Pvt. Ltd. Abbi, A. (2001). A Manual of Linguistic Fieldwork and Structures of Indian Languages (Vol. 17). Lincom Europa.

354