arXiv:2006.16212v1 [cs.CL] 29 Jun 2020 vni h aghllnug ev sacom- but region, a Tangkhul as whole the serve for tribe. language language Tangkhul mon Tangkhul whole the the if for Even franca lingua serve a to come as varietythe has which single town, a of language on lan- concentrated Tangkhul of have descriptions guages available almost the a however, of than all 1837), rather (Brown languages, language of single group that noted a been is long has Tangkhul It village. to village variesfrom language whose group family. ethnic an Tibeto-Burman is Tangkhul dis- the a within form subgroup and tinct one-another languages to Tangkhul related the closely that are clear languages. is anothers facili- it one to However, of learning enough rapid large the are tate villages. similarities neighbouring the of Even mutually those not with are vari- villages intelligible speech Tangkhul most The of eties diversified culturally. quite and are Tangkhuls linguistically, The ) of Burma. state and (another of parts con- and and tiguous India, district State, Ukhrul of of ethnic district hill Kamjong an the in to live refers which Ihao group literature) or older Hao in as known especially (also Tangkhul name The Introduction 1 { iis,nsaharia mirinso, beaditrsigotu ept sn a corpus. using small despite output interesting and able reason- gives morphessor using task identifi-cation morpheme exper- the the out, on carried iments Based topics. other ar- and of different stories ticles short from books, text collected of sources corpus use small We a approach. unsupervised an using language this of of processing beginning morphological humble a is language. work lan- current Tangkhul The natural on of work processing little guage or no is There oad h td fMrhlgclPoesn fteTangkh the of Processing Morphological of Study the Towards iis hdn aaahShraTodmDrnSingh Doren Thoudam Saharia Navanath Shadang Mirinso Abstract ninIsiueo nomto ehooyManipur Technology Information of Institute Indian } @iiitmanipur.ac.in,[email protected] mhl ni 795002 - India Imphal, Language ug sfudt esihl ope en agglu- being ( complex tinative slightly be to lan- found is this guage of analysis morphological ini- The the language. this of start processing and morphological towards corpus steps we tial small light, collect this In to attempt date. till natu- processing on language work less ral little or and particular no constrained with in category resource privileged language under lan- Tangkhul falls Indian which But, major other many pro- guages. lan- morphological for on particular work reports that cessing are of There work guage. NLP of beginning languages. of group this discussion of a for point is starting cri- it useful sufficient family a nevertheless Tangkhul a the be in nor membership can necessary for terion a ethnicity neither while as Thus, taken the in group. belong Tangkhul that languages other be- of speaking members and tribes are Naga here there that discussed possible is being it cause family the of mem- not bers a are as Tangkhuls spoken will ethnic languages by (as the mother-tongue of inadequate some num- clearly since a is seen), be by and factors, complicated of is Tangkhul ber definition the This of members tribe. by spoken languages Tibeto-Burman of family the is language Tangkhul lan- varying. native highly language is own (Tangkhul) common the their of pronunciation has the districtguage, district) (Ukhrul Kamjong region all Tangkhul & since of villages diaspora, the Tangkhul the in of spoken language villages (azay). 100 various South than more (ato), are North There (zingtun), (zing- west East i.e. sho), part major four into is divided region mainly Tangkhul Tangkhul) area. speaking geographical of from (way varies speech of accent the opooia rcsigo agaei the is language a of processing Morphological the of definition working obvious most The Mortensen , 2003 one. ) ul 2 Related Work Compounding is divided into three classes - en- docentric compound, exocentric compound and The Tibeto-Burman are co-ordinate compound. Endocentric compound still lacking the basic language processing is again divided into right headed compound and tools of required quality. Among the re- left-headed compound. In the right-headed com- ported work on Tibeto-Burman languages, pound, it may have many constituents such as Bodo (Sarmah, 2004), Mizo (Pakray et al., ‘noun + noun’ and ‘noun + derived noun’, but 2015), Kok-borok (Debbarma et al., 2012) and ‘noun + noun’ is more used than the other one in Manipuri (Singh and Bandyopadhyay, 2008) are Tangkhul (Ahum, 1997). main. Almost all the reported work used apriori There are three types of verb roots in Tangkhul- knowledge of the language either in the form of simple, complex and compound. A simple root dictionary or in the form of preparing label data is irreducible ’core element’ obtained by drop- for training. Being a resource less language, we ping all the affixes. A root may be monosyl- preferred unsupervised learning approach for the labic or bisyllabic. For example: lei (exist/have), considered language. vao (shout), and malai (forget). Though Tangkhul 2.1 Language morphology and syntax verbs are close class, few verbs are derived from ’nominal’ roots. For example: /pha/ /ra/. Table 1 Tankhul has two distinct tones - high and low, tabulated few case markers of Tanghul language. mainly with the utterances associated with the let- ter t, d, r, m, and f. For examples There are two broad types of modifiers- adjec- 1. High (Kajuiya) tival and adverbials. Adjectives, as a word class, are quite different from nouns and verbs. In the Chanhan Kasho (open) Khaikao (dry fish) case of Tangkhul language, the exact relationship Khalen (trap) between adjectives on the one hand, and other 2. Low (Khanema) kachang (month) kachui categories like nouns, verbs and adverbs on the (high) mashit (closeness) ashee (blood) other, has been one of the disputed issues in lin- guistics. In Tangkhul language there is no distinc- Adjective of Tangkhul displays a hybrid mor- tion between verbs and adjectives in the sense that phology, as it always occurs semantically some- they are derived from roots, and function as adjec- where between verbs and nouns. tives or verbs with (a) appropriate affixation and Like other morphologically rich neighbour (b) appropriate occurrence in a sentence (Ahum, languages (Majumder et al., 2007; Saharia et al., 1997). Expressives (which are aplenty in the lan- 2014), Tangkhul also has a strong tendency to guage and which most often have adverbial and form sequence of affixes to prepend/append to the adjectival functions) can be compounded with a root. number of roots to form compound adjectival. In In fact, many of the prefixes found in Tangkhul the process of compounding expressive tend to be- language behaves morphologically as if they are have like intensifiers or modifiers. The following part of the root. While it is possible to assign are some of the most commonly used adjectives independent historical origins to these elements, formed by compounding roots and expressive. from a synchronic standpoint the factors that iden- tify them are phonological and not morphological. 2.3 Compounding and reduplication As will be seen, it is often not possible to assign There are some compound modifiers in the a consistent meaning or grammatical function to Tangkhul, which are further reduplicated to denote these prefixes, which will be referred to here as modified or completely changed meaning. In the lexical prefixes (Mortensen, 2003). process of reduplication, the last syllable of the compound root is partially reduplicated by replac- 2.2 Word Formation ing its initial consonant. The following are some The formation of words in a language is one the of the most commonly used reduplicated com- topic, which still attracts the researchers most. pound adjectives. For example: Root + Root + The two main class of word formation are inflec- Reduplicate tion and derivation. Derivation is divided into again two classes - affixation and compounding. • them-reak-sek = them (skill)- reak (pretend)-

Proceeding of Regional International Conference on Natural Language Processing (regICON) 2017, 3rd and 4th November 2017, IIIT Senapati, Manipur, India Case marker Suffix Example Locative case marker -li, -wui Manipur-li (in Manipur) Genitive case marker -chi - Avi-chi (his) Nominative case marker -na khipa-na (who)

Table 1: Few case markers with example

sek (redu) of the utterances of a language. As it is based on Approximate English meaning: pretending to words or utterance of the language and its fre- be very skillful or learned quency, most of the time, the tool discovers prefix (in sequence), stem and suffix (in sequence). • Khon-zar-tar = Khon (sound)- zar(dense)- tar During experiment, we found that, a word may (redu) have maximum seven different morpheme, mostly Approximate English meaning: very noisy. sequence of affixes. The morpheme frequency with example is tabulated in Table 3. We also 3 Corpus preparation and preprocessing found that The corpus contains a good number of As the existence of the language is very rare in borrowed words from English and neighbouring web, we have manually collected/typed few texts languages. for our experiment. The current version of the The following examples shows the morphologi- corpus contains 21713 words, out of which 7362 cal richness of the considered language segmented are unique words (include inflections). The arti- using Morfessor. cles or segment of articles in the corpus can be maphaning (not thinking) = ma(no) + phaning categoried into biography (4 numbers), short story (thinking) (6 numbers), essay (11 numbers), drama (2 num- maphaninsali = ma + pha + nin + sa + li bers) and letter (1 number) with average (approx- map˜amsangmara = ma + p˜a+ m + sa + ng + mara imately) 904 words per article. The first ten fre- khamashash˜ali = kha + ma + sha + sh + ˜ali quent words in the corpus are tabulated in Table 2. kakhararkhangazai = ka + kharar + kha + nga + zai Sr No. Word Meaning frequency ˜ach˜ahonthangcham = ˜a+ ch˜a+ hon + thang + 1 eina with 708 cham 2 hi this 367 3 chi that 358 The following are few examples, the Morfessor 4 kala and 336 segmented correctly. 5 da 184 a + cham + ˜aram 6 haowa 130 ˜a+ lung + th + ung + li 7 kaji that 122 ˜a+ ng + ˜a+ ng + nao + li 8 ˜akha one 122 a + ri + shang + li 9 ˜awui his 120 10 chili there 110 Table 4 tabulated few words incorrectly segmented by Morfessor. Table 2: Ten most frequent words in the corpus. Interestingly, during segmentation, based on the For our experiment, we use Morfessor evidence in the corpus, the tool is splitting the (Creutz and Lagus, 2005) version 1.01, an un- following English words used as loan words in supervised language independent morphology Tangkhul (Table 5). learning package to discover the regularities In the context of loan words, another interesting behind the word formation process. This leaning example we found is acid + pa˜ + va. Though the approach discovers the primitive morphological root (the first morpheme) is a valid loan word, the units such as root/base/stem2 , suffix and prefix context actually was indicating the root ashipava˜ (means wife). 1http://www.http://morpho.aalto.fi/projects/morpho; Ac- cess date: 30 August, 2017. and stem, for this experiment we are using these terms inter- 2Though, there are differences in the concept of root, base changeably to indicate stem

Proceeding of Regional International Conference on Natural Language Processing (regICON) 2017, 3rd and 4th November 2017, IIIT Senapati, Manipur, India Words with no affix 1772 Example: khana Words with one affix 3600 Example: advocate+la Words with two affixes 1487 Example: ˜a+thingreira+wui Words with three affixes 410 Example: kajui+kha+nem Words with four affixes 78 Example: +lung+th+ung+li Words with five affixes 12 Example: la+ng+da+ng+l˜a+na Words with Six affixes 3 Example: kha+nga+p+eo+bing+li+la

Table 3: Affix frequency distribution in the corpus

Wrong (By Morfessor) Correct Purkayastha. 2012. Morphological analysis of kok- ˜a+ k˜a+ khare + wui ˜a+ k˜a+ kha + re + wui borok for universal networking language dictio- ˜a+ mathen + pai + ra ˜a+ ma + then + pai + ra nary. In Recent Advances in Information Technol- ogy (RAIT), 2012 1st International Conference on. ˜a+ ngas˜am + khuiya ˜a+ nga + s˜am + khui + ya IEEE, pages 474–477. Table 4: Incorrect segmentation of words Prasenjit Majumder, Mandar Mitra, Swapan K Parui, Gobinda Kole, Pabitra Mitra, and Kalyankumar acqui + red Wrong Datta. 2007. Yass: Yet another suffix stripper. activ + ities Wrong ACM transactions on information systems (TOIS) 25(4):18. activ + ity Wrong administra + tion + wui Wrong David Mortensen. 2003. Comparative tangkhul. Uni- versity of California, Berkeley . Table 5: Segmentation of loan words Partha Pakray, Arunagshu Pal, Goutam Majumder, and Alexander Gelbukh. 2015. Resource building and 4 Conclusion parts-of-speech (pos) tagging for the . In Artificial Intelligence (MICAI), 2015 Fourteenth The demand for localization through electronic Mexican International Conference on. IEEE, pages 3–7. content has made the globe a smaller place and any digital divide in this regard has to be reduced Navanath Saharia, Utpal Sharma, and Jugal Kalita. through the inclusion of more naturally occur- 2014. Stemming resource-poor indian languages. ACM Transactions on Asian Language Information ring languages through the information technol- Processing (TALIP) 13(3):14. ogy revolution. Towards this direction, the present task is a step to bring Tangkhul into the language Priyankoo Sarmah. 2004. Some aspects of the tonal technology revolution. In the present work, we phonology of Bodo. Ph.D. thesis, English and For- eign Languages University, Hyderabad. reported the morphological processing work of Tangkhul using an unsupervised approach. We Thoudam Doren Singh and Sivaji Bandyopadhyay. found the result of the experiment to be reasonably 2008. Morphology driven manipuri pos tagger. In IJCNLP. pages 91–98. good as compared to the size of the corpus used in the work. Our future direction includes more experiment on this by including language specific rules and other semi-supervised approaches.

References Victor Ahum. 1997. Tangkhul-Naga Grammar: A Study of Word Formation. Ph.D. thesis.

Mathias Creutz and Krista Lagus. 2005. Unsupervised morpheme segmentation and morphology induction from text corpora using morfessor 1.0. Technical report.

Khumbar Debbarma, Braja Gopal Patra, Swapan Debbarma, Lalita Kumari, and Bipul Shyam

Proceeding of Regional International Conference on Natural Language Processing (regICON) 2017, 3rd and 4th November 2017, IIIT Senapati, Manipur, India