ISSN 2007-9737

Identification of POS Tag for Khasi Language based on Hidden Markov Model POS Tagger

Sunita Warjri1, Partha Pakray2, Saralin Lyngdoh1, Arnab Kumar Maji1

1 North-Eastern Hill University, , ,

2 National Institute of Technology, Silchar, , India

{sunitawarjri, parthapakray, saralyngdoh, arnab.maji}@gmail.com

Abstract. Computational Linguistic (CL) becomes combining the technology of artificial intelligent and an essential and important amenity in the present computer science. The most important part of any scenarios, as many different technologies are involved NLP task is the issue of understanding the natural in making machines to understand human languages. language. Application of NLP helps machines to Khasi is the language which is spoken in Meghalaya, learn, read and understand the human language, India. Many Indian languages have been researched in by simulating the human ability of understanding different fields of Natural Language Processing (NLP), the language and by combining the technology of whereas Khasi lacks substantial research from the NLP perspectives. Therefore, in this paper, taking POS computational linguistics, computer science and tagging as one of the key aspects of NLP, we present artificial intelligence. POS tagger based on Hidden Markov Model (HMM) for Khasi language. In this present preliminary stage The most basic and important starting level of of building NLP system for Khasi, with the analyses of NLP for any language is POS tagging. POS the categories and structures of the words is started. in language processing is the aspect that deals Therefore, we have designed specific POS tagsets to with the identification of grammatical class of each categories Khasi words and vocabularies. Then, the word in a given sentence. POS is used in many POS system based on HMM is trained by using Khasi fields of NLP such as Semantic Disambiguation, words which have been tagged manually using the Phrase identification (chunking), Named Entity designed tagsets. As ambiguity is one of the main Recognition, Information Extraction, Parsing, etc. challenges in POS tagging in Khasi, we anticipated difficulties in tagging. However, by running with the first [2, 17]. The task of creating POS Tagger involves few sets of data in the experimental data by using the many stages such as: building tagsets, creating HMM tagger we found out that the result yielded by this dictionary, considering the rules of the context and model is 76.70% of accurate. also checking the inflexions, dependent anomalies of the particular language. Keywords. Natural language processing (NLP), computational linguistic, part of speech (POS), POS Though, substantial work has already been tagger, hidden Markov model (HMM). carried out in different fields of NLP for Indian Language, Khasi lacks such study from the NLP 1 Introduction perspective. Other Indian languages like , Bengali, Assamese, Manipuri, Marathi, Tamil, etc. Natural Language Processing (NLP) deals with have already been employed in Computational the inter-relation and inter-communication between linguistic. In this paper we present POS tagger the computer and natural human language by for Khasi language. Khasi Language is an official

Computación y Sistemas, Vol. 23, No. 3, 2019, pp. 795–802 doi: 10.13053/CyS-23-3-3248 ISSN 2007-9737

796 Sunita Warjri, Partha Pakray, Saralin Lyngdoh, Arnab Kumar Maji language of the state Meghalaya, in North East consisting of prefixes, suffixes and root words and India [16]. also information with respect to the text content. The name ‘Khasi’ classify both the tribe and Part-of-speech (POS) Tagger for as well as the language. Khasi Language is Language using supervised learning based on spoken in the Khasi Hills district of the state Support Vector Machine (SVM)as discussed in Meghalaya, India. Also, this language is spoken [1]. For the tagger, tag-set consisting 29 in the border area of Assam -Meghalaya as tags was developed. The corpus with dataset well as India- border. Khasi is a consisting 180,000 words, the system achieved part of Mon-Khmer family which is branch of 94% of accuracy. Austro-Asiatic, Southeast of Asia. There have Part-of-Speech Tagging for been very limited research works in computational discussed in [8] using Training data of 576 words linguistics with regard to Khasi Language. with tag-set of 9 tags. Tokenization, Morphological To perform research work on NLP there is a analysis and Disambiguation process have been need of corpus in Khasi Language. Dataset or carried out for POS tagging. The author Corpus is basically a large and extensive collection concluded with the attained accuracy of 78.82% by of texts or words, which are used for analysis of the system. any Natural Language. Corpus is an essential language based on rule based, component for any Natural Language Processing Conditional Random Field (CRF) and Support research. Therefore, in this paper we describe Vector Machines (SVM) for Part of Speech (POS) tagger for Khasi Language based on supervised Tagger discussed in [9]. A tagged dataset of system HMM trained model. 42,537 words with 26 tag were used. The POS The paper is organized as follows: Section 2 taggers methods attains the accuracies of 69% describes related works on POS Tagging; Section for rule based, 81.67% CRF based and 84.46% 3 describes Methodology used in HMM based POS for SVM. Tagger system; Section 4 describes experimental In [10] has discussed POS Tagging and results; Section 5 Conclusions and some future Chunking using Conditional Random Fields and perspectives. Transformation. The POS tagger accuracy achieved for CRF and TBL of about 77.37% for 2 Literature Review Telugu, 78.66% for Hindi, and 76.08% for Bengali and the chunker performance accuracy of 79.15% In this section, the existing relevant works in the for Telugu, 80.97% for Hindi and 82.74% for POS tagging of different languages are presented. Bengali respectively. There are several existing research works on POS Part-of-Speech (POS) tagger for Bengali lan- tagging on different languages such as Indian, guage in [3] has been reported. Tagging English, German, Spanish, etc. whereby many based on Hidden Markov Model (HMM) and researchers have also proposed different methods Maximum Entropy (ME) stochastic taggers has for POS tagging and show the achieved results. been discussed. Study is conducted to improve A Hidden Markov Model (HMM) based Part the efficiency and performance of the tagger by of Speech (POS) Tagger for Hindi language as using a morphological analyzer. An accuracy discussed in [5]. Indian Language (IL) POS tag of 76.80% has been reported. The author set have been employed for the system. In the reported to achieved good performance with the experimental result the HHM POS tagger acquires suffix information and morphological restriction accuracy of 92%. on the grammatical categories for the supervised A POS tagging for Manipuri language demon- learning model. strated 69% of accuracy by using morphology In the paper [4] the POS tagger based on SVM driven POS tagger in [12]. This POS tagger and HMM for Bengali has been proposed. The uses 10917 unique words and three dictionaries author show the result as the accuracy of 79.6%

Computación y Sistemas, Vol. 23, No. 3, 2019, pp. 795–802 doi: 10.13053/CyS-23-3-3248 ISSN 2007-9737

Identification797 of POS Tag for Khasi Language based on Hidden Markov Model POS Tagger for the manually checked corpus consisting 0.128 Table 1. POS tagset for Khasi Language [15] million words. No. Tag Description The result of developed POS taggers were given 1 PPN Proper nouns as the accuracies of 85.56% for HMM, and 91.23% 2 CLN Collective nouns for SVM. 3 CMN Common nouns Part of Speech tagging for Assamese was 4 MTN Material nouns 5 ABN Abstract nouns reported in paper [11]. The system was design 6 RFP Reflexive Pronoun based on Hidden Markov Model approach (HMM). 7 EM Emphatic Pronoun With tagsets consisting 172 tags and corpus 8 RLP Relative Pronouns consisting 10000 words which were manually 9 INP Interrogative Pronouns tagged for training the system. The accuracy of 10 DMP Demonstrative Pronouns 87% was achieved by the HMM POS tagger. 11 POP Possessive Pronoun For khasi language the author in [13] had 12 CAV Causative Verb 13 TRV Transitive verb introduce tagsets consisting 61 tags. In the paper 14 ITV Intransitive verb [14] the author had also introduce Morphological 15 DTV Ditransitive verb Analyzer for Khasi language. Using morphotactic 16 ADJ Adjective rules the author had used dictionary consisting 17 CMA Comparative Adjective marker 8000 words. The analyzing system use based 18 SPA Superlative Adjective marker word of the word classes with grammatical 19 AD Adverb relationship of subject verb object. For deriving the 20 ADT Adverb of Time morphotactic rules the prefixes, infixes and suffixes 21 ADM Adverb of Manner of the words had been made used. 22 ADP Adverb of Place 23 ADF Adverb of frequency 24 ADD Adverb of degree 3 Methodology for HMM POS tagger 25 IN Preposition 26 1PSG 1st Person singular common gender In this paper, the POS tagger for Khasi language 27 1PPG 1st Person plural common based on Hidden Markov Model (HMM) as gender supervised learning has been employed. In the 28 2PG 2ndPerson singular/plural subsection below, we present the discussion about common gender the methods that have been carried out in this work 29 2PF 2ndPerson singular/plural for building the supervised POS Tagger: Feminine gender 30 2PM 2ndPerson singular/plural Masculine gender 3.1 Tag Sets 31 3PSF 3rd Person singular Feminine gender Tag is the label that is used to describe a 32 3PSM 3rd Person singular Masculine grammatical class of information (these are: gender nouns, verbs, pronouns, prepositions, and so on). 33 3PPG 3rd Person plural common Gender For example: NN could represent the Noun class, 34 3PSG 3rd Person singular common Gender JJ the Adjective class, PRP the Pronoun class, etc. 35 VPT Verb, present tense Each language has a different pattern, frequency, 36 VPP Verb, present progressive participle and speaking style. Thus, the grammatical class 37 VST Verb, past tense is also different for different languages. Therefore, 38 VSP Verb, past perfective participle for this work we have designed tagset consisting 39 VFT Verb, future tense 54 tags for identifying the grammatical class or 40 Mod Modalities Part-of-Speech (POS) of Khasi Language [15] as 41 Neg Negation listed in the Table 1 below. 42 CLF Classifier

Computación y Sistemas, Vol. 23, No. 3, 2019, pp. 795–802 doi: 10.13053/CyS-23-3-3248 ISSN 2007-9737

798 Sunita Warjri, Partha Pakray, Saralin Lyngdoh, Arnab Kumar Maji

No. Tag Description 3.3 POS Identification based on HMM for Khasi 43 COC Coordinating conjunction Language 44 SUC Subordinating conjunction 45 CRC Correlative conjunction This subsection presents the steps for POS tagger 46 CN Cardinal Number based on Hidden Markov Model (HMM) that has 47 ON Ordinal number been carried out in this work. Part-of-speech 48 QNT Quantifiers 49 CO Copula tagging is a sequence classification problem. The 50 InP Infinitive Participle main objective of this supervised machine learning 51 PaV Passive Voice HMM trained model is to give the most i.e. the 52 COM Complementizer maximum probable tag y as outputs for the given 53 FR Foreign words word x, as shown in equation (1): 54 SYM Symbols f(x) = arg max P (y|x). (1) y∈Y

For more details regarding the designed tag-sets The following are the elements consist in the HMM of Khasi language can be found in [15] respectively. POS tagger system and the steps followed for calculating the efficiency of the tagger: 1. A finite set of words, W = {w1, w2, ..., wn}. 2. A finite tags, T = {t1, t2..., tn}. 3.2 Data Sets for Corpus Building 3. n is the number in length.

Therefore, from our objective, to find the optimal In linguistics, the corpus is the large number of data tags sequence tˆn we have equation (2) and (3): or written texts, which is collected for analyzing in computational linguistic. For this work, the corpus tˆn = arg max P (tn|wn), (2) consists of the collected words and each word tn consists with its corresponding tags. It is found that tˆn = arg max P (tn)P (wn|tn). (3) no such Khasi corpus is available till date. tn NOTE: Prior probability - P (tn) and Likelihood Therefore, in this work the Khasi corpus has probability - P (wn|tn). been built. The corpus consists of Khasi language based on context, by collecting the written text 4.Two properties of assumptions are considered in from online Khasi newspaper. These raw texts are HMM based POS taggers, shown in equation (4) then tag appropriately to each word by using the and (5): designed tagset respectively. Tagging to words at this stage has been done manually. Therefore, we were able to create 7,500 words in our data set. i.The probability of bi-gram assumption: n P (t ) = P (t1)P (t2|t1)P (t3|t2)P (t4|t3). . . P (tn|t(n−1)), The corpus has been built painstakingly with n care so that it can efficiently handle the problem Y ≈ P (t |t ). (4) of Ambiguity and also orthography. In this work, i (i−1 i=1 we aim to achieve standard Khasi corpus for POS tagging. Therefore, the dataset have also been ii.The assumption of Likelihood probability: built under the observation and validation of a n n linguistic expert from North Eastern Hill University, P (w |t ) = P (w1|t1)P (w2|t2)P (w3|t3). . . P (wn|tn),

Department of Linguistic, Shillong, Meghalaya, n India. Then, this dataset has been used for training Y ≈ P (wi|ti). (5) and testing the tagger. i=1

Computación y Sistemas, Vol. 23, No. 3, 2019, pp. 795–802 doi: 10.13053/CyS-23-3-3248 ISSN 2007-9737

Identification799 of POS Tag for Khasi Language based on Hidden Markov Model POS Tagger

5.The POS Tagger use equation (6) for estimating Table 2. Manually tagged dataset for trainning the most probable sequence of tag. n n n Tag Khasi Words (t ) = arg maxtn P (t |w ): 3PSF Ka n Y CMN Kynhun ≈ arg max P (ti|t(i − 1))P (wi|ti). (6) tn COM ba i=1 VST la TRV phah 6. Tag Transition probabilities P (t |t ) defining i i−1 EM da the probability of going from tag t to tag t , as i−1 i 3PSM u shown in equation (7). CMN Myntri Count(t , t ) ADJ Rangbah P (t |t ) = i−1 i (7) i i−1 Count(t ) 3PSF ka i−1 PPN Punjab 7. Word Emission probabilities P (wi|ti) defining 3PSM u the probability of emitting word wi in tag ti shown FR Capt. in equation (8): PPN Amanrinder PPN Singh Count(ti, wi) P (wi|ti) = . (8) IN hapoh Count(ti) 3PSF ka ABN jing¨ıalam For more details on supervised learning system POP jong based on HMM POS tagger and its assumption 3PSM u considered can be found in [7] respectively. CMN Myntri 3PSF ka 4 Experimental Results CMN tnad FR Water In the subsection below, we present brief discussion on the experimental work conducted 4.2 Discussion on Some Challenges based on the corpus, with brief discussion on the result achieved and its analysis. During the collection of data and creating the dataset, we encountered some challenges; two 4.1 Corpus main challenges in building the dataset for Khasi discussed in this paper are orthography and As discussed in the methodology, the corpus has ambiguity. In ’Orthography’, the major problem been manually designed and check by the linguistic is the spelling consistency. Spellings in Khasi expert. In this work dataset or corpus of around have not been fully standardized. Different authors of 7,500 words has been used for training and 312 spell differently for the same words and that many words for testing the HMM based POS Tagger. The words that are spelt alike have different meanings corpus of Khasi language for this research has and are pronounced differently in different context. been collected from the online Khasi newspaper Whereas ambiguities are found in categorizing from [6], and the data collected comprise of the words that are spelt and pronounced alike but differ political news and news had been used in in categories when they are used in a sentence. the corpus. Some sample snap of the dataset that Assigning labels or tags to each word of a given are manually tagged using the respective tagset is sentence is a difficult task, because there are shown in Table 2. The right hand side represent words that represent more than one grammatical the words and the left hand side represent its part of speech. The challenge of POS tagging is corresponding tags. the ’Ambiguity’.

Computación y Sistemas, Vol. 23, No. 3, 2019, pp. 795–802 doi: 10.13053/CyS-23-3-3248 ISSN 2007-9737

800 Sunita Warjri, Partha Pakray, Saralin Lyngdoh, Arnab Kumar Maji

Some other structural are also there, but are kept 4.3 Result out of this paper as they are not within the scope of the paper. Using this HMM POS tagger based on the su- Keeping in mind these problems, we tried pervised learning method for the Khasi language, to consider by checking and accounting these with the corpus of 7,500 words the system yield problems wisely and built the dataset accordingly. 76.70% of accuracy as a performance. Due to The assumption that is considered for the word the unavailability of dataset or corpus and we had categories is based on the root category and to make our own dataset, the accuracy of the prefix information. Therefore accordingly, we have tagging can be improved further by creating more designed the data sets to account these problems. data in the corpus. The Table 4 below show the Below are some of the examples cited, based on comparison result with the other language using the challenges which are mentioned above: HMM approach.

For example: Orthography problem is as 4.4 Result Analysis follows. It is found that the orthography is very complex A part from the correct tagged words, in the problem in Khasi language, as in some context experimental result some errors have also been words are different compare with the other; like the analyses. It is found that some words are tagged word ia in some context it is written as ¨ıa. There incorrectly with the tag that does not belong to are many such words that are spell and written the respective word. As we have tag manually the different by different people, some more of those Khasi words with respect to the content of context, words are like: therefore some words are tagged wrongly by the ¨ıadei, ia dei, Khamtam, Kham tam, eiei, ei ei, system due to ambiguity problem. ¨ıatreilang, ia trei lang, watla, wat la, and so on.

For example: Ambiguity problem is as follows. For example, the word: The Table 3 below show some of the ambiguous words of Khasi language: bha is tagged as ADJ or AD. Namar is tagged as COC or SUC. Table 3. Some of the Khasi ambiguous words Hapdeng is tagged as IN or ADP.

Khasi word Meaning POS class In the result it is found that for the words: ia it is Kot book noun tagged 12 times as InP and it is tagged 2 times as Kot reach verb IN, bah is tagged as CMN 3 times and 1 time as Kam work noun ADJ, dang is tagged 1 time as AD and 1 time as Kam pace verb VPP. Due to the present of ambiguous words in the lum hill noun annotated corpus it reduces the result accuracy. lum collect verb Therefore, with more annotated data there is high bah sir noun chance that the system will improve the result. bah enough adjective mar as soon as adverb mar material noun 5 Conclusion and Future Works tam pick verb tam over adjective As very limited works have been done for Khasi kham more adjective language in NLP till date, and tagging from the kham hold verb semantic and technical problems cited above have not been discussed at all, this work culminated as Therefore according to our need of the data we paper that addressed these issues from a larger have been considered and solve the problem. perspective of NLP.The problems and issues being

Computación y Sistemas, Vol. 23, No. 3, 2019, pp. 795–802 doi: 10.13053/CyS-23-3-3248 ISSN 2007-9737

Identification801 of POS Tag for Khasi Language based on Hidden Markov Model POS Tagger

Table 4. Comparison with others HMM POS tagger

Sl. Language Paper Title Approach corpus training Accuracy no. Used HMM based pos Hidden Markov 24 tags 1 Hindi 92%. tagger for Hindi Model 3,58,288 words Unigram 77.38%, Development of Marathi Unigram, Bigram, Bigram 90.30%, Part of Speech Tagger 26 lexical tags 2 Marathi Trigram and Trigram 91.46% Using Statistical 1, 95,647 words HMM Methods and Approach HMM 93.82% Automatic Part-of-Speech Tagging for Bengali: Maximum Entropy (ME) An Approach for 3 Bengali and 45,000 words 76.8% Morphologically Rich Languages HMM Methods in a Poor Resource Scenario Approach Web-based HMM Bengali News Corpus SVM 1.23% 4 Bengali and 0.128 million words for Lexicon Development HMM 85.56% SVM Methods and POS Tagging Part of Speech 5 Tagger for Assamese HMM Method 10,000 words 87% Assamese Text Identification of POS Tag for Khasi Language based 6 on Hidden Markov Model Khasi HMM Method 7,500 words 76.70% POS Tagger (Our proposed method) raised in this paper does not solve all the problems References encounter in the POS tagging of Khasi, therefore, there is a future scope of accounting the other 1. Antony, P., Mohan, S. P., & Soman, K. (2010). problems in future research. SVM based part of speech tagger for Malayalam. Recent Trends in Information, Telecommunication Therefore, we aim to improve the result of the and Computing (ITC), 2010 International Confer- ence on, IEEE, pp. 339–341. system by introducing more data in the corpus, for both training and testing. We also aim to develop 2. Chowdhury, G. G. (2003). Natural language some syntactic rules for Khasi language and to processing. Annual review of information science employ them in POS tagger for good results. This and technology, Vol. 37, No. 1, pp. 51–89. will help us to evaluate good performance of the 3. Dandapat, S., Sarkar, S., & Basu, A. (2007). POS tagger of Khasi language. Automatic part-of-speech tagging for Bengali: An approach for morphologically rich languages in a poor resource scenario. Proceedings of the 45th Annual Meeting of the ACL on Interactive Acknowledgement Poster and Demonstration Sessions, Association for Computational Linguistics, pp. 221–224. 4. Ekbal, A. & Bandyopadhyay, S. (2008). Web- Authors would like to acknowledge the Centre for based Bengali news corpus for lexicon development Natural Language Processing, National Institute and POS tagging. Polibits, Vol. 37, pp. 21–30. of Technology Silchar, 788010, Assam, India for 5. Joshi, N., Darbari, H., & Mathur, I. (2013). HMM this work. based POS tagger for Hindi. Proceeding of 2013

Computación y Sistemas, Vol. 23, No. 3, 2019, pp. 795–802 doi: 10.13053/CyS-23-3-3248 ISSN 2007-9737

802 Sunita Warjri, Partha Pakray, Saralin Lyngdoh, Arnab Kumar Maji

International Conference on Artificial Intelligence, 12. Singh, T. D. & Bandyopadhyay, S. (2008). Mor- Soft Computing (AISC-2013), pp. 341–349. phology driven Manipuri POS tagger. Proceedings 6. Mawphor (2017). Mawphor. [Online; accessed 07- of the IJCNLP-08 Workshop on NLP for less Nov-2017]. privileged languages, pp. 91–97. 7. Pakray, P., Majumder, G., & Pathak, A. (2018). An 13. Tham, M. J. (2012). Design considerations for HMM based pos tagger for POS tagging of code- developing a parts-of-speech tagset for Khasi. mixed Indian social media text. Annual Convention Emerging Trends and Applications in Computer of the Computer Society of India, Springer, pp. 495– Science (NCETACS), 2012 3rd National Conference 504. on, IEEE, pp. 277–280. 8. Patil, H., Patil, A., & Pawar, B. (2014). Part-of- 14. Tham, M. J. (2013). Preliminary investigation of Speech tagger for Marathi language using limited a morphological analyzer and generator for Khasi. training corpora. IJCA Proceedings on National Emerging Trends and Applications in Computer Sci- Conference on Recent Advances in Information ence (ICETACS), 2013 1st International Conference Technology NCRAIT (4), Citeseer, pp. 33–37. on, IEEE, pp. 256–259. 9. Patra, B. G., Debbarma, K., Das, D., & 15. Warjri, S., Pakray, P., Lyngdoh, S., & Kumar Maji, Bandyopadhyay, S. (2012). Part of Speech (POS) A. (2018). Khasi language as dominant Part- tagger for Kokborok. Proceedings of COLING 2012: of-Speech (POS) ascendant in nlp. International Posters, pp. 923–932. Journal of Computational Intelligence & IoT, Vol. 1, No. 1, pp. 109–115. 10. PVS, A. & Karthik, G. (2007). Part-of-speech 16. Wikipedia contributors (2018). Khasi language — tagging and chunking using conditional random Wikipedia, the free encyclopedia. [Online; accessed fields and transformation based learning. Shallow 02-Feb-2018]. Parsing for South Asian Languages, Vol. 21, pp. 21–24. 17. Wilks, Y. & Stevenson, M. (1998). The grammar of sense: Using 6 part-of-speech tags as a first 11. Saharia, N., Das, D., Sharma, U., & Kalita, J. step in semantic disambiguation. Natural Language (2009). Part of speech tagger for Assamese text. Engineering, Vol. 4, No. 2, pp. 135–143. Proceedings of the ACL-IJCNLP 2009 Conference Short Papers, Association for Computational Lin- Article received on 30/01/2019; accepted on 20/03/2019. guistics, pp. 33–36. Corresponding author is Sunita Warjri.

Computación y Sistemas, Vol. 23, No. 3, 2019, pp. 795–802 doi: 10.13053/CyS-23-3-3248