Identification of POS Tag for Khasi

ISSN 2007-9737

Identiﬁcation of POS Tag for Khasi Language based on Hidden Markov Model POS Tagger

Sunita Warjri1, Partha Pakray2, Saralin Lyngdoh1, Arnab Kumar Maji1

1 North-Eastern Hill University, Shillong, Meghalaya, India

2 National Institute of Technology, Silchar, Assam, India

{sunitawarjri, parthapakray, saralyngdoh, arnab.maji}@gmail.com

Abstract. Computational Linguistic (CL) becomes combining the technology of artificial intelligent and an essential and important amenity in the present computer science. The most important part of any scenarios, as many different technologies are involved NLP task is the issue of understanding the natural in making machines to understand human languages. language. Application of NLP helps machines to Khasi is the language which is spoken in Meghalaya, learn, read and understand the human language, India. Many Indian languages have been researched in by simulating the human ability of understanding different fields of Natural Language Processing (NLP), the language and by combining the technology of whereas Khasi lacks substantial research from the NLP perspectives. Therefore, in this paper, taking POS computational linguistics, computer science and tagging as one of the key aspects of NLP, we present artificial intelligence. POS tagger based on Hidden Markov Model (HMM) for Khasi language. In this present preliminary stage The most basic and important starting level of of building NLP system for Khasi, with the analyses of NLP for any language is POS tagging. POS the categories and structures of the words is started. in language processing is the aspect that deals Therefore, we have designed specific POS tagsets to with the identification of grammatical class of each categories Khasi words and vocabularies. Then, the word in a given sentence. POS is used in many POS system based on HMM is trained by using Khasi fields of NLP such as Semantic Disambiguation, words which have been tagged manually using the Phrase identification (chunking), Named Entity designed tagsets. As ambiguity is one of the main Recognition, Information Extraction, Parsing, etc. challenges in POS tagging in Khasi, we anticipated difficulties in tagging. However, by running with the first [2, 17]. The task of creating POS Tagger involves few sets of data in the experimental data by using the many stages such as: building tagsets, creating HMM tagger we found out that the result yielded by this dictionary, considering the rules of the context and model is 76.70% of accurate. also checking the inflexions, dependent anomalies of the particular language. Keywords. Natural language processing (NLP), computational linguistic, part of speech (POS), POS Though, substantial work has already been tagger, hidden Markov model (HMM). carried out in different fields of NLP for Indian Language, Khasi lacks such study from the NLP 1 Introduction perspective. Other Indian languages like Hindi, Bengali, Assamese, Manipuri, Marathi, Tamil, etc. Natural Language Processing (NLP) deals with have already been employed in Computational the inter-relation and inter-communication between linguistic. In this paper we present POS tagger the computer and natural human language by for Khasi language. Khasi Language is an official