International Journal of Recent Technology and Engineering (IJRTE) ISSN: 2277-3878, Volume-7 Issue-4, November 2018

Rule Based Parts of Speech Tagger for Chhattisgarhi Language

Vikas Pandey, M.V Padmavati, Ramesh Kumar

 Example 1: हमन दनु ो बैऱगाड़ी म रायपुर जाबⴂ Abstract: There is an increasing demand for machine translation systems for various regional languages of . WORDS हमन दनु ो बैऱगाड़ी म रायपुर जाबⴂ Chhattisgarhi being the language of the young state requires automatic languages translating system. Various types of TAGS PRP N N PP N VM natural language processing (NLP) tools are required for Table (a): Chhattisgarhi words and its tags taken from developing Chhattisgarhi to machine translation (MT) Chhattisgarhi tag set. system. In this paper, we are presenting rule based parts of speech tagger for Chhattisgarhi language. Parts of Speech tagging is a There are various approaches for POS tagging: Rule based procedure in which each word of sentence is assigned a tag from approach, Statistical approach and Hybrid approach [2, 3]. tag set. The Parts of Speech tagger is based on rule base which is Accuracy factor is the most important factor in deciding the formed by taken into consideration the grammatical structure of performance of POS tagger [2]. Chhattisgarhi language. The system is constructed over corpus The Rule Based POS tagging approach is based on size of 40,000 words with tag set consists of 30 different parts of speech tags. The corpus is taken from various Chhattisgarhi grammar rules that are framed by observing the grammatical stories. The system achieves an accuracy of 78%. structure of any language. These rules can be written in form Index Terms: Chhattisgarhi, Machine Translation, Natural of production grammar rules. Example: Language Processing, Parts of Speech tagger, Rule Based System. “A proper noun is always followed by a noun” as in the Table (a) हमन (Pronoun) is followed by दनु ो (Noun) I. INTRODUCTION There are some limitations of rule based approach; the Most of the regional languages are low resources language. main limitation is the formation of rule base. In this a rule is Some Indian languages are called low resource language as formulated for each condition [2, 3]. grammatical rules and literary work related to these languages The Statistical Based POS tagging approach is based on is not present in public domain. Pre-processing task like POS two important factors. These are: Frequency and probability tagging is a challenging task for these languages. In POS of occurrence of any word .In this approach most frequently tagging process a specific grammar class which is called as used tag for a specific word in the annotated training dataset is tag is assigned to a word in the sentence from tag set. Tag set used to tag that word in the un annotated dataset .The is a collection of grammar class which consist of English limitation of this system is that some sequences of tags can abbreviations like N(Noun),VM(Verb), PP(Preposition) come up for sentences that are not correct according to the etc.[1]. Parts of Speech (POS) tagging is a process of grammar rules of a certain language [3]. identifying the suitable class of tag for a word from a given tag In Hybrid Approach the probability theory of statistical set. It is very important task of pre-processing activity in method is used to train the corpus and then the set of machine translation. Machine translation systems take a production rules are applied on the testing corpus for tagging source language and convert it into target language. Various of testing corpus [2, 3]. POS tagging process is broadly tools are required in machine translation systems like classified into two models: Supervised Model and tokenize, POS tagger, morphological analyzer and parser. Unsupervised Model [4]. Classification of POS Tagging is POS tagger comes under pre-processing phase of machine shown in Figure 1. translation system. Most of the regional languages are low resources language. Some Indian languages are called low resource language as grammatical rules and literary work related to these languages is not present in public domain. Pre-processing task like POS tagging is a challenging task for these languages. In POS tagging process a specific grammar class which is called as tag to a word in the sentence from tag set. Tag set is a collection of grammar class which consist of English abbreviations like NN (Noun), VM (Verb), PP (Preposition) etc.[2].

Revised Version Manuscript Received on 30 November, 2018. Vikas Pandey, Dept. of Information Technology, Institute of Figure 1: Classification of POS Tagging Technology, , India Dr. M.V Padmavati, Dept. of Computer Science and Engg., Bhilai Institute of Technology, Durg, India Dr. Ramesh Kumar, Dept. of Computer Science and Engg., Bhilai Institute of Technology, Durg, India

Published By: Blue Eyes Intelligence Engineering 192 & Sciences Publication

Rule Based Parts of Speech Tagger for Chhattisgarhi Language

II. LITERATURE SURVEY Rule 2: If word is relative pronoun then there is high probability that next word will be noun. Research has already been done in morphologically rich For Example:- Indian languages like Hindi, Bengali, Telugu, Marathi, Tamil, , Gujarati, , , Odia and Punjabi. इही घर हरे जे ऱा तोखन बनाये हे । There are some low resource languages in India like Awadhi, In above example and is relative pronoun and Magahi, Nimadi, Bhojpuri, and Chhattisgarhi for which इही जेऱा घर machine translation tools have not been developed yet. and तोखन is noun. A POS tagger was developed using conditional random Rule 3: If word is reflexive pronoun then there is high field for In this system contextual probability that next word will be noun. information of the words has been used to search different For Example:- POS tags for various tokenized words. The system was evaluated over a corpus of 72,341 words with 26 different ओ अऩन धनी सन चऱ ददस। POS tags and system achieved the accuracy of 90.3% [5]. In above example अऩन is reflexive pronoun and धनी is A POS tagger was developed using Hidden Markov Model noun. for Hindi, which uses a Naïve stemmer as a pre-processor Rule 4: If word is personal pronoun then there is high based on longest suffix matching algorithm to achieve probability that next word will be noun. accuracy of 93.12% [6]. For Example:- A POS tagger was developed using Hidden Markov Model for Assamese. Unknown words were tagged using simple ऐ मोर ऩुतक हे। morphological analysis .The system was evaluated over a In above example मोर is personal pronoun and ऩुतक is corpus of 10,000 words with 172 different POS tags and noun system achieved the accuracy of 87% [7]. Rule 5: If current word is post position then there is high A POS tagger was developed using Hidden Markov Model probability that previous word will be noun. for Hindi .They uses Indian language POS tag set to develop For Example:- this tagger and achieved the accuracy of 92% [8]. धनीराम रायऩुर म रथे। III. METHODOLOGY In above example रायऩुरis noun and मᴂ is post position. This system is developed using rule based approach and Rule 6: If current word is verb then there is probability that tagging will be done by the help of POS tag made in previous word will be noun. consultation with Chhattisgarhi linguistics expert. The system For Example:- mainly works in two steps-firstly the sentences are spitted into words by the help of line splitter program and input words are तोरण नहाये बर गे हे । found in the database; if it is present then rules are applied to In above example तोरण is noun and नहाय े is verb. tag to assign if a proper tag and if it is not found then UNK tag Rule 7: If word is noun then there is probability that next or will be assigned to it. previous word will be noun. 3.1 Algorithm For Example:- The algorithm used for rule based part of speech tagger for सुररवात बबऱासऩुर म ऩढ़ते । Chhattisgarhi language is as follows: In above example सररवात and बबऱासऩर both are noun. 1. Chhattisgarhi sentence is tokenized by the help of line ु ु

splitter program. 3.2.2 Demonstrative Identification Rules 2. The Tokenized Chhattisgarhi words are normalized. Rule 1: If current word is pronoun in database and next 3. Search the numbers and tag them by using regular word is also pronoun, then first word will be demonstrative. expression. For Example:- 4. Abbreviations are searched using regular expression.

5. In the database all the input words are searched and ते कोन हरस । tag the word according to appropriate tag. In above example current word is कोन and next word is 6. UNK tag will be given to the unknown words. हरस and both are pronoun so त े is demonstrative. 7. The tagged words are then displayed. Rule 2: If current word is noun in database and next word 3.2 Rules that are Applied to Identify Different Tags is verb, then previous word will be demonstrative. 3.2.1 Noun Identification Rules For Example: - Rule 1: If word is adjective then there is high probability ओ घर चऱ ददस । that next word will be noun. For Example:- असऱी बात ये हरे। In above example असऱी is adjective बात is noun.

Published By: Blue Eyes Intelligence Engineering 193 & Sciences Publication International Journal of Recent Technology and Engineering (IJRTE) ISSN: 2277-3878, Volume-7 Issue-4, November 2018

In above example current word is घर which is noun and next word is चऱ which is a verb, so ओ is demonstrative. 3.2.3 Proper Noun Identification Rules Rule 1: If current word is not tagged and next word is tagged as proper noun, then there is high probability that current word will be proper noun. For Example: - तोरण, बऱदाऊ In above example तोरण and बऱदाऊ are proper noun. Rule 2: If current word is name and next word is surname Figure 2: POS Tagging of Chhattisgarhi sentence then we tag current and second word as single proper name. For Example: - V. CONCLUSION AND FUTURE WORK

सुररवात राम साहू will be tagged as सुररवातराम साहू In this paper Part of Speech tagger for Chhattisgarhi language by the help of using rule based technique has been where „ ‟ is proper noun. सुररवातराम discussed. The sentence is broken in to tokens by line splitter 3.2.4 Adjective Identification Rules program The tokenized words are search in the database and Rule 1: There is more chance that a word before a verb is then appropriate rules are applied to tag them. The system is adjective. constructed over corpus size of 40,000 words with tag set सचचन तᴂदऱु कर बदढ़या खेऱथे। consists of 30 different parts of speech tags. The test data is taken from various Chhattisgarhi stories and system achieves In above example is a adjective and is a verb बदढ़या खेऱथे an accuracy of 78%. 3.2.5 Verb Identification Rules The main limitation of rule based part of speech tagger is Rule 1: If current word is not tagged and next word tagged that it completely depends on rule. If rule is not present for a as an auxiliary verb, then there is high probability that current word then it will not be tagger by the system due to this word will be main verb. accuracy of system will decrease. For Example:- In future there is need to shift towards neural network based ओ हा खाना खात ररदहस । system so that test data can be automatically gets trained and by increasing the size of corpus the accuracy of system will In above example खात is main verb andररदहसis auxiliary also increase. verb.

IV. RESULTS AND DISCUSSION REFERENCES 1. Agrawal, R., Ambati, B., & Singh, A.Singh.(2012). A GUI to Detect In order to test the system, few lines of a Chhattisgarhi story and Correct Errors in Hindi Dependency Treebank. In Proc.of Eighth titled „ ‟ is taken a input sentence : International Conference on Language Resources and Evaluation, मⴂगरा Istanbul, Turkey, 1907-1911. Input Chhattisgarhi Sentence 2. Hasan, M., F.,Uzzaman, N., & Khan, M.(2006 ). Comparison of Different POS Tagging Techniques (n- grams, HMM and Brill‟s आज सु셁配ती हे न तउन े ऩाके ददयना के अजं ोर ह रइऩुर भर मं बगरे Tagger) for Bangla. International Conference on Systems, Computing Scienc and Software Engineering (SCS2 06) of International Joint हवय। Conferences on Computer, Information, and Systems Sciences, and Output Chhattisgarhi Sentence Engineering. 3. Kumawat, D., & Jain,V. (2015). POS Tagging Approaches: A Chhattisgarhi Words Tagging Comparison. International Journal of Computer Applications ,vol 118,32-38 आज सु셁配ती 4. Antony, P.,J.(2011). Parts Of Speech Tagging for Indian Languages: A Literature Survey. International Journal of Computer Applications, Vol हे न 34, 22-29. 5. Ekbal, A., Haque, R, & Bandhopadha, S. ( 2007). Bengali part of तउने speech tagging using Conditional Random Field. In Proc. of SPSAL2007. 131-136. आज सु셁配ती हे न पाके 6. Shrivastav, M., & Bhattacharyya, P. ( 2008). Hindi POS Tagger Using Naive Stemming: Harnessing Morphological Information without तउने पाके ददयना के ददयना के अं Extensive Linguistic Knowledge. In Proc.of ICON, 1-6 7. Sharia, N., Das, D., Sharma U. , Kalita, J. (2009). Part of Speech जोर अंजोर ह रइपुर भर मं Tagger for Assamese Text . In Proc. of the ACL-IJCNLP, 33-36. 8. Joshi, N., Darbari, H., & Mathur, I. (2013). HMM Based POS Tagger रइपुर बगरे हवय। For Hindi. In Proc.of CCSIT, SIPP, AISC, PDCTA, 341-349 भरमं

X>बगरेहवय

_VM>। Table1: Chhattisgarhi words and corresponding tags

Published By: Blue Eyes Intelligence Engineering 194 & Sciences Publication