Rule Based Parts of Speech Tagger for Chhattisgarhi Language

International Journal of Recent Technology and Engineering (IJRTE) ISSN: 2277-3878, Volume-7 Issue-4, November 2018 Rule Based Parts of Speech Tagger for Chhattisgarhi Language Vikas Pandey, M.V Padmavati, Ramesh Kumar Example 1: हमन दनु ो बैऱगाड़ी म रायपुर जाबⴂ Abstract: There is an increasing demand for machine translation systems for various regional languages of India. WORDS हमन दनु ो बैऱगाड़ी म रायपुर जाबⴂ Chhattisgarhi being the language of the young Chhattisgarh state requires automatic languages translating system. Various types of TAGS PRP N N PP N VM natural language processing (NLP) tools are required for Table (a): Chhattisgarhi words and its tags taken from developing Chhattisgarhi to Hindi machine translation (MT) Chhattisgarhi tag set. system. In this paper, we are presenting rule based parts of speech tagger for Chhattisgarhi language. Parts of Speech tagging is a There are various approaches for POS tagging: Rule based procedure in which each word of sentence is assigned a tag from approach, Statistical approach and Hybrid approach [2, 3]. tag set. The Parts of Speech tagger is based on rule base which is Accuracy factor is the most important factor in deciding the formed by taken into consideration the grammatical structure of performance of POS tagger [2]. Chhattisgarhi language. The system is constructed over corpus The Rule Based POS tagging approach is based on size of 40,000 words with tag set consists of 30 different parts of speech tags. The corpus is taken from various Chhattisgarhi grammar rules that are framed by observing the grammatical stories. The system achieves an accuracy of 78%. structure of any language. These rules can be written in form Index Terms: Chhattisgarhi, Machine Translation, Natural of production grammar rules. Example: Language Processing, Parts of Speech tagger, Rule Based System. “A proper noun is always followed by a noun” as in the Table (a) हमन (Pronoun) is followed by दनु ो (Noun) I. INTRODUCTION There are some limitations of rule based approach; the Most of the regional languages are low resources language. main limitation is the formation of rule base. In this a rule is Some Indian languages are called low resource language as formulated for each condition [2, 3]. grammatical rules and literary work related to these languages The Statistical Based POS tagging approach is based on is not present in public domain. Pre-processing task like POS two important factors. These are: Frequency and probability tagging is a challenging task for these languages. In POS of occurrence of any word .In this approach most frequently tagging process a specific grammar class which is called as used tag for a specific word in the annotated training dataset is tag is assigned to a word in the sentence from tag set. Tag set used to tag that word in the un annotated dataset .The is a collection of grammar class which consist of English limitation of this system is that some sequences of tags can abbreviations like N(Noun),VM(Verb), PP(Preposition) come up for sentences that are not correct according to the etc.[1]. Parts of Speech (POS) tagging is a process of grammar rules of a certain language [3]. identifying the suitable class of tag for a word from a given tag In Hybrid Approach the probability theory of statistical set. It is very important task of pre-processing activity in method is used to train the corpus and then the set of machine translation. Machine translation systems take a production rules are applied on the testing corpus for tagging source language and convert it into target language. Various of testing corpus [2, 3]. POS tagging process is broadly tools are required in machine translation systems like classified into two models: Supervised Model and tokenize, POS tagger, morphological analyzer and parser. Unsupervised Model [4]. Classification of POS Tagging is POS tagger comes under pre-processing phase of machine shown in Figure 1. translation system. Most of the regional languages are low resources language. Some Indian languages are called low resource language as grammatical rules and literary work related to these languages is not present in public domain. Pre-processing task like POS tagging is a challenging task for these languages. In POS tagging process a specific grammar class which is called as tag to a word in the sentence from tag set. Tag set is a collection of grammar class which consist of English abbreviations like NN (Noun), VM (Verb), PP (Preposition) etc.[2]. Revised Version Manuscript Received on 30 November, 2018. Vikas Pandey, Dept. of Information Technology, Bhilai Institute of Figure 1: Classification of POS Tagging Technology, Durg, India Dr. M.V Padmavati, Dept. of Computer Science and Engg., Bhilai Institute of Technology, Durg, India Dr. Ramesh Kumar, Dept. of Computer Science and Engg., Bhilai Institute of Technology, Durg, India Published By: Blue Eyes Intelligence Engineering 192 & Sciences Publication Rule Based Parts of Speech Tagger for Chhattisgarhi Language II. LITERATURE SURVEY Rule 2: If word is relative pronoun then there is high probability that next word will be noun. Research has already been done in morphologically rich For Example:- Indian languages like Hindi, Bengali, Telugu, Marathi, Tamil, Urdu, Gujarati, Kannada, Malayalam, Odia and Punjabi. इही घर हरे जे ऱा तोखन बनाये हे । There are some low resource languages in India like Awadhi, In above example and is relative pronoun and Magahi, Nimadi, Bhojpuri, and Chhattisgarhi for which इही जेऱा घर machine translation tools have not been developed yet. and तोखन is noun. A POS tagger was developed using conditional random Rule 3: If word is reflexive pronoun then there is high field for Bengali language In this system contextual probability that next word will be noun. information of the words has been used to search different For Example:- POS tags for various tokenized words. The system was evaluated over a corpus of 72,341 words with 26 different ओ अऩन धनी सन चऱ ददस। POS tags and system achieved the accuracy of 90.3% [5]. In above example अऩन is reflexive pronoun and धनी is A POS tagger was developed using Hidden Markov Model noun. for Hindi, which uses a Naïve stemmer as a pre-processor Rule 4: If word is personal pronoun then there is high based on longest suffix matching algorithm to achieve probability that next word will be noun. accuracy of 93.12% [6]. For Example:- A POS tagger was developed using Hidden Markov Model for Assamese. Unknown words were tagged using simple ऐ मोर ऩुतक हे। morphological analysis .The system was evaluated over a In above example मोर is personal pronoun and ऩुतक is corpus of 10,000 words with 172 different POS tags and noun system achieved the accuracy of 87% [7]. Rule 5: If current word is post position then there is high A POS tagger was developed using Hidden Markov Model probability that previous word will be noun. for Hindi .They uses Indian language POS tag set to develop For Example:- this tagger and achieved the accuracy of 92% [8]. धनीराम रायऩुर म रथे। III. METHODOLOGY In above example रायऩुरis noun and मᴂ is post position. This system is developed using rule based approach and Rule 6: If current word is verb then there is probability that tagging will be done by the help of POS tag made in previous word will be noun. consultation with Chhattisgarhi linguistics expert. The system For Example:- mainly works in two steps-firstly the sentences are spitted into words by the help of line splitter program and input words are तोरण नहाये बर गे हे । found in the database; if it is present then rules are applied to In above example तोरण is noun and नहाय े is verb. tag to assign if a proper tag and if it is not found then UNK tag Rule 7: If word is noun then there is probability that next or will be assigned to it. previous word will be noun. 3.1 Algorithm For Example:- The algorithm used for rule based part of speech tagger for सुररवात बबऱासऩुर म ऩढ़ते । Chhattisgarhi language is as follows: In above example सररवात and बबऱासऩर both are noun. 1. Chhattisgarhi sentence is tokenized by the help of line ु ु splitter program. 3.2.2 Demonstrative Identification Rules 2. The Tokenized Chhattisgarhi words are normalized. Rule 1: If current word is pronoun in database and next 3. Search the numbers and tag them by using regular word is also pronoun, then first word will be demonstrative. expression. For Example:- 4. Abbreviations are searched using regular expression. 5. In the database all the input words are searched and ते कोन हरस । tag the word according to appropriate tag. In above example current word is कोन and next word is 6. UNK tag will be given to the unknown words. हरस and both are pronoun so त े is demonstrative. 7. The tagged words are then displayed. Rule 2: If current word is noun in database and next word 3.2 Rules that are Applied to Identify Different Tags is verb, then previous word will be demonstrative. 3.2.1 Noun Identification Rules For Example: - Rule 1: If word is adjective then there is high probability ओ घर चऱ ददस । that next word will be noun. For Example:- असऱी बात ये हरे। In above example असऱी is adjective बात is noun. Published By: Blue Eyes Intelligence Engineering 193 & Sciences Publication International Journal of Recent Technology and Engineering (IJRTE) ISSN: 2277-3878, Volume-7 Issue-4, November 2018 In above example current word is घर which is noun and next word is चऱ which is a verb, so ओ is demonstrative.

Load more