A Rule Based Approach for Lemmatisation of Punjabi Text Documents

216 A Rule based approach for lemmatisation of Punjabi text Documents Dr. Rajeev Puri Assistant Professor, Dept of Comp. Sc, DAV College, Jalandhar. [email protected] Keywords: Punjabi lemmatiser, stemming, information retrieval ABSTRACT related meaning for the given word. The Information retrieval is a trivial task lemmatisation provides accurate results specially when the information is then a stemmer. However Punjabi being available in highly inflectional languages an inflectional language makes the task such as Hindi and Punjabi. The huge of lemmatisation more complex. Behind corpus needs to be reduced significantly each Punjabi world there are many so as to speed up the information inflectional forms as well as synonyms retrieval process as well as for generating which need to be dealt with. The Punjabi the accurate results. The pre-processing lemmatiser presented in this text is an phase of the information retrieval process attempt to obtain the lemma of the word deals with reducing the size of corpus. by using the rule-based approach. Many pre-processing steps such as stop 1. Introduction word removal, stemming and Lemmatization is process of grouping lemmatisation need to be performed for together the different inflected forms of a improving the accuracy of information word so that they can be analyzed as a retrieval task. Where the stemming single item. It is used for finding out the process converts the inflectional forms of normalized form of the word or the word by stripping the prefix or postfix, on dictionary form of the word which is called the other hand the lemmatisation is the lemma. Lexeme is a group of all the process of generating normalised forms (called headwords of a dictionary) dictionary word that has the same or that have the same meaning, and lemma Research Cell : An International Journal of Engineering Sciences, Special Issue March 2018, Vol. 27 UGC Approved Journal (S.No.63019) ISSN: 2229-6913 (Print), ISSN: 2320-0332 (Online) -, Web Presence: http://www.ijoes.vidyapublications.com © 2018 Vidya Publications. Authors are responsible for any plagiarism issues. 217 is a word that is chosen as a base form ਿਠਹਰ (Stop) ਠਿਹਰਾ, ਠਿਹਰ&, ਠਿਹਿਰਆ, that represents lexeme. Because of the ਠਿਹਰਾਅ inflectional nature of Punjabi, it makes the task of Lemmatization more complex than ਰ#5 (Color) ਰ#5ਣਾ, ਰਗਣ, ਰਗੀਨ, ਰ#5ੀਨੀ, Stemming. Both Stemming and ਰ#5ਾਉਣਾ, ਰ#5ਵਾਉਣਾ, ਰ#5ਾਈ, Lemmatization are the important ਰ#5ਵਾਈ, ਰ#5ਲਾ, ਰ#5ੀਲਾ preprocessing steps in Information Retrieval task. It is also used in many 2. Literature Survey applications of text mining. It plays an Taghva et al. [6] have developed important role in Natural Language Stemmer for Farsi language. Their Processing and many other linguistics system uses Deterministic Finite fields. It helps the search engines in automata for implementation of stemmer. finding out the keywords and helps in The minimum length of the stem is limited reducing the size of index files. For to three words. Sarkar et al. [10] building a Lemmatizer one requires proposed a Rule Based Stemmer for proper understanding of the language Bengali language which is also a and dictionary of that particular language. resource poor language. The POS Indian languages are highly inflectional in specific rules are applied to find out the nature. Punjabi is also highly inflectional stem of the word. Kraai et al. [4] have language. Various inflectional forms of developed Suffix Stripper for Dutch by Punjabi words are shown in table 1 using Porter’s Algorithm. They extend the below. original Porter’s Algorithm to handle Table 1: Sample Inflectional forms of prefixes and infixes. Lovins. [1] proposed Punjabi Text a Stemming Algorithm for English. The Punjabi Word Inflectional forms Algorithm has been developed to be used ਮuਡਾ (Boy) ਮuਡ&, ਮuਿਡਆ, ਮuਿਡਆਂ , ਮuਿਡਓ in Library Information Transfer Systems. ਕਰ (Do) ਕਰਿਦਆਂ, ਕਰਦੀਆਂ, ਕਰ/, ਕਰ0, Bijal et al. [18] discussed different stemming Algorithms for non-Indian and ਕਰਦਾ, ਕਰ1, ਕਰਦ0 Indian Languages. They presented a Research Cell : An International Journal of Engineering Sciences, Special Issue March 2018, Vol. 27 UGC Approved Journal (S.No.63019) ISSN: 2229-6913 (Print), ISSN: 2320-0332 (Online) -, Web Presence: http://www.ijoes.vidyapublications.com © 2018 Vidya Publications. Authors are responsible for any plagiarism issues. 218 comparative study of various Stemming on non-statistical approach. Mishra et al. Algorithms. Ganguly et al. [16] have [14] have developed MAULIK: an developed two rule based stemmers for effective Stemmer for Hindi based on Bengali and Hindi. Gupta et al. [11] have Hybrid Approach. The Hybrid Approach is developed Punjabi Language stemmer for combination of Brute Force and Suffix Nouns and Proper Names. First they Removal Approach. This Stemmer is generate the stem of the Punjabi word based on Devanagari Script. and after that stem is checked in Punjabi 3. Research Gap Noun and Proper Name Dictionary. Hafer Punjabi is an indo-aryan Language. It is et al. [2] have developed word spoken by more than 100 million native segmentation by letter successor people worldwide. It is the 10th most varieties method. They used Statistical widely spoken Language in the world. properties of the corpus to decide The popularity and online presence of whether word should be divided or not. Punjabi language is increasing day by The approach is based upon Z.S Harris day as more and more Punjabi corpus Process. Majumder et al. [9] have are generated and published day by day developed a Stemmer YASS (yet another over internet. So a system is required for suffix stripper) which is based on efficient searching of Punjabi documents clustering based approach. Cluster is a over the web. The searching should be in class of root word and their related word native’s own language. For faster and forms. Puri et al. [19] have developed a efficient information retrieval system in Punjabi Stemmer using Punjabi WordNet Punjabi, some pre-processing is required Database. In this paper Suffix Removal before indexing the documents. This pre- Approach is used with suffix stripping processing includes stop word removal, rules for finding out the stem for Punjabi stemming and a sophisticated Language. EI-Shishtawy et al. [13] have Lemmatizer for reducing the words to developed an accurate Arabic Root- their base forms called headwords. Based Lemmatizer for Information Various Lemmatization and Stemming Retrieval Purpose. This is the first systems are available for different Lemmatizer for Arabic Language based languages, but very less work has been Research Cell : An International Journal of Engineering Sciences, Special Issue March 2018, Vol. 27 UGC Approved Journal (S.No.63019) ISSN: 2229-6913 (Print), ISSN: 2320-0332 (Online) -, Web Presence: http://www.ijoes.vidyapublications.com © 2018 Vidya Publications. Authors are responsible for any plagiarism issues. 219 done for Punjabi because of its will be applied on the word ਰ05ੀ (12) and it inflectional nature. For every word in becomes ਰ05 (9), and that is the Punjabi, there are number of inflectional normalized word, also a valid word forms. So, attempt here has been made available in the dictionary and displayed to develop a Lemmatizer for Punjabi as an output on the output screen. The which will generate accurate lemma for proposed system uses a number of most of the Punjabi words. databases and lists for accomplishing the 4. Proposed System task. These databases and lists are A proposed system uses rule-based mentioned in the coming sections. approach to find out the normalized form 5. Synonyms Database of the word that helps in improving the Synonyms are the words that have the results of Information Retrieval task. The same meaning. A synonym database is system works with the Synonym prepared with augmented field containing replacement module that uses pre- the length of the word for quick defined rules to get the lemma of the comparison. The database is indexed on word. the word and sorted according to the For e.g:- Word:- ਿਬਮਾਰ length of each word in ascending order Synonym Words:- , ਰ05ੀ, ਮਰੀਜ, ਮਰੀਜ਼, for quick retreival. For example, the word ਰ05ਗ?@ਤ “ਮੂਰਖ” (Fool) is having 26 distinct The word ਿਬਮਾਰ has length 15, is the input synonyms as "ਜਾਹਲ", "ਉਜੱਡ", "ਜੜ-", "ਗਵਾਰ", word for the system. The list of synonym is collected from the database and then "ਭੁਚੱ ", "ਘੋਗਾ", "ਬੇਸ਼ਮਝ", "ਮੂੜ-", "ਅਬੁੱਧ", "ਨਾਸਮਝ", system picks the shortest length word "ਘੁੱਗੂ", "ਘੁੱਘੂ", "ਝੁਡੱ ੂ", "ਬੇਵਕੂਫ", "ਅਿਗਆਨੀ", from the list that is ਰ05ੀ (12) with length 12. After that the rules from the rule list "ਮੱਤਹੀਣ", "ਲੱਲ-ਾ", "ਅੰਞਾਣਾ", "ਜੜ_ਮੱਤ", "ਮੂੜਮ- ੱਤ", are applied on the synonym word to get "ਬੁੱਧੀਹੀਣ", "ਿਨਰਬੁੱਧੀ", "ਲਘੂ_ਬੁੱਧੀ", "ਅਲਪ_ਬੁੱਧੀ", the accurate lemma word as an output. "ਘੱਟ_ਬੁੱਧੀ". Here, the ‘B’ => “” rule of suffix stripping Research Cell : An International Journal of Engineering Sciences, Special Issue March 2018, Vol. 27 UGC Approved Journal (S.No.63019) ISSN: 2229-6913 (Print), ISSN: 2320-0332 (Online) -, Web Presence: http://www.ijoes.vidyapublications.com © 2018 Vidya Publications. Authors are responsible for any plagiarism issues. 220 Out of all these distinct synonyms, the ਕਾਰੀ ਕਾਰਜਕਾਰੀ , ਿਹਤਕਾਰੀ , ਲਾਭਕਾਰੀ synonym “ਜਾਹਲ” has the minimum length ਣਗ& ਚਲਣਗ& , ਰuਕਣਗ& , 1ਖਣਗ& of 12 characters and hence is chosen as L(ਂ ਹ#=/(ਂ , ਆਗ/(ਂ a replacement of all other peer words. The synonym replacement module has ਗਾ ਆਵ&5ਾ , ਜਾਵ&5ਾ, ਕਰ&5ਾ , ਭਰ&ਗਾ some inherent problems of incorrectly replacing a named entity with it’s The list of words having the common synonym. For example, the name “C+ਾਸ਼ suffixes is prepared and frequency of valid suffixes is calculated.

A Rule Based Approach for Lemmatisation of Punjabi Text Documents

Enhanced Thesaurus Terms Extraction for Document Indexing

An Evaluation of Machine Learning Approaches to Natural Language Processing for Legal Text Classification

Using Lexico-Syntactic Ontology Design Patterns for Ontology Creation and Population

LASLA and Collatinus

Experiments in Clustering Urban-Legend Texts

Removing Boilerplate and Duplicate Content from Web Corpora

The State of (Full) Text Search in Postgresql 12

A Diachronic Treebank of Russian Spanning More Than a Thousand Years

Download in the Conll-Format and Comprise Over ±175,000 Tokens

Lemmagen: Multilingual Lemmatisation with Induced Ripple-Down Rules

Language Technology Meets Documentary Linguistics: What We Have to Tell Each Other

A 500 Million Word POS-Tagged Icelandic Corpus