<<

216

A Rule based approach for lemmatisation of Punjabi text Documents

Dr. Rajeev Puri

Assistant Professor, Dept of Comp. Sc, DAV College, Jalandhar. [email protected]

Keywords: Punjabi lemmatiser, , information retrieval

ABSTRACT related meaning for the given . The Information retrieval is a trivial task lemmatisation provides accurate results specially when the information is then a stemmer. However Punjabi being available in highly inflectional languages an inflectional language makes the task such as Hindi and Punjabi. The huge of lemmatisation more complex. Behind corpus needs to be reduced significantly each Punjabi world there are many so as to speed up the information inflectional forms as well as synonyms retrieval process as well as for generating which need to be dealt with. The Punjabi the accurate results. The pre-processing lemmatiser presented in this text is an phase of the information retrieval process attempt to obtain the lemma of the word deals with reducing the size of corpus. by using the rule-based approach. Many pre-processing steps such as stop 1. Introduction word removal, stemming and Lemmatization is process of grouping lemmatisation need to be performed for together the different inflected forms of a improving the accuracy of information word so that they can be analyzed as a retrieval task. Where the stemming single item. It is used for finding out the process converts the inflectional forms of normalized form of the word or the word by stripping the prefix or postfix, on dictionary form of the word which is called the other hand the lemmatisation is the lemma. is a group of all the process of generating normalised forms (called headwords of a dictionary) dictionary word that has the same or that have the same meaning, and lemma

Research Cell : An International Journal of Engineering Sciences, Special Issue March 2018, Vol. 27 UGC Approved Journal (S.No.63019) ISSN: 2229-6913 (Print), ISSN: 2320-0332 (Online) -, Web Presence: http://www.ijoes.vidyapublications.com © 2018 Vidya Publications. Authors are responsible for any plagiarism issues.

217 is a word that is chosen as a base form ਿਠਹਰ (Stop) ਠਿਹਰਾ, ਠਿਹਰ, ਠਿਹਿਰਆ, that represents lexeme. Because of the ਠਿਹਰਾਅ inflectional nature of Punjabi, it makes the task of Lemmatization more complex than ਰਗ (Color) ਰਗਣਾ, ਰਗਣ, ਰਗੀਨ, ਰਗੀਨੀ, Stemming. Both Stemming and ਰਗਾਉਣਾ, ਰਗਵਾਉਣਾ, ਰਗਾਈ, Lemmatization are the important ਰਗਵਾਈ, ਰਗਲਾ, ਰਗੀਲਾ preprocessing steps in Information

Retrieval task. It is also used in many 2. Literature Survey applications of . It plays an Taghva et al. [6] have developed important role in Natural Language Stemmer for Farsi language. Their Processing and many other system uses Deterministic Finite fields. It helps the search engines in automata for implementation of stemmer. finding out the keywords and helps in The minimum length of the stem is limited reducing the size of index files. For to three . Sarkar et al. [10] building a Lemmatizer one requires proposed a Rule Based Stemmer for proper understanding of the language Bengali language which is also a and dictionary of that particular language. resource poor language. The POS Indian languages are highly inflectional in specific rules are applied to find out the nature. Punjabi is also highly inflectional stem of the word. Kraai et al. [4] have language. Various inflectional forms of developed Suffix Stripper for Dutch by Punjabi words are shown in table 1 using Porter’s Algorithm. They extend the below. original Porter’s Algorithm to handle Table 1: Sample Inflectional forms of prefixes and infixes. Lovins. [1] proposed Punjabi Text a Stemming Algorithm for English. The Punjabi Word Inflectional forms Algorithm has been developed to be used ਮuਡਾ (Boy) ਮuਡ, ਮuਿਡਆ, ਮuਿਡਆਂ , ਮuਿਡਓ in Library Information Transfer Systems. ਕਰ (Do) ਕਰਿਦਆਂ, ਕਰਦੀਆਂ, ਕਰ, ਕਰ, Bijal et al. [18] discussed different stemming Algorithms for non-Indian and ਕਰਦਾ, ਕਰ, ਕਰਦ Indian Languages. They presented a

Research Cell : An International Journal of Engineering Sciences, Special Issue March 2018, Vol. 27 UGC Approved Journal (S.No.63019) ISSN: 2229-6913 (Print), ISSN: 2320-0332 (Online) -, Web Presence: http://www.ijoes.vidyapublications.com © 2018 Vidya Publications. Authors are responsible for any plagiarism issues.

218 comparative study of various Stemming on non-statistical approach. Mishra et al. Algorithms. Ganguly et al. [16] have [14] have developed MAULIK: an developed two rule based stemmers for effective Stemmer for Hindi based on Bengali and Hindi. Gupta et al. [11] have Hybrid Approach. The Hybrid Approach is developed Punjabi Language stemmer for combination of Brute Force and Suffix Nouns and Proper Names. First they Removal Approach. This Stemmer is generate the stem of the Punjabi word based on Devanagari Script. and after that stem is checked in Punjabi 3. Research Gap Noun and Proper Name Dictionary. Hafer Punjabi is an indo-aryan Language. It is et al. [2] have developed word spoken by more than 100 million native segmentation by letter successor people worldwide. It is the 10th most varieties method. They used Statistical widely spoken Language in the world. properties of the corpus to decide The popularity and online presence of whether word should be divided or not. Punjabi language is increasing day by The approach is based upon Z.S Harris day as more and more Punjabi corpus Process. Majumder et al. [9] have are generated and published day by day developed a Stemmer YASS (yet another over internet. So a system is required for suffix stripper) which is based on efficient searching of Punjabi documents clustering based approach. Cluster is a over the web. The searching should be in class of root word and their related word native’s own language. For faster and forms. Puri et al. [19] have developed a efficient information retrieval system in Punjabi Stemmer using Punjabi WordNet Punjabi, some pre-processing is required Database. In this paper Suffix Removal before indexing the documents. This pre- Approach is used with suffix stripping processing includes removal, rules for finding out the stem for Punjabi stemming and a sophisticated Language. EI-Shishtawy et al. [13] have Lemmatizer for reducing the words to developed an accurate Arabic Root- their base forms called headwords. Based Lemmatizer for Information Various Lemmatization and Stemming Retrieval Purpose. This is the first systems are available for different Lemmatizer for Arabic Language based languages, but very less work has been Research Cell : An International Journal of Engineering Sciences, Special Issue March 2018, Vol. 27 UGC Approved Journal (S.No.63019) ISSN: 2229-6913 (Print), ISSN: 2320-0332 (Online) -, Web Presence: http://www.ijoes.vidyapublications.com © 2018 Vidya Publications. Authors are responsible for any plagiarism issues.

219 done for Punjabi because of its will be applied on the word ਰਗੀ (12) and it inflectional nature. For every word in becomes ਰਗ (9), and that is the Punjabi, there are number of inflectional normalized word, also a valid word forms. So, attempt here has been made available in the dictionary and displayed to develop a Lemmatizer for Punjabi as an output on the output screen. The which will generate accurate lemma for proposed system uses a number of most of the Punjabi words. databases and lists for accomplishing the 4. Proposed System task. These databases and lists are A proposed system uses rule-based mentioned in the coming sections. approach to find out the normalized form 5. Synonyms Database of the word that helps in improving the Synonyms are the words that have the results of Information Retrieval task. The same meaning. A synonym database is system works with the Synonym prepared with augmented field containing replacement module that uses pre- the length of the word for quick defined rules to get the lemma of the comparison. The database is indexed on word. the word and sorted according to the For e.g:- Word:- ਿਬਮਾਰ length of each word in ascending order Synonym Words:- , ਰਗੀ, ਮਰੀਜ, ਮਰੀਜ਼, for quick retreival. For example, the word

ਰਗਗਸਤ “ਮੂਰਖ” (Fool) is having 26 distinct The word ਿਬਮਾਰ has length 15, is the input synonyms as "ਜਾਹਲ", "ਉਜੱਡ", "ਜੜ", "ਗਵਾਰ", word for the system. The list of synonym is collected from the database and then "ਭੁਚੱ ", "ਘੋਗਾ", "ਬੇਸ਼ਮਝ", "ਮੂੜ", "ਅਬੁੱਧ", "ਨਾਸਮਝ", system picks the shortest length word "ਘੁੱਗੂ", "ਘੁੱਘੂ", "ਝੁਡੱ ੂ", "ਬੇਵਕੂਫ", "ਅਿਗਆਨੀ", from the list that is ਰਗੀ (12) with length 12. After that the rules from the rule list "ਮੱਤਹੀਣ", "ਲੱਲਾ", "ਅੰਞਾਣਾ", "ਜੜ_ਮੱਤ", "ਮੂੜਮ ੱਤ", are applied on the synonym word to get "ਬੁੱਧੀਹੀਣ", "ਿਨਰਬੁੱਧੀ", "ਲਘੂ_ਬੁੱਧੀ", "ਅਲਪ_ਬੁੱਧੀ", the accurate lemma word as an output. "ਘੱਟ_ਬੁੱਧੀ". Here, the ‘’ => “” rule of suffix stripping Research Cell : An International Journal of Engineering Sciences, Special Issue March 2018, Vol. 27 UGC Approved Journal (S.No.63019) ISSN: 2229-6913 (Print), ISSN: 2320-0332 (Online) -, Web Presence: http://www.ijoes.vidyapublications.com © 2018 Vidya Publications. Authors are responsible for any plagiarism issues.

220

Out of all these distinct synonyms, the ਕਾਰੀ ਕਾਰਜਕਾਰੀ , ਿਹਤਕਾਰੀ , ਲਾਭਕਾਰੀ synonym “ਜਾਹਲ” has the minimum length ਣਗ ਚਲਣਗ , ਰuਕਣਗ , ਖਣਗ of 12 characters and hence is chosen as ਆਂ ਹਜਆਂ , ਆਗਆਂ a replacement of all other peer words. The synonym replacement module has ਗਾ ਆਵਗਾ , ਜਾਵਗਾ, ਕਰਗਾ , ਭਰਗਾ some inherent problems of incorrectly replacing a named entity with it’s The list of words having the common synonym. For example, the name “ਕਾਸ਼ suffixes is prepared and frequency of valid suffixes is calculated. The suffix with ਿਸਘ ਬਾਦਲ” can be incorrectly transformed maximum occurrences is stripped first, to “ਚਾਨਣ ਸ਼ਰ ਮਘ” . To avoid this type of followed by suffix with largest length. The problems, a named entity recognition is suffix stripping also needs to exclude required. named entities from stripping. 6. Suffix Rules list 7. Named Entities database Suffix are the inflectional endings of the Named entities sometimes create words in any language. The need of the problems in regular lemmatization Suffix Rule List arises in order to remove process. The rules defined for the inflectional endings of the word. Rules lemmatization may sometimes give a can either remove or add a character at false positive and lemmatize a named the end of the word in order to get a valid entity. As mentioned in the proposed head word. It can either be the inflection system section, the word ‘ਜਸ਼ੀ’ becomes of input word or the synonym word. Rules ‘ਜਸ਼’ after applying the ‘’ => “” rule and are applied on input words by using the regular expressions. Some of the suffixes ‘ਸ਼ਰਮਾ’ becomes ‘ਸ਼ਰਮ’ after appling the used in the system are shown in Table 2 ‘’=> “” rule, where both ‘ਜਸ਼’ and ‘ਸ਼ਰਮ’ below – are valid dictionary words, but with an Table 2: Sample Suffix list for stripping altogether different meaning. So for Suffix Words avoiding these kind of problems with the ਦuਕਾਨ , ਮਕਾਨ , ਸ਼ਮਸ਼ਾਨ named entities, it is required to ignore the

Research Cell : An International Journal of Engineering Sciences, Special Issue March 2018, Vol. 27 UGC Approved Journal (S.No.63019) ISSN: 2229-6913 (Print), ISSN: 2320-0332 (Online) -, Web Presence: http://www.ijoes.vidyapublications.com © 2018 Vidya Publications. Authors are responsible for any plagiarism issues.

221 named entities so that only the words that Recall and F-Score is calculated based need lemmatization should be on it. The result from 10 random articles lemmatized. is shown in Table 3. 8. Dictionary Database Dictionary database is the collection of all Table 3: Result from 10 random articles the valid words. The need of the from a list of 50 articles used for testing. dictionary arises due to check on the Sr. Words TP FP FN P R F output of the System. If the output word is 1 332 76 0 6 81 93 87 2 380 117 0 7 76 94 84 available in the dictionary then the output 3 498 124 0 0 80 100 89 is considered as a valid lemma. The 4 366 120 7 10 75 92 83 bigger is the size of the dictionary, the 5 347 117 6 10 75 92 83 better it will be in terms of accuracy of the 6 493 138 4 0 78 100 88 7 357 114 7 3 76 97 85 system. 8 385 123 0 0 76 100 86

9. Algorithm 9 421 101 0 0 81 100 89 The lemmatization algorithm designed is 10 400 104 8 12 79 90 84 mentioned as below - For each word Wi, Based on the precision and recall values, if (Wi) found in {NAMED ENTITIES} THEN Ignore Wi and Continue the average precision, average recall and If (Wi) can be stemmed Stem Wi average f-score is calculated. The results Check in {Dictionary} if (found) of averages is shown in Table 4. replace Wi with dictionary word. else proceed to next step. Table 4: Average Precision, Recall and F- Endif If(Wi) found in {Synonyms Database} Score. replace Wi with synonym of min length. Average Precision 78 Endif Average Recall 95 Output Wi as Lemmatised word EndFor Average F-Score 86

10. Results and Discussions

The system is tested on 50 articles from a The graphical representation of results of popular newspaper. The performance of the system is shown in Fig. 1. The graph the system is measured using Precision- shows the precision, recall and F- Research Cell : An International Journal of Engineering Sciences, Special Issue March 2018, Vol. 27 UGC Approved Journal (S.No.63019) ISSN: 2229-6913 (Print), ISSN: 2320-0332 (Online) -, Web Presence: http://www.ijoes.vidyapublications.com © 2018 Vidya Publications. Authors are responsible for any plagiarism issues.

222 measure of 50 articles lemmatised using 11. Conclusion the algorithm above. In this paper, the design of rule based Punjabi Lemmatiser is discussed. The Fig. 1 Results from Lemmatisation rule lists have been generated manually algorithm. for the development of the system. Firstly, the system finds out the synonym of the word if any, and after that applies the rules for generating the lemma of the word. Various databases of Punjabi words have been used to improve the performance of the system. The system can be beneficial for other applications of Natural Language Processing. The performance of the system can be Fig 2 shows the graphical results of improved by adding more words into the average precision, recall and f-score database. In future the system can be values. improved by handling the Context and POS (Part of Speech) of the words. Fig 2. Average Precision, Recall and F- Score. References 1. Lovins, J. B. (1968). Development of a stemming algorithm (p. 65). Cambridge: MIT Information Processing Group, Electronic Systems Laboratory. 2. Hafer, M. A., & Weiss, S. F. (1974). Word segmentation by letter successor varieties. Information storage and retrieval, 10(11), 371-385.

3. Porter, M. F. (1980). An algorithm for

suffix stripping. Program, 14(3), 130-137. Research Cell : An International Journal of Engineering Sciences, Special Issue March 2018, Vol. 27 UGC Approved Journal (S.No.63019) ISSN: 2229-6913 (Print), ISSN: 2320-0332 (Online) -, Web Presence: http://www.ijoes.vidyapublications.com © 2018 Vidya Publications. Authors are responsible for any plagiarism issues.

223

4. Kraaij, W., & Pohlmann, R. (1995). 11. Gupta, V., & Lehal, G. S. (2011, Evaluation of a Dutch stemming November). Punjabi Language Stemmer algorithm. The New Review of Document for nouns and proper names. In and Text Management, 1, 25-43. proceedings of the 2nd Workshop on 5. Ramanathan, A., & Rao, D. D. (2003, South and Southeast Asian Natural April). A lightweight stemmer for Hindi. In Language Processing (WSSANLP) the Proceedings of EACL. IJCNLP (pp. 35-39). 6. Taghva, K., Beckley, R., & Sadeh, M. 12. Sharma, D. (2012). Stemming (2005, April). A Stemming Algorithm for algorithms: a comparative study and their the Farsi Language. In ITCC (1) (pp. 158- analysis. International Journal of Applied 162). Information Systems, 4(3), 7-12. 7. Al-Shalabi, R., Kannan, G., Hilat, I., 13. El-Shishtawy, T., & El-Ghannam, F. Ababneh, A., & Al-Zubi, A. (2005). (2012). An accurate arabic root- based Experiments with the successor variety lemmatizer for information retrieval algorithm using the cutoff and entropy purposes. arXiv preprint arXiv:1203.3584. methods. Information Technology 14. Upendra, M., & Chandra, P. (2012). Journal, 4(1), 55-62. MAULIK: An Effective Stemmer for Hindi 8. Willett, P. (2006). The Porter stemming Lanuage. International Journal of algorithm: then and now. Program,40(3), Computer Science and Engineering, 219-223. IJCSE, 4(5), 711-717. 9. Majumder, P., Mitra, M., Parui, S. K., 15. Saharia, N., Sharma, U., & Kalita, J. Kole, G., Mitra, P., & Datta, K. (2007). (2012, August). Analysis and evaluation YASS: Yet another suffix stripper. ACM of stemming algorithms: a case study with transactions on information systems Assamese. InProceedings of the (TOIS), 25(4), 18. International Conference on Advances in 10. Sarkar, S., & Bandyopadhyay, S. Computing, Communications and (2008, January). Design of a Rule- based Informatics (pp. 842-846). ACM. Stemmer for Natural Language Text in 16. Ganguly, D., Leveling, J., & Jones, G. Bengali. In IJCNLP (pp. 65-72). J. (2013, December). DCU@ Morpheme Extraction Task of FIRE-2012: Rule- Research Cell : An International Journal of Engineering Sciences, Special Issue March 2018, Vol. 27 UGC Approved Journal (S.No.63019) ISSN: 2229-6913 (Print), ISSN: 2320-0332 (Online) -, Web Presence: http://www.ijoes.vidyapublications.com © 2018 Vidya Publications. Authors are responsible for any plagiarism issues.

224 based Stemmers for Bengali and Hindi. In Post-Proceedings of the 4th and 5th Workshops of the Forum for Information Retrieval Evaluation (p. 12). ACM. 17. Paul, S., Joshi, N., & Mathur, I. (2013). Development of a hindi lemmatizer.arXiv, preprint arXiv:1305.6211. 18. Bijal, D., & Sanket, S. (2014). Overview of Stemming Algorithms for Indian and Non-Indian Languages. arXiv preprint arXiv:1404.2878. 19. Puri, R., Bedi, R. P. S., & Goyal, V. (2015). Punjabi stemmer using punjabi database. Indian Journal of Science and Technology, 8(27). 20. Aarti, D. V. S. Punjabi Language Characteristics and Role of Thesaurus in Natural Language processing.

Research Cell : An International Journal of Engineering Sciences, Special Issue March 2018, Vol. 27 UGC Approved Journal (S.No.63019) ISSN: 2229-6913 (Print), ISSN: 2320-0332 (Online) -, Web Presence: http://www.ijoes.vidyapublications.com © 2018 Vidya Publications. Authors are responsible for any plagiarism issues.