Analyzing the Punjabi Language Stemmers: a Critical Approach

250 Analyzing the Punjabi Language Stemmers: A Critical Approach Harjit Singha APS Neighbourhood Campus, Punjabi University, Patiala, India Abstract Stemming is a procedure used to reduce the inflected words by removing affixes from them. It is widely used in Natural Language Processing systems developed for various languages like English, Hindi etc. Punjabi is a low resource availability language spoken in northern regions of India and Pakistan. In Pakistan, Shahmukhi script is used to write Punjabi. While in India, Gurmukhi script is used. Various approaches are used for stemming natural language words such as Brute Force approach, Rule based approach, Statistical approach etc. In Gurmukhi scripted Punjabi language stemming, either the Brute Force approach or Rule based approach or a combination of both of these approaches have been used till now. This paper presents a comparative analysis of various stemmers developed so far for stemming Gurmukhi scripted Punjabi language words based on their methodology and accuracy. It will motivate further research in developing more efficient stemmers for Punjabi language. It will benefit the researchers to compare and understand the used approaches and the problems faced in adopting each approach. Keywords Stemming, Natural Language Processing, Information Retrieval, Suffix Striping, Punjabi Language Processing 1. Introduction Various approaches are used for stemming natural language words such as Brute Force approach, Rule based approach, Statistical A procedure adopted to reduce the inflected approach etc. In brute force approach, a table words by removing any affixes attached to of inflected words along with their stem words them is called Stemming. Stemming is a very is maintained. To stem a word, that word is common and crucial task performed in Natural searched in the table for a match. If match Language Processing (NLP) applications. occurs with some word, the related stem word Many morphologically similar words can be is picked from the table. The approach is easy stemmed to the same stem word after to implement and the accuracy is dependent on removing any suffixes or prefixes from them. the table of words available in the database. In It improves the efficiency of further processes rule based approach, the affixes are removed adopted for information retrieval. It improves by using some fixed rules. The input words are the effectiveness of information retrieval by checked to have a group of characters at the helping to identify redundant information, to end or beginning of each word. That set of remove irrelevant information from retrieval characters (affix) is removed to get the stem and to optimize the retrieval by placing high word. In some words, an alternative set of ranked information on the top [1]. International Semantic Intelligence Conference, February 25–27, 2021, New Delhi, India EMAIL: [email protected](H. Singh) ORCID: 0000-0002-3215-4798 (H. Singh) © 2021 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings (CEUR-WS.org) 251 characters is attached to the word to complete 2. Literature Review it. In statistical approach, the distance Jain et al. [3] used NLP technique to find measure is calculated between the two words. root word in Hindi language. The Hindi If the words are morphologically similar, they inflected word was stemmed to remove affix will have low distance measure. For from the end of the word. The stemmer was morphologically unrelated words, distance tested using 800 inflected words. They used measure will be high. Using this measure, the prefix, suffix and root word lists to remove the words of a document are grouped to make affixes and validate the root word. The clusters. In this way, morphologically similar stemmer provided the hit ratio of 0.93. Yadav words will be grouped in each cluster. Then et al. [4] proposed a method for automatic the central word of each cluster is taken as a summarization of text documents. They stem word [2]. generated a network of related sentences based In Gurmukhi scripted Punjabi language on semantic and lexical relation. Two stemming, either the Brute Force approach or sentences are related to each other using a Rule based approach or a combination of both certain strength that represents the number of of these approaches have been used till now. relationships between the two. Gros et al. [5] This paper analyses all those approaches and performed an analysis to test the effect on the compares them with each other based on the performance of information retrieval with methodology and accuracy. It will motivate anaphora resolution. They claimed the positive further research in developing more efficient effect on the results. In this way, the quality of stemmers for Punjabi language. It will benefit information retrieval results can be the researchers to compare and understand the significantly improved. used approaches and the problems faced in Agrawal et al. [6] proposed a translation adopting each approach. tool to translate Sanskrit language text to Hindi. They used stemming in preprocessing 1.1. Punjabi Language the Sanskrit words before converting them to Hindi. It was used as a major phase in Punjabi language belongs to Indo-Aryan translation tool development. Any affixes from group of languages. Its word order in sentence input Sanskrit words were removed before is Subject-Object-Verb and is spoken by using them for generating the equivalent Hindi almost 100 million people residing in India, text. A Sanskrit corpus was used to Pakistan and many other countries. It may be successfully stem the words by searching them called Punjabi community. Table 1 shows the in the corpus and validate. number of Punjabi speakers in some countries. Bhadwal et al. [7] developed Hindi- Sanskrit bilingual translation system. The Table 1 system was able to translate Hindi text to Punjabi Speakers in Some Countries Sanskrit and vice-versa using a number of modules. For Sanskrit to Hindi translation Country Punjabi Source flow, they used stemming as a major step Speakers before syntax analysis. The system was Pakistan 99,774,008 www.cia.gov developed in Java. Bhadwal et al. [8] proposed India 33,124,726 censusindia.gov.in a Hindi to Sanskrit machine translation Canada 543,495 www12.statcan.gc.ca system. The system includes a number of NLP UK 273,000 www.ons.gov.uk modules such as transliteration, POS-tagging, US 253,740 www.census.gov root verb extraction etc. In root verb UAE 191,000 ethnologue.com extraction, the algorithm finds the root verb before mapping to equivalent Sanskrit word. Punjabi is basically written in two different Although there are positive effects of scripts; Perso-Arabic and Gurmukhi. Perso- stemming in various NLP applications, but Arabic is related to Persian script which is also Wahbeh et al. [9] claimed negative effects on called Shahmukhi, and Gurmukhi is derived text classification in Arabic. They used from Landa, an old script of Indian sub- stemming as a preprocessing module in Arabic continent. 252 text classification and found that results were the knowledge of Punjabi and Gurmukhi negatively affected when stemming was used. script, because function of a stemmer is just to Review of literature shows that stemming remove any affixes attached to a word. is used for a variety of NLP applications. It Therefore only brief introduction to Punjabi motivated the analysis of stemmers developed and Gurmukhi script is enough. This paper for Punjabi language. Only a handful of discusses the stemmers developed for Punjabi research papers found for this low resource language Gurmukhi script. The paper availability Asian language. discusses the methodologies used by researchers to stem the inflected words. In this 3. Stemming Linguistics in paper, Gurmukhi scripted Punjabi language henceforth will be called Punjabi. Gurmukhi Punjabi The significance of this analysis is that it discloses the gap in the NLP research for In linguistics, an affix refers to either suffix Punjabi language. It will motivate further or prefix. In Punjabi language, both types of research in developing more efficient affixes are used to make proper words for stemmers for Punjabi using new ideas or by sentences. But in most cases, prefix changes extending existing approaches. the meaning of the word. So, in NLP applications, only suffixes are considered for 4.1. Earlier Stemmer for Punjabi removal. Table 2 shows some inflected words with their stem words. The table shows how the Punjabi words are dealt with linguistically Kumar et al. [10] proposed a Punjabi to remove some characters and if required, stemmer based on combination of brute force substitute some alternative set of characters to approach and suffix stripping technique. The complete the word. researchers divided the system in three units and named these units as Input unit, Output Table 2 unit and Processing unit. The Input unit takes a Punjabi word and Some Punjabi Inflected and Root words that word is sent to the Processing unit which Inflected Stem Suffix searches for the word in a database of Punjabi Word word Removal and inflected words and their stem words. If a Substitution word is found in the database, the corresponding stem word is considered as ਵਿਵਿਆਰਥੀਆਂ ਵਿਵਿਆਰਥੀ ਆਂ output stem word. If no match is found in ਬੱਸ拓 ਬੱਸ ਾ ਾਂ database, then suffix removal technique is and used to stem the word. The word ending is ਕੱਤੇ ਕੱਤ ਾੇ ਾ searched in a list of possible Punjabi suffixes. ਹਿ ਿ拓 ਹਿ ਿ拓 If the word ending matches with any one of the suffix present in the suffix list then the ਘਰ⸂ ਘਰ ਾ ਾਂ ending (suffix) is removed to get the stem ਮ ੜ ਮ ੜ ਾ word. The steps are shown in Algorithm 1 and graphically represented in Figure 1.

Load more