Implemented Stemming Algorithms for Six Ethiopian Languages
Total Page:16
File Type:pdf, Size:1020Kb
International Journal of Advanced Science and Technology Vol. 29, No. 7, (2020), pp. 2532-2536 Implemented Stemming Algorithms for Six Ethiopian Languages Wubetu Barud Demilie Department of Information Technology, Wachemo University, Hossana, Ethiopia, P.O. Box 667 [email protected] or [email protected] Abstract Text stemming is an exploratory process of removing suffixes, infixes and sometimes prefixes from words to arrive at the base word. It is one of the pipeline features of most Natural Language Processing Applications (NLPAs) and commonly used in natural language processing (NLP) and in text mining. The main problems of developing stemming are to identify and remove any kind of affixes since all the six languages those I have selected for analysis have different characteristics, sentence structures and grammatical rules. This paper tries to analysis different approaches that have been implemented by different researchers or scholars of the selected languages accordingly. I have discussed the type of stemming approaches, an overview of the available and the most popular used stemmers for selected languages and brief analysis between discussed stemmers as well as their evaluation results and analysis of available stemmers on the languages experimentally. Based on the analysis study and experiment, I have concluded and recommended the final results of the stemmer for the languages. Keywords: Affix, Analysis, Approach, Ethiopian Language, Natural Language Processing Application, Stemming Approaches. I. INTRODUCTION In information retrieval processes, stems help to advance the process of natural language processes accordingly. For example, in information retrieval processing systems, it has the ability to index documents based on topics, and to expand a query to obtain more accurate and precise results clearly. All information retrieval applications are used advance recalls and precisions accordingly. A recall growing method which can be helpful for even the simplest Boolean retrieval systems is stemming. Stemming is a preprocessing footstep in text mining applications as well as a very common requirement of natural language processing functions. In fact, it is very important in most of the information retrieval applications. Different types of stemming approaches have been implemented for the languages that I have mentioned in the study in terms of performance and accuracy. All stemming algorithms or approaches are language dependent. That means, the morphological forms of the selected six Ethiopian languages are very different in everything including its structure. “In English language, a word might change into inflectional or derivational forms”. For example, ‘walking’, ‘walked’, ‘walks’ are inflectional forms that can be mapped to the root word ‘walk’. On the other hand, the words ‘doing’, ‘does’ has a root of ‘do’, however, in past tense ‘do’ changes to ‘done’. The selected languages are an agglutinative language with rich morphological structures. All language words are created by adding suffixes, infixes and prefixes to original root word accordingly. Some of the language words are composed by appending a combination of two or three affixes to root words. With such rich structures and unique composition rules of the languages, it is being catered and used by any stemmers available for the study. Hence, this paper is set to explore the analysis of all the implemented stemming approaches and to identify the best and the recommended approach for the selected and morphologically rich languages. In this paper, I have discussed different stemming approaches for Amharic, Afan Oromo, Tigrinya, Wolaita, Kambaata and Awngi languages clearly. ISSN: 2005-4238 IJAST 2532 Copyright ⓒ 2020 SERSC International Journal of Advanced Science and Technology Vol. 29, No. 7, (2020), pp. 2532-2536 Generally, this paper describes the different types of stemming approaches which works differently in different amount of corpus and explains the comparative analysis of stemming approaches on the basis of stem production, efficiency and effectiveness in all information retrieval systems. II. STEMMING APPROACHES According to [1] there are to stemming approaches. The first stemming method is simply means of context free with the main objective of identifying affixes and removing them. The second stemming approach is lemmatization. In lemmatization, the developer has to have a good knowledge of the language and its grammatical rule. It also requires a dictionary look up; therefore, it is more complex than stemming. However, in lemmatization more accurate and precise result is expected. For example, a word ‘better’ has a lemma ‘good’. These types of words cannot be solved in basic stemming approaches unless it uses dictionary look-up table. To achieve stemming, there are different types of stemming approaches that are available for different languages, which differ in terms of performance and accuracy. There are four (4) different stemming approaches that will be discussed in this paper; namely, rule based, successor variety, a hybrid approach and longest match. II.I. Rule Based Approach The rule based approach is implemented by different researchers and composed of two parts: a rule-based light stemmer, and a patter-based infix remover. The rule-based light stemmer removes prefixes and suffixes form the word according to specific rules. The pattern based infix remover removes infixes from the word according to specific patters. This approach is named here rule based approach [3][4][5][6][7][8]. II.II. Successor Variety Approach According to [9] successor variety is one of the stemming approaches in natural language processing applications including especially, in information retrieval processing systems. In this approach, the successor variety of a string is the number of different characters that follow the string in words in a corpus (the body of text). The successor variety of substrings of a term will decrease as more characters are added until a segment boundary is reached. Successor variety stemmer does not require preparation of suffix lists and removal rules, and hence can be adapted to changing text collection. According to [10] successor variety approach uses the frequencies of letter sequences in a body of text as the basis of stemming. In less formal terms, the successor variety of a string is the number of different characters that follow it in words in some body of text. Consider a body of text consisting of the following words, for example, back, beach, body, backward and boy. To determine the successor varieties for ‘battle’, for example, the following process would be used. The first letter of battle is ‘b’. ‘b’ is followed in the text body by four characters: ‘a’, ‘e’, and ‘o’. Thus, the successor variety of ‘b’ is three. The next successor variety for battle would be one, since only ‘c’ follows ‘ba’ in the text. When this process is carried out using a large body of text, the successor variety of substrings of a term will decrease as more characters are added until a segment boundary is reached. At this point, the successor variety will sharply increase. This information is used to identify stems. II.III. A hybrid Approach According to [11] and [12] a hybrid approaches use two or more of the approaches in union. A simple example is a suffix tree approach which first consults a lookup table using brute force approach. However, instead of trying to store the entire set of relations between words in a given language, the lookup table is kept small and is only used to store a minute amount of ‘frequent exceptions’ like ‘ran => run’. II.IV. Longest Match Approach According to [13] the longest match approach, it removes the longest suffix possible. For example, if the same word ‘fruitfulness’ is considered the suffixes in the word are: ‘ness’, ‘ful’, and ‘fullness’. ISSN: 2005-4238 IJAST 2533 Copyright ⓒ 2020 SERSC International Journal of Advanced Science and Technology Vol. 29, No. 7, (2020), pp. 2532-2536 Therefore, the approach removes ‘fullness’ from the word. The problem of using longest match approach compared to other method is that it needs for generating all possible combinations of affixes and processing and storage space required and the change of affix during concatenation. III. ANALYSIS OF STEMMING APPROACHES FOR THE SELECTED LANGUAGES I have discussed four most popular stemmers that have been used for the selected Ethiopian languages by different researchers of the languages accordingly. As they have developed the stemming approaches for the languages, I have analyzed each of the approaches that have been used for the selected six languages. All of them are purposely developed for the specified language, however, some observations had been made. The following table summarizes the analysis of the approaches for the selected Ethiopian languages. Table 1: Summary of implemented algorithms for six Ethiopian language N Sensitive Primary o Language Conflation Technique in Error Accuracy Researcher Context? Rate 1. Amharic Nega Alemayehu Rule Based (Iterative) Yes 4.01% 95.90% 2. Atelach Alemu Affix removal & 25%S Amharic No 75% and Lars Asker Dictionary Based 3. 28.2% 71.8% Successor Variety (peak and Amharic Genet Mezemir Yes Approach plateau method) 4. Mekonnen Rule Based (Longest 7.48% Afan Oromo Yes 92.52% Wakshum Match) 5. Afan Oromo Debela Tesfaye Rule Based (Iterative) Yes 4.27% 94.84% 6. Rule Based (Longest 3.13% Kambaata Jonathan Samuel Yes 96.87% Match) 7. Abebe Belay and 17.58% Ge’ez Rule Based Yes 82.42 % Yibeltal Chanie 8. Wolaita Lemma Lessa Longest Match Yes 4.01% 95.9% 9. Afan Oromo Debela Tesfaye A hybrid Approach Yes 4.27% 95.73% 10. Jonathan Samuel 4.01% Kambaata & Solomon Rule Based Yes 95.9% Teferra 11. Silt’e Muzeyn Kedir Longest Match Yes 14.28% 85.71 % 12. Omer Osman and 10.7% Tigrinya Hybrid Approach Yes 89.3% Yoshiki Mikami 13. Awngi Tsegaye misikir Longest Match Yes 8.59% 91.41% 14. Girma Yohannis 8.16% Wolaita Bade and Hussien Longest Match Yes 91.84% Seid 15.