International Journal of Advanced Science and Technology Vol. 29, No. 7, (2020), pp. 2532-2536

Implemented Stemming Algorithms for Six Ethiopian Languages

Wubetu Barud Demilie Department of Information Technology, Wachemo University, Hossana, , P.O. Box 667 [email protected] or [email protected]

Abstract Text stemming is an exploratory process of removing suffixes, infixes and sometimes prefixes from words to arrive at the base word. It is one of the pipeline features of most Natural Language Processing Applications (NLPAs) and commonly used in natural language processing (NLP) and in text mining. The main problems of developing stemming are to identify and remove any kind of affixes since all the six languages those I have selected for analysis have different characteristics, sentence structures and grammatical rules. This paper tries to analysis different approaches that have been implemented by different researchers or scholars of the selected languages accordingly. I have discussed the type of stemming approaches, an overview of the available and the most popular used stemmers for selected languages and brief analysis between discussed stemmers as well as their evaluation results and analysis of available stemmers on the languages experimentally. Based on the analysis study and experiment, I have concluded and recommended the final results of the stemmer for the languages.

Keywords: Affix, Analysis, Approach, Ethiopian Language, Natural Language Processing Application, Stemming Approaches.

I. INTRODUCTION In information retrieval processes, stems help to advance the process of natural language processes accordingly. For example, in information retrieval processing systems, it has the ability to index documents based on topics, and to expand a query to obtain more accurate and precise results clearly. All information retrieval applications are used advance recalls and precisions accordingly. A recall growing method which can be helpful for even the simplest Boolean retrieval systems is stemming. Stemming is a preprocessing footstep in text mining applications as well as a very common requirement of natural language processing functions. In fact, it is very important in most of the information retrieval applications. Different types of stemming approaches have been implemented for the languages that I have mentioned in the study in terms of performance and accuracy. All stemming algorithms or approaches are language dependent. That means, the morphological forms of the selected six Ethiopian languages are very different in everything including its structure. “In , a word might change into inflectional or derivational forms”. For example, ‘walking’, ‘walked’, ‘walks’ are inflectional forms that can be mapped to the root word ‘walk’. On the other hand, the words ‘doing’, ‘does’ has a root of ‘do’, however, in past tense ‘do’ changes to ‘done’. The selected languages are an agglutinative language with rich morphological structures. All language words are created by adding suffixes, infixes and prefixes to original root word accordingly. Some of the language words are composed by appending a combination of two or three affixes to root words. With such rich structures and unique composition rules of the languages, it is being catered and used by any stemmers available for the study. Hence, this paper is set to explore the analysis of all the implemented stemming approaches and to identify the best and the recommended approach for the selected and morphologically rich languages. In this paper, I have discussed different stemming approaches for , Afan Oromo, Tigrinya, Wolaita, Kambaata and Awngi languages clearly.

ISSN: 2005-4238 IJAST 2532 Copyright ⓒ 2020 SERSC International Journal of Advanced Science and Technology Vol. 29, No. 7, (2020), pp. 2532-2536

Generally, this paper describes the different types of stemming approaches which works differently in different amount of corpus and explains the comparative analysis of stemming approaches on the basis of stem production, efficiency and effectiveness in all information retrieval systems.

II. STEMMING APPROACHES According to [1] there are to stemming approaches. The first stemming method is simply means of context free with the main objective of identifying affixes and removing them. The second stemming approach is lemmatization. In lemmatization, the developer has to have a good knowledge of the language and its grammatical rule. It also requires a dictionary look up; therefore, it is more complex than stemming. However, in lemmatization more accurate and precise result is expected. For example, a word ‘better’ has a lemma ‘good’. These types of words cannot be solved in basic stemming approaches unless it uses dictionary look-up table. To achieve stemming, there are different types of stemming approaches that are available for different languages, which differ in terms of performance and accuracy. There are four (4) different stemming approaches that will be discussed in this paper; namely, rule based, successor variety, a hybrid approach and longest match. II.I. Rule Based Approach The rule based approach is implemented by different researchers and composed of two parts: a rule-based light stemmer, and a patter-based infix remover. The rule-based light stemmer removes prefixes and suffixes form the word according to specific rules. The pattern based infix remover removes infixes from the word according to specific patters. This approach is named here rule based approach [3][4][5][6][7][8]. II.II. Successor Variety Approach According to [9] successor variety is one of the stemming approaches in natural language processing applications including especially, in information retrieval processing systems. In this approach, the successor variety of a string is the number of different characters that follow the string in words in a corpus (the body of text). The successor variety of substrings of a term will decrease as more characters are added until a segment boundary is reached. Successor variety stemmer does not require preparation of suffix lists and removal rules, and hence can be adapted to changing text collection. According to [10] successor variety approach uses the frequencies of letter sequences in a body of text as the basis of stemming. In less formal terms, the successor variety of a string is the number of different characters that follow it in words in some body of text. Consider a body of text consisting of the following words, for example, back, beach, body, backward and boy. To determine the successor varieties for ‘battle’, for example, the following process would be used. The first letter of battle is ‘b’. ‘b’ is followed in the text body by four characters: ‘a’, ‘e’, and ‘o’. Thus, the successor variety of ‘b’ is three. The next successor variety for battle would be one, since only ‘c’ follows ‘ba’ in the text. When this process is carried out using a large body of text, the successor variety of substrings of a term will decrease as more characters are added until a segment boundary is reached. At this point, the successor variety will sharply increase. This information is used to identify stems. II.III. A hybrid Approach According to [11] and [12] a hybrid approaches use two or more of the approaches in union. A simple example is a suffix tree approach which first consults a lookup table using brute force approach. However, instead of trying to store the entire set of relations between words in a given language, the lookup table is kept small and is only used to store a minute amount of ‘frequent exceptions’ like ‘ran => run’. II.IV. Longest Match Approach According to [13] the longest match approach, it removes the longest suffix possible. For example, if the same word ‘fruitfulness’ is considered the suffixes in the word are: ‘ness’, ‘ful’, and ‘fullness’.

ISSN: 2005-4238 IJAST 2533 Copyright ⓒ 2020 SERSC International Journal of Advanced Science and Technology Vol. 29, No. 7, (2020), pp. 2532-2536

Therefore, the approach removes ‘fullness’ from the word. The problem of using longest match approach compared to other method is that it needs for generating all possible combinations of affixes and processing and storage space required and the change of affix during concatenation. III. ANALYSIS OF STEMMING APPROACHES FOR THE SELECTED LANGUAGES I have discussed four most popular stemmers that have been used for the selected Ethiopian languages by different researchers of the languages accordingly. As they have developed the stemming approaches for the languages, I have analyzed each of the approaches that have been used for the selected six languages. All of them are purposely developed for the specified language, however, some observations had been made. The following table summarizes the analysis of the approaches for the selected Ethiopian languages. Table 1: Summary of implemented algorithms for six Ethiopian language N Sensitive Primary o Language Conflation Technique in Error Accuracy Researcher Context? Rate 1. Amharic Nega Alemayehu Rule Based (Iterative) Yes 4.01% 95.90% 2. Atelach Alemu Affix removal & 25%S Amharic No 75% and Lars Asker Dictionary Based 3. 28.2% 71.8% Successor Variety (peak and Amharic Genet Mezemir Yes Approach plateau method) 4. Mekonnen Rule Based (Longest 7.48% Afan Oromo Yes 92.52% Wakshum Match) 5. Afan Oromo Debela Tesfaye Rule Based (Iterative) Yes 4.27% 94.84% 6. Rule Based (Longest 3.13% Kambaata Jonathan Samuel Yes 96.87% Match) 7. Abebe Belay and 17.58% Ge’ez Rule Based Yes 82.42 % Yibeltal Chanie 8. Wolaita Lemma Lessa Longest Match Yes 4.01% 95.9% 9. Afan Oromo Debela Tesfaye A hybrid Approach Yes 4.27% 95.73% 10. Jonathan Samuel 4.01% Kambaata & Solomon Rule Based Yes 95.9% Teferra 11. Silt’e Muzeyn Kedir Longest Match Yes 14.28% 85.71 % 12. Omer Osman and 10.7% Tigrinya Hybrid Approach Yes 89.3% Yoshiki Mikami 13. Awngi Tsegaye misikir Longest Match Yes 8.59% 91.41% 14. Girma Yohannis 8.16% Wolaita Bade and Hussien Longest Match Yes 91.84% Seid 15. Tigrigna Yonas fisseha Longest Match Yes 13.89% 86.1%

IV. CONCLUSION From all the stemming approach that the researchers have been used accordingly, no one can produce 100% accurate result. Hence, all the implemented stemming approaches are useful in any natural language processing applications including information retrieval processing systems. The main difference between each stemming approach is; either a rule-based approach or a linguistic approach. A rule-based approach might not always produce correct result and the stems produced may not always be correct words of the language. All linguistic

ISSN: 2005-4238 IJAST 2534 Copyright ⓒ 2020 SERSC International Journal of Advanced Science and Technology Vol. 29, No. 7, (2020), pp. 2532-2536

approaches are based on a lexicon; hence, words that are not included in the lexicon are not stemmed accurately. However, there is no perfect stemmer that have been developed so far to match all the six Ethiopian language requirements. In general, all researchers have concluded and argued by the following main points [3][4] [5] [6] [12][13] [15][16] [17][18]:  All the stemmers have to be tested with large amount of texts/corpora to verify the actual performance.  Before developing any kind of text stemmer for Ethiopian languages, the structure and the grammatical rule should be clearly identified and it needs linguistic expert.  Detail analysis of the morphology of the selected language words shows that the languages are morphologically rich. The types of affixations such as suffixes, infixes, reduplication, blending, compounding and concatenation of suffixes in the language contribute a lot in generating rich morphological variants and make the word formation process complicated. Therefore, attempting to conflate each languages word manually is very tedious and extremely difficult. For this reason, applying automated conflation procedure such as stemmer is very important for any natural language processing applications.  All natural language processing applications need standard and balanced corpus (from different sources and genres) preparation. Hence, preparing the standard corpus for the selected language could also be another research opportunity in the field for all Ethiopian languages.  By incorporating all necessary elements, the stemmer can also be used as a component for developing other computational tools like morphological analyzer, parser, spell checker, thesaurus, word frequency counting, information retrieval and the like of the language under consideration.  It is required to increase the effectiveness of the stemmer with no or little decrease in efficiency.  Further study of the morphology of the languages can increase the accuracy of the stemmer with no or small increase in computational time.  Finally, evaluating the stemmer on text collection of large size collected from different sources. This is because large size sample can represent the characteristics of the language more than small size sample. Therefore, the accuracy of the stemmer can better be checked and improved for all selected languages in this way.

REFERENCES

1. A. Ismailov, M. M. A. Jalil, Z. Abdullah, and N. H. A. Rahim, “A Comparative Study of Stemming Algorithms for use with the Uzbek Language,” A Comp. Study Stemming Algorithms use with Uzb. Lang., no. December, 2016, doi: 10.1109/ICCOINS.2016.7783180. 2. M. Haroon, “Comparative Analysis of Stemming Algorithms for Web Text Mining,” no. September, pp. 20–25, 2018, doi: 10.5815/ijmecs.2018.09.03. 3. D. Tesfaye, “Designing a Rule Based Stemmer for Afaan Oromo Text,” no. 1, pp. 1–11. 4. M. Synthesizer and A. Abeshu, “Analysis of Rule Based Approach for Afan Oromo Automatic,” vol. 7522, no. 4, pp. 94–97, 2013. 5. J. Samuel, S. Teferra, J. Samuel, S. Teferra, J. Samuel, and S. Teferra, “Designing A Rule Based Stemming Algorithm for Kambaata Language Text,” no. 9, pp. 41–54, 2018. 6. A. B. Adege and Y. C. Manie, “DESIGNING A STEMMER FOR GE’EZ TEXT USING RULE BASED APPROACH,” vol. 8, no. 1, pp. 1574–1578, 2017. 7. M. Y. Al-nashashibi, D. Neagu, and A. A. Yaghi, “Stemming Techniques for Words : A Comparative Study,” Stemming Tech. Arab. Words A Comp. Study, no. I, pp. 270–276, 2010. 8. “Amharic light stemmer,” Amharic Light stemmer, no. ii, pp. 1–10, 2020. 9. “AN EXPERIMENT USING SUCCESSOR VARIETY,” AN Exp. USING SUCCESSOR Var., 2009.

ISSN: 2005-4238 IJAST 2535 Copyright ⓒ 2020 SERSC International Journal of Advanced Science and Technology Vol. 29, No. 7, (2020), pp. 2532-2536

10. D. Sharma and M. E. Cse, “Stemming Algorithms : A Comparative Study and their Analysis,” vol. 4, no. 3, pp. 7–12, 2012. 11. D. Tesfaye, “ADDIS ABABA UNIVERSITY FACULTY OF INFORMATICS DEPARTMENT OF INFORMATION SCIENCE Designing a Stemmer for Afaan Oromo Text : A Hybrid Approach SCHOOL OF GRADUTE STUDIES FACULTY OF INFORMATICS,” 2010. 12. O. Osman and I. Yoshiki, “Stemming Tigrinya Words for Information Retrieval,” vol. 1, no. December, pp. 345–352, 2012. 13. S. Algorithm and F. O. R. Silt, “SCHOOL OF GRADUATES STUDIES DEPARTMENT OF INFORMATION SCIENCE DESIGNINIG A STEMMING ALGORITHM FOR SILT ’ E SCHOOL OF GRADUATES STUDIES,” 2012. 14. T. Misikir, “A thesis submitted to the school of graduate studies of addis ababa university in partial fulfillment of the requirement for the degree of masters of science in information science,” 2013. 15. G. Y. Bade and H. Seid, “Development of Longest-Match Based Stemmer for Texts of Wolaita Language,” vol. 4, no. 3, pp. 79–83, 2018, doi: 10.11648/j.ijdst.20180403.11. 16. G. Y. Bade, “SK International Journal of Multidisciplinary Research Hub Prototype Development for Stemmer of Wolaita Language,” vol. 5, no. 12, pp. 7–18, 2018. 17. Y. FISSEHA, “DEVELOPMENT OF STEMMING ALGORITHM FOR TIGRIGNA TEXT,” Dev. STEMMING ALGORITHM TIGRIGNA TEXT, no. June, 2011. 18. A. A. Argaw and L. Asker, “An Amharic Stemmer : Reducing Words to their Citation Forms,” no. June, pp. 104–110, 2007.

ISSN: 2005-4238 IJAST 2536 Copyright ⓒ 2020 SERSC