Dictionary Based Spelling Corrector System: the Case of Six Ethiopian Languages
Total Page:16
File Type:pdf, Size:1020Kb
International Journal of Aquatic Science ISSN: 2008-8019 Vol 12, Issue 02, 2021 Dictionary Based Spelling Corrector System: The Case of Six Ethiopian Languages Wubetu Barud Demilie Department of Information Technology, Wachemo University, Hossana, Ethiopia, P.O. Box: 667 ABSTRACT A dictionary-based spelling correctorisa system that can directly identify what natural language is being dealt with and shifts to the proper spelling corrector for the languages that system users are interested to do so. Spelling corrector systems for languages would be used to check errors for any kind of spelling mistakes and are fairly reliant on the words in the lexicon dictionary. Some words may have very few words spelled similarly, so even numerous faults will recover the accurate word. Other words will have many likewise spelled words, so one error may make alteration problematic or unbearable. A dictionary- based model is used in noticing and modifying diverse classes of spelling errors. The main features of the planned model can be précised in giving the proposals for noticed errors and providing the correction automatically using the first suggestion. Furthermore, the planned model is calculated using dictionary-based data sets for all languages that the researcher has been selected for the study. This research work is based on a model dictionary-based which detects and corrects errors for six Ethiopian languages including Amharic, Afan Oromo, Tigrinya, Hadiyyisa, Kambatissa, and Awngi. The used corpora have been collected from balanced sources that contain economic, political, social, and related newspapers. Finally, after a successful evaluation of the proposed model, precision, recall, and f- measures have been calculated for each language. Keywords: Dictionary, Error Correction, Error Detection, Suggestion, Spelling Corrector 1868 International Journal of Aquatic Science ISSN: 2008-8019 Vol 12, Issue 02, 2021 1. Introduction Language is a medium of communication and which helps human beings to exchange ideas and information accordingly. Spelling corrector systems for languages would be used to check spellings for any kind of spelling errors. The working principles of spelling corrector including error detection have been clearly described in the work of (Demilie, 2020b)(Demilie, 2020a)and the words from the dictionary are suggested to the user who chooses the word that was intended. Spelling corrector systems are used in various Natural Language Processing Applications (NLPAs) including parts of speech tagger (Demilie, 2019)(TEKLAY, 2010) and as grammar checkers (Tesfaye, 2011). In this research paper, the researcher has designed, implemented, and evaluated an end-to-end system that performs spelling corrector and auto-correction for six Ethiopian languages. 2. Literature Review There have been extensive researches on the problem of spelling error correction without a widely used common benchmark. Each publication uses its benchmark and compares it to a (usually small) subset of methods. According to(Kumar et al., 2018), researchers have been concluded that the performance of the spelling corrector system can be improved by using the n-gram model and it can be used for many languages. Those researchers have been suggested that “n-grams can be used in two ways, either without a dictionary or together with a dictionary”. As the conclusions to the researchers, the performance of the spell corrector without a dictionary is limited. Its main advantage is its simplicity and does not require any dictionary. Here, if they are together, it can be used to define the distance between words, and the words are always checked against the dictionary. Finally, they have concluded that the implementation of both models together will improve the performance of the spelling corrector. According to (Atawy, 2018), developed a language-independent spelling corrector that was based on n-gram techniques. It was used in detecting and correcting spelling errors. The researcherhas concluded that the "n-gram model provides correction and suggestions by selecting the most suitable suggestions from a list of corrective suggestions based on lexical resources and n-gram statistics." Finally, the researcher has achieved and concluded an overall performance of 93%. According to (Demilie, 2020b), the proposed model can be summarized in giving the proposals for noticed errors and providing the correction automatically using the first suggestion. Here, the researcher had concluded his work with precision (86.6%, 85.3% ,83.9%, 82.8% and 84.7%), recall (84.7%,81.9%,82.4%,81.6% and 81.9%) and f-measure (85.65%,83.6%,83.15%,82.2% and 83.3%) for languages Amharic, Afan Oromo, Tigrinya, Hadiyyisa and Awngi respectively. The spelling correction that the researcher has been implemented in the state-of-the-art algorithm that different researchers have been recommended to be used for correction of spelling mistakes which will be implemented for all languages accordingly (Demilie, 2020b) which is dictionary lookup. 1869 International Journal of Aquatic Science ISSN: 2008-8019 Vol 12, Issue 02, 2021 3. Significance of the Study Learning to spell helps to adhesive the relation and/or linkage between the letters and their resonances, and learning high occurrence to mastery level progresses both in reading and writing. The more intensely and carefully an operator identifies a word, the more probable he or she is to identify it, spell it, define it, and use it properly in speech and script (Spelling, n.d.). Many researchers of the area have developed different spelling correctors for foreign and Ethiopian languages. From those researchers especially, Ethiopian researchers no one has been developed a spelling corrector for more than three Ethiopian languages except (Demilie, 2020b) within one system (Gezmu et al., 2014)(Ganfure & Midekso, 2014)(Jeldu & Mehta, 2018). This study had an option that informs system users to select the language accordingly. Here, the researcher acknowledged the researchers who have done different studies for Ethiopian languages including grammatical rules, word formation, sentence structure, and other related concepts for foreign and Ethiopian languages (TEKLAY, 2010)(University), n.d.)(Tamirat, n.d.)(Hadiyya (Hadiyyisa) Language Orthography - Alphabet and Writing - Themes on the Hadiya People of Ethiopia, n.d.)(Kambaata language - Wikipedia, n.d.)(Samuel et al., 2018)(Misikir, 2013). 4. Methodology There are many methodologies for identifying and correcting spelling errors in written texts. For the study, the researcher has used a dictionary-based method that is engaged to relate and detect input strings in a dictionary, a lexicon, a corpus, or an amalgamation of lexicons and corpora. The datasets or lexicon files for the six Ethiopian languages have been collected from different genres that have balanced corpora and/or lexicon with the help of linguistic experts of each language. To serve the purpose of spelling error detection and correction, exact string matching mechanisms have been used. If any string or word is not present in the chosen lexicon or corpus, it is considered to be a misspelled or worthless word. At this stage, the researcher considers that all words in the lexicon or corpus are morphologically complete, i.e. all inflected forms are included in the dictionary. The attention on dropping dictionary search time via effective dictionary-based and/or pattern corresponding tactics, via dictionary partitioning structures and morphological processing ways. The most substantial dictionary-based tactics are hashing binary search trees and finite state automata. From those listed approaches, the researcher has used hashing since it is a well-known and efficient dictionary lookup strategy. 5. Result and Discussion To evaluate the performance, the approach that the researcher has selected and to demonstrate its easy portability to all six Ethiopian languages. To evaluate the system with improved performance than the work of (Demilie, 2020b), he has collected corpora that are greater than (Demilie, 2020b), from balanced sources and within detail linguistic analysis of each language. After the preprocessing stages of the study, all languages have been evaluated based on the corpora that have been collected.Firstly, the researcher made an evaluation based on Amharic language test data which are in the dictionary file list. Secondly, he made an evaluation based on Afan Oromo language test data which are in the dictionary file list. Thirdly, he made an evaluation based on Tigrinya language test data which are in the dictionary file list. Fourthly, 1870 International Journal of Aquatic Science ISSN: 2008-8019 Vol 12, Issue 02, 2021 he made an evaluation based on the Hadiyyisa language test data which are in the dictionary file list. Fifthly, he made an evaluation based on Kambatissa language test data which are in the dictionary file list. Finally, he made an evaluation based on Awngi language test data which are in the dictionary file list. To evaluate the spelling error detection capability of the selected approach for all six languages, precision, recall, and f-measure were used as metrics. The comparative locations of the correct spellings in the reasonable suggestions list were used to evaluate spelling error correction. 6. Test Data The study used manually prepared spelling error test corpora with the help of linguistic experts of each language for evaluation of the performances. For the study, the researcher has used a test corpus that has been