Survey of Arabic Checker Techniques
Total Page:16
File Type:pdf, Size:1020Kb
SUST Journal of Engineering and Computer Sciences (JECS), Vol. 21, No. 1, 2020 Survey of Arabic Checker Techniques Ahmed Abdalrhman Saty1, Karim Bouzoubaa2, Aouragh Si Lhoussain3 1 Sudan University of Science and Technology, Sudan 2,3 Mohammed V University in Rabat, Morocco [email protected] Received:28/09/2019 Accepted:06/11/2019 ABSTRACT- It is known that the importance of spell checking, which increases with the expanding of technologies, using the Internet and the local dialects, in addition to non-awareness of linguistic language. So, this importance increases with the Arabic language, which has many complexities and specificities that differ from other languages. This paper explains these specificities and presents the existing works based on techniques categories that are used, as well as explores these techniques. Besides, it gives directions for future work. Keywords: spell checking, rule-based, morphology, n-gram, radix-search tree, levenshtein distance, jaro- winkler distance المستخلص - من المعمهم أىمية التدقيق اﻹمﻻئي، والتي تزداد مع تهسع التقنيات، استخدام اﻻنترنت والميجات المحمية، اضافة إلى عدم اﻹلمام بقهاعد المغة. وبالتالي تزداد أىميتو أكثر مع المغة العربية ندبة ﻷنيا تحتهي عمى بعض التعقيدات والخصائص التي ت ميزىا عن المغات اﻻخرى. ىذه الهرقة تدتعرض بعض خصائص المغة العربية، كما تقهم بعرض اﻷعمال المهجهدة في ىذا المجال بناء عمى التقنيات المدتخدمة ومن ثم شرحيا. عﻻوة الى ذلك تعطي بعض اﻻتجاىات لﻷعمال المدتقبمية. INTRODUCTION specificity of many linguistic phenomenon that With the increased usage of computers and increase the probability of user mistakes such as smart devices in the processing of various multiplicity of local dialects and the non- languages, comes the need for correcting errors awareness of Arabic linguistic rules. introduced at different stages. Texts of any An Arabic spell checker behaves exactly the language can be generated from different same as an English one. For example, for the alzalad), the checker detects it as an) "الزلد" sources either by humans as document typing text and emailing software, or by machine such as erroneous word and suggests a list of close ,azzad) ”الزاد ,الولد ,الصلد ,الزبد ,الزلط“ optical character recognition (OCR) and words such as machine translation (MT). These produced texts alwalad, assalad, azzabad, azzlad). may have typing mistakes that need to be spell In the context of Arabic spell checking, many checked and corrected. Spell checking approaches and methods have been studied. constitutes one of the major areas in the field of Multiple systems with different designs already Natural Languages Processing (NLP) and has exist. Some of them exploit dictionaries while been the subject of different research studies others use morphological analysis [2]. A lot of since 1960 [1]. Spell checking mainly consists of them use also similarity among words [3, 4], and a verifying that some typed words are not few use the context [5] or mix between these accepted in the used language and suggests a list techniques [6, 7]. of close words to the erroneous word. This paper surveys the existing Arabic spell Accordingly, numerous approaches have been checkers with broad coverage of their explored to correct spelling errors in texts using advantages and disadvantages and consequently NLP tools and resources. sheds light on new opportunities in order to Some languages, such as English, developed improve these existing works. The remainder of advanced detection and spell checking systems. the paper is organized as follows. Section 2 For the case of Arabic, such systems are double- explains the specificities of the Arabic language needed with the rapid growth of the Arabic needed in the context of spell checking. Section digital content and users (it is reported that, for 3 introduces the classification of errors, explains 2017, 43.8% of the whole Arabic populations the meaning of datasets with their types are Internet users [37]) and because of the alongside used techniques in existing works, and 34 SUST Journal of Engineering and Computer Sciences (JECS), Vol. 21, No. 1, 2020 presents our notices on these works. Finally, we dialect-pronunciation instead of the original conclude the paper and make propositions about letter. new ideas for improving the existing works for According to the above Arabic language future work. specificities (alphabet and lexicon), the origins Arabic Language Uniqueness of typing errors include: Nowadays more than 400 million people in the 1. Changing of shape according to the position Middle East and North Africa speak the Arabic of letters in the word. language [38]. Arabic is also used as a religious 2. Similarity of pronunciation and shape of language in the Islamic Word. Therefore, it is some letters (see Figure1). learned by various levels of proficiency, as a 3. The use of dialects. venerated, liturgical language[8] by many Muslims mainly in Asia (e.g., Pakistan, Malaysia, China) and Africa (e.g., Senegal) [39]. Arabic has its own alphabet, its own lexicon, its own morphology, and its syntactic rules. The alphabet is 28 letters and the morphology is very rich. As a Semitic language, Arabic is derivative Figure1: Arabic Keyboard and has a flexible syntax allowing, for example, [9, 11] both verbal and nominal sentences . Whatever the origins are, to assist Arabic users However, it is the alphabet and lexicon during their typing, many Arabic spell-checking specificities of the Arabic language that have a approaches and systems have been developed. In direct impact of producing errors while typing the next section, we review these works. and that necessitate the use of spell checking Arabic Spell-checking Issues systems. In this section, we review the most important The alphabet is written from right to left in a works about Arabic spell checking. Firstly, we cursive style. Some letters have specific rules for introduce the classification of errors. Secondly, writing leading their shapes depending on their we explain the meaning of datasets used in the context of spell checking with their different ”ك“ position in the word. For example, the letter is written at the beginning and middle of the kinds. Thirdly, we focus on the most used word different than the last. The hamza letter is approaches and techniques which are in turn written with 5 possible shapes depending on its divided into five categories, which are: rule- position in the word as well as the diacritic of based approach, distance similarity techniques, the previous and the next letter. These rules are techniques exploiting morphological analysis, not accurately known by all users causing them techniques relying on phonetics and finally to make typing errors. hybrid ones, which combined more than one of In addition, some letters have very close the existing techniques. Classification of errors .("ض" ,”د“) pronunciations such as Consequently, people who are typing on There are two main types of typing errors; the computers are interchangeably using one letter first one is isolated-words and the second one is instead of the other thinking that is the right way context-sensitive [12, 13]. The first type detects to spell the corresponding word. that the misspelling words do not exist in the On the other hand, and due to its derivative lexicon. To deal with this type, the researchers aspect, the Arabic lexicon is extremely large introduced many analyzing and statistical studies. (about 12 million entries). However, Arabic Based on the user knowledge, the misspelling citizens do not make use of this entire lexicon errors can be divided into three categories[11]: the and have over the history "squeezed-minimized" first one is the typographic error; where the user the Arabic language and use a reduced lexicon is familiar with how to write the word but (s)he that does not exceed almost 20,000 entries. makes the typing error. The second category is Nowadays, from region to region Arabic citizens the cognitive error; where the user is not familiar use their own dialects but with a change in the with how to write the words, due to the non- pronunciation of some letters. For instance, awareness of the Arabic rules. The third category is the phonetic error that takes place (“ق”) Egyptians replace the pronunciation of with when there is a replacement of letters, due to the (“ق“) while Sudanese replace (“أ”) with In these cases, proximity of the sound. In general, the typing .(“ز“) with (“ذ“) and replace (“غ”) users type the word using the letter with its errors (approximately 80%) occur due to one of 35 SUST Journal of Engineering and Computer Sciences (JECS), Vol. 21, No. 1, 2020 the following reasons [14]: letter insertion, letter approach is considerably used to handle deletion, letter substitution and transposition of common spelling and typographic errors. For two adjacent letters. example, authors of [19] proposed a system that The second type is context-sensitive and is also has a mechanism for automatic correction of called real-words error. In this type, the written common errors in Arabic based on rules such as word is correct but its position is incorrect and the dealing of hamza errors since there is a taa ,”ذ“ and zah”د“ leads to a wrong meaning [5, 15, 16]. Few Arabic confusion between the dah The mentioned errors .”ي“ and yaa ”ة“ spell-checking researchers addressed real-word marbuta errors. Among these works, the researchers of [5] are treated by applying regular expressions and proposed a spell checker with a large corpus word replacement list. Moreover, the works of [7, collected from three topics (sport, health, and 20, 21] captured also various kinds of common economics), as well as 28 confusion sets, that errors. It is also noted that the use of rule-based were collected from commonly confused words. techniques gives more satisfactory corrections. Later on, the authors of [16] proposed a system In addition, the rule-based approach can also be that deals with context errors by applying n- used to rank the candidate words by aggregating gram and machine learning instead of predefined the probabilities of applied rules [12].