Computational Methods to Vocalize Arabic Texts

Computational Methods to Vocalize Arabic Texts Hani Safadi 1, Dr. Oumayma Dakkak 2, Dr. Nada Ghneim 2 [email protected], [email protected], [email protected] Faculty of Informatics Engineering, University of Damascus, SYRIA 1 Higher Intuition of Applied Science and Technology, Damascus, SYRIA 2 Abstract For non-Arabic speakers this problem might seem too abstract. To illustrate it we will use a hypothetical Arabic Language has two kinds of vowels: Long vowels version of English, we will call it AEng. In AEng, of which are written as normal letters; and short vowels the five vowels which are /a/, /e/, /i/, /o/ and /u/, we which are written as punctuation marks, above or only write the /i/ and /u/ as normal letters, while we below letters. Those short vowels are normally omitted omit the /a/, /o/ and /e/. in Arabic texts because the reader can fill them and Now try to read the following sentence in this language: guess the meaning based on his knowledge of the "If yu cn rd this thn w cn omt vwls" language, and the context in which the words are. If you figured out the meaning of this sentence then you However, with the widespread usage of computers in are convinced that some vowels can be omitted. If not, linguistics application; Arabic texts need to be supplied you will adapt to this as you use AEng is your daily with short vowels in order to be analyzed. Search life! engines, text to speech engines, and text mining tools The original sentence in English is: are just some examples of applications that need "If you can read this then, we can omit vowels" Arabic texts to be vocalized before being processed. It is a little bit similar to chatting language. In this paper, we present a new method to supply those Note that this is only an example, and does not need to vocals. The approach emphasizes on unsupervised be fully representative. machine methods, because public Arabic corpora are From this example, we see how ambiguous a sentence not available. Arabic rich morphology and diverse is without the short vowels (at least for computers). For orthography present serious challenges for this example, the word Elm in Arabic could be: approach. An algorithm is developed and a system is Eilom Science implemented in Java. Ealam Flag The techniques presented in this dissertation can be Ealim to know applied to similar Semitic or other languages that have There are also other possibilities…. the same problem. In AEng we have the word bg that could be read or beg or bag. 1. Introduction This example makes it clear that we need short vowels in order to understand the text in computer. There are two types of vowels in Arabic: long Each letter has one or two short vowels associated with vowels /A/, /w/, /y/ (We are using here Buckwalter it. Arabic trasliration table [6] see Appendix A) (although The short vowels on all characters, but the last one, /w/, /y/ can be used as consonant sometimes), and short differentiate the word from other words. The short vowels /a/ (Fatha), /u/ (Damma), /I/ (Kasra). These vowel on the last letter, however, determines the short vowels are part of the word and are written as syntactic position of the word in the sentence (Like additional marks above or below letters. In addition to Nominative, Accusative, and Dative in German). the short vowels, there are other marks (F Tanween- Although they add other information to the word, this Fateh pronounced /an/, N Tanween-Damm pronounced study will not deal with them. In fact, putting the last /un/ or /on/, K Tanween-Kasir, pronounced /in/ or /en/, vowels of the words needs deeper syntactic analysis, o Sukun which means that the consonant is not and the information they add are not so essential. followed by a vowel), ~ Shadda which means a For example: duplication of the consonant). All these marks are EilomF usually not written because Arabic reader can guess EilomN them, based on his knowledge of the language and on EilomK the context. They are only put when very necessary, in cases where the word is so ambiguous without them, and cannot be distinguished from other words. Even with such a situation, only one or two discriminating letters are vocalized. 2. Previous Attempts All the possible combinations of a word to prefix-stem-suffix are considered (as long as Despite the abundance of computational Arabic the stem length is not zero). For each studies, Arabic Vocalization is not enough studied. combination, the prefix, stem and suffix are Sakhr [1] has a commercial system for Arabic checked whether they are contained in the vocalization. Unfortunately, the system is totally closed dictionaries or not. If so, the compatibility and no information about it is known. between them is considered; if they are Y. Gal [2] uses a hidden Markov model (HMM) trained compatible, this combination is considered as on vocalized Arabic texts. It uses the Holy Quranic text a possible analysis to the word. as a training corpus. It shows results of 85% correct 3. Part of speech tagging: After morphologically vocalization when used with text from the same corpus. analyzing each word, we get the several R. Nelken and S. Shieber [3] use weighted finite-state possibilities of the word; each possibility has a transducers trained on LDC corpus [4] of M. Maamouri corresponding vocalization, and part of et al. The reported results show 93% of correct speech. To choose the correct vocalization, we vocalization. must choose the correct part of speech (POS). We built a POS tagger for Arabic, using 3. Our Methodology unsupervised transformational based learning methods [7, 8]. We used the unsupervised While the past two approaches provide useful methods because we did not have a tagged attempts to solve the problem; both of them have the Arabic corpus. The only tagged Arabic corpus same drawback: They tackle the problem with a top- is LDC Arabic Corpus [4] that we could not down approach, building a model and training it with a afford to buy it. The tagger is trained using corpus, although [3] did some morphological analysis. collected texts from Internet covering multiple The problem with this approach is that it is highly disciplines [9]. dependent on the corpus. For example, Quranic texts The generated rules are then examined used in [2] are not good representative of modern manually. For some words, the part of speech Arabic. And newspaper archives in LDC [4] used in [3] tagging cannot resolve the ambiguity; do not cover all the topics in the language. therefore we need an additional level of Our work uses a bottom-up approach, where we do a processing. complete linguistic analysis of Arabic texts. 4. Heuristic rules of disambiguation: These rules Our systems works in four stages: were met by language experts. The rules 1. Parsing the text. specify the choice of a certain part of speech 2. Analyzing the text morphologically. with regards to the word and its context. For 3. Part of speech tagging of the text. example, one of these rules specify: "if the 4. Applying linguistic heuristics rules. word length is less than 3 letters, and one of its We will explain each of these steps, and show how it part of speech is preposition, then choose it". participates in the vocalization process. 5. Finally, if after all these levels, the ambiguity 1. Parsing: In this step, the text is spitted to still remains in the word; a random choice of phrases and each phrase is splitted to words. the part of speech tagged is made. The process is simple; it uses the regular expressions to parse the text. 4. Implementation 2. Morphological Analysis: In this step, each TM word is considered alone and passed to the We implemented the system using Java morphological analyzer. The morphological programming language. The implementation itself is analyzer provides for each word all the vocal composed of the same parts that are described possibilities which can be added to the word to previously, plus some additional tools and graphical generate words in the language. This stage user interfaces. gives also the part of speech tag of each The implementation allows the user to use each part possibility. alone, or use all of them together. We used Buckwalter Arabic morphological analyzer [5] that uses a concatinative 5. Results methodology to analyze words. It considers each word as a combination of prefix, stem, Unfortunately, because of the lack of vocalized and suffix. It has a dictionary of prefixes, Arabic texts, we did not have the opportunity to test our stems, and suffixes, and two compatibility system thorough fully. However, we did some empirical tables: prefix-stem table which tells what tests. We vocalized large Arabic texts automatically prefixes can come with what stems, and stem- and gave the results to experts in order to evaluate suffixes table which tells what stems can come them. The evaluation shows a percentage of 80-90% of with what suffixes. correct vocalization. 6. Future Works [4] Mohamed Maamouri, Ann Bies, Tim Buckwalter, and Wigdan Mekki, “The Penn Arabic Treebank: Although our system uses a detailed analysis of Building a Large-Scale Annotated Arabic Corpus”, texts, there are still some missing steps of analysis, Paper presented at the NEMLAR International these steps will improve the performance of our system. Conference on Arabic Language Resources and Tools, These steps may include: Cairo, Sept. 22-23, 2004. 1. Syntactic Analysis: Our current system cannot [5] Buckwalter T., “Buckwalter Arabic Morphological restore vowels on the last character of each Analyzer Version 1.0”.

Load more