<<

Arabic diacritization using weighted finite-state transducers

The Harvard community has made this article openly available. Please share how this access benefits you. Your story matters

Citation Rani Nelken and Stuart . Shieber. Arabic diacritization using weighted finite-state transducers. In Proceedings of the 2005 ACL Workshop on Computational Approaches to Semitic Languages, pages 79-86, Ann Arbor, Michigan, June 2005.

Published Version http://www.aclweb.org/anthology-new/W/W05/W05-0711.pdf

Citable link http://nrs.harvard.edu/urn-3:HUL.InstRepos:2252610

Terms of Use This article was downloaded from Harvard University’s DASH repository, and is made available under the terms and conditions applicable to Other Posted Material, as set forth at http:// nrs.harvard.edu/urn-3:HUL.InstRepos:dash.current.terms-of- use#LAA Arabic Diacritization Using Weighted Finite-State Transducers

Rani Nelken and Stuart M. Shieber Division of Engineering and Applied Sciences Harvard University 33 Oxford St. Cambridge, MA 02138 {nelken,shieber}@deas.harvard.edu

Abstract We present a system for Arabic diacritiza- tion1 in Modern Standard Arabic (MSA) using Arabic is usually written without short weighted finite-state transducers. The system vowels and additional , which is constructed using standardly available finite- are nevertheless important for several state tools, and encodes only minimal morpho- applications. We present a novel al- logical knowledge, yet achieves very high levels gorithm for restoring these symbols, of accuracy. While the methods described in this using a cascade of probabilistic finite- paper are applicable to additional semitic lan- state transducers trained on the Ara- guages, the choice of MSA was motivated by the bic treebank, integrating a word-based availability of the Arabic Treebank, distributed language model, a letter-based lan- by the Linguistic Data Consortium (LDC), a siz- guage model, and an extremely simple able electronic corpus of diacritized text, which morphological model. This combina- we could use for training and testing. Such re- tion of probabilistic methods and sim- sources are rare for other semitic languages. ple linguistic information yields high This paper is structured as follows. In Sec- levels of accuracy. tion 1, we describe the task, including a brief in- troduction to Arabic diacritization, and the cor- Introduction pus we used. In Section 2, we describe the design Most semitic languages in both ancient and con- of our system, which consists of a trigram word- temporary times are usually written without based language model, augmented by two exten- short vowels and other marks, often sions: an extremely simple morphological ana- to potential ambiguity. While such am- lyzer and a letter-based language model, which biguity only rarely impedes proficient speakers, are used to address the data sparsity problem. it can certainly be a source of confusion for be- In Section 3, we report the system’s experimen- ginning readers and people with learning disabil- tal evaluation. We review related work in Sec- ities (Abu-Rabia, 1999). The problem becomes tion 4, and close with conclusions and directions even more acute when people are required to for future research in Section 5. actively generate diacritized script, as used for example in poetry or children’s books. Diacriti- 1 The task zation is even more problematic for computa- tional systems, adding another level of ambigu- The Arabic vowel system consists of 3 short vow- ity to both analysis and generation of text. For els and 3 long vowels, as summarized in Ta- example, full vocalization is required for text- to-speech applications, and has been shown to 1We distinguish between vocalization—the restora- tion of vowel symbols—and diacritization, which includes improve speech-recognition perplexity and error additionally the restoration of a richer system of diacritic rate (Kirchoff et al., 2002). marks, as explained below. Short vowels Arab. Tran. Arab. Tran. Vocalized Unvoc. Pronounc. @ > @ | Arab. Tran. Arab. Tran. @ < ¤ ’ ¥¦§ u - - /u/  ð' & @ { ¥¦§ a - - /a/ ø' } ¥¦ § i - - /i/ Long vowels Table 2: Arabic glottal stop ñ¦¦§ uw ð w /u:/

A¦¦§ aA(1),(2) @ A /a:/ Table 2. In undiacritized text, it appears either ù¦¦§ aY ø Y(3) /a:/ as A or as one of the forms in the table. ù¦¦ § iy ø y /i:/ We used the LDC’s Arabic Treebank of dia- critized news stories (Part 2). The corpus con-  Doubled case endings sists of 501 news stories collected from Al-Hayat ¤ N - - /un/  for a total of 144,199 words. In addition to dia- ¤ F - - /an/ critization, the corpus contains several types of  A§ AF @ A /an/ annotations. After being transliterated, it was morphologically analyzed using the Buckwalter ¤ - - /in/  K morphological analyzer (2002b). Buckwalter’s (1) aA may also appear as A even in vocalized text. system provides a list of all possible analyses of (2) In some lexical items, aA is written as ‘; in which case each word, including a full diacritization, and it is dropped in undiacritized text. (3) Y and y can appear interchangeably in the cor- a part-of-speech (POS) tag. From this candi- pus (Buckwalter, 2004). date list, annotators chose a single analysis. Af- terwards, clitics, which are prevalent in Arabic Table 1: Arabic vowels were separated and marked, and a full parse- tree was manually generated. For our purposes, ble 1.2 Short vowels are written as symbols ei- we have stripped all the POS tagging, to retain ther above or below the letter in diacritized text, two versions of each file—diacritized and undi- and dropped altogether in undiacritized text. acritized, which we use for both training and Long vowels are written as a combination of a evaluation as will be explained below. short vowel , followed by a vowel letter; An example sentence fragment in Arabic is in undiacritized text, the short vowel symbol is given in Figure 1 in three forms: undiacritized, dropped. Arabic also uses vowels at the end of diacritized without case endings, and with case words to mark case distinctions, which include endings. both the short vowels and an additional doubled ...úæ ñÓ ðQÔ« éJ KQªË@ ÈðYË@ éªÓAm Ì ÐAªË@ á© ÓB@ XP form (“tanween”). . .     ©  Diacritized text also contains two syllabifica- ...úæ ñÓ ðQÔ« é JK.Q ªË@ ÈðYË@ éªÓ Am.Ì ÐAªË@ á Ó B@ XP tion marks: H. (trans.: ∼) denoting doubling of ... «   Ì  á ©  the preceding consonant (in this case, H. ), and úæ ñÓ ðQÔ é JK .QªË@ È ðYË@ é ªÓ Am. ÐAªË@ Ó B@ XP H . (trans.: o), denoting the lack of a vowel. Figure 1: Example sentence fragment The Arabic glottal stop (“hamza”) deserves special mention, as it can appear in several dif- The transliteration and translation are given ferent forms in diacritized text, enumerated in in Figure 2. We follow precisely the form that appears in the Arabic treebank.3 2Throughout the paper we follow Buckwalter’s transliteration of Arabic into 7-bit ASCII charac- 3The diacritization of this example is (strictly speak- ters (2002a). ing) incomplete with respect to the diacritization of the rd Al>myn AlEAm ljAmEp Aldwl AlErbyp Emrw mwsY rad∼a Al>amiyn AlEAm∼ lijAmiEap Alduwal AlEarabiy∼ap Eamorw muwsaY rad∼a Al>amiynu AlEAm∼u lijAmiEapi Alduwali AlEarabiy∼api Eamorw muwsaY “Arab League Secretary General Amr Mussa replied. . . ”

Figure 2: Transliteration and translation of the sentence fragment

2 System design LM SP DD UNK To restore diacritization, we have created a gen- erative probabilistic model of the process of los- ing diacritics, expressed as a finite-state trans- Figure 3: Basic model ducer. The model transduces fully diacritized Arabic text, weighted according to a language only component of the basic model for which model, into undiacritized text. To restore dia- we learn such weights. During decoding, these critics, we use Viterbi decoding, a standard al- weights are utilized to choose the most prob- gorithm for efficiently computing the best path able word sequence that could have generated through an automaton, to reconstruct the max- the undiacritized text. The model also includes imum likelihood diacritized word sequence that special symbols for unknown words, hunki, and would generate a given undiacritized word se- for numbers, hnumi, as explained below. quence. The model is constructed as a composi- Spelling (SP) A spelling transducer that tion of several weighted finite-state transduc- transduces a word into its component letters. ers (Pereira and Riley, 1997). Transducers are This is a technical necessity since the language extremely well-suited for this task as their clo- model operates on word tokens and the follow- sure under composition allows complex models ing components operate on letter tokens. For in- to be efficiently and elegantly constructed from stance the single token Al>amiyn is transduced modular implementations of simpler models. to the sequence of tokens A,l,>,a,m,i,y,n. The system is implemented using the AT&T Diacritic drop (DD) A transducer for drop- FSM and GRM libraries (Mohri et al., 2000; Al- ping vowels and other diacritics. The transducer lauzen et al., 2003), which provide a collection simply replaces all short vowel symbols and syl- of useful tools for constructing weighted finite- labification marks with the empty string, . In state transducers implementing language mod- addition, this transducer also handles the mul- els. tiple forms of the glottal stop (see Section 1). 2.1 Basic model Rather than encoding any morphological rules on when the glottal stop receives each form, we Our cascade consists of the following transduc- merely encode the generic availability of these ers, illustrated in Figure 3. various alternatives, as transductions. For in- Language model (LM) A standard trigram stance, DD includes the option of transducing language model of Arabic diacritized words. For { to A, without any information on when such smoothing purposes, we use Katz backoff (Katz, transduction should take place. 1987). We learn weights for the model from a Unknowns (UNK) Due to data sparsity, a training set of diacritized words. This is the test input may include words that did not ap- determiner Al and the letter immediately following it. pear in the training data, and will thus be un- For instance, in Alduwal, the d should actually have been known to the language model. To handle such doubled, yielding Ald∼uwal. The treebank consistently does not diacritize Al, and we adhere to its conventions words, we add a transducer, UNK, that trans- in both training and testing. duces hunki, into a stochastic sequence of ar- ¡ bitrary letters. During decoding, the letter se-

quence is fixed, and since it has no possible dia- ¢ ¢ critization in the model, Viterbi decoding would PRE MID SUF choose hunki as its most likely generator. UNK plays a similar purpose in handling Figure 4: Clitic concatenation numbers. UNK transduces hnumi to a stochasti- cally generated sequence of digits. In the train- ing data, we replace all numbers with hnumi. On non-deterministically append it to the following encountering a number in a test input, the de- word, merely by transducing the follow- coding algorithm would replace the number with ing it to . This is done iteratively for each pre- hnumi. As a post-processing step, we replace all fix. After concatenating prefixes, CC can non- occurrences of hunki and hnumi with the original deterministically decide that it has reached the input word/number. main word, which it copies. Finally, it concate- nates suffixes symmetrically to prefixes. For in- 2.2 Handling clitics stance, on encountering the letter string w Al Arabic contains numerous clitics, which are ap- >myn, CC might drop the spaces to generate pended to words, either as prefixes or as suf- wAl>myn (“and the secretary”). fixes, including the determiner, conjunctions, The transducer implementation of CC con- some prepositions, and pronouns. Clitics pose sists of three components, depicted schemati- an important challenge for an n-gram model, cally in Figure 4. The first component itera- since the same word with a different clitic combi- tively appends prefixes. For each of a fixed set nation would appear to the model as a separate of prefixes, it has a set of states for identify- token. We thus augment our basic model with ing the prefix and dropping the trailing space. a transducer for handling clitics. A non-deterministic jump moves the transducer Handling clitics using a rule-based approach is to the middle component, which implements the a non-trivial undertaking (Buckwalter, 2002b). identity function on letters, copying the putative In addition to cases of potential ambiguity be- main word to the output. Finally, CC can non- tween letters belonging to a clitic and letters be- deterministically jump to the final component, longing to a word, clitics might be iteratively which appends suffixes by dropping the preced- appended, but only in some combinations and ing space. some orderings. Buckwalter maintains a dictio- By design, CC provides a very simple model nary not only of all prefixes, stems, and suffixes, of Arabic clitics. It maintains just a list of pos- but also keeps a separate entry for sible prefixes and suffixes, but encodes no in- each allowed combination of diacritized clitics. formation about stems or possible clitic order- Since, unlike Buckwalter’s system, we are inter- ings, potentially allowing many ungrammatical ested just in the most probable clitic separation combinations. We rely on the probabilistic lan- rather than the full set of analyses, we imple- guage model to assign such combinations very 4 ment only a very simple transducer, and rely on low probabilities. the probabilistic model to handle such ambigu- We use the following list of (undiacritized) cl- ities and complexities. itics: (cf. Diab et al. (2004) who use the same From a generative perspective, we assume set with the omission of s, and ny): that the hypothetical original text from which prefixes: b (by/with), l (to), k (as), w (and), the model starts is not only diacritized, but f (and), Al (the), s (future); also has clitics separated. We augment the model with a transducer, Clitic Concatenation suffixes: y (my/mine), ny (me), nA (CC), which non-deterministically concatenates 4The only special case of multiple prefix combinations clitics to words. CC scans the letter stream; that we explicitly encode is the combination of l+Al (to on encountering a potential prefix, CC can + the) which becomes ll, by dropping the A. (our/ours), k (your/yours), kmA (your/yours 2.3 Letter model for unknown words masc. dual), (your/yours masc. pl.), km To diacritize unknown words, we trained a (your/yours fem. dual), (your/yours knA kn letter-based 4-gram language model of Arabic fem. pl.), (him/his), (her/hers), h hA hmA words, LLM, on the letter sequences of words (their/theirs masc. dual), (their/theirs hnA in the training set. Composing LLM with the fem. dual), (their/theirs masc. pl.), hm hn vowel-drop transducer, DD, yields a probabilis- (their/theirs fem. pl). tic generative model of Arabic letter and di- We integrate CC into the cascade by compos- acritization patterns, including for words that ing it after DD, and before UNK. Thus, clitics were never encountered in training. appear in their undiacritized form. Our model In principle, we could use the letter model now assumes that the diacritized input text has as an alternative model of the full text, but we clitics separated. This requires two changes to found it more effective to use it selectively, only our method. First, training must now be per- on unknown words. Thus, after running the formed on text in which clitics are separated. word-based language model, we extract all the This is straightforward since clitics are tagged words tagged as hunki and run the letter-based in the corpus. Second, in the undiacritized test model on them. Here is an example transduc- data, we keep clitics intact. Running Viterbi de- tion: coding on the augmented model would not only Diacritized Alduwal diacritize it, but also separate clitics. To gener- LLM (Weighted) ate grammatical Arabic, we reconnect the clitics DD (Diacritics dropped) Aldwl as a post-processing step. We use a greedy strat- We chose not to apply any special clitic han- egy of connecting each prefix to the following dling for the letter-based model. To see why, word, and each suffix to the preceding word.5 consider the alternative model that would in- clude CC. Since LLM is unaware of word tokens, While our handling of clitics helps overcome there is no pressure on the decoding algorithm data sparsity, there is also a potential cost for to split the clitics from the word, and clitics may decoding. Clitics, which are, intuitively speak- therefore be incorrectly vocalized. ing, less informative than regular words, are now treated as lexical items of “equal stature”. For 3 Experiments instance, a bigram model may include the col- location Al>amiyn AlEAm∼ (the secretary gen- We randomly split the set of news articles in eral). Once clitics are separated this becomes each of the two parts of the Arabic treebank Al >amiyn Al EAm∼. A bigram model would into a training and held-out testing set of sizes no longer retain the connection between each of 90% and 10% respectively. We trained both the main words, >amiyn and EAm∼, but only the word-based and the letter-based language between them and the determiner Al, which is models on the diacritized version of the train- potentially less informative. ing set. We then ran Viterbi decoding on the undiacritized version of the testing set, which Figure 5 shows an example transduction consists of a total of over 14,000 words. As a through the word-based model, where for illus- , we used a unigram word model with- tration purposes, we assume that Aldwl is an out clitic handling, constructed using the same unknown word. transducer technology. We ran two batches of experiments—one in which case endings were stripped throughout the training and testing 5The only case that requires special attention is ka data, and we did not attempt to restore them, which can be either a prefix (meaning “as”) or a suf- and one in which case markings were included. fix (meaning “your/yours” masc.). The greedy strategy always chooses the suffix meaning. We correct it by com- Results are reported in Table 3. For each parison with the input text. model, we report two measures: the word er- Diacritized rad∼a Al >amiyn Al EAm∼ li jAmiEap hunki LM (Weighted, but otherwise unchanged) SP (Change in token resolution from words to letters) DD (Diacritics dropped) rd Al >myn Al EAm l jAmEp CC (Clitics concatenated) rd Al>myn AlEAm ljAmEp UNK (hunki becomes Aldwl) rd Al>myn AlEAm ljAmEp Aldwl

Figure 5: Example transduction ror rate (WER), and the diacritization error • wstkwn was correctly decoded to wa sa rate (DER), i.e., the proportion of incorrectly takuwnu, which after post-processing be- restored diacritics. comes wasatakuwnu (“and [it] shall be”). Surprisingly, a trigram word-based language Letter model model yields only a modest improvement over the baseline unigram model. The addition of a • AstfzAz correctly decoded to {isotifozAz clitic connection model and a letter-based lan- (“instigation”). This example is interest- guage model leads to a marked improvement in ing, since a morphological analysis would both WER and DER. This trend is repeated for deterministically predict this diacritization. both variants of the task—either with or without The probabilistic letter model was able to case endings. Including case information natu- correctly decode it even though it has no rally yields proportionally worse accuracy. Since explicit encoding of such knowledge. case markings encode higher-order grammatical information, they would require a more power- • Non-Arabic names are obviously problem- ful grammatical model than offered by finite- atic for the model. For instance bwrtlAnd state methods. To illustrate the system’s per- was incorrectly decoded to buwrotilAnoda formance, here are some decodings made by the rather than buwrotlAnod (Portland), but different versions of the model. that some of the diacritics were cor- rectly restored. Al-Onaizan and Knight Basic model (2002) proposed a transducer for modeling the Arabic spelling of such names for the • An, an∼a, three versions of purpose of translating from Arabic. Such a the word “that”, which may all appear as model could be seamlessly integrated into An in undiacritized text, are often confused. our architecture, for improved accuracy. As Buckwalter (2004) notes, the corpus it- self is sometimes inconsistent about the use 4 Related work of an∼a. Gal (2002) constructed an HMM-based bigram • Several of the third-person possessive pro- model for restoring vowels (but not additional noun clitics can appear either with a u or diacritics) in Arabic and Hebrew. For Arabic, an i, for instance, the third person singu- the model was applied to the Qur’an, a corpus of lar masculine possessive can appear as ei- about 90,000 words, achieving 14% WER. The ther hu or hi. The correct form depends word-based language model component of our on the preceding letter and vowel (includ- system is very similar to Gal’s HMM. The very ing the case vowels). Part of the tradeoff of flexible framework of transducers allows us to treating clitics as independent lexical items easily enhance the model with our simple but ef- is that the word-based model is ignorant of fective morphology handler and letter-based lan- the letter preceding a suffix clitic. guage model. Several commercial tools are available for Clitic model Arabic diacritization, which unfortunately we Model without case with case WER DER WER DER Baseline 15.48% 17.33% 30.39% 24.03% 3-gram word 14.64% 16.9% 28.42% 23.34% 3-gram word + CC 8.49% 9.32% 24.22% 15.36% 3-gram word + CC + 4-gram letter 7.33% 6.35% 23.61% 12.79%

Table 3: Results on the Al-Hayat corpus did not have access to. Vergyri and Kirchhoff tion. The modular design of the system, based (2004) evaluated one (unspecified) system on on a composition of simple and compact trans- two MSA texts, reporting a 9% DER without ducers allows us to achieve high levels of accu- case information, and 28% DER with case end- racy while encoding extremely limited morpho- ings. logical knowledge. In particular, while our sys- Kirchoff et al. (2002) focuses on vocalizing tem is aware of the existence of Arabic clitics, transcripts of oral conversational Arabic. Since it has no explicit knowledge of how they can conversational Arabic is much more free-flowing, be combined. Such patterns are automatically and prone to dialect and speaker differences, learned from the training data. Likewise, while diacritization of such transcripts proves much the system is aware of different orthographic more difficult. Kirchoff et al. started from a variants of the glottal stop, it encodes no ex- unigram model, and augmented it with the fol- plicit rules to predict their distribution. lowing heuristic. For each unknown word, they The main resource that our method relies on search for the closest known unvocalized word to is the existence of sufficient quantities of dia- it according to Levenshtein distance, and apply critized text. Since semitic languages are typ- whatever transformation that word undergoes, ically written without vowels, it is rare to find yielding 16.5% WER. Our letter-based model sizable collections of diacritized text in digital provides an alternative method of generalizing form. The alternative is to diacritize text using the diacritization from known words to unknown a combination of manual annotation and compu- ones. tational tools. This is precisely the process that Vergyri and Kirchhoff (2004) also handled was followed in the compilation of the Arabic conversational Arabic, and showed that some of treebank, and similar efforts are now underway the complexity inherent in vocalizing such text for Hebrew (Wintner and Yona, 2003). can be offset by combining information on the In contrast to morphological analyzers, which acoustic signal with morphological and contex- usually provide only an unranked list of all pos- tual information. They treat the latter problem sible analyses, our method provides the most as an unsupervised tagging problem, where each probable analysis, and with a trivial extension, word is assigned a tag representing one of its could provide a ranked n-best list. Reducing and possible diacritizations according to Buckwal- ranking the possible analyses may help simplify ter’s morphological analyzer (2002b). They use the annotator’s job. The burden of requiring Expectation Maximization () to train a tri- large quantities of diacritized text could be as- gram model of tag sequences. The evaluation suaged by iterative bootstrapping—training the shows that the combined model yields a signifi- system and manually correcting it on corpora of cant improvement over just the acoustic model. increasing size. 5 Conclusions and future directions As another future direction, we note that oc- casionally one may find a vowel or two, even We have presented an effective probabilistic in otherwise undiacritized text fragments. This finite-state architecture for Arabic diacritiza- is especially true for extremely short text frag- ments, where ambiguity is undesirable, as in Tim Buckwalter. 2004. Issues in Arabic orthogra- banners or advertisements. This raises an inter- phy and morphology analysis. In Ali Farghaly and esting optimization problem—what is the least Karine Megerdoomian, editors, COLING 2004 Computational Approaches to Arabic Script-based number of vowel symbols that are required in Languages, pages 31–34, Geneva, Switzerland, order to ensure an unambiguous reading, and August 28th. COLING. where should they be placed? Assuming that Mona Diab, Kadri Hacioglu, and Daniel Jurafsky. the errors of the probabilistic model are indica- 2004. Automatic tagging of Arabic text: From tive of the types of errors that a human might raw text to base phrase chunks. In Susan Du- make, we can use this model to predict where mais, Daniel Marcu, and Salim Roukos, editors, disambiguating vowels would be most informa- HLT-NAACL 2004: Short Papers, pages 149–152, Boston, Massachusetts, USA, May 2 - May 7. As- tive. A simple change to the model described sociation for Computational Linguistics. in this paper would make vowel drop optional rather than obligatory. Such a model would Ya’akov Gal. 2002. An HMM approach to vowel restoration in Arabic and Hebrew. In Proceedings then be able to generate not only fully unvocal- of the Workshop on Computational Approaches ized text, but also partially vocalized variants to Semitic Languages, pages 27–33, Philadelphia, of it. The optimization problem would then be- July. Association for Computational Linguistics. come one of finding the partially diacritized text Slava M. Katz. 1987. Estimation of probabilities with the minimal number of vowels that would from sparse data for the language model com- be least ambiguous. ponent of a speech recognizer. IEEE Transac- tions on Acoustics, Speech and Signal Processing, Acknowledgments 35(3):400–4001, March. We thank Ya’akov Gal for his comments on a Katrin Kirchoff, Jeff Bilmes, Sourin Das, Nicolae previous version of this paper. This work was Duta, Melissa Egan, Gang Ji, Feng He, John Henderson, Daben Liu, Mohamed Noamany, Pat supported in part by grant IIS-0329089 from the Schone, Richard Schwarta, and Dimitra Vergyri. National Science Foundation. 2002. Novel approaches to Arabic speech recogni- tion: report from the 2002 Johns-Hopkins summer workshop. Technical report, Johns Hopkins Uni- References versity. Salim Abu-Rabia. 1999. The effect of Arabic vow- Mehryar Mohri, Fernando Pereira, and Michael Ri- els on the reading comprehension of second- and ley. 2000. The design principles of a weighted sixth-grade native Arab children. Journal of Psy- finite-state transducer library. Theoretical Com- cholinguist Research, 28(1):93–101, January. puter Science, 231(1):17–32. Yaser Al-Onaizan and Kevin Knight. 2002. Ma- Fernando C. N. Pereira and Michael Riley. 1997. chine transliteration of names in Arabic texts. In Speech recognition by composition of weighted fi- Proceedings of the Workshop on Computational nite automata. In Emmanuel Roche and Yves Approaches to Semitic Languages, pages 34–46, Schabes, editors, Finite-State Devices for Natu- Philadelphia, July. Association for Computational ral Language Processing. MIT Press, Cambridge, Linguistics. MA.

Cyril Allauzen, Mehryar Mohri, and Brian Roark. Dimitra Vergyri and Katrin Kirchhoff. 2004. Auto- 2003. Generalized algorithms for constructing sta- matic diacritization of Arabic for acoustic model- tistical language models. In Proceedings of the ing in speech recognition. In Ali Farghaly and 41st Annual Meeting of the Association for Com- Karine Megerdoomian, editors, COLING 2004 putational Linguistics (ACL’2003), pages 40–47. Computational Approaches to Arabic Script-based Languages, pages 66–73, Geneva, Switzerland, Tim Buckwalter. 2002a. Arabic transliteration ta- August 28th. COLING. ble. http://www.qamus.org/transliteration.htm. Shuly Wintner and Shlomo Yona. 2003. Resources Tim Buckwalter. 2002b. Buckwalter Arabic mor- for processing Hebrew. In Proceedings of the MT phological analyzer version 1.0. Linguistic Data Summit IX Workshop on Machine Translation for Consortium, catalog number LDC2002L49 and Semitic Languages, New Orleans, September. ISBN 1-58563-257-0.