Automatic diacritization of Tunisian dialect text using Recurrent Neural Network

Abir Masmoudi Mariem Ellouze Khmekhem Lamia Hadrich Belguith MIRACL Laboratory MIRACL Laboratory MIRACL Laboratory University of Sfax University of Sfax University of Sfax [email protected] [email protected] [email protected]

Abstract performance of NLP tools and tasks. This may generally bring a considerable awkward ambiguity The absence of diacritical marks in the at the data-processing level for NLP applications. texts generally leads to morpholog- Hence, we can notice that automatic diacritization ical, syntactic and semantic ambiguities. has been shown to be useful for a variety of NLP This can be more blatant when one deals applications, such as Automatic Speech Recogni- with under-resourced languages, such as tion (ASR), Text-to-Speech (TTS), and Statistical the Tunisian dialect, which suffers from Machine Translation (SMT)(24). unavailability of basic tools and linguis- tic resources, like sufficient amount of cor- pora, multilingual dictionaries, morpho- In this paper, we present our method to auto- logical and syntactic analyzers. Thus, this matic dialectical Arabic diacritization. In fact, language processing faces greater chal- both previous experiences and works have shown lenges due to the lack of these resources. that the use of Recurrent Neural Network (RNN) The automatic diacritization of MSA text could give better results for such an MSA dia- is one of the various complex problems critization system as compared to the other ap- that can be solved by deep neural net- proaches, like the Lexical Language Model (16), works today. Since the Tunisian dialect is a hybrid approach combining statistical and rule- an under-resourced language of MSA and based techniques (28). For instance, the authors as there are a lot of resemblance between (1) demonstrated in their study that the RNN gave both languages, we suggest to investigate the least DER, compared to the other MSA dia- a recurrent neural network (RNN) for this critization works. dialect diacritization problem. This model will be compared to our previous models Based on the huge similarity between MSA and models CRF and SMT (24) based on the Tunisian dialect, we decided to benefit from its same dialect corpus. We can experimen- advantages by testing the RNN performance in tally show that our model can achieve - the automatic diacritization of di- ter outcomes (DER of 10.72%), as com- alects. To the best of our knowledge, it is the first pared to the two models CRF (DER of work that investigates the RNN for the diacritiza- 20.25%) and SMT (DER of 33.15%). tion of the Tunisian dialect. 1 Introduction In this respect, we performed the task of (MSA) as well as Arabic restorating diacritical marks without taking into dialects are usually written without (24). account any previous morphological or contextual It is easy for native readers to infer correct pro- analysis. Moreover, we diagnosed different as- nunciation from undiacritized words not only from pects of the proposed model with various training the context but also from their grammatical and options. The latter include the choice of transcrip- lexical knowledge. However, this is not the case tion network (long short-term memory (LSTM) for children, new learners and non-native speak- networks, bidirectional LSTM (B-LSTM)) and the ers as they dont have a good mastery of rich lan- impact of RNN sizes. The size of the neural net- guage derivations. Moreover, the absence of dia- work is a function of the number of hidden lay- critical marks leads to ambiguity that affects the ers. Our goal is to choose the most pertinent layers

730 Proceedings of Recent Advances in Natural Language Processing, pages 730–739, Varna, Bulgaria, Sep 2–4, 2019. https://doi.org/10.26615/978-954-452-056-4_085 in the Tunisian dialect based on the final findings 2.2 Challenges in the absence of provided by various experiments. diacritization in Tunisian dialect

This model will be compared to our previous The absence of diacritics signs in the Tunisian di- CRF and SMT models (24) by utilizing the same alect texts often increases the morphological, syn- training and testing corpus. tactic and semantic ambiguity in the Tunisian di- alect. Some of them are presented as follows: The remaining of this paper is structured as fol- lows: In Section 2 we describe the linguistic back- • Morphological ambiguity: The absence ground of the Tunisian dialect, we try to show the of the diacritical marks poses an important complexity of diacritization tasks based on exam- problem at the association of grammatical in- ples and we present the main level of diacritiza- formation of the undiacritized word (24). For tion. Section 3 introduces our proposed model and example, the word I. ªË /lEb/ admits several experiments. Section 4 provides an exhaustive ex- possible words that correspond to different perimental evaluation that illustrates the efficiency solutions at the grammatical labeling level. and accuracy of our proposed method. Section 5 We can find the plural noun ”toys” and the summarizes the key findings of the present work verb ”play” in 3rd person masculine, singular and highlights the major directions for future re- of passive accomplishment. search. • Syntactic ambiguity: It should be noted that 2 Linguistic background the ambiguities in the association of gram- 2.1 Tunisian dialect matical information, related to the undiacritic words, pose difficulties in terms of syntactic The language situation in Tunisia is ”poly- analysis (24). For example, the undiacritic glossic”, where distinct language varieties, such as expression   can admit two the normative language (MSA) and the usual lan- ðQ J.ËA«Y JËñË@ I. J» guage (the Tunisian dialect) coexist (24). different diacritization forms that are syntac- tically accepted. As an official language, MSA is extensively present in multiple contexts, namely education, – We find the diacritization form   [The boy wrote business, arts and literature, and social and legal ðQ J.ËA«Y JËñË@ I. J » written documents. However, the Tunisian dialect on the desk] whose syntactic structure is the current mother tongue and the spoken lan- corresponds to a verbal sentence. guage of all Tunisians from different origins and – In addition, we find the diacritiza- distinct social belongings. For this purpose, this tion form whose syntactic structure dialect occupies a prominent linguistic importance corresponds to a nominal sentence in Tunisia.   [The boy’s books are ðQ J.ËA«Y JËñË@ I. J» on the desk]. Another salient feature of the Tunisian dialect is that it is strongly influenced by other foreign • Semantic ambiguity:Tthe different diacriti- languages. In fact, it is the outcome of the in- zation of a word can have different meanings, teraction between Berber, Arabic and many other even if they belong to the same grammatical languages such as French, Italian, Turkish and category. For example, among the possible Spanish. The manifestation of this interaction be- dicaritization of the word  /qryt/ we find tween these languages is obvious in introducing IK Q¯ borrowed words. As a result, the lexical register the following dicaritization: of the Tunisian dialect seems to be more open and –   /qryt/ [I read] contains a very rich vocabulary. IK Q¯

The Tunisian dialect has other specific aspects. – IK Q¯ /qaryt/ [I taught]. Indeed, since this dialects spoken rather than writ- ten or taught at school, there is neither grammati- These two words have the same cal, nor any orthographical or syntactic rules to be grammatical category: verb but they have two followed. different meanings.

731 T 2.3 Diacritization level output sequence y1 by performing the following t = 1 The use of diacritic symbols in several instances operations for to T (13): is quite crucial in order to disambiguate homo- h = H(W x + W h + b ) (1) graphic words. Indeed, the level of diacritization t xh t hh t−1 h refers to the number of diacritical marks presented on a word to avoid text ambiguity for human read- yt = Whyht + by (2) ers. There are four levels of possible diacritization. where H is the hidden layer activation function, • Full Diacritization: this is the case where Wxh is the weight matrix between input and the each consonant is followed by a diacritic. hidden layer, Whh is the recurrent weight matrix This type of diacritization is used mainly between the hidden layer and itself, Why is the in classical Arabic, especially in religion- weight matrix between the hidden and output lay- related books and educational writings. ers, bh and by are the bias vectors of the hidden and output layers, respectively. In a standard RNN, H • Half Diacritization: the objective of this cat- is usually an element-wise application of sigmoid egory is to add diacritics of a word except the function. Such a network is usually trained using letters that depend on the syntactic analysis the back-propagation through time (BPTT) train- of the word. Often, it is the before last letter ing (27). that depends on syntactic analysis of a word but it is not always the case due to the use of • Long short-term memory: LSTM suffixes. An alternative RNN called Long Short-Term • Partial Diacritization: is the case of adding Memory (LSTM) is introduced where the conven- only lexical vowels. The latter can be de- tional neuron is replaced with a so-called memory fined as the vowels with which the mean- cell controlled by input, output and forget gates ing of words changes. The goal of marking in order to overcome the vanishing gradient prob- these vowels is to remove ambiguity from the lem of traditional RNNs (12). In this case, H can meaning of words. be described by the following composite function (13): • No Diacritization: This level is completely underspecified. The script is subject to ambi- i = σ(W x + W h + W c + b ) (3) guity, especially with homographs (4). t xi t hi t−1 ci t−1 i

3 Methodology and experiment step f = σ(W x + W h + W c + b ) (4) In recent years, RNN has received a lot of inter- t xf t hf t−1 cf t−1 f est in many NLP tasks of sequence transcription problems, such as speech recognition, handwrit- ing recognition and diacritics restoration. So, we ct = ftct−1 + it tanh(Wxcxt + Whcht−1 + bc) select the RNN to evaluate its performance on the (5) diacritization of the Tunisian dialect. In this work, we adopted the full diacritization level, at which all diacritics should be specified in a word. ot = σ(Wxoxt + Whoht−1 + Wcoct + bo) (6)

3.1 Recurrent neural networks RNN can be mapped from a sequence of input ht = ot tanh(ct) (7) observations to a sequence of output labels. The where the σ is the sigmoid function. i, f, o and mapping is defined by activation weights and a c correspond to, the input, forget, output gates and non-linear activation function as in a standard cell activation vectors respectively. MLP. However, recurrent connections allow to ac- T cess past activations. For an input sequence x1 , • Bidirectional Long short-term memory: T RNN computes the hidden sequence h1 and the B-LSTM

732 A BLSTM processes the input sequence in both and an output layer. Each layer fulfils a particu- directions with two sub-layers in order to account lar purpose. In what follows, we will explain the for the full input context. These two sub-layers advantages of each layer. compute forward and backward hidden sequences −→ ←− h , h respectively, which are then combined to • Input layer: This level consists in mapping compute the output sequence y as follows (13): the letter sequence w to a vector sequence x. We have checked and prepared data of our corpus. In combining the mark −→ −→ ht = H(W −→xt + W−→−→ h t−1 + b−→) (8) with another diacritic, each character in the x h h h h corpus has a label corresponding to 0,1 or 2 diacritics. Character embedding input, which ←− ←− is initialized randomly and updated during ht = H(W ←−xt + W←−←− h t−1 + b←−) (9) x h h h h training, means that each character in the in- put sentence is represented by a vector of d −→ ←− yt = W−→ h t + W←− h t + by (10) real numbers. It is worth pointing out that h y h y adding a linear projection after the input layer 3.2 Model architecture affects the learning of a new representation In our diacritization task, the basic idea is to at- for the latter embedding. tribute a corresponding diacritical label to each • Hidden layer: This layer consists in map- character. Hence, we apply RNN to model our se- ping the vector sequence x to a hidden se- quence data, where a sequence of characters con- quence h. Several types of hidden layers have stitutes the input and the probability distribution been used to choose the best performance and over diacritics forms the output. Schematically, the best result in the automatic diacritization our RNN structure is employed in this work as the of the Tunisian dialect. Hence, these experi- following figure: ments were based on LSTM different layers ranging from one layer to multiple B-LSTM layers.

• Output layer: This last layer focuses on mapping each hidden vector ht to a probabil- ity distribution over labels l. In this layer, we use a softmax activation function to produce a probability distribution over output alpha- bet at each time step.

exp(yt[l]) P (l|wt) = P 0 (11) l0 exp(yt[l ])

th where yt = Whyht + by and yt[l] is the l element of yt.

Figure 1: Architecture of our neural network 3.3 Experience This can be statistically expressed in this way: As mentioned above, in order to train the RNN to Given that W = (w1...wT ), w indicates a se- achieve high accuracy, we apply our experiment quence of characters, where each character is re- based on several training options. These options lated to a label lt. In this respect, a label may include the choice of the number of layers in the stand for 0, 1 or more diacritics. Furthermore, a hidden layer. The aim of this experiment is to de- real-valued vector xw is a representation of each termine which options will give optimal accuracy. character w in the alphabet. Indeed, we applied these experiences, in which several types of hidden layers were tested. These We can state that our neural network consists layers are ranging from one LSTM layer to multi- of 3 layers, namely an input layer, a hidden layer ple B-LSTM layers.

733 The network is trained using Gradient Descent pus is a representation of spontaneous dis- optimizer with learning rate 0.0003 and a mini- courses in Tunisian dialect. This dialect cor- batch size of 200. For dropout, a rate of 0.2 pus of transcribed discourses deals with mul- is used both on embedded inputs and after each tiple themes, such as social affairs, health, re- type of hidden layers; either LSTM or B-LSTM. ligion, etc. Weights are randomly-initialized with normal dis- tribution of zero mean and 0.1 standard deviation • We utilized another type of corpus that is the and weight updates after every batch. The loss result of a conversion tool from Latin written function is the cross-entropy loss summed over all texts (also called Arabizi) into Arabic scripts outputs. The GPU used is Nvidia GTX 580 that following the CODA conversion. The Ara- has 16 streaming multiprocessors and 1.5 GB of bizi corpus is collected from social media like memory Facebook, Twitter and SMS messaging (22).

4 Results and discussion • In order to solve the problem of the lack of re- 4.1 Evaluation Metric sources for the Tunisian dialect, we have cho- sen to gather corpora from blog sites written In order to measure our model performance, an in this dialect using Arabic alphabets (24). evaluation metric, known as Diacritic Error Rate (For more details see (24)) (DER) is generally used. DER indicates how many letters have been incorrectly restored with As mentioned above, the Tunisian dialect dif- their diacritics. The DER can be calculated as fol- fers from MSA and it does not have a standard lows (24): spelling because there are no academies of Arabic (1 − |TS|) dialect. Thus, to obtain coherent learning data, it DER = ∗ 100 (12) |TG| is necessary to utilize a standard spelling. Indeed, there are words with many forms. For example, Where |TS| is the number of letters assigned the word /reservation/ can be written correctly by the system, and |TG| is the number àñJ ‚¯PPP of diacritized letters in gold standard texts. in four different ways: àñJ ‚¯PP@ P , àñJ ‚¯P@P@ P and . 4.2 Datasets àñJ ‚¯PQK P This section shows a breakdown of different sizes of our data sets, which were gathered from various In this work, spelling transcription guidelines sources. So far, we have used four existent types CODA (Tunisian Dialect writing convention), of corpora for our teamwork. (36), were adopted. CODA is a conventional- ized orthography for Dialectal Arabic. In CODA, • We made use of our TARIC corpus (Tunisian every word has a single orthographic representa- Arabic Railway Interaction Corpus) (24). tion. It uses MSA-consistent and MSA-inspired The latter collected information about the orthographic decisions (rules, exceptions and ad Tunisian dialect used in a railway station. hoc choices). CODA preserves, also, dialectal This corpus was recorded in the ticket offices morphology and dialectal syntax. CODA is eas- of the Tunis railway station. We recorded ily learnable and readable. CODA has been de- conversations in which there was a request signed for the Egyptian Dialect (11) as well as the for information about the train schedules, Tunisian Dialect (36) and the Palestinian Levan- fares, bookings, etc. This corpus consists of tine Dialect (20). For a full presentation of CODA several dialogues; each dialogue is a com- and an explanation of its choices, see ((11), (36)). plete interaction between a clerk and a client. All the words are written using the Arabic al- The normalization step is essential because it phabet with diacritics. The diacritics indicate presents a key point for the other steps of our how the word is pronounced. The same word method. Among the normalisation Tunisian Di- can have more than one pronunciation. alect words we have: • The second corpus is called STAC (Spo- ken Tunisian Arabic Corpus)(35). This cor- • Number ”sixteen” is written as €A ¢ Jƒ .

734 • To define the future, we must follow the following form: + verb, for example: €A K.   . ú æ„Öß €AK. • To define the negation, we must follow the following form: AÓ + verb.

Let’s remember that the Tunisian Dialect is characterized by the presence of foreign words, such as for instance: French, English, Spanish, Italian, etc. To transcribe these words, Arabic alphabets have been used. These foreign words have a unique form, for example: PñKP [Back], à@ Q K [train]... Figure 3: Corpus extract after the normalisation step At the end of this step, we obtain a standardized corpus. The figure 2 represents a corpus extract Since there are no automatic diacritization tools before the normalization step. for the Tunisian dialect, and because the MSA tools are unable to treat this dialect due to the differences between the MSA and the Tunisian dialect, we were obliged to diacritize the corpus manually.

Below we provide the most important charac- teristics in Table 1.

Table 1: Characteristics of our corpus. # statements #words TARIC 21,102 71,684 STAC 4,862 42,388 Arabizi 3,461 31,250 Blogs 1582 27,544 Total 31,007 172,866 Figure 2: Corpus extract before the normalisation step

We aimed to decently create training, develop- The figure 3 represents a corpus extract after the ment and test sets in order to judge our diacritiza- normalization step. tion models. We outlined the available datasets for the language under investigation. We randomly selected 23,255 sentences for training, 1,550 for development and 6,202 for testing.

Table 2 reports some quantitative information for the datasets.

735 the 3-layer B-LSTM configuration was adopted. Table 2: Tunisian dialect diacritization corpus statistics. Train Dev Test To conclude, a 3-layer BLSTM models # Statements 23,255 1,550 6,202 achieved the best results. # words 129,649 8,643 34,574 # Letters 64,8247 43,216 172,867 4.4 Error Analysis In order to reveal the weaknesses of our automatic diacritic restoration RNN models, we analyzed all errors that are mainly due to the following reasons: 4.3 Result In this section, we present the evaluation outcome • We noticed that these errors are due to the of our established diacritization models. We use presence of foreign words in our corpus. DER as an evaluation metric. The adopted RNN has from 1 to 4 hidden layers, each with 250 neu- • Some error words with prefixes, or suffixes rons. This number is chosen after different exper- or both can be significantly perceived. It is iments. We come up with the conclusion that a hard to diacritize these complex words in a smaller number of neurons (less than 250) have correct way, as the in flection diacritical mark an impact on accuracy rate and a greater number is related to the last letter of the stem rather do not improve it in a significant way. Table 3 than to the last letter of the suffix. gives an overview of the RNN models outcomes in terms of diacritic error rate (DER). • Errors due to form/spelling diacritization er- rors. Errors caused by ”Shadda” (consonant doubling), or Tanween (nunation). We per- Table 3: Diacritization Error Rate Summary for ceived that restoring ”shadda” is harder than the Tunisian dialect RNN model restoring the other diacritics. Model DER LSTM 13.86% • Errors due to missing and incorrect short B-LSTM 12.31% vowels (i.e. lexical diacritics). 2-layer B-LSTM 11.53% 3-layer B-LSTM 10.72% 4-layer B-LSTM 10.83% We have manually checked 150 error samples of our input RNN model. The following figure shows an example of 4 sample sequences that have errors. Table 3 shows the effect of using LSTM and The words that have errors are underlined and in B-LSTM models, and the number of hidden lay- red. ers on the DER. According to this table, results show a DER of 13.56% for LSTM and 12.31% for B-LSTM. Based on the results of our RNN, we detected an enhancement of 1.55% in DER of the B-LSTM model as compared to LSTM model. This means that B-LSTM is more performant than LSTM. Figure 4: Sample sequences with errors Moreover, we noticed that increasing the num- ber of B-LSTM layers from one hidden layer In about 21% of the samples, we have remarked to three layers improves accuracy. But, we ap- that the absence of ”shadda” in some words can plied the 3-layer BLSTM because its accuracy is lead to a semantic ambiguity of the verb. For in- not only closer but also fasther than the 4-layer stance, sample 1 shows that target verb   [I BLSTM. Indeed, the training time rises from 3:52 IK Q ¯  to 6:78 hours when the number of layers pro- taught] is output as IK Q¯ [I read]. These two dia- gresses monotonically from 3 to 4 and the testing critic words have the same grammatical category: time icreases from 3.65 to 5.41 minutes. Hence, verb but they have two different meanings.

736 Diacritization errors in test samples can cause ous published researches in the Tunisian dialect about 4% of errors. For example, sample 2 dis- field. plays that the ”Fatha” in the word B was mistak- enly entered after the last letter rather than the first Table 4: Diacritization results of related work letter. (CRF and SMT models) and our RNN model Model DER Some error words with prefixes, suffixes or both 3-layer B-LSTM 10.72% can be significantly perceived, in about 41% of the CRF 20.25% samples. In another illustration, the error word in SMT 33.15% sample 3 has both the prefix ”la” and the pro- È noun suffix ”haA” Aë . As depicted in Table 4, our RNN model (3- layer B-LSTM) provides the best results(DER of We also noticed a significant fraction of error 10.72%) compared to both SMT (DER of 33.15%) words (34%) due to the presence of foreign words and CRF (DER of 20.25%) models. in our corpus as in sample 4.. 5 Conclusion 4.5 Comparison with State-of-art Systems The absence of short vowels gives rise to a great In this section, we compare our proposed RNN ambiguity which influences the results of such model with two other models, namely SMT and a NLP applications. An outcome of this study discriminative model as a sequence classification was the development of RNN diacritic restoration task based on CRFs (24). These two models were model for Tunisian dialect. To the best of our realized in our previous works in order to carry on knowledge, this is the first work that deals with the dialect restoration for the Tunisian dialect. To the problem of Tunisian dialect diacritizers using achieve this comparison, we employed the same RNN. dialectical corpus and evaluation metrics.

Concerning the first model, we regarded the di- In order to choose the best configuration of the acritization problem as a simplified phrase-based RNN network, we did several preliminary experi- SMT task. The source language is the undiacritic ments with different training options. These op- text while the target language is the diacritic text. tions concern the hidden layer where we tested The basic idea of SMT is to analyze automatically the impact of the change of the neural network existing human sentences called parallel corpus in size and the topology on its performance. Sev- order to build translation model. The alignment eral types of hidden layers are tested, ranging from words without diacritics to words with dia- from one layer LSTM to multiple B-LSTM layers. critics is a monotonic mapping. To do this, we The best accuracy is obtained when using the 3- employed moses (21) a SMT tool.Word alignment layer B-LSTM model (DER of 10.72%). We com- was done with GIZA++ (25). We implemented a pared our RNN diacritization model with two ma- 5-gram language model using the SRILM toolkit jor models, namely a SMT and CRF models (24). (31). We decoded using Moses (21). These two models were realized in our previous works in order to carry on the dialect restoration In the second model, we decided to get the for the Tunisian dialect. During this comparison, diacritical marks restoration by focusing on dia- we employed the same dialectical corpus and eval- critization based on grammatical information. We uation metrics. About 9.53% DER improvement intended to build dependency relations between of RNN model was achieved over the best reported words and ”POS” tags and to perceive their effects CRF model. on word diacritizations. In fact, we proposed to scrutinize the integration of grammatical informa- We have two future plans for the diacritization tion ”POS” for the diacritization with the aid of problems of Tunisian dialect. The first plan con- Conditional Random Fields (CRF)(24). sists in expanding a rule-based diacritizer system and integrating it into our RNN model in order The following table reviews the accuracy results to ameliorate the outcomes. The second plan fo- to restore diacritics automatically from our previ- cuses on providing morphological analysis of such

737 words to the RNN in order to achieve higher accu- Diab, M., Habash, N., Owen, R.: Conventional racy. The presence of significant fraction of er- Orthography for Dialectal Arabic. In Proceedings rors in complex words that contain prefixes, suf- of the Language Resources and Evaluation Con- fixes, or both open up new perspectives for future ference , Istanbul,2012 research. Gers,F. : Long short-term memory in recurrent Abandah, G., Graves, A., Al-Shagoor, B., Ara- neural networks, Ph.D. dissertation, Department biyat, A., Jamour, F. and Al-Taee. M. 2015. Auto- of Computer Science, Swiss Federal Institute of matic diacritization of arabic text using recurrent Technology, Lausanne, EPFL, Switzerland, 2001. neural networks. International Journal on Docu- Graves, A., Mohamed, A., Hinton, G: Speech ment Analysis and Recognition (IJDAR). recognition with deep recurrent neural networks, Alotaibi, Y. A., Meftah, A. H. and Selouani. S.A. IEEE International Conference on Acoustics, 2013. Diacritization, Automatic Segmentation Speech, and Signal Processing, Canada,2013. and Labeling for Speech. In Fashwan, A., Alansary, S. 2017. SHAKKIL: an Digital Signal Processing and Signal Processing automatic diacritization system for modern stan- Education Meeting (DSP/SPE). dard Arabic texts. The Third Arabic Natural Lan- Alqudah,S., Abandah,G., Arabiyat, A., 2017.In- guage Processing Workshop (WANLP). vestigating Hybrid Approaches for Arabic Text Habash, N., Shahrour, A., and Al-Khalil, M. 2016, Diacritization with Recurrent Neural Networks. Exploiting Arabic Diacritization for High Qual- 2017 IEEE Jordan Conference on Applied Elec- ity Automatic Annotation, the Tenth International trical Engineering and Computing Technologies. Conference on Language Resources and Evalua- tion, LREC 2016. Al-Badrashiny, M., Hawwari, A. and Diab, M. 2017. A Layered Language Model based Hybrid Habash, N. and Rambow, O. 2007. Arabic diacriti- Approach to Automatic Full Diacritization of Ara- zation through full morphological tagging. The bic. Third Arabic Natural Language Processing Conference of the North American Chapter of the Workshop. Association for Computational Linguistics. Hamed, O and Zesch, T. 2017. A Survey and Ameur, M. Moulahoum, Y. and Guessoum, A. Comparative Study of Arabic Diacritization Tools. 2015. Restoration of Using a Journal of Language Technology and Computa- Multilevel Statistical Model. In IFIP International tional Linguistics, volume 32, number 1. Federation for Information Processing. Harrat, S., Abbas, M., Meftouh, K., Smaili, K., Ayman, A. Z., Elmahdy, M., Husni, H. and Al Bouzareah, E.N.S. and Loria, C. 2013. Diacritics Jaam, J. 2016. Automatic diacritics restoration for restoration for Arabic dialect texts. 14th Annual Arabic text. International Journal of Computing & Conference of the International Speech Commu- Information Science.. nication. Azmi, A. and Almajed, R. 2015. A survey of au- Hifny, Y. 2012. Higher order n-gram language tomatic Arabic diacritization techniques. Natural models for Arabic diacritics restoration. In Pro- Language Engineering, 21, pages:477495. ceedings of the 12th Conference on Language En- Belinkov, Y. and Glass. J. 2015. Arabic diacritiza- gineering. tion with recurrent neural networks. Conference Jarrar, M., Habash, N., Akra, D. and N. Zalmout, on Empirical Methods in Natural Language Pro- N.: Building a Corpus for Palestinian Arabic: a cessing. Preliminary Study :In Proceedings of the Arabic Bouamor, H., Zaghouani, W., Diab, M., Obeid, O., Natural Language Processing Workshop, EMNLP, Kemal, O., Ghoneim, M. and Hawwari, A. 2015. Doha,2014 A pilot study on Arabic multi-genre corpus dia- Koehn, P. Hoang; H. Birch, A. Callison-Burch, C. critization annotation. The Second Workshop on Federico, M. and Bertoldi, N. 2007. Moses: Open Arabic Natural Language Processing. Source Toolkit for Statistical Machine Translation. Diab, M., Ghoneim, M. and Habash. N. 2007. ACL 2007, demonstration session. Arabic Diacritization in the Context of Statistical Masmoudi, A., Habash, N., Khmekhem, M., Es- Machine Translation. In Proceedings of MTSum- teve, Y. and Belguith, L.: Arabic Transliteration of mit, Denmark. Romanized Tunisian Dialect Text: A Preliminary

738 Investigation, Computational Linguistics and In- Zitouni, I. and Sarikaya, R. 2009. Arabic Dia- telligent Text Processing - 16th International Con- critic Restoration Approach Based on Maximum ference, CICLing 2015, Cairo, Egypt, p.608-619, Entropy Models. In Journal of Computer Speech (2015). and Language. Masmoudi, A., Bougares,F., Ellouze, M., Esteve, Zribi, I., Ellouze, M., Belguith, L.H. and Blache, Y., Belguith, L.: Automatic speech recognition P. 2015. Spoken Tunisian Arabic Corpus ”STAC”: system for Tunisian dialect. Language Resources Transcription and Annotation. Res. Comput. Sci. and Evaluation 52(1): 249-267 (2018). 90. Masmoudi, A., Mdhaffer,S., Sellami, R. Belguith, Zribi I., Boujelben R., Masmoudi A., Ellouze M., L.: Automatic Diacritics Restoration for Tunisian Belguith L., Habash N.: A conventionnal Orthog- Dialect. TALLIP2018: Transactions on Asian and raphy for Tunisian Arabic, LREC’2014, Reyk- Low-Resource Language Information Processing. javik, Iceland, (2014). Och, F. and Ney, H. 2003. A Systematic Com- parison of Various Statistical Alignment Models. Computational Linguistics, 29(1):19–52. Rashwan, M., Al Sallab, A., Raafat, H. and Rafea, A. Deep Learning Framework with Confused Sub Set Resolution Architecture for Automatic Arabic Diacritization. IEEE/ACM Transactions on Au- dio, Speech, and Language Processing, 2015. Rumelhart, D., Hinton,G., and Williams, R. Learning representations by back-propagating er- rors, Nature, no. 323, pp. 533 536, 1986. Said, A., El-Sharqwi, M., Chalabi, A , and Ka- mal, E. A hybrid approach for Arabic diacritiza- tion, Application of Natural Language to Informa- tion Systems, pp. 53-64, Jun 2013. Shaalan, K., Abo Bakr, M. and Ziedan. I. 2009. A hybrid approach for building Arabic diacritizer. In Proceedings of the EACL 2009 Workshop on Computational Approaches to Semitic Languages. Shaalan, K., Abo Bakr, H. and Ziedan. I. 2008. A statistical method for adding case ending diacrit- ics for Arabic text. In Proceedings of Language Engineering Conference. Stolcke A. 2002. SRILM an Extensible Language Modeling Toolkit. Proceedings of ICSLP. Zaghouani, W., Bouamor, H., Hawwari, A., Diab, M., Obeid, O., Ghoneim, M., Alqahtani, S and Oflazer, K. 2016. Guidelines and framework for a large-scale Arabic diacritized corpus. Proceed- ings of the Tenth International Conference on Lan- guage Resources and Evaluation (LREC 2016). Zitouni, I., Sorensen, J. and Sarikaya, R. 2006. Maximum entropy based restoration of Arabic di- acritics. In Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Com- putational Linguistics.

739