Arxiv:1911.03531V1 [Cs.CL] 8 Nov 2019 Applications Such As Text to Speech (TTS)

Neural Arabic Text Diacritization: State of the Art Results and a Novel Approach for Machine Translation Ali Fadel, Ibraheem Tuffaha, Bara’ Al-Jawarneh, and Mahmoud Al-Ayyoub Jordan University of Science and Technology, Irbid, Jordan faliosm1997, bro.t.1996, baraaaljawarneh, [email protected] Abstract ... YÔg@ ÕÎ¿ In this work, we present several deep lear- Buckwalter Transliteration: klm >Hmd ... ning models for the automatic diacritiza- Incomplete sentence without diacritization. tion of Arabic text. Our models are built g ¿ using two main approaches, viz. Feed- é®K Y YÔ @ ÕÎ Buckwalter Transliteration: kal∼ama Forward Neural Network (FFNN) and Re- >aHomadNSadiyqahu current Neural Network (RNN), with se- Translation: Ahmad talked to his friend. veral enhancements such as 100-hot en- coding, embeddings, Conditional Random èðY« YÔg@ ÕÎ¿ Field (CRF) and Block-Normalized Gra- Buckwalter Transliteration: kalama dient (BNG). The models are tested on the >aHomadNEaduw ∼ahu only freely available benchmark dataset Translation: Ahmad wounded his enemy. and the results show that our models are either better or on par with other models, The letters ÕÎ¿ “klm” manifests into two diffe- which require language-dependent post- rent words when given two different diacritizati- processing steps, unlike ours. Moreover, ons. As shown in this example, ÕÎ¿ “kal∼ama” in we show that diacritics in Arabic can be the first sentence is the verb ‘talked’ in English, used to enhance the models of NLP tasks while “kalama” in the second sentence is the such as Machine Translation (MT) by pro- ÕÎ¿ posing the Translation over Diacritization verb ‘wounded’ in English. (ToD) approach. To formulate the problem in a formal manner: Given a sequence of characters representing an 1 Introduction Arabic sentence S, find the correct diacritic class In Arabic and many other languages, diacritics (from Figure2) for each Arabic character Si in S. are added to the characters of a word (as short Despite the problem’s importance, it received li- vowels) in order to convey certain information mited attention. One of the reasons for this is the about the meaning of the word as a whole and its scarcity of freely available resources for this pro- 1 place within the sentence. Arabic Text Diacritiza- blem. To address this issue, the Tashkeela Corpus tion (ATD) is an important problem with various (Zerrouki and Balla, 2017) has been released to arXiv:1911.03531v1 [cs.CL] 8 Nov 2019 applications such as text to speech (TTS). At the the community. Unfortunately, there are many pro- same time, this problem is a very challenging one blems with the use of this corpus for benchmar- even to native speakers of Arabic due to the many king purposes. A very recent study (Fadel et al., subtle issues in determining the correct diacritic 2019) discussed in details these issues and pro- for each character from the list shown in Figure2 vided a cleaned version of the dataset with pre- and the lack of practice for many native speakers. defined split into training, testing and validation Thus, the need to build automatic Arabic text dia- sets. In this work, we use this dataset and provide critizers is high (Zitouni and Sarikaya, 2009). yet another extension of it with a larger training The meaning of a sentence is greatly influenced set and a new testing set to circumvent the issue by the diacritization which is determined by the that some of the existing systems have already be- context of the sentence as shown in the following 1https://sourceforge.net/projects/ example: tashkeela en trained on the entire Tashkeela Corpus. Table 1: Statistics about the size, content and diacritics According to (Fadel et al., 2019), existing ap- usage of (Fadel et al., 2019)’s Dataset proaches to ATD are split into two groups: tradi- Train Valid Test tional rule-based approaches and machine learning Words Count 2,103K 102K 107K based approaches. The former was the main ap- Lines Count 50K 2.5K 2.5K proach by many researchers such as (Zitouni and Avg Chars/Word 3.97 3.97 3.97 Sarikaya, 2009; Pasha et al., 2014; Darwish et al., Avg Words/Line 42.06 40.97 42.89 2017) while the latter has started to receive attenti- 0 Diacritics (%) 17.78 17.75 17.80 on only recently (Belinkov and Glass, 2015; Aban- 1 Diacritic (%) 77.17 77.19 77.22 dah et al., 2015; Barqawi and Zerrouki, 2017; Mu- 2 Diacritics (%) 5.03 5.05 4.97 barak et al., 2019). Based on the extensive expe- riments of (Fadel et al., 2019), deep learning ap- Error Diacritics (%) 0 0 0 proaches (aka neural approaches) are superior to non-neural approaches especially when large trai- de counting irrelevant characters such as numbers ning data is available. In this work, we present se- and punctuations, which were included in (Zitouni veral neural ATD models and compare their per- and Sarikaya, 2009)’s original definitions of DER formance with the state of the art (SOTA) approa- and WER. It is worth mentioning that DER/WER ches to show that our models are either on par with are computed in four different ways in the lite- the SOTA approaches or even better. Finally, we rature depending on whether the last character of present a novel way to utilize diactritization in or- each word (referred to as case ending) is counted der to enhance the accuracy of Machine Translati- or not and whether the characters with no diacriti- on (MT) models in what we call Translation over zation are counter or not. Diacritization (ToD) approach. The rest of the paper is organized as follows. 3 The Feed-Forward Neural Network The following section discusses the dataset pro- (FFNN) Approach posed by (Fadel et al., 2019). Sections3 and4 discuss our two main approaches: Feed-Forward This is our first approach and we present three Neural Network (FFNN) and Recurrent Neural models based on it. In this approach, we consider Network (RNN), respectively. Section5 brielfy diacritizing each character as an independent pro- discusses the related work and presents a compa- blem. To do so, the model takes a 100-dimensional rison with the SOTA approaches while Section6 vector as an input representing features for a sin- describes our novel approach to integrate diacri- gle character in the sentence. The first 50 elements tization into translation tasks. The paper is con- in the vector represent the 50 non-diacritic cha- cluding in Section7 with final remarks and future racters before the current character and the last 50 directions of this work. elements represent the 50 non-diacritic characters after it including the current character. 2 Dataset For example, the sentence ‘ « I ë X ’, the vec- ú Î . The dataset of (Fadel et al., 2019) (which is an tor related to the character ‘H. ’ is as shown in Fi- adaptation of the Tashkeela Corpus) consists of gure1. As the figure shows, there are two cha- about 2.3M words spread over 55K lines. Basic racters before the character ‘H. ’ and four after statistics about this dataset size, content and dia- it (including the whitespace). The special token critics usage are given in Table1. Among the re- ‘<PAD>’ is used as a filler when there are no sources provided with this dataset are new defi- characters to feed. Note that the dataset contains nitions of the Diacritic Error Rate (DER), which 73 unique characters (without the diacritics) which is “the percentage of misclassified Arabic charac- are mapped to unique integer values from 0 to 74 ters regardless of whether the character has 0, 1 after sorting them based on their unicode represen- or 2 diacritics”, and the Word Error Rate (WER), tations including the special padding and unknown which is “the percentage of Arabic words which (‘<UNK>’) tokens. 2 have at least one misclassified Arabic character”. Each example belongs to one of the 15 clas- The redefinition of these measures is to exclu- ses under consideration, which are shown in Fi- 2DER/WER are computed with diacritization stat.py gure2. The model outputs probabilities for each model is trained with the same configurations as the 100-hot model and the training time is about Figure 1: Vector representation of a FFNN example. 2.5 hours only. Results and Analysis. Although the idea of diacritizing each character independently is counter- intuitive, the results of the FFNN models on the test set (shown in Table2) are very promising with the embeddings model having an obvious ad- vantage over the basic and 100-hot models and performing much better than the best rule-based diacritization system Mishkal3 among the systems reviewed by (Fadel et al., 2019) (Mishakl Figure 2: The 15 classes under consideration. DER: 13.78% vs FFNN Embeddings model DER: 4.06%). However, these models are still imperfect. class. Using a Softmax output unit, the class with More detailed error analysis of these models is maximum probability is considered as the correct available in AppendixA. output. The number of training, validation and testing examples from converting the dataset into ex- 4 The Recurrent Neural Network (RNN) amples as described earlier are 9,017K, 488K and Approach 488K respectively. Since RNN models usually need huge data to train Basic Model. The basic model consists of 17 hid- on and learn high-level linguistic abstractions, we den layers of different sizes. The activation functi- prepare an external training dataset following the on used in all layers is Rectified Linear Unit (Re- guidelines of (Fadel et al., 2019). The extra trai- LU) and the number of trainable parameters is ning dataset is extracted from the Classical Ara- about 1.5M. For more details see AppendixA.

Arxiv:1911.03531V1 [Cs.CL] 8 Nov 2019 Applications Such As Text to Speech (TTS)

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support