Automatic Diacritization of Tunisian Dialect Text Using Recurrent Neural Network

Automatic diacritization of Tunisian dialect text using Recurrent Neural Network Abir Masmoudi Mariem Ellouze Khmekhem Lamia Hadrich Belguith MIRACL Laboratory MIRACL Laboratory MIRACL Laboratory University of Sfax University of Sfax University of Sfax [email protected] [email protected] [email protected] Abstract performance of NLP tools and tasks. This may generally bring a considerable awkward ambiguity The absence of diacritical marks in the at the data-processing level for NLP applications. Arabic texts generally leads to morpholog- Hence, we can notice that automatic diacritization ical, syntactic and semantic ambiguities. has been shown to be useful for a variety of NLP This can be more blatant when one deals applications, such as Automatic Speech Recogni- with under-resourced languages, such as tion (ASR), Text-to-Speech (TTS), and Statistical the Tunisian dialect, which suffers from Machine Translation (SMT)(24). unavailability of basic tools and linguistic resources, like sufficient amount of cor- pora, multilingual dictionaries, morpho- In this paper, we present our method to auto- logical and syntactic analyzers. Thus, this matic dialectical Arabic diacritization. In fact, language processing faces greater chal- both previous experiences and works have shown lenges due to the lack of these resources. that the use of Recurrent Neural Network (RNN) The automatic diacritization of MSA text could give better results for such an MSA dia- is one of the various complex problems critization system as compared to the other ap- that can be solved by deep neural net- proaches, like the Lexical Language Model (16), works today. Since the Tunisian dialect is a hybrid approach combining statistical and rule- an under-resourced language of MSA and based techniques (28). For instance, the authors as there are a lot of resemblance between (1) demonstrated in their study that the RNN gave both languages, we suggest to investigate the least DER, compared to the other MSA dia- a recurrent neural network (RNN) for this critization works. dialect diacritization problem. This model will be compared to our previous models Based on the huge similarity between MSA and models CRF and SMT (24) based on the Tunisian dialect, we decided to benefit from its same dialect corpus. We can experimen- advantages by testing the RNN performance in tally show that our model can achieve bet- the automatic diacritization of Tunisian Arabic di- ter outcomes (DER of 10.72%), as com- alects. To the best of our knowledge, it is the first pared to the two models CRF (DER of work that investigates the RNN for the diacritiza- 20.25%) and SMT (DER of 33.15%). tion of the Tunisian dialect. 1 Introduction In this respect, we performed the task of Modern Standard Arabic (MSA) as well as Arabic restorating diacritical marks without taking into dialects are usually written without diacritics(24). account any previous morphological or contextual It is easy for native readers to infer correct pro- analysis. Moreover, we diagnosed different as- nunciation from undiacritized words not only from pects of the proposed model with various training the context but also from their grammatical and options. The latter include the choice of transcrip- lexical knowledge. However, this is not the case tion network (long short-term memory (LSTM) for children, new learners and non-native speak- networks, bidirectional LSTM (B-LSTM)) and the ers as they dont have a good mastery of rich lan- impact of RNN sizes. The size of the neural net- guage derivations. Moreover, the absence of dia- work is a function of the number of hidden lay- critical marks leads to ambiguity that affects the ers. Our goal is to choose the most pertinent layers 730 Proceedings of Recent Advances in Natural Language Processing, pages 730–739, Varna, Bulgaria, Sep 2–4, 2019. https://doi.org/10.26615/978-954-452-056-4_085 in the Tunisian dialect based on the final findings 2.2 Challenges in the absence of provided by various experiments. diacritization in Tunisian dialect This model will be compared to our previous The absence of diacritics signs in the Tunisian di- CRF and SMT models (24) by utilizing the same alect texts often increases the morphological, syn- training and testing corpus. tactic and semantic ambiguity in the Tunisian dialect. Some of them are presented as follows: The remaining of this paper is structured as follows: In Section 2 we describe the linguistic back- • Morphological ambiguity: The absence ground of the Tunisian dialect, we try to show the of the diacritical marks poses an important complexity of diacritization tasks based on exam- problem at the association of grammatical in- ples and we present the main level of diacritiza- formation of the undiacritized word (24). For tion. Section 3 introduces our proposed model and example, the word I. ªË /lEb/ admits several experiments. Section 4 provides an exhaustive ex- possible words that correspond to different perimental evaluation that illustrates the efficiency solutions at the grammatical labeling level. and accuracy of our proposed method. Section 5 We can find the plural noun ”toys” and the summarizes the key findings of the present work verb ”play” in 3rd person masculine, singular and highlights the major directions for future re- of passive accomplishment. search. • Syntactic ambiguity: It should be noted that 2 Linguistic background the ambiguities in the association of gram- 2.1 Tunisian dialect matical information, related to the undiacritic words, pose difficulties in terms of syntactic The language situation in Tunisia is ”poly- analysis (24). For example, the undiacritic glossic”, where distinct language varieties, such as expression can admit two the normative language (MSA) and the usual lan- ðQ J.ËA«Y JËñË@ I. J» guage (the Tunisian dialect) coexist (24). different diacritization forms that are syntac- tically accepted. As an official language, MSA is extensively present in multiple contexts, namely education, – We find the diacritization form [The boy wrote business, arts and literature, and social and legal ðQ J.ËA«Y JËñË@ I. J » written documents. However, the Tunisian dialect on the desk] whose syntactic structure is the current mother tongue and the spoken lan- corresponds to a verbal sentence. guage of all Tunisians from different origins and – In addition, we find the diacritiza- distinct social belongings. For this purpose, this tion form whose syntactic structure dialect occupies a prominent linguistic importance corresponds to a nominal sentence in Tunisia. [The boy’s books are ðQ J.ËA«Y JËñË@ I. J» on the desk]. Another salient feature of the Tunisian dialect is that it is strongly influenced by other foreign • Semantic ambiguity:Tthe different diacriti- languages. In fact, it is the outcome of the in- zation of a word can have different meanings, teraction between Berber, Arabic and many other even if they belong to the same grammatical languages such as French, Italian, Turkish and category. For example, among the possible Spanish. The manifestation of this interaction be- dicaritization of the word /qryt/ we find tween these languages is obvious in introducing IK Q¯ borrowed words. As a result, the lexical register the following dicaritization: of the Tunisian dialect seems to be more open and – /qryt/ [I read] contains a very rich vocabulary. IK Q¯ The Tunisian dialect has other specific aspects. – IK Q¯ /qaryt/ [I taught]. Indeed, since this dialects spoken rather than written or taught at school, there is neither grammati- These two diacritic words have the same cal, nor any orthographical or syntactic rules to be grammatical category: verb but they have two followed. different meanings. 731 T 2.3 Diacritization level output sequence y1 by performing the following t = 1 The use of diacritic symbols in several instances operations for to T (13): is quite crucial in order to disambiguate homo- h = H(W x + W h + b ) (1) graphic words. Indeed, the level of diacritization t xh t hh t−1 h refers to the number of diacritical marks presented on a word to avoid text ambiguity for human read- yt = Whyht + by (2) ers. There are four levels of possible diacritization. where H is the hidden layer activation function, • Full Diacritization: this is the case where Wxh is the weight matrix between input and the each consonant is followed by a diacritic. hidden layer, Whh is the recurrent weight matrix This type of diacritization is used mainly between the hidden layer and itself, Why is the in classical Arabic, especially in religion- weight matrix between the hidden and output lay- related books and educational writings. ers, bh and by are the bias vectors of the hidden and output layers, respectively. In a standard RNN, H • Half Diacritization: the objective of this cat- is usually an element-wise application of sigmoid egory is to add diacritics of a word except the function. Such a network is usually trained using letters that depend on the syntactic analysis the back-propagation through time (BPTT) train- of the word. Often, it is the before last letter ing (27). that depends on syntactic analysis of a word but it is not always the case due to the use of • Long short-term memory: LSTM suffixes. An alternative RNN called Long Short-Term • Partial Diacritization: is the case of adding Memory (LSTM) is introduced where the conven- only lexical vowels. The latter can be de- tional neuron is replaced with a so-called memory fined as the vowels with which the mean- cell controlled by input, output and forget gates ing of words changes. The goal of marking in order to overcome the vanishing gradient prob- these vowels is to remove ambiguity from the lem of traditional RNNs (12). In this case, H can meaning of words.

Automatic Diacritization of Tunisian Dialect Text Using Recurrent Neural Network

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support