<<

A Multitask Learning Approach for Diacritic Restoration

Sawsan Alqahtani 1,2 and Ajay Mishra1 and Mona Diab2∗ 1AWS, Amazon AI 2The George Washington University [email protected], [email protected], [email protected]

Abstract context and their previous knowledge to infer the meanings and/or pronunciations of words. How- In many languages like Arabic, are used to specify pronunciations as well as mean- ever, computational models, on the other hand, are ings. Such diacritics are often omitted in inherently limited to deal with missing diacritics written text, increasing the number of possi- which pose a challenge for such models due to ble pronunciations and meanings for a word. increased ambiguity. This results in a more ambiguous text making Diacritic restoration (or diacritization) is the computational processing on such text more process of restoring these missing diacritics for difficult. Diacritic restoration is the task of every character in the written texts. It can spec- restoring missing diacritics in the written text. Most state-of-the-art diacritic restoration mod- ify pronunciation and can be viewed as a relaxed els are built on character level variant of word sense disambiguation. For exam- which helps generalize the model to unseen ple, the Arabic word ÕΫ Elm2 can mean “flag” or data, but presumably lose useful information “knowledge”, but the meaning as well as pronun- at the word level. Thus, to compensate for ciation is specified when the word is diacritized ( this loss, we investigate the use of multi-task  learning to jointly optimize diacritic restora- ÕΫ Ealamu means “flag” while ÕΫ Eilomo means tion with related NLP problems namely word “knowledge”). As an illustrative example in En- segmentation, part-of-speech tagging, and syn- glish, if we omit the vowels in the word pn, the tactic diacritization. We use Arabic as a case word can be read as pan, pin, pun, and pen, each study since it has sufficient data resources for of these variants have different pronunciations and tasks that we consider in our joint modeling. meanings if it composes a valid word in the lan- Our joint models significantly outperform the baselines and are comparable to the state-of- guage. the-art models that are more complex relying The state-of-the-art diacritic restoration models on morphological analyzers and/or a lot more reached a decent performance over the years using data (e.g. dialectal data). recurrent or convolutional neural networks in terms of accuracy (Zalmout and Habash, 2017; Alqahtani 1 Introduction et al., 2019; Orife, 2018) and/or efficiency (Alqah- In contrast to English, some vowels in languages tani et al., 2019; Orife, 2018); yet, there is still room arXiv:2006.04016v1 [cs.CL] 7 Jun 2020 such as Arabic and Hebrew are not part of the alpha- for further improvements. Most of these models bet and diacritics are used for vowel specification.1 are built on character level information which help In addition to pertaining vowels, diacritics can also generalize the model to unseen data, but presum- represent other features such as case marking and ably lose some useful information at the word level. phonological gemination in Arabic. Not including Since word level resources are insufficient to be re- diacritics in the written text in such languages in- lied upon for training diacritic restoration models, creases the number of possible meanings as well as we integrate additional linguistic information that pronunciations. Humans rely on the surrounding considers word morphology as well as word rela- tionships within a sentence to partially compensate *∗The work was conducted while the author was with AWS, Amazon AI. for this loss. 1Diacritics are marks that are added above, below, or in- between the letters to compose a new letter or characterize the 2We use Buckwalter Transliteration encoding letter with a different sound (Wells, 2000). http://www.qamus.org/transliteration.htm. In this paper, we improve the performance of ter and word level information, bridging the gap diacritic restoration by building a multitask learn- between the two levels. ing model (i.e. joint modeling). Multitask learning refers to models that learn more than one task at Syntactic Diacritization (SYN): This refers to the same time, and has recently been shown to pro- the task of retrieving diacritics related to the syntac- vide good solutions for a number of NLP tasks tic positions for each word in the sentence, which (Hashimoto et al., 2016; Kendall et al., 2018). is a sub-task of full diacritic restoration. Arabic is The use of a multitask learning approach pro- a templatic language where words comprise roots vides an end-to-end solution, in contrast to generat- and patterns in which patterns are typically reflec- ing the linguistic features for diacritic restoration tive of diacritic distributions. Verb patterns are as a preprocessing step. In addition, it alleviates more or less predictable however nouns tend to be the reliance on other computational and/or data re- more complex. Arabic diacritics can be divided sources to generate these features. Furthermore, into lexical and inflectional (or syntactic) diacritics. the proposed model is flexible such that a task can Lexical diacritics change the meanings of words be added or removed depending on the data avail- as well as their pronunciations and their distribu- ability. This makes the model adaptable to other tion is bound by patterns/templates. In contrast, languages and dialects. inflectional diacritics are related to the syntactic positions of words in the sentence and are added We consider the following auxiliary tasks to to the last letter of the main morphemes of words boost the performance of diacritic restoration: word (word finally), changing their pronunciations.4 In- segmentation, part-of-speech (POS) tagging, and flectional diacritics are also affected by word’s root syntactic diacritization. We use Arabic as a case (e.g. weak roots) and semantic or morphological study for our approach since it has sufficient data properties (e.g. with the same grammatical case, resources for tasks that we consider in our joint masculine and feminine plurals take different dia- modeling.3 critics). The contributions of this paper are twofold: Thus, the same word can be assigned a different 1. We investigate the benefits of automatically syntactic diacritic reflecting syntactic case, i.e. de- learning related tasks to boost the perfor- pending on its relations to the remaining words in mance of diacritic restoration; the sentence (e.g. subject or object). For example,

2. In doing so, we devise a state-of-the-art model the diacritized variants ÕΫ Ealama and ÕΫ Ealamu for Arabic diacritic restoration as well as a which both mean “flag” have the corresponding framework for improving diacritic restoration syntactic diacritics: a and u, respectively. That in other languages that include diacritics. being said, the main trigger for accurate syntac- tic prediction is the relationships between words, 2 Diacritization and Auxiliary Tasks capturing semantic and most importantly, syntactic information. We formulate the problem of (full) diacritic restora- Because Arabic has a unique set of diacritics, tion (DIAC) as follows: given a sequence of char- this study formulates syntactic diacritization in the acters, we identify the diacritic corresponding to following way: each word in the input is tagged each character in that sequence from the following with a single diacritic representing its syntactic po- set of diacritics {a, u, i, o, K, F, N, ∼, ∼a, ∼u, sition in the sentence.5 The set of diacritics in ∼i, ∼F, ∼K, and ∼N}. We additionally consider syntactic diacritization is the same as the set of dia- three auxiliary tasks: syntactic diacritization, part- critics for full diacritic restoration. Other languages of-speech tagging, and word segmentation. Two that include diacritics can include syntactic related of which operate at the word level (syntactic di- diacritics but in a different manner and complexity acritization and POS tagging) and the remaining 4 tasks (diacritic restoration and word segmentation) Diacritics that are added due to passivization are also syntactic in but are not considered in our syntactic operate at the character level. This helps diacritic diacritization task. That said, they are still considered in the restoration utilize information from both charac- full diacritic restoration model. 5Combinations of diacritics is possible but we combine 3Other languages that include diacritics lack such re- valid possibilities together as one single unit in our model. sources; however, the same multitask learning framework For example, the diacritics ∼ and a are combined to form an can be applied if data resources become available. additional diacritic ∼a. compared to Arabic. 3.1 Input Representation

Word segmentation (SEG): This refers to the Since our joint model may involve both character process of separating affixes from the main unit of and word level based tasks, we began our investi- the word. Word segmentation is commonly used gation by asking the following question: how to as a preprocessing step for different NLP appli- integrate information between these two levels? cations and its usefulness is apparent in morpho- Starting from the randomly initialized character logically rich languages. For example, the undi- embeddings as well as a pretrained set of embed- acritized word whm Ñëð might be diacritized as dings for words, we follow two approaches (Figure 1 visually illustrates the two approaches with an waham∼a  “and concerned”, waham “illu- Ñëð Ñëð example). sion”, where the first diacritized word consists of two segments “wa ham∼a” Ñë ð while the second is composed of one word. Word segmentation can be formulated in the following way: each charac- ter in the input is tagged following IOB tagging scheme (B: beginning of a segment; I: inside a segment; O: out of the segment) (Diab et al., 2004).

Part-Of-Speech Tagging (POS): This refers to the task of determining the syntactic role of a word (i.e. part of speech) within a sentence. POS tags are highly correlated with diacritics (both syntactic and lexical): knowing one helps determine or reduce the possible choices of the other. For instance, Figure 1: An example of embedding vectors for the the word  ktb in the sentence ktb [someone] word cat and its individual characters: c,a, and t. (i) A I. J» character-based representation for the word cat from its means “books” if we know it to be a noun whereas individual characters; (ii) A concatenation for the word the word would be either  katab “someone I. J» embedding with each of its individual characters. wrote” or  kat∼ab “made someone write” if it I. J» is known to be a verb. (1) Character Based Representation: We pass POS tagging can be formulated in the following information learned by character level tasks into way: each word in the input is assigned a POS tag word level tasks by composing a from the Universal Dependencies tagset (Taji et al., from the word’s characters. We first concatenate 2017).6 the individual embeddings of characters in that word, and then apply a Bidirectional Long Short 3 Approach Term Memory (BiLSTM) layer to generate denser 8 We built a diacritic restoration joint model and vectors. This helps representing morphology and studied the extent to which sharing information word composition into the model. is plausible to improve diacritic restoration perfor- mance. Our joint model is motivated by the re- (2) Word-To-Character Representation: To cent of the hierarchical modeling proposed pass information learned by word level tasks into in (Hashimoto et al., 2016) such that information character level tasks, we concatenate each word learned from an auxiliary task is passed as input to with each of its composed characters during each the diacritic restoration related layers.7 pass, similar to what is described in Watson et al. (2018)’s study. This helps distinguishing the 6 Refer to https://universaldependencies.org/. This tagset is individual characters based on the surrounding chosen because it includes essential POS tags in the language, and it is unified across different languages which makes it context, implicitly capturing additional semantic suitable to investigate more languages in the future. and syntactic information. 7We also experimented with learning tasks sharing some levels and then diverging to specific layers for each tasks. However, this did not improve the performance compared to 8We also evaluated the use of a feedforward layer and uni- the diacritic restoration model when we don’t consider any directional Long Short Term Memory (LSTM) but a BiLSTM additional task. layer yielded better results. Figure 2: The diacritic restoration joint model. All Char Embed entities refer to the same randomly initialized character embedding learned during the training process. Pretrained embeddings refer to fixed word embeddings obtained from fastText (Bojanowski et al., 2017). (i) shows the input representation for CharToWord and Word- ToChar embedding which is the same as in Figure1. (ii) represents the diacritic restoration joint model; output labels from each task are concatenated with WordToChar embedding and optionally with segmentation hidden.

3.2 The Joint Model character embeddings. We pass the outputs of all For all architectures, the main component is BiL- these tasks along with WordToChar representation STM (Hochreiter and Schmidhuber, 1997; Schuster to train the BiLSTM diacritic restoration model. and Paliwal, 1997), which preserves the temporal Omitting a task is rather easy, we just remove the order of the sequence and has been shown to pro- related components for that task to yield the appro- vide the state-of-the-art performance in terms of priate model. We optionally pass the last hidden accuracy (Zalmout and Habash, 2017; Alqahtani layer for SEG along with the remaining input to 9 et al., 2019). After representing characters through the diacritic restoration model. random initialization and representing words using 4 Experimental Setups pretrained embeddings obtained from fastText (Bo- janowski et al., 2017), the learning process for each Dataset: We use the Arabic Treebank (ATB) batch runs as follows: dataset: parts 1, 2, and 3 and follow the same data 1. We extract the two additional input represen- division as Diab et al.(2013). Table1 illustrates the tation described in Section 3.1; data . For word based tasks, we segment 2. We apply BiLSTM for each of the different each sentence into space tokenized words. For char- tasks separately to obtain their corresponding acter based tasks, we, in addition, add the special outputs; boundary “” between these words, and then each word is further segmented into its characters, 3. We pass all outputs from all tasks as well as similar to that in (Alqahtani et al., 2019). We pass WordToChar embedding vectors as input to each word through the model along with a spe- the diacritic restoration model and obtain our cific number of previous and future words (+/- 10 diacritic outputs. words). Figure2 illustrates the diacritic restoration joint Parameter Settings: For all tasks, we use 250 model. As can be seen, SYN as well as POS hidden units in each direction (500 units in both tagging are trained on top of CharToWord repre- directions combined) and 300 as embedding size. sentation which is basically the concatenation of We use 3 hidden layers for tasks except in SEG in the pretrained embedding for each word with the 9Passing the last hidden layer for POS tagging and/or SYN character-based representations described in Fig- did not improve the performance; the pretrained embeddings ure1. SEG is also trained separately on top of the are sufficient to capture important linguistic signals. Train Test Dev OOV as well as the word it appears in (e.g. the charac- 502,938 63,168 63,126 7.3% ter t in the word cat would be represented slightly Table 1: Number of words and out of vocabulary different than t in a related word cats or even a (OOV) rate for Arabic. OOV rate indicates the percent- different word table). We consider both character age of undiacritized words in the test set that have not based model as well as WordToChar based model been observed during training. as our baselines (BASE). We use WordToChar representation rather than which we use only one layer. We use Adam for characters for all remaining models that jointly learning optimization with a learning rate of 0.001. learn more than one task. For all experiments, we We use 20 for epoch size, 16 for batch size, 0.3 observe improvements compared to both baselines for hidden dropout, and 0.5 for embedding dropout. across all evaluation metrics. Furthermore, all mod- We initialize the embedding with a uniform distri- els except DIAC+SEG outperform WordToChar bution [-0.1,0.1] and the hidden layers with normal diacritic restoration model in terms of WER, show- distribution. The loss scores for all considered tasks ing the benefits of considering output distributions are combined and then normalized by the number for the other tasks. Despite leveraging tasks fo- of tasks in the model. cused on syntax (SYN/POS) or morpheme bound- Evaluation metrics: We use accuracy for all aries (SEG), the improvements extend to lexical tasks except diacritic restoration. For diacritic diacritics as well. Thus, the proposed joint dia- restoration, the two most typically used metrics are critic restoration model is also helpful in settings Word Error Rate (WER) and Diacritic Error Rate beyond word final syntactic related diacritics. The (DER), the percentages of incorrectly diacritized best performance is achieved when we consider words and characters, respectively. In order to ap- all auxiliary tasks within the diacritic restoration proximate errors in the syntactic diacritics, we use model. Last Diacritic Error Rate (LER), the percentage of words that have incorrect diacritics in the last Impact of Auxiliary Tasks: We discuss the im- positions of words. To evaluate the models’ ability pact of adding each investigated task towards the to generalize beyond observed data, we compute performance of the diacritic restoration model. WER on OOV (out-of-vocabulary) words.10 Word segmentation (DIAC+SEG): When Significance testing: We ran each experiment morpheme boundaries as well as diacritics are three times and reported the mean score.11 We learned jointly, the WER performance is slightly used the t-test with p = 0.05 to evaluate whether reduced on all and OOV words. This reduction the difference between models’ performance and is attributed mostly to lexical diacritics. As Ara- the diacritic restoration is significant (Dror et al., bic exhibits a non-concatenative fusional morphol- 2018). ogy, reducing its complexity to a segmentation task might inherently obscure morphological processes 5 Results and Analysis for each form. Observing only slight improvement is surpris- Table2 shows the performance of joint diacritic ing; we believe that this is due to our experimental restoration models when different tasks are consid- setup and does not negate the importance of having ered. When we consider WordToChar as input to morphemes that assign the appropriate diacritics. the diacritic restoration model, we observe statis- We speculate that the reason for this is that we tically significant improvements for all evaluation do not capture the interaction between morphemes metrics. This is justified by the ability of word em- as an entity, losing some level of morphological beddings to capture syntactic and semantic infor- information. mation at the sentence level. The same character is For instances, the words waham∼a versus disambiguated in terms of the surrounding context wahum for the undiacritized words whm (bold let- 10Words that appear in the training dataset but do not appear ters refer to consonants distinguishing it from di- in the test dataset. acritics) would benefit from morpheme boundary 11Higher number of experiments provide more robust con- identifications to tease apart wa from hum in the clusion about the models’ performance. We only considered the minimum acceptable number of times to run each experi- second variant (wahum), emphasizing that these ment due to limited computational resources. are two words. But on the other hand, it adds an Task WER DER LER/Lex OOV WER Zalmout and Habash(2017) 8.21 - - 20.2 Zalmout and Habash(2019a) 7.50 --- Alqahtani and Diab(2019a) 7.6 2.7 - 32.1 BASE (Char) 8.51 (±0.01) 2.80 5.20/5.54 34.56 BASE (WordToChar) 8.09 (±0.05) 2.73 5.00/5.30 32.10 DIAC+SEG 8.35 (±0.02) 2.82 5.20/5.46 33.97 DIAC+SYN 7.70* (±0.02) 2.60 4.72/5.08 30.94 DIAC+POS 7.86* (±0.14) 2.65 4.72/5.20 32.28 DIAC+SEG+SYN 7.70* (±0.05) 2.59 4.65/5.03 31.33 DIAC+SEG+POS 7.73* (±0.08) 2.62 4.73/5.01 31.31 DIAC+SYN+POS 7.72* (±0.06) 2.61 4.62/5.06 31.05 ALL 7.51*(±0.09) 2.54 4.54/4.91 31.07

Table 2: Performance of the joint diacritic restoration model when different related tasks are considered. Bold numbers represent the highest score per column. Almost all scores are higher than the model BASE (char). * denotes statistically significant improvements compared to the baselines. Lex refers to the percentage of words that have incorrect lexical diacritics only, excluding syntactic diacritics. additional layer of ambiguity for other cases like tagging drops. Including POS tagging within dia- the morpheme ktb in the diacritic variants kataba, critic restoration also captures important informa- kutubu, sayakotubo - note that the underlined seg- tion about the words; the idea of POS tagging is ment has the same consonants as the other variants - to learn the underlying syntax of the sentence. In in which identifying morphemes increased the num- comparison to syntactic diacritization, it involves ber of possible diacritic variants without learning different types of information like passivization the interactions between adjacent morphemes. which could be essential in learning correct diacrit- Furthermore, we found inconsistencies in the ics. dataset for morphemes which might cause the drop Ablation Analysis: Incorporating all the aux- in performance when we only consider SEG. When iliary tasks under study within the diacritic restora- we consider all tasks together, these inconsistencies tion model (ALL) provides the best performance are reduced because of the combined information across all measures except WER on OOV words from different linguistic signals towards improving in which the best performance was given by the performance of the diacritic restoration model. DIAC+SYN. We discuss the impact of removing Syntactic diacritization (DIAC+SYN): By one task at a time from ALL and examine whether enforcing inflectional diacritics through an addi- its exclusion significantly impacts the performance. tional focused layer within the diacritic restoration Excluding SEG from the process drops the per- model, we observe improvements on WER com- formance of diacritic restoration. This shows that pared to the baselines. We notice improvements on even though SEG did not help greatly when it was syntactic related diacritics (LER score), which is combined solely with diacritic restoration, the com- expected given the nature of syntactic diacritization binations of SEG and the other word based tasks in which it learns the underlying syntactic struc- filled in the gaps that were missing from just identi- ture to assign the appropriate syntactic diacritics fying morpheme boundaries. Excluding either POS for each word. Improvements also extend to lexical tagging or syntactic diacritization also hurts the per- diacritics, and this is because word relationships formance which shows that these tasks complement are captured during learning syntactic diacritics in each other and, taken together, they improve the which BiLSTM modeling for words is integrated. performance of diacritic restoration model. Input Representation: POS tagging (DIAC+POS): When we jointly train POS tagging with full diacritic restoration, Impact of output labels: Table3 shows the we notice improvements compared to both base- different models when we do not pass the labels lines. Compared to syntactic diacritization, we of the investigated tasks (the input is only Word- obtain similar findings across all evaluation met- ToChar representation) against the same models rics except for WER on OOV words in which POS when we do. We noticed a drop in performance across all models. Notice that all models - even that to improve the performance, soft parameter when we do not consider the label have better per- sharing in a hierarchical fashion performs better formance than the baselines. This also supports the on diacritic restoration. We experimented with benefits of WordToChar representation. building a joint diacritic restoration model that Tasks With Labels Without Labels jointly learns segmentation and diacritics through DIAC+SYN 7.70 7.99 hard parameter sharing. To learn segmentation DIAC+POS 7.86 7.93 with diacritic restoration, we shared the embed- DIAC+SEG+SYN 7.70 7.93 ding layer between the two tasks as well as sharing DIAC+SEG+POS 7.73 7.99 some or all layers of BiLSTM. We got WER on DIAC+SYN+POS 7.72 7.97 all words (8.53∼9.35) in which no improvements ALL 7.51 7.91 were shown compared to character based diacritic restoration. To learn word based tasks with diacritic Table 3: WER performance when we do not consider restoration, we pass WordToChar representation to the output labels for the investigated tasks. Bold num- the diacritic restoration and/or CharToWord repre- bers represent the highest score per row. sentation for word-based tasks. The best that we Last hidden layer of SEG: Identifying mor- could get for both tasks is 8.23%∼9.6%; no statis- pheme boundaries did not increase accuracy as we tically significant improvements were found. This expected. Therefore, we examined whether infor- shows the importance of hierarchical structure for mation learned from the BiLSTM layer would help appropriate diacritic assignments. us learn morpheme interactions by passing the out- Qualitative analysis: We compared random er- put of last BiLSTM layer to the diacritic restoration rors that are correct in DIAC (character-based dia- model along with segmentation labels. We did not critic restoration) with ALL in which we consider observe any improvements towards predicting accu- all investigated tasks. Although ALL provides ac- rate diacritics when we pass information regarding curate results for more words, it introduces errors the last BiLSTM layer. For ALL, the WER score in other words that have been correctly diacritized increased by 0.22%. Thus, it is sufficient to only by DIAC. The patterns of such words are not clear. utilize the segment labels for diacritic restoration. We did not find a particular category that occurs in Passive and active verbs: Passivation in Ara- one model but not the other. Rather, the types and bic is denoted through diacritics and missing such quantity of errors differ in each of these categories. diacritic can cause ambiguity in some cases (Her- mena et al., 2015; Diab et al., 2007). To examine its State-of-the-art Comparison: Table2 also impact, we further divide verbs in the POS tagset shows the performance of the state-of-the-art mod- into passive and active, increasing the size by one. els. ALL model surpass the performance of Zal- Table4 shows the diacritic restoration performance mout and Habash(2017). However, Zalmout and with and without considering passivation. We no- Habash(2017)’s model performs significantly bet- tice improvements, in some combinations of tasks, ter on OOV words. Zalmout and Habash(2019a) across all evaluation metrics compared to the pure provides comparable performance to ALL model. POS tagging, showing its importance in diacritic The difference between their work and that in (Zal- restoration models. mout and Habash, 2017) is the use of a joint model Task With Pass Without Pass to learn morphological features other than diacritics DIAC+POS 7.65 7.86 (or features at the word level), rather than learning DIAC+SEG+POS 7.65 7.73 these features individually. Zalmout and Habash DIAC+SYN+POS 7.78 7.72 (2019a) obtained an additional boost in perfor- ALL 7.62 7.51 mance (0.3% improvement over ours) when they add a dialect variant of Arabic in the learning pro- Table 4: WER performance for different diacritic restoration models when passivation is considered. cess, sharing information between both languages. Bold numbers represent the highest score per row. Alqahtani and Diab(2019a) provides compara- ble performance to ALL and better performance Level of linguistic information: The joint di- on some task combinations in terms of WER on acritic restoration model were built empirically and all and OOV words. The difference between their tested against the development set. We noticed model and our BASE model is the addition of a CRF (Conditional Random Fields) layer which in- used in previous models hence we do not have a corporate dependencies in the output space at the valid comparison set. For SYN, we compare our cost of model’s computational efficiency (memory results with (Hifny, 2018) which uses a hybrid net- and speed). work of BiLSTM and Maximum to solve Zalmout and Habash(2019b) provides the cur- syntactic diacritization. The SYN yields results rent state-of-the-art performance in which they comparable to SOTA (our model performs 94.22 build a morphological disambiguation framework vs. SOTA 94.70). in Arabic similar to (Zalmout and Habash, 2017, 2019a). They reported their scores based on the 6 Related Work development set which was not used for tuning. In the development set, they obtained 93.9% which The problem of diacritization has been addressed significantly outperforms our best model (ALL) using classical approaches (e.g. by 1.4%. Our approach is similar to (Zalmout Maximum Entropy and Support Vector Machine) and Habash, 2019b). We both follow WordToChar (Zitouni and Sarikaya, 2009; Pasha et al., 2014) as well as CharToWord input representations dis- or neural based approaches for different languages cussed in Section 3.1, regardless of the specifics. that include diacritics such as Arabic, Vietnamese, Furthermore, we both consider the morphologi- and Yoruba. Neural based approaches yield state- cal outputs as features in our diacritic restoration of-the-art performance for diacritic restoration by model. In Zalmout and Habash(2019b), morpho- using Bidirectional LSTM or temporal convolu- logical feature space that are considered is larger, tional networks (Zalmout and Habash, 2017; Orife, making use of all morphological features in Ara- 2018; Alqahtani et al., 2019; Alqahtani and Diab, bic. Furthermore, Zalmout and Habash(2019b) 2019a). use sequence-to-sequence modeling rather than se- Arabic syntactic diacritization has been con- quence classification as ours. Unlike Zalmout and sistently reported to be difficult, degrading the Habash(2019b), our model is more flexible allow- performance of full diacritic restoration (Zitouni ing additional tasks to be added when sufficient et al., 2006; Habash et al., 2007; Said et al., 2013; resources are available. Shaalan et al., 2009; Shahrour et al., 2015; Dar- We believe that neither the underlying architec- wish et al., 2017). To improve the performance ture nor the consideration of all possible features of syntactic diacritization or full diacritic restora- were the crucial factor that led to the significant re- tion in general, previous studies followed different duction in WER performance. Rather, morphologi- approaches. Some studies separate lexical from cal analyzers is crucial in such significant improve- syntactic diacritization (Shaalan et al., 2009; Dar- ment. As a of fact, in Zalmout and Habash wish et al., 2017). Other studies consider additional (2019b), the performance significantly drops to 7.2 linguistic features such as POS tags and word seg- when they, similar to our approach, take the highest mentation (i.e. tokens or morphemes) (Ananthakr- probabilistic value as a solution. Thus, we believe ishnan et al., 2005; Zitouni et al., 2006; Zitouni and that the use of morphological analyzers enforces Sarikaya, 2009; Shaalan et al., 2009). valid word composition in the language and filter Hifny(2018) addresses syntactic diacritization out invalid words (a side effect of using charac- by building BiLSTM model in which its input em- ters as input representation). This also justifies the beddings are augmented with manually generated significant improvement on OOV words obtained features of context, POS tags, and word segments. by (Zalmout and Habash, 2017). Thus, we believe Rashwan et al.(2015) use deep belief network to that a global knowledge of words and internal con- build a diacritization model for Arabic that focuses straints within words are captured. on improving syntactic diacritization and build sub- classifiers based on the analysis of a confusion Auxiliary tasks: We compared the base matrix and POS tags. model of the auxiliary tasks to the state-of-the-art Regarding incorporating linguistic features into (SOTA). For SEG, BiLSTM model has compara- the model, previous studies have either used mor- ble performance to that in (Zalmout and Habash, phological features as a preprocessing step or as 2017) (SEG yields 99.88% F1 compared to SOTA a ranking step for building diacritic restoration 99.6%). For POS, we use a shallower tag set (16 models. As a preprocessing step, the words are number of tags compared to ∼70) than typically converted to their constituents (e.g. morphemes, lemmas, or n-grams) and then diacritic restoration Sawsan Alqahtani, Ajay Mishra, and Mona Diab. 2019. models are built on top of that (Ananthakrishnan Convolutional neural networks for diacritic restora- et al., 2005; Alqahtani and Diab, 2019b). Anan- tion. In EMNLP. thakrishnan et al.(2005) use POS tags to improve Sankaranarayanan Ananthakrishnan, Shrikanth diacritic restoration at the syntax level assuming Narayanan, and Srinivas Bangalore. 2005. Au- that POS tags are known at inference time. tomatic diacritization of arabic transcripts for automatic . In Proceedings of the As a ranking procedure, all possible analyses 4th International Conference on Natural Language of words are generated and then the most proba- Processing, pages 47–54. ble analysis is chosen (Pasha et al., 2014; Zalmout Piotr Bojanowski, Edouard Grave, Armand Joulin, and and Habash, 2017, 2019a,b). Zalmout and Habash Tomas Mikolov. 2017. Enriching word vectors with (2017) develop a morphological disambiguation subword information. Transactions of the Associa- model to determine Arabic morphological features tion for Computational Linguistics. including diacritization. They train the model using Kareem Darwish, Hamdy Mubarak, and Ahmed Abde- BiLSTM and consult with a LSTM-based language lali. 2017. Arabic diacritization: Stats, rules, and model as well as other morphological features to hacks. In Proceedings of the Third Arabic Natural rank and score the output analysis. Similar method- Language Processing Workshop, pages 9–17. ology can be found in (Pasha et al., 2014) but us- Mona Diab, Mahmoud Ghoneim, and Nizar Habash. ing Support Vector Machines. This methodology 2007. Arabic diacritization in the context of sta- shows better performance on out of vocabulary tistical . In Proceedings of MT- Summit. (OOV) words compared to pure character models. Mona Diab, Nizar Habash, Owen Rambow, and Ryan 7 Discussion & Conclusion Roth. 2013. Ldc arabic treebanks and associated corpora: Data divisions manual. arXiv arXiv:1309.5652. We present a diacritic restoration joint model that considers the output distributions for different re- Mona Diab, Kadri Hacioglu, and Daniel Jurafsky. 2004. lated tasks to improve the performance of diacritic Automatic tagging of arabic text: From raw text to base phrase chunks. In Proceedings of HLT-NAACL restoration. Our results shows statistically sig- 2004: Short papers. Association for Computational nificant improvements across all evaluation met- Linguistics. rics. This shows the importance of considering Rotem Dror, Gili Baumer, Segev Shlomov, and Roi Re- additional linguistic information at morphological ichart. 2018. The hitchhikers guide to testing statis- and/or sentence levels. Including semantic informa- tical significance in natural language processing. In tion through pretrained word embeddings within Proceedings of the 56th Annual Meeting of the As- the diacritic restoration model also helped boosting sociation for Computational Linguistics, volume 1, pages 1383–1392. the diacritic restoration performance. Although we apply our joint model on Arabic, this model pro- Nizar Habash, Ryan Gabbard, Owen Rambow, Seth vides a framework for other languages that include Kulick, and Mitch Marcus. 2007. Determining case in arabic: Learning complex linguistic behavior re- diacritics whenever resources become available. quires complex linguistic features. In Proceedings Although we observed improvements in terms of of the 2007 Joint Conference on Empirical Methods generalizing beyond observed data when using the in Natural Language Processing and Computational proposed linguistic features, the OOV performance Natural Language Learning (EMNLP-CoNLL). is still an issue for diacritic restoration. Kazuma Hashimoto, Caiming Xiong, Yoshimasa Tsu- ruoka, and Richard Socher. 2016. A joint many-task model: Growing a neural network for multiple nlp References tasks. arXiv preprint arXiv:1611.01587. Ehab W Hermena, Denis Drieghe, Sam Hellmuth, and Sawsan Alqahtani and Mona Diab. 2019a. Investigat- Simon P Liversedge. 2015. Processing of arabic di- ing input and output units in diacritic restoration. In acritical marks: Phonological–syntactic disambigua- 2019 18th IEEE International Conference On Ma- tion of homographic verbs and visual crowding ef- chine Learning And Applications (ICMLA). IEEE. fects. Journal of Experimental Psychology: Human Perception and Performance, 41(2):494. Sawsan Alqahtani and Mona Diab. 2019b. Investigat- ing input and output units in diacritic restoration. In Yasser Hifny. 2018. Hybrid lstm/maxent networks for 2019 18th IEEE International Conference on Ma- arabic syntactic diacritics restoration. IEEE Signal chine Learning and Applications (ICMLA). Processing Letters, 25(10):1515–1519. Sepp Hochreiter and Jurgen¨ Schmidhuber. 1997. Nasser Zalmout and Nizar Habash. 2017. Don’t throw Long short-term memory. Neural computation, those morphological analyzers away just yet: Neu- 9(8):1735–1780. ral morphological disambiguation for arabic. In Pro- ceedings of the 2017 Conference on Empirical Meth- Alex Kendall, Yarin Gal, and Roberto Cipolla. 2018. ods in Natural Language Processing, pages 704– Multi-task learning using uncertainty to weigh 713. losses for scene geometry and semantics. In Pro- ceedings of the IEEE Conference on Vi- Nasser Zalmout and Nizar Habash. 2019a. Adversarial sion and , pages 7482–7491. multitask learning for joint multi-feature and multi- dialect morphological modeling. arXiv preprint Iroro Orife. 2018. Attentive sequence-to-sequence arXiv:1910.12702. learning for diacritic restoration of yor\ub\’a lan- guage text. arXiv preprint arXiv:1804.00832. Nasser Zalmout and Nizar Habash. 2019b. Joint dia- critization, lemmatization, normalization, and fine- Arfath Pasha, Mohamed Al-Badrashiny, Mona T Diab, grained morphological tagging. arXiv preprint Ahmed El Kholy, Ramy Eskander, Nizar Habash, arXiv:1910.02267. Manoj Pooleery, Owen Rambow, and Ryan Roth. Imed Zitouni and Ruhi Sarikaya. 2009. Arabic diacritic 2014. Madamira: A fast, comprehensive tool for restoration approach based on maximum entropy morphological analysis and disambiguation of ara- models. Computer Speech & Language, 23(3):257– bic. In LREC, volume 14, pages 1094–1101. 276. Mohsen AA Rashwan, Ahmad A Al Sallab, Hazem M Imed Zitouni, Jeffrey S Sorensen, and Ruhi Sarikaya. Raafat, and Ahmed Rafea. 2015. Deep learn- 2006. Maximum entropy based restoration of ara- ing framework with confused sub-set resolution bic diacritics. In Proceedings of the 21st Interna- architecture for automatic arabic diacritization. tional Conference on Computational Linguistics and IEEE/ACM Transactions on Audio, Speech and Lan- the 44th annual meeting of the Association for Com- guage Processing (TASLP), 23(3):505–516. putational Linguistics, pages 577–584. Association for Computational Linguistics. Ahmed Said, Mohamed El-Sharqwi, Achraf Chalabi, and Eslam Kamal. 2013. A hybrid approach for ara- bic diacritization. In International Conference on Application of Natural Language to Information Sys- tems, pages 53–64. Springer.

Mike Schuster and Kuldip K Paliwal. 1997. Bidirec- tional recurrent neural networks. IEEE Transactions on Signal Processing, 45(11):2673–2681.

Khaled Shaalan, Hitham M Abo Bakr, and Ibrahim Ziedan. 2009. A hybrid approach for building ara- bic diacritizer. In Proceedings of the EACL 2009 workshop on computational approaches to semitic languages, pages 27–35. Association for Computa- tional Linguistics.

Anas Shahrour, Salam Khalifa, and Nizar Habash. 2015. Improving arabic diacritization through syn- tactic analysis. In Proceedings of the 2015 Con- ference on Empirical Methods in Natural Language Processing, pages 1309–1315.

Dima Taji, Nizar Habash, and Daniel Zeman. 2017. Universal dependencies for arabic. In Proceedings of the Third Arabic Natural Language Processing Workshop, pages 166–176.

Daniel Watson, Nasser Zalmout, and Nizar Habash. 2018. Utilizing character and word embeddings for text normalization with sequence-to-sequence mod- els. arXiv preprint arXiv:1809.01534.

JC Wells. 2000. Orthographic diacritics and multilin- gual . Language problems and language planning, 24(3):249–272.