A Multitask Learning Approach for Diacritic Restoration

A Multitask Learning Approach for Diacritic Restoration Sawsan Alqahtani 1,2 and Ajay Mishra1 and Mona Diab2∗ 1AWS, Amazon AI 2The George Washington University [email protected], [email protected], [email protected] Abstract context and their previous knowledge to infer the meanings and/or pronunciations of words. How- In many languages like Arabic, diacritics are used to specify pronunciations as well as mean- ever, computational models, on the other hand, are ings. Such diacritics are often omitted in inherently limited to deal with missing diacritics written text, increasing the number of possi- which pose a challenge for such models due to ble pronunciations and meanings for a word. increased ambiguity. This results in a more ambiguous text making Diacritic restoration (or diacritization) is the computational processing on such text more process of restoring these missing diacritics for difficult. Diacritic restoration is the task of every character in the written texts. It can spec- restoring missing diacritics in the written text. Most state-of-the-art diacritic restoration mod- ify pronunciation and can be viewed as a relaxed els are built on character level information variant of word sense disambiguation. For exam- which helps generalize the model to unseen ple, the Arabic word ÕÎ« Elm2 can mean “flag” or data, but presumably lose useful information “knowledge”, but the meaning as well as pronun- at the word level. Thus, to compensate for ciation is specified when the word is diacritized ( this loss, we investigate the use of multi-task learning to jointly optimize diacritic restora- ÕÎ« Ealamu means “flag” while ÕÎ« Eilomo means tion with related NLP problems namely word “knowledge”). As an illustrative example in En- segmentation, part-of-speech tagging, and syn- glish, if we omit the vowels in the word pn, the tactic diacritization. We use Arabic as a case word can be read as pan, pin, pun, and pen, each study since it has sufficient data resources for of these variants have different pronunciations and tasks that we consider in our joint modeling. meanings if it composes a valid word in the lan- Our joint models significantly outperform the baselines and are comparable to the state-of- guage. the-art models that are more complex relying The state-of-the-art diacritic restoration models on morphological analyzers and/or a lot more reached a decent performance over the years using data (e.g. dialectal data). recurrent or convolutional neural networks in terms of accuracy (Zalmout and Habash, 2017; Alqahtani 1 Introduction et al., 2019; Orife, 2018) and/or efficiency (Alqah- In contrast to English, some vowels in languages tani et al., 2019; Orife, 2018); yet, there is still room arXiv:2006.04016v1 [cs.CL] 7 Jun 2020 such as Arabic and Hebrew are not part of the alpha- for further improvements. Most of these models bet and diacritics are used for vowel specification.1 are built on character level information which help In addition to pertaining vowels, diacritics can also generalize the model to unseen data, but presum- represent other features such as case marking and ably lose some useful information at the word level. phonological gemination in Arabic. Not including Since word level resources are insufficient to be re- diacritics in the written text in such languages in- lied upon for training diacritic restoration models, creases the number of possible meanings as well as we integrate additional linguistic information that pronunciations. Humans rely on the surrounding considers word morphology as well as word relationships within a sentence to partially compensate *∗The work was conducted while the author was with AWS, Amazon AI. for this loss. 1Diacritics are marks that are added above, below, or in- between the letters to compose a new letter or characterize the 2We use Buckwalter Transliteration encoding letter with a different sound (Wells, 2000). http://www.qamus.org/transliteration.htm. In this paper, we improve the performance of ter and word level information, bridging the gap diacritic restoration by building a multitask learn- between the two levels. ing model (i.e. joint modeling). Multitask learning refers to models that learn more than one task at Syntactic Diacritization (SYN): This refers to the same time, and has recently been shown to pro- the task of retrieving diacritics related to the syntac- vide good solutions for a number of NLP tasks tic positions for each word in the sentence, which (Hashimoto et al., 2016; Kendall et al., 2018). is a sub-task of full diacritic restoration. Arabic is The use of a multitask learning approach pro- a templatic language where words comprise roots vides an end-to-end solution, in contrast to generat- and patterns in which patterns are typically reflec- ing the linguistic features for diacritic restoration tive of diacritic distributions. Verb patterns are as a preprocessing step. In addition, it alleviates more or less predictable however nouns tend to be the reliance on other computational and/or data re- more complex. Arabic diacritics can be divided sources to generate these features. Furthermore, into lexical and inflectional (or syntactic) diacritics. the proposed model is flexible such that a task can Lexical diacritics change the meanings of words be added or removed depending on the data avail- as well as their pronunciations and their distribu- ability. This makes the model adaptable to other tion is bound by patterns/templates. In contrast, languages and dialects. inflectional diacritics are related to the syntactic positions of words in the sentence and are added We consider the following auxiliary tasks to to the last letter of the main morphemes of words boost the performance of diacritic restoration: word (word finally), changing their pronunciations.4 In- segmentation, part-of-speech (POS) tagging, and flectional diacritics are also affected by word’s root syntactic diacritization. We use Arabic as a case (e.g. weak roots) and semantic or morphological study for our approach since it has sufficient data properties (e.g. with the same grammatical case, resources for tasks that we consider in our joint masculine and feminine plurals take different dia- modeling.3 critics). The contributions of this paper are twofold: Thus, the same word can be assigned a different 1. We investigate the benefits of automatically syntactic diacritic reflecting syntactic case, i.e. de- learning related tasks to boost the perfor- pending on its relations to the remaining words in mance of diacritic restoration; the sentence (e.g. subject or object). For example, 2. In doing so, we devise a state-of-the-art model the diacritized variants ÕÎ« Ealama and ÕÎ« Ealamu for Arabic diacritic restoration as well as a which both mean “flag” have the corresponding framework for improving diacritic restoration syntactic diacritics: a and u, respectively. That in other languages that include diacritics. being said, the main trigger for accurate syntactic prediction is the relationships between words, 2 Diacritization and Auxiliary Tasks capturing semantic and most importantly, syntactic information. We formulate the problem of (full) diacritic restora- Because Arabic has a unique set of diacritics, tion (DIAC) as follows: given a sequence of char- this study formulates syntactic diacritization in the acters, we identify the diacritic corresponding to following way: each word in the input is tagged each character in that sequence from the following with a single diacritic representing its syntactic po- set of diacritics fa, u, i, o, K, F, N, ∼, ∼a, ∼u, sition in the sentence.5 The set of diacritics in ∼i, ∼F, ∼K, and ∼Ng. We additionally consider syntactic diacritization is the same as the set of dia- three auxiliary tasks: syntactic diacritization, part- critics for full diacritic restoration. Other languages of-speech tagging, and word segmentation. Two that include diacritics can include syntactic related of which operate at the word level (syntactic di- diacritics but in a different manner and complexity acritization and POS tagging) and the remaining 4 tasks (diacritic restoration and word segmentation) Diacritics that are added due to passivization are also syntactic in nature but are not considered in our syntactic operate at the character level. This helps diacritic diacritization task. That said, they are still considered in the restoration utilize information from both charac- full diacritic restoration model. 5Combinations of diacritics is possible but we combine 3Other languages that include diacritics lack such re- valid possibilities together as one single unit in our model. sources; however, the same multitask learning framework For example, the diacritics ∼ and a are combined to form an can be applied if data resources become available. additional diacritic ∼a. compared to Arabic. 3.1 Input Representation Word segmentation (SEG): This refers to the Since our joint model may involve both character process of separating affixes from the main unit of and word level based tasks, we began our investi- the word. Word segmentation is commonly used gation by asking the following question: how to as a preprocessing step for different NLP appli- integrate information between these two levels? cations and its usefulness is apparent in morpho- Starting from the randomly initialized character logically rich languages. For example, the undi- embeddings as well as a pretrained set of embed- acritized word whm Ñëð might be diacritized as dings for words, we follow two approaches (Figure 1 visually illustrates the two approaches with an waham∼a “and concerned”, waham “illu- Ñëð Ñëð example). sion”, where the first diacritized word consists of two segments “wa ham∼a” Ñë ð while the second is composed of one word. Word segmentation can be formulated in the following way: each character in the input is tagged following IOB tagging scheme (B: beginning of a segment; I: inside a segment; O: out of the segment) (Diab et al., 2004).

Load more