<<

Homograph Disambiguation Through Selective Diacritic Restoration

Sawsan Alqahtani,1,2 Hanan Aldarmaki,1 Mona Diab1,3 1The George Washington University 2Princess Nourah Bint Abdul Rahman University 3AWS, Amazon AI [email protected], [email protected], [email protected]

Abstract or other functions. For languages that use diacrit- Lexical ambiguity, a challenging phenomenon ical marks, such as Arabic or Hebrew, the orthog- in all natural languages, is particularly preva- raphy is typically under-specified for such marks, lent for languages with that tend to i.e. the diacritics are omitted. This phenomenon be omitted in writing, such as Arabic. Omit- exacerbates the lexical ambiguity problem since ting diacritics to an increase in the num- it increases the rate of homographs. For exam- ber of homographs: different words with the ple, without considering context, the undiacritized same spelling. Diacritic restoration could the- Arabic word ktb may refer to any of the follow- oretically help disambiguate these words, but ing diacritized variants:1 katab  “wrote”, ku- in practice, the increase in overall sparsity I. J» leads to performance degradation in NLP ap- tub  “books”, or kutib  “was written”. plications. In this paper, we propose ap- I. J» I. J » proaches for automatically marking a sub- As an illustrative in English, dropping set of words for diacritic restoration, which vowels in a word such as pan yields the under- leads to selective homograph disambiguation. specified token pn which can be mapped to pin, Compared to full or no diacritic restoration, pan, pun, pen. It should be noted that even after these approaches yield selectively-diacritized fully specifying words with their relevant diacrit- datasets that balance sparsity and lexical dis- ics, homonyms such as “bass” are still ambiguous; ambiguation. We evaluate the various selec- likewise in Arabic, the fully-specified word bayot tion strategies extrinsically on several down-  can either mean “verse” or “house”. stream applications: neural machine transla- I  K. tion, part-of-speech tagging, and semantic - In this paper, we devise strategies to automati- tual similarity. Our experiments on Arabic cally identify and disambiguate a subset of - show promising results, where our devised graphs that result from omitting diacritics. While strategies on selective diacritization to a context is often sufficient for determining the more balanced and consistent performance in downstream applications. meaning of ambiguous words, explicitly restoring missing diacritics should provide valuable addi- 1 Introduction tional for homograph disambiguation. Lexical ambiguity, an inherent phenomenon in This process, diacritization, would render the re- natural languages, refers to words or phrases that sulting text comparable to that of languages whose can have multiple meanings. In written text, lex- words are orthographically fully specified such as ical ambiguity can be roughly characterized into English. two categories: polysemy and homonymy. A pol- Past studies have focused on developing models ysemous word has multiple senses that express for automatic diacritic restoration that can be used different but related meanings (e.g. ‘head’ as an as a pre-processing step for various applications anatomical body part, or as a person in charge), such as text-to-speech (Ungurean et al., 2008) and whereas homonyms are different words that hap- reading comprehension (Hermena et al., 2015). In pen to have the same spelling (e.g. ‘bass’ as an in- theory, restoring all diacritics should also help im- strument vs. a fish) (Lobner¨ , 2013). Homographs prove the performance of NLP applications such are words that have the same spelling but may have as . However, in practice, different pronunciation and meaning. 1We adopt Buckwalter Transliteration encod- A diacritic is a mark that is added above, below, ing into Latin script for rendering Arabic text or within letters to indicate pronunciation, vowels, http://www.qamus.org/transliteration.htm. full diacritic restoration results in increased spar- zation. Section2 describes existing work sity and out-of-vocabulary words, which leads to towards optimal diacritization and how they degradation in performance (Diab et al., 2007; differ from our approach; Alqahtani et al., 2016). The main objective of this work is to find a sweet spot between zero and full 2. We propose several unsupervised data-driven diacritization in order to reduce lexical ambigu- methods for the automatic identification of ity without increasing sparsity. We propose selec- ambiguous words; tive diacritization, a process of restoring diacritics to a subset of the words in a sentence sufficient 3. We evaluate and analyze the impact of par- to disambiguate homographs without significantly tial sense disambiguation (i.e. selective dia- increasing sparsity. Selective diacritization can be critic restoration of identified homographs) in viewed as a relaxed variant of word sense disam- downstream applications for MSA. biguation since only homographs that arise from missing diacritics are disambiguated.2 2 Related Work Intrinsically evaluating the quality of a devised selective diacritization scheme against a set We are concerned mainly with studies that target is challenging since it is difficult to obtain a word disambiguation through the use of diacrit- dataset that exhibits consistent selective diacritiza- ics/accents restoration. Homograph disambigua- tion with reliable inter-annotator agreement (Za- tion through accents has been explored previously ghouani et al., 2016b; Bouamor et al., 2015), in several studies with the use of different rule- thereby necessitating an empirical automatic in- based and machine-learning approaches for lan- vestigation. Hence, in this work, we evaluate the guages such as Arabic, Spanish, Igbo, and Viet- proposed selective diacritization schemes extrin- namese (Ezeani et al., 2017; Nguyen et al., 2012; sically on various semantic and syntactic down- Nivre et al., 2017; Said et al., 2013; Tufis¸ and stream NLP applications: Semantic Textual Simi- Chit¸u, 1999). larity (STS), Neural Machine Translation (NMT), Bouamor et al.(2015) conducted a pilot study and Part-of-Speech (POS) tagging. We compare where they asked human annotators to add the our selective strategies against two baselines full minimum number of diacritics sufficient to dis- diacritization and zero diacritics applied on all the ambiguate homographs. However, attempts to words in the text. We use Modern Standard Arabic provide human annotation for selective diacriti- (MSA) as a case-study.3 zation resulted in low inter-annotator agreement Our approach is summarized as follows: we due to the annotators’ subjectivity and different start with full diacritic restoration of a large cor- linguistic understanding of the words and con- pus, then apply different unsupervised methods to texts (Bouamor et al., 2015; Zaghouani et al., identify the words that are ambiguous when undi- 2016b). To address this issue, Zaghouani et al. acritized. This results in a dictionary where each (2016b) used a morphological disambiguation word is assigned an ambiguity label (ambiguous tool, MADAMIRA (Pasha et al., 2014), to iden- vs. unambiguous). Selectively-diacritized datasets tify candidate words that may need disambigua- can then be constructed by restoring the full dia- tion. A word was considered ambiguous if critics only to the words that are identified as am- MADAMIRA generates multiple high-scoring di- biguous. acritic alternatives, and human annotators were The contribution of this paper is threefold: asked to select from these alternatives or man- ually edit the diacritics if none of the options 1. We introduce automatic selective diacritiza- was deemed correct. This resulted in a signifi- tion as a viable step in lexical disambiguation cant increase in inter-annotator agreement. Our and provide an encouraging baseline for fu- work differs in two aspects: first, we develop au- ture developments towards optimal diacriti- tomatic methods for ambiguity detection based 2Identifying empirically successful selective diacritiza- on word usage. We then restore the diacritics tion strategies can help discover optimal diacritization for all occurrences of these ambiguous words, schemes; however, this direction is currently beyond the whereas in (Zaghouani et al., 2016b), the same scope of this work. 3Proposed methodologies can be applied to other lan- word may be tagged as ambiguous in some cases guages where diacritics are omitted. but not ambiguous in other cases depending on context, which makes it harder to generalize to formance for machine translation and word sense new datasets. disambiguation. Yarowsky(1994) developed an accent restora- All of the aforementioned approaches either ap- tion algorithm for Spanish and French that speci- ply full diacritics on all words whenever appro- fies the accent patterns for ambiguous words (i.e. priate or derive partial diacritic schemes based on multiple accent patterns). Our intuition is different linguistic understanding; crucially these partial di- than that of Yarowsky(1994) in two ways. First, acritic schemes are applied to all words in a sen- 4 they added diacritics to all words that have more tence. Our devised strategies differ in that we ap- than one diacritic pattern while we add the diacrit- ply full diacritization to a select set of tokens in the ics for only a subset of candidate words. Second, text. Our work is related to these previous stud- they used context for adding diacritics, while we ies in the sense that we reduce the search space of use context to isolate words that require diacrit- candidate words that could benefit from full or par- ics, for which we apply an off-the-shelf diacritic tial diacritization without increasing sparsity. Fur- restoration model. thermore, the novelty of this work lies in utilizing automatic unsupervised methods to identify such Rather than restoring all diacritics in the writ- words. ten text, the idea of adding diacritics sufficient to resolving lexical ambiguity was initially intro- 3 Approach duced in (Diab et al., 2007). They developed sev- eral linguistically-based partial schemes and eval- 3.1 Selective Diacritization uated their methods in Statistical Machine Trans- Selective diacritization is the process of restoring lation. They found that fully diacritizing texts diacritics to a subset of words in a text corpus. led to performance degradation due to sparseness Manually annotating words in a dataset with bi- while no diacritization increased the lexical am- nary ambiguity labels (ambiguous vs. unambigu- biguity rate. Similar results have been found in ous) is challenging due to the difficulty in defin- (Alqahtani et al., 2016), where several other ba- ing ambiguous words that would benefit from di- sic diacritic patterns were investigated. Although acritics (Zaghouani et al., 2016b). Therefore, we the impact of diacritics in machine translation was propose several techniques to automatically iden- promising, the development of partial schemes tify ambiguous words for selective diacritization. does not show significant improvements over the Since it is common to use distributed word vec- non-diacritized and fully-diacritized baselines. representations in downstream tasks, we de- Alnefaie and Azmi(2017) introduced a par- fine ambiguity in terms of distributional similarity tial diacritization scheme for MSA based on the among diacritized word variants. Our intuition is output of a morphological analyzer in addition that variants with low distributional similarity are to WordNet (Black et al., 2006), and Alqahtani more likely to benefit from diacritization to disam- et al.(2018) created a lexical resource that as- biguate their meanings and tease apart their con- signed an ambiguity label for each word, where a text variations. On the other hand, word variants word is considered ambiguous if it has more than with highly similar contexts tend to have very sim- one diacritic possibility, with and without consid- ilar distributional representations, which results in ering its part-of-speech tag. However, both (Alne- unnecessary redundancy and sparsity if all variants faie and Azmi, 2017; Alqahtani et al., 2018) did are kept. not evaluate their methods empirically to demon- Based on this definition, we developed sev- strate their effectiveness for NLP applications. eral context-based approaches to identify candi- Hanai and Glass(2014) similarly developed three date ambiguous word types and generate a set of linguistically-based partial diacritic schemes for dictionaries with ambiguity labels (AmbigDict), automatic and found statisti- where each word is marked as either ambiguous cally significant improvement over the baseline. 4For instance, the undiacritized sentence bEd ywm However, their work is focused on improving word ÐñK YªK. “after a day” would be diacritized as baEod yawom pronunciations whereas we focus on word sense when fully diacritized, bEod ywom ((Diab Ðñ K YªK. ÐñK YªK . disambiguation. Ezeani et al.(2017) discussed the et al., 2007; Alqahtani et al., 2016)’s SUK scheme) when par- impact of adding accents for each and every word tially diacritized, baEod ywm when selectively dia- ÐñK YªK. in Igbo language, potentially increasing the per- critized. or unambiguous. The proposed approaches can in more than one cluster. be classified by the type of tokens used to create the AmbigDict: diacritized (AmbigDict-DIAC) or 3.3 AmbigDict-DIAC Generation undiacritized (AmbigDict-UNDIAC). For exam- We explore clustering and translation based meth- ple an entry in AmbigDict-UNDIAC would be ods to create ambiguity dictionaries from dia- “Elm” : ambiguous; “ktb”  : unambiguous, ÕΫ I. J» critized text. whereas in AmbigDict-DIAC would be “Ealam” Clustering-based Approaches (CL): Similar : ambiguous; “kutub”  : unambiguous. ÕΫ I. J» in spirit to SENSE, we apply unsupervised clus- 3.2 AmbigDict-UNDIAC Generation tering to our corpora to induce AmbigDict. How- ever, unlike SENSE, we apply clustering to dia- We explore two methods for creating ambiguity critized data. Our intuition is that dissimilar words dictionaries from undiacritized text: using a mor- are likely to occur in different contexts, and there- phological analyzer, and unsupervised sense in- fore likely to be in different clusters. Therefore, duction. we tag words as ambiguous if diacritized variants Multiple Morphological Variants (MULTI): of the same underlying undiacritized form appear The number of diacritic alternatives for a word in different clusters. can be a clue to determine whether a word is am- As a preprocessing step, we apply a full contex- biguous due to missing diacritics (Alqahtani et al., tualized diacritization tool to the underlying cor- 2018). In this approach, context is not consid- pora. We leverage the MADAMIRA tool (Pasha ered, but rather the output of a morphological an- et al., 2014) to produce fully diacritized text (for alyzer applied to the text. We leverage the mor- every token in the data) covering both types of di- phological analyzer component of MADAMIRA acritic restoration: lexical and syntactic. The lat- (Pasha et al., 2014) to generate all possible valid ter covers syntactic case and mood diacritics. In diacritic variants of a word whether these variants this study, we are only concerned with lexical am- are present in the corpus or not. If an undiacritized biguity; Moreover, MADAMIRA has a very high word has more than one possible diacritic variant, diacritic error rate in syntactic diacritic restora- we consider it ambiguous. We use this context- tion (15%) compared to (3.5%) for lexical dia- independent approach as a baseline. critic restoration. Hence, we drop the predicted word final syntactic diacritics resulting in a dia- Sense Induction Based Approach (SENSE): critization scheme similar to the partial scheme in Selective diacritization is related to word sense (Diab et al., 2007; Alqahtani et al., 2016), namely, disambiguation, however we only target disam- FULL-CM. In FULL-CM, every token is fully lex- biguation through diacritic restoration. Tech- ically diacritized (e.g. the fully diacritized words niques used in automatic word sense induction Ealama ÕΫ and Ealamu ÕΫ differ in their syntac- can be used as a for identifying ambiguous words. Using undiacritized text, we apply an off- tic diacritics and are mapped to Ealam ÕΫ “flag” the-shelf system for word sense induction devel- in FULL-CM). oped by Pelevina et al.(2017), which uses the Given this diacritized corpus, we apply three Chinese Whispers algorithm (Biemann, 2006) to different standard clustering approaches: Brown6 identify senses of a graph constructed by comput- (Brown et al., 1992) (CL-BR), K-means7 (Ka- ing the word similarities (highest cosine similari- nungo et al., 2002) (CL-KM), and Gaussian ties) through using word as well as context embed- Mixture via Expectation Maximization (CL-EM)8 dings. We apply the first three steps described in (Dempster et al., 1977). We tune the number of Pelevina et al.(2017) but we do not use the gen- clusters for downstream tasks; in particular, we erated sense-based embeddings; we only use the empirically investigate the performance on the de- system to identify the words with multiplw senses. We set the three parameters as follows: the graph 6https://github.com/percyliang/brown-cluster 7 N n We use “sickit-learn” version 0.18.1. We use the value size to 200, the inventory granularity to 400, 1 for both random state and n init and the default values for and the minimum number of clusters (senses) k to the remaining parameters. 5.5 A word type is deemed ambiguous if it appears 8We use “sickit-learn” version 0.18.1. with the follow- ing parameters: max =1000, random state=1, and covari- 5We tuned these parameters empirically. ance type=spherical velopment set in the downstream tasks for differ- 4.1 Datasets ent number of clusters. For MULTI, SENSE, CR, we use a combination of four Modern Standard Arabic (MSA) datasets that Translation-based Approaches (TR): Transla- vary in genre and domain and add up to ∼50M tion can be used as a basis for word sense induc- tokens: Gigaword 5th edition, distributed by Lin- tion (Diab and Resnik, 2002; Ng et al., 2003) since guistic Data Consortium (LDC), dump words across different languages tend to have dis- 2016, Corpus of Contemporary Arabic (CCA) parate senses. Following a similar intuition, we (Zaghouani et al., 2016a; Al-Sulaiti and Atwell, use English translations from a corpus as a 2006), and LDC Arabic Tree Bank (ATB).10 For trigger to divide the set of diacritic possibilities of TR, we use an Arabic-English parallel dataset a word into multiple subsets. The intuition here is which includes ∼60M tokens and is created from that homographs worth disambiguating are those 53 LDC catalogs. For data cleaning, we replace that are likely to be translated differently. We e-mails and URLs with a unified token and use leverage an English MSA parallel corpus, where SPLIT tool (Al-Badrashiny et al., 2016) to clean the MSA is diacritized in the Full-CM scheme us- UTF8 characters (e.g. Latin and Chinese), remove ing MADAMIRA (the same preprocessing step for diacritics in the original data, and separate punc- CL described above). In this approach, diacritized tuation, symbols, and numbers in the text, and re- variants that share the same English translations place them with separate unified tokens. We split are considered unambiguous, whereas those that long sentences (more than 150 words) by punctu- are typically translated to different English words ation and then remove all sentences that are still are considered ambiguous. To that end, we first longer than 150 words. We use D3 style (i.e. all align the sentences at the token level and gener- affixes are separated) (Pasha et al., 2014) for Ara- ate word translation using fast-align bic tokenization without normalizing characters. (Dyer et al., 2013), which is a log-linear reparam- For English, we lower case all characters and use eterization of IBM Model 2 (Brown et al., 1993). TreeTagger (Schmid, 1999) for tokenization. We If a word shares any translation with its diacritized used SkipGram word embeddings (Mikolov et al., variant in the top N most likely translations, we 2013), where applicable. consider it unambiguous (e.g. Ealam ÕΫ ‘flag” 4.2 Extrinsic Evaluation and Ealima ÕÎ « ‘learned” are unambiguous since they do not share top translations). Otherwise, the We evaluate the effectiveness of the proposed ap- word is tagged as ambiguous. We tune N to in- proaches using three applications: Semantic Tex- clude 1, 5, 10, and all translations. tual Similarity (STS), Neural Machine Translation (NMT), and Part-of-Speech (POS) tagging. We 4 Evaluation used different significance testing methods appro- priate for each application with p = 0.05. Once we have generated the two variants of Am- bigDict (AmbigDict-UNDIAC and AmbigDict- 4.2.1 Semantic Textual Similarity (STS) DIAC), we evaluate their efficacy extrinsically on STS is a benchmark evaluation task (Cer et al., downstream applications. For all downstream ap- 2017), where the objective is to predict the sim- plications, training and test data are preprocessed ilarity score between a pair of sentences. Per- using MADAMIRA (Pasha et al., 2014) with the formance is typically evaluated using the Pearson FULL-CM diacritization scheme where we only correlation coefficient against human judgments. keep lexical diacritics.9 Then the data is filtered We used the William test (Graham and Baldwin, based on the AmbigDict of choice; namely, only 2014) for significance testing. We experiment with word tokens in the text deemed ambiguous accord- an unsupervised system based on matrix factoriza- ing to each AmbigDict maintain their full diacrit- tion developed by (Guo and Diab, 2012; Guo et al., ics (as generated by MADAMIRA) while the un- 2014), which generates sentence embeddings from ambiguous words are kept undiacritized. a word-sentence co-occurrence matrix, then com- pare them using cosine similarity.We use a dimen- 9Full diacritics are included except inflectional diacritics sion size of 700. To train the model, we use the that reflect the syntactic positions of words within sentences but do not alter meaning. 10Parts 1, 2, 3, 5, 6, 7, 10, 11, and 12 Arabic dataset released for SemEval-2017 task 1 pensive to obtain enormous human-annotated dia- (Cer et al., 2017). Since the training dataset is critized datasets, we use the morphological analy- small, we augment it by randomly selecting sen- sis and disambiguation tool, MADAMIRA version tences from the dataset (∼1,655,922) described in 2016 2.1 (Pasha et al., 2014) Section 4.1 where the chosen sentences have to satisfy the following conditions: the number of 4.4 AmbigDict words lie between 5 and 150; and, the minimum Table1 shows the number of identified ambigu- frequency for each word is 2. We apply these con- ous words using each approach. Note that the to- ditions in the diacritized data since it suffers more tal vocabulary sizes vary due to either different from sparseness, and then use their undiacritized datasets (e.g. for TR) or different preprocessing correspondents in the undiacritized setting. (e.g. MULTI is based on undiacritized text). For a given corpus, the number of ambiguous words 4.2.2 Neural Machine Translation (NMT) identified by MULTI can be viewed as an estimate We build a BiLSTM-LSTM encoder-decoder ma- of the upper bound on ambiguous words due to di- chine translation system as described in (Bah- acritics. In MULTI, words that have no valid anal- danau et al., 2014) using OpenNMT (Klein et al., ysis generated by MADAMIRA are filtered; this 2014). We use 300 as input for both resulted in significant drop of the number of types source and target vectors, 500 as hidden units, since the dataset includes noisy and infrequent in- and 0.3 for dropout. We initialize words with stances. embeddings trained using FastText (Bojanowski et al., 2017) on the selectively-diacritized dataset Dictionary Types % Ambig Words described in Section 4.1. We train the model us- AmbigDict-UNDIAC ing SGD with max gradient norm of 1 and learn- MULTI 168,384 33.82 SENSE 467,953 8.50 ing rate decay of 0.5. We use the Web Inven- AmbigDict-DIAC tory of Transcribed and Translated Talks (WIT), CL 497,222 8.70 - 8.98 which is made available for IWSLT 2016 (Mauro TR 36,533 27.58 et al., 2012). We use BLEU (Papineni et al., 2002) for evaluation, and bootstrap re-sampling and ap- Table 1: Vocabulary size and percentage of ambiguous proximate randomization for significance testing entries in AmbigDict-DIAC and AmbigDict-UNDIAC. (Clark et al., 2011).

4.2.3 POS tagging 4.5 Results and Analysis POS tagging is the task of determining the syntac- tic role of a word (i.e. part of speech) within a Dictionary STS NMT POS sentence. We use a BiLSTM-CRF architecture to NONE 0.608 27.1 97.99% train a POS tagger using the implementation pro- FULL-CM 0.593 26.8 98.06% vided by (Reimers and Gurevych, 2017), with 300 AmbigDict-UNDIAC as dimension size, initialized using the same em- MULTI 0.591 27.0 98.11%* SENSE 0.598 27.1 97.97% beddings we use in NMT. We used ATB datasets AmbigDict-DIAC parts 1,2, and 3 to train the models with Universal CL-BR 0.601 27.1 98.09% Dependencies POS tags, version 2 (Nivre et al., CL-KM 0.608 27.2 98.05% 2016). We use word-level accuracy for evaluation, CL-EM 0.617* 27.1 98.05% and t-test (Fisher, 1935; Dror et al., 2018) for sig- TR 0.616* 27.3* 97.94% nificance testing. Table 2: Performance with selectively-diacritized 4.3 Automatic Diacritization datasets in downstream applications. Bold numbers indicate approaches with higher performance than the For generating the various AmbigDict approaches, best performing baseline. * refers to approaches with we used either fully diacritized versions, with- statistically-significant performance gains against the out case and mood related diacritics,11 or undia- best performing baseline. critized versions of the datasets. Since it is ex- 11FULL-CM diacritization scheme, where we only keep Table2 shows the performance of all strategies lexical diacritics. in downstream tasks. Comparing baselines NONE and FULL-CM, we observe that applications that Dictionary STS NMT POS require semantic understanding (STS and NMT) NONE 0.590 27.4 98.26% show better performance when the dataset is undi- FULL-CM 0.575 27 98.70% AmbigDict-UNDIAC acritized, whereas POS tagging yields better per- MULTI 0.574 27.2 98.65% formance with the fully diacritized dataset. SENSE 0.581 27.3 98.37% The differences in performance between the AmbigDict-DIAC baselines are significant across all tasks. In all CL-BR 0.584 27.4 98.59% tasks, at least one of the selective diacritization CL-KM 0.591 27.5 98.52% schemes leads to performance gains compared CL-EM 0.60* 27.4 98.47% to both baselines. However, the choice of best TR 0.597* 27.6* 98.22% performing selective diacritization scheme varies Table 3: Performance of selectively-diacritized across tasks. In general, AmbigDict-DIAC ap- datasets on homographs. Bold numbers indicate proaches provide more promising results on se- approaches with higher performance than the best mantic related applications. performing baseline. * refers to approaches with TR and CL-EM approaches yield the highest statistically-significant performance gains against the performance in two of the applications (STS and best performing baseline. NMT), while MULTI and CL-BR achieved the highest performance in POS tagging. Incidentally, MULTI has the highest rate of ambiguous words, NMT and STS which are evaluated at the sen- which leads to more disambiguation through di- tence level. Thus, we compared the best perform- acritization. This is consistent with the observa- ing scheme (MULTI) and the baselines in terms tion that diacritization is useful for syntactic tasks of their per tag performance on the four most like POS tagging, as observed through the - frequent tags: verbs, nouns, adjectives, and ad- lines. In all other tasks, all selective diacritization verbs. Table4 shows the results of the baselines schemes performed significantly higher than full and MULTI. For verbs and nouns, MULTI has diacritization. better performance than both baselines followed by FULL-CM. For adjectives and adverbs, NONE Homograph Evaluation: We compared the per- followed by MULTI have better performance than formance of the various schemes on subsets of the FULL-CM. While FULL-CM has overall higher test sets that include homographs, which are iden- accuracy, these results indicate that selective dia- tified from the FULL-CM version of the training critization is a better approach for the most fre- datasets. For STS and NMT evaluation, we kept quent tags, possibly due to reduced sparsity com- only the test sentences that contain at least one ho- pared with FULL-CM. mograph. For POS word-level evaluation, we only considered the homographs. Table3 shows homo- graph performance across applications. The per- OOV Performance: We measured the POS tag- formance on these subsets follow the same trend ging performance on Out-of-Vocabulary (OOV) as the overall results illustrated in Table2 except words to measure the effect of sparsity on perfor- for POS tagging, where FULL-CM achieved the mance. We consider a word OOV if it does not comparable performance to the selective schemes. occur in the fully-diacritized training set. FULL- Note, however, that almost all schemes achieved CM achieved 87.43% tag accuracy, while NONE higher POS tagging accuracy than NONE in these achieved 87.56%. Using the MULTI scheme, subsets, and almost all schemes achieved compa- the POS tagging accuracy on OOV words was rable or higher performance than FULL-CM in 87.51%, which falls between the two baselines, as STS and NMT, with TR significantly outperform- expected. ing the rest of the schemes as well as the baselines. This illustrates the usefulness of selective diacriti- The results above indicate that using a selec- zation for balancing homograph disambiguation tive diacritization scheme like MULTI can achieve and sparsity compared to full or no diacritization. a desirable balance between disambiguation and sparsity, such that better performance can be Frequent POS Tag Performance: POS tagging achieved in the frequent cases without increasing labels each word in the sentence as opposed to sparsity and OOV rates. Scheme Verb Noun Adj Adv Pattern Pair Example MULTI 95.98% 97.63% 94.43% 97.05% CaC∼aC Ear∼aD  Q « “make wider” NONE 95.08% 97.45% 94.71% 98.08% CuCiC EuriD “has been shown” Q « FULL-CM 95.87% 97.56% 94.40% 96.79% CaCiCaCoC ba$iEayon  “ugly” (dual) á ª‚ . CaCiCiCC ba$iEiyn “ugly” (plural) Table 4: POS Tagging performance per most frequent á ª ‚ . tag. Bold scores indicate the highest score in a column. Table 6: Examples of consistent diacritic patterns of Type Example ambiguous words between CL and TR approaches. $ak “doubt” (noun) part-of- ½ƒ  speech $ak∼ ½ƒ “doubted” (verb) ous word pairs (i.e. falling in the same cluster)

>a*okur  “remember” were between 197-219 patterns, whereas patterns action Q»X@  of ambiguous pairs were between 813-872. The direction >u*ak∼ir “remind” Q»X@ majority of patterns between unambiguous words $uyuwEiy∼ayon   “communists” number á  J« ñ Jƒ also occurred between ambiguous words. For TR, $uyuwEiy∼iyn  “communists” á  J« ñ Jƒ while most patterns were labeled unambiguous, around 300 patterns were always labeled ambigu- Table 5: Examples of ambiguous word pairs detected ous. We did not find overarching semantic or by the clustering approaches. syntactic rules that consistently explain ambigu- ity tags. However, a number of patterns (∼ 20) 4.6 Properties of Ambiguity Dictionaries were always tagged as ambiguous by both TR and CL approaches. Table6 shows a sample of these Clustering-Based Ambiguity: While MULTI, patterns with examples. TR, and SENSE approaches have intuitive jus- tifications, the clustering approaches are based 5 Discussion & Conclusion entirely on distributional features. We analyzed some of the results qualitatively to shed We investigated selective diacritization as a vi- on types of words that are deemed ambiguous able technique for reducing lexical ambiguity us- through clustering. While the various cluster- ing Arabic as a case study. To our knowledge, ing approaches resulted in different labeling, their this is the first work that shows encouraging results overall statistics and patterns were highly similar. with automatic selective diacritization schemes in Using a random subset of words from these CL which the devised approaches evaluated on several dictionaries, we extracted the examples shown in downstream applications. Our findings demon- Table5, which shows some of the most common strate that partial diacritization achieves a balance types of ambiguity. Note that the detected words between homograph disambiguation and sparsity are either semantically ambiguous (e.g. deriva- effects; the performance using selective diacritiza- tions or distinct lemmas) or syntactically ambigu- tion always approached the best of both extremes ous (e.g. part-of-speech). in each application, and sometimes surpassed the performance of both baselines, which is consistent Diacritic Pattern Complexity: We investigated with our intuition of balancing sparsity and disam- whether there are regular diacritic patterns among biguation for improving overall performance. words that were considered ambiguous by CL and While the increase in performance was not con- TR. Both approaches are data-driven, and we ap- sistent across all tasks, the results provide an em- plied them on different corpora, so we investigated pirical evidence of the viability of automatic par- their degree of agreement. To do so, we abstracted tial diacritization, especially since manual efforts the diacritic patterns for words in the vocabulary in this vein had been rather challenging. We be- by converting all characters other than diacritics lieve that the approaches described in this paper to a unified token “C”, then we collected statis- could help advance the efforts towards optimal di- tics of patterns of word pairs that are deemed am- acritization schemes, which are currently mostly biguous vs. unambiguous. For example, the am- based on linguistic features. We analyzed some biguous pair “katab”  and “kutib”  have I. J» I. J » patterns that were recognized as ambiguous using the pattern CaCaC-CuCiC. For CL methods, the our best-performing schemes, and showed some number of unique diacritic patterns of unambigu- consistencies in the diacritic patterns, although the results were not conclusive. We believe that a and Abdelati Hawwari. 2015. A pilot study on Ara- deeper analysis of these patterns may help shed bic multi-genre corpus diacritization. In Proceed- light on the lexical ambiguity phenomenon in ad- ings of the Second Workshop on Arabic Natural Lan- guage Processing. dition to allowing further improvements in selec- tive diacritization. Peter F Brown, Peter V Desouza, Robert L Mercer, Vincent J Della Pietra, and Jenifer C Lai. 1992. Class-based n-gram models of natural language. Computational linguistics References , 18(4):467–479. Peter F Brown, Vincent J Della Pietra, Stephen A Della Mohamed Al-Badrashiny, Arfath Pasha, Mona Diab, Pietra, and Robert L Mercer. 1993. The mathemat- Nizar Habash, Owen Rambow, Wael Salloum, and ics of statistical machine translation: parameter esti- Ramy Eskander. 2016. SPLIT: Smart preprocessing mation. Computational linguistics, 19(2):263–311. (Quasi) language independent tool. In International Conference on Language Resources and Evaluation Daniel Cer, Mona Diab, Eneko Agirre, Inigo Lopez- (LREC). Gazpio, and Lucia Specia. 2017. Semeval-2017 task 1: Semantic textual similarity-multilingual and Latifa Al-Sulaiti and Eric Atwell. 2006. The design cross-lingual focused evaluation. In SemEval work- of a corpus of contemporary Arabic. International shop at ACL. Journal of Corpus Linguistics, 11(2):135–171. Jonathan H Clark, Chris Dyer, Alon Lavie, and Noah A Rehab Alnefaie and Aqil M. Azmi. 2017. Automatic Smith. 2011. Better hypothesis testing for statistical minimal diacritization of Arabic texts. In 3rd Inter- machine translation: Controlling for optimizer insta- national Conference on Arabic Computational Lin- bility. In Proceedings of the 49th Annual Meeting of guistics (ACLing). the Association for Computational Linguistics: Hu- man Language Technologies, pages 176–181. Sawsan Alqahtani, Mona Diab, and Wajdi Zaghouani. 2018. ARLEX: A large scale comprehensive lexical Arthur P Dempster, Nan M Laird, and Donald B Ru- inventory for Modern Standard Arabic. In OSACT bin. 1977. Maximum likelihood from incomplete 3: The 3rd Workshop on Open-Source Arabic Cor- data via the EM algorithm. Journal of the royal sta- pora and Processing Tools. tistical society., 39:1–38.

Sawsan Alqahtani, Mahmoud Ghoneim, and Mona Mona Diab, Mahmoud Ghoneim, and Nizar Habash. Diab. 2016. Investigating the impact of various par- 2007. Arabic diacritization in the context of sta- tial diacritization schemes on Arabic-English sta- tistical machine translation. In Proceedings of MT- tistical machine translation. In International As- Summit. sociation for Machine Translation in the Americas (AMTA). Mona Diab and Philip Resnik. 2002. An unsupervised method for word sense tagging using parallel cor- Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben- pora. In Proceedings of the 40th Annual Meeting gio. 2014. Neural machine translation by jointly on Association for Computational Linguistics, pages learning to align and translate. In International Con- 255–262. ference on Learning Representations. Rotem Dror, Gili Baumer, Segev Shlomov, and Roi Re- ichart. 2018. The hitchhikers guide to testing statis- Chris Biemann. 2006. Chinese whispers: an efficient tical significance in natural language processing. In graph clustering algorithm and its application to nat- Proceedings of the 56th Annual Meeting of the As- ural language processing problems. In Proceedings sociation for Computational Linguistics, volume 1, of the first workshop on graph based methods for pages 1383–1392. natural language processing. Association for Com- putational Linguistics. Chris Dyer, Victor Chahuneau, and Noah A Smith. 2013. A simple, fast, and effective reparameteriza- William Black, Sabri Elkateb, Horacio Rodriguez, tion of ibm model 2. In Proceedings of the 2013 Musa Alkhalifa, Piek Vossen, Adam Pease, and Conference of the North American Chapter of the Christiane Fellbaum. 2006. Introducing the Arabic Association for Computational Linguistics: Human WordNet project. In Proceedings of the third inter- Language Technologies, pages 644–648. national WordNet conference. Ignatius Ezeani, Mark Hepple, and Ikechukwu Piotr Bojanowski, Edouard Grave, Armand Joulin, and Onyenwe. 2017. Lexical disambiguation of Igbo us- Tomas Mikolov. 2017. Enriching word vectors with ing diacritic restoration. In Proceedings of the 1st subword information. Transactions of the Associa- Workshop on Sense, Concept and Entity Represen- tion for Computational Linguistics. tations and their Applications, pages 53–60. Houda Bouamor, Wajdi Zaghouani, Mona Diab, Os- Ronald Aylmer Fisher. 1935. The design of experi- sama Obeid, Kemal Oflazer, Mahmoud Ghoneim, ments. Oliver And Boyd. Yvette Graham and Timothy Baldwin. 2014. Testing Joakim Nivre, Marie-Catherine De Marneffe, Filip for significance of increased correlation with human Ginter, Yoav Goldberg, Jan Hajic, Christopher D judgment. In Proceedings of the 2014 Conference Manning, Ryan McDonald, Slav Petrov, Sampo on Empirical Methods in Natural Language Pro- Pyysalo, Natalia Silveira, Reut Tsarfaty, and Daniel cessing (EMNLP), pages 172–176. Zeman. 2017. A Bambara tonalization system for word sense disambiguation using differential cod- Weiwei Guo and Mona Diab. 2012. Modeling sen- ing, segmentation and edit operation filtering. In tences in the latent space. In Proceedings of the Proceedings of the Eighth International Joint Con- 50th Annual Meeting of the Association for Compu- ference on Natural Language Processing, volume 1, tational Linguistics, pages 864–872. pages 694–703.

Weiwei Guo, Wei Liu, and Mona Diab. 2014. Fast Joakim Nivre, Marie-Catherine De Marneffe, Filip tweet retrieval with compact binary codes. In Pro- Ginter, Yoav Goldberg, Jan Hajic, Christopher D ceedings of COLING 2014, the 25th International Manning, Ryan T McDonald, Slav Petrov, Sampo Conference on Computational Linguistics: Techni- Pyysalo, Natalia Silveira, et al. 2016. Universal cal Papers, pages 486–496. dependencies v1: A multilingual treebank collec- tion. In Proceedings of the Tenth International Tuka Al Hanai and James Glass. 2014. Lexical Conference on Language Resources and Evaluation modeling for Arabic ASR: A systematic approach. (LREC). In Fifteenth Annual Conference of the International Speech Communication Association. Kishore Papineni, Salim Roukos, Todd Ward, and Wei- Jing Zhu. 2002. BLEU: a method for automatic Ehab W Hermena, Denis Drieghe, Sam Hellmuth, and evaluation of machine translation. In Proceedings Simon P Liversedge. 2015. Processing of Arabic of the 40th annual meeting on association for com- diacritical marks: Phonological–syntactic disam- putational linguistics, pages 311–318. biguation of homographic verbs and visual crowding effects. Journal of Experimental Psychology: Hu- Arfath Pasha, Mohamed Al-Badrashiny, Mona T Diab, man Perception and Performance, 41(2). Ahmed El Kholy, Ramy Eskander, Nizar Habash, Manoj Pooleery, Owen Rambow, and Ryan Roth. Tapas Kanungo, David M Mount, Nathan S Netanyahu, 2014. MADAMIRA: A fast, comprehensive tool for Christine D Piatko, Ruth Silverman, and Angela Y morphological analysis and disambiguation of Ara- Wu. 2002. An efficient k-means clustering algo- bic. In LREC, volume 14, pages 1094–1101. rithm: Analysis and implementation. IEEE Trans- actions on Pattern Analysis & Machine , Maria Pelevina, Nikolay Arefyev, Chris Biemann, and pages 881–892. Alexander Panchenko. 2017. Making sense of word Proceedings of the 1st Workshop on G. Klein, Y. Kim, Y. Deng, J. Senellart, and A. M. embeddings. In Representation Learning for NLP on Association for Rush. 2014. OpenNMT: Open-Source Toolkit for Computational Linguistics Neural Machine Translation. In Proceedings of ACL . 2017, System Demonstrations. Nils Reimers and Iryna Gurevych. 2017. Report- Sebastian Lobner.¨ 2013. Understanding semantics. ing score distributions makes a difference: perfor- Routledge. mance study of LSTM-networks for sequence tag- ging. In Proceedings of the 2017 Conference on Cettolo Mauro, Girardi Christian, and Federico Mar- Empirical Methods in Natural Language Processing cello. 2012. Wit3: Web inventory of transcribed and (EMNLP), pages 338–348. translated talks. In Conference of European Associ- ation for Machine Translation, pages 261–268. Ahmed Said, Mohamed El-Sharqwi, Achraf Chalabi, and Eslam Kamal. 2013. A hybrid approach for Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Arabic diacritization. In International Conference Dean. 2013. Efficient estimation of word represen- on Application of Natural Language to Information tations in vector space. Systems, pages 53–64.

Hwee Tou Ng, Bin Wang, and Yee Seng Chan. 2003. Helmut Schmid. 1999. Improvements in part-of- Exploiting parallel texts for word sense disambigua- speech tagging with an application to German. In tion: an empirical study. In Proceedings of the 41st Natural Language Processing Using Very Large Annual Meeting on Association for Computational Corpora, pages 13–25. Linguistics, pages 455–462. Dan Tufis¸and Adrian Chit¸u. 1999. Automatic diacrit- Minh Trung Nguyen, Quoc Nhan Nguyen, and ics insertion in Romanian texts. In Proceedings of Hong Phuong Nguyen. 2012. Vietnamese diacritics the International Conference on Computational Lex- restoration as sequential tagging. In IEEE RIVF In- icography COMPLEX, volume 99, pages 185–194. ternational Conference on Communica- tion Technologies, , Innovation, and Vision Cat˘ alin˘ Ungurean, Dragos¸ Burileanu, Vladimir for the Future, pages 1–6. Popescu, Cristian Negrescu, and Aurelian Dervis. 2008. Automatic diacritic restoration for a TTS- based e-mail reader application. UPB Scientific Bulletin, Series C, 70(4):3–12. David Yarowsky. 1994. Decision lists for lexical ambi- guity resolution: application to accent restoration in Spanish and French. In Proceedings of the 32nd an- nual meeting on Association for Computational Lin- guistics, pages 88–95. Wajdi Zaghouani, Houda Bouamor, Abdelati Hawwari, Mona T Diab, Ossama Obeid, Mahmoud Ghoneim, Sawsan Alqahtani, and Kemal Oflazer. 2016a. Guidelines and framework for a large scale Ara- bic diacritized corpus. In The Tenth International Conference on Language Resources and Evaluation (LREC), page 36373643. Wajdi Zaghouani, Abdelati Hawwari, Sawsan Alqah- tani, Houda Bouamor, Mahmoud Ghoneim, Mona Diab, and Kemal Oflazer. 2016b. Using ambigu- ity detection to streamline linguistic annotation. In Proceedings of the Workshop on Computational Lin- guistics for Linguistic Complexity (CL4LC), pages 127–136.