Morphological Analysis and Disambiguation for Gulf Arabic: the Interplay Between Resources and Methods
Total Page:16
File Type:pdf, Size:1020Kb
Proceedings of the 12th Conference on Language Resources and Evaluation (LREC 2020), pages 3895–3904 Marseille, 11–16 May 2020 c European Language Resources Association (ELRA), licensed under CC-BY-NC Morphological Analysis and Disambiguation for Gulf Arabic: The Interplay between Resources and Methods Salam Khalifa, Nasser Zalmout, Nizar Habash Computational Approaches to Modeling Language (CAMeL) Lab New York University Abu Dhabi {salamkhalifa, nasser.zalmout, nizar.habash}@nyu.edu Abstract In this paper we present the first full morphological analysis and disambiguation system for Gulf Arabic. We use an existing state-of-the- art morphological disambiguation system to investigate the effects of different data sizes and different combinations of morphological analyzers for Modern Standard Arabic, Egyptian Arabic, and Gulf Arabic. We find that in very low settings, morphological analyzers help boost the performance of the full morphological disambiguation task. However, as the size of resources increase, the value of the morphological analyzers decreases. Keywords: Gulf Arabic, Morphology, Disambiguation 1. Introduction ing the most probable morphological analysis in context for a given word. This task is achieved by either ranking Despite the many advances in the field in Natural Language the output of a morphological analyzer or through an end- Processing (NLP) for Arabic, many dialectal Arabic vari- to-end system that generates a single answer. eties are lagging behind. Modern Standard Arabic (MSA), In this paper, we focus on Gulf Arabic (GLF), a morpho- the official language of the Arab world, is well studied in logically rich and resource poor Arabic dialect. We aim to NLP and has an abundance of resources including corpora benchmark full morphological analysis and disambiguation and tools. On the other hand, most Arabic dialects are for GLF using state-of-the-art approaches for the first time. considered under-resourced, with the exception of Egyp- As part of this work, we investigate the relationship tian Arabic (EGY). between the size of the training data and the different One of the main challenges for Arabic NLP is data spar- available analyzers with respect to the disambiguation sity due to morphological richness and the lack of orthog- model. raphy standards. To address such challenges, recent NLP models use deep learning architectures that have access to The rest of the paper is structured as follows. We present character and subword-level information. Such models are related work in Section 2, and briefly discuss the linguis- well equipped to model some aspects of morphology im- tic background of GLF and its challenges in Section 3. We plicitly as part of an end-to-end system without requiring present the morphological analyzer creation process in Sec- explicit feature engineering. However, these models are tion 4. In Sections 5 and 6, we present our experimental very data-intensive, and do not scale down well in the case setup and evaluation, respectively. We conclude and out- of low-resource languages. Moreover, some morphological line future work in Section 7. behaviors can be complicated and irregular at times. This makes morphology more challenging to capture implicitly 2. Related Work while modeling other tasks. This further highlights the im- portance of explicit morphological modeling for languages In the past two decades, there have been a lot of efforts on that are morphologically-rich and low-resource in particu- morphological modeling for Arabic, as it proved to be use- lar. ful in a number downstream NLP tasks (Sadat and Habash, Morphological analyzers are dictionary-like resources 2006; El Kholy and Habash, 2012; Halabi, 2016; Baly et that provide all the potential analyses for a given word al., 2017). In this section we review early efforts on mor- out-of-context. Ideal morphological analyzers are expected phological modeling of MSA and dialectal Arabic. We then to return all possible analyses of a given word (modeling present the latest neural-based contributions with special ambiguity), along with covering all the different inflected interest in Arabic. forms of a word lemma (modeling richness). The quality of Modern Standard Arabic Morphological Modeling the morphological analyzers varies drastically based on the One of the earliest morphological tagging systems for Ara- method used to create them. Higher quality analyzers are bic was presented by Khoja (2001); it was based on a cor- carefully built using existing linguistic dictionaries, and are pus of 50,000 words. Later, the LDC released the Penn manually checked. On the other end of the spectrum, mor- Arabic Treebank (PATB) (Maamouri et al., 2004), which phological analyzers created automatically (e.g., extracted was substantially larger, and supported many further efforts from annotated text or seed dictionaries) can have a lower on Arabic morphological modeling. The PATB relied on quality depending on the quantity and quality of data used the existence of the Buckwalter Morphological Analyzer or the methods employed in creating them. (BAMA) (Buckwalter, 2004) and its later version the Stan- Morphological disambiguation is the process of provid- dard Arabic Morphological Analyzer (SAMA) (Maamouri 3895 et al., 2010). Among such efforts, are MADAMIRA (Pasha So far, the mentioned efforts regarding disambiguation suf- et al., 2014) and it predecessors MADA (Habash and Ram- fer from two main issues. First, they require explicit fea- bow, 2005; Roth et al., 2008; Habash et al., 2009; Habash et ture engineering which can lead to over fitting and not be- al., 2013) and AMIRA (Diab et al., 2004; Diab et al., 2007). ing able to generalize to new dialects. Second, those sys- MADAMIRA uses a morphological analyzer and SVMs- tems rely heavily on pre-existing morphological analyzers based taggers for different features, along with n-gram lan- to generate the final answers. This reliance limits the sys- guage models for lemmatization and diacritization. More tem’s performance to the quality of the analyzer, especially recent efforts include Farasa (Abdelali et al., 2016; Dar- when generating open class features such as lemmas. In this wish and Mubarak, 2016) and YAMAMA (Khalifa et al., work we use state-of-the-art neural architectures that have 2016b), which are out-of-context taggers/segmenters that the ability to model morphologically rich and complex lan- use different techniques to achieve reasonable performance guages such as Arabic and its different verities. and fast running time. Neural-based contributions for Arabic Morphology Adaptation of MSA Tools and Resources for Dialectal Neural-based contributions for Arabic NLP are relatively Arabic A number of approaches attempt to exploit lin- scarce and specific to individual tasks rather than full guistic similarity between MSA and Arabic dialects to build morphological disambiguation. Among the contributions dialect tools using existing MSA resources (Duh and Kirch- that utilize morphological structures to enhance the hoff, 2005; Habash and Rambow, 2006; Zribi et al., 2013; neural models in different NLP tasks, we note Guzmán Salloum and Habash, 2014; Hamdi et al., 2015; Albogamy et al. (2016) for machine translation, and Abandah et and Ramsay, 2015; Eskander et al., 2016). MAGEAD is al. (2015) for diacritization. Shen et al. (2016) applied a morphological analyzer that models Arabic dialects to- their Bi-LSTM morphological disambiguation model gether with MSA using a common multi-tape finite-state- on MSA, but did not present any improvements over machine framework (Habash and Rambow, 2006). Zribi the state-of-the-art. Heigold et al. (2016) developed et al. (2013) adapt an MSA analyzer, Al-Khalil (Boudlal character-based neural models for morphological tagging et al., 2010), to Tunisian Arabic, where they modify the for 14 different languages, including Arabic, using the derivation patterns and add Tunisian-specific roots and pat- UD treebank. Samih et al. (2017a) used a Bi-LSTM-CRF terns. Eskander et al. (2016), on the other hand, presented architecture and pre-trained character embeddings for the a paradigm completion approach to generate morphological segmentation of EGY tweets. They then build up on this analyzers for low-resource dialects using morphologically approach using a similar architecture for segmentation in annotated corpora. They then make use of available MSA multiple dialects, through combining the training datasets and EGY analyzers as backoff. for the different dialects, and train a unified segmentation model. They report the results using both an SVM-Rank Dialect-Specific Contributions Al-Sabbagh and Girju and Bi-LSTM-CRF models (Samih et al., 2017b). Darwish (2012) presented a POS annotated data set and tagger for et al. (2017) use Bi-LSTM models to train a POS tagger, EGY. Habash et al. (2012) presented CALIMA, a mor- and compared it against SVM-based model. The SVM phological analyzer for EGY, which was built by extend- model in their system outperformed the neural model, ing the Egyptian Colloquial Arabic Lexicon (ECAL) (Ki- even with incorporating pre-trained embeddings. Other lany et al., 2002). The LDC has also released the EGY notable contributions include the work of Inoue et al. treebank (ARZATB) (Maamouri et al., 2012). The tree- (2017), who used multi-task learning to model fine-grained bank is morphologically annotated in a similar style to the POS tags, using the individual morphosyntactic features. PATB. The aforementioned MADAMIRA and YAMAMA More recently, Zalmout and Habash (2017) presented systems were extended to EGY using CALIMA (Habash the first neural based full morphological disambigua- et al., 2012) and ARZATB (Maamouri et al., 2012). Jarrar tion system for Arabic. Alharbi et al. (2018) also use et al. (2014) released the Curras Corpus of Palestinian Ara- a Bi-LSTM model for GLF POS tagging, with good results. bic, also annotated in the PATB morphology style. Darwish et al. (2018) used a CRF model for a multi-dialect POS In this work, we introduce the first morphological analy- tagging, using a small annotated Twitter corpus for several sis and disambiguation system for GLF.