Arabic NLG Language Functions
Total Page:16
File Type:pdf, Size:1020Kb
Arabic NLG Language Functions Wael Abed Ehud Reiter Dept of Computing Science Dept of Computing Science University of Aberdeen University of Aberdeen Aberdeen AB24 3UE, UK Aberdeen AB24 3UE, UK [email protected] [email protected] Abstract a web interface are provided and available online on GitHub. 1. The Arabic language has very limited sup- ports from NLG researchers. In this paper, 2 Related Works we explain the challenges of the core gram- mar, provide a lexical resource, and implement As there is no NLG system done for Arabic, we the first language functions for the Arabic lan- investigated similar approaches that targets the guage. We did a human evaluation to evaluate language challenges. Green and DeNero(2012) our functions in generating sentences from the presented a method to cover words agreements to NADA Corpus. generate correct words inflections in a translating 1 Introduction task using a class-based agreement model. While, Khoufi et al.(2016) dealt with un-vocalisation and The Arabic Language is a member of the Semitic agglutination in Arabic. They built a statistical language family, the fifth most spoken language model to extract grammatical rules from a treebank in the world. Around 330 million people speak corpus to use these rules for parsing. this language as their mother tongue language distributed in 25 countries. It is mainly classified Furthermore, another member of the Semitic into two types; Classical and Modern Standard group, which is the Hebrew language, also lacks Arabic (MSA) (Al-Radaideh, 2020), in addition to NLG resources. One attempt to Generate a regional variances (dialects). As far as we know, noun compounds in Hebrew goes back to 1998 there are not any pipeline NLG systems or libraries (Netzer and Elhadad, 1998), they used a rule-based that support an Arabic surface realiser at the time approach and listed the difficulties between of writing this paper. Two factors might be the English and Hebrew in terms of the syntactic, reasons: Arabic Language complexity and limited semantic and lexical constraints for generating a resources. An example that reflects the second Noun Compounds, including a special feature of cause is that even the Google n-gram viewer does Hebrew called "Smixut". not support the Arabic language (Alsmadi and Zarour, 2018). For other languages, researchers used different realisers’ technologies such as handcrafted gram- Therefore, in this paper, we are presenting six mars rule-based (Gatt and Reiter, 2009), statis- simple morphological MSA natural language func- tical approaches (Gardent and Perez-Beltrachini, tions (inflectNoun, buildNP, inflectVerb, inflectAd- 2017) and human-generated template-based (Gatt jective, pronouns and numToWords). These func- and Krahmer, 2018). However, SimpleNLG is still tions can be considered as a good start to build considered the most popular method since it was an Arabic surface realiser, which can be used in a presented by Gatt and Reiter(2009), It is a rule- pipeline NLG system as they can be used to cor- based realiser that uses three main steps syntac- rectly generate and inflect different parts of speech tic, morphological and orthographic realisations including the generation of a noun phrase. to produce a correctly inflected and high-quality We implemented these functions from scratch us- output. It was adapted to many languages includ- ing a rule-based approach. Furthermore, a lexicon 1https://github:com/WaelMohammedAbed/ containing around 9000 lemmata were collected. Natural-Language-Generation-for-the- An open-source versions of lexicon and code with Arabic-Language 7 Proceedings of The 13th International Conference on Natural Language Generation, pages 7–14, Dublin, Ireland, 15-18 December, 2020. c 2020 Association for Computational Linguistics ing (in chronological order) French (Vaudry and Singular femi- CAyF nine noun (car) Lapalme, 2013), Italian (Mazzei et al., 2016), Span- Dual and nomi- A CAyF change the last letter () ish (Ramos-Soto et al., 2017), Galician (Cascallar- native to () then add the suf- Fuentes et al., 2018), Dutch (de Jong and Theune, fix ( ) - aan Plural and geni- CAyF remove the last letter () 2018), Mandarin (Chen et al., 2018) and German tive then add the suffix () (Braun et al., 2019). - aat Another popular approach is to provide gram- Singular mascu- d line noun (de- mars’ functions to be used in a template-based fender) NLG system. Such as RosaeNLG 2,which is an Dual and geni- y`d only add the suffix (§) open-source, template-based, NLG library that tive - ayn Plural and nomi- w`d only add the suffix ( ¤) gives simpler functions than SimpleNLG and had native - uun been used for English, French, German, Italian and Singular Broken Changes Spanish languages. Similar approaches with fewer plural options can be found in Core-NLG 3 and xSpin 4. A - door w Add prefix "ah" () and -doors infix (w) (¤) Our approach is more similar to RosaeNLG than Note: in the examples above there are other changes in the to SimpleNLG. As we will provide language func- small vowels marks that did not mention because it will not tions and a noun phrase function to give correct be covered here grammatical inflections and agreements. Due to Table 1: examples show the Arabic noun’s forms the huge morphological difference between the with different case and gender Semitic and non-Semitic languages, we could not adopt SimpleNLG nor RosaeNLG. Therefore, we found it convenient as a start to building our func- on the case and the gender of the noun. tions from scratch following the same approach of RosaeNLG. Regarding the plural form, there are two different types: sound plural and broken plural. The first one 3 Arabic language challenges can be inflected using similar steps to the steps used in dual form, with a different set of suffixes (" ¤"- The Arabic language consists of 28 alphabet letters, uun, "§"-iin, ""- aat). Some cases can only be written from right to left and there is no letters cap- identified by the "harakãt" (Ar) short vowel italisation. It has a very rich morphology measured . marks of the last consonant letter. The short vowel in the number of different forms for its nouns, verbs marks will be neglected here, as for MSA according and particles (Al-Radaideh, 2020). to many studies, adding these "Short Vowels Do 3.1 Nouns Not Contribute to the Quality of Reading" (Abu- Rabia, 2019). Arabic nouns can be inflected depending on the While the Broken plural has different inflections. sentence gender, number, and grammatical case This type has more than 30 patterns to use with (nominative, genitive, and accusative). internal and external manipulation of the sounds and vowels letters in the word and there are no Arabic assigns gender for living and non-living firm rules (Dawdy-Hesterberg and Pierrehumbert, things (ex: "car" (CAyF) treated as feminine, while 2014). Table1 illustrates all these states for dual, "Goal" (d¡) is a masculine noun) (Shamsan and sound plural and broken plural. Attayib, 2015). Moreover, unlike many languages, Arabic has a dual form in addition to singular and 3.2 Numerals plural form. This form is used to express a noun The challenges of the Arabic numerals are their that refers to two entities (things or people), and relation with the noun’s gender, case, definiteness, can be generated by adding one of these suffixes word order and counts. In addition, they do not fol- (" "- aan, "§"-ayn, " A "-taan, "y "-tayn) to low the rules in some cases (Al-Bataineh and Brani- the singular form). The type of suffix used depends gan, 2020). These facts classify Arabic numerals 2https://rosaenlg:org/rosaenlg/1:16:4/ as a complex problem compared to its relevance about/compare:html in the English language. Arabic numerals (ordinal 3https://github:com/societe-generale/ core-nlg and cardinal) can be classified mainly into three 4https://xspin:it/plus groups (Al-Bataineh and Branigan, 2020)(Al Easa, 8 Arabic example Word by word to English transla- 3.3 Adjectives English tion d¤ d¡ Goal one(M) one goal In MSA, Adjectives commonly used after the d¤ CAyF Car one(F) one car ytn y CAyF cars(dual) two cars nouns, although they can be used before as well. two(F) They agree with the noun in gender, case, defi- ¤rK¤ d¤ One(M) and Twenty one niteness and number (Shamsan and Attayib, 2015) d¡ twenty(m- goals plural) goal(m- (Idrissi et al., 2019). If the adjective is used to de- singular) scribe an un-human plural noun, then the adjective will take the singular feminine form (Shamsan and Table 2: examples show the numerals inflection Attayib, 2015). However, with time this exception started to disappear and full agreement in the noun- adjective sentence is used (Idrissi et al., 2019). 1997). Ex: ºAmk Ar- the-men (M-plural- definite-nominative) the-wise (M-plural-definite- Simplex numerals (1-10): according to nominative)- The wise men (Al Easa, 1997) this group has two subgroups: singular numerals (1 and 2) and added numerals 3.4 Verbs (3-10). Singular numerals agree with the counted Arabic verbs participate in a large percentage of noun in all aspects. However, this is the only group the sentence due to their morphological richness. in the Arabic numerals that comes after the noun There are 19 forms for verb inflections, and for each in the word order. For the second subgroup, the form, the verb agrees with the subject in terms of numeral gender will take the opposite form of the number, person and gender. Moreover, it can be noun gender (see table2). inflected using four moods, two voices and three tenses (Shaalan et al., 2015). Compound numerals (11-19): It has two major differences. The first one is that each numeral con- 3.5 Pronouns sists of two words, one for each digit (units and Arabic pronouns can be categorised to Explicit and ten).