A Lexicon of Arabic Verbs Constructed on the Basis of Semitic Taxonomy and Using Finite-State Transducers Alexis Amid Neme
Total Page:16
File Type:pdf, Size:1020Kb
A lexicon of Arabic verbs constructed on the basis of Semitic taxonomy and using finite-state transducers Alexis Amid Neme To cite this version: Alexis Amid Neme. A lexicon of Arabic verbs constructed on the basis of Semitic taxonomy and using finite-state transducers. WoLeR 2011 at ESSLLI International Workshop on Lexical Resources, Aug 2011, Ljubliana, Slovenia. halshs-01186723 HAL Id: halshs-01186723 https://halshs.archives-ouvertes.fr/halshs-01186723 Submitted on 14 Jan 2016 HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. A lexicon of Arabic verbs constructed on the basis of Semitic taxonomy and using finite-state transducers Alexis Amid Neme Laboratoire d'informatique Gaspard-Monge – LIGM Université Paris-Est, 77454 Marne-la-Vallée Cedex 2, France. http://infolingu.univ-mlv.fr E-mail: [email protected] Abstract We describe a lexicon of Arabic verbs constructed on the basis of Semitic patterns and used in a resource-based method of morphological annotation of written Arabic text. The annotated output is a graph of morphemes with accurate linguistic information. An enhanced FST implementation for Semitic languages was created. This system is adapted also for generating inflected forms. The language resources can be easily updated. The lexicon is constituted of 15 400 verbal entries. We propose an inflectional taxonomy that increases the lexicon readability and maintainability for Arabic speakers and linguists. Traditional grammar defines inflectional verbal classes by using verbal pattern-classes and root-classes, related to the nature of each of the triliteral root-consonants. Verbal pattern-classes are clearly defined but root-classes are complex. In our taxonomy, traditional pattern-classes are reused and root-classes are simply redefined. Our taxonomy provides a straightforward encoding scheme for inflectional variations and orthographic adjustments due to assimilation and agglutination. We have tested and evaluated our resource against 10 000 diacriticized verb occurrences in the Nemlar corpus and compared it to Buckwalter resources. The lexical coverage is 99.9 % and a laptop needs two minutes in order to generate and compress the inflected lexicon of 2.5 million forms into 4 Megabytes. form larger morphological or syntactic units. For 1. Introduction root-and pattern morphology, a stem derives from a root Arabic morphology can be described by many formal and a particular pattern and subsequently undergoes representations. However, Semitic morphology or affixations. root-and-pattern morphology (Kiraz, 2004) is a natural At the operational level, the lexical representation of the representation for Arabic 1 . The root represents a concatenative model is entirely concatenative in order to morphemic abstraction, usually for a verb a sequence of compel with the [prefix][stem][suffix] representation. three consonants, like ktb. A pattern is a template of However, these representations imply a manual stem characters surrounding the root consonants, and in which precompilation based on a root-and-pattern representation. the slots for the root consonants are shown by indices. The The concatenative models are generally composed of combination of a root with a pattern produces a surface three components: lexicon, rewrite rules, and form. For example, kataba and yakotubu are represented morphotactics. The lexicon consists of multiple sublexica, by the root ktb and the patterns 1a2a3a or ya1o2u3u. generally prefix, stem, and suffix. The rewrite rules map Root-and-pattern morphology is standard in Arabic and is the multiple lexical representations to a surface learned in grammar text books. Arabic linguists use representation. The morphotactics component aims with root-and-pattern representation in order to list verbal a subjacent representation to generate or to parse the entries and related inflected forms. On the other hand, surface form [prefix][stem][suffix] and performs FSTs have shown their simplicity and efficiency in alternation rules at morpheme boundaries such as deletion, inflectional morphology for western languages. Computer epenthesis, and assimilation. scientists appoint FSTs as standard devices for inflection. Any formal representation that is not adapted to Semitic Various formal representations for Arabic morphology morphology will be rejected by the majority of have been created by computer scientists to avoid Arabic-speaking linguists. When linguists work in a root-and-pattern representation. The point that motivated newly created formalism, they continue to work with this trend is that FSTs formalism would not be fitted for root-and-pattern representation on paper and Semitic morphology since FSTs are concatenative subsequently, they unfold their descriptions for a specific whereas Semitic morphology is not. In concatenative formalism. Their contribution for updating and correcting representation, the root-and-pattern representation is lexical resources is complex and time-consuming, and replaced by a stem- or lexeme-based representation. For therefore error-prone. these formalisms, a stem is a basic morpheme that Our approach resorts to classical techniques of lexicon undergoes affixations with other morphemes in order to compression and lookup in an inflected full-form dictionary that includes orthographic variations related to 1 We would like to thank Eric Laporte and Sébastien Paumier for helpful discussions, morpheme agglutination. The formalization of all contributions and for the adaptation of Unitex to Arabic. Unitex is an open source multilingual possible verbal tokens requires complex and corpus processor. More than 12 European languages, Korean and Thai with their linguistic interdependent rules. For these issues, we define a resources are operational in Unitex. http://www-igm.univ-mlv.fr/~unitex taxonomy for Arabic verbs composed of 460 inflectional classes. We demonstrate that FSTs are compatible with qr> qara> PV-> qara>/VERB_PERFECT root-and-pattern representation. Our taxonomy encodes qr| qara| PV-| qara|/VERB_PERFECT simultaneously in the lexical representation three qr& qara& PV_w qara&/VERB_PERFECT variations at the surface level: qr> qora> IV qora>/VERB_IMPERFECT - inflectional classes of a lemma; qr> qora> IV_wn qora>/VERB_IMPERFECT - inflectional subclasses related to morphophonemic qr| qora| IV-| qora|/VERB_IMPERFECT assimilation; qr& qora& IV_wn qora&/VERB_IMPERFECT - orthographic adjustments related to the agglutination of qr} qora} IV_yn qora}/VERB_IMPERFECT a pronoun. qr> qora> IV_Pass yuqora>/VERB_IMPERFECT In our orthographic representation, we use a fully Table 1. BAMA stem lexicon using Buckwalter diacriticized lexicon and we take advantage of the clear transliteration. A list of stems related to the boundary, already defined in traditional grammar, lemma-identifier qara>-a_1 "to read". The 9 stems are between verbal inflection and verbal agglutination to related to the orthographic variants of the 3rd root describe these two levels independently. In order to consonant, here glottal stop (hamza), depending on the satisfy both computer scientists and Arabic linguists, we next inflectional suffix and the existence of an have created in Unitex an enhanced version of FSTs agglutinated pronoun. adapted to root-and-pattern representation. The Buckwalter representation for the Arabic lexicon is In Section 2, we outline the state-of-the-art approaches to not fitted for generation but only for text analysis. In Arabic morphological annotation. Section 3 describes the ElixirFM (http://elixir-fm.sourceforge.net/), Smrz (2007) methodology and particularly the inflectional verbal adapted the Buckwalter resources for generation and the taxonomy. Section 4 describes agglutination as morpheme project is implemented in Haskell, a functional combinatorics. Section 5 reports the construction of the programming language. In the ALMORGEANA project, lexicon. Section 6 reports the evaluation of the lexicon. A Habash (2004) proposed also a version of Buckwalter conclusion and perspectives are presented in Section 7. resources adapted to generation and analysis. Below an example lilkutubi ―books‖ : 2. State of the Art lilkutubi [kitAb-1 POS: N l+ Al+ +PL +GEN] Several morphological annotators of Arabic are available. li_l_kutub-i The Buckwalter Arabic Morphological Analyzer (BAMA) [lemma-ID NOUN PREP+DET+ (plustem) + Genitive] is one of the best Arabic morphological analysers and is available as open source. The BAMA uses a concatenative Although the lexicon is an open linguistic resource, the lexicon-driven approach where morphotactics and procedure for updating it is complex. For instance, adding orthographic adjustment rules are partially applied into the a new verb is an intricate operation. First, the A and C lexicon itself instead of being specified in terms of general lexica are composed of 561 and 989 entries. Although rules that interact to realize the output (Buckwalter, 2002). the two disjoint sets of inflectional and agglutination suffix morphemes are clearly defined in Arabic, the The BAMA