Bootstrapping a Wordnet for an Arabic Dialect from Other Wordnets and Dictionary Resources
Total Page:16
File Type:pdf, Size:1020Kb
Bootstrapping a WordNet for an Arabic Dialect from Other WordNets and Dictionary Resources Violetta Cavalli-Sforza, Hind Saddiki Karim Bouzoubaa, Lahsen Abouenour School of Science and Engineering Mohammadia School of Engineering Al Akhawayn University Mohammed V University-Agdal Ifrane, Morocco Rabat, Morocco [email protected], [email protected] [email protected], [email protected] Mohamed Maamouri, Emily Goshey Linguistic Data Consortium University of Pennsylvania Philadelphia, U.S.A. [email protected], [email protected] Abstract — We describe an experiment in developing a first While WNs are interesting language resources on their version of WordNet for Iraqi Arabic starting from Arabic own, allowing the user to explore the relationship of words to WordNet (for Modern Standard Arabic), Princeton WordNet each other, they are also useful in a number of language (for English) and a bidirectional English-Iraqi Arabic dictionary. processing tasks requiring an understanding of the meaning of The resulting initial version of the target WordNet so-constructed language. Such tasks include information retrieval [5], word was made available to human experts in Iraqi Arabic for sense disambiguation [6], automatic text classification [7], correction and evaluation. automatic text summarization [8], question answering [9] and machine translation [10], among others (e.g. [11], [12]). Keywords—language resources; WordNet; automatic creation; However, the construction of a WN requires significant human bootstrapping; dialect; Arabic resources even to obtain a relatively basic subset of the full I. INTRODUCTION AND MOTIVATION WN for a language. An alternative to building a WN from scratch is to leverage existing resources, including existing A WordNet (WN) is a lexical database of a given language WNs if they exist for similar languages, to provide a “first that focuses on open-class words: nouns, verbs, adjectives, draft” of a WN resource, which can then be modified and adverbs and adverbials (the latter being mostly nouns, augmented automatically and/or manually by linguists (native adjectives and participles used in an adverbial role, e.g. speakers of the language). ‘willingly’). Words belonging to these parts of speech are grouped into cognitive synonyms (synsets), each expressing a After the development of the English WordNet and its distinct concept. Synsets are interlinked by means of success when used in the context of many NLP tasks, many conceptual-semantic and lexical relations such as hyponymy languages, including Arabic, went on to build their own and antonymy. The first WordNet was built for the English WordNets. However, to our knowledge, no WordNet has been language (named Princeton WordNet) 1. The WordNet for developed as a dialect of the WordNet of its corresponding Modern Standard Arabic (MSA), followed many years later. language, and therefore no process and methodology has been The first Arabic WordNet was released in 2007 2 [1][2][3] and proposed and tested for the creation of dialect WordNets. followed the development process of English WordNet and The work described herein aimed at developing a general Euro WordNet 3 [4]. It utilized the Suggested Upper Merged process for creation of a basic WordNet for Arabic dialects by Ontology 4 as an interlingua to link Arabic WN to previously using Arabic WN (AWN), Princeton WN (PWN), and an developed WNs. The most recent version of AWN is AWN English-Arabic dictionary for a dialect of Arabic. The rationale 2.0.1 (released in March 2009). To our knowledge no other is as follows. Dialects of Arabic can be presumed to be similar open source WNs for the Arabic language or its dialects have enough in their conceptual organization to Modern Standard been developed to date. Arabic that AWN provides a useful starting point for building a WN for dialects. AWN is linked at the level of synsets to PWN. Using these links, it is possible to use English word This research was made possible through funding received by LDC, information to consult an English-Arabic dialect dictionary. University of Pennsylvania, USA, subcontracted to Al Akhawayn University, Our project specifically targeted the Iraqi dialect because of the Ifrane, Morocco. availability of an English-Iraqi bidirectional computational 1 http://wordnet.princeton.edu/ dictionary that was at a more advanced stage of completion 2 http://www.globalwordnet.org/AWN/ 3 http://www.illc.uva.nl/EuroWordNet/ than dictionaries for other dialects. 4 http://www.ontologyportal.org/ 10 th ACS/IEEE International Conference on Computer Systems and Applications (AICCSA 2013), May 27-30, 2013, Fes/Ifrane, Morocco The paper is organized as follows. We begin by presenting 161,705 links. Although AWN 2.0.1 is larger than 2.0, the resources we used and the pre-processing we needed to correcting some of the problems present in the older version as perform on them before they could be effectively used. We well as adding synsets and links that simplify some tasks, the then explain and justify the automatic process we followed in choice of the older version does not create serious issues for developing the initial version of the Iraqi Arabic WordNet this work. We return to this subject in the last section of the (IAWN) resource and the raw results obtained through this paper. AWN 2.0 is distributed with a browser application from automatic processing. Successively, we briefly describe the which it can be downloaded in either XML format and as a set user interfaces we built to browse the resources used, and in of CSV files. particular the tool for exploring and modifying the IAWN. Following that, we present the work performed by human AWN synsets belong to one of 5 parts of speech: noun annotators (linguists) and relate some of their reactions to the (6,438), verb (2,536), adjective (456), adjective satellite (158), IAWN resource and associated tools. We conclude with and adverb (110). The following are present in AWN 2.0: summary of the work performed and some recommendations • 9,698 equivalent links connect AWN synsets to their for future work. corresponding PWN synsets. There are approximately as many equivalent links as there are AWN synsets, II. RESOURCES USED AND PRE -PROCESSING though 10 AWN synsets (currency-related) are The construction of the initial IAWN employed AWN 2.0, equivalenced to 5 PWN synsets. PWN 2.0 (to which AWN 2.0 is linked), and a • computationally accessible Iraqi-English dictionary [13].5 94,841 hyponym (of ) links connect PWN hyponym synsets to their PWN hypernym(s) and concern verb A. Princeton WordNet (PWN) and noun synsets only. Many fewer are actually used in Much has been written about PWN [14] and detailed AWN because the AWN-PWN equivalence links only statistics about the contents of different versions are available involve less than 9% of the PWN links and therefore online.6 PWN 2.0, though smaller than 3.1, is still much larger many of these links are necessarily ignored or than AWN 2.0, containing 115,424 synsets of which most are collapsed. nouns (79,689), followed by adjectives (18,563), verbs • 22,196 similar links also connect PWN synsets to (13,508), and adverbs (3,664). Noun and verb synsets are PWN synsets, occurring strictly between adjective and organized in hierarchies connected by hyponymy/hypernym adjective satellite synsets; they are symmetric, so there relationships (a hyponym is more specific than its hypernym) are really 11,098 pairs of similar synsets. and are also related to other synsets in a variety of ways. Adjective synsets are related to other adjective synsets with • 1,067 has_instance links connect an AWN synset to similarity links: adjective synsets are similar to adjective another AWN synset representing a named entity. satellite synsets and vice versa. Adjective synsets are also • connected via specific words to antonym words and via a 7,993 antonym links and 7,920 pertainym links are pertainym link to related nouns. The term ‘adjective satellite’ links between PWN words and not between synsets. acquires its meaning from the following: “Two opposite 'head' For building IAWN, the most interesting links were the senses work as binary poles, while 'satellite' synonyms equivalent links and hyponym links, since they defined the 7 connect to each of the heads via synonymy relations”. hierarchical structure of AWN indirectly through the PWN Adverbs have an even simpler organization: adverb synsets hyponym links. Has_instance links also contribute to the are not connected to other synsets directly but via their hierarchy but, in AWN 2.0, almost ¾ of these links point to contained words to the adjective from which they derive. instances that do not actually exist as synsets; of the remaining The main information needed from PWN for the IAWN ones, the target instance synsets were also linked to other construction task was the synset ID, the part of speech, the synsets via hyponym links over ¾ of the time. Since there is no English gloss describing and giving examples of use of the mapping between AWN and PWN at the level of words, concept represented by the synset, the prototype word of the antonym and pertainym links could not really be used. Some synset (the primary synonym, most representative of the of these problems were rectified in AWN 2.0.1. We also did concept), and other synonym words when present. not use similar links (although we could have) because they related adjectives only. In AWN, adjective (and adjective B. Arabic WordNet (AWN) satellite) synsets exist, but are not connected to any hierarchy. AWN 2.0 was released in January of 2008; it contains C. Iraqi-English Dictionary (IED) 9,698 synsets, corresponding to 21,813 MSA words, and 6 different link types, totaling 143,715 links. A later version of We received the IED [13] as a set of distinct XML files for AWN, 2.0.1, is also available and contains 11,269 synsets, the English-to-Iraqi (E-to-I) directions, containing 10,741 corresponding to 23,841 words, and 22 link types, totaling English lexical entries, and Iraqi-to-English (I-to-E) direction, containing 17,457 Iraqi lexical entries .