Extending Wordnet.PT to Portuguese Varieties
Total Page:16
File Type:pdf, Size:1020Kb
WordNet.PT global – Extending WordNet.PT to Portuguese varieties Palmira Marrafa 1, Raquel Amaro 2 and Sara Mendes 2 Group for the Computation of Lexical and Grammatical Knowledge, Center of Linguistics of the University of Lisbon Avenida Professor Gama Pinto, 2 1649-003 Lisboa, Portugal [email protected] 2{ramaro,sara.mendes}@clul.ul.pt starting point for the specification of a fragment of Abstract the Portuguese lexicon, in the first phase of the project (1999-2003), consisted in the selection of a This paper reports the results of the set of semantic domains covering concepts with WordNet.PT project, an extension of global high productivity in daily life communication. The WordNet.PT to all Portuguese varieties. encoding of language-internal relations followed a Profiting from a theoretical model of high mixed top-down/bottom-up strategy for the level explanatory adequacy and from a extension of small local nets (Marrafa 2002). Such convenient and flexible development tool, work firstly focused on nouns, but has since then WordNet.PT achieves a rich and multi- global been extended to all the main POS, a work which purpose lexical resource, suitable for has resulted both in refining information contrastive studies and for a vast range of specifications and increasing WordNet.PT language-based applications covering all coverage (Amaro et al. 2006; Marrafa et al. 2006; Portuguese varieties. Amaro 2009; Mendes 2009). Relational lexica, and wordnets in particular, 1 Introduction play a leading role in machine lexical knowledge representation. Hence, providing Portuguese with WordNet.PT is being built since July 1999, at the such a rich linguistic resource, and particularly Center of Linguistics of the University of Lisbon Portuguese varieties not often considered in lexical as a project developed by the Group for the resources, is crucial, not only to researchers Computation of Lexical and Grammatical working in contrastive studies or with the so-called Knowledge (CLG). non-standard varieties, but also to the general WordNet.PT is being developed within the public, as the database is made available for general approach of EuroWordNet (Vossen 1998, consultation in the WWW through an intuitive and 1999). Therefore, like each wordnet in EWN, perspicuous web interface. Such work is also WordNet.PT has a general conceptual architecture particularly relevant as the resulting database can structured along the lines of the Princeton be extensively used in a vast range of language- WordNet (Miller et al. 1990; Fellbaum 1998). based applications, able to cover, this way, all For early strategic reasons concerning Portuguese varieties. applications, this project is being carried out on the This paper depicts the work developed and the basis of manual work, assuring the accuracy and results achieved under the scope of the reliability of its results. WordNet.PT global project, funded by Instituto Aiming at using the Portuguese WordNet in Camões, which, as mentioned above, aims at language learning applications, among others, the extending WordNet.PT to Portuguese varieties. 70 Proceedings of EMNLP 2011, Conference on Empirical Methods in Natural Language Processing, pages 70–74, Edinburgh, Scotland, UK, July 27–31, 2011. c 2011 Association for Computational Linguistics 2 The Data pinpoint the expressions used for denoting the aforementioned 10 000 concepts. Informants were Portuguese is spoken in all five continents by over selected by Instituto Camões among undergrad 250 million speakers, according to recent studies, students in Portuguese studies and supervised by and is the official language of 8 countries: Angola, Portuguese lectors in each local university. Besides Brazil, Cape Verde, East Timor, Guinea Bissau, the European Portuguese variety, which is already Mozambique, Portugal, and Sao Tome e Principe. encoded in WordNet.PT, specifications for six Being spoken in geographically distant regions and other Portuguese varieties were integrated in the by very different communities, both in terms of database 2 : Angolan Portuguese, Brazilian size and culture, Portuguese is naturally expected Portuguese, Cape Verdean Portuguese, East to show variation. Despite this, regional varieties Timorese Portuguese, Mozambican Portuguese and are far from being equally provided with linguistic Sao Tome e Principe Portuguese. For each resources representing their specificities, as most concept, several lexicalizations were identified and research work is focused either on the Brazilian or 1 both the marked and unmarked expressions the European varieties . regarding usage information 3 were considered and In the work depicted here we aim at contributing identified. to reverse this situation, considering that this kind of resource is particularly adequate to achieve this 2.1 Data selection goal, since it allows for representing lexical As mentioned above, our approach for enriching variation in a very straightforward way: concepts WordNet.PT with lexicalizations from all are the basic unit in wordnets, defined by a set of Portuguese varieties consisted in extracting 10 000 lexical conceptual relations with other concepts, concepts from WordNet.PT and associating them and represented by the set of lexical expressions to the lexical expressions which denote them in (tendentially all) that denote them. We have to each variety. antecipate the possibility of different varieties showing distinct lexicalization patterns, proper domain nouns verbs adjectives total particularly some lexical gaps or lexicalizations of nouns more specific concepts. This is straightforwardly art 422 14 83 0 519 dealt with in the WordNet model: once a system of clothes 467 62 74 0 603 relevant tags has been implemented in the database communication 314 151 106 82 653 in order to identify lexical expressions with regard education 536 37 30 82 685 food 1131 130 115 0 1376 to their corresponding varieties, lexical gaps are geography 281 0 166 200 647 simply encoded by not associating the tag of the health 1159 92 175 0 1426 variety at stake to any of the variants in the synset; housing 595 28 46 0 669 specific lexicalizations, on the other hand, are human activities 641 0 0 0 641 added to the network as a new node and associated human relations 620 189 100 0 909 to the variety tag at stake. living things 1597 113 119 1 1830 sports 480 34 23 2 539 Our approach consisted in extracting 10 000 transportation 659 562 67 30 659 concepts from WordNet.PT, and associating them all domains 7893 802 1022 284 10001 with the lexical expressions that denote them in domain overlap 10,36% 12,54% 4,22% 22,62% 10,35% each Portuguese variety considered in this project. Table 1: concepts extended to Portuguese varieties In order to accomplish this, we consulted native speakers from each of these varieties, resident in their original communities, and asked them to 2 All Portuguese varieties spoken in countries where Portuguese is the official language were considered. However, 1 Official standardized versions of East Timorese and African for the time being, data from Guinean Portuguese are not yet varieties of Portuguese essentially correspond to that of encoded in the WordNet.PT global database due to difficulties in European Portuguese. Moreover, these varieties are not maintaining a regular contact with the native speakers provided with dedicated lexical resources, such as dictionaries consulted. Despite this, we still hope to be able to include this or large-scale corpora. Being so, speakers in these regions variety in the database at some point in the future. generally use European Portuguese lexical resources, which 3 Informants were provided with a limited inventory of usage only exceptionally cover lexical variants specific to these markers: slang; vulgar; informal; humerous; popular; unusual; varieties. regional; technical; old-fashioned. 71 The semantic domain approach initially used in innovative research results in WordNet.PT. In developing WordNet.PT, provided us with a order to do so, this computational tool has been natural starting point for the selection of data to be developed to straightforwardly allow for updates considered in this project. The table above presents and improvements. In the specific case of the task the distribution, per POS and semantic domain 4, of addressed in this paper, extending the coverage of the WordNet.PT concepts extended to non- the WordNet.PT database to lexicalizations of European Portuguese varieties. different Portuguese varieties involved the design of additional features regarding the identification 2.2 Data implementation of Portuguese varieties and variety-dependent Once the data described above were presented to usage label encoding. the native speakers consulted and their input 2.3 The results organized, all the information obtained was incorporated in the database. Encoding the data obtained in WordNet.PT global This way, for a concept like bus (public extends a relevant fragment of WordNet.PT to transportation which has regular pre-established Portuguese varieties other than European stops at short intervals, typically operating within Portuguese. This way, researchers are provided cities), for instance, the following lexicalizations with a crucial database for developing contrastive were obtained: autocarro , machibombo , studies on the lexicon of different Portuguese machimbombo , ônibus , and microônibus . varieties or research on a specific Portuguese Autocarro was found to be the more common variety, just to mention a possible