Methodological Aspects of Developing and Managing an Etymological Lexical Resource: Introducing Etymdb 2.0 Clémentine Fourrier, Benoît Sagot
Total Page:16
File Type:pdf, Size:1020Kb
Methodological Aspects of Developing and Managing an Etymological Lexical Resource: Introducing EtymDB 2.0 Clémentine Fourrier, Benoît Sagot To cite this version: Clémentine Fourrier, Benoît Sagot. Methodological Aspects of Developing and Managing an Ety- mological Lexical Resource: Introducing EtymDB 2.0. LREC 2020 - 12th Language Resources and Evaluation Conference, May 2020, Marseille, France. hal-02678100 HAL Id: hal-02678100 https://hal.inria.fr/hal-02678100 Submitted on 31 May 2020 HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. Methodological Aspects of Developing and Managing an Etymological Lexical Resource: Introducing EtymDB 2.0 Clémentine Fourrier Benoît Sagot Inria {clementine.fourrier, benoit.sagot}@inria.fr Abstract Diachronic lexical information was mostly used in its natural field, historical linguistics, until recently, when promising but not yet conclusive applications to low resource languages machine translation started extending its usage to NLP. There is therefore a new need for fine-grained, large-coverage and accurate etymological lexical resources. In this paper, we propose a set of guidelines to generate such resources, for each step of the life-cycle of an etymological lexicon: creation, update, evaluation, dissemination, and exploitation. To illustrate the guidelines, we introduce EtymDB 2.0, an etymological database automatically generated from the Wiktionary, which contains 1.8 million lexemes, linked by more than 700,000 fine-grained etymological relations, across 2,536 living and dead languages. We also introduce use cases for which EtymDB 2.0 could represent a key resource, such as phylogenetic tree generation, low resource machine translation and medieval languages study. Keywords: Etymological lexicon, Lexical Resource Development, Language Resource Life-cycle, Methodology 1. Introduction guage studies, machine translation, and describe how we Most available electronic lexical resources are synchronic used EtymDB 2.0 to extract a global language phylogenetic 2 (formalising a language as it is or was at a specific, al- tree. though sometimes broad, period of time). Until recently, Our main contributions are therefore twofold: method- the few diachronic resources available were mostly used for ological proposals for the development of etymological computational historical linguistics tasks (List et al., 2018; lexical resources, and the new EtymDB 2.0 etymological 3 Carling et al., 2018). However, over the last few years, database. diachronic resource use has found its way to more gen- eral applications, notably for tasks involving low resource 2. State of the Art languages, from machine translation using diachronic lan- 2.1. Existing Etymological Databases guage relations (Nguyen and Chiang, 2017) or cognates Though a number of individual cognacy datasets can be and loan words sets (Grönroos et al., 2018) to bilingual found online, they often vary wildly in usability, reliabil- lexicons generation for low resource languages (Nasution ity and scope. The following etymological databases com- et al., 2017). bine large scope, usable data format, and generally reliable Yet only a small number of multilingual etymological sources. Interestingly, a majority of these large scale elec- lexicons are available (de Melo, 2014; Sagot, 2017; Panta- tronic etymological databases have been generated from leo et al., 2017; Batsuren et al., 2019), most extracted from a version of the Wiktionary, an online collaborative dic- the etymological information found in the Wiktionary.1 tionary, which contains large scale structured information, There is still room for improvement regarding the qual- most of the time sourced from already existing and pub- ity, richness, lexical or language coverage and etymolo- lished etymological works. gical granularity (differentiation between inheritance, bor- EtymWordNet is an older etymological database extracted rowing, cognacy) of such resources. from the 2013 version of the Wiktionary (de Melo, 2014). It The work described in this paper has two parallel objec- makes a difference between cognacy4 and generic “etymo- tives. We investigate the methodological challenges under- logical origin” relations, but goes no further. It also lying the development and use of an etymological database does not systematically differentiate between glosses.5 based on the Wiktionary. At the same time, we describe It contains 473,433 general etymological relations and how we addressed these challenges in the case of our own 538,588 cognacy relations, and was the reference point for etymological database, EtymDB. After a description of EtymDB 1.0 (which was considerably more granular). existing diachronic lexicons, we go through the life-cycle of an etymological resource in five steps: creation, up- 2 date, evaluation, dissemination and exploitation. We use A language phylogenetic tree is a speculative rooted acyclic as a case study our previous work on the development of graph displaying evolutionary relationships between languages. 3EtymDB 2.0 is available, as its previous version, under a CC- EtymDB’s initial version, EtymDB 1.0 (Sagot, 2017), as BY-SA free resource licence. well as the development of a new version, EtymDB 2.0, its 4Given two languages with a common ancestor, two words are evaluation, dissemination and exploitation. To illustrate the said to be cognates (in the strictest sense) if they are an evolution latter, we discuss possible applications in low resource lan- of the same word from said ancestor, called their proto-form. 5Here, the gloss of a word refers its meaning expressed as its 1http://en.wiktionary.org English translation CogNet is an automatically extracted cognate database mon (a word located, in time and space, in relation to other based on wordnets (Batsuren et al., 2019). It uses a loose words), and relational units are represented as etymological definition of cognacy, and therefore is actually a database links, between an etymological source and target, typed by containing both cognates and loanwords. It has the lowest etymological classes. Bowers and Romary (2017) extended granularity, but the most lexemes, with 3 million “cognate this model to standardise and integrate it in the Text Encod- pairs” across 338 languages. ing Initiative (TEI), most notably to include etymological EtymDB 1.0 is the previous version of our etymological as well as lexical creation processes as relation types: stan- database (Sagot, 2017). It makes a difference between in- dard inheritance, borrowings, metaphors, metonymy, com- heritance, borrowing, and cognacy, and contains 1 million position and grammaticalisation. distinct lexemes linked by half a million distinct relations. In 2019, the Lexical Markup Format (LMF), the official However, it has a few shortcomings, most notably in its ISO standard for NLP and digital lexicons (which provides management of duplicates. As mentioned in the introduc- guidelines to model and encode lexical information) has tion, both its update and the guidelines we followed for said been extended in (Romary et al., 2019) to introduce man- extension are the topic of the current paper. agement of etymological information; the format chosen EtyTree is a graphical etymological dictionary (Pantaleo et for its serialisation was the TEI. al., 2017). It provides information about direct inheritance relations, descendants and compounding, but does not pro- 3. Creating an Etymological Lexicon vide cognacy information. Its authors present it more as From Available Datasets being a tool to visualise the Wiktionary as it is than a new As this section describes, in part, EtymDB 1.0, which has database in itself, as little extra cleaning and filtering was already been introduced by Sagot (2017), only the relevant done on the extracted words. We used this etymological re- information for the current article will be summarised here. source as a reference to which we compared EtymDB 2.0, since it was the closest in terms of methods and data source. 3.1. Defining Goals For the sake of completeness, it should be noted that CoBL, To define goals for a new resource, it is important to know the Cognacy in Basic Lexicon database (Anderson et al., to both what is available and the limitations of existing works. be published) will probably be a good reference for future When EtymDB 1.0 was first created, the only existing re- etymology databases. It is a descendant and upgrade of source among those we described in Section 2.1. was Etym- Michael Dunn’s Indo-European Lexical Cognacy Database, WordNet (de Melo, 2014). As described earlier, it did not and though it only concerns itself with cognacy, its sources differentiate between glosses nor had a fine relation granu- have been handpicked. It is sadly not yet available to the larity. As such, the initial goal of EtymDB was to provide a public, and as such we were not able to use it as a reference large scale formalised etymological database at the lexeme point for this paper. level, which differentiated glosses, and contained a finer level of granularity