English Wordnet 2019 – an Open-Source Wordnet for English John P
Total Page:16
File Type:pdf, Size:1020Kb
English WordNet 2019 – An Open-Source WordNet for English John P. McCrae Alexandre Rademaker Data Science Institute IBM Research and FGV/EMAp Insight Centre for Data Analytics Brazil National University of Ireland Galway [email protected] [email protected] Francis Bond Ewa Rudnicka Christiane Fellbaum Nanyang Technological University Wroclaw University of Princeton University [email protected] Technology [email protected] [email protected] Abstract followed a model that requires an expert lexicogra- We describe the release of a new wordnet for pher to review and implement all changes. In this English based on the Princeton WordNet, but paper, we discuss the development of Open En- now developed under an open-source model. glish WordNet, which instead follows a method- In particular, this version of WordNet, which ology of quality assurance that is based on those we call English WordNet 2019, which has typically used for open-source projects, especially been developed by multiple people around the those connected to the Linux operating system. world through GitHub, fixes many errors in In particular, we can consider this to be an ap- previous wordnets for English. We give some plication of Linus’s Law (“given enough eye- details of the changes that have been made in this version and give some perspectives about balls, all bugs are shallow”) to the development likely future changes that will be made as this of WordNet, similar to other open source ori- project continues to evolve. entated projects such as OpenWordNetPT (Paiva et al., 2012) and the recently announced Global 1 Introduction FrameNet project1. Still, we will do our best to WordNet (Miller, 1995; Fellbaum, 1998) is one of make new data or proposed changes verified by the most widely-used language resources in natu- expert lexicographers or developers whenever pos- ral language processing and continues to find us- sible. age in a wide variety of applications including sen- We have implemented this in terms of a new timent analysis (Wang et al., 2018), natural lan- ‘fork’ of Princeton WordNet, and have released guage generation (Juraska et al., 2018) and textual a new version of WordNet that fixes many of entailment (Silva et al., 2018). However, in the (mostly trivial) errors, such as spelling mistakes, recent few years there has been only one update and thus improves the quality of the resource. We since version 3.0 was released in 2006, in spite of take inspiration from other forks such as the Mari- its wide use and the interest in the data. In the aDB fork of MySQL and aim to make this a ‘drop- meantime, a number of other wordnet teams work- in’ replacement for Princeton WordNet. This is ing with the WordNet data have proposed modifi- achieved by ensuring that that data is available cations or extensions to its latest release. These in a wide range of formats, including those used two facts have provided the chief motivation for by Princeton to publish the resource and stan- our present initiative, namely developing an open- dards promoted by the Global WordNet Associa- source WordNet for English on the basis of Prince- tion so that existing projects can use these changes ton WordNet (to be released under the name En- without updates to their workflows. In particu- glish WordNet 2019). lar, we continue to follow the basic conception of In order to allow for meaningful comparisons of Princeton WordNet and do not introduce changes performance on tasks using WordNet as a compo- that would fundamentally affect the nature of the nent, it is important to maintain a single (or very wordnet. Instead, our focus for this release is on few) wordnets as a standard and reference. fixing more minor errors and for future releases One of the core issues preventing further devel- we plan to extend this to principally adding new opment of the original WordNet model has been synsets and relations, using the existing structure the question of how to ensure the resource main- tains its quality. The Princeton WordNet team has 1https://www.globalframenet.org/ as a guide. As an open-source project we expect ton WordNet with new terminology in other di- that the community will create synsets that reflect rections, for example the Colloquial WordNet their views, and that this may in the long run lead project (McCrae et al., 2017), has been working to more significant divergences from the Princeton on adding new terms that are used in social me- WordNet model, dia, and this is available using the same GWC for- Moreover, we also present a new website and mats (McCrae et al., 2019) as this work; a similar project that allows the resources to be queried project called SlangNet (Dhuliawala et al., 2016) at http://en-word.net, which presents the seems to be unavailable now. There have also been most recent changes in a dynamic manner as they a number of attempts to extend WordNet in terms are updated on the GitHub website. To indicate of the kinds of annotation that it contains, such that this is a clearly new version of WordNet we as the addition of sentiment and emotion informa- have termed this version the 2019 edition of En- tion (Strapparava et al., 2004) or combining it with glish WordNet and provide a clear and auditable a upper-level ontology (Niles and Pease, 2003). list of changes that have been made such that Another significant direction has been the auto- it would be possible for the Princeton WordNet matic extension of WordNet and several projects to use these changes in any future versions they have been published based on extending Word- make. Net with information from other resources, espe- This paper is structured as follows: first, we will cially Wiktionary and Wikipedia. One of the most present some other efforts to extend the Princeton prominent of such resources is BabelNet (Nav- WordNet for English and then we will describe the igli and Ponzetto, 2012), which combines multi- kinds of changes that we have made for this re- ple methods using machine learning based meth- lease. We will then provide a brief discussion of ods, which have been shown to have a precision the open issues that will be handled in the next ver- of up to 89.7%. A similar effort was carried sion and how they may be handled. We will then out by the UKP group and led to the Uby re- briefly describe the release and the implementa- source (Gurevych et al., 2012), who report similar tion of the user interface. levels of accuracy in the mapping. While such au- tomatically constructed resources may be valuable 2 Background for a large number of applications, they cannot re- place WordNet for applications that require a gold Princeton WordNet (Miller, 1995; Fellbaum, standard lexicon or very high precision. Further, 2010) is the first wordnet for English, however it many of these resources have taken WordNet as is, is not the only one that has been developed for and have often repeated the same design and fre- this language. Moreover, it has been the case quently copied many of the minor errors into their that during the development of several wordnets own resources. for other languages signficant changes and/or ad- ditions were made to the underlying structure and 3 The Open English WordNet Project content of the English section of the wordnet. In at The Open English WordNet Project2 takes the least one case, namely the development of the Pol- form of a single Git repository, published on ish wordnet, plWordNet, the additions to the un- GitHub, and consisting for the most part of a col- derlying English wordnet have been so numerous lection of XML files describing the synsets and that they were released as a new wordnet, enWord- lexical entries in the resource. These XML files net (Rudnicka et al., 2015; Maziarz et al., 2016). represent each of the lexicographer file sections of These involved the addition of new lemmas (over the original resource and a simple script is pro- 11k), lexical units (over 11k) and synsets (7.5k). vided to stitch them together into a single XML The latter were linked to WordNet 3.1 synsets via file. The XML files are compliant with the GWC hyponymy relation. Still, no alterations to the LMF model (McCrae et al., 2019)3, which is itself original WordNet synsets or relations were made based partially on the LMF model (Francopoulo within this project. Currently, enWordnet is only et al., 2006) and in particular the WordNet (a.k.a available as part of the plWordNet project and does not constitute a ‘drop-in’ replacement for Prince- 2https://github.com/globalwordnet/ english-wordnet ton WordNet. 3https://globalwordnet.github.io/ Some projects have attempted to expand Prince- schemas/ Kyoto) LMF variant (Soria et al., 2009). Due to its et al., 2014) as well as in the WNDB formats used basis on LMF, a particular challenge was that the for previous versions of WordNet, allowing En- entire wordnet should be represented as a single glish WordNet to be a ‘drop-in’ replacement for XML document. However, due to the relative ver- Princeton WordNet. Furthermore, this update is bosity of the LMF format, the final data ended up used to populate the searchable frontend, which is as 97 MB, exceeding the upload limits of GitHub, available at http://en-word.net/. so instead the single XML file was divided by lex- icographer sections. Even still, this creates sev- 4 Scope of Changes eral very large files (over 10 MB) and this has re- One of the first major class of errors that we at- sulted in some challenges for those working on the tempted to fix were simple spelling errors that oc- 4 project , which may be solved by the adoption of cur particularly in the definitions and the examples a less verbose format.