OdeNet: Compiling a German Wordnet from other Resources
Melanie Siegel Francis Bond Darmstadt University School of Humanities of Applied Sciences Nanyang Technological University [email protected] [email protected]
Abstract 1998)). The OMW data (Bond and Foster, 2013) was made by matching multiple linked wordnets The Princeton WordNet for the English to Wiktionary (Wikimedia, 2013) and the Uni- language has been used worldwide in NLP code Common Locale Data Repository (Unicode, projects for many years. With the OMW 2012). The OpenThesaurus is a large resource, initiative, wordnets for different languages generated and updated by the crowd. The PWN of the world are being linked via identi- resource is a well-developed resource for English fiers. The parallel development and link- concepts. It includes many relations between the ing allows new multilingual application concepts and is linked to resources for multi- perspectives. The development of a word- ple languages. The synsets from the OMW data net for the German language is also in have an estimated accuracy of 90%. We call our this context. To save development time, new resource “OdeNet”, from “Offenes deutsches existing resources were combined and re- Wordnet - open German wordnet”. The first ver- compiled. The result was then evaluated sion of OdeNet was automatically compiled. We and improved. In a relatively short time also describe the efforts to extend and correct the a resource was created that can be used entries. in projects and continuously improved and extended. 2 Related Work 1 Introduction In the Open Multilingual Wordnet initiative (Bond and Paik, 2012; Bond et al., 2015), wordnets for The goal of this initiative is to have a German re- several languages were developed and linked. source in a multilingual wordnet initiative, where A manually well-designed wordnet resource for the concepts ( synsets) of the languages are linked, German is GermaNet (Hamp and Feldweg, 1997). and where the resources are under an open-source GermaNet was developed over 20 years now and license, being included in the NLTK language pro- is very stable and precise. The problem is that it is cessing package ((Bird et al., 2009)) via the Wn not under an open-source license and is therefore package ((Goodman and Bond, 2021)). not broadly used in language technology applica- Wordnet resources are largely used in NLP tions. Further, the restricted license makes it im- projects all over the world. Our idea is to cre- possible to include GermaNet in the Open Multi- ate a German resource that starts from a crowd- lingual Wordnet initiative. This is the reason why developed thesaurus; is open; and included in the we decided to build up a new resource. In order NLTK package. Then it can be further developed not to violate the license terms, we do not use any- by researchers while using the resource for their thing from GermaNet in OdeNet. NLP projects. Vossen(1998, p11) describes two basic ap- For the first version, we combined existing re- proaches to develop new wordnet resources: In sources: The OpenThesaurus German synonym the first case (expand), existing PWN synsets are 1 2 lexicon, the Open Multilingual Wordnet (OMW: taken and lexical entries added for the specific Bond and Foster, 2013) the English resource, the language. In the second case (merge), language- Princeton WordNet of English (PWN: Fellbaum, specific resources are built and then linked to the 1https://www.openthesaurus.de/ PWN. 2http://compling.hss.ntu.edu.sg/omw/ An example of expand is the Japanese wordnet (Isahara et al., 2008). It is based on translations “to place, to relocate, to publish - embarrassed”. of PWN to Japanese. The Japanese wordnet is Other POS ambiguities are not relevant for this not fully automatically built: most translations are work because they refer to finer POS distributions manually checked. The authors found that there than we can provide at the moment (particles - are differences between concept structures in En- prepositions, demonstrative pronouns - articles). glish and Japanese, such that several synsets could In the area of multi-word lexemes we are con- not be translated. cerned with support verb constructions, such as The Russian wordnet (Alexeyevsky and Tem- Abschied nehmen “to say goodbye” or in Rech- chenko, 2016) is an example of the merge ap- nung stellen “to invoice”. In addition, there are proach. It is based on a monolingual dictionary idioms such as das geht auf keine Kuhhaut “it beg- and the word definitions in these. The idea is gars description”. Especially for idioms it is diffi- that definitions contain hypernyms of the defined cult to automatically determine the syntactic cate- words, often in the form of WORD:HYPERNYM gory. . . . , and that this information can be used to set up However, complex nouns are not realized - as in hierarchical structures in the wordnet. English - by means of multi-word expressions, but The approach of the OdeNet initiative is merge. with compounds. Nominal compounds are very We use an existing synonym dictionary and try to productive in German. They can be very long, link the synsets to PWN. like the well-known example Donaudampfschiff- Braslavski et al.(2016) describe the creation of fahrtskapitansm¨ utze¨ “Danube steamship captain’s a large thesaurus for Russian by means of crowd cap”. They can constantly be newly created. Au- sourcing. The data is directly collected in a word- tomatic extraction and analysis from text data is net style, but synsets are not linked to the OMW. complex because there are ambiguities here too. The basic data for OdeNet is also generated in In the case of regular German compounds, there a crowd sourcing style, in the OpenThesaurus is a hyponymy relationship between the head and project. The OpenThesaurus project (Naber, the compound. For example, Wassereis is an ice 2004) is a crowd initiative to set up a German that consists of water, while Eiswasser is water synonym lexicon. The version we downloaded that is ice-cold. Different relations can exist to the in April 2017 has about 120,000 lexical entries in modifier. The regularity of the hyponymy relation- about 36,000 synsets. ship to the head of German compounds is used to add relations to OdeNet.
2.1 German 3 Process of Creating OdeNet The establishment of an ontology for the lexical The first version of OdeNet was completely auto- information of a language requires an in-depth matically created by compilation from OpenThe- study of ambiguities and multi-word lexemes. In saurus. In the following, manual corrections were German, compounds are also an issue. There made in the domains of project management and are many examples of lexical ambiguities in Ger- business reports. German definitions were intro- man, such as Mutter “mother, nut” or umfahren duced, relations were corrected and supplemented “bypass, to knock over”. These are in many and CILI links (links to the multilingual concepts cases (especially in the case of homonyms) not in OMW) were added. Then we worked on the parallel to English ambiguities, which makes the syntactic categories. The main focus was on cor- translation more difficult (for the purpose of link- recting the POS tags of multi-word lexemes.The ing in OMW). In most cases, ambiguities remain next step was the annotation of basic German within a syntactic category (POS). The capitaliza- words3. We annotated all lexical entries (except tion of German nouns prevents ambiguities be- for function words) of this list with tween nouns and other syntactic categories, as is often the case in English (e.g. change “money” dc:type="basic_German" or “transform”). Morpho-syntactic ambiguities, We then added missing entries and corrected which occur frequently in German, are not rel- synsets manually. Then, we implemented an anal- evant for OdeNet because only lemmata are in- 3as listed in http://pcai056.informatik. cluded. There are some words that can be used uni-leipzig.de/downloads/etc/legacy/ both as verbs and adjectives, such as verlegen Papers/top1000de.txt ysis of German nominal compounds and used this (id="de-39-n",pwn="in-05890249-n"), information for the addition of hypernym rela- tions. We could link 19,845 German synsets to synsets in the PWN, about 55 % of the German synsets. 3.1 Linking OpenThesaurus Synsets with the Synsets that could not be linked were often Multilingual Wordnet multi-word expressions and metaphorical, such The OpenThesaurus data can be downloaded as as: es kann Gott weiß was passieren; fur¨ nichts txt. The text file contains one synset per line, such garantieren konnen;¨ mit allem rechnen mussen¨ that the lexical items in each synset are divided by “God knows what can happen; can’t guarantee semicolons, e.g.: anything; have to count on everything”. The link gives direct access to the definition in PWN, Mobilitat;Unabh¨ angigkeit;Beweglichkeit¨ such that we could copy these into OdeNet in dc:description. Thus, we have an English The target of the transfer process of this synset definition as long as German definitions are still is to have three lexical entries and a synset entry. missing. The synset relations in PWN link to En- The format is described in Bond et al.(2016). We glish synsets. We searched for German synsets start with the synset: with the ili that links to the target of a relation
Table 2: Precision of 90 randomly chosen lexical entries ing English synsets. We checked if the definitions between the German synsets - parallel to the rela- are correct (and therefore the synsets are correctly tions in the other wordnets. linked). 55 of the 90 cases had a link to an English The Open Multilingual Wordnet initiative is a synset, and therefore a definition. In 45 of the 55 great chance to get highly linked and standard- cases (82%), these definitions were correct. ized language resources for multiple languages. There were 41 cases, where relations on the The standardization makes it possible to include lexical or the synset level were assigned (34%). these resources in NLP packages, such as NLTK 12 of these cases had wrongly assigned relations or spaCy. (39%). In 5 of these cases, the link to PWN was We have shown that it is possible using NLP also wrong, and the relation was taken over from techniques to combine language resources such as the English synset. In one case, the relation from the OpenThesaurus and the English PWN to gain the English synset was wrong, while the relation a new resource in this standardized multilingual that was automatically added by the compound context, with a reasonable precision. analysis was correct. The next correction step will The next step will be to further work on the have to address the linking. quality of OdeNet. We have already started to im- We have annotated the entries with a default plement methods that allow the semi-manual cor- confidence of 0.6, with entries that have been man- rection and extension: ually validated given a confidence of 1.0 and those from the extended OMW a confidence of 0.85. • A tool for adding more hyponym relations in case of compounds that shows the user dif- Release ferent synsets for a compound head and asks The wordnet is released through GitHub, as a com- which one to set the relation to. It then adds pressed tar file containing the wordnet itself, its the relation to OdeNet automatically. license (CC-BY-SA 4.0)7 and canonical citation.8 • A tool that shows the user all information for This can be loaded directly into the Wn Python a word and gives her multiple possibilities to library (Goodman and Bond, 2021), which allows correct and extend it. easy use: either on its own or linked to other word- nets through CILI. • A tool that allows to search for a word in PWN and give the corresponding CILI(s) and 6 Discussion and Future Plans allows the user to add the CILI to OdeNet. It has been possible to set up a wordnet for the Ger- Further, it will again be compared to the En- man language in a couple of years. We have bene- glish PWN, such that cases where linked synsets fited both from OpenThesaurus and the knowledge differ in their POS assignment will be further in- in the OMW. In this way, we were able to build vestigated. Another source of problems is multi- a very large resource, with the synsets being cre- word lexemes, where we will have to search for ated manually in the OpenThesaurus project, and better POS tagging methods. therefore very precise. We have used NLP tech- A problem is that the automatic translation niques to add more information, namely POS and method added links of CILIs from multiple Ger- the relations to OMW and CILI. We have used the man synsets, which is undesirable. We need to fo- knowledge in the OMW to supplement relations cus on better linking quality. One approach could 7https://creativecommons.org/licenses/ be to look at the cases where the POS of German by-sa/4.0/ and English differs. 8https://github.com/ hdaSprachtechnologie/odenet/releases/ We started to work on the basic German words, tag/v1.3 adding and correcting information. This will be a valuable information source for simplified lan- rative Interlingual Index. In Proceedings of the guage projects. Global WordNet Conference, volume 2016. Through the Wn library (Goodman and Bond, S. Brants, S. Dipper, P. Eisenberg, S. Hansen- 2021) the resource will be available to NLTK, such Schirra, E. Konig,¨ W. Lezius, C. Rohrer, that it can be used in NLP projects. The open- G. Smith, and H. Uszkoreit. 2004. TIGER: Lin- source idea will help to let researchers working guistic interpretation of a German corpus. Re- on German language further improve and expand search on language and computation, 2(4):597– OdeNet. We ourselves plan to use it in informa- 620. tion extraction in the business domain and senti- P. Braslavski, D. Ustalov, M.l Mukhin, and ment analysis projects. By this approach, we will Y. Kiselev. 2016. Yarn: Spinning-in-progress. add synsets from the business domain and senti- In Proceedings of the 8th Global WordNet Con- ment polarity for many words. ference, GWC 2016, pages 58–65. We will add a user interface to make crowd de- velopment possible, in order to extend and correct Christiane Fellbaum, editor. 1998. WordNet: An OdeNet. Electronic Lexical Database. MIT Press, Cam- We would also like to tag some German texts. bridge, MA. The resource is available on GitHub under an Michael Wayne Goodman and Francis Bond. open-source license: https://github.com/ 2021. Intrinsically interlingual: The Wn Python hdaSprachtechnologie/odenet. library for wordnets. In 11th International Global Wordnet Conference (GWC2021). (to Acknowledgments appear). We would like to thank Michael Wayne Goodman B. Hamp and H. Feldweg. 1997. GermaNet - a for his help in preparing the GitHub release. lexical-semantic net for German. In Proceed- ings of ACL workshop Automatic Information References Extraction and Building of Lexical Semantic Resources for NLP Applications. Madrid. Daniil Alexeyevsky and Anastasiya V. Tem- chenko. 2016. WSD in monolingual dictionar- Hitoshi Isahara, Francis Bond, Kiyotaka Uchi- ies for Russian WordNet. In Proceedings of the moto, Masao Utiyama, and Kyoko Kanzaki. Eighth Global WordNet Conference. Bucharest, 2008. Development of the Japanese wordnet. Romania. In Sixth International conference on Language Resources and Evaluation (LREC 2008). Mar- Steven Bird, Ewan Klein, and Edward Loper. rakech. 2009. Natural language processing with Python: Analyzing text with the natural lan- Daniel Naber. 2004. Openthesaurus: Build- guage toolkit. O’Reilly Media, Inc., Beijing. ing a thesaurus with a web commu- nity. Retrieved January, 3:2005. URL Francis Bond, Luis Morgado Da Costa, and https://www.openthesaurus.de/ Tuan Anh Le. 2015. Imi—a multilingual se- download/openthesaurus.pdf. mantic annotation environment. Proceedings of ACL-IJCNLP 2015 System Demonstrations, Inc. Unicode. 2012. Unicode, inc. license agree- pages 7–12. ment - data files and software. URL http:// www.unicode.org/copyright.html. Francis Bond and Ryan Foster. 2013. Link- Piek Vossen, editor. 1998. Euro WordNet. Kluwer. ing and extending an open multilingual word- net. In Proceedings of the 51st Annual Meet- Wikimedia. 2013. List of wiktionaries. (accessed ing of the Association for Computational Lin- on 2013-09-12). guistics, pages 1352–1362. Sofia. URL http: //aclweb.org/anthology/P13-1133. Francis Bond and Kyonghee Paik. 2012. A survey of wordnets and their licenses. Small, 8(4):5. Francis Bond, Piek Vossen, John P McCrae, and Christiane Fellbaum. 2016. CILI: The Collabo-