Odenet: Compiling a German Wordnet from Other Resources

OdeNet: Compiling a German Wordnet from other Resources

Melanie Siegel Francis Bond Darmstadt University School of Humanities of Applied Sciences Nanyang Technological University [email protected] [email protected]

Abstract 1998)). The OMW data (Bond and Foster, 2013) was made by matching multiple linked wordnets The Princeton WordNet for the English to Wiktionary (Wikimedia, 2013) and the Uni- language has been used worldwide in NLP code Common Locale Data Repository (Unicode, projects for many years. With the OMW 2012). The OpenThesaurus is a large resource, initiative, wordnets for different languages generated and updated by the crowd. The PWN of the world are being linked via identi- resource is a well-developed resource for English fiers. The parallel development and link- concepts. It includes many relations between the ing allows new multilingual application concepts and is linked to resources for multi- perspectives. The development of a word- ple languages. The synsets from the OMW data net for the German language is also in have an estimated accuracy of 90%. We call our this context. To save development time, new resource “OdeNet”, from “Offenes deutsches existing resources were combined and re- Wordnet - open German wordnet”. The first ver- compiled. The result was then evaluated sion of OdeNet was automatically compiled. We and improved. In a relatively short time also describe the efforts to extend and correct the a resource was created that can be used entries. in projects and continuously improved and extended. 2 Related Work 1 Introduction In the Open Multilingual Wordnet initiative (Bond and Paik, 2012; Bond et al., 2015), wordnets for The goal of this initiative is to have a German re- several languages were developed and linked. source in a multilingual wordnet initiative, where A manually well-designed wordnet resource for the concepts ( synsets) of the languages are linked, German is GermaNet (Hamp and Feldweg, 1997). and where the resources are under an open-source GermaNet was developed over 20 years now and license, being included in the NLTK language pro- is very stable and precise. The problem is that it is cessing package ((Bird et al., 2009)) via the Wn not under an open-source license and is therefore package ((Goodman and Bond, 2021)). not broadly used in language technology applica- Wordnet resources are largely used in NLP tions. Further, the restricted license makes it im- projects all over the world. Our idea is to cre- possible to include GermaNet in the Open Multi- ate a German resource that starts from a crowd- lingual Wordnet initiative. This is the reason why developed thesaurus; is open; and included in the we decided to build up a new resource. In order NLTK package. Then it can be further developed not to violate the license terms, we do not use any- by researchers while using the resource for their thing from GermaNet in OdeNet. NLP projects. Vossen(1998, p11) describes two basic ap- For the first version, we combined existing re- proaches to develop new wordnet resources: In sources: The OpenThesaurus German synonym the first case (expand), existing PWN synsets are 1 2 lexicon, the Open Multilingual Wordnet (OMW: taken and lexical entries added for the specific Bond and Foster, 2013) the English resource, the language. In the second case (merge), language- Princeton WordNet of English (PWN: Fellbaum, specific resources are built and then linked to the 1https://www.openthesaurus.de/ PWN. 2http://compling.hss.ntu.edu.sg/omw/ An example of expand is the Japanese wordnet (Isahara et al., 2008). It is based on translations “to place, to relocate, to publish - embarrassed”. of PWN to Japanese. The Japanese wordnet is Other POS ambiguities are not relevant for this not fully automatically built: most translations are work because they refer to finer POS distributions manually checked. The authors found that there than we can provide at the moment (particles - are differences between concept structures in En- prepositions, demonstrative pronouns - articles). glish and Japanese, such that several synsets could In the area of multi-word lexemes we are con- not be translated. cerned with support verb constructions, such as The Russian wordnet (Alexeyevsky and Tem- Abschied nehmen “to say goodbye” or in Rech- chenko, 2016) is an example of the merge ap- nung stellen “to invoice”. In addition, there are proach. It is based on a monolingual dictionary idioms such as das geht auf keine Kuhhaut “it beg- and the word definitions in these. The idea is gars description”. Especially for idioms it is diffi- that definitions contain hypernyms of the defined cult to automatically determine the syntactic cate- words, often in the form of WORD:HYPERNYM gory. . . . , and that this information can be used to set up However, complex nouns are not realized - as in hierarchical structures in the wordnet. English - by means of multi-word expressions, but The approach of the OdeNet initiative is merge. with compounds. Nominal compounds are very We use an existing synonym dictionary and try to productive in German. They can be very long, link the synsets to PWN. like the well-known example Donaudampfschiff- Braslavski et al.(2016) describe the creation of fahrtskapitansm¨ utze¨ “Danube steamship captain’s a large thesaurus for Russian by means of crowd cap”. They can constantly be newly created. Au- sourcing. The data is directly collected in a word- tomatic extraction and analysis from text data is net style, but synsets are not linked to the OMW. complex because there are ambiguities here too. The basic data for OdeNet is also generated in In the case of regular German compounds, there a crowd sourcing style, in the OpenThesaurus is a hyponymy relationship between the head and project. The OpenThesaurus project (Naber, the compound. For example, Wassereis is an ice 2004) is a crowd initiative to set up a German that consists of water, while Eiswasser is water synonym lexicon. The version we downloaded that is ice-cold. Different relations can exist to the in April 2017 has about 120,000 lexical entries in modifier. The regularity of the hyponymy relation- about 36,000 synsets. ship to the head of German compounds is used to add relations to OdeNet.

2.1 German 3 Process of Creating OdeNet The establishment of an ontology for the lexical The first version of OdeNet was completely auto- information of a language requires an in-depth matically created by compilation from OpenThe- study of ambiguities and multi-word lexemes. In saurus. In the following, manual corrections were German, compounds are also an issue. There made in the domains of project management and are many examples of lexical ambiguities in Ger- business reports. German definitions were intro- man, such as Mutter “mother, nut” or umfahren duced, relations were corrected and supplemented “bypass, to knock over”. These are in many and CILI links (links to the multilingual concepts cases (especially in the case of homonyms) not in OMW) were added. Then we worked on the parallel to English ambiguities, which makes the syntactic categories. The main focus was on cor- translation more difficult (for the purpose of link- recting the POS tags of multi-word lexemes.The ing in OMW). In most cases, ambiguities remain next step was the annotation of basic German within a syntactic category (POS). The capitaliza- words3. We annotated all lexical entries (except tion of German nouns prevents ambiguities be- for function words) of this list with tween nouns and other syntactic categories, as is often the case in English (e.g. change “money” dc:type="basic_German" or “transform”). Morpho-syntactic ambiguities, We then added missing entries and corrected which occur frequently in German, are not rel- synsets manually. Then, we implemented an anal- evant for OdeNet because only lemmata are in- 3as listed in http://pcai056.informatik. cluded. There are some words that can be used uni-leipzig.de/downloads/etc/legacy/ both as verbs and adjectives, such as verlegen Papers/top1000de.txt ysis of German nominal compounds and used this (id="de-39-n",pwn="in-05890249-n"), information for the addition of hypernym relations. We could link 19,845 German synsets to synsets in the PWN, about 55 % of the German synsets. 3.1 Linking OpenThesaurus Synsets with the Synsets that could not be linked were often Multilingual Wordnet multi-word expressions and metaphorical, such The OpenThesaurus data can be downloaded as as: es kann Gott weiß was passieren; fur¨ nichts txt. The text file contains one synset per line, such garantieren konnen;¨ mit allem rechnen mussen¨ that the lexical items in each synset are divided by “God knows what can happen; can’t guarantee semicolons, e.g.: anything; have to count on everything”. The link gives direct access to the definition in PWN, Mobilitat;Unabh¨ angigkeit;Beweglichkeit¨ such that we could copy these into OdeNet in dc:description. Thus, we have an English The target of the transfer process of this synset definition as long as German definitions are still is to have three lexical entries and a synset entry. missing. The synset relations in PWN link to En- The format is described in Bond et al.(2016). We glish synsets. We searched for German synsets start with the synset: with the ili that links to the target of a relation These are the lexical entries for the words in the The synset has a unique synset ID, a link to the tion, and relations to other synsets. The first task is to find POS information. POS information is not included in the OpenThe- TextBlob for POS annotation.4 OdeNet just uses to these. In the case of multi-word expressions, which is the head word in most cases. The second task is to find an English synset that can be linked. We translate the words in the synset tionary has the advantage that the translation is ing the other words in the synset. Using the NLTK wordnet API, we then search for synsets with these English words in the PWN and access their synset The lexical entries in a synset belong to one ID. sense with the same sense ID. Further senses for lexical entries come from other synsets in the 4https://textblob.readthedocs.io/en/ dev/ OpenThesaurus. Each lexical entry has a unique 5https://translate.google.de/ word ID, a lemma, and a part of speech (POS). When a synset gets its ID and link to PWN, all Basis for the compound analysis is a list of words in the synset are added to a tuple with this nouns extracted from the TIGER tree bank (Brants ID, as for example: et al., 2004). If the word to analyze consists of less than three letters, it is not a compound. If ("Beweglichkeit", "odenet-8203-n", "in-05003850-n"), there are hyphens in the word (such as Lehr-Lern- ("Beweglichkeit", Forschung, teaching-learning-research), the com- "odenet-9784-n", "in-04773351-n"), ("Beweglichkeit", pound is split at these. "odenet-11420-n", "in-04875728-n"), Using the pyphen module,6 we split the com- ("Backlogged", pound into syllables. If the word to analyze con- "odenet-19087-n", "in-05003850-n"), sists only of one syllable (as in the case of Stuhl The sense relations (antonym and pertainym) “chair”), it is not a compound. If the word con- again are taken from the PWN and linked back to sists of two noun components with one syllable German. each, as in the case of Haustur¨ “front door”, then both components are searched for in the TIGER 4 Corrections and Extensions lexicon. If they exist as entries, then the result of the analysis is a list with both components, such as 4.1 POS Corrections ([Haus],[Tur])¨ . If the two syllables do not exist as In a first evaluation, we found that POS infor- words, then an attempt is made to delete a linking mation in OdeNet was only correct in 77% of element from the first syllable and then look it up the cases. With many multi word expressions in again. This is e.g. the case with Wirtshaus “pub”, OdeNet, standard procedures to POS assignment consisting of Wirt + s + Haus. If there are more do not seem to be sufficient. The basic idea for than two syllables, different combinations of sylla- corrections was that a synset should in principle bles are tested, as in the case of Herstellungskosten contain only lexical items of the same syntactic “production costs”, until it can be split into parts category. Therefore, we extracted all synsets con- that can be found in the noun list: taining lexical items with different POS information and manually corrected them. The evaluation ("Herstellungskosten") SYLLABLES: showed an increase of correct POS to 90%. The [’Her’, ’stel’, ’lungs’, ’kos’, ’ten’] next idea was to look at endings of lexemes. In SYLLABLE COMBINATIONS: German, words ending in -ung, -heit, and -keit are [’Herstel’, ’Stellungs’, ’Lungskos’, ’Kosten’, ’Herstellungs’, always nouns, while words ending in -lich are ad- ’Stellungskos’, ’Lungskosten’, jectives. Further, nouns are capitalized. We used ’Herstellungskos’, ’Stellungskosten’] this information to automatically correct further COMPONENTS: [’Herstellungs’, ’Kosten’] POS assignments. The evaluation showed an increase of correct POS to 93.3%. If the analysis with syllables does not lead to a 4.2 Using German Compounds for result, we look up all combinations of n-grams in Hypernym Relations the word, considering fugen elements. We ran our compound analyzer on all lexi- Regular German nominal compounds have a hy- cal entries that are not multiword entries and pernym relation to their head, as explained above. could identify 3,630 compounds. In case that A large part of the German compounds are regular the head has a singular sense in OdeNet, we and many synsets contain compounds. We decided added a hypernym relation to that synset and a to make use of these facts in order to add relations hyponym relation backwards. Using synsets in- to OdeNet. stead of lexical entries results in relations not only The idea is to use the regularity of German com- between single words, but also between groups pounds to automatically generate hypernym rela- of words. For example, because of the analy- tions for OdeNet. For this purpose, we have imple- sis of the word Butterbrot “sandwich” as con- mented a compound analysis tool that recognizes sisting of Butter “butter” and Brot “bread”, we the head of the compound. Using this tool, we then added a hyponym relation between the synsets analyzed all lexical items that are not multi-word 11770-n [’Brotlaib’, ’Wecken’, ’Brot’] and 10073- expressions in OdeNet and extracted compounds and their heads. 6https://pyphen.org/ n [’Knifte’, ’belegtes Brot’, ’Scheibe’, ’Butter- Synset relation Number brot’, ’Schnitte’, ’Bemme’, ’Stulle’]. hypernym relations 9,907 There are some exceptions to the hypernym re- hyponym relations 10,101 lation of compound and compound head. In some member holonyms 84 cases, the compound is synonym to its head, as part holonyms 647 in the case of Fachterminus “technical term” and member meronyms 74 Terminus “term”. In these cases, both appear in the part meronyms 282 same synset and could therefore be automatically Table 1: Number of synset relations excluded. More complicated are negations in compounds. A Nichtraucher “non-smoker” is not hyponym to had only the electricity entry in OdeNet, which Raucher “smoker”, but antonym. On the other is wrong. If there was more than one sense for hand, Nichteisenmetall “non-ferrous metal” is a a word, there was no hypernym relation added to kind of Metall “metal”. Thus, we manually avoid such errors. checked all compounds with negations. Another Therefore, for 100 synsets that had compounds, problem are expressions with Pseudo “pseudo” we could add 40 good hypernym relations by this or Schein “phantom”. Is a pseudo-documentation method, and one wrong relation, which is a preci- a documentation? Is a Scheinschwangerschaft sion of 98%. “phantom pregnancy” a pregnancy? We decided to not treat these as hyponyms. The compound anal- 5 Current State of OdeNet ysis found 19,115 nominal compounds in OdeNet. The resulting wordnet resource (v1.3) contains In 12,132 cases, the found head was ambiguous about 120,000 lexical entries in about 36,000 between multiple senses and did not get a relation synsets. About 20,000 of these synsets are linked entry. In 1,810 cases, there was no entry for the to synsets in the English PWN and then to the mul- head in OdeNet, such that these were also ignored. tilingual CILI numbers. There are 2,664 antonym For all hypernym relations that we added, we relations and 1,053 pertainym relations linking added the backward hyponym relation as well. lexical entities. The number of synset relations can 10,346 relations were added to the OdeNet synsets be seen in table1. by this method. OdeNet contains around 35,000 For evaluation of preciseness, we randomly synsets, such that we could add information for chose 90 lexical entries, 30 with POS “n”, “v” and 29% of all synsets. “a” respectively, and evaluated them manually, see For the evaluation we randomly extracted 100 Table2. compounds from OdeNet. The compound analy- The POS information was correct in 93.3% of sis found 83 of these. Only one of the 83 ana- the cases. In 5 cases of 6 wrong POS assign- lyzed compounds got a wrong analysis: Blockdia- ments, the lemma was a multi-word lexeme, such gramm (block diagram) was analyzed as [’Block’, as nicht unumstoßlich¨ “not unalterable”. POS tag- ’Dia’, ’Gramm’] (block - slide - grams). This ging of multi-word lexemes needs more sophisti- analysis is syntactically fine, but semantically non- cated procedures than the ones we used here, as sense. Thus, the precision of the compound analy- standard POS taggers do not tag multi-word ex- sis is very high (0.99), while the recall is moderate pressions. A good part of this problem could be (0.83). For our purpose, extending OdeNet, preci- solved with POS corrections in synsets that had sion is highly important, while a moderate recall lexical items with different POS. The linked En- is fine. glish synsets could also give a hint that there might The 100 entries had 41 hypernym relation en- be a problem, as they have POS assigned, which tries that originated from compound analyses. One often would be the same for German. A further of the relation entries was wrong: in the case attempt to improve OdeNet could therefore be to of Fleischsaft “meat juice”, the compound anal- search for cases where the synsets are linked and ysis was correct ([’Fleisch’, ’Saft’]), but the hy- the POS tags of the English and German synsets pernym relation led to the synset [’Strom’, ’Saft’, do not match. ’Elektrizitat’]¨ (electricity). The German word Saft The German synsets that are linked to English is ambiguous between juice and electricity, but ones, contain the definitions from the correspond- Tested Correct Comment POS 93% many multi-word lexemes DEFINITIONS 82% in cases of errors, POS of the English words are often different RELATIONS 61% in cases of errors, definitions are also wrong

Table 2: Precision of 90 randomly chosen lexical entries ing English synsets. We checked if the definitions between the German synsets - parallel to the rela- are correct (and therefore the synsets are correctly tions in the other wordnets. linked). 55 of the 90 cases had a link to an English The Open Multilingual Wordnet initiative is a synset, and therefore a definition. In 45 of the 55 great chance to get highly linked and standard- cases (82%), these definitions were correct. ized language resources for multiple languages. There were 41 cases, where relations on the The standardization makes it possible to include lexical or the synset level were assigned (34%). these resources in NLP packages, such as NLTK 12 of these cases had wrongly assigned relations or spaCy. (39%). In 5 of these cases, the link to PWN was We have shown that it is possible using NLP also wrong, and the relation was taken over from techniques to combine language resources such as the English synset. In one case, the relation from the OpenThesaurus and the English PWN to gain the English synset was wrong, while the relation a new resource in this standardized multilingual that was automatically added by the compound context, with a reasonable precision. analysis was correct. The next correction step will The next step will be to further work on the have to address the linking. quality of OdeNet. We have already started to im- We have annotated the entries with a default plement methods that allow the semi-manual cor- confidence of 0.6, with entries that have been man- rection and extension: ually validated given a confidence of 1.0 and those from the extended OMW a confidence of 0.85. • A tool for adding more hyponym relations in case of compounds that shows the user dif- Release ferent synsets for a compound head and asks The wordnet is released through GitHub, as a com- which one to set the relation to. It then adds pressed tar file containing the wordnet itself, its the relation to OdeNet automatically. license (CC-BY-SA 4.0)7 and canonical citation.8 • A tool that shows the user all information for This can be loaded directly into the Wn Python a word and gives her multiple possibilities to library (Goodman and Bond, 2021), which allows correct and extend it. easy use: either on its own or linked to other wordnets through CILI. • A tool that allows to search for a word in PWN and give the corresponding CILI(s) and 6 Discussion and Future Plans allows the user to add the CILI to OdeNet. It has been possible to set up a wordnet for the Ger- Further, it will again be compared to the En- man language in a couple of years. We have bene- glish PWN, such that cases where linked synsets fited both from OpenThesaurus and the knowledge differ in their POS assignment will be further in- in the OMW. In this way, we were able to build vestigated. Another source of problems is multi- a very large resource, with the synsets being cre- word lexemes, where we will have to search for ated manually in the OpenThesaurus project, and better POS tagging methods. therefore very precise. We have used NLP tech- A problem is that the automatic translation niques to add more information, namely POS and method added links of CILIs from multiple Ger- the relations to OMW and CILI. We have used the man synsets, which is undesirable. We need to fo- knowledge in the OMW to supplement relations cus on better linking quality. One approach could 7https://creativecommons.org/licenses/ be to look at the cases where the POS of German by-sa/4.0/ and English differs. 8https://github.com/ hdaSprachtechnologie/odenet/releases/ We started to work on the basic German words, tag/v1.3 adding and correcting information. This will be a valuable information source for simplified lan- rative Interlingual Index. In Proceedings of the guage projects. Global WordNet Conference, volume 2016. Through the Wn library (Goodman and Bond, S. Brants, S. Dipper, P. Eisenberg, S. Hansen- 2021) the resource will be available to NLTK, such Schirra, E. Konig,¨ W. Lezius, C. Rohrer, that it can be used in NLP projects. The open- G. Smith, and H. Uszkoreit. 2004. TIGER: Lin- source idea will help to let researchers working guistic interpretation of a German corpus. Re- on German language further improve and expand search on language and computation, 2(4):597– OdeNet. We ourselves plan to use it in informa- 620. tion extraction in the business domain and senti- P. Braslavski, D. Ustalov, M.l Mukhin, and ment analysis projects. By this approach, we will Y. Kiselev. 2016. Yarn: Spinning-in-progress. add synsets from the business domain and senti- In Proceedings of the 8th Global WordNet Con- ment polarity for many words. ference, GWC 2016, pages 58–65. We will add a user interface to make crowd development possible, in order to extend and correct Christiane Fellbaum, editor. 1998. WordNet: An OdeNet. Electronic Lexical Database. MIT Press, Cam- We would also like to tag some German texts. bridge, MA. The resource is available on GitHub under an Michael Wayne Goodman and Francis Bond. open-source license: https://github.com/ 2021. Intrinsically interlingual: The Wn Python hdaSprachtechnologie/odenet. library for wordnets. In 11th International Global Wordnet Conference (GWC2021). (to Acknowledgments appear). We would like to thank Michael Wayne Goodman B. Hamp and H. Feldweg. 1997. GermaNet - a for his help in preparing the GitHub release. lexical-semantic net for German. In Proceed- ings of ACL workshop Automatic Information References Extraction and Building of Lexical Semantic Resources for NLP Applications. Madrid. Daniil Alexeyevsky and Anastasiya V. Tem- chenko. 2016. WSD in monolingual dictionar- Hitoshi Isahara, Francis Bond, Kiyotaka Uchi- ies for Russian WordNet. In Proceedings of the moto, Masao Utiyama, and Kyoko Kanzaki. Eighth Global WordNet Conference. Bucharest, 2008. Development of the Japanese wordnet. Romania. In Sixth International conference on Language Resources and Evaluation (LREC 2008). Mar- Steven Bird, Ewan Klein, and Edward Loper. rakech. 2009. Natural language processing with Python: Analyzing text with the natural lan- Daniel Naber. 2004. Openthesaurus: Build- guage toolkit. O’Reilly Media, Inc., Beijing. ing a thesaurus with a web commu- nity. Retrieved January, 3:2005. URL Francis Bond, Luis Morgado Da Costa, and https://www.openthesaurus.de/ Tuan Anh Le. 2015. Imi—a multilingual se- download/openthesaurus.pdf. mantic annotation environment. Proceedings of ACL-IJCNLP 2015 System Demonstrations, Inc. Unicode. 2012. Unicode, inc. license agree- pages 7–12. ment - data files and software. URL http:// www.unicode.org/copyright.html. Francis Bond and Ryan Foster. 2013. Link- Piek Vossen, editor. 1998. Euro WordNet. Kluwer. ing and extending an open multilingual wordnet. In Proceedings of the 51st Annual Meet- Wikimedia. 2013. List of wiktionaries. (accessed ing of the Association for Computational Lin- on 2013-09-12). guistics, pages 1352–1362. Sofia. URL http: //aclweb.org/anthology/P13-1133. Francis Bond and Kyonghee Paik. 2012. A survey of wordnets and their licenses. Small, 8(4):5. Francis Bond, Piek Vossen, John P McCrae, and Christiane Fellbaum. 2016. CILI: The Collabo-