Standardizing Wordnets in the ISO Standard LMF: Wordnet-LMF for Germanet
Total Page:16
File Type:pdf, Size:1020Kb
Standardizing Wordnets in the ISO Standard LMF: Wordnet-LMF for GermaNet Verena Henrich Erhard Hinrichs Universtiy of Tübingen Universtiy of Tübingen Department of Linguistics Department of Linguistics verena.henrich@uni- erhard.hinrichs@uni- tuebingen.de tuebingen.de tions, (annotated) text corpora, and dictionar- Abstract ies. It is fair to say that it has become common practice among developers of new linguistic It has been recognized for quite some resources to consult TEI guidelines and ISO time that sustainable data formats play standards in order to develop standard- an important role in the development conformant encoding schemes that serve as an and curation of linguistic resources. interchange format and that can be docu- The purpose of this paper is to show mented and validated by Document Type how GermaNet, the German version of Definitions (DTD) and XML schemata. the Princeton WordNet, can be con- However, for resources that were developed verted to the Lexical Markup Frame- prior to or largely in parallel with the emerging work (LMF), a published ISO standard acceptance of markup languages and of emerg- (ISO-24613) for encoding lexical re- ing encoding standards, the situation is far sources. The conversion builds on more heterogeneous. A wide variety of legacy Wordnet-LMF, which has been pro- formats exists, many of which have persisted posed in the context of the EU due to existing user communities and the KYOTO project as an LMF format for availability of tools that can process only such wordnets. The present paper proposes a idiosyncratic formats. The development of number of crucial modifications and a wordnets for a large number of languages is a set of extensions to Wordnet-LMF that typical example of a type of linguistic re- are needed for conversion of wordnets source, where legacy formats still persist as a in general and for conversion of Ger- de facto standard. WordNet 1.6 is encoded in maNet in particular. the data format of lexicographer files4 that was designed for the English Princeton WordNet 1 Introduction (Fellbaum, 1998). It is a plain-text format for It has been recognized for quite some time that storing wordnet data and allows lexicographers sustainable data formats play an important role to encode lexical and conceptual relations in the development and curation of linguistic among lexical units and synsets by use of spe- resources. As witnessed by the success of the cial-purpose diacritics. There exist numerous guidelines of the Text Encoding Initiative 1 tools that can process WordNet 1.6 lexicogra- (TEI) and of published standards issued by the pher files to extract relevant information or to International Standards Organization 2 (ISO), transform the data into other special-purpose markup languages such as XML3 (short for: formats such as Prolog-fact databases. Even Extensible Markup Language) have become tough still widely used for the reasons just lingua francas for encoding linguistic resources mentioned, the complexity of the format itself of different types, including phonetic transcrip- has a number of undesirable consequences. As Henrich and Hinrichs (2010) have pointed out, 1 See http://www.tei-c.org 2 See http://www.iso.org 4 See http://wordnet.princeton.edu/man/lexnames.5 3 See http://www.w3.org/TR/REC-xml/ WN.html 456 Proceedings of the 23rd International Conference on Computational Linguistics (Coling 2010), pages 456–464, Beijing, August 2010 the editing of lexicographer files is highly er- wordnets for a steadily increasing ror-prone and time-consuming in actual lexi- number of languages. cographic development. Moreover, format 3. To show how these open issues can be validation of the data as well as development resolved in a customized version of of new tools for data visualization and data Wordnet-LMF suitable for GermaNet. extraction become increasingly difficult since they cannot be based on generic state-of-the- The remainder of this paper is structured as art tools, that are, for example, available for follows: section 2 provides a general introduc- XML-based encodings. tion to GermaNet. Details about the adapted For exactly these reasons, XML-based inter- XML format used for GermaNet up until now change formats have been proposed in recent are provided in section 3. Section 4 introduces years also for wordnets. One of the first, if not the challenge of how to represent a wordnet in the first, example is the XML format for Ger- the Lexical Markup Framework. As one possi- maNet5, a wordnet for German (Lemnitzer and bility, Wordnet-LMF is regarded. Issues that Kunze, 2002; Henrich and Hinrichs, 2010). An arise during the conversion of GermaNet into even more recent development along these Wordnet-LMF lead to a modified version of lines is the specification of Wordnet-LMF (see Wordnet-LMF. Finally, section 5 concludes Soria et al., 2009), an instantiation of the Lexi- with a comparison of the two representation cal Markup Framework6 (LMF, (Francopoulo formats. et al., 2006)) customized for wordnets. Since LMF is an ISO standard (ISO-24613), 2 GermaNet it is a particularly attractive candidate for en- GermaNet is a lexical semantic network that is coding wordnets. Everything else being equal, modeled after the Princeton WordNet for Eng- ISO standards have a high chance of being lish. It partitions the lexical space into a set of adopted by a wide user community and of be- concepts that are interlinked by semantic rela- ing recognized as an interchange format.7 Such tions. A semantic concept is modeled by a syn- agreed-upon interchange formats are a crucial set. A synset is a set of words (called lexical prerequisite for interoperable linguistic re- units) where all the words are taken to have sources in the context of web services and of (almost) the same meaning. Thus a synset is a processing pipelines for linguistic resources. set-representation of the semantic relation of The purpose of this paper is threefold: synonymy, which means that it consists of a 1. To compare and contrast the GermaNet list of lexical units and a paraphrase (repre- XML initially proposed by Lemnitzer sented as a string). The lexical units in turn and Kunze (2002) with the Wordnet- have frames (which specify the syntactic va- LMF. This comparison is instructive lence of the lexical unit) and examples. The list since it reveals two completely differ- of lexical units for a synset is never empty, but ent conceptions of representing seman- any of the other properties may be. tic knowledge at the lexical level. There are two types of semantic relations in GermaNet: conceptual and lexical relations. 2. To point out a number of open issues Conceptual relations hold between two seman- that need to be resolved if Wordnet- tic concepts, i.e. synsets. They include rela- LMF is to be adopted widely among tions such as hyperonymy, part-whole rela- tions, entailment, or causation. Lexical rela- tions hold between two individual lexical units. 5 See http://www.sfs.uni-tuebingen.de/GermaNet/ Antonymy, a pair of opposites, is an example 6 See http://www.lexicalmarkupframework.org of a lexical relation. 7 An anonymous reviewer raised the question why OWL GermaNet covers the three word categories is not a good candidate for encoding wordnets. On this of adjectives, nouns, and verbs, each of which issue, we agree with the assessment of Soria et al. (2009) who point out that “[…] RDF and OWL are conceptual is hierarchically structured in terms of the hy- repositories representation formats that are not designed peronymy relation of synsets. to represent polysemy and store linguistic properties of words and word meanings.” 457 Figure 1. Structure of the XML synset files. semantic relation of synonymy. Further prop- 3 Current GermaNet XML Format erties of a synset (e.g., the word category or a describing paraphrase) and a lexical unit (e.g., The structure of the XML files closely follows a sense number or the orthographical form the internal structure of GermaNet, which (orthForm)) are encoded appropriately. means that the file structure mirrors the under- Figure 1 describes the underlying XML lying relational organization of the data. There structure. Each box in the figure stands for an are two DTDs that jointly describe the XML- element in the XML files, and the properties in encoded GermaNet. One DTD represents all each box (listed underneath the wavy line) rep- synsets with their lexical units and their attrib- resent the attributes of an XML element. This utes (see subsection 3.1). The other DTD rep- means, for example, that a synset element has resents all relations, both conceptual and lexi- the attributes of an id and a category.10 cal relations (see subsection 3.2). Figure 2 shows an example of a synset with The GermaNet XML format was initially two lexical units (lexUnit elements) and a developed by Kunze and Lemnitzer (2002), but paraphrase. The lexUnit elements in turn con- modifications of the GermaNet data itself led tain several attributes and an orthographical to an adopted XML format, which is presented form (the orthForm element), e.g., leuchten here.8 (German verb for: to shine). The first of the 3.1 XML Synset Files two lexical units even has a frame and an ex- ample. The XML files that represent all synsets and lexical units of GermaNet are organized <synset id="s58377" category="verben"> around the three word categories currently in- <lexUnit id="l82207" cluded in GermaNet: nouns, adjectives, and sense="1" verbs (altogether 54 synset files since the se- namedEntity="no" mantic space for each word category is divided artificial="no" styleMarking="no"> into a number of semantic subfields). <orthForm>leuchten</orthForm> The structure of each of these files is illus- <frame>NN</frame> trated in Figure 19.