Standardizing Lexical-Semantic Resources – Fleshing out the Abstract Standard LMF

Standardizing Lexical-Semantic Resources – Fleshing out the abstract standard LMF Judith Eckle-Kohlerz Iryna Gurevychyz z Ubiquitous Knowledge Processing y Ubiquitous Knowledge Processing Lab (UKP-TUDA) Lab (UKP-DIPF) Department of Computer Science German Institute for Educational Technische Universitat¨ Darmstadt Research and Educational Information www.ukp.tu-darmstadt.de www.ukp.tu-darmstadt.de Abstract can be interpreted the same way. Two syntactically interoperable LSRs might use the same term This paper describes the application of to denote different meanings. Semantically in- the Lexical Markup Framework (LMF) for teroperable LSRs, on the other hand, use terms standardizing lexical-semantic resources in that share a common definition of their meaning. the context of NLP. More specifically, we highlight the question how lexical-semantic Consequently, NLP systems that switch between resources can be made semantically inter- semantically interoperable LSRs can perform the operable by means of LMF and ISOCat. same kind of processing and the results produced The LMF model UBY-LMF, an instantia- can still be interpreted the same way. tion of LMF specifically for NLP, serves as In this paper, we focus on the question how an example to illustrate the path towards se- to achieve semantic interoperability by means of mantic interoperability of lexical resources. the ISO 24613:2008 LMF (Francopoulo et al., 2006) and ISOCat.1 The comprehensive LMF 1 Introduction lexicon model UBY-LMF (Eckle-Kohler et al., 2012) serves as an example to show how the ab- Lexical-semantic resources (LSR) are used in stract LMF standard is to be fleshed out and in- major NLP tasks, such as word sense disam- stantiated in order to make LSRs semantically in- biguation, semantic role labeling and informa- teroperable for NLP purposes. tion extraction. In recent years, the aspects of UBY-LMF covers very heterogeneous LSRs in reusing and merging LSRs have gained signifi- two languages, English and German, and has been cance, mainly due to the fact that LSRs are ex- used to standardize a range of LSRs resulting in pensive to build. Standardization of LSRs plays the large-scale LSR UBY (Gurevych et al., 2012), an important role in this context, because it facili- see http://www.ukp.tu-darmstadt.de/uby/. UBY tates integration and merging of LSRs and makes currently contains ten resources in two languages: reuse of LSRs easy. NLP systems that are built English WordNet (Fellbaum, 1998), Wiktionary2, according to standards can simply plug in stan- Wikipedia3, FrameNet (Baker et al., 1998), and dardized LSRs and are thus able to easily switch VerbNet (Kipper et al., 2008); German Wik- between different standardized LSRs. In other tionary, Wikipedia, GermaNet (Kunze and Lem- words, standardizing LSRs makes them interop- nitzer, 2002) and IMSlex-Subcat (Eckle-Kohler, erable. 1999) and the English and German entries of Two aspects of interoperability are to be dis- OmegaWiki4. tinguished in NLP: syntactic interoperability and semantic interoperability (Ide and Pustejovsky, 1http://www.isocat.org/ 2011). While NLP systems can perform the same 2http://www.wiktionary.org/ kind of processing with syntactically interopera- 3http://www.wikipedia.org/ ble LSRs, there is no guarantee, that the results 4http://www.omegawiki.org/ 496 Proceedings of KONVENS 2012 (SFLR 2012 workshop), Vienna, September 21, 2012 2 LMF and semantic interoperability Fixing the structure of a lexicon model by choosing a set of classes contributes to syntactic First, we give an overview of the LMF standard interoperability of LSRs, as it fixes the high-level and briefly describe how to use it. We put a spe- organization of lexical knowledge in an LSR, e.g., cial focus on the question how LSRs can be made whether synonymy is encoded by grouping senses semantically interoperable by means of LMF. into synsets (using the Synset class) or by speci- LMF – an abstract standard LMF defines a fying sense relations (using the SenseRelation meta-model of lexical resources, covering both class), which connect synonymous senses. NLP lexicons and machine readable dictionar- Defining attributes for the LMF classes and ies. The standard specifies this meta-model in specifying the attribute values is far more chal- the Unified Modeling Language (UML) by pro- lenging than choosing from a given set of classes, viding a set of UML diagrams. UML packages because the standard gives only a few examples of are used to organize the meta-model and each di- attributes and leaves the specification of attributes agram given in the standard corresponds to an to the user in order to allow maximum flexibility. UML package. LMF defines a mandatory core Finally, the attributes and values have to be package and a number of extension packages for linked to a description of their meaning in an different types of resources, e.g., morphological ISO 12620:2009 compliant Data Category Reg- resources or wordnets. The core package models istry (DCR), see (Broeder et al., 2010). ISOCat is a lexicon in the traditional headword-based fash- the implementation of the ISO 12620:2009 DCR ion, i.e., organized by lexical entries. Each lexi- providing descriptions of terms used in language cal entry is defined as the pairing of one to many resources. forms and zero to many senses. These descriptions in ISOCat are standardized, Instantiating LMF The abstract meta-model i.e., they comply with a predefined format and given by the LMF standard is not immediately us- provide some mandatory information types, in- able as a format for encoding (i.e., converting) an cluding a unique administrative identifier (e.g., existing LSR (Tokunaga et al., 2009). It has to be partOfSpeech) and a unique and persistent iden- instantiated first, i.e., a full-fledged lexicon model tifier (PID, e.g., http://www.isocat.org/datcat/DC- has to be developed by choosing LMF classes and 396) which can be used to link to the descriptions. by specifying suitable attributes for these LMF The standardized descriptions of terms are called classes. Data Categories (DCs). According to the standard, developing a lexi- Semantic Interoperability Connecting the lin- con model involves guistic terms used for attributes and their values 1. selecting classes from the UML packages, in a lexicon model with their meaning defined ex- ternally in ISOCat contributes to semantic inter- 2. defining attributes for these classes and operability of LSRs (see also Windhouwer and Wright (2012)). The definitions of DCs in ISOCat 3. linking the attributes and other linguistic constitute an interlingua that can be used to map terms introduced (e.g., attribute values) to idiosyncratically used linguistic terms to a set of standardized descriptions of their meaning. reference definitions (Chiarcos, 2010). Different Selecting a combination of LMF classes from LSRs that share a common definition of their lin- the LMF core package and from the extension guistic vocabulary are said to be semantically in- packages establishes the structure of a lexicon teroperable (Ide and Pustejovsky, 2010). model. While the LMF core package models a Consider as an example the LexicalEntry lexicon in terms of lexical entries, the LMF ex- class of two different lexicon models A and tensions provide classes for different types of lex- B. Lexicon model A could have an attribute icon organization, e.g., covering the synset-based partOfSpeech (POS), while lexicon model organization of wordnets or the semantic frame- B could have an attribute pos. Linking based organization of FrameNet. both attributes to the meaning “A category as- 497 Proceedings of KONVENS 2012 (SFLR 2012 workshop), Vienna, September 21, 2012 signed to a word based on its grammatical 2012) and (Gurevych et al., 2012) for detailed in- and semantic properties.” given in ISOCat formation on UBY-LMF and the corresponding (http://www.isocat.org/datcat/DC-396) makes the large-scale LSR UBY. two lexicon models semantically interoperable UBY-LMF is represented by a DTD which with respect to the POS attribute. can be used to automatically convert any given A human can look up the meaning of a term resource into the corresponding XML format. occurring in a lexicon model by following the link Converters for ten LSRs to UBY-LMF format to the ISOCat DC and consulting its description in are publicly available on Google Code, see ISOCat. Linking the attributes and their values to http://code.google.com/p/uby/. ISOCat DCs results in a so-called Data Category Selection. 3.2 UBY-LMF attributes It is important to stress that the notion of ”se- In UBY-LMF, the definition of attributes for the mantic interoperability” in the context of LMF LMF classes was guided by two requirements has a limited scope: it only refers to the meaning that we identified as important in the context of of the linguistic vocabulary used in an LMF lex- NLP: (i), comprehensiveness, and (ii) extensibil- icon model – not to the meaning of the lexemes ity (Eckle-Kohler et al., 2012). listed in a LSR. Comprehensiveness implies that the model should be able to represent all the lexical infor- 3 UBY-LMF – Instantiating LMF mation present in a wide range of LSRs, because NLP applications usually require different types Considering the fact that only a fleshed-out LMF of lexical knowledge and it is difficult to decide lexicon model, i.e., an instantiation of the LMF in advance which type of lexical information will standard, can be used for actually standardizing be useful for a particular NLP application. LSRs, it is obvious that LMF-compliant LSRs Extensibility is also crucial in the NLP domain, are not necessarily interoperable, neither syntac- because UBY-LMF should be applicable across tically nor semantically. languages (Gurevych et al., 2012), (Eckle-Kohler Therefore it is important to develop a single, and Gurevych, 2012) and as well be able to adopt comprehensive instantiation of LMF, which can automatically extracted lexical-semantic knowl- immediately be used for standardizing LSRs.

Load more