A Large, Interlinked, Syntactically-Rich Lexical Resource for Ontologies

Semantic Web 0 (0) 1 1 IOS Press lemonUby - a large, interlinked, syntactically-rich lexical resource for ontologies Judith Eckle-Kohler, a;∗ John Philip McCrae b and Christian Chiarcos c a Ubiquitous Knowledge Processing (UKP) Lab, Department of Computer Science, Technische Universität Darmstadt and Information Center for Education, German Institute for International Educational Research, Germany, http://www.ukp.tu-darmstadt.de b Cognitive Interaction Technology (CITEC), Semantic Computing Group, Universität Bielefeld, Germany, http://www.sc.cit-ec.uni-bielefeld.de c Applied Computational Linguistics (ACoLi), Department of Computer Science and Mathematics, Goethe-University Frankfurt am Main, Germany, http://acoli.cs.uni-frankfurt.de Abstract. We introduce lemonUby, a new lexical resource integrated in the Semantic Web which is the result of converting data extracted from the existing large-scale linked lexical resource UBY to the lemon lexicon model. The following data from UBY were converted: WordNet, FrameNet, VerbNet, English and German Wiktionary, the English and German entries of Omega- Wiki, as well as links between pairs of these lexicons at the word sense level (links between VerbNet and FrameNet, VerbNet and WordNet, WordNet and FrameNet, WordNet and Wiktionary, WordNet and German OmegaWiki). We linked lemonUby to other lexical resources and linguistic terminology repositories in the Linguistic Linked Open Data cloud and outline possible applications of this new dataset. Keywords: Lexicon model, lemon, UBY-LMF, UBY, OLiA, ISOcat, WordNet, VerbNet, FrameNet, Wiktionary, OmegaWiki 1. Introduction numerous mappings and linkings of lexica, as well as standards for representing lexical resources, such Recently, the language resource community has begun as the ISO 24613:2008 Lexical Markup Framework to explore the opportunities offered by the Semantic (LMF) [13]. In this context, the LLOD cloud can be Web, lead by the formation of the Linguistic Linked considered as a new data integration platform, en- Open Data (LLOD) cloud and an increasing interest in abling linkings not only between lexical resources, but making use of Linked Open Data principles in the con- also between lexical resources and other language re- text of Natural Language Processing (NLP) and Lin- sources. guistics [7]. The use of RDF supports data integra- We extend the LLOD cloud by a new lexical re- tion and offers a large body of tools for accessing this source called lemonUby1 which is the result of con- data. Furthermore, the linked data approach gives rise verting data extracted from the existing large-scale to novel research questions in the context of language linked lexical resource UBY [14]2 to the lemon lex- resources and their application. icon model. UBY has been developed independently For lexical resources, data integration has been from Semantic Web technology. It is LMF based and in the focus of interest for many years, resulting in 1http://www.lemon-model.net/lexica/uby/ *Corresponding author, http://www.ukp.tu-darmstadt.de 2http://www.ukp.tu-darmstadt.de/uby/ 1570-0844/0-1900/$27.50 c 0 – IOS Press and the authors. All rights reserved 2 Eckle-Kohler et al. / a subset of the LMF-compliant UBY lexicons is pair- LexicalForm wise linked at the word sense level. The lemon lexicon writtenRep:String model has been developed for lexical resource integra- canonicalForm tion on the Semantic Web [18]. This lexicon model form otherForm abstractForm Word serves as a common interchange format for lexical resources on the Semantic Web and has been designed to Lexicon entry LexicalEntry Phrase represent and share lexical resources that are linked to language:String ontologies, i.e., ontology lexica. Making use of a lexicon interchange format, such as lemon is not only im- isSenseOf sense Part portant for data integration, but also for the reuse of LexicalSense lexicons. While many lexical resources have already been in- prefRef reference altRef isReferenceOf cluded in the LLOD cloud, e.g., [3,19,23,8,20], the hiddenRef lem n LLOD cloud is still missing a large-scale lexical re- Ontology core source rich in lexical information on verbs, including aspects such as syntactic behaviour and semantic roles of a verb’s arguments. Such information is crucial Fig. 1. The core of the lemon model for lexicalizing relational knowledge, e.g., the relation like(Experiencer;Theme) can be lexicalized syntacti- this form on the Semantic Web, and (ii) to use them as cally with a verb as in "NP likes NP". interchangeable modules in NLP applications. The new resource lemonUby addresses this gap: In order to overcome this difficulty, the lemon model Along with resources for word-level semantics (Word- [18] was proposed as a common interchange format for Net [12], English and German Wiktionary,3 and the lexical resources on the Semantic Web. lemon has its English and German entries of OmegaWiki,4) we con- historical roots in LMF and thus allows easy conver- verted two syntactically rich resources from UBY to sion from LMF-like, non-linked data resources. It links the lemon format: FrameNet [2] and VerbNet [15]. For to data categories in annotation terminology reposito- further data integration, we established links between ries, and most of all, it realises a separation of lexicon lemonUby and other language resources in the LLOD and ontology layers, so that lemon lexica can be linked cloud. to existing ontologies in the linked data cloud. This core model is illustrated in Fig. 1, which defines the basic elements used by all lexica published as 2. Representing lexical-semantic resources as linked data. In addition to this there are a number of Linked Data: Lemon modules used to model linguistic description, syntax, morphology and relationships between lexica.5 There has been significant work towards integrating lemon has been used as a basis for integrating the lexical resources using RDF and Semantic Web prin- data of the English Wiktionary with the RDF ver- ciples [6], and many resources are already available sion of WordNet [19]. lemon’s similarity to the Word- as Linked Data. Yet, representing lexical resources in Net model made this conversion straight-forward, with RDF does not per se make them semantically interop- only the need for a slight change in modelling to ac- erable. Consider, for instance, existing conversions of commodate inflectional variants of lexical entries. WordNet and FrameNet [26,22], where a simple mapping to RDF is provided, and augmented with OWL semantics so that reasoning could be applied to the 3. Large-scale integration of lexical-semantic structure of the resource. However, the formats cho- resources: UBY and UBY-LMF sen for the RDF versions of WordNet and FrameNet are specific to the underlying data models of WordNet UBY is both a network of interlinked lexical-semantic and FrameNet. Although these lexicons are comple- resources and a project on continuous integration and mentary resources [1], it is difficult (i) to link them in linking of lexical resources for NLP applications. It is 3http://www.wiktionary.org 5More detail of the model and descriptions of the modules can be 4http://www.omegawiki.org found at http://lemon-model.net Eckle-Kohler et al. / 3 motivated by the observation that an essential require- maintained data categories in ISOcat.7 As the mapping ment in NLP is the availability of a wide range of lexi- of UBY-LMF to lemon preserves this linking, lemonU- cal resources that can be used for many different NLP by is linked to ISOcat as well. The content of ISOcat is tasks. In a continuous process, such resources are inte- also available as Linked Data [27], and therefore, pro- grated into UBY by means of (i) making them interop- vides a possible and direct way to interconnect lemon- erable and (ii) linking them to other resources in UBY Uby with other LLOD resources at the level of linguis- at the sense level. tic data categories. In UBY, interoperability is achieved by standardiz- However, ISOcat is not a formal ontology, but only a ing lexical resources according to UBY-LMF [9,10], semistructured collection of terms, and while it serves a lexicon model which is an instantiation of LMF, as a repository of definitions, it does not provide specifically designed for NLP. The lexicon model a formal data model that can be applied to a re- UBY-LMF has been developed to fully cover a wide source: ISOcat contains doublets created by differ- range of heterogeneous lexical resources without in- ent data providers, and such superficially similar cat- formation loss, which resulted in a fine-grained model egories may actually have incompatible definitions, of lexical information types (documented by data cat- e.g., gerundive [DC-1294] is an “adjective formed egories from ISOcat,6 the implementation of the ISO from a verb” (excluding verbal nouns), whereas gerun- 12620:2009 Data Category Registry) and was accom- dive [DC-2243] is a “non-finite form (...) other than panied by an extension of the ISO standard LMF by the infinitive” (including verbal nouns). Hierarchical a few elements. The extensibility of UBY-LMF was a relations between ISOcat terms are possible, but not primary design principle in order to enable the integra- obligatory, and when compared with a full-fledged tion of further (in particular automatically acquired) ontology, ISOcat terms that represent superconcepts lexical resources. for a bundle of features (e.g., ActiveVoice [DC-3064] The mapping from UBY-LMF to lemon is motivated dcif:isA VoiceProperty [DC-3551]) do not distinguish by an increase in interoperability with the Semantic relational and categorial aspects: VoiceProperty could Web and its resources, thereby making it available to be either a property that assigns ActiveVoice to a par- a new group of potential users and novel applications. ticular unit of annotation, or a concept that defines the Beyond this, mapping UBY-LMF to lemon is an in- range of such a relation.

Load more