Computational Linguistics in Practice: Weblicht and Germanet Two Projects at the Department of Linguistics at the University of Tübingen

Total Page:16

File Type:pdf, Size:1020Kb

Computational Linguistics in Practice: Weblicht and Germanet Two Projects at the Department of Linguistics at the University of Tübingen Computational Linguistics in Practice: WebLicht and GermaNet Two Projects at the Department of Linguistics at the University of Tübingen Verena Henrich University of Tübingen Department of Linguistics November 30, 2010 Who I am: Verena Henrich • 2009: Master in Computer Science at h_da - Lecture about Natural Language Processing (NLP) - Two semesters in Iceland - Topic of master thesis about NLP • Since 2009: Researcher at the Department of General and Computational Linguistics at the University of Tübingen - First task: development of an editor for the German wordnet (GermaNet) - Further project that I will introduce today: WebLicht - PhD plans: word sense disambiguation with GermaNet 2 | Verena Henrich November 30, 2010 GermaNet – A German Wordnet 3 | Verena Henrich November 30, 2010 GermaNet: A German Wordnet • GermaNet is a lexical resource covering the German base vocabulary • It is a lexical semantic network • Belongs to the family of wordnets modeled after the Princeton WordNet for English • GermaNet is divided into 3 word categories: - Adjectives - Nouns - Verbs • Words are ordered according to their meaning 4 | Verena Henrich November 30, 2010 GermaNet: Lexical Units • Word meanings are represented by lexical units • A lexical unit specifies one form and one meaning (i.e. reading) of a word • Examples: - “Bank“ has 2 readings . Reading 1: [Bank, {Sitzbank}] (bench) . Reading 2: [Bank, {Geldinstitut}] (financial institution) - “Leiter” has 3 readings . Reading 1: [Leiter, {Steiggerät}] (ladder) . Reading 2: [Leiter, {Verantwortlicher, Anführer}] (leader) . Reading 3: [Leiter, {stromleitender Stoff}] (electric conductor) • Lexical units are grouped into semantic concepts according to their meaning 5 | Verena Henrich November 30, 2010 GermaNet: Synsets • Semantic concepts are represented by synsets • A synset is a set of (near-)synonymous words 6 | Verena Henrich November 30, 2010 GermaNet: Synset Examples • Verb examples: [rennen, laufen, sprinten, spurten] (to run) [klingeln, bimmeln, schellen, gongen, läuten] (to ring) • Adjective examples: [stark, kräftig] (strong/poweful) [eckig, kantig, zackig] (square-shaped/jagged) [ausgeprägt, hervorstechend, markant] (distinctive) • Noun examples: [Witz, Scherz, Jux, Ulk, Spaß, Schabernack, Gag] (joke) [Substantiv, Hauptwort, Nomen] (noun) [Textil, Gewebe, Webware, Stoff] (cloth/material) 7 | Verena Henrich November 30, 2010 GermaNet: Synsets • Each lexical unit belongs to exactly one synset • A literal however can belong to many synsets [Chip, Katoffelchip] (potato chrisp) [Chip, Mikrochip] (computer chip) [Kohle, Geld, Kies, Knete, Moneten] (money) [Kohle, Kohlegestein] (coal) [Golf, VW Golf] (car) [Golf] (Küstengebiet) (gulf) [Golf, Golfspiel] (golf) [gehen, laufen] (to walk) [gehen, funktionieren] (to work) • A synset has an average of 1.37 lexical units 8 | Verena Henrich November 30, 2010 GermaNet: Relations • In GermaNet, there are two types of semantic relations - Lexical relations are established between lexical units . Synonymy . Antonymy . Pertainymy - Conceptual relations are established between synsets . Hypernymy and hyponymy . Part-whole relations (meronymy and holonymy) . Entailment . Causation . Association 9 | Verena Henrich November 30, 2010 GermaNet: Lexical Relations • Lexical relations hold between two lexical units - Synonymy - Antonymy - Pertainymy 10 | Verena Henrich November 30, 2010 GermaNet: Conceptual Relations • Conceptual relations hold between two synsets - Hypernymy and hyponymy - Part-whole relations (meronymy and holonymy) - Entailment - Causation - Association 11 | Verena Henrich November 30, 2010 GermaNet: Conceptual Relations • GermaNet is hierarchically structured in terms of the hypernymy-hyponymy relation of synsets 12 | Verena Henrich November 30, 2010 GermaNet: Conceptual Relations • Part-whole relations are conceptual relations 13 | Verena Henrich November 30, 2010 GermaNet: Relations 14 | Verena Henrich November 30, 2010 GermaNet: Readings for “unterhalten” 1. (v) [unterhalten, pflegen] (to cultivate) -- über etwas verfügen • [unterhalten] -- NN.AN.Pp -- Sie unterhalten gute Beziehungen zu ihren Nachbarn. • [pflegen] Hypernyms: [haben, besitzen] 2. (v) [unterhalten] (to keep oneself amused) -- sich auf angenehme Weise die Zeit vertreiben • [unterhalten] -- NN.AR.BM -- Sie hat sich blendend unterhalten. (NN.AR.BM) Hypernyms: [vergnügen] 3. (v) [unterhalten] (to entertain) -- für Zerstreuung/Zeitvertreib sorgen • [unterhalten] -- NN.AN.Bs -- Er unterhielt seine Gäste mit Musik. (NN.AN.Bs) Hypernyms: [vergnügen, amüsieren] 4. (v) [unterhalten] (to maintain sth.) – etw. halten/einrichten/betreiben und dafür aufkommen • [unterhalten] -- NN.AN -- Er unterhält einen Reitstall. (NN.AN) Hypernyms: [führen] Hyponyms: [instandhalten] [bewirtschaften] 5. (v) [unterhalten] (to talk) -- ein Gespräch führen • [unterhalten] -- NN.AR.Pp.Bo -- Er unterhielt sich den ganzen Abend über seine Prüfungen. (NN.AR.Pp) -- Er unterhielt sich nur mit mir. (NN.AR.Bo) Hypernyms: [austauschen] Hyponyms: [klönen] [labern] [palavern] [philosophieren] [plauschen] [plaudern, schwatzen, schnattern] 6. (v) [unterhalten, alimentieren] (to support sb.) -- für jmds. Lebensunterhalt aufkommen • [unterhalten] -- NN.AN -- Er unterhält eine sieben-köpfige Familie. (NN.AN) • [alimentieren] Hypernyms: [ernähren, nähren] 15 | Verena Henrich November 30, 2010 GermaNet: Purpose • GermaNet development started in 1997 at the Department of Linguistics at the University of Tübingen • Developed to serve as an electronic lexicographic reference database for German word senses • Primarily intended to serve as a resource for word sense disambiguation which is crucial for natural language applications like - Information retrieval - Construction of language technology tools - Annotation of corpora - Machine translation 16 | Verena Henrich November 30, 2010 GermaNet: Size • Number of lexical units: 84.600 - Adjectives: 8.100 lexical units - Nouns: 64.100 lexical units - Verbs: 12.300 lexical units • Number of synsets: 61.700 - Adjectives: 5.600 synsets - Nouns: 46.900 synsets - Verbs: 9.200 synsets • 84600 literals (1,10 readings per literal) • Lexical relations: 3500 • Conceptual relations: 73700 17 | Verena Henrich November 30, 2010 Tools for GermaNet • Application Programming Interfaces - Java API - Perl API • Web Application: http://weblicht.sfs.uni-tuebingen.de:8080/gnet/ • Web service: as part of WebLicht • GermaNet-Explorer: visualisation tool (developed at the University of Dortmund) • GernEdiT: GermaNet editing tool 18 | Verena Henrich November 30, 2010 GermaNet: Data Formats • Former: - Lexicograher files: complex legacy format • Now: - Relational database • Export formats: - Proprietary XML format: distribution format - Lexical Markup Framework: XML, ISO standard - Princeton WordNet format 19 | Verena Henrich November 30, 2010 GermaNet: Lexicographer Files (*** Nüsse ***) {Nuss, Nuß*o, Nusskern, ?festes_Nahrungsmittel,@ nomen.Pflanze:Nuss,@ ('der essbare Kern einer Nuss')} {Haselnuss, Haselnuß*o, Haselnusskern, Haselnußkern*o, Nuss,@ nomen.Pflanze:Haselstrauch,#} {Kokosnuss, Kokosnuß*o, Nuss,@ nomen.Pflanze:Kokospalme,#} {Betelnuss, Betelnuß*o, Nuss,@ Genussmittel,@} {Erdnuss, Erdnuß*o, Erdnusskern, Erdnußkern*o, Nuss,@ nomen.Pflanze:Erdnusspflanze,#} {Cashewkern, Cashewnuss, Cashewnuß*o, Nuss,@ nomen.Pflanze:Acajubaum,#} ... 20 | Verena Henrich November 30, 2010 GermaNet: Lexicographer Files • Lexicographer files have shortcomings, there are three main problems 1. No visualization Difficult to insert new items 2. Complex data format Syntax errors and semantic inconsistencies 3. No versioning Impossible to track back changes 21 | Verena Henrich November 30, 2010 GernEdiT – The GermaNet Editing Tool • Developed to overcome the shortcomings of the lexicographer files 1. No visualization Graphical tool (search and browse GermaNet) 2. Complex data format User-friendly tool (with internal consistency checks) 3. No versioning Editing history 22 | Verena Henrich November 30, 2010 GernEdiT – The GermaNet Editing Tool 23 | Verena Henrich November 30, 2010 GermaNet: Links & References • GermaNet homepage: http://www.sfs.uni-tuebingen.de/GermaNet/ • GermaNet web application: http://weblicht.sfs.uni-tuebingen.de:8080/gnet/ • Verena Henrich and Erhard Hinrichs: GernEdiT - The GermaNet Editing Tool. In Proceedings of the Seventh Conference on International Language Resources and Evaluation (LREC 2010), Valletta, Malta, 2010 http://www.lrec-conf.org/proceedings/lrec2010/pdf/264_Paper.pdf • Verena Henrich and Erhard Hinrichs: GernEdiT: A Graphical Tool for GermaNet Development. In Proceedings of the ACL 2010 System Demonstrations, Uppsala, Sweden, 2010 http://www.aclweb.org/anthology/P10-4004 • Fellbaum, C. (ed.): WordNet – An Electronic Lexical Database. The MIT Press, 1998. • Princeton WordNet homepage: http://wordnet.princeton.edu/ • Princeton WordNet web application: http://wordnetweb.princeton.edu/perl/webwn 24 | Verena Henrich November 30, 2010 WebLicht – Web-Based Linguistic Chaining Tool 25 | Verena Henrich November 30, 2010 WebLicht: Motivation • Many linguistic resources (corpora, dictionaries, …) and tools (tokenizer, tagger, parser, …) are available • Most of them are implemented to run on local machines - This can be inconvenient, time-consuming, and error-prone because a user has to install all necessary tools • Requirement: avoid “download-first” paradigm One possible solution: make tools and resources available on the web 26 | Verena Henrich November 30, 2010 WebLicht:
Recommended publications
  • Exploring and Visualizing Wordnet Data with Germanet Rover
    Exploring and Visualizing Wordnet Data with GermaNet Rover Marie Hinrichs, Richard Lawrence, Erhard Hinrichs University of Tübingen, Germany www.clarin-d.net GERMANET AND ROVER 1 www.clarin-d.net GermaNet: a Wordnet for German A wordnet groups synonyms into synsets and represents relations between synsets. The hypernym relation forms a hierarChiCal graph structure. 2 www.clarin-d.net Rover: a web appliCation for GermaNet Rover displays the data in GermaNet in an interaCtive interfaCe designed for researChers. It offers: • advanced searChing for synsets • visualizing the hypernym graph • CalCulating synsets’ semantiC relatedness via graph-based measures 3 www.clarin-d.net SYNSET SEARCH 4 www.clarin-d.net Synset SearCh Overview Try it: https://weblicht.sfs.uni-tuebingen.de/rover/search 5 www.clarin-d.net SearCh Options Advanced searChes for • search by regular synsets: expression and edit distance • restrict results by grammatical category, semantic class, and orthographic variant 6 www.clarin-d.net Results List A summary of each synset in the search results includes: • words in the synset • its semantic class • associated Wiktionary definitions • summary of conceptual relations to other synsets 7 www.clarin-d.net Conceptual Relations Selecting a synset displays details about its conceptual relations: • a network diagram of the synset’s hypernyms • related synsets, displayed as navigation buttons 8 www.clarin-d.net LexiCal Units and Relations More details about the words in the seleCted synset: • lexically-related words, displayed as navigation buttons • Interlingual index reCords (pointers to Princeton WordNet) • examples and assoCiated frame types • deComposition of Compound nouns 9 www.clarin-d.net SEMANTIC RELATEDNESS 10 www.clarin-d.net SemantiC Relatedness Overview Try it: https://weblicht.sfs.uni-tuebingen.de/rover/semrel 11 www.clarin-d.net SemantiC Relatedness Measures Rover supports six graph-based measures of semantic relatedness between pairs of synsets: 1.
    [Show full text]
  • Calculating Semantic Relatedness with Germanet
    Organismus, Lebewesen ‘organism, being’ ... ... Haustier ... ‘pet’ Baum Katze Hund ‘tree’ ‘cat’ ‘dog’ Calculating Semantic Relatedness with GermaNet Verena Henrich, Düsseldorf, 19. Februar 2015 Semantic Relatedness Measures – Overview • Path-based measures - Leacock and Chodorow (1998) - Wu and Palmer (1994) - Hirst and St-Onge (1998) - Path • Information-content-based measures - Resnik (1995) - Jiang and Conrath (1997) - Lin (1998) • Gloss-based measures: several variants of Lesk (1986) - Glosses from GermaNet and/or Wiktionary - Lexical fields from GermaNet and/or Wiktionary Semantic Relatedness Measures – Terminology • Relatedness/similarity is GNROOT GermaNet’s root node calculated between the two semantic concepts A (synset nodes) s1 and s2 B C D Haustier LCS(s1, s2) = lowest common subsumer ‘pet’ of s1 and s2 synset s1 Katze Hund synset s2 ‘cat’ ‘dog’ Semantic Relatedness Measures – Terminology GNROOT A B C depth(s2) = 6 length of the shortest path D from synset s2 to GNROOT LCS Haustier ‘pet’ s1 s2 Katze Hund ‘cat’ ‘dog’ Semantic Relatedness Measures – Terminology GNROOT A B C D length(s1, s2) = 2 LCS shortest path Haustier between s1 and s2 ‘pet’ s1 s2 Katze Hund ‘cat’ ‘dog’ Path-based Relatedness Measures GNROOT A B Similarity by Wu and Palmer (1994): 2 · depth(LCS(s1, s2)) C sim (s1, s2) = wup depth(s1) + depth(s2) D LCS sim (cat, dog) = 0.83 Haustier wup ‘pet’ where • depth(LCS) = 5 s1 s2 • depth(cat) = 6 Katze Hund • depth(dog) = 6 ‘cat’ ‘dog’ Path-based Relatedness Measures GNROOT A B LCS simwup(cat, tree) = 0.5 Organismus,
    [Show full text]
  • The Germanet Editing Tool Verena Henrich, Erhard Hinrichs University of Tübingen, Department of Linguistics Wilhelmstr
    GernEdiT – The GermaNet Editing Tool Verena Henrich, Erhard Hinrichs University of Tübingen, Department of Linguistics Wilhelmstr. 19, 72074 Tübingen, Germany E-mail: [email protected], [email protected] Abstract This paper introduces GernEdiT (short for: GermaNet Editing Tool), a new graphical user interface for the lexicographers and developers of GermaNet, the German version of the Princeton WordNet. GermaNet is a lexical-semantic net that relates German nouns, verbs, and adjectives. Traditionally, lexicographic work for extending the coverage of GermaNet utilized the Princeton WordNet development environment of lexicographer files. Due to a complex data format and no opportunity of automatic consistency checks, this process was very error prone and time consuming. The GermaNet Editing Tool GernEdiT was developed to overcome these shortcomings. The main purposes of the GernEdiT tool are, besides supporting lexicographers to access, modify, and extend GermaNet data in an easy and adaptive way, as follows: Replace the standard editing tools by a more user-friendly tool, use a relational database as data storage, support export formats in the form of XML, and facilitate internal consistency and correctness of the linguistic resource. All these core functionalities of GernEdiT along with the main aspects of the underlying lexical resource GermaNet and its current database format are presented in this paper. 1 Introduction data such as appropriate linking of lexical units with The traditional development of GermaNet1 (Kunze and synsets, connectedness of the synset graph, and Lemnitzer, 2002) was based on lexicographer files. automatic closure among relations and their inverse These were originally developed for the English counterparts.
    [Show full text]
  • Porting a Crowd-Sourced German Lexical Semantics Resource to Ontolex-Lemon
    Proceedings of eLex 2019 Porting a Crowd-Sourced German Lexical Semantics Resource to Ontolex-Lemon Thierry Declerck 1,2 , Melanie Siegel 3 1 German Research Center for Artificial Intelligence, Stuhlsatzenhausweg 3, 66123 Saarbrücken, Germany 2 Austrian Centre for Digital Humanities, Sonnenfelsgasse 19, 1010 Vienna, Austria 3 Darmstadt University of Applied Science, Max-Planck-Str. 2, 64807 Dieburg, Germany E-mail: [email protected], [email protected] Abstract In this paper we present our work consisting of mapping the recently created open source German lexical semantics resource “Open-de-WordNet” (OdeNet) into the OntoLex-Lemon format. OdeNet was originally created in order to be integrated in the Open Multilingual Wordnet initiative. One motivation for porting OdeNet to OntoLex-Lemon is to publish in the Linguistic Linked Open Data cloud this new WordNet-compliant resource for German. At the same time we can with the help of OntoLex-Lemon link the lemmas of OdeNet to full lexical descriptions and so extend the linguistic coverage of this new WordNet resource, as we did for French, Italian and Spanish wordnets included in the Open Multilingual Wordnet collection. As a side effect, the porting of OdeNet to OntoLex-Lemon helped in discovering some issues in the original data. Keywords: Open Multilingual Wordnet; OntoLex-Lemon; OdeNet; Lexical Semantics 1. Introduction Wordnets are well-established lexical resources with a wide range of applications in various Natural Language Processing (NLP) fields, like Machine Translation, Information Retrieval, Query Expansion, Document Classification, etc. (Morato et al., 2004). For more than twenty years they have been elaborately set up and maintained by hand, especially the original Princeton WordNet of English (PWN) (Fellbaum, 1998).
    [Show full text]
  • Textimager-VSD
    Wahed Hemati 3007 80 822 147 Computer science Text Technology Lab [email protected] Dissertation TextImager-VSD Large Scale Verb Sense Disambiguation and Named Entity Recognition in the Context of TextImager Abdul Wahed Hemati Frankfurt, 12.2019 Goethe University Frankfurt am Main Prof. Alexander Mehler Prof. Visvanathan Ramesh 1. Danksagung An dieser Stelle möchte ich mich bei all denjenigen bedanken, die mich während der Anfertigung dieser Dissertation unterstützt und motiviert haben. Die Arbeit wäre ohne Unterstützung einer Vielzahl von Leuten nicht möglich gewesen. Ganz besonders gilt dieser Dank Herrn Prof. Dr. Alexander Mehler, der meine Arbeit und somit auch mich betreut hat. Durch stetig kritisches Hinterfragen und konstruktive Kritik gab er mir wertvolle Hinweise mit auf den Weg. Er stand nicht nur mit wissenschaftlichem Rat und fachlichen Diskussionen zur Seite, sondern auch mit moralischer Unterstützung und Motivation. Die zahlreichen Gespräche auf intellektueller und persönlicher Ebene werden mir immer als bereichernder und konstruktiver Austausch in Erinnerung bleiben. Ich bin dankbar dafür, dass er mich stets sehr gut betreut und dazu gebracht hat, über meine Grenzen hinaus zu denken. Vielen Dank für die Geduld und Mühen. Ich danke Herrn Prof. Dr. Visvanathan Ramesh für die hilfsbereite und wis- senschaftliche Betreuung als Zweitgutachter. Ich bedanke mich für die Möglichkeit, diese Arbeit im Text-Technology-Lab der Goethe-Universität Frankfurt anfertigen zu können. Ich hatte die Möglichkeit, von fachlich kompetenten Kollegen und Kolleginnen beraten und betreut zu werden. Dadurch bin ich fachlich und zwischenmenschlich mit ihnen gewach- sen. Nicht zuletzt gebührt meinen Eltern der höchste Dank, die während des Lebens und vor allem während des Studiums immer an meiner Seite standen.
    [Show full text]
  • Tools for Germanet
    Exploring and Navigating: Tools for GermaNet Marc Finthammer, Irene Cramer Faculty of Cultural Studies Dortmund University of Technology, Germany marc.fi[email protected], [email protected] Abstract GermaNet is regarded to be a valuable resource for many German NLP applications, corpus research, and teaching. This demo presents three GUI-based tools meant to facilitate the exploration of and navigation through GermaNet. The GermaNet Explorer exhibits various retrieval, sort, filter and visualization functions for words/synsets and also provides an insight into the modeling of GermaNet’s semantic relations as well as its representation as a graph. The GermaNet-Measure-API and GermaNet Pathfinder offer methods for the calcu- lation of semantic relatedness based on GermaNet as a resource and the visualization of (semantic) paths between words/synsets. The GermaNet-Measure-API furthermore features a flexible interface, which facilitates the integration of all relatedness measures provided into user-defined applications. We have already used the three tools in our research on thematic chaining and thematic indexing, as a tool for the manual annotation of lexical chains, and as a resource in our courses on corpus linguistics and semantics. 1. Motivation 2. GermaNet Explorer Many researchers working with GermaNet have the same GermaNet (Lemnitzer and Kunze, 2002), the German experience: they lose their way in the rich, complex equivalent of WordNet (Fellbaum, 1998), represents a valu- structure of its XML-representation. In order to solve able lexical-semantic resource for numerous German natu- this problem, we implemented the GermaNet Explorer, ral language processing (NLP) applications. However, in of which a screenshot is shown in Figure 1.
    [Show full text]
  • Arxiv:2106.00055V1 [Cs.CL] 31 May 2021
    More than just Frequency? Demasking Unsupervised Hypernymy Prediction Methods Thomas Bott, Dominik Schlechtweg and Sabine Schulte im Walde Institute for Natural Language Processing, University of Stuttgart, Germany fbottts,schlecdk,[email protected] Abstract which word in a pair of words is the hypernym and which is the hyponym). The target subtask This paper presents a comparison of unsuper- of the current study is hypernymy prediction: we vised methods of hypernymy prediction (i.e., to predict which word in a pair of words such perform a comparative analysis of a class of ap- as fish–cod is the hypernym and which the hy- proaches commonly refered to as unsupervised hy- ponym). Most importantly, we demonstrate pernymy methods (Weeds et al., 2004; Kotlerman across datasets for English and for German et al., 2010; Clarke, 2012; Lenci and Benotto, 2012; that the predictions of three methods (Weeds- Santus et al., 2014). These methods all rely on the Prec, invCL, SLQS Row) strongly overlap and distributional hypothesis (Harris, 1954; Firth, 1957) are highly correlated with frequency-based that words which are similar in meaning also occur predictions. In contrast, the second-order in similar linguistic distributions. In this vein, they method SLQS shows an overall lower accu- racy but makes correct predictions where the exploit asymmetries in distributional vector space others go wrong. Our study once more con- representations, in order to contrast hypernym and firms the general need to check the frequency hyponym vectors. bias of a computational method in order to While these unsupervised hypernymy prediction identify frequency-(un)related effects.
    [Show full text]
  • Standardizing Wordnets in the ISO Standard LMF: Wordnet-LMF for Germanet
    Standardizing Wordnets in the ISO Standard LMF: Wordnet-LMF for GermaNet Verena Henrich Erhard Hinrichs Universtiy of Tübingen Universtiy of Tübingen Department of Linguistics Department of Linguistics verena.henrich@uni- erhard.hinrichs@uni- tuebingen.de tuebingen.de tions, (annotated) text corpora, and dictionar- Abstract ies. It is fair to say that it has become common practice among developers of new linguistic It has been recognized for quite some resources to consult TEI guidelines and ISO time that sustainable data formats play standards in order to develop standard- an important role in the development conformant encoding schemes that serve as an and curation of linguistic resources. interchange format and that can be docu- The purpose of this paper is to show mented and validated by Document Type how GermaNet, the German version of Definitions (DTD) and XML schemata. the Princeton WordNet, can be con- However, for resources that were developed verted to the Lexical Markup Frame- prior to or largely in parallel with the emerging work (LMF), a published ISO standard acceptance of markup languages and of emerg- (ISO-24613) for encoding lexical re- ing encoding standards, the situation is far sources. The conversion builds on more heterogeneous. A wide variety of legacy Wordnet-LMF, which has been pro- formats exists, many of which have persisted posed in the context of the EU due to existing user communities and the KYOTO project as an LMF format for availability of tools that can process only such wordnets. The present paper proposes a idiosyncratic formats. The development of number of crucial modifications and a wordnets for a large number of languages is a set of extensions to Wordnet-LMF that typical example of a type of linguistic re- are needed for conversion of wordnets source, where legacy formats still persist as a in general and for conversion of Ger- de facto standard.
    [Show full text]
  • Modelling and Processing Wordnets in OWL
    Lüngen, Harald Beißwenger, Michael Selzam, Bianca Storrer, Angelika Modelling and Processing Wordnets in OWL To appear in: Alexander Mehler, Kai-Uwe Kühnberger, Hennig Lobin, Harald Lüngen, Angelika Storrer & Andreas Witt (eds.): Modeling, Learning and Processing of Text Technological Data Structures. Dordrecht: Springer. PREPRINT VERSION Modelling and Processing Wordnets in OWL Harald Lüngen, Michael Beißwenger, Bianca Selzam and Angelika Storrer Abstract In this contribution, we discuss and compare alternative options of mod- elling the entities and relations of wordnet-like resources in the Web Ontology Lan- guage OWL. Based on different modelling options, we developed three models of representing wordnets in OWL, i.e. the instance model, the class model, and the metaclass model. These OWL models mainly differ with respect to the ontological status of lexical units (word senses) and the synsets. While in the instance model lexical units and synsets are represented as individuals, in the class model they are represented as classes; both model types can be encoded in the dialect OWL DL. As a third alternative, we developed a metaclass model in OWL FULL, in which lexical units and synsets are defined as metaclasses, the individuals of which are classes themselves. We apply the three OWL models to each of three wordnet-style resources: (1) a subset of the German wordnet GermaNet, (2) the wordnet-style do- main ontology TermNet, and (3) GermaTermNet, in which TermNet technical terms and GermaNet synsets are connected by means of a set of “plugin” relations. We report on the results of several experiments in which we evaluated the performance of querying and processing these different models: (1) A comparison of all three OWL models (class, instance, and metaclass model) of TermNet in the context of automatic text-to-hypertext conversion, (2) an investigation of the potential of the GermaTermNet resource by the example of a wordnet-based semantic relatedness calculation.
    [Show full text]
  • A Comparative Evaluation of Word Sense Disambiguation Algorithms for German
    A Comparative Evaluation of Word Sense Disambiguation Algorithms for German Verena Henrich, Erhard Hinrichs University of Tubingen,¨ Department of Linguistics Wilhelmstr. 19, 72074 Tubingen,¨ Germany fverena.henrich,[email protected] Abstract The present paper explores a wide range of word sense disambiguation (WSD) algorithms for German. These WSD algorithms are based on a suite of semantic relatedness measures, including path-based, information-content-based, and gloss-based methods. Since the individual algorithms produce diverse results in terms of precision and thus complement each other well in terms of coverage, a set of combined algorithms is investigated and compared in performance to the individual algorithms. Among the single algorithms considered, a word overlap method derived from the Lesk algorithm that uses Wiktionary glosses and GermaNet lexical fields yields the best F-score of 56.36. This result is outperformed by a combined WSD algorithm that uses weighted majority voting and obtains an F-score of 63.59. The WSD experiments utilize the German wordnet GermaNet as a sense inventory as well as WebCAGe (short for: Web-Harvested Corpus Annotated with GermaNet Senses), a newly constructed, sense-annotated corpus for this language. The WSD experiments also confirm that WSD performance is lower for words with fine-grained sense distinctions compared to words with coarse-grained senses. Keywords: Word sense disambiguation, German, combined classifiers 1. Introduction methods, which have previously been shown to per- Word sense disambiguation (WSD) has been a very active form well for English WSD by Pedersen et al. (2005). area of research in computational linguistics. Most of the The remainder of this paper is structured as follows: Sec- work has focused on English.
    [Show full text]
  • Using Wiktionary for Computing Semantic Relatedness
    Proceedings of the Twenty-Third AAAI Conference on Artificial Intelligence (2008) Using Wiktionary for Computing Semantic Relatedness Torsten Zesch and Christof Muller¨ and Iryna Gurevych Ubiquitous Knowledge Processing Lab Computer Science Department Technische Universitat¨ Darmstadt, Hochschulstraße 10 D-64289 Darmstadt, Germany {zesch,mueller,gurevych} (at) tk.informatik.tu-darmstadt.de Abstract approach (Gabrilovich and Markovitch 2007). We gener- alize this approach by using concept representations like We introduce Wiktionary as an emerging lexical semantic re- source that can be used as a substitute for expert-made re- Wiktionary pseudo glosses, the first paragraph of Wikipedia sources in AI applications. We evaluate Wiktionary on the articles, English WordNet glosses, and GermaNet pseudo pervasive task of computing semantic relatedness for English glosses. Additionally, we study the effect of using shorter and German by means of correlation with human rankings but more precise textual representations by considering only and solving word choice problems. For the first time, we ap- the first paragraph of a Wikipedia article instead of the full ply a concept vector based measure to a set of different con- article text. cept representations like Wiktionary pseudo glosses, the first We compare the performance of Wiktionary with expert- paragraph of Wikipedia articles, English WordNet glosses, made wordnets, like Princeton WordNet and GermaNet and GermaNet pseudo glosses. We show that: (i) Wiktionary (Kunze 2004), and with Wikipedia as another collabora- is the best lexical semantic resource in the ranking task and tively constructed resource. In order to study the effects performs comparably to other resources in the word choice task, and (ii) the concept vector based approach yields the of the coverage of lexical semantic resources, we conduct best results on all datasets in both evaluations.
    [Show full text]
  • Word Sense Disambiguation with Germanet
    Word Sense Disambiguation with GermaNet Semi-Automatic Enhancement and Empirical Results Dissertation zur Erlangung des akademischen Grades Doktor der Philosophie in der Philosophischen Fakultät der Eberhard Karls Universität Tübingen vorgelegt von Verena Henrich aus Darmstadt 2015 Gedruckt mit Genehmigung der Philosophischen Fakultät der Eberhard Karls Universität Tübingen Hauptberichterstatter: Prof. Dr. Erhard Hinrichs Mitberichterstatter: Prof. Dr. Gerhard Jäger Dekan: Prof. Dr. Jürgen Leonhardt Tag der mündlichen Prüfung: 29.4.2015 Verlag: TOBIAS-lib, Tübingen Abstract The subject of this dissertation is boosting research on word sense disambigua- tion (WSD) for German. WSD is a very active area of research in computa- tional linguistics, but most of the work is focused on English. One of the factors that has hampered WSD research for other languages such as German is the lack of appropriate resources, particularly in the form of sense-annotated corpus data. Hence, this work inevitably has to start with the preparation of resources before actual WSD experiments can be performed. The work pro- gram is fourfold. Firstly, since sense definitions are necessary to distinguish word senses (both for humans and for automatic WSD algorithms), the Ger- man wordnet GermaNet is (semi-)automatically extended with sense descrip- tions. This is done by automatically mapping GermaNet senses to descrip- tions in the online dictionary Wiktionary. Secondly, since the availability of sense-annotated corpora is a prerequisite for evaluating and developing word sense disambiguation systems, two GermaNet sense-annotated corpora are con- structed. One corpus is automatically constructed and the other corpus is manually sense-annotated. Thirdly, several knowledge-based WSD algorithms are applied and evaluated – using the newly created sense-annotated corpora.
    [Show full text]