A Comparative Evaluation of Word Sense Disambiguation Algorithms for German
Total Page:16
File Type:pdf, Size:1020Kb
A Comparative Evaluation of Word Sense Disambiguation Algorithms for German Verena Henrich, Erhard Hinrichs University of Tubingen,¨ Department of Linguistics Wilhelmstr. 19, 72074 Tubingen,¨ Germany fverena.henrich,[email protected] Abstract The present paper explores a wide range of word sense disambiguation (WSD) algorithms for German. These WSD algorithms are based on a suite of semantic relatedness measures, including path-based, information-content-based, and gloss-based methods. Since the individual algorithms produce diverse results in terms of precision and thus complement each other well in terms of coverage, a set of combined algorithms is investigated and compared in performance to the individual algorithms. Among the single algorithms considered, a word overlap method derived from the Lesk algorithm that uses Wiktionary glosses and GermaNet lexical fields yields the best F-score of 56.36. This result is outperformed by a combined WSD algorithm that uses weighted majority voting and obtains an F-score of 63.59. The WSD experiments utilize the German wordnet GermaNet as a sense inventory as well as WebCAGe (short for: Web-Harvested Corpus Annotated with GermaNet Senses), a newly constructed, sense-annotated corpus for this language. The WSD experiments also confirm that WSD performance is lower for words with fine-grained sense distinctions compared to words with coarse-grained senses. Keywords: Word sense disambiguation, German, combined classifiers 1. Introduction methods, which have previously been shown to per- Word sense disambiguation (WSD) has been a very active form well for English WSD by Pedersen et al. (2005). area of research in computational linguistics. Most of the The remainder of this paper is structured as follows: Sec- work has focused on English. One of the factors that has tion 2 provides a brief overview of the German wordnet re- hampered WSD research for other languages has been the source GermaNet. Related work is discussed in Section 3. lack of appropriate resources, particularly in the form of Section 4 introduces the methods and algorithms for WSD sense-annotated corpus data. The present paper focuses on that are used in this paper. The performance of the different WSD for German and utilizes the German wordnet Ger- WSD algorithms is evaluated in Section 5. Finally, there maNet (Kunze and Lemnitzer, 2002; Henrich and Hinrichs, are concluding remarks and an outlook to future work in 2010) as a sense inventory as well as WebCAGe (Hen- Section 6. rich et al., 2012) (short for: Web-Harvested Corpus Anno- tated with GermaNet Senses), a newly constructed, sense- 2. GermaNet Resource annotated corpus for this language. GermaNet (Kunze and Lemnitzer, 2002; Henrich and Hin- Due to the lack of sense-annotated corpora for German richs, 2010) is a lexical semantic network that is modeled prior to the construction of WebCAGe, there has been rel- after the Princeton WordNet for English (Fellbaum, 1998). atively little research on WSD for this language.1 The pur- It partitions the lexical space into a set of concepts that are pose of this paper is to help close this gap. More specifi- interlinked by semantic relations. A semantic concept is cally, it has the following three goals: represented as a synset, i.e., as a set of words whose indi- vidual members (referred to as lexical units) are taken to be 1. To apply a much wider range of WSD algorithms to (near) synonyms. Thus, a synset is a set-representation of German since the range of methods that has thus far the semantic relation of synonymy. There are two types of been applied to German is rather limited. semantic relations in GermaNet. Conceptual relations, e.g., 2. To study the combination of WSD methods in either hypernymy, part-whole relations, entailment, or causation, majority or weighted majority voting schemes as well hold between two semantic concepts, i.e. synsets. Lexical as in a Borda count setup. relations, e.g., antonymy, hold between two individual lexi- cal units. GermaNet covers the three word categories of ad- 3. To study the effect of linking the GermaNet knowl- jectives, nouns, and verbs, each of which is hierarchically edge base with other web-based lexical resources such structured in terms of the hypernymy relation of synsets. 2 as the German version of Wiktionary (Henrich et al., GermaNet’s version 6.0 covers 93407 lexical units, which 2011). This mapping was performed for the purpose are grouped into 69594 synsets. of including Wiktionary sense definitions into Ger- Sense definitions are a crucial component for wordnets. maNet and is particularly relevant for word overlap Until recently, GermaNet included sense definitions only for a small number of lexical units. However, compre- 1For more discussion, see Section 3. below. hensive sense definitions are badly needed in order to en- 2http://www.wiktionary.org/ hance its usability for a wide variety of NLP applications, 576 including word sense disambiguation. In order to add sense ods: i) bilingual disambiguation based on parallel corpus of descriptions to GermaNet, a semi-automatic mapping was English-German medical scientific abstracts obtained from performed that enriches GermaNet’s lexical units with Wik- the Springer Link web site3, and ii) collocational disam- tionary sense definitions. This mapping is the result of a biguation as introduced by Yarowsky (1995) which uses two-stage process: an automatic alignment, which achieves multi-word expressions and collocations from UMLS as 93.8% accuracy, followed by a manual post-correction – seed examples for Yarowsky’s algorithm. see Henrich et al. (2011). The studies of Steffen et al. (2003) and Widdows et al. (2003) both focus on the medical domain and are thus 3. Related Work domain-dependent. A recent study that performs WSD on Word sense disambiguation has been a widely studied nat- a domain-independent German corpus is that of Broscheit ural language processing task in recent years. It goes be- et al. (2010). It uses GermaNet as a knowledge base and yond the scope of the present paper to provide a compre- the graph-based algorithm Personalized PageRank (PPR) of hensive account of the state of the art in WSD. See Agirre Agirre and Soroa (2009) as well as a voting approach that and Edmonds (2006) as well as Navigli (2009) for a more combines PPR with the unsupervised algorithms of Lapata in-depth survey and discussion of WSD methods. In this and Keller (2007) and McCarthy et al. (2004) for determin- section, we will focus on existing work on word sense dis- ing the most frequent sense. The best results reported by ambiguation for German, so as to be able to relate the re- Broscheit et al. (2010) are not obtained by the voting ap- search reported in the present paper to previous research proach but rather by the PPR algorithm alone. In order to on this language. All research on word sense disambigua- compare the performance of the algorithms described in the tion work for German (Steffen et al., 2003; Widdows et al., present paper with the results of Broscheit et al. (2010), the 2003; Broscheit et al., 2010) has focussed on the lexical Personalized PageRank algorithm is included in the set of sample task for WSD, i.e., the disambiguation of a fixed experiments described in Section 5. set of polysemous target words. The number of ambiguous target words used in these studies has been rather small, 4. Word Sense Disambiguation Algorithms ranging from 24 (Widdows et al., 2003) to 40 (Steffen et 4.1. Semantic Relatedness Measures al., 2003; Broscheit et al., 2010) nouns. By contrast, the In order to be able to apply a wide range of WSD algo- all-words WSD task consists of disambiguating each and rithms to German, we have reimplemented the same suite of every word that appears in a test corpus. If a lexical token semantic relatedness algorithms for German that were pre- contained in the corpus is not ambiguous, this is of course viously used by Pedersen et al. (2005) for English WSD4. trivial. For ambiguous words, the assumption is that the Following their terminology, these relatedness algorithms surrounding corpus context will aid in identifying the cor- can be grouped into path-based, information-content-based, rect sense of the given target word. In recent work, Navigli and gloss-based. The algorithms used in the present work et al. (2007; 2010) have successfully applied graph-based in each of these categories are summarized in the following disambiguation methods to this all-words disambiguation list. See Budanitsky and Hirst (2006) for a detailed descrip- task. We agree that the all-words disambigation task is ul- tion. timately more meaningful and more realistic from a cogni- tive perspective, relating computational approaches to lan- 4.1.1. Path-Based Measures guage processing by humans. However, the all-words task The following path-based measures all use the GermaNet has higher prerequisites in terms of sense-annotated cor- graph-structure and compute the shortest path between two pora that do not currently hold for German. Nevertheless, concepts contained in the graph. the present study significantly goes beyond the previous studies on German word sense disambiguation in that a to- • lch: Similarity between two concepts is computed tal of 1499 nouns are included in the lexical sample dis- as the negative logarithm of the length of the short- ambiguation task compared to the much smaller samples est path between the concepts (limited to hyper- used by Steffen et al. (2003), Widdows et al. (2003), and nymy/hyponymy relations only) over the path length Broscheit et al. (2010). of the overall depth of the wordnet – as introduced by The earliest German WSD studies that we are aware of are Leacock and Chodorow (1998).