Accessing GermaNet Data and Computing Semantic Relatedness

Iryna Gurevych and Hendrik Niederlich EML Research gGmbH Schloss-Wolfsbrunnenweg 33 69118 Heidelberg, Germany

http://www.eml-research.de/ gurevych

Abstract data from WordNet, e.g. WordNet::QueryData,2 jwnl.3 We present an API developed to access Research integrating semantic knowledge into GermaNet, a lexical semantic database for NLP for languages other than English is scarce. On German represented in XML. The API the one hand, there are fewer computational know- provides a set of software functions for ledge resources like dictionaries, broad enough in parsing and retrieving information from coverage to be integrated in robust NLP applica- GermaNet. Then, we present a case study tions. On the other hand, there is little off-the-shelf which builds upon the GermaNet API and software that allows to develop applications utilizing implements an application for computing semantic knowledge from scratch. While WordNet semantic relatedness according to five dif- counterparts do exist for many languages, e.g. Ger- ferent metrics. The package can, again, maNet (Kunze & Lemnitzer, 2002) and EuroWord- serve as a software library to be deployed Net (Vossen, 1999), they differ from WordNet in in natural language processing applica- certain design aspects. E.g. GermaNet features non- tions. A graphical user interface allows to lexicalized, so called artificial concepts that are non- interactively experiment with the system. existent in WordNet. Also, the adjectives are struc- tured hierarchically which is not the case in Word- 1 Motivation Net. These and other structural differences led to The knowledge encoded in WordNet (Fellbaum, divergences in the data model. Therefore, WordNet 1998) has proved valuable in many natural lan- based implementations are not applicable to Ger- guage processing (NLP) applications. One particu- maNet. Also, there is generally lack of experimental lar way to integrate semantic knowledge into appli- evidence concerning the portability of e.g. WordNet cations is to compute of Word- based semantic similarity metrics to other Net concepts. This can be used e.g. to perform word and their sensitivity to specific factors, such as net- sense disambiguation (Patwardhan et al., 2003), work structure, language, etc. Thus, for a researcher to find predominant word senses in untagged text who wants to build a semantic relatedness applica- (McCarthy et al., 2004), to automatically generate tion for a language other than English, it is difficult spoken dialogue summaries (Gurevych & Strube, to assess the effort and challenges involved in that. 2004), and to perform spelling correction (Hirst & Departing from that, we present an API which Budanitsky, 2005). allows to parse and retrieve data from GermaNet. Extensive research concerning the integration of Though it was developed following the guidelines semantic knowledge into NLP for the English lan- for creating WordNet, GermaNet features a cou- guage has been arguably fostered by the emergence ple of divergent design decisions, such as e.g. the of WordNet::Similarity package (Pedersen et al., use of non-lexicalized concepts, the association re- 2004).1 In its turn, the development of the WordNet lation between synsets and the small number of tex- based semantic similarity software has been facil- tual definitions of word senses. Furthermore, we itated by the availability of tools to easily retrieve 2http://search.cpan.org/dist/WordNet-QueryData

1http://www.d.umn.edu/ tpederse/similarity.html 3http://sourceforge.net/projects/jwordnet build an application accessing the knowledge in Ger- synset is a cause of; getEntailments() yields synsets maNet and computing semantic relatedness of Ger- that entail a given synset; getHolonyms(), getHy- maNet word senses according to five different met- ponyms(), getHypernyms(), getMeronyms() return a rics. Three of these metrics have been adapted from list of holonyms, hyponyms, immediate hypernyms, experiments on English with WordNet, while the re- and meronyms respectively; getPartOfSpeech() re- maining two are based on automatically generated turns the part of speech associated with word senses definitions of word senses and were developed in the of a synset; getWordSenses() returns all word senses context of work with GermaNet. constituting the synset; toString() yields a string re- presentation of a synset. 2 GermaNet API The metrics of semantic relatedness are designed to employ this API. They are implemented as classes The API for accessing GermaNet has to provide which use the API methods on an instance of the functions similar to the API developed for WordNet. GermaNet object. We evaluated the C-library distributed together with GermaNet V4.0 and the XML encoded version 3 Semantic Relatedness Software of GermaNet (Lemnitzer & Kunze, 2002). As we In GermaNet, nouns, verbs and adjectives are struc- wanted the code to be portable across platforms, we tured within hierarchies of is-a relations.4 Ger- built upon the latter. The XML version of GermaNet maNet also contains information on additional is parsed with the help of the Apache Xerces parser, lexical and semantic relations, e.g. hypernymy, http://xml.apache.org/ to create a JAVA object repre- meronymy, antonymy, etc. (Kunze & Lemnitzer, senting GermaNet. For stemming the words, we use 2002). A semantic relatedness metric specifies to the functionality provided by the Porter stemmer what degree the meanings of two words are related for the German language, freely available from to each other. E.g. the meanings of Glas (Engl. http://snowball.tartarus.org/german/stemmer.html. glass) and Becher (Engl. cup) will be typically clas- Thus, the GermaNet object exists in two versions, sified as being closely related to each other, while the original one, where the information can be the relation between Glas and Juwel (Engl. gem) accessed using words, and the stemmed one, where is more distant. RelatednessComparator is a class the information can be accessed using word stems. which takes two words as input and returns a nu- We implemented a range of JAVA based meth- meric value indicating semantic relatedness for the ods for querying the data. These methods are orga- two words. Semantic relatedness metrics have been nized around the notions of word sense and synset. implemented as descendants of this class. On the word sense (WS) level, we have the follow- Three of the metrics for computing semantic relat- ing methods: getAntonyms() retrieves all antonyms edness are information content based (Resnik, 1995; of a given WS; getArtificial() indicates whether a Jiang & Conrath, 1997; Lin, 1998) and are also im- WS is an artificial concept; getGrapheme() gets a plemented in WordNet::Similarity package. How- graphemic representation of a WS; getParticipleOf() ever, some aspects in the normalization of their retrieves the WS of the verb that the word sense is results and the task definition according to which a participle of; getPartOfSpeech() gets the part of the evaluation is conducted have been changed speech associated with a WS; getPertonym() gives (Gurevych & Niederlich, 2005). The metrics are the WS that the word sense is derived from; get- implemented as classes derived from Information- ProperName() indicates whether the WS is a proper BasedComparator, which is in its turn derived from name; getSense() yields the sense number of a WS in the class PathBasedComparator. They make use of GermaNet; getStyle() indicates if the WS is stylisti- both the GermaNet hierarchy and statistical corpus cally marked; getSynset() returns the corresponding evidence, i.e. information content. synset; toString() yields a string representing a WS. On the synset level, the following information can 4As mentioned before, GermaNet abandoned the cluster- approach taken in WordNet to group adjectives. Instead a hi- be accessed: getAssociations() returns all associa- erarchical structuring based on the work by Hundsnurscher & tions; getCausations() gets the effects that a given Splett (1982) applies, as is the case with nouns and verbs. We implemented a set of utilities for computing information content of German word senses from German corpora according to the method by Resnik (1995). The TreeTagger (Schmid, 1997) is em- ployed to compile a part-of-speech tagged word fre- quency list. The information content values of Ger- maNet synsets are saved in a text file called an in- formation content map. We experimented with dif- ferent configurations of the system, one of which in- volved stemming of corpora and the other did not involve any morphological processing. Contrary to our intuition, there was almost no difference in the information content maps arising from the both sys- Figure 1: The concept of user-system interaction. tem configurations, with and without morphological processing. Therefore, the use of stemming in com- puting information content of German synsets seems have to be included in the final definition. Exper- to be unjustified. iments carried out to determine the most effective The remaining two metrics of semantic related- parameters for generating the definitions and em- ness are based on the Lesk algorithm (Lesk, 1986). ploying those to compute semantic relatedness is de- The Lesk algorithm computes the number of over- scribed in Gurevych (2005). Gurevych & Niederlich laps in the definitions of words, which are some- (2005) present a description of the evaluation proce- times extended with the definitions of words related dure for five implemented semantic relatedness met- to the given word senses (Patwardhan et al., 2003). rics against a human Gold Standard and the evalua- This algorithm for computing semantic relatedness tion results. is very attractive. It is conceptually simple and does not require an additional effort of corpus analysis 4 Graphical User Interface compared with information content based metrics. We developed a graphical user interface to interac- However, a straightforward adaptation of the Lesk tively experiment with the software for computing metric to GermaNet turned out to be impossible. semantic relatedness. The system runs on a standard Textual definitions of word senses in GermaNet are Linux or Windows machine. Upon initialization, we fairly short and small in number. In cotrast to Word- configured the system to load an information con- Net, GermaNet cannot be employed as a machine- tent map computed from the German taz corpus.5 readable dictionary, but is primarily a conceptual The information content values encoded therein are network. In order to deal with this, we developed employed by the information content based metrics. a novel methodology which generates definitions of word senses automatically from GermaNet us- For the Lesk based metrics, two best configurations ing the GermaNet API. Examples of such automati- for generating definitions of word senses are offered via the GUI: one including three hypernyms of a cally generated definitions can be found in Gurevych word sense, and the other one including all related & Niederlich (2005). The method is implemented synsets (two iterations) except hyponyms. The rep- in the class PseudoGlossGenerator of our software, resentation of synsets in a generated definition is which automatically generates glosses on the basis constituted by one (the first) of their word senses. of the conceptual hierarchy. Two metrics of semantic relatedness are, then, The user of the GUI can enter two words to- based on the application of the Lesk algorithm to gether with their part-of-speech and specify one of definitions, generated automatically according to the five metrics. Then, the system displays the cor- two system configurations. The generated defini- responding word stems, possible word senses ac- tions can be tailored to the task at hand according to a set of parameters defining which related concepts 5www.taz.de cording to GermaNet, definitions generated for these ”Applications of GermaNet II” at GLDV’2005, pp. 462–474. word senses and their information content values. Peter Lang. Gurevych, Iryna & Michael Strube (2004). Semantic similar- Furthermore, possible combinations of word senses ity applied to spoken dialogue summarization. In Proceed- for the two words are created and returned together ings of the 20th International Conference on Computational with various diagnostic information specific to each Linguistics, Geneva, Switzerland, 23 – 27 August 2004, pp. 764–770. of the metrics. This may be e.g. word overlaps in Hirst, Graeme & Alexander Budanitsky (2005). Correcting real- definitions for the Lesk based metrics, or lowest word spelling errors by restoring lexical cohesion. Natural Language Engineering, 11(1):87–111. common subsumers and their respective information Hundsnurscher, F. & J. Splett (1982). Semantik der Adjektive content values, depending on what is appropriate. im Deutschen: Analyse der semantischen Relationen. West- Finally, the best word sense combination for the two deutscher Verlag. Jiang, Jay J. & David W. Conrath (1997). Semantic similar- words is determined and this is compactly displayed ity based on corpus statistics and lexical . In Pro- together with a semantic relatedness score. The in- ceedings of the 10th International Conference on Research terface allows the user to add notes to the results by in Computational Linguistics (ROCLING). Tapei, Taiwan. Kunze, Claudia & Lothar Lemnitzer (2002). GermaNet - rep- directly editing the data shown in the GUI and save resentation, visualization, application. In Proceedings of the the detailed analysis in a text file for off-line inspec- International Conference on Language Resources and Eval- tion. The process of user-system interaction is sum- uation (LREC), Las Palmas, Canary Islands, Spain, 29 - 31 May, pp. 1485–1491. marized in Figure 1. Lemnitzer, Lothar & Claudia Kunze (2002). Adapting Ger- maNet for the Web. In Proceedings of the first Global 5 Conclusions WordNet Conference, Central Institute of Indian Languages. Mysore, India, pp. 174–181. We presented software implementing an API to Lesk, Michael (1986). Automatic sense disambiguation using machine readable dictionaries: How to tell a pine cone from GermaNet and a case study built with this API, a an ice cream cone. In Proceedings of the 5th Annual In- package to compute five semantic relatedness met- ternational Conference on Systems Documentation, Toronto, Ontario, Canada, June, pp. 24–26. rics. We revised the metrics and in some cases re- Lin, Dekang (1998). An information-theoretic definition of sim- designed them for the German language and Ger- ilarity. In Proceedings of the 15th International Conference maNet, as the latter is different from WordNet in a on Machine Learning, San Francisco, Cal., pp. 296–304. McCarthy, Diana, Rob Koeling, Julie Weeds & John Carroll number of respects. The set of software functions (2004). Finding predominant senses in untagged text. In resulting from our work is implemented in a JAVA Proceedings of the 42nd Annual Meeting of the Association library and can be used to build NLP applications for Computational Linguistics, Barcelona, Spain, 21–26 July 2004, pp. 280 – 287. with GermaNet or integrate GermaNet based seman- Patwardhan, Siddharth, Satanjeev Banerjee & Ted Pedersen tic relatedness metrics into NLP systems. Also, we (2003). Using measures of semantic relatedness for word provide a graphical user interface which allows to sense disambiguation. In Proceedings of the Fourth Interna- tional Conference on Intelligent Text Processing and Com- interactively experiment with the system and study putational Linguistics, Mexico City, Mexico, pp. 241–257. the performance of different metrics. Pedersen, Ted, Siddharth Patwardhan & Jason Michelizzi (2004). WordNet::Similarity – Measuring the relatedness of concepts. In Demonstrations of the Human Language Tech- Acknowledgments nology Conference of the North American Chapter of the As- sociation for Computational Linguistics, Boston, Mass., 2–7 This work has been funded by the Klaus Tschira May 2004, pp. 267–270. Foundation. We thank Michael Strube for his valu- Resnik, Phil (1995). Using information content to evalu- ate semantic similarity in a taxonomy. In Proceedings of able comments concerning this work. the 14th International Joint Conference on Artificial Intel- ligence, Montreal,´ Canada, 20–25 August 1995, Vol. 1, pp. References 448–453. Schmid, Helmut (1997). Probabilistic part-of-speech tagging Fellbaum, Christiane (Ed.) (1998). WordNet: An Electronic using decision trees. In Daniel Jones & Harold Somers Lexical Database. Cambridge, Mass.: MIT Press. (Eds.), New Methods in Language Processing, Studies in Gurevych, Iryna (2005). Using the Structure of a Conceptual Computational Linguistics, pp. 154–164. London, UK: UCL Network in Computing Semantic Relatedness. Submitted. Press. Gurevych, Iryna & Hendrik Niederlich (2005). Computing Vossen, Piek (1999). EuroWordNet: a mutlilingual database semantic relatedness of GermaNet concepts. In Bernhard with lexical-semantic networks. Dordrecht: Kluwer Aca- Fisseni, Hans-Christian Schmitz, Bernhard Schroder¨ & Pe- demic Publishers. tra Wagner (Eds.), Sprachtechnologie, mobile Kommunika- tion und linguistische Ressourcen: Proceedings of Workshop