GermaNet - a Lexical-Semantic Net for German

Birgit Hamp and Helmut Feldweg Seminar fiir Sprachwissenschaft University of Tiibingen Germany email: {hamp, feldweg}~sfs, nphil, uni-tuebingen, de

Abstract be used for German with only minor modifica- tions. This affects for example information extrac- We present the lexical-semantic net for tion, automatic sense disambiguation and intelli- German "GermaNet" which integrates gent document retrieval. Furthermore, GermaNet conceptual ontological information with can serve as a training source for statistical meth- lexical , within and across word ods in natural language processing (NLP) and it classes. It is compatible with the Prince- makes future integration of German in multilin- ton WordNet but integrates principle- gual resources such as EuroWordNet (Bloksma et based modifications on the construc- al., 1996) possible. tional and organizational level as well as This paper gives an overview of the resource on the level of lexical and conceptual re- situation, followed by sections on the coverage of lations. GermaNet includes a new treat- the net and the basic relations used for linkage ment of regular polysemy, artificial con- of lexical and conceptual items. The main part cepts and of particle verbs. It further- of the paper is concerned with the construction more encodes cross-classification and ba- principles of GermaNet and particular features of sic syntactic information, constituting an each of the word classes. interesting tool in exploring the interac- tion of syntax and semantics. The de- 2 Resources and Modeling velopment of such a large scale resource Methods is particularly important as German up to now lacks basic online tools for the se- In English a variety of large-scale online linguistic mantic exploration of very large corpora. resources are available. The application of these resources is essential for various NLP tasks in re- ducing time effort and error rate, as well as guar- 1 Introduction anteeing a broader and more domain-independent GermaNet is a broad-coverage lexical-semantic coverage. The resources are typically put to use net for German which currently contains some for the creation of consistent and large lexical 16.000 words and alms at modeling at least the databases for parsing and machine translation as base vocabulary of German. It can be thought well as for the treatment of lexical, syntactic and of as an online ontology in which meanings asso- semantic ambiguity. Furthermore, linguistic re- ciated with words (so-called synsets) are grouped sources are becoming increasingly important as according to their semantic relatedness. The basic training and evaluation material for statistical framework of GermaNet is similar to the Prince- methods. ton WordNet (Miller et al., 1993), guarantee- In German, however, not many large-scale ing maximal compatibility. Nevertheless some monolingual resources are publically available principle-based modifications have been applied. which can aid the building of a semantic net. The GermaNet is built from scratch, which means that particular resource situation for German makes it is neither a translation of the English Word- it necessary to rely to a large extent on manual Net nor is it based on a single dictionary or the- labour for the creation process of a , based saurus. The development of a German wordnet on monolingual general and specialist dictionaries has the advantage that the applications devel- and literature, as well as comparisons with the oped for English using WordNet as a resource can English WordNet. However, we take a strongly

9 corpus-based approach by determining the base of cross-classification. An implementation of a vocabulary modeled in GermaNet by lemmatized more suitable rule-based classification of derivates frequency lists from text corporax. This list is fur- and the unlimited number of semantically trans- ther tuned by using other available sources such as parent compounds fails due to the lack of algo- the CELEX German database. Clustering meth- rithms for their sound semantic classification. The ods, which in principle can apply to large corpora amount of polysemy is kept to a minimum in Ger- without requiring any further information in order manet, an additional sense of a word is only intro- to give similar words as output, proved to be inter- duced if it conflicts with the coordinates of other esting but not helpful for the construction of the senses of the word in the network. When in doubt, core net. Selectional restrictions of verbs for nouns GermaNet refers to the degree of polysemy given will, however, be automatically extracted by clus- in standard monolingual print dictionaries. Addi- tering methods. We use the Princeton Word- tionally, GermaNet makes use of systematic cross- Net technology for the database format, database classification. compilation, as well as the Princeton WordNet in- 3.2 Relations terface, applying extensions only where necessary. This results in maximal compatibility. Two basic types of relations can be distinguished: lexlcal relations which hold between different 3 Implementation lexical realizations of concepts, and conceptual relations which hold between different concepts 3.1 Coverage in all their particular realizations. GermaNet shares the basic database division into Synonymy and antonymy are bidirectional the four word classes noun, adjective, verb, and lexical relations holding for all word classes. All adverb with WordNet, although adverbs are not other relations (except for the 'pertains to' rela- implemented in the current working phase. tion) are conceptual relations. An example for For each of the word classes the semantic space synonymy are torkeln and taumeln, which both is divided into some 15 semantic fields. The pur- express the concept of the same particular lurch- pose of this division is mainly of an organizational ing motion. An example for antonymy are the nature: it allows to split the work into packages. adjectives kalt (cold) and warm (warm). These Naturally, the semantic fields are closely related two relations are implemented and interpreted in to major nodes in the . How- GermaNet as in WordNet. ever, they do not have to agree completely with The relation pertains to relates denominal ad- the net's top-level ontology, since a lexicographer jectives with their nominal base (finanzzell 'finan- can always include relations across these fields and cial' with Finanzen 'finances'), deverbal nominal- the division into fields is normally not shown to izations with their verbal base (Entdeckung 'dis- the user by the interface software. covery' with entdecken 'discover') and deadjecti- GermaNet only implements lemmas. We as- val nominalizations with their respective adjecti- sume that inflected forms are mapped to base val base (Mi~digkeit 'tiredness' with miide 'tired'). forms by an external morphological analyzer This pointer is semantic and not morphological (which might be integrated into an interface to in nature because different morphological realiza- GermaNet). In general, proper names and ab- tions can be used to denote derivations from dif- breviations are not integrated, even though the ferent meanings of the same lemma (e.g. konven- lexicographer may do so for important and fre- tionell is related to Konvention (Regeln des Urn- quent cases. Frequency counts from text corpora gangs) (social rule), while konventzonal is related serve as a guideline for the inclusion of lemmas. In to Konvention Ouristiseher Text) (agreement). the current version of the database multi-word ex- The relation of hyponymy ('is-a') holds for all pressions are only covered occasionaly for proper word classes and is implemented in GermaNet as names (Olympische Spiele) and terminological ex- in WordNet, so for example Rotkehlchen (robin) pressions (weifles Blutk6rperchen). Derivates and is a hyponym of Vogel (bird). a large number of high frequent German com- Meronymy ('has-a'), the part-whole rela- pounds are coded manually, making frequent use tion, holds only for nouns and is subdivided into three relations in WordNet (component- 1We have access to a large tagged and lemma- relation, member-relation, stuff-relation). Get- tized online corpus of 60.000.000 words, compris- maNet, however, currently assumes only one basic ing the ECI-corpus (1994) (Frankfurter Rundschau, Danau-Kumer, VDI Nachr~chten) and the T~b,nger meronymy relation. An example for meronymy is NewsKorpus, consisting of texts collected m Tfibingen Arm (arm) standing in the meronymy relation to from electronic newsgroups. KSrper (body).

10 For verbs, WordNet makes the assumption that 4.1 Artificial Concepts the relation of entailment holds in two differ- WordNet does contain artificial concepts, that is ent situations. (i) In cases of 'temporal inclusion' non-lexicaiized concepts. However, they are nei- of two events as in schnarchen (snoring) entailing ther marked nor put to systematic use nor even schlafen (sleeping). (ii) In cases without tempo- exactly defined. In contrast, GermaNet enforces ral inclusion as in what Fellbaum (1993, 19) calls the systematic usage of artificial concepts and es- 'backward presupposition', holding between gelin- pecially marks them by a "?'. Thus they can be gen (succeed) and versuchen (try). However, these cut out on the interface level if the user wishes two cases are quite distinct from each other, justi- so. We encode two different sorts of artificial con- fying their separation into two different relations cepts: (i) lexical gaps which are of a conceptual in GermaNet. The relation of entailment is kept nature, meaning that they can be expected to be for the case of backward presupposition. Follow- expressed in other languages (see figure 2) and (ii) ing a suggestion made in EuroWordNet (Alonge, proper artificial concepts (see figure 3). 2 Advan- 1996, 43), we distinguish temporal inclusion by tages of artificial concepts are the avoidance of un- its characteristics that the first event is always a motivated co-hyponyms and a systematic struc- subevent of the second, and thus the relation is turing of the data. See the following examples: called subevent relation. In figure 1 noble man is a co-hyponym to the The cause relation in WordNet is restricted to other three hyponyms of human, even though the hold between verbs. We extend its coverage to first three are related to a certain education and account for resultative verbs by connecting the noble man refers to a state a person is in from birth verb to its adjectival resultative state. For ex- on. This intuition is modeled in figure 2 with the ample 5When (to open) causes often (open). additional artificial concept feducated human. Selectional restrictions, giving information about typical nominal arguments for verbs and adjectives, are additionally implemented. They do not exist in WordNet even though their existence is claimed to be important to fully characterize a verbs lexical behavior (Fellbaum, 1993, 28). These selectional properties will be generated automat- Imaslorc~ ically by clustering methods once a sense-tagged corpus with GermaNet classes is available. Figure 1: Without Artifical Concepts Another additional pointer is created to account for regular polysemy in an elegant and efficient way, marking potential regular polysemy at a very high level and thus avoiding duplication of entries and time-consuming work (c.f. section 5.1). As opposed to WordNet, connectivity between word classes is a strong point of GermaNet. This is achieved in different ways: The cross-class rela- tions ('pertains to') of WordNet are used more fre- quently. Certain WordNet relations are modified to cross word classes (verbs are allowed to 'cause' adjectives) and new cross-class relations are in- troduced (e.g. 'selectional restrictions'). Cross- class relations are particularly important as the expression of one concept is often not restricted Figure 2: Lexical Gaps to a single word class. Additionally, the final version will contain ex- In figure 3, all concepts except for the leaves amples for each concept which are to be automat- are proper artificial concepts. That is, one would ically extracted from the corpus. not expect any language to explicitly verbalize the concept of for example manner of motion verbs 4 Guiding Principles which specify the specific instrument used. Nev- Some of the guiding principles of the GermaNet ertheless such a structuring is important because ontology creation are different from WordNet and ~Note that these are not notationally distinguished therefore now explained. up to now; this still needs to be added.

ll it captures semantic intuitions every speaker of 5 Individual Word Classes German has and it groups verbs according to their semantic relatedness. 5.1 Nouns With respect to nouns the treatment of regular polysemy in GermaNet deserves special atten- 4.2 Cross-Classification tion. Contrary to WordNet, GermaNet enforces the use A number of proposals have been made for the of cross-classification whenever two conflicting hi- representation of regular polysemy in the lexicon. erarchies apply. This becomes important for ex- It is generally agreed that a pure sense enumera- ample in the classification of animals, where folk tion approach is not sufficient. Instead, the differ- and specialized biological hierarchy compete on ent senses of a regularly polysemous word need to a large scale. By cross-classifying between these be treated in a more principle-based manner (see two hierarchies the becomes more ac- for example Pustejovsky (1996)). cessible and integrates different semantic compo- GermaNet is facing the problem that lexical en- nents which are essential to the meaning of the tries are integrated in an ontology with strict in- concepts. For example, in figure 4 the concept of heritance rules. This implies that any notion of a cat is shown to biologically be a vertebrate, and regular polysemy must obey the rules of inheri- a pet in the folk hierarchy, whereas a whale is only tance. It furthermore prohibits joint polysemous a vertebrate and not a pet. entries with dependencies from applying for only one aspect of a polysemous entry. A familiar type of regular polysemy is the "or- ganization - building it occupies" polysemy. Ger- maNet lists synonyms along with each concept. Therefore it is not possible to merge such a type of polysemy into one concept and use cross- classification to point to both, institution and buil&ng as in figure 5. This is only possible if all synonyms of both senses and all their depen- dent nodes in the hierarchy share the same regular polysemy, which is hardly ever the case. lartlfactI Iorganizativnl Figure 4: Cross-Classification The concept of cross-classification is of great I 1 importance in the verbal domain as well, where If, ilityl lin,titutio. I most concepts have several meaning components according to which they could be classified. How- ever, relevant information would be lost if only one particular aspect was chosen with respect to hyponymy. Verbs of sound for example form a distinct semantic class (Levin et al., in press), the members of which differ with respect to additional verb classes with which they cross-classify, in En- Figure 5: Regular Polysemy as glish as in German. According to Levin (in press, Cross-Classification 7), some can be used as verbs of motion accompa- nied by sound ( A train rumbled across the loop- To allow for regular polysemy, GermaNet in- line bridge.), others as verbs of introducing direct troduces a special bidirectional relator which is speech (Annabel squeaked, "Why can't you stay placed to the top concepts for which the regular with us?") or verbs expressing the causation of polysemy holds (c.f. figure 6). the emission of a sound (He crackled the news- In figure 6 the entry bank1 (a financial institu- paper, folding it carelessly). Systematic cross- tzon that accepts depossts and channels the money classification allows to capture this fine-grained into lending activities) may have the synonyms distinction easily and in a principle-based way. depository financial institution, banking concern,

12 i~manner of motion I

['~iter'atlve I I?spoed I lTin..trumentl I'~sound I'general

.o- oo-- .0

Figure 3: Proper Artificial Concepts Iorza"on I F regular polysemy pointer L I I I I I I I i depository financial institufionl bankl, banking concern, [ banking company |

Figure 6: Regular Polysemy Pointer banking company, which are not synonyms of ular polysemy exists between meat and animal. banks (a building in which commercial banking Therefore we can reconstruct via the regular pol- is transacted). In addition, bankl may have hy- ysemy pointer that the meat sense is referred to ponyms such as credit union, agent bank, commer- in this particular sentence even though it is not cial bank, full service bank, which do not share the explicitly encoded. Thus the pointer can be con- regular polysemy of bank1 and banks. ceived of as an implementation of a simple default Statistically frequent cases of regular polysemy via which the net can account for language pro- are manually and explicitly encoded in the net. ductivity and regularity in an effective manner. This is necessary because they often really are two separate concepts (as in pork, pig) and each 5.2 Adjectives sense may have different synonyms (pork meat is Adjectives in GermaNet are modeled in a tax- only synonym to pork). However, the polysemy onomical manner making heavy use of the hy- pointer additionally allows the recognition of sta- ponymy relation, which is very different from the tistically infrequent uses of a word sense created satellite approach taken in WordNet. Our ap- by regular polysemy. So for example the sentence proach avoids the rather fuzzy concept of indi- I had crocodile for lunch is very infrequent in that rect antonyms introduced by WordNet. Addition- crocodile is no t commonly perceived as meat but ally we do not introduce artificial antonyms as only as animal. Nevertheless we know that a reg- WordNet does (pregnant, unpregnant). The taxo-

13 nomical classes follow (Hundsnurscher and Splett, Selectional restrictions for particles include Ak- 1982) with an additional class for pertainyms 3. tionsart, a particular semantic verb field, deictic orientation and directional orientation of the base 5.3 Verbs verb. Syntactic frames and particle verbs deserve spe- The evaluation of a particle verb takes the fol- cial attention in the verbal domain. The frames lowing steps. First, GermaNet is searched for an used in GermaNet differ from those in WordNet, explicit entry of the particle verb. If no such entry and particle verbs as such are treated in WordNet exists the verb is morphologically analyzed and its at all. semantics is compositionally determined. For ex- ample the particle verb herauslau]en in figure7 is Each verb sense is linked to one or more syntac- a hyponym to lau]en (walk) as well as to heraus-. tic frames which are encoded on a lexical rather Criteria for a compositional treatment are sep- than on a conceptual level. The frames used arability, productivity and a regular semantics in GermaNet are based on the complementation of the particle (see Fleischer and Barz (1992), codes provided by CELEX (Burnage, 1995). The Stiebels (1994), Stegmann (1996)). notation in GermaNet differs from the CELEX database in providing a notation for the subject 6 Conclusion and a complementation code for Obligatory reflex- A wordnet for German has been described which, ive phrases. GermaNet provides frames for verb compared with the Princeton WordNet, integrates senses, rather than for lemmas, implying a full principle-based modifications and extensions on disambiguation of the CELEX complementation the constructional and organizational level as well codes for GermaNet. as on the level of lexical and conceptual relations. Syntactic information in GermaNet differs from Innovative features of GermaNet are a new treat- that given in WordNet in several ways. It marks ment of regular polysemy and of particle verbs, expletive subjects and reflexives explicitly, en- as well as a principle-based encoding of cross- codes case information, which is especially impor- classification and artificial concepts. As com- tant in German, distinguishes between different patibility with the Princeton WordNet and Eu- realizations of prepositional and adverbial phrases roWordNet is a major construction criteria of Ger- and marks to-infinitival as well as pure infinitival maNet, German can now, finally, be integrated complements explicitly. into multilingual large-scale projects based on on- tological and conceptual information. This con- Particles pose a particular problem in German. stitutes an important step towards the design of They are very productive, which would lead to an truly multilingual tools applicable in key areas explosion of entries if each particle verb was ex- such as information retrieval and intelligent In- plicitly encoded. Some particles establish a regu- ternet search engines. lar semantic pattern which can not be accounted for by a simple enumeration approach, whereas others are very irregular and ambiguous. We References therefore propose a mixed approach, treating ir- regular particle verbs by enumeration and regular A. Alonge. 1996. Definition of the links and sub- particle verbs in a compositional manner. Com- sets for verbs. Deliverable D006, WP4.1, Eu- position can be thought of as a default which can roWordNet, LE2-4003. be overwritten by explicit entries in the database. L. Bloksma, P. Diez-Orzas, and P. Vossen. 1996. We assume a morphological component such as User Requirements and Functional Specifica- GERTWOL (1996) to apply before the composi- tion of the EuroWordNet Project. Deliverable tional process starts. Composition itself is imple- D001, WP1, EuroWordNet, LE2-4003. mented as follows, relying on a separate lexicon for particles. The particle lexicon is hierarchically G. Burnage, 1995. The CELEX Lexical Database, Release 2. CELEX - Centre for Lexical In- structured and lists selectional restrictions with formation; Max Planck Institute for Psycholin- respect to the base verb selected. An example for guistics, The Netherlands. the hierarchical structure is given in figure 7 (with- out selectional restrictions for matters of simplic- C. Fellbaum. 1993. A Semantic Network of ity), where heraus- is a hyponym of her- and aus-. English Verbs. In G.A. Miller, It. Beckwith, C. Fellbaum, D. Gross, and K. Miller, editors, SAdjectives pertaining to a noun from which they Five Papers on WordNet. August. Revised ver- derive their meaning (financial, finances). sion.

14 Verb Database Particle Database

,hop"°.,..on.o, l :m:o sth) I herausl au~en (motion towards the deictic centre out o'f s.th. with lmanner oi~ motion: laui:en) Figure 7: Particle Verbs

Wolfgang Fleischer and Irmhild Barz. 1992. verbalen Priifixen und Partikeln. Disser- Wortbildung der deutsehen Gegenwartssprache. tation. Philosophische Fakult~t, Universit/it Max Niemeyer Verlag, Tiibingen. Dfisseldorf. GERTWOL. 1996. German morphological anal- yser. http://www.lingsoft.fi/doc/gertwol/.

Franz Hundsnurscher and Jochen Splett. 1982. Semantik der Adjektive des Deutschen: Analyse der semantisehen Relationen. Westdeutscher Verlag, Opladen. European Corpus Initiative. 1994. European Cor- pus Initiative Multilingual Corpus. B. Levin, G. Song, and B.T.S. Atkins. in press. Making sense of corpus data: A case study of verbs of sound. International Journal o] Corpus Linguistics, page 41 pages. G.A. Miller, R. Beckwith, C. Fellbanm, D. Gross, and K. Miller. 1993. Five Papers on WordNet. Technical report, Cognitive Science Laboratory, Princeton University, August. Revised version. James Pustejovsky. 1996. The Generative Lexi- con. MIT Press. Rosmary Stegmann. 1996. Semantic Analysis and Classification of Verbs of Direction. MA- Thesis. Seminar ffir Sprachwissenschaft, Uni- versit/it Tfibingen. Barbara Stiebels. 1994. Lexikalische Argumente und Adjunkte. Zum semantischen Beitrag von

15