<<

Lexical Discrimination with the Italian Version of WORDNET

Alessandro Artale, Bernardo Magnini and Carlo Strapparava IRST, 1-38050 Povo TN, Italy e-mail: {artalelmagninilstrappa}@irst. itc. it

Abstract for lexical discrimination, with Italian WOttDNET. Se- lectional restrictions provide explicit semantic informa- We present a prototype of the Italian version of tion that the verb supplies about its arguments [Jackend- WORDNET, a general computational lexical re- off, 1990], and should be fully integrated into the verb's source. Some relevant extensions are discussed argument structure. to make it usable for : in particular we Although selectional restrictions are different in dif- add verbal selectional restrictions to make lexi- ferent domains [Basili et al., 1996] we are interested in cal discrimination effective. Italian WORDNET finding common invariants across sublanguages. It is our has been coupled with a parser and a number of intention to build a very general instrument that can be experiments have been performed to individu- afterwards tuned to particular domains by identifying ate the methodology with the best trade-off be- more specific uses. The main motivation is to have both tween disambiguation rate and precision. Re- a robust and a computational efficient natural language sults confirm intuitive hypothesis on the role of system. On one hand, robustness is emphasized because selectional restrictions and show evidences for sentences that are syntactically correct, but which are a WORDNET-Iike organization of lexical senses. not successfully analyzed in the specific application do- main, can have a valid linguistic meaning. On the other 1 Introduction hand, we are able to filter the sentence meanings on a linguistic basis. This phase discards the unplausible lec- WORDNET is a thesaurus for the English language based tures pruning the search space by looking for compatibil- on psycholinguistics principles and developed at the ity semantic relations. This kind of discrimination can Princeton University by George Miller [Miller, 1990]. be realized with computationally effective algorithms by It has been conceived as a computational resource, so exploiting the lexical taxonomy of WORDNET, postpon- improving some of the drawbacks of traditional dictio- ing more complex and expensive computations to the naries, such as the circularity of the definitions and the domanin specific analysis. ambiguity of sense references. Lemmas (about 130,000 The paper is structured as follow. Section 2 de- for version 1.5) are organized in synonyms classes (about scribes the Italian prototype of WORDNET; while sec- 100,000 synsets). tion 3 shows how selectional restrictions has been added The more evident problem with WORDNET is that to verb senses. Section 4 shows how Italian WOttDNET it is a lexical knowledge base for English, and so it has been coupled with the parser, both for describing is not usable for other languages. Here we present lexical senses and as a repository for selectional restric- the efforts made in the development of the Italian ver- tions. Section 5 reports a number of experiments that sion of WoRDNET [Magnini and Strapparava, 1994; has been performed to individuate the methodology de- Magnini et al., 1994], a project started at IltSW about sign with the best trade-off between disambiguation rate one year ago in the context of ILEX [Delmonte et al., and precision. Finally section 6 provides some conclusive 1996] a more general project aiming at the realization of remarks. a computational dictionary for Italian 1. A second problem with WORDNET is that it needs some important extensions to make it usable for effective 2 The Italian WORDNET Prototype parsing. In particular, parsing requires a powerful mech- The Italian version of WORDNET is based on the as- anism for lexical discrimination, in order to select the sumption that a large part of the conceptual relations appropriate lexical readings for each in the input defined for English (about 72,000 ISA relations and sentence. In this paper we also explore the integration 5,600 PART-OF relations) can be shared with Italian. of "selectwnal res~mctwng', a traditional technique used WORDNET can be described as a lexical matrix with 1The ILEX consortium includes the two dimensions: the lexical relations, which hold among Department of the University of Torino, the University of and so are language specific, and the conceptual Venezia, and the brench of the University of Torino at relations, which hold among senses and that, at least Vercelli. in part, we consider independent from a particular lan-

32 Wo~s P

Meanings

Lonymy ;ynsS) HASPAR'r\

Pohsemy

Figure h Multilingual lexical matrix

guage. The Italian version of WORDNET aims at the 3 Adding Selectional Restrictions to realization of a multilingual lexical matrix through the Verbs addition of a third dimension relative to the language. Figure 1 shows the three dimensions of the matrix: (a) A number of steps have been followed to add selectional words in a language, indicated by YY~ ; (b) meanings, in- restrictions to Italian WORDNET. First, Italian verb dicated by .A4~; () languages, indicated by £k. From senses were extracted from a paper version of an Italian an abstract point of view, to develop the multilingual dictionary and checked against a corpus of genereric Ital- matrix it is necessary to re-map the Italian lexical forms ian texts. Each verb sense has been then coupled with with corresponding meanings (.A4,), building the set of one or more English WORDNET synsets 2. This phase synsets for Italian (making explicit the values for the has been performed manually with the help of a graph- intersections £~).I The result will be a complete redefi- ical interface (see figure 2) that includes four integrated nition of the lexical relations, while for the semantic re- working tools: (i) a bilingual dictionary with more than lations, those originally defined for English will be used 30,000 lemmas; (ii) a graph that allows the visualiza- as much as possible. tion of the coupling with the English WORDNET; (iii) the bilingual WoRDNET, that behaves exactly like the An implementation of the Multilingual lexical matrix English version with the additional possibility to browse has been realized which allows a complete integration the Italian ; (iv) finally, the working with the English version and the availability of all the cards allow the insertion, modification and check of the translations for the Italian lemmas. The architecture is data for a synset. The result of this phase is the ex- easily extendable to other languages. The integration tension of the English WordNet with the Italian synsets. with the computational lexicon ILEX is under develop- Figure 3 shows the correspondence between English and ment: it will make the access to other levels of lexical in- Italian synsets for the verb Scrivere (Write). formation, such as morphological classes, syntactic cate- The next step is the definition of the sense subgate- gories and sub-categorization frames available. The Ital- gorization frame. This includes both syntactic informa- ian version of WORDNET, in December 1996, included tion (i.e., argumental positions, prepositions on indirect about 10,000 lemmas (7,000 nouns, 700 verbs, 1,500 objects, category type) and semantic information, such adjectives, 600 adverbs). as thematic roles and selectional restrictions. Syntactic information are associated to single verbs, while seman- Till now, data acquisition has been mostly manual, tic information are associated to the whole synset, i.e., with the help of a graphical interface; however a basic semantic participants are shared among all the verbs be- goal of the project is the experimentation of techniques longing to the synset. for the (semi)automatic acquisition of data. Algorithms We built selectional restrictions using the synsets of for the resolution of the ambiguities in the coupling with the noun hyerarchy. Two different possibilities for defin- the English WORDNET have been developed. Versions ing selectional restrictions are considered: automatically created are then tested against manually acquired data, with the aim of incrementally improve the 1. Selectional restrictions obtained from the frames precision level. A final manual check is performed for all currently provided by WoRDNET. the data automatically acquired. It is also foreseen the use of corpora to extract contextual information to be ~As for figurative uses, they can also be coupled with used during the disambiguation process. WoRDNET provided that an appropriate synset do exist.

33 !• en$$ Z in¥1aremsrld~e $cnveM spedlre ~eeilse 3 8c~vsrepubbl~csr8

Se.se 4

Se"le 5 Sc~Wm ~dlge~ comporm

Figure 2: The Italian WORDNET interface.

2. Selectional restrictions obtained from the whole In the second hypothesis selectional restrictions are WORDNET noun hierarchy. taken from the whole noun hierarchy. As an example, fig- ure 4 illustrates the senses for the Italian verb Scrivere As far as the first hypothesis is concerned, WoRDNET (Write) found in Italian WORDNET. For each sense we describes all the English verbs resorting to a set of 35 report a conventional name - which unambiguously iden- different syntactic frames, which in turn include only tifies the synset - and the argumental positions admit- two restrictions, that is Something and Somebody. For ted for that sense, with the indication of the selectional example, the frames provided for the verb Write in the restrictions. The appropriate combination of synsets for synset {Publish, Write} are: an argumental position has to be both enough general to preserve all the human readings, and enough restricted Somebody... s for discriminating among different senses of both verb Somebody... s Something and noun. Founding the appropriate selectional restrictions re- The problem arising in using these two restrictions vealed itself difficult and time consuming. The process is that they are completely uncorrelated to the noun required a deep search into the WOI~DNET noun hier- synsets, then, they have to be matched with the proper archy. In order to achieve a good trade-off between dis- synsets in the noun hierarchy. The concept Somebody in- crimination power and precision level we adopted an em- cludes not only the synset Person but also all the synsets pirical process with successive steps of refinement. We denoting group of people that could hold the agent the- started with general selectional restrictions and then we matic role. We defined Somebody using the following validate them against experimental results. This itera- boolean combination of synsets: tire process ended with complex selectional restritions Somebody for verbs, as the figure 4 shows. Person V People V People-Multitude V The WORDNET verb taxonomy is based on the ~ro- (Social-Group A ~(Society V Subculture V ponymy relation, which is defined as the co-occurrence of both lexical implication and temporal co-extension be- Political-System V Moiety V Clan)) tween two verbs. We would note that, every time a tro- Something is defined as ~he complement of Somebody. ponymy relation between two verbs holds, an ISA rela-

34 Synset Label Italian Synset English Synset Write {scrivere redigerecomporre} {write compose pen indite} Write-Music {scriverecomporre} {compose write write_music} Writs-Communicate {scriverecomunlcare_per..iscritto} {write communicate_by.writingexpress.by_writing} Write-Publish {scrivere pubblicare} {publish write} Write-Send {inviare mandare scrivere spedire} {mail write post send}

Figure 3: Correspondences between Italian and English synsets for the verb 'scrsrere' (write).

WoRDNET Synset [ Subject Object Indlrect-Object Write Somebody Written-MateriaiV ---- Symbolic-Repres V SayingV Correspondence V Sentencer Message V Message-ContentV Code V SymbolV Dater Lan~..mge--Unit V PropertyV Address-SpeechV Print-Media Write-Music Person Music -- Write-Communicate Somebody (Written-4{aterialA ~Section)V Somebody Symbolic-Repres V SayingV Sentence V NameV Message V Message-ContentV Code V Date V Property Write-Publish Somebody Written-MateriaiV Print..-MediaVPublishing-House (Print-MediaA ~Section) Write-Send Somebody CorrespondenceV MessageV Somebody Letter-Missive

Figure 4: Lexical entries for Scriveze (Writs). tion between the correspondent selectional restrictions cessors (e.g., morphological analyzers). In particular, we holds, too. integrated the hierarchical structure of WORDNET as an external KB, while an ISA function uses the WORDNET 4 Coupling WORDNET and a TFS Parser hierarchy in order to check subsumption relationships between WoRDNET synsets. In this section we describe the architecture we used for checking WORDNET usability in parsing. Italian WORD- The grammar is written adopting a HPSG-like style, NET has been used in two different phases of the linguis- and each rule is regarded as Typed Feature Structure tic analysis. On a first phase, we use Italian WORD- (TFS). For the current experiment the grammar cover- NET as a lexicon repository to carry on lexical analysis. age is limited to very simple verbal sentences formed by During the semantic analysis Italian WORDNET is used a subject, a main verb together with its internal argu- as a kind of Knowledge Base (KB) exploiting the struc- ments and, possibly, an adjunct phrase. Observe that, tural relationships among synsets. In particular, we used the syntactic analysis does not take into account the pp- the supertype/subtype-like hierarchy of synsets during attachment case. We excluded the possibility to capture the parsing process in order to discard unplausible con- these complex nominal phrases. Indeed, the object of the stituents on a semantic base. experiment is to disambiguate among WORDNET senses of both verbs and nouns on the basis of the lexical se- The parser used is a CYK chart parser embedded in mantic restrictions for the arguments of the verb and the the GEPPETTO environment [Ciravegna ¢t 1996], and al., lexical semantic associated to the noun. coupled with a proper unification algorithm. GEPPETTO is based on a Typed Feature Logic [-Carpenter, 1992] for A condition for using WORDNET coupled with the the specification of linguistic data. The GEPPETTO envi- GEPPETTO environment is to bring it in a format ef- ronment allows to edit and debug grammars and lexica, fectively usable. The exploited idea was to rebuild the linking linguistic data to a parser and/or a generator, in- WORDNET hierarchy in CLOS, the object-oriented part tegrating various form of KBs, and using specialized pro- of COMMON LISP. The advantages of this approach is

35 Word WoBDNET SynsetLabe~ Regina (Queen) queen-Insect,Queen-Regnant,Queen-Wi~e, Queen-Card,Queen-Chess Art~colo (Article) Article-Artefact,Article-Clause, Article-Grawmar, Article-Document Lettera (Letter) Letter-Missive, Letter-Alphabet Libro (Book) Book-Publication, Book-Section,Book-Object

Figure 5: Lexical entries for nouns. the possibdity to implement a fast and flexible access We experiment the two hypotheses on selectional restric- to the synsets hierarchy and, in particular, an efficient tions presented in section 3, i.e., the one with general ISA functionality as required for the semantic checking WoRDNET frames and the other with more refined se- during the parsing. The arguments to ISA function may lectional restrictions. be a complex boolean combination of synsets (e.g., see As an example, figure 6 shows the output of the parser selectional restrictions in figure 4). for the sentence "La regzna scrtsse ~ma leltera a Gzo- The parser controls the overall processing. Whenever vanng' ("The queen wrote a letter to John"). As a con- it tries to build a (partially recognized) constituent it venction, we decided to describe internal arguments with incrementally verifies the admissibility of the semantic the symbol '/', while a '//' denotes a verbal adjunct. part of such a constituent, using the WORDNET hierar- This sentence was selected because it produces an high chy. In particular, whenever a noun is associated with number of lectures (40) among the test suite sentences. a verbal argument the ISA function is triggered to check This is due to both the verb sense ambiguity (write has whether the synset of the noun is subsumed by the selec- five senses) and to the noun ambiguitms (queen has five tional restriction of the corresponding verbal argument. senses, and letter two). Note that the parser excludes Due to the large number of analyses, it is useful to dis- the sense Write-Publish since the indirect object must card unplausible constituents as soon as possible to cut be introduced by the Italian prepositions "su" or "per~' the search space. This has been obtained interliving the (in English "on" or "fort'), while in this example we have syntactic and semantic processes: as soon as the seman- the preposition "a" ("to"). tic test fails the constituent is rejected. Let us first consider the results obtained in the second experimental setting, which best approximate the human 5 Experiments and Results judgment. Out of the eight interpretations accepted, In this section we describe the empirical results obtained two are implausible for a human reader. This caused by by coupling a WoRDNET based lexicon with a parser In the contemporary presence of the sense Letter-Missive our intention, the experiment would bring evidence for and of the proper noun John as, respectively, patient and the following aspects: beneficiary of the write verb sense. Note that, each of these senses are, per se, valid arguments since they sat- . Plausibility of WoRDNBT senses for describing a isfy the selectional restrictions. lexical entries; In the first experimental setting, the presence of e Usability of WoRDNET for carrying out lexical dis- weaker selectional restrictions (just ~omebody, Some- crimination. thzng) ymlds more spurious readings. As a matter of The experiment has been carried out on 60 sentences fact, the more evzdent problem is that many times ar- with 1201 different lectures, and formed by using seven gumental positions are not properly filled. For ex- verbs (wr~te, eat, smell, corrode, buy, receive, assocza~e) ample, that "a queen-Regnant could ~/rlte-Music an coupled with fifty common nouns and two proper nouns. Letter-Missive" (i.e., a kind of correspondence) is one In the general experimental setting a sentence is given to of the allowed readings. the parser in a situation characterized by multiple lexi- Figure 7 reports the quantitative results of the ex- cal entries for each single word (one for each WoRDNET periment. They are preliminary since they have been sense). The analyses produced by the parser are com- obtained on a limited number of sentences (60). The pared with the set of interpretations given by a human. figure shows, for each experimental setting, the num- As far as nouns are concerned, a lexlcal entry includes ber of total readings produced by the parser, the dis- all the senses found in Italian WORDNET. Some of the crlmination rate (i.e., the rate of the lectures rejected: nouns used in the experiment are shown in figure 5. As (1201- z)/1201), and the precision (i.e., the rate of cor- for verbs, we started from the Italian WoRDNET senses rect lectures: 122/z). These results have to be inter- and then we faced to the problem of mdividuatmg the preted considering that the focus of the experiment is proper selectional restrictions for each argumental posi- on selectional restrictions, which of course is just one tion of the verb subcategorization frame as seen before. among the various kinds of information occurring dur- So we build a small number of lexical entries, by means ing lexical discrimination. It is worth mentioning here, of which we composed the sentences of the experiment. among the others, some crucial information sources: (i)

36 Sentence La regina scrisse una lettera a Giovanni (THE QUEEN WROTE A LETTER TO JOHN)

Restrictions No semantic discrimination Number of lectures 40

Restrictions Discrimination with WoRDNET Frames (I exp. setting) Number of lectures 16

Restrictions Discrimination with WORDNET Full Hierarchy (II exp. setting) Number of lectures 8 Lectures (1, 2, 3, 4, 5, 6) Queen-Wife/Write/Letter-Alphabet//John (~) Queen-Regnant/Write/Letter-Alphabet//John (8)

Human Judgment Number of lectures 6 Lectures Queen, Regnant/Wfite-Comunicate/Letter-Missive/John (1) Queen-Wife/Write-Comunicate/Letter- Missive/John (2) Queen-Regnant/Writ e-Send/Letter-Missive/John (3) Queen-Wife/Wnte-Send/Letter-Missive/John (4) Queen-Regnant /Write /Let ter- Missive/ / John (5) Queen-Wife/Write/Letter- Missive//John (6)

Figure 6: An example of sentence

Experimental Setting ~ of lectures [ Discrimination Rate Precision Without discrimination 1201 0% 10% Discrimination with WoRDNET Frames 688 43% 18% Discrimination with the WoRDNET Full Hierarchy 164 86% 74% Human Judgment 122 90% 100%

Figure 7: Quantitative results obtained on 60 sentences

world knowledge: e.g., it is very strange to Write an they are more detailed. Article-Clause on a Newspaper-Periodic; (ii) aspec- tual properties of the verb: e.g., it is very difficult to Some general suggestions can be drawn in order to interpret La regina sta scmvendo un artscolo sul gzonale individuate a trade-of between the effort necessary for (The queen ss wmtmg an article on the newspaper) with describing selectional restrictions and the lexical disam- the Write-Publish sense, because publishing is a cul- biguation obtained. Although the definition of detailed minative process. selectional restrictions was highly time comsuming, our experience shown that this approach obtains good results 6 Conclusions both in the discrimination rate and in the precision. In this paper we presented the approach underlying the The experiment also brings evidence for a WORDNET Italian WoRDNET, a general computational lexical re- like sense organization. In fact, different selectional re- source. A prototype has been realized which implements strictions apply to different senses allowing to discrimi- a multilingual lexical matrix. Data acquisition has been nate among different readings. However, an important mostly manual with the help of a graphical interface. drawback in WoRDNET is the lack of relations among In light of the concrete use of the Italian WORDNET we related senses of the same word. This is particularly propose the integration of selectional restrictions into the crucial for the logzcal pohsemy cases [Pustejovsky, 1995], verbal taxonomy. An empirical verification has been per- when a sense can be generated from another in a pre- formed which confirms the intuitive hypothesis that se- dictable way, and, in general, to treat the so called "verb lectional restrictions crucially affect lexical disambigua- mutability effect" as discussed in [Gentner and France, tion and that the discrimination rate improves as far as 1988].

37 References [Basili et al., 1996] It. Basili, M.T. Pazienza, and P. Ve- lardi. Integrating general-purpose and corpus-based verb classification. Computational Linguislies, 22(4), 1996. [Carpenter, 1992] B. Carpenter. The logic of typed fea- ture Structures. Cambridge University Press, Cam- bridge, Massachusetts, 1992. [Ciravegna e~ al., 1996] F. Ciravegna, A. Lavelli, D. Pe- trelli, and F. Pianesi. The GEPPETTO environment, Version 2.0.b. User Manual. Technical report, IRST, 1996. [Delmonte et al., 1996] R. Delmonte, G. Ferrari, A. Goy, L. Lesmo, B. Magnini, E. Pianta, O. Stock, and C. Strapparava. ILEX - un dizionario computazionale dell'italiano. In Proc. of 5 ~h Convegno Nazionale della Associazione Italiana per l'Intelligenza Artifi- eiale, Napoli, September 1996. [Gentner and France, 1988] D. Gentner and I.M. Fran- ce. The verb mutability effect: Studies of the combi- natorial semantics of nouns and verbs. In S. Small, G.W. Cottrell, and M.K. Tanenhaus, editors, Lezicai Ambiguity Resolution. Morgan Kaufman, San Mateo, California, 1988. [Jackendoff, 1990] Ray Jackendoff. Semantic Structures. Current Studies in Linguistics. The MIT Press, Cam- bridge, Massachusetts/London, England, 1990. [Magnini and Strapparava, 1994] B. Magnini and C. Strapparava. Costruzione di una base di conoscenza lessicale per l'italiano basata su WordNet. In Proc. of the 28 ~h International Congress of the Sociel~i Lin- guistica Italiana; Palermo, Italy, Ottobre 1994. [Magnini el al., 1994] B. Magnini, C. Strapparava, F. Ciravegna, and E. Pianta. Multilingual lexical knowledge bases: Applied WordNet prospects. In The Future of Dictionary - Workshop sponsored by Rank Xerox European Research Centre and ESPRIT Project Acquilex II, Grenoble, France, October 1994. [Miller, 1990] G. A. Miller. WordNet: "An on-line lexi- cal database", International Journal of Lexicography (special issue), 3(4):235-312, 1990. [Pustejovsky, 1995] J. Pustejovsky. The Generative - icon. The MIT Press, Cambridge, Massachusetts, 1995.

38