User Requirements and Functional Specification

of

the EuroWordNet project

Version 5, Final October, 1996 Laura Bloksma£ Pedro Luis Díez-Orzas$ Piek Vossen£

Deliverable D001, WP1, EuroWordNet, LE2-4003 £ Computer Centrum Letteren, University of Amsterdam $ Novell Linguistic Development, Antwerp Identification number LE-4003-D-001 Type Document Title User requirements and functional specification of EuroWordNet Status Final Deliverable D001 Work Package WP1 Task T1 Period covered March - June 1996 Date October, 1996 Version 5 Number of pages 66 Authors Laura Bloksma, Pedro Díez-Orzas, Piek Vossen, WP/Task responsible Novell Project contact point Piek Vossen Computer Centrum Letteren University of Amsterdam Spuistraat 134 1012 VB Amsterdam The Netherlands tel. +31 20 525 4624 fax. +31 20 525 4429 e-mail: [email protected] http://www.let.uva.nl/CCL/EuroWordNet.html EC project officer Jose Soler Status Public Actual distribution Project Consortium The EuroWordNet User-Group The EuroWordNet WWW page Suplementary notes Key words Lexical semantic databases, Information Retrieval, Language Engineering Abstract In this document the general design of the EuroWordNet database is described based on the user-requirements and the technical state of the art for building multilingual semantic resources. The user- requirements are discussed from two different perspectives: the actual use of the resource in a multilingual information retrieval system developed by Novell Linguistic Development and the potential use of the resource by a diverse group of institutes and companies in Europe, constituted by the EuroWordNet user-group. The purpose of the latter group is to create a wider awareness of the use of this type of resources and to establish cooperation with other groups that build such resources to develop standards and make resources compatible.

The usage in an Information Retrieval Engine by Novell is taken as a starting point for the functional specification. In addition to the direct requirements of the user, the functional specification is based on the design of the Princeton WordNet1.5, the structure and content of the resources that will be used, the quality of the extraction tools and the limitations set by the project’s budget and time frame. Deviations from the Princeton WordNet1.5 design are due to:

• new aspects such as the multilinguality. • inadequacy of the Princeton for Information Retrieval use. • different structure of the available resources. • quality of the tools for extracting information from these resources. • to achieve a maximal compatibility across the different resources. • the copy right limitations of the results. • the possibility to customize the resource by specific users.

In addition to the design of the data structure, a data viewer is described which enables to view and compare the , and to export selections to a plain text format. Status of the abstract Complete Received on Recipient’s catalogue number Executive Summary

Crucial for the semantic processing of information stored in the form of Natural Language is the availability of large generic lexicons with semantic information. The need for such resources is apparent when accessing large amounts of relatively unstructured information which is stored in various formats, covering different languages and cultures. A user has to anticipate that the information may be expressed using a variety of words or expressions, even in different languages. With a semantic database, semantically-related words can be grouped automatically, enlarging the effectiveness of a search.

For English there is such a generic semantic resource: the WordNet database developed by George Miller and his research group at Princeton University (Miller et al. 1993). For other European languages, however, such databases with basic semantic relations do not exists or are not available, let alone a multilingual database in which several of these resources are combined. The aim of the EuroWordNet-project is to develop this multilingual database with basic semantic relations between words for several European languages (Dutch, Italian and Spanish). The EuroWordNet database will as much as possible be built from available existing resources and databases with semantic information developed in various projects. The use of the database will be demonstrated in an information retrieval environment. The expectation is that such a multilingual resource will improve the recall of documents in a meaningful way, not only of documents in each of the relevant languages, but also across these languages.

In this document we outline the user-requirements and the functional specification on which the design of the database will be based. EC-projects funded in the fourth framework (1995-1999) should follow three pre-defined stages:

Stage I: User Requirements and Functional Specification Stage II: Development of the Demonstrator and Verification Stage III: Demonstration

The main focus of the EuroWordNet project will be on Stage II (the building and the verification of the wordnets), with minimal work parts for Stage I and Stage III. Still, the project does not start from a detailed specification of the user-needs and market exploration. The two major reasons for this are that:

· semantic databases are still a novelty in linguistic technology, despite their potential value and use. Consequently, there are no studies and reports available on the need and use of such resources that could be used to describe the user-requirements. · the use of linguistic technology is cross-sectional by nature: it could be integrated in a variety of Telematics applications that involve the processing of information. The range of applications and user-types makes it more difficult to discover the user-requirements with respect to this type of resource.

Since the budget of the project does not allow for performing an extensive market research (by interviews and questionnaires) we have followed a more minimalist approach to account for the user-needs. The user-needs are addressed from two different perspectives:

· the actual demonstration of the resource in a multilingual information retrieval system developed by Novell Linguistic Development, who is a partner in the project. · the potential use of the resource by a diverse group of institutes and companies in Europe, constituted by the EuroWordNet user-group.

The purpose of the latter group is to create a wider awareness of the use of this type of resources and to establish cooperation with other groups that build such resources to develop standards and make resources compatible. Software developers and telematic-users have not had much opportunity to experiment or gain experience with applications based on the semantic processing of information. We expect that clarity on the user-needs will arise from the availability of the multilingual database and the possibility for people to work with it, rather than the other way around.

To limit the scope of the work, the use of the database will primarily be designed from a single application-perspective: information retrieval. It is important to realize that Novell is not an end- user but a developer that already has built a Information Retrieval system that incorporates a semantic database, which is tested with WordNet1.5. Given this specific application and their experience, the requirements of Novell will therefore be relatively specific, which one may not expect from a general information retrieval perspective. Using an existing application of a major software company as a starting point however has the following advantages:

· it ensures a realistic demonstration of the resource which can convince people of the feasibility and usefulness of semantically-based technology, addressing real needs rather than theoretical solution. · users will not only see a complex data structure but also a useful and conceivable effect in a user-friendly interface. · an existing application will require realistic features and lead to a realistic verification of the data.

The usage in an Information Retrieval Engine by Novell is thus taken as a starting point for the functional specification. In addition to the direct requirements of the user, the functional specification is based on the type of information that is covered by the Princeton WordNet1.5. This basically limits the scope of the data to the feasible and most important relations, about which there is a major consensus. Deviations from the Princeton WordNet1.5 design are due to:

· new aspects such as the multilinguality of the database. · inadequacy of the Princeton wordnet for Information Retrieval use, which follows from some of the user-requirements. · the nature of the information stored in the Machine Readable Dictionaries (MRDs) from which the EuroWordNet results will be derived. Some differences of the MRDs are more advantageous than the Princeton WordNet and others make it too complicated to convert to the Princeton structure. · the possibility to (semi-)automatically extract this information from the available resources, given the existing technology, time and man power. · to achieve maximal compatibility across the different resources. · not all the information from the MRDs can be made available because of copy right claims. · the possibility for user to customize the database for their specific application without having to speak all the languages.

Finally, the document describes a data viewer which will be developed. The viewer will enable users to view and compare the wordnets from a bilingual perspective, and to export selections to a plain text format. Table of Contents

1 Introduction 2

2 The User-requirements of the EuroWordNet project 5

2.1 Information Retrieval 5 2.1.1 Applications and functionality 6 2.1.2. Differences with respect to other resources 7 2.1.3. Recall vs. Precision 8 2.1.4. EuroWordNet contributions to Information Retrieval 10 2.1.5. Evaluation of improvements 11 2.1.6 EuroWordNet user-requirements 13 2.1.6.1. Data requirements 13 2.1.6.2. Architectural requirements 19 2.1.6.4. Interface requirements 19 2.1.6.5. Tools requirements 19 2.1.6.6. Product Integration Requirements 20 2.2 The EuroWordNet User-Group 20

3 The functional specification of EuroWordNet 27

3.1 The specification of the EuroWordNet data 27 3.1.1 The coverage of the vocabulary 31 3.1.2 The structure of the entries 34 3.1.3 The language internal relations 36 3.1.3.1 The language internal relations in the Princeton WordNet 36 3.1.3.2 Unifying nouns and verbs 41 3.1.3.3 Conjunction and disjunction of relations 44 3.1.3.4 Synonymy 50 3.1.3.5 Subtypes and Supertypes of meronymy 53 3.1.3.6 Differentiation of verb relations 55 3.1.4 The multilingual architecture 58 3.1.5 The top-concepts 63 3.1.6 Domains 63 3.1.7 Instances 64 3.1.8 Overview of the EuroWordNet relations 65 3.2 The EuroWordNet Database 69 3.2.1 Interface requirements 69 3.2.2 Exchange format 71

References 72 October, 1996 2

1 Introduction

The aim of the EuroWordNet-project1 is to develop a multilingual database with basic semantic relations between words for several European languages (Dutch, Italian and Spanish). Such a database is freely available for English: the WordNet database developed by George Miller and his research group at Princeton University (Miller et al. 1993). It consists of semantic relations between English word meanings (so-called synsets) which can be accessed as a kind of thesaurus in which words with related meanings are grouped together. For example, a noun like “car” is linked to, among others, all words that have a hyponymy or isa relation or a meronymy or hasa relation with it, and a verb like “drive” to, among others, all words that have a hyponymy or an entailment relation with it2:

object

go isa vehicle

isa traffic isa seat hasa hasa drive steer turn car hasa wheel entails entails tailback hasa hasa engine isa isa isa isa race taxi squad car stockcar

The European wordnets will as much as possible be built from available existing resources and databases with semantic information developed in various projects. This will not only be more cost-effective but will also make it possible to combine information from independently created resources, making the ultimate database more consistent and reliable, while keeping the richness and diversity of the vocabularies of the different languages. The wordnets will be stored in a central lexical database system and the word meanings will be linked to synsets in the Princeton WordNet. Furthermore, we will merge the major concepts and words in the individual wordnets to form a common language-independent ontology (an ontology is the set of semantic relations between concepts). This will guarantee compatibility and maximise the control over the data across the different wordnets while language-dependent differences can be maintained in the individual wordnets.

1 This document was created on the basis of discussion with the partners of the EuroWordNet project. The EuroWordNet project is funded by the EC (LE2-4003) and is a joint enterprise of the University of Amsterdam (co-ordinator of the project), the University of Sheffield, the Istituto Linguistica Computazionella del CNR (Pisa), the Fundación Universidad-Empresa (a cooperation of Universities of Barcelona and Madrid) and Novell Linguistic Development in Antwerp. The duration is 3 years, the start date 1 March 1996.

2 Here a simplified example is given. In practice, different subtypes of “isa” and “hasa” relations are distinguished as well as various other types of relations. These will be discussed in section 3.

LE 4003 EuroWordNet October, 1996 3 In this document we will describe the functional specification of the database and the users needs that it is based on. EC-projects funded in the fourth framework (1995-1999) should follow three predefined stages:

Stage I: User Requirements and Functional Specification Stage II: Development of the Demonstrator and Verification Stage III: Demonstration

The main focus of the EuroWordNet project will be on Stage II (the building and the verification of the wordnets), with minimal work parts for Stage I and Stage III. The status of the User Requirements and Functional Specification can likewise be described as follows.

The project does not start from a detailed specification of the user-needs and market exploration. The two major reasons for this are that:

· semantic databases are still a novelty in linguistic technology, despite their potential value and use. · the use of linguistic technology is cross-sectional by nature: it could be integrated in a variety of Telematics applications that involve the processing of information.

Software developers and telematic-users have not had much opportunity to experiment or gain experience with applications based on the semantic processing of information. Semantic resources are hardly available (or only for English), and most certainly there are no multilingual resources with sufficiently extensive basic semantic information for the most common words. In this respect, we expect that clarity on the user-needs will arise from the availability of such a resource and the possibility for people to work with it, rather than the other way around. The development of a multilingual wordnet will clarify the emerging use of future applications and will contribute to the development of standards in this area.

To limit the scope of the work, the use of the database will primarily be designed from a single application-perspective: information retrieval. Information retrieval (IR) is not only seen as the most direct usage of a multilingual wordnet (taking least effort to full product development) but it also directly relies on the information that is being provided. Poor performance in information retrieval tasks can relatively easily be translated into specific deficiencies in the resource. This primary usage will be tested and demonstrated by Novell Linguistic Development. It is important to realize that Novell is not an end-user but a developer that already has built a Information Retrieval system that incorporates a semantic database, which is tested with WordNet1.5. Given this specific application and their experience, the requirements of Novell will therefore be relatively specific, which one may not expect from a general information retrieval perspective.

This does not mean, however, that the resource is developed for information retrieval purposes only (the specific requirements do not exclude a wider usage of the results). In addition to the direct scope of the project, we have therefore established a European User-Group of wordnet- builders and users that cover a wider range of languages and applications. The members of the User-Group have the possibility to give feed-back to early releases of the project results (including sample, databases, documentation, definition of standards and data formats) which will be taken into account in the incremental building of the resources. Feedback of the User- Group will result in a first investigation of the market and user-needs with respect to semantically-based applications.

LE 4003 EuroWordNet October, 1996 4 This document is then structured in two major sections:

2 The User-requirements of the EuroWordNet project 3 The Functional Specification of the EuroWordNet results

Section 2 is further subdivided into a section that specifically deals with the use of semantic databases in information retrieval and a section which describes the role of the User-Group from a more general, wider perspective. Given the user-requirements, the functional specification of the resource will be outlined in section 3, where the following additional criteria play a role:

· the type of information that is covered in WordNet1.5 · the nature of the information stored in the Machine Readable Dictionaries (MRDs) from which the EuroWordNet results will be derived · the possibility to (semi-)automatically extract this information from the available resources, given the existing technology, time and man power.

Even though, in principle, the design of WordNet1.5 is followed, the resources are in some cases too different (in a positive or negative sense) or the content/structuring in WordNet1.5 is too inadequate for the use in information retrieval. In other cases, such as the multilinguality or some specific user-requirements, the EuroWordNet results will cover completely new aspects with respect to WordNet1.5.

LE 4003 EuroWordNet October, 1996 5

2 The User-requirements of the EuroWordNet project

2.1 Information Retrieval

The possibility of advanced, interactive multimedia access to world-wide electronic services not only opens up many new possibilities and developments in the economic and educational area, but it also makes it possible for people who are restricted in their geographic mobility to take part in it. It goes without saying that the amount and detail of information being presented to such users is growing explosively. While access and speed is being improved the user is more and more faced with the problem how to deal with the massive amount of information, where to find what one needs in this huge network of services. Regardless of format (structured/ unstructured) or medium (textual, visual, sound) all this information has to be labelled, classified and systematized in a particular way, in order to be retrieved by the user with a reasonable effort.

The easiest, most direct and flexible way to access any kind of information is by means of natural language interfaces. Such interfaces, however, have to be able to match queries in common, general language to (expert-)classifications and document titles, or to summarize and even translate documents. Natural Language Processing (NLP) has to support Information Retrieval (IR) in two main areas:

· creating user friendly interfaces to access any kind of information

· improving retrieval in free texts

These general requirements can be further specified in terms of the following technical (NLP- based) innovations which are recently considered in the IR-community:

· the necessity of expanding the query with the minimal user's intervention,

· the necessity of improving precision,

· the incorporation of relevance ranking techniques to structure the output,

· the convenience of a natural language interface to avoid boolean operators.

To achieve this, NLP-techniques and resources have to be developed in parallel with IR applications. Improvements in Information Retrieval depend on solutions for difficult problems in the Natural Language Processing area. Resolving these problems often not only involves practical solutions, but also new theoretical approaches: the industry needs to create new applications to quickly answer the new challenges, even when the scientific state of the art and resources are not ready for it. This can also be said of knowledge representation and semantics in relation to the current standards in Information Retrieval Systems (IRS): to handle linguistic contents and not only formats. The situation is even worse with respect to multilingual IRSs.

LE 4003 EuroWordNet October, 1996 6 In this respect, we expect that a semantic resource such as EuroWordNet will contribute to the new Information Retrieval standard in several very important ways:

· The EuroWordNet resource will help to answer some Information Retrieval needs, like query and recall enhancement.

· As a resource in several languages, EuroWordNet will contribute to the creation of Information Retrieval Systems for some European languages other than English.3

· Since there is almost4 a complete absence in the market of an interlingual semantic resource to search across several languages, EuroWordNet will help to establish new markets and new IR standards due to its multilingual character.

Especially the innovation of the last point, a Multilingual IRS, is enormous: users can retrieve important documents in other languages without knowing those languages, something impossible until now. The possibility of searching information in free text in several languages at the same time will increase the productivity and the accuracy of any environment where a large amount of documents are used every day. That implies that the type of end users that can use this new generation of Information Retrieval products will be highly differentiated from specialized and professional users to practically any type of user. The implications of this resource for the user community will be extremely significant, because this type of semantic database can produce a qualitative jump for information retrieval products. The ability to do multilingual searches will be a major improvement, and it will provide new services and open new market possibilities.

2.1.1 Applications and functionality

The development and testing of an IRS with all the performance factors goes beyond the scope of this project. The user-requirements of a full IRS include aspects like flexibility, on-line help, visual representation of the results, allowing for different retrieval techniques and criteria (such as thesaurus or fixed-indexing systems, automatic key-word browsing, information on authors, publishers, institutions, reviews, date, etc.). The aim of this project is not to develop a IRS, but to provide a generic basic resource that could be included in such a broader IRS. Given the current state of the art in IR, the availability of generic resources like WordNet is typically expected to help non-expert users when retrieval by indexing is problematic because:

· The indexing system does not cover the desired aspect or facet that a user is looking for. · The words chosen by the user are not included in the indexing key word. · The user speaks another language.

3 For some languages there is no IRS, and for others the quality with respect to the English IRSs is inferior.

4 Today there are people who claim to have multilingual Information Retrieval Systems, but these are not really what we understand by ªmultilingualº IRS: an information retrieval system that can handle contents of several languages. For example, a multilingual IRS would allow an non-English speaking user to retrieve English documents from a query that originally was made in another language. The existing multilingual IRS can deal with different languages but not going from one language to another. They are normally based on a different technology (automatic generation of knowledge bases from documents instead semantic resources) and the multilingual aspect is provided by the capability of handling different character sets for different languages.

LE 4003 EuroWordNet October, 1996 7

Once EuroWordNet is available, it will be possible to automatically enhance the query and to improve the recall of queries via semantically linked variants in any of the four languages treated by EuroWordNet and through all these languages. More specifically, the resource will used for the following IR tasks:

· Automatic query enhancement. · Semantic indexing of documents. · Improvement of recall.

The EuroWordNet resource can also be extensively used in other NLP and AI applications (with, in certain cases, some extensions). For instance, the Natural Language Interface standard can make use of semantic networks in order to obtain a better performance in natural language understanding by means of considering the lexical semantic relationships in the analysis of the sentences or texts. In Artificial Intelligence such resources have been used to perform reasoning operations such as inferencing or abstraction. Also as a writing tool, the is a qualitative jump since it joins the traditional dictionary and thesauri into a unique reference work (like the ConceptNet 1.0 in WordPerfect 7.0). For other applications see section 2.2.

2.1.2. Differences with respect to other resources

Concerning semantic networks and knowledge bases for Information Retrieval and search engines, two main approaches have been considered: one that uses a fully coded semantic network (normally linguistically oriented), the other that uses built-in knowledge bases from a given set of documents.

As products that build knowledge bases from a set of documents, we can mention:

· Software Scientific products, which use a multi-lingual text searching and a Concept Engine that derives general purpose Thesaurus, a topic tree or a knowledge base from a set of documents;

· EpilepsySpace products (University of California) also have a so-called concept- based search tool. It contains a self-organizing map and a searchable concept space or thesaurus, both generated automatically, but only for English.

As examples of the first approach, we can mention among many other:

· ConSearch (ReadWare),

· RetrievalWare (Excalibur), that uses monolingual knowledge bases or semantic networks.

EuroWordNet is different from these technologies because of its multilinguality, its flexibility, its coverage and its non-specific application approach. Furthermore, the project starts from dictionaries and semantic criteria which is crucial because:

LE 4003 EuroWordNet October, 1996 8 · the semantic links are verified and thoroughly checked to avoid incoherence and to obtain much better disambiguation scenarios;

· the different resources can be linked through interlingual relationships and correspondences, necessary to build an Interlingual Information Retrieval System.

This is exactly the kind of main application which Novell Linguistic Development wants to develop and for which the results of the European WordNet would be the correct resource5.

2.1.3. Recall vs. Precision

Today there is an academic discussion about the effect of using resources as WordNet in Information Retrieval Systems for English. The expectation is that the use of a semantic database to enhance and expand queries gives much better results in the recall than those that do not use a semantic database, but not in the precision.

In spite of this, there are some reports that show bad results in the recall using WordNet for the query enhancement. For example, Smeaton, Kelledy and O'Donnell (1995) report on retrieval experiments in Dublin City University with WordNet 1.5 (run called DCU952), where the rates of recall and precision are lower than expected. On the other hand, the results obtained by EXCALIBUR TREC-4 System (Nelson, 1995) appear to be rather different. Nelson6 shows two query construction or expansion tests: First query construction test:

Original topic ⇒ Pre-Processing ⇒ Choose Meanings ⇒ Expansion ⇒ Choose Expansions ⇒ Weight Terms

Second query construction test:

Original topic ⇒

5 This type of resource can even improve the automatic generation of knowledge bases by performing coherence and consistency checks.

6 Nelson does not explicitly mention in his paper the use of WordNet 1.5 as the semantic network resource, but from comparing WordNet with the semantic network in RetrievalWare 5.0, which has been used for this experiment, it follows that they are using WordNet resource with some modifications and expansions (cf.. glosses of RetrievalWare 5.0 are meanings with WN1.5 glosses). The expansion also seems to involve the number of meanings per lexical item, leading to more polysemy which is seen as a problem by the DCU experiment.

LE 4003 EuroWordNet October, 1996 9 Add Terms ⇒ Group Terms ⇒ Modified Topic ⇒ Pre-Processing ⇒ Choose Meanings ⇒ Expansion ⇒ Choose Expansions ⇒ Weight Terms

In the second test they added the possibility of structuring the query in advance. The results of the second test concerning recall seems to be quite good (4500 relevant retrieved documents, while DCU952 is less than 500). The results reported on precision are less satisfying. However, these are not directly caused by the semantic network but are based on different technologies like weighting and relevance ranking techniques. The different results seem to have as explanation the fact of using different query expansion engines. This aspect has to be considered in the Demonstration stage of EuroWordNet project.

Our own experiences reveal that the current usage of WordNet 1.5 as a resource for Information Retrieval gives a spectacular increase in the recall, but also a side effect decrease of the precision. The effect of the query expansion is different for recall and for precision. However, whereas the results are the same in recall with and without meaning selection, this is not the case for precision7.

Nevertheless, improvement of the precision goes beyond the scope of EuroWordNet project. It is obvious that other engines are necessary on the document side to improve precision, such as mapping the expanded and disambiguated query onto the hits that have been found in the texts (as the Excalibur experiment shows). These engines still need semantic network resources to accomplish other tasks (like word meaning disambiguation in texts), but also syntactic analysis, statistical information at the word-sense level, etc.

Table A1: Results of Access String vs. Word Meaning Investigation

Total No. of Files: 3125. Total No. of Relevant Documents: 30 Case Description of Query No. of No. of Precision Recall No. Documents Relevant Retrieved Documents 1 Original query 42 15 35.71% 50.00%

2 Original query with synonyms for all 56 23 41.07% 76.67% meanings of the query words 3 Original query with synonyms based 49 23 46.94% 76.67% on the correct word meanings

7 The same figures for recall with disambiguated and non-disambiguated queries is due to the fact that the documents with which the queries are matched are not disambiguated. This fact not only explains the equality but it also makes clear that the results for a disambiguated document will be even higher or that the user does not have to disambiguate the document if she/he thinks that these figures are high enough.

LE 4003 EuroWordNet October, 1996 10 4 Original query with synonyms and 390 25 6.41% 83.33% hyponyms for all meanings of the query words 5 Original query with synonyms and 163 25 15.34% 83.33% hyponyms based on the correct word meanings

2.1.4. EuroWordNet contributions to Information Retrieval

Novell will validate EuroWordNet for their Information Retrieval System (IRS). The Novell LTD Information Retrieval SDK is a product comprising the following basic components:

• Indexing of documents • Morphology for query enhancement • ConceptNet 1.0 for semantic enhancement of the query • Fast search system • Indexation • Relevancy ranking of the results and Precision enhancer • Topic Identifier • Natural Language Interface •

The Information Retrieval Solution from Novell LTD is a full-text indexing IRS which does not use key words. Instead the Semantic Database is available to the user to enhance the query in order to search for all possible related terms. The ConceptNet 1.0 is part of the engine of QuickFinder®. The design and functionality of the ConceptNet 1.0 is intended as a multilingual system. Novell is currently developing such a multilingual system (mainly for English, Spanish, French, German, Dutch and Italian). At this point, the English ConceptNet® 1.0 has been developed using WordNet1.5 as the main data source within a more sophisticated semantic database structure (or network) with different object types, semantic relationships, properties and features. The use of QuickFinder®, as an open system in PerfectOffice® 7.0 is not constrained to any particular type of documents or knowledge area. QuickFinder® runs on the main platforms: Windows, Macintosh and Unix. For our product Novell has already tested and verified WordNet1.5 inside our ConceptNet® 1.0 with satisfactory results in the recall in both manual and automatic expansion. Novell will use these quality, quantity and performance factors to evaluate EuroWordNet in a multilingual context.

The contributions that Novell expects from EuroWordNet in IR are basically two:

• Firstly, from a monolingual perspective, the improvement of the RECALL.

• Secondly, from a multilingual perspective, the possibility of performing MULTILINGUAL RETRIEVAL with a monolingual query.

There is still a third possibility: the contribution to open new ways of research and development by having such resource, as indexing documents, abstract and classification of documents, etc..8

8 Other additional values are: - Ease of adapting the resources to the specific applications. LE 4003 EuroWordNet October, 1996 11 Nevertheless, the step forward that EuroWordNet implies for the Information Retrieval applications does not solve all the problems, but provides desirable improvements and opens new development possibilities.

2.1.5. Evaluation of improvements

The verification will involve reviewing the resources with respect to the user needs and the functional specification. It will at least address:

· the number of senses and words delivered, · the kind of information stored (frequent meanings, central vocabulary, etc.), · the richness of information stored (number of lexical items per synset, etc.), · the documentation of the information, · the explicitness of the criteria that define the relations, · the quality of the relations (easy to understand by the user, how the definition is encoded by the links etc.) · integrability with other lexical resources.

Inspection will be done by statistical measuring and examining diverse samples of data.

Given the specific use of the wordnets in an information retrieval application different test-sets and methods for evaluation will be developed. The test situations will involve different sets of queries (in four languages) applied to a variety of corpora (in four languages) illustrating the recall effect of expanding the keywords using the WordNets. The test will be differentiated for different areas of the WordNets and making use of different types of links. Furthermore, we will investigate to what extent the general vocabulary is complementary to terminology-based retrieval, whether it helps non-expert users dealing with terminology based text classifications and to what extent different information retrieval tasks have any effect on these.

Specific retrieval situations will be set up in which traditional thesaurus-based and fixed-indexing systems do not give satisfactory results, while we expect that keyword expansion using the WordNets will lead to a better recall. The performance of the tests will then be used as a measurement of the additional functionality and quality of the WordNets. Finally, the queries will be designed to also elicit expected problems for using the WordNets, such as the decrease in precision, the problem of lexical ambiguity, etc..

Novell will apply the following approaches and criteria:

(i) Approaches:

Monolingual and multilingual application tests. General vocabulary and terminology-based tests.

- User-friendliness with regard to non-expert users: will retrieval be more flexible because non-expert users can use their own words for making queries for documents linked to a domain-specific thesaurus. - This will both include coverage of the general vocabulary and density and precision of the relations.

LE 4003 EuroWordNet October, 1996 12 (ii) Main IR Criteria:

Monolingual recall. Multilingual recall. User involvement in query expansion.

(iii) Possible criteria:

· Integration with precision technologies (the rate of precision enhancement does not only depend on the resource that is used, but also, and basically, on the methods and algorithms that are applied). · Integration with Word Sense Disambiguation technologies (different links that help to disambiguate senses by means of general (non ad hoc) algorithms). · Integration with relevance ranking and document classification technologies.

Different tests will be used to evaluate these criteria. Some of the recall tests are:

· Apply a set of queries in different languages to both multilingual and monolingual corpora WITHOUT EUROWORDNET.

· Apply a set of queries in different languages to both multilingual and monolingual corpora WITH MORPHOLOGY EXPANSION AND WITHOUT EUROWORDNET.

· Apply a set of queries in different languages to both multilingual and monolingual corpora WITHOUT MORPHOLOGY EXPANSION AND WITH EUROWORDNET.

· Apply a set of queries in different languages to both multilingual and monolingual corpora WITH MORPHOLOGY EXPANSION AND WITH EUROWORDNET.

· Alternatively, other methods will be applied: with pre-processing/without pre- processing; with meaning selection/without meaning selection; with weights/without weights; etc.

· Degree or ease of use and access to non expert users (different criteria, see above)

· A regression test will be develop to check different versions or updates.

All the test will be performed using the Novell Information Retrieval System and its specific query enhancement module. It is necessary to make a distinction between the EuroWordNet resource and the Information Retrieval system and capabilities with respect to the results of the verification. For instance, the correct handling of multiword expressions is very important for the Information Retrieval test. Without multiword handling some critical tasks can be affected:

• Multi-word (mw) matching. Matching a mw in text with a mw in a dictionary (like the wordnets), and viceversa, is not a straightforward task. In order to match mw forms properly it is necessary to accomplish two other tasks: generation and rooting.

LE 4003 EuroWordNet October, 1996 13 • Multi-word forms generation. In order to generate the right forms of a mw, we need to know which components have to be inflected and which don’t. There are many different possibilities:

divorce lawyer, only “lawyer” needs inflection governor general, both words need inflection kick the bucket, only “kick” needs inflection make a remark, all three words need inflection

• Multi-word rooting. In this case we mirror the problems of the former paragraph9.

EuroWordNet cannot provide all these functionalities, but it can provide some multi word data (lexical items for certain meanings that are multiword expressions) in so far they are available in the resources.

2.1.6 EuroWordNet user-requirements

Novell is not an end-user and Novell does not act as such in the EuroWordNet Project. Novell is a developer user with the objective of building a specialized application using a specific resource and database. This application and the supporting database already exist (incorporating WordNet1.5) and there is considerable experience in using it. Consequently, the requirements need also to be as specific as possible. We can group them as follows:

• Data requirements • Architectural requirements • Transportation format requirements • Interface requirements • Maintenance requirements • Product integration requirements

2.1.6.1. Data requirements

The data requirements can be grouped as follows:

Lexicon coverage Lexical item level Meaning level Relationship level Attribute/label level Quality of semantic representations

Verification at these levels will be performed both from monolingual and multilingual perspectives, as well as regarding to the functionality and to the language-independentness (see section 3, Functional Specification).

9 The multiword handling problem has been arisen by Antonio Sánchez Valderrábanos (Novell LTD) in an Internal Report as a comment to this document. LE 4003 EuroWordNet October, 1996 14

LE 4003 EuroWordNet October, 1996 15 1. Monolingual data requirements

1.1. Coverage

1.1.1. Number of words and distribution per part of speech

• It has to contain a sufficient number of words and meanings to be useful in a product (min. 40.000 word-meanings and 20.000 words).

• The part of speech distribution has to be around 75% nouns and 25% verbs.

• The coverage of the vocabulary has to reflect corpus frequencies.

• It must be possible to handle multiword expressions.

1.1.2. General vocabulary and domain-specific vocabulary

• It has to contain at least 75% of the general vocabulary or central lexicon for each language resource.

• It has to contain a proportion of sub-language to show the capability to vertically enhance the resources. This also includes the possibility to add user-specific instances of concepts to the general resources (car registration numbers, names of employees, etc.).

1.2. Spelling of Lexical items and glosses

• The lexical items and the glosses need to be correctly spelled. If there is more than one spelling for a given language, one standard will be followed, e.g. USA English vs. UK English.

1.3. Meanings

• Different kind of concepts will be defined as different kind of objects, e.g. a domain- concept (sport, tool, artefact) will be one type of object in the database and a semantic-meaning another type. (See the Novell ConceptNet® 1.0 structure).

• It has to show the most relevant ambiguity for the high frequent words by means of the links and formal features.

• Every meaning need to have a gloss that explains the sense. This gloss will be stored in the Interlingual Index (see section 3, Functional Specification).

• A distinction should be made between two different record types: meanings, synsets or classes (being denotational objects) and instances or references (being concrete referential objects of extra-linguistic objects).

1.4. Density and precision of the relations

LE 4003 EuroWordNet October, 1996 16 · The average of different links per meaning needs to be 3 to ensure the tri-dimensional character of the network. One link has to be a monolingual link (normally hyperonym, except for the tops); the second will be a link from the individual wordnet to the interlingua; the third should be a link to amplify the context of the meaning, like for example Domain, and it can be placed in the interlingua to be used for all the languages.

· The average of links (the same or different link type) per meaning has to be 4 to ensure a minimal description of the meaning in terms of links, that means: 40.000 meanings minimally have 160.000- links. Those four links will be the three mentioned above plus another monolingual link (e.g. all synsets, except for the bottoms, must have more than one hyponym link to get a good hierarchy shape).

· The relationships should preferably include the following aspects:

- relations within a Part of Speech: (hammer tool, handle door, walk move, feed eat...) - relations across Parts of Speech: (organisation organise, eat food, ...)

· Attributes are allowed in the links to differentiate behaviours; e.g. meronym-member and meronym-part will be two different links, conjunctive hyperonym and disjunctive hyperonym can be the same link with two different attributes (see section 3, Functional Specification).

· As a rule of thumb, we can state that two meanings cannot share more than one link type, i.e. cat cannot be both a hyponym and a holonym of animal. Such rules could be used for consistency checking and quality assurance. However, the actual constraints on the allowed combinations of rules between pairs of synsets will ultimately depend on the definition of the links and the developed criteria. For instance if you allow telic relationships between two nouns like © transmission© and © transmitee© you might also want to link them using the ©d erived from© link. The actual definitions of the links and the constraints are crucial for the quality assurance of the project and need to be worked out.

1.4.1 Hierarchical links (Organization)

The relationships need to consider al least the following issues:

· All the relationships should be bi-directional, that is: if word sense A is related to word sense B then word sense B is related to word sense A. However, this does not imply that also the semantic implication is bidirectional (this is only the case for genuine synonymy).

· All the meanings need to be attached to a global backbone (normally hyperonym/hyponym). No islands are allowed.

· All the relationship types and subtypes need to be clearly defined by the constraints and characteristics of applicability.

LE 4003 EuroWordNet October, 1996 17 • The network has to be organizable as different layers. Different types of links should not be mixed, e.g. a synonym link cannot be used for genuine synonyms and coordinates.

• Links between different kind of objects will be well differentiated and defined.

• The project should define which Badly Formed Net Structures (bad hierarchy shapes) are non-allowed and define what checks can detect them, e.g. circularity and side-loops.

• It has to produce a reasonable number of tops for both nouns and verbs so that maintenance and expansion of the network is not too complicated. The circa 500 number of tops for verbs in WordNet 1.5 exceeds the number of tops to be handled (see Díez-Orzas 1996).

• The number of items per level in the network should not make it impossible to use and navigate through it (e.g. the first level contains 15 synsets, the second level 300 synsets, third level 10.000; it is practically impossible by the end-user to navigate from 300 to 10.000 items).

1.4.2 Non hierarchical links (Richness)

• It has to contain at least an average of 1.8 of synonymy and 1.3 of polysemy (figures from WN1.5): i.e. there must be sufficient variants in the synset and there must be a sufficient number of well-motivated meanings.

• Some rules and policies need to be established to formalize lexical items and access strings: i.e. accessing ‘State’ and ‘state’ has to result in different lists of senses. Especially diacritics have to be handled properly.

1.5. Features and properties assigned to the meanings

• All the information to recognize, use and maintain the semantic network has to be stored at the following three levels: Lexical item, Meaning, Lexical item in a given Meaning (synset). Consequently, it has to be possible to get access to a word with all its meanings, to a meaning regardless of the associated words, and to any specific combination of both.

• The information stored should address at least:

- Unique identification number (one per meaning) - Part of Speech (POS) - Sense number (starting from 1 for each Part of Speech) - Usage label (formal, argot,...) - Status (on going work status) - Sub-language label (General, Geographical Name, etc.) - Centrality of variants (what is the neutral word) - Corpus frequencies of words

LE 4003 EuroWordNet October, 1996 18 · The data structure has to take into account possible extensions with morphological and grammatical information.

· The data structure has to take into account possible extensions with conceptual/semantic features.

1.6. Quality of semantic representation

· The meanings need to be recognized and understood by looking at the synsets, relationships and gloss.

· The representation need to be meaningful for query expansion in Information Retrieval.

2. Multilingual data requirements

· The different WordNets have to be connected with multilingual links.

2.1. Coverage of the vocabulary and overlap between the different languages

· The covered vocabulary of the different languages need to have the highest possible degree of overlap and all the languages will be 100% linked to some interlingua.

2.2. Lexical items

· A priori there are no language-independent meta-words. Only words are included in the network that belong to the language community.

2.3 Meanings: the multilingual relations

· The records that constitute the interlingua have to be fixed. No renumbering has to be allowed in order to keep maximal consistency. Changes will involve the relations to these records (the index) or adding new record numbers that have not been used.

2.4. Relationships

2.4.1. Number and types of links

· The interlingua has to be fully automatic and easy to maintain.

· The links between the different languages have to show the degree of equivalence (narrower, broader, equal).

2.4.1.1. Hierarchical links (Organization)

· The interlingua could have links to structured top-concept and to structured domain information.

2.5. Features and properties assigned to the meanings

LE 4003 EuroWordNet October, 1996 19 · A priori no features are expected at the multilingual level, but special information could be stored in the records of this level.

2.6. Quality of the translations

· The motivation of the interlingual links needs to be easily accessible and understandable.

2.1.6.2. Architectural requirements

In the prior paragraph some requirements have been given that relate to the multilingual character of EuroWordNet. Additionally, the multilinguality has the following requirements:

· language independent module is a superset of the language-specific WordNets. · unique identifiers are needed to take into account language reference. · the final result should enable the use of the complete EuroWordNet, any single language or any pair of languages in an IR application.

Furthermore, the data structure has to support the following processes:

• Traversal of links • Inheritance of properties and links • Equivalence links (viewing and comparison) • Monolingual and multilingual inferences (by means of set theory properties) • Interlingual mapping and correspondence

2.1.6.3. Exchange format requirements

It should be possible to easily tranfer the data from one format to another format. The data must be extendible and it must be compatible with general data standards on lexical resources as developed by TEI, EAGLES and PAROLE.

2.1.6.4. Interface requirements

The interface will show all the public information stored in EuroWordNet.

· simultaneously view wordnets for two languages · it can traverse relations between words in these wordnets and across wordnets · it will be possible to derive statistics about the relations in the wordnets · it will contain a module that checks redundancies in the information to measure consistency of data. · querying the wordnets and make specific selections · it will contain an option to export selections of data to a flat file format.

2.1.6.5. Tools requirements

LE 4003 EuroWordNet October, 1996 20

A tool kit for extending and maintaining the results by the users should be available (for market conditions, see the Novell ConceptNet® 1.0 Toolkit).

2.1.6.6. Product Integration Requirements

The resource has to be integrated with a reasonable effort. No structural changes should be needed to integrate the resource in the product. No data changes should be needed except for the extensions required by the user). It should also be easy to adapt the resources to specific applications: i.e. tailoring the top-ontology or adding on a specific terminology.

Other formal requirements:

· Different types of objects have to be well differentiated, for instance, a domain (categorizing world knowledge), a meaning (linguistic knowledge) or the instance of a meaning (object from the real world).

· All the synsets need a gloss (English gloss for all the languages) and glosses should not exceed the 254 characters to ensure maximal portability.

Other contents requirements:

· Hierarchies that can be easily understood by an non-expert end-user. This means that the links of the network have to be transparent for the average user in order to allow them to navigate through the semantic network and eventually to customize the semantic network.

· Explicit criteria need to be defined for the relationships.

Documentation:

· Accurate and clear documentation needs to be provided to users of the EuroWordNet Database.

2.2 The EuroWordNet User-Group

By means of the EuroWordNet User-Group, we hope to create a wider awareness of the project results and to pave the way for the extension of the resources to other languages, larger vocabularies and other types of applications. The User-Group currently comprises members of a diverse range of European institutes and companies (libraries, universities, software developers, publishers), interested in language learning, language-generation, machine translation, parsing, language understanding, information retrieval, electronic libraries, publishing and the production of wordnets in other languages:

Publishers Application area · Van Dale Lexicografie B.V. (NL) (electronic) dictionaries, (provider of data) language generation tools, learning tools. LE 4003 EuroWordNet October, 1996 21 · Bibliograf (ES) (electronic) dictionaries. (provider of data) · Garzanti (IT) (electronic) dictionaries. · Cambridge Language Services (UK) electronic dictionaries, language learning tools

Software Developers · SENA Athens (GR) information retrieval · CapVolmac, Utrecht (NL) authoring tools, Grammar checkers · INCYTA Barcelona (ES) machine Translation · Novell Linguistic Development, Antwerp (BE) information retrieval, authoring tools, natural language interfaces · LOGOS (IT) technical translations, desktop publishing, technical writing. · EBSCO (ES) products for automated library systems, publishing of reference databases, retrieval systems for citation and full text databases, document delivery · BERTIN (FR) information retrieval in textual databases, concept-based indexing · DATAMAT (IT) information retrieval, document processing

Non-profit users · RKD, National Institute for Art-Historical information retrieval, Documentation (NL) electronic libraries · Autonomous University of Madrid (ES) machine translation, corpus linguistics, electronic dictionaries. · University Alfonso X El Sabio (ES) machine translation, corpus linguistics, electronic dictionaries. • VPRO (NL ) Broadcasting Company, multimedia databases

LE 4003 EuroWordNet October, 1996 22 Builders · University of Heidelberg (DE) German wordnet · University of Tuebingen (DE) German wordnet · University of Athens (GR) Greek wordnet · University of Goetheborg (SE) Swedish wordnet · University of Euskal Herriko (ES) Bask wordnet · University of Tartu, Estonia (EE) Estonian, Latvian and Lithuanian wordnet · University of Princeton (USA) American wordnet

The relevance of EuroWordNet for publishers and other wordnet builders is clear. Many publishers are in the process of setting up databases with lexical information rather than producing electronic books and some of them are already involved in developing semantically organized lexicons. The publishers will also be contacted to develop an exploitation and maintenance plan for the results. Those publishers and institutes that are already involved in building wordnets and ontologies (funded in various schemes) will be involved in agreeing standards, sharing experiences, findings and evaluation criteria. For this purpose we will set up a specific workshop.

The developers and end-users form a more diverse group. Below are some more-specific statements on the relevance of the project to the software developers and non-profit organizations in the User-Group:

EBSCO offers services that include a wide range of databases, from technical, specific ones (medicine, chemistry, economy) to general ones (databases with general interest magazines). Though a few databases have their own thesaurus, in general they do not have a thesaurus, and EBSCO does not use any generic thesaurus on every database. EuroWordNet would provide EBSCO

· the chance to permit queries in Spanish (actually queries can only be made in English, and that is a problem for many Spanish users). · a complete general purpose multilingual thesaurus that would upgrade EBSCO query interface from boolean search to conceptual search.

INCYTA is a company working in the development and marketing of language pairs for the METAL machine translation system. They also participate in other linguistic engineering projects (cf. MULTEXT (LRE-II), PP-TELELANG (MLAP), AVENTINUS (LE, not started yet). Moreover, they give translation services for technical documents. The company has currently 19 people (including computational linguists, lexicographers, computational scientists and engineers). Their main interest in the results of EuroWordNet is to explore the adequacy of the semantic thesaurus for:

· enhancing the bilingual transfer phase of the METAL system by using semantic links between bilingual terms.

· Providing a means to store multilingual terminology having semantic keys as the linking (and accessing) point: i.e., terms in different languages used for the same concept should have the same semantic description, which could be used to index these terms.

Cap Volmac. The Advanced Technology Service of Cap Volmac in Utrecht develops grammar checkers and translation systems. Dictionaries (including bilinguals) are one component of these

LE 4003 EuroWordNet October, 1996 23 services. The current lexical database uses to a limited extent semantic relations, which will possibly be extended in the future. The EuroWordNet database could be a good candidate for such extension.

LE 4003 EuroWordNet October, 1996 24 Logos, an international translation company, is very active in the terminology field, Logos has created a multilingual dictionary of 3.5 million terms which can be consulted in up to 31 languages. Last year they donated this dictionary to the Internet community free at http://www.logos.it/. Although there are many dictionaries available on Internet, this is first of this kind. The pages consist of terms, related translations and even includes a built in search engine which gives access to all pages with the translated terms available on the Web. Included is access to all other dictionaries on the Web and also the new Logos Word Exchange. In addition to the help of all Internet users, Logos has an internal team of terminologists dedicated solely to this project.

Datamat - Ingegneria dei Sistemi PsA - is an Italian system integration company which actively works in the areas of information retrieval and document processing. Furthermore, Datamat owns Fulcrum Technologies Inc. - a Canadian company producer of Full/text and Search Tools, currently the world-leader on the information retrieval products market. Datamat believes that the results of EuroWordNet could have interesting impacts on some of their programs

SENA S.A. is an SME specializing in Office Automation Software Development and Electronic Information Services. It is the developer of DIKAIO, a Legal DataBase distributed on CD-ROM. DIKAIO provides a huge volume of data but minimal access and retrieval facilities, therefore it needs to be improved in a number of ways, covering facilities provided. The particular issue where the results of the EuroWordNet could by utilised by SENA S.A. is the provision of an improved document retrieval facility based on semantic matching instead of the purely textual retrieval system currently provided. SENA can be seen as representative for many new SMEs which are expected to exploit the new possibilities of NLP technology. Finally, it is also involved in a number of RTD projects in the NLP field, with the intention of developing Office Automation Software for commercialization, such as MABLE (Telematics - LE) and DIALOGOS (National RTD) both of whose results are add-ons for commercial Word Processing systems.

Bertin is an engineering firm with 470 employees for which Industrial technological development constitutes the main line of business. Some of their recent work for leading edge customers (e.g. in the nuclear industry) involves information retrieval in textual databases, for which they have developed concept-based indexation tools. For these tools Bertin needs conceptual thesauri as developed in EuroWordNet.

RKD. The RKD is one of the largest documentation and research centres in the world on the history of art. its collection comprises some 4,5 million reproductions, 2 million press clippings and a library of 400,000 volumes. The Automation Advisory department of the RKD advises Dutch museums in matters of documentation and automation. Because of the nature of its collection and because of its special role for Dutch museums, the RKD is highly interested in any new developments in the field of indexing and retrieval of structured and non-structured documentary information.

VPRO is an information processing company oriented on the future. It is specifically interested in the use of automatic information processing systems that support the services of VPRO, specially with respect to sharing data and information in the production of broadcasting programs by groups of people. VPRO extensively uses many different information resources with very different access systems. Therefore, the VPRO is specifically interested in tools that provide a unified way to retrieve information from any of these resources. EuroWordNet could be used in a retrieval system with Natural Language(s) as the common, shared language.

LE 4003 EuroWordNet October, 1996 25

LE 4003 EuroWordNet October, 1996 26 Feedback of the different groups will be integrated for the following aspects of the project:

· end-user's feedback on the available applications that incorporate the results · developer's feedback on possible future applications · exploitation and maintenance of the results · licensing of the results · extending the results · standardizing semantic resources · compatibility of lexical resources (e.g. with the Parole lexicons or terminology) · usefulness of the tools and methods used in the project for developing wordnets · resources and time needed for adapting and integrating the results

The aim of the project is not to develop an application but to develop a generic resource that could be used for a wide range of Telematics applications. In this respect it is difficult to get precise feedback from end-users in the project. Furthermore, technology based on semantic resources is relatively new so that also developers can only give preliminary responses. Therefore we will plan some user-forums and workshops to stimulate the communication and discussion. The results of this will be published and as much as possible integrated in the broader initiative to create user-clusters for the current EC-projects.

LE 4003 EuroWordNet October, 1996 27

3 The functional specification of EuroWordNet

The functional specification of the project can be defined in terms of:

· the data. · the database in which the data are stored.

The focus of the project will be on producing the linguistic data or lingware. The specification for the data will be given in the next section. The EuroWordNet Database is needed for viewing, comparing and exporting selections of these data. The database should enable users to gain insight in the (complex) results, especially clarifying the multilingual dimension. The EuroWordNet Database is further described in section 3.2. It is important to note that it is not the aim of the project to develop the tools and databases for building wordnets. The EuroWordNet database can therefore not be used to create, edit or remove entries from the database or to extend and update the results. However, most such tools already exist in some form or another for different platforms. In the project these are collected, minimally extended and made available as separate packages by the different sites. The project will compile a document that summarizes the methods, the tools and resources, their performance and the availability in terms of hardware, software and licensing conditions. Future builders of wordnets can use this document to determine the approach that suits their specific situation best, and future users can select the tools for extending or customizing the results for their applications.

3.1 The specification of the EuroWordNet data

The design of a semantic database is faced with two fundamental questions:

· what kind of semantic information is stored: linguistic knowledge or world-knowledge. · how is this information stored: as a language internal system or using some meta- language.

Many semantic databases are not explicit with respect to these distinctions and it is often very difficult to keep them apart. In the case of traditional dictionaries you might expect to find `linguistic knowledge' telling you what semantic properties determine the exact usage of a word but dictionary definitions very often also contain fragments of world-knowledge. In the case of conceptual networks that try to capture our common-sense knowledge and reasoning you might expect to only find `world knowledge’ but what is defined are often the same words. Even when some abstract features are used, these features are at some level only meaningful because they are English words10. Taking a pragmatic and practical point of view is not very helpful either since

10 Also the Princeton WordNet reflects a mixture of principles. On the one hand they claim to have built a `mental lexicon' reflecting psycholinguistic assumptions rather than a `linguistic lexicon', on the other hand, they use linguistic tests to define relations and distinctions. Furthermore, the Princeton WordNet partially avoids the issue about the relation between words and concepts by implementing the relations as a language- internal system.

LE 4003 EuroWordNet October, 1996 28 relatively simple linguistic tasks (such as spelling checking) or straight-forward information- technology applications (such as keyword retrieval) may require very sophisticated semantic data and inferencing mechanisms to have a high success rate. Given the current state of the art in NLP, a general-purpose semantic lexicon for language understanding, language generation and MT is not (yet) feasible (Lenat and Guha 1990, 1994, Davidson 1994, Pustejovsky and Bergler 1992, Briscoe et al 1993, Calzolari and Guo 1994, Alberto and Bennet 1995). However, the aim of this project is not to develop full semantic lexicons that make use of complex representation languages and inferencing mechanisms but to limit the information and mechanisms to those basic semantic relations between words which are well understood. The development of such a database is feasible and worthwhile for the following reasons:

· regardless of the variety of paradigms, all theories of semantics, ranging from Roget's thesaurus, via cognitive theories and lexical semantic to formal knowledge representation systems, distinguish notions such as hyponymy, meronymy, synonymy, and some clear tests are available to determine such relations between words (cf. Collins and Quillian 1969, Berlin 1972, Miller and Johnson-Laird 1976, Brachman 1979, Mel'cuk 1984, Tversky and Hemenway 1984, Brachman and Schmolze 1985, Brachman and Levesque 1985, Cruse 1986, Tversky 1986, Pollard and Sag 1987, Winston et al 1987, Chaffin et al 1988, Miller et al. 1993, Jackendoff 1992, Daelemans et al 1992, Copestake 1993). · the resulting database can be the backbone of any system of the future based on semantic technology and it will already be of tremendous help in more global applications such as information retrieval. · the representation of these basic relations does not rely on complex knowledge representation formalisms and is easily convertible to any other kind of implementation.

Our approach is to take one step at a time but to take the fundamental steps first. The limited scope of the semantic information guarantees feasibility, as is illustrated by the existence of the Princeton WordNet in which exactly these basic relations are stored. Because compatibility and familiarity with the Princeton WordNet is more important than the actual theoretically-motivated position with respect to the above issues, the Princeton WordNet is taken as the starting point for this project with a few additional changes (described below).

A totally new aspect of the EuroWordNet database is however its multilinguality. Although the different wordnets will be developed as independent language-internal systems they will also be linked to some intermediary (English) wordnet. In addition to this, we will merge the major concepts and words in the individual wordnets to form a common language-independent top- ontology. The resulting multilingual architecture has the following advantages:

· it will be possible to use the database for multilingual retrieval. · the different wordnets can be compared and checked cross-linguistically. · the wordnets will be more compatible while language-dependent differences can be maintained in the individual wordnets. · the database can be tailored to a user's needs by modifying the top-concepts, (e.g. by adding semantic features) without having to know the separate languages.

Another addition with respect to the Princeton WordNet is the use of domains. Domains can be seen as another way of grouping concepts in larger schemes or scripts, e.g. ªtennisº, ªpoliticsº, ªeducationº. These domains may comprise a diverse range of concepts which will not be grouped together from logical point of view where only subsumption relations are considered. For

LE 4003 EuroWordNet October, 1996 29 example, the domain “tennis” would include concepts like “game”, “tennis ball”, “tennis racket”, “set”, “tie-break”, “single play” etc.. Domains are very useful for information retrieval applications and for publishers to achieve more coherence and to isolate subvocabulary. Since the domains can be organized as a semantic hierarchy as well they are considered as a separate type of objects in the database between which relation can be stored. Finally, there will be a possibility to include instances of concepts. This is particularly relevant for topographical data such as place names (cities, countries, etc.) and for customizing the database to a specific user or client. Since instances are related to concepts they introduce yet another type of objects in the database.

Both domains and instances are already present in the semantic database developed by Novell (the ConceptNet, Díez-Orzas 1995, Cuypers and Díez-Orzas 1996) and have been used for information retrieval purposes. They can be seen as user-specific information types which can be added to the generic resources. It is not the aim of the project as such to extract this kind of information (unless to distinguish the subvocabulary) and these information types have not been foreseen in the Technical Annex of the project. Because the domain and instance information is needed by the users the data types will be part of the general design and architecture of the data structure. However, we will only provide the information to the extent that it is available and easily extractable from the MRDs and the allocated resources in the project allow such extraction. If not, we will manually code only a limited number of domain labels and instances in the database to illustrate the possible use. Future users can then customize the EuroWordNet data by adding the domain and instance information for their application.

It is important to realize that the functional specification of the EuroWordNet data should be seen as the a-priori design of the results. The relations and the coverage are described having in mind what is required for the specific users, given the state of the art in semantics (represented by WordNet1.5), the quality of the resources and what is feasible given the available resources, tools and time. When relations and coverage are discussed it will not mean that all these relations will be filled for every lexical item that is supposed to be in the database. In principle only those relations will be expressed between lexical items which are linguistically salient and which are extractable from the given resources. The resources as such differ considerably in structure and content. We therefore cannot expect that the richness of the results is the same for every wordnet. The general rule will be: if anticipated information is present in a given resource and if it is easily extractable (by semi-automatic means) then it will be stored in the result. Only in those cases that crucial links in the network are missing or miss-represented we will add or change information by hand. The design of the data will thus specify all the slots required by the users, more or less covered in the diversity of resources and considered extractable. In this respect the design will provide a maximum of flexibility to store semantic information without deviating too much from the original WordNet1.5 structure (or allowing for easy conversion to this structure) and without necessarily making too many commitments for building of the resources. In practice, the filling of these slots thus depends very much on the quality of each resource, the extraction techniques and the limits of the project. However, future extensions of the project can still profit from the open and flexible design. Furthermore, during the project it may turn out that some of the designed data types and relations are not practical or will hardly occur in the final results. To some extent relations may be added11 or may not be expressed, but the idea of the functional

11 Adding relations will not be a problem for the database. Only when relations are re-interpreted or when new database objects have to be introduced we are faced with major design problems, especially since interfaces will depend on the object types in the database.

LE 4003 EuroWordNet October, 1996 30 specification here is that all potential problems, aspects and relations are as much as possible anticipated.

In the next sections the EuroWordNet results will then be further described in terms of:

· coverage of the vocabulary: 3.1.1 · the structuring of the entries: 3.1.2 · the language internal relations: 3.1.3 · the multilingual relations: 3.1.4 · the unified top-concepts: 3.1.5 · domain relations: 3.1.6 · instances of concepts: 3.1.7 · overview of the relations: 3.1.8

LE 4003 EuroWordNet October, 1996 31

3.1.1 The coverage of the vocabulary

The EuroWordNet data will be restricted to nouns and verbs in English, Dutch, Italian and Spanish. We aim at a total set of 50,000 senses, correlating with about 20,000 most frequent words in the languages. The selection will have the following characteristics:

· there should be maximal overlap of the covered concepts across the different wordnets. · the covered subset has to be generic: all frequent words of the language with their most frequent and common senses should be present. · every parent concept that is needed to define a more specific concept should be present so that the introduction of new items does not require the addition of top-concepts. · the subset should reflect language-specific lexicalization patterns. · some subvocabulary will be added to demonstrate the possibility to augment the data with domain specific vocabularies and to be able to perform information retrieval tests with the integrated result.

The actual selection will take place in two phases. The first subset will be based on the defining vocabulary of each dictionary or resource from which the wordnets will be derived. This has the advantage that all words needed to link other words in the lexicon will be present in the selection of the wordnets. This will avoid technical complications that a selected sense is linked to a word which is not present in the database and the missing word has to be added first to be able to proceed. Since the defining vocabulary is probably not fully covered by the used dictionaries either missing defining words have to be added at the beginning.

Furthermore, the words at the more general levels of the hierarchy are expected to be more difficult to define from a linguistic perspective. These words often have many vaguely distinguished meanings with a rather special linguistic usage. By linking all the words in the top most of the problematic cases will be handled. Any extension of the vocabulary in the wordnets will then involve the linking of more specific words to well-defined and delineated concepts in the wordnets; in other words we do not expect that extensions will introduce new hierarchical tops. Words belonging to the outer-shell of the language are also expected to be less linguistically-complex (although they may have a technical meaning).

Since the total set of defining words is rather large (about 80,000 words in a normal-size dictionary) a more-specific selection will be made within the super-set of defining words. First, the most frequent defining words are selected (selection I) and an initial top-ontology is created for these words. Their basic senses have to be selected and, if necessary, senses may have to be added to reflect the role of these words in defining the other words. Another criterion for the first selection will be the fact that words correspond with tops in the Princeton WordNet. This is important to ensure compatibility of the top-level across the wordnets. Next, the selection is extended (selection II) where the main criteria are:

· frequency as a definition word · occurrence as an entry in a monolingual dictionary · being a defining word of the first selection · being related to the first selection by any of the defined semantic relations

Finally, other defining words are added (selection III) which have not yet been covered, aiming at a total set of about 20,000 to 30,000 senses. The main criteria will be the same for all sites but

LE 4003 EuroWordNet October, 1996 32 additional criteria (such as presence of words in the bilingual dictionary; presence in available lists of basic words in the languages) and the actual method may vary from site to site, depending on the available resources and tools. Here, only a global strategy is described to clarify the nature of the selection and the coverage.

Top

Frequent Word 1 Word 2 Word 3 Selection I Defining Words

Words related to the top level..... Selection II

Selection of the remainder of the defining vocabulary Selection III

In the second building phase the subset will be extended where the results and user-feedback from the verification phase will be integrated. More specifically, the subset will be extended along the following lines:

· to increase the overlap across the wordnets that have been built separately. · to include words that frequently occur in corpora but are not part of the defining vocabulary so that genericity of the wordnets is guaranteed. · to deepen the hierarchies from a top-down perspective so that language-specific lexicalisation patterns are to some extent reflected or illustrated. · to add one or two subvocabularies to demonstrate linking phenomena and domain effects between the general vocabulary and the domain specific words and meanings.

After linking the first subset each site will have a list of WordNet entries to which their entries have been linked:

LE 4003 EuroWordNet October, 1996 33

Measuring overlap between wordnets via the bilingual dictionaries Dutch Dutch WordNet 1.5 Italian Italian English English

Overlap in WordNet1.5 entries

WordNet1.5 entries uniquely linked via the Italian wordnet WordNet1.5 entries uniquely linked via the Dutch wordnet Extension

These lists will be exchanged and compared. Those WordNet entries that are not present in a site's list but are present in the lists of the other sites will be used to generate the first extension (via the bilingual dictionaries). To achieve sufficient overlap and compatibility we will also make use of the consistency-checking and wordnet-comparison mechanisms that will be implemented in the EuroWordNet database (see section 3.2). Extreme differences across wordnets (e.g. in lexical density) or incompatibility of redundant relations will be inspected.

To achieve sufficient genericity of the wordnets frequency lists will be extracted from a diverse range of corpora. The range and type of corpora will be determined in co-operation with Parole to ensure compatibility of the projects. Other criteria will be the word length, morphological complexity and the degree of polysemy. However, we expect that the defining vocabulary will mostly coincide with the more frequent words in daily language use. The extension from the corpora is probably minimal.

More difficult is to determine the frequency of the senses of the words that are included. In principle we will exclude obscure and rare senses on the basis of labelling in the dictionaries from which the wordnets are derived and by inspection. These senses may be added later on when they can be linked to other senses as specific variants (coded for register, dialect, as grammatical variants, etc..) of present concepts.

Whereas the two previous strategies will lead to the construction of the hierarchy in a bottom-up fashion (selected words are linked to more general levels), for the third extension the hierarchy will be traversed top-down so that `missing siblings' of nodes in the hierarchy can be added (e.g. ªcatº is linked to ªanimalº but ªpetº is not included in the subset). Using this method more complete lexicalisation patterns of concepts in a particular language will be covered (which is not guaranteed by the above strategies). These lexicalisations include language-specific phenomena and different types of variants (possibly also the less frequent and basic senses of the frequent words that have been omitted at the beginning). In addition to words expressing a concept we

LE 4003 EuroWordNet October, 1996 34 will investigate the possibility to include multi words, typical phrases and expressions linked to concepts. This will however be very limited.

Finally, some subvocabulary will be added (if necessary from the test-corpora). Information about the domain will be stored and the effort needed for this extension will be recorded. The latter information is important for measuring the time and effort of extensions for other users. Obviously, at some point the subset will be restricted by the time and resources available in the project. The size of the total subset of nouns we are aiming at will be around 35,000 senses, the verb subset around 15,000 senses.

3.1.2 The structure of the entries

The entry structure of the Princeton WordNet is rather different from the traditional organisation of entries in dictionaries according to historic principles and/or the part-of-speech distinction. In WordNet entries are organised around the notion of synsets. Each synset comprises one or more word senses which are considered to be identical in meaning, together with a gloss which defines that meaning, e.g.: file2, data file1 -- {a set of related records kept together}.

This means that `f ile' in sense 2 is identical in meaning to ` data file' in sense 1 and that the meaning is ` a set of related records kept together' . As a result of this organisation WordNet1.5 contains only 91,591 synsets but 126,520 words or lexical items, where words are defined as access keys to the database (Díez-Orzas 1996).

The EuroWordNet entries will be structured in much the same way. This will mean that the traditional dictionary structure has to be converted to the synset structure. This will be done in two steps:

1) determine which senses will function as synsets (see the above section on the coverage).

2) include synonymous senses as variants in the synset.

The definition of synonymy will be further discussed in the next section. Note that the information of several senses that are included in the synset may be merged.

A practical aspect is that during the building of the wordnets the variants in the synset have to contain information about the original source from which they have been derived. Each variant will therefore have an extra field that uniquely identifies the original senses that correlate with it. This information on the dictionary source senses will however not be public (except for the references to WordNet synsets) and it will not be included in the project results. Only the publishers and the builders of the wordnets will have access to this information. Another aspect is that most ‘traditional’ information from the original resources, such as definitions and examples, will be left out for copyright reasons. However, since the unique references to the original source information are available to the publishers and the builders of the wordnets it will be possible to obtain the extended information from the publishers that originally provided the material.

LE 4003 EuroWordNet October, 1996 35 Overall Synset Structure in EuroWordNet

Obligatory Optional Optional Optional Non-Public Optional Synset ID Variants Variant Usage Central Corpus Source code Status Label Info file --- Yes Corpus WN1.5 revised sense 1 Corpus sense... data file formal No Corpus WN1.5 not- sense 1 revised Obligatory Obligatory Non-Zero Language Relation Type Related Synset Internal Relations Hyperonym Hyponym Holonym Meronym Multilingual Equivalence Related Relations Type English Synset Eq-Synonym Eq-Hyponym Eq-Holonym

Each synset is related to other synsets by one or more predefined relations. Finally, each synset will be linked to at least one English synset containing an English gloss. The overall structure of an entry will thus be as follows (a more precise and final structure will be defined in the work packages WP3.1, WP4.1 and WP6.1 where the links, the coverage and the database architecture are work out).

As is indicated by the top and middle row of the table, the fields with the synset record number, the variants, and the relations are all obligatory: i.e. every entry in every wordnet has to provide (some) information for these fields (where at least one relations has to be expressed). In addition, each variant may have a usage label and a centrality label, which are both optional. The usage labels indicate register, dialect and regional restrictions of a variant. The optional field central explicitly codes whether a word is considered to be peripheral or central. Criteria to determine the centrality will be further defined, possibly on the basis of the corpus frequencies, generality, word length, morphological complexity, polysemy, etc.. Corpus frequency of each variant is also specified separately (again optionally). It can be differentiated for different types of subcorpora, which may be relevant for domain-specific subvocabulary which only occurs in a specific corpus (e.g. a single computer manual). This information is needed to guarantee the genericity of the

LE 4003 EuroWordNet October, 1996 36 resource and to be able to achieve agreement with the Parole project (LE2 4017). The Status field is used to indicate whether a variant is still under construction or has already been processed. The optionality of these fields is necessary because the resources from which the wordnets will be derived cannot all provide this information. The source information is obligatory but non-public as discussed above.

Finally, note that part-of-speech information is implicit in the wordnets. Further grammatical information for each variant can be derived from the morpho-syntactic lexicons that will be built in the Parole project or from the providers of the original information.

3.1.3 The language internal relations

The language internal relations are mostly based on the relations in WordNet1.5. Some major changes are proposed which will be discussed later on. The motivation for deviating from the Princeton wordnet is mostly twofold:

• the information in WordNet is not adequately structured given the user-requirements, • the information in the MRDs from which the wordnets will be derived is stored in a different way.

From an information retrieval point of view (and also with respect to other applications) some choices in the Princeton WordNet are problematic. Since the MRDs sometimes give the possibility to structure the information differently the opportunity is taken to improve the design with respect to the requirements.

Furthermore, information is sometimes structured differently in traditional dictionaries. In that case restructuring the dictionary information to the WordNet structure will involve more work and labour than sticking to the solutions in the dictionary. When these solutions are favourable it makes sense to adapt the design. The rationale behind the changes is to save labour rather than cause extra work. Changes are made to prevent having difficulty to use or code up particular information rather than to cause more problems in building the resources. The idea is to have maximum of flexibility in the design without committing ourselves to derive maximally complex structures.

Nevertheless, we have tried to keep the structure maximally compatible with the structure of the information in WordNet1.5. Most changes, therefore, can be seen as addition which can easily be left out to make the results WordNet1.5 compatible. First, we will now summarise the relations as they can be found in WordNet1.5

3.1.3.1 The language internal relations in the Princeton WordNet

Not all the relations described in the documentation of WordNet1.5 (Miller et al 1993) also occur in the database. The documentation gives the theoretical background for the architecture of their database rather than a description of the actual relations. For example, in the papers 7 types of meronymy relations are mentioned while in practice only three occur in the database (Díez-Orzas 1996). Similarly, for verbs a detailed hierarchy of entailment relations is described while in practice mostly hyponymy relations are present with incidental entailment and causal links. To

LE 4003 EuroWordNet October, 1996 37 describe the content of WordNet 1.5 we therefore use the description of the WN1.5 database by Díez-Orzas (1996). Most of the information in the next section is from that report.

The following relations are found in the Princeton WordNet Database for nouns and verbs:

Relations/ Interface code Description / directions categories examples: synonyms other members of the synset which are equal in A <--> B meaning to the target: i.e. belong to the synset. nouns 10 senses of man Sense 1 Synonyms of A • man, adult male

Sense 5 • homo, man, human being, human verbs senses of stop Sense 2 Synonyms of A • stop,halt,come to a halt, stop, moving (the car stopped) Sense 5 • stop,halt (stop a car) antonyms synsets which are opposite in meaning to the target A <--> B synset. nouns sense1 man, adult male Antonyms of A • => woman, adult female no antonym of ‘man’ sense 5 verbs sense 2 stop,halt,come to a halt, stop, moving Antonyms of A • => start, go, get going sense 5 stop stop,halt (stop a car) • => start, start up, set in motion

LE 4003 EuroWordNet October, 1996 38 hyperonyms synsets of which the target synset is a particular A --> B kind.

nouns Sense 1 man, adult male -- (an adult male person (as opposed to a woman); "there were two women and six men on the bus") · => male, male person -- (a person who belongs A is a kind of __ · to the sex that cannot have babies) verbs Sense 1 run, move by running -- (move fast by using one© s To A is one way feet, with one foot off the ground at any given time) to _ · · => travel rapidly, speed, hurry, zip · => travel, go, move, locomote -- (change location; move, travel, or proceed; "How fast does your new car go?" "We travelled from Rome to Naples by bus"; "The policemen went from door to door looking for the suspect") Also See-> run around, run away hyponyms synsets which are particular kinds of the target B --> A synset nouns Sense 2 weather, atmospheric condition, elements · => cold weather, cold snap, cold wave, cold ·__ is a kind of A spell -- (a period of unusually cold weather) · => fair weather, sunshine, temperateness => hot weather, heat wave, hot spell -- (a period of unusually hot verbs Sense 1 Particular ways run, move by running -- (move fast by using one© s to __ · feet, with one foot off the ground at any given time) · => trot, jog, clip -- (run at a moderately swift pace) · => scurry, scamper, skitter, scuttle, move rapidly · => run bases -- (run around the bases, in baseball) · => streak -- (run naked in a public place) · => run, run a football, make a run, run with the ball -- (in football) · => outrun, run faster than -- (as in a race) · => jog, go jogging -- (run for exercise) · => sprint, dash · => lope, run easily · => rush, run with the ball -- (in football) · => hare -- (run quickly; "He hared down the hill")

LE 4003 EuroWordNet October, 1996 39 nouns only holonyms A is a part of __ · synsets of which the target synset is a part A --> B part Sense 2 flower, bloom, blossom -- (reproductive organ of angiosperm plants esp. one having showy or colorful parts) • PART OF: angiosperm, flowering plant -- (plants having seeds in a closed ovary) member Sense 5 homo, man, human being, human -- (any living or extinct member of the family Hominidae) • MEMBER OF: genus Homo -- (type genus of the family Hominidae) substance Sense 1 glass -- (a brittle transparent solid with irregular atomic structure) • SUBSTANCE OF: glassware, glasswork -- (articles made of glass) • SUBSTANCE OF: plate glass, sheet of glass - - (glass formed into a thin sheet) meronyms Parts of A synsets which are parts of the target synset B --> A has part · Sense 2 flower, bloom, blossom -- (reproductive organ of angiosperm plants esp. one having showy or colorful parts) • HAS PART: stamen -- (the male reproductive organ of a flower) • HAS PART: pistil, gynoecium -- (the female ovule-bearing part of a flower composed of ovary and style and stigma) • HAS PART: carpel -- (a simple pistil or one element of a compound pistil) • HAS PART: ovary -- (the organ that bears the ovules of a flower) • HAS PART: floral leaf -- (a modified leaf that is part of a flower) • HAS PART: perianth, floral envelope -- (collective term for the outer parts of a flower consisting of the calyx and corolla and enclosing the stamen and carpel) has member Sense 1 womankind -- (women as distinguished from men) • HAS MEMBER: womanhood, woman -- (women as a class; "it© s an insult to American womanhood"; "woman is the glory of creation")

LE 4003 EuroWordNet October, 1996 40 has substance Sense 1 glassware, glasswork -- (articles made of glass) · HAS SUBSTANCE: glass -- (a brittle transparent solid with irregular atomic structure)

LE 4003 EuroWordNet October, 1996 41 verbs only entailments What does A entail doing? synsets which are entailed by the target synset A --> B Sense 1 walk, go on foot, foot, leg it, hoof, hoof it -- (use one©s feet to advance; advance by steps) => step, take a step things What does A cause? synsets which are caused by the target synset caused A --> B Sense 1 kill -- (cause to die) => die, pip out, decease, perish, go, exit, pass away, expire -- (pass from physical life and lose all all bodily attributes and functions necessary to sustain life; "She died from cancer"; "They children perished in the fire"; "The patient went peacefully")

Some of the specific relations have not been listed here. Troponymy is represented as hyponymy, as is done in the database; the morphological relations such as Pertain and Derived from will be implemented differently as will be discussed below.

3.1.3.2 Unifying nouns and verbs

The Princeton WordNet uses a rigid distinction between nouns and verbs. The major reason for keeping nouns and verbs separate is their different syntactic role in English. Furthermore, it is suggested that the responses of people to interviews parallel the rigid noun/verb distinction as well: nouns are related to nouns and verbs are related to verbs. The separation however leads to a very undesirable situation that very similar synsets (which in some cases even have exactly the same names for the tops) are totally unrelated only because they differ in part of speech (POS). There are for example two tops for ªchangeº, one for verbs and one for nouns. Likewise, a noun such as ªadornmentº and the verb ªadornº will in no way be linked to each other:

Examples from WordNet1.5 Sense 2 adornment -- (the action of decorating yourself with something colourful and interesting) => decoration -- (the act of decorating something (in the hope of making it more attractive)) => change of state -- (the act of changing something into something different in essential characteristics) => change -- (the act of changing something; "the change of government had no impact on the economy"; "his change on abortion cost him the election") => action -- (something done (usually as opposed to something said); "there were stories of murders and other unnatural actions")

LE 4003 EuroWordNet October, 1996 42 => act, human action, human activity -- (something that people do or cause to happen) Sense 1 decorate, adorn , grace, ornament, embellish, beautify => change, alter -- (cause to change; make different; cause a transformation; "The advent of the automobile may have altered the growth pattern of the city"; "The discussion has changed my thinking about the issue")

The noun ªadornmentº is linked to the nominal concepts ªchange of stateº and ªchangeº which are further linked to ªactionº and ªactº. The verb ªadornº is however directly linked to the verbal top synset ªchangeº.

According to our estimates about 30% of all the noun senses refers to an event, state or relation. Many of these could have synonymy or hyponymy relations with verbs or adjectives or should at least be related to the top-concepts of the verbs. Currently, all these words are not related in any way in WN1.5. Other reasons for interlinking the noun and verb networks are:

· In other languages it is either very difficult to distinguish nouns and verbs or the distinction does not even exist (Lyons 1977). · From an information retrieval point of view the same information can be coded in an NP or in a sentence. By unifying higher-order nouns and verbs in the same ontology it will be possible to match expressions with very different syntactic structures but comparable content (see the Sift project, LRE 62030). · By merging verbs and abstract nouns we can more easily link mismatches across languages that involve a part-of-speech shift. Dutch nouns such as ªafsluitingº, ªgehuilº are translated with the English verbs ªcloseº and ªcryº, respectively. In the combined hierarchy we can directly link ªafsluitingº to ªcloseº and ªgehuilº to ªcryº.

Rather than semantically distinguishing nouns and verbs we will make a distinction between first-order and higher-order entities, where nouns can either refer to first-order-entities (concrete, physical things) and both nouns and verbs can refer to higher-order entities (properties of things, relations, acts, activities, processes, states, etc.). This kind of information is rather easily extractable from the dictionaries because it is always described using a few rather-typical definition structures, e.g.:

Entry Word Definition LDOCE 1978 (a) adornment = the act of adorning ambiguity = the condition of being ambiguous (b) apiculture = the keeping of bees, esp. for profit

In the Sift (LRE 62030) and Acquilex projects (BRA 3030, BRA 7315) these nouns have been linked as hyponyms to the main noun in (a): ªactº, ªconditionº, and to the main verb/ adjective of the of-complement of the main noun: ªadornº, ªambiguousº. In the case of structure (b) the syntactic head is already a verb, ªkeepº, to which it can be linked as a hyponym. In those cases that there is a systematic morphological relation between the defined noun and the linked verb or adjective (adornment/adorn) there may also be a synonymy relation between them.

Strictly speaking, it is not necessary to add relations to the Princeton WordNet architecture but only to give up some restrictions. Nouns and verbs could be members of the same synset and the same relations could be applied to verbs and higher-order nouns. However, for practical reasons

LE 4003 EuroWordNet October, 1996 43 we will introduce the following new relations to specifically express synonymy and hyponymy relations across nouns and verbs:

noun-to-verb-hyperonym: apiculture ----> keep verb-to-noun-hyponym: keep ------> apiculture noun-to-verb-synonym: adornment ---> adorn verb-to-noun-synonym: adorn ------> adornment

Many systems crucially depend on the part-of-speech distinction in wordnet (and many other lexicons). Giving it up completely would be a too drastic change.

LE 4003 EuroWordNet October, 1996 44 3.1.3.3 Conjunction and disjunction of relations

In the current WordNet database both conjunction and disjunction of relations occur. Multiple meronyms linked to the same synset are automatically taken as conjunctives:

· all the parts together constitute the holonym car , whereas multiple hyponyms of a synset are treated as disjunctives:

· all kinds of animals are in principle disjunctive members of the same type.

We also find multiple hyperonyms in WordNet1.5, though less systematically:

Examples from WordNet1.5 Sense 1 piano , pianoforte, forte-piano => stringed instrument -- (a musical instrument in which taut strings provide the source of sound) => percussion instrument, percussive instrument -- (a musical instrument in which the sound is produced by one object striking another) Sense 1 spoon => cutlery -- (implements for cutting and eating food) => container -- (something that holds things, especially for transport or storage)

In these examples, both classifications may apply simultaneously and can therefore be seen as conjunctive hyperonyms. In EuroWordNet we expect to make a much more systematic use of conjunctive hyponymy. Especially, when information from different sources or different senses will be merged multiple classifications of the same synset may come out. Conjunctive classes may also be created in a more generic, top down fashion, since some classes combine in a systematic way, e.g. form and function, or change and motion.

Conjunctiveness and disjunctiveness of the meronymy and hyponymy relations are implicit. Multiple hyponyms always form a disjunctive set, multiple meronyms and hyperonyms always form a conjunctive set. There are however situations in which multiple hyperonyms, meronyms and holonyms could be disjunctively related to the same synset and multiple holonyms conjunctively to the same synset. Because of the implicitness of the conjunction and disjunction in WordNet1.5 this cannot be expressed by listing multiple synsets (which would cause ambiguity). To start with disjunctive hyperonyms: there is a considerable group of nouns and verbs that do not refer to one type of thing or event but can be applied to a whole range of things or events (Vossen 1995), as illustrated by the following example from LDOCE:

Entry Word Definition LDOCE 1978 (a) arrival a person or thing that arrives or has arrived puzzler a person or thing that puzzles threat a person, thing or idea regarded as a possible danger (b) buzzer a thing that buzzes stiffener a thing that stiffens attraction something which attracts

LE 4003 EuroWordNet October, 1996 45 decoration something that decorates

These `predicative' nouns are either linked to a coordinated list of definition heads to which they can be applied or to so-called void heads. Whereas in the case of conjunctive hyperonyms several classifications apply simultaneously. In these cases the classifications are disjunctive. The way the range of denotation is specified in structures such as (a) is very similar to the representation of selectional restrictions for verb arguments. All kinds of typicality phenomena can be found here which may be relevant for information retrieval as well. A word like ªpuzzlerº will more strongly match with ªpersonº than a word like ªbuzzerº even though both nouns are strictly said to be applicable to any ªthingº. If we look at the way these nouns have been defined in WN1.5 then we see that these words are either defined by void classes or that senses have been split:

Examples from WordNet1.5 Sense 1 menace, threat -- (something that is a source of danger) => danger -- (a cause of pain or injury or loss; "he feared the dangers of travelling by air") => causal agent, cause, causal agency -- (any entity that causes events to happen) => entity -- (something having concrete existence; living or non living) Sense 4 terror, scourge, threat -- (a person who inspires fear or dread) => person, individual, someone, mortal, human, soul -- (a human being)

Here we see that sense 1 of ªthreatº is linked to a non-restrictive class ªdangerº which can be any causal agent (see also the head of the gloss: ªsomethingº), whereas sense 4 expresses the same concept restricted to ªpersonº. By distinguishing different synsets it is suggested that these are also distinct concepts or senses. This is of course highly doubtful. As the more generic sense already suggest, anything can be a ªthreatº which means that we could need any number of synsets to express this range.

Unfortunately, polysemy does not always imply disjunction. In the following example the senses of the noun ªmovementº do not reflect different events but merely different aspects of the same event, namely the fact that it is an event and the fact that it is a change:

LE 4003 EuroWordNet October, 1996 46 Examples from WordNet1.5 Sense 2 movement, motion -- (a natural event that involves a change in the position or location of something) => happening, occurrence, natural event -- (an event that happens) => event -- (something that happens at a given place and time)

Sense 3 change of location, motion, movement, move -- (the act of changing your location from one place to another) => change -- (the act of changing something; "the change of government had no impact on the economy"; "his change on abortion cost him the election") => action -- (something done (usually as opposed to something said); "there were stories of murders and other unnatural actions") => act, human action, human activity -- (something that people do or cause to happen)

These synsets could as easily have been represented as single synsets with multiple conjunctive hyperonyms: ªchangeº and ªeventº. Again we are dealing with a single sense which is split over different synsets for unclear reasons.

Also disjunctive holonyms are distinguished as different synsets in WordNet1.5:

Examples from WordNet1.5 Sense 1 door -- (a swinging or sliding barrier that will close the entrance to a room or building; "he knocked on the door"; "he slammed the door as he left") PART OF: doorway, door, entree, entry, portal, room access -- (the space in a wall through which you enter or leave a room or building; the space that a door can close; "he stuck his head in the doorway") Sense 2 doorway, door , entree, entry, portal, room access -- (the space in a wall through which you enter or leave a room or building; the space that a door can close; "he stuck his head in the doorway") PART OF: wall -- (a partition with a height and length greater than its thickness; used to divide or enclose or support) Sense 6 door -- (a swinging or sliding barrier that will close off access into a car; "she forgot to lock the doors of her car") PART OF: car, auto, automobile, machine, motorcar -- (4-wheeled; usually propelled by an internal

The different holonyms of which ªdoorº can be a PART are here listed as different synsets (sense 1 and 6 of ªdoorº), suggesting that these reflect different concepts. Again, there could be any number of holonyms which belong to distinct synsets and that have a ªdoorº as a part. According to the principle that is applied here these would all result in different synsets for ªdoor.

LE 4003 EuroWordNet October, 1996 47 Also disjunctive meronyms occur in WordNet1.5, as in the next example where "obolus" and "carat" represent units of measure for different types of objects, therefore constitute disjunctive meronymsm of "g":

Examples from WordNet 1.5 Sense 1 gram, gramme, gm, g HAS PART: obolus HAS PART: carat -- (a unit of weight for precious stones = 200 mg)

In the following example of the synset for ªairplaneº the choice between disjunctive meronyms, propeller and a jet engine , seems to be avoided by omitting them altogether from the list of parts. Note that both parts have been included as alternatives in the gloss:

Examples from WordNet1.5 Sense 1 airplane, aeroplane, plane -- (an aircraft that has fixed a wing and is powered by propellers or jets) HAS PART: accelerator, throttle, throttle valve -- (a valve that regulates the supply of fuel to the engine) HAS PART: accelerator, accelerator pedal, gas pedal, gas, hand throttle, gun -- (a pedal or hand-operated lever that controls the throttle; "he stepped on the gas") HAS PART: cowl, cowling, hood -- (metal part that covers the engine) HAS PART: escape hatch -- (a means of escape in an emergency) HAS PART: fuselage -- (the central portion of an airplane that is designed to accommodate the crew and passengers (or cargo)) HAS PART: landing gear -- (an undercarriage that supports the weight of the plane when it is on the ground) HAS PART: pod, fuel pod -- (a detachable container of fuel on an airplane) HAS PART: radome, radar dome -- (a housing for a radar antenna; transparent to radio waves) HAS PART: windshield, windscreen -- (transparent (as of glass) to protect occupants of a vehicle) HAS PART: wing, wings

In addition to the missing ªpropellerº and ªjet engineº disjuncts this example clearly shows that conjunctive meronyms is the default interpretation for multiple meronyms. Examples of conjunctive holonyms have yet not been found either but one can imagine that members can simultaneously belong to the different collections or groups, and places can belong to multiple areas.

Summarising we can say that, in addition to disjunction of genus words in definitions, dictionaries and WordNet1.5 either link a word to a higher synset which generalises over these instances or distinguish different senses for each member of the disjunctive set. The disadvantage of the former solution is that valuable information may be lost and we will get very large collections of synsets that are linked to void synsets. The disadvantage of the latter solution is that we will get an increase of polysemy, while many of the disjunctive sets are open-ended. This will again cause problems for sense-disambiguation tasks and for linking synsets across languages. Especially in the case of parts that may belong to multiple holonyms (like the above example ªdoor ª) representing disjunction as multiple synsets or by abstracting from the disjunctive members (ª?containerº for ªdoorº) seems to be a bad solution.

LE 4003 EuroWordNet October, 1996 48 Alternatively, we may also adopt a more principled solution. In general we can have the following situation in which synsets are linked as mixed conjunctive and disjunctive sets. What makes the problem complex is that disjunctive and conjunctive sets can again be disjuncted or conjuncted (although in practice there will not be many levels). In the next figure the components of ªairplaneº form a conjunctive set but ªpropellerº and ªjet engineº form a disjunctive set within the conjunction. The way the relations are marked in this picture therefore does not make clear what the scope of each disjunction or set is. Only marking the relations for disjunction or conjunction is therefore not enough.

Conjunctive and disjunctive hyperonyms

plant animal

albino pet mammal

dog cat horse

Conjunctive and disjunctive meronyms and holonyms

room car aeroplane

wall window door propeller jet engine

bricks plates wood

conjunctive disjunctive conjunctive and relation relation disjunctive relation

The most elegant solution to the phenomena would be to allow relations between disjunctive or conjunctive sets of synsets (rather than between individual synsets) where the scope and nesting of the set is indicated. However, this would require a completely new architecture of the database allowing for a new data object: sets of synsets. A more simple solution would be to simply mark lists of synsets with an index indicating the scope of disjunctive or conjunctive relations between related synsets. In many cases a single synset is already linked to multiple synsets. The combination of disjunction and conjunction for the above examples could be displayed as follows where an index label marks the disjunction (ªd1º, ªd2º, ªd3º, etc.) and conjunction (ªc1º, ªc2º, etc.):

LE 4003 EuroWordNet October, 1996 49 {airplane {dog HAS PART: c1 door HYPERONYM: c1 mammal HAS PART: c2d1 jet engine HYPERONYM: c1 pet } HAS PART: c2d2 propeller }

{door {albino PART OF: d1 car HYPERONYM: d1 plant PART OF: d2 room HYPERONYM: d1 animal } PART OF: d3 entrance}

In these examples a single synset (surrounded by the curly brackets) is linked to a set of other synsets. The order of the labels indicates the nesting: in the case of ªairplaneº the PART ªdoorº (c1) is conjuncted with another PART (c2) which consists of the disjunction of ªjet engineº and ªpropellerº. We do not expect that there will be many combinations of disjunction and conjunction and certainly not many levels of combinations so that the indexes will remain fairly simple. Note that disjunctiveness and conjunctiveness are not a new relation type but are merely operators applied to sets of synsets linked to another synset. The visual interface to the data could make use of the labelling by distinguish the relations with different colours. Using these indexes the architecture of the database does not have to be changed and we still have the possibility to maintain distinct synsets for disjuncts or conjuncts that are too different. Another advantage is that the information can be seen as optional. If information on disjunction or conjunction is not available or extractable from one of the resources the specifications are still compatible with the specifications that do have multiple conjuncted or disjuncted links. Consequently, the different sites are not obliged to code this information by hand when the interpretation is too obscure.

Finally, in addition to the disjunctive class of hyperonyms it is desirable to store the verb- argument relations between nouns and verbs. In the above example ªthreatº can refer to the agent or force of the verb ªthreatº, but in principle there may be a word for manner of involvement or role in some event or state: ªfoodº as a patient of ªeatº, ªmedicineº as an instrument of ªcureº, etc.. By combining the typical activities, functions or events associated with a noun (their TELIC-role, Pustejovsky 1991) with such specification of the roles,12 it is possible to express the different involvements of the nouns in the event: danger N --agent-of--> threat medicine N --instrument-for--> cure V V <--done-by-- <--done-with-- food N --patient-of--> eat V key N --instrument-for--> open V <--done-to-- <--done-with-- spoon N --instrument-for--> eat V --instrument-for--> close V <--done-with-- <--done-with-- fork N --instrument-for--> eat V store N--location-where--> store V <--done-with-- <--done at--

It is important to realize here that we are not coding any grammatical information in EuroWordNet. We therefore cannot encode this information by linking the nouns to the syntactic

12 Interesting enough such roles are already reflected by some synsets in WordNet1.5. In the above example we see that ªthreatº is linked to ªdangerº which has as a hyperonym: causal agent, cause, causal agency -- (any entity that causes events to happen).

LE 4003 EuroWordNet October, 1996 50 argument structure of the verbs. In the case of the verbs ªopenº and ªkeyº, for example, this means that the syntactic realization of the instrument can either be as the Subject or as a with-PP complement in English. This kind of correlations have to be made in an extension of the project, possibly by combining the EuroWordNet results with syntactic lexicons generated in the Parole- project. The advantage of the purely semantic link that is proposed here is that it can be transferred from one language to another whereas the syntactic correlation, which is language- specific, cannot.

The event-involvement or TELIC-role of many of these nouns is often the only sensible way of classifying them. However, it should not be the case that all possible roles and involvements are coded for all nouns. The general rule is that links are established only when there is a strong association between two synsets and the information can be extracted from the resources. Furthermore, the general relations noun-to-verb-telic and verb-to-noun-telic can be used for those cases in which the specific involvement cannot be determined or is not clear enough, while the association with the verb is crucial for the classification of the noun.

3.1.3.4 Synonymy

One of the advantages of the synset structure is that equivalence of meaning is explicitly encoded in the entry structure. However, what is still implicit is the fine-grainedness of the synset distinctions. In principle it is possible that each synset correlates with a unique sense in a traditional dictionary. In that case the number of synsets will equal the number of senses. In practice we see that in the Princeton WordNet (just as in traditional dictionaries) many synsets are distinguished for grammatical, morphological or quasi-semantic reasons while the covered concept is more or less the same. For example, WordNet1.5 distinguishes 4 different senses of `people' which reflect different aspects or different specialisations of the single concept:

Examples from WordNet1.5 Sense 1 people -- (any group of persons (men or women or children) collectively; "old people"; "there were at least 200 people in the audience") => group, grouping -- (any number of entities (members) considered as a unit)

Sense 2 citizenry, people -- (the body of citizens of a state or country; "the Spanish people") => group, grouping -- (any number of entities (members) considered as a unit)

Sense 3 people -- (members of a family line; "his people have been farmers for generations"; "are your people still alive?") => family, family line, folk, kinfolk, kinsfolk, sept, phratry -- (people descended from a common ancestor; "his family had lived in Masachusetts since the Mayflower")

Sense 4 multitude, masses, mass, hoi polloi, people -- (the common people generally; "separate the warriors from the mass"; "power to the people") => group, grouping -- (any number of entities (members) considered as a unit)

We can make the following comments to the senses that are distinguished:

LE 4003 EuroWordNet October, 1996 51 • Sense 1 is to be understood as any group of persons. • Sense 2 of `people' is put in a synset with `citizenry'. The concept of `citizenry' can only be understood if the adjective or PP of `people' denotes something geographical; the people of Spain. Strictly speaking, it is therefore a hyponym of `people' sense 1. But in the WN 1.5 document this is explained as a restriction in WN: `people' is not allowed to be a hyponym of `people' and is linked to the same hyperonym of sense 1: ªgroup, groupingº. • Sense 3 looks like a different meaning, but the examples show that the possessive pronoun `his' and `your' trigger the meaning of `family' 13, which can be seen as a more specialized use. In sense 4 something similar occurs.

Sense 1, 2 and 4 are all three hyponyms of group, grouping© , which is probably a signal to look in more detail at this sense distinction. This seems to be confirmed by the results of the hyponyms of the different senses (here only a summary of all the hyponyms is given):

13 Sense 3 of people is also dubious for another reason. It is already clear why this sense of `people' is not a hyponym of `people' sense 1. But why is this concept of `people' not in a synset with its current hyperonym (family, family line, etc.), which is the case with sense 4 of people© ? This solution introduces an extra level in the hierarchy without an obvious reason.

LE 4003 EuroWordNet October, 1996 52 kind of people in WordNet1.5 Sense 1 people -- (any group of persons (men or women or children) collectively; "old people"; "there were at least 200 people in the audience") ... skipped... => folk, common people -- (the majority of a people who determine the group character and preserve its customs and form of civilization from one generation to the next) => caste -- (a social class separated from others by distinctions of hereditary rank or profession or wealth) ... skipped... => nation, nationality, land, country, a people -- (the people of a nation or country or a community of persons bound by a common heritage) => tribe, federation of tribes -- (as of American Indians) => British, British people, the British, Brits -- (the people of Great Britain) => English, English people, the English -- (the people of England) => Irish, Irish people, the Irish -- (people of Ireland or of Irish extraction) ... skipped... => population -- (the people who inhabit a territory or state) ... skipped... Sense 2 citizenry, people -- (the body of citizens of a state or country; "the Spanish people") => electorate -- (the body of enfranchised citizens; those qualified to vote)

Surprisingly, `population' and `nation, nationality, land, country, a people' and all its hyponyms are kind of people sense 1 and not sense 2. Therefore sense 2 is left with only one dubious hyponym `electorate'. Apparently there is a strong tendency to relate hyponyms to the first sense when senses come very close, even when the glosses of the other senses seem to fit the hyponyms better.

In general, what can be observed is that different aspects and more specific uses of a concept are split over different senses. In principle these senses can be merged using disjunction or conjunction of relations. Different aspects of a concept (people belonging to a nation or forming a cultural group) can be merged as disjunctive hyperonyms, different specializations as disjunctive hyponyms.

In yet other cases senses are split on a more systematic basis. As argued by Fellbaum (Miller et al. 1993) the transitive and intransitive uses of ªeatº are deliberately distinguished as separate synsets, as is illustrated by the example patterns for sense 1 and 2:

Examples from Wordnet1.5 Sense 1 eat, take in -- (take in solid food) => consume, ingest, take, have -- (serve oneself to, or consume regularly; "Have another bowl of chicken soup!" "I don't take sugar in my coffee") Also See-> eat at, eat into *> Somebody ----s something Sense 2 eat -- (eat a meal; take a meal) => consume, ingest, take, have -- (serve oneself to, or consume regularly; "Have another bowl of chicken soup!" "I don't take sugar in my coffee") Also See-> eat in, eat out, eat up *> Somebody ----s

LE 4003 EuroWordNet October, 1996 53 The motivation for this distinction is that hyponyms of ªeatº fall into two distinct classes: strict- transitive (e.g. ªwash downº, ªnibbleº, ªdunkº, ªsmackº) and strict-intransitive verbs (e.g. ªgobble upº, ªdineº, ªbrunchº). However, many of the hyponyms in the database are not strict- (in)transitives at all. For example, ªgobble upº can both be used transitively and intransitively as confirmed by Collins Co-Build (1987) which quotes the following vivid example: ªtruck drivers gobbling up hot dogs with dripping mustardº. The difference between transitive and intransitive uses seems to depend on the fact whether the object is implied or not. Even though this may be a semantic difference, it still is not excluded by linking all the hyponyms to a general sense of ªeatº which can be realized both as transitive or intransitive (depending on the fact whether the object is implied by the context or is considered irrelevant for the communication). Such pragmatic and syntactic considerations clearly go beyond the scope of the project. If considered at all this type of information should be coded up separately from the general classification by the synset (possibly in an extension of EuroWordNet).

Whereas the above cases might be merged into single synsets, other syntactic-semantic alternations may still be kept separate because they typically reflect the semantic distinctions that are reflected as relations in the wordnets. Causative and inchoative alternations of verbs as described by Levin (1993) correlate with the ªcauseº relation discussed above. Similarly, count- mass distinctions of nouns correlate with the portion-substance relation. Whenever a grammatical variation correlates with one of the defined semantic relations in the database it can definitely be used to distinguish separate synsets.

Finally, note that by expressing disjunction and conjunction in a more systematic way we will also be able to merge many senses which either reflect a disjunction of classes or a conjunction. The above example of ªdoorº in WordNet1.5 can be restructured as a single synset with a disjunctive range of holonyms, whereas the two senses of ªmovementº could be combined as a single synsets with a conjunctive range. All the discussed cases will result in variants in the synset that have a list of source-indications relating them to separate, unique senses in the original resource. When these resources are combined with traditional resources more fine- grained structuring of the entries can thus be recovered.

3.1.3.5 Subtypes and Supertypes of meronymy

Even though the WordNet documentation discusses various subtypes of meronymy relations these have not been implemented in the database. However, since in some cases such more specific relations can be easily extracted from some of the definition patterns and may be very useful for information retrieval purposes they will included in the database. Frequent definition phrases such as ªmade ofº, ªconsists ofº or various locative constructions can be treated as specific types of meronymy (Winston et al 1987, Chaffin et al 1988). In addition to the three subtypes mentioned above, we can also distinguish a location-meronymy and a made-of- meronymy relation: location-meronym: center location-holonym: city made-of-meronym: glass made-of-holonym: window

Nevertheless, in yet other cases the resources just give sufficient information to automatically decide that we are dealing with a meronymy relation but not on the subtype of meronymy

LE 4003 EuroWordNet October, 1996 54 relation. We therefore also need a general relation ªmeronym-holonymº for all undecidable cases. If time is left these can be differentiated manually. Adding these relations will not change the current structure and can be seen as a relatively simple addition.

This hierarchical structuring of meronymy relations is also supported by the psycho-linguistic findings of Winston et al (1987) and Chaffin et al (1988). They claim that meronymy relations are not just primitive links but also have properties themselves.

LE 4003 EuroWordNet October, 1996 55 3.1.3.6 Differentiation of verb relations

As the Novell overview of the WN1.5 database (Díez-Orzas 1996) shows the verb hierarchy has 573 tops and few levels. A shallow and diverse hierarchy typically results from linking too many synsets to relatively abstract synsets. Ideal classifications show a balance in the maximum number of items they classify and the minimal number of categories that are needed (Rosch 1977). A badly shaped hierarchy is less useful for information retrieval because query expansions will be very crude.

The fact that many verb synsets are linked to a large number of tops, exhibiting a shallow hierarchy structure suggests that other relations are more important or useful for classifying verbs. This is illustrated by the verb 'k ill' which is a top in WN1.5 but is more specifically related via the cause relation with the synset {die, pip out, etc.}. However, the non-hyponymy-relations cause and entailment do not occur very frequently in WN1.5 (204 and 435 times respectively, Díez-Orzas 1996) so these cannot compensate for the bad hierarchy structure.

In EuroWordNet we will therefore investigate to what extent the verb relations can be differentiated on the basis of very frequent and salient patterns in the resources that are available. The following definitions of verbs in LDOCE, for example, illustrate that particular frequent matrix-verbs express relations between verbs as well:

Entry Word Definition LDOCE 1978 treat 4 (2) to try to cure by medical means dismiss (2) to allow to go suspend (4) to prevent from taking part in a team, ...

Strictly speaking the italic verbs are the syntactic heads of the definition. However, semantically they form rather shallow hierarchies (because of their frequency and the diversity of the defined verbs), while the (underlined) complement verb intuitively has the strongest association with the defined verbs. Furthermore, the matrix verbs14 seem to express a less strict or non-factive causation relation (Lyons 1977). They are different from causative verbs in that the factivity of the complement is not necessarily implied. Whereas factivity of “kill” implies the factivity of “die”, this cannot (or to a lesser extent) be said of the pairs “treat-cure”, “dismiss-go”, etc.

14 Not all matrix-verbs that frequently occur as definition head also express a relation between two events:

Entry Word Definition LDOCE 1978 a blow up 2 (5) (of bad weather) to start blowing; arrive initiate to start something working b play through to continue playing while other players in front wait persist (2) to continue to exist c shut up to stop talking hinder 1 (1) to stop (someone from doing something)

Here “start” , “continue” and “stop” express aspectual information of the defined verb. Nevertheless, the defined verb can still be seen as a hyperonym of the complement of the matrix verb. Aspectuality can therefore be seen as a property of the defined verb, whereas the complement provides a good hyperonym. LE 4003 EuroWordNet October, 1996 56 From an information retrieval point of view the strong relation between the complement and the defined verb makes sense. Even though the non-factivity does not necessarily imply the truth of the complement event, it makes it nevertheless relevant and also probable. In practice, the cause- relation can then be differentiated as factive-cause and non-factive-cause (without further distinguishing between deontic and epistemic relations):

kill ---factive-causes--> die dismiss ---non-factive-causes--> to go suspend ---non-factive-causes--> not take part search ---non-factive-causes--> find treat ---non-factive-causes--> cure

If, however, the relations cannot consistently be defined or extracted and if they do not occur systematically we may also use a more general relation like ASSOCIATION (as is used by Van Dale in their synonym dictionary) to store typical verb associations or use an optional label to express factivity of causal relations. In the latter case, this will result in a similar structure as for conjunction and disjunction of relations. The general relation is then `cause' and the label factive or non-factive can be added when this information is explicitly extractible.

The actual definition of the relations and the criteria will be further defined in work packages WP31 and WP41. During these work packages it may very well be the case that some relations are added or removed. The general argumentation for introducing a relation type is more or less as follows:

(i) the hyponymy relation is not very satisfying (ii) there is another verb (often the complement of a matrix verb) which has a strong association with the defined verb but none of the existing relations can be used to relate the defined verb to the associated verb. (iii) the association can be generalized as a relation between two verbs or nouns (iv) the relation applies to a substantial proportion of the vocabulary (v) the relation can easily be extracted from the available resources (vi) the relevance for information retrieval

Differentiation of links will not be a problem for the general design of the database, neither is neglecting a relation.

With the above changes the total set of language-internal relations that will initially be used in the EuroWordNet project is as follows (where some relations have been duplicated to reflect the different directions of the relation):

LE 4003 EuroWordNet October, 1996 57 Overview of all the language-internal relations in EuroWordNet

Main Relation Subtype Disjuncts Conjuncts POS

Hyponymy hyperonym + + N, V noun-to-verb-hyperonym + + N hyponym + - N, V noun-to-verb-hyponym + - V Meronymy holonym + + N member holonym + + N substance holonym + + N part holonym + + N made-of holonym + + N location holonym + + N meronym + + N member meronym + + N substance meronym + + N part meronym + + N made-of meronym + + N location meronym + + N Synonymy synonym - - N, V verb-to-noun-synonym - - V noun-to-verb-synonym - - N Telic noun-to-verb-telic + + N verb-to-noun-telic + + V agent-of + + N patient-of + + N instrument-for + + N location-where + + N done-by + + V done-to + + V done-with + + V done-at + + V Entailment is-entailed-by + + V entails + + V Cause is-caused-by + + V causes + + V Antonymy antonym - - N,V

Typically, the conjunction/disjunction distinction and the part-of-speech distinction involve changes to the Princeton WordNet which are relevant for information retrieval applications and which deal with phenomena which are treated differently in dictionaries. The solutions given here are not fundamentally different from the structure of WordNet1.5 (in that they merely involve additions than changes), while they give us much more flexibility to treat these

LE 4003 EuroWordNet October, 1996 58 phenomena in a more natural way.15 Something similar can be said for the differentiation (and generalization) of the meronymy, cause and synonymy relations. It provides the flexibility to code more detailed information when it is extractable, but also to store more global links when it is not available.

3.1.4 The multilingual architecture

There are two fundamental MT approaches that can be used as the basic design of the multilingual database:

· inter-lingua model: several languages are linked to one intermediary representation. · trans-lingua model: each language has a unique translation relation to each of the other languages.

Even though the trans-lingua model comes closer to the way human translation might work the inter-lingual model is preferable for practical reasons:

· good bilingual dictionaries exists for many English:Foreign language pairs but are not always available for pairs of non-English languages. · English is the language that is most common and best mastered by all partners. · the Princeton wordnet is in English and there are no wordnets available for the other languages. · the number of relations and the database design can be simplified: instead of having many to many relations across languages, each language has a single relation to English.

A major drawback of the inter-lingua model is that transitivity of the relations cannot be guaranteed: a Dutch and Spanish word that have an equivalence link to the same interlingua concept are not automatically useful Dutch-Spanish translation pairs. Since this project has to yield concrete results in a limited time the interlingua model is nevertheless chosen because of its practical benefits. Within the inter-lingua model there are again two design possibilities:

· the English WordNet as a language specific system will function as the inter-lingua as well. · a copy of the English WordNet will function as the inter-lingua and the English WordNet will be linked to the intermediary vocabulary as well.

At first sight the first option seems simpler and the second option leads to the introduction of an extra level. However, the second option will give us much more flexibility to maintain and control the database and to be able to modify and change the English wordnet without having an effect on the other wordnets:

15 If no solution is provided the extraction of the relations will be complicated by having to make an unintuitive or unsatisfying choice. For example, whenever the extraction of relations results in multiple (disjuncted or conjuncted) synsets these have to be stored some way or another (or only one of them has to be selected). By storing multiple synsets, however, the implicit structure of the Princeton WordNet will interpret the list by default either as a conjunction or a disjunction. By making this explicit it is still possible to fill in a (the same) default interpretation, or, when the resources provide the information, to derive the information explicitly. Finally, since the information is stored as a feature of the links it can easily be ommitted when the results are converted to another format. LE 4003 EuroWordNet October, 1996 59 · a change of the intermediary database will not have an effect on the English wordnet. · a change of the English wordnet will not have an effect on the intermediary database.

Having a language-neutral intermediary level will make it easier to represent language specific properties. A language specific concept will be a concept that is linked to only a subset (3 or less) of the languages (assuming that the linking failure is not due to the extraction process or incompleteness of the resources).

One danger for a language-independent intermediary level is that there is no control on the status of this system when changes are made on the basis of distinctions in any of the languages. If all language-specific concepts are automatically added the result may be a list of concepts that includes a variety of distinctions and concept levels (instigated by the variety of cultures and languages) that is no longer understandable by anybody.

Another danger is that two sites might independently add a new concept to the intermediary network without knowing from each other that they are adding the same concept. To limit this danger we will design a policy for changing the intermediary system. Changes will be made by one partner after notifying the others and simultaneously updating the wordnets. Added records have to be glossed in English.

Since the development of a truly language-independent network goes beyond the scope and resources of this project we will make no theoretical claims about the intermediary list of concepts. To avoid any confusion the intermediary representation will be called: the Inter- Lingual-Index (ILI), which is nothing but a list of records with glosses that form the superset of all the concepts occurring in the separate wordnets. Each wordnet (including the English wordnet) will be linked to these records (via the English language). The ILI will function as a practical device without having any status as a cognitive or linguistic knowledge representation system. Similarly, no claims will be made about the transitivity of the relation of each wordnet to the Index. The multilingual database will not be designed for automatic machine translation purposes, although it could be used as a starting point for developing such a translation resource.

How can we deal with lexical mismatches across languages within such a system? Lexical mismatches across languages are often distinguished on a morpho-syntactic basis. Words are said to be equivalent or name the same things across languages but differ in formal properties. The organisation of the entries as synsets will to a large extent abstract from these differences. Morpho-syntactic variants within a language and across languages will be joined in the same synset. When it comes to genuine semantic mismatches we can roughly distinguish two major classes:

1) Denotational Gaps

(a) klunen (to walk with skates...) <--> walk citroenjenever (kind of gin) <--> gin paranimf (a person who assist s.o. who presents his/her thesis) <--> person

2) Linguistic Gaps

Differences in specificity (b) dedo <--> finger, toe

LE 4003 EuroWordNet October, 1996 60 hoofd (head of a human) <--> head kop (head of an animal) <--> head

Differences in conceptualization (c) droogpruimen (to eat without drinking) <--> eat bluswater (fire extinguishing water) <--> water theewater (water used for making tea) <--> water koffiewater (water used for making coffee) <--> water

In the case of a genuine denotational gap there is an object or event in one culture which is totally unknown in the other culture. For example, the Dutch verb ªklunenº refers to an event where skaters have to cross a stretch of land to go from one ice-track to another ice-track. For that purpose they walk with their skates on. The closest English concept will be ªwalkº which is however a parent concept equivalent to the more general Dutch verb ªlopenº.

Linguistic gaps are different because the concepts do exist in the other culture but they happen to be lexicalised in one language and not in the other language. A more subtle difference can be made between differences in specificity and differences in conceptualization. Whereas ªdedoº and ªheadº are simply more general than ªfinger/toeº and ªhoofd/kopº respectively, the examples in (c) all name things and events which are temporarily points of views or perspectives applying to the same types of things. We cannot say that ªbluswaterº, ªtheewaterº and ªkoffiewaterº are more specific kinds of ªwaterº, rather they name a particular perspective or conceptualization of ªwaterº. Although the distinction between these types of semantic mismatches may be very important for information retrieval it will be very difficult to keep them apart or to automatically extract them from the dictionaries. In the EuroWordNet database we will therefore make no distinction between these types of mismatches. This means that according to the EuroWordNet database the lexicalization in one language can either be more specific (vary in hyponymy level), more inclusive (vary in meronymy level) or more explicit in cause or entailment than another language. No claims will be made about the cultural implications or world-knowledge structure behind these mismatches.

The ILI can be further structured in yet two different ways to deal with these different mismatches of the wordnets with the ILI concepts:

· no relations between the index records will be stored but languages may have different types of equivalence relations with the records. In this case the ILI will be a simple list without internal structure but the translation relations will be complex. · the translation relation is simple but concepts in the ILI may have the same semantic relations just as the synsets in the wordnets.

If the ILI is unstructured this will mean that a synset that is unique in a language will have to be linked to a concept that is shared by the other languages. In the beginning the ILI may consist of the list of synsets of WordNet1.5. In those cases that there is a simple equivalence relation with a WordNet1.5 synset this can be directly expressed. If there is no direct equivalent it has to be linked to a WordNet1.5 synset that comes closest to the language-specific synset. In most cases this will be a hyperonym of the unique concept but it may also be a hyponym, meronym or any other synset. For expressing the relations with the most-closest synset, any of the language internal relations may be needed. Each of the above relations may thus have a bilingual equivalent. This approach has been worked out in the Acquilex-project (BRA 3030, BRA 7315) where formal representations have been developed for various kinds of equivalence relations, so-

LE 4003 EuroWordNet October, 1996 61 called tlinks (Copestake et all 1995). In the following picture Spanish ªdedoº and Italian ªdeitoº will have a hyperonym-tlink with ªfingerº and ªtoeº respectively. Dutch ªhoofdº and ªkopº have a hyponym-tlink with ªheadº.

Unstructured Inter Lingual Index & Complex equivalence relations

EN IT toe 1 ILI deto 1

finger 1 toe 1 { part of foot }

head 1 finger 1 { part of hand }

finger 1 or toe 1

head 1 { part of body }

head 2 { head1 of a human } DU ES head 3 { head1 of an animal } hoofd 1 dedo 1

kop 1 normal equivalent hyponym equivalent hyperonym equivalent

In this example we see that extra concepts have been added as well for the disjunction of ªfingerº and ªtoeº and for the more restricted uses of ªhead of animalsº and ªhead of humansº. Obviously, the language-specific synsets will have a normal-tlink with the added non-English synsets. Strictly speaking it is not necessary to add these concepts to the ILI. All information about the relative position of the language-specific synsets across the wordnets can be derived from the tlinks. However, since the language specific wordnets will not have any glosses the ILI can be used to explain the added concept to the other wordnet builders. This is necessary to make sure that synsets that occur on two languages but are not lexicalised in English can somehow be matched when the wordnets are compared. Furthermore, also other wordnet builders are free to use this synset for linking. Clearly, in this approach there is a large number of complex cross- linguistic relations or tlinks while the ILI is a simple list. The advantage of this approach is that the ILI can remain an unstructured list of concepts.

If we allow relations between the ILI concepts we can add a new concept for any language specific synset that is not present in the ILI. By linking the newly added concepts to the existing ILI-concepts (which are linked to the other languages) the relation with the other wordnets can be established indirectly. The advantage is that there can be a single and simple equivalence link for all the synsets with an ILI concept. The drawback is that the ILI needs to be structured. LE 4003 EuroWordNet October, 1996 62

Structured Inter Lingual Index with Simple Equivalence Relations

EN IT toe 1 ILI dito 1 finger 1 or toe 1 finger 1

head 1 toe 1 finger 1 { part of foot } { part of hand }

head 1 { part of body }

DU ES head 2 head 3 hoofd 1 dedo 1 { head1 of a human } { head1 of an animal }

kop 1

Here we clearly see that the equivalence relations are simplified and less in number but now the ILI concepts themselves are internally linked. The drawback is that we have to augment the structure of the ILI and cannot avoid to make any claims about it.

For practical reasons, we choose for the first version in which the ILI is unstructured and the equivalence relations may be complex. Complex multilingual relations only have to be considered site by site and there will be no need to communicate about concepts and relations from a many to many perspective. Furthermore, it will mean that future extensions of the database can take place without re-discussing the ILI structure. The ILI can then be seen as a fund of concepts which you can use in any way to establish a relation to the other wordnets. The status of this relation is then just practical.

The policy for changing the ILI will globally be as follows. A site that cannot find a proper equivalent among the ILI concepts (and has to derive a complex equivalence-relation) will generate an English gloss for the new ILI concept, e.g.:

equivalence relation: "klunen" hyponym-tlink "walk" new ILI record: 1234560 {way-of-walking} new GLOSS: {to walk on skates}

In some cases, the gloss could be generated from the complex equivalence-relation in a systematic way. One partner in the project will make sure that added glosses and concepts are properly structured and will from time to time update the ILI structure and distribute it.

LE 4003 EuroWordNet October, 1996 63

3.1.5 The top-concepts

Whereas the ILI will be an unstructured list, the top-concepts of each wordnet will be interlinked in a more systematic way. As suggested by the discussion on the coverage of the vocabulary we will put a lot of effort in defining the most frequent synsets in the wordnets. After the first subset is finished the wordnets will be loaded in the EuroWordNet database where the different tops can be viewed and compared. Using the Novell Concept Toolkit the tops can be visualised and edited. Furthermore, the criteria that define the relations can be used to verify the structure. Each site will likewise produce a systematised top-level of synsets for their own language with English glosses. The glossed top-synsets will then be merged in a separate workpackage. This will be done in much the same way as the synsets have been linked to the ILI, except that there will be consensus about the internal relations of these top-concepts. In the following example the Dutch synset ªvoorwerpº and the Italian synset ªoggettoº (which may be top synsets in their wordnets) are linked to the same ILI record ªobjectº, which in its turn is linked to a Top-Concept ªObjectº:

Top-ontology

Concrete

Object Substance

Dutch Synset ILI record Italian Synset voorwerp object oggetto

Together, the top-concepts will form a language-neutral ontology with explicit opposition relations, reflecting the most salient and fundamental semantic distinction. The ontology will give users control over the most basic semantic distinctions and classes that occur in all the wordnets. The structuring of the language-independent top-ontology will also be based on other ontologies that are available or will be developed by others. So far the following ontologies will be considered, e.g.: Acquilex, Sift, Komet (Bateman 1990), Princeton WordNet (Miller et al 1993). Since the merging of top-concepts will be done manually and by means of discussions the number of structured concepts has to be limited.

The top-concepts are linked to ILI records which, in their turn, are linked to the language- specific top-synsets. In this way the top-ontology only has to be specified only once and the information is distributed to the synsets of each language via the ILI records. By customizing the top-concepts (e.g. by adding semantic features or properties that can be inherited) the users can adapt the wordnets in all the related languages without having to access the wordnets.

3.1.6 Domains

As suggested above, domains provide a very different way of relating concepts. Instead of a subsumption relation, domains relate concepts on the basis of scripts or topics, e.g. ªsportsº,

LE 4003 EuroWordNet October, 1996 64 ªmilitaryº, ªhospitalº. Domain labels can be organised as a hierarchy (ªwater sportsº, ªfield sportsº, etc.), forming ever more specific groups of concepts that are likely to co-occur and they may form a coherent topic, story, scheme or script. Hierarchies of domains can be structured using the same hyponymy relation that is used for synsets and the top-concepts.

The purpose of domain information is twofold. First of all, it can directly be used in information retrieval (and also in language-learning tools and dictionary publishing) to group concepts in a different way, based on scripts rather than classification. This will have a very different retrieval effect. Secondly, using domains we can separate the generic vocabulary from the domain-specific vocabularies. This is important to control the ambiguity problem in Natural Language Processing (only those meanings have to be looked at which are unmarked for any domain or marked for the domain that is considered) and to customize the general resource for specific applications (with the possibility to undo or neglect the additions).

Since the domains are language-neutral as well they are also stored at the ILI concepts (just as the language-independent top-concepts) with a belongs-to relation. The resources for the wordnets differ considerably with respect to the availability of domain information. We therefore cannot expect that this type of information is massively extracted in this project. However, since it often involves language-independent information it only has to be provided once. Via the ILI records it can then be accessed from any wordnet (and the other way around) regardless of the domain differentiation in the original source. Furthermore, a user who wants to customize the wordnets does not have to store its domain information in every separate wordnet but only for the relevant ILI records. The tlink-relations from the ILI records to the wordnet synsets will then match the domain information with the language-specific wordnets.

3.1.7 Instances

Instances are separate data types or objects that have an instance-link with synsets. Instances are distinguished from meanings to express the semantic difference between denotational lexical items and referential lexical items. Typical instances are cities and countries, e.g. Amsterdam, Antwerp, Barcelona, Pisa, Sheffield, etc.. This information cannot be derived from the available dictionaries but the object types have to be distinguished so that users can customize the general and generic wordnet resources for their application (for example car-registration numbers, club members, etc..). In the design some of the semantic relations between synsets can also occur between instances.

As language-neutral information instances could directly be linked to the ILI concepts, but this design would cause some problems for user customization and maintenance, for example to store instances in different languages. In order to solve this problem, instances will be linked in the individual wordnet to their classes (with Belong-to-class link/Has-instances links). Another link will be used to relate the instances with ILI concepts, in the same way that meanings are linked to ILI concepts (see picture below Architecture Overview: Lang. Dependent/Independent Object Types). The instances will be Proper nouns. Although, domains and instances are distinguished in the design, there is no guarantee that they can be easily derived from the given resources. In that case they will be minimally added by hand to illustrate the possibilities of customization and extension.

LE 4003 EuroWordNet October, 1996 65 3.1.8 Overview of the EuroWordNet relations

The next figures give an overview of all the objects in the database with some of the possible relations between them. The first figure shows the multilingual architecture distinguishing the object as language specific and language-independent types. All language-independent pieces of information are as much as possible stored at the ILI concepts. This will make maintenance and customization much easier and flexible. Changes in the domain, the top-concepts and at the instance-level do not have to be specified for each wordnet but only for the ILI-records. The wordnets can now be seen as language-specific lexicalization patterns that give access to these bits of information.

Furthermore, it makes clear that the main focus of the project is on constructing the individual language-specific wordnets and not on providing general world-knowledge. At most, bits and pieces of world-knowledge will be provided by means of the top-concepts, the domains and the instances that are linked to the ILI records. As such, it is not the goal of the project to extensively provide this information for any thinkable concept. They are distinguished here because:

· they can, to a limited extent, be derived from some of the resources, · they can be added/removed by a user without considering the wordnets, · they can be shared by all the wordnets

By allowing for these data objects in the design the results can more easily be extended in the future, even though we might not be able to provide all the information in this project.

The second figure gives an overview of the different types of relations that may hold between the object types. In the third figure, an example is given for some Dutch synsets, illustrating how they can be linked to each other and to ILI-records, the domains and the top-concepts. This is how the information in the Dutch wordnet may be structured. A user who wants to add instance information has to add these to the ILI-records and not to the Dutch wordnets. It is therefore not necessary to speak Dutch in order to customize the wordnet.

LE 4003 EuroWordNet October, 1996 66

LE 4003 EuroWordNet October, 1996 67

LE 4003 EuroWordNet October, 1996 68

T-C [top concepts]

T

abstract concrete

relation property event object substance DOMAIN [superdomain] traffic animate inanimate

[opposite] instrument [subdomains] aviation navigation

ILI

vervoermiddel vehicle

vessel

aircraft [hyp(er)/onyms ] vaartuig luchtvaartuig aeroplane, airplane, plane [hyp(er)/onyms ] helicopter, vliegtuig helikopter chopper

= = [synonyms=] wings [holonyms/ vliegmachine wentelwiek landing gear mero••nyms] toestel rotor

vleugels hefschroef landingsgestel

LE 4003 EuroWordNet October, 1996 69 3.2 The EuroWordNet Database

The general EuroWordNet Database will as much as possible be derived from the existing Novell ConceptBase (Díez-Orzas 1995, Cuypers and Díez-Orzas 1996). Whereas the Novell Database can be used for interactively developing and modifying wordnets, the EuroWordNet database can only be used for viewing the data from a multilingual perspective and for exporting the wordnets or selections of the wordnets. The resulting EuroWordNet database will be freely available but the Novell ConceptBase will be available by commercial licensing conditions. The EuroWordNet database will operate on compressed indexes generated by the Novell Wordnet Database and it will allow for the following operations:

• simultaneously view wordnets for two languages • it can traverse relations between words in these wordnets and across wordnets • view and browse the language-independent top-ontology of basic semantic concepts • it will be possible to derive statistics about the relations in the wordnets • it will contain a module that checks redundancies in the information to measure consistency of data. • querying the wordnets and make specific selections. • it will contain an option to export selections of data to a plain ANSI file.

3.2.1 Interface requirements

The interface will show all the public information stored in EuroWordNet.

1. Technical requirements

• Platform: Windows95 or Windows NT • 32 bits application • ANSI character set, which contains all the needed diacritics for the different European languages. If some tools or processes need to handle other character sets (ASCII ) a special system of codes will be necessary.

2. Input

• A compressed FLAIM DataBase (generated by the Novell ConceptBase)

3. Process

• Decompression for viewing • Global viewer and hierarchies browsing • Selections and queries • Navigational services • Export services to ANSI • Statistical Information

4. Output

LE 4003 EuroWordNet October, 1996 70 · Graphic: display of links and full record information · Export: ANSI files in the transportation format

LE 4003 EuroWordNet October, 1996 71 5. Modules

· Storage

- Compressed database

· Graphic Viewer

- Tree like display showing types and subtype of links, sources and destinations. - To show at least two languages to allow comparisons - To show monolingual and multilingual connections - To show two languages synchronized by the interlingua

· Export service

- Selection of information to export: - By means of Individual selections - By means of a query: using formal parameters (absence/presence of feature or link) - By means of a query: data structure parameters (inclusion in hierarchies, connection with an element, etc.) - Storage of queries.

· Consistency checking

A specific module can check the consistency of data added to the system. This module will make use of redundancies in the information, such as for example:

(i) if two word senses in different wordnets have an equivalence link (e.g. ªhorseº in English and ªpaardº in Dutch) then the same relations should be found to equivalent word senses in each wordnet (e.g. ªhorseº has a hyperonym ªanimalº which is equivalent to ªdierº which is also the hyperonym of ªpaardº in Dutch). The addition of more and more separately-built wordnets will lead to an incremental strengthening of relations that are consistent. (ii) if two word senses have a hyponymy relation there must be other senses which are co-hyponyms of the more specific word sense. (iii) if a word sense is the meronym (a part) or holonym (the whole) of another word sense the relation should also be reversible.

In the future, the consistency checking mechanism can be used to provide a measure of consistency of new wordnets compared with the results. If a wordnet is added to the four wordnets (any number of combined wordnets) the comparison will show what proportion parallels the existing wordnets, what proportion is in conflict and what proportion is different but not necessarily in conflict. This mechanism can be used for developing a future standard for the quality of this kind of semantic resources.

3.2.2 Exchange format

LE 4003 EuroWordNet October, 1996 72 There are two main possibilities to define an exchange format in ANSI for a relational database. One is a flat-file format and the other is a set of files that shows, at least partially, the relational model of the database (this is the method employed by WN1.5). The usefulness of the format depends on its purpose. For instance, if the data is has to be loaded in any kind of engine to display the contents (like WE 1.5 interface) the second formalization is the appropriate one. Whereas, if the format is only needed for export/import services the flat file, in spite of its disadvantages with respect to the space requirement and redundancies, is good enough.

In addition, the flat-file format will be easily convertible to any desirable format (including the WN1.5 format) and the most common standards such as SGML, DDML, HTML, etc. Furthermore, it will be possible to compress it and load it into the EuroWordNet database so that large and complex data structures can be viewed, selected and exported. For that reasons the flat- file format is preferable.

References

Alberto, P. and P. Bennet (eds) 1995 Lexical issues in Machine Translation, Studies in Machine Translation and Natrural Language Processing, Vol. 8, EC.Luxembourg. Bateman, J. A. 1990 Upper modeling: organizing knowledge for natural language processing, in Á 5th. International Workshop on Natural Language Generation, 3-6 June 1990© , Pittsburgh, PA. Organized by Kathleen R. McKeown (Columbia University), Johanna D. Moore (University of Pittsburgh) and Sergei Nirenburg (Carnegie Mellon University). Beckwith, R. and G.A. Miller 1990 ªImplementing a Lexical Networkº, in: International Journal of Lexicography, Vol 3, No.4 (winter 1990), 302-312. Berlin, B. 1972 "Speculations on the growth of etnobotanical nomenclature", in: Language in Society, 51-86. Brachman, R. 1979 "On the Epistemological Status of Semantic Networks", in: N. Findler (ed.) Associative Networks: Representation and Use of Knowledge by Computers. New York: Academic Press: 3-50. Brachman, R. - J. Schmolze 1985 "An Overview of the KL-ONE Knowledge Representation System", in: Cognitive Science, Volume 9: 171-216. Brachman, R. - H. Levesque 1985 Readings in Knowlegde Representation. California: Morgan Kaufmann Publishers Inc. Briscoe, E J., A. Copestake and V. de Paiva (eds.) 1993 Default Inheritance in Unification Based Approaches to the Lexicon, Cambridge UK: Cambridge University Press. Calzolari and C. Guo (eds) 1994 Proceedings of the post-coling94 international workshop on directions of lexical research, August15-17, Beijing. Chaffin, R., D. Hermann and M. Winston

LE 4003 EuroWordNet October, 1996 73 1988 "An empirical taxonomy of part-whole relations: effects of part-whole relation type on relation identification", in: Language and Cognitive Processes, 3. Utrecht: VNU Science Press: 17-48. Collins, A.M. - M.R. Quillian 1969 "Retrieval time from semantic memory", in: Journal of Verbal Learning and Verbal Behavior, 8. New York: Academic Press: 240 - 248. Copestake, A., T Briscoe, P Vossen, A Ageno, I Castellon, F Ribas, G Rigau, H Rodriguez, A Sanmiotou, 1994 Acquisition of Lexical Translation Relations from MRDs, (September 94), CCL Amsterdam, UPC - WP 2.2, Acqulex2 WP No 40. Copestake, A. 1993 The compleat LKB. Esprit BRA-7315 Acquilex2 Deliverable, 3.1., Cambridge. Cruse, D.A. 1986 Lexical semantics. Cambridge: Cambridge University Press. Cuypers, I. and P. L. Díez-Orzas, 1996 Manual of the Novell ConceptNet MS, Internal Report, Novell Belgium N.V. Daelemans, W. - K de Smedt - G. Gazdar 1992 "Inheritance in natural language processing", in: Computational Linguistics, 18, 2. Cambridge MA: MIT Press: 205-219. Davidson, C. 1994 Common Sense and the Computer, New Scientist, April 2, 1994 Díez-Orzas, P.L. 1995 The Novell ConceptNet, Internal Report, Novell Belgium N.V. Díez-Orzas, P. L. 1996 WordNet1.5 Contents and Statistics, Novell LTD Internal Report, March 1996, Antwerp. Fellbaum, C. 1990 “English Verbs as a Semantic Net”, in: International Journal of Lexicography, Vol 3, No.4 (winter 1990), 278-301. Gross, D. and K.J. Miller 1990 “Adjectives in Wordnet”, in: International Journal of Lexicography, Vol 3, No.4 (winter 1990), 265-277. Jackendoff, R. 1992 "Parts and boundaries", in: B. Levin and S. Pinker (eds.) Lexical & conceptual semantics. Cambridge MA: Blackwell: 9 - 45. Lenat, D.B. and Guha, R.V 1990 Building Large Knowledge Based Systems, Addison Wesley, Reading Mass. Lenat, D.B. and Guha, R.V 1994 Enabling Agents to Work Together, Communications of the ACM July 1994/ Vol 37. No. 7 Mel'cuk, Igor 1984 Dictionnaire explicatif et combinatoire. Recherches lexico-semantiques. Les presses de l' Universitee de Montreal 1984 Miller, G.A., - Johnson-Laird, P.N. 1976 Language and perception. Cambridge MA: Harvard University Press. Miller, G.A, R. Beckwith, C. Fellbaum, D. Gross, and K. Miller 1993 “Introduction to WordNet: An On-line Lexical Database”, Princeton University. Nelson, P.E. 1995 “The Excalibur TREC-4 System, Preparations, and Results”, in TREC-4 Proceedings, National Institute of Standars and Technology, Gaithersburg, Md., USA.

LE 4003 EuroWordNet October, 1996 74 Pollard C. - I. Sag 1987 An information-based approach to syntax and semantics: Volume 1 fundamentals. CSLI Lecture Notes 13. Stanford CA. Pustejovsky, J. 1991 ªThe generative lexiconº, in: Computational Linguistics, 17, 4. Cambridge MA: MIT Press: 409-442. Pustejovsky, J. - S. Bergler (eds.) 1992 Lexical semantics and knowledge representation: Proceedings of the First SIGLEX Workshop Berkeley. USA, June 1991. Lecture Notes in Artifical Intelligence nr. 627. New York/Berlin: Springler Verlag. Rosch, E. 1977 "Classification of real world objects: origins and representation in cognition", in: P.N. Johnson-Laird and P.C. Wason (eds.) Thinking: readings in cognitive science. Cambridge: Cambridge University Press: 212-222. Smeaton, A.F., - Kenelly F., - O'Donnell, R. 1995 ªTREC-4 Experiments at Dublin City University: Thresholding Posting Lists, Query Expansion with WordNet and POS Tagging of Spanishº, in TREC-4 Proceedings, National Institute of Standars and Technology, Gaithersburg, Md., USA.

Tversky, B - K. Hemenway 1984 "Objects, parts and categories", in: Journal of Experimental Psychology, General, 113. Washington: American Psychological Association: 169-193. Tversky, B. 1986 "Components and categorization", in: C. Craig (ed.) Noun classes and categorization. Amsterdam/Philadelphia: Benjamins: 63-75. Vossen, P. 1995 Grammatical and conceptual individuation in the lexicon, PhD Thesis, Universiteit van Amsterdam, Ifott 15. Winston, M.E. - R. Chaffin - D.J. Hermann 1987 "A taxonomy of part-whole relations", in: Cognitive Science, 11. Norwood NJ: Ablex Publ. Corp.: 417-444.

LE 4003 EuroWordNet