User Requirements and Functional Specification of The
Total Page:16
File Type:pdf, Size:1020Kb
User Requirements and Functional Specification of the EuroWordNet project Version 5, Final October, 1996 Laura Bloksma£ Pedro Luis Díez-Orzas$ Piek Vossen£ Deliverable D001, WP1, EuroWordNet, LE2-4003 £ Computer Centrum Letteren, University of Amsterdam $ Novell Linguistic Development, Antwerp Identification number LE-4003-D-001 Type Document Title User requirements and functional specification of EuroWordNet Status Final Deliverable D001 Work Package WP1 Task T1 Period covered March - June 1996 Date October, 1996 Version 5 Number of pages 66 Authors Laura Bloksma, Pedro Díez-Orzas, Piek Vossen, WP/Task responsible Novell Project contact point Piek Vossen Computer Centrum Letteren University of Amsterdam Spuistraat 134 1012 VB Amsterdam The Netherlands tel. +31 20 525 4624 fax. +31 20 525 4429 e-mail: [email protected] http://www.let.uva.nl/CCL/EuroWordNet.html EC project officer Jose Soler Status Public Actual distribution Project Consortium The EuroWordNet User-Group The EuroWordNet WWW page Suplementary notes Key words Lexical semantic databases, Information Retrieval, Language Engineering Abstract In this document the general design of the EuroWordNet database is described based on the user-requirements and the technical state of the art for building multilingual semantic resources. The user- requirements are discussed from two different perspectives: the actual use of the resource in a multilingual information retrieval system developed by Novell Linguistic Development and the potential use of the resource by a diverse group of institutes and companies in Europe, constituted by the EuroWordNet user-group. The purpose of the latter group is to create a wider awareness of the use of this type of resources and to establish cooperation with other groups that build such resources to develop standards and make resources compatible. The usage in an Information Retrieval Engine by Novell is taken as a starting point for the functional specification. In addition to the direct requirements of the user, the functional specification is based on the design of the Princeton WordNet1.5, the structure and content of the resources that will be used, the quality of the extraction tools and the limitations set by the project’s budget and time frame. Deviations from the Princeton WordNet1.5 design are due to: • new aspects such as the multilinguality. • inadequacy of the Princeton wordnet for Information Retrieval use. • different structure of the available resources. • quality of the tools for extracting information from these resources. • to achieve a maximal compatibility across the different resources. • the copy right limitations of the results. • the possibility to customize the resource by specific users. In addition to the design of the data structure, a data viewer is described which enables to view and compare the wordnets, and to export selections to a plain text format. Status of the abstract Complete Received on Recipient’s catalogue number Executive Summary Crucial for the semantic processing of information stored in the form of Natural Language is the availability of large generic lexicons with semantic information. The need for such resources is apparent when accessing large amounts of relatively unstructured information which is stored in various formats, covering different languages and cultures. A user has to anticipate that the information may be expressed using a variety of words or expressions, even in different languages. With a semantic database, semantically-related words can be grouped automatically, enlarging the effectiveness of a search. For English there is such a generic semantic resource: the WordNet database developed by George Miller and his research group at Princeton University (Miller et al. 1993). For other European languages, however, such databases with basic semantic relations do not exists or are not available, let alone a multilingual database in which several of these resources are combined. The aim of the EuroWordNet-project is to develop this multilingual database with basic semantic relations between words for several European languages (Dutch, Italian and Spanish). The EuroWordNet database will as much as possible be built from available existing resources and databases with semantic information developed in various projects. The use of the database will be demonstrated in an information retrieval environment. The expectation is that such a multilingual resource will improve the recall of documents in a meaningful way, not only of documents in each of the relevant languages, but also across these languages. In this document we outline the user-requirements and the functional specification on which the design of the database will be based. EC-projects funded in the fourth framework (1995-1999) should follow three pre-defined stages: Stage I: User Requirements and Functional Specification Stage II: Development of the Demonstrator and Verification Stage III: Demonstration The main focus of the EuroWordNet project will be on Stage II (the building and the verification of the wordnets), with minimal work parts for Stage I and Stage III. Still, the project does not start from a detailed specification of the user-needs and market exploration. The two major reasons for this are that: · semantic databases are still a novelty in linguistic technology, despite their potential value and use. Consequently, there are no studies and reports available on the need and use of such resources that could be used to describe the user-requirements. · the use of linguistic technology is cross-sectional by nature: it could be integrated in a variety of Telematics applications that involve the processing of information. The range of applications and user-types makes it more difficult to discover the user-requirements with respect to this type of resource. Since the budget of the project does not allow for performing an extensive market research (by interviews and questionnaires) we have followed a more minimalist approach to account for the user-needs. The user-needs are addressed from two different perspectives: · the actual demonstration of the resource in a multilingual information retrieval system developed by Novell Linguistic Development, who is a partner in the project. · the potential use of the resource by a diverse group of institutes and companies in Europe, constituted by the EuroWordNet user-group. The purpose of the latter group is to create a wider awareness of the use of this type of resources and to establish cooperation with other groups that build such resources to develop standards and make resources compatible. Software developers and telematic-users have not had much opportunity to experiment or gain experience with applications based on the semantic processing of information. We expect that clarity on the user-needs will arise from the availability of the multilingual database and the possibility for people to work with it, rather than the other way around. To limit the scope of the work, the use of the database will primarily be designed from a single application-perspective: information retrieval. It is important to realize that Novell is not an end- user but a developer that already has built a Information Retrieval system that incorporates a semantic database, which is tested with WordNet1.5. Given this specific application and their experience, the requirements of Novell will therefore be relatively specific, which one may not expect from a general information retrieval perspective. Using an existing application of a major software company as a starting point however has the following advantages: · it ensures a realistic demonstration of the resource which can convince people of the feasibility and usefulness of semantically-based technology, addressing real needs rather than theoretical solution. · users will not only see a complex data structure but also a useful and conceivable effect in a user-friendly interface. · an existing application will require realistic features and lead to a realistic verification of the data. The usage in an Information Retrieval Engine by Novell is thus taken as a starting point for the functional specification. In addition to the direct requirements of the user, the functional specification is based on the type of information that is covered by the Princeton WordNet1.5. This basically limits the scope of the data to the feasible and most important relations, about which there is a major consensus. Deviations from the Princeton WordNet1.5 design are due to: · new aspects such as the multilinguality of the database. · inadequacy of the Princeton wordnet for Information Retrieval use, which follows from some of the user-requirements. · the nature of the information stored in the Machine Readable Dictionaries (MRDs) from which the EuroWordNet results will be derived. Some differences of the MRDs are more advantageous than the Princeton WordNet and others make it too complicated to convert to the Princeton structure. · the possibility to (semi-)automatically extract this information from the available resources, given the existing technology, time and man power. · to achieve maximal compatibility across the different resources. · not all the information from the MRDs can be made available because of copy right claims. · the possibility for user to customize the database for their specific application without having to speak all the languages. Finally, the document describes a data viewer which will be developed. The viewer will enable users to view and