<<

LEXICOGRAPHY IN AN INTERLINGUAL ONTOLOGY CANADIAN UNDERGRADUATE JOURNAL OF COGNITIVE SCIENCE 2004 1

Lexicography in an Interlingual Ontology: An Introduction to EuroWordNet

Peter Jansen University of Waterloo [email protected]

EuroWordNet is a multilingual lexical database constructed in the wake of WordNet. The ontological structure of the -dependent layers, analogous to individual , through the semantic space of the interlingual index and abstract framework of the top level ontologies are examined. The semantic nature of the interlingual is examined as it applies to Gruber’s principles for the design of ontologies. Benefits of EuroWordNet’s design are highlighted.

ordNet was originally proposed by Miller a lexical representation using a single . W(1990) in an experiment to test an WordNet currently represents a large portion of implementation of a model of lexical organiz- the English lexicon, consisting of over 115,000 ation. Up to this point, ontological databases had concepts. been particularly small. WordNet was intended as WordNet's organizational units are these a test of ontology design on a scale far larger than concepts. WordNet does not contain units smaller any existing lexical database, progressively than a word, such as phonemic or morphemic incorporating the English lexicon into a large information, or larger units such as frames semantic database. (Minsky, 1975), scripts (Schank and Abelson, The lexicon is the given to a linguistic 1977), schemas (Rumelhart, 1980), etc., resource that contains our knowledge of , consisting of multiple concepts. WordNet's including semantic data for each word or concept structure resembles that of both a and a expressed. Concepts can be more then a single , having qualities of both. A dictionary word − they can include compounds, such as contains semantic and syntactic information about ‘high school’, , such as ‘best friend’, single words, and is organized by word . idiomatic phrases, as in 'keep in touch' or 'being A thesaurus contains semantically related words, under the weather', and, finally, phrasal verbs, as and is organized by the general concept that a 'taking back' a book to the library, or 'putting on' collection of words represents. WordNet contains your shoes before you go to school. Compounds, synsets, which consist of a set of words or short collocations, idiomatic phrases, and phrasal verbs word constructs (as discussed previously) which extend the idea of storing words in the lexicon to represent a specific concept. These synsets form storing conceptual information that may not have the underlying structure of WordNet. In essence,

WWW.SFU.CA/COGNITIVE-SCIENCE/JOURNAL LEXICOGRAPHY IN AN INTERLINGUAL ONTOLOGY CANADIAN UNDERGRADUATE JOURNAL OF COGNITIVE SCIENCE 2004 2 synsets are the reason WordNet is a semantically are noun synsets at the top of a lexicon’s organized dictionary (Fellbaum,1998). (Miller, 1998). These include abstract Initially, WordNet was designed to contain concepts such as abstraction, possession, only synsets and pointers between the synsets processes, and states. These unique beginners can called relations. As development progressed, serve as a conceptual base for building the definitions and example sentences were included from the most abstract concept with the concepts to help contrast related synsets. towards less abstract, specific concepts and While WordNet is a lexical database, its instantiations. definitions sometimes include encyclopedic knowledge to help define the concepts it EuroWordNet represents (Fellbaum,1998). Synsets describe a collection of lexical WordNet was designed to be used to represent concepts that are semantically ‘identical’. A English words and lexical concepts. The synset may consist of only a single element, or it EuroWordNet project (Vossen, Díez-Orzas, may have many elements all describing the same Peters, 1997; EuroWordNet, 2001), completed in concept. Each element in a particular synset's list 1999, set out to create a multilingual lexical is synonymous with all other elements in that database relating conceptual information amoung synset. For example, the synset {search, lookup} a number of European , and to establish represents the concept of checking to see if a common framework that would allow new something has a specific property. In this context, languages to be incorporated. At its completion, 'search' and 'lookup' are both semantically EuroWordNet combined the Czech, Dutch, equivalent. For cases where a single word has Estonian, Italian, French, German, and Spanish multiple meanings (a polysemous word), multiple languages, and, since the project's end, a number separate and potentially unrelated synsets will of additional languages have been developed to its contain the same word. specification, including Swedish and Russian Synsets are interconnected by relations. (EuroWordNet, 2001). Relations in WordNet express simple The EuroWordNet team examined a number of relationships between synsets. These relations designs for their multilingual (Vossen et include subclass and superclass relationships al., 1997). One of the more expansive approaches (), part-of / has-a considered was to map concepts in one language relationships (), and the to concepts in each of the other languages. In this antonymy, or , relationship. The concept way, if the multilingual database consisted of network can be traversed using these relations, three languages, six different interlingual and from one synset a set of relations open a conceptual mappings would need to exist (one meaningful path to be explored, allowing simple from each language to each other language). For inferencing to take place. instance, a potential set of mappings might be WordNet consists of four distinct semantic English to French, French to English, English to networks, one each for nouns, verbs, adjectives, German, German to English, French to German, and adverbs (Fellbaum,1998). This design and German to French. The effort required to add simplifies the network design, as each word class new languages in this system becomes extremely has different semantic relations. For instance, large as the number of languages increases. The verbs have a relation called potential advantage of this method however (Fellbaum,1998), which expresses a particular would be the tailored between manner of doing something. Both nouns and verbs languages, which may make interlingual can be organized hierarchally. Unique beginners mappings more precise.

WWW.SFU.CA/COGNITIVE-SCIENCE/JOURNAL LEXICOGRAPHY IN AN INTERLINGUAL ONTOLOGY CANADIAN UNDERGRADUATE JOURNAL OF COGNITIVE SCIENCE 2004 3

The actual design used by the EuroWordNet entities are abstract concepts and include events, team requires less computational resources, but processes, relations, properties, and states. First with some added advantages and disadvantages. order entities are material objects and perceivable The design is as follows. The database is quantities. The top layer of EuroWordNet also organized into three main layers: the language- contains the domain hierarchy ontology, which dependent layer, the language-independent layer, allows synsets in the interlingual index to be and the top-layer and domain ontologies. The mapped directly to categorical descriptions, for language-dependent layer consists of a WordNet instance, animal, vertebrate, invertebrate, plant, or structured similarly to the English WordNet, clothing. The top-level ontology labels and the containing the concepts for one specific language. domain labels have equivalence relations to Each language-dependent layer is in essence a synsets in the ILI. This design feature is useful in WordNet of its own for a specific language. These instances where language-independent but multiple WordNets are then connected to a domain-specific ontologies designed for a specific language-independent lexical database. This task may be required. Linking to a domain database, called the interlingual index (ILI), is a ontology may also help select more generic WordNet of its own, but unlike a language- (further away) or more specific (closer) concepts dependent WordNet, its synsets link to the synsets in interlingual (Vossen et al., 1997, p. of other language-dependent WordNets. The 2). synsets contained in the ILI represent language- The ILI contains six different relations specific independent concepts, free of the lexical to the layer's development (Vossen et al., 1997, p. constraints of any one language. In this way, the 3-7). These relations are useful in situations where concepts represented in different languages are languages don't map well to each other. Some cross-lingually linked together, and a concept languages have concepts which are not lexicalized specified in any one language can be translated in others. For instance, the English word 'head' into any other language connected to the ILI. can refer to any head, but in Dutch there are The synsets were developed hierarchically different words to express either 'human head' or between languages by first identifying common 'animal head' (Vossen et al., 1997, p. 4). This 'base concepts', or concepts that were common to situation represents one of these ILI relationships, all languages, and beginning the database HAS_EQ_HYPERONYM, when a concept exists development from these base concepts. Thirty in one language which is more specific than an representative synsets were selected by all existing synset in the ILI. Other relations include language-specific developers, of which 24 are HAS_EQ_HYPONYM, where a concept is too noun synsets, and six are verb synsets. In general for an existing synset and is mapped to a situations where the language-specific developers more specific synset, and HAS_EQ_SYNONYM, identified more base concepts, the concepts were where concepts in the ILI are synonymous or further abstracted to the common set of base identical to each other. concepts. In instances where a base concept isn't A number of desiderata were introduced by lexically represented in a language, a close Gruber (1993) to help guide the development of representation is used. and serve in evaluating ontologies (Gomez-Perez, The base concepts are organized into a top- 2003). These guiding principles, which we shall level ontology where the base concepts are examine as they apply to EuroWordNet, include hierarchically extended to include closely related coherence, clarity, extendibility, minimal hyponyms. The base concepts are divided into encoding bias, and minimal ontological two categories in the top-level ontology: high commitments. order entities, and first order entities. High order The principal of coherence states that

WWW.SFU.CA/COGNITIVE-SCIENCE/JOURNAL LEXICOGRAPHY IN AN INTERLINGUAL ONTOLOGY CANADIAN UNDERGRADUATE JOURNAL OF COGNITIVE SCIENCE 2004 4 inferences created through the use of the ontology principle should be kept to a minimum, and should not lead to contradictions. A contradiction would likely occur in an active revision of the means that the ontology contains incoherent top-level ontologies while adding additional information. The possible sources of contradiction languages. The dynamic, network-like nature of in EuroWordNet could include situations where synsets should allow complete extendibility closely related concepts are independently beyond the top-level ontology. categorized, or categorized by different The minimal encoding bias states that concepts developers and, as a result, synsets in the ILI may should be defined at a ‘knowledge level’ and actually have both hypernym and hyponym should not be dependent on a symbolic level of relations to another specific synset. This type of encoding. This principle alludes to the use of the error has been minimized at the higher levels of common top-level ontology using common base the ontology by using a common set of base concepts in the development of the language- concepts to develop each language-dependent dependent WordNets. In this way, concepts in all WordNet. Automated searches for synsets that languages are build upon this highly abstract contain subclass-of and superclass-of relations to layer, which should minimize the bias that could another synset could be used to find such issues, exist if the language-dependent ontologies were then either the user or automated inferencing built upon unique top-level ontologies. (perhaps selecting the most common hierarchical Finally, the notion of minimal ontological derivation found between the languages) could commitment signifies minimizing specificity of correct the incoherence. information that could exist in different formats. Clarity is the principle that terms should be This is an especially important consideration in a effectively communicated. In terms of the cross-cultural, interlingual database. Examples of structure of individual WordNets, definitions to this bias could include measurements such as help differentiate semantically similar synsets dates, spans of time, distances, and intensities. should be clear. The top-level ontology should The synset nature of EuroWordNet elegantly also clearly express each base concept. Due to the expresses the spirit behind this principle by highly abstract nature of these concepts, the base expressing information semantically. Problems concepts may be best illustrated through where encoding biases may occur could include elaborating subordinate nodes, perhaps through the synset definitions in each language, which multiple levels. Clarity would not seem to apply may state each language’s specific method of to the interlingual index, as the concepts it interpreting some concept such as measure or contains are purely conceptual and must be quantity. interpreted into a language in order to be EuroWordNet attempts to incorporate a large linguistically perceived. portion of the semantic of multiple Extendibility is the guiding principle behind European languages in a common framework. The the design of EuroWordNet. The specifications design of this framework is flexible enough to allow additional languages to be mapped into allow the relatively easy addition of new EuroWordNet's structure with a minimum of languages, and scales tractably both in terms of effort. The principal of extendibility states that an computational resources required to process the ontology should be able to support the addition of lexical database, and the work required to create hyponyms to existing concepts without modifying new linguistic databases and connect them with pre-existing concepts. The use of a common base- EuroWordNet. The semantic nature of synsets concept ontology developed through examining embodies many of Gruber's (1993) principles of commonalities between multiple languages ontological development, and combined with suggests that violations of the extendibility systems for semantic disambiguation, could form

WWW.SFU.CA/COGNITIVE-SCIENCE/JOURNAL LEXICOGRAPHY IN AN INTERLINGUAL ONTOLOGY CANADIAN UNDERGRADUATE JOURNAL OF COGNITIVE SCIENCE 2004 5 an impressive interlingual translation system. Miller, G. A. (1990). WordNet: An On-line Lexical While the project was officially completed in Database. International Journal of Lexicography, 3, 1999, the specification continues to be used and 235-312. nearly three times the number of languages Miller, G. A. (1998). Nouns in WordNet. In: WordNet: originally supported have individual WordNets An Electronic Lexical Database. Fellbaum, C. (Ed.). developed and can be linked to EuroWordNet’s Cambridge, MA: MIT Press. 23-46. interlingual index. ■ Minsky, M. (1975). A Framework for Representing Knowledge. In P. H. Winston (Ed.), The Psychology of Computer Vision 211-277. New York: McGraw-Hill. References Rumelhart, D.E. (1980). Schemata: The Building Blocks of Cognition. In R.J. Spiro, B.Bruce, & W.F. Brewer EuroWordNet. (2001). http://www.illc.uva.nl/EuroWord (eds.), Theoretical Issues in Reading and Net Comprehension. Hillsdale, NJ: Erlbaum. Fellbaum, C. (1998). WordNet: An Electronic Lexical Schank, R. C., & Abelson, R. P. (1977). Scripts, Plans, Database. Cambridge, MA: MIT Press. 1-12. Goals, and Understanding: An Inquiry Into Human Gomez-Perez, A., Corcho, O., Fernandez-Lopez, M. Knowledge Structures. Hillsdale, NJ: Lawrence (2003). Ontological Engineering: With Examples Erlbaum. from the Areas of Knowledge Management, E- Vossen, P., P. Díez-Orzas, W. Peters. (1997). The Commerce, and the Semantic Web. Springer Verlag. Multilingual Design of EuroWordNet. In: P. Vossen, Gruber, T. (1993). Towards Principles for the Design of N. Calzolari, G. Adriaens, A. Sanfilippo, Y. Wilks (eds.) Ontologies Used for Knowledge Sharing. Technical Proceedings of the ACL/EACL-97 Workshop on Report KSL93-04, Stanford University, Knowledge Automatic Information Extraction and Building of Systems Laboratory. Lexical Semantic Resources for Natural Language Processing Applications, Madrid, July 1997.

WWW.SFU.CA/COGNITIVE-SCIENCE/JOURNAL