Universality and Basic Level Concepts
Total Page:16
File Type:pdf, Size:1020Kb
311 Rebecca Green College ofInformation Studies, University of Maryland, USA Carol A. Bean National Library of Medicine, USA Michele Hudon School of Library and Information Sciences, University of Montreal, CANADA Universality and Basic Level Concepts Abstract: This paper examines whether a concept's hierarchical level affects the likelihood of its universality across schemes for knowledge representation and knowledge organization. Empirical data on equivalents are drawn froma bilingual thesaurus, a pair of biomedical vocabularies, and two ontologies. Conceptual equivalence across resources occurs significantly more often at the basic level than at subordinate or superordinate levels. Attempts to integrate knowledge representation or knowledge organization tools should concentrate on establishing equivalences at the basic level. 1. Rationale The degree of success attainable in the integration of multiple knowledge representation systems or knowledge organization schemes is constrained by limitations on the universality of human conceptual systems. For example, human languages do not all lexicalize the same set of concepts; nor do they structure (quasi-)equivalent concepts in the same relational patterns (Riesthuis, 200 I). As a consequence, even multilingual thesauri designed from the outset from the perspective of multiple languages may routinely include situations where corresponding tenns are not truly equivalent (Hudon, 1997, 2001). Intuitively, where inexactness and partialness in equivalence mappings across knowledge representation schemes and knowledge organizations schemes exist, a more difficult retrieval scenario arises than where equivalence mappings reflect full and exact conceptual matches. The question we address in this paper is whether a concept's hierarchical level affects the likelihood of its universality/full equivalence across schemes for knowledge representation and knowledge organization. Cognitive science research has shown that one particular hierarchical level--<:alled the basic level-'mjoys a privileged status (Brown, 1958; Rosch et aI., 1976). Our underlying hypothesis is that concepts at the basic level (e.g., apple, shoe, chair) are more likely to match across knowledge representation schemes and knowledge organization schemes than concepts at the superordinate (e.g., fruit, footwear, furniture) or subordinate (e.g., Granny Smith, sneaker, recliner) levels. This hypothesis is consistent with ethnobiological data showing that folk classifications of flora are more likely to agree at the basic level than at superordinate or subordinate levels (Berlin, 1992). The study reported here, which is only preliminary in nature, investigates the 312 validity of the hypothesis that basic level concepts are more universal than either more general or more specific concepts in three contexts: across languages (e.g., between corresponding terms of the two langnages of a bilingnal thesaurus in the social sciences), across vocabularies (e.g., between tenns of different medical vocabularies mapping to the same concept within the Unified Medical Langnage System® [UMLS]; see related work by Bodenreider & Bean, 2001), and across ontologies (e.g., between most nearly equivalent nodes of ThoughtTreasure and WordNet; see related work by Hovy, 2002). 2.0.slc Level: Privileged Level within a Hierarchy The world is filled with a tremendously large number of individual concrete entities. When we refer to any specific one of those entities, we usually do so using a label that names a class the entity is a member of. We seldom bother, for instance, to refer to an automobile by its vehicle identification number (VIN), which would enable us to name it uniquely, but, given a neutral context, typically refer to it as a car. Of course, the class of cars is not the only class that could be used in referring to a specific automobile; both more specific (e.g., sedan) and more general (e.g., vehicle) classes/names are also available. The hierarchical level typically chosen when we use langnage to refer to entities turns out not to be random. A variety of processes by which we interact with objects in our world converge on this same level, which is therefore dubbed the basic level (Lakoff, 1987, 46-47). Several of these converging processes are linguistic in nature. In addition to the reference process just mentioned, names for basic level categories tend to be shorter than are the names of superordinate and subordinate categories. Words for basic level categories tend to enter the language earlier and to be learned by children earlier than words naming more general and more specific classes. Other processes that privilege this basic level of the hierarchy concern perception, function, and knowledgeorganization. On the perceptual level, the basic level is the highest level in a hierarchy where humans can normally form a single mental image of a class and where they can identifY the class by its average shape. As to function, the basic level tends to be the highest level in a hierarchy where humans interact with entities with a relatively constant motor program. As to knowledge organization, when people are asked to list all that they know about a certain category, the biggest increase in number of statements, over and above what one can say about the next-most-general superordinate class, comes at the basic level, thus implying that more information is stored that is specific to the basic level than to any other hierarchical level. The basic level concept arose in the context of classifYing physical objects. It is not always clear how to apply some of the converging processes noted above, especially those related to perception and function, to abstract concepts, processes, events, and so forth. We will assume, however, that the basic level concept does apply to these less concrete contexts and that the several linguistic criteria that tend to converge for concrete entities will likewise converge for non-concrete entities. 313 3.Equlvalence across Knowledge Organization and Knowledge Representation Schemes Our goal is to investigate whether the privileges of the basic level phenomenon extend to knowledge organization and knowledge representation schemes. We start by taking a random sample of (usually) a dozen terms or nodes from each of three types of schemes (a bilingual thesaurus, a metathesaurus to which a number of vocabularies have been mapped, and a set of ontologies), expanding each term to include all hierarchically related terms/nodes. For each term in the source scheme (including both the randomly selected terms/nodes and the terms/nodes identified through hierarchical expansion; each of the two tools being compared is the source scheme for half of the investigation and the target scheme for the other half), we identify the tennlnode in the corresponding target scheme that is the closest equivalent. We then analyze these mappings for degree of equivalence, taking into account the semantic scope of the tenns at each tenninus of the mapping, as well as the hierarchical placement of these terms within their home vocabulary/ontology. We also use available linguistic information (term length, date term entered language, etc.) to determine the basic level within each hierarchy. A 02/goodness·of-fit test is then applied to the results of the analysis to determine if there is a significant difference between the degree of equivalence in mappings that is, the universality--of subject tenns at basic and non-basic levels. In this analysis, the degree of equivalence between a concept in the source scheme and its nearest concept in the target scheme is characterized according to five categories: Equivalent or nearly equivalent, More general/specific, Semantic type mismatch, Not a good match, and Missing (from target). 3.1. Equivalence across Languages: Canadian Literacy Thesaurusffhesaurus canadien d'alphabetisation Case Study The Canadian Literacy Thesaurus / Thesaurus canadien d'a/phabetisation (CLTffCA, 1996) is a bilingual list of subject terms (in French and English) relating to the field of adult literacy. The French portion of the thesaurus includes 1950 descriptors, the English 1890. The two lists of terms were developed and structured independently, then later reconciled. Thus, we would expect to find a large percentage of exact equivalences across matching terms on the two sides of the thesaurus. However, despite the reconciliation, not every term in one language has an equivalent term in the other language. Nor are "equivalent" terms always fullyequivalent. Because of the degree of specialization represented by the thesaurus, the hierarchies tend to be shallow, often only two levels deep. Where this is true, it is clearly not the case that subordinate, basic-level, and superordinate terms are all present. Indeed, it is not even always the case that basic-level tenns are present. Because of this data sparsity and the relatively small size of the vocabularies involved, a random sample comprising 1% of the descriptors from each language was drawn, rather than randomly selecting six descriptors from each vocabulary. Also, presumably because of the reconciliation process that the thesaurus underwent, some number of the "exact matches" that occur are fabricated: The descriptor in one language is generated on the basis of a word-for-word translation of the descriptor from the other language, although the phrase is not attested in standard terminological resources. Such exact matches have been categorized as forced equivalences