311

Rebecca Green College ofInformation Studies, University of Maryland, USA

Carol A. Bean National Library of Medicine, USA

Michele Hudon School of Library and Information Sciences, University of Montreal, CANADA

Universality and Basic Level Concepts

Abstract: This paper examines whether a concept's hierarchical level affects the likelihood of its universality across schemes for knowledge representation and knowledge organization. Empirical data on equivalents are drawn froma bilingual thesaurus, a pair of biomedical vocabularies, and two ontologies. Conceptual equivalence across resources occurs significantly more often at the basic level than at subordinate or superordinate levels. Attempts to integrate knowledge representation or knowledge organization tools should concentrate on establishing equivalences at the basic level.

1. Rationale The degree of success attainable in the integration of multiple knowledge representation systems or knowledge organization schemes is constrained by limitations on the universality of human conceptual systems. For example, human languages do not all lexicalize the same set of concepts; nor do they structure (quasi-)equivalent concepts in the same relational patterns (Riesthuis, 200 I). As a consequence, even multilingual thesauri designed from the outset from the perspective of multiple languages may routinely include situations where corresponding tenns are not truly equivalent (Hudon, 1997, 2001). Intuitively, where inexactness and partialness in equivalence mappings across knowledge representation schemes and knowledge organizations schemes exist, a more difficult retrieval scenario arises than where equivalence mappings reflect full and exact conceptual matches. The question we address in this paper is whether a concept's hierarchical level affects the likelihood of its universality/full equivalence across schemes for knowledge representation and knowledge organization. Cognitive science research has shown that one particular hierarchical level--<:alled the basic level-'mjoys a privileged status (Brown, 1958; Rosch et aI., 1976). Our underlying hypothesis is that concepts at the basic level (e.g., apple, shoe, chair) are more likely to match across knowledge representation schemes and knowledge organization schemes than concepts at the superordinate (e.g., fruit, footwear, furniture) or subordinate (e.g., Granny Smith, sneaker, recliner) levels. This hypothesis is consistent with ethnobiological data showing that folk classifications of flora are more likely to agree at the basic level than at superordinate or subordinate levels (Berlin, 1992). The study reported here, which is only preliminary in nature, investigates the 312

validity of the hypothesis that basic level concepts are more universal than either more general or more specific concepts in three contexts: across languages (e.g., between corresponding terms of the two langnages of a bilingnal thesaurus in the social sciences), across vocabularies (e.g., between tenns of different medical vocabularies mapping to the same concept within the Unified Medical Langnage System® [UMLS]; see related work by Bodenreider & Bean, 2001), and across ontologies (e.g., between most nearly equivalent nodes of ThoughtTreasure and WordNet; see related work by Hovy, 2002).

2.0.slc Level: Privileged Level within a Hierarchy The world is filled with a tremendously large number of individual concrete entities. When we refer to any specific one of those entities, we usually do so using a label that names a class the entity is a member of. We seldom bother, for instance, to refer to an automobile by its vehicle identification number (VIN), which would enable us to name it uniquely, but, given a neutral context, typically refer to it as a car. Of course, the class of cars is not the only class that could be used in referring to a specific automobile; both more specific (e.g., sedan) and more general (e.g., vehicle) classes/names are also available. The hierarchical level typically chosen when we use langnage to refer to entities turns out not to be random. A variety of processes by which we interact with objects in our world converge on this same level, which is therefore dubbed the basic level (Lakoff, 1987, 46-47). Several of these converging processes are linguistic in nature. In addition to the reference process just mentioned, names for basic level categories tend to be shorter than are the names of superordinate and subordinate categories. Words for basic level categories tend to enter the language earlier and to be learned by children earlier than words naming more general and more specific classes. Other processes that privilege this basic level of the hierarchy concern perception, function, and knowledgeorganization. On the perceptual level, the basic level is the highest level in a hierarchy where humans can normally form a single mental image of a class and where they can identifY the class by its average shape. As to function, the basic level tends to be the highest level in a hierarchy where humans interact with entities with a relatively constant motor program. As to knowledge organization, when people are asked to list all that they know about a certain category, the biggest increase in number of statements, over and above what one can say about the next-most-general superordinate class, comes at the basic level, thus implying that more information is stored that is specific to the basic level than to any other hierarchical level.

The basic level concept arose in the context of classifYing physical objects. It is not always clear how to apply some of the converging processes noted above, especially those related to perception and function, to abstract concepts, processes, events, and so forth. We will assume, however, that the basic level concept does apply to these less concrete contexts and that the several linguistic criteria that tend to converge for concrete entities will likewise converge for non-concrete entities. 313

3.Equlvalence across Knowledge Organization and Knowledge Representation Schemes Our goal is to investigate whether the privileges of the basic level phenomenon extend to knowledge organization and knowledge representation schemes. We start by taking a random sample of (usually) a dozen terms or nodes from each of three types of schemes (a bilingual thesaurus, a metathesaurus to which a number of vocabularies have been mapped, and a set of ontologies), expanding each term to include all hierarchically related terms/nodes. For each term in the source scheme (including both the randomly selected terms/nodes and the terms/nodes identified through hierarchical expansion; each of the two tools being compared is the source scheme for half of the investigation and the target scheme for the other half), we identify the tennlnode in the corresponding target scheme that is the closest equivalent. We then analyze these mappings for degree of equivalence, taking into account the semantic scope of the tenns at each tenninus of the mapping, as well as the hierarchical placement of these terms within their home vocabulary/ontology. We also use available linguistic information (term length, date term entered language, etc.) to determine the basic level within each hierarchy. A 02/goodness·of-fit test is then applied to the results of the analysis to determine if there is a significant difference between the degree of equivalence in mappings­ that is, the universality--of subject tenns at basic and non-basic levels. In this analysis, the degree of equivalence between a concept in the source scheme and its nearest concept in the target scheme is characterized according to five categories: Equivalent or nearly equivalent, More general/specific, Semantic type mismatch, Not a good match, and Missing (from target). 3.1. Equivalence across Languages: Canadian Literacy Thesaurusffhesaurus canadien d'alphabetisation Case Study The Canadian Literacy Thesaurus / Thesaurus canadien d'a/phabetisation (CLTffCA, 1996) is a bilingual list of subject terms (in French and English) relating to the field of adult literacy. The French portion of the thesaurus includes 1950 descriptors, the English 1890. The two lists of terms were developed and structured independently, then later reconciled. Thus, we would expect to find a large percentage of exact equivalences across matching terms on the two sides of the thesaurus. However, despite the reconciliation, not every term in one language has an equivalent term in the other language. Nor are "equivalent" terms always fullyequivalent. Because of the degree of specialization represented by the thesaurus, the hierarchies tend to be shallow, often only two levels deep. Where this is true, it is clearly not the case that subordinate, basic-level, and superordinate terms are all present. Indeed, it is not even always the case that basic-level tenns are present. Because of this data sparsity and the relatively small size of the vocabularies involved, a random sample comprising 1% of the descriptors from each language was drawn, rather than randomly selecting six descriptors from each vocabulary. Also, presumably because of the reconciliation process that the thesaurus underwent, some number of the "exact matches" that occur are fabricated: The descriptor in one language is generated on the basis of a word-for-word translation of the descriptor from the other language, although the phrase is not attested in standard terminological resources. Such exact matches have been categorized as forced equivalences and are conceptually closer to missing tenns than to equivalent 314

tenns. The chart below summarizes degree of equivalence for unique subordinate, basic level, and superordinate terms as compared with their most nearly equivalent concepts across the language divide of the Canadian Literacy Thesaurus / Thesaurus canadien d'alphabetisation.

Degree of Subordinate- Basic-level Superordinate- equivalence level concepts concepts level concepts Equivalent (or 18 22 20 nearlv so) More 7 2 I general/specific Semantic type I 0 0 mismatch Not a �ood match 4 0 0 Forced equivalence 6 0 I Missing from target 7 0 0

3_2_ Equivalence across Terminologies: Unilled Medlc.l Language System Case Study The Metathesaurus® of the Unified Medical Language System (UMLS) is a tool that, in its current (2002AA) version, links 2.1 million terms from over 60 biomedical vocabularies to a set of 776,940 concepts. The mapping of each individual vocabulary to these concepts generates, in effect, mappings between any two of the included vocabularies. This study is based on mappings between anatomical terms from two widely-used comprehensive multiaxial hierarchical medical vocabularies: MeSH ® (2002) and SNOMED International ® (1998). MeSH (Medical Subject Headings) is maintained by the U.S. National Library of Medicine, and the 2002 version contains over 122,000 main headings (19,000) and supplementary concept records (103,500). SNOMED (Systematized Nomenclature of Medicine) is maintained by the College of American Pathologists; SNOMED International(1998) contains over 156,000 concepts. The six hierarchies selected from MeSH as source have as basic level terms tooth, uterus. armlhand/finger,1 gland. pancreas, and blood vessel;2 the six selected from SNOMED as source have as basic level terms abdomen, eye, bone, nose, cavity, and bone marrow. It should be noted that hierarchies within anatomy are often based on the part-whole relationship rather than on the sUbsumption (taxonomic) relationship. Consequently, the basic level terms just indicated, which are based on taxonomic hierarchies, do not always appear in the hierarchies given in MeSH and SNOMED, but have been identified using WordNet (see section 3.3). The chart below summarizes degree of equivalence for unique subordinate, basic level. and superordinate concepts as compared to the cross�thesaural most nearly equivalent concepts for MeSH and SNOMED. 315

Degree of Subordinate- Basic-level Superordinate- equivalence level concepts concepts level concepts Equivalent (or 17 10 8 nearlyso) More 3 0 7 generailspecific Semantic type 0 0 0 mismatch Not a good match 0 0 0 Missi�g from target 2 0 3

3.3. Equivalence across Ontologies: ThoughtTrea.ure and WordNet Ca.e Study ThoughtTreasure (http://www.signiform.comlttlhtmltt.htm) describes itself as "a comprehensive platfonn for natural language processing . . . and commonsense reasoning." Its knowledge base includes a lexicon of 25,000 hierarchically organized concepts, to which 55,000 English and French words have been mapped. WordNet (http://www.cogsci.princeton.edul-wn) is an English language lexical database built around sets of synonymous word senses (synsets), each of which corresponds to a node in a tree stmcture. WordNet organizes 190,000 word senses into 110,000 synsets; separate tree structures exist for nouns, verbs, adjectives, and adverbs. The ontologies of such other knowledge representation systems as Cye, Sensus, and Ontosaurus are admittedly more typical of knowledge representation systems, but the linguistic basis of both ThoughtTreasure and WordNet permit the identification of basic level within their hierarchies using the lexical criteria outlined in section 2, which othenvise would not be possible. They also have the very practical advantage of being freely availablein their entirety. The six randomly selected hierarchies based on ThoughtTreasure as source have as basic level terms trial, wife, phone number, column, resistor, _ and art; the 2 six based on WordNet as source have as basic level tenns blood vessel, word, training, official, building, and store. The following chart summarizes degree of equivalence for unique subordinate, basic level, and superordinate concepts as compared to the most nearly equivalent concepts identified across ThoughtTreasure and WordNet.

Degree of Subordinate- Basic-level SuperordinateM equivalence level concepts concepts level concepts Equivalent (or 6 10 II nearlv so) More 2 2 II general/specific Semantic type 1 0 3 mismatch Not agood match 0 0 4 Missi�g from target 7 0 10 316

4.Discussion and Analysis The cumulative results from the three case are captured in the chart below (the Forced equivalence category from the first case study has been folded into the Missing from target category here):

Degree of Subordinate- Basic-level Superordinate- equivalence level concepts concepts level concepts Equivalent (or 41 42 39 nearly so) More 12 4 19 �eneral/specific Semantic type 2 0 3 mismatch Not a good match 4 0 4 Missing from target 22 0 14

Statistical (X2/goodness-of-fit) tests were applied to these results to investigate the basic hypothesis that conceptual equivalence is more likely to occur in knowledge organization and representation tools at the basic level than at other levels. The hypothesis is specifically confirmed as follows: 1. Across the three case studies combined, basic level terms are more likely to have equivalents in other knowledge organization and representation tools than are non-basic level terms (subordinate and superordinate terms combined) (x.2 = 25.24047, df= I, a= .001). 2. Across the three case studies combined, basic level terms are more likely to have equivalents in other knowledge organization and representation tools than are subordinate terms (X2= 21.45006, df= I, a= .001). 3. Across the three case studies combined, basic level tenns are more likely to have equivalents in other knowledge organization and representation tools than are superordinate terms (X2= 22.41596, df= I, a = .001).

5. Conclusion The confirmation of our hypothesis that full equivalence between tools occurs more often at the basic level than at either superordinate or subordinate levels for both knowledge representation schemes and knowledge organization schemes should illuminate future efforts to build universal tools of subject or semantic access. Specifically, it suggests that crosswalks between such tools should focus on mappings at the basic level, without attempting to impose a comprehensive mapping at all hierarchical levels.

Notes 1. The use of a part hierarchy results in three basic level terms appearing there. 2. In this case the lexical criteria are mixed, some pointing to vessel as the basic level and others to blood vessel. 317

References Berlin, Brent. (1992). Ethnobiological Classification: Principles o/Categorization ofPlants and Animals in Traditional Societies. Princeton: Princeton University Press. Bodenreider, Olivier & Carol A. Bean. (2001). Relationships among knowledge structures: Vocabulary integration within a subject domain. In C. A. Bean & R. Green (Eds.), Relationships in the Organization 0/ Knowledge, 81-98. Dordrecht: Kluwer Academic Publishers. Brown, Roger. (1958). How shall a thing be called? Psychological Review 65: 14-21. [CLTITCA] Canadian Literacy Thesaurus / Thesaurus canadien d'alphabhisalion (2nd ed,). (1996). Montreal: Canadian Literacy Thesaurus Coalition. Available for searching: . Hovy, Eduard. (2002). Comparing sets of semantic relations in ontologies. In R. Green, C. A. Bean, & S. H. Myaeng (Eds.), The Semantics ofRelationships: An Interdisciplinary Perspective, 91-110. Dordrecht Kluwer Academic Publishers. Hudon, Michele. (1997). Multilingual thesaurus construction: Integrating the views of different cultures in one gateway to knowledge and concepts. Knowledge Organization 24/2: 84- 9l. Hudon, Michele. (2001). Relationships in multilingual thesauri. In C. A. Bean & R. Green (Eds.). Relationships in the Organization of Knowledge, 67-80. Dordrecht: Kluwer Academic Publishers. Lakoff, George. (1987). Women, Fire, and Dangerous Things: What Categories Reveal abollt the Mind. Chicago: University of Chicago Press. [MeSH] Medical Subject Headings. (2002). Bethesda, MD: National Library of Medicine. Riesthuis, Gerhard J. A. (2001). Infonnation languages and multilingual subject access. Subject Retrieval in a Networked Environment: Papers presented at an IFLA Satellite Meeting, Dublin, Ohio, USA, 14-16 August 2001 .. Rosch, Eleanor, Carolyn Mervis, Wayne Gray, David Johnson, & Penny Boyes-Braem. (1976). Basic objects in natural categories. Cognitive Psychology 8: 382-439. [SNOMED] Cote, Roger A. (Ed.). (1998). Systematized Nomenclature oj Human and Veterinary Medicine: SNOMED International (version 3.5). Northfield" IL: College of American Pathologists; Schaumburg, IL: American Veterinary Medical Association.