Artificial Intelligence in Medicine (2007) 39, 183—195
http://www.intl.elsevierhealth.com/journals/aiim
Investigating subsumption in SNOMED CT: An exploration into large description logic-based biomedical terminologies
Olivier Bodenreider a,*, Barry Smith b,c, Anand Kumar b, Anita Burgun d a U.S. National Library of Medicine, National Institutes of Health, Bethesda, MD, USA b Institute for Formal Ontology and Medical Information Science, Saarland University, Germany c Department of Philosophy, University at Buffalo, New York, USA d EA 3888 Laboratoire d’Informatique Me´dicale, Universite´ de Rennes I, France
Received 12 October 2004; received in revised form 10 December 2006; accepted 11 December 2006
KEYWORDS Summary Biomedical ontologies; SNOMED CT; Objective: Formalisms based on one or other flavor of description logic (DL) are Description logics; sometimes put forward as helping to ensure that terminologies and controlled Ontological analysis vocabularies comply with sound ontological principles. The objective of this paper is to study the degree to which one DL-based biomedical terminology (SNOMED CT) does indeed comply with such principles. Materials and methods: We defined seven ontological principles (for example: each class must have at least one parent, each class must differ from its parent) and examined the properties of SNOMED CT classes with respect to these principles. Results: Our major results are 31% of these classes have a single child; 27% have multiple parents; 51% do not exhibit any differentiae between the description of the parent and that of the child. Conclusions: The applications of this principles to quality assurance for ontologies are discussed and suggestions are made for dealing with the phenomenon of multiple inheritance. The advantages and limitations of our approach are also discussed. # 2006 Elsevier B.V. All rights reserved.
1. Introduction formalisms in representing knowledge. GALEN1 and SNOMED Clinical Terms1 (in what follows SNOMED 2 Biomedical terminologies and ontologies are increas- CT) were both developed in a native DL formalism. ingly taking advantage of description logic (DL)-based Several other groups have worked at converting existing terminologies into terminologies with a DL * Corresponding author. Tel.: +1 301 435 3246; fax: +1 301 480 3035. 1 http://www.opengalen.org/ (accessed: 10 December 2006). E-mail address: [email protected] (O. Bodenreider). 2 http://www.snomed.org/ (accessed: 10 December 2006).
0933-3657/$ — see front matter # 2006 Elsevier B.V. All rights reserved. doi:10.1016/j.artmed.2006.12.003 184 O. Bodenreider et al. formalism, including the UMLS1 Metathesaurus1 [1— are generally represented in health information 3] and Semantic Network [4], the Medical Subject systems (e.g., electronic patient records) or in Headings (MeSH) [5], the Gene OntologyTM [6] and the reports of biomedical experiments (e.g., in the form National Cancer Institute Thesaurus [7]. The Ontol- of microarray data), while biomedical terminologies ogy Web Language (OWL) plug-in developed for the and ontologies are focused on what is general, on ontology editor Prote´ge´ now also allows developers of classes and their relations. frame-based resources to export their ontologies into DL formalism. 2.2. Relations among classes The validation of an ontology by a DL-based classifier serves to ensure compliance with certain The possible relations of class A to class B which are rules of classification (e.g., absence of terminolo- relevant to our purposes here are defined in Table 1. gical cycles) and it brings also other benefits in A is the root of a given taxonomy if and only if every terms of coherence checking and query optimization class in the taxonomy is a child of A; conversely, A is [8,9]. However, neither a DL formalism nor the use a leaf of a given taxonomy if and only if A has no of a classifier can ensure compliance with all prin- children. ciples of a sound ontology [10]. The objective of this paper is to study the degree 2.3. Principles of classification to which one DL-based biomedical terminology com- plies with a basic set of ontological principles. We Scientific classification has evolved from Aristotle to selected SNOMED CT as target for this evaluation Linnaeus to the large and varied classifications of because it is the most comprehensive biomedical modern times. Along the way, classification princi- terminology recently developed in native DL formal- ples were elaborated. One such principle, resulting ism. Another reason for our choice is that SNOMED CT from the use of a unique fundamentum divisionis or is now available as part of the UMLS3 at no charge for single classificatory principle in differentiating the UMLS licensees in the U.S. It is therefore likely to species of each successive genus, is that subclasses become widely used in medical information systems. be mutually exclusive and jointly exhaustive [11]. The paper is organized as follows. We first define Some other highly general organization and classi- a limited number of basic ontological principles with fication principles — which we believe rest on a wide which biomedical ontologies are expected to be consensus among those working on terminologies in compliant. (These are in effect principles of good biomedicine and elsewhere [12] — are: classification.) We then give a brief description of SNOMED CT,we present the methods used to test the Each hierarchy must have a single root. compliance of SNOMED CTwith these principles, and Each class (except for the root) must have at least we summarize our results. Finally, we discuss the one parent. application of this method to quality assurance in ontologies and terminologies in general, laying spe- Table 1 Definition of the relations between classes A cial emphasis on the role of creating partitions in and B ontologies. The advantages and limitations of our Relation Definition approach are also discussed. A = BAand B are the same entity (i.e., they have the same definition, and thus also the 2. Background same family of instances at any given time) 2.1. Terms, classes, and instances AISAB Aand B are classes and all instances of A are instances We shall refer to the nodes in SNOMED CT not as of B concepts but rather on the one hand as terms A is a child of BAISAB; A 6¼ B; and if AISAC (where we are interested in the hierarchy itself, and CISABthen A = C or C = B as a syntactic structure), and on the other hand as A and B are siblings There is some C of which A classes (where we are interested in the biological and B are both children and A B entities to which these terms refer). It is classes, not 6¼ A is a parent of BBis a child of A concepts, which stand in IS A, PART OF and similar C is a differentia AISAB; A 6¼ B; and instances relations in biomedical ontologies. Classes have of A with respect of A are marked out within the instances. In the biomedical domain, instances to B wider class B by the fact that they exemplify C 3 http://umlsks.nlm.nih.gov/ (accessed: 10 December 2006). Investigating subsumption in SNOMED CT 185