Text Mining and Ontologies in Biomedicine: Making Sense of Raw Text
Total Page:16
File Type:pdf, Size:1020Kb
Irena Spasic Text mining and ontologies in is a postdoctoral research associate in the School of Chemistry and the Manchester biomedicine: Making sense of Interdisciplinary Biocentre at the University of Manchester. Her research interests include raw text biomedical text mining, machine learning and Irena Spasic, Sophia Ananiadou, John McNaught and Anand Kumar Date received (in revised form): 7th June 2005 bioinformatics. Sophia Ananiadou is co-director of the UK Abstract National Centre for Text The volume of biomedical literature is increasing at such a rate that it is becoming difficult to Mining and a Reader in Computer Science at the locate, retrieve and manage the reported information without text mining, which aims to University of Salford. Her automatically distill information, extract facts, discover implicit links and generate hypotheses research interests are in the relevant to user needs. Ontologies, as conceptual models, provide the necessary framework areas of computational for semantic representation of textual information. The principal link between text and an terminology and biomedical text mining. ontology is terminology, which maps terms to domain-specific concepts. This paper summarises different approaches in which ontologies have been used for text-mining John McNaught is a Lecturer in the School of applications in biomedicine. Informatics at the University of Manchester and an Associate Director of the UK National INTRODUCTION of textual information (Figure 1), and thus Centre for Text Mining. His Text is the predominant medium for provide a basis for sophisticated TM. research interests include information exchange among experts.1 Table 1 lists some popular biomedical information extraction and computational lexicography. The volume of biomedical literature is ontologies. Many such ontologies exhibit increasing at such a rate that it is difficult differing degrees of overlap, exhaustivity Anand Kumar is Alexander von Humboldt to efficiently locate, retrieve and manage and specificity and indeed differing views research fellow in the Faculty relevant information without the use of over conceptual space. Therefore, TM of Medicine at the University of text-mining (TM) applications. In order applications that rely on multiple Leipzig and a member of the to share the vast amounts of biomedical ontologies also need to include methods Institute for Formal Ontology 4 and Medical Information knowledge effectively, textual evidence for mapping between such ontologies. Science at Saarland University needs to be linked to ontologies as the These methods, together with other in Saarbru¨cken. His research main repositories of formally represented biomedical applications (including TM) interests include medical and knowledge. Ontologies are conceptual that rely on the use of ontologies, would biomedical knowledge models that aim to support consistent and benefit from a standard ontology language representation, data models and ontologies. unambiguous knowledge sharing and that (eg using standard initiatives such as provide a framework for knowledge RDF5 and OWL6). Still, even when a integration.2 An ontology links concept single standardised ontology is used, it is Keywords: text mining, labels to their interpretations, ie not always straightforward to link textual ontology, terminology, specifications of their meanings including information with ontology owing to the information extraction, concept definitions and relations to other inherent properties of language. Two information retrieval concepts.3 Apart from relations such as is- major obstacles are: (1) inconsistent and a and part-of, generally present in almost imprecise practice in the naming of any domain, ontologies also model biomedical concepts (terminology),7 and domain-specific relations, eg has-location, (2) incomplete ontologies as a result of Irena Spasic, School of Chemistry, clinically-associated-with and has- rapid knowledge expansion. The University of Manchester, manifestation are relations specific for the Nonetheless, a comprehensive body of Sackville Street, biomedical domain. Therefore, ontologies knowledge is currently stored in PO Box 88, Manchester M60 1QD,UK reflect the structure of the domain and biomedical ontologies, which can be constrain the potential interpretations of utilised in numerous ways by TM Tel: þ44 (0)161 306 4414 Fax: þ44 (0)161 306 4556 terms. As such, ontologies can be used to applications. Moreover, the results of TM E-mail: [email protected] support automatic semantic interpretation can be curated and used to facilitate & HENRY STEWART PUBLICATIONS 1467-5463. BRIEFINGS IN BIOINFORMATICS. VOL 6. NO 3. 239–251. SEPTEMBER 2005 2 3 9 Spasic et al. Figure 1: Ontologies provide machine-readable descriptions of biomedical concepts and their relations. Linking domain-specific terms, ie textual representation of these concepts, to their descriptions in the ontologies provides a platform for semantic interpretation of textual information. An explicit semantic layer supported by the use of ontologies allows text to be mined for interpretable information about biomedical concepts as opposed to simple correlations discovered by mining textual data using statistical information about co- occurrences between targeted classes of biomedical terms. The knowledge extracted from text using advanced TM can then be curated and used to update the content of biomedical ontologies, which currently lag behind in their attempts to keep abreast of new knowledge owing to its rapid expansion update of biomedical ontologies (Figure can be used to support these applications 1). In this paper the focus is on only the are discussed separately in the following former aspect of the relation between text sections: ‘Information retrieval’ and mining and ontologies, ie problems, ‘Information extraction’. The latter existing practice and prospects of using section is divided into three subsections. ontologies for different TM applications The first subsection deals with named are reviewed. The section ‘Terminology’ entity recognition as a key step in focuses on the problem of linking text to information extraction. The following ontologies. The section ‘Text mining’ two subsections discuss information provides an introduction to TM and extraction systems depending on the discusses two of its principal tasks: degree to which they rely on the use of information retrieval and information ontologies. Since many TM applications extraction. The ways in which ontologies resort to the use of machine learning methods as a way of tackling the complexity of both natural language and biomedical knowledge, it is explained Table 1: Selected generic biomedical ontologies how ontologies can be used for this purpose in the section ‘Machine learning’. Name URL The conclusion completes the paper. UMLS http://www.hlm.nih.gov/research/umls/ SNOMED http://www.snomed.org/snomedct/ GENIA http://www-tsujii.is.s.u-tokyo.ac.jp/genia/ TERMINOLOGY GALEN http://www.opengalen.org/about.html The principal link between text and an TaO http://imgproj.cs.man.ac.uk/tambis/ ontology is a terminology, which aims to GO http://www.geneontology.org/ map concepts to terms (Figure 2). A term OBO (Open Biomedical Ontologies) provides a more comprehensive list of ontologies and is defined as a textual realisation of a is available at http://obo.sourceforge.net/ specialised concept, eg gene, protein, 240 & HENRY STEWART PUBLICATIONS 1467-5463. BRIEFINGS IN BIOINFORMATICS. VOL 6. NO 3. 239–251. SEPTEMBER 2005 Text mining and ontologies in biomedicine ability of a natural language to express a single concept in a number of ways. For example, in biomedicine there are many synonyms for proteins, enzymes, genes, etc. Having six or seven synonyms for a single concept is not unusual in this domain.10 The probability of two experts using the same term to refer to the same concept is less than 20 per cent.11 In addition, biomedicine includes pharmacology, where numerous trademark names refer to the same compound (eg Advil, Brufen, Motrin, Nuprin and Nurofen all refer to ibuprofen). Term ambiguity occurs when the same term is used to refer to multiple concepts. Ambiguity is an inherent feature of natural language. Words typically have multiple dictionary entries and the meaning of a word can be altered by its context. Sublanguages, as the languages confined to specialised domains,12 provide a context which generally reduces the level of ambiguity. However, biomedicine encompasses a plethora of Figure 2: Conceptual relations reflect the connections between the concepts denoted by the given terms. These relations may be general subdomains, which is an additional cause relations commonly found in every domain (e.g. is-a, part-of or similarity for the high level of ambiguity in relation) or they can be confined to a specific domain (e.g. activation of biomedical terminology. For example, the receptors by hormones). Conceptual relations are encoded in term promoter refers to a ‘binding site in a ontologies. Term ambiguity and term variation represent specialisation of DNA chain at which RNA polymerase general lexical relations, namely synonymy and homonymy. These binds to initiate transcription of messenger relations exist on the lexical level and do not describe the relations RNA by one or more nearby structural between the underlying concepts genes’ in biology, while in chemistry it denotes a ‘substance that in very small amounts is able to increase the activity of Biomedical terminology disease. The introduction