A Survey of Current Work in Biomedical Text Mining
Total Page:16
File Type:pdf, Size:1020Kb
Aaron M. Cohen, MD A survey of current work in is a postdoctoral fellow in the medical informatics programme at OHSU. Dr Cohen works in biomedical text mining the area of text mining, focusing on issues and Aaron M. Cohen and William R. Hersh applications important to Date received (in revised form): 25th October 2004 biomedical researchers. He was chairman of the W3C working group that produced Abstract version 2 of the Synchronized The volume of published biomedical research, and therefore the underlying biomedical Multimedia Integration knowledge base, is expanding at an increasing rate. Among the tools that can aid researchers in Language (SMIL 2.0). coping with this information overload are text mining and knowledge extraction. Significant William Hersh, MD progress has been made in applying text mining to named entity recognition, text classification, is Professor and Chair of the Department of Medical terminology extraction, relationship extraction and hypothesis generation. Several research Informatics & Clinical groups are constructing integrated flexible text-mining systems intended for multiple uses. The Epidemiology in the School of major challenge of biomedical text mining over the next 5–10 years is to make these systems Medicine at Oregon Health & useful to biomedical researchers. This will require enhanced access to full text, better Science University (OHSU) in Portland, Oregon. Dr Hersh’s understanding of the feature space of biomedical literature, better methods for measuring the research focuses on the usefulness of systems to users, and continued cooperation with the biomedical research development and evaluation of community to ensure that their needs are addressed. information retrieval systems for biomedical practitioners and researchers. INTRODUCTION: arising from different research specialties. BACKGROUND AND With the recent sequencing of the human Keywords: text-mining, PURPOSE genome, the addition of detailed genetic bioinformatics, natural The volume of published biomedical information to biomedical research makes language processing research, and therefore the underlying the situation even more complicated, biomedical knowledge base, is since genetics may play a role in almost all expanding at an increasing rate. While areas of health and disease and it is likely scientific information in general has that many connections between different been growing exponentially for several branches of medicine may be based on centuries,1 the absolute numbers specific related genomic mechanisms. to modern medicine are very The goal of biomedical research is to impressive. The MEDLINE 2004 discover knowledge and put it to practical database contains over 12.5 million use in the forms of diagnosis, prevention records, and the database is currently and treatment. Clearly with the current growing at the rate of 500,000 new rate of growth in published biomedical citations each year.2 With such research, it becomes increasingly likely explosive growth, it is extremely that important connections between challenging to keep up to date with all individual elements of biomedical of the new discoveries and theories knowledge that could lead toward even within one’s own field of practical use are not being recognised Aaron Michael Cohen, biomedical research. because there is no individual in a position Postdoctoral Fellow, Biomedical research is divided into to make the necessary connections. Department of Medical Informatics highly specialised fields and subfields, Methods must be established to aid and Clinical Epidemiology, School of Medicine, with poor communication between researchers and physicians in making Oregon Health & Science disciplines.3 While this may be a necessary more efficient use of the existing research University, pre-condition for the complex and and helping them take this research to 3181 S.W. Sam Jackson Park Road, Portland, OR 97239–309, USA detailed research that biomedical science the next step along the path to practical requires, it also tends to narrow the application. While manual curation and Tel: +1 503 494 0046 Fax: +1 503 494 4551 perspective, impeding the establishment indexing can be an aid to researchers E-mail: [email protected] of connections between discoveries searching for appropriate literature, a & HENRY STEWART PUBLICATIONS 1467-5463. BRIEFINGS IN BIOINFORMATICS. VOL 6. NO 1. 57–71. MARCH 2005 5 7 Cohen and Hersh recent study of the information content management methods to the vast of MEDLINE records by Kostoff et al.4 amount of biomedical knowledge that found a significant amount of conceptual exists in the literature as well as the free information present only in the abstract text fields of biomedical databases. field and missing from the MeSH terms. This paper surveys the state of the art in This is not surprising since the biomedical text mining over the past 18– MEDLINE indexers and the MeSH 24 months. The next section covers vocabulary, while broadly based, cannot current active areas of research, including be expected to represent all of the the specific problems that are being concepts of interest for all potential users. addressed and the approaches used. This is Clearly, the full text of biomedical followed by an examination of the current literature contains a wealth of issues and future challenges of biomedical information important to users that may text mining. not be completely captured by reviewers and curators. CURRENT AREAS OF Text mining and knowledge extraction RESEARCH The goal of biomedical are ways to aid researchers in coping with While other authors have proposed text mining is to shift information overload. Text mining is categorisations based on stages of the burden of differentiated from both information information extraction of increasing information overload retrieval (IR) and text summarisation (TS) sophistication,9 here recent work is from the researcher to the computer in that while IR and TS focus on the grouped pragmatically with separate larger units of text such as documents, categories for each distinct type of text- text mining operates at a finer level of mining task. This is because current work granularity and examines the relationships centres around several common text- between specific kinds of information mining themes. contained both within and between documents. Text mining is also Named entity recognition differentiated from full-blown natural At first glace, the task of named entity language processing (NLP) in that NLP recognition (NER) appears attempts to understand the meaning of straightforward. The goal is to identify, text as a whole, while text mining and within a collection of text, all of the knowledge extraction concentrate on instances of a name for a specific type of solving a specific problem in a specific thing: for example, all of the drug names domain identified a priori (possibly using within a collection of journal articles, or some NLP techniques in the process). For all of the gene names and symbols within Recognising biological example, text mining can aid database a collection of MEDLINE abstracts. entities in text allows curators by selecting articles most likely to Hansich and de Bruijn and coworkers9,10 for further extraction of 5,6 relationships and other contain information of interest, or believed that solving this problem would information by potential new treatments for migraine allow more complex text-mining tasks to identifying the key may be determined by looking for be addressed. The idea is that recognising concepts of interest pharmacological substances that are biological entities in text allows for associated with biological processes further extraction of relationships and associated with migraine.7,8 other information by identifying the key The goal of biomedical text mining is concepts of interest and allowing those therefore to allow researchers to identify concepts to be represented in some needed information more efficiently, consistent, normalised form. uncover relationships obscured by the This task has been challenging for sheer volume of available information, several reasons. First, there does not exist and in general shift the burden of a complete dictionary for most types of information overload from the researcher biological named entities, so simple text- to the computer by applying matching algorithms do not suffice. In algorithmic, statistical and data addition, the same word or phrase can 58 & HENRY STEWART PUBLICATIONS 1467-5463. BRIEFINGS IN BIOINFORMATICS. VOL 6. NO 1. 57–71. MARCH 2005 Current work in biomedical text mining refer to a different thing depending upon extending the Brill POS tagger11,16,17 to context (eg ferritin can be a biological include gene and protein names as a tag substance or a laboratory test). type with the system trained on 7,000 Conversely, many biological entities have hand-tagged sentences from biomedical several names (eg PTEN and MMAC1 text. AbGene then applies manually refer the same gene). Biological entities generated post-processing rules based on may also have multi-word names (eg lexical-statistical characteristics that help carotid artery), so the problem is further identify the context in which gene additionally complicated by the need to names are used and eliminate false determine name boundaries and resolve positives and negatives. The system overlap of candidate names. achieved a precision of 85.7 per cent at a Because of the potential utility and recall of 66.7 per cent. complexity of the problem, NER has In contrast to the tagging approach